[bksvol-discuss] Re: Removing extra pragraphs marks

  • From: "Judy s." <cherryjam@xxxxxxxxxxxxxxxx>
  • To: bksvol-discuss@xxxxxxxxxxxxx
  • Date: Fri, 12 Oct 2012 22:12:04 -0500

Hi Netta,

Yes, do a show all first. That will show you all the paragraph marks. The extra paragraph marks are because of an error that sometimes happens when the OCR software converts a page of printed text, and decides that each printed line on the page is a paragraph, and puts a paragraph mark at the end of each line.

I'm not sure I can explain what it's done, but let me try. Then I'll give you some ideas on how to figure out which paragraph marks are wrong ones.

Imagine it this way: suppose you had a super long cooked spaghetti noodle you had to put on a plate. If you took that noodle and laid it down on the plate from left to right, it would fall off the right hand side of the plate. So you take the noodle when you reach the right hand side of the plate, and double it back across the plate, going now from right to left with your noodle until you get back to the left hand side of the plate. It's still one noodle. You might have to do this sequence of back and forth several times if it's a super long noodle to make sure it's all on the plate.

Now, if you took a knife and cut through all the back-and-forth strands of that noodle at the right hand side of the plate, you suddenly have several noodles, instead of one.

It's a little bit like that with a printed page and how text is formatted onto it by a publisher. If we had really wide paper, each sentence would take up only one line. Since reading paper that wide isn't easy, a publisher has to have a sentence take up many lines on a narrower page. Several sentences can then make up one paragraph. The publisher wraps the sentence the same way on the page to make it fit, but instead of going from left to right then right to left like we could with my noodle, the publisher prints as much of the sentence that will fit on the first line going from left to right, then drops down to the next line and prints the next part of the sentence from left to right, and keeps doing that until the entire paragraph is printed on the page. Then the publisher drops down to the next line, and uses a visual cue, an indentation that puts some blank space at the start of the first line of the next paragraph. The sighted reader sees the indentation and says "Oh, ok, I'm on a new paragraph."

The OCRing software doesn't recognize that visual cue, though, sometimes, as what it's supposed to be. Instead, like cutting the noodle on the right and chopping it into several noodles, the OCRing software decides that each printed line in the book must be a separate distinct paragraph and arbitrarily makes each printed line into a paragraph. So it cuts into one-line long pieces the sentences that are supposed to flow over several lines and make up a paragraph, and incorrectly puts a paragraph mark at the end of each printed line.

What you are looking for is paragraph marks that are incorrectly chopping sentences apart in this fashion. Two of the best ways I've found to find those kinds of incorrect paragraph marks on a page are:

1. check if a paragraph mark is followed by a word that begins with a lower case letter. Sentences never begin in an ordinary printed book with a lower case letter, so that's a good clue that it's an incorrect paragraph mark.

2. check for each paragraph mark, and see if the character right before the paragraph mark is a period, an exclamation mark, a question mark or a double quote. It if isn't one of those, then it's highly likely that the paragraph mark shouldn't be there.

I hope this isn't more confusing but is helpful! Smile.

Judy s.

On 10/12/2012 9:05 PM, Dornetta wrote:
So, doing a show all will show me the paragraph marks, right?
Now that I have done the show all, just go through the page one line at a
time, going to the end of the sentence and seeing if extra paragraph marks
are there, if they are just delete one of them? Is that correct?
If so, that is what I am doing now; my problem is this, with the show all
selected, I now have "new lines" and some paragraph marks. Madeleine told
that this problem only exist on one page and not through the entire
document, so I'm thanking my lucky stars for that but I just want this to be
correct is all.
Netta
"Just because you are blind does not mean you lack vision"-Stevie Wonder

  To unsubscribe from this list send a blank Email to
bksvol-discuss-request@xxxxxxxxxxxxx
put the word 'unsubscribe' by itself in the subject line.  To get a list of 
available commands, put the word 'help' by itself in the subject line.


To unsubscribe from this list send a blank Email to
bksvol-discuss-request@xxxxxxxxxxxxx
put the word 'unsubscribe' by itself in the subject line.  To get a list of 
available commands, put the word 'help' by itself in the subject line.

Other related posts: