Gerald, I know this must have taken a long time to put together. My thanks to you, Grace Jake, et al, for the work you do on behalf of the volunteer community. Paula ----- Original Message ----- From: Gerald Hovas To: bksvol-discuss@xxxxxxxxxxxxx Sent: Thursday, April 13, 2006 11:51 AM Subject: [bksvol-discuss] Fixing Occasional Hard Returns In The Middle Of Paragraphs Here are the tips I sent to Grace for her to review. Since she went ahead and posted them, and since they're related to a couple of posts in the last day, I thought I'd send them to the list. Gerald Removal Of Hard Returns At The End Of Every Line Some scans contain Hard Returns (ASCII 13) at the end of every line. While this causes the text in the scan to appear exactly as it appears in the book, it results in poorer quality books because this character specifies a new paragraph as well as a new line. Each line in the file becomes its own paragraph which prevents members from being able to skim the book by paragraph since skipping ahead to the beginning of the next paragraph takes you to the beginning of the next line, even if it isn't the beginning of the next paragraph in the text. The fix isn't simple and takes a little time, but it's quicker than rescanning the book and makes a difference to members who like to skim through their favorite books from time to time. Enough of a difference that someone may decide to rescan the book later to fix the problem. The following is a procedure to fix the problem using Word. Keep in mind that each scan can have its own set of problems and that the steps may need to be adapted before the procedure can be used on some scans. Note that the procedure expects a blank line after headers and before footers. First you need to verify if this procedure is needed. This can be done using the Ctrl-Up and Ctrl-Down keys. The cursor will stop at the beginning of paragraphs. If the cursor stops at the beginning of each line, then the scan has this problem. An alternative method for verifying the problem is to toggle invisible characters on with Ctrl-* (Ctrl-Shift-8) and look for Paragraph Markers. If Paragraph Marker appears at the end of each line, then the scan has this problem. Pressing Ctrl-* again will toggle invisible characters off. If the problem is intermittent instead of at the end of every line, then see the tip for fixing occasional Hard Returns in the middle of paragraphs. If the problem is indeed at the end of every line and the scan has blank lines between paragraphs, then the following procedure should work. Notes on adapting the procedure to scans which do not contain blank lines between paragraphs and notes for adapting the procedure to K-1000 will be included afterwards. List of 9 items 1. First make a backup copy of the file before starting. That way it's easy to fall back to a known position and start over if something goes wrong. 2. Move to the first page of the prologue or the first page of chapter one if you do not wish to make changes to the book's frontmatter and select the remaining text with Ctrl-Shift-End. Be careful not to move around in the file between steps since this will cause the text to no longer be selected. 3. Replace ^w^p with ^p. This will remove any whitespace at the end of lines making them consistent. Making the end of lines consistent is necessary for the following global find and replaces to fix every line. 4. You may also wish to remove whitespace at the beginning of lines if the scan contains blank lines between paragraphs, but this isn't necessary for the remaining steps to work and will cause problems when adapting the procedure to scans which do not contain blank lines between paragraphs. If you wish to do so, though, replace ^p^w with ^p. 5. Replace ^p^p^p with ^p^p. This will remove multiple blank lines from the document and will simplify the procedure. Note that this find and replace will need to be performed until no replacements are made by Wordin order to remove all of the multiple blank lines in the book. 6. Replace ^p with ^l. This will convert all Paragraph Markers to Manual Line Breaks. Manual Line Breaks are Soft Returns and only specify a new line, not a new line and paragraph. 7. Replace ^l^l with ^p^p. This will change the two consecutive Manual Line Breaks at the end of paragraphs back to Paragraph Markers. This step is the one which requires blank lines between paragraphs and is why blank lines must follow headers and proceed footers. 8. Replace -^l with -. This will prevent inserting a space after hyphens in the next step. 9. Replace ^l with a space. This will remove the Soft Return at the end of every line without running two words together. Now the problem should be fixed. list end To adapt the procedure to scans without blank lines between paragraphs, replace step 7 with the following step, and remember to leave out step 4. 7. Replace ^l^w with ^p^p or ^p followed by your preferred number of spaces for indenting a paragraph. To adapt the procedure to K-1000: Use \n in place of ^p. Use a Space in place of ^w.. Use a special symbol like ~ which doesn't appear anywhere in the book or a string like [Newline] in place of ^l. Note that you will need to perform the replacement of space\n until no replacements are made in order for lines to be consistent. It's possible too, though not probable, that you may also need to remove tabs at the end of lines as well as spaces. The ^w in Word removes strings containing any combination of spaces and tabs, so it isn't necessary to take this into consideration when using Word. Replacing a Tab with a Space prior to removing a space at the end of lines would prevent having to deal with this issue in K-1000 and simplify the alternate step 7. Note that leaving out steps 8 and 9 would leave the text as it appears in the book without preventing skimming by paragraph since ^l (ASCII 11) doesn't specify a new paragraph, only a new line. Fixing Occasional Hard Returns In The Middle Of Paragraphs OCR software will occasionally add a Hard Return or Paragraph Marker (ASCII 13) at the end of a line even though the line is not the last line of the paragraph. This causes the paragraph to be broken into two separate paragraphs in the scan. To search for this scanning error using Word, use the following search strings: ^$^p This wil find paragraphs which end in a letter. Be aware that replacing this string with nothing will not only remove the Paragraph Marker, it will also remove the letter which the string finds, so you don't want to use this in a find and replace. Another reason yu don't want to use this in a find and replace is that there are legitimate reasons for ending a paragraph with a letter, and it's best to make sure what the string finds is a scanning error. ,^p This will find paragraphs which end in a comma. Again, it's best to not use this in a find and replace because there are also legitimate reasons for ending a paragraph in a comma. These strings are not guaranteed to find every occurrence of the problem, but they should find nearly all of them. Be aware that exiting the Find dialog box and using Page-Down and Page-Up to find the next or previous occurrence will make it easy to fix a scanning error when it's found since it eliminates the steps of opening and closing the dialog box. One legitimate reason for finding both strings in the book is that few pages end in a complete sentence. Another is that letters or notes often appear in books, and these strings will find the opening or closing of the letter or note. Searching for this scanning error takes a little while, but it's one that you will want to check if you're striving for the perfect scan. Note that if you are finding Hard Returns at the end of every line or at the end of most lines, then refer to the tip for removing Hard Returns at the end of every line.