[bksvol-discuss] Re: clearing out line breaks

  • From: "Kellie Hartmann" <kellhart@xxxxxxxxxx>
  • To: <bksvol-discuss@xxxxxxxxxxxxx>
  • Date: Sat, 14 Aug 2004 13:30:08 -0500

Hi E,
I have a system for dealing with this problem, which can occur as a result
of formatting for a display but also is common among books scanned by older
scanning software, such as the Reading Edge.

First, I recommend that you use Kurzweil for this kind of work, and not the
BN. The BN's search and replace sometimes misses things for some
unfathomable reason. The following is best done before you've gone through
and read any or all of the book for correcting scanner errors; if you do
these search/replace operations first and then go through the book you'll be
able to fix anything that slips through the cracks or gets altered
unnecessarily. In Kurzweil you search for a hard linebreak by putting
backslash n in the search/replace.  So what I usually do is open the file in
k1k, and then examine it a little. First of all, check out the pattern of
the words split by hyphens and linebreaks. Is there a space between the
hyphen and the linebreak, or between the linebreak and the other half of the
word? Establish the pattern, and then replace that string. For example, you
may do your search for something like hyphen space linebreak or hyphen
linebreak space depending on the pattern, then replace that string with
nothing. That should clear up most of those without messing anything else
up. Next is there anything that distinguishes the pattern of arbitrary line
breaks from those that denote paragraph boundaries? For example, I've seen
cases where there were two spaces preceding line breaks that represented new
paragraphs and no spaces before the arbitrary ones. when this happens your
clean-up is simple. In the above example I would do the following;
1. go into find and replace.
2. search for all line breaks preceded by two spaces.
3. replace them with a search string that you know for sure does not appear
in the book, such as an odd punctuation mark or combination thereof, such as
^3. Now the file will look like a bit of a mess, but that will soon be
fixed.
4. Replace all remaining line breaks with one space.
5. Now replace the search string you used to replace the desirable line
breaks, in my example ^3,  with line breaks again. Before you save these
changes do some checking to be sure things have turned out as you expected
with no unanticipated consequences. I would recommend doing a save-as and
saving under a different filename, that way if something weird has happened
you can start over without too much difficulty.

Now, if there is nothing to differentiate between the linebreaks you want
and those you don't things get a little more complicated. What I do is a
series of search-and-replace operations replacing combinations like period
newline, quote newline, question mark newline, and exclamation point newline
with other search strings that don't appear in your file. For example, I
would replace period newline with period ^3. Replace quote newline with
quote ^3. Do the same with question mark and exclamation point. Once you've
done all that find all remaining linebreaks and replace them with one space.
Then go back and reverse all the earlier search/replaces and replace period
^3 with period newline, etc.

I know this sounds like an absolutely awful mess, and it is, but I've found
it does work and give very good results. It took me a while to get the hang
of this and I had to start over quite a few times, but the results have been
worthwhile in my opinion and it's way better than doing even a little of
this by hand. You have to be careful and pay attention, and I don't blame
anyone for not wanting to mess with this, but if you're interested in the
book and the text quality is high your end product will be highly
satisfactory.
Hth, and if you have any questions feel free to ask me,
Kellie


Other related posts: