[bksvol-discuss] Re: OCR Corrections

  • From: "Gerald Hovas" <GeraldHovas@xxxxxxxxxxx>
  • To: <bksvol-discuss@xxxxxxxxxxxxx>
  • Date: Sat, 1 Apr 2006 10:21:57 -0600

Monica,

Actually, I prefer to keep OCR Correction turned off because it's another
form of a global search and replace, and global search and replaces can get
you in trouble.

For example, which of the following replacements are safe and which are not?

1. com to corn
2. Tom to Torn
3. tom to torn
4. Thom to Thorn
5. morn to mom
6. Morn to Mom
7. bum to burn
8. Glock to Clock

The answer is that none of them are safe to do in every book.

1, 2 and 3 come loaded in OpenBook's OCR dictionary and need to be removed.
Com is a shortened form of communication, which is used in many SF and
military books.  Tom, of course, is someone's name, but it got loaded into
the dictionary anyway as tom to torn without restricting it to being case
sensitive.  tom to torn isn't totally safe either since tom sometimes refers
to a tomcat, so you might be better off removing it than restricting it to
being case sensative.

Thom is sometimes used for someone's name.  Probably a shortened form of
Thomas like Tom.

Morn is a shortened form of morning which can pop up even though it's
probably a bit archaic now.  Of course it wouldn't be archaic to the author
if you're scanning an old book.

Morn is actually the name of a minor character in Star Trek Deep Space Nine.
That's the biggest problem with scanning SF or Fantasy books.  You can never
be sure what's safe since those authors like to think up unique names and
unique ways of indicating alien speech.

I think 7 is in OpenBook's OCR dictionary as well, but I can't remember.
Seems like I had problems with it changing phrases like "a bum on a park
bench" to "a burn on a park bench" and removed it from the dictionary.

Glock is probably the most popular name of a handgun, but I keep seeing
where Glock has been changed to Clock in books.  It's rather funny when you
see someone shoot someone with a Clock the first time, but it gets annoying
after the umpteenth time.

My point is that OCR  Correction isn't as nice a feature as it's made out to
be because it's possible for it to do more harm than good, especially if you
scan many different categories of books.  Something that might be a valid
entry for one category might get you in trouble in another.  Com to corn
might be perfectly fine for a book on gardening, but will get you in trouble
in those SF and military books.

OCR  has improved quite a bit from it's early days, and it may be time for
this feature to be phased out since scans are much more accurate than they
used to be.  If not now, then sometime soon.

For now, just be careful with what you enter into the dictionary, and don't
get carried away with the feature.

HTH

Gerald


-----Original Message-----
From: bksvol-discuss-bounce@xxxxxxxxxxxxx
[mailto:bksvol-discuss-bounce@xxxxxxxxxxxxx] On Behalf Of E.
Sent: Saturday, April 01, 2006 5:11 AM
To: bksvol-discuss@xxxxxxxxxxxxx
Subject: [bksvol-discuss] Re: OCR Corrections

common words where rn is used instead of m such as rnany for many, rnam for 
man, morn for mom, wornan for woman, wornen for women, rnen for men, 
rnountain for mountain, rnore for more, and so on.

sometimes thie replaces the

If you are getting a lot of those issues, remember it points to a need to 
adjust your brightness settings before beginning a scan.  Scan a few 
practice pages and adjust brightness accordingsly.  Similarly, 1 instead of 
I is a brightness issue resulting in "1 am" "1 have" and such.

Hope this helps.

E.

 To unsubscribe from this list send a blank Email to
bksvol-discuss-request@xxxxxxxxxxxxx
put the word 'unsubscribe' by itself in the subject line.  To get a list of
available commands, put the word 'help' by itself in the subject line.

 To unsubscribe from this list send a blank Email to
bksvol-discuss-request@xxxxxxxxxxxxx
put the word 'unsubscribe' by itself in the subject line.  To get a list of 
available commands, put the word 'help' by itself in the subject line.

Other related posts: