Please be aware, that an OCR now adays, does far more than just scan and
convert what is on the printed page. We could break the process into the
following steps, which might leave you a bit of an idea.
1 - Retrieve a digitally photographic picture of the page - consisting
of millions of 0's and 1's. (I won't deal with the technical ways a
scanner works, so as to obtain this.)
2 - Arrange the obtained 0's and 1's, into an array of lines and columns.
3 - Attempt to make any sense of this array, clustering the millions of
bits into groups - leaving the OCR a literal chance of discerning what
might be considered characters, lines, and graphics.
4 - Now start the actual interpretation of the clustered bits. This
usually will be performed, by looking at the properties of such a cluster.
5 - As the interpretation progresses, look up each word derived, in an
interpretation dictionary. This dictionary will be language-specific,
and might even depend on the context or characteristics of the scanned page.
6 - Format the interpretted result, attempting to attain a layout, as
close as possible, to the original page.
7 - Save the result.
8 - Present to the user, the saved document. Often this will include
loading it into some kind of text editor - like MSWord, Notepad or
whatever. Such text editors, in turn, might process the document even
further, prior to letting you even get your hands or eyes on it.
- - -
Perhaps, let me elaborate on a couple of the steps above.
Step 4, Attempting to interpret the characters. Since there does exist
such a multitude of fonts, sizes and attributes, you would not get very
far should you want to do a pure recognition on a character.
Imagine an uppercased A. It looks basically like a house-roof, with a
horisontal line crossing it, somewhere along the vertical axes. But
certain fonts have that crossing quite in the upper segment, others tend
to do it more in the lower end; and yet the good standard you might say,
is to have it just about in the middle. Since the whole character in its
scanned version, might consist of something like 75 by 150 dots (or
bits), what your eyes would discern as "just about the middle", might
differ greatly for the OCR. Was that at point number 50, or was it at 85.
Fonts might also have varying thickness on their drawing of the
character. And still others, might have certain embellishment lines,
like tiny lines sticking out from bottom, top and sides of the main
lines of the character. So if you were to interpret based on only a
one-to-one picture of what the OCR has seen before, you would soon
enough run into things that would not be recognized.
Remember how I initially described the capitalized A, as a house-roof
with a horisontal bar crossing near the middle? That is what we could
say is the cooked-down properties of the character. Hence, the OCR would
look out for anything that would meet such a description, interpretting
it as an uppercased version of the letter A.
An uppercased B, and the digit 8, would easily be confused. Main
difference here, is that the number 8, has rounded corners on both the
left and right side of the character; whereas the letter B basically is
all straight on the left side. Here the actual interpretation rate,
would greatly depend on the quality of the scanned picture, and how
forgiving the OCR would be, when comes to variation in the left corners
being rounded or not. Still, the properties of the two characters in
question, could be summed up like this:
- If there exist a cluster of dots, that can be considered a character,
and it seems there are one ring sitting on top of the other - and
perhaps the lower is a bit bigger, check to see the corners on the left
side. Rounded corners equals this is a number 8; straight corners,
consider it to be the uppercased B.
Allright, so looking at the properties, the OCR might perform a pretty
good job. Yet there still exist the chance, that due to low quality of
the scanned picture, or electronic disturbances in the transfer from the
scanner to your PC - or could it even be laziness in the interpretation
rules - the results might not be totally clear. Even to the human eye,
the letter O, and the number 0, could many times be a challenge to
discern. As a human, you now would go by context.
It is far more likely, that your mom will LOOK after your kids, than if
she would LO0K after them. This is where the above listed step 5 jumps
into action.
The OCR's of today, is equipped with an interpretation dictionary. Some
are locked, to whatever the developer has trained it to consider likely
phrasing and grammar. Others are more user-based, where you can train
the program, by correcting it when it performs a wrong interpretation.
And the more sofisticated ones, like Omnipage, will be somehow
self-training. The techniques on how to let the software do its
self-training, are that complex and differing, we won't deal with them
here. The whole idea is, for the OCR to determine what might be most
likely in the individual case.
It might base such determination on the context. In the provided
example, it would see the words MOM and KIDS, and determine that she
might as well LOOK after the kids.
On the other hand, should the word LOOK, happen to be in the middle of
several mathematic equations, it would maybe have considered it most
likely to be interpretted as 100K. Then again, 100K does not make much
sense in a plain mathematic equation like multiplication or subtraction,
so it might determine it to be 100%. With very big loads of goodwill,
you might see the similarity between the letter K, and the percent sign.
But even if you don't, the OCR just determines this to be the most
likely interpretation, based on the context. And just as a side-note, if
the word LOOK sits in a line of specifications on your new computer, the
interpretation 100K, might make totally sense, due to it meaning 100
kilobytes (or 100 thousands characters) of storage space. - See how
contextual interpretation might affect the results you get?
I did tell these interpretation dictionaries to be language specific. As
a matter of fact, a good OCR might leave you the chance of setting what
languages it should attempt to recognize. May I here leave you an
example, on how this could affect the processed results?
You are aware the challenge of the letter O, as compared to the number
0. To work around this, leaving no doubt, many fonts have a diagonal bar
running through their 0's. When the scanner meets something that looks
like a ring, with a diagonal bar crossing, it consequently considers
this a 0. Well, so far. - But how long do the roses blossom? -
In both Norwegian and Danish, they have a national character, that is an
O with a slash running through it. This one occurs rather frequently in
words and names. So when processing a Danish document, you cannot just
look out for any "ring with a diagonal bar", considering such as 0's.
What the OCR will have to do in such cases, is to take a second look at
the "ring" itself. If it is close to circular, it might be considered
the "slashed-O", whereas if it has a hight slightly bigger than the
width, it would be considered a 0.
Similarly, there could be dictionaries inside the OCR, that base their
interpretation on other factors. Would there be some interpretation
which would be more likely, in case of graphic symbols or pictures that
surround the character, or whole word. Like on a music sheet. Or when
scanning electronic diagrams. It all depend on how much the developer
has put into the software, and what setup you have given the program.
Bob, in your particular case, I am not totally able to follow the logic
of the OCR. Of course, it is well known, for computers and software, to
have bugs. And your OCR might consider a MOTH a bug, and think that the
computer bugs might as well have a celebration, so look out for the big
party... hahaha.
What it seems though, is that the 1, and the first 0, has been
considered the letter m. To me, that would indicate the bottom left part
of the first 0, is somehow missing. Furthermore, in case of a low
individual character-spacing, this - in combination - could lead to the
OCR attempting to consider the first two characters as one. It then
would check its dictionary, for anything that would match the following:
- Got a word, that has to end on a 0, or could it be an O, and the
letters T H. It would all start out with what looks like one character.
Which word do I have in stock, that matches?
- Oh, let me see... -
- Well, how about MOTH? That does match pretty well, doesn't it? -
And now, if there is no further contextual interpretation, the OCR
throws MOTH at you, hoping that you are pleased and enjoying. (Smiles.)
Why the bottom part on the first 0 is not taken into consideration, even
if the character feels perfectly shaped on your Optacon, would be mere
guesswork. Could it be due to a bad spot on your glass plate of the
flatbed? Or, could it simply be a grain of dust, a crease in the paper.
Even the paper might have a crack in it. Or, the ink of the print might
have not colored totally. Your eye, the Optacon, is not high-density
enough to pick up such tiny differences. But out of the thousands of
dots the scanner sees in that one character, it might be enough missing.
Sometimes, if this kind of mis-interpretations occur frequently,
adjusting the resolution of your scanner might be adviceable. Scanners
now aday, are manufactured to process photographies, demanding
resolutions of things like 1200 by 1200 dots per inch. This is way
overkill, on plain text. 300 to 400 bits, might suffice - 300 often
giving the better results on normal-sized text. 400 would do on small
print. Again your scanner might need some finetuning of its contrast.
Or, try scanning in gray-scale, or plain black-and-white, instead of
color mode. Yet, if this is more of an amusing one-time scenario, just
blame it on the very fact, nothing man-made electronics will ever beat
the human brain.
Sorry for a long message, but my hope is that you got a bit more
educated on how the modern OCR works. Promise you, in the late 80's,
early 90's, when the OCR technology first took us, things did not do
that well. If our OCR had a proof-rating of 70 percent, we were
satisfied. Scanners, OCR software, and the fact that modern computers
have thousand times more memory and hard disk capacity, makes it
possible to run a much higher interpretation rate of today. Now, if our
OCR would do anything less than 95 percent, we would sound our high
complains.
End-Note:
Just to make you ponder, let me give you a quick handful of numbers to
consider. Should you scan a normal page, 12 times 8 inches, in 1200DPI
(dots per inch), it would leave you with 16.5 megabytes - or sixteen and
a half millions characters if you want. That is the data amount your OCR
would have to consider. Now, on old machines thirty years back, the
memory usually possessed of the computer, would amount into 0.64, or at
the best 1.0 megabytes. As such, the computer did not even have enough
RAM to hold such a scanned page, resulting in the scanner often working
small steps, then waiting for the OCR to process whatever the computer
could hold, before the scanner again would make it gulp down a bit more
data. Scanning one page, would take a minute or two.
OK, scanners back then did not do 1200DPI. Even the best ones available
would only do 400DPI, but they would be such tremendously expensive, no
normal budget would suffice. So 300DPI was considered the best buy. Even
these, would need 1.03 megabytes, to load one full 12 by 8 inch page
(which is the A4 size). And as already stated, even this would quite
exceed the RAM capacity of the computer. Hard disks would be quite
expensive, and some of them would only hold 5, 10 or 20 megabytes -
which would hardly suffice to do much OCR working. That is one of the
reasons, the OCR's of those days, did have little chance to perform too
much of proofing - leaving us the low interpretation rates.
On 1/10/2021 1:55 PM, Milton wrote:
Hi,to view the list archives, go to:
What scanning program are you usingIf it is Openbook or K1000 the OCR
engines are old.
If you have a flatbed scanner and running JAWS 2021 you now have a way to
use your flatbed scanner with JAWS 2021 and it will do a much better job as
Freedom has updated the OCR.
Freedom also has information about how to use the scanning feathure on their
website.
-----Original Message-----
From: optacon-l-bounce@xxxxxxxxxxxxx <optacon-l-bounce@xxxxxxxxxxxxx> On
Behalf Of Robert Feinstein
Sent: Saturday, January 9, 2021 11:37 PM
To: optacon-l@xxxxxxxxxxxxx
Subject: [optacon-l] A question
Listers, this is Bob from New York. I was reading something with my scanner
and it read "we are celebrating our moth birthday." I found that strange,
and re-scanned. Got the same results. Looked at the page with my optacon
and it said "celebrating our 100th birthday. Totally clear. I am just
curious: why would a scanner read moth instead of 100th? I usually can
figure out why an error occurs, but this time I'm stymied.
Just curious.
Hope all of you are well and using your optacons. Bob
to view the list archives, go to:
www.freelists.org/archives/optacon-l
To unsubscribe at any time, just send a message to:
optacon-l-request@xxxxxxxxxxxxx with the word "unsubscribe" (without the
quotes) in the message subject.
Tell your friends about the list. They can subscribe by sending a message
to:
optacon-l-request@xxxxxxxxxxxxx with the word "subscribe" (without the
quotes) in the message subject.
to view the list archives, go to:
www.freelists.org/archives/optacon-l
To unsubscribe at any time, just send a message to:
optacon-l-request@xxxxxxxxxxxxx with the word "unsubscribe" (without the
quotes) in the message subject.
Tell your friends about the list. They can subscribe by sending a message to:
optacon-l-request@xxxxxxxxxxxxx with the word "subscribe" (without the
quotes) in the message subject.
.