[apt4ssx8] an annoying habit of tesseract

From: "Kent A. Reed" <kentallanreed@xxxxxxxxx>
To: apt4ssx8@xxxxxxxxxxxxx
Date: Wed, 20 Feb 2013 17:50:20 -0500

Gentle persons:

Something I forgot to mention about tesseract is its annoying habit ofintermingling UTF-8 multi-byte characters with single-byte ASCIIcharacters in its output. Depending on particular circumstances, thesecharacters can cause asciidoc to do odd things as it formats its output.

It's easy enough to replace these multi-byte characters using, say,vi/vim, which displays and edits/replaces them as if they aresingle-byte ASCII characters, but it's not always so easy to noticetheir presence because they display so similarly.

This isn't really tesseract's fault. It's just trying to be true to whatit "sees" in the input image, but emitting 342/200/231 for a reversesingle-quotation mark or 302/273 for a reverse double-quotation mark(because it thinks it sees a tilt or curl in what began life as avertical mark) is annoying. These are only two examples. If there's anoption to force strict ASCII output I haven't found it.

I thought I could use the handy Linux iconv utility to fix the filesautomagically but so far it barfs no matter what input encoding I tellit to expect. Some days I think I'm getting too old for this:-)

I suppose this is just another speed bump on the road toenlightenment:-) The capability of tesseract is quite good no matter what.


Regards,
Kent

Follow-Ups:
- [apt4ssx8] Re: an annoying habit of tesseract
  - From: Kent A. Reed

[apt4ssx8] an annoying habit of tesseract

Other related posts: