[apt4ssx8] an annoying habit of tesseract

  • From: "Kent A. Reed" <kentallanreed@xxxxxxxxx>
  • To: apt4ssx8@xxxxxxxxxxxxx
  • Date: Wed, 20 Feb 2013 17:50:20 -0500

Gentle persons:

Something I forgot to mention about tesseract is its annoying habit of intermingling UTF-8 multi-byte characters with single-byte ASCII characters in its output. Depending on particular circumstances, these characters can cause asciidoc to do odd things as it formats its output.

It's easy enough to replace these multi-byte characters using, say, vi/vim, which displays and edits/replaces them as if they are single-byte ASCII characters, but it's not always so easy to notice their presence because they display so similarly.

This isn't really tesseract's fault. It's just trying to be true to what it "sees" in the input image, but emitting 342/200/231 for a reverse single-quotation mark or 302/273 for a reverse double-quotation mark (because it thinks it sees a tilt or curl in what began life as a vertical mark) is annoying. These are only two examples. If there's an option to force strict ASCII output I haven't found it.

I thought I could use the handy Linux iconv utility to fix the files automagically but so far it barfs no matter what input encoding I tell it to expect. Some days I think I'm getting too old for this:-)

I suppose this is just another speed bump on the road to enlightenment:-) The capability of tesseract is quite good no matter what.

Regards,
Kent


Other related posts: