[apt4ssx8] Re: an annoying habit of tesseract

  • From: "Kent A. Reed" <kentallanreed@xxxxxxxxx>
  • To: apt4ssx8@xxxxxxxxxxxxx
  • Date: Wed, 20 Feb 2013 19:49:47 -0500

On 2/20/2013 5:50 PM, Kent A. Reed wrote:
Gentle persons:

Something I forgot to mention about tesseract is its annoying habit of intermingling UTF-8 multi-byte characters with single-byte ASCII characters in its output. Depending on particular circumstances, these characters can cause asciidoc to do odd things as it formats its output.

Oof. I managed to turn the argument upside down. It's not the UTF-8 encoding that's the problem per se, it's the odd characters that tesseract decides it to encode. Those single-byte ASCII characters are just the so-called first code page in UTF-8 and they also happen to be the characters asciidoc knows best (it's not called "ASCII"doc for nothing).

And, iconv isn't barfing on my selection of input encoding; it's trying, in a way only its developer could love, to tell me it finds characters like the back curly single-quotation mark that it can't map into ASCII. The "-c" option can be invoked to tell it to delete those but then it deletes all upper code-page characters, whereas I want to convert certain ones for which I know the originating ASCII character.

What I need do is write a little utility of my own.


Other related posts: