On 2/20/2013 5:50 PM, Kent A. Reed wrote:
Gentle persons:Something I forgot to mention about tesseract is its annoying habit of intermingling UTF-8 multi-byte characters with single-byte ASCII characters in its output. Depending on particular circumstances, these characters can cause asciidoc to do odd things as it formats its output.
Oof. I managed to turn the argument upside down. It's not the UTF-8 encoding that's the problem per se, it's the odd characters that tesseract decides it to encode. Those single-byte ASCII characters are just the so-called first code page in UTF-8 and they also happen to be the characters asciidoc knows best (it's not called "ASCII"doc for nothing).
And, iconv isn't barfing on my selection of input encoding; it's trying, in a way only its developer could love, to tell me it finds characters like the back curly single-quotation mark that it can't map into ASCII. The "-c" option can be invoked to tell it to delete those but then it deletes all upper code-page characters, whereas I want to convert certain ones for which I know the originating ASCII character.
What I need do is write a little utility of my own. Regards, Kent