[apt4ssx8] Re: an annoying habit of tesseract

From: "Kent A. Reed" <kentallanreed@xxxxxxxxx>
To: apt4ssx8@xxxxxxxxxxxxx
Date: Wed, 20 Feb 2013 19:49:47 -0500

On 2/20/2013 5:50 PM, Kent A. Reed wrote:

Gentle persons:
Something I forgot to mention about tesseract is its annoying habit ofintermingling UTF-8 multi-byte characters with single-byte ASCIIcharacters in its output. Depending on particular circumstances, thesecharacters can cause asciidoc to do odd things as it formats its output.

Oof. I managed to turn the argument upside down. It's not the UTF-8encoding that's the problem per se, it's the odd characters thattesseract decides it to encode. Those single-byte ASCII characters arejust the so-called first code page in UTF-8 and they also happen to bethe characters asciidoc knows best (it's not called "ASCII"doc for nothing).

And, iconv isn't barfing on my selection of input encoding; it's trying,in a way only its developer could love, to tell me it finds characterslike the back curly single-quotation mark that it can't map into ASCII.The "-c" option can be invoked to tell it to delete those but then itdeletes all upper code-page characters, whereas I want to convertcertain ones for which I know the originating ASCII character.


What I need do is write a little utility of my own.

Regards,
Kent

Follow-Ups:
- [apt4ssx8] Re: an annoying habit of tesseract
  - From: dave

References:
- [apt4ssx8] an annoying habit of tesseract
  - From: Kent A. Reed

[apt4ssx8] Re: an annoying habit of tesseract

Other related posts: