[apt4ssx8] Re: an annoying habit of tesseract

  • From: dave <dengvall@xxxxxxxxxxx>
  • To: apt4ssx8@xxxxxxxxxxxxx
  • Date: Wed, 20 Feb 2013 21:19:16 -0800

On Wed, 2013-02-20 at 19:49 -0500, Kent A. Reed wrote:
> On 2/20/2013 5:50 PM, Kent A. Reed wrote:
> > Gentle persons:
> >
> > Something I forgot to mention about tesseract is its annoying habit of 
> > intermingling UTF-8 multi-byte characters with single-byte ASCII 
> > characters in its output. Depending on particular circumstances, these 
> > characters can cause asciidoc to do odd things as it formats its output.
> >
> 
> Oof. I managed to turn the argument upside down. It's not the UTF-8 
> encoding that's the problem per se, it's the odd characters that 
> tesseract decides it to encode. Those single-byte ASCII characters are 
> just the so-called first code page in UTF-8 and they also happen to be 
> the characters asciidoc knows best (it's not called "ASCII"doc for nothing).
> 
> And, iconv isn't barfing on my selection of input encoding; it's trying, 
> in a way only its developer could love, to tell me it finds characters 
> like the back curly single-quotation mark that it can't map into ASCII. 
> The "-c" option can be invoked to tell it to delete those but then it 
> deletes all upper code-page characters, whereas I want to convert 
> certain ones for which I know the originating ASCII character.
> 
> What I need do is write a little utility of my own.
> 
> Regards,
> Kent
> 
 I find sed a bit arcane but sometimes it is handy. :-)

Dave



Other related posts: