On Wed, 2013-02-20 at 19:49 -0500, Kent A. Reed wrote: > On 2/20/2013 5:50 PM, Kent A. Reed wrote: > > Gentle persons: > > > > Something I forgot to mention about tesseract is its annoying habit of > > intermingling UTF-8 multi-byte characters with single-byte ASCII > > characters in its output. Depending on particular circumstances, these > > characters can cause asciidoc to do odd things as it formats its output. > > > > Oof. I managed to turn the argument upside down. It's not the UTF-8 > encoding that's the problem per se, it's the odd characters that > tesseract decides it to encode. Those single-byte ASCII characters are > just the so-called first code page in UTF-8 and they also happen to be > the characters asciidoc knows best (it's not called "ASCII"doc for nothing). > > And, iconv isn't barfing on my selection of input encoding; it's trying, > in a way only its developer could love, to tell me it finds characters > like the back curly single-quotation mark that it can't map into ASCII. > The "-c" option can be invoked to tell it to delete those but then it > deletes all upper code-page characters, whereas I want to convert > certain ones for which I know the originating ASCII character. > > What I need do is write a little utility of my own. > > Regards, > Kent > I find sed a bit arcane but sometimes it is handy. :-) Dave