[liblouis-liblouisxml] Re: Supporting UTF-8 in opcode character argument

From: Mesar Hameed <mesar.hameed@xxxxxxxxx>
To: liblouis-liblouisxml@xxxxxxxxxxxxx
Date: Thu, 5 Jul 2012 16:16:05 +0100

Hi Vic,

On Thu 05/07/12,10:24, Vic Beckley wrote:
> This does seem to be a good combination. On my test file, with eSpeak as the
> synth, the two characters in question are silent but the triple press does
> yield their Unicode value.

Great, it should work for you 100% of the time, as requested. :)

> With the SAPI 5 synth that I usually use, the
> character are spoken as "question" and, again, the triple press yields the
> Unicode value. I wonder why it is different with different synths?

Implementers usually only define names for characters that are appropriate for 
the synths supported languages.
The remaining question then is what to do with everything else.
Some synths just say nothing, others may beep, or with your synth, it says 
question mark.
another factor is if the character is being passed to the synth as a word or as 
a character.
for example the letter m, if passed as a word probably yields mmm sound, while 
when passed as a character the synth pronounces it as em (e followed by m)

> This is in spite of the fact that they are not showing on the screen.

Yes this points to suitable fonts not being installed/selected.
I am afraid I dont have much information on how to go about getting fonts for 
windows.

> Is there somewhere on the web that you can find a definitive description of
> what all characters represent.

There are a lot, but often not very accessible with screenreaders.
Just now I managed to find:
http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=0x
which seems quite nice here with orca.

> For example, the u+0099 symbol is defined in
> the cy-cy-g1.utb table as a specific dot pattern. Window-Eyes calls this
> character a trademark symbol. NVDA doesn't know what it is. On the reference
> I found on the web it is just called a control character.

Yes, I dont think its a trademark symbol, just something lost in conversion.
These people seem to have identified it as a bug:
http://stackoverflow.com/questions/7341274/how-can-i-check-that-the-trademark-character-is-set-correctly-in-my-oracle-da
The best thing to do is to delete that symbol, and replace it with \x2122
or the actual trademark symbol.

> The character
> u+0080 is similar.

If it also had a comment after it, you should be able to look up its codepoint 
in unicodedefs.cti and correct it in the same way.

> The unicodedefs.cti table just has
> these characters as blank, probably because they are supposed to be control
> characters.
> Please explain this confusion?

This probably happend because the author of the table had one code page, while 
the target language has a different one.
Hopefully now that we are working in unicode it should be much faster and less 
error prone to correct these and many simular mistakes.

Please do keep a record of what looks suspicious so that we can fix them in one 
go for a given table.

Thanks much,
Mesar
For a description of the software, to download it and links to
project pages go to http://www.abilitiessoft.com

References:
- [liblouis-liblouisxml] Supporting UTF-8 in opcode character argument
  - From: John J. Boyer
- [liblouis-liblouisxml] Re: Supporting UTF-8 in opcode character argument
  - From: Vic Beckley
- [liblouis-liblouisxml] Re: Supporting UTF-8 in opcode character argument
  - From: Mesar Hameed
- [liblouis-liblouisxml] Re: Supporting UTF-8 in opcode character argument
  - From: Vic Beckley

[liblouis-liblouisxml] Re: Supporting UTF-8 in opcode character argument

Other related posts: