[audacity4blind] Re: Nyquist Unicode support (was: Audacity Plugins)

  • From: Robert Hänggi <aarjay.robert@xxxxxxxxx>
  • To: audacity4blind@xxxxxxxxxxxxx
  • Date: Sun, 1 Feb 2015 08:50:28 +0100

Hi Steve

I think the utf-8 translation works for most cases. However, it's
rather a ASCII/ANSI thing than a Unicode issue.

Let's get practical:
- open the sample data export plug-in
- enter as file name "hörbücher" (= German for "Audio Books") without quotes.
- click debug

On my system (Windows 7, 64), the following happens:
- The message box reports
---------------------------
Nyquist
---------------------------
[Warning: Nyquist returned invalid UTF-8 string, converted here as
Latin-1]hörbücher1.txt   1 channel (mono)
Sample Rate: 44100 Hz. Sample values on dB scale.
Length processed: 100 samples 0.00227 seconds.


Data written to:
C:\Users\Robert Hänggi\hörbücher1.txt
---------------------------
OK
---------------------------

- The debug output shows correctly:
hörbücher.txt   1 channel (mono)
Sample Rate: 44100 Hz. Sample values on dB scale.
Length processed: 100 samples 0.00227 seconds.


Therefore, three  problems are encountered:
- An unnecessary warning in the main output window (instead of the debug window)
- The file name has ironious characters instead of Latin-1 (characters 127-255)

The Latin-1 character-set is able to cover 23 additional languages by
correctly printing their special characters.
Imo, it would be worthwhile to respect an extended character set.

The string "hörbücher" should be translated as "h\366rb\374cher".

with e.g. this function:

;; latin-strings
(defun latin-1 (str &aux (latin-1-map '(
 ("\302\240" . 160); no-break space
 ("\302\241" . 161); inverted exclamation mark
 ("\302\242" . 162); cent sign
 ("\302\243" . 163); pound sign
 ("\302\244" . 164); currency sign
 ("\302\245" . 165); yen sign
 ("\302\246" . 166); broken bar
 ("\302\247" . 167); section sign
 ("\302\250" . 168); diaeresis
 ("\302\251" . 169); copyright sign
 ("\302\252" . 170); feminine ordinal indicator
 ("\302\253" . 171); left-pointing double angle quotation mark
 ("\302\254" . 172); not sign
 ("\302\255" . 173); soft hyphen
 ("\302\256" . 174); registered sign
 ("\302\257" . 175); macron
 ("\302\260" . 176); degree sign
 ("\302\261" . 177); plus-minus sign
 ("\302\262" . 178); superscript two
 ("\302\263" . 179); superscript three
 ("\302\264" . 180); acute accent
 ("\302\265" . 181); micro sign
 ("\302\266" . 182); pilcrow sign
 ("\302\267" . 183); middle dot
 ("\302\270" . 184); cedilla
 ("\302\271" . 185); superscript one
 ("\302\272" . 186); masculine ordinal indicator
 ("\302\273" . 187); right-pointing double angle quotation mark
 ("\302\274" . 188); vulgar fraction one quarter
 ("\302\275" . 189); vulgar fraction one half
 ("\302\276" . 190); vulgar fraction three quarters
 ("\302\277" . 191); inverted question mark
 ("\303\200" . 192); latin capital letter a with grave
 ("\303\201" . 193); latin capital letter a with acute
 ("\303\202" . 194); latin capital letter a with circumflex
 ("\303\203" . 195); latin capital letter a with tilde
 ("\303\204" . 196); latin capital letter a with diaeresis
 ("\303\205" . 197); latin capital letter a with ring above
 ("\303\206" . 198); latin capital letter ae
 ("\303\207" . 199); latin capital letter c with cedilla
 ("\303\210" . 200); latin capital letter e with grave
 ("\303\211" . 201); latin capital letter e with acute
 ("\303\212" . 202); latin capital letter e with circumflex
 ("\303\213" . 203); latin capital letter e with diaeresis
 ("\303\214" . 204); latin capital letter i with grave
 ("\303\215" . 205); latin capital letter i with acute
 ("\303\216" . 206); latin capital letter i with circumflex
 ("\303\217" . 207); latin capital letter i with diaeresis
 ("\303\220" . 208); latin capital letter eth (icelandic)
 ("\303\221" . 209); latin capital letter n with tilde
 ("\303\222" . 210); latin capital letter o with grave
 ("\303\223" . 211); latin capital letter o with acute
 ("\303\224" . 212); latin capital letter o with circumflex
 ("\303\225" . 213); latin capital letter o with tilde
 ("\303\226" . 214); latin capital letter o with diaeresis
 ("\303\227" . 215); multiplication sign
 ("\303\230" . 216); latin capital letter o with stroke
 ("\303\231" . 217); latin capital letter u with grave
 ("\303\232" . 218); latin capital letter u with acute
 ("\303\233" . 219); latin capital letter u with circumflex
 ("\303\234" . 220); latin capital letter u with diaeresis
 ("\303\235" . 221); latin capital letter y with acute
 ("\303\236" . 222); latin capital letter thorn (icelandic)
 ("\303\237" . 223); latin small letter sharp s (german)
 ("\303\240" . 224); latin small letter a with grave
 ("\303\241" . 225); latin small letter a with acute
 ("\303\242" . 226); latin small letter a with circumflex
 ("\303\243" . 227); latin small letter a with tilde
 ("\303\244" . 228); latin small letter a with diaeresis
 ("\303\245" . 229); latin small letter a with ring above
 ("\303\246" . 230); latin small letter ae
 ("\303\247" . 231); latin small letter c with cedilla
 ("\303\250" . 232); latin small letter e with grave
 ("\303\251" . 233); latin small letter e with acute
 ("\303\252" . 234); latin small letter e with circumflex
 ("\303\253" . 235); latin small letter e with diaeresis
 ("\303\254" . 236); latin small letter i with grave
 ("\303\255" . 237); latin small letter i with acute
 ("\303\256" . 238); latin small letter i with circumflex
 ("\303\257" . 239); latin small letter i with diaeresis
 ("\303\260" . 240); latin small letter eth (icelandic)
 ("\303\261" . 241); latin small letter n with tilde
 ("\303\262" . 242); latin small letter o with grave
 ("\303\263" . 243); latin small letter o with acute
 ("\303\264" . 244); latin small letter o with circumflex
 ("\303\265" . 245); latin small letter o with tilde
 ("\303\266" . 246); latin small letter o with diaeresis
 ("\303\267" . 247); division sign
 ("\303\270" . 248); latin small letter o with stroke
 ("\303\271" . 249); latin small letter u with grave
 ("\303\272" . 250); latin small letter u with acute
 ("\303\273" . 251); latin small letter u with circumflex
 ("\303\274" . 252); latin small letter u with diaeresis
 ("\303\275" . 253); latin small letter y with acute
 ("\303\276" . 254); latin small letter thorn (icelandic)
 ("\303\277" . 255); latin small letter y with diaeresis
 (t nil)))); ASCII
   (do* ((i 0 (if character (+ 2 i) (1+ i)))
         (pair (subseq str i (+ i 2)) (subseq (strcat str "  ") i (+ i 2)))
         (character (cdr (assoc pair latin-1-map :test 'equal))
                    (cdr (assoc pair latin-1-map :test 'equal)))
         (latin-string ""))
        ((> i (- (length str) 1)) latin-string)
    (setf latin-string (format nil "~a~a" latin-string
    (if character
        (int-char character)
        (subseq str i (1+ i)))))))
;; we could as well save under this name with e.g.
;; (s-save *track* 44100 (latin-1 "c:\\ältere Hörbücher.wav"))
(princ (latin-1 "c:\\ältere Hörbücher.wav"))
(terpri)
;; For the standard output, the original string should be written
(with princ or format)
(princ "c:\\ältere Hörbücher.wav" nil)





2015-01-30 13:53 GMT+01:00, Steve the Fiddle <stevethefiddle@xxxxxxxxx>:
> Starting a new thread as this has drifted away from the original topic.
> Reply to Robert's comments in-line below:
>
> On 30 January 2015 at 06:31, Robert Hänggi <aarjay.robert@xxxxxxxxx> wrote:
>>> Nyquist itself is case insensitive ASCII - no Unicode support at all,
>>> and I think that is unlikely to change in the foreseeable future,
>>> It "may" be possible to allow Unicode characters in the header
>>> statements as those are ignored (commented out) for Nyquist. I'll
>>> mention when work starts on version 5 headers.
>>>
>>> Steve
>>>
>>
>> Well, it does actually translate my surname correct after converting
>> to utf-8 (which I did previously actually).
>> However, I'm not sure if the encoding would function with all platforms.
>>
>> I think that strings are already translated from/to Unicode.
>>
>> For example:
>> Within the Nyquist prompt, I have 3 possibilities to let my name
>> appear in the message box:
>>
>> (print "hänggi")
>> (print "h\303\244nggi")
>> (print "h\344nggi")
>>
>> Only the last one throws a warning out.
>> However, it's needed to e.g. access my document folder within XLISP.
>> In other words, Nyquist translates the first string into the second
>> one which has afterwards to be translated into the third one (file
>> operations only).
>
> When entering a string, Audacity reads the string and uses a wxWidgets
> function
> to convert it to 8 bit ISO-8859-1 characters.
> If wxWidgets encounters a string that looks like an invalid UTF-8
> characters,
> Audacity generates a warning (not an error).
>
> In your example, Nyquist sees the second character of your name "hänggi"
> as an escape character \303, which is the first half of the two byte
> character code.
>
> Thus you can do something like this:
> (setq a (subseq "hänggi" 1 2)) ; escape code \303
> (setq b (subseq "hänggi" 2 3)) ; escape code \244
> (format nil "~a~a" a b)
>
> which should print the string "ä"
> which is two 8 bit ANSI characters converted (by Audacity) to a single 2
> byte UTF-8 character.
>
> In the Nyquist Prompt you can do this using the Debug button:
>
> (print (setq a (subseq "hänggi" 1 2))) ; prints to debug escape code \303
> (print (setq b (subseq "hänggi" 2 3))) ; prints to debug escape code \244
> (format nil "~a~a" a b) ; returns string ä
>
> Steve
>
>
>>
>> Something similar happens after a crash. Audacity warns at recovery
>> time, although it works properly.
>> The reason is just that darn "ä" in my user path.
>>
>> I think I'll stick to the "ae" for the time being.
>>
>> Thanks Steve.
>>
>> Robert
>

The audacity4blind web site is at
//www.freelists.org/webpage/audacity4blind

Subscribe and unsubscribe information, message archives,
Audacity keyboard commands, and more...

To unsubscribe from audacity4blind, send an email to
audacity4blind-request@xxxxxxxxxxxxx
with subject line
unsubscribe

Other related posts: