Hi Sebastian,

On 30/12/2020 14:51, Sebastian Humenda wrote:

Hi Einhard

Einhard Leichtfuß schrieb am 28.12.2020, 20:42 +0100:
Specific notes:
* eng-pol: only contains references to other TEI files, no entries.
   (cannot check)

Yup, that one is nasty. TBH, I'm not sure how much  sense it makes to port our
tooling to this dictionary. On one hand, this allows to split huge files. On
the other, it always creates this special case. Just thinking aloud, thoughts

Sorry (and not sorry ;-)) about that special case -- it seemed far more practical to retain the split into individual letters given how unwieldy the entire file was (recall that XML editors were less powerful back then), and just include the letters by XML mechanisms.

It itches me to say that the special case might be just the first case for how we could deal with huge dictionaries. I recall some beasts, I think at least one dictionary of Arabic was ultra huge right from the beginning.

Maybe -- and I'm just musing aloud -- the eng-pol could exemplify our strategy for huge dictionaries: split them up into manageable documents and reassemble them transparently for processing / display.

Let me stress: I'm just voicing an option, not trying to push for it.

Best wishes,


* fra-bre: has a lot of empty <pron> tags (<pron></pron>) at the end.

Empty orth - empty pron. Easy to fix, but I don't have time. See

* jpn-*: I can't really help.  Length checking does not seem to work.
   Unsure, whether ok.
   * Most unexpected: first entry from jpn-eng:

That's an espeak-ng oddity. It seems to speak Japanese just fine, but speaks
letters as "japanese letter" in English. It is up to them to fix this. Given
that this affects only individual letters, I suppose this is fine.

* jpn-rus: Some <pron> tags are empty (incl. 1st, 4th; i.e.,

eSpeakNG produces an empty transcription. In European languages, this is
usually caused by interpunctuation, therefore I believe this is alright. I
added a check to not add empty pron elements.

Specific notes (most likely problems with espeak-ng):
* nld-<some>: "à" is provided with pronunciation "ˌaːɣrˈaːvə", also

Yes, this is an eSpeakNG issue, it says "a grave". We have to leave it that
way or report it to eSpeakNG.

Less specific notes:
* More dictionaries containing embedded slashes:
   ita-bul, ita-ell, ita-fin, ita-jpn, ita-pol, ita-rus, ita-swe,
   ita-tur, nld-fin, nld-itam, nld-lat, nld-lit, nld-por, nld-rus,
   nld-spa, nld-swe

Thanks for spotting, that's a WikDict bug :).

That means that release the fd-tools since the generator seems stable.

Thanks again for the help, I would have never found all these issues.


