[freedict] Re: Poll: replace deu-eng / eng-deu

  • From: Piotr Bański <bansp@xxxxx>
  • To: freedict@xxxxxxxxxxxxx
  • Date: Sat, 9 May 2020 15:36:11 +0200

PS. Concerning +Gen, please also have a look at this discussion:

https://github.com/DARIAH-ERIC/lexicalresources/issues/40

On 09/05/2020 14:43, Piotr Bański wrote:

Hi all,

Apologies for the brevity:

* it would sound like a nice project to combine certain features of one (inflected forms) with the usage examples from the other. But it would be a full-fledged project especially if one wanted to keep the info about the origin of some bits, to be able to update them. Not a weekend task, especially given Sebastian's research mentioned below -- the DING appears to require human (rather than machine) parsing.

I have no students currently whom I could task with this. How about others?

 >> Furthermore, there are sometimes further annotations, in parantheses
 >> (<(>, <)>).  Like the []-annotations they should not be part of the
 >> keyword, but rather part of an associated value, similar to gramGrp.
 >> Is there a way to specify such in TEI?

Sure, they seem to be, roughly speaking, <usg> (usage) labels, probably of various types. I would currently defer to the emerging TEI Lex0 standard for a well worked-out set of recommendations. Please have a look at
https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html

More specifically, usage is handled at
https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html#index.xml-body.1_div.6

 >> By the way, can the diffrence between transitive and intransitive verbs
 >> ({vt}/{vi}) be encoded in TEI?

Sure, with the <subc> element:
https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html#TEI.subc

 >> Also, can {+Gen.} be encoded in TEI?

In the sense that it combines with a noun in the Genitive? <colloc> would be the way, except that there should be a way to distinguish between literal and 'featural' collocates and I can't remember if there is an issue open on that, currently. It would probably deserve an attribute on <colloc>, although the '+' sign alone helps in the parsing.

https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html#TEI.colloc

HTH, best wishes,

    Piotr

On 09/05/2020 12:34, Sebastian Humenda wrote:
Hi

Einhard Leichtfuß schrieb am 08.05.2020, 23:57 +0200:
can somebody say anything on how the quality of the two sources compare,
disregarding how easily they can be parsed?

Comparing the **current** deu-eng and the deu-eng of WikDict, they compare
roughly 82.000 for DING and 52.000 for WikDict.

To me, the current dictionary always seemed quite good, however I see
the problems inherent with the format of the Ding dictionary source.

As long as you dump everything as plain text into the TEI, the quality is
good. However, there are also a lot of incorrect entries.

The WikDict dictionary is nice since it gives examples and usage hints for
every word and translation. It however seems to not list flected forms of
verbs, in contrast to the DING dictionary.

Is it worth consideration to have two different dictionaries for the
same (ordered) pair of languages?

Not really, I would say. What is the justification? We cannot decide? :)
Aapart that most of our tooling would need to be adjusted because we simply do
not consider this case.

I have just inspected the Ding dictionary sources a little.  I could not
find any description of the format, so I had to guess from context.  The
format should roughly follow the following EBNF:
[…]

You can find an outdated specification here:
http://dict.tu-chemnitz.de/doc/syntax.html

My failed attempt of implementation is on the branch extend_ding2tei_grammar_parsing_capabilities in importers/ding2tei. It doesn't look nice, because I wanted to get it working before cleaning up and never succeeded.

A few things that I stumbled over and can recover:

-   Curley braces:
     -   Can contain the part-of-speech (POS) {n} or {v}, but also transitivity
         information.
     -   In case of nouns, contains instead the gender or number (plural)
     -   Can contain flected forms. Flected forms may be separated by a comma
         or a semicolon. Even a mix of both is around when there are different
         spellings. Some "alternate" spellings are even identical.
-   parenthesis:
     -   Used for omissions (to) do (smth.)
     -   Can contain whole sentence as usage hint
-   | as separator:
     -   Can distinguish between flected forms.
     -   Can give usage examples.
     -   Used for synonyms rarely.

The problem really isn't the syntax, but the semantic ambiguity. Consider
this:

     Weben {n}; Weberei {f} (Tätigkeit) [textil.] :: weaving

Weben and Weberei are, at best, synonyms.

     Weber {m}; Weberin {f} | Weber {pl}; Weberinnen {pl} :: weaver | weavers

Here, the semicolon separates two variants of the same word but with different
genders. If I would dig, I certainly find an example where the | would be used
differently.

     weben; wirken {vt} [textil.] | webend; wirkend | gewebt; gewoben; gewirkt | er/sie webt | ich/er/sie webte; ich/er/sie wob | er/sie hat/hatte gewebt; er/sie hat/hatte gewoben | ich/er/sie wöbe :: to weave {wove; woven} | weaving | woven | he/she weaves | I/he/she wove | he/she has/had woven | I/he/she would weave

Here, two entries are mingled. Wirken is only in specific contexts the same,
but should really be a separate entry. How would this be handled?

Unfortunately, the above EBNF is ambiguous, in particular it is unclear
when [] and {} annotations belong to a single unit or a group.  It seems
to be the case, that {} in most cases applies to a whole group and that
[], if placed after {}, also applies to the whole group, and otherwise
to a single unit.  (There is at least one exception to this rule: 'Acht
{f}; Achter {m} [Ös.] [Schw.]')

Yes indeed. Sometimes the | binds stronger, sometimes the ;.

Furthermore, there are sometimes further annotations, in parantheses
(<(>, <)>).  Like the []-annotations they should not be part of the
keyword, but rather part of an associated value, similar to gramGrp.  Is
there a way to specify such in TEI?

I am not sure, I would hope that there is a free text field for lazy encoding,
but would need to look it up myself.

By the way, can the diffrence between transitive and intransitive verbs
({vt}/{vi}) be encoded in TEI?

Yes, using an ontology, I think. Karl uses this  in the WikDict dictionaries
and he can explain it better, I think.

Also, can {+Gen.} be encoded in TEI?

@Piotr?

In any case, Einhard, is there any chance that you would be willing to have a
look? I think if you are smart enough to  ignore the corner cases for the sake
of having a stable parsing experience, this would be a great plus.



--
FreeDict - Free And Open Dictionaries
Manage your subscription at https://www.freelists.org/list/freedict
Wiki: https://github.com/freedict/fd-dictionaries/wiki
Web: http://freedict.org

Other related posts: