[freedict] Re: Poll: replace deu-eng / eng-deu

From: Piotr Bański <bansp@xxxxx>
To: freedict@xxxxxxxxxxxxx
Date: Thu, 10 Sep 2020 02:19:07 +0200

Hi Einhard,

Please excuse the brevity below:

On 09/09/2020 22:40, Einhard Leichtfuß wrote:
[...]

I'd like to ask one new question (complex) which is important right now:

C.12) Grouping of homographs

* In brief: Is superEntry ok?

It doesn't seem necessary at all and is on its way out, in general.

       * In long:
         * I'd like to group homographs somhow.
           * Allows ~word references to be potentially resolved to
             something (otherwise, there may be several target entries).

Why is that wrong? Such is the nature of homographs: they are individual lexemes that, mostly by historical coincidence, but often also because of generally arbitrary rules of lemmatization, are represented by the same string of characters.

They don't form a single natural unit (again, barring cases where historically they stem from a single source, but then, the grouping is historical and not synchronic, so it shouldn't play a role in a synchronic dictionary), so why squeeze them into one?

Lex0 uses stacked entry for that just because of its origin as a baseline format for digitized dictionaries -- such practices are encountered in various dictionaries, and often the priority of the encoder in such cases is not to make a good dictionary, but rather to encode the original dictionary as faithfully as possible. We don't need to follow such practices in creating new dictionaries.

         * TEI Lex-0 [12] suggests to use entry/entry for pure homographs
           / homonyms and entry/sense for more strongly connected
           elements, such as when the POS matches.
           * In normal TEI, entry/entry would become superEntry/entry.

By "normal TEI", you probably mean the Guidelines, where the Dictionaries chapter has long been identified as needing an update very badly.

* I have tested and superEntry/entry is validated by the
Freedict scheme, however I see its use nowhere documented.

When I created the current ODD, I didn't pay attention to some details. I admit that the ODD sorely needs an update. (I also admit that I probably won't be able to allocate the time to do that any time soon).

         * Options:
           a) Do not group.
              * pro: Represents the structure of the Ding, which has no
                     such grouping
                * contra: The Ding program (!) does such grouping.

I don't see why the fact that some piece of software does something should be an argument for bending database design towards that choice. Software comes and goes, and with it go both the good and the arbitrary decisions of its developer(s).

Best wishes and thanks for your energy and work :-)

Piotr

b) When only grouping homographs:
              b.1) superEntry/entry
              b.2) entry/sense
                   * Causes problems, with abbreviations and inflected
                     forms, which probably cannot be annotated on the
                     sense level.
           c) When only grouping "strongly connected" homographs:
              c.1) entry/sense
           d) When grouping by both:
              d.1) superEntry/entry/sense

[12]
https://dariah-eric.github.io/lexicalresources/pages/TEILex0/TEILex0.html#nested-entries-vs-multiple-senses

Note that I currently target version 1.8.1 exclusively.

1.8.1. of what?

Of the Ding.

Ok, I haven't checked this. Though when I visited the page two years ago,
there was a stable edition with 80,000 headwords and a "development" version
with 2xx,xxx headwords. You are not targeting the tremendously smaller
version, are you?

v1.8.1: 197,766 lines
devel:  205,287 lines

(Lines may contain more than one headword.)

A) TEI

A.1) TEI Lex-0.  Have I understood correctly that it is a good idea to
     follow this standard [0]?  E.g.
     * a) <gram type="gender"/> instead of <gen/>.;
     * b) <usg> with @type (and possibly @norm)

I'm not sure about this, Michael, Piotr, do you have comments? If nothing
comes during the next 1-2 weeks, I would say rather stick to the current
version that is in our schemas. It is easy to transform and better if
consistent with other dictionaries.

By "our schemas", you mean the files

  fd-dictionaries : shared/freedict-P5.* ?

I have to admit that these files are hard to grap for me (no prior
experience with XML).  Are these meant to serve as human-readable
documentation?  Is it worth the effort?

No, they are not documentation. They are symlinked into each dictionary and
"make validation" will use them.
We have not used Lex-0 in our projects yet and I think using a consistent, but
battle-proven encoding is better for your thesis. Our conversion style sheets
and tools are not prepared for Lex-0.

Otherwise, I will continue to rely on the Wiki, the (example) TEI files
and the TEI docs (and your answers).

Yes, I think this is better.

I actually like the TEI Lex-0 standard, in particular:

  i)   b) from above:  a fixed listed of good @type's (see the
       comparison table at [10]).  How would I represent
       @type="textType" (e.g. bibl., poet., admin., journalese) or
       @type="attitude" (e.g. derog., euph.), which do not have an
       equivalent in the TEI suggested @type's?
       ? Should I just use these as suggested in TEI Lex-0, thereby
         creating a mixture between TEI and TEI Lex-0?

[…]

It all boils down to somebody reading the document, defining our specific
requirements and potentially modification **and** implementing it.

I intend to use the TEI Lex-0 guidelines as a supplement to TEI
Freedict, that is, wherever they do not conflict.

Maybe such guidelines could also be used to extend the Freedict
recommendations.

A.5) Quantified (or similar) usage annotations
     * Ex.: "mainly Am."
     * Ex.: "bes. Süddt.", "especially Am."
     ? How to represent the determiner?

What is the determiner here? I thought determiner are for componound phrases
such as lemmon tree.

"mainly", "bes.", "especially".  I thought these were determiners.

Sorry, I missed the point. I was unsure about determina and read up the
Wikipedia article, but apparently the wrong one. There is no encoding for this
ATM, I think. What is the Lex-0 suggestion? :) Isn't this anyway part of the
usage? I
probably would have picked `<usg type="hint">mainly am.</usg>`, but maybe
that's too vague.

TEI Lex-0 suggests to use an attribute, but not which (there is a TODO
in the docs).  None of the <usg> annotations really fit IMO, maybe @subtype?

A.6) Dialect / language annotations.
     a) Ex.: "[Br.]", "[Am.]", "[Ös.]", "[Sächs.]"
     b) Ex.: "[South Africa]", "[Hessen]", "[Berlin]", "[Wien]"
     d) Ex.: "[French]", "[Lat.]"
     ? Represent as <usg type="geographic">?
       * According to TEI Lex-0: "marker which identifies the place or
         region where a lexical unit is mainly used"
         * Matches c) only.
     ? Separate d)?  And represent how?

[…]

In any case, I see subtle differences and would suggest either to
be sloppy and group all these as a sort of geographic identifier (only
French/Lat. don't fit)

What to do with French/Lat. then?

What about picking one of
https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-usg.html ;?

By <https://tei-c.org/release/doc/tei-p5-doc/en/html/DI.html#DITPUS>, it
should be @type=lang.

If I understand this slightly confusing page, it would in principle fine to
choose any type. If that were the case, I would at least document the choice
in the TEI header. I just checked the dict style sheets: they ignore the type
completely ;). It is really a parsing help, which strengthens the argument to
document your choice in the header.

Regarding where to document: in the fileDesc tag, you can have a noteStmt:

```xml
<notesStmt>
   <note type="status">small</note> 
   <note xml:lang="de"> 
     <list><item>blah</list>
   </note>
</notesStmt>

You can use both paragraphs (p) or lists as above and have multiple notes. I
think you can add this straight away.

So I would just add plain text, such as
   <item>@type="lang" indicates a language</item> ?

A.9.2) Date
       * The Ding is annotated with both a version and a date.
       ? How/whether to represent the date?>

In publicationStmt, there can be:

     <date when="2017-11-18">Nov 18, 2017</date>

Shouldn't this be the date of generation of the TEI file, which is
distinct from the Ding's publication?

Regards,
Einhard

--
FreeDict - Free And Open Dictionaries
Manage your subscription at https://www.freelists.org/list/freedict
Wiki: https://github.com/freedict/fd-dictionaries/wiki
Web: http://freedict.org

Follow-Ups:
- [freedict] Re: Poll: replace deu-eng / eng-deu
  - From: Einhard Leichtfuß

References:
- [freedict] Re: Poll: replace deu-eng / eng-deu
  - From: Sebastian Humenda
- [freedict] Re: Poll: replace deu-eng / eng-deu
  - From: Einhard Leichtfuß

[freedict] Re: Poll: replace deu-eng / eng-deu

Other related posts: