Hi Einhard
Einhard Leichtfuß schrieb am 29.08.2020, 3:13 +0200:
I have started working on the importer a while ago. In fact, it has
become the subject of my bachelor's thesis.
The latter is also the reason for that I neither published any code yet
nor contacted you again earlier, since I was unsure what I was allowed
to in the context of my bachelor's thesis.
Once I have write acces to the tools repository, I will publish the
current state of my work. (Unless you were to agree with me that a
separate git repository is more suitable.)
I do have some questions. A lot, in fact. I hope not to overwhelm you
with them. Just ignore some of them, if they are to many.
Since this is now my bachelor's thesis, I need to ask you to refrain
from giving me any coding-specific advice (or else I'd have to cite you
in my bachelor's thesis). Please also do not publish any changes to my
code before I submit my bachelor's thesis (september or october 2020).
Note that I currently target version 1.8.1 exclusively.
A) TEI
A.1) TEI Lex-0. Have I understood correctly that it is a good idea to
follow this standard [0]? E.g.
* a) <gram type="gender"/> instead of <gen/>.;
* b) <usg> with @type (and possibly @norm)
A.2) Verb & Transitivity annotation.
* In a HowTo [1], it is suggested to use v,vt,vi,vti, i.e., merge
all such information into a single token.
* In an example [2], I see "<pos>v</pos><subc>tr</subc>", which
would also adhere to TEI Lex-0, in contrast to the former.
? So, which to use? (I prefer the latter, if that matters.)
A.3) IPA Pronunciation. The current deu-eng.tei in the Freedict
repository contains <pron> tags. I assume that these were
autogenerated, since the Ding does not contain such information.
If I am right, how can I have that information autogenerated?
A.4) Normalization of usage annotations
* Recommended by TEI Lex-0.
* different languages (e.g. "[Sprw.]" ~ "[prov.]")
* same language (e.g. "[coll.]" ~ "[slang]")
? Should they be normalised to a single label?
? Should they be normalised to some standard labels?
* ISO 12620 [4,5,6] (full standard only commercially available)
* The usage of @norm in <usg> might render that less an issue.
A.5) Quantified (or similar) usage annotations
* Ex.: "mainly Am."
* Ex.: "bes. Süddt.", "especially Am."
? How to represent the determiner?
A.6) Dialect / language annotations.
a) Ex.: "[Br.]", "[Am.]", "[Ös.]", "[Sächs.]"
b) Ex.: "[South Africa]", "[Hessen]", "[Berlin]", "[Wien]"
d) Ex.: "[French]", "[Lat.]"
? Represent as <usg type="geographic">?
* According to TEI Lex-0: "marker which identifies the place or
region where a lexical unit is mainly used"
* Matches c) only.
? Separate d)? And represent how?
A.7) Abbreviations.
a) Headwords, which are annotations.
* rare
b) Annotated on headwords.
? How to represent in TEI?
* The TEI documentation contains an example [7] with both
<form type="abbrev"> and <form type="full">, in the same
<entry>.
* I remember though that within the Freedict project multiple
<form> tags inside <entry> are frowned upon.
B) Linguistics
Attachment:
signature.asc
Description: PGP signature