If we want to keep all presentational aspects out, the solution seems to
* Remove exclamation marks
* Remove pronouns
* If those forms are to be presented in a dictionary, some smart processing
has to be done
* Since smart processing is needed anyway, there is no harm in dumping in
all forms, even obscure ones (or are we concerned with size?)
I assume auxiliary verbs ("werde" in the example below) are counted as part
of the form itself and should stay. They are different depending on the
verb and can't be readded without further input, so this is also more
Does anyone disagree? Would such data be actually helpful to anyone or will
it just bloat the dictionary, since we won't be able to include it in our
exported dictionaries easily?
Sorry for drawing this discussion out over such a long time,
On Thu, Jun 18, 2020 at 10:15 PM Piotr Bański <bansp@xxxxx> wrote:
If they are presentational, and they are, they don't belong in the
lexical database, as Sebastian has indicated.
On 18/06/2020 21:19, Karl Bartel wrote:
>From a data perspective, an exclamation mark should not be part of
headword but be inserted by the converter. Adding this should be
forward. Another possibility is to explain the encoding in the TEI
that people who use the data in an automated fashion know that the
mark needs to be stripped.
We could add tags to mark certain parts of the entries as purely
presentational. Then users of the TEI dictionaries could choose to keep
or strip those parts however they want.
<form type="infl"><fd:present>er/sie/es </fd:present>werde stehlen</form>