Hey Sebastian,
Thanks for the warm welcome! I definitely think I could help take on the
Chinese dictionaries (at the very least, CEDICT), seeing I’ve written both a
tool that converts from FreeDict to ODICT and CEDICT to ODICT. I can’t imagine
adjusting the latter to port to TEI instead would be too hard. In fact, if I
could be added to the FreeDict org I could fork the repo and open source the
conversion script. As for the work surrounding schemas, I’m actually pretty
unfamiliar with the TEI standards, so I think more context around the goals
trying to be achieved here would be needed.
As for point #3, I think there is an opportunity for ODict to be advantageous
here. Similar to TEI, it’s layout-agnostic and semantic, but can be compiled
down to a binary for easier storage and access. Plus, the CLI that it comes
with can perform instant ad-hoc entry lookups right in the terminal. That said,
I realize how well adopted and robust the TEI format is, and I hope to
eventually extend ODict to offer the same level of granularity TEI has. Right
now ODict is very dictionary-centric, whereas I feel TEI is designed for
multiple kinds of lexical data (dictionaries, glossaries, books, etc.).
Unrelated, but something I think would make for a cool addition to the FreeDict
site, and was a project I was hoping to eventually some time in as an off-shoot
of ODict, is a way to visually lookup and add terms to FreeDict dictionaries.
Something similar to Wiktionary, with all data being stored as structured,
semantic markup (unlike Wiktionary’s actual dumps) and fully downloadable. It
could help to automatically increase the robustness of FreeDict’s dictionaries
and could become a definitive lexical resource for language learners and
educators. It’s part of the reason that ODict has a “merge” utility in its CLI
(I figured people could enter new data, and have it automatically be merged
with the existing dictionary binary and become available for download).
Anyway, it’s just an idea I had. Something sorely missing from the internet
IMO, and the reason projects like Dbnary exist.
Thanks so much again for the response, and I’ll start checking out the process
for some of those Chinese dictionary conversions!
Tyler Nickerson
Founder, Linguistic
https://www.golinguistic.com
On Jun 20, 2021, 7:38 AM -0700, Sebastian Humenda <shumenda@xxxxxx>, wrote:
HI Tyler
Tyler Nickerson schrieb am 25.05.2021, 15:44 -0700:
I’ve been a lurker on this mailing list for a while now so I thought I
might go ahead and introduce myself (as well as offer to help with the
project). I’m Tyler, a designer, developer, and language enthusiast
currently based in the Bay Area!
Great combination!
My software has actually started relying on FreeDict pretty heavily
recently, as it uses your dictionaries to help language learners better
comprehend vocabulary words during live conversation.
That's pretty cool. That's the kind of application beyond just dictionaries
that should be doable with our dictionaries. I just had a look, French is not
yet supported :).
As a result, I felt compelled to reach out and see if I could help out in
any way. I have an extensive academic background in computer science and a
good deal of UI/web design experience. I’m also fairly proficient in
Mandarin as well.
Oh certainly. We are in many aspects needing help and I hope my late response
did not scare you off.
I compiled a quick list of things that I'd like to do if time would permit:
1. On the dictionary side, we have a long list of dictionaries to look at. In
particular, the Chinese dictionaries seemed like a low hanging fruit to me:
- add chinese-hungarian dictionary #27
https://github.com/freedict/fd-dictionaries/issues/27
- add chinese-english dictionary #26
https://github.com/freedict/fd-dictionaries/issues/26
- add chinese-german dictionary #25
https://github.com/freedict/fd-dictionaries/issues/25
2. Another important issue is the refreshing of the schemas. We are still
relying on a somewhat old dialect of TEI P5. Piotr opened an issue but I
would imagine a helping hand would be appreciated:
refreshing the schemas: freeze the p5subset, add it to our vc, update the
syntax in the ODD schema #62
https://github.com/freedict/fd-dictionaries/issues/62
3. Discuss and implement a new conversion strategy
In short, we're having our XSL style sheets that support conversion into
the plain text format for the Dict server. The target format is outdated
and the style sheets are so slow that they begin to be beyond usefulness.
Our Slob exports is done bei tei2slob, a tool that understands a different
subset of what is defined in the schemas. A rewrite should work on a more
uniform level. Therefore, one could bring PyGlossary up to date with our
version of the FreeDict TEI P5. As Karl pointed out, this could be a
difficult task because PyGlossary is not built for semantic markup and
hence it seems it doesn't have a powerful intermediate representation that
would suit our needs. So before starting this task, a good deal of
research on requirements and existing code would be required.
I'm not sure which of the task could potentially be of interest. From the
priority side, you can read this list backwards.
I was actually curious - I know CEDICT and ECDICT are two very popular
Chinese <> English dictionaries, and was wondering if their licensing would
allow FreeDict to offer Chinese dictionaries based on them.
What is the actual licence? There is CC-CEDICT with CC-BY-SA-4.0 that would be
a fit for FreeDict. How does the licence tcompare to the dictionary mentioned
in #25?
And one final note – I’ve also developed a fully open-source dictionary
file format, that, unlike a lot of others, isn’t based on underlying HTML,
as an open spec, compiles to binary from an XML markup, and features a
case-insensitive entry lookup baked into its API. I’d love to help FreeDict
officially offer dictionaries in this format, as I’ve already written a
repo that converts the TEI source files into .odict binaries.
That would be a great fit for point 3. of my list. We particularly like TEI
due to its recognition in the linguist community and due to its semantic
markup. It is a good pivot format. It is much easier to convert from semantic
markup to non-semantic markup than vice versa. The dictionary format of yours
is hence interesting indeed.
Anyway, just wanted to drop in and say hi to you all! Let me know if I can
be of assistance :)
Thanks, please let us resume the discussion!
Cheers
Sebastian