**
*=========================================*
*
WebNLG+: The Second WebNLG Challenge
First call for participation : Training data now available
=========================================
https://webnlg-challenge.loria.fr/challenge_2020/
WebNLG goes bi-lingual (English, Russian) and bi-directional (generation
and parsing)!
It is our pleasure to announce that three years after the first edition,
the second WebNLG challenge (WebNLG+) is now open!
TASKS
The challenge comprises two main tasks:
1.
RDF-to-text generation, similarly to WebNLG 2017 but with new data
and into two languages;
2.
Text-to-RDF semantic parsing: converting a text into the
corresponding set of RDF triples.
For Task 1, given the four RDF triples shown in (a), the aim is to
generate a text such as (b) or (c). For Task 2, the opposite should be
achieved, i.e. to generate the triples in (a) starting from text as in
(b) or (c).
Example
1.
Set of RDF triples
<entry category="Company" eid="Id21" size="4">
<modifiedtripleset>
<mtriple>Trane | foundingDate | 1913-01-01</mtriple>
<mtriple>Trane | location | Ireland</mtriple>
<mtriple>Trane | foundationPlace | La_Crosse,_Wisconsin</mtriple>
<mtriple>Trane | numberOfEmployees | 29000</mtriple>
</modifiedtripleset>
</entry>
(b) English text
Trane, which was founded on January 1st 1913 in La Crosse, Wisconsin, is
based in Ireland. It has 29,000 employees.
(c) Russian text
Компания "Тране", основанная 1 января 1913 года в Ла-Кроссе в штате
Висконсин, находится в Ирландии. В компании работают 29 тысяч человек.
INDICATIVE DATES
- 15 April 2020: Release of Training and Development Data
- 30 April 2020: Release of some simple preliminary evaluation scripts
to support development
- 30 May 2020: Release of the final evaluation scripts
- 13 September 2020: Release of Test Data
- 27 September 2020: Entry submission deadline
- 15-18 December 2020: Results of automatic and human evaluations and
system presentations at INLG 2020
DATA & REGISTRATION
For every input triple set, at least two reference texts are provided
for each target language. The data specifications are the same as for
WebNLG 2017.
The English WebNLG+ dataset for training comprises around 14,900 data
inputs and 40,000 data-text pairs for 16 distinct DBpedia categories:
*
The 10 seen categories used in 2017: Airport, Astronaut, Building,
City, ComicsCharacter, Food, Monument, SportsTeam, University, and
WrittenWork.
o
~5,600 texts were cleaned from misspellings and missing triple
verbalisations were added to some texts.
*
The 5 unseen categories of 2017, which will now be part of the seen
data: Athlete, Artist, CelestialBody, MeanOfTransportation, Politician.
*
1 new category: Company.
The new Russian dataset comprises around 8,000 data inputs and 20,800
data-text pairs for 9 distinct categories:
*
Airport, Astronaut, Building, CelestialBody, ComicsCharacter, Food,
Monument, SportsTeam, and University.
To register for the WebNLG+ task and download the WebNLG+ training and
development data, please fill the form below:
https://framaforms.org/webnlg-challenge-2020-1586343023
The data, evaluation scripts and system outputs of WebNLG 2017 can also
be downloaded here:
https://webnlg-challenge.loria.fr/challenge_2017/
EVALUATION
For the evaluation phase, starting on July 17th, new test sets will be
released for all categories seen in the training data (see above), and
for several new unseen categories (categories not included in the
training data). For a task, each team can submit more than one system,
but can only submit one output per system; in other words, multiple
submissions of the same non-deterministic system should be avoided.
Participants are free to choose which task and language they want to
provide results for (generation and/or semantic parsing, English and/or
Russian).
System outputs as well as baseline and human-produced outputs will be
evaluated.
For RDF-to-text generation, two evaluations will be carried out:
*
Automatic evaluation, with standard n-gram-based and embedding-based
metrics such as BLEU, METEOR, TER, ChrF++, BERTScore, etc; global
and detailed results will be provided (per DBpedia category, per
input size, per Category and Input Size, etc.).
*
Human evaluation: system outputs will be assessed according to
criteria such as grammaticality/correctness,
appropriateness/adequacy and fluency/naturalness, by native speakers
recruited on crowdsourcing platforms.
For Text-to-RDF semantic parsing, the automatic evaluation of three
aspects is foreseen, in terms of recall, precision and F1-score:
*
Property identification.
*
Subject and Object Identification
*
Full triple identification.
Initially, preliminary evaluation scripts are released and can be used
to test the models. The final evaluation scripts and metrics used for
WebNLG+ will be provided at a later stage (see Indicative Dates).
MOTIVATION
The WebNLG data was originally created to promote the development of RDF
verbalisers able to generate short text and to handle micro-planning
(i.e., sentence segmentation and ordering, referring expression
generation, aggregation); the data for the first challenge included a
total of 15 DBpedia categories. The 2020 challenge aims first of all at
increasing the datasets (hence, the coverage of the verbalisers), by
covering more categories and an additional language. The other main
objective of the 2020 edition is to promote the development of knowledge
extraction tools, with a task that mirrors the verbalisation task.
[RDF Verbalisers] The RDF language—in which DBpedia is encoded—is widely
used within the Linked Data framework. Many large scale datasets are
encoded in this language (e.g., MusicBrainz, FOAF, LinkedGeoData) and
official institutions increasingly publish their data in this format.
Being able to generate good quality text from RDF data would open the
way to many new applications such as making linked data more accessible
to lay users, enriching existing text with information drawn from
knowledge bases or describing, comparing and relating entities present
in these knowledge bases.
[Multilinguality] By providing a bilingual corpus (English and Russian),
we aim to promote the development of tools for languages other than
English and to allow for experimentation with pre-training and transfer
approaches (do the English verbalisations of RDF triples help in better
verbalising the triples in Russian?)
[Knowledge extraction] The new semantic parsing task opens up new lines
of research in several directions. Can it be used to bootstrap entity
linkers? How does RDF-based semantic parsing relate to other semantic
parsing tasks where the output semantic representations are lambda terms
or KB queries? Can semantic parsing be used to improve generation in
ways similar to the back translation approaches proposed in machine
translation?
ORGANISING COMMITTEE
*
Thiago Castro Ferreira, Federal University of Minas Gerais, Brazil
*
Claire Gardent, CNRS/LORIA, Nancy, France
*
Nikolai Ilinykh, University of Gothenburg, Sweden
*
Chris van der Lee, Tilburg University, The Netherlands
*
Simon Mille, Universitat Pompeu Fabra, Barcelona, Spain
*
Diego Moussalem, Paderborn University, Germany
*
Anastasia Shimorina, Université de Lorraine/LORIA, Nancy, France
CONTACT
mail: webnlg-challenge@xxxxxxxx <mailto:webnlg-challenge@xxxxxxxx>
website: https://webnlg-challenge.loria.fr/challenge_2020/
twitter: https://twitter.com/webnlg
REFERENCES
Creating Training Corpora for NLG Micro-Planners.C. Gardent, A.
Shimorina, S. Narayan and L. Perez-Beltrachini. Proceedings of ACL 2017.
Vancouver (Canada).
The WebNLG challenge: Generating text from RDF data. C. Gardent, A.
Shimorina, S. Narayan and L. Perez-Beltrachini. Proceedings of INLG,
2017. Santiago de Compostela (Spain).
Building RDF Content for Data-to-Text Generation.L. Perez-Beltrachini,
R. Sayed and C. Gardent. Proceedings of COLING 2016. Osaka (Japan).
Enriching the WebNLG corpus. T. Castro Ferreira, D. Moussallem, E.
Krahmer and S. Wubben. Proceedings of INLG, 2018. Tilburg (The Netherlands).
Creating a corpus for Russian data-to-text generation using neural
machine translation and post-editing. A. Shimorina, E. Khasanova and C.
Gardent. Proceedings of BSNLP Workshop, 2019. Florence (Italy).*