[euralex] WebNLG+: The Second WebNLG Challenge ,First call for participation : Training data now available

From: Diego Moussallem <diego.moussallem@xxxxxxxxxxxxxxxx>
To: lod2@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, public-ontolex@xxxxxx, public-bpmlod@xxxxxx, public-ld4lt@xxxxxx, public-lod@xxxxxx, open-linguistics@xxxxxxxxxxxxxx, dbworld@xxxxxxxxxxx, meta-net-all@xxxxxxxxxxx, members@xxxxxxxxx, mt-list@xxxxxxxx, euralex@xxxxxxxxxxxxx, public-mlw-announce@xxxxxx
Date: Fri, 17 Apr 2020 18:09:38 +0200

**

*=========================================*

*

WebNLG+: The Second WebNLG Challenge

First call for participation : Training data now available

=========================================

https://webnlg-challenge.loria.fr/challenge_2020/

WebNLG goes bi-lingual (English, Russian) and bi-directional (generation and parsing)!

It is our pleasure to announce that three years after the first edition, the second WebNLG challenge (WebNLG+) is now open!

     TASKS

The challenge comprises two main tasks:

1.

   RDF-to-text generation, similarly to WebNLG 2017 but with new data
   and into two languages;

2.

   Text-to-RDF semantic parsing: converting a text into the
   corresponding set of RDF triples.

For Task 1, given the four RDF triples shown in (a), the aim is to generate a text such as (b) or (c). For Task 2, the opposite should be achieved, i.e. to generate the triples in (a) starting from text as in (b) or (c).

     Example

1.

   Set of RDF triples

<entry category="Company" eid="Id21" size="4">

    <modifiedtripleset>

        <mtriple>Trane | foundingDate | 1913-01-01</mtriple>

        <mtriple>Trane | location | Ireland</mtriple>

        <mtriple>Trane | foundationPlace | La_Crosse,_Wisconsin</mtriple>

        <mtriple>Trane | numberOfEmployees | 29000</mtriple>

    </modifiedtripleset>

</entry>

(b) English text

Trane, which was founded on January 1st 1913 in La Crosse, Wisconsin, is based in Ireland. It has 29,000 employees.

(c) Russian text

Компания "Тране", основанная 1 января 1913 года в Ла-Кроссе в штате Висконсин, находится в Ирландии. В компании работают 29 тысяч человек.

     INDICATIVE DATES

- 15 April 2020: Release of Training and Development Data

- 30 April 2020: Release of some simple preliminary evaluation scripts to support development

- 30 May 2020: Release of the final  evaluation scripts

- 13 September 2020: Release of Test Data

- 27 September 2020: Entry submission deadline

- 15-18 December 2020: Results of automatic and human evaluations and system presentations at INLG 2020

     DATA & REGISTRATION

For every input triple set, at least two reference texts are provided for each target language. The data specifications are the same as for WebNLG 2017.

The English WebNLG+ dataset for training comprises around 14,900 data inputs and 40,000 data-text pairs for 16 distinct DBpedia categories:

*

   The 10 seen categories used in 2017: Airport, Astronaut, Building,
   City, ComicsCharacter, Food, Monument, SportsTeam, University, and
   WrittenWork.

     o

       ~5,600 texts were cleaned from misspellings and missing triple
       verbalisations were added to some texts.

*

   The 5 unseen categories of 2017, which will now be part of the seen
   data: Athlete, Artist, CelestialBody, MeanOfTransportation, Politician.

*

   1 new category: Company.

The new Russian dataset comprises around 8,000 data inputs and 20,800 data-text pairs for 9 distinct categories:

*

   Airport, Astronaut, Building, CelestialBody, ComicsCharacter, Food,
   Monument, SportsTeam, and University.

To register for the WebNLG+ task and download the WebNLG+ training and development data, please fill the form below:

https://framaforms.org/webnlg-challenge-2020-1586343023

The data, evaluation scripts and system outputs of WebNLG 2017 can also be downloaded here:

https://webnlg-challenge.loria.fr/challenge_2017/

     EVALUATION

For the evaluation phase, starting on July 17th, new test sets will be released for all categories seen in the training data (see above), and for several new unseen categories (categories not included in the training data). For a task, each team can submit more than one system, but can only submit one output per system; in other words, multiple submissions of the same non-deterministic system should be avoided. Participants are free to choose which task and language they want to provide results for (generation and/or semantic parsing, English and/or Russian).

System outputs as well as baseline and human-produced outputs will be evaluated.

For RDF-to-text generation, two evaluations will be carried out:

*

   Automatic evaluation, with standard n-gram-based and embedding-based
   metrics such as BLEU, METEOR, TER, ChrF++, BERTScore, etc; global
   and detailed results will be provided (per DBpedia category, per
   input size, per Category and Input Size, etc.).

*

   Human evaluation: system outputs will be assessed according to
   criteria such as grammaticality/correctness,
   appropriateness/adequacy and fluency/naturalness, by native speakers
   recruited on crowdsourcing platforms.

For Text-to-RDF semantic parsing, the automatic evaluation of three aspects is foreseen, in terms of recall, precision and F1-score:

*

   Property identification.

*

   Subject and Object Identification

*

   Full triple identification.

Initially, preliminary evaluation scripts are released and can be used to test the models. The final evaluation scripts and metrics used for WebNLG+ will be provided at a later stage (see Indicative Dates).

     MOTIVATION

The WebNLG data was originally created to promote the development of RDF verbalisers able to generate short text and to handle micro-planning (i.e., sentence segmentation and ordering, referring expression generation, aggregation); the data for the first challenge included a total of 15 DBpedia categories. The 2020 challenge aims first of all at increasing the datasets (hence, the coverage of the verbalisers), by covering more categories and an additional language. The other main objective of the 2020 edition is to promote the development of knowledge extraction tools, with a task that mirrors the verbalisation task.

[RDF Verbalisers] The RDF language—in which DBpedia is encoded—is widely used within the Linked Data framework. Many large scale datasets are encoded in this language (e.g., MusicBrainz, FOAF, LinkedGeoData) and official institutions increasingly publish their data in this format. Being able to generate good quality text from RDF data would open the way to many new applications such as making linked data more accessible to lay users, enriching existing text with information drawn from knowledge bases or describing, comparing and relating entities present in these knowledge bases.

[Multilinguality] By providing a bilingual corpus (English and Russian), we aim to promote the development of tools for languages other than English and to allow for experimentation with pre-training and transfer approaches (do the English verbalisations of RDF triples help in better verbalising the triples in Russian?)

[Knowledge extraction] The new semantic parsing task opens up new lines of research in several directions. Can it be used to bootstrap entity linkers? How does RDF-based semantic parsing relate to other semantic parsing tasks where the output semantic representations are lambda terms or KB queries? Can semantic parsing be used to improve generation in ways similar to the back translation approaches proposed in machine translation?

     ORGANISING COMMITTEE

*

   Thiago Castro Ferreira, Federal University of Minas Gerais, Brazil

*

   Claire Gardent, CNRS/LORIA, Nancy, France

*

   Nikolai Ilinykh, University of Gothenburg, Sweden

*

   Chris van der Lee, Tilburg University, The Netherlands

*

   Simon Mille, Universitat Pompeu Fabra, Barcelona, Spain

*

   Diego Moussalem, Paderborn University, Germany

*

   Anastasia Shimorina, Université de Lorraine/LORIA, Nancy, France

     CONTACT

mail: webnlg-challenge@xxxxxxxx <mailto:webnlg-challenge@xxxxxxxx>

website: https://webnlg-challenge.loria.fr/challenge_2020/

twitter: https://twitter.com/webnlg

     REFERENCES

Creating Training Corpora for NLG Micro-Planners.C. Gardent, A. Shimorina, S. Narayan and L. Perez-Beltrachini. Proceedings of ACL 2017. Vancouver (Canada).

The WebNLG challenge: Generating text from RDF data. C. Gardent, A. Shimorina, S. Narayan and L. Perez-Beltrachini. Proceedings of INLG, 2017. Santiago de Compostela (Spain).

Building RDF Content for Data-to-Text Generation.L. Perez-Beltrachini, R. Sayed and C. Gardent. Proceedings of COLING 2016. Osaka (Japan).

Enriching the WebNLG corpus. T. Castro Ferreira, D. Moussallem, E. Krahmer and S. Wubben. Proceedings of INLG, 2018. Tilburg (The Netherlands).

Creating a corpus for Russian data-to-text generation using neural machine translation and post-editing. A. Shimorina, E. Khasanova and C. Gardent. Proceedings of BSNLP Workshop, 2019. Florence (Italy).*

[euralex] WebNLG+: The Second WebNLG Challenge ,First call for participation : Training data now available

Other related posts: