View on GitHub

The Language of Data

Research, resources, and tools for parsing short text in data

We succinctly call the Language of Data the unique grammar of short textual labels that typically appear in structured data. For example:

Name Full_addr Type Notes
Pizza Rio Via della Resistenza, 9/A, 38123 Trento, Italy pizzeria take-away possible
La Bigoudène 18 rue Vauban, 29200 Brest, France pancake restaurant Closed Permanently

Both headers and data values in the dataset above have the following characteristics:

Why is this special grammar relevant?

A high-accuracy automated analysis of the textual content of vast datasets is crucial in many applications, such as information retrieval (e.g. for meaning-based indexing of content by search engines), data integration, or AI-based data analytics.

Why is it so hard to parse the Language of Data?

State-of-the-art natural language processing tools are trained on regular text (e.g. Wikipedia) or on social media content (e.g. tweets). They analyse text in context, looking at a window of preceding and following words and phrases. In the Language of Data, context is short or non-existent, and orthography and syntax are used in specific, non-standard ways. Conventional NLP tools vastly underperform on such text (e.g. 10-40% of F-measure for named entity recognition, 70% of accuracy in classifying parts of speech). The specific grammar of the Language of Data needs specifically designed NLP tools.

Resources and tools

Our corpora and tools are in their early stages and are under constant development. More resources will follow in the near future. All tools and corpora are licensed under CC BY-NC 4.0, meaning that you are free to share and adapt the material for non-commercial purposes, provided that you give appropriate credit to the authors. Do not hesitate to contact us for individual licensing arrangements.

Trained NLP models

Name Version Task Language Accuracy/F1 Link
LoD OpenNLP Tokenizer 1.0 tokenization English 96.7% Download
LoD OpenNLP POS Tagger 1.0 POS tagging English 85.9% Download
LoD OpenNLP Name Finder 1.0 NER English 50.8% Download
LoD BERT-NER 1.0 NER English 67.4% coming soon

Note that sequence labelling classification tasks such as POS or NER tagging are much harder over the Language of Data, which explains the difference w.r.t. state-of-the-art scores over regular text.

Corpora

Name Description Language Nb. labels Nb. tokens Link
LoD Headers English Hand-annotated table head labels extracted from English-language Open Data catalogues. Token boundaries, POS and NER tags. English 8,558 31,127 Download
LoD Data English Hand-annotated data value labels extracted from English-language Open Data catalogues. Token boundaries, POS and NER tags. English 8,731 39,698 Download
LoD Headers Italian Hand-annotated table head labels extracted from Italian-language Open Data catalogues. Token boundaries, POS and NER tags. Italian 3,536 9,723 Download
LoD Data Italian Hand-annotated data value labels extracted from Italian-language Open Data catalogues. Token boundaries, POS and NER tags. Italian 6,528 39,517 Download

Publications

The main publication supporting our principal hypotheses, please cite this if you use our resources or tools.

Gábor Bella, Linda Gremes, and Fausto Giunchiglia. Exploring the Language of Data. Proceedings of COLING 2020.

An early publication on using NLP mechanisms tailored to the Language of Data in order to perform multilingual and multi-domain word sense disambiguation:

Gábor Bella, Alessio Zamboni, and Fausto Giunchiglia. Domain-Based Sense Disambiguation on Multilingual Structured Data. Proceedings of the ECAI 2016 workshop on Diversity Aware Artificial Intelligence, The Hague, Netherlands.

Credits

Research on the Language of Data is being carried out at the Language Diversity Lab of the KnowDive Research Group at the University of Trento, Italy. For any inquiry, do not hesitate to drop us an email.

Contributors: