View on GitHub

The Language of Data

Research, resources, and tools for parsing short text in data

We succinctly call the Language of Data the unique grammar of short textual labels that typically appear in structured data. For example:

_Name	_{Full_addr}	_Type	_Notes
_{Pizza Rio}	_{Via della Resistenza, 9/A, 38123 Trento, Italy}	_pizzeria	_{take-away possible}
_{La Bigoudène}	_{18 rue Vauban, 29200 Brest, France}	_{pancake restaurant}	_{Closed Permanently}

Both headers and data values in the dataset above have the following characteristics:

short labels, consisting of just a few words;
frequent named entities (names, postal addresses, dates, URLs);
non-standard orthography (use of “_” for token separation, Inconsistent Use of Capitals, frequent abbreviations, etc.);
absence or rarity of certain parts of speech (e.g. verbs, pronouns);
non-standard syntax (omission of verbs, prepositions, inverted word order, etc.: take-away [is] possible, operate [an] uninsured vehicle, death country.

Why is this special grammar relevant?

A high-accuracy automated analysis of the textual content of vast datasets is crucial in many applications, such as information retrieval (e.g. for meaning-based indexing of content by search engines), data integration, or AI-based data analytics.

Why is it so hard to parse the Language of Data?

State-of-the-art natural language processing tools are trained on regular text (e.g. Wikipedia) or on social media content (e.g. tweets). They analyse text in context, looking at a window of preceding and following words and phrases. In the Language of Data, context is short or non-existent, and orthography and syntax are used in specific, non-standard ways. Conventional NLP tools vastly underperform on such text (e.g. 10-40% of F-measure for named entity recognition, 70% of accuracy in classifying parts of speech). The specific grammar of the Language of Data needs specifically designed NLP tools.

Resources and tools

Our corpora and tools are in their early stages and are under constant development. More resources will follow in the near future. All tools and corpora are licensed under CC BY-NC 4.0, meaning that you are free to share and adapt the material for non-commercial purposes, provided that you give appropriate credit to the authors. Do not hesitate to contact us for individual licensing arrangements.

Trained NLP models

Name	Version	Task	Language	Accuracy/F1	Link
LoD OpenNLP Tokenizer	1.0	tokenization	English	96.7%	Download
LoD OpenNLP POS Tagger	1.0	POS tagging	English	85.9%	Download
LoD OpenNLP Name Finder	1.0	NER	English	50.8%	Download
LoD BERT-NER	1.0	NER	English	67.4%	coming soon

Note that sequence labelling classification tasks such as POS or NER tagging are much harder over the Language of Data, which explains the difference w.r.t. state-of-the-art scores over regular text.

Corpora

Name	Description	Language	Nb. labels	Nb. tokens	Link
LoD Headers English	Hand-annotated table head labels extracted from English-language Open Data catalogues. Token boundaries, POS and NER tags.	English	8,558	31,127	Download
LoD Data English	Hand-annotated data value labels extracted from English-language Open Data catalogues. Token boundaries, POS and NER tags.	English	8,731	39,698	Download
LoD Headers Italian	Hand-annotated table head labels extracted from Italian-language Open Data catalogues. Token boundaries, POS and NER tags.	Italian	3,536	9,723	Download
LoD Data Italian	Hand-annotated data value labels extracted from Italian-language Open Data catalogues. Token boundaries, POS and NER tags.	Italian	6,528	39,517	Download

Publications

The main publication supporting our principal hypotheses, please cite this if you use our resources or tools.

Gábor Bella, Linda Gremes, and Fausto Giunchiglia. Exploring the Language of Data. Proceedings of COLING 2020.

An early publication on using NLP mechanisms tailored to the Language of Data in order to perform multilingual and multi-domain word sense disambiguation:

Gábor Bella, Alessio Zamboni, and Fausto Giunchiglia. Domain-Based Sense Disambiguation on Multilingual Structured Data. Proceedings of the ECAI 2016 workshop on Diversity Aware Artificial Intelligence, The Hague, Netherlands.

Credits

Research on the Language of Data is being carried out at the Language Diversity Lab of the KnowDive Research Group at the University of Trento, Italy. For any inquiry, do not hesitate to drop us an email.

Contributors:

Gábor Bella;
prof. Fausto Giunchiglia;
Linda Gremes.