Large-scale multilingual knowledge bases (KBs) are the key for cross-lingual and multilingual applications such as Question Answering, Machine Translation, and Search. They can encode the same information in different languages and be used to deliver content in the most appropriate format to the user. For example, the figure below shows two different language versions of Google search for the query “rdf”.
Another interesting use of multilingual KBs is their application as a training corpus to leverage better multilingual machine learning models. However, finding qualitative multilingual content can be challenging. An analysis of over 100 thousand KBs in LOD Laundromat shows that only ∼14% of them have language tags on their rdfs:labels while ∼20% of all rdfs:labels have no language tag. One solution to overcome this challenge is the use of language identification methods to automatically tag RDF content.
In this work, we exploit DBpedia’s multilingual content for training and evaluating different language identification methods and frameworks. We show that these approaches perform poorly on rdfs:labels. In our experiments, we evaluate the performance of six language identification methods which consist of two baselines (LangTagger) as well as Apache Tika, langdetect, and Apache openNLP language detector in two configurations.
LangTagger and langdetect use Naive Bayes classifiers for language detection. Langdetect uses a character-based while Apache Tika uses a word-based n-gram method to create features. Both LangTagger models were trained using QALD training dataset questions because it is multilingual and based on DBpedia resources. Table II gives an overview of the different language identification methods evaluated.
Evaluation: since openNLP and langdetect originally supported language identification for 103 and 55 languages, we used another configuration to limit the languages to 12. Those 12 languages are English, Deutsch, Spanish, French, Brazilian Portuguese, Portuguese, Italian, Dutch, Hindi, Romanian, Persian, and Russian which are used to train the baseline models. In this work, we evaluate the performance of different language identification methods and frameworks over DBpedia rdfs:labels.
Conclusion: By observing the results shown in Table I, we can see that openNLP outperforms other frameworks w.r.t accuracy after limiting the number of inferred languages. langdetect outperforms other frameworks w.r.t. runtime. We show that it is possible to reach SOTA with a small training corpus (see baseline models in German language). Overall, the methods perform poorly on rdfs:labels. Further, we show that the accuracy can be improved by reducing the number of language profiles and using context-based training corpora.
Acknowledgement: This work was partially supported by DBpedia under Google Summer of Code (GSoC) 2020.
More information can be found in the following links:
Authors: Lahiru Hinguruduwa, Edgard Marx (@eccenca GmbH), Tommaso Soru, Thomas Riechert