Assessing Language Identification Over DBpedia

Large-scale multilingual knowledge bases (KBs) are the key for cross-lingual and multilingual applications such as Question Answering, Machine  Translation,  and  Search. They can encode the same information in different languages and be used to deliver content in the most appropriate format to the user. For example, the figure below shows two different language versions of Google search for the query “rdf”.  

Another interesting use of multilingual KBs is their application as a training corpus to leverage better multilingual machine learning models. However, finding qualitative multilingual content can be challenging. An analysis of over 100 thousand KBs in LOD Laundromat shows that only ∼14% of them have language tags on their rdfs:labels while ∼20% of all rdfs:labels have no language tag. One solution to overcome this challenge is the use of language identification methods to automatically tag RDF content.

In this work, we exploit DBpedia’s multilingual content for training and evaluating different language identification methods and frameworks. We show that these approaches perform poorly on rdfs:labels. In our experiments, we evaluate the performance of six language identification methods which consist of two baselines (LangTagger) as well as Apache Tika, langdetect, and Apache openNLP language detector in two configurations.

LangTagger and langdetect use Naive Bayes classifiers for language detection. Langdetect uses a character-based while Apache Tika uses a word-based n-gram method to create features. Both LangTagger models were trained using QALD training dataset questions because it is multilingual and based on DBpedia resources. Table II gives an overview of the different language identification methods evaluated.

Evaluation: since openNLP and langdetect originally supported language identification for 103 and 55 languages, we used another configuration to limit the languages to 12. Those 12 languages are English, Deutsch, Spanish, French, Brazilian Portuguese, Portuguese, Italian, Dutch, Hindi, Romanian, Persian, and Russian which are used to train the baseline models. In this work,  we evaluate the performance of different language identification methods and frameworks over DBpedia rdfs:labels.

Conclusion: By observing the results shown in Table I, we can see that openNLP outperforms other frameworks w.r.t accuracy after limiting the number of inferred languages. langdetect outperforms other frameworks w.r.t. runtime. We show that it is possible to reach SOTA with a small training corpus (see baseline models in German language). Overall, the methods perform poorly on rdfs:labels. Further, we show that the accuracy can be improved by reducing the number of language profiles and using context-based training corpora.

Acknowledgement: This work was partially supported by DBpedia under Google Summer of Code (GSoC) 2020.

More information can be found in the following links:

Paper: https://ieeexplore.ieee.org/abstract/document/9364504

GitHub: https://github.com/AKSW/LangTagger

Authors: Lahiru Hinguruduwa, Edgard Marx (@eccenca GmbH), Tommaso Soru, Thomas Riechert

Posted in Language Identification, Papers, Research, Uncategorized | Tagged , , , , , | Comments Off on Assessing Language Identification Over DBpedia

DBpedia Tutorial @ Knowledge Graph Conference 2021

On May 4, 2021 we will organize a tutorial at the Knowledge Graph Conference (KGC) 2021. The tutorial targets existing and potential new users of DBpedia, developers that wish to learn how to replicate DBpedia infrastructure, service providers interested in exploiting the DBpedia Knowledge Graph (KG) and data providers interested in integrating data assets with the DBpedia KG as well as data scientists (e.g. linguists) focused on extracting relevant information (e.g. linguistic) from/based on the DBpedia KG. 

During the course of the tutorial the participants will gain knowledge about how to find information, access, query and work with the DBpedia KG, how to replicate it and how to contribute and improve the Knowledge Graph. 

Quick Facts

Registration

Please register at the Knowledge Graph Conference website to be part of the meeting. You NEED to buy a conference ticket to join the tutorial.

Organisation

  •     Milan Dojchinovski, AKSW, DBpedia Association / CTU in Prague
  •     Sebastian Hellmann, AKSW, DBpedia Association
  •     Jan Forberg, AKSW, DBpedia Association
  •     Johannes Frey, AKSW, DBpedia Association
  •     Julia Holze, AKSW, DBpedia Association
  •     Marvin Hofer, AKSW
  •     Denis Streitmatter, AKSW

We are looking forward to meeting you in May!

Emma & Julia

on behalf of the DBpedia Association

Posted in Announcements, dbpedia, Events | Tagged , , , | Comments Off on DBpedia Tutorial @ Knowledge Graph Conference 2021

DBpedia @ Google Summer of Code program 2021

DBpedia, one of InfAI’s community projects, will participate in the Google Summer of Code (GSoC) program for the 10th time.

The GsoC program has the goal to bring students from all over the globe into open source software development. In this regard we are calling for students to be part of the Summer of Codes. During two funded months, you will be able to work on a specific task, which results are presented in the summer. Have a look at some of the project ideas here.

We aroused your interest in participation? Great, then check out the DBpedia website for further information.

We are looking forward to your contribution!

Kind regards,

Julia

on behalf of the DBpedia Association

Posted in dbpedia | Tagged , , , | Comments Off on DBpedia @ Google Summer of Code program 2021

DBpedia’s New Website

We are proud to announce the completion of the new DBpedia website. We used the New Year’s break as an opportunity to alter layout, design and content of the DBpedia website, according to the requirements of the DBpedia community and DBpedia members. We’ve created a new site to better present the DBpedia movement in its many facets.

New Website and Blog

The DBpedia team have diligently cleaned up the website, have removed outdated content and created a platform for new tools, applications, services and data sets. We additionally integrated the DBpedia blog on the website, a long overdue step. So now, you have access to all in one spot.

Feedback Button

Feedback from the community and members is very important to us. So, we offer a tool for you, to make your voice heard. Just click the feedback button on the new DBpedia website. If you find the content helpful, please click on Yep. If you think the content is not sufficient, please report to us either directly on the website or via dbpedia@infai.org.

Acknowledgment

The DBpedia Association would like to thank Bettina Klimek, Henri Selbmann (Seefeuer GbR) and the KILT Competence Center at InfAI for their constant support to create the new DBpedia website.

Have fun browsing the new DBpedia website.

Kind regards,

Julia

on behalf of the DBpedia Association

Posted in Announcements, dbpedia | Tagged , , | Comments Off on DBpedia’s New Website

SANSA 0.7.1 (Semantic Analytics Stack) Released

We are happy to announce SANSA 0.7.1 – the seventh release of the Scalable Semantic Analytics Stack. SANSA employs distributed computing via Apache Spark and Flink in order to allow scalable machine learning, inference and querying capabilities for large knowledge graphs.

You can find usage guidelines and examples at http://sansa-stack.net/user-guide.

The following features are currently supported by SANSA:

  • Reading and writing RDF files in N-Triples, Turtle, RDF/XML, N-Quad, TRIX format
  • Reading OWL files in various standard formats
  • Query heterogeneous sources (Data Lake) using SPARQL – CSV, Parquet, MongoDB, Cassandra, JDBC (MySQL, SQL Server, etc.) are supported
  • Support for multiple data partitioning techniques
  • SPARQL querying via Sparqlify and Ontop and Tensors
  • Graph-parallel querying of RDF using SPARQL (1.0) via GraphX traversals (experimental)
  • RDFS, RDFS Simple and OWL-Horst forward chaining inference
  • RDF graph clustering with different algorithms
  • Terminological decision trees (experimental)
  • Knowledge graph embedding approaches: TransE (beta), DistMult (beta)

Noteworthy changes or updates since the previous release are:

  • TRIX support
  • A new query engine over compressed RDF data
  • OWL/XML Support

Deployment and getting started:

  • There are template projects for SBT and Maven for Apache Spark as well as for Apache Flink available to get started.
  • The SANSA jar files are in Maven Central i.e. in most IDEs you can just search for “sansa” to include the dependencies in Maven projects.
  • Example code is available for various tasks.
  • We provide interactive notebooks for running and testing code via Docker.

We want to thank everyone who helped to create this release, in particular the projects Big Data Ocean, SLIPO, QROWD, BETTER, BOOST, MLwin, PLATOON and Simple-ML. Also check out our recent articles in which we describe how to use SANSA for tensor based querying, scalable RDB2RDF query execution, quality assessment and semantic partitioning.

Spread the word by retweeting our release announcement on Twitter. For more updates, please view our Twitter feed and consider following us.

Greetings from the SANSA Development Team

 

Posted in Uncategorized | Comments Off on SANSA 0.7.1 (Semantic Analytics Stack) Released

More Complete Resultset Retrieval from Large Heterogeneous RDF Sources

Over recent years, the Web of Data has grown significantly. Various interfaces such as LOD Stats, LOD Laundromat and SPARQL endpoints provide access to hundreds of thousands of RDF datasets, representing billions of facts. These datasets are available in different formats such as raw data dumps and HDT files, or directly accessible via SPARQL endpoints. Querying such a large amount of distributed data is particularly challenging and many of these datasets cannot be directly queried using the SPARQL query language.

In order to tackle these problems, We present WimuQ, an integrated query engine to execute SPARQL queries and retrieve results from a large amount of heterogeneous RDF data sources. Presently, WimuQ is able to execute both federated and non-federated SPARQL queries over a total of 668,166 datasets from LOD Stats and LOD Laundromat, as well as 559 active SPARQL endpoints. These data sources represent a total of 221.7 billion triples from more than 5 terabytes of information from datasets retrieved using the service “Where is My URI” (WIMU). Our evaluation of state-of-the-art real-data benchmarks shows that WimuQ retrieves more complete results for the benchmark queries. 

The contributions of this work are:

  • A hybrid SPARQL query-processing engine to execute SPARQL queries over a large amount of heterogeneous RDF data.
  • Evaluation of real-world datasets using the state of the art of federated and non-federated query benchmarks (FedBench, LargeRDFBench and FEASIBLE).
  • We present the first federated SPARQL query-processing engine that executes SPARQL queries over a total of 221.7 billion triples.

This is an ongoing work, in which the next step consists of a Large Scale approach to study the relation and similarity among the datasets. This work was supported by the Semantic Web group of HTWK Leipzig (https://www.htwk-leipzig.de/) under the advisement of Prof. Dr. rer. nat. Thomas Riechert.

Github repository: https://github.com/firmao/wimuT

Prototype/proof of concept: https://w3id.org/wimuq/

Slides: https://tinyurl.com/slidesKcap2019

Paper: https://dl.acm.org/citation.cfm?id=3364436

Conference: http://www.k-cap.org/2019/

Authors/Contact: valdestilhas@informatik.uni-leipzig.de, tsoru@informatik.uni-leipzig.de, saleem@informatik.uni-leipzig.de

Posted in paper presentation, Papers | Tagged , , | Comments Off on More Complete Resultset Retrieval from Large Heterogeneous RDF Sources

DL-Learner 1.4 (Supervised Structured Machine Learning Framework) Released

Dear all,

The Smart Data Analytics group [1] and the E.T.-db-MOLE sub-group located at the InfAI Leipzig [2] is happy to announce

DL-Learner 1.4.

DL-Learner is a framework containing algorithms for supervised machine learning in RDF and OWL. DL-Learner can use various RDF and OWL serialization formats as well as SPARQL endpoints as input, can connect to most popular OWL reasoners and is easily and flexibly configurable. It extends concepts of Inductive Logic Programming and Relational Learning to the Semantic Web in order to allow powerful data analysis.

Website: http://dl-learner.org
GitHub page: https://github.com/SmartDataAnalytics/DL-Learner
Download: https://github.com/SmartDataAnalytics/DL-Learner/releases/tag/1.4.0

In the current release, we continued to improve the code and work on our query tree and class expression learning algorithms. The config file can now optionally be written in Json syntax. We updated the packaging to be ready for Java 11 and also tested DL-Learner on Windows. Some logical fixes to the Horizontal Expansion in CELOE were reported and analysed by Yingbing Hua, thanks!

The DL-Learner system has also been presented at The Web Conference in Lyon 2018 [3]. We want to thank everyone who helped to create this release. We also acknowledge support by the following projects: LIMBO [4], QROWD [5], SAKE [6], Big Data Europe [7], HOBBIT [8], GeoKnow [9], GOLD [10], and SLIPO [11].

Kind regards,

Jens Lehmann, Lorenz Bühmann, Patrick Westphal and Simon Bin

[1] http://sda.tech
[2] https://infai.org/efficient-technology-integration/
[3] http://jens-lehmann.org/files/2018/www_dllearner.pdf
[4] https://www.limbo-project.org/
[5] http://qrowd-project.eu/
[6] https://www.sake-projekt.de/
[7] https://www.big-data-europe.eu/
[8] http://project-hobbit.eu/
[9] http://geoknow.eu/
[10] http://aksw.org/Projects/GOLD.html
[11] http://www.slipo.eu/

Posted in Announcements, DL-Learner, Software Releases, Uncategorized | Comments Off on DL-Learner 1.4 (Supervised Structured Machine Learning Framework) Released

DBpedia Day @ SEMANTiCS 2019

 We are happy to announce that SEMANTiCS 2019 will host the 14th DBpedia Community Meeting at the last day of the conference on September 12, 2019.

 

 

Highlights/Sessions

  • Keynote #1: Katja Hose, Aalborg University, Denmark
  • Keynote #2: Dan Weitzner from WPSemantix
  • DBpedia Databus presentation and training session
  • DBpedia Association hour
  • DBpedia Showcase session
  • DBpedia Chapter session

Call for Contribution

Tell us what cool things you do with DBpedia:  Present your tools and datasets at the DBpedia Community Meeting! Please submit your presentations, posters, demos or other forms of contributions through our web form.

Quick Facts

  • Web URL: https://wiki.dbpedia.org/events/14th-dbpedia-community-meeting-karlsruhe
  • When: September 12th, 2019
  • Where: Leibniz-Institute für Informationsstruktur – FIZ Karlsruhe, Hermann-von-Helmholtz-Platz 1, 76344 Eggenstein-Leopoldshafen, Germany
  • Call for Contribution: Submit your proposal in our form
  • Registration: Attending the DBpedia Community meeting costs 90 €. You can buy your ticket on the SEMANTiCS website. DBpedia members get free admission. Please contact your nearest DBpedia chapter for a promotion code, or please contact the DBpedia Association.

Sponsors and Acknowledgments

In case you want to sponsor the 14th DBpedia Community Meeting, please contact the DBpedia Association via dbpedia@infai.org.

Organisation

  • Tina Schmeissner, DBpedia Association
  • Sandra Prätor, AKSW/KILT, DBpedia Association
  • Sebastian Hellmann, AKSW/KILT, DBpedia Association

We are looking forward to meeting you in Karlsruhe!

Your DBpedia Association

Posted in Call for Paper, Call for Students, dbpedia, Events | Tagged , , | Comments Off on DBpedia Day @ SEMANTiCS 2019

LDK conference @ University of Leipzig

With the advent of digital technologies, an ever-increasing amount of language data is now available across various application areas and industry sectors, thus making language data more and more valuable. In that context, we are happy to invite you to join the 2nd Language, Data and Knowledge (LDK) conference which will be held in Leipzig from May 20th till 22nd, 2019.

This new biennial conference series aims at bringing together researchers from across disciplines concerned with language data in data science and knowledge-based applications.

In that context, the acquisition, provenance, representation, maintenance, usability, quality as well as legal, organizational and infrastructure aspects of language data are in the centre of research revolving around language data and thus constitute the focus of the conference.

To register and be part of the LDK conference and its associated events, please go to http://2019.ldk-conf.org/registration/.

Keynote Speakers

  • Keynote #1: Christian Bizer, Mannheim University
  • Keynote #2: Christiane Fellbaum, Princeton University
  • Keynote #3: Eduard Werner, Leipzig University

Associated Events

The following events are co-located with LDK 2019:

Workshops on the 20th May 2019

DBpedia Community Meeting on the 23rd May 2019

Looking forward to meeting you at the conference!

Posted in Announcements, dbpedia, Events | Comments Off on LDK conference @ University of Leipzig

13th DBpedia community meeting in Leipzig

We are happy to invite you to join the 13th edition of the DBpedia Community Meeting, which will be held in Leipzig. Following the LDK conference, May 20-22, the DBpedia Community will get together on May 23rd, 2019 at Mediencampus Villa Ida. Once again the meeting will be accompanied by a varied program of exciting lectures and showcases.

Highlights/ Sessions

  • Keynote #1: Making Linked Data Fun with DBpedia by Peter Haase, metaphacts
  • Keynote #2: From Wikipedia to Thousands of Wikis – The DBkWik Knowledge Graph by Heiko Paulheim, Universität Mannheim
  • NLP and DBpedia Session
  • DBpedia Association Hour
  • DBpedia Showcase Session

Call for Contribution

What cool things do you do with DBpedia? Present your tools and datasets at the DBpedia Community Meeting! Please submit your presentations, posters, demos or other forms of contributions through our web form.

Tickets

Attending the DBpedia Community meeting costs 40 €. You need to buy a ticket via eshop.sachsen.de. DBpedia members get free admission. Please contact your nearest DBpedia chapter for a promotion code, or please contact the DBpedia Association.

If you would like to attend the LDK conference, please register here.

We are looking forward to meeting you in Leipzig!

Posted in dbpedia, Events | Comments Off on 13th DBpedia community meeting in Leipzig