DBpedia Live Extraction Test Server

We currently feature a live extraction of DBpedia on one of our servers. The live extraction is aimed at bringing Wikipedia and DBpedia closer together. Errors in DBpedia can soon be corrected directly by editing the corresponding Wikipedia article.
We created a test page on Wikipedia for observing the effects of the live extraction on the endpoint at: http://en.wikipedia.org/wiki/User:DBpedia

Our further plans regarding the live extraction are:

  • Changing the current Virtuoso version to 6.0, which will come out soon (counting the days…)
    Note: since Virtuoso has a Wikipedia page with an Infobox, you could tell your personal semantic agent to notify you, as soon as the 6.0 release is out, as it can now be queried with SPARQL on DBpedia-Live. See the Wikipedia page, DBpedia and DBpedia-Live (Addition: Kidehen just contacted us and installed a DBpedia vad on our DBpedia mirror and now the data is accessible via LinkedData also. See http://db0.aksw.org:8890/resource/Virtuoso_Universal_Server. Note the owl:sameAs link to the original DBpedia.)
  • Getting all extractors online (some are deactivated right now)
  • Gathering feedback from the community, as DBpedia is a vital resource of the Semantic Web.
  • Deploy the Live-Extraction on the public DBpedia endpoint.
  • Deploy an OntoWiki on our DBpedia-Live mirror.

As we are in the phase of active developments, we are looking forward to receive your feedback and comments on the live extraction and our further ideas (described below).

Workings of the algorithm:
Each extractor now has 3 different states:
Active, Keep, Purge, which can be used (in order) to update triples, leave triples unchanged, remove triples.
As each extractor is responsible for certain properties in DBpedia and the overlapping is low, we used this to determine which triples to keep and which to delete. Let’s say an article like dbpedia:Berlin is updated. First the current triples which have dbpedia:Berlin as subject are retrieved from DBpedia-Live via a SPARQL. A Diff is created and all triples that changed are deleted and all new ones are updated. When an extractor, such as the ExternalLinksExtractor, is in status ‘Keep’ a filter is imposed on the SPARQL query, which filters all triples with properties corresponding to the extractor (in this case http://dbpedia.org/property/reference) . The triples, which are filtered out during retrieval, do not show up in the Diff and will therefore be deleted. Certain properties such as owl:sameAs are always included in the filter, as they are not generated by any DBpedia extractors.
Also more sophisticated filters like triples with a certain namespace as predicate and object. This gives control over which data to keep, remove, update in DBpedia.

This is an example of the current filters imposed on the queries as on our test server (!regex is very fast, while regex isn’t). From these, It is also possible to guess which extractors are currently still deactivated:
SELECT * WHERE { ?p ?o.FILTER(
((!regex(str(?p), 'http://dbpedia.org/property/wikilink'))
&&(!regex(str(?p), 'http://dbpedia.org/property/wordnet_type'))
&&(!regex(str(?p), 'http://www.w3.org/2002/07/owl#sameAs'))
&&(!regex(str(?p), 'http://dbpedia.org/property/abstract'))
&&(!regex(str(?p), 'http://www.w3.org/2000/01/rdf-schema#comment'))
&&(!regex(str(?p), 'http://xmlns.com/foaf/0.1/depiction'))
&&(!regex(str(?p), 'http://xmlns.com/foaf/0.1/img'))
&&(!regex(str(?p), 'http://dbpedia.org/property/hasPhotoCollection'))
&&(!regex(str(?o), 'http://dbpedia.org/class/yago/'))
&&((!regex(str(?p), 'http://www.w3.org/1999/02/22-rdf-syntax-ns#type')
||!regex(str(?o), 'http://dbpedia.org/ontology/') )))). }

Open issues:

  • synchronization with the LOD cloud, e.g. automatic creation of links if new DBpedia instances appear.
  • as there is no static state in DBpedia-Live any more Yago and Umbel also have to be loaded on the fly

A collection of ideas on an update log:

  1. OWL 2 Axiom Annotations: Although this would result in at least 5 times more triples, it would represent a clean solution, tackling many problems. As the extractors already have URIs the provenance of a triple would be clear. Also additional information like confidence can be kept. DBpedia is quite a small dataset compared to Bio2RDF (2.5 billion triples) and DBtune (14 billion), so performance might not be a problem.
  2. making a service that could parse older revisions of Wikipedia pages on the fly and recreate older versions.
  3. making Named Graphs, any ideas?
  4. Additionally, the update log idea of the WWW paper about Triplify can be applied

Some ideas on a Live update stream to create synchronized mirrors of DBpedia (far future):
1. release DBpedia dumps more frequently, e.g. once a week
2. Provide a method to receive updates via a push or pull mechanism in SPARUL

DBpedia is a joint effort of AKSW (Sören Auer, Jens Lehmann, Sebastian Hellmann), Freie Universität Berlin (Chris Bizer, Georgi Kobilarov, Anja Jentsch) and OpenLink Software, who provide the public endpoint and the powerful Virtuoso Universal Server.

This entry was posted in Announcements, dbpedia, Software Releases. Bookmark the permalink.

Leave a Reply