AKSW Colloquium, 22-06-2015, Concept Expansion Using Web Tables, Mining entities from the Web, Linked Data Stack

Concept Expansion Using Web Tables by Chi Wang, Kaushik Chakrabarti, Yeye He,Kris Ganjam, Zhimin Chen, Philip A. Bernstein (WWW’2015), presented by Ivan Ermilov:

Ivan ErmilovAbstract. We study the following problem: given the name of an ad-hoc concept as well as a few seed entities belonging to the concept, output all entities belonging to it. Since producing the exact set of entities is hard, we focus on returning a ranked list of entities. Previous approaches either use seed entities as the only input, or inherently require negative examples. They suffer from input ambiguity and semantic drift, or are not viable options for ad-hoc tail concepts. In this paper, we propose to leverage the millions of tables on the web for this problem. The core technical challenge is to identify the “exclusive” tables for a concept to prevent semantic drift; existing holistic ranking techniques like personalized PageRank are inadequate for this purpose. We develop novel probabilistic ranking methods that can model a new type of table-entity relationship. Experiments with real-life concepts show that our proposed solution is significantly more effective than applying state-of-the-art set expansion or holistic ranking techniques.

Mining entities from the Web by Anna Lisa Gentile

Anna Lisa GentileThis talk explores the task of mining entities and their describing attributes from the Web. The focus is on entity-centric websites, i.e. domain specific websites containing a description page for each entity. The task of extracting information from this kind of websites is usually referred as Wrapper Induction. We propose a simple knowledge based method which is (i) highly flexible with respect to different domains and (ii) does not require any training material, but exploits Linked Data as background knowledge source to build essential learning resources. Linked Data – an imprecise, redundant and large-scale knowledge resource – proved useful to support this Information Extraction task: for domains that are covered, Linked Data serve as a powerful knowledge resource for gathering learning seeds. Experiments on a publicly available dataset demonstrate that, under certain conditions, this simple approach based on distant supervision can achieve competitive results against some complex state of the art that always depends on training data.

Linked Data Stack by Martin Röbert

martinRobertMartin will present the packaging infrastructure developed for the Linked Data Stack project, which will be followed by a discussion about the future of the project.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted in Colloquium, invited talk | Comments Off on AKSW Colloquium, 22-06-2015, Concept Expansion Using Web Tables, Mining entities from the Web, Linked Data Stack

AKSW Colloquium, 15-06-2015, Caching for Link Discovery

Using Caching for Local Link Discovery on Large Data Sets [PDF]
by Mofeed Hassan

Engineering the Data Web in the Big Data era demands the development of time- and Mofeed Hassan's depictionspace-efficient solutions for covering the lifecycle of Linked Data. As shown in previous works, using pure in-memory solutions is doomed to failure as the size of datasets grows continuously with time. In this work, presented by Mofeed Hassan, a study is performed on caching solutions for one of the central tasks on the Data Web, i.e., the discovery of links between resources. To this end, 6 different caching approaches were evaluated on real data using different settings. Our results show that while existing caching approaches already allow performing Link Discovery on large datasets from local resources, the achieved cache hits are still poor. Hence, we suggest the need for dedicated solutions to this problem for tackling the upcoming challenges pertaining to the edification of a semantic Web.

Posted in Colloquium, LIMES, Papers | Tagged , , , | Comments Off on AKSW Colloquium, 15-06-2015, Caching for Link Discovery

AKSW Colloquium, 08-06-2015, DBpediaSameAs, Dynamic-LOD

DBpediaSameAs: An approach to tackling heterogeneity in DBpedia identifiers by Andre Valdestilhas

This work provides an approach to tackle heterogeneity about a problem where several transient owl:sameAs redundant occurrences were found in DBpedia identifiers during searching for owl:sameAs occurrences that were observed while finding of co-references between different data sets.andre_terno_ita

Thus, in this work there are 3 contributions in order to solve this problem: (1) DBpedia Unique Identifier, which was provided to obtain a normalization for owl:sameAs occurrences providing a unique DBpedia identifier instead of several transient owl:sameAs redundant occurrences,  (2) Rate and suggest links, in order to improve the quality and also giving the possibility to have statistic data about the links, and (3) As a result of our work we were able to achieve a performance gain where the physical size has decreased from 16.2 GB to 6 GB triples and we also have the possibility to perform normalization and create an index.

The usability of the interface was evaluated by using a standard system of usability questionnaire. The positive results from all of our interviewed participants showed that the DBpediaSameAs property is easy to use and can thus lead to novel insights.

As proof of concept an implementation is provided in a computational web system, including a Service on the web and a Graphical User Interface.

Dynamic-LOD: An approach to count links using Bloom filters by Ciro Baron

The Web of Linked Data is growing and it becomes increasingly necessary to discover the relationship between different datasets.

Ciro Baron will present an approach for accurate link counting which uses Bloom filters (BF) to compare and approximately count links between datasets, solving the problem of lack of up-to-date meta-data about linksets. The paper which compare performance to classical approaches such as binary search tree (BST) and hash tables (HT) can be found in this link(http://svn.aksw.org/papers/2015/ISWC_DynLOD/public.pdf), and the results show that Bloom filter is 12x more efficient regarding of memory usage with adequate query speed performance.

In addition, Ciro will show a small cloud generated for all English DBpedia datasets and vocabularies available in Linked Open Vocabularies (LOV).

We evaluated Dynamic-LOD in three different aspects: firstly by analyzing data structure performance comparing BF with HS and BST, secondly a quantitative evaluation regarding false positives, speed to count links in a dense scenario like DBpedia and thirdly on a large scale based on lod-cloud distributions. In fact, all three evaluations indicates that BF is a good choice for what our work proposes.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted in Colloquium, Data Quality, dbpedia, Uncategorized | Comments Off on AKSW Colloquium, 08-06-2015, DBpediaSameAs, Dynamic-LOD

Smart Data Web project kick-off

Smart Data Web, a new BMWi funded project kicked-off in Berlin. Central goal of Smart Data Web is leveraging state-of-the-art data extraction and enrichment technologies as well as Linked Data to create value-added systems for German industry. Knowledge relevant to decision-making processes will be extracted from government and industry data, official web pages and social media, analyzed using NLP and integrated into knowledge graphs. These graphs will be accessible to focus industries via dashboards and APIs, as well as the public via Linked Data. Special concern will be given to legal questions, such as data licensing as well as data security and privacy.

AKSW, representing the University of Leipzig in this project, will develop the German Knowledge Graph, the central aggregation and integration interface of Smart Data Web. Unlike most current Linked Data knowledge bases, the German Knowledge Graph will focus industry-relevant data. The graph will be developed in an iterative extraction, integration and interlinking process, building on proven technologies of the LOD2 stack. Data quality and persistence are a special priority of the German Knowledge Graph since consistency has to be guaranteed at all times. RDFUnit is our tool of choice to accomplish this task.

Smart Data Web will contribute significantly to overcome the barriers that hinder the integration of Semantic Web technologies, Web 2.0 data and data analysis for commercial application. Our partners in this project will be Beuth University of Applied SciencesDFKI, Siemens, uberMetrics and VICO Research.

Find out more at smartdataweb.de.

Martin Brümmer on behalf of the NLP2RDF group

Posted in Announcements, project kick-off | Comments Off on Smart Data Web project kick-off

AKSW Colloquium, 01-06-2015, MEX – Publishing ML Experiment Results, Scaling DL-Learner – Status and Plans

MEX – Publishing ML Experiment Results by Diego Esteves

DiegoOver the decades many machine learning experiments have been published, collaborating with the scientific community progress. One of the key-factors in order to compare machine learning experiment results to each other and collaborate positively is to thoroughly perform them on the same computing environment using the same sample data sets and algorithms configurations. Besides, practical experience shows that scientists and engineers tend to have too many output data for their sets of experiments, which, in the end, is either difficult to be analyzed without a provenance metadata as well as archive properly. Despite the efforts for publishing and managing these variables accordingly, we still have a knowledge gap, which is explicated by a missing public ontology for machine learning experiments in order to achieve the interoperability for published results. In order to minimize this gap, we introduce the novel MEX Ontology, built up based on the W3C PROV Ontology (PROV-O) and following the nanopublication principles.

Scaling DL-Learner – Status and Plans by Simon Bin

SimonWith the event of Big Data and Large Scale Data Processing, new challenges also are approaching the DL-Learner. The framework for supervised Machine Learning on OWL and RDF may benefit from various approaches to make it work better with huge data. In this talk, first experimental results of using SPARQL instead of traditional OWL-Reasoner approaches will be shown and possible future directions for scaling DL-Learner will be sketched.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted in Colloquium | Comments Off on AKSW Colloquium, 01-06-2015, MEX – Publishing ML Experiment Results, Scaling DL-Learner – Status and Plans

AKSW Colloquium, 18-05-2015, Multilingual Morpheme Ontology, Personalised Access and Enrichment of Linked Data Resources

MMoOn – A Multilingual Morpheme Ontology by Bettina Klimek

BettinaIn the last years a rapid emergence of lexical resources evolved in the Semantic Web. Whereas most of the linguistic information is already machine-readable, we found that morphological information is either absent or only contained in semi-structured strings. While a plethora of linguistic resources for the lexical domain already exist and are highly reused, there is still a great gap for equivalent morphological datasets and ontologies. In order to enable the capturing of the semantics of expressions beneath the word-level, I will present a Multilingual Morpheme Ontology called MMoOn. It is designed for the creation of machine-processable and interoperable morpheme inventories of a given natural language. As such, any MMoOn dataset will contain not only semantic information of whole words and word-forms but also information on the meaningful parts of which they consist, including inflectional and derivational affixes, stems and bases as well as a wide range of their underlying meanings.

Personalised Access and Enrichment of Linked Data Resources by Milan Dojchinovski

MilanRecent efforts in the Semantic Web community have been primarily focused at developing technical infrastructure and methods for efficient Linked Data acquisition, interlinking and publishing. Nevertheless, the actual access to a piece of information in the LOD cloud still demands significant amount of effort. In the recent years, we have conducted two lines of research to address this problem. The first line of research aims at developing graph based methods for “personalised access to Linked Data”. A key contribution of this research is the ”Linked Web APIs” dataset, the largest Web services dataset with over 11K service descriptions, which has been used as a validation dataset. The second line of research has aimed at enrichment of Linked Data text resources and development of “entity recognition and linking” methods. In the talk, I will present the developed methods and the results from the evaluation on a different datasets and evaluation challenges, and the lessons learned in this activities. I will discuss the adaptability, performance of the developed methods and present the future directions.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted in Colloquium | Comments Off on AKSW Colloquium, 18-05-2015, Multilingual Morpheme Ontology, Personalised Access and Enrichment of Linked Data Resources

AKSW Colloquium, 11-05-2015, DBpedia distributed extraction framework

Scaling up the DBpedia extraction framework by Nilesh Chakraborty

NileshThe DBpedia extraction framework extracts different kinds of structured information from Wikipedia to generate various datasets. Performing a full extraction of Wikipedia dumps of all languages (or even just the mapping-based languages) takes a significant amount of time. The distributed extraction framework runs the extraction on top of Apache Spark so that users can leverage multi-core machines or a distributed cluster of commodity machines to perform faster extraction. For example, performing extraction of the 30-40 mapping based languages on a machine with a quad-core CPU and 16G RAM takes about 36 hours. Running the distributed framework in the same setting using three such worker nodes takes around 10 hours. It’s easy to achieve faster running times by adding more cores or more machines. Apart from the Spark-based extraction framework, we have also implemented a distributed wiki-dump downloader to download Wikipedia dumps for multiple languages, from multiple mirrors, on a cluster in parallel. This is still a work in progress, and in this talk I will discuss the methods and challenges involved in this project, and our immediate goals and timeline.

Posted in Uncategorized | Comments Off on AKSW Colloquium, 11-05-2015, DBpedia distributed extraction framework

Invited talk @AIMS webinar series

On 5th of May Ivan Ermilov on behalf of AKSW presented CKAN data catalog as a part of AIMS (Agricultural Information Management Standards) webinar series. The recording and the slides of the webinar “CKAN as an open-source data management solution for open data” are available on the AIMS web portal: http://aims.fao.org/capacity-development/webinars/ckan-open-source-data-management-solution-open-data

AIMS organizes free and open to everyone webinars on various topics. You can find more recordings and material on AIMS webpage, YouTube channel and Slideshare:

Main page of Webinars@AIMS: http://aims.fao.org/capacity-development/webinars

YouTube: http://www.youtube.com/user/FAOAIMSVideos

Slideshare: http://www.slideshare.net/faoaims/ckan-as-an-opensource-data-management-solution-for-open-data

Posted in invited talk | Comments Off on Invited talk @AIMS webinar series

AKSW Colloquium, 04-05-2015, Automating RDF Dataset Transformation and Enrichment, Structured Machine Learning in Life Science

Automating RDF Dataset Transformation and Enrichment by Mohamed Sherif

Mohamed SherifWith the adoption of RDF across several domains, come growing requirements pertaining to the completeness and quality of RDF datasets. Currently, this problem is most commonly addressed by manually devising means of enriching an input dataset. The few tools that aim at supporting this endeavour usually focus on supporting the manual definition of enrichment pipelines. In this talk, we present a supervised learning approach based on a refinement operator for enriching RDF datasets. We show how we can use exemplary descriptions of enriched resources to generate accurate enrichment pipelines. We evaluate our approach against eight manually defined enrichment pipelines and show that our approach can learn accurate pipelines even when provided with a small number of training examples.

Structured Machine Learning in Life Science (PhD progress report) by Patrick Westphal

Patrick WestphalThe utilization of machine learning techniques to solve life science tasks has become widespread within the last years. Mainly working on unstructured data one question is whether such techniques could benefit from the provision of structured background knowledge. One prevalent way to express background knowledge in the life sciences is the Web Ontology Language (OWL). Accordingly there is a great variety of different domain ontologies covering anatomy, genetics, biological processes or chemistry that can be used to form structured machine learning approaches in the life science domain. The talk will give a brief overview of tasks and problems of structured machine learning in life science. Besides the special characteristics observed when applying the state-of-the-art concept learning approaches to life science tasks, a short description of the actual differences to concept learning setups in other domains is given. Further, some directions for machine learning based techniques are shown that could support concept learning in life science tasks.

About the AKSW Colloquium

This event is part of a series of events about Semantic Web technology. Please see http://wiki.aksw.org/Colloquium for further information about previous and future events. As always, Bachelor and Master students are able to get points for attendance and there is complimentary coffee and cake after the session.

Posted in Uncategorized | Comments Off on AKSW Colloquium, 04-05-2015, Automating RDF Dataset Transformation and Enrichment, Structured Machine Learning in Life Science

AKSW Colloquium, 27-04-2015, Ontotext’s RDF database-as-a-service (DBaaS) via Self-Service Semantic Suite (S4) platform via & Knowledge-Based Trust

This colloquium features two talks. First the Self-Service Semantic Suite (S4) platform is presented by Marin Dimitrov (Ontotext), followed up by Jörg Unbehauens report on Googles effort on using factual correctness as a ranking factor.

RDF database-as-a-service (DBaaS) via Self-Service Semantic Suite (S4) platform

In this talk Marin Dimitrov (Ontotext) will introduce the RDFdatabase-as-a-service (DBaaS) options for managing RDF data in the Cloud via the Self-Service Semantic Suite (S4) platform. With S4 developers and researchers can instantly get access to fully managed RDF DBaaS, without the need for hardware provisioning, maintenance and operations. Additionally, the S4 platform provides on-demand access to text analytics services for news, social media and life sciences, as well as access to knowledge graphs (DBpedia, Freebase and GeoNames).

The goal of the S4 platform is to make it easy for developers and researchers to develop smart/semantic applications, without the need to spend time and effort on infrastructure provisioning and maintenance. Marin will also provide examples of EC funded research projects – DaPaaS, ProDataMarket and KConnect — that plan to utilise the S4 platform for semantic data management

More information on S4 will be available in [1][2] and [3]

[1] Marin Dimitrov, Alex Simov and Yavor Petkov.  On-demand Text Analytics and Metadata Management with S4. In: proceedings of Workshop on Emerging Software as a Service and Analytics (ESaaSA 2015) at the 5th International Conference on Cloud Computing and Services Science (CLOSER 2015), Lisbon, Portugal.

[2] Marin Dimitrov, Alex Simov and Yavor Petkov. Text Analytics and Linked Data Management As-a-Service with S4. In: proceedings of 3rd International Workshop on Semantic Web Enterprise Adoption and Best Practice (WaSABi 2015) part of the Extended Semantic Web Conference (ESWC 2015), May 31st 2015, Portoroz, Slovenia

[3] Marin Dimitrov, Alex Simov and Yavor Petkov. Low-cost Open Data As-a-Service in the Cloud. In: proceedings of 2nd Semantic Web Enterprise Developers Workshop (SemDev 2015) part of the Extended Semantic Web Conference (ESWC 2015), May 31st 2015, Portoroz, Slovenia

Report on: “Knowledge-Based Trust: Estimating the Trustworthiness of Web Sources”

by Xin Luna Dong, Evgeniy Gabrilovich, Kevin Murphy, Van Dang, Wilko Horn, Camillo Lugaresi, Shaohua Sun, Wei Zhang

Link to the paper

Presentation by Jörg Unbehauen

Abstract:

“The quality of web sources has been traditionally evaluated using exogenous signals such as the hyperlink structure of the graph. We propose a new approach that relies on endogenous signals, namely, the correctness of factual information provided by the source. A source that has few false facts is considered to be trustworthy. The facts are automatically extracted from each source by information extraction methods commonly used to construct knowledge bases. We propose a way to distinguish errors made in the extraction process from factual errors in the web source per se, by using joint inference in a novel multi-layer probabilistic model. We call the trustworthiness score we computed Knowledge-Based Trust (KBT). On synthetic data, we show that our method can reliably compute the true trustworthiness levels of the sources. We then apply it to a database of 2.8B facts extracted from the web, and thereby estimate the trustworthiness of 119M webpages. Manual evaluation of a subset of the results confirms the effectiveness of the method. “

Posted in Uncategorized | Comments Off on AKSW Colloquium, 27-04-2015, Ontotext’s RDF database-as-a-service (DBaaS) via Self-Service Semantic Suite (S4) platform via & Knowledge-Based Trust