Mohamed Ahmed Mohamed Sherif

Automating Geospatial RDF Dataset Integration and Enrichment

Dokumente und Dateien


Bitte nutzen Sie beim Zitieren immer folgende Url:

Kurzfassung in Englisch

Over the last years, the Linked Open Data (LOD) has evolved from a mere 12 to more than 10,000 knowledge bases. These knowledge bases come from diverse domains including (but not limited to) publications, life sciences, social networking, government, media, linguistics. Moreover, the LOD cloud also contains a large number of crossdomain knowledge bases such as DBpedia and Yago2. These knowledge bases are commonly managed in a decentralized fashion and contain partly verlapping information. This architectural choice has led to knowledge pertaining to the same domain being published by independent entities in the LOD cloud. For example, information on drugs can be found in Diseasome as well as DBpedia and Drugbank. Furthermore, certain knowledge bases such as DBLP have been published by several bodies, which in turn has lead to duplicated content in the LOD . In addition, large amounts of geo-spatial information have been made available with the growth of heterogeneous Web of Data.

The concurrent publication of knowledge bases containing related information promises to become a phenomenon of increasing importance with the growth of the number of independent data providers. Enabling the joint use of the knowledge bases published by these providers for tasks such as federated queries, cross-ontology question answering and data integration is most commonly tackled by creating links between the resources described within these knowledge bases. Within this thesis, we spur the transition from isolated knowledge bases to enriched Linked Data sets where information can be easily integrated and processed. To achieve this goal, we provide concepts, approaches and use cases that facilitate the integration and enrichment of information with other data types that are already present on the Linked Data Web with a focus on geo-spatial data.

The first challenge that motivates our work is the lack of measures that use the geographic data for linking geo-spatial knowledge bases. This is partly due to the geo-spatial resources being described by the means of vector geometry. In particular, discrepancies in granularity and error measurements across knowledge bases render the selection of appropriate distance measures for geo-spatial resources difficult. We address this challenge by evaluating existing literature for point set measures that can be used to measure the similarity of vector geometries. Then, we present and evaluate the ten measures that we derived from the literature on samples of three real knowledge bases.

The second challenge we address in this thesis is the lack of automatic Link Discovery (LD) approaches capable of dealing with geospatial knowledge bases with missing and erroneous data. To this end, we present Colibri, an unsupervised approach that allows discovering links between knowledge bases while improving the quality of the instance data in these knowledge bases. A Colibri iteration begins by generating links between knowledge bases. Then, the approach makes use of these links to detect resources with probably erroneous or missing information. This erroneous or missing information detected by the approach is finally corrected or added.

The third challenge we address is the lack of scalable LD approaches for tackling big geo-spatial knowledge bases. Thus, we present Deterministic Particle-Swarm Optimization (DPSO), a novel load balancing technique for LD on parallel hardware based on particle-swarm optimization. We combine this approach with the Orchid algorithm for geo-spatial linking and evaluate it on real and artificial data sets. The lack of approaches for automatic updating of links of an evolving knowledge base is our fourth challenge. This challenge is addressed in this thesis by the Wombat algorithm. Wombat is a novel approach for the discovery of links between knowledge bases that relies exclusively on positive examples. Wombat is based on generalisation via an upward refinement operator to traverse the space of Link Specifications (LS). We study the theoretical characteristics of Wombat and evaluate it on different benchmark data sets.

The last challenge addressed herein is the lack of automatic approaches for geo-spatial knowledge base enrichment. Thus, we propose Deer, a supervised learning approach based on a refinement operator for enriching Resource Description Framework (RDF) data sets. We show how we can use exemplary descriptions of enriched resources to generate accurate enrichment pipelines. We evaluate our approach against manually defined enrichment pipelines and show that our approach can learn accurate pipelines even when provided with a small number of training examples.

Each of the proposed approaches is implemented and evaluated against state-of-the-art approaches on real and/or artificial data sets. Moreover, all approaches are peer-reviewed and published in a conference or a journal paper. Throughout this thesis, we detail the ideas, implementation and the evaluation of each of the approaches. Moreover, we discuss each approach and present lessons learned. Finally, we conclude this thesis by presenting a set of possible future extensions and use cases for each of the proposed approaches.

weitere Metadaten

übersetzter Titel
Automatische geografische RDF Datensatzintegration und Anreicherung
Linkentdeckung, maschinelles Lernen, geographische Daten, Lastverteilung, Verfeinerungsoperator, Linked Data, offene Daten, RDF
Link Discovery, Machine Learning, Geographic data, Load balancing, refinement operator, Linked Data, Open data, RDF, Big Data
SWD Schlagworte Linked Data , Open Data , Maschinelles Lernen
DDC Klassifikation000
HochschuleUniversität Leipzig
FakultätFakultät für Mathematik und Informatik, Betriebliche Informationssysteme
ProfessurProf. Dr. Ing. habil. Klaus-Peter Fähnrich
BetreuerProf. Dr. habil. Klaus-Peter Fähnrich
Prof. Dr. Jens Lehmann
Dr. Axel-Cyrille Ngonga Ngomo
Prof. Dr. Sören Auer
GutachterProf. Dr. Ing. habil. Klaus-Peter Fähnrich
Prof. Dr. Daniel Mirankar
Tag d. Einreichung (bei der Fakultät)25.02.2016
Tag d. Verteidigung / Kolloquiums / Prüfung12.05.2016
Veröffentlichungsdatum (online)12.12.2016
persistente URNurn:nbn:de:bsz:15-qucosa-215708

Hinweis zum Urheberrecht

Diese Website ist eine Installation von Qucosa - Quality Content of Saxony!
Sächsische Landesbibliothek Staats- und Universitätsbibliothek Dresden