Overview

KnoFuss is a system for semantic data fusion. It takes as input two semantic datasets represented in RDF and resolves the data linking problem:

Different datasets often contain information about the same entities but refer to them using different URIs. Such individuals have to be identified and either merged (by replacing URIs) or linked (e.g., using owl:sameAs relations). This problem is similar to the record linkage task studied in the database community.

Main features

KnoFuss is implemented as a modular and extensible architecture based on problem-solving methods. The generic task of data linking is decomposed into several subtasks (see [1]). Two main subtasks are:

Individual matching. This subtask compares properties of two individuals to decide whether they are likely to represent the same entity.
Dataset matching. At this stage, the whole set of candidate mappings produced by individual matching is analysed and refined. Thus, the system can capture the impact of different mappings on each other as well as the influence of ontological constraints: e.g., that a mapping between two individuals is less likely to be true if it leads to inconsistent data.

Each subtask of the fusion process can be performed by different methods, both generic and domain-dependent (e.g., using key attributes or machine-learning models for coreferencing, hand-tailored rules or formal ontology diagnosis for conflict detection). For both these subtasks, there are several techniques which can be applied: for example, string-based and set-based similarity metrics for individual matching, using ontological constraints and belief networks for dataset matching. Diifferent such methods can be plugged into the system and combined into a library, so that appropriate ones are selected depending on the task at hand.

In particular, the architecture contains the following methods:

Individual matching:
- Aggregated attribute-based similarity. This method uses the classical approach to individual matching where the similarity between individuals is calculated as an aggregation of similarities between their relevant attributes. The user can select the properties to be compared, similarity functions, weights, and the cut-off threshold.
- Unsupervised attribute-based similarity. This method also implements the aggregated attribute-based similarity. However, instead of relying on the user to choose the parameters of the combined similarity function, it tries to pick them automatically using a genetic algorithm. In the absence of reliable training data, it uses the desired distribution of resulting links to evaluate the fitness of candidate solutions: e.g., the expected number of mappings.
Dataset matching:
- Filtering based on ontological constraints. This method uses explicitly defined ontological constraints (class disjointness, functionality and cardinality restrictions) to update the original set of mappings provided by individual matching and filter out those which violate these constraints.
- Belief propagation network. This method combines uncertainty reasoning with ontological reasoning to refine the original set of mappings produced by individual matching methods. Confidence degrees of data statements in both repositories and of initial mappings are interpreted as Dempster-Shafer belief functions. Ontological reasoning is used to construct belief propagation networks capturing mutual impact of individual matching decisions. These networks are used to refine the original set of mappings. In more detail, the algorithm is described in [2] and [3].

Dealing with ontological heterogeneity

Repositories containing overlapping data are often structured using different ontologies. Hence, it is often not clear, which classes in two repositories contain overlapping sets of instances and which properties should be compared to perform individual matching. To cope with these issues, KnoFuss implements several strategies:

User-defined selection criteria. The user can specify how to select instances to compare from both repositories and which properties are comparable.
Exploiting automatic ontology matching. The system can take schema alignments produced by automatic ontology matching tools and use these to translate SPARQL queries used to select relevant instances and comparable properties from the terms of one schema into another. In this way, instances are compared in the same way as if they were structured using the same ontology (see [4]).
Unsupervised adaptation. In the absence of schema mappings, a genetic algorithm tries to select pairs of discriminative properties containing comparable data.

Download

The source code and the binary distribution of the KnoFuss project is available on https://code.google.com/archive/p/knofuss/ under the BSD license.

Contact

a [dot] nikolov [at] open [dot] ac [dot] uk