The unit of Modena will work on all the three project themes; within THEME 1 it will deal with issues concerning the creation and the extension of a domain ontology; within THEME 2 it will co-operate to the definition of a reference architecture for discovering and managing semantic mappings among ontologies; concerning THEME 3, it will partecipate to the automatic translation (rewriting) of queries expressed on a given ontology into an appropriate form for other different ontologies, and to the study of techniques for computing a unique result concerning the same object instantiated in multiple sources.
We will work, together with all the other Units, on the definition of common products. in the first phase, this activity aims at defining a methodological and functional reference architecture for the whole project (product D0.R1).
D0.R1 Report on the methodological and functional reference architecture (BO, MO, RM, TN)
In the second phase, we will work, together with all the other Units, on the components interfaces definition of the integrated prototype (product D0.R2)
D0.R2 Specifications of the components interfaces of the integrated prototype (BO, MO, RM, TN)
Finally, in the third phase of the project, we will work together with all the other Units on the integration of the prototypes developed during the project.
1.1 – Definition of an ontology language making up extensional aspects/concepts
The ontology language that the unit of Modena will contribute to define will be based on the ODLI3 language, made compatible with the W3C standards. In addition, the unit will focus on making the language expressive enough to represent mappings between heterogeneous independently developed ontologies and to ease the query rewriting.
1.2 – Adding a new information source to the domain ontology
Starting from the MOMIS system, the unit of Modena will study the aspect concerning the GVV and reference ontology evolution due to the integration of a new information source. In fact, a change in one or more concepts within the ontology can cause several inconsistencies both on related concepts in the same ontology and on the ontologies connected to the first one by means of mappings.
The approach aims at integrating the description of a new information source into a pre-existing ontology, by exploiting a
semi-automatic lexicon-based process that calculates the affinity among the description elements to be inserted and the ontology. An element will be added to the ontology only if there are no similar elements within it. The ontology should grow monotonously, minimizing the changes to the existing ontology and avoiding internal inconsistencies. In addition, the risk of propagating inconsistencies to mappings is reduced.
1.3 – Identification of a new information source concerning the domain ontology
The unit of Modena will collaborate to the study and development of semantic tools able to improve the effectiveness of current keyword-based techniques of search engines, e.g. Goolge. The search of new Web sources will be assisted by natural language comprehension techniques, some of them are already implemented in TUCUXI (Benassi, 2004). The purpose is to obtain a synthetic representation of the meanings contained in a text and to maintain the semantic relations between terms. The relevance of the source will be evaluated by means of a semantic similarity measure (developed within THEME 2) between the ontology and the lexical chains.
Concerning 1.1 and 1.2 activities and together with the other units, the proposal of standards and emerging languages for ontology definition and treatment, with respect to the description of the ontologies’ evolution will be analyzed.
D1.R1: Critical Analysis of the emerging ontology languages and standards. (BO, MO, ROMA, TN)
A language for ontology definition and treatment, with particular attention to ontology evolution, will be defined; the activity will be done together with the units of Bologna and Trento. A prototype to add a new information source to the domain ontology will be developed. Concerning the activity 1.3 a critical analysis of existing techniques for lexical chains extraction will be produced.
D1.R2: Definition of the language for domain ontology specification (BO, MO, TN)
D1.R4: Critical analysis of existing techniques for lexical chains’ extractions (MO)
D1.P1: Prototype for adding a new information source to the domain ontology (MO)
During this phase and with respect to the activity 1.3, the unit of Modena will focus on the study and implementation of new lexical chains extraction algorithms. Starting from the critical analysis of existing techniques (D1.R4), the unit of Modena will individuate the features that new algorithms have to implement, with respect to three main aspects of interest: the first concerns the kind of documents (Web pages), the second is about the computational complexity, in particular linear complexity algorithms will be defined; the third deals with the accuracy that the synthetic and semantic representation has to ensure. In particular, the attention will be posed on the word sense disambiguation as a preliminary phase for building lexical chains representative of the analysed sources. The prototype will strongly extend the TUCUXI functions (Benassi, 2004) and complete RoadRunner (developed by the unit of Roma), improving its ability to assign semantics to information extracted from data intensive web sites.
D1.R6: Definition of lexical-chains techniques to associate semantics to a data-intensive site schema (RM, MO)
D1.P2: Prototype to extract lexical chains from web sites (MO)
Concerning THEME 2, the activity will focus on the definition of languages, techniques and algorithms to obtain mappings among different ontologies. In general, we can have matching relations identified by the instances or by the structures, relations coming from the analysis of the involved sources, matching derived from external tools, e.g. lexical analysis or logic inference.
Within the project, the Unit of Modena will cooperate in designing matching algorithms that take into account topics detailed hereafter.
Systems that use constraints coming from lexicon try to exploit the schema elements names to find similar elements.
The similarity between the schema elements names can be identified in different ways, such as the name identity, the identity of canonical names obtained after a preprocessing step, the hypernyms identity, the name identity on the basis of the user’s suggestions.
It is worth noting that the identity between two names (or hypernyms) is not always a mere string comparison. A name can be
associated to one or more meanings (polysemy) and, viceversa, different names can have the same meaning within the source context (synonymy). To generate such kinds of relations, it is necessary to refer to lexicon ontologies that catalogue terms on the basis of their meaning, so it is possible to perform correct comparisons.
Analogously, it is worth noting that, sometimes, names associated to some database schema elements are not semantically relevant.
In these cases, it is advisable to adopt ausiliary techniques for semantics extraction from data analysis and/or techniques for analysing and exploiting the natural language comments expressed by the source designer.
Other constraints can be derived from the analysis of the two schemas: for example, the identity can be derived on the basis of the data type equivalence or on the keys domain, on the cardinality of the relations and on the IS-A relations.
Finally, on the basis of the synthetic source representation coming from the lexical chains techniques, the unit of Modena will study and develop semantic similarity measures (evolution of the ones presented in (Budanitsky, 2001)) to quantify the data source relevance with respect to the reference domain ontology. Such measures will take into account two different aspects: the first concerns the “exact” matching between meanings/concepts, the second is about the semantic similarity between concepts.
During the first phase, the most important matching algorithms in the literature will be critically analysed, with particular attention to the one that adopt techniques for conflict resolution among different representations and the ontology mapping standard proposals.
D2.R1: Critical analysis of languages and mapping techniques (MO, TN)
During the second phase, we will collaborate to define a language for the mapping and the matching algorithm that will
automatically suggest mapping relations on the basis of semantic similarity.
D2.R2: Definition of the language for semantic mappings specification (MO,TN)
D2.R3: Empirical evaluation of semantic similarity measures (MO)
The activity carried out during phase 3 will be devoted to extract a synthetic source representation by means of the lexical chains technique. In particular, the algorithm in D1.R6 will be evaluated under several parameters, such as technological ones (robustness of the extraction process, computational complexity,...) and qualitative ones (expressiveness of the lexical chains as a methodology to describe sources, possibility to extend them in a multilingual environment, effectiveness of the proposed techniques in order to assign semantics to the data extracted by RoadRunner).
In addition, the unit of Modena, as far as the problem to quantify the relevance of the sources with respect to the reference domain ontology, will study and develop semantic similarity measures between the lexical chains and the reference ontology. Such measures will take into account two different aspects: the first concerning the “exact” matching between meanings/concepts, the second is about semantic relatedness between concepts (Budanitsky, 2001). For example, the semantic similarity measure will differently consider when the reference ontology and the source, by means of lexical chains, share the concept of book and when the source does not contain the concept of book but the concept of volume.
D2.P1 Prototype of the platform for the automatic generation/management of mappings between heterogeneous domain ontology
Concerning THEME 3, the first objective is to study techniques for the automatic translation (rewriting) of a given query,
formulated w.r.t. a local domain ontology, in order to be compatible with the other ontologies available in the distributed
environment. Such process is necessary if we want to be able to answer queries in the most effective and complete way, thus taking advantage of the full potentialities of the information available in the data sources. In fact, it is not reasonable to think all the information useful to satisfy the users’ informative needs to come from the source on which the query has been formulated; rather, we must exploit all the useful sources, querying also the ones which are integrated in ontologies different from the one on which the original query is formulated. Therefore, the goal is to deliver techniques which, taking advantage of the semantic information (concepts and mappings) of the involved ontologies, be able to rewrite a given query towards the other ontologies, in the most possibly faithful way w.r.t. the original query.
A second objective is to study techniques for the GVV Global Instance computation. The Global Instance is computed within the query resolution phase and on the basis of the following elements: the mappings between the GVV and the local sources, the identification of the local sources’ elements representing the same real word object (join map), and the full-disjunction operation which allows to obtain a unique result for the same object instantiated in different sources. The specific activity of this THEME is to study and develop techniques for the semi-automatic definition of join maps and the generalization of the full-disjunction operation.In particular, we will develop solutions which are valid both under the hypothesis of "semantic homogeneity", i.e. equal values for common local attributes belonging to different sources and referring to the same real world object, and in the general (inconsistent) case.
The first phase will produce a critical analysis of the ontology-based query rewriting techniques (deliverable D3.R1).
D3.R1: Critical analysis of query languages and ontology-based query rewriting techniques (BO, MO, TN)
The query rewriting operation has to take into account the global classes involved by the query. In order to obtain a complete and minimal answer, we have to compute the global instance.
This phase activities involve the choice of the query rewriting approach, together with the definition of a series of techniques preliminar to an effective rewriting (deliverable D3.R3). In particular, the semantic mapping between the ontologies (THEME 2) will constitute a good starting point in order to be able to rewrite a given query, originally expressed w.r.t. a local ontology. The idea will be to exploit such mappings in order to quantify the similarities between the various concepts described in the involved ontologies. The techniques used to estimate the similarities between the involved concepts are not strictly included in the rewriting phase but they constitute one fundamental and preliminary phase for it. Such techniques should not be limited to the exploitation of the semantic information about the concepts available in the ontologies, but they should also make use of the context (structure) in which such concepts are inserted, following other recently presented approaches (Garcia-Molina et al., 2002). The similarities extracted by means of such approaches will provide a good basis for the rewriting phase; such rewrite will not only adapt the structure of the query but it also will consistently rewrite its values. Moreover, the extracted similarities will be used in order to estimate and quantify the distance between the rewritten queries and the original one. In order to obtain a complete and minimal answer, we will exploit and extend the full-disjunction method.
D3.R3: Definition of the query language and of the ontology-based query rewriting techniques (BO, MO, TN)
In PHASE 3, the proposed rewriting techniques and the full-disjunction operation will be implemented in the prototype for query formulation (deliverable D3.P1).
D3.P1 Prototype for query formulation (BO, MO)