THEME 3

QUERY PROCESSING

Units involved in this Theme

University of Modena and Reggio Emilia

University of Trento

University of Bologna

University of Roma Tre


Goals of this Theme

The first objective of is to exploit the characterization of sources to bias execution only towards the most relevant sources. To this end, a basic role is played by semantic mappings between domain ontologies and by the definition of a "semantic distance" between the concepts involved in mappings. As to execution, we will define techniques for automatic rewriting of queries which, by exploiting the information on the semantics of the single concepts described in the reference ontologies and the context where they are placed, rewrite the query against the other ontologies in a form that closely matched the original one. Determining the result of a query requires that each object involved is rebuilt starting from the relative information that characterize it, that is distributed on several sources ("object fusion"). In this case the target is to extend the known full-disjunction methods (based on exact matching between consistent components of the objects) to the case of approximate matches and semantic heterogeneity (presence of different values for the same attributes managed by different sources). Besides, we will study techniques for automatically defining "join maps" (identification, within local sources, of the objects corresponding to the same real-world object). Further objective is to develop techniques for determining the "best" N objects for a given query; such techniques should be correct and efficient independently of the criteria chosen for combining the different factors that impact on the object relevance (e.g., weighted sum). Last target is to develop methods that allow the result to be interactively navigated, according to the abstraction levels offered by ontologies. To this end, we will investigate proper operators for seeing the results at different levels, in order to support the user in recognizing significant patterns in data.

 
Working Phases

Phase 1 (6 months: December 1, 2004 - May 31, 2005)

The first phase of the project will be devoted to the critical analysis of state-of-the-art approaches, in order to completely define the limitations of existing solutions for the problems at hand. Then, we will articulate the specific requisites for each research issue relevant for Theme 3. In particular:
- We will perform a critical analysis of query languages and of query rewriting techniques based on ontologies (product D3.R1), with the aim of stressing their limitations and completely defining the requisites for tools and techniques that will be developed in the following phases.
- Starting from an analysis of query processing techniques for distributed and heterogeneous environments, we will identify the limits of such techniques with respect to the WISDOM architecture (where, we remind, a data source can be viewed from the outside only through the domain ontology (GVV) that includes it). In particular, considering the different aspects that can contribute in determining the relevance of a result, we will analyze if, and how, such aspects are affected by the WISDOM architecture. Moreover, we will investigate the opportunity of processing results so as to present them to the user in a compact and easy to use form, assessing whether navigation and summarization techniques borrowed from business intelligence can be coupled with pattern models typical of the data mining domain. Finally, we will consider some visual querying paradigms for databases, in order to characterize their main limitations when they are applied to systems based on ontologies, like WISDOM.

Phase 2 (6 months: June 1, 2004 - November 31, 2005)

In the second project phase, we will provide solutions for research issues related to Theme 3.
We will define the query language based on domain ontologies and the best approach for query rewriting will be selected. The main idea is to start from the matching obtained for concepts within the different ontologies, writing the query in a new form which is equivalent (as much as possible) to the original one. For this, we will define a "semantic distance" between concepts of different ontologies which are correlated by semantic mappings (Theme 2). Such distance will be one of the criteria used to define the relevance of a data source for a given query (intuitively, mappings with a low semantic distance involve a very relevant data source) and the "goodness" of results (in particular, when a concept is mapped, within an ontology, to several concepts, one has to consider the relevance for each of such mappings).
In order to establish which data sources are relevant for a query, semantic information will be combined with structural and statistic information. The former will be used to completely define the context where a given concept belongs; for the latter, the goal is to introduce quantitative information related to the data sources (in particular: quality/reliability of data within the source, frequency of data updates) and to the data instances therein contained. In this case, the basic rationale is to exploit the enhancement of domain ontologies using "content summaries" (Theme 1) so as to provide a relevance "score" for each data source, based on values supplied by the query (intuitively, a data source can be relevant from a semantic point of view, but not when its instances are considered, thus it cannot return results satisfying the query predicates).
With respect to issues related to query processing and to the retrieval of the result, in this phase we will define techniques for distributed query processing that, considering the limitations imposed by the WISDOM architecture, are able to return the most relevant results using a minimal amount of resources. Since the actual relevance for a given object depends on different factors, and also on how such factors are combined, we will develop general techniques, that will be able to operate correctly and efficiently also when the combination criteria are changed. For such criteria, that in the basic case reduce to a weighted sum of the different factors, we will examine also the more general case of qualitative definitions, which are not based on numerical a characterization. Solutions for the object fusion problem will also be provided, by extending the method based on full-disjunction, with the main goal of obtaining complete and minimal results. Moreover, we will define techniques for semi-automatic assessment of Join-Maps (identification of local sources for objects corresponding to the same real-world object).
To ease the usability of results, we will develop methods that allow the user to specify, in a precise way, the required level of details. In particular, we will define techniques for representing the information in a compact and rich in semantics way at different abstraction levels, and we will supply operators for interactive navigation of information on different layers, in agreement with the domain ontology.

Phase 3 (12 months: December 1, 2005 - November 31, 2006)

The third phase of the project will be mainly devoted to the development and the integration of prototypes, and to their experimental evaluation. Theme 3 will provide 2 different prototypes:
- The first prototype (product D3.P1) will be responsible for acquisition and analysis of queries, for determining which data sources are relevant for each query, and for query rewriting.
- The second prototype (product D3.P2) will implement the query processing techniques devised during phase 2, and will include an interface for interactive navigation of information at different abstraction levels, in agreement with the domain ontology.