Schema mappings, declarative constraints that model relationships between schemas, are the main enabler of data integration and data exchange. They are used to translate queries over a target schema into queries over a source schema (data integration) or to generate executable transformations that produce a target instance from a source instance (data exchange). These transformations are generated from the logical specification of the mappings. Schema mappings are often generated semi-automatically using tools like Clio, BEA AquaLogic, and others. The generation of executable transformations from mappings is also largely automated. The complexity of large schemas, lack of schema documentation, and the iterative, semi-automatic process of mapping and transformation generation are common sources of errors. These issues are compounded by limitations and idiosyncrasies of mapping tools (which can produce wildly different transformations for the same input schemas). Understanding and debugging an integration, its mappings, and transformations is far from trivial and is often a time consuming, expensive procedure. In addition, schema mapping is often done over data sources that are themselves dirty or inconsistent. Errors caused by faulty data cannot be neatly and cleanly separated from errors caused by an incorrect mapping or transformation.
Though partially automated, developing schema mappings remains a complex and potentially error-prone task. To address this problem, we have developed TRAMP (TRAnsformation Mapping Provenance), an extensive suite of tools supporting the debugging and tracing of schema mappings and transformation queries. TRAMP combines and extends data provenance with two novel notions, transformation provenance and mapping provenance, to explain the relationship between transformed data and those transformations and mappings that produced that data. In addition we provide query support for transformations, data, and all forms of provenance. We have formalized transformation and mapping provenance based on data provenance and implemented these new types of provenance as an extension of the Perm relational provenance management system.
In contrast to other approaches for debugging schema mappings, TRAMP is a more holistic approach that aims at providing a robust data integration debugging tool for tracing errors, no matter what their source (the data, inconsistencies between data sources, the schemas, schema constraints, the mappings, or the transformations). We argue that a robust tool for understanding the behaviour of complex schema mappings has to make all elements of a mapping scenario (schemas, schema constraints, mappings, transformations, and data) and their inter-relationships query-able. Building such a system involves non-trivial challenges from a conceptual and a system design point of view: (1) integrate this functionality in a single system that exploits a common approach to tracing errors - rather than providing a loose collection of tools, and (2) provide a uniform query interface for all elements of a mapping scenario and their inter-relationships.
TRAMP FeaturesThe main features of TRAMP are:
- Supports two new types of provenance information
- Transformation Provenance: Which parts of a query generated a query result tuple
- Mapping provenance: Which mappings (s-t tgds) generated a query result tuple
- Provides query facilities for mapping scenario information and provenance
- Querying data provenance: Using SQL (like in Perm)
- Querying transformation provenance using Meta-Querying: Support queries as an XML-based data type and use XPath/XSLT for querying
- Querying mapping scenario information: Store mapping scenario in DB and use SQL and XPath/XSLT for querying
- Fully integrated in SQL: all types of data, provenance, and the transformations themselves can be accessed in a single query
Types of Schema Mapping ErrorsWe now present common types of errors and discuss what information (especially provenance) can be used to trace them. We distinguish between three error categories based on the origin of an error: Mapping errors, Transformation errors, and Instance Errors.
Missing Mappings: A mapping missing from a mapping scenario may lead to empty target relations or incomplete target relations. Missing mappings may be caused by missing correspondences (matches), missing schema constraints, and misinterpreting the semantics of mappings and schemas. Correspondences can be missed through errors by a matching tool.
Incomplete Mappings: Incomplete mappings (i.e., mappings that are missing relations or missing conditions) may also arise due to missing correspondences or missing schema constraints.
Oversized Mappings: Oversized mappings (that is, mappings with too many relations) may be caused by using an association in the source with different semantics than the corresponding association in the target. Thus, an oversized mapping associates relations that should not be associated according to the desired mapping semantics.
Incorrect Association Paths: Schema constraints, query logs, or even connections in the data may be used by mapping systems to figure out how to join source or target relations in a mapping. In a schema there might be several ways to reach one relation from another via connecting constraints, but not all of them represent a semantically correct way of associating relations.
The most commonly used mapping languages today (including the s-t tgds we consider) are constraints that specify properties that must be true of the transformation. However, they do not determine a unique target instance, and hence do not determine a unique transformation. New types of errors arise due to the fact that a mapping can be handled in different ways by the transformations that implement the mapping.
Incorrect Handling of Atomic Values: This type of error arises if either a non-atomic attribute is handled as an atomic one or if incorrect functions are applied to split a non-atomic attribute value. An example of this type of error is a transformation that literally copies the values of a source attribute name that stores the full names of persons to a firstname attribute in the target.
Redundant Data: A transformation that implements multiple, overlapping mappings may produce redundant data for a target relation. This was one of the motivations to introduce nested mappings, core mappings, and laconic mappings. However, such mappings are not considered by most mapping tools.
Instance Data Errors: Incorrect source instance data can cause errors in the target instance. Instance data errors may confuse a user and lead her to conclude a correct mapping is incorrect.
Example for Debugging an Error with TRAMP
As an example of schema mapping with TRAMP we now show how to use TRAMP to debug a specific Missing Mapping error.
As explained above, a missing mapping may be caused by a missing correspondence. For instance, if in the example schemas presented above the correspondences between the InProceedings relation and the Publications relation is missed, then only mapping \(M_1\) (and not \(M_2\)) is created. To investigate the cause of the missing conference publications in the target instance, a user may start by querying the mapping and transformation provenance of transformation \(T_1\) that generates the Publication relation (mapping \(M_1\)):
This query reveals that all tuples have been created by mapping \(M_1\) and that the InProceedings relation is not accessed by this mapping and its transformation. This means a mapping is missing that maps data from the InProceedings relation. By explicitly modeling the correspondences and making them query-able TRAMP enables the user to track the reasons for the missing mapping. E.g., the user can search for correspondences between InProceedings and another relation by using XPath to query the XML representation of the correspondences. This would reveal that no correspondences are defined for this relation and new correspondences need to be created to enable a mapping tool to generate mapping \(M_2\).
TRAMP: Understanding the Behavior of Schema Mappings through Provenance
Boris Glavic, Gustavo Alonso, Renée J. Miller and Laura M. Haas
Proceedings of the Very Large Data Bases Endowment. 3, 1 (2010) , 1314–1325.
Formal Foundation of Contribution Semantics and Provenance Computation through Query Rewrite in TRAMP
University of Zurich.