The Vagabond system uses a novel holistic approach to help users to understand and debug data exchange scenarios. Developing such a scenario is a complex and labor-intensive process where errors are often only revealed in the target instance produced as the result of this process. This makes it very hard to debug such scenarios, especially for non-power users. Vagabond aides a user in debugging by automatically generating possible explanations for target instance errors identified by the user.
For large schemata, the multi-step process of generating a schema mapping is error-prone. Often, errors become apparent only in the generated target instance. For example, a user may recognize that some attribute values in the target instance are incorrect. Tracing errors is time-consuming and complex, because of the many possible sources of errors: data, correspondences, schema mappings, or transformations. Previous work focused on aiding the user in debugging by (1) providing additional information, such as provenance, and better query language support for schema mappings (TRAMP, MXQL, Spider) or (2) through programming language style debugging like breakpoints (Spider). These approaches have in common that they are more tailored for power users - they require the user to understand what possible sources of errors are and rely on her to guide the debugging process accordingly. In contrast to these approaches, Vagabond automatically generates and ranks explanations for errors in a data exchange setting based on user provided input about which parts of a generated target instance are erroneous. The rationale behind this approach is that (1) even inexperienced users are able to recognize instance errors, and (2) for both inexperienced and power users it is much harder to come up with explanations than to verify if a given explanation is correct.
The explanation generation of Vagabond builds on the facilities provided by TRAMP to generate and query data, various kinds of provenance, and mapping information. We consider data, correspondences, mappings, and transformations as potential causes of errors. For instance, a possible explanation for incorrect values in a target relation is that the source data where this information has been copied from is erroneous. Data provenance is used to identify this part of the source data. For each generated explanation we compute which mapping scenario elements and parts of the instance would be affected by the explanation (called the side-effects). The user can mark an explanation as correct. This will cause the side-effects of this explanation to be considered as additional errors, thus avoiding the need to mark all target instance errors to debug a data exchange scenario. To present more likely explanations first, we rank them on the number of side-effects they imply. The explanation generation is complemented with visualization of provenance and mapping information. Vagabond provides an easy-to-use GUI for navigating through this information.
- Gustavo Alonso - Professor at ETH Zurich Systems Group
- Renée J. Miller - Professor at the University of Toronto Database Group
- Laura M. Haas - IBM Fellow and Director, Institute for Massive Data, Analytics and Modeling
- Jiang Du - Ph.D. Student at the University of Toronto Database Group
|||Automatic Generation and Ranking of Explanations for Mapping Errors Technical report, Illinois Institute of Technology, IIT/CS-DB-2015-01, 2015 [bibtex] [pdf]|
|||Computing Candidate Keys Of Relational Operators For Optimizing Rewrite-Based Provenance Computation (Master Thesis), Master's thesis, IIT, 2015 [bibtex] [pdf]|
|||Efficient Scoring and Ranking of Explanation for Data Exchange Errors in Vagabond (Master Thesis), Master's thesis, IIT, 2014 [bibtex] [pdf]|
|||Debugging Data Exchange with Vagabond In Proceedings of the VLDB Endowment (Demonstration Track) (PVLDB), volume 4, 2011 [bibtex] [pdf]|