Potter's Wheel A-B-C: An Interactive Tool for Data Analysis, Cleansing, and Transformation
[ Control Home   A-B-C Home   Software   Contact   Papers and Talks ]

Real world data often has to be transformed in many ways to use in different situations. Often the data has discrepancies in structure and content that must be cleaned. Also, many analysis or visualization tools require that the data be in particular formats. Traditionally such transformation has been done through ad hoc scripts or through cookie-cutter transformation tools that require much laborious and error-prone programming. Moreover the transformation process is typically decoupled from the analysis process. On large data sets, such transformation and analysis is quite time-consuming. Users often need to perform many iterations of analysis and transformation, and have to endure many long, frustrating delays.

Data Transformation and Cleaning using A-B-C

A-B-C is an interactive tool that tightly integrates data transformation, discrepancy detection, and data analysis. Users gradually build transformations by adding or undoing transforms, in an intuitive manner through a spreadsheet-like interface; the effect of a transform is shown at once on records visible on screen. The system automatically checks for discrepancies in the current transformed version of the data, and flags them to the user as they are found. Users can specify transforms through graphical operations, or by specifying examples of the desired effect, and need not worry about specifying complex regular expressions, grammars, or custom scripts. Discrepancy detection is done in a highly customizable manner. Users can define custom domains with specific constraints that values must satisfy. The system automatically infers the structure of the data in terms of these user-defined domains, and checks the data for appropriate constraint violations. These transformations can subsequently stored as C++ or Perl programs, or A-B-C macros, for subsequent application. More details about A-B-C's transformation and discrepancy detection features are avalable here.

Data Analysis using A-B-C

A-B-C also has facilities for analyzing the transformed data. Analysis and transformation go hand in hand; analysis can help to identify patterns in the data that can be used to transform it, which in turn can convert it into a format suitable for analysis. By explicitly coupling analysis and transformation in the same software, A-B-C allows users to interactively transform and analysis large data sets or even infinite data streams, without any waits.

A-B-C allows uses to analyze data by partitioning it and computing aggregates on the partitions. The partitioning can be performed recursively along any column of the data (include columns formed by transformation), and the aggregates can be either simple numerical aggregates or graphical charts, as illustrated in these screenshots. Besides aggregates, A-B-C also allows users to see example data values from any partition using a spreadsheet-like interface. The user can sort and scroll interactively along any dimension to explore the data in detail. The examples values can in turn be transformed and analyzed as described earlier.

Software

The latest release of Potter's Wheel A-B-C is version 1.3. It is much more robust that the alpha versions, and has many new features added. A-B-C is available free, with source code, for any use whatsoever. Currently it runs on the Windows 98/NT platforms.
VersionRelease DateComments
A-B-C version 1.4 To be released soon! More Features, Better Polished
A-B-C version 1.3 Oct 10, 2000 Robust version.
A-B-C version 1.1 May 12, 2000 Alpha release, fixed bugs in 1.0.
A-B-C version 1.0 May 4, 2000 Alpha release, many glitches.

Papers and Talks

Data Cleaning Bibliography


The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holders. ACM and IEEE published documents have other restrictions given here and here.

 

[ Home | Projects | News | Papers | People | Demos]


 
 
Send mail to rshankar@cs.berkeley.edu with questions or comments about this web site.
Page Contents: Interactive Tool for Data Cleansing and Analysis