|
Potter's Wheel A-B-C: An Interactive
Tool for Data Analysis, Cleansing, and Transformation |
Real world data often has to be transformed in many ways to use in different situations. Often the data has discrepancies in structure and content that must be cleaned. Also, many analysis or visualization tools require that the data be in particular formats. Traditionally such transformation has been done through ad hoc scripts or through cookie-cutter transformation tools that require much laborious and error-prone programming. Moreover the transformation process is typically decoupled from the analysis process. On large data sets, such transformation and analysis is quite time-consuming. Users often need to perform many iterations of analysis and transformation, and have to endure many long, frustrating delays.
A-B-C is an interactive tool that tightly integrates data transformation, discrepancy detection, and data analysis. Users gradually build transformations by adding or undoing transforms, in an intuitive manner through a spreadsheet-like interface; the effect of a transform is shown at once on records visible on screen. The system automatically checks for discrepancies in the current transformed version of the data, and flags them to the user as they are found. Users can specify transforms through graphical operations, or by specifying examples of the desired effect, and need not worry about specifying complex regular expressions, grammars, or custom scripts. Discrepancy detection is done in a highly customizable manner. Users can define custom domains with specific constraints that values must satisfy. The system automatically infers the structure of the data in terms of these user-defined domains, and checks the data for appropriate constraint violations. These transformations can subsequently stored as C++ or Perl programs, or A-B-C macros, for subsequent application. More details about A-B-C's transformation and discrepancy detection features are avalable here.
A-B-C also has facilities for analyzing the transformed data. Analysis and transformation go hand in hand; analysis can help to identify patterns in the data that can be used to transform it, which in turn can convert it into a format suitable for analysis. By explicitly coupling analysis and transformation in the same software, A-B-C allows users to interactively transform and analysis large data sets or even infinite data streams, without any waits.
A-B-C allows uses to analyze data by partitioning it and computing aggregates on the partitions. The partitioning can be performed recursively along any column of the data (include columns formed by transformation), and the aggregates can be either simple numerical aggregates or graphical charts, as illustrated in these screenshots. Besides aggregates, A-B-C also allows users to see example data values from any partition using a spreadsheet-like interface. The user can sort and scroll interactively along any dimension to explore the data in detail. The examples values can in turn be transformed and analyzed as described earlier.
The latest release of Potter's Wheel A-B-C is version 1.3. It is much more robust that the alpha versions, and has many new features added. A-B-C is available free, with source code, for any use whatsoever. Currently it runs on the Windows 98/NT platforms.
Version | Release Date | Comments |
A-B-C version 1.4 | To be released soon! | More Features, Better Polished |
A-B-C version 1.3 | Oct 10, 2000 | Robust version. |
A-B-C version 1.1 | May 12, 2000 | Alpha release, fixed bugs in 1.0. |
A-B-C version 1.0 | May 4, 2000 | Alpha release, many glitches. |
|