< a quantity that can be divided into another a whole number of time />

Exploratory data mining and data cleaning

February 17, 2013

I just got my copy of Exploratory Data Mining and Data Cleaning, by Dasu and Johnson (Wiley, 2003). This is quite an old book but it offers a nice overview of common techniques to gauge and enhance data quality with exploratory data analysis.

I learned about DataSphere partition,(1,2) for instance. This book is, however, not about tools to perform data cleaning or data analysis, like Janert’s Data Analysis with Open Source Tools (O’Reilly, 2001) which presents gnuplot, Sage, R, or Python and offers a small Appendix on Working with Data. I think that having some working knowledge of common tools for data processing and data representation are very useful, as suggested by Paul Murrell’s online book, Introduction to Data Technologies.

I have been using GNU coreutils, Sed and Awk for almost 10 years now. For example, I may happen to use the following to remove chunk comments generated by knitr so that students get a single R source file which they have to annotate themselves.

% sed '/^###/d' seance07.r > seance07f.r

I found myself using Awk or Sed more and more with complex data or large statistical projects. In recent months, I have been involved in a multi-centric clinical study involving more than 3000 variables recorded in different tables for different types of subjects (children and parents). I received some CSV files extracted from a local database, that I wasn’t able to read correctly in R or Stata. I shortly realized that there was a problem with the data structure for several data files using awk to count the number of fields identified for each line of the data files:

% awk -F";" '{print NF}' XXX.csv | sort | uniq -c

Original files were also re-encoded from ISO-8859-1 to UTF-8 with iconv.

Some folks have coined the term Data Scientist to design people working with data, and to cite Wikipedia their skills include

the ability to find and interpret rich data sources, manage large amounts of data despite hardware, software and bandwidth constraints, merge data sources together, ensure consistency of data-sets, create visualizations to aid in understanding data and building rich tools that enable others to work effectively.

Associate keywords are often: big data, predictive analytics, data hacking, web scraping, etc. Data Analysis with Open Source Tools is one of those attempts to reconcile programming with statistics, although I still prefer John Chambers or Cook and Swayne books. It seems to be used as a reference textbook for a class on Statistical Computing for Biologists. Anyway, this book offers a modern perspective on data analysis using freely available tools, and for a more complete review see Noel M O’Boyle. Review of “Data Analysis with Open Source Tools” by Philipp K Janert. Journal of Cheminformatics 2011, 3:10.

More recently, O’Reilly published Bad Data Handbook which offers some applications with R and Python (and even Gremlin Groovy). On the other hand, Exploring Everyday Things with R and Ruby (O’Reilly, 2012) uses R and Ruby to mine several sources of data (Gmail messages, audio files), or write from scratch some simulation experiments.1 Other titles from O’Reilly speak from themselves, e.g. Mining the Social Web, Data Mashups in R, but what about this Data Science Starter Kit?

I believe John Chambers‘s books on data analysis using S or R(3,4) software are still really insightful because they are written from a statistical perspective, by someone who probably spent hours writing dedicated statistical packages. The more recent one even includes a section on Using and Writing Perl (pp. 309-318). We can do a lot from within R, but surely “nasty things” deserve to be done with other scripting languages.

Phil Spector also has nice tutorials on Perl, Python, SQL and various Unix stuff. He also wrote one of Springer book in the R series: Data Manipulation with R, which has two chapters on interacting with data stored in flat files or databases. Working with data is a non-negligible part of a statistical project, although this certainly doesn’t mean that all data steps are about data hacking. In The Workflow of Data Analysis Using Stata (Stata Press, 2009), JS Long discusses data preparation and data cleaning during two chapters (pp. 125-285): This is about 40% of the whole book, and I found it actually reflects the work of the statistician who is also often in charge of data preparation/management.

In the spirit of Karl Browman’s idea for a course in statistical programming, I would like to write some tutorials on dealing and thinking with data. This might include: ETL, data anonymization, data protection, storage, databases, OLAP, datasphere, md5sum, diff, quality check, data mining, codebook, language fixes, multiple imputed datasets, interim analysis, and many other aspects of statistical analysis.


  1. Johnson, T and Dasu, T. Scalable Data Space Partitioning In High Dimension. 1999.
  2. Dasu, T and Johnson, T. Hunting of the snark. Finding data glitches using Data Mining methods. IQ Conference, 1999.
  3. Chambers, JM. Programming with Data. A Guide to the S Language. Springer, 1998.
  4. Chambers, JM. Software for Data Analysis. Programming with R. Springer, 2008.

  1. Code and data available on GitHub. ↩︎

readings statistics database

See Also

» Dicing With Death » Methods for handling treatment switches » Latest reading list on medical statistics » Kiefer's Introduction to statistical inference » Ensemble Methods in Data Mining