aliquote.org

Workflow for statistical data analysis

April 26, 2012

Few days ago, I came across Oliver Kirchkamp’s Workflow of statistical data analysis which provides essential tips and guidelines for managing not only data but the whole analysis flow (from getting raw data to publishing papers).

For R users

This is a very comprehensive textbook, with illustrations in R. I already dropped some words two years ago in How to efficiently manage a statistical analysis project. People with little knowledge of R can skip chapter 2 and go directly to chapter 3 which goes to the heart of the problem: How to create and manage a statistical project in R?

In few words:

Chapter 5 is about data manipulation: subsetting (subset()–I like this function a lot), merging (merge()) and reshaping (stats::reshape()–I prefer the reshape2 package). I’m currently writing a textbook on data manipulation and visualization with R and I have planned to include a dedicated chapter on that topic. However, it is likely it will rely on base functions as well as the plyr and Hmisc packages.

I noticed that the author seems to make extensive use of the memisc package, for codebook, variable labeling and description (e.g., p. 57), importing Stata dataset. Another interesting point is made p. 56 about data signature using tools::md5sum(). Chapter 7 and 8 are about Weaving/Tangling R code (with a brief indication of how to write a minimal Makefile) and version control (SVN). Again, discussion about VC engines is kept to the bare essential, but this is good to point to that possibility. RStudio now has full support for git and Subversion.

For Stata users

Another great book I happened to order recently is dedicated to statistical analysis with Stata: J. Scott Long, The workflow of data analysis using Stata. Stata Press, 2009.

However, after having skimmed over a few random pages I feel it would apply equally to R or other software package. There is a companion package that can be found by issuing

. findit workflow

at Stata prompt. It targets Stata 9 and 10, but it will work with Stata 12. It essentially consists in two separate packages wf*-part1 and wf*-part2, and a set of help files, wf-help. The first two packages are dta and do files that are used throughout the book (datasets and Stata commands). They all goes in your home directory by default. As a matter of fact, everything is labelled as wf* such that it is easy to move them elsewhere on your hard drive. That is about 160 files in total:

$ ls wf* | wc -l
    157

About wf-help, here is a snapshot of what’s going to be installed in your personal ado folder (left panel below). After that, you will be able to type query help for wfdata, and get the screen reproduced below (right):

I have to read the book now, and I will probably update this post with a more detailed summary.

See Also

» Statistical learning and regression » Regression Methods in Biostatistics » Latest reading list on medical statistics » Fitting genetic models for twin studies » Kiefer's Introduction to statistical inference