aliquote.org

Data science from scratch

July 26, 2015

Yet another quick review of a book I purchased recently on O’Reilly; Data Science from Scratch, by Joel Grus. I will take advantage of this post to briefly review other related books.

Data Science from Scratch covers a number of statistical models (chapters 12 to 18), including: k-NN, naive Bayes, generalized linear models, decision trees, neural networks; additional chapters cover unsupervised learning (k-means and hierarchical clustering), natural language processing (text processing/classification and topic modeling), network analysis and recommender systems. Two extra chapters cover the basic of SQL and MapReduce. Overall, I find that the emphasis is put on applications and Python coding, rather than using Python to solve data science problems.

A scratch course in Python is offered in Chapter 2, and the author suggests to install Anaconda as so many do; I am, however, happy with the Python distribution that shipped with my Macbook (that may be just me, but Anaconda is creating a mess with various lib when using Homebrew and R). I don’t think this chapter would be enough for a beginner in Python, though. Likewise, chapters on linear algebra and statistics are rather scarce, even if chapter 7 (“Hypothesis and Inference”) discusses key elements of statistical inference. The graphics aren’t so terrible because the author use Matplotlib with all its horrible default settings (hopefully there are now bokeh and seaborn!).

Overall, this book is for people who know Python and are interested into getting introduced to data science quickly, without all the details. The author is certainly very proficient in Python, and he surely knows what he is doing when analyzing data, but I was surprised to find so little discussion about statistical theory or data science as a whole. As it stands, this book looks more like a collection of recipes for data scientists interested in using Python.

Other titles of interest that I should probably describe in more details in future posts:

See Also

» Stata for health researchers » R graphs cookbook » Bad data handbook » Data science at the command line » UseR 2014