My Data Science Tool Box

This post describes the tools I currently use for working with data. People often ask me to recommend specific tools, and I always hesitate, because so much boils down to personal preference. I recently added a workshop to the DSS lineup providing an overview of popular tools for working with data. The core idea is that researchers have a lot of choices available when it comes to choosing tools to implement a reproducible workflow. For example, it doesn't really matter whether you choose to learn R or Python; the important thing is that you write and document code of some kind so that your analysis can be reproduced. Similarly, it doesn't matter much whether you choose to use RStudio or Jupyter notebooks; the important thing is that you have a development and authoring environment that encourages good research practices. Still, inquiring minds want to know, what do you use?

The short answer is as follows:

Operating system
Arch Linux
Programming language
R and Python
Editor / IDE
Markup language
Org Mode and LaTeX
Revision control system

Those curious to know why I prefer these tools and how I've customized them to suit my needs and preferences can read on.

Read more…

Resources for learning R

A list of resources for learning R in preparation for CS109 this Spring.

A wealth of R resources are available, and I'm sure I've missed some really good ones. If you have a favorite tutorial or resource that is not listed here, please email me or submit a bug report or pull request to


Many organizations (including Harvard) offer R workshops. If you would like to attend an R workshop, here are some good places to start.
List of R workshops, including those offered by udemy, coursera, and others.
R workshops offered by RStudio.
R workshops offered by the Institute for Quantitative Social Science (IQSS) at Harvard.



There are some great efforts to provide interactive self-paced R tutorials in your browser or in R itself.
Interactive R tutorials with feedback, right in your web browser!
Interactive R tutorials with feedback in R.
Interactive R tutorials in your web browser. Includes a ggplot tutorial.


Many R tutorials have been collected at The list of contributed documentation at is a great place to start.

There are several excellent tutorials not listed on Some of these are listed below.
"Quick-R" aims to get you up and running in R quickly.
Notes on "Using R for psychological research".
"R for Data Science" by R luminary Hadley Wickham. Includes a ggplot tutorial.
Advanced R programming by Hadley Wickham.
A comprehensive RMarkdown tutorial.

Reference cards

RStudio maintains a collection of high-quality cheat sheets at (these are also accessible from the Help -> cheat-sheets menu in the RStudio IDE). Additional resources are listed below.
A numpy cheat sheet for R users, but it works just as well the other way around.
An R cheat sheet for MATLAB users.
Another R cheat sheet for MATLAB or Python users.

R package discovery

The Comprehensive R Archive Network (CRAN) is the main R package repository. The web interface is not very sophisticated, so I recommend using the resources listed below instead.
R Task Views are curated lists of R packages and functions organized by topic.
METACRAN is a friendly, search-able web interface to CRAN.
A search-able interactive interface to R and R package documentation.

Blogs, forums and mailing lists

R related blogs are aggregated at is by far the most popular help forum for R. Use the [r] tag or navigate directly to

Although the R mailing lists have been losing traffic to stackoverflow there are still plenty of people responding to questions. You can subscribe to the main R-help mailing list at

Coming to terms with the pace of change in R

Is it you or have I become old and cranky?

I've been using R and mostly enjoying it since 2006. Lately I've been having some misgivings about the direction R as a community is headed. Some of these misgivings no doubt stem from reluctance to learn new ways of doing things after investing so much time mastering the old ways, but underneath my old-man crankiness I believe there are real and important challenges facing the R community. R has grown considerably since I started using it a decade ago, and this has mostly been a good thing as new packages implement new and better functionality. Recently, popular contributed packages have been replacing core R functionality with new approaches, leading to a fragmentation of the user base and increasing cognitive load by requiring analysts to choose a package (or set of packages) before they even write the first line of code.

Read more…

Extracting content from .pdf files

One of common question I get as a data science consultant involves extracting content from .pdf files. In the best-case scenario the content can be extracted to consistently formatted text files and parsed from there into a usable form. In the worst case the file will need to be run through an optical character recognition (OCR) program to extract the text.

Overview of available tools

For years pdftotext from the poppler project was my go-to answer for the easy case. This is still a good option, especially on Mac (using homebrew) or Linux where installation is easy. Windows users can install poppler binaries from (make sure to add the bin directory to your PATH). More recently I've been using the excellent pdftools packge in R to more easily extract and manipulate text stored in .pdf files.

In the more difficult case where the pdf contains images rather than text it is necessary to use optical character recognition (OCR) to recover the text. This can be achieved using point-and-click applications like freeOCR, Adobe Acrobat or ABBYY. ABBYY even has a convenient cloud OCR service that can be easily accessed from R using the abbyyR package. If you don't have a license for one of these expensive OCR solutions, or if you prefer something you easily can script from the command line, tesseract is a very good option.

Read more…

Escaping from character encoding hell in R on Windows

Note: the title of this post was inspired by this question on stackoverflow.

This section gives the basic facts and recommendations for importing files with arbitrary encoding on Windows. The issues described here by and large to not apply on Mac or Linux; they are specific to running R on Windows.

If you are on a deadline and just need to get the job done this section should be all you need. Additional background and discussion is presented in later sections.

To read a text file with non ASCII encoding into R you should a) determine the encoding and b) read it in such a way that the information is re-encoded into UTF-8, and c) ignore the bug in the data.frame print method on Windows. Hopefully the encoding is specified in the documentation that accompanied your data. If not, you can guess the encoding using the stri_read_raw and stri_enc_detect functions in the stringi package. You can ensure that the information is re-encoded to UTF-8 by using the readr package.

Read more…