Resources for learning R

A list of resources for learning R in preparation for CS109 this Spring.

A wealth of R resources are available, and I'm sure I've missed some really good ones. If you have a favorite tutorial or resource that is not listed here, please email me or submit a bug report or pull request to http://github.com/izahn/blog.

Workshops

Many organizations (including Harvard) offer R workshops. If you would like to attend an R workshop, here are some good places to start.

http://r-exercises.com/r-courses/
List of R workshops, including those offered by udemy, coursera, and others.
https://www.rstudio.com/workshops/
R workshops offered by RStudio.
http://dss.iq.harvard.edu/workshop-registration
R workshops offered by the Institute for Quantitative Social Science (IQSS) at Harvard.

Tutorials

Interactive

There are some great efforts to provide interactive self-paced R tutorials in your browser or in R itself.

https://www.datacamp.com/
Interactive R tutorials with feedback, right in your web browser!
http://swirlstats.com/students.html
Interactive R tutorials with feedback in R.
http://dss.iq.harvard.edu/workshop-materials#widget-1
Interactive R tutorials in your web browser. Includes a ggplot tutorial.

Static

Many R tutorials have been collected at https://www.r-project.org/other-docs.html. The list of contributed documentation at https://cran.r-project.org/other-docs.html is a great place to start.

There are several excellent tutorials not listed on r-project.org. Some of these are listed below.

http://www.statmethods.net/
"Quick-R" aims to get you up and running in R quickly.
http://personality-project.org/r/r.guide.html
Notes on "Using R for psychological research".
http://r4ds.had.co.nz/
"R for Data Science" by R luminary Hadley Wickham. Includes a ggplot tutorial.
http://adv-r.had.co.nz/
Advanced R programming by Hadley Wickham.
http://rmarkdown.rstudio.com/lesson-1.html
A comprehensive RMarkdown tutorial.

Reference cards

RStudio maintains a collection of high-quality cheat sheets at https://www.rstudio.com/resources/cheatsheets/ (these are also accessible from the Help -> cheat-sheets menu in the RStudio IDE). Additional resources are listed below.

http://mathesaurus.sourceforge.net/r-numpy.html
A numpy cheat sheet for R users, but it works just as well the other way around.
http://www.math.umaine.edu/~hiebeler/comp/matlabR.pdf
An R cheat sheet for MATLAB users.
http://mathesaurus.sourceforge.net/matlab-python-xref.pdf
Another R cheat sheet for MATLAB or Python users.

R package discovery

The Comprehensive R Archive Network (CRAN) is the main R package repository. The web interface is not very sophisticated, so I recommend using the resources listed below instead.

https://cran.r-project.org/web/views/
R Task Views are curated lists of R packages and functions organized by topic.
http://r-pkg.org
METACRAN is a friendly, search-able web interface to CRAN.
http://rdocumentation.org
A search-able interactive interface to R and R package documentation.

Blogs, forums and mailing lists

R related blogs are aggregated at http://r-bloggers.com.

http://stackoverflow.com is by far the most popular help forum for R. Use the [r] tag or navigate directly to http://stackoverflow.com/questions/tagged/r.

Although the R mailing lists have been losing traffic to stackoverflow there are still plenty of people responding to questions. You can subscribe to the main R-help mailing list at https://stat.ethz.ch/mailman/listinfo/r-help.

Coming to terms with the pace of change in R

Is it you or have I become old and cranky?

I've been using R and mostly enjoying it since 2006. Lately I've been having some misgivings about the direction R as a community is headed. Some of these misgivings no doubt stem from reluctance to learn new ways of doing things after investing so much time mastering the old ways, but underneath my old-man crankiness I believe there are real and important challenges facing the R community. R has grown considerably since I started using it a decade ago, and this has mostly been a good thing as new packages implement new and better functionality. Recently, popular contributed packages have been replacing core R functionality with new approaches, leading to a fragmentation of the user base and increasing cognitive load by requiring analysts to choose a package (or set of packages) before they even write the first line of code.

Read more…

Extracting content from .pdf files

One of common question I get as a data science consultant involves extracting content from .pdf files. In the best-case scenario the content can be extracted to consistently formatted text files and parsed from there into a usable form. In the worst case the file will need to be run through an optical character recognition (OCR) program to extract the text.

Overview of available tools

For years pdftotext from the poppler project was my go-to answer for the easy case. This is still a good option, especially on Mac (using homebrew) or Linux where installation is easy. Windows users can install poppler binaries from http://blog.alivate.com.au/poppler-windows/ (make sure to add the bin directory to your PATH). More recently I've been using the excellent pdftools packge in R to more easily extract and manipulate text stored in .pdf files.

In the more difficult case where the pdf contains images rather than text it is necessary to use optical character recognition (OCR) to recover the text. This can be achieved using point-and-click applications like freeOCR, Adobe Acrobat or ABBYY. ABBYY even has a convenient cloud OCR service that can be easily accessed from R using the abbyyR package. If you don't have a license for one of these expensive OCR solutions, or if you prefer something you easily can script from the command line, tesseract is a very good option.

Read more…

Escaping from character encoding hell in R on Windows

Note: the title of this post was inspired by this question on stackoverflow.

This section gives the basic facts and recommendations for importing files with arbitrary encoding on Windows. The issues described here by and large to not apply on Mac or Linux; they are specific to running R on Windows.

If you are on a deadline and just need to get the job done this section should be all you need. Additional background and discussion is presented in later sections.

To read a text file with non ASCII encoding into R you should a) determine the encoding and b) read it in such a way that the information is re-encoded into UTF-8, and c) ignore the bug in the data.frame print method on Windows. Hopefully the encoding is specified in the documentation that accompanied your data. If not, you can guess the encoding using the stri_read_raw and stri_enc_detect functions in the stringi package. You can ensure that the information is re-encoded to UTF-8 by using the readr package.

Read more…