Write your post here.
A list of resources for learning R in preparation for CS109 this Spring.
A wealth of R resources are available, and I'm sure I've missed some really good ones. If you have a favorite tutorial or resource that is not listed here, please email me or submit a bug report or pull request to http://github.com/izahn/blog.
Many organizations (including Harvard) offer R workshops. If you would like to attend an R workshop, here are some good places to start.
- List of R workshops, including those offered by udemy, coursera, and others.
- R workshops offered by RStudio.
- R workshops offered by the Institute for Quantitative Social Science (IQSS) at Harvard.
There are some great efforts to provide interactive self-paced R tutorials in your browser or in R itself.
- Interactive R tutorials with feedback, right in your web browser!
- Interactive R tutorials with feedback in R.
- Interactive R tutorials in your web browser. Includes a
There are several excellent tutorials not listed on r-project.org. Some of these are listed below.
- "Quick-R" aims to get you up and running in R quickly.
- Notes on "Using R for psychological research".
- "R for Data Science" by R luminary Hadley Wickham. Includes a
- Advanced R programming by Hadley Wickham.
- A comprehensive RMarkdown tutorial.
RStudio maintains a collection of high-quality cheat sheets at https://www.rstudio.com/resources/cheatsheets/ (these are also accessible from the
Help -> cheat-sheets menu in the RStudio IDE). Additional resources are listed below.
- A numpy cheat sheet for R users, but it works just as well the other way around.
- An R cheat sheet for MATLAB users.
- Another R cheat sheet for MATLAB or Python users.
R package discovery
The Comprehensive R Archive Network (CRAN) is the main R package repository. The web interface is not very sophisticated, so I recommend using the resources listed below instead.
Blogs, forums and mailing lists
R related blogs are aggregated at http://r-bloggers.com.
Although the R mailing lists have been losing traffic to stackoverflow there are still plenty of people responding to questions. You can subscribe to the main R-help mailing list at https://stat.ethz.ch/mailman/listinfo/r-help.
Is it you or have I become old and cranky?
I've been using R and mostly enjoying it since 2006. Lately I've been having some misgivings about the direction R as a community is headed. Some of these misgivings no doubt stem from reluctance to learn new ways of doing things after investing so much time mastering the old ways, but underneath my old-man crankiness I believe there are real and important challenges facing the R community. R has grown considerably since I started using it a decade ago, and this has mostly been a good thing as new packages implement new and better functionality. Recently, popular contributed packages have been replacing core R functionality with new approaches, leading to a fragmentation of the user base and increasing cognitive load by requiring analysts to choose a package (or set of packages) before they even write the first line of code.
One of common question I get as a data science consultant involves extracting content from
Overview of available tools
pdftotext from the poppler project was my go-to answer for the easy case. This is still a good option, especially on Mac (using homebrew) or Linux where installation is easy. Windows users can install poppler binaries from http://blog.alivate.com.au/poppler-windows/ (make sure to add the
bin directory to your
PATH). More recently I've been using the excellent pdftools packge in R to more easily extract and manipulate text stored in
In the more difficult case where the pdf contains images rather than text it is necessary to use optical character recognition (OCR) to recover the text. This can be achieved using point-and-click applications like freeOCR, Adobe Acrobat or ABBYY. ABBYY even has a convenient cloud OCR service that can be easily accessed from R using the abbyyR package. If you don't have a license for one of these expensive OCR solutions, or if you prefer something you easily can script from the command line, tesseract is a very good option.
Note: the title of this post was inspired by this question on stackoverflow.
This section gives the basic facts and recommendations for importing files with arbitrary encoding on Windows. The issues described here by and large to not apply on Mac or Linux; they are specific to running R on Windows.
If you are on a deadline and just need to get the job done this section should be all you need. Additional background and discussion is presented in later sections.
To read a text file with non ASCII encoding into R you should a) determine the encoding and b) read it in such a way that the information is re-encoded into UTF-8, and c) ignore the bug in the
data.frame print method on Windows. Hopefully the encoding is specified in the documentation that accompanied your data. If not, you can guess the encoding using the
stri_enc_detect functions in the stringi package. You can ensure that the information is re-encoded to UTF-8 by using the readr package.