The premier IDE for R
RStudio anywhere using a web browser
Put Shiny applications online
Shiny, R Markdown, Tidyverse and more
Do, share, teach and learn data science
An easy way to access R packages
Let us host your Shiny applications
A single home for R & Python Data Science Teams
Scale, develop, and collaborate across R & Python
Easily share your insights
Control and distribute packages
RStudio Public Package Manager
RStudio Package Manager
Data Science Essentials
Sparklyr: Using Spark with RMarkdown
October 27, 2016
R is well-suited to handle data that can fit in memory but additional tools are needed when the amount of data you want to analyze in R grows beyond the limits of your machine's RAM. There have been a variety of solutions to this problem over the years that aim to solve this problem in R; one of the latest options is Apache Spark™. Spark is a cluster computing tool that enables analysis of massive, distributed data across dozens or hundreds of servers.
RStudio recently announced a new open-source package called sparklyr that facilitates a connection between R and Spark using a full-fledged dplyr backend with support for the entirety of Spark's MLlib library. Due to Spark's ability to interact with distributed data with little latency, it is becoming an attractive tool for interfacing with large datasets in an interactive environment. In addition to handling the storage of data, Spark also incorporates a variety of other tools including stream processing, computing on graphs, and a distributed machine learning framework. Some of these tools are available to R programmers via the sparklyr package.
In this talk, we'll discuss how to leverage Spark's capabilities in a modern R environment. In particular, we'll discuss how to use Spark within an R Markdown document or even in an interactive Shiny application. We'll also briefly discuss alternative approaches to working with large data in R and the pros and cons of using Spark.
Nathan has a background in analytic solutions and consulting. He has experience building data science teams, architecting analytic infrastructure, and delivering innovative data products. He is a long time user of R.