by Joseph Rickert
In a recent post, I highlighted several new packages that arrived on CRAN in January that provided R users with access to data. In this post, I present additional selections for interesting January packages, organized into the categories Miscellaneous, Machine Learning, Statistics and Utilities.
rcss v1.2: Provides functions for Solving Control Problems with Linear State Dynamics.
stormwindmodel v0.1.0: Provides functions to calculate wind speeds for hurricanes and tropical storms in the North American Atlantic basin. One vignette describes the package and another shows how to use it.
crisp v1.0.0: Implements the convex regression with interpretable partitions (CRISP) method of predicting an outcome variable on the basis of two covariates.
BayesS5 v1.22: Implements Bayesian Variable Selection Using Shotgun Stochastic Search with Screening (S5) useful in settings where p >> n. For details, see the paper
classifierplots v1.3.2: Provides functions to generate a grid of binary classifier and diagnostic plots with a single function call. See the README for details.
eclust v0.1.0: Provides an algorithm for clustering high-dimensional data that can be affected by an environmental factor. See the paper for details.
EnsCat v1.1: Implements various clustering methods for categorical data. See the website for examples and the paper for the details.
MAVE v0.1.7: Implements the MAVE (Minimum Average Variance Estimation) method of dimension reduction. Look here for the math and here for examples.
mfe v0.1.0: Provides functions to extract meta-features from datasets to support the design of recommendation systems. The vignette provides examples.
rsparkling v0.1.0: extends sparklyr with an interface to the H2O Sparkling Water machine learning library. The README explains how to use the package.
confSAM v0.1: Contains a function that computes estimates and confidence bounds for the false discovery proportion in a multiple testing environment. The vignette describes the theory and provides examples.
pdSpecEst v1.0.0: Implements a non-parametric, geometric wavelet method to estimate autocovariance matrix of a time series that preserves positive-definiteness of the estimator. This preserves the intrepretability of the estimate as a covariance matrix and helps with computational issues. The paper describes the theory and the vignette provides an example.
tsdecomp v0.2: Implements ARIMA model-based decompositions for quarterly and monthly time series data. The vignette describes the math.
TSeriesMMA v0.1.1: Provides a function to calculate the Hurst surface for a time series. Multiscale, multifractical analysis (MMA) is described in a paper by Gieraltowski et al.
awsjavasdk v0.2.0: Provides a boilerplate of classes used to access the Amazon Web Services Java Software Development Kit via package rJava. The vignette shows how to use the package.
colr v0.1.900: Provides functions that use Perl regular expressions to select and rename columns in dataframes, lists and numeric types. The vignette contains examples.
flifo v0.1.4: Provides functions to create and manipulate FIFO (First In First Out), LIFO (Last In First Out), and NINO (Not In or Never Out) stacks in R. See the vignette for examples.
fst v0.7.2: Provides functions to read and write data frames at high speed, and compress data with type-optimized algorithms that allow random access of stored data frames.
manipulateWidget v0.5.1: Provides helper functions to add controls like sliders, pickers, checkboxes, etc. to interactive charts created with package htmlwidgets. The animated vignette will get you started.
msgtools v0.2.4: Provides utilities for error, warning, and other messages in R packages, including consistency checks across messages, spell-checking, and message translations for localization. See the vignette for examples.
padr v0.2.0: Provides functions to transform datetime data into a format ready for analysis, including aggregating data to a higher level interval (thicken) and imputing records where observations were absent (pad). There is an Introduction.
pbdPRC v0.1-1: Implements light, yet secure remote procedure calls with a unified interface via ssh (OpenSSH) or plink/plink.exe (PuTTY). The vignette provides examples.
reprex v0.1.1: Provides a way to send code snippets with rendered output to sites like stackoverflow and github. The README shows examples.
restfulr v0.0.8: Models a RESTful service as if it were a list.
sys v1.1: A replacement for base system2 with consistent behavior across platforms. Supports interruption, background tasks, and full control over STDOUT / STDERR binary or text streams. README provides some details.
textclean v0.3.0: Provides tools to clean and process text, such as replacing or removing substrings that are not optimal for analysis. The README shows how to use them.
tidyxl v0.2.1: Imports non-tabular data from Excel into R. The vignette shows how.
unpivotr v0.1.0: Provides tools for converting data from complex or irregular layouts into a columnar structure. There is one vignette showing how to unpivot pivot tables from a spreadsheet, and another that shows how to work with multiple, similar tables.
WVPlots v0.2.2: Provides examples of ggplot plots that can be generated from a standard calling interface. Here is the explanation of the concept, and here are some nice examples.
by Garrett Grolemund
Would you like to teach people to use R? If so, I would like to jump-start your efforts.
I’m one half of RStudio’s education team, and I’ve taught thousands of people to use R, usually in face-to-face workshops. Over time, I’ve come to appreciate that teaching R in a short workshop is an unusual challenge that requires an unusual approach: you cannot teach a short workshop as if it were a college course, and you should not teach R as if it were Python, UNIX or C.
In the next few blog posts, I’ll share what I’ve learned about teaching R workshops. These ideas have made my life easier, and–more importantly–they have made my students happier (based on student feedback). I think they can do the same for you.
We’ll begin in this post by identifying common mistakes that ensnare new R teachers. Each of these mistakes seems like a good idea at first glance, but leads to an unsuccessful short workshop, and I’ll tell you why. To make things simple, I’ve recast each mistake as a principle to follow. Let’s examine them one by one:
DO NOT teach R as if it were a programming language. Why not? Because R is a programming language for doing data science. This provides a unique opportunity to motivate your students. You can be confident that your students want to use R to make graphs, fit models, and impress their colleagues. Show them how to do these empowering things and then teach programming later, as a way to do these things even better. To be honest, if your students only wanted to learn how to program, they would be studying another language.
DO NOT assume that methods that work well in a college classroom will work well in a one-, two-, or half-day workshop. Active learning, peer-led instruction, group projects, and flipped classrooms have improved college so much that I wish I could go back and do college over again. But these techniques take time to convey information, more time than you have in a workshop. They also work best with motivated students who are accustomed to learning. Do you have those? If your workshops are like mine, you have busy individuals who have set aside precious time and money to take your workshop. To be frank, they want to acquire more information than you can provide in a day of active learning or peer-led instruction. I’m not saying that you shouldn’t use these techniques (please try!), but expect to modify them heavily.
DO NOT avoid lectures. Some teachers will do somersaults to avoid lectures because lectures are too passive for students and too easy to do poorly. Some extremists even extend this notion to slides, claiming that good teachers should not use slides. If you adopt this mindset, you will fail at the one thing that you need to do well: convey large amounts of information in a short period of time. Not only should you embrace lecturing, i.e., presenting information, you should become an expert at it. Learn to present effectively, and learn to intermix presentations with activities that keep your workshop engaging.
DO NOT assume that you can teach someone else’s workshop out of the box, or even your own. A workshop is not like a video, which you make once and then replay when needed. A workshop is more like a play that must be cast, costumed, and rehearsed each time you present it in a new venue. If you think that you can reproduce a workshop quickly because it “already exists,” you are setting yourself up for failure. If you let your manager think this, you are setting yourself up for stress!
DO NOT let your workshop become a consulting clinic for installation bugs. Workshops make first impressions just like people do. You want to use the first minutes of your workshop to set an energetic tone, to engage your students, and to inspire them — not to hop from student to student debugging installation problems. Do what it takes to avoid this situation. My favorite solution is to provide a classroom RStudio Server for students to use.
But what if you feel that students deserve to leave your workshop with the software successfully installed on their computer? Then you are in good company! My mentor, Hadley Wickham, argues for this persuasively and enthusiastically. But make it happen in a way that does not torpedo your workshop. Hold a real clinic. Pass out instructions in advance and demand that any problems be reported ahead of time. Make successful installation a prerequisite for registering. Be sure that your students know that if they do not have permission to install the necessary software on their work laptop, they should bring a different laptop. Be creative and cover the bases.
Whatever you do, remember that the hour immediately before class is less than ideal for installing software. You have other tasks to attend to, and inevitably some students will come late and bring bugs.
I’ll have more to say about each of these topics in the posts that follow. In those posts, I’ll try to layout a fun, inspiring vision for how an R workshop works; no more “thou shall nots.” See you there!
]]>by Joseph Rickert
As forecast, the number of R packages hosted on CRAN exceeded 10,000 in January. Dirk Eddelbuettel, who tracks what’s happening on CRAN with his CRANberries site, called hurricaneexposure the 10,000th package in a tweet on January 27th.
hurricaneexposure was one of two hundred and six new packages that arrived on CRAN in January. Approximately 10% of these packages have to do with providing access to data in by some means or another. Some packages contain the data sets, some provide wrappers to APIs, and at least one package provides code to scrape data from a site. The following 17 packages are picks for data-related packages for January 2017. I will select packages in other categories in a follow-up post.
elevatr v0.1.1: Provides access to several databases that provide elevation data, including Mapzen Elevation Service, Mapzen Terrain Service, Amazon Terrain Tiles, and the USGS Elevation Point Query Service. There is a vignette.
epidata v0.1.0: Provides tools to retrieve data from the Economic Policy Institute. The README shows how to use the package.
europop v0.3: Contains a data set giving the populations of all European cities with at least 10,000 inhabitants during the period 1500-1800.
fivethirtyeight v0.1.0: Provides the data, code, and interactive visualizations behind FiveThirtyEight Stories. There is a vignette that provides an example of a data analysis, and a list of data sets that are included.
getCRUCLdata v.1.1: Provides functions that automate downloading and importing climatology data from University of East Anglia Climate Research Unit (CRU). There is a vignette to get you started.
hurricaneexposure v0.0.1: Allows users to create time series of tropical storm exposure histories for chosen counties for a number of hazard metrics (wind, rain, distance from the storm, etc.). The vignette provides an overview.
metScanR v0.0.1: Provides functions for mapping and gathering meteorological data from various US surface networks: COOP, USCRN, USRCRN, AL-USRCRN, ASOS, AWOS, SNOTEL, SNOTELLITE, SCAN, SNOW, and NEON.
mglR v0.1.0: Provides tools to download and organize large-scale, publicly available genomic studies on a candidate gene scale. The vignette shows how to use the package.
nzpullover v0.0.2: Contains data sets of driving offences and fines in New Zealand between 2009 and 2016, originally published by the New Zealand Police.
owmr v0.7.2: Provides a wrapper for the OpenWeatherMap API.
PeriodicTable v0.1.1: Contains a data set of the properties of chemical elements.
pwt9 v9.0-0: Contains the Penn World Table 9, which provides information on relative levels of income, output, inputs, and productivity for 182 countries between 1950 and 2014.
rdwd v0.7.0: Provides functions to obtain climate data from the German Weather Service, Deutscher Wetterdienst, (DWD). There is a vignette on Weather Stations and another showing how to use the package.
rwars v1.0.0: Provides functions to retrieve and reformat data from the ‘Star Wars’ API SWAPI. The vignette shows how to use the package.
wikidataQueryServiceR v0.1.0: Provides an API Client Library for Wikidata Query Service, which provides a way for tools to query Wikidata via SPARQL. See the README for how to use it.
wikilake v0.1: Provides functions to scrape metadata about lakes from Wikipedia. The vignette fetches data from Michigan lakes.
worrms v0.1.0: Provides a client for the World Register of Marine Species. The vignette shows how to use the package.
by Joseph Rickert
I have always been attracted to the capricious. So, it was no surprise that I fell for the Cauchy distribution at first sight. I had never seen such unpredictability! You might say that every distribution has its moments of unpredictability, but the great charm of Cauchy is that it has no moments. (No finite moments, anyway.)
Before discussing why momentlessness (not being in the moment ) leads to unpredictability, let’s derive the Cauchy distribution. A common conceit for doing this is to consider a blindfolded archer trying to hit a target directly in front of him. He randomly shoots towards the wall at an angle θ that can sometimes be so large he shoots parallel to the wall!
Where on the wall is any given arrow likely to land? The following diagram maps out the situation.
The archer is standing at the point (0, 0). The point on the wall directly in front of him is (x, 0), and the arrow will land at (x, y), (x, -y), or not at all. After changing to polar coordinates, a moments reflection will give you the equation y = xtan(θ).
Assuming that theta is uniformly distributed on the interval I = (- π/2, π/2), a direct substitution into the equation for the CDF of the uniform distribution will yield the CDF for the Cauchy distribution.
Differentiating this gives the Cauchy density function:
This looks tame, but a short argument showing that the necessary integrals do not converge demonstrates that neither the mean nor the variance exist. Hence, neither the Law of Large Numbers, nor the Central Limit Theorem apply. Taking lots of samples and computing averages doesn’t buy you anything. The averages just don’t settle down. This behavior is apparent in the following simulation that computes means of Cauchy samples for sample sizes of one to five thousand. The plots that show the same data at different scales dramatize the erratic behavior.
Not only don’t the samples converge, it is not that difficult to show that the distribution of the sample mean:
of n independent Cauchy random variables has the same distribution as a single Cauchy random variable! The proof is straightforward. Let φ(t) be the characteristic function. Then
which is the characteristic function of nY.
The extreme values that dominate the Cauchy distribution make it the prototypical heavy-tailed distribution. Informally, a distribution is often described as having heavy or “fat” tails if the probability of events in the tails of the distribution are greater than what would be given by a Normal distribution. While there seems to be more than one formal definition of a [heavy-tailed distribution] (https://en.wikipedia.org/wiki/Heavy-tailed_distribution), the following diagram, which compares the right tails of the Normal, Exponential and Cauchy distributions, gets the general idea across.
As exotic as the Cauchy distribution may seem, it is not all that difficult to come face-to-face with the Cauchy Distribution in every-day modeling work. A student t distribution with one degree of freedom is Cauchy, as is the ratio of two independent standard normal random variables.
Additionally, the Cauchy distribution, also called the Breit-Wigner, or Lorentz distribution, has applications in particle physics, spectroscopy, finance, and medicine. In his 2006 JSS paper, Geroge Marsaglia elaborates on early work he did on transforming the ratio of two jointly Normal random variables into something tractable. The original problem arose from an attempt to estimate the intercept in a linear model giving the life span of red blood cells.
The real fun, and maybe the real world, seems to happen when things are not normal.
The Introduction to Probability with R by Kenneth Baclawski, based on a course he developed with Gian-Carlo Rota, is a delightful introduction to probability theory. The idea of plotting the sample means above comes from Section 5.7 of this book.
For some good reading on heavy-tailed distributions, have a look at the extended presentation on the Fundamentals of Heavy Tails by Nair et al.; Chapter 2 of The Statistical Analysis of Financial Data in R; and Chapter 2 of An Introduction to Heavy-Tailed and Subexponential Distributions by Foss et al.
For a sophisticated but accessible look at general Stable Distributions, have a look at this recent paper on Stable Distributions, by John Nolan.
by Merav Yuravlivker, CEO of Data Society
“I’m not a coder” or “I was never good at math” is a frequent refrain I hear when I ask professionals about their data analysis skills. Through popular culture and stereotypes, most people who don’t have a background in programming automatically underestimate their ability to create amazing things with code. However, Data Society has proven that this is a false narrative through our training program – with students in over 20 countries and many government and enterprise clients, we’ve seen so-called “non-coders” proficiently put together automated data cleaning code scripts and analyses within a few weeks. So how do we do it? Well, we’ve singled out three key steps to get someone started on their journey to an amazing skill set and more powerful data analytics:
See? This doesn’t look so complicated.
This looks like another program I’ve used before…
And this is not just limited to viewing data. There is a lot of syntax from Excel that is easily transferable to R. For example, using if-else statements in Excel looks like this:
And here is what it looks like in R:
The learning curve for R is a fast one, especially for Excel users. Highlighting those similarities puts new users at ease and gives them a way to connect R functionality to the functionality they already use in Excel.
Showing students how to eliminate duplicates from data sets with 1 – 2 lines of code or quickly manipulating data into a different format is a wow factor in the amount of time it can save regular Excel users. The applications are immediately apparent for those who have struggled to go through thousands of rows manually or upload a data set with millions of records.
R is gaining in popularity, with millions of users worldwide and growing. Not only that, but we’ve seen an increase in demand for data analysis skills across all job sectors. Adding R programming and data analysis to your resume can add $10,000 – $15,000 to your salary. With that type of incentive, both in pay and in time saved, there’s no better time to take the “not” out of “I’m not a coder”.
The Data Society is a data science training platform for professionals. Among other government and corporate clients, Data Society has trained staff at the Department of Commerce and the U.S. Army through their enterprise firm, Data Society Solutions, which provides customized corporate data science training and consulting services. If you’d like to learn more, please email solutions@datasociety.co.
]]>by Jonathan Regenstein
Today, we are going to tackle a project that has long been on my wish list: a Shiny app to take a fund or portfolio, analyze its exposure to different countries, and display those exposures on a world map. Now you know how exciting my wishlists are.
Before describing our data importing/wrangling work here in the Notebook, it might be helpful to look at where we’re headed. The final Shiny app is here. This is similar to a previous project because we are building a leaflet map, shading it according to data added to the spatial dataframe, and including another HTML widget that is responsive to the map. However, our current project differs in important ways and has a completely different use.
The previous project allowed a user to click a country on the map and view the time series of returns. Our current project will allow the user to choose an ETF and see how that ETF is invested in different countries by how a world map is shaded.
From a substantive perspective, this app helps visualize country risks instead of returns – indeed, it’s the first in our series that does not import stock returns in any way. From an R perspective, in our current project the map is the responsive object according to user inputs, whereas before, the dygraph was the responsive object according to user clicks on a map. They are related and require spatial dataframes, but very different.
If you looked closely at the Shiny app, you noticed that we do have a data object that is responsive to a map click: we display a datatable of companies held by the ETF in whatever country is clicked. That is, if a user chooses an ETF and sees by the shading that the ETF is allocated X% to China, the user can click on the map to see which company the ETF owns in China. That functionality is similar to the dygraphs functionality, except of course, we have to wire up a datatable and do some filtering by country instead of passing an xts object to dygraphs. The fulcrum will still be the clicked map shape.
Alright, that app is what we’re ultimately building but, by way of what we’ll do in this Notebook, here’s the roadmap.
First, we are going to grab the data for one fund, the MSCI Emerging Markets ETF. Note that we are not going to get return data over time. Instead, we just want a snapshot of the ETF holdings: its constituents, their weights, and their home countries. Our eventual app will include several ETFs, but we are going to work with one ETF in this Notebook, with the foreknowledge that we want to reuse our steps when it’s time to build the Shiny app. In short, let’s get it right for this Emerging Markets ETF, and then we can iterate over other ETFs when we move to building our Shiny app.
After we download the snapshot of the emerging markets fund, we will do some data wrangling and some country weight aggregation, and then merge that data to our spatial dataframe. Adding that data will depend on the ETF using the same country naming convention as our spatial dataframe, so we’ll pay attention to that in the wrangling process.
Once we add the data to our spatial dataframe, we will recycle some old code, build a leaflet map, and shade it according to the ETF’s country exposure. This is just a test to see how things will look in the Shiny app, and we can even play around with different color palettes to get things just right.
Once we have the map aesthetics sorted, we’ll turn to Part Two: displaying the details of each country holding. Really, this is just filtering our dataframe by country name – whatever country the user clicks – but we’ll go ahead and make sure things look how we want in this Notebook, and then pass that object to our app eventually.
Let’s get to it!
First, let’s grab the fund data from MSCI’s homepage. We will use the readcsv() function from the readr package. We will title it emergingmarkets_fund, since we’ll be pulling in other funds later.
Note that we have to skip the first 11 rows, which is why the ‘skip = 11’ argument is included. That’s because this csv file is loaded with oddly formatted data in the first 11 rows. If we don’t skip those 11 rows, this import will be totally unhelpful. The ‘import dataset’ button in the IDE saved me minutes/hours of frustration here!
Now, we have our fund data and the wrangling begins. We are actually going to use this initial object to create two other objects: one will be merged with the spatial dataframe, and one will be a standalone object to be loaded in our Shiny app.
Those country weights are striking! China + Korea + Taiwan comprise 51% of this fund. The fund is concentrated in economies that are probably closely linked. Perhaps that’s by design? Perhaps the inter-economy correlation isn’t as high as I believe? A cross-border investment or trade Shiny app would be helpful here.
It’s worth a second to consider the definition of ’emerging market’, a term that has become ubiquitous and has a know-it-when-we-see-it feel (if you’re not into political economy, feel free to skip this paragraph). The phrase was coined in 1981 by the World Bank’s Antoine Van Agtmael to help encourage investment in developing nations, as he felt that the phrase ‘Third World’ country was both distasteful and stifling to investors. Learn more here. Today, the phrase connotes an economy that is growing and transitioning from developing to developed, though some commentators include a political transition as well. Since we are working with an MSCI fund, we should consider their definition. It wasn’t easy to track down, but according to the Financial Times, MSCI takes into account the number of listed companies of a certain size (an economic measure) and openness to foreign capital (a political measure).
Back to our task at hand: we have downloaded the fund data and gotten it into shape to be added to our shapefile. That process is the exact same as in our previous post, so before we do that, let’s use the original fund data to create one other object to store country-level detail on companies, weights and sectors. If that seems a bit confusing, head back to the Shiny app and click on a country. The datatable displays company names and details, and we need to create a dataframe to extract and hold that data.
This is what a user of our Shiny app will see upon clicking on Brazil; it is the country-level detail of how the fund is invested in Brazil. We will save that ‘EEM’ object in the .RDat file so it can be loaded into our Shiny app.
Okay, let’s go ahead and build that map of the world and add our fund country weights to it. This process is identical to how we did it here, but we’ll go through the steps again.
First, let’s download the spatial dataframe. We will also use the ms_simplify()
function from rmapshaper to reduce the size of the dataframe. This function will reduce the number of longitude and latitude coordinates used to build each country. It will make loading faster in our Shiny app, but won’t affect any of our logic.
Now, we will use the merge()
function from the sp package to add our country weight data. Remember above where we made sure to use a consistent country naming convention when wrangling the ETF data? This is where it will come in handy – we use the ‘name’ column to perform the merge. After the merging, ETF exposures will be added for each country that has a match in the ‘name’ column. For those with no match, the EEM column will be filled with NA.
We have our data added to the shapefile. Let’s go ahead and construct a map. First we’ll build a popup to show some detail, then we will create a green palette and a purple palette for no other reason than to see which is more visually appealing.
Let’s invoke leaflet! As before, we will use layerId = ~name
. This is, again, massively important because when we create a Shiny app, we want to pass country names to our datatable and filter accordingly. The layerId
is how we’ll do that: when a user clicks on a country, we capture the layerId
, which is a country name that can be used for filtering.
Both those maps look good to me, but purple might be the way to go ultimately. That’s a decision for next time – see you then!
]]>by Jonathan Regenstein
In a previous post, we built an R Notebook that pulled in data on sector ETFs and allowed us to calculate the rolling correlation between a sector ETF and the S&P 500 ETF, whose ticker is SPY. Today, we’ll wrap that into a Shiny app that allows the user to choose a sector, a returns time period such as ‘daily’ or ‘weekly’, and a rolling window. For example, if a user wants to explore the 60-day rolling correlation between the S&P 500 and an energy ETF, our app will show that. As is customary, we will use the flexdashboard format and reuse as much as possible from our Notebook.
The final app is here, with the code available in the upper right-hand corner. Let’s step through this script.
The first code chunk is where we do the heavy lifting in this app. We will build a function that takes as parameters an ETF ticker, a returns period, and a window of time, and then calculates the desired rolling correlation between that ETF ticker and SPY.
That function uses getSymbols()
to pull in prices and periodReturns()
to convert to log returns, either daily, weekly or monthly. Then we merge into one xts object and calculate rolling correlations, depending on the window parameter. It should look familiar from the Notebook, but honestly, the transition from the previous Notebook to this code chunk wasn’t as smooth as would be ideal. I broke this into two functions in the Notebook, but thought it flowed more smoothly as one function in the app since I don’t need the intermediate results stored in a persistent way. Combining the two functions wasn’t difficult, but it did break the reproducible chain in a way that I don’t love. In the real world, I would (and, in my IDE, I did) refactor the Notebook to line up with the app better. Enough self-shaming, back to it.
Next, we need to create a sidebar where our users can select a sector, a returns period and a rolling window. Nothing fancy here, but one thing to note is how we use selectInput
to translate from the sector to the ETF ticker symbol. This means our users don’t have to remember those three-letter codes; they just choose the name of the desired sector from a drop-down menu.
Have a close look at the last three lines of code in that chunk. These are a new addition that let the user determine if the mean, max and/or min rolling correlation should be included in the dygraph. We haven’t built any way of calculating those values yet, but we will shortly. This is the UI component.
Those three lines of code create checkboxes and are set to default as FALSE, meaning they won’t be plotted unless the user chooses to do so. I wanted to force the user to actively click a control to include these, but that’s a purely stylistic choice. Perhaps you don’t want to give them a choice at all here?
Next, we create our reactive values that will form the substance of this app. First, we need to calculate and store an object of rolling correlations, and we’ll use a reactive that passes user inputs to our sector_correlations
function.
Then, we build reactive objects to store mean, minimum and maximum rolling correlations. These values will help contextualize our final dygraph.
At this point, we have done some good work: built a function to calculate rolling correlations based on user input, built a sidebar to take that user input, and coded reactives to hold the values and some helpful statistics. The hard work is done, and really we did most of the hard work in the Notebook, where we toiled over the logic of arriving at this point. All that’s left now is to display this work in a compelling way. Dygraphs plus value boxes has worked in past; let’s stick with it!
That dygraph code should look familiar from the Notebook and previous posts, except we have added a little interactive feature. By including if(input$mean == TRUE) {avg()}
, we allow the user to change the graph by checking or unchecking the ‘mean’ input box in the sidebar. We are going to display this same information numerically in a value box, but the lines make this graph a bit more compelling.
Speaking of those value boxes, they rely on the reactives we built above, but, unlike the graph lines, they are always going to be displayed. The user doesn’t have a choice here.
Again, this just adds a bit of context to the graph. Note that the lines and the value boxes take their value from the same reactives. If we were to change those reactives, both UI components would be affected.
Our job is done! This a simple but powerful app: the user can choose to see the 60-day rolling correlations between the S&P 500 and an energy ETF, or the 10-month rolling correlations between the S&P 500 and a utility ETF, etc. I played around with this a little bit and was surprised that the 10-week rolling correlation between the S&P 500 and health care stocks plunged in April of 2016. Someone smarter than I can probably explain, or at least hypothesize, as to why that happened.
A closing thought about how this app might have been different: we are severely limiting what the user can do here, and intentionally so. The user can choose only from the sector ETFs that we are offering in the selectInput
dropdown. This is a sector correlations app, so I included only a few sector ETFs. But, we could just as easily have made this a textInput
and allowed the users to enter whatever ticker symbol struck their fancy. In that case, this would not longer be a sector correlations app; it would be a general stock correlations app. We could go even further and make this a general asset correlations app, in which case we would allow the user to select things like commodity, currency and housing returns and see how they correlate with stock market returns. Think about how that might change our data import logic and time series alignment.
Thanks for reading, enjoy the app, happy coding, and see you next time!
]]>by Max Kuhn
The formula interface to symbolically specify blocks of data is ubiquitous in R. It is commonly used to generate design matrices for modeling function (e.g. lm
). In traditional linear model statistics, the design matrix is the two-dimensional representation of the predictor set where instances of data are in rows and variable attributes are in columns (a.k.a. the X matrix).
A simple motivating example uses the inescapable iris data in a linear regression model:
While the purpose of this code chunk is to fit a linear regression models, the formula is used to specify the symbolic model as well as generating the intended design matrix. Note that the formula method defines the columns to be included in the design matrix, as well as which rows should be retained.
Formulas are used in R beyond specifying statistical models, and their use has been growing over time (see this or this).
In this post, I’ll walk through the mechanics of how some modeling functions use formulas to make a design matrix using lm
to illustrate the details. Note, however, that the syntactical minutiae are likely to be different from function to function, even within base R.
lm
initially uses the formula and the appropriate environment to translate the relationships between variables to creating a data frame containing the data. R has a fairly standard set of operators that can be used to create a matrix of predictors for models.
We will start by looking at some of the internals of lm
(circa December 2016).
The main tools used to get the design matrix are the model.frame
and model.matrix
functions. The definition and first few lines of lm
are:
The goal of this code is to manipulate the formula and other arguments into an acceptable set of arguments to the model.frame
function in the stats
package. The code will modify the call to lm
to use as the substrate into model.frame
, which has many similar arguments (e.g. formula
, data
, subset
, and na.action
). However, there are arguments that are not common to both functions.
The object mf
is initially created to mirror the original call. After executing match.call(expand.dots = FALSE)
, our original call generates a value of
and class(mf)
has a value of "call"
. Note that the first element of the call, mf[[1L]]
, has a value of lm
with the class of name
.
The next few lines remove any arguments to lm
that are not arguments to model.frame
, and adds another (drop.unused.levels
). Finally, the call is modified by replacing the first element of the call (lm
) to stats::model.frame
. Now, mf
, has a value of
When this code is executed using eval(expr = mf, envir = parent.frame())
, the model.frame
function returns
A
data.frame
containing the variables used in formula plus those specified in...
. It will have additional attributes, including “terms
” for an object of class “terms
” derived fromformula
, and possibly “na.action
” giving information on the handling ofNA
s (which will not be present if no special handling was done, e.g. byna.pass
).
For our particular call, the first six values of mf
are
Note that :
subset
command is executed here (note the row names above),Petal.Length
has been log transformed and the column name is not a valid name, andIf weights or an offset was used in the model, the resulting model frame would also include these.
As alluded to above, mf
has several attributes, and includes one that would not normally be associated with a data frame (e.g. "terms"
). The terms
object contains the data that defines the relationships between variables in the formulas, as well as any transformations of the individual predictors (e.g. log
). For our original model:
The terms
object will be used to generate design matrices on new data (e.g. samples being predicted).
The lm
code has some additional steps to save the model terms and generate the design matrix:
The model.matrix
function uses the data in the terms
object to generate any interactions and/or dummy variables from factors. This work is mostly accomplished by a C routine.
terms
In the previous example, the log transformation is applied to one of the columns. When using an inline function inside a formula, this transformation will be applied to the current data, as well as any future data points (say, via predict.lm
). The same workflow is followed where a model frame is used with the terms
object and model.matrix
.
However, there are some operations that can be specified in a formula that require statistical estimates. Two examples:
splines::ns
) takes a numeric variable, does some computations, and expands that variable into multiple features that can be used to model that predictor in a nonlinear fashion.stats::poly
) is a basis expansion that takes a single predictor and produces new columns that correspond to the polynomial degree.As an example of natural splines:
ns
returns multiple elements: the basis function spline results (shown just above) and the data required to generate them for new data (in the attributes).
It turns out that the only statistics required to produce new spline results are the knots
, Boundary.knots
, and intercept
attributes. When new data are predicted, those statistical quantities are used:
Now, getting back to formulas, we can include a function like this inline:
The resulting terms
object saves the model specification, and also the values required to reproduce the spline basis function. The terms
object contains an attribute that is misleadingly named predvars
that has this information:
When predict.lm
is invoked with a new data set, the terms are passed to model.frame
and the predvars
are evaluated on the new data, e.g.,
In summary, R’s formula interface works by exploiting the original formula as a general expression, and the terms
object is where most of the information about the design matrix is stored.
Finally, it is possible to include more complex operations in the formula. For example, two techniques for imputation in predictive models are using tree ensembles and K-nearest neighbors. In each case, a model can be created to impute a predictor, and this model also could be embedded into predvars
as long as the prediction function can be exploited as an expression.
There are some severe limitations to formulas which may not be obvious and various packages have had to work around these issues. The second post on formulas (“The Bad”) will illustrate these shortcomings.
]]>by Sean Lopp
This month’s collection of Tips and Tricks comes from an excellent talk given at the 2017 RStudio::Conf in Orlando by RStudio Software Engineer Kevin Ushey. The slides from his talk are embedded below and cover features from autocompletion to R Markdown shortcuts. Use the left and right arrow keys to change slides.
Enjoy!
]]>by Edgar Ruiz
In the corporate world, spreadsheets and PowerPoint presentations still dominate as the tools used for analyzing and sharing information. So, it is not at all surprising that even when business analysts use R for the analytical heavy lifting, they frequently revert to using spreadsheets and slide decks to share their results. This may seem like the easiest way to communicate with colleagues, but any modestly complicated project is likely to be error-prone and generate hours of unnecessary rework.
An R-savvy analyst can harness R Markdown to develop reproducible business reporting and information sharing workflows in any business organization; all it takes is a little effort to master some basic R document preparation tools.
In this post, I would like to examine a scenario that represents some experiences I had as an analytics professional.
A new R analysis is delivered in a PowerPoint presentation, and everyone thinks that the insights are very valuable. They all want more associates to see it, so almost immediately, the following three requests are made:
“…we need it broken out by” – The presentation needs to be split by a specific segment. The segment is normally geographical or managerial in nature.
“…they shouldn’t see each others data” – Since the results are not published in a central publishing platform, it is necessary to create multiple versions of the same report in order to secure the contents.
“…we need it every” – Satisfying requests 1 and 2 may not be too overwhelming if this were meant as a one-time analysis, but usually the analysis and its distribution need to be repeated on a regular interval.
Because we exported the findings into a presentation, sharing results becames more complex and time-consuming if we wish to satisfy the new requirements.
R Markdown combines the creation and sharing steps. The three requests can be satisfied using the following features of R Markdown:
Break out the reports – Using R Markdown’s Parameterized Reports feature, we can easily create documents for each required segment.
Automate the file creation – R Markdown can be run from code, so a separate R script can iteratively run the R Markdown and pass a different parameter for each iteration.
Create the slides inside R – Take advantage of R Markdown Presentation output to create a slide deck. Without having to learn a new scripting language, we can code the slide deck and use the same Parameter feature to automate its creation.
Keep the interactivity – In many cases, the end user needs a level of interactivity with the report. This interactivity can be achieved by using htmlwidgets inside the R Markdown document. For example, the Leaflet widget can be used for interactive maps, the Data Table widget for interactive tables, and the dygraphs widget for interactive time series charting.
Accessible and easy to open – Any alternative tool needs to be as accessible as the current spreadsheet and presentation tool. R Markdown can output results in HTML, PDF, and Word. Additionally, the Presentation output uses the highly accessible HTML5 format.
Reproducibility – Copying-and-pasting files, text, or images inevitably introduces human error. In R, data import, wrangling and modeling are already automated, so why not take it to its natural conclusion by using R Markdown to automate the presentation end of the process, as well?
Creating a dashboard is easy – In a spreadsheet, this is normally accomplished with a combination of pivot tables and graphs. R Markdown uses flexdashboard to create visually striking dashboards that are self-contained. By using this in combination with htmlwidgets, the audience gains access to a very powerful tool.
Here is an example of a live parameterized R Markdown flexdashboard based on stock data (see screenshot below).
R Markdown is a free package, so if you have R (and ideally RStudio), you can start using it today. Also, there are a lot of resources available for learning how to use R Markdown; the package’s official website is a good place to start.
Here is a sample script that uses Parameterized R Markdown to create a slide deck based on a selected stock. In this case we used Google:
And here is the resulting deck. Press the left arrow key to see the next slide:
This simple script creates an nice-looking and interactive deck that needs no manual intervention if the data needs to be refreshed, and one small parameter change if a different stock is to be selected.
We encourage you to try R Markdown yourself. The “start small and then build big” strategy rarely fails, so you could begin by automating a simple report first, and then start taking advantage of more advanced features as you grow comfortable with the tool.
]]>