This is a guest post from Gina Reynolds with contributions from 3rd- and 4th-year West Point Math majors Morgan Brown and Madison McGovern. Gina works in data analytics and teaches statistics and probability at West Point. Her work focuses on tools for proximate comparison and translation in data analysis and visualization.
The ggxmean package introduces new
geom_*s for fluid visual description of some basic statistical concepts. The ‘titular character’,
geom_x_mean, draws a vertical line at the mean of x.
A few years ago, I was sitting on the floor of a packed-out ballroom watching Thomas Lin Pederson’s talk, ‘Extend your Ability to Extend ggplot2’.
‘I want to do that,’ I thought.
And I had a use case in mind: statistical summaries, especially those used to explain fundamental statistical concepts like covariance, standard deviation, and correlation.You can visually walk through these concepts, dissecting the equations for their computation at a chalkboard. With ggplot2, you can, of course, get this done as well. I put together that walkthrough here:
So, math notation and visual representation builds of basic statistics! They co-evolve speaking to different learning styles. Plus DRY principles for coders and a walkthrough of calc w num vals, for numerophiles! #ggplot2 #xaringan #flipbookr #rstats https://t.co/JgWLxo94Ms pic.twitter.com/ol08lMGdtD
— Gina Reynolds (@EvaMaeRey) June 25, 2020
But to choreograph this, there was a lot of prep that I needed to do before starting to visualize. I had to calculate the means, standard deviations, etc., all before beginning to plot, and then feed those calculations into existing
geom_* functions like
This didn’t feel like the powerful declarative experience that you have a lot of the time using ggplot2. Compare that to the experience that you get with the boxplot. That goes something like this:
ggplot(data = my_data)
mapping = aes(x = my_category)
mapping = aes(y = my_continuous_outcome)
In this boxplot example, lots of computation happens in the background for us: min, max, 25%, 75%, median. And that is great. I understand the boxplot well; I don’t need to do those computations myself. I’m happy for ggplot2 to do that for me.
For the covariance/variance/correlation stats walkthroughs, I wanted to have the same declarative experience. I understand the mean well, and one standard deviation away from the mean, etc. I should be able to ask ggplot2 to do that computation for me: to compute the global mean (or a group-wise mean if I’m in the mood for that) and put a vertical line there.
My solution to choreographing the stats visualizations with ‘base ggplot2’ (without using the extension mechanisms) felt inelegant and fragile. It wasn’t very portable (not easy to move to other data – maybe data that my students or I might be more passionate about) or dynamic (I couldn’t easily do group-wise work instead of acting globally). It wasn’t much fun.
Thomas’ talk and the extension system seemed like the answer to bringing ggplot2’s fluid feel to these particular statistical stories.
Fast forward a few years. I consulted great materials on extending ggplot2 like the ‘Extending ggplot2’ vignette, the ‘Extension’ chapter in the newest edition of the ggplot2 book; again Thomas Lin Pederson’s talk, ggplot2 code on GitHub, and code from other extension packages in the ggplot2 extension gallery.
Using those resources, I managed to write the
geom_x_mean() function and friends. And now I’m happy to introduce the ggxmean package!
I’m excited about these functions because I think the syntax mirrors the chalkboard experience: naming concepts one at a time and easily depicting them.
Moreover, ggxmean allows you to do this visual storytelling beyond what you might do on a chalkboard: port the work routine to other datasets that your students find gripping, work with larger data sets (chalkboard work tends to be super small worked examples), and do group-wise computations!
Regarding this last point, in the plot that follows on the palmerpenguins data, ggplot instantly recomputes everything for us by species when we add the faceting declaration! ggplot2 is hard at work in the background, being its awesome self.1
library(tidyverse) library(ggxmean) palmerpenguins::penguins %>% ggplot() + aes(x = bill_length_mm) + aes(y = flipper_length_mm) + geom_point() + ggxmean::geom_x_mean() + ggxmean::geom_y_mean() + ggxmean:::geom_xdiff() + ggxmean:::geom_ydiff() + ggxmean:::geom_x1sd(linetype = "dashed") + ggxmean:::geom_y1sd(linetype = "dashed") + ggxmean:::geom_diffsmultiplied() + ggxmean:::geom_xydiffsmean(alpha = 1) + ggxmean:::geom_rsq1() + ggxmean:::geom_corrlabel() + facet_wrap(facets = vars(species))
Another set of geoms that ggxmean offers is targeted at another stats intro topic: visualizing discussion of ordinary least squares (OLS) regression. In stats classes across the world, teachers name various statistical concepts as they teach OLS. Again, instructors tend to visualize these with toy datasets on the classroom chalkboard; this is great! ggxmean attempts to isolate some of those concepts and package them into
geom_* functions to mirror that chalkboard experience:
library(tidyverse) library(ggxmean) #library(transformr) #might help w/ animate ## basic example code cars %>% ggplot() + aes(x = speed, y = dist) + geom_point() + ggxmean::geom_lm() + ggxmean::geom_lm_residuals(linetype = "dashed") + ggxmean::geom_lm_fitted(color = "goldenrod3", size = 3) + ggxmean::geom_lm_conf_int() + ggxmean::geom_lm_pred_int() + ggxmean::geom_lm_formula() + ggxmean::geom_lm_intercept(color = "red", size = 5) + ggxmean::geom_lm_intercept_label(size = 4, hjust = 0)
The work on OLS was a jumping-off point for the most recent functions to the ggxmean package. Morgan Brown and Madison McGovern, students at West Point, contributed to the package for independent studies in the fall AY2022 term. I’m incredibly excited to show you their work.
Morgan and Madison took up the question of data outliers. Here, we apply their work to famous toy datasets: Anscombe’s quartet and the datasauRus Dozen. With the functions I’d worked on, we can visualize the summary statistics (mean, sds, correlation) that are typically the subject of discussions of Anscombe’s quartet and the datasauRus Dozen. This is shown here:
# first some data munging datasets::anscombe %>% pivot_longer(cols = 1:8) %>% mutate(group = paste("Anscombe", str_extract(name, "\\d"))) %>% mutate(var = str_extract(name, "\\w")) %>% select(-name) %>% pivot_wider(names_from = var, values_from = value) %>% unnest() -> tidy_anscombe tidy_anscombe %>% ggplot() + aes(x = x, y = y) + geom_point() + aes(color = group) + facet_wrap(facets = vars(group)) + ggxmean::geom_x_mean() + ggxmean::geom_y_mean() + ggxmean:::geom_x1sd(linetype = "dashed") + ggxmean:::geom_y1sd(linetype = "dashed") + ggxmean::geom_lm() + ggxmean::geom_lm_formula() + ggxmean:::geom_corrlabel() + guides(color = "none")
But Anscombe and datasauRus constellations are pretty special. And looking at statistics describing outlyingness also makes sense. Using Morgan and Madison’s functions on leverage and influence, we can easily highlight outlying observations!
In the following plot, Morgan’s function
geom_text_leverage() calculates leverage for each observation:
tidy_anscombe %>% ggplot() + aes(x = x, y = y) + aes(color = group) + geom_point() + facet_wrap(facets = vars(group)) + ggxmean::geom_text_leverage(vjust = 1, ## A function Morgan wrote for ggxmean! check_overlap = T) + guides(color = "none")
And in the
geom_point_high_cooks() highlights the 10% most influential observations:
datasauRus::datasaurus_dozen %>% ggplot() + aes(x = x, y = y) + geom_point() + ggxmean::geom_point_high_cooks( ## A function Madison wrote for ggxmean! color = "goldenrod", alpha = .5, size = 5) + facet_wrap(facets = "dataset")
In my day-to-day analytic work, I’m glad to have the ggxmean functions ready to go. The function I use most is, not surprisingly,
geom_x_mean() for marking the global and group-wise means! In the classroom, of course, the ggxmean functions are fun to apply to a variety of datasets used in class after a good, old-fashioned chalkboard walkthrough.
The package is not yet on CRAN, so to give it a spin yourself, use:
We’re open to your feedback and contributions on code, computation, and conventions (function names, arguments, etc.)!
Welcome to the rstudio::glimpse() newsletter. Get a glimpse into our tools and how to use them.
Python users can now use Shiny to create interactive data-driven web applications by writing Python code.