ASA DataFest 2021

Thumbnail American Statistical Association DataFest logo

The American Statistical Association (ASA) DataFest is a celebration of data in which teams of undergraduates work around the clock to find and share meaning in a large, rich, and complex data set. The teams that impress the judges win prizes, and the event is a great opportunity for them to gain some data analysis experience. As part of our mission to support open source data science, RStudio was proud to help sponsor this event, and provide RStudio Cloud as a platform to help enable collaboration within the teams

ASA DataFest 2021 season just wrapped up, and like all things this year, DataFest was virtual. Despite the challenges of organising a virtual community event, we had 30 sites from six countries host an event this year, many with participation from multiple universities. Over 2,500 students participated in DataFest over an eight-week period between March and May 2021. You can find out about the participating institutions here.

The challenge

For this year’s challenge ASA DataFest partnered with The Rocky Mountain Poison & Drug Safety (RMPDS). RMPDS is a leader in public health protection serving the public since 1956 with innovative research in toxicity, progressive solutions in case management, emergency services and regulatory compliance. RMPDS also runs a large survey on drug misuse, and data from this survey conducted in the United States, the United Kingdom, Canada, and Germany formed the basis of this year’s DataFest challenge. Teams were tasked with discovering and identifying patterns of drug use, with particular attention paid to identifying misuse of prescription drugs. These could include patterns that might describe demographic profiles within a given category of drug or combinations of drugs that frequently appear together.

The dataset was challenging to work with for a variety of reasons. First, many undergraduate curricula don’t address working with surveys with weights so students had to choose between the following options:

  • Incorporating the weights into their analysis and being able to make generalizable conclusions.
  • Omitting that feature of the data due to lack of experience with working with survey data and be very careful about the scope of their inference.

The weighted samples also made country comparisons challenging without the use of proper survey analysis techniques, so many teams opted for analyzing data from a single country (often the country where they were located). On the upside, this allowed them to bring in what they might know about drug use in their countries as an outside data source and do a more localized analysis.

Running a virtual DataFest

In 2020, COVID-19 lockdown restrictions came just at the beginning of the DataFest season and many organizers either ran a modified virtual version of the event or had to abandon it altogether, since pivoting to virtual in such a short timespan with little preparation felt daunting. This year, after a full year of teaching virtually, faculty had a much better sense of what works and what doesn’t when it comes to running virtual events.

Just about all sites used some form of virtual communication tool like Slack, MS Teams, or Discord. Many sites made use of Zoom, especially for the kickoff event and the awards ceremony, though the idea of keeping folks on Zoom for the full 48-hour duration of the event wasn’t appetizing to anyone. At The University of Edinburgh (for our joint event with Heriot-Watt) we used GatherTown for co-working and communication throughout the event, which worked better than we could have hoped for, and the students loved it!

Of course, good tech alone doesn’t build a virtual community. The key players in the success of DataFest each year (whether it’s in person or online) are the volunteer mentors – postgraduate students, faculty, industry data professionals – who devote their weekend to check in with the students, help them get over hurdles, and act as a sounding board for their ideas. Recreating effective mentoring interactions online is no small feat, but spatial tools like GatherTown coupled with the willingness of mentors to check in with students regularly helped us achieve that goal.

RStudio Cloud at ASA DataFest

DataFest was founded in UCLA in 2011, and since then, just about each year RStudio has provided sponsorship for various DataFest events, including sending over the much coveted hex stickers to sites!

This year, since the event was virtual, hex sticker drops didn’t quite fit, but a new challenge was making sure all participants had access to the computing resources needed for the event, which is where RStudio Cloud came in!

We set up an ASA DataFest organization on RStudio Cloud, and within the organization, created workspaces for each host site who requested one. This meant the organizers for the host site got admin access to the workspace and could use it however they wanted. Many organisers used this to also distribute the data – they placed the data in a base project in the Cloud workspace, which means any projects created in the workspace came with the data as well.

Since ASA DataFest is tool/language agnostic, not all computing was done in RStudio Cloud, but students who chose to use R for their analysis appreciated the easy access!

ASA DataFest 2021 in action

Below are links to a few of the DataFest events from this year where you can find more out about some of the individual events and watch recordings of student presentations.

Interested in DataFest?

If you’re a faculty member interested in hosting your own DataFest or participating in a nearby event in 2022, see here for instructions for signing up for our mailing list.

If you’re an undergraduate student interested in participating in DataFest, and your school has not held an event before, reach out to a faculty member with the above information. Student interest would be a great motivator for taking on the task of organising an event or, at a minimum, reaching out to an organiser at a nearby institution to join forces.

Interested in teaching your own class or workshop using R?

RStudio Cloud is a lightweight, cloud-based solution that allows anyone to do, share, teach and learn data science online. There’s nothing to configure and no dedicated hardware, installation or annual purchase contract required.

We offer a free plan for casual, individual use, and we offer paid premium plans for professionals, instructors, researchers and organizations. Learn more at https://www.rstudio.com/products/cloud/.

More On Company News and Events

Stay Connected

Get updates when there's a new post.