There’s an old saying (at least old in data scientist years) that goes, “90% of data science is data wrangling.” This rings particularly true for data science leaders, who watch their data scientists spend days painstakingly picking apart ossified corporate datasets or arcane Excel spreadsheets. Does data science really have to be this hard? And why can’t they just delegate the job to someone else?
The reason that data wrangling is so difficult is that data is more than text and numbers. As shown in Figure 1, data scientists routinely have to deal with:
The data challenges listed above are just the tip of the iceberg. Many datasets originate in Excel, and many Excel creators hide information in their column and row names as shown in Figure 2. In other data sets, no metadata is included within the data set at all. Instead data publishers provide a completely separate data dictionary that data scientists have to interpret to use the data.
With these challenges facing them, your data scientists are far from wasting time when they are data wrangling. In fact, transforming data is an essential part of the understanding process
However, data science leaders can speed up data wrangling within a team by encouraging some simple behaviors:
Write code to allow reproducibility. Too many data scientists perform data wrangling using drag-and-drop tools like Excel. That approach may seem faster the first time that data set is ingested, but that manual process will stand in the way of reproducing the analysis later. Instead write functions for ingesting data that can be re-run every time the data changes, and you’ll save time in the long run.
Embrace tidy data. The tidyverse collection of packages in R establishes a standardized way of storing and manipulating data called tidy data, as shown in Figure 3. The tidyverse ensures that all the context needed to understand a data set is made explicit by giving every variable its own column, every observation its own row, and storing only one value per cell.
Create a standard data ingestion library. If your entire team defaults to using tidy data and the tidyverse in all their analyses, then they’ll find it easier to read and reuse each other’s data wrangling code. You can encourage that behavior by establishing a team Github organization where they can share those code packages and speed up their data understanding in future projects.
These behaviors can yield big rewards for data science teams. At rstudio::conf 2020, Dr. Travis Gerke of Moffitt Cancer Center in Tampa, Florida noted that reproducible pipelines have proved a game-changer in wrangling and unlocking complex patient data for the Center’s researchers.
If you’d like to learn more about how to reduce data wrangling hassles, we recommend:
RStudio has many tools for both R and Python programmers. In this blog post, we’ll showcase various ways that you can program in Python with RStudio tools.
Announcing the 2021.10.0 release of the RStudio Professional Drivers, with single sign-on (SSO) for Hive and Impala.