Recently, we were joined by the smart folks at Roche & Novartis to present a webinar on effective data visualization. You can watch the recording of the full presentation here. It was the latest installment in a series of webinars highlighting industry leaders in the Pharmaceutical and Life Science spaces that are doing world-changing data science work.
They presented many great insights, most of which are relevant to data scientists in every industry, and we wanted to share our learnings.
“Graphics and visuals are such an important component of the work that we do as quantitative data scientists.”
Marc Vandemeulebroecke, biostatistician at Novartis, stated this, then went on to say that effective visualizations help data scientists achieve two important things:
Imagine you have only 3 minutes to present your hard work to stakeholders. You’re probably going to rely on some sort of visualization. The stakeholders are then going to use the conclusions they draw from that presentation to make important decisions, such as what product to invest in or what initiative to approve. The stakes are set. Marc is clearly right: being able to effectively visualize data is an important part of any data scientist’s job.
So what’s the issue? Why spend an hour presenting on this topic?
“Unfortunately we’re not always good at creating effective visualizations.”
And the consequences can be brutal. “When we get [data visualization] wrong, it can lead to misinformation, confusion, or harm, especially in clinical and medical research”, Mark Baillie, Director of Statistical Methodology at Novartis, continues.
He says this while displaying a graphic recently published by the New York Times that shows the importance of social distancing during COVID 19. To be clear: this was an example of a good data visualization, one that potentially contributed to lives being saved.
He transitioned to a lighter example, a bad data visualization: a confusing graphic that attempts to relay the results of a pizza topping survey by YouGov, a government entity in the UK:
Certainly no harm was inflicted by botching a pizza graphic, but the graphic is confusing. The results were posted on Twitter and the most popular reply questioned how YouGov managed to poll 695% of the population. YouGov had to make another post clarifying its analysis.
So why aren’t we great at data visualizations? Back to Marc’s introduction: The focus in advanced education is primarily on doing the analysis, not communicating and visualizing the results. Rightfully so or not, that in turn means data visualization is a skill frequently learned on the job.
“So what do we mean by effective? We don’t necessarily mean beautiful.”
We also make common missteps when creating data visualizations. These missteps include selecting the wrong graph type, misusing color, and not considering scale. The singular error that most of these missteps roll up into: we’re overly concerned with making visualizations that are beautiful, so much so that we’re willing to sacrifice the visualization’s effectiveness.
“Effective visualization = effective communication” and the 3 laws for improving visual communication.
The teams at Roche & Novartis distill the goal of effective visualization down to effective communication. The purpose of any data visualization is to communicate with stakeholders in a way that results in true understanding and better decision making.
So, how do we achieve that? This has been such a focus at Novartis that they actually penned the Three Laws for Improving Visual Communication org-wide:
Mark introduces this law by calling it “just advanced common sense”. Before jumping into your visualization, you should work through a why, what, who, and where mental framework:
Why do you need a graph? You should be able to identify a specific purpose for the graph, such as delivering a message or exploration.
What quantitative evidence do you have to support the purpose?
Who is the intended audience? Once you figure that out, you can adjust the design to support their needs.
Where is your visualization going to live? Once you know that, you should make design choices that fit formatting constraints.
Your decisions when creating a visualization should work in harmony to achieve its purpose. You may opt for a tool like RMarkdown when distributing a visualization to senior stakeholders internally, with the purpose of quickly showing clinical trial results without the need for deep exploration. The choices you make for this data visualization are likely different than the choices you’d make for a visualization that you were to submit to the FDA or teammates that you work with.
Mark expands on this law with a slide dedicated to a quote by John Tukey, who argues that the quality of an analysis is related to the quality of the question being asked.
“An approximate answer to the right question is worth a great deal more than a precise answer to the wrong question.”
The message here is to spend more time thinking about the questions that set your analysis and visualization building process in motion.
Mark’s recommendation for this rule: do not lie or misrepresent with data. Your reporting should be open and transparent. When you’re clear, open, and transparent, you build trust with your audience, a necessary ingredient that enables your analysis to drive decision-making.
Once you make the decision to be open and transparent, you can start thinking about the design and the graph type of your visualizations. Choosing the correct graph type aids in interpretation. “Often, we don’t need to reinvent the wheel here”, Mark adds. If you want to show a deviation, correlation, or ranking, there are common graph types you can usually select from to achieve your visualization’s purpose in a clear way.
Additionally, you don’t want to make your audience members have to work when looking at a particular graph type. Unnecessary components in data visualizations, or overly stylistic choices, may do more harm than good.
More specifically, you should carefully consider the scaling and spacing of your data visualizations. “Avoid plotting log-normally distributed variables on a linear scale”, Mark presents, then goes on to add, “measurements displayed close together are perceived to be closer in time.”
The key to this law is: do not assume your reader understands what message your data visualization is trying to portray. Instead, really work at your visualization to make the takeaway obvious.
What are some best-practices you can follow to achieve this? Mark highlights several:
Try not to set text at angle. Instead, think of alternatives such as transposing the graph.
Avoid unnecessary color. Don’t use color to differentiate between categories of the same variable. Doing so can contribute to confusion.
Only use color when it adds value. Use bold, saturated, or contrasting color to emphasize important details.
All of your design choices should be made with the purpose of making your message more obvious to the audience. There are lots of bells and whistles you can add to your data visualizations. What you don’t want to do is add unnecessary or confusing distractions to your visualization.
We’re grateful for the teams at Roche & Novartis for giving a fantastic presentation and the 1,400 live attendees that chose to spend an hour with us during truly strange times. If you were one of those live attendees, thank you. If you are interested in watching the full recording of the presentation, you can find it here.
Many of the lessons from this webinar are core to RStudio’s mission: we love helping data scientists maximize the impact of their work. We also believe that it shouldn’t be difficult for data scientists to reproduce that impact time and time again as projects change.
If you’re hungry for more resources that may enhance your data visualization skills and ability to drive value at your organization, you should consider the following resources:
Finally, if you want to watch the other episodes in this webinar series, you can find them here:
2021 INSPIRE U2 participants Kathleen Bostic and Michel Ruiz-Fuentes reflect on their experience in the program, where they developed their data science skills with the support of faculty and...
Three key attributes define Serious Data Science; open-source software, code-first development, and a centralized data science infrastructure. This approach has been successful at driving value and...