RStudio is dedicated to our mission to support open source data science, and we believe that a code-first approach is uniquely powerful, because code provides the flexibility to build and share insights, tailored to the analytic problem and the needs of your stakeholders. However, different audiences often need different tools for different applications, and so we asked Pritam Dalal from our Customer Success team to share his perspective on the pros and cons of Excel vs. code-first data science.
Prior to joining RStudio, I had a 15 year career in financial services. In particular, I held trading and research roles in areas ranging from exotic derivatives, to mortgage backed securities, to option market making. And throughout all of these experiences, there was one data analysis application that was more ubiquitous than all the rest combined: Excel.
Excel is a critical tool in the financial services industry. It is not an exaggeration that each day hundreds of billions (perhaps trillions) of dollars get transacted on the basis of spreadsheet workflows. In contrast, there is a fair amount of antipathy towards Excel in the data science community. And while many of the criticisms are valid - a lack of reproducibility, severe limitations with data size, clunky visualizations - the negativity overlooks much of the exceptional utility of spreadsheets.
At a personal level, I have a great deal of affection for Excel. Spreadsheets are how I got my start in data analysis. They offered a visual and tactile approach to data centric computation. Excel has a simple built-in programming language that served as my first foray into coding. I even backtested and implemented a profitable trading strategy with spreadsheets. However, when I increased the complexity and scope of the strategy, my Excel analysis tools were not able to scale accordingly.
Data analysis programming languages such as Python and R are far more flexible and powerful tools that amply address many of Excel’s shortcomings. But much is lost as well. With code, you lose the visceral experience of traversing spreadsheet cells. And for tiny data problems (say a few hundred rows of data), the overhead of a programming language may not be worth it.
And what’s more, here is a dirty data science secret: programming is not for everyone. Many don’t have the interest, temperament, or time to learn. While there are no-code alternatives such as Tableau or PowerBI, for many of these non-coders, spreadsheets may in fact be the best tool.
If you work in financial services and are a champion of programming centric data analysis tools, then your evangelism will be better received if it is not accompanied by rancor for Excel. Many citizen data analysts have come to rely heavily on spreadsheets, and change can be scary and painful. Showing disrespect to such a pivotal tool is an easy way to erode goodwill, and to keep your message from being heard.
While Excel can sometimes be the right tool for the job, as mentioned above, Python and R provide more flexibility and power than can address many of Excel’s shortcomings, including a lack of reproducibility and clunky visualizations.
A team from AstraZeneca describes how they connected and grew a community of R users at their organization.
With the vetiver package, data scientists have a streamlined, consistent way to maintain machine learning pipelines. We recently updated our Bike Share prediction application using vetiver and Quarto.