Society benefits when leaders make more evidence-based decisions, but growing privacy concerns hamper researchers’ ability to understand and improve the world. Fully synthetic data, pseudo data generated by models, can protect confidentiality and produce statistically valid analysis. This talk shares how the Urban Institute collaborates with the IRS to create fully synthetic tax data for tax policy research. We built an R package called tidysynthesis to create machine learning models for each variable in the data. tidysynthesis leverages the power of tidymodels and allows users to run a sequences of machine learning models with different recipes, engines, and samplers while adding additional noise and enforcing logical constraints.
04:00 PM to 04:20 PM
Potomac DWatch Video
Aaron R. Williams is a senior data scientist at the Urban Institute where he works on microsimulation models, data imputation methods, and expanding access to administrative data with formal privacy and synthetic data. Williams leads Urban’s R Users Group and teaches Intro to Data Science in the McCourt School of Public Policy at Georgetown University.