Urban Institute
To build synthetic tax datasets for use in social science research
While tax data is highly sought after by social scientists, it is costly, sensitive, and difficult to access. The IRS has historically released public-use files—privacy-protected databases of sampled individual income tax returns—but has stopped producing them due to high costs and high vulnerability to re-identification attacks. This grant provides ongoing support for Claire Bowen at the Urban Institute, who is working with the IRS to develop synthetic versions of individual income tax return data. Synthetic data has mathematical and statistical properties that are similar to those of the real data, but that contains almost no private information from the original dataset. Grant funds will allow Bowen to continue developing two synthetic datasets, making substantial methodological improvements and exploring the application of differential privacy methods to assess the privacy attributes of this methodology. In addition, Bowen will make open-source code available on GitHub, document the methodology for use by other agencies, and disseminate the work through a white paper, blog posts, presentations, and journal articles.