The basic process of moving large tabular data from one environment to another is fraught with issues. Ambiguous column headings and messy metadata can make it difficult and time consuming to understand exactly what a data file contains. As researchers move data from repository to research tool (and often through a series of research tools), the opportunities for error proliferate.
Rufus Pollock of the Open Knowledge Foundation has developed a lightweight approach to structuring metadata about tabular datasets. With the Pollock approach, tabular datasets are packaged and moved with files that describe the data—datatypes, formatting, source, etc.—allowing research tools like Matlab, Excel, and Stata to appropriately parse the data inside. He describes this “data package” model as the equivalent to a shipping container for data, making it easier to standardize the entire logistics process. Funds from this grant continue development of the Pollock’s “data package” standard. Funded activities include the development of validators and extensions that would make it easy to export and import data packages from standard research tools (essentially adding new “Save As” and “Open” options); outreach to specific user communities to model use of the specification for individual disciplinary communities; the launch of several pilot projects integrating the data package model into existing user workflows; and building a broader development community around the need for better tools for efficient and trusted storage, transport, and analysis of large tabular data.