For non-specialists: If you’ve ever wondered why it’s so hard to get all the data points you want into a simple, workable database and what you can do about it, this is a good place to start. Tidy Data is a way of structuring data so that it’s more easily machine-readable and reformattable. It provides a methodology for tidying data. It also eliminates the ambiguity between zero-values and missing values, and reduces the number of empty cells in tables. Recommended.
It is often said that 80% of data analysis is spent on the cleaning and preparing data. And it’s not just a first step, but it must be repeated many times over the course of analysis as new problems come to light or new data is collected. To get a handle on the problem, this paper focuses on a small, but important, aspect of data cleaning that I call data tidying: structuring datasets to facilitate analysis.
…
Tidy data
Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is
messy or tidy depending on how rows, columns and tables are matched up with observations,
variables and types. In tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
This is Codd’s 3rd normal form (Codd 1990), but with the constraints framed in statistical
Hadley Wickham – Tidy Data, Journal of Statistical Software MMMMMM YYYY, Volume VV, Issue II
language, and the focus put on a single dataset rather than the many connected datasets
common in relational databases. Messy data is any other other arrangement of the data.