Draft

13  Working with data sources

Do data play a role in solving the problem? Before a project can move forward, data must be both accessible and relevant to the problem. Consider what variables each data source contributes. While some data are publicly available, other data are privately owned and permission becomes a prerequisite. And to the extent data are unavailable, we may need to setup experiments to generate it ourselves. To be sure, obtaining the right data is usually a top challenge: sometimes the variable is unmeasured or not recorded.

13.1 Identifying accessible data

In cataloging the data, be specific. Identify where data are stored and in what form. Are data recorded on paper or electronically, such as in a database or on a website? Are the data structured — such as a CSV file — or unstructured, like comments on a twitter feed? Provenance is important (Moreau et al. 2008): how were the data recorded? By a human or by an instrument?

13.2 Public and private data sources

13.3 APIs and databases

13.4 Web scraping considerations

13.5 Experimental design for data collection