Where to Find More Data

Learn how to find more data for our analysis.

Data for replication and original research has become widely available in the last decade. The phenomenon has been driven by both the movement toward more transparent and reproducible research and the fast development of technology and big data. An exhaustive discussion of available data is impossible since a wide variety and an almost infinite quantity of data is available online these days. Here, we provide a discussion of only a small number of exemplary data sources.

Replication data sources

Many journals now make replication data and program files available to researchers. One good example is the Journal of Peace Research, which has made available the datasets for published articles since 1998. Another good example is the American Economic Review, which has made available the datasets for recently published articles as well. Many individual scholars make the datasets for their published articles available at the Harvard Dataverse project. Currently, there are over 65,000 datasets available, covering a wide range of disciplines. From the Harvard Dataverse site, we may also find links to various other dataverses, such as World Agroforestry Centre, Population Services International (PSI), International Food Policy Research Institute (IFPRI), Murray Research Archive, CfA Dataverse, American Journal of Political Science (AJPS), Brain Genomics Superstruct Project (GSP), and Bill and Melinda Gates Foundation.

Original data sources

Many large data projects supported by national governments and international organizations have also made their data available. Here’s an illustrative list.

  • World Value Surveys provides data from nationally representative surveys in almost 100 countries with respect to individual beliefs, values, and motivations on issues such as development, religion, democratization, gender equality, social capital, and subjective well-being.
  • International Social Survey Programme provides data from surveys in some 53 countries regarding a wide range of topics, such as the role of government, social networks, social inequality, religion, environment, changing family and changing gender roles, and national identity.
  • U.S. government’s open data provides access to some 185,397 datasets in agriculture, business, climate, consumer, ecosystems, education, energy, finance, health, local government, manufacturing, ocean, public safety, and science and research. For example, datasets include consumer complaints, demographic statistics, weather, international trade in goods and services, college scorecard, and so on.
  • UNCTAD provides a wide range of data from national and international data sources. Its data center covers 150 time series on a wide range of topics, including trade, investment, commodities, population, external resources, information economy, creative economy, iron and ore, and maritime transport.
  • World Bank provides free and open access to global development data, including many datasets such as worldwide governance indicators, poverty and equity database, world development indicators, education statistics, gender statistics, and health nutrition and population statistics.
  • Miscellaneous datasets curated by various analysts. For example, Kaggle provides access to over 100 datasets, including tweets targeting ISIS, airplane crashes throughout the world since 1908, the Zika virus epidemic, 2016 U.S. presidential debates, and more.

For another example, Awesome Public Datasets, located here, provides data collected and tidied from blogs, answers, and user responses, covering topics in agriculture, biology, climate and weather, complex networks, computer networks, contextual data, data challenges, economics, education, energy, finance, geology, GIS and environment, government, healthcare, image processing, machine learning, museums, natural language, physics, psychology and cognition, public domains, search engines, social networks, social sciences, software, sports, time series, transportation, and complementary collections.

Data packages available in R

A large quantity of data is now available through R packages. For example, we have used two packages of economic data in this textbook:

  • pwt
  • wbstats

Another great data source, Quandl, provides financial and economic data from hundreds of sources via API or directly through R and other software. Data covers stocks, futures, commodities, currency, interest rate, potion, asset management, industry, and economy. All databases and datasets on Quandl are available from within R, using the Quandl R package.

The following site presents a growing list of R data packages covering a wide variety of disciplines and topics, such as ecology, genes, earth science, economics, finance, chemistry, agriculture, literature, marketing, web analytics, news, media, sports, maps, social media, government, data depots, Google web services, Amazon web services, and so forth.

Get hands-on with 1400+ tech skills courses.