What kind of data have we seen?

  • Census
    • All data (except people who don’t respond and those could be biased)
    • Survey of individuals about themselves
  • American Community Survey
    • Sample of people - how representative is this?
    • Individual data
  • Crime
    • Police reports (how representative is this of all crime?)
    • Unsure of how reports indicate crime given uncertainty of convictions
  • Recidivism
    • Lots of data (though not all)
    • Lots of factors
    • Factors are both related to individuals and proxy (eg neighborhood)
  • Flu
    • CDC data - sample of hospitals and doctors
    • Google data - proxy correlates, but a lot of data!
  • Unemployment
    • Government labor data - sample of households
    • potential proxy correlates with Google searches
    • not sure the information provided would accurately answer questions of interest
      • eg young people may or may not work but that’s different than adults

What do you see in the unemployment data?

  • unemployment is high in the winter (seasonal patterns)
  • state unemployment tends to track national levels (tho numbers are different)
    • with exceptions
  • unemployment high in 2008-10 then decreasing

What does it mean to sample

  • a complete sample is too expensive
  • large samples are more accurate (unemployment studies 110k people)
  • average multiple samples
  • sample across multiple populations to get a sense of the entire population
    • unemployment surveys use 800 of 2000 sampling units in the US that are reflective of rural/urban, industrial/farming etc