book

Q: What is big data?

  • Find FOUR really big datasets. They should be sufficiently different. Cite your sources.
  • Discuss and determine a ranking in terms of "bigness".
  • Fill the template below. Replace all (( )) with your answers.

Rank 1: Google Images

Description: Provides useful images to users

Justifaction: Contains almost every image on web in world, pixel values, no restrictions / filters.

Citation: images.google.com

Rank 2: Tweets per Year, Twitter

Description: Collection of tweets by all Twitter users in a year

Justification: Number of global Twitter users greater than US population, ~500 million tweets sent per day along with hashtag information

Citation: http://www.internetlivestats.com/twitter-statistics/

Rank 3: US Census

Description: Population count with details on each individual such as age, race, eetc.

Justification: Census data includes all personal information attached to each individual

Citation: http://www.census.gov/data/developers/data-sets/decennial-census-data.html

Rank 4: Amazon Reviews

Description: Compilation of all reviews on products for 18 years, until March 2013

Justification: Aside from the review data, reviewer personal data is probably not collected.

Citation: https://snap.stanford.edu/data/web-Amazon.html