Data & Resources
- Deleted Scenes
- Chapter 04: Global Digital Divide Raw Data
- Chapter 07: US Campaign Contributions Raw Data
- Chapter 12: Telecoms Sector Raw Data
- Chapter 12: Media Sector Raw Data
Archive (of Links)
A couple of weeks ago the AOL research department released this dataset:
"500k User Queries Sampled Over 3 Months. This collection consists of ~20M web queries collected from ~500k users over three months. Where the data is sorted by ananomized user id... The goal of this collection is to provide a real query log based on users. It could be used for personalization, query reformulation or other type of search research."
It was made available as a free download for non-commercial use only. It was quickly withdrawn but is widely available as a bittorrent download.
The data are reasonably anonymous. I say 'reasonably' because there are no strict personal identifiers in these data, but personal details like social security numbers, phone numbers, addresses and so on do feature in search requests. And there are plenty of those in here. The data are also uncensored.
This is going to send huge ripples through the regulatory debate, not least because AOL's search technology is provided by Google, the globe's number one search engine. These are a very good guide to the kind of search queries that run through Google. And Google has kept very tight wraps on this kind of thing in the past.
As an academic I'm torn: these data would provide a wonderful snapshot of search activity. But is it ethical to use them if the users have not consented? They were designed for an academic audience, but are these 'public' data? They will undoubtedly be publicly available for many years to come. It's also highly likely that both law enforcement and market research companies will be working with them already. Eszter Hargittai, an expert on the sociology of search, points out some of the problems.