AND09 Dataset

The Blog data from the US Presidential Election campaign from August to October 2008 (All data are verified by visiting the blog sites)
  • DataSet0a (Obama ground truth speeches)
  • DataSet0b (Mccain ground truth speeches)
  • DataSet1a (Skewed in favor of Obama: 48 pro-Obama blogs and 9 pro-Mccain blogs)
  • DataSet1b (48 pro-Obama blogs and 30 pro-Mccain blogs)


Multidomain sentiment dataset

Amazon Reviews dataset collected by Mark Dredze et. al.

Yahoo! Research Sandbox

Find more datasets from Sandbox

A small Yahoo! Answers Dataset

The 10 folders in the dataset correspond to relevant documents collected for the following information needs:
  1. Based on your own family's experience, what do you think we should do to improve health care in America?
  2. Are Sugar substitutes bad for you?
  3. What do you do to protect your computer from being infected by a virus, worm or spyware attack?
  4. What do you do to protect yourself from online fraud or identity theft?
  5. What should we do to free our planet from terrorism?
  6. What measures would you like to see to make the Indian Police Forces more accountable to citizens?
  7. How will corruption end in India?
  8. Has the gun culture arrived in India? If yes, how to stop it?
  9. What are the steps that India needs to take to become a superpower?
  10. How did Hitler come to adopt the Indian (sacred hindu) symbol of swastika as his party symbol?
The Yahoo! Answers dataset contains repeated documents in some folders and have been deliberately kept there to check for redundancy. This is a typical scenario with news documents from multiple sources e.g. Google News

Links to open source softwares

Stanford NLP

[Stanford NLP] A good software for many different NLP tasks

Search Engines

[Wumpus Search] A good companion C++ software for the book "Information Retrieval: Implementing and Evaluating Search Engines"
-- Getting started with Wumpus

[Lucene Search] Another very popular open source search engine in Java.

