Data Collection description

The data collection for task 3 consists of a set of medical-related documents, provided by the Khresmoi project. This collection contains documents covering a broad set of medical topics, and does not contain any patient information. The documents in the collection come from several online sources, including Health On the Net organization certified websites, as well as well-known medical sites and databases (e.g. Genetics Home Reference, ClinicalTrial.gov, Diagnosia).


Creation of the data set is documented in:


Creation of a New Medical Information Retrieval Evaluation Benchmark Targeting Patients Needs.
Lorraine Goeuriot, Liadh Kelly, Gareth J. F. Jones, Guido Zuccon, Hanna Suominen, Allan Hanbury and Henning Muller. In proceedings of The Fifth International Workshop on Evaluating Information Access (EVIA 2013), a Satellite Workshop of the NTCIR-10 Conference, Tokyo/Fukuoka, Japan, National Institute of Informatics/Kijima Printing, (2013).

(reference in BibTex format).


Data download

The archive is distributed as a set of eight zip files (size: 7.9GB compressed, 52.7GB uncompressed), named partN.zip. It contains the processed version of the crawl which was obtained by filtering out very short documents, and correcting some errors in mark-up (e.g. by applying Jsoup functions) in the raw data.


DATA WILL BE PROVIDED AFTER SIGNING THE END-USER AGREEMENT.