Skip to main content

Free-To-Use Data Sets

In this guide we will provide you with a detailed overview of all of the freely accessible data sets we have analyzed thus far.

This table shows a quick overview of all the free data sets we’ve looked at:

CorpusDownloadDocumentsQueriesNotation
ADIFree8235.I, .T, .W, .A
CACMFree3,20464.I, .T, .W, .B, .A, .N, .X
CISIFree1,460112.I, .T, .W, .B, .A, .X
CranfieldFree1,400225.I, .T, .W, .B, .A
LISAFree600435*, #
MS MARCOFree3,213,8355,193TSV format
MedlineFree1,03330.I, .W
NPLFree11,42993/
TimeFree42383*TEXT,*FIND
Reuters 21578Free21.578-SGML format
OHSUMEDFree20.000-None
Europarl Parallel CorpusFreeN.A.-<CHAPTER ID=?>, <SPEAKER ID =?>

Data Sets we used in our Experiments

ADI#

Description:
An extremely small data set containing fewer than 100 examples. It does not offer many prospects beyond experimenting with machine learning on small data sets.

Documents:
The file ADI.ALL contains eighty-three (83) documents with the same notation as Cranfield.

Example:

.I 3.Tan important need and opportunity for a. d. i. leadershipin information science education ..AR. L. TAYLOR.Wcurrent trends in information science educationappear inadequate for the important need of the nation'spracticing professional personnel for training in becominginformation specialists or more proficient users of information systems .  a particular educational program by a. d. i. is suggested to supplement others in meeting this presumptive need .

Queries:
ID (.I)holds 35 short queries that seem to be test questions.

Example:

.I 3.WWhat is information science?  Give definitions where possible.

Relevance Assessments:
The relevance assessments given in ADI.REL are in the same format as CISI & CACM. Query-ID, followed by relevant document-ID, followed by 0 and 0.0. Possibility to be used to create trainable vectors.

Example:

3   3   0   0.000000  3   43   0   0.000000  3   45   0   0.000000  3   60   0   0.000000  4   29   0   0.000000  4   63   0   0.000000  5   3   0   0.000000  

Other files:
ADI.BLN - List of boolean queries

Parsing Problems:

  • Documents - Not every entry has information on the author. Tags will not be present if there is no information available.

Source: http://ir.dcs.gla.ac.uk/resources/test_collections/adi/

CACM#

Description:
CACM is a collection of article abstracts, published in ACM journal between 1958 and 1979. It is often claimed that it is too small to observe any real impact.

Documents:
In cacm.all contains 3,204 labeled entries. Each label is marked specifically with a . followed by a letter. They appear in an entry in the following order:
(.I) ID
(.T) Title
(.W) Abstract
(.B) Publication date of the article
(.A) Author list
(.N) Information when entry was added
(.X) List of cross-references to other documents

Example:

.I 46.TMultiprogramming STRETCH: Feasibility Considerations.WThe tendency towards increased parallelism incomputers is noted.  Exploitation of this parallelismpresents a number of new problems in machine designand in programming systems.  Minimum requirementsfor successful concurrent execution of several independentproblem programs are discussed.  These requirementsare met in the STRETCH system by a carefully balancedcombination of built-in and programmed logic.Techniques are described which place the burden of theprogrammed logic on system programs (supervisoryprogram and compiler) rather than on problem programs..BCACM November, 1959.ACodd, E. F.Lowry, E. S.McDonough, E.Scalzi, C. A..NCA591102 JB March 22, 1978  3:57 PM.X168      5          46491      5          4646        5          46168      6          46

The list of cross-references shows the reference ID followed by 4, 5, or 6 and marked with the document-ID at the end. The three numbers define the references more precisely:

4: "bibliographic coupling" - if document id Y appears in the bibliographic    coupling subvector for document X with a weight of w, it means X    and Y have w common references in their bibliographies; the weight    of did X in the vector for X is the number of items in X's bibliography.5: "links" - documents X and Y are linked if X cites Y, Y cites X, or    X == Y.6: "co-citations" - if document id Y appears in the co-citation subvector    for document X with weight w, it means X and Y are cited together in    w documents; the weight of did X in the vector for X is the number    of documents that cite X.

Queries:
The file query.text contains sixty-four (64) query sentences which also use the same markers as the text. Not all queries hold information on all tags and since some of the information can be misleading, it is best to only use the .W and .I tags information. They appear in the following order:
(.I) ID
(.W) Query
(.A) Author list
(.N) Authors name and some keywords on what the query searches for

Example:

.I 33.W Articles about the sensitivity of the eigenvalue decomposition of realmatrices, in particular, zero-one matrices.  I'm especially interested inthe separation of eigenspaces corresponding to distinct eigenvalues.Articles on the subject:C. Davis and W.M. Kahn, "The rotation of eigenvectors by a permutation:,SIAM J. Numerical Analysis, vol. 7, no. 1 (1970); G.W. Stewart, "Errorbounds for approximate invariant subspaces of closed linear operators",SIAM J. Numerical Analysis., Vol. 8, no. 4 (1971)..ADavis, C.Kahn, W.M.Stewart, G.W..N 33. Bengt Aspvall (sens of eigenval decomp of real matrices)

Relevance Assessments:
In qrels.text query-ID is followed by document-ID followed by 0 int and 0.0 float. Every document has its own row and can be used as a trainable vector.

Example:

10 0046  0 010 0141  0 010 0392  0 010 0950  0 010 1158  0 010 1198  0 010 1262  0 010 1380  0 010 1471  0 010 1601  0 0

Other Files:

  • cite.info - Key to citation info
  • common_words - Stop words used by smart
  • qrels.text - List of relevance judgments

Parsing Problems:

  • Documents - Not all entries contain information on author and text. Those tags will not be present if there is no information available.

Source: http://ir.dcs.gla.ac.uk/resources/test_collections/cacm/

CISI#

Description:
The CISI collection is very similar to the CACM collection and uses the same notations.

Documents:
The file CISI.ALL contains 1,460 texts. For a detailed explanation on the notation see the CACM section.

Example:

.I 6.TAbstracting Concepts and Methods.ABorko, H..W     Graduate library school study of abstracting should be more than ahow-to-do-it course.It should include general material on the characteristics and types of abstracts,the historical development of abstracting publications, the abstract-publishingindustry (especially in the United States), and the need for standards in thepreparation and evaluation of the product.These topics we call concepts.     The text includes a methods section containing instructions for writingvarious types of abstracts, and for editing and preparing abstracting publications.These detailed instructions are supplemented by examples and exercises in theappendix.There is a brief discussion of indexing of abstract publications.     Research on automation has been treated extensively in this work, for webelieve that the topic deserves greater emphasis than it has received in thepast.Computer use is becoming increasingly important in all aspects of librarianship.Much research effort has been expended on the preparation and evaluation ofcomputer-prepared abstracts and extracts.Students, librarians, and abstractors will benefit from knowing about thisresearch and understanding how computer programs were researched to analyze text,select key sentences, and prepare extracts and abstracts.The benefits of this research are discussed.    Abstracting is a key segment of the information industry.Opportunities are available for both full-time professionals and part-time orvolunteer workers.Many librarians find such activities pleasant and rewarding, for they knowthey are contributing to the more effective use of stored information.One chapter is devoted to career opportunities for abstractors..X6 6 6363 1 6403 1 6461 1 6551 1 6551 1 6

Queries:
112 queries are stored in CISI.QRY. The notation is the same as in CAMC.

Example:

.I 21.WThe need to provide personnel for the information field.

Relevance Assessments:
In CISI.REL query-ID is followed by document-ID followed by 0 int and 0.0 float. Every document has its own row and can be used as a trainable vector.

Example:

    21      6 0 0.000000    21     14 0 0.000000    21     22 0 0.000000    21     85 0 0.000000    21    171 0 0.000000    21    185 0 0.000000    21    186 0 0.000000    21    303 0 0.000000    21    339 0 0.000000    21    392 0 0.000000    21    400 0 0.000000

Other Files:
CISI.BLN - List of boolean queries

Parsing Problems:

  • Documents - Some entries contain more than one author, presenting us with more than one author tag.
  • Documents - Not every entry has information on the publication date, those tags will not be present if there is no information available.

Source: http://ir.dcs.gla.ac.uk/resources/test_collections/cisi/

Cranfield#

Description:
Cranfield uses the same notation as CACM and CISI, but the relevance evaluations are more detailed and more specific to the query tasks. While using this collection, it is important to note that the ID of the queries, e.g. .I 002, is not the same query-ID used in the relevance assessments. It seems that the query-IDs come from the order in the file. It may be helpful to update the IDs before working with them to avoid confusion.

Documents:
The File cran.all holds 1,400 documents, for detailed notation see the CACMsection. In contrast to CACM and CISI there are no cross-references to other documents listed.

Example:

.I 5.Tone-dimensional transient heat conduction into a double-layerslab subjected to a linear heat input for a small timeinternal ..Awasserman,b..Bj. ae. scs. 24, 1957, 924..Wone-dimensional transient heat conduction into a double-layerslab subjected to a linear heat input for a small timeinternal .  analytic solutions are presented for the transient heatconduction in composite slabs exposed at one surface to atriangular heat rate .  this type of heating rate may occur, forexample, during aerodynamic heating .

Queries:
The File cran.qury contains 225 queries; mostly questions, with some queries being term searches. The ID at the beginning of every query is not the reference-ID used in cranqrel.

Example:

.I 004.Wwhat problems of heat conduction in composite slabs have been solved sofar .

Relevance Assessments:
Each row in cranqrel holds a query-ID, relevant document-ID, and the relevancy code (1, 2, 3, 4, or 5). Every document has its own row. The relevancy codes are defined as follows:

1: References which are a complete answer to the question.
2: References of a high degree of relevance, the lack of which either would have made the research impracticable or would have resulted in a considerable amount of extra work.
3: References which were useful, either as general background to the work or as suggesting methods of tackling certain aspects of the work.
4: References of minimum interest, for example, those that have been included from an historical viewpoint.
5: References of no interest.

Example:

3 5 33 6 33 90 33 91 33 119 33 144 33 181 33 399 33 485 -1

Other Files: -

Source: http://ir.dcs.gla.ac.uk/resources/test_collections/cran/

LISA#

Description:
The LISA collection, contributed in 1982 by Peter Willett of Sheffield University, is being provided to support research investigations. It is punctuated by clear stop marks making the data structure easy to understand. The queries appear to be very specific and quite long.

Documents:
There are a total of 6,004 labeled abstracts stored in files LISA0.001 to LISA5.850. The beginning of every entry is marked by a unique ID followed by the title. Some title line entries also display information about the author, time, and place. A row of 44 *s separates the entries.

Example:

Document 5640NORDINFO COURSE IN BIBLIOMETRICS, 1981-10-26-29, HANASSARI, HELSINKI,SWEDISH-FINNISH CULTURAL CENTRE (IN SWEDISH). DESCRIBES THE DISCIPLINE OF BIBLIOMETRICS WITH REFERENCE TO A NORDINFO COURSEATTENDED IN HELSINKI. DEFINES BIBLIOMETRICS AS QUANTITATIVE MEASUREMENT OFLIBRARY TECHNIQUES. BIBLIOMETRIC METHODS CAN BE USED ON A GLOBAL SCALE, BUTTHEY CAN ALSO BE A TOOL TO CALCULATE THE BEST USE OF AN INDIVIDUAL LIBRARY'SRESOURCES. AMONG THE EMPIRICAL LAWS DEVELOPED, BRADFORD'S LAW IS THE MOSTAPPLICABLE. DESCRIBES THE USE OF THIS LAW TO OBTAIN A GRAPH WITH ANEXPONENTIALLY INCREASING PART, A SO-CALLED BRADFORD GRAPH. THIS CAN SHOW THERELATIONSHIP BETWEEN AUTHORS AND NUMBER OF ARTICLES, ARTICLES AND QUOTATIONSAND CAN HELP TO DETERMINE THE PROCESS OF OBSOLESCENCE OF LITERATURE IN ALIBRARY. OTHER BIBLIOMETRIC LAWS ARE THOSE OF LOTKA AND ZIPF. THE METHODS AREUSEFUL IN ACQUISITION, PLANNING OF SPACE ALLOCATION, WITHDRAWALS, LIBRARY USE,AND OTHER AREAS OF LIBRARY ADMINISTRATION.********************************************

Queries:
Thirty-five (35) queries are stored in file LISA.QUE. The query-ID is followed by several long sentences, all of which are written from a first person perspective, expressing interest in certain topics. The entries are separated by a #.

Example:

24I AM INTERESTED IN ALMOST ANYTHING TO DO WITH AUTOMATIC DOCUMENTCLASSIFICATION: SEARCH STRATEGIES FOR HIERARCHICAL AND NON-HIERARCHICALCLUSTERS, CLUSTERING ALGORITHMS, THE CREATION OF CLUSTER REPRESENTATIVES,RETRIEVAL EXPERIMENTS USING CLUSTERED FILES AND MEASURES OFINTER-DOCUMENT SIMILARITY. RELATED TO THIS IS AN INTEREST IN TERMCLASSIFICATIONS, THEIR APPLICATION IN RETRIEVAL, INTER-TERM SIMILARITIESETC. AUTOMATIC DOCUMENT CLASSIFICATION, CLUSTERS, CLUSTERING, TERMCLASSIFICATIONS. #

Relevance Assessments:
Relevance assessments, represented by unique IDs, are stored in LISARJ.NUM. Every query-ID (first column) is followed by the number of relevant document-IDs.. Depending on the program used to open or parse the file, the relevant document-IDs could continue into the next line. Since there is no end marker it would be best to parse the lines depending on the query-ID and number of relevant document-IDs.

Example:

           1           2        3392        3396                                           2           2        2623        4291                                           3           5        1407        1431        3794        3795                3796                                                                               4           7         604        3527        4644        5087                5112        5113        5295                                                       5           1        3401 

Other Files:
The LISA.REL file is an old version of the relevance assessments which helps in understanding the new version, but provides no further meaningful information.

Parsing Problems:

  • Documents - All documents are stored in separate files with only the newline marking the separate ID, title, and text. This has presented unique problems.
  • Relevance Assessments - These are not parseable line by line, the number of relevant documents should be used to parse.

Source: http://ir.dcs.gla.ac.uk/resources/test_collections/lisa/

Medline#

Description:
A document collection of short medical articles with very specific queries.

Documents:
The file MED.ALL holds 1,033 articles with a notation similar to Cranfield but with less information. Only the document-ID (.I) and the text (.W) are marked.

Example:

.I 1.Wcorrelation between maternal and fetal plasma levels of glucose and freefatty acids .                                                           correlation coefficients have been determined between the levels of glucose and ffa in maternal and fetal plasma collected at delivery .    significant correlations were obtained between the maternal and fetal glucose levels and the maternal and fetal ffa levels . from the size ofthe correlation coefficients and the slopes of regression lines it      appears that the fetal plasma glucose level at delivery is very stronglydependent upon the maternal level whereas the fetal ffa level at        delivery is only slightly dependent upon the maternal level .

Queries:
Thirty (30) specific queries are stored in MED.QRY. The notation is (.I) for the query-ID and (.W) for the query text. More than one sentence is possible and can be separated with ..

Example:

.I 5.W the crossing of fatty acids through the placental barrier.  normalfatty acid levels in placenta and fetus.

Relevance Assessments:
For each row of MED.REL the query-ID is followed by a 0 which separates it from the relevant document-ID. At the end of each row is a 1. There is a row for each query-ID, in combination with the relevant document-ID.

Example:

5 0 1 15 0 2 15 0 4 15 0 5 15 0 6 15 0 7 15 0 8 15 0 9 15 0 10 15 0 11 15 0 12 1

Other Files:
MED.REL.OLD - An older version of relevance assessments. Here the IDs are followed by 0and 0.000000 and can be used as trainable embeddings.

Source: http://ir.dcs.gla.ac.uk/resources/test_collections/medl/

NPL#

Description:
This collection was contributed by Vaswani and Cameron at the National Physical Laboratory in the UK in 1970. The end marker structure is consistent for every file, even though it lacks context to engage with in a pragmatic search.

Documents:The files doc-text & doc-vecs contain 11,429 entries. doc-text provides them in text form with unique IDs which match the vector representation of the terms in the doc-vecs file. Every entry ends with /.

Example:

141some aspects of the logical and circuit design of a digital fieldcomputer  a new type of digital computer for the solution of fieldproblems is described  by making calculations at all the latticepoints of the field simultaneously computation time is greatly reducedan experimental design of a basic unit for potential and other problemsis presented   /

Same text represented as a vector, in a different order, and without irrelevant terms:

141     3     5     7    23    27    33    34    42    54    71  101   109   155   161   162   224   272   304   315   345  534   582   597   626  1215 /

Queries:
Ninety-three (93) queries are stored in query-text & query-vecs as both text and vector representation. The ID is followed by the query, and the entry is finished with a /.

Example:

3USE OF DIGITAL COMPUTERS IN THE DESIGN OF BAND PASS FILTERS HAVING GIVEN PHASE AND ATTENUATION CHARACTERISTICS/

Same query represented as a vector, in a different order, and without irrelevant terms:

  3  1  10  23  35  71  76  77  97  191  224  309  360 /

Relevance Assessments:
In rlv-ass the query IDs are followed by the relevant document IDs. The entries are separated by /.

Example:

3  141   148   813  1610  2429  3059  3272  3398  3614  3688 3708  4437  4710  4725  4833  5476  5662  5856  5976  6351 6885  6974  7086  7177  7304  7571  8007  8232  8957  928910174 10484 10486   /

Other Files:

  • term-vocab contains vocabulary stems with representative IDs. The end marker is /.
  • term-vecs contains the occurrences of search terms in the documents. The first ID is always the vocab ID, followed by all of the document IDs in which the terms occurred. The end marker is /.
  • term-mst contains word IDs, followed by context-word IDs, followed by co-occurrences and mutual similarity values. This is only available for words which occured in at least 2 documents.

Parsing Problems:

  • The number of spaces between the separator / and the entries are different between documents, queries, and relevance assessments.
  • Relevance Assessments: there are many spaces and newlines to remove before getting the raw numbers.

Source: http://ir.dcs.gla.ac.uk/resources/test_collections/npl/

Time#

Description:
This collection contains 423 articles from issues of TIME Magazine from the 1960's. With only 423 documents it is a rather small data set. Problems with the labeling are once again present. The IDs used in the relevance assessments do not correlate with the unique text numbers that mark the start of a document.

Documents:
The file TIME.ALL stores 423 documents, the first of which is labeled as *TEXT 017 and the last one *TEXT 563. Every document starts with a line that specifies text, date, and page number. The lines are followed by sentences of unlabeled text. The example is shortened as the articles are quite long.

Example:

*TEXT 017 01/04/63 PAGE 020 THE ALLIES AFTER NASSAU IN DECEMBER 1960, THE U.S . FIRST PROPOSED TO HELP NATO DEVELOP ITS OWN NUCLEAR STRIKE FORCE . BUT EUROPE MADE NO ATTEMPT TO DEVISE A PLAN . LAST WEEK, AS THEY STUDIED THE NASSAU ACCORD BETWEEN PRESIDENT KENNEDY AND PRIME MINISTER MACMILLAN, EUROPEANS SAW EMERGING THE FIRST OUTLINES OF THE NUCLEAR NATO THAT THE U.S . WANTS AND WILL SUPPORT . IT ALL SPRANG FROM THE ANGLO-U.S . CRISIS OVER CANCELLATION OF THE BUG-RIDDEN SKYBOLT MISSILE, AND THE U.S . OFFER TO SUPPLY BRITAIN AND FRANCE WITH THE PROVED POLARIS (TIME, DEC . 28) . THE ONE ALLIED LEADER WHO UNRESERVEDLY WELCOMED THE POLARIS...

Queries:
The 83 queries, stored in TIME.QUE, are marked with *FIND followed by the query ID.

Example:

*FIND     46  PRESIDENT DE GAULLE'S POLICY ON BRITISH ENTRY INTO THE COMMON MARKET .

Relevance Assessments:
Every row of TIME.REL shows the query-ID, followed by relevant document-IDs. The document-IDs are not the same as the text numbers, instead the IDs are representative of the order they appear in the TIME.ALL file.

Example:

46   1  20  23  32  39  47  53  54  80  93 151 157 174 202 272 291 294 348 47  23  47  48  53  54  56 48 306 49  47  56  81 103 150 183 205 291 50 157

Other Files:
TIME.STP - List of stop words

Parsing Problems:

  • Relevance Assessments - Spaces between the ID and the first document ID vary greatly.

Source: http://ir.dcs.gla.ac.uk/resources/test_collections/time/

Other Data Sets#

Europarl Parallel Corpus#

Description:
Starting in 2001, the corpus has collected texts with up to 60 million words per language. Texts were extracted from the proceedings of the European Parliament for these 21 European languages:
Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.

Documents:
Every file has the same structure; there are sections separated with the tag <CHAPTER ID=?> which are followed by the title of the chapter. The <SPEAKER ID =??> is a unique ID which marks the spoken part.

Example from the English Corpus:

<CHAPTER ID=6>Social and economic situation and development of the regions of the Union<SPEAKER ID=80 NAME="President">The next item is the debate on the report (A5-0107/1999) by Mr Berend, on behalf of the Committee on Regional Policy, Transport and Tourism, on the sixth periodic report on the social and economic situation and development of the regions of the European Union [SEC(99)0066 - C5-0120/99 - 1999/2123(COS)].

Queries: -

Relevance Assessments: -

Other Files: -

Source: http://www.statmt.org/europarl/

MS MARCO Document Ranking#

Description:
MS MARCO stands for "Microsoft Machine Reading Comprehension" and is a large scale dataset which can be used on machine reading comprehension, question answering, passage ranking, Keyphrase Extraction, and Conversational Search Studies. It holds over 3 million (3,213,835) documents and more than one million (1,010,916) unique queries, that were extracted from anonymized Bing usage logs.

For our case we focused on the MS MARCO Document Ranking corpus which is available through github.

Documents:
The documents are downloadable as a tsv-file (msmarco-docs.tsv), which makes them easy to parse. There are over 3 million (3,213,835) documents with assigned ID, title and website link (where the document was extracted from). ID, title, and document text are separated by tabs and every new entry starts with a newline. If there is no title available there a "." instead of it.

Example:

D1555982       https://answers.yahoo.com/question/index?qid=20071007114826AAwCFvR      The hot glowing surfaces of stars emit energy in the form of electromagnetic radiation.?      Science & Mathematics Physics The hot glowing surfaces of stars emit energy in the form of electromagnetic radiation.? It is a good approximation to assume that the emissivity e is equal to 1 for these surfaces. Find the radius of the star Rigel, the bright blue star in the constellation Orion that radiates energy at a rate of 2.7 x 10^32 W and has a surface temperature of 11,000 K. Assume that the star is spherical. Use σ =... show more Follow 3 answers Answers Relevance Rating Newest Oldest Best Answer: Stefan-Boltzmann law states that the energy flux by radiation is proportional to the forth power of the temperature: q = ε · σ · T^4 The total energy flux at a spherical surface of Radius R is Q = q·π·R² = ε·σ·T^4·π·R² Hence the radius is R = √ ( Q / (ε·σ·T^4·π) ) = √ ( 2.7x10+32 W / (1 · 5.67x10-8W/m²K^4 · (1100K)^4 · π) ) = 3.22x10+13 m Source (s):http://en.wikipedia.org/wiki/Stefan_bolt...schmiso · 1 decade ago0 18 Comment Schmiso, you forgot a 4 in your answer. Your link even says it: L = 4pi (R^2)sigma (T^4). Using L, luminosity, as the energy in this problem, you can find the radius R by doing sqrt (L/ (4pisigma (T^4)). Hope this helps everyone. Caroline · 4 years ago4 1 Comment (Stefan-Boltzmann law) L = 4pi*R^2*sigma*T^4 Solving for R we get: => R = (1/ (2T^2)) * sqrt (L/ (pi*sigma)) Plugging in your values you should get: => R = (1/ (2 (11,000K)^2)) *sqrt ( (2.7*10^32W)/ (pi * (5.67*10^-8 W/m^2K^4))) R = 1.609 * 10^11 m? · 3 years ago0 1 Comment Maybe you would like to learn more about one of these? Want to build a free website? Interested in dating sites? Need a Home Security Safe? How to order contacts online?

Queries:
There are over one million (1,010,916) unique, real queries which are assigned to a unique ID. The queries can all be found in the Q&A corpus; for the Documents Ranking corpus only ⅛ of the queries were used. Those were split into train, development, and test sets. We used the development set for our experiments because in order to use the Ranking Evaluation API of Elasticsearch you need relevance assessments. Those were only provided by the development set. The development queries (5,193) are stored in the file msmarco-docdev-queries.tsv. Each row of the file holds the query and its ID separated by one tab.

Example:

121352           define extreme634306           what does chattel mean on credit history920825           what was the great leap forward brainly510633           tattoo fixers how much does it cost737889           what is decentralization process.278900           how many cars enter the la jolla concours d' elegance?674172           what is a bank transit number303205           how much can i contribute to nondeductible ira570009           what are the four major groups of elements492875           sanitizer temperature

Relevance Assessments:
The relevance assessments can be found in the msmarco-docdev-qrels.tsv file. Each row holds the query ID, a 0, the document-ID and a 1; all separated by a space.

Example:

2 0 D1650436 11215 0 D1202771 11288 0 D1547717 11576 0 D1313702 12235 0 D2113408 12798 0 D2830290 12962 0 D125453 1

Other Files:
msmarco-docs-lookup.tsv
msmarco-doctrain-queries.tsv
msmarco-docdev-top100
docleaderboard-queries.tsv
docleaderboard-top100

Parsing Problems: -

Source: https://github.com/microsoft/MSMARCO-Document-Ranking

OHSUMED#

Description:
This collection was originally created by William Hersh as a new large medical test collection for experiments on the SMART retrieval system. It was later divided into Training and Test sets.
The split documents contain 20,000 abstracts, while the unsplit documents contain 50,216; both collections are sorted into 23 medical categories. Since the official source is no longer available online, we will refer to the download files by Alessandro Moschitti.

Documents:
The file Cardiovascular Diseases Abstract contains 20,000 documents divided into Training and Test set directories. Inside those directories are 23 folders which represent each of the 23 categories the abstracts are assigned to. Each document is written in a file without special notation.

Example:

Haemophilus influenzae meningitis with prolonged hospital course. A retrospective evaluation of Haemophilus influenzae type b meningitis observed over a 2-year period documented 86 cases. Eight of these patients demonstrated an unusual clinical course characterized by persistent fever (duration: greater than 10 days), cerebrospinal fluid pleocytosis, profound meningeal enhancement on computed tomography, significant morbidity, and a prolonged hospital course. The mean age of these 8 patients was 6 months, in contrast to a mean age of 14 months for the entire group. Two patients had clinical evidence of relapse. Four of the 8 patients tested for latex particle agglutination in the cerebrospinal fluid remained positive after 10 days. All patients received antimicrobial therapy until they were afebrile for a minimum of 5 days. Subsequent neurologic examination revealed a persistent seizure disorder in 5 patients (62.5%), moderate-to-profound hearing loss in 2 (25%), mild ataxia in 1 (12.5%), and developmental delay with hydrocephalus which required shunting in 1 (12.5%). One patient had no sequelae.

The download link All Cardiovascular Diseases Abstracts contain 50,216 abstracts which are only sorted by the 23 category folders.

Queries: -

Relevance Assessments: -

Other Files:Category Description - Defines the 23 categories

Source: ftp://medir.ohsu.edu/pub/ohsumed
(If the original source is no longer available online, please consult the website of Alessandro Moschitti at the University of Trento where you will find download links for OHSUMED and Reuters http://disi.unitn.eu/moschitti/corpora.htm)

Reuters 21578#

Description:
This collection was originally gathered and labeled by Carnegie Group, Inc. and Reuters, Ltd. Further information is available in the README.txt document.
The files are in SGML format, therefore it is important to understand that particular markup language first. It is already split into Training and Test sets by the tag LEWISSPLIT. Topics and relations between them are well documented, which makes this collection a very useful one for pragmatic search development. However, it must be noted that only half of the documents were manually assigned to topics, so there are unlabeled documents which are marked by `LEWISSPLIT="NOT-USED".
The collection is a multi-labeled one; as such, one document can be assigned to more than one topic.

Documents:
The files reut2-000.sgm to reut2-021.sgm contain 21,578 documents that are in SGML format. The start of every document is marked with the following string: <!DOCTYPE lewis SYSTEM "lewis.dtd"> The entries are clearly distinguishable by their notation.

Example:

<REUTERS TOPICS="BYPASS" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="2984" NEWID="14001"><DATE> 7-APR-1987 11:02:35.07</DATE><TOPICS></TOPICS><PLACES></PLACES><PEOPLE></PEOPLE><ORGS></ORGS><EXCHANGES></EXCHANGES><COMPANIES></COMPANIES><UNKNOWN>&#5;&#5;&#5;E&#22;&#22;&#1;f1137&#31;reuteb f BC-JOHANNESBURG-GOLD-SHA   04-07 0120</UNKNOWN><TEXT>&#2;<TITLE>JOHANNESBURG GOLD SHARES CLOSE MIXED TO FIRMER</TITLE><DATELINE>    JOHANNESBURG, April 7 - </DATELINE><BODY>Gold share prices closed mixed toslightly firmer in quiet and cautious trading, showing littlereaction to a retreat in the bullion price back to below 420dlrs and a firmer financial rand, dealers said.    Heavyweight Vaal Reefs ended eight rand higher at 398 randbut Grootvlei eased 40 cents at 16.60 rand, while miningfinancials had Gold Fields up a rand at 63 rand despite weakerquarterly results. Other minings were firm but platinums eased.    Industrials also closed mixed to firmer, the index onceagain hitting a new high of 1757 from Friday's 1753 finish. Theoverall index also hit a new high of 2188 versus 2179 onFriday. REUTER&#3;</BODY></TEXT></REUTERS>

The attributes specified after <REUTERS are meant to identify documents and groups of documents, and have the following meanings:

    1. TOPICS: The possible values are YES, NO, and BYPASS:        a. YES indicates that (in the original data) there was at                        least one entry in the TOPICS fields.        b. NO indicates that (in the original data) the story had no                        entries in the TOPICS field.        c. BYPASS indicates that (in the original data) the story was                        marked with the string "bypass" (or a typographical variant on that                        string).                        Although this shouldn't be used for Topic search because there could be topics even if there is a NO, or likewise, there could be be no topics even if there is a YES.     2. LEWISSPLIT : The possible values are TRAINING, TEST, and                        NOT-USED.        a. TRAINING indicates that it was used in training sets in the                        experiments reported in LEWIS91d (Chapters 9 and 10), LEWIS92b,                        LEWIS92e, and LEWIS94b.        b. TEST indicates it was used in the test set                        for those experiments.        c. NOT-USED means it was not used in those                        experiments.      3. CGISPLIT : The possible values are TRAINING-SET and                        PUBLISHED-TESTSET indicating whether the document was in the training                        set or the test set for the experiments reported in HAYES89 and                        HAYES90b.      4. OLDID : The identification number (ID) the story had in the                        Reuters-22173 collection.      5. NEWID : The identification number (ID) the story has in the                        Reuters-21578, Distribution 1.0 collection. These IDs are assigned to                        the stories in chronological order.

For more detailed descriptions see the VI. Formatting section of the README.txt.

Queries:
There are no queries, but there are several files which contain the topics, places, people, etc. as strings. See Other Files.

Relevance Assessments: -

Other Files:
all-exchanges-strings.lc.txt - Alphabetical list of exchange categories
all-orgs-strings.lc.txt - Alphabetical list of organization categories
all-people-strings.lc.txt - Alphabetical list of names
all-places-strings.lc.txt - Alphabetical list of places
all-topics-strings.lc.txt - Alphabetical list of topics

Example:

acqalumaustdlraustralbarleybfrbopcancarcasscastor-mealcastor-oilcastorseedcitruspulpcocoacoconutcoconut-oilcoffeecoppercopra-cakecorncorn-oil

cat-descriptions_120396.txt - List of categories, with number of items labeled with them Example:

**Currency Codes (27) U.S. Dollar (DLR)Australian Dollar (AUSTDLR)Hong Kong Dollar (HK)Singapore Dollar (SINGDLR)New Zealand Dollar (NZDLR)Canadian Dollar (CAN)Sterling (STG)D-Mark (DMK)Japanese Yen (YEN)Swiss Franc (SFR)French Franc (FFR)Belgian Franc (BFR)Netherlands Guilder/Florin (DFL)Italian Lira (LIT)Danish Krone/Crown (DKR)Norwegian Krone/Crown (NKR)Swedish Krona/Crown (SKR)Mexican Peso (MEXPESO)Brazilian Cruzado (CRUZADO)Argentine Austral (AUSTRAL)Saudi Arabian Riyal (SAUDRIYAL)South African Rand (RAND)Indonesian Rupiah (RUPIAH)Malaysian Ringitt (RINGGIT)Portuguese Escudo (ESCUDO)Spanish Peseta (PESETA)Greek Drachma (DRACHMA) 

Source: http://www.daviddlewis.com/resources/testcollections/reuters21578/
(If the original source is not available, please consult the website of Alessandro Moschitti at University of Trento where you will find download links for OHSUMED and Reuters http://disi.unitn.eu/moschitti/corpora.htm)

TREC#

Description:
TREC (the TExt REtrieval Conference) is not just one collection, but is instead a conference which has been held regularly since 1992. The TREC workshop series - among other goals - tries to encourage research in IR, based on large test collections, and increase the communication amid IR research and development.
TREC has produced many test collections, all of which contain a set of documents, a set of topics (questions), and a set of relevance judgments (answers). The collections can be downloaded from the TREC website but are usually copyrighted and must be licensed. The process to license a collection can be found on the data page entry for the collection of interest (https://trec.nist.gov/data.html).
The collections cover various interests such as “Chemical IR”, “Conversational Assistance”, “Legal”, “Medical” ,”News”, “Spoken Document Retrieval”, etc. One very interesting thing about this collection is its constant growth and high standards regarding having a homogeneous notation. This can be very helpful for developing and testing NLP processing algorithms.
They also provide tools to process the data. Publications about TREC are published by NIST (National Institute of Standards and Technology) and are accessible here: http://trec.nist.gov/pubs.html Since there are many different data sets, we provide no examples given for this collection.

Documents: -

Queries: -

Relevance Assessments: -

Other Files: -

Source: https://trec.nist.gov/



Acknowledgements:
Thanks to Kenny Hall and Irina Temnikova for proofreading this article.

Written by Miriam Rupprecht, June 2020