Free-To-Use Data Sets
In this guide we will provide you with a detailed overview of all of the freely accessible data sets we have analyzed thus far.
This table shows a quick overview of all the free data sets we’ve looked at:
Corpus | Download | Documents | Queries | Notation |
---|---|---|---|---|
ADI | Free | 82 | 35 | .I, .T, .W, .A |
CACM | Free | 3,204 | 64 | .I, .T, .W, .B, .A, .N, .X |
CISI | Free | 1,460 | 112 | .I, .T, .W, .B, .A, .X |
Cranfield | Free | 1,400 | 225 | .I, .T, .W, .B, .A |
LISA | Free | 6004 | 35 | *, # |
MS MARCO | Free | 3,213,835 | 5,193 | TSV format |
Medline | Free | 1,033 | 30 | .I, .W |
NPL | Free | 11,429 | 93 | / |
Time | Free | 423 | 83 | *TEXT,*FIND |
Reuters 21578 | Free | 21.578 | - | SGML format |
OHSUMED | Free | 20.000 | - | None |
Europarl Parallel Corpus | Free | N.A. | - | <CHAPTER ID=?>, <SPEAKER ID =?> |
Data Sets we used in our Experiments
#
ADIDescription:
An extremely small data set containing fewer than 100 examples. It does not offer many prospects beyond experimenting with machine learning on small data sets.
Documents:
The file ADI.ALL
contains eighty-three (83) documents with the same notation as Cranfield
.
Example:
.I 3.Tan important need and opportunity for a. d. i. leadershipin information science education ..AR. L. TAYLOR.Wcurrent trends in information science educationappear inadequate for the important need of the nation'spracticing professional personnel for training in becominginformation specialists or more proficient users of information systems . a particular educational program by a. d. i. is suggested to supplement others in meeting this presumptive need .
Queries:ID (.I)
holds 35 short queries that seem to be test questions.
Example:
.I 3.WWhat is information science? Give definitions where possible.
Relevance Assessments:
The relevance assessments given in ADI.REL
are in the same format as CISI
& CACM
. Query-ID, followed by relevant document-ID, followed by 0
and 0.0
. Possibility to be used to create trainable vectors.
Example:
3 3 0 0.000000 3 43 0 0.000000 3 45 0 0.000000 3 60 0 0.000000 4 29 0 0.000000 4 63 0 0.000000 5 3 0 0.000000
Other files:ADI.BLN
- List of boolean queries
Parsing Problems:
- Documents - Not every entry has information on the author. Tags will not be present if there is no information available.
Source: http://ir.dcs.gla.ac.uk/resources/test_collections/adi/
#
CACMDescription:
CACM is a collection of article abstracts, published in ACM journal between 1958 and 1979.
It is often claimed that it is too small to observe any real impact.
Documents:
In cacm.all
contains 3,204 labeled entries. Each label is marked specifically with a .
followed by a letter. They appear in an entry in the following order:
(.I) ID
(.T) Title
(.W) Abstract
(.B) Publication date of the article
(.A) Author list
(.N) Information when entry was added
(.X) List of cross-references to other documents
Example:
.I 46.TMultiprogramming STRETCH: Feasibility Considerations.WThe tendency towards increased parallelism incomputers is noted. Exploitation of this parallelismpresents a number of new problems in machine designand in programming systems. Minimum requirementsfor successful concurrent execution of several independentproblem programs are discussed. These requirementsare met in the STRETCH system by a carefully balancedcombination of built-in and programmed logic.Techniques are described which place the burden of theprogrammed logic on system programs (supervisoryprogram and compiler) rather than on problem programs..BCACM November, 1959.ACodd, E. F.Lowry, E. S.McDonough, E.Scalzi, C. A..NCA591102 JB March 22, 1978 3:57 PM.X168 5 46491 5 4646 5 46168 6 46
The list of cross-references shows the reference ID followed by 4
, 5
, or 6
and marked with the document-ID at the end. The three numbers define the references more precisely:
4: "bibliographic coupling" - if document id Y appears in the bibliographic coupling subvector for document X with a weight of w, it means X and Y have w common references in their bibliographies; the weight of did X in the vector for X is the number of items in X's bibliography.5: "links" - documents X and Y are linked if X cites Y, Y cites X, or X == Y.6: "co-citations" - if document id Y appears in the co-citation subvector for document X with weight w, it means X and Y are cited together in w documents; the weight of did X in the vector for X is the number of documents that cite X.
Queries:
The file query.text
contains sixty-four (64) query sentences which also use the same markers as the text. Not all queries hold information on all tags and since some of the information can be misleading, it is best to only use the .W
and .I
tags information.
They appear in the following order:
(.I) ID
(.W) Query
(.A) Author list
(.N) Authors name and some keywords on what the query searches for
Example:
.I 33.W Articles about the sensitivity of the eigenvalue decomposition of realmatrices, in particular, zero-one matrices. I'm especially interested inthe separation of eigenspaces corresponding to distinct eigenvalues.Articles on the subject:C. Davis and W.M. Kahn, "The rotation of eigenvectors by a permutation:,SIAM J. Numerical Analysis, vol. 7, no. 1 (1970); G.W. Stewart, "Errorbounds for approximate invariant subspaces of closed linear operators",SIAM J. Numerical Analysis., Vol. 8, no. 4 (1971)..ADavis, C.Kahn, W.M.Stewart, G.W..N 33. Bengt Aspvall (sens of eigenval decomp of real matrices)
Relevance Assessments:
In qrels.text
query-ID is followed by document-ID followed by 0 int and 0.0 float. Every document has its own row and can be used as a trainable vector.
Example:
10 0046 0 010 0141 0 010 0392 0 010 0950 0 010 1158 0 010 1198 0 010 1262 0 010 1380 0 010 1471 0 010 1601 0 0
Other Files:
cite.info
- Key to citation infocommon_words
- Stop words used by smartqrels.text
- List of relevance judgments
Parsing Problems:
- Documents - Not all entries contain information on author and text. Those tags will not be present if there is no information available.
Source: http://ir.dcs.gla.ac.uk/resources/test_collections/cacm/
#
CISIDescription:
The CISI collection is very similar to the CACM collection and uses the same notations.
Documents:
The file CISI.ALL
contains 1,460 texts. For a detailed explanation on the notation see the CACM
section.
Example:
.I 6.TAbstracting Concepts and Methods.ABorko, H..W Graduate library school study of abstracting should be more than ahow-to-do-it course.It should include general material on the characteristics and types of abstracts,the historical development of abstracting publications, the abstract-publishingindustry (especially in the United States), and the need for standards in thepreparation and evaluation of the product.These topics we call concepts. The text includes a methods section containing instructions for writingvarious types of abstracts, and for editing and preparing abstracting publications.These detailed instructions are supplemented by examples and exercises in theappendix.There is a brief discussion of indexing of abstract publications. Research on automation has been treated extensively in this work, for webelieve that the topic deserves greater emphasis than it has received in thepast.Computer use is becoming increasingly important in all aspects of librarianship.Much research effort has been expended on the preparation and evaluation ofcomputer-prepared abstracts and extracts.Students, librarians, and abstractors will benefit from knowing about thisresearch and understanding how computer programs were researched to analyze text,select key sentences, and prepare extracts and abstracts.The benefits of this research are discussed. Abstracting is a key segment of the information industry.Opportunities are available for both full-time professionals and part-time orvolunteer workers.Many librarians find such activities pleasant and rewarding, for they knowthey are contributing to the more effective use of stored information.One chapter is devoted to career opportunities for abstractors..X6 6 6363 1 6403 1 6461 1 6551 1 6551 1 6
Queries:
112 queries are stored in CISI.QRY
. The notation is the same as in CAMC.
Example:
.I 21.WThe need to provide personnel for the information field.
Relevance Assessments:
In CISI.REL
query-ID is followed by document-ID followed by 0 int and 0.0 float. Every document has its own row and can be used as a trainable vector.
Example:
21 6 0 0.000000 21 14 0 0.000000 21 22 0 0.000000 21 85 0 0.000000 21 171 0 0.000000 21 185 0 0.000000 21 186 0 0.000000 21 303 0 0.000000 21 339 0 0.000000 21 392 0 0.000000 21 400 0 0.000000
Other Files:CISI.BLN
- List of boolean queries
Parsing Problems:
- Documents - Some entries contain more than one author, presenting us with more than one author tag.
- Documents - Not every entry has information on the publication date, those tags will not be present if there is no information available.
Source: http://ir.dcs.gla.ac.uk/resources/test_collections/cisi/
#
CranfieldDescription:
Cranfield uses the same notation as CACM
and CISI
, but the relevance evaluations are more detailed and more specific to the query tasks. While using this collection, it is important to note that the ID of the queries, e.g. .I 002
, is not the same query-ID used in the relevance assessments. It seems that the query-IDs come from the order in the file. It may be helpful to update the IDs before working with them to avoid confusion.
Documents:
The File cran.all
holds 1,400 documents, for detailed notation see the CACM
section. In contrast to CACM
and CISI
there are no cross-references to other documents listed.
Example:
.I 5.Tone-dimensional transient heat conduction into a double-layerslab subjected to a linear heat input for a small timeinternal ..Awasserman,b..Bj. ae. scs. 24, 1957, 924..Wone-dimensional transient heat conduction into a double-layerslab subjected to a linear heat input for a small timeinternal . analytic solutions are presented for the transient heatconduction in composite slabs exposed at one surface to atriangular heat rate . this type of heating rate may occur, forexample, during aerodynamic heating .
Queries:
The File cran.qury
contains 225 queries; mostly questions, with some queries being term searches. The ID at the beginning of every query is not the reference-ID used in cranqrel
.
Example:
.I 004.Wwhat problems of heat conduction in composite slabs have been solved sofar .
Relevance Assessments:
Each row in cranqrel
holds a query-ID, relevant document-ID, and the relevancy code (1, 2, 3, 4, or 5). Every document has its own row. The relevancy codes are defined as follows:
1: References which are a complete answer to the question.
2: References of a high degree of relevance, the lack of which either would have made the research impracticable or would have resulted in a considerable amount of extra work.
3: References which were useful, either as general background to the work or as suggesting methods of tackling certain aspects of the work.
4: References of minimum interest, for example, those that have been included from an historical viewpoint.
5: References of no interest.
Example:
3 5 33 6 33 90 33 91 33 119 33 144 33 181 33 399 33 485 -1
Other Files: -
Source: http://ir.dcs.gla.ac.uk/resources/test_collections/cran/
#
LISADescription:
The LISA collection, contributed in 1982 by Peter Willett of Sheffield University, is being provided to support research investigations.
It is punctuated by clear stop marks making the data structure easy to understand. The queries appear to be very specific and quite long.
Documents:
There are a total of 6,004 labeled abstracts stored in files LISA0.001
to LISA5.850
. The beginning of every entry is marked by a unique ID followed by the title. Some title line entries also display information about the author, time, and place. A row of 44 *
s separates the entries.
Example:
Document 5640NORDINFO COURSE IN BIBLIOMETRICS, 1981-10-26-29, HANASSARI, HELSINKI,SWEDISH-FINNISH CULTURAL CENTRE (IN SWEDISH). DESCRIBES THE DISCIPLINE OF BIBLIOMETRICS WITH REFERENCE TO A NORDINFO COURSEATTENDED IN HELSINKI. DEFINES BIBLIOMETRICS AS QUANTITATIVE MEASUREMENT OFLIBRARY TECHNIQUES. BIBLIOMETRIC METHODS CAN BE USED ON A GLOBAL SCALE, BUTTHEY CAN ALSO BE A TOOL TO CALCULATE THE BEST USE OF AN INDIVIDUAL LIBRARY'SRESOURCES. AMONG THE EMPIRICAL LAWS DEVELOPED, BRADFORD'S LAW IS THE MOSTAPPLICABLE. DESCRIBES THE USE OF THIS LAW TO OBTAIN A GRAPH WITH ANEXPONENTIALLY INCREASING PART, A SO-CALLED BRADFORD GRAPH. THIS CAN SHOW THERELATIONSHIP BETWEEN AUTHORS AND NUMBER OF ARTICLES, ARTICLES AND QUOTATIONSAND CAN HELP TO DETERMINE THE PROCESS OF OBSOLESCENCE OF LITERATURE IN ALIBRARY. OTHER BIBLIOMETRIC LAWS ARE THOSE OF LOTKA AND ZIPF. THE METHODS AREUSEFUL IN ACQUISITION, PLANNING OF SPACE ALLOCATION, WITHDRAWALS, LIBRARY USE,AND OTHER AREAS OF LIBRARY ADMINISTRATION.********************************************
Queries:
Thirty-five (35) queries are stored in file LISA.QUE
. The query-ID is followed by several long sentences, all of which are written from a first person perspective, expressing interest in certain topics. The entries are separated by a #
.
Example:
24I AM INTERESTED IN ALMOST ANYTHING TO DO WITH AUTOMATIC DOCUMENTCLASSIFICATION: SEARCH STRATEGIES FOR HIERARCHICAL AND NON-HIERARCHICALCLUSTERS, CLUSTERING ALGORITHMS, THE CREATION OF CLUSTER REPRESENTATIVES,RETRIEVAL EXPERIMENTS USING CLUSTERED FILES AND MEASURES OFINTER-DOCUMENT SIMILARITY. RELATED TO THIS IS AN INTEREST IN TERMCLASSIFICATIONS, THEIR APPLICATION IN RETRIEVAL, INTER-TERM SIMILARITIESETC. AUTOMATIC DOCUMENT CLASSIFICATION, CLUSTERS, CLUSTERING, TERMCLASSIFICATIONS. #
Relevance Assessments:
Relevance assessments, represented by unique IDs, are stored in LISARJ.NUM
. Every query-ID (first column) is followed by the number of relevant document-IDs.. Depending on the program used to open or parse the file, the relevant document-IDs could continue into the next line. Since there is no end marker it would be best to parse the lines depending on the query-ID and number of relevant document-IDs.
Example:
1 2 3392 3396 2 2 2623 4291 3 5 1407 1431 3794 3795 3796 4 7 604 3527 4644 5087 5112 5113 5295 5 1 3401
Other Files:
The LISA.REL
file is an old version of the relevance assessments which helps in understanding the new version, but provides no further meaningful information.
Parsing Problems:
- Documents - All documents are stored in separate files with only the newline marking the separate ID, title, and text. This has presented unique problems.
- Relevance Assessments - These are not parseable line by line, the number of relevant documents should be used to parse.
Source: http://ir.dcs.gla.ac.uk/resources/test_collections/lisa/
#
MedlineDescription:
A document collection of short medical articles with very specific queries.
Documents:
The file MED.ALL
holds 1,033 articles with a notation similar to Cranfield
but with less information. Only the document-ID (.I) and the text (.W) are marked.
Example:
.I 1.Wcorrelation between maternal and fetal plasma levels of glucose and freefatty acids . correlation coefficients have been determined between the levels of glucose and ffa in maternal and fetal plasma collected at delivery . significant correlations were obtained between the maternal and fetal glucose levels and the maternal and fetal ffa levels . from the size ofthe correlation coefficients and the slopes of regression lines it appears that the fetal plasma glucose level at delivery is very stronglydependent upon the maternal level whereas the fetal ffa level at delivery is only slightly dependent upon the maternal level .
Queries:
Thirty (30) specific queries are stored in MED.QRY
. The notation is (.I) for the query-ID and (.W) for the query text. More than one sentence is possible and can be separated with .
.
Example:
.I 5.W the crossing of fatty acids through the placental barrier. normalfatty acid levels in placenta and fetus.
Relevance Assessments:
For each row of MED.REL
the query-ID is followed by a 0
which separates it from the relevant document-ID. At the end of each row is a 1
. There is a row for each query-ID, in combination with the relevant document-ID.
Example:
5 0 1 15 0 2 15 0 4 15 0 5 15 0 6 15 0 7 15 0 8 15 0 9 15 0 10 15 0 11 15 0 12 1
Other Files:MED.REL.OLD
- An older version of relevance assessments. Here the IDs are followed by 0
and 0.000000
and can be used as trainable embeddings.
Source: http://ir.dcs.gla.ac.uk/resources/test_collections/medl/
#
NPLDescription:
This collection was contributed by Vaswani and Cameron at the National Physical Laboratory in the UK in 1970.
The end marker structure is consistent for every file, even though it lacks context to engage with in a pragmatic search.
Documents:The files doc-text
& doc-vecs
contain 11,429 entries. doc-text
provides them in text form with unique IDs which match the vector representation of the terms in the doc-vecs
file. Every entry ends with /
.
Example:
141some aspects of the logical and circuit design of a digital fieldcomputer a new type of digital computer for the solution of fieldproblems is described by making calculations at all the latticepoints of the field simultaneously computation time is greatly reducedan experimental design of a basic unit for potential and other problemsis presented /
Same text represented as a vector, in a different order, and without irrelevant terms:
141 3 5 7 23 27 33 34 42 54 71 101 109 155 161 162 224 272 304 315 345 534 582 597 626 1215 /
Queries:
Ninety-three (93) queries are stored in query-text
& query-vecs
as both text and vector representation. The ID is followed by the query, and the entry is finished with a /
.
Example:
3USE OF DIGITAL COMPUTERS IN THE DESIGN OF BAND PASS FILTERS HAVING GIVEN PHASE AND ATTENUATION CHARACTERISTICS/
Same query represented as a vector, in a different order, and without irrelevant terms:
3 1 10 23 35 71 76 77 97 191 224 309 360 /
Relevance Assessments:
In rlv-ass
the query IDs are followed by the relevant document IDs. The entries are separated by /
.
Example:
3 141 148 813 1610 2429 3059 3272 3398 3614 3688 3708 4437 4710 4725 4833 5476 5662 5856 5976 6351 6885 6974 7086 7177 7304 7571 8007 8232 8957 928910174 10484 10486 /
Other Files:
term-vocab
contains vocabulary stems with representative IDs. The end marker is/
.term-vecs
contains the occurrences of search terms in the documents. The first ID is always the vocab ID, followed by all of the document IDs in which the terms occurred. The end marker is/
.term-mst
contains word IDs, followed by context-word IDs, followed by co-occurrences and mutual similarity values. This is only available for words which occured in at least 2 documents.
Parsing Problems:
- The number of spaces between the separator
/
and the entries are different between documents, queries, and relevance assessments. - Relevance Assessments: there are many spaces and newlines to remove before getting the raw numbers.
Source: http://ir.dcs.gla.ac.uk/resources/test_collections/npl/
#
TimeDescription:
This collection contains 423 articles from issues of TIME Magazine from the 1960's. With only 423 documents it is a rather small data set. Problems with the labeling are once again present. The IDs used in the relevance assessments do not correlate with the unique text numbers that mark the start of a document.
Documents:
The file TIME.ALL
stores 423 documents, the first of which is labeled as *TEXT 017
and the last one *TEXT 563
. Every document starts with a line that specifies text, date, and page number. The lines are followed by sentences of unlabeled text. The example is shortened as the articles are quite long.
Example:
*TEXT 017 01/04/63 PAGE 020 THE ALLIES AFTER NASSAU IN DECEMBER 1960, THE U.S . FIRST PROPOSED TO HELP NATO DEVELOP ITS OWN NUCLEAR STRIKE FORCE . BUT EUROPE MADE NO ATTEMPT TO DEVISE A PLAN . LAST WEEK, AS THEY STUDIED THE NASSAU ACCORD BETWEEN PRESIDENT KENNEDY AND PRIME MINISTER MACMILLAN, EUROPEANS SAW EMERGING THE FIRST OUTLINES OF THE NUCLEAR NATO THAT THE U.S . WANTS AND WILL SUPPORT . IT ALL SPRANG FROM THE ANGLO-U.S . CRISIS OVER CANCELLATION OF THE BUG-RIDDEN SKYBOLT MISSILE, AND THE U.S . OFFER TO SUPPLY BRITAIN AND FRANCE WITH THE PROVED POLARIS (TIME, DEC . 28) . THE ONE ALLIED LEADER WHO UNRESERVEDLY WELCOMED THE POLARIS...
Queries:
The 83 queries, stored in TIME.QUE
, are marked with *FIND
followed by the query ID.
Example:
*FIND 46 PRESIDENT DE GAULLE'S POLICY ON BRITISH ENTRY INTO THE COMMON MARKET .
Relevance Assessments:
Every row of TIME.REL
shows the query-ID, followed by relevant document-IDs. The document-IDs are not the same as the text numbers, instead the IDs are representative of the order they appear in the TIME.ALL
file.
Example:
46 1 20 23 32 39 47 53 54 80 93 151 157 174 202 272 291 294 348 47 23 47 48 53 54 56 48 306 49 47 56 81 103 150 183 205 291 50 157
Other Files:TIME.STP
- List of stop words
Parsing Problems:
- Relevance Assessments - Spaces between the ID and the first document ID vary greatly.
Source: http://ir.dcs.gla.ac.uk/resources/test_collections/time/
#
Other Data Sets#
Europarl Parallel CorpusDescription:
Starting in 2001, the corpus has collected texts with up to 60 million words per language. Texts were extracted from the proceedings of the European Parliament for these 21 European languages:
Romanic (French, Italian, Spanish, Portuguese, Romanian), Germanic (English, Dutch, German, Danish, Swedish), Slavik (Bulgarian, Czech, Polish, Slovak, Slovene), Finni-Ugric (Finnish, Hungarian, Estonian), Baltic (Latvian, Lithuanian), and Greek.
Documents:
Every file has the same structure; there are sections separated with the tag <CHAPTER ID=?>
which are followed by the title of the chapter. The <SPEAKER ID =??>
is a unique ID which marks the spoken part.
Example from the English Corpus:
<CHAPTER ID=6>Social and economic situation and development of the regions of the Union<SPEAKER ID=80 NAME="President">The next item is the debate on the report (A5-0107/1999) by Mr Berend, on behalf of the Committee on Regional Policy, Transport and Tourism, on the sixth periodic report on the social and economic situation and development of the regions of the European Union [SEC(99)0066 - C5-0120/99 - 1999/2123(COS)].
Queries: -
Relevance Assessments: -
Other Files: -
Source: http://www.statmt.org/europarl/
#
MS MARCO Document RankingDescription:
MS MARCO stands for "Microsoft Machine Reading Comprehension" and is a large scale dataset which can be used on machine reading comprehension, question answering, passage ranking, Keyphrase Extraction, and Conversational Search Studies. It holds over 3 million (3,213,835) documents and more than one million (1,010,916) unique queries, that were extracted from anonymized Bing usage logs.
For our case we focused on the MS MARCO Document Ranking corpus which is available through github.
Documents:
The documents are downloadable as a tsv-file (msmarco-docs.tsv), which makes them easy to parse. There are over 3 million (3,213,835) documents with assigned ID, title and website link (where the document was extracted from). ID, title, and document text are separated by tabs and every new entry starts with a newline. If there is no title available there a "." instead of it.
Example:
D1555982 https://answers.yahoo.com/question/index?qid=20071007114826AAwCFvR The hot glowing surfaces of stars emit energy in the form of electromagnetic radiation.? Science & Mathematics Physics The hot glowing surfaces of stars emit energy in the form of electromagnetic radiation.? It is a good approximation to assume that the emissivity e is equal to 1 for these surfaces. Find the radius of the star Rigel, the bright blue star in the constellation Orion that radiates energy at a rate of 2.7 x 10^32 W and has a surface temperature of 11,000 K. Assume that the star is spherical. Use σ =... show more Follow 3 answers Answers Relevance Rating Newest Oldest Best Answer: Stefan-Boltzmann law states that the energy flux by radiation is proportional to the forth power of the temperature: q = ε · σ · T^4 The total energy flux at a spherical surface of Radius R is Q = q·π·R² = ε·σ·T^4·π·R² Hence the radius is R = √ ( Q / (ε·σ·T^4·π) ) = √ ( 2.7x10+32 W / (1 · 5.67x10-8W/m²K^4 · (1100K)^4 · π) ) = 3.22x10+13 m Source (s):http://en.wikipedia.org/wiki/Stefan_bolt...schmiso · 1 decade ago0 18 Comment Schmiso, you forgot a 4 in your answer. Your link even says it: L = 4pi (R^2)sigma (T^4). Using L, luminosity, as the energy in this problem, you can find the radius R by doing sqrt (L/ (4pisigma (T^4)). Hope this helps everyone. Caroline · 4 years ago4 1 Comment (Stefan-Boltzmann law) L = 4pi*R^2*sigma*T^4 Solving for R we get: => R = (1/ (2T^2)) * sqrt (L/ (pi*sigma)) Plugging in your values you should get: => R = (1/ (2 (11,000K)^2)) *sqrt ( (2.7*10^32W)/ (pi * (5.67*10^-8 W/m^2K^4))) R = 1.609 * 10^11 m? · 3 years ago0 1 Comment Maybe you would like to learn more about one of these? Want to build a free website? Interested in dating sites? Need a Home Security Safe? How to order contacts online?
Queries:
There are over one million (1,010,916) unique, real queries which are assigned to a unique ID. The queries can all be found in the Q&A corpus; for the Documents Ranking corpus only ⅛ of the queries were used. Those were split into train, development, and test sets. We used the development set for our experiments because in order to use the Ranking Evaluation API of Elasticsearch you need relevance assessments. Those were only provided by the development set.
The development queries (5,193) are stored in the file msmarco-docdev-queries.tsv
. Each row of the file holds the query and its ID separated by one tab.
Example:
121352 define extreme634306 what does chattel mean on credit history920825 what was the great leap forward brainly510633 tattoo fixers how much does it cost737889 what is decentralization process.278900 how many cars enter the la jolla concours d' elegance?674172 what is a bank transit number303205 how much can i contribute to nondeductible ira570009 what are the four major groups of elements492875 sanitizer temperature
Relevance Assessments:
The relevance assessments can be found in the msmarco-docdev-qrels.tsv
file. Each row holds the query ID, a 0
, the document-ID and a 1
; all separated by a space.
Example:
2 0 D1650436 11215 0 D1202771 11288 0 D1547717 11576 0 D1313702 12235 0 D2113408 12798 0 D2830290 12962 0 D125453 1
Other Files:
msmarco-docs-lookup.tsv
msmarco-doctrain-queries.tsv
msmarco-docdev-top100
docleaderboard-queries.tsv
docleaderboard-top100
Parsing Problems: -
Source: https://github.com/microsoft/MSMARCO-Document-Ranking
#
OHSUMEDDescription:
This collection was originally created by William Hersh as a new large medical test collection for experiments on the SMART retrieval system. It was later divided into Training and Test sets.
The split documents contain 20,000 abstracts, while the unsplit documents contain 50,216; both collections are sorted into 23 medical categories. Since the official source is no longer available online, we will refer to the download files by Alessandro Moschitti.
Documents:
The file Cardiovascular Diseases Abstract
contains 20,000 documents divided into Training and Test set directories. Inside those directories are 23 folders which represent each of the 23 categories the abstracts are assigned to. Each document is written in a file without special notation.
Example:
Haemophilus influenzae meningitis with prolonged hospital course. A retrospective evaluation of Haemophilus influenzae type b meningitis observed over a 2-year period documented 86 cases. Eight of these patients demonstrated an unusual clinical course characterized by persistent fever (duration: greater than 10 days), cerebrospinal fluid pleocytosis, profound meningeal enhancement on computed tomography, significant morbidity, and a prolonged hospital course. The mean age of these 8 patients was 6 months, in contrast to a mean age of 14 months for the entire group. Two patients had clinical evidence of relapse. Four of the 8 patients tested for latex particle agglutination in the cerebrospinal fluid remained positive after 10 days. All patients received antimicrobial therapy until they were afebrile for a minimum of 5 days. Subsequent neurologic examination revealed a persistent seizure disorder in 5 patients (62.5%), moderate-to-profound hearing loss in 2 (25%), mild ataxia in 1 (12.5%), and developmental delay with hydrocephalus which required shunting in 1 (12.5%). One patient had no sequelae.
The download link All Cardiovascular Diseases Abstracts
contain 50,216 abstracts which are only sorted by the 23 category folders.
Queries: -
Relevance Assessments: -
Other Files:Category Description
- Defines the 23 categories
Source: ftp://medir.ohsu.edu/pub/ohsumed
(If the original source is no longer available online, please consult the website of Alessandro Moschitti at the University of Trento where you will find download links for OHSUMED and Reuters http://disi.unitn.eu/moschitti/corpora.htm)
#
Reuters 21578Description:
This collection was originally gathered and labeled by Carnegie Group, Inc. and Reuters, Ltd. Further information is available in the README.txt
document.
The files are in SGML format, therefore it is important to understand that particular markup language first.
It is already split into Training
and Test
sets by the tag LEWISSPLIT
. Topics and relations between them are well documented, which makes this collection a very useful one for pragmatic search development. However, it must be noted that only half of the documents were manually assigned to topics, so there are unlabeled documents which are marked by `LEWISSPLIT="NOT-USED".
The collection is a multi-labeled one; as such, one document can be assigned to more than one topic.
Documents:
The files reut2-000.sgm
to reut2-021.sgm
contain 21,578 documents that are in SGML format. The start of every document is marked with the following string: <!DOCTYPE lewis SYSTEM "lewis.dtd">
The entries are clearly distinguishable by their notation.
Example:
<REUTERS TOPICS="BYPASS" LEWISSPLIT="TRAIN" CGISPLIT="TRAINING-SET" OLDID="2984" NEWID="14001"><DATE> 7-APR-1987 11:02:35.07</DATE><TOPICS></TOPICS><PLACES></PLACES><PEOPLE></PEOPLE><ORGS></ORGS><EXCHANGES></EXCHANGES><COMPANIES></COMPANIES><UNKNOWN>Ef1137reuteb f BC-JOHANNESBURG-GOLD-SHA 04-07 0120</UNKNOWN><TEXT><TITLE>JOHANNESBURG GOLD SHARES CLOSE MIXED TO FIRMER</TITLE><DATELINE> JOHANNESBURG, April 7 - </DATELINE><BODY>Gold share prices closed mixed toslightly firmer in quiet and cautious trading, showing littlereaction to a retreat in the bullion price back to below 420dlrs and a firmer financial rand, dealers said. Heavyweight Vaal Reefs ended eight rand higher at 398 randbut Grootvlei eased 40 cents at 16.60 rand, while miningfinancials had Gold Fields up a rand at 63 rand despite weakerquarterly results. Other minings were firm but platinums eased. Industrials also closed mixed to firmer, the index onceagain hitting a new high of 1757 from Friday's 1753 finish. Theoverall index also hit a new high of 2188 versus 2179 onFriday. REUTER</BODY></TEXT></REUTERS>
The attributes specified after <REUTERS
are meant to identify documents and groups of
documents, and have the following meanings:
1. TOPICS: The possible values are YES, NO, and BYPASS: a. YES indicates that (in the original data) there was at least one entry in the TOPICS fields. b. NO indicates that (in the original data) the story had no entries in the TOPICS field. c. BYPASS indicates that (in the original data) the story was marked with the string "bypass" (or a typographical variant on that string). Although this shouldn't be used for Topic search because there could be topics even if there is a NO, or likewise, there could be be no topics even if there is a YES. 2. LEWISSPLIT : The possible values are TRAINING, TEST, and NOT-USED. a. TRAINING indicates that it was used in training sets in the experiments reported in LEWIS91d (Chapters 9 and 10), LEWIS92b, LEWIS92e, and LEWIS94b. b. TEST indicates it was used in the test set for those experiments. c. NOT-USED means it was not used in those experiments. 3. CGISPLIT : The possible values are TRAINING-SET and PUBLISHED-TESTSET indicating whether the document was in the training set or the test set for the experiments reported in HAYES89 and HAYES90b. 4. OLDID : The identification number (ID) the story had in the Reuters-22173 collection. 5. NEWID : The identification number (ID) the story has in the Reuters-21578, Distribution 1.0 collection. These IDs are assigned to the stories in chronological order.
For more detailed descriptions see the VI. Formatting
section of the README.txt
.
Queries:
There are no queries, but there are several files which contain the topics, places, people, etc. as strings. See Other Files
.
Relevance Assessments: -
Other Files:
all-exchanges-strings.lc.txt
- Alphabetical list of exchange categories
all-orgs-strings.lc.txt
- Alphabetical list of organization categories
all-people-strings.lc.txt
- Alphabetical list of names
all-places-strings.lc.txt
- Alphabetical list of places
all-topics-strings.lc.txt
- Alphabetical list of topics
Example:
acqalumaustdlraustralbarleybfrbopcancarcasscastor-mealcastor-oilcastorseedcitruspulpcocoacoconutcoconut-oilcoffeecoppercopra-cakecorncorn-oil
cat-descriptions_120396.txt
- List of categories, with number of items labeled with them
Example:
**Currency Codes (27) U.S. Dollar (DLR)Australian Dollar (AUSTDLR)Hong Kong Dollar (HK)Singapore Dollar (SINGDLR)New Zealand Dollar (NZDLR)Canadian Dollar (CAN)Sterling (STG)D-Mark (DMK)Japanese Yen (YEN)Swiss Franc (SFR)French Franc (FFR)Belgian Franc (BFR)Netherlands Guilder/Florin (DFL)Italian Lira (LIT)Danish Krone/Crown (DKR)Norwegian Krone/Crown (NKR)Swedish Krona/Crown (SKR)Mexican Peso (MEXPESO)Brazilian Cruzado (CRUZADO)Argentine Austral (AUSTRAL)Saudi Arabian Riyal (SAUDRIYAL)South African Rand (RAND)Indonesian Rupiah (RUPIAH)Malaysian Ringitt (RINGGIT)Portuguese Escudo (ESCUDO)Spanish Peseta (PESETA)Greek Drachma (DRACHMA)
Source: http://www.daviddlewis.com/resources/testcollections/reuters21578/
(If the original source is not available, please consult the website of Alessandro Moschitti at University of Trento where you will find download links for OHSUMED and Reuters http://disi.unitn.eu/moschitti/corpora.htm)
#
TRECDescription:
TREC (the TExt REtrieval Conference) is not just one collection, but is instead a conference which has been held regularly since 1992. The TREC workshop series - among other goals - tries to encourage research in IR, based on large test collections, and increase the communication amid IR research and development.
TREC has produced many test collections, all of which contain a set of documents, a set of topics (questions), and a set of relevance judgments (answers). The collections can be downloaded from the TREC website but are usually copyrighted and must be licensed. The process to license a collection can be found on the data page entry for the collection of interest (https://trec.nist.gov/data.html).
The collections cover various interests such as “Chemical IR”, “Conversational Assistance”, “Legal”, “Medical” ,”News”, “Spoken Document Retrieval”, etc.
One very interesting thing about this collection is its constant growth and high standards regarding having a homogeneous notation. This can be very helpful for developing and testing NLP processing algorithms.
They also provide tools to process the data.
Publications about TREC are published by NIST (National Institute of Standards and Technology) and are accessible here: http://trec.nist.gov/pubs.html
Since there are many different data sets, we provide no examples given for this collection.
Documents: -
Queries: -
Relevance Assessments: -
Other Files: -
Source: https://trec.nist.gov/
Acknowledgements:
Thanks to Kenny Hall and Irina Temnikova for proofreading this article.