The Word Relatedness Mturk-771 Test Collection
Release date: February 1, 2012
Prepared by: Guy Halawi, Gideon Dror
The Mturk-771 Test Collection contains 771
English word pairs along with human-assigned relatedness judgements.
The collection can be used to train and/or test computer algorithms
implementing semantic relatedness measures (i.e., algorithms that numerically
estimate relatedness of natural language words).
The set contains 771 word pairs along with their mean realtedness
scores. The scores were collected on Amazon Mechanical Turk.
At least 20 ratings were collecetd for each word pair, where
each judgment task consisted of a batch of 50 word pairs. Ratings
were collected on a 1–5 scale, where 5 stands for “highly
related” and 1 stands for “not related”. In order to discard poor quality
work, each batch contained 10 trap word pairs with known
extreme relatedness values, serving as binary indicators.
A batch that failed on more than one of the binary indicators was
discarded. This guarantees an over 98% probability for detecting
random workers (e.g., bots). The relatedness value of each word pair
was taken as the mean score given by the workers.
To verify the agreement between raters,
we randomly split the raters into two groups, each including at least 10 Mechanical
Turk workers. We then averaged the numeric judgements
for each word pair among the raters in each of the two sets, thus
yielding a (771 element long) vector of average judgments for each
set. Finally, we computed the correlation between the mean judgement vectors of
the two sets. We repeated this process 1000 times, and over these
1000 random splits the mean correlation between the two sets of
raters was 0.8957, with extremely small variance, attesting to the
quality of the collected data.
Availability and usage
We provide both the final relatedness scores as well as the raw scores as collected from the
Amazon Mechanical Turk workers. Note that in the RAW scores datset, a zero entry represents an 'I do not know' entry.