The Word Relatedness Mturk-771 Test Collection

Version: 1.0
Release date: February 1, 2012
Prepared by: Guy Halawi, Gideon Dror
Maintained by: Gideon Dror

Overview

The Mturk-771 Test Collection contains 771 English word pairs along with human-assigned relatedness judgements. The collection can be used to train and/or test computer algorithms implementing semantic relatedness measures (i.e., algorithms that numerically estimate relatedness of natural language words).

Description

The set contains 771 word pairs along with their mean realtedness scores. The scores were collected on Amazon Mechanical Turk. At least 20 ratings were collecetd for each word pair, where each judgment task consisted of a batch of 50 word pairs. Ratings were collected on a 1–5 scale, where 5 stands for “highly related” and 1 stands for “not related”. In order to discard poor quality work, each batch contained 10 trap word pairs with known extreme relatedness values, serving as binary indicators. A batch that failed on more than one of the binary indicators was discarded. This guarantees an over 98% probability for detecting random workers (e.g., bots). The relatedness value of each word pair was taken as the mean score given by the workers.

Quality assessment

To verify the agreement between raters, we randomly split the raters into two groups, each including at least 10 Mechanical Turk workers. We then averaged the numeric judgements for each word pair among the raters in each of the two sets, thus yielding a (771 element long) vector of average judgments for each set. Finally, we computed the correlation between the mean judgement vectors of the two sets. We repeated this process 1000 times, and over these 1000 random splits the mean correlation between the two sets of raters was 0.8957, with extremely small variance, attesting to the quality of the collected data.

Availability and usage

We provide both the final relatedness scores as well as the raw scores as collected from the Amazon Mechanical Turk workers. Note that in the RAW scores datset, a zero entry represents an 'I do not know' entry.

gideon dror