In document classification and other natural language processing applications, having a good measure of the similarity of two texts can be a valuable building block. Ideally, such a measure would capture semantic information. Cosine similarity on bag-of-words vectors is known to do well in practice, but it inherently cannot capture when documents say the same thing in completely different words.
Take, for example, two headlines:
- Obama speaks to the media in Illinois
- The President greets the press in Chicago
These have no content words in common, so according to most bag of words—based metrics, their distance would be maximal. (For such applications, you probably don’t want to count stopwords such as the and in, which don’t truly signal semantic similarity.)
One way out of this conundrum is the word mover’s distance (WMD), introduced in From Word Embeddings To Document Distances, (Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, Kilian Q. Weinberger, ICML 2015). WMD adapts the earth mover’s distance to the space of documents: the distance between two texts is given by the total amount of “mass” needed to move the words from one side into the other, multiplied by the distance the words need to move. So, starting from a measure of the distance between different words, we can get a principled document-level distance.