This project is built on the concept of synonym-like relationships between words, and explores the large-scale structure that results from a sufficiently large, interconnected word network. I started this project hoping to find that most words could be connected to most other words via a synonymic path, hopping from word to word across the network. I was inspired by the concept of six degrees of separation, as well as other derivative projects such as Six Degrees of Wikipedia.
Words, definitions, and pointers have been sourced from WordNet, a database of English words from Princeton. WordNet's database is a collection of delimited text files which group words of similar meaning together into synsets, and indicate various relationship types between synsets such as hypernyms (i.e. a hamburger is a type of sandwich) and entailments (i.e. dreaming is always done while sleeping). When referencing "words" in the database, it should be noted that collocations (e.g. "fountain pen" or "take in") have also been included.
I downloaded the WordNet database files and parsed them into data structures that I could manipulate with Python. Paths are found via a breadth-first search which iterates alternately from each endpoint. WordNet has made explicit some connection types which have an inverse (such as holonyms and meronyms). However other inverse connection types that are not explicitly present in WordNet can be inferred (such as cause/effect relationships), so these have been added. I also added a new connection type called a "word-pivot". Word-pivots allow connections between different meanings if the meanings share a common word (i.e. pound can refer to a weight and a currency).
WordNet contains over 147-thousand words, and of the over 21-billion possible start/target endpoint combinations, there exists a path for over 98% of them. Any two words might have an infinite number of possible paths between them, however, the set of paths with the fewest number of steps is finite and well-defined. So when I refer to "paths", I mean all paths of the shortest length. A random pair of word endpoints has an average path length between 7 and 8, and between more commonly-used words, paths are often shorter. Path lengths greater than 10 occur for just 2% of word endpoints, about as scarce as having no path at all. The longest paths are 21 synsets long, one of the few being between the words flightless and self-sustained.
Almost all of the synsets are interconnected and form the main group within the database. However, there is a small percentage of synsets that are isolated from the main group and form island clusters. Synsets within these island clusters have very few connections if any. Island clusters are made up of 1 to 7 synsets, and have no connections outside of this. These exceptional cases often reveal missing connections in the corners of the database. For example, the word unverified is in one of these island clusters, and is connected to just a few other words such as unproved; it is not connected to any words in the main group of the database including the word untested. In the future, I would like to continue working on the database, and account for some of these more obvious missing connections.
With the ability to find paths between distant words, there are a lot of directions to go from here. The first expansion I've added is finding "quasi-opposites". Even though a word doesn't have an established antonym relationship, my code can look for the nearest words that do have direct antonyms. These next of kin antonym relationships sometimes make good sense, are often abstract, or occasionally unconvincing, but on the whole amusing.
I'm not a linguist, nor do I vouch for WordNet's data. The scope of this project encompasses connections within the WordNet database, not language in general. I understood the database as a 117-thousand vertex, 14-billion edge, higher-dimensional directed graph. If the words in the network were replaced with other abstract symbols, and if those symbols had no intrinsic meaning of their own, the network remains an interesting mathematical object to study. If anything, I hope this project helps to demonstrate that for similar systems of symbols and connections, the network can be complex enough to encode meaning, which can emerge from the connections themselves rather than being required as a fundamental building block.
Check out some of my other projects here.
© 2022 Johnathan Pennington | All rights reserved.