Six Degrees of Francis Bacon - Clustering

As a part of my Six Degrees of Francis Bacon class, my group wanted to see if we could discern the attitudes towards contextually sensitive words in particular time periods by using their n-gram frequencies. We used the K-Mediods algorithm to cluster similar words from the Google N-gram database.

Introduction

Karl Marx’s “The Communist Manifesto”, published in 1848, is often cited as one of the most influential pieces of political literature to be written in the past couple of centuries. An analytical commentary on the problems inherent in social structures, with particular emphasis placed on capitalism, the manifesto was commissioned by the Communist Party and serves as the basis for the communist movement that grew out of the nineteenth century. What followed was an ideological split between followers and proponents of individualistic social structure (capitalism) and more community-oriented social structures, like those proposed in communism.

This ideological shift served as the basis for our project. We wanted to examine whether the changes in political and social thought were borne out in the language and lexicon used in the time period surrounding the publication of “The Communist Manifesto”. Several research questions were of particular interest to us: do words identified as ‘communal’ words have similar patterns of frequency over our given time period? Do the ‘individualistic’ words? We also would like to compare some data specific to “The Communist Manifesto” to our general dataset.

We expected to see several trends in the data that we analyzed. As the ideological shift we focused on involved an emphasis on group-mentality and social organization, we expected to see an increase in the frequency of the usage of words that we identified as being ‘communal’. As the population became more aware of the ideas presented within the manifesto and inherent in communism, there would logically be a greater usage of terms related to those ideas. Conversely, we hypothesized that there might be a slight decrease in the frequency usage of more ‘individualistic’ terms. We treated ‘communal’ and ‘individualistic’ as antonyms, and as such, predicted an increase in one to suggest a decrease in the other. Additionally, we expected to see a correlation between like terms, in regards to plurality; we anticipated seeing clusters, based on the pattern of frequency over time, of words we deemed to be ‘communal’ and ‘individualistic’

Our project was two-pronged in nature, with a digital component and a humanistic analysis. As outlined below, we needed to construct a digital tool capable of clustering the data that was of interest. The tool produced a computational analysis of our primary source of data, the google Ngram corpus. The tool returned data on the frequency of words used over our time period, 1840-1860, and clusters words that have similar usage frequency. This data was then looked at through a humanistic lens, where we provided sociocultural explanations for the data and explored the limitations and biases of our project.

Related Work

In 2012, Jean Twenge, M. Keith Campbell, and Brittany Gentile published an article that included two studies they conducted on the usage of individualistic and communal terms. Their studies focused on a perceived increase in use of individual-centric words and phrases between 1960 and 2008. Like the analysis presented below, they are interested in examining the influence culture has on language. They use a large corpus of book data that has been digitized as their primary source. They eventually conclude that the increase in individualistic words and phrases demonstrates a cultural shift toward American individualism in the second half of the twentieth century. This article, and Twenge et al.’s work have been criticized for failing to account for specific events that happen within the time period that might alter language change. Our project, in comparison, centers around a particular event in history, looking for language change that may be related to that event. In our research, we turned to a fair amount of existing work that investigates the relationship of language to Marxism. In many of the studies we came across, social class became the primary concern, with a focus on the veiling effect of specific language. This is related to the expression of distinct ideas by a certain social class within a larger national language. An example is Valentinin Voloshiov’s “Marxism and the Philosophy of Language”, which deals with how ideologies shape language. Another reference point is “Marxist Linguistic Theory and Communist Practice: A Sociolinguistic Study” by Max K. Adler, in which language is thought of as a product of society, and society as a product of language. If this is indeed the case, we would expect to see such forces at work within our Ngram data.

Another way that our research is distinct from existing work, is that we seek to quantify the changes that occur within language and ideologies after the publication of “The Communist Manifesto”. By tracking the rise or decline in particular language, we expect to make visible the linguistic trends that may coincide with historical events.

Methodology

Do to computational limitations, the tool was designed to output data relating to the top 25,000 most frequently used 1-grams in the general English corpus from 1840-1860. To do this, we filtered through the time-series information for all of the 1-grams in the corpus and sorted them by their maximum volume count. Therefore, a word was considered to be in the top 25,000 only if it was featured in a variety of writings. Unfortunately, we were unable to normalize words based on their relative frequency, due to the floating point limitation of the double data type e.g. the numbers were too small to compute.

After the preprocessing, we decided to utilize the K-medoids clustering algorithms to cluster words that have similar time series. For our K-medoids distance function, rather than using standard euclidean distance, we decided to use Pearson’s product correlation coefficient. Pearson’s correlation controls for different locations and scales of the two time series, which was useful to us so we could account for the base-usage of a word (the location), and the response a related word would have (the scale). Ideally, using this distance function allowed us to detect a larger selection of correlated words, rather than just those with a high euclidean similarity. Initially we were planning on utilizing K-means as our clustering method, however it did not end up performing well using Pearson’s correlation as a distance function. The cluster assignment and centroid update function did not end up converging, therefore the algorithm never truly ran correctly. K-medoids, which can handle arbitrary distance functions, ended up being a better fit for us.

We used the Java-ML implementation of the algorithm, due to its high regard in the community and its availability. We ran the algorithm 10 times, with a maximum of 1000 clusters for each run and at least 100 iterations of the algorithm. We took the 10 runs of the algorithm and calculated the sum of centroid similarities for each run. The run with the highest sum of similarities was then taken as the final set of clusters. The resulting set of clusters were exported and then used in a custom-made view. The view allows for an individual to search for a word, view the other words clustered with it, view the time-series for each word and view the average time-series of the individual time-series.

Following the production of the digital clustering tool, we input our lists of ‘individualistic’ and ‘communal’ words. As a note, these word lists are primarily taken from a previous study on plurality performed by Twenge, Campbell, and Gentile in 2012. Twenge’s study also focused on examining the distinction between plural and individualistic words and language’s relation to culture. We used their word lists as the basis for our own, modifying them slightly to be more applicable to our specific project and time period of study. In inputting the words, we were interested in which words the tool identified as belonging to the same cluster, as well as the frequency curve associated with each cluster. The frequency curves can be produced by taking the numeric data that the tool outputs for each clustered word and inserting it into a basic Microsoft excel file and making a line chart. Examples of frequency charts can be found in Appendix C. The input words and the words they clustered with can be found in Appendix B.

We looked at Twenge’s study again when considering what methods might be potentially effective. For instance, Twenge made use of Mechanical Turk to have participants rate words on a scale from most ‘individualistic’ to most ‘communal’. Crowdsourcing can certainly be an effective tool for dealing with subtle or subjective questions. However, our system of clustering avoids such arbitrary rating systems and provides us with much more data driven results.

We also used Wheaton College’s online Lexos analysis tool to identify the top 50 words used in the Communist Manifesto. After removing common English stopwords, the resulting wordcloud served as a starting point for us to begin testing our clustering tool.

Results

We have represented our results visually in the appendices at the end of this paper. The first appendix outlines the thirty words, fifteen ‘individualistic’ and fifteen ‘communal’, that would wanted to closely examine. The second appendix consists of a table which displays each of the thirty words that we input into the clustering tool and the resulting cluster of words that the tool identified as having similar frequency patterns. The third, and final, appendix has a sample of the line graphs associated with the selected words.
Upon closer analysis, several interesting results were apparent. Of the fifteen ‘individualistic’ words that we identified in our word lists fourteen returned clusters; “Uniqueness” was not in the top 25,000 most frequent words in the Ngram data and did not return a cluster. Of the fifteen ‘communal’ words, only eleven returned clusters; the inputs “communal”, “communitarian”, “socialism”, and “collectivism” did not return clusters. Also, the average size of an ‘individualistic’ cluster was 65.286 words. The average size of a ‘communal’ cluster was 46.545. The averages were determined by summing up the total words in one of the two categories and dividing by the number of terms that output clusters. (See Appendix B)

In terms of upward or downward tendencies, all of the graphs produced using the numerical data for each clustered word show an increase over time. That is, the average line graph of each cluster increases over the time period. We did not see any decreasing graphs. (See Appendix C)

Discussion

The quantitative data that was output by the digital tool served as the basis for further analysis. Below, we explore our research questions in relation to the data, discuss assumptions that we made in our project, outline biases of our project, and propose implications for future research.

Our hypothesis regarding the increase of plural term usage over the period from 1840-1860 was, in part, based on the assumption that words with similar meanings would have similar frequency graphs. In using the clustering tool, we hoped to see these frequency clusters of semantically-related terms. Specifically, we hoped to see ‘communal’ words clustered with other ‘communal’ words, indicating that they followed similar frequency patterns. We expected the frequency patterns of such clusters to increase over the course of our chosen time period. Such a result would provide some support for our hypothesis that there was a shift in ideology, a move toward a more group-oriented frame of thought, that was evident in the lexicon.

Our results however, did not display clusters that were as semantically-connected as we had predicted. Some words did tend to cluster with like words, but many of our sample words produced clusters that were wide-ranging with no apparent semantic link (see appendix B).

With the knowledge that the graphs produced by inputting the numerical data about word frequencies into Excel covered more semantic territory than just ‘communal’ or ‘individualistic’ words, we still thought there might be value in looking at more general upward and downward trends across the data. All of the graphs we produced from the data display an increase. This increase is the result of the tool outputting the raw data on the frequency of the words themselves, which will naturally increase over time. In order to actually look at the increase or decrease of specific words without interference, we would need to have relative frequencies for the terms. Unfortunately, due to computational limitations, the relative frequencies would be too small analyze. We would need to recompute the relative frequencies and graph those, and unfortunately we did not have time to do so. Therefore, we were unable to reach concrete conclusions about increase or decrease in our two categories.

In our effort to examine the direct influence of a single text, we also pulled out the most frequent words in The Communist Manifesto, in hopes of comparing their frequencies over the time period in question. Unfortunately, many of the words that were used extensively in Marx’s text were not among the top 25,000 words identified in the Ngram data. Words like ‘proletariat’ and ‘bourgeoisie’, that are integral to the manifesto and to the ideology underlying the text, are not present. Thus, we were unable to track the use of the words and their frequencies. This suggests that these words were not largely in use in the time period we examined.

In fact, it may be the case that the Communist Manifesto simply was not as influential as we would have expected, given the emphasis modern society places on the text. Our data suggests, though not empirically, that the manifesto’s impact on the overall lexicon from 1840-1860 was limited, if present at all. We were not able to attribute any specific lexical change to the manifesto itself.

While the scope of our project and results was not such that we can make any claims on whether the publication of The Communist Manifesto impacted the lexicon of the time period that we examined, we can explore other implications of the data. As mentioned above, the Communist Manifesto uses highly specific language, meaning that many of its most crucial terms do not show up in the most common language of the time we investigated. This being the case, it became an interesting question for us to consider what terms might be most revealing about the culture of this period. For instance, if we look at the word ‘industrial’, which does show up in the top 50 words of the Communist Manifesto, we see that its usage follows a similar change over time as the word ‘atheism’. While this could be significant, even obvious, let us also consider that ‘industrial’ clusters with such words as ‘accidents’, ‘blinds’, ‘successes’, and ‘villages’. Some of these clusters seem predictable, others seem to reveal more subtle connections, and still others seem arbitrary or difficult to draw meaning from. For future work, perhaps additional methods can be implemented in order to aid in interpretation

A possible bias of our data comes from the fact that the English publication date of the Communist Manifesto (1950) is situated squarely within the time period we investigated (1840-1860). We chose this time frame because it allowed us to not only examine the forces at work leading up to the manifesto, but also those immediately following its publication. However, this is perhaps too limited a time period, as we suspect that the linguistic impact of the manifesto might not be noticeable until much later than 1860. One possible implication of our study is that the impact of any single culturally significant literary work, be it the Communist Manifesto, The Wealth of Nations, or any other work, is far less immediate and apparent than what might be expected. Thus, it might be necessary to investigate linguistic data at a much larger scope.

For example, looking at parts of speech might reveal even more. An interesting result that we noted during the examination of our data was that nouns tended to cluster with words that were more easily identifiable as being related by meaning. Other parts of speech, like adjectives and adverbs, produced more scattered results. Words that had both a noun form and an adjective form also did not produce very clear clusters. For example, the word ‘individual’, which can be both a noun or a descriptive adjective, clustered with words that ranged from other adjectives (adequate, comprehensive), verbs (deducing, eject) and other nouns. The nouns that clustered with individual (art, auxiliary, compulsion, fete, handmaid, individual, nicety, serpents, species, superstructure, sweets, uncertainty, veracity) ranged from concepts to tangible objects and, arguably, included more words related to ‘communal’ ideas than ‘individual’ ones.

Additionally, a closer examination of the words in each cluster could add depth to the analysis. For example, it might be worthwhile to look at the positivity or negativity of the words that cluster together. Coding words as negative or positive could give more insight into ideological change over time. It might be the case that words with similar positive or negative connotations group together. We noted that a large number of words that are traditionally perceived as negative appeared in the clusters of the ‘individualistic’ terms. For example, the word “independent” clusters with “abandon”, “conflicting”, “confusion”, “danger”, “destroying”, and “difficulties”. It should be noted that positive words also appear in the cluster, but to a lesser degree. Exploring this tendency in future research might lead to interesting conclusions.

Overall, this project resulted in the production of a digital tool that has wide-ranging applications. While we discovered that its use in exploring a singular text, in this case “The Communist Manifesto”, as indicative of larger social context may not always be possible, it is at our disposal for future research. Ideally, we could now apply this tool to a larger variety of questions with more concrete foundations and more correct parameters of interest.