Word Embedding Models for Community Newspapers: How the East Boston Community News Envisions the Future

As I have been working with The East Boston Community News (EBCN) and thinking about how I could computationally derive meaning from the newspaper’s 471 issues, I/the BRC team decided to investigate two main questions: how do communities share information, and how do they envision their future?. As a way of beginning to explore these questions and the topic of futures, I started to research how to implement word embedding models on the EBCN.

There are a few terms to unpack and understand when working with word embedding models. First, is the term “model,” which generally refers to some representation of a feature of interest. In the case of a word embedding model, the model is generally a representation of a set of text in a way which makes their semantic relationships clear. Ultimately, this means that for word embedding models, users can explore which words in a corpus (i.e. collection of texts) are used in similar ways, semantically. Therefore, a user can determine which words in a set of text appear in similar contexts. Programmers sometimes refer to this method as “judging a word by the company it keeps.” I determined that this unique quality of word embedding models would allow me to analyze what words in the EBCN are used in a similar way to how words like “future” or “imagine” might be. We might state the research question as: how do the writers of the EBCN imagine and envision the future? To get at the answer, we ask word embedding models what kinds of words the EBCN writers use in the same way they use the terms “future” or “imagine.”

For implementing the construction of this model, I used Gensim’s implementation of Word2Vec in Python and structured my code after this tutorial authored by Gensim’s creator.

Word Embeddings with Word2Vec

Unlike simple word counts or measures of word frequency, word embeddings attempt to evaluate the meanings of words. Word embeddings are vector representations of particular words. A vector representation is when a numerical value is assigned to some object (for instance a word). When this numerical value is assigned to a word, naturally, this means that those values can be mathematically manipulated like any other numerical value. These values can also be represented on a number plot using vectors, which are just lines which have a specific spatial orientation and point in a specific direction. The vectors (which are actually words) can then be measured for their nearness using cosine similarity, which just measures the cosine of angle between two vectors in order to determine how close the two vectors are to each other. When words are very near each other, it means they are used in similar ways. A group of words which have similar cosine similarities also tend to be clustered together on this spatial plot-representation of the vectors. This tells us that this group of words may share a topic or context in the larger text.

Word2Vec is one of the most popular techniques for conducting this sort of transformation. Using a shallow neural network, Word2Vec transforms words into vectors, making natural language machine readable. Gensim is an open-source Python library with a particular focus on topic modeling. In addition to implementing Word2Vec on new text, Gensim also allows for pre-trained models to be loaded and queried relatively smoothly.

Word embeddings work by using an algorithm to train a model on a set of words. This training process involves splitting a text into sets of individual, tokenized sentences and then providing those sentences to an instance of Word2Vec. Essentially, the input for Gensim is a “list of lists” where each document is broken into sentences and those sentences are broken into words. The two algorithms that Gensim uses to train word embeddings are continuous bag of words (CBOW) and skip-gram. As Ria Kulshrestha writes, “in the CBOW model, the distributed representations of context (or surrounding words) are combined to predict the word in the middle. While in the Skip-gram model, the distributed representation of the input word is used to predict the context.” In other words, CBOW predicts the word given its context while skip-gram predicts the context given the word.

Skip-gram works particularly well for sets of words which are unlikely to have repeats though it is slower to train. A future iteration of this experiment, perhaps one which includes a lot of slang, may work better with the skip-gram method. Alternatively, CBOW works well on sets where words are likely often repeated and is faster than skip-gram. For this particular implementation on the EBCN, I decided to use the CBOW method since I was working with a large corpus of words, though they all belong to the same newspaper. Because all of these issues belong to the same newspaper and are constrained to the 70s and 80s as well as a particular neighborhood in Boston, I made the assumption that there would be a lot of repeat in terms of word usage and context. Additionally, the speed with which CBOW trains was appealing.

Additional settings that must be determined are the number of workers Gensim will use to train the model (typically, a good rule of thumb is to match the number of workers to the number of cores in your device), the number of dimensions of the embedding (the default is 100), the minimum number of occurrences of a word to consider (default is 5), and the maximum distance between a target word and the words surrounding it (default is 5). You can find your number of cores by checking in your device settings — most laptops produced in the last decade or so have multi-core processors. Before that, processors (the main computing chip) could only run one task at a time, but each core can handle one or more tasks, so a multi-core processor can run multiple tasks at the same time. For text analysis projects, increased processor power means a faster run time with less wear and tear on the device. For the purposes of this particular run of the code, I chose to leave the default settings as is, though I did increase the number of workers from three to five.

In order to save time and memory I followed the memory-efficient iterator code recommended by Gensim’s creator Radim Řehůřek. Because I was going to be running the code on my personal device, it was especially important that the mass of processing text wouldn’t use up all of my available resources. Essentially, the way that the iterator works is by using recursion, or embedded self references, to iterate through each document, one line at a time, and read the text into a variable. By reading each document one line at a time, I didn’t need to worry about whether or not the entire corpus could fit in memory (which it likely couldn’t).

The process of training the model takes quite a bit of time, but Gensim allows for the trained model to be saved to the disk and reloaded as needed. Additionally, once a model has been loaded, it can continue to be trained with new material. Training the model on all 471 issues of the EBCN took roughly three hours to complete.

Once the model was finished with training, I was able to query the model with specific words in order to discover the contexts in which those words are used. Since the central theme of this work was to analyze how community newspapers such as the EBCN envision or plan for the future, I focused on that line of inquiry.

The East Boston Community News and Conceptions of the Future

The somewhat obvious first search term that I used was the word “future.” Since word vectors are mathematical representations of words, this means that a user can use mathematical equations to determine the results, as well. The classic example is “king minus man equals queen” and so on and so forth. Future work with the model might involve similar configurations but for an initial pass, I chose to only query single search terms using the “most_similar” function which comes with Gensim.

The list of words the model identified as being most semantically similar to “future” in the EBCN is:

[(‘development’, 0.742843508720398),

(‘design’, 0.7344774007797241),

(‘waterfront’, 0.7251332998275757),

(‘plan’, 0.7078417539596558),

(‘safety’, 0.7070328593254089),

(‘sion’, 0.7026692628860474),

(‘changes’, 0.7024692893028259),

(‘specific’, 0.7015269994735718),

(‘progress’, 0.6998405456542969),

(‘piers’, 0.6984201073646545)]

As can be shown, the most similar word is “development” followed by “design” and “waterfront,” all words which are closely related to conceptions of property. Considering that the run of the EBCN covers the period in which Logan International Airport was being expanded into East Boston neighborhoods, it makes sense that often, the newspaper is envisioning “future” as belonging to the realm of property and land. Additionally, since East Boston is an immigrant dense region of the city, perhaps the algorithm is pairing land ownership (as well as its fragility) to some sense of community future.

Another term of interest was “activism.” I was particularly interested in the ways in which the EBCN envisions activism and whether or not there would be any crossover with “future.” The results of that query are:

[(‘activists.’, 0.7978725433349609),

(‘affairs.’, 0.7375671863555908),

(“Narconon’s”, 0.7281777858734131),

(‘praising’, 0.7188684940338135),

(‘dense’, 0.7172383069992065),

(‘strategies.’, 0.7034701108932495),

(‘clippings’, 0.7034184336662292),

(‘decaying’, 0.6936590075492859),

(‘broad-based’, 0.688977837562561),

(‘airport-impacted’, 0.6874728202819824)]

In addition to the interesting position of “airport-impacted” on this list, is the inclusion of “Narconon,” which is a rehabilitation and drug education center which partnered with the East Boston Interagency Council in 1977.

Finally, the last search term for this particular experiment was “planning.”

[(‘ning’, 0.7343742847442627),

(‘design’, 0.706474781036377),

(‘successful’, 0.7028966546058655),

(‘planned’, 0.6976842880249023),

(‘continuing’, 0.6883267164230347),

(‘citywide’, 0.686152458190918),

(‘developing’, 0.6850674152374268),

(‘development’, 0.6763161420822144),

(‘seeking’, 0.6693742871284485),

(‘cial’, 0.6680723428726196)]

Then, remembering that the EBCN also often features long–if not half the length of the issue–pieces written in Italian. Foregrounding these results with the fact that I have no prior experience with Italian as a language, I decided to simply translate the English search terms into Italian and query the model for them. While it is currently unclear to me how Word2Vec treats bilingual corpuses, I noticed that many of the words I searched for in Italian were not present in the vocabulary that the model recognizes. Whether or not this is because Word2Vec is confused by the bilingual nature of the corpus or if the Italian content of the EBCN simply does not focus on these issues is uncertain to me. However, you’ll notice that the results produced by querying for words in Italian return only Italian results rather than results in a mix of English and Italian (and vice versa for the English results). Of the words I used for my English queries, only the word “future” existed in the Italian vocabulary (which means words such as “activism” and “planning” were not present). Additionally, the word “future” in Italian can be translated to “futura” or “futuro” with one being the feminine use of the word and the other being the masculine. Of the two, only the masculine usage of the word was present in the vocabulary.

Futuro

[(‘gran’, 0.9783154726028442),

(“nazionale”, 0.9727517366409302),

(‘approvata’, 0.9722154140472412),

(‘possesso’, 0.9719233512878418),

(‘quattro’, 0.9712430238723755),

(“locali”, 0.9710487127304077),

(‘ricevuto’, 0.9706181287765503),

(“voli”, 0.9702284336090088),

(‘corso’, 0.9701635241508484),

(“un’altra”, 0.9701262712478638)]

You’ll notice that the fourth most associated word is “possesso” which means “possession.” This result somewhat links the connection between ownership and imagining the future which we saw in the English results. Similarly, the word “voli” meaning “flights” seems to be drawing out the connection to the airport expansion that we also saw in the English results.

Interestingly, the second most associated word (after “gran” which means “great”) is the word “nazionale” which means “national.” Knowing that East Boston has a large immigrant population, these results seem to suggest that nationhood, as well as ownership, resonates with the Italian readers of the EBCN when they are writing about their future. Querying the model with words such as “nazionale,” “terra,” and “aeroporto” produced results which were not significant, primarily drawing out words such as “person,” “that,” “first,” “none,” etc. The lack of nouns produced by querying the model with Italian phrases might be an indication of a flaw in the model, itself, or it may be the result of insufficient understanding of the Italian language and its various use of parts of speech. Further research and use of the EBCN word embedding model to explore Italian phrases might be best explored by someone with a deeper understanding of the Italian language, though the ability of the model to produce Italian results, at all, indicates that this particular aspect of word embeddings in the EBCN is in fact readily explorable, unlike some of the language-related issues I identified in the Named Entity Recognition experiments.

These results might help us to begin to better understand the relationship between “future” and conceptions of property, as well as the role of “planning” with developments. Additionally, the appearance of the phrase “airport-impacted” as well as Narconon alongside “activism” may provide a preliminary understanding of what hyper-local, community activism looks like through publications such as the EBCN and illustrated in what a computer algorithm can extract. In many ways, this methodology is not providing answers, but rather helping me to specify the questions that I am asking of the EBCN. Through this word embedding model, I can transform a question that is somewhat vague such as “how does the EBCN envision its future?” into a more specific research question like “what is the role of property and ownership in influencing how the EBCN plans for the future?” or “how does activism in East Boston, as represented by the EBCN, impact substance abuse and airport development? What groups are activists trying to reach?”

It is important to understand that this methodology is not definitive and perhaps, although I chose to primarily stick to default settings with Gensim, some of those decisions might have impacted the results. One of the major benefits of word embedding models is that they don’t require that a corpus be extensively annotated, so I am able to run the model on the raw text of the EBCN. However, Word2Vec uses an unsupervised method of machine learning, which means that the computer is left to ascertain meaning from information fed to it by the programmer. Many researchers in recent years have identified some of the issues with machine learning algorithms classifying data into categories–many of which deal with the simple fact that language is messy and doesn’t always fit into neat categories. By forcing a word or phrase to fit into a box, sometimes meaning is lost and sometimes that word–which may be significant–is actually buried under other words and phrases which more neatly fit into these categories. Additionally, Word2Vec has some trouble with words that have two or more meanings. So the word ‘cell’ meaning prison cell and ‘cell’ in the scientific sense will occupy the same vector space. Given my elementary understanding of Italian, perhaps there might have been some better search terms. Alternatively, perhaps the topics covered by the Italian section of the newspaper focus less on social issues which are readily visible by searching for words like “activism” and “planning.”

Conclusion

In many ways, this experiment with word embeddings has raised more questions than it has answered. I still have many questions regarding how the EBCN envisions its future and how that future is rendered in the newspaper. However, through the use of this word embedding model, many of my questions are much more specific and easily explorable by hand. Hopefully, this write up has illustrated how computational methods and text analysis, in many ways, help users see a body of text in ways they might not have otherwise. The goal of many computational experiments is to use the computer to reveal some larger pattern and to use that pattern to then return to the text.

My hope with this write up, is that readers will take these results and not only question them, but also return to the EBCN with more guided research interests. There is clearly some connection between “future” and questions of property and ownership. Someone who is interested in communities, particularly immigrant communities, and how they navigate space might return to the newspaper and try to locate places where land and property is being discussed and analyze how it is talked about.

While there is certainly much more to explore with the EBCN, with our recent acquisition of the Boston Gay Community News, much of the focus of my research for the remainder of summer will be on using the code I developed for analyzing the EBCN on the Gay Community News. In doing so, I hope to further explore some of the limitations of these computational methods, as I understand and implement them, and to produce similar, specified lines of research inquiry as what I have begun to produce here with my experiments with the EBCN. These lines of questioning will help form a text analysis and community newspaper research agenda for the Boston Research Center moving forward, and perhaps lead to a better understanding of how communities express their activism in local publications.

Deja un comentario

Tu dirección de correo electrónico no será publicada. Los campos necesarios están marcados *