CS for Social Change x Peace Lab Methodology

Reddit
Dataset
We used the comprehensive database of Reddit comments maintained at pushshift.io as the basis for our social media analysis. We included all relevant data estimated by the source to come from January, 2008 to December, 2021. We excluded 2022 until present from the analysis because the war in Ukraine may have caused a tectonic shift in discourse and transfers, adequate data for which may take longer to accumulate.
Pre-processing
Reddit data curation: We used two separate sets of keywords to identify content in the comprehensive Reddit corpus that is relevant to weapons and their transfer between collective entities, respectively. The keywords can be found here. The relevance of comments in the resulting sample of >230,000 comments was determined on the following definitions (a post would be relevant if the ratings for both dimensions are 1):
WEAPONS:

1: If the post explicitly refers to, implies or implicitly alludes to human-made artifacts whose primary purpose is to seriously injure or kill humans, or destroy strategic targets. 

0: If the post does not contain content that can be reasonably and intuitively classified as the above.

-1: If the presence or absence of a reference to weaponry is unclear, e.g., due to the absence of thread context or within-group language

TRANSFER:

1: If the post explicitly refers to, implies or implicitly alludes to transfers of something, be it through selling/buying, donation/pledges or any other transactions between two entities. We are not interested in transactions between individuals, but rather ones that involve at least one collective entity. This could be a government, a parastatal group like a militia, a corporation or an identity group.

0: If the post does not contain content that can be reasonably and intuitively classified as the above.

-1: If the presence or absence of a reference to a transaction is unclear, e.g., due to the absence of thread context or within-group language
Three iterations of refinements to the keywords were performed based on the results of random samples of two dozen documents. In the end, given the high rate of unrelated comments in the random samples, topic-based relevance filtering was performed (see below).

Topic Modeling Preprocessing:
Stopwords were removed using NLTK’s standard list for English. Lemmatization was performed using Textblob. Special characters were replaced with space. The preprocessing was performed in batches on a cluster computing platform for speed.
Topic Model Training
The Gensim package was used to train Latent Dirichlet Allocation models on the Reddit corpus of >230,000 comments. Alpha and eta were set to 0.1 to encourage the model to assign high probability to only a few words in each topic and to a few topics in each document. This enhanced the intuitive identifiability of the resulting topics. To determine the optimal number of topics, models were trained with 10 to 100 topics in increments of 10. The model that provided the best combination of high coherence (using the CV measure) and low topic overlap (using Jaccard similarity based on top 100 words for each topic) was chosen. Figure 1 shows that the model with 40 topics has the best balance.
Figure 1. ***

Topics relevant to our focus, that is, weapons transfers between collective entities, were determined through manual examination of the top 40 words with the highest conditional probability under each topic, as well as the top five comments in the dataset that had the highest proportion of words assigned to each of the forty topics. Five topics were determined as certainly relevant. Any comment with at least twenty percent of its words assigned to relevant topics was deemed to contain at least mentions of our topical focus and was therefore included in further analysis. This resulted in a corpus of 39,362 comments. Figure 2 shows the distribution of these comments across the years. 
Figure 2. ***

To determine the relative prominence of the five topics over the years while controlling for differences in comment length, we calculated the number of words in each comment that were most likely sampled from a given topic and plotted the proportion of words from comments in each month that were thus assigned. 95% Confidence Intervals were calculated using the formula for binomial distribution with each document considered as an independent sample for simplicity. The figures below show the confidence intervals for the five analyzed topics. The CIs were removed from the figure on the project’s front page to prevent cluttering. Note the greater uncertainty in earlier years due to a relative dearth of data.
The relative probabilities of top words given each analyzed topic was used to create word clouds, with larger words representing those with higher conditional probabilities. 
Moralization Classification
Dataset
Moralization: We used the Moral Foundations Reddit Corpus described here for training a classifier to identify moralized content in our social media dataset. The first 52,269 entries were used as training and evaluation data with a randomized 80,20 split. The remaining 8,958 data points were used as a test set. We considered the “Non-Moral” label as 0 and any other label as 1 to simplify the task to a binary classification. A majority rule was used to adjudicate between the three annotators across the training and evaluation data points. 
Neural Network Training and Evaluation
We used HuggingFace’s PyTorch implementation of BERT as the pretrained basis for our neural classifier. Since there were about five times as many positive labels in the training dataset as negative labels, we used scikit-learn’s compute_class_weight function to determine class weights that would result in a balanced classification performance. The input text was cut off after the first 524 tokens given BERT’s in-built limitations. The model was trained for binary classification for 10 epochs with a batch size of 8 and weight decay of .01. Early stopping was implemented after the first 500 steps with a batch size of 20 and a difference threshold of .01. The final Precision on the test set was .78 and the final Recall .68, resulting in an F1 score of .73. Given the highly subjective nature of the task, this performance was deemed adequate and the model was applied to the full Reddit Weapons Transfer Discourse dataset to determine comments that are moralized.
Arms Transfers Analysis
The international arms transfers data, including the estimated dollar value of the transactions, was provided by the Stockholm International Peace Research Institute (SIPRI) for research purposes. Given our focus on U.S. arms transfers, we only included instances that had the United States as their assessed place of origin. Both datasets. We included all relevant data estimated by the source to come from January, 2008 to December, 2021.