Sentence Similarity Analysis

Data

This is a snapshot of the data (JWB article data 1967–2025 downloaded from Scopus) we will be working with.

import pandas as pd
import numpy as np
data = pd.read_csv('../data/jwb-articles.csv')
data = data[data['Abstract'].notna()] # Keep nonempty abstracts
data.head()
Authors Author full names Author(s) ID Title Year Source title Volume Issue Art. No. Page start ... ISSN ISBN CODEN PubMed ID Language of Original Document Document Type Publication Stage Open Access Source EID
0 Al Asady, A.; Anokhin, S. Al Asady, Ahmad (57219984746); Anokhin, Sergey... 57219984746; 24482882200 The Trojan horse of international entrepreneur... 2025 Journal of World Business 60 6 101677.0 NaN ... 10909516 NaN NaN NaN English Article Final NaN Scopus 2-s2.0-105014957115
1 Thams, Y.; Dau, L.A.; Doh, J.; Kostova, T.; Ne... Thams, Yannick (55357149800); Dau, Luis Alfons... 55357149800; 35147597100; 7003920280; 66037741... Political ideology and the multinational enter... 2025 Journal of World Business 60 6 101678.0 NaN ... 10909516 NaN NaN NaN English Short survey Final NaN Scopus 2-s2.0-105014844629
2 Lindner, T.; Puck, J.; Puhr, H. Lindner, Thomas (57159151000); Puck, Jonas (85... 57159151000; 8563161700; 57223389639 Artificial intelligence in international busin... 2025 Journal of World Business 60 6 101676.0 NaN ... 10909516 NaN NaN NaN English Short survey Final All Open Access; Hybrid Gold Open Access Scopus 2-s2.0-105014595041
3 Bruton, G.D.; Mejía-Morelos, J.H.; Ahlstrom, D. Bruton, Garry D. (6603867202); Mejía-Morelos, ... 6603867202; 55748855800; 56525447800 Multinational corporations and inclusive suppl... 2025 Journal of World Business 60 6 101663.0 NaN ... 10909516 NaN NaN NaN English Article Final All Open Access; Hybrid Gold Open Access Scopus 2-s2.0-105013512235
4 Liang, Y.; Giroud, A.; Rygh, A.; Chen, Z. Liang, Yanze (57223851564); Giroud, Axèle L.A.... 57223851564; 7003496253; 37117826800; 58631386600 Political embeddedness and post-acquisition in... 2025 Journal of World Business 60 6 101665.0 NaN ... 10909516 NaN NaN NaN English Article Final All Open Access; Hybrid Gold Open Access Scopus 2-s2.0-105013485759

5 rows × 41 columns

Cosine Similarity

We first preprocess our text by tokenizing, remove stop words and stem the abstracts.

import re 
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

def preprocess_text(text):
    # Lowercasing and remove punctuation
    lowercased_text = text.lower()
    remove_punctuation = re.sub(r'[^\w\s]', '', lowercased_text)
    remove_white_space = remove_punctuation.strip()

    # Tokenization 
    tokenized_text = word_tokenize(remove_white_space)

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    stopwords_removed = [word for word in tokenized_text if word not in stop_words]

    # Stemming
    ps = PorterStemmer()
    stemmed_text = [ps.stem(word) for word in stopwords_removed]
    
    # Return the stemmed text as a list
    return stemmed_text

data['clean_abstract'] = data['Abstract'].apply(preprocess_text)
data[['Abstract', 'clean_abstract']].head(5)
Abstract clean_abstract
0 This study explores the under-theorized relati... [studi, explor, undertheor, relationship, inte...
1 While politics and political issues such as ri... [polit, polit, issu, risk, domin, agenda, inte...
2 This paper discusses the impact of artificial ... [paper, discuss, impact, artifici, intellig, a...
3 An institutional logic represents the way a pa... [institut, logic, repres, way, particular, soc...
4 Political embeddedness has been shown to influ... [polit, embedded, shown, influenc, firm, innov...

We now compute the term frequency-inverse document frequency (TF-IDF) scores of each abstract.

from sklearn.feature_extraction.text import TfidfVectorizer

# Join each list of words into a string
texts = data['clean_abstract'].apply(lambda x: ' '.join(x))

# Compute TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(texts)

# Get feature names (words)
feature_names = vectorizer.get_feature_names_out()

# Create TF-IDF DataFrame
df_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)

# Optionally, include the original data (Abstract + clean_abstract)
data_tfidf = pd.concat([data.reset_index(drop=True), df_tfidf], axis=1)
data_tfidf.head(5)
Authors Author full names Author(s) ID Title Year Source title Volume Issue Art. No. Page start ... zeroinfl zeroshot zhirinovski zimbabw zizhu zone zoom zurawicki firm firmlevel
0 Al Asady, A.; Anokhin, S. Al Asady, Ahmad (57219984746); Anokhin, Sergey... 57219984746; 24482882200 The Trojan horse of international entrepreneur... 2025 Journal of World Business 60 6 101677.0 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 Thams, Y.; Dau, L.A.; Doh, J.; Kostova, T.; Ne... Thams, Yannick (55357149800); Dau, Luis Alfons... 55357149800; 35147597100; 7003920280; 66037741... Political ideology and the multinational enter... 2025 Journal of World Business 60 6 101678.0 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 Lindner, T.; Puck, J.; Puhr, H. Lindner, Thomas (57159151000); Puck, Jonas (85... 57159151000; 8563161700; 57223389639 Artificial intelligence in international busin... 2025 Journal of World Business 60 6 101676.0 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 Bruton, G.D.; Mejía-Morelos, J.H.; Ahlstrom, D. Bruton, Garry D. (6603867202); Mejía-Morelos, ... 6603867202; 55748855800; 56525447800 Multinational corporations and inclusive suppl... 2025 Journal of World Business 60 6 101663.0 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 Liang, Y.; Giroud, A.; Rygh, A.; Chen, Z. Liang, Yanze (57223851564); Giroud, Axèle L.A.... 57223851564; 7003496253; 37117826800; 58631386600 Political embeddedness and post-acquisition in... 2025 Journal of World Business 60 6 101665.0 NaN ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 6645 columns

We can compute the pairwise similarity scores and display it in a heat map.

from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt

sim_scores = np.zeros((df_tfidf.shape[0], df_tfidf.shape[0]))

tfidf_scores = df_tfidf.values

cosim = cosine_similarity(tfidf_scores)

plt.imshow(cosim)
plt.colorbar()

For a closer look, we examine the first 100 abstracts. The brighter the cell, the more similar the abstracts are.

plt.imshow(cosim[:100, :100])
plt.colorbar()