This is a snapshot of the data (JWB article data 1967–2025 downloaded from Scopus) we will be working with.
import pandas as pdimport numpy as npdata = pd.read_csv('../data/jwb-articles.csv')data = data[data['Abstract'].notna()] # Keep nonempty abstractsdata.head()
Authors
Author full names
Author(s) ID
Title
Year
Source title
Volume
Issue
Art. No.
Page start
...
ISSN
ISBN
CODEN
PubMed ID
Language of Original Document
Document Type
Publication Stage
Open Access
Source
EID
0
Al Asady, A.; Anokhin, S.
Al Asady, Ahmad (57219984746); Anokhin, Sergey...
57219984746; 24482882200
The Trojan horse of international entrepreneur...
2025
Journal of World Business
60
6
101677.0
NaN
...
10909516
NaN
NaN
NaN
English
Article
Final
NaN
Scopus
2-s2.0-105014957115
1
Thams, Y.; Dau, L.A.; Doh, J.; Kostova, T.; Ne...
Thams, Yannick (55357149800); Dau, Luis Alfons...
55357149800; 35147597100; 7003920280; 66037741...
Political ideology and the multinational enter...
2025
Journal of World Business
60
6
101678.0
NaN
...
10909516
NaN
NaN
NaN
English
Short survey
Final
NaN
Scopus
2-s2.0-105014844629
2
Lindner, T.; Puck, J.; Puhr, H.
Lindner, Thomas (57159151000); Puck, Jonas (85...
57159151000; 8563161700; 57223389639
Artificial intelligence in international busin...
2025
Journal of World Business
60
6
101676.0
NaN
...
10909516
NaN
NaN
NaN
English
Short survey
Final
All Open Access; Hybrid Gold Open Access
Scopus
2-s2.0-105014595041
3
Bruton, G.D.; Mejía-Morelos, J.H.; Ahlstrom, D.
Bruton, Garry D. (6603867202); Mejía-Morelos, ...
6603867202; 55748855800; 56525447800
Multinational corporations and inclusive suppl...
2025
Journal of World Business
60
6
101663.0
NaN
...
10909516
NaN
NaN
NaN
English
Article
Final
All Open Access; Hybrid Gold Open Access
Scopus
2-s2.0-105013512235
4
Liang, Y.; Giroud, A.; Rygh, A.; Chen, Z.
Liang, Yanze (57223851564); Giroud, Axèle L.A....
57223851564; 7003496253; 37117826800; 58631386600
Political embeddedness and post-acquisition in...
2025
Journal of World Business
60
6
101665.0
NaN
...
10909516
NaN
NaN
NaN
English
Article
Final
All Open Access; Hybrid Gold Open Access
Scopus
2-s2.0-105013485759
5 rows × 41 columns
Cosine Similarity
We first preprocess our text by tokenizing, remove stop words and stem the abstracts.
import re from nltk.tokenize import word_tokenizefrom nltk.corpus import stopwordsfrom nltk.stem import PorterStemmerdef preprocess_text(text):# Lowercasing and remove punctuation lowercased_text = text.lower() remove_punctuation = re.sub(r'[^\w\s]', '', lowercased_text) remove_white_space = remove_punctuation.strip()# Tokenization tokenized_text = word_tokenize(remove_white_space)# Remove stop words stop_words =set(stopwords.words('english')) stopwords_removed = [word for word in tokenized_text if word notin stop_words]# Stemming ps = PorterStemmer() stemmed_text = [ps.stem(word) for word in stopwords_removed]# Return the stemmed text as a listreturn stemmed_textdata['clean_abstract'] = data['Abstract'].apply(preprocess_text)data[['Abstract', 'clean_abstract']].head(5)
Abstract
clean_abstract
0
This study explores the under-theorized relati...
[studi, explor, undertheor, relationship, inte...
1
While politics and political issues such as ri...
[polit, polit, issu, risk, domin, agenda, inte...
2
This paper discusses the impact of artificial ...
[paper, discuss, impact, artifici, intellig, a...
3
An institutional logic represents the way a pa...
[institut, logic, repres, way, particular, soc...
4
Political embeddedness has been shown to influ...
[polit, embedded, shown, influenc, firm, innov...
We now compute the term frequency-inverse document frequency (TF-IDF) scores of each abstract.
from sklearn.feature_extraction.text import TfidfVectorizer# Join each list of words into a stringtexts = data['clean_abstract'].apply(lambda x: ' '.join(x))# Compute TF-IDFvectorizer = TfidfVectorizer()tfidf_matrix = vectorizer.fit_transform(texts)# Get feature names (words)feature_names = vectorizer.get_feature_names_out()# Create TF-IDF DataFramedf_tfidf = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)# Optionally, include the original data (Abstract + clean_abstract)data_tfidf = pd.concat([data.reset_index(drop=True), df_tfidf], axis=1)
data_tfidf.head(5)
Authors
Author full names
Author(s) ID
Title
Year
Source title
Volume
Issue
Art. No.
Page start
...
zeroinfl
zeroshot
zhirinovski
zimbabw
zizhu
zone
zoom
zurawicki
firm
firmlevel
0
Al Asady, A.; Anokhin, S.
Al Asady, Ahmad (57219984746); Anokhin, Sergey...
57219984746; 24482882200
The Trojan horse of international entrepreneur...
2025
Journal of World Business
60
6
101677.0
NaN
...
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1
Thams, Y.; Dau, L.A.; Doh, J.; Kostova, T.; Ne...
Thams, Yannick (55357149800); Dau, Luis Alfons...
55357149800; 35147597100; 7003920280; 66037741...
Political ideology and the multinational enter...
2025
Journal of World Business
60
6
101678.0
NaN
...
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
2
Lindner, T.; Puck, J.; Puhr, H.
Lindner, Thomas (57159151000); Puck, Jonas (85...
57159151000; 8563161700; 57223389639
Artificial intelligence in international busin...
2025
Journal of World Business
60
6
101676.0
NaN
...
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
3
Bruton, G.D.; Mejía-Morelos, J.H.; Ahlstrom, D.
Bruton, Garry D. (6603867202); Mejía-Morelos, ...
6603867202; 55748855800; 56525447800
Multinational corporations and inclusive suppl...
2025
Journal of World Business
60
6
101663.0
NaN
...
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
4
Liang, Y.; Giroud, A.; Rygh, A.; Chen, Z.
Liang, Yanze (57223851564); Giroud, Axèle L.A....
57223851564; 7003496253; 37117826800; 58631386600
Political embeddedness and post-acquisition in...
2025
Journal of World Business
60
6
101665.0
NaN
...
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
5 rows × 6645 columns
We can compute the pairwise similarity scores and display it in a heat map.
from sklearn.metrics.pairwise import cosine_similarityimport matplotlib.pyplot as pltsim_scores = np.zeros((df_tfidf.shape[0], df_tfidf.shape[0]))tfidf_scores = df_tfidf.valuescosim = cosine_similarity(tfidf_scores)plt.imshow(cosim)plt.colorbar()
For a closer look, we examine the first 100 abstracts. The brighter the cell, the more similar the abstracts are.