We only train the model on the first 200 as there is a 512 token limitation on the input length for the default BERT model. This may not apply to other models. We then convert the Pandas DataFrame to a Hugging Face Dataset.
We now fit the model on the abstracts. We apply the FinBERT model as an illustration. This model was pre-trained on financial communication text.
from transformers import pipelinefinbert = BertForSequenceClassification.from_pretrained('yiyanghkust/finbert-tone',num_labels=3)tokenizer = BertTokenizer.from_pretrained('yiyanghkust/finbert-tone')nlp = pipeline("sentiment-analysis", model=finbert, tokenizer=tokenizer)def get_sentiment(examples):# Initialize lists to store results sentiments = [] scores = []# Process each entry in the batchfor text in examples['Abstract_200']:try:# Get the sentiment and sentiment score for each article result = nlp(text) sentiment = result[0]['label'] score = result[0]['score'] sentiments.append(sentiment) scores.append(score)exceptExceptionas e:print(f'Error processing text: {text}. Error: {e}', flush=True)# Append default values in case of an error sentiments.append(None) scores.append(None)# Ensure the output lists are of the same length as the batch size batch_size =len(examples['Abstract_200'])whilelen(sentiments) < batch_size: sentiments.append(None) scores.append(None)return {'sentiment': sentiments, 'score': scores}# Run the sentiment analysis in batchesdataset = dataset.map(get_sentiment, batched=True, batch_size=64)
We can examine the results for the first 15 articles.