ClimateNLP: Analyzing Public Sentiment Towards Climate Change: Proposed Approach

cover
14 Feb 2024

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Ajay Krishnan T. K., School of Digital Sciences;

(2) V. S. Anoop, School of Digital Sciences.

4 Proposed Approach

This section deals with the proposed methodology for the sentiment analysis of climate-related tweets from Twitter using ClimateBERT embeddings and Random Forest Classifier. The overall workflow of the proposed approach is given in Figure 1.

Figure 1: Overall workflow of the sentiment analysis on climate Change tweets

4.1 Dataset

The methodology begins with data collection from Twitter using the snscrape library. The collected data is loaded into a pandas DataFrame for further processing. The dataset consists of climate change-related tweets, valuable for sentiment analysis. The tweets gathered between 1 January 2022 and 2 February 2023. However, it’s important to note that the collected data may contain class imbalance, where certain sentiment categories are overrepresented while others are underrepresented. This could potentially bias the model’s predictions. Initially, the dataset consisted of 4410 data points. After data augmentation, the final dataset consists of 5506 data points, with three labels, Positive, Negative and Neutral. The data set is available at https://github.com/appliednlp-duk/nlp-climate-change. Table 1 shows the example of labeling for each tweet, whether it is Positive, Negative, or Neutral.

4.2 Data Preprocessing

To prepare the data for sentiment analysis, several preprocessing steps are applied. Special characters and digits are removed, and the text is converted to lowercase to ensure consistency. Tokenization is performed, which involves splitting the words into individual units. Stopwords (common words with little contextual meaning) are removed, and stemming or lemmatization techniques may be applied to normalize the words. This preprocessing step ensures the text data is cleaned and ready for further analysis.

4.3 Experimental setup

This section describes the experiment for implementing the proposed approach detailed in section 3. All the experiments were executed on NVIDIA A100 with 80 GB GPU memory and 1,935 GB/Second bandwidth. The ChatGPT Sentiment tweets were pre-processed to make them ready for experimentation. All scripts were written in Python 3.9, and the Machine Learning Models were used from the Scikit-Learn library available at https://scikitlearn.org/ stable/.