This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Ajay Krishnan T. K., School of Digital Sciences;
(2) V. S. Anoop, School of Digital Sciences.
Table of Links
- Abstract & Introduction
- Related Studies
- Materials and Methods
- Proposed Approach
- Results and Discussions
- Conclusions and References
4 Proposed Approach
This section deals with the proposed methodology for the sentiment analysis of climate-related tweets from Twitter using ClimateBERT embeddings and Random Forest Classifier. The overall workflow of the proposed approach is given in Figure 1.
4.1 Dataset
The methodology begins with data collection from Twitter using the snscrape library. The collected data is loaded into a pandas DataFrame for further processing. The dataset consists of climate change-related tweets, valuable for sentiment analysis. The tweets gathered between 1 January 2022 and 2 February 2023. However, it’s important to note that the collected data may contain class imbalance, where certain sentiment categories are overrepresented while others are underrepresented. This could potentially bias the model’s predictions. Initially, the dataset consisted of 4410 data points. After data augmentation, the final dataset consists of 5506 data points, with three labels, Positive, Negative and Neutral. The data set is available at https://github.com/appliednlp-duk/nlp-climate-change. Table 1 shows the example of labeling for each tweet, whether it is Positive, Negative, or Neutral.
4.2 Data Preprocessing
To prepare the data for sentiment analysis, several preprocessing steps are applied. Special characters and digits are removed, and the text is converted to lowercase to ensure consistency. Tokenization is performed, which involves splitting the words into individual units. Stopwords (common words with little contextual meaning) are removed, and stemming or lemmatization techniques may be applied to normalize the words. This preprocessing step ensures the text data is cleaned and ready for further analysis.
4.3 Experimental setup
This section describes the experiment for implementing the proposed approach detailed in section 3. All the experiments were executed on NVIDIA A100 with 80 GB GPU memory and 1,935 GB/Second bandwidth. The ChatGPT Sentiment tweets were pre-processed to make them ready for experimentation. All scripts were written in Python 3.9, and the Machine Learning Models were used from the Scikit-Learn library available at https://scikitlearn.org/ stable/.