This paper is available on arxiv under CC BY-SA 4.0 DEED license.
Authors:
(1) Mathias-Felipe de-Lima-Santos, Faculty of Humanities, University of Amsterdam, Institute of Science and Technology, Federal University of São Paulo;
(2) Wilson Ceron, Institute of Science and Technology, Federal University of São Paulo.
Table of Links
2. Data and Methods
2.1. Data Collection and Preparation
Recognizing that isolated bots might not represent the most critical issues on the platform, we opted to focus on coordinated activities to investigate disinformation campaigns within Facebook groups. We contend that these communities inherently function as echo chambers, where users intentionally join these Facebook groups to be exposed selectively to information that aligns with their beliefs and values. Hence, these communities offer an ideal context to delve into information dissemination dynamics. To explore this avenue, our study follows a three-step approach.
Initially, we identified disinformation narratives circulating on Facebook by analyzing debunked content from two prominent fact-checking agencies in Brazil: Agência Lupa and Aos Fatos. Both organizations adhere to the transparency standards set by the International Fact-Checking Network (IFCN), a coalition dedicated to upholding excellence in the fact-checking industry [69]. Our data collection spanned from January 2020 to June 2021, yielding a total of 2,860 items. We employed an algorithm to filter out debunks that did not include the term “vaccine” or related variations in their titles. This process yielded 250 debunks specifically addressing COVID-19 vaccines. Subsequently, we subjected these debunks to qualitative analysis, confirming that they were all false or misleading, and eliminating any that did not meet this criterion.
Moving on to our second step, we extracted relevant data to locate these debunked posts within Facebook. We utilized academic access to CrowdTangle, an insights tool owned and operated by Meta since 2016. It is important to note that prior research has highlighted certain limitations of this tool, such as incomplete metrics and restricted access to fully public spaces on the broader Facebook or Instagram platforms. CrowdTangle only encompasses public groups with a certain user threshold, as opposed to the entire spectrum of groups [9,70,71].
In our search, we aimed to pinpoint sentences that could be readily identified and would not yield unrelated results. For instance, we refrained from using phrases such as "COVID-19 vaccines" or similar constructs that could encompass both disinformation and credible information. Our search criteria aligned with the timeframe of the debunks, spanning from January 2020 to June 2021. Through this process, we retrieved a total of 21,614 posts containing disinformation across 3,912 groups. Importantly, this data extraction was performed after Facebook’s public announcement that it had removed false content from its platform [72]. This announcement holds particular significance, as these posts should have been eradicated from the platform by that time, which could have hindered our study. Nevertheless, our findings reveal that this announcement was not fully realized, as many debunked posts persisted on the platform. This discrepancy suggests that the volume of such posts within Facebook public groups could be even more substantial.
In our third phase, we proceeded to download all the identified posts. Due to data extraction limitations within the tool, we segmented the process into timeframes, later amalgamating the data into a unified dataset. This database underwent a process of duplicate removal based on post IDs, resulting in the elimination of 1,707 duplicated entries from our initial dataset. Consequently, our final dataset encompassed 19,457 distinct entries.
2.2. Data Analysis and Visualization
In prior investigations of coordinated inauthentic behavior, researchers utilized estimated time thresholds to identify items shared in near-simultaneity over a short period. Similarly, a statistical metric was proposed to identify concurrent link sharing by assessing the interarrival time – the interval difference in seconds between successive shares of URLs [28]. However, we chose not to adopt these thresholds in our study for several reasons.
Unlike previous studies that centered on URLs [9,28], our analysis seeks to identify CIB within textual and visual content. Moreover, our study focuses solely on coordinated acticities among nonhuman accounts, necessitating a more stringent approach. This threshold determination was guided by similar studies that calculated this value based on a subset of the 10% of URLs with the shortest time intervals between the first and second shares [9,28,54–56]. Our empirical tests demonstrated that the timeframe calculated from the shortest intervals of 10% of URLs could range from 30 seconds to a minute. Consequently, depending on the dataset in use, this threshold might extend to around one minute, a timeframe that could feasibly be performed by humans.
Given these constraints, we undertook manual testing to ascertain a timeframe unlikely for consecutive human posting. Our tests indicated that this interval should be less than 30 seconds. We acknowledge that factors like internet speed and computing power might impact this performance. Nevertheless, we opted to adopt a threshold of 30 seconds between two posts, as it represented the minimum time required for consecutive posting. Our approach also considered a recursive 30-second timeframe, accounting for the possibility of repeated new posts within short intervals – a scenario unlikely to occur frequently. This approach allowed us to identify coordinated posts that were disseminated over an extended period.
Considering these temporal criteria, our computational model assessed four elements to determine coordination between posts. First, the method analyzed the “message” field, encompassing the textual content of a Facebook post. Second, it scrutinized the “description” field, which provides textual information accompanying external URLs or images shared on Facebook thumbnails. For example, the description for the post in Figure 1 was “Uma catastrófica análise sobre as vacinas contra o vírus chinês: ‘Interferem diretamente no material genético’,” identical to the content in the thumbnail. Third, our methodology leveraged CrowdTangle’s computer vision algorithm to detect text within images and ascertain if these visual contents were disseminated through automated means. It is worth noting that prior research has highlighted that CrowdTangle’s computer vision capabilities for text recognition have been a recent development and are not without limitations 71. Lastly, our process examined whether multiple entities rapidly and consistently shared the same URL, which serves as another indicator of coordinated activity [9,28].
To visualize the coordinated behaviors among different Facebook groups more effectively, we constructed a graph G=(V,E), where each vertex V={v1,v2,v3,…,vn} represents a Facebook group, and the edges E={e1,e2,e3,…,em} indicate the sharing of posts with signals of coordinated activity across these groups. This process was applied to the entire dataset, resulting in the creation of Figure 2. To implement this graph, we utilized the network analysis software Gephi[73], which allowed us to visually demonstrate the stronger connections between certain groups and the presence of structures that resemble “echo chambers.” The Louvain method was employed to identify network communities within this graph [74]. This community detection algorithm relies on modularity optimization, resulting in a fast process to generate clusters [75]. Through this technique, we could pinpoint closely linked Facebook groups that formed more significant echo chambers.
Furthermore, we generated a second graph (see Figure 3) illustrating the five most shared instances of disinformation content. This graph, denoted as G=(D, F), consisted of nodes of different types, where the set of disinformation content D={d1,d2,d3,…,dn} was connected to the set of Facebook groups F={ f1,f2,f3,…,fn} through an edge set E={e1,e2,e3,…,em}, signifying the coordinated activity signals within the dataset. This graph vividly demonstrates the robust correlation between echo chambers and the widespread dissemination of disinformation. In the subsequent section, we delve into our findings and present these visualizations.