A P P E N D I X A

Data Repository

e ﬁrst and most important step to detect fake news is to collect a benchmark dataset. Despite

several existing computational solutions on the detection of fake news, the lack of comprehensive

and community-driven fake news datasets has become one of major roadblocks. In this appendix,

we introduce a multi-dimensional data repository FakeNewsNet,

which contains two datasets

with news content, social context, and spatiotemporal information [134]. For related datasets on

fake news, rumors, etc., the readers can refer to several other survey papers such as [128, 183].

e constructed FakeNewsNet repository has the potential to boost the study of various

open research problems related to fake news study. First, the rich set of features in the datasets

provides an opportunity to experiment with diﬀerent approaches for fake new detection, un-

derstand the diﬀusion of fake news in social network and intervene in it. Second, the temporal

information enables the study of early fake news detection by generating synthetic user engage-

ments from historical temporal user engagement patterns in the dataset [112]. ird, we can

investigate the fake news diﬀusion process by identifying provenances, persuaders, and devel-

oping better fake news intervention strategies [131]. Our data repository can serve as a starting

point for many exploratory studies for fake news, and provide a better, shared insight into dis-

information tactics. is data repository is continuously updated with new sources and features.

For a better comparison of the diﬀerences, we list existing popular fake news detection datasets

below and compare them with the FakeNewsNet repository in Table A.1.

• BuzzFeedNews:

is dataset comprises a complete sample of news published in Facebook

from nine news agencies over a week close to the 2016 U.S. election from September

19–23 and September 26 and 27. Every post and the linked article were fact-checked

claim-by-claim by ﬁve BuzzFeed journalists. It contains 1,627 articles, 826 mainstream,

356 left-wing, and 545 right-wing articles.

• LIAR:

is dataset [163] is collected from fact-checking website PolitiFact. It has 12,800

human-labeled short statements collected from PolitiFact and the statements are labeled

into six categories ranging from completely false to completely true as pants on ﬁre, false,

barely-true, half-true, mostly true, and true.

https://github.com/KaiDMML/FakeNewsNet

https://github.com/BuzzFeedNews/2016-10-facebook-fact-check/tree/master/data

https://www.cs.ucsb.edu/~william/software.html

80 A. DATA REPOSITORY

• BS Detector:

is dataset is collected from a browser extension called BS detector devel-

oped for checking news veracity. It searches all links on a given web page for references to

unreliable sources by checking against a manually compiled list of domains. e labels are

the outputs of the BS detector, rather than human annotators.

• CREDBANK:

is is a large-scale crowd-sourced dataset [91] of around 60 million

tweets that cover 96 days starting from October 2015. e tweets are related to over

1,000 news events. Each event is assessed for credibilities by 30 annotators from Ama-

zon Mechanical Turk.

• BuzzFace:

is dataset [120] is collected by extending the BuzzFeed dataset with com-

ments related to news articles on Facebook. e dataset contains 2263 news articles and

1.6 million comments discussing news content.

• FacebookHoax:

is dataset [147] comprises information related to posts from the Face-

book pages related to scientiﬁc news (non-hoax) and conspiracy pages (hoax) collected

using Facebook Graph API. e dataset contains 15,500 posts from 32 pages (14 con-

spiracy and 18 scientiﬁc) with more than 2,300,000 likes.

• NELA-GT-2018:

is dataset collects articles between February 2018 through Novem-

ber 2018 from 194 news and media outlets including mainstream, hyper-partisan, and

conspiracy sources, resulting in 713 k articles. e ground truth labels are integrated from

eight independent assessments.

From Table A.1, we observe that no existing public dataset can provide all possible fea-

tures of news content, social context, and spatiotemporal information. Existing datasets have

some limitations that we try to address in our data repository. For example, BuzzFeedNews

only contains headlines and text for each news piece and covers news articles from very few

news agencies. LIAR dataset contains mostly short statements instead of entire news articles

with the meta attributes. BS Detector data is collected and annotated by using a developed

news veracity checking tool, rather than using human expert annotators. CREDBANK dataset

was originally collected for evaluating tweet credibilities and the tweets in the dataset are not

related to the fake news articles and hence cannot be eﬀectively used for fake news detection.

BuzzFace dataset has basic news contents and social context information but it does not cap-

ture the temporal information. e FacebookHoax dataset consists very few instances about the

conspiracy theories and scientiﬁc news.

To address the disadvantages of existing fake news detection datasets, the proposed Fak-

eNewsNet repository collects multi-dimension information from news content, social context,

https://github.com/bs-detector/bs-detector

http://compsocial.github.io/CREDBANK-data/

https://github.com/gsantia/BuzzFace

https://github.com/gabll/some-like-it-hoax

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ULHLCB

A. DATA REPOSITORY 81

Table A.1: e comparison with representative fake news detection datasets

Datasets

News Content Social Context Spatiotemporal

Linguistic Visual User Post Response Network Spatial Temporal

BuzzFeedNews

LIAR

BS Detector

CREDBANK

ü ü ü ü

BuzzFace

ü ü ü

FacebookHoax

ü ü ü ü

NELA-GT-2018

FakeNewsNet

ü ü ü ü ü ü ü

and spatiotemporal information from diﬀerent types of news domains such as political and en-

tertainment sources.

Data Integration In this part, we introduce the dataset integration process for the FakeNews-

Net repository. We demonstrate in Figure A.1 how we can collect news contents with reliable

ground truth labels, and how we obtain additional social context and spatialtemporal informa-

tion.

News Content: to collect reliable ground truth labels for fake news, we utilize fact-

checking websites to obtain news contents for fake news and true news such as PolitiFact

and

GossipCop.

In PolitiFact, journalists and domain experts review the political news and provide fact-

checking evaluation results to claim news articles as fake

or real.

We utilize these claims as

ground truths for fake and real news pieces. In PolitiFact’s fact-checking evaluation result, the

source URLs of the web page that published the news articles are provided, which can be used

to fetch the news contents related to the news articles. In some cases, the web pages of source

news articles are removed and are no longer available. To tackle this problem, we (i) check if the

removed page was archived and automatically retrieve content at the Wayback Machine;

and

(ii) make use of Google web search in automated fashion to identify news article that is most

related to the actual news.

https://www.politifact.com/

https://www.gossipcop.com/

Available at https://www.politifact.com/subjects/fake-news/.

Available at https://www.politifact.com/truth-o-meter/rulings/true/.

https://archive.org/web/

82 A. DATA REPOSITORY

Run Daily

Fact Checking

Crawler

Post

Crawler

Response

Crawler

User

Crawler

Network

Crawler

GossipCop

Crawler

PolitiFact

Crawler

Linguistic Content

Crawler

Visual Content

Crawler

Labeled

News

Updating Periodically

Social Context and Spatiotemporal Information

News Content

FakeNewsNet

Figure A.1: e ﬂowchart of data integration process for FakeNewsNet. It mainly describes the

collection of news content, social context and spatiotemporal information.

GossipCop is a website for fact-checking entertainment stories aggregated from various

media outlets. GossipCop provides rating scores on the scale of 0–10 to classify a news story as

the degree from fake to real. In order to collect true entertainment news pieces, we crawl the

news articles from E! Online,

which is a well-known trusted media website for publishing

entertainment news pieces. We consider all the articles from E! Online as real news sources.

We collect all the news stories from GossipCop with rating scores less than 5 as the fake news

stories. Since GossipCop does not explicitly provide the URL of the source news article, so we

search the news headline in Google or the Wayback Machine archive to obtain the news source

information.

Social Context: the user engagements related to the fake and real news pieces from fact-

checking websites are collected using search API provided by social media platforms such as

the Twitter’s Advanced Search API.

e search queries for collecting user engagements are

formed from the headlines of news articles, with special characters removed from the search

query to ﬁlter out the noise. After we obtain the social media posts that directly spread news

pieces, we further fetch the user response toward these posts such as replies, likes, and reposts.

https://www.eonline.com/

https://twitter.com/search-advanced?lang=en

A. DATA REPOSITORY 83

In addition, when we obtain all the users engaging in news dissemination process, we collect all

the metadata for user proﬁles, user posts, and the social network information.

Spatiotemporal Information: the spatiotemporal information includes spatial and tem-

poral information. For spatial information, we obtain the locations explicitly provided in user

proﬁles. e temporal information indicates that we record the timestamps of user engagements,

which can be used to study how fake news pieces propagate on social media, and how the topics

of fake news are changing over time. Since fact-checking websites periodically update newly

coming news articles, so we dynamically collect these newly added news pieces and update the

FakeNewsNet repository as well. In addition, we keep collecting the user engagements for all the

news pieces periodically in the FakeNewsNet repository such as the recent social media posts,

and second order user behaviors such as replies, likes, and retweets. For example, we run the

news content crawler and update Tweet collector per day. e spatiotemporal information pro-

vides useful and comprehensive information for studying fake news problem from a temporal

perspective.