79
A P P E N D I X A
Data Repository
e first and most important step to detect fake news is to collect a benchmark dataset. Despite
several existing computational solutions on the detection of fake news, the lack of comprehensive
and community-driven fake news datasets has become one of major roadblocks. In this appendix,
we introduce a multi-dimensional data repository FakeNewsNet,
1
which contains two datasets
with news content, social context, and spatiotemporal information [134]. For related datasets on
fake news, rumors, etc., the readers can refer to several other survey papers such as [128, 183].
e constructed FakeNewsNet repository has the potential to boost the study of various
open research problems related to fake news study. First, the rich set of features in the datasets
provides an opportunity to experiment with different approaches for fake new detection, un-
derstand the diffusion of fake news in social network and intervene in it. Second, the temporal
information enables the study of early fake news detection by generating synthetic user engage-
ments from historical temporal user engagement patterns in the dataset [112]. ird, we can
investigate the fake news diffusion process by identifying provenances, persuaders, and devel-
oping better fake news intervention strategies [131]. Our data repository can serve as a starting
point for many exploratory studies for fake news, and provide a better, shared insight into dis-
information tactics. is data repository is continuously updated with new sources and features.
For a better comparison of the differences, we list existing popular fake news detection datasets
below and compare them with the FakeNewsNet repository in Table A.1.
BuzzFeedNews:
2
is dataset comprises a complete sample of news published in Facebook
from nine news agencies over a week close to the 2016 U.S. election from September
19–23 and September 26 and 27. Every post and the linked article were fact-checked
claim-by-claim by five BuzzFeed journalists. It contains 1,627 articles, 826 mainstream,
356 left-wing, and 545 right-wing articles.
LIAR:
3
is dataset [163] is collected from fact-checking website PolitiFact. It has 12,800
human-labeled short statements collected from PolitiFact and the statements are labeled
into six categories ranging from completely false to completely true as pants on fire, false,
barely-true, half-true, mostly true, and true.
1
https://github.com/KaiDMML/FakeNewsNet
2
https://github.com/BuzzFeedNews/2016-10-facebook-fact-check/tree/master/data
3
https://www.cs.ucsb.edu/~william/software.html
80 A. DATA REPOSITORY
BS Detector:
4
is dataset is collected from a browser extension called BS detector devel-
oped for checking news veracity. It searches all links on a given web page for references to
unreliable sources by checking against a manually compiled list of domains. e labels are
the outputs of the BS detector, rather than human annotators.
CREDBANK:
5
is is a large-scale crowd-sourced dataset [91] of around 60 million
tweets that cover 96 days starting from October 2015. e tweets are related to over
1,000 news events. Each event is assessed for credibilities by 30 annotators from Ama-
zon Mechanical Turk.
BuzzFace:
6
is dataset [120] is collected by extending the BuzzFeed dataset with com-
ments related to news articles on Facebook. e dataset contains 2263 news articles and
1.6 million comments discussing news content.
FacebookHoax:
7
is dataset [147] comprises information related to posts from the Face-
book pages related to scientific news (non-hoax) and conspiracy pages (hoax) collected
using Facebook Graph API. e dataset contains 15,500 posts from 32 pages (14 con-
spiracy and 18 scientific) with more than 2,300,000 likes.
NELA-GT-2018:
8
is dataset collects articles between February 2018 through Novem-
ber 2018 from 194 news and media outlets including mainstream, hyper-partisan, and
conspiracy sources, resulting in 713 k articles. e ground truth labels are integrated from
eight independent assessments.
From Table A.1, we observe that no existing public dataset can provide all possible fea-
tures of news content, social context, and spatiotemporal information. Existing datasets have
some limitations that we try to address in our data repository. For example, BuzzFeedNews
only contains headlines and text for each news piece and covers news articles from very few
news agencies. LIAR dataset contains mostly short statements instead of entire news articles
with the meta attributes. BS Detector data is collected and annotated by using a developed
news veracity checking tool, rather than using human expert annotators. CREDBANK dataset
was originally collected for evaluating tweet credibilities and the tweets in the dataset are not
related to the fake news articles and hence cannot be effectively used for fake news detection.
BuzzFace dataset has basic news contents and social context information but it does not cap-
ture the temporal information. e FacebookHoax dataset consists very few instances about the
conspiracy theories and scientific news.
To address the disadvantages of existing fake news detection datasets, the proposed Fak-
eNewsNet repository collects multi-dimension information from news content, social context,
4
https://github.com/bs-detector/bs-detector
5
http://compsocial.github.io/CREDBANK-data/
6
https://github.com/gsantia/BuzzFace
7
https://github.com/gabll/some-like-it-hoax
8
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/ULHLCB
A. DATA REPOSITORY 81
Table A.1: e comparison with representative fake news detection datasets
Datasets
News Content Social Context Spatiotemporal
Linguistic Visual User Post Response Network Spatial Temporal
BuzzFeedNews
ü
ü
ü
ü
LIAR
ü
BS Detector
ü
CREDBANK
ü ü ü ü
BuzzFace
ü ü ü
FacebookHoax
ü ü ü ü
NELA-GT-2018
ü
FakeNewsNet
ü ü ü ü ü ü ü
and spatiotemporal information from different types of news domains such as political and en-
tertainment sources.
Data Integration In this part, we introduce the dataset integration process for the FakeNews-
Net repository. We demonstrate in Figure A.1 how we can collect news contents with reliable
ground truth labels, and how we obtain additional social context and spatialtemporal informa-
tion.
News Content: to collect reliable ground truth labels for fake news, we utilize fact-
checking websites to obtain news contents for fake news and true news such as PolitiFact
9
and
GossipCop.
10
In PolitiFact, journalists and domain experts review the political news and provide fact-
checking evaluation results to claim news articles as fake
11
or real.
12
We utilize these claims as
ground truths for fake and real news pieces. In PolitiFacts fact-checking evaluation result, the
source URLs of the web page that published the news articles are provided, which can be used
to fetch the news contents related to the news articles. In some cases, the web pages of source
news articles are removed and are no longer available. To tackle this problem, we (i) check if the
removed page was archived and automatically retrieve content at the Wayback Machine;
13
and
(ii) make use of Google web search in automated fashion to identify news article that is most
related to the actual news.
9
https://www.politifact.com/
10
https://www.gossipcop.com/
11
Available at https://www.politifact.com/subjects/fake-news/.
12
Available at https://www.politifact.com/truth-o-meter/rulings/true/.
13
https://archive.org/web/
82 A. DATA REPOSITORY
Run Daily
Fact Checking
Crawler
Post
Crawler
Response
Crawler
User
Crawler
Network
Crawler
GossipCop
Crawler
PolitiFact
Crawler
Linguistic Content
Crawler
Visual Content
Crawler
Labeled
News
Updating Periodically
Social Context and Spatiotemporal Information
News Content
FakeNewsNet
Figure A.1: e flowchart of data integration process for FakeNewsNet. It mainly describes the
collection of news content, social context and spatiotemporal information.
GossipCop is a website for fact-checking entertainment stories aggregated from various
media outlets. GossipCop provides rating scores on the scale of 0–10 to classify a news story as
the degree from fake to real. In order to collect true entertainment news pieces, we crawl the
news articles from E! Online,
14
which is a well-known trusted media website for publishing
entertainment news pieces. We consider all the articles from E! Online as real news sources.
We collect all the news stories from GossipCop with rating scores less than 5 as the fake news
stories. Since GossipCop does not explicitly provide the URL of the source news article, so we
search the news headline in Google or the Wayback Machine archive to obtain the news source
information.
Social Context: the user engagements related to the fake and real news pieces from fact-
checking websites are collected using search API provided by social media platforms such as
the Twitter’s Advanced Search API.
15
e search queries for collecting user engagements are
formed from the headlines of news articles, with special characters removed from the search
query to filter out the noise. After we obtain the social media posts that directly spread news
pieces, we further fetch the user response toward these posts such as replies, likes, and reposts.
14
https://www.eonline.com/
15
https://twitter.com/search-advanced?lang=en
A. DATA REPOSITORY 83
In addition, when we obtain all the users engaging in news dissemination process, we collect all
the metadata for user profiles, user posts, and the social network information.
Spatiotemporal Information: the spatiotemporal information includes spatial and tem-
poral information. For spatial information, we obtain the locations explicitly provided in user
profiles. e temporal information indicates that we record the timestamps of user engagements,
which can be used to study how fake news pieces propagate on social media, and how the topics
of fake news are changing over time. Since fact-checking websites periodically update newly
coming news articles, so we dynamically collect these newly added news pieces and update the
FakeNewsNet repository as well. In addition, we keep collecting the user engagements for all the
news pieces periodically in the FakeNewsNet repository such as the recent social media posts,
and second order user behaviors such as replies, likes, and retweets. For example, we run the
news content crawler and update Tweet collector per day. e spatiotemporal information pro-
vides useful and comprehensive information for studying fake news problem from a temporal
perspective.