25
C H A P T E R 3
How Social Context Helps
Social context refers to the entire social environment in which the dissemination of the news
operates, including how the social data is distributed and how online users interact with each
other. It provides useful auxiliary information for inferring the veracity of news articles. e na-
ture of social media provides researchers with additional resources to supplement and enhance
news content-based models. ere are three major aspects of the social context information that
we can represent: users, generated posts, and networks. First, users may have different character-
istics for those spreading fake and real news, or establish different patterns of behaviors toward
fake news. Second, in the process of fake news dissemination, users express their opinions and
emotions through posts/comments. ird, users form different types of networks on social me-
dia. We now discuss how social context information can help fake news detection from three
perspectives: user-based, post-based, and network-based.
3.1 USER-BASED DETECTION
User-based fake news detection aims to explore the characteristics and behaviors of consumers
on social media to classify fake news. Next, we first demonstrate the comparison and exploration
of user profiles for fake news detection, and then discuss how to model user behaviors such as
“flagging fake news” to predict fake news.
3.1.1 USER FEATURE MODELING
Previous research advancements aggregate uses profiles and engagements on news pieces to help
infer which articles are fake [22, 59], giving some promising early results. Recent research starts
to perform a principled study on characterizing user profiles and explore their potential to detect
fake news.
Profile Features
We collect and analyze user meta profile features from two major aspects, i.e., explicit and im-
plicit [142]. Explicit features are obtained directly from meta-data returned by querying social
media site aplication programming interface (APIs). Implicit features are not directly available
but are inferred from user meta information or online behaviors, such as historical tweets. Our
selected feature sets are by no means the comprehensive list of all possible features. However,
we focus on those explicit features that can be easily accessed and are available for almost all
public users, and implicit features that are widely used in the literature for better understanding
26 3. HOW SOCIAL CONTEXT HELPS
user characteristics. We first select two subset of users who are more likely to share fake and real
news based on FakeNewsNet data [134], and compare the aggregated statistics over these two
sets [139].
Explicit Features A list of representative explicit profile attributes include the following.
Profile-Related. Basic user description fields:
Verified: whether this is a verified user;
RegisterTime: the number of days since the accounted was registered;
Content-Related.
Attributes of user activities:
StatusCount: the number of posts;
FavorCount: the number of favorites;
Network-Related. Social networks attributes:
FollowerCount: the number of followers;
FollowingCount: the number of users being followed.
Implicit Features We also explore several implicit profile features, which are not directly pro-
vided through user meta data, but are widely used to describe and understand user demograph-
ics [122]. Note that we adopt widely used tools to predict these implicit features in an unsuper-
vised way. Some representative features are as follows.
Age: studies have shown that age has major impacts on people’s psychology and cognition.
For example, as age gradually changes, people typically become less open to experiences,
but more agreeable and conscientious [86]. We infer the age of users using existing state-
of-the-art approaches [121]. is method uses a linear regression model with the collected
predictive lexica (with words and weights).
Personality: personality refers to the traits and characteristics that makes an individual dif-
ferent from others. We draw on the popular Five Factor Model (or Big Five”), which clas-
sifies personality traits into five dimensions: Extraversion (e.g., outgoing, talkative, active);
Agreeableness (e.g., trusting, kind, generous); Conscientiousness (e.g., self-controlled, re-
sponsible, thorough); Neuroticism (e.g., anxious, depressive, touchy); and Openness (e.g.,
intellectual, artistic, insightful). To predict users’ personalities, we apply an unsupervised
personality prediction tool called Pear [23], a state-of-the-art unsupervised, text-based
personality prediction model.
Political Bias: political bias plays an important role in shaping users’ profiles and affecting
news consumption choices on social media. Sociological studies on journalism demon-
strate the correlation between partisan bias and news content authenticity (i.e., fake or
3.1. USER-BASED DETECTION 27
real news) [44]. Specifically, users tend to share news that confirms their existing political
bias [99]. We adopt a method in [74] to measure user political bias scores by exploiting
users’ interests.
From empirical comparison analysis, we observe that most of the explicit and implicit
profile features reveal different feature distributions, which demonstrates the potential to use
them to detect fake news. For example, as shown in Figure 3.1, we demonstrate the box-and-
whisker diagram, which shows that the distribution of user RegisterTime exhibits a significant
difference between users who spread fake news and those who spread real news. e observations
RegisterTime* RegisterTime*
4,000
3,000
2,000
1,000
0
4,000
3,000
2,000
1,000
0
Fake Real Fake Real
(a) RegisterTime on PolitiFact (b) RegisterTime on GossipCop
Figure 3.1: Profile Features Comparison. We show the Box-Plot to demonstrate the distribu-
tion of RegisterTime for users. Based on [142].
800
700
600
500
400
300
200
100
0
500
400
300
200
100
0
0 20 40 60 80 0 20 40 60 80
Users sharing fake news
Users sharing real news
Users sharing fake news
Users sharing real news
User Age
No. of Users
No. of Users
User Age
(a) Age on PolitiFact (b) Age on GossipCop
Figure 3.2: Age Comparison. We rank the ages from low to high and plot the values for
users. e x-axis represents the predicted ages, and y-axis indicates the number of users. Based
on [142].
28 3. HOW SOCIAL CONTEXT HELPS
on both datasets demonstrate that users who are more likely to share fake news registered much
earlier. As another example for implicit features, we demonstrate the comparison in Figure 3.2.
We can see that the predicted ages are significantly different, and users who spread fake news
are predicted younger than those who spread real news. Motivated by the observations, we can
further extract these explicit and implicit features for all the users that spread the news to predict
whether it is a fake news piece or not.
Psychology-Related Features
To understand the characteristics of users who spread fake news, we can rely on psychological
theories. Although there is a large body of work on these psychological theories, not many of
them can be (1) applied to users and their behaviors on social media and (2) quantitatively
measured for fake news articles and spreaders on social media. Hence, based on psychological
theories we have mentioned in Section 1.2, we can enumerate five categories of features that can
potentially express the differences between users who spread fake news and the ones who spread
real news.
Motivational Factors:
there are three LIWC categories that are related to
uncertainty
:
discrepancy (e.g., should, would, and could), tentativeness (e.g., maybe, perhaps, and
guess), and certainty (e.g., always and never). ese categories are abbreviated as discrep,
tentat, and certain respectively. Anxiety can be measured using the LIWC Anxiety cat-
egory (anx) which includes words such as nervous, afraid, and tense. Importance or outcome-
relevance is observed to be a difficult feature to measure in psychology so researchers suggest
using proxies to quantify importance; we use anxiety as a proxy for measuring this feature,
meaning that people are more anxious about a topic which is more important to them. We
use LIWC Future Focus (futurefocus) to measure lack of control, this category includes
words such as may, will, and soon. We do not measure belief explicitly because we assume
that any user who tweets fake news articles believes in it.
Demographics: Twitter users are not obligated to include information such as age, race,
gender, political orientation, or education on their profiles. Hence, we cannot obtain any
demographic information from public profiles of tweeters unless we use a proxy. e
prevalence of using swear words is shown to be correlated with gender, social status, and
race [12]. Hence, we use LIWC Swear Words category (swear) as a measure for demo-
graphics feature.
Social Engagement: the more a user is involved with social media the less likely it is for
her/him to be misguided by fake news. We measure social engagement on Twitter using
the average number of tweets per day.
Position in the Network: this feature can be quantified using a variety of metrics when
the network structure is known. However, in the case of social networks between Twitter
3.1. USER-BASED DETECTION 29
users, we do not have complete structure and even collecting local information is time
consuming due to the rate limitation of Twitter APIs. Hence, we use the information
available in our datasets and extract influence using the number of followers and popularity
using the number of followers of each user.
Relationship Enhancement: improving the relation to other social media users and gain-
ing more attention from the community is one of the motivations for spreading fake news.
If the number of retweets and likes of a fake news post is higher than the average number
of retweets and likes of all the posts by the user, it indicates that this user has enhanced
his/her social relations. Hence, we use the difference between the number of retweets and
likes of fake news posts and the average values for this user, as the indications of relation-
ship enhancement motivation.
We study the differences between users who spread fake news and the ones who spread
real news in terms of five feature categories, as shown in Table 3.1. We set the null hypothesis is
that these two groups have the same mean in five categories of features. If the results of T-test
shows that there is a significant difference (p-value less than 0.05) between two groups, we can
reject the null hypothesis. We observe that: (1) in the Motivational Factors category, we observe
a significant difference (p-value < 0.005) between users who share fake news and the ones who
share real news in terms of all features except Anxiety; and (2) users also show a significant
difference in terms of demographics indicated by the presence of swear words in the tweets.
ese features can be further utilized to detection fake news by using various classifiers such as
random forest, logistic regression, etc.
Table 3.1: Summary of the metrics used to measure psychological user features
Feature Category Feature Name Metric Example Words
Motivational Factors Tentativeness
Discrepancy
Certainty
Anxiety
Lack of Control
LIWC
tentat
LIWC
discrep
LIWC
certain
LIWC
anx
LIWC
focusfuture
maybe, perhaps
should, would
always, never
worried, fearful
may, will, soon
Demographics Demographics LIWC swear damn
Social Engagement Social Engagement Avg Tweets per day -
Position in the Network Infl uence
Polularity
#Folowees
#Followers
-
-
Reationship Enhancement Boosting #Retweets
Boosting #Likes
Increase in Retweets
Increase in Likes
-
-
30 3. HOW SOCIAL CONTEXT HELPS
3.1.2 USER BEHAVIOR MODELING
In order to mitigate the effect of fake news, social media sites such as Facebook,
1
Twitter,
2
or Weibo
3
propose allowing users to flag a story in their news feed as fake or not, and if the
news receives enough flags, it is sent to a trusted third party for manual fact checking. Such
flagging behaviors provide useful signals and have the potential to improve fake news detection
and reduce the spread of fake news dissemination.
However, the more users are exposed to a story before sending it for fact checking, the
greater the confidence a story may be fake news; however, the higher the potential damage if
it turns out to be fake news. us, it is important to study when-to-fact-check problem: finding
the optimal timing to collect set of user flags for reducing fake news exposure and improving
detection performance. In addition, users may not always flag fake news accurately. erefore,
it is important to consider the reliability of user flagging to decide what-to-fact-check: selecting
a set of news for fact-checking giving a budget with considering users flagging reliabilities.
When-to-Fact-Check
As in Section 2.4.1, the fact-checking procedure is time consuming and labor intensive, so it
is important to optimize the procedure by deciding when to select stories to fact-check [65].
To develop the model that allows the efficient fact-checking procedure, the authors in [65] use
the framework of marked temporal point processes and solve a novel stochastic optimal control
problem for fact-checking stories in online social networks.
Given an online social networking site with a set of users U and a set of unverified news
posts C on social media, we define two types of user events: endogenous events, which correspond
to the publication of stories by users on their own initiative, and exogenous events, which cor-
respond to the resharing and/or flagging of stories by users who are exposed to them through
their feeds. For a news post, we can characterize the number of exogenous events as a counting
process N
e
.t/ with intensity function as
e
.t/ which denotes the expected number of exposures
at any given time t; and we can use N
f
.t/ to denote the number of a subset of exposures that
involves flags by users. en, we can compute the average number of users exposed to fake news
by time t as
N
N .t/:
N
N .t/ WD p.y D 1jf D 1/N
f
.t/ C p.y D 1jf D 0/.N
e
.t/ N
f
.t//; (3.1)
where p.y D 1jf D 1/ and p.y D 1jf D 0/ denote the conditional probability of a story being
fake news (i.e., y D 1) given a flag (i.e., f D 1) and the conditional probability of a story be-
ing fake news given no flag (i.e., f D 1).
N
N .t/ can also be represented as a point process with
1
https://newsroom.fb.com/news/2016/12/news-feed-fyi-addressing-hoaxes-and-fake-news/
2
https://www.washingtonpost.com/news/the-switch/wp/2017/06/29/twitter-is-looking-for-ways-to-
let-users-flag-fake-news
3
https://www.scmp.com/news/china/policies-politics/article/2055179/how-chinas-highly-
censored-wechat-and-weibo-fight-fake
3.1. USER-BASED DETECTION 31
intensity function
Q
as follows:
O
dt D E
f .t/;f
Œd
N
N .t/ (3.2)
D E
f .t/;f
h
.p.y D 1jf D 1/ p.y D 1jf D 0/f .t/ C p.y D 1jf D 0/
i
e
.t/dt; (3.3)
where f .t/ checks if there is a flag at time. e probability of flagging a story at time depends
on whether the story is fake news or not, which we do not know before we fact-check them.
erefore, we try to estimate this probability using available data and Bayesian statistic. Now
we can characterize the problem of fact-checking using above notations as follows:
minimize
b.t
0
;t
f
E
.
O
.t
f
// C
Z
t
f
t
0
`.
O
./; b.//d
subject to b.t/ 0 8t 2 .t
0
; t
f
; (3.4)
where b.t/ is the intensity for the fact-checking scheduling problem that we can optimize, ./
is an arbitrary penalty function, and `.; / is a loss function which depend on the expected
intensity of the spread of fake news
O
.t
f
/ and the fact-checking intensity, b.t /. We can assume
the following quadratic forms for ./ and `.; /:
.
O
.t
f
// D
1
2
.
O
.t
f
//
2
(3.5)
`.
O
.t/; b.t// D
1
2
.
O
.t//
2
C
1
2
qb
2
.t/; (3.6)
where q is a parameter indicating the trade-off between fact-checking and the spread of fake
news. Smaller values indicate more resources available for fact-checking and higher values indi-
cate less available resources.
What-to-Fact-Check
Different from the problem of what-to-fact-check in the previous section, the authors in [152]
aim to explore, given a set of news, how to select a small set of news pieces for fact-checking
to minimize the spreading of fake news. Specifically, they are different in the following aspects:
(1) considering the reliability of user flagging behaviors with random variables; (2) agnostic to
the actual temporal dynamics of news spreading process; and (3) using discrete epochs with a
fixed budget, instead of continuous time with the overall budget. Let U denote the set of users
and t D 1; 2; ; T denote the epochs, where each epoch can be a time window such as one
day. e model mainly involves three components: (1) news generation and spreading; (2) users’
activity of flagging the news; and (3) selecting news for fact-checking.
At the beginning of each epoch t, the newly coming news set is denoted as A
t
D
fa
t
1
; ; a
t
N
g. Let the random variable Y
t
D fy
t
1
; ; y
t
N
g denote the set of unknown labels for
the news pieces, respectively. Each news a
t
i
is associated with a source user who seeded the news,
32 3. HOW SOCIAL CONTEXT HELPS
denoted as p
t
i
. For each news a
t
i
2 A
t
, the set of users exposed is initialized as
t1
.s/ D fg, and
the set of users who flagged the news is initialized as
t1
.a/ D fg.
In epoch t, when a news a 2 A
t
propagates to a news user u 2 U , this user can flag the
news to be fake. e set of users who flag news a as fake is denoted by l
t
.a/. Furthermore, the
function
t
.a/ denotes the complete set of users who have flagged the news a as fake. We can
also introduce
u
2 Œ0; 1 as the probability of user u taking the flagging actions to help annotate
fake news, and the bigger
u
is, the less likely user u will flag fake news. In addition, the accuracy
of user labels is conditioned on the news he/she is flagging, represented by two parameters:
˛
u
2 Œ0; 1: the probability that user u flag news a as not fake, given that a is not fake; and
ˇ
u
2 Œ0; 1: the probability that user u flag news a as fake, given that a is fake.
en we can quantify the observed flagging activity of user u for any news a with the following
matrix by variables .
N
u
;
u
/:
N
u
1
u
1
N
u
u
D
1 1
0 0
C .1
u
/
˛
u
1 ˇ
u
1 ˛
u
0 ˇ
u
; (3.7)
where
N
u
and
u
are defined as follows:
N
u
D p.y
u
.a/ D 0jy.a/ D 0/ (3.8)
1
N
u
D p.y
u
.a/ D 1jy.a/ D 0/ (3.9)
u
D p.y
u
.a/ D 1jy.a/ D 1/ (3.10)
1
u
D p.y
u
.a/ D 0jy.a/ D 1/: (3.11)
At the end of epoch t, we need to select the news set S
t
to send to an expert for acquiring
the true labels. If the news is labeled as fake by the expert, the news is blocked to avoid further
dissemination. e utility is the number of users saved from being exposed to fake news a. e
algorithm selects a set S
t
in t, and the total expected utility of the algorithm for t D 1; ; T
is given by
T
X
tD1
E
"
X
s2S
t
1
fy.s/D1g
.j
1
.a/j j
t
.a/j/
#
; (3.12)
where j
1
.a/j denotes the number of users who would eventually see the news a, and j
t
.a/j
means the number of users who have seen news a by the end of epoch t .
To select the optimal set of news S
t
for fact-checking at each epoch t , the proposed
algorithm utilizes a Bayesian approach to infer the news labels and learn user parameters
through posterior sampling [152].