62 4. CHALLENGING PROBLEMS OF FAKE NEWS DETECTION
4.2 WEAKLY SUPERVISED FAKE NEWS DETECTION
Weakly supervised fake news detection aims to predict fake news with limited or no supervision
labels. We introduce some representative methods for semi-supervised and unsupervised fake
news detection.
4.2.1 A TENSOR DECOMPOSITION SEMI-SUPERVISED APPROACH
In the scenario of semi-supervised fake news detection, assuming the labels of a limited number
of news pieces are given, we aim to predict the labels of unknown news articles. Formally, we
denote the corpus of fake news A D fa
1
; a
2
; ; a
N
g where each document a
i
can be repre-
sented as a vector of term in a dictionary, with size of d D jj. Assuming that the labels of
some news articles are available, and let y 2 f1; 0; 1g denote the vector containing the partially
known labels, such that entries of 1 represent real news, 1 represents fake news, and 0 denotes
an unknown label. e problem is that, given a collection of news articles a and a label y with
entries for labeled real/fake news and unknown articles, the goal is to predict the class labels of
the unknown articles.
e framework for semi-supervised fake news detection mainly consists of two stages (see
Figure 4.4): (1) tensor decomposition; (2) building k-Nearest-Neighbor (KNN) news graph; and
(3) belief propagation. e tensor decomposition is to learn news representation from news con-
tents; the KNN graph is built to link labeled and unlabeled news pieces; and belief propagation
can utilize graph structure modeling to predict the labels of unknown news pieces.
1
1
1
1
0
0
0
0
0
0
0
0
1
-1
-1
-1
-1
-1
-1
-1
-1 -1
-1
-1
-1
-1
-1
1
1
1
1
1
1
1
1
1
Belief PropagationKNN Graph
Tensor
Decomposition
X
H
B
F
Figure 4.4: e tensor decomposition framework for semi-supervised fake news detection.
Building KNN Graph of News Articles
Spatial relations among words represent the affinities of them in a news document. To incorpo-
rate spatial relations, a news document a
i
can be represented as a tensor X 2 R
N d d
. We then
decompose the three mode tensor X 2 R
N d d
into three matrix using the CP/PARAFAC de-
4.2. WEAKLY SUPERVISED FAKE NEWS DETECTION 63
composition as in Equation (2.2) in Section 2.1.2. en we can obtain the news representation
matrix F.
e tensor embedding we computed in last step provides a compact and discriminative
representation of news articles into a concise set of latent topics. Using this embedding, we
construct a graphical representation of news articles. Specifically, we can used the resultant news
representation matrix F 2 R
N d
to build a KNN graph G among labeled and unlabeled news
pieces. Each node in G represents a news article and each edge encodes that two articles are
similar in the embedding space. A KNN graph is a graph where a node a
i
and a
j
are connected
by an edge is one of them belongs to the KNN list of the other. e KNN of a data point is
defined using the closeness relation in the feature space with a distance metric such as Euclidean
l
2
distance as follows. In practice, the parameter K is empirically decided:
d.F
i
; F
j
/ D kF
i
F
j
k
2
: (4.10)
e resultant graph G is an undirected, symmetric graph where each node is connected to at
least k nodes. e graph can be compactly represented as an N N adjacency matrix O.
Belief Propagation
Belief propagation is to propagate the label information of labeled news pieces to unlabeled ones
in the KNN graph G. e basic assumption is that news articles that are connected in G are
more likely to be of the same labels due to the construction method of the tensor embeddings.
To this end, a fast and linearized fast belief propagation (FaBP) [73] can be utilized due to the
efficiency and effectiveness in large graphs. e operative intuition behind FaBP and other such
guilt-by-association methods is that nodes which are “close” are likely to have similar labels or
belief values. e FaBP solves the following linear problem:
ŒI C ˛D c
0
Ob
h
D
h
; (4.11)
where
h
and b
h
denote the prior and final beliefs (labels). O is the adjacency matrix of the KNN
graph, I is an identity matrix, and D is a diagonal matrix where D
ii
D
P
j
O
ij
and D
ij
D 0 for
i ¤ j .
4.2.2 A TENSOR DECOMPOSITION UNSUPERVISED APPROACH
e news content is suggested to contain crucial information to differentiate fake from real
news [53]. Section 2.1.1 introduced linguistic textual feature extraction for fake news detection.
ese features such as n-gram, and tf-idf (term frequency-inverse document frequency) can
capture the correlations and similarities between different news contents. However, they ignore
the context of a news document such as spatial vicinity of each word. To this end, we model the
corpus as a third-order tensor, which can simultaneously leverage the article and term relations,
as well as the spatial/contextual relations between the terms. e advantage is that by exploiting
both aspects of the corpus, in particular the spatial relations between the words, the learned
64 4. CHALLENGING PROBLEMS OF FAKE NEWS DETECTION
news representation can be a determining factor for identifying groups of articles that fall under
different types of fake news.
Given the corpus of news pieces A D fa
1
; a
2
; ; a
N
g where each document a
i
is a vector
of term in a dictionary, with size of d D jj. e problem is clustering the news documents
based on their terms into different categories such as fake and real.
e framework for unsupervised fake news detection consists of two stages (see Fig-
ure 4.5). First, we explain how to extract spatial relations between terms through tensor de-
composition; then, we discus the co-clustering to decompose relevant documents.
X
A1
A1 A1 A2 A3 A3 C5 C5A2
A2 A3
B1 B2B3
C1C2 C3 C4C5
B4
0.8
0.7
0.6
0.5
0.8
0.7
0.6
0.5
n
1
n
2
n
3
n
4
n
5
n
6
n
7
n
8
n
9
0.8
0.7
0.6
0.5
0.8
0.7
0.6
0.5
Co-cluster
Rank = 4
Rank = 5
Rank = 3
Terms
Documents
.
.
.
.
.
.
.
.
.
Terms
Figure 4.5: A tensor decomposition framework for unsupervised fake news detection. It consists
of two stages: tensor decomposition for spatial relation extraction and co-clustering decompo-
sition ensembles. Based on [53].
Tensor Ensemble Co-Clustering
We first build a tensor X 2 R
N d d
for each news a. We then decompose the tensor X into
three matrix using the CP/PARAFAC decomposition as in Equation (2.2) in Section 2.1.2, and
we obtain the news representation matrix F. After learning the news representation through
spatial relation extraction process, the next step is to cluster the news pieces based on their
cluster membership among a set of different tensor decompositions. Intuitively, the clustering
4.2. WEAKLY SUPERVISED FAKE NEWS DETECTION 65
seeks to find a subset of news articles that frequently cluster together in different configurations
of the tensor decomposition of the previous step. e intuition is that news articles that tend
to frequently appear surrounded each other among different rank configurations, while having
the same ranking within their latent factors are more likely to ultimately belong to the same
category. e ranking of a news article with respect to a latent factor is derived by simply sorting
the coefficients of each latent factor, corresponding to the clustering membership of a news
article to a latent factor.
To this end, we can combine the clustering results of each individual tensor decomposition
into a collective (news-article by latent-factor) matrix, from which we are going to extract co-
clusters of news articles and the corresponding latent factors (coming from the ensemble of
decompositions). For example, as shown in Figure 4.5, we can perform the tensor decomposition
three times with different rank 3, 4, and 5, and then construct a collect feature matrix F
0
. e
co-clustering objective with l
1
norm regularization for a combine matrix [103] F
0
is shown as
follows:
min
F
0
RQ
T
2
F
C .kRk
1
C kQk
1
/; (4.12)
where R 2 R
N k
is the representation matrix of news articles, and Q 2 R
M k
is the coding
matrix, and the term .kRk
1
C kQk
1
/ is to enforce the sparse constraints.
4.2.3 A PROBABILISTIC GENERATIVE UNSUPERVISED APPROACH
Existing work on fake news detection is mostly based on supervised methods. Although they
have shown some promising results, these supervised methods suffer from a critical limitation,
i.e., they require a reliably pre-annotated dataset to train a classification model. However, ob-
taining a large number of annotations is time consuming and labor intensive, as the process
needs careful checking of news contents as well as other additional evidence such as authorita-
tive reports.
e key idea is to extract users’ opinions on the news by exploiting the auxiliary infor-
mation of the users’ engagements with the news tweets on social media, and aggregate their
opinions in a well-designed unsupervised way to generate our estimation results [174]. As news
propagates, users engage with different types of behaviors on social media, such as publishing a
news tweet, liking, forwarding, or replying to a news tweet. is information can, on a certain
level, reflect the users opinions on the news. For example, Figure 4.6 shows two news tweet
examples regarding the aforementioned news. According to the users’ tweet contexts, we see
that the user in Figure 4.6a disagreed with the authenticity of the news, which may indicate the
user’s high credibility in identifying fake news. On the other hand, it appears that the user in
Figure4.6b falsely believed the news or intentionally spread the fake news, implying the users
deficiency in the ability to identify fake news. Besides, as for other users who engaged in the
tweets, it is likely that the users who liked/retweeted the first tweet also doubted the news,
while those who liked/retweeted the second tweet may also be deceived by the news. e users’
66 4. CHALLENGING PROBLEMS OF FAKE NEWS DETECTION
(b) Agreeing to the authenticity of the news
(a) Doubting the authenticity of the news
Figure 4.6: e illustration of user opinions to news pieces. Based on [174].
opinions toward the news can also be revealed from their replies to the news tweets. Next, we
introduce the scenario of the hierarchical structure of user engagements, and then describe how
to use a probabilistic graph model to infer fake news.
Hierarchical User Engagements Figure 4.7 presents an overview of the hierarchical structure
of user engagements on social media. Let A D fa
1
; ; a
N
g denote a set of news corpus, and
U
M
and U
K
represent the sets of verified and unverified users , respectively. For each given news
a
i
2 A , we collect all the verified users’ tweets on this news. Let U
M
i
2 U
M
denote the set of
verified users who published tweets for the news a
i
. en, for the tweet of each verified user
u
j
2 U
M i
, we collect the unverified users’ social engagements. Let U
K
ij
2 U
K
denote the set of
unverified users who engaged in the tweet.
News
Verified User
Unverified User
y
y
i
y
N
z
i,
z
i,j
z
i,M
r
i,j,
r
i,j,k
r
i,j,K
Figure 4.7: An illustration of hierarchical user engagements.
4.2. WEAKLY SUPERVISED FAKE NEWS DETECTION 67
For each verified user u
j
2 U
M
i
, we let z
i;j
2 f0; 1g denote the users opinion on the news,
i.e., z
i;j
is 1 if the user thinks the news is real, and 0 otherwise. Several heuristics can be applied
to extract z
i;j
. Let a
i
and c
i;j
denote the news content and the user u
j
s own text content of the
tweet, respectively. en, z
i;j
can be defined as the sentiment of c
i;j
.
For verified user u
j
s tweet on news a
i
, many unverified users may like, retweet, or reply
to the tweet. Let r
i;j;k
2 f0; 1g denote the opinion of the unverified user u
k
2 U
K
ij
. We assume
that if the user u
k
liked or retweeted the tweet, then it implies that u
k
agreed to the opinion of
the tweet. If the user u
k
replied to the tweet, then its opinion can be extracted by employing off-
the-shelf sentiment analysis [55]. When an unverified user may conduct multiple engagements
in a tweet, the user’s opinion r
i;j;k
is obtained using majority voting.
Probabilistic Graphic Model for Inferring Fake News Given the definitions of a
i
, z
i;j
, and
r
i;j;k
, we now present the unsupervised fake news detection framework (UFD). Figure 4.8 shows
the probabilistic graphical structure of the model. Each node in the graph represents a random
variable or a prior parameter, where darker nodes and white nodes indicate observed or latent
variables, respectively.
Figure 4.8: A probabilistic graphic model for unsupervised fake news detection. e circles with
gray colors denote observed random variables, the circles without white color are non-observed
random variables, arrows mean the conditional dependencies, and rectangles illustrate the du-
plication frequencies.
Each news a
i
, y
i
is generated from a Bernoulli distribution with parameter
i
:
y
i
Bernoulli.
i
/: (4.13)
e prior probability of
i
is generated from a Beta distribution with hyper parameter D
.
1
;
0
/ as
i
Beta.
1
;
0
/, where
1
is the prior number of true news pieces and
0
is the
prior number of fake news pieces. If we do not have a strong belief in practice, we can initially
assign a uniform prior indicating that each news piece has an equal probability of being true or
fake.
68 4. CHALLENGING PROBLEMS OF FAKE NEWS DETECTION
For verified user u
j
, its credibility toward news is modeled with two variables
1
j
and
0
j
,
denoting the probability that user u
j
thinks a news piece is real giving the true estimation of the
news is true or fake, defined as follows:
1
j
WD p.z
i;j
D 1jy
i
D 1/
0
j
WD p.z
i;j
D 1jy
i
D 0/:
(4.14)
We generate the parameters
1
j
from Beta distributions with hyper parameters ˛
1
D
1
1
; ˛
1
0
/,
where ˛
1
1
is the prior true positive count, and ˛
1
0
denotes the false negative count. Similarly, we
can generate
0
j
from another Beta distribution with hyper parameters ˛
0
D
0
1
; ˛
0
0
/:
1
j
Beta
˛
1
1
; ˛
1
0
0
j
Beta
˛
0
1
; ˛
0
0
:
(4.15)
Given
1
j
and
0
j
, the opinion of each verified user u
j
for news i is generated from a Bernoulli
distribution with parameter
x
i
j
, y
i;j
Bernoulli.
x
i
j
/.
For unverified user, he/she engages in the verified users’ tweets, and thus the opinion is
likely to be influenced by the news itself and the verified users’ opinions. erefore, for each
unverified user u
k
2 U
K
, the following variables are adopted to model the credibility:
0;0
k
WD p.r
i;j;k
D 1jx
i
D 0; z
i;j
D 0/
0;1
k
WD p.r
i;j;k
D 1jx
i
D 0; z
i;j
D 1/
1;0
k
WD p.r
i;j;k
D 1jx
i
D 1; z
i;j
D 0/
1;1
k
WD p.r
i;j;k
D 1jx
i
D 1; z
i;j
D 1/
(4.16)
and for each pair of .u; v/ 2 f0; 1g
2
,
u;v
k
represents the probability that the unverified user u
k
thinks the news is true under the condition that the true labels of the news is u and the verified
user’ opinion is v. For each
u;v
k
, it is generated from the Beta distribution as follows:
u;v
k
Beta
ˇ
u;v
1
; ˇ
u;v
0
: (4.17)
Given the truth estimation of news y
i
, and the verified user’ opinions z
i;j
, we generate the
unverified user’s opinion from a Bernoulli distribution with parameter
y
i
;z
i ;j
k
as follows:
r
i;j;k
D Bernoulli
y
i
;z
i;j
k
: (4.18)
e overall objective is to find instances of the latent variables that maximize the joint probability
estimation of y as follows:
Oy D argmax
y
p.y; z; r; ; ; /ddd ; (4.19)
where for simplicity of notations, we use and to denote f
0
;
1
g and f
0;0
;
0;1
;
1;0
;
1;1
g,
respectively. However, the exact inference on the posterior distribution may result in exponential
complexity; we propose using Gibbs sampling to effectively inference the variable estimations.