C H A P T E R 4

Challenging Problems of Fake

News Detection

In previous chapters, we introduce how to extract features and build machine learning models

from news content and social context to detect fake news, which generally considers the standard

scenario of binary classiﬁcation. Since fake news detection is a critical real-world problem, it has

encountered speciﬁc challenges that need to be considered. In addition, recent advancements of

machine learning methods such as deep neural networks, tensor factorization, and probabilistic

models allow us to better capture eﬀective features of news from auxiliary information, and deal

with specialized settings of fake news detection.

In this chapter, we discuss several challenging problems of fake news detection. Speciﬁ-

cally, there is a need for detecting fake news at the early stage to prevent further propagation of

fake news on social media. Since obtaining the ground truth of fake news is labor intensive and

time consuming, it is important to study fake news detection in a weakly supervised setting, i.e.,

with limited or no labels for training. It is also necessary to understand why a particular piece

of news is classiﬁed as fake by machine learning models, in which the derived explanation can

provide new insights and knowledge not obvious to practitioners.

4.1 FAKE NEWS EARLY DETECTION

Fake news early detection aims to give early alerts of fake news during the dissemination process

so that actions can be taken to help prevent its further propagation on social media.

4.1.1 A USER-RESPONSE GENERATION APPROACH

We learn that the rich social context information provides eﬀective auxiliary signals to advance

fake news detection on social media. However, these types of social context information, such

as user comments, can only be available after people have already engaged in the fake news

propagation. erefore, a more practical solution for early fake news detection is to assume

the news content is the only available information. In addition, we can assume we have obtained

historical data that contains both news contents and user response, and can leverage the historical

data to help enhance early detection performance on newly emerging news articles without any

user responses.

56 4. CHALLENGING PROBLEMS OF FAKE NEWS DETECTION

Let A D fa

; a

;    ; a

g denote the set of news corpus, where each document a

is a

vector of term in a dictionary, † with size of d D j†j, and C D fc

; c

;    ; c

g represents the

set of user responses. e detection task can be deﬁned as: given a news article a, the goal is to

predict whether it is fake or not without using assuming the corresponding user response exists.

e framework mainly consists of two major components (see Figure 4.1): (1) a convolu-

tion neural network component to learn news representation; and (2) a user response generator

to generate auxiliary signals to help detect fake news.

Average Pooling Convolutional Layer

Fully Connected Layer

Fully Connected

Layer

TCNN

URG

Article

Latent

Distribution

Generative

Network

Generated

Response

Label

…

w w



w

w

s

s w



w

Figure 4.1: e user-response generation framework for early fake news detection. It consists

primarily of two stages: neural news representation learning and deep user response generator.

Based on [112].

Neural Representation Learning for News

To extract semantic information and learn the representation for news, we can use a two-level

convolution neural network (TCNN) structure: sentence-level and document-level. We have

introduced similar feature learning techniques in Section 2.1.3. For the sentence-level, we can

ﬁrst derive sentence representation as the average of the word embeddings of those words in the

sentence. Each sentence in a news article can be represented as a one-hot vector s 2 f0; 1g

jT j

indicating which words from the vocabulary † are present in the sentence. en the sentence

representation is deﬁned by average pooling of word embedding vectors of words in the sentence

as follows:

v.s/ D

; (4.1)

where W is the embedding matrix for all words, where embedding of each word is obtained

from a pre-trained skip-gram algorithm [90] on all news articles. In the document level, the

4.1. FAKE NEWS EARLY DETECTION 57

news representation is derived from the sentence representations by concatenating (˚) each

sentence representation. For a news piece a

, containing L sentences S D fs

;    ; s

g, the news

representation s

is represented as:

D v.s

/ ˚ v.s

/ ˚    v.s

/: (4.2)

After the news representation is obtained, we can use the convolution neural networks to learn

the representations as in Section 2.1.3.

User Response Generator

e goal of the user response generator is to generate user responses to help learn more eﬀec-

tive representations to predict fake news. We can use a generative conditional variational auto-

encoder (CVAE) [144] to learn the a distribution over user responses, conditioned on the article,

and can therefore be used to generate varying responses sampled from the learned distribution.

Speciﬁcally, CVAE takes the user response C

.i/

D fc

;    ; c

i n

g and the news article a as the

input, and aim to reconstruct the input c conditioned on a by learning the latent representation

z. e objective is shown as follows:

zq



Œ log p



jz; a

/ C D



.zjc

; a

//: (4.3)

e ﬁrst term is the reconstruction error designed as the negative log-likelihood of the data

reconstructed from the latent variable z under the inﬂuence of article a

. e second term is for

regularization and minimize the divergence between the encoder distribution q



.zjc; a/ and the

prior distribution p



.z/.

We use the learned representation a from the TCNN as the condition and feed into the

user response generator to generate synthetic responses. e user response generated by URG is

put through a nonlinear neural network and then combined with the text features extracted by

TCNN. en, the ﬁnal feature vector is fed into a feed forward softmax classiﬁer for classiﬁca-

tion as in Figure 4.1.

4.1.2 AN EVENT-INVARIANT ADVERSARIAL APPROACH

Most of existing fake news detection methods perform supervised learning using historical data

that are collected from diﬀerent news events. Actually, these methods tend to capture lots of

event-speciﬁc features which are not shared among diﬀerent news events [165]. Such event-

speciﬁc features, though being able to help classify the posts on veriﬁed events, would have limit

help or even hurt the detection with regard to newly emerged events. erefore, it is important

to learn event-invariant features that are discriminative to detect fake news from unveriﬁed

events. e goal is to design an eﬀective model to remove the nontransferable event-speciﬁc

features and preserve the shared event-invariant features among all the events to improve fake

news detection performance.

58 4. CHALLENGING PROBLEMS OF FAKE NEWS DETECTION

In [165], an event adversarial neural network (EANN) model is proposed to extract

nontransferable multi-modal feature representations for fake news detection (see Figure 4.2).

EANN mainly consists of three components: (1) the multi-modal feature extractor; (2) the fake

news detector; and (3) the event discriminator. e multi-modal feature extractor cooperates

with the fake news detector to carry out the major task of identifying false news. Simultane-

ously, the multi-modal feature extractor tries to fool the event discriminator to learn the event

invariant representations.

Reddit,

has,

found,

much,

clearer,

photo…

Word

Embedding

Text

Feature

Multimodal

Feature

Text-CNN

Fake News Detector

Event Discriminator

Multimodal Feature Extractor

Visual Feature

VGG-19

vis-fc

reversal

adv-fc1

adv-fc2

pred-fc

Concatenation

Figure 4.2: e illustration of the event adversarial neural networks (EANN). It consists of three

parts: a multi-modal feature extractor, an event discriminator, and a fake news detector. Based

on [165].

Multi-Modal Feature Extractor

e multi-modal feature extractor aims to extract feature representations from news text and

images using neural networks. We introduced representative techniques in Sections 2.1.3 and

2.2.3 for neural textual and visual feature learning. In [165], for textual feature, the CNNs are

utilized to obtain R

cnn

; and for image feature, the VGG19 neural networks are adopted to get

vgg

. To enforce a standard feature representation of both text and image, we can add another

dense layer (“vis-fc”) to map the learned feature representation to the same dimension:

D  .W

cnn

D  .W

vgg

(4.4)

4.1. FAKE NEWS EARLY DETECTION 59

e textual features R

and the visual features R

will be concatenated to form the multi-

modal feature representation denoted as R

D R

˚ R

, which is the output of the multi-

modal feature extractor.

Fake News Detector

e fake news detector deploys a fully connected layer (“pred-fc”) with softmax function to

predict whether a news post is fake or real. e fake news detector takes the learned multi-

modal feature representation R

as the input. e objective function of fake news detector is to

minimize the cross entropy loss as follows:

min L

.

; 

/ WD min EŒy log.P



.a// C .1  y/.log.1  P



.a///; (4.5)

where a is a news post, 

and 

are the parameters of the multi-modal feature extractor and

fake news detector. However, directly optimizing the loss function in Equation (4.5) only help

detect fake news on the events that are already included in the training data, so it does not

generalize well to new events. us, we need to enable the model to learn more general feature

representations that can capture the common features among all the events. Such representation

should be event-invariant and does not include any event-speciﬁc features.

Event Discriminator

To learn the event-invariant feature representations, it is required to remove the uniqueness of

each events in the training data and focuses on extracting features that are shared among diﬀerent

events. To this end, we use an event-discriminator, which is a neural network consisting of two

dense layers, to correctly classify the post into one of E events .1;    ; e/ correctly. e event

discriminator is a classiﬁer and deploy the cross entropy loss function as follows:

min L

.

; 

/ WD min E

kD1

ŒkDy



log.G

.aI 

//I 

; (4.6)

where G

denotes the multi-modal feature extractor, y

denotes the even label predicted, and

and G

represents the event discriminator. Equation (4.6) can estimate the dissimilarities of

diﬀerent events’ distributions. e large loss means the distributions of diﬀerent events’ rep-

resentations are similar and the learned features are event-invariant. us, in order to remove

the uniqueness of each event, we need to maximize the discrimination loss L

by seeking the

optimal parameter 

e above idea motivates a minimax game between the multi-modal feature extractor and

the event discriminator. On one hand, the multi-modal feature extractor tries to fool the event

discriminator to maximize the discrimination loss, and on the other hand, the event discrimi-

nator aims to discover the event-speciﬁc information included in the feature representations to

recognize the event. e integration process of three components and the ﬁnal objective function

will be introduced in the next section.

60 4. CHALLENGING PROBLEMS OF FAKE NEWS DETECTION

4.1.3 A PROPAGATION-PATH MODELING APPROACH

e diﬀusion paths of fake news and real news can be very diﬀerent on social media. In ad-

dition to only relying on news contents to detect fake news, the auxiliary information from

spreaders at the early stage could also be important for fake news early detection. Existing de-

tection methods mainly explore temporal-linguistic features extracted from user comments, or

temporal-structural features extracted from propagation paths/trees or networks [80]. However,

compared to user comments, user characteristics are more available, reliable, and robust in the early

stage of news propagation than linguistic and structural features widely used by state-of-the-art

approaches.

Given the corpus of news pieces A D fa

; a

;    ; a

g where each document a

is a vec-

tor of term in a dictionary, † with size of d D j†j. Let U D fu

;    ; u

g denotes the set of

social media users, each user is associated with a feature vector u

2 R

. e propagation path

is deﬁned as a variable-length multivariate time series P.a

/ D<    ; .u

; t/;    >, where each

tuple

; t/ denotes that user u

tweets/retweets the news story a

. Since the goal is to perform

early detection of fake news, the designed model should be able to make predictions based on

only a partial propagation path, deﬁned as P.a

; T / D< x

; t < T >.

is framework consists of three major components (see Figure 4.3): (1) building the

propagation path; (2) learning path representations through RNN and CNN; and (3) predicting

fake news base on path representations.

Source Tweet Retweet

User Vectors

Label

Hidden

ConcatenationPooling Pooling

. . . u

GRU GRU GRU CNN CNN CNN

Figure 4.3: e propagation-path based framework for early fake news detection.

Building Propagation Path

e ﬁrst step to build a propagation path is to identify the users who have engaged in the prop-

agation process. e propagation path P.a

/ for news piece a

is constructed by extracting the

4.1. FAKE NEWS EARLY DETECTION 61

user characteristics from those proﬁles of the users who posts/reposts the news piece. After

P.a

/ is obtained, the length of the propagation path would be diﬀerent for diﬀerent news

pieces. erefore, we can transform all the propagation paths with the ﬁxed lengths n, denoted

as S.a

/ D< u

;    ; u

>. If there are more than n tuples in P.a

/, then P.a

/ will be truncated

and only the ﬁrst n tuples appear in S.a

/; if P.a

/ contains less than n tuples, then tuples are

over-sampled from P.a

/ to ensure the ﬁnal length of S.a

/ is n.

Learning Path Representations

To learn the representation, both RNNs and CNNs are utilized to preserve the global and local

temporal information [80]. We have introduced how to use RNN to learn temporal represen-

tation in Section 3.3.3; a similar technique can be used here. As shown in Figure 4.3, we can

obtain the representation h

at each timestamp t, and the overall representation using RNN can

be computed as the mean pooling of all output vectors < h

;    ; h

> for all timestamps, i.e.,

tD1

, which encodes the global variation of user characteristics.

To encode the local temporal feature of user characteristics, we ﬁrst construct the local

propagation path U

tWtCh1

D< u

;    ; u

tCh1

> for each timestamp t, with a moving window

h out of S.a

/. en we apply CNN with a convolution ﬁlter on U

tWtCh1

to get a scalar feature

D ReLU.W

tWtCh1

C b

/; (4.7)

where ReLU is the dense layer with rectiﬁed linear unit activation function and b

is a bias term.

To map the c

to a latent vector, we can utilize a convolution ﬁlter to transform c

to a k dimen-

sion vector f

. So, we can obtain a feature vectors for all the timestamps < f

;    ; f

nhC1

from which we can apply mean pooling operation and get the overall representations s

nhC1

tD1

that encodes the local representations of user characteristics.

Predicting Fake News

After we learn the representations from propagation paths through both the RNN-based and

CNN-based techniques, we can concatenate them into a single vector as follows:

a D Œs

; s

 (4.8)

and then a is fed into a multi-layer (q layer) feed-forward neural network that ﬁnally to predict

the class label for the corresponding propagation path as follows:

D ReLU.W

j 1

C b

/; 8j 2 Œ1;    ; q

y D softmax.I

(4.9)

where q is the number of hidden layers, I

is the hidden states of the j

layer, and y is the output

over the set of all possible labels of news pieces.