C H A P T E R 2

What News Content Tells

News content features describe the meta information related to a piece of news. A list of repre-

sentative news content features are listed as follows.

• Source: Author or publisher of the news article.

• Headline: Short title text that aims to catch the attention of readers and describes the

main topic of the article.

• Body Text: Main text that elaborates the details of the news story; there is usually a major

claim that is speciﬁcally highlighted and that shapes the angle of the publisher.

• Image/Video: Part of the body content of a news article that provides visual cues to frame

the story.

Based on these raw content features, diﬀerent kinds of feature representations can be built

to extract discriminative characteristics of fake news. Typically, the news content we are looking

at will mostly be textual features, visual features, and style features.

2.1 TEXTUAL FEATURES

Textual features are extracted from news content with natural language processing (NLP) tech-

niques [102]. Next, we introduce how to extract linguistic, low-rank, and neural textual features.

2.1.1 LINGUISTIC FEATURES

Linguistic features are extracted from the text content in terms of characters, words, sentences,

and documents. In order to capture the diﬀerent aspects of fake news and real news, both com-

mon linguistic features and domain-speciﬁc linguistic features are utilized. Common linguistic

features are often used to represent documents for various tasks in natural language processing.

Common linguistic features are: (i) lexical features, including character-level and word-level fea-

tures, such as total words, characters per word, frequency of large words, and unique words; and

(ii) syntactic features, including sentence-level features, such as frequency of function words and

phrases (i.e., “n-grams” and bag-of-words approaches [43]) or punctuation and parts-of-speech

(POS) tagging [5, 107]. Domain-speciﬁc linguistic features, which are speciﬁcally aligned to

news domain, such as quoted words, external links, number of graphs, and the average length

of graphs [111].

10 2. WHAT NEWS CONTENT TELLS

2.1.2 LOW-RANK TEXTUAL FEATURES

Low-rank modeling for text data has been widely explored in diﬀerent domains [155]. Low-

rank approximation is to learn a compact (low-dimension) text representations from the high-

dimension and noisy raw feature matrix. Existing low-rank models are mostly based on matrix

factorization or tensor factorization techniques, which project the term-news matrix into a k-

dimensional latent space.

Matrix Factorization

Matrix factorization for text modeling has been widely used for learning document represen-

tations such as clustering [173]. It can learn a low-dimensional basis where each dimension

represents the coherent latent topic from raw feature matrix. Non-negative matrix factoriza-

tion (NMF) methods introduce non-negative constraints when factorizing the news-words ma-

trix [138]. Using NMF, we attempt to project the document-word matrix to a joint latent se-

mantic factor space with low dimensionality, such that the document-word relations are mod-

eled as the inner product in the space. Speciﬁcally, giving the news-word matrix X 2 R

nd

NMF methods try to ﬁnd two non-negative matrices F 2 R

N k

and V 2 R

d k

by solving the

following optimization problem:

min

F;V0



X  FV



;

(2.1)

where d is the size of word vocabulary and k is the dimension of the latent topic space. In

addition, R and V are the non-negative matrices indicating low-dimensional representations of

news and words.

Tensor Factorization

e goal of tensor decomposition is to learn the representation of news articles by considering

the spatial relations of words in a document. Tensor factorization approaches [46, 53] ﬁrst build

a 3-mode news-word-word tensor for a news document as X 2 R

N d d

, where a horizontal

slice X

i;W;W

represents the spatial relations in a news document; and for a horizontal slice S,

S.i; j / represents the number of times that the ith term and the j th term of the dictionary

appear in an aﬃnity of each other. To learn the representations, diﬀerent tensor decomposition

techniques can be applied. For example, we can use nonnegative CP/PARAFAC decomposition

with alternating Poisson regression (CP APR) that uses Kullback–Leibler divergence because

of very high sparsity in the tensor. CP/PARAFAC decomposition represents X as follows:

X  ŒF; B; H D

rD1



ˇ b

ˇ h

; (2.2)

where ˇ denotes the outer product and f

(same for b

and h

) denotes the normalized rth

column of non-negative factor matrix F (same for B and H), and R is the rank. Each row of F

denotes the representation of the corresponding article in the embedding space.

2.1. TEXTUAL FEATURES 11

2.1.3 NEURAL TEXTUAL FEATURES

With the recent advancements of deep neural networks in NLP, diﬀerent neural network struc-

tures, such as convolution neural networks (CNNs) and recurrent neural networks (RNNs), are

developed to learn the latent textual feature representations. Neural textual features are based

on dense vector representations rather than high-dimensional and sparse features, and have

achieved superior results on various NLP tasks [178]. We introduce major representative deep

neural textual methods including CNNs and RNNs, and some of their variants.

CNN

CNNs have been widely used for fake news detection and achieve many good results [163, 175].

CNNs have the ability to extract salient n-gram features from the input sentence to create an in-

formative latent semantic representation of the sentence for downstream tasks [178]. As shown

in Figure 2.1, to learn the textual representations of news sentences, CNNs ﬁrst build a word

representation matrix for each word using the word-embedding vectors such as Word2Vec [90],

then apply several convolution layers and a max-pooling layer to obtain the ﬁnal textual repre-

sentations. Speciﬁcally, for each word w

in the news sentence, we can denote the k dimensional

word embedding vector as w

2 R

, and a sentence with n words can be represented as:

1Wn

D w

˚ w

   w

; (2.3)

where ˚ denotes the concatenation operation. A convolution ﬁlter with window size h takes the

contiguous sequence of h words in the sentence as input and output the feature, as follows:

D .W  w

iWi Ch1

/ (2.4)

and ./ is the ReLU activation function and W represents the weight of the ﬁlter. e ﬁlter can

further be applied to the rest of the words and then we can get a feature vector for the sentence:

Qw D Œ Qw

; Qw

;    ; Qw

nhC1

 (2.5)

for every feature vector t, we further use max-pooling operation to take the maximum value to

extract the most important information. e process can be repeated until we get the features

for all ﬁlters. Following the max pooling operations, a fully connected layer is used to ensure the

ﬁnal textual feature representation.

RNN

An RNN is popular in NLP, which can encode the sequence information of sentences and

paragraphs directly. e representative RNN for learning textual representation is the long short-

term memory (LSTM) neural networks [64, 114, 119].

For example, two layers of LSTM [114] is built to detect fake news, where one layer puts

simple word embedding into LSTM and the other one concatenate LSTM output with Linguis-

tic Inquiry and Word Count (LIWC) [105] feature vectors before feeding into the action layer.

12 2. WHAT NEWS CONTENT TELLS

Word Embeddings

Fully Connected Layer

with Softmax Predictor

ConvNet Layer

Max-pooling

never

below

zero

......

Figure 2.1: e illustration of CNNs for learning neural textual features. Based on [163].

In both cases, they were more accurate than NBC and Maximum Entropy models. Karimi and

Tang [64] utilized a bi-directional LSTM to learn the sentence-level textual representations. To

encode the uncertainty of news contents, Zhang et al. [182] proposed to incorporate a Bayesian

LSTM to learn the representations of news claims. Recent works [69, 81] incorporated speakers’

proﬁle information such as names and topics to LSTM model to predict fake news. Moreover,

Pham et al. utilized memory networks, which is a kind of attention-based neural network, to

learn textual representations by memorizing a set of words in the memory [109].

In addition to modeling the sentence-level content with RNNs, other approaches model

the news-level content in a hierarchical way such as hierarchical attention neural networks [177]

and hierarchical sequence to sequence auto-encoders [78]. For hierarchical attention neural net-

works (see Figure 2.2), we ﬁrst learn the sentence vectors by using the word encoder with atten-

tion and then learn the sentence representations through sentence encoder component. Specif-

ically, we can use bidirectional gated recurrent units (GRU) [8] to model word sequences from

both directions of words:

!

!

GRU.w

/; t 2 f1; : : : ; M





GRU.w

/; t 2 f1; : : : ; M

(2.6)

We then obtain an annotation of word w

by concatenating the forward hidden state

!

and

backward hidden state



, i.e., h

D Œ

!

;



, which contains the information of the whole

sentence centered around w

. Note that not all words contribute equally to the representation

of the sentence meaning. erefore, we introduce an attention mechanism to learn the weights

to measure the importance of each word, and the sentence vector v

2 R

2d 1

is computed as

follows:

tD1

; (2.7)

2.1. TEXTUAL FEATURES 13

Sentence EncoderWord Encoder









M



















M







M







M



······

Figure 2.2: e illustration of hierarchical attention neural networks for learning neural textual

features. Based on [163].

where ˛

measures the importance of t

word for the sentence s

, and ˛

is calculated as follows:

D tanh.W

C b

exp.o

kD1

exp.o

;

(2.8)

where o

is a hidden representation of h

obtained by feeding the hidden state h

to a fully

embedding layer, and o

is the weight parameter that represents the world-level context vector.

Similarly, RNNs with GRU units to encode each sentence in news articles,

!

!

GRU.v

/; i 2 f1; : : : ; N g





GRU.v

/; i 2 fN; : : : ; 1g:

(2.9)

We then obtain an annotation of sentence h

2 R

2d 1

by concatenating the forward hidden state

!

and backward hidden state



, i.e., h

D Œ

!

;



, which captures the context from neighbor