9
C H A P T E R 2
What News Content Tells
News content features describe the meta information related to a piece of news. A list of repre-
sentative news content features are listed as follows.
Source: Author or publisher of the news article.
Headline: Short title text that aims to catch the attention of readers and describes the
main topic of the article.
Body Text: Main text that elaborates the details of the news story; there is usually a major
claim that is specifically highlighted and that shapes the angle of the publisher.
Image/Video: Part of the body content of a news article that provides visual cues to frame
the story.
Based on these raw content features, different kinds of feature representations can be built
to extract discriminative characteristics of fake news. Typically, the news content we are looking
at will mostly be textual features, visual features, and style features.
2.1 TEXTUAL FEATURES
Textual features are extracted from news content with natural language processing (NLP) tech-
niques [102]. Next, we introduce how to extract linguistic, low-rank, and neural textual features.
2.1.1 LINGUISTIC FEATURES
Linguistic features are extracted from the text content in terms of characters, words, sentences,
and documents. In order to capture the different aspects of fake news and real news, both com-
mon linguistic features and domain-specific linguistic features are utilized. Common linguistic
features are often used to represent documents for various tasks in natural language processing.
Common linguistic features are: (i) lexical features, including character-level and word-level fea-
tures, such as total words, characters per word, frequency of large words, and unique words; and
(ii) syntactic features, including sentence-level features, such as frequency of function words and
phrases (i.e., n-grams” and bag-of-words approaches [43]) or punctuation and parts-of-speech
(POS) tagging [5, 107]. Domain-specific linguistic features, which are specifically aligned to
news domain, such as quoted words, external links, number of graphs, and the average length
of graphs [111].
10 2. WHAT NEWS CONTENT TELLS
2.1.2 LOW-RANK TEXTUAL FEATURES
Low-rank modeling for text data has been widely explored in different domains [155]. Low-
rank approximation is to learn a compact (low-dimension) text representations from the high-
dimension and noisy raw feature matrix. Existing low-rank models are mostly based on matrix
factorization or tensor factorization techniques, which project the term-news matrix into a k-
dimensional latent space.
Matrix Factorization
Matrix factorization for text modeling has been widely used for learning document represen-
tations such as clustering [173]. It can learn a low-dimensional basis where each dimension
represents the coherent latent topic from raw feature matrix. Non-negative matrix factoriza-
tion (NMF) methods introduce non-negative constraints when factorizing the news-words ma-
trix [138]. Using NMF, we attempt to project the document-word matrix to a joint latent se-
mantic factor space with low dimensionality, such that the document-word relations are mod-
eled as the inner product in the space. Specifically, giving the news-word matrix X 2 R
nd
C
,
NMF methods try to find two non-negative matrices F 2 R
N k
C
and V 2 R
d k
C
by solving the
following optimization problem:
min
F;V0
X FV
T
2
F
;
(2.1)
where d is the size of word vocabulary and k is the dimension of the latent topic space. In
addition, R and V are the non-negative matrices indicating low-dimensional representations of
news and words.
Tensor Factorization
e goal of tensor decomposition is to learn the representation of news articles by considering
the spatial relations of words in a document. Tensor factorization approaches [46, 53] first build
a 3-mode news-word-word tensor for a news document as X 2 R
N d d
, where a horizontal
slice X
i;W;W
represents the spatial relations in a news document; and for a horizontal slice S,
S.i; j / represents the number of times that the ith term and the j th term of the dictionary
appear in an affinity of each other. To learn the representations, different tensor decomposition
techniques can be applied. For example, we can use nonnegative CP/PARAFAC decomposition
with alternating Poisson regression (CP APR) that uses Kullback–Leibler divergence because
of very high sparsity in the tensor. CP/PARAFAC decomposition represents X as follows:
X ŒF; B; H D
R
X
rD1
r
f
r
ˇ b
r
ˇ h
r
; (2.2)
where ˇ denotes the outer product and f
r
(same for b
r
and h
r
) denotes the normalized rth
column of non-negative factor matrix F (same for B and H), and R is the rank. Each row of F
denotes the representation of the corresponding article in the embedding space.
2.1. TEXTUAL FEATURES 11
2.1.3 NEURAL TEXTUAL FEATURES
With the recent advancements of deep neural networks in NLP, different neural network struc-
tures, such as convolution neural networks (CNNs) and recurrent neural networks (RNNs), are
developed to learn the latent textual feature representations. Neural textual features are based
on dense vector representations rather than high-dimensional and sparse features, and have
achieved superior results on various NLP tasks [178]. We introduce major representative deep
neural textual methods including CNNs and RNNs, and some of their variants.
CNN
CNNs have been widely used for fake news detection and achieve many good results [163, 175].
CNNs have the ability to extract salient n-gram features from the input sentence to create an in-
formative latent semantic representation of the sentence for downstream tasks [178]. As shown
in Figure 2.1, to learn the textual representations of news sentences, CNNs first build a word
representation matrix for each word using the word-embedding vectors such as Word2Vec [90],
then apply several convolution layers and a max-pooling layer to obtain the final textual repre-
sentations. Specifically, for each word w
i
in the news sentence, we can denote the k dimensional
word embedding vector as w
i
2 R
k
, and a sentence with n words can be represented as:
w
1Wn
D w
1
˚ w
2
w
n
; (2.3)
where ˚ denotes the concatenation operation. A convolution filter with window size h takes the
contiguous sequence of h words in the sentence as input and output the feature, as follows:
Qw
i
D .W w
iWi Ch1
/ (2.4)
and ./ is the ReLU activation function and W represents the weight of the filter. e filter can
further be applied to the rest of the words and then we can get a feature vector for the sentence:
Qw D ΠQw
1
; Qw
2
; ; Qw
nhC1
(2.5)
for every feature vector t, we further use max-pooling operation to take the maximum value to
extract the most important information. e process can be repeated until we get the features
for all filters. Following the max pooling operations, a fully connected layer is used to ensure the
final textual feature representation.
RNN
An RNN is popular in NLP, which can encode the sequence information of sentences and
paragraphs directly. e representative RNN for learning textual representation is the long short-
term memory (LSTM) neural networks [64, 114, 119].
For example, two layers of LSTM [114] is built to detect fake news, where one layer puts
simple word embedding into LSTM and the other one concatenate LSTM output with Linguis-
tic Inquiry and Word Count (LIWC) [105] feature vectors before feeding into the action layer.
12 2. WHAT NEWS CONTENT TELLS
Word Embeddings
Fully Connected Layer
with Softmax Predictor
ConvNet Layer
Max-pooling
It
is
never
below
zero
......
Figure 2.1: e illustration of CNNs for learning neural textual features. Based on [163].
In both cases, they were more accurate than NBC and Maximum Entropy models. Karimi and
Tang [64] utilized a bi-directional LSTM to learn the sentence-level textual representations. To
encode the uncertainty of news contents, Zhang et al. [182] proposed to incorporate a Bayesian
LSTM to learn the representations of news claims. Recent works [69, 81] incorporated speakers’
profile information such as names and topics to LSTM model to predict fake news. Moreover,
Pham et al. utilized memory networks, which is a kind of attention-based neural network, to
learn textual representations by memorizing a set of words in the memory [109].
In addition to modeling the sentence-level content with RNNs, other approaches model
the news-level content in a hierarchical way such as hierarchical attention neural networks [177]
and hierarchical sequence to sequence auto-encoders [78]. For hierarchical attention neural net-
works (see Figure 2.2), we first learn the sentence vectors by using the word encoder with atten-
tion and then learn the sentence representations through sentence encoder component. Specif-
ically, we can use bidirectional gated recurrent units (GRU) [8] to model word sequences from
both directions of words:
!
h
it
D
!
GRU.w
it
/; t 2 f1; : : : ; M
i
g
h
it
D
GRU.w
it
/; t 2 f1; : : : ; M
i
g:
(2.6)
We then obtain an annotation of word w
it
by concatenating the forward hidden state
!
h
it
and
backward hidden state
h
it
, i.e., h
it
D Œ
!
h
it
;
h
it
, which contains the information of the whole
sentence centered around w
i
. Note that not all words contribute equally to the representation
of the sentence meaning. erefore, we introduce an attention mechanism to learn the weights
to measure the importance of each word, and the sentence vector v
i
2 R
2d 1
is computed as
follows:
v
i
D
M
i
X
tD1
˛
it
h
it
; (2.7)
2.1. TEXTUAL FEATURES 13
Sentence EncoderWord Encoder
a
o
s
α
α
α
N
α

α

α
M
o
w
h
h
h
N
v
v
v
N
h
h
h
N
h

h

h
M
w

w

w
M
h

h

h
M
······
······
······
······
······
······
······
Figure 2.2: e illustration of hierarchical attention neural networks for learning neural textual
features. Based on [163].
where ˛
it
measures the importance of t
th
word for the sentence s
i
, and ˛
it
is calculated as follows:
o
it
D tanh.W
w
h
it
C b
w
/
˛
it
D
exp.o
it
o
T
w
/
P
M
i
kD1
exp.o
ik
o
T
w
/
;
(2.8)
where o
it
is a hidden representation of h
it
obtained by feeding the hidden state h
it
to a fully
embedding layer, and o
w
is the weight parameter that represents the world-level context vector.
Similarly, RNNs with GRU units to encode each sentence in news articles,
!
h
i
D
!
GRU.v
i
/; i 2 f1; : : : ; N g
h
i
D
GRU.v
i
/; i 2 fN; : : : ; 1g:
(2.9)
We then obtain an annotation of sentence h
i
2 R
2d 1
by concatenating the forward hidden state
!
h
i
and backward hidden state
h
i
, i.e., h
i
D Œ
!
h
i
;
h
i
, which captures the context from neighbor