Emerging Trends in Image Processing, Computer Vision, and Pattern Recognition, 1st Edition (2015)

Part II. Computer Vision and Recognition Systems

Chapter 26. A local feature-based facial expression recognition system from depth video

Md. Zia Uddin Department of Computer Education, Sungkyunkwan University, Seoul, Republic of Korea

Abstract

In this chapter, a novel approach is proposed to recognize some facial expressions from time-sequential depth videos. Local directional pattern features are extracted from the time-sequential depth faces that are followed by principal component analysis and linear discriminant analysis to make the features more robust. Finally, the local features are applied with hidden Markov models to model and recognize different facial expressions successfully. The proposed approach shows superior recognition rate against the conventional approaches.

Keywords

Depth information

LDP

PCA

LDA

HMM

FER

ACKNOWLEDGEMENT

This work was supported by Faculty Research Fund, Sungkyunkwan University, 2013.

1 Introduction

Facial expression recognition (FER) provides machines a way of sensing emotions that can be considered one of the mostly used artificial intelligence and pattern analysis applications [1–10]. In case of extracting peoples’ expression images through Red Green Blue (RGB) cameras, most of the FER works used principal component analysis (PCA), which is really well known for dimension reduction and used in many earlier works. In Padgett and Cottrell [3], PCA was used to recognize facial action units (FAUs) from the facial expression images. In Donato et al. [5] as well as Ekman and Priesen [6], PCA was used for FER with the facial action coding system.

Very recently, independent component analysis (ICA) has been extensively utilized for FER based on local face image features [5,10–21]. In Bartlett et al. [14], the authors used ICA to extract local features and then classified several facial expressions. In Chao-Fa and Shin [15], ICA was used to recognize the FAUs. Besides ICA, local binary patterns (LBP) has been used lately for FER [22–24]. The main property of LBP features is their tolerance against illumination changes as well as their computational simplicity. Later on, LBP was improved by focusing on face pixel’s gradient information and named as local directional pattern (LDP) to represent local face features [25]. As like as LBP, LDP features also have the tolerance against illumination changes but they represent much robust features than LBP due to considering the gradient information for each pixel as aforementioned [25].

Thus, LDP can be considered to be a robust approach and hence can be adopted for FER. To make LDP facial expression features more robust, linear discriminant analysis (LDA) can be applied as LDA is a strong method to be used to obtain good discrimination among the face images from different expressions by considering linear feature spaces. Hidden Markov model (HMM) is considered to be a robust tool to model and decode time-sequential events [21,26–28]. Hence, HMM seems an appropriate choice to train and recognize features of different facial expressions for FER.

For capturing face images, RGB cameras are used most widely but the faces captured through a RGB camera cannot provide the depth of the pixels based on the far and near parts of human face in the facial expression video where the depth information can be considered to contribute more to extract efficient features to describe the expression more strongly. Hence, depth videos should allow one to come up with more efficient person independent FER.

In this chapter, a novel FER approach is proposed using LDP, PCA, LDA, and HMM. Local LDP features are first extracted from the facial expression images and further extended by PCA and LDA. These robust features are then converted into discrete symbols using vector quantization and then the symbols are used to model discrete HMMs of different expressions. To compare the performance of the proposed approach, different comparison studies have been conducted such as PCA, PCA-LDA, ICA, and ICA-LDA as feature extractor in combination with HMM. The experimental results show that the proposed method shows superiority over the conventional approaches.

2 Depth Image Preprocessing

The images of different expressions are captured by a depth camera [29] where the camera generates RGB and distance information (i.e., depth) simultaneously for the objects captured by the camera. The depth video represents the range of every pixel in the scene as a gray level intensity (i.e., the longer ranged pixels have darker and shorter ones brighter values or vice versa). Figure 1 shows the basic steps of proposed FER system.

f26-01-9780128020456

FIGURE 1 Basic steps involved in the proposed facial expression recognition system.

Figure 2(a) represents a depth image from a surprise expression. It can be noticed that in the depth image, the higher pixel value represents the near (e.g., nose) and the lower (e.g., eyes) the far distance. The pseudo-color image corresponding to the depth image in Figure 2(b) also indicates the significant differences among different face portions where the color intensities. Figure 3(a)–(c) shows five generalized depth faces from a happy, surprise, and disgust expressions, respectively.

f26-02-9780128020456

FIGURE 2 (a) A depth image and (b) corresponding pseudo color image of a surprise image.

f26-03-9780128020456

FIGURE 3 A sequential depth facial expression images of (a) happy, (b) surprise, and (c) disgust.

3 Feature extraction

The feature extraction of the proposed approach consists of three fundamental stages: (1) LDP is performed first on the depth faces of the facial expression videos, (2) PCA is applied on the LDP features for dimensionality reduction, and (3) LDA is then applied to compress the same facial expression images as close as possible and to separate the different expression class images as far as possible.

3.1 LDP Features

The LDP assigns an 8-bit binary code to each pixel of an input depth image. This pattern is then calculated by comparing the relative edge response values of a pixel in eight different directions. Kirsch, Prewitt, and Sobel edge detector are some of the different representative edge detectors that can be used. Amongst which, the Kirsch edge detector [20] detects the edges more accurately than the others as it considers all eight neighbors. Given a central pixel in the image, the eight directional edge response values {m_k},k = 0, 1, …, 7 are computed by Kirsch masks M_k in eight different orientations centered on its position [18]. Figure 4 shows these masks.

f26-04-9780128020456

FIGURE 4 Kirsch edge masks in eight directions.

The presence of a corner or an edge represents high response values in some particular directions and therefore, it is interesting to know the p most prominent directions in order to generate the LDP. Here, the top-p directional bit responses b_k are set to 1. The remaining bits of 8-bit LDP pattern are set to 0. Finally, the LDP code is derived in Equation (1). Figure 5 shows the mask response as well as LDP bit positions and Figure 6 an exemplary LDP code considering five top positions, that is, p = 5.

si1_e (1)

where m_p is the pth most significant directional response.

f26-05-9780128020456

FIGURE 5 (a) Edge response to eight directions and (b) LDP binary bit positions.

f26-06-9780128020456

FIGURE 6 LDP code.

Thus, an image is transformed to the LDP map using LDP code. The image textual feature is presented by the histogram of the LDP map of which the qth bin can be defined as

(2)

where n is the number of the LDP histogram bins (normally n = 256) for an image I. Then, the histogram of the LDP map is presented as

(3)

To describe the LDP features, a depth silhouette image is divided into nonoverlapping rectangle regions and the histogram is computed for each region. Furthermore, the whole LDP feature F is expressed as a concatenated sequence of histograms

(4)

where s represents the number of nonoverlapped regions in the image. After analyzing the LDP features of all the face depth images, there are some positions from all the positions corresponding to all the face images have values > 0 and hence these positions can be ignored. Thus, the LDP features from the depth faces can be represented as D.

3.2 PCA on LDP Features

PCA is very popular method to be used for data dimension reduction. PCA is a subspace projection method which transforms the high-dimensional space to a reduced space maintaining the maximum variability. The principal components of the covariance data matrix Y of the LDP features D can be calculated as

(5)

where λ represents the eigenvalue matrix and P the eigenvector matrix. The eigenvector associated with the top eigenvalue means the axis of maximum variance and the next one with the second largest eigenvalue indicates the axis of second largest variance and so on. Thus, m number of eigenvectors are chosen according to the highest eigenvalues for projection of LDP features. The PCA feature space projections V of LDP features can be represented as

(6)

3.3 LDA on PCA Features

To obtain more robust features, LDA is performed on the PCA feature vectors F. Basically, LDA is based on class specific information which maximizes the ratio of the within, Q_w and between, Q_b scatter matrix. The optimal discrimination matrix W_LDA is chosen from the maximization of ratio of the determinant of the between and within class scatter matrix as

si7_e (7)

where W_LDA is the discriminant feature space. Thus, the LDP-LDA feature vectors of facial expression images can be obtained as follows:

(8)

Figure 7 shows an exemplar plot of 3D LDA representation of the LDP-PCA features of all the facial expression depth images that shows a good separation among the representation of the depth faces of different classes.

f26-07-9780128020456

FIGURE 7 3D plot of LDP-PCA-LDA features of depth faces from six expressions.

3.4 HMM for Expression Modeling and Recognition

To decode the depth information-based time-sequential facial expression features, discrete HMMs are employed. HMMs have been applied extensively to solve a large number of complex problems in various applications such as speech recognition [30].

An HMM is a collection of states where each state is characterized by transition and symbol observation probabilities. A basic HMM can be expressed as H = {S, π, R, B} where S denotes possible states, π the initial probability of the states, R the transition probability matrix between hidden states, and B observation symbols’ probability from every state. If the number of activities is N then there will be a dictionary (H₁, H₂, …, H_N) of N trained models. We used the Baum-Welch algorithm for HMM parameter estimation as applied in [21]. Figure 8 shows the structure and transition probabilities of a sad HMM after training.

f26-08-9780128020456

FIGURE 8 A HMM transition probabilities for sad expression after training.

To test a facial expression video for recognition, the obtained observation sequence O from the corresponding depth image sequence is used to determine the proper model by highest likelihood L computation of all N trained expression HMMs as follows:

(9)

4 Experiments and results

The FER database was built for six expressions: namely Surprise, Sad, Happy, Disgust, Anger, and Fear. Each expression video clip was of variable length and each expression in each video starts and ends with neutral expression. A total of 20 sequences from each expression were used to build the feature space. To train and test each facial expression model, 20 and 40 image sequences were applied, respectively.

The average recognition rate using PCA on depth faces is 62.50% as shown in Table 1. Then, we applied LDA on PCA features and obtained 65.83% average recognition rate as shown in Table 2. As PCA-based global features showed poor recognition performance, we tried ICA-based local features for FER and obtained 83.33% average recognition rate as reported in Table 3. To improve ICA features, we applied LDA on the ICA features and as shown in Table 4, the average recognition rate utilizing ICA representation on the depth facial expression images is 83.50%, which is higher than that of depth face-based FER applying PCA-based features.

Table 1

FER Confusion Matrix Using Depth Faces with PCA

t0010