Research

Facial landmark detection

Graphical flowchart of TPAMI 2013
Detections with TPAMI 2013

These works propose algorithms for the detection of a set of characteristic facial landmarks based on the use of regression techniques (Support Vector Regression in our case). The proposed algorithms are discriminative methods that lie within the family of part-based methods. These type of algorithms typically use a per-point model based on the local appearance of the landmark. These models are later used for constructing a response map, over which a face shape fitting algorithm is used to obtain the final detection. 

The use of regressors as the appearance models in substitution of the classically used classifiers means that the construction of the response maps can be avoided. However, the most important aspect is that they result in a large performance gain. The application of regressors to facial landmarking has become a standard approach in the past years, and results in the best algorithms to date.

  1. L21-based regression and prediction accumulation across views for robust facial landmark detection. pdf bibtex link code B. Martinez and M. Valstar. In Image and Vision Computing, vol. 47, 2016.
    @article{martinez_300W,
       author = {B. Martinez and M. Valstar},
       title = {L21-based regression and prediction accumulation across views
    for robust facial landmark detection},
       journal = {Image and Vision Computing},
       volume = {47}, 
       year = {2016}, 
      pages = {36–44},
       url = {http://dx.doi.org/10.1016/j.imavis.2015.09.003}
    }
    
  2. Local Evidence Aggregation for Regression-Based Facial Point Detection. pdf bibtex B. Martinez, M. Valstar, X. Binefa and M. Pantic. In IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 35, number 5, 2013.
    @article{Martinez13,
      author    = {B. Martinez, M. Valstar, X. Binefa and M. Pantic},
      title     = {Local Evidence Aggregation for Regression-Based Facial Point
                   Detection},
      journal   = {IEEE Trans. Pattern Analysis and Machine Intelligence},
      volume    = {35},
      number    = {5},
      year      = {2013},
      pages     = {1149-1163},
    }
  3. Facial Point Detection using Boosted Regression and Graph Models. pdf bibtex M. F. Valstar, B. Martinez, X. Binefa and M. Pantic. In Proceedings of IEEE Int'l Conf. Computer Vision and Pattern Recognition, 2010.
    @inproceedings{Valstar2010fpdub,
        author = {M. F. Valstar, B. Martinez, X. Binefa and M. Pantic},
        pages = {2729--2736},
        booktitle = {Proceedings of IEEE Int'l Conf. Computer Vision and Pattern Recognition},
        title = {Facial Point Detection using Boosted Regression and Graph Models},
        year = {2010},
    }

Face detection

Model templates (Profile left and near-frontal left)
Parts detected
Final detection
Performance on AFLW (True vs. False positive rates)

In this work we trained a Deformable Parts Model (DPM) to the problem of multiview face detection. We use a star architecture to model shape variations, where the root filter corresponds to the whole face. This filter is computed at a coarse level of a HOG pyramid, while the different parts are computed at a finer level of the pyramid. Due to the use of Latent SVM to train the models, the different parts composing the face do not have to be explicitly defined, and hence do not need to be manually annotated. Furthermore, a mixture model with 4 components is used to allow for robust face detection under a wide variation of head poses.

The proposed methodology for face detection was compared against a multi-view training of the Viola and Jones algorithm, and against Zhu and Ramanan, yielding the best performance. Apart from the performance improvement, the algorithm presents further advantages. In particular, the ammount of manual annotation required is drastically reduced, respect to Zhu and Ramanan, as only the bounding box needs to be annotated. Furthermore, a good image resolution is less important than for Zhu and Ramanan, as the inner-facial structures are not as specific as facial points. Better detection performance is attained while only using 4 mixture components and 5 parts, which is a significant reduction in the computation requirements. Finally, a cascaded strategy, in a fashion similar to the Viola and Jones algorithm, is used. This results in a dramatic decrease in the computational requirements without compromising the detection performance (the cascade strategy is only available with the linux version of the code).

  1. Empirical Analysis of Cascade Deformable Models for Multi-view Face Detection. pdf bibtex link code J. Orozco, B. Martinez and M. Pantic. In Image and Vision Computing, vol. 42, 2015.
    @article{orzoco15,
       author = {J. Orozco, B. Martinez and M. Pantic}, 
       title = {Empirical Analysis of Cascade Deformable Models for Multi-view Face Detection},
       journal = {Image and Vision Computing},
       volume = {42}, 
       pages = {47--61},
       year = {2015},
       url = {http://www.sciencedirect.com/science/article/pii/S0262885615000967}
    }

Facial Action Unit detection

Describing an expression through AUs
Three Orthogonal Planes (TOP) strategy
Flowchart of TSMC-B 2014

Facial expression recognition is often posed as the problem of recognising the 6 prototypical facial expressions. However, it is often the case that the wide variety of expressions in everyday life do not lie within one of these 6 categories. Facial action units (FAU) aims instead at describing the signals used to transmit a message (e.g. an emotion), while the interpretation of the signs in terms of the message conveyed needs to be later analysed.

AU detection methods typically rely on a frame-based feature descriptor, followed by a learning layer (e.g. SVM). Our work proposes the use of spatio-temporal texture descriptors for the task of AU. Common descriptors like LPQ and LBP were extended to spatio-temporal versions, called LPQ-TOP and LBP-TOP. TOP stands for Three Orthogonal Planes, and reflects that the descriptor results from computing the 2D features on the XY, YZ and XZ planes. Then these 3 descriptors are concatenated into a single histogram.

Furthermore, we analyse the applicability of these descriptors to the detection of the temporal segments (onset, appex and offset). These are by nature spatiotemporal events, and machine learning algorithms cannot effectively distinguish them using traditional frame-based representations. The performance gain when using the spatiotemporal feature descriptors is consistent and significant accross several datasets. 

  1. Learning to transfer: transferring latent task structures and its application to person-specific facial action unit detection. pdf bibtex T. Almaev, B. Martinez and M.F. Valstar. In IEEE International Conference on Computer Vision, 2015.
    @inproceedings{almaev15,
      author    = {T. Almaev, B. Martinez and M.F. Valstar},
      title     = {Learning to transfer: transferring latent task structures and its application to person-specific facial action unit detection},
      booktitle = {IEEE International Conference on Computer Vision},
      year      = {2015}
    }
  2. Advances, Challenges, and Opportunities in Automatic Facial Expression Recognition. pdf bibtex link B. Martinez, M.F. Valstar. In Advances in Face Detection and Facial Image Analysis, 2016.
    @incollection{martinez_book15,
      author      = "B. Martinez, M.F. Valstar",
      title       = "Advances, Challenges, and Opportunities in Automatic Facial Expression Recognition",
      editor      = "M. Kawulok, E. Celebi, B. Smolka",
      booktitle   = "Advances in Face Detection and Facial Image Analysis",
      publisher   = "Springer",
      pages = "63 -- 100",
      year = "2016",
      url = {http://www.springer.com/us/book/9783319259567}
    }
  3. Learning to combine local models for Facial Action Unit detection. bibtex S. Jaiwand and B. Martinez and M.F. Valstar. In FERA workshop, in conj. with FG 2015, 2015.
    @inproceedings{jaiswal15,
       author = {S. Jaiwand and B. Martinez and M.F. Valstar},
       title = {Learning to combine local models for Facial Action
    Unit detection},
       booktitle = {FERA workshop, in conj. with FG 2015 },
       year = {2015}
    }
  4. A Dynamic Appearance Descriptor Approach to Facial Actions Temporal Modelling. pdf bibtex link B. Jiang, M. F. Valstar, B. Martinez and M. Pantic. In IEEE Transactions on Cybernetics, vol. 44, number 2, 2014.
    @article{JiangEtAl13,
        author = {B. Jiang, M. F. Valstar, B. Martinez and M. Pantic},
        journal = {IEEE Transactions on Cybernetics},
        volume = {44},
        number = {2},
        pages = {161--174},
        title = {A Dynamic Appearance Descriptor Approach to Facial Actions Temporal Modelling},
        year = {2014},
        url = {http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6504484}
    }
  5. Decision Level Fusion of Domain Specific Regions for Facial Action Recognition. pdf bibtex B. Jiang, B. Martinez, M. Valstar and M. Pantic. In IEEE International Conference on Pattern Recognition, 2014.
    @inproceedings{JiangICPR2014,
        author = {B. Jiang, B. Martinez, M. Valstar and M. Pantic},
        booktitle = {IEEE International Conference on Pattern Recognition},
        title = {Decision Level Fusion of Domain Specific Regions for Facial Action Recognition},
        year = {2014},
    }
  6. Parametric temporal alignment for the detection of facial action temporal segments. pdf bibtex B. Jiang and B. Martinez and M. Pantic. In British Machine Vision Conference, 2014.
    @inproceedings{JiangBMVC2014,
        author = {B. Jiang and B. Martinez and M. Pantic},
        booktitle = {British Machine Vision Conference},
        title = {Parametric temporal alignment for the detection of facial action temporal segments},
        year = {2014},
    }

Model-free Object Tracking

Part-pased regression-based tracking

Model free tacking consists of tracking an object just based on an initialisation given on the first frame. The object to be tracked is of unknown class, so that models cannot be pre-trained, and has unknown properties (e.g., unknown motion pattern, structure, etc). In particular, we are interested in performing part-based tracking of these objects.

In order to do so, we have adapted the cascaded regression methodologies such as the Supervised Descent Method. This is motivated by their excellent performance for facial landmark detection and tracking which are part-based alignment problems of a specific class. Since in our case the class is not known in advance, we learn the regression models on the fly. Furthermore, we compute a cohort of predictions and combine them in a robust manner to attain improved robustness to unseen conditions.

  1. TRIC-track: Tracking by Regression with Incrementally Learned Cascades. pdf bibtex X. Wang, M.F. Valstar, B. Martinez, H.M. Khan and T.P. Pridmore. In IEEE International Conference on Computer Vision, 2015.
    @inproceedings{wang15,
      author    = {X. Wang, M.F. Valstar, B. Martinez, H.M. Khan and T.P. Pridmore},
      title     = {TRIC-track: Tracking by Regression with Incrementally Learned Cascades},
      booktitle = {IEEE International Conference on Computer Vision},
      year      = {2015}
    }

Facial Component Detection on Thermal Imagery

One of the main challenges in the detection of inner facial structures is the large amount of clutter present. That is to say, if we aim at detecting facial components like eyes or the nostrils, the amount of structures within the face that result in false positives is much larger than for visible imagery. We explored the use of a standard part-based sliding-window classification-based detection strategy, but altered the procedure in order to hard-mine the negatives used to train a classifier per component. 

We have also compiled and published the Mahnob Laughter Database, which contains audiovisual multimodal recordings of induced naturalistic laughter and smiles, together with speech recordings. Thermal recordings are included in the database, and the experimental results on facial component detection are obtained over these recordings.

  1. Facial Component Detection in Thermal Imagery. pdf bibtex B. Martinez, X. Binefa and M. Pantic. In Proceedings of IEEE Int'l Conf. Computer Vision and Pattern Recognition (CVPR-W), vol. 3, 2010.
    @inproceedings{Martinez2010fcdit,
        author = {B. Martinez, X. Binefa and M. Pantic},
        pages = {48--54},
        booktitle = {Proceedings of IEEE Int'l Conf. Computer Vision and Pattern Recognition (CVPR-W)},
        title = {Facial Component Detection in Thermal Imagery},
        volume = {3},
        year = {2010},
    }
  2. The MAHNOB Laughter Database. pdf bibtex S. Petridis, B. Martinez and M. Pantic. In Image and Vision Computing Journal, vol. 31, number 2, 2013.
    @article{mahnob_laughter_db,
        author = {S. Petridis, B. Martinez and M. Pantic},
        pages = {186--202},
        journal = {Image and Vision Computing Journal},
        number = {2},
        title = {The MAHNOB Laughter Database},
        volume = {31},
        year = {2013},
    }