Face anti-spoofing with cross-stage relation enhancement and spoof material perception

Daiyuan Li; Guo Chen; Xixian Wu; Zitong Yu; Mingkui Tan

doi:10.1016/j.neunet.2024.106275

Face anti-spoofing with cross-stage relation enhancement and spoof material perception

Neural Netw. 2024 Mar 27:175:106275. doi: 10.1016/j.neunet.2024.106275. Online ahead of print.

Authors

Daiyuan Li¹, Guo Chen², Xixian Wu³, Zitong Yu⁴, Mingkui Tan⁵

Affiliations

¹ South China University of Technology, Guangzhou, 510006, Guangdong, China; Pazhou Laboratory, Guangzhou, 510000, Guangdong, China; Key Laboratory of Big Data and Intelligent Robot, Ministry of Education, Guangzhou, 510000, Guangdong, China. Electronic address: selidaiyuan@mail.scut.edu.cn.
² South China University of Technology, Guangzhou, 510006, Guangdong, China; CSSC Systems Engineering Research Institute, Beijing, 100000, Beijing, China. Electronic address: qwead134@gmail.com.
³ HuNan Gmax Intelligent Technology, Changsha, 410000, Hunan, China. Electronic address: wuxixian@gmax-ai.com.
⁴ Great Bay University, Dongguan, 523000, Guangdong, China. Electronic address: zitong.yu@ieee.org.
⁵ South China University of Technology, Guangzhou, 510006, Guangdong, China. Electronic address: mingkuitan@scut.edu.cn.

PMID: 38653078
DOI: 10.1016/j.neunet.2024.106275

Abstract

Face Anti-Spoofing (FAS) seeks to protect face recognition systems from spoofing attacks, which is applied extensively in scenarios such as access control, electronic payment, and security surveillance systems. Face anti-spoofing requires the integration of local details and global semantic information. Existing CNN-based methods rely on small stride or image patch-based feature extraction structures, which struggle to capture spatial and cross-layer feature correlations effectively. Meanwhile, Transformer-based methods have limitations in extracting discriminative detailed features. To address the aforementioned issues, we introduce a multi-stage CNN-Transformer-based framework, which extracts local features through the convolutional layer and long-distance feature relationships via self-attention. Based on this, we proposed a cross-attention multi-stage feature fusion, employing semantically high-stage features to query task-relevant features in low-stage features for further cross-stage feature fusion. To enhance the discrimination of local features for subtle differences, we design pixel-wise material classification supervision and add a auxiliary branch in the intermediate layers of the model. Moreover, to address the limitations of a single acquisition environment and scarcity of acquisition devices in the existing Near-Infrared dataset, we create a large-scale Near-Infrared Face Anti-Spoofing dataset with 380k pictures of 1040 identities. The proposed method could achieve the state-of-the-art in OULU-NPU and our proposed Near-Infrared dataset at just 1.3GFlops and 3.2M parameter numbers, which demonstrate the effective of the proposed method.

Keywords: Dataset; Face anti-spoofing; Presentation attack; Transformer.