PET: Parameter-efficient Knowledge Distillation on Transformer

Hyojin Jeon; Seungcheol Park; Jin-Gee Kim; U Kang

doi:10.1371/journal.pone.0288060

PET: Parameter-efficient Knowledge Distillation on Transformer

PLoS One. 2023 Jul 6;18(7):e0288060. doi: 10.1371/journal.pone.0288060. eCollection 2023.

Authors

Hyojin Jeon¹, Seungcheol Park¹, Jin-Gee Kim¹, U Kang¹

Affiliation

¹ Department of Computer Science and Engineering, Seoul National University, Seoul, Republic of Korea.

Abstract

Given a large Transformer model, how can we obtain a small and computationally efficient model which maintains the performance of the original model? Transformer has shown significant performance improvements for many NLP tasks in recent years. However, their large size, expensive computational cost, and long inference time make it challenging to deploy them to resource-constrained devices. Existing Transformer compression methods mainly focus on reducing the size of the encoder ignoring the fact that the decoder takes the major portion of the long inference time. In this paper, we propose PET (Parameter-Efficient knowledge distillation on Transformer), an efficient Transformer compression method that reduces the size of both the encoder and decoder. In PET, we identify and exploit pairs of parameter groups for efficient weight sharing, and employ a warm-up process using a simplified task to increase the gain through Knowledge Distillation. Extensive experiments on five real-world datasets show that PET outperforms existing methods in machine translation tasks. Specifically, on the IWSLT'14 EN→DE task, PET reduces the memory usage by 81.20% and accelerates the inference speed by 45.15% compared to the uncompressed model, with a minor decrease in BLEU score of 0.27.

Copyright: © 2023 Jeon et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

MeSH terms

Data Compression*
Distillation
Electric Power Supplies
Knowledge

Grants and funding

This work was supported by Institute of Information \& communications Technology Planning \& Evaluation(IITP) grant funded by the Korea government(MSIT) [No.2020-0-00894, Flexible and Efficient Model Compression Method for Various Applications and Environments], [No.2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)], and [NO.2021-0-02068, Artificial Intelligence Innovation Hub (Artificial Intelligence Institute, Seoul National University)]. U Kang is the corresponding author. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.