DistilBERT
The original English-language BERT has two models: bert_base
featuring 12 encoders with 12 bidirectional self-attention heads, and the bert_large
featuring 24 encoders with 16 bidirectional self-attention heads. While these models lead to significant improvements over the previous state-of-the-art, they often have several hundred million parameters. Operating these large models in on-the-edge and/or under constrained computational training or inference budgets proved to be a challenge. Using the technique known as knowledge distillation, researchers propose a method to pre-train a smaller general-purpose language representation model known as DistilBERT - a model that retains BERT’s architecture while being both lighter and faster, which can then be fine-tuned with good performances on a wide range of tasks like its larger counterparts.
While keeping the general architecture of BERT, DistilBERT is optimised by reducing the number of layers, as it has bigger impacts in the Transformer model architecture than other factors such as the variation on the last dimension of the tensor [SDCW19]. In addition, the right initialisation for the sub-network to converge is found by taking advantage of the common dimensionality between the teacher and student networks. This results in a general-purpose pre-trained version of BERT that is 40% smaller, 60% faster and retains 97% of the language understanding capabilities [SDCW19], thus it becomes a compelling option for edge applications.