ModernBERT: The Future of Encoder-Only Models

6 min readDec 20, 2024

The release of BERT (Bidirectional Encoder Representations from Transformers) in 2018 marked a revolutionary moment in the field of Natural Language Processing (NLP). As the first deep learning model to process text bidirectionally, BERT set the stage for massive improvements in a variety of NLP tasks. Fast forward six years, and ModernBERT, the latest evolution in the family of encoder-only models, has arrived to provide an even more powerful, efficient, and scalable solution for numerous NLP applications.

In this detailed guide, we will explore the improvements ModernBERT brings to the table, compare it with its predecessors, and illustrate how it’s designed to address the limitations of older models while adding new features and capabilities. We will also dive deep into the data and training strategies that helped develop ModernBERT, and showcase how this model sets new standards for NLP tasks like retrieval-augmented generation (RAG), classification, code retrieval, and more.

The Evolution of BERT to ModernBERT

BERT became a foundational model in NLP, allowing machines to perform a variety of tasks with a deep understanding of language. It could perform tasks like sentiment analysis, named entity recognition (NER), question answering, and text classification with remarkable accuracy. However, while BERT provided impressive results, it had limitations, such as:

Short Context Length: BERT could only handle 512 tokens, limiting its ability to process long documents or large text data.
Speed and Efficiency: The model required significant computational resources, especially in real-time applications.
Lack of Code Awareness: BERT was trained primarily on natural language data, limiting its performance on tasks that involved programming code.

In response to these limitations, ModernBERT has been developed as a Pareto improvement over BERT, addressing both performance and efficiency in several key ways:

Longer Context Length: ModernBERT can process 8,192 tokens — 16 times more than BERT’s 512-token limit. This allows it to handle longer documents and provides the model with a richer understanding of context.
Faster and More Efficient: ModernBERT offers up to 4x faster processing compared to previous models, especially when dealing with variable-length inputs. This improvement enables more efficient use of resources, making ModernBERT suitable for deployment in real-time applications and smaller computational environments.
Code-Aware Training: For the first time, ModernBERT has been trained on a large corpus that includes code data, which allows it to excel at code retrieval tasks and programming assistant applications, a major step up from models like BERT, which were trained mainly on natural language data.

Architecture and Design Innovations in ModernBERT

To achieve these advancements, ModernBERT incorporates several architectural changes and optimizations. Some of the core innovations include:

1. Rotary Positional Embeddings (RoPE)

One of the critical innovations in ModernBERT’s design is the replacement of traditional positional embeddings with rotary positional embeddings (RoPE). This new approach allows ModernBERT to better handle longer sequences by understanding the relative positions of words more efficiently, which is especially crucial when scaling to 8,192 tokens.

2. GeGLU Layers

ModernBERT swaps out the traditional MLP (multi-layer perceptron) layers with GeGLU (Gated Linear Unit) layers, which provide improved activation functions and allow the model to process information more efficiently. This change improves the model’s ability to capture complex relationships between words and results in better downstream performance.

3. Efficient Use of Hardware Resources

Unlike many large language models that require specialized hardware and significant memory, ModernBERT has been designed to work efficiently on consumer-grade GPUs. The model is optimized to run smoothly on widely available hardware, making it more accessible for various real-world applications.

4. Flash Attention and Unpadding

ModernBERT uses Flash Attention and unpadding techniques to speed up training and inference. Flash Attention is a highly efficient attention mechanism that reduces the computational overhead required for processing long sequences. Unpadding, on the other hand, removes unnecessary padding tokens during the computation process, optimizing memory usage and speeding up operations.

Training Data and Process

One of the most critical factors behind ModernBERT’s success is the training data and the training process. ModernBERT was trained on an expansive dataset that goes beyond the typical Wikipedia and BooksCorpus data used for BERT.

Diverse Data Sources

ModernBERT was trained on over 2 trillion tokens from a diverse range of data sources, including:

Web documents: Articles, blogs, and other content from the web.
Code: Large codebases from GitHub and other open-source platforms.
Scientific papers: Research papers that span a wide array of domains.

This diverse dataset enables ModernBERT to perform well on tasks that require specialized knowledge, such as code retrieval, programming assistance, and technical document understanding.

Training Process

ModernBERT underwent a three-phase training process:

Phase 1: Initial training on 1.7 trillion tokens with a sequence length of 1,024.
Phase 2: Long-context adaptation, training on 250 billion tokens with a sequence length of 8,192.
Phase 3: Annealing with 50 billion tokens, focusing on fine-tuning the model to improve its performance on long-context tasks.

These phases allowed the model to improve its general language understanding while ensuring it could handle long documents effectively.

Comparing ModernBERT to Other Models

ModernBERT’s performance stands out when compared to other popular models in the NLP space. The following benchmarks highlight ModernBERT’s impressive capabilities:

GLUE Benchmark: ModernBERT outperforms all other encoder-only models in this comprehensive evaluation, achieving a score of 88.5 — well above that of BERT (84.7) and RoBERTa (86.4). Its ability to process long contexts and handle complex NLP tasks allows it to shine in real-world applications.
SQuAD: ModernBERT’s performance on the Stanford Question Answering Dataset (SQuAD) is also notable, with scores that surpass those of earlier models like BERT and DeBERTaV3.
Code Retrieval: ModernBERT is the first encoder-based model to achieve impressive results on code-related tasks. On the StackOverflow-QA dataset, it scored above 80, a milestone no other model has reached.
Speed and Efficiency: ModernBERT is twice as fast as DeBERTa and up to 3x faster than other high-quality models in long-context retrieval tasks. Its inference speed is particularly beneficial in real-time applications like RAG pipelines and content moderation.

A Practical Example: Code Search with ModernBERT

Let’s explore how ModernBERT can be applied in a code search scenario. Suppose a software development company needs to search through a large repository of code to find the most relevant code snippets for implementing a new feature. ModernBERT can help index the entire codebase and retrieve the most relevant code snippets efficiently.

Here’s an example of how to use ModernBERT for this task:

from transformers import pipeline
modern_bert = pipeline('feature-extraction', model='modern-bert-base')
query = "How to implement a binary search in Python?"
embeddings = modern_bert(query)

In this scenario, ModernBERT generates embeddings for the query and compares them to a database of indexed code snippets to return the most relevant results.

Conclusion: ModernBERT — The Next Step in NLP Evolution

ModernBERT represents the culmination of years of research in NLP and large language models. It successfully combines innovations in architecture, training data, and efficiency to provide a model that is not only faster and more accurate than its predecessors but also more versatile.

With its ability to handle long-context tasks, code-related applications, and real-time deployments, ModernBERT is poised to become the new standard for encoder-only models in a wide range of NLP applications.

For those interested in implementing ModernBERT in their projects, Hugging Face offers a straightforward integration with its Transformers library, allowing you to quickly get started with this state-of-the-art model.

Explore more about ModernBERT on Hugging Face and take advantage of its powerful features to enhance your NLP applications!