Better Intent Classification And Entity Extraction with DIETClassifier Pipeline Optimization

Better Intent Classification And Entity Extraction with DIETClassifier Pipeline Optimization

Rasa's DIETClassifier provides state of the art performance for intent classification and entity extraction. In this post you will learn how this algorithm work and how to adapt the pipeline to the specifics of your project to get the best performance out of it We'll deep dive into the most important steps and show you how optimize the training for your very specific chatbot.

What is a NLU pipeline?

When your train your model, or when your model is in production and tries to identify the intent and entities associated with a user utterance, the data percolates through a pipeline performing a sequence of operations.

There is technically no limit to the types of operations but 3 steps are common to most pipelines: tokenization, featurization, and training or inference.

We'll briefly explain what they do below and go in more detail in the next section.


Tokenization consists in getting the list of words (tokens) from an utterance. For example, the sentence Set up a pipeline for the Rasa DIETClassifier would be tokenized as:

"Set", "up", "a", "pipeline", "for", "the",  "Rasa", "DIETClassifier"

Many features are derived from words rather than sentences. Tokens resulting from this step will be used downstream the pipeline for features extraction.


Machine learning is about statistics, statistics work with numbers, not words. Featurization is the process of transforming words into meaningful numbers (or vectors) that can be fed to the training algorithm.

Training / Inference

At training time, the algorithm learns from the features derived from the raw text data. At inference time (when the trained model is used to make predictions), the raw data follows the same path. Utterances are tokenized, features are extracted and used to predict the intent and entities.

Tokenization and featurization

As we have seen, a NLU pipeline is a sequence of steps the raw data goes through. It must have tokenizers, featurizers and training algorithms. Here is a very simple example:

  - name: WhitespaceTokenizer
  - name: CountVectorsFeaturizer
  - name: DIETClassifier

The WhiteSpaceTokenizer separates words with white spaces. Note that it doesn't do well with other characters such as apostrophes. If one of your project's languages uses them (e.g. French) consider another tokenizer such as the SpacyTokenizer which in turns requires SpacyNLP:

  - name: SpacyNLP
  - name: SpacyTokenizer
  - name: CountVectorsFeaturizer
  - name: DIETClassifier

Using whole words

By default, the CountVectorsFeaturizers only adds one feature for each word in your training data.

For this example I trained a model using the first pipeline above on the following data:

training data

When trying to infer the intent out of an utterance containing only restaurant, a word contained in the unique example for the book_restaurant intent, the intent is recognized with a high confidence rate. It works because the exact same feature can be extracted from the word restaurant as at training time.

countvectorsfeaturizer analyzer word correct

However the model will not be very resilient to typos. When omitting a letter and typing restauran, it cannot recognize the correct intent.

countvectorsfeaturizer analyzer word typo

This happens because the only feature provided at training time for the word restaurant was the whole world. The algorithm as no way to figure out that restauran is close to restaurant.

Let's see how we can improve that.

Using n-grams

N-grams are combinations of letters in a word. For example, this is the list of 4-grams contained in restaurant:

rest, esta, stau, taur, aura, uran, rant

The list of 3-grams:

res, est, sta, tau, aur, ura, ran, ant

And the list of bi-grams:

re, es, st, ta, au, ur, ra, an, nt

Instead of getting one unique feature to map a single word, we can use the many parts this word is made of. The following pipeline will use all n-grams above as features.

  - name: WhitespaceTokenizer
  - name: CountVectorsFeaturizer
    analyzer: word
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 2
    max_ngram: 4
  - name: DIETClassifier

The result is a much richer comprehension of words, and an increased resilience to typing errors. In the video below, the model is able to fall back on the right intent because even with typos, the features extracted at inference time carry a lot of similarity with features extracted at training time.

When using n-grams, you need to consider training time. Adding features by a factor of 24 will have an impact on your training. Choose a combination of n-grams that brings enough resilience to typing mistakes.

Using pre-trained language models

Being resilient to mistakes is great, but typing anything else than flight or restaurant will bring random results. It would be great if we could bring some general language knowledge out of the box. General language means knowing, for example, that a pizzeria is a restaurant.

With our current pipeline, our bot doesn't know that.

without language model

What is a pre-trained model

A pre-trained model contains embeddings trained on a generally big corpus such as Wikipedia or CommonCrawl. Those embeddings are a numeric representation of the words meaning in terms of similarities. It doesn't know what an apple is, but it knows that an apple is similar to a pear, or that a pizzeria is similar to a restaurant.

To learn more about embeddings, read our post about how intent classification works.

Those embeddings can be used as features and distillate some pre-existing knowledge of the world in your model.

Adding pre-trained embeddings or language models to the Rasa NLU pipeline

Let's see how adding this general knowledge can help. The pipeline below loads pretrained embeddings for about 600k words.

  - name: "SpacyNLP"
    model: "en_core_web_lg"
    case_sensitive: false
  - name: SpacyTokenizer
  - name: SpacyFeaturizer
  - name: DIETClassifier

Note that this will not work out of the box: Spacy and the en_core_web_lg must be installed on your Rasa instance.

You can see the benefits of pre-trained embeddings below:

Our model is able to figure out similarities with many words unseen in our traning data set, which only contains the words flight and restaurant. Wine, pizzeria, beer, meal are more related to restaurant than they are to flight. Plane, ticket and even crash are more related to flight than they are to restaurant.

Note that in this pipeline, the only features fed to the DIETClassifier are tokens present in the pre-trained model's vocabulary. As a result, this pipeline will not be useful with typos.

Adding a CountVectorsFeaturizer with n-grams won't help here because it will only get n-grams from the words in your dataset, no from the pre-trained embeddings.

Which means that if you want your model to pick up the right intent when a user types pizzria, a similar word must exist in your data. To be typo tolerant, you will still have to add those words, possibly with the typing errors, in your training data.

Things to consider when using pre-trained language models in your Rasa NLU pipeline

This raises the question: are pre-trained embeddings useful at all? There are several things to consider:

  1. Pre-trained embeddings consume resources: storing 300 dimensions vectors for 600k words takes a lot of memory. And in practice only a very small subset is relevant to your project.
  2. Using smaller models with smaller vocabularies will only get you so far. The HuggingFace's Bert pre-trained models only have 30-50k vectors, not enough in our experiments to let our model know that pizzeria and restaurant are similar (at least one of the two words is not in the vocabulary). An exception is a model trained on a corpus relevant to your domain. For example SciBert is trained on scientific data, so if your chatbot is about science this model might be helpful despite a smaller vocabulary. In practice, however, finding a model pre-trained on your domain's data is unlikely.

So, should we use them?

It is generally a good idea to use them when you start building an assistant: it yields good results when training data contains very few examples.

As you get more training data, their value decreases because a significant part of your domain's vocabulary is known from your data. They only are useful when using synonyms you haven't thought of or for general conversation intents.

At this point you'll decide if the extra value they bring is worth the cost, in terms of resources, booting time, inference time, etc.

Now that we have covered how to extract good features, let's explore get most of them when training our NLU model.

Other pre-trained language models available

HuggingFace models

HuggingFace pre-trained models are very easy to load in your pipeline because they download model weights directly for you at training time and when loading a trained NLU model.

A variety of models is available with embeddings in many different languages. However, most of them contain between 30k and 50k embeddings and might not bring a lot of valuable knowledge to help with your specific domain.

  - name: HFTransformersNLP
    # Name of the language model to use
    model_name: "bert"
    # Pre-Trained weights to be loaded
    model_weights: "bert-base-uncased"
    cache_dir: /app/models/.cache # required with Botfront
  - name: LanguageModelTokenizer
  - name: LanguageModelFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: char_wb
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier

Note that the tokenizer and featurizer are different from the earlier Spacy example.


Finally, a mention for the recommended (by Rasa) pipeline for projects in English.

  - name: ConveRTTokenizer
  - name: ConveRTFeaturizer
  - name: RegexFeaturizer
  - name: CountVectorsFeaturizer
  - name: CountVectorsFeaturizer
    analyzer: "char_wb"
    min_ngram: 1
    max_ngram: 4
  - name: DIETClassifier

Intent classification with the Rasa DIETClassifier

Now that we have seen how providing good features can impact our training, let's see go through a few knobs we have access to when training.

In this section we'll focus on intents.

What is the DIETClassifier and how does it work?

Let's quicky explain how the DIETClassifier works, in general and layman's terms:

The model takes different features as inputs:

  • Dense features from pre-trained embeddings. Dense features are fed from featurizers upstream in the pipeline. For example the SpacyFeaturizer will provide pre-trained embeddings from the Spacy models as features.
  • Sparse features from the training data. Sparse features are provided by the CountVectorsFeaturizers.

The distinction between dense and sparse features is a technicality: pre-trained vectors are dense because every dimension of the vector contains a number, while features from the training data are sparse because they comes as one-hot vectors: all values are 0 except one which is generally equal to 1.

You don't need to provide both dense and sparse features. You can use pre-trained embeddings, features from your oww training data, or both.

As a reminder, at training time, the classifier will learn boundaries between groups of data points:

boundaries of an NLU classifier: that is why your NLU will work better with more examples.
boundaries of an NLU classifier: that is why your NLU will work better with more examples.

Every utterance in the training data is represented in a vector space. The position is calculated from the features. The algorithm must learn where the separation is between all groups of utterences (intents).

The illustration above shows only two intents in a twodimensional space. Imagine how hard this task can be for 500 intents in a 300-dimensional space.

Optimizing the DIETClassifier

The default configuration should work just fine when you have a handful of intents with simple sentences. But as your training data grows, you may need to adjust some knobs.


Epochs is the number of times the training goes through all your data. The default is 300. I am mentionning this knob because in most cases you don't need to increase it. In addition to increasing your training time, it might create overfitting: the model becomes too strongly attached to the training data and start performing less good on unseen data. We want our model to generalize, which means applying what it learns on training data to unseen user utterances.

However reducing it might work, especially for smaller datasets.

Example configuration:

  - name: DIETClassifier
    epochs: 100

Language model

Technically, a language model is trained to predict missing tokens or words. For example, when asking the missing token in:

The Saint-Lawrence is a ___________

The model should output river.

Another challenge might be:

The __________ island is between the Saint-Lawrence and Des Prairies River.

Where the correct output is Montreal.

You might be wondering why we're training our model on an elementary school task. A good way to asnwer this is to remember what you learned as a kid from such challenges: you probably learned the structural features of your language, grammatical rules, and general knowledge about the world.

Knowing all this help you understand what people when they are talking to you. And that is exactly what we want for our assistant.

By setting use_masked_language_model: True, the DIETClassifier will perform such challenges and acquire some additional domain knowlege from your training data. This knowledge can be used to add context to embeddings.

Enabling the use_masked_language_model will help when the number of intents grows, when your domain language have subtle nuances, or when you expect long or complex user utterances.

Note that this is not related to pre-trained language models discussed in other parts of this post. Pre-trained language models are generally trained on large corpus of data. Here, you are training a language model on your own training data.

Example configuration:

  - name: DIETClassifier
    use_masked_language_model: True

Embedding dimensions

Embeddings are vectors and dimensions are the number of numbers composing those vectors. The default is 20. Embeddings carry the meaning of the word. Up to a certain point, the more dimensions, the more meaning you can capture. GloVe vectors, for example, have 300 dimensions.

So, should you just bump embedding_dimension to 300? Not so fast!

GloVe is trained on CommonCrawl, a snapshot of the whole web containing billions of words.

glove comparative superlative

As you can see there's plenty to learn from for many words used in many different contexts. This knowledge could not be captured in 20-dimensional embeddings.

Is your training data rich enough to capture to saturate 20-dimensional vectors of meaning? At some point there might be. When your training data becomes substantial, increasing embedding_dimension might get your better results.

Example configuration:

  - name: DIETClassifier
    embedding_dimension: 30

Number of transformer layers

Transformers look at how words influence each other in a sentence. In other words it contextualizes words. For example, in "Play a game", "Watch a game", and "Watch a play", game and play have slightly different meanings. Knowing the nuances will help identify the correct intent. If at some point you encounter confusions in such intents, you may try to increase the number of transformers_layers. The default value is 2.

Example configuration:

  - name: DIETClassifier
    transformers_layers: 4

Entity extraction

Now that we have a super-powered intent classifier, let's see how we can tweak entity extraction.

How entity extraction works

At training time, the DIETClassifier knows from your data which sections of your trainig utterances are entities.

At inference time, it goes through all the words of a sentence and evaluate if they belong to an entity. If two or more contiguous words belong to the same entity, then the sequence is tagged as a whole. That is how you can have multi-words entities.

To evaluate if a word should be tagged with an entity, the algorithm looks at the features of:

  • The word being evaluated
  • The word preceding the word being evaluated
  • The word following the word being evaluated.

To make that more concrete, let's consider a few examples:

In the sentence I want to book a room in Paris next week, where Paris is a location entity, we can note that it is preceded by the word in and followed by nothing. It is a very common structure for this type of requests.

And your model will learn that any word or expression following words like in, near and followed by next, today, tomorrow, or even nothing has a likelihood of being a location entity.

How to featurize your data for entity extraction

The next question is how it looks at those words. Remember that machine learning is about statistics, statistics are about numbers, and features are the way to convert words into meaningful numbers.

We can influence the training by specifying the features we are interested in.

Features available

For reference, here is the exhaustive list of features taken from the Rasa documentation.

Feature Description
BOS Checks if the token is at the beginning of the sentence.
EOS Checks if the token is at the end of the sentence.
low Checks if the token is lower case.
upper Checks if the token is upper case.
title Checks if the token starts with an uppercase character and all remaining characters are lowercased.
digit Checks if the token contains just digits.
prefix5 Take the first five characters of the token.
prefix2 Take the first two characters of the token.
suffix5 Take the last five characters of the token.
suffix3 Take the last three characters of the token.
suffix2 Take the last two characters of the token.
suffix1 Take the last character of the token.
pos Take the Part-of-Speech tag of the token (SpacyTokenizer required).
pos2 Take the first two characters of the Part-of-Speech tag of the token (SpacyTokenizer required).

Specify features of interest in your pipeline

Let's digest that with an example.

In the following pipeline, we have introduced the LexicalSyntacticFeaturizer. This new featurizer will produce the features from your training data to train entity extraction.

It's important to remember that featurization preceeds training and inference and must therefore be placed before in the pipeline.

  - name: LexicalSyntacticFeaturizer
      - [low, title, upper, suffix2] # features for the word preceding the word being evaluated
      - [EOS, title, suffix5] # features for the word being evaluated
      - [prefix2] # features for the word following the word being evaluated
  - name: DIETClassifier

How features are being analyzed

In the features section we can define features for the word being analyzed and the surrounding words.

In the context of the sentence I want to book a room in Paris next week, when evaluating if Paris is a location, the algorithm will look at:

  • The word in, and in particular if it is lowercased (low), uppercased (upper), or capitalized (title). It will also look at the last two letters (suffix2), which will reinforce the likelihood of picking up the entity for words like in, near (last 2 letters: ar)
  • The word Paris, and in particular if it is at the end of the sentence (EOS), capitalized, and will also consider the last 5 letters.
  • And the word next, it will look at the first 2 letters.

Choosing the right features

There is something important to keep in mind when using prefixes and suffixes, especially with 5 letters: if you have too few examples, your model may tend to memorize them and might fail to recognize entities not seen in your data.

That why in that particular case using prefixes or suffixes makes sense for surrounding words: there aren't that many possibilities besides in, near, from (e.g. not to far from Paris). However, using suffix5 for the entity itself is questionnable. The list of possible cities is virtually infinite, so using them may reinforce cities found inside your trainind data against cities that are not included in your dataset.

This configuration is unique for all your entities, so you must come up with an average strategies that generally works well for all of them.

Wrapping up

I hope that this post has given you enough material to start improving your NLU pipelines. If you have any question or comment, feel free to post in our Spectrum community.

Sources and aknowledgements

An important source of information for this post was Rasa Algorithm Whiteboard series on the DIET architecture. There is a lot more to learn from if you are interested in the technical aspects of the architecture.