Rasa's DIETClassifier provides state of the art performance for intent classification and entity extraction. In this post you will learn how this algorithm work and how to adapt the pipeline to the specifics of your project to get the best performance out of it We'll deep dive into the most important steps and show you how optimize the training for your very specific chatbot.
When your train your model, or when your model is in production and tries to identify the intent and entities associated with a user utterance, the data percolates through a pipeline performing a sequence of operations.
There is technically no limit to the types of operations but 3 steps are common to most pipelines: tokenization, featurization, and training or inference.
We'll briefly explain what they do below and go in more detail in the next section.
Tokenization consists in getting the list of words (tokens) from an utterance. For example, the sentence Set up a pipeline for the Rasa DIETClassifier would be tokenized as:
"Set", "up", "a", "pipeline", "for", "the", "Rasa", "DIETClassifier"
Many features are derived from words rather than sentences. Tokens resulting from this step will be used downstream the pipeline for features extraction.
Machine learning is about statistics, statistics work with numbers, not words. Featurization is the process of transforming words into meaningful numbers (or vectors) that can be fed to the training algorithm.
At training time, the algorithm learns from the features derived from the raw text data. At inference time (when the trained model is used to make predictions), the raw data follows the same path. Utterances are tokenized, features are extracted and used to predict the intent and entities.
As we have seen, a NLU pipeline is a sequence of steps the raw data goes through. It must have tokenizers, featurizers and training algorithms. Here is a very simple example:
pipeline: - name: WhitespaceTokenizer - name: CountVectorsFeaturizer - name: DIETClassifier
WhiteSpaceTokenizer separates words with white spaces. Note that it doesn't do well with other characters such as apostrophes. If one of your project's languages uses them (e.g. French) consider another tokenizer such as the
SpacyTokenizer which in turns requires
pipeline: - name: SpacyNLP - name: SpacyTokenizer - name: CountVectorsFeaturizer - name: DIETClassifier
By default, the
CountVectorsFeaturizers only adds one feature for each word in your training data.
For this example I trained a model using the first pipeline above on the following data:
When trying to infer the intent out of an utterance containing only restaurant, a word contained in the unique example for the book_restaurant intent, the intent is recognized with a high confidence rate. It works because the exact same feature can be extracted from the word restaurant as at training time.
However the model will not be very resilient to typos. When omitting a letter and typing restauran, it cannot recognize the correct intent.
This happens because the only feature provided at training time for the word restaurant was the whole world. The algorithm as no way to figure out that restauran is close to restaurant.
Let's see how we can improve that.
N-grams are combinations of letters in a word. For example, this is the list of 4-grams contained in restaurant:
rest, esta, stau, taur, aura, uran, rant
The list of 3-grams:
res, est, sta, tau, aur, ura, ran, ant
And the list of bi-grams:
re, es, st, ta, au, ur, ra, an, nt
Instead of getting one unique feature to map a single word, we can use the many parts this word is made of. The following pipeline will use all n-grams above as features.
pipeline: - name: WhitespaceTokenizer - name: CountVectorsFeaturizer analyzer: word - name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 2 max_ngram: 4 - name: DIETClassifier
The result is a much richer comprehension of words, and an increased resilience to typing errors. In the video below, the model is able to fall back on the right intent because even with typos, the features extracted at inference time carry a lot of similarity with features extracted at training time.
When using n-grams, you need to consider training time. Adding features by a factor of 24 will have an impact on your training. Choose a combination of n-grams that brings enough resilience to typing mistakes.
Being resilient to mistakes is great, but typing anything else than flight or restaurant will bring random results. It would be great if we could bring some general language knowledge out of the box. General language means knowing, for example, that a pizzeria is a restaurant.
With our current pipeline, our bot doesn't know that.
A pre-trained model contains embeddings trained on a generally big corpus such as Wikipedia or CommonCrawl. Those embeddings are a numeric representation of the words meaning in terms of similarities. It doesn't know what an apple is, but it knows that an apple is similar to a pear, or that a pizzeria is similar to a restaurant.
To learn more about embeddings, read our post about how intent classification works.
Those embeddings can be used as features and distillate some pre-existing knowledge of the world in your model.
Let's see how adding this general knowledge can help. The pipeline below loads pretrained embeddings for about 600k words.
pipeline: - name: "SpacyNLP" model: "en_core_web_lg" case_sensitive: false - name: SpacyTokenizer - name: SpacyFeaturizer - name: DIETClassifier
Note that this will not work out of the box: Spacy and the
en_core_web_lg must be installed on your Rasa instance.
You can see the benefits of pre-trained embeddings below:
Our model is able to figure out similarities with many words unseen in our traning data set, which only contains the words flight and restaurant. Wine, pizzeria, beer, meal are more related to restaurant than they are to flight. Plane, ticket and even crash are more related to flight than they are to restaurant.
Note that in this pipeline, the only features fed to the
DIETClassifier are tokens present in the pre-trained model's vocabulary. As a result, this pipeline will not be useful with typos.
CountVectorsFeaturizer with n-grams won't help here because it will only get n-grams from the words in your dataset, no from the pre-trained embeddings.
Which means that if you want your model to pick up the right intent when a user types pizzria, a similar word must exist in your data. To be typo tolerant, you will still have to add those words, possibly with the typing errors, in your training data.
This raises the question: are pre-trained embeddings useful at all? There are several things to consider:
- Pre-trained embeddings consume resources: storing 300 dimensions vectors for 600k words takes a lot of memory. And in practice only a very small subset is relevant to your project.
- Using smaller models with smaller vocabularies will only get you so far. The HuggingFace's Bert pre-trained models only have 30-50k vectors, not enough in our experiments to let our model know that pizzeria and restaurant are similar (at least one of the two words is not in the vocabulary). An exception is a model trained on a corpus relevant to your domain. For example SciBert is trained on scientific data, so if your chatbot is about science this model might be helpful despite a smaller vocabulary. In practice, however, finding a model pre-trained on your domain's data is unlikely.
So, should we use them?
It is generally a good idea to use them when you start building an assistant: it yields good results when training data contains very few examples.
As you get more training data, their value decreases because a significant part of your domain's vocabulary is known from your data. They only are useful when using synonyms you haven't thought of or for general conversation intents.
At this point you'll decide if the extra value they bring is worth the cost, in terms of resources, booting time, inference time, etc.
Now that we have covered how to extract good features, let's explore get most of them when training our NLU model.
HuggingFace pre-trained models are very easy to load in your pipeline because they download model weights directly for you at training time and when loading a trained NLU model.
A variety of models is available with embeddings in many different languages. However, most of them contain between 30k and 50k embeddings and might not bring a lot of valuable knowledge to help with your specific domain.
pipeline: - name: HFTransformersNLP # Name of the language model to use model_name: "bert" # Pre-Trained weights to be loaded model_weights: "bert-base-uncased" cache_dir: /app/models/.cache # required with Botfront - name: LanguageModelTokenizer - name: LanguageModelFeaturizer - name: CountVectorsFeaturizer - name: CountVectorsFeaturizer analyzer: char_wb min_ngram: 1 max_ngram: 4 - name: DIETClassifier
Note that the tokenizer and featurizer are different from the earlier Spacy example.
Finally, a mention for the recommended (by Rasa) pipeline for projects in English.
pipeline: - name: ConveRTTokenizer - name: ConveRTFeaturizer - name: RegexFeaturizer - name: CountVectorsFeaturizer - name: CountVectorsFeaturizer analyzer: "char_wb" min_ngram: 1 max_ngram: 4 - name: DIETClassifier
Now that we have seen how providing good features can impact our training, let's see go through a few knobs we have access to when training.
In this section we'll focus on intents.
Let's quicky explain how the
DIETClassifier works, in general and layman's terms:
The model takes different features as inputs:
- Dense features from pre-trained embeddings. Dense features are fed from featurizers upstream in the pipeline. For example the
SpacyFeaturizerwill provide pre-trained embeddings from the Spacy models as features.
- Sparse features from the training data. Sparse features are provided by the
The distinction between dense and sparse features is a technicality: pre-trained vectors are dense because every dimension of the vector contains a number, while features from the training data are sparse because they comes as one-hot vectors: all values are
0 except one which is generally equal to
You don't need to provide both dense and sparse features. You can use pre-trained embeddings, features from your oww training data, or both.
As a reminder, at training time, the classifier will learn boundaries between groups of data points:
Every utterance in the training data is represented in a vector space. The position is calculated from the features. The algorithm must learn where the separation is between all groups of utterences (intents).
The illustration above shows only two intents in a twodimensional space. Imagine how hard this task can be for 500 intents in a 300-dimensional space.
The default configuration should work just fine when you have a handful of intents with simple sentences. But as your training data grows, you may need to adjust some knobs.
Epochs is the number of times the training goes through all your data. The default is
I am mentionning this knob because in most cases you don't need to increase it. In addition to increasing your training time, it might create overfitting: the model becomes too strongly attached to the training data and start performing less good on unseen data. We want our model to generalize, which means applying what it learns on training data to unseen user utterances.
However reducing it might work, especially for smaller datasets.
pipeline: ... - name: DIETClassifier epochs: 100
Technically, a language model is trained to predict missing tokens or words. For example, when asking the missing token in:
The Saint-Lawrence is a ___________
The model should output
Another challenge might be:
The __________ island is between the Saint-Lawrence and Des Prairies River.
Where the correct output is
You might be wondering why we're training our model on an elementary school task. A good way to asnwer this is to remember what you learned as a kid from such challenges: you probably learned the structural features of your language, grammatical rules, and general knowledge about the world.
Knowing all this help you understand what people when they are talking to you. And that is exactly what we want for our assistant.
use_masked_language_model: True, the
DIETClassifier will perform such challenges and acquire some additional domain knowlege from your training data. This knowledge can be used to add context to embeddings.
use_masked_language_model will help when the number of intents grows, when your domain language have subtle nuances, or when you expect long or complex user utterances.
Note that this is not related to pre-trained language models discussed in other parts of this post. Pre-trained language models are generally trained on large corpus of data. Here, you are training a language model on your own training data.
pipeline: ... - name: DIETClassifier use_masked_language_model: True
Embeddings are vectors and dimensions are the number of numbers composing those vectors. The default is
Embeddings carry the meaning of the word. Up to a certain point, the more dimensions, the more meaning you can capture.
GloVe vectors, for example, have 300 dimensions.
So, should you just bump
300? Not so fast!
GloVe is trained on CommonCrawl, a snapshot of the whole web containing billions of words.
As you can see there's plenty to learn from for many words used in many different contexts. This knowledge could not be captured in 20-dimensional embeddings.
Is your training data rich enough to capture to saturate 20-dimensional vectors of meaning? At some point there might be. When your training data becomes substantial, increasing
embedding_dimension might get your better results.
pipeline: ... - name: DIETClassifier embedding_dimension: 30
Transformers look at how words influence each other in a sentence. In other words it contextualizes words. For example, in "Play a game", "Watch a game", and "Watch a play", game and play have slightly different meanings. Knowing the nuances will help identify the correct intent. If at some point you encounter confusions in such intents, you may try to increase the number of
transformers_layers. The default value is
pipeline: ... - name: DIETClassifier transformers_layers: 4
Now that we have a super-powered intent classifier, let's see how we can tweak entity extraction.
At training time, the
DIETClassifier knows from your data which sections of your trainig utterances are entities.
At inference time, it goes through all the words of a sentence and evaluate if they belong to an entity. If two or more contiguous words belong to the same entity, then the sequence is tagged as a whole. That is how you can have multi-words entities.
To evaluate if a word should be tagged with an entity, the algorithm looks at the features of:
- The word being evaluated
- The word preceding the word being evaluated
- The word following the word being evaluated.
To make that more concrete, let's consider a few examples:
In the sentence I want to book a room in Paris next week, where Paris is a
location entity, we can note that it is preceded by the word in and followed by nothing. It is a very common structure for this type of requests.
And your model will learn that any word or expression following words like in, near and followed by next, today, tomorrow, or even nothing has a likelihood of being a
The next question is how it looks at those words. Remember that machine learning is about statistics, statistics are about numbers, and features are the way to convert words into meaningful numbers.
We can influence the training by specifying the features we are interested in.
For reference, here is the exhaustive list of features taken from the Rasa documentation.
||Checks if the token is at the beginning of the sentence.|
||Checks if the token is at the end of the sentence.|
||Checks if the token is lower case.|
||Checks if the token is upper case.|
||Checks if the token starts with an uppercase character and all remaining characters are lowercased.|
||Checks if the token contains just digits.|
||Take the first five characters of the token.|
||Take the first two characters of the token.|
||Take the last five characters of the token.|
||Take the last three characters of the token.|
||Take the last two characters of the token.|
||Take the last character of the token.|
||Take the Part-of-Speech tag of the token (
||Take the first two characters of the Part-of-Speech tag of the token (
Let's digest that with an example.
In the following pipeline, we have introduced the
This new featurizer will produce the features from your training data to train entity extraction.
It's important to remember that featurization preceeds training and inference and must therefore be placed before in the pipeline.
pipeline: ... - name: LexicalSyntacticFeaturizer features: - [low, title, upper, suffix2] # features for the word preceding the word being evaluated - [EOS, title, suffix5] # features for the word being evaluated - [prefix2] # features for the word following the word being evaluated - name: DIETClassifier ...
features section we can define features for the word being analyzed and the surrounding words.
In the context of the sentence I want to book a room in Paris next week, when evaluating if Paris is a
location, the algorithm will look at:
- The word in, and in particular if it is lowercased (
low), uppercased (
upper), or capitalized (
title). It will also look at the last two letters (
suffix2), which will reinforce the likelihood of picking up the entity for words like in, near (last 2 letters: ar)
- The word Paris, and in particular if it is at the end of the sentence (
EOS), capitalized, and will also consider the last 5 letters.
- And the word next, it will look at the first 2 letters.
There is something important to keep in mind when using prefixes and suffixes, especially with 5 letters: if you have too few examples, your model may tend to memorize them and might fail to recognize entities not seen in your data.
That why in that particular case using prefixes or suffixes makes sense for surrounding words: there aren't that many possibilities besides in, near, from (e.g. not to far from Paris). However, using
suffix5 for the entity itself is questionnable. The list of possible cities is virtually infinite, so using them may reinforce cities found inside your trainind data against cities that are not included in your dataset.
This configuration is unique for all your entities, so you must come up with an average strategies that generally works well for all of them.
I hope that this post has given you enough material to start improving your NLU pipelines. If you have any question or comment, feel free to post in our Spectrum community.
An important source of information for this post was Rasa Algorithm Whiteboard series on the DIET architecture. There is a lot more to learn from if you are interested in the technical aspects of the architecture.