How intent classification works in NLU


How intent classification works in NLU

If you’re building a serious chatbot, you are probably interested in getting your NLU right. Understanding requests in natural language is a critical part of a successful conversational experience. You can't address a request properly if you don't understand it.

Anatomy of a task oriented chatbot
Anatomy of a task oriented chatbot

Giants like Google, Microsoft, and IBM provide NLU platforms in a Saas fashion, and Rasa (on which Botfront is based on) is the leader on the open source market.

This post explains how intent classification works. It is intended to provide an intuitive understanding of the underlying machine learning.

Teaching semantics to a Martian

An intent is a group of utterances with similar meaning. Meaning is the important word here. Consider these two sentences: “I want to make a reservation in an Italian restaurant” and “I need a table in a pizzeria”.

They both mean the same thing (even if all Italian restaurants are not pizzerias) but all the words are different. How can NLU determine they are similar and assign them both the same intent?

Imagine you’re a Martian, you can read but know nothing about words. Just after you land on our planet you find these sentences written on a stone:

“I’m sitting on a bench with my friend.” “It rains on the bench we used to sit on.”
“You can sit either on the chair or on the bench.”
“Laurie is sitting on a chair.”
“Hi, welcome, please sit on the chair.”
“That’s the chair your grandfather used to sit on.”

You might figure out that a chair or a bench is something you sit on and sit is something you do with chairs and benches. In other words, words appearing in the same context share semantic meaning.

Of course, you won’t understand what chair or sit means, but that’s ok. We’re interested in finding similarities, so knowing that a chair is similar to a bench is already something.

Now, if you spend the next 2000 years reading all of Wikipedia, you will be able to find many more similarities. You will have read about all sorts of foods, chemicals, cities, professions, skills, and many other subjects so you’ll be able to say that "teacher" is somewhat more similar to "professor" than a "physicist". That an "apple" is more similar to an "orange" than a "hamburger".

Do you get the idea? Now that we taught semantics to Martians, can we do the same with a computer?

Teaching semantics to a machine: word embeddings

NLU is about machine learning; machine learning is about statistics, and statistics are about numbers. So how can we turn “apple” is more similar to an “orange” than to a "hamburger" into a comparison with numbers?

What if the Martian could read the whole Internet and learn similarities from billions of words? Most Martians can't, however computers are decent at this kind of tasks! Embedding algorithms are used on huge text corpora such as Wikipedia or Commoncrawl to learn those similarities.

As you can see in the tables below those datasets contains millions of words in many languages, and a lot can be learned by crawling them.

Source: Learning vectors for 157 languages (https://arxiv.org/pdf/1802.06893.pdf)
Source: Learning vectors for 157 languages (https://arxiv.org/pdf/1802.06893.pdf)

The first generation of such algorithms included Word2Vec and Glove. Training means the algorithm reads the entirety of Wikipedia or Commoncrawl and learns the semantics of words from their context. The output of this training is word embeddings, or word vectors. Each word is a vector (an array of numbers). As you may remember from vectors, you can place them in a space and measure the distance between them. Similar words will have vectors close to each other. The distance between the vectors of “apple” and “orange” is smaller than between the vectors of “apple” and “hamburger”.

The video below explores a 3-D projection of a 200-D word vectors space. We can easily visualize how similar words are clustered together. I made this video with http://projector.tensorflow.org/. Feel free to explore it by yourself 🙂

Word embeddings are good for comparing words, but how do we compare sentences? You can perform arithmetic operations on vectors. A sentence is a group of words. The meaning of each word is captured in a vector, and averaging those vectors is a way to capture the meaning of a sentence.

To recap, the meaning of a word is captured by its word embedding, and the meaning of a sentence is captured by the average of the embedding of its words. So now we have something that can enable a machine to say “I am hungry” and “I want to eat” are similar sentences.

Intents are labels tagging sentences that has similar meaning. We have seen how meaning can be captured, we can use that to train our assistant.

Training a NLU model

NLU systems become better with more training. That’s why they need several examples for every intent. Those examples should be similar in meanings, so if you were to plot all those sentences’ vectors, they should be close to each other. The examples should form sort of clouds of points (one cloud for each intent). When you train your NLU, it learns the boundaries between all the clouds, so when your system encounters a sentence it has never heard it can map it to the closest cloud of points and determine its intent. The illustration below shows how such clouds and boundaries might look. Each point corresponds to a sentence, or more precisely to a sentence’s vector. Blue crosses are vectors of sentences with the intent checkbalance, and red circles correspond to vectors of sentences with the intent _transfer.

boundaries of an NLU classifier: that is why your NLU will work better with more examples.
boundaries of an NLU classifier: that is why your NLU will work better with more examples.

Limitations of word embeddings

Word embeddings are great because they provide some sense of meaning for a wide vocabulary out of the box. But they also come with limitations newer generations of NLU systems are adressing.

Homonyms

One limitation is that a single work can have several meaning. A bank can be a financial instutition or a follow the river. An embedding of bank would carry an average meaning based on the frequency of the context in which bank is used in the training data.

Plurals, abbreviations and typos

“bk ✈️ 2 SFO”

Another limitation is that using whole words as features makes typos and abbreviations harder or impossible to understand: if a word not present in the original corpus (e.g. Wikipedia) it is unknown.

Heavy

Finally, pre-trained word embeddings bring a knowledge of the world inside your model. But your model does only need a fraction of it, and might require a more specific knowledge of the world your chatbot is actually about: if your chatbot is not about a mainstream subject, chances are that the words that are important to your domain are underrepresented. As a result, the quality of the embeddings will be affected.

Training your own embeddings

Training embeddings on your own dataset is a good shot at solving the issues raised above. You would have better vectors for your domain specific words, and a lighter model that would not contain millions of words embeddings you have no use for.

Rasa's *EmbeddingsIntentClassifier**, based on Facebook's StarSpace algorithm allowed that: instead of using pre-trained embeddings, it learns embeddings directly from from your data and trains a classifier on it.

This dramatically increases the accuracy on domain specific datasets because it learns directly from the words included in your examples. However, because it has no pre-existing knowledge of the world (no pre-trained embeddings) it requires susbstantially more examples for each intent to get started.

A benefit from training embeddings on your vocubalary is that it can create features from n-grams and not just from words. N-grams are combination of letters inside the words. This makes your model tolerant to small variations such as plurals or typing mistakes.

Mixing pre-trained with your own embeddings

The latest Rasa iteration is the DIETClassifier brings the best of both worlds with the ability to mix pre-trained embeddings with your domain specific embeddings. It means you can still benefit from a general knowledge of the world and add the knowledge of your domain. General knowledge of the world means that your assistant will know that beer and wine are drinks and that yes and sure means affirmation. You can now build the knowledge of your domain on top of this general knowledge.

Giving context to embeddings with transformers

Another big improvement is that the DIETClassifier enriches embeddings with context from your data thanks to its transformers architecture.

Transformers are algorithm components designed to learn from sequences. Sentences are sequences in the sense that order matters and that each word is used in the context of the other words. Understanding a sentence properly involves understanding how each word relates to others.

Transformers positional encoding. Source: https://www.youtube.com/watch?v=TQQlZhbC5ps
Transformers positional encoding. Source: https://www.youtube.com/watch?v=TQQlZhbC5ps

The images above illustrates this. Obviously each word relates strongly to itself, we shouldn't look too closely at that. If we look at the The token (first row), we see that dog is the darkest token (besides The of course). This indicates the The and dog share context. Same for red and dog.

Every embedding is changed to reflect how it relates to other words. Which means that NLU models are now able to understand that the bank you withdraw money from is different from the bank that follows the river.

Ordering and negation

Transformers looks at how each word infludence others in a sentence. It means that a sentence is not a dumb average of all its word embeddings anymore, but rather a weighted average, where the weights represents how relevant are given word for a particular intent.

For example, the work play may be more relevant in the sentence "I want to play chess" where the intent is play than in "I want to watch a play" where the intent is watch.

“I am hungry” vs. “I am not hungry”

A corollary is that negation is better captured. With older approaches simply averaging word vectors, the only difference between I am hungry and I am not hungry was the value of the not vector, which might not be a big enough shift to distinguish two opposite intents. If you have any experience with NLU you know how handling negation has always been difficult.

With transformers, your model has a chance to understand that not is strongly related to great and weigh it differently. When you include many other examples with negations, it becomes a concept your model can learn.

Conclusion and take-away

I hope this post helped you develop an intuition on how NLU works. Intent recognition has evolved a lot these recent years with the innovative work done by Rasa, and you can benefit from it by adding the DIETClassifier to your pipleine.