How MoMo Builds an AI Chatbot

Share: img img img


Customer support agents have to answer users’ frequent issues day after day by searching through user guides or documents, it is such boring and time-consuming work. Question answering systems would significantly reduce the loads on customer service teams, enabling them to focus on things that truly matter. In recent years, the great success of Natural Language Processing has led to the fast growth of conversational AI. A variety of frameworks for supporting building chatbots like DialogFlowRasa, or neural generation-based conversational models like Google’s LaMDA have been released and used by big tech firms.

At MoMo, we help customer support agents save time and costs by suggesting appropriate answers for a user question (i.g., Why am I not receiving the OTP code?) instead of directly chatting with the user. Then, agents will select one from the list of responses and may want to make some minor edits before sending it. Currently, we only take into consideration banking matters since they account for a large number of tickets that users get stuck in MoMo. To respond rapidly and precisely in both single-turn and multi-turn ways, we leveraged Rasa — an open-source conversational AI framework to classify user problems. Our solution is currently running on production and did help agents reduce support time by up to 61% of conversations (counted in November 2021). We also evaluated our Natural Language Understanding (NLU) component on the test set with nearly 5000 samples, the results showed that the model achieved an accuracy of 94% for general intent and 80% for specific intent.

Despite gaining high performance on the test set, our system is still not able to provide many valuable instructions in practice due to the broad variety of intents and lack of high-quality data at scale. In the following sections, we will describe the way we designed the suggestion system to overcome challenges, what we achieved with those ideas, and why we are still stuck on them.

The look of our dataset

Our raw conversation dataset was very dirty, unlabeled, and contained a lot of irrelevant information. In order to prepare a good dataset for training, we performed a variety of data filtering pipelines on it. In the starting phase, we did both automatic labeling and manual labeling. For automatic labeling, we made use of fast and accurate fuzzy search engines such as ElasticSearch to map user messages to suitable topics. After filtering out unnecessary sentences, we just needed to look at hard-to-detect cases so as to relabel them manually and get the data with more accurate labels. In the following phases, we invest more in manual labeling data to augment high-quality training sets. In addition, we threw our data through an embedding clustering method as well to re-correct mislabeled samples. This method first encoded inputs into sentence embeddings and then applied clustering techniques on those embeddings and their labels to visualize mislabeled and outlier samples.

At the present time, we have just collected conversation data related to banking issues. As a result, we obtained approximately 4500 sentence-intent pairs in total with a number of unique intents up to 25, belonging to 4 main banking problems including linking (Liên_kết), unlink (Hủy_liên_kết), deposit (Nạp_tiền), and withdrawal (Rút_tiền).

## intent: Liên_kết__Hướng_dẫn_liên_kết_Ví
- tôi không biết cách liên kết với mb
(I don't know how to link with MB)
- làm sao liên kết tay khoản mo mo
(How can I link with MoMo)
- cho em hỏi cách liên momo với ngân hàng
(Let me know how to link MoMo with bank account)
- hướng dẫn mở thẻ
(Instructions to open a bank card)## intent: Nạp_tiền__OTP_-_Không_nhận_được_mã
- chuyển tiền từ tài khoản về momo không nhận được mã otp.
(Money transfer from account to MoMo did not receive OTP code)
- mình nạp tiền mà không nhận được mã otp từ nh để nhập vào momo.
(I recharged but did not receive OTP code from the bank to enter in MoMo)

Diving into model experiments

In the early stage of model development, we employed an existing pre-trained generative model, GPT-2, denoted as our model version 1. This model version used user questions and chat histories as input to automatically generate interesting responses and help answer open-domain questions with ease. However, this model took a lot of inference time and was hard to control bias, safety, and knowledge incorporation. It also required a large number of high-quality data to train a good model which we did not have at that time.

We quickly switched to the intent classification approach. In version 2, we choose Rasa — one of the most popular open-source conversational AI frameworks, because of its large community, scalability, and robustness. Rasa basically is a combination of two modules: Rasa NLU is in charge of analyzing the user messages to grasp their intentions and Rasa Core takes structured inputs from NLU to choose which action the chatbot should take. Rasa named their main classifier as DIET, which stands for Dual Intent and Entity Transformer (see Figure 1 or an interactive demo for more details). It is a Transformer-based model with 2 layers that takes a sequence of tokens as input and then ranks all possible intent labels by computing similarity. The architecture allows us to leverage pre-trained word embeddings like fastTextBERT, and so on, but for two main reasons, our current serving model only uses sparse features that are the combination of token level one-hot encodings and character level multi-hot encodings. First, the existing pre-trained language models for Vietnamese are very large in size, consuming plenty of computational resources as well as inference time. Second, most word embeddings models are typically trained on grammatically correct text data, they need the inputs to be pre-processed before being fed into the model to get the most out of it, but they will be vulnerable against the inputs containing a ton of spelling errors.

Figure 1. Rasa DIET architecture from the paper.

During model training, we always faced seriously imbalanced data problems preventing our intent classification model from performing well. The model will not learn enough from minority intents and tend to make predictions related to majority intents since they take up a large number of samples during training. To alleviate this problem, Rasa provides a balanced batching strategy that loads training data in a way that all classes are guaranteed to be represented in every batch, or at least in as many subsequent batches as possible, still mimicking the fact that some classes are more frequent than others.

Figure 2. Chatbot system with hierarchical architecture for intent classification

The data loading strategy could not fully eliminate the imbalanced data. Therefore, we had to keep looking for ways to combat this problem. Because of data characteristics, it could be divided into multiple levels, moving from general to specific. For example, in the third release, we divided bank issues into two levels, including general level (i.g., Bank linking) and specific level (i.g., Card not registered for online payment). We illustrate our full chatbot system with hierarchical architecture in Figure 2. An input sentence first will go through the general intent classifier in the NLU component to see if it belongs to one of four main banking matters. After that, the corresponding specific intent classifier will predict to understand the exact problem that the user is facing.

Figure 3. Improvements made from both better data and more complex models

In Figure 3, we measured the effectiveness of different model architectures and data quality on a set of nearly 5000 separate bank-related sentences. Using the Hierarchical model (v3) is better than the Single model (v2), significantly improving by approximately 13% accuracy. Moreover, the model when trained on better data will yield more performance. In particular, we reviewed our bad data to filter out mislabeled samples and intents with less than five samples and then trained the Hierarchical model (v3) on that better data. Finally, our model accomplished roughly 75% accuracy, which is 8% higher than the model trained on bad data. This proves that data quality is crucial in building high-performance intent classification models.

According to the results in Figure 3, our entire current system achieved 29% of the correct suggestions out of the total number of suggestions on the test set with roughly 900 real dialogues. Meanwhile, our previous system (Hierarchical model (v3)+Bad data) only gained a low score of 16%, which means the number of suggestions was much higher than the number of correct ones. The main reason for that significant gap is that the previous system still gave instructions to users sentences unrelated to banking issues while the current system alleviated this. In order to accomplish that, we tested two approaches to severely limit making suggestions. In the first experiment, we set a strict confidence threshold on predictions which means that only ones with confidence scores higher than the threshold would be presented to users. The second one is that we both set a threshold and added an “others” class containing junk sentences. Figure 4 depicts the combination of confidence threshold and additional “others” class reduced the number of unreasonable suggestions, leading to an increase in the rate of correct ones from 16% to 29%.

Figure 4. The proportion of correct suggestions for different approaches

As we mentioned before, user input questions are very noisy and consist of various spelling mistakes, teen-code, and acronyms. To investigate how much spelling errors influence the model’s predictions, we corrected the inputs before chucking them into the model by adopting a Transformer-based spelling correction architecture published by Tran et al. on a corpus of financial texts and subtitles. Applying spelling correction dramatically improved the performance of general and specific intents by 1% and 5% respectively compared to without spelling correction (Figure 5). Despite its high efficiency, the drawback of using spelling correction is highly time-consuming. The time to correct input sentences is much lower than the time to infer labels. Therefore, it should be optimized before plugging into our system.

Figure 5. Influence of spelling correction and pre-trained word embeddings

When having a standard dataset with minimal typos, we were eager to examine the efficiency of pre-train word embeddings. In a series of experiments, we replaced token level one-hot encodings with fastText, an open-source library for efficient learning of word representations, released by the Facebook AI Research. Based on the results, applying word embeddings damages the model performance, dropping dramatically from 80% to 78% on specific intents and from 94% to 91% on general intents. One of the possible reasons is that the dataset still contains spelling mistakes, which results in the loss of information when using word embeddings. Meanwhile, with 3-grams of token level one-hot encodings helps to capture an aggregation of information from surrounding factors.

Figure 5 also illustrates the difference in the performance between domain-general fastText word embeddings from Facebook and our domain-specific fastText trained on 2GB of financial texts with the same settings as Facebook’s public fastText. The results demonstrate that our domain-specific pre-trained word embeddings can boost the performance on specific intents by 2% compared to the domain-general word embeddings. It is obvious that the models with context close to our fine-tuning data yield richer features. However, there is not much variation on general intents, driving us to an observation that pretraining on domain-specific resources can make a significant impact on the dataset with a small number of samples, while less meaning on the larger dataset with enough information.

Below, we show some examples of how our system answers user questions in a multi-turn manner.

User: Toi lien ket voi tai khoan ngan hang SCB ko duoc ban xem giup
(I can't link to SCB bank account, please help me)Agent: Có phải bạn đang có nhu cầu muốn liên kết Ngân hàng SCB đúng không ạ?
(Are you in need of a link with SCB Bank?)User: đúng b
(Yes)Agent: Dạ cho mình hỏi tài khoản/thẻ của bạn có đang sử dụng bình thường không ạ?
(Can I ask if your account/card is in normal use?)User: lâu chưa sd nên mình cũng k biết
(I don't know since I haven't used it in a while)Agent: Dạ mình kiểm tra bạn liên kết ngân hàng không thành công do tài khoản đang bị đóng hoặc không tồn tại nên bạn liên hệ ngân hàng SCB 1800545438 hoặc 19006538 để được hỗ trợ kiểm tra lại trạng thái tài khoản giúp mình nhé.❤
(I checked your bank link failed because the account is closed or does not exist, so please contact SCB bank 1800545438 or 19006538 for support to re-check the status of your account.❤)

Future Improvements

In future work, we will continue our goal of understanding user questions in a better way through improvements in model architectures, data that we list below:

  • We plan to incorporate pre-trained language models such as BERT (for Vietnamese) into our Rasa intent classification model in order to extract contextualized word embeddings and use them as dense features to make predictions.
  • The spelling Correction module plays an essential role in our system if we want to take full advantage of pre-trained language models since most of them are trained on large-scale and fairly cleaned corpora. We will try to optimize the spelling correction module to make it run faster and better, meeting the tight inference time and performance requirements.

From passive to active

In the product scenario, when using a chatbot to respond to user queries, we have got some useful feedback such as responses that are becoming more and more mechanical and stereotypical. To justify this, we argue that we are using formal templates from the MoMo customer service team to directly display to the users without any customization.

We have already scheduled a task for tackling this challenge on our calendar in the near future such as automatically generating responses from user intents and templates or operating in a multi-turn manner. However, we realize that there are limits and conflicts between these improvements and the construction of a “passive” suggestion system. For example, in case the agents do not want to select suggestions from our system, it will break the current flow, leading to suggestions that are incorrect and irrelevant to the current conversation. Therefore, we are aiming for a complete chatbot platform that directly chats with the users and answers their questions in a more human-like fashion.

Hot Jobs

Senior Motion Designer
Ho Chi Minh
Create high-quality 2D motion for social media, YouTube, websites, and potentially explainer videos; Animate characters, objects, and backgrounds to bring life to storyboards and concepts; Apply strong editorial judgment to select the best footage and craft a compelling narrative; Identify and implement opportunities to enhance the story through creative editing techniques; Collaborate with other designers and team members to develop storyboards and visual references to help communicate ideas; Work closely with project managers and other designers to ensure timely delivery of high-quality projects; Demonstrate the ability to utilize available resources effectively and continuously learn new skills.
Senior Executive - Content & Marketing
Ho Chi Minh
We are designing a Personal Finance product suite that helps users manage their financial life and educates users on financial knowledge. For this growing product, we’re looking for a content creator to help us generate in- and out-app high quality content for the product suite. The candidate will sit in the Growth team, reporting to the project’s Growth Manager. As part of a broader Growth team, you will also participate in Growth marketing activities to acquire and engage users. Therefore, … Continue reading How MoMo Builds an AI Chatbot
Senior Executive - Growth, Telco
Ho Chi Minh
Propose and implement a long-term, holistic brand growth strategy for Telecommunications product line; In charge of the overall management and execution of major marketing campaigns, promotional programs, communication plans, launching of new product features, and other projects as required; Work with cross-functional team including data analytics and product development to generate ideas and develop an end-to-end growth plan; Work with data analytics team to create and perform key reports and analyses for the management team for the reporting and decision making regarding the projects; Work with the creative and operations team for the execution of marketing and communications plans.