Dissertation: An Automated SMS FAQ Retrieval System in a Multi-Lingual Environment

The aim of this dissertation is to investigate automatic replies to client queries via SMS using multiple languages; with the main languages in this particular study being English and Maltese. An analysis of the efficiency of different algorithms is carried out for a system that would be able to fetch an FAQ answer corresponding to the client’s question from a list of corpora, and sent to the same client. 

Natural language processing is a field of artificial intelligence, computer science and linguistics concerned with the interactions between a computer machine and human natural language. The anytime/anywhere ease of access provided by mobile networks and the portability of handsets, coupled with the strong human urge for immediate answers, has fuelled the growth of information-based services on mobile devices. These services can be simple advertisements, polls, alerts or complex applications such as browsing, search and e-commerce.

The latest mobile devices come equipped with high resolution screen space, inbuilt web browsers and full message keypads, however the majority of users still use cheaper models that have limited screen space and a basic keypad. Hence  questions are more likely to be noisy and contain spelling mistakes, abbreviations, deletions, phonetic spellings, transliterations etc. Henceforth, the challenge with automated FAQ retrieval systems is how to return answers to the user’s natural language questions. The whole process is complicated as it involves a number of different but closely combined techniques, including query rewrites and formulations, question classification, information retrieval, passage retrieval, answer extraction, answer ranking and justification.

Using natural language processing, and the tools it offers, the system would be able to analyse a client’s query via an SMS and returns an SMS reply related to that same query. The system would need to be able to handle the noise found in each SMS being received; in other words the system would need to clean such SMS, analyse it and then be able to fetch an answer from a list of FAQs and return it to the user.

During this project a widely used NLP tool called term frequency-inverse document frequency together with a common similarity measure called Cosine Similarity, were used to identify an appropriate reply for a given query. The results obtained after testing a number of SMSs falling under four specific categories are rather encouraging. Out of 602 SMSs tested, the system was able to identify the category for 506 SMSs. Moreover, although the average similarity scores obtained during these tests varied because some of the SMSs being tested were too noisy,  this method still yielded good results. When testing 142 SMSs related to Internet, it was concluded that the system would calculate a similarity score of more than 0.1322 for 90% of the SMSs being tested. This is a rather promising result.