LSTM based Ensemble Network to enhance the learning of long-term dependencies in chatbot

. A chatbot is a software that can reproduce a discussion portraying a speci ﬁ c dimension of articulation among people and machines utilizing Natural Human Language. With the advent of AI, chatbots have developed from being minor guideline-based models to progressively modern models. A striking highlight of the current chatbot frameworks is their capacity to maintain and support explicit highlights and settings of the discussions empowering them to have human interaction in real-time surroundings. The paper presents a detailed database concerning the models utilized to deal with the learning of long haul conditions in a chatbot. The paper proposes a novel crossbreed Long Short Term Memory based Ensemble model to retain the information in speci ﬁ c situations. The proposed model uses a characterized number of Long Short Term Memory Networks as a signi ﬁ cant aspect of its working as one to create the aggregate forecast class for the information inquiry and conversation. We found that both of the ensemble methods LSTM and GRU work well in different dataset environments and the ensemble technique is an effective one in chatbot applications.


Introduction
A Conversational Agent is otherwise called 'Chatbot' is a software program which leads a discussion by means of sound-related or literary strategies in a characteristic language, for example, English. Chatbots are being coordinated universally into our lives in a type of Virtual collaborators and messaging applications. In 1950, Alan Turing proposed 'Turing Test' as a benchmark of a chatbot program to imitate a human in a discussion [3]. ELIZA, Jaberwacky, ALICE were not many of the underlying chatbots created dependent on principle-based methodology [1]. Eliza was considered to be the first-ever chatbot to simulate a psychotherapist, in the 1960s. Although it was capable of establishing a dialogue, simulating a human being, his virtual model was based on rephrasing the user input whenever a collection of hand-crafted infusions matched. Eliza was not designed to model human cognitive capabilities; however, it demonstrated how software could create a significant impact through the mere illusion of comprehension. The 'Measurable Revolution' contingent upon Machine Learning blossomed in the late 1980s and mid-1990s. There has been a significant powerful move-ment in the zone of chatbot inferable from the development of human-made reasoning. The presentation of the influx of Artificial Intelligence-based chatbots has introduced another time of conversational interfaces. The other factor adding to advancement is the noticeable change of the elements of human discussion leaning toward short informing over different types of correspondence. Most chatbots work through remote helpers, advising applications or association's sites. Right now, the market of cutting edge conversational specialists is shared by IBM's Watson, Apple's Siri, Google Assistant, Amazon Alexa, Microsoft's Cortana to give some examples. Endeavours have been made to typify the usefulness of chatbot consistently into administrations alongside contracting the uniqueness contrasted with human discussions. The incorporation of an inductive memory practically equivalent to the human cerebrum into the engineering of a chatbot encourages the chatbot to keep up the edge of setting for longer durations. Protecting the highlights identified with the relationship for longer lengths is named as adapting Long term dependencies. This model acquires a factor of commonality and lucidness throughout the discussion between the end client and chatbot. Wide and unambiguous data about the advancement of memory incorporated models utilized in chatbot could give an exhaustive comprehension and bits of knowledge into the eventual fate of chatbot inquire about. The design and advancement strategy of a run of the mill chatbot relies upon fundamental ideas as determined in Figure 1 [2].
A brief look at the concepts to comprehend the variations possible at the formative stage is presented below.
-Text Processing: Word embeddings are the vector representations of words within the specific vocabulary enabling better implementation and utilization of statistical machine learning models. -Machine Learning Model: The concept of Artificial Neural Network is extensively employed in dealing with input processing, classification and generating the most appropriate response for the input query. Chatbots like Deepprobe, Superagent utilize the Long short-term memory (LSTM) model with Seq2Seq, while Rubystar uses Seq2Seq with Gated recurrent unit (GRU) [2].

Literature survey
This section aims to provide an overview of different concepts employed in handling long-term dependencies and discuss their corresponding nuances. The timeline of some fundamental technologies is listed in Table 1.

Artificial neural networks
Artificial Neural Networks (ANNs) are informationprocessing models inspired by the biological neural system and the capability of the brain to process information. The work on Neuron Circuit and Perceptron by Warren Pitts, Warren McCulloch and F. Rosenblatt served groundwork for ANN to evolve and induct over the traditional computer frameworks in the 1970s [1]. ANN is composed of a large number of densely interconnected mathematical function units called 'Neurons' clustered into three types of layers, as shown in Figure 2. The input layer is responsible for the initial processing of input data, whereas the output layer deals with aggregating the final outputs and presenting the result. The weighted connections between neurons in the hidden layer form the basis of the learning process, providing variable strength to the input data traversing forward towards output neurons. An activation function like Sigmoid, ReLU or tanh is applied on the summation of weighted inputs in a neuron [4]. The model is trained using 'Back Propagation' where the error calculated leads to optimal updation of weights. Gradient (Dw) can be calculated as change in error with  respect to change in weights de dw À Á . Values for new weights is determined by adding weight (w) and the gradient (Dw). The entire process is depicted in Figure 3.
However, the brute force approach for updating weights suffers from 'Curse of dimensionality' [5]. Gradient Descent (GD) and Stochastic Gradient (SGD) descent offer a faster way to find optimum weights. Both these methods determine the global minima by finding the point where the slope of the cost function is zero hence resulting in the error to be minimum. GD and SGD are compared in Table 2.
ANN suffers from both overfitting and underfitting. Overfitting is an outcome of an overly accurate or complicated model showing low bias but high variance. Underfitting is a result of a too simple model showing low variance but high bias [6]. ANNs deal with fixed-sized vectors only and they do not possess a dedicate memory element to handle sequential data hence making them an inappropriate choice for a chatbot handling dependent vectors.

RNN
Recurrent Neural Network (RNN) is the class of Artificial Neural Network supplemented by the integration of edges spanning adjacent timestamps. Psychologist David  Steps: For every iteration -Iterate over each value in the dataset -Evaluate Gradient -Return Rumelhart's work on symbolic artificial intelligence from 1986 formed base for the development of RNN. RNN has two inputs, the present values and values from recent past, enabling it to capture the dynamics of a sequence of inputs in scenarios like handwriting recognition, stock price prediction, etc. Owing to the variable size of input and output vectors, RNN has shown significant improvement over traditional feed-forward networks in Chatbots as RNNs are capable of exploiting a dynamically changing contextual window over input sequences. The overall architecture of RNN is specified in Figure 4.
At given time t, output for the state S t is calculated applying a function on the portion of the output from the previous state S tÀ1 and current input X t . It can be termed mathematically as S t = F (S tÀ1 , X t ) where F is activation function like tan h or ReLU. This process continues forming an information loop for a given state concerning time. The unrolled structure of RNN is shown in Figure 5, along with equations.
Like Feed-forward networks, RNNs use backpropagation for training the difference being the additional parameter 'time'. Hence it is termed as 'Backpropagation through time (BTT)' as shown in Figure 5 [7].
The range of context to be used practically is limited as each prediction looks at one step prior state value. While back-propagating the recurrent connections, the influence of given input vector on the corresponding hidden layer and hence overall network output either decays or blows up exponentially giving rise to Vanishing Gradient and Exploding Gradient problem respectively as shown in Figure 6. Both these problems cause the model to train poorly and performance degradation.
A prediction of a state at the time 't' depends on the input presented at earlier time T where T << t. When the gap between T and t grows large, it becomes challenging for the model to attain convergence causing the failure of RNN to handle 'Long Term Dependencies' which makes it unfitting model for chatbots dealing with time series conversations [8].

LSTM
Long Short Term Memory networks are an extension for Recurrent Neural Networks with explicitly extended memory capability well suited to handle long term dependencies [9]. LSTM networks were proposed by   German researchers Sepp Hochreiter and Juergen Schmidhuber in 1997 as a solution to the vanishing gradient problem [10]. In comparison, LSTM can learn to bridge the features in excess about 1000 definite time steps by imposing constant error flow through the units termed as 'cells' effectively dealing with Long Term dependencies [11].
LSTM contain information from a context in a gated cell. The cells control the data to be written, stored, read and erased using Forget, Input and Output gates which are implemented with element-wise multiplications by sigmoids, as shown in Figure 7 [7]. The forget gate learns the weights controlling the decay rate of values stored in memory cells. For the instance when the input and output gates are off, and the forget gate is not causing decay, the memory cell maintains its value over time causing the gradient of the error to stay constant during backpropagation. This enables the model to remember information for more extended periods. The overall architecture of LSTM is shown in Figure 7.
Mathematically each step can be explained as follows [7]: -In the first step Forget Gate layer decides the features to be flushed out from cell state looking at h tÀ1 and new input x t .
-In the second step, deciding the information be stored in the cell state is done in two steps. Input Gate layer i t which is a sigmoid layer establishes the values to be updated. Then a tan h layer generates the vector of new candidate valuesC t .
-The old cell state C tÀ1 is updated to a new cell C t summing the output from Forget gate layer function f t and i t ÂC t .
The output is determined in two steps À First, the sigmoid layer decides the parts of cells to output o t . The product of the new cell state C t through tan h and the output of sigmoid gate outputs h t the selectively decided parts.
Hyperparameters tuning and optimization is an arduous and experimental task [4]. The training of the LSTM model is expensive in terms of memory and computational power. In the domain of chatbots for time series conversations, LSTM is shown to perform well and maintain the context for longer durations. Unlike LSTMs, GRU has two gates as Reset and Update to control the flow of information and refine the outputs. When compared to LSTM, the update gate can be considered a combination of Forget and Input gate from LSTM. Update gate determines the portion of information from previous time steps needs to be passed to the next states. This gives GRU an edge over LSTM as the model can decide to maintain all features of earlier timestamps. Reset gate is used to determine the irrelevant part of the information which needs to be discarded [12]. GRU works in the following steps:

GRU
-Update gate [z t ] at a time 't' is calculated.
-Reset gate [r t ] calculates the information to be forgotten using -New memory content is introduced which uses the reset gate to store the relevant informatioñ is calculated, which holds information for the current unit using update gate output and memory content from previous steps [h tÀ1 ].
GRU exposes complete memory content without control gate when compared to controlled exposure of LSTM using Output gate. GRU explicitly holds the influx of information while calculating new memory content using the Update gate. Owing to less complicated nature and few tensor operations, GRU is computationally more effective and faster to train.

NTM
Neural Turing Machine (NTM) explores the concept of evidently extending the context accumulator of RNN with an addressable external memory. They are an example of Memory Augmented Neural Networks, which decouple the computation from memory [13]. NTM have been shown to outperform LSTMs on sequence learning tasks demanding large memory for handling memorization of more extended contexts. Controller and Memory matrix are primary components in NTM ash shown in Figure 8. The controller is a recurrent or feed-forward neural network which takes input and returns the output. External memory unit constitutes of N Â W memory matrix where N is the number of memory locations, and W is the dimension of each memory cell. The interaction between the Controller and Memory matrix is carried out by reading and write heads. The memory matrix is initialized using schemes like Constant initialization or Truncated Normal distribution [13]. The NTM model can be trained by variants of stochastic gradients using backpropagation through time in case of an RNN based controller.
Algorithmic tasks like priority sort, Associative Recall, Copy, Repeat Copy, etc. can be performed to test if the NTM could be trained via supervised learning for efficient performance. NTM models generalize reasonably well to longer inputs.

Ensemble learning
The concept of ensemble learning was popularized in 1990 by Hansen and Salamon [14] over the idea that performance of a set of classifiers outweighs that of a single classifier. The individual models work in unison, where the outputs are combined with a particular decision fusion strategy to output a single answer [15]. Owing to the combination of various learning models, the generalization ability turns to be healthier. The basic architecture of the Ensemble model is depicted in Figure 9. The variation in the member models is a critical factor for classification performance [16]; hence strategies, as follows, were proposed for boosting the diversity scale among the member learners: -Employing different learning algorithms for diverse learners or using the same algorithm with variation in parameters.
-Training the members with varied datasets by subsampling or changing the attributes. -Combination of the above two methods is used simultaneously.
An overall comparison between the concepts discussed along with the problem statements each methodology is well suited for is stated in Table 3.
The authors now present a comprehensive review of some of the recent works carried out in this domain (Tab. 4):

Data source
Dataset plays a prominent role in making the conversation with chatbots as realistic as possible. For this purpose, we have to teach the chatbot model how to understand the text entered by a user and how to answer back. The format and syntax of this conversation, depending upon the purpose of the chatbot will also vary. For this purpose, various categories of datasets are available for the training of chatbots. An overview of the same is given in Figure 10 below: The dataset used in the fundamental research is Cornell Movie Dialog Corpus. Cornell Edu distributes it. The dataset consists of different metadata-rich files. The conversations in the dataset are extracted from movie scripts. The dataset in whole has 220579 exchanges between 10292 characters collected from 617 movies. Two files are used for establishing the conversation data. 'Movie_lines.txt' contains texts from the dialogues and it has attributes like lineID, CharacterId, movieID, character name and the actual text. 'movie_conversations.txt' forms the structure of the conversation. It maps the conversation between two characterIDs together along with the movieID of the movie. The '+++$+++' acts as the field separator between the attributes mentioned for each file utilized.

Data preprocessing
Python with NumPy is used to preprocess the dataset to institute the conversation dictionaries. Dictionaries are created to map each line and the corresponding 'id', creating a list of all conversations, separating questions and answers. Individual, conjoint words are cleaned and replaced with simple words. A dictionary is also designed to map each word with its number of occurrences and for mapping the questions words and answer words to unique integer values.   Ensemble learning forms the basis of the proposed methodology. Classifiers like Support vector machines, Linear regression were used in the Ensemble model initially. With the onset of Deep learning, a more elaborate approach can be followed to improve the overall performance of the Ensemble model. The idea is to define a number of LSTM networks with variation in hyperparameters as part of the ensemble model. The member models work together in parallel, and their individual outputs are aggregated to generate the output of the overall model. As a fine-tuning measure, the concept of Pruning is also employed. An architectural overview is presented in Figure 9, followed by detailed Ensemble Network algorithm. Segmentation, Vector Space Model (VSM), Classification algorithm & Response generation forms the primary components of the chatbot. The flow of operations is shown in Figure 11. After the output class is predicted, the output of Chabot is returned to the user.
Components of the LSTM based Ensemble network are described in Figure 12. The Encoder-Decoder LSTM acts the base for defining the even single LSTM as well as the combination of LSTMs working in unison as part of the ensemble. The entire process of implementation can be broken down in specific steps, as discussed below. The proposed model includes training phase and the testing phase, which are shown in Figures 11 and 12, respectively. In the training phase, a definite number of LSTM networks are generated and trained using variations in training data. The models with lower accuracy are filtered out. In the testing phase, the models with higher accuracy work in conjunction to predict the output class from the calculated output weights. The detail of the two steps is presented in the following sections. The notations used in the algorithm specifications are specified in Table 5.

Data preprocessing
The dataset is preprocessed as defined in the section NUMBER. Different functions are created for the generation of dictionaries and cleaning the text.

Building the model
Encoder LSTM is responsible for reading the input sequence and encoding the same into a vector essentially   to map the corresponding vector from the vector space defined. Decoder LSTM deals with decoding the vector generated and outputting the predicted sequence. Encoder-Decoder LSTM generates a continual representation of data from a considerable number of data attributes from previous timestamps. This architecture of Encoder-Decoder LSTM was found useful on long and continuous data influx. We split the dataset into training and validation dataset as an attempt to carry out cross-validation. As seen in Figure 12, three different decoder LSTMs are created to decode the training data, decode the validation data, and the actual decoder for the encoder created.

Training phase
The hyperparameters like the number of epochs, batch size, LSTM size, number of layers in Encoder and Decoder LSTM, Learning rate are initialized for single LSTM as well as the Ensemble LSTM model. A session for training is initialized, and the models are training for both portions of the dataset that, that is, the Training dataset and validation dataset as well. As the training progresses, the model generates the weights. The model generalizes the data for patterns and features and stores it in the model to be utilized while testing. The training phase is also depicted in Figure 13. Output of k th LSTM model Fig. 13. Training phase algorithm.
Step 1: The parameters like the number of LSTM networks N, the maximum number of hidden layers in the dmodel h max , a maximum number of neurons in a hidden layer p max and the accuracy threshold for each LSTM model w threshold are initialized.
Step 2: Generate N LSTM models with variations in hyperparameters. The number of hidden layers h m and the number of neurons in a hidden layer p m are assigned random values between 0 and corresponding maximum bounds.
Step 3: The training dataset is split into T partitions. T À 1 partitions are used to train each LSTM model. Every different LSTM model trains and learns using BPTT.
Step 4: The performance of each LSTM model along with the average accuracy A m is evaluated using the remaining single partition from testing dataset.
Step 5: The LSTM models with accuracy lower than the accuracy threshold w threshold are dropped, leaving k models in the ensemble network.

Testing phase
The weights generated in training are loaded, and a session is initiated running the Encoder-Decoder LSTM model as a part of both the single model and Ensemble LSTM. The incoming queries are cleaned and preprocessed using the functions defined in Data Preprocessing. The predicted answer is returned to the user. This phase deals with applying the patterns and features learned on the testing dataset. The testing dataset is fed into the Ensemble model to predict the output classes. The testing phase is depicted in Figure 14.
Step 1: The testing dataset is retrieved and fed into the LSTM models of the ensemble.
Step 2: The output C k for the individual model is calculated. C k is multiplied with corresponding weights of each model. A weighted average of the sum of total weighted outputs is calculated.
Step 3: The weighted average is used to predict the response class.
An example of the variations in the hypermeters for multiple LSTM multiples as a part of the Ensemble Network is presented in Figure 15.

Performance analysis
RNN, LSTM and GRU serve as the best choice for the classifier in Ensemble Network. LSTM and GRU provide an edge over RNN owing to the presence of a dedicated memory control unit enabling the learning of long term dependencies. The selection of an appropriate model between them depends on the key differences and the dataset. GRU exposes complete memory content without control gate when compared to controlled exposure of LSTM using Output gate. LSTM doesn't control the amount of information flowing in from previous time steps while computing new memory content. On the other hand, GRU explicitly controls the influx of information while calculating new memory content using the Update gate. An experiment was performed to compare LSTM and GRU for their performance in the time series prediction.

Prediction comparison
LSTM and GRU are closely related mechanisms for handling long term dependencies. A comparison between both for their performance provided important insight for the selection of LSTM over GRU, as seen in Figure 16. Two models constituting a single LSTM and a single GRU were created for the comparison. To make the comparison more  11, 25 (2020) just, both LSTM and GRU models had four hidden layers with 50 neurons each with 0.2 dropout rate. Both the models were trained with 50 epochs and a batch size of 32. The dataset used to select the appropriate model for time series data analysis is Stock Price Dataset. The Dataset is split into portions randomly to generate a more stable and evenly spread out output values. In the first split of the dataset, there are 700 entries in total. In the second split, the dataset contains 1400 entries of data points. The final dataset consists of 3000 data points. The models developed  were executed on the three variations of the dataset to analyze the parameters like Mean Squared Error, Accuracy, Loss and Time taken for training and ultimately deciding the most applicable model for handling time series data as well as long term dependencies.
For Dataset 1 with700 data points, the performance of GRU seems better than LSTM as the predicted values by GRU, and the Testing values map well than that of LSTM. In iteration for Dataset 2 with 1400 data points, the performance of LSTM and GRU both seem almost equal as the values predicted by both the models correspond with the testing values. In the iteration for Dataset 3, GRU shows more error and deflect away from the Testing Data points. On the other hand, LSTM offers better performance over 3000 data points from Dataset 3.

Mean squared error analysis
The mean squared error is calculated for each dataset variation for both LSTM and GRU model. Mean squared error is one way to calculate the error during backpropagation which is the basis for or training for both LSTM and GRU. Higher error value indicates performance degradation and improper training. The mean squared error for both models on three datasets is presented in Table 6.
From the values of Mean Squared Error, we can conclude that GRU performs well and generates a standard error for datasets with small size with lesser data points. LSTM was found performing well for datasets with massive data points.

Average loss calculation
The loss values are calculated for each epoch for both LSTM and GRU. For a given model, the lesser the loss, the better is the training of the model. Failure is a summation of errors made for each batch of training dataset over an epoch. The average of loss values over the complete training procedure for 50 epochs is specified in Table 7 for the variations in a dataset.
Owing to less computational steps, GRU generates less error than LSTM. LSTM executes well over the last dataset variation having the highest number of data points.

Training time comparison
The training time for the complete training process in minutes is specified for both LSTM and GRU in Table 8.
For all the disparity in the dataset, LSTM model takes more time to train over GRU. When LSTM and GRU were used to train the chatbot models, the model with LSTM showed overall better accuracy of 71.69% over 70.12% accuracy of GRU during Training. During Testing, the LSTM model showed better accuracy of 71.59% over 70.75% of GRU model.
Considering four parameters like Mean Squared Error, Loss, Training Time and Accuracy, we can deduce LSTM is a better choice over GRU as a whole.

Comparison between LSTM and Ensemble LSTM
Following graphs present a comparative look at the performance of LSTM and Ensemble Model for the same time series data analysis. The Ensemble Model consists of three LSTMs with variation in hyperparameters. The graphs depict the mapping between the values from the testing dataset and the predicted values from the corresponding model. The maps related to LSTM models are specified in Figure 17, and the graphs associated with Ensemble LSTM models are depicted in Figure 18.
From the graphs, we can conclude the performance of Ensemble Model is similar in the aspect of the performance of the LSTM model and in some cases, better. With finetuning the Ensemble models, the performance can be improved over standalone LSTM model.

Comparison between GRU and Ensemble GRU
The following graphs in Figures 19 and 20 provide overall performance analysis for single GRU model and Ensemble GRU model in respective sections. The Ensemble GRU model consists of three individual GRU models with variations in hyperparameters.
We derive from the graphs, the performance of single GRU and the Ensemble GRU is comparatively similar. The performance could be better enhanced with the diversity in the hyperparameters of singular models constituting the Ensemble model.

Comparison between Ensemble LSTM and Ensemble GRU
This section provides the differentiation in the performance between the Ensemble LSTM model and Ensemble GRU model against the prediction values from the testing    dataset. The training was carried out on variations of the same dataset hence employing the strategy of crossvalidation to make it most even comparison.

Prediction comparison À Ensemble LSTM, Ensemble GRU
From Figure 21, it can be deduced from the graph i and ii that the performance of both ensemble models is almost equal in terms of close mapping with the prediction values. However, in graph iii, it is observed that Ensemble GRU performance better than Ensemble LSTM. The values predicted by Ensemble GRU are closely mapped with the actual values from the testing dataset.

Mean squared error analysis
The final Mean Squared Error for both Ensemble models is calculated. Lower the values, better the performance of the model. The values are defined in Table 9.
From the values calculated, we can deduce that no specific model performs better than the other in every variation of the dataset. In two cases, the performance of both models was comparatively equal while in one case the Ensemble GRU performs better than Ensemble LSTM.

Average loss calculation
For each epoch during the training, Loss values are calculated. Lesser the loss values, better is the training of the model. The average loss values calculated for each dataset variation over the complete training is presented in Table 10.
From the values in the table, we deduce the average loss during the training of Ensemble GRU is lesser than the values of Ensemble LSTM. That indicates Ensemble GRU trains well with the dataset compared to Ensemble LSTM.
Based on the analysis of three parameters, we can conclude the performance of both Ensemble LSTM and Ensemble GRU is not definitively better than each other. In some cases Ensemble GRU performed better while in some cases Ensemble LSTM. With tweaking the parameters of individual models working together as an ensemble, the best performance can be achieved.

Conclusion
This paper presents a review of the evolution of technologies applied in Chatbots handling time series conversations in the labels of Architectural Design and Implementation. The paper also intends to contribute to developing a sturdy groundwork on the concepts utilized in learning long term dependencies hence providing a roadmap towards further enhancements being inclined towards minimalistic yet alike requisites. These primal conditions can be considered as 1. We are designing the word embedding schema not constrained by the knowledge base. 2. Flexible and accurate conversational model. 3. Reaching the true peak of imitating the human conversation requiring no human intervention. The proposed LSTM based Ensemble Network architecture attempts at enhancing the user experience by providing a sense of continuance of context in a series of conversations. The algorithm does so by generalizing the features imperative to making the conversation human. Future scope for this concept would be to think of applying similar approach using GPT-1, GPT-2 or GPT-3 that are widely applied nowadays in NLP domain.