Application of PCA-LSTM algorithm for ﬁ nancial market stock return prediction and optimization model

. Accurately predicting stock returns can help reduce market risk. This paper brie ﬂ y introduced the long short-term memory (LSTM) algorithm model for predicting stock returns and combined it with principal component analysis (PCA) to improve the prediction accuracy. Simulation experiments were conducted on 80 stocks, and the PCA-LSTM model was compared with back-propagation neural network (BPNN) and LSTM models. The results showed that the PCA analysis method effectively identi ﬁ ed the principal components of variable indicators. During the training iteration convergence, the PCA-LSTM model not only converged faster but also had smaller errors after stabilization. Moreover, the PCA-LSTM model had the highest prediction accuracy, the LSTM model was the second, and the BPNN model was the worst.


Introduction
The rapid development of the market economy has also led to the development of the stock market. Although the stock market cannot directly create wealth, it promotes economic development and facilitates a certain degree of wealth redistribution [1], while also provides a platform for increasing individual wealth value. However, operating in the stock market is complex and volatile, and the changes in the stock returns of each listed company are affected by various objective and subjective factors [2]. Simply put, the stock market is a risky place, and it is almost impossible to consistently make a profit without experiencing losses. The only way to minimize the risk is by considering relevant factors that influence the market. One method of reducing risk is through forecasting stock returns [3], which allows investors to make informed decisions based on the results. With the development of computer technology, intelligent algorithms are gradually being used for stock forecasting. Intelligent algorithms can take advantage of data mining of big data to calculate the hidden laws of stock changes, which can then be used for accurate stock forecasting [4]. Zhang et al. [5] proposed an improved stock forecasting model. The Shanghai Stock Exchange Composite Index and the Taiwan Stock Exchange Capitalization Weighted Stock Index were used to validate the performance of the proposed method. The experimental results showed that the method outperformed other baseline methods. Yang et al. [6] developed a convolutional neural network (CNN)-based model for predicting time series using multi-factor analysis. They found that the prediction accuracy of the model was higher than the other models. Xiao et al. [7] proposed a cumulative autoregressive moving average method for basic stock market forecasting and found from simulation experiments that their multi-model fusion algorithm could achieve expected results, indicating universal applicability, market applicability, and stable feasibility. This paper briefly introduced the long shortterm memory (LSTM) algorithm model for forecasting stock returns and combined it with principal component analysis (PCA) to improve accuracy. Simulation experiments were then carried out with 80 stocks. Moreover, the PCA-LSTM model was compared with back-propagation neural network (BPNN) and LSTM models. The novelty of this article lies in the combination of PCA and the LSTM model, using PCA to screen factors that affect stock returns, reducing data dimensions, and facilitating fast computation. Its contribution also lies in combining PCA with the LSTM model, reducing computational complexity by selecting important indicators and minimizing interference from redundant indicators, thereby improving the accuracy of the LSTM algorithm for predicting stock returns and providing an effective reference for predicting changes in the stock market.

Predictive models for stock returns
Intelligent algorithms that can be applied to stock returns include correlation rule analysis, support vector machines, and neural networks, among which the neural network algorithm uses the nonlinear characteristics of the activation function to fit hidden patterns [8]. All the aforementioned intelligent algorithms require collecting relevant indicator factors for predicting stock returns and then utilize them in making predictions [9].
The neural network algorithms used in this paper effectively fit the nonlinear law for predicting stock returns [10]. While the BPNN algorithm is a common choice for stock return prediction, it fails to consider the time series pattern of stock returns being related to previous time periods [11]. Therefore, the LSTM algorithm, which is suitable for processing time series data, is ultimately selected for predicting stock returns. The LSTM algorithm is an extension of the recurrent neural network algorithm, which can solve the problem of gradient explosion occurs when dealing with long series data. There are three gates and one neuron state in each neuron of the LSTM algorithm. The "forgetting gate" [12] is used to either forget or store past data. The "input gate`" is used to input current moment data (a collection of indicators used to predict stock returns). The "output gate" is used to output the weighted combination of the "forget gate" and "input gate" [13].
In the actual use of the LSTM algorithm for stock return prediction, many indicators need to be input, and some of them may not have a significant impact on stock returns. This can instead cause interference in the prediction process. Therefore, in this paper, to further improve the prediction accuracy of the intelligent algorithm, the PCA [14] and the LSTM algorithm are combined. The PCA is used to screen indicators that have the greatest influence on the direction of stock returns from the input indicators. After screening, these indicator data are used to make predictions. The flowchart of the optimized combined algorithm is shown in Figure 1.
① The indicator data of the collected training sample and the corresponding stock returns are input.
② The PCA screens the indicator data in the training sample [15]. Firstly, the sample data constitutes a matrix of size i Â j, where i denotes the number of samples and j denotes the number of indicators in each sample. Then, the data matrix is standardized to remove differences in magnitude between different indicators. Next, the correlation coefficient matrix of the standardized data matrix is calculated to determine each indicator's variance contribution rate [16], and the indicators that make the cumulative variance contribution rate exceed 85% are used as principal components. The variance contribution rate can reflect the degree to which various indicators affect stock returns during the process of change in this article. The variance contribution ratio is calculated by the following formula: where a j is the variance contribution of the jth indicator, l j and l k denote the eigenvalue of the jth and the kth indicator, respectively, and p is the total number of eigenvalues.
③ The PCA-filtered indicator data are input into the LSTM algorithm for forward calculation: where f t indicates the forget gate output, b f , U f , and W f indicate the bias term in the forget gate [17], the input term weight, and the forget gate weight, respectively, s t indicates the cyclic gate output, b, U, and W are the corresponding weight in the cyclic gate, g t indicates the external input gate unit, b g , U g , and W g are the corresponding weight in the input gate, q t indicates the output gate unit, b q , U q , and W q are the corresponding weight in the output gate, and x t is the sample input at the current moment. ④ Whether the LSTM algorithm has reached the termination condition or not is determined. If it has [18], the training is stopped; if not, the weight parameters in the neurons of the LSTM algorithm are adjusted in reverse using stochastic gradient descent approach. The termination conditions include the number of training times reaching the preset threshold or the stable convergence of the difference between the results obtained from the forward calculation of the LSTM algorithm and the labels to the preset threshold [19].
The indicators screened by the PCA are input into the LSTM algorithm for forward calculation to obtain the final prediction results after undergoing repeated training in the aforementioned steps. The constituent stocks of the Shanghai Stock Exchange 50 index were selected as the research subject from the CSMAR database [20], with a time range from January 1, 2015 to December 31, 2021. After excluding suspended and delisted stocks within this period, data from randomly selected 80 stocks were used for analysis, some of which are shown in Table 1. CSMAR database is a research-oriented and precise database in the field of economic and finance, developed by Shenzhen CSMAR Data Technology Co., Ltd. based on academic research needs, drawing on professional standards from authoritative databases such as CRSP, COMPUSTAT, TAQ, THOMSON, and combining with the actual national conditions of China. After 20 years of continuous accumulation and improvement, the CSMAR database has covered 18 series including factor research, character features, green economy, stocks, companies, overseas, information, funds, bonds, industries, economy and commodity futures. It contains over 150 databases with more than 4,000 tables and over 40,000 fields. In addition to various indicators, stock returns are also be affected by the previous time period, making it a characteristic of time series. Thus, when constructing training and testing sets after collecting stock data, it is necessary to consider the continuity of the time series. Therefore, this paper took one year as the length of the sliding window and half a year as the sliding step length of the window. One sliding window was one cycle of training and testing, then there were 13 periods. In each period, the first nine months were used as training period and the last three months were used as testing period.
The variable indicators used for input in the training and test datasets are shown in Table 2. OP t represents the stock opening index at time t, CP t represents the stock closing index at time t, HP t represents the highest stock index at time t, LP t represents the lowest stock index at time t, True À range represents the true range of change between the previous and current moment of the stock, MA represents the moving average, MTM represents the momentum indicator of the stock price, and RSI represents the relative strength indicator of the stock price.

Experimental setup
The relevant parameters of the LSTM algorithm combined with PCA were set. Four hidden layers are set in the LSTM algorithm, and the number of nodes in each layer was 1,024. The activation function in the hidden layer was the sigmoid function. The stochastic gradient descent approach was used for training. The learning step length was set as 0.02. The maximum number of training was 1,000.
In addition to the PCA-LSTM algorithm, two neural network algorithms, LSTM and BPNN, were also tested. The parameter settings of the LSTM algorithm were consistent with those of the PCA-LSTM algorithm. The number of nodes in the input layer of the BPNN related parameters depended on the number of variable indicators. The number of nodes in the hidden layer was set to 512 based on experience and the orthogonal experiment method. The output layer had one node, which was used to output the predicted stock return at the next moment. Weight parameter adjustment and training session numbers during the BPNN training process were consistent with the previous two prediction algorithms. Table 3 shows the results of the PCA-LSTM algorithm after performing PCA on the input indicators. The variance contribution rates for all 14 input indicators are shown in Table 3. After arranging different indicators in descending order according to their variance contribution rate and calculating the cumulative variance contribution rate, the top k variables with a cumulative variance contribution rate exceeding 85% were used as principal component variables. After calculation, it was  Table 3, were the main component variables. Figure 2 shows the convergence curves of three stock return forecasting models, BPNN, LSTM, and PCA-LSTM models, during training. It was observed in Figure 2 that the mean square error (MSE) of all three algorithm models converged during the training process, the PCA-LSTM model converged the fastest, stabilizing after about 600 times, followed by the LSTM model which stabilized after about 800 times, and the BPNN model stabilized after about 900 times. In addition, when stabilized, the MSE of the BPNN model was the largest, while the MSE of LSTM and PCA-LSTM models was relatively close; however, it was still evident that the former had a larger MSE.

Experimental results
As there were 13 periods and 80 stocks selected as research subjects, there is not enough space here to show the predicted and true values for all test sets. Therefore, only the real return of stock number 600029 over a period of 30 days and its corresponding predictions by the three forecasting models are shown in Figure 3. It can be observed from Figure 3 that the stock return fluctuated around zero, with the predictive value of the PCA-LSTM model being close to the true value, while the BPNN model showed significant deviation.     0.9571, minimum relative error was 0.0007, and maximum relative error of 0.0211. The MSE of the PCA-LSTM model was 30.231, the MAPE was 0.9571, the minimum relative error was 0.0007, and the maximum relative error was 0.0187. The comparison of the data in Table 4 showed that the PCA-LSTM model had the highest prediction accuracy, the LSTM model was the second, and the BPNN model was the lowest.

Conclusion
This paper briefly introduced the LSTM model for predicting stock returns and combined it with PCA to form the PCA-LSTM prediction model. Simulation experiments were then conducted on 80 stocks. The PCA-LSTM model was compared with the BPNN and LSTM models. The findings are shown below. (1) In the results of PCA, 9 out of 14 variable indicators were identified as the main component variables with serial numbers 1, 5, 11, 12, 13, 14, 3, 9, and 7. (2) The training errors of all three prediction algorithm models converged with the increase of training times; among them, the PCA-LSTM model had the fastest convergence rate followed by the LSTM model and the BPNN model which was slowest to converge; when convergence was stable, the BPNN model had the largest error while LSTM model had second highest error and PCA-LSTM model had smallest error. (3) In terms of the prediction accuracy of stock return, the PCA-LSTM model had the highest performance, followed by the LSTM model and then the BPNN model.