Real-time fast learning hardware implementation

. Machine learning algorithms are widely used in many intelligent applications and cloud services. Currently, the hottest topic in this ﬁ eld is Deep Learning represented often by neural network structures. Deep learning is fully known as deep neural network, and arti ﬁ cial neural network is a typical machine learning method and an important way of deep learning. With the massive growth of data, deep learning research has made signi ﬁ cant achievements and is widely used in natural language processing (NLP), image recognition, and autonomous driving. However, there are still many breakthroughs needed in the training time and energy consumption of deep learning. Based on our previous research on fast learning architecture for neural network, in this paper, a solution to minimize the learning time of a fully connected neural network is analysed theoretically. Therefore, we propose a new parallel algorithm structure and a training method with over-tuned parameters. This strategy ﬁ nally leads to an adaptation delay and the impact of this delay on the learning performance is analyzed using a simple benchmark case study. It is shown that a reduction of the adaptation step size could be proposed to compensate errors due to the delayed adaptation, then the gain in processing time for the learning phase is analysed as a function of the network parameters chosen in this study. Finally, to realize the real-time learning, this solution is implemented with a FPGA due to the parallelism architecture and ﬂ exibility, this integration shows a good performance and low power consumption.


Introduction
In recent years, with the advent of the Big Data era, the amount of data created in just a few years has exploded [1], and thanks to the rise in data volume, increased computing power, and the advent of deep learning, artificial intelligence has begun to grow rapidly. One of the main research domains in artificial intelligence (AI) is artificial neural network (ANN) [2]. ANN algorithm is inspired by biological neurons, and it has been introduced in 1943 by Warren McCullough and Walter Pitts [3], however in 1969, M. Minsky and S. Papert, published the book "Perceptrons" [4], which has pointed out that single-layer Perceptron could not implement a heterogeneous gate (XOR), and multi-layer Perceptron could not give a learning algorithm, so it was useless. Given the status of the two men in the field of AI, the book generated a great response and ANN research fell into a slump. In 1985, Rumelhart et al. re-generated the Backpropagation Algorithm [5] for training the weights of multilayer perceptron, which fine-tunes the coefficients of the connection weights of each layer backwards to optimize the network weights by comparing the actual output with the error generated by the theoretical output, thus solving the problem that Minsky thought could not be solved. This has brought a second boom in ANN research. In 2006, Geoffrey Hinton, a professor at the University of Toronto and a leading figure in the field of machine learning, and his students published an article in "Science" that formally introduced the concept of deep learning [6] and started the wave of deep learning in academia and industry.
Since then, with the improvement of algorithms, computing speed and the emergence of big data, "deep learning" in NLP, image recognition, chess (AlphaGo, AlphaGo Zero [7]), autonomous driving and other applications have achieved remarkable results, ANN rejuvenated, and thus set off the third climax that continues to date.
In brief, there are two main reasons why deep learning techniques have made great breakthroughs in recent years; -The first reason is due to the continuous expansion of the training data set. In the 80 s and 90 s, data sets contained tens of thousands training examples, such as the MNIST [8], then in early 2010 s, larger data sets (Image Net dataset, for instance) appeared with hundreds of thousands to tens of millions of training examples. The most used ImageNet dataset, ILSVRC 2012-2017, consists of approximately 1.5 million images. -The second reason is the huge increase in computing power of hardware devices. Improvements in semiconductor devices and computing architectures have significantly reduced the time overhead of network computation, including the training process and the inference process.
There is no doubt that AI has made a very prominent progress in recent years, in many of the applications currently proposed [9][10][11], the networks are trained off-line, and the real-time focus is mainly on the processing of the data by the network and not on the learning phase with the adaptation of the free network parameters. Therefore, how to accelerate the training of neural networks, by software as well as hardware, how to do real-time training, and how to reduce the power consumption, especially during the training, are still big challenges for the researchers [12].
To accelerate the training of neural networks, we have proposed a solution to minimize the learning time of a fully connected neural network [13] and based on that, this paper presents a processing architecture in which the treatments applied to the examples of the learning base are strongly parallelized and anticipated, even before the parameters adaptation of the previous examples are completed. This strategy finally leads to a delayed adaptation and the impact of this delay on the learning performances is analysed through a simple replicable school case study. It is shown that a reduction of the adaptation step size could be proposed to compensate errors due to the delayed adaptation, then the gain in processing time for the learning phase is analysed as a function of the network parameters chosen in this study.
For the implementation of this processing architecture and considering the parallelism character of neuron network algorithm, GPU, FPGA and ASIC offer the possibilities to do so.
The GPU is widely used as hardware accelerators both for learning and for reference. GPU has high memory bandwidth and highly efficient matrix-based floating-point calculations [14]. However, GPU consumes considerably more power when it is operating compared with FPGA and ASIC [15].
FPGA has a form of parallelism with ASIC in development, they both achieve good performance with very lower power consummation compared with GPU, ASIC based accelerators demand a longer development cycle, a higher cost but with less flexibility than FPGA based accelerators. FPGA has now become another alternative hardware accelerator solution [16].
FPGAs are composed by a set of programmable logic units which are called Configurable Logic Blocks (CLB), a programmable interconnection network, on-die processors, transceiver I/O's, RAM blocks, DSP engines, etc. FPGAs can be reprogrammed to desired application or functionality requirements after manufacturing. They have the following advantages compared to others hardware solutions: (1) Low power consumption; (2) Customizable; (3) Reconfigurable; (4) High performance [17][18][19].
FPGAs used to develop with the hardware description languages such as VHDL and Verilog. This was a hindrance for software developers. With the adaptation of software level programming framework such as the Open Computing Language (OpenCL) integrated in FPGAs developing tools, FPGAs are employed frequently in deep learning [20].
However FPGAs are often used as reference, not for real-time learning [21]. In this work, we focus on a full implementation (feed forward propagation and back propagation) on FPGA a Multi-Layer Perceptron neural network algorithm (MLPNN) which can be used as reference when the network is trained, and for real-time learning when the network is used for on-chip training.
The following sections are organized as follows: in Section 2, we will introduce the main notations and equations of the algorithm. In Section 3, we will present the organization and grouping of calculations. In Section 4, we will demonstrate a theoretical study on the impact of a delayed adaptation of the parameters. A practical simulation case is observed in Section 5. In the last two sections, we will illustrate the hardware implementation and summarize the result of synthesis of this implementation, where an adequate explanation is presented.

Algorithm analysis
A Multi-Layer Perceptron neural network algorithm is composed by neurons and grouped by layers. The outputs of the neurons of the previous layer become the input of each neuron of the present layer. The notations and the equations are presented hereafter.

Notations
In the sequel of the paper, we consider a fully connected neural network Figure 1 with q layers numbered from 1 to q.
We introduce the following notations: -N L : the number of neurons of the Lth layer.

Algorithm's equations
Concerning the activation function in the neuron, we consider either a non-linear sigmoid function f(x) defined as follows: or a linear function, especially for the last layer: The first input layer is fed with The free parameters are all "synaptic coefficients": w L ij and the constant parameters u L i , for all layers L ∈ [1, q]. The main objective in this part is to apply a gradient descent algorithm to find all w L ij and u L i terms and, for that purpose, we will have to calculate the derivative of the error with respect to the synaptic coefficients and the constant parameters. This partial derivative will be represented as ∂E The equation for the forward propagation, for L = 1 to q is: The error calculation equation is: Therefore the equations for the Backward propagation initialed with a sigmoid function for the last layer j ∈ [1, N q ] is: Or initialed with a linear function for the last layer is: For L = q-1 to 1, for all j indexes of the concerned layer the equation is: Finally, the equations for the "synaptic coefficients" and the bias Adaptation are: 3 Organization and grouping of calculations We propose to use an architecture in which a set of calculations are performed on each clock period as described in Table 1. Most operations can be performed by dedicated hardware resources on one system clock. Some functions which are more complicated to obtain, will require several system clocks.
For the sake of simplicity for the paper presentation and without any loss of generalization we consider a 2 layers neural network (q = 2) in the sequel of the presented tables. To carry out the algorithm on a structure with calculations in parallel and with anticipation it is thus necessary to introduce new variables identified by letters in Table 2. These variables must be saved in dedicated memories. In our example, the worst case corresponds to F j (t + 6) that will be calculated at (t + 6) and used at (t + 7) and (t + 15). It is also necessary to memorize the synaptic coefficients of layer 2 (w 2 ij ) that are used in forward propagation at time (t + 5) and used in backward error updates at time (t + 13).
Considering Table 2, it appears that dedicated hardware processing resources are necessary to calculate the 17 variables listed in the table. As shown in Figure 2 that, thanks to a hardware resource allocation algorithm, the treatment can be done with 16 hardware dedicated processing resources, identified as R 1 to R 16 . Some of them are loaded at 100%, it is typically the case for R 3 and R 6 that must calculate the sigmoid function, while others are loaded at 66% as R 1 , R 2 , R 4 and R 5 . All other resources are loaded at 33%. The global load of hardware resources R i being equal to 50.7%. The time necessary for the propagation of calculations is equal, for this q = 2 layers-network, to 17 step time. Each step time equals to one or several system clocks. This result can be extended for any value for q and the processing time T is equal to: Based on the analysis in the previous sections, there is a lag in the synaptic coefficient adaptation in the backpropagation algorithm for neural networks. In this paper, in addition to accelerating the training time by parallel operations, we further accelerate the training time by the synaptic coefficient adaptation without delay method in the backpropagation. In this section, a theoretical analysis is done on the synaptic coefficient adaptation without delay method.
We take a simple case below Figure 3, where z(n) is the output of the neuron, w(n À 1) is the synaptic coefficient, t(n) is the desired values at the output of the neural network, f is the active function, e(n) is the error between the output of neuron and the desired output (Fig. 3).
The prime objective of our study focuses on minimizing errors. Considering the following adaptation equations: which is used to calculate the minimum of the sum of the squared errors, with The equation is used to adapt the synaptic coefficient. In order to simplify the description and not to affect the convergence results, we will use a linear function as activation function in the rest of the exposition in this section. Step time Index i, j, k Equations Where the additional operation and the multiplication operation time are denoted respectively with T + and T Â . The time of multiplication operation with d is denoted with T d .
The time for a full iteration (Fig. 4) is: For N R + 1 input data set, the computation time for N R +1 iterations is: For i = 0 to N R where we keep the "old" coefficient, and then we try to minimize the sum of the squares of the errors made: with Hence the adaptation equation is: For i = 0 to N R , when i = 0: Finally, when i = N R : The whole delay of the calculation is presented as Figure 5. Thus we can parallelize and start the products input by coefficient before adaptation. The parameters will be only adapted at the end of sum of the errors as described in Figure 6.
Hence, the computation time T 0 N R for N R + 1 iterations with a delayed adaptation parallelized structure is: The considerable gain of learning time T N R À T 0 N R with a delayed adaptation parallelized structure is: On all tested examples, such as illustrated in Figure 8, the adaptation with delay converges as fast as without delay.

Simulation results
To test the algorithm, we have chosen to ask the neural network to learn how to calculate a Discrete Fourier Transform. The advantage of this option is to have a perfect replicable simulation without referring to training or a generalization base. Because the training vectors can be generated randomly in a quasi-infinite way, the simulation is not dependent on the size of the existing database, either. Step time    For each iteration of the training, we therefore randomly generate a vector of N FFT Gaussian random complex terms {x i } i ∈ 0; N FFT À 1 ð Þ . Each term having zero mean and unity variance for its real and imaginary parts (Table 3). Then, for each iteration, we compute the Fourier Transform of this Gaussian vector as follows: and use the terms of this Fourier Transform as the desired signal t 1!N q À Á . The input and output signals being Gaussian we present in input a vector made up of the real parts then the imaginary parts. The input and the output vectors have respectively the size of N 0 = 2 * N FFT and N q = 2 * N FFT .
We let the training run for several million examples (between 50 and 5 million depending on the simulations) for different values of the training delay and we varied the gradient adaptation step size d. Finally, at each iteration, we took the squared error E which we integrated over an exponential window with a forgetting factor l by means of the following equation: We then divided this error by the power of the desired signal to obtain a normalized mean squared error (NMSE) that we plotted in logarithm in base 10. Main simulation parameters are summarized in Table 4. Figure 9 that the adaptation delay degrades the performances rather quickly and can even block the convergence of the adaptation. This is very noticeable as soon as this delay exceeds about ten clock times. However, the simulation results presented in Figure 10 were obtained with a d adaptation step size of the gradient equals to 5 * 10 À4 . It can be seen, in Figure 10, that reducing this step size to 10 À4 would sufficiently improve the performances significantly Tables 5. In the meantime, it tends to solve the nonconvergence problem and allow the adaptation with delay to have performances closely resembling the adaptation without delay. However, this reduction of the adaptation step increases the convergence time of the learning. We can see that for an objective of the logarithm of the normalized mean squared error equals to À2.4, we need to present 5 million examples with an adaptation step size of 5 * 10 À4 and 20 million examples with an adaptation step  Table 3. Complex to real mapping.

It appears in
size of 10 À4 . Looking through the basic example we have used, we can therefore conclude that the implementation architecture proposed in this article optimizes the processing time but at the cost of an increase in convergence time by a factor of 4.

The design of real-time learning architecture
In this section, we have designed the hardware level in two parties: computation unit level and system level. The objective is to parallelize the calculations as much as possible and specially to anticipate them without waiting for the adaptation of the coefficients linked to the previous examples. This allows to optimize the use of hardware resources at the cost of a latency for the adaptation of the coefficients. We introduce the time of an addition T + , a multiplication T x , the calculation of the exponential function T exp and a division T / . Based on the algorithm features described previously and considering the characteristic of FPGA, the computation unit level of our implementation consists of three parties which are the forward propagation, back propagation, and adaptation.

Forward propagation
Denoted FP, implements the two forward propagation equations (3) and (4) of Section 2. It is represented in Figure 11. Parallelizing all the multiplications, we arrive, for the FP component, at a latency T for the forward propagation step of the L layer, equal to:

Backpropagation
For the backpropagation equations of Section 2, we implement a component noted BP q for the last layer or BP L for an inner layer. The two components are represented in Figures 12 and 13. Parallelizing all the calculations for the output layer, we arrive at latency times:

Adaptation
For the adaptation equations of Section 2 we implement a component denoted UP represented in Figure 14.
For the UP module, the latency of backpropagation in L th layer is: The structures of these modules of each layer are the same with the equivalent numbers of inputs and one output. Once we know the number of neurons, and layers, we can generate and implement the neural network quickly. Because of the reconfigurability of FPGA, the neural network can be regenerated as many times as we want the neural networks in the same chip depending on the requirements. We form the above computation units as IP cores in FPGA and use them to design the System level. The whole architecture of the system level design is presented in Figure 15.
In system level, a control unit was designed to send the control signals for each component and synchronize the whole system with the same system clock. To thin the flow of data, the delay modules were added to hold the signals.  Among all the controlled signals, the "mode" signal is used to choose between the "reference" mode and the "learning" mode. With this pipeline parallelism system level architecture, one training data is received for each system clock, and all the operations run simultaneously from the first Total Latency which is: The system repeats the same operation for each system clock until the valid signal becomes '0'. Figure 16 illustrates the flowing of the training data. The advantages of this  design are: firstly, no timeout for the inputs À to charge the training data sets for training. Secondly, all the neurons in each layer are calculated at the same time, whatever the number of neurons in a layer, the latency of this layer in question is the same for one neuron or for n neurons. Thirdly, the computation units are operating at the same time, the using of on-chip resources are optimized. Thus, our approach and contribution can expedite the learning of the neural network. Even the number of neurons and layers are important, this work offers a remarkable reward for training time.

Synthesis results
In this paper, we have designed a neural network with 32 neurons as inputs, one hidden layer with 32 neurons and one output layer with 32 neurons to calculate a Discrete Fourier Transform. The input and the desired output are generated from a Matlab program. The data used is 32-bit      Figure 15, we need two modules to manage the delays. The latency of the first one, denoted Delay 1, is: And the second one, denoted Delay 2, of which latency is: Hence the total Latency is: We may draw from this theoretic analysis that the first adaptation of synaptic, and bias finished after the 639 th system clock Tables 6 then there was one adaptation for each system clock from the 640th one.
In order to choose the suitable target FPGA device family, we did follow analysis using various registers, DSP, RAM, MLAB and ALUT for each component and each layer, with summaries presented in Tables 7 and 8.
In Tables 7 and 8, there are a considerably large number of registers depending on the FPGA board which has been chosen. This can be transformed in part into RAM after the placement and routing. Comparing with the Intel Agilex product table, we can see that even a basic family of Agilex offers a large capacity to receive this implementation. Therefore, we chose the Intel Agilex AGFB022R31C2E1V as selected device in our project with Quartus 21.4.
We have developed six components individually with the Intel ® HLS Compiler: FP1 for the neuron of forward propagation of the first hidden layer, FP2 for the neuron of forward propagation of output layer, BP2 for the backpropagation of the neuron of output layer, BP1 for the backpropagation of the neuron of hidden layer, UP2 for synaptic and bias adaptation of the neuron of output layer and UP1 for synaptic and bias adaptation of the neuron of hidden layer.
Finally, we have programmed a top-level file to integrate all the components to create our neural network. In addition, the basic components previously developed previous can be reused to regenerate easily the new neural networks far more easily for any number of neurons and layers.
After the synthesis, in terms of power utilization, the logic units used 10.768 watts, the RAM blocks used 3.195 watts, the DSPs used 9.022 watts the clock used 8.392 watts, and the power static was 3.210 watts. The summary is described in Figure 17. From architectural point of view, this implementation tends to save tremendous training time and power. In our case, the frequency adapted is only 400 Mhz, however with the new generation of FPGA, the frequency may achieve 1.5 Ghz. Thus, the process of learning can be faster.

Conclusion
This paper proposes we have proposed a hardware decomposition of the computations of a whole architecture for a Real-Time Learning Implementation based on the gradient backpropagation algorithm. It has shown, in our case, that a complete step of coefficient adaptation could be performed in 23 step times. The proposed parallelization leads to an anticipation of the computations and to a delayed update of the free coefficients of the network. It was shown that the effects of this adaptation delay could be compensated by a reduction of the adaptation step size. We finally, conclude that by slightly slowing down the adaptation phase, we can propose a parallelized architecture which will significantly accelerate the processing time Table 7. Resource utilization summary for each cell.
with a positive overall result. Hence, we did also a hardware implementation with FPGA that can easily integrate this kind of design, and the pipeline structure design in FPGA. It suggests a solution to receive one training data set for each system clock, not only can it reduce the training time, but also maximize the utilization of the resource on chip.