Real-time fast learning hardware implementation

Ming Jun Zhang; Samuel Garcia; Michel Terre

doi:10.1051/smdo/2023001

Home

All issues

Volume 14 (2023)

Int. J. Simul. Multidisci. Des. Optim., 14 (2023) 1

Full HTML

Open Access

Issue		Int. J. Simul. Multidisci. Des. Optim. Volume 14, 2023


Article Number		1
Number of page(s)		16
DOI		https://doi.org/10.1051/smdo/2023001
Published online		24 April 2023

Int. J. Simul. Multidisci. Des. Optim. 14, 1 (2023)

Research article

Real-time fast learning hardware implementation

Ming Jun Zhang¹^,2^*, Samuel Garcia² and Michel Terre²

¹ DGUT-Cnam Institute, Dongguan University of Technology, 1, Daxue Rd., Songshan Lake, Dongguan, Guangdong Province, PR China
² HESAM/CNAM/CEDRIC 292 rue Saint Matin, 75003 Paris, France

^* e-mail: ming-jun.zhang@lecnam.net

Received: 23 October 2022
Accepted: 30 January 2023

Abstract

Machine learning algorithms are widely used in many intelligent applications and cloud services. Currently, the hottest topic in this field is Deep Learning represented often by neural network structures. Deep learning is fully known as deep neural network, and artificial neural network is a typical machine learning method and an important way of deep learning. With the massive growth of data, deep learning research has made significant achievements and is widely used in natural language processing (NLP), image recognition, and autonomous driving. However, there are still many breakthroughs needed in the training time and energy consumption of deep learning. Based on our previous research on fast learning architecture for neural network, in this paper, a solution to minimize the learning time of a fully connected neural network is analysed theoretically. Therefore, we propose a new parallel algorithm structure and a training method with over-tuned parameters. This strategy finally leads to an adaptation delay and the impact of this delay on the learning performance is analyzed using a simple benchmark case study. It is shown that a reduction of the adaptation step size could be proposed to compensate errors due to the delayed adaptation, then the gain in processing time for the learning phase is analysed as a function of the network parameters chosen in this study. Finally, to realize the real-time learning, this solution is implemented with a FPGA due to the parallelism architecture and flexibility, this integration shows a good performance and low power consumption.

Key words: Neural networks / learning algorithms / deep learning / parallel architecture / FPGA / hardware accelerator

© M.J. Zhang et al., Published by EDP Sciences, 2023

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1 Introduction

In recent years, with the advent of the Big Data era, the amount of data created in just a few years has exploded [1], and thanks to the rise in data volume, increased computing power, and the advent of deep learning, artificial intelligence has begun to grow rapidly. One of the main research domains in artificial intelligence (AI) is artificial neural network (ANN) [2]. ANN algorithm is inspired by biological neurons, and it has been introduced in 1943 by Warren McCullough and Walter Pitts [3], however in 1969, M. Minsky and S. Papert, published the book “Perceptrons” [4], which has pointed out that single-layer Perceptron could not implement a heterogeneous gate (XOR), and multi-layer Perceptron could not give a learning algorithm, so it was useless. Given the status of the two men in the field of AI, the book generated a great response and ANN research fell into a slump. In 1985, Rumelhart et al. re-generated the Backpropagation Algorithm [5] for training the weights of multilayer perceptron, which fine-tunes the coefficients of the connection weights of each layer backwards to optimize the network weights by comparing the actual output with the error generated by the theoretical output, thus solving the problem that Minsky thought could not be solved. This has brought a second boom in ANN research. In 2006, Geoffrey Hinton, a professor at the University of Toronto and a leading figure in the field of machine learning, and his students published an article in “Science” that formally introduced the concept of deep learning [6] and started the wave of deep learning in academia and industry.

Since then, with the improvement of algorithms, computing speed and the emergence of big data, “deep learning” in NLP, image recognition, chess (AlphaGo, AlphaGo Zero [7]), autonomous driving and other applications have achieved remarkable results, ANN rejuvenated, and thus set off the third climax that continues to date.

In brief, there are two main reasons why deep learning techniques have made great breakthroughs in recent years;

The first reason is due to the continuous expansion of the training data set. In the 80 s and 90 s, data sets contained tens of thousands training examples, such as the MNIST [8], then in early 2010 s, larger data sets (Image Net dataset, for instance) appeared with hundreds of thousands to tens of millions of training examples. The most used ImageNet dataset, ILSVRC 2012-2017, consists of approximately 1.5 million images.
The second reason is the huge increase in computing power of hardware devices. Improvements in semiconductor devices and computing architectures have significantly reduced the time overhead of network computation, including the training process and the inference process.

There is no doubt that AI has made a very prominent progress in recent years, in many of the applications currently proposed [9–11], the networks are trained off-line, and the real-time focus is mainly on the processing of the data by the network and not on the learning phase with the adaptation of the free network parameters. Therefore, how to accelerate the training of neural networks, by software as well as hardware, how to do real-time training, and how to reduce the power consumption, especially during the training, are still big challenges for the researchers [12].

To accelerate the training of neural networks, we have proposed a solution to minimize the learning time of a fully connected neural network [13] and based on that, this paper presents a processing architecture in which the treatments applied to the examples of the learning base are strongly parallelized and anticipated, even before the parameters adaptation of the previous examples are completed. This strategy finally leads to a delayed adaptation and the impact of this delay on the learning performances is analysed through a simple replicable school case study. It is shown that a reduction of the adaptation step size could be proposed to compensate errors due to the delayed adaptation, then the gain in processing time for the learning phase is analysed as a function of the network parameters chosen in this study.

For the implementation of this processing architecture and considering the parallelism character of neuron network algorithm, GPU, FPGA and ASIC offer the possibilities to do so.

The GPU is widely used as hardware accelerators both for learning and for reference. GPU has high memory bandwidth and highly efficient matrix-based floating-point calculations [14]. However, GPU consumes considerably more power when it is operating compared with FPGA and ASIC [15].

FPGA has a form of parallelism with ASIC in development, they both achieve good performance with very lower power consummation compared with GPU, ASIC based accelerators demand a longer development cycle, a higher cost but with less flexibility than FPGA based accelerators. FPGA has now become another alternative hardware accelerator solution [16].

FPGAs are composed by a set of programmable logic units which are called Configurable Logic Blocks (CLB), a programmable interconnection network, on-die processors, transceiver I/O's, RAM blocks, DSP engines, etc. FPGAs can be reprogrammed to desired application or functionality requirements after manufacturing. They have the following advantages compared to others hardware solutions: (1) Low power consumption; (2) Customizable; (3) Reconfigurable; (4) High performance [17–19].

FPGAs used to develop with the hardware description languages such as VHDL and Verilog. This was a hindrance for software developers. With the adaptation of software level programming framework such as the Open Computing Language (OpenCL) integrated in FPGAs developing tools, FPGAs are employed frequently in deep learning [20].

However FPGAs are often used as reference, not for real-time learning [21]. In this work, we focus on a full implementation (feed forward propagation and back propagation) on FPGA a Multi-Layer Perceptron neural network algorithm (MLPNN) which can be used as reference when the network is trained, and for real-time learning when the network is used for on-chip training. The following sections are organized as follows: in Section 2, we will introduce the main notations and equations of the algorithm. In Section 3, we will present the organization and grouping of calculations. In Section 4, we will demonstrate a theoretical study on the impact of a delayed adaptation of the parameters. A practical simulation case is observed in Section 5. In the last two sections, we will illustrate the hardware implementation and summarize the result of synthesis of this implementation, where an adequate explanation is presented.

2 Algorithm analysis

A Multi-Layer Perceptron neural network algorithm is composed by neurons and grouped by layers. The outputs of the neurons of the previous layer become the input of each neuron of the present layer. The notations and the equations are presented hereafter.

2.1 Notations

In the sequel of the paper, we consider a fully connected neural network Figure 1 with q layers numbered from 1 to q.

We introduce the following notations:

N_L : the number of neurons of the Lth layer.
$y_{i}^{L}$ : the input of the activation function of the ith neuron of the Lth layer.
$z_{i}^{L}$ : the output of the ith neuron of the Lth layer.
$w_{i j}^{L}$ : the synaptic coefficient between the jth neuron of the (L − 1)th layer and the ith neuron of the Lth layer.
$θ_{i}^{L}$ : the constant coefficient of the ith neuron of the Lth layer (bias).
t_i : the ith desired values at the output of the neural network.

Fig. 1

Fully connected neural network.

2.2 Algorithm's equations

Concerning the activation function in the neuron, we consider either a non-linear sigmoid function f(x) defined as follows:

$f (x) = \frac{e^{x}}{1 + e^{x}},$ (1)

or a linear function, especially for the last layer:

$f (x) = x .$ (2)

The first input layer is fed with $s = (z_{1}^{0}, z_{2}^{0}, \dots, z_{N_{0}}^{0})$ . The free parameters are all “synaptic coefficients”: $w_{i j}^{L}$ and the constant parameters $θ_{i}^{L}$ , for all layers L ∈ [1, q].

The main objective in this part is to apply a gradient descent algorithm to find all $w_{i j}^{L}$ and $θ_{i}^{L}$ terms and, for that purpose, we will have to calculate the derivative of the error with respect to the synaptic coefficients and the constant parameters. This partial derivative will be represented as $\frac{\partial E}{\partial y_{j}^{L}}$ .

The equation for the forward propagation, for L = 1 to q is:

$y_{i}^{L} = \sum_{j = 1}^{N_{L - 1}} w_{i j}^{L} z_{j}^{L - 1} + θ_{i}^{L},$ (3)

$z_{i}^{L} = f (y_{i}^{L}),$ (4)

The error calculation equation is:

$E (s) = \frac{1}{2} \sum_{j = 1}^{N_{q}} {(z_{j}^{q} - t_{j})}^{2} .$ (5)

Therefore the equations for the Backward propagation initialed with a sigmoid function for the last layer j ∈ [1, N_q] is:

$\frac{\partial E}{\partial y_{j}^{q}} = (z_{j}^{q} - t_{j}^{q}) z_{j}^{q} (1 - z_{j}^{q}) .$ (6)

Or initialed with a linear function for the last layer is:

$\frac{\partial E}{\partial y_{j}^{q}} = (z_{j}^{q} - t_{j}^{q}) .$ (7)

For L = q-1 to 1, for all j indexes of the concerned layer the equation is:

$\frac{\partial E}{\partial y_{j}^{L}} = z_{j}^{L} (1 - z_{j}^{L}) \sum_{k = 1}^{N_{L + 1}} w_{k j}^{L + 1} \frac{\partial E}{\partial y_{k}^{L + 1}} .$ (8)

Finally, the equations for the “synaptic coefficients” and the bias Adaptation are:

$w_{i j}^{L} = w_{i j}^{L} - δ \frac{\partial E}{\partial y_{i}^{L}} z_{j}^{L - 1} .$ (9)

$θ_{i}^{L} = θ_{i}^{L} - δ \frac{\partial E}{\partial y_{i}^{L}} .$ (10)

3 Organization and grouping of calculations

We propose to use an architecture in which a set of calculations are performed on each clock period as described in Table 1. Most operations can be performed by dedicated hardware resources on one system clock. Some functions which are more complicated to obtain, will require several system clocks.

For the sake of simplicity for the paper presentation and without any loss of generalization we consider a 2 layers neural network (q = 2) in the sequel of the presented tables. To carry out the algorithm on a structure with calculations in parallel and with anticipation it is thus necessary to introduce new variables identified by letters in Table 2. These variables must be saved in dedicated memories. In our example, the worst case corresponds to F_j (t + 6) that will be calculated at (t + 6) and used at (t + 7) and (t + 15). It is also necessary to memorize the synaptic coefficients of layer 2 ( $w_{i j}^{2}$ ) that are used in forward propagation at time (t + 5) and used in backward error updates at time (t + 13).

Considering Table 2, it appears that dedicated hardware processing resources are necessary to calculate the 17 variables listed in the table. As shown in Figure 2 that, thanks to a hardware resource allocation algorithm, the treatment can be done with 16 hardware dedicated processing resources, identified as R₁ to R₁₆. Some of them are loaded at 100%, it is typically the case for R₃ and R₆ that must calculate the sigmoid function, while others are loaded at 66% as R₁, R₂, R₄ and R₅. All other resources are loaded at 33%. The global load of hardware resources R_i being equal to 50.7%. The time necessary for the propagation of calculations is equal, for this q = 2 layers-network, to 17 step time. Each step time equals to one or several system clocks. This result can be extended for any value for q and the processing time T is equal to:

$T = 8 q + 1 .$ (11)

Table 1

Clock time periodes required for algorithm equations.

Table 2

Variables used in the architecture and ressources involved.

Fig. 2

Temporal implementation on dedicated hardware processing resources (green: data of time t, red: data of time t + 1, blue: data of time t + 2, brown: data of time t + 3, yellow: data of time t + 4).

4 A theoretical study on the impact of a delayed adaptation of the synaptic coefficient

Based on the analysis in the previous sections, there is a lag in the synaptic coefficient adaptation in the backpropagation algorithm for neural networks. In this paper, in addition to accelerating the training time by parallel operations, we further accelerate the training time by the synaptic coefficient adaptation without delay method in the backpropagation. In this section, a theoretical analysis is done on the synaptic coefficient adaptation without delay method.

We take a simple case below Figure 3, where z(n) is the output of the neuron, w(n − 1) is the synaptic coefficient, t(n) is the desired values at the output of the neural network, f is the active function, e(n) is the error between the output of neuron and the desired output (Fig. 3).

The prime objective of our study focuses on minimizing errors. Considering the following adaptation equations:

$\min_{w} e {(n)}^{2},$ (12)

which is used to calculate the minimum of the sum of the squared errors, with

$e (n) = t (n) - f (w (n - 1) z (n)) .$ (13)

The equation

$w (n) = w (n - 1) + δ e (n) z (n),$ (14)

is used to adapt the synaptic coefficient. In order to simplify the description and not to affect the convergence results, we will use a linear function as activation function in the rest of the exposition in this section.

Where the additional operation and the multiplication operation time are denoted respectively with T₊ and T_×. The time of multiplication operation with δ is denoted with T_δ.

The time for a full iteration (Fig. 4) is:

$T_{i} = 2 T_{\times} + 2 T_{+} + T_{δ} .$ (15)

For N_R + 1 input data set, the computation time for N_R +1 iterations is:

$T_{N R} = (N_{R} + 1) T_{i} = (N_{R} + 1) (2 T_{\times} + 2 T_{+} + T_{δ}) .$ (16)

For i = 0 to N_R where we keep the “old” coefficient, and then we try to minimize the sum of the squares of the errors made:

$\min_{w'} \sum_{i = 0}^{N_{R}} {e_{n - 1} (n + i)}^{2} .$ (17)

with

$e_{n - 1} (n + i) = t (n + i) - w' (n - 1) z (n + i) .$ (18)

Hence the adaptation equation is:

$w' (n + N_{R}) = w' (n - 1) + δ \sum_{i = 0}^{N_{R}} e_{n - 1} (n + i) z (n + i)$ (19)

For i = 0 to N_R, when i = 0:

$S_{n - 1} (n + i) = S_{n - 1} (n + i - 1) + e_{n - 1} (n + i) z (n + i)$ (20)

Finally, when i = N_R:

$w' (n + N_{R}) = w' (n - 1) + δ S_{n - 1} (n + N_{R})$ (21)

The whole delay of the calculation is presented as Figure 5. Thus we can parallelize and start the products input by coefficient before adaptation. The parameters will be only adapted at the end of sum of the errors as described in Figure 6.

Hence, the computation time $T'_{N_{R}}$ for N_R + 1 iterations with a delayed adaptation parallelized structure is:

$T'_{N_{R}} = (N_{R} + 1) T_{\times} + T_{\times} + 2 T_{+} + T_{δ}$ (22)

The considerable gain of learning time $(T_{N_{R}} - T'_{N_{R}})$ with a delayed adaptation parallelized structure is:

$N_{R} (T_{\times} + 2 T_{+} + T_{δ})$ (23)

On all tested examples, such as illustrated in Figure 8, the adaptation with delay converges as fast as without delay.

Fig. 3

Simplifed error model.

Fig. 4

Simplifed error model.

Fig. 5

Calculation time for N_R iterations.

Fig. 6

Illustration of adaptation with delay.

Fig. 7

Time comparison of operations with or without delay.

Fig. 8

Illustration of convergence with or without delay.

5 Simulation results

To test the algorithm, we have chosen to ask the neural network to learn how to calculate a Discrete Fourier Transform. The advantage of this option is to have a perfect replicable simulation without referring to training or a generalization base. Because the training vectors can be generated randomly in a quasi-infinite way, the simulation is not dependent on the size of the existing database, either. For each iteration of the training, we therefore randomly generate a vector of N_FFT Gaussian random complex terms ${x_{i}}_{i} \in (0, N_{F F T} - 1)$ . Each term having zero mean and unity variance for its real and imaginary parts (Table 3). Then, for each iteration, we compute the Fourier Transform of this Gaussian vector as follows:

${t_{k}}_{k} \in (0, N_{F F T} - 1) = F F T {{x_{i}}_{i} \in (0, N_{F F T} - 1)}$ (24)

and use the terms of this Fourier Transform as the desired signal $(t_{1 \to N_{q}})$ . The input and output signals being Gaussian we present in input a vector made up of the real parts then the imaginary parts. The input and the output vectors have respectively the size of N₀ = 2 * N_FFT and N_q = 2 * N_FFT.

We let the training run for several million examples (between 50 and 5 million depending on the simulations) for different values of the training delay and we varied the gradient adaptation step size δ. Finally, at each iteration, we took the squared error E which we integrated over an exponential window with a forgetting factor λ by means of the following equation:

$e (t + 1) = λ e (t) + (1 - λ) \sum_{j = 1}^{N_{q}} {(z_{j}^{q} - t_{j})}^{2}$ (25)

We then divided this error by the power of the desired signal to obtain a normalized mean squared error (NMSE) that we plotted in logarithm in base 10. Main simulation parameters are summarized in Table 4.

It appears in Figure 9 that the adaptation delay degrades the performances rather quickly and can even block the convergence of the adaptation. This is very noticeable as soon as this delay exceeds about ten clock times. However, the simulation results presented in Figure 10 were obtained with a δ adaptation step size of the gradient equals to 5 * 10⁻⁴. It can be seen, in Figure 10, that reducing this step size to 10⁻⁴ would sufficiently improve the performances significantly Tables 5. In the meantime, it tends to solve the non-convergence problem and allow the adaptation with delay to have performances closely resembling the adaptation without delay. However, this reduction of the adaptation step increases the convergence time of the learning. We can see that for an objective of the logarithm of the normalized mean squared error equals to −2.4, we need to present 5 million examples with an adaptation step size of 5 * 10⁻⁴ and 20 million examples with an adaptation step size of 10⁻⁴. Looking through the basic example we have used, we can therefore conclude that the implementation architecture proposed in this article optimizes the processing time but at the cost of an increase in convergence time by a factor of 4.

Table 3

Complex to real mapping.

Fig. 9

Logarithm of the normalized mean squared error vs number of iterations with δ = 5 × 10⁻⁴ (N_FFT = 16).

Fig. 10

Logarithm of the normalized mean squared error vs number of iteration i^th δ = 10⁻⁴ (N_FFT = 16).

6 The design of real-time learning architecture

In this section, we have designed the hardware level in two parties: computation unit level and system level. The objective is to parallelize the calculations as much as possible and specially to anticipate them without waiting for the adaptation of the coefficients linked to the previous examples. This allows to optimize the use of hardware resources at the cost of a latency for the adaptation of the coefficients. We introduce the time of an addition T₊, a multiplication T_x, the calculation of the exponential function T_exp and a division T_/.

Based on the algorithm features described previously and considering the characteristic of FPGA, the computation unit level of our implementation consists of three parties which are the forward propagation, back propagation, and adaptation.

6.1 Forward propagation

Denoted FP, implements the two forward propagation equations (3) and (4) of Section 2. It is represented in Figure 11. Parallelizing all the multiplications, we arrive, for the FP component, at a latency T for the forward propagation step of the L layer, equal to:

$T_{N C_{L}} ≅ T_{x} + ({log}_{2} N_{L - 1} + 2) T_{+} + T_{e x p} + T_{/}$ (26)

Fig. 11

Architecture of FP.

6.2 Backpropagation

For the backpropagation equations of Section 2, we implement a component noted BP_q for the last layer or BP_L for an inner layer. The two components are represented in Figures 12 and 13.

Parallelizing all the calculations for the output layer, we arrive at latency times:

$T_{B P_{q}} = 2 T_{x} + T_{+}$ (27)

$T_{B P_{L}} ≅ 2 T_{x} + ({log}_{2} N_{L - 1} + 1) T_{+}$ (28)

Fig. 12

Architecture of BP_q for for the calculation of the backpropagation equations for the last layer.

Fig. 13

Architecture of BP_L for the calculation of the backpropagation equations for the L layer.

6.3 Adaptation

For the adaptation equations of Section 2 we implement a component denoted UP represented in Figure 14.

For the UP module, the latency of backpropagation in L^th layer is:

$T_{U P} = 2 T_{+} + T_{x}$ (29)

The structures of these modules of each layer are the same with the equivalent numbers of inputs and one output. Once we know the number of neurons, and layers, we can generate and implement the neural network quickly. Because of the reconfigurability of FPGA, the neural network can be regenerated as many times as we want the neural networks in the same chip depending on the requirements.

We form the above computation units as IP cores in FPGA and use them to design the System level. The whole architecture of the system level design is presented in Figure 15.

In system level, a control unit was designed to send the control signals for each component and synchronize the whole system with the same system clock. To thin the flow of data, the delay modules were added to hold the signals. Among all the controlled signals, the “mode” signal is used to choose between the “reference” mode and the “learning” mode. With this pipeline parallelism system level architecture, one training data is received for each system clock, and all the operations run simultaneously from the first Total Latency which is:

$T_{L a t} = \sum_{L = 1}^{q} T_{N C_{L}} + T_{B P_{q}} + \sum_{L = 1}^{q - 1} T_{B P_{L}} + T_{U P}$ (30)

The system repeats the same operation for each system clock until the valid signal becomes ‘0’. Figure 16 illustrates the flowing of the training data. The advantages of this design are: firstly, no timeout for the inputs − to charge the training data sets for training. Secondly, all the neurons in each layer are calculated at the same time, whatever the number of neurons in a layer, the latency of this layer in question is the same for one neuron or for n neurons. Thirdly, the computation units are operating at the same time, the using of on-chip resources are optimized. Thus, our approach and contribution can expedite the learning of the neural network. Even the number of neurons and layers are important, this work offers a remarkable reward for training time.

Fig. 14

Architecture of UP for the calculation of the adaptation equations.

Fig. 15

Whole architecture of a real-time learning implementation.

Fig. 16

Training data flow.

7 Synthesis results

In this paper, we have designed a neural network with 32 neurons as inputs, one hidden layer with 32 neurons and one output layer with 32 neurons to calculate a Discrete Fourier Transform. The input and the desired output are generated from a Matlab program. The data used is 32-bit data of floating point. To implement this neural network, we have done an analysis in the first place about the latency for each component to develop the delay modules.

According to the pipeline architecture of a whole architecture of a Real-Time Learning Implementation demonstrated in Figure 15, we need two modules to manage the delays. The latency of the first one, denoted Delay 1, is:

$D e l a y 1 = F P 1 + F P 2 + B P 2 + B P 1$ (31)

And the second one, denoted Delay 2, of which latency is:

$D e l a y 2 = F P 2 + B P 2$ (32)

Hence the total Latency is:

$T o t a l L a t e n c y = F P 1 + F P 2 + B P 2 + B P 1 + U P 1$ (33)

We may draw from this theoretic analysis that the first adaptation of synaptic, and bias finished after the 639^th system clock Tables 6 then there was one adaptation for each system clock from the 640th one.

In order to choose the suitable target FPGA device family, we did follow analysis using various registers, DSP, RAM, MLAB and ALUT for each component and each layer, with summaries presented in Tables 7 and 8.

In Tables 7 and 8, there are a considerably large number of registers depending on the FPGA board which has been chosen. This can be transformed in part into RAM after the placement and routing. Comparing with the Intel Agilex product table, we can see that even a basic family of Agilex offers a large capacity to receive this implementation. Therefore, we chose the Intel Agilex AGFB022R31C2E1V as selected device in our project with Quartus 21.4.

We have developed six components individually with the Intel® HLS Compiler: FP1 for the neuron of forward propagation of the first hidden layer, FP2 for the neuron of forward propagation of output layer, BP2 for the backpropagation of the neuron of output layer, BP1 for the backpropagation of the neuron of hidden layer, UP2 for synaptic and bias adaptation of the neuron of output layer and UP1 for synaptic and bias adaptation of the neuron of hidden layer.

Finally, we have programmed a top-level file to integrate all the components to create our neural network. In addition, the basic components previously developed previous can be reused to regenerate easily the new neural networks far more easily for any number of neurons and layers.

After the synthesis, in terms of power utilization, the logic units used 10.768 watts, the RAM blocks used 3.195 watts, the DSPs used 9.022 watts the clock used 8.392 watts, and the power static was 3.210 watts. The summary is described in Figure 17. From architectural point of view, this implementation tends to save tremendous training time and power. In our case, the frequency adapted is only 400 Mhz, however with the new generation of FPGA, the frequency may achieve 1.5 Ghz. Thus, the process of learning can be faster.

Table 4

Main simulation parameters.

Table 5

NMSE with an adaptation step size equals to 10⁻⁴.

Table 6

Latency summary.

Table 7

Resource utilization summary for each cell.

Table 8

Resource utilization summary for each layer.

Fig. 17

Summary of power utilization.

8 Conclusion

This paper proposes we have proposed a hardware decomposition of the computations of a whole architecture for a Real-Time Learning Implementation based on the gradient backpropagation algorithm. It has shown, in our case, that a complete step of coefficient adaptation could be performed in 23 step times. The proposed parallelization leads to an anticipation of the computations and to a delayed update of the free coefficients of the network. It was shown that the effects of this adaptation delay could be compensated by a reduction of the adaptation step size. We finally, conclude that by slightly slowing down the adaptation phase, we can propose a parallelized architecture which will significantly accelerate the processing time with a positive overall result. Hence, we did also a hardware implementation with FPGA that can easily integrate this kind of design, and the pipeline structure design in FPGA. It suggests a solution to receive one training data set for each system clock, not only can it reduce the training time, but also maximize the utilization of the resource on chip.

References

Z. Xu, N. Tang, C. Xu, X. Cheng, Data science: connotation, methods, technologies, and development, Data Sci. Manag. 1, 32–37 (2021) [CrossRef] [Google Scholar]
Z. Yan, Z. Jin, S. Teng, G. Chen, D. Bassir, Measurement of Bridge Vibration by UAVs Combined with CNN and KLT Optical-Flow Method. Appl. Sci. 12, 1581 (2022) [Google Scholar]
W.S. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous activity, Bull. Math. Biophys. 5, 115–133 (1943) [CrossRef] [Google Scholar]
H.D. Block, A review of perceptrons: An introduction to computational geometry, Inf. Control 17, 501–522 (1970) [CrossRef] [Google Scholar]
D.E. Rumelhart, J.L. McClelland, Learning Internal Representations by Error Propagation (1987), pp. 318–362 [Google Scholar]
G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science (New York, N.Y.) 313, 504–507 (2006) [CrossRef] [Google Scholar]
D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou et al., Mastering the game of Go without human knowledge, Nature 550, 354–359 (2017) [CrossRef] [Google Scholar]
H. Gao, S. Tomić, J. Nikolić, Z. Perić, D. Aleksić, Performance of post-training two-bits uniform and layer-wise uniform quantization for MNIST dataset from the perspective of support region choice, Math. Probl. Eng. 2022, 1463094 (2022) [Google Scholar]
A. Van Biesbroeck, F. Shang, D. Bassir, CAD model segmentation via deep learning, Int. J. Comput. Methods 18 (2020), 10.1142/S0219876220410054 [Google Scholar]
V. Shelar, S. Subramani, D. Jebaseelan, R-tree data structure implementation for Computer Aided Engineering (CAE) tools, Int. J. Simulat. Multidiscipl. Des. Optim. 12, 6 (2021) [CrossRef] [EDP Sciences] [Google Scholar]
A. Ignatov, G. Malivenko, R. Timofte, S. Chen, X. Xia, Z. Liu, Y. Zhang, F. Zhu, J. Li, X. Xiao, Y. Tian, X. Wu, C. Kyrkou, Y. Chen, Z. Zhang, Y. Peng, Y. Lin, S. Dutta, S. Das, S. Siddiqui, Fast and Accurate Quantized Camera Scene Detection on Smartphones, Mobile AI 2021 Challenge: Report (2021), pp. 2558–2568 [Google Scholar]
Y. Hu, Y. Liu, Z. Liu, A survey on convolutional neural network accelerators: GPU, FPGA and ASIC, 2022 14th International Conference on Computer Research and Development (ICCRD) (2022), pp. 100–107 [CrossRef] [Google Scholar]
M.J. Zhang, S. Garcia, M. Terre, Fast Learning Architecture for Neural Networks, in 2022 30th European Signal Processing Conference (EUSIPCO) (2022), pp. 1611–1615 [CrossRef] [Google Scholar]
L. Ravaglia, M. Rusci, A. Capotondi, F. Conti, L. Pellegrini, V. Lomonaco, D. Maltoni, L. Benini, Memory-latency-accuracy trade-offs for continual learning on a RISC-V extreme-edge node, 2020 IEEE Workshop on Signal Processing Systems (SiPS) (2020), pp. 1–6 [Google Scholar]
Y. Li, S.E. Li, X. Jia, S. Zeng, Y. Wang, FPGA accelerated model predictive control for autonomous driving, J. Intell. Connect. Veh. 5, 63–71 (2022) [Google Scholar]
K. Guo, S. Zeng, J. Yu, Y. Wang, H. Yang, DL A survey of FPGA-based neural network inference accelerators, ACM Trans. Reconfig. Technol. Syst. 12, 1–26 (2019) [Google Scholar]
S. Xiong, G. Wu, X. Fan, X. Feng, Z. Huang, W. Cao, X. Zhou, S. Ding, J. Yu, L. Wang, Z. Shi, MRI-based brain tumor segmentation using FPGA-accelerated neural network, BMC Bioinfor. 22, 421 (2021) [CrossRef] [Google Scholar]
C.-C. Sun, A. Ahamad, P.-H. Liu, SoC FPGA accelerated sub-optimized binary fully convolutional neural network for robotic floor region segmentation, Sensors 20, 6133 (2020) [CrossRef] [Google Scholar]
L. Wang, Y. Zhao, X. Li, An automatic conversion tool for caffe neural network configuration oriented to openCL-based FPGA platforms, 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC) (2019), pp. 195–198 [Google Scholar]
M. Carreras, G. Deriu, L. Raffo, L. Benini, P. Meloni, Optimizing temporal convolutional network inference on FPGA-based accelerators, IEEE J. Emerg. Selected Top. Circ. Syst. 1–1 (2020), 10.1109/JETCAS.2020.3014503 [Google Scholar]
H. Park, C. Lee, H. Lee, Y. Yoo, Y. Park, I. Kim, K. Yi, Work-in-progress: optimizing DCNN FPGA accelerator design for handwritten hangul character recognition, in 2017 International Conference on Compilers, Architectures and Synthesis For Embedded Systems (CASES) (2017), pp. 1–2 [Google Scholar]

Cite this article as: Ming Jun Zhang, Samuel Garcia, Michel Terre, Real-time fast learning hardware implementation, Int. J. Simul. Multidisci. Des. Optim. 14, 1 (2023)

All Tables

Table 1

Clock time periodes required for algorithm equations.

In the text

Table 2

Variables used in the architecture and ressources involved.

In the text

Table 3

Complex to real mapping.

In the text

Table 4

Main simulation parameters.

In the text

Table 5

NMSE with an adaptation step size equals to 10⁻⁴.

Latency summary.

Resource utilization summary for each cell.

In the text

Table 8

Resource utilization summary for each layer.

In the text

All Figures

	Fig. 1 Fully connected neural network.
In the text

	Fig. 2 Temporal implementation on dedicated hardware processing resources (green: data of time t, red: data of time t + 1, blue: data of time t + 2, brown: data of time t + 3, yellow: data of time t + 4).
In the text

	Fig. 3 Simplifed error model.
In the text

	Fig. 4 Simplifed error model.
In the text

	Fig. 5 Calculation time for N_R iterations.
In the text

	Fig. 6 Illustration of adaptation with delay.
In the text

	Fig. 7 Time comparison of operations with or without delay.
In the text

	Fig. 8 Illustration of convergence with or without delay.
In the text

	Fig. 9 Logarithm of the normalized mean squared error vs number of iterations with δ = 5 × 10⁻⁴ (N_FFT = 16).
In the text

	Fig. 10 Logarithm of the normalized mean squared error vs number of iteration i^th δ = 10⁻⁴ (N_FFT = 16).
In the text

	Fig. 11 Architecture of FP.
In the text

	Fig. 12 Architecture of BP_q for for the calculation of the backpropagation equations for the last layer.
In the text

	Fig. 13 Architecture of BP_L for the calculation of the backpropagation equations for the L layer.
In the text

	Fig. 14 Architecture of UP for the calculation of the adaptation equations.
In the text

	Fig. 15 Whole architecture of a real-time learning implementation.
In the text

	Fig. 16 Training data flow.
In the text

	Fig. 17 Summary of power utilization.
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.

[1] Z. Xu, N. Tang, C. Xu, X. Cheng, Data science: connotation, methods, technologies, and development, Data Sci. Manag. 1, 32–37 (2021) [CrossRef] [Google Scholar]

[2] Z. Yan, Z. Jin, S. Teng, G. Chen, D. Bassir, Measurement of Bridge Vibration by UAVs Combined with CNN and KLT Optical-Flow Method. Appl. Sci. 12, 1581 (2022) [Google Scholar]

[3] W.S. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous activity, Bull. Math. Biophys. 5, 115–133 (1943) [CrossRef] [Google Scholar]

[4] H.D. Block, A review of perceptrons: An introduction to computational geometry, Inf. Control 17, 501–522 (1970) [CrossRef] [Google Scholar]

[5] D.E. Rumelhart, J.L. McClelland, Learning Internal Representations by Error Propagation (1987), pp. 318–362 [Google Scholar]

[6] G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks, Science (New York, N.Y.) 313, 504–507 (2006) [CrossRef] [Google Scholar]

[7] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou et al., Mastering the game of Go without human knowledge, Nature 550, 354–359 (2017) [CrossRef] [Google Scholar]

[8] H. Gao, S. Tomić, J. Nikolić, Z. Perić, D. Aleksić, Performance of post-training two-bits uniform and layer-wise uniform quantization for MNIST dataset from the perspective of support region choice, Math. Probl. Eng. 2022, 1463094 (2022) [Google Scholar]

[9] A. Van Biesbroeck, F. Shang, D. Bassir, CAD model segmentation via deep learning, Int. J. Comput. Methods 18 (2020), 10.1142/S0219876220410054 [Google Scholar]

[10] V. Shelar, S. Subramani, D. Jebaseelan, R-tree data structure implementation for Computer Aided Engineering (CAE) tools, Int. J. Simulat. Multidiscipl. Des. Optim. 12, 6 (2021) [CrossRef] [EDP Sciences] [Google Scholar]

[11] A. Ignatov, G. Malivenko, R. Timofte, S. Chen, X. Xia, Z. Liu, Y. Zhang, F. Zhu, J. Li, X. Xiao, Y. Tian, X. Wu, C. Kyrkou, Y. Chen, Z. Zhang, Y. Peng, Y. Lin, S. Dutta, S. Das, S. Siddiqui, Fast and Accurate Quantized Camera Scene Detection on Smartphones, Mobile AI 2021 Challenge: Report (2021), pp. 2558–2568 [Google Scholar]

[12] Y. Hu, Y. Liu, Z. Liu, A survey on convolutional neural network accelerators: GPU, FPGA and ASIC, 2022 14th International Conference on Computer Research and Development (ICCRD) (2022), pp. 100–107 [CrossRef] [Google Scholar]

[13] M.J. Zhang, S. Garcia, M. Terre, Fast Learning Architecture for Neural Networks, in 2022 30th European Signal Processing Conference (EUSIPCO) (2022), pp. 1611–1615 [CrossRef] [Google Scholar]

[14] L. Ravaglia, M. Rusci, A. Capotondi, F. Conti, L. Pellegrini, V. Lomonaco, D. Maltoni, L. Benini, Memory-latency-accuracy trade-offs for continual learning on a RISC-V extreme-edge node, 2020 IEEE Workshop on Signal Processing Systems (SiPS) (2020), pp. 1–6 [Google Scholar]

[15] Y. Li, S.E. Li, X. Jia, S. Zeng, Y. Wang, FPGA accelerated model predictive control for autonomous driving, J. Intell. Connect. Veh. 5, 63–71 (2022) [Google Scholar]

[16] K. Guo, S. Zeng, J. Yu, Y. Wang, H. Yang, DL A survey of FPGA-based neural network inference accelerators, ACM Trans. Reconfig. Technol. Syst. 12, 1–26 (2019) [Google Scholar]

[17] S. Xiong, G. Wu, X. Fan, X. Feng, Z. Huang, W. Cao, X. Zhou, S. Ding, J. Yu, L. Wang, Z. Shi, MRI-based brain tumor segmentation using FPGA-accelerated neural network, BMC Bioinfor. 22, 421 (2021) [CrossRef] [Google Scholar]

[18] C.-C. Sun, A. Ahamad, P.-H. Liu, SoC FPGA accelerated sub-optimized binary fully convolutional neural network for robotic floor region segmentation, Sensors 20, 6133 (2020) [CrossRef] [Google Scholar]

[19] L. Wang, Y. Zhao, X. Li, An automatic conversion tool for caffe neural network configuration oriented to openCL-based FPGA platforms, 2019 IEEE 3rd Information Technology, Networking, Electronic and Automation Control Conference (ITNEC) (2019), pp. 195–198 [Google Scholar]

[20] M. Carreras, G. Deriu, L. Raffo, L. Benini, P. Meloni, Optimizing temporal convolutional network inference on FPGA-based accelerators, IEEE J. Emerg. Selected Top. Circ. Syst. 1–1 (2020), 10.1109/JETCAS.2020.3014503 [Google Scholar]

[21] H. Park, C. Lee, H. Lee, Y. Yoo, Y. Park, I. Kim, K. Yi, Work-in-progress: optimizing DCNN FPGA accelerator design for handwritten hangul character recognition, in 2017 International Conference on Compilers, Architectures and Synthesis For Embedded Systems (CASES) (2017), pp. 1–2 [Google Scholar]