Deep Bayesian active learning using in-memory computing hardware

0
Deep Bayesian active learning using in-memory computing hardware

Synapse weights stochastic characteristics

To implement in-memory DBAL, we used an expandable stochastic CIM computing (ESCIM) system (Extended Data Fig. 1). The ESICM system, expandable through stacking, integrated three memristor chips in this study. The memristor device adopted the TiN/TaOx/HfOx/TiN material stack. Figure 2a depicts the current–voltage (IV) curves of the 1-transistor-1-memristor (1T1R) cell, which are smooth and symmetrical. Thanks to TaOx, which serves as a thermally enhanced layer33, the multilevel characteristics of the memristor are improved.

Fig. 2: Stochastic characteristics of memristors.
figure 2

a, A typical measured IV curve of a single 1T1R cell for a quasi-d.c. sweep. b, The probability density of the read noise in the 3.3 μA current state at read voltage Vread = 0.2 V, measured over 1,000 read cycles across individual arrays. c, Typical analog switching behaviors of memristors under identical pulse trains. The dark lines represent the average values of the conductance, the light colors fill the regions between plus and minus one standard deviation, and the gray dots represent measured data. d, The statistical distribution of the conductance transition from their initial states to states for devices in a 4K chip, under a single set pulse with constant-amplitude voltage Vset = 2.0 V. The gate voltage of the transistor is Vt = 1.25 V. The current state is measured at read voltage Vread = 0.2 V. e, Probability density curves of the conductance transition at three initial states (Iread = 1 μA, 2 μA and 3 μA). Each curve corresponds to a profile along the vertical lines shown in d.

Source data

To analyze the stochastic characteristics of the memristor, we measured conductance variations during the reading and modulation process. On the one hand, the fluctuation data collected from the reading test can be modeled using a double exponential distribution (Fig. 2b). Memristors in different current states have distinct random fluctuation characteristics (Extended Data Figs. 2 and 3). The measured results of read noise align well with the previous reports of current fluctuation behaviors in HfOx-based memristor devices34,35 (Supplementary Note 1). By adjusting the memristor to an appropriate current state, various probability distributions can be obtained (‘In situ sampling via reading memristors’ section in Methods). Such unique stochastic characteristics can facilitate in situ random number generation by reading the current. According to the Lindeberg–Feller central limit theorem (Supplementary Note 2), the Gaussian distribution can be realized by accumulating the currents of multiple memristors (Extended Data Fig. 4). Hence, the ESCIM system can efficiently perform both in situ Gaussian random number generation and in-memory computation, integrating the devices and cycles variabilities of memristors (Extended Data Fig. 5 and Supplementary Note 3).

On the other hand, memristors also have random fluctuations during the conductance modulation process. Under identical-amplitude voltage pulse modulation, the memristor exhibits continuous bidirectional resistive switching characteristics (Fig. 2c). Meanwhile, random migration of oxygen ions within the resistive layer can cause variations impacting the conductance of different devices, even within the same device across cycles. To quantitatively analyze these inherent stochastic characteristics, we measured conductance transitions (‘Measurements of the conductance transition’ section in Methods). We applied a constant-amplitude voltage pulse to the 1T1R cells, prompting the memristors to transition from initial to subsequent conductance states. Figure 2d shows the conductance transition distribution during the set operation. While the subsequent conductance generally increases during a set operation, decreases can occur due to random oxygen ion migration. Furthermore, the spread of the transition distribution varies depending on the initial conductance state. Figure 2e shows the transition probability curves for three different initial conductance states, clearly showing an alignment with a Gaussian distribution. In addition, the conductance transition in the case of reset operation also exhibits a similar Gaussian distribution (Extended Data Fig. 6). Therefore, the operation of a single pulse during a memristor’s conductance modulation can be modeled as drawing a random number from a Gaussian distribution.

In-memory DBAL

Consistent with the Lindeberg–Feller central limit theorem, Gaussian weights in a BDNN can be simulated using read currents from multiple devices (Fig. 3a). Within our implementation strategy, a Gaussian weight is produced using three devices in the ESCIM system. As shown in Fig. 3b, we proposed the in-memory DBAL framework building on a memristor BDNN (Supplementary Note 4).

Fig. 3: In-memory DBAL and mSGLD in situ learning method.
figure 3

a, Realization of a Gaussian weight of a BDNN in memristor crossbar array. Read currents accumulated from three devices act as one Gaussian weight. b, The proposed in-memory Bayesian active learning flow chart. c, The initial phase of the proposed mSGLD. The conductance of the memristor is updated according to the sign of the gradient to mimic an effective stochastic gradient learning algorithm. The added Gaussian noise ηm can be realized by the random fluctuations of conductance modulation process. d, The latter phase of the proposed mSGLD. The network optimization process came into a flat minimum of the loss function, and the magnitude of the gradient diminished. Gaussian noise of the total read current dominates, thus mimicking the Langevin dynamic MH process. e, Realization of the weight updating. Left: a certain percentage of weights with large gradient magnitude is selected to be updated, and this update ratio keeps decreasing with the number of training iterations. Right: one of the three devices of these selected weights is randomly chosen for modulation to realize the weight update. f, By making multiple predictions, the prediction distribution can be efficiently obtained and, thus, the uncertainty can be calculated.

The proposed in-memory DBAL framework integrates a digital computer and our ESCIM system (Extended Data Fig. 7). First, an initial memristor BDNN model is deployed on memristor crossbar arrays (for the pseudocode, refer to Supplementary Fig. 1). This model’s weights are obtained by ex situ training on a digital computer using a small initial training dataset. The read noise model and conductance modulation model are used during this process, enabling the network to learn weight distributions that better fit the integrated memristor arrays (‘Stochastic models of memristors’ section in Methods). Next, the deployed memristor BDNN predicts the classes of data in the unlabeled dataset (Supplementary Fig. 2), and calculates prediction uncertainty (Supplementary Fig. 3). This process involves fully turning on the transistors in the crossbar array, applying the read voltage to the source line (SL) of the device row by row and sensing the fluctuating read current that flows through the virtual ground BL by an analog-to-digital converter (ADC). The network prediction, due to the weight’s stochasticity introduced by the memristor cells’ variabilities, is a distribution reflecting the variability in read currents rather than a single deterministic value (‘Uncertainty quantification’ section in Methods). Consequently, we can use the multiple prediction outputs from the network to derive a prediction distribution, thereby calculating the prediction uncertainty. Subsequently, based on the prediction uncertainty of the samples in the unlabeled dataset, we select a data sample with the highest uncertainty to query for its label and incorporate this sample and the queried label into the training dataset. A data sample with the highest uncertainty typically contains more information, and its label is generally the most beneficial for enhancing the network’s classification performance. Finally, using the updated training dataset, which includes the original training data and the newly added samples, the memristor BDNN performs in situ learning. After in situ learning, the network continues to calculate uncertainty, select high-uncertainty samples and retrain until performance expectations are met or label queries are exhausted.

In the process of active learning, the in situ learning step is crucial. Given the limited quantity of samples in the training dataset, inadequate in situ learning capacity could lead to a network with deficient classification capabilities. This might hinder the quantification of uncertainty, thereby challenging the identification of useful unlabeled data samples. Even after multiple rounds of active learning, the network performance might still not meet the anticipated standards.

In situ learning to capture uncertainty

To accurately capture uncertainty in DBAL’s iterative learning, we proposed an in situ learning method using the stochastic property of the conductance modulation process (Supplementary Fig. 4). The method is an improvement based on the stochastic gradient Langevin dynamics (SGLD) algorithm16. The weight parameter update in the SGLD algorithm is very straightforward: it takes the gradient step of traditional training algorithms36 and adds an amount of Gaussian noise. The training process of SGLD includes two phases. In the initial phase, the gradient will be dominant and the algorithm will mimic an efficient stochastic gradient algorithm. As step sizes decay with the number of training iterations, in the latter phase, the injected Gaussian noise will be dominant and, therefore, the algorithm will mimic the Langevin dynamic Metropolis–Hastings (MH) algorithm. As the number of training iterations increases, the algorithm will smoothly transit between the two phases. Using the SGLD method allows the weight parameters to capture parameter uncertainty and not just collapse to the maximum a posteriori solution.

Based on the stochastic nature of memristors, we improved the SGLD algorithm using sign backpropagation, namely, mSGLD. The stochastic fluctuation under constant-amplitude pulses can also be considered as random number generation. In the initial phase of mSGLD, we calculated the gradient of the memristor conductance \(\frac\partial \mathrmLoss\partial I\) (‘mSGLD training method’ section in Methods) and then updated the value of the memristor weight based on the sign of the gradient to mimic an effective stochastic gradient learning algorithm

$$\Delta I=\rmsign\left(\frac\partial \mathrmLoss\partial I\right)+\eta _\mathrmm.$$

(1)

Since the transition probability of the conductance during modulation is Gaussian distribution, the added Gaussian noise ηm can be realized by the random fluctuations inherent in the device (Fig. 3c). Therefore, the actual update operation of the device on the memristor array is

$$\rmsign\left(\frac\partial \mathrmLoss\partial I\right)=\left\{\beginarrayrcl1 & \to & \rmSet\,\rmdevice\\ -1 & \to & \rmReset\,\rmdevice.\endarray\right.$$

(2)

That is, if the sign of the gradient of a device is positive, a set operation is performed on the device; if it is negative, a reset operation is performed.

In the later phase of mSGLD, with more training iterations, the memristor network’s classification performance improves. As it reaches a flat loss function minimum, the gradient diminishes and injected Gaussian noise becomes dominant (Fig. 3d). For smooth transition between the two phases, we proposed a smooth transition method. In this method, we update a decreasing percentage of weights with large gradients as training iterations increase (‘mSGLD training method’ section in Methods and Extended Data Fig. 8). Then, for the selected weights, one of three devices is randomly chosen for modulation to update the weight (Fig. 3e). Therefore, as the number of training iterations increases, the training ends and the number of updated weights decreases to a small amount. Finally, the Gaussian noise of the total read current dominates, thus mimicking the Langevin dynamic MH process. By gradually decreasing the weight update ratio, the smooth transition between phases also reduces the negative impact of excessive conductance stochasticity, stabilizing the network learning process. We also thoroughly discuss management of noise in mSGLD (Supplementary Figs. 5 and 6), using regular SGD instead of mSGLD (Supplementary Fig. 7), and the impact of binarizing the gradient (Supplementary Fig. 8) in a BDAL simulation experiment based on a Modified National Institute of Standards and Technology (MNIST) dataset classification task. The results show the effectiveness and resilience of our proposed mSGLD method (Supplementary Note 5).

Our proposed in situ learning algorithms leverage stochastic characteristics in the reading and conductance modulation process for Gaussian random number generation during network prediction and learning. In BDNN learning, weights are updated with gradient values added Gaussian noise, allowing Bayesian parameter uncertainty capture via the in situ mSGLD method. In BDNN prediction, Gaussian weights are sampled and computed through VMM with the input vector. Memristor Gaussian weights in crossbar arrays enable efficient weight sampling and VMM with a single read operation. Multiple predictions yield the prediction distribution and calculate uncertainty (Fig. 3f).

Robot’s skill learning

To demonstrate the applicability of the proposed methods, we performed a demonstration in a robot’s skill learning task (Fig. 4a). In this learning task (‘Robot’s pouring skill learning task and robot simulator’ section in Methods), the robot is equipped with a set of basic abilities such as locomotion and basic object manipulation. The robot needs to build on these foundations by training a BDNN model to acquire pouring high-level skills. However, the labeled data required to learn the skill are very difficult, time-consuming or expensive to obtain due to resource overhead and time overhead. Therefore, the robot needs to efficiently learn the pouring skill with as few labeled samples or attempts as possible, thus minimizing the experiments time, labor and cost for obtaining labeled data.

Fig. 4: Robot’s skill learning and experimental results.
figure 4

a, A schematic illustration of robot’s pouring skill learning task using in-memory DBAL. b, The evolution of the accuracy on training data of the memristor BDNN with each epoch on 64, 84, 104 and 124 training dataset sizes. c, Histogram (bar) and distribution curve (line) of memristor conductance state at the initial and end of the active learning loop. Measured at read voltage Vread = 0.2 V. d, The classification accuracy on the test dataset of the active learning method and the passive learning method. Active learning has an advantage compared with passive learning, which randomly selects the samples to be queried. e, The learning performance of active and passive learning methods. We used the percentage of pouring the contents of the cup into the bowl as a performance metric. f, Visualization of the active learning results. The robot is pouring the beads from the cup into the bowl. T1, T2, T3 and T4 represent sequential time points in the process of pouring. g, Visualization of the final cup position for 500 pour valid control parameters. It visualizes the most confident (low uncertainty) prediction of the memristor BDNN by coloring small values in red and large values in blue.

Source data

The expected effect of the pouring action is to pour the contents of a cup into a bowl. We are interested in learning under what constraints the execution of this action will transfer a sufficiently large portion of the initial contents of the cup into the target bowl. These constraints for a pour are the context parameters (the bowl and cup dimensions) and control parameters that the robot can choose (the axis of rotation, the cup rotation frame and the final pitch). To execute successful pouring action, we used a memristor BDNN to predict whether the execution of the pouring action will be successful or not under different control parameters and then proceed to the next step of plan and execution based on the control parameters that have a higher probability of success (Fig. 4a). Thus, our main objective is to train a BDNN to achieve high accuracy and action effectiveness using as few labeled samples as possible, through the proposed active learning methods.

We implemented an active learning task using an 11 × 50 × 50 × 2 memristor BDNN (‘Experiment system setup’ section in Methods), balancing hardware complexity and network performance (Supplementary Note 6). The BDNN was trained on a digital computer using a 64-sample dataset, then deployed to the ESCIM system. In the active learning loop (Fig. 4a), the memristor BDNN predicts and estimates uncertainties of 10,000 unlabeled constraints, queries the label of the most uncertain constraint and adds it to the training set. The memristor BDNN then uses the updated dataset for 110 epoch iterations of in situ learning via the proposed mSGLD method. If training accuracy reaches 98% or more in several iterations, in situ learning ends early. This loop is repeated 64 times, cumulatively querying 64 constraints, resulting in a final training dataset of 128 samples.

We successfully demonstrated the in-memory active learning process of this task using the ESCIM system. We used a digital computer to set up a three-dimensional (3D) tabletop simulator with a robot and control DBAL loop, as shown by the orange and light green parts of Fig. 4a (‘Experiment system setup’ section in Methods). Supplementary Fig. 9 shows the pseudocode for the robot’s skill learning task using the in-memory DBAL framework. Our active learning method was extensively evaluated using the simulator. The ESCIM system, connected to the digital computer, read and modulated memristor arrays’ conductance during BDNN prediction and in situ learning, as shown by the gray parts in Fig. 4a. Supplementary Note 7 provides additional technical details on the robot’s skill learning process. Figure 4b depicts the memristor BDNN’s training classification accuracy evolution across four different training dataset sizes, indicating high accuracy across all networks. Notably, with 64 training samples, in situ learning stops early due to the smaller data size’s reduced complexity and noise. The memristor conductance state distribution of the initial and end of the active learning loop is shown in Fig. 4c. We also measured the passive learning method for comparison, which randomly selects samples for querying instead of on the basis of prediction uncertainty. Figure 4d shows the classification accuracy of active and passive learning method with unseen testing data. Increasing training data size generally enhances the model’s generalization and test accuracy. It can be seen that initial network classification accuracies of both methods are similar. However, as query samples increase, active learning outperforms passive learning, improving by about 13%. We also analyze the impact of cycle-to-cycle variability on network’s performance over time (Supplementary Note 8). The network maintains stable performance over time, with accuracy levels similar to those post in situ learning (Supplementary Fig. 10). The reason for the stability may be that the BDNN can inherently tolerate certain weights’ variations caused by cycle-to-cycle variability. Furthermore, we compared the learning performance of active and passive learning on the pouring skill task (Fig. 4e), with results showing that active learning outperforms passive learning with the same number of query samples.

We visualized the process of the robot pouring the beads from the cup into the bowl using active learning, as shown in Fig. 4f and Supplementary Movie 1. We also visualized the final tipping angle of the cup at the end of the execution of the pouring action by the robot. Figure 4g shows a dataset of pouring control parameters for a single bowl and cup pair by showing the final position of the red cup. We find that the memristor BDNN learns that either the cup has a larger pitch and is located directly above the bowl or the cup has a smaller pitch and is located slightly to the right of the bowl. This suggests that the memristor BDNN is capturing intuitively relevant information for a successful pour. These results show that the proposed methods could realize efficient in-memory DBAL.

Moreover, we evaluated the energy consumption and latency of stochastic CIM computing system in this learning task (Supplementary Fig. 11 and Extended Data Table 1) and then compared it with a traditional CMOS-based graphics processing unit computing platform (Supplementary Note 9). The stochastic CIM computing system achieved a remarkable 44% boost in speed and conserved 153 times more energy. This speed could be further improved by employing high-parallel modulation methods, and the energy cost could be further minimized by refining the ADC design (Extended Data Fig. 9). Due to in-memory VMM and in situ sampling facilitated by the intrinsic physical randomness of reading and conductance modulation, memristor crossbars are capable of enabling both in situ learning and prediction, thus overcoming von Neumann bottleneck challenges.

link

Leave a Reply

Your email address will not be published. Required fields are marked *