Memory

The Calculus of Artificial Intelligence and Autonomous Driving

By Robert Bielby - 2019-05-13

There, I did it. I used two of the hottest buzz words in one sentence: artificial intelligence and autonomous driving. It should be no surprise the two would be mentioned in the same sentence. Artificial intelligence and, specifically, the application of convolutional neural networks (CNNs) for image recognition has forever changed the automotive playing field by delivering higher levels of accuracy over more traditional computer vision algorithms. In fact, today’s state-of-the-art image recognition algorithm—based on CNN—has been shown to deliver accuracy that is superior to humans. However, this accuracy comes at a price: highly accurate CNNs that are used to achieve level 5 autonomous driving can easily require 10s of 100s of teraflops of compute performance. And with this compute performance typically comes the need for high bandwidth memory.

CNNs, a specific class of deep neural networks (DNNs), rely on convolutional and pooling layers—components that are especially effective for image recognition due to the translation-invariant nature of all images. In other words, regardless of where an object may appear in the field of view, the object can be consistently recognized.

Deep learning identifies structure in large data sets through backpropagation, where the parameters that are used to compute representation in each layer are determined from the representation in the previous layer. Deep learning networks have both an input and output layer, and typically dozens of hidden layers—each with thousands of nodes. Each node is assigned a fixed weight. Computing for each node is effectively a sum of products computation (this is like traditional digital signal processing [DSP] or signal processing based on analog techniques). The multiply-accumulate (MAC) operation is carried out at each node of the DNN, leading to MAC operations that can easily fall into the range of 4 billion to provide a given result.

DNNs have two modes of operation: training and inference. Inference refers to the mode where the network is being trained to ultimately develop the weightings that are used to accurately classify an image. This article focuses on the inference operation mode, where the trained network is deployed in the field and used to detect and classify pedestrians, bicycles, cars, street signs, and such in real-time.

Because a given result can require over 4 billion MAC operations, traditional single instruction, multiple data (SIMD) architectures would be a natural architecture choice; however, because of the extensive level of data movement overhead associated with data transfer in and out of a data register, there can be significant impact to both performance and power consumption. Novel processing architectures based on systolic arrays and neuromorphic processing are emerging to tackle the compute problem in manners that provide greater throughput with a lower cost and power footprint versus more traditional computing architectures.

And while the underlying compute architecture establishes the baseline for the potential computational performance that can be achieved, memory—specifically memory bandwidth—plays an essential role in achieving theoretical compute performance.

AI processors based on systolic arrays require high memory bandwidth to ensure the array is continuously fed and does not stall. Because the weights don’t change during inference, systolic array-based architectures typically store algorithm weights in large on-chip memories. Ideally, all weights can be stored in on-chip memory for maximum performance. Activations, unlike weights, are typically stored in an external DRAM. A high-bandwidth connection to external DRAM is essential for realizing maximum performance, and the memory bandwidth must scale with the array size.

Looking closer at the actual process of vision processing, it all starts with raw images of the surroundings, which typically is sourced from multiple cameras of various resolutions and frame rates. It is not unrealistic for a vehicle to employ 6 -10 cameras, where the forward-looking cameras could have resolutions of up to 8M pixels and 60 frames per second, which are needed to achieve level 4 or 5 autonomous driving capabilities. The aggregate data throughput associated with the cameras and other sensors can easily exceed 100 Mb/s.

Objects and areas of interest (including pedestrians, vehicles, signs, curbs, etc.) are typically located using edge detection. Once the object of interest has been found, the object is then identified using neural networks. The range or distance from the object is determined via stereo cameras, radar or lidar. Multiple image frames are compared to establish optical flow, or more specifically, the speed and direction of movement.

Real-time vision processing based on this rich data stream requires levels of compute performance that were not realistic or practical in an automobile only a decade earlier. This is true for both the underlying processor and the associated memory subsystem. Computational performance ranging in the 10s to 100s of teraflops for a centralized processing architecture requires memory bandwidth that can reach 500 GB/s and beyond. While LPDDR4 has been a mainstream memory solution for the automotive market for many years, with an I/O performance of 4266 MHz, delivering 500 GB/s of memory bandwidth requires a memory bus that is more than 950 bits wide! For an automotive application, this level of solution is not practical from a cost, reliability, and physical footprint perspective. Similarly, it would be impractical to consider an SoC targeting a mainstream application hosting 900 dedicated I/Os for memory interfacing.

While historically graphics processing and the associated GPUs have been one of the primary drivers of high memory bandwidth demands, AI applications—such as autonomous driving—are driving similar, if not greater, levels of DRAM bandwidth. It is only fitting that the automotive industry is now turning to GDDR6—memory originally developed for the graphics market—as a mainstream, high bandwidth solution for automotive. With over 3X the I/O performance per pin versus LPDDR4, GDDR6 enables deployment of very high bandwidth applications that are also practical from a power, performance and cost perspective.

With over 28 years of commitment to the automotive market and over 42% market share, Micron continues to lead the industry in developing innovative memory solutions that continue to play a vital role in enabling the next generations of transportation. Whether it is enabling in-vehicle experience or level 5 autonomous driving, Micron’s automotive-qualified memory solutions continue to be the de-facto industry standard.

Rob Peglar

Robert Bielby

+