Industrial Sessions
June 14th, 2022 (Tuesday) 16:10 – 17:40
Industrial Session#1 – Research Facility(KETI & ETRI)
AIWareK: Compiling PyTorch Model for AI Processor Using MLIR Framework
Presenter :
Hyunjeong Kwon (Electronics and Telecommunications Research Institute, Korea)
Abstract
Deep learning compiler becomes necessary with the active research on AI hardware. This work compiles PyTorch models into target hardware codes using MLIR framework. The compiler first constructs a graph from the PyTorch model using TorchScript tracing. To construct the graph, our domain-specific parser generates an abstract syntax tree using tokens generated from the lexer. Then, the graph IR (GIR) is built and lowered into the kernel IR (KIR) and the processor IR (PIR) in MLIR framework. PIR becomes the input of the backed compiler that generates target machine codes. Experimental result shows that AIWareK compiled ResNet18 in 7.67 seconds, yielding 1.16e-03 mean absolute error.
ArtBrain-K: AI Processor based-on 5-PetaFLOPS AI Server System
Presenter :
Jinho Han (Electronics and Telecommunications Research Institute, Korea)
Abstract
We developed ArtBrain-K, an artificial intelligence server system with a performance of 5 PetaFLOPS per rack that can have a small form factor with low power, 2400W based on a proprietary artificial intelligence processor with 40 TeraFLOPS performance, and 15W power consumption. It consists of 8 artificial intelligence compute nodes with a performance of 590 TeraFLOPS and 300W power consumption which loaded with 20 ABrain-S NPU boards based on the artificial intelligence processor. Also, we developed AIWareRT, a software development environment for independent artificial intelligence processor. Compared to the existing GPU system, the performance is 3 times higher and the power efficiency is 7 times higher. It has been applied to the next-generation airport automatic immigration system and image recognition-based security system, and it can be used in fields that require enormous computing resources for data processing and learning, such as a huge neural network such as a transformer-based artificial intelligence algorithm.
Implementing Binarized Neural Network Processor on FPGA-Based Platform
Presenter :
Jeahack Lee (Korea Electronics Technology Institute, Korea)
Abstract
Binarized neural networks (BNNs) have 1-bit weights and activations, which are well suited for FPGAs. The BNNs suffer from accuracy loss compared with conventional neural networks. Shortcut connections are introduced to address the performance degradation. This work proposes a BNN processor supporting the shortcut connects. To evaluate the performance of the processor, we implement the system on an FPGA (Xilinx Kintex UltraScale). Our experiments show that the proposed processor achieves state-of-the-art energy efficiency.
Implementation of an Quantum Circuit Simulator using Classical Bits
Presenter :
Yunpyo Hong (Korea Electronics Technology Institute, Korea)
Abstract
The quantum computer has attracted much attention since it can reduce running time of many tasks such as factorization and search compared to classical computers. However, the actual quantum computer is difficult to get access, and the classical computer has a problem of calculation time and computation cost grows exponentially with increasing the number of qubits. In this paper, we address an efficient quantum circuit simulator including quantum circuit emulation hardware and its software framework using classical bit. Also, the proposed quantum circuit simulator operates the same as the conventional quantum software framework and shows a fast operation speed.
Weighted Decoupling : An Effective Image Resizing Method for Binarized Neural Network
Presenter :
Seungwoo Im (Korea Electronics Technology Institute, Korea)
Abstract
Many neuromorphic hardware uses a micro controller unit that has low computing power and small memory size. To process deep neural network applications on hardware-restricted conditions and to alleviate computational latency, binarized neural networks have been developed. Moreover, we can reduce required multiply-accumulate operation by shrinking resolution of input images. In this paper, we com-pare various image resizing method and propose weighted decoupling, a new resizing method for the best accuracy of ResNetB18 model, an additionally binarized version of Res-NetE18. A ResNetB18 model with proposed resizing method shows the better accuracy performance than that of a same model trained with 12.25x large images.
Characteristic Comparison of Korean Unstructured Dialogue Corpora by Morphological Analysis
Presenter :
Seona Moon (Korea Electronics Technology Institute, Korea)
Abstract
Natural language processing (NLP) has globally attracted researchers’ attention. Many unstructured dialogue corpora in other languages as well as English and Chinese have been collected for NLP research. Those corpora show various characteristics depending on the relationship between speakers, the dialogue topic, how dialogues are gathered, etc. Analyzing their characteristics is therefore mandatory to comprehend the corpora for studying natural language dialogue. In this paper, we choose five different Korean unstructured dialogue corpora for their characteristic comparison, and identify the average numbers of utterances, proper nouns and pronouns per dialogue using MeCab-ko, a Korean morpheme analyzer.
June 15th, 2022 (Wednesday) 08:30 – 09:30
Industrial Session#2 – Industry(Samsung & SK hynix & LG Electronics)
AI Accelerator Embedded Computational Storage for Large-Scale DNN Models
Presenter :
Byungmin Ahn (Samsung Electronics, Co., Ltd., Korea)
Abstract
As the model size of Deep Neural Networks (DNN) is extremely expanding, especially in Natural Language Processing (NLP) field, it is not suitable for performing AI application on conventional DRAM-based systems that lack capacity. For such a reason, we propose a new computational storage by integrating a Neural Processing Unit (NPU) that could perform the DNN inference to the NAND Flash Memory Controller (FMC). Our NPU is able to use weight scheduling to efficiently load data from multiple NAND channels, and the optimal number of MAC units in NPU Core module is derived. We evaluate our proposed NPU architecture when using a very large size of DNN model. In contrast to DRAM-based memory system, ours show an equal inference performance, and total memory access power and latency are reduced by 22.6% and 36.3%, respectively. Also, by sharing the data buffer of NAND flash memory as shared SRAM, the area overhead could be reduced by 82% compared to when NPU using dedicated SRAM.
An Architecture of Sparse Length Sum Accelerator in AxDIMM
Presenter :
Byeongho Kim (Samsung Electronics, Co., Ltd., Korea)
Abstract
In this paper, we have implemented high-efficient near-memory sparse-length sum hardware accelerator which is parallelized over each channel or rank to support Meta’s deep learning recommendation model (DLRM). In addition, we described high-level architecture and efforts to enable on the conventional x86 server system.
From our suggested near-memory accelerator, we got 1.94x performance gain on two rank system which are physically multiplied.
From our suggested near-memory accelerator, we got 1.94x performance gain on two rank system which are physically multiplied.
CMS: A Computational Memory Solution for High-Performance and Power-Efficient Recommendation System
Presenter :
Minho Ha (SK hynix, Korea)
Abstract
This paper proposes a cost-effective and scalable memory solution for high-performance and power-efficient recommendation systems, called computational memory solution (CMS). To address the memory bandwidth challenges of the deep learning-based recommendation system, CMS offloads memory-intensive embedding operations to near-data processors equipped with large capacity and high bandwidth memory. In contrast to the other state-of-the-art near-memory processing accelerators that only support inference and have scalability restrictions, CMS supports training and scales as much as the high-speed serial interface allows. Our evaluation results show that CMS achieves up to 7.5× higher throughput and 12.2× higher power efficiency for training than state-of-the-art accelerators.
Al Engine Structures in TV Processor
Presenter :
Hyun Chul Shin (LG Electronics, Korea)
Abstract
Recently, artificial intelligence(AI) functions have been rapidly adopted to TV systems. We have developed two different kinds of AI picture processing engine structures. The first one is an optimized structure for pixel level processing, which requires large logic area in chip for highly complicated computations on huge amount of data. Specifically, the activation data can be transferred from the previous layer to the next layer without external data movement, which only requires minimum internal memory transactions. The second one is a more general structure which can be used for various applications. Most deep learning engines show lower MAC utilization ratio than expected, which may result from excessive external memory transactions. For high utilization, we adopted memory bandwidth reduction methods such as weight clustering, layer fusion, and multiple layer fusion. Our convolution engine with 8-bit integer precision also reduced both memory bandwidth and power consumption. In the result, our TV processor can conduct both pixel level processing and recognition in a very efficient way and shows astonishing image quality.