Search Results for author: Tushar Krishna

Found 42 papers, 11 papers with code

H3DFact: Heterogeneous 3D Integrated CIM for Factorization with Holographic Perceptual Representations

no code implementations5 Apr 2024 Zishen Wan, Che-Kai Liu, Mohamed Ibrahim, Hanchen Yang, Samuel Spetalnick, Tushar Krishna, Arijit Raychowdhury

Disentangling attributes of various sensory signals is central to human-like perception and reasoning and a critical task for higher-order cognitive and neuro-symbolic AI systems.

Abstracting Sparse DNN Acceleration via Structured Sparse Tensor Decomposition

no code implementations12 Mar 2024 Geonhwa Jeong, Po-An Tsai, Abhimanyu R. Bambhaniya, Stephen W. Keckler, Tushar Krishna

Next, we develop a software framework, TASDER, to accelerate DNNs by searching layer-wise, high-quality structured decomposition for both weight and activation tensors so that they can be accelerated by any systems with structured sparse hardware support.

Tensor Decomposition

Algorithm-Hardware Co-Design of Distribution-Aware Logarithmic-Posit Encodings for Efficient DNN Inference

1 code implementation8 Mar 2024 Akshat Ramachandran, Zishen Wan, Geonhwa Jeong, John Gustafson, Tushar Krishna

Traditional Deep Neural Network (DNN) quantization methods using integer, fixed-point, or floating-point data types struggle to capture diverse DNN parameter distributions at low precision, and often require large silicon overhead and intensive quantization-aware training.

Quantization

GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM

1 code implementation8 Mar 2024 Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, Tuo Zhao

Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference.

Quantization

Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers

1 code implementation7 Feb 2024 Abhimanyu Rajeshkumar Bambhaniya, Amir Yazdanbakhsh, Suvinay Subramanian, Sheng-Chun Kao, Shivani Agrawal, Utku Evci, Tushar Krishna

In this work, we study the effectiveness of existing sparse training recipes at \textit{high-sparsity regions} and argue that these methods fail to sustain the model quality on par with low-sparsity regions.

Towards Cognitive AI Systems: a Survey and Prospective on Neuro-Symbolic AI

no code implementations2 Jan 2024 Zishen Wan, Che-Kai Liu, Hanchen Yang, Chaojian Li, Haoran You, Yonggan Fu, Cheng Wan, Tushar Krishna, Yingyan Lin, Arijit Raychowdhury

The remarkable advancements in artificial intelligence (AI), primarily driven by deep neural networks, have significantly impacted various aspects of our lives.

TACOS: Topology-Aware Collective Algorithm Synthesizer for Distributed Machine Learning

no code implementations11 Apr 2023 William Won, Midhilesh Elavazhagan, Sudarshan Srinivasan, Ajaya Durg, Samvit Kaul, Swati Gupta, Tushar Krishna

To this end, this paper introduces TACOS, an automated synthesizer that generates topology-aware collective algorithms for common distributed machine learning collectives across arbitrary input network topologies.

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

3 code implementations24 Mar 2023 William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, Tushar Krishna

In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms.

VEGETA: Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs

no code implementations17 Feb 2023 Geonhwa Jeong, Sana Damani, Abhimanyu Rajeshkumar Bambhaniya, Eric Qin, Christopher J. Hughes, Sreenivas Subramoney, Hyesoon Kim, Tushar Krishna

Therefore, as DL workloads embrace sparsity to reduce the computations and memory size of models, it is also imperative for CPUs to add support for sparsity to avoid under-utilization of the dense matrix engine and inefficient usage of the caches and registers.

COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training

no code implementations30 Nov 2022 Divya Kiran Kadiyala, Saeed Rashidi, Taekyung Heo, Abhimanyu Rajeshkumar Bambhaniya, Tushar Krishna, Alexandros Daglis

To facilitate the design space exploration of such massive DL training clusters, we introduce COMET, a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training.

Demystifying Map Space Exploration for NPUs

1 code implementation7 Oct 2022 Sheng-Chun Kao, Angshuman Parashar, Po-An Tsai, Tushar Krishna

Map Space Exploration is the problem of finding optimized mappings of a Deep Neural Network (DNN) model on an accelerator.

Navigate Neural Architecture Search

Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask

no code implementations15 Sep 2022 Sheng-Chun Kao, Amir Yazdanbakhsh, Suvinay Subramanian, Shivani Agrawal, Utku Evci, Tushar Krishna

In this work, we focus on N:M sparsity and extensively study and evaluate various training recipes for N:M sparsity in terms of the trade-off between model accuracy and compute cost (FLOPs).

Impact of RoCE Congestion Control Policies on Distributed Training of DNNs

no code implementations22 Jul 2022 Tarannum Khan, Saeed Rashidi, Srinivas Sridharan, Pallavi Shurpali, Aditya Akella, Tushar Krishna

Our results indicate that previously proposed RoCE congestion control schemes have little impact on the end-to-end performance of training workloads, motivating the necessity of designing an optimized, yet low-overhead, congestion control scheme based on the characteristics of distributed training platforms and workloads.

Blocking

DiGamma: Domain-aware Genetic Algorithm for HW-Mapping Co-optimization for DNN Accelerators

2 code implementations26 Jan 2022 Sheng-Chun Kao, Michael Pellauer, Angshuman Parashar, Tushar Krishna

The design of DNN accelerators includes two key parts: HW resource configuration and mapping strategy.

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

no code implementations9 Oct 2021 Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan, Tushar Krishna

Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e. g., GPU/TPU).

Scheduling

RASA: Efficient Register-Aware Systolic Array Matrix Engine for CPU

no code implementations5 Oct 2021 Geonhwa Jeong, Eric Qin, Ananda Samajdar, Christopher J. Hughes, Sreenivas Subramoney, Hyesoon Kim, Tushar Krishna

As AI-based applications become pervasive, CPU vendors are starting to incorporate matrix engines within the datapath to boost efficiency.

Exploring Multi-dimensional Hierarchical Network Topologies for Efficient Distributed Training of Trillion Parameter DL Models

no code implementations24 Sep 2021 William Won, Saeed Rashidi, Sudarshan Srinivasan, Tushar Krishna

High-performance distributed training platforms should leverage multi-dimensional hierarchical networks, which interconnect accelerators through different levels of the network, to dramatically reduce expensive NICs required for the scale-out network.

Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators

no code implementations15 Sep 2021 Geonhwa Jeong, Gokcen Kestor, Prasanth Chatarasi, Angshuman Parashar, Po-An Tsai, Sivasankaran Rajamanickam, Roberto Gioiosa, Tushar Krishna

The algorithms and accelerator cost models are connected via a novel mapping abstraction that captures the map space of spatial accelerators which can be systematically pruned based on constraints from the hardware, workload, and mapper.

AIRCHITECT: Learning Custom Architecture Design and Mapping Space

no code implementations16 Aug 2021 Ananda Samajdar, Jan Moritz Joseph, Matthew Denton, Tushar Krishna

We design and train a custom network architecture called AIRCHITECT, which is capable of learning the architecture design space with as high as 94. 3% test accuracy and predicting optimal configurations which achieve on average (GeoMean) of 99. 9% the best possible performance on a test dataset with $10^5$ GEMM workloads.

MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores

no code implementations28 Apr 2021 Sheng-Chun Kao, Tushar Krishna

In particular, we focus on the problem of mapping jobs from several DNNs simultaneously on an accelerator.

Efficient Exploration

Self-Adaptive Reconfigurable Arrays (SARA): Using ML to Assist Scaling GEMM Acceleration

no code implementations12 Jan 2021 Ananda Samajdar, Michael Pellauer, Tushar Krishna

We demonstrate an instance of SARA with an accelerator we call SAGAR, which introduces a novel reconfigurable systolic array that can be configured to work as a distributed collection of smaller arrays of various sizes or as a single array with flexible aspect ratios.

CLAN: Continuous Learning using Asynchronous Neuroevolution on Commodity Edge Devices

no code implementations27 Aug 2020 Parth Mannan, Ananda Samajdar, Tushar Krishna

The true impact of AI can only be fully realized if we can have AI agents continuously interacting with the real world and solving everyday problems.

Restructuring, Pruning, and Adjustment of Deep Models for Parallel Distributed Inference

no code implementations19 Aug 2020 Afshin Abdi, Saeed Rashidi, Faramarz Fekri, Tushar Krishna

In this paper, we consider the parallel implementation of an already-trained deep model on multiple processing nodes (a. k. a.

STONNE: A Detailed Architectural Simulator for Flexible Neural Network Accelerators

1 code implementation10 Jun 2020 Francisco Muñoz-Martínez, José L. Abellán, Manuel E. Acacio, Tushar Krishna

The design of specialized architectures for accelerating the inference procedure of Deep Neural Networks (DNNs) is a booming area of research nowadays.

Generative Design of Hardware-aware DNNs

no code implementations6 Jun 2020 Sheng-Chun Kao, Arun Ramamurthy, Tushar Krishna

We propose a new way for autonomous quantization and HW-aware tuning.

Quantization

Conditional Neural Architecture Search

no code implementations6 Jun 2020 Sheng-Chun Kao, Arun Ramamurthy, Reed Williams, Tushar Krishna

Designing resource-efficient Deep Neural Networks (DNNs) is critical to deploy deep learning solutions over edge platforms due to diverse performance, power, and memory budgets.

Neural Architecture Search

Marvel: A Data-centric Compiler for DNN Operators on Spatial Accelerators

no code implementations18 Feb 2020 Prasanth Chatarasi, Hyoukjun Kwon, Natesh Raina, Saurabh Malik, Vaisakh Haridas, Angshuman Parashar, Michael Pellauer, Tushar Krishna, Vivek Sarkar

Searching for the optimal mappings is challenging because of the large space of mappings, and this challenge gets exacerbated with new operators and diverse accelerator configurations. To address this challenge, we propose a decoupled off-chip/on-chip approach that decomposes the mapping space into off-chip and on-chip subspaces, and first optimizes the off-chip subspace followed by the on-chip subspace.

Co-Exploration of Neural Architectures and Heterogeneous ASIC Accelerator Designs Targeting Multiple Tasks

no code implementations10 Feb 2020 Lei Yang, Zheyu Yan, Meng Li, Hyoukjun Kwon, Liangzhen Lai, Tushar Krishna, Vikas Chandra, Weiwen Jiang, Yiyu Shi

Neural Architecture Search (NAS) has demonstrated its power on various AI accelerating platforms such as Field Programmable Gate Arrays (FPGAs) and Graphic Processing Units (GPUs).

Neural Architecture Search

Heterogeneous Dataflow Accelerators for Multi-DNN Workloads

no code implementations13 Sep 2019 Hyoukjun Kwon, Liangzhen Lai, Tushar Krishna, Vikas Chandra

The results suggest that HDA is an alternative class of Pareto-optimal accelerators to RDA with strength in energy, which can be a better choice than RDAs depending on the use cases.

Distributed, Parallel, and Cluster Computing

SCALE-Sim: Systolic CNN Accelerator

8 code implementations16 Oct 2018 Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, Tushar Krishna

Systolic Arrays are one of the most popular compute substrates within Deep Learning accelerators today, as they provide extremely high efficiency for running dense matrix multiplications.

Distributed, Parallel, and Cluster Computing Hardware Architecture

GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware

no code implementations3 Aug 2018 Ananda Samajdar, Parth Mannan, Kartikay Garg, Tushar Krishna

EvE can evolve the topology and weights of neural networks completely in hardware for the task at hand, without requiring hand-optimization or backpropagation training.

Image Classification OpenAI Gym +2

Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows: A Data-Centric Approach Using MAESTRO

no code implementations4 May 2018 Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer, Angshuman Parashar, Vivek Sarkar, Tushar Krishna

The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse and perform staging are known as dataflow, and they directly impact the performance and energy efficiency of DNN accelerator designs.

Scheduling valid

Cannot find the paper you are looking for? You can Submit a new open access paper.