On Transformer (Attention) Acceleration

Software Techniques are a different topic to discuss later – methods like pruning and knowledge distillation


Majority operations are 1) Self-Attention and 2) Feedforward.

  • where self-attention is BLAS level 3 (matrix-matrix)
  • and Feedforward is BLAS level 2 (matrix-vector)

Simple, a more underlying optimization would be:

  • Tackle matmul engine, which is already pervasive (SIMD, systolic array, vector mulps)
  • Tackle sparse matrix handling, a result of activation/model pruning

However, what about exploiting the chracteristics of the Transformer network?

  • see Demystifying BERT for characteristics analysis

“Hardware Accelerator for Multi-Head Attention and
Position-Wise Feed-Forward in the Transformer” – Nanjing Univ.

Utilized two key components: multi-head attention ResBlock and position-wise feedforward network ResBlock – for the two most complex layers.

First, an efficient method to partition huge matrices in Transformer and let the two ResBlocks to shared most if the hardware resources.

Second, computation flow is well designed for high utilization of systolic array

Third, complicated nonlinear fns are highly optimized to reduce hardware complexity and latency of entire system.


[2022 ISCA] “Accelerating Attention through
Gradient-Based Learned Runtime Pruning” -UCSD, Mingu Kang

Prunes low attention score, with a threshold derived from the training process.

Bit-serial architecture is devised to best utilize threshold pruning – this processes transformer language models with bit-level early termination microarchitectural mechanism.


[2021 ISCA] “ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in NNs” -SNU, Jae W. Lee

Attention is very expensive – cost increases quadratically with input number.

We propose a HW-SW co-design solution by filtering out relations (attention score) unlikely to affect final output.


[2020 HPCA] “Accelerating Attention Mechanisms in NN with Approximation” – SNU, Jae W. Lee

Approximating At[2022tention Scores, falls in performance compared to ELSA.


[2022] “Adaptable Butterfly Accelerator for Attention-based NNs via Hardwareand Algorithm Co-design”

Leave a comment