Sparse and Irregular Tensor Computation

Is sparse matrix of concern to DL as well? – Defintely, yes.


From Survey of Acelerator Arch for DNNs (2020)

“a large proportion of NN connetions can be pruned to zero with or without minimum accuracy loss”

“Many corresponding computing architectures have also been proposed”


From Cambricon (2018)

Sparsity in Neural Network section:

“sparsity as an effective way to resolve overfitting, which prunes the network: there are static and dynamic sparsity:

static sparsity removes connection permanently

dynamic sparsity just drops the computation on the sight of 0″

to read: 2204.09656.pdf (arxiv.org) (Transformer pruning framework-2022)

But 요 최신 논문에 따르면, pruning is an effective way to reduce inference cost, which also makes sense”

-> FW, SW-level removing of heads and filters

Proposes an SW./HW approach to address irregularity of sparse NN:

Accelerator Architecture: must look further, but develges into RTL-level to exploit both static and dynamic sparsity


Lecture 03 – Pruning and Sparsity (Part I) | MIT 6.S965 – YouTube

1] AI models are getting larger

2] Memory is very expensive.

Pruning can be done in software perspective.

As well as complementary hardware perspective.

“uses certain heuristics to decide what to prune”

Fine-grain Pruning = unstructured pruning – sparse matrices

  • synapsys pruning – erasing certain connections only.

Neural Network Pruning 101. All you need to know not to get lost | by Hugo Tessier | Towards Data Science – also a really good reference.

Coarse-grain Pruning = structured Pruning – prune certain dimension.

  • = neuron pruning (erase all synapsys connected to layer)
  • = rows of weight matrices (in linear layer)
  • certain channels (by pruning entire filters) (convolution layer)

On the inference accelerator perspective

Sparse Weight – static sparsity

Sparse Activation – dynamic sparsity (input-depen=determined at runtime)


Post Training Framework for Transformers

Limiting the scope to post-training techniques.

  • Why add steps to already compute-heavy training process? (much more cost than inference-“this increases training time and computational overhead”) (DO current accelerators do this as well?)

Heavily rely on the zero activations of RELU fns, while many transformers now use GELU . (UNet Utilizes RELU thou) ~ thus many techniques can be activation fn specific

*Attention layer는 activation을 안 쓰나? (RELU가 매 layer 사이에 껴있지는 않은듯)

Why does a transformer not use an activation function following the multi-head attention layer? – Artificial Intelligence Stack Exchange

GPT2에는 softmax activation이 있는 듯.

Stable Diffusion의 UNet은? 잠만 이건 Transformer이 아닌가?

(그럼 더욱 FFN에 집중해야 할텐데, FFN과 MHA의 비율이 어떻게 되려나?)

;

Basically, the framework seeks to reduce model size with a small-scale search process (with sample data, so training some might call) to find the optimal structured pruning combination. (structured in that it seeks to reduce certain dimensions of the parameter.)

Thus the method is on a software level, (explicitly seeks to be indepedent from specialized hardware logic-which better utilizes unstructured pruning.)

As the matter on whether it makes it more challenging for hardware,

-> 어떤 식으로 (leading to sparse matrices) (그 결과가 그냥 작은 네트워크인지, 아니면 0으로 가득 찬 네트워크 인지) – thus sparse matrix (can be represented in a compressed form with metadata and indices) ()

I only mentioned transformers as an good case for the increasing size of DL networks,

하드웨어 레벨이면, unstructured pruning을 지원하는 것이 가장 ideal한 것 같다.

Leave a comment