ELECTRA Profiling and MatMUL on CPU

ELECTRA – Single Thread

ELECTRA is a BERT-based transformer trained with a unique technique., which in essence, shares all characteristics of an attention based Vasiwani transformer

BERT summary here Vaswani transformer summary here

As multi-head attention and feedforward are all performed as matrix operations,

Memory intensive, low-locality data is expected
Frequent and concentrated matrix multiplication is expected

I compared two model sizes: ELECTRA small and base.

Tools

Huggingface implementation of ELECTRA in PyTorch was chosen as profile target.

Analysis of hf’s ELECTRA methods is here

Intel’s Vtune profiler was recommended for use on Intel CPUs. Vtune was installed on both the remote target server and local laptop. From my laptop Vtune, I would launch a profiling session on the target server, where actual ELECTRA workload would actually run.

When profiling ELECTRA with a pre-compiled release of PyTorch (installed with conda for example), Vtune was unable to see under the framework-level functions, lest the operators themselves. It all showed up as “func@address“

A source build of PyTorch, with -debug option was necessary, which I will explain here in the future.

Intel Vtune

As an Intel CPU specific profiler, the metrics used in the results also reflect the characteristics of Intel CPUs (especially uOps). Details are organized here.

Here were some scribbles going into actual profiling.

Timeline

12/29/2021

start of internship
January

Selecting ELECTRA Implementation

PyTorch Debug build for Vtune Profiler to see operation fns.
February

Vtune metrics

Intel CPU characteristics

IPC profiling – SGEMM analysis
March

Memory profiling

ELECTRA multi-threading
April

OpenMP

Blocked Matrix Multiplication

Cache Performance
May

Paper Writing and Submission
June-July

IEIE poster session

Here for Detailed Timeline

Result and Analysis

IPC

My initial expectations were

Matmul size increases as model size increase.
This will result in decreased IPC of the entire model.

Input sequence length 128, batch size 64

which was exact opposite of actual profiling result.

Operation Percentage of Each Model Size

IPC of SGEMM remains fairly constant throughout model size

But note that IPC of SGEMM in general is much higher than IPC of any ELECTRA. This can be explained by SGEMM using higher port throughput, in that it parallelizes more ports simultaneously.

Therefore, higher IPC of larger ELECTRA can be attributed to it having higher ratio of SGEMM.

Memory Usage

SGEMM showed higher boundness in Core than Memory. In fact, during matrix operation, there is rarely reference below L1 cache. This reflects the effectiveness of CPU memory hierarchy, quite contrast to the core’s ability limiting the performance.

Another distinct characteristic of modern CPU is multi-core, with all cores sharing L3-cache.

ELECTRA – Multi-thread

Multi-threading (intra-op parallelism) implemented with OpenMP

Speed up due to thread-level parallelism is expected to differ by each operation characteristics.

libtorch_cpu : non-matmul operations such as layer norm, softmax

libomp : Parallelization overhead

=> SGEMM and libtorch_cpu showing near-ideal speedup.

Computation Breakdown of each BERT Layer

Sequence: MH-Attention -> Add&Norm -> FF

Breakdown of significant operations

Unlike seq2seq models, transformer takes in the entire sequence as a whole and computes it entirely. This leads to higher parallelization opportunities.

Matrix multiplication is readily parallelized with matrix blocking.

As an opportunity to study OpenMP methods, I implemented blocked matrix multiplication and profiled memory hierarchy. in which Vtune’s ability to access memory performance counters proved useful.

Parallelizing Block Matrix Multiplication

— this is the part I used for my graduate thesis.

Naive matrix multiplication iterates over entire rows of operand B to calculate a single column of C.

Block matrix multiplication breaks the operand into blocks, reuses each block while they are brought up to higher level cache, thus exploits data reusability.

Reusing high level cache data, which has faster read/write speed, effectively lowers the average access latency.

Note that this understanding of matrix multiplication is quite abstract. In actual innerworkings, there are much more to consider:

Effect of prefetching was unpredictable, thus disregarded
cache block alignment was given to be aligned
actual size of the matrix operands was given to be larger than cache block size

Single Thread Results

Block matrix multiplication effectively reduced the average access latency.

Compared to naive matmul, speedup was larger with larger operands.

Blocking Factor Analysis

Runtime and memory stall saw its minimum when L1 cache, the fastest cache could fit all three block operands.

“I would have liked to check that the three operands are actually placed on L1 cache at one moment in time for real.”

Multithreading Performance

OpenMP assigns work to each thread with calculating each destination operand’s blocked portion (a block of C in the previous example). Since C is the only memory region to be written, there is little to no synchronization overhead. Thus, I saw a near-ideal speedup on runtime.

Conclusion:

Single Thread Performance Comparison:

Block matmul effectively decreases runtime by reducing memory access stall

Analysis of blocking factor – yet to be valiated:

Machine dependent blocking factor reaches optimum when L1 cache is exploited to its maximum.

Analysis multi-threading block matmul

Reaches ideal speed up, due to block matmul’s small synchronization overhead.

Presentation at IEIE Summer Conference:

Potential Topics:

When SGEMM is compiled..

-AVX intrinsic and MULPS (Multiply Packed Single-Precision Floating-Point Values) (LINK)

-Accurate spotting of memory

Resources:

All codes are uploaded to my github repo.

Undergrad thesis outline can be found here (partially in Korean)

Undergrad thesis paper can be found here (only abstract in English)