ELECTRA – Single Thread
ELECTRA is a BERT-based transformer trained with a unique technique., which in essence, shares all characteristics of an attention based Vasiwani transformer
BERT summary here Vaswani transformer summary here
As multi-head attention and feedforward are all performed as matrix operations,
- Memory intensive, low-locality data is expected
- Frequent and concentrated matrix multiplication is expected
I compared two model sizes: ELECTRA small and base.
Tools
Huggingface implementation of ELECTRA in PyTorch was chosen as profile target.
Analysis of hf’s ELECTRA methods is here
Intel’s Vtune profiler was recommended for use on Intel CPUs. Vtune was installed on both the remote target server and local laptop. From my laptop Vtune, I would launch a profiling session on the target server, where actual ELECTRA workload would actually run.
When profiling ELECTRA with a pre-compiled release of PyTorch (installed with conda for example), Vtune was unable to see under the framework-level functions, lest the operators themselves. It all showed up as “func@address“
A source build of PyTorch, with -debug option was necessary, which I will explain here in the future.
Intel Vtune
As an Intel CPU specific profiler, the metrics used in the results also reflect the characteristics of Intel CPUs (especially uOps). Details are organized here.
Here were some scribbles going into actual profiling.
Timeline
12/29/2021
start of internship
January
Selecting ELECTRA Implementation
PyTorch Debug build for Vtune Profiler to see operation fns.
February
Vtune metrics
Intel CPU characteristics
IPC profiling – SGEMM analysis
March
Memory profiling
ELECTRA multi-threading
April
OpenMP
Blocked Matrix Multiplication
Cache Performance
May
Paper Writing and Submission
June-July
IEIE poster session
Here for Detailed Timeline
Result and Analysis
IPC
My initial expectations were
- Matmul size increases as model size increase.
- This will result in decreased IPC of the entire model.
which was exact opposite of actual profiling result.
But note that IPC of SGEMM in general is much higher than IPC of any ELECTRA. This can be explained by SGEMM using higher port throughput, in that it parallelizes more ports simultaneously.
Therefore, higher IPC of larger ELECTRA can be attributed to it having higher ratio of SGEMM.
Memory Usage
SGEMM showed higher boundness in Core than Memory. In fact, during matrix operation, there is rarely reference below L1 cache. This reflects the effectiveness of CPU memory hierarchy, quite contrast to the core’s ability limiting the performance.
Another distinct characteristic of modern CPU is multi-core, with all cores sharing L3-cache.
ELECTRA – Multi-thread
Multi-threading (intra-op parallelism) implemented with OpenMP
Speed up due to thread-level parallelism is expected to differ by each operation characteristics.
libtorch_cpu : non-matmul operations such as layer norm, softmax
libomp : Parallelization overhead
=> SGEMM and libtorch_cpu showing near-ideal speedup.
Computation Breakdown of each BERT Layer
Sequence: MH-Attention -> Add&Norm -> FF
Unlike seq2seq models, transformer takes in the entire sequence as a whole and computes it entirely. This leads to higher parallelization opportunities.
Matrix multiplication is readily parallelized with matrix blocking.
As an opportunity to study OpenMP methods, I implemented blocked matrix multiplication and profiled memory hierarchy. in which Vtune’s ability to access memory performance counters proved useful.
Parallelizing Block Matrix Multiplication
— this is the part I used for my graduate thesis.
Naive matrix multiplication iterates over entire rows of operand B to calculate a single column of C.
Block matrix multiplication breaks the operand into blocks, reuses each block while they are brought up to higher level cache, thus exploits data reusability.
Reusing high level cache data, which has faster read/write speed, effectively lowers the average access latency.
Note that this understanding of matrix multiplication is quite abstract. In actual innerworkings, there are much more to consider:
- Effect of prefetching was unpredictable, thus disregarded
- cache block alignment was given to be aligned
- actual size of the matrix operands was given to be larger than cache block size
Single Thread Results
Block matrix multiplication effectively reduced the average access latency.
Compared to naive matmul, speedup was larger with larger operands.
Blocking Factor Analysis
Runtime and memory stall saw its minimum when L1 cache, the fastest cache could fit all three block operands.
“I would have liked to check that the three operands are actually placed on L1 cache at one moment in time for real.”
Multithreading Performance
OpenMP assigns work to each thread with calculating each destination operand’s blocked portion (a block of C in the previous example). Since C is the only memory region to be written, there is little to no synchronization overhead. Thus, I saw a near-ideal speedup on runtime.
Conclusion:
Single Thread Performance Comparison:
Block matmul effectively decreases runtime by reducing memory access stall
Analysis of blocking factor – yet to be valiated:
Machine dependent blocking factor reaches optimum when L1 cache is exploited to its maximum.
Analysis multi-threading block matmul
Reaches ideal speed up, due to block matmul’s small synchronization overhead.
Presentation at IEIE Summer Conference:
Potential Topics:
When SGEMM is compiled..
-AVX intrinsic and MULPS (Multiply Packed Single-Precision Floating-Point Values) (LINK)
-Accurate spotting of memory
Resources:
All codes are uploaded to my github repo.
Undergrad thesis outline can be found here (partially in Korean)
Undergrad thesis paper can be found here (only abstract in English)