INVITED: New Directions in Distributed Deep Learning: Bringing theNetwork at Forefront of IoT Design

My Preface:

As deep learning models get larger, I was wondering how edge would deal with the evergrowing model size


Introduction:

Three challenges to large-scale adoptation of DL at edge:

  • Hardware-constrained IoT devices
  • Data security and privacy in IoT era
  • Lack of network-aware deep learning algorithms for distributed inference across multiple IoT devices

Sending private data to the cloud exposes security risks. On-device inference and training is a method to avoid the data privacy compromise.

The current trend in model compression seeks to reduce requirements for a single device. But as additional challenges limit the widespread deployment of DL in IoT devices. These challenges can be solved by exploiting benefits of the network.

Federated learning – Prevents sending private data to the cloud by first training ML models on-device with local data. Then the MODEL is sent to the cloud for global update.

Data-Independent Model Compression – While most model-compression techniques (for edge deployment, as an example) rely on original training dataset or alternate data, these pose a severe issue when the said dataset contains private data. Ideal ‘data-independent model compression’ deploys model to edge without relying to private datasets.

Communication-Aware Model Compression – IoT sensors have strict memory constraints. Models must be distributed across multiple nodes. Distributed inference generates massive communication costs.


Federated Learning

Model is first trained on a random subset of local devices, then sent to a server for global update. But due to the heterogeneity in data distrubution, this may result in activation-divergence.

Data across users may be non-IID; local training updates can take the global model in different directions. I.e., feature activation at the final layers of a model can diverge across different users, resulting in lower accuracy of global model.

Creating smarter loss functions, such as FedMAX can better exploit the network of devices for model training without compromising data privacy.


Data-Independent Model Compression

Model compression in the form of Knowledge Distillation (KD). Thus a way to train the small student network with without the real, private dataset.

Dream Distillation does not requires real data for KD, but use metadata from a single layer to train accurate student models.

  • Use k-means algorithmn to cluster the real activations at the average-pool layer of teacher network with a part of CIFAR-10.
  • Use the metadata and teacher network to generate a large number of synthetic images that contain knowledge of classes.

These generated limages are the Dreams of the deep network, the namesake.

ref on KD: 딥러닝 모델 지식의 증류기법, Knowledge Distillation | Seongsu (baeseongsu.github.io)

can it be applied to transformers?:


Communication-Aware Model Compression

Network of edge devices must be exploited via distributed learning for high-accuracy, low-latency intelligence. But most model compression literature focuses on single device.

As many IoT devices are significantly memory-constrained, model itself must be distributed across multiple devices. This lead to heavy inter-device communication at each layer of deep network.

Thus minimizing communication for efficient distributed inference must be pursed alongside the memory and computation.

Network-of-Neural Networks

A new distributed inference paradigm with memory- and communication-aware student arch. obtained from a single large teacher model. NoNN consists of a collection of multiple, disjoint student modules which focus only on a part of teacher’s knowledge.

NoNN partitions teacher’s final convolution layer; features for various classes are learned at different filters in CNNs. These activation patterns are used to create a filter activation network that represents how teacher’s knowledge about various classes are organized into filters. This network is partitioned to train individual student modules.

As separate students mimic parts of teacher’s knowledge, NoNN results in a highly parallel student architecture, with each student architecture selected to fit the edge constraints. Individual student modules do not communicate until the final fully connected layer. Compared to a horizontally-split deep network, this method has lighter communication cost.


A thought:

Both data-independent model compression and communication-aware model compression seems to be implemented for CNNs.

Dream Distillation works on top of Knowledge Distillation. I can identify KD’s application to transformers.

NoNN is unique in that it minimizes inter-module communication. This is achieved by communicating at the final FC layer.

  • It is quite intuitive to split the classification task by each class to a student module.
  • Unsure if this will be applicable to transformers, with thousands of classes (words)

Leave a comment