Deep Learning on the Edge: Model Partitioning, Caching, and Compression

Fang, Yihao

Please use this identifier to cite or link to this item: http://hdl.handle.net/11375/25576

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Zheng, Rong	-
dc.contributor.author	Fang, Yihao	-
dc.date.accessioned	2020-08-07T15:35:09Z	-
dc.date.available	2020-08-07T15:35:09Z	-
dc.date.issued	2020	-
dc.identifier.uri	http://hdl.handle.net/11375/25576	-
dc.description.abstract	With the recent advancement in deep learning, there has been increasing interest to apply deep learning algorithms to mobile edge devices (e.g. wireless access points, mobile phones, and self-driving vehicles). Such devices are closer to end-users and data sources compared to cloud data centers, therefore deep learning on the edge leads to several merits: 1) reduce communication overhead (e.g. latency), 2) preserve data privacy (e.g. not leaking sensitive information to cloud service providers), and 3) promote autonomy without the need of continuous network connectivity. However, it also comes with a trade-off that deep learning on the edge often results in less prediction accuracy or longer inference time. How to optimize such a trade-off has drawn a lot of attention among the machine learning and systems research communities. Those communities have explored three main directions: partitioning, caching, and compression to solve the problem. Deep learning model partitioning works in distributed and parallel computing by leveraging computation units (e.g. edge nodes and end devices) of different capabilities to achieve the best of both worlds (accuracy and latency), but the inference time of partitioning is nevertheless lower bounded by the smallest of inference times on edge nodes (or end devices). In contrast, model caching is not limited by such a lower bound. There are two trends of studies in caching, 1) caching the prediction results on the edge node or end device, and 2) caching a partition or less complex model on the edge node or end device. Caching the prediction results usually compromises accuracy, since a mapping function (e.g. a hash function) from the inputs to the cached results often cannot match a complex function given by a full-size neural network. On the other hand, caching a model's partition does not sacrifice accuracy, if we employ a proper partition selection policy. Model compression reduces deep learning model size by e.g. pruning neural network edges or quantizing network parameters. A reduced model has a smaller size and fewer operations to compute on the edge nodes or end device. However, compression usually sacrifices prediction accuracy in exchange for shorter inference time. In this thesis, our contributions to partitioning, caching, and compression are covered with experiments on state-of-the-art deep learning models. In partitioning, we propose TeamNet based on competitive and selective learning schemes. Experiments using MNIST and CIFAR-10 datasets show that on Raspberry Pi and Jetson TX2 (with TensorFlow), TeamNet shortens neural network inference as much as 53% without compromising predictive accuracy. In caching, we propose CacheNet, which caches low-complexity models on end devices and high-complexity (or full) models on edge or cloud servers. Experiments using CIFAR-10 and FVG have shown on Raspberry Pi, Jetson Nano, and Jetson TX2 (with TensorFlow Lite and NCNN), CacheNet is 58-217% faster than baseline approaches that run inference tasks on end devices or edge servers alone. In compression, we propose the logographic subword model for compression in machine translation. Experiments demonstrate that in the tasks of English-Chinese/Chinese-English translation, logographic subword model reduces training and inference time by 11-77% with Theano and Torch. We demonstrate our approaches are promising for applying deep learning models on the mobile edge.	en_US
dc.language.iso	en	en_US
dc.subject	Deep Learning	en_US
dc.subject	Edge Artificial Intelligence	en_US
dc.title	Deep Learning on the Edge: Model Partitioning, Caching, and Compression	en_US
dc.type	Thesis	en_US
dc.contributor.department	Computing and Software	en_US
dc.description.degreetype	Thesis	en_US
dc.description.degree	Doctor of Philosophy (PhD)	en_US
dc.description.layabstract	Edge artificial intelligence (EI) has attracted much attention in recent years. EI is a new computing paradigm where artificial intelligence (e.g. deep learning) algorithms are distributed among edge nodes and end devices of computer networks. There are many merits in EI such as shorter latency, better privacy, and autonomy. These advantages motivate us to contribute to EI by developing intelligent solutions including partitioning, caching, and compression.	en_US
Appears in Collections:	Open Access Dissertations and Theses

Files in This Item:

File	Description	Size	Format
Fang_Yihao_2020-07_Doctor-of-Philosophy.pdf Open Access		3.67 MB	Adobe PDF	View/Open

Show simple item record