When it comes to deep learning, embeddings signify various categorical entities. These categorical entities include movies, words, apps, etc. The embedding layer can map every entity to a special vector because of which the layer’s memory requirement corresponds to the total number of entities.
The recommendation domain can possess multiple entities of a specified category and the embedding layer of the same entities can consume thousands of gigabytes. Due to their sheer volume, these networks cannot be implemented into limited resource environments (for example smartphones, smartwatches, tablets, etc.)
Work of compressed embeddings for on-device inference
Through compressed embeddings for on-device inference, we can reduce the dimensions of any embedding table while being able to map every entity into its inherent embedding. In this process instead of having one single embedding table, you can use two individual embedding tables.
In the first table, various entities share an embedding while in the other table you can find the trainable weight of each embedding entity. This entire process enables the models to differentiate between categorical entities while partaking in the same embedding.
Since these two tables are trained jointly, the network can learn a unique embedding per entity, helping it maintain a discriminative capability similar to a model with an uncompressed embedding table.
Compressed embeddings in modern deep learning
In modern deep learning, some of the elementary prime predictive features include categorical features that have huge vocabulary sizes. The NLP or natural language processing systems depend heavily on input data such as individual words, characters, sub-word tokens, etc. These input data are directly provided in different models as categorical features.
Additionally, the search and recommender systems can change their focus in the following way:
- From featuring categorical inputs using traditional resources and methods
- To showcase the model inputs as categorical features
With the help of embeddings, these techniques' categorical features and implementation have improved. However, in certain areas, people come across specific challenges while dealing with embeddings. For example, when the vocabulary sizes of the categorical features and the number of categorical features in a single model increase in size the memory footprint and the categorical features also increase simultaneously.
This kind of problem is very common when it comes to recommender and search systems. Although the vocabulary of natural language consists of numerous words, the metadata, queries, and documents can easily turn out to be millions. As a result the embedding matrices also become very large and these models are unable to perform every operation. It can severely affect the performance of low-resource models such as a tablet or a smartphone.
On-Device Inference
In comparison to any other computational archetype, on-device inference can offer multiple benefits. Such benefits include improved data privacy, increased moratorium, and low communication bandwidth. Although on-device inference can be challenging when used with deep neural networks.
Nevertheless, with the help of hardware-based optimization techniques, framework-based, neural net architecture-based methods can help in tackling those challenges.
To Conclude
With the help of compressed embeddings for on-device inference, you will be able to help your deep learning model work proficiently while increasing the pace of inference, efficient energy usage, and mobile deployment.
Reference links:
https://proceedings.mlsys.org/paper/2022/file/812b4ba287f5ee0bc9d43bbf5bbe87fb-Paper.pdf
https://arxiv.org/abs/2203.10135
https://deepai.org/publication/learning-compressed-embeddings-for-on-device-inference