Inside eBay’s Optimization Techniques for Scaling AI

(max.ku/Shutterstock)

Getting the software right is important when developing machine learning models, such as recommendation or classification systems. But at eBay, optimizing the software to run on a particular piece of hardware using distillation and quantization techniques was absolutely essential to ensure scalability.

eBay’s head of machine learning and NLP Selcuk Kopru described how the company optimizes its machine learning models in support of its AI-driven marketplace in a presentation made earlier today at the AI Hardware Summit, a hybrid event that’s taking place virtually and at the Computer History Museum in Mountain View, California this week.

“[I]n order to build a truly global marketplace that is driven by state of the art and powerful and scalable AI services,” Kopru said, “you have to do a lot of optimizations after model training, and specifically for the target hardware.”

eBay certainly is no stranger to scale. With 1.5 billion active listings from more than 19 million active sellers trying to reach 159 million active buyers, the ecommerce giant has a global reach that is matched by only a handful of firms. Machine learning and other AI techniques, such as natural language processing (NLP), play big roles in scaling eBay’s operations to reach its massive audience.

For instance, automatically generated descriptions of product listings is crucial for displaying information on the small screens of smart phones, Kopru said.

eBay trains its version of the BERT transformer model using a 64 GPU system (image courtesy eBay)

“The full item description is too big to display on mobile screen, and generation of a description summary is essential in this experience,” he said. “Feature extraction from product reviews, filtering of product reviews, shipping and delivery estimation, and payments and fraud detection in member-to-member communications – they are all benefiting from AI.”

eBay uses a host of AI and ML techniques to glean insight from listings and other data that it possesses. That includes neural network-based transformer models, like BERT, GPT-2, and GPT-3, for inferring information from text, as well as K-Nearest Neighbors (KNN), a supervised machine learning algorithm, for image classification.

For text understanding and summarization, eBay uses a custom version of the large BERT transformer model, called eBERT. As Kopru explained, the company starts by developing a similarity function that presents the titles of the product entries as vectors in a shared space. In his presentation at AI Hardware Summit, the product entry space was limited to two-dimensions. “Of course, real applications can increase to hundreds of dimensions,” he said.

Finding the similarity between item titles helps in matching the listings to products in the catalog and can also be used to find duplicate products. “Having a clean catalog is very important in any ecommerce experience,” Kopru said.

eBay trains eBERT on a cluster composed of 64 Nvidia V100 GPUs. According to Kopru, it takes two weeks to complete a round of training for the eBERT model. Because training eBERT is such an expensive task, the company takes pains to ensure eBay researchers and engineers can access the trained model in a very efficient way, with just a few lines of Python code.

eBay uses a NUMA setup to speed neural inference for its eBERT model (image courtesy eBay)

But because of its size, eBert is not suitable to use online for inferencing purposes. “You cannot just put it into production,” Kopru said. “Therefore, we use techniques like model distillation and quantization to improve the throughput of the models.”

Distillation is a way to compress a model by using a smaller “student” model to match a large pre-trained “teacher” model. It includes specific techniques aimed at minimizing the loss of precision. Quantization, meanwhile, is a method used to execute some of a model’s operations on tensors with integers rather than floating point values.

The accuracy tradeoff from using distillation and quantization are worth it to eBay, Kopru said. “We can train 3x faster models and 3x more throughput can be achieved by giving up 3% of the accuracy,” he said. “This tradeoff is a good one based on the results we are getting. We are willing to continue to do that.”

While eBay does all of its deep learning training purely on GPUs, for inference, it uses a hybrid approach that includes CPUs. For inference tasks, eBay uses Intel’s Deep Learning Boost (DL Boost) technology, which implements Vector Neural Network Instructions (VNNI) on Intel’s AVX-512 instruction set. It also utilizes 8-bit integer quantization, Kopru said.

The system is deployed and scaled horizontally on a Kubernetes cluster configured using NUMA, he said. “Our custom implementation doubled the throughput with half of the latency compared to a non-VNNI implementation,” he said.

eBay uses some of the same techniques to optimize the kNN setup, which is used for image classification. The company has paired its kNN image classification system with a Hierarchical Navigable Small World (HNSW) library to optimize search in the kNN space. For inference, it also brought in 8-digit integer quantization and Intel’s DL Boost library.

Selcuk Kopru is head of machine learning and NLP at eBay

The use of HNSW enables kNN search to perform at a high level and maintain low latency and high throughput on a catalog that contains a billion items, each with up to 768 dimensions, according to Kopru. “Compared to the existing implementation, we have observed up to 2.5X speed up in terms of latency, from 17 milliseconds to 7 milliseconds,” he said.

Kopru left his audience with a few thoughts on how they might apply his learnings to their particular problems. For starters, similarity is a powerful tool that can be used to solve many AI problems in e-commerce and in other disciplines, Kopru said.

“My second message is for latency and throughput,” he said. “Many optimizations need to be done, almost at every step of the machine learning lifecycle. Whether it is in training or inferencing, we have to make sure that we are using the right optimization methods, such as distillation and quantization on those platforms.”

Finally, Kopru encourages practitioners to consider using a hybrid approach for inferencing to achieve utilization and cost efficiency goals. eBay also keeps a watchful eye on TPUs for benchmarking purposes.

“These are well-known techniques and I really suggests everyone is using those techniques,” he said.

Related Items:

Optimizing AI and Deep Learning Performance

Three Tricks to Amplify Small Data for Deep Learning

One Model to Rule Them All: Transformer Networks Usher in AI 2.0, Forrester Says