Back to Browse

Enabling Cost-Efficient LLM Serving with Ray Serve

8.6K views
Oct 12, 2023
30:28

Ray Serve is the cheapest and easiest way to deploy LLMs, and has served billions of tokens in Anyscale Endpoints. This talk discusses how Ray Serve reduces cost via fine-grained autoscaling, continuous batching, and model parallel inference, as well as the work we've done to make it easy to deploy any Hugging Face model with these optimizations. Takeaways: • Learn how Ray Serve saves costs by using fewer GPUs with finegrained autoscaling and integrating with libraries like VLLM to maximize GPU utilization. About Anyscale --- Anyscale is the AI Application Platform for developing, running, and scaling AI. https://www.anyscale.com/ If you're interested in a managed Ray service, check out: https://www.anyscale.com/signup/ About Ray --- Ray is the most popular open source framework for scaling and productionizing AI workloads. From Generative AI and LLMs to computer vision, Ray powers the world’s most ambitious AI workloads. https://docs.ray.io/en/latest/ #llm #machinelearning #ray #deeplearning #distributedsystems #python #genai

Download

0 formats

No download links available.

Enabling Cost-Efficient LLM Serving with Ray Serve | NatokHD