vLLM Outperforms Ollama: 6x Throughput at High Concurrency Levels in New Benchmark

In the fast-evolving world of machine learning infrastructure, developers are constantly seeking the ideal frameworks that can balance performance with usability. A recent benchmark published in Towards AI has cast fresh light on this pressing need, comparing three popular frameworks: vLLM, Ollama, and llama.cpp. The results, especially at higher concurrency, reveal significant performance gaps that could dictate architectural decisions for engineering teams. The central narrative in this benchmark is not just about raw throughput but the practical implications of selecting a framework for scalable, multi-tenant services. As this article will explore, vLLM’s advanced design with its PagedAttention technology not only delivers dramatic throughput advantages but also ensures stability and reliability under load, a critical factor for teams aiming to scale their operations efficiently in a cloud environment.

Context

Machine learning models have become indispensable in powering a wide range of applications, from natural language processing to real-time data analysis. As these models grow in complexity and size, the frameworks that host and serve them are under increasing pressure to deliver not just raw computational power, but also the flexibility and scalability that modern developers demand. The need for a robust and efficient framework is paramount as organizations look to integrate AI capabilities into their products and services.

The three frameworks benchmarked by Towards AI — vLLM, Ollama, and llama.cpp — each have their strengths and target different segments of the market. Ollama, noted for its ease of use and developer-friendly environment, has been a go-to for small teams and local deployment scenarios. Its ability to quickly integrate into existing workflows makes it an attractive option for developers who prioritize seamless setup and intuitive interfaces. Llama.cpp, on the other hand, is known for its lightweight nature and minimalistic approach, appealing to developers who need to deploy models with constrained resources.

This week’s benchmark is timely, as it addresses the critical aspect of concurrency, a factor that becomes increasingly relevant as applications are scaled to handle more significant loads. In cloud environments, the ability to manage multiple users concurrently without sacrificing performance or reliability is crucial. The results of the benchmark, showing vLLM’s dominance in high-concurrency scenarios, provide a compelling narrative for teams planning their cloud strategies. As more companies move towards multi-tenant architectures to maximize resource utilization, understanding the trade-offs between different AI frameworks becomes essential.

What Happened

The benchmark conducted and released this week by Towards AI rigorously tested vLLM, Ollama, and llama.cpp under increasing concurrency conditions. At one concurrent user, all three frameworks performed comparably, with token generation rates within 13% of each other. Ollama was particularly noted for delivering a superior developer experience at this level due to its streamlined setup and operational simplicity.

However, as concurrency increased to five users, Ollama’s performance began to decline. The single-queue model that facilitates its ease of use started to struggle, resulting in increased latency. By the time concurrency reached 20 users, llama.cpp’s serial decoding mechanism could not keep up, effectively collapsing under the load. At this point, the stark differences between the frameworks began to emerge prominently.

When tested with 50 concurrent users, vLLM demonstrated a decisive advantage, achieving approximately six times the aggregate throughput compared to its competitors. Its p99 latency remained under three seconds, a significant indicator of its robustness and efficiency. In contrast, Ollama’s p99 latency ballooned to an untenable 24.7 seconds, with numerous requests timing out. The superiority of vLLM in these conditions is attributed to its PagedAttention mechanism, which dynamically interleaves tokens across requests, ensuring optimal GPU utilization despite variability in sequence lengths. This capability positions vLLM as the preferred choice for production environments where high concurrency is the norm. It stands out not just in terms of performance metrics but also in its ability to maintain stability under duress.

Why It Matters

The implications of this benchmark are profound for the technology industry, particularly for teams involved in deploying machine learning services at scale. As the demand for AI-driven applications grows, so does the need for frameworks that can handle increased loads without compromising on performance. vLLM’s demonstrated superiority in high-concurrency scenarios positions it as a crucial tool for organizations seeking to build reliable and efficient AI services.

For engineering teams, the benchmark serves as a critical guide in selecting the right framework depending on their specific needs. While Ollama remains a strong candidate for local development and smaller-scale applications, its limitations become apparent in more demanding environments. Teams currently using Ollama in production are advised to consider transitioning to vLLM to avoid potential performance bottlenecks as their user base grows.

Moreover, this benchmark highlights the ongoing evolution in machine learning infrastructure. The industry is moving towards architectures that not only improve raw throughput but also optimize resource utilization and maintain low latencies under load. Such advancements are particularly relevant as businesses strive to offer seamless AI-driven experiences to their users. The choice of framework can thus significantly impact an organization’s ability to scale and meet user expectations effectively.

How We Approached This

At Stack Runner, we approach technology reporting with a focus on the practical implications for developers and IT teams. Our analysis of the benchmark data from Towards AI involved a careful examination of the performance metrics and architectural differences among the frameworks. We emphasized the importance of concurrency as a factor that can define the success of machine learning deployments in real-world applications.

In crafting this article, we prioritized insights that directly affect decision-making for development teams. By highlighting the strengths and weaknesses of each framework in the context of scalability and performance, we aim to provide a clear roadmap for teams navigating the complex landscape of AI infrastructure. We chose to focus on vLLM’s PagedAttention as a differentiator that not only enhances performance metrics but also contributes to overall system reliability — a critical consideration in multi-tenant cloud environments.

Frequently Asked Questions

What is vLLM’s PagedAttention technology?

PagedAttention is a continuous-batching design used by vLLM that dynamically interleaves tokens across multiple requests. This approach helps keep the GPU saturated, enhancing throughput and maintaining low latency, even with high variability in sequence lengths. It enables vLLM to efficiently manage high concurrency levels, outperforming other frameworks like Ollama and llama.cpp in demanding scenarios.

Why does Ollama struggle with higher concurrency?

Ollama’s performance issues at higher concurrency stem from its single-queue model, which simplifies deployment but becomes a bottleneck as user numbers increase. This model leads to backlogged requests and increased latency, as seen in the benchmark where its p99 latency reached 24.7 seconds at 50 concurrent users. This structure makes Ollama less suited for multi-tenant production environments compared to frameworks like vLLM.

Is llama.cpp viable for production use?

Llama.cpp is viable for production use in specific contexts where resource constraints are critical, and concurrency demands are low. However, as demonstrated in the benchmark, it struggles significantly as concurrency increases, particularly beyond 20 users. Its serial decoding approach cannot handle high loads effectively, making it less ideal for environments requiring robust scalability compared to vLLM.

As the machine learning landscape continues to evolve, the choice of framework can significantly impact an organization’s scalability and performance capabilities. The recent benchmark highlights vLLM as a frontrunner for handling high concurrency efficiently, suggesting that teams prioritize it for scalable production environments. With PagedAttention technology driving its success, vLLM sets a new standard for what developers can expect from AI infrastructure. As more teams evaluate their deployment strategies, the insights from this benchmark should serve as a critical guide in shaping their architectural decisions, ensuring they not only meet current demands but are also prepared for future growth.