Redazione RHC : 18 July 2025 08:21
In recent years, large language models (LLM, Large Language Models) such as GPT, Claude, or LLaMA have demonstrated extraordinary capabilities in understanding and generating natural language. However, behind the scenes, running an LLM is no child’s play: it requires significant computational infrastructure, a significant financial investment, and precise architectural choices. Let’s try to understand why.
A 70-billion-parameter LLM, like Meta’s LLaMA 3.3 70B, contains 70 billion “weights,” floating-point numbers (usually FP16 or BF16, i.e., 2 bytes per parameter) that represent the skills learned during training. Just to load this model into memory, you need approximately:
Add another 20-30 GB of VRAM to handle dynamic operations during inference: token cache (KV cache), prompt embedding, temporary activations, and system overhead. In total, a 70 billion parameter LLM requires approximately 160-180 GB of GPU memory to run efficiently.
Many people ask, “Why not run the model on a CPU?”. The answer is simple: latency and parallelism.
Graphics Processing Units (GPUs) are designed to execute millions of operations in parallel, making them ideal for the tensor computation required by LLMs. CPUs, on the other hand, are optimized for a limited number of highly complex sequential operations. A model like the LLaMA 3.3 70B can generate a word every 5-10 seconds on a CPU, while on a dedicated GPU it can respond in less than a second. In a production context, this difference is unacceptable.
Furthermore, the VRAM of high-end GPUs (e.g., NVIDIA A100, H100) allows the model to be kept resident in memory and to take advantage of hardware acceleration for matrix multiplication, the heart of LLM inference.
Let’s imagine we want to offer a service similar to ChatGPT for text generation only, based on a 70 billion-parameter LLM model, with 100 concurrently active users. Let’s assume each user sends prompts with 300–500 tokens and expects fast responses, with sub-second latency.
A model of this size requires about 140 GB of GPU memory for the FP16 weights alone, plus another 20–40 GB for the token cache (KV cache), temporary activations, and system overhead. A single GPU, even a high-end one, doesn’t have enough memory to run the full model, so it needs to be distributed across multiple GPUs using tensor parallelism techniques.
A typical configuration involves distributing the model across a cluster of eight 80GB A100 GPUs, enough to both load the model in FP16 and manage the memory needed for real-time inference. However, to serve 100 concurrent users while maintaining sub-second latency for an LLM of this size, a single instance of 8 A100 GPUs (80GB) is generally insufficient.
To achieve the goal of 100 concurrent users with sub-second latency, a combination of:
To scale further, these instances can be replicated across multiple GPU PODs, enabling the asynchronous and traffic-balanced handling of thousands of total users based on incoming traffic. Of course, beyond pure inference, it’s essential to provide additional resources for:
On-premise implementation requires hundreds of thousands of euros of initial investment, plus annual management, power, and staffing costs. Alternatively, major cloud providers offer equivalent resources at a much more affordable and flexible monthly cost. However, it’s important to note that even in the cloud, a hardware configuration capable of handling such a load in real time can incur monthly costs that easily exceed tens of thousands of euros, if not more, depending on usage.
In both cases, it’s clear that using large-scale LLMs represents not only an algorithmic challenge, but also an infrastructural and economic one, making the search for more efficient and lightweight models increasingly important.
A simple alternative for many companies is to use the APIs of external providers like OpenAI, Anthropic, or Google. However, when confidentiality and data criticality come into play, the approach changes radically. If the data to be processed includes sensitive or personal information (e.g., medical records, business plans, or judicial documents), sending it to external cloud services may conflict with GDPR requirements, particularly with respect to cross-border data transfers and the principle of data minimization.
Many corporate policies based on security standards such as ISO/IEC 27001 also require the processing of critical data in controlled, auditable, and localized environments.
Furthermore, with the entry into force of the European Regulation on Artificial Intelligence (AI Act), providers and users of artificial intelligence systems AI must guarantee traceability, transparency, security and human supervision, especially if the model is used in high-risk contexts (finance, healthcare, education, justice). Using LLM through cloud APIs can make it impossible to meet these obligations, as data inference and management occur outside the organization’s direct control.
In these cases, the only option truly compliant with regulatory and security standards is to adopt an on-premise infrastructure or a dedicated private cloud, where:
This approach allows you to preserve digital sovereignty and comply with GDPR, ISO 27001, and the AI Act, while requiring significant technical and financial effort.
Commissioning an LLM It’s not just an algorithmic challenge, but above all an infrastructure undertaking, involving specialized hardware, complex optimizations, high energy costs, and latency constraints. Cutting-edge models require clusters of dozens of GPUs, with investments ranging from hundreds of thousands to millions of euros per year to ensure a scalable, fast, and reliable service.
A final, but fundamental, consideration concerns the environmental impact of these systems. Large models consume enormous amounts of electricity, both during training and inference. As LLM adoption increases, it becomes urgent to develop smaller, lighter, and more efficient models that can deliver comparable performance with a significantly reduced computational (and energy) footprint.
As with every technological evolution—from personal computers to mobile phones—efficiency is the key to maturity: we don’t always need larger models, but smarter, more adaptive, and sustainable models.