Red Hot Cyber

Cybersecurity is about sharing. Recognize the risk, combat it, share your experiences, and encourage others to do better than you.
Search

How an LLM Really Works: Costs, Infrastructure, and the Technical Choices Behind Big Language Models

Redazione RHC : 18 July 2025 08:21

In recent years, large language models (LLM, Large Language Models) such as GPT, Claude, or LLaMA have demonstrated extraordinary capabilities in understanding and generating natural language. However, behind the scenes, running an LLM is no child’s play: it requires significant computational infrastructure, a significant financial investment, and precise architectural choices. Let’s try to understand why.

70 Billion Parameters: What It Really Means

A 70-billion-parameter LLM, like Meta’s LLaMA 3.3 70B, contains 70 billion “weights,” floating-point numbers (usually FP16 or BF16, i.e., 2 bytes per parameter) that represent the skills learned during training. Just to load this model into memory, you need approximately:

  • 140 GB of GPU RAM (70 billion × 2 bytes).

Add another 20-30 GB of VRAM to handle dynamic operations during inference: token cache (KV cache), prompt embedding, temporary activations, and system overhead. In total, a 70 billion parameter LLM requires approximately 160-180 GB of GPU memory to run efficiently.

Why you need a GPU: CPU isn’t enough

Many people ask, “Why not run the model on a CPU?”. The answer is simple: latency and parallelism.

Graphics Processing Units (GPUs) are designed to execute millions of operations in parallel, making them ideal for the tensor computation required by LLMs. CPUs, on the other hand, are optimized for a limited number of highly complex sequential operations. A model like the LLaMA 3.3 70B can generate a word every 5-10 seconds on a CPU, while on a dedicated GPU it can respond in less than a second. In a production context, this difference is unacceptable.

Furthermore, the VRAM of high-end GPUs (e.g., NVIDIA A100, H100) allows the model to be kept resident in memory and to take advantage of hardware acceleration for matrix multiplication, the heart of LLM inference.

An example: 100 active users on a 70B LLM

Let’s imagine we want to offer a service similar to ChatGPT for text generation only, based on a 70 billion-parameter LLM model, with 100 concurrently active users. Let’s assume each user sends prompts with 300–500 tokens and expects fast responses, with sub-second latency.

A model of this size requires about 140 GB of GPU memory for the FP16 weights alone, plus another 20–40 GB for the token cache (KV cache), temporary activations, and system overhead. A single GPU, even a high-end one, doesn’t have enough memory to run the full model, so it needs to be distributed across multiple GPUs using tensor parallelism techniques.

A typical configuration involves distributing the model across a cluster of eight 80GB A100 GPUs, enough to both load the model in FP16 and manage the memory needed for real-time inference. However, to serve 100 concurrent users while maintaining sub-second latency for an LLM of this size, a single instance of 8 A100 GPUs (80GB) is generally insufficient.

To achieve the goal of 100 concurrent users with sub-second latency, a combination of:

  • A significantly larger number of A100 GPUs (for example, a cluster with 16-32 or more 80GB A100s), distributed across multiple PODs or in a single larger configuration.
  • Adopting next-generation GPUs like the NVIDIA H100, which offer significant improvements in throughput and latency for LLM inference, but at a higher cost.
  • Maximizing software optimizations, such as using advanced inference frameworks (e.g., vLLM, NVIDIA TensorRT-LLM) with techniques like paged attention and dynamic batching.
  • Implementing quantization (going from FP16 to FP8 or INT8/INT4), which would dramatically reduce memory requirements and increase computation speed, but with a potential loss of output quality (especially for INT4 quantization).

To scale further, these instances can be replicated across multiple GPU PODs, enabling the asynchronous and traffic-balanced handling of thousands of total users based on incoming traffic. Of course, beyond pure inference, it’s essential to provide additional resources for:

  • Dynamic scaling based on demand.
  • Load balancing across instances.
  • Logging, monitoring, orchestration, and data security.

But how much does it cost? Such an infrastructure?

On-premise implementation requires hundreds of thousands of euros of initial investment, plus annual management, power, and staffing costs. Alternatively, major cloud providers offer equivalent resources at a much more affordable and flexible monthly cost. However, it’s important to note that even in the cloud, a hardware configuration capable of handling such a load in real time can incur monthly costs that easily exceed tens of thousands of euros, if not more, depending on usage.

In both cases, it’s clear that using large-scale LLMs represents not only an algorithmic challenge, but also an infrastructural and economic one, making the search for more efficient and lightweight models increasingly important.

On-premise or API? Privacy is a game-changer

A simple alternative for many companies is to use the APIs of external providers like OpenAI, Anthropic, or Google. However, when confidentiality and data criticality come into play, the approach changes radically. If the data to be processed includes sensitive or personal information (e.g., medical records, business plans, or judicial documents), sending it to external cloud services may conflict with GDPR requirements, particularly with respect to cross-border data transfers and the principle of data minimization.

Many corporate policies based on security standards such as ISO/IEC 27001 also require the processing of critical data in controlled, auditable, and localized environments.

Furthermore, with the entry into force of the European Regulation on Artificial Intelligence (AI Act), providers and users of artificial intelligence systems AI must guarantee traceability, transparency, security and human supervision, especially if the model is used in high-risk contexts (finance, healthcare, education, justice). Using LLM through cloud APIs can make it impossible to meet these obligations, as data inference and management occur outside the organization’s direct control.

In these cases, the only option truly compliant with regulatory and security standards is to adopt an on-premise infrastructure or a dedicated private cloud, where:

  • Data control is complete;
  • Inference occurs in a closed environment and compliant;
  • Auditing, logging, and accountability metrics are managed internally.

This approach allows you to preserve digital sovereignty and comply with GDPR, ISO 27001, and the AI Act, while requiring significant technical and financial effort.

Conclusions: Between Power and Control

Commissioning an LLM It’s not just an algorithmic challenge, but above all an infrastructure undertaking, involving specialized hardware, complex optimizations, high energy costs, and latency constraints. Cutting-edge models require clusters of dozens of GPUs, with investments ranging from hundreds of thousands to millions of euros per year to ensure a scalable, fast, and reliable service.

A final, but fundamental, consideration concerns the environmental impact of these systems. Large models consume enormous amounts of electricity, both during training and inference. As LLM adoption increases, it becomes urgent to develop smaller, lighter, and more efficient models that can deliver comparable performance with a significantly reduced computational (and energy) footprint.

As with every technological evolution—from personal computers to mobile phones—efficiency is the key to maturity: we don’t always need larger models, but smarter, more adaptive, and sustainable models.

Redazione
The editorial team of Red Hot Cyber consists of a group of individuals and anonymous sources who actively collaborate to provide early information and news on cybersecurity and computing in general.

Lista degli articoli