Selecting an AI server without getting carried away: a pragmatic look at models, user numbers, and workflows
When people hear “local AI,” they often think first of a model and a graphics card. In practice, it's more like a small system made up of building blocks: an LLM that generates text, an interface that people use to work with it, and often an automation layer that integrates the model into processes. This is exactly where most bad purchases happen. Not because someone doesn't understand technology well enough, but because the wrong questions are being asked.
This article explains the basics so that you can prepare a hardware decision without getting lost in data sheets. You don't need any prior knowledge. And you don't have to commit to a specific tool setup, because the logic remains the same whether you work with Ollama/OpenWebUI/n8n or with another combination.
What exactly is a “local AI stack”?
1) VRAM: the GPU's memory
The most important term to start with is VRAM. This is the memory on the graphics card. An LLM must fit into this memory for the most part, otherwise it will slow down or not run properly. VRAM is therefore less of a “nice to have” and more of a budget consideration.
More VRAM doesn't just mean a “bigger model.” It also means more breathing room for longer contexts, multiple simultaneous requests, and additional components such as vision or reranking.
2) GPU computing power: does it feel fast?
Even if a model fits into the VRAM, it can still be sluggish. Users notice this immediately. The decisive factors are the time until the first output and how quickly the text then flows. This is more important in everyday use than theoretical maximum values.
3) CPU and RAM: the reality surrounding them
Many tasks related to the model do not run on the GPU: fetching data, processing JSON, parsing documents, preparing PDFs, calculating embeddings, moving files, coordinating workflows. For this, you need CPU cores and RAM. Especially in automation, this is the area that quickly turns “runs in testing” into “becomes tough in everyday use.”
4) Storage: NVMe is not a luxury, but stability
When documents are involved, you need fast storage devices. Not only for speed, but also for smooth performance under load. NVMe reduces waiting times for index access, logs, cache, and file processing.
Understanding model sizes without getting bogged down in numbers
Model sizes are usually specified in “B,” meaning billions of parameters: 7B, 14B, 27B, 70B. More parameters often means better quality, but the relationship is not linear. A well-trained 27B can be very convincing in many business tasks, while a poorly fitted 70B can be unnecessarily expensive and slow.
In addition to size, there is a second lever that is crucial for local setups: quantization.
Quantization compresses model weights so that they require less VRAM. This is why many models are even feasible locally. You will then see variants such as Q4, Q5, Q8. As a rough rule of thumb:
- More strongly quantized: saves VRAM, but may compromise precision or style
- Less quantization: better quality, requires more VRAM
QAT (Quantization Aware Training) is particularly interesting in this context: the model has been trained to perform better in quantized form. “gemma3 27B IT QAT” is a good example of this category: a model size that is often noticeably better than typical 7B/14B in practice, but remains within reach of local deployments thanks to QAT.
Context length: the hidden VRAM hog
Context is what the model should “keep in mind”: chat history, instructions, document excerpts. More context is convenient, but it costs VRAM. So if you work with documents a lot, you're quickly faced with a choice: dump more context directly into the model or use a retrieval strategy.
RAG in a nutshell: Use documents without cramming everything into context
RAG stands for Retrieval Augmented Generation. Instead of giving the model entire documents, you search for the relevant passages and only pass these on. This saves context and often increases the quality of the results.
RAG shifts the load: less VRAM pressure due to huge contexts, but more CPU/RAM/storage for embeddings, indexing, and searching. This is precisely why a local AI server is not purely a GPU issue.
Conclusion
When planning local AI for everyday use, it's not enough for it to start up somehow and spit out a few answers in testing. The key is for it to remain stable under real-world use: multiple parallel chats, documents in the background, workflows running alongside.
VRAM determines which model class and contexts are realistic. GPU performance determines whether it feels fast. CPU, RAM, and NVMe determine whether workflows, documents, and secondary loads can sustain the whole thing over the long term.
Our scope7 family reflects precisely these typical growth stages: from a solid start and team operation to a platform for multiple teams and parallel models. The point is not to buy as much as possible, but to choose the right level so that users enjoy using it.
👉 Want to see how it looks in your environment? Book a demo and experience meloki.