You are staring at your terminal. You just typed the command to run Llama 3 or Mistral. You hit Enter. The hard drive lights blink furiously, the SSH session stutters, and then silence. A message appears: Killed.
Your server just committed suicide.
This is the dreaded Linux “Out of Memory” (OOM) Killer. It is a kernel mechanism designed to save the operating system by murdering the process that is eating all the RAM.1 In this case, your AI model was the victim.
If you are trying to run a 7-billion or 8-billion parameter model on a 4GB or 8GB server, this is simple math. An 8B model in its raw form requires about 16GB of memory.2 Even compressed, it needs 5-6GB. If you have 4GB of physical RAM, you are trying to pour a gallon of water into a pint glass.
But don’t upgrade your plan yet. We can fix this with some Linux surgery. Here is how to force your server to run models it has no business running.
The First Line of Defense: Massive Swap Space
Most people create a swap file, but they make it too small. For AI, swap isn’t just “emergency” memory; it is “necessary” memory. Since we are using SSDs (which are fast), we can lean on swap heavily.
If you have 4GB RAM, do not create a 2GB swap file. Create an 8GB or 16GB swap file.
Run these commands to resize or create a massive safety net:
Bash
# Turn off existing swap
sudo swapoff -a
# Create a 16GB swap file (Yes, 16GB. Storage is cheap, RAM is not)
sudo fallocate -l 16G /swapfile
# Secure it
sudo chmod 600 /swapfile
# Tell Linux it's swap
sudo mkswap /swapfile
# Turn it on
sudo swapon /swapfile
# Make it permanent
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
Now, when your Llama 3 model spills over the 4GB RAM limit, it will land softly on the SSD instead of crashing the server.
The Secret Weapon: Tuning “Swappiness”
By default, Linux hates using swap. It tries to keep everything in RAM until the absolute last second. When that second arrives, the spike in pressure is often too fast, and the OOM Killer shoots first.
We need to tell Linux: “It is okay to use the disk. Don’t panic.”
We do this by changing the vm.swappiness value.3 The scale is 0 to 100.4
- 0: Never swap (OOM Crash likely).
- 60: Default.5
- 100: Swap aggressively.6
For a low-RAM AI server, we actually want a higher value than usual, or we want to ensure it uses swap smoothly.7 However, a more critical setting is vm.vfs_cache_pressure. This controls how fast Linux deletes file info from RAM. We want to keep file info (the model weights) in RAM as long as possible.
Run this to optimize memory pressure:
Bash
# Prefer keeping application data in RAM over file system metadata
sudo sysctl vm.vfs_cache_pressure=50
# Allow the system to use swap more proactively to avoid sudden OOM locks
sudo sysctl vm.swappiness=80
To make this permanent, create a config file:
Bash
echo "vm.swappiness=80" | sudo tee -a /etc/sysctl.conf
echo "vm.vfs_cache_pressure=50" | sudo tee -a /etc/sysctl.conf
The “God Mode” Fix: Disable OOM Killer for Ollama
If you are running your AI via a system service (like Ollama or a Python script managed by Systemd), you can tell the Linux kernel: “Under no circumstances are you allowed to kill this specific process.”8
We use a setting called OOMScoreAdjust.
1. Edit your system service:
Bash
sudo systemctl edit ollama.service
2. Add this magic line:
In the editor, add OOMScoreAdjust=-1000 to the [Service] section.9 The value -1000 means “Invincible.”10
Ini, TOML
[Service]
Environment="OLLAMA_HOST=0.0.0.0"
OOMScoreAdjust=-1000
3. Save and Apply:
Bash
sudo systemctl daemon-reload
sudo systemctl restart ollama
Now, if the server runs out of memory, Linux will kill everything else (SSH, logging, web server) before it kills your AI. This is risky for a production web server, but perfect for a dedicated AI node.
The Hidden Culprit: The Context Window
This is where 90% of users fail. They download a small model but leave the “Context Window” at default.
The Context Window is the “short-term memory” of the AI. It stores the conversation history.
- A 4096-token context window takes up roughly 2GB to 3GB of extra RAM (KV Cache) on top of the model weights.
- A 8192-token context window can eat 5GB+.
If you are on a 4GB server, you cannot run a standard 8k context. You must lower it.
How to reduce context in Ollama:
You need to create a custom “Modelfile.”
- Pull the base model:
ollama pull llama3 - Create a file named
Modelfilewith this content:DockerfileFROM llama3 PARAMETER num_ctx 2048 - Create the new low-RAM version:Bash
ollama create llama3-lite -f Modelfile - Run it:Bash
ollama run llama3-lite
By dropping the context from 8192 to 2048, you just saved about 3GB of RAM. The AI will still be smart, it just won’t remember conversations that are 50 pages long.
Pick the Right Quantization (The GGUF Format)
Not all “8B” models are the same size. They are compressed using a method called Quantization.11
- F16 (Full Precision): 16GB RAM. (Impossible for you).
- Q8 (8-bit): 8.5GB RAM. (Too big).
- Q4_K_M (4-bit Medium): 4.8GB RAM. (The Standard).
- Q3_K_M (3-bit Medium): 3.5GB RAM. (The low-end hero).
If Q4 is still crashing your server, switch to Q3_K_M. The intelligence drop is barely noticeable for simple tasks, but it fits comfortably in 4GB RAM.
To find these, look for “GGUF” files on HuggingFace or use tags in Ollama like ollama run llama3:8b-instruct-q3_K_M (check the specific tag library for availability).
Monitoring: Watch It Happen
Don’t fly blind. You need to see exactly when the memory fills up. Standard top is ugly. Install htop or btop.12
Bash
sudo apt install btop -y
btop
Open btop in a second terminal window while you load the model. Watch the “Mem” bar.
- It will fill the physical RAM (Purple/Blue).
- Then it will start filling the “Disk” or “Swap” bar (Red).
- If Swap hits 100%, you die. If Swap stays at 50%, you are safe.
Conclusion
You don’t need an NVIDIA H100 to learn AI. You just need to understand how Linux manages resources. By creating a massive swap file, adjusting the kernel’s kill logic, and shrinking the AI’s context window, you can run surprisingly smart models on surprisingly cheap hardware.
It won’t be fast—you are trading speed for possibility—but it will work. And in the world of expensive GPUs, “working” is a victory.
Next Step
Would you like me to write a shell script that automatically sets up the Swap, configures the OOM killer, and installs the monitoring tools in one go?