Back to Blog
TechFeatured

What Is LLM Quantization? A Simple Explanation

When using AI models, you often see terms like "quantization," "4bit," and "Q4_K_M." What do they actually mean? We explain it simply using a photo compression analogy.

Mar 27, 20263min read

When using local AI or tools like Ollama and LM Studio, you come across expressions like these:

"Q4_K_M", "INT8", "FP16", "4-bit quantization"

They seem important, but the explanations are hard to follow. Today, we'll explain quantization in really simple terms.


The Key Analogy: Photo Quality Compression

When you send a photo taken on your smartphone through a messaging app, the quality drops, right? That's because it was compressed to reduce file size.

Quantization works on exactly the same principle.

Inside an AI model, there are billions to hundreds of billions of numbers (weights). These numbers are the AI's "knowledge."

Storage MethodSize Per NumberExample
FP32 (original)32 bits3.14159265358979...
FP16 (half precision)16 bits3.1416
INT8 (8-bit quantization)8 bits3
INT4 (4-bit quantization)4 bits3 (even coarser)

Quantization = representing numbers more coarsely to reduce file size


Why Does It Matter?

For example, say you have a Llama 3 70B model. That's 70 billion parameters (numbers).

  • FP32 original: 70B x 4 bytes = ~280GB — impossible on a regular PC
  • FP16: 70B x 2 bytes = ~140GB — still difficult
  • INT4 quantization: 70B x 0.5 bytes = ~35GB — possible on a high-end PC!

Thanks to quantization, regular users can run powerful AI models on their own computers.


What Does Q4_K_M Mean?

This is a notation you frequently see in Ollama and GGUF files.

Q4_K_M
│ │ └── M: Medium (balanced quality)
│ └──── K: K-quant method (a more sophisticated quantization algorithm)
└────── 4: Stored in 4 bits
NotationDescriptionRecommended Use Case
Q8_08-bit, closest to originalWhen you have plenty of VRAM
Q4_K_M4-bit, optimal quality/size balanceRecommended for most cases
Q3_K_S3-bit, very smallWhen VRAM is very limited
Q2_K2-bit, minimum sizeSignificant quality loss, not recommended

Is the Quality Difference Really That Big?

Honestly, for most everyday conversations, there's almost no difference.

  • Writing, translation, summarization — Q4_K_M is sufficient
  • Complex math reasoning, coding — Q6 or higher recommended
  • Academic research — FP16 recommended

It's like watching YouTube — even at 1080p instead of 4K, you can enjoy most videos just fine.


One-Line Summary

Quantization = compressing an AI model's quality. The file gets smaller, performance drops slightly, but you can run it on your own PC.

If you want to run AI directly on your computer instead of the cloud, look for a Q4_K_M version. It's sufficient for most use cases.


Related post: What's the Best Llama Model That Runs on My Computer?

Get new posts by email ✉️

We'll notify you when new posts are published