4-bit LLM Training with QAT & Unsloth | Complete Guide

JT
Jaydev Tonde
December 4, 2025·23 min read
Share this article
Link copied to clipboard

QAT Cover Image

Let's be real: we've all been there. You spend days (or nights) fine-tuning a new LLM, feeling pretty good about it. Then, to make it run faster, you quantize it to 4-bit. The model's accuracy drops dramatically.

We shrink models to make them faster and cheaper to run, but it always feels like a bad trade-off: Accuracy for Speed.

The usual culprit is Post-Training Quantization (PTQ). It's the "fast and easy" method where you train your model first, then just squish it down to 4-bit afterward. The problem? The model was never built for that, and the performance hit can be painful.

But what if you could get that 4-bit speed and keep your original accuracy? Or even.. beat it?

That's what Quantization-Aware Training (QAT) is all about. Instead of squishing the model after, you make it "aware" of the 4-bit limits while it's still training.

The folks at Unsloth and PyTorch AO are the ones who really cracked this. They built the tools that make it possible.

My goal here isn't to reinvent the wheel. It's to show you how to use their work to get the same awesome results. I ran the experiment myself, fine-tuning a Qwen3-4B model, just to prove it out.

And the results are in: This stuff really works.

In my own tests, the QAT method recovered 69% of the accuracy on the WikiText benchmark. Even better, on MMLU-Pro test, it actually beat the original full-precision model by 1.7%.

So, let's walk through how you can do it, too.


Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.

How Quantization Actually Works: PTQ vs. QAT

To understand why QAT is so effective, we first need to cover a few core concepts.

What is Quantization?

At its simplest, quantization is the process of reducing the precision of numbers. Our model's weights (the parameters it learned during training) are typically stored in 32-bit(FP32), or 16-bit (BF16 or FP16). Let's consider we have them in BF16.

A 16-bit number can represent over 65,000 different values. An INT4 (4-bit integer) number can only represent 16 different values.

alt text source : https://www.maartengrootendorst.com/blog/quantization/

The goal is to map the massive BF16 range to the tiny INT4 range without losing the "meaning" of the weights. We do this using two key parameters:

  1. Scale: A multiplier that scales the numbers up or down.
  2. Zero-Point: An offset that shifts the entire range.

This mapping leads to two main ways of "squishing" the numbers.

Symmetric vs. Asymmetric Quantization

  • Symmetric Quantization: This method is simple and fast. It forces the range of numbers to be centered around 0. The "zero-point" is always 0.
    • Analogy: Imagine a ruler where 0 must be in the exact center. If your BF16 numbers range from -100 to +100, it works perfectly. But if your numbers range from +10 to +210 (like after a ReLU activation), you "waste" half your 16 available 4-bit values representing numbers from -210 to 0 that don't even exist in your data.
  • Asymmetric Quantization: This method is more flexible and accurate. It shifts the "zero-point" to match the data.
    • Analogy: This is a "sliding" ruler. For the data ranging from +10 to +210, it sets the 4-bit "0" to represent the BF16 "+10" and uses all 16 available values to cover the range up to +210. It doesn't waste any space.

Want to learn more? For a more detailed breakdown of these concepts, check out this excellent blog post: Quantization in LLMs.

Now, let's look at when we apply these techniques.

Post-Training Quantization (PTQ)

This is the "squish it after" method. It's fast and easy, but often less accurate. Popular PTQ methods you might have heard of include AWQ and GPTQ but in this blog we are using weight-only PTQ so caliberation dataset is not needed.

  1. You train your model in full BF16 precision.
  2. After training, you quantize it. For weight-only PTQ (like what we're using in our experiment), the quantization parameters (scale and zero-point) are determined directly from the statistics (min/max values) of the BF16 weights themselves, often on a per-group basis. This means you don't need a separate calibration dataset to run through the model, which simplifies the process.
  3. The BF16 weights are then converted into their INT4 representation.

The problem? The model was never trained to handle the rounding errors (quantization noise) this process introduces, leading to potential accuracy drops.

Quantization-Aware Training (QAT)

This is the "train it to be squished" method. It's our focus today.

With QAT, you're simulating the 4-bit errors during the fine-tuning process. The model learns to adapt its weights to compensate for this noise (the rounding errors), resulting in dramatically higher accuracy.

This is how it works on every single training step:

alt text

Forward Pass (Simulating the Error)

  1. Start: The model begins with its full-precision BF16 weights.
  2. Simulate 4-bit Precision: This is where the magic happens, and it's a two-part process that uses the scale and zero-point we discussed.
    • Quantize: First, the model calculates a scale and zero-point for the BF16 weights. It uses them to "squish" the weights into their 4-bit integer representation. This simulates the precision loss.
    • De-Quantize: It immediately uses that same scale and zero-point to convert those 4-bit integers back into the BF16 format.
  3. Calculate: The forward pass (e.g., matrix multiply) is then performed using these "degraded" BF16 weights. The model is now "seeing" the rounding errors and precision loss that a real 4-bit model would have.

Backward Pass (Learning from the Error)

  1. Calculate Loss: The model's final output (which includes the 4-bit errors) is compared to the correct answer, and a loss is calculated.
  2. Pass Gradients (The "Trick"): This is the clever part. The rounding step from the forward pass isn't "differentiable"—its gradient is zero, which would stop the model from learning. QAT uses a trick called the Straight-Through Estimator (STE).
    • Analogy: Think of STE as a "detour" for the gradient. On the backward pass, the gradient travels back toward the weights. When it hits the "broken" (non-differentiable) round() function, STE simply "pretends" it's not there and directs the gradient straight through to the original weights, as if it were a simple 1-to-1 connection.
  3. Update Weights: Because of STE, the gradients (which contain information about the 4-bit error) successfully reach and update the original, full-precision BF16 weights.

Over thousands of steps, the full-precision weights learn to adjust themselves. They learn to avoid "ambiguous" values that are close to a rounding boundary. In this way, they become "robust" to the 4-bit squishing process, ensuring the final 4-bit model is as accurate as possible.


The Trade-off: Slower Training for a Smarter Model

This simulation adds extra math (the quantize/de-quantize step) to every single training pass.

  • The Trade-off: QAT is slower than a standard fine-tuning run. In our experiment, the QAT model took about 4 hours to train, while the standard PTQ-path model only took 2 hours.
  • The Payoff: You invest more time upfront during training, but the result is a final 4-bit model with dramatically higher accuracy, as our benchmarks will show.

Quick Comparison: PTQ vs. QAT

Here's a simple table to lock in the difference.

FeaturePost-Training Quantization (PTQ)Quantization-Aware Training (QAT)
When?After training is finished.During the training process.
How?Squishes a fully-trained model once.Simulates 4-bit errors on every step.
Goal?Speed and size reduction.Maximize accuracy while getting speed.
Cost?Very fast (seconds to minutes).Slower (adds overhead to training).
Result?Good, but almost always loses accuracy.Excellent. Can match or beat the baseline.

A Quick Look at Int4WeightOnlyConfig

This configuration, used for our PTQ baseline, implements grouped, asymmetric, weight-only INT4 quantization.

In our PTQ script, to create our 4-bit baseline, we used torchao's Int4WeightOnlyConfig: quant_config = Int4WeightOnlyConfig(group_size=128, ...)

What does this actually do?

  • WeightOnly: This is a crucial distinction. It only quantizes the model's weights (the Linear layers) to INT4. The activations (the data flowing between layers) remain in BF16. This is a very common and effective PTQ method for inference speedups with minimal fuss.

  • group_size=128: This is the secret to its accuracy. Instead of calculating one scale and zero-point for an entire multi-million-parameter weight matrix (which is very inaccurate), it breaks the matrix into small groups of 128 numbers. It then calculates a separate scale and zero-point for each tiny group directly from the weight values within that group. This is far more granular and better handles outliers, which is why "grouped" quantization is so much better than per-tensor quantization.


The Power Behind the Recovery: Unsloth and PyTorch AO

alt text

The impressive accuracy recovery and performance boost we're discussing aren't just theoretical. They're made possible by the innovative work and seamless integration provided by Unsloth and TorchAO. These are the core components of the solution that Unsloth has championed, and which we've leveraged in our experiments.

  • Unsloth: This is the high-performance framework that makes LoRA fine-tuning incredibly fast and memory-efficient. Unsloth's magic lies in its optimized kernels and its deep integration with quantization techniques. When we pass qat_scheme = "int4" to Unsloth, it intelligently leverages torchao under the hood to perform the QAT loop, transforming a regular fine-tuning run into a Quantization-Aware one with minimal fuss for the developer.

  • PyTorch AO (torchao): This is PyTorch's native library for advanced quantization and optimization. It provides the robust, low-level building blocks for both PTQ and QAT. Unsloth integrates directly with torchao to manage the quantization process during training and conversion.

Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.

GPU Machine Setup for Model Training (H100)

If you'd like to follow this tutorial step-by-step, you can easily rent a GPU machine tailored to your needs on the E2E Networks platform. Check the GPU pricing page for current rates. Following are the steps I followed to get my GPU instance ready for model training.

Step 1: Launching Your GPU Instance

First, let's get your GPU instance launched on the TIR platform of E2E Networks. It's a straightforward process.

  1. Log in to your E2E Networks account and find the Instances (Nodes) section (usually under Products in the sidebar).

  2. Click the CREATE INSTANCE button.

    alt text

  3. Choose Image: Head to the Pre-built tab and select the PyTorch image. This is a massive time-saver, as it comes with the NVIDIA drivers, CUDA, PyTorch, and other key tools already installed.

    alt text

  4. Choose a Plan: Select a GPU instance that fits your needs. For a powerful model like this, we're selecting an H100 instance, which is perfect for heavy-duty training or high-throughput inference.

  5. Select Pricing: To save on costs, especially for batch jobs that don't need to run 24/7, choose the Spot plan. Just remember the trade-off: spot instances can be interrupted if the capacity is needed elsewhere.

    alt text

  6. Finalize Details: Configure your Storage (the default 30GB is often enough to start), add your SSH key for secure login, and adjust your Security Group settings (e.g., ensure port 22 is open for SSH).

  7. Click Launch Instance, and you're ready for the next step!


Step 2: Connecting to Your New Instance

  • Once your instance status shows as Running, it's time to connect. You've got two great options:

    alt text

  • SSH: This is the standard for most development. You can use the provided SSH command directly in your local terminal.

    Pro-Tip: For the best experience, connect your favorite IDE (like VS Code with the Remote - SSH extension) to the instance's IP address. This gives you a full-featured coding environment right on your powerful remote machine.

  • Jupyter Lab: For quick experimentation or notebook-based work, simply click the Jupyter link provided on the instance details page. This will open a complete Jupyter Lab interface directly in your web browser—no local setup needed.


The Experiment: A Single Script for QAT vs. PTQ

Theory is great, but let's see the code. To fairly compare PTQ and QAT, I created a single training script that can run both experiments. The full script is available on GitHub and the trained models are on Hugging Face.

Here's the setup:

  • Model: unsloth/Qwen3-4B-Instruct-2507 (a 4B parameter model from the Qwen3 family, optimized by Unsloth).
  • Dataset: mlabonne/FineTome-100k (a high-quality, 100k-sample subset of the FineTome dataset).
  • The Goal: Fine-tune this model on our dataset using two different methods, then benchmark the results.
    • Experiment 1 (PTQ): Fine-tune in full precision (BF16), then quantize to 4-bit after training.
    • Experiment 2 (QAT): Fine-tune while simulating 4-bit quantization.

The Script Walkthrough

Step 1: Configuration

The most important part of the script is this single flag. It controls whether we run the PTQ or QAT experiment.

python
QUANTIZATION_TYPE = "PTQ" # Options: "PTQ" or "QAT" RUN_NAME = "Qwen3_4B_" + QUANTIZATION_TYPE wandb.init(project = "QuantizationTraining", name = RUN_NAME)

This QUANTIZATION_TYPE variable will change our model setup and quantization steps later on.

Step 2: Load Model (in Full Precision)

Load our model using Unsloth's FastLanguageModel.

python
# Load the Model and Tokenizer model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/Qwen3-4B-Instruct-2507", max_seq_length = 2048, load_in_4bit = False, # <-- Key! load_in_8bit = False, # <-- Key! full_finetuning = False, )

Important Note: This might seem counter-intuitive! Why are we setting load_in_4bit = False?

It's because we need to fine-tune the model first.

  • For QAT, we load in full precision (BF16 by default on new GPUs) and let Unsloth simulate 4-bit during training.
  • For PTQ, we must train in full precision to create our baseline, which we'll quantize later.

Step 3: Add LoRA Adapters

This is where the paths for QAT and PTQ truly diverge. We use Unsloth's get_peft_model to add LoRA adapters. Pay close attention to the qat_scheme parameter.

python
# Add LoRA Adapters model = FastLanguageModel.get_peft_model( model, r = 16, target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], lora_alpha = 32, # This is the magic line! qat_scheme = "int4" if QUANTIZATION_TYPE == "QAT" else None, use_gradient_checkpointing = "unsloth", random_state = 3407, )

If our QUANTIZATION_TYPE is set to "QAT", Unsloth and torchao automatically inject "fake quantization" modules into our model. This is what makes it "Quantization-Aware." If the type is "PTQ", qat_scheme is None, and it performs a standard BF16 LoRA fine-tuning.

Step 4: Data Preparation and Training

We load the mlabonne/FineTome-100k dataset, apply the Qwen3 chat template, and set up the SFTTrainer from TRL.

python
# Load training dataset dataset = load_dataset("mlabonne/FineTome-100k", split = "train") # Apply chat template def formatting_prompts_func(examples): convos = examples["conversations"] texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos] return { "text" : texts, } dataset = dataset.map(formatting_prompts_func, batched = True) # Train the model trainer = SFTTrainer( model = model, tokenizer = tokenizer, train_dataset = dataset, args = SFTConfig( dataset_text_field = "text", per_device_train_batch_size = 4, gradient_accumulation_steps = 4, num_train_epochs = 1, learning_rate = 2e-5, logging_steps = 50, optim = "adamw_8bit", seed = 3407, report_to = "wandb", ), ) trainer_stats = trainer.train()

After this step, trainer.train() runs. The QAT model trains a bit slower because of the extra quantization simulation, but it's learning how to be 4-bit. The PTQ model just trains normally. QAT model training took approximately 4 hrs. to train while PTQ model took 2 hrs. only.

Step 5: Applying Quantization

Here's how we finalize the models.

Path 1: Post-Training Quantization (PTQ)

If QUANTIZATION_TYPE == "PTQ", our training just finished. We now have a BF16 model with LoRA adapters. We must:

  1. Merge the LoRA adapters into the base model to get a full, fine-tuned BF16 model.
  2. Save this full model.
  3. Reload the saved model and apply 4-bit quantization now.
<!-- end list -->
python
# Check our flag if QUANTIZATION_TYPE == "PTQ": # 1. Merge and save the full-precision model basline_name = RUN_NAME.replace("PTQ", "baseline") merged_model = trainer.model.merge_and_unload() merged_model.save_pretrained(f"./{basline_name}") tokenizer.save_pretrained(f"./{basline_name}") # 2. Define our 4-bit PTQ config from torchao quant_config = Int4WeightOnlyConfig(group_size=128, int4_packing_format="tile_packed_to_4d") quantization_config = TorchAoConfig(quant_type=quant_config) # 3. Reload the model and apply 4-bit quantization model = AutoModelForCausalLM.from_pretrained( f"./{basline_name}", device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config ) tokenizer = AutoTokenizer.from_pretrained(f"./{basline_name}")
Path 2: Quantization-Aware Training (QAT)

If QUANTIZATION_TYPE == "QAT", our model is already "aware." We don't need to merge and reload. We just tell torchao to convert the "fake quantized" modules into real 4-bit quantized layers.

python
# Check our flag if QUANTIZATION_TYPE == "QAT": # Just convert the already-trained model! quantize_(model, QATConfig(step = "convert"))

That's it. It's a single line. This step swaps the simulation-time modules for fast, 4-bit inference-time modules.

Step 6: Save the Final Model

Finally, we save our quantized model. The QAT model requires a special save function from torchao.

python
if QUANTIZATION_TYPE == "QAT": model.save_pretrained_torchao( RUN_NAME, tokenizer, torchao_config = model._torchao_config.base_config, ) else: # PTQ model.save_pretrained(f"./{RUN_NAME}") tokenizer.save_pretrained(f"./{RUN_NAME}")

Now we have three different models, ready for benchmarking.

  1. Baseline model before applying PTQ (16 bit).
  2. Post Training Quantized model (4 bit).
  3. Quantization aware trained model (4 bit).

Model Evaluation on Multiple Benchmarks

To really know if our training and optimization actually helped, we need to test it with standardized benchmarks.

Think of these tests as a "report card" for the model. They move us past vague feelings and give us cold, hard numbers, showing us exactly where the model shines and where it struggles. To get a complete picture, we used two very different tests:

The Benchmarks We Picked

1. MMLU-Pro

MMLU-Pro is a collection of over 12,000 extremely tough, multiple-choice questions.

  • The questions are pulled from 14 professional and academic domains, including law, medicine, advanced mathematics, philosophy, and engineering.
  • This test shows if our model can reason like a specialist, not just repeat facts. Can it solve a complex legal problem or understand an advanced engineering concept?
  • The Score: It's a simple Accuracy (%). You want the highest score possible. Higher is better.

2. WikiText

WikiText uses the WikiText-103 dataset, which is a high-quality collection of over 100 million words from verified, "good" Wikipedia articles.

  • This benchmark measures the model's core grasp of the English language. How good is it at predicting the next word in a sentence? This tests its understanding of grammar, context, and common sense. A model that does well here is fluent and "understands" how sentences are built.
  • The Score: It gives a score called Perplexity (PPL). This one is the opposite of MMLU: Lower is better. A low score means the model was "less surprised" by the text—its predictions were very accurate.

How to Run These Benchmarks with lm-eval

We didn't build our own testing setup from scratch—and you shouldn't either. We used the awesome open-source tool lm-evaluation-harness (everyone just calls it lm-eval) from EleutherAI.

There are some package incompatibility issues with torch, torchao, vllm and these will likely get resolved as some folks have already reported them on the official torchao GitHub repo. Follow the steps below to run the torchao 4-bit model inference smoothly without those issues.

Step 1: Create Python env and activate it

bash
python -m venv qat-eval
bash
source qat-eval/bin/activate

Step 2: Install torch and torchao nightly index

bash
pip install --pre torch torchvision torchaudio torchao --index-url https://download.pytorch.org/whl/nightly/cu128

Step 3: Install vllm nightly wheel

bash
pip install --pre vllm --extra-index-url https://wheels.vllm.ai/nightly

Step 4: Install Unsloth

bash
pip install unsloth

Step 5: Clone the lm-eval repository

bash
git clone https://github.com/EleutherAI/lm-evaluation-harness.git cd lm-evaluation-harness

Step 6: Install the dependencies

bash
pip install -e .

Step 7: Run the WikiText Test

You're all set. Here's the command to run the WikiText benchmark. You can point it at a model you've got saved locally or any model on the Hugging Face Hub.

bash
lm_eval --model hf \ --model_args pretrained=jaytonde05/Qwen_4B_PTQ \ --tasks wikitext \ --device cuda:0 --batch_size 1

So, what's all that code mean?

  • --model hf: Selects which model type or provider is evaluated.
  • --model_args pretrained=...: This is where you point to your model.
    • For a local model: pretrained=./my-quantized-model
    • For a Hub model: pretrained=jaytonde05/Qwen_4B_QAT
  • --tasks wikitext: This is the key! We're telling it which test to run. (If you wanted MMLU-Pro, you'd put its task name here, like --tasks mmlu_pro).
  • --device cuda:0: Select your GPU to use for evaluation.

Benchmark Results

We've completed the experiments, and it's time to see how our models performed. We're comparing three different versions:

  1. Baseline (BF16): Our original model, fine-tuned with LoRA in bfloat16. This is our "full-precision" benchmark.
  2. PTQ (4-bit): The Baseline model, but squished down to 4-bits after training (Post-Training Quantization).
  3. QAT (4-bit): Our new model, fine-tuned from the start using 4-bit Quantization-Aware Training (QAT).

Let's see if the extra effort of QAT paid off, starting with the MMLU-Pro accuracy test.

MMLU-Pro

MMLU-Pro is a tough benchmark that measures a model's reasoning and knowledge. For this test, a higher score is better.

mmlu-pro

Model VersionMMLU-Pro Score (Higher is Better)
Baseline (BF16)0.4818
PTQ (4-bit)0.4511
QAT (4-bit)0.4903

Let's break down these results:

  1. The Baseline: Our original BF16 model set the bar with a score of 0.4818.
  2. The Problem (PTQ): As expected, simply squishing the model with PTQ caused a performance drop. The score fell to 0.4511. This is the accuracy loss we need to fix.
  3. The Solution (QAT): The 4-bit QAT model didn't just fix the drop—it scored 0.4903, beating the original baseline by 1.76%!

The Accuracy and Recovery Calculation

This is the key takeaway. Let's look at it in two simple ways:

1. Performance vs. Original (QAT vs. Baseline)

This compares our final 4-bit model to the original full-precision model.

  • Calculation: ((0.4903 - 0.4818) / 0.4818) * 100
  • Result: Our 4-bit QAT model is 1.76% more accurate than the original BF16 model.

2. The "Recovery" Story (QAT vs. PTQ)

This tells us how well QAT fixed the specific problem caused by PTQ.

  • Accuracy Lost by PTQ: 0.4818 - 0.4511 = 0.0307 (This was the "problem")
  • Accuracy Gained by QAT: 0.4903 - 0.4511 = 0.0392 (This is what QAT "fixed")
  • Recovery Percentage: (0.0392 / 0.0307) * 100 = ~127.7%

You read that right. QAT recovered over 100% of the lost accuracy plus gave us more 27%, smashing past the original baseline. This is a huge win.

WikiText Perplexity

Next, we used WikiText to measure perplexity. Perplexity checks how well a model predicts the next word in a sentence. For this test, lower is better, as it means the model is less "perplexed" or confused.

WikiText Perplexity

Model VersionPerplexity (Lower is Better)
Baseline (BF16)10.9391
PTQ (4-bit)11.7545
QAT (4-bit)11.1904

This benchmark tells a slightly different but equally important story:

  • The PTQ model performed the worst, with the highest perplexity (11.7545).
  • The QAT model (11.1904) landed right between the PTQ model and the Baseline.

It didn't beat the baseline here, but it dramatically improved on the PTQ model. Let's do the recovery math again.

The Recovery Calculation

  1. Performance Gap to Fix: First, how much worse did PTQ do than the baseline? 11.7545 (PTQ) - 10.9391 (Baseline) = 0.8154 This 0.8154 increase in perplexity is the "gap" we need to close.

  2. Performance Recovered: How much of that gap did QAT close? 11.7545 (PTQ) - 11.1904 (QAT) = 0.5641 QAT clawed back 0.5641 points of perplexity.

  3. Percentage Recovered: (0.5641 / 0.8154) * 100 = ~69.2%

The QAT model closed 69.2% of the performance gap, bringing it much closer to the original baseline than the simple PTQ model.

QAT is a clear winner. On complex reasoning (MMLU-Pro), it didn't just recover the accuracy lost from 4-bit quantization—it actually exceeded the original baseline's performance. On a language modeling task (WikiText), it recovered over two-thirds of the performance hit.

This shows that "training-aware" quantization is a powerful technique to get small, fast models without sacrificing performance.

And the best part? These torchao-quantized models are fully compatible with high-performance inference engines like vLLM. You can take the QAT model we just trained, load it into vLLM, and get top-tier accuracy at top-tier speeds in production.


Conclusion

For years, using 4-bit quantization always meant making a compromise. We used it for the speed, but we had to settle for a noticeable drop in model accuracy.

Post-Training Quantization (PTQ) is fast, simple, and still has its place. If you have a fully-trained model and just need a "good enough" 4-bit version right now, it's a valid option.

But our experiment shows that Quantization-Aware Training (QAT) is clearly the superior approach.

By making the model "aware" of its 4-bit future during fine-tuning, the Unsloth and torchao stack delivered a model that:

  1. Recovered 69% of the accuracy lost by PTQ on a task like WikiText.
  2. Performed better than the original BF16 baseline.

Yes, QAT takes slightly longer to train due to the added simulation, but no need to do trade-off between speed and accuracy. You're investing a little more time in training to get a final model that is both fast and smart.

For any serious production use case where accuracy is non-negotiable, QAT should be your new default.


Get the Code

You don't have to take my word for it. All the code, benchmarks, and trained models are publicly available. I encourage you to run the notebook, test it on your own data, and see the results for yourself.

Happy quantizing!


References

  1. Quantization Aware training blog by unsloth : Quantization-Aware Training (QAT)
  2. TorchAO official git repo : TorchAO
  3. TorchAO INT4 PTQ receipe : Huggingface model repo
  4. MMLU-Pro git repo for installation and setup : MMLU-Pro
Free Credits Inside

Get ₹2,000 free credits to test your AI workloads

Sign up and complete ID verification to unlock free credits. Deploy on NVIDIA H200, H100, and L40S GPUs—no commitment required.