GGUF Models
Download and optimize GGUF embedding models
This section is only relevant if you’re using local GGUF models for embedding generation. If you’re using Ollama or an external embedding API (like OpenAI), you don’t need this section.
Recommended Models
nomic-embed-text-v1.5 (Recommended)
Best for: General-purpose embeddings with excellent quality
- Dimensions: 768
- Size: ~275MB (Q4_K_M quantization)
- Download: Hugging Face
wget https://huggingface.co/nomic-ai/nomic-embed-text-v1.5-GGUF/resolve/main/nomic-embed-text-v1.5.Q4_K_M.gguf
Qwen3-Embedding-0.6B-Q8_0
Best for: High-quality embeddings when size is not an issue
- Dimensions: 1024
- Size: ~1.2GB (Q8_0 quantization)
- Download: Hugging Face
wget https://huggingface.co/Qwen/Qwen3-Embedding-0.6B-GGUF/resolve/main/Qwen3-Embedding-0.6B-Q8_0.gguf
Quantization Levels
GGUF models come in different quantization levels:
| Quantization | Size | Quality | Speed | Recommended For |
|---|---|---|---|---|
| Q4_K_M | Small | Good | Fast | ⭐ General use |
| Q5_K_M | Medium | Better | Medium | High quality |
| Q8_0 | Large | Optimal | Slow | Maximum quality |
| F16 | Very large | Perfect | Very slow | Benchmarking |
Recommendation: Use Q4_K_M for the best balance of size, speed, and quality.
GPU Acceleration
Determining GPU Layers
The --gguf-gpu-layers parameter controls how many layers are offloaded to GPU:
# CPU only
--gguf-gpu-layers 0
# Partial GPU (recommended for testing)
--gguf-gpu-layers 16
# Full GPU (best performance)
--gguf-gpu-layers 32
Finding the right value:
- Start with
--gguf-gpu-layers 32 - If you get OOM errors, reduce by 8s
- Monitor GPU memory usage with
nvidia-smi(NVIDIA) orrocm-smi(AMD)
Platform-Specific Tips
NVIDIA (CUDA)
# Check CUDA availability
nvidia-smi
# Run with full GPU
./run-remembrances.sh \
--gguf-model-path ./model.gguf \
--gguf-gpu-layers 32 \
--gguf-threads 8
AMD (ROCm)
# Check ROCm availability
rocm-smi
# Run with full GPU
./run-remembrances.sh \
--gguf-model-path ./model.gguf \
--gguf-gpu-layers 32 \
--gguf-threads 8
Apple Silicon (Metal)
# Metal is detected automatically
./run-remembrances.sh \
--gguf-model-path ./model.gguf \
--gguf-gpu-layers 32 \
--gguf-threads 8
Performance Optimization
Thread Count
# Auto-detect (recommended)
--gguf-threads 0
# Manual (use your CPU core count)
--gguf-threads 8
Memory Management
- Small models (< 100MB): Can run entirely in GPU memory
- Medium models (100-500MB): May need partial GPU offloading
- Large models (> 500MB): Consider using lower quantization
Model Selection Guide
Choose based on your needs:
| Use Case | Model | Quantization | GPU Layers |
|---|---|---|---|
| Production | nomic-embed-text-v1.5 | Q4_K_M | 32 |
| Development | all-MiniLM-L6-v2 | Q4_K_M | 16 |
| High Quality | nomic-embed-text-v1.5 | Q8_0 | 32 |
| Low Memory | all-MiniLM-L6-v2 | Q4_K_M | 0 |
Troubleshooting
Out of Memory
Reduce GPU layers:
--gguf-gpu-layers 16 # or lower
Slow Performance
Increase GPU layers and threads:
--gguf-gpu-layers 32
--gguf-threads 8
Model Won’t Load
Check file path and permissions:
ls -lh ./model.gguf
chmod +r ./model.gguf
See Also
- Configuration - Server configuration options
- Getting Started - Installation guide
<file_path>
remembrances-mcp/website/content/en/docs/gguf-models/_index.md
</file_path>
<edit_description>
Translate GGUF models documentation from Spanish to English
</edit_description>