Choosing embedding models for on-premise RAG deployments

For organizations deploying RAG systems on-premise, the choice of embedding model is one of the most consequential technical decisions. Unlike cloud-based deployments where API calls to services like OpenAI or Cohere are straightforward, on-premise requirements demand models that can run locally while still delivering competitive retrieval quality.

Why embedding model selection matters

The embedding model transforms your documents and queries into vector representations that enable semantic search. A poor choice here cascades through your entire RAG pipeline: if the retriever cannot find relevant documents, even the most capable language model cannot generate accurate responses.

For on-premise deployments, you must balance three primary concerns: retrieval quality, computational requirements, and licensing terms. Cloud API users can largely ignore the latter two, but they become critical constraints when running models locally.

Categories of embedding models

Embedding models suitable for on-premise deployment generally fall into three categories based on their architecture and resource requirements.

Lightweight models (under 100M parameters)

Models like all-MiniLM-L6-v2 and paraphrase-MiniLM-L6-v2 from Sentence Transformers offer excellent efficiency. They can run on CPU-only infrastructure with latencies suitable for real-time applications.

Characteristics

-384-dimensional embeddings (smaller vector storage)
-Sub-10ms inference on modern CPUs
-Apache 2.0 license (commercial use permitted)
-Good general-purpose retrieval, may struggle with domain-specific terminology

Mid-size models (100M-500M parameters)

This category includes models like bge-base-en-v1.5, e5-base-v2, and gte-base. They represent the sweet spot for many enterprise deployments, offering significantly better retrieval quality while remaining practical for on-premise infrastructure.

Characteristics

-768-1024 dimensional embeddings
-10-50ms inference on CPU, under 5ms on GPU
-Generally permissive licenses (MIT, Apache 2.0)
-Strong performance on standard benchmarks (MTEB, BEIR)

Large models (500M+ parameters)

Models like bge-large-en-v1.5, e5-large-v2, and the newer gte-large offer state-of-the-art retrieval quality but require GPU infrastructure for practical inference speeds.

Characteristics

-1024+ dimensional embeddings
-Requires GPU for production workloads
-Top-tier benchmark performance
-Better handling of nuanced queries and domain terminology

Licensing considerations

License terms are often overlooked until procurement or legal review. For enterprise on-premise deployments, verify the following before committing to a model:

Commercial use rights: Some models trained on proprietary data restrict commercial applications
Derivative work permissions: Fine-tuning may require specific license terms
Attribution requirements: Some licenses require visible attribution in products
Export restrictions: Models from certain organizations may have geographic restrictions

Most Sentence Transformers models use Apache 2.0 or MIT licenses, making them safe choices for enterprise deployment. The BGE and E5 model families also offer permissive terms, though you should verify the specific version you plan to use.

Infrastructure requirements

Your infrastructure constraints often narrow the field of viable models more than benchmark scores do. Consider these factors:

CPU-only deployments

If GPU infrastructure is not available, focus on models optimized for CPU inference. ONNX Runtime can significantly accelerate inference for many models. Quantized versions (INT8) offer 2-4x speedups with minimal quality degradation.

Recommended models for CPU: all-MiniLM-L6-v2, bge-small-en-v1.5, or e5-small-v2.

GPU-accelerated deployments

With GPU resources available, larger models become practical. A single modern GPU (e.g., NVIDIA A10, L4, or RTX 4090) can handle embedding generation for most enterprise workloads with sub-millisecond latencies.

Recommended models for GPU: bge-large-en-v1.5, e5-large-v2, or gte-large.

Evaluation approach

Public benchmarks (MTEB, BEIR) provide useful starting points, but they may not reflect performance on your specific documents and query patterns. We recommend a structured evaluation process:

1. Create a test dataset from your actual documents and representative queries. Include edge cases and domain-specific terminology.
2. Define relevance judgments for your test queries. Which documents should be retrieved? What constitutes a partial match?
3. Evaluate candidate models on your test set using metrics like MRR@10, NDCG@10, and Recall@k.
4. Measure latency and throughput on your target infrastructure under realistic load conditions.
5. Assess storage requirements for your document corpus at each embedding dimension.

Our recommendations

Based on our experience deploying RAG systems across regulated industries, here are our general recommendations:

For most enterprise deployments: bge-base-en-v1.5 or e5-base-v2 offer the best balance of quality and efficiency
For CPU-constrained environments: bge-small-en-v1.5 with ONNX optimization
For maximum retrieval quality: bge-large-en-v1.5 or gte-large with GPU acceleration
For multilingual requirements: bge-m3 or multilingual-e5-large

Conclusion

Embedding model selection for on-premise RAG deployments requires balancing retrieval quality against infrastructure constraints and licensing terms. While public benchmarks provide useful guidance, evaluation on your specific data and queries is essential.

The good news is that open-source embedding models have reached a level of quality that makes on-premise deployment genuinely viable. With careful model selection and infrastructure planning, you can achieve retrieval performance competitive with cloud API services while maintaining complete control over your data.

Choosing embedding models for on-premise RAG deployments

Why embedding model selection matters

Categories of embedding models

Lightweight models (under 100M parameters)

Characteristics

Mid-size models (100M-500M parameters)

Characteristics

Large models (500M+ parameters)

Characteristics

Licensing considerations

Infrastructure requirements

CPU-only deployments

GPU-accelerated deployments

Evaluation approach

Our recommendations

Conclusion

Need help selecting an embedding model?