Performance Tuning & Cost Management
A runbook for optimizing latency and cost using NjiraAI's tiered model cascade.
NjiraAI evaluates policies using an intelligent model cascade. This guide explains how operators can tune performance by configuring the appropriate intelligence tier for different agent workloads.
Intelligence Tiers
We offer three tiers of model capability. The specific LLMs backing these tiers are updated regularly by Njira to the latest state-of-the-art versions to guarantee best-in-class safety.
| Tier | Characteristics | Recommended Usage | Underlying Models (Example) |
|---|---|---|---|
| FAST | High speed, low latency, lowest cost. | Use for real-time chat filtering, simple PII detection, or extremely high-volume, low-risk tooling. | gpt-5-mini, claude-haiku-4.5 |
| STANDARD (Default) | Balanced performance and intelligence. | The default. Use for most complex policy logic and standard context-aware safety evaluation. | gpt-5.2, claude-sonnet-4.5 |
| STRONG | Maximum reasoning capability. Highest latency and cost. | Use exclusively for nuanced threat detection (e.g., sophisticated prompt injection) or secure financial auditing tools. | gpt-5.2-pro, claude-opus-4.5 |
Operational Setup: Selecting a Tier
You can specify the desired tier per-request or per-tool to optimize your cost/latency profile.
Via Headers (Proxy Gateway)
When routing agents through the Njira Gateway proxy, pass the X-Njira-Tier header. This is the simplest way to tune performance without changing SDK code:
curl https://gateway.njira.ai/v1/chat/completions \
-H "Authorization: Bearer <your-key>" \
-H "X-Njira-Tier: fast" \
...
Via SDK Metadata
If you are using deep SDK integrations for pre/post boundaries, include the tier in the metadata payload:
response = njira.audit(
content="...",
metadata={"tier": "strong"}
)
Triage: When to Tune Tiers
If your agent workflows are experiencing unacceptable latencies:
- Check the p95 evaluation latency in the Usage dashboard.
- If the latency is heavily driven by the
STANDARDpolicy evaluation, attempt down-tiering the specific tool toFAST. - Verify Safety: Monitor the newly configured
FASTtool in Shadow mode first to ensure the smaller model is not generating false negatives (missing actual threats).
If no tier is explicitly specified in the request context, the system defaults to STANDARD to ensure a safe, robust baseline for all unconfigured traffic.