Serverless Deployment Runbook

A pre-flight guide to guarantee trace delivery in short-lived v8/Lambda runtimes.

Serverless environments (Vercel Edge, AWS Lambda, Cloudflare Workers) require explicit lifecycle management to guarantee Traces and Spans are delivered to Njira before the compute instance is frozen or destroyed.

The Operational Risk

In standard containerized setups (e.g., Kubernetes), Njira flushes traces asynchronously in the background. In serverless runtimes:

  1. The process may terminate or freeze immediately after you return a generic HTTP response.
  2. Background async tasks (like sending traces) will be instantly killed in mid-flight.
  3. Buffered events are silently lost if not explicitly flushed.

The Solution: Explicit Flush

To prevent data loss, you must strictly command the SDK to flush() the event buffer at the end of each request lifecycle.

Pre-Flight Checklist

Before deploying a serverless handler to production, verify:

  • Has an explicit await njira.flush() or SDK Middleware been applied to the endpoint?
  • Is the flush wrapped in a finally block to guarantee execution during errors?
  • Is timeoutMs (or timeout_ms) configured so a slow network doesn't hang the upstream response?

Implementation Reference (TypeScript)

The SDK middleware handles flushing automatically by hooking into the response lifecycle:

  • Express/Fastify: Calls trace.flush() on the response finish event.
  • Next.js: Calls trace.flush() immediately before returning the final response.

Manual Flush (AWS Lambda / Bare Handlers)

If you are not using standard HTTP middleware, you must instrument the handler manually.

export const handler = async (event: APIGatewayEvent) => {
  try {
    const result = await processEvent(event);
    return { statusCode: 200, body: JSON.stringify(result) };
  } finally {
    // CRITICAL: Await flush before the cloud provider freezes the instance
    await njira.trace.flush({ timeoutMs: 2000 });
  }
};

Vercel Edge / Cloudflare Workers

Edge functions have stricter execution limits. For Cloudflare, you must inject the flush into the waitUntil context extension.

export default {
  async fetch(request, env, ctx) {
    const result = await handleRequest(request);
    // Extend the worker lifecycle just long enough to send the telemetry
    ctx.waitUntil(njira.trace.flush());
    return result;
  }
};

Implementation Reference (Python)

The FastAPI middleware automatically flushes after the response is produced, executing as a background task before the framework yields the ASGI thread.

Manual Flush (AWS Lambda)

def handler(event, context):
    try:
        result = process_event(event)
        return {"statusCode": 200, "body": json.dumps(result)}
    finally:
        # Note: AWS Lambda runtime requires a synchronous flush
        njira.flush_sync(timeout_ms=2000)

Buffer & Timeout Configuration

Variable Description Default
NJIRA_BUFFER_SIZE Max events before auto-flush 100
NJIRA_FLUSH_INTERVAL_MS Auto-flush interval 5000

Triage: Traces are missing in production

If your application works locally but traces disappear when deployed to the cloud, run through this triage flow:

  1. Verify Flush Placement: Ensure flush() is truly the last SDK operation executed. If you log a trace.event() after flushing, it will sit in the buffer and be destroyed.
  2. Check for Timeout Silencing: Set your timeoutMs to 2000 and check your cloud provider logs (Cloudwatch / Vercel Logs). If you see Njira Flush Timeout warnings, the network connection from your worker to the Njira endpoint is excessively slow or blocked by a VPC firewall.
  3. Inspect the Finally Block: Ensure the flush resides in a finally block. If your agent throws an exception, the handler might return a 500 early, bypassing a flush invocation placed at the bottom of the function.