What are the responsibilities and job description for the AI Diagnostics & Observability Engineer position at Sage Care?

Role Overview

Own and build the full diagnostic, observability, and RCA infrastructure that makes Sage Care’s AI assistant trustworthy and debuggable—in real time and post-call. This engineer builds the visibility layer across telephony, transcription, reasoning, SOP traversal, and tool-calling; creates dashboards for both engineers and live human supervisors; and implements automated triage notification pipelines that surface issues to the right module owners immediately.

This role sits at the intersection of LLM orchestration, voice pipelines, transcription, SOP engines, and operations, serving as the connective tissue across the stack. Your work enables rapid root-cause analysis, real-time intervention, and continuous improvement of our clinical AI assistants.

Key Responsibilities

Root Cause Analysis, Tracing & Observability

Build automated RCA pipelines to detect and classify failure modes:

Hallucinations
Misrouted intents
Leaked/invalid tool calls (Transfer, SayMessage, Hangup, NOOP)
Unrecoverable SOP loops
Broken state transitions
Telephony dropouts / DTMF issues

Implement event tracing infrastructure capturing every agentic decision across LLM, telephony, and SOP execution.
Compare expected vs. actual SOP behavior using protocol-driven expectations or human-labeled ground truth.
Automatically compute performance, safety, reliability, and coverage metrics.

Diagnostic Dashboards & Visualization

Build live and post-call dashboards that visualize:

Full call timeline
SOP/state machine traversal
Agent reasoning traces
Tool invocation history
Divergence from expected behavior

Design interactive visualizations: heatmaps, decision-path overlays, branching comparisons, and error hotspots.
Build triage dashboards for engineering and operations teams to rapidly understand system health.

Integration with Core AI Modules

Voice Telephony Integration

Trace call-level events (dropouts, retries, audio playback issues).
Detect DTMF misfires and incorrect action routing.

Transcriber Module Integration

Analyze turn segmentation, word-error-rate drift, boosting performance, and latency.
Visualize errors in context (audio, transcript, aligned timecodes).

LLM Orchestration Integration

Audit intent classification accuracy and subgraph routing.
Trace reasoning sequences, missing tool calls, redundant tool calls, or invalid arguments.
Validate tool call correctness (maps, SMS, search, internal SOP tools).

Live Monitoring & Human-in-the-Loop

Architect a live SOP state-machine tracer with:

Real-time transcript overlays
Current state next expected state
Deviation alerts

Build dashboards to monitor 10–15 concurrent calls, highlighting sessions with:

Loops
Latency spikes
Failed tool calls
Repeated incorrect decisions

Provide human specialists with escalation alerts and context.

Command & Control Interface

Build An Intervention Console For On-call Specialists, Enabling

“Skip step”
“Say apology”
“Escalate to human”
“Send SMS”
“Repeat last message”
Override of SOP steps while maintaining auditability and continuity.

This system must blend seamlessly into existing agent workflows without breaking call integrity.

Failure Classification, Clustering & Pattern Detection

Build clustering systems (via embeddings or metadata) to group systemic failure modes:

Intent misroutes under noisy audio
Repeated missing tool calls
Looped state machine traversal
Hallucinated follow-ups or invalid summaries

Generate recurring-failure reports to guide engineering improvements.

Auto-Triaging & Notification System (NEW)

Design and implement an automated triage and notification system that:

Detects failure category and severity in real time.
Routes incidents to the correct module owners:

Telephony
Transcription
LLM orchestration
SOP/decision-tree team
Platform reliability

Sends structured payloads containing:

Trace graphs
Relevant logs
Transcript segments
SOP divergence snapshots
Suggested RCA labels

Notifications May Integrate With

PagerDuty
Slack (rich message blocks)
Jira auto-ticket creation
Internal incident pipelines

This ensures rapid operational feedback loops and reduces time-to-resolution.

Post-Call RCA Pipelines & Analytics

Extend pipelines to automatically generate human-readable failure summaries with:

Call-level trace graphs
Tool call sequences
Transcript context
Classified failure types
Suggested root causes

Store snapshots for operational handoff and debugging.

Required Qualifications

Strong backend engineer experienced with diagnostics, observability, and event-driven tracing.
Expert in Python, logging systems, real-time pipelines, and distributed debugging.
Deep familiarity with:

LLM agents
LangGraph or state-machine frameworks
Tool-calling architectures
Telemetry or tracing frameworks

Comfortable designing both:

Backend data pipelines
Frontend dashboards in React, D3, WebSockets, or equivalent.

Preferred Qualifications

Telephony/Voice: SIP, WebRTC, Twilio, audio streaming pipelines.
Clinical operations, call-center workflows, or mission-critical HITL supervision systems.
Observability stacks (Grafana, ELK, OpenTelemetry, Sentry).
Clustering/ML techniques for failure pattern discovery.

Apply for this job

Receive alerts for other AI Diagnostics & Observability Engineer job openings

AI Diagnostics & Observability Engineer

What are the responsibilities and job description for the AI Diagnostics & Observability Engineer position at Sage Care?

Job openings at Sage Care

Not the job you're looking for? Here are some other AI Diagnostics & Observability Engineer jobs in the Palo Alto, CA area that may be a better fit.

We don't have any other AI Diagnostics & Observability Engineer jobs in the Palo Alto, CA area right now.

AI Assistant is available now!