Demo

SRE Architect

Incedo Inc
Austin, TX Full Time
POSTED ON 4/29/2026
AVAILABLE BEFORE 5/29/2026

Site Reliability Architect (SRE) Unified Observability & AIOps

Role Summary

We are seeking a Senior SRE with strong expertise in Unified Observability, proactive detection, AIOps, and GenAI-driven operations to support complex, distributed financial services platforms. The role requires hands-on experience designing SLI/SLO-driven monitoring, dynamic thresholds, intelligent alerting, and AI/ML-based anomaly detection across multi-stream architectures.

Key Responsibilities

Observability & Reliability Engineering

  • Design and implement unified observability dashboards across metrics, logs, traces, events, and topology
  • Define and manage SLIs, SLOs, and error budgets aligned to business outcomes
  • Build actionable dashboards for operations, engineering, and leadership
  • Implement alerting strategies using static and dynamic thresholds

Proactive Detection & AIOps

  • Leverage AI/ML/AIOps to detect anomalies, predict incidents, and reduce MTTR
  • Transition monitoring from reactive alerts to proactive insights
  • Implement noise reduction, alert correlation, and root cause analysis
  • Apply baseline modeling, seasonality detection, and anomaly scoring

Distributed Systems & Dependency Analysis

  • Monitor and troubleshoot multi-service architectures involving:
    • Microservices
    • Downstream APIs
    • Kafka / streaming platforms
    • Cloud infrastructure (Terraform, IaC)
  • Identify whether issues originate from:
    • Upstream/downstream dependencies
    • Streaming platform
    • Infrastructure
    • Application code

Tooling & Platforms

  • Deep hands-on experience with Dynatrace (mandatory)
  • Experience with:
    • OpenTelemetry
    • Prometheus / Grafana
    • ELK / EFK
    • Cloud-native monitoring (AWS/Azure/Google Cloud Platform)
  • Strong JSON-based telemetry manipulation and enrichment

GenAI & LLM Enablement

  • Apply GenAI / LLMs for:
    • Incident summarization
    • Root cause explanation
    • Runbook recommendations
    • Auto-remediation suggestions
  • Collaborate with platform teams to operationalize GenAI safely

Required Skills & Experience

15 years in SRE / Production Engineering
Strong Unified Observability background (not infra-only)
Hands-on Dynatrace experience (metrics, traces, logs, Davis AI)
SLI/SLO engineering experience in production systems
Experience implementing dynamic thresholds and anomaly detection
Knowledge of AI/ML concepts applied to Ops (AIOps)
Distributed systems troubleshooting expertise
Experience with Kafka or streaming data platforms

Differentiators (Highly Valued)

  • Experience in financial services or regulated environments
  • Proven reduction of alert noise and MTTR using AIOps
  • GenAI / LLM integration into operations workflows

Interview Question Bank (Mapped to LPL Expectations)

  1. Dashboards, SLAs, and Reliability Targets

Purpose: Identify true SREs vs dashboard builders

  • How do you design dashboards differently for engineers vs leadership?
  • Explain how SLIs and SLOs differ from SLAs. Which do you operationalize?
  • How do you map SLOs to alerting without creating noise?
  • What KPIs would you track for a critical trading or advisor-facing platform?

Red Flag: Talks only about CPU, memory, uptime

  1. Alerting Strategy & Threshold Design

Purpose: Assess signal-to-noise maturity

  • How do you decide when to use static vs dynamic thresholds?
  • Explain how you prevent alert storms during high traffic or seasonal spikes.
  • What makes an alert actionable?
  • How do you design alerts for early symptom detection?

Follow-up

  • What happens after an alert fires? Walk me through the lifecycle.
  1. Dynamic Thresholds & Anomaly Detection

Purpose: Validate AIOps fundamentals

  • How do dynamic thresholds work under the hood?
  • How do you account for baseline drift and seasonality?
  • What risks do dynamic thresholds introduce?
  • How would you tune sensitivity to avoid false positives?

Expected Concepts Baselines
ML models
Adaptive learning
Time-series analysis

  1. Multiplexing (Metrics, Signals, Streams)

Purpose: Test system observability depth

  • What is multiplexing in observability?
  • How do multiple telemetry signals strengthen diagnosis?
  • Provide an example where one signal was misleading.
  • How do you correlate metrics, traces, logs, and events?
  1. JSON Tooling & Proactive Detection

Purpose: Ensure hands-on operational telemetry skills

  • How have you used JSON-based event payloads to enrich observability?
  • How do you normalize data across heterogeneous sources?
  • How do structured logs improve proactive detection?
  • How do you extract signals from high-volume telemetry?
  1. Proactive vs Reactive Detection

Purpose: Directly aligned to LPL concern

  • Give an example where you predicted an incident before customer impact.
  • What indicators help you identify impending failures?
  • How do you measure the success of proactive detection?
  1. Multi-Service Failure Diagnosis (Critical Question)

Purpose: Core differentiator at LPL

Scenario Question

A user-facing issue is reported. The architecture includes:

  • Frontend
  • Backend microservices
  • Downstream APIs
  • Kafka streams
  • Terraform-managed infrastructure

Ask:

  • How do you determine if the issue is:
    • Application-related?
    • Kafka or streaming lag?
    • Downstream API latency?
    • Infrastructure drift via Terraform?

Expected Approach Dependency mapping
Golden signals
Trace correlation
Change analysis

  1. Dynatrace (Mandatory)

Purpose: Address explicit gap in feedback

  • What Dynatrace features have you used most?
  • How does Davis AI determine root cause?
  • How do you implement service-level baselining in Dynatrace?
  • How do you reduce alert noise using Dynatrace?

Red Flag: I ve mostly used dashboards

  1. AI/ML & AIOps Fundamentals

Purpose: Ensure non-theoretical knowledge

  • What ML techniques are commonly used in AIOps?
  • How do supervised vs unsupervised models differ in Ops?
  • Where does AI fail in observability?
  • How do you validate AI-based decisions?
  1. GenAI & LLM Use Cases for SRE

Purpose: Explicit LPL requirement

  • Where do you see GenAI adding value in SRE?
  • Have you used LLMs for incident response?
  • How would you integrate GenAI without introducing risk?
  • What data would you restrict from LLM exposure?

Salary : $140,000 - $160,000

If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a SRE Architect?

Sign up to receive alerts about other jobs on the SRE Architect career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$56,954 - $79,676
Income Estimation: 
$64,389 - $101,339
Income Estimation: 
$73,800 - $91,103
Income Estimation: 
$89,966 - $112,616
Income Estimation: 
$95,407 - $122,738
Income Estimation: 
$103,114 - $138,258
Income Estimation: 
$86,891 - $130,303
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Incedo Inc

  • Incedo Inc Charlotte, NC
  • Position: Senior Software Developer Location: Fort Mill, SC or New York, NY Type: Full-Time Company Overview Incedo is a US‑based consulting, data science,... more
  • 1 Day Ago

  • Incedo Inc Pittsburgh, PA
  • Job Title: Business Systems Analyst (BSA) / Scrum Master Location: Pittsburgh, PA/Dallas, TX Type: Full-Time Role Overview We are seeking a versatile BSA/S... more
  • 1 Day Ago

  • Incedo Inc Austin, TX
  • Job Title: Data Analyst- Wealth Management Location: Austin TX Mode: Hybrid Duration: Full Time Job Description We are seeking a highly skilled Data Analys... more
  • 1 Day Ago

  • Incedo Inc Los Angeles, CA
  • We are looking for a Junior Engagement Manager to support a strategic automotive client on AWS-based data transformation initiatives . You’ll help manage d... more
  • 1 Day Ago


Not the job you're looking for? Here are some other SRE Architect jobs in the Austin, TX area that may be a better fit.

  • Jobs via Dice Austin, TX
  • Location: Austin, TX Salary: $105,000.00 USD Annually - $110,000.00 USD Annually Description: Our client is currently seeking a SRE Engineer SRE Engineer L... more
  • 7 Days Ago

  • Conch Technologies Austin, TX
  • Position: SRE/DevOps Engineer – GenAI Location: Austin, TX (Onsite) FULLTIME Job Description Core SRE / DevOps Skills • Strong experience in DevOps / SRE r... more
  • 20 Days Ago

AI Assistant is available now!

Feel free to start your new journey!