Demo

Cloud Platform / Site Reliability Engineer #11154

Jobs via Dice
Dallas, TX Full Time
POSTED ON 4/14/2026
AVAILABLE BEFORE 5/13/2026
Dice is the leading career destination for tech experts at every stage of their careers. Our client, ECCO Select, is seeking the following. Apply via Dice today!

Cloud Platform SRE Engineer #11145

ECCO Select Dallas-Fort Worth Metroplex (Hybrid)

This is a W2, contract for hire opportunity

No C2C

The Platform Engineer / Site Reliability Engineer is a hands-on, hybrid role responsible for building, operating, and continuously improving Client s cloud platform and production reliability posture. You will work at the intersection of infrastructure engineering, DevOps, and SRE designing and automating AWS-based platforms with Terraform, establishing observability and incident response practices, and embedding reliability into every layer of the stack.

This role also carries a forward-looking mandate: you will evaluate and integrate AI-powered tooling including large language models and AWS Bedrock to accelerate operations, improve developer experience, and drive intelligent automation across the platform.

You will collaborate closely with cloud engineers, application teams, and security stakeholders to deliver infrastructure that is secure, observable, cost-effective, and built for scale. As the Client continues its migration to AWS and modernization of its platform engineering practices, this role is central to establishing the reliability and automation standards that will define the next chapter of our infrastructure.

Essential Duties & Responsibilities:

  • Champion site reliability engineering practices across the organization: define and enforce service level objectives (SLOs) backed by meaningful service level indicators (SLIs) and error budgets; use reliability metrics to balance feature velocity against system stability and drive prioritization of engineering work.
  • Build and maintain observability across production systems encompassing metrics, logs, traces, and dashboards to ensure teams have real-time visibility into system health, can detect and diagnose issues quickly, and can make data-driven decisions about capacity, performance, and reliability improvements.
  • Author and maintain operational runbooks; participate in on-call rotations and lead post-incident reviews (blameless retrospectives) that drive measurable improvements to MTTR and MTTD.
  • Drive reliability improvements through capacity planning, chaos and resilience testing patterns, dependency mapping, and progressive rollout strategies.
  • Build and maintain a Terraform-first infrastructure platform: develop reusable modules, enforce state management strategy, implement policy-as-code (OPA/Sentinel), and maintain consistent tagging and resource governance across a multi-account AWS environment (Organizations/Control Tower).
  • Design and operate CI/CD pipelines using GitHub Actions; create golden-path templates and paved-road workflows for build, test, scan, and deploy improving developer experience and reducing friction for application teams.
  • Architect and manage core AWS services including VPC networking (subnets, routing, security groups/NACLs, ALB/NLB, PrivateLink, NAT, Transit Gateway, Managed Firewall), compute (EC2, ECS/Fargate, EKS, Lambda), storage (S3, EFS), and data services (RDS/Aurora, DynamoDB).
  • Standardize container image build and deployment patterns (blue/green, canary, rolling), autoscaling policies (HPA, target tracking), and serverless deployment models across the organization.
  • Implement least-privilege IAM policies, KMS encryption, WAF rules, and Secrets Manager baselines; integrate Security Hub, GuardDuty, and Inspector findings into automated response and remediation workflows.
  • Enforce guardrails via Service Control Policies (SCPs), tagging standards, and auditability requirements to meet internal policy and regulatory obligations.
  • Evaluate and deploy AI-powered tooling to enhance platform operations and developer experience including LLM-based assistants, AI coding agents, and retrieval-augmented workflows.
  • Leverage AWS Bedrock (including Agents for Bedrock), Lambda, Step Functions, and retrieval services (Kendra/OpenSearch) to build and operate AI agent architectures that support DevOps automation, incident triage, and knowledge management.
  • Monitor AI workload performance including accuracy, latency, token cost, and drift; establish guardrails and evaluation frameworks for production AI integrations.
  • Stay current with the rapidly evolving AI/ML landscape and translate emerging capabilities into practical platform improvements.
  • Apply FinOps practices: right-size resources, implement autoscaling, plan Savings Plans/Reserved Instances, and configure budget and anomaly alerts.
  • Maintain accurate cloud asset and configuration data in ServiceNow CMDB; participate in change, incident, and problem management processes.
  • Design resilient hybrid connectivity patterns (Direct Connect/VPN) and DNS architectures (Route 53) supporting private service access, cross-account resolution, and failover.

Education & Experience:

We re targeting a strong hands-on engineer who blends platform/infrastructure depth with SRE discipline and a genuine curiosity for AI-driven operations. Ideal profile:

  • Bachelor s degree preferred; High School Diploma or Equivalent with relevant experience required.
  • 4 7 years of combined cloud infrastructure, platform engineering, DevOps, or SRE experience with a primary AWS focus.
  • Strong Terraform proficiency: modules, state management, workspaces, CI integration, and policy-as-code patterns.
  • Hands-on experience with container orchestration (ECS/Fargate, EKS) and serverless (Lambda, API Gateway) in production environments.
  • Demonstrated experience defining SLIs/SLOs, building observability pipelines (CloudWatch, X-Ray, or equivalent), and participating in on-call and incident response.
  • Proficiency with Git/GitHub, GitHub Actions or equivalent CI/CD platforms, and infrastructure pipeline design.
  • Linux systems administration fundamentals and scripting (Bash and/or Python).
  • Docker image lifecycle management, registry operations (ECR), and container security basics.
  • Practical understanding of networking: VPCs, subnets, routing, load balancers, DNS, hybrid connectivity, and multi-account architectures.
  • Security fundamentals: least-privilege IAM, KMS, WAF, Secrets Manager, tagging discipline.
  • Familiarity with AI/ML concepts; hands-on experience with LLMs, prompt engineering, or AWS Bedrock is strongly preferred.
  • Experience with AI coding assistants, automation agents, or retrieval-augmented generation (RAG) patterns is a plus.
  • FinOps basics: cost allocation, right-sizing, Savings Plans, and budget alerting.
  • Azure familiarity (AKS, Key Vault, Firewall, Sentinel, Monitor) is helpful during the near-term transition but not a core requirement.

Certificates, Licenses, Registrations:

  • AWS certifications (Solutions Architect Associate, SysOps Administrator, or Developer Associate) preferred; AWS DevOps Professional or Security Specialty a strong plus.
  • Kubernetes certifications (CKA/CKAD) or Docker certifications are a plus.
  • AWS AI Practitioner or Machine Learning Specialty certification is a differentiator.
  • Azure certifications are optional and considered a plus during the transition period.

Tools & Technologies You ll Use:

  • AWS: EC2, ECS/Fargate, EKS, ECR, Lambda, API Gateway, RDS/Aurora, DynamoDB, ElastiCache, S3, EFS, CloudFront, Route 53, ALB/NLB, IAM, KMS, WAF, Secrets Manager, CloudWatch/X-Ray, Organizations/Control Tower, Bedrock, Kendra, OpenSearch, Step Functions.
  • IaC/CI: Terraform (primary), CloudFormation (as needed), GitHub/GitHub Actions, OPA/Sentinel policy-as-code.
  • SRE/Observability: CloudWatch (metrics, logs, alarms, dashboards), X-Ray, approved third-party APM/logging tools, PagerDuty or equivalent.
  • Scripting/OS/Containers: Linux, Bash, Python, Docker, Helm.
  • Networking: VPC, Transit Gateway, Direct Connect/VPN, PrivateLink, Route 53, DNS patterns.
  • AI/ML: AWS Bedrock, Agents for Bedrock, LLM APIs, prompt engineering frameworks, RAG architectures.
  • ITSM/FinOps: ServiceNow, Cost Explorer, Savings Plans/RI planning, budget/anomaly alerting.

ECCO Select is committed to hiring and retaining a diverse workforce. Our policy is to provide equal opportunity to all people without regard to race, color, religion, national origin, ancestry, marital status, veteran status, age, disability, pregnancy, genetic information, citizenship status, sex, sexual orientation, gender identity or any other legally protected category. Veterans of our United States Uniformed Services are specifically encouraged to apply for ECCO Select opportunities.

Salary.com Estimation for Cloud Platform / Site Reliability Engineer #11154 in Dallas, TX
$90,068 to $106,780
If your compensation planning software is too rigid to deploy winning incentive strategies, it’s time to find an adaptable solution. Compensation Planning
Enhance your organization's compensation strategy with salary data sets that HR and team managers can use to pay your staff right. Surveys & Data Sets

What is the career path for a Cloud Platform / Site Reliability Engineer #11154?

Sign up to receive alerts about other jobs on the Cloud Platform / Site Reliability Engineer #11154 career path by checking the boxes next to the positions that interest you.
Income Estimation: 
$92,017 - $124,111
Income Estimation: 
$111,369 - $141,168
Income Estimation: 
$117,871 - $153,580
Income Estimation: 
$109,939 - $144,341
Income Estimation: 
$114,500 - $144,633
Income Estimation: 
$92,877 - $110,401
Income Estimation: 
$120,933 - $155,034
Income Estimation: 
$114,618 - $136,401
Employees: Get a Salary Increase
View Core, Job Family, and Industry Job Skills and Competency Data for more than 15,000 Job Titles Skills Library

Job openings at Jobs via Dice

  • Jobs via Dice Sheridan, WY
  • Dice is the leading career destination for tech experts at every stage of their careers. Our client, Varmoda Tech LLC, is seeking the following. Apply via ... more
  • 8 Days Ago

  • Jobs via Dice Burlington, VT
  • Desktop Deployment Technician (Part-Time - 20 Hours a week) (Contract Role) Overview We are seeking a Desktop Deployment Technician to support a large-scal... more
  • 8 Days Ago

  • Jobs via Dice Georgia, VT
  • Dice is the leading career destination for tech experts at every stage of their careers. Our client, AaraTechnologies Inc, is seeking the following. Apply ... more
  • 8 Days Ago

  • Jobs via Dice Alaska, AK
  • job summary: Enterprise Healthcare client has an immediate opening for a highly motivated Project Manager III to join their dynamic and growing team. All q... more
  • 8 Days Ago


Not the job you're looking for? Here are some other Cloud Platform / Site Reliability Engineer #11154 jobs in the Dallas, TX area that may be a better fit.

  • Jobs via Dice Plano, TX
  • Overview Who we are Collaborative. Respectful. A place to dream and do. These are just a few words that describe what life is like at Toyota. As one of the... more
  • 18 Days Ago

  • Forhyre Plano, TX
  • We are looking for someone that is generalist at heart, one who is curious, appreciates complexity, knows or wants to learn when to step back and when to d... more
  • 3 Days Ago

AI Assistant is available now!

Feel free to start your new journey!