What are the responsibilities and job description for the Cloud Platform / Site Reliability Engineer #11154 position at Jobs via Dice?

Dice is the leading career destination for tech experts at every stage of their careers. Our client, ECCO Select, is seeking the following. Apply via Dice today!

Cloud Platform SRE Engineer #11145

ECCO Select Dallas-Fort Worth Metroplex (Hybrid)

This is a W2, contract for hire opportunity

No C2C

The Platform Engineer / Site Reliability Engineer is a hands-on, hybrid role responsible for building, operating, and continuously improving Client s cloud platform and production reliability posture. You will work at the intersection of infrastructure engineering, DevOps, and SRE designing and automating AWS-based platforms with Terraform, establishing observability and incident response practices, and embedding reliability into every layer of the stack.

This role also carries a forward-looking mandate: you will evaluate and integrate AI-powered tooling including large language models and AWS Bedrock to accelerate operations, improve developer experience, and drive intelligent automation across the platform.

You will collaborate closely with cloud engineers, application teams, and security stakeholders to deliver infrastructure that is secure, observable, cost-effective, and built for scale. As the Client continues its migration to AWS and modernization of its platform engineering practices, this role is central to establishing the reliability and automation standards that will define the next chapter of our infrastructure.

Essential Duties & Responsibilities:

Champion site reliability engineering practices across the organization: define and enforce service level objectives (SLOs) backed by meaningful service level indicators (SLIs) and error budgets; use reliability metrics to balance feature velocity against system stability and drive prioritization of engineering work.
Build and maintain observability across production systems encompassing metrics, logs, traces, and dashboards to ensure teams have real-time visibility into system health, can detect and diagnose issues quickly, and can make data-driven decisions about capacity, performance, and reliability improvements.
Author and maintain operational runbooks; participate in on-call rotations and lead post-incident reviews (blameless retrospectives) that drive measurable improvements to MTTR and MTTD.
Drive reliability improvements through capacity planning, chaos and resilience testing patterns, dependency mapping, and progressive rollout strategies.
Build and maintain a Terraform-first infrastructure platform: develop reusable modules, enforce state management strategy, implement policy-as-code (OPA/Sentinel), and maintain consistent tagging and resource governance across a multi-account AWS environment (Organizations/Control Tower).
Design and operate CI/CD pipelines using GitHub Actions; create golden-path templates and paved-road workflows for build, test, scan, and deploy improving developer experience and reducing friction for application teams.
Architect and manage core AWS services including VPC networking (subnets, routing, security groups/NACLs, ALB/NLB, PrivateLink, NAT, Transit Gateway, Managed Firewall), compute (EC2, ECS/Fargate, EKS, Lambda), storage (S3, EFS), and data services (RDS/Aurora, DynamoDB).
Standardize container image build and deployment patterns (blue/green, canary, rolling), autoscaling policies (HPA, target tracking), and serverless deployment models across the organization.
Implement least-privilege IAM policies, KMS encryption, WAF rules, and Secrets Manager baselines; integrate Security Hub, GuardDuty, and Inspector findings into automated response and remediation workflows.
Enforce guardrails via Service Control Policies (SCPs), tagging standards, and auditability requirements to meet internal policy and regulatory obligations.
Evaluate and deploy AI-powered tooling to enhance platform operations and developer experience including LLM-based assistants, AI coding agents, and retrieval-augmented workflows.
Leverage AWS Bedrock (including Agents for Bedrock), Lambda, Step Functions, and retrieval services (Kendra/OpenSearch) to build and operate AI agent architectures that support DevOps automation, incident triage, and knowledge management.
Monitor AI workload performance including accuracy, latency, token cost, and drift; establish guardrails and evaluation frameworks for production AI integrations.
Stay current with the rapidly evolving AI/ML landscape and translate emerging capabilities into practical platform improvements.
Apply FinOps practices: right-size resources, implement autoscaling, plan Savings Plans/Reserved Instances, and configure budget and anomaly alerts.
Maintain accurate cloud asset and configuration data in ServiceNow CMDB; participate in change, incident, and problem management processes.
Design resilient hybrid connectivity patterns (Direct Connect/VPN) and DNS architectures (Route 53) supporting private service access, cross-account resolution, and failover.

Education & Experience:

We re targeting a strong hands-on engineer who blends platform/infrastructure depth with SRE discipline and a genuine curiosity for AI-driven operations. Ideal profile:

Bachelor s degree preferred; High School Diploma or Equivalent with relevant experience required.
4 7 years of combined cloud infrastructure, platform engineering, DevOps, or SRE experience with a primary AWS focus.
Strong Terraform proficiency: modules, state management, workspaces, CI integration, and policy-as-code patterns.
Hands-on experience with container orchestration (ECS/Fargate, EKS) and serverless (Lambda, API Gateway) in production environments.
Demonstrated experience defining SLIs/SLOs, building observability pipelines (CloudWatch, X-Ray, or equivalent), and participating in on-call and incident response.
Proficiency with Git/GitHub, GitHub Actions or equivalent CI/CD platforms, and infrastructure pipeline design.
Linux systems administration fundamentals and scripting (Bash and/or Python).
Docker image lifecycle management, registry operations (ECR), and container security basics.
Practical understanding of networking: VPCs, subnets, routing, load balancers, DNS, hybrid connectivity, and multi-account architectures.
Security fundamentals: least-privilege IAM, KMS, WAF, Secrets Manager, tagging discipline.
Familiarity with AI/ML concepts; hands-on experience with LLMs, prompt engineering, or AWS Bedrock is strongly preferred.
Experience with AI coding assistants, automation agents, or retrieval-augmented generation (RAG) patterns is a plus.
FinOps basics: cost allocation, right-sizing, Savings Plans, and budget alerting.
Azure familiarity (AKS, Key Vault, Firewall, Sentinel, Monitor) is helpful during the near-term transition but not a core requirement.

Certificates, Licenses, Registrations:

AWS certifications (Solutions Architect Associate, SysOps Administrator, or Developer Associate) preferred; AWS DevOps Professional or Security Specialty a strong plus.
Kubernetes certifications (CKA/CKAD) or Docker certifications are a plus.
AWS AI Practitioner or Machine Learning Specialty certification is a differentiator.
Azure certifications are optional and considered a plus during the transition period.

Tools & Technologies You ll Use:

AWS: EC2, ECS/Fargate, EKS, ECR, Lambda, API Gateway, RDS/Aurora, DynamoDB, ElastiCache, S3, EFS, CloudFront, Route 53, ALB/NLB, IAM, KMS, WAF, Secrets Manager, CloudWatch/X-Ray, Organizations/Control Tower, Bedrock, Kendra, OpenSearch, Step Functions.
IaC/CI: Terraform (primary), CloudFormation (as needed), GitHub/GitHub Actions, OPA/Sentinel policy-as-code.
SRE/Observability: CloudWatch (metrics, logs, alarms, dashboards), X-Ray, approved third-party APM/logging tools, PagerDuty or equivalent.
Scripting/OS/Containers: Linux, Bash, Python, Docker, Helm.
Networking: VPC, Transit Gateway, Direct Connect/VPN, PrivateLink, Route 53, DNS patterns.
AI/ML: AWS Bedrock, Agents for Bedrock, LLM APIs, prompt engineering frameworks, RAG architectures.
ITSM/FinOps: ServiceNow, Cost Explorer, Savings Plans/RI planning, budget/anomaly alerting.

ECCO Select is committed to hiring and retaining a diverse workforce. Our policy is to provide equal opportunity to all people without regard to race, color, religion, national origin, ancestry, marital status, veteran status, age, disability, pregnancy, genetic information, citizenship status, sex, sexual orientation, gender identity or any other legally protected category. Veterans of our United States Uniformed Services are specifically encouraged to apply for ECCO Select opportunities.

Apply for this job

Receive alerts for other Cloud Platform / Site Reliability Engineer #11154 job openings

Cloud Platform / Site Reliability Engineer #11154

What are the responsibilities and job description for the Cloud Platform / Site Reliability Engineer #11154 position at Jobs via Dice?

What is the career path for a Cloud Platform / Site Reliability Engineer #11154?

Job openings at Jobs via Dice

Not the job you're looking for? Here are some other Cloud Platform / Site Reliability Engineer #11154 jobs in the Dallas, TX area that may be a better fit.

We don't have any other Cloud Platform / Site Reliability Engineer #11154 jobs in the Dallas, TX area right now.

AI Assistant is available now!