What are the responsibilities and job description for the Infrastructure Engineer — Systems & Platform position at Sixtyfour?
What You’ll Do
- Design and maintain highly available, scalable infrastructure across AWS (ECS, EKS, Lambda, SQS, CloudFront, CloudWatch).
- Architect automated CI/CD pipelines (GitHub Actions, Terraform) with strong testing, observability, and rollback safety.
- Optimize LLM inference infrastructure, including autoscaling GPU/CPU clusters, caching, async queues, batching, and tracing.
- Improve deployment workflows and environment consistency using Docker, IaC, and lightweight configuration management.
- Work on backend performance, including queue throughput, caching strategies, database indexing, and load balancing.
- Monitor, debug, and improve system reliability and latency across all services (API, inference, and web app).
- Build internal tools that enhance developer productivity and operational visibility.
- Partner with engineers to evolve the workflow and job execution engine for better parallelism, retry logic, and observability.
- Set up metrics, tracing, and alerting (OpenTelemetry, Prometheus, Grafana, Sentry) to make reliability measurable and actionable.
- Strong experience with cloud infrastructure (AWS preferred) including EC2, ECS, EKS, Lambda, S3, VPCs, networking, and IAM.
- Proficiency with Docker and CI/CD tools such as GitHub Actions or CircleCI.
- Experience scaling Python backend systems and modern web APIs (FastAPI preferred).
- Hands-on experience with API servers and background workers (Celery, Redis queues, etc.).
- Comfort with Postgres and Redis, including schema design, caching, rate limiting, and locks.
- Strong observability mindset, including logs, metrics, and traces.
- Production experience with autoscaling, load testing, and cost-aware resource optimization.
- Excellent debugging and on-call discipline with a focus on uptime and reliability.
- Experience managing LLM serving infrastructure (OpenAI-compatible APIs, vLLM, Triton, or similar).
- Familiarity with Next.js and TypeScript to understand end-to-end deployment pipelines.
- Experience with Terraform, Pulumi, or similar IaC tools.
- Security-focused mindset, including network boundaries, secret management, and RBAC.
- Knowledge of real-time systems (SSE or WebSockets) or stream processing.
- Experience building developer platform tools or internal DevOps systems.
Salary : $140,000 - $200,000