What are the responsibilities and job description for the Staff Engineer, CI/CD & Cloud Infrastructure position at Foresite Labs?
Location: San Diego, CA
Job Type: Full-Time
Salary Range: $ 175,000 - $185,000
Position Overview
We are looking for a Staff CI/CD & Cloud Infrastructure Engineer to own and evolve our build pipelines, deployment workflows, and cloud infrastructure. You will be responsible for ensuring that software — spanning Python, C/C , and CUDA on Linux — is built, tested, versioned, and deployed reliably across both AWS cloud environments and a fleet of complex embedded instruments operated in our central lab facility.
This is a senior hands-on role for an engineer who thrives at the intersection of DevOps automation, cloud infrastructure management, and release engineering. You will design and maintain CI/CD pipelines, manage complex AWS infrastructure as code, and ensure full traceability from source commits through builds, tests, artifacts, and deployments. You will work cross-functionally with firmware, application, and HPC engineers to keep the entire delivery pipeline fast, reliable, and observable.
Key Responsibilities
CI/CD & Build Engineering
failover strategies
Education:
BS/MS in Computer Science or Engineering
Required:
Compensation Range: $175K - $185K
Job Type: Full-Time
Salary Range: $ 175,000 - $185,000
Position Overview
We are looking for a Staff CI/CD & Cloud Infrastructure Engineer to own and evolve our build pipelines, deployment workflows, and cloud infrastructure. You will be responsible for ensuring that software — spanning Python, C/C , and CUDA on Linux — is built, tested, versioned, and deployed reliably across both AWS cloud environments and a fleet of complex embedded instruments operated in our central lab facility.
This is a senior hands-on role for an engineer who thrives at the intersection of DevOps automation, cloud infrastructure management, and release engineering. You will design and maintain CI/CD pipelines, manage complex AWS infrastructure as code, and ensure full traceability from source commits through builds, tests, artifacts, and deployments. You will work cross-functionally with firmware, application, and HPC engineers to keep the entire delivery pipeline fast, reliable, and observable.
Key Responsibilities
CI/CD & Build Engineering
- Design, build, and maintain CI/CD pipelines using GitHub Actions or similar platforms
- Manage build systems for Python, C/C , and CUDA codebases on Linux
- Integrate build tools (CMake, Make, pip, setuptools) into automated pipelines
- Implement robust versioning, tagging, and artifact management strategies
- Ensure full traceability of builds, test results, and artifacts from commit to deployment
- Manage Docker-based build environments including base images, caching, and reproducibility
- Maintain and optimize build performance, parallelism, and reliability
failover strategies
- Architect and manage complex AWS infrastructure including:
- IAM roles, policies, and access management
- Storage services (S3, EBS, EFS) with tiered lifecycle policies
- Databases (RDS, DynamoDB, or similar) with backup and
- Data workflow and pipeline engines (Step Functions, Airflow, or
- Compute services (EC2, ECS, EKS, Lambda) scaled to workload
- Implement infrastructure as code using Terraform
- Manage Kubernetes clusters and Helm charts for containerized
- workloads
- Design for scalability, high availability, and disaster recovery
- Manage cost optimization, resource tagging, and infrastructure
- governance
- Support multi-account and multi-region strategies as needed
- Familiarity with Azure and GCP for secondary or hybrid
- requirements
- Provision, configure, and manage on-premises Linux HPC nodes used for secondary and tertiary data processing
- Define infrastructure-as-code (Terraform, Ansible, or similar) for reproducible HPC node provisioning and configuration
- Manage high-speed networking infrastructure between instruments, HPC nodes, and storage (configuration, monitoring, troubleshooting)
- Implement and manage shared storage systems (NFS, parallel filesystems, or similar) accessible to both local HPC and cloud compute
- Design and operate hybrid burst-to-cloud infrastructure — provision and manage AWS compute resources that extend local HPC capacity on demand
- Collaborate with the data pipeline team to ensure infrastructure meets throughput, latency, and reliability requirements
- Manage OS patching, driver updates, and GPU runtime environments across HPC nodes
- Monitor HPC cluster health, utilization, and capacity to inform scaling decisions
- Design and operate data ingestion pipelines for high-volume experiment data from lab instruments
- Implement tiered storage strategies (hot/warm/cold) to balance accessibility, performance, and cost
- Deploy and manage search infrastructure (Elasticsearch/ OpenSearch) to make experiment data universally discoverable and queryable
- Build data cataloging and metadata tagging systems so datasets are well-organized and self-describing
- Integrate visualization tools (Grafana, Kibana, or similar) to enable engineers and scientists to explore and analyze experiment data
- Design data lifecycle policies including retention, archival, and compliance requirements
- Ensure data pipelines are reliable, idempotent, and observable with clear error handling and retry logic
- Work with engineering and science teams to define data schemas, access patterns, and query requirements
- Own deployment workflows for software delivered to embedded instruments in our central lab
- Manage release processes for a small number of complex, high- value lab-operated instruments
- Design deployment strategies that account for rollback, validation, and minimal downtime
- Coordinate versioned releases across multiple software components and dependencies
- Support development, staging, and production environment parity
- Implement centralized log collection and aggregation across cloud and on-site systems
- Deploy and manage observability tooling (Prometheus, Grafana, Loki, CloudWatch, or similar)
- Ensure structured, searchable logging with clear correlation across services
- Build dashboards and alerting for infrastructure health, pipeline status, and deployment state
- Establish traceability standards linking builds, tests, artifacts, and deployments
- Support diagnostics and post-mortem analysis for production incidents
- Integrate agentic AI tools into CI/CD workflows to automate code review, test generation, and pipeline troubleshooting
- Evaluate and deploy AI-powered assistants for infrastructure management, incident response, and operational tasks
- Design guardrails and human-in-the-loop controls for AI-driven automation in production environments
- Stay current with the rapidly evolving landscape of AI-augmented development and DevOps tooling
- Champion adoption of agentic AI across engineering workflows to accelerate delivery and improve reliability
Education:
BS/MS in Computer Science or Engineering
Required:
- Experience & Technical Skills
- 7 years of experience in DevOps, CI/CD, or cloud infrastructure roles
- Strong, hands-on Linux expertise (administration, debugging, performance tuning)
- Deep experience designing and operating CI/CD pipelines (GitHub Actions preferred)
- Proven experience managing complex AWS infrastructure at scale
- Strong knowledge of Docker including multi-stage builds, registries, and orchestration
- Experience with infrastructure as code using Terraform
- Experience with Kubernetes and Helm for container orchestration
- Solid understanding of versioning strategies, artifact management, and release engineering
- Experience integrating agentic AI into DevOps workflows and CI/CD pipelines
- Proficiency in Python and shell scripting for automation and tooling
- Ability to read, debug, and build C/C and CUDA applications on Linux
- Experience integrating build systems (CMake, Make) into CI pipelines
- Familiarity with package management and dependency resolution across languages
- Deep AWS experience across IAM, networking (VPC, security groups), storage, compute, and database services
- Experience managing on-premises Linux HPC infrastructure alongside cloud resources
- Experience designing for high availability, failover, and disaster recovery
- Experience with data pipeline and workflow orchestration tools (Step Functions, Airflow, or similar)
- Experience with search and indexing platforms (Elasticsearch, OpenSearch, or similar)
- Understanding of tiered storage strategies and data lifecycle management
- Knowledge of cost management, tagging strategies, and infrastructure governance
- Experience with logging and monitoring stacks (Prometheus,
- Grafana, Loki, ELK, or CloudWatch)
- Understanding of build and artifact traceability practices
- Experience with structured logging and distributed tracing concepts
- Experience deploying software to embedded or lab-operated instruments
- Experience with high-speed networking (InfiniBand, RDMA, or 10/25/100GbE) in HPC environments
- Experience with CUDA build toolchains and GPU-accelerated workloads
- Familiarity with Azure or GCP in addition to AWS
- Experience in regulated or reliability-sensitive environments
- Experience with GitOps workflows and progressive delivery
- Familiarity with secrets management (Vault, AWS Secrets Manager)
Compensation Range: $175K - $185K
Salary : $175,000 - $185,000