What are the responsibilities and job description for the AI/ML Infrastructure Engineer position at Oracle and Careers?
. Experience in scripting and automation using tools like Ansible, Terraform, and/or Kubernetes.
- Experience with containerization technologies (e.g., Docker, Kubernetes) and orchestration tools for managing distributed systems.
- Solid understanding of networking concepts, security principles, and best practices.
- Excellent problem-solving skills, with the ability to troubleshoot complex issues and drive resolution in a fast-paced environment.
- Strong communication and collaboration skills, with the ability to work effectively in cross-functional teams and convey technical concepts to non-technical stakeholders.
- Strong documentation skills with experience documenting infrastructure designs, configurations, procedures, and troubleshooting steps to facilitate knowledge sharing, ensure maintainability, and enhance team collaboration.
- Strong Linux skills with hands-on experience in Oracle Linux/RHEL/CentOS, Ubuntu, and Debian distributions, including system administration, package management, shell scripting, and performance optimization.
Preferred Qualifications
- Strong proficiency in at least one of the programming languages such as Python, Rust, Go, Java, or Scala
- Proven experience designing, implementing, and managing infrastructure for AI/ML or HPC workloads.
- Understanding machine learning frameworks and libraries such as TensorFlow, PyTorch, or sci-kit-learn and their deployment in production environments is a plus.
- Familiarity with DevOps practices and tools for continuous integration, deployment, and monitoring (e.g., Jenkins, GitLab CI/CD, Prometheus).
- Strong experience with High-Performance Computing systems
- Take ownership of problems and work to identify solutions.
- Ability to think through the solution and identify/document potential issues impacting your customers.
- Design, deploy, and manage infrastructure components such as cloud resources, distributed computing systems, and data storage solutions to support AI/ML workflows.
- Collaborate with scientists and software/infrastructure engineers to understand infrastructure requirements for training, testing, and deploying machine learning models.
- Implement automation solutions for provisioning, configuring, and monitoring AI/ML infrastructure to streamline operations and enhance productivity.
- Optimize infrastructure performance by tuning parameters, optimizing resource utilization, and implementing caching and data pre-processing techniques.
- Ensure security and compliance standards are met throughout the AI/ML infrastructure stack, including data encryption, access control, and vulnerability management.
- Troubleshoot infrastructure performance, scalability, and reliability issues and implement solutions to mitigate risks and minimize downtime.
- Stay updated on emerging technologies and best practices in AI/ML infrastructure and evaluate their potential impact on our systems and workflows.
- Document infrastructure designs, configurations, and procedures to facilitate knowledge sharing and ensure maintainability.