What are the responsibilities and job description for the Pod Software Engineer, Mid-Level position at Jobright.ai?
Jobright is an AI-powered career platform that helps job seekers discover the top opportunities in the US. We are NOT a staffing agency. Jobright does not hire directly for these positions. We connect you with verified openings from employers you can trust.
Job Summary:
Etched is focused on creating powerful servers for transformer inference through advanced hardware solutions. The Pod Software Engineer will be responsible for developing and optimizing high-performance networking solutions, specifically for communication among inference nodes in large-scale clusters.
Responsibilities:
• Design, develop, and implement RDMA based networking peering, supporting high bandwidth, low latency communication across PCIe nodes within and across racks.
• Develop tests that qualify host processors (x86), NICs, TORs and device network interfaces for high performance.
• Furnish burn-in teams with tests that represent both real-world use cases and workloads for device to device networking, and extreme-load stress testing.
• Define the key metrics that system software must collect to maintain high availability and performance under extreme communications workloads.
• Analyze performance deviations, optimize network stack configurations, and propose kernel tuning parameters for low-latency, high-bandwidth inference workloads.
• Design and execute automated qualification tests for RDMA NICs and interconnects across various server configurations.
• Identify and root-cause firmware, driver, and hardware issues that impact RDMA performance and reliability.
• Collaborate with ODMs and silicon vendors to validate new RDMA features and enhancements.
• Implement and validate peer RDMA support for GPU-to-GPU and accelerator-to-accelerator communication.
• Modify kernel drivers and user-space libraries to optimize direct memory access between inference pods.
• Profile and benchmark inter-node RDMA latency and bandwidth to improve inference job scaling.
• Optimize NIC and switch configurations to balance throughput, congestion control, and reliability.
Qualifications:
Required:
• Proficiency in C/C
• Proficiency in at least one scripting language (e.g., Python, Bash, Go).
• Strong experience with device-to-device networking technologies (RDMA, GPUDirect, etc.), including RoCE.
• Experience with zero-copy networking, RDMA verbs and memory registration.
• Familiarity with queue pairs, completions queues, and transport types.
• Strong understanding of operating systems (Linux preferred) and server hardware architectures.
• Ability to analyze complex technical problems and provide effective solutions.
• Excellent communication and collaboration skills.
• Ability to work independently and as part of a team.
• Experience with version control systems (e.g., Git).
• Experience with reading and interpreting hardware logs.
Preferred:
• Experience with networking technologies like NVLink, Infiniband, ML Pod interconnects.
• Experience with widely deployed Top of Rack Switches (Cisco, Juniper, Arista, etc.)
• Knowledge of server virtualization.
• Experience with tracing tools like perf, eBPF, ftrace, etc.
• Experience with performance testing and benchmarking tools (gProf, vTune, Wireshark, etc.).
• Familiarity with hardware diagnostic tools and techniques
• Experience with containerization technologies (e.g., Docker, Kubernetes).
• Experience with CI/CD pipelines.
• Experience with Rust.
Company:
By burning the transformer architecture into our chips, we’re creating the world’s most powerful servers for transformer inference. Founded in 2022, the company is headquartered in Cupertino, California, USA, with a team of 11-50 employees. The company is currently Early Stage. Etched has a track record of offering H1B sponsorships.