What are the responsibilities and job description for the GPU Pipeline Microarchitect & RTL Designer position at Oxmiq Labs?
The Role
Own microarchitecture and RTL for high-throughput GPU pipeline blocks. You’ll translate product goals into clear specs, deliver timing-clean RTL, and partner across verification, physical design, and—critically—the GPU architecture team to land measurable PPA wins.
Responsibilities
- Microarchitecture: Define pipeline stages, flow control, queues/buffers, and interfaces; write concise design specs and lead reviews.
- RTL Design & PPA: Implement clean, synthesizable SystemVerilog; drive performance/power/area optimizations (datapaths, arbitration, backpressure, gating).
- Architecture collaboration: Work day-to-day with the architecture team to refine requirements, align on performance targets, and iterate on uArch choices with data.
- Verification Partnership: Build unit tests, create coverage plans, and author SVA; collaborate with UVM/formal to close corner cases.
- Quality & Sign-off: Run lint/CDC/RDC; support synthesis/STA and timing convergence; engage with PD/DFT for constraints and test.
- Bring-up & Debug: Support emulation/FPGA and silicon; instrument counters, analyze traces, and root-cause issues end-to-end.
- Communication & teamwork: Communicate trade-offs clearly across architecture, software, and PD; mentor peers and contribute to cross-IP integration.
Qualifications
- 5 years industry experience on desktop, mobile, or data center GPUs with real, shipped project ownership.
- Proficient in RTL design (SystemVerilog) and PPA optimization across performance, power, and area.
- Team player with strong understanding of overall GPU architecture and micro-architecture (SIMT/SIMD execution, scheduling and flow control, memory hierarchy).
- Hands-on first: Able to build unit tests, drive coverage-based verification (functional/code), and write robust SVA.
Depth in at least one of the following domains:
- Shader Core (execution pipelines, hazards, replay)
- Floating-Point Unit (IEEE-754, exceptions, denormals)
- Instruction Scheduler (warp/wavefront issuing, fairness, QoS)
- Job Scheduler / Command Submission
- L1/L2 Cache Design (coherency, miss handling, prefetch)
- Command Processor (front-end, MMIO, context management)
- Tensor Core Design (matrix/tensor datapaths, mixed precision)
- Tensor DMA (high-BW engines, tiling, compression)
Required Skills
Nice to Have: Experience with ray tracing blocks, texture/sampler, ROP/blend, or MMU/TLB. Performance modeling, perf counter design, and trace analysis. EDA fluency: VCS/Questa, Verdi, Jasper/IFV, DC/Genus, PrimeTime/Tempus; emulation (Palladium/Veloce) or FPGA protos. Collaboration with compiler/LLVM and driver/runtime teams.
What Success Looks Like (6–12 Months):
Tapeout-quality RTL for one or more pipeline blocks with signed-off PPA. Coverage closure against a clear plan (≥ target functional/code coverage) with SVA-backed correctness. Demonstrated perf/power gains on target workloads vs. baseline.