What are the responsibilities and job description for the Senior Compiler Backend Engineer position at Oxmiq Labs?
About OXMIQ Labs
OXMIQ Labs is re‑architecting the GPU stack “from atoms to agents”—building a licensable GPU hardware and software platform for next‑generation AI, graphics, and multimodal workloads. Founded by GPU architect Raja Koduri, OXMIQ develops GPU IP cores and software, not consumer chips, with a software‑first model and IP licensing business.
At the heart of the hardware roadmap is OxCore™, a RISC‑V–based GPU IP core that integrates scalar, vector, and tensor engines in a modular architecture, designed to scale from tiny edge devices to zettascale data‑center deployments via the OxQuilt™ chiplet/SoC builder. OxCore targets near‑ and in‑memory compute, supports nano‑agents, and aims for SIMD/CUDA compatibility and native Python acceleration.
On the software side, OXMIQ is building:
- OXPython – runs Python‑based CUDA applications unmodified on non‑NVIDIA hardware
- Capsule – a GPU container / deployment layer for heterogeneous systems
This role sits right at that hardware–software boundary.
The role
We’re looking for a Senior Compiler Backend Engineer to own the compiler backend for OxCore hardware IP.
You’ll design and implement the lowering pipeline that maps high‑level IR and Python/CUDA‑style workloads—flowing through systems like OXPython and Capsule—onto OxCore’s scalar, vector, tensor, and near‑/in‑memory engines across many OxQuilt configurations.
This is a foundational role: your work will directly shape how developers target OxCore from Python and other high‑level languages and how customers squeeze performance out of their OxCore‑based SoCs.
What you’ll do
Own the OxCore compiler backend
- Design and implement the OxCore codegen backend (likely on top of LLVM/MLIR or similar) from high‑level IR down to OxCore’s instruction set / micro‑ops.
- Define and evolve OxCore‑specific IR dialects, calling conventions, and ABI details across scalar, vector, and tensor engines.
- Implement lowering passes that map Python/CUDA‑like kernels and ML operators to OxCore execution units and memory hierarchy.
Architect performance‑critical optimizations for OxCore
- Build OxCore‑aware optimization passes:
- instruction selection and scheduling across heterogeneous units
- register allocation tuned for OxCore’s register files
- memory‑access shaping for near‑/in‑memory compute, coalescing, tiling, and locality
- warp/SIMD‑style utilization given OxCore’s SIMT/SIMD/CUDA‑compatible execution model
- Develop cost models and auto‑tuning hooks that understand different OxCore/OxQuilt configurations (ratios of compute, memory, and interconnect).
Integrate tightly with Capsule, OXPython, and runtime
- Collaborate with the OXPython team to ensure Python‑based CUDA workloads lower efficiently onto OxCore, preserving semantics while exploiting OxCore features.
- Work with Capsule/runtime engineers on:
- kernel launch strategies and stream/queue design
- heterogeneous dispatch across OxCore and other accelerators
- profiling hooks and debug interfaces.
Partner with hardware and tools teams
- Work closely with OxCore architecture and OxQuilt teams to:
- capture hardware capabilities and constraints into compiler models
- co‑design micro‑architectural features that unlock compiler‑driven performance.
- Use pre‑silicon models, simulators, and FPGA/emulation platforms to validate correctness and drive performance prior to customer silicon.
Mentor & lead
- Provide technical leadership for OxCore backend architecture, coding standards, and design reviews.
- Mentor other engineers on compiler/backend internals, GPU/accelerator performance, and RISC‑V nuances.
You might be a fit if you
- Have 7 years of experience in compiler backend / codegen / low‑level performance engineering (title flexible for exceptional candidates).
- Have shipped or led substantial work on compiler backends (LLVM, GCC, MLIR, custom) targeting GPUs or accelerators.
- Understand deeply:
- SSA/IR design, CFGs, dataflow
- instruction selection & scheduling
- register allocation strategies
- loop transforms, tiling, vectorization.
- Have strong GPU/accelerator architecture intuition:
- SIMD/SIMT, warps/wavefronts, occupancy
- memory hierarchies (local/shared, HBM/DRAM, scratchpad)
- throughput vs latency trade‑offs.
- Are fluent in modern C (and/or Rust) for large systems codebases.
- Have meaningful exposure to RISC‑V or other ISA‑level work (writing backends, intrinsics, or hand‑tuned assembly is a plus).
- Know how to profile and optimize: you’ve used tools like Nsight, perf, VTune, ROCm tools, or custom profilers to chase down performance wins.
- Are comfortable operating in pre‑silicon environments (simulators, emulators, performance modeling).
- Enjoy working across boundaries: hardware, compilers, runtimes, and ML/graphics workloads.
Nice to have
These are bonuses, not hard requirements:
- Experience with ML compilers / DSLs (e.g., MLIR, TVM, XLA, Triton, Halide, IREE).
- Background in GPU IP or licensable core design flows (ARM, IP providers, or custom accelerators).
- Familiarity with Python‑first or CUDA‑centric toolchains, and porting CUDA workloads to new backends.
- Experience with chiplet / heterogeneous SoC design constraints or HW/SW co‑design.
- Contributions to open‑source compilers or runtimes.
- Prior work in AI, graphics, or multimodal workloads (rendering, path tracing, transformer models, etc.).
What success looks like (first 6–12 months)
- A robust OxCore backend capable of compiling and optimizing a core set of workloads (e.g., representative AI/graphics kernels) against current OxCore models.
- Demonstrated end‑to‑end speedups versus generic GPU backends for selected workloads, using OxCore features (near‑/in‑memory compute, scalar/vector/tensor orchestration).
- Tight integration with OXPython and Capsule, with real pipelines running on partner hardware/accelerators.
- A clear roadmap for expanding OxCore ISA coverage, optimization passes, and OxQuilt‑aware codegen.
How we work
- Small, senior team with high ownership.
- Software‑first, but deeply aligned with hardware IP and customer silicon.
- Pragmatic: we care about performance, reliability, and developer experience over buzzwords.