What are the responsibilities and job description for the Staff Machine Learning Engineer / Principal ML Engineer position at SRS Consulting Inc?

Role: Staff Machine Learning Engineer

Location: San Jose, CA (Onsite) Locals

Duration: Long-term

Mode of Interview: Virtual & Final In-person

Why this role exists

We're building privacy‐preserving LLM capabilities that help hardware design teams reason over Verilog/SystemVerilog and RTL artifacts—code generation, refactoring, lint explanation, constraint translation, and spec‐to‐RTL assistance. We're looking for a Staff‐level engineer to technically lead a small, high‐leverage team that fine‐tunes and productizes LLMs for these workflows in a strict enterprise data‐privacy environment.

You don't need to be a Verilog/RTL expert to start; curiosity, drive, and deep LLM craftsmanship matter most. Any HDL/EDA fluency is a strong plus.

What you'll do (Responsibilities)

• Own the technical roadmap for Verilog/RTL‐focused LLM capabilities—from model selection and adaptation to evaluation, deployment, and continuous improvement.

• Lead a hands‐on team of applied scientists/engineers: set direction, unblock technically, review designs/code, and raise the bar on experimentation velocity and reliability.

• Fine‐tune and customize models using state‐of‐the‐art techniques (LoRA/QLoRA, PEFT, instruction tuning, preference optimization/RLAIF) with robust HDL‐specific evals:

o Compile‐/lint‐/simulate‐based pass rates, pass@k for code generation, constrained decoding to enforce syntax, and "does‐it‐synthesize” checks.

• Design privacy‐first ML pipelines on AWS:

o Training/customization and hosting using Amazon Bedrock (including Anthropic models) where appropriate; SageMaker (or EKS KServe/Triton/DJL) for bespoke training needs.

o Artifacts in S3 with KMS CMKs; isolated VPC subnets & PrivateLink (including Bedrock VPC endpoints), IAM least‐privilege, CloudTrail auditing, and Secrets Manager for credentials.

o Enforce encryption in transit/at rest, data minimization, no public egress for customer/RTL corpora.

• Stand up dependable model serving: Bedrock model invocation where it fits, and/or low‐latency self‐hosted inference (vLLM/TensorRT‐LLM), autoscaling, and canary/blue‐green rollouts.

• Build an evaluation culture: automatic regression suites that run HDL compilers/simulators, measure behavioral fidelity, and detect hallucinations/constraint violations; model cards and experiment tracking (MLflow/Weights & Biases).

• Partner deeply with hardware design, CAD/EDA, Security, and Legal to source/prepare datasets (anonymization, redaction, licensing), define acceptance gates, and meet compliance requirements.

• Drive productization: integrate LLMs with internal developer tools (IDEs/plug‐ins, code review bots, CI), retrieval (RAG) over internal HDL repos/specs, and safe tool‐use/function‐calling.

• Mentor & uplevel: coach ICs on LLM best practices, reproducible training, critical paper reading, and building secure‐by‐default systems.

What you'll bring (Minimum qualifications)

• 10 years total engineering experience with 5 years in ML/AI or large‐scale distributed systems; 3 years working directly with transformers/LLMs.

• Proven track record shipping LLM‐powered features in production and leading ambiguous, cross‐functional initiatives at Staff level.

• Deep hands‐on skill with PyTorch, Hugging Face Transformers/PEFT/TRL, distributed training (DeepSpeed/FSDP), quantization‐aware fine‐tuning (LoRA/QLoRA), and constrained/grammar‐guided decoding.

• AWS expertise to design and defend secure enterprise deployments, including:

o Amazon Bedrock (model selection, Anthropic model usage, model customization, Guardrails, Knowledge Bases, Bedrock runtime APIs, VPC endpoints)

o SageMaker (Training, Inference, Pipelines), S3, EC2/EKS/ECR, VPC/Subnets/Security Groups, IAM, KMS, PrivateLink, CloudWatch/CloudTrail, Step Functions, Batch, Secrets Manager.

• Strong software engineering fundamentals: testing, CI/CD, observability, performance tuning; Python a must (bonus for Go/Java/C ).

• Demonstrated ability to set technical vision and influence across teams; excellent written and verbal communication for execs and engineers.

Nice to have (Preferred qualifications)

• Familiarity with Verilog/SystemVerilog/RTL workflows: lint, synthesis, timing closure, simulation, formal, test benches, and EDA tools (Synopsys/Cadence/Mentor).

• Experience integrating static analysis/AST‐aware tokenization for code models or grammar‐constrained decoding.

• RAG at scale over code/specs (vector stores, chunking strategies), tool‐use/function‐calling for code transformation.

• Inference optimization: TensorRT‐LLM, KV‐cache optimization, speculative decoding; throughput/latency trade‐offs at batch and token levels.

• Model governance/safety in the enterprise: model cards, red‐teaming, secure eval data handling; exposure to SOC2/ISO 27001/NIST frameworks.

• Data anonymization, DLP scanning, and code de‐identification to protect IP.

What success looks like

90 days

• Baseline an HDL‐aware eval harness that compiles/simulates; establish secure AWS training & serving environments (VPC‐only, KMS‐backed, no public egress).

• Ship an initial fine‐tuned/customized model with measurable gains vs. base (e.g., X% compile‐pass rate, −Y% lint findings per K LOC generated).

180 days

• Expand customization/training coverage (Bedrock for managed FMs including Anthropic; SageMaker/EKS for bespoke/open models).

• Add constrained decoding retrieval over internal design specs; productionize inference with SLOs (p95 latency, availability) and audited rollout to pilot hardware teams.

12 months

• Demonstrably reduce review/iteration cycles for RTL tasks with clear metrics (defect reduction, time‐to‐lint‐clean, % auto‐fix suggestions accepted), and a stable MLOps path for continuous improvement.

How we work (Security & privacy by design)

• Customer and internal design data remain within private AWS VPCs; access via IAM roles and audited by CloudTrail; all artifacts encrypted with KMS.

• No public internet calls for sensitive workloads; Bedrock access via VPC interface endpoints/PrivateLink with endpoint policies; SageMaker and/or EKS run in private subnets.

• Data pipelines enforce minimization, tagging, retention windows, and reproducibility; DLP scanning and redaction are first‐class steps.

• We produce model cards, data lineage, and evaluation artifacts for every release.

Tech you'll touch

• Modeling: PyTorch, HF Transformers/PEFT/TRL, DeepSpeed/FSDP, vLLM, TensorRT‐LLM

• AWS & MLOps: Amazon Bedrock (Anthropic and other FMs, Guardrails, Knowledge Bases, Runtime APIs), SageMaker (Training/Inference/Pipelines), MLflow/W&B, ECR, EKS/KServe/Triton, Step Functions

• Platform/Security: S3 KMS, IAM, VPC/PrivateLink (incl. Bedrock), CloudWatch/CloudTrail, Secrets Manager

• Tooling (nice to have): HDL toolchains for compile/simulate/lint, vector stores (pgvector/OpenSearch), GitHub/GitLab CI

Staff Machine Learning Engineer (ML Platform)

EarnIn -

Palo Alto, CA

View Job Details

Staff Machine Learning Engineer, ML Performance & Optimization

Waymo -

Mountain View, CA

View Job Details

Staff Machine Learning Engineer (Applied ML)

EarnIn -

Mountain View, CA

View Job Details

Apply for this job

Receive alerts for other Staff Machine Learning Engineer / Principal ML Engineer job openings

Staff Machine Learning Engineer / Principal ML Engineer

What are the responsibilities and job description for the Staff Machine Learning Engineer / Principal ML Engineer position at SRS Consulting Inc?

What is the career path for a Staff Machine Learning Engineer / Principal ML Engineer?

Job openings at SRS Consulting Inc

Not the job you're looking for? Here are some other Staff Machine Learning Engineer / Principal ML Engineer jobs in the San Jose, CA area that may be a better fit.

We don't have any other Staff Machine Learning Engineer / Principal ML Engineer jobs in the San Jose, CA area right now.

AI Assistant is available now!