What are the responsibilities and job description for the LLM - Full Stack Python + JS - Remote (Stron Exp in LLM Coding tools) position at MillenniumSoft Inc?
Role: LLM - Full Stack Python JS Geo: LATAM, USA, Europe, West Africa YoE: 6 years Engagement Type: Fulltime, 40h/week Project duration: 3 Months Total number of positions: 5 Start Date: Immediate(Next week ) Vetting: Two rounds of interviews (90 min technical round on Flocareer and a 15 min cultural &, offer discussion) Skill: Python, JavaScript / Node.js, TypeScript Availability: 40 hours per week with 4 hours of overlap with PST.
Role Overview: We’re a coding-focused team at Turing that serves as a research partner for a Frontier AI Lab. Our role is to build coding tasks, evaluations, datasets, and tooling that help train and improve large language models (LLMs). You’ll write and debug production-quality code, design rigorous evaluations, and build reproducible workflows that generate clean, high-signal data for model training. Attention to detail matters deeply here—small mistakes can cascade into misleading results, so precision and thoroughness are essential. You’ll also collaborate closely with engineers, researchers, and quality owners to align on standards, review work, and continuously raise the quality bar. If you enjoy solving unusual technical problems, investigating subtle model failures, and working in developer-like environments where correctness, reproducibility, and collaboration matter, this role will keep you very entertained.
What does your day-to-day look like : Write, review, and debug code across multiple languages Design tasks and evaluation scenarios for coding, reasoning, and debugging Investigate LLM outputs and identify hallucinations, regressions, and failure modes Build reproducible dev environments using Docker automation tools Develop scripts, pipelines, and tools for data generation, scoring, and validation Produce structured annotations, judgments, and high-quality datasets Run systematic evaluations that help improve model reliability and reasoning
Required Skills : Experience using LLM coding tools (Cursor, Copilot, Code Whisperer)Strong hands-on coding experience (professional or research-based) in one or more of: Python, JavaScript / Node.js, TypeScript (Additional languages like Go, Java, C , C#, Rust, SQL, R, Dart, etc. are a plus) Solid experience with Linux Bash, scripting, and automation Strong with Docker, reproducible environments, and dev containers
Advanced Git skills (branching, diffs, patches, conflict resolution) Solid understanding of testing and QA (unit, integration, negative, edge-case focused) Ability to reliably overlap with 8am-12pm PT
Nice-to-Haves: Experience using LLM coding tools (Cursor, Copilot, Code Whisperer) Experience with dataset creation, annotation, evaluation, or ML pipelines Familiarity with benchmarks like SWE Bench or Terminal Bench Background in QA automation, DevOps, ML systems, or data engineering
Who Thrives Here: Engineers who enjoy breaking things and understanding why People who like designing tasks, running experiments, and debugging Detail-oriented folks who can spot subtle issues in code or model behavior Engineers who like building clean, reusable workflows rather than one-off hacks
Preferred Background: Bachelor’s degree in a technical field with 6 years’ experience Master’s degree in a technical field with 4 years’ experience PhD in a technical field with 2 years’ experience
Offer Details: Commitments Required: 8 hours per day with overlap of 4 hours with PST. Engagement type: Contractor assignment (no medical/paid leave) Duration of contract: 3 months; [expected start date is next week] Location: West Africa, LATAM, North America, South America.
Evaluation Process (approximately 75 mins): Two rounds of interviews (90 min technical round and a 15 min cultural &, offer discussion)
Role Overview: We’re a coding-focused team at Turing that serves as a research partner for a Frontier AI Lab. Our role is to build coding tasks, evaluations, datasets, and tooling that help train and improve large language models (LLMs). You’ll write and debug production-quality code, design rigorous evaluations, and build reproducible workflows that generate clean, high-signal data for model training. Attention to detail matters deeply here—small mistakes can cascade into misleading results, so precision and thoroughness are essential. You’ll also collaborate closely with engineers, researchers, and quality owners to align on standards, review work, and continuously raise the quality bar. If you enjoy solving unusual technical problems, investigating subtle model failures, and working in developer-like environments where correctness, reproducibility, and collaboration matter, this role will keep you very entertained.
What does your day-to-day look like : Write, review, and debug code across multiple languages Design tasks and evaluation scenarios for coding, reasoning, and debugging Investigate LLM outputs and identify hallucinations, regressions, and failure modes Build reproducible dev environments using Docker automation tools Develop scripts, pipelines, and tools for data generation, scoring, and validation Produce structured annotations, judgments, and high-quality datasets Run systematic evaluations that help improve model reliability and reasoning
Required Skills : Experience using LLM coding tools (Cursor, Copilot, Code Whisperer)Strong hands-on coding experience (professional or research-based) in one or more of: Python, JavaScript / Node.js, TypeScript (Additional languages like Go, Java, C , C#, Rust, SQL, R, Dart, etc. are a plus) Solid experience with Linux Bash, scripting, and automation Strong with Docker, reproducible environments, and dev containers
Advanced Git skills (branching, diffs, patches, conflict resolution) Solid understanding of testing and QA (unit, integration, negative, edge-case focused) Ability to reliably overlap with 8am-12pm PT
Nice-to-Haves: Experience using LLM coding tools (Cursor, Copilot, Code Whisperer) Experience with dataset creation, annotation, evaluation, or ML pipelines Familiarity with benchmarks like SWE Bench or Terminal Bench Background in QA automation, DevOps, ML systems, or data engineering
Who Thrives Here: Engineers who enjoy breaking things and understanding why People who like designing tasks, running experiments, and debugging Detail-oriented folks who can spot subtle issues in code or model behavior Engineers who like building clean, reusable workflows rather than one-off hacks
Preferred Background: Bachelor’s degree in a technical field with 6 years’ experience Master’s degree in a technical field with 4 years’ experience PhD in a technical field with 2 years’ experience
Offer Details: Commitments Required: 8 hours per day with overlap of 4 hours with PST. Engagement type: Contractor assignment (no medical/paid leave) Duration of contract: 3 months; [expected start date is next week] Location: West Africa, LATAM, North America, South America.
Evaluation Process (approximately 75 mins): Two rounds of interviews (90 min technical round and a 15 min cultural &, offer discussion)