What are the responsibilities and job description for the Principal SRE position at Veracity Software Inc?
Job Title: Principal SRE
Duration: 12 Months with possibility to extend or convert
Location: Charlotte, NC/Irving, TX/Phoenix, AZ/Minneapolis, MN/Iselin, NJ/San Francisco, CA - Hybrid Role (3 days onsite every week)
Objective: Improve scalability & reliability of existing systems. Drive retroactive mitigation of technical debt for critical systems
Skills and focus of PREs:
Observability
Duration: 12 Months with possibility to extend or convert
Location: Charlotte, NC/Irving, TX/Phoenix, AZ/Minneapolis, MN/Iselin, NJ/San Francisco, CA - Hybrid Role (3 days onsite every week)
Objective: Improve scalability & reliability of existing systems. Drive retroactive mitigation of technical debt for critical systems
Skills and focus of PREs:
Observability
- Alerts & Dashboards focused on user experience
- Focus: Metrics, Logs, Traces
- Continuous Trend & Usage analysis
- Thresholds for new equipment purchases with lead time
- Measure performance from user perspective
- Focus: Response time, Error rate, Latency
- High Availability & Fault Tolerance (within & across DC)
- Focus: RTO should match business expectations
- Act as a Platform Reliability Engineering (PRE) expert, providing deep technical leadership in one core domain (Database, Cloud, Network, Compute/Storage, Middleware, or Application Support), while partnering across broader platform teams.
- Lead analysis and remediation of systemic reliability issues and complex production problems, translating recurring incidents into long-term engineering solutions.
- Design, influence, and implement scalable, highly reliable systems using SRE principles including SLI/SLOs, error budgets, and automation-first approaches.
- Drive observability standards across platforms, with a strong focus on metrics, logs, traces, and user-experience-based alerting and dashboards.
- Partner with application, infrastructure, cloud, and support teams to improve availability, performance, capacity, and resiliency of both new and existing systems.
- Lead or contribute to blameless post-mortems, ensuring actionable outcomes and sustained reduction of repeat incidents.
- Translate advanced technical knowledge and enterprise context into clear guidance for senior leadership on reliability risks, priorities, and investment areas.
- Mentor and guide engineers and support staff on reliability best practices, operational standards, and automation opportunities.
- 10 years of experience in Systems Operations, SRE, Platform Engineering, or Production Support, with deep expertise in at least one of the following domains:
- Database platforms
- Cloud platforms
- Network infrastructure
- Compute / Storage platforms
- Middleware platforms
- Enterprise Application Support
- Strong hands-on experience applying SRE concepts such as SLI/SLO definition, error budgets, reliability metrics, and incident-driven engineering improvements.
- Proven experience diagnosing and resolving complex, large-scale production issues across distributed systems.
- Solid understanding of observability tools and practices covering monitoring, alerting, logging, and tracing.
- Experience driving automation and self-service to reduce operational toil and manual interventions.
- Strong communication skills with the ability to influence engineers, partners, and senior leaders.
- Exposure to capacity management, performance engineering, and resiliency design (HA, fault tolerance, RTO/RPO).
- Experience working in environments with hybrid platforms (on-prem cloud) and complex enterprise dependencies.
- Ability to drive technical debt remediation for critical legacy systems using structured, prioritized backlogs.
- Familiarity with IT service management, incident/problem management, and continuous improvement frameworks.
- Experience mentoring or leading senior engineers in reliability or operations-focused roles.
- Strong collaboration and partnering skills across platform, application, and support teams.
- Ability to manage multiple priorities in a fast-paced, high-impact production environment.
- Consistent delivery of high-quality reliability outcomes within expected timelines.
- High attention to detail, data-driven problem-solving, and operational rigor.
- Prior project or initiative leadership experience is highly desirable.