What are the responsibilities and job description for the Incident Manager onsite in San Diego, CA position at TestingXperts?
Responsibilities
- Manage incident management bridge calls with support teams, on-call support application teams and management. Manage, escalate, status, and assist, coordinating repair efforts for all major incidents (P1 – P4).
- Regular communication updates to the Customer, End-Users and other Stakeholders during the entire Incident Management cycle
- Track and document incident updates in real time
- Since Major incidents are highly escalated cases, handling with presence of mind and innovation.
- Support the development and execution of change management plans to drive adoption and utilization of new processes, systems, and technologies.
- Reviewing changes, their priority, their urgency and performing risk analysis.
- Creating problem tickets and respective action items, reviewing root cause analysis and its closers.
- Performing PIR and Postmortem reports.
- Leading Site reliability/Disaster Recovery/Game Day/Switchover/Failover activities.
- Experience in handling multiple monitoring tools like Service now, Pager duty, Slack, Zoom, JIRA, etc.
- Perform quality audits and data analytics on incident tickets to ensure quality and uncover new trends.
- Meet the SLAs and other KPIs agreed and produce the Process Performance Reports
- Provides documentation for Known Error Data Base (KEDB) or similar depository
- Develop process and procedures that ensure Incident Management related action items are tracked and completed
- Ensuring the Process adherence, meeting the Quality norms
- Provide Management reporting on Incident Metrics and Incident Management performance
Qualifications/Skills required.
- Degree in computer science, Information Technology, or related field.
- 7-10 years of experience in incident management or related field.
- Knowledge of Cloud services is must. ( AWS/Azure/GCP)
- Advanced proficiency in site reliability culture and principles and can demonstrate how to implement site reliability across platform teams while avoiding common pitfalls.
- Should be able to plan and conduct site reliability testing
- Should have experience in AMS - Application Management Services.
- Knowledge of incident management/change management/problem management processes and procedures.
- Experience with and knowledge of change management principles, methodologies and tools
- Excellent problem-solving and analytical skills.
- Excellent verbal & written communication and interpersonal skills.
- Ability to work independently and as part of a team.
- Ability to manage multiple tasks simultaneously.
- Note : This is NOT an Infrastructure support role, This is Semi technical role to support an environment which is 100% hosted over cloud and to drive Applications related issues.