What are the responsibilities and job description for the Observability and AI Enterprise Architect position at ClifyX?
Role: Observability and AI Enterprise Architect
Location: Edison, NJ
Fulltime Position
Travel: 50%
Job Description
Client Cloud Unit is looking for an experienced Observability and AI Enterprise Architect to design and implement enterprise-grade observability and AI solutions that provide deep visibility into infrastructure, applications, networks and transform IT Operations. This role requires expertise in leading observability platforms and hands-on experience in IT operations, combined with the ability to integrate AI-driven solutions for IT Operations (AIOps) using cutting-edge technologies such as LLMs, agentic frameworks, and industry-leading platforms like Anthropic, OpenAI, Bedrock, Gemini, and others.
- Design and deploy observability frameworks leveraging tools such as Grafana, Dynatrace, Prometheus, ELK, Splunk, etc. Define best practices for monitoring, alerting, and visualization across hybrid and multi-cloud environments.
- Develop strategies for monitoring KPIs tied to business outcomes (e.g., sales performance, supply chain efficiency, customer experience).
- Collaborate with business and IT teams to identify key metrics and integrate them into dashboards and alerting systems.
- Implement AIOps solutions using industry-leading platforms like OpenAI, AWS Bedrock, Google Gemini, Anthropic, and similar technologies.
- Develop predictive analytics and anomaly detection models to proactively identify and resolve operational issues.
- Integrate observability tools with ITSM platforms and automation workflows. Enable automated root cause analysis and remediation using AI/ML models.
- Provide observability strategies for infrastructure (servers, storage, cloud), applications (microservices, APIs), and networks (LAN/WAN, SD-WAN). Collaborate with DevOps, SRE, and IT operations teams to ensure end-to-end visibility and reliability.
- Establish observability standards, KPIs, and SLAs for performance and availability. Ensure compliance with security and regulatory requirements in monitoring solutions.
- Develop scalable architecture using LLMs, agentic frameworks, and multi-modal AI technologies.
- Build AI-powered analytics platforms for IT operations analysis, anomaly detection, and predictive insights.
- Architect and deploy intelligent chatbots for IT support and self-service capabilities.
- Integrate AI solutions with existing IT operations tools and workflows.
- Implement automated remediation and root cause analysis using AI/ML models.
Qualifications:
- 10-13 years of relevant experience
- Hands-on experience with Grafana, Dynatrace, and other monitoring platforms.
- Practical experience implementing AI-based solutions for anomaly detection, predictive maintenance, and automated remediation. Familiarity with OpenAI, Bedrock, Gemini, Anthropic, or similar AI platforms.
- Strong understanding of infrastructure, application architectures, and networking. Experience with cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes).
- Proficiency in Python, Bash, or similar scripting languages for automation and integration.
- Strong experience with LLMs (OpenAI, Anthropic, Gemini, Bedrock) and agentic AI solutions.
- Hands-on experience in designing AI architectures for enterprise IT environments.
- Proficiency in Python or similar languages for AI model integration and automation.