Description

Job Title: AI-Assisted SRE / AIOps Lead Engineer

Location: Remote

Employment Type: Contract

Role Overview

We are seeking a highly skilled and hands-on AI-Assisted SRE / AIOps Lead Engineer to lead the operationalization and scaling of an SRE agent-driven operations model. This role combines Site Reliability Engineering, automation, production operations, and AI-assisted workflow enablement to modernize operational practices and improve system reliability.

This is not a traditional support or coordination role. The ideal candidate will be a technical builder and operator who can independently assess risk, validate AI-driven recommendations, and apply sound operational judgment in high-impact production environments. You should be comfortable leading a small team while actively contributing at the technical level.

Key Responsibilities

  • Lead the adoption, onboarding, and operationalization of SRE agent-driven workflows across reliability and support functions
  • Translate existing scripts, runbooks, SOPs, and operational procedures into scalable, agent-compatible workflows
  • Evaluate and determine which operational activities should remain manual, become semi-automated, or be fully automated
  • Validate AI-generated recommendations, remediation actions, and workflow outputs before production implementation
  • Support production releases, release validation, smoke testing, and post-deployment system health checks
  • Drive troubleshooting efforts during production incidents and ensure timely resolution with thorough root cause analysis
  • Improve alert management, event correlation, and incident response effectiveness
  • Partner with engineering, platform, and operations teams to onboard new workflows and drive process improvements
  • Develop and maintain operational documentation, standards, and reusable runbooks
  • Mentor junior engineers and provide technical guidance on workflow design, operational execution, and validation practices
  • Continuously identify opportunities to modernize legacy operational processes and improve efficiency

Required Experience

  • 5–10 years of hands-on experience in Site Reliability Engineering, cloud operations, production engineering, platform operations, or IT operations
  • Strong experience supporting and troubleshooting production environments
  • Demonstrated experience with automation, incident management, and operational process improvement
  • Experience working with release support processes and production validation activities
  • Exposure to AI-assisted operations, AIOps platforms, or automation-led support models is highly preferred
  • Experience leading initiatives while remaining deeply involved in hands-on execution

Required Technical Skills

  • Strong scripting expertise in:
  • Python
  • PowerShell
  • Shell/Bash
  • Hands-on experience with:
  • Monitoring and observability platforms
  • Logging systems and dashboards
  • Alerting and incident workflows
  • Production support and release validation processes
  • Cloud platforms, preferably Azure
  • ITSM/ticketing platforms such as ServiceNow, Jira, or equivalent
  • APIs, integrations, and automation pipelines
  • Working knowledge or exposure to:
  • Kubernetes / AKS
  • AI productivity and operational tools such as ChatGPT and Copilot
  • Modern automation and orchestration practices

Critical Soft Skills

  • Strong analytical and structured problem-solving skills
  • Ability to operate effectively in ambiguous environments with incomplete documentation
  • Strong ownership mindset with the ability to independently drive outcomes
  • Excellent judgment during high-pressure production incidents
  • Ability to challenge assumptions and validate AI-assisted recommendations rather than relying on them blindly
  • Creative approach toward transforming and modernizing legacy operational workflows
  • Strong communication and collaboration skills across technical and non-technical teams

Ideal Candidate Profile

  • Hands-on builder/operator rather than a pure coordinator or process manager
  • Comfortable balancing automation with operational governance and control
  • Able to independently assess:
  • Risk impact
  • Blast radius
  • Rollback strategies
  • Safe execution practices
  • Capable of leading a small team while continuing to contribute technically on a day-to-day basis
  • Practical mindset with a strong focus on operational excellence and reliability engineering

This role is ideal for someone who enjoys combining AI-assisted operations, automation, and modern SRE practices to build scalable and reliable operational systems.