Description
Job Title: AI-Assisted SRE / AIOps Lead Engineer
Location: Remote
Employment Type: Contract
Role Overview
We are seeking a highly skilled and hands-on AI-Assisted SRE / AIOps Lead Engineer to lead the operationalization and scaling of an SRE agent-driven operations model. This role combines Site Reliability Engineering, automation, production operations, and AI-assisted workflow enablement to modernize operational practices and improve system reliability.
This is not a traditional support or coordination role. The ideal candidate will be a technical builder and operator who can independently assess risk, validate AI-driven recommendations, and apply sound operational judgment in high-impact production environments. You should be comfortable leading a small team while actively contributing at the technical level.
Key Responsibilities
- Lead the adoption, onboarding, and operationalization of SRE agent-driven workflows across reliability and support functions
- Translate existing scripts, runbooks, SOPs, and operational procedures into scalable, agent-compatible workflows
- Evaluate and determine which operational activities should remain manual, become semi-automated, or be fully automated
- Validate AI-generated recommendations, remediation actions, and workflow outputs before production implementation
- Support production releases, release validation, smoke testing, and post-deployment system health checks
- Drive troubleshooting efforts during production incidents and ensure timely resolution with thorough root cause analysis
- Improve alert management, event correlation, and incident response effectiveness
- Partner with engineering, platform, and operations teams to onboard new workflows and drive process improvements
- Develop and maintain operational documentation, standards, and reusable runbooks
- Mentor junior engineers and provide technical guidance on workflow design, operational execution, and validation practices
- Continuously identify opportunities to modernize legacy operational processes and improve efficiency
Required Experience
- 5–10 years of hands-on experience in Site Reliability Engineering, cloud operations, production engineering, platform operations, or IT operations
- Strong experience supporting and troubleshooting production environments
- Demonstrated experience with automation, incident management, and operational process improvement
- Experience working with release support processes and production validation activities
- Exposure to AI-assisted operations, AIOps platforms, or automation-led support models is highly preferred
- Experience leading initiatives while remaining deeply involved in hands-on execution
Required Technical Skills
- Strong scripting expertise in:
- Python
- PowerShell
- Shell/Bash
- Hands-on experience with:
- Monitoring and observability platforms
- Logging systems and dashboards
- Alerting and incident workflows
- Production support and release validation processes
- Cloud platforms, preferably Azure
- ITSM/ticketing platforms such as ServiceNow, Jira, or equivalent
- APIs, integrations, and automation pipelines
- Working knowledge or exposure to:
- Kubernetes / AKS
- AI productivity and operational tools such as ChatGPT and Copilot
- Modern automation and orchestration practices
Critical Soft Skills
- Strong analytical and structured problem-solving skills
- Ability to operate effectively in ambiguous environments with incomplete documentation
- Strong ownership mindset with the ability to independently drive outcomes
- Excellent judgment during high-pressure production incidents
- Ability to challenge assumptions and validate AI-assisted recommendations rather than relying on them blindly
- Creative approach toward transforming and modernizing legacy operational workflows
- Strong communication and collaboration skills across technical and non-technical teams
Ideal Candidate Profile
- Hands-on builder/operator rather than a pure coordinator or process manager
- Comfortable balancing automation with operational governance and control
- Able to independently assess:
- Risk impact
- Blast radius
- Rollback strategies
- Safe execution practices
- Capable of leading a small team while continuing to contribute technically on a day-to-day basis
- Practical mindset with a strong focus on operational excellence and reliability engineering
This role is ideal for someone who enjoys combining AI-assisted operations, automation, and modern SRE practices to build scalable and reliable operational systems.





