Description
Role:- Site Reliability Engineer (SRE) with advanced DevOps
Location:- Toronto, ON – 4 Days Onsite
Type:- Contract
Skills
SRE, Kubernetes, Splunk, bash, docker, terraform, GitLab
Additional Comments:-
Job Summary: We are seeking an experienced Site Reliability Engineer (SRE) with advanced DevOps expertise to help build, scale, and maintain our infrastructure and services.
You will play a critical role in ensuring high availability, performance, scalability, and security of our production systems, while enabling continuous deployment and rapid delivery of features to our customers.
Key Responsibilities:–
· Design, build, and maintain reliable, scalable, and secure cloud-based infrastructure (AWS, Azure, or GCP).
· Develop and improve observability using monitoring, ing, logging, and tracing tools (e.g., Prometheus, Grafana, ELK, Datadog, etc.).
· Automate repetitive tasks and infrastructure using Infrastructure-as-Code (Terraform, CloudFormation, Pulumi).
· Create and maintain CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, ArgoCD, etc.) to support fast and safe delivery.
· Lead incident response, root cause analysis, and postmortems to ensure high uptime and rapid recovery.
· Optimize system performance, reliability, and cost-effectiveness through proactive monitoring and tuning.
· Collaborate with software engineering teams to define SLAs/SLOs and improve service reliability.
· Implement and maintain security best practices across environments (e.g., secrets management, IAM, firewalls, etc.).
· Maintain disaster recovery plans, backups, and high-availability strategies.
Qualifications: Required:-
· 8 years of experience as an SRE, DevOps Engineer, or similar role.
· Proficiency in scripting and automation (Bash, Python, Go, etc.).
· Strong experience with containerization and orchestration (Docker, Kubernetes, Helm).
· Solid understanding of Linux systems administration and networking fundamentals.
· Experience with cloud platforms (AWS, Azure, or GCP).
· Experience with IaC tools like Terraform or CloudFormation.
· Familiarity with GitOps and modern deployment practices.
· Hands-on experience with observability tools (e.g., Prometheus, Grafana, Datadog).
· Strong troubleshooting and incident response