Summary:
We are seeking a highly skilled and experienced Site Reliability Engineer to join our team. The ideal candidate will be responsible for ensuring the reliability, availability, and performance of our systems and applications. As a Site Reliability Engineer, you will collaborate closely with development teams to enhance the reliability of our services and streamline release procedures. Additionally, you will play a key role in system design consulting, platform management, and capacity planning.
Responsibilities:
- Gather and analyze metrics from operating systems and critical application services to facilitate quick identification of issues and faults.
- Partner with development teams to improve the reliability of application services and optimize release procedures.
- Participate in system design consulting, platform management, and capacity planning activities.
- Maintain a balance between feature development and reliability while adhering to well-defined service level objectives.
- Maintain complete control and understanding over production environments to expedite issue identification and mitigation.
- Define Service Level Agreements (SLA), Service Level Objectives (SLO), and Service Level Indicators (SLI) aligned with organizational requirements.
- Administer Linux Servers (Ubuntu & Amazon Linux) effectively.
- Demonstrate expertise in incident management, on-call processes, and Software Development Life Cycle (SDLC).
- Maintain and update internal knowledge base efficiently.
- Reduce toil using multiple scripting languages such as Terraform, Bash, and Python.
- Demonstrate familiarity with web servers including Apache, Nginx, Tomcat, and HAproxy.
- Administer AWS & Azure cloud service providers, including but not limited to EC2, RDS, S3, ECS, SNS, SES, CloudWatch, CDN, WAF, CloudFront, CloudTrail, R53, VPC, Routing, API Gateway, Lambda, IAM Roles, SG, Elastic Cache, Memcached, DynamoDB, CodeDeploy, CodeBuild, and serverless technologies.
- Implement and manage container schedulers & orchestration systems such as Kubernetes, Docker Swarm, OpenShift, or AWS EKS/ECS.
- Implement and maintain Monitoring & Observability platforms including Elastic stack, Grafana, Prometheus, Graphite, and APM tools such as New Relic, AppDynamics, or Dynatrace.
- Assess impact, devise release strategies, manage deployments, and handle incidents and changes effectively.
- Demonstrate excellent written and verbal communication skills, with the ability to collaborate with multiple teams and stakeholders.
Qualification, Experience, Competence:
- Bachelor’s degree in Computer Science or a related STEM field, with a preference for candidates with majors in Mathematics.
- Minimum of 4 years of experience in Site Reliability Engineering, demonstrating strong hands-on proficiency in the listed technologies within the responsibilities section.
- Certification in any of the relevant technologies is highly desirable. We value candidates who can demonstrate their certifications along with their detail-oriented problem-solving skills and deep knowledge of the products.
- Strong understanding of Microservice architecture and patterns.
- Understanding of Observability enablement for various application stacks.
- Knowledge of Relational Database Management Systems (RDMS) and NoSQL observability, high availability, and scalability.
- Experience with Infrastructure as Code (IaC) tools for automated provisioning on cloud and on-premises environments using Terraform, Pulumi, or CloudFormation.
- Proficiency in tools such as Jira and Confluence for effective project management and documentation.
Apply to career@avrioc.com