Back to jobs

SRE (Site Reliability Engineer)

Supersourcing
Remote Remote - Unknown
$130k - $190k (est.) ~market rate
Posted Apr 12, 2026
Apply on himalayas

Leaving for himalayas in 10s

About This Role

Key technical requirements and skills: - Expertise in Java, SpringBoot, Node.js, microservices, RDBMS, NoSQL - Strong experience with AWS services (EC2, S3, Lambda, IAM, ECS, EKS, SQS, Kinesis) - Proficiency in observability using Splunk and NewRelic - Experience with Infrastructure as Code using Terraform - Understanding of APIs and event-driven approaches, as well as security patterns - Unix/Linux systems administration and Docker Team/project information: - The role is part of the Site Reliability Engineering team, working closely with the development team and other partner teams to ensure application reliability, efficiency, and performance. - The team manages the complex challenges of scale and builds highly reliable, scalable, and secure applications for customers and internal users. Unique or notable aspects: - The role focuses on managing the unique

About the job

Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that the servicesboth our internally critical and our externally-visible systemshave reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally, SREs will keep an ever-watchful eye on our systems capacity and performance.

As a Site Reliability Engineer, you will have the opportunity to manage the complex challenges of scale which are unique to Digitization, while using your expertise in coding, algorithms, complexity analysis and large-scale system design. You will provide scalable, reliable, durable, and secure applications for our customers and internal users. You will help build highly reliable applications using a customer-first approach while innovating technically. You will understand our customer's needs and how we can meet them.

Responsibilities

  • Work with the Site Reliability Engineering team, Development team, and other partner teams to ensure that applications reliability, efficiency, and performance meets our customer's needs, while keeping the service's operation's reliable, scalable, and automated.

  • Develop and implement projects that improve system reliability, efficiency, and performance

  • Partner with development teams on feature launches to ensure our customers are delivered reliable and scalable functionality.

  • Build a deep knowledge on production infrastructure and using that to debug distributed systems problems and identify improvements to the system.

  • Operations, SLO, SLA management

  • Metrics reporting and progress tracking

  • Be on-call, responding to and managing incidents.

  • Observability (Alarms, monitoring, synthetics).

  • Error management

Qualifications

  • Bachelor's degree in Computer Science or a related engineering degree

  • 8+ years of IT industry experience

  • Strong Experience in

  1. Java, Springboot, Nodejs, microservices, RDBMS, NoSQL

  2. AWS EC2, S3, Lambda, IAM, ECS, EKS, SQS, Kinesis

  3. Observability using Splunk, NewRelic

  4. Infrastructure as Code using terraform

  5. APIs and event-driven approaches

  6. Security patterns

  7. Unix/Linux systems administration. Familiar with Docker is a must.

  • Strong Experience in analysing and troubleshooting large-scale distributed systems. Quick reaction on high severity customer impacts.

  • Ability to debug and optimize code and automate routine tasks

  • Knowledge in modern software engineering practices and tools - Agile and DevOps

  • Strong communication skill and the ability to explain complex technical matters in an easy-to-understand way.

Originally posted on Himalayas