Senior Site Reliability Engineer
Overview
Azure Cosmos DB is Microsoft’s next generation globally distributed, massively scalable, multi-model cloud database service. It is designed to enable developers to build planet-scale applications. Azure Cosmos DB is one of the fastest growing Azure services. Joining the Azure Cosmos DB team is a fantastic opportunity to work with highly talented engineers operating like a startup, and to deliver on our next set of big challenges.
As a Senior Site Reliability Engineer, you will identify and deliver software improvements using your expertise in software development, complexity analysis, and scalable system design to ensure services/systems are highly stable, performant, and meeting the expectations of our customers. You will work closely with other engineering teams and provide a holistic view of our cloud service.
Responsibilities
- Identify opportunities and drive the design and implementation of end-to-end telemetry, alerting, self-healing and automation capabilities to improve service health, manageability, and reliability.
- Participate in on-call rotations and own, triage, investigate and resolve service issues with an emphasis on broad communications, learning & teaching throughout the process.
- Interact with customers / support representatives and communicate on a deeply technical level with product engineering and product management teams to evolve services.
- Own availability, performance, and supportability targets for the service.
- Author functional and technical documentation and remain current on relevant technologies and procedures.
Qualifications
Knowledge, experience and skills required:
- Bachelor's degree in computer science/Engineering/related fields or equivalent industry experience.
- 6+ years of experience with writing tools, automation / scripting (Powershell, Python or similar), programming (C++, C# or equivalent) and making enhancements in subcomponents within and around services/products to deliver and manage software in production. Experience aiding understanding of distributed systems and networking is preferred.
- 6+ years of troubleshooting/debugging experience: telemetry-based analysis (KQL or equivalent preferred), troubleshooting skills across network, hardware, and distributed service layers, with demonstrated ability to debug, fix, and optimize code.
- Good communications skills, both verbal and written.
Related Jobs

Barclays
Pune, India
Software Engineer
Full-time
Be an early applicant
Posted 3 days ago

Micron Technology
Hyderabad, India
Associate Engineer/ Engineer Data Science
Full-time
Be an early applicant
Posted 3 days ago

GE Vernova
Chennai, India
Graduate Engineer Trainee
Full-time
Be an early applicant
Posted 3 days ago

McKinsey & Company
Bengaluru, India
+1 more
Software Engineer I - Java FullStack
Full-time
Be an early applicant
Posted a day ago

Synopsys Inc
Noida, India
Senior Analog Mixed-Signal Design Engineer
Full-time
Be an early applicant
Posted 2 hours ago