Senior Software Engineer
Job Description
Job Description
As a Senior Site Reliability Engineer, you will focus on detecting, triaging, and mitigating OCI service-impacting events quickly and efficiently. You will be responsible for minimising downtime by delivering exceptional major incident management and ensuring the reliability, scalability, performance, and security of the systems that prevent incidents from occurring.Your work will directly contribute to reducing event duration by leveraging your operational expertise, best practices, and the ability to develop tools that automate and improve incident management processes.
Oracle Cloud is cutting-edge and continuously evolving. When issues arise, your team will respond within minutes to mitigate customer impact and ensure service continuity. This role will give you deep insight into the inner workings of OCI's systems and operations.You'll collaborate with and influence leaders across Oracle, driving organisational initiatives aimed at continually improving OCI-wide service availability. As part of an agile, high-impact team, you will play a crucial role in shaping the future of Oracle Cloud.
If you're excited to be part of a fast-moving team that's pushing the boundaries of innovation, we'd love to connect with you!
We are looking for candidates who are flexible to work APAC shift hours (6 AM to 2 PM IST) on a rotating roster, including occasional weekends and public holidays.
Career Level - IC3
Responsibilities
Career Level - IC3
Oracle's Cloud is innovative and constantly evolving. When it experiences issues, your team will respond within minutes to ensure customer impact is mitigated. This experience will expose you to the inner workings of OCI's systems and organisations.You will interact with and influence leaders from across the Oracle business and will drive broad cross-organization programs meant to iteratively improve OCI-wide service availability. We are an agile team with significant impact. If you want to be a part of a fast-moving team breaking new ground, we would like to speak with you!
- Lead major incident recovery by orchestrating cross-functional collaboration, driving rapid escalation, clear communication, and seamless stakeholder alignment to ensure swift and effective resolution.
- Identify opportunities to automate and streamline critical incident workflows, taking full ownership of developing and implementing innovative solutions to enhance efficiency and drive faster resolutions.
- Leverage deep expertise in cloud computing design patterns and dependencies to proactively mitigate complex major incidents and optimize cloud-based solutions and Leverage your expertise to quickly diagnose root causes, mitigate impact, and implement long-term fixes.
- Proficient in troubleshooting cloud infrastructure issues using observability platforms to monitor, analyse, and resolve performance and reliability challenges.
- Continuously improve operational processes, tools, and workflows to enhance the reliability and efficiency of the cloud infrastructure.
- Bachelor's degree or higher in Computer Science or a related field, or equivalent work experience.
- 5+ years of experience in Site Reliability Engineering (SRE), DevOps, or Systems Engineering.
- Extensive hands-on experience with public cloud operations (e.g., AWS, Azure, GCP, OCI).
- Proven track record in Major Incident Management within cloud-based environments, with the ability to drive effective incident resolution.
- Strong understanding of automation and orchestration principles, with a focus on improving system reliability and efficiency.
- Proficiency in at least one modern object-oriented programming language (e.g., Python, Java, Go, etc.).
- Solid experience in software engineering best practices, including Agile methodologies, coding standards, code reviews, version control, build processes, testing, and operations.
- Familiarity with infrastructure automation tools such as Chef, Ansible, Jenkins, and Terraform.
- Expertise in several key technologies, including Infrastructure-as-a-Service (IaaS), CI/CD systems, Docker, RESTful APIs, log analysis, and debugging tools.
- Experience with observability platforms such as Grafana, Prometheus, and other monitoring, logging, and tracing tools to optimize system visibility, performance, and issue resolution.
- Strong leadership, project planning, and communication skills, with a demonstrated ability to manage and execute complex initiatives.
- Excellent analytical and problem-solving skills, with the ability to troubleshoot and resolve technical issues quickly and efficiently.
- Proven ability to lead high-impact major incidents in cloud-based environments, driving resolution and improvement.
- Ability to manage multiple competing priorities in a fast-paced environment while maintaining focus on key objectives.
- Strong communication skills, with the ability to effectively engage both technical and non-technical stakeholders at all levels.
- Confidence to lead and manage large conference calls during incidents, ensuring alignment and timely decision-making.
- Experience working with distributed, service-oriented architectures and managing system reliability at scale.
About Us
As a world leader in cloud solutions, Oracle uses tomorrow's technology to tackle today's problems. True innovation starts with diverse perspectives and various abilities and backgrounds.
When everyone's voice is heard, we're inspired to go beyond what's been done before. It's why we're committed to expanding our inclusive workforce that promotes diverse insights and perspectives.
We've partnered with industry-leaders in almost every sectorand continue to thrive after 40+ years of change by operating with integrity.
Oracle careers open the door to global opportunities where work-life balance flourishes. We offer a highly competitive suite of employee benefits designed on the principles of parity and consistency. We put our people first with flexible medical, life insurance and retirement options.We also encourage employees to give back to their communities through our volunteer programs.
We're committed to including people with disabilities at all stages of the employment process. If you require accessibility assistance or accommodation for a disability at any point, let us know by calling +1 888 404 2494, option one.
Disclaimer:
Oracle is an Equal Employment Opportunity Employer*. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, national origin, sexual orientation, gender identity, disability and protected veterans status, or any other characteristic protected by law.Oracle will consider for employment qualified applicants with arrest and conviction records pursuant to applicable law.
- Which includes being a United States Affirmative Action Employer