Exp Level: 6+ Yrs,
Relevant Exp: 6+ Yrs
Technical Skills: Should have at-least 6+ Experience using, configuring, and building Monitoring Tools – Splunk, AppDynamics, OneConsole, Argos CD Role: SRE Lead (SME)
Key Responsibilities:
Team Leadership: Leading and mentoring the SRE team, ensuring they have the resources and guidance needed to perform their roles effectively.
System Design and Architecture: Overseeing the design and architecture of reliable systems, ensuring scalability, fault tolerance, and high availability.
Incident Management: Coordinating response to incidents, conducting post-mortems, and implementing measures to prevent recurrence. JIRA Knowledge must monitor and Performance: Setting up and maintaining monitoring tools and dashboards to track system performance and detect issues proactively.
Automation: Developing and promoting automation for repetitive tasks to reduce human error and improve efficiency. AI/ML Automation experience is preferrable
Collaboration: Working closely with development, operations, and other cross-functional teams to ensure smooth integration and deployment of new features.
Capacity Planning: Analyzing system capacity and planning for future growth to ensure the infrastructure can handle increased demand. SLA/SLO Management: Defining and managing Service Level Agreements (SLAs) and Service Level Objectives (SLOs) to meet business requirements.
Continuous Improvement: Identifying areas for improvement in system reliability and performance and driving initiatives to address them. Documentation: Ensuring proper documentation of systems, processes, and incident responses to maintain knowledge sharing and consistency.
Example Daily Activities: Reviewing system performance metrics and addressing any anomalies.
Leading incident response calls and coordinating with relevant teams.
Meeting with stakeholders to discuss reliability goals and progress.
Developing scripts and automation tools for system maintenance tasks.
Conducting training sessions for team members on best practices.
Planning and executing system upgrades and infrastructure improvements.
Knowledge on DevOps tools GIT Lab is plus and should know more on managing CI/CD Pipelines Experience with Incident Management Tools.
Experience in Kubernetes, understanding of Operating Systems and Databases Experience working with Jira, Gitlab, Version Control Tools. Work shift time 1.30PM- 10.30PM.