Site Reliability Engineering Director
ONUM
Company
Onum is a data optimization and analytics company based in Madrid. We specialize in real-time data analysis to enable rapid decision-making regarding cybersecurity, network performance, and infrastructure management. Onum helps you optimize your data analytics costs by reducing data, avoiding vendor lock-in, and aligning the value of each dataset with actions taken.
About the Role
As the Director of Site Reliability Engineering, you will lead a small but high-impact team of SREs focused on ensuring the reliability, scalability, and efficiency of our infrastructure. This role combines strategic thinking with technical leadership, giving you the opportunity to shape our reliability practices while remaining close to day-to-day operations.
You will collaborate with Engineering, DevOps and Product teams to embed reliability into everything we build, and drive continuous improvement across systems, processes, and automation. Your leadership will be critical in setting standards, prioritizing initiatives, and elevating our platform's resilience.
Responsibilities
Team Leadership & Development:
- Lead, mentor, and develop a team of 5 Site Reliability Engineers, fostering a culture of technical excellence, accountability, and collaboration.
- Set clear goals and expectations, conduct regular one-on-ones, and support career growth.
- Partner with recruiting to attract and hire top SRE talent.
Technical Strategy & Direction:
- Define the team’s roadmap in alignment with company priorities, focusing on scalability, reliability, and automation.
- Lead key technical initiatives, including infrastructure modernization, observability, and incident response improvements.
- Establish and promote SRE best practices across teams.
Hands-on Technical Leadership:
- Lead by example by participating in technical discussions, incident resolution, and troubleshooting critical system issues.
- Provide guidance on best practices for system reliability, automation, and performance optimization.
- Support the team in designing and implementing reliable, scalable cloud infrastructure, ensuring smooth deployment pipelines and reducing manual toil.
Incident Management & Operational Excellence:
- Own the on-call process and incident response framework, ensuring effective resolution, communication, and postmortems.
- Continuously improve monitoring, alerting, and system health metrics to detect and respond to issues proactively.
- Reduce operational toil through automation and process optimization.
Cross-functional Collaboration:
- Work closely with Engineering, Product, DevOps, and Security teams to ensure reliability is embedded throughout the development lifecycle.
- Serve as a subject matter expert in reliability to influence technical and product direction.
Automation & Process Improvement:
- Identify opportunities for automation in daily operations, helping to improve deployment speed, incident response, and reliability of the platform.
- Ensure the team is leveraging infrastructure-as-code (e.g., Terraform) and other automation tools to reduce manual processes and increase scalability.
Operational Metrics & Monitoring:
- Work with your team to ensure systems are well-monitored and metrics are effectively captured using tools like Prometheus, Grafana, or Datadog.
- Track key performance indicators (KPIs) for system uptime, reliability, and team performance, identifying areas for continuous improvement.
Qualifications:
- 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role, with at least 5+ years experience leading a small team or mentoring junior engineers.
- Strong understanding of cloud platforms (AWS, GCP, or Azure) and modern infrastructure practices (e.g., containerization with Docker/Kubernetes, CI/CD pipelines).
- Hands-on experience with infrastructure-as-code tools (Terraform, Ansible, etc.) and cloud automation.
- Proven ability to troubleshoot complex infrastructure issues, perform root cause analysis, and implement system improvements.
- Experience with monitoring and alerting systems like Prometheus, Grafana, Datadog, or equivalent.
- Excellent communication and collaboration skills, with the ability to work cross-functionally and explain technical concepts to non-technical stakeholders.