Site Reliability Engineer
Tricentis
As a Site Reliability Engineer, you'll play a key role in shaping the reliability, scalability, and performance of Tricentis SaaS products. Sitting at the intersection of software engineering and systems engineering, you will apply sound engineering principles and operational discipline to scale our platforms reliably and efficiently.
You will work autonomously on medium-sized projects, support production environments, and drive improvements across deployment, monitoring, and infrastructure systems. Your contributions will directly influence how we build, release, and operate our cloud-based products.
Your Impact as an 🚀
- Own and deliver infrastructure projects end-to-end over a span of weeks with minimal supervision.
- Develop and maintain reliable infrastructure in cloud environments (AWS or Azure) using Terraform, Kubernetes, and GitHub Actions.
- Enhance observability by implementing and tuning monitoring, metrics, and alerting systems.
- Investigate and troubleshoot complex incidents using logs, metrics, traces, and other observability tools.
- Collaborate daily with product engineers and support teams to address reliability and scalability challenges.
- Participate in the on-call rotation, lead incident response, and conduct thorough root cause analysis and postmortems.
- Propose improvements to SRE processes and standards (e.g., deployments, SLOs, CI/CD workflows).
- Champion reliability best practices within your team and contribute to internal documentation and onboarding materials.
As a valuable member of our SRE team, you'll have the opportunity to 💪
- Take ownership of infrastructure components and reliability initiatives end-to-end.
- Serve as a reliability partner for specific product areas, influencing architecture and scaling decisions.
- Improve deployment pipelines and CI/CD processes to accelerate delivery without compromising stability.
- Proactively identify risks and opportunities in our observability stack and act on them.
- Collaborate closely with engineers across teams to deliver resilient and well-monitored systems.
- Contribute to incident response practices and be a first responder during critical outages.
- Share your learnings through documentation, postmortems, and mentoring more junior engineers.
- Suggest process and tooling improvements that help raise the overall reliability bar.
Our Tech Stack 🌐
AWS, Terraform, GitHub Actions, Kubernetes, ArgoCD, DataDog, Prometheus, Grafana, Betterstack, All-in-one incident management platform | incident.io , Jira and more
Our Culture 🦄
We don't just preach our values; we embody them in everything we do. We are committed to creating an environment that empowers, supports, and includes individuals, where trust, transparency, creativity, curiosity, and continuous improvement thrive on a daily basis.
About You 🎯
- 2-3+ years of hands-on experience in SRE, DevOps, or CloudOps roles.
- Hands-on experience with AWS and infrastructure-as-code (Terraform).
- Working knowledge of container orchestration with Kubernetes.
- Comfortable with observability tooling and managing alerts (e.g., DataDog, Prometheus, Grafana).
- Skilled in debugging production systems and identifying root causes.
- Familiar with security best practices in infrastructure and deployments.
- Strong communicator and team player with a bias toward action.
- Hands-on experience working with CI/CD pipelines, preferably GitHub Actions.
- Familiarity with SLIs/SLOs and incident management best practices would be considered a plus.