Stabilization of the Connectivity Platform at TeamViewer
Background
TeamViewer faced massive challenges with its Connectivity Platform. Perceived availability was low; customers reported numerous incidents over weeks and months. Clear metrics were lacking, processes were insufficiently defined, and responsibilities were unclear. Trust in the platform and the team was steadily declining.
Challenge
The biggest difficulty was the lack of measurability. Without objective metrics for availability, discussions and decisions were based on assumptions and rumors – solid management was impossible.
Approach
To sustainably stabilize the platform, a package of measures was implemented:
- Measurability: Introduction of clear metrics (Uptime, MTBF, MTTR, MTTD).
- Transparency: Monitoring screens for R&D, additionally displaying platform availability in the headquarters.
- Processes: Establishment of Incident, Change, and Problem Management.
- Tools: Integration of modern tools (Jira, Slack, OpsGenie, StatusPage; monitoring with Grafana, Prometheus, Zabbix).
- Automation: Automated communication and escalation processes for incidents.
- On-Call Structure: Introduction of an official on-call service (SREs, operators, developers).
- Communication: Introduction of status updates and communication rules depending on incident severity.
- Post-Mortems: Regular reviews for root cause analysis and tracking of improvements.
- Reporting: Monthly reports on availability, trends, and measures.
Result
Within a few months, the platform was stabilized and an availability of 99.9x% was achieved.
The combination of measurability, processes, and clear responsibilities not only ensured technical stability but also strengthened the trust of customers and internal stakeholders.
Success Factors
Facts instead of gut feeling
- Introduction of clear metrics (Uptime, MTBF, MTTR, MTTD) → discussions were no longer based on assumptions, but on data.
Radical Transparency & Sense of Urgency
- Monitoring screens and availability displays in the headquarters → availability was suddenly as present as revenue.
- This constant pressure of visibility created a strong Sense of Urgency throughout the company.
Agile Process Discipline
- Clear structures through Incident, Change, and Problem Management → less chaos, faster reactions, predictable improvements.
Automation and Tools with Purpose
- Automated workflows (Slack channels for incidents, escalation chains, reporting) → teams focused on the truly critical problems.
Cultural Change through On-Call & Post-Mortems
- Shared responsibility for availability (SREs, operators, developers).
- Learning culture through structured post-mortems → sustainable improvements instead of blame.
Clear Communication at All Levels
- From operational status updates to monthly reports for management → trust regained, expectations managed.
PROJECT TYP
CLIENT