TikTok

Driving efficiency at scale: Enhancing Kubernetes deployment for TikTok developers

To ensure rapid and uninterrupted content delivery of TikTok, the company invests heavily in building its own data centers and developing a custom Kubernetes deployment platform.

As the lead designer of this platform, I designed several features that optimized resource efficiency, resulting in $20mil in cost savings.

Role

Lead product designer

Duration

May 2024 - Nov 2024

Team

a product manager, 6 engineers

Background

To deploy TikTok applications like Live Streaming and Video on Demand to edge data centers, teams must reserve resources—such as compute and storage—across multiple Kubernetes clusters in advance. This ensures their applications run smoothly and perform optimally.

Problem

When limited resources are shared across teams, inefficiencies arise, leading to significant waste.

User - Hoarding and inequity

Platform - high operational costs to manually intervene

PROBLEM STATEMENT

How might we design a Kubernetes resources reservation system that reduces hoarding & promotes efficient utilization?

Solutions

Two-pronged strategy to clearly expose inefficiencies and make it easier for users to promptly adjust their resource reservations.

1. Exposing efficiencies through reports and alerts

The daily project report provides a clear summary of resource utilization across all reserved resources. It breaks down usage by Kubernetes clusters, enabling users to identify imbalances and make informed adjustments.

Weekly alert notifications are sent via internal communication channels, guiding users back to the platform to take timely and relevant actions.

2. Enhancing the usability of the current interface

To bridge the gap between insight and action, I redesigned the reservation interface to reduce the cognitive load of interpreting complex relationships between clusters and resources—making it easier for users to spot and correct over-reservations with minimal friction.

Information architecture explorations

Data visualization explorations

BEFORE
AFTER

Impact

Both north-star metrics and supporting metrics are tracked and here are the results after three months.

Northstar metric - cost saving
20,000,000 USD
Average project resources utilization rate
CPU
32%
Memory
25%
Storage
16%
Bandwidth
15%
BGE+ Admin support metrics
Average monthly on-call tickets
25%
Reduced on-call time
2780 minutes

To learn more about this project

There’s so much more behind this project—deep user research, strategic decisions, roadmap twists, and countless iterations. If you’re curious about the full journey, I’d love to chat and share more!

Contact me

Next project

hireEZ Recruiting Pipeline Redesign