Resolved
Production Platform Services Outage
Update 3: Resolved (2026-04-16, 02:20 UTC)
All production services have been fully restored. Both worker nodes have been replaced and all application pods are running normally.
Root cause: Two worker nodes in our production Kubernetes cluster became unresponsive due to CPU resource exhaustion on undersized instances. The failure cascaded across both nodes within minutes, taking all services offline.
Actions taken:
- Replaced both affected nodes with fresh instances
- Applied safeguards to prevent concurrent background job pile-up
- Planning further capacity improvements to prevent recurrence
We will continue monitoring the environment closely over the next 24 hours.
Update 2: Identified (2026-04-16, 02:00 UTC)
We have identified the root cause as infrastructure resource exhaustion on our Kubernetes worker nodes. We are replacing the affected nodes to restore service. New nodes are joining the cluster now.
Update 1: Investigating (2026-04-16, 01:50 UTC)
We are investigating reports of our production platform services being unreachable. Our Kubernetes cluster nodes are showing as unhealthy. All services including CertChain Platform API, CertChain Payments API, and CITB Parser API are affected.
We are actively working to restore service and will provide updates as we progress.
Affected Services
- Backend
- Payment
- Passport API
Impact Duration
2026-04-16, 00:02 UTC - 02:20 UTC (~2 hours 18 minutes)
Resolved
·