Postmortem: INC198745 – Login errors for OIDC service
On Sunday, April 6th, customers using OIDC for login experienced a service interruption starting at 1:00 a.m. PT (4:00 a.m. ET). The issue was resolved by 4:20 a.m. PT (7:20 a.m. ET), lasting approximately 1 hour and 20 minutes beyond our scheduled maintenance window.
Point of failure: During a planned change to our server environment early Sunday morning, a system component restarted in a way that preserved temporary files it shouldn't have. These leftover files interfered with the login system's startup process, preventing it from functioning correctly. While our team had anticipated this kind of restart and included a script designed to clear out these temporary files, it didn’t work as expected in this case. This specific type of restart hadn’t been done in our production environment before, so the issue had not previously surfaced.
A permanent fix was developed and deployed on April 9th, ensuring that all temporary files are properly removed no matter how the system is restarted. This will prevent the issue from reoccurring and improve the reliability of our login system.
PRB011499 – Root Cause Analysis
PTASK0010313 - Deploy fix for OIDC git initialization script to Production – COMPLETED
PTASK0010314 - Document Kubernetes behavior during restarts - COMPLETED
We recognize the impact this disruption had on our members and their business operations. We are taking this incident seriously and implementing targeted improvements to enhance system resilience and accelerate detection and recovery in the future.
If you have any questions about this postmortem, please contact digitalbanking_support@central.1com