Alert! - INC150901: OAUTH Login errors and 2 Step Verification service interruption
Incident Report for Central 1
Postmortem

Postmortem: INC150901 – Digital Banking: 2SV Login Outage – P1

Summary:

On Friday, November 25, there was an online banking outage for all 2-Step-Verfication (2SV) login customers, along with RSA database dependent features including e-Transfers, bill payments and biometric logins. Members trying to login would have seen an error message between 5:25 p.m. to 7:20 p.m. PT (8:25 p.m. to 10:20 p.m. ET) a 1 hour and 55-minute online banking service outage. All services recovered after the RSA databases were restarted.

Postmortem:

On Friday, November 25, at 5:25 p.m. PT (8:25 p.m. ET) Central 1’s RSA database became unresponsive affecting 2SV logins as well as features inside online banking that use the RSA database for risk scoring (e-Transfers, bill payments and biometric logins). Credit union members with 2SV would have received an error when attempting to login that read: “Unfortunately there was a problem with your request. Please visit your branch or call us at 1-888-XXXX.” This would have impacted both desktop and mobile apps for 2SV members. Increased Authentication credit unions who are ‘failed open’ (work when Increased Authentication isn’t available) were able to login, however, they received a ‘try again later’ error when trying to perform a payment service like e-Transfers, bill payments, or when they tried to use the biometric login on their mobile app.

Central 1 received Site 24x7 alerts and server monitoring alerts at 5:40 p.m. PT (8:40 p.m. ET). Credit unions began receiving calls from members advising of the outage. A priority 1 incident was created, and the Central 1 team was engaged by 6 p.m. PT (9 p.m. ET). The team performed application and service restarts to attempt to recover services while investigating the errors to determine the point of failure.

The Central 1 team continued to investigate and identified that the database connection was the point of failure. Additional teams were brought in to recycle database services at 7 p.m. PT (10 p.m. ET). All RSA services were recycled and reconnected to the database restoring full services by 7:20 p.m. PT (10:20 p.m. ET), resolving the incident.

Point of Failure: The RSA database will cache each thread connected to it. Due to the transactional volume added to the RSA service with ongoing 2SV launches, Central 1 passed a threshold where too many threads were cached, causing space to fill up and crash the database.

On the evening of Friday, November 25, a temporary workaround was completed by adding a script on the RSA database (DB) to clear out old DB connections. The server has memory available for the buffer pool and, when connection threads occupy over a new limit of space, a script will run and trim down inactive threads to keep their memory usage down. Monitoring has been placed on these services and script as well to quickly alert the point of failure for expedited recovery.

Central 1 is continuing to improve on our product resiliency and will be completing a full review of our RSA support model, monitoring and capacity management as we continue to grow the service with additional 2SV implementation and transactional step-up authentication. We apologize for the impact of this incident and understand the impact it had on your members accessing online banking during the outage, and the reputational harm it can have on your brand. We are working very hard to mitigate these service impacts.

Impact Assessment:

Affected Service(s): 2SV login, e-Transfers, Bill Payments and Biometric Login
Affected FI’s: All 2SV implemented customers
Ticket opened: 2022-11-25 18:03 PT and resolved 2022-11-25 19:55 PT
Outage from 5:25 to 7:20 p.m. PT (1 hour and 55 minutes)

Actions:

PRB011015 – 2SV Outage Root Cause Analysis – Product and Platform teams
•Point of failure along with monitoring on data connections is still underway.
•Optimization of services for thread caching in review with vendor.

RITM322541 – Feature Fail Open review for RSA – Product

RITM322543 – Support Escalation Procedure Enhancements – Client Support Services
• Develop play books within PagerDuty for concurrent call outs
• On-call staff incident management refresher training

Jason Seale, PMP
Director, Client Support Services
C 778 558 5627 | jseale@central1.com

Posted Dec 08, 2022 - 11:33 PST

Resolved
The emergency server re-start is now complete and all RSA and online banking services are now available and recovered. The incident's start time was 5:41p.m PT(8:41pm E.T) and recovered at 7:45p.m PT(10:45pm E.T).Please contact Support if you experience any further issues.

Central 1 - DigitalBanking_Support@Central1.com - 1.888.889.7878, option 2
Posted Nov 25, 2022 - 19:54 PST
Update
Central 1 is in the process of performing an RSA server farm emergency re-start. An update will be provided by or before 8:30pm P.T.(11:30pm E.T.)


Central 1 - DigitalBanking_Support@Central1.com - 1.888.889.7878, option 2
Posted Nov 25, 2022 - 19:33 PST
Investigating
Please note that we are currently experiencing an RSA OAUTH 2 Step Verification service interruption. This is affecting user logins to MemberDirect, Forge 2, and mobile app.
The following error is appearing after entering the login information in the OAUTH sign in page: "Unfortunately there was a problem with your request. Please contact technical support."
We are actively investigating and an update will be provided by or before 7:30pm P.T.(10:30pm E.T.)


Central 1 - DigitalBanking_Support@Central1.com - 1.888.889.7878, option 2
Posted Nov 25, 2022 - 18:31 PST
This incident affected: Incident Alerting.