Postmortem: RSA OAuth and 2SV Recent Degradation | INC153074
Central 1’s RSA service which is used for our Increased Authentication and 2SV services experienced degradation on January 21st for 49 minutes (INC152479), February 1st for 2 hours and 11 minutes (INC152835) and February 7th for 51 minutes (INC153074). During service degradation, features reliant on the RSA risk score and/or requiring the RSA database for decision-making were unavailable. The impact to members was:
Other services dependent on the risk scoring would have been impacted as follows:
Known point of failure: The RSA database receives weekly index maintenance to optimize database performance. This job runs on Saturdays starting at 1:30 a.m. PT (4:30 a.m. ET). This job typically takes 4 to 5 hours to complete, and it does not interrupt production. On Saturday January 21st, the job did not finish until 7:55 a.m. (10:55 a.m. ET), taking 6 hours 24 min to complete. In review of the incident, our database administration team determined there was a collision of our indexing job and another maintenance job that was scheduled to run between 7:05 to 7:55 a.m. PT (10:05 to 10:55 a.m. ET). The second job (to update statistics) caused a block for records into the database, preventing scoring services to work. The extended indexing job has been slowly increasing in duration over time as the database has been increasing in size. It does have data purged after 13 months, but with the increase of 2SV adoption, and the increased volume of brute force attacks (which creates records) the table has grown large enough that a review for optimization needs to be completed with our vendor.
On February 1st and 7th, the point of failure has been attributed to an application threading problem caused by a Drools bug. Drools provides our decision engine processing rules. Our current version has a known bug that can cause database connection pools to become locked (not closing connections). When connections to the database reaches 100% it prevents the service from performing service requests. To help mitigate this problem Central 1 has added 4 additional production RSA (MDAuth) servers which will this reduce the chance of the threading issue recurring. Our long-term solution is an RSA and Drools version upgrade (CHG131855 | SVM-2612).
Central 1 is also implementing Dynatrace in QA on the new production servers to help improve our monitoring and proactive triage of errors.
The root cause for the point of failure, Drools bug, is still unknown. Recent increases in RSA traffic due to continuing implementations, new Policies and new cases are contributors, but other factors which may have pushed the service passed a daily traffic threshold causing the Drools bug to be realized are:
Central 1 completed our recent health check with a 3rd party vendor (Saviium in 2021) for our RSA services. This health check looks at our overall RSA service, but we believe adding so many changes at once between reviews helped lead to the realized degradation of service. Central 1 is working on several initiatives highlighted in our actions below to mitigate further impacts while we upgrade our services and build a strong roadmap going forward for service stability.
Actions:
PRB011044 - RSA service degradation root cause analysis
Assigned to: Product Management
Due date: Closed
RITM327058 – Build RSA service/DB monitoring/alerting
Assigned to: Platform
Due Date: Closed
RITM329455 – RSA Improvements Roadmap
Assigned to: Product/Bart
Due Date: End of April 2023
PRB011075– Ongoing RSA Performance Analysis by Product Management
Assigned to: Bart Venlet
Due Date: End of April 2023
PRB011076 – Ongoing RSA Performance Analysis by Platform
Assigned to: Quintin Paulson
Due Date: End of April 2023
At our company, we take the quality of our service delivery very seriously. We understand that when our services are not working as expected, it can have a significant impact on our customers and their businesses. We believe that the best way to address any issues that arise is to be transparent about them and work diligently to improve our processes and systems. Our upcoming RSA upgrade and improvement/stability changes along with improved monitoring and reviews will mitigate the changes of such incidents occurring again.
If you have any questions about this postmortem please reach out to me directly.
Jason R Seale
Director of Client Support Services
jseale@central1.com | 778.558.5627