Summary:
On Thursday, November 17, 2022 a t approximately 10:55 a.m. PT (1:55 p.m. ET) Central 1 clients that use the OpenText Public Website (PWS) service started to experience high latency when trying to reach their public website. The latency persisted throughout the day with several intermittent outages. Services fully recovered for all clients at 7:15 p.m. PT (10:10 p.m. ET). This incident did not directly impact Online Banking, as the websites are independently hosted, however many customers navigate to the Online Banking login portlet via the PWS pages, which led to a reduction in desktop banking during this Incident.
Postmortem:
On Thursday, November 17, 2022 at approximately 10:55 a.m. PT (1:55 p.m. ET) Central 1 clients using OpenText Public Website (PWS) service started to experience high latency on the OpenText platform. The latency was restricted to the initial launch of the site in the customers session, and in most (~85%) cases the site would render after ~20 seconds of latency, and the remainder of the session was unaffected. In the other 15% of cases, the PWS would not render for the customer and the initial loading of the site would timeout.
By 11:15 a.m. PT (2:15 p.m. ET) our site 24x7 monitoring tool detected that sporadic site s were failing, triggering our monitoring alerts. Digital Banking Support started to receive a spike in phone calls alerting that customers could not access their credit union's website. In response a priority 2 incident was raised and Central 1 product and platform teams moved to escalate the incident with OpenText, our vendor who manages our Forge PWS platform.
This incident did not impact Online Banking and Mobile App. were still available with an approximate 20% reduction of desktop login traffic to online banking. Note customers would have needed to bookmark the online banking login page to avoid the public website latency/outage. The latency for sites was very intermittent throughout the incident, suggesting that the root cause was some sort of possible volume problem (increased load) or cycling of pods (causing reduction of available capacity) thus increasing latency.
At 12:30 p.m. PT (3:30 p.m. ET), Central 1 called an escalation meeting with all OpenText resources to review their triage and assist with this incident. OpenText is a managed service, therefore we are heavily reliant on their triage, analysis and decision making.
The OpenText team could not locate any point of systemic failure and recycled some websites with no improvement to the latency/outages. One suspected point of failure was possible a bad file pushed live by a client (unknown conditions to cause it to be a ‘bad file’), so between 1 to 5 p.m. PT (4 to 8 p.m. ET), all websites’ changes pushed live that morning were reviewed, one at a time, with the susceptible files reverted.
A decision was made at 5:40 p.m. PT (8:40 p.m. ET) to take down ALL sites. If the problem was resources that couldn’t come out of a stuck cycle to recover services, then only removing all load would help. Central 1 took down all Forge PWS and began bringing the slowly up under close inspection. All websites recovered with no latency by 7:15 p.m. PT (10:15 p.m. ET).
The investigation teams reconvened the next morning at 9:30 a.m. PT (12:30 p.m. ET) to review Friday morning stability. On Friday morning all PWS performance statistics were green, and services remained stable.
Current point of failure: The C1 and OpenText teams believe that the live pods may have transitioned into a bad state due to too much load (either external or publishing across all clients). When a pod becomes unhealthy it auto restarts. As this process persisted the entire cluster moved into a state where the continual restarting of pods put too much load on the other available pods, and the system was in an insufficient capacity cycle without the ability to full recover. OpenText is also investigating other possible root causes including threading, as well as the potential for a bad workflow going live. Please see “Actions” below for the pending “OpenText” analysis.
Impact Assessment:
Affected Service(s): Public websites
Affected FI’s: All Forge Clients
Affected End Customers: Unknown
Impact windows: 10:55 a.m. to 7:15 p.m. PT (1:55 to 10:15 p.m. ET)
Central 1 Actions:
Product and Vendor Management to coordinate with OpenText
PRB011012 – Central 1 ongoing investigation into PWS outage
PRB011013 - OpenText ongoing investigation into PWS outage
Due Date: By end of 2022
RITM321765 – Review C1’s OT Architecture
Due date: End of January 2023
• Review the current tenant instances
• Complete a code review on multitenancy architecture
RITM321768 – Review and update products Threat Risk Analysis and Third-Party Risk Analysis reports
Due Date: By end of 2022
RITM321769 – Vendor Management improvements for our OpenText support model
Due date: End of January 2023
• Confirm C1’s access to hourly OpenText logs
• Reviewing master agreements for support improvements.
RITM321772 – Ongoing new performance testing
Due Date: End of January 2023
• Review pod and thread count thresholds under performance strain.
• Attempt to clear the cache while there is load
• Review Queue and drip protects
We understand the severity that such a long outage of your public website has on your reputation and your members ability to perform the expected, reliable online banking and viewing of your sites content.
We are working with our vendor in a full review of the service support, architecture review and service resiliency planning. Central 1 is reviewing how we can improve this relationship and support model and expect to have quick turnaround on our deliverables.
If you have any questions please do not hesitate to reach out to me.
Jason Seale, PMP
Director, Client Support Services
Central 1 Client Support Services
1441 Creekside Drive, Vancouver, BC, Canada V6J 4S7
T 1 800 661 6813 ext. 5185 C 778 558 5627 Support 888 889 7878
jseale@central1.com www.central1.com