Weekend maintenance kicks an Italian bank offline for daysIt is now day five that Italian bank Sella has its apps and internetbank down, after a weekend systems update went south. The problem seems to be database-related: “something, something Oracle.”👋 Hi, this is Gergely with a bonus, free issue of the Pragmatic Engineer Newsletter. In every issue, I cover topics related to Big Tech and startups through the lens of engineering managers and senior engineers. In this article, we cover one out of four topics from today’s subscriber-only The Pulse issue. To get full issues twice a week, subscribe here. Sella is a mid-sized bank in Italy with around $14B in assets under management. However, Since Sunday (7 April), most of its services are inaccessible. The Sella banking app, the Sella Invest app, and internet banking are all down. Two trading apps – Sella XTrading and Sella Trader, continue to operate normally. So, it seems all the banks’ digital properties, except those for trading, have gone down hard. “Something, something Oracle.” What caused this issue? From Sella’s status page:
It’s hard to get too much out of these vague reports, but here’s my attempt at decrypting what might have happened:
You might expect something catastrophic like this to happen if Sella was trying to “jump ahead” by updating to several versions of Oracle from a very old version. After all, if it was “just” a schema change: this change should be straightforward enough to revert. Still, I’m puzzled by how long the system has been down. If it was a risky update, just restore the backups! Every sensible tech company has a backup strategy to bring back its systems – and performing a backup before a major update is common practice. If you perform a risky update, this is a good reminder to start by doing a backup, or inserting a “rollback point” that you can revert to, should things go wrong. If it was an update to Oracle, or to the operating system, then why not roll back the update? This is also why banks utilize a 2-day “blackout period” which creates space to revert failed migrations or updates. Of course, if it was that simple, then Sella would surely have already done the change. It all suggests there was an unexpected edge case. The problem might be caused by the bank’s external IT vendor. It’s common enough for banks to contract with development agencies to build and maintain their services. I understand that Banca Sella uses a company called Fabrick for some of its IT needs. Confirming this suspicion is Fabrick’s status page that says Fabrick’s tech staff are involved:
We can already see a problem: Sella doesn’t seem to own its core tech systems, and hired another company for this. Would this outage have occurred if Sella had in-house staff responsible for updates, who understood the risks of such changes and their impact on the bank? It is likely that full time staff would have moved more cautiously. In contrast, for a vendor, the biggest loss they face from messing up is losing a client. This is a loss, but usually not a catastrophic one. Are 2-day “blackout” windows good for an engineering culture? Banks regularly do maintenance during weekends; it’s when data migration projects are run, and things like schema updates, rolling out of new systems, and other riskier changes. In some ways, banks are lucky to have these “blackout periods” when money movements are frozen, and engineering teams have up to 48 hours to make changes, test them, and roll them back if needed. In contrast, most other industries like ecommerce, utilities, airlines, etc, have no such luxury and must invest in zero-downtime migrations. This means these companies need to get very good at detecting and resolving incidents in real time. Banks rarely do! So, it’s fair to ask if it is a net good or a net bad that banks don’t need to plan for short downtime, or zero downtime maintenance or migration? I’d argue this approach unintentionally helps create a weaker engineering culture, and a worse work-life balance. Banks can simply have engineers work at weekends on risky migrations. They don’t need to do rollback plans, because they already have one: “if it goes wrong on Saturday, we have another 24 hours to fix it on Sunday.” Good luck to the Sella and Fabrick teams in resolving this outage. Perhaps we’ll learn what caused this serious incident after it’s been resolved. As related reading, here are past articles touching on the topic of migrations and tricky outages:
This was one out of the five topics covered in this week’s The Pulse. The full edition additionally covers:
Featured Pragmatic Engineer JobsImportant update: this jobs board will be retired in 2 weeks. Read the reasons why I decided sunset this job board. Featured Pragmatic Engineer jobs:
You’re on the free list for The Pragmatic Engineer. For the full experience, become a paying subscriber. Many readers expense this newsletter within their company’s training/learning/development budget. This post is public, so feel free to share and forward it. |
Search thousands of free JavaScript snippets that you can quickly copy and paste into your web pages. Get free JavaScript tutorials, references, code, menus, calendars, popup windows, games, and much more.
Weekend maintenance kicks an Italian bank offline for days
Subscribe to:
Post Comments (Atom)
Top 3 UX Design Articles of 2024 to Remember
Based on most subscriptions ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ...
-
code.gs // 1. Enter sheet name where data is to be written below var SHEET_NAME = "Sheet1" ; // 2. Run > setup // // 3....
No comments:
Post a Comment