A case of two towers
This is the famous Leaning Tower of Pisa, as you might know. The tower started to lean as soon as the construction began in the twelfth century. The reason for this was apparently the soft ground (which could not support the tower) on which this tower was built. The situation went from bad to worse, and by the time the construction was completed in 14th century, it was already leaning significantly.
Look at this another slightly less famous example. This is the Leaning Tower of Bad Frankenhausen. It is the church tower of Oberkirche. This 56-meter-high tower leans 4.93 degrees, which is more than the Tower of Pisa. The reason for the tilt of this tower is different than that of Pisa. The ground under the tower had salt deposits, which when dissolved in the nearby stream, created small sinkholes, causing the foundation to sink. This tilted the tower.
These two instances exemplify the relevance of operational resilience, especially in the case of the information systems that we build for business operations.
Think of the visible towers as the business operating systems (BOS). These are visible to users. The users can feel and see the deterioration of these operational systems like people can see the lean of the towers. However, users don’t see the reasons for the deterioration.
How to improve?
Like these towers, business operating systems lose resilience because the supporting information systems deteriorate. The information systems can deteriorate because they are built with an architectural foundation that does not support the operating environment, or the changes (for example, new regulations, etc.) made to it over the years.
They may also deteriorate because of technology obsolescence, for example, monolithic centralised information systems giving way to distributes systems. In either case, the end result is that the business operating systems become risky, less resilient, costly to maintain, and possibly non-compliant.
In the case of these two towers, as solid structures, the resilience had to be restored by underpinning and strengthening the foundations. However, in the case of business operating systems such as a loan origination system, a different option is available. One can choose to transform the systems from then ground up, with a completely new foundation.
The information systems’ resiliency depends on its hardware and software, cyber security, operating procedures and relevant staffing. It can also be affected by natural disasters such as fires, floods, earthquakes, etc. which can also damage the physical facilities hosting the information systems. Pandemics can cause human talent shortages, supply chain issues, which too affect resilience. Hence, business continuity strategies also play an important part in maintaining operational resilience.
Protecting information systems from these disruptive agents is an expensive task. Especially when modern digital world dominated by Gen Z, expects 24x7 operations and real-time responses. Many small and medium organisations, find it increasingly difficult to make these investments.
Most public cloud operators, taking advantage of the economies of scale, protect their estate against these disruptive agents pretty well. Whereas for a smaller player, protecting their information systems to the same degree is very expensive. So, an option to improve the overall operational resiliency could be cloud migration.
The customer Cloud dilemma
However, moving foundational information systems to the cloud has its own challenges. I am listing a few of them that I have heard from customers in the recent past.
1. It takes longer to do anything, now that we are on the Cloud
This customer moved some of their information systems to the Cloud, using a lift-and-shift paradigm. The move was swift and done with clockwork precision. However, this shift merely moved some on-premise workloads onto the Cloud, without taking advantage of the Cloud platform services or any modern architectural practices. It also left some workloads on-premise, owing to some valid reasons.
In this hybrid environment, teams were still using the procurement practices of the on-premise world instead of using the cloud model of requisitioning resources in a self-service model. As a result, severe bottlenecks were created when they replicated their on-premise processes on a hybrid cloud environment.
The key to addressing this problem is to adjust the processes for the cloud and provide for a dichotomy of procurement approaches. Force fitting a single process, clearly unsuitable for the environment causes this problem. More agile approaches like DevSecOps and SRE (Site Reliability Engineering) should be given some consideration too, which are aligned to new ways of deploying best practices such as through Observability platforms.
2. It is more expensive on Cloud than what we used to pay on-premise
This is one of the consequences when workloads are merely moved to the cloud without taking advantage of features provided by the cloud. In the cloud resource, costs are linear by default. Whereas, on-premise costs benefit from non-linearity provided by economy of scale, depreciation, etc. Cloud providers too, offer certain features like reserved instances, burstable instances, etc. to offer benefits of economy of scale and pay-when-used model. A blind lift-and-shift move though fails to take advantage of these beneficial features.
An applications and architecture modernisation exercise following the migration is key to addressing this problem. Treat the cloud migration as not merely an infrastructure migration, but also an architectural migration. Application modernisation can clear the technical debt but can retain traces of unsuitable architecture in the environment.
And an unsuitable architecture, then over the period of time contributes to more technical debt. An architecture modernisation transforms the overall architecture, thus, reducing the possibility of redeveloping technical debt due to unsuitable architecture. So it is necessary to modernise both, the application portfolio and the underlying architecture.
3. It has become free for all, we have lost control
Some organisations go to the extreme, abandoning the central control on resource requisition to allow for on-demand resource allocation without any control at all. Especially, architectural features like function-as-a-service, make it very easy for IT staff to use resources indiscriminately. Without proper oversight, the IT sprawl soon starts to consume a large part of operational budget. The financial control of the operational expense gets jeopardised due to this approach. This is also highlighted by Gartner in the article “Why Cloud Budgets Don’t Stay in Check — And How to Make Sure Yours Do”
Now, most cloud providers provide good mechanisms for organising operational units and budget control. They also offer best practices for implementing and maintaining the cloud operating model. Some organisations though, instead of adjusting for a hybrid cloud operating model, go to either the extreme of complete control or no control. In their case, it is advised to adopt the budget control best practices.
4. It’s a nightmare protecting the IT estate from Cyber-threats
The most drastic difference caused by cloud migration is felt by the organisation’s cyber security unit. The typical security organisation is used to defend a perimeter as it delineates a trusted zone from the big bad world of the internet - the alleged untrustworthy zone. All the security processes and posturing is aligned with this core concept. The cloud can potentially turn this trust assumption on its head. Though cloud providers do provide primitives to enable a perimeter to create a trusted zone, which can be defended, often the operational unit may unintentionally open itself to threats, by oversight or misconfigurations of provided security primitives.
In such cases, every server, every platform service potentially can expose an organisation to cyber threats and needs defending. It also needs ongoing review and relevant training for aligned teams. After all, “You ‘don’t know what you don’t know.”
In order to avoid exposing yourself to cyber threats, the typical cloud environment should be treated as a zero-trust environment. This requires a drastic rethinking of the cyber security processes and posturing. This Forrester article “The Definition Of Modern Zero Trust” is a good starting point to get yourself acquainted with this model.
Like the famous monuments, business operating systems can face an erosion of foundation, leading to operational resilience issues. Unlike the monuments, however, there is a possibility of rebuilding the foundation, with a move to the cloud. The best practices approach to cloud migration, which is built on experience can save a lot of problems that will be faced.