Cloud Data

Cloud Computing Fault Tolerance And Resilience

Cloud Computing Fault Tolerance And Resilience – In this article, I will discuss whether there is a difference between resilience and fault tolerance when talking about IT systems.

In my previous post, I discussed why sustainability has become perhaps the most important paradigm of the 21st century. However, the examples I used were mostly about individuals and organizations, and you might wonder if this “resilience article” applies to IT systems as well – or if it’s just fault tolerance in disguise using a new and catchier term.

Cloud Computing Fault Tolerance And Resilience

Cloud Computing Fault Tolerance And Resilience

How many times have I heard the phrase “resistance is nothing new”. It’s just tolerance of sin dressed up in a shiny new term. Of course, these people were not completely wrong. When we talk about resilience in the context of IT systems, we are often actually talking about fault tolerance.

Govardhana Miriyala Kannaiah On Linkedin: #gopuwrites #aws #gcp #azure #highavailability #faulttolerance #cloud

For example, if you look at the definition of “fault tolerance” found on Wikipedia, it is very similar to what we usually talk about when it comes to resilience in IT systems:

Fault tolerance is a feature that ensures the system will function properly if one or more of its components fail. If the quality of performance declines at all, the decline is proportional to the severity of the failure. The ability to maintain performance when parts of the system fail is called graceful degradation.

In case of partial failure, continue your work. Graceful degradation of service. If we talk about the flexibility of IT systems, this is very similar to what we usually talk about.

Moreover, if you look at traditional resilience literature, you will read not only about resisting unpleasant events (and perhaps subtly humiliating your operations for a while), or about recovering from them in time. You will also read about systems that expand the level of adverse events to which they can respond, develop persistent behavior, continuously adapt to adverse changes, or even change themselves over time.

An Essential Guide To Microservices Resilience Testing

In terms of IT systems, it is very similar to systems that incorporate some kind of advanced artificial intelligence that modifies the system and code at runtime. It doesn’t matter whether we find this idea personally attractive or terrifying: the systems we typically develop today, and the resilience practices we typically talk about, are far from this idea.

Personally, I wouldn’t say so. Of course, there is significant overlap between resilience and fault tolerance in relation to IT systems. Both disciplines deal with adverse events that may adversely affect the proper functioning of the system.

Some simpler robustness concepts can also be mapped 1:1 to well-known fault tolerance metrics. For example, monitoring latency or checking the correctness of request parameters from an upstream call or return values ​​from a downstream call are common measures of fault tolerance.

Cloud Computing Fault Tolerance And Resilience

However, they are also measures of reliability if we consider the aspect of resilience that you want to withstand adverse external events. Recognizing such a phenomenon is the first step to confronting it. Therefore, they are also examples of related resilience.

Build Your Own Game Day To Support Operational Resilience

Before moving on to the non-covered aspects of fault tolerance, let’s first ask whether there are fault tolerance measures that are not resilience measures.

To be honest, I’m not entirely sure. When I review the fault tolerance literature, I sometimes come across “strange” fault tolerance measures, measures that you wouldn’t implement in the context of enterprise IT systems. For example, in highly safety-critical environments such as space travel, you may want to have heterogeneous redundancy.

This means that you will be working on a different design for each, a different programming language, a different operating environment, and a different hardware platform for each. While this could mean the difference between life and death in an environment like manned space travel, it’s economic nonsense in most enterprise software contexts.

Does this mean that such measures belong to the domain of fault tolerance but not to the domain of resilience? I do not know. I think you can find arguments for both positions. Personally, I think they might belong in the realm of sustainability, but they aren’t criteria I often refer to.

How To Test Your Fault Isolation Boundaries In The Cloud

On the other hand, durability certainly covers aspects that are not included in the field of fault tolerance. Whenever we talk about a graceful response to as yet unknown failure modes, we are talking about an extraordinarily flexible behavior of system parts.

By adapting to, or even transforming based on, changing levels of failure, we’ve left the realm of traditional fault tolerance.

You could argue that responding to unknown failure modes is still part of fault tolerance – and again, there can be arguments for both positions. However, traditional fault tolerance is usually “how can we think about failure modes and how should the system respond?”. Therefore, usually more is known about the failure modes (in which case the boundaries may again be blurred).

Cloud Computing Fault Tolerance And Resilience

However, a resilient system must be able to respond well to unknown and thus unpredictable failure modes. As I wrote at the beginning, we are still in the early stages of creating such systems, systems that are smart enough to understand that something truly unexpected is happening and can respond better than shouting for help or shutting themselves down.

Pdf) Fault Tolerance Techniques And Comparative Implementation In Cloud Computing

However, I think we’re getting there with agile software design. Our system landscapes are becoming more complex every day, which means that unexpected failure modes are becoming more and more visible. At the same time, information technology is becoming more and more important to our business and personal lives every day, and this means that it is critical that these systems continue to work even if something unexpected happens.

Currently, we compensate for the lack of graceful behavior and adaptability of our IT systems with human involvement. Whenever IT systems don’t know how to react to an unexpected event, they call for human help: they send an alert to a system operator or some member of the DevOps team on call. This means that today, if we talk about sustainability in the context of information technology systems, we are actually talking about socio-technical systems, information technology systems.

At the same time, we are constantly shifting the boundaries between errors that IT systems can resolve on their own (“self-healing”) and errors that cannot correct themselves when humans are involved. Thus, we are continuously expanding the capabilities of our IT systems, moving away from traditional fault tolerance towards a more complete concept of resilience.

For me, it’s an exciting journey. Of course, depending on the twists and turns we take in our journey, there is the possibility of encountering not only interesting scenic road trips, but also eerie valleys of nightmare – in fact, I’m sure we’ll encounter the latter fairly soon. several times and I hope to learn. But still, I’m very curious to see where this will lead.

Approaches For System Level Fault Tolerance In Distributed Real Time Computer Systems

While we must be careful not to create systems that completely spiral out of control due to a lack of self-healing or even self-consistent capabilities, we cannot leave the ever-increasing complexity of our system landscapes to some poor human operator without improvement. how we do it. Information technology systems support them in performing their duties.

In this article, we discussed whether resilience in the context of IT systems is the same as fault tolerance, just using a more catchy term. While there is certainly a lot of overlap, especially when discussing basic resilience measures (which we often do), resilience goes far beyond traditional fault tolerance.

Resilience refers to systems that respond well to unknown failure modes, develop extraordinarily resilient behavior, adapt to changing levels of failure, or even change. Currently, we achieve these advanced sustainability features by involving people, including technical systems into socio-technical systems, including the systems themselves and the people who operate and change them.

Cloud Computing Fault Tolerance And Resilience

At the same time, we’re constantly pushing the limits of what systems can handle on their own, and we’re constantly making (technical) IT systems a little more resilient – not just more fault-tolerant.

Secure And Reliable Cloud Software Development Services With .net

I hope this helps clarify a bit the difference between fault tolerance and persistence. Once we’ve done that, we’re ready to move deeper into the realm of resilience. Follow us…;) Institutional Open Access Policy Open Access Program Special Issues Guidelines Editorial Process Research and Ethics Publication Fees Awards Certificates

All articles published by Fura are available worldwide under an open access license. No specific permission is required for reusing all or part of an article published by, including images and tables. For articles published under the Creative Commons CC BY open access license, any part of the article may be reused without permission, provided the original article is clearly credited. For more information, visit https:///openaccess.

Selected articles represent state-of-the-art research in the field with significant potential for high impact. A feature article must be an original and substantial article containing it

Resilience in cloud computing, cloud computing and services, fault tolerance in cloud computing, enterprise and cloud computing, cloud fault tolerance, what is fault tolerance in cloud computing, cloud computing and business, cloud and edge computing, fault tolerance in cloud computing ppt, resilience meaning in cloud computing, cloud computing fault tolerance, cloud computing and security

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button