Besa Program - Week 6: Architecting For Resilience

Besa Program - Week 6: Architecting For Resilience

ยท

3 min read

๐ŸŒŸ Excited to Share Insights from Besa Session 6: Architecting for Resilience! ๐ŸŒŸ

In our latest Besa session, we delved deep into the crucial realm of Architecting for Resilience, and what an eye-opening journey it was! ๐Ÿ’ก

Key Takeaways: We explored the paramount importance of resilient architectures in preventing downtime and mitigating the severe impacts on business operations. A firsthand account of a 10-hour application outage truly underscored the significance of this aspect, highlighting the dire consequences it can entail.

Resilience Strategies: From infrastructure to application and database resilience, we covered it all! Discussing cutting-edge AWS solutions like elastic load balancing, auto scaling, and self-healing features opened up new avenues for fortifying our systems against unforeseen disruptions. Application-level techniques such as circuit breakers, isolation, timeouts, and retries emerged as powerful tools in our arsenal.

Resilience strategies include:

  1. Load balancing

  2. Auto scaling

  3. Loose coupling

  4. Graceful degradation

  5. Timeout and retry pattern

  6. Self healing pattern

  7. Circuit Breaker Pattern

  8. Fault Isolation Pattern

  9. Bulkhead pattern

  10. Caching

Multi-Region Architectures & Disaster Recovery: Navigating through resilient multi-region architectures, we explored game-changing solutions like Amazon DynamoDB's Global tables and Amazon Aurora's secondary clusters. Additionally, we delved into disaster recovery strategies like:

Disaster Recovery Strategies

  1. Backup and Restore. [RTO/RPO is Hours ]

  2. Pilot Light. [RTO/RPO is 10s of minutes ]

  3. Warm Standby [RTO/RPO is Minutes ]

  4. Active / Active [ RTO/RPO is Real time]

  5. RPO is How much Data you can afford to loose.

  6. RTO - How quickly must you recover? What is cost of down time?

Reliability & Operational Excellence: The session emphasized the pivotal role of reliability and operational excellence in crafting resilient systems. Introducing the Well-Architected Framework tool for assessing workload resilience showcased the holistic approach we're adopting towards this endeavor. Furthermore, the flexibility of using a mix of instance types and capturing crucial data before instance termination emerged as key practices in our pursuit of resilience.

Elastic Load Balancing options:

  1. Application Load Balancer. [Supports Layer 7 protocols]

  2. Network Load Balancer. [More powerful than the rest, support the Transport Layer]

  3. Gateway Load Balancer. [For third party Security Tools]

For the workloads running on EC2 we have Auto scaling Group to be able to withstand scalability of resilient architectures.

Elastic Compute Cloud Instance - Auto Scaling Group (EC2 ASG )features:

  1. Warm pools

  2. Capacity rebalancing

  3. Life Cycle hooks - For Pre and Post tasks

  4. Instance refresh

  5. Maximum Instance life time

  6. Attribute based instance type selection

How do we measure how resilient an architecture is ?

Resilient System KPIโ€™s :

  1. MTBF / MTTR / MTTD - Utilized observability to reduce the MTTR

  2. Failure rate [ number of http status codes 4xx, 5xxx ] - Done through the following monitoring tools i.e., Prometheus/ Grafana and Splunk

  3. Scalability [ We trigger auto scaling based on thresholds on CPU or Memory usage]

  4. Fault tolerance [ Strategies: Process redundancy , data replications , Storing state of systems] , this is how pod and node failures managed in K8s.

  5. Redundancy level metric.

The journey towards Architecting for Resilience is indeed an ongoing one, but each session brings us closer to our goal of building robust, future-proof systems. Kudos to the Besa team for curating yet another insightful and impactful session!

๐Ÿ’ช #Besa #ArchitectingForResilience #AWS #Technology #Innovation #ContinuousLearning

ย