🟣 Availability & Reliability Interview Questions Answered to help you get ready for your next Design Patterns & System Architecture interview.
Availability pertains to the accessibility of a system while in use. When a system is available, it means it’s operational and can respond to requests. In contrast, reliability denotes how consistently the system operates without unexpected shutdowns or errors.
Consider a system that issues requests at specific intervals, say every hour.
Here is the Python code:
import statistics
# Times in hours the system was operational
operational_times = [1, 1, 1, 1, 0.75, 1, 1]
reliability = statistics.mean(operational_times)
print(f"System was operational {reliability * 100}% of the time.")
System Availability quantifies the time a system is operational and can independently assess and fulfill its tasks. It’s typically represented as a percentage.
\text{Availability} = \frac{\text{Downtime}}{\text{Total Time}} \times 100\%
Mean Time Between Failures (MTBF): This measures the average operational time until a failure occurs.
MTBF = Total Operational Time / Number of Failures
Mean Time To Repair (MTTR): This measures the average time needed to restore a failed system.
MTTR = Total Repair Time / Number of Failures
An important consideration is that corresponding time units should be used for both MTBF and MTTR to get accurate availability percentages.
Availability is then calculated using MTBF and MTTR:
Availability = 1 - (Downtime / Total Operational Time) = MTBF / (MTBF + MTTR)
\text{Availability} = \frac{100}{100 + 2} \times 100\% = \frac{100}{102} \times 100\% \approx 98.04\%
This system is available about 98.04% of the time.
“Five Nines”, or 99.999% availability, represents the pinnacle in system dependability. It translates to a mere 5.26 minutes of downtime annually, making such systems extremely reliable.
The use of redundancy in a system effectively employs backups or duplicates of components or processes to minimize the impact of potential failures. This strategy directly enhances system reliability by providing alternative means to accomplish tasks when primary components or processes fail.
Component-Level Redundancy: Involves incorporating backup or mirrored components so that if the primary ones fail, the system can seamlessly transition to the backups. Common examples include RAID storage systems and network interface cards in computers.
Subsystem-Level Redundancy: Ensures entire subsystems have backups or diverse paths, enhancing reliability at a larger scale. For instance, dual power supply units in servers and electrical distribution systems with redundant transformers and switches.
Information Redundancy: Employed to replicate and synchronize critical data or information quickly and accurately. This redundancy type is fundamental to ensuring data integrity and resilience, often seen in data mirroring for failover and disaster recovery.
Failover Mechanisms: Systems with redundancy are designed to transition seamlessly to redundant components when a primary one fails. This ability to “failover” is critical for ensuring uninterrupted services.
Parallel Paths and Load Balancing: Multiple routes or channels can give redundant systems the agility to steer traffic away from faulty components. Load balancers distribute incoming network or application traffic across multiple targets, ensuring no single resource is overwhelmed.
Cross-Verification and Consensus Building: In some setups, redundancy enables the system to rely on the agreement of multiple components. For instance, in three-node clusters, the decision is made by majority consent. If one node deviates, the redundant nodes can maintain system integrity.
Here is the Java code:
public class HardDrive {
private String data;
public String readData() {
return data;
}
public void writeData(String data) {
this.data = data;
}
}
public class RAID1Controller {
private HardDrive primary;
private HardDrive backup;
public RAID1Controller(HardDrive primary, HardDrive backup) {
this.primary = primary;
this.backup = backup;
}
public String readData() {
String data = primary.readData();
// If primary is down, read from backup
if (data == null) {
data = backup.readData();
}
return data;
}
public void writeData(String data) {
primary.writeData(data);
backup.writeData(data);
}
}
Here is the Java code:
public class NetworkInterfaceCard {
// Methods for network operations
}
public class Server {
private NetworkInterfaceCard primaryNIC;
private NetworkInterfaceCard backupNIC;
public Server(NetworkInterfaceCard primaryNIC, NetworkInterfaceCard backupNIC) {
this.primaryNIC = primaryNIC;
this.backupNIC = backupNIC;
}
public void sendData(byte[] data) {
if (primaryNIC.isOperational()) {
primaryNIC.sendData(data);
} else if (backupNIC.isOperational()) {
backupNIC.sendData(data);
} else {
throw new RuntimeException("Both primary and backup NICs are down!");
}
}
}
A Single Point of Failure (SPOF) is a component within a system whose failure could lead to a total system outage.
SPOFs are undesirable because they can:
Here is the code:
def load_balancer(webservers, request):
# Code to distribute the request
pass
MTBF helps in estimating the average time between two failures for a system or component.
Typically, MTBF uses the following formula:
\text{MTBF} = \frac{\text{Total Up Time}}{\text{Number of Failures}}
Measure of Reliability: MTBF provides an indication of system reliability. For instance, a higher MTBF implies better reliability, whereas a lower MTBF means the system is more prone to failures.
Service Predictability: Organizations use MTBF to anticipate service schedules, ensuring minimal downtime and improved customer satisfaction. In maintenance terms, a mean-time-to-service $MTTS = \frac{1}{\text{MTBF}}$.
Assumption of Constant Failure Rate: This method might not be accurate for systems that do not exhibit a consistent rate of failure over time.
Contextual Dependencies: MTBF values are often application-specific and can be affected by environmental, operational, and design factors.
SSD Lifetime Estimations: In the context of SSDs, MTBF assists in predicting the drive’s lifespan and its subsequent replacement schedule.
Redundancy Planning: MTBF helps in designing redundant systems, ensuring that a backup is available before the main component fails, based on expected failure rates.
Mean Time to Repair $MTTR$ is a vital metric in evaluating system reliability and availability.
MTTR determines the time from failure recognition to restoration. Lower MTTR results in improved system availability as downtimes are minimized.
When $MTTR$ declines, both planned and unplanned outages become shorter, meaning operational states are restored more quickly.
System availability and MTTR are intricately linked through the following formula:
Availability = \frac{{\text{MTBF}}}{{\text{MTBF} + MTTR}}
Where:
MTTR and availability thus operate along an inverse relationship, suggesting that as MTTR increases, overall availability diminishes, and vice versa.
Let’s say a system has an MTBF of 125 hours and a MTTR of 5 hours. Using the formula:
Availability = \frac{{125}}{{125 + 5}} = 0.96 = 96\%
Therefore, the system is available 96% of the time.
However, if the MTTR increases to 10 hours:
Availability = \frac{{125}}{{125 + 10}} = 0.93 = 93\%
This indicates that even a 5-hour increase in MTTR leads to a 3% reduction in system availability.
Fault tolerance (FT) and high availability (HA) are both key considerations in system design, each emphasizing different attributes and strategies.
High Availability (HA): Focuses on minimizing downtime and providing continuous service.
Fault Tolerance (FT): Prioritizes system stability and data integrity, even when components fail.
HA: Distributes workloads evenly to ensure swift responses. Common techniques include round-robin and least connections.
FT: Offers redundancy, enabling failover when one server or component is at capacity or becomes unresponsive. This promotes consistent system performance.
HA: Replicates data across multiple nodes, typically at the data layer, guaranteeing that services can quickly access data even if a node fails.
FT: Data is redundantly stored for integrity and accuracy. The data layers synchronize across nodes to ensure consistency. This is crucial for systems like databases, ensuring that even if one node fails, data integrity and availability are maintained.
HA: Uses multiple data centers located at distinct geographical locations to ensure service uptime, even during regional outages. Potential downtime is offset as traffic is diverted to operational data centers.
FT: In the event of a region-specific failure, data and services can be seamlessly redirected to other regions, mitigating any data loss or inconsistency and maintaining operational continuity.
HA: Constantly monitors the health and performance of systems, quickly identifying issues so they can be addressed before service is affected.
FT: Not only identifies issues but can also proactively make adjustments, such as launching new instances or services.
An online platform uses multiple load-balanced web servers and a central Redis cache.
High Availability: If one web server lags in performance or becomes unresponsive, the load balancer detects this and redirects traffic to healthier servers.
Fault Tolerance: If the Redis cache fails or lags, web servers can operate using a locally cached copy or a secondary Redis cache, ensuring data integrity and operations continuity, even in the presence of a cache failure.
Here is the Nginx configuration:
http {
upstream my_server {
server server1;
server server2 backup;
}
server {
location / {
proxy_pass http://my_server;
}
}
}
Designing systems for high availability (HA) requires robust architecture that minimizes downtime and prioritizes seamless user experiences. Here steps are outlined to demonstrate how to build such systems using best practices.
Load Balancing: Helps in distributing the incoming traffic across multiple resources, thereby ensuring better resource utilization and availability. Implement it at the DNS or application level for optimal configuration.
Database Technologies for Redundancy: Utilize technologies such as clustering, replication, and sharding for dynamic data distribution amongst nodes, thereby reducing the probability of a single point of failure.
Multi-Data-Center Deployment: Duplicating the infrastructure across disparate data centers ensures service availability even in the event of an entire data center outage.
Health Checks: Automated, recurring checks confirm that each system and its components are healthy.
Auto-Scaling: Leverages predefined rules or conditions for automatically adjusting the allocated resources based on the traffic load, thereby ensuring optimal performance and availability.
The CAP theorem states that it’s impossible for a distributed system to simultaneously guarantee all three of the following:
Practical systems based on the CAP theorem are not strictly consistent, but they do offer High Availability and tolerance for network partitions. Solutions that embrace the softer shades of consistency are widely used in distributed data systems.
Concepts such as eventual consistency, read/write quorums, and the use of NoSQL databases have proven to be valuable tools for architects who must navigate the complexities of distributed systems.
To enhance system availability, consider implementing the following design patterns.
Singleton restricts the instantiation of a class to a single object. This can prevent unwanted resource allocation.
Here is the Java code:
public class Singleton {
private static Singleton instance = null;
private Singleton() {}
public static Singleton getInstance() {
if(instance == null) {
instance = new Singleton();
}
return instance;
}
public void doSomething() {
System.out.println("Doing something..");
}
}
The Object Pool optimizes object creation by keeping a dynamic pool of initialized objects, ready for use. This reduces latency by eliminating the need to create an object from scratch.
Here is the Java code:
public class ObjectPool<T> {
private List<T> availableObjects = new ArrayList<>();
private List<T> inUseObjects = new ArrayList<>();
private Supplier<T> objectFactory;
public ObjectPool(Supplier<T> objectFactory, int initialSize) {
this.objectFactory = objectFactory;
for (int i = 0; i < initialSize; i++) {
availableObjects.add(objectFactory.get());
}
}
public T getObject() {
if (availableObjects.isEmpty()) {
T newObject = objectFactory.get();
availableObjects.add(newObject);
return newObject;
} else {
T object = availableObjects.remove(availableObjects.size() - 1);
inUseObjects.add(object);
return object;
}
}
public void returnObject(T object) {
inUseObjects.remove(object);
availableObjects.add(object);
}
}
Load balancing plays a pivotal role in enhancing system availability by directing incoming traffic efficiently across multiple servers or processes. However, it comes with its own set of challenges.
This straightforward method cycles through a list of servers, sending each new request to the next server in line. It’s easy to implement but may not be ideal if servers have different capacities or loads.
Serving an incoming request from the server with the fewest active connections helps maintain balanced loads. It’s sensible for systems with varying server capacities.
This strategy maps client IP addresses to specific servers, offering session persistence for users while ensuring load distribution. It’s useful for certain applications.
Challenge: Maintaining session persistence could lead to uneven traffic distribution.
Solution: Implement backup cookies and session synchronization between servers.
Challenge: Not all clients may support session cookies, impacting load distribution.
Solution: For such clients, consider other identifying factors like their originating IP address.
Challenge: Too frequent checks might intensify server load.
Solution: Adopt smarter health checks that are less frequent but still reliable, such as verifying service on-demand when a user’s request arrives.
Central Point of Failure: Load balancers can become a single point of failure, although using multiple balancers can mitigate this.
Complexity Induced by Layer 7 Load Balancing: Layer 7 load balancers, while powerful, can introduce complications in managing HTTPS certificates and more.
Here is the Python code:
servers = ["server1", "server2", "server3"]
def round_robin(servers, current_index):
next_index = (current_index + 1) % len(servers)
return servers[next_index], next_index
current_server_index = 0
for _ in range(10):
server, current_server_index = round_robin(servers, current_server_index)
print(f"Redirecting request to {server}")
Health checks are an integral part of system operations, focusing on preemptive fault resolution and ensuring that components are able to handle their intended workload.
Continuous Monitoring: Health checks are frequently scheduled, assessing both individual components and the system as a whole.
Rapid Feedback Loop: Quick assessments enable prompt responses to failures or performance issues.
Automated Actions: Systems can be designed to initiate recovery or adaptive procedures based on health check results.
Granularity: Health checks can target specific functionalities or the system at large.
Multi-Level Inspection: System checks can range from high-level operational metrics to cross-component interfaces and individual functionalities.
Predictive Analysis: By detecting and addressing potential issues, a system remains more resilient.
Proactive Checks: These are scheduled assessments ensuring that core components are operational and responsive.
Reactive Checks: Triggers, such as user interactions, can initiate evaluations of the system or its functionalities.
Performance Checks: Beyond simple ‘up’ or ‘down’ assessments, these routines evaluate whether components are meeting performance benchmarks.
HTTP Endpoints: Presence and responsiveness can be determined through HTTP status codes.
Resource Usage Evaluation: Evaluate access and consumption of memory, CPU, and disk space.
Database Connectivity: Ensure the system can interact with its data storage effectively.
Queue Monitoring: Assess the state and performance of queues used for asynchronous processing.
Service Dependencies: Assess the health of dependent services.
The Circuit Breaker Pattern acts as a safeguard in distributed systems, protecting against system failures and temporary overload issues. It is a core component in maintaining the availability and reliability of applications.
Tripped State: When the circuit is “open” or “tripped,” incoming requests are automatically redirected. This gives the underlying system time to recover without being overwhelmed by traffic.
Monitoring: The Circuit Breaker continuously monitors the behavior of external dependencies, such as remote services, databases, or APIs. If the number of failures or response times exceed a certain threshold, the circuit is tripped.
Timeouts: Limiting the time for a potential resource to respond or providing an easy path for handling failures ensures that an application doesn’t get bogged down in requests.
Fallback Mechanism: When the circuit is “open,” requests can be redirected to a predefined fallback method. This ensures essential operations can continue even when a service is degraded.
Reduced Latency: By swiftly terminating requests to failing components, the pattern helps improve system response times.
Improved Resilience: The pattern proactively identifies when a component or service is struggling, limiting the potential for cascading failures.
Enhanced User Experience: Instead of allowing users to be confronted with delayed or erroneous responses, the circuit is tripped, and they are quickly directed to a viable alternative.
Here is the Python code:
class CircuitBreaker:
def __init__(self, failure_threshold, recovery_timeout, fallback):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.fallback = fallback
self.current_failures = 0
self.last_failure = None
def is_open(self):
if self.last_failure and (time.time() - self.last_failure) < self.recovery_timeout:
return True
if self.current_failures >= self.failure_threshold:
self.last_failure = time.time()
self.current_failures = 0
return True
return False
def execute(self, operation):
if self.is_open():
return self.fallback()
try:
result = operation()
# Reset on success
self.current_failures = 0
return result
except Exception as e:
self.current_failures += 1
return self.fallback()
Reliability in a system is about ensuring consistent and predictable behavior over time. Monitoring a set of key indicators can help maintain and improve reliability.
Here is the Python code:
# Import necessary libraries
import pandas as pd
# Data for system failures
failures_data = {
'Failure Time': ['01-01-2021 08:00:00', '03-01-2021 14:30:00', '06-01-2021 19:45:00'],
'Restore Time': ['01-01-2021 10:00:00', '03-01-2021 15:30:00', '06-01-2021 20:30:00']
}
# Create a DataFrame with the failures data
failures_df = pd.DataFrame(failures_data)
# Calculate MTBF
mtbf = (pd.to_datetime(failures_df['Failure Time']).diff() / pd.Timedelta(hours=1)).mean()
# Calculate MTTR
mttr = (pd.to_datetime(failures_df['Restore Time']) - pd.to_datetime(failures_df['Failure Time'])).mean()
print(f"MTBF: {mtbf} hours")
print(f"MTTR: {mttr.total_seconds() / 3600} hours")
Ensuring high system availability is essential for critical services. A comprehensive monitoring system is key to promptly detecting and addressing any issues.
To plot these metrics, you can use various tools. For example, for visualizing availability and downtime, a line chart would be suitable. If you’re monitoring MTBF and MTTR over time, a scatter plot can provide insights.
Here is the Python code:
from datetime import datetime
class SystemMonitor:
def __init__(self):
self.start_time = datetime.now()
def get_operational_time(self):
return (datetime.now() - self.start_time).total_seconds()
def get_failure_count(self):
# Replace with your failure detection logic.
return 0
def total_repair_time(self):
# Replace with your repair time aggregation logic.
return 0
monitor = SystemMonitor()
# Calculate MTBF
mtbf = monitor.get_operational_time() / monitor.get_failure_count()
# Calculate MTTR
mttr = monitor.total_repair_time() / monitor.get_failure_count()