Imagine your business without failover capabilities. What
impact would disasters such as ransomware attacks, natural
catastrophes, hardware failures, file corruption, human
errors, and numerous other issues have? Without proper
preventive measures, these disruptions can halt production and
seriously degrade your business value.
What is Failover? Understanding the Basics
Failover is crucial for backup and disaster recovery because
it involves the process of switching essential workloads,
systems, and applications to a standby or secondary site
when the main site is down or unavailable.
This ensures business operations can continue almost or
entirely uninterrupted, even during disasters or scheduled
maintenance.
Why is Failover Crucial for Your Business?
Failover is vital for business continuity as it reduces or
prevents total system failures, enhancing system resilience
against single points of failure. This resilience ensures
services continue despite component failures, contributing
significantly to a business's uptime and reliability.
Failover is particularly mission-critical systems that must
always be available. It allows employees to continue their
work without interruption and ensures easy access to files
and systems, even during unplanned outages. This is
especially important for businesses with strict uptime
requirements.
Failover Configurations: How to Set Them Up
Firstly, High Availability (HA) can be set up in
Active-Active or Active-Standby modes, and here's how AHM
configures it:
Active-Active(Independent Disks)
In an Active-Active HA configuration, two
or more nodes perform the same tasks simultaneously. This
setup ensures that the workload is evenly distributed and
balanced across all nodes, preventing any node from being
overloaded. Active-Active clusters can
utilize more nodes, thus improving throughput and response
times. For the HA cluster to operate smoothly, each node's
configuration and settings must be identical.
The AHM(Agens High Availability Manager) system manages
and provides high availability for servers, consisting of
two types of servers. If there is an issue with the
primary server, the secondary server takes over as the
primary server and operates accordingly.
Primary, master, read/write server: This
server allows data input, modification, deletion, and
query
Secondary, slave, standby server: This
server reflects changes made on the primary server
Active-Passive (Standby - Independent Disks)
Similar to an Active-Active cluster, an
Active-Passive or
Active-Standby High Availability (HA)
configuration also consists of at least two nodes.
However, as the name suggests, not all nodes are active.In
a typical two-node setup, one node is always active while
the other remains passive or on standby, ready to take
over if the active node fails. For a smooth failover
process, both nodes must have identical settings.
Benefits of Effective Failover Strategies
Implementing a robust failover solution is critical for
maintaining business continuity and preserving value,
offering numerous benefits, including:
-
Guaranteed Business Continuity: Failover ensures that business operations continue as
usual, even if disasters occur or key components go
offline
-
Improved Uptime: The failover process
quickly switches from the primary system to a redundant
standby system when the main system is unavailable,
minimizing downtime and allowing work to continue
without interruption
-
Cost Savings: implementing a failover
solution reduces costs associated with downtime, such as
lost revenue, productivity, opportunities, and brand
reputation damage
Considerations for Implementing Efficient Failover
While properly implemented failover can benefit a company,
it is essential to be aware of potential drawbacks and
make efforts to mitigate them:
-
Cost: Setting up, managing, and
monitoring a failover system involves significant costs,
including hardware and software expenses. Ensuring that
failover operations run smoothly and automatically may
require substantial capital investment in high-bandwidth
systems with synchronous data transmission capabilities
-
Expertise Needed: Just like the primary
systems, failover systems require professional
maintenance, testing, and validation to operate
smoothly. If deploying and managing a failover system
requires more expertise, businesses might need to rely
on external experts, which can significantly increase
costs
Activating Failover with AHM: A Step-by-Step Guide
AHM monitors the health of nodes within a cluster to
detect and initiate failover. If an active node fails and
cannot provide services, the HA solution detects this and
switches the standby server to an active role.
During this time, the standby server, which is already
running, replicates data from the active server. This
minimizes downtime and allows the standby server to
continue services seamlessly when the active database is
down.
Steps Involved in Failover Process:
-
AHM on each node detects a failure in the active
database
-
The standby AHM attempts to connect to the primary
database
- The failover process initiates
-
The most recently updated node executes a pre-promotion
script
-
Standby AHM promotes the standby database to primary
-
AHM assigns a virtual IP address to the new leader node
to normalize services
Failover - Recovery and Reversion to Primary:
-
The node affected by the failure is restored as a
standby and then can be reverted back to primary.
-
The restored node calls the
`ahm_promote_node` command to
re-promote the node to the primary role.
AHM: The Ultimate Solution for Ensuring Stable Business
Continuity
Architecture of AHM
AHM operates with components that monitor cluster node
failures to detect failovers. It includes:
Failover Process
-
Real-time monitoring of failures in systems, networks,
storage, and applications
-
Consensus-building on decisions related to failover
- Executing failover during failures
Heartbeat
-
Health checks between redundant nodes for Database and
AHM
-
Provides a single connection point to client
applications (using virtual IP)
- Manages cluster status
-
Executes pre and post-scripts provided by the user
AgensSQL's HA Solution, Pgpool-II:
-
Supports high availability through the HA (High
Availability) extension solution
-
Utilizes a distributed mechanism at the DB session level
through the expansion of readonly nodes
-
Detects failures in primary or standby databases and
automatically takes appropriate actions to ensure
service continuity
Features of Pgpool-II:
-
Connection pool management: Enhances overall performance
through the reuse of connections
-
Load balancing: Distributes queries
across multiple servers if they hold the same data using
the replication function
-
Failover: In the event of a failure in
the Master Server, the Slave Server assumes its
functions
Agens High Availability Manager(AHM)
-
High Availability Components: AHM is a
high-availability component developed by AGEDB to
resolve Single Points of Failure (SPOF) in AgensSQL
database servers
-
Fault Detection and Automatic Measures:
Uses a distributed mechanism to detect failures in
primary or standby databases and automatically takes
appropriate measures to ensure service continuity
-
Guaranteed Availability: Ensures the
availability of database servers for client
applications, installed alongside each primary and
standby database. If a failure occurs in the primary
database or system, AHM coordinates to designate one of
the available standby database servers as the new
primary and quickly restores services by initiating the
VIP on the previous primary server
-
Support for Stable System Operation:
Provides features necessary for cluster operation such
as fail-over handling, thus ensuring stable system
operation and ease of system expansion
This session covered the critical component of failover in
preserving business continuity and value. In our next
session, we will delve into more technical content and
share success stories of AGEDBâs failover implementations.
For more information about AGEDB's advanced technology and
training or if you have any inquiries, please contact us
at marketing@agedb.io for professional
consultation.