1.1. What is a Cluster?

ShaoLin InfiniCluster (SIC) connects, or clusters, multiple, independent systems into a management framework for increased availability. Each system, or node, runs its own operating system and cooperates at the software level to form a cluster. SIC links commodity hardware with intelligent software to provide application failover and control. When a node or a monitored application fails, other nodes can take predefined actions to take over and bring up services elsewhere in the cluster.

1.1.1. Failure Detection

SIC can detect both application failure and node failure in the cluster domain.

1.1.1.1. Application Failure Detection

SIC is typically deployed to keep business-critical applications online and available to users. SIC provides a mechanism to detect failure of an application by issuing specific commands, tests, or scripts that monitor the overall health of an application. SIC also determines the health of underlying resources supporting the application, such as file systems and network interfaces. In addition, SIC also supports detection of overload of a system to decide to whether the system is running abnormally.

1.1.1.2. Node Failure Detection

One of the most difficult tasks in clustering is correctly discriminating between loss of a system and loss of communication between systems. SIC uses the ShaoLin General Parallel Cluster Infrastructure (SGPCI) a kernel to kernel redundant network cluster heartbeat membership and communication system for detecting failure on a node and on fencing. For more information on detecting node failure and how SIC protects data, see Cluster Control, Communications, and Membership on page 15.

1.1.2. Switchover, Failover and Failback

Failover, failback and switchover are the processes of bringing up application services on a different node in a cluster. In all cases, an application and its network identity are brought up on a selected node. Client systems access a virtual IP address that moves with the service. Client systems are unaware of which server they are using. A virtual IP address is an address brought up in addition to the base address of systems in the cluster. For example, in a 2-node cluster consisting of db-server1 and db-server2, a virtual address may be called db-server. Clients then access db-server and are unaware of which physical server actually hosts the db-server. Virtual IP addresses use a technology known as IP Aliasing.

Figure 1-1. Switchover, Failover and Failback

1.1.2.1. Switchover

A switchover is an orderly shutdown of an application and its supporting resources on one host and a controlled startup on another host. Typically this means unassigning the virtual IP address, stopping the application, and deporting shared storage. On the other host, the process is reversed. Storage is imported, file systems are mounted, the application is started, and the virtual IP address is brought up at the other host.

1.1.2.2. Failover

A failover is similar to a switchover, except the ordered shutdown of applications on the original node may not be possible, so the services are started on another node. The process of starting the application on the node is identical in a failover or switchover.

1.1.2.3. Failback

Failback is identical to switchover. It is an automated trigger due to recovery of a host that is purposely assigned to run a particular application with a higher priority.