The Corosync/Pacemaker pair is the successor of the long obsolete and unsupported Heartbeat v2 software package. While Heartbeat v2 was playing the role of both the Group Communication System (GCS) and the Cluster Resource Manager (CRM), these roles are split under the new system. Corosync is the GCS and in charge of communication, while Pacemaker sits on top of the GCS and plays the role of the CRM, managing the resources and responding to changes in the cluster status.
An existing cluster under Heartbeat v2 can be migrated to Corosync/Pacemaker
using the script
ngcp-migrate-ha-crm. This script automates the following
/etc/ha.d/haresourcesto disable stopping of resources by Heartbeat v2
The script must be run on the standby node, and the steps described above are first performed locally (on the standby node), followed by the node’s peer (the active node). This is to minimise resource downtime.
Switching the values of
The migration needs to be run on non-migrated pairs of nodes, from the
standby node of each pair, and the services need to be all in a good state
(from ngcp-service summary point of view). The program will perform these
sanity checks before making it possible to proceed. The user will also be
prompted for confirmation, which can be skipped for non-interactive use with
FORCE environment variable.
For a Carrier system the configuration settings will be set per pair, so that it does not affect the entire cluster. Once the whole cluster has been migrated these configuration files should be merged into the global one. If the switch does not need to be staged, then the ngcp-parallel-ssh(8) command can help with that, with its inactive host selector, such as ngcp-parallel-ssh inactive "FORCE=yes ngcp-migrate-ha-crm".
Corosync is the Group Communication System (GCS). Its configuration resides in
/etc/corosync/corosync.conf and describes the following details:
Config details for both nodes:
Corosync uses a voting system to determine the state of the cluster. Each
configured node in the cluster receives one vote. A quorum is defined as a
majority presence within the cluster, meaning at least 50% of the configured
nodes plus one, or
q = n / 2 + 1. For example, if 8 nodes were configured, a
quorum would be present if at least 5 nodes are communicating with each other.
In this state, the cluster is said to be quorate, which means it can operate
normally. (Any remaining nodes, 3 in the worst case, would see the cluster as
inquorate and would relinquish all their resources.)
A two-node cluster is a special case as under the formula above, a quorum would
consist of 2 functioning nodes. The Corosync config setting
overrides this and artificially sets the quorum to 1. This means that under a
split-brain scenario (each node seeing only 1 vote), both nodes would see the
cluster as quorate and try to become active, instead of both nodes going standby.
In addition to this, Pacemaker itself also uses an internal scoring system for individual resources. This mechanism is described below and not directly related to the quorum.
Pacemaker uses the communication service provided by Corosync to manage local resources. All status and configuration information is shared between all Pacemaker instances within the cluster as long as communication is up. This means that any configuration change done on any node will immediately and automatically be propagated to all other nodes in the cluster.
Pacemaker internally uses an XML document to store its configuration, called
"CIB" stored in
/var/lib/pacemaker/cib/cib.xml. However, this XML document
must never be edited or modified directly. Instead, a shell-like interface
crm is provided to talk to Pacemaker, query status information, alter cluster
state, view and modify configuration, etc. Any configuration change done through
crm is immediately reflected in the CIB XML, locally as well as on all other
To repeat, do not ever directly modify Pacemaker’s XML configuration.
As an added bonus, just to make things more awkward, the syntax used by
is not XML at all, but rather uses a Cisco-like hierarchy.
Commands can be issued to
crm either directly from the shell as command-line
arguments, or interactively by entering a Cisco-like shell. So for example, the
current config can be viewed either from the shell with:
root@sp1:~# crm config show ...
Or interactively, as either:
root@sp1:~# crm crm(live/sp1)# config crm(live/sp1)configure# show ...
root@sp1:~# crm crm(live/sp1)# config show ...
Interactive online help is provided by the
ls command to list commands valid
in the current context, or using the
help command for a more verbose help
The current cluster status can be viewed with the top-level
crm(live/sp1)# status Stack: corosync Current DC: sp2 (version unknown) - partition with quorum Last updated: Fri Nov 22 18:38:06 2019 Last change: Fri Nov 22 18:25:28 2019 by hacluster via crmd on sp1
2 nodes configured 7 resources configured
Online: [ sp1 sp2 ]
Full list of resources:
Resource Group: g_vips p_vip_eth1_v4_1 (ocf::heartbeat:IPaddr): Started sp1 p_vip_eth2_v4_1 (ocf::heartbeat:IPaddr): Started sp1 Resource Group: g_ngcp p_monit_services (ocf::ngcp:monit-services): Started sp1 Clone Set: c_ping [p_ping] Started: [ sp1 sp2 ] Clone Set: fencing [st-null] Started: [ sp1 sp2 ]
If the status is queried from
sp2 instead, the output will be the same. Most
importantly, the resources will not show up as "stopped" on
sp2 but instead
will be reported as running on
The resources reported are described in the configuration section below.
The NGCP templates do not operate on Pacemaker’s CIB XML directly, but instead
produce a file in CRM syntax in
/etc/pacemaker/cluster.crm. This file is not
handled by Pacemaker directly, but instead is loaded into Pacemaker via the
config load replace. It shouldn’t be necessary to do this
manually, as the script
ngcp-ha-crm-reload handles this automatically, which
is called from the config file’s postbuild script.
Changes to the config don’t need to be saved explicitly. This is done automatically by Pacemaker, as well as sharing any changes with all other members of the cluster.
crm, changes made to the config are cached until made active with
commit, or discarded with
refresh. Changes to resource status can be
avoided by enabling maintenance mode (see below).
However, since our config is loaded from a template, any changes done
to the config through
The currently active config can be shown with
config show and should be
logically identical to the contents of
crm(live/sp1)# config show node 1: sp1 node 2: sp2 primitive p_monit_services ocf:ngcp:monit-services \ meta migration-threshold=20 \ meta failure-timeout=800 \ op monitor interval=20 timeout=60 on-fail=restart \ op_params on-fail=restart primitive p_ping ocf:pacemaker:ping \ params host_list="10.15.20.30 192.168.211.1" multiplier=1000 dampen=5s \ meta failure-timeout=800 \ op monitor interval=1 timeout=60 on-fail=restart \ op_params timeout=60 on-fail=restart primitive p_vip_eth1_v4_1 IPaddr \ params ip=192.168.255.250 nic=eth1 cidr_netmask=24 \ op monitor interval=5 timeout=60 on-fail=restart \ op_params on-fail=restart primitive p_vip_eth2_v4_1 IPaddr \ params ip=192.168.1.161 nic=eth2 cidr_netmask=24 \ op monitor interval=5 timeout=60 on-fail=restart \ op_params on-fail=restart primitive st-null stonith:null \ params hostlist="sp1 sp2" group g_ngcp p_monit_services group g_vips p_vip_eth1_v4_1 p_vip_eth2_v4_1 clone c_ping p_ping clone fencing st-null location l_ngcp g_ngcp \ rule pingd: defined pingd colocation l_ngcp_with_vip inf: g_ngcp g_vips location l_vips g_vips \ rule pingd: defined pingd order o_vip_then_ngcp Mandatory: g_vips g_ngcp property cib-bootstrap-options: \ have-watchdog=false \ cluster-infrastructure=corosync \ cluster-name=sp \ stonith-enabled=yes \ no-quorum-policy=ignore \ startup-fencing=yes \ maintenance-mode=false \ last-lrm-refresh=1574443528 rsc_defaults rsc-options: \ resource-stickiness=100
clone c_ping p_pingdefines a
clonetype object with the name
config del ...), when starting or stopping a resource, when referring to resources from a group, etc.
Resources are the primary type of objects that Pacemaker handles. A resource is
anything that can be started or stopped, and a resource is normally allowed to
run on one node only. A resource is defined as a
primitive type object.
Pacemaker supports many types of resources, all of which have different options
that can be given to them. The config syntax defines that options given to a
resource itself are prefixed with
params, while options that influence how a
resource should be managed are prefixed with
meta. Options that are relevant
to operations that can be performed on a resource are prefixed with
Resources are grouped into classes, providers, and types. Details about them
(e.g. which options they support) can be obtained through the
crm(live/sp1)ra# info IPaddr Manages virtual IPv4 addresses (portable version) (ocf:heartbeat:IPaddr) ...
primitive p_vip_eth1_v4_1 IPaddr \ params ip=192.168.255.250 nic=eth1 cidr_netmask=24 \ op monitor interval=5 timeout=60 on-fail=restart \ op_params on-fail=restart
This defines a resource of type
IPaddr with name
p_vip_eth1_v4_1 and the
given parameters (address, netmask, interface). Pacemaker will check for the
existence of the address every 5 seconds, with an action timeout of 60 seconds.
If the monitor action fails, the resource is restarted.
primitive p_monit_services ocf:ngcp:monit-services \ meta migration-threshold=20 \ meta failure-timeout=800 \ op monitor interval=20 timeout=60 on-fail=restart \ op_params on-fail=restart
While Pacemaker has support for native systemd services, for the time being
we’re still relying on monit to manage our services. Therefore, services are
defined in Pacemaker virtually identical to how they were defined in Heartbeat
v2, through a
monit-services start/stop script. The old Heartbeat script was
/etc/ha.d/resource.d/monit-services and the new script used by Pacemaker is
The primary difference between the two scripts is the support for a
meta migration-threshold=20means that the resource will be migrated away (instead of restarted) after 20 failures. See the discussion on failure counts below.
meta failure-timeout=800means that the failure count should be reset to zero if the last failure occurred more than 800 seconds ago. (However, the actual timer depends on the
primitive p_ping ocf:pacemaker:ping \ params host_list="10.15.20.30 192.168.211.1" multiplier=1000 dampen=5s \ meta failure-timeout=800 \ op monitor interval=1 timeout=60 on-fail=restart \ op_params timeout=60 on-fail=restart
pingd service, using a resource name that intelligently is not
pingd but rather just
ping, replaces Heartbeat’s ping nodes. It
supports multiple ping backends, and uses
fping by default.
Each configured ping node (each entry in
host_list) produces a score of 1
if that ping node is up. The scores are summed up and multiplied by the
multiplier. So in the example above, a score of 2000 is generated if both
ping nodes are up. Pacemaker will then prefer the node which produces the
dampen=5smeans to wait 5 seconds after a change occurred to prevent transient glitches from causing service flapping.
primitive st-null stonith:null \ params hostlist="sp1 sp2"
Pacemaker will generate a warning if no fencing mechanism is configured,
therefore we configure the
null fencing mechanism.
Pacemaker supports several proper fencing mechanism and these might eventually get supported in the future.
group g_ngcp p_monit_services group g_vips p_vip_eth1_v4_1 p_vip_eth2_v4_1
To manage, control, and restrict multiple resources at the same time, resources
can be grouped into single objects. The group
g_ngcp is pointless for the time
being (it contains only a single other resource) but will become useful once
native systemd resources are in use. The group
g_vips ensures that all shared
IP addresses are active at the same time.
clone c_ping p_ping clone fencing st-null
Since a single resource normally only runs on one node, a clone can be defined
to allow a resource to run on all nodes. We want the
pingd service and the
fencing service to always run on all nodes.
colocation l_ngcp_with_vip inf: g_ngcp g_vips
This tells Pacemaker that we want to force the
g_ngcp resource on the same
node that is running the
location l_ngcp g_ngcp \ rule pingd: defined pingd location l_vips g_vips \ rule pingd: defined pingd
This tells Pacemaker that these resources depend on the
pingd service being
pingd fails on one node (ping nodes are unavailable), then
Pacemaker will shut down the constrained resources.
order o_vip_then_ngcp Mandatory: g_vips g_ngcp
This tells Pacemaker that the shared IP addresses must be up and running before the system services can be started.
property cib-bootstrap-options: \ have-watchdog=false \ cluster-infrastructure=corosync \ cluster-name=sp \ stonith-enabled=yes \ no-quorum-policy=ignore \ startup-fencing=yes \ maintenance-mode=false \ last-lrm-refresh=1574443528
Relevant options are:
have-watchdog=falseindicates that no external watchdog service such as
SBDis in use.
cluster-name=spis to match the configuration of Corosync.
stonith-enabled=yesis required to suppress a warning message, even though no real STONITH (
nullfencing mechanism) is in use.
no-quorum-policy=ignoretells Pacemaker to continue normally if quorum is lost. This is the only setting that makes sense in a two-node cluster.
startup-fencing=yesis also needed to suppress a warning even though no real fencing is in use. This tells Pacemaker to shoot nodes that are not present immediately after startup.
maintenance-mode=falsetells Pacemaker to actually perform resource actions. If maintenance mode is enabled, Pacemaker will continue to run, but will not start or stop any services. This should be enabled before loading a new config, and then disabled afterwards. The script
Pacemaker keeps a failure count for each resource, which is somewhat hidden from
view, but can largely influence its behaviour. Each time a service fails (either
during runtime or during startup), the failure count is increased by one. If the
failure count exceeds the configured
migration-threshold, Pacemaker will cease
trying to start the service and will migrate the service away to another node.
crm status this simply shows up as
Failure counts can be cleared automatically if the
failure-timeout setting is
configured for a resource. This timeout is counted after the last time the
resource has failed, and is checked periodically according to the
cluster-recheck-interval. In other words, a very short failure timeout won’t
have any effect unless the recheck interval is also very short.
If no faiure timeout is configured, any existing failure count must be cleared manually.
The failure count for a resource can be checked from the shell via
crm_failcount, for example:
root@sp1:~# crm_failcount -G -r p_monit_services scope=status name=fail-count-p_monit_services value=0
The failure count on a different node can also be examined:
root@sp1:~# crm_failcount -G -r g_ngcp -N sp2 scope=status name=fail-count-g_ngcp value=0
The same can be done via
crm(live/sp1)resource# failcount g_vips show sp1 scope=status name=fail-count-g_vips value=0 crm(live/sp1)resource# failcount c_ping show sp2 scope=status name=fail-count-c_ping value=0
As a shortcut, the script
ngcp-ha-show-failcounts is provided:
root@sp1:~# ngcp-ha-show-failcounts p_vip_eth1_v4_1: sp1: 0 sp2: 0 p_vip_eth2_v4_1: sp1: 0 sp2: 0 p_monit_services: sp1: 0 sp2: 0
Analogous to checking a failure count, it can be cleared using any one of these methods:
root@sp1:~# crm_failcount -D -r p_monit_services Cleaned up p_monit_services on sp1 root@sp1:~# crm resource failcount g_ngcp delete sp2 Cleaned up p_monit_services on sp2 root@sp1:~# crm crm(live/sp1)# resource crm(live/sp1)resource# failcount c_ping delete sp1 Cleaned up p_ping:0 on sp1 Cleaned up p_ping:1 on sp1 crm(live/sp1)resource# bye root@sp1:~# ngcp-ha-clear-failcounts Cleaned up p_monit_services on sp2 Cleaned up p_monit_services on sp1
In addition, the
resource cleanup also resets failure counts.
Pacemaker uses an internal scoring system to determine which resources to run where. A resource will be run on the node on which it received the highest score. If a resource has a negative score, that resource will not be run at all. If a resource has the same score on multiple nodes, then the resource will be run on any one of those nodes. Scores can be calculated and acted upon through various config settings.
A score value of
infinity (and negative infinity) to force certain states is
provided, which evaluates to not infinity at all, but rather to a static value
of one million. This can be used to artificially manipulate resource scores to
force running a resource on a particular node, or forbid a resource from running
on particular nodes.
Scores can be inspected through the
ngcp-make-standby work normally. Under
Pacemaker, they function through the
resource move to create a
temporary location constraint on
g_vips. This can be done manually through:
crm resource move g_vips sp1 30
The lifetime of 30 seconds is needed because
g_ngcp depends on the location of
g_vips, and therefore
g_ngcp needs to be stopped before
g_vips can be
stopped. The location constraint must remain active until
g_ngcp has been
completely and successfully stopped.
These commands only effect the status of the running resources, and not the status of the node itself. This means that after going standby, Pacemaker will immediately be ready to take over the resources again if needed. See below for a discussion on node status.
ngcp-check-active uses the output of
crm resource status g_vips
to determine whether the local node is active or not.
In addition to the status and location of individual resources, nodes themselves
can also go into standby mode. The submenu
crm has the relevant
A node in standby mode will not only give up all of its resources, but will also refuse to take them over until it’s back online. Therefore, it’s possible to set both nodes to standby mode and shut down all resources on both nodes.
A node in standby mode will still participate in GCS communications and remain visible to the rest of the cluster.
crm node standby to set the local node to standby mode. A remote node can
be set to standby using e.g.
crm node standby sp2.
By default, the standby mode remains active until it’s cancelled manually (a
forever). Alternatively, a lifetime of
reboot can be specified
to tell Pacemaker that after the next reboot, the node should automatically come
back online. Example:
crm node standby sp2 reboot
To cancel standby mode, use
crm node online, optionally followed by the node
To show the current status of all nodes, use
crm node show. The top-level
crm status also shows this.
If Pacemaker’s maintenance mode is enabled, it will continue to operate
normally, i.e. continue to run and monitor resources, but will refuse to stop or
start any resources. This is useful to make changes to the running config, and
is done automatically by
To enable and disable maintenance mode:
crm maintenance on crm maintenance off
or using the more lower level method:
crm configure property maintenance-mode=true crm configure property maintenance-mode=false