Aspera Cluster Manager (Overview)

Aspera Cluster Manager (ACM) is the software module responsible for starting the right services on an Orchestrator node according to that node’s current status (active or passive). It is also in charge of monitoring the active node to determine when to fail-over the active/passive services from the active to the passive (when the active node becomes unresponsive).
Note: ACM must run as root.

How does it work?

ACM is installed on both Orchestrator nodes. Both instances of ACM first determine the status of the node on which they are running by checking a common status file stored on the shared space dedicated to ACM. In order to avoid a race condition while accessing that common status file, a specific locking mechanism—aslockfile— is used to synchronize both instances.

Once the status of a node is determined, the ACM instance running on the active node verifies that all of the services are running, and it starts any service that is not running. Once this is done, the instance updates the status file in order to keep its last modification date current.

The ACM instance running on the passive node checks that the status file is current, meaning that its last modification date is not older than 2 minutes). If the file is current, ACM checks that the active/passive services are up and running; it then starts all the services that are not running currently but should be running. If the common status file is no longer current, then it is a failover scenario, and ACM takes over as the new active node by starting all of the services.

How long does a failover process take?

If the passive node fails, then ACM does nothing. It is up to the load balancer to detect that the passive node is unresponsive and redirect the traffic accordingly. See Load Balancer Behavior for more information. This process typically takes one minute or less.

If the active node fails, then ACM eventually detects that the status file is no longer current and it triggers a failover. Additionally, the load balancer detects that the active node is down and it redirects all traffic to the healthy node. This process typically takes up to 5 minutes.