Acronis Storage: Chunk Server node is in 'Failed' state

Symptoms

MDS WRN: CS#1025 have reported IO error on pushing chunk 1cee of 'data.0', please check disks
MDS ERR CS#1026 detected back storage I/O failure
MDS ERR CS#1026 detected journal I/O failure
MDS WRN: Integrity failed accessing 'data.0' by the client at 192.168.1.11:42356
MDS WRN: CS#1025 is failed permanently and will not be used for new chunks allocation

Cause

In case an I/O error is returned by any disk, Chunk Server located on this disk is switched to the 'failed' state. Acronis Storage would not automatically recover the CS from this state, even after a Storage Node reboot.

Right after an I/O error occurs, the file system is re-mounted in the read-only mode and Acronis Storage no longer tries to allocate any data chunks on this CS. At the same time, if the drive is still available for reading, Acronis Storage tries to replicate all the chunks out of it.

Solution

The following workflow is recommended to troubleshoot the issue:

Determine affected disk.
Check its health status.
Decide if the device needs replacement.
Based on the information above, return failed CS to active status or decommission it.

1. Determine the affected device

How to find the affected node and drive with WebCP
In the left menu, go to Nodes and click the node marked as Failed. Note the name of this node. Click Disks and find the disk marked as Failed. Note the device name for this disk (for example, SDC on this screenshot):

How to find the affected disk with SSH and CLI
Log in to any node of the Acronis Storage cluster with SSH.

Issue the following command:
vstorage -c <cluster_name> stat | grep failed

Example output:

[root@ ~]# vstorage -c PCKGW1 stat | grep failed
connected to MDS#2
CS nodes: 6 of 6 (5 avail, 0 inactive, 0 offline, 1 out of space, 1 failed), storage version: 122
1026 failed 98.2GB 0B 6 2 0% 0/0 0.0 172.29.38.210 7.5.111-1.as7

Note CS ID displayed in the first column (1026 in the example above) and the IP address of the node where CS is located (172.29.38.210 in the example above).

To determine the disk where the affected CS is located, use following command:
vstorage -c <cluster_name> list-services

Example output:

Although the CS is in the failed state, it is running and replicating data to other CSs, if it is possible. Therefore, in the output of list-services command it is displayed as active.

[root@PCKGW1 ~]# vstorage -c PCKGW1 list-services
TYPE ID ENABLED STATUS DEVICE/VOLUME GROUP DEVICE INFO PATH
CS 1025 enabled active [1297] /dev/sdd1 VMware Virtual disk /vstorage/df218335/cs
CS 1026 enabled active [1288] /dev/sdc1 VMware Virtual disk /vstorage/12bb6baf/cs
MDS 1 enabled active [1295] /dev/sdb1 VMware Virtual disk /vstorage/38b5fb92/mds

In the ID column find CS with the ID you have noted on the previous step. Note Device/volume for this CS and its path (see PATH column). The PATH column is useful than you need to review the log file for given CS. Log file will be located at PATH/logs (/vstorage/12bb6baf/cs/logs for the example above).

2. Check the affected disk health status

The ultimate goal of this step is to collect information required to make a decision whether it is possible to continue using the affected disk, or whether it should be replaced.

The following information should be reviewed and analyzed for any data related to the issue:

dmesg command output. It is handy to use dmesg -T in order to see human-readable timestamps.
/var/log/messages file
SMART status of physical hard drive. Could be acquired with: systemctl -a <affected device>

3. Decide if device needs replacement

Depending on the physical storage type (directly attached JBOD, iSCSI LUN, Fibre channel etc.) and particular circumstances, exact error messages and patterns vary greatly.

Here are some rules of thumb to facilitate decision-making process:

If SMART status is unsatisfactory for the physical disk, this usually means the disk needs to be replaced.
Check if similar issues or any other error messages were previously logged for this disk. If the issue appears for the first time usually CS could be reused without configuration changes. Nevertheless pay special attention to this CS in future.
If there are multiple error messages present in dmesg and/or /var/log/messages for several disks on a single backplane or RAID controller, this means hardware itself could be a culprit. Contact your hardware vendor for aditional review.
In case of iSCSI device any I/O errors could be a result of poor network connectivity or incorrect network configuration. Troubleshootng should start with thorough network check.
If Acronis Storage is installed to a virtual machine and CS is located in .vmdk or .vhd file stored on a NAS, such a system should be carefully checked for reliability before going to production. Acronis Storage ships a special tool, vstorage-hwflush-check, for checking how a storage device flushes data to disk in an emergency case such as power outage. We strongly recommend using this tool to make sure your storage behaves correctly in case of power-off events. This article explains how to use the tool.

4. Return failed CS to Active status

If it is decided to reuse the same CS on the same drive, follow the steps below:

Reboot the affected Acronis Storage node
Check dmesg | grep <disk name> (eg. dmesg | grep sdc in the example above) for any messages about file system errors on the affected drive. In case of errors check the file system with fsck or e2fsck

Use following command to override failed status for the CS:
vstorage -c <cluster_name> rm-cs -U <CSID>
Verify and confirm Active state for the CS with the following command:
vstorage -c <cluster_name> stat | grep <CSID>