How to recover data from broken GlusterFS cluster

March 31, 2017 Sergey Nuzhdin

7 minute read

A few days ago when I tried to install helm chart in my Kubernetes cluster I noticed that all new pods that required storage were in pending state. After a quick check of the logs, I found out that pods were unable to get PVC from GlusterFS. I recently wrote about my experience deploying GlusterFS cluster. This time I will go through recovering data from the broken GlusterFS cluster, and some problems I faced deploying new cluster.

State: Peer Rejected

The first thing I noticed was the bad state of the peers.

$ kubectl exec -ti glusterfs0-2272744551-a4ghp gluster peer status
Number of Peers: 2

Hostname: XXX
Uuid: fd499492-e26f-4c3f-919a-57ebced1439a
State:   State: Peer Rejected (Connected)

This case is covered in GlusterFS documentation.

NOTE: If you’ll find your peers in Accepted peer request state after following the manual - restart glusterd one more time.

After all my nodes were in good state I tried to use storage again. It didn’t work, but I found this errors in heketi logs (I don’t remember if they where there before) :

[heketi] ERROR 2017/03/24 12:24:51 /src/github.com/heketi/heketi/apps/glusterfs/app_volume.go:148: Failed to create volume: Unable to execute command on glusterfs-6bjq2: volume create: vol_9d00fda936724e9f0500575f734b8fe3: failed: Brick: 192.168.11.69:/var/lib/heketi/mounts/vg_d16a4532c7879392d4546009d7dab6d0/brick_2830bc82e8836bb5913949ca32d0f8d4/brick not available. Brick may be containing or be contained by an existing brick.

[sshexec] ERROR 2017/03/24 12:24:48 /src/github.com/heketi/heketi/executors/sshexec/volume.go:139: Unable to delete volume vol_9d00fda936724e9f0500575f734b8fe3: Unable to execute command on glusterfs-6bjq2: volume delete: vol_9d00fda936724e9f0500575f734b8fe3: failed: Volume vol_9d00fda936724e9f0500575f734b8fe3 does not exist

I spent several hours trying to heal volumes using all possible commands from GlusterFS mailing lists, and different manual[1][2], but it didn’t help. Heal command showed that everything was fine. But old volumes were empty inside the PVC.

At some point, I realized that it’s easier to recreate the whole cluster. So I started to look how to get data from my volumes.

LVM intro

Since GlusterFS brick is not a usual partition, you can’t just mount it and copy your data. Instead, you need to access logical volumes created by GlusterFS. GlusterFS is using lvm (Logical Volume Manager) to create and manage storage. That means that it has 3 main types of objects:

(image source)

Volume group(VG) - the highest level of abstraction in lvm. It gathers collection of LV and PV

Physical Volume(PV) - usually hard disk or device that ‘looks’ like a hard disk

Logical Volume(LV) - an equivalent of a disk partition in usual system.

Here are some useful resources about logical volumes:

Recovering data

For recovery, we need to ssh to the data node. We will only need information about Logical Volumes, but for better understanding, I added similar commands for PV and VG.

Physical Volumes

# List all physical volumes
$ pvscan
  PV /dev/sdd   VG vg_af24fae4b067b6e1ff353d36b6a6918d   lvm2 [1023.87 GiB / 861.94 GiB free]
  Total: 1 [1023.87 GiB] / in use: 1 [1023.87 GiB] / in no VG: 0 [0   ]

# Display various attributes of physical volumes
$ pvdisplay
  --- Physical volume ---
  PV Name               /dev/sdd
  VG Name               vg_af24fae4b067b6e1ff353d36b6a6918d
  PV Size               1.00 TiB / not usable 132.00 MiB
  Allocatable           yes
  PE Size               4.00 MiB
  Total PE              262111
  Free PE               220657
  Allocated PE          41454
  PV UUID               eHxHEy-bMhv-rzF1-HOQn-ds1f-fs3T-GChR1p

Volume groups

$ vgscan
  Reading all physical volumes.  This may take a while...
  Found volume group "vg_af24fae4b067b6e1ff353d36b6a6918d" using metadata type lvm2

$ vgdisplay
  --- Volume group ---
  VG Name               vg_af24fae4b067b6e1ff353d36b6a6918d
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  194
  VG Access             read/write
  VG Status             resizable
  MAX LV                0
  Cur LV                26
  Open LV               13
  Max PV                0
  Cur PV                1
  Act PV                1
  VG Size               1023.87 GiB
  PE Size               4.00 MiB
  Total PE              262111
  Alloc PE / Size       41454 / 161.93 GiB
  Free  PE / Size       220657 / 861.94 GiB
  VG UUID               Vhnkxf-HCHU-1USc-Y1fo-sNej-PNzr-fZXd3A

Logical Volumes

$ lvscan
  ACTIVE            '/dev/vg_af24fae4b067b6e1ff353d36b6a6918d/tp_32a08cbefd56fbdb4bcceba3b20ab4aa' [2.00 GiB] inherit
  ACTIVE            '/dev/vg_af24fae4b067b6e1ff353d36b6a6918d/brick_32a08cbefd56fbdb4bcceba3b20ab4aa' [2.00 GiB] inherit
  ACTIVE            '/dev/vg_af24fae4b067b6e1ff353d36b6a6918d/tp_8a72359bb8912053e87af7cf411282ab' [15.00 GiB] inherit
  ACTIVE            '/dev/vg_af24fae4b067b6e1ff353d36b6a6918d/brick_8a72359bb8912053e87af7cf411282ab' [15.00 GiB] inherit
  ACTIVE            '/dev/vg_af24fae4b067b6e1ff353d36b6a6918d/tp_b9b58b673d04adc56531bc4553c566e8' [20.00 GiB] inherit
  ACTIVE            '/dev/vg_af24fae4b067b6e1ff353d36b6a6918d/brick_b9b58b673d04adc56531bc4553c566e8' [20.00 GiB] inherit

$ lvdisplay
  --- Logical volume ---
  LV Name                tp_32a08cbefd56fbdb4bcceba3b20ab4aa
  VG Name                vg_af24fae4b067b6e1ff353d36b6a6918d
  LV UUID                ySroXS-reeP-L3mh-PJ8F-JzXS-8IWE-BubioM
  LV Write Access        read/write
  LV Creation host, time server.example.com, 2017-03-25 14:54:02 +0100
  LV Pool metadata       tp_32a08cbefd56fbdb4bcceba3b20ab4aa_tmeta
  LV Pool data           tp_32a08cbefd56fbdb4bcceba3b20ab4aa_tdata
  LV Status              available
  # open                 2
  LV Size                2.00 GiB
  Allocated pool data    0.71%
  Allocated metadata     0.33%
  Current LE             512
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:2

  --- Logical volume ---
  LV Path                /dev/vg_af24fae4b067b6e1ff353d36b6a6918d/brick_32a08cbefd56fbdb4bcceba3b20ab4aa
  LV Name                brick_32a08cbefd56fbdb4bcceba3b20ab4aa
  VG Name                vg_af24fae4b067b6e1ff353d36b6a6918d
  LV UUID                QHUP2i-n2Tv-zz89-xpke-dr34-4jAJ-SJ286F
  LV Write Access        read/write
  LV Creation host, time server.example.com, 2017-03-25 14:54:03 +0100
  LV Pool name           tp_32a08cbefd56fbdb4bcceba3b20ab4aa
  LV Status              available
  # open                 1
  LV Size                2.00 GiB
  Mapped size            0.71%
  Current LE             512
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:4

  --- Logical volume ---
  LV Name                tp_8a72359bb8912053e87af7cf411282ab
  VG Name                vg_af24fae4b067b6e1ff353d36b6a6918d
  LV UUID                C7TByo-VdOq-mL1y-sdyY-7Pza-4IxI-1k3kxs
  LV Write Access        read/write
  LV Creation host, time server.example.com, 2017-03-25 16:43:06 +0100
  LV Pool metadata       tp_8a72359bb8912053e87af7cf411282ab_tmeta
  LV Pool data           tp_8a72359bb8912053e87af7cf411282ab_tdata
  LV Status              available
  # open                 2
  LV Size                15.00 GiB
  Allocated pool data    0.12%
  Allocated metadata     0.07%
  Current LE             3840
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:7

  --- Logical volume ---
  LV Path                /dev/vg_af24fae4b067b6e1ff353d36b6a6918d/brick_8a72359bb8912053e87af7cf411282ab
  LV Name                brick_8a72359bb8912053e87af7cf411282ab
  VG Name                vg_af24fae4b067b6e1ff353d36b6a6918d
  LV UUID                zuUtZB-oRJP-lmZt-Qmn2-j8ta-LR3k-4rH2VA
  LV Write Access        read/write
  LV Creation host, time server.example.com, 2017-03-25 16:43:07 +0100
  LV Pool name           tp_8a72359bb8912053e87af7cf411282ab
  LV Status              available
  # open                 1
  LV Size                15.00 GiB
  Mapped size            0.12%
  Current LE             3840
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           252:9

What we need from all of this is the output of the lvdisplay command. Specifically its LV Path . Because we can mount it as a usual disk.

$ mount /dev/vg_d16a4532c7879392d4546009d7dab6d0/brick_be8a8a7cbf5239caa185cb8d00ae4cfd /mount/volume1

Since I had more than 30 of GlusterFS volumes, I ended up with this bash one-liner. It mounts each volume to its own directory.

$ cd your-target-directory
$ for i in $(lvdisplay | grep /dev/ | cut -d" " -f20); do DIR="${i##/*/}"; echo $DIR; mkdir $DIR; mount $i $DIR; done
$ ls
brick_be8a8a7cbf5239caa185cb8d00ae4cfd 
brick_8a72359bb8912053e87af7cf411282ab 
...

Now you can inspect volumes and copy your data from the bricks as usual.

Deploy new cluster

All data save and disks are clean. Time to deploy GlusterFS again. I was very interested to see whether anything has changed since my previous experience. So I decided to follow the installation manual from the beginning.

I was pleasantly surprised that I didn’t face any of the problems I had before. It’s probably due to the non-fresh system. Instead, I faced a new problem because of the this.

> $ heketi-cli topology load --json=topology-sample.json                                       Found node node-03 on cluster b55122d933d2a850f69524d9758c7c49
                Adding device /dev/sdb ... Unable to add device: Unable to execute command on glusterfs-b0n9m:   WARNING: Not using lvmetad because config setting use_lvmetad=0.
  WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).
  Can't open /dev/sdb exclusively.  Mounted filesystem?
        Found node node-02 on cluster b55122d933d2a850f69524d9758c7c49
                Adding device /dev/sdb ... Unable to add device: Unable to execute command on glusterfs-xnftq:   WARNING: Not using lvmetad because config setting use_lvmetad=0.
  WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).
  Can't open /dev/sdb exclusively.  Mounted filesystem?
        Found node node-01 on cluster b55122d933d2a850f69524d9758c7c49
                Adding device /dev/sdd ... Unable to add device: Unable to execute command on glusterfs-x4kdf:   WARNING: Not using lvmetad because config setting use_lvmetad=0.
  WARNING: To avoid corruption, rescan devices to make changes visible (pvscan --cache).
  Can't open /dev/sdd exclusively.  Mounted filesystem?

Running pvscan --cache as suggested didn’t help. Neither as deleting all LV, VG, PV. It didn’t work even after a reboot. At the end, I found the solution on stackoverflow.

$ dmsetup remove sdX

dmsetup is some kind of low-level utility to work with lvm. More information could be found here

Anyway, I continued to follow the manual

$ heketi-cli setup-openshift-heketi-storage
$ kubectl create -f heketi-storage.json

At this point everything was fine. Running heketi-cli topology info returned me all my nodes and volumes. Then I deleted all helper objects and deployed heketi-deployment

# kubectl delete all,service,jobs,deployment,secret --selector="deploy-heketi" 
# kubectl create -f heketi-deployment.json 
service "heketi" created
deployment "heketi" created

After this, heketi-cli cluster info returned me nothing.

I thought that may be I didn’t wait until the end of deployment job to finish. Following the manual for the second time gave me the same result.

After some debugging, I traced the problem to the empty heketi.db in heketi-db-backup secret.

Setting heketi.db value from the generated heketi-storage.json file seems to fix the issue.

Conclusion

I don’t know what caused the problem with my storage. One of the guesses is some network related issue. I didn’t lose any of the data but there were some scary moments :).

I will continue to use GlusterFS on staging cluster, but for now, I’m not ready to use it in production.

Like this post? Want more? Subscribe to get updates delivered straight to your inbox.

lwolfs blog

Posts

Categories

About

Subscribe

Recent Posts

How to recover data from broken GlusterFS cluster

State: Peer Rejected

LVM intro

Recovering data

Deploy new cluster

Conclusion

LWOLFS BLOG

Recent Posts

Agola - One CI To Run Them All

Switching to Istio as the primary ingress

How to deploy multi-arch Kubernetes cluster using Kubespray

Home Lab Infrastructure Overview

Going open-source in monitoring, part V: Collecting errors from production using Sentry

Categories

About