Problem with query and any operation on PGs

* Problem with query and any operation on PGs
       [not found] <175484591.20170523135449@tlen.pl>
@ 2017-05-23 12:48 ` Łukasz Chrustek
  2017-05-23 14:17   ` Sage Weil
  0 siblings, 1 reply; 35+ messages in thread
From: Łukasz Chrustek @ 2017-05-23 12:48 UTC (permalink / raw)
  To: ceph-devel

Cześć,

Hello,

After terrible outage coused by failure of 10Gbit switch, ceph cluster
went  to HEALTH_ERR (three whole storage servers go offline in the same time
and didn't back in short time). After cluster recovery two PGs goto to
incomplite state, I can't them query, and can't do with them anything,
what   would   allow   back  working cluster back. here is strace of
this command: https://pastebin.com/HpNFvR8Z. But... this cluster isn't enteriely off:

[root@cc1 ~]# rbd ls management-vms
os-mongodb1
os-mongodb1-database
os-gitlab-root
os-mongodb1-database2
os-wiki-root
[root@cc1 ~]# rbd ls volumes
^C
[root@cc1 ~]#

and for all mon hosts (don't put all three here)

[root@cc1 ~]# rbd -m 192.168.128.1 list management-vms
os-mongodb1
os-mongodb1-database
os-gitlab-root
os-mongodb1-database2
os-wiki-root
[root@cc1 ~]# rbd -m 192.168.128.1 list volumes
^C
[root@cc1 ~]#

and  all other POOLs from list, except (most important) volumes, I can
list images.

Fanny thing, I can list rbd info for particular image:

[root@cc1 ~]# rbd info
volumes/volume-197602d7-40f9-40ad-b286-cdec688b1497
rbd image 'volume-197602d7-40f9-40ad-b286-cdec688b1497':
        size 20480 MB in 1280 objects
        order 24 (16384 kB objects)
        block_name_prefix: rbd_data.64a21a0a9acf52
        format: 2
        features: layering
        flags:
        parent: images/37bdf0ca-f1f3-46ce-95b9-c04bb9ac8a53@snap
        overlap: 3072 MB

but can't list the whole content of pool volumes.

[root@cc1 ~]# ceph osd pool ls
volumes
images
backups
volumes-ssd-intel-s3700
management-vms
.rgw.root
.rgw.control
.rgw
.rgw.gc
.log
.users.uid
.rgw.buckets.index
.users
.rgw.buckets.extra
.rgw.buckets
volumes-cached
cache-ssd

here is ceph osd tree:

ID  WEIGHT    TYPE NAME            UP/DOWN REWEIGHT PRIMARY-AFFINITY
 -7  20.88388 root ssd-intel-s3700
-11   3.19995     host ssd-stor1
 56   0.79999         osd.56            up  1.00000          1.00000
 57   0.79999         osd.57            up  1.00000          1.00000
 58   0.79999         osd.58            up  1.00000          1.00000
 59   0.79999         osd.59            up  1.00000          1.00000
 -9   2.12999     host ssd-stor2
 60   0.70999         osd.60            up  1.00000          1.00000
 61   0.70999         osd.61            up  1.00000          1.00000
 62   0.70999         osd.62            up  1.00000          1.00000
 -8   2.12999     host ssd-stor3
 63   0.70999         osd.63            up  1.00000          1.00000
 64   0.70999         osd.64            up  1.00000          1.00000
 65   0.70999         osd.65            up  1.00000          1.00000
-10   4.19998     host ssd-stor4
 25   0.70000         osd.25            up  1.00000          1.00000
 26   0.70000         osd.26            up  1.00000          1.00000
 27   0.70000         osd.27            up  1.00000          1.00000
 28   0.70000         osd.28            up  1.00000          1.00000
 29   0.70000         osd.29            up  1.00000          1.00000
 24   0.70000         osd.24            up  1.00000          1.00000
-12   3.41199     host ssd-stor5
 73   0.85300         osd.73            up  1.00000          1.00000
 74   0.85300         osd.74            up  1.00000          1.00000
 75   0.85300         osd.75            up  1.00000          1.00000
 76   0.85300         osd.76            up  1.00000          1.00000
-13   3.41199     host ssd-stor6
 77   0.85300         osd.77            up  1.00000          1.00000
 78   0.85300         osd.78            up  1.00000          1.00000
 79   0.85300         osd.79            up  1.00000          1.00000
 80   0.85300         osd.80            up  1.00000          1.00000
-15   2.39999     host ssd-stor7
 90   0.79999         osd.90            up  1.00000          1.00000
 91   0.79999         osd.91            up  1.00000          1.00000
 92   0.79999         osd.92            up  1.00000          1.00000
 -1 167.69969 root default
 -2  33.99994     host stor1
  6   3.39999         osd.6           down        0          1.00000
  7   3.39999         osd.7             up  1.00000          1.00000
  8   3.39999         osd.8             up  1.00000          1.00000
  9   3.39999         osd.9             up  1.00000          1.00000
 10   3.39999         osd.10          down        0          1.00000
 11   3.39999         osd.11          down        0          1.00000
 69   3.39999         osd.69            up  1.00000          1.00000
 70   3.39999         osd.70            up  1.00000          1.00000
 71   3.39999         osd.71          down        0          1.00000
 81   3.39999         osd.81            up  1.00000          1.00000
 -3  20.99991     host stor2
 13   2.09999         osd.13            up  1.00000          1.00000
 12   2.09999         osd.12            up  1.00000          1.00000
 14   2.09999         osd.14            up  1.00000          1.00000
 15   2.09999         osd.15            up  1.00000          1.00000
 16   2.09999         osd.16            up  1.00000          1.00000
 17   2.09999         osd.17            up  1.00000          1.00000
 18   2.09999         osd.18          down        0          1.00000
 19   2.09999         osd.19            up  1.00000          1.00000
 20   2.09999         osd.20            up  1.00000          1.00000
 21   2.09999         osd.21            up  1.00000          1.00000
 -4  25.00000     host stor3
 30   2.50000         osd.30            up  1.00000          1.00000
 31   2.50000         osd.31            up  1.00000          1.00000
 32   2.50000         osd.32            up  1.00000          1.00000
 33   2.50000         osd.33          down        0          1.00000
 34   2.50000         osd.34            up  1.00000          1.00000
 35   2.50000         osd.35            up  1.00000          1.00000
 66   2.50000         osd.66            up  1.00000          1.00000
 67   2.50000         osd.67            up  1.00000          1.00000
 68   2.50000         osd.68            up  1.00000          1.00000
 72   2.50000         osd.72          down        0          1.00000
 -5  25.00000     host stor4
 44   2.50000         osd.44            up  1.00000          1.00000
 45   2.50000         osd.45            up  1.00000          1.00000
 46   2.50000         osd.46          down        0          1.00000
 47   2.50000         osd.47            up  1.00000          1.00000
  0   2.50000         osd.0             up  1.00000          1.00000
  1   2.50000         osd.1             up  1.00000          1.00000
  2   2.50000         osd.2             up  1.00000          1.00000
  3   2.50000         osd.3             up  1.00000          1.00000
  4   2.50000         osd.4             up  1.00000          1.00000
  5   2.50000         osd.5             up  1.00000          1.00000
 -6  14.19991     host stor5
 48   1.79999         osd.48            up  1.00000          1.00000
 49   1.59999         osd.49            up  1.00000          1.00000
 50   1.79999         osd.50            up  1.00000          1.00000
 51   1.79999         osd.51          down        0          1.00000
 52   1.79999         osd.52            up  1.00000          1.00000
 53   1.79999         osd.53            up  1.00000          1.00000
 54   1.79999         osd.54            up  1.00000          1.00000
 55   1.79999         osd.55            up  1.00000          1.00000
-14  14.39999     host stor6
 82   1.79999         osd.82            up  1.00000          1.00000
 83   1.79999         osd.83            up  1.00000          1.00000
 84   1.79999         osd.84            up  1.00000          1.00000
 85   1.79999         osd.85            up  1.00000          1.00000
 86   1.79999         osd.86            up  1.00000          1.00000
 87   1.79999         osd.87            up  1.00000          1.00000
 88   1.79999         osd.88            up  1.00000          1.00000
 89   1.79999         osd.89            up  1.00000          1.00000
-16  12.59999     host stor7
 93   1.79999         osd.93            up  1.00000          1.00000
 94   1.79999         osd.94            up  1.00000          1.00000
 95   1.79999         osd.95            up  1.00000          1.00000
 96   1.79999         osd.96            up  1.00000          1.00000
 97   1.79999         osd.97            up  1.00000          1.00000
 98   1.79999         osd.98            up  1.00000          1.00000
 99   1.79999         osd.99            up  1.00000          1.00000
-17  21.49995     host stor8
 22   1.59999         osd.22            up  1.00000          1.00000
 23   1.59999         osd.23            up  1.00000          1.00000
 36   2.09999         osd.36            up  1.00000          1.00000
 37   2.09999         osd.37            up  1.00000          1.00000
 38   2.50000         osd.38            up  1.00000          1.00000
 39   2.50000         osd.39            up  1.00000          1.00000
 40   2.50000         osd.40            up  1.00000          1.00000
 41   2.50000         osd.41          down        0          1.00000
 42   2.50000         osd.42            up  1.00000          1.00000
 43   1.59999         osd.43            up  1.00000          1.00000
[root@cc1 ~]#

and ceph health detail:

ceph health detail | grep down
HEALTH_WARN 23 pgs backfilling; 23 pgs degraded; 2 pgs down; 2 pgs
peering; 2 pgs stuck inactive; 25 pgs stuck unclean; 23 pgs
undersized; recovery 176211/14148564 objects degraded (1.245%);
recovery 238972/14148564 objects misplaced (1.689%); noout flag(s) set
pg 1.60 is stuck inactive since forever, current state
down+remapped+peering, last acting [66,69,40]
pg 1.165 is stuck inactive since forever, current state
down+remapped+peering, last acting [37]
pg 1.60 is stuck unclean since forever, current state
down+remapped+peering, last acting [66,69,40]
pg 1.165 is stuck unclean since forever, current state
down+remapped+peering, last acting [37]
pg 1.165 is down+remapped+peering, acting [37]
pg 1.60 is down+remapped+peering, acting [66,69,40]

problematic pgs are 1.165 and 1.60.

Please  advice  how  to  unblock pool volumes and/or make this two pgs
working  -  in a last night and day, when we tried to solve this issue
these pgs are for 100% empty from data.

-- 
Pozdrowienia,
 Łukasz Chrustek

^ permalink raw reply	[flat|nested] 35+ messages in thread