OSD recovery failed because of "leveldb: Corruption : checksum mismatch"

* OSD recovery failed because of "leveldb: Corruption : checksum mismatch"
@ 2013-07-16  2:26 wanhai zhu
  2013-07-16  7:11 ` Fwd: " wanhai zhu
  0 siblings, 1 reply; 2+ messages in thread
From: wanhai zhu @ 2013-07-16  2:26 UTC (permalink / raw)
  To: Ceph Development; +Cc: Sage Weil

Dear  guys :

I have a ceph cluster which is used for backend storage of kvm guest,
and this cluster has four nodes, each node has three disks.  And the
ceph version is 0.61.4.

Because of electrical power down, the ceph cluster have been shutdown
innormally several days ago. When I restarted all the nodes and
started the ceph service in each node, two osd service are down and
out, and the error message shows “ File system of the disk need to be
repair”, so I execute these CLI “xfs_check and xfs_repair -L”. After
that, I can mount the disk in the specific directory and see the raw
object data in the right state, then I start the specific osd service
but the osd service are also down and out and the error log show
“leveldb: Corruption : checksum mismatch” , because this error makes
several pg “stale+active+clean” and some pgs are lost in the cluster.

The details of the error log are as follows:

2013-07-09 16:45:31.940767 7f9a5a7ee780  0 ceph version 0.61.4
(1669132fcfc27d0c0b5e5bb93ade59d147e23404), process ceph-osd, pid 4640

2013-07-09 16:45:31.986070 7f9a5a7ee780  0 filestore(/osd0) mount
FIEMAP ioctl is supported and appears to work

2013-07-09 16:45:31.986084 7f9a5a7ee780  0 filestore(/osd0) mount
FIEMAP ioctl is disabled via 'filestore fiemap' config option

2013-07-09 16:45:31.986649 7f9a5a7ee780  0 filestore(/osd0) mount did
NOT detect btrfs

2013-07-09 16:45:32.001812 7f9a5a7ee780  0 filestore(/osd0) mount
syncfs(2) syscall fully supported (by glibc and kernel)

2013-07-09 16:45:32.001895 7f9a5a7ee780  0 filestore(/osd0) mount found snaps <>

2013-07-09 16:45:32.003550 7f9a5a7ee780 -1 filestore(/osd0) Error
initializing leveldb: Corruption: checksum mismatch

2013-07-09 16:45:32.003619 7f9a5a7ee780 -1 ^[[0;31m ** ERROR: error
converting store /osd0: (1) Operation not permitted^[[0m

      In these days , I have tried several ways to resolve these
problem and recovery the osd service , but all fails and I have
exclude the cause of “xfs_check and xfs_repair” which is not
responsible for this issue. So I need your help or some advice to
resolve these problem.

      At the same time , I have some question about the ceph cluster
here, maybe someone can help me or give me a detail explanation.

1)       Are there some tools or command lines to move or recovery the
pg from one osd to another osd manually?  Or are there some ways to
fix the leveldb issue ?

2)       I used the rbd service for the guest block storage and when I
use the CLI “ceph osd pg map image-name”, I can see only one pg that
the rbd block has. Does it mean rbd block are stored in only one pg?
So does it mean the maximum of rbd block size is equal to the disk
capacity?

3)       Are there any ways or best practices to prevent the ceph
service from losing pg data when two osd services are down and out
(pool size is 2)? Customize the cluster map and rule set in order to
spilt the osd service in different failing zones as swift zone
concepts, Is that a good way?

      I need all your help and any idea or suggestion are very
appreciated.  Thanks.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 2+ messages in thread