All of lore.kernel.org
 help / color / mirror / Atom feed
* OSD cascading crash during recovery with corrupted replica
@ 2015-10-15  0:20 GuangYang
  0 siblings, 0 replies; only message in thread
From: GuangYang @ 2015-10-15  0:20 UTC (permalink / raw)
  To: sjust, dzafman; +Cc: ceph-devel

Hi Sam/David,
We came across this problem a couple of times and it is extremely painful to work around it via operational steps, I would like to work on a patch, but before I start, it would be nice hear your suggestions.

The problem is:
On erasure coded pool, when there is a corruption, and the object is a recovery candidate, currently it would crash the primary when trying to recover the object, and so on so forth as other OSDs on the acting set to be promoted as primary, until the PG gets down.

Solution:
I think one way to fix it, is to put the object back to recovery waiting list together with the corruption information (add the corrupted shard to peering_missing), and then let it be picked up by the next round of recovery. 

Does that sound like a good way to pursue? Do you have any other suggestions I may look into?

Thanks,
Guang 		 	   		  --
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2015-10-15  0:20 UTC | newest]

Thread overview: (only message) (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-10-15  0:20 OSD cascading crash during recovery with corrupted replica GuangYang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.