All of lore.kernel.org
 help / color / mirror / Atom feed
* Crash of almost full ceph
@ 2012-08-04 10:37 Vladimir Bashkirtsev
  2012-08-06 16:25 ` Gregory Farnum
  0 siblings, 1 reply; 5+ messages in thread
From: Vladimir Bashkirtsev @ 2012-08-04 10:37 UTC (permalink / raw)
  To: ceph-devel

Hello,

Yesterday finally I have managed to screw up my installation of ceph! :)

My ceph was at 80% capacity. I have rebooted one of OSDs remotely and 
managed to screw up with fstab. Host failed to come up and while I was 
driving from home to my office ceph took recovery action. But it meant 
that it has filled up another OSDs completely and it has failed. Ceph 
continued to recover and killed other OSDs in the same fashion. Not 
quite good. Attempt to restart OSDs was in vain: they were unable to 
test for xattrs because file system was full and only growing file 
system allowed them to restart.

Now this leads me to a question/proposal: is there a feature which 
allows ceph to halt recovery process if any of live OSDs exceeding say 
95% percent capacity? It is quite distinct from what is considered full 
or near full OSD as any writes when OSD is near full or full coming from 
clients and inability to write leads to client lock up. But halting 
recovery should allow clients to continue even so ceph is in degraded 
state. It does not make sense to me to allow ceph go from degraded state 
to crashed state when no client needs it.

Regards,
Vladimir

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Crash of almost full ceph
  2012-08-04 10:37 Crash of almost full ceph Vladimir Bashkirtsev
@ 2012-08-06 16:25 ` Gregory Farnum
  2012-08-06 16:39   ` Vladimir Bashkirtsev
  0 siblings, 1 reply; 5+ messages in thread
From: Gregory Farnum @ 2012-08-06 16:25 UTC (permalink / raw)
  To: Vladimir Bashkirtsev; +Cc: ceph-devel

On Sat, Aug 4, 2012 at 3:37 AM, Vladimir Bashkirtsev
<vladimir@bashkirtsev.com> wrote:
> Hello,
>
> Yesterday finally I have managed to screw up my installation of ceph! :)
>
> My ceph was at 80% capacity. I have rebooted one of OSDs remotely and
> managed to screw up with fstab. Host failed to come up and while I was
> driving from home to my office ceph took recovery action. But it meant that
> it has filled up another OSDs completely and it has failed. Ceph continued
> to recover and killed other OSDs in the same fashion. Not quite good.
> Attempt to restart OSDs was in vain: they were unable to test for xattrs
> because file system was full and only growing file system allowed them to
> restart.
>
> Now this leads me to a question/proposal: is there a feature which allows
> ceph to halt recovery process if any of live OSDs exceeding say 95% percent
> capacity? It is quite distinct from what is considered full or near full OSD
> as any writes when OSD is near full or full coming from clients and
> inability to write leads to client lock up. But halting recovery should
> allow clients to continue even so ceph is in degraded state. It does not
> make sense to me to allow ceph go from degraded state to crashed state when
> no client needs it.

There is not yet any such feature, no — dealing with full systems is
notoriously hard and we haven't come up with a great solution yet. One
thing you can do is experiment with the "mon_osd_min_in_ratio"
parameter, which prevents the monitors from marking out more than a
certain percentage of the OSD cluster (and without something being
marked out, no data will be moved around). If you don't want the
cluster to automatically mark any OSDs out, you can also set the
"mon_osd_down_out_interval" to zero.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Crash of almost full ceph
  2012-08-06 16:25 ` Gregory Farnum
@ 2012-08-06 16:39   ` Vladimir Bashkirtsev
  2012-08-06 16:53     ` Gregory Farnum
  0 siblings, 1 reply; 5+ messages in thread
From: Vladimir Bashkirtsev @ 2012-08-06 16:39 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

On 07/08/12 01:55, Gregory Farnum wrote:
> There is not yet any such feature, no — dealing with full systems is 
> notoriously hard and we haven't come up with a great solution yet. One 
> thing you can do is experiment with the "mon_osd_min_in_ratio" 
> parameter, which prevents the monitors from marking out more than a 
> certain percentage of the OSD cluster (and without something being 
> marked out, no data will be moved around). If you don't want the 
> cluster to automatically mark any OSDs out, you can also set the 
> "mon_osd_down_out_interval" to zero. -Greg 
But it is good idea to have such feature as fail safe device. Settings 
you speak about may help a bit when cluster is almost full and there 
good number of OSDs but hard refusal of ceph to run recovery if ANY live 
OSD is over certain limit is quite unambiguous. If recovery fails due to 
one OSD is at capacity then it should be handed over to admin to decide 
what to do: rebalance CRUSH, add new OSD, remove some objects. Certainly 
ceph should not be able to fill up OSD with activity which is not 
required (but desired) by end clients.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Crash of almost full ceph
  2012-08-06 16:39   ` Vladimir Bashkirtsev
@ 2012-08-06 16:53     ` Gregory Farnum
       [not found]       ` <5020B458.40106@bashkirtsev.com>
  0 siblings, 1 reply; 5+ messages in thread
From: Gregory Farnum @ 2012-08-06 16:53 UTC (permalink / raw)
  To: Vladimir Bashkirtsev; +Cc: ceph-devel

On Mon, Aug 6, 2012 at 9:39 AM, Vladimir Bashkirtsev
<vladimir@bashkirtsev.com> wrote:
> On 07/08/12 01:55, Gregory Farnum wrote:
>>
>> There is not yet any such feature, no — dealing with full systems is
>> notoriously hard and we haven't come up with a great solution yet. One thing
>> you can do is experiment with the "mon_osd_min_in_ratio" parameter, which
>> prevents the monitors from marking out more than a certain percentage of the
>> OSD cluster (and without something being marked out, no data will be moved
>> around). If you don't want the cluster to automatically mark any OSDs out,
>> you can also set the "mon_osd_down_out_interval" to zero. -Greg
>
> But it is good idea to have such feature as fail safe device. Settings you
> speak about may help a bit when cluster is almost full and there good number
> of OSDs but hard refusal of ceph to run recovery if ANY live OSD is over
> certain limit is quite unambiguous. If recovery fails due to one OSD is at
> capacity then it should be handed over to admin to decide what to do:
> rebalance CRUSH, add new OSD, remove some objects. Certainly ceph should not
> be able to fill up OSD with activity which is not required (but desired) by
> end clients.

Oh, I see what you're saying. Given how distributed Ceph is this is
actually harder than it sounds — we could get closer by refusing to
mark OSDs out whenever the full list is non-empty, but we could not
for instance do partial recovery and then stop once an OSD gets full.
In any case, I've made a bug (http://tracker.newdream.net/issues/2911)
since this isn't something I can hack together right now. :)
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Crash of almost full ceph
       [not found]       ` <5020B458.40106@bashkirtsev.com>
@ 2012-08-07 19:35         ` Gregory Farnum
  0 siblings, 0 replies; 5+ messages in thread
From: Gregory Farnum @ 2012-08-07 19:35 UTC (permalink / raw)
  To: Vladimir Bashkirtsev, Samuel Just; +Cc: ceph-devel

On Mon, Aug 6, 2012 at 11:23 PM, Vladimir Bashkirtsev
<vladimir@bashkirtsev.com> wrote:
>> Oh, I see what you're saying. Given how distributed Ceph is this is
>> actually harder than it sounds — we could get closer by refusing to
>> mark OSDs out whenever the full list is non-empty, but we could not
>> for instance do partial recovery and then stop once an OSD gets full.
>> In any case, I've made a bug (http://tracker.newdream.net/issues/2911)
>> since this isn't something I can hack together right now. :)
>> -Greg
>
> Refusing to mark OSDs out when full list is non-empty will definitely be a
> big step in right direction. It will prevent cascading failure I have
> described originally. But on other hand I think the way around complication
> of distributed nature is to have OSD itself to refuse backfill when it has
> reached full state regardless of where backfill coming from. In this case
> backfill will stall while clients activity still will be handled as per
> normal. Or OSD does not distinguish backfill requests from other OSDs and
> clients requests? Question of course is what other OSD will do when its peer
> refuses to accept backfill while it is still marked as in. There some
> stand-off period required on OSD which sends backfill. So it looks like OSD
> should have two things:
> 1. When receiving backfill check if already full and if it is drop request,
> don't ack back.
> 2. When sending backfill if ack did not arrive in reasonable time retry
> later after some time has passed (something tells me that such functionality
> already in place).
>
> I should admit I have not read ceph code but with all experience which I
> have got with ceph it seems that it should be fairly easy to implement.

That's one possibility, but it has a lot of side effects which could
be troubling — for instance, it means that pg_temp entries go from
being around until backfill completes, to being around until the OSD
is no longer full and backfill completes. The chances for map growth
etc are high and worrying.

Do you have any thoughts, Sam?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-08-07 19:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-08-04 10:37 Crash of almost full ceph Vladimir Bashkirtsev
2012-08-06 16:25 ` Gregory Farnum
2012-08-06 16:39   ` Vladimir Bashkirtsev
2012-08-06 16:53     ` Gregory Farnum
     [not found]       ` <5020B458.40106@bashkirtsev.com>
2012-08-07 19:35         ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.