All of lore.kernel.org
 help / color / mirror / Atom feed
* osd full still writing data while cluster recovering
@ 2017-06-28  9:57 handong He
  2017-06-28 16:37 ` Sage Weil
  0 siblings, 1 reply; 5+ messages in thread
From: handong He @ 2017-06-28  9:57 UTC (permalink / raw)
  To: ceph-devel

Hello,

I'm using ceph-jewel 10.2.7 for some test.
Discovered that when an osd is full(like full_ratio=0.95), client
write failed, which is normal. But a full osd cannot stop a recovering
cluster writing data, make osd used ratio from 95% to100%. When that
happen, osd will be down for no space left and cannot startup anymore.

So the question is : can the cluster auto stop recovering while osd is
reaching full without setting the norecover flag manually?  Or is it
already fix in the latest version?

Consider this situation: a half-full cluster with many osds. For some
bad luck(netlink down| server down | or others) in midnight, some osds
down|out and trigger cluster recovery, makes some  other health osds'
used% to 100% (experienceless in operation and maintenance, please fix
me if i'm wrong). Unluckly, this just like a plague and make much more
osds down. It maybe easy to fix one down osd like that, but a disaster
to fix 10+ osds with 100% space used.


here is my test environment and steps:

three nodes, each node has one monitor and one osd(10G hdd for
convenient), running in vm.
ceph conf is basic.
pool size set to 2.
using 'rados bench' writing data to osds.

1. exec command  to set osd full ratio:
# ceph pg set_full_ratio 0.8
# ceph pg set nearfull_ratio 0.7

2. writing data, when an osd is reaching full, stop writing and mark
out one osd with command:
# ceph osd out 0

3. waiting for cluster recovering finished , and exec command:
# ceph osd df
# ceph osd tree

we can find that other osds is down.

Thanks and Best Regards!

He Handong

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: osd full still writing data while cluster recovering
  2017-06-28  9:57 osd full still writing data while cluster recovering handong He
@ 2017-06-28 16:37 ` Sage Weil
  2017-06-28 17:04   ` David Zafman
  0 siblings, 1 reply; 5+ messages in thread
From: Sage Weil @ 2017-06-28 16:37 UTC (permalink / raw)
  To: handong He, dzafman; +Cc: ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2250 bytes --]

On Wed, 28 Jun 2017, handong He wrote:
> Hello,
> 
> I'm using ceph-jewel 10.2.7 for some test.
> Discovered that when an osd is full(like full_ratio=0.95), client
> write failed, which is normal. But a full osd cannot stop a recovering
> cluster writing data, make osd used ratio from 95% to100%. When that
> happen, osd will be down for no space left and cannot startup anymore.
> 
> So the question is : can the cluster auto stop recovering while osd is
> reaching full without setting the norecover flag manually?  Or is it
> already fix in the latest version?
> 
> Consider this situation: a half-full cluster with many osds. For some
> bad luck(netlink down| server down | or others) in midnight, some osds
> down|out and trigger cluster recovery, makes some  other health osds'
> used% to 100% (experienceless in operation and maintenance, please fix
> me if i'm wrong). Unluckly, this just like a plague and make much more
> osds down. It maybe easy to fix one down osd like that, but a disaster
> to fix 10+ osds with 100% space used.

There are additional thresholds for stopping backfill and (later) a 
failsafe to prevent any writes, but you're not hte first one to see these 
not work properly in jewel.  David recently made a ton of 
improvements here in master for luminous, but I'm not sure what the 
status is for backporting some of the critical pieces to jewel...

sage

 
> here is my test environment and steps:
> 
> three nodes, each node has one monitor and one osd(10G hdd for
> convenient), running in vm.
> ceph conf is basic.
> pool size set to 2.
> using 'rados bench' writing data to osds.
> 
> 1. exec command  to set osd full ratio:
> # ceph pg set_full_ratio 0.8
> # ceph pg set nearfull_ratio 0.7
> 
> 2. writing data, when an osd is reaching full, stop writing and mark
> out one osd with command:
> # ceph osd out 0
> 
> 3. waiting for cluster recovering finished , and exec command:
> # ceph osd df
> # ceph osd tree
> 
> we can find that other osds is down.
> 
> Thanks and Best Regards!
> 
> He Handong
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: osd full still writing data while cluster recovering
  2017-06-28 16:37 ` Sage Weil
@ 2017-06-28 17:04   ` David Zafman
  2017-06-28 17:21     ` Nathan Cutler
  0 siblings, 1 reply; 5+ messages in thread
From: David Zafman @ 2017-06-28 17:04 UTC (permalink / raw)
  To: Sage Weil, handong He; +Cc: ceph-devel


Luminous has the more complex fix which prevents recovery/backfill from 
filling up a disk.

In your 3 node test cluster with 1 osd out you have 66% of your storage 
available with up to 80% in use, so you are out of space. In Luminous 
not only would new writes be blocked but PGs would be marked 
"backfill_toofull" or "recovery_toofull."


A portion of the Luminous changes are in a pending Jewel backport.  It 
includes code that warns about uneven OSD usage and increases 
mon_osd_min_in_ratio to .75 (75%).

In a more realistic Jewel cluster you can increase the value of 
mon_osd_min_in_ratio to what is best for your situation.  This will 
prevent too many OSDs from being marked out.

David


On 6/28/17 9:37 AM, Sage Weil wrote:
> On Wed, 28 Jun 2017, handong He wrote:
>> Hello,
>>
>> I'm using ceph-jewel 10.2.7 for some test.
>> Discovered that when an osd is full(like full_ratio=0.95), client
>> write failed, which is normal. But a full osd cannot stop a recovering
>> cluster writing data, make osd used ratio from 95% to100%. When that
>> happen, osd will be down for no space left and cannot startup anymore.
>>
>> So the question is : can the cluster auto stop recovering while osd is
>> reaching full without setting the norecover flag manually?  Or is it
>> already fix in the latest version?
>>
>> Consider this situation: a half-full cluster with many osds. For some
>> bad luck(netlink down| server down | or others) in midnight, some osds
>> down|out and trigger cluster recovery, makes some  other health osds'
>> used% to 100% (experienceless in operation and maintenance, please fix
>> me if i'm wrong). Unluckly, this just like a plague and make much more
>> osds down. It maybe easy to fix one down osd like that, but a disaster
>> to fix 10+ osds with 100% space used.
> There are additional thresholds for stopping backfill and (later) a
> failsafe to prevent any writes, but you're not hte first one to see these
> not work properly in jewel.  David recently made a ton of
> improvements here in master for luminous, but I'm not sure what the
> status is for backporting some of the critical pieces to jewel...
>
> sage
>
>   
>> here is my test environment and steps:
>>
>> three nodes, each node has one monitor and one osd(10G hdd for
>> convenient), running in vm.
>> ceph conf is basic.
>> pool size set to 2.
>> using 'rados bench' writing data to osds.
>>
>> 1. exec command  to set osd full ratio:
>> # ceph pg set_full_ratio 0.8
>> # ceph pg set nearfull_ratio 0.7
>>
>> 2. writing data, when an osd is reaching full, stop writing and mark
>> out one osd with command:
>> # ceph osd out 0
>>
>> 3. waiting for cluster recovering finished , and exec command:
>> # ceph osd df
>> # ceph osd tree
>>
>> we can find that other osds is down.
>>
>> Thanks and Best Regards!
>>
>> He Handong
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: osd full still writing data while cluster recovering
  2017-06-28 17:04   ` David Zafman
@ 2017-06-28 17:21     ` Nathan Cutler
  2017-06-29  1:43       ` handong He
  0 siblings, 1 reply; 5+ messages in thread
From: Nathan Cutler @ 2017-06-28 17:21 UTC (permalink / raw)
  To: David Zafman, Sage Weil, handong He; +Cc: ceph-devel



On 06/28/2017 07:04 PM, David Zafman wrote:
> 
> Luminous has the more complex fix which prevents recovery/backfill from 
> filling up a disk.
> 
> In your 3 node test cluster with 1 osd out you have 66% of your storage 
> available with up to 80% in use, so you are out of space. In Luminous 
> not only would new writes be blocked but PGs would be marked 
> "backfill_toofull" or "recovery_toofull."
> 
> 
> A portion of the Luminous changes are in a pending Jewel backport.

That's https://github.com/ceph/ceph/pull/15050 in case anyone was wondering.

> It 
> includes code that warns about uneven OSD usage and increases 
> mon_osd_min_in_ratio to .75 (75%).
> 
> In a more realistic Jewel cluster you can increase the value of 
> mon_osd_min_in_ratio to what is best for your situation.  This will 
> prevent too many OSDs from being marked out.
> 
> David
> 
> 
> On 6/28/17 9:37 AM, Sage Weil wrote:
>> On Wed, 28 Jun 2017, handong He wrote:
>>> Hello,
>>>
>>> I'm using ceph-jewel 10.2.7 for some test.
>>> Discovered that when an osd is full(like full_ratio=0.95), client
>>> write failed, which is normal. But a full osd cannot stop a recovering
>>> cluster writing data, make osd used ratio from 95% to100%. When that
>>> happen, osd will be down for no space left and cannot startup anymore.
>>>
>>> So the question is : can the cluster auto stop recovering while osd is
>>> reaching full without setting the norecover flag manually?  Or is it
>>> already fix in the latest version?
>>>
>>> Consider this situation: a half-full cluster with many osds. For some
>>> bad luck(netlink down| server down | or others) in midnight, some osds
>>> down|out and trigger cluster recovery, makes some  other health osds'
>>> used% to 100% (experienceless in operation and maintenance, please fix
>>> me if i'm wrong). Unluckly, this just like a plague and make much more
>>> osds down. It maybe easy to fix one down osd like that, but a disaster
>>> to fix 10+ osds with 100% space used.
>> There are additional thresholds for stopping backfill and (later) a
>> failsafe to prevent any writes, but you're not hte first one to see these
>> not work properly in jewel.  David recently made a ton of
>> improvements here in master for luminous, but I'm not sure what the
>> status is for backporting some of the critical pieces to jewel...
>>
>> sage
>>
>>> here is my test environment and steps:
>>>
>>> three nodes, each node has one monitor and one osd(10G hdd for
>>> convenient), running in vm.
>>> ceph conf is basic.
>>> pool size set to 2.
>>> using 'rados bench' writing data to osds.
>>>
>>> 1. exec command  to set osd full ratio:
>>> # ceph pg set_full_ratio 0.8
>>> # ceph pg set nearfull_ratio 0.7
>>>
>>> 2. writing data, when an osd is reaching full, stop writing and mark
>>> out one osd with command:
>>> # ceph osd out 0
>>>
>>> 3. waiting for cluster recovering finished , and exec command:
>>> # ceph osd df
>>> # ceph osd tree
>>>
>>> we can find that other osds is down.
>>>
>>> Thanks and Best Regards!
>>>
>>> He Handong
>>> -- 
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
> 
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 
Nathan Cutler
Software Engineer Distributed Storage
SUSE LINUX, s.r.o.
Tel.: +420 284 084 037

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: osd full still writing data while cluster recovering
  2017-06-28 17:21     ` Nathan Cutler
@ 2017-06-29  1:43       ` handong He
  0 siblings, 0 replies; 5+ messages in thread
From: handong He @ 2017-06-29  1:43 UTC (permalink / raw)
  To: Nathan Cutler, David Zafman, Sage Weil; +Cc: ceph-devel

Thanks for reply. It helps a lot.

Later I will try Luminous and keep tracking this issue in jewel.


Thanks,
Handong

2017-06-29 1:21 GMT+08:00 Nathan Cutler <ncutler@suse.cz>:
>
>
> On 06/28/2017 07:04 PM, David Zafman wrote:
>>
>>
>> Luminous has the more complex fix which prevents recovery/backfill from
>> filling up a disk.
>>
>> In your 3 node test cluster with 1 osd out you have 66% of your storage
>> available with up to 80% in use, so you are out of space. In Luminous not
>> only would new writes be blocked but PGs would be marked "backfill_toofull"
>> or "recovery_toofull."
>>
>>
>> A portion of the Luminous changes are in a pending Jewel backport.
>
>
> That's https://github.com/ceph/ceph/pull/15050 in case anyone was wondering.
>
>
>> It includes code that warns about uneven OSD usage and increases
>> mon_osd_min_in_ratio to .75 (75%).
>>
>> In a more realistic Jewel cluster you can increase the value of
>> mon_osd_min_in_ratio to what is best for your situation.  This will prevent
>> too many OSDs from being marked out.
>>
>> David
>>
>>
>> On 6/28/17 9:37 AM, Sage Weil wrote:
>>>
>>> On Wed, 28 Jun 2017, handong He wrote:
>>>>
>>>> Hello,
>>>>
>>>> I'm using ceph-jewel 10.2.7 for some test.
>>>> Discovered that when an osd is full(like full_ratio=0.95), client
>>>> write failed, which is normal. But a full osd cannot stop a recovering
>>>> cluster writing data, make osd used ratio from 95% to100%. When that
>>>> happen, osd will be down for no space left and cannot startup anymore.
>>>>
>>>> So the question is : can the cluster auto stop recovering while osd is
>>>> reaching full without setting the norecover flag manually?  Or is it
>>>> already fix in the latest version?
>>>>
>>>> Consider this situation: a half-full cluster with many osds. For some
>>>> bad luck(netlink down| server down | or others) in midnight, some osds
>>>> down|out and trigger cluster recovery, makes some  other health osds'
>>>> used% to 100% (experienceless in operation and maintenance, please fix
>>>> me if i'm wrong). Unluckly, this just like a plague and make much more
>>>> osds down. It maybe easy to fix one down osd like that, but a disaster
>>>> to fix 10+ osds with 100% space used.
>>>
>>> There are additional thresholds for stopping backfill and (later) a
>>> failsafe to prevent any writes, but you're not hte first one to see these
>>> not work properly in jewel.  David recently made a ton of
>>> improvements here in master for luminous, but I'm not sure what the
>>> status is for backporting some of the critical pieces to jewel...
>>>
>>> sage
>>>
>>>> here is my test environment and steps:
>>>>
>>>> three nodes, each node has one monitor and one osd(10G hdd for
>>>> convenient), running in vm.
>>>> ceph conf is basic.
>>>> pool size set to 2.
>>>> using 'rados bench' writing data to osds.
>>>>
>>>> 1. exec command  to set osd full ratio:
>>>> # ceph pg set_full_ratio 0.8
>>>> # ceph pg set nearfull_ratio 0.7
>>>>
>>>> 2. writing data, when an osd is reaching full, stop writing and mark
>>>> out one osd with command:
>>>> # ceph osd out 0
>>>>
>>>> 3. waiting for cluster recovering finished , and exec command:
>>>> # ceph osd df
>>>> # ceph osd tree
>>>>
>>>> we can find that other osds is down.
>>>>
>>>> Thanks and Best Regards!
>>>>
>>>> He Handong
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> Nathan Cutler
> Software Engineer Distributed Storage
> SUSE LINUX, s.r.o.
> Tel.: +420 284 084 037

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-06-29  1:43 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-28  9:57 osd full still writing data while cluster recovering handong He
2017-06-28 16:37 ` Sage Weil
2017-06-28 17:04   ` David Zafman
2017-06-28 17:21     ` Nathan Cutler
2017-06-29  1:43       ` handong He

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.