* osd full still writing data while cluster recovering
@ 2017-06-28 9:57 handong He
2017-06-28 16:37 ` Sage Weil
0 siblings, 1 reply; 5+ messages in thread
From: handong He @ 2017-06-28 9:57 UTC (permalink / raw)
To: ceph-devel
Hello,
I'm using ceph-jewel 10.2.7 for some test.
Discovered that when an osd is full(like full_ratio=0.95), client
write failed, which is normal. But a full osd cannot stop a recovering
cluster writing data, make osd used ratio from 95% to100%. When that
happen, osd will be down for no space left and cannot startup anymore.
So the question is : can the cluster auto stop recovering while osd is
reaching full without setting the norecover flag manually? Or is it
already fix in the latest version?
Consider this situation: a half-full cluster with many osds. For some
bad luck(netlink down| server down | or others) in midnight, some osds
down|out and trigger cluster recovery, makes some other health osds'
used% to 100% (experienceless in operation and maintenance, please fix
me if i'm wrong). Unluckly, this just like a plague and make much more
osds down. It maybe easy to fix one down osd like that, but a disaster
to fix 10+ osds with 100% space used.
here is my test environment and steps:
three nodes, each node has one monitor and one osd(10G hdd for
convenient), running in vm.
ceph conf is basic.
pool size set to 2.
using 'rados bench' writing data to osds.
1. exec command to set osd full ratio:
# ceph pg set_full_ratio 0.8
# ceph pg set nearfull_ratio 0.7
2. writing data, when an osd is reaching full, stop writing and mark
out one osd with command:
# ceph osd out 0
3. waiting for cluster recovering finished , and exec command:
# ceph osd df
# ceph osd tree
we can find that other osds is down.
Thanks and Best Regards!
He Handong
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: osd full still writing data while cluster recovering
2017-06-28 9:57 osd full still writing data while cluster recovering handong He
@ 2017-06-28 16:37 ` Sage Weil
2017-06-28 17:04 ` David Zafman
0 siblings, 1 reply; 5+ messages in thread
From: Sage Weil @ 2017-06-28 16:37 UTC (permalink / raw)
To: handong He, dzafman; +Cc: ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 2250 bytes --]
On Wed, 28 Jun 2017, handong He wrote:
> Hello,
>
> I'm using ceph-jewel 10.2.7 for some test.
> Discovered that when an osd is full(like full_ratio=0.95), client
> write failed, which is normal. But a full osd cannot stop a recovering
> cluster writing data, make osd used ratio from 95% to100%. When that
> happen, osd will be down for no space left and cannot startup anymore.
>
> So the question is : can the cluster auto stop recovering while osd is
> reaching full without setting the norecover flag manually? Or is it
> already fix in the latest version?
>
> Consider this situation: a half-full cluster with many osds. For some
> bad luck(netlink down| server down | or others) in midnight, some osds
> down|out and trigger cluster recovery, makes some other health osds'
> used% to 100% (experienceless in operation and maintenance, please fix
> me if i'm wrong). Unluckly, this just like a plague and make much more
> osds down. It maybe easy to fix one down osd like that, but a disaster
> to fix 10+ osds with 100% space used.
There are additional thresholds for stopping backfill and (later) a
failsafe to prevent any writes, but you're not hte first one to see these
not work properly in jewel. David recently made a ton of
improvements here in master for luminous, but I'm not sure what the
status is for backporting some of the critical pieces to jewel...
sage
> here is my test environment and steps:
>
> three nodes, each node has one monitor and one osd(10G hdd for
> convenient), running in vm.
> ceph conf is basic.
> pool size set to 2.
> using 'rados bench' writing data to osds.
>
> 1. exec command to set osd full ratio:
> # ceph pg set_full_ratio 0.8
> # ceph pg set nearfull_ratio 0.7
>
> 2. writing data, when an osd is reaching full, stop writing and mark
> out one osd with command:
> # ceph osd out 0
>
> 3. waiting for cluster recovering finished , and exec command:
> # ceph osd df
> # ceph osd tree
>
> we can find that other osds is down.
>
> Thanks and Best Regards!
>
> He Handong
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: osd full still writing data while cluster recovering
2017-06-28 16:37 ` Sage Weil
@ 2017-06-28 17:04 ` David Zafman
2017-06-28 17:21 ` Nathan Cutler
0 siblings, 1 reply; 5+ messages in thread
From: David Zafman @ 2017-06-28 17:04 UTC (permalink / raw)
To: Sage Weil, handong He; +Cc: ceph-devel
Luminous has the more complex fix which prevents recovery/backfill from
filling up a disk.
In your 3 node test cluster with 1 osd out you have 66% of your storage
available with up to 80% in use, so you are out of space. In Luminous
not only would new writes be blocked but PGs would be marked
"backfill_toofull" or "recovery_toofull."
A portion of the Luminous changes are in a pending Jewel backport. It
includes code that warns about uneven OSD usage and increases
mon_osd_min_in_ratio to .75 (75%).
In a more realistic Jewel cluster you can increase the value of
mon_osd_min_in_ratio to what is best for your situation. This will
prevent too many OSDs from being marked out.
David
On 6/28/17 9:37 AM, Sage Weil wrote:
> On Wed, 28 Jun 2017, handong He wrote:
>> Hello,
>>
>> I'm using ceph-jewel 10.2.7 for some test.
>> Discovered that when an osd is full(like full_ratio=0.95), client
>> write failed, which is normal. But a full osd cannot stop a recovering
>> cluster writing data, make osd used ratio from 95% to100%. When that
>> happen, osd will be down for no space left and cannot startup anymore.
>>
>> So the question is : can the cluster auto stop recovering while osd is
>> reaching full without setting the norecover flag manually? Or is it
>> already fix in the latest version?
>>
>> Consider this situation: a half-full cluster with many osds. For some
>> bad luck(netlink down| server down | or others) in midnight, some osds
>> down|out and trigger cluster recovery, makes some other health osds'
>> used% to 100% (experienceless in operation and maintenance, please fix
>> me if i'm wrong). Unluckly, this just like a plague and make much more
>> osds down. It maybe easy to fix one down osd like that, but a disaster
>> to fix 10+ osds with 100% space used.
> There are additional thresholds for stopping backfill and (later) a
> failsafe to prevent any writes, but you're not hte first one to see these
> not work properly in jewel. David recently made a ton of
> improvements here in master for luminous, but I'm not sure what the
> status is for backporting some of the critical pieces to jewel...
>
> sage
>
>
>> here is my test environment and steps:
>>
>> three nodes, each node has one monitor and one osd(10G hdd for
>> convenient), running in vm.
>> ceph conf is basic.
>> pool size set to 2.
>> using 'rados bench' writing data to osds.
>>
>> 1. exec command to set osd full ratio:
>> # ceph pg set_full_ratio 0.8
>> # ceph pg set nearfull_ratio 0.7
>>
>> 2. writing data, when an osd is reaching full, stop writing and mark
>> out one osd with command:
>> # ceph osd out 0
>>
>> 3. waiting for cluster recovering finished , and exec command:
>> # ceph osd df
>> # ceph osd tree
>>
>> we can find that other osds is down.
>>
>> Thanks and Best Regards!
>>
>> He Handong
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: osd full still writing data while cluster recovering
2017-06-28 17:04 ` David Zafman
@ 2017-06-28 17:21 ` Nathan Cutler
2017-06-29 1:43 ` handong He
0 siblings, 1 reply; 5+ messages in thread
From: Nathan Cutler @ 2017-06-28 17:21 UTC (permalink / raw)
To: David Zafman, Sage Weil, handong He; +Cc: ceph-devel
On 06/28/2017 07:04 PM, David Zafman wrote:
>
> Luminous has the more complex fix which prevents recovery/backfill from
> filling up a disk.
>
> In your 3 node test cluster with 1 osd out you have 66% of your storage
> available with up to 80% in use, so you are out of space. In Luminous
> not only would new writes be blocked but PGs would be marked
> "backfill_toofull" or "recovery_toofull."
>
>
> A portion of the Luminous changes are in a pending Jewel backport.
That's https://github.com/ceph/ceph/pull/15050 in case anyone was wondering.
> It
> includes code that warns about uneven OSD usage and increases
> mon_osd_min_in_ratio to .75 (75%).
>
> In a more realistic Jewel cluster you can increase the value of
> mon_osd_min_in_ratio to what is best for your situation. This will
> prevent too many OSDs from being marked out.
>
> David
>
>
> On 6/28/17 9:37 AM, Sage Weil wrote:
>> On Wed, 28 Jun 2017, handong He wrote:
>>> Hello,
>>>
>>> I'm using ceph-jewel 10.2.7 for some test.
>>> Discovered that when an osd is full(like full_ratio=0.95), client
>>> write failed, which is normal. But a full osd cannot stop a recovering
>>> cluster writing data, make osd used ratio from 95% to100%. When that
>>> happen, osd will be down for no space left and cannot startup anymore.
>>>
>>> So the question is : can the cluster auto stop recovering while osd is
>>> reaching full without setting the norecover flag manually? Or is it
>>> already fix in the latest version?
>>>
>>> Consider this situation: a half-full cluster with many osds. For some
>>> bad luck(netlink down| server down | or others) in midnight, some osds
>>> down|out and trigger cluster recovery, makes some other health osds'
>>> used% to 100% (experienceless in operation and maintenance, please fix
>>> me if i'm wrong). Unluckly, this just like a plague and make much more
>>> osds down. It maybe easy to fix one down osd like that, but a disaster
>>> to fix 10+ osds with 100% space used.
>> There are additional thresholds for stopping backfill and (later) a
>> failsafe to prevent any writes, but you're not hte first one to see these
>> not work properly in jewel. David recently made a ton of
>> improvements here in master for luminous, but I'm not sure what the
>> status is for backporting some of the critical pieces to jewel...
>>
>> sage
>>
>>> here is my test environment and steps:
>>>
>>> three nodes, each node has one monitor and one osd(10G hdd for
>>> convenient), running in vm.
>>> ceph conf is basic.
>>> pool size set to 2.
>>> using 'rados bench' writing data to osds.
>>>
>>> 1. exec command to set osd full ratio:
>>> # ceph pg set_full_ratio 0.8
>>> # ceph pg set nearfull_ratio 0.7
>>>
>>> 2. writing data, when an osd is reaching full, stop writing and mark
>>> out one osd with command:
>>> # ceph osd out 0
>>>
>>> 3. waiting for cluster recovering finished , and exec command:
>>> # ceph osd df
>>> # ceph osd tree
>>>
>>> we can find that other osds is down.
>>>
>>> Thanks and Best Regards!
>>>
>>> He Handong
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Nathan Cutler
Software Engineer Distributed Storage
SUSE LINUX, s.r.o.
Tel.: +420 284 084 037
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: osd full still writing data while cluster recovering
2017-06-28 17:21 ` Nathan Cutler
@ 2017-06-29 1:43 ` handong He
0 siblings, 0 replies; 5+ messages in thread
From: handong He @ 2017-06-29 1:43 UTC (permalink / raw)
To: Nathan Cutler, David Zafman, Sage Weil; +Cc: ceph-devel
Thanks for reply. It helps a lot.
Later I will try Luminous and keep tracking this issue in jewel.
Thanks,
Handong
2017-06-29 1:21 GMT+08:00 Nathan Cutler <ncutler@suse.cz>:
>
>
> On 06/28/2017 07:04 PM, David Zafman wrote:
>>
>>
>> Luminous has the more complex fix which prevents recovery/backfill from
>> filling up a disk.
>>
>> In your 3 node test cluster with 1 osd out you have 66% of your storage
>> available with up to 80% in use, so you are out of space. In Luminous not
>> only would new writes be blocked but PGs would be marked "backfill_toofull"
>> or "recovery_toofull."
>>
>>
>> A portion of the Luminous changes are in a pending Jewel backport.
>
>
> That's https://github.com/ceph/ceph/pull/15050 in case anyone was wondering.
>
>
>> It includes code that warns about uneven OSD usage and increases
>> mon_osd_min_in_ratio to .75 (75%).
>>
>> In a more realistic Jewel cluster you can increase the value of
>> mon_osd_min_in_ratio to what is best for your situation. This will prevent
>> too many OSDs from being marked out.
>>
>> David
>>
>>
>> On 6/28/17 9:37 AM, Sage Weil wrote:
>>>
>>> On Wed, 28 Jun 2017, handong He wrote:
>>>>
>>>> Hello,
>>>>
>>>> I'm using ceph-jewel 10.2.7 for some test.
>>>> Discovered that when an osd is full(like full_ratio=0.95), client
>>>> write failed, which is normal. But a full osd cannot stop a recovering
>>>> cluster writing data, make osd used ratio from 95% to100%. When that
>>>> happen, osd will be down for no space left and cannot startup anymore.
>>>>
>>>> So the question is : can the cluster auto stop recovering while osd is
>>>> reaching full without setting the norecover flag manually? Or is it
>>>> already fix in the latest version?
>>>>
>>>> Consider this situation: a half-full cluster with many osds. For some
>>>> bad luck(netlink down| server down | or others) in midnight, some osds
>>>> down|out and trigger cluster recovery, makes some other health osds'
>>>> used% to 100% (experienceless in operation and maintenance, please fix
>>>> me if i'm wrong). Unluckly, this just like a plague and make much more
>>>> osds down. It maybe easy to fix one down osd like that, but a disaster
>>>> to fix 10+ osds with 100% space used.
>>>
>>> There are additional thresholds for stopping backfill and (later) a
>>> failsafe to prevent any writes, but you're not hte first one to see these
>>> not work properly in jewel. David recently made a ton of
>>> improvements here in master for luminous, but I'm not sure what the
>>> status is for backporting some of the critical pieces to jewel...
>>>
>>> sage
>>>
>>>> here is my test environment and steps:
>>>>
>>>> three nodes, each node has one monitor and one osd(10G hdd for
>>>> convenient), running in vm.
>>>> ceph conf is basic.
>>>> pool size set to 2.
>>>> using 'rados bench' writing data to osds.
>>>>
>>>> 1. exec command to set osd full ratio:
>>>> # ceph pg set_full_ratio 0.8
>>>> # ceph pg set nearfull_ratio 0.7
>>>>
>>>> 2. writing data, when an osd is reaching full, stop writing and mark
>>>> out one osd with command:
>>>> # ceph osd out 0
>>>>
>>>> 3. waiting for cluster recovering finished , and exec command:
>>>> # ceph osd df
>>>> # ceph osd tree
>>>>
>>>> we can find that other osds is down.
>>>>
>>>> Thanks and Best Regards!
>>>>
>>>> He Handong
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> --
> Nathan Cutler
> Software Engineer Distributed Storage
> SUSE LINUX, s.r.o.
> Tel.: +420 284 084 037
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2017-06-29 1:43 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-28 9:57 osd full still writing data while cluster recovering handong He
2017-06-28 16:37 ` Sage Weil
2017-06-28 17:04 ` David Zafman
2017-06-28 17:21 ` Nathan Cutler
2017-06-29 1:43 ` handong He
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.