All of lore.kernel.org
 help / color / mirror / Atom feed
* full_ratios - please explain?
@ 2015-02-18 14:39 Wyllys Ingersoll
  2015-02-18 15:05 ` Wido den Hollander
  0 siblings, 1 reply; 5+ messages in thread
From: Wyllys Ingersoll @ 2015-02-18 14:39 UTC (permalink / raw)
  To: ceph-devel

Can someone explain the interaction and effects of all of these
"full_ratio" parameters?  I havent found any real good explanation of how
they affect the distribution of data once the cluster gets above the
"nearfull" and close to the "close" ratios.


mon_osd_full_ratio
mon_osd_nearfull_ratio

osd_backfill_full_ratio
osd_failsafe_full_ratio
osd_failsafe_nearfull_ratio

We have a cluster with about 144 OSDs (518 TB) and trying to get it to a
90% full rate for testing purposes.

We've found that when some of the OSDs get above the mon_osd_full_ratio
value (.95 in our system), then it stops accepting any new data, even
though there is plenty of space left on other OSDs that are not yet even up
to 90%.  Tweaking the osd_failsafe ratios enabled data to move again for a
bit, but eventually it becomes unbalanced and stops working again.

Is there a recommended combination of values to use that will allow the
cluster to continue accepting data and rebalancing correctly above 90%.

thanks,
 Wyllys Ingersoll

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: full_ratios - please explain?
  2015-02-18 14:39 full_ratios - please explain? Wyllys Ingersoll
@ 2015-02-18 15:05 ` Wido den Hollander
  2015-02-18 15:21   ` Wyllys Ingersoll
  0 siblings, 1 reply; 5+ messages in thread
From: Wido den Hollander @ 2015-02-18 15:05 UTC (permalink / raw)
  To: Wyllys Ingersoll, ceph-devel

On 18-02-15 15:39, Wyllys Ingersoll wrote:
> Can someone explain the interaction and effects of all of these
> "full_ratio" parameters?  I havent found any real good explanation of how
> they affect the distribution of data once the cluster gets above the
> "nearfull" and close to the "close" ratios.
> 

When only ONE (1) OSD goes over the mon_osd_nearfull_ratio the cluster
goes from HEALTH_OK into HEALTH_WARN state.

> 
> mon_osd_full_ratio
> mon_osd_nearfull_ratio
> 
> osd_backfill_full_ratio
> osd_failsafe_full_ratio
> osd_failsafe_nearfull_ratio
> 
> We have a cluster with about 144 OSDs (518 TB) and trying to get it to a
> 90% full rate for testing purposes.
> 
> We've found that when some of the OSDs get above the mon_osd_full_ratio
> value (.95 in our system), then it stops accepting any new data, even
> though there is plenty of space left on other OSDs that are not yet even up
> to 90%.  Tweaking the osd_failsafe ratios enabled data to move again for a
> bit, but eventually it becomes unbalanced and stops working again.
> 

Yes, that is because with Ceph safety goes first. When only one OSD goes
over the full ratio the whole cluster stops I/O.

CRUSH does not take OSD utilization into account when placing data, so
it's almost impossible to predict which I/O can continue.

Data safety and integrity is priority number 1. Full disks are a danger
to those priorities, so I/O is stopped.

> Is there a recommended combination of values to use that will allow the
> cluster to continue accepting data and rebalancing correctly above 90%.
> 

No, not with those values. Monitor your filesystems that they stay below
those values. If one OSD becomes to full you can weigh it down using
CRUSH to have some data move away from it.

> thanks,
>  Wyllys Ingersoll
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: full_ratios - please explain?
  2015-02-18 15:05 ` Wido den Hollander
@ 2015-02-18 15:21   ` Wyllys Ingersoll
  2015-02-18 15:52     ` Sage Weil
  0 siblings, 1 reply; 5+ messages in thread
From: Wyllys Ingersoll @ 2015-02-18 15:21 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel

Thanks!  More below inline...

On Wed, Feb 18, 2015 at 10:05 AM, Wido den Hollander <wido@42on.com> wrote:
> On 18-02-15 15:39, Wyllys Ingersoll wrote:
>> Can someone explain the interaction and effects of all of these
>> "full_ratio" parameters?  I havent found any real good explanation of how
>> they affect the distribution of data once the cluster gets above the
>> "nearfull" and close to the "close" ratios.
>>
>
> When only ONE (1) OSD goes over the mon_osd_nearfull_ratio the cluster
> goes from HEALTH_OK into HEALTH_WARN state.
>
>>
>> mon_osd_full_ratio
>> mon_osd_nearfull_ratio
>>
>> osd_backfill_full_ratio
>> osd_failsafe_full_ratio
>> osd_failsafe_nearfull_ratio
>>
>> We have a cluster with about 144 OSDs (518 TB) and trying to get it to a
>> 90% full rate for testing purposes.
>>
>> We've found that when some of the OSDs get above the mon_osd_full_ratio
>> value (.95 in our system), then it stops accepting any new data, even
>> though there is plenty of space left on other OSDs that are not yet even up
>> to 90%.  Tweaking the osd_failsafe ratios enabled data to move again for a
>> bit, but eventually it becomes unbalanced and stops working again.
>>
>
> Yes, that is because with Ceph safety goes first. When only one OSD goes
> over the full ratio the whole cluster stops I/O.



Which full_ratio?  The problem is that there are at least 3
"full_ratios" - mon_osd_full_ratio, osd_failsafe_full_ratio, and
osd_backfill_full_ratio - how do they interact? What is the
consequence of having one be higher than the others?


Its seems extreme that 1 full osd out of potentially hundreds would
cause all IO into the cluster to stop when there are literally 10s or
100s of terrabytes of space left on other, less-full OSDs.

The confusion for me (and probably for others) is the proliferation of
"full_ratio" parameters and a lack of clarity on how they all affect
the cluster health and ability to balance when things start to fill
up.


>
> CRUSH does not take OSD utilization into account when placing data, so
> it's almost impossible to predict which I/O can continue.
>
> Data safety and integrity is priority number 1. Full disks are a danger
> to those priorities, so I/O is stopped.


Understood, but 1 full disk out of hundreds should not cause the
entire system to stop accepting new data or even balancing out the
data that it already has especially when there is room to grow yet on
other OSDs.

If 1 disk reaches the "full_ratio", but 99 (or 999) others are still
well below that value, why doesn't it get balanced out ( assuming the
crush map considers all OSDs equal and all the pools have similar
pg_num values) ?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: full_ratios - please explain?
  2015-02-18 15:21   ` Wyllys Ingersoll
@ 2015-02-18 15:52     ` Sage Weil
  2015-02-18 15:53       ` Wyllys Ingersoll
  0 siblings, 1 reply; 5+ messages in thread
From: Sage Weil @ 2015-02-18 15:52 UTC (permalink / raw)
  To: Wyllys Ingersoll; +Cc: Wido den Hollander, ceph-devel

On Wed, 18 Feb 2015, Wyllys Ingersoll wrote:
> Thanks!  More below inline...
> 
> On Wed, Feb 18, 2015 at 10:05 AM, Wido den Hollander <wido@42on.com> wrote:
> > On 18-02-15 15:39, Wyllys Ingersoll wrote:
> >> Can someone explain the interaction and effects of all of these
> >> "full_ratio" parameters?  I havent found any real good explanation of how
> >> they affect the distribution of data once the cluster gets above the
> >> "nearfull" and close to the "close" ratios.
> >>
> >
> > When only ONE (1) OSD goes over the mon_osd_nearfull_ratio the cluster
> > goes from HEALTH_OK into HEALTH_WARN state.
> >
> >>
> >> mon_osd_full_ratio
> >> mon_osd_nearfull_ratio
> >>
> >> osd_backfill_full_ratio
> >> osd_failsafe_full_ratio
> >> osd_failsafe_nearfull_ratio
> >>
> >> We have a cluster with about 144 OSDs (518 TB) and trying to get it to a
> >> 90% full rate for testing purposes.
> >>
> >> We've found that when some of the OSDs get above the mon_osd_full_ratio
> >> value (.95 in our system), then it stops accepting any new data, even
> >> though there is plenty of space left on other OSDs that are not yet even up
> >> to 90%.  Tweaking the osd_failsafe ratios enabled data to move again for a
> >> bit, but eventually it becomes unbalanced and stops working again.
> >>
> >
> > Yes, that is because with Ceph safety goes first. When only one OSD goes
> > over the full ratio the whole cluster stops I/O.
> 
> 
> 
> Which full_ratio?  The problem is that there are at least 3
> "full_ratios" - mon_osd_full_ratio, osd_failsafe_full_ratio, and
> osd_backfill_full_ratio - how do they interact? What is the
> consequence of having one be higher than the others?

mon_osd_full_ratio (.95) ... when any OSD reaches this threshold the 
monitor marks the cluster as 'full' and client writes are not accepted.

mon_osd_nearfull_ratio (.85) ... when any OSD reaches this threshold the 
cluster goes HEALTH_WARN and calls out near-full OSDs.

osd_backfill_full_ratio (.85) ... when an OSD locally reaches this 
threshold it will refuse to migrate a PG to itself.  This prevents 
rebalancing or repair from overfilling an OSD.  It should be lower than 
the

The osd_failsafe_full_ratio (.97) is a final sanity check that makes the 
OSD throw out writes if it is really close to full.

It's bad news if an OSD fills up completely so we do what we can to 
prevent it.

> Its seems extreme that 1 full osd out of potentially hundreds would
> cause all IO into the cluster to stop when there are literally 10s or
> 100s of terrabytes of space left on other, less-full OSDs.

Yes, but the nature of hash-based distribution is that you don't know 
where a write will go, so you don't want to let the cluster fill up.  85% 
is pretty conservative; you could increase it if you're comfortable.  Just 
be aware that file systems over 80% start to get very slow so it is a 
bad idea to run them this full anyway.

> The confusion for me (and probably for others) is the proliferation of
> "full_ratio" parameters and a lack of clarity on how they all affect
> the cluster health and ability to balance when things start to fill
> up.
> 
> 
> >
> > CRUSH does not take OSD utilization into account when placing data, so
> > it's almost impossible to predict which I/O can continue.
> >
> > Data safety and integrity is priority number 1. Full disks are a danger
> > to those priorities, so I/O is stopped.
> 
> 
> Understood, but 1 full disk out of hundreds should not cause the
> entire system to stop accepting new data or even balancing out the
> data that it already has especially when there is room to grow yet on
> other OSDs.

The "proper" response to this currently is that if an OSD reaches the 
lower nearfull threshold the admin gets a warning and triggers some 
rebalancing.  That's why it's 10% lower then the actual full cutoff--there 
is plenty of time to adjust weights and/or expand the cluster.

It's not an ideal approach, perhaps, but it's simple and works well 
enough.  And it's not clear that there's is anything better we can do that 
isn't also very complicated...

sage

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: full_ratios - please explain?
  2015-02-18 15:52     ` Sage Weil
@ 2015-02-18 15:53       ` Wyllys Ingersoll
  0 siblings, 0 replies; 5+ messages in thread
From: Wyllys Ingersoll @ 2015-02-18 15:53 UTC (permalink / raw)
  To: Sage Weil; +Cc: Wido den Hollander, ceph-devel

OK, thanks for the clarifications!

-Wyllys


On Wed, Feb 18, 2015 at 10:52 AM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 18 Feb 2015, Wyllys Ingersoll wrote:
>> Thanks!  More below inline...
>>
>> On Wed, Feb 18, 2015 at 10:05 AM, Wido den Hollander <wido@42on.com> wrote:
>> > On 18-02-15 15:39, Wyllys Ingersoll wrote:
>> >> Can someone explain the interaction and effects of all of these
>> >> "full_ratio" parameters?  I havent found any real good explanation of how
>> >> they affect the distribution of data once the cluster gets above the
>> >> "nearfull" and close to the "close" ratios.
>> >>
>> >
>> > When only ONE (1) OSD goes over the mon_osd_nearfull_ratio the cluster
>> > goes from HEALTH_OK into HEALTH_WARN state.
>> >
>> >>
>> >> mon_osd_full_ratio
>> >> mon_osd_nearfull_ratio
>> >>
>> >> osd_backfill_full_ratio
>> >> osd_failsafe_full_ratio
>> >> osd_failsafe_nearfull_ratio
>> >>
>> >> We have a cluster with about 144 OSDs (518 TB) and trying to get it to a
>> >> 90% full rate for testing purposes.
>> >>
>> >> We've found that when some of the OSDs get above the mon_osd_full_ratio
>> >> value (.95 in our system), then it stops accepting any new data, even
>> >> though there is plenty of space left on other OSDs that are not yet even up
>> >> to 90%.  Tweaking the osd_failsafe ratios enabled data to move again for a
>> >> bit, but eventually it becomes unbalanced and stops working again.
>> >>
>> >
>> > Yes, that is because with Ceph safety goes first. When only one OSD goes
>> > over the full ratio the whole cluster stops I/O.
>>
>>
>>
>> Which full_ratio?  The problem is that there are at least 3
>> "full_ratios" - mon_osd_full_ratio, osd_failsafe_full_ratio, and
>> osd_backfill_full_ratio - how do they interact? What is the
>> consequence of having one be higher than the others?
>
> mon_osd_full_ratio (.95) ... when any OSD reaches this threshold the
> monitor marks the cluster as 'full' and client writes are not accepted.
>
> mon_osd_nearfull_ratio (.85) ... when any OSD reaches this threshold the
> cluster goes HEALTH_WARN and calls out near-full OSDs.
>
> osd_backfill_full_ratio (.85) ... when an OSD locally reaches this
> threshold it will refuse to migrate a PG to itself.  This prevents
> rebalancing or repair from overfilling an OSD.  It should be lower than
> the
>
> The osd_failsafe_full_ratio (.97) is a final sanity check that makes the
> OSD throw out writes if it is really close to full.
>
> It's bad news if an OSD fills up completely so we do what we can to
> prevent it.
>
>> Its seems extreme that 1 full osd out of potentially hundreds would
>> cause all IO into the cluster to stop when there are literally 10s or
>> 100s of terrabytes of space left on other, less-full OSDs.
>
> Yes, but the nature of hash-based distribution is that you don't know
> where a write will go, so you don't want to let the cluster fill up.  85%
> is pretty conservative; you could increase it if you're comfortable.  Just
> be aware that file systems over 80% start to get very slow so it is a
> bad idea to run them this full anyway.
>
>> The confusion for me (and probably for others) is the proliferation of
>> "full_ratio" parameters and a lack of clarity on how they all affect
>> the cluster health and ability to balance when things start to fill
>> up.
>>
>>
>> >
>> > CRUSH does not take OSD utilization into account when placing data, so
>> > it's almost impossible to predict which I/O can continue.
>> >
>> > Data safety and integrity is priority number 1. Full disks are a danger
>> > to those priorities, so I/O is stopped.
>>
>>
>> Understood, but 1 full disk out of hundreds should not cause the
>> entire system to stop accepting new data or even balancing out the
>> data that it already has especially when there is room to grow yet on
>> other OSDs.
>
> The "proper" response to this currently is that if an OSD reaches the
> lower nearfull threshold the admin gets a warning and triggers some
> rebalancing.  That's why it's 10% lower then the actual full cutoff--there
> is plenty of time to adjust weights and/or expand the cluster.
>
> It's not an ideal approach, perhaps, but it's simple and works well
> enough.  And it's not clear that there's is anything better we can do that
> isn't also very complicated...
>
> sage

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-02-18 15:53 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-18 14:39 full_ratios - please explain? Wyllys Ingersoll
2015-02-18 15:05 ` Wido den Hollander
2015-02-18 15:21   ` Wyllys Ingersoll
2015-02-18 15:52     ` Sage Weil
2015-02-18 15:53       ` Wyllys Ingersoll

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.