All of lore.kernel.org
 help / color / mirror / Atom feed
* global backfill reservation?
@ 2017-05-12 18:53 Sage Weil
  2017-05-12 20:49 ` Peter Maloney
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Sage Weil @ 2017-05-12 18:53 UTC (permalink / raw)
  To: ceph-devel

A common complaint is that recovery/backfill/rebalancing has a high 
impact.  That isn't news.  What I realized this week after hearing more 
operators describe their workaround is that everybody's workaround is 
roughly the same: make small changes to the crush map so that only a small 
number of PGs are backfilling at a time.  In retrospect it seems obvious, 
but the problem is that our backfill throttling is per-OSD: the "slowest" 
we can go is 1 backfilling PG per OSD.  (Actually, 2.. one primary and one 
replica due to separate reservation thresholds to avoid deadlock.)  That 
means that every OSD is impacted.  Doing fewer PGs doesn't make the 
recovery vs client scheduling better, but it means it affects fewer PGs 
and fewer client IOs and the net observed impact is smaller.

Anyway, in short, I think we need to be able to set a *global* threshold 
of "no more than X % of OSDs should be backfilling at a time," which is 
impossible given the current reservation appoach.

This could be done naively by having OSDs reserve a slot via the mon or 
mgr.  If we only did it for backfill the impact should be minimal (those 
are big slow long-running operations already).

I think you can *almost* do it cleverly by inferring the set of PGs that 
have to backfill by pg_temp.  However, that doesn't take any priority or 
stuck PGs into consideration.

Anyway, the naive thing probably isn't so bad...

1) PGMap counts backfilling PGs per OSD (and then the number of OSDs with 
one or more backfilling PGs).

2) For the first step of the backfill (recovery?) reservation, OSDs ask 
the mgr for a reservation slot.  The reservation is (pgid,interval epoch) 
so that the mgr can throw out the reservation require without needing an 
explicit cancellation if there is an interval change.

3) mgr grants as many reservations as it can without (backfilling + 
grants) > whatever the max is.

We can set the max with a global tunable like

 max_osd_backfilling_ratio = .3

so that only 30% of the osds can be backfilling at once?

sage

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: global backfill reservation?
  2017-05-12 18:53 global backfill reservation? Sage Weil
@ 2017-05-12 20:49 ` Peter Maloney
  2017-05-15 22:02   ` Gregory Farnum
  2017-05-13 16:55 ` Dan van der Ster
  2017-05-20 14:24 ` Ning Yao
  2 siblings, 1 reply; 14+ messages in thread
From: Peter Maloney @ 2017-05-12 20:49 UTC (permalink / raw)
  To: Sage Weil, ceph-devel

On 05/12/17 20:53, Sage Weil wrote:
> A common complaint is that recovery/backfill/rebalancing has a high 
> impact.  That isn't news.  What I realized this week after hearing more 
> operators describe their workaround is that everybody's workaround is 
> roughly the same: make small changes to the crush map so that only a small 
> number of PGs are backfilling at a time.  In retrospect it seems obvious, 
> but the problem is that our backfill throttling is per-OSD: the "slowest" 
> we can go is 1 backfilling PG per OSD.  (Actually, 2.. one primary and one 
> replica due to separate reservation thresholds to avoid deadlock.)  That 
> means that every OSD is impacted.  Doing fewer PGs doesn't make the 
> recovery vs client scheduling better, but it means it affects fewer PGs 
> and fewer client IOs and the net observed impact is smaller.
>
> Anyway, in short, I think we need to be able to set a *global* threshold 
> of "no more than X % of OSDs should be backfilling at a time," which is 
> impossible given the current reservation appoach.
>
> This could be done naively by having OSDs reserve a slot via the mon or 
> mgr.  If we only did it for backfill the impact should be minimal (those 
> are big slow long-running operations already).
>
> I think you can *almost* do it cleverly by inferring the set of PGs that 
> have to backfill by pg_temp.  However, that doesn't take any priority or 
> stuck PGs into consideration.
>
> Anyway, the naive thing probably isn't so bad...
>
> 1) PGMap counts backfilling PGs per OSD (and then the number of OSDs with 
> one or more backfilling PGs).
>
> 2) For the first step of the backfill (recovery?) reservation, OSDs ask 
> the mgr for a reservation slot.  The reservation is (pgid,interval epoch) 
> so that the mgr can throw out the reservation require without needing an 
> explicit cancellation if there is an interval change.
>
> 3) mgr grants as many reservations as it can without (backfilling + 
> grants) > whatever the max is.
>
> We can set the max with a global tunable like
>
>  max_osd_backfilling_ratio = .3
>
> so that only 30% of the osds can be backfilling at once?
>
> sage

I think the biggest problem is not how many OSDs are busy, but that any
single osd is overloaded long enough for a human user to call it laggy
(eg. "ls" takes 5s because of blocked requests). A setting to say you
want all osds 30% busy would be better than saying you want 30% of your
osds overloaded and 70% idle (where another word for idle is wasted).
The problems with clients seem to happen when they hit an overly busy
osd, rather than because many are moderately busy. (Is the future QoS
code supposed to handle this, for recovery [and scrub, snap trim,
flatten, rbd resize, etc.] not just clients? And I find resize [shrink
with snaps present] and flatten to be the worst since there appears to
be no config options to slow them down)

I always have max backfills = 1 and recovery max active = 1, but with my
small cluster (3 nodes and 36 osds so far), I find that letting it go
fully parallel is better than trying to make small changes one at a
time. I have tested things like running fio or xfs_fsr to defrag and
overloading one osd makes it far worse than having many osds a bit busy.
And I verified that by putting those things in cgroups where they are
limited to a certain iops and bandwidth per disk, and then they can't
cause blocked requests easily.

Peter


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: global backfill reservation?
  2017-05-12 18:53 global backfill reservation? Sage Weil
  2017-05-12 20:49 ` Peter Maloney
@ 2017-05-13 16:55 ` Dan van der Ster
  2017-06-02 14:05   ` Peter Maloney
  2017-05-20 14:24 ` Ning Yao
  2 siblings, 1 reply; 14+ messages in thread
From: Dan van der Ster @ 2017-05-13 16:55 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

On Fri, May 12, 2017 at 8:53 PM, Sage Weil <sweil@redhat.com> wrote:
> A common complaint is that recovery/backfill/rebalancing has a high
> impact.  That isn't news.  What I realized this week after hearing more
> operators describe their workaround is that everybody's workaround is
> roughly the same: make small changes to the crush map so that only a small
> number of PGs are backfilling at a time.  In retrospect it seems obvious,
> but the problem is that our backfill throttling is per-OSD: the "slowest"
> we can go is 1 backfilling PG per OSD.  (Actually, 2.. one primary and one
> replica due to separate reservation thresholds to avoid deadlock.)  That
> means that every OSD is impacted.  Doing fewer PGs doesn't make the
> recovery vs client scheduling better, but it means it affects fewer PGs
> and fewer client IOs and the net observed impact is smaller.
>
> Anyway, in short, I think we need to be able to set a *global* threshold
> of "no more than X % of OSDs should be backfilling at a time," which is
> impossible given the current reservation appoach.
>
> This could be done naively by having OSDs reserve a slot via the mon or
> mgr.  If we only did it for backfill the impact should be minimal (those
> are big slow long-running operations already).
>
> I think you can *almost* do it cleverly by inferring the set of PGs that
> have to backfill by pg_temp.  However, that doesn't take any priority or
> stuck PGs into consideration.
>
> Anyway, the naive thing probably isn't so bad...
>
> 1) PGMap counts backfilling PGs per OSD (and then the number of OSDs with
> one or more backfilling PGs).
>
> 2) For the first step of the backfill (recovery?) reservation, OSDs ask
> the mgr for a reservation slot.  The reservation is (pgid,interval epoch)
> so that the mgr can throw out the reservation require without needing an
> explicit cancellation if there is an interval change.
>
> 3) mgr grants as many reservations as it can without (backfilling +
> grants) > whatever the max is.
>
> We can set the max with a global tunable like
>
>  max_osd_backfilling_ratio = .3
>
> so that only 30% of the osds can be backfilling at once?
>
> sage

+1, this is something I've wanted for awhile. Using my "gentle
reweight" scripts, I've found that backfilling stays pretty
transparent as long as we limit to <5% of OSDs backfilling on our
large clusters. I think it will take some experimentation to find the
best default ratio to ship.

On the other hand, the *other* reason that we operators like to make
small changes is to limit the number of PGs that go through peering
all at once. Correct me if I'm wrong, but as an operator I'd hesitate
to trigger a re-peering of *all* PGs in an active pool -- users would
surely notice such an operation. Does luminous or luminous++ have some
improvements to this half of the problem?

Cheers, Dan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: global backfill reservation?
  2017-05-12 20:49 ` Peter Maloney
@ 2017-05-15 22:02   ` Gregory Farnum
  2017-05-16  7:21     ` David Butterfield
  0 siblings, 1 reply; 14+ messages in thread
From: Gregory Farnum @ 2017-05-15 22:02 UTC (permalink / raw)
  To: Peter Maloney; +Cc: Sage Weil, ceph-devel

On Fri, May 12, 2017 at 1:49 PM, Peter Maloney
<peter.maloney@brockmann-consult.de> wrote:
> On 05/12/17 20:53, Sage Weil wrote:
>> A common complaint is that recovery/backfill/rebalancing has a high
>> impact.  That isn't news.  What I realized this week after hearing more
>> operators describe their workaround is that everybody's workaround is
>> roughly the same: make small changes to the crush map so that only a small
>> number of PGs are backfilling at a time.  In retrospect it seems obvious,
>> but the problem is that our backfill throttling is per-OSD: the "slowest"
>> we can go is 1 backfilling PG per OSD.  (Actually, 2.. one primary and one
>> replica due to separate reservation thresholds to avoid deadlock.)  That
>> means that every OSD is impacted.  Doing fewer PGs doesn't make the
>> recovery vs client scheduling better, but it means it affects fewer PGs
>> and fewer client IOs and the net observed impact is smaller.
>>
>> Anyway, in short, I think we need to be able to set a *global* threshold
>> of "no more than X % of OSDs should be backfilling at a time," which is
>> impossible given the current reservation appoach.
>>
>> This could be done naively by having OSDs reserve a slot via the mon or
>> mgr.  If we only did it for backfill the impact should be minimal (those
>> are big slow long-running operations already).
>>
>> I think you can *almost* do it cleverly by inferring the set of PGs that
>> have to backfill by pg_temp.  However, that doesn't take any priority or
>> stuck PGs into consideration.
>>
>> Anyway, the naive thing probably isn't so bad...
>>
>> 1) PGMap counts backfilling PGs per OSD (and then the number of OSDs with
>> one or more backfilling PGs).
>>
>> 2) For the first step of the backfill (recovery?) reservation, OSDs ask
>> the mgr for a reservation slot.  The reservation is (pgid,interval epoch)
>> so that the mgr can throw out the reservation require without needing an
>> explicit cancellation if there is an interval change.
>>
>> 3) mgr grants as many reservations as it can without (backfilling +
>> grants) > whatever the max is.
>>
>> We can set the max with a global tunable like
>>
>>  max_osd_backfilling_ratio = .3
>>
>> so that only 30% of the osds can be backfilling at once?
>>
>> sage
>
> I think the biggest problem is not how many OSDs are busy, but that any
> single osd is overloaded long enough for a human user to call it laggy
> (eg. "ls" takes 5s because of blocked requests). A setting to say you
> want all osds 30% busy would be better than saying you want 30% of your
> osds overloaded and 70% idle (where another word for idle is wasted).

Yeah, this.

I think your first instinct was right, Sage: the client-visible
backfill impact is mostly a result of poor scheduling and
prioritization. The workaround of minimizing how much work we do at
once is really about reducing the tail size to a level low enough
people don't complain about it, but I think anybody aggregating data
metrics and looking at 99th%ile latencies and expecting some kind of
SLA would remain fairly unhappy with these outcomes. (The other issue
is as Dan notes — peering all at once is very visible; something that
delays only a small percentage of ops means other ops can keep
processing and client VMs don't seize up the same way).

That said, global backfill scheduling has other uses (...and might be
faster to implement than proper prioritization). It lets us restrict
network bandwidth devoted to backfill, not just local disk ops. And a
central daemon like the manager can do better prioritization than the
OSDs are really capable of in the case of degraded stuff (especially
with more complicated things like the undersized level on erasure
coded data across varying rules).
Those use cases make me think we might not want to start with such a
naive approach though. Perhaps OSDs report their personal backfill
limits to the manager when asking for the number of reservations they
want, and the manager decides which ones to issue based on that data,
its global limits, and the priorities it can see in terms of overall
PG states and backfill progress?
(In particular, it may want to "save" reservations for somebody that
is currently a backfill target but will shortly be freeing up a slot
or something.)
-Greg

> The problems with clients seem to happen when they hit an overly busy
> osd, rather than because many are moderately busy. (Is the future QoS
> code supposed to handle this, for recovery [and scrub, snap trim,
> flatten, rbd resize, etc.] not just clients? And I find resize [shrink
> with snaps present] and flatten to be the worst since there appears to
> be no config options to slow them down)
>
> I always have max backfills = 1 and recovery max active = 1, but with my
> small cluster (3 nodes and 36 osds so far), I find that letting it go
> fully parallel is better than trying to make small changes one at a
> time. I have tested things like running fio or xfs_fsr to defrag and
> overloading one osd makes it far worse than having many osds a bit busy.
> And I verified that by putting those things in cgroups where they are
> limited to a certain iops and bandwidth per disk, and then they can't
> cause blocked requests easily.
>
> Peter
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: global backfill reservation?
  2017-05-15 22:02   ` Gregory Farnum
@ 2017-05-16  7:21     ` David Butterfield
  0 siblings, 0 replies; 14+ messages in thread
From: David Butterfield @ 2017-05-16  7:21 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Peter Maloney, Sage Weil, ceph-devel

On Mon, May 15, 2017 at 4:02 PM, Gregory Farnum <gfarnum@redhat.com> wrote:
> On Fri, May 12, 2017 at 1:49 PM, Peter Maloney wrote:
>> I think the biggest problem is not how many OSDs are busy, but that any
>> single osd is overloaded long enough for a human user to call it laggy
>> (eg. "ls" takes 5s because of blocked requests). A setting to say you
>> want all osds 30% busy would be better than saying you want 30% of your
>> osds overloaded and 70% idle (where another word for idle is wasted).
>
> That said, global backfill scheduling has other uses (...and might be
> faster to implement than proper prioritization). It lets us restrict
> network bandwidth devoted to backfill, not just local disk ops.

I worked on the performance of resynchronization (after node recovery)
and restripe (after node add/remove) of a distributed SAN that already
had an adjustable bandwidth limit when I started on it (leaky bucket
sort of thing).  It limited bandwidth, but the restripe after adding a new
node could take a week (it was cruder with its fixed geometry than
newer techniques).

I found it worked better to disable the bandwidth limiter and instead
control the resync load by adjusting the number of network I/O ops
a recovering node will issue and have outstanding to other nodes for
resync I/O at any given time.  Queue Depths like 2 or 3 or 4 finished
sooner and with less impact on client I/O than using the B/W limiter.

It still wasn't wonderful, but it was better, so it might be an approach
to consider.  Note that in this system all nodes would do recovery
concurrently.  The QD limit can be set independently by each node
without resort to a central or distributed algorithm.  If necessary each
node could dynamically control its own pull rate by adjusting its
recovery QD based on its current load or whatever.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: global backfill reservation?
  2017-05-12 18:53 global backfill reservation? Sage Weil
  2017-05-12 20:49 ` Peter Maloney
  2017-05-13 16:55 ` Dan van der Ster
@ 2017-05-20 14:24 ` Ning Yao
  2017-05-21  3:34   ` David Butterfield
  2017-06-02 21:44   ` LIU, Fei
  2 siblings, 2 replies; 14+ messages in thread
From: Ning Yao @ 2017-05-20 14:24 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

I think the most efficient way to solve this problem is not to
restrict the number of backfilling pgs.  The reason why they want to
reduce backfilling pgs at the same time is because this is the only
thing we can do in Ceph currently. As David mentioned above, reducing
the active backfilling pgs at a time will increase the total recovery
time, which in turn leads to lower reliability and increase the data
loss probability.

Actually, for end-users, they do not care what happens in the ceph
backend. They wanna if there is enough bandwidth, then recover my data
as fast as possible. But at the same time, they want the user IO is
served first. That means if the cluster has 10GB/s, 100k iops IO
bandwidth, at night, user IO cost 20% bandwidth so that 80% bandwidth
for recovery, while at daytime, user IO cost 80% bandwidth  so that
20% bandwidth for recovery. so it seems pretty reasonable to do it
with dynamic QoS strategy and serve the user IO first at anytime. Only
in this way, it can achieve the final goal for this issue.

Therefore
Regards
Ning Yao


2017-05-13 2:53 GMT+08:00 Sage Weil <sweil@redhat.com>:
> A common complaint is that recovery/backfill/rebalancing has a high
> impact.  That isn't news.  What I realized this week after hearing more
> operators describe their workaround is that everybody's workaround is
> roughly the same: make small changes to the crush map so that only a small
> number of PGs are backfilling at a time.  In retrospect it seems obvious,
> but the problem is that our backfill throttling is per-OSD: the "slowest"
> we can go is 1 backfilling PG per OSD.  (Actually, 2.. one primary and one
> replica due to separate reservation thresholds to avoid deadlock.)  That
> means that every OSD is impacted.  Doing fewer PGs doesn't make the
> recovery vs client scheduling better, but it means it affects fewer PGs
> and fewer client IOs and the net observed impact is smaller.
>
> Anyway, in short, I think we need to be able to set a *global* threshold
> of "no more than X % of OSDs should be backfilling at a time," which is
> impossible given the current reservation appoach.
>
> This could be done naively by having OSDs reserve a slot via the mon or
> mgr.  If we only did it for backfill the impact should be minimal (those
> are big slow long-running operations already).
>
> I think you can *almost* do it cleverly by inferring the set of PGs that
> have to backfill by pg_temp.  However, that doesn't take any priority or
> stuck PGs into consideration.
>
> Anyway, the naive thing probably isn't so bad...
>
> 1) PGMap counts backfilling PGs per OSD (and then the number of OSDs with
> one or more backfilling PGs).
>
> 2) For the first step of the backfill (recovery?) reservation, OSDs ask
> the mgr for a reservation slot.  The reservation is (pgid,interval epoch)
> so that the mgr can throw out the reservation require without needing an
> explicit cancellation if there is an interval change.
>
> 3) mgr grants as many reservations as it can without (backfilling +
> grants) > whatever the max is.
>
> We can set the max with a global tunable like
>
>  max_osd_backfilling_ratio = .3
>
> so that only 30% of the osds can be backfilling at once?
>
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: global backfill reservation?
  2017-05-20 14:24 ` Ning Yao
@ 2017-05-21  3:34   ` David Butterfield
  2017-06-02 21:44   ` LIU, Fei
  1 sibling, 0 replies; 14+ messages in thread
From: David Butterfield @ 2017-05-21  3:34 UTC (permalink / raw)
  To: Ning Yao; +Cc: Sage Weil, ceph-devel

On Sat, May 20, 2017 at 8:24 AM, Ning Yao <zay11022@gmail.com> wrote:
> so it seems pretty reasonable to do it
> with dynamic QoS strategy and serve the user IO first at anytime. Only
> in this way, it can achieve the final goal for this issue.

But part of the final goal is to minimize unhappiness including from loss
of data after a double failure, which means completing a timely recovery.
Giving strict priority to user I/O could starve recovery indefinitely.  Some
systems are *always* busy.

It seems likely to result in highly variable and unpredictable recovery times.
I think unpredictability about when their data "will be fully protected again"
is a source of anxiety for customers, if it can take more than a few hours.

One nice thing about controlling with queue depth is that it self-adjusts to
the load.  If the network and peer machine are idle, the operations will flow
at their maximum rate for a given queue depth (IOPS = QD / RTT, the
round-trip time of the entire circuit of the network and the peer service
together).

But if other load is present on the network or on the peer CPU, its requested
operations will interleave with the recovery I/O; this drives up RTT (by
slowing the peer server and/or delaying the network), automatically reducing
IOPS without adjusting Queue Depth.  Under high client load there will be
many client I/O operations for each recovery operation.

The Queue Depth can still be adjusted to set the overall aggressiveness
of the recovery process.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: global backfill reservation?
  2017-05-13 16:55 ` Dan van der Ster
@ 2017-06-02 14:05   ` Peter Maloney
  2017-06-02 15:38     ` Sage Weil
  2017-06-03  7:51     ` Dan van der Ster
  0 siblings, 2 replies; 14+ messages in thread
From: Peter Maloney @ 2017-06-02 14:05 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: ceph-devel, Sage Weil

On 05/13/17 18:55, Dan van der Ster wrote:
> +1, this is something I've wanted for awhile. Using my "gentle
> reweight" scripts, I've found that backfilling stays pretty
> transparent as long as we limit to <5% of OSDs backfilling on our
> large clusters. I think it will take some experimentation to find the
> best default ratio to ship.
>
> On the other hand, the *other* reason that we operators like to make
> small changes is to limit the number of PGs that go through peering
> all at once. Correct me if I'm wrong, but as an operator I'd hesitate
> to trigger a re-peering of *all* PGs in an active pool -- users would
> surely notice such an operation. Does luminous or luminous++ have some
> improvements to this half of the problem?
>
> Cheers, Dan
>

Hi Dan,

I have read your script:
https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight#L42

And at that line I see you using "ceph osd crush reweight" instead of
"ceph osd reweight".

And I just added 2 nodes to my cluster and had some related issues and
solved them. Doing it like your script, crush reweighting a tiny bit a
time causes blocked requests for long durations, even just moving 1 pg
... I let one go for 40s before stopping it. It seemed impossible to
ever get one pg to peer without such a long block. I also tried making a
special pool with those 12 osds to test and it took 1 minute to make 64
pgs without any clients using them, which is still unreasonable for a
blocked request. (Also the "normal" way to just blindly add osds with
full weight and not take any special care would just do the same in one
big jump instead of many.)

And the solution in the end was quite painless... have osds up (with
either weight 0), then just set reweight 0, crush weight normal (TB) and
then it does peering (one sort of peering?) and then after peering is
done, change "ceph osd reweight", even a bunch at once and it has barely
any impact... it does peering (the other sort of peering, not repeating
the slow terrible sort it did already?), but very fast and with only a
few 5s blocked requests (which is fairly normal here due to rbd
snapshots). Maybe the crush weight peering with 0 reweight makes it do
the slow terrible sort of peering, but without blocking any real pgs,
and therefore without blocking clients, so it's tolerable (blocking
empty osds, not used pools and pgs). And then the other peering is fast.

And Sage, if that's true, then couldn't ceph by default just do the
first kind of peering work before any pgs, pools, clients, etc. are
affected, before moving on to the stuff that affects clients, regardless
of which steps were used? At some point during adding t hose 2 nodes I
was thinking how could ceph be so broken and mysterious... why does it
just hang there? Would it do this during recovery of a dead osd too? Now
I know how to avoid it and that it shouldn't affect recovering dead osds
(not changing crush weight)... but it would be nice for all users not to
ever think that way. :)

And Dan, I am curious about why you use crush reweight for this (which I
failed to), and whether you tried it the way I describe above, or
another way.

And I'm using jewel 10.2.7. I don't know how other versions behave.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: global backfill reservation?
  2017-06-02 14:05   ` Peter Maloney
@ 2017-06-02 15:38     ` Sage Weil
  2017-06-03 23:11       ` Peter Maloney
  2017-06-03  7:51     ` Dan van der Ster
  1 sibling, 1 reply; 14+ messages in thread
From: Sage Weil @ 2017-06-02 15:38 UTC (permalink / raw)
  To: Peter Maloney; +Cc: Dan van der Ster, ceph-devel

On Fri, 2 Jun 2017, Peter Maloney wrote:
> On 05/13/17 18:55, Dan van der Ster wrote:
> > +1, this is something I've wanted for awhile. Using my "gentle
> > reweight" scripts, I've found that backfilling stays pretty
> > transparent as long as we limit to <5% of OSDs backfilling on our
> > large clusters. I think it will take some experimentation to find the
> > best default ratio to ship.
> >
> > On the other hand, the *other* reason that we operators like to make
> > small changes is to limit the number of PGs that go through peering
> > all at once. Correct me if I'm wrong, but as an operator I'd hesitate
> > to trigger a re-peering of *all* PGs in an active pool -- users would
> > surely notice such an operation. Does luminous or luminous++ have some
> > improvements to this half of the problem?
> >
> > Cheers, Dan
> >
> 
> Hi Dan,
> 
> I have read your script:
> https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight#L42
> 
> And at that line I see you using "ceph osd crush reweight" instead of
> "ceph osd reweight".
> 
> And I just added 2 nodes to my cluster and had some related issues and
> solved them. Doing it like your script, crush reweighting a tiny bit a
> time causes blocked requests for long durations, even just moving 1 pg
> ... I let one go for 40s before stopping it. It seemed impossible to
> ever get one pg to peer without such a long block. I also tried making a
> special pool with those 12 osds to test and it took 1 minute to make 64
> pgs without any clients using them, which is still unreasonable for a
> blocked request. (Also the "normal" way to just blindly add osds with
> full weight and not take any special care would just do the same in one
> big jump instead of many.)

FWIW this sounds a lot like the problem that Josh is solving now (deletes 
in the workload can make peering slow).  "Slow peering" is not very 
specific, I guess, but that's the one known issue that makes peering 10s 
of seconds slow.

> And the solution in the end was quite painless... have osds up (with
> either weight 0), then just set reweight 0, crush weight normal (TB) and
> then it does peering (one sort of peering?) and then after peering is
> done, change "ceph osd reweight", even a bunch at once and it has barely
> any impact... it does peering (the other sort of peering, not repeating
> the slow terrible sort it did already?), but very fast and with only a
> few 5s blocked requests (which is fairly normal here due to rbd
> snapshots). Maybe the crush weight peering with 0 reweight makes it do
> the slow terrible sort of peering, but without blocking any real pgs,
> and therefore without blocking clients, so it's tolerable (blocking
> empty osds, not used pools and pgs). And then the other peering is fast.

I don't see how this would be any different from a peering perspective.  
The pattern of data movement and remapping would be different, but there's 
no difference in this sequence that seems like it relate to peering 
taking 10s of seconds.  :/

How confident are you that this was a real effect?  Could it be that when 
you tried the second method your disk caches were warm vs the first time 
around when they were cold?

sage

> And Sage, if that's true, then couldn't ceph by default just do the
> first kind of peering work before any pgs, pools, clients, etc. are
> affected, before moving on to the stuff that affects clients, regardless
> of which steps were used? At some point during adding t hose 2 nodes I
> was thinking how could ceph be so broken and mysterious... why does it
> just hang there? Would it do this during recovery of a dead osd too? Now
> I know how to avoid it and that it shouldn't affect recovering dead osds
> (not changing crush weight)... but it would be nice for all users not to
> ever think that way. :)
> 
> And Dan, I am curious about why you use crush reweight for this (which I
> failed to), and whether you tried it the way I describe above, or
> another way.
> 
> And I'm using jewel 10.2.7. I don't know how other versions behave.
> 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: global backfill reservation?
  2017-05-20 14:24 ` Ning Yao
  2017-05-21  3:34   ` David Butterfield
@ 2017-06-02 21:44   ` LIU, Fei
  1 sibling, 0 replies; 14+ messages in thread
From: LIU, Fei @ 2017-06-02 21:44 UTC (permalink / raw)
  To: Ning Yao, Sage Weil; +Cc: ceph-devel

Agree with what Ning said in terms of Ceph cluster’s user expectation. The recovery/backfill even scrub should be scheduled dynamically based on SLA and cluster resources.

Regards,
James

On 5/20/17, 7:24 AM, "Ning Yao" <ceph-devel-owner@vger.kernel.org on behalf of zay11022@gmail.com> wrote:

    I think the most efficient way to solve this problem is not to
    restrict the number of backfilling pgs.  The reason why they want to
    reduce backfilling pgs at the same time is because this is the only
    thing we can do in Ceph currently. As David mentioned above, reducing
    the active backfilling pgs at a time will increase the total recovery
    time, which in turn leads to lower reliability and increase the data
    loss probability.
    
    Actually, for end-users, they do not care what happens in the ceph
    backend. They wanna if there is enough bandwidth, then recover my data
    as fast as possible. But at the same time, they want the user IO is
    served first. That means if the cluster has 10GB/s, 100k iops IO
    bandwidth, at night, user IO cost 20% bandwidth so that 80% bandwidth
    for recovery, while at daytime, user IO cost 80% bandwidth  so that
    20% bandwidth for recovery. so it seems pretty reasonable to do it
    with dynamic QoS strategy and serve the user IO first at anytime. Only
    in this way, it can achieve the final goal for this issue.
    
    Therefore
    Regards
    Ning Yao
    
    
    2017-05-13 2:53 GMT+08:00 Sage Weil <sweil@redhat.com>:
    > A common complaint is that recovery/backfill/rebalancing has a high
    > impact.  That isn't news.  What I realized this week after hearing more
    > operators describe their workaround is that everybody's workaround is
    > roughly the same: make small changes to the crush map so that only a small
    > number of PGs are backfilling at a time.  In retrospect it seems obvious,
    > but the problem is that our backfill throttling is per-OSD: the "slowest"
    > we can go is 1 backfilling PG per OSD.  (Actually, 2.. one primary and one
    > replica due to separate reservation thresholds to avoid deadlock.)  That
    > means that every OSD is impacted.  Doing fewer PGs doesn't make the
    > recovery vs client scheduling better, but it means it affects fewer PGs
    > and fewer client IOs and the net observed impact is smaller.
    >
    > Anyway, in short, I think we need to be able to set a *global* threshold
    > of "no more than X % of OSDs should be backfilling at a time," which is
    > impossible given the current reservation appoach.
    >
    > This could be done naively by having OSDs reserve a slot via the mon or
    > mgr.  If we only did it for backfill the impact should be minimal (those
    > are big slow long-running operations already).
    >
    > I think you can *almost* do it cleverly by inferring the set of PGs that
    > have to backfill by pg_temp.  However, that doesn't take any priority or
    > stuck PGs into consideration.
    >
    > Anyway, the naive thing probably isn't so bad...
    >
    > 1) PGMap counts backfilling PGs per OSD (and then the number of OSDs with
    > one or more backfilling PGs).
    >
    > 2) For the first step of the backfill (recovery?) reservation, OSDs ask
    > the mgr for a reservation slot.  The reservation is (pgid,interval epoch)
    > so that the mgr can throw out the reservation require without needing an
    > explicit cancellation if there is an interval change.
    >
    > 3) mgr grants as many reservations as it can without (backfilling +
    > grants) > whatever the max is.
    >
    > We can set the max with a global tunable like
    >
    >  max_osd_backfilling_ratio = .3
    >
    > so that only 30% of the osds can be backfilling at once?
    >
    > sage
    > --
    > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
    > the body of a message to majordomo@vger.kernel.org
    > More majordomo info at  http://vger.kernel.org/majordomo-info.html
    --
    To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
    the body of a message to majordomo@vger.kernel.org
    More majordomo info at  http://vger.kernel.org/majordomo-info.html
    



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: global backfill reservation?
  2017-06-02 14:05   ` Peter Maloney
  2017-06-02 15:38     ` Sage Weil
@ 2017-06-03  7:51     ` Dan van der Ster
  2017-06-03 22:58       ` Peter Maloney
  1 sibling, 1 reply; 14+ messages in thread
From: Dan van der Ster @ 2017-06-03  7:51 UTC (permalink / raw)
  To: Peter Maloney; +Cc: ceph-devel, Sage Weil

On Fri, Jun 2, 2017 at 4:05 PM, Peter Maloney
<peter.maloney@brockmann-consult.de> wrote:
> On 05/13/17 18:55, Dan van der Ster wrote:
>> +1, this is something I've wanted for awhile. Using my "gentle
>> reweight" scripts, I've found that backfilling stays pretty
>> transparent as long as we limit to <5% of OSDs backfilling on our
>> large clusters. I think it will take some experimentation to find the
>> best default ratio to ship.
>>
>> On the other hand, the *other* reason that we operators like to make
>> small changes is to limit the number of PGs that go through peering
>> all at once. Correct me if I'm wrong, but as an operator I'd hesitate
>> to trigger a re-peering of *all* PGs in an active pool -- users would
>> surely notice such an operation. Does luminous or luminous++ have some
>> improvements to this half of the problem?
>>
>> Cheers, Dan
>>
>
> Hi Dan,
>
> I have read your script:
> https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight#L42
>
> And at that line I see you using "ceph osd crush reweight" instead of
> "ceph osd reweight".
>
> And I just added 2 nodes to my cluster and had some related issues and
> solved them. Doing it like your script, crush reweighting a tiny bit a
> time causes blocked requests for long durations, even just moving 1 pg
> ... I let one go for 40s before stopping it. It seemed impossible to
> ever get one pg to peer without such a long block. I also tried making a
> special pool with those 12 osds to test and it took 1 minute to make 64
> pgs without any clients using them, which is still unreasonable for a
> blocked request. (Also the "normal" way to just blindly add osds with
> full weight and not take any special care would just do the same in one
> big jump instead of many.)
>
> And the solution in the end was quite painless... have osds up (with
> either weight 0), then just set reweight 0, crush weight normal (TB) and
> then it does peering (one sort of peering?) and then after peering is
> done, change "ceph osd reweight", even a bunch at once and it has barely
> any impact... it does peering (the other sort of peering, not repeating
> the slow terrible sort it did already?), but very fast and with only a
> few 5s blocked requests (which is fairly normal here due to rbd
> snapshots). Maybe the crush weight peering with 0 reweight makes it do
> the slow terrible sort of peering, but without blocking any real pgs,
> and therefore without blocking clients, so it's tolerable (blocking
> empty osds, not used pools and pgs). And then the other peering is fast.
>
> And Sage, if that's true, then couldn't ceph by default just do the
> first kind of peering work before any pgs, pools, clients, etc. are
> affected, before moving on to the stuff that affects clients, regardless
> of which steps were used? At some point during adding t hose 2 nodes I
> was thinking how could ceph be so broken and mysterious... why does it
> just hang there? Would it do this during recovery of a dead osd too? Now
> I know how to avoid it and that it shouldn't affect recovering dead osds
> (not changing crush weight)... but it would be nice for all users not to
> ever think that way. :)
>
> And Dan, I am curious about why you use crush reweight for this (which I
> failed to), and whether you tried it the way I describe above, or
> another way.
>
> And I'm using jewel 10.2.7. I don't know how other versions behave.
>
>

Here's what we do:
  1. Create and start new OSDs with initial crush weight = 0.0. No PGs
should re-peer when these are booted.
  2. Run the reweight script, e.g. like this for some 6T drives:

   ceph-gentle-reweight -o osd.10,osd.11,osd.12 -l 15 -b 50 -d 0.01 -t 5.46

In practice we've added >150 drives at once with that script -- using
that tiny delta.

We use crush reweight because it "works for us (tm)". We haven't seen
any strange peering hangs, though we exercise this on hammer, not
(yet) jewel.
I hadn't thought of your method using osd reweight -- how do you add
new osds with an initial osd reweight? Maybe you create the osds in a
non-default root then move them after being reweighted to 0.0?

Cheers, Dan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: global backfill reservation?
  2017-06-03  7:51     ` Dan van der Ster
@ 2017-06-03 22:58       ` Peter Maloney
  2017-06-06 14:51         ` Peter Maloney
  0 siblings, 1 reply; 14+ messages in thread
From: Peter Maloney @ 2017-06-03 22:58 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: ceph-devel, Sage Weil

On 06/03/17 09:51, Dan van der Ster wrote:
> On Fri, Jun 2, 2017 at 4:05 PM, Peter Maloney
> <peter.maloney@brockmann-consult.de> wrote:
>> On 05/13/17 18:55, Dan van der Ster wrote:
>>> +1, this is something I've wanted for awhile. Using my "gentle
>>> reweight" scripts, I've found that backfilling stays pretty
>>> transparent as long as we limit to <5% of OSDs backfilling on our
>>> large clusters. I think it will take some experimentation to find the
>>> best default ratio to ship.
>>>
>>> On the other hand, the *other* reason that we operators like to make
>>> small changes is to limit the number of PGs that go through peering
>>> all at once. Correct me if I'm wrong, but as an operator I'd hesitate
>>> to trigger a re-peering of *all* PGs in an active pool -- users would
>>> surely notice such an operation. Does luminous or luminous++ have some
>>> improvements to this half of the problem?
>>>
>>> Cheers, Dan
>>>
>> Hi Dan,
>>
>> I have read your script:
>> https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight#L42
>>
>> And at that line I see you using "ceph osd crush reweight" instead of
>> "ceph osd reweight".
>>
>> And I just added 2 nodes to my cluster and had some related issues and
>> solved them. Doing it like your script, crush reweighting a tiny bit a
>> time causes blocked requests for long durations, even just moving 1 pg
>> ... I let one go for 40s before stopping it. It seemed impossible to
>> ever get one pg to peer without such a long block. I also tried making a
>> special pool with those 12 osds to test and it took 1 minute to make 64
>> pgs without any clients using them, which is still unreasonable for a
>> blocked request. (Also the "normal" way to just blindly add osds with
>> full weight and not take any special care would just do the same in one
>> big jump instead of many.)
>>
>> And the solution in the end was quite painless... have osds up (with
>> either weight 0), then just set reweight 0, crush weight normal (TB) and
>> then it does peering (one sort of peering?) and then after peering is
>> done, change "ceph osd reweight", even a bunch at once and it has barely
>> any impact... it does peering (the other sort of peering, not repeating
>> the slow terrible sort it did already?), but very fast and with only a
>> few 5s blocked requests (which is fairly normal here due to rbd
>> snapshots). Maybe the crush weight peering with 0 reweight makes it do
>> the slow terrible sort of peering, but without blocking any real pgs,
>> and therefore without blocking clients, so it's tolerable (blocking
>> empty osds, not used pools and pgs). And then the other peering is fast.
>>
>> And Sage, if that's true, then couldn't ceph by default just do the
>> first kind of peering work before any pgs, pools, clients, etc. are
>> affected, before moving on to the stuff that affects clients, regardless
>> of which steps were used? At some point during adding t hose 2 nodes I
>> was thinking how could ceph be so broken and mysterious... why does it
>> just hang there? Would it do this during recovery of a dead osd too? Now
>> I know how to avoid it and that it shouldn't affect recovering dead osds
>> (not changing crush weight)... but it would be nice for all users not to
>> ever think that way. :)
>>
>> And Dan, I am curious about why you use crush reweight for this (which I
>> failed to), and whether you tried it the way I describe above, or
>> another way.
>>
>> And I'm using jewel 10.2.7. I don't know how other versions behave.
>>
>>
> Here's what we do:
>   1. Create and start new OSDs with initial crush weight = 0.0. No PGs
> should re-peer when these are booted.
>   2. Run the reweight script, e.g. like this for some 6T drives:
>
>    ceph-gentle-reweight -o osd.10,osd.11,osd.12 -l 15 -b 50 -d 0.01 -t 5.46
>
> In practice we've added >150 drives at once with that script -- using
> that tiny delta.
>
> We use crush reweight because it "works for us (tm)". We haven't seen
> any strange peering hangs, though we exercise this on hammer, not
> (yet) jewel.
> I hadn't thought of your method using osd reweight -- how do you add
> new osds with an initial osd reweight? Maybe you create the osds in a
> non-default root then move them after being reweighted to 0.0?
>
> Cheers, Dan

I added them with crush weight 0, then my plan was to raise the weight
like you do. That's basically what I did for all the other servers. But
I fiddled with the crush map and had them in another root when I set the
reweight 0, then weight 6, then moved them to root default (long
peering), then reweight 1 (short peering). But that wasn't what I
planned on doing or plan to do in the future.

I expect that would be the same as crush weight 0 and in the normal root
when created, then when ready for peering, set reweight 0 first, then
crush weight 6, then after peering is done, reweight 1 for a few at a
time (ceph osd reweight ...; sleep 2; while ceph health | grep peering;
do sleep 1; done ...).

The next step in this upgrade is to replace 18 2TB disks with 6TB
ones... I'll do it that way and find out if it works without the extra root.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: global backfill reservation?
  2017-06-02 15:38     ` Sage Weil
@ 2017-06-03 23:11       ` Peter Maloney
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Maloney @ 2017-06-03 23:11 UTC (permalink / raw)
  To: Sage Weil; +Cc: Dan van der Ster, ceph-devel

On 06/02/17 17:38, Sage Weil wrote:
> On Fri, 2 Jun 2017, Peter Maloney wrote:
>> On 05/13/17 18:55, Dan van der Ster wrote:
>>> +1, this is something I've wanted for awhile. Using my "gentle
>>> reweight" scripts, I've found that backfilling stays pretty
>>> transparent as long as we limit to <5% of OSDs backfilling on our
>>> large clusters. I think it will take some experimentation to find the
>>> best default ratio to ship.
>>>
>>> On the other hand, the *other* reason that we operators like to make
>>> small changes is to limit the number of PGs that go through peering
>>> all at once. Correct me if I'm wrong, but as an operator I'd hesitate
>>> to trigger a re-peering of *all* PGs in an active pool -- users would
>>> surely notice such an operation. Does luminous or luminous++ have some
>>> improvements to this half of the problem?
>>>
>>> Cheers, Dan
>>>
>> Hi Dan,
>>
>> I have read your script:
>> https://github.com/cernceph/ceph-scripts/blob/master/tools/ceph-gentle-reweight#L42
>>
>> And at that line I see you using "ceph osd crush reweight" instead of
>> "ceph osd reweight".
>>
>> And I just added 2 nodes to my cluster and had some related issues and
>> solved them. Doing it like your script, crush reweighting a tiny bit a
>> time causes blocked requests for long durations, even just moving 1 pg
>> ... I let one go for 40s before stopping it. It seemed impossible to
>> ever get one pg to peer without such a long block. I also tried making a
>> special pool with those 12 osds to test and it took 1 minute to make 64
>> pgs without any clients using them, which is still unreasonable for a
>> blocked request. (Also the "normal" way to just blindly add osds with
>> full weight and not take any special care would just do the same in one
>> big jump instead of many.)
> FWIW this sounds a lot like the problem that Josh is solving now (deletes 
> in the workload can make peering slow).  "Slow peering" is not very 
> specific, I guess, but that's the one known issue that makes peering 10s 
> of seconds slow.
>
>> And the solution in the end was quite painless... have osds up (with
>> either weight 0), then just set reweight 0, crush weight normal (TB) and
>> then it does peering (one sort of peering?) and then after peering is
>> done, change "ceph osd reweight", even a bunch at once and it has barely
>> any impact... it does peering (the other sort of peering, not repeating
>> the slow terrible sort it did already?), but very fast and with only a
>> few 5s blocked requests (which is fairly normal here due to rbd
>> snapshots). Maybe the crush weight peering with 0 reweight makes it do
>> the slow terrible sort of peering, but without blocking any real pgs,
>> and therefore without blocking clients, so it's tolerable (blocking
>> empty osds, not used pools and pgs). And then the other peering is fast.
> I don't see how this would be any different from a peering perspective.  
> The pattern of data movement and remapping would be different, but there's 
> no difference in this sequence that seems like it relate to peering 
> taking 10s of seconds.  :/
Maybe I explained it badly.... I mean it took just as long to change the
crush weight and peer, but when reweight was 0, the clients weren't
affected. Then when I set reweight 1, it was faster and clients seemed
happy still.
> How confident are you that this was a real effect?  Could it be that when 
> you tried the second method your disk caches were warm vs the first time 
> around when they were cold?
I don't know how to judge whether it cached anything... what is there to
cache on an empty disk? And does repating the test use the same data? It
was trying to peer the same pg each time.

I repeatedly re-tested the same osd to try to get it to peer many
times...like 30 or 40 times probably, spread over 2 days. Each time I
just let it block clients for about 5-20 seconds, and then when I
managed to somehow get it to only block 1 pg I know didn't matter much
(a basically idle pool), then I let it go 40s or longer.

I considered that doing the test with the separate root prepared the
osds for peering in the real root... but thought that's probably wrong
since the first osd was still slow doing my same test as before, until I
thought of using reweight instead of crush reweight. So that's like 40
times trying crush weight on one osd (a few times with 2-3 osds)... one
time testing separate a root and it fully peered... then a few times
trying crush weight again... then the reweight idea with one disk, then
one more, etc. and then the last 3 or 4 at once.

And I checked iostat and didn't think the disks looked very busy while
peering. I'll pay closer attention to that stuff (and anything you
suggest before then) when I do the next 18 osds (first removing, then
adding larger ones).

>
> sage
>
>> And Sage, if that's true, then couldn't ceph by default just do the
>> first kind of peering work before any pgs, pools, clients, etc. are
>> affected, before moving on to the stuff that affects clients, regardless
>> of which steps were used? At some point during adding t hose 2 nodes I
>> was thinking how could ceph be so broken and mysterious... why does it
>> just hang there? Would it do this during recovery of a dead osd too? Now
>> I know how to avoid it and that it shouldn't affect recovering dead osds
>> (not changing crush weight)... but it would be nice for all users not to
>> ever think that way. :)
>>
>> And Dan, I am curious about why you use crush reweight for this (which I
>> failed to), and whether you tried it the way I describe above, or
>> another way.
>>
>> And I'm using jewel 10.2.7. I don't know how other versions behave.
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: global backfill reservation?
  2017-06-03 22:58       ` Peter Maloney
@ 2017-06-06 14:51         ` Peter Maloney
  0 siblings, 0 replies; 14+ messages in thread
From: Peter Maloney @ 2017-06-06 14:51 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: ceph-devel, Sage Weil

On 06/02/17 17:38, Sage Weil wrote:
> I don't see how this would be any different from a peering perspective.  
> The pattern of data movement and remapping would be different, but there's 
> no difference in this sequence that seems like it relate to peering 
> taking 10s of seconds.  :/
>
> How confident are you that this was a real effect?  Could it be that when 
> you tried the second method your disk caches were warm vs the first time 
> around when they were cold?
>
> sage


After the new disks are added, much more confident. See below... one
time I crush weighted 6 at once, with issues, and the other times it was
other disks, with no issues if I don't crush reweight too many at once.


On 06/04/17 00:58, Peter Maloney wrote:
> On 06/03/17 09:51, Dan van der Ster wrote:
>> On Fri, Jun 2, 2017 at 4:05 PM, Peter Maloney
>> <peter.maloney@brockmann-consult.de> wrote:
>>> ...
>>> And Sage, if that's true, then couldn't ceph by default just do the
>>> first kind of peering work before any pgs, pools, clients, etc. are
>>> affected, before moving on to the stuff that affects clients, regardless
>>> of which steps were used? At some point during adding t hose 2 nodes I
>>> was thinking how could ceph be so broken and mysterious... why does it
>>> just hang there? Would it do this during recovery of a dead osd too? Now
>>> I know how to avoid it and that it shouldn't affect recovering dead osds
>>> (not changing crush weight)... but it would be nice for all users not to
>>> ever think that way. :)
>>>
>>> ...
>> Here's what we do:
>>   1. Create and start new OSDs with initial crush weight = 0.0. No PGs
>> should re-peer when these are booted.
>>   2. Run the reweight script, e.g. like this for some 6T drives:
>>
>>    ceph-gentle-reweight -o osd.10,osd.11,osd.12 -l 15 -b 50 -d 0.01 -t 5.46
>>
>> In practice we've added >150 drives at once with that script -- using
>> that tiny delta.
>>
>> We use crush reweight because it "works for us (tm)". We haven't seen
>> any strange peering hangs, though we exercise this on hammer, not
>> (yet) jewel.
>> I hadn't thought of your method using osd reweight -- how do you add
>> new osds with an initial osd reweight? Maybe you create the osds in a
>> non-default root then move them after being reweighted to 0.0?
>>
>> Cheers, Dan
> I added them with crush weight 0, then my plan was to raise the weight
> like you do. That's basically what I did for all the other servers. But
> I fiddled with the crush map and had them in another root when I set the
> reweight 0, then weight 6, then moved them to root default (long
> peering), then reweight 1 (short peering). But that wasn't what I
> planned on doing or plan to do in the future.
>
> I expect that would be the same as crush weight 0 and in the normal root
> when created, then when ready for peering, set reweight 0 first, then
> crush weight 6, then after peering is done, reweight 1 for a few at a
> time (ceph osd reweight ...; sleep 2; while ceph health | grep peering;
> do sleep 1; done ...).
>
> The next step in this upgrade is to replace 18 2TB disks with 6TB
> ones... I'll do it that way and find out if it works without the extra root.

So I'm done removing the 18 2TB disks and adding the 6TB ones (plus
replacing a dead one). I did 6 disks at a time (all the 2TB disks on
each node).

I didn't test raising weight slowly, but I did test that setting the
weight straight to 6 on all at once (with reweight still 0) causes
client issues. (but reweight to 1 all at once, multi-process even, like
I do here works fine)

Here's the script that does the job well. First have the new osds
created with weight 0, and daemons running. Then this script finds them
by weight 0 and works with them:

> # list osds with hosts next to them for easy filtering with awk
> (doesn't support chassis, rack, etc. buckets)
> ceph_list_osd() {
>     ceph osd tree | awk '
>         BEGIN {found=0; host=""};
>         $3 == "host" {found=1; host=$4; getline};
>         $3 == "host" {found=0}
>         found || $3 ~ /osd\./ {print $0 " " host}'
> }
>
> peering_sleep() {
>     echo "sleeping"
>     sleep 2
>     while ceph health | grep -q peer; do
>         echo -n .
>         sleep 1
>     done
>     echo
>     sleep 5
> }
>
> # after an osd is already created, this reweights them to 'activate' them
> ceph_activate_osds() {
>     weight="$1"
>     host=$(hostname -s)
>     
>     if [ -z "$weight" ]; then
>         weight=6.00099
>     fi
>     
>     # for crush weight 0 osds, set reweight 0 so the crush weight
> non-zero won't cause as many blocked requests
>     for id in $(ceph_list_osd | awk '$2 == 0 {print $1}'); do
>         ceph osd reweight $id 0 &
>     done
>     wait
>     peering_sleep
>     
>     # the harsh reweight which we do slowly
>     for id in $(ceph_list_osd | awk -v host="$host" '$5 == 0 && $7 ==
> host {print $1}'); do
>         echo ceph osd crush reweight "osd.$id" "$weight"
>         ceph osd crush reweight "osd.$id" "$weight"
>         peering_sleep
>     done
>     
>     # the light reweight
>     for id in $(ceph_list_osd | awk -v host="$host" '$5 == 0 && $7 ==
> host {print $1}'); do
>         ceph osd reweight $id 1 &
>     done
>     wait
> }

and the ceph status in case it's somehow useful:
> root@ceph1:~ # ceph -s
>     cluster 684e4a3f-25fb-4b78-8756-62befa9be15e
>      health HEALTH_WARN
>             756 pgs backfill_wait
>             6 pgs backfilling
>             260 pgs degraded
>             183 pgs recovery_wait
>             260 pgs stuck degraded
>             945 pgs stuck unclean
>             60 pgs stuck undersized
>             60 pgs undersized
>             recovery 494450/38357551 objects degraded (1.289%)
>             recovery 26900171/38357551 objects misplaced (70.130%)
>      monmap e3: 3 mons at
> {ceph1=10.3.0.131:6789/0,ceph2=10.3.0.132:6789/0,ceph3=10.3.0.133:6789/0}
>             election epoch 614, quorum 0,1,2 ceph1,ceph2,ceph3
>       fsmap e322: 1/1/1 up {0=ceph2=up:active}, 2 up:standby
>      osdmap e119625: 60 osds: 60 up, 60 in; 933 remapped pgs
>             flags sortbitwise,require_jewel_osds
>       pgmap v19175947: 1152 pgs, 4 pools, 31301 GB data, 8172 kobjects
>             94851 GB used, 212 TB / 305 TB avail
>             494450/38357551 objects degraded (1.289%)
>             26900171/38357551 objects misplaced (70.130%)
>                  685 active+remapped+wait_backfill
>                  200 active+clean
>                  164 active+recovery_wait+degraded+remapped
>                   52 active+undersized+degraded+remapped+wait_backfill
>                   19 active+degraded+remapped+wait_backfill
>                   12 active+recovery_wait+degraded
>                    7 active+clean+scrubbing
>                    7 active+recovery_wait+undersized+degraded+remapped
>                    5 active+degraded+remapped+backfilling
>                    1 active+undersized+degraded+remapped+backfilling
> recovery io 900 MB/s, 240 objects/s
>   client io 79721 B/s rd, 10418 kB/s wr, 19 op/s rd, 137 op/s wr
>
> root@ceph1:~ # ceph osd tree
> ID WEIGHT    TYPE NAME      UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 336.06061 root default                                     
> -2  64.01199     host ceph1                                   
>  0   4.00099         osd.0       up  0.61998          1.00000
>  1   4.00099         osd.1       up  0.59834          1.00000
>  2   4.00099         osd.2       up  0.79213          1.00000
> 27   4.00099         osd.27      up  0.69460          1.00000
> 30   6.00099         osd.30      up  0.73935          1.00000
> 31   6.00099         osd.31      up  0.81180          1.00000
> 10   6.00099         osd.10      up  0.64571          1.00000
> 12   6.00099         osd.12      up  0.94655          1.00000
> 13   6.00099         osd.13      up  0.75957          1.00000
> 14   6.00099         osd.14      up  0.77515          1.00000
> 15   6.00099         osd.15      up  0.74663          1.00000
> 16   6.00099         osd.16      up  0.93401          1.00000
> -3  64.01181     host ceph2                                   
>  3   4.00099         osd.3       up  0.69209          1.00000
>  4   4.00099         osd.4       up  0.75365          1.00000
>  5   4.00099         osd.5       up  0.80797          1.00000
> 28   4.00099         osd.28      up  0.66307          1.00000
> 32   6.00099         osd.32      up  0.81369          1.00000
> 33   6.00099         osd.33      up  1.00000          1.00000
>  9   6.00098         osd.9       up  0.58499          1.00000
> 17   6.00098         osd.17      up  0.90613          1.00000
> 18   6.00098         osd.18      up  0.73138          1.00000
> 19   6.00098         osd.19      up  0.80649          1.00000
> 20   6.00098         osd.20      up  0.51999          1.00000
> 21   6.00098         osd.21      up  0.79404          1.00000
> -4  64.01181     host ceph3                                   
>  6   4.00099         osd.6       up  0.56717          1.00000
>  7   4.00099         osd.7       up  0.72240          1.00000
>  8   4.00099         osd.8       up  0.79919          1.00000
> 29   4.00099         osd.29      up  0.80109          1.00000
> 34   6.00099         osd.34      up  0.71120          1.00000
> 35   6.00099         osd.35      up  0.63611          1.00000
> 11   6.00098         osd.11      up  0.67000          1.00000
> 22   6.00098         osd.22      up  0.80756          1.00000
> 23   6.00098         osd.23      up  0.67000          1.00000
> 24   6.00098         osd.24      up  0.71599          1.00000
> 25   6.00098         osd.25      up  0.64540          1.00000
> 26   6.00098         osd.26      up  0.76378          1.00000
> -5  72.01199     host ceph4                                   
> 36   6.00099         osd.36      up  0.74846          1.00000
> 37   6.00099         osd.37      up  0.71387          1.00000
> 38   6.00099         osd.38      up  0.71129          1.00000
> 39   6.00099         osd.39      up  0.76547          1.00000
> 40   6.00099         osd.40      up  0.73967          1.00000
> 41   6.00099         osd.41      up  0.64742          1.00000
> 42   6.00099         osd.42      up  0.81006          1.00000
> 44   6.00099         osd.44      up  0.65381          1.00000
> 45   6.00099         osd.45      up  0.77457          1.00000
> 46   6.00099         osd.46      up  0.82390          1.00000
> 47   6.00099         osd.47      up  0.85431          1.00000
> 43   6.00099         osd.43      up  0.64775          1.00000
> -6  72.01300     host ceph5                                   
> 48   6.00099         osd.48      up  0.71269          1.00000
> 49   6.00099         osd.49      up  0.97649          1.00000
> 50   6.00099         osd.50      up  0.98079          1.00000
> 51   6.00099         osd.51      up  0.75307          1.00000
> 52   6.00099         osd.52      up  0.86545          1.00000
> 53   6.00099         osd.53      up  0.64278          1.00000
> 54   6.00099         osd.54      up  0.94551          1.00000
> 55   6.00099         osd.55      up  0.73465          1.00000
> 56   6.00099         osd.56      up  0.69908          1.00000
> 57   6.00099         osd.57      up  0.78789          1.00000
> 58   6.00099         osd.58      up  0.89081          1.00000
> 59   6.00099         osd.59      up  0.66379          1.00000


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-06-06 14:52 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-12 18:53 global backfill reservation? Sage Weil
2017-05-12 20:49 ` Peter Maloney
2017-05-15 22:02   ` Gregory Farnum
2017-05-16  7:21     ` David Butterfield
2017-05-13 16:55 ` Dan van der Ster
2017-06-02 14:05   ` Peter Maloney
2017-06-02 15:38     ` Sage Weil
2017-06-03 23:11       ` Peter Maloney
2017-06-03  7:51     ` Dan van der Ster
2017-06-03 22:58       ` Peter Maloney
2017-06-06 14:51         ` Peter Maloney
2017-05-20 14:24 ` Ning Yao
2017-05-21  3:34   ` David Butterfield
2017-06-02 21:44   ` LIU, Fei

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.