All of lore.kernel.org
 help / color / mirror / Atom feed
* CRUSH puzzle: step weighted-take
@ 2018-09-27 15:18 Dan van der Ster
       [not found] ` <CABZ+qqmMiGSF3g-WR9TLTUAxwdK3aCt0wq+ku5XWzukp0tJ04w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Dan van der Ster @ 2018-09-27 15:18 UTC (permalink / raw)
  To: ceph-users, ceph-devel-u79uwXL29TY76Z2rM5mHXA

Dear Ceph friends,

I have a CRUSH data migration puzzle and wondered if someone could
think of a clever solution.

Consider an osd tree like this:

  -2       4428.02979     room 0513-R-0050
 -72        911.81897         rack RA01
  -4        917.27899         rack RA05
  -6        917.25500         rack RA09
  -9        786.23901         rack RA13
 -14        895.43903         rack RA17
 -65       1161.16003     room 0513-R-0060
 -71        578.76001         ipservice S513-A-IP38
 -70        287.56000             rack BA09
 -80        291.20001             rack BA10
 -76        582.40002         ipservice S513-A-IP63
 -75        291.20001             rack BA11
 -78        291.20001             rack BA12

In the beginning, for reasons that are not important, we created two pools:
  * poolA chooses room=0513-R-0050 then replicates 3x across the racks.
  * poolB chooses room=0513-R-0060, replicates 2x across the
ipservices, then puts a 3rd replica in room 0513-R-0050.

For clarity, here is the crush rule for poolB:
        type replicated
        min_size 1
        max_size 10
        step take 0513-R-0060
        step chooseleaf firstn 2 type ipservice
        step emit
        step take 0513-R-0050
        step chooseleaf firstn -2 type rack
        step emit

Now to the puzzle.
For reasons that are not important, we now want to change the rule for
poolB to put all three 3 replicas in room 0513-R-0060.
And we need to do this in a way which is totally non-disruptive
(latency-wise) to the users of either pools. (These are both *very*
active RBD pools).

I see two obvious ways to proceed:
  (1) simply change the rule for poolB to put a third replica on any
osd in room 0513-R-0060. I'm afraid though that this would involve way
too many concurrent backfills, cluster-wide, even with
osd_max_backfills=1.
  (2) change poolB size to 2, then change the crush rule to that from
(1), then reset poolB size to 3. This would risk data availability
during the time that the pool is size=2, and also risks that every osd
in room 0513-R-0050 would be too busy deleting for some indeterminate
time period (10s of minutes, I expect).

So I would probably exclude those two approaches.

Conceptually what I'd like to be able to do is a gradual migration,
which if I may invent some syntax on the fly...

Instead of
       step take 0513-R-0050
do
       step weighted-take 99 0513-R-0050 1 0513-R-0060

That is, 99% of the time take room 0513-R-0050 for the 3rd copies, 1%
of the time take room 0513-R-0060.
With a mechanism like that, we could gradually adjust those "step
weighted-take" lines until 100% of the 3rd copies were in 0513-R-0060.

I have a feeling that something equivalent to that is already possible
with weight-sets or some other clever crush trickery.
Any ideas?

Best Regards,

Dan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: CRUSH puzzle: step weighted-take
       [not found] ` <CABZ+qqmMiGSF3g-WR9TLTUAxwdK3aCt0wq+ku5XWzukp0tJ04w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-09-27 16:33   ` Luis Periquito
       [not found]     ` <CACx0BdPEebcwKH6LU-BSosAJw3Yd_Dsgfn6=N1j=7vvqUP2fow-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-09-27 19:50   ` Maged Mokhtar
  2018-09-27 22:50   ` Goncalo Borges
  2 siblings, 1 reply; 9+ messages in thread
From: Luis Periquito @ 2018-09-27 16:33 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: Ceph Users, ceph-devel-u79uwXL29TY76Z2rM5mHXA

I think your objective is to move the data without anyone else
noticing. What I usually do is reduce the priority of the recovery
process as much as possible. Do note this will make the recovery take
a looong time, and will also make recovery from failures slow...
ceph tell osd.* injectargs '--osd_recovery_sleep 0.9'
ceph tell osd.* injectargs '--osd-max-backfills 1'
ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
ceph tell osd.* injectargs '--osd-client-op-priority 63'
ceph tell osd.* injectargs '--osd-recovery-max-active 1'
ceph tell osd.* injectargs '--osd_recovery_max_chunk 524288'

I would also assume you have set osd_scrub_during_recovery to false.



On Thu, Sep 27, 2018 at 4:19 PM Dan van der Ster <dan-EOCVfBHj35C+XT7JhA+gdA@public.gmane.org> wrote:
>
> Dear Ceph friends,
>
> I have a CRUSH data migration puzzle and wondered if someone could
> think of a clever solution.
>
> Consider an osd tree like this:
>
>   -2       4428.02979     room 0513-R-0050
>  -72        911.81897         rack RA01
>   -4        917.27899         rack RA05
>   -6        917.25500         rack RA09
>   -9        786.23901         rack RA13
>  -14        895.43903         rack RA17
>  -65       1161.16003     room 0513-R-0060
>  -71        578.76001         ipservice S513-A-IP38
>  -70        287.56000             rack BA09
>  -80        291.20001             rack BA10
>  -76        582.40002         ipservice S513-A-IP63
>  -75        291.20001             rack BA11
>  -78        291.20001             rack BA12
>
> In the beginning, for reasons that are not important, we created two pools:
>   * poolA chooses room=0513-R-0050 then replicates 3x across the racks.
>   * poolB chooses room=0513-R-0060, replicates 2x across the
> ipservices, then puts a 3rd replica in room 0513-R-0050.
>
> For clarity, here is the crush rule for poolB:
>         type replicated
>         min_size 1
>         max_size 10
>         step take 0513-R-0060
>         step chooseleaf firstn 2 type ipservice
>         step emit
>         step take 0513-R-0050
>         step chooseleaf firstn -2 type rack
>         step emit
>
> Now to the puzzle.
> For reasons that are not important, we now want to change the rule for
> poolB to put all three 3 replicas in room 0513-R-0060.
> And we need to do this in a way which is totally non-disruptive
> (latency-wise) to the users of either pools. (These are both *very*
> active RBD pools).
>
> I see two obvious ways to proceed:
>   (1) simply change the rule for poolB to put a third replica on any
> osd in room 0513-R-0060. I'm afraid though that this would involve way
> too many concurrent backfills, cluster-wide, even with
> osd_max_backfills=1.
>   (2) change poolB size to 2, then change the crush rule to that from
> (1), then reset poolB size to 3. This would risk data availability
> during the time that the pool is size=2, and also risks that every osd
> in room 0513-R-0050 would be too busy deleting for some indeterminate
> time period (10s of minutes, I expect).
>
> So I would probably exclude those two approaches.
>
> Conceptually what I'd like to be able to do is a gradual migration,
> which if I may invent some syntax on the fly...
>
> Instead of
>        step take 0513-R-0050
> do
>        step weighted-take 99 0513-R-0050 1 0513-R-0060
>
> That is, 99% of the time take room 0513-R-0050 for the 3rd copies, 1%
> of the time take room 0513-R-0060.
> With a mechanism like that, we could gradually adjust those "step
> weighted-take" lines until 100% of the 3rd copies were in 0513-R-0060.
>
> I have a feeling that something equivalent to that is already possible
> with weight-sets or some other clever crush trickery.
> Any ideas?
>
> Best Regards,
>
> Dan
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: CRUSH puzzle: step weighted-take
       [not found] ` <CABZ+qqmMiGSF3g-WR9TLTUAxwdK3aCt0wq+ku5XWzukp0tJ04w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-09-27 16:33   ` Luis Periquito
@ 2018-09-27 19:50   ` Maged Mokhtar
       [not found]     ` <005bdde7-f206-1405-8255-05025c6d0734-6jkX1og9WkpAfugRpC6u6w@public.gmane.org>
  2018-09-27 22:50   ` Goncalo Borges
  2 siblings, 1 reply; 9+ messages in thread
From: Maged Mokhtar @ 2018-09-27 19:50 UTC (permalink / raw)
  To: Dan van der Ster, ceph-users, ceph-devel-u79uwXL29TY76Z2rM5mHXA



On 27/09/18 17:18, Dan van der Ster wrote:
> Dear Ceph friends,
>
> I have a CRUSH data migration puzzle and wondered if someone could
> think of a clever solution.
>
> Consider an osd tree like this:
>
>    -2       4428.02979     room 0513-R-0050
>   -72        911.81897         rack RA01
>    -4        917.27899         rack RA05
>    -6        917.25500         rack RA09
>    -9        786.23901         rack RA13
>   -14        895.43903         rack RA17
>   -65       1161.16003     room 0513-R-0060
>   -71        578.76001         ipservice S513-A-IP38
>   -70        287.56000             rack BA09
>   -80        291.20001             rack BA10
>   -76        582.40002         ipservice S513-A-IP63
>   -75        291.20001             rack BA11
>   -78        291.20001             rack BA12
>
> In the beginning, for reasons that are not important, we created two pools:
>    * poolA chooses room=0513-R-0050 then replicates 3x across the racks.
>    * poolB chooses room=0513-R-0060, replicates 2x across the
> ipservices, then puts a 3rd replica in room 0513-R-0050.
>
> For clarity, here is the crush rule for poolB:
>          type replicated
>          min_size 1
>          max_size 10
>          step take 0513-R-0060
>          step chooseleaf firstn 2 type ipservice
>          step emit
>          step take 0513-R-0050
>          step chooseleaf firstn -2 type rack
>          step emit
>
> Now to the puzzle.
> For reasons that are not important, we now want to change the rule for
> poolB to put all three 3 replicas in room 0513-R-0060.
> And we need to do this in a way which is totally non-disruptive
> (latency-wise) to the users of either pools. (These are both *very*
> active RBD pools).
>
> I see two obvious ways to proceed:
>    (1) simply change the rule for poolB to put a third replica on any
> osd in room 0513-R-0060. I'm afraid though that this would involve way
> too many concurrent backfills, cluster-wide, even with
> osd_max_backfills=1.
>    (2) change poolB size to 2, then change the crush rule to that from
> (1), then reset poolB size to 3. This would risk data availability
> during the time that the pool is size=2, and also risks that every osd
> in room 0513-R-0050 would be too busy deleting for some indeterminate
> time period (10s of minutes, I expect).
>
> So I would probably exclude those two approaches.
>
> Conceptually what I'd like to be able to do is a gradual migration,
> which if I may invent some syntax on the fly...
>
> Instead of
>         step take 0513-R-0050
> do
>         step weighted-take 99 0513-R-0050 1 0513-R-0060
>
> That is, 99% of the time take room 0513-R-0050 for the 3rd copies, 1%
> of the time take room 0513-R-0060.
> With a mechanism like that, we could gradually adjust those "step
> weighted-take" lines until 100% of the 3rd copies were in 0513-R-0060.
>
> I have a feeling that something equivalent to that is already possible
> with weight-sets or some other clever crush trickery.
> Any ideas?
>
> Best Regards,
>
> Dan
> _______________________________________________
> ceph-users mailing list
> ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
would it be possible in your case to create a parent datacenter bucket 
to hold both rooms and assign their relative weights there, then for the 
third replica do a step take to this parent bucket ? its not elegant but 
may do the trick.
The suggested step weighted-take would be more flexible as it can be 
changed on a replica level, but i do not know if you can do this with 
existing code.

Maged

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: CRUSH puzzle: step weighted-take
       [not found] ` <CABZ+qqmMiGSF3g-WR9TLTUAxwdK3aCt0wq+ku5XWzukp0tJ04w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2018-09-27 16:33   ` Luis Periquito
  2018-09-27 19:50   ` Maged Mokhtar
@ 2018-09-27 22:50   ` Goncalo Borges
       [not found]     ` <CAL8KHarp1AyOCx5ymHUT4h9d7oJvWVFGh8UY87LXhfF7-heF2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2 siblings, 1 reply; 9+ messages in thread
From: Goncalo Borges @ 2018-09-27 22:50 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users


[-- Attachment #1.1: Type: text/plain, Size: 388 bytes --]

Hi Dan

Hope to find you ok.

Here goes a suggestion from someone who has been sitting in the side line
for the last 2 years but following stuff as much as possible

Will weight set per pool help?

This is only possible in luminous but according to the docs there is the
possibility to adjust positional weights for devices hosting replicas of
objects for a given bucket.

Cheers
Goncalo

[-- Attachment #1.2: Type: text/html, Size: 1344 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: CRUSH puzzle: step weighted-take
       [not found]     ` <CACx0BdPEebcwKH6LU-BSosAJw3Yd_Dsgfn6=N1j=7vvqUP2fow-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-09-28  6:59       ` Dan van der Ster
  0 siblings, 0 replies; 9+ messages in thread
From: Dan van der Ster @ 2018-09-28  6:59 UTC (permalink / raw)
  To: Luis Periquito; +Cc: ceph-users, ceph-devel-u79uwXL29TY76Z2rM5mHXA

On Thu, Sep 27, 2018 at 6:34 PM Luis Periquito <periquito-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> I think your objective is to move the data without anyone else
> noticing. What I usually do is reduce the priority of the recovery
> process as much as possible. Do note this will make the recovery take
> a looong time, and will also make recovery from failures slow...
> ceph tell osd.* injectargs '--osd_recovery_sleep 0.9'
> ceph tell osd.* injectargs '--osd-max-backfills 1'
> ceph tell osd.* injectargs '--osd-recovery-op-priority 1'
> ceph tell osd.* injectargs '--osd-client-op-priority 63'
> ceph tell osd.* injectargs '--osd-recovery-max-active 1'
> ceph tell osd.* injectargs '--osd_recovery_max_chunk 524288'
>
> I would also assume you have set osd_scrub_during_recovery to false.
>

Thanks Luis -- that will definitely be how we backfill if we go that
route. However I would prefer to avoid one big massive change that
takes a long time to complete.

- dan

>
>
> On Thu, Sep 27, 2018 at 4:19 PM Dan van der Ster <dan-EOCVfBHj35C+XT7JhA+gdA@public.gmane.org> wrote:
> >
> > Dear Ceph friends,
> >
> > I have a CRUSH data migration puzzle and wondered if someone could
> > think of a clever solution.
> >
> > Consider an osd tree like this:
> >
> >   -2       4428.02979     room 0513-R-0050
> >  -72        911.81897         rack RA01
> >   -4        917.27899         rack RA05
> >   -6        917.25500         rack RA09
> >   -9        786.23901         rack RA13
> >  -14        895.43903         rack RA17
> >  -65       1161.16003     room 0513-R-0060
> >  -71        578.76001         ipservice S513-A-IP38
> >  -70        287.56000             rack BA09
> >  -80        291.20001             rack BA10
> >  -76        582.40002         ipservice S513-A-IP63
> >  -75        291.20001             rack BA11
> >  -78        291.20001             rack BA12
> >
> > In the beginning, for reasons that are not important, we created two pools:
> >   * poolA chooses room=0513-R-0050 then replicates 3x across the racks.
> >   * poolB chooses room=0513-R-0060, replicates 2x across the
> > ipservices, then puts a 3rd replica in room 0513-R-0050.
> >
> > For clarity, here is the crush rule for poolB:
> >         type replicated
> >         min_size 1
> >         max_size 10
> >         step take 0513-R-0060
> >         step chooseleaf firstn 2 type ipservice
> >         step emit
> >         step take 0513-R-0050
> >         step chooseleaf firstn -2 type rack
> >         step emit
> >
> > Now to the puzzle.
> > For reasons that are not important, we now want to change the rule for
> > poolB to put all three 3 replicas in room 0513-R-0060.
> > And we need to do this in a way which is totally non-disruptive
> > (latency-wise) to the users of either pools. (These are both *very*
> > active RBD pools).
> >
> > I see two obvious ways to proceed:
> >   (1) simply change the rule for poolB to put a third replica on any
> > osd in room 0513-R-0060. I'm afraid though that this would involve way
> > too many concurrent backfills, cluster-wide, even with
> > osd_max_backfills=1.
> >   (2) change poolB size to 2, then change the crush rule to that from
> > (1), then reset poolB size to 3. This would risk data availability
> > during the time that the pool is size=2, and also risks that every osd
> > in room 0513-R-0050 would be too busy deleting for some indeterminate
> > time period (10s of minutes, I expect).
> >
> > So I would probably exclude those two approaches.
> >
> > Conceptually what I'd like to be able to do is a gradual migration,
> > which if I may invent some syntax on the fly...
> >
> > Instead of
> >        step take 0513-R-0050
> > do
> >        step weighted-take 99 0513-R-0050 1 0513-R-0060
> >
> > That is, 99% of the time take room 0513-R-0050 for the 3rd copies, 1%
> > of the time take room 0513-R-0060.
> > With a mechanism like that, we could gradually adjust those "step
> > weighted-take" lines until 100% of the 3rd copies were in 0513-R-0060.
> >
> > I have a feeling that something equivalent to that is already possible
> > with weight-sets or some other clever crush trickery.
> > Any ideas?
> >
> > Best Regards,
> >
> > Dan
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: CRUSH puzzle: step weighted-take
       [not found]     ` <005bdde7-f206-1405-8255-05025c6d0734-6jkX1og9WkpAfugRpC6u6w@public.gmane.org>
@ 2018-09-28  7:02       ` Dan van der Ster
       [not found]         ` <CABZ+qq=LgcMKJxvxg0fa9x-gNAR18Yviou5XiriLCOfnEj+w0w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Dan van der Ster @ 2018-09-28  7:02 UTC (permalink / raw)
  To: mmokhtar-6jkX1og9WkpAfugRpC6u6w
  Cc: ceph-users, ceph-devel-u79uwXL29TY76Z2rM5mHXA

On Thu, Sep 27, 2018 at 9:57 PM Maged Mokhtar <mmokhtar-6jkX1og9WkpAfugRpC6u6w@public.gmane.org> wrote:
>
>
>
> On 27/09/18 17:18, Dan van der Ster wrote:
> > Dear Ceph friends,
> >
> > I have a CRUSH data migration puzzle and wondered if someone could
> > think of a clever solution.
> >
> > Consider an osd tree like this:
> >
> >    -2       4428.02979     room 0513-R-0050
> >   -72        911.81897         rack RA01
> >    -4        917.27899         rack RA05
> >    -6        917.25500         rack RA09
> >    -9        786.23901         rack RA13
> >   -14        895.43903         rack RA17
> >   -65       1161.16003     room 0513-R-0060
> >   -71        578.76001         ipservice S513-A-IP38
> >   -70        287.56000             rack BA09
> >   -80        291.20001             rack BA10
> >   -76        582.40002         ipservice S513-A-IP63
> >   -75        291.20001             rack BA11
> >   -78        291.20001             rack BA12
> >
> > In the beginning, for reasons that are not important, we created two pools:
> >    * poolA chooses room=0513-R-0050 then replicates 3x across the racks.
> >    * poolB chooses room=0513-R-0060, replicates 2x across the
> > ipservices, then puts a 3rd replica in room 0513-R-0050.
> >
> > For clarity, here is the crush rule for poolB:
> >          type replicated
> >          min_size 1
> >          max_size 10
> >          step take 0513-R-0060
> >          step chooseleaf firstn 2 type ipservice
> >          step emit
> >          step take 0513-R-0050
> >          step chooseleaf firstn -2 type rack
> >          step emit
> >
> > Now to the puzzle.
> > For reasons that are not important, we now want to change the rule for
> > poolB to put all three 3 replicas in room 0513-R-0060.
> > And we need to do this in a way which is totally non-disruptive
> > (latency-wise) to the users of either pools. (These are both *very*
> > active RBD pools).
> >
> > I see two obvious ways to proceed:
> >    (1) simply change the rule for poolB to put a third replica on any
> > osd in room 0513-R-0060. I'm afraid though that this would involve way
> > too many concurrent backfills, cluster-wide, even with
> > osd_max_backfills=1.
> >    (2) change poolB size to 2, then change the crush rule to that from
> > (1), then reset poolB size to 3. This would risk data availability
> > during the time that the pool is size=2, and also risks that every osd
> > in room 0513-R-0050 would be too busy deleting for some indeterminate
> > time period (10s of minutes, I expect).
> >
> > So I would probably exclude those two approaches.
> >
> > Conceptually what I'd like to be able to do is a gradual migration,
> > which if I may invent some syntax on the fly...
> >
> > Instead of
> >         step take 0513-R-0050
> > do
> >         step weighted-take 99 0513-R-0050 1 0513-R-0060
> >
> > That is, 99% of the time take room 0513-R-0050 for the 3rd copies, 1%
> > of the time take room 0513-R-0060.
> > With a mechanism like that, we could gradually adjust those "step
> > weighted-take" lines until 100% of the 3rd copies were in 0513-R-0060.
> >
> > I have a feeling that something equivalent to that is already possible
> > with weight-sets or some other clever crush trickery.
> > Any ideas?
> >
> > Best Regards,
> >
> > Dan
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> would it be possible in your case to create a parent datacenter bucket
> to hold both rooms and assign their relative weights there, then for the
> third replica do a step take to this parent bucket ? its not elegant but
> may do the trick.

Hey, that might work! both rooms are already in the default root:

  -1       5589.18994 root default
  -2       4428.02979     room 0513-R-0050
 -65       1161.16003     room 0513-R-0060
 -71        578.76001         ipservice S513-A-IP38
 -76        582.40002         ipservice S513-A-IP63

so I'll play with a test pool and weighting down room 0513-R-0060 to
see if this can work.

Thanks!

-- dan

> The suggested step weighted-take would be more flexible as it can be
> changed on a replica level, but i do not know if you can do this with
> existing code.
>
> Maged
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: CRUSH puzzle: step weighted-take
       [not found]     ` <CAL8KHarp1AyOCx5ymHUT4h9d7oJvWVFGh8UY87LXhfF7-heF2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-09-28  7:04       ` Dan van der Ster
  0 siblings, 0 replies; 9+ messages in thread
From: Dan van der Ster @ 2018-09-28  7:04 UTC (permalink / raw)
  To: goncalofilipeborges-Re5JQEeQqe8AvxtiuMwx3w
  Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA, ceph-users

On Fri, Sep 28, 2018 at 12:51 AM Goncalo Borges
<goncalofilipeborges-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>
> Hi Dan
>
> Hope to find you ok.
>
> Here goes a suggestion from someone who has been sitting in the side line for the last 2 years but following stuff as much as possible
>
> Will weight set per pool help?
>
> This is only possible in luminous but according to the docs there is the possibility to adjust positional weights for devices hosting replicas of objects for a given bucket.

We're running luminous, so weight-sets are indeed in the game.
I need to read the docs in detail to see if it could help...
combining with Maged's idea might be the solution.

Thanks!

dan


> Cheers
> Goncalo
>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: CRUSH puzzle: step weighted-take
       [not found]         ` <CABZ+qq=LgcMKJxvxg0fa9x-gNAR18Yviou5XiriLCOfnEj+w0w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-10-01 18:09           ` Gregory Farnum
       [not found]             ` <CAJ4mKGYFAgbu0fW_zy7stR5YU7ERG0=jWciFYMmDG_sfZAbUwg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Gregory Farnum @ 2018-10-01 18:09 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: Ceph Users, ceph-devel

On Fri, Sep 28, 2018 at 12:03 AM Dan van der Ster <dan@vanderster.com> wrote:
>
> On Thu, Sep 27, 2018 at 9:57 PM Maged Mokhtar <mmokhtar@petasan.org> wrote:
> >
> >
> >
> > On 27/09/18 17:18, Dan van der Ster wrote:
> > > Dear Ceph friends,
> > >
> > > I have a CRUSH data migration puzzle and wondered if someone could
> > > think of a clever solution.
> > >
> > > Consider an osd tree like this:
> > >
> > >    -2       4428.02979     room 0513-R-0050
> > >   -72        911.81897         rack RA01
> > >    -4        917.27899         rack RA05
> > >    -6        917.25500         rack RA09
> > >    -9        786.23901         rack RA13
> > >   -14        895.43903         rack RA17
> > >   -65       1161.16003     room 0513-R-0060
> > >   -71        578.76001         ipservice S513-A-IP38
> > >   -70        287.56000             rack BA09
> > >   -80        291.20001             rack BA10
> > >   -76        582.40002         ipservice S513-A-IP63
> > >   -75        291.20001             rack BA11
> > >   -78        291.20001             rack BA12
> > >
> > > In the beginning, for reasons that are not important, we created two pools:
> > >    * poolA chooses room=0513-R-0050 then replicates 3x across the racks.
> > >    * poolB chooses room=0513-R-0060, replicates 2x across the
> > > ipservices, then puts a 3rd replica in room 0513-R-0050.
> > >
> > > For clarity, here is the crush rule for poolB:
> > >          type replicated
> > >          min_size 1
> > >          max_size 10
> > >          step take 0513-R-0060
> > >          step chooseleaf firstn 2 type ipservice
> > >          step emit
> > >          step take 0513-R-0050
> > >          step chooseleaf firstn -2 type rack
> > >          step emit
> > >
> > > Now to the puzzle.
> > > For reasons that are not important, we now want to change the rule for
> > > poolB to put all three 3 replicas in room 0513-R-0060.
> > > And we need to do this in a way which is totally non-disruptive
> > > (latency-wise) to the users of either pools. (These are both *very*
> > > active RBD pools).
> > >
> > > I see two obvious ways to proceed:
> > >    (1) simply change the rule for poolB to put a third replica on any
> > > osd in room 0513-R-0060. I'm afraid though that this would involve way
> > > too many concurrent backfills, cluster-wide, even with
> > > osd_max_backfills=1.
> > >    (2) change poolB size to 2, then change the crush rule to that from
> > > (1), then reset poolB size to 3. This would risk data availability
> > > during the time that the pool is size=2, and also risks that every osd
> > > in room 0513-R-0050 would be too busy deleting for some indeterminate
> > > time period (10s of minutes, I expect).
> > >
> > > So I would probably exclude those two approaches.
> > >
> > > Conceptually what I'd like to be able to do is a gradual migration,
> > > which if I may invent some syntax on the fly...
> > >
> > > Instead of
> > >         step take 0513-R-0050
> > > do
> > >         step weighted-take 99 0513-R-0050 1 0513-R-0060
> > >
> > > That is, 99% of the time take room 0513-R-0050 for the 3rd copies, 1%
> > > of the time take room 0513-R-0060.
> > > With a mechanism like that, we could gradually adjust those "step
> > > weighted-take" lines until 100% of the 3rd copies were in 0513-R-0060.
> > >
> > > I have a feeling that something equivalent to that is already possible
> > > with weight-sets or some other clever crush trickery.
> > > Any ideas?
> > >
> > > Best Regards,
> > >
> > > Dan
> > > _______________________________________________
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > would it be possible in your case to create a parent datacenter bucket
> > to hold both rooms and assign their relative weights there, then for the
> > third replica do a step take to this parent bucket ? its not elegant but
> > may do the trick.
>
> Hey, that might work! both rooms are already in the default root:
>
>   -1       5589.18994 root default
>   -2       4428.02979     room 0513-R-0050
>  -65       1161.16003     room 0513-R-0060
>  -71        578.76001         ipservice S513-A-IP38
>  -76        582.40002         ipservice S513-A-IP63
>
> so I'll play with a test pool and weighting down room 0513-R-0060 to
> see if this can work.

I don't think this will work — it will probably change the seed that
is used and mean that the rule tries to move *everything*, not just
the third PG replicas. But perhaps I'm mistaken about the details of
this mechanic...

The crush weighted-take is interesting, but I'm not sure I would want
to do something probabilistic like that in this situation. What we've
discussed before — but *not* implemented or even scheduled, sadly for
you here — is having multiple CRUSH "epochs" active at the same time,
and letting the OSDMap specify a pg as the crossover point from one
CRUSH epoch to the next. (Among other things, this would let us
finally limit the number of backfills in progress at the cluster
level!)

I'm less familiar with the weight-set mechanism, so you might have a
chance there? Mostly though this is just not something RADOS is set up
to do, because we expect the cluster to be able to handle the backfill
you throw at it, once the per-OSD config is correct. (It has become
clear that the per-OSD configs need to do a better prioritization job
if that's ever going to work, or maybe we're just completely wrong
anyway. But obviously it takes more time to change the architecture
and the code to handle it than to just identify there's a problem.)
*sigh*
-Greg

>
> Thanks!
>
> -- dan
>
> > The suggested step weighted-take would be more flexible as it can be
> > changed on a replica level, but i do not know if you can do this with
> > existing code.
> >
> > Maged
> >
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: CRUSH puzzle: step weighted-take
       [not found]             ` <CAJ4mKGYFAgbu0fW_zy7stR5YU7ERG0=jWciFYMmDG_sfZAbUwg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-10-02 10:02               ` Dan van der Ster
  0 siblings, 0 replies; 9+ messages in thread
From: Dan van der Ster @ 2018-10-02 10:02 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-users, ceph-devel-u79uwXL29TY76Z2rM5mHXA

On Mon, Oct 1, 2018 at 8:09 PM Gregory Farnum <gfarnum@redhat.com> wrote:
>
> On Fri, Sep 28, 2018 at 12:03 AM Dan van der Ster <dan@vanderster.com> wrote:
> >
> > On Thu, Sep 27, 2018 at 9:57 PM Maged Mokhtar <mmokhtar@petasan.org> wrote:
> > >
> > >
> > >
> > > On 27/09/18 17:18, Dan van der Ster wrote:
> > > > Dear Ceph friends,
> > > >
> > > > I have a CRUSH data migration puzzle and wondered if someone could
> > > > think of a clever solution.
> > > >
> > > > Consider an osd tree like this:
> > > >
> > > >    -2       4428.02979     room 0513-R-0050
> > > >   -72        911.81897         rack RA01
> > > >    -4        917.27899         rack RA05
> > > >    -6        917.25500         rack RA09
> > > >    -9        786.23901         rack RA13
> > > >   -14        895.43903         rack RA17
> > > >   -65       1161.16003     room 0513-R-0060
> > > >   -71        578.76001         ipservice S513-A-IP38
> > > >   -70        287.56000             rack BA09
> > > >   -80        291.20001             rack BA10
> > > >   -76        582.40002         ipservice S513-A-IP63
> > > >   -75        291.20001             rack BA11
> > > >   -78        291.20001             rack BA12
> > > >
> > > > In the beginning, for reasons that are not important, we created two pools:
> > > >    * poolA chooses room=0513-R-0050 then replicates 3x across the racks.
> > > >    * poolB chooses room=0513-R-0060, replicates 2x across the
> > > > ipservices, then puts a 3rd replica in room 0513-R-0050.
> > > >
> > > > For clarity, here is the crush rule for poolB:
> > > >          type replicated
> > > >          min_size 1
> > > >          max_size 10
> > > >          step take 0513-R-0060
> > > >          step chooseleaf firstn 2 type ipservice
> > > >          step emit
> > > >          step take 0513-R-0050
> > > >          step chooseleaf firstn -2 type rack
> > > >          step emit
> > > >
> > > > Now to the puzzle.
> > > > For reasons that are not important, we now want to change the rule for
> > > > poolB to put all three 3 replicas in room 0513-R-0060.
> > > > And we need to do this in a way which is totally non-disruptive
> > > > (latency-wise) to the users of either pools. (These are both *very*
> > > > active RBD pools).
> > > >
> > > > I see two obvious ways to proceed:
> > > >    (1) simply change the rule for poolB to put a third replica on any
> > > > osd in room 0513-R-0060. I'm afraid though that this would involve way
> > > > too many concurrent backfills, cluster-wide, even with
> > > > osd_max_backfills=1.
> > > >    (2) change poolB size to 2, then change the crush rule to that from
> > > > (1), then reset poolB size to 3. This would risk data availability
> > > > during the time that the pool is size=2, and also risks that every osd
> > > > in room 0513-R-0050 would be too busy deleting for some indeterminate
> > > > time period (10s of minutes, I expect).
> > > >
> > > > So I would probably exclude those two approaches.
> > > >
> > > > Conceptually what I'd like to be able to do is a gradual migration,
> > > > which if I may invent some syntax on the fly...
> > > >
> > > > Instead of
> > > >         step take 0513-R-0050
> > > > do
> > > >         step weighted-take 99 0513-R-0050 1 0513-R-0060
> > > >
> > > > That is, 99% of the time take room 0513-R-0050 for the 3rd copies, 1%
> > > > of the time take room 0513-R-0060.
> > > > With a mechanism like that, we could gradually adjust those "step
> > > > weighted-take" lines until 100% of the 3rd copies were in 0513-R-0060.
> > > >
> > > > I have a feeling that something equivalent to that is already possible
> > > > with weight-sets or some other clever crush trickery.
> > > > Any ideas?
> > > >
> > > > Best Regards,
> > > >
> > > > Dan
> > > > _______________________________________________
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > would it be possible in your case to create a parent datacenter bucket
> > > to hold both rooms and assign their relative weights there, then for the
> > > third replica do a step take to this parent bucket ? its not elegant but
> > > may do the trick.
> >
> > Hey, that might work! both rooms are already in the default root:
> >
> >   -1       5589.18994 root default
> >   -2       4428.02979     room 0513-R-0050
> >  -65       1161.16003     room 0513-R-0060
> >  -71        578.76001         ipservice S513-A-IP38
> >  -76        582.40002         ipservice S513-A-IP63
> >
> > so I'll play with a test pool and weighting down room 0513-R-0060 to
> > see if this can work.
>
> I don't think this will work — it will probably change the seed that
> is used and mean that the rule tries to move *everything*, not just
> the third PG replicas. But perhaps I'm mistaken about the details of
> this mechanic...
>

Indeed, osdmaptool indicated that this isn't going to work -- the
first two replicas were the same (a, b), but the third replica was
always (a) again... seems like two "overlapping" chooseleafs in the
same crush rule will choose the same osd :(

> The crush weighted-take is interesting, but I'm not sure I would want
> to do something probabilistic like that in this situation. What we've
> discussed before — but *not* implemented or even scheduled, sadly for
> you here — is having multiple CRUSH "epochs" active at the same time,
> and letting the OSDMap specify a pg as the crossover point from one
> CRUSH epoch to the next. (Among other things, this would let us
> finally limit the number of backfills in progress at the cluster
> level!)
>
> I'm less familiar with the weight-set mechanism, so you might have a
> chance there? Mostly though this is just not something RADOS is set up
> to do, because we expect the cluster to be able to handle the backfill
> you throw at it, once the per-OSD config is correct. (It has become
> clear that the per-OSD configs need to do a better prioritization job
> if that's ever going to work, or maybe we're just completely wrong
> anyway. But obviously it takes more time to change the architecture
> and the code to handle it than to just identify there's a problem.)
> *sigh*

In the end we found a good solution using upmap:

1. We created a new crush rule with our desired placement... three
replicas in room 0513-R-0060.
2. We took a snapshot of the PG up mappings.
3. We set norebalance then changed the crush rule for our poolB to the
new rule. (PGs re-peered, with lots of data movement required but
nothing moving yet due to norebalance).
4. We then used a script which uses pg-upmap-items to map each PG back
to its original OSDs. (This put the cluster back in HEALTH_OK)
5. We now have a cron job removing 10 of the pg-upmap-items per ~15
minutes, with data slowly moving into 0513-R-0060.

The upmapping script is a rough implementation of the ceph osd crush
"freeze" idea presented at ye olde CDM April 2018:
https://pad.ceph.com/p/cephalocon-usability-brainstorming
Seems to be quite useful in this case.

-- Dan


> -Greg
>
> >
> > Thanks!
> >
> > -- dan
> >
> > > The suggested step weighted-take would be more flexible as it can be
> > > changed on a replica level, but i do not know if you can do this with
> > > existing code.
> > >
> > > Maged
> > >
> > >
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-10-02 10:02 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-27 15:18 CRUSH puzzle: step weighted-take Dan van der Ster
     [not found] ` <CABZ+qqmMiGSF3g-WR9TLTUAxwdK3aCt0wq+ku5XWzukp0tJ04w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-09-27 16:33   ` Luis Periquito
     [not found]     ` <CACx0BdPEebcwKH6LU-BSosAJw3Yd_Dsgfn6=N1j=7vvqUP2fow-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-09-28  6:59       ` Dan van der Ster
2018-09-27 19:50   ` Maged Mokhtar
     [not found]     ` <005bdde7-f206-1405-8255-05025c6d0734-6jkX1og9WkpAfugRpC6u6w@public.gmane.org>
2018-09-28  7:02       ` Dan van der Ster
     [not found]         ` <CABZ+qq=LgcMKJxvxg0fa9x-gNAR18Yviou5XiriLCOfnEj+w0w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-10-01 18:09           ` Gregory Farnum
     [not found]             ` <CAJ4mKGYFAgbu0fW_zy7stR5YU7ERG0=jWciFYMmDG_sfZAbUwg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-10-02 10:02               ` Dan van der Ster
2018-09-27 22:50   ` Goncalo Borges
     [not found]     ` <CAL8KHarp1AyOCx5ymHUT4h9d7oJvWVFGh8UY87LXhfF7-heF2g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-09-28  7:04       ` Dan van der Ster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.