All of lore.kernel.org
 help / color / mirror / Atom feed
* crush devices class types
@ 2017-01-22 16:44 Loic Dachary
  2017-01-23 13:38 ` Wido den Hollander
  2017-01-23 14:12 ` Sage Weil
  0 siblings, 2 replies; 25+ messages in thread
From: Loic Dachary @ 2017-01-22 16:44 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development

Hi Sage,

You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)

The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:

    # devices
    device 0 osd.0
    device 1 osd.1
    device 2 osd.2
    device 3 osd.3

into:

    # devices
    device 0 osd.0 ssd
    device 1 osd.1 ssd
    device 2 osd.2 hdd
    device 3 osd.3 hdd

where ssd/hdd is the device class would be much better. However, using the device class like so:

    rule ssd {
            ruleset 1
            type replicated
            min_size 1
            max_size 10
            step take default:ssd
            step chooseleaf firstn 0 type host
            step emit
    }

looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:

    rule ssd {
            ruleset 1
            type replicated
            min_size 1
            max_size 10
            device class = ssd
            step take default
            step chooseleaf firstn 0 type host
            step emit
    }

What do you think ?

Cheers

[1] http://pad.ceph.com/p/crush-types http://tracker.ceph.com/projects/ceph/wiki/CDM_01-FEB-2017

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-01-22 16:44 crush devices class types Loic Dachary
@ 2017-01-23 13:38 ` Wido den Hollander
  2017-01-23 14:29   ` Sage Weil
  2017-01-23 14:12 ` Sage Weil
  1 sibling, 1 reply; 25+ messages in thread
From: Wido den Hollander @ 2017-01-23 13:38 UTC (permalink / raw)
  To: Sage Weil, Loic Dachary; +Cc: Ceph Development


> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
> 
> 
> Hi Sage,
> 
> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
> 
> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
> 
>     # devices
>     device 0 osd.0
>     device 1 osd.1
>     device 2 osd.2
>     device 3 osd.3
> 
> into:
> 
>     # devices
>     device 0 osd.0 ssd
>     device 1 osd.1 ssd
>     device 2 osd.2 hdd
>     device 3 osd.3 hdd
> 
> where ssd/hdd is the device class would be much better. However, using the device class like so:
> 
>     rule ssd {
>             ruleset 1
>             type replicated
>             min_size 1
>             max_size 10
>             step take default:ssd
>             step chooseleaf firstn 0 type host
>             step emit
>     }
> 
> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
> 
>     rule ssd {
>             ruleset 1
>             type replicated
>             min_size 1
>             max_size 10
>             device class = ssd

Would that be sane?

Why not:

step set-class ssd
step take default
step chooseleaf firstn 0 type host
step emit

Since it's a 'step' you take, am I right?

Wido

>             step take default
>             step chooseleaf firstn 0 type host
>             step emit
>     }
> 
> What do you think ?
> 
> Cheers
> 
> [1] http://pad.ceph.com/p/crush-types http://tracker.ceph.com/projects/ceph/wiki/CDM_01-FEB-2017
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-01-22 16:44 crush devices class types Loic Dachary
  2017-01-23 13:38 ` Wido den Hollander
@ 2017-01-23 14:12 ` Sage Weil
  1 sibling, 0 replies; 25+ messages in thread
From: Sage Weil @ 2017-01-23 14:12 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On Sun, 22 Jan 2017, Loic Dachary wrote:
> Hi Sage,
> 
> You proposed an improvement to the crush map to address different device 
> types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I 
> was indeed confused by the tricks required to create SSD only pools. 
> After years of practice it feels more natural :-)
> 
> The source of my confusion was mostly because I had to use a 
> hierarchical description to describe something that is not organized 
> hierarchically. "The rack contains hosts that contain devices" is 
> intuitive. "The rack contains hosts that contain ssd that contain 
> devices" is counter intuitive. Changing:
> 
>     # devices
>     device 0 osd.0
>     device 1 osd.1
>     device 2 osd.2
>     device 3 osd.3
> 
> into:
> 
>     # devices
>     device 0 osd.0 ssd
>     device 1 osd.1 ssd
>     device 2 osd.2 hdd
>     device 3 osd.3 hdd
> 
> where ssd/hdd is the device class would be much better. However, using 
> the device class like so:
> 
>     rule ssd {
>             ruleset 1
>             type replicated
>             min_size 1
>             max_size 10
>             step take default:ssd
>             step chooseleaf firstn 0 type host
>             step emit
>     }
> 
> looks arcane. Since the goal is to simplify the description for the 
> first time user, maybe we could have something like:
> 
>     rule ssd {
>             ruleset 1
>             type replicated
>             min_size 1
>             max_size 10
>             device class = ssd
>             step take default
>             step chooseleaf firstn 0 type host
>             step emit
>     }
> 
> What do you think ?

Yes! Maybe adjust it to be consistent with the other directives, like

	device_class ssd

We could also adjust the naming of the generated nodes to be something 
like ssd@default instead of default:ssd as that reads semi-intuitively?

sage

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-01-23 13:38 ` Wido den Hollander
@ 2017-01-23 14:29   ` Sage Weil
  2017-01-23 14:41     ` Loic Dachary
  0 siblings, 1 reply; 25+ messages in thread
From: Sage Weil @ 2017-01-23 14:29 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: Loic Dachary, Ceph Development

On Mon, 23 Jan 2017, Wido den Hollander wrote:
> > Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
> > 
> > 
> > Hi Sage,
> > 
> > You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
> > 
> > The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
> > 
> >     # devices
> >     device 0 osd.0
> >     device 1 osd.1
> >     device 2 osd.2
> >     device 3 osd.3
> > 
> > into:
> > 
> >     # devices
> >     device 0 osd.0 ssd
> >     device 1 osd.1 ssd
> >     device 2 osd.2 hdd
> >     device 3 osd.3 hdd
> > 
> > where ssd/hdd is the device class would be much better. However, using the device class like so:
> > 
> >     rule ssd {
> >             ruleset 1
> >             type replicated
> >             min_size 1
> >             max_size 10
> >             step take default:ssd
> >             step chooseleaf firstn 0 type host
> >             step emit
> >     }
> > 
> > looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
> > 
> >     rule ssd {
> >             ruleset 1
> >             type replicated
> >             min_size 1
> >             max_size 10
> >             device class = ssd
> 
> Would that be sane?
> 
> Why not:
> 
> step set-class ssd
> step take default
> step chooseleaf firstn 0 type host
> step emit
> 
> Since it's a 'step' you take, am I right?

Good idea... a step is a cleaner way to extend the syntax!

sage

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-01-23 14:29   ` Sage Weil
@ 2017-01-23 14:41     ` Loic Dachary
  2017-02-02 20:57       ` Sage Weil
  0 siblings, 1 reply; 25+ messages in thread
From: Loic Dachary @ 2017-01-23 14:41 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: Ceph Development

Hi Wido,

Updated http://pad.ceph.com/p/crush-types with your proposal for the rule syntax

Cheers

On 01/23/2017 03:29 PM, Sage Weil wrote:
> On Mon, 23 Jan 2017, Wido den Hollander wrote:
>>> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
>>>
>>>
>>> Hi Sage,
>>>
>>> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
>>>
>>> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
>>>
>>>     # devices
>>>     device 0 osd.0
>>>     device 1 osd.1
>>>     device 2 osd.2
>>>     device 3 osd.3
>>>
>>> into:
>>>
>>>     # devices
>>>     device 0 osd.0 ssd
>>>     device 1 osd.1 ssd
>>>     device 2 osd.2 hdd
>>>     device 3 osd.3 hdd
>>>
>>> where ssd/hdd is the device class would be much better. However, using the device class like so:
>>>
>>>     rule ssd {
>>>             ruleset 1
>>>             type replicated
>>>             min_size 1
>>>             max_size 10
>>>             step take default:ssd
>>>             step chooseleaf firstn 0 type host
>>>             step emit
>>>     }
>>>
>>> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
>>>
>>>     rule ssd {
>>>             ruleset 1
>>>             type replicated
>>>             min_size 1
>>>             max_size 10
>>>             device class = ssd
>>
>> Would that be sane?
>>
>> Why not:
>>
>> step set-class ssd
>> step take default
>> step chooseleaf firstn 0 type host
>> step emit
>>
>> Since it's a 'step' you take, am I right?
> 
> Good idea... a step is a cleaner way to extend the syntax!
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-01-23 14:41     ` Loic Dachary
@ 2017-02-02 20:57       ` Sage Weil
  2017-02-03 10:52         ` Wido den Hollander
                           ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Sage Weil @ 2017-02-02 20:57 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Wido den Hollander, Ceph Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3710 bytes --]

Hi everyone,

I made more updates to http://pad.ceph.com/p/crush-types after the CDM 
discussion yesterday:

- consolidated notes into a single proposal
- use otherwise illegal character (e.g., ~) as separater for generated 
buckets.  This avoids ambiguity with user-defined buckets.
- class-id $class $id properties for each bucket.  This allows us to 
preserve the derivative bucket ids across a decompile->compile cycle so 
that data does not move (the bucket id is one of many inputs into crush's 
hash during placement).
- simpler rule syntax:

    rule ssd {
            ruleset 1
            step take default class ssd
            step chooseleaf firstn 0 type host
            step emit
    }

My rationale here is that we don't want to make this a separate 'step' 
call since steps map to underlying crush rule step ops, and this is a 
directive only to the compiler.  Making it an optional step argument seems 
like the cleanest way to do that.

Any other comments before we kick this off?

Thanks!
sage


On Mon, 23 Jan 2017, Loic Dachary wrote:

> Hi Wido,
> 
> Updated http://pad.ceph.com/p/crush-types with your proposal for the rule syntax
> 
> Cheers
> 
> On 01/23/2017 03:29 PM, Sage Weil wrote:
> > On Mon, 23 Jan 2017, Wido den Hollander wrote:
> >>> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
> >>>
> >>>
> >>> Hi Sage,
> >>>
> >>> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
> >>>
> >>> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
> >>>
> >>>     # devices
> >>>     device 0 osd.0
> >>>     device 1 osd.1
> >>>     device 2 osd.2
> >>>     device 3 osd.3
> >>>
> >>> into:
> >>>
> >>>     # devices
> >>>     device 0 osd.0 ssd
> >>>     device 1 osd.1 ssd
> >>>     device 2 osd.2 hdd
> >>>     device 3 osd.3 hdd
> >>>
> >>> where ssd/hdd is the device class would be much better. However, using the device class like so:
> >>>
> >>>     rule ssd {
> >>>             ruleset 1
> >>>             type replicated
> >>>             min_size 1
> >>>             max_size 10
> >>>             step take default:ssd
> >>>             step chooseleaf firstn 0 type host
> >>>             step emit
> >>>     }
> >>>
> >>> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
> >>>
> >>>     rule ssd {
> >>>             ruleset 1
> >>>             type replicated
> >>>             min_size 1
> >>>             max_size 10
> >>>             device class = ssd
> >>
> >> Would that be sane?
> >>
> >> Why not:
> >>
> >> step set-class ssd
> >> step take default
> >> step chooseleaf firstn 0 type host
> >> step emit
> >>
> >> Since it's a 'step' you take, am I right?
> > 
> > Good idea... a step is a cleaner way to extend the syntax!
> > 
> > sage
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> -- 
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-02-02 20:57       ` Sage Weil
@ 2017-02-03 10:52         ` Wido den Hollander
  2017-02-03 10:57           ` John Spray
  2017-06-28 17:54           ` Kyle Bader
  2017-02-03 11:24         ` John Spray
  2017-02-03 12:22         ` Loic Dachary
  2 siblings, 2 replies; 25+ messages in thread
From: Wido den Hollander @ 2017-02-03 10:52 UTC (permalink / raw)
  To: Sage Weil, Loic Dachary; +Cc: Ceph Development


> Op 2 februari 2017 om 21:57 schreef Sage Weil <sage@newdream.net>:
> 
> 
> Hi everyone,
> 
> I made more updates to http://pad.ceph.com/p/crush-types after the CDM 
> discussion yesterday:
> 
> - consolidated notes into a single proposal
> - use otherwise illegal character (e.g., ~) as separater for generated 
> buckets.  This avoids ambiguity with user-defined buckets.
> - class-id $class $id properties for each bucket.  This allows us to 
> preserve the derivative bucket ids across a decompile->compile cycle so 
> that data does not move (the bucket id is one of many inputs into crush's 
> hash during placement).
> - simpler rule syntax:
> 
>     rule ssd {
>             ruleset 1
>             step take default class ssd
>             step chooseleaf firstn 0 type host
>             step emit
>     }
> 
> My rationale here is that we don't want to make this a separate 'step' 
> call since steps map to underlying crush rule step ops, and this is a 
> directive only to the compiler.  Making it an optional step argument seems 
> like the cleanest way to do that.
> 
> Any other comments before we kick this off?
> 

No, looks good to me! Like combining the class into the 'step'.

Would be very nice to have this in L!

What would be interesting as well is if OSD daemons could somehow access this while parsing their configuration.

Eg

[class.ssd]
  osd_op_threads = 16

[class.hdd]
   osd_max_backfills = 1

That way you can keep configuration generic and makes config management a lot easier.

Wido

> Thanks!
> sage
> 
> 
> On Mon, 23 Jan 2017, Loic Dachary wrote:
> 
> > Hi Wido,
> > 
> > Updated http://pad.ceph.com/p/crush-types with your proposal for the rule syntax
> > 
> > Cheers
> > 
> > On 01/23/2017 03:29 PM, Sage Weil wrote:
> > > On Mon, 23 Jan 2017, Wido den Hollander wrote:
> > >>> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
> > >>>
> > >>>
> > >>> Hi Sage,
> > >>>
> > >>> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
> > >>>
> > >>> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
> > >>>
> > >>>     # devices
> > >>>     device 0 osd.0
> > >>>     device 1 osd.1
> > >>>     device 2 osd.2
> > >>>     device 3 osd.3
> > >>>
> > >>> into:
> > >>>
> > >>>     # devices
> > >>>     device 0 osd.0 ssd
> > >>>     device 1 osd.1 ssd
> > >>>     device 2 osd.2 hdd
> > >>>     device 3 osd.3 hdd
> > >>>
> > >>> where ssd/hdd is the device class would be much better. However, using the device class like so:
> > >>>
> > >>>     rule ssd {
> > >>>             ruleset 1
> > >>>             type replicated
> > >>>             min_size 1
> > >>>             max_size 10
> > >>>             step take default:ssd
> > >>>             step chooseleaf firstn 0 type host
> > >>>             step emit
> > >>>     }
> > >>>
> > >>> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
> > >>>
> > >>>     rule ssd {
> > >>>             ruleset 1
> > >>>             type replicated
> > >>>             min_size 1
> > >>>             max_size 10
> > >>>             device class = ssd
> > >>
> > >> Would that be sane?
> > >>
> > >> Why not:
> > >>
> > >> step set-class ssd
> > >> step take default
> > >> step chooseleaf firstn 0 type host
> > >> step emit
> > >>
> > >> Since it's a 'step' you take, am I right?
> > > 
> > > Good idea... a step is a cleaner way to extend the syntax!
> > > 
> > > sage
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > 
> > -- 
> > Loïc Dachary, Artisan Logiciel Libre
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> >

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-02-03 10:52         ` Wido den Hollander
@ 2017-02-03 10:57           ` John Spray
  2017-02-03 12:23             ` Wido den Hollander
  2017-06-28 17:54           ` Kyle Bader
  1 sibling, 1 reply; 25+ messages in thread
From: John Spray @ 2017-02-03 10:57 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: Sage Weil, Loic Dachary, Ceph Development

On Fri, Feb 3, 2017 at 10:52 AM, Wido den Hollander <wido@42on.com> wrote:
>
>> Op 2 februari 2017 om 21:57 schreef Sage Weil <sage@newdream.net>:
>>
>>
>> Hi everyone,
>>
>> I made more updates to http://pad.ceph.com/p/crush-types after the CDM
>> discussion yesterday:
>>
>> - consolidated notes into a single proposal
>> - use otherwise illegal character (e.g., ~) as separater for generated
>> buckets.  This avoids ambiguity with user-defined buckets.
>> - class-id $class $id properties for each bucket.  This allows us to
>> preserve the derivative bucket ids across a decompile->compile cycle so
>> that data does not move (the bucket id is one of many inputs into crush's
>> hash during placement).
>> - simpler rule syntax:
>>
>>     rule ssd {
>>             ruleset 1
>>             step take default class ssd
>>             step chooseleaf firstn 0 type host
>>             step emit
>>     }
>>
>> My rationale here is that we don't want to make this a separate 'step'
>> call since steps map to underlying crush rule step ops, and this is a
>> directive only to the compiler.  Making it an optional step argument seems
>> like the cleanest way to do that.
>>
>> Any other comments before we kick this off?
>>
>
> No, looks good to me! Like combining the class into the 'step'.
>
> Would be very nice to have this in L!
>
> What would be interesting as well is if OSD daemons could somehow access this while parsing their configuration.
>
> Eg
>
> [class.ssd]
>   osd_op_threads = 16
>
> [class.hdd]
>    osd_max_backfills = 1
>
> That way you can keep configuration generic and makes config management a lot easier.

I think there's a general desirable concept of applying configs
according to a CRUSH selector, that would tie in nicely with hosting
configs on the mons: rather than the OSD doing the filtering, it would
just be sent the proper configuration for its location.

John

>
> Wido
>
>> Thanks!
>> sage
>>
>>
>> On Mon, 23 Jan 2017, Loic Dachary wrote:
>>
>> > Hi Wido,
>> >
>> > Updated http://pad.ceph.com/p/crush-types with your proposal for the rule syntax
>> >
>> > Cheers
>> >
>> > On 01/23/2017 03:29 PM, Sage Weil wrote:
>> > > On Mon, 23 Jan 2017, Wido den Hollander wrote:
>> > >>> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
>> > >>>
>> > >>>
>> > >>> Hi Sage,
>> > >>>
>> > >>> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
>> > >>>
>> > >>> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
>> > >>>
>> > >>>     # devices
>> > >>>     device 0 osd.0
>> > >>>     device 1 osd.1
>> > >>>     device 2 osd.2
>> > >>>     device 3 osd.3
>> > >>>
>> > >>> into:
>> > >>>
>> > >>>     # devices
>> > >>>     device 0 osd.0 ssd
>> > >>>     device 1 osd.1 ssd
>> > >>>     device 2 osd.2 hdd
>> > >>>     device 3 osd.3 hdd
>> > >>>
>> > >>> where ssd/hdd is the device class would be much better. However, using the device class like so:
>> > >>>
>> > >>>     rule ssd {
>> > >>>             ruleset 1
>> > >>>             type replicated
>> > >>>             min_size 1
>> > >>>             max_size 10
>> > >>>             step take default:ssd
>> > >>>             step chooseleaf firstn 0 type host
>> > >>>             step emit
>> > >>>     }
>> > >>>
>> > >>> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
>> > >>>
>> > >>>     rule ssd {
>> > >>>             ruleset 1
>> > >>>             type replicated
>> > >>>             min_size 1
>> > >>>             max_size 10
>> > >>>             device class = ssd
>> > >>
>> > >> Would that be sane?
>> > >>
>> > >> Why not:
>> > >>
>> > >> step set-class ssd
>> > >> step take default
>> > >> step chooseleaf firstn 0 type host
>> > >> step emit
>> > >>
>> > >> Since it's a 'step' you take, am I right?
>> > >
>> > > Good idea... a step is a cleaner way to extend the syntax!
>> > >
>> > > sage
>> > > --
>> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > > the body of a message to majordomo@vger.kernel.org
>> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > >
>> >
>> > --
>> > Loïc Dachary, Artisan Logiciel Libre
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-02-02 20:57       ` Sage Weil
  2017-02-03 10:52         ` Wido den Hollander
@ 2017-02-03 11:24         ` John Spray
  2017-02-03 12:22         ` Loic Dachary
  2 siblings, 0 replies; 25+ messages in thread
From: John Spray @ 2017-02-03 11:24 UTC (permalink / raw)
  To: Sage Weil; +Cc: Loic Dachary, Wido den Hollander, Ceph Development

On Thu, Feb 2, 2017 at 8:57 PM, Sage Weil <sage@newdream.net> wrote:
> Hi everyone,
>
> I made more updates to http://pad.ceph.com/p/crush-types after the CDM
> discussion yesterday:
>
> - consolidated notes into a single proposal
> - use otherwise illegal character (e.g., ~) as separater for generated
> buckets.  This avoids ambiguity with user-defined buckets.
> - class-id $class $id properties for each bucket.  This allows us to
> preserve the derivative bucket ids across a decompile->compile cycle so
> that data does not move (the bucket id is one of many inputs into crush's
> hash during placement).
> - simpler rule syntax:
>
>     rule ssd {
>             ruleset 1
>             step take default class ssd
>             step chooseleaf firstn 0 type host
>             step emit
>     }
>
> My rationale here is that we don't want to make this a separate 'step'
> call since steps map to underlying crush rule step ops, and this is a
> directive only to the compiler.  Making it an optional step argument seems
> like the cleanest way to do that.
>
> Any other comments before we kick this off?

I wonder if it's worth specifying how commands/settings would refer to
things within the class-specific hierarchies, for commands that want
to act on populations of drive.  I think with hosts its
straightforward (myhost~ssd for the ssds, or just myhost for all
drives), but for all drives doing "root~ssd" is kind of awkward if
they've never thought about the upper levels of the tree.  Maybe a
shorthand of "~ssd" to refer to the root?

John

> Thanks!
> sage
>
>
> On Mon, 23 Jan 2017, Loic Dachary wrote:
>
>> Hi Wido,
>>
>> Updated http://pad.ceph.com/p/crush-types with your proposal for the rule syntax
>>
>> Cheers
>>
>> On 01/23/2017 03:29 PM, Sage Weil wrote:
>> > On Mon, 23 Jan 2017, Wido den Hollander wrote:
>> >>> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
>> >>>
>> >>>
>> >>> Hi Sage,
>> >>>
>> >>> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
>> >>>
>> >>> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
>> >>>
>> >>>     # devices
>> >>>     device 0 osd.0
>> >>>     device 1 osd.1
>> >>>     device 2 osd.2
>> >>>     device 3 osd.3
>> >>>
>> >>> into:
>> >>>
>> >>>     # devices
>> >>>     device 0 osd.0 ssd
>> >>>     device 1 osd.1 ssd
>> >>>     device 2 osd.2 hdd
>> >>>     device 3 osd.3 hdd
>> >>>
>> >>> where ssd/hdd is the device class would be much better. However, using the device class like so:
>> >>>
>> >>>     rule ssd {
>> >>>             ruleset 1
>> >>>             type replicated
>> >>>             min_size 1
>> >>>             max_size 10
>> >>>             step take default:ssd
>> >>>             step chooseleaf firstn 0 type host
>> >>>             step emit
>> >>>     }
>> >>>
>> >>> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
>> >>>
>> >>>     rule ssd {
>> >>>             ruleset 1
>> >>>             type replicated
>> >>>             min_size 1
>> >>>             max_size 10
>> >>>             device class = ssd
>> >>
>> >> Would that be sane?
>> >>
>> >> Why not:
>> >>
>> >> step set-class ssd
>> >> step take default
>> >> step chooseleaf firstn 0 type host
>> >> step emit
>> >>
>> >> Since it's a 'step' you take, am I right?
>> >
>> > Good idea... a step is a cleaner way to extend the syntax!
>> >
>> > sage
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-02-02 20:57       ` Sage Weil
  2017-02-03 10:52         ` Wido den Hollander
  2017-02-03 11:24         ` John Spray
@ 2017-02-03 12:22         ` Loic Dachary
  2017-02-03 12:46           ` John Spray
  2 siblings, 1 reply; 25+ messages in thread
From: Loic Dachary @ 2017-02-03 12:22 UTC (permalink / raw)
  To: Sage Weil; +Cc: Ceph Development

Hi,

Reading Wido & John comments I thought of something, not sure if that's a good idea or not. Here it is anyway ;-)

The device class problem we're trying to solve is one instance of a more general need to produce crush tables that implement a given use case. The SSD / HDD use case is so frequent that it would make sense to modify the crush format for this. But maybe we could instead implement that to be a crush table generator.

Let say you want help to create the hierarchies to implement the ssd/hdd separation, you write your crushmap using the proposed syntax. But instead of feeding it directly to crushtool -c, you would do something like:

   crushtool --plugin 'device-class' --transform < mycrushmap.txt | crushtool -c - -o mycrushmap

The 'device-class' transformation documents the naming conventions so the user knows root will generate root_ssd and root_hdd. And the users can also check by themselves the generated crushmap.

Cons:

* the users need to be aware of the transformation step and be able to read and understand the generated result.
* it could look like it's not part of the standard way of doing things, that it's a hack. 

Pros:

* it can inspire people to implement other crushmap transformation / generators (an alternative, simpler, syntax comes to mind ;-)
* it can be implemented using python to lower the barrier of entry

I don't think it makes the implementation of the current proposal any simpler or more complex. Worst case scenario nobody write any plugin but that does not make this one plugin less useful.

Cheers

On 02/02/2017 09:57 PM, Sage Weil wrote:
> Hi everyone,
> 
> I made more updates to http://pad.ceph.com/p/crush-types after the CDM 
> discussion yesterday:
> 
> - consolidated notes into a single proposal
> - use otherwise illegal character (e.g., ~) as separater for generated 
> buckets.  This avoids ambiguity with user-defined buckets.
> - class-id $class $id properties for each bucket.  This allows us to 
> preserve the derivative bucket ids across a decompile->compile cycle so 
> that data does not move (the bucket id is one of many inputs into crush's 
> hash during placement).
> - simpler rule syntax:
> 
>     rule ssd {
>             ruleset 1
>             step take default class ssd
>             step chooseleaf firstn 0 type host
>             step emit
>     }
> 
> My rationale here is that we don't want to make this a separate 'step' 
> call since steps map to underlying crush rule step ops, and this is a 
> directive only to the compiler.  Making it an optional step argument seems 
> like the cleanest way to do that.
> 
> Any other comments before we kick this off?
> 
> Thanks!
> sage
> 
> 
> On Mon, 23 Jan 2017, Loic Dachary wrote:
> 
>> Hi Wido,
>>
>> Updated http://pad.ceph.com/p/crush-types with your proposal for the rule syntax
>>
>> Cheers
>>
>> On 01/23/2017 03:29 PM, Sage Weil wrote:
>>> On Mon, 23 Jan 2017, Wido den Hollander wrote:
>>>>> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
>>>>>
>>>>>
>>>>> Hi Sage,
>>>>>
>>>>> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
>>>>>
>>>>> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
>>>>>
>>>>>     # devices
>>>>>     device 0 osd.0
>>>>>     device 1 osd.1
>>>>>     device 2 osd.2
>>>>>     device 3 osd.3
>>>>>
>>>>> into:
>>>>>
>>>>>     # devices
>>>>>     device 0 osd.0 ssd
>>>>>     device 1 osd.1 ssd
>>>>>     device 2 osd.2 hdd
>>>>>     device 3 osd.3 hdd
>>>>>
>>>>> where ssd/hdd is the device class would be much better. However, using the device class like so:
>>>>>
>>>>>     rule ssd {
>>>>>             ruleset 1
>>>>>             type replicated
>>>>>             min_size 1
>>>>>             max_size 10
>>>>>             step take default:ssd
>>>>>             step chooseleaf firstn 0 type host
>>>>>             step emit
>>>>>     }
>>>>>
>>>>> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
>>>>>
>>>>>     rule ssd {
>>>>>             ruleset 1
>>>>>             type replicated
>>>>>             min_size 1
>>>>>             max_size 10
>>>>>             device class = ssd
>>>>
>>>> Would that be sane?
>>>>
>>>> Why not:
>>>>
>>>> step set-class ssd
>>>> step take default
>>>> step chooseleaf firstn 0 type host
>>>> step emit
>>>>
>>>> Since it's a 'step' you take, am I right?
>>>
>>> Good idea... a step is a cleaner way to extend the syntax!
>>>
>>> sage
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-02-03 10:57           ` John Spray
@ 2017-02-03 12:23             ` Wido den Hollander
  0 siblings, 0 replies; 25+ messages in thread
From: Wido den Hollander @ 2017-02-03 12:23 UTC (permalink / raw)
  To: John Spray; +Cc: Sage Weil, Loic Dachary, Ceph Development


> Op 3 februari 2017 om 11:57 schreef John Spray <jspray@redhat.com>:
> 
> 
> On Fri, Feb 3, 2017 at 10:52 AM, Wido den Hollander <wido@42on.com> wrote:
> >
> >> Op 2 februari 2017 om 21:57 schreef Sage Weil <sage@newdream.net>:
> >>
> >>
> >> Hi everyone,
> >>
> >> I made more updates to http://pad.ceph.com/p/crush-types after the CDM
> >> discussion yesterday:
> >>
> >> - consolidated notes into a single proposal
> >> - use otherwise illegal character (e.g., ~) as separater for generated
> >> buckets.  This avoids ambiguity with user-defined buckets.
> >> - class-id $class $id properties for each bucket.  This allows us to
> >> preserve the derivative bucket ids across a decompile->compile cycle so
> >> that data does not move (the bucket id is one of many inputs into crush's
> >> hash during placement).
> >> - simpler rule syntax:
> >>
> >>     rule ssd {
> >>             ruleset 1
> >>             step take default class ssd
> >>             step chooseleaf firstn 0 type host
> >>             step emit
> >>     }
> >>
> >> My rationale here is that we don't want to make this a separate 'step'
> >> call since steps map to underlying crush rule step ops, and this is a
> >> directive only to the compiler.  Making it an optional step argument seems
> >> like the cleanest way to do that.
> >>
> >> Any other comments before we kick this off?
> >>
> >
> > No, looks good to me! Like combining the class into the 'step'.
> >
> > Would be very nice to have this in L!
> >
> > What would be interesting as well is if OSD daemons could somehow access this while parsing their configuration.
> >
> > Eg
> >
> > [class.ssd]
> >   osd_op_threads = 16
> >
> > [class.hdd]
> >    osd_max_backfills = 1
> >
> > That way you can keep configuration generic and makes config management a lot easier.
> 
> I think there's a general desirable concept of applying configs
> according to a CRUSH selector, that would tie in nicely with hosting
> configs on the mons: rather than the OSD doing the filtering, it would
> just be sent the proper configuration for its location.
> 

I would love to see the configs hosted on the MONs instead of a local config file :)

Just wanted to suggest this as in the current situation it is still difficult to have different configs for SSD and HDD based OSDs.

Wido

> John
> 
> >
> > Wido
> >
> >> Thanks!
> >> sage
> >>
> >>
> >> On Mon, 23 Jan 2017, Loic Dachary wrote:
> >>
> >> > Hi Wido,
> >> >
> >> > Updated http://pad.ceph.com/p/crush-types with your proposal for the rule syntax
> >> >
> >> > Cheers
> >> >
> >> > On 01/23/2017 03:29 PM, Sage Weil wrote:
> >> > > On Mon, 23 Jan 2017, Wido den Hollander wrote:
> >> > >>> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
> >> > >>>
> >> > >>>
> >> > >>> Hi Sage,
> >> > >>>
> >> > >>> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
> >> > >>>
> >> > >>> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
> >> > >>>
> >> > >>>     # devices
> >> > >>>     device 0 osd.0
> >> > >>>     device 1 osd.1
> >> > >>>     device 2 osd.2
> >> > >>>     device 3 osd.3
> >> > >>>
> >> > >>> into:
> >> > >>>
> >> > >>>     # devices
> >> > >>>     device 0 osd.0 ssd
> >> > >>>     device 1 osd.1 ssd
> >> > >>>     device 2 osd.2 hdd
> >> > >>>     device 3 osd.3 hdd
> >> > >>>
> >> > >>> where ssd/hdd is the device class would be much better. However, using the device class like so:
> >> > >>>
> >> > >>>     rule ssd {
> >> > >>>             ruleset 1
> >> > >>>             type replicated
> >> > >>>             min_size 1
> >> > >>>             max_size 10
> >> > >>>             step take default:ssd
> >> > >>>             step chooseleaf firstn 0 type host
> >> > >>>             step emit
> >> > >>>     }
> >> > >>>
> >> > >>> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
> >> > >>>
> >> > >>>     rule ssd {
> >> > >>>             ruleset 1
> >> > >>>             type replicated
> >> > >>>             min_size 1
> >> > >>>             max_size 10
> >> > >>>             device class = ssd
> >> > >>
> >> > >> Would that be sane?
> >> > >>
> >> > >> Why not:
> >> > >>
> >> > >> step set-class ssd
> >> > >> step take default
> >> > >> step chooseleaf firstn 0 type host
> >> > >> step emit
> >> > >>
> >> > >> Since it's a 'step' you take, am I right?
> >> > >
> >> > > Good idea... a step is a cleaner way to extend the syntax!
> >> > >
> >> > > sage
> >> > > --
> >> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> > > the body of a message to majordomo@vger.kernel.org
> >> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> > >
> >> >
> >> > --
> >> > Loïc Dachary, Artisan Logiciel Libre
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >
> >> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-02-03 12:22         ` Loic Dachary
@ 2017-02-03 12:46           ` John Spray
  2017-02-03 12:52             ` Brett Niver
  2017-02-03 13:21             ` Loic Dachary
  0 siblings, 2 replies; 25+ messages in thread
From: John Spray @ 2017-02-03 12:46 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Sage Weil, Ceph Development

On Fri, Feb 3, 2017 at 12:22 PM, Loic Dachary <loic@dachary.org> wrote:
> Hi,
>
> Reading Wido & John comments I thought of something, not sure if that's a good idea or not. Here it is anyway ;-)
>
> The device class problem we're trying to solve is one instance of a more general need to produce crush tables that implement a given use case. The SSD / HDD use case is so frequent that it would make sense to modify the crush format for this. But maybe we could instead implement that to be a crush table generator.
>
> Let say you want help to create the hierarchies to implement the ssd/hdd separation, you write your crushmap using the proposed syntax. But instead of feeding it directly to crushtool -c, you would do something like:
>
>    crushtool --plugin 'device-class' --transform < mycrushmap.txt | crushtool -c - -o mycrushmap
>
> The 'device-class' transformation documents the naming conventions so the user knows root will generate root_ssd and root_hdd. And the users can also check by themselves the generated crushmap.
>
> Cons:
>
> * the users need to be aware of the transformation step and be able to read and understand the generated result.
> * it could look like it's not part of the standard way of doing things, that it's a hack.
>
> Pros:
>
> * it can inspire people to implement other crushmap transformation / generators (an alternative, simpler, syntax comes to mind ;-)
> * it can be implemented using python to lower the barrier of entry
>
> I don't think it makes the implementation of the current proposal any simpler or more complex. Worst case scenario nobody write any plugin but that does not make this one plugin less useful.

I think this is basically the alternative approach that Sam was
suggesting during CDM: the idea of layering a new (perhaps very
similar) syntax on top of the existing one, instead of extending the
existing one directly.

The main argument against doing that was the complexity, not just of
implementation but for users, who would now potentially have two
separate sets of commands, one operating on the "high level" map
(which would have a "myhost" object in it), and one operating on the
native crush map (which would only have myhost~ssd, myhost~hdd
entries, and would have no concept that a thing called myhost
existed).

As for implemetning other generators, the trouble with that is that
the resulting conventions would be unknown to other tools, and to any
commands built in to Ceph.  We *really* need a variant of "set noout"
that operates on a crush subtree (typically a host), as it's the sane
way to get people to temporarily mark some OSDs while they
reboot/upgrade a host, but to implement that command we have to have
an unambiguous way of identifying which buckets in the crush map
belong to a host.  Whatever the convention is (myhost~ssd, myhost_ssd,
whatever), it needs to be defined and built into Ceph in order to be
interoperable.

John



> Cheers
>
> On 02/02/2017 09:57 PM, Sage Weil wrote:
>> Hi everyone,
>>
>> I made more updates to http://pad.ceph.com/p/crush-types after the CDM
>> discussion yesterday:
>>
>> - consolidated notes into a single proposal
>> - use otherwise illegal character (e.g., ~) as separater for generated
>> buckets.  This avoids ambiguity with user-defined buckets.
>> - class-id $class $id properties for each bucket.  This allows us to
>> preserve the derivative bucket ids across a decompile->compile cycle so
>> that data does not move (the bucket id is one of many inputs into crush's
>> hash during placement).
>> - simpler rule syntax:
>>
>>     rule ssd {
>>             ruleset 1
>>             step take default class ssd
>>             step chooseleaf firstn 0 type host
>>             step emit
>>     }
>>
>> My rationale here is that we don't want to make this a separate 'step'
>> call since steps map to underlying crush rule step ops, and this is a
>> directive only to the compiler.  Making it an optional step argument seems
>> like the cleanest way to do that.
>>
>> Any other comments before we kick this off?
>>
>> Thanks!
>> sage
>>
>>
>> On Mon, 23 Jan 2017, Loic Dachary wrote:
>>
>>> Hi Wido,
>>>
>>> Updated http://pad.ceph.com/p/crush-types with your proposal for the rule syntax
>>>
>>> Cheers
>>>
>>> On 01/23/2017 03:29 PM, Sage Weil wrote:
>>>> On Mon, 23 Jan 2017, Wido den Hollander wrote:
>>>>>> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
>>>>>>
>>>>>>
>>>>>> Hi Sage,
>>>>>>
>>>>>> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
>>>>>>
>>>>>> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
>>>>>>
>>>>>>     # devices
>>>>>>     device 0 osd.0
>>>>>>     device 1 osd.1
>>>>>>     device 2 osd.2
>>>>>>     device 3 osd.3
>>>>>>
>>>>>> into:
>>>>>>
>>>>>>     # devices
>>>>>>     device 0 osd.0 ssd
>>>>>>     device 1 osd.1 ssd
>>>>>>     device 2 osd.2 hdd
>>>>>>     device 3 osd.3 hdd
>>>>>>
>>>>>> where ssd/hdd is the device class would be much better. However, using the device class like so:
>>>>>>
>>>>>>     rule ssd {
>>>>>>             ruleset 1
>>>>>>             type replicated
>>>>>>             min_size 1
>>>>>>             max_size 10
>>>>>>             step take default:ssd
>>>>>>             step chooseleaf firstn 0 type host
>>>>>>             step emit
>>>>>>     }
>>>>>>
>>>>>> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
>>>>>>
>>>>>>     rule ssd {
>>>>>>             ruleset 1
>>>>>>             type replicated
>>>>>>             min_size 1
>>>>>>             max_size 10
>>>>>>             device class = ssd
>>>>>
>>>>> Would that be sane?
>>>>>
>>>>> Why not:
>>>>>
>>>>> step set-class ssd
>>>>> step take default
>>>>> step chooseleaf firstn 0 type host
>>>>> step emit
>>>>>
>>>>> Since it's a 'step' you take, am I right?
>>>>
>>>> Good idea... a step is a cleaner way to extend the syntax!
>>>>
>>>> sage
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-02-03 12:46           ` John Spray
@ 2017-02-03 12:52             ` Brett Niver
  2017-02-03 13:21             ` Loic Dachary
  1 sibling, 0 replies; 25+ messages in thread
From: Brett Niver @ 2017-02-03 12:52 UTC (permalink / raw)
  To: John Spray; +Cc: Loic Dachary, Sage Weil, Ceph Development

+1
I think there may be multiple reasons to have a generator.

On Fri, Feb 3, 2017 at 7:46 AM, John Spray <jspray@redhat.com> wrote:
> On Fri, Feb 3, 2017 at 12:22 PM, Loic Dachary <loic@dachary.org> wrote:
>> Hi,
>>
>> Reading Wido & John comments I thought of something, not sure if that's a good idea or not. Here it is anyway ;-)
>>
>> The device class problem we're trying to solve is one instance of a more general need to produce crush tables that implement a given use case. The SSD / HDD use case is so frequent that it would make sense to modify the crush format for this. But maybe we could instead implement that to be a crush table generator.
>>
>> Let say you want help to create the hierarchies to implement the ssd/hdd separation, you write your crushmap using the proposed syntax. But instead of feeding it directly to crushtool -c, you would do something like:
>>
>>    crushtool --plugin 'device-class' --transform < mycrushmap.txt | crushtool -c - -o mycrushmap
>>
>> The 'device-class' transformation documents the naming conventions so the user knows root will generate root_ssd and root_hdd. And the users can also check by themselves the generated crushmap.
>>
>> Cons:
>>
>> * the users need to be aware of the transformation step and be able to read and understand the generated result.
>> * it could look like it's not part of the standard way of doing things, that it's a hack.
>>
>> Pros:
>>
>> * it can inspire people to implement other crushmap transformation / generators (an alternative, simpler, syntax comes to mind ;-)
>> * it can be implemented using python to lower the barrier of entry
>>
>> I don't think it makes the implementation of the current proposal any simpler or more complex. Worst case scenario nobody write any plugin but that does not make this one plugin less useful.
>
> I think this is basically the alternative approach that Sam was
> suggesting during CDM: the idea of layering a new (perhaps very
> similar) syntax on top of the existing one, instead of extending the
> existing one directly.
>
> The main argument against doing that was the complexity, not just of
> implementation but for users, who would now potentially have two
> separate sets of commands, one operating on the "high level" map
> (which would have a "myhost" object in it), and one operating on the
> native crush map (which would only have myhost~ssd, myhost~hdd
> entries, and would have no concept that a thing called myhost
> existed).
>
> As for implemetning other generators, the trouble with that is that
> the resulting conventions would be unknown to other tools, and to any
> commands built in to Ceph.  We *really* need a variant of "set noout"
> that operates on a crush subtree (typically a host), as it's the sane
> way to get people to temporarily mark some OSDs while they
> reboot/upgrade a host, but to implement that command we have to have
> an unambiguous way of identifying which buckets in the crush map
> belong to a host.  Whatever the convention is (myhost~ssd, myhost_ssd,
> whatever), it needs to be defined and built into Ceph in order to be
> interoperable.
>
> John
>
>
>
>> Cheers
>>
>> On 02/02/2017 09:57 PM, Sage Weil wrote:
>>> Hi everyone,
>>>
>>> I made more updates to http://pad.ceph.com/p/crush-types after the CDM
>>> discussion yesterday:
>>>
>>> - consolidated notes into a single proposal
>>> - use otherwise illegal character (e.g., ~) as separater for generated
>>> buckets.  This avoids ambiguity with user-defined buckets.
>>> - class-id $class $id properties for each bucket.  This allows us to
>>> preserve the derivative bucket ids across a decompile->compile cycle so
>>> that data does not move (the bucket id is one of many inputs into crush's
>>> hash during placement).
>>> - simpler rule syntax:
>>>
>>>     rule ssd {
>>>             ruleset 1
>>>             step take default class ssd
>>>             step chooseleaf firstn 0 type host
>>>             step emit
>>>     }
>>>
>>> My rationale here is that we don't want to make this a separate 'step'
>>> call since steps map to underlying crush rule step ops, and this is a
>>> directive only to the compiler.  Making it an optional step argument seems
>>> like the cleanest way to do that.
>>>
>>> Any other comments before we kick this off?
>>>
>>> Thanks!
>>> sage
>>>
>>>
>>> On Mon, 23 Jan 2017, Loic Dachary wrote:
>>>
>>>> Hi Wido,
>>>>
>>>> Updated http://pad.ceph.com/p/crush-types with your proposal for the rule syntax
>>>>
>>>> Cheers
>>>>
>>>> On 01/23/2017 03:29 PM, Sage Weil wrote:
>>>>> On Mon, 23 Jan 2017, Wido den Hollander wrote:
>>>>>>> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
>>>>>>>
>>>>>>>
>>>>>>> Hi Sage,
>>>>>>>
>>>>>>> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
>>>>>>>
>>>>>>> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
>>>>>>>
>>>>>>>     # devices
>>>>>>>     device 0 osd.0
>>>>>>>     device 1 osd.1
>>>>>>>     device 2 osd.2
>>>>>>>     device 3 osd.3
>>>>>>>
>>>>>>> into:
>>>>>>>
>>>>>>>     # devices
>>>>>>>     device 0 osd.0 ssd
>>>>>>>     device 1 osd.1 ssd
>>>>>>>     device 2 osd.2 hdd
>>>>>>>     device 3 osd.3 hdd
>>>>>>>
>>>>>>> where ssd/hdd is the device class would be much better. However, using the device class like so:
>>>>>>>
>>>>>>>     rule ssd {
>>>>>>>             ruleset 1
>>>>>>>             type replicated
>>>>>>>             min_size 1
>>>>>>>             max_size 10
>>>>>>>             step take default:ssd
>>>>>>>             step chooseleaf firstn 0 type host
>>>>>>>             step emit
>>>>>>>     }
>>>>>>>
>>>>>>> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
>>>>>>>
>>>>>>>     rule ssd {
>>>>>>>             ruleset 1
>>>>>>>             type replicated
>>>>>>>             min_size 1
>>>>>>>             max_size 10
>>>>>>>             device class = ssd
>>>>>>
>>>>>> Would that be sane?
>>>>>>
>>>>>> Why not:
>>>>>>
>>>>>> step set-class ssd
>>>>>> step take default
>>>>>> step chooseleaf firstn 0 type host
>>>>>> step emit
>>>>>>
>>>>>> Since it's a 'step' you take, am I right?
>>>>>
>>>>> Good idea... a step is a cleaner way to extend the syntax!
>>>>>
>>>>> sage
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-02-03 12:46           ` John Spray
  2017-02-03 12:52             ` Brett Niver
@ 2017-02-03 13:21             ` Loic Dachary
  2017-02-15 11:57               ` John Spray
  1 sibling, 1 reply; 25+ messages in thread
From: Loic Dachary @ 2017-02-03 13:21 UTC (permalink / raw)
  To: John Spray; +Cc: Ceph Development



On 02/03/2017 01:46 PM, John Spray wrote:
> On Fri, Feb 3, 2017 at 12:22 PM, Loic Dachary <loic@dachary.org> wrote:
>> Hi,
>>
>> Reading Wido & John comments I thought of something, not sure if that's a good idea or not. Here it is anyway ;-)
>>
>> The device class problem we're trying to solve is one instance of a more general need to produce crush tables that implement a given use case. The SSD / HDD use case is so frequent that it would make sense to modify the crush format for this. But maybe we could instead implement that to be a crush table generator.
>>
>> Let say you want help to create the hierarchies to implement the ssd/hdd separation, you write your crushmap using the proposed syntax. But instead of feeding it directly to crushtool -c, you would do something like:
>>
>>    crushtool --plugin 'device-class' --transform < mycrushmap.txt | crushtool -c - -o mycrushmap
>>
>> The 'device-class' transformation documents the naming conventions so the user knows root will generate root_ssd and root_hdd. And the users can also check by themselves the generated crushmap.
>>
>> Cons:
>>
>> * the users need to be aware of the transformation step and be able to read and understand the generated result.
>> * it could look like it's not part of the standard way of doing things, that it's a hack.
>>
>> Pros:
>>
>> * it can inspire people to implement other crushmap transformation / generators (an alternative, simpler, syntax comes to mind ;-)
>> * it can be implemented using python to lower the barrier of entry
>>
>> I don't think it makes the implementation of the current proposal any simpler or more complex. Worst case scenario nobody write any plugin but that does not make this one plugin less useful.
> 
> I think this is basically the alternative approach that Sam was
> suggesting during CDM: the idea of layering a new (perhaps very
> similar) syntax on top of the existing one, instead of extending the
> existing one directly.

Ha nice, not such a stupid idea then :-) I'll try to defend it a little more below then. Please bear in mind that I'm not sure this is the way to go even though I'm writing as if I am.

> The main argument against doing that was the complexity, not just of
> implementation but for users, who would now potentially have two
> separate sets of commands, one operating on the "high level" map
> (which would have a "myhost" object in it), and one operating on the
> native crush map (which would only have myhost~ssd, myhost~hdd
> entries, and would have no concept that a thing called myhost
> existed).

As a user I'm not sure what is more complicated / confusing. If I'm an experienced Ceph user I'll think of this new syntax as a generator because I already know how crush works. I'll welcome the help and be relieved that I don't have to manually do that anymore. But having that as a native syntax may be a little unconfortable for me because I will want to verify the new syntax matches what I expect, which comes naturally if the transformation step is separate. I may even tweak it a little with an intermediate script to match one thing or two. If I'm a new Ceph user this is one more concept I need to learn: the device class. And to understand what it means, the documentation will have to explain that it creates an independant crush hierarchy for each device class, with weights that only tak
 e into account the devices of that given class. I will not be exonerated from understanding the transformation step and the syntactic sugar may even make that more complicated to get.

If I understand correctly, the three would co-exist: host, host~ssd, host~hdd so that you can write a rule that takes from all devices. 

> As for implemetning other generators, the trouble with that is that
> the resulting conventions would be unknown to other tools, and to any
> commands built in to Ceph. 

Yes. But do we really want to insert the concept of "device class" in Ceph ? There are recurring complaints about manually creating the crushmap required to separate ssd from hdd. But is it inconvenient in any way that Ceph is otherwise unaware of this distinction ? 

> We *really* need a variant of "set noout"
> that operates on a crush subtree (typically a host), as it's the sane
> way to get people to temporarily mark some OSDs while they
> reboot/upgrade a host, but to implement that command we have to have
> an unambiguous way of identifying which buckets in the crush map
> belong to a host.  Whatever the convention is (myhost~ssd, myhost_ssd,
> whatever), it needs to be defined and built into Ceph in order to be
> interoperable.

That goes back (above) to my understanding of Sage proposal (which I may have wrong ?) in which the host bucket still exists and still contains all devices regardless of their class.

Cheers

> 
> John
> 
> 
> 
>> Cheers
>>
>> On 02/02/2017 09:57 PM, Sage Weil wrote:
>>> Hi everyone,
>>>
>>> I made more updates to http://pad.ceph.com/p/crush-types after the CDM
>>> discussion yesterday:
>>>
>>> - consolidated notes into a single proposal
>>> - use otherwise illegal character (e.g., ~) as separater for generated
>>> buckets.  This avoids ambiguity with user-defined buckets.
>>> - class-id $class $id properties for each bucket.  This allows us to
>>> preserve the derivative bucket ids across a decompile->compile cycle so
>>> that data does not move (the bucket id is one of many inputs into crush's
>>> hash during placement).
>>> - simpler rule syntax:
>>>
>>>     rule ssd {
>>>             ruleset 1
>>>             step take default class ssd
>>>             step chooseleaf firstn 0 type host
>>>             step emit
>>>     }
>>>
>>> My rationale here is that we don't want to make this a separate 'step'
>>> call since steps map to underlying crush rule step ops, and this is a
>>> directive only to the compiler.  Making it an optional step argument seems
>>> like the cleanest way to do that.
>>>
>>> Any other comments before we kick this off?
>>>
>>> Thanks!
>>> sage
>>>
>>>
>>> On Mon, 23 Jan 2017, Loic Dachary wrote:
>>>
>>>> Hi Wido,
>>>>
>>>> Updated http://pad.ceph.com/p/crush-types with your proposal for the rule syntax
>>>>
>>>> Cheers
>>>>
>>>> On 01/23/2017 03:29 PM, Sage Weil wrote:
>>>>> On Mon, 23 Jan 2017, Wido den Hollander wrote:
>>>>>>> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
>>>>>>>
>>>>>>>
>>>>>>> Hi Sage,
>>>>>>>
>>>>>>> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
>>>>>>>
>>>>>>> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
>>>>>>>
>>>>>>>     # devices
>>>>>>>     device 0 osd.0
>>>>>>>     device 1 osd.1
>>>>>>>     device 2 osd.2
>>>>>>>     device 3 osd.3
>>>>>>>
>>>>>>> into:
>>>>>>>
>>>>>>>     # devices
>>>>>>>     device 0 osd.0 ssd
>>>>>>>     device 1 osd.1 ssd
>>>>>>>     device 2 osd.2 hdd
>>>>>>>     device 3 osd.3 hdd
>>>>>>>
>>>>>>> where ssd/hdd is the device class would be much better. However, using the device class like so:
>>>>>>>
>>>>>>>     rule ssd {
>>>>>>>             ruleset 1
>>>>>>>             type replicated
>>>>>>>             min_size 1
>>>>>>>             max_size 10
>>>>>>>             step take default:ssd
>>>>>>>             step chooseleaf firstn 0 type host
>>>>>>>             step emit
>>>>>>>     }
>>>>>>>
>>>>>>> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
>>>>>>>
>>>>>>>     rule ssd {
>>>>>>>             ruleset 1
>>>>>>>             type replicated
>>>>>>>             min_size 1
>>>>>>>             max_size 10
>>>>>>>             device class = ssd
>>>>>>
>>>>>> Would that be sane?
>>>>>>
>>>>>> Why not:
>>>>>>
>>>>>> step set-class ssd
>>>>>> step take default
>>>>>> step chooseleaf firstn 0 type host
>>>>>> step emit
>>>>>>
>>>>>> Since it's a 'step' you take, am I right?
>>>>>
>>>>> Good idea... a step is a cleaner way to extend the syntax!
>>>>>
>>>>> sage
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-02-03 13:21             ` Loic Dachary
@ 2017-02-15 11:57               ` John Spray
  2017-02-15 12:14                 ` Loic Dachary
  0 siblings, 1 reply; 25+ messages in thread
From: John Spray @ 2017-02-15 11:57 UTC (permalink / raw)
  To: Loic Dachary; +Cc: Ceph Development

On Fri, Feb 3, 2017 at 1:21 PM, Loic Dachary <loic@dachary.org> wrote:
>
>
> On 02/03/2017 01:46 PM, John Spray wrote:
>> On Fri, Feb 3, 2017 at 12:22 PM, Loic Dachary <loic@dachary.org> wrote:
>>> Hi,
>>>
>>> Reading Wido & John comments I thought of something, not sure if that's a good idea or not. Here it is anyway ;-)
>>>
>>> The device class problem we're trying to solve is one instance of a more general need to produce crush tables that implement a given use case. The SSD / HDD use case is so frequent that it would make sense to modify the crush format for this. But maybe we could instead implement that to be a crush table generator.
>>>
>>> Let say you want help to create the hierarchies to implement the ssd/hdd separation, you write your crushmap using the proposed syntax. But instead of feeding it directly to crushtool -c, you would do something like:
>>>
>>>    crushtool --plugin 'device-class' --transform < mycrushmap.txt | crushtool -c - -o mycrushmap
>>>
>>> The 'device-class' transformation documents the naming conventions so the user knows root will generate root_ssd and root_hdd. And the users can also check by themselves the generated crushmap.
>>>
>>> Cons:
>>>
>>> * the users need to be aware of the transformation step and be able to read and understand the generated result.
>>> * it could look like it's not part of the standard way of doing things, that it's a hack.
>>>
>>> Pros:
>>>
>>> * it can inspire people to implement other crushmap transformation / generators (an alternative, simpler, syntax comes to mind ;-)
>>> * it can be implemented using python to lower the barrier of entry
>>>
>>> I don't think it makes the implementation of the current proposal any simpler or more complex. Worst case scenario nobody write any plugin but that does not make this one plugin less useful.
>>
>> I think this is basically the alternative approach that Sam was
>> suggesting during CDM: the idea of layering a new (perhaps very
>> similar) syntax on top of the existing one, instead of extending the
>> existing one directly.
>
> Ha nice, not such a stupid idea then :-) I'll try to defend it a little more below then. Please bear in mind that I'm not sure this is the way to go even though I'm writing as if I am.
>
>> The main argument against doing that was the complexity, not just of
>> implementation but for users, who would now potentially have two
>> separate sets of commands, one operating on the "high level" map
>> (which would have a "myhost" object in it), and one operating on the
>> native crush map (which would only have myhost~ssd, myhost~hdd
>> entries, and would have no concept that a thing called myhost
>> existed).
>
> As a user I'm not sure what is more complicated / confusing. If I'm an experienced Ceph user I'll think of this new syntax as a generator because I already know how crush works. I'll welcome the help and be relieved that I don't have to manually do that anymore. But having that as a native syntax may be a little unconfortable for me because I will want to verify the new syntax matches what I expect, which comes naturally if the transformation step is separate. I may even tweak it a little with an intermediate script to match one thing or two. If I'm a new Ceph user this is one more concept I need to learn: the device class. And to understand what it means, the documentation will have to explain that it creates an independant crush hierarchy for each device class, with weights that only t
 ake into account the devices of that given class. I will not be exonerated from understanding the transformation step and the syntactic sugar may even make that more complicated to get.
>
> If I understand correctly, the three would co-exist: host, host~ssd, host~hdd so that you can write a rule that takes from all devices.

(Sorry this response is so late)

I think the extra work is not so much in the formats, as it is
exposing that syntax via all the commands that we have, and/or new
commands.  We would either need two lots of commands, or we would need
to pick one layer (the 'generator' or the native one) for the
commands, and treat the other layer as a hidden thing.

It's also not just the extra work of implementing the commands/syntax,
it's the extra complexity that ends up being exposed to users.

>
>> As for implemetning other generators, the trouble with that is that
>> the resulting conventions would be unknown to other tools, and to any
>> commands built in to Ceph.
>
> Yes. But do we really want to insert the concept of "device class" in Ceph ? There are recurring complaints about manually creating the crushmap required to separate ssd from hdd. But is it inconvenient in any way that Ceph is otherwise unaware of this distinction ?

Currently, if someone has done the manual stuff to set up SSD/HDD
crush trees, any external tool has no way of knowing that two hosts
(one ssd, one hdd) are actually the same host.  That's the key thing
here for me -- the time saving during setup is a nice side effect, but
the primary value of having a Ceph-defined way to do this is that
every tool building on Ceph can rely on it.



>> We *really* need a variant of "set noout"
>> that operates on a crush subtree (typically a host), as it's the sane
>> way to get people to temporarily mark some OSDs while they
>> reboot/upgrade a host, but to implement that command we have to have
>> an unambiguous way of identifying which buckets in the crush map
>> belong to a host.  Whatever the convention is (myhost~ssd, myhost_ssd,
>> whatever), it needs to be defined and built into Ceph in order to be
>> interoperable.
>
> That goes back (above) to my understanding of Sage proposal (which I may have wrong ?) in which the host bucket still exists and still contains all devices regardless of their class.

In Sage's proposal as I understand it, there's an underlying native
crush map that uses today's format (i.e. clients need no upgrade),
which is generated in response to either commands that edit the map,
or the user inputting a modified map in the text format.  That
conversion would follow pretty simple rules (assuming a host 'myhost'
with ssd and hdd devices):
 * On the way in, bucket 'myhost' generates 'myhost~ssd', 'myhost~hdd' buckets
 * On the way out, buckets 'myhost~ssd', 'myhost~hdd' get merged into 'myhost'
 * When running a CLI command, something targeting 'myhost' will
target both 'myhost~hdd' and 'myhost~ssd'

It's that last part that probably isn't captured properly by something
external that does a syntax conversion during import/export.

John

> Cheers
>
>>
>> John
>>
>>
>>
>>> Cheers
>>>
>>> On 02/02/2017 09:57 PM, Sage Weil wrote:
>>>> Hi everyone,
>>>>
>>>> I made more updates to http://pad.ceph.com/p/crush-types after the CDM
>>>> discussion yesterday:
>>>>
>>>> - consolidated notes into a single proposal
>>>> - use otherwise illegal character (e.g., ~) as separater for generated
>>>> buckets.  This avoids ambiguity with user-defined buckets.
>>>> - class-id $class $id properties for each bucket.  This allows us to
>>>> preserve the derivative bucket ids across a decompile->compile cycle so
>>>> that data does not move (the bucket id is one of many inputs into crush's
>>>> hash during placement).
>>>> - simpler rule syntax:
>>>>
>>>>     rule ssd {
>>>>             ruleset 1
>>>>             step take default class ssd
>>>>             step chooseleaf firstn 0 type host
>>>>             step emit
>>>>     }
>>>>
>>>> My rationale here is that we don't want to make this a separate 'step'
>>>> call since steps map to underlying crush rule step ops, and this is a
>>>> directive only to the compiler.  Making it an optional step argument seems
>>>> like the cleanest way to do that.
>>>>
>>>> Any other comments before we kick this off?
>>>>
>>>> Thanks!
>>>> sage
>>>>
>>>>
>>>> On Mon, 23 Jan 2017, Loic Dachary wrote:
>>>>
>>>>> Hi Wido,
>>>>>
>>>>> Updated http://pad.ceph.com/p/crush-types with your proposal for the rule syntax
>>>>>
>>>>> Cheers
>>>>>
>>>>> On 01/23/2017 03:29 PM, Sage Weil wrote:
>>>>>> On Mon, 23 Jan 2017, Wido den Hollander wrote:
>>>>>>>> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Sage,
>>>>>>>>
>>>>>>>> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
>>>>>>>>
>>>>>>>> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
>>>>>>>>
>>>>>>>>     # devices
>>>>>>>>     device 0 osd.0
>>>>>>>>     device 1 osd.1
>>>>>>>>     device 2 osd.2
>>>>>>>>     device 3 osd.3
>>>>>>>>
>>>>>>>> into:
>>>>>>>>
>>>>>>>>     # devices
>>>>>>>>     device 0 osd.0 ssd
>>>>>>>>     device 1 osd.1 ssd
>>>>>>>>     device 2 osd.2 hdd
>>>>>>>>     device 3 osd.3 hdd
>>>>>>>>
>>>>>>>> where ssd/hdd is the device class would be much better. However, using the device class like so:
>>>>>>>>
>>>>>>>>     rule ssd {
>>>>>>>>             ruleset 1
>>>>>>>>             type replicated
>>>>>>>>             min_size 1
>>>>>>>>             max_size 10
>>>>>>>>             step take default:ssd
>>>>>>>>             step chooseleaf firstn 0 type host
>>>>>>>>             step emit
>>>>>>>>     }
>>>>>>>>
>>>>>>>> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
>>>>>>>>
>>>>>>>>     rule ssd {
>>>>>>>>             ruleset 1
>>>>>>>>             type replicated
>>>>>>>>             min_size 1
>>>>>>>>             max_size 10
>>>>>>>>             device class = ssd
>>>>>>>
>>>>>>> Would that be sane?
>>>>>>>
>>>>>>> Why not:
>>>>>>>
>>>>>>> step set-class ssd
>>>>>>> step take default
>>>>>>> step chooseleaf firstn 0 type host
>>>>>>> step emit
>>>>>>>
>>>>>>> Since it's a 'step' you take, am I right?
>>>>>>
>>>>>> Good idea... a step is a cleaner way to extend the syntax!
>>>>>>
>>>>>> sage
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-02-15 11:57               ` John Spray
@ 2017-02-15 12:14                 ` Loic Dachary
  2017-03-08  9:42                   ` Dan van der Ster
  0 siblings, 1 reply; 25+ messages in thread
From: Loic Dachary @ 2017-02-15 12:14 UTC (permalink / raw)
  To: John Spray; +Cc: Ceph Development

Hi John,

Thanks for the discussion :-) I'll start implementing the proposal as described originally.

Cheers

On 02/15/2017 12:57 PM, John Spray wrote:
> On Fri, Feb 3, 2017 at 1:21 PM, Loic Dachary <loic@dachary.org> wrote:
>>
>>
>> On 02/03/2017 01:46 PM, John Spray wrote:
>>> On Fri, Feb 3, 2017 at 12:22 PM, Loic Dachary <loic@dachary.org> wrote:
>>>> Hi,
>>>>
>>>> Reading Wido & John comments I thought of something, not sure if that's a good idea or not. Here it is anyway ;-)
>>>>
>>>> The device class problem we're trying to solve is one instance of a more general need to produce crush tables that implement a given use case. The SSD / HDD use case is so frequent that it would make sense to modify the crush format for this. But maybe we could instead implement that to be a crush table generator.
>>>>
>>>> Let say you want help to create the hierarchies to implement the ssd/hdd separation, you write your crushmap using the proposed syntax. But instead of feeding it directly to crushtool -c, you would do something like:
>>>>
>>>>    crushtool --plugin 'device-class' --transform < mycrushmap.txt | crushtool -c - -o mycrushmap
>>>>
>>>> The 'device-class' transformation documents the naming conventions so the user knows root will generate root_ssd and root_hdd. And the users can also check by themselves the generated crushmap.
>>>>
>>>> Cons:
>>>>
>>>> * the users need to be aware of the transformation step and be able to read and understand the generated result.
>>>> * it could look like it's not part of the standard way of doing things, that it's a hack.
>>>>
>>>> Pros:
>>>>
>>>> * it can inspire people to implement other crushmap transformation / generators (an alternative, simpler, syntax comes to mind ;-)
>>>> * it can be implemented using python to lower the barrier of entry
>>>>
>>>> I don't think it makes the implementation of the current proposal any simpler or more complex. Worst case scenario nobody write any plugin but that does not make this one plugin less useful.
>>>
>>> I think this is basically the alternative approach that Sam was
>>> suggesting during CDM: the idea of layering a new (perhaps very
>>> similar) syntax on top of the existing one, instead of extending the
>>> existing one directly.
>>
>> Ha nice, not such a stupid idea then :-) I'll try to defend it a little more below then. Please bear in mind that I'm not sure this is the way to go even though I'm writing as if I am.
>>
>>> The main argument against doing that was the complexity, not just of
>>> implementation but for users, who would now potentially have two
>>> separate sets of commands, one operating on the "high level" map
>>> (which would have a "myhost" object in it), and one operating on the
>>> native crush map (which would only have myhost~ssd, myhost~hdd
>>> entries, and would have no concept that a thing called myhost
>>> existed).
>>
>> As a user I'm not sure what is more complicated / confusing. If I'm an experienced Ceph user I'll think of this new syntax as a generator because I already know how crush works. I'll welcome the help and be relieved that I don't have to manually do that anymore. But having that as a native syntax may be a little unconfortable for me because I will want to verify the new syntax matches what I expect, which comes naturally if the transformation step is separate. I may even tweak it a little with an intermediate script to match one thing or two. If I'm a new Ceph user this is one more concept I need to learn: the device class. And to understand what it means, the documentation will have to explain that it creates an independant crush hierarchy for each device class, with weights that only 
 take into account the devices of that given class. I will not be exonerated from understanding the transformation step and the syntactic sugar may even make that more complicated to get.
>>
>> If I understand correctly, the three would co-exist: host, host~ssd, host~hdd so that you can write a rule that takes from all devices.
> 
> (Sorry this response is so late)
> 
> I think the extra work is not so much in the formats, as it is
> exposing that syntax via all the commands that we have, and/or new
> commands.  We would either need two lots of commands, or we would need
> to pick one layer (the 'generator' or the native one) for the
> commands, and treat the other layer as a hidden thing.
> 
> It's also not just the extra work of implementing the commands/syntax,
> it's the extra complexity that ends up being exposed to users.
> 
>>
>>> As for implemetning other generators, the trouble with that is that
>>> the resulting conventions would be unknown to other tools, and to any
>>> commands built in to Ceph.
>>
>> Yes. But do we really want to insert the concept of "device class" in Ceph ? There are recurring complaints about manually creating the crushmap required to separate ssd from hdd. But is it inconvenient in any way that Ceph is otherwise unaware of this distinction ?
> 
> Currently, if someone has done the manual stuff to set up SSD/HDD
> crush trees, any external tool has no way of knowing that two hosts
> (one ssd, one hdd) are actually the same host.  That's the key thing
> here for me -- the time saving during setup is a nice side effect, but
> the primary value of having a Ceph-defined way to do this is that
> every tool building on Ceph can rely on it.
> 
> 
> 
>>> We *really* need a variant of "set noout"
>>> that operates on a crush subtree (typically a host), as it's the sane
>>> way to get people to temporarily mark some OSDs while they
>>> reboot/upgrade a host, but to implement that command we have to have
>>> an unambiguous way of identifying which buckets in the crush map
>>> belong to a host.  Whatever the convention is (myhost~ssd, myhost_ssd,
>>> whatever), it needs to be defined and built into Ceph in order to be
>>> interoperable.
>>
>> That goes back (above) to my understanding of Sage proposal (which I may have wrong ?) in which the host bucket still exists and still contains all devices regardless of their class.
> 
> In Sage's proposal as I understand it, there's an underlying native
> crush map that uses today's format (i.e. clients need no upgrade),
> which is generated in response to either commands that edit the map,
> or the user inputting a modified map in the text format.  That
> conversion would follow pretty simple rules (assuming a host 'myhost'
> with ssd and hdd devices):
>  * On the way in, bucket 'myhost' generates 'myhost~ssd', 'myhost~hdd' buckets
>  * On the way out, buckets 'myhost~ssd', 'myhost~hdd' get merged into 'myhost'
>  * When running a CLI command, something targeting 'myhost' will
> target both 'myhost~hdd' and 'myhost~ssd'
> 
> It's that last part that probably isn't captured properly by something
> external that does a syntax conversion during import/export.
> 
> John
> 
>> Cheers
>>
>>>
>>> John
>>>
>>>
>>>
>>>> Cheers
>>>>
>>>> On 02/02/2017 09:57 PM, Sage Weil wrote:
>>>>> Hi everyone,
>>>>>
>>>>> I made more updates to http://pad.ceph.com/p/crush-types after the CDM
>>>>> discussion yesterday:
>>>>>
>>>>> - consolidated notes into a single proposal
>>>>> - use otherwise illegal character (e.g., ~) as separater for generated
>>>>> buckets.  This avoids ambiguity with user-defined buckets.
>>>>> - class-id $class $id properties for each bucket.  This allows us to
>>>>> preserve the derivative bucket ids across a decompile->compile cycle so
>>>>> that data does not move (the bucket id is one of many inputs into crush's
>>>>> hash during placement).
>>>>> - simpler rule syntax:
>>>>>
>>>>>     rule ssd {
>>>>>             ruleset 1
>>>>>             step take default class ssd
>>>>>             step chooseleaf firstn 0 type host
>>>>>             step emit
>>>>>     }
>>>>>
>>>>> My rationale here is that we don't want to make this a separate 'step'
>>>>> call since steps map to underlying crush rule step ops, and this is a
>>>>> directive only to the compiler.  Making it an optional step argument seems
>>>>> like the cleanest way to do that.
>>>>>
>>>>> Any other comments before we kick this off?
>>>>>
>>>>> Thanks!
>>>>> sage
>>>>>
>>>>>
>>>>> On Mon, 23 Jan 2017, Loic Dachary wrote:
>>>>>
>>>>>> Hi Wido,
>>>>>>
>>>>>> Updated http://pad.ceph.com/p/crush-types with your proposal for the rule syntax
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On 01/23/2017 03:29 PM, Sage Weil wrote:
>>>>>>> On Mon, 23 Jan 2017, Wido den Hollander wrote:
>>>>>>>>> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi Sage,
>>>>>>>>>
>>>>>>>>> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
>>>>>>>>>
>>>>>>>>> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
>>>>>>>>>
>>>>>>>>>     # devices
>>>>>>>>>     device 0 osd.0
>>>>>>>>>     device 1 osd.1
>>>>>>>>>     device 2 osd.2
>>>>>>>>>     device 3 osd.3
>>>>>>>>>
>>>>>>>>> into:
>>>>>>>>>
>>>>>>>>>     # devices
>>>>>>>>>     device 0 osd.0 ssd
>>>>>>>>>     device 1 osd.1 ssd
>>>>>>>>>     device 2 osd.2 hdd
>>>>>>>>>     device 3 osd.3 hdd
>>>>>>>>>
>>>>>>>>> where ssd/hdd is the device class would be much better. However, using the device class like so:
>>>>>>>>>
>>>>>>>>>     rule ssd {
>>>>>>>>>             ruleset 1
>>>>>>>>>             type replicated
>>>>>>>>>             min_size 1
>>>>>>>>>             max_size 10
>>>>>>>>>             step take default:ssd
>>>>>>>>>             step chooseleaf firstn 0 type host
>>>>>>>>>             step emit
>>>>>>>>>     }
>>>>>>>>>
>>>>>>>>> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
>>>>>>>>>
>>>>>>>>>     rule ssd {
>>>>>>>>>             ruleset 1
>>>>>>>>>             type replicated
>>>>>>>>>             min_size 1
>>>>>>>>>             max_size 10
>>>>>>>>>             device class = ssd
>>>>>>>>
>>>>>>>> Would that be sane?
>>>>>>>>
>>>>>>>> Why not:
>>>>>>>>
>>>>>>>> step set-class ssd
>>>>>>>> step take default
>>>>>>>> step chooseleaf firstn 0 type host
>>>>>>>> step emit
>>>>>>>>
>>>>>>>> Since it's a 'step' you take, am I right?
>>>>>>>
>>>>>>> Good idea... a step is a cleaner way to extend the syntax!
>>>>>>>
>>>>>>> sage
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-02-15 12:14                 ` Loic Dachary
@ 2017-03-08  9:42                   ` Dan van der Ster
  2017-03-08 10:05                     ` Loic Dachary
  2017-03-08 14:39                     ` Sage Weil
  0 siblings, 2 replies; 25+ messages in thread
From: Dan van der Ster @ 2017-03-08  9:42 UTC (permalink / raw)
  To: Loic Dachary; +Cc: John Spray, Ceph Development

Hi Loic,

Did you already have a plan for how an operator would declare the
device class of each OSD?
Would this be a new --device-class option to ceph-disk prepare, which
would perhaps create a device-class file in the root of the OSD's xfs
dir?
Then osd crush create-or-move in ceph-osd-prestart.sh would be a
combination of ceph.conf's "crush location" and this per-OSD file.

Cheers, Dan



On Wed, Feb 15, 2017 at 1:14 PM, Loic Dachary <loic@dachary.org> wrote:
> Hi John,
>
> Thanks for the discussion :-) I'll start implementing the proposal as described originally.
>
> Cheers
>
> On 02/15/2017 12:57 PM, John Spray wrote:
>> On Fri, Feb 3, 2017 at 1:21 PM, Loic Dachary <loic@dachary.org> wrote:
>>>
>>>
>>> On 02/03/2017 01:46 PM, John Spray wrote:
>>>> On Fri, Feb 3, 2017 at 12:22 PM, Loic Dachary <loic@dachary.org> wrote:
>>>>> Hi,
>>>>>
>>>>> Reading Wido & John comments I thought of something, not sure if that's a good idea or not. Here it is anyway ;-)
>>>>>
>>>>> The device class problem we're trying to solve is one instance of a more general need to produce crush tables that implement a given use case. The SSD / HDD use case is so frequent that it would make sense to modify the crush format for this. But maybe we could instead implement that to be a crush table generator.
>>>>>
>>>>> Let say you want help to create the hierarchies to implement the ssd/hdd separation, you write your crushmap using the proposed syntax. But instead of feeding it directly to crushtool -c, you would do something like:
>>>>>
>>>>>    crushtool --plugin 'device-class' --transform < mycrushmap.txt | crushtool -c - -o mycrushmap
>>>>>
>>>>> The 'device-class' transformation documents the naming conventions so the user knows root will generate root_ssd and root_hdd. And the users can also check by themselves the generated crushmap.
>>>>>
>>>>> Cons:
>>>>>
>>>>> * the users need to be aware of the transformation step and be able to read and understand the generated result.
>>>>> * it could look like it's not part of the standard way of doing things, that it's a hack.
>>>>>
>>>>> Pros:
>>>>>
>>>>> * it can inspire people to implement other crushmap transformation / generators (an alternative, simpler, syntax comes to mind ;-)
>>>>> * it can be implemented using python to lower the barrier of entry
>>>>>
>>>>> I don't think it makes the implementation of the current proposal any simpler or more complex. Worst case scenario nobody write any plugin but that does not make this one plugin less useful.
>>>>
>>>> I think this is basically the alternative approach that Sam was
>>>> suggesting during CDM: the idea of layering a new (perhaps very
>>>> similar) syntax on top of the existing one, instead of extending the
>>>> existing one directly.
>>>
>>> Ha nice, not such a stupid idea then :-) I'll try to defend it a little more below then. Please bear in mind that I'm not sure this is the way to go even though I'm writing as if I am.
>>>
>>>> The main argument against doing that was the complexity, not just of
>>>> implementation but for users, who would now potentially have two
>>>> separate sets of commands, one operating on the "high level" map
>>>> (which would have a "myhost" object in it), and one operating on the
>>>> native crush map (which would only have myhost~ssd, myhost~hdd
>>>> entries, and would have no concept that a thing called myhost
>>>> existed).
>>>
>>> As a user I'm not sure what is more complicated / confusing. If I'm an experienced Ceph user I'll think of this new syntax as a generator because I already know how crush works. I'll welcome the help and be relieved that I don't have to manually do that anymore. But having that as a native syntax may be a little unconfortable for me because I will want to verify the new syntax matches what I expect, which comes naturally if the transformation step is separate. I may even tweak it a little with an intermediate script to match one thing or two. If I'm a new Ceph user this is one more concept I need to learn: the device class. And to understand what it means, the documentation will have to explain that it creates an independant crush hierarchy for each device class, with weights that only take into account the devices of that given class. I will not be exonerated from understanding the transformation step and the syntactic sugar may even make that more complicated to get.
>>>
>>> If I understand correctly, the three would co-exist: host, host~ssd, host~hdd so that you can write a rule that takes from all devices.
>>
>> (Sorry this response is so late)
>>
>> I think the extra work is not so much in the formats, as it is
>> exposing that syntax via all the commands that we have, and/or new
>> commands.  We would either need two lots of commands, or we would need
>> to pick one layer (the 'generator' or the native one) for the
>> commands, and treat the other layer as a hidden thing.
>>
>> It's also not just the extra work of implementing the commands/syntax,
>> it's the extra complexity that ends up being exposed to users.
>>
>>>
>>>> As for implemetning other generators, the trouble with that is that
>>>> the resulting conventions would be unknown to other tools, and to any
>>>> commands built in to Ceph.
>>>
>>> Yes. But do we really want to insert the concept of "device class" in Ceph ? There are recurring complaints about manually creating the crushmap required to separate ssd from hdd. But is it inconvenient in any way that Ceph is otherwise unaware of this distinction ?
>>
>> Currently, if someone has done the manual stuff to set up SSD/HDD
>> crush trees, any external tool has no way of knowing that two hosts
>> (one ssd, one hdd) are actually the same host.  That's the key thing
>> here for me -- the time saving during setup is a nice side effect, but
>> the primary value of having a Ceph-defined way to do this is that
>> every tool building on Ceph can rely on it.
>>
>>
>>
>>>> We *really* need a variant of "set noout"
>>>> that operates on a crush subtree (typically a host), as it's the sane
>>>> way to get people to temporarily mark some OSDs while they
>>>> reboot/upgrade a host, but to implement that command we have to have
>>>> an unambiguous way of identifying which buckets in the crush map
>>>> belong to a host.  Whatever the convention is (myhost~ssd, myhost_ssd,
>>>> whatever), it needs to be defined and built into Ceph in order to be
>>>> interoperable.
>>>
>>> That goes back (above) to my understanding of Sage proposal (which I may have wrong ?) in which the host bucket still exists and still contains all devices regardless of their class.
>>
>> In Sage's proposal as I understand it, there's an underlying native
>> crush map that uses today's format (i.e. clients need no upgrade),
>> which is generated in response to either commands that edit the map,
>> or the user inputting a modified map in the text format.  That
>> conversion would follow pretty simple rules (assuming a host 'myhost'
>> with ssd and hdd devices):
>>  * On the way in, bucket 'myhost' generates 'myhost~ssd', 'myhost~hdd' buckets
>>  * On the way out, buckets 'myhost~ssd', 'myhost~hdd' get merged into 'myhost'
>>  * When running a CLI command, something targeting 'myhost' will
>> target both 'myhost~hdd' and 'myhost~ssd'
>>
>> It's that last part that probably isn't captured properly by something
>> external that does a syntax conversion during import/export.
>>
>> John
>>
>>> Cheers
>>>
>>>>
>>>> John
>>>>
>>>>
>>>>
>>>>> Cheers
>>>>>
>>>>> On 02/02/2017 09:57 PM, Sage Weil wrote:
>>>>>> Hi everyone,
>>>>>>
>>>>>> I made more updates to http://pad.ceph.com/p/crush-types after the CDM
>>>>>> discussion yesterday:
>>>>>>
>>>>>> - consolidated notes into a single proposal
>>>>>> - use otherwise illegal character (e.g., ~) as separater for generated
>>>>>> buckets.  This avoids ambiguity with user-defined buckets.
>>>>>> - class-id $class $id properties for each bucket.  This allows us to
>>>>>> preserve the derivative bucket ids across a decompile->compile cycle so
>>>>>> that data does not move (the bucket id is one of many inputs into crush's
>>>>>> hash during placement).
>>>>>> - simpler rule syntax:
>>>>>>
>>>>>>     rule ssd {
>>>>>>             ruleset 1
>>>>>>             step take default class ssd
>>>>>>             step chooseleaf firstn 0 type host
>>>>>>             step emit
>>>>>>     }
>>>>>>
>>>>>> My rationale here is that we don't want to make this a separate 'step'
>>>>>> call since steps map to underlying crush rule step ops, and this is a
>>>>>> directive only to the compiler.  Making it an optional step argument seems
>>>>>> like the cleanest way to do that.
>>>>>>
>>>>>> Any other comments before we kick this off?
>>>>>>
>>>>>> Thanks!
>>>>>> sage
>>>>>>
>>>>>>
>>>>>> On Mon, 23 Jan 2017, Loic Dachary wrote:
>>>>>>
>>>>>>> Hi Wido,
>>>>>>>
>>>>>>> Updated http://pad.ceph.com/p/crush-types with your proposal for the rule syntax
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On 01/23/2017 03:29 PM, Sage Weil wrote:
>>>>>>>> On Mon, 23 Jan 2017, Wido den Hollander wrote:
>>>>>>>>>> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hi Sage,
>>>>>>>>>>
>>>>>>>>>> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
>>>>>>>>>>
>>>>>>>>>> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
>>>>>>>>>>
>>>>>>>>>>     # devices
>>>>>>>>>>     device 0 osd.0
>>>>>>>>>>     device 1 osd.1
>>>>>>>>>>     device 2 osd.2
>>>>>>>>>>     device 3 osd.3
>>>>>>>>>>
>>>>>>>>>> into:
>>>>>>>>>>
>>>>>>>>>>     # devices
>>>>>>>>>>     device 0 osd.0 ssd
>>>>>>>>>>     device 1 osd.1 ssd
>>>>>>>>>>     device 2 osd.2 hdd
>>>>>>>>>>     device 3 osd.3 hdd
>>>>>>>>>>
>>>>>>>>>> where ssd/hdd is the device class would be much better. However, using the device class like so:
>>>>>>>>>>
>>>>>>>>>>     rule ssd {
>>>>>>>>>>             ruleset 1
>>>>>>>>>>             type replicated
>>>>>>>>>>             min_size 1
>>>>>>>>>>             max_size 10
>>>>>>>>>>             step take default:ssd
>>>>>>>>>>             step chooseleaf firstn 0 type host
>>>>>>>>>>             step emit
>>>>>>>>>>     }
>>>>>>>>>>
>>>>>>>>>> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
>>>>>>>>>>
>>>>>>>>>>     rule ssd {
>>>>>>>>>>             ruleset 1
>>>>>>>>>>             type replicated
>>>>>>>>>>             min_size 1
>>>>>>>>>>             max_size 10
>>>>>>>>>>             device class = ssd
>>>>>>>>>
>>>>>>>>> Would that be sane?
>>>>>>>>>
>>>>>>>>> Why not:
>>>>>>>>>
>>>>>>>>> step set-class ssd
>>>>>>>>> step take default
>>>>>>>>> step chooseleaf firstn 0 type host
>>>>>>>>> step emit
>>>>>>>>>
>>>>>>>>> Since it's a 'step' you take, am I right?
>>>>>>>>
>>>>>>>> Good idea... a step is a cleaner way to extend the syntax!
>>>>>>>>
>>>>>>>> sage
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>
>>>>> --
>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>> --
>>> Loïc Dachary, Artisan Logiciel Libre
>>
>
> --
> Loïc Dachary, Artisan Logiciel Libre
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-03-08  9:42                   ` Dan van der Ster
@ 2017-03-08 10:05                     ` Loic Dachary
  2017-03-08 14:39                     ` Sage Weil
  1 sibling, 0 replies; 25+ messages in thread
From: Loic Dachary @ 2017-03-08 10:05 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: John Spray, Ceph Development



On 03/08/2017 10:42 AM, Dan van der Ster wrote:
> Hi Loic,
> 
> Did you already have a plan for how an operator would declare the
> device class of each OSD?
> Would this be a new --device-class option to ceph-disk prepare, which
> would perhaps create a device-class file in the root of the OSD's xfs
> dir?
> Then osd crush create-or-move in ceph-osd-prestart.sh would be a
> combination of ceph.conf's "crush location" and this per-OSD file.

Nothing yet, but it's a perfect time to discuss that :-)

Cheers

> 
> Cheers, Dan
> 
> 
> 
> On Wed, Feb 15, 2017 at 1:14 PM, Loic Dachary <loic@dachary.org> wrote:
>> Hi John,
>>
>> Thanks for the discussion :-) I'll start implementing the proposal as described originally.
>>
>> Cheers
>>
>> On 02/15/2017 12:57 PM, John Spray wrote:
>>> On Fri, Feb 3, 2017 at 1:21 PM, Loic Dachary <loic@dachary.org> wrote:
>>>>
>>>>
>>>> On 02/03/2017 01:46 PM, John Spray wrote:
>>>>> On Fri, Feb 3, 2017 at 12:22 PM, Loic Dachary <loic@dachary.org> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Reading Wido & John comments I thought of something, not sure if that's a good idea or not. Here it is anyway ;-)
>>>>>>
>>>>>> The device class problem we're trying to solve is one instance of a more general need to produce crush tables that implement a given use case. The SSD / HDD use case is so frequent that it would make sense to modify the crush format for this. But maybe we could instead implement that to be a crush table generator.
>>>>>>
>>>>>> Let say you want help to create the hierarchies to implement the ssd/hdd separation, you write your crushmap using the proposed syntax. But instead of feeding it directly to crushtool -c, you would do something like:
>>>>>>
>>>>>>    crushtool --plugin 'device-class' --transform < mycrushmap.txt | crushtool -c - -o mycrushmap
>>>>>>
>>>>>> The 'device-class' transformation documents the naming conventions so the user knows root will generate root_ssd and root_hdd. And the users can also check by themselves the generated crushmap.
>>>>>>
>>>>>> Cons:
>>>>>>
>>>>>> * the users need to be aware of the transformation step and be able to read and understand the generated result.
>>>>>> * it could look like it's not part of the standard way of doing things, that it's a hack.
>>>>>>
>>>>>> Pros:
>>>>>>
>>>>>> * it can inspire people to implement other crushmap transformation / generators (an alternative, simpler, syntax comes to mind ;-)
>>>>>> * it can be implemented using python to lower the barrier of entry
>>>>>>
>>>>>> I don't think it makes the implementation of the current proposal any simpler or more complex. Worst case scenario nobody write any plugin but that does not make this one plugin less useful.
>>>>>
>>>>> I think this is basically the alternative approach that Sam was
>>>>> suggesting during CDM: the idea of layering a new (perhaps very
>>>>> similar) syntax on top of the existing one, instead of extending the
>>>>> existing one directly.
>>>>
>>>> Ha nice, not such a stupid idea then :-) I'll try to defend it a little more below then. Please bear in mind that I'm not sure this is the way to go even though I'm writing as if I am.
>>>>
>>>>> The main argument against doing that was the complexity, not just of
>>>>> implementation but for users, who would now potentially have two
>>>>> separate sets of commands, one operating on the "high level" map
>>>>> (which would have a "myhost" object in it), and one operating on the
>>>>> native crush map (which would only have myhost~ssd, myhost~hdd
>>>>> entries, and would have no concept that a thing called myhost
>>>>> existed).
>>>>
>>>> As a user I'm not sure what is more complicated / confusing. If I'm an experienced Ceph user I'll think of this new syntax as a generator because I already know how crush works. I'll welcome the help and be relieved that I don't have to manually do that anymore. But having that as a native syntax may be a little unconfortable for me because I will want to verify the new syntax matches what I expect, which comes naturally if the transformation step is separate. I may even tweak it a little with an intermediate script to match one thing or two. If I'm a new Ceph user this is one more concept I need to learn: the device class. And to understand what it means, the documentation will have to explain that it creates an independant crush hierarchy for each device class, with weights that onl
 y take into account the devices of that given class. I will not be exonerated from understanding the transformation step and the syntactic sugar may even make that more complicated to get.
>>>>
>>>> If I understand correctly, the three would co-exist: host, host~ssd, host~hdd so that you can write a rule that takes from all devices.
>>>
>>> (Sorry this response is so late)
>>>
>>> I think the extra work is not so much in the formats, as it is
>>> exposing that syntax via all the commands that we have, and/or new
>>> commands.  We would either need two lots of commands, or we would need
>>> to pick one layer (the 'generator' or the native one) for the
>>> commands, and treat the other layer as a hidden thing.
>>>
>>> It's also not just the extra work of implementing the commands/syntax,
>>> it's the extra complexity that ends up being exposed to users.
>>>
>>>>
>>>>> As for implemetning other generators, the trouble with that is that
>>>>> the resulting conventions would be unknown to other tools, and to any
>>>>> commands built in to Ceph.
>>>>
>>>> Yes. But do we really want to insert the concept of "device class" in Ceph ? There are recurring complaints about manually creating the crushmap required to separate ssd from hdd. But is it inconvenient in any way that Ceph is otherwise unaware of this distinction ?
>>>
>>> Currently, if someone has done the manual stuff to set up SSD/HDD
>>> crush trees, any external tool has no way of knowing that two hosts
>>> (one ssd, one hdd) are actually the same host.  That's the key thing
>>> here for me -- the time saving during setup is a nice side effect, but
>>> the primary value of having a Ceph-defined way to do this is that
>>> every tool building on Ceph can rely on it.
>>>
>>>
>>>
>>>>> We *really* need a variant of "set noout"
>>>>> that operates on a crush subtree (typically a host), as it's the sane
>>>>> way to get people to temporarily mark some OSDs while they
>>>>> reboot/upgrade a host, but to implement that command we have to have
>>>>> an unambiguous way of identifying which buckets in the crush map
>>>>> belong to a host.  Whatever the convention is (myhost~ssd, myhost_ssd,
>>>>> whatever), it needs to be defined and built into Ceph in order to be
>>>>> interoperable.
>>>>
>>>> That goes back (above) to my understanding of Sage proposal (which I may have wrong ?) in which the host bucket still exists and still contains all devices regardless of their class.
>>>
>>> In Sage's proposal as I understand it, there's an underlying native
>>> crush map that uses today's format (i.e. clients need no upgrade),
>>> which is generated in response to either commands that edit the map,
>>> or the user inputting a modified map in the text format.  That
>>> conversion would follow pretty simple rules (assuming a host 'myhost'
>>> with ssd and hdd devices):
>>>  * On the way in, bucket 'myhost' generates 'myhost~ssd', 'myhost~hdd' buckets
>>>  * On the way out, buckets 'myhost~ssd', 'myhost~hdd' get merged into 'myhost'
>>>  * When running a CLI command, something targeting 'myhost' will
>>> target both 'myhost~hdd' and 'myhost~ssd'
>>>
>>> It's that last part that probably isn't captured properly by something
>>> external that does a syntax conversion during import/export.
>>>
>>> John
>>>
>>>> Cheers
>>>>
>>>>>
>>>>> John
>>>>>
>>>>>
>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On 02/02/2017 09:57 PM, Sage Weil wrote:
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I made more updates to http://pad.ceph.com/p/crush-types after the CDM
>>>>>>> discussion yesterday:
>>>>>>>
>>>>>>> - consolidated notes into a single proposal
>>>>>>> - use otherwise illegal character (e.g., ~) as separater for generated
>>>>>>> buckets.  This avoids ambiguity with user-defined buckets.
>>>>>>> - class-id $class $id properties for each bucket.  This allows us to
>>>>>>> preserve the derivative bucket ids across a decompile->compile cycle so
>>>>>>> that data does not move (the bucket id is one of many inputs into crush's
>>>>>>> hash during placement).
>>>>>>> - simpler rule syntax:
>>>>>>>
>>>>>>>     rule ssd {
>>>>>>>             ruleset 1
>>>>>>>             step take default class ssd
>>>>>>>             step chooseleaf firstn 0 type host
>>>>>>>             step emit
>>>>>>>     }
>>>>>>>
>>>>>>> My rationale here is that we don't want to make this a separate 'step'
>>>>>>> call since steps map to underlying crush rule step ops, and this is a
>>>>>>> directive only to the compiler.  Making it an optional step argument seems
>>>>>>> like the cleanest way to do that.
>>>>>>>
>>>>>>> Any other comments before we kick this off?
>>>>>>>
>>>>>>> Thanks!
>>>>>>> sage
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 23 Jan 2017, Loic Dachary wrote:
>>>>>>>
>>>>>>>> Hi Wido,
>>>>>>>>
>>>>>>>> Updated http://pad.ceph.com/p/crush-types with your proposal for the rule syntax
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On 01/23/2017 03:29 PM, Sage Weil wrote:
>>>>>>>>> On Mon, 23 Jan 2017, Wido den Hollander wrote:
>>>>>>>>>>> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Sage,
>>>>>>>>>>>
>>>>>>>>>>> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
>>>>>>>>>>>
>>>>>>>>>>> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
>>>>>>>>>>>
>>>>>>>>>>>     # devices
>>>>>>>>>>>     device 0 osd.0
>>>>>>>>>>>     device 1 osd.1
>>>>>>>>>>>     device 2 osd.2
>>>>>>>>>>>     device 3 osd.3
>>>>>>>>>>>
>>>>>>>>>>> into:
>>>>>>>>>>>
>>>>>>>>>>>     # devices
>>>>>>>>>>>     device 0 osd.0 ssd
>>>>>>>>>>>     device 1 osd.1 ssd
>>>>>>>>>>>     device 2 osd.2 hdd
>>>>>>>>>>>     device 3 osd.3 hdd
>>>>>>>>>>>
>>>>>>>>>>> where ssd/hdd is the device class would be much better. However, using the device class like so:
>>>>>>>>>>>
>>>>>>>>>>>     rule ssd {
>>>>>>>>>>>             ruleset 1
>>>>>>>>>>>             type replicated
>>>>>>>>>>>             min_size 1
>>>>>>>>>>>             max_size 10
>>>>>>>>>>>             step take default:ssd
>>>>>>>>>>>             step chooseleaf firstn 0 type host
>>>>>>>>>>>             step emit
>>>>>>>>>>>     }
>>>>>>>>>>>
>>>>>>>>>>> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
>>>>>>>>>>>
>>>>>>>>>>>     rule ssd {
>>>>>>>>>>>             ruleset 1
>>>>>>>>>>>             type replicated
>>>>>>>>>>>             min_size 1
>>>>>>>>>>>             max_size 10
>>>>>>>>>>>             device class = ssd
>>>>>>>>>>
>>>>>>>>>> Would that be sane?
>>>>>>>>>>
>>>>>>>>>> Why not:
>>>>>>>>>>
>>>>>>>>>> step set-class ssd
>>>>>>>>>> step take default
>>>>>>>>>> step chooseleaf firstn 0 type host
>>>>>>>>>> step emit
>>>>>>>>>>
>>>>>>>>>> Since it's a 'step' you take, am I right?
>>>>>>>>>
>>>>>>>>> Good idea... a step is a cleaner way to extend the syntax!
>>>>>>>>>
>>>>>>>>> sage
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Loïc Dachary, Artisan Logiciel Libre
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> --
>>>> Loïc Dachary, Artisan Logiciel Libre
>>>
>>
>> --
>> Loïc Dachary, Artisan Logiciel Libre
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
Loïc Dachary, Artisan Logiciel Libre

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-03-08  9:42                   ` Dan van der Ster
  2017-03-08 10:05                     ` Loic Dachary
@ 2017-03-08 14:39                     ` Sage Weil
  2017-03-08 15:55                       ` Dan van der Ster
  2017-06-28  5:28                       ` Kyle Bader
  1 sibling, 2 replies; 25+ messages in thread
From: Sage Weil @ 2017-03-08 14:39 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: Loic Dachary, John Spray, Ceph Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 15143 bytes --]

On Wed, 8 Mar 2017, Dan van der Ster wrote:
> Hi Loic,
> 
> Did you already have a plan for how an operator would declare the
> device class of each OSD?
> Would this be a new --device-class option to ceph-disk prepare, which
> would perhaps create a device-class file in the root of the OSD's xfs
> dir?
> Then osd crush create-or-move in ceph-osd-prestart.sh would be a
> combination of ceph.conf's "crush location" and this per-OSD file.

Hmm we haven't talked about this part yet.  I see a few options...

1) explicit ceph-disk argument, recorded as a file in osd_data

2) osd can autodetect this based on the 'rotational' flag in sysfs.  The 
trick here, I think, is to come up with suitable defaults.  We might have 
NVMe, SATA/SAS SSDs, HDD, or a combination (with journal and data (and 
db) spread across multiple types).   Perhaps those could break down into 
classes like

	hdd
	ssd
	nvme
	hdd+ssd-journal
	hdd+nvme-jouranl
	hdd+ssd-db+nvme-jouranl

which is probably sufficient for most users.  And if the admin likes they 
can override.

- Then the osd adjusts device-class on startup, just like it does with the 
crush map position.  (Note that this will have no real effect until the 
CRUSH rule(s) are changed to use device class.)

- We'll need an 'osd crush set-device-class <osd.NNN> <class>' command.  
The only danger I see here is that if you set it to something other than  
what the OSD autodetects above, it'll get clobbered on the next OSD 
restart.  Maybe the autodetection *only* sets the device class if it isn't 
already set?

- We need to adjust the crush rule commands to allow a device class. 
Currently we have

osd crush rule create-erasure <name>     create crush rule <name> for erasure 
 {<profile>}                              coded pool created with <profile> (
                                          default default)
osd crush rule create-simple <name>      create crush rule <name> to start from 
 <root> <type> {firstn|indep}             <root>, replicate across buckets of 
                                          type <type>, using a choose mode of 
                                          <firstn|indep> (default firstn; indep 
                                          best for erasure pools)

...so we could add another optional arg at the end for the device class.

sage





> 
> Cheers, Dan
> 
> 
> 
> On Wed, Feb 15, 2017 at 1:14 PM, Loic Dachary <loic@dachary.org> wrote:
> > Hi John,
> >
> > Thanks for the discussion :-) I'll start implementing the proposal as described originally.
> >
> > Cheers
> >
> > On 02/15/2017 12:57 PM, John Spray wrote:
> >> On Fri, Feb 3, 2017 at 1:21 PM, Loic Dachary <loic@dachary.org> wrote:
> >>>
> >>>
> >>> On 02/03/2017 01:46 PM, John Spray wrote:
> >>>> On Fri, Feb 3, 2017 at 12:22 PM, Loic Dachary <loic@dachary.org> wrote:
> >>>>> Hi,
> >>>>>
> >>>>> Reading Wido & John comments I thought of something, not sure if that's a good idea or not. Here it is anyway ;-)
> >>>>>
> >>>>> The device class problem we're trying to solve is one instance of a more general need to produce crush tables that implement a given use case. The SSD / HDD use case is so frequent that it would make sense to modify the crush format for this. But maybe we could instead implement that to be a crush table generator.
> >>>>>
> >>>>> Let say you want help to create the hierarchies to implement the ssd/hdd separation, you write your crushmap using the proposed syntax. But instead of feeding it directly to crushtool -c, you would do something like:
> >>>>>
> >>>>>    crushtool --plugin 'device-class' --transform < mycrushmap.txt | crushtool -c - -o mycrushmap
> >>>>>
> >>>>> The 'device-class' transformation documents the naming conventions so the user knows root will generate root_ssd and root_hdd. And the users can also check by themselves the generated crushmap.
> >>>>>
> >>>>> Cons:
> >>>>>
> >>>>> * the users need to be aware of the transformation step and be able to read and understand the generated result.
> >>>>> * it could look like it's not part of the standard way of doing things, that it's a hack.
> >>>>>
> >>>>> Pros:
> >>>>>
> >>>>> * it can inspire people to implement other crushmap transformation / generators (an alternative, simpler, syntax comes to mind ;-)
> >>>>> * it can be implemented using python to lower the barrier of entry
> >>>>>
> >>>>> I don't think it makes the implementation of the current proposal any simpler or more complex. Worst case scenario nobody write any plugin but that does not make this one plugin less useful.
> >>>>
> >>>> I think this is basically the alternative approach that Sam was
> >>>> suggesting during CDM: the idea of layering a new (perhaps very
> >>>> similar) syntax on top of the existing one, instead of extending the
> >>>> existing one directly.
> >>>
> >>> Ha nice, not such a stupid idea then :-) I'll try to defend it a little more below then. Please bear in mind that I'm not sure this is the way to go even though I'm writing as if I am.
> >>>
> >>>> The main argument against doing that was the complexity, not just of
> >>>> implementation but for users, who would now potentially have two
> >>>> separate sets of commands, one operating on the "high level" map
> >>>> (which would have a "myhost" object in it), and one operating on the
> >>>> native crush map (which would only have myhost~ssd, myhost~hdd
> >>>> entries, and would have no concept that a thing called myhost
> >>>> existed).
> >>>
> >>> As a user I'm not sure what is more complicated / confusing. If I'm an experienced Ceph user I'll think of this new syntax as a generator because I already know how crush works. I'll welcome the help and be relieved that I don't have to manually do that anymore. But having that as a native syntax may be a little unconfortable for me because I will want to verify the new syntax matches what I expect, which comes naturally if the transformation step is separate. I may even tweak it a little with an intermediate script to match one thing or two. If I'm a new Ceph user this is one more concept I need to learn: the device class. And to understand what it means, the documentation will have to explain that it creates an independant crush hierarchy for each device class, with weights that on
 ly take into account the devices of that given class. I will not be exonerated from understanding the transformation step and the syntactic sugar may even make that more complicated to get.
> >>>
> >>> If I understand correctly, the three would co-exist: host, host~ssd, host~hdd so that you can write a rule that takes from all devices.
> >>
> >> (Sorry this response is so late)
> >>
> >> I think the extra work is not so much in the formats, as it is
> >> exposing that syntax via all the commands that we have, and/or new
> >> commands.  We would either need two lots of commands, or we would need
> >> to pick one layer (the 'generator' or the native one) for the
> >> commands, and treat the other layer as a hidden thing.
> >>
> >> It's also not just the extra work of implementing the commands/syntax,
> >> it's the extra complexity that ends up being exposed to users.
> >>
> >>>
> >>>> As for implemetning other generators, the trouble with that is that
> >>>> the resulting conventions would be unknown to other tools, and to any
> >>>> commands built in to Ceph.
> >>>
> >>> Yes. But do we really want to insert the concept of "device class" in Ceph ? There are recurring complaints about manually creating the crushmap required to separate ssd from hdd. But is it inconvenient in any way that Ceph is otherwise unaware of this distinction ?
> >>
> >> Currently, if someone has done the manual stuff to set up SSD/HDD
> >> crush trees, any external tool has no way of knowing that two hosts
> >> (one ssd, one hdd) are actually the same host.  That's the key thing
> >> here for me -- the time saving during setup is a nice side effect, but
> >> the primary value of having a Ceph-defined way to do this is that
> >> every tool building on Ceph can rely on it.
> >>
> >>
> >>
> >>>> We *really* need a variant of "set noout"
> >>>> that operates on a crush subtree (typically a host), as it's the sane
> >>>> way to get people to temporarily mark some OSDs while they
> >>>> reboot/upgrade a host, but to implement that command we have to have
> >>>> an unambiguous way of identifying which buckets in the crush map
> >>>> belong to a host.  Whatever the convention is (myhost~ssd, myhost_ssd,
> >>>> whatever), it needs to be defined and built into Ceph in order to be
> >>>> interoperable.
> >>>
> >>> That goes back (above) to my understanding of Sage proposal (which I may have wrong ?) in which the host bucket still exists and still contains all devices regardless of their class.
> >>
> >> In Sage's proposal as I understand it, there's an underlying native
> >> crush map that uses today's format (i.e. clients need no upgrade),
> >> which is generated in response to either commands that edit the map,
> >> or the user inputting a modified map in the text format.  That
> >> conversion would follow pretty simple rules (assuming a host 'myhost'
> >> with ssd and hdd devices):
> >>  * On the way in, bucket 'myhost' generates 'myhost~ssd', 'myhost~hdd' buckets
> >>  * On the way out, buckets 'myhost~ssd', 'myhost~hdd' get merged into 'myhost'
> >>  * When running a CLI command, something targeting 'myhost' will
> >> target both 'myhost~hdd' and 'myhost~ssd'
> >>
> >> It's that last part that probably isn't captured properly by something
> >> external that does a syntax conversion during import/export.
> >>
> >> John
> >>
> >>> Cheers
> >>>
> >>>>
> >>>> John
> >>>>
> >>>>
> >>>>
> >>>>> Cheers
> >>>>>
> >>>>> On 02/02/2017 09:57 PM, Sage Weil wrote:
> >>>>>> Hi everyone,
> >>>>>>
> >>>>>> I made more updates to http://pad.ceph.com/p/crush-types after the CDM
> >>>>>> discussion yesterday:
> >>>>>>
> >>>>>> - consolidated notes into a single proposal
> >>>>>> - use otherwise illegal character (e.g., ~) as separater for generated
> >>>>>> buckets.  This avoids ambiguity with user-defined buckets.
> >>>>>> - class-id $class $id properties for each bucket.  This allows us to
> >>>>>> preserve the derivative bucket ids across a decompile->compile cycle so
> >>>>>> that data does not move (the bucket id is one of many inputs into crush's
> >>>>>> hash during placement).
> >>>>>> - simpler rule syntax:
> >>>>>>
> >>>>>>     rule ssd {
> >>>>>>             ruleset 1
> >>>>>>             step take default class ssd
> >>>>>>             step chooseleaf firstn 0 type host
> >>>>>>             step emit
> >>>>>>     }
> >>>>>>
> >>>>>> My rationale here is that we don't want to make this a separate 'step'
> >>>>>> call since steps map to underlying crush rule step ops, and this is a
> >>>>>> directive only to the compiler.  Making it an optional step argument seems
> >>>>>> like the cleanest way to do that.
> >>>>>>
> >>>>>> Any other comments before we kick this off?
> >>>>>>
> >>>>>> Thanks!
> >>>>>> sage
> >>>>>>
> >>>>>>
> >>>>>> On Mon, 23 Jan 2017, Loic Dachary wrote:
> >>>>>>
> >>>>>>> Hi Wido,
> >>>>>>>
> >>>>>>> Updated http://pad.ceph.com/p/crush-types with your proposal for the rule syntax
> >>>>>>>
> >>>>>>> Cheers
> >>>>>>>
> >>>>>>> On 01/23/2017 03:29 PM, Sage Weil wrote:
> >>>>>>>> On Mon, 23 Jan 2017, Wido den Hollander wrote:
> >>>>>>>>>> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Hi Sage,
> >>>>>>>>>>
> >>>>>>>>>> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
> >>>>>>>>>>
> >>>>>>>>>> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
> >>>>>>>>>>
> >>>>>>>>>>     # devices
> >>>>>>>>>>     device 0 osd.0
> >>>>>>>>>>     device 1 osd.1
> >>>>>>>>>>     device 2 osd.2
> >>>>>>>>>>     device 3 osd.3
> >>>>>>>>>>
> >>>>>>>>>> into:
> >>>>>>>>>>
> >>>>>>>>>>     # devices
> >>>>>>>>>>     device 0 osd.0 ssd
> >>>>>>>>>>     device 1 osd.1 ssd
> >>>>>>>>>>     device 2 osd.2 hdd
> >>>>>>>>>>     device 3 osd.3 hdd
> >>>>>>>>>>
> >>>>>>>>>> where ssd/hdd is the device class would be much better. However, using the device class like so:
> >>>>>>>>>>
> >>>>>>>>>>     rule ssd {
> >>>>>>>>>>             ruleset 1
> >>>>>>>>>>             type replicated
> >>>>>>>>>>             min_size 1
> >>>>>>>>>>             max_size 10
> >>>>>>>>>>             step take default:ssd
> >>>>>>>>>>             step chooseleaf firstn 0 type host
> >>>>>>>>>>             step emit
> >>>>>>>>>>     }
> >>>>>>>>>>
> >>>>>>>>>> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
> >>>>>>>>>>
> >>>>>>>>>>     rule ssd {
> >>>>>>>>>>             ruleset 1
> >>>>>>>>>>             type replicated
> >>>>>>>>>>             min_size 1
> >>>>>>>>>>             max_size 10
> >>>>>>>>>>             device class = ssd
> >>>>>>>>>
> >>>>>>>>> Would that be sane?
> >>>>>>>>>
> >>>>>>>>> Why not:
> >>>>>>>>>
> >>>>>>>>> step set-class ssd
> >>>>>>>>> step take default
> >>>>>>>>> step chooseleaf firstn 0 type host
> >>>>>>>>> step emit
> >>>>>>>>>
> >>>>>>>>> Since it's a 'step' you take, am I right?
> >>>>>>>>
> >>>>>>>> Good idea... a step is a cleaner way to extend the syntax!
> >>>>>>>>
> >>>>>>>> sage
> >>>>>>>> --
> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>>>>>> the body of a message to majordomo@vger.kernel.org
> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Loïc Dachary, Artisan Logiciel Libre
> >>>>>>> --
> >>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>>>>> the body of a message to majordomo@vger.kernel.org
> >>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>>>>
> >>>>>
> >>>>> --
> >>>>> Loïc Dachary, Artisan Logiciel Libre
> >>>>> --
> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >>>>> the body of a message to majordomo@vger.kernel.org
> >>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>>
> >>>
> >>> --
> >>> Loïc Dachary, Artisan Logiciel Libre
> >>
> >
> > --
> > Loïc Dachary, Artisan Logiciel Libre
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-03-08 14:39                     ` Sage Weil
@ 2017-03-08 15:55                       ` Dan van der Ster
  2017-03-08 17:00                         ` Sage Weil
  2017-06-28  5:28                       ` Kyle Bader
  1 sibling, 1 reply; 25+ messages in thread
From: Dan van der Ster @ 2017-03-08 15:55 UTC (permalink / raw)
  To: Sage Weil; +Cc: Loic Dachary, John Spray, Ceph Development

On Wed, Mar 8, 2017 at 3:39 PM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 8 Mar 2017, Dan van der Ster wrote:
>> Hi Loic,
>>
>> Did you already have a plan for how an operator would declare the
>> device class of each OSD?
>> Would this be a new --device-class option to ceph-disk prepare, which
>> would perhaps create a device-class file in the root of the OSD's xfs
>> dir?
>> Then osd crush create-or-move in ceph-osd-prestart.sh would be a
>> combination of ceph.conf's "crush location" and this per-OSD file.
>
> Hmm we haven't talked about this part yet.  I see a few options...
>
> 1) explicit ceph-disk argument, recorded as a file in osd_data
>
> 2) osd can autodetect this based on the 'rotational' flag in sysfs.  The
> trick here, I think, is to come up with suitable defaults.  We might have
> NVMe, SATA/SAS SSDs, HDD, or a combination (with journal and data (and
> db) spread across multiple types).   Perhaps those could break down into
> classes like
>
>         hdd
>         ssd
>         nvme
>         hdd+ssd-journal
>         hdd+nvme-jouranl
>         hdd+ssd-db+nvme-jouranl
>
> which is probably sufficient for most users.  And if the admin likes they
> can override.
>
> - Then the osd adjusts device-class on startup, just like it does with the
> crush map position.  (Note that this will have no real effect until the
> CRUSH rule(s) are changed to use device class.)
>
> - We'll need an 'osd crush set-device-class <osd.NNN> <class>' command.
> The only danger I see here is that if you set it to something other than
> what the OSD autodetects above, it'll get clobbered on the next OSD
> restart.  Maybe the autodetection *only* sets the device class if it isn't
> already set?

This is the same issue we have with crush locations, hence the osd
crush update on start option, right?

>
> - We need to adjust the crush rule commands to allow a device class.
> Currently we have
>
> osd crush rule create-erasure <name>     create crush rule <name> for erasure
>  {<profile>}                              coded pool created with <profile> (
>                                           default default)
> osd crush rule create-simple <name>      create crush rule <name> to start from
>  <root> <type> {firstn|indep}             <root>, replicate across buckets of
>                                           type <type>, using a choose mode of
>                                           <firstn|indep> (default firstn; indep
>                                           best for erasure pools)
>
> ...so we could add another optional arg at the end for the device class.
>

How far along in the implementation are you? Still time for discussing
the basic idea?

I wonder if you all had thought about using device classes like we use
buckets (i.e. to choose across device types)? Suppose I have two
brands of ssds: I want to define two classes ssd-a and ssd-b. And I
want to replicate across these classes (and across, say, hosts as
well). I think I'd need a choose step to choose 2 from classtype ssd
(out of ssd-a, ssd-b, etc...), and then chooseleaf across hosts.
IOW, device classes could be an orthogonal, but similarly flexible,
structure to crush buckets: device classes would have a hierarchy.

So we could still have:

device 0 osd.0 class ssd-a
device 1 osd.1 class ssd-b
device 2 osd.2 class hdd-c
device 3 osd.3 class hdd-d

but then we define the class-types and their hierarchy like we already
do for osds. Shown in a "class tree" we could have, for example:

TYPE               NAME
root                  default
    classtype    hdd
        class        hdd-c
        class        hdd-d
    classtype    ssd
        class        ssd-a
        class        ssd-b

Sorry to bring this up late in the thread.

Cheers, Dan


> sage
>
>
>
>
>
>>
>> Cheers, Dan
>>
>>
>>
>> On Wed, Feb 15, 2017 at 1:14 PM, Loic Dachary <loic@dachary.org> wrote:
>> > Hi John,
>> >
>> > Thanks for the discussion :-) I'll start implementing the proposal as described originally.
>> >
>> > Cheers
>> >
>> > On 02/15/2017 12:57 PM, John Spray wrote:
>> >> On Fri, Feb 3, 2017 at 1:21 PM, Loic Dachary <loic@dachary.org> wrote:
>> >>>
>> >>>
>> >>> On 02/03/2017 01:46 PM, John Spray wrote:
>> >>>> On Fri, Feb 3, 2017 at 12:22 PM, Loic Dachary <loic@dachary.org> wrote:
>> >>>>> Hi,
>> >>>>>
>> >>>>> Reading Wido & John comments I thought of something, not sure if that's a good idea or not. Here it is anyway ;-)
>> >>>>>
>> >>>>> The device class problem we're trying to solve is one instance of a more general need to produce crush tables that implement a given use case. The SSD / HDD use case is so frequent that it would make sense to modify the crush format for this. But maybe we could instead implement that to be a crush table generator.
>> >>>>>
>> >>>>> Let say you want help to create the hierarchies to implement the ssd/hdd separation, you write your crushmap using the proposed syntax. But instead of feeding it directly to crushtool -c, you would do something like:
>> >>>>>
>> >>>>>    crushtool --plugin 'device-class' --transform < mycrushmap.txt | crushtool -c - -o mycrushmap
>> >>>>>
>> >>>>> The 'device-class' transformation documents the naming conventions so the user knows root will generate root_ssd and root_hdd. And the users can also check by themselves the generated crushmap.
>> >>>>>
>> >>>>> Cons:
>> >>>>>
>> >>>>> * the users need to be aware of the transformation step and be able to read and understand the generated result.
>> >>>>> * it could look like it's not part of the standard way of doing things, that it's a hack.
>> >>>>>
>> >>>>> Pros:
>> >>>>>
>> >>>>> * it can inspire people to implement other crushmap transformation / generators (an alternative, simpler, syntax comes to mind ;-)
>> >>>>> * it can be implemented using python to lower the barrier of entry
>> >>>>>
>> >>>>> I don't think it makes the implementation of the current proposal any simpler or more complex. Worst case scenario nobody write any plugin but that does not make this one plugin less useful.
>> >>>>
>> >>>> I think this is basically the alternative approach that Sam was
>> >>>> suggesting during CDM: the idea of layering a new (perhaps very
>> >>>> similar) syntax on top of the existing one, instead of extending the
>> >>>> existing one directly.
>> >>>
>> >>> Ha nice, not such a stupid idea then :-) I'll try to defend it a little more below then. Please bear in mind that I'm not sure this is the way to go even though I'm writing as if I am.
>> >>>
>> >>>> The main argument against doing that was the complexity, not just of
>> >>>> implementation but for users, who would now potentially have two
>> >>>> separate sets of commands, one operating on the "high level" map
>> >>>> (which would have a "myhost" object in it), and one operating on the
>> >>>> native crush map (which would only have myhost~ssd, myhost~hdd
>> >>>> entries, and would have no concept that a thing called myhost
>> >>>> existed).
>> >>>
>> >>> As a user I'm not sure what is more complicated / confusing. If I'm an experienced Ceph user I'll think of this new syntax as a generator because I already know how crush works. I'll welcome the help and be relieved that I don't have to manually do that anymore. But having that as a native syntax may be a little unconfortable for me because I will want to verify the new syntax matches what I expect, which comes naturally if the transformation step is separate. I may even tweak it a little with an intermediate script to match one thing or two. If I'm a new Ceph user this is one more concept I need to learn: the device class. And to understand what it means, the documentation will have to explain that it creates an independant crush hierarchy for each device class, with weights that only take into account the devices of that given class. I will not be exonerated from understanding the transformation step and the syntactic sugar may even make that more complicated to get.
>> >>>
>> >>> If I understand correctly, the three would co-exist: host, host~ssd, host~hdd so that you can write a rule that takes from all devices.
>> >>
>> >> (Sorry this response is so late)
>> >>
>> >> I think the extra work is not so much in the formats, as it is
>> >> exposing that syntax via all the commands that we have, and/or new
>> >> commands.  We would either need two lots of commands, or we would need
>> >> to pick one layer (the 'generator' or the native one) for the
>> >> commands, and treat the other layer as a hidden thing.
>> >>
>> >> It's also not just the extra work of implementing the commands/syntax,
>> >> it's the extra complexity that ends up being exposed to users.
>> >>
>> >>>
>> >>>> As for implemetning other generators, the trouble with that is that
>> >>>> the resulting conventions would be unknown to other tools, and to any
>> >>>> commands built in to Ceph.
>> >>>
>> >>> Yes. But do we really want to insert the concept of "device class" in Ceph ? There are recurring complaints about manually creating the crushmap required to separate ssd from hdd. But is it inconvenient in any way that Ceph is otherwise unaware of this distinction ?
>> >>
>> >> Currently, if someone has done the manual stuff to set up SSD/HDD
>> >> crush trees, any external tool has no way of knowing that two hosts
>> >> (one ssd, one hdd) are actually the same host.  That's the key thing
>> >> here for me -- the time saving during setup is a nice side effect, but
>> >> the primary value of having a Ceph-defined way to do this is that
>> >> every tool building on Ceph can rely on it.
>> >>
>> >>
>> >>
>> >>>> We *really* need a variant of "set noout"
>> >>>> that operates on a crush subtree (typically a host), as it's the sane
>> >>>> way to get people to temporarily mark some OSDs while they
>> >>>> reboot/upgrade a host, but to implement that command we have to have
>> >>>> an unambiguous way of identifying which buckets in the crush map
>> >>>> belong to a host.  Whatever the convention is (myhost~ssd, myhost_ssd,
>> >>>> whatever), it needs to be defined and built into Ceph in order to be
>> >>>> interoperable.
>> >>>
>> >>> That goes back (above) to my understanding of Sage proposal (which I may have wrong ?) in which the host bucket still exists and still contains all devices regardless of their class.
>> >>
>> >> In Sage's proposal as I understand it, there's an underlying native
>> >> crush map that uses today's format (i.e. clients need no upgrade),
>> >> which is generated in response to either commands that edit the map,
>> >> or the user inputting a modified map in the text format.  That
>> >> conversion would follow pretty simple rules (assuming a host 'myhost'
>> >> with ssd and hdd devices):
>> >>  * On the way in, bucket 'myhost' generates 'myhost~ssd', 'myhost~hdd' buckets
>> >>  * On the way out, buckets 'myhost~ssd', 'myhost~hdd' get merged into 'myhost'
>> >>  * When running a CLI command, something targeting 'myhost' will
>> >> target both 'myhost~hdd' and 'myhost~ssd'
>> >>
>> >> It's that last part that probably isn't captured properly by something
>> >> external that does a syntax conversion during import/export.
>> >>
>> >> John
>> >>
>> >>> Cheers
>> >>>
>> >>>>
>> >>>> John
>> >>>>
>> >>>>
>> >>>>
>> >>>>> Cheers
>> >>>>>
>> >>>>> On 02/02/2017 09:57 PM, Sage Weil wrote:
>> >>>>>> Hi everyone,
>> >>>>>>
>> >>>>>> I made more updates to http://pad.ceph.com/p/crush-types after the CDM
>> >>>>>> discussion yesterday:
>> >>>>>>
>> >>>>>> - consolidated notes into a single proposal
>> >>>>>> - use otherwise illegal character (e.g., ~) as separater for generated
>> >>>>>> buckets.  This avoids ambiguity with user-defined buckets.
>> >>>>>> - class-id $class $id properties for each bucket.  This allows us to
>> >>>>>> preserve the derivative bucket ids across a decompile->compile cycle so
>> >>>>>> that data does not move (the bucket id is one of many inputs into crush's
>> >>>>>> hash during placement).
>> >>>>>> - simpler rule syntax:
>> >>>>>>
>> >>>>>>     rule ssd {
>> >>>>>>             ruleset 1
>> >>>>>>             step take default class ssd
>> >>>>>>             step chooseleaf firstn 0 type host
>> >>>>>>             step emit
>> >>>>>>     }
>> >>>>>>
>> >>>>>> My rationale here is that we don't want to make this a separate 'step'
>> >>>>>> call since steps map to underlying crush rule step ops, and this is a
>> >>>>>> directive only to the compiler.  Making it an optional step argument seems
>> >>>>>> like the cleanest way to do that.
>> >>>>>>
>> >>>>>> Any other comments before we kick this off?
>> >>>>>>
>> >>>>>> Thanks!
>> >>>>>> sage
>> >>>>>>
>> >>>>>>
>> >>>>>> On Mon, 23 Jan 2017, Loic Dachary wrote:
>> >>>>>>
>> >>>>>>> Hi Wido,
>> >>>>>>>
>> >>>>>>> Updated http://pad.ceph.com/p/crush-types with your proposal for the rule syntax
>> >>>>>>>
>> >>>>>>> Cheers
>> >>>>>>>
>> >>>>>>> On 01/23/2017 03:29 PM, Sage Weil wrote:
>> >>>>>>>> On Mon, 23 Jan 2017, Wido den Hollander wrote:
>> >>>>>>>>>> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> Hi Sage,
>> >>>>>>>>>>
>> >>>>>>>>>> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
>> >>>>>>>>>>
>> >>>>>>>>>> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
>> >>>>>>>>>>
>> >>>>>>>>>>     # devices
>> >>>>>>>>>>     device 0 osd.0
>> >>>>>>>>>>     device 1 osd.1
>> >>>>>>>>>>     device 2 osd.2
>> >>>>>>>>>>     device 3 osd.3
>> >>>>>>>>>>
>> >>>>>>>>>> into:
>> >>>>>>>>>>
>> >>>>>>>>>>     # devices
>> >>>>>>>>>>     device 0 osd.0 ssd
>> >>>>>>>>>>     device 1 osd.1 ssd
>> >>>>>>>>>>     device 2 osd.2 hdd
>> >>>>>>>>>>     device 3 osd.3 hdd
>> >>>>>>>>>>
>> >>>>>>>>>> where ssd/hdd is the device class would be much better. However, using the device class like so:
>> >>>>>>>>>>
>> >>>>>>>>>>     rule ssd {
>> >>>>>>>>>>             ruleset 1
>> >>>>>>>>>>             type replicated
>> >>>>>>>>>>             min_size 1
>> >>>>>>>>>>             max_size 10
>> >>>>>>>>>>             step take default:ssd
>> >>>>>>>>>>             step chooseleaf firstn 0 type host
>> >>>>>>>>>>             step emit
>> >>>>>>>>>>     }
>> >>>>>>>>>>
>> >>>>>>>>>> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
>> >>>>>>>>>>
>> >>>>>>>>>>     rule ssd {
>> >>>>>>>>>>             ruleset 1
>> >>>>>>>>>>             type replicated
>> >>>>>>>>>>             min_size 1
>> >>>>>>>>>>             max_size 10
>> >>>>>>>>>>             device class = ssd
>> >>>>>>>>>
>> >>>>>>>>> Would that be sane?
>> >>>>>>>>>
>> >>>>>>>>> Why not:
>> >>>>>>>>>
>> >>>>>>>>> step set-class ssd
>> >>>>>>>>> step take default
>> >>>>>>>>> step chooseleaf firstn 0 type host
>> >>>>>>>>> step emit
>> >>>>>>>>>
>> >>>>>>>>> Since it's a 'step' you take, am I right?
>> >>>>>>>>
>> >>>>>>>> Good idea... a step is a cleaner way to extend the syntax!
>> >>>>>>>>
>> >>>>>>>> sage
>> >>>>>>>> --
>> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >>>>>>>> the body of a message to majordomo@vger.kernel.org
>> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>> --
>> >>>>>>> Loïc Dachary, Artisan Logiciel Libre
>> >>>>>>> --
>> >>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >>>>>>> the body of a message to majordomo@vger.kernel.org
>> >>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>>>>>>
>> >>>>>
>> >>>>> --
>> >>>>> Loïc Dachary, Artisan Logiciel Libre
>> >>>>> --
>> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> >>>>> the body of a message to majordomo@vger.kernel.org
>> >>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>>>
>> >>>
>> >>> --
>> >>> Loïc Dachary, Artisan Logiciel Libre
>> >>
>> >
>> > --
>> > Loïc Dachary, Artisan Logiciel Libre
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-03-08 15:55                       ` Dan van der Ster
@ 2017-03-08 17:00                         ` Sage Weil
       [not found]                           ` <201706281000476718115@gmail.com>
  0 siblings, 1 reply; 25+ messages in thread
From: Sage Weil @ 2017-03-08 17:00 UTC (permalink / raw)
  To: Dan van der Ster; +Cc: Loic Dachary, John Spray, Ceph Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 19623 bytes --]

On Wed, 8 Mar 2017, Dan van der Ster wrote:
> On Wed, Mar 8, 2017 at 3:39 PM, Sage Weil <sage@newdream.net> wrote:
> > On Wed, 8 Mar 2017, Dan van der Ster wrote:
> >> Hi Loic,
> >>
> >> Did you already have a plan for how an operator would declare the
> >> device class of each OSD?
> >> Would this be a new --device-class option to ceph-disk prepare, which
> >> would perhaps create a device-class file in the root of the OSD's xfs
> >> dir?
> >> Then osd crush create-or-move in ceph-osd-prestart.sh would be a
> >> combination of ceph.conf's "crush location" and this per-OSD file.
> >
> > Hmm we haven't talked about this part yet.  I see a few options...
> >
> > 1) explicit ceph-disk argument, recorded as a file in osd_data
> >
> > 2) osd can autodetect this based on the 'rotational' flag in sysfs.  The
> > trick here, I think, is to come up with suitable defaults.  We might have
> > NVMe, SATA/SAS SSDs, HDD, or a combination (with journal and data (and
> > db) spread across multiple types).   Perhaps those could break down into
> > classes like
> >
> >         hdd
> >         ssd
> >         nvme
> >         hdd+ssd-journal
> >         hdd+nvme-jouranl
> >         hdd+ssd-db+nvme-jouranl
> >
> > which is probably sufficient for most users.  And if the admin likes they
> > can override.
> >
> > - Then the osd adjusts device-class on startup, just like it does with the
> > crush map position.  (Note that this will have no real effect until the
> > CRUSH rule(s) are changed to use device class.)
> >
> > - We'll need an 'osd crush set-device-class <osd.NNN> <class>' command.
> > The only danger I see here is that if you set it to something other than
> > what the OSD autodetects above, it'll get clobbered on the next OSD
> > restart.  Maybe the autodetection *only* sets the device class if it isn't
> > already set?
> 
> This is the same issue we have with crush locations, hence the osd
> crush update on start option, right?
> 
> >
> > - We need to adjust the crush rule commands to allow a device class.
> > Currently we have
> >
> > osd crush rule create-erasure <name>     create crush rule <name> for erasure
> >  {<profile>}                              coded pool created with <profile> (
> >                                           default default)
> > osd crush rule create-simple <name>      create crush rule <name> to start from
> >  <root> <type> {firstn|indep}             <root>, replicate across buckets of
> >                                           type <type>, using a choose mode of
> >                                           <firstn|indep> (default firstn; indep
> >                                           best for erasure pools)
> >
> > ...so we could add another optional arg at the end for the device class.
> >
> 
> How far along in the implementation are you? Still time for discussing
> the basic idea?
> 
> I wonder if you all had thought about using device classes like we use
> buckets (i.e. to choose across device types)? Suppose I have two
> brands of ssds: I want to define two classes ssd-a and ssd-b. And I
> want to replicate across these classes (and across, say, hosts as
> well). I think I'd need a choose step to choose 2 from classtype ssd
> (out of ssd-a, ssd-b, etc...), and then chooseleaf across hosts.
> IOW, device classes could be an orthogonal, but similarly flexible,
> structure to crush buckets: device classes would have a hierarchy.
> 
> So we could still have:
> 
> device 0 osd.0 class ssd-a
> device 1 osd.1 class ssd-b
> device 2 osd.2 class hdd-c
> device 3 osd.3 class hdd-d
> 
> but then we define the class-types and their hierarchy like we already
> do for osds. Shown in a "class tree" we could have, for example:
> 
> TYPE               NAME
> root                  default
>     classtype    hdd
>         class        hdd-c
>         class        hdd-d
>     classtype    ssd
>         class        ssd-a
>         class        ssd-b
> 
> Sorry to bring this up late in the thread.

John mentioned something similar in a related thread several weeks back.  
This would be a pretty cool capability.  It's quite a bit harder to 
realize, though.

First, you need to ensure that you have a broad enough mix to device 
classes to make this an enforceable constraint.  Like if you're doing 3x 
replication, that means at least 3 brands/models of SSDs.  And, like the 
normal hierarchy, you need to ensure that there are sufficient numbers of 
each to actually place the data in a way that satisfies the constraint.

Mainly, though, it requires a big change to the crush mapping algorithm 
itself.  (A nice property of the current device classes is that crush on 
theclient doesn't need to change--this will work fine with any legacy 
client.)  Here, though, we'd need to do the crush rules in 2 
dimentions.  Something like first choosing the device types for the 
replicas, and then using a separate tree for each device, while also 
recognizing the equivalence of other nodes in the hiearachy (racks, hosts, 
etc.) to enforce the usual placement constraints.

Anyway, it would be much more involved.  I think the main thing to do now 
is try to ensure we don't make our lives harder later if we go down that 
path.  My guess is we'd want to adopt some naming mechanism for classes 
that is friendly to class hierarchy like you have above (e.g. hdd/a, 
hdd/b), but otherwise the "each device has a class" property we're adding 
now wouldn't really change.  The new bit would be how the rule is defined, 
but since larger changes would be needed there I don't think the small 
tweak we've just made would be an issue...?

BTW, the initial CRUSH device class support just merged.  Next up are the 
various mon commands and osd hooks to make it easy to use...

sage



> 
> Cheers, Dan
> 
> 
> > sage
> >
> >
> >
> >
> >
> >>
> >> Cheers, Dan
> >>
> >>
> >>
> >> On Wed, Feb 15, 2017 at 1:14 PM, Loic Dachary <loic@dachary.org> wrote:
> >> > Hi John,
> >> >
> >> > Thanks for the discussion :-) I'll start implementing the proposal as described originally.
> >> >
> >> > Cheers
> >> >
> >> > On 02/15/2017 12:57 PM, John Spray wrote:
> >> >> On Fri, Feb 3, 2017 at 1:21 PM, Loic Dachary <loic@dachary.org> wrote:
> >> >>>
> >> >>>
> >> >>> On 02/03/2017 01:46 PM, John Spray wrote:
> >> >>>> On Fri, Feb 3, 2017 at 12:22 PM, Loic Dachary <loic@dachary.org> wrote:
> >> >>>>> Hi,
> >> >>>>>
> >> >>>>> Reading Wido & John comments I thought of something, not sure if that's a good idea or not. Here it is anyway ;-)
> >> >>>>>
> >> >>>>> The device class problem we're trying to solve is one instance of a more general need to produce crush tables that implement a given use case. The SSD / HDD use case is so frequent that it would make sense to modify the crush format for this. But maybe we could instead implement that to be a crush table generator.
> >> >>>>>
> >> >>>>> Let say you want help to create the hierarchies to implement the ssd/hdd separation, you write your crushmap using the proposed syntax. But instead of feeding it directly to crushtool -c, you would do something like:
> >> >>>>>
> >> >>>>>    crushtool --plugin 'device-class' --transform < mycrushmap.txt | crushtool -c - -o mycrushmap
> >> >>>>>
> >> >>>>> The 'device-class' transformation documents the naming conventions so the user knows root will generate root_ssd and root_hdd. And the users can also check by themselves the generated crushmap.
> >> >>>>>
> >> >>>>> Cons:
> >> >>>>>
> >> >>>>> * the users need to be aware of the transformation step and be able to read and understand the generated result.
> >> >>>>> * it could look like it's not part of the standard way of doing things, that it's a hack.
> >> >>>>>
> >> >>>>> Pros:
> >> >>>>>
> >> >>>>> * it can inspire people to implement other crushmap transformation / generators (an alternative, simpler, syntax comes to mind ;-)
> >> >>>>> * it can be implemented using python to lower the barrier of entry
> >> >>>>>
> >> >>>>> I don't think it makes the implementation of the current proposal any simpler or more complex. Worst case scenario nobody write any plugin but that does not make this one plugin less useful.
> >> >>>>
> >> >>>> I think this is basically the alternative approach that Sam was
> >> >>>> suggesting during CDM: the idea of layering a new (perhaps very
> >> >>>> similar) syntax on top of the existing one, instead of extending the
> >> >>>> existing one directly.
> >> >>>
> >> >>> Ha nice, not such a stupid idea then :-) I'll try to defend it a little more below then. Please bear in mind that I'm not sure this is the way to go even though I'm writing as if I am.
> >> >>>
> >> >>>> The main argument against doing that was the complexity, not just of
> >> >>>> implementation but for users, who would now potentially have two
> >> >>>> separate sets of commands, one operating on the "high level" map
> >> >>>> (which would have a "myhost" object in it), and one operating on the
> >> >>>> native crush map (which would only have myhost~ssd, myhost~hdd
> >> >>>> entries, and would have no concept that a thing called myhost
> >> >>>> existed).
> >> >>>
> >> >>> As a user I'm not sure what is more complicated / confusing. If I'm an experienced Ceph user I'll think of this new syntax as a generator because I already know how crush works. I'll welcome the help and be relieved that I don't have to manually do that anymore. But having that as a native syntax may be a little unconfortable for me because I will want to verify the new syntax matches what I expect, which comes naturally if the transformation step is separate. I may even tweak it a little with an intermediate script to match one thing or two. If I'm a new Ceph user this is one more concept I need to learn: the device class. And to understand what it means, the documentation will have to explain that it creates an independant crush hierarchy for each device class, with weights that
  only take into account the devices of that given class. I will not be exonerated from understanding the transformation step and the syntactic sugar may even make that more complicated to ge
 t.
> >> >>>
> >> >>> If I understand correctly, the three would co-exist: host, host~ssd, host~hdd so that you can write a rule that takes from all devices.
> >> >>
> >> >> (Sorry this response is so late)
> >> >>
> >> >> I think the extra work is not so much in the formats, as it is
> >> >> exposing that syntax via all the commands that we have, and/or new
> >> >> commands.  We would either need two lots of commands, or we would need
> >> >> to pick one layer (the 'generator' or the native one) for the
> >> >> commands, and treat the other layer as a hidden thing.
> >> >>
> >> >> It's also not just the extra work of implementing the commands/syntax,
> >> >> it's the extra complexity that ends up being exposed to users.
> >> >>
> >> >>>
> >> >>>> As for implemetning other generators, the trouble with that is that
> >> >>>> the resulting conventions would be unknown to other tools, and to any
> >> >>>> commands built in to Ceph.
> >> >>>
> >> >>> Yes. But do we really want to insert the concept of "device class" in Ceph ? There are recurring complaints about manually creating the crushmap required to separate ssd from hdd. But is it inconvenient in any way that Ceph is otherwise unaware of this distinction ?
> >> >>
> >> >> Currently, if someone has done the manual stuff to set up SSD/HDD
> >> >> crush trees, any external tool has no way of knowing that two hosts
> >> >> (one ssd, one hdd) are actually the same host.  That's the key thing
> >> >> here for me -- the time saving during setup is a nice side effect, but
> >> >> the primary value of having a Ceph-defined way to do this is that
> >> >> every tool building on Ceph can rely on it.
> >> >>
> >> >>
> >> >>
> >> >>>> We *really* need a variant of "set noout"
> >> >>>> that operates on a crush subtree (typically a host), as it's the sane
> >> >>>> way to get people to temporarily mark some OSDs while they
> >> >>>> reboot/upgrade a host, but to implement that command we have to have
> >> >>>> an unambiguous way of identifying which buckets in the crush map
> >> >>>> belong to a host.  Whatever the convention is (myhost~ssd, myhost_ssd,
> >> >>>> whatever), it needs to be defined and built into Ceph in order to be
> >> >>>> interoperable.
> >> >>>
> >> >>> That goes back (above) to my understanding of Sage proposal (which I may have wrong ?) in which the host bucket still exists and still contains all devices regardless of their class.
> >> >>
> >> >> In Sage's proposal as I understand it, there's an underlying native
> >> >> crush map that uses today's format (i.e. clients need no upgrade),
> >> >> which is generated in response to either commands that edit the map,
> >> >> or the user inputting a modified map in the text format.  That
> >> >> conversion would follow pretty simple rules (assuming a host 'myhost'
> >> >> with ssd and hdd devices):
> >> >>  * On the way in, bucket 'myhost' generates 'myhost~ssd', 'myhost~hdd' buckets
> >> >>  * On the way out, buckets 'myhost~ssd', 'myhost~hdd' get merged into 'myhost'
> >> >>  * When running a CLI command, something targeting 'myhost' will
> >> >> target both 'myhost~hdd' and 'myhost~ssd'
> >> >>
> >> >> It's that last part that probably isn't captured properly by something
> >> >> external that does a syntax conversion during import/export.
> >> >>
> >> >> John
> >> >>
> >> >>> Cheers
> >> >>>
> >> >>>>
> >> >>>> John
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>> Cheers
> >> >>>>>
> >> >>>>> On 02/02/2017 09:57 PM, Sage Weil wrote:
> >> >>>>>> Hi everyone,
> >> >>>>>>
> >> >>>>>> I made more updates to http://pad.ceph.com/p/crush-types after the CDM
> >> >>>>>> discussion yesterday:
> >> >>>>>>
> >> >>>>>> - consolidated notes into a single proposal
> >> >>>>>> - use otherwise illegal character (e.g., ~) as separater for generated
> >> >>>>>> buckets.  This avoids ambiguity with user-defined buckets.
> >> >>>>>> - class-id $class $id properties for each bucket.  This allows us to
> >> >>>>>> preserve the derivative bucket ids across a decompile->compile cycle so
> >> >>>>>> that data does not move (the bucket id is one of many inputs into crush's
> >> >>>>>> hash during placement).
> >> >>>>>> - simpler rule syntax:
> >> >>>>>>
> >> >>>>>>     rule ssd {
> >> >>>>>>             ruleset 1
> >> >>>>>>             step take default class ssd
> >> >>>>>>             step chooseleaf firstn 0 type host
> >> >>>>>>             step emit
> >> >>>>>>     }
> >> >>>>>>
> >> >>>>>> My rationale here is that we don't want to make this a separate 'step'
> >> >>>>>> call since steps map to underlying crush rule step ops, and this is a
> >> >>>>>> directive only to the compiler.  Making it an optional step argument seems
> >> >>>>>> like the cleanest way to do that.
> >> >>>>>>
> >> >>>>>> Any other comments before we kick this off?
> >> >>>>>>
> >> >>>>>> Thanks!
> >> >>>>>> sage
> >> >>>>>>
> >> >>>>>>
> >> >>>>>> On Mon, 23 Jan 2017, Loic Dachary wrote:
> >> >>>>>>
> >> >>>>>>> Hi Wido,
> >> >>>>>>>
> >> >>>>>>> Updated http://pad.ceph.com/p/crush-types with your proposal for the rule syntax
> >> >>>>>>>
> >> >>>>>>> Cheers
> >> >>>>>>>
> >> >>>>>>> On 01/23/2017 03:29 PM, Sage Weil wrote:
> >> >>>>>>>> On Mon, 23 Jan 2017, Wido den Hollander wrote:
> >> >>>>>>>>>> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
> >> >>>>>>>>>>
> >> >>>>>>>>>>
> >> >>>>>>>>>> Hi Sage,
> >> >>>>>>>>>>
> >> >>>>>>>>>> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
> >> >>>>>>>>>>
> >> >>>>>>>>>> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
> >> >>>>>>>>>>
> >> >>>>>>>>>>     # devices
> >> >>>>>>>>>>     device 0 osd.0
> >> >>>>>>>>>>     device 1 osd.1
> >> >>>>>>>>>>     device 2 osd.2
> >> >>>>>>>>>>     device 3 osd.3
> >> >>>>>>>>>>
> >> >>>>>>>>>> into:
> >> >>>>>>>>>>
> >> >>>>>>>>>>     # devices
> >> >>>>>>>>>>     device 0 osd.0 ssd
> >> >>>>>>>>>>     device 1 osd.1 ssd
> >> >>>>>>>>>>     device 2 osd.2 hdd
> >> >>>>>>>>>>     device 3 osd.3 hdd
> >> >>>>>>>>>>
> >> >>>>>>>>>> where ssd/hdd is the device class would be much better. However, using the device class like so:
> >> >>>>>>>>>>
> >> >>>>>>>>>>     rule ssd {
> >> >>>>>>>>>>             ruleset 1
> >> >>>>>>>>>>             type replicated
> >> >>>>>>>>>>             min_size 1
> >> >>>>>>>>>>             max_size 10
> >> >>>>>>>>>>             step take default:ssd
> >> >>>>>>>>>>             step chooseleaf firstn 0 type host
> >> >>>>>>>>>>             step emit
> >> >>>>>>>>>>     }
> >> >>>>>>>>>>
> >> >>>>>>>>>> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
> >> >>>>>>>>>>
> >> >>>>>>>>>>     rule ssd {
> >> >>>>>>>>>>             ruleset 1
> >> >>>>>>>>>>             type replicated
> >> >>>>>>>>>>             min_size 1
> >> >>>>>>>>>>             max_size 10
> >> >>>>>>>>>>             device class = ssd
> >> >>>>>>>>>
> >> >>>>>>>>> Would that be sane?
> >> >>>>>>>>>
> >> >>>>>>>>> Why not:
> >> >>>>>>>>>
> >> >>>>>>>>> step set-class ssd
> >> >>>>>>>>> step take default
> >> >>>>>>>>> step chooseleaf firstn 0 type host
> >> >>>>>>>>> step emit
> >> >>>>>>>>>
> >> >>>>>>>>> Since it's a 'step' you take, am I right?
> >> >>>>>>>>
> >> >>>>>>>> Good idea... a step is a cleaner way to extend the syntax!
> >> >>>>>>>>
> >> >>>>>>>> sage
> >> >>>>>>>> --
> >> >>>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >>>>>>>> the body of a message to majordomo@vger.kernel.org
> >> >>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>>>>>>>
> >> >>>>>>>
> >> >>>>>>> --
> >> >>>>>>> Loïc Dachary, Artisan Logiciel Libre
> >> >>>>>>> --
> >> >>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >>>>>>> the body of a message to majordomo@vger.kernel.org
> >> >>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>>>>>>
> >> >>>>>
> >> >>>>> --
> >> >>>>> Loïc Dachary, Artisan Logiciel Libre
> >> >>>>> --
> >> >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> >>>>> the body of a message to majordomo@vger.kernel.org
> >> >>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>>>
> >> >>>
> >> >>> --
> >> >>> Loïc Dachary, Artisan Logiciel Libre
> >> >>
> >> >
> >> > --
> >> > Loïc Dachary, Artisan Logiciel Libre
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Re: crush devices class types
       [not found]                           ` <201706281000476718115@gmail.com>
@ 2017-06-28  4:26                             ` Sage Weil
  0 siblings, 0 replies; 25+ messages in thread
From: Sage Weil @ 2017-06-28  4:26 UTC (permalink / raw)
  To: clive.xc; +Cc: Dan van der Ster, Loic Dachary, John Spray, Ceph Development

[-- Attachment #1: Type: TEXT/PLAIN, Size: 24016 bytes --]

On Wed, 28 Jun 2017, clive.xc@gmail.com wrote:
> Hi Sage,
> I am trying ceph 12.2.0, and got one problem:
> 
> my bucket can be created succesfully,
> 
> [root@node1 ~]# ceph osd tree
> ID WEIGHT  TYPE NAME          UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -6 0.01939 root default~ssd                                    
> -5 0.01939     host node1~ssd                                  
>  0 0.01939         osd.0           up  1.00000          1.00000
> -4 0.01939 root default~hdd                                    
> -3 0.01939     host node1~hdd                                  
>  1 0.01939         osd.1           up  1.00000          1.00000
> -1 0.03879 root default                                        
> -2 0.03879     host node1                                      
>  0 0.01939         osd.0           up  1.00000          1.00000
>  1 0.01939         osd.1           up  1.00000          1.00000
> 
> but crush rule cannot be created:
> 
> [root@node1 ~]# ceph osd crush rule create-simple hdd default~hdd host
> Invalid command:  invalid chars ~ in default~hdd
> osd crush rule create-simple <name> <root> <type> {firstn|indep} :  create
?? ?crush rule <name> to start from <root>, replicate across buckets of type <t
> ype>, using a choose mode of <firstn|indep> (default firstn; indep best for
>  erasure pools)
> Error EINVAL: invalid command

Eep.. this is an oversight.  We need to fix the create rule command to 
allow rules specifying a device class.  I'll make sure this is in the next 
RC.

Until then, you can extract the crush map and create the rule manually.  
The updated syntax adds 'class <foo>' to the end of the 'take' step.  
e.g.,

rule replicated_ssd_rule {
        ruleset 1
        type replicated
        min_size 1
        max_size 10
        step take default class ssd
        step chooseleaf firstn 0 type host
        step emit
}

sage




 > 
> ____________________________________________________________________________
> clive.xc@gmail.com
>        
> From: Sage Weil
> Date: 2017-03-09 01:00
> To: Dan van der Ster
> CC: Loic Dachary; John Spray; Ceph Development
> Subject: Re: crush devices class types
> On Wed, 8 Mar 2017, Dan van der Ster wrote:
> > On Wed, Mar 8, 2017 at 3:39 PM, Sage Weil <sage@newdream.net> wrote:
> > > On Wed, 8 Mar 2017, Dan van der Ster wrote:
> > >> Hi Loic,
> > >>
> > >> Did you already have a plan for how an operator would declare the
> > >> device class of each OSD?
> > >> Would this be a new --device-class option to ceph-disk prepare,
> which
> > >> would perhaps create a device-class file in the root of the OSD's
> xfs
> > >> dir?
> > >> Then osd crush create-or-move in ceph-osd-prestart.sh would be a
> > >> combination of ceph.conf's "crush location" and this per-OSD
> file.
> > >
> > > Hmm we haven't talked about this part yet.  I see a few options...
> > >
> > > 1) explicit ceph-disk argument, recorded as a file in osd_data
> > >
> > > 2) osd can autodetect this based on the 'rotational' flag in
> sysfs.  The
> > > trick here, I think, is to come up with suitable defaults.  We
> might have
> > > NVMe, SATA/SAS SSDs, HDD, or a combination (with journal and data
> (and
> > > db) spread across multiple types).   Perhaps those could break
> down into
> > > classes like
> > >
> > >         hdd
> > >         ssd
> > >         nvme
> > >         hdd+ssd-journal
> > >         hdd+nvme-jouranl
> > >         hdd+ssd-db+nvme-jouranl
> > >
> > > which is probably sufficient for most users.  And if the admin
> likes they
> > > can override.
> > >
> > > - Then the osd adjusts device-class on startup, just like it does
> with the
> > > crush map position.  (Note that this will have no real effect
> until the
> > > CRUSH rule(s) are changed to use device class.)
> > >
> > > - We'll need an 'osd crush set-device-class <osd.NNN> <class>'
> command.
> > > The only danger I see here is that if you set it to something
> other than
> > > what the OSD autodetects above, it'll get clobbered on the next
> OSD
> > > restart.  Maybe the autodetection *only* sets the device class if
> it isn't
> > > already set?
> >
> > This is the same issue we have with crush locations, hence the osd
> > crush update on start option, right?
> >
> > >
> > > - We need to adjust the crush rule commands to allow a device
> class.
> > > Currently we have
> > >
> > > osd crush rule create-erasure <name>     create crush rule <name>
> for erasure
> > >  {<profile>}                              coded pool created with
> <profile> (
> > >                                           default default)
> > > osd crush rule create-simple <name>      create crush rule <name>
> to start from
> > >  <root> <type> {firstn|indep}             <root>, replicate across
> buckets of
> > >                                           type <type>, using a
> choose mode of
> > >                                           <firstn|indep> (default
> firstn; indep
> > >                                           best for erasure pools)
> > >
> > > ...so we could add another optional arg at the end for the device
> class.
> > >
> >
> > How far along in the implementation are you? Still time for
> discussing
> > the basic idea?
> >
> > I wonder if you all had thought about using device classes like we
> use
> > buckets (i.e. to choose across device types)? Suppose I have two
> > brands of ssds: I want to define two classes ssd-a and ssd-b. And I
> > want to replicate across these classes (and across, say, hosts as
> > well). I think I'd need a choose step to choose 2 from classtype ssd
> > (out of ssd-a, ssd-b, etc...), and then chooseleaf across hosts.
> > IOW, device classes could be an orthogonal, but similarly flexible,
> > structure to crush buckets: device classes would have a hierarchy.
> >
> > So we could still have:
> >
> > device 0 osd.0 class ssd-a
> > device 1 osd.1 class ssd-b
> > device 2 osd.2 class hdd-c
> > device 3 osd.3 class hdd-d
> >
> > but then we define the class-types and their hierarchy like we
> already
> > do for osds. Shown in a "class tree" we could have, for example:
> >
> > TYPE               NAME
> > root                  default
> >     classtype    hdd
> >         class        hdd-c
> >         class        hdd-d
> >     classtype    ssd
> >         class        ssd-a
> >         class        ssd-b
> >
> > Sorry to bring this up late in the thread.
>  
> John mentioned something similar in a related thread several weeks
> back. 
> This would be a pretty cool capability.  It's quite a bit harder to
> realize, though.
>  
> First, you need to ensure that you have a broad enough mix to device
> classes to make this an enforceable constraint.  Like if you're doing
> 3x
> replication, that means at least 3 brands/models of SSDs.  And, like
> the
> normal hierarchy, you need to ensure that there are sufficient numbers
> of
> each to actually place the data in a way that satisfies the
> constraint.
>  
> Mainly, though, it requires a big change to the crush mapping
> algorithm
> itself.  (A nice property of the current device classes is that crush
> on
> theclient doesn't need to change--this will work fine with any legacy
> client.)  Here, though, we'd need to do the crush rules in 2
> dimentions.  Something like first choosing the device types for the
> replicas, and then using a separate tree for each device, while also
> recognizing the equivalence of other nodes in the hiearachy (racks,
> hosts,
> etc.) to enforce the usual placement constraints.
>  
> Anyway, it would be much more involved.  I think the main thing to do
> now
> is try to ensure we don't make our lives harder later if we go down
> that
> path.  My guess is we'd want to adopt some naming mechanism for
> classes
> that is friendly to class hierarchy like you have above (e.g. hdd/a,
> hdd/b), but otherwise the "each device has a class" property we're
> adding
> now wouldn't really change.  The new bit would be how the rule is
> defined,
> but since larger changes would be needed there I don't think the small
> tweak we've just made would be an issue...?
>  
> BTW, the initial CRUSH device class support just merged.  Next up are
> the
> various mon commands and osd hooks to make it easy to use...
>  
> sage
>  
>  
>  
> >
> > Cheers, Dan
> >
> >
> > > sage
> > >
> > >
> > >
> > >
> > >
> > >>
> > >> Cheers, Dan
> > >>
> > >>
> > >>
> > >> On Wed, Feb 15, 2017 at 1:14 PM, Loic Dachary <loic@dachary.org>
> wrote:
> > >> > Hi John,
> > >> >
> > >> > Thanks for the discussion :-) I'll start implementing the
> proposal as described originally.
> > >> >
> > >> > Cheers
> > >> >
> > >> > On 02/15/2017 12:57 PM, John Spray wrote:
> > >> >> On Fri, Feb 3, 2017 at 1:21 PM, Loic Dachary
> <loic@dachary.org> wrote:
> > >> >>>
> > >> >>>
> > >> >>> On 02/03/2017 01:46 PM, John Spray wrote:
> > >> >>>> On Fri, Feb 3, 2017 at 12:22 PM, Loic Dachary
> <loic@dachary.org> wrote:
> > >> >>>>> Hi,
> > >> >>>>>
> > >> >>>>> Reading Wido & John comments I thought of something, not
> sure if that's a good idea or not. Here it is anyway ;-)
> > >> >>>>>
> > >> >>>>> The device class problem we're trying to solve is one
> instance of a more general need to produce crush tables that implement
> a given use case. The SSD / HDD use case is so frequent that it would
> make sense to modify the crush format for this. But maybe we could
> instead implement that to be a crush table generator.
> > >> >>>>>
> > >> >>>>> Let say you want help to create the hierarchies to
> implement the ssd/hdd separation, you write your crushmap using the
> proposed syntax. But instead of feeding it directly to crushtool -c,
> you would do something like:
> > >> >>>>>
> > >> >>>>>    crushtool --plugin 'device-class' --transform <
> mycrushmap.txt | crushtool -c - -o mycrushmap
> > >> >>>>>
> > >> >>>>> The 'device-class' transformation documents the naming
> conventions so the user knows root will generate root_ssd and
> root_hdd. And the users can also check by themselves the generated
> crushmap.
> > >> >>>>>
> > >> >>>>> Cons:
> > >> >>>>>
> > >> >>>>> * the users need to be aware of the transformation step and
> be able to read and understand the generated result.
> > >> >>>>> * it could look like it's not part of the standard way of
> doing things, that it's a hack.
> > >> >>>>>
> > >> >>>>> Pros:
> > >> >>>>>
> > >> >>>>> * it can inspire people to implement other crushmap
> transformation / generators (an alternative, simpler, syntax comes to
> mind ;-)
> > >> >>>>> * it can be implemented using python to lower the barrier
> of entry
> > >> >>>>>
> > >> >>>>> I don't think it makes the implementation of the current
> proposal any simpler or more complex. Worst case scenario nobody write
> any plugin but that does not make this one plugin less useful.
> > >> >>>>
> > >> >>>> I think this is basically the alternative approach that Sam
> was
> > >> >>>> suggesting during CDM: the idea of layering a new (perhaps
> very
> > >> >>>> similar) syntax on top of the existing one, instead of
> extending the
> > >> >>>> existing one directly.
> > >> >>>
> > >> >>> Ha nice, not such a stupid idea then :-) I'll try to defend
> it a little more below then. Please bear in mind that I'm not sure
> this is the way to go even though I'm writing as if I am.
> > >> >>>
> > >> >>>> The main argument against doing that was the complexity, not
> just of
> > >> >>>> implementation but for users, who would now potentially have
> two
> > >> >>>> separate sets of commands, one operating on the "high level"
> map
> > >> >>>> (which would have a "myhost" object in it), and one
> operating on the
> > >> >>>> native crush map (which would only have myhost~ssd,
> myhost~hdd
> > >> >>>> entries, and would have no concept that a thing called
> myhost
> > >> >>>> existed).
> > >> >>>
> > >> >>> As a user I'm not sure what is more complicated / confusing.
> If I'm an experienced Ceph user I'll think of this new syntax as a
> generator because I already know how crush works. I'll welcome the
> help and be relieved that I don't have to manually do that anymore.
> But having that as a native syntax may be a little unconfortable for
> me because I will want to verify the new syntax matches what I expect,
> which comes naturally if the transformation step is separate. I may
> even tweak it a little with an intermediate script to match one thing
> or two. If I'm a new Ceph user this is one more concept I need to
> learn: the device class. And to understand what it means, the
> documentation will have to explain that it creates an independant
> crush hierarchy for each device class, with weights that only take
> into account the devices of that given class. I will not be exonerated
> from understanding the transformation step and the syntactic sugar may
> even make that more complicated to ge
> t.
> > >> >>>
> > >> >>> If I understand correctly, the three would co-exist: host,
> host~ssd, host~hdd so that you can write a rule that takes from all
> devices.
> > >> >>
> > >> >> (Sorry this response is so late)
> > >> >>
> > >> >> I think the extra work is not so much in the formats, as it is
> > >> >> exposing that syntax via all the commands that we have, and/or
> new
> > >> >> commands.  We would either need two lots of commands, or we
> would need
> > >> >> to pick one layer (the 'generator' or the native one) for the
> > >> >> commands, and treat the other layer as a hidden thing.
> > >> >>
> > >> >> It's also not just the extra work of implementing the
> commands/syntax,
> > >> >> it's the extra complexity that ends up being exposed to users.
> > >> >>
> > >> >>>
> > >> >>>> As for implemetning other generators, the trouble with that
> is that
> > >> >>>> the resulting conventions would be unknown to other tools,
> and to any
> > >> >>>> commands built in to Ceph.
> > >> >>>
> > >> >>> Yes. But do we really want to insert the concept of "device
> class" in Ceph ? There are recurring complaints about manually
> creating the crushmap required to separate ssd from hdd. But is it
> inconvenient in any way that Ceph is otherwise unaware of this
> distinction ?
> > >> >>
> > >> >> Currently, if someone has done the manual stuff to set up
> SSD/HDD
> > >> >> crush trees, any external tool has no way of knowing that two
> hosts
> > >> >> (one ssd, one hdd) are actually the same host.  That's the key
> thing
> > >> >> here for me -- the time saving during setup is a nice side
> effect, but
> > >> >> the primary value of having a Ceph-defined way to do this is
> that
> > >> >> every tool building on Ceph can rely on it.
> > >> >>
> > >> >>
> > >> >>
> > >> >>>> We *really* need a variant of "set noout"
> > >> >>>> that operates on a crush subtree (typically a host), as it's
> the sane
> > >> >>>> way to get people to temporarily mark some OSDs while they
> > >> >>>> reboot/upgrade a host, but to implement that command we have
> to have
> > >> >>>> an unambiguous way of identifying which buckets in the crush
> map
> > >> >>>> belong to a host.  Whatever the convention is (myhost~ssd,
> myhost_ssd,
> > >> >>>> whatever), it needs to be defined and built into Ceph in
> order to be
> > >> >>>> interoperable.
> > >> >>>
> > >> >>> That goes back (above) to my understanding of Sage proposal
> (which I may have wrong ?) in which the host bucket still exists and
> still contains all devices regardless of their class.
> > >> >>
> > >> >> In Sage's proposal as I understand it, there's an underlying
> native
> > >> >> crush map that uses today's format (i.e. clients need no
> upgrade),
> > >> >> which is generated in response to either commands that edit
> the map,
> > >> >> or the user inputting a modified map in the text format.  That
> > >> >> conversion would follow pretty simple rules (assuming a host
> 'myhost'
> > >> >> with ssd and hdd devices):
> > >> >>  * On the way in, bucket 'myhost' generates 'myhost~ssd',
> 'myhost~hdd' buckets
> > >> >>  * On the way out, buckets 'myhost~ssd', 'myhost~hdd' get
> merged into 'myhost'
> > >> >>  * When running a CLI command, something targeting 'myhost'
> will
> > >> >> target both 'myhost~hdd' and 'myhost~ssd'
> > >> >>
> > >> >> It's that last part that probably isn't captured properly by
> something
> > >> >> external that does a syntax conversion during import/export.
> > >> >>
> > >> >> John
> > >> >>
> > >> >>> Cheers
> > >> >>>
> > >> >>>>
> > >> >>>> John
> > >> >>>>
> > >> >>>>
> > >> >>>>
> > >> >>>>> Cheers
> > >> >>>>>
> > >> >>>>> On 02/02/2017 09:57 PM, Sage Weil wrote:
> > >> >>>>>> Hi everyone,
> > >> >>>>>>
> > >> >>>>>> I made more updates to http://pad.ceph.com/p/crush-types
> after the CDM
> > >> >>>>>> discussion yesterday:
> > >> >>>>>>
> > >> >>>>>> - consolidated notes into a single proposal
> > >> >>>>>> - use otherwise illegal character (e.g., ~) as separater
> for generated
> > >> >>>>>> buckets.  This avoids ambiguity with user-defined buckets.
> > >> >>>>>> - class-id $class $id properties for each bucket.  This
> allows us to
> > >> >>>>>> preserve the derivative bucket ids across a
> decompile->compile cycle so
> > >> >>>>>> that data does not move (the bucket id is one of many
> inputs into crush's
> > >> >>>>>> hash during placement).
> > >> >>>>>> - simpler rule syntax:
> > >> >>>>>>
> > >> >>>>>>     rule ssd {
> > >> >>>>>>             ruleset 1
> > >> >>>>>>             step take default class ssd
> > >> >>>>>>             step chooseleaf firstn 0 type host
> > >> >>>>>>             step emit
> > >> >>>>>>     }
> > >> >>>>>>
> > >> >>>>>> My rationale here is that we don't want to make this a
> separate 'step'
> > >> >>>>>> call since steps map to underlying crush rule step ops,
> and this is a
> > >> >>>>>> directive only to the compiler.  Making it an optional
> step argument seems
> > >> >>>>>> like the cleanest way to do that.
> > >> >>>>>>
> > >> >>>>>> Any other comments before we kick this off?
> > >> >>>>>>
> > >> >>>>>> Thanks!
> > >> >>>>>> sage
> > >> >>>>>>
> > >> >>>>>>
> > >> >>>>>> On Mon, 23 Jan 2017, Loic Dachary wrote:
> > >> >>>>>>
> > >> >>>>>>> Hi Wido,
> > >> >>>>>>>
> > >> >>>>>>> Updated http://pad.ceph.com/p/crush-types with your
> proposal for the rule syntax
> > >> >>>>>>>
> > >> >>>>>>> Cheers
> > >> >>>>>>>
> > >> >>>>>>> On 01/23/2017 03:29 PM, Sage Weil wrote:
> > >> >>>>>>>> On Mon, 23 Jan 2017, Wido den Hollander wrote:
> > >> >>>>>>>>>> Op 22 januari 2017 om 17:44 schreef Loic Dachary
> <loic@dachary.org>:
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> Hi Sage,
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> You proposed an improvement to the crush map to
> address different device types (SSD, HDD, etc.)[1]. When learning how
> to create a crush map, I was indeed confused by the tricks required to
> create SSD only pools. After years of practice it feels more natural
> :-)
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> The source of my confusion was mostly because I had to
> use a hierarchical description to describe something that is not
> organized hierarchically. "The rack contains hosts that contain
> devices" is intuitive. "The rack contains hosts that contain ssd that
> contain devices" is counter intuitive. Changing:
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>     # devices
> > >> >>>>>>>>>>     device 0 osd.0
> > >> >>>>>>>>>>     device 1 osd.1
> > >> >>>>>>>>>>     device 2 osd.2
> > >> >>>>>>>>>>     device 3 osd.3
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> into:
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>     # devices
> > >> >>>>>>>>>>     device 0 osd.0 ssd
> > >> >>>>>>>>>>     device 1 osd.1 ssd
> > >> >>>>>>>>>>     device 2 osd.2 hdd
> > >> >>>>>>>>>>     device 3 osd.3 hdd
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> where ssd/hdd is the device class would be much
> better. However, using the device class like so:
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>     rule ssd {
> > >> >>>>>>>>>>             ruleset 1
> > >> >>>>>>>>>>             type replicated
> > >> >>>>>>>>>>             min_size 1
> > >> >>>>>>>>>>             max_size 10
> > >> >>>>>>>>>>             step take default:ssd
> > >> >>>>>>>>>>             step chooseleaf firstn 0 type host
> > >> >>>>>>>>>>             step emit
> > >> >>>>>>>>>>     }
> > >> >>>>>>>>>>
> > >> >>>>>>>>>> looks arcane. Since the goal is to simplify the
> description for the first time user, maybe we could have something
> like:
> > >> >>>>>>>>>>
> > >> >>>>>>>>>>     rule ssd {
> > >> >>>>>>>>>>             ruleset 1
> > >> >>>>>>>>>>             type replicated
> > >> >>>>>>>>>>             min_size 1
> > >> >>>>>>>>>>             max_size 10
> > >> >>>>>>>>>>             device class = ssd
> > >> >>>>>>>>>
> > >> >>>>>>>>> Would that be sane?
> > >> >>>>>>>>>
> > >> >>>>>>>>> Why not:
> > >> >>>>>>>>>
> > >> >>>>>>>>> step set-class ssd
> > >> >>>>>>>>> step take default
> > >> >>>>>>>>> step chooseleaf firstn 0 type host
> > >> >>>>>>>>> step emit
> > >> >>>>>>>>>
> > >> >>>>>>>>> Since it's a 'step' you take, am I right?
> > >> >>>>>>>>
> > >> >>>>>>>> Good idea... a step is a cleaner way to extend the
> syntax!
> > >> >>>>>>>>
> > >> >>>>>>>> sage
> > >> >>>>>>>> --
> > >> >>>>>>>> To unsubscribe from this list: send the line
> "unsubscribe ceph-devel" in
> > >> >>>>>>>> the body of a message to majordomo@vger.kernel.org
> > >> >>>>>>>> More majordomo info at 
> http://vger.kernel.org/majordomo-info.html
> > >> >>>>>>>>
> > >> >>>>>>>
> > >> >>>>>>> --
> > >> >>>>>>> Loïc Dachary, Artisan Logiciel Libre
> > >> >>>>>>> --
> > >> >>>>>>> To unsubscribe from this list: send the line "unsubscribe
> ceph-devel" in
> > >> >>>>>>> the body of a message to majordomo@vger.kernel.org
> > >> >>>>>>> More majordomo info at 
> http://vger.kernel.org/majordomo-info.html
> > >> >>>>>>>
> > >> >>>>>
> > >> >>>>> --
> > >> >>>>> Loïc Dachary, Artisan Logiciel Libre
> > >> >>>>> --
> > >> >>>>> To unsubscribe from this list: send the line "unsubscribe
> ceph-devel" in
> > >> >>>>> the body of a message to majordomo@vger.kernel.org
> > >> >>>>> More majordomo info at 
> http://vger.kernel.org/majordomo-info.html
> > >> >>>>
> > >> >>>
> > >> >>> --
> > >> >>> Loïc Dachary, Artisan Logiciel Libre
> > >> >>
> > >> >
> > >> > --
> > >> > Loïc Dachary, Artisan Logiciel Libre
> > >> > --
> > >> > To unsubscribe from this list: send the line "unsubscribe
> ceph-devel" in
> > >> > the body of a message to majordomo@vger.kernel.org
> > >> > More majordomo info at 
> http://vger.kernel.org/majordomo-info.html
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe
> ceph-devel" in
> > >> the body of a message to majordomo@vger.kernel.org
> > >> More majordomo info at 
> http://vger.kernel.org/majordomo-info.html
> > >>
> > >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe
> ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> 
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-03-08 14:39                     ` Sage Weil
  2017-03-08 15:55                       ` Dan van der Ster
@ 2017-06-28  5:28                       ` Kyle Bader
  2017-06-28 13:46                         ` Sage Weil
  1 sibling, 1 reply; 25+ messages in thread
From: Kyle Bader @ 2017-06-28  5:28 UTC (permalink / raw)
  To: Sage Weil; +Cc: Dan van der Ster, Loic Dachary, John Spray, Ceph Development

On Wed, Mar 8, 2017 at 6:39 AM, Sage Weil <sage@newdream.net> wrote:
> On Wed, 8 Mar 2017, Dan van der Ster wrote:
>> Hi Loic,
>>
>> Did you already have a plan for how an operator would declare the
>> device class of each OSD?
>> Would this be a new --device-class option to ceph-disk prepare, which
>> would perhaps create a device-class file in the root of the OSD's xfs
>> dir?
>> Then osd crush create-or-move in ceph-osd-prestart.sh would be a
>> combination of ceph.conf's "crush location" and this per-OSD file.
>
> Hmm we haven't talked about this part yet.  I see a few options...
>
> 1) explicit ceph-disk argument, recorded as a file in osd_data
>
> 2) osd can autodetect this based on the 'rotational' flag in sysfs.  The
> trick here, I think, is to come up with suitable defaults.  We might have
> NVMe, SATA/SAS SSDs, HDD, or a combination (with journal and data (and
> db) spread across multiple types).   Perhaps those could break down into
> classes like
>
>         hdd
>         ssd
>         nvme
>         hdd+ssd-journal
>         hdd+nvme-jouranl
>         hdd+ssd-db+nvme-jouranl
>
> which is probably sufficient for most users.  And if the admin likes they
> can override.
>
> - Then the osd adjusts device-class on startup, just like it does with the
> crush map position.  (Note that this will have no real effect until the
> CRUSH rule(s) are changed to use device class.)
>
> - We'll need an 'osd crush set-device-class <osd.NNN> <class>' command.
> The only danger I see here is that if you set it to something other than
> what the OSD autodetects above, it'll get clobbered on the next OSD
> restart.  Maybe the autodetection *only* sets the device class if it isn't
> already set?

one of the classes could be "autodetect" and it could be the default,
kinda like auto negotiation on nic / switch ports?

-- 

Kyle Bader

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-06-28  5:28                       ` Kyle Bader
@ 2017-06-28 13:46                         ` Sage Weil
  0 siblings, 0 replies; 25+ messages in thread
From: Sage Weil @ 2017-06-28 13:46 UTC (permalink / raw)
  To: Kyle Bader; +Cc: Dan van der Ster, Loic Dachary, John Spray, Ceph Development

On Tue, 27 Jun 2017, Kyle Bader wrote:
> On Wed, Mar 8, 2017 at 6:39 AM, Sage Weil <sage@newdream.net> wrote:
> > On Wed, 8 Mar 2017, Dan van der Ster wrote:
> >> Hi Loic,
> >>
> >> Did you already have a plan for how an operator would declare the
> >> device class of each OSD?
> >> Would this be a new --device-class option to ceph-disk prepare, which
> >> would perhaps create a device-class file in the root of the OSD's xfs
> >> dir?
> >> Then osd crush create-or-move in ceph-osd-prestart.sh would be a
> >> combination of ceph.conf's "crush location" and this per-OSD file.
> >
> > Hmm we haven't talked about this part yet.  I see a few options...
> >
> > 1) explicit ceph-disk argument, recorded as a file in osd_data
> >
> > 2) osd can autodetect this based on the 'rotational' flag in sysfs.  The
> > trick here, I think, is to come up with suitable defaults.  We might have
> > NVMe, SATA/SAS SSDs, HDD, or a combination (with journal and data (and
> > db) spread across multiple types).   Perhaps those could break down into
> > classes like
> >
> >         hdd
> >         ssd
> >         nvme
> >         hdd+ssd-journal
> >         hdd+nvme-jouranl
> >         hdd+ssd-db+nvme-jouranl
> >
> > which is probably sufficient for most users.  And if the admin likes they
> > can override.
> >
> > - Then the osd adjusts device-class on startup, just like it does with the
> > crush map position.  (Note that this will have no real effect until the
> > CRUSH rule(s) are changed to use device class.)
> >
> > - We'll need an 'osd crush set-device-class <osd.NNN> <class>' command.
> > The only danger I see here is that if you set it to something other than
> > what the OSD autodetects above, it'll get clobbered on the next OSD
> > restart.  Maybe the autodetection *only* sets the device class if it isn't
> > already set?
> 
> one of the classes could be "autodetect" and it could be the default,
> kinda like auto negotiation on nic / switch ports?

Yes!  I think we already have osd_crush_update_on_start (default: yes) 
which controls whether we update the crush location on OSD start.  We can 
do the same with the device class if it not already set via an option like 
osd_crush_update_class_on_start.  I'm thinking just a simple 'hdd' and 
'ssd' class...

sage


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: crush devices class types
  2017-02-03 10:52         ` Wido den Hollander
  2017-02-03 10:57           ` John Spray
@ 2017-06-28 17:54           ` Kyle Bader
  1 sibling, 0 replies; 25+ messages in thread
From: Kyle Bader @ 2017-06-28 17:54 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: Sage Weil, Loic Dachary, Ceph Development

I like John's idea of getting configuration specific to a class from
the monitors, but I think I've thought of a situation where it would
be desirable to have a local configuration like Wido's. On some of the
really high end flash configurations we have two network adaptors,
each on their own NUMA node, with a distinct IP address. This is an
operations nightmare, because each OSD needs it's own definition that
points to the appropriate IP.

[class.numa.1]
cluster_network = 10.1.0.0/24

[class.numa.2]
cluster_network = 10.1.1.0/24

Maybe it doesn't make sense for "classes" to be used this way, because
I can't think of a reason you would want a pool to only use the left
half of every machine, but using the "classes" as a form of "label"
could be make this sort of configuration more approachable.

[label.numa.1]
cluster_network = 10.1.0.0/24

[label.numa.2]
cluster_network = 10.1.1.0/24

Perhaps we have a pluggable system for applying "labels" to OSDs, and
the "class" of an OSD is dictated by possession of some combination of
labels. Example labels:

mfg: [intel|samsung|sandisk|micron]
numa: [1,2]
bus: [sata,sas,nvme]
rotational: [0,1]
type: [rust,2dnand,3dnand,xpoint]
over_provisioning: [1.1,1.2,1.3]

Then you could create a "gpssd" classifier that includes OSDs with:

bus = sas
rotational = 0

And a "piops" classifier that includes OSDs with:

bus: nvme
over_provisioning: 1.3



On Fri, Feb 3, 2017 at 2:52 AM, Wido den Hollander <wido@42on.com> wrote:
>
>> Op 2 februari 2017 om 21:57 schreef Sage Weil <sage@newdream.net>:
>>
>>
>> Hi everyone,
>>
>> I made more updates to http://pad.ceph.com/p/crush-types after the CDM
>> discussion yesterday:
>>
>> - consolidated notes into a single proposal
>> - use otherwise illegal character (e.g., ~) as separater for generated
>> buckets.  This avoids ambiguity with user-defined buckets.
>> - class-id $class $id properties for each bucket.  This allows us to
>> preserve the derivative bucket ids across a decompile->compile cycle so
>> that data does not move (the bucket id is one of many inputs into crush's
>> hash during placement).
>> - simpler rule syntax:
>>
>>     rule ssd {
>>             ruleset 1
>>             step take default class ssd
>>             step chooseleaf firstn 0 type host
>>             step emit
>>     }
>>
>> My rationale here is that we don't want to make this a separate 'step'
>> call since steps map to underlying crush rule step ops, and this is a
>> directive only to the compiler.  Making it an optional step argument seems
>> like the cleanest way to do that.
>>
>> Any other comments before we kick this off?
>>
>
> No, looks good to me! Like combining the class into the 'step'.
>
> Would be very nice to have this in L!
>
> What would be interesting as well is if OSD daemons could somehow access this while parsing their configuration.
>
> Eg
>
> [class.ssd]
>   osd_op_threads = 16
>
> [class.hdd]
>    osd_max_backfills = 1
>
> That way you can keep configuration generic and makes config management a lot easier.
>
> Wido
>
>> Thanks!
>> sage
>>
>>
>> On Mon, 23 Jan 2017, Loic Dachary wrote:
>>
>> > Hi Wido,
>> >
>> > Updated http://pad.ceph.com/p/crush-types with your proposal for the rule syntax
>> >
>> > Cheers
>> >
>> > On 01/23/2017 03:29 PM, Sage Weil wrote:
>> > > On Mon, 23 Jan 2017, Wido den Hollander wrote:
>> > >>> Op 22 januari 2017 om 17:44 schreef Loic Dachary <loic@dachary.org>:
>> > >>>
>> > >>>
>> > >>> Hi Sage,
>> > >>>
>> > >>> You proposed an improvement to the crush map to address different device types (SSD, HDD, etc.)[1]. When learning how to create a crush map, I was indeed confused by the tricks required to create SSD only pools. After years of practice it feels more natural :-)
>> > >>>
>> > >>> The source of my confusion was mostly because I had to use a hierarchical description to describe something that is not organized hierarchically. "The rack contains hosts that contain devices" is intuitive. "The rack contains hosts that contain ssd that contain devices" is counter intuitive. Changing:
>> > >>>
>> > >>>     # devices
>> > >>>     device 0 osd.0
>> > >>>     device 1 osd.1
>> > >>>     device 2 osd.2
>> > >>>     device 3 osd.3
>> > >>>
>> > >>> into:
>> > >>>
>> > >>>     # devices
>> > >>>     device 0 osd.0 ssd
>> > >>>     device 1 osd.1 ssd
>> > >>>     device 2 osd.2 hdd
>> > >>>     device 3 osd.3 hdd
>> > >>>
>> > >>> where ssd/hdd is the device class would be much better. However, using the device class like so:
>> > >>>
>> > >>>     rule ssd {
>> > >>>             ruleset 1
>> > >>>             type replicated
>> > >>>             min_size 1
>> > >>>             max_size 10
>> > >>>             step take default:ssd
>> > >>>             step chooseleaf firstn 0 type host
>> > >>>             step emit
>> > >>>     }
>> > >>>
>> > >>> looks arcane. Since the goal is to simplify the description for the first time user, maybe we could have something like:
>> > >>>
>> > >>>     rule ssd {
>> > >>>             ruleset 1
>> > >>>             type replicated
>> > >>>             min_size 1
>> > >>>             max_size 10
>> > >>>             device class = ssd
>> > >>
>> > >> Would that be sane?
>> > >>
>> > >> Why not:
>> > >>
>> > >> step set-class ssd
>> > >> step take default
>> > >> step chooseleaf firstn 0 type host
>> > >> step emit
>> > >>
>> > >> Since it's a 'step' you take, am I right?
>> > >
>> > > Good idea... a step is a cleaner way to extend the syntax!
>> > >
>> > > sage
>> > > --
>> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > > the body of a message to majordomo@vger.kernel.org
>> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > >
>> >
>> > --
>> > Loïc Dachary, Artisan Logiciel Libre
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 

Kyle Bader

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2017-06-28 17:54 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-22 16:44 crush devices class types Loic Dachary
2017-01-23 13:38 ` Wido den Hollander
2017-01-23 14:29   ` Sage Weil
2017-01-23 14:41     ` Loic Dachary
2017-02-02 20:57       ` Sage Weil
2017-02-03 10:52         ` Wido den Hollander
2017-02-03 10:57           ` John Spray
2017-02-03 12:23             ` Wido den Hollander
2017-06-28 17:54           ` Kyle Bader
2017-02-03 11:24         ` John Spray
2017-02-03 12:22         ` Loic Dachary
2017-02-03 12:46           ` John Spray
2017-02-03 12:52             ` Brett Niver
2017-02-03 13:21             ` Loic Dachary
2017-02-15 11:57               ` John Spray
2017-02-15 12:14                 ` Loic Dachary
2017-03-08  9:42                   ` Dan van der Ster
2017-03-08 10:05                     ` Loic Dachary
2017-03-08 14:39                     ` Sage Weil
2017-03-08 15:55                       ` Dan van der Ster
2017-03-08 17:00                         ` Sage Weil
     [not found]                           ` <201706281000476718115@gmail.com>
2017-06-28  4:26                             ` Sage Weil
2017-06-28  5:28                       ` Kyle Bader
2017-06-28 13:46                         ` Sage Weil
2017-01-23 14:12 ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.