From mboxrd@z Thu Jan 1 00:00:00 1970 From: Sage Weil Subject: Re: Re: crush devices class types Date: Wed, 28 Jun 2017 04:26:20 +0000 (UTC) Message-ID: References: , <1971303930.7861.1485178720341@ox.pcextreme.nl>, , <372c7fcb-8697-28ea-0d7c-1efc29fc21cf@dachary.org>, , <2e591b25-3db2-2dd2-03af-2c1ef40292ac@dachary.org>, , <83ad191e-933f-83a4-c6d4-c831c1ac893f@dachary.org>, , <8bf1e90e-20ca-e4e9-d3dd-3535b6a7d781@dachary.org>, , , , <201706281000476718115@gmail.com> Mime-Version: 1.0 Content-Type: MULTIPART/MIXED; BOUNDARY="8323329-884031322-1498623983=:3424" Return-path: Received: from cobra.newdream.net ([66.33.216.30]:47080 "EHLO cobra.newdream.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751596AbdF1E0Y (ORCPT ); Wed, 28 Jun 2017 00:26:24 -0400 In-Reply-To: <201706281000476718115@gmail.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: "clive.xc@gmail.com" Cc: Dan van der Ster , Loic Dachary , John Spray , Ceph Development This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --8323329-884031322-1498623983=:3424 Content-Type: TEXT/PLAIN; charset=UTF-8 Content-Transfer-Encoding: 8BIT On Wed, 28 Jun 2017, clive.xc@gmail.com wrote: > Hi Sage, > I am trying ceph 12.2.0, and got one problem: > > my bucket can be created succesfully, > > [root@node1 ~]# ceph osd tree > ID WEIGHT  TYPE NAME          UP/DOWN REWEIGHT PRIMARY-AFFINITY > -6 0.01939 root default~ssd                                     > -5 0.01939     host node1~ssd                                   >  0 0.01939         osd.0           up  1.00000          1.00000 > -4 0.01939 root default~hdd                                     > -3 0.01939     host node1~hdd                                   >  1 0.01939         osd.1           up  1.00000          1.00000 > -1 0.03879 root default                                         > -2 0.03879     host node1                                       >  0 0.01939         osd.0           up  1.00000          1.00000 >  1 0.01939         osd.1           up  1.00000          1.00000 > > but crush rule cannot be created: > > [root@node1 ~]# ceph osd crush rule create-simple hdd default~hdd host > Invalid command:  invalid chars ~ in default~hdd > osd crush rule create-simple    {firstn|indep} :  create ?? ?crush rule  to start from , replicate across buckets of type  ype>, using a choose mode of  (default firstn; indep best for >  erasure pools) > Error EINVAL: invalid command Eep.. this is an oversight. We need to fix the create rule command to allow rules specifying a device class. I'll make sure this is in the next RC. Until then, you can extract the crush map and create the rule manually. The updated syntax adds 'class ' to the end of the 'take' step. e.g., rule replicated_ssd_rule { ruleset 1 type replicated min_size 1 max_size 10 step take default class ssd step chooseleaf firstn 0 type host step emit } sage > > ____________________________________________________________________________ > clive.xc@gmail.com >   > From: Sage Weil > Date: 2017-03-09 01:00 > To: Dan van der Ster > CC: Loic Dachary; John Spray; Ceph Development > Subject: Re: crush devices class types > On Wed, 8 Mar 2017, Dan van der Ster wrote: > > On Wed, Mar 8, 2017 at 3:39 PM, Sage Weil wrote: > > > On Wed, 8 Mar 2017, Dan van der Ster wrote: > > >> Hi Loic, > > >> > > >> Did you already have a plan for how an operator would declare the > > >> device class of each OSD? > > >> Would this be a new --device-class option to ceph-disk prepare, > which > > >> would perhaps create a device-class file in the root of the OSD's > xfs > > >> dir? > > >> Then osd crush create-or-move in ceph-osd-prestart.sh would be a > > >> combination of ceph.conf's "crush location" and this per-OSD > file. > > > > > > Hmm we haven't talked about this part yet.  I see a few options... > > > > > > 1) explicit ceph-disk argument, recorded as a file in osd_data > > > > > > 2) osd can autodetect this based on the 'rotational' flag in > sysfs.  The > > > trick here, I think, is to come up with suitable defaults.  We > might have > > > NVMe, SATA/SAS SSDs, HDD, or a combination (with journal and data > (and > > > db) spread across multiple types).   Perhaps those could break > down into > > > classes like > > > > > >         hdd > > >         ssd > > >         nvme > > >         hdd+ssd-journal > > >         hdd+nvme-jouranl > > >         hdd+ssd-db+nvme-jouranl > > > > > > which is probably sufficient for most users.  And if the admin > likes they > > > can override. > > > > > > - Then the osd adjusts device-class on startup, just like it does > with the > > > crush map position.  (Note that this will have no real effect > until the > > > CRUSH rule(s) are changed to use device class.) > > > > > > - We'll need an 'osd crush set-device-class ' > command. > > > The only danger I see here is that if you set it to something > other than > > > what the OSD autodetects above, it'll get clobbered on the next > OSD > > > restart.  Maybe the autodetection *only* sets the device class if > it isn't > > > already set? > > > > This is the same issue we have with crush locations, hence the osd > > crush update on start option, right? > > > > > > > > - We need to adjust the crush rule commands to allow a device > class. > > > Currently we have > > > > > > osd crush rule create-erasure      create crush rule > for erasure > > >  {}                              coded pool created with > ( > > >                                           default default) > > > osd crush rule create-simple       create crush rule > to start from > > >  {firstn|indep}             , replicate across > buckets of > > >                                           type , using a > choose mode of > > >                                           (default > firstn; indep > > >                                           best for erasure pools) > > > > > > ...so we could add another optional arg at the end for the device > class. > > > > > > > How far along in the implementation are you? Still time for > discussing > > the basic idea? > > > > I wonder if you all had thought about using device classes like we > use > > buckets (i.e. to choose across device types)? Suppose I have two > > brands of ssds: I want to define two classes ssd-a and ssd-b. And I > > want to replicate across these classes (and across, say, hosts as > > well). I think I'd need a choose step to choose 2 from classtype ssd > > (out of ssd-a, ssd-b, etc...), and then chooseleaf across hosts. > > IOW, device classes could be an orthogonal, but similarly flexible, > > structure to crush buckets: device classes would have a hierarchy. > > > > So we could still have: > > > > device 0 osd.0 class ssd-a > > device 1 osd.1 class ssd-b > > device 2 osd.2 class hdd-c > > device 3 osd.3 class hdd-d > > > > but then we define the class-types and their hierarchy like we > already > > do for osds. Shown in a "class tree" we could have, for example: > > > > TYPE               NAME > > root                  default > >     classtype    hdd > >         class        hdd-c > >         class        hdd-d > >     classtype    ssd > >         class        ssd-a > >         class        ssd-b > > > > Sorry to bring this up late in the thread. >   > John mentioned something similar in a related thread several weeks > back.  > This would be a pretty cool capability.  It's quite a bit harder to > realize, though. >   > First, you need to ensure that you have a broad enough mix to device > classes to make this an enforceable constraint.  Like if you're doing > 3x > replication, that means at least 3 brands/models of SSDs.  And, like > the > normal hierarchy, you need to ensure that there are sufficient numbers > of > each to actually place the data in a way that satisfies the > constraint. >   > Mainly, though, it requires a big change to the crush mapping > algorithm > itself.  (A nice property of the current device classes is that crush > on > theclient doesn't need to change--this will work fine with any legacy > client.)  Here, though, we'd need to do the crush rules in 2 > dimentions.  Something like first choosing the device types for the > replicas, and then using a separate tree for each device, while also > recognizing the equivalence of other nodes in the hiearachy (racks, > hosts, > etc.) to enforce the usual placement constraints. >   > Anyway, it would be much more involved.  I think the main thing to do > now > is try to ensure we don't make our lives harder later if we go down > that > path.  My guess is we'd want to adopt some naming mechanism for > classes > that is friendly to class hierarchy like you have above (e.g. hdd/a, > hdd/b), but otherwise the "each device has a class" property we're > adding > now wouldn't really change.  The new bit would be how the rule is > defined, > but since larger changes would be needed there I don't think the small > tweak we've just made would be an issue...? >   > BTW, the initial CRUSH device class support just merged.  Next up are > the > various mon commands and osd hooks to make it easy to use... >   > sage >   >   >   > > > > Cheers, Dan > > > > > > > sage > > > > > > > > > > > > > > > > > >> > > >> Cheers, Dan > > >> > > >> > > >> > > >> On Wed, Feb 15, 2017 at 1:14 PM, Loic Dachary > wrote: > > >> > Hi John, > > >> > > > >> > Thanks for the discussion :-) I'll start implementing the > proposal as described originally. > > >> > > > >> > Cheers > > >> > > > >> > On 02/15/2017 12:57 PM, John Spray wrote: > > >> >> On Fri, Feb 3, 2017 at 1:21 PM, Loic Dachary > wrote: > > >> >>> > > >> >>> > > >> >>> On 02/03/2017 01:46 PM, John Spray wrote: > > >> >>>> On Fri, Feb 3, 2017 at 12:22 PM, Loic Dachary > wrote: > > >> >>>>> Hi, > > >> >>>>> > > >> >>>>> Reading Wido & John comments I thought of something, not > sure if that's a good idea or not. Here it is anyway ;-) > > >> >>>>> > > >> >>>>> The device class problem we're trying to solve is one > instance of a more general need to produce crush tables that implement > a given use case. The SSD / HDD use case is so frequent that it would > make sense to modify the crush format for this. But maybe we could > instead implement that to be a crush table generator. > > >> >>>>> > > >> >>>>> Let say you want help to create the hierarchies to > implement the ssd/hdd separation, you write your crushmap using the > proposed syntax. But instead of feeding it directly to crushtool -c, > you would do something like: > > >> >>>>> > > >> >>>>>    crushtool --plugin 'device-class' --transform < > mycrushmap.txt | crushtool -c - -o mycrushmap > > >> >>>>> > > >> >>>>> The 'device-class' transformation documents the naming > conventions so the user knows root will generate root_ssd and > root_hdd. And the users can also check by themselves the generated > crushmap. > > >> >>>>> > > >> >>>>> Cons: > > >> >>>>> > > >> >>>>> * the users need to be aware of the transformation step and > be able to read and understand the generated result. > > >> >>>>> * it could look like it's not part of the standard way of > doing things, that it's a hack. > > >> >>>>> > > >> >>>>> Pros: > > >> >>>>> > > >> >>>>> * it can inspire people to implement other crushmap > transformation / generators (an alternative, simpler, syntax comes to > mind ;-) > > >> >>>>> * it can be implemented using python to lower the barrier > of entry > > >> >>>>> > > >> >>>>> I don't think it makes the implementation of the current > proposal any simpler or more complex. Worst case scenario nobody write > any plugin but that does not make this one plugin less useful. > > >> >>>> > > >> >>>> I think this is basically the alternative approach that Sam > was > > >> >>>> suggesting during CDM: the idea of layering a new (perhaps > very > > >> >>>> similar) syntax on top of the existing one, instead of > extending the > > >> >>>> existing one directly. > > >> >>> > > >> >>> Ha nice, not such a stupid idea then :-) I'll try to defend > it a little more below then. Please bear in mind that I'm not sure > this is the way to go even though I'm writing as if I am. > > >> >>> > > >> >>>> The main argument against doing that was the complexity, not > just of > > >> >>>> implementation but for users, who would now potentially have > two > > >> >>>> separate sets of commands, one operating on the "high level" > map > > >> >>>> (which would have a "myhost" object in it), and one > operating on the > > >> >>>> native crush map (which would only have myhost~ssd, > myhost~hdd > > >> >>>> entries, and would have no concept that a thing called > myhost > > >> >>>> existed). > > >> >>> > > >> >>> As a user I'm not sure what is more complicated / confusing. > If I'm an experienced Ceph user I'll think of this new syntax as a > generator because I already know how crush works. I'll welcome the > help and be relieved that I don't have to manually do that anymore. > But having that as a native syntax may be a little unconfortable for > me because I will want to verify the new syntax matches what I expect, > which comes naturally if the transformation step is separate. I may > even tweak it a little with an intermediate script to match one thing > or two. If I'm a new Ceph user this is one more concept I need to > learn: the device class. And to understand what it means, the > documentation will have to explain that it creates an independant > crush hierarchy for each device class, with weights that only take > into account the devices of that given class. I will not be exonerated > from understanding the transformation step and the syntactic sugar may > even make that more complicated to ge > t. > > >> >>> > > >> >>> If I understand correctly, the three would co-exist: host, > host~ssd, host~hdd so that you can write a rule that takes from all > devices. > > >> >> > > >> >> (Sorry this response is so late) > > >> >> > > >> >> I think the extra work is not so much in the formats, as it is > > >> >> exposing that syntax via all the commands that we have, and/or > new > > >> >> commands.  We would either need two lots of commands, or we > would need > > >> >> to pick one layer (the 'generator' or the native one) for the > > >> >> commands, and treat the other layer as a hidden thing. > > >> >> > > >> >> It's also not just the extra work of implementing the > commands/syntax, > > >> >> it's the extra complexity that ends up being exposed to users. > > >> >> > > >> >>> > > >> >>>> As for implemetning other generators, the trouble with that > is that > > >> >>>> the resulting conventions would be unknown to other tools, > and to any > > >> >>>> commands built in to Ceph. > > >> >>> > > >> >>> Yes. But do we really want to insert the concept of "device > class" in Ceph ? There are recurring complaints about manually > creating the crushmap required to separate ssd from hdd. But is it > inconvenient in any way that Ceph is otherwise unaware of this > distinction ? > > >> >> > > >> >> Currently, if someone has done the manual stuff to set up > SSD/HDD > > >> >> crush trees, any external tool has no way of knowing that two > hosts > > >> >> (one ssd, one hdd) are actually the same host.  That's the key > thing > > >> >> here for me -- the time saving during setup is a nice side > effect, but > > >> >> the primary value of having a Ceph-defined way to do this is > that > > >> >> every tool building on Ceph can rely on it. > > >> >> > > >> >> > > >> >> > > >> >>>> We *really* need a variant of "set noout" > > >> >>>> that operates on a crush subtree (typically a host), as it's > the sane > > >> >>>> way to get people to temporarily mark some OSDs while they > > >> >>>> reboot/upgrade a host, but to implement that command we have > to have > > >> >>>> an unambiguous way of identifying which buckets in the crush > map > > >> >>>> belong to a host.  Whatever the convention is (myhost~ssd, > myhost_ssd, > > >> >>>> whatever), it needs to be defined and built into Ceph in > order to be > > >> >>>> interoperable. > > >> >>> > > >> >>> That goes back (above) to my understanding of Sage proposal > (which I may have wrong ?) in which the host bucket still exists and > still contains all devices regardless of their class. > > >> >> > > >> >> In Sage's proposal as I understand it, there's an underlying > native > > >> >> crush map that uses today's format (i.e. clients need no > upgrade), > > >> >> which is generated in response to either commands that edit > the map, > > >> >> or the user inputting a modified map in the text format.  That > > >> >> conversion would follow pretty simple rules (assuming a host > 'myhost' > > >> >> with ssd and hdd devices): > > >> >>  * On the way in, bucket 'myhost' generates 'myhost~ssd', > 'myhost~hdd' buckets > > >> >>  * On the way out, buckets 'myhost~ssd', 'myhost~hdd' get > merged into 'myhost' > > >> >>  * When running a CLI command, something targeting 'myhost' > will > > >> >> target both 'myhost~hdd' and 'myhost~ssd' > > >> >> > > >> >> It's that last part that probably isn't captured properly by > something > > >> >> external that does a syntax conversion during import/export. > > >> >> > > >> >> John > > >> >> > > >> >>> Cheers > > >> >>> > > >> >>>> > > >> >>>> John > > >> >>>> > > >> >>>> > > >> >>>> > > >> >>>>> Cheers > > >> >>>>> > > >> >>>>> On 02/02/2017 09:57 PM, Sage Weil wrote: > > >> >>>>>> Hi everyone, > > >> >>>>>> > > >> >>>>>> I made more updates to http://pad.ceph.com/p/crush-types > after the CDM > > >> >>>>>> discussion yesterday: > > >> >>>>>> > > >> >>>>>> - consolidated notes into a single proposal > > >> >>>>>> - use otherwise illegal character (e.g., ~) as separater > for generated > > >> >>>>>> buckets.  This avoids ambiguity with user-defined buckets. > > >> >>>>>> - class-id $class $id properties for each bucket.  This > allows us to > > >> >>>>>> preserve the derivative bucket ids across a > decompile->compile cycle so > > >> >>>>>> that data does not move (the bucket id is one of many > inputs into crush's > > >> >>>>>> hash during placement). > > >> >>>>>> - simpler rule syntax: > > >> >>>>>> > > >> >>>>>>     rule ssd { > > >> >>>>>>             ruleset 1 > > >> >>>>>>             step take default class ssd > > >> >>>>>>             step chooseleaf firstn 0 type host > > >> >>>>>>             step emit > > >> >>>>>>     } > > >> >>>>>> > > >> >>>>>> My rationale here is that we don't want to make this a > separate 'step' > > >> >>>>>> call since steps map to underlying crush rule step ops, > and this is a > > >> >>>>>> directive only to the compiler.  Making it an optional > step argument seems > > >> >>>>>> like the cleanest way to do that. > > >> >>>>>> > > >> >>>>>> Any other comments before we kick this off? > > >> >>>>>> > > >> >>>>>> Thanks! > > >> >>>>>> sage > > >> >>>>>> > > >> >>>>>> > > >> >>>>>> On Mon, 23 Jan 2017, Loic Dachary wrote: > > >> >>>>>> > > >> >>>>>>> Hi Wido, > > >> >>>>>>> > > >> >>>>>>> Updated http://pad.ceph.com/p/crush-types with your > proposal for the rule syntax > > >> >>>>>>> > > >> >>>>>>> Cheers > > >> >>>>>>> > > >> >>>>>>> On 01/23/2017 03:29 PM, Sage Weil wrote: > > >> >>>>>>>> On Mon, 23 Jan 2017, Wido den Hollander wrote: > > >> >>>>>>>>>> Op 22 januari 2017 om 17:44 schreef Loic Dachary > : > > >> >>>>>>>>>> > > >> >>>>>>>>>> > > >> >>>>>>>>>> Hi Sage, > > >> >>>>>>>>>> > > >> >>>>>>>>>> You proposed an improvement to the crush map to > address different device types (SSD, HDD, etc.)[1]. When learning how > to create a crush map, I was indeed confused by the tricks required to > create SSD only pools. After years of practice it feels more natural > :-) > > >> >>>>>>>>>> > > >> >>>>>>>>>> The source of my confusion was mostly because I had to > use a hierarchical description to describe something that is not > organized hierarchically. "The rack contains hosts that contain > devices" is intuitive. "The rack contains hosts that contain ssd that > contain devices" is counter intuitive. Changing: > > >> >>>>>>>>>> > > >> >>>>>>>>>>     # devices > > >> >>>>>>>>>>     device 0 osd.0 > > >> >>>>>>>>>>     device 1 osd.1 > > >> >>>>>>>>>>     device 2 osd.2 > > >> >>>>>>>>>>     device 3 osd.3 > > >> >>>>>>>>>> > > >> >>>>>>>>>> into: > > >> >>>>>>>>>> > > >> >>>>>>>>>>     # devices > > >> >>>>>>>>>>     device 0 osd.0 ssd > > >> >>>>>>>>>>     device 1 osd.1 ssd > > >> >>>>>>>>>>     device 2 osd.2 hdd > > >> >>>>>>>>>>     device 3 osd.3 hdd > > >> >>>>>>>>>> > > >> >>>>>>>>>> where ssd/hdd is the device class would be much > better. However, using the device class like so: > > >> >>>>>>>>>> > > >> >>>>>>>>>>     rule ssd { > > >> >>>>>>>>>>             ruleset 1 > > >> >>>>>>>>>>             type replicated > > >> >>>>>>>>>>             min_size 1 > > >> >>>>>>>>>>             max_size 10 > > >> >>>>>>>>>>             step take default:ssd > > >> >>>>>>>>>>             step chooseleaf firstn 0 type host > > >> >>>>>>>>>>             step emit > > >> >>>>>>>>>>     } > > >> >>>>>>>>>> > > >> >>>>>>>>>> looks arcane. Since the goal is to simplify the > description for the first time user, maybe we could have something > like: > > >> >>>>>>>>>> > > >> >>>>>>>>>>     rule ssd { > > >> >>>>>>>>>>             ruleset 1 > > >> >>>>>>>>>>             type replicated > > >> >>>>>>>>>>             min_size 1 > > >> >>>>>>>>>>             max_size 10 > > >> >>>>>>>>>>             device class = ssd > > >> >>>>>>>>> > > >> >>>>>>>>> Would that be sane? > > >> >>>>>>>>> > > >> >>>>>>>>> Why not: > > >> >>>>>>>>> > > >> >>>>>>>>> step set-class ssd > > >> >>>>>>>>> step take default > > >> >>>>>>>>> step chooseleaf firstn 0 type host > > >> >>>>>>>>> step emit > > >> >>>>>>>>> > > >> >>>>>>>>> Since it's a 'step' you take, am I right? > > >> >>>>>>>> > > >> >>>>>>>> Good idea... a step is a cleaner way to extend the > syntax! > > >> >>>>>>>> > > >> >>>>>>>> sage > > >> >>>>>>>> -- > > >> >>>>>>>> To unsubscribe from this list: send the line > "unsubscribe ceph-devel" in > > >> >>>>>>>> the body of a message to majordomo@vger.kernel.org > > >> >>>>>>>> More majordomo info at  > http://vger.kernel.org/majordomo-info.html > > >> >>>>>>>> > > >> >>>>>>> > > >> >>>>>>> -- > > >> >>>>>>> Loïc Dachary, Artisan Logiciel Libre > > >> >>>>>>> -- > > >> >>>>>>> To unsubscribe from this list: send the line "unsubscribe > ceph-devel" in > > >> >>>>>>> the body of a message to majordomo@vger.kernel.org > > >> >>>>>>> More majordomo info at  > http://vger.kernel.org/majordomo-info.html > > >> >>>>>>> > > >> >>>>> > > >> >>>>> -- > > >> >>>>> Loïc Dachary, Artisan Logiciel Libre > > >> >>>>> -- > > >> >>>>> To unsubscribe from this list: send the line "unsubscribe > ceph-devel" in > > >> >>>>> the body of a message to majordomo@vger.kernel.org > > >> >>>>> More majordomo info at  > http://vger.kernel.org/majordomo-info.html > > >> >>>> > > >> >>> > > >> >>> -- > > >> >>> Loïc Dachary, Artisan Logiciel Libre > > >> >> > > >> > > > >> > -- > > >> > Loïc Dachary, Artisan Logiciel Libre > > >> > -- > > >> > To unsubscribe from this list: send the line "unsubscribe > ceph-devel" in > > >> > the body of a message to majordomo@vger.kernel.org > > >> > More majordomo info at  > http://vger.kernel.org/majordomo-info.html > > >> -- > > >> To unsubscribe from this list: send the line "unsubscribe > ceph-devel" in > > >> the body of a message to majordomo@vger.kernel.org > > >> More majordomo info at  > http://vger.kernel.org/majordomo-info.html > > >> > > >> > > -- > > To unsubscribe from this list: send the line "unsubscribe > ceph-devel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at  http://vger.kernel.org/majordomo-info.html > > > > > > > --8323329-884031322-1498623983=:3424--