All of lore.kernel.org
 help / color / mirror / Atom feed
* Add dmclock QoS client calls to librados -- request for comments
@ 2017-12-18 19:04 J. Eric Ivancich
  2017-12-19 16:13 ` Sage Weil
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: J. Eric Ivancich @ 2017-12-18 19:04 UTC (permalink / raw)
  To: Ceph Development

We are asking the Ceph community to provide their thoughts on this
draft proposal for expanding the librados API with calls that would
allow clients to specify QoS (quality of service) parameters for
their operations.

We have an on-going effort to provide Ceph users with more options to
manage QoS. With the release of Luminous we introduced access to a
prototype of the mclock QoS algorithm for queuing operations by class
of operation and either differentiating clients or treating them as a
unit. Although not yet integrated, the library we're using supports
dmClock, a distributed version of mClock. Both are documented in
_mClock: Handling Throughput Variability for Hypervisor IO Scheduling_
by Gulati, Merchant, and Varman 2010.

In order to offer greater flexibility, we'd like to move forward with
providing clients with the ability to use different QoS parameters. We
are keeping our options open w.r.t. the ultimate set of algorithm(s)
we'll use. The mClock/dmClock algorithm allows a "client", which we
can interpret broadly, to set a minimum ops/sec (reservation) and a
maximum ops/sec (limit). Furthermore a "client" can also define a
weight (a.k.a.  priority), which is a scalar value to determine
relative weighting.

We think reservation, limit, and weight are sufficiently generic that
we'd be able to use or adapt them other QoS algorithms we may try or
use in the future.

[To give you a sense of how broadly we can interpret "client", we
currently have code that interprets classes of operations (e.g.,
background replication or background snap-trimming) as a client.]

== Units ==

One key difference we're considering, however, is changing the unit
that reservations and limits are expressed in from ops/sec to
something more appropriate for Ceph. Operations have payloads of
different sizes and will therefore take different amounts of time, and
that should be factored in. We might refer to this as the "cost" of
the operation. And the cost is not linear with the size of the
payload. For example, a write of 4 MB might only take 20 times as long
as a write of 4 KB even though the sizes differ by a factor of
1000. Using cost would allow us to, for example, achieve a fairer
prioritization of a client doing many small writes against a client
that's doing a few larger writes.

One proposed formula to translate one op into cost would be something
along the lines of:

    cost_units = a + b * log(payload_size)

where a and b would have to be chosen or tuned based on the storage
back-end.

And that gets us to the units for defining reservation and limit --
cost_units per unit of time. Typically these are floating point
values, however we do not use floating point types in librados calls
because qemu, when calling into librbd, does not save and restore the
cpu's floating point mode.

There are two ways of getting appropriate ranges of values given that
we need to use integral types for cost_units per unit of time. One is
a large time unit in the denominator, such as minutes or even
hours. That would leave us with cost_units per minute. We are unsure
that the strange unit is the best approach and your feedback would be
appreciated.

A standard alternative would be to use a standard time unit, such as
seconds, but integers as fixed-point values. So a floating-point value
in cost_units per second would be multiplied by, say, 1000 and rounded
to get the corresponding integer value.

== librados Additions ==

The basic idea is that one would be able to create (and destroy) qos
profiles and then associate a profile with an ioctx. Ops on the ioctx
would use the qos profile associated with it.

typedef void* rados_qos_profile_t; // opaque

// parameters uint64_t in cost_units per time unit as discussed above
profile1 = rados_qos_profile_create(reservation, weight, limit);

rados_ioctx_set_qos_profile(ioctx3, profile1);

...
// ops to ioctx3 would now use the specified profile
...

// use the profile just for a particular operation
rados_write_op_set_qos_prefile(op1, profile1);

rados_ioctx_set_qos_profile(ioctx3, NULL); // set to default profile

rados_qos_profile_destroy(profile1);

== MOSDOp and MOSDOpReply Changes ==

Because the qos_profile would be managed by the ioctx, MOSDOps sent
via that ioctx would include the reservation, weight, and limit. At
this point we think this would be better than keeping the profiles on
the back-end, although it increases the MOSDOp data structure by about
128 bits.

The MOSDOp type already contains dmclock's delta and rho parameters
and MOSDOpReply already contains the dmclock phase indicator due to
prior work. Given that we're moving towards using cost_unit per
time_unit rather than ops per sec, perhaps we should also include the
calculated cost in the MOSDOpReply.

== Conclusion ==

So that's what we're thinking about and your own thoughts and feedback
would be appreciated. Thanks!

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Add dmclock QoS client calls to librados -- request for comments
  2017-12-18 19:04 Add dmclock QoS client calls to librados -- request for comments J. Eric Ivancich
@ 2017-12-19 16:13 ` Sage Weil
  2018-01-02 15:26   ` J. Eric Ivancich
  2017-12-19 17:45 ` Mark Nelson
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2017-12-19 16:13 UTC (permalink / raw)
  To: J. Eric Ivancich; +Cc: Ceph Development

Hi Eric,

On Mon, 18 Dec 2017, J. Eric Ivancich wrote:
> We are asking the Ceph community to provide their thoughts on this
> draft proposal for expanding the librados API with calls that would
> allow clients to specify QoS (quality of service) parameters for
> their operations.
> 
> We have an on-going effort to provide Ceph users with more options to
> manage QoS. With the release of Luminous we introduced access to a
> prototype of the mclock QoS algorithm for queuing operations by class
> of operation and either differentiating clients or treating them as a
> unit. Although not yet integrated, the library we're using supports
> dmClock, a distributed version of mClock. Both are documented in
> _mClock: Handling Throughput Variability for Hypervisor IO Scheduling_
> by Gulati, Merchant, and Varman 2010.
> 
> In order to offer greater flexibility, we'd like to move forward with
> providing clients with the ability to use different QoS parameters. We
> are keeping our options open w.r.t. the ultimate set of algorithm(s)
> we'll use. The mClock/dmClock algorithm allows a "client", which we
> can interpret broadly, to set a minimum ops/sec (reservation) and a
> maximum ops/sec (limit). Furthermore a "client" can also define a
> weight (a.k.a.  priority), which is a scalar value to determine
> relative weighting.
> 
> We think reservation, limit, and weight are sufficiently generic that
> we'd be able to use or adapt them other QoS algorithms we may try or
> use in the future.
> 
> [To give you a sense of how broadly we can interpret "client", we
> currently have code that interprets classes of operations (e.g.,
> background replication or background snap-trimming) as a client.]
> 
> == Units ==
> 
> One key difference we're considering, however, is changing the unit
> that reservations and limits are expressed in from ops/sec to
> something more appropriate for Ceph. Operations have payloads of
> different sizes and will therefore take different amounts of time, and
> that should be factored in. We might refer to this as the "cost" of
> the operation. And the cost is not linear with the size of the
> payload. For example, a write of 4 MB might only take 20 times as long
> as a write of 4 KB even though the sizes differ by a factor of
> 1000. Using cost would allow us to, for example, achieve a fairer
> prioritization of a client doing many small writes against a client
> that's doing a few larger writes.
> 
> One proposed formula to translate one op into cost would be something
> along the lines of:
> 
>     cost_units = a + b * log(payload_size)
> 
> where a and b would have to be chosen or tuned based on the storage
> back-end.
> 
> And that gets us to the units for defining reservation and limit --
> cost_units per unit of time. Typically these are floating point
> values, however we do not use floating point types in librados calls
> because qemu, when calling into librbd, does not save and restore the
> cpu's floating point mode.
> 
> There are two ways of getting appropriate ranges of values given that
> we need to use integral types for cost_units per unit of time. One is
> a large time unit in the denominator, such as minutes or even
> hours. That would leave us with cost_units per minute. We are unsure
> that the strange unit is the best approach and your feedback would be
> appreciated.
> 
> A standard alternative would be to use a standard time unit, such as
> seconds, but integers as fixed-point values. So a floating-point value
> in cost_units per second would be multiplied by, say, 1000 and rounded
> to get the corresponding integer value.

I think if payload_size above is bytes, then any reasonable value for 
cost_units will be a non-tiny integer, and we won't need floating point, 
right?  E.g., a 4KB write would be (at a minimum) 10, but probably larger 
if a and b >= 1.  That would let us keep seconds as the time base?

> == librados Additions ==
> 
> The basic idea is that one would be able to create (and destroy) qos
> profiles and then associate a profile with an ioctx. Ops on the ioctx
> would use the qos profile associated with it.
> 
> typedef void* rados_qos_profile_t; // opaque
> 
> // parameters uint64_t in cost_units per time unit as discussed above
> profile1 = rados_qos_profile_create(reservation, weight, limit);
> 
> rados_ioctx_set_qos_profile(ioctx3, profile1);
> 
> ...
> // ops to ioctx3 would now use the specified profile
> ...
> 
> // use the profile just for a particular operation
> rados_write_op_set_qos_prefile(op1, profile1);
> 
> rados_ioctx_set_qos_profile(ioctx3, NULL); // set to default profile
> 
> rados_qos_profile_destroy(profile1);

I would s/destroy/release/, as the profile will be implicitly 
reference counted (with a ref consumed by the ioctx that is pointing to 
it).

It might be useful to add a rados_qos_profile_get_id(handle) that returns 
the client-local integer id that we're using to identify the profile.  
This isn't really useful for the application per se, but it will be 
helpful for debugging purposes perhaps?

> == MOSDOp and MOSDOpReply Changes ==
> 
> Because the qos_profile would be managed by the ioctx, MOSDOps sent
> via that ioctx would include the reservation, weight, and limit. At
> this point we think this would be better than keeping the profiles on
> the back-end, although it increases the MOSDOp data structure by about
> 128 bits.
> 
> The MOSDOp type already contains dmclock's delta and rho parameters
> and MOSDOpReply already contains the dmclock phase indicator due to
> prior work. Given that we're moving towards using cost_unit per
> time_unit rather than ops per sec, perhaps we should also include the
> calculated cost in the MOSDOpReply.

Good idea!

sage


> 
> == Conclusion ==
> 
> So that's what we're thinking about and your own thoughts and feedback
> would be appreciated. Thanks!
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Add dmclock QoS client calls to librados -- request for comments
  2017-12-18 19:04 Add dmclock QoS client calls to librados -- request for comments J. Eric Ivancich
  2017-12-19 16:13 ` Sage Weil
@ 2017-12-19 17:45 ` Mark Nelson
  2018-01-02 15:11   ` J. Eric Ivancich
  2018-01-02 17:58 ` Gregory Farnum
  2018-01-03 19:26 ` Gregory Farnum
  3 siblings, 1 reply; 11+ messages in thread
From: Mark Nelson @ 2017-12-19 17:45 UTC (permalink / raw)
  To: J. Eric Ivancich, Ceph Development

Hi Eric,

This is pretty dense! :) (I have the same problem with emails 
sometimes).  responses inline.

On 12/18/2017 01:04 PM, J. Eric Ivancich wrote:
> We are asking the Ceph community to provide their thoughts on this
> draft proposal for expanding the librados API with calls that would
> allow clients to specify QoS (quality of service) parameters for
> their operations.
> 
> We have an on-going effort to provide Ceph users with more options to
> manage QoS. With the release of Luminous we introduced access to a
> prototype of the mclock QoS algorithm for queuing operations by class
> of operation and either differentiating clients or treating them as a
> unit. Although not yet integrated, the library we're using supports
> dmClock, a distributed version of mClock. Both are documented in
> _mClock: Handling Throughput Variability for Hypervisor IO Scheduling_
> by Gulati, Merchant, and Varman 2010.
> 
> In order to offer greater flexibility, we'd like to move forward with
> providing clients with the ability to use different QoS parameters. We
> are keeping our options open w.r.t. the ultimate set of algorithm(s)
> we'll use. The mClock/dmClock algorithm allows a "client", which we
> can interpret broadly, to set a minimum ops/sec (reservation) and a
> maximum ops/sec (limit). Furthermore a "client" can also define a
> weight (a.k.a.  priority), which is a scalar value to determine
> relative weighting.
> 
> We think reservation, limit, and weight are sufficiently generic that
> we'd be able to use or adapt them other QoS algorithms we may try or
> use in the future.
> 
> [To give you a sense of how broadly we can interpret "client", we
> currently have code that interprets classes of operations (e.g.,
> background replication or background snap-trimming) as a client.]
> 
> == Units ==
> 
> One key difference we're considering, however, is changing the unit
> that reservations and limits are expressed in from ops/sec to
> something more appropriate for Ceph. Operations have payloads of
> different sizes and will therefore take different amounts of time, and
> that should be factored in. We might refer to this as the "cost" of
> the operation. And the cost is not linear with the size of the
> payload. For example, a write of 4 MB might only take 20 times as long
> as a write of 4 KB even though the sizes differ by a factor of
> 1000. Using cost would allow us to, for example, achieve a fairer
> prioritization of a client doing many small writes against a client
> that's doing a few larger writes.

Getting away from ops/s is a good idea imho, and I generally agree here.

> 
> One proposed formula to translate one op into cost would be something
> along the lines of:
> 
>      cost_units = a + b * log(payload_size)
> 
> where a and b would have to be chosen or tuned based on the storage
> back-end.

I guess the idea is that we can generally approximate the curve of both 
HDDs and solid state storage with this formula by tweaking a and b? 
I've got a couple of concerns:

1) I don't think most users are going to get a and b right.  If anything 
I suspect we'll end up with a couple of competing values for HDD and 
SSDs that people will just copy/paste from each other or the mailing 
list.  I'd much rather that we had hdd/ssd defaults like we do for other 
options in ceph that get us in the right ballparks and get set 
automatically based on the disk type.

2) log() is kind of expensive.  It's not *that* bad, but it's enough 
that for small NVMe read ops we could start to see it show up in profiles.

http://nicolas.limare.net/pro/notes/2014/12/16_math_speed/

I suspect it might be a good idea to pre-compute the cost_units for the 
first 64k (or whatever) payload_sizes, especially if that value is 
64bit.  It would take minimal memory and I could see it becoming more 
important as flash becomes more common (especially on ARM and similar CPUs).

3) If there were an easy way to express it, it might be nice to just 
give advanced users the option to write their own function here as an 
override vs the defaults. ie (not real numbers):

notreal_qos_cost_unit_algorithm = ""
notreal_qos_cost_unit_algorithm_ssd = "1024 + 0.25*log(payload_size)"
notreal_qos_cost_unit_algorithm_hdd = "64 + 32*log(payload_size)"

I think this is cleaner than needing to specify hdd_a, hdd_b, ssd_a, 
ssd_b on nodes with mixed HDD/flash OSDs.

> 
> And that gets us to the units for defining reservation and limit --
> cost_units per unit of time. Typically these are floating point
> values, however we do not use floating point types in librados calls
> because qemu, when calling into librbd, does not save and restore the
> cpu's floating point mode.
> 
> There are two ways of getting appropriate ranges of values given that
> we need to use integral types for cost_units per unit of time. One is
> a large time unit in the denominator, such as minutes or even
> hours. That would leave us with cost_units per minute. We are unsure
> that the strange unit is the best approach and your feedback would be
> appreciated.
> 
> A standard alternative would be to use a standard time unit, such as
> seconds, but integers as fixed-point values. So a floating-point value
> in cost_units per second would be multiplied by, say, 1000 and rounded
> to get the corresponding integer value.

In the 2nd scenario it's just a question of how we handle it internally 
right?

> 
> == librados Additions ==
> 
> The basic idea is that one would be able to create (and destroy) qos
> profiles and then associate a profile with an ioctx. Ops on the ioctx
> would use the qos profile associated with it.
> 
> typedef void* rados_qos_profile_t; // opaque
> 
> // parameters uint64_t in cost_units per time unit as discussed above
> profile1 = rados_qos_profile_create(reservation, weight, limit);
> 
> rados_ioctx_set_qos_profile(ioctx3, profile1);
> 
> ...
> // ops to ioctx3 would now use the specified profile
> ...
> 
> // use the profile just for a particular operation
> rados_write_op_set_qos_prefile(op1, profile1);
> 
> rados_ioctx_set_qos_profile(ioctx3, NULL); // set to default profile
> 
> rados_qos_profile_destroy(profile1);
> 
> == MOSDOp and MOSDOpReply Changes ==
> 
> Because the qos_profile would be managed by the ioctx, MOSDOps sent
> via that ioctx would include the reservation, weight, and limit. At
> this point we think this would be better than keeping the profiles on
> the back-end, although it increases the MOSDOp data structure by about
> 128 bits.
> 
> The MOSDOp type already contains dmclock's delta and rho parameters
> and MOSDOpReply already contains the dmclock phase indicator due to
> prior work. Given that we're moving towards using cost_unit per
> time_unit rather than ops per sec, perhaps we should also include the
> calculated cost in the MOSDOpReply.

Does it change things at all if we have fast per-calculated values of 
cost_unit available for a given payload size?

> 
> == Conclusion ==
> 
> So that's what we're thinking about and your own thoughts and feedback
> would be appreciated. Thanks!
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Add dmclock QoS client calls to librados -- request for comments
  2017-12-19 17:45 ` Mark Nelson
@ 2018-01-02 15:11   ` J. Eric Ivancich
  2018-01-03 13:43     ` 김태웅
  0 siblings, 1 reply; 11+ messages in thread
From: J. Eric Ivancich @ 2018-01-02 15:11 UTC (permalink / raw)
  To: Mark Nelson, Ceph Development

Thanks, Mark, for those thoughts.

> On Dec 19, 2017, at 12:45 PM, Mark Nelson <mark.a.nelson@gmail.com> wrote:
> 
> On 12/18/2017 01:04 PM, J. Eric Ivancich wrote:
>> == Units ==
>> One key difference we're considering, however, is changing the unit
>> that reservations and limits are expressed in from ops/sec to
>> something more appropriate for Ceph. Operations have payloads of
>> different sizes and will therefore take different amounts of time, and
>> that should be factored in. We might refer to this as the "cost" of
>> the operation. And the cost is not linear with the size of the
>> payload. For example, a write of 4 MB might only take 20 times as long
>> as a write of 4 KB even though the sizes differ by a factor of
>> 1000. Using cost would allow us to, for example, achieve a fairer
>> prioritization of a client doing many small writes against a client
>> that's doing a few larger writes.
> 
> Getting away from ops/s is a good idea imho, and I generally agree here.

Cool!

>> One proposed formula to translate one op into cost would be something
>> along the lines of:
>>     cost_units = a + b * log(payload_size)
>> where a and b would have to be chosen or tuned based on the storage
>> back-end.
> 
> I guess the idea is that we can generally approximate the curve of both HDDs and solid state storage with this formula by tweaking a and b? I've got a couple of concerns:

That’s correct.

> 1) I don't think most users are going to get a and b right.  If anything I suspect we'll end up with a couple of competing values for HDD and SSDs that people will just copy/paste from each other or the mailing list.  I'd much rather that we had hdd/ssd defaults like we do for other options in ceph that get us in the right ballparks and get set automatically based on the disk type.

I agree; best to have sensible defaults.

> 2) log() is kind of expensive.  It's not *that* bad, but it's enough that for small NVMe read ops we could start to see it show up in profiles.
> 
> http://nicolas.limare.net/pro/notes/2014/12/16_math_speed/
> 
> I suspect it might be a good idea to pre-compute the cost_units for the first 64k (or whatever) payload_sizes, especially if that value is 64bit.  It would take minimal memory and I could see it becoming more important as flash becomes more common (especially on ARM and similar CPUs).

I agree. Rounding to the nearest kb and then doing a table look-up is likely the right way to go and then only doing the full calculation when necessary. We could even consider pre-computing the values for powers-of-2-kb (e.g., 1k, 2k, 4k, 8k, 16k, …. 128k, 256k, …) and rounding each payload to the next highest, assuming it’s not problematic to treat, say, 20k, 25k, and 30k as having the same cost (i.e., same as 32k). Or use a combination of the two approaches — linear table for smaller payloads and exponential table for the larger payloads.

> 3) If there were an easy way to express it, it might be nice to just give advanced users the option to write their own function here as an override vs the defaults. ie (not real numbers):
> 
> notreal_qos_cost_unit_algorithm = ""
> notreal_qos_cost_unit_algorithm_ssd = "1024 + 0.25*log(payload_size)"
> notreal_qos_cost_unit_algorithm_hdd = "64 + 32*log(payload_size)"
> 
> I think this is cleaner than needing to specify hdd_a, hdd_b, ssd_a, ssd_b on nodes with mixed HDD/flash OSDs.

I’m inclined to go with the more simple implementation, at least for the first take, but certainly open to the more general implementation that you suggest.

>> And that gets us to the units for defining reservation and limit --
>> cost_units per unit of time. Typically these are floating point
>> values, however we do not use floating point types in librados calls
>> because qemu, when calling into librbd, does not save and restore the
>> cpu's floating point mode.
>> There are two ways of getting appropriate ranges of values given that
>> we need to use integral types for cost_units per unit of time. One is
>> a large time unit in the denominator, such as minutes or even
>> hours. That would leave us with cost_units per minute. We are unsure
>> that the strange unit is the best approach and your feedback would be
>> appreciated.
>> A standard alternative would be to use a standard time unit, such as
>> seconds, but integers as fixed-point values. So a floating-point value
>> in cost_units per second would be multiplied by, say, 1000 and rounded
>> to get the corresponding integer value.
> 
> In the 2nd scenario it's just a question of how we handle it internally right?

The client calling into librados would have to do the conversion of floating-point into fixed-point. I’ll reply to Sage’s reply to this thread next, but I think he makes a good point that the number of cost units for typical payload sizes will be (much?) larger than 1, so we might be able to use seconds as are time unit *and* avoid fixed-point math. In other words, I’m now thinking that the caller would simply need to round to an integral value *if* they started with a floating point value.

>> == librados Additions ==
>> The basic idea is that one would be able to create (and destroy) qos
>> profiles and then associate a profile with an ioctx. Ops on the ioctx
>> would use the qos profile associated with it.
>> typedef void* rados_qos_profile_t; // opaque
>> // parameters uint64_t in cost_units per time unit as discussed above
>> profile1 = rados_qos_profile_create(reservation, weight, limit);
>> rados_ioctx_set_qos_profile(ioctx3, profile1);
>> ...
>> // ops to ioctx3 would now use the specified profile
>> ...
>> // use the profile just for a particular operation
>> rados_write_op_set_qos_prefile(op1, profile1);
>> rados_ioctx_set_qos_profile(ioctx3, NULL); // set to default profile
>> rados_qos_profile_destroy(profile1);
>> == MOSDOp and MOSDOpReply Changes ==
>> Because the qos_profile would be managed by the ioctx, MOSDOps sent
>> via that ioctx would include the reservation, weight, and limit. At
>> this point we think this would be better than keeping the profiles on
>> the back-end, although it increases the MOSDOp data structure by about
>> 128 bits.
>> The MOSDOp type already contains dmclock's delta and rho parameters
>> and MOSDOpReply already contains the dmclock phase indicator due to
>> prior work. Given that we're moving towards using cost_unit per
>> time_unit rather than ops per sec, perhaps we should also include the
>> calculated cost in the MOSDOpReply.
> 
> Does it change things at all if we have fast per-calculated values of cost_unit available for a given payload size?

No, that wouldn’t change anything. This value will help the new piece in librados that handles dmclock correctly apportion the work done by each server to ensure fairness across servers. When using “ops” the value was 1. With cost units it gets a little more complex. This would all be internal to librados and the client wouldn’t have to deal with this value.

Eric


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Add dmclock QoS client calls to librados -- request for comments
  2017-12-19 16:13 ` Sage Weil
@ 2018-01-02 15:26   ` J. Eric Ivancich
  0 siblings, 0 replies; 11+ messages in thread
From: J. Eric Ivancich @ 2018-01-02 15:26 UTC (permalink / raw)
  To: Sage Weil, Ceph Development

Hi Sage,

> On Dec 19, 2017, at 11:13 AM, Sage Weil <sage@newdream.net> wrote:
> 
> Hi Eric,
> 
> On Mon, 18 Dec 2017, J. Eric Ivancich wrote:
>> == Units ==
>> 
>> One key difference we're considering, however, is changing the unit
>> that reservations and limits are expressed in from ops/sec to
>> something more appropriate for Ceph. Operations have payloads of
>> different sizes and will therefore take different amounts of time, and
>> that should be factored in. We might refer to this as the "cost" of
>> the operation. And the cost is not linear with the size of the
>> payload. For example, a write of 4 MB might only take 20 times as long
>> as a write of 4 KB even though the sizes differ by a factor of
>> 1000. Using cost would allow us to, for example, achieve a fairer
>> prioritization of a client doing many small writes against a client
>> that's doing a few larger writes.
>> 
>> One proposed formula to translate one op into cost would be something
>> along the lines of:
>> 
>>    cost_units = a + b * log(payload_size)
>> 
>> where a and b would have to be chosen or tuned based on the storage
>> back-end.
>> 
>> And that gets us to the units for defining reservation and limit --
>> cost_units per unit of time. Typically these are floating point
>> values, however we do not use floating point types in librados calls
>> because qemu, when calling into librbd, does not save and restore the
>> cpu's floating point mode.
>> 
>> There are two ways of getting appropriate ranges of values given that
>> we need to use integral types for cost_units per unit of time. One is
>> a large time unit in the denominator, such as minutes or even
>> hours. That would leave us with cost_units per minute. We are unsure
>> that the strange unit is the best approach and your feedback would be
>> appreciated.
>> 
>> A standard alternative would be to use a standard time unit, such as
>> seconds, but integers as fixed-point values. So a floating-point value
>> in cost_units per second would be multiplied by, say, 1000 and rounded
>> to get the corresponding integer value.
> 
> I think if payload_size above is bytes, then any reasonable value for 
> cost_units will be a non-tiny integer, and we won't need floating point, 
> right?  E.g., a 4KB write would be (at a minimum) 10, but probably larger 
> if a and b >= 1.  That would let us keep seconds as the time base?

Very good point! As long as the cost is greater than 10 (maybe even much greater than 10) a reservation or limit as low as 1 would be small, and we can avoid both odd time unit denominators and fixed-point and still achieve low settings.

>> == librados Additions ==
>> 
>> The basic idea is that one would be able to create (and destroy) qos
>> profiles and then associate a profile with an ioctx. Ops on the ioctx
>> would use the qos profile associated with it.
>> 
>> typedef void* rados_qos_profile_t; // opaque
>> 
>> // parameters uint64_t in cost_units per time unit as discussed above
>> profile1 = rados_qos_profile_create(reservation, weight, limit);
>> 
>> rados_ioctx_set_qos_profile(ioctx3, profile1);
>> 
>> ...
>> // ops to ioctx3 would now use the specified profile
>> ...
>> 
>> // use the profile just for a particular operation
>> rados_write_op_set_qos_prefile(op1, profile1);
>> 
>> rados_ioctx_set_qos_profile(ioctx3, NULL); // set to default profile
>> 
>> rados_qos_profile_destroy(profile1);
> 
> I would s/destroy/release/, as the profile will be implicitly 
> reference counted (with a ref consumed by the ioctx that is pointing to 
> it).
> 
> It might be useful to add a rados_qos_profile_get_id(handle) that returns 
> the client-local integer id that we're using to identify the profile.  
> This isn't really useful for the application per se, but it will be 
> helpful for debugging purposes perhaps?

Both sound good.

>> == MOSDOp and MOSDOpReply Changes ==
>> 
>> Because the qos_profile would be managed by the ioctx, MOSDOps sent
>> via that ioctx would include the reservation, weight, and limit. At
>> this point we think this would be better than keeping the profiles on
>> the back-end, although it increases the MOSDOp data structure by about
>> 128 bits.
>> 
>> The MOSDOp type already contains dmclock's delta and rho parameters
>> and MOSDOpReply already contains the dmclock phase indicator due to
>> prior work. Given that we're moving towards using cost_unit per
>> time_unit rather than ops per sec, perhaps we should also include the
>> calculated cost in the MOSDOpReply.
> 
> Good idea!

Thank you,

Eric


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Add dmclock QoS client calls to librados -- request for comments
  2017-12-18 19:04 Add dmclock QoS client calls to librados -- request for comments J. Eric Ivancich
  2017-12-19 16:13 ` Sage Weil
  2017-12-19 17:45 ` Mark Nelson
@ 2018-01-02 17:58 ` Gregory Farnum
  2018-01-03 19:26 ` Gregory Farnum
  3 siblings, 0 replies; 11+ messages in thread
From: Gregory Farnum @ 2018-01-02 17:58 UTC (permalink / raw)
  To: J. Eric Ivancich; +Cc: Ceph Development

On Mon, Dec 18, 2017 at 11:04 AM, J. Eric Ivancich <ivancich@redhat.com> wrote:
> We are asking the Ceph community to provide their thoughts on this
> draft proposal for expanding the librados API with calls that would
> allow clients to specify QoS (quality of service) parameters for
> their operations.
>
> We have an on-going effort to provide Ceph users with more options to
> manage QoS. With the release of Luminous we introduced access to a
> prototype of the mclock QoS algorithm for queuing operations by class
> of operation and either differentiating clients or treating them as a
> unit. Although not yet integrated, the library we're using supports
> dmClock, a distributed version of mClock. Both are documented in
> _mClock: Handling Throughput Variability for Hypervisor IO Scheduling_
> by Gulati, Merchant, and Varman 2010.
>
> In order to offer greater flexibility, we'd like to move forward with
> providing clients with the ability to use different QoS parameters. We
> are keeping our options open w.r.t. the ultimate set of algorithm(s)
> we'll use. The mClock/dmClock algorithm allows a "client", which we
> can interpret broadly, to set a minimum ops/sec (reservation) and a
> maximum ops/sec (limit). Furthermore a "client" can also define a
> weight (a.k.a.  priority), which is a scalar value to determine
> relative weighting.
>
> We think reservation, limit, and weight are sufficiently generic that
> we'd be able to use or adapt them other QoS algorithms we may try or
> use in the future.
>
> [To give you a sense of how broadly we can interpret "client", we
> currently have code that interprets classes of operations (e.g.,
> background replication or background snap-trimming) as a client.]
>
> == Units ==
>
> One key difference we're considering, however, is changing the unit
> that reservations and limits are expressed in from ops/sec to
> something more appropriate for Ceph. Operations have payloads of
> different sizes and will therefore take different amounts of time, and
> that should be factored in. We might refer to this as the "cost" of
> the operation. And the cost is not linear with the size of the
> payload. For example, a write of 4 MB might only take 20 times as long
> as a write of 4 KB even though the sizes differ by a factor of
> 1000. Using cost would allow us to, for example, achieve a fairer
> prioritization of a client doing many small writes against a client
> that's doing a few larger writes.

I'm sure we'll need to convert to a different type of rate limiter, so
this is good. I suspect we'll want to experiment with the right ways
to convert costs to a single dimension, though — and it will probably
vary quite a lot depending on the underlying storage medium. I'd be
careful about embedding a cost function into the code and instead make
it a function that can be updated (or even configured by the cluster
admin).

>
> One proposed formula to translate one op into cost would be something
> along the lines of:
>
>     cost_units = a + b * log(payload_size)
>
> where a and b would have to be chosen or tuned based on the storage
> back-end.
>
> And that gets us to the units for defining reservation and limit --
> cost_units per unit of time. Typically these are floating point
> values, however we do not use floating point types in librados calls
> because qemu, when calling into librbd, does not save and restore the
> cpu's floating point mode.
>
> There are two ways of getting appropriate ranges of values given that
> we need to use integral types for cost_units per unit of time. One is
> a large time unit in the denominator, such as minutes or even
> hours. That would leave us with cost_units per minute. We are unsure
> that the strange unit is the best approach and your feedback would be
> appreciated.
>
> A standard alternative would be to use a standard time unit, such as
> seconds, but integers as fixed-point values. So a floating-point value
> in cost_units per second would be multiplied by, say, 1000 and rounded
> to get the corresponding integer value.
>
> == librados Additions ==
>
> The basic idea is that one would be able to create (and destroy) qos
> profiles and then associate a profile with an ioctx. Ops on the ioctx
> would use the qos profile associated with it.
>
> typedef void* rados_qos_profile_t; // opaque
>
> // parameters uint64_t in cost_units per time unit as discussed above
> profile1 = rados_qos_profile_create(reservation, weight, limit);
>
> rados_ioctx_set_qos_profile(ioctx3, profile1);
>
> ...
> // ops to ioctx3 would now use the specified profile
> ...
>
> // use the profile just for a particular operation
> rados_write_op_set_qos_prefile(op1, profile1);
>
> rados_ioctx_set_qos_profile(ioctx3, NULL); // set to default profile
>
> rados_qos_profile_destroy(profile1);

So this sounds like QoS will be *entirely* set up and configured by the client?

I can understand the appeal of that, but we've at least discussed the
possibility in the past of this being cryptographically secured and
part of the cephx caps. The problem with secured QoS is that it's not
entirely clear how we'd share the amount of used reservation across
the OSDs involved. (Presumably they'd periodically send the client
back signed tickets, and the client would share those, and the OSDs
would adjust local reservations from that usage, but of course that's
hand-waving a ton.) It would be very sad if our librados interface
wasn't designed with a future secured QoS in mind. (And doing it on
the IoCtx would seem to preclude that?)
-Greg

> == MOSDOp and MOSDOpReply Changes ==
>
> Because the qos_profile would be managed by the ioctx, MOSDOps sent
> via that ioctx would include the reservation, weight, and limit. At
> this point we think this would be better than keeping the profiles on
> the back-end, although it increases the MOSDOp data structure by about
> 128 bits.
>
> The MOSDOp type already contains dmclock's delta and rho parameters
> and MOSDOpReply already contains the dmclock phase indicator due to
> prior work. Given that we're moving towards using cost_unit per
> time_unit rather than ops per sec, perhaps we should also include the
> calculated cost in the MOSDOpReply.
>
> == Conclusion ==
>
> So that's what we're thinking about and your own thoughts and feedback
> would be appreciated. Thanks!
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Add dmclock QoS client calls to librados -- request for comments
  2018-01-02 15:11   ` J. Eric Ivancich
@ 2018-01-03 13:43     ` 김태웅
  2018-01-05  4:35       ` Byung Su Park
  0 siblings, 1 reply; 11+ messages in thread
From: 김태웅 @ 2018-01-03 13:43 UTC (permalink / raw)
  To: J. Eric Ivancich, KIM TAEWOONG, bspark8; +Cc: Mark Nelson, Ceph Development

2018-01-03 0:11 GMT+09:00 J. Eric Ivancich <ivancich@redhat.com>:
>
> Thanks, Mark, for those thoughts.
>
> > On Dec 19, 2017, at 12:45 PM, Mark Nelson <mark.a.nelson@gmail.com> wrote:
> >
> > On 12/18/2017 01:04 PM, J. Eric Ivancich wrote:
> >> == Units ==
> >> One key difference we're considering, however, is changing the unit
> >> that reservations and limits are expressed in from ops/sec to
> >> something more appropriate for Ceph. Operations have payloads of
> >> different sizes and will therefore take different amounts of time, and
> >> that should be factored in. We might refer to this as the "cost" of
> >> the operation. And the cost is not linear with the size of the
> >> payload. For example, a write of 4 MB might only take 20 times as long
> >> as a write of 4 KB even though the sizes differ by a factor of
> >> 1000. Using cost would allow us to, for example, achieve a fairer
> >> prioritization of a client doing many small writes against a client
> >> that's doing a few larger writes.
> >
> > Getting away from ops/s is a good idea imho, and I generally agree here.
>
> Cool!
>
> >> One proposed formula to translate one op into cost would be something
> >> along the lines of:
> >>     cost_units = a + b * log(payload_size)
> >> where a and b would have to be chosen or tuned based on the storage
> >> back-end.
> >
> > I guess the idea is that we can generally approximate the curve of both HDDs and solid state storage with this formula by tweaking a and b? I've got a couple of concerns:
>
> That’s correct.
>
> > 1) I don't think most users are going to get a and b right.  If anything I suspect we'll end up with a couple of competing values for HDD and SSDs that people will just copy/paste from each other or the mailing list.  I'd much rather that we had hdd/ssd defaults like we do for other options in ceph that get us in the right ballparks and get set automatically based on the disk type.
>
> I agree; best to have sensible defaults.
>
> > 2) log() is kind of expensive.  It's not *that* bad, but it's enough that for small NVMe read ops we could start to see it show up in profiles.
> >
> > http://nicolas.limare.net/pro/notes/2014/12/16_math_speed/
> >
> > I suspect it might be a good idea to pre-compute the cost_units for the first 64k (or whatever) payload_sizes, especially if that value is 64bit.  It would take minimal memory and I could see it becoming more important as flash becomes more common (especially on ARM and similar CPUs).
>
> I agree. Rounding to the nearest kb and then doing a table look-up is likely the right way to go and then only doing the full calculation when necessary. We could even consider pre-computing the values for powers-of-2-kb (e.g., 1k, 2k, 4k, 8k, 16k, …. 128k, 256k, …) and rounding each payload to the next highest, assuming it’s not problematic to treat, say, 20k, 25k, and 30k as having the same cost (i.e., same as 32k). Or use a combination of the two approaches — linear table for smaller payloads and exponential table for the larger payloads.

Pre-computation to make the cost table seems a good idea. I think that
makes us able to use more complicated formulas because the computation
is needed only when it is necessary.
I wonder if the log function is really needed. In past tests performed
on my environment, the cost seemed to be linear to the request size,
not log function.
According to my observation, the larger the size, the stronger the
linearity. Maybe it could be depended on the environment.
To cover these various environments, we could change the formula like below.
cost_units = a + b * payload_size + c * log(d * payload_size)
I'm not sure which term should be removed at this time. The exact form
of the formula should be considered with more tests.

>
> > 3) If there were an easy way to express it, it might be nice to just give advanced users the option to write their own function here as an override vs the defaults. ie (not real numbers):
> >
> > notreal_qos_cost_unit_algorithm = ""
> > notreal_qos_cost_unit_algorithm_ssd = "1024 + 0.25*log(payload_size)"
> > notreal_qos_cost_unit_algorithm_hdd = "64 + 32*log(payload_size)"
> >
> > I think this is cleaner than needing to specify hdd_a, hdd_b, ssd_a, ssd_b on nodes with mixed HDD/flash OSDs.
>
> I’m inclined to go with the more simple implementation, at least for the first take, but certainly open to the more general implementation that you suggest.
>
> >> And that gets us to the units for defining reservation and limit --
> >> cost_units per unit of time. Typically these are floating point
> >> values, however we do not use floating point types in librados calls
> >> because qemu, when calling into librbd, does not save and restore the
> >> cpu's floating point mode.
> >> There are two ways of getting appropriate ranges of values given that
> >> we need to use integral types for cost_units per unit of time. One is
> >> a large time unit in the denominator, such as minutes or even
> >> hours. That would leave us with cost_units per minute. We are unsure
> >> that the strange unit is the best approach and your feedback would be
> >> appreciated.
> >> A standard alternative would be to use a standard time unit, such as
> >> seconds, but integers as fixed-point values. So a floating-point value
> >> in cost_units per second would be multiplied by, say, 1000 and rounded
> >> to get the corresponding integer value.
> >
> > In the 2nd scenario it's just a question of how we handle it internally right?
>
> The client calling into librados would have to do the conversion of floating-point into fixed-point. I’ll reply to Sage’s reply to this thread next, but I think he makes a good point that the number of cost units for typical payload sizes will be (much?) larger than 1, so we might be able to use seconds as are time unit *and* avoid fixed-point math. In other words, I’m now thinking that the caller would simply need to round to an integral value *if* they started with a floating point value.
>
> >> == librados Additions ==
> >> The basic idea is that one would be able to create (and destroy) qos
> >> profiles and then associate a profile with an ioctx. Ops on the ioctx
> >> would use the qos profile associated with it.
> >> typedef void* rados_qos_profile_t; // opaque
> >> // parameters uint64_t in cost_units per time unit as discussed above
> >> profile1 = rados_qos_profile_create(reservation, weight, limit);
> >> rados_ioctx_set_qos_profile(ioctx3, profile1);
> >> ...
> >> // ops to ioctx3 would now use the specified profile
> >> ...
> >> // use the profile just for a particular operation
> >> rados_write_op_set_qos_prefile(op1, profile1);
> >> rados_ioctx_set_qos_profile(ioctx3, NULL); // set to default profile
> >> rados_qos_profile_destroy(profile1);
> >> == MOSDOp and MOSDOpReply Changes ==
> >> Because the qos_profile would be managed by the ioctx, MOSDOps sent
> >> via that ioctx would include the reservation, weight, and limit. At
> >> this point we think this would be better than keeping the profiles on
> >> the back-end, although it increases the MOSDOp data structure by about
> >> 128 bits.
> >> The MOSDOp type already contains dmclock's delta and rho parameters
> >> and MOSDOpReply already contains the dmclock phase indicator due to
> >> prior work. Given that we're moving towards using cost_unit per
> >> time_unit rather than ops per sec, perhaps we should also include the
> >> calculated cost in the MOSDOpReply.
> >
> > Does it change things at all if we have fast per-calculated values of cost_unit available for a given payload size?
>
> No, that wouldn’t change anything. This value will help the new piece in librados that handles dmclock correctly apportion the work done by each server to ensure fairness across servers. When using “ops” the value was 1. With cost units it gets a little more complex. This would all be internal to librados and the client wouldn’t have to deal with this value.
>
> Eric
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Add dmclock QoS client calls to librados -- request for comments
  2017-12-18 19:04 Add dmclock QoS client calls to librados -- request for comments J. Eric Ivancich
                   ` (2 preceding siblings ...)
  2018-01-02 17:58 ` Gregory Farnum
@ 2018-01-03 19:26 ` Gregory Farnum
  2018-01-03 20:03   ` Sage Weil
  3 siblings, 1 reply; 11+ messages in thread
From: Gregory Farnum @ 2018-01-03 19:26 UTC (permalink / raw)
  To: J. Eric Ivancich; +Cc: Ceph Development

On Mon, Dec 18, 2017 at 11:04 AM, J. Eric Ivancich <ivancich@redhat.com> wrote:
> We are asking the Ceph community to provide their thoughts on this
> draft proposal for expanding the librados API with calls that would
> allow clients to specify QoS (quality of service) parameters for
> their operations.
>
> We have an on-going effort to provide Ceph users with more options to
> manage QoS. With the release of Luminous we introduced access to a
> prototype of the mclock QoS algorithm for queuing operations by class
> of operation and either differentiating clients or treating them as a
> unit. Although not yet integrated, the library we're using supports
> dmClock, a distributed version of mClock. Both are documented in
> _mClock: Handling Throughput Variability for Hypervisor IO Scheduling_
> by Gulati, Merchant, and Varman 2010.
>
> In order to offer greater flexibility, we'd like to move forward with
> providing clients with the ability to use different QoS parameters. We
> are keeping our options open w.r.t. the ultimate set of algorithm(s)
> we'll use. The mClock/dmClock algorithm allows a "client", which we
> can interpret broadly, to set a minimum ops/sec (reservation) and a
> maximum ops/sec (limit). Furthermore a "client" can also define a
> weight (a.k.a.  priority), which is a scalar value to determine
> relative weighting.
>
> We think reservation, limit, and weight are sufficiently generic that
> we'd be able to use or adapt them other QoS algorithms we may try or
> use in the future.
>
> [To give you a sense of how broadly we can interpret "client", we
> currently have code that interprets classes of operations (e.g.,
> background replication or background snap-trimming) as a client.]
>
> == Units ==
>
> One key difference we're considering, however, is changing the unit
> that reservations and limits are expressed in from ops/sec to
> something more appropriate for Ceph. Operations have payloads of
> different sizes and will therefore take different amounts of time, and
> that should be factored in. We might refer to this as the "cost" of
> the operation. And the cost is not linear with the size of the
> payload. For example, a write of 4 MB might only take 20 times as long
> as a write of 4 KB even though the sizes differ by a factor of
> 1000. Using cost would allow us to, for example, achieve a fairer
> prioritization of a client doing many small writes against a client
> that's doing a few larger writes.
>
> One proposed formula to translate one op into cost would be something
> along the lines of:
>
>     cost_units = a + b * log(payload_size)
>
> where a and b would have to be chosen or tuned based on the storage
> back-end.
>
> And that gets us to the units for defining reservation and limit --
> cost_units per unit of time. Typically these are floating point
> values, however we do not use floating point types in librados calls
> because qemu, when calling into librbd, does not save and restore the
> cpu's floating point mode.
>
> There are two ways of getting appropriate ranges of values given that
> we need to use integral types for cost_units per unit of time. One is
> a large time unit in the denominator, such as minutes or even
> hours. That would leave us with cost_units per minute. We are unsure
> that the strange unit is the best approach and your feedback would be
> appreciated.
>
> A standard alternative would be to use a standard time unit, such as
> seconds, but integers as fixed-point values. So a floating-point value
> in cost_units per second would be multiplied by, say, 1000 and rounded
> to get the corresponding integer value.
>
> == librados Additions ==
>
> The basic idea is that one would be able to create (and destroy) qos
> profiles and then associate a profile with an ioctx. Ops on the ioctx
> would use the qos profile associated with it.
>
> typedef void* rados_qos_profile_t; // opaque
>
> // parameters uint64_t in cost_units per time unit as discussed above
> profile1 = rados_qos_profile_create(reservation, weight, limit);
>
> rados_ioctx_set_qos_profile(ioctx3, profile1);
>
> ...
> // ops to ioctx3 would now use the specified profile
> ...
>
> // use the profile just for a particular operation
> rados_write_op_set_qos_prefile(op1, profile1);
>
> rados_ioctx_set_qos_profile(ioctx3, NULL); // set to default profile
>
> rados_qos_profile_destroy(profile1);

Oh, one more thing I noticed. It's not clear to me from this interface
if it's possible to use the same profile across more than one ioctx
and have them share a common reservation. Or will it just be a
configuration struct that the IoCtx uses to set up its internal
tracking state, and then they run independently even if reused?
-Greg


>
> == MOSDOp and MOSDOpReply Changes ==
>
> Because the qos_profile would be managed by the ioctx, MOSDOps sent
> via that ioctx would include the reservation, weight, and limit. At
> this point we think this would be better than keeping the profiles on
> the back-end, although it increases the MOSDOp data structure by about
> 128 bits.
>
> The MOSDOp type already contains dmclock's delta and rho parameters
> and MOSDOpReply already contains the dmclock phase indicator due to
> prior work. Given that we're moving towards using cost_unit per
> time_unit rather than ops per sec, perhaps we should also include the
> calculated cost in the MOSDOpReply.
>
> == Conclusion ==
>
> So that's what we're thinking about and your own thoughts and feedback
> would be appreciated. Thanks!
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Add dmclock QoS client calls to librados -- request for comments
  2018-01-03 19:26 ` Gregory Farnum
@ 2018-01-03 20:03   ` Sage Weil
  0 siblings, 0 replies; 11+ messages in thread
From: Sage Weil @ 2018-01-03 20:03 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: J. Eric Ivancich, Ceph Development

On Wed, 3 Jan 2018, Gregory Farnum wrote:
> On Mon, Dec 18, 2017 at 11:04 AM, J. Eric Ivancich <ivancich@redhat.com> wrote:
> > We are asking the Ceph community to provide their thoughts on this
> > draft proposal for expanding the librados API with calls that would
> > allow clients to specify QoS (quality of service) parameters for
> > their operations.
> >
> > We have an on-going effort to provide Ceph users with more options to
> > manage QoS. With the release of Luminous we introduced access to a
> > prototype of the mclock QoS algorithm for queuing operations by class
> > of operation and either differentiating clients or treating them as a
> > unit. Although not yet integrated, the library we're using supports
> > dmClock, a distributed version of mClock. Both are documented in
> > _mClock: Handling Throughput Variability for Hypervisor IO Scheduling_
> > by Gulati, Merchant, and Varman 2010.
> >
> > In order to offer greater flexibility, we'd like to move forward with
> > providing clients with the ability to use different QoS parameters. We
> > are keeping our options open w.r.t. the ultimate set of algorithm(s)
> > we'll use. The mClock/dmClock algorithm allows a "client", which we
> > can interpret broadly, to set a minimum ops/sec (reservation) and a
> > maximum ops/sec (limit). Furthermore a "client" can also define a
> > weight (a.k.a.  priority), which is a scalar value to determine
> > relative weighting.
> >
> > We think reservation, limit, and weight are sufficiently generic that
> > we'd be able to use or adapt them other QoS algorithms we may try or
> > use in the future.
> >
> > [To give you a sense of how broadly we can interpret "client", we
> > currently have code that interprets classes of operations (e.g.,
> > background replication or background snap-trimming) as a client.]
> >
> > == Units ==
> >
> > One key difference we're considering, however, is changing the unit
> > that reservations and limits are expressed in from ops/sec to
> > something more appropriate for Ceph. Operations have payloads of
> > different sizes and will therefore take different amounts of time, and
> > that should be factored in. We might refer to this as the "cost" of
> > the operation. And the cost is not linear with the size of the
> > payload. For example, a write of 4 MB might only take 20 times as long
> > as a write of 4 KB even though the sizes differ by a factor of
> > 1000. Using cost would allow us to, for example, achieve a fairer
> > prioritization of a client doing many small writes against a client
> > that's doing a few larger writes.
> >
> > One proposed formula to translate one op into cost would be something
> > along the lines of:
> >
> >     cost_units = a + b * log(payload_size)
> >
> > where a and b would have to be chosen or tuned based on the storage
> > back-end.
> >
> > And that gets us to the units for defining reservation and limit --
> > cost_units per unit of time. Typically these are floating point
> > values, however we do not use floating point types in librados calls
> > because qemu, when calling into librbd, does not save and restore the
> > cpu's floating point mode.
> >
> > There are two ways of getting appropriate ranges of values given that
> > we need to use integral types for cost_units per unit of time. One is
> > a large time unit in the denominator, such as minutes or even
> > hours. That would leave us with cost_units per minute. We are unsure
> > that the strange unit is the best approach and your feedback would be
> > appreciated.
> >
> > A standard alternative would be to use a standard time unit, such as
> > seconds, but integers as fixed-point values. So a floating-point value
> > in cost_units per second would be multiplied by, say, 1000 and rounded
> > to get the corresponding integer value.
> >
> > == librados Additions ==
> >
> > The basic idea is that one would be able to create (and destroy) qos
> > profiles and then associate a profile with an ioctx. Ops on the ioctx
> > would use the qos profile associated with it.
> >
> > typedef void* rados_qos_profile_t; // opaque
> >
> > // parameters uint64_t in cost_units per time unit as discussed above
> > profile1 = rados_qos_profile_create(reservation, weight, limit);
> >
> > rados_ioctx_set_qos_profile(ioctx3, profile1);
> >
> > ...
> > // ops to ioctx3 would now use the specified profile
> > ...
> >
> > // use the profile just for a particular operation
> > rados_write_op_set_qos_prefile(op1, profile1);
> >
> > rados_ioctx_set_qos_profile(ioctx3, NULL); // set to default profile
> >
> > rados_qos_profile_destroy(profile1);
> 
> Oh, one more thing I noticed. It's not clear to me from this interface
> if it's possible to use the same profile across more than one ioctx
> and have them share a common reservation. Or will it just be a
> configuration struct that the IoCtx uses to set up its internal
> tracking state, and then they run independently even if reused?

I think the idea is that there is an internal id associated with the qos 
profile, and the reservation pool id that is exposed to the osd etc to 
shape traffic is the <client_id, profile_id> pair.  So it would let you 
share the profile across two ioctx such that they come out of the same 
reservation.

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Add dmclock QoS client calls to librados -- request for comments
  2018-01-03 13:43     ` 김태웅
@ 2018-01-05  4:35       ` Byung Su Park
  2018-01-05 21:29         ` J. Eric Ivancich
  0 siblings, 1 reply; 11+ messages in thread
From: Byung Su Park @ 2018-01-05  4:35 UTC (permalink / raw)
  To: J. Eric Ivancich, Ceph Development
  Cc: Sage Weil, Mark Nelson, KIM TAEWOONG, 박병수

Hi Eric,

2018-01-03 22:43 GMT+09:00 김태웅 <isis1054@gmail.com>:
>
> 2018-01-03 0:11 GMT+09:00 J. Eric Ivancich <ivancich@redhat.com>:
> >
> > Thanks, Mark, for those thoughts.
> >
> > > On Dec 19, 2017, at 12:45 PM, Mark Nelson <mark.a.nelson@gmail.com> wrote:
> > >
> > > On 12/18/2017 01:04 PM, J. Eric Ivancich wrote:
> > >> == Units ==
> > >> One key difference we're considering, however, is changing the unit
> > >> that reservations and limits are expressed in from ops/sec to
> > >> something more appropriate for Ceph. Operations have payloads of
> > >> different sizes and will therefore take different amounts of time, and
> > >> that should be factored in. We might refer to this as the "cost" of
> > >> the operation. And the cost is not linear with the size of the
> > >> payload. For example, a write of 4 MB might only take 20 times as long
> > >> as a write of 4 KB even though the sizes differ by a factor of
> > >> 1000. Using cost would allow us to, for example, achieve a fairer
> > >> prioritization of a client doing many small writes against a client
> > >> that's doing a few larger writes.
> > >
> > > Getting away from ops/s is a good idea imho, and I generally agree here.
> >
> > Cool!
> >
> > >> One proposed formula to translate one op into cost would be something
> > >> along the lines of:
> > >>     cost_units = a + b * log(payload_size)
> > >> where a and b would have to be chosen or tuned based on the storage
> > >> back-end.
> > >
> > > I guess the idea is that we can generally approximate the curve of both HDDs and solid state storage with this formula by tweaking a and b? I've got a couple of concerns:
> >
> > That’s correct.
> >
> > > 1) I don't think most users are going to get a and b right.  If anything I suspect we'll end up with a couple of competing values for HDD and SSDs that people will just copy/paste from each other or the mailing list.  I'd much rather that we had hdd/ssd defaults like we do for other options in ceph that get us in the right ballparks and get set automatically based on the disk type.
> >
> > I agree; best to have sensible defaults.
> >
> > > 2) log() is kind of expensive.  It's not *that* bad, but it's enough that for small NVMe read ops we could start to see it show up in profiles.
> > >
> > > http://nicolas.limare.net/pro/notes/2014/12/16_math_speed/
> > >
> > > I suspect it might be a good idea to pre-compute the cost_units for the first 64k (or whatever) payload_sizes, especially if that value is 64bit.  It would take minimal memory and I could see it becoming more important as flash becomes more common (especially on ARM and similar CPUs).
> >
> > I agree. Rounding to the nearest kb and then doing a table look-up is likely the right way to go and then only doing the full calculation when necessary. We could even consider pre-computing the values for powers-of-2-kb (e.g., 1k, 2k, 4k, 8k, 16k, …. 128k, 256k, …) and rounding each payload to the next highest, assuming it’s not problematic to treat, say, 20k, 25k, and 30k as having the same cost (i.e., same as 32k). Or use a combination of the two approaches — linear table for smaller payloads and exponential table for the larger payloads.
>
> Pre-computation to make the cost table seems a good idea. I think that
> makes us able to use more complicated formulas because the computation
> is needed only when it is necessary.
> I wonder if the log function is really needed. In past tests performed
> on my environment, the cost seemed to be linear to the request size,
> not log function.
> According to my observation, the larger the size, the stronger the
> linearity. Maybe it could be depended on the environment.
> To cover these various environments, we could change the formula like below.
> cost_units = a + b * payload_size + c * log(d * payload_size)
> I'm not sure which term should be removed at this time. The exact form
> of the formula should be considered with more tests.
>

In addition to Taewoong's opinion, the environment in which the I/O
cost increases linearly with payload_size is the SSD based Ceph
cluster.
We also think that we need to add predefined differential values b1
and b2 for I/O type (read/write) when calculating I/O cost.
For I/O cost modeling, the following paper can be referred to.
(https://people.ucsc.edu/~hlitz/papers/reflex.pdf)

> >
> > > 3) If there were an easy way to express it, it might be nice to just give advanced users the option to write their own function here as an override vs the defaults. ie (not real numbers):
> > >
> > > notreal_qos_cost_unit_algorithm = ""
> > > notreal_qos_cost_unit_algorithm_ssd = "1024 + 0.25*log(payload_size)"
> > > notreal_qos_cost_unit_algorithm_hdd = "64 + 32*log(payload_size)"
> > >
> > > I think this is cleaner than needing to specify hdd_a, hdd_b, ssd_a, ssd_b on nodes with mixed HDD/flash OSDs.
> >
> > I’m inclined to go with the more simple implementation, at least for the first take, but certainly open to the more general implementation that you suggest.
> >
> > >> And that gets us to the units for defining reservation and limit --
> > >> cost_units per unit of time. Typically these are floating point
> > >> values, however we do not use floating point types in librados calls
> > >> because qemu, when calling into librbd, does not save and restore the
> > >> cpu's floating point mode.
> > >> There are two ways of getting appropriate ranges of values given that
> > >> we need to use integral types for cost_units per unit of time. One is
> > >> a large time unit in the denominator, such as minutes or even
> > >> hours. That would leave us with cost_units per minute. We are unsure
> > >> that the strange unit is the best approach and your feedback would be
> > >> appreciated.
> > >> A standard alternative would be to use a standard time unit, such as
> > >> seconds, but integers as fixed-point values. So a floating-point value
> > >> in cost_units per second would be multiplied by, say, 1000 and rounded
> > >> to get the corresponding integer value.
> > >
> > > In the 2nd scenario it's just a question of how we handle it internally right?
> >
> > The client calling into librados would have to do the conversion of floating-point into fixed-point. I’ll reply to Sage’s reply to this thread next, but I think he makes a good point that the number of cost units for typical payload sizes will be (much?) larger than 1, so we might be able to use seconds as are time unit *and* avoid fixed-point math. In other words, I’m now thinking that the caller would simply need to round to an integral value *if* they started with a floating point value.
> >
> > >> == librados Additions ==
> > >> The basic idea is that one would be able to create (and destroy) qos
> > >> profiles and then associate a profile with an ioctx. Ops on the ioctx
> > >> would use the qos profile associated with it.
> > >> typedef void* rados_qos_profile_t; // opaque
> > >> // parameters uint64_t in cost_units per time unit as discussed above
> > >> profile1 = rados_qos_profile_create(reservation, weight, limit);
> > >> rados_ioctx_set_qos_profile(ioctx3, profile1);
> > >> ...
> > >> // ops to ioctx3 would now use the specified profile
> > >> ...
> > >> // use the profile just for a particular operation
> > >> rados_write_op_set_qos_prefile(op1, profile1);
> > >> rados_ioctx_set_qos_profile(ioctx3, NULL); // set to default profile
> > >> rados_qos_profile_destroy(profile1);
> > >> == MOSDOp and MOSDOpReply Changes ==
> > >> Because the qos_profile would be managed by the ioctx, MOSDOps sent
> > >> via that ioctx would include the reservation, weight, and limit. At
> > >> this point we think this would be better than keeping the profiles on
> > >> the back-end, although it increases the MOSDOp data structure by about
> > >> 128 bits.
> > >> The MOSDOp type already contains dmclock's delta and rho parameters
> > >> and MOSDOpReply already contains the dmclock phase indicator due to
> > >> prior work. Given that we're moving towards using cost_unit per
> > >> time_unit rather than ops per sec, perhaps we should also include the
> > >> calculated cost in the MOSDOpReply.

Currently, the architecture you suggest is I/O cost calculation and
profiling on the client side.
I would like to hear more about why you think about client side
implementation rather than server side implementation.

As we already know, the dmClock algorithm already controls the degree
of request using the delta/rho on the client side and the fair cost
estimate for each different size/type of IO is required on the server
side.
I'd have to think about calculating I/O costs on the server side at least once.

Thank you.

> > >
> > > Does it change things at all if we have fast per-calculated values of cost_unit available for a given payload size?
> >
> > No, that wouldn’t change anything. This value will help the new piece in librados that handles dmclock correctly apportion the work done by each server to ensure fairness across servers. When using “ops” the value was 1. With cost units it gets a little more complex. This would all be internal to librados and the client wouldn’t have to deal with this value.
> >
> > Eric
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Add dmclock QoS client calls to librados -- request for comments
  2018-01-05  4:35       ` Byung Su Park
@ 2018-01-05 21:29         ` J. Eric Ivancich
  0 siblings, 0 replies; 11+ messages in thread
From: J. Eric Ivancich @ 2018-01-05 21:29 UTC (permalink / raw)
  To: Byung Su Park, Ceph Development
  Cc: Sage Weil, Mark Nelson, KIM TAEWOONG, 박병수

Hi Byung Su and Taewoong,

On 01/04/2018 11:35 PM, Byung Su Park wrote:
> Hi Eric,
> 
> 2018-01-03 22:43 GMT+09:00 김태웅 <isis1054@gmail.com>:
>>
>> 2018-01-03 0:11 GMT+09:00 J. Eric Ivancich <ivancich@redhat.com>:
>>>
>>> Thanks, Mark, for those thoughts.
>>>
>>>> On Dec 19, 2017, at 12:45 PM, Mark Nelson <mark.a.nelson@gmail.com> wrote:
>>>>
>>>> On 12/18/2017 01:04 PM, J. Eric Ivancich wrote:
>>>>> == Units ==
>>>>> One key difference we're considering, however, is changing the unit
>>>>> that reservations and limits are expressed in from ops/sec to
>>>>> something more appropriate for Ceph. Operations have payloads of
>>>>> different sizes and will therefore take different amounts of time, and
>>>>> that should be factored in. We might refer to this as the "cost" of
>>>>> the operation. And the cost is not linear with the size of the
>>>>> payload. For example, a write of 4 MB might only take 20 times as long
>>>>> as a write of 4 KB even though the sizes differ by a factor of
>>>>> 1000. Using cost would allow us to, for example, achieve a fairer
>>>>> prioritization of a client doing many small writes against a client
>>>>> that's doing a few larger writes.
>>>>
>>>> Getting away from ops/s is a good idea imho, and I generally agree here.
>>>
>>> Cool!
>>>
>>>>> One proposed formula to translate one op into cost would be something
>>>>> along the lines of:
>>>>>     cost_units = a + b * log(payload_size)
>>>>> where a and b would have to be chosen or tuned based on the storage
>>>>> back-end.
>>>>
>>>> I guess the idea is that we can generally approximate the curve of both HDDs and solid state storage with this formula by tweaking a and b? I've got a couple of concerns:
>>>
>>> That’s correct.
>>>
>>>> 1) I don't think most users are going to get a and b right.  If anything I suspect we'll end up with a couple of competing values for HDD and SSDs that people will just copy/paste from each other or the mailing list.  I'd much rather that we had hdd/ssd defaults like we do for other options in ceph that get us in the right ballparks and get set automatically based on the disk type.
>>>
>>> I agree; best to have sensible defaults.
>>>
>>>> 2) log() is kind of expensive.  It's not *that* bad, but it's enough that for small NVMe read ops we could start to see it show up in profiles.
>>>>
>>>> http://nicolas.limare.net/pro/notes/2014/12/16_math_speed/
>>>>
>>>> I suspect it might be a good idea to pre-compute the cost_units for the first 64k (or whatever) payload_sizes, especially if that value is 64bit.  It would take minimal memory and I could see it becoming more important as flash becomes more common (especially on ARM and similar CPUs).
>>>
>>> I agree. Rounding to the nearest kb and then doing a table look-up is likely the right way to go and then only doing the full calculation when necessary. We could even consider pre-computing the values for powers-of-2-kb (e.g., 1k, 2k, 4k, 8k, 16k, …. 128k, 256k, …) and rounding each payload to the next highest, assuming it’s not problematic to treat, say, 20k, 25k, and 30k as having the same cost (i.e., same as 32k). Or use a combination of the two approaches — linear table for smaller payloads and exponential table for the larger payloads.
>>
>> Pre-computation to make the cost table seems a good idea. I think that
>> makes us able to use more complicated formulas because the computation
>> is needed only when it is necessary.
>> I wonder if the log function is really needed. In past tests performed
>> on my environment, the cost seemed to be linear to the request size,
>> not log function.
>> According to my observation, the larger the size, the stronger the
>> linearity. Maybe it could be depended on the environment.
>> To cover these various environments, we could change the formula like below.
>> cost_units = a + b * payload_size + c * log(d * payload_size)
>> I'm not sure which term should be removed at this time. The exact form
>> of the formula should be considered with more tests.
>>
> 
> In addition to Taewoong's opinion, the environment in which the I/O
> cost increases linearly with payload_size is the SSD based Ceph
> cluster.
> We also think that we need to add predefined differential values b1
> and b2 for I/O type (read/write) when calculating I/O cost.
> For I/O cost modeling, the following paper can be referred to.
> (https://people.ucsc.edu/~hlitz/papers/reflex.pdf)

Thank you for that reference; I will read it. I'm certainly open to
making the modeling function more complex. In a way you're arguing for
Mark Nelson's idea (see immediately below) of allowing a somewhat
free-form function to be defined. And since such a function would need
to be parsed and likely stored as a computation tree and thereby
interpreted, it argues even further for pre-computing these values in
one or more tables.

>>>> 3) If there were an easy way to express it, it might be nice to just give advanced users the option to write their own function here as an override vs the defaults. ie (not real numbers):
>>>>
>>>> notreal_qos_cost_unit_algorithm = ""
>>>> notreal_qos_cost_unit_algorithm_ssd = "1024 + 0.25*log(payload_size)"
>>>> notreal_qos_cost_unit_algorithm_hdd = "64 + 32*log(payload_size)"
>>>>
>>>> I think this is cleaner than needing to specify hdd_a, hdd_b, ssd_a, ssd_b on nodes with mixed HDD/flash OSDs.
>>>
>>> I’m inclined to go with the more simple implementation, at least for the first take, but certainly open to the more general implementation that you suggest.

...

>>>>> == MOSDOp and MOSDOpReply Changes ==
>>>>> Because the qos_profile would be managed by the ioctx, MOSDOps sent
>>>>> via that ioctx would include the reservation, weight, and limit. At
>>>>> this point we think this would be better than keeping the profiles on
>>>>> the back-end, although it increases the MOSDOp data structure by about
>>>>> 128 bits.
>>>>> The MOSDOp type already contains dmclock's delta and rho parameters
>>>>> and MOSDOpReply already contains the dmclock phase indicator due to
>>>>> prior work. Given that we're moving towards using cost_unit per
>>>>> time_unit rather than ops per sec, perhaps we should also include the
>>>>> calculated cost in the MOSDOpReply.
> 
> Currently, the architecture you suggest is I/O cost calculation and
> profiling on the client side.
> I would like to hear more about why you think about client side
> implementation rather than server side implementation.
> 
> As we already know, the dmClock algorithm already controls the degree
> of request using the delta/rho on the client side and the fair cost
> estimate for each different size/type of IO is required on the server
> side.
> I'd have to think about calculating I/O costs on the server side at least once.

Sorry that wasn't clear. Yes, the cost is calculated on the server-side,
which is why the cost needs to be sent back to the client in the
MOSDOpReply, so the client-side of dmclock would then know how to update
its state correctly to calculate future delta and rho values.

Since you're familiar with the internals of the dmclock library, I'll
add that having the server-side calculate the cost this would make it
difficult (likely impossible) to correctly use the BorrowingTracker. To
use that tracker the client would need to independently calculate the
cost of a request.

Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-01-05 21:29 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-18 19:04 Add dmclock QoS client calls to librados -- request for comments J. Eric Ivancich
2017-12-19 16:13 ` Sage Weil
2018-01-02 15:26   ` J. Eric Ivancich
2017-12-19 17:45 ` Mark Nelson
2018-01-02 15:11   ` J. Eric Ivancich
2018-01-03 13:43     ` 김태웅
2018-01-05  4:35       ` Byung Su Park
2018-01-05 21:29         ` J. Eric Ivancich
2018-01-02 17:58 ` Gregory Farnum
2018-01-03 19:26 ` Gregory Farnum
2018-01-03 20:03   ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.