How best to integrate dmClock QoS library into ceph codebase

All of lore.kernel.org
 help / color / mirror / Atom feed

* How best to integrate dmClock QoS library into ceph codebase
@ 2017-04-04 13:32 J. Eric Ivancich
  2017-04-04 16:00 ` Adam C. Emerson
  2017-05-16  2:46 ` Ming Lin
  0 siblings, 2 replies; 15+ messages in thread
From: J. Eric Ivancich @ 2017-04-04 13:32 UTC (permalink / raw)
  To: ceph-devel

In our work to improve QoS with ceph, we implemented the dmClock algorithm (see: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf).

The algorithm is implemented as a general library that could be used in ceph and in other projects unrelated to ceph. To do this the dmClock library code makes use of C++ templates. On the server side, the key class takes template parameters to describe the type of a request and the type of a client identifier. On the client side, the key class takes a template parameter for the type of a server identifier.

You can find the library here:

   https://github.com/ceph/dmclock

The question is how best to integrate this library with ceph code in our git repo into the future.

There are three obvious options:

   1. Keep dmClock as a separate repo and incorporate it into ceph as a git submodule.
   2. Keep dmClock as a separate repo and incorporate it into ceph as a git subtree.
   3. Move the code into the ceph tree and stop maintaining as a generalized library.

Both git submodules and git subtrees have their own set of challenges; neither is perfect. Many have weighed in their relative advantages and disadvantages (https://www.google.com/search?q=git+submodule+subtree).

I’m inclined to keep it separate (option 1 or 2) so that others might use it in other projects.

When I started the integration, Sam recommended that it be maintained as a subtree. So that’s how it’s implemented in my not-yet-merged branch. The key challenges have been with rebases that include the commit in which the subtree was added and with pushing code changes back to the library. Since relatively few would likely be doing these types of ops, perhaps option 2 might be easiest. Currently our git PR process has an automated check for "Unmodifed Submodules". I’m guessing we’d likely want a similar check for changes in subtree code.

An argument for making it a submodule is that ceph already uses submodules and ceph developers are familiar with working with (and around) them.

But perhaps others would like to weigh in.

Thanks,

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How best to integrate dmClock QoS library into ceph codebase
  2017-04-04 13:32 How best to integrate dmClock QoS library into ceph codebase J. Eric Ivancich
@ 2017-04-04 16:00 ` Adam C. Emerson
  2017-05-16  2:46 ` Ming Lin
  1 sibling, 0 replies; 15+ messages in thread
From: Adam C. Emerson @ 2017-04-04 16:00 UTC (permalink / raw)
  To: J. Eric Ivancich; +Cc: The Sacred Order of the Squid Cybernetic

On 04/04/2017, J. Eric Ivancich wrote:
> An argument for making it a submodule is that ceph already uses submodules and ceph developers are familiar with working with (and around) them.

I think this would be the overriding argument. If both submodules and
subtrees have their own set of drawbacks, it seems like a bad idea for
Ceph as a project to have to deal with /both/ sets of drawbacks
instead of just one.

-- 
Senior Software Engineer           Red Hat Storage, Ann Arbor, MI, US
IRC: Aemerson@{RedHat, OFTC}
0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How best to integrate dmClock QoS library into ceph codebase
  2017-04-04 13:32 How best to integrate dmClock QoS library into ceph codebase J. Eric Ivancich
  2017-04-04 16:00 ` Adam C. Emerson
@ 2017-05-16  2:46 ` Ming Lin
  2017-05-16 12:29   ` J. Eric Ivancich
  1 sibling, 1 reply; 15+ messages in thread
From: Ming Lin @ 2017-05-16  2:46 UTC (permalink / raw)
  To: J. Eric Ivancich; +Cc: Ceph Development

Hi Eric,

Do you have any integration patches I can have a try?

Thanks,
Ming

On Tue, Apr 4, 2017 at 6:32 AM, J. Eric Ivancich <ivancich@redhat.com> wrote:
> In our work to improve QoS with ceph, we implemented the dmClock algorithm (see: https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Gulati.pdf).
>
> The algorithm is implemented as a general library that could be used in ceph and in other projects unrelated to ceph. To do this the dmClock library code makes use of C++ templates. On the server side, the key class takes template parameters to describe the type of a request and the type of a client identifier. On the client side, the key class takes a template parameter for the type of a server identifier.
>
> You can find the library here:
>
>    https://github.com/ceph/dmclock
>
> The question is how best to integrate this library with ceph code in our git repo into the future.
>
> There are three obvious options:
>
>    1. Keep dmClock as a separate repo and incorporate it into ceph as a git submodule.
>    2. Keep dmClock as a separate repo and incorporate it into ceph as a git subtree.
>    3. Move the code into the ceph tree and stop maintaining as a generalized library.
>
> Both git submodules and git subtrees have their own set of challenges; neither is perfect. Many have weighed in their relative advantages and disadvantages (https://www.google.com/search?q=git+submodule+subtree).
>
> I’m inclined to keep it separate (option 1 or 2) so that others might use it in other projects.
>
> When I started the integration, Sam recommended that it be maintained as a subtree. So that’s how it’s implemented in my not-yet-merged branch. The key challenges have been with rebases that include the commit in which the subtree was added and with pushing code changes back to the library. Since relatively few would likely be doing these types of ops, perhaps option 2 might be easiest. Currently our git PR process has an automated check for "Unmodifed Submodules". I’m guessing we’d likely want a similar check for changes in subtree code.
>
> An argument for making it a submodule is that ceph already uses submodules and ceph developers are familiar with working with (and around) them.
>
> But perhaps others would like to weigh in.
>
> Thanks,
>
> Eric--
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How best to integrate dmClock QoS library into ceph codebase
  2017-05-16  2:46 ` Ming Lin
@ 2017-05-16 12:29   ` J. Eric Ivancich
  2017-05-16 17:59     ` Ming Lin
  2017-06-21 17:38     ` sheng qiu
  0 siblings, 2 replies; 15+ messages in thread
From: J. Eric Ivancich @ 2017-05-16 12:29 UTC (permalink / raw)
  To: Ming Lin; +Cc: Ceph Development

On 05/15/2017 10:46 PM, Ming Lin wrote:
> Hi Eric,
> 
> Do you have any integration patches I can have a try?
Hi Ming,

The dmClock library became part of the master branch in early May.

Also, two implementations of the dmClock QoS are in a pull request
currently being reviewed:

    https://github.com/ceph/ceph/pull/14997

Eric


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How best to integrate dmClock QoS library into ceph codebase
  2017-05-16 12:29   ` J. Eric Ivancich
@ 2017-05-16 17:59     ` Ming Lin
  2017-06-21 17:38     ` sheng qiu
  1 sibling, 0 replies; 15+ messages in thread
From: Ming Lin @ 2017-05-16 17:59 UTC (permalink / raw)
  To: J. Eric Ivancich; +Cc: Ceph Development

On Tue, May 16, 2017 at 5:29 AM, J. Eric Ivancich <ivancich@redhat.com> wrote:
> On 05/15/2017 10:46 PM, Ming Lin wrote:
>> Hi Eric,
>>
>> Do you have any integration patches I can have a try?
> Hi Ming,
>
> The dmClock library became part of the master branch in early May.
>
> Also, two implementations of the dmClock QoS are in a pull request
> currently being reviewed:
>
>     https://github.com/ceph/ceph/pull/14997

That's great. I'll test it soon.

Thanks.

>
> Eric
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How best to integrate dmClock QoS library into ceph codebase
  2017-05-16 12:29   ` J. Eric Ivancich
  2017-05-16 17:59     ` Ming Lin
@ 2017-06-21 17:38     ` sheng qiu
  2017-06-21 21:04       ` J. Eric Ivancich
  1 sibling, 1 reply; 15+ messages in thread
From: sheng qiu @ 2017-06-21 17:38 UTC (permalink / raw)
  To: J. Eric Ivancich; +Cc: Ceph Development

hi Eric,

we are pretty interested in your dmclock integration work with CEPH.
After reading your pull request, i am a little confusing.
May i ask if the setting in config such as
osd_op_queue_mclock_client_op_res functioning in your added dmclock's
queues and their enqueue and dequeue methods?
the below enqueue function insert request into a map<priority,
subqueue>, i guess for mclock_opclass queue, you set high priority for
client op and lower for scrub, recovery, etc.
Within each subqueue of same priority, did you do FIFO?

void enqueue_strict(K cl, unsigned priority, T item) override final {
    high_queue[priority].enqueue(cl, 0, item);
}

I am appreciated if you can provide some comments, especially if i
didn't understand correctly.

Thanks,
Sheng

On Tue, May 16, 2017 at 5:29 AM, J. Eric Ivancich <ivancich@redhat.com> wrote:
> On 05/15/2017 10:46 PM, Ming Lin wrote:
>> Hi Eric,
>>
>> Do you have any integration patches I can have a try?
> Hi Ming,
>
> The dmClock library became part of the master branch in early May.
>
> Also, two implementations of the dmClock QoS are in a pull request
> currently being reviewed:
>
>     https://github.com/ceph/ceph/pull/14997
>
> Eric
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How best to integrate dmClock QoS library into ceph codebase
  2017-06-21 17:38     ` sheng qiu
@ 2017-06-21 21:04       ` J. Eric Ivancich
  2017-06-27 21:21         ` sheng qiu
  0 siblings, 1 reply; 15+ messages in thread
From: J. Eric Ivancich @ 2017-06-21 21:04 UTC (permalink / raw)
  To: sheng qiu; +Cc: Ceph Development

Hi Sheng,

I'll interleave responses below.

On 06/21/2017 01:38 PM, sheng qiu wrote:
> hi Eric,
> 
> we are pretty interested in your dmclock integration work with CEPH.
> After reading your pull request, i am a little confusing.
> May i ask if the setting in config such as
> osd_op_queue_mclock_client_op_res functioning in your added dmclock's
> queues and their enqueue and dequeue methods?

Yes, that (and related) configuration option is used. You'll see it
referenced in both src/osd/mClockOpClassQueue.cc and
src/osd/mClockClientQueue.cc.

Let me answer for mClockOpClassQueue, but the process is similar in
mClockClientQueue.

The configuration value is brought into an instance of
mClockOpClassQueue::mclock_op_tags_t. The variables
mClockOpClassQueue::mclock_op_tags holds a unique_ptr to a singleton of
that type. And then when a new operation is enqueued, the function
mClockOpClassQueue::op_class_client_info_f is called to determine its
mclock parameters at which time the value is used.

> the below enqueue function insert request into a map<priority,
> subqueue>, i guess for mclock_opclass queue, you set high priority for
> client op and lower for scrub, recovery, etc.
> Within each subqueue of same priority, did you do FIFO?
> 
> void enqueue_strict(K cl, unsigned priority, T item) override final {
>     high_queue[priority].enqueue(cl, 0, item);
> }

Yes, higher priority operations use a strict queue and lower priority
operations use mclock. That basic behavior was based on the two earlier
op queue implementations (src/common/WeightedPriorityQueue.h and
src/common/PrioritizedQueue.h). The priority value that's used as a
cut-off is determined by the configuration option osd_op_queue_cut_off
(which can be "low" or "high", which map to values CEPH_MSG_PRIO_LOW and
CEPH_MSG_PRIO_HIGH (defined in src/include/msgr.h); see function
OSD::get_io_prio_cut).

And those operations that end up in the high queue are handled strictly
-- higher priorities before lower priorities.

> I am appreciated if you can provide some comments, especially if i
> didn't understand correctly.

I hope that's helpful. Please let me know if you have further questions.

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How best to integrate dmClock QoS library into ceph codebase
  2017-06-21 21:04       ` J. Eric Ivancich
@ 2017-06-27 21:21         ` sheng qiu
  2017-06-28 18:33           ` J. Eric Ivancich
  0 siblings, 1 reply; 15+ messages in thread
From: sheng qiu @ 2017-06-27 21:21 UTC (permalink / raw)
  To: J. Eric Ivancich; +Cc: Ceph Development

Hi Eric,

i am appreciated to your kind reply.

In our test, we set the following in the ceph.conf:

osd_op_queue = mclock_client
osd_op_queue_cut_off = high
osd_op_queue_mclock_client_op_lim = 100.0
osd_op_queue_mclock_client_op_res = 50.0
osd_op_num_shards = 1
osd_op_num_threads_per_shard = 1


in this setup, all io requests should go to one mclock_client queue
and using the mclock scheduling (osd_op_queue_cut_off = high).
we use fio for test, we set job=1, bs=4k, qd=1 or 16.

we are expecting the visible iops by fio should < 100, while we see a
much higher value.
Did we understand your work correctly? or did we miss anything?

Thanks,
Sheng



On Wed, Jun 21, 2017 at 2:04 PM, J. Eric Ivancich <ivancich@redhat.com> wrote:
> Hi Sheng,
>
> I'll interleave responses below.
>
> On 06/21/2017 01:38 PM, sheng qiu wrote:
>> hi Eric,
>>
>> we are pretty interested in your dmclock integration work with CEPH.
>> After reading your pull request, i am a little confusing.
>> May i ask if the setting in config such as
>> osd_op_queue_mclock_client_op_res functioning in your added dmclock's
>> queues and their enqueue and dequeue methods?
>
> Yes, that (and related) configuration option is used. You'll see it
> referenced in both src/osd/mClockOpClassQueue.cc and
> src/osd/mClockClientQueue.cc.
>
> Let me answer for mClockOpClassQueue, but the process is similar in
> mClockClientQueue.
>
> The configuration value is brought into an instance of
> mClockOpClassQueue::mclock_op_tags_t. The variables
> mClockOpClassQueue::mclock_op_tags holds a unique_ptr to a singleton of
> that type. And then when a new operation is enqueued, the function
> mClockOpClassQueue::op_class_client_info_f is called to determine its
> mclock parameters at which time the value is used.
>
>> the below enqueue function insert request into a map<priority,
>> subqueue>, i guess for mclock_opclass queue, you set high priority for
>> client op and lower for scrub, recovery, etc.
>> Within each subqueue of same priority, did you do FIFO?
>>
>> void enqueue_strict(K cl, unsigned priority, T item) override final {
>>     high_queue[priority].enqueue(cl, 0, item);
>> }
>
> Yes, higher priority operations use a strict queue and lower priority
> operations use mclock. That basic behavior was based on the two earlier
> op queue implementations (src/common/WeightedPriorityQueue.h and
> src/common/PrioritizedQueue.h). The priority value that's used as a
> cut-off is determined by the configuration option osd_op_queue_cut_off
> (which can be "low" or "high", which map to values CEPH_MSG_PRIO_LOW and
> CEPH_MSG_PRIO_HIGH (defined in src/include/msgr.h); see function
> OSD::get_io_prio_cut).
>
> And those operations that end up in the high queue are handled strictly
> -- higher priorities before lower priorities.
>
>> I am appreciated if you can provide some comments, especially if i
>> didn't understand correctly.
>
> I hope that's helpful. Please let me know if you have further questions.
>
> Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How best to integrate dmClock QoS library into ceph codebase
  2017-06-27 21:21         ` sheng qiu
@ 2017-06-28 18:33           ` J. Eric Ivancich
  2017-06-28 18:55             ` sheng qiu
  2017-07-11 18:14             ` sheng qiu
  0 siblings, 2 replies; 15+ messages in thread
From: J. Eric Ivancich @ 2017-06-28 18:33 UTC (permalink / raw)
  To: sheng qiu; +Cc: Ceph Development

On 06/27/2017 05:21 PM, sheng qiu wrote:
> i am appreciated to your kind reply.
> 
> In our test, we set the following in the ceph.conf:
> 
> osd_op_queue = mclock_client
> osd_op_queue_cut_off = high
> osd_op_queue_mclock_client_op_lim = 100.0
> osd_op_queue_mclock_client_op_res = 50.0
> osd_op_num_shards = 1
> osd_op_num_threads_per_shard = 1
> 
> 
> in this setup, all io requests should go to one mclock_client queue
> and using the mclock scheduling (osd_op_queue_cut_off = high).
> we use fio for test, we set job=1, bs=4k, qd=1 or 16.
> 
> we are expecting the visible iops by fio should < 100, while we see a
> much higher value.
> Did we understand your work correctly? or did we miss anything?

Hi Sheng,

I think you understand things well, but there is one additional detail
you may not have noticed yet. And that is what should be done when all
clients have reached their limit momentarily and the ObjectStore would
like another op to keep itself busy? We either a) refuse to provide it
with an op, or b) give it the op the op that's most appropriate by
weight. The ceph code currently is not designed to handle a) and it's
not even clear that we should starve the ObjectStore in that manner. So
we do b), and that means we can exceed the limit.

dmclock's PullPriorityQueue constructors have a parameter
_allow_limit_break, which ceph sets to true. That is how we do b) above.
If you ever wanted to set that to false you'd need to make other changes
to the ObjectStore ceph code to handle cases with the op queue is not
empty but not ready/willing to return an op when one is requested.

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How best to integrate dmClock QoS library into ceph codebase
  2017-06-28 18:33           ` J. Eric Ivancich
@ 2017-06-28 18:55             ` sheng qiu
  2017-06-29 18:03               ` J. Eric Ivancich
  2017-07-11 18:14             ` sheng qiu
  1 sibling, 1 reply; 15+ messages in thread
From: sheng qiu @ 2017-06-28 18:55 UTC (permalink / raw)
  To: J. Eric Ivancich; +Cc: Ceph Development

thanks Eric!

yes, we found that "allow_limit_break" logic, and did a little hack to
enable the logic when it's set to false.
seems working as what we expected.

May I ask do you have any plan to implement client side logic for a
true "D"mclock, seems to me know it's mclock on each individual OSD.
And each client also has a common iops config. We are planing to
working on that part and integrate with your current work.

Thanks,
Sheng

On Wed, Jun 28, 2017 at 11:33 AM, J. Eric Ivancich <ivancich@redhat.com> wrote:
> On 06/27/2017 05:21 PM, sheng qiu wrote:
>> i am appreciated to your kind reply.
>>
>> In our test, we set the following in the ceph.conf:
>>
>> osd_op_queue = mclock_client
>> osd_op_queue_cut_off = high
>> osd_op_queue_mclock_client_op_lim = 100.0
>> osd_op_queue_mclock_client_op_res = 50.0
>> osd_op_num_shards = 1
>> osd_op_num_threads_per_shard = 1
>>
>>
>> in this setup, all io requests should go to one mclock_client queue
>> and using the mclock scheduling (osd_op_queue_cut_off = high).
>> we use fio for test, we set job=1, bs=4k, qd=1 or 16.
>>
>> we are expecting the visible iops by fio should < 100, while we see a
>> much higher value.
>> Did we understand your work correctly? or did we miss anything?
>
> Hi Sheng,
>
> I think you understand things well, but there is one additional detail
> you may not have noticed yet. And that is what should be done when all
> clients have reached their limit momentarily and the ObjectStore would
> like another op to keep itself busy? We either a) refuse to provide it
> with an op, or b) give it the op the op that's most appropriate by
> weight. The ceph code currently is not designed to handle a) and it's
> not even clear that we should starve the ObjectStore in that manner. So
> we do b), and that means we can exceed the limit.
>
> dmclock's PullPriorityQueue constructors have a parameter
> _allow_limit_break, which ceph sets to true. That is how we do b) above.
> If you ever wanted to set that to false you'd need to make other changes
> to the ObjectStore ceph code to handle cases with the op queue is not
> empty but not ready/willing to return an op when one is requested.
>
> Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How best to integrate dmClock QoS library into ceph codebase
  2017-06-28 18:55             ` sheng qiu
@ 2017-06-29 18:03               ` J. Eric Ivancich
  2017-07-04 12:35                 ` Jin Cai
  0 siblings, 1 reply; 15+ messages in thread
From: J. Eric Ivancich @ 2017-06-29 18:03 UTC (permalink / raw)
  To: sheng qiu; +Cc: Ceph Development

On 06/28/2017 02:55 PM, sheng qiu wrote:
> May I ask do you have any plan to implement client side logic for a
> true "D"mclock, seems to me know it's mclock on each individual OSD.
> And each client also has a common iops config. We are planing to
> working on that part and integrate with your current work.

That is not a high priority in the short-term. Our main goal with
integrating dmclock/mclock was to better manage priorities among
operation classes.

Developers at SK Telecom have done some work towards this, though. For
example, see here:

    https://github.com/ivancich/ceph/pull/1

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How best to integrate dmClock QoS library into ceph codebase
  2017-06-29 18:03               ` J. Eric Ivancich
@ 2017-07-04 12:35                 ` Jin Cai
  2017-07-05 22:06                   ` J. Eric Ivancich
  0 siblings, 1 reply; 15+ messages in thread
From: Jin Cai @ 2017-07-04 12:35 UTC (permalink / raw)
  To: J. Eric Ivancich; +Cc: sheng qiu, Ceph Development

Hi, Eric
     We are now testing the mclock priority queue that you contributed
several days ago.
     Our test environment consists of four machines: one machine for
the monitor and mgr daemon, two OSDs in each of the left three ones.
     mclock-related configurations are as the following:

    osd_op_queue = mclock_opclass
    osd_op_queue_mclock_client_op_res = 20000.0
    osd_op_queue_mclock_client_op_wgt = 0.0
    osd_op_queue_mclock_client_op_lim = 30000.0
    osd_op_queue_mclock_recov_res = 0.0
    osd_op_queue_mclock_recov_wgt = 0.0
    osd_op_queue_mclock_recov_lim = 2000.0


   When we kiledl one OSD daemon to test the effects of recovery on
the client op, other OSDs crashed as well because of assert semantics:

   ceph version 12.0.3-2318-g32ab095
(32ab09536207b4b261874c0063b3275b97537045) luminous (dev)
 1: (()+0x9e86b1) [0x7f4287e846b1]
 2: (()+0xf100) [0x7f4284d18100]
 3: (gsignal()+0x37) [0x7f4283d415f7]
 4: (abort()+0x148) [0x7f4283d42ce8]
 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x284) [0x7f4287ec2364]
 6: (ceph::mClockQueue<std::pair<spg_t, PGQueueable>,
ceph::mClockOpClassQueue::osd_op_type_t>::dequeue()+0x45f)
[0x7f4287baac3f]
 7: (ceph::mClockOpClassQueue::dequeue()+0xd) [0x7f4287baacfd]
 8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x314) [0x7f428798a174]
 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8e9)
[0x7f4287ec7d39]
 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f4287ec9ec0]
 11: (()+0x7dc5) [0x7f4284d10dc5]
 12: (clone()+0x6d) [0x7f4283e02ced]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.


However, if the value of osd_op_queue was set to wpq or prio, it worked well.
So do you know why the assert semantics was triggered.

Thanks very much.




2017-06-30 2:03 GMT+08:00 J. Eric Ivancich <ivancich@redhat.com>:
> On 06/28/2017 02:55 PM, sheng qiu wrote:
>> May I ask do you have any plan to implement client side logic for a
>> true "D"mclock, seems to me know it's mclock on each individual OSD.
>> And each client also has a common iops config. We are planing to
>> working on that part and integrate with your current work.
>
> That is not a high priority in the short-term. Our main goal with
> integrating dmclock/mclock was to better manage priorities among
> operation classes.
>
> Developers at SK Telecom have done some work towards this, though. For
> example, see here:
>
>     https://github.com/ivancich/ceph/pull/1
>
> Eric
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How best to integrate dmClock QoS library into ceph codebase
  2017-07-04 12:35                 ` Jin Cai
@ 2017-07-05 22:06                   ` J. Eric Ivancich
  0 siblings, 0 replies; 15+ messages in thread
From: J. Eric Ivancich @ 2017-07-05 22:06 UTC (permalink / raw)
  To: Jin Cai; +Cc: sheng qiu, Ceph Development

Hello,

Thank you for reporting that. I need to verify whether this error was generated by the code on the master branch, or whether the code has been modified in any way?

The dequeue function with the failed assert (in mClockPriorityQueue.h, line 315) only has two asserts.

The first assert checks that the queue is not empty, and it should not be since OSD::ShardedOpWQ::_process verifies that it’s not empty before calling the dequeue function.

The second assert verifies that when an op was requested from the dmclock queue that it actually got an op, and not an empty code, or a future code (indicating that limits were strictly enforced, which they should not be since allow_limit_break should be true).

So please let me know if this is modified code. If need be I’ll try to duplicate your scenario, so I can debug it. Please provide all the necessary details, so I can reproduce it.

Thanks,

Eric

> On Jul 4, 2017, at 8:35 AM, Jin Cai <caijin.laurence@gmail.com> wrote:
> 
> Hi, Eric
>     We are now testing the mclock priority queue that you contributed
> several days ago.
>     Our test environment consists of four machines: one machine for
> the monitor and mgr daemon, two OSDs in each of the left three ones.
>     mclock-related configurations are as the following:
> 
>    osd_op_queue = mclock_opclass
>    osd_op_queue_mclock_client_op_res = 20000.0
>    osd_op_queue_mclock_client_op_wgt = 0.0
>    osd_op_queue_mclock_client_op_lim = 30000.0
>    osd_op_queue_mclock_recov_res = 0.0
>    osd_op_queue_mclock_recov_wgt = 0.0
>    osd_op_queue_mclock_recov_lim = 2000.0
> 
> 
>   When we kiledl one OSD daemon to test the effects of recovery on
> the client op, other OSDs crashed as well because of assert semantics:
> 
>   ceph version 12.0.3-2318-g32ab095
> (32ab09536207b4b261874c0063b3275b97537045) luminous (dev)
> 1: (()+0x9e86b1) [0x7f4287e846b1]
> 2: (()+0xf100) [0x7f4284d18100]
> 3: (gsignal()+0x37) [0x7f4283d415f7]
> 4: (abort()+0x148) [0x7f4283d42ce8]
> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x284) [0x7f4287ec2364]
> 6: (ceph::mClockQueue<std::pair<spg_t, PGQueueable>,
> ceph::mClockOpClassQueue::osd_op_type_t>::dequeue()+0x45f)
> [0x7f4287baac3f]
> 7: (ceph::mClockOpClassQueue::dequeue()+0xd) [0x7f4287baacfd]
> 8: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0x314) [0x7f428798a174]
> 9: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8e9)
> [0x7f4287ec7d39]
> 10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f4287ec9ec0]
> 11: (()+0x7dc5) [0x7f4284d10dc5]
> 12: (clone()+0x6d) [0x7f4283e02ced]
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
> 
> 
> However, if the value of osd_op_queue was set to wpq or prio, it worked well.
> So do you know why the assert semantics was triggered.
> 
> Thanks very much.
> 
> 
> 
> 
> 2017-06-30 2:03 GMT+08:00 J. Eric Ivancich <ivancich@redhat.com>:
>> On 06/28/2017 02:55 PM, sheng qiu wrote:
>>> May I ask do you have any plan to implement client side logic for a
>>> true "D"mclock, seems to me know it's mclock on each individual OSD.
>>> And each client also has a common iops config. We are planing to
>>> working on that part and integrate with your current work.
>> 
>> That is not a high priority in the short-term. Our main goal with
>> integrating dmclock/mclock was to better manage priorities among
>> operation classes.
>> 
>> Developers at SK Telecom have done some work towards this, though. For
>> example, see here:
>> 
>>    https://github.com/ivancich/ceph/pull/1
>> 
>> Eric
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How best to integrate dmClock QoS library into ceph codebase
  2017-06-28 18:33           ` J. Eric Ivancich
  2017-06-28 18:55             ` sheng qiu
@ 2017-07-11 18:14             ` sheng qiu
  2017-07-27 20:24               ` J. Eric Ivancich
  1 sibling, 1 reply; 15+ messages in thread
From: sheng qiu @ 2017-07-11 18:14 UTC (permalink / raw)
  To: J. Eric Ivancich; +Cc: Ceph Development

Hi Eric,

We are trying to evaluate dmclock's effect on controlling the recovery
traffic in order to reducing impact on client io.
However, we are experiencing some problem and didn't get our expected results.

we setup a small cluster, with several OSD machines. In our
configurations, we set recovery limit = 0.001 or even smaller, and
res=0.0, wgt=1.0.
we set client res = 20k or even higher, limit=0.0, wgt=500.

Then we killed osd while doing fio on client side and bring it back to
trigger recovery. We saw fio iops still reduced a lot comparable to
not using dmclock queue. We did some debugging and saw that when
recovery is active, fio requests enqueued much less frequent than
before.
overall, seems dmclock's configuration on recovery part does not show
any differences. Since the enqueue rate of fio requests are reduced,
when dmclock try to dequeue a request, there's less chance to pull a
fio request.

Can you give some comments on this?

Thanks,
Sheng





On Wed, Jun 28, 2017 at 11:33 AM, J. Eric Ivancich <ivancich@redhat.com> wrote:
> On 06/27/2017 05:21 PM, sheng qiu wrote:
>> i am appreciated to your kind reply.
>>
>> In our test, we set the following in the ceph.conf:
>>
>> osd_op_queue = mclock_client
>> osd_op_queue_cut_off = high
>> osd_op_queue_mclock_client_op_lim = 100.0
>> osd_op_queue_mclock_client_op_res = 50.0
>> osd_op_num_shards = 1
>> osd_op_num_threads_per_shard = 1
>>
>>
>> in this setup, all io requests should go to one mclock_client queue
>> and using the mclock scheduling (osd_op_queue_cut_off = high).
>> we use fio for test, we set job=1, bs=4k, qd=1 or 16.
>>
>> we are expecting the visible iops by fio should < 100, while we see a
>> much higher value.
>> Did we understand your work correctly? or did we miss anything?
>
> Hi Sheng,
>
> I think you understand things well, but there is one additional detail
> you may not have noticed yet. And that is what should be done when all
> clients have reached their limit momentarily and the ObjectStore would
> like another op to keep itself busy? We either a) refuse to provide it
> with an op, or b) give it the op the op that's most appropriate by
> weight. The ceph code currently is not designed to handle a) and it's
> not even clear that we should starve the ObjectStore in that manner. So
> we do b), and that means we can exceed the limit.
>
> dmclock's PullPriorityQueue constructors have a parameter
> _allow_limit_break, which ceph sets to true. That is how we do b) above.
> If you ever wanted to set that to false you'd need to make other changes
> to the ObjectStore ceph code to handle cases with the op queue is not
> empty but not ready/willing to return an op when one is requested.
>
> Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: How best to integrate dmClock QoS library into ceph codebase
  2017-07-11 18:14             ` sheng qiu
@ 2017-07-27 20:24               ` J. Eric Ivancich
  0 siblings, 0 replies; 15+ messages in thread
From: J. Eric Ivancich @ 2017-07-27 20:24 UTC (permalink / raw)
  To: sheng qiu; +Cc: Ceph Development

Hi Sheng,

I’ll interleave responses below.

> On Jul 11, 2017, at 2:14 PM, sheng qiu <herbert1984106@gmail.com> wrote:
> We are trying to evaluate dmclock's effect on controlling the recovery
> traffic in order to reducing impact on client io.
> However, we are experiencing some problem and didn't get our expected results.
> 
> we setup a small cluster, with several OSD machines. In our
> configurations, we set recovery limit = 0.001 or even smaller, and
> res=0.0, wgt=1.0.
> we set client res = 20k or even higher, limit=0.0, wgt=500.

As presently implemented limits are not enforced. There is a PR that makes modification to enforce them (https://github.com/ceph/ceph/pull/16242), which I’m still evaluating. You can see some discussion of the issue (see: http://marc.info/?l=ceph-devel&m=149867479701646&w=2).

> Then we killed osd while doing fio on client side and bring it back to
> trigger recovery. We saw fio iops still reduced a lot comparable to
> not using dmclock queue. We did some debugging and saw that when
> recovery is active, fio requests enqueued much less frequent than
> before.

Are you saying that fio requests from the client slow down? I assume you’re using the fio tool. If so, what is max-jobs set to?

Also, are you saying that that fio iops had lower values with mclock compared with the weighted priority queue (“wpq”)?

> overall, seems dmclock's configuration on recovery part does not show
> any differences. Since the enqueue rate of fio requests are reduced,
> when dmclock try to dequeue a request, there's less chance to pull a
> fio request.

Theoretically, at least, with a higher reservation value, the request tags should have smaller reservation tags, which should bias mclock to dequeueing them. So I’d like to know more about your experiment.

Thank you,

Eric

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-07-27 20:24 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-04 13:32 How best to integrate dmClock QoS library into ceph codebase J. Eric Ivancich
2017-04-04 16:00 ` Adam C. Emerson
2017-05-16  2:46 ` Ming Lin
2017-05-16 12:29   ` J. Eric Ivancich
2017-05-16 17:59     ` Ming Lin
2017-06-21 17:38     ` sheng qiu
2017-06-21 21:04       ` J. Eric Ivancich
2017-06-27 21:21         ` sheng qiu
2017-06-28 18:33           ` J. Eric Ivancich
2017-06-28 18:55             ` sheng qiu
2017-06-29 18:03               ` J. Eric Ivancich
2017-07-04 12:35                 ` Jin Cai
2017-07-05 22:06                   ` J. Eric Ivancich
2017-07-11 18:14             ` sheng qiu
2017-07-27 20:24               ` J. Eric Ivancich

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.