All of lore.kernel.org
 help / color / mirror / Atom feed
* False waker detection in BFQ
@ 2021-05-05 16:20 Jan Kara
  2021-05-20 15:05 ` Paolo Valente
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2021-05-05 16:20 UTC (permalink / raw)
  To: Paolo Valente; +Cc: linux-block

Hi Paolo!

I have two processes doing direct IO writes like:

dd if=/dev/zero of=/mnt/file$i bs=128k oflag=direct count=4000M

Now each of these processes belongs to a different cgroup and it has
different bfq.weight. I was looking into why these processes do not split
bandwidth according to BFQ weights. Or actually the bandwidth is split
accordingly initially but eventually degrades into 50/50 split. After some
debugging I've found out that due to luck, one of the processes is decided
to be a waker of the other process and at that point we loose isolation
between the two cgroups. This pretty reliably happens sometime during the
run of these two processes on my test VM. So can we tweak the waker logic
to reduce the chances for false positives? Essentially when there are only
two processes doing heavy IO against the device, the logic in
bfq_check_waker() is such that they are very likely to eventually become
wakers of one another. AFAICT the only condition that needs to get
fulfilled is that they need to submit IO within 4 ms of the completion of
IO of the other process 3 times.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: False waker detection in BFQ
  2021-05-05 16:20 False waker detection in BFQ Jan Kara
@ 2021-05-20 15:05 ` Paolo Valente
  2021-05-21 13:10   ` Jan Kara
  2021-08-13 14:01   ` Jan Kara
  0 siblings, 2 replies; 9+ messages in thread
From: Paolo Valente @ 2021-05-20 15:05 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-block



> Il giorno 5 mag 2021, alle ore 18:20, Jan Kara <jack@suse.cz> ha scritto:
> 
> Hi Paolo!
> 
> I have two processes doing direct IO writes like:
> 
> dd if=/dev/zero of=/mnt/file$i bs=128k oflag=direct count=4000M
> 
> Now each of these processes belongs to a different cgroup and it has
> different bfq.weight. I was looking into why these processes do not split
> bandwidth according to BFQ weights. Or actually the bandwidth is split
> accordingly initially but eventually degrades into 50/50 split. After some
> debugging I've found out that due to luck, one of the processes is decided
> to be a waker of the other process and at that point we loose isolation
> between the two cgroups. This pretty reliably happens sometime during the
> run of these two processes on my test VM. So can we tweak the waker logic
> to reduce the chances for false positives? Essentially when there are only
> two processes doing heavy IO against the device, the logic in
> bfq_check_waker() is such that they are very likely to eventually become
> wakers of one another. AFAICT the only condition that needs to get
> fulfilled is that they need to submit IO within 4 ms of the completion of
> IO of the other process 3 times.
> 

Hi Jan!
as I happened to tell you moths ago, I feared some likely cover case
to show up eventually.  Actually, I was even more pessimistic than how
reality proved to be :)

I'm sorry for my delay, but I've had to think about this issue for a
while.  Being too strict would easily run out journald as a waker for
processes belonging to a different group.

So, what do you think of this proposal: add the extra filter that a
waker must belong to the same group of the woken, or, at most, to the
root group?

Thanks,
Paolo

> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: False waker detection in BFQ
  2021-05-20 15:05 ` Paolo Valente
@ 2021-05-21 13:10   ` Jan Kara
  2021-08-13 14:01   ` Jan Kara
  1 sibling, 0 replies; 9+ messages in thread
From: Jan Kara @ 2021-05-21 13:10 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jan Kara, linux-block

On Thu 20-05-21 17:05:45, Paolo Valente wrote:
> > Il giorno 5 mag 2021, alle ore 18:20, Jan Kara <jack@suse.cz> ha scritto:
> > 
> > Hi Paolo!
> > 
> > I have two processes doing direct IO writes like:
> > 
> > dd if=/dev/zero of=/mnt/file$i bs=128k oflag=direct count=4000M
> > 
> > Now each of these processes belongs to a different cgroup and it has
> > different bfq.weight. I was looking into why these processes do not split
> > bandwidth according to BFQ weights. Or actually the bandwidth is split
> > accordingly initially but eventually degrades into 50/50 split. After some
> > debugging I've found out that due to luck, one of the processes is decided
> > to be a waker of the other process and at that point we loose isolation
> > between the two cgroups. This pretty reliably happens sometime during the
> > run of these two processes on my test VM. So can we tweak the waker logic
> > to reduce the chances for false positives? Essentially when there are only
> > two processes doing heavy IO against the device, the logic in
> > bfq_check_waker() is such that they are very likely to eventually become
> > wakers of one another. AFAICT the only condition that needs to get
> > fulfilled is that they need to submit IO within 4 ms of the completion of
> > IO of the other process 3 times.
>
> as I happened to tell you moths ago, I feared some likely cover case
> to show up eventually.  Actually, I was even more pessimistic than how
> reality proved to be :)

:)

> I'm sorry for my delay, but I've had to think about this issue for a
> while.  Being too strict would easily run out journald as a waker for
> processes belonging to a different group.
> 
> So, what do you think of this proposal: add the extra filter that a
> waker must belong to the same group of the woken, or, at most, to the
> root group?

I thought you will suggest that :) Well, I'd probably allow waker-wakee
relationship if the two cgroups are in 'ancestor' - 'successor'
relationship. Not necessarily only root cgroup vs some cgroup. That being
said in my opinion it is just a poor mans band aid fixing this particular
setup. It will not fix e.g. a similar problem when those two processes are
in the same cgroup but have say different IO priorities.

The question is how we could do better. But so far I have no great idea
either.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: False waker detection in BFQ
  2021-05-20 15:05 ` Paolo Valente
  2021-05-21 13:10   ` Jan Kara
@ 2021-08-13 14:01   ` Jan Kara
  2021-08-23 13:58     ` Paolo Valente
  1 sibling, 1 reply; 9+ messages in thread
From: Jan Kara @ 2021-08-13 14:01 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jan Kara, linux-block

Hi Paolo!

On Thu 20-05-21 17:05:45, Paolo Valente wrote:
> > Il giorno 5 mag 2021, alle ore 18:20, Jan Kara <jack@suse.cz> ha scritto:
> > 
> > Hi Paolo!
> > 
> > I have two processes doing direct IO writes like:
> > 
> > dd if=/dev/zero of=/mnt/file$i bs=128k oflag=direct count=4000M
> > 
> > Now each of these processes belongs to a different cgroup and it has
> > different bfq.weight. I was looking into why these processes do not split
> > bandwidth according to BFQ weights. Or actually the bandwidth is split
> > accordingly initially but eventually degrades into 50/50 split. After some
> > debugging I've found out that due to luck, one of the processes is decided
> > to be a waker of the other process and at that point we loose isolation
> > between the two cgroups. This pretty reliably happens sometime during the
> > run of these two processes on my test VM. So can we tweak the waker logic
> > to reduce the chances for false positives? Essentially when there are only
> > two processes doing heavy IO against the device, the logic in
> > bfq_check_waker() is such that they are very likely to eventually become
> > wakers of one another. AFAICT the only condition that needs to get
> > fulfilled is that they need to submit IO within 4 ms of the completion of
> > IO of the other process 3 times.
> > 
> 
> Hi Jan!
> as I happened to tell you moths ago, I feared some likely cover case
> to show up eventually.  Actually, I was even more pessimistic than how
> reality proved to be :)
> 
> I'm sorry for my delay, but I've had to think about this issue for a
> while.  Being too strict would easily run out journald as a waker for
> processes belonging to a different group.
> 
> So, what do you think of this proposal: add the extra filter that a
> waker must belong to the same group of the woken, or, at most, to the
> root group?

Returning back to this :). I've been debugging other BFQ problems with IO
priorities not really leading to service differentiation (mostly because
scheduler tag exhaustion, false waker detection, and how we inject IO for a
waker) and as a result I have come up with a couple of patches that also
address this issue as a side effect - I've added an upper time limit
(128*slice_idle) for the "third cooperation" detection and that mostly got
rid of these false waker detections. We could fail to detect waker-wakee
processes if they do not cooperate frequently but then the value of the
detection is small and the lack of isolation may do more harm than good
anyway.

Currently I'm running wider set of benchmarks for the patches to see
whether I didn't regress anything else. If not, I'll post the patches to
the list.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: False waker detection in BFQ
  2021-08-13 14:01   ` Jan Kara
@ 2021-08-23 13:58     ` Paolo Valente
  2021-08-23 16:06       ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Paolo Valente @ 2021-08-23 13:58 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-block



> Il giorno 13 ago 2021, alle ore 16:01, Jan Kara <jack@suse.cz> ha scritto:
> 
> Hi Paolo!
> 
> On Thu 20-05-21 17:05:45, Paolo Valente wrote:
>>> Il giorno 5 mag 2021, alle ore 18:20, Jan Kara <jack@suse.cz> ha scritto:
>>> 
>>> Hi Paolo!
>>> 
>>> I have two processes doing direct IO writes like:
>>> 
>>> dd if=/dev/zero of=/mnt/file$i bs=128k oflag=direct count=4000M
>>> 
>>> Now each of these processes belongs to a different cgroup and it has
>>> different bfq.weight. I was looking into why these processes do not split
>>> bandwidth according to BFQ weights. Or actually the bandwidth is split
>>> accordingly initially but eventually degrades into 50/50 split. After some
>>> debugging I've found out that due to luck, one of the processes is decided
>>> to be a waker of the other process and at that point we loose isolation
>>> between the two cgroups. This pretty reliably happens sometime during the
>>> run of these two processes on my test VM. So can we tweak the waker logic
>>> to reduce the chances for false positives? Essentially when there are only
>>> two processes doing heavy IO against the device, the logic in
>>> bfq_check_waker() is such that they are very likely to eventually become
>>> wakers of one another. AFAICT the only condition that needs to get
>>> fulfilled is that they need to submit IO within 4 ms of the completion of
>>> IO of the other process 3 times.
>>> 
>> 
>> Hi Jan!
>> as I happened to tell you moths ago, I feared some likely cover case
>> to show up eventually.  Actually, I was even more pessimistic than how
>> reality proved to be :)
>> 
>> I'm sorry for my delay, but I've had to think about this issue for a
>> while.  Being too strict would easily run out journald as a waker for
>> processes belonging to a different group.
>> 
>> So, what do you think of this proposal: add the extra filter that a
>> waker must belong to the same group of the woken, or, at most, to the
>> root group?
> 
> Returning back to this :). I've been debugging other BFQ problems with IO
> priorities not really leading to service differentiation (mostly because
> scheduler tag exhaustion, false waker detection, and how we inject IO for a
> waker) and as a result I have come up with a couple of patches that also
> address this issue as a side effect - I've added an upper time limit
> (128*slice_idle) for the "third cooperation" detection and that mostly got
> rid of these false waker detections.

Great!

> We could fail to detect waker-wakee
> processes if they do not cooperate frequently but then the value of the
> detection is small and the lack of isolation may do more harm than good
> anyway.
> 

IIRC, dbench was our best benchmark for checking whether the detection is
(still) effective.


> Currently I'm running wider set of benchmarks for the patches to see
> whether I didn't regress anything else. If not, I'll post the patches to
> the list.
> 

Any news?

Thanks,
Paolo

> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: False waker detection in BFQ
  2021-08-23 13:58     ` Paolo Valente
@ 2021-08-23 16:06       ` Jan Kara
  2021-08-25 16:43         ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2021-08-23 16:06 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jan Kara, linux-block

On Mon 23-08-21 15:58:25, Paolo Valente wrote:
> > Currently I'm running wider set of benchmarks for the patches to see
> > whether I didn't regress anything else. If not, I'll post the patches to
> > the list.
> 
> Any news?

It took a while for all those benchmarks to run. Overall results look sane,
I'm just verifying by hand now whether some of the localized regressions
(usually specific to a particular fs+machine config) are due to a measurement
noise or real regressions...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: False waker detection in BFQ
  2021-08-23 16:06       ` Jan Kara
@ 2021-08-25 16:43         ` Jan Kara
  2021-08-26  9:45           ` Paolo Valente
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2021-08-25 16:43 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jan Kara, linux-block

On Mon 23-08-21 18:06:18, Jan Kara wrote:
> On Mon 23-08-21 15:58:25, Paolo Valente wrote:
> > > Currently I'm running wider set of benchmarks for the patches to see
> > > whether I didn't regress anything else. If not, I'll post the patches to
> > > the list.
> > 
> > Any news?
> 
> It took a while for all those benchmarks to run. Overall results look sane,
> I'm just verifying by hand now whether some of the localized regressions
> (usually specific to a particular fs+machine config) are due to a measurement
> noise or real regressions...

OK, so after some manual analysis I've found out that dbench indeed becomes
more noisy with my changes for high numbers of processes. I'm leaving for
vacation soon so I will not be probably able to debug it before I leave but
let me ask you one thing: The problematic change seems to be mostly a
revert of 7cc4ffc55564 ("block, bfq: put reqs of waker and woken in
dispatch list") and I'm currently puzzled why it has such an effect. What
I've found out is that 7cc4ffc55564 results in IO of higher priority
process being injected into the time slice of lower priority process and
thus there's always only single busy queue (of the lower priority process)
and thus higher priority process queue never gets scheduled. As a result
higher priority IO always competes with lower priority IO and there's no
service differentiation (we get 50/50 split of throughput between the
processes despite different IO priorities).  And this scenario shows that
always injecting IO of waker/wakee isn't desirable, especially in a way as
done in 7cc4ffc55564 where injected IO isn't accounted within BFQ at all
(which easily allows for service degradation unnoticed by BFQ).  That's why
I've basically reverted that commit on the ground that on next dispatch we
call bfq_select_queue() which will see waker/wakee has IO to do and can
decide to inject the IO anyway. We do more CPU work but the IO pattern
should be similar. But apparently I was wrong :) I just wanted to bounce
this off of you if you have any suggestion what to look for or any tips
regarding why 7cc4ffc55564 apparently achieves much more reliable request
injection for dbench.
								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: False waker detection in BFQ
  2021-08-25 16:43         ` Jan Kara
@ 2021-08-26  9:45           ` Paolo Valente
  2021-08-26 17:51             ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Paolo Valente @ 2021-08-26  9:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-block



> Il giorno 25 ago 2021, alle ore 18:43, Jan Kara <jack@suse.cz> ha scritto:
> 
> On Mon 23-08-21 18:06:18, Jan Kara wrote:
>> On Mon 23-08-21 15:58:25, Paolo Valente wrote:
>>>> Currently I'm running wider set of benchmarks for the patches to see
>>>> whether I didn't regress anything else. If not, I'll post the patches to
>>>> the list.
>>> 
>>> Any news?
>> 
>> It took a while for all those benchmarks to run. Overall results look sane,
>> I'm just verifying by hand now whether some of the localized regressions
>> (usually specific to a particular fs+machine config) are due to a measurement
>> noise or real regressions...
> 
> OK, so after some manual analysis I've found out that dbench indeed becomes
> more noisy with my changes for high numbers of processes. I'm leaving for
> vacation soon so I will not be probably able to debug it before I leave but
> let me ask you one thing: The problematic change seems to be mostly a
> revert of 7cc4ffc55564 ("block, bfq: put reqs of waker and woken in
> dispatch list") and I'm currently puzzled why it has such an effect. What
> I've found out is that 7cc4ffc55564 results in IO of higher priority
> process being injected into the time slice of lower priority process and
> thus there's always only single busy queue (of the lower priority process)
> and thus higher priority process queue never gets scheduled. As a result
> higher priority IO always competes with lower priority IO and there's no
> service differentiation (we get 50/50 split of throughput between the
> processes despite different IO priorities).

I need a little help here.  Since high-priority I/O is immediately
injected, I wonder why it does not receive all the bandwidth it
demands.  Maybe, from your analysis, you have an answer.  Perhaps it
happens because:
1) high-priority I/O is FIFO-queued with lower-priority I/O in the
   dispatch list?
or
2) immediate injection prevents idling from being performed in favor
   of high-priority I/O?


>  And this scenario shows that
> always injecting IO of waker/wakee isn't desirable, especially in a way as
> done in 7cc4ffc55564 where injected IO isn't accounted within BFQ at all
> (which easily allows for service degradation unnoticed by BFQ).

Not sure that accounting would help high-priority I/O in your scenario.

>  That's why
> I've basically reverted that commit on the ground that on next dispatch we
> call bfq_select_queue() which will see waker/wakee has IO to do and can
> decide to inject the IO anyway. We do more CPU work but the IO pattern
> should be similar. But apparently I was wrong :)

For the pattern to be similar, I guess that, when new high-priority
I/O arrives, this I/O should preempt lower-priority I/O.
Unfortunately, this is not always the case, depending on other
parameters.  Waker/wakee I/O is guaranteed to be injected only when the
in-service queue has no I/O.

At any rate, probably we can solve this puzzle by just analyzing a
trace in which you detect a loss of throughput without 7cc4ffc55564.
The best case would be one with the minimum possible number of
threads, to get a simpler trace.

> I just wanted to bounce
> this off of you if you have any suggestion what to look for or any tips
> regarding why 7cc4ffc55564 apparently achieves much more reliable request
> injection for dbench.

I hope my considerations above help a little bit.

Thanks,
Paolo

> 								Honza
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: False waker detection in BFQ
  2021-08-26  9:45           ` Paolo Valente
@ 2021-08-26 17:51             ` Jan Kara
  0 siblings, 0 replies; 9+ messages in thread
From: Jan Kara @ 2021-08-26 17:51 UTC (permalink / raw)
  To: Paolo Valente; +Cc: Jan Kara, linux-block

On Thu 26-08-21 11:45:17, Paolo Valente wrote:
> 
> 
> > Il giorno 25 ago 2021, alle ore 18:43, Jan Kara <jack@suse.cz> ha scritto:
> > 
> > On Mon 23-08-21 18:06:18, Jan Kara wrote:
> >> On Mon 23-08-21 15:58:25, Paolo Valente wrote:
> >>>> Currently I'm running wider set of benchmarks for the patches to see
> >>>> whether I didn't regress anything else. If not, I'll post the patches to
> >>>> the list.
> >>> 
> >>> Any news?
> >> 
> >> It took a while for all those benchmarks to run. Overall results look sane,
> >> I'm just verifying by hand now whether some of the localized regressions
> >> (usually specific to a particular fs+machine config) are due to a measurement
> >> noise or real regressions...
> > 
> > OK, so after some manual analysis I've found out that dbench indeed becomes
> > more noisy with my changes for high numbers of processes. I'm leaving for
> > vacation soon so I will not be probably able to debug it before I leave but
> > let me ask you one thing: The problematic change seems to be mostly a
> > revert of 7cc4ffc55564 ("block, bfq: put reqs of waker and woken in
> > dispatch list") and I'm currently puzzled why it has such an effect. What
> > I've found out is that 7cc4ffc55564 results in IO of higher priority
> > process being injected into the time slice of lower priority process and
> > thus there's always only single busy queue (of the lower priority process)
> > and thus higher priority process queue never gets scheduled. As a result
> > higher priority IO always competes with lower priority IO and there's no
> > service differentiation (we get 50/50 split of throughput between the
> > processes despite different IO priorities).
> 
> I need a little help here.  Since high-priority I/O is immediately
> injected, I wonder why it does not receive all the bandwidth it
> demands.  Maybe, from your analysis, you have an answer.  Perhaps it
> happens because:
> 1) high-priority I/O is FIFO-queued with lower-priority I/O in the
>    dispatch list?

Yes, this is the case.

> >  And this scenario shows that
> > always injecting IO of waker/wakee isn't desirable, especially in a way as
> > done in 7cc4ffc55564 where injected IO isn't accounted within BFQ at all
> > (which easily allows for service degradation unnoticed by BFQ).
> 
> Not sure that accounting would help high-priority I/O in your scenario.

It did help noticeably. Because then both high and low priority bfq queues
become busy so bfq_select_queue() sees both queues and schedules higher
priority queue.

> >  That's why
> > I've basically reverted that commit on the ground that on next dispatch we
> > call bfq_select_queue() which will see waker/wakee has IO to do and can
> > decide to inject the IO anyway. We do more CPU work but the IO pattern
> > should be similar. But apparently I was wrong :)
> 
> For the pattern to be similar, I guess that, when new high-priority
> I/O arrives, this I/O should preempt lower-priority I/O.
> Unfortunately, this is not always the case, depending on other
> parameters.  Waker/wakee I/O is guaranteed to be injected only when the
> in-service queue has no I/O.
> 
> At any rate, probably we can solve this puzzle by just analyzing a
> trace in which you detect a loss of throughput without 7cc4ffc55564.
> The best case would be one with the minimum possible number of
> threads, to get a simpler trace.

Yeah, OK, I'll gather the trace once I return from vacation and look into
it. Thanks for help!

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-08-26 17:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-05 16:20 False waker detection in BFQ Jan Kara
2021-05-20 15:05 ` Paolo Valente
2021-05-21 13:10   ` Jan Kara
2021-08-13 14:01   ` Jan Kara
2021-08-23 13:58     ` Paolo Valente
2021-08-23 16:06       ` Jan Kara
2021-08-25 16:43         ` Jan Kara
2021-08-26  9:45           ` Paolo Valente
2021-08-26 17:51             ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.