linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* CFQ idling kills I/O performance on ext4 with blkio cgroup controller
@ 2019-05-17 22:16 Srivatsa S. Bhat
  2019-05-18 18:39 ` Paolo Valente
  0 siblings, 1 reply; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-05-17 22:16 UTC (permalink / raw)
  To: linux-fsdevel, linux-block, linux-ext4, cgroups, linux-kernel
  Cc: axboe, paolo.valente, jack, jmoyer, tytso, amakhalov, anishs,
	srivatsab, Srivatsa S. Bhat


Hi,

One of my colleagues noticed upto 10x - 30x drop in I/O throughput
running the following command, with the CFQ I/O scheduler:

dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync

Throughput with CFQ: 60 KB/s
Throughput with noop or deadline: 1.5 MB/s - 2 MB/s

I spent some time looking into it and found that this is caused by the
undesirable interaction between 4 different components:

- blkio cgroup controller enabled
- ext4 with the jbd2 kthread running in the root blkio cgroup
- dd running on ext4, in any other blkio cgroup than that of jbd2
- CFQ I/O scheduler with defaults for slice_idle and group_idle


When docker is enabled, systemd creates a blkio cgroup called
system.slice to run system services (and docker) under it, and a
separate blkio cgroup called user.slice for user processes. So, when
dd is invoked, it runs under user.slice.

The dd command above includes the dsync flag, which performs an
fdatasync after every write to the output file. Since dd is writing to
a file on ext4, jbd2 will be active, committing transactions
corresponding to those fdatasync requests from dd. (In other words, dd
depends on jdb2, in order to make forward progress). But jdb2 being a
kernel thread, runs in the root blkio cgroup, as opposed to dd, which
runs under user.slice.

Now, if the I/O scheduler in use for the underlying block device is
CFQ, then its inter-queue/inter-group idling takes effect (via the
slice_idle and group_idle parameters, both of which default to 8ms).
Therefore, everytime CFQ switches between processing requests from dd
vs jbd2, this 8ms idle time is injected, which slows down the overall
throughput tremendously!

To verify this theory, I tried various experiments, and in all cases,
the 4 pre-conditions mentioned above were necessary to reproduce this
performance drop. For example, if I used an XFS filesystem (which
doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed
directly to a block device, I couldn't reproduce the performance
issue. Similarly, running dd in the root blkio cgroup (where jbd2
runs) also gets full performance; as does using the noop or deadline
I/O schedulers; or even CFQ itself, with slice_idle and group_idle set
to zero.

These results were reproduced on a Linux VM (kernel v4.19) on ESXi,
both with virtualized storage as well as with disk pass-through,
backed by a rotational hard disk in both cases. The same problem was
also seen with the BFQ I/O scheduler in kernel v5.1.

Searching for any earlier discussions of this problem, I found an old
thread on LKML that encountered this behavior [1], as well as a docker
github issue [2] with similar symptoms (mentioned later in the
thread).

So, I'm curious to know if this is a well-understood problem and if
anybody has any thoughts on how to fix it.

Thank you very much!


[1]. https://lkml.org/lkml/2015/11/19/359

[2]. https://github.com/moby/moby/issues/21485
     https://github.com/moby/moby/issues/21485#issuecomment-222941103

Regards,
Srivatsa

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-17 22:16 CFQ idling kills I/O performance on ext4 with blkio cgroup controller Srivatsa S. Bhat
@ 2019-05-18 18:39 ` Paolo Valente
  2019-05-18 19:28   ` Theodore Ts'o
  2019-05-18 20:50   ` Srivatsa S. Bhat
  0 siblings, 2 replies; 52+ messages in thread
From: Paolo Valente @ 2019-05-18 18:39 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, linux-kernel,
	axboe, jack, jmoyer, tytso, amakhalov, anishs, srivatsab

[-- Attachment #1: Type: text/plain, Size: 3579 bytes --]

I've addressed these issues in my last batch of improvements for BFQ, which landed in the upcoming 5.2. If you give it a try, and still see the problem, then I'll be glad to reproduce it, and hopefully fix it for you.

Thanks,
Paolo

> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
> 
> 
> Hi,
> 
> One of my colleagues noticed upto 10x - 30x drop in I/O throughput
> running the following command, with the CFQ I/O scheduler:
> 
> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync
> 
> Throughput with CFQ: 60 KB/s
> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s
> 
> I spent some time looking into it and found that this is caused by the
> undesirable interaction between 4 different components:
> 
> - blkio cgroup controller enabled
> - ext4 with the jbd2 kthread running in the root blkio cgroup
> - dd running on ext4, in any other blkio cgroup than that of jbd2
> - CFQ I/O scheduler with defaults for slice_idle and group_idle
> 
> 
> When docker is enabled, systemd creates a blkio cgroup called
> system.slice to run system services (and docker) under it, and a
> separate blkio cgroup called user.slice for user processes. So, when
> dd is invoked, it runs under user.slice.
> 
> The dd command above includes the dsync flag, which performs an
> fdatasync after every write to the output file. Since dd is writing to
> a file on ext4, jbd2 will be active, committing transactions
> corresponding to those fdatasync requests from dd. (In other words, dd
> depends on jdb2, in order to make forward progress). But jdb2 being a
> kernel thread, runs in the root blkio cgroup, as opposed to dd, which
> runs under user.slice.
> 
> Now, if the I/O scheduler in use for the underlying block device is
> CFQ, then its inter-queue/inter-group idling takes effect (via the
> slice_idle and group_idle parameters, both of which default to 8ms).
> Therefore, everytime CFQ switches between processing requests from dd
> vs jbd2, this 8ms idle time is injected, which slows down the overall
> throughput tremendously!
> 
> To verify this theory, I tried various experiments, and in all cases,
> the 4 pre-conditions mentioned above were necessary to reproduce this
> performance drop. For example, if I used an XFS filesystem (which
> doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed
> directly to a block device, I couldn't reproduce the performance
> issue. Similarly, running dd in the root blkio cgroup (where jbd2
> runs) also gets full performance; as does using the noop or deadline
> I/O schedulers; or even CFQ itself, with slice_idle and group_idle set
> to zero.
> 
> These results were reproduced on a Linux VM (kernel v4.19) on ESXi,
> both with virtualized storage as well as with disk pass-through,
> backed by a rotational hard disk in both cases. The same problem was
> also seen with the BFQ I/O scheduler in kernel v5.1.
> 
> Searching for any earlier discussions of this problem, I found an old
> thread on LKML that encountered this behavior [1], as well as a docker
> github issue [2] with similar symptoms (mentioned later in the
> thread).
> 
> So, I'm curious to know if this is a well-understood problem and if
> anybody has any thoughts on how to fix it.
> 
> Thank you very much!
> 
> 
> [1]. https://lkml.org/lkml/2015/11/19/359
> 
> [2]. https://github.com/moby/moby/issues/21485
>     https://github.com/moby/moby/issues/21485#issuecomment-222941103
> 
> Regards,
> Srivatsa


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-18 18:39 ` Paolo Valente
@ 2019-05-18 19:28   ` Theodore Ts'o
  2019-05-20  9:15     ` Jan Kara
  2019-05-20 10:38     ` Paolo Valente
  2019-05-18 20:50   ` Srivatsa S. Bhat
  1 sibling, 2 replies; 52+ messages in thread
From: Theodore Ts'o @ 2019-05-18 19:28 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Srivatsa S. Bhat, linux-fsdevel, linux-block, linux-ext4,
	cgroups, linux-kernel, axboe, jack, jmoyer, amakhalov, anishs,
	srivatsab

On Sat, May 18, 2019 at 08:39:54PM +0200, Paolo Valente wrote:
> I've addressed these issues in my last batch of improvements for
> BFQ, which landed in the upcoming 5.2. If you give it a try, and
> still see the problem, then I'll be glad to reproduce it, and
> hopefully fix it for you.

Hi Paolo, I'm curious if you could give a quick summary about what you
changed in BFQ?

I was considering adding support so that if userspace calls fsync(2)
or fdatasync(2), to attach the process's CSS to the transaction, and
then charge all of the journal metadata writes the process's CSS.  If
there are multiple fsync's batched into the transaction, the first
process which forced the early transaction commit would get charged
the entire journal write.  OTOH, journal writes are sequential I/O, so
the amount of disk time for writing the journal is going to be
relatively small, and especially, the fact that work from other
cgroups is going to be minimal, especially if hadn't issued an
fsync().

In the case where you have three cgroups all issuing fsync(2) and they
all landed in the same jbd2 transaction thanks to commit batching, in
the ideal world we would split up the disk time usage equally across
those three cgroups.  But it's probably not worth doing that...

That being said, we probably do need some BFQ support, since in the
case where we have multiple processes doing buffered writes w/o fsync,
we do charnge the data=ordered writeback to each block cgroup.  Worse,
the commit can't complete until the all of the data integrity
writebacks have completed.  And if there are N cgroups with dirty
inodes, and slice_idle set to 8ms, there is going to be 8*N ms worth
of idle time tacked onto the commit time.

If we charge the journal I/O to the cgroup, and there's only one
process doing the 

   dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync

then we don't need to worry about this failure mode, since both the
journal I/O and the data writeback will be hitting the same cgroup.
But that's arguably an artificial use case, and much more commonly
there will be multiple cgroups all trying to at least some file system
I/O.

						- Ted

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-18 18:39 ` Paolo Valente
  2019-05-18 19:28   ` Theodore Ts'o
@ 2019-05-18 20:50   ` Srivatsa S. Bhat
  2019-05-20 10:19     ` Paolo Valente
  1 sibling, 1 reply; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-05-18 20:50 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, linux-kernel,
	axboe, jack, jmoyer, tytso, amakhalov, anishs, srivatsab

On 5/18/19 11:39 AM, Paolo Valente wrote:
> I've addressed these issues in my last batch of improvements for BFQ,
> which landed in the upcoming 5.2. If you give it a try, and still see
> the problem, then I'll be glad to reproduce it, and hopefully fix it
> for you.
>

Hi Paolo,

Thank you for looking into this!

I just tried current mainline at commit 72cf0b07, but unfortunately
didn't see any improvement:

dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync

With mq-deadline, I get:

5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s

With bfq, I get:
5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s

Please let me know if any more info about my setup might be helpful.

Thank you!

Regards,
Srivatsa
VMware Photon OS

> 
>> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>
>>
>> Hi,
>>
>> One of my colleagues noticed upto 10x - 30x drop in I/O throughput
>> running the following command, with the CFQ I/O scheduler:
>>
>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync
>>
>> Throughput with CFQ: 60 KB/s
>> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s
>>
>> I spent some time looking into it and found that this is caused by the
>> undesirable interaction between 4 different components:
>>
>> - blkio cgroup controller enabled
>> - ext4 with the jbd2 kthread running in the root blkio cgroup
>> - dd running on ext4, in any other blkio cgroup than that of jbd2
>> - CFQ I/O scheduler with defaults for slice_idle and group_idle
>>
>>
>> When docker is enabled, systemd creates a blkio cgroup called
>> system.slice to run system services (and docker) under it, and a
>> separate blkio cgroup called user.slice for user processes. So, when
>> dd is invoked, it runs under user.slice.
>>
>> The dd command above includes the dsync flag, which performs an
>> fdatasync after every write to the output file. Since dd is writing to
>> a file on ext4, jbd2 will be active, committing transactions
>> corresponding to those fdatasync requests from dd. (In other words, dd
>> depends on jdb2, in order to make forward progress). But jdb2 being a
>> kernel thread, runs in the root blkio cgroup, as opposed to dd, which
>> runs under user.slice.
>>
>> Now, if the I/O scheduler in use for the underlying block device is
>> CFQ, then its inter-queue/inter-group idling takes effect (via the
>> slice_idle and group_idle parameters, both of which default to 8ms).
>> Therefore, everytime CFQ switches between processing requests from dd
>> vs jbd2, this 8ms idle time is injected, which slows down the overall
>> throughput tremendously!
>>
>> To verify this theory, I tried various experiments, and in all cases,
>> the 4 pre-conditions mentioned above were necessary to reproduce this
>> performance drop. For example, if I used an XFS filesystem (which
>> doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed
>> directly to a block device, I couldn't reproduce the performance
>> issue. Similarly, running dd in the root blkio cgroup (where jbd2
>> runs) also gets full performance; as does using the noop or deadline
>> I/O schedulers; or even CFQ itself, with slice_idle and group_idle set
>> to zero.
>>
>> These results were reproduced on a Linux VM (kernel v4.19) on ESXi,
>> both with virtualized storage as well as with disk pass-through,
>> backed by a rotational hard disk in both cases. The same problem was
>> also seen with the BFQ I/O scheduler in kernel v5.1.
>>
>> Searching for any earlier discussions of this problem, I found an old
>> thread on LKML that encountered this behavior [1], as well as a docker
>> github issue [2] with similar symptoms (mentioned later in the
>> thread).
>>
>> So, I'm curious to know if this is a well-understood problem and if
>> anybody has any thoughts on how to fix it.
>>
>> Thank you very much!
>>
>>
>> [1]. https://lkml.org/lkml/2015/11/19/359
>>
>> [2]. https://github.com/moby/moby/issues/21485
>>     https://github.com/moby/moby/issues/21485#issuecomment-222941103
>>
>> Regards,
>> Srivatsa
> 


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-18 19:28   ` Theodore Ts'o
@ 2019-05-20  9:15     ` Jan Kara
  2019-05-20 10:45       ` Paolo Valente
  2019-05-21 16:48       ` Theodore Ts'o
  2019-05-20 10:38     ` Paolo Valente
  1 sibling, 2 replies; 52+ messages in thread
From: Jan Kara @ 2019-05-20  9:15 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Paolo Valente, Srivatsa S. Bhat, linux-fsdevel, linux-block,
	linux-ext4, cgroups, linux-kernel, axboe, jack, jmoyer,
	amakhalov, anishs, srivatsab

On Sat 18-05-19 15:28:47, Theodore Ts'o wrote:
> On Sat, May 18, 2019 at 08:39:54PM +0200, Paolo Valente wrote:
> > I've addressed these issues in my last batch of improvements for
> > BFQ, which landed in the upcoming 5.2. If you give it a try, and
> > still see the problem, then I'll be glad to reproduce it, and
> > hopefully fix it for you.
> 
> Hi Paolo, I'm curious if you could give a quick summary about what you
> changed in BFQ?
> 
> I was considering adding support so that if userspace calls fsync(2)
> or fdatasync(2), to attach the process's CSS to the transaction, and
> then charge all of the journal metadata writes the process's CSS.  If
> there are multiple fsync's batched into the transaction, the first
> process which forced the early transaction commit would get charged
> the entire journal write.  OTOH, journal writes are sequential I/O, so
> the amount of disk time for writing the journal is going to be
> relatively small, and especially, the fact that work from other
> cgroups is going to be minimal, especially if hadn't issued an
> fsync().

But this makes priority-inversion problems with ext4 journal worse, doesn't
it? If we submit journal commit in blkio cgroup of some random process, it
may get throttled which then effectively blocks the whole filesystem. Or do
you want to implement a more complex back-pressure mechanism where you'd
just account to different blkio cgroup during journal commit and then
throttle as different point where you are not blocking other tasks from
progress?

> In the case where you have three cgroups all issuing fsync(2) and they
> all landed in the same jbd2 transaction thanks to commit batching, in
> the ideal world we would split up the disk time usage equally across
> those three cgroups.  But it's probably not worth doing that...
> 
> That being said, we probably do need some BFQ support, since in the
> case where we have multiple processes doing buffered writes w/o fsync,
> we do charnge the data=ordered writeback to each block cgroup.  Worse,
> the commit can't complete until the all of the data integrity
> writebacks have completed.  And if there are N cgroups with dirty
> inodes, and slice_idle set to 8ms, there is going to be 8*N ms worth
> of idle time tacked onto the commit time.

Yeah. At least in some cases, we know there won't be any more IO from a
particular cgroup in the near future (e.g. transaction commit completing,
or when the layers above IO scheduler already know which IO they are going
to submit next) and in that case idling is just a waste of time. But so far
I haven't decided how should look a reasonably clean interface for this
that isn't specific to a particular IO scheduler implementation.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-18 20:50   ` Srivatsa S. Bhat
@ 2019-05-20 10:19     ` Paolo Valente
  2019-05-20 22:45       ` Srivatsa S. Bhat
  2019-05-21 11:25       ` Paolo Valente
  0 siblings, 2 replies; 52+ messages in thread
From: Paolo Valente @ 2019-05-20 10:19 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, tytso, amakhalov, anishs,
	srivatsab

[-- Attachment #1: Type: text/plain, Size: 5806 bytes --]



> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
> 
> On 5/18/19 11:39 AM, Paolo Valente wrote:
>> I've addressed these issues in my last batch of improvements for BFQ,
>> which landed in the upcoming 5.2. If you give it a try, and still see
>> the problem, then I'll be glad to reproduce it, and hopefully fix it
>> for you.
>> 
> 
> Hi Paolo,
> 
> Thank you for looking into this!
> 
> I just tried current mainline at commit 72cf0b07, but unfortunately
> didn't see any improvement:
> 
> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
> 
> With mq-deadline, I get:
> 
> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s
> 
> With bfq, I get:
> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s
> 

Hi Srivatsa,
thanks for reproducing this on mainline.  I seem to have reproduced a
bonsai-tree version of this issue.  Before digging into the block
trace, I'd like to ask you for some feedback.

First, in my test, the total throughput of the disk happens to be
about 20 times as high as that enjoyed by dd, regardless of the I/O
scheduler.  I guess this massive overhead is normal with dsync, but
I'd like know whether it is about the same on your side.  This will
help me understand whether I'll actually be analyzing about the same
problem as yours.

Second, the commands I used follow.  Do they implement your test case
correctly?

[root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp
[root@localhost tmp]# echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs
[root@localhost tmp]# cat /sys/block/sda/queue/scheduler
[mq-deadline] bfq none
[root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
10000+0 record dentro
10000+0 record fuori
5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
[root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler
[root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
10000+0 record dentro
10000+0 record fuori
5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s

Thanks,
Paolo

> Please let me know if any more info about my setup might be helpful.
> 
> Thank you!
> 
> Regards,
> Srivatsa
> VMware Photon OS
> 
>> 
>>> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>> 
>>> 
>>> Hi,
>>> 
>>> One of my colleagues noticed upto 10x - 30x drop in I/O throughput
>>> running the following command, with the CFQ I/O scheduler:
>>> 
>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync
>>> 
>>> Throughput with CFQ: 60 KB/s
>>> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s
>>> 
>>> I spent some time looking into it and found that this is caused by the
>>> undesirable interaction between 4 different components:
>>> 
>>> - blkio cgroup controller enabled
>>> - ext4 with the jbd2 kthread running in the root blkio cgroup
>>> - dd running on ext4, in any other blkio cgroup than that of jbd2
>>> - CFQ I/O scheduler with defaults for slice_idle and group_idle
>>> 
>>> 
>>> When docker is enabled, systemd creates a blkio cgroup called
>>> system.slice to run system services (and docker) under it, and a
>>> separate blkio cgroup called user.slice for user processes. So, when
>>> dd is invoked, it runs under user.slice.
>>> 
>>> The dd command above includes the dsync flag, which performs an
>>> fdatasync after every write to the output file. Since dd is writing to
>>> a file on ext4, jbd2 will be active, committing transactions
>>> corresponding to those fdatasync requests from dd. (In other words, dd
>>> depends on jdb2, in order to make forward progress). But jdb2 being a
>>> kernel thread, runs in the root blkio cgroup, as opposed to dd, which
>>> runs under user.slice.
>>> 
>>> Now, if the I/O scheduler in use for the underlying block device is
>>> CFQ, then its inter-queue/inter-group idling takes effect (via the
>>> slice_idle and group_idle parameters, both of which default to 8ms).
>>> Therefore, everytime CFQ switches between processing requests from dd
>>> vs jbd2, this 8ms idle time is injected, which slows down the overall
>>> throughput tremendously!
>>> 
>>> To verify this theory, I tried various experiments, and in all cases,
>>> the 4 pre-conditions mentioned above were necessary to reproduce this
>>> performance drop. For example, if I used an XFS filesystem (which
>>> doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed
>>> directly to a block device, I couldn't reproduce the performance
>>> issue. Similarly, running dd in the root blkio cgroup (where jbd2
>>> runs) also gets full performance; as does using the noop or deadline
>>> I/O schedulers; or even CFQ itself, with slice_idle and group_idle set
>>> to zero.
>>> 
>>> These results were reproduced on a Linux VM (kernel v4.19) on ESXi,
>>> both with virtualized storage as well as with disk pass-through,
>>> backed by a rotational hard disk in both cases. The same problem was
>>> also seen with the BFQ I/O scheduler in kernel v5.1.
>>> 
>>> Searching for any earlier discussions of this problem, I found an old
>>> thread on LKML that encountered this behavior [1], as well as a docker
>>> github issue [2] with similar symptoms (mentioned later in the
>>> thread).
>>> 
>>> So, I'm curious to know if this is a well-understood problem and if
>>> anybody has any thoughts on how to fix it.
>>> 
>>> Thank you very much!
>>> 
>>> 
>>> [1]. https://lkml.org/lkml/2015/11/19/359
>>> 
>>> [2]. https://github.com/moby/moby/issues/21485
>>>    https://github.com/moby/moby/issues/21485#issuecomment-222941103
>>> 
>>> Regards,
>>> Srivatsa
>> 
> 


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-18 19:28   ` Theodore Ts'o
  2019-05-20  9:15     ` Jan Kara
@ 2019-05-20 10:38     ` Paolo Valente
  2019-05-21  7:38       ` Andrea Righi
  1 sibling, 1 reply; 52+ messages in thread
From: Paolo Valente @ 2019-05-20 10:38 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Srivatsa S. Bhat, linux-fsdevel, linux-block, linux-ext4,
	cgroups, kernel list, Jens Axboe, Jan Kara, jmoyer, amakhalov,
	anishs, srivatsab, Andrea Righi

[-- Attachment #1: Type: text/plain, Size: 3264 bytes --]



> Il giorno 18 mag 2019, alle ore 21:28, Theodore Ts'o <tytso@mit.edu> ha scritto:
> 
> On Sat, May 18, 2019 at 08:39:54PM +0200, Paolo Valente wrote:
>> I've addressed these issues in my last batch of improvements for
>> BFQ, which landed in the upcoming 5.2. If you give it a try, and
>> still see the problem, then I'll be glad to reproduce it, and
>> hopefully fix it for you.
> 
> Hi Paolo, I'm curious if you could give a quick summary about what you
> changed in BFQ?
> 

Here is the idea: while idling for a process, inject I/O from other
processes, at such an extent that no harm is caused to the process for
which we are idling.  Details in this LWN article:
https://lwn.net/Articles/784267/
in section "Improving extra-service injection".

> I was considering adding support so that if userspace calls fsync(2)
> or fdatasync(2), to attach the process's CSS to the transaction, and
> then charge all of the journal metadata writes the process's CSS.  If
> there are multiple fsync's batched into the transaction, the first
> process which forced the early transaction commit would get charged
> the entire journal write.  OTOH, journal writes are sequential I/O, so
> the amount of disk time for writing the journal is going to be
> relatively small, and especially, the fact that work from other
> cgroups is going to be minimal, especially if hadn't issued an
> fsync().
> 

Yeah, that's a longstanding and difficult instance of the general
too-short-blanket problem.  Jan has already highlighted one of the
main issues in his reply.  I'll add a design issue (from my point of
view): I'd find a little odd that explicit sync transactions have an
owner to charge, while generic buffered writes have not.

I think Andrea Righi addressed related issues in his recent patch
proposal [1], so I've CCed him too.

[1] https://lkml.org/lkml/2019/3/9/220

> In the case where you have three cgroups all issuing fsync(2) and they
> all landed in the same jbd2 transaction thanks to commit batching, in
> the ideal world we would split up the disk time usage equally across
> those three cgroups.  But it's probably not worth doing that...
> 
> That being said, we probably do need some BFQ support, since in the
> case where we have multiple processes doing buffered writes w/o fsync,
> we do charnge the data=ordered writeback to each block cgroup.  Worse,
> the commit can't complete until the all of the data integrity
> writebacks have completed.  And if there are N cgroups with dirty
> inodes, and slice_idle set to 8ms, there is going to be 8*N ms worth
> of idle time tacked onto the commit time.
> 

Jan already wrote part of what I wanted to reply here, so I'll
continue from his reply.

Thanks,
Paolo

> If we charge the journal I/O to the cgroup, and there's only one
> process doing the
> 
>   dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync
> 
> then we don't need to worry about this failure mode, since both the
> journal I/O and the data writeback will be hitting the same cgroup.
> But that's arguably an artificial use case, and much more commonly
> there will be multiple cgroups all trying to at least some file system
> I/O.
> 
> 						- Ted


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-20  9:15     ` Jan Kara
@ 2019-05-20 10:45       ` Paolo Valente
  2019-05-21 16:48       ` Theodore Ts'o
  1 sibling, 0 replies; 52+ messages in thread
From: Paolo Valente @ 2019-05-20 10:45 UTC (permalink / raw)
  To: Jan Kara
  Cc: Theodore Ts'o, Srivatsa S. Bhat, linux-fsdevel, linux-block,
	linux-ext4, cgroups, linux-kernel, axboe, jmoyer, amakhalov,
	anishs, srivatsab

[-- Attachment #1: Type: text/plain, Size: 3349 bytes --]



> Il giorno 20 mag 2019, alle ore 11:15, Jan Kara <jack@suse.cz> ha scritto:
> 
> On Sat 18-05-19 15:28:47, Theodore Ts'o wrote:
>> On Sat, May 18, 2019 at 08:39:54PM +0200, Paolo Valente wrote:
>>> I've addressed these issues in my last batch of improvements for
>>> BFQ, which landed in the upcoming 5.2. If you give it a try, and
>>> still see the problem, then I'll be glad to reproduce it, and
>>> hopefully fix it for you.
>> 
>> Hi Paolo, I'm curious if you could give a quick summary about what you
>> changed in BFQ?
>> 
>> I was considering adding support so that if userspace calls fsync(2)
>> or fdatasync(2), to attach the process's CSS to the transaction, and
>> then charge all of the journal metadata writes the process's CSS.  If
>> there are multiple fsync's batched into the transaction, the first
>> process which forced the early transaction commit would get charged
>> the entire journal write.  OTOH, journal writes are sequential I/O, so
>> the amount of disk time for writing the journal is going to be
>> relatively small, and especially, the fact that work from other
>> cgroups is going to be minimal, especially if hadn't issued an
>> fsync().
> 
> But this makes priority-inversion problems with ext4 journal worse, doesn't
> it? If we submit journal commit in blkio cgroup of some random process, it
> may get throttled which then effectively blocks the whole filesystem. Or do
> you want to implement a more complex back-pressure mechanism where you'd
> just account to different blkio cgroup during journal commit and then
> throttle as different point where you are not blocking other tasks from
> progress?
> 
>> In the case where you have three cgroups all issuing fsync(2) and they
>> all landed in the same jbd2 transaction thanks to commit batching, in
>> the ideal world we would split up the disk time usage equally across
>> those three cgroups.  But it's probably not worth doing that...
>> 
>> That being said, we probably do need some BFQ support, since in the
>> case where we have multiple processes doing buffered writes w/o fsync,
>> we do charnge the data=ordered writeback to each block cgroup. Worse,
>> the commit can't complete until the all of the data integrity
>> writebacks have completed.  And if there are N cgroups with dirty
>> inodes, and slice_idle set to 8ms, there is going to be 8*N ms worth
>> of idle time tacked onto the commit time.
> 
> Yeah. At least in some cases, we know there won't be any more IO from a
> particular cgroup in the near future (e.g. transaction commit completing,
> or when the layers above IO scheduler already know which IO they are going
> to submit next) and in that case idling is just a waste of time.

Yep.  Issues like this are targeted exactly by the improvement I
mentioned in my previous reply.

> But so far
> I haven't decided how should look a reasonably clean interface for this
> that isn't specific to a particular IO scheduler implementation.
> 

That's an interesting point.  So far, I've assumed that nobody would
have told anything to BFQ.  But if you guys think that such a
communication may be acceptable at some degree, then I'd be glad to
try to come up with some solution.  For instance: some hook that any
I/O scheduler may export if meaningful.

Thanks,
Paolo

> 								Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-20 10:19     ` Paolo Valente
@ 2019-05-20 22:45       ` Srivatsa S. Bhat
  2019-05-21  6:23         ` Paolo Valente
  2019-05-21 11:25       ` Paolo Valente
  1 sibling, 1 reply; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-05-20 22:45 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, tytso, amakhalov, anishs,
	srivatsab

On 5/20/19 3:19 AM, Paolo Valente wrote:
> 
> 
>> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>
>> On 5/18/19 11:39 AM, Paolo Valente wrote:
>>> I've addressed these issues in my last batch of improvements for BFQ,
>>> which landed in the upcoming 5.2. If you give it a try, and still see
>>> the problem, then I'll be glad to reproduce it, and hopefully fix it
>>> for you.
>>>
>>
>> Hi Paolo,
>>
>> Thank you for looking into this!
>>
>> I just tried current mainline at commit 72cf0b07, but unfortunately
>> didn't see any improvement:
>>
>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>
>> With mq-deadline, I get:
>>
>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s
>>
>> With bfq, I get:
>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s
>>
> 
> Hi Srivatsa,
> thanks for reproducing this on mainline.  I seem to have reproduced a
> bonsai-tree version of this issue.  Before digging into the block
> trace, I'd like to ask you for some feedback.
> 
> First, in my test, the total throughput of the disk happens to be
> about 20 times as high as that enjoyed by dd, regardless of the I/O
> scheduler.  I guess this massive overhead is normal with dsync, but
> I'd like know whether it is about the same on your side.  This will
> help me understand whether I'll actually be analyzing about the same
> problem as yours.
> 

Do you mean to say the throughput obtained by dd'ing directly to the
block device (bypassing the filesystem)? That does give me a 20x
speedup with bs=512, but much more with a bigger block size (achieving
a max throughput of about 110 MB/s).

dd if=/dev/zero of=/dev/sdc bs=512 count=10000 conv=fsync
10000+0 records in
10000+0 records out
5120000 bytes (5.1 MB, 4.9 MiB) copied, 0.15257 s, 33.6 MB/s

dd if=/dev/zero of=/dev/sdc bs=4k count=10000 conv=fsync
10000+0 records in
10000+0 records out
40960000 bytes (41 MB, 39 MiB) copied, 0.395081 s, 104 MB/s

I'm testing this on a Toshiba MG03ACA1 (1TB) hard disk.

> Second, the commands I used follow.  Do they implement your test case
> correctly?
> 
> [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp
> [root@localhost tmp]# echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs
> [root@localhost tmp]# cat /sys/block/sda/queue/scheduler
> [mq-deadline] bfq none
> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
> 10000+0 record dentro
> 10000+0 record fuori
> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
> [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler
> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
> 10000+0 record dentro
> 10000+0 record fuori
> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s
> 

Yes, this is indeed the testcase, although I see a much bigger
drop in performance with bfq, compared to the results from
your setup.

Regards,
Srivatsa

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-20 22:45       ` Srivatsa S. Bhat
@ 2019-05-21  6:23         ` Paolo Valente
  2019-05-21  7:19           ` Srivatsa S. Bhat
  2019-05-21  9:10           ` Jan Kara
  0 siblings, 2 replies; 52+ messages in thread
From: Paolo Valente @ 2019-05-21  6:23 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

[-- Attachment #1: Type: text/plain, Size: 4449 bytes --]



> Il giorno 21 mag 2019, alle ore 00:45, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
> 
> On 5/20/19 3:19 AM, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>> 
>>> On 5/18/19 11:39 AM, Paolo Valente wrote:
>>>> I've addressed these issues in my last batch of improvements for BFQ,
>>>> which landed in the upcoming 5.2. If you give it a try, and still see
>>>> the problem, then I'll be glad to reproduce it, and hopefully fix it
>>>> for you.
>>>> 
>>> 
>>> Hi Paolo,
>>> 
>>> Thank you for looking into this!
>>> 
>>> I just tried current mainline at commit 72cf0b07, but unfortunately
>>> didn't see any improvement:
>>> 
>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>> 
>>> With mq-deadline, I get:
>>> 
>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s
>>> 
>>> With bfq, I get:
>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s
>>> 
>> 
>> Hi Srivatsa,
>> thanks for reproducing this on mainline.  I seem to have reproduced a
>> bonsai-tree version of this issue.  Before digging into the block
>> trace, I'd like to ask you for some feedback.
>> 
>> First, in my test, the total throughput of the disk happens to be
>> about 20 times as high as that enjoyed by dd, regardless of the I/O
>> scheduler.  I guess this massive overhead is normal with dsync, but
>> I'd like know whether it is about the same on your side.  This will
>> help me understand whether I'll actually be analyzing about the same
>> problem as yours.
>> 
> 
> Do you mean to say the throughput obtained by dd'ing directly to the
> block device (bypassing the filesystem)?

No no, I mean simply what follows.

1) in one terminal:
[root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
10000+0 record dentro
10000+0 record fuori
5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s

2) In a second terminal, while the dd is in progress in the first
terminal:
$ iostat -tmd /dev/sda 3
Linux 5.1.0+ (localhost.localdomain) 	20/05/2019 	_x86_64_	(2 CPU)

...
20/05/2019 11:40:17
Device             tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda            2288,00         0,00         9,77          0         29

20/05/2019 11:40:20
Device             tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda            2325,33         0,00         9,93          0         29

20/05/2019 11:40:23
Device             tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda            2351,33         0,00        10,05          0         30
...

As you can see, the overall throughput (~10 MB/s) is more than 20
times as high as the dd throughput (~350 KB/s).  But the dd is the
only source of I/O.

Do you also see such a huge difference?

Thanks,
Paolo

> That does give me a 20x
> speedup with bs=512, but much more with a bigger block size (achieving
> a max throughput of about 110 MB/s).
> 
> dd if=/dev/zero of=/dev/sdc bs=512 count=10000 conv=fsync
> 10000+0 records in
> 10000+0 records out
> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 0.15257 s, 33.6 MB/s
> 
> dd if=/dev/zero of=/dev/sdc bs=4k count=10000 conv=fsync
> 10000+0 records in
> 10000+0 records out
> 40960000 bytes (41 MB, 39 MiB) copied, 0.395081 s, 104 MB/s
> 
> I'm testing this on a Toshiba MG03ACA1 (1TB) hard disk.
> 
>> Second, the commands I used follow.  Do they implement your test case
>> correctly?
>> 
>> [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp
>> [root@localhost tmp]# echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs
>> [root@localhost tmp]# cat /sys/block/sda/queue/scheduler
>> [mq-deadline] bfq none
>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>> 10000+0 record dentro
>> 10000+0 record fuori
>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
>> [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler
>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>> 10000+0 record dentro
>> 10000+0 record fuori
>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s
>> 
> 
> Yes, this is indeed the testcase, although I see a much bigger
> drop in performance with bfq, compared to the results from
> your setup.
> 
> Regards,
> Srivatsa


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-21  6:23         ` Paolo Valente
@ 2019-05-21  7:19           ` Srivatsa S. Bhat
  2019-05-21  9:10           ` Jan Kara
  1 sibling, 0 replies; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-05-21  7:19 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

On 5/20/19 11:23 PM, Paolo Valente wrote:
> 
> 
>> Il giorno 21 mag 2019, alle ore 00:45, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>
>> On 5/20/19 3:19 AM, Paolo Valente wrote:
>>>
>>>
>>>> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>>>
>>>> On 5/18/19 11:39 AM, Paolo Valente wrote:
>>>>> I've addressed these issues in my last batch of improvements for BFQ,
>>>>> which landed in the upcoming 5.2. If you give it a try, and still see
>>>>> the problem, then I'll be glad to reproduce it, and hopefully fix it
>>>>> for you.
>>>>>
>>>>
>>>> Hi Paolo,
>>>>
>>>> Thank you for looking into this!
>>>>
>>>> I just tried current mainline at commit 72cf0b07, but unfortunately
>>>> didn't see any improvement:
>>>>
>>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>>>
>>>> With mq-deadline, I get:
>>>>
>>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s
>>>>
>>>> With bfq, I get:
>>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s
>>>>
>>>
>>> Hi Srivatsa,
>>> thanks for reproducing this on mainline.  I seem to have reproduced a
>>> bonsai-tree version of this issue.  Before digging into the block
>>> trace, I'd like to ask you for some feedback.
>>>
>>> First, in my test, the total throughput of the disk happens to be
>>> about 20 times as high as that enjoyed by dd, regardless of the I/O
>>> scheduler.  I guess this massive overhead is normal with dsync, but
>>> I'd like know whether it is about the same on your side.  This will
>>> help me understand whether I'll actually be analyzing about the same
>>> problem as yours.
>>>
>>
>> Do you mean to say the throughput obtained by dd'ing directly to the
>> block device (bypassing the filesystem)?
> 
> No no, I mean simply what follows.
> 
> 1) in one terminal:
> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
> 10000+0 record dentro
> 10000+0 record fuori
> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
> 
> 2) In a second terminal, while the dd is in progress in the first
> terminal:
> $ iostat -tmd /dev/sda 3
> Linux 5.1.0+ (localhost.localdomain) 	20/05/2019 	_x86_64_	(2 CPU)
> 
> ...
> 20/05/2019 11:40:17
> Device             tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sda            2288,00         0,00         9,77          0         29
> 
> 20/05/2019 11:40:20
> Device             tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sda            2325,33         0,00         9,93          0         29
> 
> 20/05/2019 11:40:23
> Device             tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sda            2351,33         0,00        10,05          0         30
> ...
> 
> As you can see, the overall throughput (~10 MB/s) is more than 20
> times as high as the dd throughput (~350 KB/s).  But the dd is the
> only source of I/O.
> 
> Do you also see such a huge difference?
> 
Ah, I see what you mean. Yes, I get a huge difference as well:

I/O scheduler    dd throughput    Total throughput (via iostat)
-------------    -------------    -----------------------------

mq-deadline
    or              1.6 MB/s               50 MB/s (30x)
  kyber

   bfq               60 KB/s                1 MB/s (16x)


Regards,
Srivatsa
VMware Photon OS

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-20 10:38     ` Paolo Valente
@ 2019-05-21  7:38       ` Andrea Righi
  0 siblings, 0 replies; 52+ messages in thread
From: Andrea Righi @ 2019-05-21  7:38 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Theodore Ts'o, Srivatsa S. Bhat, linux-fsdevel, linux-block,
	linux-ext4, cgroups, kernel list, Jens Axboe, Jan Kara, jmoyer,
	amakhalov, anishs, srivatsab, Josef Bacik, Tejun Heo

On Mon, May 20, 2019 at 12:38:32PM +0200, Paolo Valente wrote:
...
> > I was considering adding support so that if userspace calls fsync(2)
> > or fdatasync(2), to attach the process's CSS to the transaction, and
> > then charge all of the journal metadata writes the process's CSS.  If
> > there are multiple fsync's batched into the transaction, the first
> > process which forced the early transaction commit would get charged
> > the entire journal write.  OTOH, journal writes are sequential I/O, so
> > the amount of disk time for writing the journal is going to be
> > relatively small, and especially, the fact that work from other
> > cgroups is going to be minimal, especially if hadn't issued an
> > fsync().
> > 
> 
> Yeah, that's a longstanding and difficult instance of the general
> too-short-blanket problem.  Jan has already highlighted one of the
> main issues in his reply.  I'll add a design issue (from my point of
> view): I'd find a little odd that explicit sync transactions have an
> owner to charge, while generic buffered writes have not.
> 
> I think Andrea Righi addressed related issues in his recent patch
> proposal [1], so I've CCed him too.
> 
> [1] https://lkml.org/lkml/2019/3/9/220

If journal metadata writes are submitted using a process's CSS, the
commit may be throttled and that can also throttle indirectly other
"high-priority" blkio cgroups, so I think that logic alone isn't enough.

We have discussed this priorty-inversion problem with Josef and Tejun
(adding both of them in cc), the idea that seemed most reasonable was to
temporarily boost the priority of blkio cgroups when there are multiple
sync(2) waiters in the system.

More exactly, when I/O is going to be throttled for a specific blkio
cgroup, if there's any other blkio cgroup waiting for writeback I/O,
no throttling is applied (this logic can be refined by saving a list of
blkio sync(2) waiters and taking the highest I/O rate among them).

In addition to that Tejun mentioned that he would like to see a better
sync(2) isolation done at the fs namespace level. This last part still
needs to be defined and addressed.

However, even the simple logic above "no throttling if there's any other
sync(2) waiter" can already prevent big system lockups (see for example
the simple test case that I suggested here https://lkml.org/lkml/2019/),
so I think having this change alone would be a nice improvement already:

 https://lkml.org/lkml/2019/3/9/220

Thanks,
-Andrea

> 
> > In the case where you have three cgroups all issuing fsync(2) and they
> > all landed in the same jbd2 transaction thanks to commit batching, in
> > the ideal world we would split up the disk time usage equally across
> > those three cgroups.  But it's probably not worth doing that...
> > 
> > That being said, we probably do need some BFQ support, since in the
> > case where we have multiple processes doing buffered writes w/o fsync,
> > we do charnge the data=ordered writeback to each block cgroup.  Worse,
> > the commit can't complete until the all of the data integrity
> > writebacks have completed.  And if there are N cgroups with dirty
> > inodes, and slice_idle set to 8ms, there is going to be 8*N ms worth
> > of idle time tacked onto the commit time.
> > 
> 
> Jan already wrote part of what I wanted to reply here, so I'll
> continue from his reply.
> 
> Thanks,
> Paolo
> 
> > If we charge the journal I/O to the cgroup, and there's only one
> > process doing the
> > 
> >   dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync
> > 
> > then we don't need to worry about this failure mode, since both the
> > journal I/O and the data writeback will be hitting the same cgroup.
> > But that's arguably an artificial use case, and much more commonly
> > there will be multiple cgroups all trying to at least some file system
> > I/O.
> > 
> > 						- Ted
> 



^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-21  6:23         ` Paolo Valente
  2019-05-21  7:19           ` Srivatsa S. Bhat
@ 2019-05-21  9:10           ` Jan Kara
  2019-05-21 16:31             ` Theodore Ts'o
  1 sibling, 1 reply; 52+ messages in thread
From: Jan Kara @ 2019-05-21  9:10 UTC (permalink / raw)
  To: Paolo Valente
  Cc: Srivatsa S. Bhat, linux-fsdevel, linux-block, linux-ext4,
	cgroups, kernel list, Jens Axboe, Jan Kara, jmoyer,
	Theodore Ts'o, amakhalov, anishs, srivatsab

On Tue 21-05-19 08:23:05, Paolo Valente wrote:
> > Il giorno 21 mag 2019, alle ore 00:45, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
> > 
> > On 5/20/19 3:19 AM, Paolo Valente wrote:
> >> 
> >> 
> >>> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
> >>> 
> >>> On 5/18/19 11:39 AM, Paolo Valente wrote:
> >>>> I've addressed these issues in my last batch of improvements for BFQ,
> >>>> which landed in the upcoming 5.2. If you give it a try, and still see
> >>>> the problem, then I'll be glad to reproduce it, and hopefully fix it
> >>>> for you.
> >>>> 
> >>> 
> >>> Hi Paolo,
> >>> 
> >>> Thank you for looking into this!
> >>> 
> >>> I just tried current mainline at commit 72cf0b07, but unfortunately
> >>> didn't see any improvement:
> >>> 
> >>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
> >>> 
> >>> With mq-deadline, I get:
> >>> 
> >>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s
> >>> 
> >>> With bfq, I get:
> >>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s
> >>> 
> >> 
> >> Hi Srivatsa,
> >> thanks for reproducing this on mainline.  I seem to have reproduced a
> >> bonsai-tree version of this issue.  Before digging into the block
> >> trace, I'd like to ask you for some feedback.
> >> 
> >> First, in my test, the total throughput of the disk happens to be
> >> about 20 times as high as that enjoyed by dd, regardless of the I/O
> >> scheduler.  I guess this massive overhead is normal with dsync, but
> >> I'd like know whether it is about the same on your side.  This will
> >> help me understand whether I'll actually be analyzing about the same
> >> problem as yours.
> >> 
> > 
> > Do you mean to say the throughput obtained by dd'ing directly to the
> > block device (bypassing the filesystem)?
> 
> No no, I mean simply what follows.
> 
> 1) in one terminal:
> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
> 10000+0 record dentro
> 10000+0 record fuori
> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
> 
> 2) In a second terminal, while the dd is in progress in the first
> terminal:
> $ iostat -tmd /dev/sda 3
> Linux 5.1.0+ (localhost.localdomain) 	20/05/2019 	_x86_64_	(2 CPU)
> 
> ...
> 20/05/2019 11:40:17
> Device             tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sda            2288,00         0,00         9,77          0         29
> 
> 20/05/2019 11:40:20
> Device             tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sda            2325,33         0,00         9,93          0         29
> 
> 20/05/2019 11:40:23
> Device             tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
> sda            2351,33         0,00        10,05          0         30
> ...
> 
> As you can see, the overall throughput (~10 MB/s) is more than 20
> times as high as the dd throughput (~350 KB/s).  But the dd is the
> only source of I/O.

Yes and that's expected. It just shows how inefficient small synchronous IO
is. Look, dd(1) writes 512-bytes. From FS point of view we have to write:
full fs block with data (+4KB), inode to journal (+4KB), journal descriptor
block (+4KB), journal superblock (+4KB), transaction commit block (+4KB) -
so that's 20KB just from top of my head to write 512 bytes...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-20 10:19     ` Paolo Valente
  2019-05-20 22:45       ` Srivatsa S. Bhat
@ 2019-05-21 11:25       ` Paolo Valente
  2019-05-21 13:20         ` Paolo Valente
  1 sibling, 1 reply; 52+ messages in thread
From: Paolo Valente @ 2019-05-21 11:25 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, tytso, amakhalov, anishs,
	srivatsab


[-- Attachment #1.1: Type: text/plain, Size: 1587 bytes --]



> Il giorno 20 mag 2019, alle ore 12:19, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
> 
> 
>> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>> 
>> On 5/18/19 11:39 AM, Paolo Valente wrote:
>>> I've addressed these issues in my last batch of improvements for BFQ,
>>> which landed in the upcoming 5.2. If you give it a try, and still see
>>> the problem, then I'll be glad to reproduce it, and hopefully fix it
>>> for you.
>>> 
>> 
>> Hi Paolo,
>> 
>> Thank you for looking into this!
>> 
>> I just tried current mainline at commit 72cf0b07, but unfortunately
>> didn't see any improvement:
>> 
>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>> 
>> With mq-deadline, I get:
>> 
>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s
>> 
>> With bfq, I get:
>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s
>> 
> 
> Hi Srivatsa,
> thanks for reproducing this on mainline.  I seem to have reproduced a
> bonsai-tree version of this issue.

Hi again Srivatsa,
I've analyzed the trace, and I've found the cause of the loss of
throughput in on my side.  To find out whether it is the same cause as
on your side, I've prepared a script that executes your test and takes
a trace during the test.  If ok for you, could you please
- change the value for the DEVS parameter in the attached script, if
  needed
- execute the script
- send me the trace file that the script will leave in your working
dir

Looking forward to your trace,
Paolo


[-- Attachment #1.2: dsync_test.sh --]
[-- Type: application/octet-stream, Size: 1941 bytes --]

#!/bin/bash

DEVS=sda # please set this parameter to the dev name for your test drive

SCHED=bfq
TRACE=1

function init_tracing {
	if [ "$TRACE" == "1" ] ; then
		if [ ! -d /sys/kernel/debug/tracing ] ; then
			mount -t debugfs none /sys/kernel/debug
		fi
		echo nop > /sys/kernel/debug/tracing/current_tracer
		echo 500000 > /sys/kernel/debug/tracing/buffer_size_kb
		echo "${SCHED}*" "__${SCHED}*" >\
			/sys/kernel/debug/tracing/set_ftrace_filter
		echo blk > /sys/kernel/debug/tracing/current_tracer
	fi
}

function set_tracing {
	if [ "$TRACE" == "1" ] ; then
	    if [[ -e /sys/kernel/debug/tracing/tracing_enabled && \
		$(cat /sys/kernel/debug/tracing/tracing_enabled) -ne $1 ]]; then
			echo "echo $1 > /sys/kernel/debug/tracing/tracing_enabled"
			echo $1 > /sys/kernel/debug/tracing/tracing_enabled
		fi
		dev=$(echo $DEVS | awk '{ print $1 }')
		if [[ -e /sys/block/$dev/trace/enable && \
			  $(cat /sys/block/$dev/trace/enable) -ne $1 ]]; then
		    echo "echo $1 > /sys/block/$dev/trace/enable"
		    echo $1 > /sys/block/$dev/trace/enable
		fi

		if [ "$1" == 0 ]; then
		    for cpu_path in /sys/kernel/debug/tracing/per_cpu/cpu?
		    do
			stat_file=$cpu_path/stats
			OVER=$(grep "overrun" $stat_file | \
			    grep -v "overrun: 0")
			if [ "$OVER" != "" ]; then
			    cpu=$(basename $cpu_path)
			    echo $OVER on $cpu, please increase buffer size!
			fi
		    done
		fi
	fi
}

init_tracing

mkdir /sys/fs/cgroup/blkio/testgrp
echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs
echo > /sys/kernel/debug/tracing/trace
set_tracing 1 
echo bfq > /sys/block/sda/queue/scheduler
cat /sys/block/sda/queue/scheduler
echo 0 > /sys/block/sda/queue/iosched/low_latency
dd if=/dev/zero of=/root/test.img bs=512 count=5000 oflag=dsync
set_tracing 0
echo 1 > /sys/block/sda/queue/iosched/low_latency
cp /sys/kernel/debug/tracing/trace .
echo $BASHPID > /sys/fs/cgroup/blkio/cgroup.procs 
rmdir /sys/fs/cgroup/blkio/testgrp

[-- Attachment #1.3: Type: text/plain, Size: 5006 bytes --]


>  Before digging into the block
> trace, I'd like to ask you for some feedback.
> 
> First, in my test, the total throughput of the disk happens to be
> about 20 times as high as that enjoyed by dd, regardless of the I/O
> scheduler.  I guess this massive overhead is normal with dsync, but
> I'd like know whether it is about the same on your side.  This will
> help me understand whether I'll actually be analyzing about the same
> problem as yours.
> 
> Second, the commands I used follow.  Do they implement your test case
> correctly?
> 
> [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp
> [root@localhost tmp]# echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs
> [root@localhost tmp]# cat /sys/block/sda/queue/scheduler
> [mq-deadline] bfq none
> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
> 10000+0 record dentro
> 10000+0 record fuori
> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
> [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler
> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
> 10000+0 record dentro
> 10000+0 record fuori
> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s
> 
> Thanks,
> Paolo
> 
>> Please let me know if any more info about my setup might be helpful.
>> 
>> Thank you!
>> 
>> Regards,
>> Srivatsa
>> VMware Photon OS
>> 
>>> 
>>>> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>>> 
>>>> 
>>>> Hi,
>>>> 
>>>> One of my colleagues noticed upto 10x - 30x drop in I/O throughput
>>>> running the following command, with the CFQ I/O scheduler:
>>>> 
>>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync
>>>> 
>>>> Throughput with CFQ: 60 KB/s
>>>> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s
>>>> 
>>>> I spent some time looking into it and found that this is caused by the
>>>> undesirable interaction between 4 different components:
>>>> 
>>>> - blkio cgroup controller enabled
>>>> - ext4 with the jbd2 kthread running in the root blkio cgroup
>>>> - dd running on ext4, in any other blkio cgroup than that of jbd2
>>>> - CFQ I/O scheduler with defaults for slice_idle and group_idle
>>>> 
>>>> 
>>>> When docker is enabled, systemd creates a blkio cgroup called
>>>> system.slice to run system services (and docker) under it, and a
>>>> separate blkio cgroup called user.slice for user processes. So, when
>>>> dd is invoked, it runs under user.slice.
>>>> 
>>>> The dd command above includes the dsync flag, which performs an
>>>> fdatasync after every write to the output file. Since dd is writing to
>>>> a file on ext4, jbd2 will be active, committing transactions
>>>> corresponding to those fdatasync requests from dd. (In other words, dd
>>>> depends on jdb2, in order to make forward progress). But jdb2 being a
>>>> kernel thread, runs in the root blkio cgroup, as opposed to dd, which
>>>> runs under user.slice.
>>>> 
>>>> Now, if the I/O scheduler in use for the underlying block device is
>>>> CFQ, then its inter-queue/inter-group idling takes effect (via the
>>>> slice_idle and group_idle parameters, both of which default to 8ms).
>>>> Therefore, everytime CFQ switches between processing requests from dd
>>>> vs jbd2, this 8ms idle time is injected, which slows down the overall
>>>> throughput tremendously!
>>>> 
>>>> To verify this theory, I tried various experiments, and in all cases,
>>>> the 4 pre-conditions mentioned above were necessary to reproduce this
>>>> performance drop. For example, if I used an XFS filesystem (which
>>>> doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed
>>>> directly to a block device, I couldn't reproduce the performance
>>>> issue. Similarly, running dd in the root blkio cgroup (where jbd2
>>>> runs) also gets full performance; as does using the noop or deadline
>>>> I/O schedulers; or even CFQ itself, with slice_idle and group_idle set
>>>> to zero.
>>>> 
>>>> These results were reproduced on a Linux VM (kernel v4.19) on ESXi,
>>>> both with virtualized storage as well as with disk pass-through,
>>>> backed by a rotational hard disk in both cases. The same problem was
>>>> also seen with the BFQ I/O scheduler in kernel v5.1.
>>>> 
>>>> Searching for any earlier discussions of this problem, I found an old
>>>> thread on LKML that encountered this behavior [1], as well as a docker
>>>> github issue [2] with similar symptoms (mentioned later in the
>>>> thread).
>>>> 
>>>> So, I'm curious to know if this is a well-understood problem and if
>>>> anybody has any thoughts on how to fix it.
>>>> 
>>>> Thank you very much!
>>>> 
>>>> 
>>>> [1]. https://lkml.org/lkml/2015/11/19/359
>>>> 
>>>> [2]. https://github.com/moby/moby/issues/21485
>>>>   https://github.com/moby/moby/issues/21485#issuecomment-222941103
>>>> 
>>>> Regards,
>>>> Srivatsa


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-21 11:25       ` Paolo Valente
@ 2019-05-21 13:20         ` Paolo Valente
  2019-05-21 16:21           ` Paolo Valente
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Valente @ 2019-05-21 13:20 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, tytso, amakhalov, anishs,
	srivatsab


[-- Attachment #1.1: Type: text/plain, Size: 1857 bytes --]



> Il giorno 21 mag 2019, alle ore 13:25, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
> 
> 
>> Il giorno 20 mag 2019, alle ore 12:19, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>> 
>> 
>> 
>>> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>> 
>>> On 5/18/19 11:39 AM, Paolo Valente wrote:
>>>> I've addressed these issues in my last batch of improvements for BFQ,
>>>> which landed in the upcoming 5.2. If you give it a try, and still see
>>>> the problem, then I'll be glad to reproduce it, and hopefully fix it
>>>> for you.
>>>> 
>>> 
>>> Hi Paolo,
>>> 
>>> Thank you for looking into this!
>>> 
>>> I just tried current mainline at commit 72cf0b07, but unfortunately
>>> didn't see any improvement:
>>> 
>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>> 
>>> With mq-deadline, I get:
>>> 
>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s
>>> 
>>> With bfq, I get:
>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s
>>> 
>> 
>> Hi Srivatsa,
>> thanks for reproducing this on mainline.  I seem to have reproduced a
>> bonsai-tree version of this issue.
> 
> Hi again Srivatsa,
> I've analyzed the trace, and I've found the cause of the loss of
> throughput in on my side.  To find out whether it is the same cause as
> on your side, I've prepared a script that executes your test and takes
> a trace during the test.  If ok for you, could you please
> - change the value for the DEVS parameter in the attached script, if
>  needed
> - execute the script
> - send me the trace file that the script will leave in your working
> dir
> 

Sorry, I forgot to add that I also need you to, first, apply the
attached patch (it will make BFQ generate the log I need).

Thanks,
Paolo


[-- Attachment #1.2: 0001-block-bfq-add-logs-and-BUG_ONs.patch.gz --]
[-- Type: application/x-gzip, Size: 27285 bytes --]

[-- Attachment #1.3: Type: text/plain, Size: 5186 bytes --]



> Looking forward to your trace,
> Paolo
> 
> <dsync_test.sh>
>> Before digging into the block
>> trace, I'd like to ask you for some feedback.
>> 
>> First, in my test, the total throughput of the disk happens to be
>> about 20 times as high as that enjoyed by dd, regardless of the I/O
>> scheduler.  I guess this massive overhead is normal with dsync, but
>> I'd like know whether it is about the same on your side.  This will
>> help me understand whether I'll actually be analyzing about the same
>> problem as yours.
>> 
>> Second, the commands I used follow.  Do they implement your test case
>> correctly?
>> 
>> [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp
>> [root@localhost tmp]# echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs
>> [root@localhost tmp]# cat /sys/block/sda/queue/scheduler
>> [mq-deadline] bfq none
>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>> 10000+0 record dentro
>> 10000+0 record fuori
>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
>> [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler
>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>> 10000+0 record dentro
>> 10000+0 record fuori
>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s
>> 
>> Thanks,
>> Paolo
>> 
>>> Please let me know if any more info about my setup might be helpful.
>>> 
>>> Thank you!
>>> 
>>> Regards,
>>> Srivatsa
>>> VMware Photon OS
>>> 
>>>> 
>>>>> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>>>> 
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> One of my colleagues noticed upto 10x - 30x drop in I/O throughput
>>>>> running the following command, with the CFQ I/O scheduler:
>>>>> 
>>>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync
>>>>> 
>>>>> Throughput with CFQ: 60 KB/s
>>>>> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s
>>>>> 
>>>>> I spent some time looking into it and found that this is caused by the
>>>>> undesirable interaction between 4 different components:
>>>>> 
>>>>> - blkio cgroup controller enabled
>>>>> - ext4 with the jbd2 kthread running in the root blkio cgroup
>>>>> - dd running on ext4, in any other blkio cgroup than that of jbd2
>>>>> - CFQ I/O scheduler with defaults for slice_idle and group_idle
>>>>> 
>>>>> 
>>>>> When docker is enabled, systemd creates a blkio cgroup called
>>>>> system.slice to run system services (and docker) under it, and a
>>>>> separate blkio cgroup called user.slice for user processes. So, when
>>>>> dd is invoked, it runs under user.slice.
>>>>> 
>>>>> The dd command above includes the dsync flag, which performs an
>>>>> fdatasync after every write to the output file. Since dd is writing to
>>>>> a file on ext4, jbd2 will be active, committing transactions
>>>>> corresponding to those fdatasync requests from dd. (In other words, dd
>>>>> depends on jdb2, in order to make forward progress). But jdb2 being a
>>>>> kernel thread, runs in the root blkio cgroup, as opposed to dd, which
>>>>> runs under user.slice.
>>>>> 
>>>>> Now, if the I/O scheduler in use for the underlying block device is
>>>>> CFQ, then its inter-queue/inter-group idling takes effect (via the
>>>>> slice_idle and group_idle parameters, both of which default to 8ms).
>>>>> Therefore, everytime CFQ switches between processing requests from dd
>>>>> vs jbd2, this 8ms idle time is injected, which slows down the overall
>>>>> throughput tremendously!
>>>>> 
>>>>> To verify this theory, I tried various experiments, and in all cases,
>>>>> the 4 pre-conditions mentioned above were necessary to reproduce this
>>>>> performance drop. For example, if I used an XFS filesystem (which
>>>>> doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed
>>>>> directly to a block device, I couldn't reproduce the performance
>>>>> issue. Similarly, running dd in the root blkio cgroup (where jbd2
>>>>> runs) also gets full performance; as does using the noop or deadline
>>>>> I/O schedulers; or even CFQ itself, with slice_idle and group_idle set
>>>>> to zero.
>>>>> 
>>>>> These results were reproduced on a Linux VM (kernel v4.19) on ESXi,
>>>>> both with virtualized storage as well as with disk pass-through,
>>>>> backed by a rotational hard disk in both cases. The same problem was
>>>>> also seen with the BFQ I/O scheduler in kernel v5.1.
>>>>> 
>>>>> Searching for any earlier discussions of this problem, I found an old
>>>>> thread on LKML that encountered this behavior [1], as well as a docker
>>>>> github issue [2] with similar symptoms (mentioned later in the
>>>>> thread).
>>>>> 
>>>>> So, I'm curious to know if this is a well-understood problem and if
>>>>> anybody has any thoughts on how to fix it.
>>>>> 
>>>>> Thank you very much!
>>>>> 
>>>>> 
>>>>> [1]. https://lkml.org/lkml/2015/11/19/359
>>>>> 
>>>>> [2]. https://github.com/moby/moby/issues/21485
>>>>>  https://github.com/moby/moby/issues/21485#issuecomment-222941103
>>>>> 
>>>>> Regards,
>>>>> Srivatsa


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-21 13:20         ` Paolo Valente
@ 2019-05-21 16:21           ` Paolo Valente
  2019-05-21 17:38             ` Paolo Valente
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Valente @ 2019-05-21 16:21 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab


[-- Attachment #1.1: Type: text/plain, Size: 2519 bytes --]



> Il giorno 21 mag 2019, alle ore 15:20, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
> 
> 
>> Il giorno 21 mag 2019, alle ore 13:25, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>> 
>> 
>> 
>>> Il giorno 20 mag 2019, alle ore 12:19, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>>> 
>>> 
>>> 
>>>> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>>> 
>>>> On 5/18/19 11:39 AM, Paolo Valente wrote:
>>>>> I've addressed these issues in my last batch of improvements for BFQ,
>>>>> which landed in the upcoming 5.2. If you give it a try, and still see
>>>>> the problem, then I'll be glad to reproduce it, and hopefully fix it
>>>>> for you.
>>>>> 
>>>> 
>>>> Hi Paolo,
>>>> 
>>>> Thank you for looking into this!
>>>> 
>>>> I just tried current mainline at commit 72cf0b07, but unfortunately
>>>> didn't see any improvement:
>>>> 
>>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>>> 
>>>> With mq-deadline, I get:
>>>> 
>>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s
>>>> 
>>>> With bfq, I get:
>>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s
>>>> 
>>> 
>>> Hi Srivatsa,
>>> thanks for reproducing this on mainline.  I seem to have reproduced a
>>> bonsai-tree version of this issue.
>> 
>> Hi again Srivatsa,
>> I've analyzed the trace, and I've found the cause of the loss of
>> throughput in on my side.  To find out whether it is the same cause as
>> on your side, I've prepared a script that executes your test and takes
>> a trace during the test.  If ok for you, could you please
>> - change the value for the DEVS parameter in the attached script, if
>> needed
>> - execute the script
>> - send me the trace file that the script will leave in your working
>> dir
>> 
> 
> Sorry, I forgot to add that I also need you to, first, apply the
> attached patch (it will make BFQ generate the log I need).
> 

Sorry again :) This time for attaching one more patch.  This is
basically a blind fix attempt, based on what I see in my VM.

So, instead of only sending me a trace, could you please:
1) apply this new patch on top of the one I attached in my previous email
2) repeat your test and report results
3) regardless of whether bfq performance improves, take a trace with
   my script (I've attached a new version that doesn't risk to output an
   annoying error message as the previous one)

Thanks,
Paolo


[-- Attachment #1.2: dsync_test.sh --]
[-- Type: application/octet-stream, Size: 1848 bytes --]

#!/bin/bash

DEVS=sda # please set this parameter to the dev name for your test drive

TRACE=1

function init_tracing {
	if [ "$TRACE" == "1" ] ; then
		if [ ! -d /sys/kernel/debug/tracing ] ; then
			mount -t debugfs none /sys/kernel/debug
		fi
		echo nop > /sys/kernel/debug/tracing/current_tracer
		echo 500000 > /sys/kernel/debug/tracing/buffer_size_kb
		echo blk > /sys/kernel/debug/tracing/current_tracer
	fi
}

function set_tracing {
	if [ "$TRACE" == "1" ] ; then
	    if [[ -e /sys/kernel/debug/tracing/tracing_enabled && \
		$(cat /sys/kernel/debug/tracing/tracing_enabled) -ne $1 ]]; then
			echo "echo $1 > /sys/kernel/debug/tracing/tracing_enabled"
			echo $1 > /sys/kernel/debug/tracing/tracing_enabled
		fi
		dev=$(echo $DEVS | awk '{ print $1 }')
		if [[ -e /sys/block/$dev/trace/enable && \
			  $(cat /sys/block/$dev/trace/enable) -ne $1 ]]; then
		    echo "echo $1 > /sys/block/$dev/trace/enable"
		    echo $1 > /sys/block/$dev/trace/enable
		fi

		if [ "$1" == 0 ]; then
		    for cpu_path in /sys/kernel/debug/tracing/per_cpu/cpu?
		    do
			stat_file=$cpu_path/stats
			OVER=$(grep "overrun" $stat_file | \
			    grep -v "overrun: 0")
			if [ "$OVER" != "" ]; then
			    cpu=$(basename $cpu_path)
			    echo $OVER on $cpu, please increase buffer size!
			fi
		    done
		fi
	fi
}

init_tracing

mkdir /sys/fs/cgroup/blkio/testgrp
echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs
echo > /sys/kernel/debug/tracing/trace
set_tracing 1 
echo bfq > /sys/block/sda/queue/scheduler
cat /sys/block/sda/queue/scheduler
echo 0 > /sys/block/sda/queue/iosched/low_latency
dd if=/dev/zero of=/root/test.img bs=512 count=5000 oflag=dsync
set_tracing 0
echo 1 > /sys/block/sda/queue/iosched/low_latency
cp /sys/kernel/debug/tracing/trace .
echo $BASHPID > /sys/fs/cgroup/blkio/cgroup.procs 
rmdir /sys/fs/cgroup/blkio/testgrp

[-- Attachment #1.3: 0001-block-bfq-boost-injection.patch.gz --]
[-- Type: application/x-gzip, Size: 2462 bytes --]

[-- Attachment #1.4: Type: text/plain, Size: 5380 bytes --]



> Thanks,
> Paolo
> 
> <0001-block-bfq-add-logs-and-BUG_ONs.patch.gz>
> 
>> Looking forward to your trace,
>> Paolo
>> 
>> <dsync_test.sh>
>>> Before digging into the block
>>> trace, I'd like to ask you for some feedback.
>>> 
>>> First, in my test, the total throughput of the disk happens to be
>>> about 20 times as high as that enjoyed by dd, regardless of the I/O
>>> scheduler.  I guess this massive overhead is normal with dsync, but
>>> I'd like know whether it is about the same on your side. This will
>>> help me understand whether I'll actually be analyzing about the same
>>> problem as yours.
>>> 
>>> Second, the commands I used follow.  Do they implement your test case
>>> correctly?
>>> 
>>> [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp
>>> [root@localhost tmp]# echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs
>>> [root@localhost tmp]# cat /sys/block/sda/queue/scheduler
>>> [mq-deadline] bfq none
>>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>> 10000+0 record dentro
>>> 10000+0 record fuori
>>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
>>> [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler
>>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>> 10000+0 record dentro
>>> 10000+0 record fuori
>>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s
>>> 
>>> Thanks,
>>> Paolo
>>> 
>>>> Please let me know if any more info about my setup might be helpful.
>>>> 
>>>> Thank you!
>>>> 
>>>> Regards,
>>>> Srivatsa
>>>> VMware Photon OS
>>>> 
>>>>> 
>>>>>> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>>>>> 
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> One of my colleagues noticed upto 10x - 30x drop in I/O throughput
>>>>>> running the following command, with the CFQ I/O scheduler:
>>>>>> 
>>>>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync
>>>>>> 
>>>>>> Throughput with CFQ: 60 KB/s
>>>>>> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s
>>>>>> 
>>>>>> I spent some time looking into it and found that this is caused by the
>>>>>> undesirable interaction between 4 different components:
>>>>>> 
>>>>>> - blkio cgroup controller enabled
>>>>>> - ext4 with the jbd2 kthread running in the root blkio cgroup
>>>>>> - dd running on ext4, in any other blkio cgroup than that of jbd2
>>>>>> - CFQ I/O scheduler with defaults for slice_idle and group_idle
>>>>>> 
>>>>>> 
>>>>>> When docker is enabled, systemd creates a blkio cgroup called
>>>>>> system.slice to run system services (and docker) under it, and a
>>>>>> separate blkio cgroup called user.slice for user processes. So, when
>>>>>> dd is invoked, it runs under user.slice.
>>>>>> 
>>>>>> The dd command above includes the dsync flag, which performs an
>>>>>> fdatasync after every write to the output file. Since dd is writing to
>>>>>> a file on ext4, jbd2 will be active, committing transactions
>>>>>> corresponding to those fdatasync requests from dd. (In other words, dd
>>>>>> depends on jdb2, in order to make forward progress). But jdb2 being a
>>>>>> kernel thread, runs in the root blkio cgroup, as opposed to dd, which
>>>>>> runs under user.slice.
>>>>>> 
>>>>>> Now, if the I/O scheduler in use for the underlying block device is
>>>>>> CFQ, then its inter-queue/inter-group idling takes effect (via the
>>>>>> slice_idle and group_idle parameters, both of which default to 8ms).
>>>>>> Therefore, everytime CFQ switches between processing requests from dd
>>>>>> vs jbd2, this 8ms idle time is injected, which slows down the overall
>>>>>> throughput tremendously!
>>>>>> 
>>>>>> To verify this theory, I tried various experiments, and in all cases,
>>>>>> the 4 pre-conditions mentioned above were necessary to reproduce this
>>>>>> performance drop. For example, if I used an XFS filesystem (which
>>>>>> doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed
>>>>>> directly to a block device, I couldn't reproduce the performance
>>>>>> issue. Similarly, running dd in the root blkio cgroup (where jbd2
>>>>>> runs) also gets full performance; as does using the noop or deadline
>>>>>> I/O schedulers; or even CFQ itself, with slice_idle and group_idle set
>>>>>> to zero.
>>>>>> 
>>>>>> These results were reproduced on a Linux VM (kernel v4.19) on ESXi,
>>>>>> both with virtualized storage as well as with disk pass-through,
>>>>>> backed by a rotational hard disk in both cases. The same problem was
>>>>>> also seen with the BFQ I/O scheduler in kernel v5.1.
>>>>>> 
>>>>>> Searching for any earlier discussions of this problem, I found an old
>>>>>> thread on LKML that encountered this behavior [1], as well as a docker
>>>>>> github issue [2] with similar symptoms (mentioned later in the
>>>>>> thread).
>>>>>> 
>>>>>> So, I'm curious to know if this is a well-understood problem and if
>>>>>> anybody has any thoughts on how to fix it.
>>>>>> 
>>>>>> Thank you very much!
>>>>>> 
>>>>>> 
>>>>>> [1]. https://lkml.org/lkml/2015/11/19/359
>>>>>> 
>>>>>> [2]. https://github.com/moby/moby/issues/21485
>>>>>> https://github.com/moby/moby/issues/21485#issuecomment-222941103
>>>>>> 
>>>>>> Regards,
>>>>>> Srivatsa


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-21  9:10           ` Jan Kara
@ 2019-05-21 16:31             ` Theodore Ts'o
  0 siblings, 0 replies; 52+ messages in thread
From: Theodore Ts'o @ 2019-05-21 16:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Paolo Valente, Srivatsa S. Bhat, linux-fsdevel, linux-block,
	linux-ext4, cgroups, kernel list, Jens Axboe, jmoyer, amakhalov,
	anishs, srivatsab

On Tue, May 21, 2019 at 11:10:26AM +0200, Jan Kara wrote:
> > [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
> 
> Yes and that's expected. It just shows how inefficient small synchronous IO
> is. Look, dd(1) writes 512-bytes. From FS point of view we have to write:
> full fs block with data (+4KB), inode to journal (+4KB), journal descriptor
> block (+4KB), journal superblock (+4KB), transaction commit block (+4KB) -
> so that's 20KB just from top of my head to write 512 bytes...

Well, it's not *that* bad.  With fdatasync(), we're only having to do
this worse case thing every 8 writes.  The other writes, we don't
actually need to do any file-system level block allocation, so it's
only a 512 byte write to the disk[1] seven out of eight writes.

That's also true for the slice_idle hit, of course, We only need to do
a jbd2 transaction when there is a block allocation, and that's only
going to happen one in eight writes.

       	   	      	     	     	   - Ted

[1] Of course, small synchronous writes to a HDD are *also* terrible
for performance, just from the HDD's perspective.  For a random write
workload, if you are using disks with a 4k physical sector size, it's
having to do a read/modify/write for each 512 byte write.  And HDD
vendors are talking about wanting to go to a 32k or 64k physical
sector size...  In this sequential write workload, you'll mostly be
shielded from this by the HDD's cache, but the fact that you have to
wait for the bits to hit the platter is always going to be painful.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-20  9:15     ` Jan Kara
  2019-05-20 10:45       ` Paolo Valente
@ 2019-05-21 16:48       ` Theodore Ts'o
  2019-05-21 18:19         ` Josef Bacik
  1 sibling, 1 reply; 52+ messages in thread
From: Theodore Ts'o @ 2019-05-21 16:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Paolo Valente, Srivatsa S. Bhat, linux-fsdevel, linux-block,
	linux-ext4, cgroups, linux-kernel, axboe, jmoyer, amakhalov,
	anishs, srivatsab

On Mon, May 20, 2019 at 11:15:58AM +0200, Jan Kara wrote:
> But this makes priority-inversion problems with ext4 journal worse, doesn't
> it? If we submit journal commit in blkio cgroup of some random process, it
> may get throttled which then effectively blocks the whole filesystem. Or do
> you want to implement a more complex back-pressure mechanism where you'd
> just account to different blkio cgroup during journal commit and then
> throttle as different point where you are not blocking other tasks from
> progress?

Good point, yes, it can.  It depends in what cgroup the file system is
mounted (and hence what cgroup the jbd2 kernel thread is on).  If it
was mounted in the root cgroup, then jbd2 thread is going to be
completely unthrottled (except for the data=ordered writebacks, which
will be charged to the cgroup which write those pages) so the only
thing which is nuking us will be the slice_idle timeout --- both for
the writebacks (which could get charged to N different cgroups, with
disastrous effects --- and this is going to be true for any file
system on a syncfs(2) call as well) and switching between the jbd2
thread's cgroup and the writeback cgroup.

One thing the I/O scheduler could do is use the synchronous flag as a
hint that it should ix-nay on the idle-way.  Or maybe we need to have
a different way to signal this to the jbd2 thread, since I do
recognize that this issue is ext4-specific, *because* we do the
transaction handling in a separate thread, and because of the
data=ordered scheme, both of which are unique to ext4.  So exempting
synchronous writes from cgroup control doesn't make sense for other
file systems.

So maybe a special flag meaning "entangled writes", where the
sched_idle hacks should get suppressed for the data=ordered
writebacks, but we still charge the block I/O to the relevant CSS's?

I could also imagine if there was some way that file system could
track whether all of the file system modifications were charged to a
single cgroup, we could in that case charge it to that cgroup?

> Yeah. At least in some cases, we know there won't be any more IO from a
> particular cgroup in the near future (e.g. transaction commit completing,
> or when the layers above IO scheduler already know which IO they are going
> to submit next) and in that case idling is just a waste of time. But so far
> I haven't decided how should look a reasonably clean interface for this
> that isn't specific to a particular IO scheduler implementation.

The best I've come up with is some way of signalling that all of the
writes coming from the jbd2 commit are entangled, probably via a bio
flag.

If we don't have cgroup support, the other thing we could do is assume
that the jbd2 thread should always be in the root (unconstrained)
cgroup, and then force all writes, include data=ordered writebacks, to
be in the jbd2's cgroup.  But that would make the block cgroup
controls trivially bypassable by an application, which could just be
fsync-happy and exempt all of its buffered I/O writes from cgroup
control.  So that's probably not a great way to go --- but it would at
least fix this particular performance issue.  :-/

						- Ted

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-21 16:21           ` Paolo Valente
@ 2019-05-21 17:38             ` Paolo Valente
  2019-05-21 22:51               ` Srivatsa S. Bhat
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Valente @ 2019-05-21 17:38 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

[-- Attachment #1: Type: text/plain, Size: 8521 bytes --]



> Il giorno 21 mag 2019, alle ore 18:21, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
> 
> 
>> Il giorno 21 mag 2019, alle ore 15:20, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>> 
>> 
>> 
>>> Il giorno 21 mag 2019, alle ore 13:25, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>>> 
>>> 
>>> 
>>>> Il giorno 20 mag 2019, alle ore 12:19, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>>>> 
>>>> 
>>>> 
>>>>> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>>>> 
>>>>> On 5/18/19 11:39 AM, Paolo Valente wrote:
>>>>>> I've addressed these issues in my last batch of improvements for BFQ,
>>>>>> which landed in the upcoming 5.2. If you give it a try, and still see
>>>>>> the problem, then I'll be glad to reproduce it, and hopefully fix it
>>>>>> for you.
>>>>>> 
>>>>> 
>>>>> Hi Paolo,
>>>>> 
>>>>> Thank you for looking into this!
>>>>> 
>>>>> I just tried current mainline at commit 72cf0b07, but unfortunately
>>>>> didn't see any improvement:
>>>>> 
>>>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>>>> 
>>>>> With mq-deadline, I get:
>>>>> 
>>>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s
>>>>> 
>>>>> With bfq, I get:
>>>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s
>>>>> 
>>>> 
>>>> Hi Srivatsa,
>>>> thanks for reproducing this on mainline.  I seem to have reproduced a
>>>> bonsai-tree version of this issue.
>>> 
>>> Hi again Srivatsa,
>>> I've analyzed the trace, and I've found the cause of the loss of
>>> throughput in on my side.  To find out whether it is the same cause as
>>> on your side, I've prepared a script that executes your test and takes
>>> a trace during the test.  If ok for you, could you please
>>> - change the value for the DEVS parameter in the attached script, if
>>> needed
>>> - execute the script
>>> - send me the trace file that the script will leave in your working
>>> dir
>>> 
>> 
>> Sorry, I forgot to add that I also need you to, first, apply the
>> attached patch (it will make BFQ generate the log I need).
>> 
> 
> Sorry again :) This time for attaching one more patch.  This is
> basically a blind fix attempt, based on what I see in my VM.
> 
> So, instead of only sending me a trace, could you please:
> 1) apply this new patch on top of the one I attached in my previous email
> 2) repeat your test and report results

One last thing (I swear!): as you can see from my script, I tested the
case low_latency=0 so far.  So please, for the moment, do your test
with low_latency=0.  You find the whole path to this parameter in,
e.g., my script.

Thanks,
Paolo

> 3) regardless of whether bfq performance improves, take a trace with
>   my script (I've attached a new version that doesn't risk to output an
>   annoying error message as the previous one)
> 
> Thanks,
> Paolo
> 
> <dsync_test.sh><0001-block-bfq-boost-injection.patch.gz>
> 
>> Thanks,
>> Paolo
>> 
>> <0001-block-bfq-add-logs-and-BUG_ONs.patch.gz>
>> 
>>> Looking forward to your trace,
>>> Paolo
>>> 
>>> <dsync_test.sh>
>>>> Before digging into the block
>>>> trace, I'd like to ask you for some feedback.
>>>> 
>>>> First, in my test, the total throughput of the disk happens to be
>>>> about 20 times as high as that enjoyed by dd, regardless of the I/O
>>>> scheduler.  I guess this massive overhead is normal with dsync, but
>>>> I'd like know whether it is about the same on your side. This will
>>>> help me understand whether I'll actually be analyzing about the same
>>>> problem as yours.
>>>> 
>>>> Second, the commands I used follow.  Do they implement your test case
>>>> correctly?
>>>> 
>>>> [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp
>>>> [root@localhost tmp]# echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs
>>>> [root@localhost tmp]# cat /sys/block/sda/queue/scheduler
>>>> [mq-deadline] bfq none
>>>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>>> 10000+0 record dentro
>>>> 10000+0 record fuori
>>>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s
>>>> [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler
>>>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>>> 10000+0 record dentro
>>>> 10000+0 record fuori
>>>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s
>>>> 
>>>> Thanks,
>>>> Paolo
>>>> 
>>>>> Please let me know if any more info about my setup might be helpful.
>>>>> 
>>>>> Thank you!
>>>>> 
>>>>> Regards,
>>>>> Srivatsa
>>>>> VMware Photon OS
>>>>> 
>>>>>> 
>>>>>>> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>>>>>> 
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> One of my colleagues noticed upto 10x - 30x drop in I/O throughput
>>>>>>> running the following command, with the CFQ I/O scheduler:
>>>>>>> 
>>>>>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync
>>>>>>> 
>>>>>>> Throughput with CFQ: 60 KB/s
>>>>>>> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s
>>>>>>> 
>>>>>>> I spent some time looking into it and found that this is caused by the
>>>>>>> undesirable interaction between 4 different components:
>>>>>>> 
>>>>>>> - blkio cgroup controller enabled
>>>>>>> - ext4 with the jbd2 kthread running in the root blkio cgroup
>>>>>>> - dd running on ext4, in any other blkio cgroup than that of jbd2
>>>>>>> - CFQ I/O scheduler with defaults for slice_idle and group_idle
>>>>>>> 
>>>>>>> 
>>>>>>> When docker is enabled, systemd creates a blkio cgroup called
>>>>>>> system.slice to run system services (and docker) under it, and a
>>>>>>> separate blkio cgroup called user.slice for user processes. So, when
>>>>>>> dd is invoked, it runs under user.slice.
>>>>>>> 
>>>>>>> The dd command above includes the dsync flag, which performs an
>>>>>>> fdatasync after every write to the output file. Since dd is writing to
>>>>>>> a file on ext4, jbd2 will be active, committing transactions
>>>>>>> corresponding to those fdatasync requests from dd. (In other words, dd
>>>>>>> depends on jdb2, in order to make forward progress). But jdb2 being a
>>>>>>> kernel thread, runs in the root blkio cgroup, as opposed to dd, which
>>>>>>> runs under user.slice.
>>>>>>> 
>>>>>>> Now, if the I/O scheduler in use for the underlying block device is
>>>>>>> CFQ, then its inter-queue/inter-group idling takes effect (via the
>>>>>>> slice_idle and group_idle parameters, both of which default to 8ms).
>>>>>>> Therefore, everytime CFQ switches between processing requests from dd
>>>>>>> vs jbd2, this 8ms idle time is injected, which slows down the overall
>>>>>>> throughput tremendously!
>>>>>>> 
>>>>>>> To verify this theory, I tried various experiments, and in all cases,
>>>>>>> the 4 pre-conditions mentioned above were necessary to reproduce this
>>>>>>> performance drop. For example, if I used an XFS filesystem (which
>>>>>>> doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed
>>>>>>> directly to a block device, I couldn't reproduce the performance
>>>>>>> issue. Similarly, running dd in the root blkio cgroup (where jbd2
>>>>>>> runs) also gets full performance; as does using the noop or deadline
>>>>>>> I/O schedulers; or even CFQ itself, with slice_idle and group_idle set
>>>>>>> to zero.
>>>>>>> 
>>>>>>> These results were reproduced on a Linux VM (kernel v4.19) on ESXi,
>>>>>>> both with virtualized storage as well as with disk pass-through,
>>>>>>> backed by a rotational hard disk in both cases. The same problem was
>>>>>>> also seen with the BFQ I/O scheduler in kernel v5.1.
>>>>>>> 
>>>>>>> Searching for any earlier discussions of this problem, I found an old
>>>>>>> thread on LKML that encountered this behavior [1], as well as a docker
>>>>>>> github issue [2] with similar symptoms (mentioned later in the
>>>>>>> thread).
>>>>>>> 
>>>>>>> So, I'm curious to know if this is a well-understood problem and if
>>>>>>> anybody has any thoughts on how to fix it.
>>>>>>> 
>>>>>>> Thank you very much!
>>>>>>> 
>>>>>>> 
>>>>>>> [1]. https://lkml.org/lkml/2015/11/19/359
>>>>>>> 
>>>>>>> [2]. https://github.com/moby/moby/issues/21485
>>>>>>> https://github.com/moby/moby/issues/21485#issuecomment-222941103
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Srivatsa


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-21 16:48       ` Theodore Ts'o
@ 2019-05-21 18:19         ` Josef Bacik
  2019-05-21 19:10           ` Theodore Ts'o
  0 siblings, 1 reply; 52+ messages in thread
From: Josef Bacik @ 2019-05-21 18:19 UTC (permalink / raw)
  To: Theodore Ts'o, Jan Kara, Paolo Valente, Srivatsa S. Bhat,
	linux-fsdevel, linux-block, linux-ext4, cgroups, linux-kernel,
	axboe, jmoyer, amakhalov, anishs, srivatsab

On Tue, May 21, 2019 at 12:48:14PM -0400, Theodore Ts'o wrote:
> On Mon, May 20, 2019 at 11:15:58AM +0200, Jan Kara wrote:
> > But this makes priority-inversion problems with ext4 journal worse, doesn't
> > it? If we submit journal commit in blkio cgroup of some random process, it
> > may get throttled which then effectively blocks the whole filesystem. Or do
> > you want to implement a more complex back-pressure mechanism where you'd
> > just account to different blkio cgroup during journal commit and then
> > throttle as different point where you are not blocking other tasks from
> > progress?
> 
> Good point, yes, it can.  It depends in what cgroup the file system is
> mounted (and hence what cgroup the jbd2 kernel thread is on).  If it
> was mounted in the root cgroup, then jbd2 thread is going to be
> completely unthrottled (except for the data=ordered writebacks, which
> will be charged to the cgroup which write those pages) so the only
> thing which is nuking us will be the slice_idle timeout --- both for
> the writebacks (which could get charged to N different cgroups, with
> disastrous effects --- and this is going to be true for any file
> system on a syncfs(2) call as well) and switching between the jbd2
> thread's cgroup and the writeback cgroup.
> 
> One thing the I/O scheduler could do is use the synchronous flag as a
> hint that it should ix-nay on the idle-way.  Or maybe we need to have
> a different way to signal this to the jbd2 thread, since I do
> recognize that this issue is ext4-specific, *because* we do the
> transaction handling in a separate thread, and because of the
> data=ordered scheme, both of which are unique to ext4.  So exempting
> synchronous writes from cgroup control doesn't make sense for other
> file systems.
> 
> So maybe a special flag meaning "entangled writes", where the
> sched_idle hacks should get suppressed for the data=ordered
> writebacks, but we still charge the block I/O to the relevant CSS's?
> 
> I could also imagine if there was some way that file system could
> track whether all of the file system modifications were charged to a
> single cgroup, we could in that case charge it to that cgroup?
> 
> > Yeah. At least in some cases, we know there won't be any more IO from a
> > particular cgroup in the near future (e.g. transaction commit completing,
> > or when the layers above IO scheduler already know which IO they are going
> > to submit next) and in that case idling is just a waste of time. But so far
> > I haven't decided how should look a reasonably clean interface for this
> > that isn't specific to a particular IO scheduler implementation.
> 
> The best I've come up with is some way of signalling that all of the
> writes coming from the jbd2 commit are entangled, probably via a bio
> flag.
> 
> If we don't have cgroup support, the other thing we could do is assume
> that the jbd2 thread should always be in the root (unconstrained)
> cgroup, and then force all writes, include data=ordered writebacks, to
> be in the jbd2's cgroup.  But that would make the block cgroup
> controls trivially bypassable by an application, which could just be
> fsync-happy and exempt all of its buffered I/O writes from cgroup
> control.  So that's probably not a great way to go --- but it would at
> least fix this particular performance issue.  :-/
> 

Chris is adding a REQ_ROOT (or something) flag that means don't throttle me now,
but the the blkcg attached to the bio is the one that is responsible for this
IO.  Then for io.latency we'll let the io go through unmolested but it gets
counted to the right cgroup, and if then we're exceeding latency guarantees we
have the ability to schedule throttling for that cgroup in a safer place.  This
would eliminate the data=ordered issue for ext4, you guys keep doing what you
are doing and we'll handle throttling elsewhere, just so long as the bio's are
tagged with the correct source then all is well.  Thanks,

Josef

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-21 18:19         ` Josef Bacik
@ 2019-05-21 19:10           ` Theodore Ts'o
  0 siblings, 0 replies; 52+ messages in thread
From: Theodore Ts'o @ 2019-05-21 19:10 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Jan Kara, Paolo Valente, Srivatsa S. Bhat, linux-fsdevel,
	linux-block, linux-ext4, cgroups, linux-kernel, axboe, jmoyer,
	amakhalov, anishs, srivatsab

On Tue, May 21, 2019 at 02:19:53PM -0400, Josef Bacik wrote:
> Chris is adding a REQ_ROOT (or something) flag that means don't throttle me now,
> but the the blkcg attached to the bio is the one that is responsible for this
> IO.  Then for io.latency we'll let the io go through unmolested but it gets
> counted to the right cgroup, and if then we're exceeding latency guarantees we
> have the ability to schedule throttling for that cgroup in a safer place.  This
> would eliminate the data=ordered issue for ext4, you guys keep doing what you
> are doing and we'll handle throttling elsewhere, just so long as the bio's are
> tagged with the correct source then all is well.  Thanks,

Great, it sounds like Chris also came up with the the entangled writes
flag idea (although with probably a better name than I did :-).  So
now all we need to do is to plumb a flag through the writeback code so
that file systems (or the VFS player) implementing syncfs(2) or
fsync(2) can arrange to have that flag set if necessary.

Speaking of syncfs(2), something which we considered doing at Google
many years ago (but never did) was to implement a hack so that someone
calling syncfs(2) or sync(2) when they were not root, would make that
sys call be a no-op.  The reason for this was on heavy loaded
machines, an SRE logged in as a non-root user might absent-mindly type
"sync", and that would cause a storm of I/O traffic that would really
mess up the machine.  The jobs that were in the low latency bucket
would be protected (since we didn't run with journalling), but those
that were in the best efforts bucket would be really unhappy.

If we have a "don't throttle me now" REQ_ROOT flag combined with
journalling, then someone running "sync", even if it's by accident,
could really ruin a low-latency job's day, and in a container
environment, there really is no reason for a non-root user to be
wanting to request a syncfs(2) or sync(2).  So maybe we should have a
way to make it be a no-op (or return an error, but that might surprise
some applications) for non-privileged users.  Maybe as a per-mount
flag/option, or via some other tunable?

						- Ted

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-21 17:38             ` Paolo Valente
@ 2019-05-21 22:51               ` Srivatsa S. Bhat
  2019-05-22  8:05                 ` Paolo Valente
  0 siblings, 1 reply; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-05-21 22:51 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

[ Resending this mail with a dropbox link to the traces (instead
of a file attachment), since it didn't go through the last time. ]

On 5/21/19 10:38 AM, Paolo Valente wrote:
> 
>> So, instead of only sending me a trace, could you please:
>> 1) apply this new patch on top of the one I attached in my previous email
>> 2) repeat your test and report results
> 
> One last thing (I swear!): as you can see from my script, I tested the
> case low_latency=0 so far.  So please, for the moment, do your test
> with low_latency=0.  You find the whole path to this parameter in,
> e.g., my script.
> 
No problem! :) Thank you for sharing patches for me to test!

I have good news :) Your patch improves the throughput significantly
when low_latency = 0.

Without any patch:

dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
10000+0 records in
10000+0 records out
5120000 bytes (5.1 MB, 4.9 MiB) copied, 58.0915 s, 88.1 kB/s


With both patches applied:

dd if=/dev/zero of=/root/test0.img bs=512 count=10000 oflag=dsync
10000+0 records in
10000+0 records out
5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.87487 s, 1.3 MB/s

The performance is still not as good as mq-deadline (which achieves
1.6 MB/s), but this is a huge improvement for BFQ nonetheless!

A tarball with the trace output from the 2 scenarios you requested,
one with only the debug patch applied (trace-bfq-add-logs-and-BUG_ONs),
and another with both patches applied (trace-bfq-boost-injection) is
available here:

https://www.dropbox.com/s/pdf07vi7afido7e/bfq-traces.tar.gz?dl=0

Thank you!
 
Regards,
Srivatsa
VMware Photon OS

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-21 22:51               ` Srivatsa S. Bhat
@ 2019-05-22  8:05                 ` Paolo Valente
  2019-05-22  9:02                   ` Srivatsa S. Bhat
  2019-05-22  9:09                   ` Paolo Valente
  0 siblings, 2 replies; 52+ messages in thread
From: Paolo Valente @ 2019-05-22  8:05 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

[-- Attachment #1: Type: text/plain, Size: 2121 bytes --]



> Il giorno 22 mag 2019, alle ore 00:51, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
> 
> [ Resending this mail with a dropbox link to the traces (instead
> of a file attachment), since it didn't go through the last time. ]
> 
> On 5/21/19 10:38 AM, Paolo Valente wrote:
>> 
>>> So, instead of only sending me a trace, could you please:
>>> 1) apply this new patch on top of the one I attached in my previous email
>>> 2) repeat your test and report results
>> 
>> One last thing (I swear!): as you can see from my script, I tested the
>> case low_latency=0 so far.  So please, for the moment, do your test
>> with low_latency=0.  You find the whole path to this parameter in,
>> e.g., my script.
>> 
> No problem! :) Thank you for sharing patches for me to test!
> 
> I have good news :) Your patch improves the throughput significantly
> when low_latency = 0.
> 
> Without any patch:
> 
> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
> 10000+0 records in
> 10000+0 records out
> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 58.0915 s, 88.1 kB/s
> 
> 
> With both patches applied:
> 
> dd if=/dev/zero of=/root/test0.img bs=512 count=10000 oflag=dsync
> 10000+0 records in
> 10000+0 records out
> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.87487 s, 1.3 MB/s
> 
> The performance is still not as good as mq-deadline (which achieves
> 1.6 MB/s), but this is a huge improvement for BFQ nonetheless!
> 
> A tarball with the trace output from the 2 scenarios you requested,
> one with only the debug patch applied (trace-bfq-add-logs-and-BUG_ONs),
> and another with both patches applied (trace-bfq-boost-injection) is
> available here:
> 
> https://www.dropbox.com/s/pdf07vi7afido7e/bfq-traces.tar.gz?dl=0
> 

Hi Srivatsa,
I've seen the bugzilla you've created.  I'm a little confused on how
to better proceed.  Shall we move this discussion to the bugzilla, or
should we continue this discussion here, where it has started, and
then update the bugzilla?

Let me know,
Paolo

> Thank you!
> 
> Regards,
> Srivatsa
> VMware Photon OS


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-22  8:05                 ` Paolo Valente
@ 2019-05-22  9:02                   ` Srivatsa S. Bhat
  2019-05-22  9:12                     ` Paolo Valente
  2019-05-22  9:09                   ` Paolo Valente
  1 sibling, 1 reply; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-05-22  9:02 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

On 5/22/19 1:05 AM, Paolo Valente wrote:
> 
> 
>> Il giorno 22 mag 2019, alle ore 00:51, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>
>> [ Resending this mail with a dropbox link to the traces (instead
>> of a file attachment), since it didn't go through the last time. ]
>>
>> On 5/21/19 10:38 AM, Paolo Valente wrote:
>>>
>>>> So, instead of only sending me a trace, could you please:
>>>> 1) apply this new patch on top of the one I attached in my previous email
>>>> 2) repeat your test and report results
>>>
>>> One last thing (I swear!): as you can see from my script, I tested the
>>> case low_latency=0 so far.  So please, for the moment, do your test
>>> with low_latency=0.  You find the whole path to this parameter in,
>>> e.g., my script.
>>>
>> No problem! :) Thank you for sharing patches for me to test!
>>
>> I have good news :) Your patch improves the throughput significantly
>> when low_latency = 0.
>>
>> Without any patch:
>>
>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>> 10000+0 records in
>> 10000+0 records out
>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 58.0915 s, 88.1 kB/s
>>
>>
>> With both patches applied:
>>
>> dd if=/dev/zero of=/root/test0.img bs=512 count=10000 oflag=dsync
>> 10000+0 records in
>> 10000+0 records out
>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.87487 s, 1.3 MB/s
>>
>> The performance is still not as good as mq-deadline (which achieves
>> 1.6 MB/s), but this is a huge improvement for BFQ nonetheless!
>>
>> A tarball with the trace output from the 2 scenarios you requested,
>> one with only the debug patch applied (trace-bfq-add-logs-and-BUG_ONs),
>> and another with both patches applied (trace-bfq-boost-injection) is
>> available here:
>>
>> https://www.dropbox.com/s/pdf07vi7afido7e/bfq-traces.tar.gz?dl=0
>>
> 
> Hi Srivatsa,
> I've seen the bugzilla you've created.  I'm a little confused on how
> to better proceed.  Shall we move this discussion to the bugzilla, or
> should we continue this discussion here, where it has started, and
> then update the bugzilla?
> 

Let's continue here on LKML itself. The only reason I created the
bugzilla entry is to attach the tarball of the traces, assuming
that it would allow me to upload a 20 MB file (since email attachment
didn't work). But bugzilla's file restriction is much smaller than
that, so it didn't work out either, and I resorted to using dropbox.
So we don't need the bugzilla entry anymore; I might as well close it
to avoid confusion.

Regards,
Srivatsa
VMware Photon OS


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-22  8:05                 ` Paolo Valente
  2019-05-22  9:02                   ` Srivatsa S. Bhat
@ 2019-05-22  9:09                   ` Paolo Valente
  2019-05-22 10:01                     ` Srivatsa S. Bhat
  1 sibling, 1 reply; 52+ messages in thread
From: Paolo Valente @ 2019-05-22  9:09 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

[-- Attachment #1: Type: text/plain, Size: 3446 bytes --]



> Il giorno 22 mag 2019, alle ore 10:05, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
> 
> 
>> Il giorno 22 mag 2019, alle ore 00:51, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>> 
>> [ Resending this mail with a dropbox link to the traces (instead
>> of a file attachment), since it didn't go through the last time. ]
>> 
>> On 5/21/19 10:38 AM, Paolo Valente wrote:
>>> 
>>>> So, instead of only sending me a trace, could you please:
>>>> 1) apply this new patch on top of the one I attached in my previous email
>>>> 2) repeat your test and report results
>>> 
>>> One last thing (I swear!): as you can see from my script, I tested the
>>> case low_latency=0 so far.  So please, for the moment, do your test
>>> with low_latency=0.  You find the whole path to this parameter in,
>>> e.g., my script.
>>> 
>> No problem! :) Thank you for sharing patches for me to test!
>> 
>> I have good news :) Your patch improves the throughput significantly
>> when low_latency = 0.
>> 
>> Without any patch:
>> 
>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>> 10000+0 records in
>> 10000+0 records out
>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 58.0915 s, 88.1 kB/s
>> 
>> 
>> With both patches applied:
>> 
>> dd if=/dev/zero of=/root/test0.img bs=512 count=10000 oflag=dsync
>> 10000+0 records in
>> 10000+0 records out
>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.87487 s, 1.3 MB/s
>> 
>> The performance is still not as good as mq-deadline (which achieves
>> 1.6 MB/s), but this is a huge improvement for BFQ nonetheless!
>> 
>> A tarball with the trace output from the 2 scenarios you requested,
>> one with only the debug patch applied (trace-bfq-add-logs-and-BUG_ONs),
>> and another with both patches applied (trace-bfq-boost-injection) is
>> available here:
>> 
>> https://www.dropbox.com/s/pdf07vi7afido7e/bfq-traces.tar.gz?dl=0
>> 
> 
> Hi Srivatsa,
> I've seen the bugzilla you've created.  I'm a little confused on how
> to better proceed.  Shall we move this discussion to the bugzilla, or
> should we continue this discussion here, where it has started, and
> then update the bugzilla?
> 

Ok, I've received some feedback on this point, and I'll continue the
discussion here.  Then I'll report back on the bugzilla.

First, thank you very much for testing my patches, and, above all, for
sharing those huge traces!

According to the your traces, the residual 20% lower throughput that you
record is due to the fact that the BFQ injection mechanism takes a few
hundredths of seconds to stabilize, at the beginning of the workload.
During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
that you see without this new patch.  After that time, there
seems to be no loss according to the trace.

The problem is that a loss lasting only a few hundredths of seconds is
however not negligible for a write workload that lasts only 3-4
seconds.  Could you please try writing a larger file?

In addition, I wanted to ask you whether you measured BFQ throughput
with traces disabled.  This may make a difference.

After trying writing a larger file, you can try with low_latency on.
On my side, it causes results to become a little unstable across
repetitions (which is expected).

Thanks,
Paolo


> Let me know,
> Paolo
> 
>> Thank you!
>> 
>> Regards,
>> Srivatsa
>> VMware Photon OS


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-22  9:02                   ` Srivatsa S. Bhat
@ 2019-05-22  9:12                     ` Paolo Valente
  2019-05-22 10:02                       ` Srivatsa S. Bhat
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Valente @ 2019-05-22  9:12 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

[-- Attachment #1: Type: text/plain, Size: 3091 bytes --]



> Il giorno 22 mag 2019, alle ore 11:02, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
> 
> On 5/22/19 1:05 AM, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 22 mag 2019, alle ore 00:51, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>> 
>>> [ Resending this mail with a dropbox link to the traces (instead
>>> of a file attachment), since it didn't go through the last time. ]
>>> 
>>> On 5/21/19 10:38 AM, Paolo Valente wrote:
>>>> 
>>>>> So, instead of only sending me a trace, could you please:
>>>>> 1) apply this new patch on top of the one I attached in my previous email
>>>>> 2) repeat your test and report results
>>>> 
>>>> One last thing (I swear!): as you can see from my script, I tested the
>>>> case low_latency=0 so far.  So please, for the moment, do your test
>>>> with low_latency=0.  You find the whole path to this parameter in,
>>>> e.g., my script.
>>>> 
>>> No problem! :) Thank you for sharing patches for me to test!
>>> 
>>> I have good news :) Your patch improves the throughput significantly
>>> when low_latency = 0.
>>> 
>>> Without any patch:
>>> 
>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync
>>> 10000+0 records in
>>> 10000+0 records out
>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 58.0915 s, 88.1 kB/s
>>> 
>>> 
>>> With both patches applied:
>>> 
>>> dd if=/dev/zero of=/root/test0.img bs=512 count=10000 oflag=dsync
>>> 10000+0 records in
>>> 10000+0 records out
>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.87487 s, 1.3 MB/s
>>> 
>>> The performance is still not as good as mq-deadline (which achieves
>>> 1.6 MB/s), but this is a huge improvement for BFQ nonetheless!
>>> 
>>> A tarball with the trace output from the 2 scenarios you requested,
>>> one with only the debug patch applied (trace-bfq-add-logs-and-BUG_ONs),
>>> and another with both patches applied (trace-bfq-boost-injection) is
>>> available here:
>>> 
>>> https://www.dropbox.com/s/pdf07vi7afido7e/bfq-traces.tar.gz?dl=0
>>> 
>> 
>> Hi Srivatsa,
>> I've seen the bugzilla you've created.  I'm a little confused on how
>> to better proceed.  Shall we move this discussion to the bugzilla, or
>> should we continue this discussion here, where it has started, and
>> then update the bugzilla?
>> 
> 
> Let's continue here on LKML itself.

Just done :)

> The only reason I created the
> bugzilla entry is to attach the tarball of the traces, assuming
> that it would allow me to upload a 20 MB file (since email attachment
> didn't work). But bugzilla's file restriction is much smaller than
> that, so it didn't work out either, and I resorted to using dropbox.
> So we don't need the bugzilla entry anymore; I might as well close it
> to avoid confusion.
> 

No no, don't close it: it can reach people that don't use LKML.  We
just have to remember to report back at the end of this.  BTW, I also
think that the bug is incorrectly filed against 5.1, while all these
tests and results concern 5.2-rcX.

Thanks,
Paolo

> Regards,
> Srivatsa
> VMware Photon OS


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-22  9:09                   ` Paolo Valente
@ 2019-05-22 10:01                     ` Srivatsa S. Bhat
  2019-05-22 10:54                       ` Paolo Valente
  0 siblings, 1 reply; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-05-22 10:01 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

On 5/22/19 2:09 AM, Paolo Valente wrote:
> 
> First, thank you very much for testing my patches, and, above all, for
> sharing those huge traces!
> 
> According to the your traces, the residual 20% lower throughput that you
> record is due to the fact that the BFQ injection mechanism takes a few
> hundredths of seconds to stabilize, at the beginning of the workload.
> During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
> that you see without this new patch.  After that time, there
> seems to be no loss according to the trace.
> 
> The problem is that a loss lasting only a few hundredths of seconds is
> however not negligible for a write workload that lasts only 3-4
> seconds.  Could you please try writing a larger file?
> 

I tried running dd for longer (about 100 seconds), but still saw around
1.4 MB/s throughput with BFQ, and between 1.5 MB/s - 1.6 MB/s with
mq-deadline and noop. But I'm not too worried about that difference.

> In addition, I wanted to ask you whether you measured BFQ throughput
> with traces disabled.  This may make a difference.
> 

The above result (1.4 MB/s) was obtained with traces disabled.

> After trying writing a larger file, you can try with low_latency on.
> On my side, it causes results to become a little unstable across
> repetitions (which is expected).
> 
With low_latency on, I get between 60 KB/s - 100 KB/s.

Regards,
Srivatsa
VMware Photon OS

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-22  9:12                     ` Paolo Valente
@ 2019-05-22 10:02                       ` Srivatsa S. Bhat
  0 siblings, 0 replies; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-05-22 10:02 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

On 5/22/19 2:12 AM, Paolo Valente wrote:
> 
>> Il giorno 22 mag 2019, alle ore 11:02, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>
>>
>> Let's continue here on LKML itself.
> 
> Just done :)
> 
>> The only reason I created the
>> bugzilla entry is to attach the tarball of the traces, assuming
>> that it would allow me to upload a 20 MB file (since email attachment
>> didn't work). But bugzilla's file restriction is much smaller than
>> that, so it didn't work out either, and I resorted to using dropbox.
>> So we don't need the bugzilla entry anymore; I might as well close it
>> to avoid confusion.
>>
> 
> No no, don't close it: it can reach people that don't use LKML.  We
> just have to remember to report back at the end of this.

Ah, good point!

>  BTW, I also
> think that the bug is incorrectly filed against 5.1, while all these
> tests and results concern 5.2-rcX.
> 

Fixed now, thank you for pointing out!
 
Regards,
Srivatsa
VMware Photon OS

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-22 10:01                     ` Srivatsa S. Bhat
@ 2019-05-22 10:54                       ` Paolo Valente
  2019-05-23  2:30                         ` Srivatsa S. Bhat
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Valente @ 2019-05-22 10:54 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

[-- Attachment #1: Type: text/plain, Size: 3322 bytes --]



> Il giorno 22 mag 2019, alle ore 12:01, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
> 
> On 5/22/19 2:09 AM, Paolo Valente wrote:
>> 
>> First, thank you very much for testing my patches, and, above all, for
>> sharing those huge traces!
>> 
>> According to the your traces, the residual 20% lower throughput that you
>> record is due to the fact that the BFQ injection mechanism takes a few
>> hundredths of seconds to stabilize, at the beginning of the workload.
>> During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
>> that you see without this new patch.  After that time, there
>> seems to be no loss according to the trace.
>> 
>> The problem is that a loss lasting only a few hundredths of seconds is
>> however not negligible for a write workload that lasts only 3-4
>> seconds.  Could you please try writing a larger file?
>> 
> 
> I tried running dd for longer (about 100 seconds), but still saw around
> 1.4 MB/s throughput with BFQ, and between 1.5 MB/s - 1.6 MB/s with
> mq-deadline and noop.

Ok, then now the cause is the periodic reset of the mechanism.

It would be super easy to fill this gap, by just gearing the mechanism
toward a very aggressive injection.  The problem is maintaining
control.  As you can imagine from the performance gap between CFQ (or
BFQ with malfunctioning injection) and BFQ with this fix, it is very
hard to succeed in maximizing the throughput while at the same time
preserving control on per-group I/O.

On the bright side, you might be interested in one of the benefits
that BFQ gives in return for this ~10% loss of throughput, in a
scenario that may be important for you (according to affiliation you
report): from ~500% to ~1000% higher throughput when you have to serve
the I/O of multiple VMs, and to guarantee at least no starvation to
any VM [1].  The same holds with multiple clients or containers, and
in general with any set of entities that may compete for storage.

[1] https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/

> But I'm not too worried about that difference.
> 
>> In addition, I wanted to ask you whether you measured BFQ throughput
>> with traces disabled.  This may make a difference.
>> 
> 
> The above result (1.4 MB/s) was obtained with traces disabled.
> 
>> After trying writing a larger file, you can try with low_latency on.
>> On my side, it causes results to become a little unstable across
>> repetitions (which is expected).
>> 
> With low_latency on, I get between 60 KB/s - 100 KB/s.
> 

Gosh, full regression.  Fortunately, it is simply meaningless to use
low_latency in a scenario where the goal is to guarantee per-group
bandwidths.  Low-latency heuristics, to reach their (low-latency)
goals, modify the I/O schedule compared to the best schedule for
honoring group weights and boosting throughput.  So, as recommended in
BFQ documentation, just switch low_latency off if you want to control
I/O with groups.  It may still make sense to leave low_latency on
in some specific case, which I don't want to bother you about.

However, I feel bad with such a low throughput :)  Would you be so
kind to provide me with a trace?

Thanks,
Paolo

> Regards,
> Srivatsa
> VMware Photon OS


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-22 10:54                       ` Paolo Valente
@ 2019-05-23  2:30                         ` Srivatsa S. Bhat
  2019-05-23  9:19                           ` Paolo Valente
  2019-05-23 23:32                           ` Srivatsa S. Bhat
  0 siblings, 2 replies; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-05-23  2:30 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

On 5/22/19 3:54 AM, Paolo Valente wrote:
> 
> 
>> Il giorno 22 mag 2019, alle ore 12:01, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>
>> On 5/22/19 2:09 AM, Paolo Valente wrote:
>>>
>>> First, thank you very much for testing my patches, and, above all, for
>>> sharing those huge traces!
>>>
>>> According to the your traces, the residual 20% lower throughput that you
>>> record is due to the fact that the BFQ injection mechanism takes a few
>>> hundredths of seconds to stabilize, at the beginning of the workload.
>>> During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
>>> that you see without this new patch.  After that time, there
>>> seems to be no loss according to the trace.
>>>
>>> The problem is that a loss lasting only a few hundredths of seconds is
>>> however not negligible for a write workload that lasts only 3-4
>>> seconds.  Could you please try writing a larger file?
>>>
>>
>> I tried running dd for longer (about 100 seconds), but still saw around
>> 1.4 MB/s throughput with BFQ, and between 1.5 MB/s - 1.6 MB/s with
>> mq-deadline and noop.
> 
> Ok, then now the cause is the periodic reset of the mechanism.
> 
> It would be super easy to fill this gap, by just gearing the mechanism
> toward a very aggressive injection.  The problem is maintaining
> control.  As you can imagine from the performance gap between CFQ (or
> BFQ with malfunctioning injection) and BFQ with this fix, it is very
> hard to succeed in maximizing the throughput while at the same time
> preserving control on per-group I/O.
> 

Ah, I see. Just to make sure that this fix doesn't overly optimize for
total throughput (because of the testcase we've been using) and end up
causing regressions in per-group I/O control, I ran a test with
multiple simultaneous dd instances, each writing to a different
portion of the filesystem (well separated, to induce seeks), and each
dd task bound to its own blkio cgroup. I saw similar results with and
without this patch, and the throughput was equally distributed among
all the dd tasks.

> On the bright side, you might be interested in one of the benefits
> that BFQ gives in return for this ~10% loss of throughput, in a
> scenario that may be important for you (according to affiliation you
> report): from ~500% to ~1000% higher throughput when you have to serve
> the I/O of multiple VMs, and to guarantee at least no starvation to
> any VM [1].  The same holds with multiple clients or containers, and
> in general with any set of entities that may compete for storage.
> 
> [1] https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/
> 

Great article! :) Thank you for sharing it!

>> But I'm not too worried about that difference.
>>
>>> In addition, I wanted to ask you whether you measured BFQ throughput
>>> with traces disabled.  This may make a difference.
>>>
>>
>> The above result (1.4 MB/s) was obtained with traces disabled.
>>
>>> After trying writing a larger file, you can try with low_latency on.
>>> On my side, it causes results to become a little unstable across
>>> repetitions (which is expected).
>>>
>> With low_latency on, I get between 60 KB/s - 100 KB/s.
>>
> 
> Gosh, full regression.  Fortunately, it is simply meaningless to use
> low_latency in a scenario where the goal is to guarantee per-group
> bandwidths.  Low-latency heuristics, to reach their (low-latency)
> goals, modify the I/O schedule compared to the best schedule for
> honoring group weights and boosting throughput.  So, as recommended in
> BFQ documentation, just switch low_latency off if you want to control
> I/O with groups.  It may still make sense to leave low_latency on
> in some specific case, which I don't want to bother you about.
> 

My main concern here is about Linux's I/O performance out-of-the-box,
i.e., with all default settings, which are:

- cgroups and blkio enabled (systemd default)
- blkio non-root cgroups in use (this is the implicit systemd behavior
  if docker is installed; i.e., it runs tasks under user.slice)
- I/O scheduler with blkio group sched support: bfq
- bfq default configuration: low_latency = 1

If this yields a throughput that is 10x-30x slower than what is
achievable, I think we should either fix the code (if possible) or
change the defaults such that they don't lead to this performance
collapse (perhaps default low_latency to 0 if bfq group scheduling
is in use?)

> However, I feel bad with such a low throughput :)  Would you be so
> kind to provide me with a trace?
> 
Certainly! Short runs of dd resulted in a lot of variation in the
throughput (between 60 KB/s - 1 MB/s), so I increased dd's runtime
to get repeatable numbers (~70 KB/s). As a result, the trace file
(trace-bfq-boost-injection-low-latency-71KBps) is quite large, and
is available here:

https://www.dropbox.com/s/svqfbv0idcg17pn/bfq-traces.tar.gz?dl=0

Also, I'm very happy to run additional tests or experiments to help
track down this issue. So, please don't hesitate to let me know if
you'd like me to try anything else or get you additional traces etc. :)

Thank you!

Regards,
Srivatsa
VMware Photon OS

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-23  2:30                         ` Srivatsa S. Bhat
@ 2019-05-23  9:19                           ` Paolo Valente
  2019-05-23 17:22                             ` Paolo Valente
  2019-05-23 23:32                           ` Srivatsa S. Bhat
  1 sibling, 1 reply; 52+ messages in thread
From: Paolo Valente @ 2019-05-23  9:19 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab


[-- Attachment #1.1: Type: text/plain, Size: 6826 bytes --]



> Il giorno 23 mag 2019, alle ore 04:30, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
> 
> On 5/22/19 3:54 AM, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 22 mag 2019, alle ore 12:01, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>> 
>>> On 5/22/19 2:09 AM, Paolo Valente wrote:
>>>> 
>>>> First, thank you very much for testing my patches, and, above all, for
>>>> sharing those huge traces!
>>>> 
>>>> According to the your traces, the residual 20% lower throughput that you
>>>> record is due to the fact that the BFQ injection mechanism takes a few
>>>> hundredths of seconds to stabilize, at the beginning of the workload.
>>>> During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
>>>> that you see without this new patch.  After that time, there
>>>> seems to be no loss according to the trace.
>>>> 
>>>> The problem is that a loss lasting only a few hundredths of seconds is
>>>> however not negligible for a write workload that lasts only 3-4
>>>> seconds.  Could you please try writing a larger file?
>>>> 
>>> 
>>> I tried running dd for longer (about 100 seconds), but still saw around
>>> 1.4 MB/s throughput with BFQ, and between 1.5 MB/s - 1.6 MB/s with
>>> mq-deadline and noop.
>> 
>> Ok, then now the cause is the periodic reset of the mechanism.
>> 
>> It would be super easy to fill this gap, by just gearing the mechanism
>> toward a very aggressive injection.  The problem is maintaining
>> control.  As you can imagine from the performance gap between CFQ (or
>> BFQ with malfunctioning injection) and BFQ with this fix, it is very
>> hard to succeed in maximizing the throughput while at the same time
>> preserving control on per-group I/O.
>> 
> 
> Ah, I see. Just to make sure that this fix doesn't overly optimize for
> total throughput (because of the testcase we've been using) and end up
> causing regressions in per-group I/O control, I ran a test with
> multiple simultaneous dd instances, each writing to a different
> portion of the filesystem (well separated, to induce seeks), and each
> dd task bound to its own blkio cgroup. I saw similar results with and
> without this patch, and the throughput was equally distributed among
> all the dd tasks.
> 

Thank you very much for pre-testing this change, this let me know in
advance that I shouldn't find issues when I'll test regressions, at
the end of this change phase.

>> On the bright side, you might be interested in one of the benefits
>> that BFQ gives in return for this ~10% loss of throughput, in a
>> scenario that may be important for you (according to affiliation you
>> report): from ~500% to ~1000% higher throughput when you have to serve
>> the I/O of multiple VMs, and to guarantee at least no starvation to
>> any VM [1].  The same holds with multiple clients or containers, and
>> in general with any set of entities that may compete for storage.
>> 
>> [1] https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/
>> 
> 
> Great article! :) Thank you for sharing it!

Thanks! I mentioned it just to better put things into context.

> 
>>> But I'm not too worried about that difference.
>>> 
>>>> In addition, I wanted to ask you whether you measured BFQ throughput
>>>> with traces disabled.  This may make a difference.
>>>> 
>>> 
>>> The above result (1.4 MB/s) was obtained with traces disabled.
>>> 
>>>> After trying writing a larger file, you can try with low_latency on.
>>>> On my side, it causes results to become a little unstable across
>>>> repetitions (which is expected).
>>>> 
>>> With low_latency on, I get between 60 KB/s - 100 KB/s.
>>> 
>> 
>> Gosh, full regression.  Fortunately, it is simply meaningless to use
>> low_latency in a scenario where the goal is to guarantee per-group
>> bandwidths.  Low-latency heuristics, to reach their (low-latency)
>> goals, modify the I/O schedule compared to the best schedule for
>> honoring group weights and boosting throughput.  So, as recommended in
>> BFQ documentation, just switch low_latency off if you want to control
>> I/O with groups.  It may still make sense to leave low_latency on
>> in some specific case, which I don't want to bother you about.
>> 
> 
> My main concern here is about Linux's I/O performance out-of-the-box,
> i.e., with all default settings, which are:
> 
> - cgroups and blkio enabled (systemd default)
> - blkio non-root cgroups in use (this is the implicit systemd behavior
>  if docker is installed; i.e., it runs tasks under user.slice)
> - I/O scheduler with blkio group sched support: bfq
> - bfq default configuration: low_latency = 1
> 
> If this yields a throughput that is 10x-30x slower than what is
> achievable, I think we should either fix the code (if possible) or
> change the defaults such that they don't lead to this performance
> collapse (perhaps default low_latency to 0 if bfq group scheduling
> is in use?)

Yeah, I thought of this after sending my last email yesterday.  Group
scheduling and low-latency heuristics may simply happen to fight
against each other in personal systems.  Let's proceed this way.  I'll
try first to make the BFQ low-latency mechanism clever enough to not
hinder throughput when groups are in place.  If I make it, then we
will get the best of the two worlds: group isolation and intra-group
low latency; with no configuration change needed.  If I don't make it,
I'll try to think of the best solution to cope with this non-trivial
situation.


>> However, I feel bad with such a low throughput :)  Would you be so
>> kind to provide me with a trace?
>> 
> Certainly! Short runs of dd resulted in a lot of variation in the
> throughput (between 60 KB/s - 1 MB/s), so I increased dd's runtime
> to get repeatable numbers (~70 KB/s). As a result, the trace file
> (trace-bfq-boost-injection-low-latency-71KBps) is quite large, and
> is available here:
> 
> https://www.dropbox.com/s/svqfbv0idcg17pn/bfq-traces.tar.gz?dl=0
> 

Thank you very much for your patience and professional help.

> Also, I'm very happy to run additional tests or experiments to help
> track down this issue. So, please don't hesitate to let me know if
> you'd like me to try anything else or get you additional traces etc. :)
> 

Here's to you!  :) I've attached a new small improvement that may
reduce fluctuations (path to apply on top of the others, of course).
Unfortunately, I don't expect this change to boost the throughput
though.

In contrast, I've thought of a solution that might be rather
effective: making BFQ aware (heuristically) of trivial
synchronizations between processes in different groups.  This will
require a little more work and time.


Thanks,
Paolo


[-- Attachment #1.2: 0001-block-bfq-re-sample-req-service-times-when-possible.patch.gz --]
[-- Type: application/x-gzip, Size: 666 bytes --]

[-- Attachment #1.3: Type: text/plain, Size: 60 bytes --]



> Thank you!
> 
> Regards,
> Srivatsa
> VMware Photon OS


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-23  9:19                           ` Paolo Valente
@ 2019-05-23 17:22                             ` Paolo Valente
  2019-05-23 23:43                               ` Srivatsa S. Bhat
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Valente @ 2019-05-23 17:22 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab


[-- Attachment #1.1: Type: text/plain, Size: 7728 bytes --]



> Il giorno 23 mag 2019, alle ore 11:19, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
> 
> 
>> Il giorno 23 mag 2019, alle ore 04:30, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>> 
>> On 5/22/19 3:54 AM, Paolo Valente wrote:
>>> 
>>> 
>>>> Il giorno 22 mag 2019, alle ore 12:01, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>>> 
>>>> On 5/22/19 2:09 AM, Paolo Valente wrote:
>>>>> 
>>>>> First, thank you very much for testing my patches, and, above all, for
>>>>> sharing those huge traces!
>>>>> 
>>>>> According to the your traces, the residual 20% lower throughput that you
>>>>> record is due to the fact that the BFQ injection mechanism takes a few
>>>>> hundredths of seconds to stabilize, at the beginning of the workload.
>>>>> During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
>>>>> that you see without this new patch.  After that time, there
>>>>> seems to be no loss according to the trace.
>>>>> 
>>>>> The problem is that a loss lasting only a few hundredths of seconds is
>>>>> however not negligible for a write workload that lasts only 3-4
>>>>> seconds.  Could you please try writing a larger file?
>>>>> 
>>>> 
>>>> I tried running dd for longer (about 100 seconds), but still saw around
>>>> 1.4 MB/s throughput with BFQ, and between 1.5 MB/s - 1.6 MB/s with
>>>> mq-deadline and noop.
>>> 
>>> Ok, then now the cause is the periodic reset of the mechanism.
>>> 
>>> It would be super easy to fill this gap, by just gearing the mechanism
>>> toward a very aggressive injection.  The problem is maintaining
>>> control.  As you can imagine from the performance gap between CFQ (or
>>> BFQ with malfunctioning injection) and BFQ with this fix, it is very
>>> hard to succeed in maximizing the throughput while at the same time
>>> preserving control on per-group I/O.
>>> 
>> 
>> Ah, I see. Just to make sure that this fix doesn't overly optimize for
>> total throughput (because of the testcase we've been using) and end up
>> causing regressions in per-group I/O control, I ran a test with
>> multiple simultaneous dd instances, each writing to a different
>> portion of the filesystem (well separated, to induce seeks), and each
>> dd task bound to its own blkio cgroup. I saw similar results with and
>> without this patch, and the throughput was equally distributed among
>> all the dd tasks.
>> 
> 
> Thank you very much for pre-testing this change, this let me know in
> advance that I shouldn't find issues when I'll test regressions, at
> the end of this change phase.
> 
>>> On the bright side, you might be interested in one of the benefits
>>> that BFQ gives in return for this ~10% loss of throughput, in a
>>> scenario that may be important for you (according to affiliation you
>>> report): from ~500% to ~1000% higher throughput when you have to serve
>>> the I/O of multiple VMs, and to guarantee at least no starvation to
>>> any VM [1].  The same holds with multiple clients or containers, and
>>> in general with any set of entities that may compete for storage.
>>> 
>>> [1] https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/
>>> 
>> 
>> Great article! :) Thank you for sharing it!
> 
> Thanks! I mentioned it just to better put things into context.
> 
>> 
>>>> But I'm not too worried about that difference.
>>>> 
>>>>> In addition, I wanted to ask you whether you measured BFQ throughput
>>>>> with traces disabled.  This may make a difference.
>>>>> 
>>>> 
>>>> The above result (1.4 MB/s) was obtained with traces disabled.
>>>> 
>>>>> After trying writing a larger file, you can try with low_latency on.
>>>>> On my side, it causes results to become a little unstable across
>>>>> repetitions (which is expected).
>>>>> 
>>>> With low_latency on, I get between 60 KB/s - 100 KB/s.
>>>> 
>>> 
>>> Gosh, full regression.  Fortunately, it is simply meaningless to use
>>> low_latency in a scenario where the goal is to guarantee per-group
>>> bandwidths.  Low-latency heuristics, to reach their (low-latency)
>>> goals, modify the I/O schedule compared to the best schedule for
>>> honoring group weights and boosting throughput.  So, as recommended in
>>> BFQ documentation, just switch low_latency off if you want to control
>>> I/O with groups.  It may still make sense to leave low_latency on
>>> in some specific case, which I don't want to bother you about.
>>> 
>> 
>> My main concern here is about Linux's I/O performance out-of-the-box,
>> i.e., with all default settings, which are:
>> 
>> - cgroups and blkio enabled (systemd default)
>> - blkio non-root cgroups in use (this is the implicit systemd behavior
>> if docker is installed; i.e., it runs tasks under user.slice)
>> - I/O scheduler with blkio group sched support: bfq
>> - bfq default configuration: low_latency = 1
>> 
>> If this yields a throughput that is 10x-30x slower than what is
>> achievable, I think we should either fix the code (if possible) or
>> change the defaults such that they don't lead to this performance
>> collapse (perhaps default low_latency to 0 if bfq group scheduling
>> is in use?)
> 
> Yeah, I thought of this after sending my last email yesterday. Group
> scheduling and low-latency heuristics may simply happen to fight
> against each other in personal systems.  Let's proceed this way. I'll
> try first to make the BFQ low-latency mechanism clever enough to not
> hinder throughput when groups are in place.  If I make it, then we
> will get the best of the two worlds: group isolation and intra-group
> low latency; with no configuration change needed.  If I don't make it,
> I'll try to think of the best solution to cope with this non-trivial
> situation.
> 
> 
>>> However, I feel bad with such a low throughput :)  Would you be so
>>> kind to provide me with a trace?
>>> 
>> Certainly! Short runs of dd resulted in a lot of variation in the
>> throughput (between 60 KB/s - 1 MB/s), so I increased dd's runtime
>> to get repeatable numbers (~70 KB/s). As a result, the trace file
>> (trace-bfq-boost-injection-low-latency-71KBps) is quite large, and
>> is available here:
>> 
>> https://www.dropbox.com/s/svqfbv0idcg17pn/bfq-traces.tar.gz?dl=0
>> 
> 
> Thank you very much for your patience and professional help.
> 
>> Also, I'm very happy to run additional tests or experiments to help
>> track down this issue. So, please don't hesitate to let me know if
>> you'd like me to try anything else or get you additional traces etc. :)
>> 
> 
> Here's to you!  :) I've attached a new small improvement that may
> reduce fluctuations (path to apply on top of the others, of course).
> Unfortunately, I don't expect this change to boost the throughput
> though.
> 
> In contrast, I've thought of a solution that might be rather
> effective: making BFQ aware (heuristically) of trivial
> synchronizations between processes in different groups.  This will
> require a little more work and time.
> 

Hi Srivatsa,
I'm back :)

First, there was a mistake in the last patch I sent you, namely in
0001-block-bfq-re-sample-req-service-times-when-possible.patch.
Please don't apply that patch at all.

I've attached a new series of patches instead.  The first patch in this
series is a fixed version of the faulty patch above (if I'm creating too
much confusion, I'll send you again all patches to apply on top of
mainline).

This series also implements the more effective idea I told you a few
hours ago.  In my system, the loss is now around only 10%, even with
low_latency on.

Looking forward to your results,
Paolo


[-- Attachment #1.2: patches-with-waker-detection.tgz --]
[-- Type: application/octet-stream, Size: 2956 bytes --]

[-- Attachment #1.3: Type: text/plain, Size: 162 bytes --]



> 
> Thanks,
> Paolo
> 
> <0001-block-bfq-re-sample-req-service-times-when-possible.patch.gz>
> 
>> Thank you!
>> 
>> Regards,
>> Srivatsa
>> VMware Photon OS


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-23  2:30                         ` Srivatsa S. Bhat
  2019-05-23  9:19                           ` Paolo Valente
@ 2019-05-23 23:32                           ` Srivatsa S. Bhat
  2019-05-30  8:38                             ` Srivatsa S. Bhat
  1 sibling, 1 reply; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-05-23 23:32 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

On 5/22/19 7:30 PM, Srivatsa S. Bhat wrote:
> On 5/22/19 3:54 AM, Paolo Valente wrote:
>>
>>
>>> Il giorno 22 mag 2019, alle ore 12:01, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>>
>>> On 5/22/19 2:09 AM, Paolo Valente wrote:
>>>>
>>>> First, thank you very much for testing my patches, and, above all, for
>>>> sharing those huge traces!
>>>>
>>>> According to the your traces, the residual 20% lower throughput that you
>>>> record is due to the fact that the BFQ injection mechanism takes a few
>>>> hundredths of seconds to stabilize, at the beginning of the workload.
>>>> During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
>>>> that you see without this new patch.  After that time, there
>>>> seems to be no loss according to the trace.
>>>>
>>>> The problem is that a loss lasting only a few hundredths of seconds is
>>>> however not negligible for a write workload that lasts only 3-4
>>>> seconds.  Could you please try writing a larger file?
>>>>
>>>
>>> I tried running dd for longer (about 100 seconds), but still saw around
>>> 1.4 MB/s throughput with BFQ, and between 1.5 MB/s - 1.6 MB/s with
>>> mq-deadline and noop.
>>
>> Ok, then now the cause is the periodic reset of the mechanism.
>>
>> It would be super easy to fill this gap, by just gearing the mechanism
>> toward a very aggressive injection.  The problem is maintaining
>> control.  As you can imagine from the performance gap between CFQ (or
>> BFQ with malfunctioning injection) and BFQ with this fix, it is very
>> hard to succeed in maximizing the throughput while at the same time
>> preserving control on per-group I/O.
>>
> 
> Ah, I see. Just to make sure that this fix doesn't overly optimize for
> total throughput (because of the testcase we've been using) and end up
> causing regressions in per-group I/O control, I ran a test with
> multiple simultaneous dd instances, each writing to a different
> portion of the filesystem (well separated, to induce seeks), and each
> dd task bound to its own blkio cgroup. I saw similar results with and
> without this patch, and the throughput was equally distributed among
> all the dd tasks.
> 
Actually, it turns out that I ran the dd tasks directly on the block
device for this experiment, and not on top of ext4. I'll redo this on
ext4 and report back soon.

Regards,
Srivatsa
VMware Photon OS

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-23 17:22                             ` Paolo Valente
@ 2019-05-23 23:43                               ` Srivatsa S. Bhat
  2019-05-24  6:51                                 ` Paolo Valente
  0 siblings, 1 reply; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-05-23 23:43 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

On 5/23/19 10:22 AM, Paolo Valente wrote:
> 
>> Il giorno 23 mag 2019, alle ore 11:19, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>>
>>> Il giorno 23 mag 2019, alle ore 04:30, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>>
[...]
>>> Also, I'm very happy to run additional tests or experiments to help
>>> track down this issue. So, please don't hesitate to let me know if
>>> you'd like me to try anything else or get you additional traces etc. :)
>>>
>>
>> Here's to you!  :) I've attached a new small improvement that may
>> reduce fluctuations (path to apply on top of the others, of course).
>> Unfortunately, I don't expect this change to boost the throughput
>> though.
>>
>> In contrast, I've thought of a solution that might be rather
>> effective: making BFQ aware (heuristically) of trivial
>> synchronizations between processes in different groups.  This will
>> require a little more work and time.
>>
> 
> Hi Srivatsa,
> I'm back :)
> 
> First, there was a mistake in the last patch I sent you, namely in
> 0001-block-bfq-re-sample-req-service-times-when-possible.patch.
> Please don't apply that patch at all.
> 
> I've attached a new series of patches instead.  The first patch in this
> series is a fixed version of the faulty patch above (if I'm creating too
> much confusion, I'll send you again all patches to apply on top of
> mainline).
> 

No problem, I got it :)

> This series also implements the more effective idea I told you a few
> hours ago.  In my system, the loss is now around only 10%, even with
> low_latency on.
> 

When trying to run multiple dd tasks simultaneously, I get the kernel
panic shown below (mainline is fine, without these patches).

[  568.232231] BUG: kernel NULL pointer dereference, address: 0000000000000024
[  568.232257] #PF: supervisor read access in kernel mode
[  568.232273] #PF: error_code(0x0000) - not-present page
[  568.232289] PGD 0 P4D 0
[  568.232299] Oops: 0000 [#1] SMP PTI
[  568.232312] CPU: 0 PID: 1029 Comm: dd Tainted: G            E     5.1.0-io-dbg-4+ #6
[  568.232334] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
[  568.232388] RIP: 0010:bfq_serv_to_charge+0x21/0x50
[  568.232404] Code: ff e8 c3 5e bc ff 0f 1f 00 0f 1f 44 00 00 48 8b 86 20 01 00 00 55 48 89 e5 53 48 89 fb a8 40 75 09 83 be a0 01 00 00 01 76 09 <8b> 43 24 c1 e8 09 5b 5d c3 48 8b 7e 08 e8 5d fd ff ff 84 c0 75 ea
[  568.232473] RSP: 0018:ffffa73a42dab750 EFLAGS: 00010002
[  568.232489] RAX: 0000000000001052 RBX: 0000000000000000 RCX: ffffa73a42dab7a0
[  568.232510] RDX: ffffa73a42dab657 RSI: ffff8b7b6ba2ab70 RDI: 0000000000000000
[  568.232530] RBP: ffffa73a42dab758 R08: 0000000000000000 R09: 0000000000000001
[  568.232551] R10: 0000000000000000 R11: ffffa73a42dab7a0 R12: ffff8b7b6aed3800
[  568.232571] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8b7b6aed3800
[  568.232592] FS:  00007fb5b0724540(0000) GS:ffff8b7b6f800000(0000) knlGS:0000000000000000
[  568.232615] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  568.232632] CR2: 0000000000000024 CR3: 00000004266be002 CR4: 00000000001606f0
[  568.232690] Call Trace:
[  568.232703]  bfq_select_queue+0x781/0x1000
[  568.232717]  bfq_dispatch_request+0x1d7/0xd60
[  568.232731]  ? bfq_bfqq_handle_idle_busy_switch.isra.36+0x2cd/0xb20
[  568.232751]  blk_mq_do_dispatch_sched+0xa8/0xe0
[  568.232765]  blk_mq_sched_dispatch_requests+0xe3/0x150
[  568.232783]  __blk_mq_run_hw_queue+0x56/0x100
[  568.232798]  __blk_mq_delay_run_hw_queue+0x107/0x160
[  568.232814]  blk_mq_run_hw_queue+0x75/0x190
[  568.232828]  blk_mq_sched_insert_requests+0x7a/0x100
[  568.232844]  blk_mq_flush_plug_list+0x1d7/0x280
[  568.232859]  blk_flush_plug_list+0xc2/0xe0
[  568.232872]  blk_finish_plug+0x2c/0x40
[  568.232886]  ext4_writepages+0x592/0xe60
[  568.233381]  ? ext4_mark_iloc_dirty+0x52b/0x860
[  568.233851]  do_writepages+0x3c/0xd0
[  568.234304]  ? ext4_mark_inode_dirty+0x1a0/0x1a0
[  568.234748]  ? do_writepages+0x3c/0xd0
[  568.235197]  ? __generic_write_end+0x4e/0x80
[  568.235644]  __filemap_fdatawrite_range+0xa5/0xe0
[  568.236089]  ? __filemap_fdatawrite_range+0xa5/0xe0
[  568.236533]  ? ext4_da_write_end+0x13c/0x280
[  568.236983]  file_write_and_wait_range+0x5a/0xb0
[  568.237407]  ext4_sync_file+0x11e/0x3e0
[  568.237819]  vfs_fsync_range+0x48/0x80
[  568.238217]  ext4_file_write_iter+0x234/0x3d0
[  568.238610]  ? _cond_resched+0x19/0x40
[  568.238982]  new_sync_write+0x112/0x190
[  568.239347]  __vfs_write+0x29/0x40
[  568.239705]  vfs_write+0xb1/0x1a0
[  568.240078]  ksys_write+0x89/0xc0
[  568.240428]  __x64_sys_write+0x1a/0x20
[  568.240771]  do_syscall_64+0x5b/0x140
[  568.241115]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  568.241456] RIP: 0033:0x7fb5b02325f4
[  568.241787] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 48 8d 05 09 11 2d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
[  568.242842] RSP: 002b:00007ffcb12e2968 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  568.243220] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb5b02325f4
[  568.243616] RDX: 0000000000000200 RSI: 000055698f2ad000 RDI: 0000000000000001
[  568.244026] RBP: 0000000000000200 R08: 0000000000000004 R09: 0000000000000003
[  568.244401] R10: 00007fb5b04feca0 R11: 0000000000000246 R12: 000055698f2ad000
[  568.244775] R13: 0000000000000000 R14: 0000000000000000 R15: 000055698f2ad000
[  568.245154] Modules linked in: xt_MASQUERADE(E) nf_conntrack_netlink(E) nfnetlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) br_netfilter(E) bridge(E) stp(E) llc(E) overlay(E) vmw_vsock_vmci_transport(E) vsock(E) ip6table_filter(E) ip6_tables(E) xt_conntrack(E) iptable_mangle(E) iptable_nat(E) nf_nat(E) iptable_filter
[  568.248651] CR2: 0000000000000024
[  568.249142] ---[ end trace 0ddd315e0a5bdfba ]---


Regards,
Srivatsa
VMware Photon OS

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-23 23:43                               ` Srivatsa S. Bhat
@ 2019-05-24  6:51                                 ` Paolo Valente
  2019-05-24  7:56                                   ` Paolo Valente
  2019-05-29  1:09                                   ` Srivatsa S. Bhat
  0 siblings, 2 replies; 52+ messages in thread
From: Paolo Valente @ 2019-05-24  6:51 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

[-- Attachment #1: Type: text/plain, Size: 6464 bytes --]



> Il giorno 24 mag 2019, alle ore 01:43, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
> 
> On 5/23/19 10:22 AM, Paolo Valente wrote:
>> 
>>> Il giorno 23 mag 2019, alle ore 11:19, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>>> 
>>>> Il giorno 23 mag 2019, alle ore 04:30, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>>> 
> [...]
>>>> Also, I'm very happy to run additional tests or experiments to help
>>>> track down this issue. So, please don't hesitate to let me know if
>>>> you'd like me to try anything else or get you additional traces etc. :)
>>>> 
>>> 
>>> Here's to you!  :) I've attached a new small improvement that may
>>> reduce fluctuations (path to apply on top of the others, of course).
>>> Unfortunately, I don't expect this change to boost the throughput
>>> though.
>>> 
>>> In contrast, I've thought of a solution that might be rather
>>> effective: making BFQ aware (heuristically) of trivial
>>> synchronizations between processes in different groups.  This will
>>> require a little more work and time.
>>> 
>> 
>> Hi Srivatsa,
>> I'm back :)
>> 
>> First, there was a mistake in the last patch I sent you, namely in
>> 0001-block-bfq-re-sample-req-service-times-when-possible.patch.
>> Please don't apply that patch at all.
>> 
>> I've attached a new series of patches instead.  The first patch in this
>> series is a fixed version of the faulty patch above (if I'm creating too
>> much confusion, I'll send you again all patches to apply on top of
>> mainline).
>> 
> 
> No problem, I got it :)
> 
>> This series also implements the more effective idea I told you a few
>> hours ago.  In my system, the loss is now around only 10%, even with
>> low_latency on.
>> 
> 
> When trying to run multiple dd tasks simultaneously, I get the kernel
> panic shown below (mainline is fine, without these patches).
> 

Could you please provide me somehow with a list *(bfq_serv_to_charge+0x21) ?

Thanks,
Paolo

> [  568.232231] BUG: kernel NULL pointer dereference, address: 0000000000000024
> [  568.232257] #PF: supervisor read access in kernel mode
> [  568.232273] #PF: error_code(0x0000) - not-present page
> [  568.232289] PGD 0 P4D 0
> [  568.232299] Oops: 0000 [#1] SMP PTI
> [  568.232312] CPU: 0 PID: 1029 Comm: dd Tainted: G            E     5.1.0-io-dbg-4+ #6
> [  568.232334] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
> [  568.232388] RIP: 0010:bfq_serv_to_charge+0x21/0x50
> [  568.232404] Code: ff e8 c3 5e bc ff 0f 1f 00 0f 1f 44 00 00 48 8b 86 20 01 00 00 55 48 89 e5 53 48 89 fb a8 40 75 09 83 be a0 01 00 00 01 76 09 <8b> 43 24 c1 e8 09 5b 5d c3 48 8b 7e 08 e8 5d fd ff ff 84 c0 75 ea
> [  568.232473] RSP: 0018:ffffa73a42dab750 EFLAGS: 00010002
> [  568.232489] RAX: 0000000000001052 RBX: 0000000000000000 RCX: ffffa73a42dab7a0
> [  568.232510] RDX: ffffa73a42dab657 RSI: ffff8b7b6ba2ab70 RDI: 0000000000000000
> [  568.232530] RBP: ffffa73a42dab758 R08: 0000000000000000 R09: 0000000000000001
> [  568.232551] R10: 0000000000000000 R11: ffffa73a42dab7a0 R12: ffff8b7b6aed3800
> [  568.232571] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8b7b6aed3800
> [  568.232592] FS:  00007fb5b0724540(0000) GS:ffff8b7b6f800000(0000) knlGS:0000000000000000
> [  568.232615] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  568.232632] CR2: 0000000000000024 CR3: 00000004266be002 CR4: 00000000001606f0
> [  568.232690] Call Trace:
> [  568.232703]  bfq_select_queue+0x781/0x1000
> [  568.232717]  bfq_dispatch_request+0x1d7/0xd60
> [  568.232731]  ? bfq_bfqq_handle_idle_busy_switch.isra.36+0x2cd/0xb20
> [  568.232751]  blk_mq_do_dispatch_sched+0xa8/0xe0
> [  568.232765]  blk_mq_sched_dispatch_requests+0xe3/0x150
> [  568.232783]  __blk_mq_run_hw_queue+0x56/0x100
> [  568.232798]  __blk_mq_delay_run_hw_queue+0x107/0x160
> [  568.232814]  blk_mq_run_hw_queue+0x75/0x190
> [  568.232828]  blk_mq_sched_insert_requests+0x7a/0x100
> [  568.232844]  blk_mq_flush_plug_list+0x1d7/0x280
> [  568.232859]  blk_flush_plug_list+0xc2/0xe0
> [  568.232872]  blk_finish_plug+0x2c/0x40
> [  568.232886]  ext4_writepages+0x592/0xe60
> [  568.233381]  ? ext4_mark_iloc_dirty+0x52b/0x860
> [  568.233851]  do_writepages+0x3c/0xd0
> [  568.234304]  ? ext4_mark_inode_dirty+0x1a0/0x1a0
> [  568.234748]  ? do_writepages+0x3c/0xd0
> [  568.235197]  ? __generic_write_end+0x4e/0x80
> [  568.235644]  __filemap_fdatawrite_range+0xa5/0xe0
> [  568.236089]  ? __filemap_fdatawrite_range+0xa5/0xe0
> [  568.236533]  ? ext4_da_write_end+0x13c/0x280
> [  568.236983]  file_write_and_wait_range+0x5a/0xb0
> [  568.237407]  ext4_sync_file+0x11e/0x3e0
> [  568.237819]  vfs_fsync_range+0x48/0x80
> [  568.238217]  ext4_file_write_iter+0x234/0x3d0
> [  568.238610]  ? _cond_resched+0x19/0x40
> [  568.238982]  new_sync_write+0x112/0x190
> [  568.239347]  __vfs_write+0x29/0x40
> [  568.239705]  vfs_write+0xb1/0x1a0
> [  568.240078]  ksys_write+0x89/0xc0
> [  568.240428]  __x64_sys_write+0x1a/0x20
> [  568.240771]  do_syscall_64+0x5b/0x140
> [  568.241115]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  568.241456] RIP: 0033:0x7fb5b02325f4
> [  568.241787] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 48 8d 05 09 11 2d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
> [  568.242842] RSP: 002b:00007ffcb12e2968 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  568.243220] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb5b02325f4
> [  568.243616] RDX: 0000000000000200 RSI: 000055698f2ad000 RDI: 0000000000000001
> [  568.244026] RBP: 0000000000000200 R08: 0000000000000004 R09: 0000000000000003
> [  568.244401] R10: 00007fb5b04feca0 R11: 0000000000000246 R12: 000055698f2ad000
> [  568.244775] R13: 0000000000000000 R14: 0000000000000000 R15: 000055698f2ad000
> [  568.245154] Modules linked in: xt_MASQUERADE(E) nf_conntrack_netlink(E) nfnetlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) br_netfilter(E) bridge(E) stp(E) llc(E) overlay(E) vmw_vsock_vmci_transport(E) vsock(E) ip6table_filter(E) ip6_tables(E) xt_conntrack(E) iptable_mangle(E) iptable_nat(E) nf_nat(E) iptable_filter
> [  568.248651] CR2: 0000000000000024
> [  568.249142] ---[ end trace 0ddd315e0a5bdfba ]---
> 
> 
> Regards,
> Srivatsa
> VMware Photon OS


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-24  6:51                                 ` Paolo Valente
@ 2019-05-24  7:56                                   ` Paolo Valente
  2019-05-29  1:09                                   ` Srivatsa S. Bhat
  1 sibling, 0 replies; 52+ messages in thread
From: Paolo Valente @ 2019-05-24  7:56 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab


[-- Attachment #1.1: Type: text/plain, Size: 2262 bytes --]



> Il giorno 24 mag 2019, alle ore 08:51, Paolo Valente <paolo.valente@linaro.org> ha scritto:
> 
> 
> 
>> Il giorno 24 mag 2019, alle ore 01:43, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>> 
>> On 5/23/19 10:22 AM, Paolo Valente wrote:
>>> 
>>>> Il giorno 23 mag 2019, alle ore 11:19, Paolo Valente <paolo.valente@linaro.org> ha scritto:
>>>> 
>>>>> Il giorno 23 mag 2019, alle ore 04:30, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>>>> 
>> [...]
>>>>> Also, I'm very happy to run additional tests or experiments to help
>>>>> track down this issue. So, please don't hesitate to let me know if
>>>>> you'd like me to try anything else or get you additional traces etc. :)
>>>>> 
>>>> 
>>>> Here's to you!  :) I've attached a new small improvement that may
>>>> reduce fluctuations (path to apply on top of the others, of course).
>>>> Unfortunately, I don't expect this change to boost the throughput
>>>> though.
>>>> 
>>>> In contrast, I've thought of a solution that might be rather
>>>> effective: making BFQ aware (heuristically) of trivial
>>>> synchronizations between processes in different groups. This will
>>>> require a little more work and time.
>>>> 
>>> 
>>> Hi Srivatsa,
>>> I'm back :)
>>> 
>>> First, there was a mistake in the last patch I sent you, namely in
>>> 0001-block-bfq-re-sample-req-service-times-when-possible.patch.
>>> Please don't apply that patch at all.
>>> 
>>> I've attached a new series of patches instead.  The first patch in this
>>> series is a fixed version of the faulty patch above (if I'm creating too
>>> much confusion, I'll send you again all patches to apply on top of
>>> mainline).
>>> 
>> 
>> No problem, I got it :)
>> 
>>> This series also implements the more effective idea I told you a few
>>> hours ago.  In my system, the loss is now around only 10%, even with
>>> low_latency on.
>>> 
>> 
>> When trying to run multiple dd tasks simultaneously, I get the kernel
>> panic shown below (mainline is fine, without these patches).
>> 
> 
> Could you please provide me somehow with a list *(bfq_serv_to_charge+0x21) ?
> 

Maybe I've found the cause. Please apply also the two patches attached and retry.

Thanks,
Paolo


[-- Attachment #1.2: fix-patches-for-waker-detection.tgz --]
[-- Type: application/octet-stream, Size: 1228 bytes --]

[-- Attachment #1.3: Type: text/plain, Size: 4543 bytes --]


> Thanks,
> Paolo
> 
>> [  568.232231] BUG: kernel NULL pointer dereference, address: 0000000000000024
>> [  568.232257] #PF: supervisor read access in kernel mode
>> [  568.232273] #PF: error_code(0x0000) - not-present page
>> [  568.232289] PGD 0 P4D 0
>> [  568.232299] Oops: 0000 [#1] SMP PTI
>> [  568.232312] CPU: 0 PID: 1029 Comm: dd Tainted: G            E     5.1.0-io-dbg-4+ #6
>> [  568.232334] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
>> [  568.232388] RIP: 0010:bfq_serv_to_charge+0x21/0x50
>> [  568.232404] Code: ff e8 c3 5e bc ff 0f 1f 00 0f 1f 44 00 00 48 8b 86 20 01 00 00 55 48 89 e5 53 48 89 fb a8 40 75 09 83 be a0 01 00 00 01 76 09 <8b> 43 24 c1 e8 09 5b 5d c3 48 8b 7e 08 e8 5d fd ff ff 84 c0 75 ea
>> [  568.232473] RSP: 0018:ffffa73a42dab750 EFLAGS: 00010002
>> [  568.232489] RAX: 0000000000001052 RBX: 0000000000000000 RCX: ffffa73a42dab7a0
>> [  568.232510] RDX: ffffa73a42dab657 RSI: ffff8b7b6ba2ab70 RDI: 0000000000000000
>> [  568.232530] RBP: ffffa73a42dab758 R08: 0000000000000000 R09: 0000000000000001
>> [  568.232551] R10: 0000000000000000 R11: ffffa73a42dab7a0 R12: ffff8b7b6aed3800
>> [  568.232571] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8b7b6aed3800
>> [  568.232592] FS:  00007fb5b0724540(0000) GS:ffff8b7b6f800000(0000) knlGS:0000000000000000
>> [  568.232615] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  568.232632] CR2: 0000000000000024 CR3: 00000004266be002 CR4: 00000000001606f0
>> [  568.232690] Call Trace:
>> [  568.232703]  bfq_select_queue+0x781/0x1000
>> [  568.232717]  bfq_dispatch_request+0x1d7/0xd60
>> [  568.232731]  ? bfq_bfqq_handle_idle_busy_switch.isra.36+0x2cd/0xb20
>> [  568.232751]  blk_mq_do_dispatch_sched+0xa8/0xe0
>> [  568.232765]  blk_mq_sched_dispatch_requests+0xe3/0x150
>> [  568.232783]  __blk_mq_run_hw_queue+0x56/0x100
>> [  568.232798]  __blk_mq_delay_run_hw_queue+0x107/0x160
>> [  568.232814]  blk_mq_run_hw_queue+0x75/0x190
>> [  568.232828]  blk_mq_sched_insert_requests+0x7a/0x100
>> [  568.232844]  blk_mq_flush_plug_list+0x1d7/0x280
>> [  568.232859]  blk_flush_plug_list+0xc2/0xe0
>> [  568.232872]  blk_finish_plug+0x2c/0x40
>> [  568.232886]  ext4_writepages+0x592/0xe60
>> [  568.233381]  ? ext4_mark_iloc_dirty+0x52b/0x860
>> [  568.233851]  do_writepages+0x3c/0xd0
>> [  568.234304]  ? ext4_mark_inode_dirty+0x1a0/0x1a0
>> [  568.234748]  ? do_writepages+0x3c/0xd0
>> [  568.235197]  ? __generic_write_end+0x4e/0x80
>> [  568.235644]  __filemap_fdatawrite_range+0xa5/0xe0
>> [  568.236089]  ? __filemap_fdatawrite_range+0xa5/0xe0
>> [  568.236533]  ? ext4_da_write_end+0x13c/0x280
>> [  568.236983]  file_write_and_wait_range+0x5a/0xb0
>> [  568.237407]  ext4_sync_file+0x11e/0x3e0
>> [  568.237819]  vfs_fsync_range+0x48/0x80
>> [  568.238217]  ext4_file_write_iter+0x234/0x3d0
>> [  568.238610]  ? _cond_resched+0x19/0x40
>> [  568.238982]  new_sync_write+0x112/0x190
>> [  568.239347]  __vfs_write+0x29/0x40
>> [  568.239705]  vfs_write+0xb1/0x1a0
>> [  568.240078]  ksys_write+0x89/0xc0
>> [  568.240428]  __x64_sys_write+0x1a/0x20
>> [  568.240771]  do_syscall_64+0x5b/0x140
>> [  568.241115]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [  568.241456] RIP: 0033:0x7fb5b02325f4
>> [  568.241787] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 48 8d 05 09 11 2d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
>> [  568.242842] RSP: 002b:00007ffcb12e2968 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>> [  568.243220] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb5b02325f4
>> [  568.243616] RDX: 0000000000000200 RSI: 000055698f2ad000 RDI: 0000000000000001
>> [  568.244026] RBP: 0000000000000200 R08: 0000000000000004 R09: 0000000000000003
>> [  568.244401] R10: 00007fb5b04feca0 R11: 0000000000000246 R12: 000055698f2ad000
>> [  568.244775] R13: 0000000000000000 R14: 0000000000000000 R15: 000055698f2ad000
>> [  568.245154] Modules linked in: xt_MASQUERADE(E) nf_conntrack_netlink(E) nfnetlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) br_netfilter(E) bridge(E) stp(E) llc(E) overlay(E) vmw_vsock_vmci_transport(E) vsock(E) ip6table_filter(E) ip6_tables(E) xt_conntrack(E) iptable_mangle(E) iptable_nat(E) nf_nat(E) iptable_filter
>> [  568.248651] CR2: 0000000000000024
>> [  568.249142] ---[ end trace 0ddd315e0a5bdfba ]---
>> 
>> 
>> Regards,
>> Srivatsa
>> VMware Photon OS


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-24  6:51                                 ` Paolo Valente
  2019-05-24  7:56                                   ` Paolo Valente
@ 2019-05-29  1:09                                   ` Srivatsa S. Bhat
  2019-05-29  7:41                                     ` Paolo Valente
  1 sibling, 1 reply; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-05-29  1:09 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

On 5/23/19 11:51 PM, Paolo Valente wrote:
> 
>> Il giorno 24 mag 2019, alle ore 01:43, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>
>> When trying to run multiple dd tasks simultaneously, I get the kernel
>> panic shown below (mainline is fine, without these patches).
>>
> 
> Could you please provide me somehow with a list *(bfq_serv_to_charge+0x21) ?
> 

Hi Paolo,

Sorry for the delay! Here you go:

(gdb) list *(bfq_serv_to_charge+0x21)
0xffffffff814bad91 is in bfq_serv_to_charge (./include/linux/blkdev.h:919).
914	
915	extern unsigned int blk_rq_err_bytes(const struct request *rq);
916	
917	static inline unsigned int blk_rq_sectors(const struct request *rq)
918	{
919		return blk_rq_bytes(rq) >> SECTOR_SHIFT;
920	}
921	
922	static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
923	{
(gdb) 


For some reason, I've not been able to reproduce this issue after
reporting it here. (Perhaps I got lucky when I hit the kernel panic
a bunch of times last week).

I'll test with your fix applied and see how it goes.

Thank you!

Regards,
Srivatsa

> 
>> [  568.232231] BUG: kernel NULL pointer dereference, address: 0000000000000024
>> [  568.232257] #PF: supervisor read access in kernel mode
>> [  568.232273] #PF: error_code(0x0000) - not-present page
>> [  568.232289] PGD 0 P4D 0
>> [  568.232299] Oops: 0000 [#1] SMP PTI
>> [  568.232312] CPU: 0 PID: 1029 Comm: dd Tainted: G            E     5.1.0-io-dbg-4+ #6
>> [  568.232334] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
>> [  568.232388] RIP: 0010:bfq_serv_to_charge+0x21/0x50
>> [  568.232404] Code: ff e8 c3 5e bc ff 0f 1f 00 0f 1f 44 00 00 48 8b 86 20 01 00 00 55 48 89 e5 53 48 89 fb a8 40 75 09 83 be a0 01 00 00 01 76 09 <8b> 43 24 c1 e8 09 5b 5d c3 48 8b 7e 08 e8 5d fd ff ff 84 c0 75 ea
>> [  568.232473] RSP: 0018:ffffa73a42dab750 EFLAGS: 00010002
>> [  568.232489] RAX: 0000000000001052 RBX: 0000000000000000 RCX: ffffa73a42dab7a0
>> [  568.232510] RDX: ffffa73a42dab657 RSI: ffff8b7b6ba2ab70 RDI: 0000000000000000
>> [  568.232530] RBP: ffffa73a42dab758 R08: 0000000000000000 R09: 0000000000000001
>> [  568.232551] R10: 0000000000000000 R11: ffffa73a42dab7a0 R12: ffff8b7b6aed3800
>> [  568.232571] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8b7b6aed3800
>> [  568.232592] FS:  00007fb5b0724540(0000) GS:ffff8b7b6f800000(0000) knlGS:0000000000000000
>> [  568.232615] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  568.232632] CR2: 0000000000000024 CR3: 00000004266be002 CR4: 00000000001606f0
>> [  568.232690] Call Trace:
>> [  568.232703]  bfq_select_queue+0x781/0x1000
>> [  568.232717]  bfq_dispatch_request+0x1d7/0xd60
>> [  568.232731]  ? bfq_bfqq_handle_idle_busy_switch.isra.36+0x2cd/0xb20
>> [  568.232751]  blk_mq_do_dispatch_sched+0xa8/0xe0
>> [  568.232765]  blk_mq_sched_dispatch_requests+0xe3/0x150
>> [  568.232783]  __blk_mq_run_hw_queue+0x56/0x100
>> [  568.232798]  __blk_mq_delay_run_hw_queue+0x107/0x160
>> [  568.232814]  blk_mq_run_hw_queue+0x75/0x190
>> [  568.232828]  blk_mq_sched_insert_requests+0x7a/0x100
>> [  568.232844]  blk_mq_flush_plug_list+0x1d7/0x280
>> [  568.232859]  blk_flush_plug_list+0xc2/0xe0
>> [  568.232872]  blk_finish_plug+0x2c/0x40
>> [  568.232886]  ext4_writepages+0x592/0xe60
>> [  568.233381]  ? ext4_mark_iloc_dirty+0x52b/0x860
>> [  568.233851]  do_writepages+0x3c/0xd0
>> [  568.234304]  ? ext4_mark_inode_dirty+0x1a0/0x1a0
>> [  568.234748]  ? do_writepages+0x3c/0xd0
>> [  568.235197]  ? __generic_write_end+0x4e/0x80
>> [  568.235644]  __filemap_fdatawrite_range+0xa5/0xe0
>> [  568.236089]  ? __filemap_fdatawrite_range+0xa5/0xe0
>> [  568.236533]  ? ext4_da_write_end+0x13c/0x280
>> [  568.236983]  file_write_and_wait_range+0x5a/0xb0
>> [  568.237407]  ext4_sync_file+0x11e/0x3e0
>> [  568.237819]  vfs_fsync_range+0x48/0x80
>> [  568.238217]  ext4_file_write_iter+0x234/0x3d0
>> [  568.238610]  ? _cond_resched+0x19/0x40
>> [  568.238982]  new_sync_write+0x112/0x190
>> [  568.239347]  __vfs_write+0x29/0x40
>> [  568.239705]  vfs_write+0xb1/0x1a0
>> [  568.240078]  ksys_write+0x89/0xc0
>> [  568.240428]  __x64_sys_write+0x1a/0x20
>> [  568.240771]  do_syscall_64+0x5b/0x140
>> [  568.241115]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [  568.241456] RIP: 0033:0x7fb5b02325f4
>> [  568.241787] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 48 8d 05 09 11 2d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
>> [  568.242842] RSP: 002b:00007ffcb12e2968 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>> [  568.243220] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb5b02325f4
>> [  568.243616] RDX: 0000000000000200 RSI: 000055698f2ad000 RDI: 0000000000000001
>> [  568.244026] RBP: 0000000000000200 R08: 0000000000000004 R09: 0000000000000003
>> [  568.244401] R10: 00007fb5b04feca0 R11: 0000000000000246 R12: 000055698f2ad000
>> [  568.244775] R13: 0000000000000000 R14: 0000000000000000 R15: 000055698f2ad000
>> [  568.245154] Modules linked in: xt_MASQUERADE(E) nf_conntrack_netlink(E) nfnetlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) br_netfilter(E) bridge(E) stp(E) llc(E) overlay(E) vmw_vsock_vmci_transport(E) vsock(E) ip6table_filter(E) ip6_tables(E) xt_conntrack(E) iptable_mangle(E) iptable_nat(E) nf_nat(E) iptable_filter
>> [  568.248651] CR2: 0000000000000024
>> [  568.249142] ---[ end trace 0ddd315e0a5bdfba ]---
>>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-29  1:09                                   ` Srivatsa S. Bhat
@ 2019-05-29  7:41                                     ` Paolo Valente
  2019-05-30  8:29                                       ` Srivatsa S. Bhat
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Valente @ 2019-05-29  7:41 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

[-- Attachment #1: Type: text/plain, Size: 6036 bytes --]



> Il giorno 29 mag 2019, alle ore 03:09, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
> 
> On 5/23/19 11:51 PM, Paolo Valente wrote:
>> 
>>> Il giorno 24 mag 2019, alle ore 01:43, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>> 
>>> When trying to run multiple dd tasks simultaneously, I get the kernel
>>> panic shown below (mainline is fine, without these patches).
>>> 
>> 
>> Could you please provide me somehow with a list *(bfq_serv_to_charge+0x21) ?
>> 
> 
> Hi Paolo,
> 
> Sorry for the delay! Here you go:
> 
> (gdb) list *(bfq_serv_to_charge+0x21)
> 0xffffffff814bad91 is in bfq_serv_to_charge (./include/linux/blkdev.h:919).
> 914
> 915	extern unsigned int blk_rq_err_bytes(const struct request *rq);
> 916
> 917	static inline unsigned int blk_rq_sectors(const struct request *rq)
> 918	{
> 919		return blk_rq_bytes(rq) >> SECTOR_SHIFT;
> 920	}
> 921
> 922	static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
> 923	{
> (gdb)
> 
> 
> For some reason, I've not been able to reproduce this issue after
> reporting it here. (Perhaps I got lucky when I hit the kernel panic
> a bunch of times last week).
> 
> I'll test with your fix applied and see how it goes.
> 

Great!  the offending line above gives me hope that my fix is correct.
If no more failures occur, then I'm eager (and a little worried ...)
to see how it goes with throughput :)

Thanks,
Paolo

> Thank you!
> 
> Regards,
> Srivatsa
> 
>> 
>>> [  568.232231] BUG: kernel NULL pointer dereference, address: 0000000000000024
>>> [  568.232257] #PF: supervisor read access in kernel mode
>>> [  568.232273] #PF: error_code(0x0000) - not-present page
>>> [  568.232289] PGD 0 P4D 0
>>> [  568.232299] Oops: 0000 [#1] SMP PTI
>>> [  568.232312] CPU: 0 PID: 1029 Comm: dd Tainted: G            E     5.1.0-io-dbg-4+ #6
>>> [  568.232334] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016
>>> [  568.232388] RIP: 0010:bfq_serv_to_charge+0x21/0x50
>>> [  568.232404] Code: ff e8 c3 5e bc ff 0f 1f 00 0f 1f 44 00 00 48 8b 86 20 01 00 00 55 48 89 e5 53 48 89 fb a8 40 75 09 83 be a0 01 00 00 01 76 09 <8b> 43 24 c1 e8 09 5b 5d c3 48 8b 7e 08 e8 5d fd ff ff 84 c0 75 ea
>>> [  568.232473] RSP: 0018:ffffa73a42dab750 EFLAGS: 00010002
>>> [  568.232489] RAX: 0000000000001052 RBX: 0000000000000000 RCX: ffffa73a42dab7a0
>>> [  568.232510] RDX: ffffa73a42dab657 RSI: ffff8b7b6ba2ab70 RDI: 0000000000000000
>>> [  568.232530] RBP: ffffa73a42dab758 R08: 0000000000000000 R09: 0000000000000001
>>> [  568.232551] R10: 0000000000000000 R11: ffffa73a42dab7a0 R12: ffff8b7b6aed3800
>>> [  568.232571] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8b7b6aed3800
>>> [  568.232592] FS:  00007fb5b0724540(0000) GS:ffff8b7b6f800000(0000) knlGS:0000000000000000
>>> [  568.232615] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [  568.232632] CR2: 0000000000000024 CR3: 00000004266be002 CR4: 00000000001606f0
>>> [  568.232690] Call Trace:
>>> [  568.232703]  bfq_select_queue+0x781/0x1000
>>> [  568.232717]  bfq_dispatch_request+0x1d7/0xd60
>>> [  568.232731]  ? bfq_bfqq_handle_idle_busy_switch.isra.36+0x2cd/0xb20
>>> [  568.232751]  blk_mq_do_dispatch_sched+0xa8/0xe0
>>> [  568.232765]  blk_mq_sched_dispatch_requests+0xe3/0x150
>>> [  568.232783]  __blk_mq_run_hw_queue+0x56/0x100
>>> [  568.232798]  __blk_mq_delay_run_hw_queue+0x107/0x160
>>> [  568.232814]  blk_mq_run_hw_queue+0x75/0x190
>>> [  568.232828]  blk_mq_sched_insert_requests+0x7a/0x100
>>> [  568.232844]  blk_mq_flush_plug_list+0x1d7/0x280
>>> [  568.232859]  blk_flush_plug_list+0xc2/0xe0
>>> [  568.232872]  blk_finish_plug+0x2c/0x40
>>> [  568.232886]  ext4_writepages+0x592/0xe60
>>> [  568.233381]  ? ext4_mark_iloc_dirty+0x52b/0x860
>>> [  568.233851]  do_writepages+0x3c/0xd0
>>> [  568.234304]  ? ext4_mark_inode_dirty+0x1a0/0x1a0
>>> [  568.234748]  ? do_writepages+0x3c/0xd0
>>> [  568.235197]  ? __generic_write_end+0x4e/0x80
>>> [  568.235644]  __filemap_fdatawrite_range+0xa5/0xe0
>>> [  568.236089]  ? __filemap_fdatawrite_range+0xa5/0xe0
>>> [  568.236533]  ? ext4_da_write_end+0x13c/0x280
>>> [  568.236983]  file_write_and_wait_range+0x5a/0xb0
>>> [  568.237407]  ext4_sync_file+0x11e/0x3e0
>>> [  568.237819]  vfs_fsync_range+0x48/0x80
>>> [  568.238217]  ext4_file_write_iter+0x234/0x3d0
>>> [  568.238610]  ? _cond_resched+0x19/0x40
>>> [  568.238982]  new_sync_write+0x112/0x190
>>> [  568.239347]  __vfs_write+0x29/0x40
>>> [  568.239705]  vfs_write+0xb1/0x1a0
>>> [  568.240078]  ksys_write+0x89/0xc0
>>> [  568.240428]  __x64_sys_write+0x1a/0x20
>>> [  568.240771]  do_syscall_64+0x5b/0x140
>>> [  568.241115]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>> [  568.241456] RIP: 0033:0x7fb5b02325f4
>>> [  568.241787] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 48 8d 05 09 11 2d 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
>>> [  568.242842] RSP: 002b:00007ffcb12e2968 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>>> [  568.243220] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fb5b02325f4
>>> [  568.243616] RDX: 0000000000000200 RSI: 000055698f2ad000 RDI: 0000000000000001
>>> [  568.244026] RBP: 0000000000000200 R08: 0000000000000004 R09: 0000000000000003
>>> [  568.244401] R10: 00007fb5b04feca0 R11: 0000000000000246 R12: 000055698f2ad000
>>> [  568.244775] R13: 0000000000000000 R14: 0000000000000000 R15: 000055698f2ad000
>>> [  568.245154] Modules linked in: xt_MASQUERADE(E) nf_conntrack_netlink(E) nfnetlink(E) xfrm_user(E) xfrm_algo(E) xt_addrtype(E) br_netfilter(E) bridge(E) stp(E) llc(E) overlay(E) vmw_vsock_vmci_transport(E) vsock(E) ip6table_filter(E) ip6_tables(E) xt_conntrack(E) iptable_mangle(E) iptable_nat(E) nf_nat(E) iptable_filter
>>> [  568.248651] CR2: 0000000000000024
>>> [  568.249142] ---[ end trace 0ddd315e0a5bdfba ]---
>>> 


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-29  7:41                                     ` Paolo Valente
@ 2019-05-30  8:29                                       ` Srivatsa S. Bhat
  2019-05-30 10:45                                         ` Paolo Valente
  0 siblings, 1 reply; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-05-30  8:29 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

On 5/29/19 12:41 AM, Paolo Valente wrote:
> 
> 
>> Il giorno 29 mag 2019, alle ore 03:09, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>
>> On 5/23/19 11:51 PM, Paolo Valente wrote:
>>>
>>>> Il giorno 24 mag 2019, alle ore 01:43, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>>>
>>>> When trying to run multiple dd tasks simultaneously, I get the kernel
>>>> panic shown below (mainline is fine, without these patches).
>>>>
>>>
>>> Could you please provide me somehow with a list *(bfq_serv_to_charge+0x21) ?
>>>
>>
>> Hi Paolo,
>>
>> Sorry for the delay! Here you go:
>>
>> (gdb) list *(bfq_serv_to_charge+0x21)
>> 0xffffffff814bad91 is in bfq_serv_to_charge (./include/linux/blkdev.h:919).
>> 914
>> 915	extern unsigned int blk_rq_err_bytes(const struct request *rq);
>> 916
>> 917	static inline unsigned int blk_rq_sectors(const struct request *rq)
>> 918	{
>> 919		return blk_rq_bytes(rq) >> SECTOR_SHIFT;
>> 920	}
>> 921
>> 922	static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
>> 923	{
>> (gdb)
>>
>>
>> For some reason, I've not been able to reproduce this issue after
>> reporting it here. (Perhaps I got lucky when I hit the kernel panic
>> a bunch of times last week).
>>
>> I'll test with your fix applied and see how it goes.
>>
> 
> Great!  the offending line above gives me hope that my fix is correct.
> If no more failures occur, then I'm eager (and a little worried ...)
> to see how it goes with throughput :)
> 

Your fix held up well under my testing :)

As for throughput, with low_latency = 1, I get around 1.4 MB/s with
bfq (vs 1.6 MB/s with mq-deadline). This is a huge improvement
compared to what it was before (70 KB/s).

With tracing on, the throughput is a bit lower (as expected I guess),
about 1 MB/s, and the corresponding trace file
(trace-waker-detection-1MBps) is available at:

https://www.dropbox.com/s/3roycp1zwk372zo/bfq-traces.tar.gz?dl=0

Thank you so much for your tireless efforts in fixing this issue!

Regards,
Srivatsa
VMware Photon OS

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-23 23:32                           ` Srivatsa S. Bhat
@ 2019-05-30  8:38                             ` Srivatsa S. Bhat
  0 siblings, 0 replies; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-05-30  8:38 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, jmoyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab

On 5/23/19 4:32 PM, Srivatsa S. Bhat wrote:
> On 5/22/19 7:30 PM, Srivatsa S. Bhat wrote:
>> On 5/22/19 3:54 AM, Paolo Valente wrote:
>>>
>>>
>>>> Il giorno 22 mag 2019, alle ore 12:01, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>>>
>>>> On 5/22/19 2:09 AM, Paolo Valente wrote:
>>>>>
>>>>> First, thank you very much for testing my patches, and, above all, for
>>>>> sharing those huge traces!
>>>>>
>>>>> According to the your traces, the residual 20% lower throughput that you
>>>>> record is due to the fact that the BFQ injection mechanism takes a few
>>>>> hundredths of seconds to stabilize, at the beginning of the workload.
>>>>> During that setup time, the throughput is equal to the dreadful ~60-90 KB/s
>>>>> that you see without this new patch.  After that time, there
>>>>> seems to be no loss according to the trace.
>>>>>
>>>>> The problem is that a loss lasting only a few hundredths of seconds is
>>>>> however not negligible for a write workload that lasts only 3-4
>>>>> seconds.  Could you please try writing a larger file?
>>>>>
>>>>
>>>> I tried running dd for longer (about 100 seconds), but still saw around
>>>> 1.4 MB/s throughput with BFQ, and between 1.5 MB/s - 1.6 MB/s with
>>>> mq-deadline and noop.
>>>
>>> Ok, then now the cause is the periodic reset of the mechanism.
>>>
>>> It would be super easy to fill this gap, by just gearing the mechanism
>>> toward a very aggressive injection.  The problem is maintaining
>>> control.  As you can imagine from the performance gap between CFQ (or
>>> BFQ with malfunctioning injection) and BFQ with this fix, it is very
>>> hard to succeed in maximizing the throughput while at the same time
>>> preserving control on per-group I/O.
>>>
>>
>> Ah, I see. Just to make sure that this fix doesn't overly optimize for
>> total throughput (because of the testcase we've been using) and end up
>> causing regressions in per-group I/O control, I ran a test with
>> multiple simultaneous dd instances, each writing to a different
>> portion of the filesystem (well separated, to induce seeks), and each
>> dd task bound to its own blkio cgroup. I saw similar results with and
>> without this patch, and the throughput was equally distributed among
>> all the dd tasks.
>>
> Actually, it turns out that I ran the dd tasks directly on the block
> device for this experiment, and not on top of ext4. I'll redo this on
> ext4 and report back soon.
> 

With all your patches applied (including waker detection for the low
latency case), I ran four simultaneous dd instances, each writing to a
different ext4 partition, and each dd task bound to its own blkio
cgroup.  The throughput continued to be well distributed among the dd
tasks, as shown below (I increased dd's block size from 512B to 8KB
for these experiments to get double-digit throughput numbers, so as to
make comparisons easier).

bfq with low_latency = 1:

819200000 bytes (819 MB, 781 MiB) copied, 16452.6 s, 49.8 kB/s
819200000 bytes (819 MB, 781 MiB) copied, 17139.6 s, 47.8 kB/s
819200000 bytes (819 MB, 781 MiB) copied, 17251.7 s, 47.5 kB/s
819200000 bytes (819 MB, 781 MiB) copied, 17384 s, 47.1 kB/s

bfq with low_latency = 0:

819200000 bytes (819 MB, 781 MiB) copied, 16257.9 s, 50.4 kB/s
819200000 bytes (819 MB, 781 MiB) copied, 17204.5 s, 47.6 kB/s
819200000 bytes (819 MB, 781 MiB) copied, 17220.6 s, 47.6 kB/s
819200000 bytes (819 MB, 781 MiB) copied, 17348.1 s, 47.2 kB/s
 
Regards,
Srivatsa
VMware Photon OS

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-30  8:29                                       ` Srivatsa S. Bhat
@ 2019-05-30 10:45                                         ` Paolo Valente
  2019-06-02  7:04                                           ` Srivatsa S. Bhat
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Valente @ 2019-05-30 10:45 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, Jeff Moyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab, Ulf Hansson, Linus Walleij

[-- Attachment #1: Type: text/plain, Size: 3966 bytes --]



> Il giorno 30 mag 2019, alle ore 10:29, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
> 
> On 5/29/19 12:41 AM, Paolo Valente wrote:
>> 
>> 
>>> Il giorno 29 mag 2019, alle ore 03:09, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>> 
>>> On 5/23/19 11:51 PM, Paolo Valente wrote:
>>>> 
>>>>> Il giorno 24 mag 2019, alle ore 01:43, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>>>> 
>>>>> When trying to run multiple dd tasks simultaneously, I get the kernel
>>>>> panic shown below (mainline is fine, without these patches).
>>>>> 
>>>> 
>>>> Could you please provide me somehow with a list *(bfq_serv_to_charge+0x21) ?
>>>> 
>>> 
>>> Hi Paolo,
>>> 
>>> Sorry for the delay! Here you go:
>>> 
>>> (gdb) list *(bfq_serv_to_charge+0x21)
>>> 0xffffffff814bad91 is in bfq_serv_to_charge (./include/linux/blkdev.h:919).
>>> 914
>>> 915	extern unsigned int blk_rq_err_bytes(const struct request *rq);
>>> 916
>>> 917	static inline unsigned int blk_rq_sectors(const struct request *rq)
>>> 918	{
>>> 919		return blk_rq_bytes(rq) >> SECTOR_SHIFT;
>>> 920	}
>>> 921
>>> 922	static inline unsigned int blk_rq_cur_sectors(const struct request *rq)
>>> 923	{
>>> (gdb)
>>> 
>>> 
>>> For some reason, I've not been able to reproduce this issue after
>>> reporting it here. (Perhaps I got lucky when I hit the kernel panic
>>> a bunch of times last week).
>>> 
>>> I'll test with your fix applied and see how it goes.
>>> 
>> 
>> Great!  the offending line above gives me hope that my fix is correct.
>> If no more failures occur, then I'm eager (and a little worried ...)
>> to see how it goes with throughput :)
>> 
> 
> Your fix held up well under my testing :)
> 

Great!

> As for throughput, with low_latency = 1, I get around 1.4 MB/s with
> bfq (vs 1.6 MB/s with mq-deadline). This is a huge improvement
> compared to what it was before (70 KB/s).
> 

That's beautiful news!

So, now we have the best of the two worlds: maximum throughput and
total control on I/O (including minimum latency for interactive and
soft real-time applications).  Besides, no manual configuration
needed.  Of course, this holds unless/until you find other flaws ... ;)

> With tracing on, the throughput is a bit lower (as expected I guess),
> about 1 MB/s, and the corresponding trace file
> (trace-waker-detection-1MBps) is available at:
> 
> https://www.dropbox.com/s/3roycp1zwk372zo/bfq-traces.tar.gz?dl=0
> 

Thank you for the new trace.  I've analyzed it carefully, and, as I
imagined, this residual 12% throughput loss is due to a couple of
heuristics that occasionally get something wrong.  Most likely, ~12%
is the worst-case loss, and if one repeats the tests, the loss may be
much lower in some runs.

I think it is very hard to eliminate this fluctuation while keeping
full I/O control.  But, who knows, I might have some lucky idea in the
future.

At any rate, since you pointed out that you are interested in
out-of-the-box performance, let me complete the context: in case
low_latency is left set, one gets, in return for this 12% loss,
a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
times of applications under load [1];
b) 500-1000% higher throughput in multi-client server workloads, as I
already pointed out [2].

I'm going to prepare complete patches.  In addition, if ok for you,
I'll report these results on the bug you created.  Then I guess we can
close it.

[1] https://algo.ing.unimo.it/people/paolo/disk_sched/results.php
[2] https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/

> Thank you so much for your tireless efforts in fixing this issue!
> 

I did enjoy working on this with you: your test case and your support
enabled me to make important improvements.  So, thank you very much
for your collaboration so far,
Paolo


> Regards,
> Srivatsa
> VMware Photon OS


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-05-30 10:45                                         ` Paolo Valente
@ 2019-06-02  7:04                                           ` Srivatsa S. Bhat
  2019-06-11 22:34                                             ` Srivatsa S. Bhat
  0 siblings, 1 reply; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-06-02  7:04 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, Jeff Moyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab, Ulf Hansson, Linus Walleij

On 5/30/19 3:45 AM, Paolo Valente wrote:
> 
> 
>> Il giorno 30 mag 2019, alle ore 10:29, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>
[...]
>>
>> Your fix held up well under my testing :)
>>
> 
> Great!
> 
>> As for throughput, with low_latency = 1, I get around 1.4 MB/s with
>> bfq (vs 1.6 MB/s with mq-deadline). This is a huge improvement
>> compared to what it was before (70 KB/s).
>>
> 
> That's beautiful news!
> 
> So, now we have the best of the two worlds: maximum throughput and
> total control on I/O (including minimum latency for interactive and
> soft real-time applications).  Besides, no manual configuration
> needed.  Of course, this holds unless/until you find other flaws ... ;)
> 

Indeed, that's awesome! :)

>> With tracing on, the throughput is a bit lower (as expected I guess),
>> about 1 MB/s, and the corresponding trace file
>> (trace-waker-detection-1MBps) is available at:
>>
>> https://www.dropbox.com/s/3roycp1zwk372zo/bfq-traces.tar.gz?dl=0
>>
> 
> Thank you for the new trace.  I've analyzed it carefully, and, as I
> imagined, this residual 12% throughput loss is due to a couple of
> heuristics that occasionally get something wrong.  Most likely, ~12%
> is the worst-case loss, and if one repeats the tests, the loss may be
> much lower in some runs.
>

Ah, I see.
 
> I think it is very hard to eliminate this fluctuation while keeping
> full I/O control.  But, who knows, I might have some lucky idea in the
> future.
> 

:)

> At any rate, since you pointed out that you are interested in
> out-of-the-box performance, let me complete the context: in case
> low_latency is left set, one gets, in return for this 12% loss,
> a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
> times of applications under load [1];
> b) 500-1000% higher throughput in multi-client server workloads, as I
> already pointed out [2].
> 

I'm very happy that you could solve the problem without having to
compromise on any of the performance characteristics/features of BFQ!


> I'm going to prepare complete patches.  In addition, if ok for you,
> I'll report these results on the bug you created.  Then I guess we can
> close it.
> 

Sounds great!

> [1] https://algo.ing.unimo.it/people/paolo/disk_sched/results.php
> [2] https://www.linaro.org/blog/io-bandwidth-management-for-production-quality-services/
> 
>> Thank you so much for your tireless efforts in fixing this issue!
>>
> 
> I did enjoy working on this with you: your test case and your support
> enabled me to make important improvements.  So, thank you very much
> for your collaboration so far,
> Paolo

My pleasure! :)
 
Regards,
Srivatsa
VMware Photon OS

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-06-02  7:04                                           ` Srivatsa S. Bhat
@ 2019-06-11 22:34                                             ` Srivatsa S. Bhat
  2019-06-12 13:04                                               ` Jan Kara
  2019-06-13  5:46                                               ` Paolo Valente
  0 siblings, 2 replies; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-06-11 22:34 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, Jeff Moyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab, Ulf Hansson, Linus Walleij

On 6/2/19 12:04 AM, Srivatsa S. Bhat wrote:
> On 5/30/19 3:45 AM, Paolo Valente wrote:
>>
[...]
>> At any rate, since you pointed out that you are interested in
>> out-of-the-box performance, let me complete the context: in case
>> low_latency is left set, one gets, in return for this 12% loss,
>> a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
>> times of applications under load [1];
>> b) 500-1000% higher throughput in multi-client server workloads, as I
>> already pointed out [2].
>>
> 
> I'm very happy that you could solve the problem without having to
> compromise on any of the performance characteristics/features of BFQ!
> 
> 
>> I'm going to prepare complete patches.  In addition, if ok for you,
>> I'll report these results on the bug you created.  Then I guess we can
>> close it.
>>
> 
> Sounds great!
>

Hi Paolo,

Hope you are doing great!

I was wondering if you got a chance to post these patches to LKML for
review and inclusion... (No hurry, of course!)

Also, since your fixes address the performance issues in BFQ, do you
have any thoughts on whether they can be adapted to CFQ as well, to
benefit the older stable kernels that still support CFQ?

Thank you!

Regards,
Srivatsa
VMware Photon OS

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-06-11 22:34                                             ` Srivatsa S. Bhat
@ 2019-06-12 13:04                                               ` Jan Kara
  2019-06-12 19:36                                                 ` Srivatsa S. Bhat
  2019-06-13  5:46                                               ` Paolo Valente
  1 sibling, 1 reply; 52+ messages in thread
From: Jan Kara @ 2019-06-12 13:04 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Paolo Valente, linux-fsdevel, linux-block, linux-ext4, cgroups,
	kernel list, Jens Axboe, Jan Kara, Jeff Moyer, Theodore Ts'o,
	amakhalov, anishs, srivatsab, Ulf Hansson, Linus Walleij

On Tue 11-06-19 15:34:48, Srivatsa S. Bhat wrote:
> On 6/2/19 12:04 AM, Srivatsa S. Bhat wrote:
> > On 5/30/19 3:45 AM, Paolo Valente wrote:
> >>
> [...]
> >> At any rate, since you pointed out that you are interested in
> >> out-of-the-box performance, let me complete the context: in case
> >> low_latency is left set, one gets, in return for this 12% loss,
> >> a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
> >> times of applications under load [1];
> >> b) 500-1000% higher throughput in multi-client server workloads, as I
> >> already pointed out [2].
> >>
> > 
> > I'm very happy that you could solve the problem without having to
> > compromise on any of the performance characteristics/features of BFQ!
> > 
> > 
> >> I'm going to prepare complete patches.  In addition, if ok for you,
> >> I'll report these results on the bug you created.  Then I guess we can
> >> close it.
> >>
> > 
> > Sounds great!
> >
> 
> Hi Paolo,
> 
> Hope you are doing great!
> 
> I was wondering if you got a chance to post these patches to LKML for
> review and inclusion... (No hurry, of course!)
> 
> Also, since your fixes address the performance issues in BFQ, do you
> have any thoughts on whether they can be adapted to CFQ as well, to
> benefit the older stable kernels that still support CFQ?

Since CFQ doesn't exist in current upstream kernel anymore, I seriously
doubt you'll be able to get any performance improvements for it in the
stable kernels...

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-06-12 13:04                                               ` Jan Kara
@ 2019-06-12 19:36                                                 ` Srivatsa S. Bhat
  2019-06-13  6:02                                                   ` Greg Kroah-Hartman
                                                                     ` (2 more replies)
  0 siblings, 3 replies; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-06-12 19:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Paolo Valente, linux-fsdevel, linux-block, linux-ext4, cgroups,
	kernel list, Jens Axboe, Jeff Moyer, Theodore Ts'o,
	amakhalov, anishs, srivatsab, Ulf Hansson, Linus Walleij,
	Greg Kroah-Hartman, Stable


[ Adding Greg to CC ]

On 6/12/19 6:04 AM, Jan Kara wrote:
> On Tue 11-06-19 15:34:48, Srivatsa S. Bhat wrote:
>> On 6/2/19 12:04 AM, Srivatsa S. Bhat wrote:
>>> On 5/30/19 3:45 AM, Paolo Valente wrote:
>>>>
>> [...]
>>>> At any rate, since you pointed out that you are interested in
>>>> out-of-the-box performance, let me complete the context: in case
>>>> low_latency is left set, one gets, in return for this 12% loss,
>>>> a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
>>>> times of applications under load [1];
>>>> b) 500-1000% higher throughput in multi-client server workloads, as I
>>>> already pointed out [2].
>>>>
>>>
>>> I'm very happy that you could solve the problem without having to
>>> compromise on any of the performance characteristics/features of BFQ!
>>>
>>>
>>>> I'm going to prepare complete patches.  In addition, if ok for you,
>>>> I'll report these results on the bug you created.  Then I guess we can
>>>> close it.
>>>>
>>>
>>> Sounds great!
>>>
>>
>> Hi Paolo,
>>
>> Hope you are doing great!
>>
>> I was wondering if you got a chance to post these patches to LKML for
>> review and inclusion... (No hurry, of course!)
>>
>> Also, since your fixes address the performance issues in BFQ, do you
>> have any thoughts on whether they can be adapted to CFQ as well, to
>> benefit the older stable kernels that still support CFQ?
> 
> Since CFQ doesn't exist in current upstream kernel anymore, I seriously
> doubt you'll be able to get any performance improvements for it in the
> stable kernels...
> 

I suspected as much, but that seems unfortunate though. The latest LTS
kernel is based on 4.19, which still supports CFQ. It would have been
great to have a process to address significant issues on older
kernels too.

Greg, do you have any thoughts on this? The context is that both CFQ
and BFQ I/O schedulers have issues that cause I/O throughput to suffer
upto 10x - 30x on certain workloads and system configurations, as
reported in [1].

In this thread, Paolo posted patches to fix BFQ performance on
mainline. However CFQ suffers from the same performance collapse, but
CFQ was removed from the kernel in v5.0. So obviously the usual stable
backporting path won't work here for several reasons:

  1. There won't be a mainline commit to backport from, as CFQ no
     longer exists in mainline.

  2. This is not a security/stability fix, and is likely to involve
     invasive changes.

I was wondering if there was a way to address the performance issues
in CFQ in the older stable kernels (including the latest LTS 4.19),
despite the above constraints, since the performance drop is much too
significant. I guess not, but thought I'd ask :-)

[1]. https://lore.kernel.org/lkml/8d72fcf7-bbb4-2965-1a06-e9fc177a8938@csail.mit.edu/ 


Regards,
Srivatsa
VMware Photon OS

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-06-11 22:34                                             ` Srivatsa S. Bhat
  2019-06-12 13:04                                               ` Jan Kara
@ 2019-06-13  5:46                                               ` Paolo Valente
  2019-06-13 19:13                                                 ` Srivatsa S. Bhat
  1 sibling, 1 reply; 52+ messages in thread
From: Paolo Valente @ 2019-06-13  5:46 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, Jeff Moyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab, Ulf Hansson, Linus Walleij

[-- Attachment #1: Type: text/plain, Size: 2348 bytes --]



> Il giorno 12 giu 2019, alle ore 00:34, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
> 
> On 6/2/19 12:04 AM, Srivatsa S. Bhat wrote:
>> On 5/30/19 3:45 AM, Paolo Valente wrote:
>>> 
> [...]
>>> At any rate, since you pointed out that you are interested in
>>> out-of-the-box performance, let me complete the context: in case
>>> low_latency is left set, one gets, in return for this 12% loss,
>>> a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
>>> times of applications under load [1];
>>> b) 500-1000% higher throughput in multi-client server workloads, as I
>>> already pointed out [2].
>>> 
>> 
>> I'm very happy that you could solve the problem without having to
>> compromise on any of the performance characteristics/features of BFQ!
>> 
>> 
>>> I'm going to prepare complete patches.  In addition, if ok for you,
>>> I'll report these results on the bug you created.  Then I guess we can
>>> close it.
>>> 
>> 
>> Sounds great!
>> 
> 
> Hi Paolo,
> 

Hi

> Hope you are doing great!
> 

Sort of, thanks :)

> I was wondering if you got a chance to post these patches to LKML for
> review and inclusion... (No hurry, of course!)
> 


I'm having troubles testing these new patches on 5.2-rc4.  As it
happened with the first release candidates for 5.1, the CPU of my test
machine (Intel Core i7-2760QM@2.40GHz) is so slowed down that results
are heavily distorted with every I/O scheduler.

Unfortunately, I'm not competent enough to spot the cause of this
regression in a feasible amount of time.  I hope it'll go away with
next release candidates, or I'll test on 5.1.

> Also, since your fixes address the performance issues in BFQ, do you
> have any thoughts on whether they can be adapted to CFQ as well, to
> benefit the older stable kernels that still support CFQ?
> 

I have implanted my fixes on the existing throughput-boosting
infrastructure of BFQ.  CFQ doesn't have such an infrastructure.

If you need I/O control with older kernels, you may want to check my
version of BFQ for legacy block, named bfq-sq and available in this
repo:
https://github.com/Algodev-github/bfq-mq/

I'm willing to provide you with any information or help if needed.

Thanks,
Paolo


> Thank you!
> 
> Regards,
> Srivatsa
> VMware Photon OS


[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-06-12 19:36                                                 ` Srivatsa S. Bhat
@ 2019-06-13  6:02                                                   ` Greg Kroah-Hartman
  2019-06-13 19:03                                                     ` Srivatsa S. Bhat
  2019-06-13  8:20                                                   ` Jan Kara
  2019-06-13  8:37                                                   ` Jens Axboe
  2 siblings, 1 reply; 52+ messages in thread
From: Greg Kroah-Hartman @ 2019-06-13  6:02 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Jan Kara, Paolo Valente, linux-fsdevel, linux-block, linux-ext4,
	cgroups, kernel list, Jens Axboe, Jeff Moyer, Theodore Ts'o,
	amakhalov, anishs, srivatsab, Ulf Hansson, Linus Walleij, Stable

On Wed, Jun 12, 2019 at 12:36:53PM -0700, Srivatsa S. Bhat wrote:
> 
> [ Adding Greg to CC ]
> 
> On 6/12/19 6:04 AM, Jan Kara wrote:
> > On Tue 11-06-19 15:34:48, Srivatsa S. Bhat wrote:
> >> On 6/2/19 12:04 AM, Srivatsa S. Bhat wrote:
> >>> On 5/30/19 3:45 AM, Paolo Valente wrote:
> >>>>
> >> [...]
> >>>> At any rate, since you pointed out that you are interested in
> >>>> out-of-the-box performance, let me complete the context: in case
> >>>> low_latency is left set, one gets, in return for this 12% loss,
> >>>> a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
> >>>> times of applications under load [1];
> >>>> b) 500-1000% higher throughput in multi-client server workloads, as I
> >>>> already pointed out [2].
> >>>>
> >>>
> >>> I'm very happy that you could solve the problem without having to
> >>> compromise on any of the performance characteristics/features of BFQ!
> >>>
> >>>
> >>>> I'm going to prepare complete patches.  In addition, if ok for you,
> >>>> I'll report these results on the bug you created.  Then I guess we can
> >>>> close it.
> >>>>
> >>>
> >>> Sounds great!
> >>>
> >>
> >> Hi Paolo,
> >>
> >> Hope you are doing great!
> >>
> >> I was wondering if you got a chance to post these patches to LKML for
> >> review and inclusion... (No hurry, of course!)
> >>
> >> Also, since your fixes address the performance issues in BFQ, do you
> >> have any thoughts on whether they can be adapted to CFQ as well, to
> >> benefit the older stable kernels that still support CFQ?
> > 
> > Since CFQ doesn't exist in current upstream kernel anymore, I seriously
> > doubt you'll be able to get any performance improvements for it in the
> > stable kernels...
> > 
> 
> I suspected as much, but that seems unfortunate though. The latest LTS
> kernel is based on 4.19, which still supports CFQ. It would have been
> great to have a process to address significant issues on older
> kernels too.
> 
> Greg, do you have any thoughts on this? The context is that both CFQ
> and BFQ I/O schedulers have issues that cause I/O throughput to suffer
> upto 10x - 30x on certain workloads and system configurations, as
> reported in [1].
> 
> In this thread, Paolo posted patches to fix BFQ performance on
> mainline. However CFQ suffers from the same performance collapse, but
> CFQ was removed from the kernel in v5.0. So obviously the usual stable
> backporting path won't work here for several reasons:
> 
>   1. There won't be a mainline commit to backport from, as CFQ no
>      longer exists in mainline.
> 
>   2. This is not a security/stability fix, and is likely to involve
>      invasive changes.
> 
> I was wondering if there was a way to address the performance issues
> in CFQ in the older stable kernels (including the latest LTS 4.19),
> despite the above constraints, since the performance drop is much too
> significant. I guess not, but thought I'd ask :-)

If someone cares about something like this, then I strongly just
recommend they move to the latest kernel version.  There should not be
anything stoping them from doing that, right?  Nothing "forces" anyone
to be on the 4.19.y release, especially when it really starts to show
its age.

Don't ever treat the LTS releases as "the only thing someone can run, so
we must backport huge things to it!"  Just use 5.1, and then move to 5.2
when it is out and so on.  That's always the preferred way, you always
get better support, faster kernels, newer features, better hardware
support, and most importantly, more bugfixes.

I wrote a whole essay on this thing, but no one ever seems to read it...

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-06-12 19:36                                                 ` Srivatsa S. Bhat
  2019-06-13  6:02                                                   ` Greg Kroah-Hartman
@ 2019-06-13  8:20                                                   ` Jan Kara
  2019-06-13 19:05                                                     ` Srivatsa S. Bhat
  2019-06-13  8:37                                                   ` Jens Axboe
  2 siblings, 1 reply; 52+ messages in thread
From: Jan Kara @ 2019-06-13  8:20 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Jan Kara, Paolo Valente, linux-fsdevel, linux-block, linux-ext4,
	cgroups, kernel list, Jens Axboe, Jeff Moyer, Theodore Ts'o,
	amakhalov, anishs, srivatsab, Ulf Hansson, Linus Walleij,
	Greg Kroah-Hartman, Stable

On Wed 12-06-19 12:36:53, Srivatsa S. Bhat wrote:
> 
> [ Adding Greg to CC ]
> 
> On 6/12/19 6:04 AM, Jan Kara wrote:
> > On Tue 11-06-19 15:34:48, Srivatsa S. Bhat wrote:
> >> On 6/2/19 12:04 AM, Srivatsa S. Bhat wrote:
> >>> On 5/30/19 3:45 AM, Paolo Valente wrote:
> >>>>
> >> [...]
> >>>> At any rate, since you pointed out that you are interested in
> >>>> out-of-the-box performance, let me complete the context: in case
> >>>> low_latency is left set, one gets, in return for this 12% loss,
> >>>> a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
> >>>> times of applications under load [1];
> >>>> b) 500-1000% higher throughput in multi-client server workloads, as I
> >>>> already pointed out [2].
> >>>>
> >>>
> >>> I'm very happy that you could solve the problem without having to
> >>> compromise on any of the performance characteristics/features of BFQ!
> >>>
> >>>
> >>>> I'm going to prepare complete patches.  In addition, if ok for you,
> >>>> I'll report these results on the bug you created.  Then I guess we can
> >>>> close it.
> >>>>
> >>>
> >>> Sounds great!
> >>>
> >>
> >> Hi Paolo,
> >>
> >> Hope you are doing great!
> >>
> >> I was wondering if you got a chance to post these patches to LKML for
> >> review and inclusion... (No hurry, of course!)
> >>
> >> Also, since your fixes address the performance issues in BFQ, do you
> >> have any thoughts on whether they can be adapted to CFQ as well, to
> >> benefit the older stable kernels that still support CFQ?
> > 
> > Since CFQ doesn't exist in current upstream kernel anymore, I seriously
> > doubt you'll be able to get any performance improvements for it in the
> > stable kernels...
> > 
> 
> I suspected as much, but that seems unfortunate though. The latest LTS
> kernel is based on 4.19, which still supports CFQ. It would have been
> great to have a process to address significant issues on older
> kernels too.

Well, you could still tune the performance difference by changing
slice_idle and group_idle tunables for CFQ (in
/sys/block/<device>/queue/iosched/).  Changing these to lower values will
reduce the throughput loss when switching between cgroups at the cost of
lower accuracy of enforcing configured IO proportions among cgroups.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-06-12 19:36                                                 ` Srivatsa S. Bhat
  2019-06-13  6:02                                                   ` Greg Kroah-Hartman
  2019-06-13  8:20                                                   ` Jan Kara
@ 2019-06-13  8:37                                                   ` Jens Axboe
  2 siblings, 0 replies; 52+ messages in thread
From: Jens Axboe @ 2019-06-13  8:37 UTC (permalink / raw)
  To: Srivatsa S. Bhat, Jan Kara
  Cc: Paolo Valente, linux-fsdevel, linux-block, linux-ext4, cgroups,
	kernel list, Jeff Moyer, Theodore Ts'o, amakhalov, anishs,
	srivatsab, Ulf Hansson, Linus Walleij, Greg Kroah-Hartman,
	Stable

On 6/12/19 1:36 PM, Srivatsa S. Bhat wrote:
> 
> [ Adding Greg to CC ]
> 
> On 6/12/19 6:04 AM, Jan Kara wrote:
>> On Tue 11-06-19 15:34:48, Srivatsa S. Bhat wrote:
>>> On 6/2/19 12:04 AM, Srivatsa S. Bhat wrote:
>>>> On 5/30/19 3:45 AM, Paolo Valente wrote:
>>>>>
>>> [...]
>>>>> At any rate, since you pointed out that you are interested in
>>>>> out-of-the-box performance, let me complete the context: in case
>>>>> low_latency is left set, one gets, in return for this 12% loss,
>>>>> a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
>>>>> times of applications under load [1];
>>>>> b) 500-1000% higher throughput in multi-client server workloads, as I
>>>>> already pointed out [2].
>>>>>
>>>>
>>>> I'm very happy that you could solve the problem without having to
>>>> compromise on any of the performance characteristics/features of BFQ!
>>>>
>>>>
>>>>> I'm going to prepare complete patches.  In addition, if ok for you,
>>>>> I'll report these results on the bug you created.  Then I guess we can
>>>>> close it.
>>>>>
>>>>
>>>> Sounds great!
>>>>
>>>
>>> Hi Paolo,
>>>
>>> Hope you are doing great!
>>>
>>> I was wondering if you got a chance to post these patches to LKML for
>>> review and inclusion... (No hurry, of course!)
>>>
>>> Also, since your fixes address the performance issues in BFQ, do you
>>> have any thoughts on whether they can be adapted to CFQ as well, to
>>> benefit the older stable kernels that still support CFQ?
>>
>> Since CFQ doesn't exist in current upstream kernel anymore, I seriously
>> doubt you'll be able to get any performance improvements for it in the
>> stable kernels...
>>
> 
> I suspected as much, but that seems unfortunate though. The latest LTS
> kernel is based on 4.19, which still supports CFQ. It would have been
> great to have a process to address significant issues on older
> kernels too.
> 
> Greg, do you have any thoughts on this? The context is that both CFQ
> and BFQ I/O schedulers have issues that cause I/O throughput to suffer
> upto 10x - 30x on certain workloads and system configurations, as
> reported in [1].
> 
> In this thread, Paolo posted patches to fix BFQ performance on
> mainline. However CFQ suffers from the same performance collapse, but
> CFQ was removed from the kernel in v5.0. So obviously the usual stable
> backporting path won't work here for several reasons:
> 
>    1. There won't be a mainline commit to backport from, as CFQ no
>       longer exists in mainline.
> 
>    2. This is not a security/stability fix, and is likely to involve
>       invasive changes.
> 
> I was wondering if there was a way to address the performance issues
> in CFQ in the older stable kernels (including the latest LTS 4.19),
> despite the above constraints, since the performance drop is much too
> significant. I guess not, but thought I'd ask :-)
> 
> [1]. https://lore.kernel.org/lkml/8d72fcf7-bbb4-2965-1a06-e9fc177a8938@csail.mit.edu/

This issue has always been there. There will be no specific patches made
for stable for something that doesn't even exist in the newer kernels.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-06-13  6:02                                                   ` Greg Kroah-Hartman
@ 2019-06-13 19:03                                                     ` Srivatsa S. Bhat
  0 siblings, 0 replies; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-06-13 19:03 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Jan Kara, Paolo Valente, linux-fsdevel, linux-block, linux-ext4,
	cgroups, kernel list, Jens Axboe, Jeff Moyer, Theodore Ts'o,
	amakhalov, anishs, srivatsab, Ulf Hansson, Linus Walleij, Stable

On 6/12/19 11:02 PM, Greg Kroah-Hartman wrote:
> On Wed, Jun 12, 2019 at 12:36:53PM -0700, Srivatsa S. Bhat wrote:
>>
>> [ Adding Greg to CC ]
>>
>> On 6/12/19 6:04 AM, Jan Kara wrote:
>>> On Tue 11-06-19 15:34:48, Srivatsa S. Bhat wrote:
>>>> On 6/2/19 12:04 AM, Srivatsa S. Bhat wrote:
>>>>> On 5/30/19 3:45 AM, Paolo Valente wrote:
>>>>>>
>>>> [...]
>>>>>> At any rate, since you pointed out that you are interested in
>>>>>> out-of-the-box performance, let me complete the context: in case
>>>>>> low_latency is left set, one gets, in return for this 12% loss,
>>>>>> a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
>>>>>> times of applications under load [1];
>>>>>> b) 500-1000% higher throughput in multi-client server workloads, as I
>>>>>> already pointed out [2].
>>>>>>
>>>>>
>>>>> I'm very happy that you could solve the problem without having to
>>>>> compromise on any of the performance characteristics/features of BFQ!
>>>>>
>>>>>
>>>>>> I'm going to prepare complete patches.  In addition, if ok for you,
>>>>>> I'll report these results on the bug you created.  Then I guess we can
>>>>>> close it.
>>>>>>
>>>>>
>>>>> Sounds great!
>>>>>
>>>>
>>>> Hi Paolo,
>>>>
>>>> Hope you are doing great!
>>>>
>>>> I was wondering if you got a chance to post these patches to LKML for
>>>> review and inclusion... (No hurry, of course!)
>>>>
>>>> Also, since your fixes address the performance issues in BFQ, do you
>>>> have any thoughts on whether they can be adapted to CFQ as well, to
>>>> benefit the older stable kernels that still support CFQ?
>>>
>>> Since CFQ doesn't exist in current upstream kernel anymore, I seriously
>>> doubt you'll be able to get any performance improvements for it in the
>>> stable kernels...
>>>
>>
>> I suspected as much, but that seems unfortunate though. The latest LTS
>> kernel is based on 4.19, which still supports CFQ. It would have been
>> great to have a process to address significant issues on older
>> kernels too.
>>
>> Greg, do you have any thoughts on this? The context is that both CFQ
>> and BFQ I/O schedulers have issues that cause I/O throughput to suffer
>> upto 10x - 30x on certain workloads and system configurations, as
>> reported in [1].
>>
>> In this thread, Paolo posted patches to fix BFQ performance on
>> mainline. However CFQ suffers from the same performance collapse, but
>> CFQ was removed from the kernel in v5.0. So obviously the usual stable
>> backporting path won't work here for several reasons:
>>
>>   1. There won't be a mainline commit to backport from, as CFQ no
>>      longer exists in mainline.
>>
>>   2. This is not a security/stability fix, and is likely to involve
>>      invasive changes.
>>
>> I was wondering if there was a way to address the performance issues
>> in CFQ in the older stable kernels (including the latest LTS 4.19),
>> despite the above constraints, since the performance drop is much too
>> significant. I guess not, but thought I'd ask :-)
> 
> If someone cares about something like this, then I strongly just
> recommend they move to the latest kernel version.  There should not be
> anything stoping them from doing that, right?  Nothing "forces" anyone
> to be on the 4.19.y release, especially when it really starts to show
> its age.
> 
> Don't ever treat the LTS releases as "the only thing someone can run, so
> we must backport huge things to it!"  Just use 5.1, and then move to 5.2
> when it is out and so on.  That's always the preferred way, you always
> get better support, faster kernels, newer features, better hardware
> support, and most importantly, more bugfixes.
> 

Thank you for the clarification!
 
Regards,
Srivatsa
VMware Photon OS

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-06-13  8:20                                                   ` Jan Kara
@ 2019-06-13 19:05                                                     ` Srivatsa S. Bhat
  0 siblings, 0 replies; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-06-13 19:05 UTC (permalink / raw)
  To: Jan Kara
  Cc: Paolo Valente, linux-fsdevel, linux-block, linux-ext4, cgroups,
	kernel list, Jens Axboe, Jeff Moyer, Theodore Ts'o,
	amakhalov, anishs, srivatsab, Ulf Hansson, Linus Walleij,
	Greg Kroah-Hartman, Stable

On 6/13/19 1:20 AM, Jan Kara wrote:
> On Wed 12-06-19 12:36:53, Srivatsa S. Bhat wrote:
>>
>> [ Adding Greg to CC ]
>>
>> On 6/12/19 6:04 AM, Jan Kara wrote:
>>> On Tue 11-06-19 15:34:48, Srivatsa S. Bhat wrote:
>>>> On 6/2/19 12:04 AM, Srivatsa S. Bhat wrote:
>>>>> On 5/30/19 3:45 AM, Paolo Valente wrote:
>>>>>>
>>>> [...]
>>>>>> At any rate, since you pointed out that you are interested in
>>>>>> out-of-the-box performance, let me complete the context: in case
>>>>>> low_latency is left set, one gets, in return for this 12% loss,
>>>>>> a) at least 1000% higher responsiveness, e.g., 1000% lower start-up
>>>>>> times of applications under load [1];
>>>>>> b) 500-1000% higher throughput in multi-client server workloads, as I
>>>>>> already pointed out [2].
>>>>>>
>>>>>
>>>>> I'm very happy that you could solve the problem without having to
>>>>> compromise on any of the performance characteristics/features of BFQ!
>>>>>
>>>>>
>>>>>> I'm going to prepare complete patches.  In addition, if ok for you,
>>>>>> I'll report these results on the bug you created.  Then I guess we can
>>>>>> close it.
>>>>>>
>>>>>
>>>>> Sounds great!
>>>>>
>>>>
>>>> Hi Paolo,
>>>>
>>>> Hope you are doing great!
>>>>
>>>> I was wondering if you got a chance to post these patches to LKML for
>>>> review and inclusion... (No hurry, of course!)
>>>>
>>>> Also, since your fixes address the performance issues in BFQ, do you
>>>> have any thoughts on whether they can be adapted to CFQ as well, to
>>>> benefit the older stable kernels that still support CFQ?
>>>
>>> Since CFQ doesn't exist in current upstream kernel anymore, I seriously
>>> doubt you'll be able to get any performance improvements for it in the
>>> stable kernels...
>>>
>>
>> I suspected as much, but that seems unfortunate though. The latest LTS
>> kernel is based on 4.19, which still supports CFQ. It would have been
>> great to have a process to address significant issues on older
>> kernels too.
> 
> Well, you could still tune the performance difference by changing
> slice_idle and group_idle tunables for CFQ (in
> /sys/block/<device>/queue/iosched/).  Changing these to lower values will
> reduce the throughput loss when switching between cgroups at the cost of
> lower accuracy of enforcing configured IO proportions among cgroups.
> 

Good point, and seems fair enough, thank you!

Regards,
Srivatsa
VMware Photon OS

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: CFQ idling kills I/O performance on ext4 with blkio cgroup controller
  2019-06-13  5:46                                               ` Paolo Valente
@ 2019-06-13 19:13                                                 ` Srivatsa S. Bhat
  0 siblings, 0 replies; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-06-13 19:13 UTC (permalink / raw)
  To: Paolo Valente
  Cc: linux-fsdevel, linux-block, linux-ext4, cgroups, kernel list,
	Jens Axboe, Jan Kara, Jeff Moyer, Theodore Ts'o, amakhalov,
	anishs, srivatsab, Ulf Hansson, Linus Walleij

On 6/12/19 10:46 PM, Paolo Valente wrote:
> 
>> Il giorno 12 giu 2019, alle ore 00:34, Srivatsa S. Bhat <srivatsa@csail.mit.edu> ha scritto:
>>
[...]
>>
>> Hi Paolo,
>>
> 
> Hi
> 
>> Hope you are doing great!
>>
> 
> Sort of, thanks :)
> 
>> I was wondering if you got a chance to post these patches to LKML for
>> review and inclusion... (No hurry, of course!)
>>
> 
> 
> I'm having troubles testing these new patches on 5.2-rc4.  As it
> happened with the first release candidates for 5.1, the CPU of my test
> machine (Intel Core i7-2760QM@2.40GHz) is so slowed down that results
> are heavily distorted with every I/O scheduler.
> 

Oh, that's unfortunate!

> Unfortunately, I'm not competent enough to spot the cause of this
> regression in a feasible amount of time.  I hope it'll go away with
> next release candidates, or I'll test on 5.1.
> 

Sounds good to me!

>> Also, since your fixes address the performance issues in BFQ, do you
>> have any thoughts on whether they can be adapted to CFQ as well, to
>> benefit the older stable kernels that still support CFQ?
>>
> 
> I have implanted my fixes on the existing throughput-boosting
> infrastructure of BFQ.  CFQ doesn't have such an infrastructure.
> 
> If you need I/O control with older kernels, you may want to check my
> version of BFQ for legacy block, named bfq-sq and available in this
> repo:
> https://github.com/Algodev-github/bfq-mq/
>

Great! Thank you for sharing this!
 
> I'm willing to provide you with any information or help if needed.
> 
Thank you!

Regards,
Srivatsa
VMware Photon OS

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2019-06-13 19:13 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-17 22:16 CFQ idling kills I/O performance on ext4 with blkio cgroup controller Srivatsa S. Bhat
2019-05-18 18:39 ` Paolo Valente
2019-05-18 19:28   ` Theodore Ts'o
2019-05-20  9:15     ` Jan Kara
2019-05-20 10:45       ` Paolo Valente
2019-05-21 16:48       ` Theodore Ts'o
2019-05-21 18:19         ` Josef Bacik
2019-05-21 19:10           ` Theodore Ts'o
2019-05-20 10:38     ` Paolo Valente
2019-05-21  7:38       ` Andrea Righi
2019-05-18 20:50   ` Srivatsa S. Bhat
2019-05-20 10:19     ` Paolo Valente
2019-05-20 22:45       ` Srivatsa S. Bhat
2019-05-21  6:23         ` Paolo Valente
2019-05-21  7:19           ` Srivatsa S. Bhat
2019-05-21  9:10           ` Jan Kara
2019-05-21 16:31             ` Theodore Ts'o
2019-05-21 11:25       ` Paolo Valente
2019-05-21 13:20         ` Paolo Valente
2019-05-21 16:21           ` Paolo Valente
2019-05-21 17:38             ` Paolo Valente
2019-05-21 22:51               ` Srivatsa S. Bhat
2019-05-22  8:05                 ` Paolo Valente
2019-05-22  9:02                   ` Srivatsa S. Bhat
2019-05-22  9:12                     ` Paolo Valente
2019-05-22 10:02                       ` Srivatsa S. Bhat
2019-05-22  9:09                   ` Paolo Valente
2019-05-22 10:01                     ` Srivatsa S. Bhat
2019-05-22 10:54                       ` Paolo Valente
2019-05-23  2:30                         ` Srivatsa S. Bhat
2019-05-23  9:19                           ` Paolo Valente
2019-05-23 17:22                             ` Paolo Valente
2019-05-23 23:43                               ` Srivatsa S. Bhat
2019-05-24  6:51                                 ` Paolo Valente
2019-05-24  7:56                                   ` Paolo Valente
2019-05-29  1:09                                   ` Srivatsa S. Bhat
2019-05-29  7:41                                     ` Paolo Valente
2019-05-30  8:29                                       ` Srivatsa S. Bhat
2019-05-30 10:45                                         ` Paolo Valente
2019-06-02  7:04                                           ` Srivatsa S. Bhat
2019-06-11 22:34                                             ` Srivatsa S. Bhat
2019-06-12 13:04                                               ` Jan Kara
2019-06-12 19:36                                                 ` Srivatsa S. Bhat
2019-06-13  6:02                                                   ` Greg Kroah-Hartman
2019-06-13 19:03                                                     ` Srivatsa S. Bhat
2019-06-13  8:20                                                   ` Jan Kara
2019-06-13 19:05                                                     ` Srivatsa S. Bhat
2019-06-13  8:37                                                   ` Jens Axboe
2019-06-13  5:46                                               ` Paolo Valente
2019-06-13 19:13                                                 ` Srivatsa S. Bhat
2019-05-23 23:32                           ` Srivatsa S. Bhat
2019-05-30  8:38                             ` Srivatsa S. Bhat

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).