linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* CFQ idling kills I/O performance on ext4 with blkio cgroup controller
@ 2019-05-17 22:16 Srivatsa S. Bhat
  2019-05-18 18:39 ` Paolo Valente
  0 siblings, 1 reply; 52+ messages in thread
From: Srivatsa S. Bhat @ 2019-05-17 22:16 UTC (permalink / raw)
  To: linux-fsdevel, linux-block, linux-ext4, cgroups, linux-kernel
  Cc: axboe, paolo.valente, jack, jmoyer, tytso, amakhalov, anishs,
	srivatsab, Srivatsa S. Bhat


Hi,

One of my colleagues noticed upto 10x - 30x drop in I/O throughput
running the following command, with the CFQ I/O scheduler:

dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync

Throughput with CFQ: 60 KB/s
Throughput with noop or deadline: 1.5 MB/s - 2 MB/s

I spent some time looking into it and found that this is caused by the
undesirable interaction between 4 different components:

- blkio cgroup controller enabled
- ext4 with the jbd2 kthread running in the root blkio cgroup
- dd running on ext4, in any other blkio cgroup than that of jbd2
- CFQ I/O scheduler with defaults for slice_idle and group_idle


When docker is enabled, systemd creates a blkio cgroup called
system.slice to run system services (and docker) under it, and a
separate blkio cgroup called user.slice for user processes. So, when
dd is invoked, it runs under user.slice.

The dd command above includes the dsync flag, which performs an
fdatasync after every write to the output file. Since dd is writing to
a file on ext4, jbd2 will be active, committing transactions
corresponding to those fdatasync requests from dd. (In other words, dd
depends on jdb2, in order to make forward progress). But jdb2 being a
kernel thread, runs in the root blkio cgroup, as opposed to dd, which
runs under user.slice.

Now, if the I/O scheduler in use for the underlying block device is
CFQ, then its inter-queue/inter-group idling takes effect (via the
slice_idle and group_idle parameters, both of which default to 8ms).
Therefore, everytime CFQ switches between processing requests from dd
vs jbd2, this 8ms idle time is injected, which slows down the overall
throughput tremendously!

To verify this theory, I tried various experiments, and in all cases,
the 4 pre-conditions mentioned above were necessary to reproduce this
performance drop. For example, if I used an XFS filesystem (which
doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed
directly to a block device, I couldn't reproduce the performance
issue. Similarly, running dd in the root blkio cgroup (where jbd2
runs) also gets full performance; as does using the noop or deadline
I/O schedulers; or even CFQ itself, with slice_idle and group_idle set
to zero.

These results were reproduced on a Linux VM (kernel v4.19) on ESXi,
both with virtualized storage as well as with disk pass-through,
backed by a rotational hard disk in both cases. The same problem was
also seen with the BFQ I/O scheduler in kernel v5.1.

Searching for any earlier discussions of this problem, I found an old
thread on LKML that encountered this behavior [1], as well as a docker
github issue [2] with similar symptoms (mentioned later in the
thread).

So, I'm curious to know if this is a well-understood problem and if
anybody has any thoughts on how to fix it.

Thank you very much!


[1]. https://lkml.org/lkml/2015/11/19/359

[2]. https://github.com/moby/moby/issues/21485
     https://github.com/moby/moby/issues/21485#issuecomment-222941103

Regards,
Srivatsa

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2019-06-13 19:13 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-17 22:16 CFQ idling kills I/O performance on ext4 with blkio cgroup controller Srivatsa S. Bhat
2019-05-18 18:39 ` Paolo Valente
2019-05-18 19:28   ` Theodore Ts'o
2019-05-20  9:15     ` Jan Kara
2019-05-20 10:45       ` Paolo Valente
2019-05-21 16:48       ` Theodore Ts'o
2019-05-21 18:19         ` Josef Bacik
2019-05-21 19:10           ` Theodore Ts'o
2019-05-20 10:38     ` Paolo Valente
2019-05-21  7:38       ` Andrea Righi
2019-05-18 20:50   ` Srivatsa S. Bhat
2019-05-20 10:19     ` Paolo Valente
2019-05-20 22:45       ` Srivatsa S. Bhat
2019-05-21  6:23         ` Paolo Valente
2019-05-21  7:19           ` Srivatsa S. Bhat
2019-05-21  9:10           ` Jan Kara
2019-05-21 16:31             ` Theodore Ts'o
2019-05-21 11:25       ` Paolo Valente
2019-05-21 13:20         ` Paolo Valente
2019-05-21 16:21           ` Paolo Valente
2019-05-21 17:38             ` Paolo Valente
2019-05-21 22:51               ` Srivatsa S. Bhat
2019-05-22  8:05                 ` Paolo Valente
2019-05-22  9:02                   ` Srivatsa S. Bhat
2019-05-22  9:12                     ` Paolo Valente
2019-05-22 10:02                       ` Srivatsa S. Bhat
2019-05-22  9:09                   ` Paolo Valente
2019-05-22 10:01                     ` Srivatsa S. Bhat
2019-05-22 10:54                       ` Paolo Valente
2019-05-23  2:30                         ` Srivatsa S. Bhat
2019-05-23  9:19                           ` Paolo Valente
2019-05-23 17:22                             ` Paolo Valente
2019-05-23 23:43                               ` Srivatsa S. Bhat
2019-05-24  6:51                                 ` Paolo Valente
2019-05-24  7:56                                   ` Paolo Valente
2019-05-29  1:09                                   ` Srivatsa S. Bhat
2019-05-29  7:41                                     ` Paolo Valente
2019-05-30  8:29                                       ` Srivatsa S. Bhat
2019-05-30 10:45                                         ` Paolo Valente
2019-06-02  7:04                                           ` Srivatsa S. Bhat
2019-06-11 22:34                                             ` Srivatsa S. Bhat
2019-06-12 13:04                                               ` Jan Kara
2019-06-12 19:36                                                 ` Srivatsa S. Bhat
2019-06-13  6:02                                                   ` Greg Kroah-Hartman
2019-06-13 19:03                                                     ` Srivatsa S. Bhat
2019-06-13  8:20                                                   ` Jan Kara
2019-06-13 19:05                                                     ` Srivatsa S. Bhat
2019-06-13  8:37                                                   ` Jens Axboe
2019-06-13  5:46                                               ` Paolo Valente
2019-06-13 19:13                                                 ` Srivatsa S. Bhat
2019-05-23 23:32                           ` Srivatsa S. Bhat
2019-05-30  8:38                             ` Srivatsa S. Bhat

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).