> Before digging into the block > trace, I'd like to ask you for some feedback. > > First, in my test, the total throughput of the disk happens to be > about 20 times as high as that enjoyed by dd, regardless of the I/O > scheduler. I guess this massive overhead is normal with dsync, but > I'd like know whether it is about the same on your side. This will > help me understand whether I'll actually be analyzing about the same > problem as yours. > > Second, the commands I used follow. Do they implement your test case > correctly? > > [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp > [root@localhost tmp]# echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs > [root@localhost tmp]# cat /sys/block/sda/queue/scheduler > [mq-deadline] bfq none > [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync > 10000+0 record dentro > 10000+0 record fuori > 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s > [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler > [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync > 10000+0 record dentro > 10000+0 record fuori > 5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s > > Thanks, > Paolo > >> Please let me know if any more info about my setup might be helpful. >> >> Thank you! >> >> Regards, >> Srivatsa >> VMware Photon OS >> >>> >>>> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat ha scritto: >>>> >>>> >>>> Hi, >>>> >>>> One of my colleagues noticed upto 10x - 30x drop in I/O throughput >>>> running the following command, with the CFQ I/O scheduler: >>>> >>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync >>>> >>>> Throughput with CFQ: 60 KB/s >>>> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s >>>> >>>> I spent some time looking into it and found that this is caused by the >>>> undesirable interaction between 4 different components: >>>> >>>> - blkio cgroup controller enabled >>>> - ext4 with the jbd2 kthread running in the root blkio cgroup >>>> - dd running on ext4, in any other blkio cgroup than that of jbd2 >>>> - CFQ I/O scheduler with defaults for slice_idle and group_idle >>>> >>>> >>>> When docker is enabled, systemd creates a blkio cgroup called >>>> system.slice to run system services (and docker) under it, and a >>>> separate blkio cgroup called user.slice for user processes. So, when >>>> dd is invoked, it runs under user.slice. >>>> >>>> The dd command above includes the dsync flag, which performs an >>>> fdatasync after every write to the output file. Since dd is writing to >>>> a file on ext4, jbd2 will be active, committing transactions >>>> corresponding to those fdatasync requests from dd. (In other words, dd >>>> depends on jdb2, in order to make forward progress). But jdb2 being a >>>> kernel thread, runs in the root blkio cgroup, as opposed to dd, which >>>> runs under user.slice. >>>> >>>> Now, if the I/O scheduler in use for the underlying block device is >>>> CFQ, then its inter-queue/inter-group idling takes effect (via the >>>> slice_idle and group_idle parameters, both of which default to 8ms). >>>> Therefore, everytime CFQ switches between processing requests from dd >>>> vs jbd2, this 8ms idle time is injected, which slows down the overall >>>> throughput tremendously! >>>> >>>> To verify this theory, I tried various experiments, and in all cases, >>>> the 4 pre-conditions mentioned above were necessary to reproduce this >>>> performance drop. For example, if I used an XFS filesystem (which >>>> doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed >>>> directly to a block device, I couldn't reproduce the performance >>>> issue. Similarly, running dd in the root blkio cgroup (where jbd2 >>>> runs) also gets full performance; as does using the noop or deadline >>>> I/O schedulers; or even CFQ itself, with slice_idle and group_idle set >>>> to zero. >>>> >>>> These results were reproduced on a Linux VM (kernel v4.19) on ESXi, >>>> both with virtualized storage as well as with disk pass-through, >>>> backed by a rotational hard disk in both cases. The same problem was >>>> also seen with the BFQ I/O scheduler in kernel v5.1. >>>> >>>> Searching for any earlier discussions of this problem, I found an old >>>> thread on LKML that encountered this behavior [1], as well as a docker >>>> github issue [2] with similar symptoms (mentioned later in the >>>> thread). >>>> >>>> So, I'm curious to know if this is a well-understood problem and if >>>> anybody has any thoughts on how to fix it. >>>> >>>> Thank you very much! >>>> >>>> >>>> [1]. https://lkml.org/lkml/2015/11/19/359 >>>> >>>> [2]. https://github.com/moby/moby/issues/21485 >>>> https://github.com/moby/moby/issues/21485#issuecomment-222941103 >>>> >>>> Regards, >>>> Srivatsa