> Il giorno 21 mag 2019, alle ore 18:21, Paolo Valente ha scritto: > > > >> Il giorno 21 mag 2019, alle ore 15:20, Paolo Valente ha scritto: >> >> >> >>> Il giorno 21 mag 2019, alle ore 13:25, Paolo Valente ha scritto: >>> >>> >>> >>>> Il giorno 20 mag 2019, alle ore 12:19, Paolo Valente ha scritto: >>>> >>>> >>>> >>>>> Il giorno 18 mag 2019, alle ore 22:50, Srivatsa S. Bhat ha scritto: >>>>> >>>>> On 5/18/19 11:39 AM, Paolo Valente wrote: >>>>>> I've addressed these issues in my last batch of improvements for BFQ, >>>>>> which landed in the upcoming 5.2. If you give it a try, and still see >>>>>> the problem, then I'll be glad to reproduce it, and hopefully fix it >>>>>> for you. >>>>>> >>>>> >>>>> Hi Paolo, >>>>> >>>>> Thank you for looking into this! >>>>> >>>>> I just tried current mainline at commit 72cf0b07, but unfortunately >>>>> didn't see any improvement: >>>>> >>>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync >>>>> >>>>> With mq-deadline, I get: >>>>> >>>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 3.90981 s, 1.3 MB/s >>>>> >>>>> With bfq, I get: >>>>> 5120000 bytes (5.1 MB, 4.9 MiB) copied, 84.8216 s, 60.4 kB/s >>>>> >>>> >>>> Hi Srivatsa, >>>> thanks for reproducing this on mainline. I seem to have reproduced a >>>> bonsai-tree version of this issue. >>> >>> Hi again Srivatsa, >>> I've analyzed the trace, and I've found the cause of the loss of >>> throughput in on my side. To find out whether it is the same cause as >>> on your side, I've prepared a script that executes your test and takes >>> a trace during the test. If ok for you, could you please >>> - change the value for the DEVS parameter in the attached script, if >>> needed >>> - execute the script >>> - send me the trace file that the script will leave in your working >>> dir >>> >> >> Sorry, I forgot to add that I also need you to, first, apply the >> attached patch (it will make BFQ generate the log I need). >> > > Sorry again :) This time for attaching one more patch. This is > basically a blind fix attempt, based on what I see in my VM. > > So, instead of only sending me a trace, could you please: > 1) apply this new patch on top of the one I attached in my previous email > 2) repeat your test and report results One last thing (I swear!): as you can see from my script, I tested the case low_latency=0 so far. So please, for the moment, do your test with low_latency=0. You find the whole path to this parameter in, e.g., my script. Thanks, Paolo > 3) regardless of whether bfq performance improves, take a trace with > my script (I've attached a new version that doesn't risk to output an > annoying error message as the previous one) > > Thanks, > Paolo > > <0001-block-bfq-boost-injection.patch.gz> > >> Thanks, >> Paolo >> >> <0001-block-bfq-add-logs-and-BUG_ONs.patch.gz> >> >>> Looking forward to your trace, >>> Paolo >>> >>> >>>> Before digging into the block >>>> trace, I'd like to ask you for some feedback. >>>> >>>> First, in my test, the total throughput of the disk happens to be >>>> about 20 times as high as that enjoyed by dd, regardless of the I/O >>>> scheduler. I guess this massive overhead is normal with dsync, but >>>> I'd like know whether it is about the same on your side. This will >>>> help me understand whether I'll actually be analyzing about the same >>>> problem as yours. >>>> >>>> Second, the commands I used follow. Do they implement your test case >>>> correctly? >>>> >>>> [root@localhost tmp]# mkdir /sys/fs/cgroup/blkio/testgrp >>>> [root@localhost tmp]# echo $BASHPID > /sys/fs/cgroup/blkio/testgrp/cgroup.procs >>>> [root@localhost tmp]# cat /sys/block/sda/queue/scheduler >>>> [mq-deadline] bfq none >>>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync >>>> 10000+0 record dentro >>>> 10000+0 record fuori >>>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 14,6892 s, 349 kB/s >>>> [root@localhost tmp]# echo bfq > /sys/block/sda/queue/scheduler >>>> [root@localhost tmp]# dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflag=dsync >>>> 10000+0 record dentro >>>> 10000+0 record fuori >>>> 5120000 bytes (5,1 MB, 4,9 MiB) copied, 20,1953 s, 254 kB/s >>>> >>>> Thanks, >>>> Paolo >>>> >>>>> Please let me know if any more info about my setup might be helpful. >>>>> >>>>> Thank you! >>>>> >>>>> Regards, >>>>> Srivatsa >>>>> VMware Photon OS >>>>> >>>>>> >>>>>>> Il giorno 18 mag 2019, alle ore 00:16, Srivatsa S. Bhat ha scritto: >>>>>>> >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> One of my colleagues noticed upto 10x - 30x drop in I/O throughput >>>>>>> running the following command, with the CFQ I/O scheduler: >>>>>>> >>>>>>> dd if=/dev/zero of=/root/test.img bs=512 count=10000 oflags=dsync >>>>>>> >>>>>>> Throughput with CFQ: 60 KB/s >>>>>>> Throughput with noop or deadline: 1.5 MB/s - 2 MB/s >>>>>>> >>>>>>> I spent some time looking into it and found that this is caused by the >>>>>>> undesirable interaction between 4 different components: >>>>>>> >>>>>>> - blkio cgroup controller enabled >>>>>>> - ext4 with the jbd2 kthread running in the root blkio cgroup >>>>>>> - dd running on ext4, in any other blkio cgroup than that of jbd2 >>>>>>> - CFQ I/O scheduler with defaults for slice_idle and group_idle >>>>>>> >>>>>>> >>>>>>> When docker is enabled, systemd creates a blkio cgroup called >>>>>>> system.slice to run system services (and docker) under it, and a >>>>>>> separate blkio cgroup called user.slice for user processes. So, when >>>>>>> dd is invoked, it runs under user.slice. >>>>>>> >>>>>>> The dd command above includes the dsync flag, which performs an >>>>>>> fdatasync after every write to the output file. Since dd is writing to >>>>>>> a file on ext4, jbd2 will be active, committing transactions >>>>>>> corresponding to those fdatasync requests from dd. (In other words, dd >>>>>>> depends on jdb2, in order to make forward progress). But jdb2 being a >>>>>>> kernel thread, runs in the root blkio cgroup, as opposed to dd, which >>>>>>> runs under user.slice. >>>>>>> >>>>>>> Now, if the I/O scheduler in use for the underlying block device is >>>>>>> CFQ, then its inter-queue/inter-group idling takes effect (via the >>>>>>> slice_idle and group_idle parameters, both of which default to 8ms). >>>>>>> Therefore, everytime CFQ switches between processing requests from dd >>>>>>> vs jbd2, this 8ms idle time is injected, which slows down the overall >>>>>>> throughput tremendously! >>>>>>> >>>>>>> To verify this theory, I tried various experiments, and in all cases, >>>>>>> the 4 pre-conditions mentioned above were necessary to reproduce this >>>>>>> performance drop. For example, if I used an XFS filesystem (which >>>>>>> doesn't use a separate kthread like jbd2 for journaling), or if I dd'ed >>>>>>> directly to a block device, I couldn't reproduce the performance >>>>>>> issue. Similarly, running dd in the root blkio cgroup (where jbd2 >>>>>>> runs) also gets full performance; as does using the noop or deadline >>>>>>> I/O schedulers; or even CFQ itself, with slice_idle and group_idle set >>>>>>> to zero. >>>>>>> >>>>>>> These results were reproduced on a Linux VM (kernel v4.19) on ESXi, >>>>>>> both with virtualized storage as well as with disk pass-through, >>>>>>> backed by a rotational hard disk in both cases. The same problem was >>>>>>> also seen with the BFQ I/O scheduler in kernel v5.1. >>>>>>> >>>>>>> Searching for any earlier discussions of this problem, I found an old >>>>>>> thread on LKML that encountered this behavior [1], as well as a docker >>>>>>> github issue [2] with similar symptoms (mentioned later in the >>>>>>> thread). >>>>>>> >>>>>>> So, I'm curious to know if this is a well-understood problem and if >>>>>>> anybody has any thoughts on how to fix it. >>>>>>> >>>>>>> Thank you very much! >>>>>>> >>>>>>> >>>>>>> [1]. https://lkml.org/lkml/2015/11/19/359 >>>>>>> >>>>>>> [2]. https://github.com/moby/moby/issues/21485 >>>>>>> https://github.com/moby/moby/issues/21485#issuecomment-222941103 >>>>>>> >>>>>>> Regards, >>>>>>> Srivatsa