* Potential regression with iomap DIO for 4k writes @ 2021-06-02 23:35 Gabriel Krisman Bertazi 2021-06-16 0:17 ` Gabriel Krisman Bertazi 0 siblings, 1 reply; 5+ messages in thread From: Gabriel Krisman Bertazi @ 2021-06-02 23:35 UTC (permalink / raw) To: Theodore Ts'o; +Cc: kernel, linux-ext4 Hi, While I've been exploring the performance of different DIO implementations, I've come across what seems a noticeable regression (22% slowdown) in 4k writes in Ext4 when comparing the original direct-io with the current iomap dio implementation, as existing on linus/master. Perhaps you already know about this, but I'm having a hard time understanding the root cause, in order to attempt to improve the situation. * Benchmark For starter, I'm comparing three kernels, built with same config and compiler (gcc-8.4.0 (locally built)). My DUT is a Samsung SSD 970 EVO Plus 250GB dedicated to this test (no concurrent IO). - Kernel 1: Commit immediately before iomap for ext4 is merged ("f112a2fd1f59"). On the data below, this kernel is identified as 5.4.0-original-dio. Available in a public branch at: <gitlab.collabora.com:krisman/linux.git -b dio/original-dio> - Kernel 2: tag 5.5 (first release with dio-iomap). In the data below, identified as 5.5.0-old-iomap. For completeness, it is available at: <gitlab.collabora.com:krisman/linux.git -b dio/old-dio> - Kernel 3: Kernel tag 5.13-rc3. In the data below, identified as 5.13-rc3-iomap. For completeness, it is available at: <gitlab.collabora.com:krisman/linux.git -b dio/iomap> I ran the fio job below with the combinations: BS=4k,16k and RW=read,write fio --ioengine libaio --size=2G --direct=1 --iodepth=64 --time_based=1 \ --thread=1 --overwrite=1 --runtime=100 --output-format=terse For every kernel test, the file system was recreated, and the 2GB file was pre-allocated. In an attempt to further isolate the problem, I tested both xfs and ext4 in the same condition. The script I used is available at: <https://people.collabora.com/~krisman/dio/bench.sh> * Results I obtained the following performance results, relative to the baseline 5.4.0-original-dio. | IOPS | | kernel | read-4k | read-16k | write-4k | write-16k | |------------------------+------------------+-------------------+-------------------+-------------------+ | 5.13.0-rc3-iomap-ext4 | 1.01192950082305 | 1.00026413252562 | 0.806377013901006 | 1.00020735846057 | | 5.5.0-old-iomap-ext4 | 1.01154156662508 | 0.998753983520427 | 0.777051125458035 | 0.999937792461829 | | 5.13.0-rc3-iomap-xfs | 1.00234888443008 | 1.00027645151444 | 1.00996172750095 | 1.00156349447934 | | 5.5.0-old-iomap-xfs | 1.00010412786902 | 1.00202731110586 | 1.01502846821264 | 1.00149431330769 | Total IO is the amount of data copied (relative to baseline). | TOTAL_IO | kernel | read-4k | read-16k | write-4k | write-16k | |------------------------+------------------+-------------------+-------------------+-------------------| | 5.13.0-rc3-iomap-ext4 | 1.01193023173156 | 1.00026332569559 | 0.806377530301477 | 1.00014686835205 | | 5.5.0-old-iomap-ext4 | 1.01154196621591 | 0.998758131673757 | 0.777050753425118 | 0.999902824986834 | | 5.13.0-rc3-iomap-xfs | 1.00234893734134 | 1.00027535318322 | 1.00996437458991 | 1.00156305646789 | | 5.5.0-old-iomap-xfs | 1.00010328564078 | 1.00202831801018 | 1.01503060595258 | 1.00149069402364 | With a visualization of the above data here: <https://people.collabora.com/~krisman/dio/bench.png> The only out of the ordinary result seems to be in write-4k for Ext4, which suggests around 20% less IOPS (and total IO) for iomap in comparison to the original DIO. This is not a one-off run, as it seems to be consistently reproducible with more test runs in my environment. The performance reduction also doesn't reproduce on XFS. I tried to limit the influence of other parts of the kernel that could affect the behavior by comparing the kernel immediately before the introduction of dio-iomap for ext4 with the first version with that feature. By also observing that xfs doesn't change, I believe it to be ext4 specific. I'm also publishing raw data and all related material to the link below, in case anyone wants to tinker with my data: https://people.collabora.com/~krisman/dio/ Perhaps I'm missing something obvious. But I can't pinpoint a specific problem with my analysis. Is this expected, given the way ext4 iomap work? Do you have any idea of the root cause or how it can be improved? I will keep looking to this issue, but I'd like to share this partial result, in case there is a problem with my analysis, or if you have any suggestion. Thanks, -- Gabriel Krisman Bertazi ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Potential regression with iomap DIO for 4k writes 2021-06-02 23:35 Potential regression with iomap DIO for 4k writes Gabriel Krisman Bertazi @ 2021-06-16 0:17 ` Gabriel Krisman Bertazi 2021-06-18 17:21 ` Theodore Ts'o 0 siblings, 1 reply; 5+ messages in thread From: Gabriel Krisman Bertazi @ 2021-06-16 0:17 UTC (permalink / raw) To: Theodore Ts'o; +Cc: kernel, linux-ext4 Gabriel Krisman Bertazi <krisman@collabora.com> writes: > Hi, > > While I've been exploring the performance of different DIO > implementations, I've come across what seems a noticeable regression > (22% slowdown) in 4k writes in Ext4 when comparing the original > direct-io with the current iomap dio implementation, as existing on > linus/master. Perhaps you already know about this, but I'm having a > hard time understanding the root cause, in order to attempt to improve > the situation. Hi Ted, Sorry for the ping, but do you have any ideas of what we are seeing here? Thank you, > * Benchmark > > For starter, I'm comparing three kernels, built with same config and > compiler (gcc-8.4.0 (locally built)). My DUT is a Samsung SSD 970 EVO > Plus 250GB dedicated to this test (no concurrent IO). > > - Kernel 1: Commit immediately before iomap for ext4 is merged > ("f112a2fd1f59"). On the data below, this kernel is identified as > 5.4.0-original-dio. Available in a public branch at: > > <gitlab.collabora.com:krisman/linux.git -b dio/original-dio> > > - Kernel 2: tag 5.5 (first release with dio-iomap). In the data > below, identified as 5.5.0-old-iomap. For completeness, it is > available at: > > <gitlab.collabora.com:krisman/linux.git -b dio/old-dio> > > - Kernel 3: Kernel tag 5.13-rc3. In the data below, identified as > 5.13-rc3-iomap. For completeness, it is available at: > > <gitlab.collabora.com:krisman/linux.git -b dio/iomap> > > I ran the fio job below with the combinations: BS=4k,16k and RW=read,write > > fio --ioengine libaio --size=2G --direct=1 --iodepth=64 --time_based=1 \ > --thread=1 --overwrite=1 --runtime=100 --output-format=terse > > For every kernel test, the file system was recreated, and the 2GB file > was pre-allocated. > > In an attempt to further isolate the problem, I tested both xfs and ext4 > in the same condition. > > The script I used is available at: > > <https://people.collabora.com/~krisman/dio/bench.sh> > > * Results > > I obtained the following performance results, relative to the baseline > 5.4.0-original-dio. > > | IOPS | > | kernel | read-4k | read-16k | write-4k | write-16k | > |------------------------+------------------+-------------------+-------------------+-------------------+ > | 5.13.0-rc3-iomap-ext4 | 1.01192950082305 | 1.00026413252562 | 0.806377013901006 | 1.00020735846057 | > | 5.5.0-old-iomap-ext4 | 1.01154156662508 | 0.998753983520427 | 0.777051125458035 | 0.999937792461829 | > | 5.13.0-rc3-iomap-xfs | 1.00234888443008 | 1.00027645151444 | 1.00996172750095 | 1.00156349447934 | > | 5.5.0-old-iomap-xfs | 1.00010412786902 | 1.00202731110586 | 1.01502846821264 | 1.00149431330769 | > > > Total IO is the amount of data copied (relative to baseline). > > | TOTAL_IO > | kernel | read-4k | read-16k | write-4k | write-16k | > |------------------------+------------------+-------------------+-------------------+-------------------| > | 5.13.0-rc3-iomap-ext4 | 1.01193023173156 | 1.00026332569559 | 0.806377530301477 | 1.00014686835205 | > | 5.5.0-old-iomap-ext4 | 1.01154196621591 | 0.998758131673757 | 0.777050753425118 | 0.999902824986834 | > | 5.13.0-rc3-iomap-xfs | 1.00234893734134 | 1.00027535318322 | 1.00996437458991 | 1.00156305646789 | > | 5.5.0-old-iomap-xfs | 1.00010328564078 | 1.00202831801018 | 1.01503060595258 | 1.00149069402364 | > > With a visualization of the above data here: > > <https://people.collabora.com/~krisman/dio/bench.png> > > The only out of the ordinary result seems to be in write-4k for Ext4, > which suggests around 20% less IOPS (and total IO) for iomap in > comparison to the original DIO. This is not a one-off run, as it seems > to be consistently reproducible with more test runs in my environment. > The performance reduction also doesn't reproduce on XFS. > > I tried to limit the influence of other parts of the kernel that could > affect the behavior by comparing the kernel immediately before the > introduction of dio-iomap for ext4 with the first version with that > feature. By also observing that xfs doesn't change, I believe it to be > ext4 specific. > > I'm also publishing raw data and all related material to the link below, > in case anyone wants to tinker with my data: > > https://people.collabora.com/~krisman/dio/ > > Perhaps I'm missing something obvious. But I can't pinpoint a specific > problem with my analysis. Is this expected, given the way ext4 iomap > work? Do you have any idea of the root cause or how it can be improved? > > I will keep looking to this issue, but I'd like to share this partial > result, in case there is a problem with my analysis, or if you have any > suggestion. > > Thanks, -- Gabriel Krisman Bertazi ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Potential regression with iomap DIO for 4k writes 2021-06-16 0:17 ` Gabriel Krisman Bertazi @ 2021-06-18 17:21 ` Theodore Ts'o 2021-06-23 16:04 ` Gabriel Krisman Bertazi 0 siblings, 1 reply; 5+ messages in thread From: Theodore Ts'o @ 2021-06-18 17:21 UTC (permalink / raw) To: Gabriel Krisman Bertazi; +Cc: kernel, linux-ext4 On Tue, Jun 15, 2021 at 08:17:55PM -0400, Gabriel Krisman Bertazi wrote: > Gabriel Krisman Bertazi <krisman@collabora.com> writes: > > > While I've been exploring the performance of different DIO > > implementations, I've come across what seems a noticeable regression > > (22% slowdown) in 4k writes in Ext4 when comparing the original > > direct-io with the current iomap dio implementation, as existing on > > linus/master. Perhaps you already know about this, but I'm having a > > hard time understanding the root cause, in order to attempt to improve > > the situation. > > Sorry for the ping, but do you have any ideas of what we are seeing > here? Apologies for the delay in responding; somehow I missed your initial e-mail on the subject on June 2nd, although I haven't found it in the mailing list archives[1]. I don't know if it got caught in a spam trap, or was accidentally deleted from my inbox. [1] https://lore.kernel.org/linux-ext4/87lf7rkffv.fsf@collabora.com/ I didn't do any bs=4k benchmarks before we landed the DIO iomap changes, and it's interesting that it largely goes away with a 16k block size[2] [2] https://people.collabora.com/~krisman/dio/week21/bench.png Looking at your flame graphs[3][4] [3] https://people.collabora.com/~krisman/dio/week23/clean_flames/5.4.0-dio_original-dio-ext4-write-4k.svg [4] https://people.collabora.com/~krisman/dio/week23/clean_flames/5.5.0-dio_old-iomap-ext4-write-4k.svg ... nothing immediately jumps out at me. Have you compared the output of /proc/lock_stat for the two kernels? And is there anything obvious in the blktrace of the two kernels? Cheers, - Ted > > * Benchmark > > > > For starter, I'm comparing three kernels, built with same config and > > compiler (gcc-8.4.0 (locally built)). My DUT is a Samsung SSD 970 EVO > > Plus 250GB dedicated to this test (no concurrent IO). > > > > - Kernel 1: Commit immediately before iomap for ext4 is merged > > ("f112a2fd1f59"). On the data below, this kernel is identified as > > 5.4.0-original-dio. Available in a public branch at: > > > > <gitlab.collabora.com:krisman/linux.git -b dio/original-dio> > > > > - Kernel 2: tag 5.5 (first release with dio-iomap). In the data > > below, identified as 5.5.0-old-iomap. For completeness, it is > > available at: > > > > <gitlab.collabora.com:krisman/linux.git -b dio/old-dio> > > > > - Kernel 3: Kernel tag 5.13-rc3. In the data below, identified as > > 5.13-rc3-iomap. For completeness, it is available at: > > > > <gitlab.collabora.com:krisman/linux.git -b dio/iomap> > > > > I ran the fio job below with the combinations: BS=4k,16k and RW=read,write > > > > fio --ioengine libaio --size=2G --direct=1 --iodepth=64 --time_based=1 \ > > --thread=1 --overwrite=1 --runtime=100 --output-format=terse > > > > For every kernel test, the file system was recreated, and the 2GB file > > was pre-allocated. > > > > In an attempt to further isolate the problem, I tested both xfs and ext4 > > in the same condition. > > > > The script I used is available at: > > > > <https://people.collabora.com/~krisman/dio/bench.sh> > > > > * Results > > > > I obtained the following performance results, relative to the baseline > > 5.4.0-original-dio. > > > > | IOPS | > > | kernel | read-4k | read-16k | write-4k | write-16k | > > |------------------------+------------------+-------------------+-------------------+-------------------+ > > | 5.13.0-rc3-iomap-ext4 | 1.01192950082305 | 1.00026413252562 | 0.806377013901006 | 1.00020735846057 | > > | 5.5.0-old-iomap-ext4 | 1.01154156662508 | 0.998753983520427 | 0.777051125458035 | 0.999937792461829 | > > | 5.13.0-rc3-iomap-xfs | 1.00234888443008 | 1.00027645151444 | 1.00996172750095 | 1.00156349447934 | > > | 5.5.0-old-iomap-xfs | 1.00010412786902 | 1.00202731110586 | 1.01502846821264 | 1.00149431330769 | > > > > > > Total IO is the amount of data copied (relative to baseline). > > > > | TOTAL_IO > > | kernel | read-4k | read-16k | write-4k | write-16k | > > |------------------------+------------------+-------------------+-------------------+-------------------| > > | 5.13.0-rc3-iomap-ext4 | 1.01193023173156 | 1.00026332569559 | 0.806377530301477 | 1.00014686835205 | > > | 5.5.0-old-iomap-ext4 | 1.01154196621591 | 0.998758131673757 | 0.777050753425118 | 0.999902824986834 | > > | 5.13.0-rc3-iomap-xfs | 1.00234893734134 | 1.00027535318322 | 1.00996437458991 | 1.00156305646789 | > > | 5.5.0-old-iomap-xfs | 1.00010328564078 | 1.00202831801018 | 1.01503060595258 | 1.00149069402364 | > > > > With a visualization of the above data here: > > > > <https://people.collabora.com/~krisman/dio/bench.png> > > > > The only out of the ordinary result seems to be in write-4k for Ext4, > > which suggests around 20% less IOPS (and total IO) for iomap in > > comparison to the original DIO. This is not a one-off run, as it seems > > to be consistently reproducible with more test runs in my environment. > > The performance reduction also doesn't reproduce on XFS. > > > > I tried to limit the influence of other parts of the kernel that could > > affect the behavior by comparing the kernel immediately before the > > introduction of dio-iomap for ext4 with the first version with that > > feature. By also observing that xfs doesn't change, I believe it to be > > ext4 specific. > > > > I'm also publishing raw data and all related material to the link below, > > in case anyone wants to tinker with my data: > > > > https://people.collabora.com/~krisman/dio/ > > > > Perhaps I'm missing something obvious. But I can't pinpoint a specific > > problem with my analysis. Is this expected, given the way ext4 iomap > > work? Do you have any idea of the root cause or how it can be improved? > > > > I will keep looking to this issue, but I'd like to share this partial > > result, in case there is a problem with my analysis, or if you have any > > suggestion. > > > > Thanks, > > -- > Gabriel Krisman Bertazi ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Potential regression with iomap DIO for 4k writes 2021-06-18 17:21 ` Theodore Ts'o @ 2021-06-23 16:04 ` Gabriel Krisman Bertazi 2021-08-10 15:28 ` Gabriel Krisman Bertazi 0 siblings, 1 reply; 5+ messages in thread From: Gabriel Krisman Bertazi @ 2021-06-23 16:04 UTC (permalink / raw) To: Theodore Ts'o; +Cc: kernel, linux-ext4 "Theodore Ts'o" <tytso@mit.edu> writes: > On Tue, Jun 15, 2021 at 08:17:55PM -0400, Gabriel Krisman Bertazi wrote: > Apologies for the delay in responding; somehow I missed your initial > e-mail on the subject on June 2nd, although I haven't found it in the > mailing list archives[1]. I don't know if it got caught in a spam > trap, or was accidentally deleted from my inbox. > > [1] https://lore.kernel.org/linux-ext4/87lf7rkffv.fsf@collabora.com/ > > I didn't do any bs=4k benchmarks before we landed the DIO iomap > changes, and it's interesting that it largely goes away with a 16k > block size[2] > > [2] https://people.collabora.com/~krisman/dio/week21/bench.png > > Looking at your flame graphs[3][4] > > [3] https://people.collabora.com/~krisman/dio/week23/clean_flames/5.4.0-dio_original-dio-ext4-write-4k.svg > [4] https://people.collabora.com/~krisman/dio/week23/clean_flames/5.5.0-dio_old-iomap-ext4-write-4k.svg > > ... nothing immediately jumps out at me. > > Have you compared the output of /proc/lock_stat for the two kernels? > And is there anything obvious in the blktrace of the two kernels? Hi Ted, I updated and rerun my test script [1] to collect lock_stat and blktrace. I particularly don't see anything standing out on lock-stat, but I'm sharing the raw data [2], in case you have a better insight. blktrace, on the other hand, shows something interesting. Looks like the original DIO is always keeping a lower device queue depth than iomap [3], although I'd expect the opposite if there is FS contention, correct? Iff ext4 is taking longer to submit bios, it would be expected for the device queue depth to be more empty instead of more full, right? [1] https://people.collabora.com/~krisman/dio/week24/bench-regression.sh [2] https://people.collabora.com/~krisman/dio/week24/ [3] (LARGE SVG!) https://people.collabora.com/~krisman/dio/week24/iowatcher-charts.svg -- Gabriel Krisman Bertazi ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Potential regression with iomap DIO for 4k writes 2021-06-23 16:04 ` Gabriel Krisman Bertazi @ 2021-08-10 15:28 ` Gabriel Krisman Bertazi 0 siblings, 0 replies; 5+ messages in thread From: Gabriel Krisman Bertazi @ 2021-08-10 15:28 UTC (permalink / raw) To: Theodore Ts'o; +Cc: kernel, linux-ext4, Khazhy Kumykov Gabriel Krisman Bertazi <krisman@collabora.com> writes: > "Theodore Ts'o" <tytso@mit.edu> writes: > >> On Tue, Jun 15, 2021 at 08:17:55PM -0400, Gabriel Krisman Bertazi wrote: > >> Apologies for the delay in responding; somehow I missed your initial >> e-mail on the subject on June 2nd, although I haven't found it in the >> mailing list archives[1]. I don't know if it got caught in a spam >> trap, or was accidentally deleted from my inbox. >> >> [1] https://lore.kernel.org/linux-ext4/87lf7rkffv.fsf@collabora.com/ >> >> I didn't do any bs=4k benchmarks before we landed the DIO iomap >> changes, and it's interesting that it largely goes away with a 16k >> block size[2] >> >> [2] https://people.collabora.com/~krisman/dio/week21/bench.png >> >> Looking at your flame graphs[3][4] >> >> [3] https://people.collabora.com/~krisman/dio/week23/clean_flames/5.4.0-dio_original-dio-ext4-write-4k.svg >> [4] https://people.collabora.com/~krisman/dio/week23/clean_flames/5.5.0-dio_old-iomap-ext4-write-4k.svg >> >> ... nothing immediately jumps out at me. >> >> Have you compared the output of /proc/lock_stat for the two kernels? >> And is there anything obvious in the blktrace of the two kernels? > > Hi Ted, > > I updated and rerun my test script [1] to collect lock_stat and blktrace. > I particularly don't see anything standing out on lock-stat, but I'm > sharing the raw data [2], in case you have a better insight. > > blktrace, on the other hand, shows something interesting. Looks like > the original DIO is always keeping a lower device queue depth than iomap > [3], although I'd expect the opposite if there is FS contention, > correct? Iff ext4 is taking longer to submit bios, it would be expected > for the device queue depth to be more empty instead of more full, > right? Hello, I continued to explore this issue with different parameters and types of device under test. In particular, I attempted to isolate the parameters where the slowdown is observed. Is anyone also able to reproduce the issue I'm reporting here? * Different queue depths First of all, I attempted to reproduce it with different queue depths and the results can be seen in Figure [1]. This shows that, for lower fio queue depths, 4k write performance is much better in iomap than the original DIO. But, as the queue depth increases, the original DIO performance increases, eventually becoming much better than iomap for higher queue depths. The odd thing is that this chart suggests a similar behavior for other types of IO. The main difference from my previous tests where I detected it only on 4k-writes is that this test iteration lacks perf/blktrace instrumentation during the test run. I don't know if this explains the difference from what I reported earlier. Further tests is needed. * Different block schedulers Another thing that I attempted to do is using different block schedulers, as shown in Figure [2]. This result reiterates the previous queue depth results for different kinds of IO, but also suggests that IO schedulers are not playing much of a role here. Despite a change on the total IOPS, an offset between original DIO and iomap performance remains. * Different type of device I also experimented with using different types of devices. For that, I ran the same Fio benchmark with ramfs and nvme. The results shown in Figure [3] suggests I'm only able to observe the problem with nvme. On ramfs, iomap behavior is constantly better than original DIO. * Differences between original DIO vs. iomap. By looking at the flame graphs I observed that some of the completions were happening in the submission path, even if I was doing async IO. This seemed to happen, if I understand correctly, because from the time the last BIO was submitted, the last reference was already dropped by the BIO completion, so iomap_dio_rw calls iomap_dio_complete synchronously. I modified it, and could verify through the flame graphs it was working correctly and all the completions where happening inside the s_dio_done_wq workqueue. Nevertheless, this didn't yield gains in IOPS for the same benchmark[4]. * next steps I understand there is some mixed information, in particular regarding the types of IO where the slowdown is observed. While I believe this is related to the instrumentation overhead, I want to further verify that hypothesis. I also noticed that the actual device queue depth never goes too high[5], despite the fio settings. While I suspect this is specific to my device, I couldn't confirm it. I plan to test on more NVMes to see if the behavior is reproducible elsewhere. For reference, the NVMe I'm using is a Samsung 970 EVO NVMe® M.2 SSD. I'd love to know if this is reproducible by someone else, or if anything in my analysis is wrong. I'm not an expert in most of this, so I'm trying to figure out as I go. [1] https://people.collabora.com/~krisman/dio/week27/per-depth.png [2] https://people.collabora.com/~krisman/dio/week27/per-depth-per-sched.png [3] https://people.collabora.com/~krisman/dio/week27/nvme-vs-ramfs.png [4] https://people.collabora.com/~krisman/dio/week29/iops.png [5] https://people.collabora.com/~krisman/dio/week24/iowatcher-charts.svg Thank you, -- Gabriel Krisman Bertazi ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2021-08-10 15:29 UTC | newest] Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-06-02 23:35 Potential regression with iomap DIO for 4k writes Gabriel Krisman Bertazi 2021-06-16 0:17 ` Gabriel Krisman Bertazi 2021-06-18 17:21 ` Theodore Ts'o 2021-06-23 16:04 ` Gabriel Krisman Bertazi 2021-08-10 15:28 ` Gabriel Krisman Bertazi
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.