* Strange behavior with log IO fake-completions @ 2018-09-10 18:07 Joshi 2018-09-10 23:58 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Joshi @ 2018-09-10 18:07 UTC (permalink / raw) To: linux-xfs Hi folks, I wanted to check log IO speed impact during fsync-heavy workload. To obtain theoretical maximum performance data, I did fake-completion of all log IOs (i.e. log IO cost is made 0). Test: iozone, single-thread, 1GiB file, 4K record, sync for each 4K ( '-eo' option). Disk: 800GB NVMe disk. XFS based on 4.15, default options except log size = 184M Machine: Intel Xeon E5-2690 @2.6 GHz, 2 NUMA nodes, 24 cpus each And results are : ------------------------------------------------ baseline log fake-completion 109,845 45,538 ------------------------------------------------ I wondered why fake-completion turned out to be ~50% slower! May I know if anyone encountered this before, or knows why this can happen? For fake-completion, I just tag all log IOs bufer-pointers (in xlog_sync). And later in xfs_buf_submit, I just complete those tagged log IOs without any real bio-formation (comment call to _xfs_bio_ioapply). Hope this is correct/enough to do nothing! It seems to me that CPU count/frequency is playing a role here. Above data was obtained with CPU frequency set to higher values. In order to keep running CPU at nearly constant high frequency, I tried things such as - performance governor, bios-based performance settings, explicit setting of cpu scaling max frequency etc. However, results did not differ much. Moreover frequency did not remain constant/high. But when I used "affine/bind" option of iozone (-P option), iozone runs on single cpu all the time, and I get to see expected result - ------------------------------------------------------------- baseline (affine) log fake-completion(affine) 125,253 163,367 ------------------------------------------------------------- Also, during above episode, I felt the need to discover best way to eliminate cpu frequency variations out of benchmarking. I'd be thankful knowing about it. Thanks, -- Joshi ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Strange behavior with log IO fake-completions 2018-09-10 18:07 Strange behavior with log IO fake-completions Joshi @ 2018-09-10 23:58 ` Dave Chinner 2018-09-11 4:10 ` Joshi 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2018-09-10 23:58 UTC (permalink / raw) To: Joshi; +Cc: linux-xfs On Mon, Sep 10, 2018 at 11:37:45PM +0530, Joshi wrote: > Hi folks, > I wanted to check log IO speed impact during fsync-heavy workload. > To obtain theoretical maximum performance data, I did fake-completion > of all log IOs (i.e. log IO cost is made 0). > > Test: iozone, single-thread, 1GiB file, 4K record, sync for each 4K ( > '-eo' option). > Disk: 800GB NVMe disk. XFS based on 4.15, default options except log size = 184M > Machine: Intel Xeon E5-2690 @2.6 GHz, 2 NUMA nodes, 24 cpus each > > And results are : > ------------------------------------------------ > baseline log fake-completion > 109,845 45,538 > ------------------------------------------------ > I wondered why fake-completion turned out to be ~50% slower! > May I know if anyone encountered this before, or knows why this can happen? You made all log IO submission/completion synchronous. https://marc.info/?l=linux-xfs&m=153532933529663&w=2 > For fake-completion, I just tag all log IOs bufer-pointers (in > xlog_sync). And later in xfs_buf_submit, I just complete those tagged > log IOs without any real bio-formation (comment call to > _xfs_bio_ioapply). Hope this is correct/enough to do nothing! It'll work, but it's a pretty silly thing to do. See above. > It seems to me that CPU count/frequency is playing a role here. Unlikely. > Above data was obtained with CPU frequency set to higher values. In > order to keep running CPU at nearly constant high frequency, I tried > things such as - performance governor, bios-based performance > settings, explicit setting of cpu scaling max frequency etc. However, > results did not differ much. Moreover frequency did not remain > constant/high. > > But when I used "affine/bind" option of iozone (-P option), iozone > runs on single cpu all the time, and I get to see expected result - > ------------------------------------------------------------- > baseline (affine) log fake-completion(affine) > 125,253 163,367 > ------------------------------------------------------------- Yup, because now it forces the work that gets handed off to another workqueue (the CIL push workqueue) to also run on the same CPU rather than asynchronously on another CPU. The result is that you essentially force everything to run in a tight loop on a hot CPU cache. Hitting a hot cache can make code run much, much faster in microbenchmark situations like this, leading to optimisations that don't actually work in the real world where those same code paths never run confined to a single pre-primed, hot CPU cache. When you combine that with the fact that IOZone has a very well known susceptibility to CPU cache residency effects, it means the results are largely useless for comparison between different kernel builds. This is because small code changes can result in sufficiently large changes in kernel CPU cache footprint that it perturbs IOZone behaviour. We typically see variations of over +/-10% from IOZone just by running two kernels that have slightly different config parameters. IOWs, don't use IOZone for anything related to performance testing. > Also, during above episode, I felt the need to discover best way to > eliminate cpu frequency variations out of benchmarking. I'd be > thankful knowing about it. I've never bothered with tuning for affinity or CPU frequency scaling when perf testing. If you have to rely on such things to get optimal performance from your filesystem algorithms, you are doing it wrong. That is: a CPU running at near full utilisation will always be run at maximum frequency, hence if you have to tune CPU frequency to get decent performance your algorithm is limited by something that prevents full CPU utilisation, not CPU frequency. Similarly, if you have to use affinity to get decent performance, you're optimising for limited system utilisation rather than bein able to use all the resources in the machine effectively. The first goal of filesystem optimisation is to utilise every resource as efficiently as possible. Then people can constraint their workloads with affinity, containers, etc however they want without having to care about performance - it will never be worse than the performance at full resource utilisation..... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Strange behavior with log IO fake-completions 2018-09-10 23:58 ` Dave Chinner @ 2018-09-11 4:10 ` Joshi 2018-09-12 0:42 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Joshi @ 2018-09-11 4:10 UTC (permalink / raw) To: david; +Cc: linux-xfs > > Test: iozone, single-thread, 1GiB file, 4K record, sync for each 4K ( > > '-eo' option). > > Disk: 800GB NVMe disk. XFS based on 4.15, default options except log size = 184M > > Machine: Intel Xeon E5-2690 @2.6 GHz, 2 NUMA nodes, 24 cpus each > > > > And results are : > > ------------------------------------------------ > > baseline log fake-completion > > 109,845 45,538 > > ------------------------------------------------ > > I wondered why fake-completion turned out to be ~50% slower! > > May I know if anyone encountered this before, or knows why this can happen? > > You made all log IO submission/completion synchronous. > > https://marc.info/?l=linux-xfs&m=153532933529663&w=2 > > > For fake-completion, I just tag all log IOs bufer-pointers (in > > xlog_sync). And later in xfs_buf_submit, I just complete those tagged > > log IOs without any real bio-formation (comment call to > > _xfs_bio_ioapply). Hope this is correct/enough to do nothing! > > It'll work, but it's a pretty silly thing to do. See above. Thank you very much, Dave. I feel things are somewhat different here than other email-thread you pointed. Only log IO was fake-completed, not that entire XFS volume was on ramdisk. Underlying disk was NVMe, and I checked that no merging/batching happened for log IO submissions in base case. Completion count were same (as many as submitted) too. Call that I disabled in base code (in xfs_buf_submit, for log IOs) is _xfs_buf_ioapply. So only thing happened differently for log IO submitter thread is that it executed bp-completion-handling-code (xfs_buf_ioend_async). And that anyway pushes the processing to worker. It still remains mostly async, I suppose. In original form, it would have executed extra code to form/sent bio (possibility of submission/completion merging, but that did not happen for this workload), and completion would have come after some time say T. I wondered about impact on XFS if this time T can be made very low by underlying storage for certain IOs. If underlying device/layers provide some sort of differentiated I/O service enabling ultra-low-latency completion for certain IOs (flagged as urgent), and one chooses log IO to take that low-latency path - won't we see same problem as shown by fake-completion? Thanks, On Tue, Sep 11, 2018 at 5:28 AM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Sep 10, 2018 at 11:37:45PM +0530, Joshi wrote: > > Hi folks, > > I wanted to check log IO speed impact during fsync-heavy workload. > > To obtain theoretical maximum performance data, I did fake-completion > > of all log IOs (i.e. log IO cost is made 0). > > > > Test: iozone, single-thread, 1GiB file, 4K record, sync for each 4K ( > > '-eo' option). > > Disk: 800GB NVMe disk. XFS based on 4.15, default options except log size = 184M > > Machine: Intel Xeon E5-2690 @2.6 GHz, 2 NUMA nodes, 24 cpus each > > > > And results are : > > ------------------------------------------------ > > baseline log fake-completion > > 109,845 45,538 > > ------------------------------------------------ > > I wondered why fake-completion turned out to be ~50% slower! > > May I know if anyone encountered this before, or knows why this can happen? > > You made all log IO submission/completion synchronous. > > https://marc.info/?l=linux-xfs&m=153532933529663&w=2 > > > For fake-completion, I just tag all log IOs bufer-pointers (in > > xlog_sync). And later in xfs_buf_submit, I just complete those tagged > > log IOs without any real bio-formation (comment call to > > _xfs_bio_ioapply). Hope this is correct/enough to do nothing! > > It'll work, but it's a pretty silly thing to do. See above. > > > It seems to me that CPU count/frequency is playing a role here. > > Unlikely. > > > Above data was obtained with CPU frequency set to higher values. In > > order to keep running CPU at nearly constant high frequency, I tried > > things such as - performance governor, bios-based performance > > settings, explicit setting of cpu scaling max frequency etc. However, > > results did not differ much. Moreover frequency did not remain > > constant/high. > > > > But when I used "affine/bind" option of iozone (-P option), iozone > > runs on single cpu all the time, and I get to see expected result - > > ------------------------------------------------------------- > > baseline (affine) log fake-completion(affine) > > 125,253 163,367 > > ------------------------------------------------------------- > > Yup, because now it forces the work that gets handed off to another > workqueue (the CIL push workqueue) to also run on the same CPU > rather than asynchronously on another CPU. The result is that you > essentially force everything to run in a tight loop on a hot CPU > cache. Hitting a hot cache can make code run much, much faster in > microbenchmark situations like this, leading to optimisations that > don't actually work in the real world where those same code paths > never run confined to a single pre-primed, hot CPU cache. > > When you combine that with the fact that IOZone has a very well > known susceptibility to CPU cache residency effects, it means the > results are largely useless for comparison between different kernel > builds. This is because small code changes can result in > sufficiently large changes in kernel CPU cache footprint that it > perturbs IOZone behaviour. We typically see variations of over > +/-10% from IOZone just by running two kernels that have slightly > different config parameters. > > IOWs, don't use IOZone for anything related to performance testing. > > > Also, during above episode, I felt the need to discover best way to > > eliminate cpu frequency variations out of benchmarking. I'd be > > thankful knowing about it. > > I've never bothered with tuning for affinity or CPU frequency > scaling when perf testing. If you have to rely on such things to get > optimal performance from your filesystem algorithms, you are doing > it wrong. > > That is: a CPU running at near full utilisation will always be run > at maximum frequency, hence if you have to tune CPU frequency to get > decent performance your algorithm is limited by something that > prevents full CPU utilisation, not CPU frequency. > > Similarly, if you have to use affinity to get decent performance, > you're optimising for limited system utilisation rather than > bein able to use all the resources in the machine effectively. The > first goal of filesystem optimisation is to utilise every resource as > efficiently as possible. Then people can constraint their workloads > with affinity, containers, etc however they want without having to > care about performance - it will never be worse than the performance > at full resource utilisation..... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- Joshi ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Strange behavior with log IO fake-completions 2018-09-11 4:10 ` Joshi @ 2018-09-12 0:42 ` Dave Chinner 2018-09-14 11:51 ` Joshi 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2018-09-12 0:42 UTC (permalink / raw) To: Joshi; +Cc: linux-xfs On Tue, Sep 11, 2018 at 09:40:59AM +0530, Joshi wrote: > > > Test: iozone, single-thread, 1GiB file, 4K record, sync for each 4K ( > > > '-eo' option). > > > Disk: 800GB NVMe disk. XFS based on 4.15, default options except log size = 184M > > > Machine: Intel Xeon E5-2690 @2.6 GHz, 2 NUMA nodes, 24 cpus each > > > > > > And results are : > > > ------------------------------------------------ > > > baseline log fake-completion > > > 109,845 45,538 > > > ------------------------------------------------ > > > I wondered why fake-completion turned out to be ~50% slower! > > > May I know if anyone encountered this before, or knows why this can happen? > > > > You made all log IO submission/completion synchronous. > > > > https://marc.info/?l=linux-xfs&m=153532933529663&w=2 > > > > > For fake-completion, I just tag all log IOs bufer-pointers (in > > > xlog_sync). And later in xfs_buf_submit, I just complete those tagged > > > log IOs without any real bio-formation (comment call to > > > _xfs_bio_ioapply). Hope this is correct/enough to do nothing! > > > > It'll work, but it's a pretty silly thing to do. See above. > > Thank you very much, Dave. I feel things are somewhat different here > than other email-thread you pointed. > Only log IO was fake-completed, not that entire XFS volume was on > ramdisk. The "fake completion" forces log IO to be completed synchronously, just like a ramdisk does. It's exactly the same behaviour. > Underlying disk was NVMe, and I checked that no > merging/batching happened for log IO submissions in base case. > Completion count were same (as many as submitted) too. Of course. async or sync, the amount of IO work an isolated single threaded, synchronous write workload does is the same because there is no other IO that can be aggregated for either the data or journal writes.... > Call that I disabled in base code (in xfs_buf_submit, for log IOs) is > _xfs_buf_ioapply. Yes, I assumed you were doing something like that. The problem is, you're not looking at what the journal is doing as a whole when xfs_log_force_lsn() is called from fsync. In this case, the initial log state is that the current iclogbuf is ACTIVE because we have no journal writes in progress. The typical, full IO path through this code is as follows: fsync() process CIL workqueue log workqueue xfs_log_force_lsn() xlog_cil_force_lsn() xlog_cil_push_now(seq) <flushes pending push work> <queues required push work> xlog_wait(xc_commit_wait); <blocks waiting for CIL push> <walks items in CIL> <formats into log buffer> xlog_write() ..... iclogbuf goes to SYNCING xfs_buf_submit(iclogbuf) __xfs_buf_submit() submit_bio() <push work complete> <wakes push waiters> <wakes, returns to __xfs_log_force_lsn()> __xfs_log_force_lsn() <iclogbuf in SYNCING> xlog_wait(ic_force_wait) <blocks waiting for IO completion> <IO completion on some CPU> xfs_buf_endio_async() queue_work(log-wq) <wakes> xlog_iodone xlog_state_do_callback wake_up_all(ic_force_wait) <work complete> <wakes, returns to __xfs_log_force_lsn()> <log force done> <returns to fsync context> In normal cases, the overhead of IO completion queuing the xlog_iodone completion work to the log wq can be mostly ignored - it's just part of the IO completion delay time. What's important to realise is that it runs some time after the CIL push waiters were woken and then go back to sleep waiting for the log IO to complete, so the processing is completely separated and there's no contention between submission and completion. > So only thing happened differently for log IO > submitter thread is that it executed bp-completion-handling-code > (xfs_buf_ioend_async). And that anyway pushes the processing to > worker. It still remains mostly async, I suppose. Well, when andhow the completion runs depends on a *lot* of things. Let's simplify - the above case is: fsync proc CIL IO-complete log xfs_log_force_lsn <wait for CIL> xlog_write() wake CIL waiters <wait for iclogbuf> <queue to log> xlog_iodone() wake iclogbug waiters log force done. Essentially, there are 7 context switches in that, but we just don't see that overhead because it's tiny compared to the IO time. There's also no contention between the fsync proc waiting on iclogbuf io completion and iclogbuf completion processing, so that runs as fast as possible. > In original form, it > would have executed extra code to form/sent bio (possibility of > submission/completion merging, but that did not happen for this > workload), and completion would have come after some time say T. > I wondered about impact on XFS if this time T can be made very low by > underlying storage for certain IOs. Even when IO times are short as a few tens of microseconds, this is sufficient to avoid contention between submission and completion processing and turn the context switches into noise. And even with shorter IO delays, the async processing still allows the log state machine to aggregate concurrent fsyncs - see the recent work with AIO fsync to scale out to concurrently writing and fsyncing tens of thousands of files per second on high speed nvme SSDs. So, why don't fake completions work well? It's because the IO delay goes away completely and that causes submission/completion processing contention in the log code. If we have lots of idle CPUs sitting around with a synchronous log IO completion, we get this fsync behaviour: fsync proc CIL IO-complete log xfs_log_force_lsn <wait for CIL> xlog_write() <queue to log-wq> wake CIL waiters xlog_iodone() <wait for iclogbuf> wake iclogbug waiters log force done. IOWs, we now have the iclogbuf IO completion work racing with the completion of the CIL push, finalising the log state after the writes, waking CIL waiters and going to sleep waiting for log IO completion and state machine transitions. Basically, it causes cross-CPU contention for the same log cachelines and locks because we've triggered a synchronous IO completion of the iclogbuf. That contention kills CPU performance because it's now having to wait for the cache coherency protocol to fetch data from other CPU caches and/or memory. This can easily slow critical code down by a factor of 3-4 orders of magnitude. Racing/spurious sleep/wakeup events have a fair bit of overhead, too. Now, your other "unexpected" result - lets use affinity to confine everything to one CPU: fsync proc CIL IO-complete log xfs_log_force_lsn <wait for CIL> xlog_write() <queue to log> wake CIL waiters xlog_iodone() wake iclogbug waiters log force done Note the difference? The log work can't start until the CIL work has completed because it's constrained to the same CPU, so it doesn't cause any contention with the finalising of the CIL push and waking waiters. Further, because the log workqueue is a high priority workqueue and we're confined to a single CPU, it gets processed by the workqueue infrastructure before it releases the CPU to schedule other tasks. This changes the order of operations in the code. And because it's all on the one CPU, this is effectively a CPU bound operation and the cachelines stay hot in the CPU and there's no lock contention to invalidate them. The code runs as fast as it is possible to run as a result. Combine those things and the result is that we do less work and the CPU executes it faster. Not to mention that only running on one CPU core allows modern CPUs to maximise the clock frequency on that one core, making the affinity constrained workload look even faster than one that is spread across multiple cores. (Reliable performance benchmarking is hard!) > If underlying device/layers provide some sort of differentiated I/O > service enabling ultra-low-latency completion for certain IOs (flagged > as urgent), and one chooses log IO to take that low-latency path - > won't we see same problem as shown by fake-completion? Maybe, maybe not. It really depends on what is happening system wide, not what is happening to any single thread or IO.... IOWs, I wouldn't be trying to extrapolate how a filesystem will react to a specific type of storage from a single threaded, synchronous write workload. All you'll succeed in doing is micro-optimising behaviour for single threaded, synchronous writes on that specific storage device. Perhaps it would be better for you to describe what you are trying to acheive first so we can point you in the right direction rather than playing "20 Questions" and still not understanding of why the filesystem isn't doing what you expect it to be doing.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Strange behavior with log IO fake-completions 2018-09-12 0:42 ` Dave Chinner @ 2018-09-14 11:51 ` Joshi 2018-09-17 2:56 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Joshi @ 2018-09-14 11:51 UTC (permalink / raw) To: david; +Cc: linux-xfs Thank you Dave, for sharing the details and putting the pieces together. > So, why don't fake completions work well? It's because the IO delay > goes away completely and that causes submission/completion > processing contention in the log code. If we have lots of idle CPUs > sitting around with a synchronous log IO completion, we get this > fsync behaviour: > > fsync proc CIL IO-complete log > xfs_log_force_lsn > <wait for CIL> > xlog_write() > <queue to log-wq> > wake CIL waiters xlog_iodone() > <wait for iclogbuf> wake iclogbug waiters > > log force done. > > IOWs, we now have the iclogbuf IO completion work racing with the > completion of the CIL push, finalising the log state after the > writes, waking CIL waiters and going to sleep waiting for log IO > completion and state machine transitions. > > Basically, it causes cross-CPU contention for the same log cachelines > and locks because we've triggered a synchronous IO completion of the > iclogbuf. That contention kills CPU performance because it's now > having to wait for the cache coherency protocol to fetch data from > other CPU caches and/or memory Sorry, but the contention that you mentioned seems to be more about fast-completion-of-log-IO rather than synchronous-completion-of-log-IO. Isn't it? The code that gets executed in case of fake-completion would be exactly same had storage completed log IO blazing fast. And I do not see any different/better numbers when I perform fake-completion in asynchronous way (i.e. increment b_io_remaining, and queue bp in a workqueue). I gathered some more numbers with iozone, and with filebench's varmail (another sync-heavy benchmark). With 48 and 16 cpus (reduced through nr_cpus kernel parameter). Numbers indicate that whatever contention was applicable for 48 cpus, does seem to vanish while running with 16 cpus. In fact, fake-completion outperforms base with good margins while running with 16 cpus. ************************ Varmail (nr_cpus = 48 vs nr_cpus=16) *************************************** Threads Base Fake-completion Diff% (fake-completion vs base) 1 53,662 / 55,815 38,322 / 63,919 -28.6% / 14.5% 2 89,382 / 80,196 84,183 / 94,269 -5.8% / 17.5% 4 148,953 / 144,781 141,030 / 163,434 -5.3% / 12.9% 8 230,970 / 226,314 230,410/ 243,300 -0.2% / 7.5% 16 235,077 / 242,282 234,817/ 244,312 -0.1% / 0.8% 32 235,752 / 243,885 233,517/ 245,685 -0.9% / 1.0% 64 234,659 / 240,872 232,855/ 243,884 -0.8% / 1.3% ************************ Iozone (nr_cpus = 48 vs nr_cpus=16) **************************************** Threads Base Fake-completion Diff% (fake-completion vs base) 1 99,691 / 110,804 45,801 / 151,478 -54.1% / 36.7% 2 182,221 / 188,503 145,968 / 227,948 -19.8% / 20.9% 4 336,585 / 333,111 326,697 / 392,058 -2.9% / 17.7% 8 553,251 / 544,356 621,871 / 677,801 12.4% / 24.5% 16 768,023 / 779,373 944,148 / 964,868 22.9% / 23.8% 32 934,050 / 959,441 1088,719 / 1094,018 16.6% / 14% 64 1053,738 / 1015,464 1086,711 / 1045,853 3.1% / 3% > Now, your other "unexpected" result - lets use affinity to confine > everything to one CPU: > > fsync proc CIL IO-complete log > xfs_log_force_lsn > <wait for CIL> > xlog_write() > <queue to log> > wake CIL waiters > xlog_iodone() > wake iclogbug waiters > log force done > > Note the difference? The log work can't start until the CIL work has > completed because it's constrained to the same CPU, so it doesn't > cause any contention with the finalising of the CIL push and waking > waiters. "Log work can't start" part does not sound very good either. It needed to be done anyway before task waiting for fsync is woken. And even without affinity thing, work-queue infrastructure executes deferred work on the same cpu where it was queued. At least on 4.15, I see that if I queue work on cpu X, kworker running on cpu X picks it up and processes. So I fancy that for a single fsync IO, all the workers that needed to be executed would have executed on same cpu, even without affinity. > > If underlying device/layers provide some sort of differentiated I/O > > service enabling ultra-low-latency completion for certain IOs (flagged > > as urgent), and one chooses log IO to take that low-latency path - > > won't we see same problem as shown by fake-completion? > > Maybe, maybe not. It really depends on what is happening system > wide, not what is happening to any single thread or IO.... > > IOWs, I wouldn't be trying to extrapolate how a filesystem > will react to a specific type of storage from a single threaded, > synchronous write workload. All you'll succeed in doing is > micro-optimising behaviour for single threaded, synchronous > writes on that specific storage device. > > Perhaps it would be better for you to describe what you are trying > to acheive first so we can point you in the right direction rather > than playing "20 Questions" and still not understanding of why the > filesystem isn't doing what you expect it to be doing.... Single threaded IO is about latency. And storage-systems need to care about that apart from multi-threaded, throughput scenarios. In the above varmail/iozone data, XFS seems to be friendlier to throughput than to latency. After 8 or more threads, XFS transaction speed became independent of log IO speed (which is a good thing). But below 8, it responded odd to log-speed-increase. Underlying storage (NVMe) allowed me to complete certain IOs with a very low latency (around ~3 microsecond for 4K ). And I leveraged that facility to complete log-IO with very low-latency. With the expectation that latency of single-threaded fsync IOs would be improved as a result of that. I hope it is not too unnatural to expect that. To rule-out the device peculiarities and to determine where exactly problem lies, I chose to try the whole thing with fake completion. I'd appreciate if you have suggestions about what could be changed in XFS to get advantage of low-latency log IO. Thanks, On Wed, Sep 12, 2018 at 6:13 AM Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Sep 11, 2018 at 09:40:59AM +0530, Joshi wrote: > > > > Test: iozone, single-thread, 1GiB file, 4K record, sync for each 4K ( > > > > '-eo' option). > > > > Disk: 800GB NVMe disk. XFS based on 4.15, default options except log size = 184M > > > > Machine: Intel Xeon E5-2690 @2.6 GHz, 2 NUMA nodes, 24 cpus each > > > > > > > > And results are : > > > > ------------------------------------------------ > > > > baseline log fake-completion > > > > 109,845 45,538 > > > > ------------------------------------------------ > > > > I wondered why fake-completion turned out to be ~50% slower! > > > > May I know if anyone encountered this before, or knows why this can happen? > > > > > > You made all log IO submission/completion synchronous. > > > > > > https://marc.info/?l=linux-xfs&m=153532933529663&w=2 > > > > > > > For fake-completion, I just tag all log IOs bufer-pointers (in > > > > xlog_sync). And later in xfs_buf_submit, I just complete those tagged > > > > log IOs without any real bio-formation (comment call to > > > > _xfs_bio_ioapply). Hope this is correct/enough to do nothing! > > > > > > It'll work, but it's a pretty silly thing to do. See above. > > > > Thank you very much, Dave. I feel things are somewhat different here > > than other email-thread you pointed. > > Only log IO was fake-completed, not that entire XFS volume was on > > ramdisk. > > The "fake completion" forces log IO to be completed synchronously, > just like a ramdisk does. It's exactly the same behaviour. > > > Underlying disk was NVMe, and I checked that no > > merging/batching happened for log IO submissions in base case. > > Completion count were same (as many as submitted) too. > > Of course. async or sync, the amount of IO work an isolated single > threaded, synchronous write workload does is the same because there > is no other IO that can be aggregated for either the data or journal > writes.... > > > Call that I disabled in base code (in xfs_buf_submit, for log IOs) is > > _xfs_buf_ioapply. > > Yes, I assumed you were doing something like that. The problem is, > you're not looking at what the journal is doing as a whole when > xfs_log_force_lsn() is called from fsync. > > In this case, the initial log state is that the current iclogbuf is > ACTIVE because we have no journal writes in progress. The typical, > full IO path through this code is as follows: > > fsync() process CIL workqueue log workqueue > > xfs_log_force_lsn() > xlog_cil_force_lsn() > xlog_cil_push_now(seq) > <flushes pending push work> > <queues required push work> > xlog_wait(xc_commit_wait); > <blocks waiting for CIL push> > > <walks items in CIL> > <formats into log buffer> > xlog_write() > ..... > iclogbuf goes to SYNCING > xfs_buf_submit(iclogbuf) > __xfs_buf_submit() > submit_bio() > <push work complete> > <wakes push waiters> > > <wakes, returns to __xfs_log_force_lsn()> > __xfs_log_force_lsn() > <iclogbuf in SYNCING> > xlog_wait(ic_force_wait) > <blocks waiting for IO completion> > > <IO completion on some CPU> > xfs_buf_endio_async() > queue_work(log-wq) > <wakes> > xlog_iodone > xlog_state_do_callback > wake_up_all(ic_force_wait) > <work complete> > <wakes, returns to __xfs_log_force_lsn()> > <log force done> > <returns to fsync context> > > In normal cases, the overhead of IO completion queuing the > xlog_iodone completion work to the log wq can be mostly ignored - > it's just part of the IO completion delay time. What's important to > realise is that it runs some time after the CIL push waiters were > woken and then go back to sleep waiting for the log IO to complete, > so the processing is completely separated and there's no contention > between submission and completion. > > > So only thing happened differently for log IO > > submitter thread is that it executed bp-completion-handling-code > > (xfs_buf_ioend_async). And that anyway pushes the processing to > > worker. It still remains mostly async, I suppose. > > Well, when andhow the completion runs depends on a *lot* of things. > Let's simplify - the above case is: > > fsync proc CIL IO-complete log > xfs_log_force_lsn > <wait for CIL> > xlog_write() > wake CIL waiters > <wait for iclogbuf> > <queue to log> > xlog_iodone() > wake iclogbug waiters > log force done. > > Essentially, there are 7 context switches in that, but we just don't > see that overhead because it's tiny compared to the IO time. There's > also no contention between the fsync proc waiting on iclogbuf io > completion and iclogbuf completion processing, so that runs as fast > as possible. > > > In original form, it > > would have executed extra code to form/sent bio (possibility of > > submission/completion merging, but that did not happen for this > > workload), and completion would have come after some time say T. > > I wondered about impact on XFS if this time T can be made very low by > > underlying storage for certain IOs. > > Even when IO times are short as a few tens of microseconds, this is > sufficient to avoid contention between submission and completion > processing and turn the context switches into noise. And even with > shorter IO delays, the async processing still allows the log state > machine to aggregate concurrent fsyncs - see the recent > work with AIO fsync to scale out to concurrently writing and > fsyncing tens of thousands of files per second on high speed nvme > SSDs. > > So, why don't fake completions work well? It's because the IO delay > goes away completely and that causes submission/completion > processing contention in the log code. If we have lots of idle CPUs > sitting around with a synchronous log IO completion, we get this > fsync behaviour: > > fsync proc CIL IO-complete log > xfs_log_force_lsn > <wait for CIL> > xlog_write() > <queue to log-wq> > wake CIL waiters xlog_iodone() > <wait for iclogbuf> wake iclogbug waiters > > log force done. > > IOWs, we now have the iclogbuf IO completion work racing with the > completion of the CIL push, finalising the log state after the > writes, waking CIL waiters and going to sleep waiting for log IO > completion and state machine transitions. > > Basically, it causes cross-CPU contention for the same log cachelines > and locks because we've triggered a synchronous IO completion of the > iclogbuf. That contention kills CPU performance because it's now > having to wait for the cache coherency protocol to fetch data from > other CPU caches and/or memory. This can easily slow critical code > down by a factor of 3-4 orders of magnitude. Racing/spurious > sleep/wakeup events have a fair bit of overhead, too. > > Now, your other "unexpected" result - lets use affinity to confine > everything to one CPU: > > fsync proc CIL IO-complete log > xfs_log_force_lsn > <wait for CIL> > xlog_write() > <queue to log> > wake CIL waiters > xlog_iodone() > wake iclogbug waiters > log force done > > Note the difference? The log work can't start until the CIL work has > completed because it's constrained to the same CPU, so it doesn't > cause any contention with the finalising of the CIL push and waking > waiters. > > Further, because the log workqueue is a high priority workqueue and > we're confined to a single CPU, it gets processed by the workqueue > infrastructure before it releases the CPU to schedule other tasks. > This changes the order of operations in the code. > > And because it's all on the one CPU, this is effectively a CPU bound > operation and the cachelines stay hot in the CPU and there's no lock > contention to invalidate them. The code runs as fast as it is > possible to run as a result. > > Combine those things and the result is that we do less work and the > CPU executes it faster. Not to mention that only running on one CPU > core allows modern CPUs to maximise the clock frequency on that one > core, making the affinity constrained workload look even faster than > one that is spread across multiple cores. (Reliable performance > benchmarking is hard!) > > > If underlying device/layers provide some sort of differentiated I/O > > service enabling ultra-low-latency completion for certain IOs (flagged > > as urgent), and one chooses log IO to take that low-latency path - > > won't we see same problem as shown by fake-completion? > > Maybe, maybe not. It really depends on what is happening system > wide, not what is happening to any single thread or IO.... > > IOWs, I wouldn't be trying to extrapolate how a filesystem > will react to a specific type of storage from a single threaded, > synchronous write workload. All you'll succeed in doing is > micro-optimising behaviour for single threaded, synchronous > writes on that specific storage device. > > Perhaps it would be better for you to describe what you are trying > to acheive first so we can point you in the right direction rather > than playing "20 Questions" and still not understanding of why the > filesystem isn't doing what you expect it to be doing.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- Joshi ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Strange behavior with log IO fake-completions 2018-09-14 11:51 ` Joshi @ 2018-09-17 2:56 ` Dave Chinner 2018-09-21 0:23 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2018-09-17 2:56 UTC (permalink / raw) To: Joshi; +Cc: linux-xfs On Fri, Sep 14, 2018 at 05:21:43PM +0530, Joshi wrote: > Thank you Dave, for sharing the details and putting the pieces together. > > > So, why don't fake completions work well? It's because the IO delay > > goes away completely and that causes submission/completion > > processing contention in the log code. If we have lots of idle CPUs > > sitting around with a synchronous log IO completion, we get this > > fsync behaviour: > > > > fsync proc CIL IO-complete log > > xfs_log_force_lsn > > <wait for CIL> > > xlog_write() > > <queue to log-wq> > > wake CIL waiters xlog_iodone() > > <wait for iclogbuf> wake iclogbug waiters > > > > log force done. > > > > IOWs, we now have the iclogbuf IO completion work racing with the > > completion of the CIL push, finalising the log state after the > > writes, waking CIL waiters and going to sleep waiting for log IO > > completion and state machine transitions. > > > > Basically, it causes cross-CPU contention for the same log cachelines > > and locks because we've triggered a synchronous IO completion of the > > iclogbuf. That contention kills CPU performance because it's now > > having to wait for the cache coherency protocol to fetch data from > > other CPU caches and/or memory > > Sorry, but the contention that you mentioned seems to be more about > fast-completion-of-log-IO rather than > synchronous-completion-of-log-IO. Isn't it? > The code that gets executed in case of fake-completion would be > exactly same had storage completed log IO blazing fast. No, no it isn't. Blazing fast storage is not the same zero-delay, IO bypass completion like your fake-completion hack. Especially considering that it avoids sending REQ_FLUSH to the underlying device on each fsync, which typically has some additional delay for the IO that it is delivered with. >From that perspective, fake-completion is actually causing out of order completion of the journal IO vs the rest of the IO that is in flight. i.e. because it doesn't go through the IO layer, scheduler, etc, it's not actually being issued and completed with all the other IO that is in progress. Hence it's basically skipping the pending IO completion queue at all layers and jumping straight to the head of the queue. IOWs, it's in no way reflective of how the journal behaves when there there is real IO load present on the underlying device. And now that I've mentioned REQ_FLUSH, I have to point out that fake-completion is breaking the integrity model of the filesystem. Failing to issue REQ_FLUSH on log IOs means that volatile caches below the filesystem are not flushed before the journal is written to. This also changes way the underlying device processes IO, and so may be operating way faster than it should because it's not getting regular cache flushes. But even more important: to provide the correct data integrity semantics to fsync() callers, we really do need to issue two IOs for each fsync - both data and log for related metadata need to be written. So lets go back to looking at where all our time in fsync is being spent, and why fake-completion is not telling you what you think it is. With fast nvme storage (using your numbers of 3uS per IO) and ignoring all the CPU time between the IOs, we could do ~160k fsync's per second from a single thread on such a device due to the 6uS per fsync IOWait time constraint. If we add in 1uS of CPU time to do the data write, and another 1uS of CPU time for the fsync (i.e. all the journal write code) we're now at 8uS per fsync and so down to 125k operations per second. And then we've got application time to set up for the next write+fsync, which if there's real data to be processed is going to be a few uS too. But let's conisder that it's IOZone so it's just feeding the same data in a tight loop which gives us roughly other 1uS per loop. IOws, we're now at 9uS of CPU+IOwait time per write/fsync loop, and that puts us at ~111k fsync's per second. That's right on the number that IOZone is giving you for single threaded performance. To make it go faster, we have to either reduce CPU overhead or reduce IO time. We can't reduce IO time - we have to do two physical IOs. That leaves us with reducing CPU time processing the IO. Your fake completion code does not just reduce the CPU time of the journal processing slightly - it completely omits one of those two IOs (the journal IO) that are required for fsync. IOWs, instead of 6uS of IOwait time, we now only have 3uS. That means we the per-fsync time is back down to 6uS, which is somewhere around 160k fsync/s per second - that's roughly 40% faster. And if we look at your single threaded 16 CPU iozone results we see: > Threads Base Fake-completion Diff% > 1 110,804 151,478 36.7% The change in performance is right on the number I'd expect from eliding the log IO. The measured improvement hasn't come from fake-completion simulating lower CPU usage, it's come directly from not issuing and waiting for journal IO. And your high cpu count results confirm the contention problems between submission and completion I described result in reduced performance in low thread counts. IOws, your results are confirming what I've been saying tends to happen with zero-delay IO completion setups. IOWs, we already know that removing the journal IO wait time have a performance improvement, that it is visible even with really fast IO subsystems, and we know we cannot make optimisations in this manner. IOWs, your test doesn't tell us that there's a problem with asynchronous log IO completion algorithms on asynhronous storage, they just tell us that: a) the journal IO is ~only 35% of the fsync overhead on your device; and b) synchrnous and/or zero delay IO completions are not handled well, but that still only extends to ramdisks and pmem in practice. > And I do not see any different/better numbers when I perform > fake-completion in asynchronous way (i.e. increment b_io_remaining, > and queue bp in a workqueue). That's because you're not changing CPU usage in any meaningful way - the change in performance you've measured is caused by eliding the journal IO and scheduling immediate IO completion, not by the extra context switch that queuing the log IO completion causes. And, FWIW, workqueues can queue work onto other CPUs - if the work is already running, then it will continue to be run on the CPU it is running on, not get scheduled on the queuing CPU. When the XFS log is busy, this happens a *lot*. [snip] > > Now, your other "unexpected" result - lets use affinity to confine > > everything to one CPU: > > > > fsync proc CIL IO-complete log > > xfs_log_force_lsn > > <wait for CIL> > > xlog_write() > > <queue to log> > > wake CIL waiters > > xlog_iodone() > > wake iclogbug waiters > > log force done > > > > Note the difference? The log work can't start until the CIL work has > > completed because it's constrained to the same CPU, so it doesn't > > cause any contention with the finalising of the CIL push and waking > > waiters. > > "Log work can't start" part does not sound very good either. It needed > to be done anyway before task waiting for fsync is woken. I'm beginning to think you don't really understand how logging in XFS works. If you want to improve the logging subsystem, it would be a good idea for you to read and understand this document so you have some idea of how the data that is written to the log gets there and what happens to it after it's in the log. Documentation/filesystems/xfs-delayed-logging-design.txt > > Perhaps it would be better for you to describe what you are trying > > to acheive first so we can point you in the right direction rather > > than playing "20 Questions" and still not understanding of why the > > filesystem isn't doing what you expect it to be doing.... > > Single threaded IO is about latency. And storage-systems need to care > about that apart from multi-threaded, throughput scenarios. Who does single threaded write IO these days? What real world application problem are you trying to solve? IMO, If you care about single operation latency in isolation from everything else, XFS is not the filesystem to use. XFS's algorithms are CPU intensive because they are designed to scale out to large CPU counts efficiently. e.g. a single write+fsync on ext4 takes less than 50% of the CPU time to execute and complete than XFS. So for the sorts of single threaded fsync workloads you are talking about, ext4 will always have much lower latency for single operations than XFS because it burns much less CPU between waiting for IOs. > To rule-out the device peculiarities and to determine where exactly > problem lies, I chose to try the whole thing with fake completion. > > I'd appreciate if you have suggestions about what could be changed in > XFS to get advantage of low-latency log IO. I still don't understand what application you have that requires optimisation of single threaded fsync latency. Without some context of the application level problem you are trying to solve I can't answer questions like this. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Strange behavior with log IO fake-completions 2018-09-17 2:56 ` Dave Chinner @ 2018-09-21 0:23 ` Dave Chinner 2018-09-21 12:54 ` Joshi 0 siblings, 1 reply; 10+ messages in thread From: Dave Chinner @ 2018-09-21 0:23 UTC (permalink / raw) To: Joshi; +Cc: linux-xfs On Mon, Sep 17, 2018 at 12:56:01PM +1000, Dave Chinner wrote: > On Fri, Sep 14, 2018 at 05:21:43PM +0530, Joshi wrote: > > > Now, your other "unexpected" result - lets use affinity to confine > > > everything to one CPU: > > > > > > fsync proc CIL IO-complete log > > > xfs_log_force_lsn > > > <wait for CIL> > > > xlog_write() > > > <queue to log> > > > wake CIL waiters > > > xlog_iodone() > > > wake iclogbug waiters > > > log force done > > > > > > Note the difference? The log work can't start until the CIL work has > > > completed because it's constrained to the same CPU, so it doesn't > > > cause any contention with the finalising of the CIL push and waking > > > waiters. > > > > "Log work can't start" part does not sound very good either. It needed > > to be done anyway before task waiting for fsync is woken. > > I'm beginning to think you don't really understand how logging in > XFS works. If you want to improve the logging subsystem, it would > be a good idea for you to read and understand this document so you > have some idea of how the data that is written to the log gets > there and what happens to it after it's in the log. > > Documentation/filesystems/xfs-delayed-logging-design.txt It occurred to me yesterday that looking at the journal in a different way might help explain how it all works. Instead of looking at it as an IO engine, think of how an out-of-order CPU is designed to be made up of multiple stages in a pipeline - each stage does a small piece of work that it passes on to the next stage to process. Individual operation progresses serially through the pipeline, but, each stage of the pipeline can be operating on a different operation. Hence we can have multiple operations in flight at once, and the operations can also be run out of order as dynamical stage completion scheduling dictates. However, from a high level everything appears to complete in order because of re-ordering stages put everythign in order once the indiivdual operations have been executed. Similarly, the XFS journalling subsystem is an out of order, multi-stage pipeline with a post-IO re-ordering stage to ensure the end result is that individual operations always appear to complete in order. Indeed, what ends up on disk in the journal is not in order, so one of the things that log recovery does is rebuild the state necessarily to reorder operations correctly before replay so that, again, it appears like everything occurred in the order that the transactions were committed to the journal. So perhaps looking at it as a multi-stage pipeline might also help explain why fake-completion changes the behaviour in unpredictable ways. i.e. it basically chops out stages of the pipeline, changing the length of the pipeline and the order in which stages of the pipleine are executed. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Strange behavior with log IO fake-completions 2018-09-21 0:23 ` Dave Chinner @ 2018-09-21 12:54 ` Joshi 2018-09-24 14:25 ` Joshi 0 siblings, 1 reply; 10+ messages in thread From: Joshi @ 2018-09-21 12:54 UTC (permalink / raw) To: david; +Cc: linux-xfs I added some timing measurement code and obtained data for "dd" sync requests (dd if=/dev/zero of=/mnt/f1 bs=4k count=1000 oflag=sync). Let me define these times - T1= xfs_log_force_lsn time (loosely, this is overall time to complete one fsync request) And within xfs_log_foce_lsn , these were added - T2 = xlog_cil_force_lsn time T3 = xlog_wait time for ic_force_wait Last one is this - T4 = Time elapsed between issuing wake-up on ic_force_wait (during completion) and thread waiting on ic_force_wait (dd in this case) resumed execution Basically T1 = T2+T3 +some_more, and T3 = T4 + some_more. Here is the data for 1 dd-sync request (trend remains mostly same for all requests) - Base case: T1= 37096, T2=11070, T3=19130, T4=2410 FC case: T1= 101206, T2=47150, T3=45443, T4=29483 T4 in fake-completion case seemed particularly odd to me. "dd" took so much time to resume even after wake up was issued on ic_force_wait! Looks like fast log completion is making xfs wait queues behave odd. I changed all log wait queues (ic_force_wait, ic_write_wait, and xc_commit_wait) into completion variables (despite knowing that it is also implemented using wait_queue). And it brought down the value of T4. Overall numbers with iozone/varmail look like this - Iozone(48 cpus) Threads Base FC 1 149149.2 151444.5 2 251794.5 255033.1 Varmail (48 cpus) Threads Base FC 1 61500.9 62396.4 2 102834.9 103080.1 However, base case numbers also got improved. Earlier base numbers were comparatively very low with 48 cpus. Thanks, On Fri, Sep 21, 2018 at 5:53 AM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Sep 17, 2018 at 12:56:01PM +1000, Dave Chinner wrote: > > On Fri, Sep 14, 2018 at 05:21:43PM +0530, Joshi wrote: > > > > Now, your other "unexpected" result - lets use affinity to confine > > > > everything to one CPU: > > > > > > > > fsync proc CIL IO-complete log > > > > xfs_log_force_lsn > > > > <wait for CIL> > > > > xlog_write() > > > > <queue to log> > > > > wake CIL waiters > > > > xlog_iodone() > > > > wake iclogbug waiters > > > > log force done > > > > > > > > Note the difference? The log work can't start until the CIL work has > > > > completed because it's constrained to the same CPU, so it doesn't > > > > cause any contention with the finalising of the CIL push and waking > > > > waiters. > > > > > > "Log work can't start" part does not sound very good either. It needed > > > to be done anyway before task waiting for fsync is woken. > > > > I'm beginning to think you don't really understand how logging in > > XFS works. If you want to improve the logging subsystem, it would > > be a good idea for you to read and understand this document so you > > have some idea of how the data that is written to the log gets > > there and what happens to it after it's in the log. > > > > Documentation/filesystems/xfs-delayed-logging-design.txt > > It occurred to me yesterday that looking at the journal in a > different way might help explain how it all works. > > Instead of looking at it as an IO engine, think of how an > out-of-order CPU is designed to be made up of multiple stages in a > pipeline - each stage does a small piece of work that it passes on > to the next stage to process. Individual operation progresses serially > through the pipeline, but, each stage of the pipeline can be > operating on a different operation. Hence we can have multiple > operations in flight at once, and the operations can also be run > out of order as dynamical stage completion scheduling dictates. > However, from a high level everything appears to complete in order > because of re-ordering stages put everythign in order once the > indiivdual operations have been executed. > > Similarly, the XFS journalling subsystem is an out of order, > multi-stage pipeline with a post-IO re-ordering stage to ensure the > end result is that individual operations always appear to complete > in order. Indeed, what ends up on disk in the journal is not in > order, so one of the things that log recovery does is rebuild the > state necessarily to reorder operations correctly before replay so > that, again, it appears like everything occurred in the order that > the transactions were committed to the journal. > > So perhaps looking at it as a multi-stage pipeline might also help > explain why fake-completion changes the behaviour in unpredictable > ways. i.e. it basically chops out stages of the pipeline, changing > the length of the pipeline and the order in which stages of the > pipleine are executed. > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com -- Joshi ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Strange behavior with log IO fake-completions 2018-09-21 12:54 ` Joshi @ 2018-09-24 14:25 ` Joshi 2018-09-26 0:59 ` Dave Chinner 0 siblings, 1 reply; 10+ messages in thread From: Joshi @ 2018-09-24 14:25 UTC (permalink / raw) To: david; +Cc: linux-xfs > I changed all log wait queues (ic_force_wait, ic_write_wait, and xc_commit_wait) into completion variables (despite knowing that it is also implemented using wait_queue). Sorry, ignore this and subsequent part of the text. I think replacing with completion requires more work than I did.Calling "complete_all" is tricky, given the need of calling "reinit_completion" which may get executed by multiple threads in this case. On Fri, Sep 21, 2018 at 6:24 PM Joshi <joshiiitr@gmail.com> wrote: > > I added some timing measurement code and obtained data for "dd" sync > requests (dd if=/dev/zero of=/mnt/f1 bs=4k count=1000 oflag=sync). > Let me define these times - > T1= xfs_log_force_lsn time (loosely, this is overall time to complete > one fsync request) > And within xfs_log_foce_lsn , these were added - > T2 = xlog_cil_force_lsn time > T3 = xlog_wait time for ic_force_wait > Last one is this - > T4 = Time elapsed between issuing wake-up on ic_force_wait (during > completion) and thread waiting on ic_force_wait (dd in this case) > resumed execution > Basically T1 = T2+T3 +some_more, and T3 = T4 + some_more. > > Here is the data for 1 dd-sync request (trend remains mostly same for > all requests) - > Base case: T1= 37096, T2=11070, T3=19130, T4=2410 > FC case: T1= 101206, T2=47150, T3=45443, T4=29483 > > T4 in fake-completion case seemed particularly odd to me. "dd" took so > much time to resume even after wake up was issued on ic_force_wait! > Looks like fast log completion is making xfs wait queues behave odd. > I changed all log wait queues (ic_force_wait, ic_write_wait, and > xc_commit_wait) into completion variables (despite knowing that it is > also implemented using wait_queue). > And it brought down the value of T4. > Overall numbers with iozone/varmail look like this - > > Iozone(48 cpus) > Threads Base FC > 1 149149.2 151444.5 > 2 251794.5 255033.1 > Varmail (48 cpus) > Threads Base FC > 1 61500.9 62396.4 > 2 102834.9 103080.1 > > However, base case numbers also got improved. Earlier base numbers > were comparatively very low with 48 cpus. > > Thanks, > On Fri, Sep 21, 2018 at 5:53 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Mon, Sep 17, 2018 at 12:56:01PM +1000, Dave Chinner wrote: > > > On Fri, Sep 14, 2018 at 05:21:43PM +0530, Joshi wrote: > > > > > Now, your other "unexpected" result - lets use affinity to confine > > > > > everything to one CPU: > > > > > > > > > > fsync proc CIL IO-complete log > > > > > xfs_log_force_lsn > > > > > <wait for CIL> > > > > > xlog_write() > > > > > <queue to log> > > > > > wake CIL waiters > > > > > xlog_iodone() > > > > > wake iclogbug waiters > > > > > log force done > > > > > > > > > > Note the difference? The log work can't start until the CIL work has > > > > > completed because it's constrained to the same CPU, so it doesn't > > > > > cause any contention with the finalising of the CIL push and waking > > > > > waiters. > > > > > > > > "Log work can't start" part does not sound very good either. It needed > > > > to be done anyway before task waiting for fsync is woken. > > > > > > I'm beginning to think you don't really understand how logging in > > > XFS works. If you want to improve the logging subsystem, it would > > > be a good idea for you to read and understand this document so you > > > have some idea of how the data that is written to the log gets > > > there and what happens to it after it's in the log. > > > > > > Documentation/filesystems/xfs-delayed-logging-design.txt > > > > It occurred to me yesterday that looking at the journal in a > > different way might help explain how it all works. > > > > Instead of looking at it as an IO engine, think of how an > > out-of-order CPU is designed to be made up of multiple stages in a > > pipeline - each stage does a small piece of work that it passes on > > to the next stage to process. Individual operation progresses serially > > through the pipeline, but, each stage of the pipeline can be > > operating on a different operation. Hence we can have multiple > > operations in flight at once, and the operations can also be run > > out of order as dynamical stage completion scheduling dictates. > > However, from a high level everything appears to complete in order > > because of re-ordering stages put everythign in order once the > > indiivdual operations have been executed. > > > > Similarly, the XFS journalling subsystem is an out of order, > > multi-stage pipeline with a post-IO re-ordering stage to ensure the > > end result is that individual operations always appear to complete > > in order. Indeed, what ends up on disk in the journal is not in > > order, so one of the things that log recovery does is rebuild the > > state necessarily to reorder operations correctly before replay so > > that, again, it appears like everything occurred in the order that > > the transactions were committed to the journal. > > > > So perhaps looking at it as a multi-stage pipeline might also help > > explain why fake-completion changes the behaviour in unpredictable > > ways. i.e. it basically chops out stages of the pipeline, changing > > the length of the pipeline and the order in which stages of the > > pipleine are executed. > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david@fromorbit.com > > > > -- > Joshi -- Joshi ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Strange behavior with log IO fake-completions 2018-09-24 14:25 ` Joshi @ 2018-09-26 0:59 ` Dave Chinner 0 siblings, 0 replies; 10+ messages in thread From: Dave Chinner @ 2018-09-26 0:59 UTC (permalink / raw) To: Joshi; +Cc: linux-xfs On Mon, Sep 24, 2018 at 07:55:02PM +0530, Joshi wrote: > > I changed all log wait queues (ic_force_wait, ic_write_wait, and > xc_commit_wait) into completion variables (despite knowing that it is > also implemented using wait_queue). > > Sorry, ignore this and subsequent part of the text. I think replacing > with completion requires more work than I did.Calling "complete_all" > is tricky, given the need of calling "reinit_completion" which may get > executed by multiple threads in this case. Yup. Perhaps you might like to look at this recent patch set aimed at reducing spurious wake-ups and lock contention in the log reservation code: https://marc.info/?l=linux-kernel&m=153531683127094&w=2 Not sure it applies to this case, but it still may be useful to you. BTW, when posting benchmark results, can you also include a baseline result from a completely unmodified kernel and the patches you are testing? Nobody can evaluate the correctness or effectiveness of the changes you are testing without these things.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2018-09-26 7:10 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-09-10 18:07 Strange behavior with log IO fake-completions Joshi 2018-09-10 23:58 ` Dave Chinner 2018-09-11 4:10 ` Joshi 2018-09-12 0:42 ` Dave Chinner 2018-09-14 11:51 ` Joshi 2018-09-17 2:56 ` Dave Chinner 2018-09-21 0:23 ` Dave Chinner 2018-09-21 12:54 ` Joshi 2018-09-24 14:25 ` Joshi 2018-09-26 0:59 ` Dave Chinner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.