All of lore.kernel.org
 help / color / mirror / Atom feed
* Strange behavior with log IO fake-completions
@ 2018-09-10 18:07 Joshi
  2018-09-10 23:58 ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Joshi @ 2018-09-10 18:07 UTC (permalink / raw)
  To: linux-xfs

Hi folks,
I wanted to check log IO speed impact during fsync-heavy workload.
To obtain theoretical maximum performance data, I did fake-completion
of all log IOs (i.e. log IO cost is made 0).

Test: iozone, single-thread, 1GiB file, 4K record, sync for each 4K (
'-eo' option).
Disk: 800GB NVMe disk. XFS based on 4.15, default options except log size = 184M
Machine: Intel Xeon E5-2690 @2.6 GHz, 2 NUMA nodes, 24 cpus each

And results are :
------------------------------------------------
baseline                log fake-completion
109,845                 45,538
------------------------------------------------
I wondered why fake-completion turned out to be ~50% slower!
May I know if anyone encountered this before, or knows why this can happen?

For fake-completion, I just tag all log IOs bufer-pointers (in
xlog_sync). And later in xfs_buf_submit, I just complete those tagged
log IOs without any real bio-formation (comment call to
_xfs_bio_ioapply). Hope this is correct/enough to do nothing!

It seems to me that CPU count/frequency is playing a role here.
Above data was obtained with CPU frequency set to higher values. In
order to keep running CPU at nearly constant high frequency, I tried
things such as - performance governor, bios-based performance
settings, explicit setting of cpu scaling max frequency etc. However,
results did not differ much. Moreover frequency did not remain
constant/high.

But when I used "affine/bind" option of iozone (-P option), iozone
runs on single cpu all the time, and I get to see expected result -
-------------------------------------------------------------
baseline (affine)         log fake-completion(affine)
125,253                     163,367
-------------------------------------------------------------

Also, during above episode, I felt the need to discover best way to
eliminate cpu frequency variations out of benchmarking. I'd be
thankful knowing about it.


Thanks,
-- 
Joshi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Strange behavior with log IO fake-completions
  2018-09-10 18:07 Strange behavior with log IO fake-completions Joshi
@ 2018-09-10 23:58 ` Dave Chinner
  2018-09-11  4:10   ` Joshi
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2018-09-10 23:58 UTC (permalink / raw)
  To: Joshi; +Cc: linux-xfs

On Mon, Sep 10, 2018 at 11:37:45PM +0530, Joshi wrote:
> Hi folks,
> I wanted to check log IO speed impact during fsync-heavy workload.
> To obtain theoretical maximum performance data, I did fake-completion
> of all log IOs (i.e. log IO cost is made 0).
> 
> Test: iozone, single-thread, 1GiB file, 4K record, sync for each 4K (
> '-eo' option).
> Disk: 800GB NVMe disk. XFS based on 4.15, default options except log size = 184M
> Machine: Intel Xeon E5-2690 @2.6 GHz, 2 NUMA nodes, 24 cpus each
> 
> And results are :
> ------------------------------------------------
> baseline                log fake-completion
> 109,845                 45,538
> ------------------------------------------------
> I wondered why fake-completion turned out to be ~50% slower!
> May I know if anyone encountered this before, or knows why this can happen?

You made all log IO submission/completion synchronous.

https://marc.info/?l=linux-xfs&m=153532933529663&w=2

> For fake-completion, I just tag all log IOs bufer-pointers (in
> xlog_sync). And later in xfs_buf_submit, I just complete those tagged
> log IOs without any real bio-formation (comment call to
> _xfs_bio_ioapply). Hope this is correct/enough to do nothing!

It'll work, but it's a pretty silly thing to do. See above.

> It seems to me that CPU count/frequency is playing a role here.

Unlikely.

> Above data was obtained with CPU frequency set to higher values. In
> order to keep running CPU at nearly constant high frequency, I tried
> things such as - performance governor, bios-based performance
> settings, explicit setting of cpu scaling max frequency etc. However,
> results did not differ much. Moreover frequency did not remain
> constant/high.
> 
> But when I used "affine/bind" option of iozone (-P option), iozone
> runs on single cpu all the time, and I get to see expected result -
> -------------------------------------------------------------
> baseline (affine)         log fake-completion(affine)
> 125,253                     163,367
> -------------------------------------------------------------

Yup, because now it forces the work that gets handed off to another
workqueue (the CIL push workqueue) to also run on the same CPU
rather than asynchronously on another CPU. The result is that you
essentially force everything to run in a tight loop on a hot CPU
cache. Hitting a hot cache can make code run much, much faster in
microbenchmark situations like this, leading to optimisations that
don't actually work in the real world where those same code paths
never run confined to a single pre-primed, hot CPU cache.

When you combine that with the fact that IOZone has a very well
known susceptibility to CPU cache residency effects, it means the
results are largely useless for comparison between different kernel
builds. This is because small code changes can result in
sufficiently large changes in kernel CPU cache footprint that it
perturbs IOZone behaviour. We typically see variations of over
+/-10% from IOZone just by running two kernels that have slightly
different config parameters.

IOWs, don't use IOZone for anything related to performance testing.

> Also, during above episode, I felt the need to discover best way to
> eliminate cpu frequency variations out of benchmarking. I'd be
> thankful knowing about it.

I've never bothered with tuning for affinity or CPU frequency
scaling when perf testing. If you have to rely on such things to get
optimal performance from your filesystem algorithms, you are doing
it wrong.

That is: a CPU running at near full utilisation will always be run
at maximum frequency, hence if you have to tune CPU frequency to get
decent performance your algorithm is limited by something that
prevents full CPU utilisation, not CPU frequency.

Similarly, if you have to use affinity to get decent performance,
you're optimising for limited system utilisation rather than
bein able to use all the resources in the machine effectively. The
first goal of filesystem optimisation is to utilise every resource as
efficiently as possible. Then people can constraint their workloads
with affinity, containers, etc however they want without having to
care about performance - it will never be worse than the performance
at full resource utilisation.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Strange behavior with log IO fake-completions
  2018-09-10 23:58 ` Dave Chinner
@ 2018-09-11  4:10   ` Joshi
  2018-09-12  0:42     ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Joshi @ 2018-09-11  4:10 UTC (permalink / raw)
  To: david; +Cc: linux-xfs

> > Test: iozone, single-thread, 1GiB file, 4K record, sync for each 4K (
> > '-eo' option).
> > Disk: 800GB NVMe disk. XFS based on 4.15, default options except log size = 184M
> > Machine: Intel Xeon E5-2690 @2.6 GHz, 2 NUMA nodes, 24 cpus each
> >
> > And results are :
> > ------------------------------------------------
> > baseline                log fake-completion
> > 109,845                 45,538
> > ------------------------------------------------
> > I wondered why fake-completion turned out to be ~50% slower!
> > May I know if anyone encountered this before, or knows why this can happen?
>
> You made all log IO submission/completion synchronous.
>
> https://marc.info/?l=linux-xfs&m=153532933529663&w=2
>
> > For fake-completion, I just tag all log IOs bufer-pointers (in
> > xlog_sync). And later in xfs_buf_submit, I just complete those tagged
> > log IOs without any real bio-formation (comment call to
> > _xfs_bio_ioapply). Hope this is correct/enough to do nothing!
>
> It'll work, but it's a pretty silly thing to do. See above.

Thank you very much, Dave. I feel things are somewhat different here
than other email-thread you pointed.
Only log IO was fake-completed, not that entire XFS volume was on
ramdisk. Underlying disk was NVMe, and I checked that no
merging/batching happened for log IO submissions in base case.
Completion count were same (as many as submitted) too.
Call that I disabled in base code (in xfs_buf_submit, for log IOs) is
_xfs_buf_ioapply. So only thing happened differently for log IO
submitter thread is that it executed bp-completion-handling-code
(xfs_buf_ioend_async). And that anyway pushes the processing to
worker. It still remains mostly async, I suppose. In original form, it
would have executed extra code to form/sent bio (possibility of
submission/completion merging, but that did not happen for this
workload), and completion would have come after some time say T.
I wondered about impact on XFS if this time T can be made very low by
underlying storage for certain IOs.
If underlying device/layers provide some sort of differentiated I/O
service enabling ultra-low-latency completion for certain IOs (flagged
as urgent), and one chooses log IO to take that low-latency path  -
won't we see same problem as shown by fake-completion?

Thanks,

On Tue, Sep 11, 2018 at 5:28 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Sep 10, 2018 at 11:37:45PM +0530, Joshi wrote:
> > Hi folks,
> > I wanted to check log IO speed impact during fsync-heavy workload.
> > To obtain theoretical maximum performance data, I did fake-completion
> > of all log IOs (i.e. log IO cost is made 0).
> >
> > Test: iozone, single-thread, 1GiB file, 4K record, sync for each 4K (
> > '-eo' option).
> > Disk: 800GB NVMe disk. XFS based on 4.15, default options except log size = 184M
> > Machine: Intel Xeon E5-2690 @2.6 GHz, 2 NUMA nodes, 24 cpus each
> >
> > And results are :
> > ------------------------------------------------
> > baseline                log fake-completion
> > 109,845                 45,538
> > ------------------------------------------------
> > I wondered why fake-completion turned out to be ~50% slower!
> > May I know if anyone encountered this before, or knows why this can happen?
>
> You made all log IO submission/completion synchronous.
>
> https://marc.info/?l=linux-xfs&m=153532933529663&w=2
>
> > For fake-completion, I just tag all log IOs bufer-pointers (in
> > xlog_sync). And later in xfs_buf_submit, I just complete those tagged
> > log IOs without any real bio-formation (comment call to
> > _xfs_bio_ioapply). Hope this is correct/enough to do nothing!
>
> It'll work, but it's a pretty silly thing to do. See above.
>
> > It seems to me that CPU count/frequency is playing a role here.
>
> Unlikely.
>
> > Above data was obtained with CPU frequency set to higher values. In
> > order to keep running CPU at nearly constant high frequency, I tried
> > things such as - performance governor, bios-based performance
> > settings, explicit setting of cpu scaling max frequency etc. However,
> > results did not differ much. Moreover frequency did not remain
> > constant/high.
> >
> > But when I used "affine/bind" option of iozone (-P option), iozone
> > runs on single cpu all the time, and I get to see expected result -
> > -------------------------------------------------------------
> > baseline (affine)         log fake-completion(affine)
> > 125,253                     163,367
> > -------------------------------------------------------------
>
> Yup, because now it forces the work that gets handed off to another
> workqueue (the CIL push workqueue) to also run on the same CPU
> rather than asynchronously on another CPU. The result is that you
> essentially force everything to run in a tight loop on a hot CPU
> cache. Hitting a hot cache can make code run much, much faster in
> microbenchmark situations like this, leading to optimisations that
> don't actually work in the real world where those same code paths
> never run confined to a single pre-primed, hot CPU cache.
>
> When you combine that with the fact that IOZone has a very well
> known susceptibility to CPU cache residency effects, it means the
> results are largely useless for comparison between different kernel
> builds. This is because small code changes can result in
> sufficiently large changes in kernel CPU cache footprint that it
> perturbs IOZone behaviour. We typically see variations of over
> +/-10% from IOZone just by running two kernels that have slightly
> different config parameters.
>
> IOWs, don't use IOZone for anything related to performance testing.
>
> > Also, during above episode, I felt the need to discover best way to
> > eliminate cpu frequency variations out of benchmarking. I'd be
> > thankful knowing about it.
>
> I've never bothered with tuning for affinity or CPU frequency
> scaling when perf testing. If you have to rely on such things to get
> optimal performance from your filesystem algorithms, you are doing
> it wrong.
>
> That is: a CPU running at near full utilisation will always be run
> at maximum frequency, hence if you have to tune CPU frequency to get
> decent performance your algorithm is limited by something that
> prevents full CPU utilisation, not CPU frequency.
>
> Similarly, if you have to use affinity to get decent performance,
> you're optimising for limited system utilisation rather than
> bein able to use all the resources in the machine effectively. The
> first goal of filesystem optimisation is to utilise every resource as
> efficiently as possible. Then people can constraint their workloads
> with affinity, containers, etc however they want without having to
> care about performance - it will never be worse than the performance
> at full resource utilisation.....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com



-- 
Joshi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Strange behavior with log IO fake-completions
  2018-09-11  4:10   ` Joshi
@ 2018-09-12  0:42     ` Dave Chinner
  2018-09-14 11:51       ` Joshi
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2018-09-12  0:42 UTC (permalink / raw)
  To: Joshi; +Cc: linux-xfs

On Tue, Sep 11, 2018 at 09:40:59AM +0530, Joshi wrote:
> > > Test: iozone, single-thread, 1GiB file, 4K record, sync for each 4K (
> > > '-eo' option).
> > > Disk: 800GB NVMe disk. XFS based on 4.15, default options except log size = 184M
> > > Machine: Intel Xeon E5-2690 @2.6 GHz, 2 NUMA nodes, 24 cpus each
> > >
> > > And results are :
> > > ------------------------------------------------
> > > baseline                log fake-completion
> > > 109,845                 45,538
> > > ------------------------------------------------
> > > I wondered why fake-completion turned out to be ~50% slower!
> > > May I know if anyone encountered this before, or knows why this can happen?
> >
> > You made all log IO submission/completion synchronous.
> >
> > https://marc.info/?l=linux-xfs&m=153532933529663&w=2
> >
> > > For fake-completion, I just tag all log IOs bufer-pointers (in
> > > xlog_sync). And later in xfs_buf_submit, I just complete those tagged
> > > log IOs without any real bio-formation (comment call to
> > > _xfs_bio_ioapply). Hope this is correct/enough to do nothing!
> >
> > It'll work, but it's a pretty silly thing to do. See above.
> 
> Thank you very much, Dave. I feel things are somewhat different here
> than other email-thread you pointed.
> Only log IO was fake-completed, not that entire XFS volume was on
> ramdisk.

The "fake completion" forces log IO to be completed synchronously,
just like a ramdisk does. It's exactly the same behaviour.

> Underlying disk was NVMe, and I checked that no
> merging/batching happened for log IO submissions in base case.
> Completion count were same (as many as submitted) too.

Of course. async or sync, the amount of IO work an isolated single
threaded, synchronous write workload does is the same because there
is no other IO that can be aggregated for either the data or journal
writes....

> Call that I disabled in base code (in xfs_buf_submit, for log IOs) is
> _xfs_buf_ioapply.

Yes, I assumed you were doing something like that. The problem is,
you're not looking at what the journal is doing as a whole when
xfs_log_force_lsn() is called from fsync.

In this case, the initial log state is that the current iclogbuf is
ACTIVE because we have no journal writes in progress. The typical,
full IO path through this code is as follows:

fsync() process			CIL workqueue		log workqueue

xfs_log_force_lsn()
  xlog_cil_force_lsn()
    xlog_cil_push_now(seq)
    	<flushes pending push work>
	<queues required push work>
    xlog_wait(xc_commit_wait);
        <blocks waiting for CIL push>

				<walks items in CIL>
				<formats into log buffer>
				xlog_write()
				  .....
				  iclogbuf goes to SYNCING
				  xfs_buf_submit(iclogbuf)
				    __xfs_buf_submit()
					submit_bio()
				<push work complete>
				<wakes push waiters>

<wakes, returns to __xfs_log_force_lsn()>
  __xfs_log_force_lsn()
    <iclogbuf in SYNCING>
    xlog_wait(ic_force_wait)
      <blocks waiting for IO completion>

				<IO completion on some CPU>
				   xfs_buf_endio_async()
				     queue_work(log-wq)
							<wakes>
							xlog_iodone
							  xlog_state_do_callback
							    wake_up_all(ic_force_wait)
							<work complete>
<wakes, returns to __xfs_log_force_lsn()>
<log force done>
<returns to fsync context>

In normal cases, the overhead of IO completion queuing the
xlog_iodone completion work to the log wq can be mostly ignored -
it's just part of the IO completion delay time. What's important to
realise is that it runs some time after the CIL push waiters were
woken and then go back to sleep waiting for the log IO to complete,
so the processing is completely separated and there's no contention
between submission and completion.

> So only thing happened differently for log IO
> submitter thread is that it executed bp-completion-handling-code
> (xfs_buf_ioend_async). And that anyway pushes the processing to
> worker. It still remains mostly async, I suppose.

Well, when andhow the completion runs depends on a *lot* of things.
Let's simplify - the above case is:

fsync proc		CIL		IO-complete	log
xfs_log_force_lsn
  <wait for CIL>
			xlog_write()
			wake CIL waiters
  <wait for iclogbuf>
					<queue to log>
							xlog_iodone()
							wake iclogbug waiters
log force done.

Essentially, there are 7 context switches in that, but we just don't
see that overhead because it's tiny compared to the IO time. There's
also no contention between the fsync proc waiting on iclogbuf io
completion and iclogbuf completion processing, so that runs as fast
as possible.

> In original form, it
> would have executed extra code to form/sent bio (possibility of
> submission/completion merging, but that did not happen for this
> workload), and completion would have come after some time say T.
> I wondered about impact on XFS if this time T can be made very low by
> underlying storage for certain IOs.

Even when IO times are short as a few tens of microseconds, this is
sufficient to avoid contention between submission and completion
processing and turn the context switches into noise. And even with
shorter IO delays, the async processing still allows the log state
machine to aggregate concurrent fsyncs - see the recent
work with AIO fsync to scale out to concurrently writing and
fsyncing tens of thousands of files per second on high speed nvme
SSDs.

So, why don't fake completions work well? It's because the IO delay
goes away completely and that causes submission/completion
processing contention in the log code. If we have lots of idle CPUs
sitting around with a synchronous log IO completion, we get this
fsync behaviour:

fsync proc		CIL		IO-complete	log
xfs_log_force_lsn
  <wait for CIL>
			xlog_write()
			  <queue to log-wq>
			wake CIL waiters		xlog_iodone()
  <wait for iclogbuf>					wake iclogbug waiters
							
log force done.

IOWs, we now have the iclogbuf IO completion work racing with the
completion of the CIL push, finalising the log state after the
writes, waking CIL waiters and going to sleep waiting for log IO
completion and state machine transitions.

Basically, it causes cross-CPU contention for the same log cachelines
and locks because we've triggered a synchronous IO completion of the
iclogbuf. That contention kills CPU performance because it's now
having to wait for the cache coherency protocol to fetch data from
other CPU caches and/or memory. This can easily slow critical code
down by a factor of 3-4 orders of magnitude. Racing/spurious
sleep/wakeup events have a fair bit of overhead, too. 

Now, your other "unexpected" result - lets use affinity to confine
everything to one CPU:

fsync proc		CIL		IO-complete	log
xfs_log_force_lsn
  <wait for CIL>
			xlog_write()
			<queue to log>
			wake CIL waiters
							xlog_iodone()
							wake iclogbug waiters
log force done

Note the difference? The log work can't start until the CIL work has
completed because it's constrained to the same CPU, so it doesn't
cause any contention with the finalising of the CIL push and waking
waiters.

Further, because the log workqueue is a high priority workqueue and
we're confined to a single CPU, it gets processed by the workqueue
infrastructure before it releases the CPU to schedule other tasks.
This changes the order of operations in the code.

And because it's all on the one CPU, this is effectively a CPU bound
operation and the cachelines stay hot in the CPU and there's no lock
contention to invalidate them. The code runs as fast as it is
possible to run as a result.

Combine those things and the result is that we do less work and the
CPU executes it faster. Not to mention that only running on one CPU
core allows modern CPUs to maximise the clock frequency on that one
core, making the affinity constrained workload look even faster than
one that is spread across multiple cores. (Reliable performance
benchmarking is hard!)

> If underlying device/layers provide some sort of differentiated I/O
> service enabling ultra-low-latency completion for certain IOs (flagged
> as urgent), and one chooses log IO to take that low-latency path  -
> won't we see same problem as shown by fake-completion?

Maybe, maybe not. It really depends on what is happening system
wide, not what is happening to any single thread or IO....

IOWs, I wouldn't be trying to extrapolate how a filesystem
will react to a specific type of storage from a single threaded,
synchronous write workload. All you'll succeed in doing is
micro-optimising behaviour for single threaded, synchronous
writes on that specific storage device.

Perhaps it would be better for you to describe what you are trying
to acheive first so we can point you in the right direction rather
than playing "20 Questions" and still not understanding of why the
filesystem isn't doing what you expect it to be doing....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Strange behavior with log IO fake-completions
  2018-09-12  0:42     ` Dave Chinner
@ 2018-09-14 11:51       ` Joshi
  2018-09-17  2:56         ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Joshi @ 2018-09-14 11:51 UTC (permalink / raw)
  To: david; +Cc: linux-xfs

Thank you Dave, for sharing the details and putting the pieces together.

> So, why don't fake completions work well? It's because the IO delay
> goes away completely and that causes submission/completion
> processing contention in the log code. If we have lots of idle CPUs
> sitting around with a synchronous log IO completion, we get this
> fsync behaviour:
>
> fsync proc              CIL             IO-complete     log
> xfs_log_force_lsn
>   <wait for CIL>
>                         xlog_write()
>                           <queue to log-wq>
>                         wake CIL waiters                xlog_iodone()
>   <wait for iclogbuf>                                   wake iclogbug waiters
>
> log force done.
>
> IOWs, we now have the iclogbuf IO completion work racing with the
> completion of the CIL push, finalising the log state after the
> writes, waking CIL waiters and going to sleep waiting for log IO
> completion and state machine transitions.
>
> Basically, it causes cross-CPU contention for the same log cachelines
> and locks because we've triggered a synchronous IO completion of the
> iclogbuf. That contention kills CPU performance because it's now
> having to wait for the cache coherency protocol to fetch data from
> other CPU caches and/or memory

Sorry, but the contention that you mentioned seems to be more about
fast-completion-of-log-IO rather than
synchronous-completion-of-log-IO. Isn't it?
The code that gets executed in case of fake-completion would be
exactly same had storage completed log IO blazing fast.
And I do not see any different/better numbers when I perform
fake-completion in asynchronous way (i.e. increment b_io_remaining,
and queue bp in a workqueue).

I gathered some more numbers with iozone, and with filebench's varmail
(another sync-heavy benchmark). With 48 and 16 cpus (reduced through
nr_cpus kernel parameter).
Numbers indicate that whatever contention was applicable for 48 cpus,
does seem to vanish while running with 16 cpus. In fact,
fake-completion outperforms base with good margins while running with
16 cpus.

************************ Varmail (nr_cpus = 48 vs nr_cpus=16)
***************************************
Threads              Base                       Fake-completion
     Diff% (fake-completion vs base)
1                53,662   / 55,815              38,322  / 63,919
      -28.6%  / 14.5%
2                89,382   / 80,196              84,183  / 94,269
      -5.8%    / 17.5%
4                148,953 / 144,781           141,030 / 163,434
  -5.3%    / 12.9%
8                230,970 /  226,314           230,410/  243,300
  -0.2%    /  7.5%
16              235,077 /  242,282           234,817/  244,312
 -0.1%    /  0.8%
32              235,752 /  243,885           233,517/  245,685
 -0.9%    /  1.0%
64              234,659 /  240,872           232,855/  243,884
 -0.8%    /  1.3%

************************ Iozone (nr_cpus = 48 vs nr_cpus=16)
****************************************
Threads              Base                          Fake-completion
        Diff% (fake-completion vs base)
1                99,691    / 110,804              45,801   / 151,478
           -54.1%  / 36.7%
2                182,221  / 188,503              145,968 / 227,948
         -19.8%  / 20.9%
4                336,585  / 333,111              326,697 / 392,058
          -2.9%    / 17.7%
8                553,251  /  544,356             621,871 / 677,801
         12.4%    / 24.5%
16              768,023  /  779,373             944,148 / 964,868
        22.9%    / 23.8%
32              934,050  /  959,441             1088,719 / 1094,018
      16.6%    / 14%
64              1053,738 /  1015,464          1086,711 / 1045,853
     3.1%      /  3%

> Now, your other "unexpected" result - lets use affinity to confine
> everything to one CPU:
>
> fsync proc              CIL             IO-complete     log
> xfs_log_force_lsn
>   <wait for CIL>
>                         xlog_write()
>                         <queue to log>
>                         wake CIL waiters
>                                                         xlog_iodone()
>                                                         wake iclogbug waiters
> log force done
>
> Note the difference? The log work can't start until the CIL work has
> completed because it's constrained to the same CPU, so it doesn't
> cause any contention with the finalising of the CIL push and waking
> waiters.

"Log work can't start" part does not sound very good either. It needed
to be done anyway before task waiting for fsync is woken.
And even without affinity thing, work-queue infrastructure executes
deferred work on the same cpu where it was queued.
At least on 4.15, I see that if I queue work on cpu X, kworker running
on cpu X picks it up and processes.
So I fancy that for a single fsync IO, all the workers that needed to
be executed would have executed on same cpu, even without affinity.

 > > If underlying device/layers provide some sort of differentiated I/O
> > service enabling ultra-low-latency completion for certain IOs (flagged
> > as urgent), and one chooses log IO to take that low-latency path  -
> > won't we see same problem as shown by fake-completion?
>
> Maybe, maybe not. It really depends on what is happening system
> wide, not what is happening to any single thread or IO....
>
> IOWs, I wouldn't be trying to extrapolate how a filesystem
> will react to a specific type of storage from a single threaded,
> synchronous write workload. All you'll succeed in doing is
> micro-optimising behaviour for single threaded, synchronous
> writes on that specific storage device.
>
> Perhaps it would be better for you to describe what you are trying
> to acheive first so we can point you in the right direction rather
> than playing "20 Questions" and still not understanding of why the
> filesystem isn't doing what you expect it to be doing....

Single threaded IO is about latency. And storage-systems need to care
about that apart from multi-threaded, throughput scenarios.
In the above varmail/iozone data, XFS seems to be friendlier to
throughput than to latency. After 8 or more threads, XFS transaction
speed became independent of log IO speed (which is a good thing). But
below 8, it responded odd to log-speed-increase.
Underlying storage (NVMe) allowed me to complete certain IOs with a
very low latency (around ~3 microsecond for 4K ). And I leveraged that
facility to complete log-IO with very low-latency.
With the expectation that latency of single-threaded fsync IOs would
be improved as a result of that. I hope it is not too unnatural to
expect that.
To rule-out the device peculiarities and to determine where exactly
problem lies, I chose to try the whole thing with fake completion.

I'd appreciate if you have suggestions about what could be changed in
XFS to get advantage of low-latency log IO.

Thanks,
On Wed, Sep 12, 2018 at 6:13 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, Sep 11, 2018 at 09:40:59AM +0530, Joshi wrote:
> > > > Test: iozone, single-thread, 1GiB file, 4K record, sync for each 4K (
> > > > '-eo' option).
> > > > Disk: 800GB NVMe disk. XFS based on 4.15, default options except log size = 184M
> > > > Machine: Intel Xeon E5-2690 @2.6 GHz, 2 NUMA nodes, 24 cpus each
> > > >
> > > > And results are :
> > > > ------------------------------------------------
> > > > baseline                log fake-completion
> > > > 109,845                 45,538
> > > > ------------------------------------------------
> > > > I wondered why fake-completion turned out to be ~50% slower!
> > > > May I know if anyone encountered this before, or knows why this can happen?
> > >
> > > You made all log IO submission/completion synchronous.
> > >
> > > https://marc.info/?l=linux-xfs&m=153532933529663&w=2
> > >
> > > > For fake-completion, I just tag all log IOs bufer-pointers (in
> > > > xlog_sync). And later in xfs_buf_submit, I just complete those tagged
> > > > log IOs without any real bio-formation (comment call to
> > > > _xfs_bio_ioapply). Hope this is correct/enough to do nothing!
> > >
> > > It'll work, but it's a pretty silly thing to do. See above.
> >
> > Thank you very much, Dave. I feel things are somewhat different here
> > than other email-thread you pointed.
> > Only log IO was fake-completed, not that entire XFS volume was on
> > ramdisk.
>
> The "fake completion" forces log IO to be completed synchronously,
> just like a ramdisk does. It's exactly the same behaviour.
>
> > Underlying disk was NVMe, and I checked that no
> > merging/batching happened for log IO submissions in base case.
> > Completion count were same (as many as submitted) too.
>
> Of course. async or sync, the amount of IO work an isolated single
> threaded, synchronous write workload does is the same because there
> is no other IO that can be aggregated for either the data or journal
> writes....
>
> > Call that I disabled in base code (in xfs_buf_submit, for log IOs) is
> > _xfs_buf_ioapply.
>
> Yes, I assumed you were doing something like that. The problem is,
> you're not looking at what the journal is doing as a whole when
> xfs_log_force_lsn() is called from fsync.
>
> In this case, the initial log state is that the current iclogbuf is
> ACTIVE because we have no journal writes in progress. The typical,
> full IO path through this code is as follows:
>
> fsync() process                 CIL workqueue           log workqueue
>
> xfs_log_force_lsn()
>   xlog_cil_force_lsn()
>     xlog_cil_push_now(seq)
>         <flushes pending push work>
>         <queues required push work>
>     xlog_wait(xc_commit_wait);
>         <blocks waiting for CIL push>
>
>                                 <walks items in CIL>
>                                 <formats into log buffer>
>                                 xlog_write()
>                                   .....
>                                   iclogbuf goes to SYNCING
>                                   xfs_buf_submit(iclogbuf)
>                                     __xfs_buf_submit()
>                                         submit_bio()
>                                 <push work complete>
>                                 <wakes push waiters>
>
> <wakes, returns to __xfs_log_force_lsn()>
>   __xfs_log_force_lsn()
>     <iclogbuf in SYNCING>
>     xlog_wait(ic_force_wait)
>       <blocks waiting for IO completion>
>
>                                 <IO completion on some CPU>
>                                    xfs_buf_endio_async()
>                                      queue_work(log-wq)
>                                                         <wakes>
>                                                         xlog_iodone
>                                                           xlog_state_do_callback
>                                                             wake_up_all(ic_force_wait)
>                                                         <work complete>
> <wakes, returns to __xfs_log_force_lsn()>
> <log force done>
> <returns to fsync context>
>
> In normal cases, the overhead of IO completion queuing the
> xlog_iodone completion work to the log wq can be mostly ignored -
> it's just part of the IO completion delay time. What's important to
> realise is that it runs some time after the CIL push waiters were
> woken and then go back to sleep waiting for the log IO to complete,
> so the processing is completely separated and there's no contention
> between submission and completion.
>
> > So only thing happened differently for log IO
> > submitter thread is that it executed bp-completion-handling-code
> > (xfs_buf_ioend_async). And that anyway pushes the processing to
> > worker. It still remains mostly async, I suppose.
>
> Well, when andhow the completion runs depends on a *lot* of things.
> Let's simplify - the above case is:
>
> fsync proc              CIL             IO-complete     log
> xfs_log_force_lsn
>   <wait for CIL>
>                         xlog_write()
>                         wake CIL waiters
>   <wait for iclogbuf>
>                                         <queue to log>
>                                                         xlog_iodone()
>                                                         wake iclogbug waiters
> log force done.
>
> Essentially, there are 7 context switches in that, but we just don't
> see that overhead because it's tiny compared to the IO time. There's
> also no contention between the fsync proc waiting on iclogbuf io
> completion and iclogbuf completion processing, so that runs as fast
> as possible.
>
> > In original form, it
> > would have executed extra code to form/sent bio (possibility of
> > submission/completion merging, but that did not happen for this
> > workload), and completion would have come after some time say T.
> > I wondered about impact on XFS if this time T can be made very low by
> > underlying storage for certain IOs.
>
> Even when IO times are short as a few tens of microseconds, this is
> sufficient to avoid contention between submission and completion
> processing and turn the context switches into noise. And even with
> shorter IO delays, the async processing still allows the log state
> machine to aggregate concurrent fsyncs - see the recent
> work with AIO fsync to scale out to concurrently writing and
> fsyncing tens of thousands of files per second on high speed nvme
> SSDs.
>
> So, why don't fake completions work well? It's because the IO delay
> goes away completely and that causes submission/completion
> processing contention in the log code. If we have lots of idle CPUs
> sitting around with a synchronous log IO completion, we get this
> fsync behaviour:
>
> fsync proc              CIL             IO-complete     log
> xfs_log_force_lsn
>   <wait for CIL>
>                         xlog_write()
>                           <queue to log-wq>
>                         wake CIL waiters                xlog_iodone()
>   <wait for iclogbuf>                                   wake iclogbug waiters
>
> log force done.
>
> IOWs, we now have the iclogbuf IO completion work racing with the
> completion of the CIL push, finalising the log state after the
> writes, waking CIL waiters and going to sleep waiting for log IO
> completion and state machine transitions.
>
> Basically, it causes cross-CPU contention for the same log cachelines
> and locks because we've triggered a synchronous IO completion of the
> iclogbuf. That contention kills CPU performance because it's now
> having to wait for the cache coherency protocol to fetch data from
> other CPU caches and/or memory. This can easily slow critical code
> down by a factor of 3-4 orders of magnitude. Racing/spurious
> sleep/wakeup events have a fair bit of overhead, too.
>
> Now, your other "unexpected" result - lets use affinity to confine
> everything to one CPU:
>
> fsync proc              CIL             IO-complete     log
> xfs_log_force_lsn
>   <wait for CIL>
>                         xlog_write()
>                         <queue to log>
>                         wake CIL waiters
>                                                         xlog_iodone()
>                                                         wake iclogbug waiters
> log force done
>
> Note the difference? The log work can't start until the CIL work has
> completed because it's constrained to the same CPU, so it doesn't
> cause any contention with the finalising of the CIL push and waking
> waiters.
>
> Further, because the log workqueue is a high priority workqueue and
> we're confined to a single CPU, it gets processed by the workqueue
> infrastructure before it releases the CPU to schedule other tasks.
> This changes the order of operations in the code.
>
> And because it's all on the one CPU, this is effectively a CPU bound
> operation and the cachelines stay hot in the CPU and there's no lock
> contention to invalidate them. The code runs as fast as it is
> possible to run as a result.
>
> Combine those things and the result is that we do less work and the
> CPU executes it faster. Not to mention that only running on one CPU
> core allows modern CPUs to maximise the clock frequency on that one
> core, making the affinity constrained workload look even faster than
> one that is spread across multiple cores. (Reliable performance
> benchmarking is hard!)
>
> > If underlying device/layers provide some sort of differentiated I/O
> > service enabling ultra-low-latency completion for certain IOs (flagged
> > as urgent), and one chooses log IO to take that low-latency path  -
> > won't we see same problem as shown by fake-completion?
>
> Maybe, maybe not. It really depends on what is happening system
> wide, not what is happening to any single thread or IO....
>
> IOWs, I wouldn't be trying to extrapolate how a filesystem
> will react to a specific type of storage from a single threaded,
> synchronous write workload. All you'll succeed in doing is
> micro-optimising behaviour for single threaded, synchronous
> writes on that specific storage device.
>
> Perhaps it would be better for you to describe what you are trying
> to acheive first so we can point you in the right direction rather
> than playing "20 Questions" and still not understanding of why the
> filesystem isn't doing what you expect it to be doing....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com



-- 
Joshi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Strange behavior with log IO fake-completions
  2018-09-14 11:51       ` Joshi
@ 2018-09-17  2:56         ` Dave Chinner
  2018-09-21  0:23           ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2018-09-17  2:56 UTC (permalink / raw)
  To: Joshi; +Cc: linux-xfs

On Fri, Sep 14, 2018 at 05:21:43PM +0530, Joshi wrote:
> Thank you Dave, for sharing the details and putting the pieces together.
> 
> > So, why don't fake completions work well? It's because the IO delay
> > goes away completely and that causes submission/completion
> > processing contention in the log code. If we have lots of idle CPUs
> > sitting around with a synchronous log IO completion, we get this
> > fsync behaviour:
> >
> > fsync proc              CIL             IO-complete     log
> > xfs_log_force_lsn
> >   <wait for CIL>
> >                         xlog_write()
> >                           <queue to log-wq>
> >                         wake CIL waiters                xlog_iodone()
> >   <wait for iclogbuf>                                   wake iclogbug waiters
> >
> > log force done.
> >
> > IOWs, we now have the iclogbuf IO completion work racing with the
> > completion of the CIL push, finalising the log state after the
> > writes, waking CIL waiters and going to sleep waiting for log IO
> > completion and state machine transitions.
> >
> > Basically, it causes cross-CPU contention for the same log cachelines
> > and locks because we've triggered a synchronous IO completion of the
> > iclogbuf. That contention kills CPU performance because it's now
> > having to wait for the cache coherency protocol to fetch data from
> > other CPU caches and/or memory
> 
> Sorry, but the contention that you mentioned seems to be more about
> fast-completion-of-log-IO rather than
> synchronous-completion-of-log-IO. Isn't it?
> The code that gets executed in case of fake-completion would be
> exactly same had storage completed log IO blazing fast.

No, no it isn't. Blazing fast storage is not the same zero-delay, IO
bypass completion like your fake-completion hack. Especially
considering that it avoids sending REQ_FLUSH to the underlying
device on each fsync, which typically has some additional delay for
the IO that it is delivered with.

>From that perspective, fake-completion is actually causing out of
order completion of the journal IO vs the rest of the IO that is in
flight. i.e. because it doesn't go through the IO layer, scheduler,
etc, it's not actually being issued and completed with all the other
IO that is in progress. Hence it's basically skipping the pending IO
completion queue at all layers and jumping straight to the head of
the queue.  IOWs, it's in no way reflective of how the journal
behaves when there there is real IO load present on the underlying
device.

And now that I've mentioned REQ_FLUSH, I have to point out that
fake-completion is breaking the integrity model of the filesystem.
Failing to issue REQ_FLUSH on log IOs means that volatile caches
below the filesystem are not flushed before the journal is written
to. This also changes way the underlying device processes IO, and so
may be operating way faster than it should because it's not getting
regular cache flushes.

But even more important: to provide the correct data integrity
semantics to fsync() callers, we really do need to issue two IOs for
each fsync - both data and log for related metadata need to be
written. So lets go back to looking at where all our time in fsync
is being spent, and why fake-completion is not telling you what you
think it is.

With fast nvme storage (using your numbers of 3uS per IO) and
ignoring all the CPU time between the IOs, we could do ~160k
fsync's per second from a single thread on such a device due to the
6uS per fsync IOWait time constraint.

If we add in 1uS of CPU time to do the data write, and another 1uS
of CPU time for the fsync (i.e. all the journal write code) we're now
at 8uS per fsync and so down to 125k operations per second.

And then we've got application time to set up for the next
write+fsync, which if there's real data to be processed is going to
be a few uS too. But let's conisder that it's IOZone so it's just
feeding the same data in a tight loop which gives us roughly other
1uS per loop.  IOws, we're now at 9uS of CPU+IOwait time per
write/fsync loop, and that puts us at ~111k fsync's per second.

That's right on the number that IOZone is giving you for single
threaded performance. To make it go faster, we have to either reduce
CPU overhead or reduce IO time. We can't reduce IO time - we have to
do two physical IOs. That leaves us with reducing CPU time
processing the IO.

Your fake completion code does not just reduce the CPU time of the
journal processing slightly - it completely omits one of those two
IOs (the journal IO) that are required for fsync. IOWs, instead of
6uS of IOwait time, we now only have 3uS. That means we the
per-fsync time is back down to 6uS, which is somewhere around
160k fsync/s per second - that's roughly 40% faster.

And if we look at your single threaded 16 CPU iozone results we see:

> Threads	Base		Fake-completion	Diff%
> 1		110,804		151,478		36.7%

The change in performance is right on the number I'd expect from
eliding the log IO.  The measured improvement hasn't come from
fake-completion simulating lower CPU usage, it's come directly from
not issuing and waiting for journal IO.

And your high cpu count results confirm the contention problems
between submission and completion I described result in reduced
performance in low thread counts. IOws, your results are confirming
what I've been saying tends to happen with zero-delay IO completion
setups.

IOWs, we already know that removing the journal IO wait time have a
performance improvement, that it is visible even with really fast IO
subsystems, and we know we cannot make optimisations in this manner.
IOWs, your test doesn't tell us that there's a problem with
asynchronous log IO completion algorithms on asynhronous storage,
they just tell us that:

	a) the journal IO is ~only 35% of the fsync overhead on your
	device; and
	b) synchrnous and/or zero delay IO completions are not
	handled well, but that still only extends to ramdisks and
	pmem in practice.

> And I do not see any different/better numbers when I perform
> fake-completion in asynchronous way (i.e. increment b_io_remaining,
> and queue bp in a workqueue).

That's because you're not changing CPU usage in any meaningful way -
the change in performance you've measured is caused by eliding the
journal IO and scheduling immediate IO completion, not by the extra
context switch that queuing the log IO completion causes.

And, FWIW, workqueues can queue work onto other CPUs - if the work
is already running, then it will continue to be run on the CPU it is
running on, not get scheduled on the queuing CPU. When the XFS log
is busy, this happens a *lot*.

[snip]

> > Now, your other "unexpected" result - lets use affinity to confine
> > everything to one CPU:
> >
> > fsync proc              CIL             IO-complete     log
> > xfs_log_force_lsn
> >   <wait for CIL>
> >                         xlog_write()
> >                         <queue to log>
> >                         wake CIL waiters
> >                                                         xlog_iodone()
> >                                                         wake iclogbug waiters
> > log force done
> >
> > Note the difference? The log work can't start until the CIL work has
> > completed because it's constrained to the same CPU, so it doesn't
> > cause any contention with the finalising of the CIL push and waking
> > waiters.
> 
> "Log work can't start" part does not sound very good either. It needed
> to be done anyway before task waiting for fsync is woken.

I'm beginning to think you don't really understand how logging in
XFS works.  If you want to improve the logging subsystem, it would
be a good idea for you to read and understand this document so you
have some idea of how the data that is written to the log gets
there and what happens to it after it's in the log.

Documentation/filesystems/xfs-delayed-logging-design.txt

> > Perhaps it would be better for you to describe what you are trying
> > to acheive first so we can point you in the right direction rather
> > than playing "20 Questions" and still not understanding of why the
> > filesystem isn't doing what you expect it to be doing....
> 
> Single threaded IO is about latency. And storage-systems need to care
> about that apart from multi-threaded, throughput scenarios.

Who does single threaded write IO these days? What real world
application problem are you trying to solve?

IMO, If you care about single operation latency in isolation from
everything else, XFS is not the filesystem to use.  XFS's algorithms
are CPU intensive because they are designed to scale out to large
CPU counts efficiently. e.g. a single write+fsync on ext4 takes less
than 50% of the CPU time to execute and complete than XFS. So for
the sorts of single threaded fsync workloads you are talking about,
ext4 will always have much lower latency for single operations than
XFS because it burns much less CPU between waiting for IOs.

> To rule-out the device peculiarities and to determine where exactly
> problem lies, I chose to try the whole thing with fake completion.
> 
> I'd appreciate if you have suggestions about what could be changed in
> XFS to get advantage of low-latency log IO.

I still don't understand what application you have that requires
optimisation of single threaded fsync latency.  Without some context
of the application level problem you are trying to solve I can't
answer questions like this.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Strange behavior with log IO fake-completions
  2018-09-17  2:56         ` Dave Chinner
@ 2018-09-21  0:23           ` Dave Chinner
  2018-09-21 12:54             ` Joshi
  0 siblings, 1 reply; 10+ messages in thread
From: Dave Chinner @ 2018-09-21  0:23 UTC (permalink / raw)
  To: Joshi; +Cc: linux-xfs

On Mon, Sep 17, 2018 at 12:56:01PM +1000, Dave Chinner wrote:
> On Fri, Sep 14, 2018 at 05:21:43PM +0530, Joshi wrote:
> > > Now, your other "unexpected" result - lets use affinity to confine
> > > everything to one CPU:
> > >
> > > fsync proc              CIL             IO-complete     log
> > > xfs_log_force_lsn
> > >   <wait for CIL>
> > >                         xlog_write()
> > >                         <queue to log>
> > >                         wake CIL waiters
> > >                                                         xlog_iodone()
> > >                                                         wake iclogbug waiters
> > > log force done
> > >
> > > Note the difference? The log work can't start until the CIL work has
> > > completed because it's constrained to the same CPU, so it doesn't
> > > cause any contention with the finalising of the CIL push and waking
> > > waiters.
> > 
> > "Log work can't start" part does not sound very good either. It needed
> > to be done anyway before task waiting for fsync is woken.
> 
> I'm beginning to think you don't really understand how logging in
> XFS works.  If you want to improve the logging subsystem, it would
> be a good idea for you to read and understand this document so you
> have some idea of how the data that is written to the log gets
> there and what happens to it after it's in the log.
> 
> Documentation/filesystems/xfs-delayed-logging-design.txt

It occurred to me yesterday that looking at the journal in a
different way might help explain how it all works.

Instead of looking at it as an IO engine, think of how an
out-of-order CPU is designed to be made up of multiple stages in a
pipeline - each stage does a small piece of work that it passes on
to the next stage to process. Individual operation progresses serially
through the pipeline, but, each stage of the pipeline can be
operating on a different operation. Hence we can have multiple
operations in flight at once, and the operations can also be run
out of order as dynamical stage completion scheduling dictates.
However, from a high level everything appears to complete in order
because of re-ordering stages put everythign in order once the
indiivdual operations have been executed.

Similarly, the XFS journalling subsystem is an out of order,
multi-stage pipeline with a post-IO re-ordering stage to ensure the
end result is that individual operations always appear to complete
in order.  Indeed, what ends up on disk in the journal is not in
order, so one of the things that log recovery does is rebuild the
state necessarily to reorder operations correctly before replay so
that, again, it appears like everything occurred in the order that
the transactions were committed to the journal.

So perhaps looking at it as a multi-stage pipeline might also help
explain why fake-completion changes the behaviour in unpredictable
ways. i.e. it basically chops out stages of the pipeline, changing
the length of the pipeline and the order in which stages of the
pipleine are executed.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Strange behavior with log IO fake-completions
  2018-09-21  0:23           ` Dave Chinner
@ 2018-09-21 12:54             ` Joshi
  2018-09-24 14:25               ` Joshi
  0 siblings, 1 reply; 10+ messages in thread
From: Joshi @ 2018-09-21 12:54 UTC (permalink / raw)
  To: david; +Cc: linux-xfs

I added some timing measurement code and obtained data for "dd" sync
requests (dd if=/dev/zero of=/mnt/f1 bs=4k count=1000 oflag=sync).
Let me define these times -
T1= xfs_log_force_lsn time (loosely, this is overall time to complete
one fsync request)
And within xfs_log_foce_lsn , these were added -
T2 = xlog_cil_force_lsn time
T3 = xlog_wait time for ic_force_wait
Last one is this -
T4 =  Time elapsed between issuing wake-up on ic_force_wait (during
completion) and thread waiting on ic_force_wait (dd in this case)
resumed execution
Basically T1 = T2+T3 +some_more, and T3 = T4 + some_more.

Here is the data for 1 dd-sync request (trend remains mostly same for
all requests) -
Base case: T1= 37096,  T2=11070,  T3=19130,  T4=2410
FC case:    T1= 101206, T2=47150, T3=45443,  T4=29483

T4 in fake-completion case seemed particularly odd to me. "dd" took so
much time to resume even after wake up was issued on ic_force_wait!
Looks like fast log completion is making xfs wait queues behave odd.
I changed all log wait queues (ic_force_wait, ic_write_wait, and
xc_commit_wait) into completion variables (despite knowing that it is
also implemented using wait_queue).
And it brought down the value of T4.
Overall numbers with iozone/varmail look like this -

Iozone(48 cpus)
Threads       Base           FC
1                149149.2    151444.5
2                251794.5     255033.1
Varmail (48 cpus)
Threads      Base          FC
1                 61500.9     62396.4
2                 102834.9   103080.1

However, base case numbers also got improved. Earlier base numbers
were comparatively very low with 48 cpus.

Thanks,
On Fri, Sep 21, 2018 at 5:53 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Sep 17, 2018 at 12:56:01PM +1000, Dave Chinner wrote:
> > On Fri, Sep 14, 2018 at 05:21:43PM +0530, Joshi wrote:
> > > > Now, your other "unexpected" result - lets use affinity to confine
> > > > everything to one CPU:
> > > >
> > > > fsync proc              CIL             IO-complete     log
> > > > xfs_log_force_lsn
> > > >   <wait for CIL>
> > > >                         xlog_write()
> > > >                         <queue to log>
> > > >                         wake CIL waiters
> > > >                                                         xlog_iodone()
> > > >                                                         wake iclogbug waiters
> > > > log force done
> > > >
> > > > Note the difference? The log work can't start until the CIL work has
> > > > completed because it's constrained to the same CPU, so it doesn't
> > > > cause any contention with the finalising of the CIL push and waking
> > > > waiters.
> > >
> > > "Log work can't start" part does not sound very good either. It needed
> > > to be done anyway before task waiting for fsync is woken.
> >
> > I'm beginning to think you don't really understand how logging in
> > XFS works.  If you want to improve the logging subsystem, it would
> > be a good idea for you to read and understand this document so you
> > have some idea of how the data that is written to the log gets
> > there and what happens to it after it's in the log.
> >
> > Documentation/filesystems/xfs-delayed-logging-design.txt
>
> It occurred to me yesterday that looking at the journal in a
> different way might help explain how it all works.
>
> Instead of looking at it as an IO engine, think of how an
> out-of-order CPU is designed to be made up of multiple stages in a
> pipeline - each stage does a small piece of work that it passes on
> to the next stage to process. Individual operation progresses serially
> through the pipeline, but, each stage of the pipeline can be
> operating on a different operation. Hence we can have multiple
> operations in flight at once, and the operations can also be run
> out of order as dynamical stage completion scheduling dictates.
> However, from a high level everything appears to complete in order
> because of re-ordering stages put everythign in order once the
> indiivdual operations have been executed.
>
> Similarly, the XFS journalling subsystem is an out of order,
> multi-stage pipeline with a post-IO re-ordering stage to ensure the
> end result is that individual operations always appear to complete
> in order.  Indeed, what ends up on disk in the journal is not in
> order, so one of the things that log recovery does is rebuild the
> state necessarily to reorder operations correctly before replay so
> that, again, it appears like everything occurred in the order that
> the transactions were committed to the journal.
>
> So perhaps looking at it as a multi-stage pipeline might also help
> explain why fake-completion changes the behaviour in unpredictable
> ways. i.e. it basically chops out stages of the pipeline, changing
> the length of the pipeline and the order in which stages of the
> pipleine are executed.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com



-- 
Joshi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Strange behavior with log IO fake-completions
  2018-09-21 12:54             ` Joshi
@ 2018-09-24 14:25               ` Joshi
  2018-09-26  0:59                 ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Joshi @ 2018-09-24 14:25 UTC (permalink / raw)
  To: david; +Cc: linux-xfs

> I changed all log wait queues (ic_force_wait, ic_write_wait, and
xc_commit_wait) into completion variables (despite knowing that it is
also implemented using wait_queue).

Sorry, ignore this and subsequent part of the text. I think replacing
with completion requires more work than I did.Calling "complete_all"
is tricky, given the need of calling "reinit_completion" which may get
executed by multiple threads in this case.
On Fri, Sep 21, 2018 at 6:24 PM Joshi <joshiiitr@gmail.com> wrote:
>
> I added some timing measurement code and obtained data for "dd" sync
> requests (dd if=/dev/zero of=/mnt/f1 bs=4k count=1000 oflag=sync).
> Let me define these times -
> T1= xfs_log_force_lsn time (loosely, this is overall time to complete
> one fsync request)
> And within xfs_log_foce_lsn , these were added -
> T2 = xlog_cil_force_lsn time
> T3 = xlog_wait time for ic_force_wait
> Last one is this -
> T4 =  Time elapsed between issuing wake-up on ic_force_wait (during
> completion) and thread waiting on ic_force_wait (dd in this case)
> resumed execution
> Basically T1 = T2+T3 +some_more, and T3 = T4 + some_more.
>
> Here is the data for 1 dd-sync request (trend remains mostly same for
> all requests) -
> Base case: T1= 37096,  T2=11070,  T3=19130,  T4=2410
> FC case:    T1= 101206, T2=47150, T3=45443,  T4=29483
>
> T4 in fake-completion case seemed particularly odd to me. "dd" took so
> much time to resume even after wake up was issued on ic_force_wait!
> Looks like fast log completion is making xfs wait queues behave odd.
> I changed all log wait queues (ic_force_wait, ic_write_wait, and
> xc_commit_wait) into completion variables (despite knowing that it is
> also implemented using wait_queue).
> And it brought down the value of T4.
> Overall numbers with iozone/varmail look like this -
>
> Iozone(48 cpus)
> Threads       Base           FC
> 1                149149.2    151444.5
> 2                251794.5     255033.1
> Varmail (48 cpus)
> Threads      Base          FC
> 1                 61500.9     62396.4
> 2                 102834.9   103080.1
>
> However, base case numbers also got improved. Earlier base numbers
> were comparatively very low with 48 cpus.
>
> Thanks,
> On Fri, Sep 21, 2018 at 5:53 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, Sep 17, 2018 at 12:56:01PM +1000, Dave Chinner wrote:
> > > On Fri, Sep 14, 2018 at 05:21:43PM +0530, Joshi wrote:
> > > > > Now, your other "unexpected" result - lets use affinity to confine
> > > > > everything to one CPU:
> > > > >
> > > > > fsync proc              CIL             IO-complete     log
> > > > > xfs_log_force_lsn
> > > > >   <wait for CIL>
> > > > >                         xlog_write()
> > > > >                         <queue to log>
> > > > >                         wake CIL waiters
> > > > >                                                         xlog_iodone()
> > > > >                                                         wake iclogbug waiters
> > > > > log force done
> > > > >
> > > > > Note the difference? The log work can't start until the CIL work has
> > > > > completed because it's constrained to the same CPU, so it doesn't
> > > > > cause any contention with the finalising of the CIL push and waking
> > > > > waiters.
> > > >
> > > > "Log work can't start" part does not sound very good either. It needed
> > > > to be done anyway before task waiting for fsync is woken.
> > >
> > > I'm beginning to think you don't really understand how logging in
> > > XFS works.  If you want to improve the logging subsystem, it would
> > > be a good idea for you to read and understand this document so you
> > > have some idea of how the data that is written to the log gets
> > > there and what happens to it after it's in the log.
> > >
> > > Documentation/filesystems/xfs-delayed-logging-design.txt
> >
> > It occurred to me yesterday that looking at the journal in a
> > different way might help explain how it all works.
> >
> > Instead of looking at it as an IO engine, think of how an
> > out-of-order CPU is designed to be made up of multiple stages in a
> > pipeline - each stage does a small piece of work that it passes on
> > to the next stage to process. Individual operation progresses serially
> > through the pipeline, but, each stage of the pipeline can be
> > operating on a different operation. Hence we can have multiple
> > operations in flight at once, and the operations can also be run
> > out of order as dynamical stage completion scheduling dictates.
> > However, from a high level everything appears to complete in order
> > because of re-ordering stages put everythign in order once the
> > indiivdual operations have been executed.
> >
> > Similarly, the XFS journalling subsystem is an out of order,
> > multi-stage pipeline with a post-IO re-ordering stage to ensure the
> > end result is that individual operations always appear to complete
> > in order.  Indeed, what ends up on disk in the journal is not in
> > order, so one of the things that log recovery does is rebuild the
> > state necessarily to reorder operations correctly before replay so
> > that, again, it appears like everything occurred in the order that
> > the transactions were committed to the journal.
> >
> > So perhaps looking at it as a multi-stage pipeline might also help
> > explain why fake-completion changes the behaviour in unpredictable
> > ways. i.e. it basically chops out stages of the pipeline, changing
> > the length of the pipeline and the order in which stages of the
> > pipleine are executed.
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
>
>
>
> --
> Joshi



--
Joshi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Strange behavior with log IO fake-completions
  2018-09-24 14:25               ` Joshi
@ 2018-09-26  0:59                 ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2018-09-26  0:59 UTC (permalink / raw)
  To: Joshi; +Cc: linux-xfs

On Mon, Sep 24, 2018 at 07:55:02PM +0530, Joshi wrote:
> > I changed all log wait queues (ic_force_wait, ic_write_wait, and
> xc_commit_wait) into completion variables (despite knowing that it is
> also implemented using wait_queue).
> 
> Sorry, ignore this and subsequent part of the text. I think replacing
> with completion requires more work than I did.Calling "complete_all"
> is tricky, given the need of calling "reinit_completion" which may get
> executed by multiple threads in this case.

Yup.

Perhaps 
you might like to look at this recent patch set aimed at reducing
spurious wake-ups and lock contention in the log reservation code:

https://marc.info/?l=linux-kernel&m=153531683127094&w=2

Not sure it applies to this case, but it still may be useful to you.

BTW, when posting benchmark results, can you also include a baseline
result from a completely unmodified kernel and the patches you are
testing? Nobody can evaluate the correctness or effectiveness of the
changes you are testing without these things....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-09-26  7:10 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-10 18:07 Strange behavior with log IO fake-completions Joshi
2018-09-10 23:58 ` Dave Chinner
2018-09-11  4:10   ` Joshi
2018-09-12  0:42     ` Dave Chinner
2018-09-14 11:51       ` Joshi
2018-09-17  2:56         ` Dave Chinner
2018-09-21  0:23           ` Dave Chinner
2018-09-21 12:54             ` Joshi
2018-09-24 14:25               ` Joshi
2018-09-26  0:59                 ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.