All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation
@ 2013-07-26 23:28 Paul E. McKenney
  2013-07-27  0:29 ` Linus Torvalds
  2013-07-27  2:57 ` Dave Chinner
  0 siblings, 2 replies; 9+ messages in thread
From: Paul E. McKenney @ 2013-07-26 23:28 UTC (permalink / raw)
  To: davej
  Cc: linux-kernel, torvalds, viro, hch, jack, curtw, jaxboe, linux-fsdevel

Dave Jones reported RCU stalls, overly long hrtimer interrupts, and
amazingly long NMI handlers from a trinity-induced workload involving
lots of concurrent sync() calls (https://lkml.org/lkml/2013/7/23/369).
There are any number of things that one might do to make sync() behave
better under high levels of contention, but it is also the case that
multiple concurrent sync() system calls can be satisfied by a single
sys_sync() invocation.

Given that this situation is reminiscent of rcu_barrier(), this commit
applies the rcu_barrier() approach to sys_sync().  This approach uses
a global mutex and a sequence counter.  The mutex is held across the
sync() operation, which eliminates contention between concurrent sync()
operations.  The counter is incremented at the beginning and end of
each sync() operation, so that it is odd while a sync() operation is in
progress and even otherwise, just like sequence locks.

The code that used to be in sys_sync() is now in do_sync(), and sys_sync()
now handles the concurrency.  The sys_sync() function first takes a
snapshot of the counter, then acquires the mutex, and then takes another
snapshot of the counter.  If the values of the two snapshots indicate that
a full do_sync() executed during the mutex acquisition, the sys_sync()
function releases the mutex and returns ("Our work is done!").  Otherwise,
sys_sync() increments the counter, invokes do_sync(), and increments
the counter again.

This approach allows a single call to do_sync() to satisfy an arbitrarily
large number of sync() system calls, which should eliminate issues due
to large numbers of concurrent invocations of the sync() system call.

Changes since v1 (https://lkml.org/lkml/2013/7/24/683):

o	Add a pair of memory barriers to keep the increments from
	bleeding into the do_sync() code.  (The failure probability
	is insanely low, but when you have several hundred million
	devices running Linux, you can expect several hundred instances
	of one-in-a-million failures.)

o	Actually CC some people who have experience in this area.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Curt Wohlgemuth <curtw@google.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: linux-fsdevel@vger.kernel.org

 b/fs/sync.c |   45 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/fs/sync.c b/fs/sync.c
index 905f3f6..6e851db 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -99,7 +99,7 @@ static void fdatawait_one_bdev(struct block_device *bdev, void *arg)
  * just write metadata (such as inodes or bitmaps) to block device page cache
  * and do not sync it on their own in ->sync_fs().
  */
-SYSCALL_DEFINE0(sync)
+static void do_sync(void)
 {
 	int nowait = 0, wait = 1;
 
@@ -111,6 +111,49 @@ SYSCALL_DEFINE0(sync)
 	iterate_bdevs(fdatawait_one_bdev, NULL);
 	if (unlikely(laptop_mode))
 		laptop_sync_completion();
+	return;
+}
+
+static DEFINE_MUTEX(sync_mutex);	/* One do_sync() at a time. */
+static unsigned long sync_seq;		/* Many sync()s from one do_sync(). */
+					/*  Overflow harmless, extra wait. */
+
+/*
+ * Only allow one task to do sync() at a time, and further allow
+ * concurrent sync() calls to be satisfied by a single do_sync()
+ * invocation.
+ */
+SYSCALL_DEFINE0(sync)
+{
+	unsigned long snap;
+	unsigned long snap_done;
+
+	snap = ACCESS_ONCE(sync_seq);
+	smp_mb();  /* Prevent above from bleeding into critical section. */
+	mutex_lock(&sync_mutex);
+	snap_done = ACCESS_ONCE(sync_seq);
+	if (ULONG_CMP_GE(snap_done, ((snap + 1) & ~0x1) + 2)) {
+		/*
+		 * A full do_sync() executed between our two fetches from
+		 * sync_seq, so our work is done!
+		 */
+		smp_mb(); /* Order test with caller's subsequent code. */
+		mutex_unlock(&sync_mutex);
+		return 0;
+	}
+
+	/* Record the start of do_sync(). */
+	ACCESS_ONCE(sync_seq)++;
+	WARN_ON_ONCE((sync_seq & 0x1) != 1);
+	smp_mb(); /* Keep prior increment out of do_sync(). */
+
+	do_sync();
+
+	/* Record the end of do_sync(). */
+	smp_mb(); /* Keep subsequent increment out of do_sync(). */
+	ACCESS_ONCE(sync_seq)++;
+	WARN_ON_ONCE((sync_seq & 0x1) != 0);
+	mutex_unlock(&sync_mutex);
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation
  2013-07-26 23:28 [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation Paul E. McKenney
@ 2013-07-27  0:29 ` Linus Torvalds
  2013-07-27  1:23   ` Paul E. McKenney
  2013-07-27  2:57 ` Dave Chinner
  1 sibling, 1 reply; 9+ messages in thread
From: Linus Torvalds @ 2013-07-27  0:29 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Dave Jones, Linux Kernel Mailing List, Al Viro,
	Christoph Hellwig, Jan Kara, curtw, Jens Axboe, linux-fsdevel

On Fri, Jul 26, 2013 at 4:28 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> +
> +       snap = ACCESS_ONCE(sync_seq);
> +       smp_mb();  /* Prevent above from bleeding into critical section. */
> +       mutex_lock(&sync_mutex);
> +       snap_done = ACCESS_ONCE(sync_seq);
> +       if (ULONG_CMP_GE(snap_done, ((snap + 1) & ~0x1) + 2)) {

Ugh. I dislike this RCU'ism. It's bad code. It doesn't just look ugly
and complex, it's also not even clever.

It is possible that the compiler can fix up this horrible stuff and
turn it into the nice clever stuff, but I dunno.

The two things that make me go "Eww":

 - "((snap + 1) & ~0x1) + 2" just isn't the smart way of doing things.
Afaik, "(snap+3)&~1" gives the same answer with a simpler arithmetic.

 - that ULONG_CMP_GE() macro is disgusting. What's wrong with doing it
the sane way, which is how (for example) the time comparison functions
do it (see time_before() and friends): Just do it

     ((long)(a-b) >= 0)

   which doesn't need large constants.

And yeah, a smart compiler will hopefully do one or both of those, but
what annoys me about the source code is that it actually isn't even
any more readable despite being more complicated and needing more
compiler tricks for good code generation.

So that one line is (a) totally undocumented, (b) not obvious and (c)
not very clever.

I'm also not a huge believer in those two WARN_ON_ONCE's you have. The
sequence count is *only* updated in this place, it is *only* updated
inside a lock, and dammit, if those tests ever trigger, we have bigger
problems than that piece of code. Those warnings may make sense in
code when you write it the first time (because you're thinking things
through), but they do *not* make sense at the point where that code is
actually committed to the project. I notice that you have those
warnings in the RCU code itself, and I don't really think they make
sense there either.

Finally, the ACCESS_ONCE() is also only correct in the one place where
you do the access speculatively outside the lock. Inside the lock,
there is no excuse/reason for them, since the value is stable, and you
need the memory barriers anyway, so there's no way the compiler could
migrate things regardless. So the other two ACCESS_ONCE calls are
actually misleading and wrong, and only likely to make the compiler
generate much worse code.

In fact, the ACCESS_ONCE() is pretty much *guaranteed* to cause the
compiler to unnecessarily generate worse code, since there is
absolutely no reason why the compiler couldn't reuse the "snap_done"
value it reads when it then does the "sync_seq++". There's no way the
value could possible have changed from the "snap_done" value earlier,
since we're inside the lock, so why force the compiler to reload it?

In short, I think the code does too much. I'm sure it works, but I
think it might make people believe that the extra work (like those
later ACCESS_ONCE ones) is meaningful, when it isn't. It's just
make-believe, afaik.

But maybe I'm missing something, and there actually *is* reason for
the extra work/complexity?

                 Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation
  2013-07-27  0:29 ` Linus Torvalds
@ 2013-07-27  1:23   ` Paul E. McKenney
  0 siblings, 0 replies; 9+ messages in thread
From: Paul E. McKenney @ 2013-07-27  1:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Jones, Linux Kernel Mailing List, Al Viro,
	Christoph Hellwig, Jan Kara, curtw, Jens Axboe, linux-fsdevel

On Fri, Jul 26, 2013 at 05:29:44PM -0700, Linus Torvalds wrote:
> On Fri, Jul 26, 2013 at 4:28 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > +
> > +       snap = ACCESS_ONCE(sync_seq);
> > +       smp_mb();  /* Prevent above from bleeding into critical section. */
> > +       mutex_lock(&sync_mutex);
> > +       snap_done = ACCESS_ONCE(sync_seq);
> > +       if (ULONG_CMP_GE(snap_done, ((snap + 1) & ~0x1) + 2)) {
> 
> Ugh. I dislike this RCU'ism. It's bad code. It doesn't just look ugly
> and complex, it's also not even clever.
> 
> It is possible that the compiler can fix up this horrible stuff and
> turn it into the nice clever stuff, but I dunno.
> 
> The two things that make me go "Eww":
> 
>  - "((snap + 1) & ~0x1) + 2" just isn't the smart way of doing things.
> Afaik, "(snap+3)&~1" gives the same answer with a simpler arithmetic.

You are right, this is a better approach, and I have changed this
patch to use it.

I will also apply it to the similar code in RCU.

>  - that ULONG_CMP_GE() macro is disgusting. What's wrong with doing it
> the sane way, which is how (for example) the time comparison functions
> do it (see time_before() and friends): Just do it
> 
>      ((long)(a-b) >= 0)
> 
>    which doesn't need large constants.

True, and I used to use this approach, but it can result in signed integer
overflow, which is undefined in C.  (Yes, we use -fno-strict-overflow,
but there might come a day when we don't want to.)  And ULONG_CMP_GE()
generated exactly the same code as ((long)(a-b) >= 0) last I tried it.

> And yeah, a smart compiler will hopefully do one or both of those, but
> what annoys me about the source code is that it actually isn't even
> any more readable despite being more complicated and needing more
> compiler tricks for good code generation.
> 
> So that one line is (a) totally undocumented, (b) not obvious and (c)
> not very clever.

For (a), how about I add the following comment?

	/*
	 * If the value in snap is odd, we need to wait for the current
	 * do_sync() to complete, then wait for the next one, in other
	 * words, we need the value of snap_done to be three larger than
	 * the value of snap.  On the other hand, if the value in snap is
	 * even, we only have to wait for the next request to complete,
	 * in other words, we need the value of snap_done to be only two
	 * greater than the value of snap.  The "(snap + 3) & 0x1" computes
	 * this for us.
	 */

Hopefully, this helps with (b).  For (c), I now use the expression
you suggested above.

Does that help, or is more needed.

> I'm also not a huge believer in those two WARN_ON_ONCE's you have. The
> sequence count is *only* updated in this place, it is *only* updated
> inside a lock, and dammit, if those tests ever trigger, we have bigger
> problems than that piece of code. Those warnings may make sense in
> code when you write it the first time (because you're thinking things
> through), but they do *not* make sense at the point where that code is
> actually committed to the project. I notice that you have those
> warnings in the RCU code itself, and I don't really think they make
> sense there either.

I agree that the fact that this variable is updated only in this one
place in sys_sync() makes these warning less than useful in production.
I will therefore remove them once this patch get beyond RFC status.

However, the similar warnings in RCU have been very helpful in spotting
bugs in the callers of rcu_idle_enter() and friends, so I would very
much prefer to keep them.

> Finally, the ACCESS_ONCE() is also only correct in the one place where
> you do the access speculatively outside the lock. Inside the lock,
> there is no excuse/reason for them, since the value is stable, and you
> need the memory barriers anyway, so there's no way the compiler could
> migrate things regardless. So the other two ACCESS_ONCE calls are
> actually misleading and wrong, and only likely to make the compiler
> generate much worse code.
> 
> In fact, the ACCESS_ONCE() is pretty much *guaranteed* to cause the
> compiler to unnecessarily generate worse code, since there is
> absolutely no reason why the compiler couldn't reuse the "snap_done"
> value it reads when it then does the "sync_seq++". There's no way the
> value could possible have changed from the "snap_done" value earlier,
> since we're inside the lock, so why force the compiler to reload it?

Good point, I have removed the second ACCESS_ONCE().

> In short, I think the code does too much. I'm sure it works, but I
> think it might make people believe that the extra work (like those
> later ACCESS_ONCE ones) is meaningful, when it isn't. It's just
> make-believe, afaik.
> 
> But maybe I'm missing something, and there actually *is* reason for
> the extra work/complexity?

Only for the ULONG_CMP_GE().  On the rest, you are quite right, and
the updated patch is as follows.  

But I will believe it works only if it helps Dave Jones's tests.  ;-)

							Thanx, Paul

------------------------------------------------------------------------

Dave Jones reported RCU stalls, overly long hrtimer interrupts, and
amazingly long NMI handlers from a trinity-induced workload involving
lots of concurrent sync() calls (https://lkml.org/lkml/2013/7/23/369).
There are any number of things that one might do to make sync() behave
better under high levels of contention, but it is also the case that
multiple concurrent sync() system calls can be satisfied by a single
sys_sync() invocation.

Given that this situation is reminiscent of rcu_barrier(), this commit
applies the rcu_barrier() approach to sys_sync().  This approach uses
a global mutex and a sequence counter.  The mutex is held across the
sync() operation, which eliminates contention between concurrent sync()
operations.  The counter is incremented at the beginning and end of
each sync() operation, so that it is odd while a sync() operation is in
progress and even otherwise, just like sequence locks.

The code that used to be in sys_sync() is now in do_sync(), and sys_sync()
now handles the concurrency.  The sys_sync() function first takes a
snapshot of the counter, then acquires the mutex, and then takes another
snapshot of the counter.  If the values of the two snapshots indicate that
a full do_sync() executed during the mutex acquisition, the sys_sync()
function releases the mutex and returns ("Our work is done!").  Otherwise,
sys_sync() increments the counter, invokes do_sync(), and increments
the counter again.

This approach allows a single call to do_sync() to satisfy an arbitrarily
large number of sync() system calls, which should eliminate issues due
to large numbers of concurrent invocations of the sync() system call.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jan Kara <jack@suse.cz>
Cc: Curt Wohlgemuth <curtw@google.com>
Cc: Jens Axboe <jaxboe@fusionio.com>
Cc: linux-fsdevel@vger.kernel.org
[ paulmck: Updated based on Linus Torvalds feedback. ]

diff --git a/fs/sync.c b/fs/sync.c
index 905f3f6..174ad9b 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -99,7 +99,7 @@ static void fdatawait_one_bdev(struct block_device *bdev, void *arg)
  * just write metadata (such as inodes or bitmaps) to block device page cache
  * and do not sync it on their own in ->sync_fs().
  */
-SYSCALL_DEFINE0(sync)
+static void do_sync(void)
 {
 	int nowait = 0, wait = 1;
 
@@ -111,6 +111,60 @@ SYSCALL_DEFINE0(sync)
 	iterate_bdevs(fdatawait_one_bdev, NULL);
 	if (unlikely(laptop_mode))
 		laptop_sync_completion();
+	return;
+}
+
+static DEFINE_MUTEX(sync_mutex);	/* One do_sync() at a time. */
+static unsigned long sync_seq;		/* Many sync()s from one do_sync(). */
+					/*  Overflow harmless, extra wait. */
+
+/*
+ * Only allow one task to do sync() at a time, and further allow
+ * concurrent sync() calls to be satisfied by a single do_sync()
+ * invocation.
+ */
+SYSCALL_DEFINE0(sync)
+{
+	unsigned long snap;
+	unsigned long snap_done;
+
+	snap = ACCESS_ONCE(sync_seq);
+	smp_mb();  /* Prevent above from bleeding into critical section. */
+	mutex_lock(&sync_mutex);
+	snap_done = sync_seq;
+
+	/*
+	 * If the value in snap is odd, we need to wait for the current
+	 * do_sync() to complete, then wait for the next one, in other
+	 * words, we need the value of snap_done to be three larger than
+	 * the value of snap.  On the other hand, if the value in snap is
+	 * even, we only have to wait for the next request to complete,
+	 * in other words, we need the value of snap_done to be only two
+	 * greater than the value of snap.  The "(snap + 3) & 0x1" computes
+	 * this for us (thank you, Linus!).
+	 */
+	if (ULONG_CMP_GE(snap_done, (snap + 3) & ~0x1)) {
+		/*
+		 * A full do_sync() executed between our two fetches from
+		 * sync_seq, so our work is done!
+		 */
+		smp_mb(); /* Order test with caller's subsequent code. */
+		mutex_unlock(&sync_mutex);
+		return 0;
+	}
+
+	/* Record the start of do_sync(). */
+	ACCESS_ONCE(sync_seq)++;
+	WARN_ON_ONCE((sync_seq & 0x1) != 1);
+	smp_mb(); /* Keep prior increment out of do_sync(). */
+
+	do_sync();
+
+	/* Record the end of do_sync(). */
+	smp_mb(); /* Keep subsequent increment out of do_sync(). */
+	ACCESS_ONCE(sync_seq)++;
+	WARN_ON_ONCE((sync_seq & 0x1) != 0);
+	mutex_unlock(&sync_mutex);
 	return 0;
 }
 


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation
  2013-07-26 23:28 [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation Paul E. McKenney
  2013-07-27  0:29 ` Linus Torvalds
@ 2013-07-27  2:57 ` Dave Chinner
  2013-07-27  4:05   ` Paul E. McKenney
  1 sibling, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2013-07-27  2:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: davej, linux-kernel, torvalds, viro, hch, jack, curtw, jaxboe,
	linux-fsdevel

On Fri, Jul 26, 2013 at 04:28:52PM -0700, Paul E. McKenney wrote:
> Dave Jones reported RCU stalls, overly long hrtimer interrupts, and
> amazingly long NMI handlers from a trinity-induced workload involving
> lots of concurrent sync() calls (https://lkml.org/lkml/2013/7/23/369).
> There are any number of things that one might do to make sync() behave
> better under high levels of contention, but it is also the case that
> multiple concurrent sync() system calls can be satisfied by a single
> sys_sync() invocation.
> 
> Given that this situation is reminiscent of rcu_barrier(), this commit
> applies the rcu_barrier() approach to sys_sync().  This approach uses
> a global mutex and a sequence counter.  The mutex is held across the
> sync() operation, which eliminates contention between concurrent sync()
> operations.
>
> The counter is incremented at the beginning and end of
> each sync() operation, so that it is odd while a sync() operation is in
> progress and even otherwise, just like sequence locks.
> 
> The code that used to be in sys_sync() is now in do_sync(), and sys_sync()
> now handles the concurrency.  The sys_sync() function first takes a
> snapshot of the counter, then acquires the mutex, and then takes another
> snapshot of the counter.  If the values of the two snapshots indicate that
> a full do_sync() executed during the mutex acquisition, the sys_sync()
> function releases the mutex and returns ("Our work is done!").  Otherwise,
> sys_sync() increments the counter, invokes do_sync(), and increments
> the counter again.
> 
> This approach allows a single call to do_sync() to satisfy an arbitrarily
> large number of sync() system calls, which should eliminate issues due
> to large numbers of concurrent invocations of the sync() system call.

This is not addressing the problem that is causing issues during
sync. Indeed, it only puts a bandaid over the currently observed
trigger.

Indeed, i suspect that this will significantly slow down concurrent
sync operations, as it serialised sync across all superblocks rather
than serialising per-superblock like is currently done. Indeed, that
per-superblock serialisation is where all the lock contention
problems are. And it's not sync alone that causes the contention
problems - it has to be combined with other concurrent workloads
that add or remove inodes from the inode cache at tha same time.

I have patches to address that by removing the source
of the lock contention completely, and not just for the sys_sync
trigger. Those patches make the problems with concurrent
sys_sync operation go away completely for me, not to mention improve
performance for 8+ thread metadata workloads on XFS significantly.

IOWs, I don't see that concurrent sys_sync operation is a problem at
all, and it is actively desirable for systems that have multiple
busy filesystems as it allows concurrent dispatch of IO across those
multiple filesystems. Serialising all sys_sync work might stop the
contention problems, but it will also slow down concurrent sync
operations on busy systems as it only allows one thread to dispatch
and wait for IO at a time.

So, let's not slap a bandaid over a symptom - let's address the
cause of the lock contention properly....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation
  2013-07-27  2:57 ` Dave Chinner
@ 2013-07-27  4:05   ` Paul E. McKenney
  2013-07-27  6:21     ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2013-07-27  4:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: davej, linux-kernel, torvalds, viro, hch, jack, curtw, jaxboe,
	linux-fsdevel

On Sat, Jul 27, 2013 at 12:57:03PM +1000, Dave Chinner wrote:
> On Fri, Jul 26, 2013 at 04:28:52PM -0700, Paul E. McKenney wrote:
> > Dave Jones reported RCU stalls, overly long hrtimer interrupts, and
> > amazingly long NMI handlers from a trinity-induced workload involving
> > lots of concurrent sync() calls (https://lkml.org/lkml/2013/7/23/369).
> > There are any number of things that one might do to make sync() behave
> > better under high levels of contention, but it is also the case that
> > multiple concurrent sync() system calls can be satisfied by a single
> > sys_sync() invocation.
> > 
> > Given that this situation is reminiscent of rcu_barrier(), this commit
> > applies the rcu_barrier() approach to sys_sync().  This approach uses
> > a global mutex and a sequence counter.  The mutex is held across the
> > sync() operation, which eliminates contention between concurrent sync()
> > operations.
> >
> > The counter is incremented at the beginning and end of
> > each sync() operation, so that it is odd while a sync() operation is in
> > progress and even otherwise, just like sequence locks.
> > 
> > The code that used to be in sys_sync() is now in do_sync(), and sys_sync()
> > now handles the concurrency.  The sys_sync() function first takes a
> > snapshot of the counter, then acquires the mutex, and then takes another
> > snapshot of the counter.  If the values of the two snapshots indicate that
> > a full do_sync() executed during the mutex acquisition, the sys_sync()
> > function releases the mutex and returns ("Our work is done!").  Otherwise,
> > sys_sync() increments the counter, invokes do_sync(), and increments
> > the counter again.
> > 
> > This approach allows a single call to do_sync() to satisfy an arbitrarily
> > large number of sync() system calls, which should eliminate issues due
> > to large numbers of concurrent invocations of the sync() system call.
> 
> This is not addressing the problem that is causing issues during
> sync. Indeed, it only puts a bandaid over the currently observed
> trigger.
> 
> Indeed, i suspect that this will significantly slow down concurrent
> sync operations, as it serialised sync across all superblocks rather
> than serialising per-superblock like is currently done. Indeed, that
> per-superblock serialisation is where all the lock contention
> problems are. And it's not sync alone that causes the contention
> problems - it has to be combined with other concurrent workloads
> that add or remove inodes from the inode cache at tha same time.

Seems like something along the lines of wakeup_flusher_threads()
currently at the start of sys_sync() would address this.

> I have patches to address that by removing the source
> of the lock contention completely, and not just for the sys_sync
> trigger. Those patches make the problems with concurrent
> sys_sync operation go away completely for me, not to mention improve
> performance for 8+ thread metadata workloads on XFS significantly.
> 
> IOWs, I don't see that concurrent sys_sync operation is a problem at
> all, and it is actively desirable for systems that have multiple
> busy filesystems as it allows concurrent dispatch of IO across those
> multiple filesystems. Serialising all sys_sync work might stop the
> contention problems, but it will also slow down concurrent sync
> operations on busy systems as it only allows one thread to dispatch
> and wait for IO at a time.
> 
> So, let's not slap a bandaid over a symptom - let's address the
> cause of the lock contention properly....

Hmmm...

Could you please send your patches over to Dave Jones right now?  I am
getting quite tired of getting RCU CPU stall warning complaints from
him that turn out to be due to highly contended sync() system calls.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation
  2013-07-27  4:05   ` Paul E. McKenney
@ 2013-07-27  6:21     ` Dave Chinner
  2013-07-27 11:26       ` Paul E. McKenney
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2013-07-27  6:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: davej, linux-kernel, torvalds, viro, hch, jack, curtw, jaxboe,
	linux-fsdevel

On Fri, Jul 26, 2013 at 09:05:24PM -0700, Paul E. McKenney wrote:
> On Sat, Jul 27, 2013 at 12:57:03PM +1000, Dave Chinner wrote:
> > On Fri, Jul 26, 2013 at 04:28:52PM -0700, Paul E. McKenney wrote:
> > > Dave Jones reported RCU stalls, overly long hrtimer interrupts, and
> > > amazingly long NMI handlers from a trinity-induced workload involving
> > > lots of concurrent sync() calls (https://lkml.org/lkml/2013/7/23/369).
> > > There are any number of things that one might do to make sync() behave
> > > better under high levels of contention, but it is also the case that
> > > multiple concurrent sync() system calls can be satisfied by a single
> > > sys_sync() invocation.
> > > 
> > > Given that this situation is reminiscent of rcu_barrier(), this commit
> > > applies the rcu_barrier() approach to sys_sync().  This approach uses
> > > a global mutex and a sequence counter.  The mutex is held across the
> > > sync() operation, which eliminates contention between concurrent sync()
> > > operations.
> > >
> > > The counter is incremented at the beginning and end of
> > > each sync() operation, so that it is odd while a sync() operation is in
> > > progress and even otherwise, just like sequence locks.
> > > 
> > > The code that used to be in sys_sync() is now in do_sync(), and sys_sync()
> > > now handles the concurrency.  The sys_sync() function first takes a
> > > snapshot of the counter, then acquires the mutex, and then takes another
> > > snapshot of the counter.  If the values of the two snapshots indicate that
> > > a full do_sync() executed during the mutex acquisition, the sys_sync()
> > > function releases the mutex and returns ("Our work is done!").  Otherwise,
> > > sys_sync() increments the counter, invokes do_sync(), and increments
> > > the counter again.
> > > 
> > > This approach allows a single call to do_sync() to satisfy an arbitrarily
> > > large number of sync() system calls, which should eliminate issues due
> > > to large numbers of concurrent invocations of the sync() system call.
> > 
> > This is not addressing the problem that is causing issues during
> > sync. Indeed, it only puts a bandaid over the currently observed
> > trigger.
> > 
> > Indeed, i suspect that this will significantly slow down concurrent
> > sync operations, as it serialised sync across all superblocks rather
> > than serialising per-superblock like is currently done. Indeed, that
> > per-superblock serialisation is where all the lock contention
> > problems are. And it's not sync alone that causes the contention
> > problems - it has to be combined with other concurrent workloads
> > that add or remove inodes from the inode cache at tha same time.
> 
> Seems like something along the lines of wakeup_flusher_threads()
> currently at the start of sys_sync() would address this.

That's the first thing sync() currently does, but it doesn't solve
the problem because that only triggers WB_SYNC_NONE writeback. ie.
it's not data integrity writeback, so if there's any contention it
will skip pages and inodes. So we still need sync to dispatch IO.

Further, it requires scheduling of the work before anything will be
done, and so if you are CPU loaded then there's a good chance that
the flusher thread doesn't get to a CPU until sync actually queues
the sync work to it and then waits for completion on it...

> > I have patches to address that by removing the source
> > of the lock contention completely, and not just for the sys_sync
> > trigger. Those patches make the problems with concurrent
> > sys_sync operation go away completely for me, not to mention improve
> > performance for 8+ thread metadata workloads on XFS significantly.
> > 
> > IOWs, I don't see that concurrent sys_sync operation is a problem at
> > all, and it is actively desirable for systems that have multiple
> > busy filesystems as it allows concurrent dispatch of IO across those
> > multiple filesystems. Serialising all sys_sync work might stop the
> > contention problems, but it will also slow down concurrent sync
> > operations on busy systems as it only allows one thread to dispatch
> > and wait for IO at a time.
> > 
> > So, let's not slap a bandaid over a symptom - let's address the
> > cause of the lock contention properly....
> 
> Hmmm...
> 
> Could you please send your patches over to Dave Jones right now?  I am
> getting quite tired of getting RCU CPU stall warning complaints from
> him that turn out to be due to highly contended sync() system calls.

Then ignore them until the code is ready - it'll be 3.12 before the
fixes are merged, anyway, because the lock contention fix requires
infrastructure that is currently in mmotm that is queued for 3.12
(i.e. the per-node list infrastructure) to fix a whole bunch of
other, more critical VFS lock contention problems. Seeing as a new
mmotm went out last week, I should have the patches ready for review
early next week.

FWIW, we (as in XFS filesystem testers) regularly run tests that
have hundreds of concurrent sys_sync() calls running at the same
time. e.g. xfstests::xfs/297 runs a 1000 fsstress processes while
freezing and unfreezing the filesystem, and that usually shows
hundreds of threads running sys_sync concurrently after a short
amount of runtime. So it's pretty clear that what Dave is seeing
is not necessarily representative of what happens when there ar lots
of sys_sync() calls run concurrently.

BTW, concurrent syncfs() calls are going to have exactly the same
problem as concurrent sync() calls, as is any other operation that
results in a walk of the per-superblock inodes list.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation
  2013-07-27  6:21     ` Dave Chinner
@ 2013-07-27 11:26       ` Paul E. McKenney
  2013-07-29  7:06         ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Paul E. McKenney @ 2013-07-27 11:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: davej, linux-kernel, torvalds, viro, hch, jack, curtw, jaxboe,
	linux-fsdevel

On Sat, Jul 27, 2013 at 04:21:01PM +1000, Dave Chinner wrote:
> On Fri, Jul 26, 2013 at 09:05:24PM -0700, Paul E. McKenney wrote:
> > On Sat, Jul 27, 2013 at 12:57:03PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 26, 2013 at 04:28:52PM -0700, Paul E. McKenney wrote:
> > > > Dave Jones reported RCU stalls, overly long hrtimer interrupts, and
> > > > amazingly long NMI handlers from a trinity-induced workload involving
> > > > lots of concurrent sync() calls (https://lkml.org/lkml/2013/7/23/369).
> > > > There are any number of things that one might do to make sync() behave
> > > > better under high levels of contention, but it is also the case that
> > > > multiple concurrent sync() system calls can be satisfied by a single
> > > > sys_sync() invocation.
> > > > 
> > > > Given that this situation is reminiscent of rcu_barrier(), this commit
> > > > applies the rcu_barrier() approach to sys_sync().  This approach uses
> > > > a global mutex and a sequence counter.  The mutex is held across the
> > > > sync() operation, which eliminates contention between concurrent sync()
> > > > operations.
> > > >
> > > > The counter is incremented at the beginning and end of
> > > > each sync() operation, so that it is odd while a sync() operation is in
> > > > progress and even otherwise, just like sequence locks.
> > > > 
> > > > The code that used to be in sys_sync() is now in do_sync(), and sys_sync()
> > > > now handles the concurrency.  The sys_sync() function first takes a
> > > > snapshot of the counter, then acquires the mutex, and then takes another
> > > > snapshot of the counter.  If the values of the two snapshots indicate that
> > > > a full do_sync() executed during the mutex acquisition, the sys_sync()
> > > > function releases the mutex and returns ("Our work is done!").  Otherwise,
> > > > sys_sync() increments the counter, invokes do_sync(), and increments
> > > > the counter again.
> > > > 
> > > > This approach allows a single call to do_sync() to satisfy an arbitrarily
> > > > large number of sync() system calls, which should eliminate issues due
> > > > to large numbers of concurrent invocations of the sync() system call.
> > > 
> > > This is not addressing the problem that is causing issues during
> > > sync. Indeed, it only puts a bandaid over the currently observed
> > > trigger.
> > > 
> > > Indeed, i suspect that this will significantly slow down concurrent
> > > sync operations, as it serialised sync across all superblocks rather
> > > than serialising per-superblock like is currently done. Indeed, that
> > > per-superblock serialisation is where all the lock contention
> > > problems are. And it's not sync alone that causes the contention
> > > problems - it has to be combined with other concurrent workloads
> > > that add or remove inodes from the inode cache at tha same time.
> > 
> > Seems like something along the lines of wakeup_flusher_threads()
> > currently at the start of sys_sync() would address this.
> 
> That's the first thing sync() currently does, but it doesn't solve
> the problem because that only triggers WB_SYNC_NONE writeback. ie.
> it's not data integrity writeback, so if there's any contention it
> will skip pages and inodes. So we still need sync to dispatch IO.
> 
> Further, it requires scheduling of the work before anything will be
> done, and so if you are CPU loaded then there's a good chance that
> the flusher thread doesn't get to a CPU until sync actually queues
> the sync work to it and then waits for completion on it...

I could address these fairly straightforwardly, but I will hold off
for a bit to see what you come up with.

> > > I have patches to address that by removing the source
> > > of the lock contention completely, and not just for the sys_sync
> > > trigger. Those patches make the problems with concurrent
> > > sys_sync operation go away completely for me, not to mention improve
> > > performance for 8+ thread metadata workloads on XFS significantly.
> > > 
> > > IOWs, I don't see that concurrent sys_sync operation is a problem at
> > > all, and it is actively desirable for systems that have multiple
> > > busy filesystems as it allows concurrent dispatch of IO across those
> > > multiple filesystems. Serialising all sys_sync work might stop the
> > > contention problems, but it will also slow down concurrent sync
> > > operations on busy systems as it only allows one thread to dispatch
> > > and wait for IO at a time.
> > > 
> > > So, let's not slap a bandaid over a symptom - let's address the
> > > cause of the lock contention properly....
> > 
> > Hmmm...
> > 
> > Could you please send your patches over to Dave Jones right now?  I am
> > getting quite tired of getting RCU CPU stall warning complaints from
> > him that turn out to be due to highly contended sync() system calls.
> 
> Then ignore them until the code is ready - it'll be 3.12 before the
> fixes are merged, anyway, because the lock contention fix requires
> infrastructure that is currently in mmotm that is queued for 3.12
> (i.e. the per-node list infrastructure) to fix a whole bunch of
> other, more critical VFS lock contention problems. Seeing as a new
> mmotm went out last week, I should have the patches ready for review
> early next week.
> 
> FWIW, we (as in XFS filesystem testers) regularly run tests that
> have hundreds of concurrent sys_sync() calls running at the same
> time. e.g. xfstests::xfs/297 runs a 1000 fsstress processes while
> freezing and unfreezing the filesystem, and that usually shows
> hundreds of threads running sys_sync concurrently after a short
> amount of runtime. So it's pretty clear that what Dave is seeing
> is not necessarily representative of what happens when there ar lots
> of sys_sync() calls run concurrently.

So Dave might be finding an additional problem.  ;-)

> BTW, concurrent syncfs() calls are going to have exactly the same
> problem as concurrent sync() calls, as is any other operation that
> results in a walk of the per-superblock inodes list.

Yep!  Your upcoming patch addresses these as well?

							Thanx, Paul

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation
  2013-07-27 11:26       ` Paul E. McKenney
@ 2013-07-29  7:06         ` Dave Chinner
  2013-07-30 17:43           ` Paul E. McKenney
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2013-07-29  7:06 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: davej, linux-kernel, torvalds, viro, hch, jack, curtw, jaxboe,
	linux-fsdevel

On Sat, Jul 27, 2013 at 04:26:28AM -0700, Paul E. McKenney wrote:
> On Sat, Jul 27, 2013 at 04:21:01PM +1000, Dave Chinner wrote:
> > On Fri, Jul 26, 2013 at 09:05:24PM -0700, Paul E. McKenney wrote:
> > > Could you please send your patches over to Dave Jones right now?  I am
> > > getting quite tired of getting RCU CPU stall warning complaints from
> > > him that turn out to be due to highly contended sync() system calls.
> > 
> > Then ignore them until the code is ready - it'll be 3.12 before the
> > fixes are merged, anyway, because the lock contention fix requires
> > infrastructure that is currently in mmotm that is queued for 3.12
> > (i.e. the per-node list infrastructure) to fix a whole bunch of
> > other, more critical VFS lock contention problems. Seeing as a new
> > mmotm went out last week, I should have the patches ready for review
> > early next week.
> > 
> > FWIW, we (as in XFS filesystem testers) regularly run tests that
> > have hundreds of concurrent sys_sync() calls running at the same
> > time. e.g. xfstests::xfs/297 runs a 1000 fsstress processes while
> > freezing and unfreezing the filesystem, and that usually shows
> > hundreds of threads running sys_sync concurrently after a short
> > amount of runtime. So it's pretty clear that what Dave is seeing
> > is not necessarily representative of what happens when there ar lots
> > of sys_sync() calls run concurrently.
> 
> So Dave might be finding an additional problem.  ;-)

Dave will always find problems. If you want something broken, give
it to Dave and he'll hand it back in pieces. :)

> > BTW, concurrent syncfs() calls are going to have exactly the same
> > problem as concurrent sync() calls, as is any other operation that
> > results in a walk of the per-superblock inodes list.
> 
> Yep!  Your upcoming patch addresses these as well?

Yes, it does.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation
  2013-07-29  7:06         ` Dave Chinner
@ 2013-07-30 17:43           ` Paul E. McKenney
  0 siblings, 0 replies; 9+ messages in thread
From: Paul E. McKenney @ 2013-07-30 17:43 UTC (permalink / raw)
  To: Dave Chinner
  Cc: davej, linux-kernel, torvalds, viro, hch, jack, curtw, jaxboe,
	linux-fsdevel

On Mon, Jul 29, 2013 at 05:06:43PM +1000, Dave Chinner wrote:
> On Sat, Jul 27, 2013 at 04:26:28AM -0700, Paul E. McKenney wrote:
> > On Sat, Jul 27, 2013 at 04:21:01PM +1000, Dave Chinner wrote:
> > > On Fri, Jul 26, 2013 at 09:05:24PM -0700, Paul E. McKenney wrote:
> > > > Could you please send your patches over to Dave Jones right now?  I am
> > > > getting quite tired of getting RCU CPU stall warning complaints from
> > > > him that turn out to be due to highly contended sync() system calls.
> > > 
> > > Then ignore them until the code is ready - it'll be 3.12 before the
> > > fixes are merged, anyway, because the lock contention fix requires
> > > infrastructure that is currently in mmotm that is queued for 3.12
> > > (i.e. the per-node list infrastructure) to fix a whole bunch of
> > > other, more critical VFS lock contention problems. Seeing as a new
> > > mmotm went out last week, I should have the patches ready for review
> > > early next week.
> > > 
> > > FWIW, we (as in XFS filesystem testers) regularly run tests that
> > > have hundreds of concurrent sys_sync() calls running at the same
> > > time. e.g. xfstests::xfs/297 runs a 1000 fsstress processes while
> > > freezing and unfreezing the filesystem, and that usually shows
> > > hundreds of threads running sys_sync concurrently after a short
> > > amount of runtime. So it's pretty clear that what Dave is seeing
> > > is not necessarily representative of what happens when there ar lots
> > > of sys_sync() calls run concurrently.
> > 
> > So Dave might be finding an additional problem.  ;-)
> 
> Dave will always find problems. If you want something broken, give
> it to Dave and he'll hand it back in pieces. :)

So rather than Wreck-it Ralph, we have Destroy-it Dave?  ;-)

And I must hasten to add that Dave's destroy-it services, though
sometimes irritating, are almost always quite valuable.

> > > BTW, concurrent syncfs() calls are going to have exactly the same
> > > problem as concurrent sync() calls, as is any other operation that
> > > results in a walk of the per-superblock inodes list.
> > 
> > Yep!  Your upcoming patch addresses these as well?
> 
> Yes, it does.

Good to hear, looking forward to seeing them?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-07-30 17:47 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-07-26 23:28 [PATCH RFC fs] v2 Make sync() satisfy many requests with one invocation Paul E. McKenney
2013-07-27  0:29 ` Linus Torvalds
2013-07-27  1:23   ` Paul E. McKenney
2013-07-27  2:57 ` Dave Chinner
2013-07-27  4:05   ` Paul E. McKenney
2013-07-27  6:21     ` Dave Chinner
2013-07-27 11:26       ` Paul E. McKenney
2013-07-29  7:06         ` Dave Chinner
2013-07-30 17:43           ` Paul E. McKenney

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.