[RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
@ 2015-06-19 21:50 Dave Hansen
  2015-06-19 23:33 ` Andi Kleen
  2015-06-23 15:17 ` Jan Kara
  0 siblings, 2 replies; 20+ messages in thread
From: Dave Hansen @ 2015-06-19 21:50 UTC (permalink / raw)
  To: dave
  Cc: dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen,
	ak, linux-kernel

From: Dave Hansen <dave.hansen@linux.intel.com>

I have a _tiny_ microbenchmark that sits in a loop and writes
single bytes to a file.  Writing one byte to a tmpfs file is
around 2x slower than reading one byte from a file, which is a
_bit_ more than I expecte.  This is a dumb benchmark, but I think
it's hard to deny that write() is a hot path and we should avoid
unnecessary overhead there.

I did a 'perf record' of 30-second samples of read and write.
The top item in a diffprofile is srcu_read_lock() from
fsnotify().  There are active inotify fd's from systemd, but
nothing is actually listening to the file or its part of
the filesystem.

I *think* we can avoid taking the srcu_read_lock() for the
common case where there are no actual marks on the file
being modified *or* the vfsmount.

The *_fsnotify_mask is an aggregate of each of the masks from
each mark.  If we have nothing set in the masks at all then there
are no marks and no need to do anything with 'ignored masks'
since none exist.  This keeps us from having to do the costly
srcu_read_lock() for a check which is very cheap.

This patch gave a 10.8% speedup in writes/second on my test.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Eric Paris <eparis@redhat.com>
Cc: John McCutchan <john@johnmccutchan.com>
Cc: Robert Love <rlove@rlove.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: linux-kernel@vger.kernel.org
---

 b/fs/notify/fsnotify.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff -puN fs/notify/fsnotify.c~optimize-fsnotify fs/notify/fsnotify.c
--- a/fs/notify/fsnotify.c~optimize-fsnotify	2015-06-19 13:29:53.117283581 -0700
+++ b/fs/notify/fsnotify.c	2015-06-19 13:29:53.123283853 -0700
@@ -213,6 +213,16 @@ int fsnotify(struct inode *to_tell, __u3
 	    !(test_mask & to_tell->i_fsnotify_mask) &&
 	    !(mnt && test_mask & mnt->mnt_fsnotify_mask))
 		return 0;
+	/*
+	 * Optimization: The *_fsnotify_mask is an aggregate of each of the
+	 * masks from each mark.  If we have nothing set in the masks at
+	 * all then there are no marks and no need to do anything with
+	 * 'ignored masks' since none exist.  This keeps us from having to
+	 * do the costly srcu_read_lock() for a check which is very cheap.
+	 */
+	if (!to_tell->i_fsnotify_mask &&
+	    (!mnt || !mnt->mnt_fsnotify_mask))
+		return 0;

 	idx = srcu_read_lock(&fsnotify_mark_srcu);

_
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-19 21:50 [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files Dave Hansen
@ 2015-06-19 23:33 ` Andi Kleen
  2015-06-20  0:29   ` Paul E. McKenney
  2015-06-20  0:39   ` Dave Hansen
  2015-06-23 15:17 ` Jan Kara
  1 sibling, 2 replies; 20+ messages in thread
From: Andi Kleen @ 2015-06-19 23:33 UTC (permalink / raw)
  To: Dave Hansen
  Cc: dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen,
	linux-kernel, paulmck

On Fri, Jun 19, 2015 at 02:50:25PM -0700, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> I have a _tiny_ microbenchmark that sits in a loop and writes
> single bytes to a file.  Writing one byte to a tmpfs file is
> around 2x slower than reading one byte from a file, which is a
> _bit_ more than I expecte.  This is a dumb benchmark, but I think
> it's hard to deny that write() is a hot path and we should avoid
> unnecessary overhead there.
> 
> I did a 'perf record' of 30-second samples of read and write.
> The top item in a diffprofile is srcu_read_lock() from
> fsnotify().  There are active inotify fd's from systemd, but
> nothing is actually listening to the file or its part of
> the filesystem.
> 
> I *think* we can avoid taking the srcu_read_lock() for the
> common case where there are no actual marks on the file
> being modified *or* the vfsmount.

What is so expensive in it? Just the memory barrier in it?

Perhaps the function can be tuned in general.

-Andi

int __srcu_read_lock(struct srcu_struct *sp)
{
        int idx;

        idx = ACCESS_ONCE(sp->completed) & 0x1;
        preempt_disable();
        __this_cpu_inc(sp->per_cpu_ref->c[idx]);
        smp_mb(); /* B */  /* Avoid leaking the critical section. */
        __this_cpu_inc(sp->per_cpu_ref->seq[idx]);
        preempt_enable();
        return idx;
}


> 
> The *_fsnotify_mask is an aggregate of each of the masks from
> each mark.  If we have nothing set in the masks at all then there
> are no marks and no need to do anything with 'ignored masks'
> since none exist.  This keeps us from having to do the costly
> srcu_read_lock() for a check which is very cheap.
> 
> This patch gave a 10.8% speedup in writes/second on my test.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Eric Paris <eparis@redhat.com>
> Cc: John McCutchan <john@johnmccutchan.com>
> Cc: Robert Love <rlove@rlove.org>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Cc: Andi Kleen <ak@linux.intel.com>
> Cc: linux-kernel@vger.kernel.org
> ---
> 
>  b/fs/notify/fsnotify.c |   10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff -puN fs/notify/fsnotify.c~optimize-fsnotify fs/notify/fsnotify.c
> --- a/fs/notify/fsnotify.c~optimize-fsnotify	2015-06-19 13:29:53.117283581 -0700
> +++ b/fs/notify/fsnotify.c	2015-06-19 13:29:53.123283853 -0700
> @@ -213,6 +213,16 @@ int fsnotify(struct inode *to_tell, __u3
>  	    !(test_mask & to_tell->i_fsnotify_mask) &&
>  	    !(mnt && test_mask & mnt->mnt_fsnotify_mask))
>  		return 0;
> +	/*
> +	 * Optimization: The *_fsnotify_mask is an aggregate of each of the
> +	 * masks from each mark.  If we have nothing set in the masks at
> +	 * all then there are no marks and no need to do anything with
> +	 * 'ignored masks' since none exist.  This keeps us from having to
> +	 * do the costly srcu_read_lock() for a check which is very cheap.
> +	 */
> +	if (!to_tell->i_fsnotify_mask &&
> +	    (!mnt || !mnt->mnt_fsnotify_mask))
> +		return 0;
>  
>  	idx = srcu_read_lock(&fsnotify_mark_srcu);
>  
> _

-- 
ak@linux.intel.com -- Speaking for myself only
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-19 23:33 ` Andi Kleen
@ 2015-06-20  0:29   ` Paul E. McKenney
  2015-06-20  0:39   ` Dave Hansen
  1 sibling, 0 replies; 20+ messages in thread
From: Paul E. McKenney @ 2015-06-20  0:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Dave Hansen, dave.hansen, akpm, jack, viro, eparis, john, rlove,
	tim.c.chen, linux-kernel

On Fri, Jun 19, 2015 at 04:33:06PM -0700, Andi Kleen wrote:
> On Fri, Jun 19, 2015 at 02:50:25PM -0700, Dave Hansen wrote:
> > 
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> > 
> > I have a _tiny_ microbenchmark that sits in a loop and writes
> > single bytes to a file.  Writing one byte to a tmpfs file is
> > around 2x slower than reading one byte from a file, which is a
> > _bit_ more than I expecte.  This is a dumb benchmark, but I think
> > it's hard to deny that write() is a hot path and we should avoid
> > unnecessary overhead there.
> > 
> > I did a 'perf record' of 30-second samples of read and write.
> > The top item in a diffprofile is srcu_read_lock() from
> > fsnotify().  There are active inotify fd's from systemd, but
> > nothing is actually listening to the file or its part of
> > the filesystem.
> > 
> > I *think* we can avoid taking the srcu_read_lock() for the
> > common case where there are no actual marks on the file
> > being modified *or* the vfsmount.
> 
> What is so expensive in it? Just the memory barrier in it?
> 
> Perhaps the function can be tuned in general.

The memory barrier we are pretty much stuck with unless we want
synchronize_srcu() to be quite a bit more expensive (and for SRCU to be
unusable from offline and idle) -- and that synchronize_srcu() expense
drove rewrite from the earlier version to this one.  It is possible to
cut down from two to one instances of __this_cpu_inc(), however.

It of course would be possible to have two types of SRCU, one for fast
grace periods and the other for memory-barrier-free read-side critical
sections, but obviously a very clear case would need to be made for this.
At least judging from the reactions the last time I introduced a new
flavor of RCU.  ;-)

So, echoing Andi, what exactly is expensive?

							Thanx, Paul

> -Andi
> 
> int __srcu_read_lock(struct srcu_struct *sp)
> {
>         int idx;
> 
>         idx = ACCESS_ONCE(sp->completed) & 0x1;
>         preempt_disable();
>         __this_cpu_inc(sp->per_cpu_ref->c[idx]);
>         smp_mb(); /* B */  /* Avoid leaking the critical section. */
>         __this_cpu_inc(sp->per_cpu_ref->seq[idx]);
>         preempt_enable();
>         return idx;
> }
> 
> 
> > 
> > The *_fsnotify_mask is an aggregate of each of the masks from
> > each mark.  If we have nothing set in the masks at all then there
> > are no marks and no need to do anything with 'ignored masks'
> > since none exist.  This keeps us from having to do the costly
> > srcu_read_lock() for a check which is very cheap.
> > 
> > This patch gave a 10.8% speedup in writes/second on my test.
> > 
> > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Jan Kara <jack@suse.cz>
> > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > Cc: Eric Paris <eparis@redhat.com>
> > Cc: John McCutchan <john@johnmccutchan.com>
> > Cc: Robert Love <rlove@rlove.org>
> > Cc: Tim Chen <tim.c.chen@linux.intel.com>
> > Cc: Andi Kleen <ak@linux.intel.com>
> > Cc: linux-kernel@vger.kernel.org
> > ---
> > 
> >  b/fs/notify/fsnotify.c |   10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff -puN fs/notify/fsnotify.c~optimize-fsnotify fs/notify/fsnotify.c
> > --- a/fs/notify/fsnotify.c~optimize-fsnotify	2015-06-19 13:29:53.117283581 -0700
> > +++ b/fs/notify/fsnotify.c	2015-06-19 13:29:53.123283853 -0700
> > @@ -213,6 +213,16 @@ int fsnotify(struct inode *to_tell, __u3
> >  	    !(test_mask & to_tell->i_fsnotify_mask) &&
> >  	    !(mnt && test_mask & mnt->mnt_fsnotify_mask))
> >  		return 0;
> > +	/*
> > +	 * Optimization: The *_fsnotify_mask is an aggregate of each of the
> > +	 * masks from each mark.  If we have nothing set in the masks at
> > +	 * all then there are no marks and no need to do anything with
> > +	 * 'ignored masks' since none exist.  This keeps us from having to
> > +	 * do the costly srcu_read_lock() for a check which is very cheap.
> > +	 */
> > +	if (!to_tell->i_fsnotify_mask &&
> > +	    (!mnt || !mnt->mnt_fsnotify_mask))
> > +		return 0;
> >  
> >  	idx = srcu_read_lock(&fsnotify_mark_srcu);
> >  
> > _
> 
> -- 
> ak@linux.intel.com -- Speaking for myself only
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-19 23:33 ` Andi Kleen
  2015-06-20  0:29   ` Paul E. McKenney
@ 2015-06-20  0:39   ` Dave Hansen
  2015-06-20  2:21     ` Paul E. McKenney
  1 sibling, 1 reply; 20+ messages in thread
From: Dave Hansen @ 2015-06-20  0:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen,
	linux-kernel, paulmck

On 06/19/2015 04:33 PM, Andi Kleen wrote:
>> > I *think* we can avoid taking the srcu_read_lock() for the
>> > common case where there are no actual marks on the file
>> > being modified *or* the vfsmount.
> What is so expensive in it? Just the memory barrier in it?

The profiling doesn't hit on the mfence directly, but I assume that the
overhead is coming from there.  The "mov    0x8(%rdi),%rcx" is identical
before and after the barrier, but it appears much more expensive
_after_.  That makes no sense unless the barrier is the thing causing it.

Here's how the annotation mode of 'perf top' breaks it down:

>        │    ffffffff810fb480 <load0>:
>        │      nop
>        │      mov    (%rdi),%rax
>   0.58 │      push   %rbp
>        │      incl   %gs:0x7ef0f488(%rip)
>   1.73 │      mov    %rsp,%rbp
>        │      and    $0x1,%eax
>        │      movslq %eax,%rdx
>   0.58 │      mov    0x8(%rdi),%rcx
>        │      incq   %gs:(%rcx,%rdx,8)
>        │      mfence
>  69.94 │      add    $0x2,%rdx
>   7.51 │      mov    0x8(%rdi),%rcx
>   4.05 │      incq   %gs:(%rcx,%rdx,8)
>  13.87 │      decl   %gs:0x7ef0f45f(%rip)
>        │      pop    %rbp
>   1.73 │    ← retq
>                                                          

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-20  0:39   ` Dave Hansen
@ 2015-06-20  2:21     ` Paul E. McKenney
  2015-06-20 18:02       ` Dave Hansen
  0 siblings, 1 reply; 20+ messages in thread
From: Paul E. McKenney @ 2015-06-20  2:21 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove,
	tim.c.chen, linux-kernel

On Fri, Jun 19, 2015 at 05:39:11PM -0700, Dave Hansen wrote:
> On 06/19/2015 04:33 PM, Andi Kleen wrote:
> >> > I *think* we can avoid taking the srcu_read_lock() for the
> >> > common case where there are no actual marks on the file
> >> > being modified *or* the vfsmount.
> > What is so expensive in it? Just the memory barrier in it?
> 
> The profiling doesn't hit on the mfence directly, but I assume that the
> overhead is coming from there.  The "mov    0x8(%rdi),%rcx" is identical
> before and after the barrier, but it appears much more expensive
> _after_.  That makes no sense unless the barrier is the thing causing it.

OK, one thing to try is to simply delete the memory barrier.  The
resulting code will be unsafe, but will probably run well enough to
get benchmark results.  If it is the memory barrier, you should of
course get increased throughput.

							Thanx, Paul

> Here's how the annotation mode of 'perf top' breaks it down:
> 
> >        │    ffffffff810fb480 <load0>:
> >        │      nop
> >        │      mov    (%rdi),%rax
> >   0.58 │      push   %rbp
> >        │      incl   %gs:0x7ef0f488(%rip)
> >   1.73 │      mov    %rsp,%rbp
> >        │      and    $0x1,%eax
> >        │      movslq %eax,%rdx
> >   0.58 │      mov    0x8(%rdi),%rcx
> >        │      incq   %gs:(%rcx,%rdx,8)
> >        │      mfence
> >  69.94 │      add    $0x2,%rdx
> >   7.51 │      mov    0x8(%rdi),%rcx
> >   4.05 │      incq   %gs:(%rcx,%rdx,8)
> >  13.87 │      decl   %gs:0x7ef0f45f(%rip)
> >        │      pop    %rbp
> >   1.73 │    ← retq
> >                                                          
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-20  2:21     ` Paul E. McKenney
@ 2015-06-20 18:02       ` Dave Hansen
  2015-06-21  1:30         ` Paul E. McKenney
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Hansen @ 2015-06-20 18:02 UTC (permalink / raw)
  To: paulmck
  Cc: Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove,
	tim.c.chen, linux-kernel

On 06/19/2015 07:21 PM, Paul E. McKenney wrote:
>>> > > What is so expensive in it? Just the memory barrier in it?
>> > 
>> > The profiling doesn't hit on the mfence directly, but I assume that the
>> > overhead is coming from there.  The "mov    0x8(%rdi),%rcx" is identical
>> > before and after the barrier, but it appears much more expensive
>> > _after_.  That makes no sense unless the barrier is the thing causing it.
> OK, one thing to try is to simply delete the memory barrier.  The
> resulting code will be unsafe, but will probably run well enough to
> get benchmark results.  If it is the memory barrier, you should of
> course get increased throughput.

So I took the smp_mb() out of __srcu_read_lock().  The benchmark didn't
improve at all.  Looking at the profile, all of the overhead had just
shifted to __srcu_read_unlock() and its memory barrier!  Removing the
barrier in __srcu_read_unlock() got essentially the same gains out of
the benchmark as the original patch in this thread that just avoids RCU.

I think that's fairly conclusive that the source of the overhead is,
indeed, the memory barriers.

Although I said this test was single threaded, I also had another
thought.  The benchmark is single-threaded, but 'perf' is sitting doing
profiling and who knows what else on the other core, and the profiling
NMIs are certainly writing plenty of data to memory.  So, there might be
plenty of work for that smp_mb()/mfence to do _despite_ the benchmark
itself being single threaded.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-20 18:02       ` Dave Hansen
@ 2015-06-21  1:30         ` Paul E. McKenney
  2015-06-22 13:28           ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Paul E. McKenney @ 2015-06-21  1:30 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove,
	tim.c.chen, linux-kernel

On Sat, Jun 20, 2015 at 11:02:08AM -0700, Dave Hansen wrote:
> On 06/19/2015 07:21 PM, Paul E. McKenney wrote:
> >>> > > What is so expensive in it? Just the memory barrier in it?
> >> > 
> >> > The profiling doesn't hit on the mfence directly, but I assume that the
> >> > overhead is coming from there.  The "mov    0x8(%rdi),%rcx" is identical
> >> > before and after the barrier, but it appears much more expensive
> >> > _after_.  That makes no sense unless the barrier is the thing causing it.
> > OK, one thing to try is to simply delete the memory barrier.  The
> > resulting code will be unsafe, but will probably run well enough to
> > get benchmark results.  If it is the memory barrier, you should of
> > course get increased throughput.
> 
> So I took the smp_mb() out of __srcu_read_lock().  The benchmark didn't
> improve at all.  Looking at the profile, all of the overhead had just
> shifted to __srcu_read_unlock() and its memory barrier!  Removing the
> barrier in __srcu_read_unlock() got essentially the same gains out of
> the benchmark as the original patch in this thread that just avoids RCU.
> 
> I think that's fairly conclusive that the source of the overhead is,
> indeed, the memory barriers.
> 
> Although I said this test was single threaded, I also had another
> thought.  The benchmark is single-threaded, but 'perf' is sitting doing
> profiling and who knows what else on the other core, and the profiling
> NMIs are certainly writing plenty of data to memory.  So, there might be
> plenty of work for that smp_mb()/mfence to do _despite_ the benchmark
> itself being single threaded.

Well, it is not hard to have an SRCU-like thing that doesn't have
read-side memory barriers, given that older versions of SRCU didn't
have them.  However, the price is increased latency for the analog to
synchronize_srcu().  I am guessing that this would not be a problem
for notification-group destruction, which is presumably rare.

That said, if empty *_fsnotify_mask is the common case or if the
overhead of processing notification overwhelms srcu_read_lock(),
your original patch seems a bit simpler.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-21  1:30         ` Paul E. McKenney
@ 2015-06-22 13:28           ` Peter Zijlstra
  2015-06-22 15:11             ` Paul E. McKenney
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2015-06-22 13:28 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Dave Hansen, Andi Kleen, dave.hansen, akpm, jack, viro, eparis,
	john, rlove, tim.c.chen, linux-kernel

On Sat, Jun 20, 2015 at 06:30:58PM -0700, Paul E. McKenney wrote:
> Well, it is not hard to have an SRCU-like thing that doesn't have
> read-side memory barriers, given that older versions of SRCU didn't
> have them.  However, the price is increased latency for the analog to
> synchronize_srcu().  I am guessing that this would not be a problem
> for notification-group destruction, which is presumably rare.

I don't think it ever makes sense to optimize for a global state. So
screw sync_srcu() and make the srcu_read_lock() thing go fast.

If you need fast global state you're doing it wrong.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-22 13:28           ` Peter Zijlstra
@ 2015-06-22 15:11             ` Paul E. McKenney
  2015-06-22 15:20               ` Peter Zijlstra
                                 ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Paul E. McKenney @ 2015-06-22 15:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Andi Kleen, dave.hansen, akpm, jack, viro, eparis,
	john, rlove, tim.c.chen, linux-kernel

On Mon, Jun 22, 2015 at 03:28:21PM +0200, Peter Zijlstra wrote:
> On Sat, Jun 20, 2015 at 06:30:58PM -0700, Paul E. McKenney wrote:
> > Well, it is not hard to have an SRCU-like thing that doesn't have
> > read-side memory barriers, given that older versions of SRCU didn't
> > have them.  However, the price is increased latency for the analog to
> > synchronize_srcu().  I am guessing that this would not be a problem
> > for notification-group destruction, which is presumably rare.
> 
> I don't think it ever makes sense to optimize for a global state. So
> screw sync_srcu() and make the srcu_read_lock() thing go fast.
> 
> If you need fast global state you're doing it wrong.

That depends on how slow the resulting slow global state would be.
We have some use cases (definitely KVM, perhaps also some of the VFS
code) that need the current speed, as opposed to the profound slowness
that three trips through synchronize_sched() would provide.  Plus we
would lose the ability to have SRCU readers on idle and offline CPUs.

But if Dave is willing to test it, I would be happy to send along
a fast-readers patch, easy to do.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-22 15:11             ` Paul E. McKenney
@ 2015-06-22 15:20               ` Peter Zijlstra
  2015-06-22 16:29                 ` Paul E. McKenney
  2015-06-22 18:50               ` Dave Hansen
  2015-06-22 18:52               ` Peter Zijlstra
  2 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2015-06-22 15:20 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Dave Hansen, Andi Kleen, dave.hansen, akpm, jack, viro, eparis,
	john, rlove, tim.c.chen, linux-kernel

On Mon, Jun 22, 2015 at 08:11:21AM -0700, Paul E. McKenney wrote:
> That depends on how slow the resulting slow global state would be.
> We have some use cases (definitely KVM, perhaps also some of the VFS
> code) that need the current speed, as opposed to the profound slowness
> that three trips through synchronize_sched() would provide. 

But we have call_srcu() these days, not everything needs to use
sync_srcu() anymore. Although I've not checked recently.

> Plus we
> would lose the ability to have SRCU readers on idle and offline CPUs.

Are we actually doing that? offline CPUs in particular seems iffy, I
don't think we need (or should) worry about that. I know its been an
issue with regular RCU due to tracing, but I'm not sure we should care
for it for SRCU.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-22 15:20               ` Peter Zijlstra
@ 2015-06-22 16:29                 ` Paul E. McKenney
  2015-06-22 19:03                   ` Peter Zijlstra
  0 siblings, 1 reply; 20+ messages in thread
From: Paul E. McKenney @ 2015-06-22 16:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Andi Kleen, dave.hansen, akpm, jack, viro, eparis,
	john, rlove, tim.c.chen, linux-kernel

On Mon, Jun 22, 2015 at 05:20:13PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 22, 2015 at 08:11:21AM -0700, Paul E. McKenney wrote:
> > That depends on how slow the resulting slow global state would be.
> > We have some use cases (definitely KVM, perhaps also some of the VFS
> > code) that need the current speed, as opposed to the profound slowness
> > that three trips through synchronize_sched() would provide. 
> 
> But we have call_srcu() these days, not everything needs to use
> sync_srcu() anymore. Although I've not checked recently.

I believe that the KVM guys do need synchronize_srcu(), but yes, there
probably are at least some cases where people might do well to move
from synchronize_srcu() to call_srcu().  That said, the added complexity
might or might not be worthwhile in all cases.

> > Plus we
> > would lose the ability to have SRCU readers on idle and offline CPUs.
> 
> Are we actually doing that? offline CPUs in particular seems iffy, I
> don't think we need (or should) worry about that. I know its been an
> issue with regular RCU due to tracing, but I'm not sure we should care
> for it for SRCU.

I believe that there still are some cases.  But why would offline
CPUs seem so iffy?  CPUs coming up execute code before they are fully
operational, and during that time, much of the kernel views them as
being offline.  Yet they do have to execute significant code in order
to get themselves set up.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-22 16:29                 ` Paul E. McKenney
@ 2015-06-22 19:03                   ` Peter Zijlstra
  2015-06-23  0:31                     ` Paul E. McKenney
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2015-06-22 19:03 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Dave Hansen, Andi Kleen, dave.hansen, akpm, jack, viro, eparis,
	john, rlove, tim.c.chen, linux-kernel

On Mon, Jun 22, 2015 at 09:29:49AM -0700, Paul E. McKenney wrote:

> I believe that there still are some cases.  But why would offline
> CPUs seem so iffy?  CPUs coming up execute code before they are fully
> operational, and during that time, much of the kernel views them as
> being offline.  Yet they do have to execute significant code in order
> to get themselves set up.

I'm thinking we do far too much during bringup and tear-down as it is.
But yes maybe.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-22 19:03                   ` Peter Zijlstra
@ 2015-06-23  0:31                     ` Paul E. McKenney
  0 siblings, 0 replies; 20+ messages in thread
From: Paul E. McKenney @ 2015-06-23  0:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Andi Kleen, dave.hansen, akpm, jack, viro, eparis,
	john, rlove, tim.c.chen, linux-kernel

On Mon, Jun 22, 2015 at 09:03:08PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 22, 2015 at 09:29:49AM -0700, Paul E. McKenney wrote:
> 
> > I believe that there still are some cases.  But why would offline
> > CPUs seem so iffy?  CPUs coming up execute code before they are fully
> > operational, and during that time, much of the kernel views them as
> > being offline.  Yet they do have to execute significant code in order
> > to get themselves set up.
> 
> I'm thinking we do far too much during bringup and tear-down as it is.
> But yes maybe.

Boot, suspend, and hibernation indeed would be faster if we did less,
but we still will have to do something.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-22 15:11             ` Paul E. McKenney
  2015-06-22 15:20               ` Peter Zijlstra
@ 2015-06-22 18:50               ` Dave Hansen
  2015-06-23  0:26                 ` Paul E. McKenney
  2015-06-22 18:52               ` Peter Zijlstra
  2 siblings, 1 reply; 20+ messages in thread
From: Dave Hansen @ 2015-06-22 18:50 UTC (permalink / raw)
  To: paulmck, Peter Zijlstra
  Cc: Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove,
	tim.c.chen, linux-kernel

On 06/22/2015 08:11 AM, Paul E. McKenney wrote:
> But if Dave is willing to test it, I would be happy to send along
> a fast-readers patch, easy to do.

I'm always willing to test, but the cost of the srcu_read_lock() barrier
shows up even on my 2-year-old "Intel(R) Core(TM) i5-3320M CPU @
2.60GHz" laptop.  The numbers I shared in this thread are on a newer CPU
than that, so I'm fairly confident this will show up on just about any
(big core) Intel CPU newer than SandyBridge.

The tests I've been running are:

	https://github.com/antonblanchard/will-it-scale

with two new 1-byte read/write tests copied in to "tests/":

	https://www.sr71.net/~dave/intel/read1byte.c
	https://www.sr71.net/~dave/intel/write1byte.c

The one-byte thing is silly but it does help isolate the kernel's
overhead from what the app is doing.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-22 18:50               ` Dave Hansen
@ 2015-06-23  0:26                 ` Paul E. McKenney
  2015-06-24 16:50                   ` Dave Hansen
  0 siblings, 1 reply; 20+ messages in thread
From: Paul E. McKenney @ 2015-06-23  0:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Andi Kleen, dave.hansen, akpm, jack, viro,
	eparis, john, rlove, tim.c.chen, linux-kernel

On Mon, Jun 22, 2015 at 11:50:50AM -0700, Dave Hansen wrote:
> On 06/22/2015 08:11 AM, Paul E. McKenney wrote:
> > But if Dave is willing to test it, I would be happy to send along
> > a fast-readers patch, easy to do.
> 
> I'm always willing to test, but the cost of the srcu_read_lock() barrier
> shows up even on my 2-year-old "Intel(R) Core(TM) i5-3320M CPU @
> 2.60GHz" laptop.  The numbers I shared in this thread are on a newer CPU
> than that, so I'm fairly confident this will show up on just about any
> (big core) Intel CPU newer than SandyBridge.
> 
> The tests I've been running are:
> 
> 	https://github.com/antonblanchard/will-it-scale
> 
> with two new 1-byte read/write tests copied in to "tests/":
> 
> 	https://www.sr71.net/~dave/intel/read1byte.c
> 	https://www.sr71.net/~dave/intel/write1byte.c
> 
> The one-byte thing is silly but it does help isolate the kernel's
> overhead from what the app is doing.

OK, here is an experimental patch that provides a fast-readers variant
of RCU, forward-ported from v3.3.  Because we didn't have call_srcu()
and srcu_barrier() back then, it is not a drop-in replacement for SRCU,
so you need to adapt the code to the API, which means putting an "fr"
in front of the "srcu" in the API members.

Understood on the overhead of the memory-barrier instruction showing
up consistently.  My point was instead that getting rid of this
memory-barrier instruction does not come for free, as it greatly
increases the latency of synchronize_frsrcu().  In a real workload,
it is entirely possible that the savings from eliminating the memory
barrier are overwhelmed by the increased grace-period latency.

Anyway, the patch is below.  Very lightly tested.

							Thanx, Paul

------------------------------------------------------------------------

commit d28d4e08ee2f87e6746ac77b07d5c94a65bcbd95
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Mon Jun 22 14:51:18 2015 -0700

    rcu: Add a fast-readers variant of SRCU
    
    Dave Hansen's testing of single-byte I/O to tmpfs showed significant
    system-level overhead from the memory barriers in srcu_read_lock()
    and srcu_read_unlock().
    
    This experimental not-for-merging commit therefore re-introduces the old
    variant of SRCU that avoided read-side memory barriers.  This of course
    also introduces that variant's slow grace periods.  There is no analog
    to call_rcu(), nor is there an analog to rcu_barrier(), so this is not a
    drop-in replacement for SRCU.  Unlike SRCU, this implementation cannot
    be invoked from idle or offline CPUs.  This variant therefore gets its
    own API: frsrcu_read_lock(), frsrcu_read_unlock(), and so on.
    
    Reported-by: Dave Hansen <dave@sr71.net>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/include/linux/frsrcu.h b/include/linux/frsrcu.h
new file mode 100644
index 000000000000..5ad60d2d5bca
--- /dev/null
+++ b/include/linux/frsrcu.h
@@ -0,0 +1,250 @@
+/*
+ * Fast-Reader Sleepable Read-Copy Update mechanism for mutual exclusion
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, you can access it online at
+ * http://www.gnu.org/licenses/gpl-2.0.html.
+ *
+ * Copyright (C) IBM Corporation, 2006, 2015
+ *
+ * Author: Paul McKenney <paulmck@linux.vnet.ibm.com>
+ */
+
+#ifndef _LINUX_FRSRCU_H
+#define _LINUX_FRSRCU_H
+
+#include <linux/mutex.h>
+#include <linux/rcupdate.h>
+
+struct frsrcu_struct_array {
+	int c[2];
+};
+
+struct frsrcu_struct {
+	int completed;
+	struct frsrcu_struct_array __percpu *per_cpu_ref;
+	struct mutex mutex;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map dep_map;
+#endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+};
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+
+int __init_frsrcu_struct(struct frsrcu_struct *frsp, const char *name,
+			 struct lock_class_key *key);
+
+#define init_frsrcu_struct(frsp) \
+({ \
+	static struct lock_class_key __frsrcu_key; \
+	\
+	__init_frsrcu_struct((frsp), #frsp, &__frsrcu_key); \
+})
+
+#define __FRSRCU_DEP_MAP_INIT(frsrcu_name) .dep_map = { .name = #frsrcu_name },
+#else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+
+int init_frsrcu_struct(struct frsrcu_struct *frsp);
+
+#define __FRSRCU_DEP_MAP_INIT(frsrcu_name)
+#endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+
+#define __FRSRCU_STRUCT_INIT(name)					\
+	{								\
+		.completed = -300,					\
+		.per_cpu_ref = &name##_frsrcu_array,			\
+		.mutex = __MUTEX_INITIALIZER(name.mutex),		\
+		__FRSRCU_DEP_MAP_INIT(name)				\
+	}
+
+/*
+ * define and init a frsrcu struct at build time.
+ * dont't call init_frsrcu_struct() nor cleanup_frsrcu_struct() on it.
+ */
+#define __DEFINE_FRSRCU(name, is_static)				\
+	static DEFINE_PER_CPU(struct frsrcu_struct_array, name##_frsrcu_array);\
+	is_static struct frsrcu_struct name = __FRSRCU_STRUCT_INIT(name)
+#define DEFINE_FRSRCU(name)		__DEFINE_FRSRCU(name, /* not static */)
+#define DEFINE_STATIC_FRSRCU(name)	__DEFINE_FRSRCU(name, static)
+
+void cleanup_frsrcu_struct(struct frsrcu_struct *frsp);
+int __frsrcu_read_lock(struct frsrcu_struct *frsp) __acquires(frsp);
+void __frsrcu_read_unlock(struct frsrcu_struct *frsp, int idx) __releases(frsp);
+void synchronize_frsrcu(struct frsrcu_struct *frsp);
+void synchronize_frsrcu_expedited(struct frsrcu_struct *frsp);
+long frsrcu_batches_completed(struct frsrcu_struct *frsp);
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+
+/**
+ * frsrcu_read_lock_held - might we be in FRSRCU read-side critical section?
+ *
+ * If CONFIG_DEBUG_LOCK_ALLOC is selected, returns nonzero iff in an FRSRCU
+ * read-side critical section.  In absence of CONFIG_DEBUG_LOCK_ALLOC,
+ * this assumes we are in an FRSRCU read-side critical section unless it can
+ * prove otherwise.
+ *
+ * Checks debug_lockdep_rcu_enabled() to prevent false positives during boot
+ * and while lockdep is disabled.
+ *
+ * Note that if the CPU is in the idle loop from an RCU point of view
+ * (ie: that we are in the section between rcu_idle_enter() and
+ * rcu_idle_exit()) then frsrcu_read_lock_held() returns false even if
+ * the CPU did an frsrcu_read_lock().  The reason for this is that RCU
+ * ignores CPUs that are in such a section, considering these as in
+ * extended quiescent state, so such a CPU is effectively never in an
+ * RCU read-side critical section regardless of what RCU primitives it
+ * invokes.  This state of affairs is required --- we need to keep an
+ * RCU-free window in idle where the CPU may possibly enter into low
+ * power mode. This way we can notice an extended quiescent state to
+ * other CPUs that started a grace period. Otherwise we would delay any
+ * grace period as long as we run in the idle task.
+ *
+ * Similarly, we avoid claiming an SRCU read lock held if the current
+ * CPU is offline.
+ */
+static inline int frsrcu_read_lock_held(struct frsrcu_struct *frsp)
+{
+	if (!debug_lockdep_rcu_enabled())
+		return 1;
+
+	if (!rcu_is_watching())
+		return 0;
+	if (!rcu_lockdep_current_cpu_online())
+		return 0;
+	return lock_is_held(&frsp->dep_map);
+}
+
+#else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+
+static inline int frsrcu_read_lock_held(struct frsrcu_struct *frsp)
+{
+	return 1;
+}
+
+#endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+
+/**
+ * frsrcu_dereference_check - fetch FRSRCU-protected pointer for later deref
+ * @p: the pointer to fetch and protect for later dereferencing
+ * @frsp: pointer to the frsrcu_struct, which is used to check that we
+ *	really are in an FRSRCU read-side critical section.
+ * @c: condition to check for update-side use
+ *
+ * If PROVE_RCU is enabled, invoking this outside of an RCU read-side
+ * critical section will result in an RCU-lockdep splat, unless @c evaluates
+ * to 1.  The @c argument will normally be a logical expression containing
+ * lockdep_is_held() calls.
+ */
+#define frsrcu_dereference_check(p, frsp, c) \
+	__rcu_dereference_check((p), frsrcu_read_lock_held(frsp) || (c), __rcu)
+
+/**
+ * frsrcu_dereference - fetch FRSRCU-protected pointer for later dereferencing
+ * @p: the pointer to fetch and protect for later dereferencing
+ * @frsp: pointer to the frsrcu_struct, which is used to check that we
+ *	really are in an FRSRCU read-side critical section.
+ *
+ * Makes rcu_dereference_check() do the dirty work.  If PROVE_RCU
+ * is enabled, invoking this outside of an RCU read-side critical
+ * section will result in an RCU-lockdep splat.
+ */
+#define frsrcu_dereference(p, frsp) frsrcu_dereference_check((p), (frsp), 0)
+
+/**
+ * frsrcu_read_lock - register a new reader for an FRSRCU-protected structure
+ * @frsp: frsrcu_struct in which to register the new reader.
+ *
+ * Enter an FRSRCU read-side critical section.  Note that FRSRCU read-side
+ * critical sections may be nested.  However, it is illegal to
+ * call anything that waits on an FRSRCU grace period for the same
+ * frsrcu_struct, whether directly or indirectly.  Please note that
+ * one way to indirectly wait on an FRSRCU grace period is to acquire
+ * a mutex that is held elsewhere while calling synchronize_srcu() or
+ * synchronize_srcu_expedited().
+ *
+ * Note that frsrcu_read_lock() and the matching frsrcu_read_unlock() must
+ * occur in the same context, for example, it is illegal to invoke
+ * frsrcu_read_unlock() in an irq handler if the matching frsrcu_read_lock()
+ * was invoked in process context.
+ */
+static inline int frsrcu_read_lock(struct frsrcu_struct *frsp) __acquires(frsp)
+{
+	int retval = __frsrcu_read_lock(frsp);
+
+	rcu_lock_acquire(&(frsp)->dep_map);
+	rcu_lockdep_assert(rcu_is_watching(),
+			   "frsrcu_read_lock() used illegally while idle");
+	return retval;
+}
+
+/**
+ * frsrcu_read_unlock - unregister old reader from FRSRCU-protected structure
+ * @frsp: frsrcu_struct in which to unregister the old reader.
+ * @idx: return value from corresponding frsrcu_read_lock().
+ *
+ * Exit an FRSRCU read-side critical section.
+ */
+static inline void frsrcu_read_unlock(struct frsrcu_struct *frsp, int idx)
+	__releases(frsp)
+{
+	rcu_lockdep_assert(rcu_is_watching(),
+			   "frsrcu_read_unlock() used illegally while idle");
+	rcu_lock_release(&(frsp)->dep_map);
+	__frsrcu_read_unlock(frsp, idx);
+}
+
+/**
+ * frsrcu_read_lock_raw - register new reader for FRSRCU-protected structure
+ * @frsp: frsrcu_struct in which to register the new reader.
+ *
+ * Enter an FRSRCU read-side critical section.  Similar to frsrcu_read_lock(),
+ * but avoids the RCU-lockdep checking.  This means that it is legal to
+ * use frsrcu_read_lock_raw() in one context, for example, in an exception
+ * handler, and then have the matching frsrcu_read_unlock_raw() in another
+ * context, for example in the task that took the exception.
+ *
+ * However, the entire FRSRCU read-side critical section must reside within
+ * a single task.  For example, beware of using frsrcu_read_lock_raw() in a
+ * device interrupt handler and frsrcu_read_unlock() in the interrupted task:
+ * This will not work if interrupts are threaded.
+ */
+static inline int frsrcu_read_lock_raw(struct frsrcu_struct *frsp)
+{
+	unsigned long flags;
+	int ret;
+
+	local_irq_save(flags);
+	ret =  __frsrcu_read_lock(frsp);
+	local_irq_restore(flags);
+	return ret;
+}
+
+/**
+ * frsrcu_read_unlock_raw - unregister reader from FRSRCU-protected structure
+ * @frsp: frsrcu_struct in which to unregister the old reader.
+ * @idx: return value from corresponding frsrcu_read_lock_raw().
+ *
+ * Exit an FRSRCU read-side critical section without lockdep-RCU checking.
+ * See frsrcu_read_lock_raw() for more details.
+ */
+static inline void frsrcu_read_unlock_raw(struct frsrcu_struct *frsp, int idx)
+{
+	unsigned long flags;
+
+	local_irq_save(flags);
+	__frsrcu_read_unlock(frsp, idx);
+	local_irq_restore(flags);
+}
+
+#endif
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 03a899aabd17..7ff71dbabf86 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -69,6 +69,7 @@ void rcu_unexpedite_gp(void);
 #endif /* #else #ifdef CONFIG_TINY_RCU */
 
 enum rcutorture_type {
+	FRSRCU_FLAVOR,
 	RCU_FLAVOR,
 	RCU_BH_FLAVOR,
 	RCU_SCHED_FLAVOR,
diff --git a/init/Kconfig b/init/Kconfig
index 4c08197044f1..381d75cafe7d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -517,6 +517,15 @@ config SRCU
 	  permits arbitrary sleeping or blocking within RCU read-side critical
 	  sections.
 
+config FRSRCU
+	bool
+	help
+	  This option selects the fast-readers sleepable version of
+	  RCU. This version permits arbitrary sleeping or blocking within
+	  RCU read-side critical sections, but also avoids memory barriers
+	  within read-side markers.  Nothing comes for free, though.
+	  Grace periods are quite a bit more expensive than those of SRCU.
+
 config TASKS_RCU
 	bool
 	default n
diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile
index 50a808424b06..d2d7272a69bb 100644
--- a/kernel/rcu/Makefile
+++ b/kernel/rcu/Makefile
@@ -1,4 +1,5 @@
 obj-y += update.o
+obj-$(CONFIG_FRSRCU) += frsrcu.o
 obj-$(CONFIG_SRCU) += srcu.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
 obj-$(CONFIG_TREE_RCU) += tree.o
diff --git a/kernel/rcu/frsrcu.c b/kernel/rcu/frsrcu.c
new file mode 100644
index 000000000000..3bea0a020379
--- /dev/null
+++ b/kernel/rcu/frsrcu.c
@@ -0,0 +1,320 @@
+/*
+ * Fast-Reader Sleepable Read-Copy Update mechanism for mutual exclusion.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, you can access it online at
+ * http://www.gnu.org/licenses/gpl-2.0.html.
+ *
+ * Copyright (C) IBM Corporation, 2006, 2015
+ *
+ * Author: Paul McKenney <paulmck@linux.vnet.ibm.com>
+ */
+
+#include <linux/export.h>
+#include <linux/mutex.h>
+#include <linux/percpu.h>
+#include <linux/preempt.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/smp.h>
+#include <linux/delay.h>
+#include <linux/frsrcu.h>
+
+/**
+ * init_frsrcu_struct - initialize a sleep-RCU structure
+ * @frsp: structure to initialize.
+ *
+ * Must invoke this on a given frsrcu_struct before passing that
+ * frsrcu_struct to any other function.  Each frsrcu_struct represents a
+ * separate domain of FRSRCU protection.
+ */
+int init_frsrcu_struct(struct frsrcu_struct *frsp)
+{
+	frsp->completed = 0;
+	mutex_init(&frsp->mutex);
+	frsp->per_cpu_ref = alloc_percpu(struct frsrcu_struct_array);
+	return frsp->per_cpu_ref ? 0 : -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(init_frsrcu_struct);
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+
+int __init_frsrcu_struct(struct frsrcu_struct *frsp, const char *name,
+		       struct lock_class_key *key)
+{
+	/* Don't re-initialize a lock while it is held. */
+	debug_check_no_locks_freed((void *)frsp, sizeof(*frsp));
+	lockdep_init_map(&frsp->dep_map, name, key, 0);
+	return init_frsrcu_struct(frsp);
+}
+EXPORT_SYMBOL_GPL(__init_frsrcu_struct);
+
+#endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */
+
+/*
+ * frsrcu_readers_active_idx -- returns approximate number of readers
+ *	active on the specified rank of per-CPU counters.
+ */
+
+static int frsrcu_readers_active_idx(struct frsrcu_struct *frsp, int idx)
+{
+	int cpu;
+	int sum;
+
+	sum = 0;
+	for_each_possible_cpu(cpu)
+		sum += READ_ONCE(per_cpu_ptr(frsp->per_cpu_ref, cpu)->c[idx]);
+	return sum;
+}
+
+/**
+ * frsrcu_readers_active - returns approximate number of readers.
+ * @frsp: which frsrcu_struct counts active readers (holding frsrcu_read_lock).
+ *
+ * Note that this is not an atomic primitive, and can therefore suffer
+ * severe errors when invoked on an active frsrcu_struct.  That said, it
+ * can be useful as an error check at cleanup time.
+ */
+static int frsrcu_readers_active(struct frsrcu_struct *frsp)
+{
+	return frsrcu_readers_active_idx(frsp, 0) +
+	       frsrcu_readers_active_idx(frsp, 1);
+}
+
+/**
+ * cleanup_frsrcu_struct - deconstruct a sleep-RCU structure
+ * @frsp: structure to clean up.
+ *
+ * Must invoke this after you are finished using a given frsrcu_struct that
+ * was initialized via init_frsrcu_struct(), else you leak memory.
+ */
+void cleanup_frsrcu_struct(struct frsrcu_struct *frsp)
+{
+	int sum;
+
+	sum = frsrcu_readers_active(frsp);
+	WARN_ON(sum);  /* Leakage unless caller handles error. */
+	if (sum != 0)
+		return;
+	free_percpu(frsp->per_cpu_ref);
+	frsp->per_cpu_ref = NULL;
+}
+EXPORT_SYMBOL_GPL(cleanup_frsrcu_struct);
+
+/*
+ * Counts the new reader in the appropriate per-CPU element of the
+ * frsrcu_struct.  Must be called from process context.  Returns an index
+ * that must be passed to the matching frsrcu_read_unlock().
+ */
+int __frsrcu_read_lock(struct frsrcu_struct *frsp)
+{
+	int idx;
+
+	preempt_disable();
+	idx = frsp->completed & 0x1;
+	barrier();  /* ensure compiler looks -once- at frsp->completed. */
+	__this_cpu_inc(frsp->per_cpu_ref->c[idx]);
+	preempt_enable();
+	return idx;
+}
+EXPORT_SYMBOL_GPL(__frsrcu_read_lock);
+
+/*
+ * Removes the count for the old reader from the appropriate per-CPU
+ * element of the frsrcu_struct.  Note that this may well be a different CPU
+ * than that which was incremented by the corresponding frsrcu_read_lock().
+ * Must be called from process context.
+ */
+void __frsrcu_read_unlock(struct frsrcu_struct *frsp, int idx)
+{
+	preempt_disable();
+	__this_cpu_dec(frsp->per_cpu_ref->c[idx]);
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(__frsrcu_read_unlock);
+
+/*
+ * We use an adaptive strategy for synchronize_frsrcu() and especially for
+ * synchronize_frsrcu_expedited().  We spin for a fixed time period (defined
+ * below) to allow FRSRCU readers to exit their read-side critical sections.
+ * If there are still some readers after 10 microseconds, we repeatedly
+ * block for 1-millisecond time periods.  This approach has done well in
+ * testing, so there is no need for a config parameter.
+ */
+#define SYNCHRONIZE_FRSRCU_READER_DELAY 10
+
+/*
+ * Helper function for synchronize_frsrcu() and synchronize_frsrcu_expedited().
+ */
+static void
+__synchronize_frsrcu(struct frsrcu_struct *frsp, void (*sync_func)(void))
+{
+	int idx;
+
+	rcu_lockdep_assert(!lock_is_held(&sp->dep_map) &&
+			   !lock_is_held(&rcu_bh_lock_map) &&
+			   !lock_is_held(&rcu_lock_map) &&
+			   !lock_is_held(&rcu_sched_lock_map),
+			   "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section");
+
+	idx = frsp->completed;
+	mutex_lock(&frsp->mutex);
+
+	/*
+	 * Check to see if someone else did the work for us while we were
+	 * waiting to acquire the lock.  We need -two- advances of
+	 * the counter, not just one.  If there was but one, we might have
+	 * shown up -after- our helper's first synchronize_sched(), thus
+	 * having failed to prevent CPU-reordering races with concurrent
+	 * frsrcu_read_unlock()s on other CPUs (see comment below).  So we
+	 * either (1) wait for two or (2) supply the second ourselves.
+	 */
+
+	if ((frsp->completed - idx) >= 2) {
+		mutex_unlock(&frsp->mutex);
+		return;
+	}
+
+	sync_func();  /* Force memory barrier on all CPUs. */
+
+	/*
+	 * The preceding synchronize_sched() ensures that any CPU that
+	 * sees the new value of frsp->completed will also see any preceding
+	 * changes to data structures made by this CPU.  This prevents
+	 * some other CPU from reordering the accesses in its FRSRCU
+	 * read-side critical section to precede the corresponding
+	 * frsrcu_read_lock() -- ensuring that such references will in
+	 * fact be protected.
+	 *
+	 * So it is now safe to do the flip.
+	 */
+
+	idx = frsp->completed & 0x1;
+	frsp->completed++;
+
+	sync_func();  /* Force memory barrier on all CPUs. */
+
+	/*
+	 * At this point, because of the preceding synchronize_sched(),
+	 * all frsrcu_read_lock() calls using the old counters have completed.
+	 * Their corresponding critical sections might well be still
+	 * executing, but the frsrcu_read_lock() primitives themselves
+	 * will have finished executing.  We initially give readers
+	 * an arbitrarily chosen 10 microseconds to get out of their
+	 * FRSRCU read-side critical sections, then loop waiting 1/HZ
+	 * seconds per iteration.  The 10-microsecond value has done
+	 * very well in testing.
+	 */
+
+	if (frsrcu_readers_active_idx(frsp, idx))
+		udelay(SYNCHRONIZE_FRSRCU_READER_DELAY);
+	while (frsrcu_readers_active_idx(frsp, idx))
+		schedule_timeout_interruptible(1);
+
+	sync_func();  /* Force memory barrier on all CPUs. */
+
+	/*
+	 * The preceding synchronize_sched() forces all frsrcu_read_unlock()
+	 * primitives that were executing concurrently with the preceding
+	 * for_each_possible_cpu() loop to have completed by this point.
+	 * More importantly, it also forces the corresponding FRSRCU
+	 * read-side critical sections to have also completed, and the
+	 * corresponding references to FRSRCU-protected data items to
+	 * be dropped.
+	 *
+	 * Note:
+	 *
+	 *	Despite what you might think at first glance, the
+	 *	preceding synchronize_sched() -must- be within the
+	 *	critical section ended by the following mutex_unlock().
+	 *	Otherwise, a task taking the early exit can race
+	 *	with a frsrcu_read_unlock(), which might have executed
+	 *	just before the preceding frsrcu_readers_active() check,
+	 *	and whose CPU might have reordered the frsrcu_read_unlock()
+	 *	with the preceding critical section.  In this case, there
+	 *	is nothing preventing the synchronize_sched() task that is
+	 *	taking the early exit from freeing a data structure that
+	 *	is still being referenced (out of order) by the task
+	 *	doing the frsrcu_read_unlock().
+	 *
+	 *	Alternatively, the comparison with "2" on the early
+	 *	exit could be changed to "3", but this increases
+	 *	synchronize_frsrcu() latency for bulk loads.  So the
+	 *	current code is preferred.
+	 */
+
+	mutex_unlock(&frsp->mutex);
+}
+
+/**
+ * synchronize_frsrcu - wait for pre-existing FRSRCU read-side critical sections
+ * @frsp: frsrcu_struct with which to synchronize.
+ *
+ * Flip the completed counter, and wait for the old count to drain to zero.
+ * As with classic RCU, the updater must use some separate means of
+ * synchronizing concurrent updates.  Can block; must be called from
+ * process context.
+ *
+ * Note that it is illegal to call synchronize_frsrcu() from the
+ * corresponding FRSRCU read-side critical section; doing so will result
+ * in deadlock.  However, it is perfectly legal to call synchronize_frsrcu()
+ * on one frsrcu_struct from some other frsrcu_struct's read-side critical
+ * section.
+ */
+void synchronize_frsrcu(struct frsrcu_struct *frsp)
+{
+	__synchronize_frsrcu(frsp, synchronize_sched);
+}
+EXPORT_SYMBOL_GPL(synchronize_frsrcu);
+
+/**
+ * synchronize_frsrcu_expedited - Brute-force FSRCU grace period
+ * @frsp: frsrcu_struct with which to synchronize.
+ *
+ * Wait for an SRCU grace period to elapse, but use a "big hammer"
+ * approach to force the grace period to end quickly.  This consumes
+ * significant time on all CPUs and is unfriendly to real-time workloads,
+ * so is thus not recommended for any sort of common-case code.  In fact, if
+ * you are using synchronize_srcu_expedited() in a loop, please restructure
+ * your code to batch your updates, and then use a single synchronize_srcu()
+ * instead.
+ *
+ * Note that it is illegal to call this function while holding any lock
+ * that is acquired by a CPU-hotplug notifier.  And yes, it is also illegal
+ * to call this function from a CPU-hotplug notifier.  Failing to observe
+ * these restriction will result in deadlock.  It is also illegal to call
+ * synchronize_srcu_expedited() from the corresponding SRCU read-side
+ * critical section; doing so will result in deadlock.  However, it is
+ * perfectly legal to call synchronize_srcu_expedited() on one srcu_struct
+ * from some other srcu_struct's read-side critical section, as long as
+ * the resulting graph of srcu_structs is acyclic.
+ */
+void synchronize_frsrcu_expedited(struct frsrcu_struct *frsp)
+{
+	__synchronize_frsrcu(frsp, synchronize_sched_expedited);
+}
+EXPORT_SYMBOL_GPL(synchronize_frsrcu_expedited);
+
+/**
+ * frsrcu_batches_completed - return batches completed.
+ * @frsp: frsrcu_struct on which to report batch completion.
+ *
+ * Report the number of batches, correlated with, but not necessarily
+ * precisely the same as, the number of grace periods that have elapsed.
+ */
+
+long frsrcu_batches_completed(struct frsrcu_struct *frsp)
+{
+	return frsp->completed;
+}
+EXPORT_SYMBOL_GPL(frsrcu_batches_completed);
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 59e32684c23b..6dfd9b9bd895 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -44,6 +44,7 @@
 #include <linux/cpu.h>
 #include <linux/delay.h>
 #include <linux/stat.h>
+#include <linux/frsrcu.h>
 #include <linux/srcu.h>
 #include <linux/slab.h>
 #include <linux/trace_clock.h>
@@ -264,6 +265,102 @@ struct rcu_torture_ops {
 static struct rcu_torture_ops *cur_ops;
 
 /*
+ * Definitions for frsrcu torture testing.
+ */
+
+DEFINE_STATIC_FRSRCU(frsrcu_ctl);
+static struct frsrcu_struct frsrcu_ctld;
+static struct frsrcu_struct *frsrcu_ctlp = &frsrcu_ctl;
+
+static int frsrcu_torture_read_lock(void) __acquires(frsrcu_ctlp)
+{
+	return frsrcu_read_lock(frsrcu_ctlp);
+}
+
+static void frsrcu_torture_read_unlock(int idx) __releases(frsrcu_ctlp)
+{
+	frsrcu_read_unlock(frsrcu_ctlp, idx);
+}
+
+static unsigned long frsrcu_torture_completed(void)
+{
+	return frsrcu_batches_completed(frsrcu_ctlp);
+}
+
+static void frsrcu_torture_synchronize(void)
+{
+	synchronize_frsrcu(frsrcu_ctlp);
+}
+
+static void frsrcu_torture_stats(void)
+{
+	int cpu;
+	int idx = frsrcu_ctlp->completed & 0x1;
+
+	pr_alert("%s%s per-CPU(idx=%d):",
+		 torture_type, TORTURE_FLAG, idx);
+	for_each_possible_cpu(cpu) {
+		long c0, c1;
+
+		c0 = (long)per_cpu_ptr(frsrcu_ctlp->per_cpu_ref, cpu)->c[!idx];
+		c1 = (long)per_cpu_ptr(frsrcu_ctlp->per_cpu_ref, cpu)->c[idx];
+		pr_cont(" %d(%ld,%ld)", cpu, c0, c1);
+	}
+	pr_cont("\n");
+}
+
+static void frsrcu_torture_synchronize_expedited(void)
+{
+	synchronize_frsrcu_expedited(frsrcu_ctlp);
+}
+
+static void rcu_sync_torture_init(void);
+static void srcu_read_delay(struct torture_random_state *rrsp);
+
+static struct rcu_torture_ops frsrcu_ops = {
+	.ttype		= FRSRCU_FLAVOR,
+	.init		= rcu_sync_torture_init,
+	.readlock	= frsrcu_torture_read_lock,
+	.read_delay	= srcu_read_delay,
+	.readunlock	= frsrcu_torture_read_unlock,
+	.started	= NULL,
+	.completed	= frsrcu_torture_completed,
+	.sync		= frsrcu_torture_synchronize,
+	.exp_sync	= frsrcu_torture_synchronize_expedited,
+	.stats		= frsrcu_torture_stats,
+	.name		= "frsrcu"
+};
+
+static void frsrcu_torture_init(void)
+{
+	rcu_sync_torture_init();
+	WARN_ON(init_frsrcu_struct(&frsrcu_ctld));
+	frsrcu_ctlp = &frsrcu_ctld;
+}
+
+static void frsrcu_torture_cleanup(void)
+{
+	cleanup_frsrcu_struct(&frsrcu_ctld);
+	frsrcu_ctlp = &frsrcu_ctl; /* In case of a later rcutorture run. */
+}
+
+/* As above, but dynamically allocated. */
+static struct rcu_torture_ops frsrcud_ops = {
+	.ttype		= FRSRCU_FLAVOR,
+	.init		= frsrcu_torture_init,
+	.cleanup	= frsrcu_torture_cleanup,
+	.readlock	= frsrcu_torture_read_lock,
+	.read_delay	= srcu_read_delay,
+	.readunlock	= frsrcu_torture_read_unlock,
+	.started	= NULL,
+	.completed	= frsrcu_torture_completed,
+	.sync		= frsrcu_torture_synchronize,
+	.exp_sync	= frsrcu_torture_synchronize_expedited,
+	.stats		= frsrcu_torture_stats,
+	.name		= "frsrcud"
+};
+
+/*
  * Definitions for rcu torture testing.
  */
 
@@ -810,7 +907,7 @@ rcu_torture_cbflood(void *arg)
 	int err = 1;
 	int i;
 	int j;
-	struct rcu_head *rhp;
+	struct rcu_head *rhp = NULL;
 
 	if (cbflood_n_per_burst > 0 &&
 	    cbflood_inter_holdoff > 0 &&
@@ -823,9 +920,7 @@ rcu_torture_cbflood(void *arg)
 	}
 	if (err) {
 		VERBOSE_TOROUT_STRING("rcu_torture_cbflood disabled: Bad args or OOM");
-		while (!torture_must_stop())
-			schedule_timeout_interruptible(HZ);
-		return 0;
+		goto wait_for_stop;
 	}
 	VERBOSE_TOROUT_STRING("rcu_torture_cbflood task started");
 	do {
@@ -844,6 +939,7 @@ rcu_torture_cbflood(void *arg)
 		stutter_wait("rcu_torture_cbflood");
 	} while (!torture_must_stop());
 	vfree(rhp);
+wait_for_stop:
 	torture_kthread_stopping("rcu_torture_cbflood");
 	return 0;
 }
@@ -1709,6 +1805,7 @@ rcu_torture_init(void)
 	int cpu;
 	int firsterr = 0;
 	static struct rcu_torture_ops *torture_ops[] = {
+		&frsrcu_ops, &frsrcud_ops,
 		&rcu_ops, &rcu_bh_ops, &rcu_busted_ops, &srcu_ops, &srcud_ops,
 		&sched_ops, RCUTORTURE_TASKS_OPS
 	};
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index b908048f8d6a..fb39380eb1d5 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1233,6 +1233,7 @@ config RCU_TORTURE_TEST
 	depends on DEBUG_KERNEL
 	select TORTURE_TEST
 	select SRCU
+	select FRSRCU
 	select TASKS_RCU
 	default n
 	help

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-23  0:26                 ` Paul E. McKenney
@ 2015-06-24 16:50                   ` Dave Hansen
  2015-06-24 17:29                     ` Paul E. McKenney
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Hansen @ 2015-06-24 16:50 UTC (permalink / raw)
  To: paulmck
  Cc: Peter Zijlstra, Andi Kleen, dave.hansen, akpm, jack, viro,
	eparis, john, rlove, tim.c.chen, linux-kernel

On 06/22/2015 05:26 PM, Paul E. McKenney wrote:
> OK, here is an experimental patch that provides a fast-readers variant
> of RCU, forward-ported from v3.3.  Because we didn't have call_srcu()
> and srcu_barrier() back then, it is not a drop-in replacement for SRCU,
> so you need to adapt the code to the API, which means putting an "fr"
> in front of the "srcu" in the API members.
> 
> Understood on the overhead of the memory-barrier instruction showing
> up consistently.  My point was instead that getting rid of this
> memory-barrier instruction does not come for free, as it greatly
> increases the latency of synchronize_frsrcu().  In a real workload,
> it is entirely possible that the savings from eliminating the memory
> barrier are overwhelmed by the increased grace-period latency.
> 
> Anyway, the patch is below.  Very lightly tested.

This does give a very similar performance boost as the other
optimization I posted.  I measured this patch to boost the writes/second
by 11.0% while my previous optimization did 10.8%.

I don't think this workload will see any of the overhead of the
synchronize_frsrcu(), though, but this helps confirm the source of the
overhead.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-24 16:50                   ` Dave Hansen
@ 2015-06-24 17:29                     ` Paul E. McKenney
  0 siblings, 0 replies; 20+ messages in thread
From: Paul E. McKenney @ 2015-06-24 17:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, Andi Kleen, dave.hansen, akpm, jack, viro,
	eparis, john, rlove, tim.c.chen, linux-kernel

On Wed, Jun 24, 2015 at 09:50:50AM -0700, Dave Hansen wrote:
> On 06/22/2015 05:26 PM, Paul E. McKenney wrote:
> > OK, here is an experimental patch that provides a fast-readers variant
> > of RCU, forward-ported from v3.3.  Because we didn't have call_srcu()
> > and srcu_barrier() back then, it is not a drop-in replacement for SRCU,
> > so you need to adapt the code to the API, which means putting an "fr"
> > in front of the "srcu" in the API members.
> > 
> > Understood on the overhead of the memory-barrier instruction showing
> > up consistently.  My point was instead that getting rid of this
> > memory-barrier instruction does not come for free, as it greatly
> > increases the latency of synchronize_frsrcu().  In a real workload,
> > it is entirely possible that the savings from eliminating the memory
> > barrier are overwhelmed by the increased grace-period latency.
> > 
> > Anyway, the patch is below.  Very lightly tested.
> 
> This does give a very similar performance boost as the other
> optimization I posted.  I measured this patch to boost the writes/second
> by 11.0% while my previous optimization did 10.8%.
> 
> I don't think this workload will see any of the overhead of the
> synchronize_frsrcu(), though, but this helps confirm the source of the
> overhead.

Thank you for testing it!  I agree that your patch is much simpler than
adding another flavor of RCU, so I prefer your patch, but the confirmation
is nevertheless valuable.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-22 15:11             ` Paul E. McKenney
  2015-06-22 15:20               ` Peter Zijlstra
  2015-06-22 18:50               ` Dave Hansen
@ 2015-06-22 18:52               ` Peter Zijlstra
  2015-06-23  0:29                 ` Paul E. McKenney
  2 siblings, 1 reply; 20+ messages in thread
From: Peter Zijlstra @ 2015-06-22 18:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Dave Hansen, Andi Kleen, dave.hansen, akpm, jack, viro, eparis,
	john, rlove, tim.c.chen, linux-kernel

On Mon, Jun 22, 2015 at 08:11:21AM -0700, Paul E. McKenney wrote:
> That depends on how slow the resulting slow global state would be.
> We have some use cases (definitely KVM, perhaps also some of the VFS
> code) that need the current speed, as opposed to the profound slowness
> that three trips through synchronize_sched() would provide.

So what we have with that percpu-rwsem code that I send out earlier
today is a conditional smp_mb(), and I think we can do the same for
SRCU.

I'm just not sure !GP is common enough for all SRCU cases to be worth
doing.

Those that rely on sync_srcu() and who do it rarely would definitely
benefit. The same with those that rarely do call_srcu().

But those that heavily use call_srcu() would be better off with the
prolonged GP with 3 sync_sched() calls in.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-22 18:52               ` Peter Zijlstra
@ 2015-06-23  0:29                 ` Paul E. McKenney
  0 siblings, 0 replies; 20+ messages in thread
From: Paul E. McKenney @ 2015-06-23  0:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dave Hansen, Andi Kleen, dave.hansen, akpm, jack, viro, eparis,
	john, rlove, tim.c.chen, linux-kernel

On Mon, Jun 22, 2015 at 08:52:29PM +0200, Peter Zijlstra wrote:
> On Mon, Jun 22, 2015 at 08:11:21AM -0700, Paul E. McKenney wrote:
> > That depends on how slow the resulting slow global state would be.
> > We have some use cases (definitely KVM, perhaps also some of the VFS
> > code) that need the current speed, as opposed to the profound slowness
> > that three trips through synchronize_sched() would provide.
> 
> So what we have with that percpu-rwsem code that I send out earlier
> today is a conditional smp_mb(), and I think we can do the same for
> SRCU.
> 
> I'm just not sure !GP is common enough for all SRCU cases to be worth
> doing.

Especially given that we don't want the readers to have to acquire a
lock in order to get a consistent view of whether or not a grace period
is in progress.

> Those that rely on sync_srcu() and who do it rarely would definitely
> benefit. The same with those that rarely do call_srcu().
> 
> But those that heavily use call_srcu() would be better off with the
> prolonged GP with 3 sync_sched() calls in.

Those are indeed two likely possibilities.  Other possibilities include
cases where synchronize_srcu() is invoked rarely, but where its latency
is visible to userspace, and those where there really is a need to
wait synchronously for a grace period, so that call_srcu() doesn't buy
you anything.

							Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files
  2015-06-19 21:50 [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files Dave Hansen
  2015-06-19 23:33 ` Andi Kleen
@ 2015-06-23 15:17 ` Jan Kara
  1 sibling, 0 replies; 20+ messages in thread
From: Jan Kara @ 2015-06-23 15:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen,
	ak, linux-kernel

On Fri 19-06-15 14:50:25, Dave Hansen wrote:
> 
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> I have a _tiny_ microbenchmark that sits in a loop and writes
> single bytes to a file.  Writing one byte to a tmpfs file is
> around 2x slower than reading one byte from a file, which is a
> _bit_ more than I expecte.  This is a dumb benchmark, but I think
> it's hard to deny that write() is a hot path and we should avoid
> unnecessary overhead there.
> 
> I did a 'perf record' of 30-second samples of read and write.
> The top item in a diffprofile is srcu_read_lock() from
> fsnotify().  There are active inotify fd's from systemd, but
> nothing is actually listening to the file or its part of
> the filesystem.
> 
> I *think* we can avoid taking the srcu_read_lock() for the
> common case where there are no actual marks on the file
> being modified *or* the vfsmount.
> 
> The *_fsnotify_mask is an aggregate of each of the masks from
> each mark.  If we have nothing set in the masks at all then there
> are no marks and no need to do anything with 'ignored masks'
> since none exist.  This keeps us from having to do the costly
> srcu_read_lock() for a check which is very cheap.
> 
> This patch gave a 10.8% speedup in writes/second on my test.
...
> diff -puN fs/notify/fsnotify.c~optimize-fsnotify fs/notify/fsnotify.c
> --- a/fs/notify/fsnotify.c~optimize-fsnotify	2015-06-19 13:29:53.117283581 -0700
> +++ b/fs/notify/fsnotify.c	2015-06-19 13:29:53.123283853 -0700
> @@ -213,6 +213,16 @@ int fsnotify(struct inode *to_tell, __u3
>  	    !(test_mask & to_tell->i_fsnotify_mask) &&
>  	    !(mnt && test_mask & mnt->mnt_fsnotify_mask))
>  		return 0;
> +	/*
> +	 * Optimization: The *_fsnotify_mask is an aggregate of each of the
> +	 * masks from each mark.  If we have nothing set in the masks at
> +	 * all then there are no marks and no need to do anything with
> +	 * 'ignored masks' since none exist.  This keeps us from having to
> +	 * do the costly srcu_read_lock() for a check which is very cheap.
> +	 */
> +	if (!to_tell->i_fsnotify_mask &&
> +	    (!mnt || !mnt->mnt_fsnotify_mask))
> +		return 0;

But this changes userspace visible behavior. You can have ignored mask set
without any of the notification masks set and you are expected to clear the
ignored mask on the first IN_MODIFY event. So the test just above your
check is dealing with all the cases we can easily detect.

That being said we could further refine things by storing a flag in inode &
struct mount telling whether any of the attached marks have ignored_mask
that needs clearing set and only do the traversal of the list of marks if any
of the ignored_masks is set. That will not only avoid the traversal when
nobody is watching but also in cases where nobody is watching for the
IN_MODIFY event.

Finally, we could do something even without the flag. We can have a look at
to_tell->i_fsnotify_marks.first and mnt->mnt_fsnotify_marks.first even
without SRCU to check whether they are both NULL. If they are, we know we
can safely skip the SRCU thing. If they are != NULL, we grab SRCU read lock
and refetch the pointers. This should be correct as we don't dereference
any pointer we fetched outside of the SRCU critical section. Am I right?
And this optimizes the case when nobody is watching without unwanted
side-effects.

								Honza
>  
>  	idx = srcu_read_lock(&fsnotify_mark_srcu);
>  
> _
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-06-24 17:29 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-06-19 21:50 [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files Dave Hansen
2015-06-19 23:33 ` Andi Kleen
2015-06-20  0:29   ` Paul E. McKenney
2015-06-20  0:39   ` Dave Hansen
2015-06-20  2:21     ` Paul E. McKenney
2015-06-20 18:02       ` Dave Hansen
2015-06-21  1:30         ` Paul E. McKenney
2015-06-22 13:28           ` Peter Zijlstra
2015-06-22 15:11             ` Paul E. McKenney
2015-06-22 15:20               ` Peter Zijlstra
2015-06-22 16:29                 ` Paul E. McKenney
2015-06-22 19:03                   ` Peter Zijlstra
2015-06-23  0:31                     ` Paul E. McKenney
2015-06-22 18:50               ` Dave Hansen
2015-06-23  0:26                 ` Paul E. McKenney
2015-06-24 16:50                   ` Dave Hansen
2015-06-24 17:29                     ` Paul E. McKenney
2015-06-22 18:52               ` Peter Zijlstra
2015-06-23  0:29                 ` Paul E. McKenney
2015-06-23 15:17 ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.