* [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files @ 2015-06-19 21:50 Dave Hansen 2015-06-19 23:33 ` Andi Kleen 2015-06-23 15:17 ` Jan Kara 0 siblings, 2 replies; 20+ messages in thread From: Dave Hansen @ 2015-06-19 21:50 UTC (permalink / raw) To: dave Cc: dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, ak, linux-kernel From: Dave Hansen <dave.hansen@linux.intel.com> I have a _tiny_ microbenchmark that sits in a loop and writes single bytes to a file. Writing one byte to a tmpfs file is around 2x slower than reading one byte from a file, which is a _bit_ more than I expecte. This is a dumb benchmark, but I think it's hard to deny that write() is a hot path and we should avoid unnecessary overhead there. I did a 'perf record' of 30-second samples of read and write. The top item in a diffprofile is srcu_read_lock() from fsnotify(). There are active inotify fd's from systemd, but nothing is actually listening to the file or its part of the filesystem. I *think* we can avoid taking the srcu_read_lock() for the common case where there are no actual marks on the file being modified *or* the vfsmount. The *_fsnotify_mask is an aggregate of each of the masks from each mark. If we have nothing set in the masks at all then there are no marks and no need to do anything with 'ignored masks' since none exist. This keeps us from having to do the costly srcu_read_lock() for a check which is very cheap. This patch gave a 10.8% speedup in writes/second on my test. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric Paris <eparis@redhat.com> Cc: John McCutchan <john@johnmccutchan.com> Cc: Robert Love <rlove@rlove.org> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: linux-kernel@vger.kernel.org --- b/fs/notify/fsnotify.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff -puN fs/notify/fsnotify.c~optimize-fsnotify fs/notify/fsnotify.c --- a/fs/notify/fsnotify.c~optimize-fsnotify 2015-06-19 13:29:53.117283581 -0700 +++ b/fs/notify/fsnotify.c 2015-06-19 13:29:53.123283853 -0700 @@ -213,6 +213,16 @@ int fsnotify(struct inode *to_tell, __u3 !(test_mask & to_tell->i_fsnotify_mask) && !(mnt && test_mask & mnt->mnt_fsnotify_mask)) return 0; + /* + * Optimization: The *_fsnotify_mask is an aggregate of each of the + * masks from each mark. If we have nothing set in the masks at + * all then there are no marks and no need to do anything with + * 'ignored masks' since none exist. This keeps us from having to + * do the costly srcu_read_lock() for a check which is very cheap. + */ + if (!to_tell->i_fsnotify_mask && + (!mnt || !mnt->mnt_fsnotify_mask)) + return 0; idx = srcu_read_lock(&fsnotify_mark_srcu); _ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-19 21:50 [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files Dave Hansen @ 2015-06-19 23:33 ` Andi Kleen 2015-06-20 0:29 ` Paul E. McKenney 2015-06-20 0:39 ` Dave Hansen 2015-06-23 15:17 ` Jan Kara 1 sibling, 2 replies; 20+ messages in thread From: Andi Kleen @ 2015-06-19 23:33 UTC (permalink / raw) To: Dave Hansen Cc: dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, linux-kernel, paulmck On Fri, Jun 19, 2015 at 02:50:25PM -0700, Dave Hansen wrote: > > From: Dave Hansen <dave.hansen@linux.intel.com> > > I have a _tiny_ microbenchmark that sits in a loop and writes > single bytes to a file. Writing one byte to a tmpfs file is > around 2x slower than reading one byte from a file, which is a > _bit_ more than I expecte. This is a dumb benchmark, but I think > it's hard to deny that write() is a hot path and we should avoid > unnecessary overhead there. > > I did a 'perf record' of 30-second samples of read and write. > The top item in a diffprofile is srcu_read_lock() from > fsnotify(). There are active inotify fd's from systemd, but > nothing is actually listening to the file or its part of > the filesystem. > > I *think* we can avoid taking the srcu_read_lock() for the > common case where there are no actual marks on the file > being modified *or* the vfsmount. What is so expensive in it? Just the memory barrier in it? Perhaps the function can be tuned in general. -Andi int __srcu_read_lock(struct srcu_struct *sp) { int idx; idx = ACCESS_ONCE(sp->completed) & 0x1; preempt_disable(); __this_cpu_inc(sp->per_cpu_ref->c[idx]); smp_mb(); /* B */ /* Avoid leaking the critical section. */ __this_cpu_inc(sp->per_cpu_ref->seq[idx]); preempt_enable(); return idx; } > > The *_fsnotify_mask is an aggregate of each of the masks from > each mark. If we have nothing set in the masks at all then there > are no marks and no need to do anything with 'ignored masks' > since none exist. This keeps us from having to do the costly > srcu_read_lock() for a check which is very cheap. > > This patch gave a 10.8% speedup in writes/second on my test. > > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: Jan Kara <jack@suse.cz> > Cc: Al Viro <viro@zeniv.linux.org.uk> > Cc: Eric Paris <eparis@redhat.com> > Cc: John McCutchan <john@johnmccutchan.com> > Cc: Robert Love <rlove@rlove.org> > Cc: Tim Chen <tim.c.chen@linux.intel.com> > Cc: Andi Kleen <ak@linux.intel.com> > Cc: linux-kernel@vger.kernel.org > --- > > b/fs/notify/fsnotify.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff -puN fs/notify/fsnotify.c~optimize-fsnotify fs/notify/fsnotify.c > --- a/fs/notify/fsnotify.c~optimize-fsnotify 2015-06-19 13:29:53.117283581 -0700 > +++ b/fs/notify/fsnotify.c 2015-06-19 13:29:53.123283853 -0700 > @@ -213,6 +213,16 @@ int fsnotify(struct inode *to_tell, __u3 > !(test_mask & to_tell->i_fsnotify_mask) && > !(mnt && test_mask & mnt->mnt_fsnotify_mask)) > return 0; > + /* > + * Optimization: The *_fsnotify_mask is an aggregate of each of the > + * masks from each mark. If we have nothing set in the masks at > + * all then there are no marks and no need to do anything with > + * 'ignored masks' since none exist. This keeps us from having to > + * do the costly srcu_read_lock() for a check which is very cheap. > + */ > + if (!to_tell->i_fsnotify_mask && > + (!mnt || !mnt->mnt_fsnotify_mask)) > + return 0; > > idx = srcu_read_lock(&fsnotify_mark_srcu); > > _ -- ak@linux.intel.com -- Speaking for myself only -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-19 23:33 ` Andi Kleen @ 2015-06-20 0:29 ` Paul E. McKenney 2015-06-20 0:39 ` Dave Hansen 1 sibling, 0 replies; 20+ messages in thread From: Paul E. McKenney @ 2015-06-20 0:29 UTC (permalink / raw) To: Andi Kleen Cc: Dave Hansen, dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, linux-kernel On Fri, Jun 19, 2015 at 04:33:06PM -0700, Andi Kleen wrote: > On Fri, Jun 19, 2015 at 02:50:25PM -0700, Dave Hansen wrote: > > > > From: Dave Hansen <dave.hansen@linux.intel.com> > > > > I have a _tiny_ microbenchmark that sits in a loop and writes > > single bytes to a file. Writing one byte to a tmpfs file is > > around 2x slower than reading one byte from a file, which is a > > _bit_ more than I expecte. This is a dumb benchmark, but I think > > it's hard to deny that write() is a hot path and we should avoid > > unnecessary overhead there. > > > > I did a 'perf record' of 30-second samples of read and write. > > The top item in a diffprofile is srcu_read_lock() from > > fsnotify(). There are active inotify fd's from systemd, but > > nothing is actually listening to the file or its part of > > the filesystem. > > > > I *think* we can avoid taking the srcu_read_lock() for the > > common case where there are no actual marks on the file > > being modified *or* the vfsmount. > > What is so expensive in it? Just the memory barrier in it? > > Perhaps the function can be tuned in general. The memory barrier we are pretty much stuck with unless we want synchronize_srcu() to be quite a bit more expensive (and for SRCU to be unusable from offline and idle) -- and that synchronize_srcu() expense drove rewrite from the earlier version to this one. It is possible to cut down from two to one instances of __this_cpu_inc(), however. It of course would be possible to have two types of SRCU, one for fast grace periods and the other for memory-barrier-free read-side critical sections, but obviously a very clear case would need to be made for this. At least judging from the reactions the last time I introduced a new flavor of RCU. ;-) So, echoing Andi, what exactly is expensive? Thanx, Paul > -Andi > > int __srcu_read_lock(struct srcu_struct *sp) > { > int idx; > > idx = ACCESS_ONCE(sp->completed) & 0x1; > preempt_disable(); > __this_cpu_inc(sp->per_cpu_ref->c[idx]); > smp_mb(); /* B */ /* Avoid leaking the critical section. */ > __this_cpu_inc(sp->per_cpu_ref->seq[idx]); > preempt_enable(); > return idx; > } > > > > > > The *_fsnotify_mask is an aggregate of each of the masks from > > each mark. If we have nothing set in the masks at all then there > > are no marks and no need to do anything with 'ignored masks' > > since none exist. This keeps us from having to do the costly > > srcu_read_lock() for a check which is very cheap. > > > > This patch gave a 10.8% speedup in writes/second on my test. > > > > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> > > Cc: Andrew Morton <akpm@linux-foundation.org> > > Cc: Jan Kara <jack@suse.cz> > > Cc: Al Viro <viro@zeniv.linux.org.uk> > > Cc: Eric Paris <eparis@redhat.com> > > Cc: John McCutchan <john@johnmccutchan.com> > > Cc: Robert Love <rlove@rlove.org> > > Cc: Tim Chen <tim.c.chen@linux.intel.com> > > Cc: Andi Kleen <ak@linux.intel.com> > > Cc: linux-kernel@vger.kernel.org > > --- > > > > b/fs/notify/fsnotify.c | 10 ++++++++++ > > 1 file changed, 10 insertions(+) > > > > diff -puN fs/notify/fsnotify.c~optimize-fsnotify fs/notify/fsnotify.c > > --- a/fs/notify/fsnotify.c~optimize-fsnotify 2015-06-19 13:29:53.117283581 -0700 > > +++ b/fs/notify/fsnotify.c 2015-06-19 13:29:53.123283853 -0700 > > @@ -213,6 +213,16 @@ int fsnotify(struct inode *to_tell, __u3 > > !(test_mask & to_tell->i_fsnotify_mask) && > > !(mnt && test_mask & mnt->mnt_fsnotify_mask)) > > return 0; > > + /* > > + * Optimization: The *_fsnotify_mask is an aggregate of each of the > > + * masks from each mark. If we have nothing set in the masks at > > + * all then there are no marks and no need to do anything with > > + * 'ignored masks' since none exist. This keeps us from having to > > + * do the costly srcu_read_lock() for a check which is very cheap. > > + */ > > + if (!to_tell->i_fsnotify_mask && > > + (!mnt || !mnt->mnt_fsnotify_mask)) > > + return 0; > > > > idx = srcu_read_lock(&fsnotify_mark_srcu); > > > > _ > > -- > ak@linux.intel.com -- Speaking for myself only > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-19 23:33 ` Andi Kleen 2015-06-20 0:29 ` Paul E. McKenney @ 2015-06-20 0:39 ` Dave Hansen 2015-06-20 2:21 ` Paul E. McKenney 1 sibling, 1 reply; 20+ messages in thread From: Dave Hansen @ 2015-06-20 0:39 UTC (permalink / raw) To: Andi Kleen Cc: dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, linux-kernel, paulmck On 06/19/2015 04:33 PM, Andi Kleen wrote: >> > I *think* we can avoid taking the srcu_read_lock() for the >> > common case where there are no actual marks on the file >> > being modified *or* the vfsmount. > What is so expensive in it? Just the memory barrier in it? The profiling doesn't hit on the mfence directly, but I assume that the overhead is coming from there. The "mov 0x8(%rdi),%rcx" is identical before and after the barrier, but it appears much more expensive _after_. That makes no sense unless the barrier is the thing causing it. Here's how the annotation mode of 'perf top' breaks it down: > │ ffffffff810fb480 <load0>: > │ nop > │ mov (%rdi),%rax > 0.58 │ push %rbp > │ incl %gs:0x7ef0f488(%rip) > 1.73 │ mov %rsp,%rbp > │ and $0x1,%eax > │ movslq %eax,%rdx > 0.58 │ mov 0x8(%rdi),%rcx > │ incq %gs:(%rcx,%rdx,8) > │ mfence > 69.94 │ add $0x2,%rdx > 7.51 │ mov 0x8(%rdi),%rcx > 4.05 │ incq %gs:(%rcx,%rdx,8) > 13.87 │ decl %gs:0x7ef0f45f(%rip) > │ pop %rbp > 1.73 │ ← retq > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-20 0:39 ` Dave Hansen @ 2015-06-20 2:21 ` Paul E. McKenney 2015-06-20 18:02 ` Dave Hansen 0 siblings, 1 reply; 20+ messages in thread From: Paul E. McKenney @ 2015-06-20 2:21 UTC (permalink / raw) To: Dave Hansen Cc: Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, linux-kernel On Fri, Jun 19, 2015 at 05:39:11PM -0700, Dave Hansen wrote: > On 06/19/2015 04:33 PM, Andi Kleen wrote: > >> > I *think* we can avoid taking the srcu_read_lock() for the > >> > common case where there are no actual marks on the file > >> > being modified *or* the vfsmount. > > What is so expensive in it? Just the memory barrier in it? > > The profiling doesn't hit on the mfence directly, but I assume that the > overhead is coming from there. The "mov 0x8(%rdi),%rcx" is identical > before and after the barrier, but it appears much more expensive > _after_. That makes no sense unless the barrier is the thing causing it. OK, one thing to try is to simply delete the memory barrier. The resulting code will be unsafe, but will probably run well enough to get benchmark results. If it is the memory barrier, you should of course get increased throughput. Thanx, Paul > Here's how the annotation mode of 'perf top' breaks it down: > > > │ ffffffff810fb480 <load0>: > > │ nop > > │ mov (%rdi),%rax > > 0.58 │ push %rbp > > │ incl %gs:0x7ef0f488(%rip) > > 1.73 │ mov %rsp,%rbp > > │ and $0x1,%eax > > │ movslq %eax,%rdx > > 0.58 │ mov 0x8(%rdi),%rcx > > │ incq %gs:(%rcx,%rdx,8) > > │ mfence > > 69.94 │ add $0x2,%rdx > > 7.51 │ mov 0x8(%rdi),%rcx > > 4.05 │ incq %gs:(%rcx,%rdx,8) > > 13.87 │ decl %gs:0x7ef0f45f(%rip) > > │ pop %rbp > > 1.73 │ ← retq > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-20 2:21 ` Paul E. McKenney @ 2015-06-20 18:02 ` Dave Hansen 2015-06-21 1:30 ` Paul E. McKenney 0 siblings, 1 reply; 20+ messages in thread From: Dave Hansen @ 2015-06-20 18:02 UTC (permalink / raw) To: paulmck Cc: Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, linux-kernel On 06/19/2015 07:21 PM, Paul E. McKenney wrote: >>> > > What is so expensive in it? Just the memory barrier in it? >> > >> > The profiling doesn't hit on the mfence directly, but I assume that the >> > overhead is coming from there. The "mov 0x8(%rdi),%rcx" is identical >> > before and after the barrier, but it appears much more expensive >> > _after_. That makes no sense unless the barrier is the thing causing it. > OK, one thing to try is to simply delete the memory barrier. The > resulting code will be unsafe, but will probably run well enough to > get benchmark results. If it is the memory barrier, you should of > course get increased throughput. So I took the smp_mb() out of __srcu_read_lock(). The benchmark didn't improve at all. Looking at the profile, all of the overhead had just shifted to __srcu_read_unlock() and its memory barrier! Removing the barrier in __srcu_read_unlock() got essentially the same gains out of the benchmark as the original patch in this thread that just avoids RCU. I think that's fairly conclusive that the source of the overhead is, indeed, the memory barriers. Although I said this test was single threaded, I also had another thought. The benchmark is single-threaded, but 'perf' is sitting doing profiling and who knows what else on the other core, and the profiling NMIs are certainly writing plenty of data to memory. So, there might be plenty of work for that smp_mb()/mfence to do _despite_ the benchmark itself being single threaded. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-20 18:02 ` Dave Hansen @ 2015-06-21 1:30 ` Paul E. McKenney 2015-06-22 13:28 ` Peter Zijlstra 0 siblings, 1 reply; 20+ messages in thread From: Paul E. McKenney @ 2015-06-21 1:30 UTC (permalink / raw) To: Dave Hansen Cc: Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, linux-kernel On Sat, Jun 20, 2015 at 11:02:08AM -0700, Dave Hansen wrote: > On 06/19/2015 07:21 PM, Paul E. McKenney wrote: > >>> > > What is so expensive in it? Just the memory barrier in it? > >> > > >> > The profiling doesn't hit on the mfence directly, but I assume that the > >> > overhead is coming from there. The "mov 0x8(%rdi),%rcx" is identical > >> > before and after the barrier, but it appears much more expensive > >> > _after_. That makes no sense unless the barrier is the thing causing it. > > OK, one thing to try is to simply delete the memory barrier. The > > resulting code will be unsafe, but will probably run well enough to > > get benchmark results. If it is the memory barrier, you should of > > course get increased throughput. > > So I took the smp_mb() out of __srcu_read_lock(). The benchmark didn't > improve at all. Looking at the profile, all of the overhead had just > shifted to __srcu_read_unlock() and its memory barrier! Removing the > barrier in __srcu_read_unlock() got essentially the same gains out of > the benchmark as the original patch in this thread that just avoids RCU. > > I think that's fairly conclusive that the source of the overhead is, > indeed, the memory barriers. > > Although I said this test was single threaded, I also had another > thought. The benchmark is single-threaded, but 'perf' is sitting doing > profiling and who knows what else on the other core, and the profiling > NMIs are certainly writing plenty of data to memory. So, there might be > plenty of work for that smp_mb()/mfence to do _despite_ the benchmark > itself being single threaded. Well, it is not hard to have an SRCU-like thing that doesn't have read-side memory barriers, given that older versions of SRCU didn't have them. However, the price is increased latency for the analog to synchronize_srcu(). I am guessing that this would not be a problem for notification-group destruction, which is presumably rare. That said, if empty *_fsnotify_mask is the common case or if the overhead of processing notification overwhelms srcu_read_lock(), your original patch seems a bit simpler. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-21 1:30 ` Paul E. McKenney @ 2015-06-22 13:28 ` Peter Zijlstra 2015-06-22 15:11 ` Paul E. McKenney 0 siblings, 1 reply; 20+ messages in thread From: Peter Zijlstra @ 2015-06-22 13:28 UTC (permalink / raw) To: Paul E. McKenney Cc: Dave Hansen, Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, linux-kernel On Sat, Jun 20, 2015 at 06:30:58PM -0700, Paul E. McKenney wrote: > Well, it is not hard to have an SRCU-like thing that doesn't have > read-side memory barriers, given that older versions of SRCU didn't > have them. However, the price is increased latency for the analog to > synchronize_srcu(). I am guessing that this would not be a problem > for notification-group destruction, which is presumably rare. I don't think it ever makes sense to optimize for a global state. So screw sync_srcu() and make the srcu_read_lock() thing go fast. If you need fast global state you're doing it wrong. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-22 13:28 ` Peter Zijlstra @ 2015-06-22 15:11 ` Paul E. McKenney 2015-06-22 15:20 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Paul E. McKenney @ 2015-06-22 15:11 UTC (permalink / raw) To: Peter Zijlstra Cc: Dave Hansen, Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, linux-kernel On Mon, Jun 22, 2015 at 03:28:21PM +0200, Peter Zijlstra wrote: > On Sat, Jun 20, 2015 at 06:30:58PM -0700, Paul E. McKenney wrote: > > Well, it is not hard to have an SRCU-like thing that doesn't have > > read-side memory barriers, given that older versions of SRCU didn't > > have them. However, the price is increased latency for the analog to > > synchronize_srcu(). I am guessing that this would not be a problem > > for notification-group destruction, which is presumably rare. > > I don't think it ever makes sense to optimize for a global state. So > screw sync_srcu() and make the srcu_read_lock() thing go fast. > > If you need fast global state you're doing it wrong. That depends on how slow the resulting slow global state would be. We have some use cases (definitely KVM, perhaps also some of the VFS code) that need the current speed, as opposed to the profound slowness that three trips through synchronize_sched() would provide. Plus we would lose the ability to have SRCU readers on idle and offline CPUs. But if Dave is willing to test it, I would be happy to send along a fast-readers patch, easy to do. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-22 15:11 ` Paul E. McKenney @ 2015-06-22 15:20 ` Peter Zijlstra 2015-06-22 16:29 ` Paul E. McKenney 2015-06-22 18:50 ` Dave Hansen 2015-06-22 18:52 ` Peter Zijlstra 2 siblings, 1 reply; 20+ messages in thread From: Peter Zijlstra @ 2015-06-22 15:20 UTC (permalink / raw) To: Paul E. McKenney Cc: Dave Hansen, Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, linux-kernel On Mon, Jun 22, 2015 at 08:11:21AM -0700, Paul E. McKenney wrote: > That depends on how slow the resulting slow global state would be. > We have some use cases (definitely KVM, perhaps also some of the VFS > code) that need the current speed, as opposed to the profound slowness > that three trips through synchronize_sched() would provide. But we have call_srcu() these days, not everything needs to use sync_srcu() anymore. Although I've not checked recently. > Plus we > would lose the ability to have SRCU readers on idle and offline CPUs. Are we actually doing that? offline CPUs in particular seems iffy, I don't think we need (or should) worry about that. I know its been an issue with regular RCU due to tracing, but I'm not sure we should care for it for SRCU. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-22 15:20 ` Peter Zijlstra @ 2015-06-22 16:29 ` Paul E. McKenney 2015-06-22 19:03 ` Peter Zijlstra 0 siblings, 1 reply; 20+ messages in thread From: Paul E. McKenney @ 2015-06-22 16:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Dave Hansen, Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, linux-kernel On Mon, Jun 22, 2015 at 05:20:13PM +0200, Peter Zijlstra wrote: > On Mon, Jun 22, 2015 at 08:11:21AM -0700, Paul E. McKenney wrote: > > That depends on how slow the resulting slow global state would be. > > We have some use cases (definitely KVM, perhaps also some of the VFS > > code) that need the current speed, as opposed to the profound slowness > > that three trips through synchronize_sched() would provide. > > But we have call_srcu() these days, not everything needs to use > sync_srcu() anymore. Although I've not checked recently. I believe that the KVM guys do need synchronize_srcu(), but yes, there probably are at least some cases where people might do well to move from synchronize_srcu() to call_srcu(). That said, the added complexity might or might not be worthwhile in all cases. > > Plus we > > would lose the ability to have SRCU readers on idle and offline CPUs. > > Are we actually doing that? offline CPUs in particular seems iffy, I > don't think we need (or should) worry about that. I know its been an > issue with regular RCU due to tracing, but I'm not sure we should care > for it for SRCU. I believe that there still are some cases. But why would offline CPUs seem so iffy? CPUs coming up execute code before they are fully operational, and during that time, much of the kernel views them as being offline. Yet they do have to execute significant code in order to get themselves set up. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-22 16:29 ` Paul E. McKenney @ 2015-06-22 19:03 ` Peter Zijlstra 2015-06-23 0:31 ` Paul E. McKenney 0 siblings, 1 reply; 20+ messages in thread From: Peter Zijlstra @ 2015-06-22 19:03 UTC (permalink / raw) To: Paul E. McKenney Cc: Dave Hansen, Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, linux-kernel On Mon, Jun 22, 2015 at 09:29:49AM -0700, Paul E. McKenney wrote: > I believe that there still are some cases. But why would offline > CPUs seem so iffy? CPUs coming up execute code before they are fully > operational, and during that time, much of the kernel views them as > being offline. Yet they do have to execute significant code in order > to get themselves set up. I'm thinking we do far too much during bringup and tear-down as it is. But yes maybe. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-22 19:03 ` Peter Zijlstra @ 2015-06-23 0:31 ` Paul E. McKenney 0 siblings, 0 replies; 20+ messages in thread From: Paul E. McKenney @ 2015-06-23 0:31 UTC (permalink / raw) To: Peter Zijlstra Cc: Dave Hansen, Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, linux-kernel On Mon, Jun 22, 2015 at 09:03:08PM +0200, Peter Zijlstra wrote: > On Mon, Jun 22, 2015 at 09:29:49AM -0700, Paul E. McKenney wrote: > > > I believe that there still are some cases. But why would offline > > CPUs seem so iffy? CPUs coming up execute code before they are fully > > operational, and during that time, much of the kernel views them as > > being offline. Yet they do have to execute significant code in order > > to get themselves set up. > > I'm thinking we do far too much during bringup and tear-down as it is. > But yes maybe. Boot, suspend, and hibernation indeed would be faster if we did less, but we still will have to do something. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-22 15:11 ` Paul E. McKenney 2015-06-22 15:20 ` Peter Zijlstra @ 2015-06-22 18:50 ` Dave Hansen 2015-06-23 0:26 ` Paul E. McKenney 2015-06-22 18:52 ` Peter Zijlstra 2 siblings, 1 reply; 20+ messages in thread From: Dave Hansen @ 2015-06-22 18:50 UTC (permalink / raw) To: paulmck, Peter Zijlstra Cc: Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, linux-kernel On 06/22/2015 08:11 AM, Paul E. McKenney wrote: > But if Dave is willing to test it, I would be happy to send along > a fast-readers patch, easy to do. I'm always willing to test, but the cost of the srcu_read_lock() barrier shows up even on my 2-year-old "Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz" laptop. The numbers I shared in this thread are on a newer CPU than that, so I'm fairly confident this will show up on just about any (big core) Intel CPU newer than SandyBridge. The tests I've been running are: https://github.com/antonblanchard/will-it-scale with two new 1-byte read/write tests copied in to "tests/": https://www.sr71.net/~dave/intel/read1byte.c https://www.sr71.net/~dave/intel/write1byte.c The one-byte thing is silly but it does help isolate the kernel's overhead from what the app is doing. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-22 18:50 ` Dave Hansen @ 2015-06-23 0:26 ` Paul E. McKenney 2015-06-24 16:50 ` Dave Hansen 0 siblings, 1 reply; 20+ messages in thread From: Paul E. McKenney @ 2015-06-23 0:26 UTC (permalink / raw) To: Dave Hansen Cc: Peter Zijlstra, Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, linux-kernel On Mon, Jun 22, 2015 at 11:50:50AM -0700, Dave Hansen wrote: > On 06/22/2015 08:11 AM, Paul E. McKenney wrote: > > But if Dave is willing to test it, I would be happy to send along > > a fast-readers patch, easy to do. > > I'm always willing to test, but the cost of the srcu_read_lock() barrier > shows up even on my 2-year-old "Intel(R) Core(TM) i5-3320M CPU @ > 2.60GHz" laptop. The numbers I shared in this thread are on a newer CPU > than that, so I'm fairly confident this will show up on just about any > (big core) Intel CPU newer than SandyBridge. > > The tests I've been running are: > > https://github.com/antonblanchard/will-it-scale > > with two new 1-byte read/write tests copied in to "tests/": > > https://www.sr71.net/~dave/intel/read1byte.c > https://www.sr71.net/~dave/intel/write1byte.c > > The one-byte thing is silly but it does help isolate the kernel's > overhead from what the app is doing. OK, here is an experimental patch that provides a fast-readers variant of RCU, forward-ported from v3.3. Because we didn't have call_srcu() and srcu_barrier() back then, it is not a drop-in replacement for SRCU, so you need to adapt the code to the API, which means putting an "fr" in front of the "srcu" in the API members. Understood on the overhead of the memory-barrier instruction showing up consistently. My point was instead that getting rid of this memory-barrier instruction does not come for free, as it greatly increases the latency of synchronize_frsrcu(). In a real workload, it is entirely possible that the savings from eliminating the memory barrier are overwhelmed by the increased grace-period latency. Anyway, the patch is below. Very lightly tested. Thanx, Paul ------------------------------------------------------------------------ commit d28d4e08ee2f87e6746ac77b07d5c94a65bcbd95 Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Date: Mon Jun 22 14:51:18 2015 -0700 rcu: Add a fast-readers variant of SRCU Dave Hansen's testing of single-byte I/O to tmpfs showed significant system-level overhead from the memory barriers in srcu_read_lock() and srcu_read_unlock(). This experimental not-for-merging commit therefore re-introduces the old variant of SRCU that avoided read-side memory barriers. This of course also introduces that variant's slow grace periods. There is no analog to call_rcu(), nor is there an analog to rcu_barrier(), so this is not a drop-in replacement for SRCU. Unlike SRCU, this implementation cannot be invoked from idle or offline CPUs. This variant therefore gets its own API: frsrcu_read_lock(), frsrcu_read_unlock(), and so on. Reported-by: Dave Hansen <dave@sr71.net> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> diff --git a/include/linux/frsrcu.h b/include/linux/frsrcu.h new file mode 100644 index 000000000000..5ad60d2d5bca --- /dev/null +++ b/include/linux/frsrcu.h @@ -0,0 +1,250 @@ +/* + * Fast-Reader Sleepable Read-Copy Update mechanism for mutual exclusion + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, you can access it online at + * http://www.gnu.org/licenses/gpl-2.0.html. + * + * Copyright (C) IBM Corporation, 2006, 2015 + * + * Author: Paul McKenney <paulmck@linux.vnet.ibm.com> + */ + +#ifndef _LINUX_FRSRCU_H +#define _LINUX_FRSRCU_H + +#include <linux/mutex.h> +#include <linux/rcupdate.h> + +struct frsrcu_struct_array { + int c[2]; +}; + +struct frsrcu_struct { + int completed; + struct frsrcu_struct_array __percpu *per_cpu_ref; + struct mutex mutex; +#ifdef CONFIG_DEBUG_LOCK_ALLOC + struct lockdep_map dep_map; +#endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ +}; + +#ifdef CONFIG_DEBUG_LOCK_ALLOC + +int __init_frsrcu_struct(struct frsrcu_struct *frsp, const char *name, + struct lock_class_key *key); + +#define init_frsrcu_struct(frsp) \ +({ \ + static struct lock_class_key __frsrcu_key; \ + \ + __init_frsrcu_struct((frsp), #frsp, &__frsrcu_key); \ +}) + +#define __FRSRCU_DEP_MAP_INIT(frsrcu_name) .dep_map = { .name = #frsrcu_name }, +#else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ + +int init_frsrcu_struct(struct frsrcu_struct *frsp); + +#define __FRSRCU_DEP_MAP_INIT(frsrcu_name) +#endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */ + +#define __FRSRCU_STRUCT_INIT(name) \ + { \ + .completed = -300, \ + .per_cpu_ref = &name##_frsrcu_array, \ + .mutex = __MUTEX_INITIALIZER(name.mutex), \ + __FRSRCU_DEP_MAP_INIT(name) \ + } + +/* + * define and init a frsrcu struct at build time. + * dont't call init_frsrcu_struct() nor cleanup_frsrcu_struct() on it. + */ +#define __DEFINE_FRSRCU(name, is_static) \ + static DEFINE_PER_CPU(struct frsrcu_struct_array, name##_frsrcu_array);\ + is_static struct frsrcu_struct name = __FRSRCU_STRUCT_INIT(name) +#define DEFINE_FRSRCU(name) __DEFINE_FRSRCU(name, /* not static */) +#define DEFINE_STATIC_FRSRCU(name) __DEFINE_FRSRCU(name, static) + +void cleanup_frsrcu_struct(struct frsrcu_struct *frsp); +int __frsrcu_read_lock(struct frsrcu_struct *frsp) __acquires(frsp); +void __frsrcu_read_unlock(struct frsrcu_struct *frsp, int idx) __releases(frsp); +void synchronize_frsrcu(struct frsrcu_struct *frsp); +void synchronize_frsrcu_expedited(struct frsrcu_struct *frsp); +long frsrcu_batches_completed(struct frsrcu_struct *frsp); + +#ifdef CONFIG_DEBUG_LOCK_ALLOC + +/** + * frsrcu_read_lock_held - might we be in FRSRCU read-side critical section? + * + * If CONFIG_DEBUG_LOCK_ALLOC is selected, returns nonzero iff in an FRSRCU + * read-side critical section. In absence of CONFIG_DEBUG_LOCK_ALLOC, + * this assumes we are in an FRSRCU read-side critical section unless it can + * prove otherwise. + * + * Checks debug_lockdep_rcu_enabled() to prevent false positives during boot + * and while lockdep is disabled. + * + * Note that if the CPU is in the idle loop from an RCU point of view + * (ie: that we are in the section between rcu_idle_enter() and + * rcu_idle_exit()) then frsrcu_read_lock_held() returns false even if + * the CPU did an frsrcu_read_lock(). The reason for this is that RCU + * ignores CPUs that are in such a section, considering these as in + * extended quiescent state, so such a CPU is effectively never in an + * RCU read-side critical section regardless of what RCU primitives it + * invokes. This state of affairs is required --- we need to keep an + * RCU-free window in idle where the CPU may possibly enter into low + * power mode. This way we can notice an extended quiescent state to + * other CPUs that started a grace period. Otherwise we would delay any + * grace period as long as we run in the idle task. + * + * Similarly, we avoid claiming an SRCU read lock held if the current + * CPU is offline. + */ +static inline int frsrcu_read_lock_held(struct frsrcu_struct *frsp) +{ + if (!debug_lockdep_rcu_enabled()) + return 1; + + if (!rcu_is_watching()) + return 0; + if (!rcu_lockdep_current_cpu_online()) + return 0; + return lock_is_held(&frsp->dep_map); +} + +#else /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ + +static inline int frsrcu_read_lock_held(struct frsrcu_struct *frsp) +{ + return 1; +} + +#endif /* #else #ifdef CONFIG_DEBUG_LOCK_ALLOC */ + +/** + * frsrcu_dereference_check - fetch FRSRCU-protected pointer for later deref + * @p: the pointer to fetch and protect for later dereferencing + * @frsp: pointer to the frsrcu_struct, which is used to check that we + * really are in an FRSRCU read-side critical section. + * @c: condition to check for update-side use + * + * If PROVE_RCU is enabled, invoking this outside of an RCU read-side + * critical section will result in an RCU-lockdep splat, unless @c evaluates + * to 1. The @c argument will normally be a logical expression containing + * lockdep_is_held() calls. + */ +#define frsrcu_dereference_check(p, frsp, c) \ + __rcu_dereference_check((p), frsrcu_read_lock_held(frsp) || (c), __rcu) + +/** + * frsrcu_dereference - fetch FRSRCU-protected pointer for later dereferencing + * @p: the pointer to fetch and protect for later dereferencing + * @frsp: pointer to the frsrcu_struct, which is used to check that we + * really are in an FRSRCU read-side critical section. + * + * Makes rcu_dereference_check() do the dirty work. If PROVE_RCU + * is enabled, invoking this outside of an RCU read-side critical + * section will result in an RCU-lockdep splat. + */ +#define frsrcu_dereference(p, frsp) frsrcu_dereference_check((p), (frsp), 0) + +/** + * frsrcu_read_lock - register a new reader for an FRSRCU-protected structure + * @frsp: frsrcu_struct in which to register the new reader. + * + * Enter an FRSRCU read-side critical section. Note that FRSRCU read-side + * critical sections may be nested. However, it is illegal to + * call anything that waits on an FRSRCU grace period for the same + * frsrcu_struct, whether directly or indirectly. Please note that + * one way to indirectly wait on an FRSRCU grace period is to acquire + * a mutex that is held elsewhere while calling synchronize_srcu() or + * synchronize_srcu_expedited(). + * + * Note that frsrcu_read_lock() and the matching frsrcu_read_unlock() must + * occur in the same context, for example, it is illegal to invoke + * frsrcu_read_unlock() in an irq handler if the matching frsrcu_read_lock() + * was invoked in process context. + */ +static inline int frsrcu_read_lock(struct frsrcu_struct *frsp) __acquires(frsp) +{ + int retval = __frsrcu_read_lock(frsp); + + rcu_lock_acquire(&(frsp)->dep_map); + rcu_lockdep_assert(rcu_is_watching(), + "frsrcu_read_lock() used illegally while idle"); + return retval; +} + +/** + * frsrcu_read_unlock - unregister old reader from FRSRCU-protected structure + * @frsp: frsrcu_struct in which to unregister the old reader. + * @idx: return value from corresponding frsrcu_read_lock(). + * + * Exit an FRSRCU read-side critical section. + */ +static inline void frsrcu_read_unlock(struct frsrcu_struct *frsp, int idx) + __releases(frsp) +{ + rcu_lockdep_assert(rcu_is_watching(), + "frsrcu_read_unlock() used illegally while idle"); + rcu_lock_release(&(frsp)->dep_map); + __frsrcu_read_unlock(frsp, idx); +} + +/** + * frsrcu_read_lock_raw - register new reader for FRSRCU-protected structure + * @frsp: frsrcu_struct in which to register the new reader. + * + * Enter an FRSRCU read-side critical section. Similar to frsrcu_read_lock(), + * but avoids the RCU-lockdep checking. This means that it is legal to + * use frsrcu_read_lock_raw() in one context, for example, in an exception + * handler, and then have the matching frsrcu_read_unlock_raw() in another + * context, for example in the task that took the exception. + * + * However, the entire FRSRCU read-side critical section must reside within + * a single task. For example, beware of using frsrcu_read_lock_raw() in a + * device interrupt handler and frsrcu_read_unlock() in the interrupted task: + * This will not work if interrupts are threaded. + */ +static inline int frsrcu_read_lock_raw(struct frsrcu_struct *frsp) +{ + unsigned long flags; + int ret; + + local_irq_save(flags); + ret = __frsrcu_read_lock(frsp); + local_irq_restore(flags); + return ret; +} + +/** + * frsrcu_read_unlock_raw - unregister reader from FRSRCU-protected structure + * @frsp: frsrcu_struct in which to unregister the old reader. + * @idx: return value from corresponding frsrcu_read_lock_raw(). + * + * Exit an FRSRCU read-side critical section without lockdep-RCU checking. + * See frsrcu_read_lock_raw() for more details. + */ +static inline void frsrcu_read_unlock_raw(struct frsrcu_struct *frsp, int idx) +{ + unsigned long flags; + + local_irq_save(flags); + __frsrcu_read_unlock(frsp, idx); + local_irq_restore(flags); +} + +#endif diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 03a899aabd17..7ff71dbabf86 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -69,6 +69,7 @@ void rcu_unexpedite_gp(void); #endif /* #else #ifdef CONFIG_TINY_RCU */ enum rcutorture_type { + FRSRCU_FLAVOR, RCU_FLAVOR, RCU_BH_FLAVOR, RCU_SCHED_FLAVOR, diff --git a/init/Kconfig b/init/Kconfig index 4c08197044f1..381d75cafe7d 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -517,6 +517,15 @@ config SRCU permits arbitrary sleeping or blocking within RCU read-side critical sections. +config FRSRCU + bool + help + This option selects the fast-readers sleepable version of + RCU. This version permits arbitrary sleeping or blocking within + RCU read-side critical sections, but also avoids memory barriers + within read-side markers. Nothing comes for free, though. + Grace periods are quite a bit more expensive than those of SRCU. + config TASKS_RCU bool default n diff --git a/kernel/rcu/Makefile b/kernel/rcu/Makefile index 50a808424b06..d2d7272a69bb 100644 --- a/kernel/rcu/Makefile +++ b/kernel/rcu/Makefile @@ -1,4 +1,5 @@ obj-y += update.o +obj-$(CONFIG_FRSRCU) += frsrcu.o obj-$(CONFIG_SRCU) += srcu.o obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o obj-$(CONFIG_TREE_RCU) += tree.o diff --git a/kernel/rcu/frsrcu.c b/kernel/rcu/frsrcu.c new file mode 100644 index 000000000000..3bea0a020379 --- /dev/null +++ b/kernel/rcu/frsrcu.c @@ -0,0 +1,320 @@ +/* + * Fast-Reader Sleepable Read-Copy Update mechanism for mutual exclusion. + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, you can access it online at + * http://www.gnu.org/licenses/gpl-2.0.html. + * + * Copyright (C) IBM Corporation, 2006, 2015 + * + * Author: Paul McKenney <paulmck@linux.vnet.ibm.com> + */ + +#include <linux/export.h> +#include <linux/mutex.h> +#include <linux/percpu.h> +#include <linux/preempt.h> +#include <linux/rcupdate.h> +#include <linux/sched.h> +#include <linux/smp.h> +#include <linux/delay.h> +#include <linux/frsrcu.h> + +/** + * init_frsrcu_struct - initialize a sleep-RCU structure + * @frsp: structure to initialize. + * + * Must invoke this on a given frsrcu_struct before passing that + * frsrcu_struct to any other function. Each frsrcu_struct represents a + * separate domain of FRSRCU protection. + */ +int init_frsrcu_struct(struct frsrcu_struct *frsp) +{ + frsp->completed = 0; + mutex_init(&frsp->mutex); + frsp->per_cpu_ref = alloc_percpu(struct frsrcu_struct_array); + return frsp->per_cpu_ref ? 0 : -ENOMEM; +} +EXPORT_SYMBOL_GPL(init_frsrcu_struct); + +#ifdef CONFIG_DEBUG_LOCK_ALLOC + +int __init_frsrcu_struct(struct frsrcu_struct *frsp, const char *name, + struct lock_class_key *key) +{ + /* Don't re-initialize a lock while it is held. */ + debug_check_no_locks_freed((void *)frsp, sizeof(*frsp)); + lockdep_init_map(&frsp->dep_map, name, key, 0); + return init_frsrcu_struct(frsp); +} +EXPORT_SYMBOL_GPL(__init_frsrcu_struct); + +#endif /* #ifdef CONFIG_DEBUG_LOCK_ALLOC */ + +/* + * frsrcu_readers_active_idx -- returns approximate number of readers + * active on the specified rank of per-CPU counters. + */ + +static int frsrcu_readers_active_idx(struct frsrcu_struct *frsp, int idx) +{ + int cpu; + int sum; + + sum = 0; + for_each_possible_cpu(cpu) + sum += READ_ONCE(per_cpu_ptr(frsp->per_cpu_ref, cpu)->c[idx]); + return sum; +} + +/** + * frsrcu_readers_active - returns approximate number of readers. + * @frsp: which frsrcu_struct counts active readers (holding frsrcu_read_lock). + * + * Note that this is not an atomic primitive, and can therefore suffer + * severe errors when invoked on an active frsrcu_struct. That said, it + * can be useful as an error check at cleanup time. + */ +static int frsrcu_readers_active(struct frsrcu_struct *frsp) +{ + return frsrcu_readers_active_idx(frsp, 0) + + frsrcu_readers_active_idx(frsp, 1); +} + +/** + * cleanup_frsrcu_struct - deconstruct a sleep-RCU structure + * @frsp: structure to clean up. + * + * Must invoke this after you are finished using a given frsrcu_struct that + * was initialized via init_frsrcu_struct(), else you leak memory. + */ +void cleanup_frsrcu_struct(struct frsrcu_struct *frsp) +{ + int sum; + + sum = frsrcu_readers_active(frsp); + WARN_ON(sum); /* Leakage unless caller handles error. */ + if (sum != 0) + return; + free_percpu(frsp->per_cpu_ref); + frsp->per_cpu_ref = NULL; +} +EXPORT_SYMBOL_GPL(cleanup_frsrcu_struct); + +/* + * Counts the new reader in the appropriate per-CPU element of the + * frsrcu_struct. Must be called from process context. Returns an index + * that must be passed to the matching frsrcu_read_unlock(). + */ +int __frsrcu_read_lock(struct frsrcu_struct *frsp) +{ + int idx; + + preempt_disable(); + idx = frsp->completed & 0x1; + barrier(); /* ensure compiler looks -once- at frsp->completed. */ + __this_cpu_inc(frsp->per_cpu_ref->c[idx]); + preempt_enable(); + return idx; +} +EXPORT_SYMBOL_GPL(__frsrcu_read_lock); + +/* + * Removes the count for the old reader from the appropriate per-CPU + * element of the frsrcu_struct. Note that this may well be a different CPU + * than that which was incremented by the corresponding frsrcu_read_lock(). + * Must be called from process context. + */ +void __frsrcu_read_unlock(struct frsrcu_struct *frsp, int idx) +{ + preempt_disable(); + __this_cpu_dec(frsp->per_cpu_ref->c[idx]); + preempt_enable(); +} +EXPORT_SYMBOL_GPL(__frsrcu_read_unlock); + +/* + * We use an adaptive strategy for synchronize_frsrcu() and especially for + * synchronize_frsrcu_expedited(). We spin for a fixed time period (defined + * below) to allow FRSRCU readers to exit their read-side critical sections. + * If there are still some readers after 10 microseconds, we repeatedly + * block for 1-millisecond time periods. This approach has done well in + * testing, so there is no need for a config parameter. + */ +#define SYNCHRONIZE_FRSRCU_READER_DELAY 10 + +/* + * Helper function for synchronize_frsrcu() and synchronize_frsrcu_expedited(). + */ +static void +__synchronize_frsrcu(struct frsrcu_struct *frsp, void (*sync_func)(void)) +{ + int idx; + + rcu_lockdep_assert(!lock_is_held(&sp->dep_map) && + !lock_is_held(&rcu_bh_lock_map) && + !lock_is_held(&rcu_lock_map) && + !lock_is_held(&rcu_sched_lock_map), + "Illegal synchronize_srcu() in same-type SRCU (or RCU) read-side critical section"); + + idx = frsp->completed; + mutex_lock(&frsp->mutex); + + /* + * Check to see if someone else did the work for us while we were + * waiting to acquire the lock. We need -two- advances of + * the counter, not just one. If there was but one, we might have + * shown up -after- our helper's first synchronize_sched(), thus + * having failed to prevent CPU-reordering races with concurrent + * frsrcu_read_unlock()s on other CPUs (see comment below). So we + * either (1) wait for two or (2) supply the second ourselves. + */ + + if ((frsp->completed - idx) >= 2) { + mutex_unlock(&frsp->mutex); + return; + } + + sync_func(); /* Force memory barrier on all CPUs. */ + + /* + * The preceding synchronize_sched() ensures that any CPU that + * sees the new value of frsp->completed will also see any preceding + * changes to data structures made by this CPU. This prevents + * some other CPU from reordering the accesses in its FRSRCU + * read-side critical section to precede the corresponding + * frsrcu_read_lock() -- ensuring that such references will in + * fact be protected. + * + * So it is now safe to do the flip. + */ + + idx = frsp->completed & 0x1; + frsp->completed++; + + sync_func(); /* Force memory barrier on all CPUs. */ + + /* + * At this point, because of the preceding synchronize_sched(), + * all frsrcu_read_lock() calls using the old counters have completed. + * Their corresponding critical sections might well be still + * executing, but the frsrcu_read_lock() primitives themselves + * will have finished executing. We initially give readers + * an arbitrarily chosen 10 microseconds to get out of their + * FRSRCU read-side critical sections, then loop waiting 1/HZ + * seconds per iteration. The 10-microsecond value has done + * very well in testing. + */ + + if (frsrcu_readers_active_idx(frsp, idx)) + udelay(SYNCHRONIZE_FRSRCU_READER_DELAY); + while (frsrcu_readers_active_idx(frsp, idx)) + schedule_timeout_interruptible(1); + + sync_func(); /* Force memory barrier on all CPUs. */ + + /* + * The preceding synchronize_sched() forces all frsrcu_read_unlock() + * primitives that were executing concurrently with the preceding + * for_each_possible_cpu() loop to have completed by this point. + * More importantly, it also forces the corresponding FRSRCU + * read-side critical sections to have also completed, and the + * corresponding references to FRSRCU-protected data items to + * be dropped. + * + * Note: + * + * Despite what you might think at first glance, the + * preceding synchronize_sched() -must- be within the + * critical section ended by the following mutex_unlock(). + * Otherwise, a task taking the early exit can race + * with a frsrcu_read_unlock(), which might have executed + * just before the preceding frsrcu_readers_active() check, + * and whose CPU might have reordered the frsrcu_read_unlock() + * with the preceding critical section. In this case, there + * is nothing preventing the synchronize_sched() task that is + * taking the early exit from freeing a data structure that + * is still being referenced (out of order) by the task + * doing the frsrcu_read_unlock(). + * + * Alternatively, the comparison with "2" on the early + * exit could be changed to "3", but this increases + * synchronize_frsrcu() latency for bulk loads. So the + * current code is preferred. + */ + + mutex_unlock(&frsp->mutex); +} + +/** + * synchronize_frsrcu - wait for pre-existing FRSRCU read-side critical sections + * @frsp: frsrcu_struct with which to synchronize. + * + * Flip the completed counter, and wait for the old count to drain to zero. + * As with classic RCU, the updater must use some separate means of + * synchronizing concurrent updates. Can block; must be called from + * process context. + * + * Note that it is illegal to call synchronize_frsrcu() from the + * corresponding FRSRCU read-side critical section; doing so will result + * in deadlock. However, it is perfectly legal to call synchronize_frsrcu() + * on one frsrcu_struct from some other frsrcu_struct's read-side critical + * section. + */ +void synchronize_frsrcu(struct frsrcu_struct *frsp) +{ + __synchronize_frsrcu(frsp, synchronize_sched); +} +EXPORT_SYMBOL_GPL(synchronize_frsrcu); + +/** + * synchronize_frsrcu_expedited - Brute-force FSRCU grace period + * @frsp: frsrcu_struct with which to synchronize. + * + * Wait for an SRCU grace period to elapse, but use a "big hammer" + * approach to force the grace period to end quickly. This consumes + * significant time on all CPUs and is unfriendly to real-time workloads, + * so is thus not recommended for any sort of common-case code. In fact, if + * you are using synchronize_srcu_expedited() in a loop, please restructure + * your code to batch your updates, and then use a single synchronize_srcu() + * instead. + * + * Note that it is illegal to call this function while holding any lock + * that is acquired by a CPU-hotplug notifier. And yes, it is also illegal + * to call this function from a CPU-hotplug notifier. Failing to observe + * these restriction will result in deadlock. It is also illegal to call + * synchronize_srcu_expedited() from the corresponding SRCU read-side + * critical section; doing so will result in deadlock. However, it is + * perfectly legal to call synchronize_srcu_expedited() on one srcu_struct + * from some other srcu_struct's read-side critical section, as long as + * the resulting graph of srcu_structs is acyclic. + */ +void synchronize_frsrcu_expedited(struct frsrcu_struct *frsp) +{ + __synchronize_frsrcu(frsp, synchronize_sched_expedited); +} +EXPORT_SYMBOL_GPL(synchronize_frsrcu_expedited); + +/** + * frsrcu_batches_completed - return batches completed. + * @frsp: frsrcu_struct on which to report batch completion. + * + * Report the number of batches, correlated with, but not necessarily + * precisely the same as, the number of grace periods that have elapsed. + */ + +long frsrcu_batches_completed(struct frsrcu_struct *frsp) +{ + return frsp->completed; +} +EXPORT_SYMBOL_GPL(frsrcu_batches_completed); diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c index 59e32684c23b..6dfd9b9bd895 100644 --- a/kernel/rcu/rcutorture.c +++ b/kernel/rcu/rcutorture.c @@ -44,6 +44,7 @@ #include <linux/cpu.h> #include <linux/delay.h> #include <linux/stat.h> +#include <linux/frsrcu.h> #include <linux/srcu.h> #include <linux/slab.h> #include <linux/trace_clock.h> @@ -264,6 +265,102 @@ struct rcu_torture_ops { static struct rcu_torture_ops *cur_ops; /* + * Definitions for frsrcu torture testing. + */ + +DEFINE_STATIC_FRSRCU(frsrcu_ctl); +static struct frsrcu_struct frsrcu_ctld; +static struct frsrcu_struct *frsrcu_ctlp = &frsrcu_ctl; + +static int frsrcu_torture_read_lock(void) __acquires(frsrcu_ctlp) +{ + return frsrcu_read_lock(frsrcu_ctlp); +} + +static void frsrcu_torture_read_unlock(int idx) __releases(frsrcu_ctlp) +{ + frsrcu_read_unlock(frsrcu_ctlp, idx); +} + +static unsigned long frsrcu_torture_completed(void) +{ + return frsrcu_batches_completed(frsrcu_ctlp); +} + +static void frsrcu_torture_synchronize(void) +{ + synchronize_frsrcu(frsrcu_ctlp); +} + +static void frsrcu_torture_stats(void) +{ + int cpu; + int idx = frsrcu_ctlp->completed & 0x1; + + pr_alert("%s%s per-CPU(idx=%d):", + torture_type, TORTURE_FLAG, idx); + for_each_possible_cpu(cpu) { + long c0, c1; + + c0 = (long)per_cpu_ptr(frsrcu_ctlp->per_cpu_ref, cpu)->c[!idx]; + c1 = (long)per_cpu_ptr(frsrcu_ctlp->per_cpu_ref, cpu)->c[idx]; + pr_cont(" %d(%ld,%ld)", cpu, c0, c1); + } + pr_cont("\n"); +} + +static void frsrcu_torture_synchronize_expedited(void) +{ + synchronize_frsrcu_expedited(frsrcu_ctlp); +} + +static void rcu_sync_torture_init(void); +static void srcu_read_delay(struct torture_random_state *rrsp); + +static struct rcu_torture_ops frsrcu_ops = { + .ttype = FRSRCU_FLAVOR, + .init = rcu_sync_torture_init, + .readlock = frsrcu_torture_read_lock, + .read_delay = srcu_read_delay, + .readunlock = frsrcu_torture_read_unlock, + .started = NULL, + .completed = frsrcu_torture_completed, + .sync = frsrcu_torture_synchronize, + .exp_sync = frsrcu_torture_synchronize_expedited, + .stats = frsrcu_torture_stats, + .name = "frsrcu" +}; + +static void frsrcu_torture_init(void) +{ + rcu_sync_torture_init(); + WARN_ON(init_frsrcu_struct(&frsrcu_ctld)); + frsrcu_ctlp = &frsrcu_ctld; +} + +static void frsrcu_torture_cleanup(void) +{ + cleanup_frsrcu_struct(&frsrcu_ctld); + frsrcu_ctlp = &frsrcu_ctl; /* In case of a later rcutorture run. */ +} + +/* As above, but dynamically allocated. */ +static struct rcu_torture_ops frsrcud_ops = { + .ttype = FRSRCU_FLAVOR, + .init = frsrcu_torture_init, + .cleanup = frsrcu_torture_cleanup, + .readlock = frsrcu_torture_read_lock, + .read_delay = srcu_read_delay, + .readunlock = frsrcu_torture_read_unlock, + .started = NULL, + .completed = frsrcu_torture_completed, + .sync = frsrcu_torture_synchronize, + .exp_sync = frsrcu_torture_synchronize_expedited, + .stats = frsrcu_torture_stats, + .name = "frsrcud" +}; + +/* * Definitions for rcu torture testing. */ @@ -810,7 +907,7 @@ rcu_torture_cbflood(void *arg) int err = 1; int i; int j; - struct rcu_head *rhp; + struct rcu_head *rhp = NULL; if (cbflood_n_per_burst > 0 && cbflood_inter_holdoff > 0 && @@ -823,9 +920,7 @@ rcu_torture_cbflood(void *arg) } if (err) { VERBOSE_TOROUT_STRING("rcu_torture_cbflood disabled: Bad args or OOM"); - while (!torture_must_stop()) - schedule_timeout_interruptible(HZ); - return 0; + goto wait_for_stop; } VERBOSE_TOROUT_STRING("rcu_torture_cbflood task started"); do { @@ -844,6 +939,7 @@ rcu_torture_cbflood(void *arg) stutter_wait("rcu_torture_cbflood"); } while (!torture_must_stop()); vfree(rhp); +wait_for_stop: torture_kthread_stopping("rcu_torture_cbflood"); return 0; } @@ -1709,6 +1805,7 @@ rcu_torture_init(void) int cpu; int firsterr = 0; static struct rcu_torture_ops *torture_ops[] = { + &frsrcu_ops, &frsrcud_ops, &rcu_ops, &rcu_bh_ops, &rcu_busted_ops, &srcu_ops, &srcud_ops, &sched_ops, RCUTORTURE_TASKS_OPS }; diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index b908048f8d6a..fb39380eb1d5 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1233,6 +1233,7 @@ config RCU_TORTURE_TEST depends on DEBUG_KERNEL select TORTURE_TEST select SRCU + select FRSRCU select TASKS_RCU default n help -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply related [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-23 0:26 ` Paul E. McKenney @ 2015-06-24 16:50 ` Dave Hansen 2015-06-24 17:29 ` Paul E. McKenney 0 siblings, 1 reply; 20+ messages in thread From: Dave Hansen @ 2015-06-24 16:50 UTC (permalink / raw) To: paulmck Cc: Peter Zijlstra, Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, linux-kernel On 06/22/2015 05:26 PM, Paul E. McKenney wrote: > OK, here is an experimental patch that provides a fast-readers variant > of RCU, forward-ported from v3.3. Because we didn't have call_srcu() > and srcu_barrier() back then, it is not a drop-in replacement for SRCU, > so you need to adapt the code to the API, which means putting an "fr" > in front of the "srcu" in the API members. > > Understood on the overhead of the memory-barrier instruction showing > up consistently. My point was instead that getting rid of this > memory-barrier instruction does not come for free, as it greatly > increases the latency of synchronize_frsrcu(). In a real workload, > it is entirely possible that the savings from eliminating the memory > barrier are overwhelmed by the increased grace-period latency. > > Anyway, the patch is below. Very lightly tested. This does give a very similar performance boost as the other optimization I posted. I measured this patch to boost the writes/second by 11.0% while my previous optimization did 10.8%. I don't think this workload will see any of the overhead of the synchronize_frsrcu(), though, but this helps confirm the source of the overhead. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-24 16:50 ` Dave Hansen @ 2015-06-24 17:29 ` Paul E. McKenney 0 siblings, 0 replies; 20+ messages in thread From: Paul E. McKenney @ 2015-06-24 17:29 UTC (permalink / raw) To: Dave Hansen Cc: Peter Zijlstra, Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, linux-kernel On Wed, Jun 24, 2015 at 09:50:50AM -0700, Dave Hansen wrote: > On 06/22/2015 05:26 PM, Paul E. McKenney wrote: > > OK, here is an experimental patch that provides a fast-readers variant > > of RCU, forward-ported from v3.3. Because we didn't have call_srcu() > > and srcu_barrier() back then, it is not a drop-in replacement for SRCU, > > so you need to adapt the code to the API, which means putting an "fr" > > in front of the "srcu" in the API members. > > > > Understood on the overhead of the memory-barrier instruction showing > > up consistently. My point was instead that getting rid of this > > memory-barrier instruction does not come for free, as it greatly > > increases the latency of synchronize_frsrcu(). In a real workload, > > it is entirely possible that the savings from eliminating the memory > > barrier are overwhelmed by the increased grace-period latency. > > > > Anyway, the patch is below. Very lightly tested. > > This does give a very similar performance boost as the other > optimization I posted. I measured this patch to boost the writes/second > by 11.0% while my previous optimization did 10.8%. > > I don't think this workload will see any of the overhead of the > synchronize_frsrcu(), though, but this helps confirm the source of the > overhead. Thank you for testing it! I agree that your patch is much simpler than adding another flavor of RCU, so I prefer your patch, but the confirmation is nevertheless valuable. Thanx, Paul ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-22 15:11 ` Paul E. McKenney 2015-06-22 15:20 ` Peter Zijlstra 2015-06-22 18:50 ` Dave Hansen @ 2015-06-22 18:52 ` Peter Zijlstra 2015-06-23 0:29 ` Paul E. McKenney 2 siblings, 1 reply; 20+ messages in thread From: Peter Zijlstra @ 2015-06-22 18:52 UTC (permalink / raw) To: Paul E. McKenney Cc: Dave Hansen, Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, linux-kernel On Mon, Jun 22, 2015 at 08:11:21AM -0700, Paul E. McKenney wrote: > That depends on how slow the resulting slow global state would be. > We have some use cases (definitely KVM, perhaps also some of the VFS > code) that need the current speed, as opposed to the profound slowness > that three trips through synchronize_sched() would provide. So what we have with that percpu-rwsem code that I send out earlier today is a conditional smp_mb(), and I think we can do the same for SRCU. I'm just not sure !GP is common enough for all SRCU cases to be worth doing. Those that rely on sync_srcu() and who do it rarely would definitely benefit. The same with those that rarely do call_srcu(). But those that heavily use call_srcu() would be better off with the prolonged GP with 3 sync_sched() calls in. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-22 18:52 ` Peter Zijlstra @ 2015-06-23 0:29 ` Paul E. McKenney 0 siblings, 0 replies; 20+ messages in thread From: Paul E. McKenney @ 2015-06-23 0:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Dave Hansen, Andi Kleen, dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, linux-kernel On Mon, Jun 22, 2015 at 08:52:29PM +0200, Peter Zijlstra wrote: > On Mon, Jun 22, 2015 at 08:11:21AM -0700, Paul E. McKenney wrote: > > That depends on how slow the resulting slow global state would be. > > We have some use cases (definitely KVM, perhaps also some of the VFS > > code) that need the current speed, as opposed to the profound slowness > > that three trips through synchronize_sched() would provide. > > So what we have with that percpu-rwsem code that I send out earlier > today is a conditional smp_mb(), and I think we can do the same for > SRCU. > > I'm just not sure !GP is common enough for all SRCU cases to be worth > doing. Especially given that we don't want the readers to have to acquire a lock in order to get a consistent view of whether or not a grace period is in progress. > Those that rely on sync_srcu() and who do it rarely would definitely > benefit. The same with those that rarely do call_srcu(). > > But those that heavily use call_srcu() would be better off with the > prolonged GP with 3 sync_sched() calls in. Those are indeed two likely possibilities. Other possibilities include cases where synchronize_srcu() is invoked rarely, but where its latency is visible to userspace, and those where there really is a need to wait synchronously for a grace period, so that call_srcu() doesn't buy you anything. Thanx, Paul -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files 2015-06-19 21:50 [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files Dave Hansen 2015-06-19 23:33 ` Andi Kleen @ 2015-06-23 15:17 ` Jan Kara 1 sibling, 0 replies; 20+ messages in thread From: Jan Kara @ 2015-06-23 15:17 UTC (permalink / raw) To: Dave Hansen Cc: dave.hansen, akpm, jack, viro, eparis, john, rlove, tim.c.chen, ak, linux-kernel On Fri 19-06-15 14:50:25, Dave Hansen wrote: > > From: Dave Hansen <dave.hansen@linux.intel.com> > > I have a _tiny_ microbenchmark that sits in a loop and writes > single bytes to a file. Writing one byte to a tmpfs file is > around 2x slower than reading one byte from a file, which is a > _bit_ more than I expecte. This is a dumb benchmark, but I think > it's hard to deny that write() is a hot path and we should avoid > unnecessary overhead there. > > I did a 'perf record' of 30-second samples of read and write. > The top item in a diffprofile is srcu_read_lock() from > fsnotify(). There are active inotify fd's from systemd, but > nothing is actually listening to the file or its part of > the filesystem. > > I *think* we can avoid taking the srcu_read_lock() for the > common case where there are no actual marks on the file > being modified *or* the vfsmount. > > The *_fsnotify_mask is an aggregate of each of the masks from > each mark. If we have nothing set in the masks at all then there > are no marks and no need to do anything with 'ignored masks' > since none exist. This keeps us from having to do the costly > srcu_read_lock() for a check which is very cheap. > > This patch gave a 10.8% speedup in writes/second on my test. ... > diff -puN fs/notify/fsnotify.c~optimize-fsnotify fs/notify/fsnotify.c > --- a/fs/notify/fsnotify.c~optimize-fsnotify 2015-06-19 13:29:53.117283581 -0700 > +++ b/fs/notify/fsnotify.c 2015-06-19 13:29:53.123283853 -0700 > @@ -213,6 +213,16 @@ int fsnotify(struct inode *to_tell, __u3 > !(test_mask & to_tell->i_fsnotify_mask) && > !(mnt && test_mask & mnt->mnt_fsnotify_mask)) > return 0; > + /* > + * Optimization: The *_fsnotify_mask is an aggregate of each of the > + * masks from each mark. If we have nothing set in the masks at > + * all then there are no marks and no need to do anything with > + * 'ignored masks' since none exist. This keeps us from having to > + * do the costly srcu_read_lock() for a check which is very cheap. > + */ > + if (!to_tell->i_fsnotify_mask && > + (!mnt || !mnt->mnt_fsnotify_mask)) > + return 0; But this changes userspace visible behavior. You can have ignored mask set without any of the notification masks set and you are expected to clear the ignored mask on the first IN_MODIFY event. So the test just above your check is dealing with all the cases we can easily detect. That being said we could further refine things by storing a flag in inode & struct mount telling whether any of the attached marks have ignored_mask that needs clearing set and only do the traversal of the list of marks if any of the ignored_masks is set. That will not only avoid the traversal when nobody is watching but also in cases where nobody is watching for the IN_MODIFY event. Finally, we could do something even without the flag. We can have a look at to_tell->i_fsnotify_marks.first and mnt->mnt_fsnotify_marks.first even without SRCU to check whether they are both NULL. If they are, we know we can safely skip the SRCU thing. If they are != NULL, we grab SRCU read lock and refetch the pointers. This should be correct as we don't dereference any pointer we fetched outside of the SRCU critical section. Am I right? And this optimizes the case when nobody is watching without unwanted side-effects. Honza > > idx = srcu_read_lock(&fsnotify_mark_srcu); > > _ -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2015-06-24 17:29 UTC | newest] Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-06-19 21:50 [RFC][PATCH] fs: optimize inotify/fsnotify code for unwatched files Dave Hansen 2015-06-19 23:33 ` Andi Kleen 2015-06-20 0:29 ` Paul E. McKenney 2015-06-20 0:39 ` Dave Hansen 2015-06-20 2:21 ` Paul E. McKenney 2015-06-20 18:02 ` Dave Hansen 2015-06-21 1:30 ` Paul E. McKenney 2015-06-22 13:28 ` Peter Zijlstra 2015-06-22 15:11 ` Paul E. McKenney 2015-06-22 15:20 ` Peter Zijlstra 2015-06-22 16:29 ` Paul E. McKenney 2015-06-22 19:03 ` Peter Zijlstra 2015-06-23 0:31 ` Paul E. McKenney 2015-06-22 18:50 ` Dave Hansen 2015-06-23 0:26 ` Paul E. McKenney 2015-06-24 16:50 ` Dave Hansen 2015-06-24 17:29 ` Paul E. McKenney 2015-06-22 18:52 ` Peter Zijlstra 2015-06-23 0:29 ` Paul E. McKenney 2015-06-23 15:17 ` Jan Kara
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.