[PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
@ 2011-12-19  3:36 mengcong
  2011-12-19  4:11 ` Al Viro
  2011-12-19  7:31 ` Srivatsa S. Bhat
  0 siblings, 2 replies; 37+ messages in thread
From: mengcong @ 2011-12-19  3:36 UTC (permalink / raw)
  To: Alexander Viro; +Cc: linux-kernel, linux-fsdevel, Nick Piggin

In a heavily loaded system, when frequently turning on and off CPUs, the
kernel will detect soft-lockups on multiple CPUs. The detailed bug report
is at https://lkml.org/lkml/2011/8/24/185.

The root cause is that brlock functions, i.e. br_write_lock() and
br_write_unlock(), only locks/unlocks the per-CPU spinlock of CPUs that
are online, which means, if one online CPU is locked and then goes
offline, any later unlocking operation happens during its offline state
will not touch it; and when it goes online again, it has the incorrect
brlock state. This has been verified in current kernel.

I can reproduce this bug on the intact 3.1 kernel. After my patch applied, 
I've ran an 8-hours long test(test script provided by the bug reporter), 
and no soft lockup happened again.

Signed-off-by: Cong Meng <mc@linux.vnet.ibm.com>
Reported-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---
 include/linux/lglock.h |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/lglock.h b/include/linux/lglock.h
index f549056..08b9e84 100644
--- a/include/linux/lglock.h
+++ b/include/linux/lglock.h
@@ -27,8 +27,8 @@
 #define br_lock_init(name)	name##_lock_init()
 #define br_read_lock(name)	name##_local_lock()
 #define br_read_unlock(name)	name##_local_unlock()
-#define br_write_lock(name)	name##_global_lock_online()
-#define br_write_unlock(name)	name##_global_unlock_online()
+#define br_write_lock(name)	name##_global_lock()
+#define br_write_unlock(name)	name##_global_unlock()

 #define DECLARE_BRLOCK(name)	DECLARE_LGLOCK(name)
 #define DEFINE_BRLOCK(name)	DEFINE_LGLOCK(name)
-- 
1.7.5.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-19  3:36 [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs mengcong
@ 2011-12-19  4:11 ` Al Viro
  2011-12-19  5:00   ` Dave Chinner
  2011-12-19  7:31 ` Srivatsa S. Bhat
  1 sibling, 1 reply; 37+ messages in thread
From: Al Viro @ 2011-12-19  4:11 UTC (permalink / raw)
  To: mengcong; +Cc: linux-kernel, linux-fsdevel, Nick Piggin

On Mon, Dec 19, 2011 at 11:36:15AM +0800, mengcong wrote:
> In a heavily loaded system, when frequently turning on and off CPUs, the
> kernel will detect soft-lockups on multiple CPUs. The detailed bug report
> is at https://lkml.org/lkml/2011/8/24/185.
> 
> The root cause is that brlock functions, i.e. br_write_lock() and
> br_write_unlock(), only locks/unlocks the per-CPU spinlock of CPUs that
> are online, which means, if one online CPU is locked and then goes
> offline, any later unlocking operation happens during its offline state
> will not touch it; and when it goes online again, it has the incorrect
> brlock state. This has been verified in current kernel.
> 
> I can reproduce this bug on the intact 3.1 kernel. After my patch applied, 
> I've ran an 8-hours long test(test script provided by the bug reporter), 
> and no soft lockup happened again.

Argh...  OK, that's seriously nasty.  I agree that this is broken, but
your patch makes br_write_lock() very costly on kernels build with
huge number of possible CPUs, even when it's run on a box with few
CPUs ;-/

That sucks.  Worse, AFAICS, the only way to prevent on-/off-line status
changes is blocking (and both directions are bad - if the thing goes online
between br_write_lock() and br_write_unlock(), we'll get spin_unlock without
spin_lock).  And I really don't want to make vfsmount_lock writers blocking -
we *probably* could get away with that, but it'll suck very badly.  Especially
since we'll have that nested inside namespace_sem...

Alternative is to do get_online_cpus/put_online_cpus around the stuff in
fs/namespace.c, putting it *outside* everything but actual IO.  We can
do that (since right now vfsmount_lock is non-blocking and the only
potentially blocking operations under namespace_sem is kmalloc()), but
I'm not particulary comfortable doing that - I never played with the code
in kernel/cpu.c and I don't know if there's anything subtle to watch out
for.

The same issue exists for lg_global_lock_online(), but that beast is
never used (and the only remaining user of lg_global_lock() is
hardly time-critical - with Miklos' patches it's only done on
mount -o remount,force,ro).

Nick, any comments?  That's your code...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-19  4:11 ` Al Viro
@ 2011-12-19  5:00   ` Dave Chinner
  2011-12-19  6:07     ` mengcong
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2011-12-19  5:00 UTC (permalink / raw)
  To: Al Viro; +Cc: mengcong, linux-kernel, linux-fsdevel, Nick Piggin

On Mon, Dec 19, 2011 at 04:11:42AM +0000, Al Viro wrote:
> On Mon, Dec 19, 2011 at 11:36:15AM +0800, mengcong wrote:
> > In a heavily loaded system, when frequently turning on and off CPUs, the
> > kernel will detect soft-lockups on multiple CPUs. The detailed bug report
> > is at https://lkml.org/lkml/2011/8/24/185.
> > 
> > The root cause is that brlock functions, i.e. br_write_lock() and
> > br_write_unlock(), only locks/unlocks the per-CPU spinlock of CPUs that
> > are online, which means, if one online CPU is locked and then goes
> > offline, any later unlocking operation happens during its offline state
> > will not touch it; and when it goes online again, it has the incorrect
> > brlock state. This has been verified in current kernel.
> > 
> > I can reproduce this bug on the intact 3.1 kernel. After my patch applied, 
> > I've ran an 8-hours long test(test script provided by the bug reporter), 
> > and no soft lockup happened again.
> 
> Argh...  OK, that's seriously nasty.  I agree that this is broken, but
> your patch makes br_write_lock() very costly on kernels build with
> huge number of possible CPUs, even when it's run on a box with few
> CPUs ;-/

I fixed this problem with the XFS per-cpu superblock counters
(bit lock + counters in per-cpu structs) back in 2006. It basically
uses a lglock-like local/global locking structure and iterates them
using for_each_online_cpu().

I fixed it simply by registering a hotplug notifier and
draining/reinitialising counters on the appropriate event under a
global lock context. i.e. make CPU hotplug serialise against
concurrent lock operations. See commit e8234a68 ("[XFS] Add support
for hotplug CPUs...")

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-19  5:00   ` Dave Chinner
@ 2011-12-19  6:07     ` mengcong
  0 siblings, 0 replies; 37+ messages in thread
From: mengcong @ 2011-12-19  6:07 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-kernel, linux-fsdevel, Nick Piggin, Dave Chinner

On Mon, 2011-12-19 at 16:00 +1100, Dave Chinner wrote:
> On Mon, Dec 19, 2011 at 04:11:42AM +0000, Al Viro wrote:
> > On Mon, Dec 19, 2011 at 11:36:15AM +0800, mengcong wrote:
> > > In a heavily loaded system, when frequently turning on and off CPUs, the
> > > kernel will detect soft-lockups on multiple CPUs. The detailed bug report
> > > is at https://lkml.org/lkml/2011/8/24/185.
> > > 
> > > The root cause is that brlock functions, i.e. br_write_lock() and
> > > br_write_unlock(), only locks/unlocks the per-CPU spinlock of CPUs that
> > > are online, which means, if one online CPU is locked and then goes
> > > offline, any later unlocking operation happens during its offline state
> > > will not touch it; and when it goes online again, it has the incorrect
> > > brlock state. This has been verified in current kernel.
> > > 
> > > I can reproduce this bug on the intact 3.1 kernel. After my patch applied, 
> > > I've ran an 8-hours long test(test script provided by the bug reporter), 
> > > and no soft lockup happened again.
> > 
> > Argh...  OK, that's seriously nasty.  I agree that this is broken, but
> > your patch makes br_write_lock() very costly on kernels build with
> > huge number of possible CPUs, even when it's run on a box with few
> > CPUs ;-/
> 
> I fixed this problem with the XFS per-cpu superblock counters
> (bit lock + counters in per-cpu structs) back in 2006. It basically
> uses a lglock-like local/global locking structure and iterates them
> using for_each_online_cpu().
> 
> I fixed it simply by registering a hotplug notifier and
> draining/reinitialising counters on the appropriate event under a
> global lock context. i.e. make CPU hotplug serialise against
> concurrent lock operations. See commit e8234a68 ("[XFS] Add support
> for hotplug CPUs...")
> 

how about to register a cpu hotplug notifier to align the brlock as what
Dave did?

> Cheers,
> 
> Dave.



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-19  3:36 [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs mengcong
  2011-12-19  4:11 ` Al Viro
@ 2011-12-19  7:31 ` Srivatsa S. Bhat
  2011-12-19  9:12   ` Stephen Boyd
  1 sibling, 1 reply; 37+ messages in thread
From: Srivatsa S. Bhat @ 2011-12-19  7:31 UTC (permalink / raw)
  To: mc
  Cc: Alexander Viro, linux-kernel, linux-fsdevel, Nick Piggin, david,
	akpm, Maciej Rutecki

Hi,

I feel the following patch is a better fix for 2 reasons:

1. As Al Viro pointed out, if we do for_each_possible_cpus() then we might
encounter unnecessary performance hit in some scenarios. So working with
only online cpus, safely(a.k.a race-free), if possible, would be a good
solution (which this patch implements).

2. *_global_lock_online() and *_global_unlock_online() needs fixing as well
because, the names suggest that they lock/unlock per-CPU locks of only the
currently online CPUs, but unfortunately they do not have any synchronization
to prevent offlining those CPUs in between, if it happens to race with a CPU
hotplug operation.

And if we solve issue 2 above "carefully" (as mentioned in the changelog below),
it solves this whole thing!

---
From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Subject: [PATCH] VFS: Fix race between CPU hotplug and *_global_[un]lock_online()

The *_global_[un]lock_online() macros defined in include/linux/lglock.h
can race with CPU hotplug operations. Fix this race by using the pair
get_online_cpus() and put_online_cpus() around them, so as to prevent CPU
hotplug happening at the same time.

But be careful to note the semantics here: if we enable CPU hotplug in-between
a lock_online() and the corresponding unlock_online(), the lock states can
get messed up, since we might end up for example, in a situation such as taking
a lock on an online CPU but not releasing it because that CPU was offline when
unlock_online() was invoked (thanks to Cong Meng for debugging this issue).
[Soft-lockups detected as a consequence of this issue was reported earlier in
https://lkml.org/lkml/2011/8/24/185.]

So, we are careful to allow CPU hotplug only after the lock-unlock sequence
is complete.

Debugged-by: Cong Meng <mc@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/lglock.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/lglock.h b/include/linux/lglock.h
index f549056..583d1a8 100644
--- a/include/linux/lglock.h
+++ b/include/linux/lglock.h
@@ -22,6 +22,7 @@
 #include <linux/spinlock.h>
 #include <linux/lockdep.h>
 #include <linux/percpu.h>
+#include <linux/cpu.h>

 /* can make br locks by using local lock for read side, global lock for write */
 #define br_lock_init(name)	name##_lock_init()
@@ -126,6 +127,7 @@
 	int i;								\
 	preempt_disable();						\
 	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
+	get_online_cpus();						\
 	for_each_online_cpu(i) {					\
 		arch_spinlock_t *lock;					\
 		lock = &per_cpu(name##_lock, i);			\
@@ -142,6 +144,7 @@
 		lock = &per_cpu(name##_lock, i);			\
 		arch_spin_unlock(lock);					\
 	}								\
+	put_online_cpus();						\
 	preempt_enable();						\
  }									\
  EXPORT_SYMBOL(name##_global_unlock_online);				\

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-19  7:31 ` Srivatsa S. Bhat
@ 2011-12-19  9:12   ` Stephen Boyd
  2011-12-19 11:03     ` Srivatsa S. Bhat
  0 siblings, 1 reply; 37+ messages in thread
From: Stephen Boyd @ 2011-12-19  9:12 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: mc, Alexander Viro, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On 12/18/2011 11:31 PM, Srivatsa S. Bhat wrote:
> Hi,
>
> I feel the following patch is a better fix for 2 reasons:
>
> 1. As Al Viro pointed out, if we do for_each_possible_cpus() then we might
> encounter unnecessary performance hit in some scenarios. So working with
> only online cpus, safely(a.k.a race-free), if possible, would be a good
> solution (which this patch implements).
>
> 2. *_global_lock_online() and *_global_unlock_online() needs fixing as well
> because, the names suggest that they lock/unlock per-CPU locks of only the
> currently online CPUs, but unfortunately they do not have any synchronization
> to prevent offlining those CPUs in between, if it happens to race with a CPU
> hotplug operation.
>
> And if we solve issue 2 above "carefully" (as mentioned in the changelog below),
> it solves this whole thing!

We started seeing this same problem last week. I've come up with almost 
the same solution but you beat me to the list!

> diff --git a/include/linux/lglock.h b/include/linux/lglock.h
> index f549056..583d1a8 100644
> --- a/include/linux/lglock.h
> +++ b/include/linux/lglock.h
> @@ -126,6 +127,7 @@
>   	int i;								\
>   	preempt_disable();						\
>   	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
> +	get_online_cpus();						\
>   	for_each_online_cpu(i) {					\
>   		arch_spinlock_t *lock;					\
>   		lock =&per_cpu(name##_lock, i);			\
> @@ -142,6 +144,7 @@
>   		lock =&per_cpu(name##_lock, i);			\
>   		arch_spin_unlock(lock);					\
>   	}								\
> +	put_online_cpus();						\
>   	preempt_enable();						\
>    }									\
>    EXPORT_SYMBOL(name##_global_unlock_online);				\

Don't you want to call {get,put}_online_cpus() outside the 
preempt_{disable,enable}()? Otherwise you are scheduling while atomic?

With that fixed

Acked-by: Stephen Boyd <sboyd@codeaurora.org>

but I wonder if taking the hotplug mutex even for a short time reduces 
the effectiveness of these locks? Or is it more about fast readers and 
slow writers?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-19  9:12   ` Stephen Boyd
@ 2011-12-19 11:03     ` Srivatsa S. Bhat
  2011-12-19 12:11       ` Al Viro
  0 siblings, 1 reply; 37+ messages in thread
From: Srivatsa S. Bhat @ 2011-12-19 11:03 UTC (permalink / raw)
  To: Stephen Boyd
  Cc: mc, Alexander Viro, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On 12/19/2011 02:42 PM, Stephen Boyd wrote:

> On 12/18/2011 11:31 PM, Srivatsa S. Bhat wrote:
>> Hi,
>>
>> I feel the following patch is a better fix for 2 reasons:
>>
>> 1. As Al Viro pointed out, if we do for_each_possible_cpus() then we
>> might
>> encounter unnecessary performance hit in some scenarios. So working with
>> only online cpus, safely(a.k.a race-free), if possible, would be a good
>> solution (which this patch implements).
>>
>> 2. *_global_lock_online() and *_global_unlock_online() needs fixing as
>> well
>> because, the names suggest that they lock/unlock per-CPU locks of only
>> the
>> currently online CPUs, but unfortunately they do not have any
>> synchronization
>> to prevent offlining those CPUs in between, if it happens to race with
>> a CPU
>> hotplug operation.
>>
>> And if we solve issue 2 above "carefully" (as mentioned in the
>> changelog below),
>> it solves this whole thing!
> 
> We started seeing this same problem last week. I've come up with almost
> the same solution but you beat me to the list!
> 

Oh :-)

>> diff --git a/include/linux/lglock.h b/include/linux/lglock.h
>> index f549056..583d1a8 100644
>> --- a/include/linux/lglock.h
>> +++ b/include/linux/lglock.h
>> @@ -126,6 +127,7 @@
>>       int i;                                \
>>       preempt_disable();                        \
>>       rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);        \
>> +    get_online_cpus();                        \
>>       for_each_online_cpu(i) {                    \
>>           arch_spinlock_t *lock;                    \
>>           lock =&per_cpu(name##_lock, i);            \
>> @@ -142,6 +144,7 @@
>>           lock =&per_cpu(name##_lock, i);            \
>>           arch_spin_unlock(lock);                    \
>>       }                                \
>> +    put_online_cpus();                        \
>>       preempt_enable();                        \
>>    }                                    \
>>    EXPORT_SYMBOL(name##_global_unlock_online);                \
> 
> Don't you want to call {get,put}_online_cpus() outside the
> preempt_{disable,enable}()? Otherwise you are scheduling while atomic?
>

Oh right, thanks for spotting this! (and thanks for your Ack too!)

> With that fixed
> 
> Acked-by: Stephen Boyd <sboyd@codeaurora.org>
> 
> but I wonder if taking the hotplug mutex even for a short time reduces
> the effectiveness of these locks? Or is it more about fast readers and
> slow writers?
> 

IMHO, we don't need to be concerned here because, {get,put}_online_cpus()
implement a refcounting solution, and they don't really serialize stuff
unnecessarily. The readers (those who prevent cpu hotplug, such as this lock-
unlock code) are fast and can be concurrent, while the writers (the task that
is doing the cpu hotplug) waits till all existing readers are gone/done with
their work.

So, since we are readers here, IMO, we don't have to worry about performance.
(I know that we get serialized just for a moment while incrementing the
refcount, but that should not be worrisome right?)

Moreover, using for_each_online_cpu() without using {get,put}_online_cpus()
around that, is plain wrong, because of the unhandled race with cpu hotplug.
IOW, our primary concern here is functionality, isn't it?

To summarize, in the current design of these VFS locks, using
{get,put}_online_cpus() is *essential* to fix a functionality-related bug,
(and not so bad performance-wise as well).

The following patch (v2) incorporates your comments:

---
From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Subject: [PATCH v2] VFS: Fix race between CPU hotplug and *_global_[un]lock_online()

The *_global_[un]lock_online() macros defined in include/linux/lglock.h
can race with CPU hotplug operations. Fix this race by using the pair
get_online_cpus() and put_online_cpus() around them, so as to prevent CPU
hotplug happening at the same time.

But be careful to note the semantics here: if we enable CPU hotplug in-between
a lock_online() and the corresponding unlock_online(), the lock states can
get messed up, since we might end up for example, in a situation such as taking
a lock on an online CPU but not releasing it because that CPU was offline when
unlock_online() was invoked (thanks to Cong Meng for debugging this issue).
[Soft-lockups detected as a consequence of this issue was reported earlier in
https://lkml.org/lkml/2011/8/24/185.]

So, we are careful to allow CPU hotplug only after the lock-unlock sequence
is complete.

v2: Moved {get,put}_online_cpus() outside of preempt_{disable,enable} to avoid
scheduling while atomic.

Debugged-by: Cong Meng <mc@linux.vnet.ibm.com>
Acked-by: Stephen Boyd <sboyd@codeaurora.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/lglock.h |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/include/linux/lglock.h b/include/linux/lglock.h
index f549056..7ef257d 100644
--- a/include/linux/lglock.h
+++ b/include/linux/lglock.h
@@ -22,6 +22,7 @@
 #include <linux/spinlock.h>
 #include <linux/lockdep.h>
 #include <linux/percpu.h>
+#include <linux/cpu.h>

 /* can make br locks by using local lock for read side, global lock for write */
 #define br_lock_init(name)	name##_lock_init()
@@ -124,6 +125,7 @@
 									\
  void name##_global_lock_online(void) {					\
 	int i;								\
+	get_online_cpus();						\
 	preempt_disable();						\
 	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
 	for_each_online_cpu(i) {					\
@@ -143,6 +145,7 @@
 		arch_spin_unlock(lock);					\
 	}								\
 	preempt_enable();						\
+	put_online_cpus();						\
  }									\
  EXPORT_SYMBOL(name##_global_unlock_online);				\
 									\

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-19 11:03     ` Srivatsa S. Bhat
@ 2011-12-19 12:11       ` Al Viro
  2011-12-19 20:23         ` Srivatsa S. Bhat
  2011-12-19 23:56         ` Dave Chinner
  0 siblings, 2 replies; 37+ messages in thread
From: Al Viro @ 2011-12-19 12:11 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Stephen Boyd, mc, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On Mon, Dec 19, 2011 at 04:33:47PM +0530, Srivatsa S. Bhat wrote:

> IMHO, we don't need to be concerned here because, {get,put}_online_cpus()
> implement a refcounting solution, and they don't really serialize stuff
> unnecessarily. The readers (those who prevent cpu hotplug, such as this lock-
> unlock code) are fast and can be concurrent, while the writers (the task that
> is doing the cpu hotplug) waits till all existing readers are gone/done with
> their work.
> 
> So, since we are readers here, IMO, we don't have to worry about performance.
> (I know that we get serialized just for a moment while incrementing the
> refcount, but that should not be worrisome right?)
> 
> Moreover, using for_each_online_cpu() without using {get,put}_online_cpus()
> around that, is plain wrong, because of the unhandled race with cpu hotplug.
> IOW, our primary concern here is functionality, isn't it?
> 
> To summarize, in the current design of these VFS locks, using
> {get,put}_online_cpus() is *essential* to fix a functionality-related bug,
> (and not so bad performance-wise as well).
> 
> The following patch (v2) incorporates your comments:

I really don't like that.  Amount of contention is not a big issue, but the
fact that now br_write_lock(vfsmount_lock) became blocking is really nasty.
Moreover, we suddenly get cpu_hotplug.lock nested inside namespace_sem...
BTW, it's seriously blocking - if nothing else, it waits for cpu_down()
in progress to complete.  Which can involve any number of interesting
locks taken by notifiers.

Dave's variant is also no good; consider this:
CPU1: br_write_lock(); spinlocks grabbed
CPU2: br_read_lock(); spinning on one of them
CPU3: try to take CPU2 down.  We *can't* proceed to the end, notifiers or no
notifiers, until CPU2 gets through the critical area.  Which can't happen
until the spinlock is unlocked, i.e. until CPU1 does br_write_unlock().
Notifier can't silently do spin_unlock() here or we'll get CPU2 free to go
into the critical area when it's really not safe there.

That got one hell of a deadlock potential ;-/  So far I'm more or less
in favor of doing get_online_cpus() explicitly in fs/namespace.c, outside
of namespace_sem.  But I still have not convinced myself that it's
really safe ;-/

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-19 12:11       ` Al Viro
@ 2011-12-19 20:23         ` Srivatsa S. Bhat
  2011-12-19 20:52           ` Al Viro
  2011-12-19 23:56         ` Dave Chinner
  1 sibling, 1 reply; 37+ messages in thread
From: Srivatsa S. Bhat @ 2011-12-19 20:23 UTC (permalink / raw)
  To: Al Viro
  Cc: Stephen Boyd, mc, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On 12/19/2011 05:41 PM, Al Viro wrote:

> On Mon, Dec 19, 2011 at 04:33:47PM +0530, Srivatsa S. Bhat wrote:
> 
>> IMHO, we don't need to be concerned here because, {get,put}_online_cpus()
>> implement a refcounting solution, and they don't really serialize stuff
>> unnecessarily. The readers (those who prevent cpu hotplug, such as this lock-
>> unlock code) are fast and can be concurrent, while the writers (the task that
>> is doing the cpu hotplug) waits till all existing readers are gone/done with
>> their work.
>>
>> So, since we are readers here, IMO, we don't have to worry about performance.
>> (I know that we get serialized just for a moment while incrementing the
>> refcount, but that should not be worrisome right?)
>>
>> Moreover, using for_each_online_cpu() without using {get,put}_online_cpus()
>> around that, is plain wrong, because of the unhandled race with cpu hotplug.
>> IOW, our primary concern here is functionality, isn't it?
>>
>> To summarize, in the current design of these VFS locks, using
>> {get,put}_online_cpus() is *essential* to fix a functionality-related bug,
>> (and not so bad performance-wise as well).
>>
>> The following patch (v2) incorporates your comments:
> 
> I really don't like that.  Amount of contention is not a big issue, but the
> fact that now br_write_lock(vfsmount_lock) became blocking is really nasty.
> Moreover, we suddenly get cpu_hotplug.lock nested inside namespace_sem...
> BTW, it's seriously blocking - if nothing else, it waits for cpu_down()
> in progress to complete.  Which can involve any number of interesting
> locks taken by notifiers.
> 
> Dave's variant is also no good; consider this:
> CPU1: br_write_lock(); spinlocks grabbed
> CPU2: br_read_lock(); spinning on one of them
> CPU3: try to take CPU2 down.  We *can't* proceed to the end, notifiers or no
> notifiers, until CPU2 gets through the critical area.  Which can't happen
> until the spinlock is unlocked, i.e. until CPU1 does br_write_unlock().
> Notifier can't silently do spin_unlock() here or we'll get CPU2 free to go
> into the critical area when it's really not safe there.
> 
> That got one hell of a deadlock potential ;-/  So far I'm more or less
> in favor of doing get_online_cpus() explicitly in fs/namespace.c, outside
> of namespace_sem.  But I still have not convinced myself that it's
> really safe ;-/
> 


Okay, clearly you want br_write_locks to remain non-blocking. And for reasons
related to long-term code understandability/maintainability, I am not a fan of
the idea of sprinkling {get,put}_online_cpus() in fs/namespace.c, away from the
place where the race with CPU hotplug actually exists (though I understand that
that would work just fine).

So, considering the above two requirements, let me propose another approach
(though it might sound a bit hacky) :

We note that we can simplify our requirement in *global_[un]lock_online() to:

"lock and unlock must be invoked on the same set of CPUs, and that sequence
must not get missed for any CPU, even if the CPU is offline. We only care about
the correctness of lock-unlock operations, and not really about CPU Hotplug or
'working with only online CPUs' or anything of that sort."

If this new definition of our requirement is acceptable (correct me if I am
wrong), then we can do something like the following patch, while still
retaining br locks as non-blocking.

We make a copy of the current cpu_online_mask, and lock the per-cpu locks of
all those cpus. Then while unlocking, we unlock the per-cpu locks of these cpus
(by using that temporary copy of cpu_online_mask we created earlier), without
caring about the cpus actually online at that moment.
IOW, we do lock-unlock on the same set of cpus, and that too, without missing
the complete lock-unlock sequence in any of them. Guaranteed.

[I will provide the changelog later, if people are OK with this approach].

But it is to be noted that this doesn't actually synchronize with cpu hotplug,
but tries to overcome the issue nevertheless. Comments/suggestions?

---

 include/linux/lglock.h |    9 +++++++--
 1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/include/linux/lglock.h b/include/linux/lglock.h
index f549056..6351ae3 100644
--- a/include/linux/lglock.h
+++ b/include/linux/lglock.h
@@ -122,11 +122,16 @@
  }									\
  EXPORT_SYMBOL(name##_local_unlock_cpu);				\
 									\
+static DECLARE_BITMAP(name##_locked_cpu_bits, CONFIG_NR_CPUS);		\
+static struct cpumask *name##_locked_cpu_mask =				\
+				to_cpumask(name##_locked_cpu_bits);	\
+									\
  void name##_global_lock_online(void) {					\
 	int i;								\
 	preempt_disable();						\
 	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
-	for_each_online_cpu(i) {					\
+	cpumask_copy(name##_locked_cpu_mask, cpu_online_mask);		\
+	for_each_cpu(i, name##_locked_cpu_mask) {			\
 		arch_spinlock_t *lock;					\
 		lock = &per_cpu(name##_lock, i);			\
 		arch_spin_lock(lock);					\
@@ -137,7 +142,7 @@
  void name##_global_unlock_online(void) {				\
 	int i;								\
 	rwlock_release(&name##_lock_dep_map, 1, _RET_IP_);		\
-	for_each_online_cpu(i) {					\
+	for_each_cpu(i, name##_locked_cpu_mask) {			\
 		arch_spinlock_t *lock;					\
 		lock = &per_cpu(name##_lock, i);			\
 		arch_spin_unlock(lock);					\



^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-19 20:23         ` Srivatsa S. Bhat
@ 2011-12-19 20:52           ` Al Viro
  2011-12-20  4:56             ` Srivatsa S. Bhat
  0 siblings, 1 reply; 37+ messages in thread
From: Al Viro @ 2011-12-19 20:52 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Stephen Boyd, mc, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On Tue, Dec 20, 2011 at 01:53:42AM +0530, Srivatsa S. Bhat wrote:
> If this new definition of our requirement is acceptable (correct me if I am
> wrong), then we can do something like the following patch, while still
> retaining br locks as non-blocking.
> 
> We make a copy of the current cpu_online_mask, and lock the per-cpu locks of
> all those cpus. Then while unlocking, we unlock the per-cpu locks of these cpus
> (by using that temporary copy of cpu_online_mask we created earlier), without
> caring about the cpus actually online at that moment.
> IOW, we do lock-unlock on the same set of cpus, and that too, without missing
> the complete lock-unlock sequence in any of them. Guaranteed.

	And what's to stop a process on a newly added CPU from _not_
spinning in br_read_lock(), even though br_write_unlock() hadn't been
done yet?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-19 12:11       ` Al Viro
  2011-12-19 20:23         ` Srivatsa S. Bhat
@ 2011-12-19 23:56         ` Dave Chinner
  2011-12-20  4:05           ` Al Viro
  1 sibling, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2011-12-19 23:56 UTC (permalink / raw)
  To: Al Viro
  Cc: Srivatsa S. Bhat, Stephen Boyd, mc, linux-kernel, linux-fsdevel,
	Nick Piggin, akpm, Maciej Rutecki

On Mon, Dec 19, 2011 at 12:11:00PM +0000, Al Viro wrote:
> On Mon, Dec 19, 2011 at 04:33:47PM +0530, Srivatsa S. Bhat wrote:
> 
> > IMHO, we don't need to be concerned here because, {get,put}_online_cpus()
> > implement a refcounting solution, and they don't really serialize stuff
> > unnecessarily. The readers (those who prevent cpu hotplug, such as this lock-
> > unlock code) are fast and can be concurrent, while the writers (the task that
> > is doing the cpu hotplug) waits till all existing readers are gone/done with
> > their work.
> > 
> > So, since we are readers here, IMO, we don't have to worry about performance.
> > (I know that we get serialized just for a moment while incrementing the
> > refcount, but that should not be worrisome right?)
> > 
> > Moreover, using for_each_online_cpu() without using {get,put}_online_cpus()
> > around that, is plain wrong, because of the unhandled race with cpu hotplug.
> > IOW, our primary concern here is functionality, isn't it?
> > 
> > To summarize, in the current design of these VFS locks, using
> > {get,put}_online_cpus() is *essential* to fix a functionality-related bug,
> > (and not so bad performance-wise as well).
> > 
> > The following patch (v2) incorporates your comments:
> 
> I really don't like that.  Amount of contention is not a big issue, but the
> fact that now br_write_lock(vfsmount_lock) became blocking is really nasty.
> Moreover, we suddenly get cpu_hotplug.lock nested inside namespace_sem...
> BTW, it's seriously blocking - if nothing else, it waits for cpu_down()
> in progress to complete.  Which can involve any number of interesting
> locks taken by notifiers.
> 
> Dave's variant is also no good; consider this:
> CPU1: br_write_lock(); spinlocks grabbed
> CPU2: br_read_lock(); spinning on one of them
> CPU3: try to take CPU2 down.  We *can't* proceed to the end, notifiers or no
> notifiers, until CPU2 gets through the critical area.  Which can't happen
> until the spinlock is unlocked, i.e. until CPU1 does br_write_unlock().
> Notifier can't silently do spin_unlock() here or we'll get CPU2 free to go
> into the critical area when it's really not safe there.

Yeah, XFS has some, er, complexity to handle this.

Basically, it has global state (the on-disk superblock) that the
per-cpu counters synchronised back to every so often and hence the
per-cpu counters can be switched on and off.  There's also a global
state lock that is held through a counter modification slow path and
during notifier operations and the combination of these is used to
avoid such race conditions.  Hence when a cpu dies, we do:

        case CPU_DEAD:
        case CPU_DEAD_FROZEN:
                /* Disable all the counters, then fold the dead cpu's
                 * count into the total on the global superblock and
                 * re-enable the counters. */
                xfs_icsb_lock(mp);
                spin_lock(&mp->m_sb_lock);
                xfs_icsb_disable_counter(mp, XFS_SBS_ICOUNT);
                xfs_icsb_disable_counter(mp, XFS_SBS_IFREE);
                xfs_icsb_disable_counter(mp, XFS_SBS_FDBLOCKS);
....

Which is basically:

	1. take global counter state modification mutex
	2. take in-core superblock lock (global in-core fs state
	   that the per-cpu counters sync to)
	3. disable each online per-cpu counter
		a. lock all online per-cpu locks for the counter
		b. clear counter enabled bit
		c. unlock all online per-cpu locks

And when the counter is re-enabled after the cleanup of the per-cpu
counter state on the dead CPU, it does it via a rebalancing
operation:

	1. disable each online per-cpu counter
		a. lock all online per-cpu locks for the counter
		b. clear counter enabled bit
		c. unlock all online per-cpu locks
	2. balance counter across all online per-cpu structures
	3. enable each online epr-cpu counter:
		a. lock all online per-cpu locks for the counter
		b. set counter enabled bit
		c. unlock all online per-cpu locks
	4. drop in-core superblock lock
	5. drop global counter state modification mutex

Hence, in the situation you describe above, if CPU 2 gets the lock
before the notifier, all is well. In the case it doesn't, it get
blocked like this:

	prempt_disable()
	if (counter disabled)
		goto slow path
	lock_local_counter()			<<<< spin here
					cpu notifier disables counter
					and unlocks it. We get the lock
	if (counter disabled) {
		unlock_local_counter()
		goto slow path
	}

.....
slow_path:
	preempt_enable()
	xfs_icsb_lock(mp)		<<<< serialises on global notifier lock
					not on any of the spin locks

Like I said, there's quite a bit of complexity in all this to handle
the cpu notifiers in (what I think is) a race free manner. I've been
looking at replacing all this complexity (it's close to a 1000 lines
of code) with the generic per-cpu counters, but that's got it's own
problems that involve adding lots of complexity....

> That got one hell of a deadlock potential ;-/  So far I'm more or less
> in favor of doing get_online_cpus() explicitly in fs/namespace.c, outside
> of namespace_sem.  But I still have not convinced myself that it's
> really safe ;-/

Agreed, it looks like a lot simpler solution to this problem than a
notifier. But I don't think I know enough about the usage context to
determine if it is safe, either, so i can't really help you there. :/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-19 23:56         ` Dave Chinner
@ 2011-12-20  4:05           ` Al Viro
  0 siblings, 0 replies; 37+ messages in thread
From: Al Viro @ 2011-12-20  4:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Srivatsa S. Bhat, Stephen Boyd, mc, linux-kernel, linux-fsdevel,
	Nick Piggin, akpm, Maciej Rutecki

On Tue, Dec 20, 2011 at 10:56:59AM +1100, Dave Chinner wrote:

> > That got one hell of a deadlock potential ;-/  So far I'm more or less
> > in favor of doing get_online_cpus() explicitly in fs/namespace.c, outside
> > of namespace_sem.  But I still have not convinced myself that it's
> > really safe ;-/
> 
> Agreed, it looks like a lot simpler solution to this problem than a
> notifier. But I don't think I know enough about the usage context to
> determine if it is safe, either, so i can't really help you there. :/

That's really nasty; mntput_no_expire() (and thus mntput()) wants
br_write_lock()/br_write_unlock().  Right now we *know* that mntput()
is non-blocking in situations when we are holding more than one reference.
With that kind of change that isn't true anymore - one needs to have
long-term refs to make it safe.  And that's not going to be fun to audit...

Can we get some kind of non-blocking exclusion against CPU hotplug?  Note
that we care about it only for writers, i.e. when we are going to cause
cacheline bouncing from hell, no matter what.

I *really* hate making br_write_lock() blocking and explicit get_online_cpus()
around it isn't really any better.  Too much PITA verifying correctness after
the locking change.

At that point in the cycle the original patch (loop over all CPUs, online or
not) may turn out to be the only sane variant, as much as its going to
hurt us.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-19 20:52           ` Al Viro
@ 2011-12-20  4:56             ` Srivatsa S. Bhat
  2011-12-20  6:27               ` Al Viro
  0 siblings, 1 reply; 37+ messages in thread
From: Srivatsa S. Bhat @ 2011-12-20  4:56 UTC (permalink / raw)
  To: Al Viro
  Cc: Stephen Boyd, mc, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On 12/20/2011 02:22 AM, Al Viro wrote:

> On Tue, Dec 20, 2011 at 01:53:42AM +0530, Srivatsa S. Bhat wrote:
>> If this new definition of our requirement is acceptable (correct me if I am
>> wrong), then we can do something like the following patch, while still
>> retaining br locks as non-blocking.
>>
>> We make a copy of the current cpu_online_mask, and lock the per-cpu locks of
>> all those cpus. Then while unlocking, we unlock the per-cpu locks of these cpus
>> (by using that temporary copy of cpu_online_mask we created earlier), without
>> caring about the cpus actually online at that moment.
>> IOW, we do lock-unlock on the same set of cpus, and that too, without missing
>> the complete lock-unlock sequence in any of them. Guaranteed.
> 
> 	And what's to stop a process on a newly added CPU from _not_
> spinning in br_read_lock(), even though br_write_unlock() hadn't been
> done yet?
> 

Oh, right, that has to be handled as well...

Hmmm... How about registering a CPU hotplug notifier callback during lock init
time, and then for every cpu that gets onlined (after we took a copy of the
cpu_online_mask to work with), we see if that cpu is different from the ones
we have already locked, and if it is, we lock it in the callback handler and
update the locked_cpu_mask appropriately (so that we release the locks properly
during the unlock operation).

Handling the newly introduced race between the callback handler and lock-unlock
code must not be difficult, I believe..

Any loopholes in this approach? Or is the additional complexity just not worth
it here?

Regards,
Srivatsa S. Bhat

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-20  4:56             ` Srivatsa S. Bhat
@ 2011-12-20  6:27               ` Al Viro
  2011-12-20  7:28                 ` Srivatsa S. Bhat
  2011-12-20  7:30                 ` mengcong
  0 siblings, 2 replies; 37+ messages in thread
From: Al Viro @ 2011-12-20  6:27 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Stephen Boyd, mc, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On Tue, Dec 20, 2011 at 10:26:05AM +0530, Srivatsa S. Bhat wrote:
> Oh, right, that has to be handled as well...
> 
> Hmmm... How about registering a CPU hotplug notifier callback during lock init
> time, and then for every cpu that gets onlined (after we took a copy of the
> cpu_online_mask to work with), we see if that cpu is different from the ones
> we have already locked, and if it is, we lock it in the callback handler and
> update the locked_cpu_mask appropriately (so that we release the locks properly
> during the unlock operation).
> 
> Handling the newly introduced race between the callback handler and lock-unlock
> code must not be difficult, I believe..
> 
> Any loopholes in this approach? Or is the additional complexity just not worth
> it here?

To summarize the modified variant of that approach hashed out on IRC:

	* lglock grows three extra things: spinlock, cpu bitmap and cpu hotplug
notifier.
	* foo_global_lock_online starts with grabbing that spinlock and
loops over the cpus in that bitmap.
	* foo_global_unlock_online loops over the same bitmap and then drops
that spinlock
	* callback of the notifier is going to do all bitmap updates.  Under
that spinlock.  Events that need handling definitely include the things like
"was going up but failed", since we need the bitmap to contain all online CPUs
at all time, preferably without too much junk beyond that.  IOW, we need to add
it there _before_ low-level __cpu_up() calls set_cpu_online().  Which means
that we want to clean up on failed attempt to up it.  Taking a CPU down is
probably less PITA; just clear bit on the final "the sucker's dead" event.
	* bitmap is initialized once, at the same time we set the notifier
up.  Just grab the spinlock and do
	for_each_online_cpu(N)
		add N to bitmap
then release the spinlock and let the callbacks handle all updates.

I think that'll work with relatively little pain, but I'm not familiar enough
with the cpuhotplug notifiers, so I'd rather have the folks familiar with those
to supply the set of events to watch for...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-20  6:27               ` Al Viro
@ 2011-12-20  7:28                 ` Srivatsa S. Bhat
  2011-12-20  9:37                   ` mengcong
  2011-12-20  7:30                 ` mengcong
  1 sibling, 1 reply; 37+ messages in thread
From: Srivatsa S. Bhat @ 2011-12-20  7:28 UTC (permalink / raw)
  To: Al Viro
  Cc: Stephen Boyd, mc, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On 12/20/2011 11:57 AM, Al Viro wrote:

> On Tue, Dec 20, 2011 at 10:26:05AM +0530, Srivatsa S. Bhat wrote:
>> Oh, right, that has to be handled as well...
>>
>> Hmmm... How about registering a CPU hotplug notifier callback during lock init
>> time, and then for every cpu that gets onlined (after we took a copy of the
>> cpu_online_mask to work with), we see if that cpu is different from the ones
>> we have already locked, and if it is, we lock it in the callback handler and
>> update the locked_cpu_mask appropriately (so that we release the locks properly
>> during the unlock operation).
>>
>> Handling the newly introduced race between the callback handler and lock-unlock
>> code must not be difficult, I believe..
>>
>> Any loopholes in this approach? Or is the additional complexity just not worth
>> it here?
> 
> To summarize the modified variant of that approach hashed out on IRC:
> 
> 	* lglock grows three extra things: spinlock, cpu bitmap and cpu hotplug
> notifier.
> 	* foo_global_lock_online starts with grabbing that spinlock and
> loops over the cpus in that bitmap.
> 	* foo_global_unlock_online loops over the same bitmap and then drops
> that spinlock
> 	* callback of the notifier is going to do all bitmap updates.  Under
> that spinlock.  Events that need handling definitely include the things like
> "was going up but failed", since we need the bitmap to contain all online CPUs
> at all time, preferably without too much junk beyond that.  IOW, we need to add
> it there _before_ low-level __cpu_up() calls set_cpu_online().  Which means
> that we want to clean up on failed attempt to up it.  Taking a CPU down is
> probably less PITA; just clear bit on the final "the sucker's dead" event.
> 	* bitmap is initialized once, at the same time we set the notifier
> up.  Just grab the spinlock and do
> 	for_each_online_cpu(N)
> 		add N to bitmap
> then release the spinlock and let the callbacks handle all updates.
> 
> I think that'll work with relatively little pain, but I'm not familiar enough
> with the cpuhotplug notifiers, so I'd rather have the folks familiar with those
> to supply the set of events to watch for...
> 


We need not watch out for "up failed" events. It is enough if we handle
CPU_ONLINE and CPU_DEAD events. Because, these 2 events are triggered only
upon successful online or offline operation, and these notifications are
more than enough for our purpose (to update our bitmaps). Also, those cpus
which came online wont start running until these "success notifications"
are all done, which is where we do our stuff in the callback (ie., try
grabbing the spinlock..).

Of course, by doing this (only looking out for CPU_ONLINE and CPU_DEAD
events), our bitmap will probably be one step behind cpu_online_mask
(which means, we'll still have to take the snapshot of cpu_online_mask and
work with it instead of using for_each_online_cpu()).
But that doesn't matter, as long as:
  * we don't allow the newly onlined CPU to start executing code (this
    is achieved by taking the spinlock in the callback)
  * we stick to our bitmap while taking and releasing the spinlocks.

Both of these have been handled in the design proposed above. So we are good
to go I guess.

I am working on translating all these to working code.. Will post the patch
as soon as I'm done.

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-20  6:27               ` Al Viro
  2011-12-20  7:28                 ` Srivatsa S. Bhat
@ 2011-12-20  7:30                 ` mengcong
  2011-12-20  7:37                   ` Srivatsa S. Bhat
  1 sibling, 1 reply; 37+ messages in thread
From: mengcong @ 2011-12-20  7:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Srivatsa S. Bhat, Stephen Boyd, linux-kernel, linux-fsdevel,
	Nick Piggin, david, akpm, Maciej Rutecki

On Tue, 2011-12-20 at 06:27 +0000, Al Viro wrote:
> On Tue, Dec 20, 2011 at 10:26:05AM +0530, Srivatsa S. Bhat wrote:
> > Oh, right, that has to be handled as well...
> > 
> > Hmmm... How about registering a CPU hotplug notifier callback during lock init
> > time, and then for every cpu that gets onlined (after we took a copy of the
> > cpu_online_mask to work with), we see if that cpu is different from the ones
> > we have already locked, and if it is, we lock it in the callback handler and
> > update the locked_cpu_mask appropriately (so that we release the locks properly
> > during the unlock operation).
> > 
> > Handling the newly introduced race between the callback handler and lock-unlock
> > code must not be difficult, I believe..
> > 
> > Any loopholes in this approach? Or is the additional complexity just not worth
> > it here?
> 
> To summarize the modified variant of that approach hashed out on IRC:
> 
On which IRC do you discuss?

> 	* lglock grows three extra things: spinlock, cpu bitmap and cpu hotplug
> notifier.
> 	* foo_global_lock_online starts with grabbing that spinlock and
> loops over the cpus in that bitmap.
> 	* foo_global_unlock_online loops over the same bitmap and then drops
> that spinlock
> 	* callback of the notifier is going to do all bitmap updates.  Under
> that spinlock.  Events that need handling definitely include the things like
> "was going up but failed", since we need the bitmap to contain all online CPUs
> at all time, preferably without too much junk beyond that.  IOW, we need to add
> it there _before_ low-level __cpu_up() calls set_cpu_online().  Which means
> that we want to clean up on failed attempt to up it.  Taking a CPU down is
> probably less PITA; just clear bit on the final "the sucker's dead" event.
> 	* bitmap is initialized once, at the same time we set the notifier
> up.  Just grab the spinlock and do
> 	for_each_online_cpu(N)
> 		add N to bitmap
> then release the spinlock and let the callbacks handle all updates.
> 
> I think that'll work with relatively little pain, but I'm not familiar enough
> with the cpuhotplug notifiers, so I'd rather have the folks familiar with those
> to supply the set of events to watch for...
> 



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-20  7:30                 ` mengcong
@ 2011-12-20  7:37                   ` Srivatsa S. Bhat
  0 siblings, 0 replies; 37+ messages in thread
From: Srivatsa S. Bhat @ 2011-12-20  7:37 UTC (permalink / raw)
  To: mc
  Cc: Al Viro, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On 12/20/2011 01:00 PM, mengcong wrote:

> On Tue, 2011-12-20 at 06:27 +0000, Al Viro wrote:
>> On Tue, Dec 20, 2011 at 10:26:05AM +0530, Srivatsa S. Bhat wrote:
>>> Oh, right, that has to be handled as well...
>>>
>>> Hmmm... How about registering a CPU hotplug notifier callback during lock init
>>> time, and then for every cpu that gets onlined (after we took a copy of the
>>> cpu_online_mask to work with), we see if that cpu is different from the ones
>>> we have already locked, and if it is, we lock it in the callback handler and
>>> update the locked_cpu_mask appropriately (so that we release the locks properly
>>> during the unlock operation).
>>>
>>> Handling the newly introduced race between the callback handler and lock-unlock
>>> code must not be difficult, I believe..
>>>
>>> Any loopholes in this approach? Or is the additional complexity just not worth
>>> it here?
>>
>> To summarize the modified variant of that approach hashed out on IRC:
>>
> On which IRC do you discuss?


#kernel on tinc.sekrit.org :-)

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-20  7:28                 ` Srivatsa S. Bhat
@ 2011-12-20  9:37                   ` mengcong
  2011-12-20 10:36                     ` Srivatsa S. Bhat
  0 siblings, 1 reply; 37+ messages in thread
From: mengcong @ 2011-12-20  9:37 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Al Viro, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On Tue, 2011-12-20 at 12:58 +0530, Srivatsa S. Bhat wrote:
> On 12/20/2011 11:57 AM, Al Viro wrote:
> 
> > On Tue, Dec 20, 2011 at 10:26:05AM +0530, Srivatsa S. Bhat wrote:
> >> Oh, right, that has to be handled as well...
> >>
> >> Hmmm... How about registering a CPU hotplug notifier callback during lock init
> >> time, and then for every cpu that gets onlined (after we took a copy of the
> >> cpu_online_mask to work with), we see if that cpu is different from the ones
> >> we have already locked, and if it is, we lock it in the callback handler and
> >> update the locked_cpu_mask appropriately (so that we release the locks properly
> >> during the unlock operation).
> >>
> >> Handling the newly introduced race between the callback handler and lock-unlock
> >> code must not be difficult, I believe..
> >>
> >> Any loopholes in this approach? Or is the additional complexity just not worth
> >> it here?
> > 
> > To summarize the modified variant of that approach hashed out on IRC:
> > 
> > 	* lglock grows three extra things: spinlock, cpu bitmap and cpu hotplug
> > notifier.
> > 	* foo_global_lock_online starts with grabbing that spinlock and
> > loops over the cpus in that bitmap.
> > 	* foo_global_unlock_online loops over the same bitmap and then drops
> > that spinlock
> > 	* callback of the notifier is going to do all bitmap updates.  Under
> > that spinlock.  Events that need handling definitely include the things like
> > "was going up but failed", since we need the bitmap to contain all online CPUs
> > at all time, preferably without too much junk beyond that.  IOW, we need to add
> > it there _before_ low-level __cpu_up() calls set_cpu_online().  Which means
> > that we want to clean up on failed attempt to up it.  Taking a CPU down is
> > probably less PITA; just clear bit on the final "the sucker's dead" event.
> > 	* bitmap is initialized once, at the same time we set the notifier
> > up.  Just grab the spinlock and do
> > 	for_each_online_cpu(N)
> > 		add N to bitmap
> > then release the spinlock and let the callbacks handle all updates.
> > 
> > I think that'll work with relatively little pain, but I'm not familiar enough
> > with the cpuhotplug notifiers, so I'd rather have the folks familiar with those
> > to supply the set of events to watch for...
> > 
> 
> 
> We need not watch out for "up failed" events. It is enough if we handle
> CPU_ONLINE and CPU_DEAD events. Because, these 2 events are triggered only
> upon successful online or offline operation, and these notifications are
> more than enough for our purpose (to update our bitmaps). Also, those cpus
> which came online wont start running until these "success notifications"
> are all done, which is where we do our stuff in the callback (ie., try
> grabbing the spinlock..).
> 
> Of course, by doing this (only looking out for CPU_ONLINE and CPU_DEAD
> events), our bitmap will probably be one step behind cpu_online_mask
> (which means, we'll still have to take the snapshot of cpu_online_mask and
> work with it instead of using for_each_online_cpu()).
> But that doesn't matter, as long as:
>   * we don't allow the newly onlined CPU to start executing code (this
>     is achieved by taking the spinlock in the callback)

I think cpu notifier callback doesn't always run on the UPing cpu.
Actually, it rarely runs on the UPing cpu.
If I was wrong about the above thought, there is still a chance that lg-lock
operations are scheduled on the UPing cpu before calling the callback.

>   * we stick to our bitmap while taking and releasing the spinlocks.
> 
> Both of these have been handled in the design proposed above. So we are good
> to go I guess.
> 
> I am working on translating all these to working code.. Will post the patch
> as soon as I'm done.
> 
> Regards,
> Srivatsa S. Bhat



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-20  9:37                   ` mengcong
@ 2011-12-20 10:36                     ` Srivatsa S. Bhat
  2011-12-20 11:08                       ` Srivatsa S. Bhat
  0 siblings, 1 reply; 37+ messages in thread
From: Srivatsa S. Bhat @ 2011-12-20 10:36 UTC (permalink / raw)
  To: mc
  Cc: Al Viro, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On 12/20/2011 03:07 PM, mengcong wrote:

> On Tue, 2011-12-20 at 12:58 +0530, Srivatsa S. Bhat wrote:
>> On 12/20/2011 11:57 AM, Al Viro wrote:
>>
>>> On Tue, Dec 20, 2011 at 10:26:05AM +0530, Srivatsa S. Bhat wrote:
>>>> Oh, right, that has to be handled as well...
>>>>
>>>> Hmmm... How about registering a CPU hotplug notifier callback during lock init
>>>> time, and then for every cpu that gets onlined (after we took a copy of the
>>>> cpu_online_mask to work with), we see if that cpu is different from the ones
>>>> we have already locked, and if it is, we lock it in the callback handler and
>>>> update the locked_cpu_mask appropriately (so that we release the locks properly
>>>> during the unlock operation).
>>>>
>>>> Handling the newly introduced race between the callback handler and lock-unlock
>>>> code must not be difficult, I believe..
>>>>
>>>> Any loopholes in this approach? Or is the additional complexity just not worth
>>>> it here?
>>>
>>> To summarize the modified variant of that approach hashed out on IRC:
>>>
>>> 	* lglock grows three extra things: spinlock, cpu bitmap and cpu hotplug
>>> notifier.
>>> 	* foo_global_lock_online starts with grabbing that spinlock and
>>> loops over the cpus in that bitmap.
>>> 	* foo_global_unlock_online loops over the same bitmap and then drops
>>> that spinlock
>>> 	* callback of the notifier is going to do all bitmap updates.  Under
>>> that spinlock.  Events that need handling definitely include the things like
>>> "was going up but failed", since we need the bitmap to contain all online CPUs
>>> at all time, preferably without too much junk beyond that.  IOW, we need to add
>>> it there _before_ low-level __cpu_up() calls set_cpu_online().  Which means
>>> that we want to clean up on failed attempt to up it.  Taking a CPU down is
>>> probably less PITA; just clear bit on the final "the sucker's dead" event.
>>> 	* bitmap is initialized once, at the same time we set the notifier
>>> up.  Just grab the spinlock and do
>>> 	for_each_online_cpu(N)
>>> 		add N to bitmap
>>> then release the spinlock and let the callbacks handle all updates.
>>>
>>> I think that'll work with relatively little pain, but I'm not familiar enough
>>> with the cpuhotplug notifiers, so I'd rather have the folks familiar with those
>>> to supply the set of events to watch for...
>>>
>>
>>
>> We need not watch out for "up failed" events. It is enough if we handle
>> CPU_ONLINE and CPU_DEAD events. Because, these 2 events are triggered only
>> upon successful online or offline operation, and these notifications are
>> more than enough for our purpose (to update our bitmaps). Also, those cpus
>> which came online wont start running until these "success notifications"
>> are all done, which is where we do our stuff in the callback (ie., try
>> grabbing the spinlock..).
>>
>> Of course, by doing this (only looking out for CPU_ONLINE and CPU_DEAD
>> events), our bitmap will probably be one step behind cpu_online_mask
>> (which means, we'll still have to take the snapshot of cpu_online_mask and
>> work with it instead of using for_each_online_cpu()).
>> But that doesn't matter, as long as:
>>   * we don't allow the newly onlined CPU to start executing code (this
>>     is achieved by taking the spinlock in the callback)
> 
> I think cpu notifier callback doesn't always run on the UPing cpu.
> Actually, it rarely runs on the UPing cpu.
> If I was wrong about the above thought, there is still a chance that lg-lock
> operations are scheduled on the UPing cpu before calling the callback.
> 


I wasn't actually banking on that, but you have raised a very good point.
The scheduler uses its own set of cpu hotplug callback handlers to start
using the newly added cpu (see the set of callbacks in kernel/sched.c)

So, now we have a race between our callback and the scheduler's callbacks.
("Placing" our callback appropriately in a safe position using priority
for notifiers doesn't appeal to me that much, since it looks like too much
hackery. It should probably be our last resort).

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-20 10:36                     ` Srivatsa S. Bhat
@ 2011-12-20 11:08                       ` Srivatsa S. Bhat
  2011-12-20 12:50                         ` Srivatsa S. Bhat
  0 siblings, 1 reply; 37+ messages in thread
From: Srivatsa S. Bhat @ 2011-12-20 11:08 UTC (permalink / raw)
  To: mc
  Cc: Al Viro, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On 12/20/2011 04:06 PM, Srivatsa S. Bhat wrote:

> On 12/20/2011 03:07 PM, mengcong wrote:
> 
>> On Tue, 2011-12-20 at 12:58 +0530, Srivatsa S. Bhat wrote:
>>> On 12/20/2011 11:57 AM, Al Viro wrote:
>>>
>>>> On Tue, Dec 20, 2011 at 10:26:05AM +0530, Srivatsa S. Bhat wrote:
>>>>> Oh, right, that has to be handled as well...
>>>>>
>>>>> Hmmm... How about registering a CPU hotplug notifier callback during lock init
>>>>> time, and then for every cpu that gets onlined (after we took a copy of the
>>>>> cpu_online_mask to work with), we see if that cpu is different from the ones
>>>>> we have already locked, and if it is, we lock it in the callback handler and
>>>>> update the locked_cpu_mask appropriately (so that we release the locks properly
>>>>> during the unlock operation).
>>>>>
>>>>> Handling the newly introduced race between the callback handler and lock-unlock
>>>>> code must not be difficult, I believe..
>>>>>
>>>>> Any loopholes in this approach? Or is the additional complexity just not worth
>>>>> it here?
>>>>
>>>> To summarize the modified variant of that approach hashed out on IRC:
>>>>
>>>> 	* lglock grows three extra things: spinlock, cpu bitmap and cpu hotplug
>>>> notifier.
>>>> 	* foo_global_lock_online starts with grabbing that spinlock and
>>>> loops over the cpus in that bitmap.
>>>> 	* foo_global_unlock_online loops over the same bitmap and then drops
>>>> that spinlock
>>>> 	* callback of the notifier is going to do all bitmap updates.  Under
>>>> that spinlock.  Events that need handling definitely include the things like
>>>> "was going up but failed", since we need the bitmap to contain all online CPUs
>>>> at all time, preferably without too much junk beyond that.  IOW, we need to add
>>>> it there _before_ low-level __cpu_up() calls set_cpu_online().  Which means
>>>> that we want to clean up on failed attempt to up it.  Taking a CPU down is
>>>> probably less PITA; just clear bit on the final "the sucker's dead" event.
>>>> 	* bitmap is initialized once, at the same time we set the notifier
>>>> up.  Just grab the spinlock and do
>>>> 	for_each_online_cpu(N)
>>>> 		add N to bitmap
>>>> then release the spinlock and let the callbacks handle all updates.
>>>>
>>>> I think that'll work with relatively little pain, but I'm not familiar enough
>>>> with the cpuhotplug notifiers, so I'd rather have the folks familiar with those
>>>> to supply the set of events to watch for...
>>>>
>>>
>>>
>>> We need not watch out for "up failed" events. It is enough if we handle
>>> CPU_ONLINE and CPU_DEAD events. Because, these 2 events are triggered only
>>> upon successful online or offline operation, and these notifications are
>>> more than enough for our purpose (to update our bitmaps). Also, those cpus
>>> which came online wont start running until these "success notifications"
>>> are all done, which is where we do our stuff in the callback (ie., try
>>> grabbing the spinlock..).
>>>
>>> Of course, by doing this (only looking out for CPU_ONLINE and CPU_DEAD
>>> events), our bitmap will probably be one step behind cpu_online_mask
>>> (which means, we'll still have to take the snapshot of cpu_online_mask and
>>> work with it instead of using for_each_online_cpu()).
>>> But that doesn't matter, as long as:
>>>   * we don't allow the newly onlined CPU to start executing code (this
>>>     is achieved by taking the spinlock in the callback)
>>
>> I think cpu notifier callback doesn't always run on the UPing cpu.
>> Actually, it rarely runs on the UPing cpu.
>> If I was wrong about the above thought, there is still a chance that lg-lock
>> operations are scheduled on the UPing cpu before calling the callback.
>>
> 
> 
> I wasn't actually banking on that, but you have raised a very good point.
> The scheduler uses its own set of cpu hotplug callback handlers to start
> using the newly added cpu (see the set of callbacks in kernel/sched.c)
> 
> So, now we have a race between our callback and the scheduler's callbacks.
> ("Placing" our callback appropriately in a safe position using priority
> for notifiers doesn't appeal to me that much, since it looks like too much
> hackery. It should probably be our last resort).
> 


Hey, actually there is a simple solution: just nip it (or rather delay it)
in the bud ;) That is, we watch out for CPU_UP_PREPARE event and lock it
up there itself using our spinlock.. that way, that cpu will not come up
until we are done executing br_write_unlock(). In fact, we can even fail
the onlining of that cpu by returning appropriate value from our callback,
but that would be too harsh.. so we can settle for delaying the cpu online
operation :)

And we do the same thing for CPU_DOWN_PREPARE event. And we can forget about
taking snapshot of cpu_online_mask, since cpu_online_mask is guaranteed to
be unmodified until we are done with br_write_unlock().

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-20 11:08                       ` Srivatsa S. Bhat
@ 2011-12-20 12:50                         ` Srivatsa S. Bhat
  2011-12-20 14:06                           ` Al Viro
  0 siblings, 1 reply; 37+ messages in thread
From: Srivatsa S. Bhat @ 2011-12-20 12:50 UTC (permalink / raw)
  To: mc
  Cc: Al Viro, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On 12/20/2011 04:38 PM, Srivatsa S. Bhat wrote:

> On 12/20/2011 04:06 PM, Srivatsa S. Bhat wrote:
> 
>> On 12/20/2011 03:07 PM, mengcong wrote:
>>
>>> On Tue, 2011-12-20 at 12:58 +0530, Srivatsa S. Bhat wrote:
>>>> On 12/20/2011 11:57 AM, Al Viro wrote:
>>>>
>>>>> On Tue, Dec 20, 2011 at 10:26:05AM +0530, Srivatsa S. Bhat wrote:
>>>>>> Oh, right, that has to be handled as well...
>>>>>>
>>>>>> Hmmm... How about registering a CPU hotplug notifier callback during lock init
>>>>>> time, and then for every cpu that gets onlined (after we took a copy of the
>>>>>> cpu_online_mask to work with), we see if that cpu is different from the ones
>>>>>> we have already locked, and if it is, we lock it in the callback handler and
>>>>>> update the locked_cpu_mask appropriately (so that we release the locks properly
>>>>>> during the unlock operation).
>>>>>>
>>>>>> Handling the newly introduced race between the callback handler and lock-unlock
>>>>>> code must not be difficult, I believe..
>>>>>>
>>>>>> Any loopholes in this approach? Or is the additional complexity just not worth
>>>>>> it here?
>>>>>
>>>>> To summarize the modified variant of that approach hashed out on IRC:
>>>>>
>>>>> 	* lglock grows three extra things: spinlock, cpu bitmap and cpu hotplug
>>>>> notifier.
>>>>> 	* foo_global_lock_online starts with grabbing that spinlock and
>>>>> loops over the cpus in that bitmap.
>>>>> 	* foo_global_unlock_online loops over the same bitmap and then drops
>>>>> that spinlock
>>>>> 	* callback of the notifier is going to do all bitmap updates.  Under
>>>>> that spinlock.  Events that need handling definitely include the things like
>>>>> "was going up but failed", since we need the bitmap to contain all online CPUs
>>>>> at all time, preferably without too much junk beyond that.  IOW, we need to add
>>>>> it there _before_ low-level __cpu_up() calls set_cpu_online().  Which means
>>>>> that we want to clean up on failed attempt to up it.  Taking a CPU down is
>>>>> probably less PITA; just clear bit on the final "the sucker's dead" event.
>>>>> 	* bitmap is initialized once, at the same time we set the notifier
>>>>> up.  Just grab the spinlock and do
>>>>> 	for_each_online_cpu(N)
>>>>> 		add N to bitmap
>>>>> then release the spinlock and let the callbacks handle all updates.
>>>>>
>>>>> I think that'll work with relatively little pain, but I'm not familiar enough
>>>>> with the cpuhotplug notifiers, so I'd rather have the folks familiar with those
>>>>> to supply the set of events to watch for...
>>>>>
>>>>
>>>>
>>>> We need not watch out for "up failed" events. It is enough if we handle
>>>> CPU_ONLINE and CPU_DEAD events. Because, these 2 events are triggered only
>>>> upon successful online or offline operation, and these notifications are
>>>> more than enough for our purpose (to update our bitmaps). Also, those cpus
>>>> which came online wont start running until these "success notifications"
>>>> are all done, which is where we do our stuff in the callback (ie., try
>>>> grabbing the spinlock..).
>>>>
>>>> Of course, by doing this (only looking out for CPU_ONLINE and CPU_DEAD
>>>> events), our bitmap will probably be one step behind cpu_online_mask
>>>> (which means, we'll still have to take the snapshot of cpu_online_mask and
>>>> work with it instead of using for_each_online_cpu()).
>>>> But that doesn't matter, as long as:
>>>>   * we don't allow the newly onlined CPU to start executing code (this
>>>>     is achieved by taking the spinlock in the callback)
>>>
>>> I think cpu notifier callback doesn't always run on the UPing cpu.
>>> Actually, it rarely runs on the UPing cpu.
>>> If I was wrong about the above thought, there is still a chance that lg-lock
>>> operations are scheduled on the UPing cpu before calling the callback.
>>>
>>
>>
>> I wasn't actually banking on that, but you have raised a very good point.
>> The scheduler uses its own set of cpu hotplug callback handlers to start
>> using the newly added cpu (see the set of callbacks in kernel/sched.c)
>>
>> So, now we have a race between our callback and the scheduler's callbacks.
>> ("Placing" our callback appropriately in a safe position using priority
>> for notifiers doesn't appeal to me that much, since it looks like too much
>> hackery. It should probably be our last resort).
>>
> 
> 
> Hey, actually there is a simple solution: just nip it (or rather delay it)
> in the bud ;) That is, we watch out for CPU_UP_PREPARE event and lock it
> up there itself using our spinlock.. that way, that cpu will not come up
> until we are done executing br_write_unlock(). In fact, we can even fail
> the onlining of that cpu by returning appropriate value from our callback,
> but that would be too harsh.. so we can settle for delaying the cpu online
> operation :)
> 
> And we do the same thing for CPU_DOWN_PREPARE event. And we can forget about
> taking snapshot of cpu_online_mask, since cpu_online_mask is guaranteed to
> be unmodified until we are done with br_write_unlock().
> 


Had to modify this a bit, to handle a race while using cpu_online_mask.

Anyway, here is the code:

The idea here is that the notifier will grab the spinlock and not release it
until the entire cpu hotplug operation is complete. Hence the pairs
CPU_UP_PREPARE <-> CPU_UP_CANCELED, CPU_ONLINE
and
CPU_DOWN_PREPARE <-> CPU_DOWN_FAILED, CPU_DEAD

However, if the notifier couldn't acquire the spinlock, it will keep spinning
at the CPU_UP_PREPARE or CPU_DOWN_PREPARE event, thereby preventing cpu
hotplug, as well as any updates to cpu_online_mask. Hence, taking snapshot of
cpu_online_mask becomes unnecessary.

[Actually I have another approach, using CPU_STARTING and CPU_DYING events,
which addresses Cong Meng's concern in a more direct manner, but I would
rather post that patch after sometime, to avoid the risk of confusing
everyone. Moreover, that approach is not a "clear winner", so it can wait
until the below patch gets reviewed].

---

 include/linux/lglock.h |   39 +++++++++++++++++++++++++++++++++++++++
 1 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/include/linux/lglock.h b/include/linux/lglock.h
index f549056..71079b1 100644
--- a/include/linux/lglock.h
+++ b/include/linux/lglock.h
@@ -22,6 +22,7 @@
 #include <linux/spinlock.h>
 #include <linux/lockdep.h>
 #include <linux/percpu.h>
+#include <linux/cpu.h>
 
 /* can make br locks by using local lock for read side, global lock for write */
 #define br_lock_init(name)	name##_lock_init()
@@ -75,6 +76,41 @@
  DEFINE_PER_CPU(arch_spinlock_t, name##_lock);				\
  DEFINE_LGLOCK_LOCKDEP(name);						\
 									\
+DEFINE_SPINLOCK(name##_notifier_lock);					\
+									\
+static int								\
+name##_lg_cpu_callback(struct notifier_block *nb,			\
+				unsigned long action, void *hcpu)	\
+{									\
+	switch (action & ~CPU_TASKS_FROZEN) {				\
+	case CPU_UP_PREPARE:						\
+		spin_lock(&name##_notifier_lock);			\
+		break;							\
+									\
+	case CPU_UP_CANCELED:						\
+	case CPU_ONLINE:						\
+		spin_unlock(&name##_notifier_lock);			\
+		break;							\
+									\
+	case CPU_DOWN_PREPARE:						\
+		spin_lock(&name##_notifier_lock);			\
+		break;							\
+									\
+	case CPU_DOWN_FAILED:						\
+	case CPU_DEAD:							\
+		spin_unlock(&name##_notifier_lock);			\
+		break;							\
+									\
+	default:							\
+		return NOTIFY_DONE;					\
+	}								\
+	return NOTIFY_OK;						\
+}									\
+									\
+static struct notifier_block name##_lg_cpu_notifier = {			\
+	.notifier_call = name##_lg_cpu_callback,			\
+};									\
+									\
  void name##_lock_init(void) {						\
 	int i;								\
 	LOCKDEP_INIT_MAP(&name##_lock_dep_map, #name, &name##_lock_key, 0); \
@@ -83,6 +119,7 @@
 		lock = &per_cpu(name##_lock, i);			\
 		*lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;	\
 	}								\
+	register_hotcpu_notifier(&name##_lg_cpu_notifier);		\
  }									\
  EXPORT_SYMBOL(name##_lock_init);					\
 									\
@@ -125,6 +162,7 @@
  void name##_global_lock_online(void) {					\
 	int i;								\
 	preempt_disable();						\
+	spin_lock(&name##_notifier_lock);				\
 	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
 	for_each_online_cpu(i) {					\
 		arch_spinlock_t *lock;					\
@@ -142,6 +180,7 @@
 		lock = &per_cpu(name##_lock, i);			\
 		arch_spin_unlock(lock);					\
 	}								\
+	spin_unlock(&name##_notifier_lock);				\
 	preempt_enable();						\
  }									\
  EXPORT_SYMBOL(name##_global_unlock_online);				\




Regards,
Srivatsa S. Bhat


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-20 12:50                         ` Srivatsa S. Bhat
@ 2011-12-20 14:06                           ` Al Viro
  2011-12-20 14:35                             ` Srivatsa S. Bhat
  0 siblings, 1 reply; 37+ messages in thread
From: Al Viro @ 2011-12-20 14:06 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On Tue, Dec 20, 2011 at 06:20:44PM +0530, Srivatsa S. Bhat wrote:

> > Hey, actually there is a simple solution: just nip it (or rather delay it)
> > in the bud ;) That is, we watch out for CPU_UP_PREPARE event and lock it
> > up there itself using our spinlock.. that way, that cpu will not come up
> > until we are done executing br_write_unlock(). In fact, we can even fail
> > the onlining of that cpu by returning appropriate value from our callback,
> > but that would be too harsh.. so we can settle for delaying the cpu online

Eeeek...  Are you serious about leaving a spinlock grabbed by notifier
callback and not released until another callback call?  That would be one
hell of a constraint on what these notifiers can do - _nothing_ between
these calls (including other notifier callbacks, etc.) would be allowed
to block.

That sounds extremely brittle...

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-20 14:06                           ` Al Viro
@ 2011-12-20 14:35                             ` Srivatsa S. Bhat
  2011-12-20 17:59                               ` Al Viro
  0 siblings, 1 reply; 37+ messages in thread
From: Srivatsa S. Bhat @ 2011-12-20 14:35 UTC (permalink / raw)
  To: Al Viro
  Cc: mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On 12/20/2011 07:36 PM, Al Viro wrote:

> On Tue, Dec 20, 2011 at 06:20:44PM +0530, Srivatsa S. Bhat wrote:
> 
>>> Hey, actually there is a simple solution: just nip it (or rather delay it)
>>> in the bud ;) That is, we watch out for CPU_UP_PREPARE event and lock it
>>> up there itself using our spinlock.. that way, that cpu will not come up
>>> until we are done executing br_write_unlock(). In fact, we can even fail
>>> the onlining of that cpu by returning appropriate value from our callback,
>>> but that would be too harsh.. so we can settle for delaying the cpu online
> 
> Eeeek...  Are you serious about leaving a spinlock grabbed by notifier
> callback and not released until another callback call?  That would be one
> hell of a constraint on what these notifiers can do - _nothing_ between
> these calls (including other notifier callbacks, etc.) would be allowed
> to block.
> 
> That sounds extremely brittle...
> 

Sorry but I didn't quite get your point...
No two cpu hotplug operations can race because of the cpu_hotplug lock they
use. Hence, if a cpu online operation begins, it has to succeed or fail
eventually. No other cpu hotplug operation can intervene. Ditto for cpu offline
operations.

Hence a CPU_UP_PREPARE event *will* be followed by a corresponding
CPU_UP_CANCELED or CPU_ONLINE event for the same cpu. (And we ignore the
CPU_STARTING event that comes in between, on purpose, so as to avoid the race
with cpu_online_mask). Similar is the story for offline operation.

And if the notifier grabs the spinlock and keeps it locked between these 2
points of a cpu hotplug operation, it ensures that our br locks will spin,
instead of block till the cpu hotplug operation is complete. Isn't this what
we desired all along? "A non-blocking way to sync br locks with cpu hotplug"?

Or am I missing something?

Regards,
Srivatsa S. Bhat

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-20 14:35                             ` Srivatsa S. Bhat
@ 2011-12-20 17:59                               ` Al Viro
  2011-12-20 19:12                                 ` Srivatsa S. Bhat
  0 siblings, 1 reply; 37+ messages in thread
From: Al Viro @ 2011-12-20 17:59 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On Tue, Dec 20, 2011 at 08:05:32PM +0530, Srivatsa S. Bhat wrote:

> Sorry but I didn't quite get your point...
> No two cpu hotplug operations can race because of the cpu_hotplug lock they
> use. Hence, if a cpu online operation begins, it has to succeed or fail
> eventually. No other cpu hotplug operation can intervene. Ditto for cpu offline
> operations.
> 
> Hence a CPU_UP_PREPARE event *will* be followed by a corresponding
> CPU_UP_CANCELED or CPU_ONLINE event for the same cpu. (And we ignore the
> CPU_STARTING event that comes in between, on purpose, so as to avoid the race
> with cpu_online_mask). Similar is the story for offline operation.
> 
> And if the notifier grabs the spinlock and keeps it locked between these 2
> points of a cpu hotplug operation, it ensures that our br locks will spin,
> instead of block till the cpu hotplug operation is complete. Isn't this what
> we desired all along? "A non-blocking way to sync br locks with cpu hotplug"?
> 
> Or am I missing something?

The standard reason why losing the timeslice while holding a spinlock means
deadlocks?
CPU1: grabs spinlock
CPU[2..n]: tries to grab the same spinlock, spins
CPU1: does something blocking, process loses timeslice
CPU1: whatever got scheduled there happens to to try and grab the same
spinlock and you are stuck.  At that point *all* CPUs are spinning on
that spinlock and your code that would eventually unlock it has no chance
to get any CPU to run on.

Having the callback grab and release a spinlock is fine (as long as you
don't do anything blocking between these spin_lock/spin_unlock).  Having
it leave with spinlock held, though, means that the area where you can't
block has expanded a whole lot.  As I said, brittle...

A quick grep through the actual callbacks immediately shows e.g.
del_timer_sync() done on CPU_DOWN_PREPARE.  And sysfs_remove_group(),
which leads to outright mutex_lock().  And sysfs_remove_link() (ditto).
And via_cputemp_device_remove() (again, mutex_lock()).  And free_irq().
And perf_event_exit_cpu() (mutex_lock()).  And...

IOW, there are shitloads of deadlocks right there.  If your callback's
position in the chain is earlier than any of those, you are screwed.

No, what I had in mind was different - use the callbacks to maintain a
bitmap that would contain
	a) all CPUs that were online at the moment
	b) ... and not too much else
Updates protected by spinlock; in all cases it gets dropped before the
callback returns.  br_write_lock() grabs that spinlock and iterates over
the set; it *does* leave the spinlock grabbed - that's OK, since all
code between br_write_lock() and br_write_unlock() must be non-blocking
anyway.  br_write_unlock() iterates over the same bitmap (unchanged since
br_write_lock()) and finally drops the spinlock.

AFAICS, what we want in callback is
	CPU_DEAD, CPU_DEAD_FROZEN, CPU_UP_CANCELLED, CPU_UP_CANCELLED_FROZEN:
		grab spinlock
		remove cpu from bitmap
		drop spinlock
	CPU_UP_PREPARE, CPU_UP_PREPARE_FROZEN
		grab spinlock
		add cpu to bitmap
		drop spinlock
That ought to keep bitmap close to cpu_online_mask, which is enough for
our purposes.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-20 17:59                               ` Al Viro
@ 2011-12-20 19:12                                 ` Srivatsa S. Bhat
  2011-12-20 19:58                                   ` Al Viro
  0 siblings, 1 reply; 37+ messages in thread
From: Srivatsa S. Bhat @ 2011-12-20 19:12 UTC (permalink / raw)
  To: Al Viro
  Cc: mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On 12/20/2011 11:29 PM, Al Viro wrote:

> On Tue, Dec 20, 2011 at 08:05:32PM +0530, Srivatsa S. Bhat wrote:
> 
>> Sorry but I didn't quite get your point...
>> No two cpu hotplug operations can race because of the cpu_hotplug lock they
>> use. Hence, if a cpu online operation begins, it has to succeed or fail
>> eventually. No other cpu hotplug operation can intervene. Ditto for cpu offline
>> operations.
>>
>> Hence a CPU_UP_PREPARE event *will* be followed by a corresponding
>> CPU_UP_CANCELED or CPU_ONLINE event for the same cpu. (And we ignore the
>> CPU_STARTING event that comes in between, on purpose, so as to avoid the race
>> with cpu_online_mask). Similar is the story for offline operation.
>>
>> And if the notifier grabs the spinlock and keeps it locked between these 2
>> points of a cpu hotplug operation, it ensures that our br locks will spin,
>> instead of block till the cpu hotplug operation is complete. Isn't this what
>> we desired all along? "A non-blocking way to sync br locks with cpu hotplug"?
>>
>> Or am I missing something?
> 
> The standard reason why losing the timeslice while holding a spinlock means
> deadlocks?
> CPU1: grabs spinlock
> CPU[2..n]: tries to grab the same spinlock, spins
> CPU1: does something blocking, process loses timeslice
> CPU1: whatever got scheduled there happens to to try and grab the same
> spinlock and you are stuck.  At that point *all* CPUs are spinning on
> that spinlock and your code that would eventually unlock it has no chance
> to get any CPU to run on.
> 
> Having the callback grab and release a spinlock is fine (as long as you
> don't do anything blocking between these spin_lock/spin_unlock).  Having
> it leave with spinlock held, though, means that the area where you can't
> block has expanded a whole lot.  As I said, brittle...
> 


Ah, now I see your point! Thanks for the explanation.

> A quick grep through the actual callbacks immediately shows e.g.
> del_timer_sync() done on CPU_DOWN_PREPARE.  And sysfs_remove_group(),
> which leads to outright mutex_lock().  And sysfs_remove_link() (ditto).
> And via_cputemp_device_remove() (again, mutex_lock()).  And free_irq().
> And perf_event_exit_cpu() (mutex_lock()).  And...
> 
> IOW, there are shitloads of deadlocks right there.  If your callback's
> position in the chain is earlier than any of those, you are screwed.
> 


The thought makes me shudder!

> No, what I had in mind was different - use the callbacks to maintain a

> bitmap that would contain
> 	a) all CPUs that were online at the moment
> 	b) ... and not too much else
> Updates protected by spinlock; in all cases it gets dropped before the
> callback returns.  br_write_lock() grabs that spinlock and iterates over
> the set; it *does* leave the spinlock grabbed - that's OK, since all
> code between br_write_lock() and br_write_unlock() must be non-blocking
> anyway.  br_write_unlock() iterates over the same bitmap (unchanged since
> br_write_lock()) and finally drops the spinlock.
> 


I had this same thing in mind when I started out to write the patch.. but
after Cong raised that concern, I changed track and in the meantime, tried
to get rid of maintaining our own bitmap as well...
But unfortunately that turned out to be disastrous!

> AFAICS, what we want in callback is
> 	CPU_DEAD, CPU_DEAD_FROZEN, CPU_UP_CANCELLED, CPU_UP_CANCELLED_FROZEN:
> 		grab spinlock
> 		remove cpu from bitmap
> 		drop spinlock
> 	CPU_UP_PREPARE, CPU_UP_PREPARE_FROZEN
> 		grab spinlock
> 		add cpu to bitmap
> 		drop spinlock
> That ought to keep bitmap close to cpu_online_mask, which is enough for
> our purposes.
> 


Yes, that should do. And while initializing our bitmap, we could use
  
get_online_cpus()
make a copy of cpu_online_mask
put_online_cpus()

since blocking here is acceptable, as this is done in the lock_init() code.
Right?

That would be better than

register_hotcpu_notifier(...);
grab spinlock
for_each_online_cpu(N)
  add N to bitmap
release spinlock

because the latter code is not fully race-free (because we don't handle
CPU_DOWN_PREPARE event in the callback and hence cpu_online_mask can get
updated in-between). But it would still work since cpus going down don't
really pose problems for us.

However the former code is race-free, and we can afford it since we are
free to block at that point.

Regards,
Srivatsa S. Bhat


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-20 19:12                                 ` Srivatsa S. Bhat
@ 2011-12-20 19:58                                   ` Al Viro
  2011-12-20 22:27                                     ` Dave Chinner
  2011-12-21 21:15                                     ` Srivatsa S. Bhat
  0 siblings, 2 replies; 37+ messages in thread
From: Al Viro @ 2011-12-20 19:58 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On Wed, Dec 21, 2011 at 12:42:04AM +0530, Srivatsa S. Bhat wrote:
> register_hotcpu_notifier(...);
> grab spinlock
> for_each_online_cpu(N)
>   add N to bitmap
> release spinlock
> 
> because the latter code is not fully race-free (because we don't handle
> CPU_DOWN_PREPARE event in the callback and hence cpu_online_mask can get
> updated in-between). But it would still work since cpus going down don't
> really pose problems for us.

Um?  Sure, that loop can end up adding CPUs on their way down into the set.
And as soon as they get their CPU_DEAD, notifier will prune them out...  Or
is there something I'm missing here?  Anyway, the variant I have here
(untested) follows:

diff --git a/include/linux/lglock.h b/include/linux/lglock.h
index f549056..1951f67 100644
--- a/include/linux/lglock.h
+++ b/include/linux/lglock.h
@@ -22,6 +22,7 @@
 #include <linux/spinlock.h>
 #include <linux/lockdep.h>
 #include <linux/percpu.h>
+#include <linux/cpu.h>
 
 /* can make br locks by using local lock for read side, global lock for write */
 #define br_lock_init(name)	name##_lock_init()
@@ -72,9 +73,31 @@
 
 #define DEFINE_LGLOCK(name)						\
 									\
+ DEFINE_SPINLOCK(name##_cpu_lock);					\
+ cpumask_t name##_cpus __read_mostly;					\
  DEFINE_PER_CPU(arch_spinlock_t, name##_lock);				\
  DEFINE_LGLOCK_LOCKDEP(name);						\
 									\
+ static int								\
+ name##_lg_cpu_callback(struct notifier_block *nb,			\
+				unsigned long action, void *hcpu)	\
+ {									\
+	switch (action & ~CPU_TASKS_FROZEN) {				\
+	case CPU_UP_PREPARE:						\
+		spin_lock(&name##_cpu_lock);				\
+		cpu_set((unsigned long)hcpu, name##_cpus);		\
+		spin_unlock(&name##_cpu_lock);				\
+		break;							\
+	case CPU_UP_CANCELED: case CPU_DEAD:				\
+		spin_lock(&name##_cpu_lock);				\
+		cpu_clear((unsigned long)hcpu, name##_cpus);		\
+		spin_unlock(&name##_cpu_lock);				\
+	}								\
+	return NOTIFY_OK;						\
+ }									\
+ static struct notifier_block name##_lg_cpu_notifier = {		\
+	.notifier_call = name##_lg_cpu_callback,			\
+ };									\
  void name##_lock_init(void) {						\
 	int i;								\
 	LOCKDEP_INIT_MAP(&name##_lock_dep_map, #name, &name##_lock_key, 0); \
@@ -83,6 +106,11 @@
 		lock = &per_cpu(name##_lock, i);			\
 		*lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;	\
 	}								\
+	register_hotcpu_notifier(&name##_lg_cpu_notifier);		\
+	spin_lock(&name##_cpu_lock);					\
+	for_each_online_cpu(i)						\
+		cpu_set(i, name##_cpus);				\
+	spin_unlock(&name##_cpu_lock);					\
  }									\
  EXPORT_SYMBOL(name##_lock_init);					\
 									\
@@ -124,9 +152,9 @@
 									\
  void name##_global_lock_online(void) {					\
 	int i;								\
-	preempt_disable();						\
+	spin_lock(&name##_cpu_lock);					\
 	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
-	for_each_online_cpu(i) {					\
+	for_each_cpu(i, &name##_cpus) {					\
 		arch_spinlock_t *lock;					\
 		lock = &per_cpu(name##_lock, i);			\
 		arch_spin_lock(lock);					\
@@ -137,12 +165,12 @@
  void name##_global_unlock_online(void) {				\
 	int i;								\
 	rwlock_release(&name##_lock_dep_map, 1, _RET_IP_);		\
-	for_each_online_cpu(i) {					\
+	for_each_cpu(i, &name##_cpus) {					\
 		arch_spinlock_t *lock;					\
 		lock = &per_cpu(name##_lock, i);			\
 		arch_spin_unlock(lock);					\
 	}								\
-	preempt_enable();						\
+	spin_unlock(&name##_cpu_lock);					\
  }									\
  EXPORT_SYMBOL(name##_global_unlock_online);				\
 									\

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-20 19:58                                   ` Al Viro
@ 2011-12-20 22:27                                     ` Dave Chinner
  2011-12-20 23:31                                       ` Al Viro
  2011-12-21 21:15                                     ` Srivatsa S. Bhat
  1 sibling, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2011-12-20 22:27 UTC (permalink / raw)
  To: Al Viro
  Cc: Srivatsa S. Bhat, mc, Stephen Boyd, linux-kernel, linux-fsdevel,
	Nick Piggin, akpm, Maciej Rutecki

On Tue, Dec 20, 2011 at 07:58:06PM +0000, Al Viro wrote:
> On Wed, Dec 21, 2011 at 12:42:04AM +0530, Srivatsa S. Bhat wrote:
> > register_hotcpu_notifier(...);
> > grab spinlock
> > for_each_online_cpu(N)
> >   add N to bitmap
> > release spinlock
> > 
> > because the latter code is not fully race-free (because we don't handle
> > CPU_DOWN_PREPARE event in the callback and hence cpu_online_mask can get
> > updated in-between). But it would still work since cpus going down don't
> > really pose problems for us.
> 
> Um?  Sure, that loop can end up adding CPUs on their way down into the set.
> And as soon as they get their CPU_DEAD, notifier will prune them out...  Or
> is there something I'm missing here?  Anyway, the variant I have here
> (untested) follows:

Only thing that concerns me about this patch is the bitmap changing
between lock and unlock operations. i.e.

CPU 1: lock all cpus in mask
CPU 2: brings up new cpu, notifier adds CPU to bitmask
CPU 1: unlock all cpus in mask

And in this case the unlock tries to unlock a cpu that wasn't locked
to start with. It really seems to me that while a global lock is in
progress, the online bitmask cannot be allowed to change.

Perhaps something can be passed between the lock and unlock
operations to be able to detect a changed mask between lock/unlock
operations (e.g. a generation number) and then handle that via a
slow path that unlocks only locks that are active in the online
bitmask?  i.e. all the notifier does is bump the generation count,
and the slow path on the unlock handles everything else?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-20 22:27                                     ` Dave Chinner
@ 2011-12-20 23:31                                       ` Al Viro
  0 siblings, 0 replies; 37+ messages in thread
From: Al Viro @ 2011-12-20 23:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Srivatsa S. Bhat, mc, Stephen Boyd, linux-kernel, linux-fsdevel,
	Nick Piggin, akpm, Maciej Rutecki

On Wed, Dec 21, 2011 at 09:27:34AM +1100, Dave Chinner wrote:

> Only thing that concerns me about this patch is the bitmap changing
> between lock and unlock operations. i.e.
> 
> CPU 1: lock all cpus in mask
... after having taken the spinlock
> CPU 2: brings up new cpu, notifier adds CPU to bitmask
... which would also take the same spinlock
> CPU 1: unlock all cpus in mask
... which is where that spinlock would've been released

> And in this case the unlock tries to unlock a cpu that wasn't locked
> to start with. It really seems to me that while a global lock is in
> progress, the online bitmask cannot be allowed to change.

Well, yes.  See spin_lock() in br_write_lock() before the loop and
spin_unlock() in br_write_unlock() after the loop...

> Perhaps something can be passed between the lock and unlock
> operations to be able to detect a changed mask between lock/unlock
> operations (e.g. a generation number) and then handle that via a
> slow path that unlocks only locks that are active in the online
> bitmask?  i.e. all the notifier does is bump the generation count,
> and the slow path on the unlock handles everything else?

???

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-20 19:58                                   ` Al Viro
  2011-12-20 22:27                                     ` Dave Chinner
@ 2011-12-21 21:15                                     ` Srivatsa S. Bhat
  2011-12-21 22:02                                       ` Al Viro
  2011-12-21 22:12                                       ` Andrew Morton
  1 sibling, 2 replies; 37+ messages in thread
From: Srivatsa S. Bhat @ 2011-12-21 21:15 UTC (permalink / raw)
  To: Al Viro
  Cc: mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On 12/21/2011 01:28 AM, Al Viro wrote:

> On Wed, Dec 21, 2011 at 12:42:04AM +0530, Srivatsa S. Bhat wrote:
>> register_hotcpu_notifier(...);
>> grab spinlock
>> for_each_online_cpu(N)
>>   add N to bitmap
>> release spinlock
>>
>> because the latter code is not fully race-free (because we don't handle
>> CPU_DOWN_PREPARE event in the callback and hence cpu_online_mask can get
>> updated in-between). But it would still work since cpus going down don't
>> really pose problems for us.
> 
> Um?  Sure, that loop can end up adding CPUs on their way down into the set.
> And as soon as they get their CPU_DEAD, notifier will prune them out...  Or
> is there something I'm missing here?
>

No, everything is fine overall (as I mentioned earlier - it would still work,
because even though the for_each_online_cpu() thing is racy, that race is
harmless). 

Now considering that we have come to a working solution to this whole issue
(though it looks a bit convoluted), IMHO it is worth trying to make this
whole thing look a bit more intuitive and appear "obviously race-free", at
least at places where we can afford it. IOW, I feel using
{get,put}_online_cpus() as mentioned in my previous post avoids a
"suspicious looking" usage of for_each_online_cpu() :-)

The following patch is the same as the patch you posted, but with this small
change (aimed merely at making the code a bit easier to understand) and a
commit message added. Please point out if this change seems bad for
any reason. And if it is fine, Viro, can you please sign-off on this patch?
(since this patch has code contributions from both you and me)

I tested this patch locally - the originally reported issue did not crop up
(soft-lockups in VFS callpaths).

---
From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Subject: [PATCH] VFS: Fix race between CPU hotplug and lglocks

Currently, the *_global_[un]lock_online() routines are not at all synchronized
with CPU hotplug. Soft-lockups detected as a consequence of this race was
reported earlier at https://lkml.org/lkml/2011/8/24/185. (Thanks to Cong Meng
for finding out that the root-cause of this issue is the race condition
between br_write_[un]lock() and CPU hotplug, which results in the lock states
getting messed up).

Fixing this race by just adding {get,put}_online_cpus() at appropriate places
in *_global_[un]lock_online() is not a good option, because, then suddenly
br_write_[un]lock() would become blocking, whereas they have been kept as
non-blocking all this time, and we would want to keep them that way.

So, overall, we want to ensure 3 things:
1. br_write_lock() and br_write_unlock() must remain as non-blocking.
2. The corresponding lock and unlock of the per-cpu spinlocks must not happen
   for different sets of CPUs.
3. Either prevent any new CPU online operation in between this lock-unlock, or
   ensure that the newly onlined CPU does not proceed with its corresponding
   per-cpu spinlock unlocked.

To achieve all this:
(a) We introduce a new spinlock that is taken by the *_global_lock_online()
    routine and released by the *_global_unlock_online() routine.
(b) We register a callback for CPU hotplug notifications, and this callback
    takes the same spinlock as above.
(c) We maintain a bitmap which is close to the cpu_online_mask, and once it is
    initialized in the lock_init() code, all future updates to it are done in
    the callback, under the above spinlock.
(d) The above bitmap is used (instead of cpu_online_mask) while locking and
    unlocking the per-cpu locks.

The callback takes the spinlock upon the CPU_UP_PREPARE event. So, if the
br_write_lock-unlock sequence is in progress, the callback keeps spinning,
thus preventing the CPU online operation till the lock-unlock sequence is
complete. This takes care of requirement (3).

The bitmap that we maintain remains unmodified throughout the lock-unlock
sequence, since all updates to it are managed by the callback, which takes
the same spinlock as the one taken by the lock code and released only by the
unlock routine. Combining this with (d) above, satisfies requirement (2).

Overall, since we use a spinlock (mentioned in (a)) to prevent CPU hotplug
operations from racing with br_write_lock-unlock, requirement (1) is also
taken care of.

By the way, it is to be noted that a CPU offline operation can actually run
in parallel with our lock-unlock sequence, because our callback doesn't react
to notifications earlier than CPU_DEAD (in order to maintain our bitmap
properly). And this means, since we use our own bitmap (which is stale, on
purpose) during the lock-unlock sequence, we could end up unlocking the
per-cpu lock of an offline CPU (because we had locked it earlier, when the
CPU was online), in order to satisfy requirement (2). But this is harmless,
though it looks a bit awkward.

Debugged-by: Cong Meng <mc@linux.vnet.ibm.com>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
---

 include/linux/lglock.h |   36 ++++++++++++++++++++++++++++++++----
 1 files changed, 32 insertions(+), 4 deletions(-)

diff --git a/include/linux/lglock.h b/include/linux/lglock.h
index f549056..87f402c 100644
--- a/include/linux/lglock.h
+++ b/include/linux/lglock.h
@@ -22,6 +22,7 @@
 #include <linux/spinlock.h>
 #include <linux/lockdep.h>
 #include <linux/percpu.h>
+#include <linux/cpu.h>

 /* can make br locks by using local lock for read side, global lock for write */
 #define br_lock_init(name)	name##_lock_init()
@@ -72,9 +73,31 @@

 #define DEFINE_LGLOCK(name)						\
 									\
+ DEFINE_SPINLOCK(name##_cpu_lock);					\
+ cpumask_t name##_cpus __read_mostly;					\
  DEFINE_PER_CPU(arch_spinlock_t, name##_lock);				\
  DEFINE_LGLOCK_LOCKDEP(name);						\
 									\
+ static int								\
+ name##_lg_cpu_callback(struct notifier_block *nb,			\
+				unsigned long action, void *hcpu)	\
+ {									\
+	switch (action & ~CPU_TASKS_FROZEN) {				\
+	case CPU_UP_PREPARE:						\
+		spin_lock(&name##_cpu_lock);				\
+		cpu_set((unsigned long)hcpu, name##_cpus);		\
+		spin_unlock(&name##_cpu_lock);				\
+		break;							\
+	case CPU_UP_CANCELED: case CPU_DEAD:				\
+		spin_lock(&name##_cpu_lock);				\
+		cpu_clear((unsigned long)hcpu, name##_cpus);		\
+		spin_unlock(&name##_cpu_lock);				\
+	}								\
+	return NOTIFY_OK;						\
+ }									\
+ static struct notifier_block name##_lg_cpu_notifier = {		\
+	.notifier_call = name##_lg_cpu_callback,			\
+ };									\
  void name##_lock_init(void) {						\
 	int i;								\
 	LOCKDEP_INIT_MAP(&name##_lock_dep_map, #name, &name##_lock_key, 0); \
@@ -83,6 +106,11 @@
 		lock = &per_cpu(name##_lock, i);			\
 		*lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;	\
 	}								\
+	register_hotcpu_notifier(&name##_lg_cpu_notifier);		\
+	get_online_cpus();						\
+	for_each_online_cpu(i)						\
+		cpu_set(i, name##_cpus);				\
+	put_online_cpus();						\
  }									\
  EXPORT_SYMBOL(name##_lock_init);					\
 									\
@@ -124,9 +152,9 @@
 									\
  void name##_global_lock_online(void) {					\
 	int i;								\
-	preempt_disable();						\
+	spin_lock(&name##_cpu_lock);					\
 	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
-	for_each_online_cpu(i) {					\
+	for_each_cpu(i, &name##_cpus) {					\
 		arch_spinlock_t *lock;					\
 		lock = &per_cpu(name##_lock, i);			\
 		arch_spin_lock(lock);					\
@@ -137,12 +165,12 @@
  void name##_global_unlock_online(void) {				\
 	int i;								\
 	rwlock_release(&name##_lock_dep_map, 1, _RET_IP_);		\
-	for_each_online_cpu(i) {					\
+	for_each_cpu(i, &name##_cpus) {					\
 		arch_spinlock_t *lock;					\
 		lock = &per_cpu(name##_lock, i);			\
 		arch_spin_unlock(lock);					\
 	}								\
-	preempt_enable();						\
+	spin_unlock(&name##_cpu_lock);					\
  }									\
  EXPORT_SYMBOL(name##_global_unlock_online);				\
 									\

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-21 21:15                                     ` Srivatsa S. Bhat
@ 2011-12-21 22:02                                       ` Al Viro
  2011-12-21 22:12                                       ` Andrew Morton
  1 sibling, 0 replies; 37+ messages in thread
From: Al Viro @ 2011-12-21 22:02 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: mc, Stephen Boyd, linux-kernel, linux-fsdevel, Nick Piggin,
	david, akpm, Maciej Rutecki

On Thu, Dec 22, 2011 at 02:45:29AM +0530, Srivatsa S. Bhat wrote:

> The following patch is the same as the patch you posted, but with this small
> change (aimed merely at making the code a bit easier to understand) and a
> commit message added. Please point out if this change seems bad for
> any reason. And if it is fine, Viro, can you please sign-off on this patch?
> (since this patch has code contributions from both you and me)

I can live with that.  I still think it's not particulary useful, but...
Anyway,

ACKed-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

I couldn't care less who ends up in the Author: fields.  I can feed it to
Linus tonight, if you prefer it to go through the VFS tree.  But IMO that's
properly a CPU-hotplug issue and VFS is only involved as it's the only
current user of br-locks.  Should go into -stable as well, all way back to
2.6.36...

Sad that Nick is still MIA, BTW - it's his code and he'd been Cc'ed on the
thread all along... ;-/

> I tested this patch locally - the originally reported issue did not crop up
> (soft-lockups in VFS callpaths).
> 
> ---
> From: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> Subject: [PATCH] VFS: Fix race between CPU hotplug and lglocks
> 
> Currently, the *_global_[un]lock_online() routines are not at all synchronized
> with CPU hotplug. Soft-lockups detected as a consequence of this race was
> reported earlier at https://lkml.org/lkml/2011/8/24/185. (Thanks to Cong Meng
> for finding out that the root-cause of this issue is the race condition
> between br_write_[un]lock() and CPU hotplug, which results in the lock states
> getting messed up).
> 
> Fixing this race by just adding {get,put}_online_cpus() at appropriate places
> in *_global_[un]lock_online() is not a good option, because, then suddenly
> br_write_[un]lock() would become blocking, whereas they have been kept as
> non-blocking all this time, and we would want to keep them that way.
> 
> So, overall, we want to ensure 3 things:
> 1. br_write_lock() and br_write_unlock() must remain as non-blocking.
> 2. The corresponding lock and unlock of the per-cpu spinlocks must not happen
>    for different sets of CPUs.
> 3. Either prevent any new CPU online operation in between this lock-unlock, or
>    ensure that the newly onlined CPU does not proceed with its corresponding
>    per-cpu spinlock unlocked.
> 
> To achieve all this:
> (a) We introduce a new spinlock that is taken by the *_global_lock_online()
>     routine and released by the *_global_unlock_online() routine.
> (b) We register a callback for CPU hotplug notifications, and this callback
>     takes the same spinlock as above.
> (c) We maintain a bitmap which is close to the cpu_online_mask, and once it is
>     initialized in the lock_init() code, all future updates to it are done in
>     the callback, under the above spinlock.
> (d) The above bitmap is used (instead of cpu_online_mask) while locking and
>     unlocking the per-cpu locks.
> 
> The callback takes the spinlock upon the CPU_UP_PREPARE event. So, if the
> br_write_lock-unlock sequence is in progress, the callback keeps spinning,
> thus preventing the CPU online operation till the lock-unlock sequence is
> complete. This takes care of requirement (3).
> 
> The bitmap that we maintain remains unmodified throughout the lock-unlock
> sequence, since all updates to it are managed by the callback, which takes
> the same spinlock as the one taken by the lock code and released only by the
> unlock routine. Combining this with (d) above, satisfies requirement (2).
> 
> Overall, since we use a spinlock (mentioned in (a)) to prevent CPU hotplug
> operations from racing with br_write_lock-unlock, requirement (1) is also
> taken care of.
> 
> By the way, it is to be noted that a CPU offline operation can actually run
> in parallel with our lock-unlock sequence, because our callback doesn't react
> to notifications earlier than CPU_DEAD (in order to maintain our bitmap
> properly). And this means, since we use our own bitmap (which is stale, on
> purpose) during the lock-unlock sequence, we could end up unlocking the
> per-cpu lock of an offline CPU (because we had locked it earlier, when the
> CPU was online), in order to satisfy requirement (2). But this is harmless,
> though it looks a bit awkward.
> 
> Debugged-by: Cong Meng <mc@linux.vnet.ibm.com>
> Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
> ---
> 
>  include/linux/lglock.h |   36 ++++++++++++++++++++++++++++++++----
>  1 files changed, 32 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/lglock.h b/include/linux/lglock.h
> index f549056..87f402c 100644
> --- a/include/linux/lglock.h
> +++ b/include/linux/lglock.h
> @@ -22,6 +22,7 @@
>  #include <linux/spinlock.h>
>  #include <linux/lockdep.h>
>  #include <linux/percpu.h>
> +#include <linux/cpu.h>
>  
>  /* can make br locks by using local lock for read side, global lock for write */
>  #define br_lock_init(name)	name##_lock_init()
> @@ -72,9 +73,31 @@
>  
>  #define DEFINE_LGLOCK(name)						\
>  									\
> + DEFINE_SPINLOCK(name##_cpu_lock);					\
> + cpumask_t name##_cpus __read_mostly;					\
>   DEFINE_PER_CPU(arch_spinlock_t, name##_lock);				\
>   DEFINE_LGLOCK_LOCKDEP(name);						\
>  									\
> + static int								\
> + name##_lg_cpu_callback(struct notifier_block *nb,			\
> +				unsigned long action, void *hcpu)	\
> + {									\
> +	switch (action & ~CPU_TASKS_FROZEN) {				\
> +	case CPU_UP_PREPARE:						\
> +		spin_lock(&name##_cpu_lock);				\
> +		cpu_set((unsigned long)hcpu, name##_cpus);		\
> +		spin_unlock(&name##_cpu_lock);				\
> +		break;							\
> +	case CPU_UP_CANCELED: case CPU_DEAD:				\
> +		spin_lock(&name##_cpu_lock);				\
> +		cpu_clear((unsigned long)hcpu, name##_cpus);		\
> +		spin_unlock(&name##_cpu_lock);				\
> +	}								\
> +	return NOTIFY_OK;						\
> + }									\
> + static struct notifier_block name##_lg_cpu_notifier = {		\
> +	.notifier_call = name##_lg_cpu_callback,			\
> + };									\
>   void name##_lock_init(void) {						\
>  	int i;								\
>  	LOCKDEP_INIT_MAP(&name##_lock_dep_map, #name, &name##_lock_key, 0); \
> @@ -83,6 +106,11 @@
>  		lock = &per_cpu(name##_lock, i);			\
>  		*lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;	\
>  	}								\
> +	register_hotcpu_notifier(&name##_lg_cpu_notifier);		\
> +	get_online_cpus();						\
> +	for_each_online_cpu(i)						\
> +		cpu_set(i, name##_cpus);				\
> +	put_online_cpus();						\
>   }									\
>   EXPORT_SYMBOL(name##_lock_init);					\
>  									\
> @@ -124,9 +152,9 @@
>  									\
>   void name##_global_lock_online(void) {					\
>  	int i;								\
> -	preempt_disable();						\
> +	spin_lock(&name##_cpu_lock);					\
>  	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
> -	for_each_online_cpu(i) {					\
> +	for_each_cpu(i, &name##_cpus) {					\
>  		arch_spinlock_t *lock;					\
>  		lock = &per_cpu(name##_lock, i);			\
>  		arch_spin_lock(lock);					\
> @@ -137,12 +165,12 @@
>   void name##_global_unlock_online(void) {				\
>  	int i;								\
>  	rwlock_release(&name##_lock_dep_map, 1, _RET_IP_);		\
> -	for_each_online_cpu(i) {					\
> +	for_each_cpu(i, &name##_cpus) {					\
>  		arch_spinlock_t *lock;					\
>  		lock = &per_cpu(name##_lock, i);			\
>  		arch_spin_unlock(lock);					\
>  	}								\
> -	preempt_enable();						\
> +	spin_unlock(&name##_cpu_lock);					\
>   }									\
>   EXPORT_SYMBOL(name##_global_unlock_online);				\
>  									\
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-21 21:15                                     ` Srivatsa S. Bhat
  2011-12-21 22:02                                       ` Al Viro
@ 2011-12-21 22:12                                       ` Andrew Morton
  2011-12-22  7:02                                         ` Al Viro
  1 sibling, 1 reply; 37+ messages in thread
From: Andrew Morton @ 2011-12-21 22:12 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: Al Viro, mc, Stephen Boyd, linux-kernel, linux-fsdevel,
	Nick Piggin, david, Maciej Rutecki

On Thu, 22 Dec 2011 02:45:29 +0530
"Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> wrote:

off-topic, but the lockdep stuff in include/linux/lglock.h
(LOCKDEP_INIT_MAP and DEFINE_LGLOCK_LOCKDEP) appears to be dead code.

> --- a/include/linux/lglock.h
> +++ b/include/linux/lglock.h
> @@ -22,6 +22,7 @@
>  #include <linux/spinlock.h>
>  #include <linux/lockdep.h>
>  #include <linux/percpu.h>
> +#include <linux/cpu.h>
>  
>  /* can make br locks by using local lock for read side, global lock for write */
>  #define br_lock_init(name)	name##_lock_init()
> @@ -72,9 +73,31 @@
>  
>  #define DEFINE_LGLOCK(name)						\
>  									\
> + DEFINE_SPINLOCK(name##_cpu_lock);					\
> + cpumask_t name##_cpus __read_mostly;					\
>   DEFINE_PER_CPU(arch_spinlock_t, name##_lock);				\
>   DEFINE_LGLOCK_LOCKDEP(name);						\
>  									\

also off-topic: we can't define a static lglock with this macro.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-21 22:12                                       ` Andrew Morton
@ 2011-12-22  7:02                                         ` Al Viro
  2011-12-22  7:20                                           ` Andrew Morton
  0 siblings, 1 reply; 37+ messages in thread
From: Al Viro @ 2011-12-22  7:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Srivatsa S. Bhat, mc, Stephen Boyd, linux-kernel, linux-fsdevel,
	Nick Piggin, david, Maciej Rutecki

On Wed, Dec 21, 2011 at 02:12:29PM -0800, Andrew Morton wrote:
> off-topic, but the lockdep stuff in include/linux/lglock.h
> (LOCKDEP_INIT_MAP and DEFINE_LGLOCK_LOCKDEP) appears to be dead code.

Um?  See ..._lock_init(); it's used there.

> > + DEFINE_SPINLOCK(name##_cpu_lock);					\
> > + cpumask_t name##_cpus __read_mostly;					\
> >   DEFINE_PER_CPU(arch_spinlock_t, name##_lock);				\
> >   DEFINE_LGLOCK_LOCKDEP(name);						\
> >  									\
> 
> also off-topic: we can't define a static lglock with this macro.

True...  I don't like the general STL feel of that code, TBH.  The only
thing that stops me from putting all that stuff into a struct and
getting rid of those macros from hell is that we'd need to use alloc_percpu()
and that means extra overhead, potentially quite painful on the "local"
side of those.  May be worth experimenting with later, but not at this
point in cycle...

Anyway, to Linus it goes.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-22  7:02                                         ` Al Viro
@ 2011-12-22  7:20                                           ` Andrew Morton
  2011-12-22  8:08                                             ` Al Viro
  2011-12-22  8:22                                             ` Andi Kleen
  0 siblings, 2 replies; 37+ messages in thread
From: Andrew Morton @ 2011-12-22  7:20 UTC (permalink / raw)
  To: Al Viro
  Cc: Srivatsa S. Bhat, mc, Stephen Boyd, linux-kernel, linux-fsdevel,
	Nick Piggin, david, Maciej Rutecki, Andi Kleen

On Thu, 22 Dec 2011 07:02:15 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Wed, Dec 21, 2011 at 02:12:29PM -0800, Andrew Morton wrote:
> > off-topic, but the lockdep stuff in include/linux/lglock.h
> > (LOCKDEP_INIT_MAP and DEFINE_LGLOCK_LOCKDEP) appears to be dead code.
> 
> Um?  See ..._lock_init(); it's used there.

oops, I had Andi's patch applied.

Wanna take a look at it while things are fresh in your mind?


From: Andi Kleen <ak@linux.intel.com>
Subject: brlocks/lglocks: clean up code

lglocks and brlocks are currently generated with some complicated macros
in lglock.h.  But there's no reason I can see to not just use common
utility functions that get pointers to the lglock.

Since there are at least two users it makes sense to share this code in a
library.

This will also make it later possible to dynamically allocate lglocks.

In general the users now look more like normal function calls with
pointers, not magic macros.

The patch is rather large because I move over all users in one go to keep
it bisectable.  This impacts the VFS somewhat in terms of lines changed. 
But no actual behaviour change.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/dcache.c            |   10 +-
 fs/file_table.c        |   23 ++---
 fs/internal.h          |    3 
 fs/namei.c             |   26 +++---
 fs/namespace.c         |  124 +++++++++++++++----------------
 fs/pnode.c             |    4 -
 include/linux/lglock.h |  154 ++++++---------------------------------
 kernel/Makefile        |    2 
 kernel/lglock.c        |  101 +++++++++++++++++++++++++
 9 files changed, 223 insertions(+), 224 deletions(-)

diff -puN fs/dcache.c~brlocks-lglocks-clean-up-code fs/dcache.c
--- a/fs/dcache.c~brlocks-lglocks-clean-up-code
+++ a/fs/dcache.c
@@ -2454,7 +2454,7 @@ static int prepend_path(const struct pat
 	bool slash = false;
 	int error = 0;
 
-	br_read_lock(vfsmount_lock);
+	br_read_lock(&vfsmount_lock);
 	while (dentry != root->dentry || vfsmnt != root->mnt) {
 		struct dentry * parent;
 
@@ -2485,7 +2485,7 @@ static int prepend_path(const struct pat
 		error = prepend(buffer, buflen, "/", 1);
 
 out:
-	br_read_unlock(vfsmount_lock);
+	br_read_unlock(&vfsmount_lock);
 	return error;
 
 global_root:
@@ -2859,11 +2859,11 @@ int path_is_under(struct path *path1, st
 	struct dentry *dentry = path1->dentry;
 	int res;
 
-	br_read_lock(vfsmount_lock);
+	br_read_lock(&vfsmount_lock);
 	if (mnt != path2->mnt) {
 		for (;;) {
 			if (mnt->mnt_parent == mnt) {
-				br_read_unlock(vfsmount_lock);
+				br_read_unlock(&vfsmount_lock);
 				return 0;
 			}
 			if (mnt->mnt_parent == path2->mnt)
@@ -2873,7 +2873,7 @@ int path_is_under(struct path *path1, st
 		dentry = mnt->mnt_mountpoint;
 	}
 	res = is_subdir(dentry, path2->dentry);
-	br_read_unlock(vfsmount_lock);
+	br_read_unlock(&vfsmount_lock);
 	return res;
 }
 EXPORT_SYMBOL(path_is_under);
diff -puN fs/file_table.c~brlocks-lglocks-clean-up-code fs/file_table.c
--- a/fs/file_table.c~brlocks-lglocks-clean-up-code
+++ a/fs/file_table.c
@@ -34,7 +34,6 @@ struct files_stat_struct files_stat = {
 	.max_files = NR_FILE
 };
 
-DECLARE_LGLOCK(files_lglock);
 DEFINE_LGLOCK(files_lglock);
 
 /* SLAB cache for file structures */
@@ -422,9 +421,9 @@ static inline void __file_sb_list_add(st
  */
 void file_sb_list_add(struct file *file, struct super_block *sb)
 {
-	lg_local_lock(files_lglock);
+	lg_local_lock(&files_lglock);
 	__file_sb_list_add(file, sb);
-	lg_local_unlock(files_lglock);
+	lg_local_unlock(&files_lglock);
 }
 
 /**
@@ -437,9 +436,9 @@ void file_sb_list_add(struct file *file,
 void file_sb_list_del(struct file *file)
 {
 	if (!list_empty(&file->f_u.fu_list)) {
-		lg_local_lock_cpu(files_lglock, file_list_cpu(file));
+		lg_local_lock_cpu(&files_lglock, file_list_cpu(file));
 		list_del_init(&file->f_u.fu_list);
-		lg_local_unlock_cpu(files_lglock, file_list_cpu(file));
+		lg_local_unlock_cpu(&files_lglock, file_list_cpu(file));
 	}
 }
 
@@ -478,7 +477,7 @@ int fs_may_remount_ro(struct super_block
 {
 	struct file *file;
 	/* Check that no files are currently opened for writing. */
-	lg_global_lock(files_lglock);
+	lg_global_lock(&files_lglock);
 	do_file_list_for_each_entry(sb, file) {
 		struct inode *inode = file->f_path.dentry->d_inode;
 
@@ -490,10 +489,10 @@ int fs_may_remount_ro(struct super_block
 		if (S_ISREG(inode->i_mode) && (file->f_mode & FMODE_WRITE))
 			goto too_bad;
 	} while_file_list_for_each_entry;
-	lg_global_unlock(files_lglock);
+	lg_global_unlock(&files_lglock);
 	return 1; /* Tis' cool bro. */
 too_bad:
-	lg_global_unlock(files_lglock);
+	lg_global_unlock(&files_lglock);
 	return 0;
 }
 
@@ -509,7 +508,7 @@ void mark_files_ro(struct super_block *s
 	struct file *f;
 
 retry:
-	lg_global_lock(files_lglock);
+	lg_global_lock(&files_lglock);
 	do_file_list_for_each_entry(sb, f) {
 		struct vfsmount *mnt;
 		if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
@@ -526,12 +525,12 @@ retry:
 		file_release_write(f);
 		mnt = mntget(f->f_path.mnt);
 		/* This can sleep, so we can't hold the spinlock. */
-		lg_global_unlock(files_lglock);
+		lg_global_unlock(&files_lglock);
 		mnt_drop_write(mnt);
 		mntput(mnt);
 		goto retry;
 	} while_file_list_for_each_entry;
-	lg_global_unlock(files_lglock);
+	lg_global_unlock(&files_lglock);
 }
 
 void __init files_init(unsigned long mempages)
@@ -549,6 +548,6 @@ void __init files_init(unsigned long mem
 	n = (mempages * (PAGE_SIZE / 1024)) / 10;
 	files_stat.max_files = max_t(unsigned long, n, NR_FILE);
 	files_defer_init();
-	lg_lock_init(files_lglock);
+	lg_lock_init(&files_lglock, "files_lglock");
 	percpu_counter_init(&nr_files, 0);
 } 
diff -puN fs/internal.h~brlocks-lglocks-clean-up-code fs/internal.h
--- a/fs/internal.h~brlocks-lglocks-clean-up-code
+++ a/fs/internal.h
@@ -77,8 +77,7 @@ extern void mnt_make_shortterm(struct vf
 
 extern void __init mnt_init(void);
 
-DECLARE_BRLOCK(vfsmount_lock);
-
+extern struct lglock vfsmount_lock;
 
 /*
  * fs_struct.c
diff -puN fs/namei.c~brlocks-lglocks-clean-up-code fs/namei.c
--- a/fs/namei.c~brlocks-lglocks-clean-up-code
+++ a/fs/namei.c
@@ -463,7 +463,7 @@ static int unlazy_walk(struct nameidata 
 	mntget(nd->path.mnt);
 
 	rcu_read_unlock();
-	br_read_unlock(vfsmount_lock);
+	br_read_unlock(&vfsmount_lock);
 	nd->flags &= ~LOOKUP_RCU;
 	return 0;
 
@@ -521,14 +521,14 @@ static int complete_walk(struct nameidat
 		if (unlikely(!__d_rcu_to_refcount(dentry, nd->seq))) {
 			spin_unlock(&dentry->d_lock);
 			rcu_read_unlock();
-			br_read_unlock(vfsmount_lock);
+			br_read_unlock(&vfsmount_lock);
 			return -ECHILD;
 		}
 		BUG_ON(nd->inode != dentry->d_inode);
 		spin_unlock(&dentry->d_lock);
 		mntget(nd->path.mnt);
 		rcu_read_unlock();
-		br_read_unlock(vfsmount_lock);
+		br_read_unlock(&vfsmount_lock);
 	}
 
 	if (likely(!(nd->flags & LOOKUP_JUMPED)))
@@ -693,15 +693,15 @@ int follow_up(struct path *path)
 	struct vfsmount *parent;
 	struct dentry *mountpoint;
 
-	br_read_lock(vfsmount_lock);
+	br_read_lock(&vfsmount_lock);
 	parent = path->mnt->mnt_parent;
 	if (parent == path->mnt) {
-		br_read_unlock(vfsmount_lock);
+		br_read_unlock(&vfsmount_lock);
 		return 0;
 	}
 	mntget(parent);
 	mountpoint = dget(path->mnt->mnt_mountpoint);
-	br_read_unlock(vfsmount_lock);
+	br_read_unlock(&vfsmount_lock);
 	dput(path->dentry);
 	path->dentry = mountpoint;
 	mntput(path->mnt);
@@ -833,7 +833,7 @@ static int follow_managed(struct path *p
 			/* Something is mounted on this dentry in another
 			 * namespace and/or whatever was mounted there in this
 			 * namespace got unmounted before we managed to get the
-			 * vfsmount_lock */
+			 * &vfsmount_lock */
 		}
 
 		/* Handle an automount point */
@@ -959,7 +959,7 @@ failed:
 	if (!(nd->flags & LOOKUP_ROOT))
 		nd->root.mnt = NULL;
 	rcu_read_unlock();
-	br_read_unlock(vfsmount_lock);
+	br_read_unlock(&vfsmount_lock);
 	return -ECHILD;
 }
 
@@ -1253,7 +1253,7 @@ static void terminate_walk(struct nameid
 		if (!(nd->flags & LOOKUP_ROOT))
 			nd->root.mnt = NULL;
 		rcu_read_unlock();
-		br_read_unlock(vfsmount_lock);
+		br_read_unlock(&vfsmount_lock);
 	}
 }
 
@@ -1487,7 +1487,7 @@ static int path_init(int dfd, const char
 		nd->path = nd->root;
 		nd->inode = inode;
 		if (flags & LOOKUP_RCU) {
-			br_read_lock(vfsmount_lock);
+			br_read_lock(&vfsmount_lock);
 			rcu_read_lock();
 			nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
 		} else {
@@ -1500,7 +1500,7 @@ static int path_init(int dfd, const char
 
 	if (*name=='/') {
 		if (flags & LOOKUP_RCU) {
-			br_read_lock(vfsmount_lock);
+			br_read_lock(&vfsmount_lock);
 			rcu_read_lock();
 			set_root_rcu(nd);
 		} else {
@@ -1513,7 +1513,7 @@ static int path_init(int dfd, const char
 			struct fs_struct *fs = current->fs;
 			unsigned seq;
 
-			br_read_lock(vfsmount_lock);
+			br_read_lock(&vfsmount_lock);
 			rcu_read_lock();
 
 			do {
@@ -1549,7 +1549,7 @@ static int path_init(int dfd, const char
 			if (fput_needed)
 				*fp = file;
 			nd->seq = __read_seqcount_begin(&nd->path.dentry->d_seq);
-			br_read_lock(vfsmount_lock);
+			br_read_lock(&vfsmount_lock);
 			rcu_read_lock();
 		} else {
 			path_get(&file->f_path);
diff -puN fs/namespace.c~brlocks-lglocks-clean-up-code fs/namespace.c
--- a/fs/namespace.c~brlocks-lglocks-clean-up-code
+++ a/fs/namespace.c
@@ -419,7 +419,7 @@ static int mnt_make_readonly(struct vfsm
 {
 	int ret = 0;
 
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 	mnt->mnt_flags |= MNT_WRITE_HOLD;
 	/*
 	 * After storing MNT_WRITE_HOLD, we'll read the counters. This store
@@ -453,15 +453,15 @@ static int mnt_make_readonly(struct vfsm
 	 */
 	smp_wmb();
 	mnt->mnt_flags &= ~MNT_WRITE_HOLD;
-	br_write_unlock(vfsmount_lock);
+	br_write_unlock(&vfsmount_lock);
 	return ret;
 }
 
 static void __mnt_unmake_readonly(struct vfsmount *mnt)
 {
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 	mnt->mnt_flags &= ~MNT_READONLY;
-	br_write_unlock(vfsmount_lock);
+	br_write_unlock(&vfsmount_lock);
 }
 
 static void free_vfsmnt(struct vfsmount *mnt)
@@ -508,10 +508,10 @@ struct vfsmount *lookup_mnt(struct path 
 {
 	struct vfsmount *child_mnt;
 
-	br_read_lock(vfsmount_lock);
+	br_read_lock(&vfsmount_lock);
 	if ((child_mnt = __lookup_mnt(path->mnt, path->dentry, 1)))
 		mntget(child_mnt);
-	br_read_unlock(vfsmount_lock);
+	br_read_unlock(&vfsmount_lock);
 	return child_mnt;
 }
 
@@ -776,34 +776,34 @@ static void mntput_no_expire(struct vfsm
 {
 put_again:
 #ifdef CONFIG_SMP
-	br_read_lock(vfsmount_lock);
+	br_read_lock(&vfsmount_lock);
 	if (likely(atomic_read(&mnt->mnt_longterm))) {
 		mnt_dec_count(mnt);
-		br_read_unlock(vfsmount_lock);
+		br_read_unlock(&vfsmount_lock);
 		return;
 	}
-	br_read_unlock(vfsmount_lock);
+	br_read_unlock(&vfsmount_lock);
 
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 	mnt_dec_count(mnt);
 	if (mnt_get_count(mnt)) {
-		br_write_unlock(vfsmount_lock);
+		br_write_unlock(&vfsmount_lock);
 		return;
 	}
 #else
 	mnt_dec_count(mnt);
 	if (likely(mnt_get_count(mnt)))
 		return;
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 #endif
 	if (unlikely(mnt->mnt_pinned)) {
 		mnt_add_count(mnt, mnt->mnt_pinned + 1);
 		mnt->mnt_pinned = 0;
-		br_write_unlock(vfsmount_lock);
+		br_write_unlock(&vfsmount_lock);
 		acct_auto_close_mnt(mnt);
 		goto put_again;
 	}
-	br_write_unlock(vfsmount_lock);
+	br_write_unlock(&vfsmount_lock);
 	mntfree(mnt);
 }
 
@@ -828,20 +828,20 @@ EXPORT_SYMBOL(mntget);
 
 void mnt_pin(struct vfsmount *mnt)
 {
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 	mnt->mnt_pinned++;
-	br_write_unlock(vfsmount_lock);
+	br_write_unlock(&vfsmount_lock);
 }
 EXPORT_SYMBOL(mnt_pin);
 
 void mnt_unpin(struct vfsmount *mnt)
 {
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 	if (mnt->mnt_pinned) {
 		mnt_inc_count(mnt);
 		mnt->mnt_pinned--;
 	}
-	br_write_unlock(vfsmount_lock);
+	br_write_unlock(&vfsmount_lock);
 }
 EXPORT_SYMBOL(mnt_unpin);
 
@@ -931,12 +931,12 @@ int mnt_had_events(struct proc_mounts *p
 	struct mnt_namespace *ns = p->ns;
 	int res = 0;
 
-	br_read_lock(vfsmount_lock);
+	br_read_lock(&vfsmount_lock);
 	if (p->m.poll_event != ns->event) {
 		p->m.poll_event = ns->event;
 		res = 1;
 	}
-	br_read_unlock(vfsmount_lock);
+	br_read_unlock(&vfsmount_lock);
 
 	return res;
 }
@@ -1157,12 +1157,12 @@ int may_umount_tree(struct vfsmount *mnt
 	struct vfsmount *p;
 
 	/* write lock needed for mnt_get_count */
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 	for (p = mnt; p; p = next_mnt(p, mnt)) {
 		actual_refs += mnt_get_count(p);
 		minimum_refs += 2;
 	}
-	br_write_unlock(vfsmount_lock);
+	br_write_unlock(&vfsmount_lock);
 
 	if (actual_refs > minimum_refs)
 		return 0;
@@ -1189,10 +1189,10 @@ int may_umount(struct vfsmount *mnt)
 {
 	int ret = 1;
 	down_read(&namespace_sem);
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 	if (propagate_mount_busy(mnt, 2))
 		ret = 0;
-	br_write_unlock(vfsmount_lock);
+	br_write_unlock(&vfsmount_lock);
 	up_read(&namespace_sem);
 	return ret;
 }
@@ -1209,13 +1209,13 @@ void release_mounts(struct list_head *he
 			struct dentry *dentry;
 			struct vfsmount *m;
 
-			br_write_lock(vfsmount_lock);
+			br_write_lock(&vfsmount_lock);
 			dentry = mnt->mnt_mountpoint;
 			m = mnt->mnt_parent;
 			mnt->mnt_mountpoint = mnt->mnt_root;
 			mnt->mnt_parent = mnt;
 			m->mnt_ghosts--;
-			br_write_unlock(vfsmount_lock);
+			br_write_unlock(&vfsmount_lock);
 			dput(dentry);
 			mntput(m);
 		}
@@ -1281,12 +1281,12 @@ static int do_umount(struct vfsmount *mn
 		 * probably don't strictly need the lock here if we examined
 		 * all race cases, but it's a slowpath.
 		 */
-		br_write_lock(vfsmount_lock);
+		br_write_lock(&vfsmount_lock);
 		if (mnt_get_count(mnt) != 2) {
-			br_write_unlock(vfsmount_lock);
+			br_write_unlock(&vfsmount_lock);
 			return -EBUSY;
 		}
-		br_write_unlock(vfsmount_lock);
+		br_write_unlock(&vfsmount_lock);
 
 		if (!xchg(&mnt->mnt_expiry_mark, 1))
 			return -EAGAIN;
@@ -1328,7 +1328,7 @@ static int do_umount(struct vfsmount *mn
 	}
 
 	down_write(&namespace_sem);
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 	event++;
 
 	if (!(flags & MNT_DETACH))
@@ -1340,7 +1340,7 @@ static int do_umount(struct vfsmount *mn
 			umount_tree(mnt, 1, &umount_list);
 		retval = 0;
 	}
-	br_write_unlock(vfsmount_lock);
+	br_write_unlock(&vfsmount_lock);
 	up_write(&namespace_sem);
 	release_mounts(&umount_list);
 	return retval;
@@ -1452,19 +1452,19 @@ struct vfsmount *copy_tree(struct vfsmou
 			q = clone_mnt(p, p->mnt_root, flag);
 			if (!q)
 				goto Enomem;
-			br_write_lock(vfsmount_lock);
+			br_write_lock(&vfsmount_lock);
 			list_add_tail(&q->mnt_list, &res->mnt_list);
 			attach_mnt(q, &path);
-			br_write_unlock(vfsmount_lock);
+			br_write_unlock(&vfsmount_lock);
 		}
 	}
 	return res;
 Enomem:
 	if (res) {
 		LIST_HEAD(umount_list);
-		br_write_lock(vfsmount_lock);
+		br_write_lock(&vfsmount_lock);
 		umount_tree(res, 0, &umount_list);
-		br_write_unlock(vfsmount_lock);
+		br_write_unlock(&vfsmount_lock);
 		release_mounts(&umount_list);
 	}
 	return NULL;
@@ -1483,9 +1483,9 @@ void drop_collected_mounts(struct vfsmou
 {
 	LIST_HEAD(umount_list);
 	down_write(&namespace_sem);
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 	umount_tree(mnt, 0, &umount_list);
-	br_write_unlock(vfsmount_lock);
+	br_write_unlock(&vfsmount_lock);
 	up_write(&namespace_sem);
 	release_mounts(&umount_list);
 }
@@ -1613,7 +1613,7 @@ static int attach_recursive_mnt(struct v
 	if (err)
 		goto out_cleanup_ids;
 
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 
 	if (IS_MNT_SHARED(dest_mnt)) {
 		for (p = source_mnt; p; p = next_mnt(p, source_mnt))
@@ -1632,7 +1632,7 @@ static int attach_recursive_mnt(struct v
 		list_del_init(&child->mnt_hash);
 		commit_tree(child);
 	}
-	br_write_unlock(vfsmount_lock);
+	br_write_unlock(&vfsmount_lock);
 
 	return 0;
 
@@ -1729,10 +1729,10 @@ static int do_change_type(struct path *p
 			goto out_unlock;
 	}
 
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 	for (m = mnt; m; m = (recurse ? next_mnt(m, mnt) : NULL))
 		change_mnt_propagation(m, type);
-	br_write_unlock(vfsmount_lock);
+	br_write_unlock(&vfsmount_lock);
 
  out_unlock:
 	up_write(&namespace_sem);
@@ -1779,9 +1779,9 @@ static int do_loopback(struct path *path
 
 	err = graft_tree(mnt, path);
 	if (err) {
-		br_write_lock(vfsmount_lock);
+		br_write_lock(&vfsmount_lock);
 		umount_tree(mnt, 0, &umount_list);
-		br_write_unlock(vfsmount_lock);
+		br_write_unlock(&vfsmount_lock);
 	}
 out2:
 	unlock_mount(path);
@@ -1838,16 +1838,16 @@ static int do_remount(struct path *path,
 	else
 		err = do_remount_sb(sb, flags, data, 0);
 	if (!err) {
-		br_write_lock(vfsmount_lock);
+		br_write_lock(&vfsmount_lock);
 		mnt_flags |= path->mnt->mnt_flags & MNT_PROPAGATION_MASK;
 		path->mnt->mnt_flags = mnt_flags;
-		br_write_unlock(vfsmount_lock);
+		br_write_unlock(&vfsmount_lock);
 	}
 	up_write(&sb->s_umount);
 	if (!err) {
-		br_write_lock(vfsmount_lock);
+		br_write_lock(&vfsmount_lock);
 		touch_mnt_namespace(path->mnt->mnt_ns);
-		br_write_unlock(vfsmount_lock);
+		br_write_unlock(&vfsmount_lock);
 	}
 	return err;
 }
@@ -2052,9 +2052,9 @@ fail:
 	/* remove m from any expiration list it may be on */
 	if (!list_empty(&m->mnt_expire)) {
 		down_write(&namespace_sem);
-		br_write_lock(vfsmount_lock);
+		br_write_lock(&vfsmount_lock);
 		list_del_init(&m->mnt_expire);
-		br_write_unlock(vfsmount_lock);
+		br_write_unlock(&vfsmount_lock);
 		up_write(&namespace_sem);
 	}
 	mntput(m);
@@ -2070,11 +2070,11 @@ fail:
 void mnt_set_expiry(struct vfsmount *mnt, struct list_head *expiry_list)
 {
 	down_write(&namespace_sem);
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 
 	list_add_tail(&mnt->mnt_expire, expiry_list);
 
-	br_write_unlock(vfsmount_lock);
+	br_write_unlock(&vfsmount_lock);
 	up_write(&namespace_sem);
 }
 EXPORT_SYMBOL(mnt_set_expiry);
@@ -2094,7 +2094,7 @@ void mark_mounts_for_expiry(struct list_
 		return;
 
 	down_write(&namespace_sem);
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 
 	/* extract from the expiration list every vfsmount that matches the
 	 * following criteria:
@@ -2113,7 +2113,7 @@ void mark_mounts_for_expiry(struct list_
 		touch_mnt_namespace(mnt->mnt_ns);
 		umount_tree(mnt, 1, &umounts);
 	}
-	br_write_unlock(vfsmount_lock);
+	br_write_unlock(&vfsmount_lock);
 	up_write(&namespace_sem);
 
 	release_mounts(&umounts);
@@ -2376,9 +2376,9 @@ void mnt_make_shortterm(struct vfsmount 
 #ifdef CONFIG_SMP
 	if (atomic_add_unless(&mnt->mnt_longterm, -1, 1))
 		return;
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 	atomic_dec(&mnt->mnt_longterm);
-	br_write_unlock(vfsmount_lock);
+	br_write_unlock(&vfsmount_lock);
 #endif
 }
 
@@ -2406,9 +2406,9 @@ static struct mnt_namespace *dup_mnt_ns(
 		kfree(new_ns);
 		return ERR_PTR(-ENOMEM);
 	}
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 	list_add_tail(&new_ns->list, &new_ns->root->mnt_list);
-	br_write_unlock(vfsmount_lock);
+	br_write_unlock(&vfsmount_lock);
 
 	/*
 	 * Second pass: switch the tsk->fs->* elements and mark new vfsmounts
@@ -2647,7 +2647,7 @@ SYSCALL_DEFINE2(pivot_root, const char _
 			goto out4;
 	} else if (!is_subdir(old.dentry, new.dentry))
 		goto out4;
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 	detach_mnt(new.mnt, &parent_path);
 	detach_mnt(root.mnt, &root_parent);
 	/* mount old root on put_old */
@@ -2655,7 +2655,7 @@ SYSCALL_DEFINE2(pivot_root, const char _
 	/* mount new_root on / */
 	attach_mnt(new.mnt, &root_parent);
 	touch_mnt_namespace(current->nsproxy->mnt_ns);
-	br_write_unlock(vfsmount_lock);
+	br_write_unlock(&vfsmount_lock);
 	chroot_fs_refs(&root, &new);
 	error = 0;
 out4:
@@ -2718,7 +2718,7 @@ void __init mnt_init(void)
 	for (u = 0; u < HASH_SIZE; u++)
 		INIT_LIST_HEAD(&mount_hashtable[u]);
 
-	br_lock_init(vfsmount_lock);
+	br_lock_init(&vfsmount_lock);
 
 	err = sysfs_init();
 	if (err)
@@ -2738,9 +2738,9 @@ void put_mnt_ns(struct mnt_namespace *ns
 	if (!atomic_dec_and_test(&ns->count))
 		return;
 	down_write(&namespace_sem);
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 	umount_tree(ns->root, 0, &umount_list);
-	br_write_unlock(vfsmount_lock);
+	br_write_unlock(&vfsmount_lock);
 	up_write(&namespace_sem);
 	release_mounts(&umount_list);
 	kfree(ns);
diff -puN fs/pnode.c~brlocks-lglocks-clean-up-code fs/pnode.c
--- a/fs/pnode.c~brlocks-lglocks-clean-up-code
+++ a/fs/pnode.c
@@ -273,12 +273,12 @@ int propagate_mnt(struct vfsmount *dest_
 		prev_src_mnt  = child;
 	}
 out:
-	br_write_lock(vfsmount_lock);
+	br_write_lock(&vfsmount_lock);
 	while (!list_empty(&tmp_list)) {
 		child = list_first_entry(&tmp_list, struct vfsmount, mnt_hash);
 		umount_tree(child, 0, &umount_list);
 	}
-	br_write_unlock(vfsmount_lock);
+	br_write_unlock(&vfsmount_lock);
 	release_mounts(&umount_list);
 	return ret;
 }
diff -puN include/linux/lglock.h~brlocks-lglocks-clean-up-code include/linux/lglock.h
--- a/include/linux/lglock.h~brlocks-lglocks-clean-up-code
+++ a/include/linux/lglock.h
@@ -24,26 +24,14 @@
 #include <linux/percpu.h>
 
 /* can make br locks by using local lock for read side, global lock for write */
-#define br_lock_init(name)	name##_lock_init()
-#define br_read_lock(name)	name##_local_lock()
-#define br_read_unlock(name)	name##_local_unlock()
-#define br_write_lock(name)	name##_global_lock_online()
-#define br_write_unlock(name)	name##_global_unlock_online()
+#define br_lock_init(name)	lg_lock_init(name, #name)
+#define br_read_lock(name)	lg_local_lock(name)
+#define br_read_unlock(name)	lg_local_unlock(name)
+#define br_write_lock(name)	lg_global_lock_online(name)
+#define br_write_unlock(name)	lg_global_unlock_online(name)
 
-#define DECLARE_BRLOCK(name)	DECLARE_LGLOCK(name)
 #define DEFINE_BRLOCK(name)	DEFINE_LGLOCK(name)
 
-
-#define lg_lock_init(name)	name##_lock_init()
-#define lg_local_lock(name)	name##_local_lock()
-#define lg_local_unlock(name)	name##_local_unlock()
-#define lg_local_lock_cpu(name, cpu)	name##_local_lock_cpu(cpu)
-#define lg_local_unlock_cpu(name, cpu)	name##_local_unlock_cpu(cpu)
-#define lg_global_lock(name)	name##_global_lock()
-#define lg_global_unlock(name)	name##_global_unlock()
-#define lg_global_lock_online(name) name##_global_lock_online()
-#define lg_global_unlock_online(name) name##_global_unlock_online()
-
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
 #define LOCKDEP_INIT_MAP lockdep_init_map
 
@@ -58,115 +46,27 @@
 #define DEFINE_LGLOCK_LOCKDEP(name)
 #endif
 
+struct lglock {
+	arch_spinlock_t __percpu *lock;
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lock_class_key lock_key;
+	struct lockdep_map    lock_dep_map;
+#endif 
+};
+
+#define DEFINE_LGLOCK(name) \
+	DEFINE_PER_CPU(arch_spinlock_t, name ## _lock) = __ARCH_SPIN_LOCK_UNLOCKED; \
+	struct lglock name = { .lock = &name ## _lock }
+
+/* Only valid for statics */
+void lg_lock_init(struct lglock *lg, char *name);
+void lg_local_lock(struct lglock *lg);
+void lg_local_unlock(struct lglock *lg);
+void lg_local_lock_cpu(struct lglock *lg, int cpu);
+void lg_local_unlock_cpu(struct lglock *lg, int cpu);
+void lg_global_lock_online(struct lglock *lg);
+void lg_global_unlock_online(struct lglock *lg);
+void lg_global_lock(struct lglock *lg);
+void lg_global_unlock(struct lglock *lg);
 
-#define DECLARE_LGLOCK(name)						\
- extern void name##_lock_init(void);					\
- extern void name##_local_lock(void);					\
- extern void name##_local_unlock(void);					\
- extern void name##_local_lock_cpu(int cpu);				\
- extern void name##_local_unlock_cpu(int cpu);				\
- extern void name##_global_lock(void);					\
- extern void name##_global_unlock(void);				\
- extern void name##_global_lock_online(void);				\
- extern void name##_global_unlock_online(void);				\
-
-#define DEFINE_LGLOCK(name)						\
-									\
- DEFINE_PER_CPU(arch_spinlock_t, name##_lock);				\
- DEFINE_LGLOCK_LOCKDEP(name);						\
-									\
- void name##_lock_init(void) {						\
-	int i;								\
-	LOCKDEP_INIT_MAP(&name##_lock_dep_map, #name, &name##_lock_key, 0); \
-	for_each_possible_cpu(i) {					\
-		arch_spinlock_t *lock;					\
-		lock = &per_cpu(name##_lock, i);			\
-		*lock = (arch_spinlock_t)__ARCH_SPIN_LOCK_UNLOCKED;	\
-	}								\
- }									\
- EXPORT_SYMBOL(name##_lock_init);					\
-									\
- void name##_local_lock(void) {						\
-	arch_spinlock_t *lock;						\
-	preempt_disable();						\
-	rwlock_acquire_read(&name##_lock_dep_map, 0, 0, _THIS_IP_);	\
-	lock = &__get_cpu_var(name##_lock);				\
-	arch_spin_lock(lock);						\
- }									\
- EXPORT_SYMBOL(name##_local_lock);					\
-									\
- void name##_local_unlock(void) {					\
-	arch_spinlock_t *lock;						\
-	rwlock_release(&name##_lock_dep_map, 1, _THIS_IP_);		\
-	lock = &__get_cpu_var(name##_lock);				\
-	arch_spin_unlock(lock);						\
-	preempt_enable();						\
- }									\
- EXPORT_SYMBOL(name##_local_unlock);					\
-									\
- void name##_local_lock_cpu(int cpu) {					\
-	arch_spinlock_t *lock;						\
-	preempt_disable();						\
-	rwlock_acquire_read(&name##_lock_dep_map, 0, 0, _THIS_IP_);	\
-	lock = &per_cpu(name##_lock, cpu);				\
-	arch_spin_lock(lock);						\
- }									\
- EXPORT_SYMBOL(name##_local_lock_cpu);					\
-									\
- void name##_local_unlock_cpu(int cpu) {				\
-	arch_spinlock_t *lock;						\
-	rwlock_release(&name##_lock_dep_map, 1, _THIS_IP_);		\
-	lock = &per_cpu(name##_lock, cpu);				\
-	arch_spin_unlock(lock);						\
-	preempt_enable();						\
- }									\
- EXPORT_SYMBOL(name##_local_unlock_cpu);				\
-									\
- void name##_global_lock_online(void) {					\
-	int i;								\
-	preempt_disable();						\
-	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
-	for_each_online_cpu(i) {					\
-		arch_spinlock_t *lock;					\
-		lock = &per_cpu(name##_lock, i);			\
-		arch_spin_lock(lock);					\
-	}								\
- }									\
- EXPORT_SYMBOL(name##_global_lock_online);				\
-									\
- void name##_global_unlock_online(void) {				\
-	int i;								\
-	rwlock_release(&name##_lock_dep_map, 1, _RET_IP_);		\
-	for_each_online_cpu(i) {					\
-		arch_spinlock_t *lock;					\
-		lock = &per_cpu(name##_lock, i);			\
-		arch_spin_unlock(lock);					\
-	}								\
-	preempt_enable();						\
- }									\
- EXPORT_SYMBOL(name##_global_unlock_online);				\
-									\
- void name##_global_lock(void) {					\
-	int i;								\
-	preempt_disable();						\
-	rwlock_acquire(&name##_lock_dep_map, 0, 0, _RET_IP_);		\
-	for_each_possible_cpu(i) {					\
-		arch_spinlock_t *lock;					\
-		lock = &per_cpu(name##_lock, i);			\
-		arch_spin_lock(lock);					\
-	}								\
- }									\
- EXPORT_SYMBOL(name##_global_lock);					\
-									\
- void name##_global_unlock(void) {					\
-	int i;								\
-	rwlock_release(&name##_lock_dep_map, 1, _RET_IP_);		\
-	for_each_possible_cpu(i) {					\
-		arch_spinlock_t *lock;					\
-		lock = &per_cpu(name##_lock, i);			\
-		arch_spin_unlock(lock);					\
-	}								\
-	preempt_enable();						\
- }									\
- EXPORT_SYMBOL(name##_global_unlock);
 #endif
diff -puN kernel/Makefile~brlocks-lglocks-clean-up-code kernel/Makefile
--- a/kernel/Makefile~brlocks-lglocks-clean-up-code
+++ a/kernel/Makefile
@@ -6,7 +6,7 @@ obj-y     = fork.o exec_domain.o panic.o
 	    cpu.o exit.o itimer.o time.o softirq.o resource.o \
 	    sysctl.o sysctl_binary.o capability.o ptrace.o timer.o user.o \
 	    signal.o sys.o kmod.o workqueue.o pid.o \
-	    rcupdate.o extable.o params.o posix-timers.o \
+	    rcupdate.o extable.o params.o posix-timers.o lglock.o \
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
 	    hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
 	    notifier.o ksysfs.o cred.o \
diff -puN /dev/null kernel/lglock.c
--- /dev/null
+++ a/kernel/lglock.c
@@ -0,0 +1,101 @@
+/* See include/linux/lglock.h for description */
+#include <linux/module.h>
+#include <linux/lglock.h>
+
+void lg_lock_init(struct lglock *lg, char *name)
+{
+	LOCKDEP_INIT_MAP(&lg->lock_dep_map, name, &lg->lock_key, 0);
+}
+EXPORT_SYMBOL(lg_lock_init);
+
+void lg_local_lock(struct lglock *lg)
+{
+	arch_spinlock_t *lock;
+	preempt_disable();
+	rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_);
+	lock = this_cpu_ptr(lg->lock);
+	arch_spin_lock(lock);
+}
+EXPORT_SYMBOL(lg_local_lock);
+
+void lg_local_unlock(struct lglock *lg)
+{
+	arch_spinlock_t *lock;
+	rwlock_release(&lg->lock_dep_map, 1, _RET_IP_);
+	lock = this_cpu_ptr(lg->lock);
+	arch_spin_unlock(lock);
+	preempt_enable();
+}
+EXPORT_SYMBOL(lg_local_unlock);
+
+void lg_local_lock_cpu(struct lglock *lg, int cpu)
+{
+	arch_spinlock_t *lock;
+	preempt_disable();
+	rwlock_acquire_read(&lg->lock_dep_map, 0, 0, _RET_IP_);
+	lock = per_cpu_ptr(lg->lock, cpu);
+	arch_spin_lock(lock);
+}
+EXPORT_SYMBOL(lg_local_lock_cpu);
+
+void lg_local_unlock_cpu(struct lglock *lg, int cpu)
+{
+	arch_spinlock_t *lock;
+	rwlock_release(&lg->lock_dep_map, 1, _RET_IP_);
+	lock = per_cpu_ptr(lg->lock, cpu);
+	arch_spin_unlock(lock);
+	preempt_enable();
+}
+EXPORT_SYMBOL(lg_local_unlock_cpu);
+
+void lg_global_lock_online(struct lglock *lg)
+{
+	int i;
+	preempt_disable();
+	rwlock_acquire(&lg->lock_dep_map, 0, 0, _RET_IP_);
+	for_each_online_cpu(i) {
+		arch_spinlock_t *lock;
+		lock = per_cpu_ptr(lg->lock, i);
+		arch_spin_lock(lock);
+	}
+}
+EXPORT_SYMBOL(lg_global_lock_online);
+
+void lg_global_unlock_online(struct lglock *lg)
+{
+	int i;
+	rwlock_release(&lg->lock_dep_map, 1, _RET_IP_);
+	for_each_online_cpu(i) {
+		arch_spinlock_t *lock;
+		lock = per_cpu_ptr(lg->lock, i);
+		arch_spin_unlock(lock);
+	}
+	preempt_enable();
+}
+EXPORT_SYMBOL(lg_global_unlock_online);
+
+void lg_global_lock(struct lglock *lg)
+{
+	int i;
+	preempt_disable();
+	rwlock_acquire(&lg->lock_dep_map, 0, 0, _RET_IP_);
+	for_each_possible_cpu(i) {
+		arch_spinlock_t *lock;
+		lock = per_cpu_ptr(lg->lock, i);
+		arch_spin_lock(lock);
+	}
+}
+EXPORT_SYMBOL(lg_global_lock);
+
+void lg_global_unlock(struct lglock *lg)
+{
+	int i;
+	rwlock_release(&lg->lock_dep_map, 1, _RET_IP_);
+	for_each_possible_cpu(i) {
+		arch_spinlock_t *lock;
+		lock = per_cpu_ptr(lg->lock, i);
+		arch_spin_unlock(lock);
+	}
+	preempt_enable();
+}
+EXPORT_SYMBOL(lg_global_unlock);
_


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-22  7:20                                           ` Andrew Morton
@ 2011-12-22  8:08                                             ` Al Viro
  2011-12-22  8:17                                               ` Andi Kleen
  2011-12-22  8:22                                             ` Andi Kleen
  1 sibling, 1 reply; 37+ messages in thread
From: Al Viro @ 2011-12-22  8:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Srivatsa S. Bhat, mc, Stephen Boyd, linux-kernel, linux-fsdevel,
	Nick Piggin, david, Maciej Rutecki, Andi Kleen

On Wed, Dec 21, 2011 at 11:20:47PM -0800, Andrew Morton wrote:
> On Thu, 22 Dec 2011 07:02:15 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
> > On Wed, Dec 21, 2011 at 02:12:29PM -0800, Andrew Morton wrote:
> > > off-topic, but the lockdep stuff in include/linux/lglock.h
> > > (LOCKDEP_INIT_MAP and DEFINE_LGLOCK_LOCKDEP) appears to be dead code.
> > 
> > Um?  See ..._lock_init(); it's used there.
> 
> oops, I had Andi's patch applied.
> 
> Wanna take a look at it while things are fresh in your mind?

a) tons of trivial conflicts with fs/namespace.c changes in my tree
b) more seriously, the question of overhead - see the mail you replied
to.

I really, really don't like templates, so in general I would prefer to do
something fairly close to what Andi seems to have done in that patch.
However, we *are* talking about some fairly hot paths and I'm really not
comfortable with doing that kind of changes that late in the cycle.
If somebody cares to take Andi's patch and see what it does to overhead,
I'd love to see the results; we might do that in the next merge window.
Again, the idea itself is obviously OK; the only problem is with the
performance issues.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-22  8:08                                             ` Al Viro
@ 2011-12-22  8:17                                               ` Andi Kleen
  2011-12-22  8:39                                                 ` Al Viro
  0 siblings, 1 reply; 37+ messages in thread
From: Andi Kleen @ 2011-12-22  8:17 UTC (permalink / raw)
  To: Al Viro
  Cc: Andrew Morton, Srivatsa S. Bhat, mc, Stephen Boyd, linux-kernel,
	linux-fsdevel, Nick Piggin, david, Maciej Rutecki, Andi Kleen

On Thu, Dec 22, 2011 at 08:08:56AM +0000, Al Viro wrote:
> On Wed, Dec 21, 2011 at 11:20:47PM -0800, Andrew Morton wrote:
> > On Thu, 22 Dec 2011 07:02:15 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote:
> > 
> > > On Wed, Dec 21, 2011 at 02:12:29PM -0800, Andrew Morton wrote:
> > > > off-topic, but the lockdep stuff in include/linux/lglock.h
> > > > (LOCKDEP_INIT_MAP and DEFINE_LGLOCK_LOCKDEP) appears to be dead code.
> > > 
> > > Um?  See ..._lock_init(); it's used there.
> > 
> > oops, I had Andi's patch applied.
> > 
> > Wanna take a look at it while things are fresh in your mind?
> 
> a) tons of trivial conflicts with fs/namespace.c changes in my tree
> b) more seriously, the question of overhead - see the mail you replied
> to.
> 

The costly operations here are the atomics and nothing really changes
for them. So I don't expect any measurable difference.

I actually have an idea to remove them for the common case, but not in 
that patchkit or cycle :)

I can run a ftrace if you want, but I expect any difference to be below
the measurement inaccuracy.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-22  7:20                                           ` Andrew Morton
  2011-12-22  8:08                                             ` Al Viro
@ 2011-12-22  8:22                                             ` Andi Kleen
  1 sibling, 0 replies; 37+ messages in thread
From: Andi Kleen @ 2011-12-22  8:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Al Viro, Srivatsa S. Bhat, mc, Stephen Boyd, linux-kernel,
	linux-fsdevel, Nick Piggin, david, Maciej Rutecki, Andi Kleen

On Wed, Dec 21, 2011 at 11:20:47PM -0800, Andrew Morton wrote:
> On Thu, 22 Dec 2011 07:02:15 +0000 Al Viro <viro@ZenIV.linux.org.uk> wrote:
> 
> > On Wed, Dec 21, 2011 at 02:12:29PM -0800, Andrew Morton wrote:
> > > off-topic, but the lockdep stuff in include/linux/lglock.h
> > > (LOCKDEP_INIT_MAP and DEFINE_LGLOCK_LOCKDEP) appears to be dead code.
> > 
> > Um?  See ..._lock_init(); it's used there.
> 
> oops, I had Andi's patch applied.
> 
> Wanna take a look at it while things are fresh in your mind?

Double checked: DEFINE_LGLOCK_LOCKDEP is dead now and can be removed.
The stuff it declared is in the struct lglock now.

I can send a patch for that tomorrow if you haven't done it already.

LOCKDEP_INIT_MAP should be still used in lg_lock_init() and that
is called

-Andi

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs
  2011-12-22  8:17                                               ` Andi Kleen
@ 2011-12-22  8:39                                                 ` Al Viro
  0 siblings, 0 replies; 37+ messages in thread
From: Al Viro @ 2011-12-22  8:39 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Srivatsa S. Bhat, mc, Stephen Boyd, linux-kernel,
	linux-fsdevel, Nick Piggin, david, Maciej Rutecki

On Thu, Dec 22, 2011 at 09:17:57AM +0100, Andi Kleen wrote:

> The costly operations here are the atomics and nothing really changes
> for them. So I don't expect any measurable difference.
> 
> I actually have an idea to remove them for the common case, but not in 
> that patchkit or cycle :)
> 
> I can run a ftrace if you want, but I expect any difference to be below
> the measurement inaccuracy.

What I'm concerned with is not the cost of extra dereference per se; it's
more about cacheline bouncing - note that with this fix in place you'll
end up with spinlock in the same cacheline as the pointer to per-cpu stuff.
Hell knows; it might not matter at all, since we take it only for
br_write_lock() (and definitely rare CPU up/down), but still, I'd like to
see the data.  In any case, that's not -stable material.  The race fix,
OTOH, is...

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2011-12-22  8:40 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-19  3:36 [PATCH] VFS: br_write_lock locks on possible CPUs other than online CPUs mengcong
2011-12-19  4:11 ` Al Viro
2011-12-19  5:00   ` Dave Chinner
2011-12-19  6:07     ` mengcong
2011-12-19  7:31 ` Srivatsa S. Bhat
2011-12-19  9:12   ` Stephen Boyd
2011-12-19 11:03     ` Srivatsa S. Bhat
2011-12-19 12:11       ` Al Viro
2011-12-19 20:23         ` Srivatsa S. Bhat
2011-12-19 20:52           ` Al Viro
2011-12-20  4:56             ` Srivatsa S. Bhat
2011-12-20  6:27               ` Al Viro
2011-12-20  7:28                 ` Srivatsa S. Bhat
2011-12-20  9:37                   ` mengcong
2011-12-20 10:36                     ` Srivatsa S. Bhat
2011-12-20 11:08                       ` Srivatsa S. Bhat
2011-12-20 12:50                         ` Srivatsa S. Bhat
2011-12-20 14:06                           ` Al Viro
2011-12-20 14:35                             ` Srivatsa S. Bhat
2011-12-20 17:59                               ` Al Viro
2011-12-20 19:12                                 ` Srivatsa S. Bhat
2011-12-20 19:58                                   ` Al Viro
2011-12-20 22:27                                     ` Dave Chinner
2011-12-20 23:31                                       ` Al Viro
2011-12-21 21:15                                     ` Srivatsa S. Bhat
2011-12-21 22:02                                       ` Al Viro
2011-12-21 22:12                                       ` Andrew Morton
2011-12-22  7:02                                         ` Al Viro
2011-12-22  7:20                                           ` Andrew Morton
2011-12-22  8:08                                             ` Al Viro
2011-12-22  8:17                                               ` Andi Kleen
2011-12-22  8:39                                                 ` Al Viro
2011-12-22  8:22                                             ` Andi Kleen
2011-12-20  7:30                 ` mengcong
2011-12-20  7:37                   ` Srivatsa S. Bhat
2011-12-19 23:56         ` Dave Chinner
2011-12-20  4:05           ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).