All of lore.kernel.org
 help / color / mirror / Atom feed
* [rfc] "fair" rw spinlocks
@ 2009-11-23 14:54 Nick Piggin
  2009-11-24 20:19 ` David Miller
                   ` (4 more replies)
  0 siblings, 5 replies; 58+ messages in thread
From: Nick Piggin @ 2009-11-23 14:54 UTC (permalink / raw)
  To: Linux Kernel Mailing List, Linus Torvalds

Hi,

Last time this issue came up that I could see, I don't think
there were objections to making rwlocks fair, the main
difficulty seemed to be that we allow reentrant read locks
(so a write lock waiting must not block arbitrary read lockers).

Nowadays our rwlock usage is smaller although still quite a
few, so it would make better sense to do a conversion by
introducing a new lock type and move them over I guess.

Anyway, I would like to add some kind of fairness or at least
anti starvation for writers. We have a customer seeing total
livelock on tasklist_lock for write locking on a system as small
as 8 core Opteron.

This was basically reproduced by several cores executing wait
with WNOHANG.

Of course it would always be nice to improve locking so
contention isn't an issue, but so long as we have rwlocks, we
could possibly get into a situation where starvation is
triggered *somehow*. So I'd really like to fix this.

This particular starvation on tasklist lock I guess is a local
DoS vulnerability even if the workload is not particularly
realistic.

Anyway, I don't have a patch yet. I'm sure it can be done
without extra atomics in fastpaths. Comments?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-23 14:54 [rfc] "fair" rw spinlocks Nick Piggin
@ 2009-11-24 20:19 ` David Miller
  2009-11-25  6:52   ` Nick Piggin
  2009-11-25  8:49   ` Andi Kleen
  2009-11-24 20:47 ` Andi Kleen
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 58+ messages in thread
From: David Miller @ 2009-11-24 20:19 UTC (permalink / raw)
  To: npiggin; +Cc: linux-kernel, torvalds

From: Nick Piggin <npiggin@suse.de>
Date: Mon, 23 Nov 2009 15:54:09 +0100

> This was basically reproduced by several cores executing wait
> with WNOHANG.
> 
> Of course it would always be nice to improve locking so
> contention isn't an issue, but so long as we have rwlocks, we
> could possibly get into a situation where starvation is
> triggered *somehow*. So I'd really like to fix this.
> 
> This particular starvation on tasklist lock I guess is a local
> DoS vulnerability even if the workload is not particularly
> realistic.
> 
> Anyway, I don't have a patch yet. I'm sure it can be done
> without extra atomics in fastpaths. Comments?

I think nobody would notice if you changed tasklist_lock into
a spinlock_t, and this would solve the DoS because at least on
x86 you'd end up with the ticket spinlocks.

And this is a repeating theme every time the topic of rwlocks come up.
All uses should just simply be converted gradually to some other
locking mechanism over time, the cases causing problems taking
priority.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-23 14:54 [rfc] "fair" rw spinlocks Nick Piggin
  2009-11-24 20:19 ` David Miller
@ 2009-11-24 20:47 ` Andi Kleen
  2009-11-25  6:54   ` Nick Piggin
  2009-11-28  2:07 ` Paul E. McKenney
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 58+ messages in thread
From: Andi Kleen @ 2009-11-24 20:47 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel Mailing List, Linus Torvalds

Nick Piggin <npiggin@suse.de> writes:

> Hi,
>
> Last time this issue came up that I could see, I don't think
> there were objections to making rwlocks fair, the main
> difficulty seemed to be that we allow reentrant read locks
> (so a write lock waiting must not block arbitrary read lockers).
>
> Nowadays our rwlock usage is smaller although still quite a
> few, so it would make better sense to do a conversion by
> introducing a new lock type and move them over I guess.

You want to do a new lock type for potentially nested rwlocks?

>From the basic idea it sounds good, but according
to grep the current tree has hundreds of rwlocks all over,
and how would you reliably detect whether they are nestable
or not?

I assume it's not something that could be easily analyzed
at compile time and relying on runtime would seem dangerous.

Basically it sounds like quite a lot of work.

A better plan might be to have new types which are non nestable
and only move over audited code to fair rwlocks.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-24 20:19 ` David Miller
@ 2009-11-25  6:52   ` Nick Piggin
  2009-11-25  8:49   ` Andi Kleen
  1 sibling, 0 replies; 58+ messages in thread
From: Nick Piggin @ 2009-11-25  6:52 UTC (permalink / raw)
  To: David Miller; +Cc: linux-kernel, torvalds

On Tue, Nov 24, 2009 at 12:19:59PM -0800, David Miller wrote:
> From: Nick Piggin <npiggin@suse.de>
> Date: Mon, 23 Nov 2009 15:54:09 +0100
> 
> > This was basically reproduced by several cores executing wait
> > with WNOHANG.
> > 
> > Of course it would always be nice to improve locking so
> > contention isn't an issue, but so long as we have rwlocks, we
> > could possibly get into a situation where starvation is
> > triggered *somehow*. So I'd really like to fix this.
> > 
> > This particular starvation on tasklist lock I guess is a local
> > DoS vulnerability even if the workload is not particularly
> > realistic.
> > 
> > Anyway, I don't have a patch yet. I'm sure it can be done
> > without extra atomics in fastpaths. Comments?
> 
> I think nobody would notice if you changed tasklist_lock into
> a spinlock_t, and this would solve the DoS because at least on
> x86 you'd end up with the ticket spinlocks.
> 
> And this is a repeating theme every time the topic of rwlocks come up.
> All uses should just simply be converted gradually to some other
> locking mechanism over time, the cases causing problems taking
> priority.

For tasklist_lock, I'm not so sure. It gets taken fairly often
for read, and has some quite long critical sections. I think
we might start running into contention on lock hold times.

For other locks, I agree. In fact there are a couple of them in
the vfs which I think should be just spinlocks. I'll send some
patches for those.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-24 20:47 ` Andi Kleen
@ 2009-11-25  6:54   ` Nick Piggin
  2009-11-25  8:48     ` Andi Kleen
  0 siblings, 1 reply; 58+ messages in thread
From: Nick Piggin @ 2009-11-25  6:54 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Kernel Mailing List, Linus Torvalds

On Tue, Nov 24, 2009 at 09:47:56PM +0100, Andi Kleen wrote:
> Nick Piggin <npiggin@suse.de> writes:
> 
> > Hi,
> >
> > Last time this issue came up that I could see, I don't think
> > there were objections to making rwlocks fair, the main
> > difficulty seemed to be that we allow reentrant read locks
> > (so a write lock waiting must not block arbitrary read lockers).
> >
> > Nowadays our rwlock usage is smaller although still quite a
> > few, so it would make better sense to do a conversion by
> > introducing a new lock type and move them over I guess.
> 
> You want to do a new lock type for potentially nested rwlocks?
> 
> >From the basic idea it sounds good, but according
> to grep the current tree has hundreds of rwlocks all over,
> and how would you reliably detect whether they are nestable
> or not?
> 
> I assume it's not something that could be easily analyzed
> at compile time and relying on runtime would seem dangerous.
> 
> Basically it sounds like quite a lot of work.
> 
> A better plan might be to have new types which are non nestable
> and only move over audited code to fair rwlocks.

No that's what I meant, the new type would be non nestable.



^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-25  6:54   ` Nick Piggin
@ 2009-11-25  8:48     ` Andi Kleen
  2009-11-25 13:09       ` Arnd Bergmann
  0 siblings, 1 reply; 58+ messages in thread
From: Andi Kleen @ 2009-11-25  8:48 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andi Kleen, Linux Kernel Mailing List, Linus Torvalds

> No that's what I meant, the new type would be non nestable.

Okay.  Unfortunately it's still quite a lot of work.

%git/linux-2.6> gid rwlock_t | wc -l
221

Just tasklist_lock alone would be tough I guess, on level with one of your
nasty VFS locks that protect a thousand different things without any
comments.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-24 20:19 ` David Miller
  2009-11-25  6:52   ` Nick Piggin
@ 2009-11-25  8:49   ` Andi Kleen
  2009-11-25  8:56     ` Nick Piggin
  1 sibling, 1 reply; 58+ messages in thread
From: Andi Kleen @ 2009-11-25  8:49 UTC (permalink / raw)
  To: David Miller; +Cc: npiggin, linux-kernel, torvalds

David Miller <davem@davemloft.net> writes:
>
> I think nobody would notice if you changed tasklist_lock into
> a spinlock_t, and this would solve the DoS because at least on
> x86 you'd end up with the ticket spinlocks.

iirc tasklist_lock was one of them who could be potentially nested.

So just turning it into a spinlock might deadlock.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-25  8:49   ` Andi Kleen
@ 2009-11-25  8:56     ` Nick Piggin
  0 siblings, 0 replies; 58+ messages in thread
From: Nick Piggin @ 2009-11-25  8:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: David Miller, linux-kernel, torvalds

On Wed, Nov 25, 2009 at 09:49:23AM +0100, Andi Kleen wrote:
> David Miller <davem@davemloft.net> writes:
> >
> > I think nobody would notice if you changed tasklist_lock into
> > a spinlock_t, and this would solve the DoS because at least on
> > x86 you'd end up with the ticket spinlocks.
> 
> iirc tasklist_lock was one of them who could be potentially nested.
> 
> So just turning it into a spinlock might deadlock.

It can be nested I think due to interrupt context. But if the read
side disables interrupts too, then I think it should be OK (though
I haven't audited this closely).


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-25  8:48     ` Andi Kleen
@ 2009-11-25 13:09       ` Arnd Bergmann
  0 siblings, 0 replies; 58+ messages in thread
From: Arnd Bergmann @ 2009-11-25 13:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Nick Piggin, Linux Kernel Mailing List, Linus Torvalds

On Wednesday 25 November 2009, Andi Kleen wrote:
> > No that's what I meant, the new type would be non nestable.
> 
> Okay.  Unfortunately it's still quite a lot of work.
> 
> %git/linux-2.6> gid rwlock_t | wc -l
> 221
> 
> Just tasklist_lock alone would be tough I guess, on level with one of your
> nasty VFS locks that protect a thousand different things without any
> comments.

You missed all the static DEFINE_RWLOCK instances, which are another
130 or so. Fortunately many of them are in the obscure category.
Looking only at the files used in a x86 defconfig build, less than
80 remain, mostly networking related.

	Arnd <><

arch/x86/kernel/amd_iommu.c:static DEFINE_RWLOCK(amd_iommu_devtable_lock);
drivers/md/dm.c:	rwlock_t map_lock;
drivers/md/dm-region-hash.c:	rwlock_t hash_lock;
drivers/scsi/sg.c:	rwlock_t rq_list_lock;	/* protect access to list in req_arr */
drivers/scsi/sg.c:static DEFINE_RWLOCK(sg_index_lock);	/* Also used to lock
fs/binfmt_misc.c:static DEFINE_RWLOCK(entries_lock);
fs/exec.c:static DEFINE_RWLOCK(binfmt_lock);
fs/fcntl.c:static DEFINE_RWLOCK(fasync_lock);
fs/filesystems.c:static DEFINE_RWLOCK(file_systems_lock);
fs/proc/kcore.c:static DEFINE_RWLOCK(kclist_lock);
include/linux/atalk.h:extern rwlock_t atalk_interfaces_lock;
include/linux/atalk.h:extern rwlock_t atalk_routes_lock;
include/linux/atalk.h:extern rwlock_t atalk_sockets_lock;
include/linux/atmdev.h:extern rwlock_t vcc_sklist_lock;
include/linux/fs.h:	rwlock_t lock;          /* protects pid, uid, euid fields */
include/linux/fs_struct.h:	rwlock_t lock;
include/linux/inetdevice.h:	rwlock_t		mc_list_lock;
include/linux/leds.h:	rwlock_t	  leddev_list_lock;
include/linux/netdevice.h:extern rwlock_t				dev_base_lock;		/* Device list lock */
include/linux/sched.h:extern rwlock_t tasklist_lock;
include/linux/sunrpc/cache.h:	rwlock_t		hash_lock;
include/linux/vmalloc.h:extern rwlock_t vmlist_lock;
include/net/act_api.h:	rwlock_t		*lock;
include/net/bluetooth/bluetooth.h:	rwlock_t          lock;
include/net/if_inet6.h:	rwlock_t		lock;
include/net/if_inet6.h:	rwlock_t		mc_lock;
include/net/if_inet6.h:	rwlock_t		sflock;
include/net/inet_frag.h:	rwlock_t		lock;
include/net/ip6_fib.h:	rwlock_t		tb6_lock;
include/net/ip.h:extern rwlock_t ip_ra_lock;
include/net/ipv6.h:extern rwlock_t ip6_ra_lock;
include/net/ip_vs.h:	rwlock_t		sched_lock;    /* lock sched_data */
include/net/llc.h:extern rwlock_t llc_sap_list_lock;
include/net/llc.h:		rwlock_t	  lock;
include/net/neighbour.h:	rwlock_t		lock;
include/net/neighbour.h:	rwlock_t		lock;
include/net/netns/packet.h:	rwlock_t		sklist_lock;
include/net/raw.h:	rwlock_t lock;
include/net/request_sock.h:	rwlock_t		syn_wait_lock;
include/net/sock.h:	rwlock_t		sk_callback_lock;
include/net/sock.h:	rwlock_t		sk_dst_lock;
include/net/xfrm.h:	rwlock_t		lock;
include/sound/core.h:	rwlock_t ctl_files_rwlock;	/* ctl_files list lock */
include/sound/pcm.h:extern rwlock_t snd_pcm_link_rwlock;
kernel/cgroup.c:static DEFINE_RWLOCK(css_set_lock);
kernel/exec_domain.c:static DEFINE_RWLOCK(exec_domains_lock);
kernel/fork.c:__cacheline_aligned DEFINE_RWLOCK(tasklist_lock);  /* outer */
kernel/resource.c:static DEFINE_RWLOCK(resource_lock);
mm/vmalloc.c:DEFINE_RWLOCK(vmlist_lock);
net/core/dev.c:DEFINE_RWLOCK(dev_base_lock);
net/core/gen_estimator.c:static DEFINE_RWLOCK(est_lock);
net/core/neighbour.c:static DEFINE_RWLOCK(neigh_tbl_lock);
net/core/sock.c:static DEFINE_RWLOCK(proto_list_lock);
net/ipv4/fib_hash.c:static DEFINE_RWLOCK(fib_hash_lock);
net/ipv4/inetpeer.c:static DEFINE_RWLOCK(peer_pool_lock);
net/ipv4/ipmr.c:static DEFINE_RWLOCK(mrt_lock);
net/ipv4/ip_sockglue.c:DEFINE_RWLOCK(ip_ra_lock);
net/ipv6/addrconf.c:static DEFINE_RWLOCK(addrconf_hash_lock);
net/ipv6/anycast.c:static DEFINE_RWLOCK(ipv6_sk_ac_lock);
net/ipv6/ip6_fib.c:static DEFINE_RWLOCK(fib6_walker_lock);
net/ipv6/ip6_flowlabel.c:static DEFINE_RWLOCK(ip6_fl_lock);
net/ipv6/ip6_flowlabel.c:static DEFINE_RWLOCK(ip6_sk_fl_lock);
net/ipv6/ipv6_sockglue.c:DEFINE_RWLOCK(ip6_ra_lock);
net/ipv6/mcast.c:static DEFINE_RWLOCK(ipv6_sk_mc_lock);
net/llc/llc_core.c:DEFINE_RWLOCK(llc_sap_list_lock);
net/netfilter/nfnetlink_log.c:static DEFINE_RWLOCK(instances_lock);
net/netlink/af_netlink.c:static DEFINE_RWLOCK(nl_table_lock);
net/sched/act_api.c:static DEFINE_RWLOCK(act_mod_lock);
net/sched/cls_api.c:static DEFINE_RWLOCK(cls_mod_lock);
net/sched/ematch.c:static DEFINE_RWLOCK(ematch_mod_lock);
net/sched/sch_api.c:static DEFINE_RWLOCK(qdisc_mod_lock);
net/xfrm/xfrm_policy.c:static DEFINE_RWLOCK(xfrm_policy_afinfo_lock);
net/xfrm/xfrm_policy.c:static DEFINE_RWLOCK(xfrm_policy_lock);
net/xfrm/xfrm_state.c:static DEFINE_RWLOCK(xfrm_km_lock);
net/xfrm/xfrm_state.c:static DEFINE_RWLOCK(xfrm_state_afinfo_lock);
security/keys/keyring.c:static DEFINE_RWLOCK(keyring_name_lock);
security/selinux/ss/services.c:static DEFINE_RWLOCK(policy_rwlock);
sound/core/pcm_native.c:DEFINE_RWLOCK(snd_pcm_link_rwlock);
sound/core/seq/seq_clientmgr.h:	rwlock_t ports_lock;
sound/core/seq/seq_ports.h:	rwlock_t list_lock;

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-23 14:54 [rfc] "fair" rw spinlocks Nick Piggin
  2009-11-24 20:19 ` David Miller
  2009-11-24 20:47 ` Andi Kleen
@ 2009-11-28  2:07 ` Paul E. McKenney
  2009-11-28 11:15   ` Andi Kleen
  2009-11-28 17:30 ` Linus Torvalds
  2009-12-01 19:01 ` Mathieu Desnoyers
  4 siblings, 1 reply; 58+ messages in thread
From: Paul E. McKenney @ 2009-11-28  2:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel Mailing List, Linus Torvalds

On Mon, Nov 23, 2009 at 03:54:09PM +0100, Nick Piggin wrote:
> Hi,
> 
> Last time this issue came up that I could see, I don't think
> there were objections to making rwlocks fair, the main
> difficulty seemed to be that we allow reentrant read locks
> (so a write lock waiting must not block arbitrary read lockers).
> 
> Nowadays our rwlock usage is smaller although still quite a
> few, so it would make better sense to do a conversion by
> introducing a new lock type and move them over I guess.
> 
> Anyway, I would like to add some kind of fairness or at least
> anti starvation for writers. We have a customer seeing total
> livelock on tasklist_lock for write locking on a system as small
> as 8 core Opteron.
> 
> This was basically reproduced by several cores executing wait
> with WNOHANG.
> 
> Of course it would always be nice to improve locking so
> contention isn't an issue, but so long as we have rwlocks, we
> could possibly get into a situation where starvation is
> triggered *somehow*. So I'd really like to fix this.
> 
> This particular starvation on tasklist lock I guess is a local
> DoS vulnerability even if the workload is not particularly
> realistic.
> 
> Anyway, I don't have a patch yet. I'm sure it can be done
> without extra atomics in fastpaths. Comments?

The usual trick would be to keep per-fair-rwlock state in per-CPU
variables.  If it is forbidden to read-acquire one nestable fair rwlock
while read-holding another, then this per-CPU state can be a single
pointer and a nesting count.  On the other hand, if it is permitted to
read-acquire one nestable fair rwlock while holding another, then one
can use a small per-CPU array of pointer/count pairs.

Readers check the per-CPU state.  If they already read-hold the lock,
they increment the nesting count, otherwise, they contend directly for
the lock (and set up the per-CPU state).

Same number of atomics on the fastpath as the current implementation.
Too bad about those array access, though!  ;-)

(Though on modern hardware, the array accesses might be a non-event,
performance-wise.)

Hey, you asked!!!  And there are other ways to make this work, including
variations on brlock.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-28  2:07 ` Paul E. McKenney
@ 2009-11-28 11:15   ` Andi Kleen
  2009-11-28 15:20     ` Paul E. McKenney
  0 siblings, 1 reply; 58+ messages in thread
From: Andi Kleen @ 2009-11-28 11:15 UTC (permalink / raw)
  To: paulmck; +Cc: Nick Piggin, Linux Kernel Mailing List, Linus Torvalds

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>
> The usual trick would be to keep per-fair-rwlock state in per-CPU
> variables.  If it is forbidden to read-acquire one nestable fair rwlock
> while read-holding another, then this per-CPU state can be a single
> pointer and a nesting count.  On the other hand, if it is permitted to
> read-acquire one nestable fair rwlock while holding another, then one
> can use a small per-CPU array of pointer/count pairs.

The problem is that in preemptible kernels kernel threads can switch
CPUs all the time.  How would you sync the per CPU state then?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-28 11:15   ` Andi Kleen
@ 2009-11-28 15:20     ` Paul E. McKenney
  0 siblings, 0 replies; 58+ messages in thread
From: Paul E. McKenney @ 2009-11-28 15:20 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Nick Piggin, Linux Kernel Mailing List, Linus Torvalds

On Sat, Nov 28, 2009 at 12:15:07PM +0100, Andi Kleen wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> >
> > The usual trick would be to keep per-fair-rwlock state in per-CPU
> > variables.  If it is forbidden to read-acquire one nestable fair rwlock
> > while read-holding another, then this per-CPU state can be a single
> > pointer and a nesting count.  On the other hand, if it is permitted to
> > read-acquire one nestable fair rwlock while holding another, then one
> > can use a small per-CPU array of pointer/count pairs.
> 
> The problem is that in preemptible kernels kernel threads can switch
> CPUs all the time.  How would you sync the per CPU state then?

In preemptible kernels, put the state into the task structure.  Perhaps do
this in non-preemptible kernels too, just to save a bit of source code.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-23 14:54 [rfc] "fair" rw spinlocks Nick Piggin
                   ` (2 preceding siblings ...)
  2009-11-28  2:07 ` Paul E. McKenney
@ 2009-11-28 17:30 ` Linus Torvalds
  2009-11-29 18:51   ` Paul E. McKenney
                     ` (2 more replies)
  2009-12-01 19:01 ` Mathieu Desnoyers
  4 siblings, 3 replies; 58+ messages in thread
From: Linus Torvalds @ 2009-11-28 17:30 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel Mailing List



On Mon, 23 Nov 2009, Nick Piggin wrote:
> 
> Last time this issue came up that I could see, I don't think
> there were objections to making rwlocks fair, the main
> difficulty seemed to be that we allow reentrant read locks
> (so a write lock waiting must not block arbitrary read lockers).

We have at least one major rwlock user - tasklist_lock or whatever. And 
that one definitely depends on being able to do 'rwlock()' in an 
interrupt, without other rwlock'ers having to disable irq's (even if there 
might be a new writer coming in on another cpu).

That usage case _might_ be turned into RCU or something similar, in which 
case I don't think any major rwlock users remain. However, if that's the 
case, then why should anybody care about fairness any more either?

So as far as I can tell, we have only one real user of rwlocks where 
livelocks might be relevant, but that one real user absolutely _requires_ 
the unfair behavior.

		Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-28 17:30 ` Linus Torvalds
@ 2009-11-29 18:51   ` Paul E. McKenney
  2009-11-30  7:57     ` Nick Piggin
  2009-11-30  7:55   ` Nick Piggin
  2009-11-30 10:00   ` Christoph Hellwig
  2 siblings, 1 reply; 58+ messages in thread
From: Paul E. McKenney @ 2009-11-29 18:51 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nick Piggin, Linux Kernel Mailing List

On Sat, Nov 28, 2009 at 09:30:18AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 23 Nov 2009, Nick Piggin wrote:
> > 
> > Last time this issue came up that I could see, I don't think
> > there were objections to making rwlocks fair, the main
> > difficulty seemed to be that we allow reentrant read locks
> > (so a write lock waiting must not block arbitrary read lockers).
> 
> We have at least one major rwlock user - tasklist_lock or whatever. And 
> that one definitely depends on being able to do 'rwlock()' in an 
> interrupt, without other rwlock'ers having to disable irq's (even if there 
> might be a new writer coming in on another cpu).

In other words, any fair rwlock must unconditionally grant read access
to any CPU that already read-holds the lock, regardless of the state of
any writer wannabees.  Or is there another requirement I am missing?

> That usage case _might_ be turned into RCU or something similar, in which 
> case I don't think any major rwlock users remain. However, if that's the 
> case, then why should anybody care about fairness any more either?

Indeed, RCU goes a step further, permitting new readers unconditionally,
regardless of writer state.  ;-)

> So as far as I can tell, we have only one real user of rwlocks where 
> livelocks might be relevant, but that one real user absolutely _requires_ 
> the unfair behavior.

But the required unfairness is limited to unconditionally granting
recursive read requests, right?  If I understand correctly, if a given
CPU does not already read-hold the lock, then we can safely make that
CPU wait for a writer that might otherwise be starved.  Again, is there
another requirement that I am missing?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-28 17:30 ` Linus Torvalds
  2009-11-29 18:51   ` Paul E. McKenney
@ 2009-11-30  7:55   ` Nick Piggin
  2009-11-30 15:22     ` Linus Torvalds
  2009-11-30 16:20     ` Paul E. McKenney
  2009-11-30 10:00   ` Christoph Hellwig
  2 siblings, 2 replies; 58+ messages in thread
From: Nick Piggin @ 2009-11-30  7:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List, Paul McKenney

On Sat, Nov 28, 2009 at 09:30:18AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 23 Nov 2009, Nick Piggin wrote:
> > 
> > Last time this issue came up that I could see, I don't think
> > there were objections to making rwlocks fair, the main
> > difficulty seemed to be that we allow reentrant read locks
> > (so a write lock waiting must not block arbitrary read lockers).
> 
> We have at least one major rwlock user - tasklist_lock or whatever. And 
> that one definitely depends on being able to do 'rwlock()' in an 
> interrupt, without other rwlock'ers having to disable irq's (even if there 
> might be a new writer coming in on another cpu).
> 
> That usage case _might_ be turned into RCU or something similar, in which 
> case I don't think any major rwlock users remain. However, if that's the 
> case, then why should anybody care about fairness any more either?

We do have quite a large number of rwlocks really. If they are so
important as to be rwlocks, then presumably means we can expect
multiple readers in the critical sections. Therefore, we can have
a possibility of livelock. (It isn't so hard, I have a test case
which _totally_ livelocks the write-side of the tasklist lock on a
2 socket opteron, just with several threads calling wait(2)).

I'm just not sure why you don't think the other rwlocks are a problem.
Certainly they are probably not as central and visible or likely to
be a problem as tasklist_lock, but at least for a DoS, couldn't they
be an issue?

But anyway, yes I would be happy to just start with tasklist_lock
for now and if we could keep slowly trying to move away from rwlocks
in other areas, that would be good.

 
> So as far as I can tell, we have only one real user of rwlocks where 
> livelocks might be relevant, but that one real user absolutely _requires_ 
> the unfair behavior.

Yes, although the behaviour required is that it can be recursively
acquired. So we could still have a lock that disallows new non recursive
read acquires when there is a pending write locker.

RCU seems nicer, but tasklist lock locking scares me so I wanted to fix
it the easy way :)


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-29 18:51   ` Paul E. McKenney
@ 2009-11-30  7:57     ` Nick Piggin
  0 siblings, 0 replies; 58+ messages in thread
From: Nick Piggin @ 2009-11-30  7:57 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Sun, Nov 29, 2009 at 10:51:22AM -0800, Paul E. McKenney wrote:
> On Sat, Nov 28, 2009 at 09:30:18AM -0800, Linus Torvalds wrote:
> > So as far as I can tell, we have only one real user of rwlocks where 
> > livelocks might be relevant, but that one real user absolutely _requires_ 
> > the unfair behavior.
> 
> But the required unfairness is limited to unconditionally granting
> recursive read requests, right?  If I understand correctly, if a given
> CPU does not already read-hold the lock, then we can safely make that
> CPU wait for a writer that might otherwise be starved.  Again, is there
> another requirement that I am missing?

I think this is the only ordering requirement.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-28 17:30 ` Linus Torvalds
  2009-11-29 18:51   ` Paul E. McKenney
  2009-11-30  7:55   ` Nick Piggin
@ 2009-11-30 10:00   ` Christoph Hellwig
  2009-11-30 15:52     ` Linus Torvalds
  2 siblings, 1 reply; 58+ messages in thread
From: Christoph Hellwig @ 2009-11-30 10:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nick Piggin, Linux Kernel Mailing List

On Sat, Nov 28, 2009 at 09:30:18AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 23 Nov 2009, Nick Piggin wrote:
> > 
> > Last time this issue came up that I could see, I don't think
> > there were objections to making rwlocks fair, the main
> > difficulty seemed to be that we allow reentrant read locks
> > (so a write lock waiting must not block arbitrary read lockers).
> 
> We have at least one major rwlock user - tasklist_lock or whatever. And 
> that one definitely depends on being able to do 'rwlock()' in an 
> interrupt, without other rwlock'ers having to disable irq's (even if there 
> might be a new writer coming in on another cpu).

How long will this use be around?  I've seen some slow progress toward
replacing most read side uses of the task list lock with RCU.  While we
still have lots of read side users now I wonder when they'll go away.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30  7:55   ` Nick Piggin
@ 2009-11-30 15:22     ` Linus Torvalds
  2009-11-30 15:40       ` Nick Piggin
  2009-11-30 16:20     ` Paul E. McKenney
  1 sibling, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2009-11-30 15:22 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel Mailing List, Paul McKenney



On Mon, 30 Nov 2009, Nick Piggin wrote:
> 
> We do have quite a large number of rwlocks really.

Dynamically they tend to be unimportant, except for the tasklist_lock. 
Many of them are in drivers and/or finegrained, or get called only by 
fairly unusual things (registering filesystems etc).

> If they are so important as to be rwlocks,

Stop making total red-herring arguments.

It's not about "so important as to be rwlocks". Quite the reverse. I'm 
saying that most rwlocks are totally _unimportant_. Being a rwlock does 
_not_ make anything more important or less important in itself, so your 
argument is bogus. You have to base importantness on other issues than 
whether they are rwlocks or not.

As far as I can tell there is _one_ single important rwlock, and that's 
tasklist_lock. Everything else could probably trivially and individually 
be turned into a spinlock if fairness matters for them. But tasklist_lock 
fundamentally depends on the semantics of rwlocks.

And that one rwlock requires unfair behavior, and is not going to be happy 
with some more complicated thing (because it is also called from some 
pretty critical pathways).

So my argument is purely:

 - there is absolutely NOBODY who cares about "fair" rwlocks, because no 
   other user will ever hit its lock enough for it to matter. And if they
   really do, most of them tend to be fairly simple and localized and 
   might be turned into spinlocks.

 - the _one_ major exception to this - somebody who does hit the lock 
   enough for fairness to matter - is not likely amenable to any kind of 
   trivial fairness.

Now, I'd love to come up with some solution to tasklist_lock, but I just 
don't see it. At least nothing easy that doesn't have tons of downsides 
(like turning it into a spinlock, using the irq-safe versions, and having 
irq's potentially disabled for much longer than I think is good).

			Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30 15:22     ` Linus Torvalds
@ 2009-11-30 15:40       ` Nick Piggin
  2009-11-30 16:07         ` Linus Torvalds
  0 siblings, 1 reply; 58+ messages in thread
From: Nick Piggin @ 2009-11-30 15:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List, Paul McKenney

On Mon, Nov 30, 2009 at 07:22:13AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 30 Nov 2009, Nick Piggin wrote:
> > 
> > We do have quite a large number of rwlocks really.
> 
> Dynamically they tend to be unimportant, except for the tasklist_lock. 
> Many of them are in drivers and/or finegrained, or get called only by 
> fairly unusual things (registering filesystems etc).
> 
> > If they are so important as to be rwlocks,
> 
> Stop making total red-herring arguments.

I just mean, if their usage patterns justify their being an rwlock
(as opposed to a plain spinlock). That would mean they can have
significant parallelism in read-side critical sections. And that means
they could have livelock/DoS problems.


> It's not about "so important as to be rwlocks". Quite the reverse. I'm 
> saying that most rwlocks are totally _unimportant_. Being a rwlock does 
> _not_ make anything more important or less important in itself, so your 
> argument is bogus. You have to base importantness on other issues than 
> whether they are rwlocks or not.

I agree there are probably many of them which should just be spinlocks
by nature of their usage patterns. But I would have thought that at
least *some* small percentage of them other than tasklist_lock
would be valid rwlocks.

Anyway, pointless to care about that point I guess... either they exist
or they don't, either way we can't just make all rwlocks fair of course.
So they would have to be tackled one at a time.

 
> As far as I can tell there is _one_ single important rwlock, and that's 
> tasklist_lock. Everything else could probably trivially and individually 
> be turned into a spinlock if fairness matters for them. But tasklist_lock 
> fundamentally depends on the semantics of rwlocks.
> 
> And that one rwlock requires unfair behavior, and is not going to be happy 
> with some more complicated thing (because it is also called from some 
> pretty critical pathways).
> 
> So my argument is purely:
> 
>  - there is absolutely NOBODY who cares about "fair" rwlocks, because no 
>    other user will ever hit its lock enough for it to matter. And if they
>    really do, most of them tend to be fairly simple and localized and 
>    might be turned into spinlocks.
> 
>  - the _one_ major exception to this - somebody who does hit the lock 
>    enough for fairness to matter - is not likely amenable to any kind of 
>    trivial fairness.
> 
> Now, I'd love to come up with some solution to tasklist_lock, but I just 
> don't see it. At least nothing easy that doesn't have tons of downsides 
> (like turning it into a spinlock, using the irq-safe versions, and having 
> irq's potentially disabled for much longer than I think is good).

Well the simple thing I tried earlier was a per-cpu array of nesting
counter there. It's not _too_ expensive, but it does add another cacheline
access and branch there. It seems to work in solving the livelock though.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30 10:00   ` Christoph Hellwig
@ 2009-11-30 15:52     ` Linus Torvalds
  2009-11-30 17:46       ` Ingo Molnar
  0 siblings, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2009-11-30 15:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Nick Piggin, Linux Kernel Mailing List



On Mon, 30 Nov 2009, Christoph Hellwig wrote:
> 
> How long will this use be around?  I've seen some slow progress toward
> replacing most read side uses of the task list lock with RCU.  While we
> still have lots of read side users now I wonder when they'll go away.

tasklist_lock is pretty nasty. I threw out "replace it with RCU" because 
it would be nice, but the data structures used are not just simple linked 
lists that we have RCU helpers for traversing.

There are various real exclusion rules about things like 'tsk->exit_state' 
etc, which do not translate directly to RCU usage. Of course, _maybe_ all 
the places that care already take the thing for writing and would just 
automatically have exclusion anyway.

So I'd love to see somebody try to do the conversion. To a first 
approximation, you probably could do

 - turn tasklist_lock into a spinlock

 - sed 's/write_lock_irq(&tasklist_lock)/spin_lock(&tasklist_lock)/g'
   sed 's/write_unlock_irq(&tasklist_lock)/spin_unlock(&tasklist_lock)/g'

 - sed 's/read_lock(&tasklist_lock)/rcu_read_lock()/g'
   sed 's/read_unlock(&tasklist_lock)/rcu_read_unlock()/g'

 - make all the task lists use the RCU versions of the list routines

 - free the task structure using RCU

and you'd be _pretty_ close to a working system.

But I'd worry about current read-lockers that depend on things like that 
tsk->exit_state thing being stable from just read-locking (since only 
write-lockers should change it). So you can _probably_ do 99% of it fairly 
mindlessly, but the remaining 1% is the subtle stuff.

Maybe it's easier than I think. Or maybe I've totally ignored something, 
and there are much much worse issues than the occasional exit_state thing.
(Things like 'tsk->mm' are already protected by per-task locks, but there 
may be other things that depend on holding tasklist lock for exclusion).

			Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30 15:40       ` Nick Piggin
@ 2009-11-30 16:07         ` Linus Torvalds
  2009-11-30 16:17           ` Nick Piggin
  2009-11-30 16:39           ` Paul E. McKenney
  0 siblings, 2 replies; 58+ messages in thread
From: Linus Torvalds @ 2009-11-30 16:07 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linux Kernel Mailing List, Paul McKenney



On Mon, 30 Nov 2009, Nick Piggin wrote:
> 
> Well the simple thing I tried earlier was a per-cpu array of nesting
> counter there. It's not _too_ expensive, but it does add another cacheline
> access and branch there. It seems to work in solving the livelock though.

So how did you do the nesting counter? Afaik, it needs to be something 
like

	local_irq_save(flags);
	if (!get_cpu_var(tasklist_counter)++)
		spin_lock(&tasklist_lock);
	local_irq_restore(flags);

on the read_lock side (and the same in reverse on unlock). Which seems 
quite a bit more expensive than what we have now. Especially on UP, but I 
guess you can make it conditional on CONFIG_SMP (but that won't help 
generic kernels).

				Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30 16:07         ` Linus Torvalds
@ 2009-11-30 16:17           ` Nick Piggin
  2009-11-30 16:39           ` Paul E. McKenney
  1 sibling, 0 replies; 58+ messages in thread
From: Nick Piggin @ 2009-11-30 16:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Linux Kernel Mailing List, Paul McKenney

On Mon, Nov 30, 2009 at 08:07:16AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 30 Nov 2009, Nick Piggin wrote:
> > 
> > Well the simple thing I tried earlier was a per-cpu array of nesting
> > counter there. It's not _too_ expensive, but it does add another cacheline
> > access and branch there. It seems to work in solving the livelock though.
> 
> So how did you do the nesting counter? Afaik, it needs to be something 
> like
> 
> 	local_irq_save(flags);
> 	if (!get_cpu_var(tasklist_counter)++)
> 		spin_lock(&tasklist_lock);
> 	local_irq_restore(flags);
> 
> on the read_lock side (and the same in reverse on unlock). Which seems 
> quite a bit more expensive than what we have now. Especially on UP, but I 
> guess you can make it conditional on CONFIG_SMP (but that won't help 
> generic kernels).

Ah yes the irq manipulations is going to be fairly expensive :(
Let's see... I was just using local_* ops for the counter, but
that is sadly racy.

Hmm.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30  7:55   ` Nick Piggin
  2009-11-30 15:22     ` Linus Torvalds
@ 2009-11-30 16:20     ` Paul E. McKenney
  1 sibling, 0 replies; 58+ messages in thread
From: Paul E. McKenney @ 2009-11-30 16:20 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Mon, Nov 30, 2009 at 08:55:57AM +0100, Nick Piggin wrote:
> On Sat, Nov 28, 2009 at 09:30:18AM -0800, Linus Torvalds wrote:
> > So as far as I can tell, we have only one real user of rwlocks where 
> > livelocks might be relevant, but that one real user absolutely _requires_ 
> > the unfair behavior.
> 
> Yes, although the behaviour required is that it can be recursively
> acquired. So we could still have a lock that disallows new non recursive
> read acquires when there is a pending write locker.
> 
> RCU seems nicer, but tasklist lock locking scares me so I wanted to fix
> it the easy way :)

Having a livelock-free tasklist lock would certainly make it easier
to apply things like RCU on a code-path-by-code-path basis as needed.
Much less scary than a big-bang rip-and-replace.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30 16:07         ` Linus Torvalds
  2009-11-30 16:17           ` Nick Piggin
@ 2009-11-30 16:39           ` Paul E. McKenney
  2009-11-30 17:05             ` Linus Torvalds
  1 sibling, 1 reply; 58+ messages in thread
From: Paul E. McKenney @ 2009-11-30 16:39 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nick Piggin, Linux Kernel Mailing List

On Mon, Nov 30, 2009 at 08:07:16AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 30 Nov 2009, Nick Piggin wrote:
> > 
> > Well the simple thing I tried earlier was a per-cpu array of nesting
> > counter there. It's not _too_ expensive, but it does add another cacheline
> > access and branch there. It seems to work in solving the livelock though.
> 
> So how did you do the nesting counter? Afaik, it needs to be something 
> like
> 
> 	local_irq_save(flags);
> 	if (!get_cpu_var(tasklist_counter)++)
> 		spin_lock(&tasklist_lock);
> 	local_irq_restore(flags);
> 
> on the read_lock side (and the same in reverse on unlock). Which seems 
> quite a bit more expensive than what we have now. Especially on UP, but I 
> guess you can make it conditional on CONFIG_SMP (but that won't help 
> generic kernels).

My suggestion would be to put the nesting counter in the task structure
to avoid this problem.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30 16:39           ` Paul E. McKenney
@ 2009-11-30 17:05             ` Linus Torvalds
  2009-11-30 17:13               ` Nick Piggin
  0 siblings, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2009-11-30 17:05 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Nick Piggin, Linux Kernel Mailing List



On Mon, 30 Nov 2009, Paul E. McKenney wrote:
> 
> My suggestion would be to put the nesting counter in the task structure
> to avoid this problem.

It still doesn't end up being all that cheap. Now you'd need to disable 
preemption in order to fix the race between the local counter and the real 
lock.

That should be cheaper than cli/sti, but the downside is that now you need 
that task struct pointer (both for the preemption disable and the 
counter), so now you're adding some register pressure too. Of course, 
maybe you don't even want to inline it anyway, in which case that doesn't 
matter.

One advantage with your suggestion of using preemption is that (unlike irq 
disables) you can keep the preemt counter over the whole lock, so you 
don't need to re-do the preempt disable/enable in both read-lock and 
read-unlock.

So you might end up with something like (UNTESTED!):

	static void tasklist_write_lock(void)
	{
		spin_lock_irq(&tasklist_lock);
	}

	static void tasklist_write_unlock(void)
	{
		spin_unlock_irq(&tasklist_lock);
	}

	static void tasklist_read_lock(void)
	{
		preempt_disable();
		if (!current->tasklist_count++)
			spin_lock(&tasklist_lock);
	}

	static void tasklist_read_unlock(void)
	{
		if (!--current->tasklist_count)
			spin_unlock(&tasklist_lock);
		preempt_enable();
	}

And the upside, of course, is that a spin_unlock() is cheaper than a 
read_unlock() (no serializing atomic op), so while there is overhead, 
there are also some advantages.. Maybe that atomic op advantage is enough 
to offset the extra instructions.

And maybe we could use 'raw_spin_[un]lock()' in the above read-[un]lock 
sequences, since tasklist_lock is pretty special, and since we do the 
preempt disable by hand (no need to do it again in the spinlock code). 
That looks like it might cut down the overhead of all of the above to 
almost nothing for what is probably the common case today (ie preemption 
enabled).

			Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30 17:05             ` Linus Torvalds
@ 2009-11-30 17:13               ` Nick Piggin
  2009-11-30 17:18                 ` Linus Torvalds
  2009-11-30 18:29                 ` Paul E. McKenney
  0 siblings, 2 replies; 58+ messages in thread
From: Nick Piggin @ 2009-11-30 17:13 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Paul E. McKenney, Linux Kernel Mailing List

On Mon, Nov 30, 2009 at 09:05:57AM -0800, Linus Torvalds wrote:
> 
> 
> On Mon, 30 Nov 2009, Paul E. McKenney wrote:
> > 
> > My suggestion would be to put the nesting counter in the task structure
> > to avoid this problem.
> 
> It still doesn't end up being all that cheap. Now you'd need to disable 
> preemption in order to fix the race between the local counter and the real 
> lock.
> 
> That should be cheaper than cli/sti, but the downside is that now you need 
> that task struct pointer (both for the preemption disable and the 
> counter), so now you're adding some register pressure too. Of course, 
> maybe you don't even want to inline it anyway, in which case that doesn't 
> matter.
> 
> One advantage with your suggestion of using preemption is that (unlike irq 
> disables) you can keep the preemt counter over the whole lock, so you 
> don't need to re-do the preempt disable/enable in both read-lock and 
> read-unlock.
> 
> So you might end up with something like (UNTESTED!):
> 
> 	static void tasklist_write_lock(void)
> 	{
> 		spin_lock_irq(&tasklist_lock);
> 	}
> 
> 	static void tasklist_write_unlock(void)
> 	{
> 		spin_unlock_irq(&tasklist_lock);
> 	}

Two questions. Firstly, does tasklist_lock benefit much from read
side paralellism? Looking at some of the critical sections some seem
to hold it for quite a while (over task and thread iterations). So
it might not be the right thing to convert it to a spinlock?

Secondly:

> 	static void tasklist_read_lock(void)
> 	{
> 		preempt_disable();
> 		if (!current->tasklist_count++)

What happens if an interrupt and nested tasklist_read_lock() happens
here?

> 			spin_lock(&tasklist_lock);
> 	}
> 
> 	static void tasklist_read_unlock(void)
> 	{
> 		if (!--current->tasklist_count)
> 			spin_unlock(&tasklist_lock);
> 		preempt_enable();
> 	}
> 
> And the upside, of course, is that a spin_unlock() is cheaper than a 
> read_unlock() (no serializing atomic op), so while there is overhead, 
> there are also some advantages.. Maybe that atomic op advantage is enough 
> to offset the extra instructions.
> 
> And maybe we could use 'raw_spin_[un]lock()' in the above read-[un]lock 
> sequences, since tasklist_lock is pretty special, and since we do the 
> preempt disable by hand (no need to do it again in the spinlock code). 
> That looks like it might cut down the overhead of all of the above to 
> almost nothing for what is probably the common case today (ie preemption 
> enabled).
> 
> 			Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30 17:13               ` Nick Piggin
@ 2009-11-30 17:18                 ` Linus Torvalds
  2009-12-01 17:03                   ` Arnd Bergmann
  2009-11-30 18:29                 ` Paul E. McKenney
  1 sibling, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2009-11-30 17:18 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Paul E. McKenney, Linux Kernel Mailing List



On Mon, 30 Nov 2009, Nick Piggin wrote:
> 
> Two questions. Firstly, does tasklist_lock benefit much from read
> side paralellism? Looking at some of the critical sections some seem
> to hold it for quite a while (over task and thread iterations). So
> it might not be the right thing to convert it to a spinlock?

I do agree that it's a potential problem.

> Secondly:
> 
> > 	static void tasklist_read_lock(void)
> > 	{
> > 		preempt_disable();
> > 		if (!current->tasklist_count++)
> 
> What happens if an interrupt and nested tasklist_read_lock() happens
> here?

You're right. We'd still need to disable hardware interrupts. Very 
annoying, because that is pretty much _guaranteed_ to eat up any advantage 
of the faster spin_unlock path. 

So scratch that approach. My earlier suggestion still looks technically 
correct, but has all the performance disadvantages.

The best option really would be to try to make it all use RCU, rather than 
paper over things. That _really_ should improve performance.

			Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30 15:52     ` Linus Torvalds
@ 2009-11-30 17:46       ` Ingo Molnar
  2009-11-30 21:12         ` Thomas Gleixner
  0 siblings, 1 reply; 58+ messages in thread
From: Ingo Molnar @ 2009-11-30 17:46 UTC (permalink / raw)
  To: Linus Torvalds, Peter Zijlstra, Thomas Gleixner
  Cc: Christoph Hellwig, Nick Piggin, Linux Kernel Mailing List


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, 30 Nov 2009, Christoph Hellwig wrote:
> > 
> > How long will this use be around?  I've seen some slow progress toward
> > replacing most read side uses of the task list lock with RCU.  While we
> > still have lots of read side users now I wonder when they'll go away.
> 
> tasklist_lock is pretty nasty. I threw out "replace it with RCU" 
> because it would be nice, but the data structures used are not just 
> simple linked lists that we have RCU helpers for traversing.
> 
> There are various real exclusion rules about things like 
> 'tsk->exit_state' etc, which do not translate directly to RCU usage. 
> Of course, _maybe_ all the places that care already take the thing for 
> writing and would just automatically have exclusion anyway.
> 
> So I'd love to see somebody try to do the conversion. To a first 
> approximation, you probably could do
> 
>  - turn tasklist_lock into a spinlock
> 
>  - sed 's/write_lock_irq(&tasklist_lock)/spin_lock(&tasklist_lock)/g'
>    sed 's/write_unlock_irq(&tasklist_lock)/spin_unlock(&tasklist_lock)/g'
> 
>  - sed 's/read_lock(&tasklist_lock)/rcu_read_lock()/g'
>    sed 's/read_unlock(&tasklist_lock)/rcu_read_unlock()/g'
> 
>  - make all the task lists use the RCU versions of the list routines
> 
>  - free the task structure using RCU
> 
> and you'd be _pretty_ close to a working system.

In -rt we've got that in essence, and it's indeed working fine (with a 
few caveats). A few RCU conversions of tasklist_lock usage in that area 
even trickled upstream, because the simple lock would hurt so much under 
-rt.

	Ingo

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30 17:13               ` Nick Piggin
  2009-11-30 17:18                 ` Linus Torvalds
@ 2009-11-30 18:29                 ` Paul E. McKenney
  1 sibling, 0 replies; 58+ messages in thread
From: Paul E. McKenney @ 2009-11-30 18:29 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Linus Torvalds, Linux Kernel Mailing List

On Mon, Nov 30, 2009 at 06:13:33PM +0100, Nick Piggin wrote:
> On Mon, Nov 30, 2009 at 09:05:57AM -0800, Linus Torvalds wrote:
> Secondly:
> 
> > 	static void tasklist_read_lock(void)
> > 	{
> > 		preempt_disable();
> > 		if (!current->tasklist_count++)
> 
> What happens if an interrupt and nested tasklist_read_lock() happens
> here?

Ouch!!!  This race affects the reader-writer-lock version as well.
Ah well, looks like interrupt disabling is required either way.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30 17:46       ` Ingo Molnar
@ 2009-11-30 21:12         ` Thomas Gleixner
  2009-11-30 21:27           ` Peter Zijlstra
  0 siblings, 1 reply; 58+ messages in thread
From: Thomas Gleixner @ 2009-11-30 21:12 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Peter Zijlstra, Christoph Hellwig, Nick Piggin,
	Linux Kernel Mailing List

On Mon, 30 Nov 2009, Ingo Molnar wrote:
> * Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> > On Mon, 30 Nov 2009, Christoph Hellwig wrote:
> > > 
> > > How long will this use be around?  I've seen some slow progress toward
> > > replacing most read side uses of the task list lock with RCU.  While we
> > > still have lots of read side users now I wonder when they'll go away.
> > 
> > tasklist_lock is pretty nasty. I threw out "replace it with RCU" 
> > because it would be nice, but the data structures used are not just 
> > simple linked lists that we have RCU helpers for traversing.
> > 
> > There are various real exclusion rules about things like 
> > 'tsk->exit_state' etc, which do not translate directly to RCU usage. 
> > Of course, _maybe_ all the places that care already take the thing for 
> > writing and would just automatically have exclusion anyway.
> > 
> > So I'd love to see somebody try to do the conversion. To a first 
> > approximation, you probably could do
> > 
> >  - turn tasklist_lock into a spinlock
> > 
> >  - sed 's/write_lock_irq(&tasklist_lock)/spin_lock(&tasklist_lock)/g'
> >    sed 's/write_unlock_irq(&tasklist_lock)/spin_unlock(&tasklist_lock)/g'
> > 
> >  - sed 's/read_lock(&tasklist_lock)/rcu_read_lock()/g'
> >    sed 's/read_unlock(&tasklist_lock)/rcu_read_unlock()/g'
> > 
> >  - make all the task lists use the RCU versions of the list routines
> > 
> >  - free the task structure using RCU
> > 
> > and you'd be _pretty_ close to a working system.
> 
> In -rt we've got that in essence, and it's indeed working fine (with a 
> few caveats). A few RCU conversions of tasklist_lock usage in that area 
> even trickled upstream, because the simple lock would hurt so much under 
> -rt.

I think the conversion Linus proposed is pretty feasible. I went
through the read_lock sites and most of them are protecting function
calls which we already use under rcu_read_lock() in other places like
find_task* and thread or pid iterators.

There are a few non obvious ones in signal.c and posix-cpu-timers.c
(what a surprise) but nothing looks too scary.

If nobody beats me I'm going to let sed loose on the kernel, lift the
task_struct rcu free code from -rt and figure out what explodes.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30 21:12         ` Thomas Gleixner
@ 2009-11-30 21:27           ` Peter Zijlstra
  2009-11-30 22:02             ` Thomas Gleixner
  0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2009-11-30 21:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Linus Torvalds, Christoph Hellwig, Nick Piggin,
	Linux Kernel Mailing List

On Mon, 2009-11-30 at 22:12 +0100, Thomas Gleixner wrote:

> I think the conversion Linus proposed is pretty feasible. I went
> through the read_lock sites and most of them are protecting function
> calls which we already use under rcu_read_lock() in other places like
> find_task* and thread or pid iterators.
> 
> There are a few non obvious ones in signal.c and posix-cpu-timers.c
> (what a surprise) but nothing looks too scary.
> 
> If nobody beats me I'm going to let sed loose on the kernel, lift the
> task_struct rcu free code from -rt and figure out what explodes.

Things like sched.c:tg_set_bandwidth() take the tasklist_lock in
read-mode to exclude tasks being added concurrently to avoid
sched_rt_can_attach() races with tg_has_rt_tasks().

Possibly the cgroup stuff has a smaller lock to use for this.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30 21:27           ` Peter Zijlstra
@ 2009-11-30 22:02             ` Thomas Gleixner
  2009-11-30 22:11               ` Linus Torvalds
  0 siblings, 1 reply; 58+ messages in thread
From: Thomas Gleixner @ 2009-11-30 22:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Linus Torvalds, Christoph Hellwig, Nick Piggin,
	Linux Kernel Mailing List



On Mon, 30 Nov 2009, Peter Zijlstra wrote:

> On Mon, 2009-11-30 at 22:12 +0100, Thomas Gleixner wrote:
> 
> > I think the conversion Linus proposed is pretty feasible. I went
> > through the read_lock sites and most of them are protecting function
> > calls which we already use under rcu_read_lock() in other places like
> > find_task* and thread or pid iterators.
> > 
> > There are a few non obvious ones in signal.c and posix-cpu-timers.c
> > (what a surprise) but nothing looks too scary.
> > 
> > If nobody beats me I'm going to let sed loose on the kernel, lift the
> > task_struct rcu free code from -rt and figure out what explodes.
> 
> Things like sched.c:tg_set_bandwidth() take the tasklist_lock in
> read-mode to exclude tasks being added concurrently to avoid
> sched_rt_can_attach() races with tg_has_rt_tasks().

Yeah, forgot to mention sched.c, but that's solvable
 
> Possibly the cgroup stuff has a smaller lock to use for this.

Worth checking.

Thanks,

	tglx


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30 22:02             ` Thomas Gleixner
@ 2009-11-30 22:11               ` Linus Torvalds
  2009-11-30 22:37                 ` Thomas Gleixner
  0 siblings, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2009-11-30 22:11 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Christoph Hellwig, Nick Piggin,
	Linux Kernel Mailing List



On Mon, 30 Nov 2009, Thomas Gleixner wrote:
> 
> Yeah, forgot to mention sched.c, but that's solvable

It should be fairly easy to add a few 'spin_lock(&tasklist_lock)' around 
stuff that really depended on exclusion from writers. That should 
_hopefully_ be the rare case.

The biggest problem is that there will almost inevitably be things that 
get missed, and any races exposed by lacking locking will be _very_ hard 
to debug and trigger. So what I'd be worried about is not getting to a 
"practically working" state, but any really subtle cases that nobody 
really hits in practice.

			Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30 22:11               ` Linus Torvalds
@ 2009-11-30 22:37                 ` Thomas Gleixner
  2009-11-30 22:49                   ` Linus Torvalds
  0 siblings, 1 reply; 58+ messages in thread
From: Thomas Gleixner @ 2009-11-30 22:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Ingo Molnar, Christoph Hellwig, Nick Piggin,
	Linux Kernel Mailing List

On Mon, 30 Nov 2009, Linus Torvalds wrote:
> On Mon, 30 Nov 2009, Thomas Gleixner wrote:
> > 
> > Yeah, forgot to mention sched.c, but that's solvable
> 
> It should be fairly easy to add a few 'spin_lock(&tasklist_lock)' around 
> stuff that really depended on exclusion from writers. That should 
> _hopefully_ be the rare case.
> 
> The biggest problem is that there will almost inevitably be things that 
> get missed, and any races exposed by lacking locking will be _very_ hard 
> to debug and trigger. So what I'd be worried about is not getting to a 
> "practically working" state, but any really subtle cases that nobody 
> really hits in practice.

I'm aware of that. The number of places where we read_lock
tasklist_lock is 79 in 36 files right now. That's not a horrible task
to go through them one by one and do a case by case conversion with a
proper changelog. That would only leave the write_lock sites. 

We can then either do the rw_lock to spin_lock conversion or keep the
rw_lock which has no readers anymore and behaves like a spinlock for a
transition time so reverts of one of the read_lock -> rcu patches
could be done to debug stuff.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30 22:37                 ` Thomas Gleixner
@ 2009-11-30 22:49                   ` Linus Torvalds
  2009-12-01 17:37                     ` [PATCH] audit: Call tty_audit_push_task() outside preempt disabled region Thomas Gleixner
  2009-12-06  3:12                     ` [rfc] "fair" rw spinlocks Eric W. Biederman
  0 siblings, 2 replies; 58+ messages in thread
From: Linus Torvalds @ 2009-11-30 22:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Peter Zijlstra, Ingo Molnar, Christoph Hellwig, Nick Piggin,
	Linux Kernel Mailing List



On Mon, 30 Nov 2009, Thomas Gleixner wrote:
> 
> I'm aware of that. The number of places where we read_lock
> tasklist_lock is 79 in 36 files right now. That's not a horrible task
> to go through them one by one and do a case by case conversion with a
> proper changelog. That would only leave the write_lock sites. 

The write_lock sites should be fine, since just changing them to a 
spinlock should be 100% semantically equivalent - except for the lack of 
interrupt disable. And the lack of interrupt disable will result in a nice 
big deadlock if some interrupt really does take the spinlock, which is 
much easier to debug than a subtle race that would get the wrong read 
value.

> We can then either do the rw_lock to spin_lock conversion or keep the
> rw_lock which has no readers anymore and behaves like a spinlock for a
> transition time so reverts of one of the read_lock -> rcu patches
> could be done to debug stuff.

So as per the above, I wouldn't worry about the write lockers. Might as 
well change it to a spinlock, since that's what it will act as. It's not 
as if there is any chance that the spinlock code is subtly buggy.

So the only reason to keep it as a rwlock would be if you decide to do the 
read-locked cases one by one, and don't end up with all of them converted. 
Which is a reasonable strategy too, of course. We don't _have_ to convert 
them all - if the main problem is some starvation issue, it's sufficient 
to convert just the main read-lock cases so that writers never get 
starved.

But converting it all would be nice, because that whole

	write_lock_irq(&tasklist_lock);

to

	spin_lock(&tasklist_lock);

conversion would likely be a measurable performance win. Both because 
spinlocks are fundamentally faster (no atomic on unlock), and because you 
get rid of the irq disable/enable. But in order to get there, you'd have 
to convert _all_ the read-lockers, so you'd miss the opportunity to only 
convert the easy cases.

			Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30 17:18                 ` Linus Torvalds
@ 2009-12-01 17:03                   ` Arnd Bergmann
  2009-12-01 17:15                     ` Linus Torvalds
  0 siblings, 1 reply; 58+ messages in thread
From: Arnd Bergmann @ 2009-12-01 17:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nick Piggin, Paul E. McKenney, Linux Kernel Mailing List

On Monday 30 November 2009, Linus Torvalds wrote:
> The best option really would be to try to make it all use RCU, rather than 
> paper over things. That really should improve performance.

Are there any writers at interrupt time? If not, another option might be
to first convert all the readers that can happen from interrupts to RCU,
which lets us get rid of the irq disable in the write path.

Step two could either be the conversion of all nested readers to RCU,
or your tasklist_read_lock(), which should then be safe. Doing both of
these lets us turn tasklist_lock into a plain spinlock, possibly followed
by converting remaining readers that show measurable lock contention.

	Arnd <><

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-12-01 17:03                   ` Arnd Bergmann
@ 2009-12-01 17:15                     ` Linus Torvalds
  0 siblings, 0 replies; 58+ messages in thread
From: Linus Torvalds @ 2009-12-01 17:15 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: Nick Piggin, Paul E. McKenney, Linux Kernel Mailing List



On Tue, 1 Dec 2009, Arnd Bergmann wrote:
> On Monday 30 November 2009, Linus Torvalds wrote:
> > The best option really would be to try to make it all use RCU, rather than 
> > paper over things. That really should improve performance.
> 
> Are there any writers at interrupt time?

No, there can't be. That would already be a deadlock, since we take the 
read lock without irq protection (exactly because many of the read-lockers 
are pretty performance-sensitive).

> If not, another option might be to first convert all the readers that 
> can happen from interrupts to RCU, which lets us get rid of the irq 
> disable in the write path.

If you convert the irq readers, you generally really need to convert the 
rest too. In particular, you still need to convert the write-side to use 
the RCU versions of the insert/remove code, and to free the things from 
RCU in order for it all to be safe (think: irq reader on another CPU than 
the writer, now without any locking).

So you really don't win all that much. At a minimum, you always have to 
convert all the writers to use RCU (even if you then keep the rwlock as 
the exclusion model), and since that involves a large portion of the 
complexity (including at least the RCU freeing side), what you end up with 
is that you can avoid converting _some_ of the readers.

So I do agree that you can do things in two stages, but I suspect the irq 
disable on the write path part is the least of our problems.

				Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH] audit: Call tty_audit_push_task() outside preempt disabled region
  2009-11-30 22:49                   ` Linus Torvalds
@ 2009-12-01 17:37                     ` Thomas Gleixner
  2009-12-01 18:22                       ` Oleg Nesterov
  2009-12-06  3:12                     ` [rfc] "fair" rw spinlocks Eric W. Biederman
  1 sibling, 1 reply; 58+ messages in thread
From: Thomas Gleixner @ 2009-12-01 17:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Ingo Molnar, Christoph Hellwig, Nick Piggin,
	Linux Kernel Mailing List, Al Viro, James Morris, Oleg Nesterov

While auditing all tasklist_lock read_lock sites I stumbled over the
following call chain:

audit_prepare_user_tty()
  read_lock(&tasklist_lock);
  tty_audit_push_task();
     mutex_lock(&buf->mutex);

     --> buf->mutex is locked with preemption disabled.

Solve this by acquiring a reference to the task struct under
rcu_read_lock and call tty_audit_push_task outside of the preempt
disabled region.

Move all code which needs to be protected by sighand lock into
tty_audit_push_task() and use lock/unlock_sighand as we do not hold
tasklist_lock.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 drivers/char/tty_audit.c |   29 +++++++++++++++++++++--------
 include/linux/tty.h      |    8 ++++----
 kernel/audit.c           |   25 +++++++++----------------
 3 files changed, 34 insertions(+), 28 deletions(-)

Index: linux-2.6-tip/drivers/char/tty_audit.c
===================================================================
--- linux-2.6-tip.orig/drivers/char/tty_audit.c
+++ linux-2.6-tip/drivers/char/tty_audit.c
@@ -188,25 +188,38 @@ void tty_audit_tiocsti(struct tty_struct
 }
 
 /**
- *	tty_audit_push_task	-	Flush task's pending audit data
+ * tty_audit_push_task	-	Flush task's pending audit data
+ * @tsk:		task pointer
+ * @loginuid:		sender login uid
+ * @sessionid:		sender session id
+ *
+ * Called with a ref on @tsk held. Try to lock sighand and get a
+ * reference to the tty audit buffer if available.
+ * Flush the buffer or return an appropriate error code.
  */
-void tty_audit_push_task(struct task_struct *tsk, uid_t loginuid, u32 sessionid)
+int tty_audit_push_task(struct task_struct *tsk, uid_t loginuid, u32 sessionid)
 {
-	struct tty_audit_buf *buf;
+	struct tty_audit_buf *buf = NULL;
+	unsigned long flags;
 
-	spin_lock_irq(&tsk->sighand->siglock);
-	buf = tsk->signal->tty_audit_buf;
-	if (buf)
+	if (!lock_task_sighand(tsk, &flags))
+		return -ESRCH;
+
+	if (tsk->signal->audit_tty && tsk->signal->tty_audit_buf) {
+		buf = tsk->signal->tty_audit_buf;
 		atomic_inc(&buf->count);
-	spin_unlock_irq(&tsk->sighand->siglock);
+	}
+	unlock_task_sighand(tsk, &flags);
+
 	if (!buf)
-		return;
+		return -EPERM;
 
 	mutex_lock(&buf->mutex);
 	tty_audit_buf_push(tsk, loginuid, sessionid, buf);
 	mutex_unlock(&buf->mutex);
 
 	tty_audit_buf_put(buf);
+	return 0;
 }
 
 /**
Index: linux-2.6-tip/include/linux/tty.h
===================================================================
--- linux-2.6-tip.orig/include/linux/tty.h
+++ linux-2.6-tip/include/linux/tty.h
@@ -494,8 +494,8 @@ extern void tty_audit_exit(void);
 extern void tty_audit_fork(struct signal_struct *sig);
 extern void tty_audit_tiocsti(struct tty_struct *tty, char ch);
 extern void tty_audit_push(struct tty_struct *tty);
-extern void tty_audit_push_task(struct task_struct *tsk,
-					uid_t loginuid, u32 sessionid);
+extern int tty_audit_push_task(struct task_struct *tsk,
+			       uid_t loginuid, u32 sessionid);
 #else
 static inline void tty_audit_add_data(struct tty_struct *tty,
 				      unsigned char *data, size_t size)
@@ -513,8 +513,8 @@ static inline void tty_audit_fork(struct
 static inline void tty_audit_push(struct tty_struct *tty)
 {
 }
-static inline void tty_audit_push_task(struct task_struct *tsk,
-					uid_t loginuid, u32 sessionid)
+static inline int tty_audit_push_task(struct task_struct *tsk,
+				      uid_t loginuid, u32 sessionid)
 {
 }
 #endif
Index: linux-2.6-tip/kernel/audit.c
===================================================================
--- linux-2.6-tip.orig/kernel/audit.c
+++ linux-2.6-tip/kernel/audit.c
@@ -467,23 +467,16 @@ static int audit_prepare_user_tty(pid_t 
 	struct task_struct *tsk;
 	int err;
 
-	read_lock(&tasklist_lock);
+	rcu_read_lock();
 	tsk = find_task_by_vpid(pid);
-	err = -ESRCH;
-	if (!tsk)
-		goto out;
-	err = 0;
-
-	spin_lock_irq(&tsk->sighand->siglock);
-	if (!tsk->signal->audit_tty)
-		err = -EPERM;
-	spin_unlock_irq(&tsk->sighand->siglock);
-	if (err)
-		goto out;
-
-	tty_audit_push_task(tsk, loginuid, sessionid);
-out:
-	read_unlock(&tasklist_lock);
+	if (!tsk) {
+		rcu_read_unlock();
+		return -ESRCH;
+	}
+	get_task_struct(tsk);
+	rcu_read_unlock();
+	err = tty_audit_push_task(tsk, loginuid, sessionid);
+	put_task_struct(tsk);
 	return err;
 }
 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] audit: Call tty_audit_push_task() outside preempt disabled region
  2009-12-01 17:37                     ` [PATCH] audit: Call tty_audit_push_task() outside preempt disabled region Thomas Gleixner
@ 2009-12-01 18:22                       ` Oleg Nesterov
  2009-12-01 19:53                         ` Thomas Gleixner
  0 siblings, 1 reply; 58+ messages in thread
From: Oleg Nesterov @ 2009-12-01 18:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Christoph Hellwig,
	Nick Piggin, Linux Kernel Mailing List, Al Viro, James Morris

On 12/01, Thomas Gleixner wrote:
>
> -void tty_audit_push_task(struct task_struct *tsk, uid_t loginuid, u32 sessionid)
> +int tty_audit_push_task(struct task_struct *tsk, uid_t loginuid, u32 sessionid)
>  {
> -	struct tty_audit_buf *buf;
> +	struct tty_audit_buf *buf = NULL;
> +	unsigned long flags;
>
> -	spin_lock_irq(&tsk->sighand->siglock);
> -	buf = tsk->signal->tty_audit_buf;
> -	if (buf)
> +	if (!lock_task_sighand(tsk, &flags))
> +		return -ESRCH;
> +
> +	if (tsk->signal->audit_tty && tsk->signal->tty_audit_buf) {
> +		buf = tsk->signal->tty_audit_buf;
>  		atomic_inc(&buf->count);
> -	spin_unlock_irq(&tsk->sighand->siglock);
> +	}
> +	unlock_task_sighand(tsk, &flags);
> +
>  	if (!buf)
> -		return;
> +		return -EPERM;

I think the patch is correct, but it changes the behaviour of
audit_prepare_user_tty() a bit.

Suppose that signal->audit_tty != NULL but signal->tty_audit_buf
is not allocated yet. In this audit_prepare_user_tty() returns 0
before the patch and -EPERM after.

I do not know if this matters, just to be sure this is OK.

Oleg.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-23 14:54 [rfc] "fair" rw spinlocks Nick Piggin
                   ` (3 preceding siblings ...)
  2009-11-28 17:30 ` Linus Torvalds
@ 2009-12-01 19:01 ` Mathieu Desnoyers
  4 siblings, 0 replies; 58+ messages in thread
From: Mathieu Desnoyers @ 2009-12-01 19:01 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Arnd Bergmann, Linus Torvalds, Paul E. McKenney,
	Linux Kernel Mailing List

Nick Piggin <npiggin@suse.de> wrote:
> Anyway, I don't have a patch yet. I'm sure it can be done
> without extra atomics in fastpaths. Comments?

Hi Nick,

I actually have an implementation doing what you are looking for. I did
it last year, but after that had to focus on finishing my Ph.D., so it
went a bit out of focus. You can see this old thread:

http://lwn.net/Articles/297493/
"Priority Sifting Reader-Writer Lock v13"

It's not fair per-se: it's biased to favor writers over readers.
However, writers vs writers are fair.

Disclaimer: there may be some amount of dust accumulated on these
patches over the past year. Also, I did not do any formal proof of
correctness of these algorithms (yet) ;)

The code is in the LTTng tree for the 2.6.32-rc8 kernel, they can be
listed by searching for "psrwlock" commits, e.g.:

http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fcompudj%2Flinux-2.6-lttng.git&a=search&h=HEAD&st=commit&s=psrwlock

The main commit is:
http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=commit;h=4ae9a3960df1eefb726709ed6b697d739db3380b

Documentation (design explanation, performance tests):
http://git.kernel.org/?p=linux/kernel/git/compudj/linux-2.6-lttng.git;a=blob;f=Documentation/psrwlock.txt;h=5309aeb3d04d645d890b5dd7702982631a81e4c8;hb=18f769c8588a9368254b6c4bffc870f08cee6f97

I currently have to focus on my thesis defense (it's actually this
Friday), but I'll try to catch up with feedback if there happens to be
some. Please feel free to look at the code and clean it up if you think
it's worthwhile. It would be a shame to duplicate that work.

Thanks,

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH] audit: Call tty_audit_push_task() outside preempt disabled region
  2009-12-01 18:22                       ` Oleg Nesterov
@ 2009-12-01 19:53                         ` Thomas Gleixner
  0 siblings, 0 replies; 58+ messages in thread
From: Thomas Gleixner @ 2009-12-01 19:53 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Peter Zijlstra, Ingo Molnar, Christoph Hellwig,
	Nick Piggin, Linux Kernel Mailing List, Al Viro, James Morris

On Tue, 1 Dec 2009, Oleg Nesterov wrote:
> On 12/01, Thomas Gleixner wrote:
> >
> > -void tty_audit_push_task(struct task_struct *tsk, uid_t loginuid, u32 sessionid)
> > +int tty_audit_push_task(struct task_struct *tsk, uid_t loginuid, u32 sessionid)
> >  {
> > -	struct tty_audit_buf *buf;
> > +	struct tty_audit_buf *buf = NULL;
> > +	unsigned long flags;
> >
> > -	spin_lock_irq(&tsk->sighand->siglock);
> > -	buf = tsk->signal->tty_audit_buf;
> > -	if (buf)
> > +	if (!lock_task_sighand(tsk, &flags))
> > +		return -ESRCH;
> > +
> > +	if (tsk->signal->audit_tty && tsk->signal->tty_audit_buf) {
> > +		buf = tsk->signal->tty_audit_buf;
> >  		atomic_inc(&buf->count);
> > -	spin_unlock_irq(&tsk->sighand->siglock);
> > +	}
> > +	unlock_task_sighand(tsk, &flags);
> > +
> >  	if (!buf)
> > -		return;
> > +		return -EPERM;
> 
> I think the patch is correct, but it changes the behaviour of
> audit_prepare_user_tty() a bit.
> 
> Suppose that signal->audit_tty != NULL but signal->tty_audit_buf
> is not allocated yet. In this audit_prepare_user_tty() returns 0
> before the patch and -EPERM after.
> 
> I do not know if this matters, just to be sure this is OK.

Hmm, true. Missed that. Thanks for catching it.

     tglx

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-11-30 22:49                   ` Linus Torvalds
  2009-12-01 17:37                     ` [PATCH] audit: Call tty_audit_push_task() outside preempt disabled region Thomas Gleixner
@ 2009-12-06  3:12                     ` Eric W. Biederman
  2009-12-07 18:18                       ` Paul E. McKenney
  2009-12-07 18:32                       ` Oleg Nesterov
  1 sibling, 2 replies; 58+ messages in thread
From: Eric W. Biederman @ 2009-12-06  3:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Thomas Gleixner, Peter Zijlstra, Ingo Molnar, Christoph Hellwig,
	Nick Piggin, Linux Kernel Mailing List, Oleg Nesterov

Linus Torvalds <torvalds@linux-foundation.org> writes:

> On Mon, 30 Nov 2009, Thomas Gleixner wrote:
>> 
>> I'm aware of that. The number of places where we read_lock
>> tasklist_lock is 79 in 36 files right now. That's not a horrible task
>> to go through them one by one and do a case by case conversion with a
>> proper changelog. That would only leave the write_lock sites. 
>
> The write_lock sites should be fine, since just changing them to a 
> spinlock should be 100% semantically equivalent - except for the lack of 
> interrupt disable. And the lack of interrupt disable will result in a nice 
> big deadlock if some interrupt really does take the spinlock, which is 
> much easier to debug than a subtle race that would get the wrong read 
> value.
>
>> We can then either do the rw_lock to spin_lock conversion or keep the
>> rw_lock which has no readers anymore and behaves like a spinlock for a
>> transition time so reverts of one of the read_lock -> rcu patches
>> could be done to debug stuff.
>
> So as per the above, I wouldn't worry about the write lockers. Might as 
> well change it to a spinlock, since that's what it will act as. It's not 
> as if there is any chance that the spinlock code is subtly buggy.
>
> So the only reason to keep it as a rwlock would be if you decide to do the 
> read-locked cases one by one, and don't end up with all of them converted. 
> Which is a reasonable strategy too, of course. We don't _have_ to convert 
> them all - if the main problem is some starvation issue, it's sufficient 
> to convert just the main read-lock cases so that writers never get 
> starved.
>
> But converting it all would be nice, because that whole
>
> 	write_lock_irq(&tasklist_lock);
>
> to
>
> 	spin_lock(&tasklist_lock);
>
> conversion would likely be a measurable performance win. Both because 
> spinlocks are fundamentally faster (no atomic on unlock), and because you 
> get rid of the irq disable/enable. But in order to get there, you'd have 
> to convert _all_ the read-lockers, so you'd miss the opportunity to only 
> convert the easy cases.


Atomically sending signal to every member of a process group, is the
big fly in the ointment I am aware of.  Last time I looked I could
not see how to convert it rcu.

Fundamentally: "kill -KILL -pgrp" should be usable to kill all of
the processes in a process group, and "kill -KILL -1" should be usable
to kill everything except the sender and init.  Something I have seen
in shutdown scripts on more than one occasion.

This is a subtle in the sense that it won't show up in simple tests if
you get it wrong.

This is a pain because we occasionally signal a process group from
interrupt context.

The trouble as I recall is how to ensure new processes see the signal.

Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-12-06  3:12                     ` [rfc] "fair" rw spinlocks Eric W. Biederman
@ 2009-12-07 18:18                       ` Paul E. McKenney
  2009-12-07 22:24                         ` Eric W. Biederman
  2009-12-07 18:32                       ` Oleg Nesterov
  1 sibling, 1 reply; 58+ messages in thread
From: Paul E. McKenney @ 2009-12-07 18:18 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Christoph Hellwig, Nick Piggin, Linux Kernel Mailing List,
	Oleg Nesterov

On Sat, Dec 05, 2009 at 07:12:28PM -0800, Eric W. Biederman wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > On Mon, 30 Nov 2009, Thomas Gleixner wrote:
> >> 
> >> I'm aware of that. The number of places where we read_lock
> >> tasklist_lock is 79 in 36 files right now. That's not a horrible task
> >> to go through them one by one and do a case by case conversion with a
> >> proper changelog. That would only leave the write_lock sites. 
> >
> > The write_lock sites should be fine, since just changing them to a 
> > spinlock should be 100% semantically equivalent - except for the lack of 
> > interrupt disable. And the lack of interrupt disable will result in a nice 
> > big deadlock if some interrupt really does take the spinlock, which is 
> > much easier to debug than a subtle race that would get the wrong read 
> > value.
> >
> >> We can then either do the rw_lock to spin_lock conversion or keep the
> >> rw_lock which has no readers anymore and behaves like a spinlock for a
> >> transition time so reverts of one of the read_lock -> rcu patches
> >> could be done to debug stuff.
> >
> > So as per the above, I wouldn't worry about the write lockers. Might as 
> > well change it to a spinlock, since that's what it will act as. It's not 
> > as if there is any chance that the spinlock code is subtly buggy.
> >
> > So the only reason to keep it as a rwlock would be if you decide to do the 
> > read-locked cases one by one, and don't end up with all of them converted. 
> > Which is a reasonable strategy too, of course. We don't _have_ to convert 
> > them all - if the main problem is some starvation issue, it's sufficient 
> > to convert just the main read-lock cases so that writers never get 
> > starved.
> >
> > But converting it all would be nice, because that whole
> >
> > 	write_lock_irq(&tasklist_lock);
> >
> > to
> >
> > 	spin_lock(&tasklist_lock);
> >
> > conversion would likely be a measurable performance win. Both because 
> > spinlocks are fundamentally faster (no atomic on unlock), and because you 
> > get rid of the irq disable/enable. But in order to get there, you'd have 
> > to convert _all_ the read-lockers, so you'd miss the opportunity to only 
> > convert the easy cases.
> 
> Atomically sending signal to every member of a process group, is the
> big fly in the ointment I am aware of.  Last time I looked I could
> not see how to convert it rcu.
> 
> Fundamentally: "kill -KILL -pgrp" should be usable to kill all of
> the processes in a process group, and "kill -KILL -1" should be usable
> to kill everything except the sender and init.  Something I have seen
> in shutdown scripts on more than one occasion.
> 
> This is a subtle in the sense that it won't show up in simple tests if
> you get it wrong.
> 
> This is a pain because we occasionally signal a process group from
> interrupt context.

Is it required that all of the processes see the signal before the
corresponding interrupt handler returns?  (My guess is "no", which
enables a trick or two, but thought I should ask.)

> The trouble as I recall is how to ensure new processes see the signal.

And can we afford to serialize signals to groups of processes?  Not
necessarily one at a time, but a limited set at a given time?
Alternatively, a long list of pending group signals for each new task to
walk?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-12-06  3:12                     ` [rfc] "fair" rw spinlocks Eric W. Biederman
  2009-12-07 18:18                       ` Paul E. McKenney
@ 2009-12-07 18:32                       ` Oleg Nesterov
  2009-12-07 20:38                         ` Peter Zijlstra
  2009-12-07 22:10                         ` Eric W. Biederman
  1 sibling, 2 replies; 58+ messages in thread
From: Oleg Nesterov @ 2009-12-07 18:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Christoph Hellwig, Nick Piggin, Linux Kernel Mailing List

On 12/05, Eric W. Biederman wrote:
>
> Atomically sending signal to every member of a process group, is the
> big fly in the ointment I am aware of.  Last time I looked I could
> not see how to convert it rcu.

I am not sure, but iirc we can do this lockless (under rcu_lock).
We need to modify pid_link to use list_entry and attach_pid() should
add the new task to the end. Of course we need more changes, but
(again iirc) this is not too hard.

> This is a pain because we occasionally signal a process group from
> interrupt context.

Only send_sigio/etc does so, right?


I didn't read the previous discussion yet (will try tomorrow), sorry
if I am off-topic. But I think the nastiest problem with tasklist
is that it protects parent/child relationship. We need per-process
lock, but first we should change ptrace to avoid this lock somehow.
(this is one of the goals of ptrace-utrace, but not "immediate").

Oleg.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-12-07 18:32                       ` Oleg Nesterov
@ 2009-12-07 20:38                         ` Peter Zijlstra
  2009-12-09 15:55                           ` Oleg Nesterov
  2009-12-07 22:10                         ` Eric W. Biederman
  1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2009-12-07 20:38 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Linus Torvalds, Thomas Gleixner, Ingo Molnar,
	Christoph Hellwig, Nick Piggin, Linux Kernel Mailing List

On Mon, 2009-12-07 at 19:32 +0100, Oleg Nesterov wrote:
> On 12/05, Eric W. Biederman wrote:
> >
> > Atomically sending signal to every member of a process group, is the
> > big fly in the ointment I am aware of.  Last time I looked I could
> > not see how to convert it rcu.
> 
> I am not sure, but iirc we can do this lockless (under rcu_lock).
> We need to modify pid_link to use list_entry and attach_pid() should
> add the new task to the end. Of course we need more changes, but
> (again iirc) this is not too hard.
> 
> > This is a pain because we occasionally signal a process group from
> > interrupt context.
> 
> Only send_sigio/etc does so, right?
> 
> 
> I didn't read the previous discussion yet (will try tomorrow), sorry
> if I am off-topic. But I think the nastiest problem with tasklist
> is that it protects parent/child relationship. We need per-process
> lock, but first we should change ptrace to avoid this lock somehow.
> (this is one of the goals of ptrace-utrace, but not "immediate").

Didn't Thomas and you also come up with a scheme to push most signal
processing into task context?


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-12-07 18:32                       ` Oleg Nesterov
  2009-12-07 20:38                         ` Peter Zijlstra
@ 2009-12-07 22:10                         ` Eric W. Biederman
  2009-12-09 15:37                           ` Oleg Nesterov
  1 sibling, 1 reply; 58+ messages in thread
From: Eric W. Biederman @ 2009-12-07 22:10 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Christoph Hellwig, Nick Piggin, Linux Kernel Mailing List,
	Paul E. McKenney

Oleg Nesterov <oleg@redhat.com> writes:

> On 12/05, Eric W. Biederman wrote:
>>
>> Atomically sending signal to every member of a process group, is the
>> big fly in the ointment I am aware of.  Last time I looked I could
>> not see how to convert it rcu.
>
> I am not sure, but iirc we can do this lockless (under rcu_lock).
> We need to modify pid_link to use list_entry and attach_pid() should
> add the new task to the end. Of course we need more changes, but
> (again iirc) this is not too hard.

The problem is that even adding to the end of the list, we could run
into a deleted entry and not see the new end of the list.

Suppose when we start iterating the list we have:

  A -> B -> C -> D

Then someone deletes some of the entries while we are iterating the list.

A ->
 B' -> C' -> D'

We will continue on traversing through the deleted entries.

Then someone adds a new entry to the end of the list.

A-> N

Since we are at B', C' or D' we will never see the new entry on the
end of the list.

If there was a way to iterate a list with a guarantee that we always saw
new entries while we are traversing it I would love to see it.

Additionally we have the other possibility that if a child is forking
we send the signal to the parent after the child forks away but before
the child joins whichever list we are walking, and we complete our
traversal without seeing the child.

>> This is a pain because we occasionally signal a process group from
>> interrupt context.
>
> Only send_sigio/etc does so, right?

That seems right.  My memory is vague and as I recall you were the one
who found the senders in interrupt context.

> I didn't read the previous discussion yet (will try tomorrow), sorry
> if I am off-topic. But I think the nastiest problem with tasklist
> is that it protects parent/child relationship. We need per-process
> lock, but first we should change ptrace to avoid this lock somehow.
> (this is one of the goals of ptrace-utrace, but not "immediate").

That is another good one.  The discussion was roughly:
We should get rid of rwlocks.
tasklist_lock is a rwlock.
What prevents us from making tasklist_lock a spinlock etc.

On that score since we take tasklist_lock for write in ptrace_attach
and I presume in the other parent/child cases  I don't think that
issue prevents us from moving off of rwlocks for tasklist_lock.

Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-12-07 18:18                       ` Paul E. McKenney
@ 2009-12-07 22:24                         ` Eric W. Biederman
  2009-12-07 22:35                           ` Andi Kleen
  0 siblings, 1 reply; 58+ messages in thread
From: Eric W. Biederman @ 2009-12-07 22:24 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Christoph Hellwig, Nick Piggin, Linux Kernel Mailing List,
	Oleg Nesterov

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>
> Is it required that all of the processes see the signal before the
> corresponding interrupt handler returns?  (My guess is "no", which
> enables a trick or two, but thought I should ask.)

Not that I recall.  I think it is just an I/O completed signal.

>> The trouble as I recall is how to ensure new processes see the signal.
>
> And can we afford to serialize signals to groups of processes?  Not
> necessarily one at a time, but a limited set at a given time?
> Alternatively, a long list of pending group signals for each new task to
> walk?

Semantically I don't believe there are not any particular ordering
requirements, except that the work must be done before we return from
the kernel.

In the ideal implementation we could have hundreds of processes sending
signals to the same process group without affecting the rest of the
kernel.  The current implementation is a concern for scaling and we have
been removing it where we can.  

Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-12-07 22:24                         ` Eric W. Biederman
@ 2009-12-07 22:35                           ` Andi Kleen
  2009-12-07 23:19                             ` Eric W. Biederman
  0 siblings, 1 reply; 58+ messages in thread
From: Andi Kleen @ 2009-12-07 22:35 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: paulmck, Linus Torvalds, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Christoph Hellwig, Nick Piggin,
	Linux Kernel Mailing List, Oleg Nesterov

ebiederm@xmission.com (Eric W. Biederman) writes:

> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>>
>> Is it required that all of the processes see the signal before the
>> corresponding interrupt handler returns?  (My guess is "no", which
>> enables a trick or two, but thought I should ask.)
>
> Not that I recall.  I think it is just an I/O completed signal.

Wasn't there the sysrq SAK too? That one definitely would need
to be careful about synchronicity.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-12-07 22:35                           ` Andi Kleen
@ 2009-12-07 23:19                             ` Eric W. Biederman
  2009-12-08  1:39                               ` Paul E. McKenney
  0 siblings, 1 reply; 58+ messages in thread
From: Eric W. Biederman @ 2009-12-07 23:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: paulmck, Linus Torvalds, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Christoph Hellwig, Nick Piggin,
	Linux Kernel Mailing List, Oleg Nesterov

Andi Kleen <andi@firstfloor.org> writes:

> ebiederm@xmission.com (Eric W. Biederman) writes:
>
>> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>>>
>>> Is it required that all of the processes see the signal before the
>>> corresponding interrupt handler returns?  (My guess is "no", which
>>> enables a trick or two, but thought I should ask.)
>>
>> Not that I recall.  I think it is just an I/O completed signal.
>
> Wasn't there the sysrq SAK too? That one definitely would need
> to be careful about synchronicity.

SAK from sysrq is done through schedule work, I seem to recall the
locking being impossible otherwise.  There is also send_sig_all and a
few others from sysrq.  I expect we could legitimately make them
schedule_work as well if needed.

Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-12-07 23:19                             ` Eric W. Biederman
@ 2009-12-08  1:39                               ` Paul E. McKenney
  2009-12-08  2:11                                 ` Eric W. Biederman
  0 siblings, 1 reply; 58+ messages in thread
From: Paul E. McKenney @ 2009-12-08  1:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andi Kleen, Linus Torvalds, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Christoph Hellwig, Nick Piggin,
	Linux Kernel Mailing List, Oleg Nesterov

On Mon, Dec 07, 2009 at 03:19:59PM -0800, Eric W. Biederman wrote:
> Andi Kleen <andi@firstfloor.org> writes:
> 
> > ebiederm@xmission.com (Eric W. Biederman) writes:
> >
> >> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> >>>
> >>> Is it required that all of the processes see the signal before the
> >>> corresponding interrupt handler returns?  (My guess is "no", which
> >>> enables a trick or two, but thought I should ask.)
> >>
> >> Not that I recall.  I think it is just an I/O completed signal.
> >
> > Wasn't there the sysrq SAK too? That one definitely would need
> > to be careful about synchronicity.
> 
> SAK from sysrq is done through schedule work, I seem to recall the
> locking being impossible otherwise.  There is also send_sig_all and a
> few others from sysrq.  I expect we could legitimately make them
> schedule_work as well if needed.

OK, I will chance it...  Here is one possible trick:

o	Maintain a list of ongoing group-signal operations, protected
	by some suitable lock.  These could be in a per-chain-locked
	hash table, hashed by the signal target (e.g., pgrp).

o	When a task is created, it scans the above list, committing
	suicide (or doing whatever the signal requires) if appropriate.

o	When creating a child task, the parent holds an SRCU across
	creation.  It acquires SRCU before starting creation, and
	releases it when it knows that the child has completed
	scanning the above list.

o	The updater does the following:

	o	Add its request to the above list.

	o	Wait for an SRCU grace period to elapse.

	o	Kill off everything currently in the task list,
		and then wait for each such task to get to a point
		where it can be guaranteed not to spawn additional
		tasks.  (This might be mediated via a reference
		count in the corresponding list element, or by
		rescanning the task list, or any of a number of
		similar tricks.)

		Of course, if the signal is non-fatal, then it is
		necessary only to wait until the child has taken
		the signal.

	o	If it is possible for a given task's children to
		outlive it, despite the fact that the children must
		commit suicide upon finding themselves indicated by the
		list, wait for another SRCU grace period to elapse.
		(This additional SRCU grace period would be required
		for a non-fatal pgrp signal, for example.)

	o	Remove the element from the list.

Does this approach make sense, or am I misunderstanding the problem?

Either way, one additional question...  It seems to me that non-fatal
signals really don't require the above mechanism, because if a task
handles the signal, and then spawns a child, one can argue that the
child came after the signal and should thus be unaffected.  Right?
Or more confusion on my part?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-12-08  1:39                               ` Paul E. McKenney
@ 2009-12-08  2:11                                 ` Eric W. Biederman
  2009-12-08  2:37                                   ` Paul E. McKenney
  0 siblings, 1 reply; 58+ messages in thread
From: Eric W. Biederman @ 2009-12-08  2:11 UTC (permalink / raw)
  To: paulmck
  Cc: Andi Kleen, Linus Torvalds, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Christoph Hellwig, Nick Piggin,
	Linux Kernel Mailing List, Oleg Nesterov

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

> On Mon, Dec 07, 2009 at 03:19:59PM -0800, Eric W. Biederman wrote:
>> Andi Kleen <andi@firstfloor.org> writes:
>> 
>> > ebiederm@xmission.com (Eric W. Biederman) writes:
>> >
>> >> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>> >>>
>> >>> Is it required that all of the processes see the signal before the
>> >>> corresponding interrupt handler returns?  (My guess is "no", which
>> >>> enables a trick or two, but thought I should ask.)
>> >>
>> >> Not that I recall.  I think it is just an I/O completed signal.
>> >
>> > Wasn't there the sysrq SAK too? That one definitely would need
>> > to be careful about synchronicity.
>> 
>> SAK from sysrq is done through schedule work, I seem to recall the
>> locking being impossible otherwise.  There is also send_sig_all and a
>> few others from sysrq.  I expect we could legitimately make them
>> schedule_work as well if needed.
>
> OK, I will chance it...  Here is one possible trick:
>
> o	Maintain a list of ongoing group-signal operations, protected
> 	by some suitable lock.  These could be in a per-chain-locked
> 	hash table, hashed by the signal target (e.g., pgrp).
>
> o	When a task is created, it scans the above list, committing
> 	suicide (or doing whatever the signal requires) if appropriate.
>
> o	When creating a child task, the parent holds an SRCU across
> 	creation.  It acquires SRCU before starting creation, and
> 	releases it when it knows that the child has completed
> 	scanning the above list.
>
> o	The updater does the following:
>
> 	o	Add its request to the above list.
>
> 	o	Wait for an SRCU grace period to elapse.
>
> 	o	Kill off everything currently in the task list,
> 		and then wait for each such task to get to a point
> 		where it can be guaranteed not to spawn additional
> 		tasks.  (This might be mediated via a reference
> 		count in the corresponding list element, or by
> 		rescanning the task list, or any of a number of
> 		similar tricks.)
>
> 		Of course, if the signal is non-fatal, then it is
> 		necessary only to wait until the child has taken
> 		the signal.
>
> 	o	If it is possible for a given task's children to
> 		outlive it, despite the fact that the children must
> 		commit suicide upon finding themselves indicated by the
> 		list, wait for another SRCU grace period to elapse.
> 		(This additional SRCU grace period would be required
> 		for a non-fatal pgrp signal, for example.)
>
> 	o	Remove the element from the list.
>
> Does this approach make sense, or am I misunderstanding the problem?

I think that is about right.  I played with that idea a little bit.
I was thinking of simply having new children return -ERESTARTSYS, and
retry the fork.  I put it down because I decided that seems like a
very twisted implementation of a read/write lock.

If we can scale noticeably better a than tasklist_lock it is
definitely worth doing.  I think it is really easy to tie yourself up
in pretzels thinking about this.

An srcu in the pid structure that we hold while signaling tasks.
Interesting.

> Either way, one additional question...  It seems to me that non-fatal
> signals really don't require the above mechanism, because if a task
> handles the signal, and then spawns a child, one can argue that the
> child came after the signal and should thus be unaffected.  Right?
> Or more confusion on my part?

SIGSTOP also seems pretty important not to escape.  I'm not certain of
the others.  I think I would get a bit upset if job control signals in
the shell stopped working properly.  I think asking the question did
that app do something wrong with SIGTERM or did the kernel drop it
would drive me a bit batty.

It is hard to tell what breaks because most buggy implementations will
work correctly most of the time.

Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-12-08  2:11                                 ` Eric W. Biederman
@ 2009-12-08  2:37                                   ` Paul E. McKenney
  0 siblings, 0 replies; 58+ messages in thread
From: Paul E. McKenney @ 2009-12-08  2:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Andi Kleen, Linus Torvalds, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Christoph Hellwig, Nick Piggin,
	Linux Kernel Mailing List, Oleg Nesterov

On Mon, Dec 07, 2009 at 06:11:49PM -0800, Eric W. Biederman wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> 
> > On Mon, Dec 07, 2009 at 03:19:59PM -0800, Eric W. Biederman wrote:
> >> Andi Kleen <andi@firstfloor.org> writes:
> >> 
> >> > ebiederm@xmission.com (Eric W. Biederman) writes:
> >> >
> >> >> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> >> >>>
> >> >>> Is it required that all of the processes see the signal before the
> >> >>> corresponding interrupt handler returns?  (My guess is "no", which
> >> >>> enables a trick or two, but thought I should ask.)
> >> >>
> >> >> Not that I recall.  I think it is just an I/O completed signal.
> >> >
> >> > Wasn't there the sysrq SAK too? That one definitely would need
> >> > to be careful about synchronicity.
> >> 
> >> SAK from sysrq is done through schedule work, I seem to recall the
> >> locking being impossible otherwise.  There is also send_sig_all and a
> >> few others from sysrq.  I expect we could legitimately make them
> >> schedule_work as well if needed.
> >
> > OK, I will chance it...  Here is one possible trick:
> >
> > o	Maintain a list of ongoing group-signal operations, protected
> > 	by some suitable lock.  These could be in a per-chain-locked
> > 	hash table, hashed by the signal target (e.g., pgrp).
> >
> > o	When a task is created, it scans the above list, committing
> > 	suicide (or doing whatever the signal requires) if appropriate.
> >
> > o	When creating a child task, the parent holds an SRCU across
> > 	creation.  It acquires SRCU before starting creation, and
> > 	releases it when it knows that the child has completed
> > 	scanning the above list.
> >
> > o	The updater does the following:
> >
> > 	o	Add its request to the above list.
> >
> > 	o	Wait for an SRCU grace period to elapse.
> >
> > 	o	Kill off everything currently in the task list,
> > 		and then wait for each such task to get to a point
> > 		where it can be guaranteed not to spawn additional
> > 		tasks.  (This might be mediated via a reference
> > 		count in the corresponding list element, or by
> > 		rescanning the task list, or any of a number of
> > 		similar tricks.)
> >
> > 		Of course, if the signal is non-fatal, then it is
> > 		necessary only to wait until the child has taken
> > 		the signal.
> >
> > 	o	If it is possible for a given task's children to
> > 		outlive it, despite the fact that the children must
> > 		commit suicide upon finding themselves indicated by the
> > 		list, wait for another SRCU grace period to elapse.
> > 		(This additional SRCU grace period would be required
> > 		for a non-fatal pgrp signal, for example.)
> >
> > 	o	Remove the element from the list.
> >
> > Does this approach make sense, or am I misunderstanding the problem?
> 
> I think that is about right.  I played with that idea a little bit.
> I was thinking of simply having new children return -ERESTARTSYS, and
> retry the fork.  I put it down because I decided that seems like a
> very twisted implementation of a read/write lock.
> 
> If we can scale noticeably better a than tasklist_lock it is
> definitely worth doing.  I think it is really easy to tie yourself up
> in pretzels thinking about this.

No argument here!!!

> An srcu in the pid structure that we hold while signaling tasks.
> Interesting.

;-)

> > Either way, one additional question...  It seems to me that non-fatal
> > signals really don't require the above mechanism, because if a task
> > handles the signal, and then spawns a child, one can argue that the
> > child came after the signal and should thus be unaffected.  Right?
> > Or more confusion on my part?
> 
> SIGSTOP also seems pretty important not to escape.  I'm not certain of
> the others.  I think I would get a bit upset if job control signals in
> the shell stopped working properly.  I think asking the question did
> that app do something wrong with SIGTERM or did the kernel drop it
> would drive me a bit batty.

Good point!!!  It does indeed seem to me that SIGSTOP needs to be
handled as carefully as does a fatal signal.  I guess that SIGCONT
is easier, at least unless there is some tricky way that a stopped
task can nevertheless spawn a new task.  ;-)

> It is hard to tell what breaks because most buggy implementations will
> work correctly most of the time.

Indeed you are quite right, and it is thus worthwhile burning a few
extra CPU cycles to faithfully emulate the old behavior.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-12-07 22:10                         ` Eric W. Biederman
@ 2009-12-09 15:37                           ` Oleg Nesterov
  2009-12-10  3:36                             ` Eric W. Biederman
  2009-12-10  6:22                             ` Paul E. McKenney
  0 siblings, 2 replies; 58+ messages in thread
From: Oleg Nesterov @ 2009-12-09 15:37 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Christoph Hellwig, Nick Piggin, Linux Kernel Mailing List,
	Paul E. McKenney

On 12/07, Eric W. Biederman wrote:
>
> Oleg Nesterov <oleg@redhat.com> writes:
>
> > On 12/05, Eric W. Biederman wrote:
> >>
> >> Atomically sending signal to every member of a process group, is the
> >> big fly in the ointment I am aware of.  Last time I looked I could
> >> not see how to convert it rcu.
> >
> > I am not sure, but iirc we can do this lockless (under rcu_lock).
> > We need to modify pid_link to use list_entry and attach_pid() should
> > add the new task to the end. Of course we need more changes, but
> > (again iirc) this is not too hard.
>
> The problem is that even adding to the end of the list, we could run
> into a deleted entry and not see the new end of the list.
>
> Suppose when we start iterating the list we have:
>
>   A -> B -> C -> D
>
> Then someone deletes some of the entries while we are iterating the list.
>
> A ->
>  B' -> C' -> D'
>
> We will continue on traversing through the deleted entries.
>
> Then someone adds a new entry to the end of the list.
>
> A-> N
>
> Since we are at B', C' or D' we will never see the new entry on the
> end of the list.

Yes, but who can add the new entry?

Let's forget about setpgrp/etc for the moment, I think we have "races"
with or without tasklist. Say, setpgrp() can add the new process to the
already "killed" pgrp.

Then, I think the only important case is SIGKILL/SIGSTOP (or other
signals which can't be blockes/ignored). We must kill/stop the entire
pgrp, we must not race with fork() and miss a child.

In this case I _think_ rcu_read_lock() is enough,

	rcu_read_lock()

	list_for_each_entry_rcu(task, pid->tasks[PIDTYPE_PGID)
		group_send_sig_info(sig, task);

	rcu_read_unlock();

except group_send_sig_info() can race with mt-exec, but this is simple
to fix.

If we send a signal (not necessary SIGKILL) to a process P, we must see
all childs which were forked by P, both send_signal() and copy_process()
take the same ->siglock, we must see the result of list_add_tail_rcu().
And, after we sent SIGKILL/SIGSTOP, it can't fork the new child.

If list_for_each_entry() does not see the exited process P, this means
we see the result of list_del_rcu(). But this also means we must the
the result of the previous list_add_rcu().

IOW, fork+exit means list_add_rcu() + wmb() + list_del_rcu(), if we
don't see the new entry on list, we must see the new one, right?

(I am ignoring the case when list_for_each_entry_rcu() sees a process
 P but lock_task_sighand(P) fails, I think this is the same as if we
 we missed P)

Now suppose a signal is blocked/ignored or has a handler. In this case
we can miss a child, but I think this is OK, we can pretend the new
child was forked after kill_pgrp() completes. Say, this child C was
forked by some process P. We can miss C only if it was forked after
we already sent the signal to P.

However. I do not pretend the reasoning above is "complete", and
perhaps I missed something else.

> Additionally we have the other possibility that if a child is forking
> we send the signal to the parent after the child forks away but before
> the child joins whichever list we are walking, and we complete our
> traversal without seeing the child.

Not sure I understand... But afaics this case is covered above.
->siglock should serialize this, copy_process() does attach_pid()
under this lock.

Oleg.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-12-07 20:38                         ` Peter Zijlstra
@ 2009-12-09 15:55                           ` Oleg Nesterov
  0 siblings, 0 replies; 58+ messages in thread
From: Oleg Nesterov @ 2009-12-09 15:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric W. Biederman, Linus Torvalds, Thomas Gleixner, Ingo Molnar,
	Christoph Hellwig, Nick Piggin, Linux Kernel Mailing List

On 12/07, Peter Zijlstra wrote:
>
> Didn't Thomas and you also come up with a scheme to push most signal
> processing into task context?

We discussed another case (posix timers), this doesn't involve tasklist.
Oh, and stilll lacks a solution :/

Oleg.


^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-12-09 15:37                           ` Oleg Nesterov
@ 2009-12-10  3:36                             ` Eric W. Biederman
  2009-12-10  6:22                             ` Paul E. McKenney
  1 sibling, 0 replies; 58+ messages in thread
From: Eric W. Biederman @ 2009-12-10  3:36 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Linus Torvalds, Thomas Gleixner, Peter Zijlstra, Ingo Molnar,
	Christoph Hellwig, Nick Piggin, Linux Kernel Mailing List,
	Paul E. McKenney

Oleg Nesterov <oleg@redhat.com> writes:

> On 12/07, Eric W. Biederman wrote:
>>
>> Oleg Nesterov <oleg@redhat.com> writes:
>>
>> > On 12/05, Eric W. Biederman wrote:
>> >>
>> >> Atomically sending signal to every member of a process group, is the
>> >> big fly in the ointment I am aware of.  Last time I looked I could
>> >> not see how to convert it rcu.
>> >
>> > I am not sure, but iirc we can do this lockless (under rcu_lock).
>> > We need to modify pid_link to use list_entry and attach_pid() should
>> > add the new task to the end. Of course we need more changes, but
>> > (again iirc) this is not too hard.
>>
>> The problem is that even adding to the end of the list, we could run
>> into a deleted entry and not see the new end of the list.
>>
>> Suppose when we start iterating the list we have:
>>
>>   A -> B -> C -> D
>>
>> Then someone deletes some of the entries while we are iterating the list.
>>
>> A ->
>>  B' -> C' -> D'
>>
>> We will continue on traversing through the deleted entries.
>>
>> Then someone adds a new entry to the end of the list.
>>
>> A-> N
>>
>> Since we are at B', C' or D' we will never see the new entry on the
>> end of the list.
>
> Yes, but who can add the new entry?
>
> Let's forget about setpgrp/etc for the moment, I think we have "races"
> with or without tasklist. Say, setpgrp() can add the new process to the
> already "killed" pgrp.

Agreed. A setpgrp call that we miss can be considered to have happened
after the signal was sent to the process group.

> Then, I think the only important case is SIGKILL/SIGSTOP (or other
> signals which can't be blockes/ignored). We must kill/stop the entire
> pgrp, we must not race with fork() and miss a child.
>
> In this case I _think_ rcu_read_lock() is enough,
>
> 	rcu_read_lock()
>
> 	list_for_each_entry_rcu(task, pid->tasks[PIDTYPE_PGID)
> 		group_send_sig_info(sig, task);
>
> 	rcu_read_unlock();
>
> except group_send_sig_info() can race with mt-exec, but this is simple
> to fix.

After we change the code to always add new entries to the end of
the rcu list you are correct.

The danger I saw is having a new process ( that we must handle ) show
up while we have gotten into a rcu cul-de-sac of the process list,
where we do not see new processes added to the end.

Once a process has gotten a signal we don't care, any children it spawns
happen after the signal was sent.

So we only care about children for processes that we have not delivered
the signal to.  If we add children to the end of the list they will be
visible if you traverse through the parent.  Since we have not traversed
through the parent and delivered the signal (by definition) then we don't
care.

This works for all signals, and especially for SIGKILL and SIGSTOP.

I am tempted to apply the test that if user space can prove the ordering
is not what is expected than there is a problem.  However signals have
not always arrived strongly orders with other user space events, and
the tasklist lock today does not appear to preclude a task that has
received a signal from talking to a task that has not yet received a
signal via pipes, etc.  

So as long as we can guarantee that if you traverse a parent you will
also traverse all of the children forked before you traversed the
parent, then we should be fine.


> If we send a signal (not necessary SIGKILL) to a process P, we must see
> all childs which were forked by P, both send_signal() and copy_process()
> take the same ->siglock, we must see the result of list_add_tail_rcu().
> And, after we sent SIGKILL/SIGSTOP, it can't fork the new child.

Sounds right.

> If list_for_each_entry() does not see the exited process P, this means
> we see the result of list_del_rcu(). But this also means we must the
> the result of the previous list_add_rcu().
>
> IOW, fork+exit means list_add_rcu() + wmb() + list_del_rcu(), if we
> don't see the new entry on list, we must see the new one, right?
>
> (I am ignoring the case when list_for_each_entry_rcu() sees a process
>  P but lock_task_sighand(P) fails, I think this is the same as if we
>  we missed P)
>
> Now suppose a signal is blocked/ignored or has a handler. In this case
> we can miss a child, but I think this is OK, we can pretend the new
> child was forked after kill_pgrp() completes. Say, this child C was
> forked by some process P. We can miss C only if it was forked after
> we already sent the signal to P.

I don't see how we can miss a child that matters.

> However. I do not pretend the reasoning above is "complete", and
> perhaps I missed something else.

I can not find fault with your idea.

Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-12-09 15:37                           ` Oleg Nesterov
  2009-12-10  3:36                             ` Eric W. Biederman
@ 2009-12-10  6:22                             ` Paul E. McKenney
  2009-12-10 10:31                               ` Eric W. Biederman
  1 sibling, 1 reply; 58+ messages in thread
From: Paul E. McKenney @ 2009-12-10  6:22 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric W. Biederman, Linus Torvalds, Thomas Gleixner,
	Peter Zijlstra, Ingo Molnar, Christoph Hellwig, Nick Piggin,
	Linux Kernel Mailing List

On Wed, Dec 09, 2009 at 04:37:09PM +0100, Oleg Nesterov wrote:
> On 12/07, Eric W. Biederman wrote:
> >
> > Oleg Nesterov <oleg@redhat.com> writes:
> >
> > > On 12/05, Eric W. Biederman wrote:
> > >>
> > >> Atomically sending signal to every member of a process group, is the
> > >> big fly in the ointment I am aware of.  Last time I looked I could
> > >> not see how to convert it rcu.
> > >
> > > I am not sure, but iirc we can do this lockless (under rcu_lock).
> > > We need to modify pid_link to use list_entry and attach_pid() should
> > > add the new task to the end. Of course we need more changes, but
> > > (again iirc) this is not too hard.
> >
> > The problem is that even adding to the end of the list, we could run
> > into a deleted entry and not see the new end of the list.
> >
> > Suppose when we start iterating the list we have:
> >
> >   A -> B -> C -> D
> >
> > Then someone deletes some of the entries while we are iterating the list.
> >
> > A ->
> >  B' -> C' -> D'
> >
> > We will continue on traversing through the deleted entries.
> >
> > Then someone adds a new entry to the end of the list.
> >
> > A-> N
> >
> > Since we are at B', C' or D' we will never see the new entry on the
> > end of the list.
> 
> Yes, but who can add the new entry?
> 
> Let's forget about setpgrp/etc for the moment, I think we have "races"
> with or without tasklist. Say, setpgrp() can add the new process to the
> already "killed" pgrp.
> 
> Then, I think the only important case is SIGKILL/SIGSTOP (or other
> signals which can't be blockes/ignored). We must kill/stop the entire
> pgrp, we must not race with fork() and miss a child.
> 
> In this case I _think_ rcu_read_lock() is enough,
> 
> 	rcu_read_lock()
> 
> 	list_for_each_entry_rcu(task, pid->tasks[PIDTYPE_PGID)
> 		group_send_sig_info(sig, task);
> 
> 	rcu_read_unlock();
> 
> except group_send_sig_info() can race with mt-exec, but this is simple
> to fix.
> 
> If we send a signal (not necessary SIGKILL) to a process P, we must see
> all childs which were forked by P, both send_signal() and copy_process()
> take the same ->siglock, we must see the result of list_add_tail_rcu().
> And, after we sent SIGKILL/SIGSTOP, it can't fork the new child.
> 
> If list_for_each_entry() does not see the exited process P, this means
> we see the result of list_del_rcu(). But this also means we must the
> the result of the previous list_add_rcu().
> 
> IOW, fork+exit means list_add_rcu() + wmb() + list_del_rcu(), if we
> don't see the new entry on list, we must see the new one, right?
> 
> (I am ignoring the case when list_for_each_entry_rcu() sees a process
>  P but lock_task_sighand(P) fails, I think this is the same as if we
>  we missed P)
> 
> Now suppose a signal is blocked/ignored or has a handler. In this case
> we can miss a child, but I think this is OK, we can pretend the new
> child was forked after kill_pgrp() completes. Say, this child C was
> forked by some process P. We can miss C only if it was forked after
> we already sent the signal to P.
> 
> However. I do not pretend the reasoning above is "complete", and
> perhaps I missed something else.

My main concern would be "fork storms", where each CPU in a large
system is spawning children in a pgrp that some other CPU is attempting
to kill.  The CPUs spawning children might be able to keep ahead of
the single CPU, so that the pgrp never is completely killed.

Enlisting the aid of the CPUs doing the spawning (e.g., by having them
consult a list of signals being sent) prevents this fork-storm scenario.

							Thanx, Paul

> > Additionally we have the other possibility that if a child is forking
> > we send the signal to the parent after the child forks away but before
> > the child joins whichever list we are walking, and we complete our
> > traversal without seeing the child.
> 
> Not sure I understand... But afaics this case is covered above.
> ->siglock should serialize this, copy_process() does attach_pid()
> under this lock.
> 
> Oleg.
> 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-12-10  6:22                             ` Paul E. McKenney
@ 2009-12-10 10:31                               ` Eric W. Biederman
  2009-12-10 16:41                                 ` Paul E. McKenney
  0 siblings, 1 reply; 58+ messages in thread
From: Eric W. Biederman @ 2009-12-10 10:31 UTC (permalink / raw)
  To: paulmck
  Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Christoph Hellwig, Nick Piggin,
	Linux Kernel Mailing List

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

> My main concern would be "fork storms", where each CPU in a large
> system is spawning children in a pgrp that some other CPU is attempting
> to kill.  The CPUs spawning children might be able to keep ahead of
> the single CPU, so that the pgrp never is completely killed.
>
> Enlisting the aid of the CPUs doing the spawning (e.g., by having them
> consult a list of signals being sent) prevents this fork-storm scenario.

We almost have a worst case bound. We can have at most max_thread
processes.  Unfortunately it appears we don't force an rcu grace
period anywhere.  So It does appear theoretically possible to fork and
exit on a bunch of other cpus infinitely extending the rcu interval.

Still that is all inside the tasklist_lock, which serializes all of those
other cpus.  So as long as the cost of queuing signals is less than the
cost of adding processes to the task lists we won't have a problem.

Eric

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [rfc] "fair" rw spinlocks
  2009-12-10 10:31                               ` Eric W. Biederman
@ 2009-12-10 16:41                                 ` Paul E. McKenney
  0 siblings, 0 replies; 58+ messages in thread
From: Paul E. McKenney @ 2009-12-10 16:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Oleg Nesterov, Linus Torvalds, Thomas Gleixner, Peter Zijlstra,
	Ingo Molnar, Christoph Hellwig, Nick Piggin,
	Linux Kernel Mailing List

On Thu, Dec 10, 2009 at 02:31:39AM -0800, Eric W. Biederman wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> 
> > My main concern would be "fork storms", where each CPU in a large
> > system is spawning children in a pgrp that some other CPU is attempting
> > to kill.  The CPUs spawning children might be able to keep ahead of
> > the single CPU, so that the pgrp never is completely killed.
> >
> > Enlisting the aid of the CPUs doing the spawning (e.g., by having them
> > consult a list of signals being sent) prevents this fork-storm scenario.
> 
> We almost have a worst case bound. We can have at most max_thread
> processes.  Unfortunately it appears we don't force an rcu grace
> period anywhere.  So It does appear theoretically possible to fork and
> exit on a bunch of other cpus infinitely extending the rcu interval.

The RCU grace period will still complete in a timely fashion, at least
assuming that each RCU read-side critical section completes in a timely
fashion.  The old Classic implememntations need only a context switch on
each CPU (which should happen at some point upon return to user space,
and the counter-based implementations (SRCU and preemptible RCU) use
pairs of counters to avoid waiting on new RCU read-side critical sections.

Either way, the RCU grace period waits only for the RCU read-side
critical sections that started before it did, not for any later RCU
read-side critical sections.

> Still that is all inside the tasklist_lock, which serializes all of those
> other cpus.  So as long as the cost of queuing signals is less than the
> cost of adding processes to the task lists we won't have a problem.

Agreed, as long as we continue to serialize task creation, we should be OK.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2009-12-10 16:41 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-11-23 14:54 [rfc] "fair" rw spinlocks Nick Piggin
2009-11-24 20:19 ` David Miller
2009-11-25  6:52   ` Nick Piggin
2009-11-25  8:49   ` Andi Kleen
2009-11-25  8:56     ` Nick Piggin
2009-11-24 20:47 ` Andi Kleen
2009-11-25  6:54   ` Nick Piggin
2009-11-25  8:48     ` Andi Kleen
2009-11-25 13:09       ` Arnd Bergmann
2009-11-28  2:07 ` Paul E. McKenney
2009-11-28 11:15   ` Andi Kleen
2009-11-28 15:20     ` Paul E. McKenney
2009-11-28 17:30 ` Linus Torvalds
2009-11-29 18:51   ` Paul E. McKenney
2009-11-30  7:57     ` Nick Piggin
2009-11-30  7:55   ` Nick Piggin
2009-11-30 15:22     ` Linus Torvalds
2009-11-30 15:40       ` Nick Piggin
2009-11-30 16:07         ` Linus Torvalds
2009-11-30 16:17           ` Nick Piggin
2009-11-30 16:39           ` Paul E. McKenney
2009-11-30 17:05             ` Linus Torvalds
2009-11-30 17:13               ` Nick Piggin
2009-11-30 17:18                 ` Linus Torvalds
2009-12-01 17:03                   ` Arnd Bergmann
2009-12-01 17:15                     ` Linus Torvalds
2009-11-30 18:29                 ` Paul E. McKenney
2009-11-30 16:20     ` Paul E. McKenney
2009-11-30 10:00   ` Christoph Hellwig
2009-11-30 15:52     ` Linus Torvalds
2009-11-30 17:46       ` Ingo Molnar
2009-11-30 21:12         ` Thomas Gleixner
2009-11-30 21:27           ` Peter Zijlstra
2009-11-30 22:02             ` Thomas Gleixner
2009-11-30 22:11               ` Linus Torvalds
2009-11-30 22:37                 ` Thomas Gleixner
2009-11-30 22:49                   ` Linus Torvalds
2009-12-01 17:37                     ` [PATCH] audit: Call tty_audit_push_task() outside preempt disabled region Thomas Gleixner
2009-12-01 18:22                       ` Oleg Nesterov
2009-12-01 19:53                         ` Thomas Gleixner
2009-12-06  3:12                     ` [rfc] "fair" rw spinlocks Eric W. Biederman
2009-12-07 18:18                       ` Paul E. McKenney
2009-12-07 22:24                         ` Eric W. Biederman
2009-12-07 22:35                           ` Andi Kleen
2009-12-07 23:19                             ` Eric W. Biederman
2009-12-08  1:39                               ` Paul E. McKenney
2009-12-08  2:11                                 ` Eric W. Biederman
2009-12-08  2:37                                   ` Paul E. McKenney
2009-12-07 18:32                       ` Oleg Nesterov
2009-12-07 20:38                         ` Peter Zijlstra
2009-12-09 15:55                           ` Oleg Nesterov
2009-12-07 22:10                         ` Eric W. Biederman
2009-12-09 15:37                           ` Oleg Nesterov
2009-12-10  3:36                             ` Eric W. Biederman
2009-12-10  6:22                             ` Paul E. McKenney
2009-12-10 10:31                               ` Eric W. Biederman
2009-12-10 16:41                                 ` Paul E. McKenney
2009-12-01 19:01 ` Mathieu Desnoyers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.