All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] NUMA futex hashing
@ 2006-08-08  7:07 Ravikiran G Thirumalai
  2006-08-08  9:14 ` Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 78+ messages in thread
From: Ravikiran G Thirumalai @ 2006-08-08  7:07 UTC (permalink / raw)
  To: linux-kernel; +Cc: Shai Fultheim (Shai@scalex86.org), pravin b shelar

Current futex hash scheme is not the best for NUMA.   The futex hash table is 
an array of struct futex_hash_bucket, which is just a spinlock and a 
list_head -- this means multiple spinlocks on the same cacheline and on NUMA 
machines, on the same internode cacheline.  If futexes of two unrelated 
threads running on two different nodes happen to hash onto adjacent hash 
buckets, or buckets on the same internode cacheline, then we have the 
internode cacheline bouncing between nodes.

Here is a simple scheme which maintains per-node hash tables for futexes.  

In this scheme, a private futex is assigned to the node id of the futex's KVA. 
The reasoning is, the futex KVA is allocated from the node as indicated
by memory policy set by the process, and that should be a good 'home node' 
for that futex.  Of course this helps workloads where all the threads of a 
process are bound to the same node, but it seems reasonable to run all
threads of a process on the same node.

A shared futex is assigned a home node based on jhash2 itself.  Since inode
and offset are used as the key, the same inode offset is used to arrive at
the home node of a shared futex.  This distributes private futexes across
all nodes.

Comments? Suggestions? Particularly regarding shared futexes.  Any policy
suggestions?

Thanks,
Kiran

Note: This patch needs to have kvaddr_to_nid() reintroduced.  This was taken
out in git commit 9f3fd602aef96c2a490e3bfd669d06475aeba8d8

Index: linux-2.6.18-rc3/kernel/futex.c
===================================================================
--- linux-2.6.18-rc3.orig/kernel/futex.c	2006-08-02 12:11:34.000000000 -0700
+++ linux-2.6.18-rc3/kernel/futex.c	2006-08-02 16:48:47.000000000 -0700
@@ -137,20 +137,35 @@ struct futex_hash_bucket {
        struct list_head       chain;
 };
 
-static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
+static struct futex_hash_bucket *futex_queues[MAX_NUMNODES] __read_mostly;
 
 /* Futex-fs vfsmount entry: */
 static struct vfsmount *futex_mnt;
 
 /*
  * We hash on the keys returned from get_futex_key (see below).
+ * With NUMA aware futex hashing, we have per-node hash tables.
+ * We determine the home node of a futex based on the KVA -- if the futex
+ * is a private futex.  For shared futexes, we use  jhash2 itself on the
+ * futex_key to arrive at a home node.
  */
 static struct futex_hash_bucket *hash_futex(union futex_key *key)
 {
+	int nodeid;
 	u32 hash = jhash2((u32*)&key->both.word,
 			  (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
 			  key->both.offset);
-	return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
+	if (key->both.offset & 0x1) {
+		/* 
+		 * Shared futex: Use any of the 'possible' nodes as home node.
+		 */
+		nodeid = hash & (MAX_NUMNODES -1);
+		BUG_ON(!node_possible(nodeid));
+	} else
+		/* Private futex */ 
+		nodeid = kvaddr_to_nid(key->both.ptr);
+	
+	return &futex_queues[nodeid][hash & ((1 << FUTEX_HASHBITS)-1)];
 }
 
 /*
@@ -1909,13 +1924,25 @@ static int __init init(void)
 {
 	unsigned int i;
 
+	int nid;
+
+	for_each_node(nid)
+	{
+		futex_queues[nid] = kmalloc_node(
+					(sizeof(struct futex_hash_bucket) *
+					(1 << FUTEX_HASHBITS)),
+					GFP_KERNEL, nid);
+		if (!futex_queues[nid])
+			panic("futex_init: Allocation of multi-node futex_queues failed");
+		for (i = 0; i < (1 << FUTEX_HASHBITS); i++) {
+			INIT_LIST_HEAD(&futex_queues[nid][i].chain);
+			spin_lock_init(&futex_queues[nid][i].lock);
+		}
+	}
+
 	register_filesystem(&futex_fs_type);
 	futex_mnt = kern_mount(&futex_fs_type);
 
-	for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
-		INIT_LIST_HEAD(&futex_queues[i].chain);
-		spin_lock_init(&futex_queues[i].lock);
-	}
 	return 0;
 }
 __initcall(init);


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08  7:07 [RFC] NUMA futex hashing Ravikiran G Thirumalai
@ 2006-08-08  9:14 ` Eric Dumazet
  2006-08-08 20:31   ` Ravikiran G Thirumalai
  2006-08-08  9:37 ` Jes Sorensen
  2006-08-08  9:57 ` Andi Kleen
  2 siblings, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2006-08-08  9:14 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: linux-kernel, Shai Fultheim (Shai@scalex86.org), pravin b shelar

On Tuesday 08 August 2006 09:07, Ravikiran G Thirumalai wrote:
> Current futex hash scheme is not the best for NUMA.   The futex hash table
> is an array of struct futex_hash_bucket, which is just a spinlock and a
> list_head -- this means multiple spinlocks on the same cacheline and on
> NUMA machines, on the same internode cacheline.  If futexes of two
> unrelated threads running on two different nodes happen to hash onto
> adjacent hash buckets, or buckets on the same internode cacheline, then we
> have the internode cacheline bouncing between nodes.
>
> Here is a simple scheme which maintains per-node hash tables for futexes.
>
> In this scheme, a private futex is assigned to the node id of the futex's
> KVA. The reasoning is, the futex KVA is allocated from the node as
> indicated by memory policy set by the process, and that should be a good
> 'home node' for that futex.  Of course this helps workloads where all the
> threads of a process are bound to the same node, but it seems reasonable to
> run all threads of a process on the same node.
>
> A shared futex is assigned a home node based on jhash2 itself.  Since inode
> and offset are used as the key, the same inode offset is used to arrive at
> the home node of a shared futex.  This distributes private futexes across
> all nodes.
>
> Comments? Suggestions? Particularly regarding shared futexes.  Any policy
> suggestions?
>

Your patch seems fine, but I have one comment.

For non NUMA machine, we would have one useless indirection to get the 
futex_queues pointer.

static struct futex_hash_bucket *futex_queues[1];

I think it is worth to redesign your patch so that this extra-indirection is 
needed only for NUMA machines.

#if defined(CONFIG_NUMA)
static struct futex_hash_bucket *futex_queues[MAX_NUMNODES];
#define FUTEX_QUEUES(nodeid, hash) \
	&futex_queues[nodeid][hash & ((1 << FUTEX_HASHBITS)-1)];
#else
static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
# define FUTEX_QUEUES(nodeid, hash) \
     &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
#endif

Thank you

> Thanks,
> Kiran
>
> Note: This patch needs to have kvaddr_to_nid() reintroduced.  This was
> taken out in git commit 9f3fd602aef96c2a490e3bfd669d06475aeba8d8
>
> Index: linux-2.6.18-rc3/kernel/futex.c
> ===================================================================
> --- linux-2.6.18-rc3.orig/kernel/futex.c	2006-08-02 12:11:34.000000000
> -0700 +++ linux-2.6.18-rc3/kernel/futex.c	2006-08-02 16:48:47.000000000
> -0700 @@ -137,20 +137,35 @@ struct futex_hash_bucket {
>         struct list_head       chain;
>  };
>
> -static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
> +static struct futex_hash_bucket *futex_queues[MAX_NUMNODES] __read_mostly;
>
>  /* Futex-fs vfsmount entry: */
>  static struct vfsmount *futex_mnt;
>
>  /*
>   * We hash on the keys returned from get_futex_key (see below).
> + * With NUMA aware futex hashing, we have per-node hash tables.
> + * We determine the home node of a futex based on the KVA -- if the futex
> + * is a private futex.  For shared futexes, we use  jhash2 itself on the
> + * futex_key to arrive at a home node.
>   */
>  static struct futex_hash_bucket *hash_futex(union futex_key *key)
>  {
> +	int nodeid;
>  	u32 hash = jhash2((u32*)&key->both.word,
>  			  (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
>  			  key->both.offset);
> -	return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
> +	if (key->both.offset & 0x1) {
> +		/*
> +		 * Shared futex: Use any of the 'possible' nodes as home node.
> +		 */
> +		nodeid = hash & (MAX_NUMNODES -1);
> +		BUG_ON(!node_possible(nodeid));
> +	} else
> +		/* Private futex */
> +		nodeid = kvaddr_to_nid(key->both.ptr);
> +
> +	return &futex_queues[nodeid][hash & ((1 << FUTEX_HASHBITS)-1)];
>  }
>
>  /*
> @@ -1909,13 +1924,25 @@ static int __init init(void)
>  {
>  	unsigned int i;
>
> +	int nid;
> +
> +	for_each_node(nid)
> +	{
> +		futex_queues[nid] = kmalloc_node(
> +					(sizeof(struct futex_hash_bucket) *
> +					(1 << FUTEX_HASHBITS)),
> +					GFP_KERNEL, nid);
> +		if (!futex_queues[nid])
> +			panic("futex_init: Allocation of multi-node futex_queues failed");
> +		for (i = 0; i < (1 << FUTEX_HASHBITS); i++) {
> +			INIT_LIST_HEAD(&futex_queues[nid][i].chain);
> +			spin_lock_init(&futex_queues[nid][i].lock);
> +		}
> +	}
> +
>  	register_filesystem(&futex_fs_type);
>  	futex_mnt = kern_mount(&futex_fs_type);
>
> -	for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
> -		INIT_LIST_HEAD(&futex_queues[i].chain);
> -		spin_lock_init(&futex_queues[i].lock);
> -	}
>  	return 0;
>  }
>  __initcall(init);
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08  7:07 [RFC] NUMA futex hashing Ravikiran G Thirumalai
  2006-08-08  9:14 ` Eric Dumazet
@ 2006-08-08  9:37 ` Jes Sorensen
  2006-08-08  9:58   ` Andi Kleen
  2006-08-08  9:57 ` Andi Kleen
  2 siblings, 1 reply; 78+ messages in thread
From: Jes Sorensen @ 2006-08-08  9:37 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: linux-kernel, Shai Fultheim (Shai@scalex86.org), pravin b shelar

>>>>> "Ravikiran" == Ravikiran G Thirumalai <kiran@scalex86.org> writes:

Ravikiran> Current futex hash scheme is not the best for NUMA.  The
Ravikiran> futex hash table is an array of struct futex_hash_bucket,
Ravikiran> which is just a spinlock and a list_head -- this means
Ravikiran> multiple spinlocks on the same cacheline and on NUMA
Ravikiran> machines, on the same internode cacheline.  If futexes of
Ravikiran> two unrelated threads running on two different nodes happen
Ravikiran> to hash onto adjacent hash buckets, or buckets on the same
Ravikiran> internode cacheline, then we have the internode cacheline
Ravikiran> bouncing between nodes.

Ravikiran,

Using that argument, all you need to do is to add the alignment
____cacheline_aligned_in_smp to the definition of
struct futex_hash_bucket and the problem is solved, given that the
internode cacheline in a NUMA system is defined to be the same as the
SMP cacheline size.

Ravikiran> Here is a simple scheme which maintains per-node hash
Ravikiran> tables for futexes.

Ravikiran> In this scheme, a private futex is assigned to the node id
Ravikiran> of the futex's KVA.  The reasoning is, the futex KVA is
Ravikiran> allocated from the node as indicated by memory policy set
Ravikiran> by the process, and that should be a good 'home node' for
Ravikiran> that futex.  Of course this helps workloads where all the
Ravikiran> threads of a process are bound to the same node, but it
Ravikiran> seems reasonable to run all threads of a process on the
Ravikiran> same node.

You can't make that assumption at all. In many NUMA workloads it is
not common to have all threads of a process run on the same node. You
often see a case where one thread spawns a number of threads that are
then grouped onto the various nodes.

If we want to make the futexes really NUMA aware, having them
explicitly allocated on a given node would be more useful or
alternatively have them allocated on a first touch basis.

But to be honest, I doubt it matters too much since the futex
cacheline is most likely to end up in cache on the node where it's
being used and as long as the other nodes don't try and touch the same
futex this becomes a non-issue with the proper alignment.

I don't think your patch is harmful, but it looks awfully complex for
something that could be solved just as well by a simple alignment
statement.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08  7:07 [RFC] NUMA futex hashing Ravikiran G Thirumalai
  2006-08-08  9:14 ` Eric Dumazet
  2006-08-08  9:37 ` Jes Sorensen
@ 2006-08-08  9:57 ` Andi Kleen
  2006-08-08 10:10   ` Eric Dumazet
  2 siblings, 1 reply; 78+ messages in thread
From: Andi Kleen @ 2006-08-08  9:57 UTC (permalink / raw)
  To: Ravikiran G Thirumalai
  Cc: Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel

Ravikiran G Thirumalai <kiran@scalex86.org> writes:

> Current futex hash scheme is not the best for NUMA.   The futex hash table is 
> an array of struct futex_hash_bucket, which is just a spinlock and a 
> list_head -- this means multiple spinlocks on the same cacheline and on NUMA 
> machines, on the same internode cacheline.  If futexes of two unrelated 
> threads running on two different nodes happen to hash onto adjacent hash 
> buckets, or buckets on the same internode cacheline, then we have the 
> internode cacheline bouncing between nodes.

When I did some testing with a (arguably far too lock intensive) benchmark
on a bigger box I got most bouncing cycles not in the futex locks itself,
but in the down_read on the mm semaphore.

I guess that is not addressed?

-Andi

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08  9:37 ` Jes Sorensen
@ 2006-08-08  9:58   ` Andi Kleen
  2006-08-08 10:07     ` Jes Sorensen
  0 siblings, 1 reply; 78+ messages in thread
From: Andi Kleen @ 2006-08-08  9:58 UTC (permalink / raw)
  To: Jes Sorensen
  Cc: linux-kernel, Shai Fultheim (Shai@scalex86.org), pravin b shelar

Jes Sorensen <jes@sgi.com> writes:
> 
> Using that argument, all you need to do is to add the alignment
> ____cacheline_aligned_in_smp to the definition of
> struct futex_hash_bucket and the problem is solved, given that the
> internode cacheline in a NUMA system is defined to be the same as the
> SMP cacheline size.

Yes but it would waste quite a lot of memory and cache.
Wasted cache = slow.

-Andi

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08  9:58   ` Andi Kleen
@ 2006-08-08 10:07     ` Jes Sorensen
  0 siblings, 0 replies; 78+ messages in thread
From: Jes Sorensen @ 2006-08-08 10:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: linux-kernel, Shai Fultheim (Shai@scalex86.org), pravin b shelar

>>>>> "Andi" == Andi Kleen <ak@suse.de> writes:

Andi> Jes Sorensen <jes@sgi.com> writes:
>>  Using that argument, all you need to do is to add the alignment
>> ____cacheline_aligned_in_smp to the definition of struct
>> futex_hash_bucket and the problem is solved, given that the
>> internode cacheline in a NUMA system is defined to be the same as
>> the SMP cacheline size.

Andi> Yes but it would waste quite a lot of memory and cache.  Wasted
Andi> cache = slow.

Compared to the extra level of indirection, I doubt it would be
measurable. The cache space is barely wasted, we're talking
approximately half a cacheline per futex hash bucket in use.

Jes

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08  9:57 ` Andi Kleen
@ 2006-08-08 10:10   ` Eric Dumazet
  2006-08-08 10:36     ` Andi Kleen
  2006-08-09  0:13     ` [RFC] NUMA futex hashing Ravikiran G Thirumalai
  0 siblings, 2 replies; 78+ messages in thread
From: Eric Dumazet @ 2006-08-08 10:10 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Tuesday 08 August 2006 11:57, Andi Kleen wrote:
> Ravikiran G Thirumalai <kiran@scalex86.org> writes:
> > Current futex hash scheme is not the best for NUMA.   The futex hash
> > table is an array of struct futex_hash_bucket, which is just a spinlock
> > and a list_head -- this means multiple spinlocks on the same cacheline
> > and on NUMA machines, on the same internode cacheline.  If futexes of two
> > unrelated threads running on two different nodes happen to hash onto
> > adjacent hash buckets, or buckets on the same internode cacheline, then
> > we have the internode cacheline bouncing between nodes.
>
> When I did some testing with a (arguably far too lock intensive) benchmark
> on a bigger box I got most bouncing cycles not in the futex locks itself,
> but in the down_read on the mm semaphore.

This is true, even with a normal application (not a biased benchmark) and 
using oprofile. mmap_sem is the killer.

We may have special case for PRIVATE futexes (they dont need to be chained in 
a global table, but a process private table)

POSIX thread api already can let the application tell glibc/kernel a 
mutex/futex ahe a process scope.

For this private futexes, I think we would not need to down_read(mmap_sem) at 
all. (only a/some lock/s protecting the process private table)

Eric
 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 10:10   ` Eric Dumazet
@ 2006-08-08 10:36     ` Andi Kleen
  2006-08-08 12:29       ` Eric Dumazet
  2006-08-09  0:13     ` [RFC] NUMA futex hashing Ravikiran G Thirumalai
  1 sibling, 1 reply; 78+ messages in thread
From: Andi Kleen @ 2006-08-08 10:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

>
> We may have special case for PRIVATE futexes (they dont need to be chained in 
> a global table, but a process private table)

What do you mean with PRIVATE futex? 

Even if the futex mapping is only visible by a single MM mmap_sem is still needed
to protect against other threads doing mmap.
 
-Andi

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 10:36     ` Andi Kleen
@ 2006-08-08 12:29       ` Eric Dumazet
  2006-08-08 12:47         ` Andi Kleen
  0 siblings, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2006-08-08 12:29 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Tuesday 08 August 2006 12:36, Andi Kleen wrote:
> > We may have special case for PRIVATE futexes (they dont need to be
> > chained in a global table, but a process private table)
>
> What do you mean with PRIVATE futex?
>
> Even if the futex mapping is only visible by a single MM mmap_sem is still
> needed to protect against other threads doing mmap.

Hum... I would call that a user error.

If a thread is munmap()ing the vma that contains active futexes, result is 
undefined. Same as today I think (a thread blocked in a FUTEX_WAIT should 
stay blocked)

The point is that private futexes could be managed using virtual addresses, 
and no call to find_extend_vma(), hence no mmap_sem contention.

There could be problem if the same futex (32 bits integer) could be mapped at 
different virtual addresses in the same process.

Eric

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 12:29       ` Eric Dumazet
@ 2006-08-08 12:47         ` Andi Kleen
  2006-08-08 12:57           ` Eric Dumazet
  0 siblings, 1 reply; 78+ messages in thread
From: Andi Kleen @ 2006-08-08 12:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Tuesday 08 August 2006 14:29, Eric Dumazet wrote:
> On Tuesday 08 August 2006 12:36, Andi Kleen wrote:
> > > We may have special case for PRIVATE futexes (they dont need to be
> > > chained in a global table, but a process private table)
> >
> > What do you mean with PRIVATE futex?
> >
> > Even if the futex mapping is only visible by a single MM mmap_sem is still
> > needed to protect against other threads doing mmap.
> 
> Hum... I would call that a user error.
> 
> If a thread is munmap()ing the vma that contains active futexes, result is 
> undefined.

We can't allow anything that could crash the kernel, corrupt a kernel,
data structure, allow writing to freed memory etc.  No matter how
defined it is or not. Working with a vma that doesn't have 
an existence guarantee would be just that.

-Andi

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 12:47         ` Andi Kleen
@ 2006-08-08 12:57           ` Eric Dumazet
  2006-08-08 14:39             ` Ulrich Drepper
  0 siblings, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2006-08-08 12:57 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Tuesday 08 August 2006 14:47, Andi Kleen wrote:
> On Tuesday 08 August 2006 14:29, Eric Dumazet wrote:
> > On Tuesday 08 August 2006 12:36, Andi Kleen wrote:
> > > > We may have special case for PRIVATE futexes (they dont need to be
> > > > chained in a global table, but a process private table)
> > >
> > > What do you mean with PRIVATE futex?
> > >
> > > Even if the futex mapping is only visible by a single MM mmap_sem is
> > > still needed to protect against other threads doing mmap.
> >
> > Hum... I would call that a user error.
> >
> > If a thread is munmap()ing the vma that contains active futexes, result
> > is undefined.
>
> We can't allow anything that could crash the kernel, corrupt a kernel,
> data structure, allow writing to freed memory etc.  No matter how
> defined it is or not. Working with a vma that doesn't have
> an existence guarantee would be just that.

As I said, we do not walk the vmas anymore, no crashes are ever possible.

Just keep a process private list of 'private futexes' , indexed by their 
virtual address. This list can be of course stored in a efficient data 
structure, an AVL or RB tree, or hash table.

The validity of the virtual address is still tested by normal get_user() 
call.. If the memory was freed by a thread, then a normal EFAULT error will 
be reported... eventually.

Eric

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 12:57           ` Eric Dumazet
@ 2006-08-08 14:39             ` Ulrich Drepper
  2006-08-08 15:11               ` Nick Piggin
  0 siblings, 1 reply; 78+ messages in thread
From: Ulrich Drepper @ 2006-08-08 14:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On 8/8/06, Eric Dumazet <dada1@cosmosbay.com> wrote:
> The validity of the virtual address is still tested by normal get_user()
> call.. If the memory was freed by a thread, then a normal EFAULT error will
> be reported... eventually.

This is indeed what should be done.  Private futexes are the by far
more frequent case and I bet you'd see improvements when avoiding the
mm mutex even for normal machines since futexes really are everywhere.
 For shared mutexes you end up doing two lookups and that's fine IMO
as long as the first lookup is fast.

As for the NUMA case, I would oppose any change which has the
slightest impact on non-NUMA machines.  It cannot be allowed that the
majority of systems is slowed down significantly just because of NUMA.
 Especially since the effects of NUMA beside cache line transfer
penalties IMO probably are neglect able.  The in-kernel futex
representation only exists when there are waiters and so the memory
needed is only allocated when we a waiting.  In this case it just be
easy enough to use local memory.  But this unlikely will help much
since the waker thread is ideally not on the same processor, maybe not
even on the same node.  So there will be cacheline transfers in most
cases and everything possible improvement will be minimal and maybe
even not generally measurable.

If you want to do anything in this area, first remove the global
mutex.  Then really measure with real world application.  And I don't
mean specially designed HPC apps which assign threads/processes to
processors or nodes.  Those are special cases of a special case.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 14:39             ` Ulrich Drepper
@ 2006-08-08 15:11               ` Nick Piggin
  2006-08-08 15:36                 ` Ulrich Drepper
  2006-08-08 16:08                 ` Eric Dumazet
  0 siblings, 2 replies; 78+ messages in thread
From: Nick Piggin @ 2006-08-08 15:11 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Eric Dumazet, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Ulrich Drepper wrote:
> On 8/8/06, Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>> The validity of the virtual address is still tested by normal get_user()
>> call.. If the memory was freed by a thread, then a normal EFAULT error 
>> will
>> be reported... eventually.
> 
> 
> This is indeed what should be done.  Private futexes are the by far
> more frequent case and I bet you'd see improvements when avoiding the
> mm mutex even for normal machines since futexes really are everywhere.
> For shared mutexes you end up doing two lookups and that's fine IMO
> as long as the first lookup is fast.

The private futex's namespace is its virtual address, so I don't see
how you can decouple that from the management of virtual addresses.

Let me get this straight: to insert a contended futex into your rbtree,
you need to hold the mmap sem to ensure that address remains valid,
then you need to take a lock which protects your rbtree. Then to wake
up a process and remove the futex, you need to take the rbtree lock. Or
to unmap any memory you also need to take the rbtree lock and ensure
there are no futexes there.

So you just add another lock for no reason, or have I got a few screws
loose myself? I don't see how you can significantly reduce lock
cacheline bouncing in a futex heavy workload if you're just going to
add another shared data structure. But if you can, sweet ;)

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 15:11               ` Nick Piggin
@ 2006-08-08 15:36                 ` Ulrich Drepper
  2006-08-08 16:22                   ` Nick Piggin
  2006-08-08 16:08                 ` Eric Dumazet
  1 sibling, 1 reply; 78+ messages in thread
From: Ulrich Drepper @ 2006-08-08 15:36 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On 8/8/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> Let me get this straight: to insert a contended futex into your rbtree,
> you need to hold the mmap sem to ensure that address remains valid,
> then you need to take a lock which protects your rbtree.

Why does it have to remain valid?  As long as the kernel doesn't crash
on any of the operations associated with the futex syscalls let the
address space region explode, implode, whatever.  It's  a bug in the
program if the address region is changed while a futex is placed
there.  If the futex syscall hangs forever or returns with a bogus
state (error or even success) this is perfectly acceptable.  We
shouldn't slow down correct uses just to make it possible for broken
programs to receive a more detailed error description.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 15:11               ` Nick Piggin
  2006-08-08 15:36                 ` Ulrich Drepper
@ 2006-08-08 16:08                 ` Eric Dumazet
  2006-08-08 16:34                   ` Nick Piggin
  2006-08-08 16:58                   ` Ulrich Drepper
  1 sibling, 2 replies; 78+ messages in thread
From: Eric Dumazet @ 2006-08-08 16:08 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ulrich Drepper, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Tuesday 08 August 2006 17:11, Nick Piggin wrote:
> Ulrich Drepper wrote:
> > On 8/8/06, Eric Dumazet <dada1@cosmosbay.com> wrote:
> >> The validity of the virtual address is still tested by normal get_user()
> >> call.. If the memory was freed by a thread, then a normal EFAULT error
> >> will
> >> be reported... eventually.
> >
> > This is indeed what should be done.  Private futexes are the by far
> > more frequent case and I bet you'd see improvements when avoiding the
> > mm mutex even for normal machines since futexes really are everywhere.
> > For shared mutexes you end up doing two lookups and that's fine IMO
> > as long as the first lookup is fast.
>
> The private futex's namespace is its virtual address, so I don't see
> how you can decouple that from the management of virtual addresses.
>
> Let me get this straight: to insert a contended futex into your rbtree,
> you need to hold the mmap sem to ensure that address remains valid,
> then you need to take a lock which protects your rbtree. Then to wake
> up a process and remove the futex, you need to take the rbtree lock. Or
> to unmap any memory you also need to take the rbtree lock and ensure
> there are no futexes there.
>
> So you just add another lock for no reason, or have I got a few screws
> loose myself? I don't see how you can significantly reduce lock
> cacheline bouncing in a futex heavy workload if you're just going to
> add another shared data structure. But if you can, sweet ;)

We certainly can. But if you insist of using mmap sem at all, then we have a 
problem.

rbtree would not reduce cacheline bouncing, so :

We could use a hashtable (allocated on demand) of size N, N depending on 
NR_CPUS for example. each chain protected by a private spinlock. If N is well 
chosen, we might reduce lock cacheline bouncing. (different threads fighting 
on different private futexes would have a good chance to get different 
cachelines in this hashtable)

As soon a process enters 'private futex' code, the futex code allocates this 
hashtable if the process has a NULL hash table (set to NULL at exec() time, 
or maybe re-allocated because we want to be sure futex syscall always suceed 
(no ENOMEM))

So we really can... but for 'private futexes' which are the vast majority of 
futexes needed by typical program (using POSIX pshared thread mutex attribute 
PTHREAD_PROCESS_PRIVATE, currently not used by NPTL glibc)

Of course we would need a new syscall, and to change glibc to be able to 
actually use this new private_futex syscall.

Probably a lot of work, still, but could help heavy threaded programs not 
touching mmap_sem.

We might have a refcounting problem on this 'hashtable' since several threads 
share this structure, but only at thread creation/destruction, not in futex 
call (ie no cacheline bouncing on the refcount)

Eric

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 15:36                 ` Ulrich Drepper
@ 2006-08-08 16:22                   ` Nick Piggin
  2006-08-08 16:26                     ` Nick Piggin
  2006-08-08 16:49                     ` Ulrich Drepper
  0 siblings, 2 replies; 78+ messages in thread
From: Nick Piggin @ 2006-08-08 16:22 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Eric Dumazet, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Ulrich Drepper wrote:
> On 8/8/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>> Let me get this straight: to insert a contended futex into your rbtree,
>> you need to hold the mmap sem to ensure that address remains valid,
>> then you need to take a lock which protects your rbtree.
> 
> 
> Why does it have to remain valid?  As long as the kernel doesn't crash
> on any of the operations associated with the futex syscalls let the
> address space region explode, implode, whatever.  It's  a bug in the
> program if the address region is changed while a futex is placed
> there.  If the futex syscall hangs forever or returns with a bogus
> state (error or even success) this is perfectly acceptable.  We

I thought mremap (no, that's already kind of messed up); or
even just getting consistency in failures (eg. so you don't have
the situation that a futex op can succeed on a previously
unmapped region).

If you're not worried about the latter, then it might work...

I didn't initially click that the private futex API operates
purely on tokens rather than virtual memory... comments in
futex.c talk about futexes being hashed to a particular
physical page (which is the case for shared). That's whacked.

So actually you would change semantics in some weird corner
cases, like mremaping a shared futex over a private futex's
Arguably that's broken, though ;)

> shouldn't slow down correct uses just to make it possible for broken
> programs to receive a more detailed error description.
> 

No we shouldn't slow them down. I'd be interested to see whether
locking is significantly sped up with this new data structure,
though.

You might also slow down due to the fact that you'd have to do the
locking and unconditionally traverse the private futexes even for
shared futexes.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 16:22                   ` Nick Piggin
@ 2006-08-08 16:26                     ` Nick Piggin
  2006-08-08 16:49                     ` Ulrich Drepper
  1 sibling, 0 replies; 78+ messages in thread
From: Nick Piggin @ 2006-08-08 16:26 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Eric Dumazet, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Nick Piggin wrote:

> No we shouldn't slow them down. I'd be interested to see whether
> locking is significantly sped up with this new data structure,
> though.

OTOH, maybe you don't need a new data structure. Maybe you could
use the hash and check that for a match on a private futex before
trying to find a possible shared futex.

Locking I guess becomes no more of a problem than now, and in some
cases maybe much less. So OK, I stand corrected.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 16:08                 ` Eric Dumazet
@ 2006-08-08 16:34                   ` Nick Piggin
  2006-08-08 16:49                     ` Eric Dumazet
  2006-08-08 16:58                   ` Ulrich Drepper
  1 sibling, 1 reply; 78+ messages in thread
From: Nick Piggin @ 2006-08-08 16:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ulrich Drepper, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Eric Dumazet wrote:

> We certainly can. But if you insist of using mmap sem at all, then we have a 
> problem.
> 
> rbtree would not reduce cacheline bouncing, so :
> 
> We could use a hashtable (allocated on demand) of size N, N depending on 
> NR_CPUS for example. each chain protected by a private spinlock. If N is well 
> chosen, we might reduce lock cacheline bouncing. (different threads fighting 
> on different private futexes would have a good chance to get different 
> cachelines in this hashtable)

See other mail. We already have a hash table ;)

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 16:34                   ` Nick Piggin
@ 2006-08-08 16:49                     ` Eric Dumazet
  2006-08-08 16:59                       ` Eric Dumazet
  2006-08-09  1:56                       ` Nick Piggin
  0 siblings, 2 replies; 78+ messages in thread
From: Eric Dumazet @ 2006-08-08 16:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ulrich Drepper, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Tuesday 08 August 2006 18:34, Nick Piggin wrote:
> Eric Dumazet wrote:
> > We certainly can. But if you insist of using mmap sem at all, then we
> > have a problem.
> >
> > rbtree would not reduce cacheline bouncing, so :
> >
> > We could use a hashtable (allocated on demand) of size N, N depending on
> > NR_CPUS for example. each chain protected by a private spinlock. If N is
> > well chosen, we might reduce lock cacheline bouncing. (different threads
> > fighting on different private futexes would have a good chance to get
> > different cachelines in this hashtable)
>
> See other mail. We already have a hash table ;)

Yes but still you want at FUTEX_WAIT time to tell the kernel the futex is 
private to this process.

Giving the same info at FUTEX_WAKE time could avoid the kernel to make the 
second pass (using only a private futex lookup), avoiding again the mmap_sem 
touch in case no threads are waiting anymore on this futex.

Eric

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 16:22                   ` Nick Piggin
  2006-08-08 16:26                     ` Nick Piggin
@ 2006-08-08 16:49                     ` Ulrich Drepper
  1 sibling, 0 replies; 78+ messages in thread
From: Ulrich Drepper @ 2006-08-08 16:49 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On 8/8/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> I thought mremap (no, that's already kind of messed up); or
> even just getting consistency in failures (eg. so you don't have
> the situation that a futex op can succeed on a previously
> unmapped region).
>
> If you're not worried about the latter, then it might work...

I'm not the least bit worried about this.  It's 100% an application's
fault.  You cannot touch an address space if it's used, e.g., for
mutexes.


> I didn't initially click that the private futex API operates
> purely on tokens rather than virtual memory...

I haven't looked at the code in some time but I thought this got
clarified in the comments.  For waiting on private mutexes we need
nothing but the address value itself.  There is the FUTEX_WAKE_OP
operation which will also write to memory but this is only the waker
side and if the memory mapping is gone, just flag an error.  It's
another program error which shouldn't in any way slow down normal
operations.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 16:08                 ` Eric Dumazet
  2006-08-08 16:34                   ` Nick Piggin
@ 2006-08-08 16:58                   ` Ulrich Drepper
  2006-08-08 17:08                     ` Eric Dumazet
  2006-08-09  1:58                     ` Nick Piggin
  1 sibling, 2 replies; 78+ messages in thread
From: Ulrich Drepper @ 2006-08-08 16:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Piggin, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On 8/8/06, Eric Dumazet <dada1@cosmosbay.com> wrote:
> So we really can... but for 'private futexes' which are the vast majority of
> futexes needed by typical program (using POSIX pshared thread mutex attribute
> PTHREAD_PROCESS_PRIVATE, currently not used by NPTL glibc)

Nonsense.  Mutexes are by default always private.  They explicitly
have to be marked as sharable.  This happens using the
pthread_mutexattr_setpshared function which takes
PTHREAD_PROCESS_PRIVATE or PTHREAD_PROCESS_SHARED in the second
parameter.  So the former _is_ clearly used.


> Of course we would need a new syscall, and to change glibc to be able to
> actually use this new private_futex syscall.

No, why?  The kernel already does recognize private mutexes.  It just
checks whether the pages used to store it are private or mapped.  This
requires some interaction with the memory subsystem but as long as no
crashes happen the data can change underneath.  It's the program's
fault if it does.

On the waker side you would search the local futex hash table/tree
first and if this doesn't yield a match, search the global table.
Wakeup calls without any waiters are usually rare.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 16:49                     ` Eric Dumazet
@ 2006-08-08 16:59                       ` Eric Dumazet
  2006-08-09  1:56                       ` Nick Piggin
  1 sibling, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2006-08-08 16:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ulrich Drepper, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Tuesday 08 August 2006 18:49, Eric Dumazet wrote:
> On Tuesday 08 August 2006 18:34, Nick Piggin wrote:
> > Eric Dumazet wrote:
> > > We certainly can. But if you insist of using mmap sem at all, then we
> > > have a problem.
> > >
> > > rbtree would not reduce cacheline bouncing, so :
> > >
> > > We could use a hashtable (allocated on demand) of size N, N depending
> > > on NR_CPUS for example. each chain protected by a private spinlock. If
> > > N is well chosen, we might reduce lock cacheline bouncing. (different
> > > threads fighting on different private futexes would have a good chance
> > > to get different cachelines in this hashtable)
> >
> > See other mail. We already have a hash table ;)
>
> Yes but still you want at FUTEX_WAIT time to tell the kernel the futex is
> private to this process.
>
> Giving the same info at FUTEX_WAKE time could avoid the kernel to make the
> second pass (using only a private futex lookup), avoiding again the
> mmap_sem touch in case no threads are waiting anymore on this futex.

After looking at kernel/futex.c, I realize we also can avoid the atomic ops 
(and another cacheline bouncing) done in get_key_refs()/drop_key_refs(), 
touching the inode i_count or mm_count refcounter)

Eric

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 16:58                   ` Ulrich Drepper
@ 2006-08-08 17:08                     ` Eric Dumazet
  2006-08-09  1:58                     ` Nick Piggin
  1 sibling, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2006-08-08 17:08 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Nick Piggin, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Tuesday 08 August 2006 18:58, Ulrich Drepper wrote:
> On 8/8/06, Eric Dumazet <dada1@cosmosbay.com> wrote:
> > So we really can... but for 'private futexes' which are the vast majority
> > of futexes needed by typical program (using POSIX pshared thread mutex
> > attribute PTHREAD_PROCESS_PRIVATE, currently not used by NPTL glibc)
>
> Nonsense.  Mutexes are by default always private.  They explicitly
> have to be marked as sharable.  This happens using the
> pthread_mutexattr_setpshared function which takes
> PTHREAD_PROCESS_PRIVATE or PTHREAD_PROCESS_SHARED in the second
> parameter.  So the former _is_ clearly used.
>

I was saying that PTHREAD_PROCESS_PRIVATE or PTHREAD_PROCESS_SHARED info is 
not provided to the kernel (because futex api/implementation dont need to). 
It was not an attack on glibc.

> > Of course we would need a new syscall, and to change glibc to be able to
> > actually use this new private_futex syscall.
>
> No, why?  The kernel already does recognize private mutexes.  It just
> checks whether the pages used to store it are private or mapped.  This
> requires some interaction with the memory subsystem but as long as no
> crashes happen the data can change underneath.  It's the program's
> fault if it does.

But if you let futex code doing the vma walk to check the private/shared 
status, you still need the mmap_sem locking.

Moreover, a program can mmap() a file (shared in terms of VMA), and continue 
to use a  PTHREAD_PROCESS_PRIVATE mutex lying in this shared zone
(Example : shmem or hugetlb mapping, wich API might always give a 'shared' 
vma)

>
> On the waker side you would search the local futex hash table/tree
> first and if this doesn't yield a match, search the global table.
> Wakeup calls without any waiters are usually rare.

If the two searches touch two different cache lines in the hash table, we 
might have a performance regression.
Of course we might chose a hash function so that the same slot is accessed.

Eric


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08  9:14 ` Eric Dumazet
@ 2006-08-08 20:31   ` Ravikiran G Thirumalai
  0 siblings, 0 replies; 78+ messages in thread
From: Ravikiran G Thirumalai @ 2006-08-08 20:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-kernel, Shai Fultheim (Shai@scalex86.org), pravin b shelar

On Tue, Aug 08, 2006 at 11:14:49AM +0200, Eric Dumazet wrote:
> On Tuesday 08 August 2006 09:07, Ravikiran G Thirumalai wrote:
> >
> 
> Your patch seems fine, but I have one comment.
> 
> For non NUMA machine, we would have one useless indirection to get the 
> futex_queues pointer.
> 
> static struct futex_hash_bucket *futex_queues[1];
> 
> I think it is worth to redesign your patch so that this extra-indirection is 
> needed only for NUMA machines.

Yes.  Will do in the next iteration.

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 10:10   ` Eric Dumazet
  2006-08-08 10:36     ` Andi Kleen
@ 2006-08-09  0:13     ` Ravikiran G Thirumalai
  1 sibling, 0 replies; 78+ messages in thread
From: Ravikiran G Thirumalai @ 2006-08-09  0:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andi Kleen, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Tue, Aug 08, 2006 at 12:10:39PM +0200, Eric Dumazet wrote:
> On Tuesday 08 August 2006 11:57, Andi Kleen wrote:
> > Ravikiran G Thirumalai <kiran@scalex86.org> writes:
> > > Current futex hash scheme is not the best for NUMA.   The futex hash
> > > table is an array of struct futex_hash_bucket, which is just a spinlock
> > > and a list_head -- this means multiple spinlocks on the same cacheline
> > > and on NUMA machines, on the same internode cacheline.  If futexes of two
> > > unrelated threads running on two different nodes happen to hash onto
> > > adjacent hash buckets, or buckets on the same internode cacheline, then
> > > we have the internode cacheline bouncing between nodes.
> >
> > When I did some testing with a (arguably far too lock intensive) benchmark
> > on a bigger box I got most bouncing cycles not in the futex locks itself,
> > but in the down_read on the mm semaphore.
> 
> This is true, even with a normal application (not a biased benchmark) and 
> using oprofile. mmap_sem is the killer.

Not if two threads of two different process (so no same mmap_sem) hash onto
futexes on the same cacheline.   But agreed, mmap_sem needs to be fixed too.
If everyone agrees on a per-process hash table for private futexes, then
we will work on that approach.  

Thanks,
Kiran

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 16:49                     ` Eric Dumazet
  2006-08-08 16:59                       ` Eric Dumazet
@ 2006-08-09  1:56                       ` Nick Piggin
  1 sibling, 0 replies; 78+ messages in thread
From: Nick Piggin @ 2006-08-09  1:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ulrich Drepper, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Eric Dumazet wrote:

>On Tuesday 08 August 2006 18:34, Nick Piggin wrote:
>
>>Eric Dumazet wrote:
>>
>>>We certainly can. But if you insist of using mmap sem at all, then we
>>>have a problem.
>>>
>>>rbtree would not reduce cacheline bouncing, so :
>>>
>>>We could use a hashtable (allocated on demand) of size N, N depending on
>>>NR_CPUS for example. each chain protected by a private spinlock. If N is
>>>well chosen, we might reduce lock cacheline bouncing. (different threads
>>>fighting on different private futexes would have a good chance to get
>>>different cachelines in this hashtable)
>>>
>>See other mail. We already have a hash table ;)
>>
>
>Yes but still you want at FUTEX_WAIT time to tell the kernel the futex is 
>private to this process.
>

Yes, but I'm saying we already have a hash table. The hash table.

I'm *not* saying you *don't* also want a private directive from userspace.
--

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-08 16:58                   ` Ulrich Drepper
  2006-08-08 17:08                     ` Eric Dumazet
@ 2006-08-09  1:58                     ` Nick Piggin
  2006-08-09  6:26                       ` Eric Dumazet
  1 sibling, 1 reply; 78+ messages in thread
From: Nick Piggin @ 2006-08-09  1:58 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Eric Dumazet, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Ulrich Drepper wrote:

>> Of course we would need a new syscall, and to change glibc to be able to
>> actually use this new private_futex syscall.
>
>
> No, why?  The kernel already does recognize private mutexes.  It just
> checks whether the pages used to store it are private or mapped.  This
> requires some interaction with the memory subsystem but as long as no
> crashes happen the data can change underneath.  It's the program's
> fault if it does.


Because that requires taking mmap_sem. Avoiding that is the whole
purpose, isn't it?

--

Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-09  1:58                     ` Nick Piggin
@ 2006-08-09  6:26                       ` Eric Dumazet
  2006-08-09  6:43                         ` Eric Dumazet
  0 siblings, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2006-08-09  6:26 UTC (permalink / raw)
  To: Nick Piggin, Ulrich Drepper, Andi Kleen, Ravikiran G Thirumalai
  Cc: Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1347 bytes --]

Based on various discussions and feedbacks, I cooked a patch that implements 
the notion of private futexes (private to a process, in the spirit of POSIX 
pshared PTHREAD_PROCESS_PRIVATE )

[PATCH] futex : Add new PRIVATE futex primitives for performance improvements

When a futex is privately used by a process, we dont really need to lookup the 
list of vmas of the process in order to discover if the futex is backed by a 
inode or by the mm struct. We dont really need to keep a refcount on the 
inode or mm.

This patch introduces new futex calls, that could be used by user land (glibc 
of course) when private futexes are used.

Avoiding vmas lookup means avoiding taking the mmap_sem (and forcing cacheline 
bouncings).

Avoiding refcounting on underlying inode or mm struct also avoids cacheline 
bouncing.

Thats two cacheline bounces avoided per FUTEX syscall

glibc could use the new futex primitives introduced here (in particular for 
PTHREAD_PROCESS_PRIVATE semantic), and fallback to old one if running on 
older kernel. Fallback could set a global variable with the number of syscall 
so that only one failed syscall is done in the process lifetime.

Note : Compatibility should be maintained by this patch, as old applications 
will use the 'SHARED' functionality, unchanged.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


[-- Attachment #2: futex_priv1.patch --]
[-- Type: text/x-diff, Size: 9592 bytes --]

--- linux-2.6.18-rc4/include/linux/futex.h	2006-08-08 22:46:13.000000000 +0200
+++ linux-2.6.18-rc4-ed/include/linux/futex.h	2006-08-08 23:23:13.000000000 +0200
@@ -15,6 +15,11 @@
 #define FUTEX_LOCK_PI		6
 #define FUTEX_UNLOCK_PI		7
 #define FUTEX_TRYLOCK_PI	8
+#define FUTEX_WAIT_PRIVATE         9
+#define FUTEX_WAKE_PRIVATE        10
+#define FUTEX_REQUEUE_PRIVATE     11
+#define FUTEX_CMP_REQUEUE_PRIVATE 12
+#define FUTEX_WAKE_OP_PRIVATE     13
 
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
--- linux-2.6.18-rc4/kernel/futex.c	2006-08-08 22:45:46.000000000 +0200
+++ linux-2.6.18-rc4-ed/kernel/futex.c	2006-08-09 07:26:19.000000000 +0200
@@ -60,8 +60,9 @@
  * Don't rearrange members without looking at hash_futex().
  *
  * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
- * We set bit 0 to indicate if it's an inode-based key.
  */
+#define OFF_INODE    1 /* We set bit 0 if it has a reference on inode */
+#define OFF_MMSHARED 2 /* We set bit 1 if it has a reference on mm */
 union futex_key {
 	struct {
 		unsigned long pgoff;
@@ -79,6 +80,8 @@
 		int offset;
 	} both;
 };
+#define FUT_SHARED  1 /* we should walk vmas */
+#define FUT_PRIVATE 0 /* private futex: no need to walk vmas*/
 
 /*
  * Priority Inheritance state:
@@ -140,7 +143,7 @@
 static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
 
 /* Futex-fs vfsmount entry: */
-static struct vfsmount *futex_mnt;
+static struct vfsmount *futex_mnt __read_mostly;
 
 /*
  * We hash on the keys returned from get_futex_key (see below).
@@ -175,7 +178,7 @@
  *
  * Should be called with &current->mm->mmap_sem but NOT any spinlocks.
  */
-static int get_futex_key(u32 __user *uaddr, union futex_key *key)
+static int get_futex_key(u32 __user *uaddr, union futex_key *key, int shared)
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
@@ -191,6 +194,11 @@
 		return -EINVAL;
 	address -= key->both.offset;
 
+	if (shared == FUT_PRIVATE) {
+		key->private.mm = mm;
+		key->private.address = address;
+		return 0;
+	}
 	/*
 	 * The futex is hashed differently depending on whether
 	 * it's in a shared or private mapping.  So check vma first.
@@ -215,6 +223,7 @@
 	 * mappings of _writable_ handles.
 	 */
 	if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
+		key->both.offset += OFF_MMSHARED; /* reference taken on mm */
 		key->private.mm = mm;
 		key->private.address = address;
 		return 0;
@@ -224,7 +233,7 @@
 	 * Linear file mappings are also simple.
 	 */
 	key->shared.inode = vma->vm_file->f_dentry->d_inode;
-	key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
+	key->both.offset += OFF_INODE; /* reference taken on inode */
 	if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
 		key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
 				     + vma->vm_pgoff);
@@ -256,8 +265,8 @@
  */
 static inline void get_key_refs(union futex_key *key)
 {
-	if (key->both.ptr != 0) {
-		if (key->both.offset & 1)
+	if (key->both.offset & (OFF_INODE|OFF_MMSHARED)) {
+		if (key->both.offset & OFF_INODE)
 			atomic_inc(&key->shared.inode->i_count);
 		else
 			atomic_inc(&key->private.mm->mm_count);
@@ -270,8 +279,8 @@
  */
 static void drop_key_refs(union futex_key *key)
 {
-	if (key->both.ptr != 0) {
-		if (key->both.offset & 1)
+	if (key->both.offset & (OFF_INODE|OFF_MMSHARED)) {
+		if (key->both.offset & OFF_INODE)
 			iput(key->shared.inode);
 		else
 			mmdrop(key->private.mm);
@@ -650,7 +659,7 @@
  * Wake up all waiters hashed on the physical page that is mapped
  * to this virtual address:
  */
-static int futex_wake(u32 __user *uaddr, int nr_wake)
+static int futex_wake(u32 __user *uaddr, int nr_wake, int shared)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
@@ -658,9 +667,10 @@
 	union futex_key key;
 	int ret;
 
-	down_read(&current->mm->mmap_sem);
+	if (shared)
+		down_read(&current->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr, &key);
+	ret = get_futex_key(uaddr, &key, shared);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -682,7 +692,8 @@
 
 	spin_unlock(&hb->lock);
 out:
-	up_read(&current->mm->mmap_sem);
+	if (shared)
+		up_read(&current->mm->mmap_sem);
 	return ret;
 }
 
@@ -692,7 +703,7 @@
  */
 static int
 futex_wake_op(u32 __user *uaddr1, u32 __user *uaddr2,
-	      int nr_wake, int nr_wake2, int op)
+	      int nr_wake, int nr_wake2, int op, int shared)
 {
 	union futex_key key1, key2;
 	struct futex_hash_bucket *hb1, *hb2;
@@ -703,10 +714,10 @@
 retryfull:
 	down_read(&current->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr1, &key1);
+	ret = get_futex_key(uaddr1, &key1, shared);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, &key2);
+	ret = get_futex_key(uaddr2, &key2, shared);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -802,7 +813,7 @@
  * physical page.
  */
 static int futex_requeue(u32 __user *uaddr1, u32 __user *uaddr2,
-			 int nr_wake, int nr_requeue, u32 *cmpval)
+			 int nr_wake, int nr_requeue, u32 *cmpval, int shared)
 {
 	union futex_key key1, key2;
 	struct futex_hash_bucket *hb1, *hb2;
@@ -811,12 +822,13 @@
 	int ret, drop_count = 0;
 
  retry:
-	down_read(&current->mm->mmap_sem);
+	if (shared)
+		down_read(&current->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr1, &key1);
+	ret = get_futex_key(uaddr1, &key1, shared);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, &key2);
+	ret = get_futex_key(uaddr2, &key2, shared);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -839,7 +851,8 @@
 			 * If we would have faulted, release mmap_sem, fault
 			 * it in and start all over again.
 			 */
-			up_read(&current->mm->mmap_sem);
+			if (shared)
+				up_read(&current->mm->mmap_sem);
 
 			ret = get_user(curval, uaddr1);
 
@@ -888,7 +901,8 @@
 		drop_key_refs(&key1);
 
 out:
-	up_read(&current->mm->mmap_sem);
+	if (shared)
+		up_read(&current->mm->mmap_sem);
 	return ret;
 }
 
@@ -999,7 +1013,7 @@
 	drop_key_refs(&q->key);
 }
 
-static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time)
+static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time, int shared)
 {
 	struct task_struct *curr = current;
 	DECLARE_WAITQUEUE(wait, curr);
@@ -1010,9 +1024,10 @@
 
 	q.pi_state = NULL;
  retry:
-	down_read(&curr->mm->mmap_sem);
+	if (shared)
+		down_read(&curr->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr, &q.key);
+	ret = get_futex_key(uaddr, &q.key, shared);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
 
@@ -1047,7 +1062,8 @@
 		 * If we would have faulted, release mmap_sem, fault it in and
 		 * start all over again.
 		 */
-		up_read(&curr->mm->mmap_sem);
+		if (shared)
+			up_read(&curr->mm->mmap_sem);
 
 		ret = get_user(uval, uaddr);
 
@@ -1066,7 +1082,8 @@
 	 * Now the futex is queued and we have checked the data, we
 	 * don't want to hold mmap_sem while we sleep.
 	 */
-	up_read(&curr->mm->mmap_sem);
+	if (shared)
+		up_read(&curr->mm->mmap_sem);
 
 	/*
 	 * There might have been scheduling since the queue_me(), as we
@@ -1108,7 +1125,8 @@
 	queue_unlock(&q, hb);
 
  out_release_sem:
-	up_read(&curr->mm->mmap_sem);
+	if (shared)
+		up_read(&curr->mm->mmap_sem);
 	return ret;
 }
 
@@ -1134,7 +1152,7 @@
  retry:
 	down_read(&curr->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr, &q.key);
+	ret = get_futex_key(uaddr, &q.key, FUT_SHARED);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
 
@@ -1435,7 +1453,7 @@
 	 */
 	down_read(&current->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr, &key);
+	ret = get_futex_key(uaddr, &key, FUT_SHARED);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -1551,7 +1569,7 @@
 	return ret;
 }
 
-static struct file_operations futex_fops = {
+static const struct file_operations futex_fops = {
 	.release	= futex_close,
 	.poll		= futex_poll,
 };
@@ -1600,7 +1618,7 @@
 	q->pi_state = NULL;
 
 	down_read(&current->mm->mmap_sem);
-	err = get_futex_key(uaddr, &q->key);
+	err = get_futex_key(uaddr, &q->key, FUT_SHARED);
 
 	if (unlikely(err != 0)) {
 		up_read(&current->mm->mmap_sem);
@@ -1742,7 +1760,7 @@
 		 */
 		if (!pi) {
 			if (uval & FUTEX_WAITERS)
-				futex_wake(uaddr, 1);
+				futex_wake(uaddr, 1, FUT_SHARED);
 		}
 	}
 	return 0;
@@ -1830,23 +1848,38 @@
 
 	switch (op) {
 	case FUTEX_WAIT:
-		ret = futex_wait(uaddr, val, timeout);
+		ret = futex_wait(uaddr, val, timeout, FUT_SHARED);
+		break;
+	case FUTEX_WAIT_PRIVATE:
+		ret = futex_wait(uaddr, val, timeout, FUT_PRIVATE);
 		break;
 	case FUTEX_WAKE:
-		ret = futex_wake(uaddr, val);
+		ret = futex_wake(uaddr, val, FUT_SHARED);
+		break;
+	case FUTEX_WAKE_PRIVATE:
+		ret = futex_wake(uaddr, val, FUT_PRIVATE);
 		break;
 	case FUTEX_FD:
 		/* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */
 		ret = futex_fd(uaddr, val);
 		break;
 	case FUTEX_REQUEUE:
-		ret = futex_requeue(uaddr, uaddr2, val, val2, NULL);
+		ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, FUT_SHARED);
+		break;
+	case FUTEX_REQUEUE_PRIVATE:
+		ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, FUT_PRIVATE);
 		break;
 	case FUTEX_CMP_REQUEUE:
-		ret = futex_requeue(uaddr, uaddr2, val, val2, &val3);
+		ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, FUT_SHARED);
+		break;
+	case FUTEX_CMP_REQUEUE_PRIVATE:
+		ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, FUT_PRIVATE);
 		break;
 	case FUTEX_WAKE_OP:
-		ret = futex_wake_op(uaddr, uaddr2, val, val2, val3);
+		ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, FUT_SHARED);
+		break;
+	case FUTEX_WAKE_OP_PRIVATE:
+		ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, FUT_PRIVATE);
 		break;
 	case FUTEX_LOCK_PI:
 		ret = futex_lock_pi(uaddr, val, timeout, val2, 0);

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [RFC] NUMA futex hashing
  2006-08-09  6:26                       ` Eric Dumazet
@ 2006-08-09  6:43                         ` Eric Dumazet
  2007-03-15 19:10                           ` [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements Eric Dumazet
                                             ` (3 more replies)
  0 siblings, 4 replies; 78+ messages in thread
From: Eric Dumazet @ 2006-08-09  6:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ulrich Drepper, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

(patch inlined this time)
Based on various discussions and feedbacks, I cooked a patch that implements 
the notion of private futexes (private to a process, in the spirit of POSIX 
pshared PTHREAD_PROCESS_PRIVATE )

[PATCH] futex : Add new PRIVATE futex primitives for performance improvements

When a futex is privately used by a process, we dont really need to lookup the 
list of vmas of the process in order to discover if the futex is backed by a 
inode or by the mm struct. We dont really need to keep a refcount on the 
inode or mm.

This patch introduces new futex calls, that could be used by user land (glibc 
of course) when private futexes are used.

Avoiding vmas lookup means avoiding taking the mmap_sem (and forcing cacheline 
bouncings).

Avoiding refcounting on underlying inode or mm struct also avoids cacheline 
bouncing.

Thats two cacheline bounces avoided per FUTEX syscall

glibc could use the new futex primitives introduced here (in particular for 
PTHREAD_PROCESS_PRIVATE semantic), and fallback to old one if running on 
older kernel. Fallback could set a global variable with the number of syscall 
so that only one failed syscall is done in the process lifetime.

Note : Compatibility should be maintained by this patch, as old applications 
will use the 'SHARED' functionality, unchanged.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
--- linux-2.6.18-rc4/include/linux/futex.h	2006-08-08 22:46:13.000000000 +0200
+++ linux-2.6.18-rc4-ed/include/linux/futex.h	2006-08-08 23:23:13.000000000 
+0200
@@ -15,6 +15,11 @@
 #define FUTEX_LOCK_PI		6
 #define FUTEX_UNLOCK_PI		7
 #define FUTEX_TRYLOCK_PI	8
+#define FUTEX_WAIT_PRIVATE         9
+#define FUTEX_WAKE_PRIVATE        10
+#define FUTEX_REQUEUE_PRIVATE     11
+#define FUTEX_CMP_REQUEUE_PRIVATE 12
+#define FUTEX_WAKE_OP_PRIVATE     13
 
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
--- linux-2.6.18-rc4/kernel/futex.c	2006-08-08 22:45:46.000000000 +0200
+++ linux-2.6.18-rc4-ed/kernel/futex.c	2006-08-09 07:26:19.000000000 +0200
@@ -60,8 +60,9 @@
  * Don't rearrange members without looking at hash_futex().
  *
  * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
- * We set bit 0 to indicate if it's an inode-based key.
  */
+#define OFF_INODE    1 /* We set bit 0 if it has a reference on inode */
+#define OFF_MMSHARED 2 /* We set bit 1 if it has a reference on mm */
 union futex_key {
 	struct {
 		unsigned long pgoff;
@@ -79,6 +80,8 @@
 		int offset;
 	} both;
 };
+#define FUT_SHARED  1 /* we should walk vmas */
+#define FUT_PRIVATE 0 /* private futex: no need to walk vmas*/
 
 /*
  * Priority Inheritance state:
@@ -140,7 +143,7 @@
 static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
 
 /* Futex-fs vfsmount entry: */
-static struct vfsmount *futex_mnt;
+static struct vfsmount *futex_mnt __read_mostly;
 
 /*
  * We hash on the keys returned from get_futex_key (see below).
@@ -175,7 +178,7 @@
  *
  * Should be called with &current->mm->mmap_sem but NOT any spinlocks.
  */
-static int get_futex_key(u32 __user *uaddr, union futex_key *key)
+static int get_futex_key(u32 __user *uaddr, union futex_key *key, int shared)
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
@@ -191,6 +194,11 @@
 		return -EINVAL;
 	address -= key->both.offset;
 
+	if (shared == FUT_PRIVATE) {
+		key->private.mm = mm;
+		key->private.address = address;
+		return 0;
+	}
 	/*
 	 * The futex is hashed differently depending on whether
 	 * it's in a shared or private mapping.  So check vma first.
@@ -215,6 +223,7 @@
 	 * mappings of _writable_ handles.
 	 */
 	if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
+		key->both.offset += OFF_MMSHARED; /* reference taken on mm */
 		key->private.mm = mm;
 		key->private.address = address;
 		return 0;
@@ -224,7 +233,7 @@
 	 * Linear file mappings are also simple.
 	 */
 	key->shared.inode = vma->vm_file->f_dentry->d_inode;
-	key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
+	key->both.offset += OFF_INODE; /* reference taken on inode */
 	if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
 		key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
 				     + vma->vm_pgoff);
@@ -256,8 +265,8 @@
  */
 static inline void get_key_refs(union futex_key *key)
 {
-	if (key->both.ptr != 0) {
-		if (key->both.offset & 1)
+	if (key->both.offset & (OFF_INODE|OFF_MMSHARED)) {
+		if (key->both.offset & OFF_INODE)
 			atomic_inc(&key->shared.inode->i_count);
 		else
 			atomic_inc(&key->private.mm->mm_count);
@@ -270,8 +279,8 @@
  */
 static void drop_key_refs(union futex_key *key)
 {
-	if (key->both.ptr != 0) {
-		if (key->both.offset & 1)
+	if (key->both.offset & (OFF_INODE|OFF_MMSHARED)) {
+		if (key->both.offset & OFF_INODE)
 			iput(key->shared.inode);
 		else
 			mmdrop(key->private.mm);
@@ -650,7 +659,7 @@
  * Wake up all waiters hashed on the physical page that is mapped
  * to this virtual address:
  */
-static int futex_wake(u32 __user *uaddr, int nr_wake)
+static int futex_wake(u32 __user *uaddr, int nr_wake, int shared)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
@@ -658,9 +667,10 @@
 	union futex_key key;
 	int ret;
 
-	down_read(&current->mm->mmap_sem);
+	if (shared)
+		down_read(&current->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr, &key);
+	ret = get_futex_key(uaddr, &key, shared);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -682,7 +692,8 @@
 
 	spin_unlock(&hb->lock);
 out:
-	up_read(&current->mm->mmap_sem);
+	if (shared)
+		up_read(&current->mm->mmap_sem);
 	return ret;
 }
 
@@ -692,7 +703,7 @@
  */
 static int
 futex_wake_op(u32 __user *uaddr1, u32 __user *uaddr2,
-	      int nr_wake, int nr_wake2, int op)
+	      int nr_wake, int nr_wake2, int op, int shared)
 {
 	union futex_key key1, key2;
 	struct futex_hash_bucket *hb1, *hb2;
@@ -703,10 +714,10 @@
 retryfull:
 	down_read(&current->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr1, &key1);
+	ret = get_futex_key(uaddr1, &key1, shared);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, &key2);
+	ret = get_futex_key(uaddr2, &key2, shared);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -802,7 +813,7 @@
  * physical page.
  */
 static int futex_requeue(u32 __user *uaddr1, u32 __user *uaddr2,
-			 int nr_wake, int nr_requeue, u32 *cmpval)
+			 int nr_wake, int nr_requeue, u32 *cmpval, int shared)
 {
 	union futex_key key1, key2;
 	struct futex_hash_bucket *hb1, *hb2;
@@ -811,12 +822,13 @@
 	int ret, drop_count = 0;
 
  retry:
-	down_read(&current->mm->mmap_sem);
+	if (shared)
+		down_read(&current->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr1, &key1);
+	ret = get_futex_key(uaddr1, &key1, shared);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, &key2);
+	ret = get_futex_key(uaddr2, &key2, shared);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -839,7 +851,8 @@
 			 * If we would have faulted, release mmap_sem, fault
 			 * it in and start all over again.
 			 */
-			up_read(&current->mm->mmap_sem);
+			if (shared)
+				up_read(&current->mm->mmap_sem);
 
 			ret = get_user(curval, uaddr1);
 
@@ -888,7 +901,8 @@
 		drop_key_refs(&key1);
 
 out:
-	up_read(&current->mm->mmap_sem);
+	if (shared)
+		up_read(&current->mm->mmap_sem);
 	return ret;
 }
 
@@ -999,7 +1013,7 @@
 	drop_key_refs(&q->key);
 }
 
-static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time)
+static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time, int 
shared)
 {
 	struct task_struct *curr = current;
 	DECLARE_WAITQUEUE(wait, curr);
@@ -1010,9 +1024,10 @@
 
 	q.pi_state = NULL;
  retry:
-	down_read(&curr->mm->mmap_sem);
+	if (shared)
+		down_read(&curr->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr, &q.key);
+	ret = get_futex_key(uaddr, &q.key, shared);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
 
@@ -1047,7 +1062,8 @@
 		 * If we would have faulted, release mmap_sem, fault it in and
 		 * start all over again.
 		 */
-		up_read(&curr->mm->mmap_sem);
+		if (shared)
+			up_read(&curr->mm->mmap_sem);
 
 		ret = get_user(uval, uaddr);
 
@@ -1066,7 +1082,8 @@
 	 * Now the futex is queued and we have checked the data, we
 	 * don't want to hold mmap_sem while we sleep.
 	 */
-	up_read(&curr->mm->mmap_sem);
+	if (shared)
+		up_read(&curr->mm->mmap_sem);
 
 	/*
 	 * There might have been scheduling since the queue_me(), as we
@@ -1108,7 +1125,8 @@
 	queue_unlock(&q, hb);
 
  out_release_sem:
-	up_read(&curr->mm->mmap_sem);
+	if (shared)
+		up_read(&curr->mm->mmap_sem);
 	return ret;
 }
 
@@ -1134,7 +1152,7 @@
  retry:
 	down_read(&curr->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr, &q.key);
+	ret = get_futex_key(uaddr, &q.key, FUT_SHARED);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
 
@@ -1435,7 +1453,7 @@
 	 */
 	down_read(&current->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr, &key);
+	ret = get_futex_key(uaddr, &key, FUT_SHARED);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -1551,7 +1569,7 @@
 	return ret;
 }
 
-static struct file_operations futex_fops = {
+static const struct file_operations futex_fops = {
 	.release	= futex_close,
 	.poll		= futex_poll,
 };
@@ -1600,7 +1618,7 @@
 	q->pi_state = NULL;
 
 	down_read(&current->mm->mmap_sem);
-	err = get_futex_key(uaddr, &q->key);
+	err = get_futex_key(uaddr, &q->key, FUT_SHARED);
 
 	if (unlikely(err != 0)) {
 		up_read(&current->mm->mmap_sem);
@@ -1742,7 +1760,7 @@
 		 */
 		if (!pi) {
 			if (uval & FUTEX_WAITERS)
-				futex_wake(uaddr, 1);
+				futex_wake(uaddr, 1, FUT_SHARED);
 		}
 	}
 	return 0;
@@ -1830,23 +1848,38 @@
 
 	switch (op) {
 	case FUTEX_WAIT:
-		ret = futex_wait(uaddr, val, timeout);
+		ret = futex_wait(uaddr, val, timeout, FUT_SHARED);
+		break;
+	case FUTEX_WAIT_PRIVATE:
+		ret = futex_wait(uaddr, val, timeout, FUT_PRIVATE);
 		break;
 	case FUTEX_WAKE:
-		ret = futex_wake(uaddr, val);
+		ret = futex_wake(uaddr, val, FUT_SHARED);
+		break;
+	case FUTEX_WAKE_PRIVATE:
+		ret = futex_wake(uaddr, val, FUT_PRIVATE);
 		break;
 	case FUTEX_FD:
 		/* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */
 		ret = futex_fd(uaddr, val);
 		break;
 	case FUTEX_REQUEUE:
-		ret = futex_requeue(uaddr, uaddr2, val, val2, NULL);
+		ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, FUT_SHARED);
+		break;
+	case FUTEX_REQUEUE_PRIVATE:
+		ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, FUT_PRIVATE);
 		break;
 	case FUTEX_CMP_REQUEUE:
-		ret = futex_requeue(uaddr, uaddr2, val, val2, &val3);
+		ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, FUT_SHARED);
+		break;
+	case FUTEX_CMP_REQUEUE_PRIVATE:
+		ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, FUT_PRIVATE);
 		break;
 	case FUTEX_WAKE_OP:
-		ret = futex_wake_op(uaddr, uaddr2, val, val2, val3);
+		ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, FUT_SHARED);
+		break;
+	case FUTEX_WAKE_OP_PRIVATE:
+		ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, FUT_PRIVATE);
 		break;
 	case FUTEX_LOCK_PI:
 		ret = futex_lock_pi(uaddr, val, timeout, val2, 0);

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements
  2006-08-09  6:43                         ` Eric Dumazet
@ 2007-03-15 19:10                           ` Eric Dumazet
  2007-03-15 20:15                             ` Nick Piggin
                                               ` (2 more replies)
  2007-03-15 19:13                           ` [PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic Eric Dumazet
                                             ` (2 subsequent siblings)
  3 siblings, 3 replies; 78+ messages in thread
From: Eric Dumazet @ 2007-03-15 19:10 UTC (permalink / raw)
  To: Nick Piggin, Ulrich Drepper, Andrew Morton, Ingo Molnar
  Cc: Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Hi

I'm pleased to present these patches which improve linux futex performance and 
scalability, on both UP, SMP and NUMA configs.

I had this idea last year but I was not understood, probably because I gave 
not enough explanations. Sorry if this mail is really long...


Analysis of current linux futex code :
--------------------------------------

A central hash table futex_queues[] holds all contexts (futex_q) of waiting 
threads.
Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to 
perform lookups or insert/deletion of a futex_q.

When a futex_wait() is done, thread has to :

1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
     (calling find_vma()). This validation tells us if the futex use
     an inode based store (mapped file), or mm based store (anonymous mem)

2) - compute a hash key

3) - Atomic Increment of reference counter on an inode or a mm

4) - lock part of futex_queues[] hash table

5) - perform the test on value of futex.
                (rollback is value != expected_value, returns EWOULDBLOCK)
        (various loops if test triggers mm faults)

6) queue the context into hash table, release the lock got in 4)

7) - release the read_lock on mmap_sem

   <block>

8) Eventually unqueue the context (but rarely, as this part
 may be done by the futex_wake())

Futexes were designed to improve scalability but current implementation
has various problems :

- Central hashtable :
 This means scalability problems if many processes/threads want to use
 futexes at the same time.
 This means NUMA unbalance because this hashtable is located on one node.

- Using mmap_sem on every futex() syscall :

 Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
 ops on mmap_sem, dirtying cache line :
        - lot of cache line ping pongs on SMP configurations.

 mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
 Highly threaded processes might suffer from mmap_sem contention.

 mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
 It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
 because of cache misses.

Most of these scalability problems come from the fact that futexes are in
one global namespace. As we use a central hash table, we must make sure
they are all using the same reference (given by the mm subsystem).
We chose to force all futexes be 'shared'. This has a cost.

But fact is POSIX defined PRIVATE and SHARED, allowing clear separation, and
optimal performance if carefuly implemented. Time has come for linux to have 
better threading performance.

PTHREAD_PROCESS_PRIVATE semantic allows implementation to use separate
repositories :
- One 'global' namespace for all PROCESS_SHARED futexes.
- One "per process private repository" for PROCESS_PRIVATE futexes.
  This repository is NUMA aware, it is allocated the first time a process
  issues a futex(XXXX_PRIVATE) call. If allocation is not possible because
  of memory shortage, we just fallback using the central repository.

The goal is to permit new futex commands to avoid :
 - Using the central hash table (still used by PTHREAD_PROCESS_SHARED futexes)
 - Taking the mmap_sem semaphore, conflicting with other subsystems.
 - Modifying a ref_count on mm or an inode, still conflicting with mm or fs.

This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
futexes, we only need to distinguish futexes by their virtual address, no
matter the underlying mm storage is.

This is why this patches :

1) Define new futex subcommands (basically adding a _PRIVATE flag)
   Avoids using mmap_sem, and ref counter on inode or mm.

2) Allows each process to have a private repository (a small hash table)
   where its PROCESS_PRIVATE active futexes are stored, instead of 
   the global repository.
   if CONFIG_BASE_SMALL, we still use the global repository

3) NUMA optimization : we allocate the global hash table with vmalloc()

If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And
be prepared to fallback on old subcommands for old kernels. Using one
global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.

Only PTHREAD_PROCESS_SHARED futexes should use the old subcommands.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : the same SHARED futex (mapped on a file) can be used by old binaries 
*and* new binaries, because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)

/* calling futex_wait(addr, value) with value != *addr */
450 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
427 cycles per futex(FUTEX_WAIT) call (using one futex)
337 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
332 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
For reference :
214 cycles per futex(1000) call (returns ENOSYS)
186 cycles per getppid() call
187 cycles per umask() call
182 cycles per ni_syscall() call

Thank you for reading this mail

[PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic
[PATCH 2/3] FUTEX : introduce private hashtables
[PATCH 3/3] FUTEX : NUMA friendly global hashtable

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic
  2006-08-09  6:43                         ` Eric Dumazet
  2007-03-15 19:10                           ` [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements Eric Dumazet
@ 2007-03-15 19:13                           ` Eric Dumazet
  2007-03-15 19:16                           ` [PATCH 2/3] FUTEX : introduce private hashtables Eric Dumazet
  2007-03-15 19:20                           ` [PATCH 3/3] FUTEX : NUMA friendly global hashtable Eric Dumazet
  3 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2007-03-15 19:13 UTC (permalink / raw)
  To: Nick Piggin, Ulrich Drepper, Andrew Morton, Ingo Molnar
  Cc: Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2352 bytes --]

[PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic

This first patch introduces XXX_PRIVATE futexes operations.

When a process uses a XXX_PRIVATE futex primitive, kernel can avoid
to take a read lock on mmap_sem, to find the vma that contains the futex,
to learn if it is associated to an inode (shared) or the mm (private to 
process)

We also avoid taking a reference on the found inode or the mm.

Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
 ops on mmap_sem, dirtying cache line :
        - lot of cache line ping pongs on SMP configurations.

 mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
 Highly threaded processes might suffer from mmap_sem contention.

 mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
 It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
 because of cache misses.

This first patch is possible because, for one process using 
PTHREAD_PROCESS_PRIVATE futexes, we only need to distinguish futexes by their 
virtual address, no matter the underlying mm storage is. The case of multiple 
virtual addresses mapped on the same physical address is just insane : "Dont 
do it on PROCESS_PRIVATE futexes, please ?"

If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And
be prepared to fallback on old subcommands for old kernels. Using one
global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK, so that 
only one syscall might fail.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : SHARED futexes can be used by old binaries *and* new binaries,
because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/futex.h |   12 +
 kernel/futex.c        |  273 +++++++++++++++++++++++++---------------
 2 files changed, 188 insertions(+), 97 deletions(-)

[-- Attachment #2: futex_p1.patch --]
[-- Type: text/plain, Size: 20078 bytes --]

--- linux-2.6.21-rc3/kernel/futex.c	2007-03-13 13:22:31.000000000 +0100
+++ linux-2.6.21-rc3-ed/kernel/futex.c	2007-03-15 18:30:15.000000000 +0100
@@ -16,6 +16,9 @@
  *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
  *  Copyright (C) 2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com>
  *
+ *  Introduction of PRIVATE futexes by Eric Dumazet
+ *  Copyright (C) 2007 Eric Dumazet <dada1@cosmosbay.com>
+ *
  *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
  *  enough at me, Linus for the original (flawed) idea, Matthew
  *  Kirkwood for proof-of-concept implementation.
@@ -60,8 +63,18 @@
  * Don't rearrange members without looking at hash_futex().
  *
  * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
- * We set bit 0 to indicate if it's an inode-based key.
+ * We use the two low order bits of offset to tell what is the kind of key :
+ *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
+ *       (no reference on an inode or mm)
+ *  01 : Shared futex (PTHREAD_PROCESS_SHARED)
+ *	mapped on a file (reference on the underlying inode)
+ *  10 : Shared futex (PTHREAD_PROCESS_SHARED)
+ *       (but private mapping on an mm, and reference taken on it)
  */
+
+#define OFF_INODE    1 /* We set bit 0 if key has a reference on inode */
+#define OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */
+
 union futex_key {
 	struct {
 		unsigned long pgoff;
@@ -129,9 +142,6 @@ struct futex_q {
 	struct task_struct *task;
 };
 
-/*
- * Split the global futex_lock into every hash list lock.
- */
 struct futex_hash_bucket {
        spinlock_t              lock;
        struct list_head       chain;
@@ -175,7 +185,8 @@ static inline int match_futex(union fute
  *
  * Should be called with &current->mm->mmap_sem but NOT any spinlocks.
  */
-static int get_futex_key(u32 __user *uaddr, union futex_key *key)
+static int get_futex_key(u32 __user *uaddr, union futex_key *key,
+		struct rw_semaphore *shared)
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
@@ -192,6 +203,22 @@ static int get_futex_key(u32 __user *uad
 	address -= key->both.offset;
 
 	/*
+	 * PROCESS_PRIVATE futexes are fast.
+	 * As the mm cannot disappear under us and the 'key' only needs
+	 * virtual address, we dont even have to find the underlying vma.
+	 * Note : We do have to check 'address' is a valid user address,
+	 *        but access_ok() should be faster than find_vma()
+	 * Note : At this point, address points to the start of page,
+	 *        not the real futex address, this is ok.
+	 */
+	if (!shared) {
+		if (!access_ok(VERIFY_WRITE, address, sizeof(int)))
+			return -EFAULT;
+		key->private.mm = mm;
+		key->private.address = address;
+		return 0;
+	}
+	/*
 	 * The futex is hashed differently depending on whether
 	 * it's in a shared or private mapping.  So check vma first.
 	 */
@@ -215,6 +242,7 @@ static int get_futex_key(u32 __user *uad
 	 * mappings of _writable_ handles.
 	 */
 	if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
+		key->both.offset += OFF_MMSHARED; /* reference taken on mm */
 		key->private.mm = mm;
 		key->private.address = address;
 		return 0;
@@ -224,7 +252,7 @@ static int get_futex_key(u32 __user *uad
 	 * Linear file mappings are also simple.
 	 */
 	key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
-	key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
+	key->both.offset += OFF_INODE; /* inode-based key. */
 	if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
 		key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
 				     + vma->vm_pgoff);
@@ -251,16 +279,21 @@ static int get_futex_key(u32 __user *uad
  * Take a reference to the resource addressed by a key.
  * Can be called while holding spinlocks.
  *
- * NOTE: mmap_sem MUST be held between get_futex_key() and calling this
- * function, if it is called at all.  mmap_sem keeps key->shared.inode valid.
+ * NOTE: for SHARED futexes, mmap_sem MUST be held between get_futex_key()
+ * and calling this function, if it is called at all.  mmap_sem keeps 
+ * key->shared.inode valid.
  */
 static inline void get_key_refs(union futex_key *key)
 {
-	if (key->both.ptr != 0) {
-		if (key->both.offset & 1)
+	switch (key->both.offset & (OFF_INODE|OFF_MMSHARED)) {
+		case OFF_INODE:
 			atomic_inc(&key->shared.inode->i_count);
-		else
+			break;
+		case OFF_MMSHARED:
 			atomic_inc(&key->private.mm->mm_count);
+			break;
+		default:
+			break;
 	}
 }
 
@@ -270,11 +303,15 @@ static inline void get_key_refs(union fu
  */
 static void drop_key_refs(union futex_key *key)
 {
-	if (key->both.ptr != 0) {
-		if (key->both.offset & 1)
+	switch (key->both.offset & (OFF_INODE|OFF_MMSHARED)) {
+		case OFF_INODE:
 			iput(key->shared.inode);
-		else
+			break;
+		case OFF_MMSHARED:
 			mmdrop(key->private.mm);
+			break;
+		default:
+			break;
 	}
 }
 
@@ -286,32 +323,44 @@ static inline int get_futex_value_locked
 	ret = __copy_from_user_inatomic(dest, from, sizeof(u32));
 	pagefault_enable();
 
-	return ret ? -EFAULT : 0;
+	return ret;
 }
 
 /*
- * Fault handling. Called with current->mm->mmap_sem held.
+ * Fault handling.
+ * if shared is non NULL, current->mm->mmap_sem is held
  */
-static int futex_handle_fault(unsigned long address, int attempt)
+static int futex_handle_fault(unsigned long address, int attempt,
+	struct rw_semaphore *shared)
 {
 	struct vm_area_struct * vma;
 	struct mm_struct *mm = current->mm;
+	int ret = 0;
 
-	if (attempt > 2 || !(vma = find_vma(mm, address)) ||
-	    vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
+	if (attempt > 2)
 		return -EFAULT;
 
-	switch (handle_mm_fault(mm, vma, address, 1)) {
-	case VM_FAULT_MINOR:
-		current->min_flt++;
-		break;
-	case VM_FAULT_MAJOR:
-		current->maj_flt++;
-		break;
-	default:
-		return -EFAULT;
-	}
-	return 0;
+	if (!shared)
+		down_read(&mm->mmap_sem);
+
+	if (!(vma = find_vma(mm, address)) ||
+	    vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
+		ret = -EFAULT;
+
+	else
+		switch (handle_mm_fault(mm, vma, address, 1)) {
+		case VM_FAULT_MINOR:
+			current->min_flt++;
+			break;
+		case VM_FAULT_MAJOR:
+			current->maj_flt++;
+			break;
+		default:
+			ret = -EFAULT;
+		}
+	if (!shared)
+		up_read(&current->mm->mmap_sem);
+	return ret;
 }
 
 /*
@@ -649,7 +698,8 @@ double_lock_hb(struct futex_hash_bucket 
  * Wake up all waiters hashed on the physical page that is mapped
  * to this virtual address:
  */
-static int futex_wake(u32 __user *uaddr, int nr_wake)
+static int futex_wake(u32 __user *uaddr, int nr_wake,
+	struct rw_semaphore *shared)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
@@ -657,9 +707,10 @@ static int futex_wake(u32 __user *uaddr,
 	union futex_key key;
 	int ret;
 
-	down_read(&current->mm->mmap_sem);
+	if (shared)
+		down_read(shared);
 
-	ret = get_futex_key(uaddr, &key);
+	ret = get_futex_key(uaddr, &key, shared);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -681,7 +732,8 @@ static int futex_wake(u32 __user *uaddr,
 
 	spin_unlock(&hb->lock);
 out:
-	up_read(&current->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 	return ret;
 }
 
@@ -691,7 +743,7 @@ out:
  */
 static int
 futex_wake_op(u32 __user *uaddr1, u32 __user *uaddr2,
-	      int nr_wake, int nr_wake2, int op)
+	      int nr_wake, int nr_wake2, int op, struct rw_semaphore *shared)
 {
 	union futex_key key1, key2;
 	struct futex_hash_bucket *hb1, *hb2;
@@ -700,12 +752,13 @@ futex_wake_op(u32 __user *uaddr1, u32 __
 	int ret, op_ret, attempt = 0;
 
 retryfull:
-	down_read(&current->mm->mmap_sem);
+	if (shared)
+		down_read(shared);
 
-	ret = get_futex_key(uaddr1, &key1);
+	ret = get_futex_key(uaddr1, &key1, shared);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, &key2);
+	ret = get_futex_key(uaddr2, &key2, shared);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -741,15 +794,14 @@ retry:
 		 * futex_atomic_op_inuser needs to both read and write
 		 * *(int __user *)uaddr2, but we can't modify it
 		 * non-atomically.  Therefore, if get_user below is not
-		 * enough, we need to handle the fault ourselves, while
-		 * still holding the mmap_sem.
+		 * enough, we need to handle the fault ourselves. Make 
+		 * sure we hold mmap_sem.
 		 */
 		if (attempt++) {
-			if (futex_handle_fault((unsigned long)uaddr2,
-						attempt)) {
-				ret = -EFAULT;
+			ret = futex_handle_fault((unsigned long)uaddr2,
+						attempt, shared);
+			if (ret)
 				goto out;
-			}
 			goto retry;
 		}
 
@@ -757,7 +809,8 @@ retry:
 		 * If we would have faulted, release mmap_sem,
 		 * fault it in and start all over again.
 		 */
-		up_read(&current->mm->mmap_sem);
+		if (shared)
+			up_read(shared);
 
 		ret = get_user(dummy, uaddr2);
 		if (ret)
@@ -794,7 +847,8 @@ retry:
 	if (hb1 != hb2)
 		spin_unlock(&hb2->lock);
 out:
-	up_read(&current->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 	return ret;
 }
 
@@ -803,7 +857,8 @@ out:
  * physical page.
  */
 static int futex_requeue(u32 __user *uaddr1, u32 __user *uaddr2,
-			 int nr_wake, int nr_requeue, u32 *cmpval)
+			 int nr_wake, int nr_requeue, u32 *cmpval,
+			 struct rw_semaphore *shared)
 {
 	union futex_key key1, key2;
 	struct futex_hash_bucket *hb1, *hb2;
@@ -812,12 +867,13 @@ static int futex_requeue(u32 __user *uad
 	int ret, drop_count = 0;
 
  retry:
-	down_read(&current->mm->mmap_sem);
+	if (shared)
+		down_read(shared);
 
-	ret = get_futex_key(uaddr1, &key1);
+	ret = get_futex_key(uaddr1, &key1, shared);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, &key2);
+	ret = get_futex_key(uaddr2, &key2, shared);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -840,7 +896,8 @@ static int futex_requeue(u32 __user *uad
 			 * If we would have faulted, release mmap_sem, fault
 			 * it in and start all over again.
 			 */
-			up_read(&current->mm->mmap_sem);
+			if (shared)
+				up_read(shared);
 
 			ret = get_user(curval, uaddr1);
 
@@ -889,7 +946,8 @@ out_unlock:
 		drop_key_refs(&key1);
 
 out:
-	up_read(&current->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 	return ret;
 }
 
@@ -1000,7 +1058,8 @@ static void unqueue_me_pi(struct futex_q
 	drop_key_refs(&q->key);
 }
 
-static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time)
+static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time,
+			struct rw_semaphore *shared)
 {
 	struct task_struct *curr = current;
 	DECLARE_WAITQUEUE(wait, curr);
@@ -1011,9 +1070,10 @@ static int futex_wait(u32 __user *uaddr,
 
 	q.pi_state = NULL;
  retry:
-	down_read(&curr->mm->mmap_sem);
+	if (shared)
+		down_read(shared);
 
-	ret = get_futex_key(uaddr, &q.key);
+	ret = get_futex_key(uaddr, &q.key, shared);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
 
@@ -1036,8 +1096,8 @@ static int futex_wait(u32 __user *uaddr,
 	 * a wakeup when *uaddr != val on entry to the syscall.  This is
 	 * rare, but normal.
 	 *
-	 * We hold the mmap semaphore, so the mapping cannot have changed
-	 * since we looked it up in get_futex_key.
+	 * for shared futexes, we hold the mmap semaphore, so the mapping
+	 * cannot have changed since we looked it up in get_futex_key.
 	 */
 	ret = get_futex_value_locked(&uval, uaddr);
 
@@ -1048,7 +1108,8 @@ static int futex_wait(u32 __user *uaddr,
 		 * If we would have faulted, release mmap_sem, fault it in and
 		 * start all over again.
 		 */
-		up_read(&curr->mm->mmap_sem);
+		if (shared)
+			up_read(shared);
 
 		ret = get_user(uval, uaddr);
 
@@ -1067,7 +1128,8 @@ static int futex_wait(u32 __user *uaddr,
 	 * Now the futex is queued and we have checked the data, we
 	 * don't want to hold mmap_sem while we sleep.
 	 */
-	up_read(&curr->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 
 	/*
 	 * There might have been scheduling since the queue_me(), as we
@@ -1109,7 +1171,8 @@ static int futex_wait(u32 __user *uaddr,
 	queue_unlock(&q, hb);
 
  out_release_sem:
-	up_read(&curr->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 	return ret;
 }
 
@@ -1120,7 +1183,7 @@ static int futex_wait(u32 __user *uaddr,
  * races the kernel might see a 0 value of the futex too.)
  */
 static int futex_lock_pi(u32 __user *uaddr, int detect, unsigned long sec,
-			 long nsec, int trylock)
+			 long nsec, int trylock, struct rw_semaphore *shared)
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
 	struct task_struct *curr = current;
@@ -1141,9 +1204,10 @@ static int futex_lock_pi(u32 __user *uad
 
 	q.pi_state = NULL;
  retry:
-	down_read(&curr->mm->mmap_sem);
+	if (shared)
+		down_read(shared);
 
-	ret = get_futex_key(uaddr, &q.key);
+	ret = get_futex_key(uaddr, &q.key, shared);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
 
@@ -1237,7 +1301,8 @@ static int futex_lock_pi(u32 __user *uad
 	 * Now the futex is queued and we have checked the data, we
 	 * don't want to hold mmap_sem while we sleep.
 	 */
-	up_read(&curr->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 
 	WARN_ON(!q.pi_state);
 	/*
@@ -1251,7 +1316,8 @@ static int futex_lock_pi(u32 __user *uad
 		ret = ret ? 0 : -EWOULDBLOCK;
 	}
 
-	down_read(&curr->mm->mmap_sem);
+	if (shared)
+		down_read(shared);
 	spin_lock(q.lock_ptr);
 
 	/*
@@ -1279,7 +1345,8 @@ static int futex_lock_pi(u32 __user *uad
 
 		/* Unqueue and drop the lock */
 		unqueue_me_pi(&q, hb);
-		up_read(&curr->mm->mmap_sem);
+		if (shared)
+			up_read(shared);
 		/*
 		 * We own it, so we have to replace the pending owner
 		 * TID. This must be atomic as we have preserve the
@@ -1308,7 +1375,8 @@ static int futex_lock_pi(u32 __user *uad
 		}
 		/* Unqueue and drop the lock */
 		unqueue_me_pi(&q, hb);
-		up_read(&curr->mm->mmap_sem);
+		if (shared)
+			up_read(shared);
 	}
 
 	if (!detect && ret == -EDEADLK && 0)
@@ -1320,7 +1388,8 @@ static int futex_lock_pi(u32 __user *uad
 	queue_unlock(&q, hb);
 
  out_release_sem:
-	up_read(&curr->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 	return ret;
 
  uaddr_faulted:
@@ -1331,15 +1400,15 @@ static int futex_lock_pi(u32 __user *uad
 	 * still holding the mmap_sem.
 	 */
 	if (attempt++) {
-		if (futex_handle_fault((unsigned long)uaddr, attempt)) {
-			ret = -EFAULT;
+		ret = futex_handle_fault((unsigned long)uaddr, attempt, shared);
+		if (ret)
 			goto out_unlock_release_sem;
-		}
 		goto retry_locked;
 	}
 
 	queue_unlock(&q, hb);
-	up_read(&curr->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 
 	ret = get_user(uval, uaddr);
 	if (!ret && (uval != -EFAULT))
@@ -1353,7 +1422,7 @@ static int futex_lock_pi(u32 __user *uad
  * This is the in-kernel slowpath: we look up the PI state (if any),
  * and do the rt-mutex unlock.
  */
-static int futex_unlock_pi(u32 __user *uaddr)
+static int futex_unlock_pi(u32 __user *uaddr, struct rw_semaphore *shared)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
@@ -1373,9 +1442,10 @@ retry:
 	/*
 	 * First take all the futex related locks:
 	 */
-	down_read(&current->mm->mmap_sem);
+	if (shared)
+		down_read(shared);
 
-	ret = get_futex_key(uaddr, &key);
+	ret = get_futex_key(uaddr, &key, shared);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -1434,7 +1504,8 @@ retry_locked:
 out_unlock:
 	spin_unlock(&hb->lock);
 out:
-	up_read(&current->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 
 	return ret;
 
@@ -1446,15 +1517,15 @@ pi_faulted:
 	 * still holding the mmap_sem.
 	 */
 	if (attempt++) {
-		if (futex_handle_fault((unsigned long)uaddr, attempt)) {
-			ret = -EFAULT;
+		ret = futex_handle_fault((unsigned long)uaddr, attempt, shared);
+		if (ret)
 			goto out_unlock;
-		}
 		goto retry_locked;
 	}
 
 	spin_unlock(&hb->lock);
-	up_read(&current->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 
 	ret = get_user(uval, uaddr);
 	if (!ret && (uval != -EFAULT))
@@ -1506,6 +1577,7 @@ static int futex_fd(u32 __user *uaddr, i
 	struct futex_q *q;
 	struct file *filp;
 	int ret, err;
+	struct rw_semaphore *shared;
 	static unsigned long printk_interval;
 
 	if (printk_timed_ratelimit(&printk_interval, 60 * 60 * 1000)) {
@@ -1547,11 +1619,12 @@ static int futex_fd(u32 __user *uaddr, i
 	}
 	q->pi_state = NULL;
 
-	down_read(&current->mm->mmap_sem);
-	err = get_futex_key(uaddr, &q->key);
+	shared = &current->mm->mmap_sem;
+	down_read(shared);
+	err = get_futex_key(uaddr, &q->key, shared);
 
 	if (unlikely(err != 0)) {
-		up_read(&current->mm->mmap_sem);
+		up_read(shared);
 		kfree(q);
 		goto error;
 	}
@@ -1563,7 +1636,7 @@ static int futex_fd(u32 __user *uaddr, i
 	filp->private_data = q;
 
 	queue_me(q, ret, filp);
-	up_read(&current->mm->mmap_sem);
+	up_read(shared);
 
 	/* Now we map fd to filp, so userspace can access it */
 	fd_install(ret, filp);
@@ -1690,7 +1763,7 @@ retry:
 		 */
 		if (!pi) {
 			if (uval & FUTEX_WAITERS)
-				futex_wake(uaddr, 1);
+				futex_wake(uaddr, 1, &curr->mm->mmap_sem);
 		}
 	}
 	return 0;
@@ -1776,35 +1849,40 @@ long do_futex(u32 __user *uaddr, int op,
 		u32 __user *uaddr2, u32 val2, u32 val3)
 {
 	int ret;
+	int opm = op & FUTEX_CMD_MASK;
+	struct rw_semaphore *shared = NULL;
+
+	if (op & FUTEX_PRIVATE_FLAG)
+		shared = &current->mm->mmap_sem;
 
-	switch (op) {
+	switch (opm) {
 	case FUTEX_WAIT:
-		ret = futex_wait(uaddr, val, timeout);
+		ret = futex_wait(uaddr, val, timeout, shared);
 		break;
 	case FUTEX_WAKE:
-		ret = futex_wake(uaddr, val);
+		ret = futex_wake(uaddr, val, shared);
 		break;
 	case FUTEX_FD:
 		/* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */
 		ret = futex_fd(uaddr, val);
 		break;
 	case FUTEX_REQUEUE:
-		ret = futex_requeue(uaddr, uaddr2, val, val2, NULL);
+		ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, shared);
 		break;
 	case FUTEX_CMP_REQUEUE:
-		ret = futex_requeue(uaddr, uaddr2, val, val2, &val3);
+		ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, shared);
 		break;
 	case FUTEX_WAKE_OP:
-		ret = futex_wake_op(uaddr, uaddr2, val, val2, val3);
+		ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, shared);
 		break;
 	case FUTEX_LOCK_PI:
-		ret = futex_lock_pi(uaddr, val, timeout, val2, 0);
+		ret = futex_lock_pi(uaddr, val, timeout, val2, 0, shared);
 		break;
 	case FUTEX_UNLOCK_PI:
-		ret = futex_unlock_pi(uaddr);
+		ret = futex_unlock_pi(uaddr, shared);
 		break;
 	case FUTEX_TRYLOCK_PI:
-		ret = futex_lock_pi(uaddr, 0, timeout, val2, 1);
+		ret = futex_lock_pi(uaddr, 0, timeout, val2, 1, shared);
 		break;
 	default:
 		ret = -ENOSYS;
@@ -1820,8 +1898,9 @@ asmlinkage long sys_futex(u32 __user *ua
 	struct timespec t;
 	unsigned long timeout = MAX_SCHEDULE_TIMEOUT;
 	u32 val2 = 0;
+	int opm = op & FUTEX_CMD_MASK;
 
-	if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) {
+	if (utime && (opm == FUTEX_WAIT || opm == FUTEX_LOCK_PI)) {
 		if (copy_from_user(&t, utime, sizeof(t)) != 0)
 			return -EFAULT;
 		if (!timespec_valid(&t))
@@ -1834,9 +1913,9 @@ asmlinkage long sys_futex(u32 __user *ua
 		}
 	}
 	/*
-	 * requeue parameter in 'utime' if op == FUTEX_REQUEUE.
+	 * requeue parameter in 'utime' if opm == FUTEX_REQUEUE.
 	 */
-	if (op == FUTEX_REQUEUE || op == FUTEX_CMP_REQUEUE)
+	if (opm == FUTEX_REQUEUE || opm == FUTEX_CMP_REQUEUE)
 		val2 = (u32) (unsigned long) utime;
 
 	return do_futex(uaddr, op, val, timeout, uaddr2, val2, val3);
--- linux-2.6.21-rc3/include/linux/futex.h	2007-03-13 13:22:31.000000000 +0100
+++ linux-2.6.21-rc3-ed/include/linux/futex.h	2007-03-15 18:08:37.000000000 +0100
@@ -16,6 +16,18 @@
 #define FUTEX_UNLOCK_PI		7
 #define FUTEX_TRYLOCK_PI	8
 
+#define FUTEX_PRIVATE_FLAG 128
+#define FUTEX_CMD_MASK     ~FUTEX_PRIVATE_FLAG
+
+#define FUTEX_WAIT_PRIVATE	(FUTEX_WAIT | FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAKE_PRIVATE	(FUTEX_WAKE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_REQUEUE_PRIVATE	(FUTEX_REQUEUE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_CMP_REQUEUE_PRIVATE (FUTEX_CMP_REQUEUE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAKE_OP_PRIVATE	(FUTEX_WAKE_OP | FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_PI_PRIVATE	(FUTEX_LOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_PI_PRIVATE	(FUTEX_UNLOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG)
+
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
  * thread exit time.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 2/3] FUTEX : introduce private hashtables
  2006-08-09  6:43                         ` Eric Dumazet
  2007-03-15 19:10                           ` [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements Eric Dumazet
  2007-03-15 19:13                           ` [PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic Eric Dumazet
@ 2007-03-15 19:16                           ` Eric Dumazet
  2007-03-15 20:25                             ` Nick Piggin
  2007-03-15 19:20                           ` [PATCH 3/3] FUTEX : NUMA friendly global hashtable Eric Dumazet
  3 siblings, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2007-03-15 19:16 UTC (permalink / raw)
  To: Nick Piggin, Ulrich Drepper, Andrew Morton, Ingo Molnar
  Cc: Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1002 bytes --]

[PATCH 2/3] FUTEX : introduce private hashtables

This patch introduces a separate hashtable per process to store _PRIVATE 
futexes.
This hashtable is dynamically allocated on the first _PRIVATE futex syscall.
If memory cannot be allocated, the process will use the global hashtable.

Using a separate hashtable has the advantage of lowering the contention on the 
global hashtable. NUMA should benefits of this separation because the 
allocation should respect the mm policy of the process.

Code is using kmalloc()/vmalloc() depending on the size of spinlocks. For 
normal setup, size of the private hashtable should be 768 bytes on 32bit 
arches, 1536 bytes on 64bit arches.

Private hashtable is freed() when process exits.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/futex.h |    4 +
 include/linux/sched.h |    7 ++
 kernel/fork.c         |    1 
 kernel/futex.c        |  112 ++++++++++++++++++++++++++++++++++++++--
 4 files changed, 120 insertions(+), 4 deletions(-)

[-- Attachment #2: futex_p2.patch --]
[-- Type: text/plain, Size: 5570 bytes --]

--- linux-2.6.21-rc3/kernel/futex.c	2007-03-15 18:30:15.000000000 +0100
+++ linux-2.6.21-rc3-ed/kernel/futex.c	2007-03-15 18:54:47.000000000 +0100
@@ -51,11 +51,11 @@
 #include <linux/pagemap.h>
 #include <linux/syscalls.h>
 #include <linux/signal.h>
+#include <linux/vmalloc.h>
 #include <asm/futex.h>
 
 #include "rtmutex_common.h"
 
-#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8)
 
 /*
  * Futexes are matched on equal values of this key.
@@ -147,11 +147,96 @@ struct futex_hash_bucket {
        struct list_head       chain;
 };
 
-static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS];
+
+#if CONFIG_BASE_SMALL
+# define FUTEX_HASH_SLOTS	16
+# define FUTEX_NOPRIVHASH	/* no private hashtable, only one global */
+#else
+# define FUTEX_HASH_SLOTS	256
+# define FUTEX_PRIVHASH_SLOTS	64
+#endif
+
+#define FUTEX_PRIVHASH_SIZE \
+	(FUTEX_PRIVHASH_SLOTS * sizeof(struct futex_hash_bucket))
+/*
+ * futex_queues[] : global hash table
+ *
+ * PTHREAD_PROCESS_SHARED futexes are hashed into this table
+ *
+ * PTHREAD_PROCESS_PRIVATE futexes may be hashed into this table too if the 
+ * owner process failed to allocate its private hashtable (or CONFIG_BASE_SMALL)
+ *
+ */
+static struct futex_hash_bucket futex_queues[FUTEX_HASH_SLOTS];
 
 /* Futex-fs vfsmount entry: */
 static struct vfsmount *futex_mnt;
 
+
+/*
+ * private futexes are hashed into a process private table.
+ * As this table is dynamically allocated, it might be in fact
+ * the global table in case of memory stress.
+ * A pointer to this table is kept in mm_struct.
+ */
+#ifndef FUTEX_NOPRIVHASH
+static void mm_priv_queues_alloc(struct mm_struct *mm)
+{
+	unsigned int ui;
+	struct futex_hash_bucket *queues;
+
+	/*
+	 * FUTEX_PRIVHASH_SIZE is a constant, compiler should choose
+	 * either vmalloc()/kmalloc()  :)
+	 */
+	if (FUTEX_PRIVHASH_SIZE > PAGE_SIZE)
+		queues = vmalloc(FUTEX_PRIVHASH_SIZE);
+	else
+		queues = kmalloc(FUTEX_PRIVHASH_SIZE, GFP_KERNEL);
+
+	if (queues) {
+		for (ui = 0; ui < FUTEX_PRIVHASH_SLOTS; ui++) {
+			spin_lock_init(&queues[ui].lock);
+			INIT_LIST_HEAD(&queues[ui].chain);
+		}
+		spin_lock(&mm->page_table_lock);
+		/*
+		 * check if another thread installed a table before me
+		 */
+		if (mm->mm_priv_futex_queues) {
+			if (FUTEX_PRIVHASH_SIZE > PAGE_SIZE)
+				vfree(queues);
+			else
+				kfree(queues);
+		}
+		else
+			mm->mm_priv_futex_queues = queues;
+	}
+	else {
+		spin_lock(&mm->page_table_lock);
+		if (!mm->mm_priv_futex_queues)
+			mm->mm_priv_futex_queues = futex_queues;
+	}
+	spin_unlock(&mm->page_table_lock);
+}
+#endif
+
+/*
+ * Called from __mmdrop()/mm_free_futex() to eventually free private futexes
+ * hash table attached to mm
+ */
+void __mm_free_futex(struct mm_struct *mm)
+{
+#ifndef FUTEX_NOPRIVHASH
+	if (mm->mm_priv_futex_queues != futex_queues) {
+		if (FUTEX_PRIVHASH_SIZE > PAGE_SIZE)
+			vfree(mm->mm_priv_futex_queues);
+		else
+			kfree(mm->mm_priv_futex_queues);
+	}
+#endif
+}
+
 /*
  * We hash on the keys returned from get_futex_key (see below).
  */
@@ -159,8 +244,27 @@ static struct futex_hash_bucket *hash_fu
 {
 	u32 hash = jhash2((u32*)&key->both.word,
 			  (sizeof(key->both.word)+sizeof(key->both.ptr))/4,
-			  key->both.offset);
-	return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)];
+			  key->both.offset) % FUTEX_HASH_SLOTS;
+
+#ifdef FUTEX_NOPRIVHASH
+	return &futex_queues[hash];
+#else
+	struct mm_struct *mm;
+	/*
+	 * PROCESS_SHARED futexes are hashed into futex_queues[]
+	 */
+	if (key->both.offset & (OFF_INODE|OFF_MMSHARED))
+		return &futex_queues[hash];
+
+	if (FUTEX_PRIVHASH_SLOTS < FUTEX_HASH_SLOTS)
+		hash %= FUTEX_PRIVHASH_SLOTS;
+
+	mm = current->mm;
+	if (unlikely(!mm->mm_priv_futex_queues))
+		mm_priv_queues_alloc(mm);
+
+	return &mm->mm_priv_futex_queues[hash];
+#endif
 }
 
 /*
--- linux-2.6.21-rc3/include/linux/futex.h	2007-03-15 18:08:37.000000000 +0100
+++ linux-2.6.21-rc3-ed/include/linux/futex.h	2007-03-15 18:31:40.000000000 +0100
@@ -115,6 +115,7 @@ handle_futex_death(u32 __user *uaddr, st
 #ifdef CONFIG_FUTEX
 extern void exit_robust_list(struct task_struct *curr);
 extern void exit_pi_state_list(struct task_struct *curr);
+extern void __mm_free_futex(struct mm_struct *mm);
 #else
 static inline void exit_robust_list(struct task_struct *curr)
 {
@@ -122,6 +123,9 @@ static inline void exit_robust_list(stru
 static inline void exit_pi_state_list(struct task_struct *curr)
 {
 }
+static inline void __mm_free_futex(struct mm_struct *mm)
+{
+}
 #endif
 #endif /* __KERNEL__ */
 
--- linux-2.6.21-rc3/include/linux/sched.h	2007-03-15 18:32:08.000000000 +0100
+++ linux-2.6.21-rc3-ed/include/linux/sched.h	2007-03-15 18:31:40.000000000 +0100
@@ -373,6 +373,8 @@ struct mm_struct {
 	/* aio bits */
 	rwlock_t		ioctx_list_lock;
 	struct kioctx		*ioctx_list;
+	/* private futexes */
+	struct futex_hash_bucket *mm_priv_futex_queues;
 };
 
 struct sighand_struct {
@@ -1374,6 +1376,11 @@ static inline int sas_ss_flags(unsigned 
  * Routines for handling mm_structs
  */
 extern struct mm_struct * mm_alloc(void);
+static inline void mm_free_futex(struct mm_struct * mm)
+{
+	if (mm->mm_priv_futex_queues)
+		__mm_free_futex(mm);
+}
 
 /* mmdrop drops the mm and the page tables */
 extern void FASTCALL(__mmdrop(struct mm_struct *));
--- linux-2.6.21-rc3/kernel/fork.c	2007-03-15 18:32:08.000000000 +0100
+++ linux-2.6.21-rc3-ed/kernel/fork.c	2007-03-15 18:31:40.000000000 +0100
@@ -374,6 +374,7 @@ void fastcall __mmdrop(struct mm_struct 
 	BUG_ON(mm == &init_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
+	mm_free_futex(mm);
 	free_mm(mm);
 }
 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH 3/3] FUTEX : NUMA friendly global hashtable
  2006-08-09  6:43                         ` Eric Dumazet
                                             ` (2 preceding siblings ...)
  2007-03-15 19:16                           ` [PATCH 2/3] FUTEX : introduce private hashtables Eric Dumazet
@ 2007-03-15 19:20                           ` Eric Dumazet
  3 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2007-03-15 19:20 UTC (permalink / raw)
  To: Nick Piggin, Ulrich Drepper, Andrew Morton, Ingo Molnar
  Cc: Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 594 bytes --]

[PATCH 3/3] FUTEX : NUMA friendly global hashtable

On NUMA machines, we should get better performance using a big futex 
hashtable, allocated with vmalloc() so that it is spreaded on several nodes.

I chose a static size of four pages. (Very big NUMA machines have 64k page 
size)

This patch should have a temporary effect, as most futexes are expected to be 
stored in process private tables. We probably can drop it  in five years :)


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 kernel/futex.c |   26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

[-- Attachment #2: futex_p3.patch --]
[-- Type: text/plain, Size: 1770 bytes --]

--- linux-2.6.21-rc3/kernel/futex.c	2007-03-15 18:54:47.000000000 +0100
+++ linux-2.6.21-rc3-ed/kernel/futex.c	2007-03-15 18:54:47.000000000 +0100
@@ -152,8 +152,13 @@ struct futex_hash_bucket {
 # define FUTEX_HASH_SLOTS	16
 # define FUTEX_NOPRIVHASH	/* no private hashtable, only one global */
 #else
-# define FUTEX_HASH_SLOTS	256
-# define FUTEX_PRIVHASH_SLOTS	64
+# ifdef CONFIG_NUMA
+#  define FUTEX_HASH_SLOTS	((4*PAGE_SIZE)/sizeof(struct futex_hash_bucket))
+#  define FUTEX_PRIVHASH_SLOTS	(4096 / sizeof(struct futex_hash_bucket))
+# else
+#  define FUTEX_HASH_SLOTS	256
+#  define FUTEX_PRIVHASH_SLOTS	64
+# endif
 #endif
 
 #define FUTEX_PRIVHASH_SIZE \
@@ -166,8 +171,14 @@ struct futex_hash_bucket {
  * PTHREAD_PROCESS_PRIVATE futexes may be hashed into this table too if the 
  * owner process failed to allocate its private hashtable (or CONFIG_BASE_SMALL)
  *
+ * On NUMA configs, table is allocated with vmalloc() to spread this hash table
+ * up to 4 nodes (we use 4 pages)
  */
+#ifdef CONFIG_NUMA
+static struct futex_hash_bucket *futex_queues __read_mostly;
+#else
 static struct futex_hash_bucket futex_queues[FUTEX_HASH_SLOTS];
+#endif
 
 /* Futex-fs vfsmount entry: */
 static struct vfsmount *futex_mnt;
@@ -2051,7 +2062,16 @@ static int __init init(void)
 		return PTR_ERR(futex_mnt);
 	}
 
-	for (i = 0; i < ARRAY_SIZE(futex_queues); i++) {
+#ifdef CONFIG_NUMA
+	/*
+	 * vmalloc() is supposed to obey mempolicy and spread our 4 pages
+	 * on several nodes
+	 */
+	futex_queues = vmalloc(FUTEX_HASH_SLOTS * sizeof(*futex_queues));
+	if (!futex_queues)
+		panic("Failed to allocate futex hash table\n");
+#endif
+	for (i = 0; i < FUTEX_HASH_SLOTS; i++) {
 		INIT_LIST_HEAD(&futex_queues[i].chain);
 		spin_lock_init(&futex_queues[i].lock);
 	}

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements
  2007-03-15 19:10                           ` [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements Eric Dumazet
@ 2007-03-15 20:15                             ` Nick Piggin
  2007-03-16  8:05                             ` Peter Zijlstra
  2007-04-04  7:16                             ` Ulrich Drepper
  2 siblings, 0 replies; 78+ messages in thread
From: Nick Piggin @ 2007-03-15 20:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ulrich Drepper, Andrew Morton, Ingo Molnar, Andi Kleen,
	Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Eric Dumazet wrote:
> Hi
> 
> I'm pleased to present these patches which improve linux futex performance and 
> scalability, on both UP, SMP and NUMA configs.
> 
> I had this idea last year but I was not understood, probably because I gave 
> not enough explanations. Sorry if this mail is really long...

Yes please, this is really nice. The mmap_sem is already overworked just
doing real mm stuff, let alone making it the central point of our "scalable"
thread synchronisation syscall.

To summarise: not only will this make futexes scale much better, but it will
also take some load off mmap_sem.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/3] FUTEX : introduce private hashtables
  2007-03-15 19:16                           ` [PATCH 2/3] FUTEX : introduce private hashtables Eric Dumazet
@ 2007-03-15 20:25                             ` Nick Piggin
  2007-03-15 21:09                               ` Ulrich Drepper
  2007-03-15 22:59                               ` William Lee Irwin III
  0 siblings, 2 replies; 78+ messages in thread
From: Nick Piggin @ 2007-03-15 20:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ulrich Drepper, Andrew Morton, Ingo Molnar, Andi Kleen,
	Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Eric Dumazet wrote:
> [PATCH 2/3] FUTEX : introduce private hashtables
> 
> This patch introduces a separate hashtable per process to store _PRIVATE 
> futexes.
> This hashtable is dynamically allocated on the first _PRIVATE futex syscall.
> If memory cannot be allocated, the process will use the global hashtable.
> 
> Using a separate hashtable has the advantage of lowering the contention on the 
> global hashtable. NUMA should benefits of this separation because the 
> allocation should respect the mm policy of the process.
> 
> Code is using kmalloc()/vmalloc() depending on the size of spinlocks. For 
> normal setup, size of the private hashtable should be 768 bytes on 32bit 
> arches, 1536 bytes on 64bit arches.
> 
> Private hashtable is freed() when process exits.


I do disagree with this patch, though.

There should be little contention on the memory in the global hash anyway,
because we can roughly reduce contention as a factor of hash-size/cacheline-size.

What we will have are cache misses on the global table... but we're going to
get cache misses on those private tables as well. Also, you never know what
the use cases are going to be... there could be an application with thousands
of threads and mutexes in which case your private hash could be too small... I
think w e want to stay away from per-mm _anything_ with private futexes, which
is the direction that your patch 1 takes us.

I would just avoid the complexity and setup/teardown costs, and just use a
vmalloc'ed global hash for NUMA.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/3] FUTEX : introduce private hashtables
  2007-03-15 20:25                             ` Nick Piggin
@ 2007-03-15 21:09                               ` Ulrich Drepper
  2007-03-15 21:29                                 ` Nick Piggin
  2007-03-15 22:59                               ` William Lee Irwin III
  1 sibling, 1 reply; 78+ messages in thread
From: Ulrich Drepper @ 2007-03-15 21:09 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Andrew Morton, Ingo Molnar, Andi Kleen,
	Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On 3/15/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> There should be little contention on the memory in the global hash anyway,
> because we can roughly reduce contention as a factor of hash-size/cacheline-size.
>
> What we will have are cache misses on the global table... but we're going to
> get cache misses on those private tables as well.

I'm thinking about NUMA cases.  If you have private tables for a
process which is pinned to some cluster in a NUMA machine the table is
local to the node.  If you have a global table you cannot optimize
your application for such a situation because at least some of the
pages of the global table are remote.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/3] FUTEX : introduce private hashtables
  2007-03-15 21:09                               ` Ulrich Drepper
@ 2007-03-15 21:29                                 ` Nick Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nick Piggin @ 2007-03-15 21:29 UTC (permalink / raw)
  To: Ulrich Drepper
  Cc: Eric Dumazet, Andrew Morton, Ingo Molnar, Andi Kleen,
	Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Ulrich Drepper wrote:
> On 3/15/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
>> There should be little contention on the memory in the global hash 
>> anyway,
>> because we can roughly reduce contention as a factor of 
>> hash-size/cacheline-size.
>>
>> What we will have are cache misses on the global table... but we're 
>> going to
>> get cache misses on those private tables as well.
> 
> 
> I'm thinking about NUMA cases.  If you have private tables for a
> process which is pinned to some cluster in a NUMA machine the table is
> local to the node.  If you have a global table you cannot optimize
> your application for such a situation because at least some of the
> pages of the global table are remote.
> 

That's true, but it also might be able to be improved in other ways.

At least once we get the basic support in the kernel, and glibc picks
it up, then we have a better base to evaluate these more exotic changes
against.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 2/3] FUTEX : introduce private hashtables
  2007-03-15 20:25                             ` Nick Piggin
  2007-03-15 21:09                               ` Ulrich Drepper
@ 2007-03-15 22:59                               ` William Lee Irwin III
  1 sibling, 0 replies; 78+ messages in thread
From: William Lee Irwin III @ 2007-03-15 22:59 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Ulrich Drepper, Andrew Morton, Ingo Molnar,
	Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Fri, Mar 16, 2007 at 07:25:53AM +1100, Nick Piggin wrote:
> I would just avoid the complexity and setup/teardown costs, and just
> use a vmalloc'ed global hash for NUMA.

This patch is not the way to go, but neither are vmalloc()'d global
hashtables. When you just happen to hash to the wrong node, you're in
for quasi-unreproducible poor performance. The size is never right, at
which point RCU resizing is required with all its overhead and memory
freeing delays and failure to resize (even if only to contract) under
pressure. Better would be to use a different data structure admitting
locality of reference and adaptively sizing itself, furthermore
localized to the appropriate sharing domain.  For file-backed futexes,
this would be the struct address_space. For anonymous-backed futexes,
this would be the COW sharing group, which an anon_vma could almost be
used to represent. Using an object to properly represent the COW
sharing group (i.e. Hugh's struct anon) would do the trick, and one
might as well move the rmap code over to it while we're at it since the
anon_vma scanning tricks are all pointless overhead once the COW
sharing group is accurately tracked (the scanning around for nearby vmas
with ->anon_vma set is not great anyway, though the overhead is hidden
in the noise of large teardown and setup operations; inheriting on
fork() is much simpler and faster).

In such a manner localization is accomplished while no interface
extensions are required.


-- wli

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements
  2007-03-15 19:10                           ` [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements Eric Dumazet
  2007-03-15 20:15                             ` Nick Piggin
@ 2007-03-16  8:05                             ` Peter Zijlstra
  2007-03-16  9:30                               ` Eric Dumazet
  2007-04-04  7:16                             ` Ulrich Drepper
  2 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2007-03-16  8:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Piggin, Ulrich Drepper, Andrew Morton, Ingo Molnar,
	Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Thu, 2007-03-15 at 20:10 +0100, Eric Dumazet wrote:
> Hi
> 
> I'm pleased to present these patches which improve linux futex performance and 
> scalability, on both UP, SMP and NUMA configs.
> 
> I had this idea last year but I was not understood, probably because I gave 
> not enough explanations. Sorry if this mail is really long...

I started playing with it after your last reference to it, I have some
code here (against -rt):
  http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/

Which I will post once I have the found what keeps pthread_join() from
completing :-(

It basically adds a per task vma lookup cache which can also activate
the private logic without explicit use of the new interface.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements
  2007-03-16  8:05                             ` Peter Zijlstra
@ 2007-03-16  9:30                               ` Eric Dumazet
  2007-03-16 10:10                                 ` Peter Zijlstra
  0 siblings, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2007-03-16  9:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Ulrich Drepper, Andrew Morton, Ingo Molnar,
	Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Friday 16 March 2007 09:05, Peter Zijlstra wrote:
> On Thu, 2007-03-15 at 20:10 +0100, Eric Dumazet wrote:
> > Hi
> >
> > I'm pleased to present these patches which improve linux futex
> > performance and scalability, on both UP, SMP and NUMA configs.
> >
> > I had this idea last year but I was not understood, probably because I
> > gave not enough explanations. Sorry if this mail is really long...
>
> I started playing with it after your last reference to it, I have some
> code here (against -rt):
>   http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/
>
> Which I will post once I have the found what keeps pthread_join() from
> completing :-(
>
> It basically adds a per task vma lookup cache which can also activate
> the private logic without explicit use of the new interface.

Hi Peter

I dont think yet another cache will help in the general case.
A typical program uses many vmas at once...

glibc has internal futexes, on a different vma than futexes declared in your 
program. Each shared library is going to have its own vma for its data (and 
futexes)

(244 vmas on one kmail program for example)

About your guess_futex_shared() thing, I miss the vma_anon() definition.
But if it has to walk the vmas (and take mmap_sem), you already loose the 
PRIVATE benefit.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements
  2007-03-16  9:30                               ` Eric Dumazet
@ 2007-03-16 10:10                                 ` Peter Zijlstra
  2007-03-16 10:30                                   ` Eric Dumazet
  0 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2007-03-16 10:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Piggin, Ulrich Drepper, Andrew Morton, Ingo Molnar,
	Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Fri, 2007-03-16 at 10:30 +0100, Eric Dumazet wrote:
> On Friday 16 March 2007 09:05, Peter Zijlstra wrote:
> > On Thu, 2007-03-15 at 20:10 +0100, Eric Dumazet wrote:
> > > Hi
> > >
> > > I'm pleased to present these patches which improve linux futex
> > > performance and scalability, on both UP, SMP and NUMA configs.
> > >
> > > I had this idea last year but I was not understood, probably because I
> > > gave not enough explanations. Sorry if this mail is really long...
> >
> > I started playing with it after your last reference to it, I have some
> > code here (against -rt):
> >   http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/
> >
> > Which I will post once I have the found what keeps pthread_join() from
> > completing :-(
> >
> > It basically adds a per task vma lookup cache which can also activate
> > the private logic without explicit use of the new interface.
> 
> Hi Peter
> 
> I dont think yet another cache will help in the general case.
> A typical program uses many vmas at once...
> 
> glibc has internal futexes, on a different vma than futexes declared in your 
> program. Each shared library is going to have its own vma for its data (and 
> futexes)
> 
> (244 vmas on one kmail program for example)

Yeah, I was just hoping a few cache entries would be enough to get the
worst of them. A benchmark will have to tell I guess.

> About your guess_futex_shared() thing, I miss the vma_anon() definition.

http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/vma_cache.patch

> But if it has to walk the vmas (and take mmap_sem), you already loose the 
> PRIVATE benefit.

It doesn't take mmap_sem, I am aware of the problems.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements
  2007-03-16 10:10                                 ` Peter Zijlstra
@ 2007-03-16 10:30                                   ` Eric Dumazet
  2007-03-16 10:36                                     ` Peter Zijlstra
  0 siblings, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2007-03-16 10:30 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nick Piggin, Ulrich Drepper, Andrew Morton, Ingo Molnar,
	Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Friday 16 March 2007 11:10, Peter Zijlstra wrote:

> http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/vma_cache.p
>atch

Oh thanks

>
> > But if it has to walk the vmas (and take mmap_sem), you already loose the
> > PRIVATE benefit.
>
> It doesn't take mmap_sem, I am aware of the problems.

Yes but the vma_anon() -> vma_cache_find() needs to read 3 cache lines on 
x86_64 (sizeof(struct vma_cache) = 136)
and dirty one bit, so it might be more expensive than the mmap_sem ...


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements
  2007-03-16 10:30                                   ` Eric Dumazet
@ 2007-03-16 10:36                                     ` Peter Zijlstra
  0 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-03-16 10:36 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Piggin, Ulrich Drepper, Andrew Morton, Ingo Molnar,
	Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Fri, 2007-03-16 at 11:30 +0100, Eric Dumazet wrote:
> On Friday 16 March 2007 11:10, Peter Zijlstra wrote:
> 
> > http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/vma_cache.p
> >atch
> 
> Oh thanks
> 
> >
> > > But if it has to walk the vmas (and take mmap_sem), you already loose the
> > > PRIVATE benefit.
> >
> > It doesn't take mmap_sem, I am aware of the problems.
> 
> Yes but the vma_anon() -> vma_cache_find() needs to read 3 cache lines on 
> x86_64 (sizeof(struct vma_cache) = 136)
> and dirty one bit, so it might be more expensive than the mmap_sem ...

I though the cacheline was 128 bytes, but point taken.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements
  2007-03-15 19:10                           ` [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements Eric Dumazet
  2007-03-15 20:15                             ` Nick Piggin
  2007-03-16  8:05                             ` Peter Zijlstra
@ 2007-04-04  7:16                             ` Ulrich Drepper
  2007-04-05 17:49                               ` [PATCH] FUTEX : new PRIVATE futexes Eric Dumazet
  2 siblings, 1 reply; 78+ messages in thread
From: Ulrich Drepper @ 2007-04-04  7:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Nick Piggin, Andrew Morton, Ingo Molnar, Andi Kleen,
	Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On 3/15/07, Eric Dumazet <dada1@cosmosbay.com> wrote:
> But fact is POSIX defined PRIVATE and SHARED, allowing clear separation, and
> optimal performance if carefuly implemented. Time has come for linux to have
> better threading performance.

Now that I've been pointed to the thread I can comment on it.

Yes, this approach makes a lot of sense.  Programs which shared syn
objects without declaring them correctly are broken and deserve to
fail.  It would be quite a lot of change in libpthread but it's
manageable.

I haven't tested the code nor will I likely do until I get a Fedora
kernel with it.  So, convince DaveJ to take it for testing.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH] FUTEX : new PRIVATE futexes
  2007-04-04  7:16                             ` Ulrich Drepper
@ 2007-04-05 17:49                               ` Eric Dumazet
  2007-04-05 20:43                                 ` Ulrich Drepper
                                                   ` (4 more replies)
  0 siblings, 5 replies; 78+ messages in thread
From: Eric Dumazet @ 2007-04-05 17:49 UTC (permalink / raw)
  To: Ulrich Drepper, Andrew Morton, Dave Jones
  Cc: Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Hi all

I'm pleased to present this patch which improves linux futexes performance and 
scalability, merely avoiding taking mmap_sem rwlock.

Ulrich agreed with the API and said glibc work could start as soon
as he gets a Fedora kernel with it :)

Andrew, could we get this in mm as well ? This version is against 2.6.21-rc5-mm4
(so supports 64bit futexes)

In this third version I dropped the NUMA optims and process private hash table,
to let new API come in and be tested.

Thank you

[PATCH] FUTEX : new PRIVATE futexes

Analysis of current linux futex code :
--------------------------------------

A central hash table futex_queues[] holds all contexts (futex_q) of waiting 
threads.
Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to 
perform lookups or insert/deletion of a futex_q.

When a futex_wait() is done, calling thread has to :


1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
     (calling find_vma()). This validation tells us if the futex uses
     an inode based store (mapped file), or mm based store (anonymous mem)

2) - compute a hash key

3) - Atomic increment of reference counter on an inode or a mm_struct

4) - lock part of futex_queues[] hash table

5) - perform the test on value of futex.
                (rollback is value != expected_value, returns EWOULDBLOCK)
        (various loops if test triggers mm faults)

6) queue the context into hash table, release the lock got in 4)

7) - release the read_lock on mmap_sem

   <block>

8) Eventually unqueue the context (but rarely, as this part
 may be done by the futex_wake())

Futexes were designed to improve scalability but current implementation
has various problems :

- Central hashtable :
 This means scalability problems if many processes/threads want to use
 futexes at the same time.
 This means NUMA unbalance because this hashtable is located on one node.

- Using mmap_sem on every futex() syscall :

 Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
 ops on mmap_sem, dirtying cache line :
        - lot of cache line ping pongs on SMP configurations.

 mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
 Highly threaded processes might suffer from mmap_sem contention.

 mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
 It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
 because of cache misses.

Most of these scalability problems come from the fact that futexes are in
one global namespace. As we use a central hash table, we must make sure
they are all using the same reference (given by the mm subsystem).
We chose to force all futexes be 'shared'. This has a cost.

But fact is POSIX defined PRIVATE and SHARED, allowing clear separation, and
optimal performance if carefuly implemented. Time has come for linux to have 
better threading performance.

The goal is to permit new futex commands to avoid :
 - Taking the mmap_sem semaphore, conflicting with other subsystems.
 - Modifying a ref_count on mm or an inode, still conflicting with mm or fs.

This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
futexes, we only need to distinguish futexes by their virtual address, no
matter the underlying mm storage is.



If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And
be prepared to fallback on old subcommands for old kernels. Using one
global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.

PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : the same SHARED futex (mapped on a file) can be used by old binaries 
*and* new binaries, because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)

/* calling futex_wait(addr, value) with value != *addr */
434 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
427 cycles per futex(FUTEX_WAIT) call (using one futex)
345 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
345 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
For reference :
187 cycles per getppid() call
188 cycles per umask() call
183 cycles per ni_syscall() call

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/futex.h |   45 +++++-
 kernel/futex.c        |  294 +++++++++++++++++++++++++---------------
 2 files changed, 230 insertions(+), 109 deletions(-)

--- linux-2.6.21-rc5-mm4/include/linux/futex.h
+++ linux-2.6.21-rc5-mm4-ed/include/linux/futex.h
@@ -19,6 +19,18 @@ union ktime;
 #define FUTEX_TRYLOCK_PI	8
 #define FUTEX_CMP_REQUEUE_PI	9
 
+#define FUTEX_PRIVATE_FLAG	128
+#define FUTEX_CMD_MASK		~FUTEX_PRIVATE_FLAG
+
+#define FUTEX_WAIT_PRIVATE	(FUTEX_WAIT | FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAKE_PRIVATE	(FUTEX_WAKE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_REQUEUE_PRIVATE	(FUTEX_REQUEUE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_CMP_REQUEUE_PRIVATE (FUTEX_CMP_REQUEUE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAKE_OP_PRIVATE	(FUTEX_WAKE_OP | FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_PI_PRIVATE	(FUTEX_LOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_PI_PRIVATE	(FUTEX_UNLOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG)
+
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
  * thread exit time.
@@ -115,8 +127,18 @@ handle_futex_death(u32 __user *uaddr, st
  * Don't rearrange members without looking at hash_futex().
  *
  * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
- * We set bit 0 to indicate if it's an inode-based key.
+ * We use the two low order bits of offset to tell what is the kind of key :
+ *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
+ *       (no reference on an inode or mm)
+ *  01 : Shared futex (PTHREAD_PROCESS_SHARED)
+ *	mapped on a file (reference on the underlying inode)
+ *  10 : Shared futex (PTHREAD_PROCESS_SHARED)
+ *       (but private mapping on an mm, and reference taken on it)
  */
+
+#define FUT_OFF_INODE    1 /* We set bit 0 if key has a reference on inode */
+#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */
+
 union futex_key {
 	unsigned long __user *uaddr;
 	struct {
@@ -135,8 +157,27 @@ union futex_key {
 		int offset;
 	} both;
 };
-int get_futex_key(void __user *uaddr, union futex_key *key);
+
+/**
+ * get_futex_key - Get parameters which are the keys for a futex.
+ * @uaddr: virtual address of the futex
+ * @shared: NULL for a PROCESS_PRIVATE futex,
+ *	&current->mm->mmap_sem for a PROCESS_SHARED futex
+ * @key: address where result is stored.
+ *
+ * Returns an error code or 0
+ */
+int get_futex_key(void __user *uaddr, union futex_key *key,
+		  struct rw_semaphore *shared);
+
+/**
+ * get_futex_key_refs - Take a reference to the resource addressed by a key
+ */
 void get_futex_key_refs(union futex_key *key);
+
+/**
+ * drop_futex_key_refs - Drop a reference to the resource addressed by a key.
+ */
 void drop_futex_key_refs(union futex_key *key);
 
 #ifdef CONFIG_FUTEX
--- linux-2.6.21-rc5-mm4/kernel/futex.c
+++ linux-2.6.21-rc5-mm4-ed/kernel/futex.c
@@ -16,6 +16,9 @@
  *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
  *  Copyright (C) 2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com>
  *
+ *  PRIVATE futexes by Eric Dumazet
+ *  Copyright (C) 2007 Eric Dumazet <dada1@cosmosbay.com>
+ *
  *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
  *  enough at me, Linus for the original (flawed) idea, Matthew
  *  Kirkwood for proof-of-concept implementation.
@@ -199,9 +202,12 @@ static inline int match_futex(union fute
  * Returns: 0, or negative error code.
  * The key words are stored in *key on success.
  *
- * Should be called with &current->mm->mmap_sem but NOT any spinlocks.
+ * shared is NULL for PROCESS_PRIVATE futexes
+ * For other futexes, it points to &current->mm->mmap_sem and
+ * caller must have taken the reader lock. but NOT any spinlocks.
  */
-int get_futex_key(void __user *uaddr, union futex_key *key)
+int get_futex_key(void __user *uaddr, union futex_key *key,
+	struct rw_semaphore *shared)
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
@@ -218,6 +224,22 @@ int get_futex_key(void __user *uaddr, un
 	address -= key->both.offset;
 
 	/*
+	 * PROCESS_PRIVATE futexes are fast.
+	 * As the mm cannot disappear under us and the 'key' only needs
+	 * virtual address, we dont even have to find the underlying vma.
+	 * Note : We do have to check 'address' is a valid user address,
+	 *        but access_ok() should be faster than find_vma()
+	 * Note : At this point, address points to the start of page,
+	 *        not the real futex address, this is ok.
+	 */
+	if (!shared) {
+		if (!access_ok(VERIFY_WRITE, address, sizeof(int)))
+			return -EFAULT;
+		key->private.mm = mm;
+		key->private.address = address;
+		return 0;
+	}
+	/*
 	 * The futex is hashed differently depending on whether
 	 * it's in a shared or private mapping.  So check vma first.
 	 */
@@ -244,6 +266,7 @@ int get_futex_key(void __user *uaddr, un
 	 * mappings of _writable_ handles.
 	 */
 	if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
+		key->both.offset += FUT_OFF_MMSHARED; /* reference taken on mm */
 		key->private.mm = mm;
 		key->private.address = address;
 		return 0;
@@ -253,7 +276,7 @@ int get_futex_key(void __user *uaddr, un
 	 * Linear file mappings are also simple.
 	 */
 	key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
-	key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
+	key->both.offset += FUT_OFF_INODE; /* inode-based key. */
 	if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
 		key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
 				     + vma->vm_pgoff);
@@ -281,17 +304,19 @@ EXPORT_SYMBOL_GPL(get_futex_key);
  * Take a reference to the resource addressed by a key.
  * Can be called while holding spinlocks.
  *
- * NOTE: mmap_sem MUST be held between get_futex_key() and calling this
- * function, if it is called at all.  mmap_sem keeps key->shared.inode valid.
  */
 inline void get_futex_key_refs(union futex_key *key)
 {
-	if (key->both.ptr != 0) {
-		if (key->both.offset & 1)
+	if (key->both.ptr == 0)
+		return;
+	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
+		case FUT_OFF_INODE:
 			atomic_inc(&key->shared.inode->i_count);
-		else
+			break;
+		case FUT_OFF_MMSHARED:
 			atomic_inc(&key->private.mm->mm_count);
-	}
+			break;
+  	}
 }
 EXPORT_SYMBOL_GPL(get_futex_key_refs);
 
@@ -301,11 +326,15 @@ EXPORT_SYMBOL_GPL(get_futex_key_refs);
  */
 void drop_futex_key_refs(union futex_key *key)
 {
-	if (key->both.ptr != 0) {
-		if (key->both.offset & 1)
+	if (key->both.ptr == 0)
+		return;
+	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
+		case FUT_OFF_INODE:
 			iput(key->shared.inode);
-		else
+			break;
+		case FUT_OFF_MMSHARED:
 			mmdrop(key->private.mm);
+			break;
 	}
 }
 EXPORT_SYMBOL_GPL(drop_futex_key_refs);
@@ -339,28 +368,40 @@ get_futex_value_locked(unsigned long *de
 }
 
 /*
- * Fault handling. Called with current->mm->mmap_sem held.
+ * Fault handling.
+ * if shared is non NULL, current->mm->mmap_sem is already held
  */
-static int futex_handle_fault(unsigned long address, int attempt)
+static int futex_handle_fault(unsigned long address, int attempt,
+	struct rw_semaphore *shared)
 {
 	struct vm_area_struct * vma;
 	struct mm_struct *mm = current->mm;
+	int ret = 0;
 
-	if (attempt > 2 || !(vma = find_vma(mm, address)) ||
-	    vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
+	if (attempt > 2)
 		return -EFAULT;
 
-	switch (handle_mm_fault(mm, vma, address, 1)) {
-	case VM_FAULT_MINOR:
-		current->min_flt++;
-		break;
-	case VM_FAULT_MAJOR:
-		current->maj_flt++;
-		break;
-	default:
-		return -EFAULT;
-	}
-	return 0;
+	if (!shared)
+		down_read(&mm->mmap_sem);
+
+	if (!(vma = find_vma(mm, address)) ||
+	    vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
+		ret = -EFAULT;
+
+	else
+		switch (handle_mm_fault(mm, vma, address, 1)) {
+		case VM_FAULT_MINOR:
+			current->min_flt++;
+			break;
+		case VM_FAULT_MAJOR:
+			current->maj_flt++;
+			break;
+		default:
+			ret = -EFAULT;
+		}
+	if (!shared)
+		up_read(&mm->mmap_sem);
+	return ret;
 }
 
 /*
@@ -705,7 +746,8 @@ double_lock_hb(struct futex_hash_bucket 
  * Wake up all waiters hashed on the physical page that is mapped
  * to this virtual address:
  */
-static int futex_wake(unsigned long __user *uaddr, int nr_wake)
+static int futex_wake(unsigned long __user *uaddr, int nr_wake,
+	struct rw_semaphore *shared)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
@@ -713,9 +755,10 @@ static int futex_wake(unsigned long __us
 	union futex_key key;
 	int ret;
 
-	down_read(&current->mm->mmap_sem);
+	if (shared)
+		down_read(shared);
 
-	ret = get_futex_key(uaddr, &key);
+	ret = get_futex_key(uaddr, &key, shared);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -737,7 +780,8 @@ static int futex_wake(unsigned long __us
 
 	spin_unlock(&hb->lock);
 out:
-	up_read(&current->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 	return ret;
 }
 
@@ -807,7 +851,8 @@ retry:
  */
 static int
 futex_requeue_pi(unsigned long __user *uaddr1, unsigned long __user *uaddr2,
-		 int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64)
+		 int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64,
+		 struct rw_semaphore *shared)
 {
 	union futex_key key1, key2;
 	struct futex_hash_bucket *hb1, *hb2;
@@ -825,12 +870,13 @@ retry:
 	/*
 	 * First take all the futex related locks:
 	 */
-	down_read(&current->mm->mmap_sem);
+	if (shared)
+		down_read(shared);
 
-	ret = get_futex_key(uaddr1, &key1);
+	ret = get_futex_key(uaddr1, &key1, shared);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, &key2);
+	ret = get_futex_key(uaddr2, &key2, shared);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -998,7 +1044,8 @@ out:
  */
 static int
 futex_wake_op(unsigned long __user *uaddr1, unsigned long __user *uaddr2,
-	      int nr_wake, int nr_wake2, int op, int futex64)
+	      int nr_wake, int nr_wake2, int op, int futex64,
+		struct rw_semaphore *shared)
 {
 	union futex_key key1, key2;
 	struct futex_hash_bucket *hb1, *hb2;
@@ -1007,12 +1054,13 @@ futex_wake_op(unsigned long __user *uadd
 	int ret, op_ret, attempt = 0;
 
 retryfull:
-	down_read(&current->mm->mmap_sem);
+	if (shared)
+		down_read(shared);
 
-	ret = get_futex_key(uaddr1, &key1);
+	ret = get_futex_key(uaddr1, &key1, shared);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, &key2);
+	ret = get_futex_key(uaddr2, &key2, shared);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -1055,15 +1103,14 @@ retry:
 		 * futex_atomic_op_inuser needs to both read and write
 		 * *(int __user *)uaddr2, but we can't modify it
 		 * non-atomically.  Therefore, if get_user below is not
-		 * enough, we need to handle the fault ourselves, while
-		 * still holding the mmap_sem.
+		 * enough, we need to handle the fault ourselves. Make 
+		 * sure we hold mmap_sem.
 		 */
 		if (attempt++) {
-			if (futex_handle_fault((unsigned long)uaddr2,
-						attempt)) {
-				ret = -EFAULT;
+			ret = futex_handle_fault((unsigned long)uaddr2,
+						attempt, shared);
+			if (ret)
 				goto out;
-			}
 			goto retry;
 		}
 
@@ -1071,7 +1118,8 @@ retry:
 		 * If we would have faulted, release mmap_sem,
 		 * fault it in and start all over again.
 		 */
-		up_read(&current->mm->mmap_sem);
+		if (shared)
+			up_read(shared);
 
 		ret = futex_get_user(&dummy, uaddr2, futex64);
 		if (ret)
@@ -1108,7 +1156,8 @@ retry:
 	if (hb1 != hb2)
 		spin_unlock(&hb2->lock);
 out:
-	up_read(&current->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 	return ret;
 }
 
@@ -1118,7 +1167,8 @@ out:
  */
 static int
 futex_requeue(unsigned long __user *uaddr1, unsigned long __user *uaddr2,
-	      int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64)
+	      int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64,
+		struct rw_semaphore *shared)
 {
 	union futex_key key1, key2;
 	struct futex_hash_bucket *hb1, *hb2;
@@ -1127,12 +1177,13 @@ futex_requeue(unsigned long __user *uadd
 	int ret, drop_count = 0;
 
  retry:
-	down_read(&current->mm->mmap_sem);
+	if (shared)
+		down_read(shared);
 
-	ret = get_futex_key(uaddr1, &key1);
+	ret = get_futex_key(uaddr1, &key1, shared);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, &key2);
+	ret = get_futex_key(uaddr2, &key2, shared);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -1155,7 +1206,8 @@ futex_requeue(unsigned long __user *uadd
 			 * If we would have faulted, release mmap_sem, fault
 			 * it in and start all over again.
 			 */
-			up_read(&current->mm->mmap_sem);
+			if (shared)
+				up_read(shared);
 
 			ret = futex_get_user(&curval, uaddr1, futex64);
 
@@ -1208,7 +1260,8 @@ out_unlock:
 		drop_futex_key_refs(&key1);
 
 out:
-	up_read(&current->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 	return ret;
 }
 
@@ -1392,7 +1445,8 @@ static int fixup_pi_state_owner(unsigned
 
 static long futex_wait_restart(struct restart_block *restart);
 static int futex_wait(unsigned long __user *uaddr, unsigned long val,
-		      ktime_t *abs_time, int futex64)
+		      ktime_t *abs_time, int futex64,
+		      struct rw_semaphore *shared)
 {
 	struct task_struct *curr = current;
 	DECLARE_WAITQUEUE(wait, curr);
@@ -1405,9 +1459,10 @@ static int futex_wait(unsigned long __us
 
 	q.pi_state = NULL;
  retry:
-	down_read(&curr->mm->mmap_sem);
+	if (shared)
+		down_read(shared);
 
-	ret = get_futex_key(uaddr, &q.key);
+	ret = get_futex_key(uaddr, &q.key, shared);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
 
@@ -1430,8 +1485,8 @@ static int futex_wait(unsigned long __us
 	 * a wakeup when *uaddr != val on entry to the syscall.  This is
 	 * rare, but normal.
 	 *
-	 * We hold the mmap semaphore, so the mapping cannot have changed
-	 * since we looked it up in get_futex_key.
+	 * for shared futexes, we hold the mmap semaphore, so the mapping
+	 * cannot have changed since we looked it up in get_futex_key.
 	 */
 	ret = get_futex_value_locked(&uval, uaddr, futex64);
 
@@ -1442,7 +1497,8 @@ static int futex_wait(unsigned long __us
 		 * If we would have faulted, release mmap_sem, fault it in and
 		 * start all over again.
 		 */
-		up_read(&curr->mm->mmap_sem);
+		if (shared)
+			up_read(shared);
 		ret = futex_get_user(&uval, uaddr, futex64);
 
 		if (!ret)
@@ -1468,7 +1524,8 @@ static int futex_wait(unsigned long __us
 	 * Now the futex is queued and we have checked the data, we
 	 * don't want to hold mmap_sem while we sleep.
 	 */
-	up_read(&curr->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 
 	/*
 	 * There might have been scheduling since the queue_me(), as we
@@ -1568,7 +1625,8 @@ static int futex_wait(unsigned long __us
 			}
 			/* Unqueue and drop the lock */
 			unqueue_me_pi(&q);
-			up_read(&curr->mm->mmap_sem);
+			if (shared)
+				up_read(shared);
 		}
 
 		debug_rt_mutex_free_waiter(&q.waiter);
@@ -1598,6 +1656,8 @@ static int futex_wait(unsigned long __us
 		restart->arg1 = val;
 		restart->arg2 = (unsigned long)abs_time;
 		restart->arg3 = (unsigned long)futex64;
+		if (shared)
+			restart->arg3 |= 2;
 		return -ERESTART_RESTARTBLOCK;
 	}
 
@@ -1605,7 +1665,8 @@ static int futex_wait(unsigned long __us
 	queue_unlock(&q, hb);
 
  out_release_sem:
-	up_read(&curr->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 	return ret;
 }
 
@@ -1615,10 +1676,11 @@ static long futex_wait_restart(struct re
 	unsigned long __user *uaddr = (unsigned long __user *)restart->arg0;
 	unsigned long val = restart->arg1;
 	ktime_t *abs_time = (ktime_t *)restart->arg2;
-	int futex64 = (int)restart->arg3;
+	int futex64 = (int)restart->arg3 & 1 ;
+	struct rw_semaphore *shared = (restart->arg3 & 2) ? &current->mm->mmap_sem : NULL;
 
 	restart->fn = do_no_restart_syscall;
-	return (long)futex_wait(uaddr, val, abs_time, futex64);
+	return (long)futex_wait(uaddr, val, abs_time, futex64, shared);
 }
 
 
@@ -1674,7 +1736,7 @@ static void set_pi_futex_owner(struct fu
  * races the kernel might see a 0 value of the futex too.)
  */
 static int futex_lock_pi(unsigned long __user *uaddr, int detect, ktime_t *time,
-			 int trylock, int futex64)
+			 int trylock, int futex64, struct rw_semaphore *shared)
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
 	struct task_struct *curr = current;
@@ -1695,9 +1757,10 @@ static int futex_lock_pi(unsigned long _
 
 	q.pi_state = NULL;
  retry:
-	down_read(&curr->mm->mmap_sem);
+	if (shared)
+		down_read(shared);
 
-	ret = get_futex_key(uaddr, &q.key);
+	ret = get_futex_key(uaddr, &q.key, shared);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
 
@@ -1818,7 +1881,8 @@ static int futex_lock_pi(unsigned long _
 	 * Now the futex is queued and we have checked the data, we
 	 * don't want to hold mmap_sem while we sleep.
 	 */
-	up_read(&curr->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 
 	WARN_ON(!q.pi_state);
 	/*
@@ -1832,7 +1896,8 @@ static int futex_lock_pi(unsigned long _
 		ret = ret ? 0 : -EWOULDBLOCK;
 	}
 
-	down_read(&curr->mm->mmap_sem);
+	if (shared)
+		down_read(shared);
 	spin_lock(q.lock_ptr);
 
 	/*
@@ -1854,7 +1919,8 @@ static int futex_lock_pi(unsigned long _
 		}
 		/* Unqueue and drop the lock */
 		unqueue_me_pi(&q);
-		up_read(&curr->mm->mmap_sem);
+		if (shared)
+			up_read(shared);
 	}
 
 	if (!detect && ret == -EDEADLK && 0)
@@ -1866,7 +1932,8 @@ static int futex_lock_pi(unsigned long _
 	queue_unlock(&q, hb);
 
  out_release_sem:
-	up_read(&curr->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 	return ret;
 
  uaddr_faulted:
@@ -1877,15 +1944,15 @@ static int futex_lock_pi(unsigned long _
 	 * still holding the mmap_sem.
 	 */
 	if (attempt++) {
-		if (futex_handle_fault((unsigned long)uaddr, attempt)) {
-			ret = -EFAULT;
+		ret = futex_handle_fault((unsigned long)uaddr, attempt, shared);
+		if (ret)
 			goto out_unlock_release_sem;
-		}
 		goto retry_locked;
 	}
 
 	queue_unlock(&q, hb);
-	up_read(&curr->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 
 	ret = futex_get_user(&uval, uaddr, futex64);
 	if (!ret && (uval != -EFAULT))
@@ -1899,7 +1966,8 @@ static int futex_lock_pi(unsigned long _
  * This is the in-kernel slowpath: we look up the PI state (if any),
  * and do the rt-mutex unlock.
  */
-static int futex_unlock_pi(unsigned long __user *uaddr, int futex64)
+static int futex_unlock_pi(unsigned long __user *uaddr, int futex64,
+	struct rw_semaphore *shared)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
@@ -1919,9 +1987,10 @@ retry:
 	/*
 	 * First take all the futex related locks:
 	 */
-	down_read(&current->mm->mmap_sem);
+	if (shared)
+		down_read(shared);
 
-	ret = get_futex_key(uaddr, &key);
+	ret = get_futex_key(uaddr, &key, shared);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -1980,7 +2049,8 @@ retry_locked:
 out_unlock:
 	spin_unlock(&hb->lock);
 out:
-	up_read(&current->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 
 	return ret;
 
@@ -1992,15 +2062,15 @@ pi_faulted:
 	 * still holding the mmap_sem.
 	 */
 	if (attempt++) {
-		if (futex_handle_fault((unsigned long)uaddr, attempt)) {
-			ret = -EFAULT;
+		ret = futex_handle_fault((unsigned long)uaddr, attempt, shared);
+		if (ret)
 			goto out_unlock;
-		}
 		goto retry_locked;
 	}
 
 	spin_unlock(&hb->lock);
-	up_read(&current->mm->mmap_sem);
+	if (shared)
+		up_read(shared);
 
 	ret = futex_get_user(&uval, uaddr, futex64);
 	if (!ret && (uval != -EFAULT))
@@ -2052,6 +2122,7 @@ static int futex_fd(u32 __user *uaddr, i
 	struct futex_q *q;
 	struct file *filp;
 	int ret, err;
+	struct rw_semaphore *shared;
 	static unsigned long printk_interval;
 
 	if (printk_timed_ratelimit(&printk_interval, 60 * 60 * 1000)) {
@@ -2093,11 +2164,12 @@ static int futex_fd(u32 __user *uaddr, i
 	}
 	q->pi_state = NULL;
 
-	down_read(&current->mm->mmap_sem);
-	err = get_futex_key(uaddr, &q->key);
+	shared = &current->mm->mmap_sem;
+	down_read(shared);
+	err = get_futex_key(uaddr, &q->key, shared);
 
 	if (unlikely(err != 0)) {
-		up_read(&current->mm->mmap_sem);
+		up_read(shared);
 		kfree(q);
 		goto error;
 	}
@@ -2109,7 +2181,7 @@ static int futex_fd(u32 __user *uaddr, i
 	filp->private_data = q;
 
 	queue_me(q, ret, filp);
-	up_read(&current->mm->mmap_sem);
+	up_read(shared);
 
 	/* Now we map fd to filp, so userspace can access it */
 	fd_install(ret, filp);
@@ -2238,7 +2310,8 @@ retry:
 		 */
 		if (!pi) {
 			if (uval & FUTEX_WAITERS)
-				futex_wake((unsigned long __user *)uaddr, 1);
+				futex_wake((unsigned long __user *)uaddr, 1,
+						&curr->mm->mmap_sem);
 		}
 	}
 	return 0;
@@ -2326,13 +2399,18 @@ long do_futex(unsigned long __user *uadd
 	      unsigned long val2, unsigned long val3, int fut64)
 {
 	int ret;
+	int opm = op & FUTEX_CMD_MASK;
+	struct rw_semaphore *shared = NULL;
+
+	if (!(op & FUTEX_PRIVATE_FLAG))
+		shared = &current->mm->mmap_sem;
 
-	switch (op) {
+	switch (opm) {
 	case FUTEX_WAIT:
-		ret = futex_wait(uaddr, val, timeout, fut64);
+		ret = futex_wait(uaddr, val, timeout, fut64, shared);
 		break;
 	case FUTEX_WAKE:
-		ret = futex_wake(uaddr, val);
+		ret = futex_wake(uaddr, val, shared);
 		break;
 	case FUTEX_FD:
 		if (fut64)
@@ -2342,25 +2420,25 @@ long do_futex(unsigned long __user *uadd
 			ret = futex_fd((u32 __user *)uaddr, val);
 		break;
 	case FUTEX_REQUEUE:
-		ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, fut64);
+		ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, fut64, shared);
 		break;
 	case FUTEX_CMP_REQUEUE:
-		ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, fut64);
+		ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, fut64, shared);
 		break;
 	case FUTEX_WAKE_OP:
-		ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, fut64);
+		ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, fut64, shared);
 		break;
 	case FUTEX_LOCK_PI:
-		ret = futex_lock_pi(uaddr, val, timeout, 0, fut64);
+		ret = futex_lock_pi(uaddr, val, timeout, 0, fut64, shared);
 		break;
 	case FUTEX_UNLOCK_PI:
-		ret = futex_unlock_pi(uaddr, fut64);
+		ret = futex_unlock_pi(uaddr, fut64, shared);
 		break;
 	case FUTEX_TRYLOCK_PI:
-		ret = futex_lock_pi(uaddr, 0, timeout, 1, fut64);
+		ret = futex_lock_pi(uaddr, 0, timeout, 1, fut64, shared);
 		break;
 	case FUTEX_CMP_REQUEUE_PI:
-		ret = futex_requeue_pi(uaddr, uaddr2, val, val2, &val3, fut64);
+		ret = futex_requeue_pi(uaddr, uaddr2, val, val2, &val3, fut64, shared);
 		break;
 	default:
 		ret = -ENOSYS;
@@ -2377,23 +2455,24 @@ sys_futex64(u64 __user *uaddr, int op, u
 	struct timespec ts;
 	ktime_t t, *tp = NULL;
 	u64 val2 = 0;
+	int opm = op & FUTEX_CMD_MASK;
 
-	if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) {
+	if (utime && (opm == FUTEX_WAIT || opm == FUTEX_LOCK_PI)) {
 		if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
 			return -EFAULT;
 		if (!timespec_valid(&ts))
 			return -EINVAL;
 
 		t = timespec_to_ktime(ts);
-		if (op == FUTEX_WAIT)
+		if (opm == FUTEX_WAIT)
 			t = ktime_add(ktime_get(), t);
 		tp = &t;
 	}
 	/*
-	 * requeue parameter in 'utime' if op == FUTEX_REQUEUE.
+	 * requeue parameter in 'utime' if opm == FUTEX_REQUEUE.
 	 */
-	if (op == FUTEX_REQUEUE || op == FUTEX_CMP_REQUEUE
-	    || op == FUTEX_CMP_REQUEUE_PI)
+	if (opm == FUTEX_REQUEUE || opm == FUTEX_CMP_REQUEUE
+	    || opm == FUTEX_CMP_REQUEUE_PI)
 		val2 = (unsigned long) utime;
 
 	return do_futex((unsigned long __user*)uaddr, op, val, tp,
@@ -2409,23 +2488,24 @@ asmlinkage long sys_futex(u32 __user *ua
 	struct timespec ts;
 	ktime_t t, *tp = NULL;
 	u32 val2 = 0;
+	int opm = op & FUTEX_CMD_MASK;
 
-	if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) {
+	if (utime && (opm == FUTEX_WAIT || opm == FUTEX_LOCK_PI)) {
 		if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
 			return -EFAULT;
 		if (!timespec_valid(&ts))
 			return -EINVAL;
 
 		t = timespec_to_ktime(ts);
-		if (op == FUTEX_WAIT)
+		if (opm == FUTEX_WAIT)
 			t = ktime_add(ktime_get(), t);
 		tp = &t;
 	}
 	/*
 	 * requeue parameter in 'utime' if op == FUTEX_REQUEUE.
 	 */
-	if (op == FUTEX_REQUEUE || op == FUTEX_CMP_REQUEUE
-	    || op == FUTEX_CMP_REQUEUE_PI)
+	if (opm == FUTEX_REQUEUE || opm == FUTEX_CMP_REQUEUE
+	    || opm == FUTEX_CMP_REQUEUE_PI)
 		val2 = (u32) (unsigned long) utime;
 
 	return do_futex((unsigned long __user*)uaddr, op, val, tp,


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] FUTEX : new PRIVATE futexes
  2007-04-05 17:49                               ` [PATCH] FUTEX : new PRIVATE futexes Eric Dumazet
@ 2007-04-05 20:43                                 ` Ulrich Drepper
  2007-04-06  1:19                                 ` Nick Piggin
                                                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 78+ messages in thread
From: Ulrich Drepper @ 2007-04-05 20:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Dave Jones, Nick Piggin, Ingo Molnar, Andi Kleen,
	Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On 4/5/07, Eric Dumazet <dada1@cosmosbay.com> wrote:
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  include/linux/futex.h |   45 +++++-
>  kernel/futex.c        |  294 +++++++++++++++++++++++++---------------
>  2 files changed, 230 insertions(+), 109 deletions(-)

I cannot vouch for the whole code but the concept is very sound and
definitely along the way the specification allows it:

Acked-by: Ulrich Drepper <drepper@redhat.com>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] FUTEX : new PRIVATE futexes
  2007-04-05 17:49                               ` [PATCH] FUTEX : new PRIVATE futexes Eric Dumazet
  2007-04-05 20:43                                 ` Ulrich Drepper
@ 2007-04-06  1:19                                 ` Nick Piggin
  2007-04-06  5:53                                   ` Eric Dumazet
  2007-04-06  6:05                                   ` Hugh Dickins
  2007-04-06 12:26                                 ` Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes) Peter Zijlstra
                                                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 78+ messages in thread
From: Nick Piggin @ 2007-04-06  1:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ulrich Drepper, Andrew Morton, Dave Jones, Ingo Molnar,
	Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Hi Eric,

Thanks for doing this... It's looking good, I just have some minor
comments:

Eric Dumazet wrote:

> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

> --- linux-2.6.21-rc5-mm4/kernel/futex.c
> +++ linux-2.6.21-rc5-mm4-ed/kernel/futex.c
> @@ -16,6 +16,9 @@
>   *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
>   *  Copyright (C) 2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com>
>   *
> + *  PRIVATE futexes by Eric Dumazet
> + *  Copyright (C) 2007 Eric Dumazet <dada1@cosmosbay.com>
> + *
>   *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
>   *  enough at me, Linus for the original (flawed) idea, Matthew
>   *  Kirkwood for proof-of-concept implementation.
> @@ -199,9 +202,12 @@ static inline int match_futex(union fute
>   * Returns: 0, or negative error code.
>   * The key words are stored in *key on success.
>   *
> - * Should be called with &current->mm->mmap_sem but NOT any spinlocks.
> + * shared is NULL for PROCESS_PRIVATE futexes
> + * For other futexes, it points to &current->mm->mmap_sem and
> + * caller must have taken the reader lock. but NOT any spinlocks.
>   */
> -int get_futex_key(void __user *uaddr, union futex_key *key)
> +int get_futex_key(void __user *uaddr, union futex_key *key,
> +	struct rw_semaphore *shared)

Can we pass in something other than the rw_semaphore here? Seeing as
it only actually gets used as a flag, it might be nicer just to pass
a 0 or 1? And all through the call stack...

Did the whole thing just turn out neater when you passed the rwsem?
We always know to use current->mm->mmap_sem, so it doesn't seem like
a boolean flag would hurt?

>  {
>  	unsigned long address = (unsigned long)uaddr;
>  	struct mm_struct *mm = current->mm;
> @@ -218,6 +224,22 @@ int get_futex_key(void __user *uaddr, un
>  	address -= key->both.offset;
>  
>  	/*
> +	 * PROCESS_PRIVATE futexes are fast.
> +	 * As the mm cannot disappear under us and the 'key' only needs
> +	 * virtual address, we dont even have to find the underlying vma.
> +	 * Note : We do have to check 'address' is a valid user address,
> +	 *        but access_ok() should be faster than find_vma()
> +	 * Note : At this point, address points to the start of page,
> +	 *        not the real futex address, this is ok.
> +	 */
> +	if (!shared) {
> +		if (!access_ok(VERIFY_WRITE, address, sizeof(int)))
> +			return -EFAULT;

Shouldn't that be sizeof(long) to handle 64 bit futexes? Or strictly, it
should depend on the size of the operation. Maybe the access_ok check
should go outside get_futex_key?


> +		key->private.mm = mm;
> +		key->private.address = address;
> +		return 0;
> +	}
> +	/*
>  	 * The futex is hashed differently depending on whether
>  	 * it's in a shared or private mapping.  So check vma first.
>  	 */
> @@ -244,6 +266,7 @@ int get_futex_key(void __user *uaddr, un
>  	 * mappings of _writable_ handles.
>  	 */
>  	if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
> +		key->both.offset += FUT_OFF_MMSHARED; /* reference taken on mm */
>  		key->private.mm = mm;
>  		key->private.address = address;
>  		return 0;
> @@ -253,7 +276,7 @@ int get_futex_key(void __user *uaddr, un
>  	 * Linear file mappings are also simple.
>  	 */
>  	key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
> -	key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
> +	key->both.offset += FUT_OFF_INODE; /* inode-based key. */
>  	if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
>  		key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
>  				     + vma->vm_pgoff);

I like |= for adding flags, it seems less ambiguous. But I guess that's
a matter of opinion. Hugh seems to like +=, and I can't argue with him
about style issues ;)

> @@ -281,17 +304,19 @@ EXPORT_SYMBOL_GPL(get_futex_key);
>   * Take a reference to the resource addressed by a key.
>   * Can be called while holding spinlocks.
>   *
> - * NOTE: mmap_sem MUST be held between get_futex_key() and calling this
> - * function, if it is called at all.  mmap_sem keeps key->shared.inode valid.
>   */
>  inline void get_futex_key_refs(union futex_key *key)
>  {
> -	if (key->both.ptr != 0) {
> -		if (key->both.offset & 1)
> +	if (key->both.ptr == 0)
> +		return;
> +	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
> +		case FUT_OFF_INODE:
>  			atomic_inc(&key->shared.inode->i_count);
> -		else
> +			break;
> +		case FUT_OFF_MMSHARED:
>  			atomic_inc(&key->private.mm->mm_count);
> -	}
> +			break;
> +  	}
>  }
>  EXPORT_SYMBOL_GPL(get_futex_key_refs);
>  
> @@ -301,11 +326,15 @@ EXPORT_SYMBOL_GPL(get_futex_key_refs);
>   */
>  void drop_futex_key_refs(union futex_key *key)
>  {
> -	if (key->both.ptr != 0) {
> -		if (key->both.offset & 1)
> +	if (key->both.ptr == 0)
> +		return;
> +	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
> +		case FUT_OFF_INODE:
>  			iput(key->shared.inode);
> -		else
> +			break;
> +		case FUT_OFF_MMSHARED:
>  			mmdrop(key->private.mm);
> +			break;
>  	}
>  }
>  EXPORT_SYMBOL_GPL(drop_futex_key_refs);

I wonder if it would be worthwhile inlining and likley()ing the
private fastpath? Might make it pretty compact... I guess that's
something to worry about after glibc gets support.

> @@ -339,28 +368,40 @@ get_futex_value_locked(unsigned long *de
>  }
>  
>  /*
> - * Fault handling. Called with current->mm->mmap_sem held.
> + * Fault handling.
> + * if shared is non NULL, current->mm->mmap_sem is already held
>   */
> -static int futex_handle_fault(unsigned long address, int attempt)
> +static int futex_handle_fault(unsigned long address, int attempt,
> +	struct rw_semaphore *shared)
>  {
>  	struct vm_area_struct * vma;
>  	struct mm_struct *mm = current->mm;
> +	int ret = 0;
>  
> -	if (attempt > 2 || !(vma = find_vma(mm, address)) ||
> -	    vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
> +	if (attempt > 2)
>  		return -EFAULT;
>  
> -	switch (handle_mm_fault(mm, vma, address, 1)) {
> -	case VM_FAULT_MINOR:
> -		current->min_flt++;
> -		break;
> -	case VM_FAULT_MAJOR:
> -		current->maj_flt++;
> -		break;
> -	default:
> -		return -EFAULT;
> -	}
> -	return 0;
> +	if (!shared)
> +		down_read(&mm->mmap_sem);
> +
> +	if (!(vma = find_vma(mm, address)) ||
> +	    vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
> +		ret = -EFAULT;
> +
> +	else
> +		switch (handle_mm_fault(mm, vma, address, 1)) {
> +		case VM_FAULT_MINOR:
> +			current->min_flt++;
> +			break;
> +		case VM_FAULT_MAJOR:
> +			current->maj_flt++;
> +			break;
> +		default:
> +			ret = -EFAULT;
> +		}
> +	if (!shared)
> +		up_read(&mm->mmap_sem);
> +	return ret;
>  }
>  
>  /*

You've got an extra space after the if (maybe for clarity?). In this
situation I prefer putting braces around both the if and the else, and
if you get rid of that blank line, it doesn't cost you anything more ;)

> @@ -1598,6 +1656,8 @@ static int futex_wait(unsigned long __us
>  		restart->arg1 = val;
>  		restart->arg2 = (unsigned long)abs_time;
>  		restart->arg3 = (unsigned long)futex64;
> +		if (shared)
> +			restart->arg3 |= 2;

Could you make this into a proper flags argument and use #define CONSTANTs for it?

> @@ -2377,23 +2455,24 @@ sys_futex64(u64 __user *uaddr, int op, u
>  	struct timespec ts;
>  	ktime_t t, *tp = NULL;
>  	u64 val2 = 0;
> +	int opm = op & FUTEX_CMD_MASK;

What's opm stand for?

>  
> -	if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) {
> +	if (utime && (opm == FUTEX_WAIT || opm == FUTEX_LOCK_PI)) {


-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] FUTEX : new PRIVATE futexes
  2007-04-06  1:19                                 ` Nick Piggin
@ 2007-04-06  5:53                                   ` Eric Dumazet
  2007-04-06 11:50                                     ` Nick Piggin
  2007-04-06  6:05                                   ` Hugh Dickins
  1 sibling, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2007-04-06  5:53 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Ulrich Drepper, Andrew Morton, Dave Jones, Ingo Molnar,
	Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Nick Piggin a écrit :
> Hi Eric,
> 
> Thanks for doing this... It's looking good, I just have some minor
> comments:

Hi Nick, thanks for reviewing.

> 
> Eric Dumazet wrote:
>>   */
>> -int get_futex_key(void __user *uaddr, union futex_key *key)
>> +int get_futex_key(void __user *uaddr, union futex_key *key,
>> +    struct rw_semaphore *shared)
> 
> Can we pass in something other than the rw_semaphore here? Seeing as
> it only actually gets used as a flag, it might be nicer just to pass
> a 0 or 1? And all through the call stack...
> 
> Did the whole thing just turn out neater when you passed the rwsem?
> We always know to use current->mm->mmap_sem, so it doesn't seem like
> a boolean flag would hurt?

That's a good question

current->mm->mmap_sem being calculated once is a win in itself, because 
current access is not cheap.
It also does the memory access to go through part of the chain in advance, 
before its use. It does a prefetch() equivalent for free : If current->mm is 
not in CPU cache, CPU wont stall because next instructions dont depend on it.

This means less CPU stall in case current->mm is not in CPU cache. Thats 
difficult to benchmark it, but you can trust me.

A flag means :

if (flag)
     up_read(&current->mm->mmap_sem)

This generates quite a bad code.

if (ptr)
    up_read(ptr)

generates *much* better code.

So this is a cleanup and a runtime optimization.

I dit a similar optimization on commit 163da958ba5282cbf85e8b3dc08e4f51f8b01c5e

I invite you to check it :

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=163da958ba5282cbf85e8b3dc08e4f51f8b01c5e



> 
>>  {
>>      unsigned long address = (unsigned long)uaddr;
>>      struct mm_struct *mm = current->mm;
>> @@ -218,6 +224,22 @@ int get_futex_key(void __user *uaddr, un
>>      address -= key->both.offset;
>>  
>>      /*
>> +     * PROCESS_PRIVATE futexes are fast.
>> +     * As the mm cannot disappear under us and the 'key' only needs
>> +     * virtual address, we dont even have to find the underlying vma.
>> +     * Note : We do have to check 'address' is a valid user address,
>> +     *        but access_ok() should be faster than find_vma()
>> +     * Note : At this point, address points to the start of page,
>> +     *        not the real futex address, this is ok.
>> +     */
>> +    if (!shared) {
>> +        if (!access_ok(VERIFY_WRITE, address, sizeof(int)))
>> +            return -EFAULT;
> 
> Shouldn't that be sizeof(long) to handle 64 bit futexes? Or strictly, it
> should depend on the size of the operation. Maybe the access_ok check
> should go outside get_futex_key?

If you check again, you'll see that address points to the start of the PAGE, 
not the real u32/u64 futex address. This checks the PAGE. We can use char, 
short, int, long, or char[PAGE_SIZE] as long as we know a futex cannot span 
two pages.


>>       */
>>      key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
>> -    key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
>> +    key->both.offset += FUT_OFF_INODE; /* inode-based key. */
>>      if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
>>          key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
>>                       + vma->vm_pgoff);
> 
> I like |= for adding flags, it seems less ambiguous. But I guess that's
> a matter of opinion. Hugh seems to like +=, and I can't argue with him
> about style issues ;)


Previous code was doing offset++ wich means offset += 1;
I didnt want to hurt Hugh :)

>>  EXPORT_SYMBOL_GPL(drop_futex_key_refs);
> 
> I wonder if it would be worthwhile inlining and likley()ing the
> private fastpath? Might make it pretty compact... I guess that's
> something to worry about after glibc gets support.

Yes, in a future patch, in about one year :)

>> +
>> +    if (!(vma = find_vma(mm, address)) ||
>> +        vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
>> +        ret = -EFAULT;
>> +
>> +    else
>> +        switch (handle_mm_fault(mm, vma, address, 1)) {
>> +        case VM_FAULT_MINOR:
>> +            current->min_flt++;
>> +            break;
>> +        case VM_FAULT_MAJOR:
>> +            current->maj_flt++;
>> +            break;
>> +        default:
>> +            ret = -EFAULT;
>> +        }
>> +    if (!shared)
>> +        up_read(&mm->mmap_sem);
>> +    return ret;
>>  }
>>  
>>  /*
> 
> You've got an extra space after the if (maybe for clarity?). In this
> situation I prefer putting braces around both the if and the else, and
> if you get rid of that blank line, it doesn't cost you anything more ;)

Oh well...

> 
>> @@ -1598,6 +1656,8 @@ static int futex_wait(unsigned long __us
>>          restart->arg1 = val;
>>          restart->arg2 = (unsigned long)abs_time;
>>          restart->arg3 = (unsigned long)futex64;
>> +        if (shared)
>> +            restart->arg3 |= 2;
> 
> Could you make this into a proper flags argument and use #define 
> CONSTANTs for it?

Yes, but I'm not sure it will improve readability.

> 
>> @@ -2377,23 +2455,24 @@ sys_futex64(u64 __user *uaddr, int op, u
>>      struct timespec ts;
>>      ktime_t t, *tp = NULL;
>>      u64 val2 = 0;
>> +    int opm = op & FUTEX_CMD_MASK;
> 
> What's opm stand for?

I guess 'm' stands for 'mask' or 'masked' ?

Thank you

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] FUTEX : new PRIVATE futexes
  2007-04-06  1:19                                 ` Nick Piggin
  2007-04-06  5:53                                   ` Eric Dumazet
@ 2007-04-06  6:05                                   ` Hugh Dickins
  2007-04-06 17:41                                     ` Jan Engelhardt
  1 sibling, 1 reply; 78+ messages in thread
From: Hugh Dickins @ 2007-04-06  6:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Ulrich Drepper, Andrew Morton, Dave Jones,
	Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Fri, 6 Apr 2007, Nick Piggin wrote:
> 
> I like |= for adding flags, it seems less ambiguous. But I guess that's
> a matter of opinion. Hugh seems to like +=,

Do I?  You probably have a shaming example in mind (PAGE_MAPPING_ANON?
that's a hybrid case where using + and - helped minimize the casting);
but in general I'd agree with you that it's |= for setting flag bits.

Hmm, Eric's FUT_OFF_INODE is hybrid too, that might justify the +=

> and I can't argue with him about style issues ;)

I feel a warm glow.  Nobody ever called me a style guru before.

Hugh

p.s.  Please don't interpret this as any useful contribution to
reviewing Eric's futex work: seems sensible, but I've hardly looked.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] FUTEX : new PRIVATE futexes
  2007-04-06  5:53                                   ` Eric Dumazet
@ 2007-04-06 11:50                                     ` Nick Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nick Piggin @ 2007-04-06 11:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ulrich Drepper, Andrew Morton, Dave Jones, Ingo Molnar,
	Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Eric Dumazet wrote:
> Nick Piggin a écrit :

>> Did the whole thing just turn out neater when you passed the rwsem?
>> We always know to use current->mm->mmap_sem, so it doesn't seem like
>> a boolean flag would hurt?
> 
> 
> That's a good question
> 
> current->mm->mmap_sem being calculated once is a win in itself, because 
> current access is not cheap.
> It also does the memory access to go through part of the chain in 
> advance, before its use. It does a prefetch() equivalent for free : If 
> current->mm is not in CPU cache, CPU wont stall because next 
> instructions dont depend on it.

Fair enough. Current access I think should be cheap though (it is
effectively a constant), but I guess it is still improvement.

>> Shouldn't that be sizeof(long) to handle 64 bit futexes? Or strictly, it
>> should depend on the size of the operation. Maybe the access_ok check
>> should go outside get_futex_key?
> 
> 
> If you check again, you'll see that address points to the start of the 
> PAGE, not the real u32/u64 futex address. This checks the PAGE. We can 
> use char, short, int, long, or char[PAGE_SIZE] as long as we know a 
> futex cannot span two pages.

Ah, that works.

>>>       */
>>>      key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
>>> -    key->both.offset++; /* Bit 0 of offset indicates inode-based 
>>> key. */
>>> +    key->both.offset += FUT_OFF_INODE; /* inode-based key. */
>>>      if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
>>>          key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
>>>                       + vma->vm_pgoff);
>>
>>
>> I like |= for adding flags, it seems less ambiguous. But I guess that's
>> a matter of opinion. Hugh seems to like +=, and I can't argue with him
>> about style issues ;)
> 
> 
> 
> Previous code was doing offset++ wich means offset += 1;

But it doesn't mean you have to ;)

>>> @@ -1598,6 +1656,8 @@ static int futex_wait(unsigned long __us
>>>          restart->arg1 = val;
>>>          restart->arg2 = (unsigned long)abs_time;
>>>          restart->arg3 = (unsigned long)futex64;
>>> +        if (shared)
>>> +            restart->arg3 |= 2;
>>
>>
>> Could you make this into a proper flags argument and use #define 
>> CONSTANTs for it?
> 
> 
> Yes, but I'm not sure it will improve readability.

Well that bit of code alone is obviously unreadable.

restart->arg3 = 0;
if (futex64)
     restart->arg3 |= FUTEX_64;
if (shared)
     restart->arg3 |= FUTEX_SHARED;

Maybe a matter of taste.

> 
>>
>>> @@ -2377,23 +2455,24 @@ sys_futex64(u64 __user *uaddr, int op, u
>>>      struct timespec ts;
>>>      ktime_t t, *tp = NULL;
>>>      u64 val2 = 0;
>>> +    int opm = op & FUTEX_CMD_MASK;
>>
>>
>> What's opm stand for?
> 
> 
> I guess 'm' stands for 'mask' or 'masked' ?

Why not call it cmd? (ie. what it is, rather than what you have done
to derive it).

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes)
  2007-04-05 17:49                               ` [PATCH] FUTEX : new PRIVATE futexes Eric Dumazet
  2007-04-05 20:43                                 ` Ulrich Drepper
  2007-04-06  1:19                                 ` Nick Piggin
@ 2007-04-06 12:26                                 ` Peter Zijlstra
  2007-04-06 13:02                                   ` Hugh Dickins
  2007-04-06 12:31                                 ` [PATCH] FUTEX : new PRIVATE futexes Peter Zijlstra
  2007-04-07  8:43                                 ` [PATCH, take4] " Eric Dumazet
  4 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2007-04-06 12:26 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ulrich Drepper, Andrew Morton, Dave Jones, Nick Piggin,
	Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel, hugh, Pierre.Peiffer


Hi,

some thoughts on shared futexes;

Could we get rid of the mmap_sem on the shared futexes in the following
manner:

 - do a page table walk to find the pte;
 - get a page using pfn_to_page (skipping VM_PFNMAP)
 - get the futex key from page->mapping->host and page->index
   and offset from addr % PAGE_SIZE.

or given a key:

 - lookup the page from key.shared.inode->i_mapping by key.shared.pgoff
   possibly loading the page using mapping->a_ops->readpage().

then:

 - perform the futex operation on a kmap of the page


This should all work except for VM_PFNMAP.

Since the address is passed from userspace we cannot trust it to not
point into a VM_PFNMAP area.

However, with the RCU VMA lookup patches I'm working on we could do that
check without holding locks and without exclusive cachelines; the
question is, is that good enough?

Or is there an alternative way of determining a pfnmap given a
pfn/struct page?


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] FUTEX : new PRIVATE futexes
  2007-04-05 17:49                               ` [PATCH] FUTEX : new PRIVATE futexes Eric Dumazet
                                                   ` (2 preceding siblings ...)
  2007-04-06 12:26                                 ` Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes) Peter Zijlstra
@ 2007-04-06 12:31                                 ` Peter Zijlstra
  2007-04-07  8:43                                 ` [PATCH, take4] " Eric Dumazet
  4 siblings, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-04-06 12:31 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ulrich Drepper, Andrew Morton, Dave Jones, Nick Piggin,
	Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Thu, 2007-04-05 at 19:49 +0200, Eric Dumazet wrote:
> Hi all
> 
> I'm pleased to present this patch which improves linux futexes performance and 
> scalability, merely avoiding taking mmap_sem rwlock.
> 
> Ulrich agreed with the API and said glibc work could start as soon
> as he gets a Fedora kernel with it :)
> 
> Andrew, could we get this in mm as well ? This version is against 2.6.21-rc5-mm4
> (so supports 64bit futexes)
> 
> In this third version I dropped the NUMA optims and process private hash table,
> to let new API come in and be tested.

Good work, Thanks!

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes)
  2007-04-06 12:26                                 ` Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes) Peter Zijlstra
@ 2007-04-06 13:02                                   ` Hugh Dickins
  2007-04-06 13:15                                     ` Peter Zijlstra
  2007-04-06 13:15                                     ` Nick Piggin
  0 siblings, 2 replies; 78+ messages in thread
From: Hugh Dickins @ 2007-04-06 13:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Eric Dumazet, Ulrich Drepper, Andrew Morton, Dave Jones,
	Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel, Pierre.Peiffer

On Fri, 6 Apr 2007, Peter Zijlstra wrote:
> 
> some thoughts on shared futexes;
> 
> Could we get rid of the mmap_sem on the shared futexes in the following
> manner:
> 
>  - do a page table walk to find the pte;

("walk" meaning descent down the levels, I presume, rather than across)

I've not had time to digest your proposal, and I'm about to go out:
let me sound a warning that springs to mind, maybe it's entirely
inapproriate, but better said than kept silent.

It looks as if you're supposing that mmap_sem is needed to find_vma,
but not for going down the pagetables.  It's not a simple as that:
you need to be careful that a concurrent munmap from another thread
isn't freeing pagetables from under you.

Holding (down_read) of mmap_sem is one way to protect against that.
try_to_unmap doesn't have that luxury: in its case, it's made safe
by the way free_pgtables does anon_vma_unlink and unlink_file_vma
before freeing any pagetables, so try_to_unmap etc. won't get there;
but you can't do that.

Hugh

>  - get a page using pfn_to_page (skipping VM_PFNMAP)
>  - get the futex key from page->mapping->host and page->index
>    and offset from addr % PAGE_SIZE.
> 
> or given a key:
> 
>  - lookup the page from key.shared.inode->i_mapping by key.shared.pgoff
>    possibly loading the page using mapping->a_ops->readpage().
> 
> then:
> 
>  - perform the futex operation on a kmap of the page
> 
> 
> This should all work except for VM_PFNMAP.
> 
> Since the address is passed from userspace we cannot trust it to not
> point into a VM_PFNMAP area.
> 
> However, with the RCU VMA lookup patches I'm working on we could do that
> check without holding locks and without exclusive cachelines; the
> question is, is that good enough?
> 
> Or is there an alternative way of determining a pfnmap given a
> pfn/struct page?
> 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes)
  2007-04-06 13:02                                   ` Hugh Dickins
@ 2007-04-06 13:15                                     ` Peter Zijlstra
  2007-04-06 13:15                                     ` Nick Piggin
  1 sibling, 0 replies; 78+ messages in thread
From: Peter Zijlstra @ 2007-04-06 13:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Eric Dumazet, Ulrich Drepper, Andrew Morton, Dave Jones,
	Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel, Pierre.Peiffer

On Fri, 2007-04-06 at 14:02 +0100, Hugh Dickins wrote:
> On Fri, 6 Apr 2007, Peter Zijlstra wrote:
> > 
> > some thoughts on shared futexes;
> > 
> > Could we get rid of the mmap_sem on the shared futexes in the following
> > manner:
> > 
> >  - do a page table walk to find the pte;
> 
> ("walk" meaning descent down the levels, I presume, rather than across)

indeed.

> I've not had time to digest your proposal, and I'm about to go out:
> let me sound a warning that springs to mind, maybe it's entirely
> inapproriate, but better said than kept silent.
> 
> It looks as if you're supposing that mmap_sem is needed to find_vma,
> but not for going down the pagetables.  It's not a simple as that:
> you need to be careful that a concurrent munmap from another thread
> isn't freeing pagetables from under you.
> 
> Holding (down_read) of mmap_sem is one way to protect against that.
> try_to_unmap doesn't have that luxury: in its case, it's made safe
> by the way free_pgtables does anon_vma_unlink and unlink_file_vma
> before freeing any pagetables, so try_to_unmap etc. won't get there;
> but you can't do that.

Ah, drad.

I had hoped that the pte_lock would solve that race, but that doesn't
cover the upper levels of the tree.

Back to the drawing board..


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes)
  2007-04-06 13:02                                   ` Hugh Dickins
  2007-04-06 13:15                                     ` Peter Zijlstra
@ 2007-04-06 13:15                                     ` Nick Piggin
  2007-04-06 13:22                                       ` Peter Zijlstra
  1 sibling, 1 reply; 78+ messages in thread
From: Nick Piggin @ 2007-04-06 13:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Peter Zijlstra, Eric Dumazet, Ulrich Drepper, Andrew Morton,
	Dave Jones, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel, Pierre.Peiffer

Hugh Dickins wrote:
> On Fri, 6 Apr 2007, Peter Zijlstra wrote:
> 
>>some thoughts on shared futexes;
>>
>>Could we get rid of the mmap_sem on the shared futexes in the following
>>manner:

I'd imagine shared futexes would be much less common than private for
threaded programs... I'd say we should reevaluate things once we have
private futexes, and malloc/free stop hammering mmap_sem so hard...

>> - get a page using pfn_to_page (skipping VM_PFNMAP)
>> - get the futex key from page->mapping->host and page->index
>>   and offset from addr % PAGE_SIZE.
>>
>>or given a key:
>>
>> - lookup the page from key.shared.inode->i_mapping by key.shared.pgoff
>>   possibly loading the page using mapping->a_ops->readpage().

For shared futexes, wouldn't i_mapping be worse, because you'd be
ping-ponging the tree_lock between processes, rather than have each
use their own mmap_sem?

That also only helps for the wakeup case too, doesn't it? You have
to use the vmas to find out which inode to use to do the wait, I think?
(unless you introduce a new shared futex API).

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes)
  2007-04-06 13:15                                     ` Nick Piggin
@ 2007-04-06 13:22                                       ` Peter Zijlstra
  2007-04-06 13:40                                         ` Nick Piggin
  0 siblings, 1 reply; 78+ messages in thread
From: Peter Zijlstra @ 2007-04-06 13:22 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Hugh Dickins, Eric Dumazet, Ulrich Drepper, Andrew Morton,
	Dave Jones, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel, Pierre.Peiffer

On Fri, 2007-04-06 at 23:15 +1000, Nick Piggin wrote:
> Hugh Dickins wrote:
> > On Fri, 6 Apr 2007, Peter Zijlstra wrote:
> > 
> >>some thoughts on shared futexes;
> >>
> >>Could we get rid of the mmap_sem on the shared futexes in the following
> >>manner:
> 
> I'd imagine shared futexes would be much less common than private for
> threaded programs... I'd say we should reevaluate things once we have
> private futexes, and malloc/free stop hammering mmap_sem so hard...

Indeed, private futexes are by far the most common.

> >> - get a page using pfn_to_page (skipping VM_PFNMAP)
> >> - get the futex key from page->mapping->host and page->index
> >>   and offset from addr % PAGE_SIZE.
> >>
> >>or given a key:
> >>
> >> - lookup the page from key.shared.inode->i_mapping by key.shared.pgoff
> >>   possibly loading the page using mapping->a_ops->readpage().
> 
> For shared futexes, wouldn't i_mapping be worse, because you'd be
> ping-ponging the tree_lock between processes, rather than have each
> use their own mmap_sem?

Your lockless pagecache work would solve most of that, no?

> That also only helps for the wakeup case too, doesn't it? You have
> to use the vmas to find out which inode to use to do the wait, I think?
> (unless you introduce a new shared futex API).

one could do something like this:

 struct address_space *mapping = page_mapping(page);
 if (!mapping || mapping == &swapper_space)
   do_private_futex();
 else
   do_shared_futex(mapping->host, page->index, addr % PAGE_SIZE);


But alas, it seems I overlooked that the mmap_sem also protects the page
tables as pointed out by Hugh, so this is all in fain it seems.

A well.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes)
  2007-04-06 13:22                                       ` Peter Zijlstra
@ 2007-04-06 13:40                                         ` Nick Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nick Piggin @ 2007-04-06 13:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Hugh Dickins, Eric Dumazet, Ulrich Drepper, Andrew Morton,
	Dave Jones, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel, Pierre.Peiffer

Peter Zijlstra wrote:
> On Fri, 2007-04-06 at 23:15 +1000, Nick Piggin wrote:

>>>>or given a key:
>>>>
>>>>- lookup the page from key.shared.inode->i_mapping by key.shared.pgoff
>>>>  possibly loading the page using mapping->a_ops->readpage().
>>
>>For shared futexes, wouldn't i_mapping be worse, because you'd be
>>ping-ponging the tree_lock between processes, rather than have each
>>use their own mmap_sem?
> 
> 
> Your lockless pagecache work would solve most of that, no?

Yeah it would.

>>That also only helps for the wakeup case too, doesn't it? You have
>>to use the vmas to find out which inode to use to do the wait, I think?
>>(unless you introduce a new shared futex API).
> 
> 
> one could do something like this:
> 
>  struct address_space *mapping = page_mapping(page);
>  if (!mapping || mapping == &swapper_space)
>    do_private_futex();
>  else
>    do_shared_futex(mapping->host, page->index, addr % PAGE_SIZE);
> 
> 
> But alas, it seems I overlooked that the mmap_sem also protects the page
> tables as pointed out by Hugh, so this is all in fain it seems.
> 
> A well.

Well if it turned out to be a real problem, we _could_ introduce a new
futex API that takes a file descriptor+offset.

The old futex API is basically just a shorthand for choosing between
shared and private, depending on what the vma at the given address
happens to be.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH] FUTEX : new PRIVATE futexes
  2007-04-06  6:05                                   ` Hugh Dickins
@ 2007-04-06 17:41                                     ` Jan Engelhardt
  0 siblings, 0 replies; 78+ messages in thread
From: Jan Engelhardt @ 2007-04-06 17:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Eric Dumazet, Ulrich Drepper, Andrew Morton,
	Dave Jones, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel


On Apr 6 2007 07:05, Hugh Dickins wrote:
>On Fri, 6 Apr 2007, Nick Piggin wrote:
>> 
>> I like |= for adding flags, it seems less ambiguous. But I guess that's
>> a matter of opinion. Hugh seems to like +=,
>
>Do I?  You probably have a shaming example in mind (PAGE_MAPPING_ANON?
>that's a hybrid case where using + and - helped minimize the casting);
>but in general I'd agree with you that it's |= for setting flag bits.
>
>Hmm, Eric's FUT_OFF_INODE is hybrid too, that might justify the +=

If a bit is already set, you can't set it again using +=.


Jan
-- 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH, take4] FUTEX : new PRIVATE futexes
  2007-04-05 17:49                               ` [PATCH] FUTEX : new PRIVATE futexes Eric Dumazet
                                                   ` (3 preceding siblings ...)
  2007-04-06 12:31                                 ` [PATCH] FUTEX : new PRIVATE futexes Peter Zijlstra
@ 2007-04-07  8:43                                 ` Eric Dumazet
  2007-04-07  9:30                                   ` Nick Piggin
                                                     ` (3 more replies)
  4 siblings, 4 replies; 78+ messages in thread
From: Eric Dumazet @ 2007-04-07  8:43 UTC (permalink / raw)
  To: Andrew Morton, Dave Jones
  Cc: Ulrich Drepper, Nick Piggin, Ingo Molnar, Andi Kleen,
	Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Hi all

Updates on this take4 :

- All remarks from Nick were addressed I hope

- Current mm code have a problem with 64bit futexes, as spoted by Nick :

get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
So it is possible a 64bit futex spans two pages of memory...
I had to change get_futex_key() prototype to be able to do a correct test.

History :

I'm pleased to present this patch which improves linux futexes performance and 
scalability, merely avoiding taking mmap_sem rwlock.

Ulrich agreed with the API and said glibc work could start as soon
as he gets a Fedora kernel with it :)

Andrew, could we get this in mm as well ? This version is against 2.6.21-rc5-mm4
(so supports 64bit futexes)

In this third version I dropped the NUMA optims and process private hash table,
to let new API come in and be tested.

Thank you

[PATCH] FUTEX : new PRIVATE futexes

Analysis of current linux futex code :
--------------------------------------

A central hash table futex_queues[] holds all contexts (futex_q) of waiting 
threads.
Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to 
perform lookups or insert/deletion of a futex_q.

When a futex_wait() is done, calling thread has to :


1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
     (calling find_vma()). This validation tells us if the futex uses
     an inode based store (mapped file), or mm based store (anonymous mem)

2) - compute a hash key

3) - Atomic increment of reference counter on an inode or a mm_struct

4) - lock part of futex_queues[] hash table

5) - perform the test on value of futex.
                (rollback is value != expected_value, returns EWOULDBLOCK)
        (various loops if test triggers mm faults)

6) queue the context into hash table, release the lock got in 4)

7) - release the read_lock on mmap_sem

   <block>

8) Eventually unqueue the context (but rarely, as this part
 may be done by the futex_wake())

Futexes were designed to improve scalability but current implementation
has various problems :

- Central hashtable :
 This means scalability problems if many processes/threads want to use
 futexes at the same time.
 This means NUMA unbalance because this hashtable is located on one node.

- Using mmap_sem on every futex() syscall :

 Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
 ops on mmap_sem, dirtying cache line :
        - lot of cache line ping pongs on SMP configurations.

 mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
 Highly threaded processes might suffer from mmap_sem contention.

 mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
 It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
 because of cache misses.

Most of these scalability problems come from the fact that futexes are in
one global namespace. As we use a central hash table, we must make sure
they are all using the same reference (given by the mm subsystem).
We chose to force all futexes be 'shared'. This has a cost.

But fact is POSIX defined PRIVATE and SHARED, allowing clear separation, and
optimal performance if carefuly implemented. Time has come for linux to have 
better threading performance.

The goal is to permit new futex commands to avoid :
 - Taking the mmap_sem semaphore, conflicting with other subsystems.
 - Modifying a ref_count on mm or an inode, still conflicting with mm or fs.

This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
futexes, we only need to distinguish futexes by their virtual address, no
matter the underlying mm storage is.



If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And
be prepared to fallback on old subcommands for old kernels. Using one
global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.

PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : the same SHARED futex (mapped on a file) can be used by old binaries 
*and* new binaries, because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)

/* calling futex_wait(addr, value) with value != *addr */
434 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
427 cycles per futex(FUTEX_WAIT) call (using one futex)
345 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
345 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
For reference :
187 cycles per getppid() call
188 cycles per umask() call
183 cycles per ni_syscall() call

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/futex.h |   46 +++++
 kernel/futex.c        |  340 ++++++++++++++++++++++++++--------------
 2 files changed, 268 insertions(+), 118 deletions(-)

--- linux-2.6.21-rc5-mm4/include/linux/futex.h
+++ linux-2.6.21-rc5-mm4-ed/include/linux/futex.h
@@ -19,6 +19,18 @@ union ktime;
 #define FUTEX_TRYLOCK_PI	8
 #define FUTEX_CMP_REQUEUE_PI	9
 
+#define FUTEX_PRIVATE_FLAG	128
+#define FUTEX_CMD_MASK		~FUTEX_PRIVATE_FLAG
+
+#define FUTEX_WAIT_PRIVATE	(FUTEX_WAIT | FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAKE_PRIVATE	(FUTEX_WAKE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_REQUEUE_PRIVATE	(FUTEX_REQUEUE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_CMP_REQUEUE_PRIVATE (FUTEX_CMP_REQUEUE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAKE_OP_PRIVATE	(FUTEX_WAKE_OP | FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_PI_PRIVATE	(FUTEX_LOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_PI_PRIVATE	(FUTEX_UNLOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG)
+
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
  * thread exit time.
@@ -115,8 +127,18 @@ handle_futex_death(u32 __user *uaddr, st
  * Don't rearrange members without looking at hash_futex().
  *
  * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
- * We set bit 0 to indicate if it's an inode-based key.
+ * We use the two low order bits of offset to tell what is the kind of key :
+ *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
+ *       (no reference on an inode or mm)
+ *  01 : Shared futex (PTHREAD_PROCESS_SHARED)
+ *	mapped on a file (reference on the underlying inode)
+ *  10 : Shared futex (PTHREAD_PROCESS_SHARED)
+ *       (but private mapping on an mm, and reference taken on it)
  */
+
+#define FUT_OFF_INODE    1 /* We set bit 0 if key has a reference on inode */
+#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */
+
 union futex_key {
 	unsigned long __user *uaddr;
 	struct {
@@ -135,8 +157,28 @@ union futex_key {
 		int offset;
 	} both;
 };
-int get_futex_key(void __user *uaddr, union futex_key *key);
+
+/**
+ * get_futex_key - Get parameters which are the keys for a futex.
+ * @uaddr: virtual address of the futex
+ * @size: size of futex (4 or 8)
+ * @shared: NULL for a PROCESS_PRIVATE futex,
+ *	&current->mm->mmap_sem for a PROCESS_SHARED futex
+ * @key: address where result is stored.
+ *
+ * Returns an error code or 0
+ */
+int get_futex_key(void __user *uaddr, int size, struct rw_semaphore *shared,
+		  union futex_key *key);
+
+/**
+ * get_futex_key_refs - Take a reference to the resource addressed by a key
+ */
 void get_futex_key_refs(union futex_key *key);
+
+/**
+ * drop_futex_key_refs - Drop a reference to the resource addressed by a key.
+ */
 void drop_futex_key_refs(union futex_key *key);
 
 #ifdef CONFIG_FUTEX
--- linux-2.6.21-rc5-mm4/kernel/futex.c
+++ linux-2.6.21-rc5-mm4-ed/kernel/futex.c
@@ -16,6 +16,9 @@
  *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
  *  Copyright (C) 2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com>
  *
+ *  PRIVATE futexes by Eric Dumazet
+ *  Copyright (C) 2007 Eric Dumazet <dada1@cosmosbay.com>
+ *
  *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
  *  enough at me, Linus for the original (flawed) idea, Matthew
  *  Kirkwood for proof-of-concept implementation.
@@ -199,11 +202,14 @@ static inline int match_futex(union fute
  * Returns: 0, or negative error code.
  * The key words are stored in *key on success.
  *
- * Should be called with &current->mm->mmap_sem but NOT any spinlocks.
+ * fshared is NULL for PROCESS_PRIVATE futexes
+ * For other futexes, it points to &current->mm->mmap_sem and
+ * caller must have taken the reader lock. but NOT any spinlocks.
  */
-int get_futex_key(void __user *uaddr, union futex_key *key)
+int get_futex_key(void __user *fuaddr, int fsize, struct rw_semaphore *fshared,
+	union futex_key *key)
 {
-	unsigned long address = (unsigned long)uaddr;
+	unsigned long address = (unsigned long)fuaddr;
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	struct page *page;
@@ -213,11 +219,29 @@ int get_futex_key(void __user *uaddr, un
 	 * The futex address must be "naturally" aligned.
 	 */
 	key->both.offset = address % PAGE_SIZE;
-	if (unlikely((key->both.offset % sizeof(u32)) != 0))
+	if (unlikely((key->both.offset & (fsize - 1)) != 0))
 		return -EINVAL;
 	address -= key->both.offset;
 
 	/*
+	 * PROCESS_PRIVATE futexes are fast.
+	 * As the mm cannot disappear under us and the 'key' only needs
+	 * virtual address, we dont even have to find the underlying vma.
+	 * Note : We do have to check 'address' is a valid user address,
+	 *        but access_ok() should be faster than find_vma()
+	 */
+	if (!fshared) {
+		/*
+		 * Note : At this point, address points to the start of page,
+		 *        not the real futex address, this is OK.
+		 */
+		if (!access_ok(VERIFY_WRITE, address, sizeof(int)))
+			return -EFAULT;
+		key->private.mm = mm;
+		key->private.address = address;
+		return 0;
+	}
+	/*
 	 * The futex is hashed differently depending on whether
 	 * it's in a shared or private mapping.  So check vma first.
 	 */
@@ -232,7 +256,7 @@ int get_futex_key(void __user *uaddr, un
 		return (vma->vm_flags & VM_IO) ? -EPERM : -EACCES;
 
 	/* Save the user address in the ley */
-	key->uaddr = uaddr;
+	key->uaddr = fuaddr;
 
 	/*
 	 * Private mappings are handled in a simple way.
@@ -244,6 +268,7 @@ int get_futex_key(void __user *uaddr, un
 	 * mappings of _writable_ handles.
 	 */
 	if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
+		key->both.offset |= FUT_OFF_MMSHARED; /* reference taken on mm */
 		key->private.mm = mm;
 		key->private.address = address;
 		return 0;
@@ -253,7 +278,7 @@ int get_futex_key(void __user *uaddr, un
 	 * Linear file mappings are also simple.
 	 */
 	key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
-	key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
+	key->both.offset |= FUT_OFF_INODE; /* inode-based key. */
 	if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
 		key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
 				     + vma->vm_pgoff);
@@ -281,17 +306,19 @@ EXPORT_SYMBOL_GPL(get_futex_key);
  * Take a reference to the resource addressed by a key.
  * Can be called while holding spinlocks.
  *
- * NOTE: mmap_sem MUST be held between get_futex_key() and calling this
- * function, if it is called at all.  mmap_sem keeps key->shared.inode valid.
  */
 inline void get_futex_key_refs(union futex_key *key)
 {
-	if (key->both.ptr != 0) {
-		if (key->both.offset & 1)
+	if (key->both.ptr == 0)
+		return;
+	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
+		case FUT_OFF_INODE:
 			atomic_inc(&key->shared.inode->i_count);
-		else
+			break;
+		case FUT_OFF_MMSHARED:
 			atomic_inc(&key->private.mm->mm_count);
-	}
+			break;
+  	}
 }
 EXPORT_SYMBOL_GPL(get_futex_key_refs);
 
@@ -301,11 +328,15 @@ EXPORT_SYMBOL_GPL(get_futex_key_refs);
  */
 void drop_futex_key_refs(union futex_key *key)
 {
-	if (key->both.ptr != 0) {
-		if (key->both.offset & 1)
+	if (key->both.ptr == 0)
+		return;
+	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
+		case FUT_OFF_INODE:
 			iput(key->shared.inode);
-		else
+			break;
+		case FUT_OFF_MMSHARED:
 			mmdrop(key->private.mm);
+			break;
 	}
 }
 EXPORT_SYMBOL_GPL(drop_futex_key_refs);
@@ -339,28 +370,38 @@ get_futex_value_locked(unsigned long *de
 }
 
 /*
- * Fault handling. Called with current->mm->mmap_sem held.
+ * Fault handling.
+ * if fshared is non NULL, current->mm->mmap_sem is already held
  */
-static int futex_handle_fault(unsigned long address, int attempt)
+static int futex_handle_fault(unsigned long address,
+			      struct rw_semaphore *fshared, int attempt)
 {
 	struct vm_area_struct * vma;
 	struct mm_struct *mm = current->mm;
+	int ret = -EFAULT;
 
-	if (attempt > 2 || !(vma = find_vma(mm, address)) ||
-	    vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
-		return -EFAULT;
-
-	switch (handle_mm_fault(mm, vma, address, 1)) {
-	case VM_FAULT_MINOR:
-		current->min_flt++;
-		break;
-	case VM_FAULT_MAJOR:
-		current->maj_flt++;
-		break;
-	default:
-		return -EFAULT;
+	if (attempt > 2)
+		return ret;
+	if (!fshared)
+		down_read(&mm->mmap_sem);
+	vma = find_vma(mm, address);
+	if (vma &&
+	    address >= vma->vm_start &&
+	    (vma->vm_flags & VM_WRITE)) {
+		switch (handle_mm_fault(mm, vma, address, 1)) {
+		case VM_FAULT_MINOR:
+			ret = 0;
+			current->min_flt++;
+			break;
+		case VM_FAULT_MAJOR:
+			ret = 0;
+			current->maj_flt++;
+			break;
+		}
 	}
-	return 0;
+	if (!fshared)
+		up_read(&mm->mmap_sem);
+	return ret;
 }
 
 /*
@@ -702,20 +743,22 @@ double_lock_hb(struct futex_hash_bucket 
 }
 
 /*
- * Wake up all waiters hashed on the physical page that is mapped
- * to this virtual address:
+ * Wake up all waiters on a futex (fuaddr, futex64, fshared)
  */
-static int futex_wake(unsigned long __user *uaddr, int nr_wake)
+static int futex_wake(unsigned long __user *fuaddr, int futex64,
+		      struct rw_semaphore *fshared, int nr_wake)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
 	struct plist_head *head;
 	union futex_key key;
 	int ret;
+	int fsize = futex64 ? sizeof(u64) : sizeof(u32);
 
-	down_read(&current->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr, &key);
+	ret = get_futex_key(fuaddr, fsize, fshared, &key);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -737,7 +780,8 @@ static int futex_wake(unsigned long __us
 
 	spin_unlock(&hb->lock);
 out:
-	up_read(&current->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	return ret;
 }
 
@@ -807,7 +851,8 @@ retry:
  */
 static int
 futex_requeue_pi(unsigned long __user *uaddr1, unsigned long __user *uaddr2,
-		 int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64)
+		 int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64,
+		 struct rw_semaphore *fshared)
 {
 	union futex_key key1, key2;
 	struct futex_hash_bucket *hb1, *hb2;
@@ -817,6 +862,7 @@ futex_requeue_pi(unsigned long __user *u
 	struct rt_mutex_waiter *waiter, *top_waiter = NULL;
 	struct rt_mutex *lock2 = NULL;
 	int ret, drop_count = 0;
+	int fsize = futex64 ? sizeof(u64) : sizeof(u32);
 
 	if (refill_pi_state_cache())
 		return -ENOMEM;
@@ -825,12 +871,13 @@ retry:
 	/*
 	 * First take all the futex related locks:
 	 */
-	down_read(&current->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr1, &key1);
+	ret = get_futex_key(uaddr1, fsize, fshared, &key1);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, &key2);
+	ret = get_futex_key(uaddr2, fsize, fshared, &key2);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -998,21 +1045,24 @@ out:
  */
 static int
 futex_wake_op(unsigned long __user *uaddr1, unsigned long __user *uaddr2,
-	      int nr_wake, int nr_wake2, int op, int futex64)
+	      int nr_wake, int nr_wake2, int op, int futex64,
+		struct rw_semaphore *fshared)
 {
 	union futex_key key1, key2;
 	struct futex_hash_bucket *hb1, *hb2;
 	struct plist_head *head;
 	struct futex_q *this, *next;
 	int ret, op_ret, attempt = 0;
+	int fsize = futex64 ? sizeof(u64) : sizeof(u32);
 
 retryfull:
-	down_read(&current->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr1, &key1);
+	ret = get_futex_key(uaddr1, fsize, fshared, &key1);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, &key2);
+	ret = get_futex_key(uaddr2, fsize, fshared, &key2);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -1055,15 +1105,14 @@ retry:
 		 * futex_atomic_op_inuser needs to both read and write
 		 * *(int __user *)uaddr2, but we can't modify it
 		 * non-atomically.  Therefore, if get_user below is not
-		 * enough, we need to handle the fault ourselves, while
-		 * still holding the mmap_sem.
+		 * enough, we need to handle the fault ourselves. Make 
+		 * sure we hold mmap_sem.
 		 */
 		if (attempt++) {
-			if (futex_handle_fault((unsigned long)uaddr2,
-						attempt)) {
-				ret = -EFAULT;
+			ret = futex_handle_fault((unsigned long)uaddr2,
+						fshared, attempt);
+			if (ret)
 				goto out;
-			}
 			goto retry;
 		}
 
@@ -1071,7 +1120,8 @@ retry:
 		 * If we would have faulted, release mmap_sem,
 		 * fault it in and start all over again.
 		 */
-		up_read(&current->mm->mmap_sem);
+		if (fshared)
+			up_read(fshared);
 
 		ret = futex_get_user(&dummy, uaddr2, futex64);
 		if (ret)
@@ -1108,7 +1158,8 @@ retry:
 	if (hb1 != hb2)
 		spin_unlock(&hb2->lock);
 out:
-	up_read(&current->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	return ret;
 }
 
@@ -1118,21 +1169,23 @@ out:
  */
 static int
 futex_requeue(unsigned long __user *uaddr1, unsigned long __user *uaddr2,
-	      int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64)
+	      int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64,
+		struct rw_semaphore *fshared)
 {
 	union futex_key key1, key2;
 	struct futex_hash_bucket *hb1, *hb2;
 	struct plist_head *head1;
 	struct futex_q *this, *next;
 	int ret, drop_count = 0;
-
+	int fsize = futex64 ? sizeof(u64) : sizeof(u32);
  retry:
-	down_read(&current->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr1, &key1);
+	ret = get_futex_key(uaddr1, fsize, fshared, &key1);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, &key2);
+	ret = get_futex_key(uaddr2, fsize, fshared, &key2);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -1155,7 +1208,8 @@ futex_requeue(unsigned long __user *uadd
 			 * If we would have faulted, release mmap_sem, fault
 			 * it in and start all over again.
 			 */
-			up_read(&current->mm->mmap_sem);
+			if (fshared)
+				up_read(fshared);
 
 			ret = futex_get_user(&curval, uaddr1, futex64);
 
@@ -1208,7 +1262,8 @@ out_unlock:
 		drop_futex_key_refs(&key1);
 
 out:
-	up_read(&current->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	return ret;
 }
 
@@ -1390,9 +1445,17 @@ static int fixup_pi_state_owner(unsigned
 	return ret;
 }
 
+/*
+ * In case we must use restart_block to restart a futex_wait,
+ * we encode in the 'arg3' both futex64 and shared capabilities
+ */
+#define ARG3_FUTEX64 1
+#define ARG3_SHARED  2
+
 static long futex_wait_restart(struct restart_block *restart);
-static int futex_wait(unsigned long __user *uaddr, unsigned long val,
-		      ktime_t *abs_time, int futex64)
+static int futex_wait(unsigned long __user *uaddr, int futex64,
+		      struct rw_semaphore *fshared,
+		      unsigned long val, ktime_t *abs_time)
 {
 	struct task_struct *curr = current;
 	DECLARE_WAITQUEUE(wait, curr);
@@ -1402,12 +1465,14 @@ static int futex_wait(unsigned long __us
 	int ret;
 	struct hrtimer_sleeper t, *to = NULL;
 	int rem = 0;
+	int fsize = futex64 ? sizeof(u64) : sizeof(u32);
 
 	q.pi_state = NULL;
  retry:
-	down_read(&curr->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr, &q.key);
+	ret = get_futex_key(uaddr, fsize, fshared, &q.key);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
 
@@ -1430,8 +1495,8 @@ static int futex_wait(unsigned long __us
 	 * a wakeup when *uaddr != val on entry to the syscall.  This is
 	 * rare, but normal.
 	 *
-	 * We hold the mmap semaphore, so the mapping cannot have changed
-	 * since we looked it up in get_futex_key.
+	 * for shared futexes, we hold the mmap semaphore, so the mapping
+	 * cannot have changed since we looked it up in get_futex_key.
 	 */
 	ret = get_futex_value_locked(&uval, uaddr, futex64);
 
@@ -1442,7 +1507,8 @@ static int futex_wait(unsigned long __us
 		 * If we would have faulted, release mmap_sem, fault it in and
 		 * start all over again.
 		 */
-		up_read(&curr->mm->mmap_sem);
+		if (fshared)
+			up_read(fshared);
 		ret = futex_get_user(&uval, uaddr, futex64);
 
 		if (!ret)
@@ -1468,7 +1534,8 @@ static int futex_wait(unsigned long __us
 	 * Now the futex is queued and we have checked the data, we
 	 * don't want to hold mmap_sem while we sleep.
 	 */
-	up_read(&curr->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 
 	/*
 	 * There might have been scheduling since the queue_me(), as we
@@ -1568,7 +1635,8 @@ static int futex_wait(unsigned long __us
 			}
 			/* Unqueue and drop the lock */
 			unqueue_me_pi(&q);
-			up_read(&curr->mm->mmap_sem);
+			if (fshared)
+				up_read(fshared);
 		}
 
 		debug_rt_mutex_free_waiter(&q.waiter);
@@ -1597,7 +1665,13 @@ static int futex_wait(unsigned long __us
 		restart->arg0 = (unsigned long)uaddr;
 		restart->arg1 = val;
 		restart->arg2 = (unsigned long)abs_time;
-		restart->arg3 = (unsigned long)futex64;
+		restart->arg3 = 0;
+#ifdef CONFIG_64BIT
+		if (futex64)
+			restart->arg3 |= ARG3_FUTEX64;
+#endif
+		if (fshared)
+			restart->arg3 |= ARG3_SHARED;
 		return -ERESTART_RESTARTBLOCK;
 	}
 
@@ -1605,7 +1679,8 @@ static int futex_wait(unsigned long __us
 	queue_unlock(&q, hb);
 
  out_release_sem:
-	up_read(&curr->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	return ret;
 }
 
@@ -1615,10 +1690,17 @@ static long futex_wait_restart(struct re
 	unsigned long __user *uaddr = (unsigned long __user *)restart->arg0;
 	unsigned long val = restart->arg1;
 	ktime_t *abs_time = (ktime_t *)restart->arg2;
-	int futex64 = (int)restart->arg3;
+	int futex64 = 0;
+	struct rw_semaphore *fshared = NULL;
 
+#ifdef CONFIG_64BIT
+	if (restart->arg3 & ARG3_FUTEX64)
+		futex64 = 1;
+#endif
+	if (restart->arg3 & ARG3_SHARED)
+		fshared = &current->mm->mmap_sem;
 	restart->fn = do_no_restart_syscall;
-	return (long)futex_wait(uaddr, val, abs_time, futex64);
+	return (long)futex_wait(uaddr, futex64, fshared, val, abs_time);
 }
 
 
@@ -1674,7 +1756,7 @@ static void set_pi_futex_owner(struct fu
  * races the kernel might see a 0 value of the futex too.)
  */
 static int futex_lock_pi(unsigned long __user *uaddr, int detect, ktime_t *time,
-			 int trylock, int futex64)
+			 int trylock, int futex64, struct rw_semaphore *fshared)
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
 	struct task_struct *curr = current;
@@ -1682,6 +1764,7 @@ static int futex_lock_pi(unsigned long _
 	unsigned long uval, newval, curval;
 	struct futex_q q;
 	int ret, lock_held, attempt = 0;
+	int fsize = futex64 ? sizeof(u64) : sizeof(u32);
 
 	if (refill_pi_state_cache())
 		return -ENOMEM;
@@ -1695,9 +1778,10 @@ static int futex_lock_pi(unsigned long _
 
 	q.pi_state = NULL;
  retry:
-	down_read(&curr->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr, &q.key);
+	ret = get_futex_key(uaddr, fsize, fshared, &q.key);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
 
@@ -1818,7 +1902,8 @@ static int futex_lock_pi(unsigned long _
 	 * Now the futex is queued and we have checked the data, we
 	 * don't want to hold mmap_sem while we sleep.
 	 */
-	up_read(&curr->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 
 	WARN_ON(!q.pi_state);
 	/*
@@ -1832,7 +1917,8 @@ static int futex_lock_pi(unsigned long _
 		ret = ret ? 0 : -EWOULDBLOCK;
 	}
 
-	down_read(&curr->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 	spin_lock(q.lock_ptr);
 
 	/*
@@ -1854,7 +1940,8 @@ static int futex_lock_pi(unsigned long _
 		}
 		/* Unqueue and drop the lock */
 		unqueue_me_pi(&q);
-		up_read(&curr->mm->mmap_sem);
+		if (fshared)
+			up_read(fshared);
 	}
 
 	if (!detect && ret == -EDEADLK && 0)
@@ -1866,7 +1953,8 @@ static int futex_lock_pi(unsigned long _
 	queue_unlock(&q, hb);
 
  out_release_sem:
-	up_read(&curr->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	return ret;
 
  uaddr_faulted:
@@ -1877,15 +1965,16 @@ static int futex_lock_pi(unsigned long _
 	 * still holding the mmap_sem.
 	 */
 	if (attempt++) {
-		if (futex_handle_fault((unsigned long)uaddr, attempt)) {
-			ret = -EFAULT;
+		ret = futex_handle_fault((unsigned long)uaddr, fshared,
+					 attempt);
+		if (ret)
 			goto out_unlock_release_sem;
-		}
 		goto retry_locked;
 	}
 
 	queue_unlock(&q, hb);
-	up_read(&curr->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 
 	ret = futex_get_user(&uval, uaddr, futex64);
 	if (!ret && (uval != -EFAULT))
@@ -1899,7 +1988,8 @@ static int futex_lock_pi(unsigned long _
  * This is the in-kernel slowpath: we look up the PI state (if any),
  * and do the rt-mutex unlock.
  */
-static int futex_unlock_pi(unsigned long __user *uaddr, int futex64)
+static int futex_unlock_pi(unsigned long __user *uaddr, int futex64,
+	struct rw_semaphore *fshared)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
@@ -1907,6 +1997,7 @@ static int futex_unlock_pi(unsigned long
 	struct plist_head *head;
 	union futex_key key;
 	int ret, attempt = 0;
+	int fsize = futex64 ? sizeof(u64) : sizeof(u32);
 
 retry:
 	if (futex_get_user(&uval, uaddr, futex64))
@@ -1919,9 +2010,10 @@ retry:
 	/*
 	 * First take all the futex related locks:
 	 */
-	down_read(&current->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr, &key);
+	ret = get_futex_key(uaddr, fsize, fshared, &key);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -1980,7 +2072,8 @@ retry_locked:
 out_unlock:
 	spin_unlock(&hb->lock);
 out:
-	up_read(&current->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 
 	return ret;
 
@@ -1992,15 +2085,16 @@ pi_faulted:
 	 * still holding the mmap_sem.
 	 */
 	if (attempt++) {
-		if (futex_handle_fault((unsigned long)uaddr, attempt)) {
-			ret = -EFAULT;
+		ret = futex_handle_fault((unsigned long)uaddr, fshared,
+					 attempt);
+		if (ret)
 			goto out_unlock;
-		}
 		goto retry_locked;
 	}
 
 	spin_unlock(&hb->lock);
-	up_read(&current->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 
 	ret = futex_get_user(&uval, uaddr, futex64);
 	if (!ret && (uval != -EFAULT))
@@ -2052,6 +2146,7 @@ static int futex_fd(u32 __user *uaddr, i
 	struct futex_q *q;
 	struct file *filp;
 	int ret, err;
+	struct rw_semaphore *fshared;
 	static unsigned long printk_interval;
 
 	if (printk_timed_ratelimit(&printk_interval, 60 * 60 * 1000)) {
@@ -2093,11 +2188,12 @@ static int futex_fd(u32 __user *uaddr, i
 	}
 	q->pi_state = NULL;
 
-	down_read(&current->mm->mmap_sem);
-	err = get_futex_key(uaddr, &q->key);
+	fshared = &current->mm->mmap_sem;
+	down_read(fshared);
+	err = get_futex_key(uaddr, sizeof(u32), fshared, &q->key);
 
 	if (unlikely(err != 0)) {
-		up_read(&current->mm->mmap_sem);
+		up_read(fshared);
 		kfree(q);
 		goto error;
 	}
@@ -2109,7 +2205,7 @@ static int futex_fd(u32 __user *uaddr, i
 	filp->private_data = q;
 
 	queue_me(q, ret, filp);
-	up_read(&current->mm->mmap_sem);
+	up_read(fshared);
 
 	/* Now we map fd to filp, so userspace can access it */
 	fd_install(ret, filp);
@@ -2238,7 +2334,8 @@ retry:
 		 */
 		if (!pi) {
 			if (uval & FUTEX_WAITERS)
-				futex_wake((unsigned long __user *)uaddr, 1);
+				futex_wake((unsigned long __user *)uaddr,
+					   0, &curr->mm->mmap_sem, 1);
 		}
 	}
 	return 0;
@@ -2326,13 +2423,18 @@ long do_futex(unsigned long __user *uadd
 	      unsigned long val2, unsigned long val3, int fut64)
 {
 	int ret;
+	int cmd = op & FUTEX_CMD_MASK;
+	struct rw_semaphore *fshared = NULL;
+
+	if (!(op & FUTEX_PRIVATE_FLAG))
+		fshared = &current->mm->mmap_sem;
 
-	switch (op) {
+	switch (cmd) {
 	case FUTEX_WAIT:
-		ret = futex_wait(uaddr, val, timeout, fut64);
+		ret = futex_wait(uaddr, fut64, fshared, val, timeout);
 		break;
 	case FUTEX_WAKE:
-		ret = futex_wake(uaddr, val);
+		ret = futex_wake(uaddr, fut64, fshared, val);
 		break;
 	case FUTEX_FD:
 		if (fut64)
@@ -2342,25 +2444,29 @@ long do_futex(unsigned long __user *uadd
 			ret = futex_fd((u32 __user *)uaddr, val);
 		break;
 	case FUTEX_REQUEUE:
-		ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, fut64);
+		ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, fut64,
+				    fshared);
 		break;
 	case FUTEX_CMP_REQUEUE:
-		ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, fut64);
+		ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, fut64,
+				    fshared);
 		break;
 	case FUTEX_WAKE_OP:
-		ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, fut64);
+		ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, fut64,
+				    fshared);
 		break;
 	case FUTEX_LOCK_PI:
-		ret = futex_lock_pi(uaddr, val, timeout, 0, fut64);
+		ret = futex_lock_pi(uaddr, val, timeout, 0, fut64, fshared);
 		break;
 	case FUTEX_UNLOCK_PI:
-		ret = futex_unlock_pi(uaddr, fut64);
+		ret = futex_unlock_pi(uaddr, fut64, fshared);
 		break;
 	case FUTEX_TRYLOCK_PI:
-		ret = futex_lock_pi(uaddr, 0, timeout, 1, fut64);
+		ret = futex_lock_pi(uaddr, 0, timeout, 1, fut64, fshared);
 		break;
 	case FUTEX_CMP_REQUEUE_PI:
-		ret = futex_requeue_pi(uaddr, uaddr2, val, val2, &val3, fut64);
+		ret = futex_requeue_pi(uaddr, uaddr2, val, val2, &val3, fut64,
+				       fshared);
 		break;
 	default:
 		ret = -ENOSYS;
@@ -2377,23 +2483,24 @@ sys_futex64(u64 __user *uaddr, int op, u
 	struct timespec ts;
 	ktime_t t, *tp = NULL;
 	u64 val2 = 0;
+	cmd opm = op & FUTEX_CMD_MASK;
 
-	if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) {
+	if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) {
 		if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
 			return -EFAULT;
 		if (!timespec_valid(&ts))
 			return -EINVAL;
 
 		t = timespec_to_ktime(ts);
-		if (op == FUTEX_WAIT)
+		if (cmd == FUTEX_WAIT)
 			t = ktime_add(ktime_get(), t);
 		tp = &t;
 	}
 	/*
-	 * requeue parameter in 'utime' if op == FUTEX_REQUEUE.
+	 * requeue parameter in 'utime' if cmd == FUTEX_REQUEUE.
 	 */
-	if (op == FUTEX_REQUEUE || op == FUTEX_CMP_REQUEUE
-	    || op == FUTEX_CMP_REQUEUE_PI)
+	if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE
+	    || cmd == FUTEX_CMP_REQUEUE_PI)
 		val2 = (unsigned long) utime;
 
 	return do_futex((unsigned long __user*)uaddr, op, val, tp,
@@ -2409,23 +2516,24 @@ asmlinkage long sys_futex(u32 __user *ua
 	struct timespec ts;
 	ktime_t t, *tp = NULL;
 	u32 val2 = 0;
+	int cmd = op & FUTEX_CMD_MASK;
 
-	if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) {
+	if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) {
 		if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
 			return -EFAULT;
 		if (!timespec_valid(&ts))
 			return -EINVAL;
 
 		t = timespec_to_ktime(ts);
-		if (op == FUTEX_WAIT)
+		if (cmd == FUTEX_WAIT)
 			t = ktime_add(ktime_get(), t);
 		tp = &t;
 	}
 	/*
 	 * requeue parameter in 'utime' if op == FUTEX_REQUEUE.
 	 */
-	if (op == FUTEX_REQUEUE || op == FUTEX_CMP_REQUEUE
-	    || op == FUTEX_CMP_REQUEUE_PI)
+	if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE
+	    || cmd == FUTEX_CMP_REQUEUE_PI)
 		val2 = (u32) (unsigned long) utime;
 
 	return do_futex((unsigned long __user*)uaddr, op, val, tp,

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH, take4] FUTEX : new PRIVATE futexes
  2007-04-07  8:43                                 ` [PATCH, take4] " Eric Dumazet
@ 2007-04-07  9:30                                   ` Nick Piggin
  2007-04-07 10:00                                     ` Eric Dumazet
  2007-04-07 11:18                                   ` Jakub Jelinek
                                                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 78+ messages in thread
From: Nick Piggin @ 2007-04-07  9:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Ingo Molnar,
	Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Eric Dumazet wrote:
> Hi all
> 
> Updates on this take4 :
> 
> - All remarks from Nick were addressed I hope

Yeah looks very nice. Thanks for doing that.

> 
> - Current mm code have a problem with 64bit futexes, as spoted by Nick :
> 
> get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
> So it is possible a 64bit futex spans two pages of memory...
> I had to change get_futex_key() prototype to be able to do a correct test.

I wonder if it should be encfocing alignment to keep in on 1 page?

Otherwise,
Acked-by: Nick Piggin <npiggin@suse.de>

I'll be away for a couple of days, but I'll look at running some performance
tests when I get back.

Thanks,
Nick

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH, take4] FUTEX : new PRIVATE futexes
  2007-04-07  9:30                                   ` Nick Piggin
@ 2007-04-07 10:00                                     ` Eric Dumazet
  2007-04-11  7:22                                       ` Nick Piggin
  0 siblings, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2007-04-07 10:00 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Ingo Molnar,
	Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Sat, 07 Apr 2007 19:30:14 +1000
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Eric Dumazet wrote:
 
> > 
> > - Current mm code have a problem with 64bit futexes, as spoted by Nick :
> > 
> > get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
> > So it is possible a 64bit futex spans two pages of memory...
> > I had to change get_futex_key() prototype to be able to do a correct test.
> 
> I wonder if it should be encfocing alignment to keep in on 1 page?

I believe I just did that :)

Before the patch :

Alignment was only 4 bytes for all futexes, but some user app could trigger a kernel bug (since one 64bit futex could sit on two different pages, so possible separate vmas, so the inode refcounting was wrong, and access_ok did not a correct check)

After the patch :

Alignment is 8 bytes for 64 bit futexes, 4 bytes for 32bit futexes.
All futexes are contrained to be in one single page.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH, take4] FUTEX : new PRIVATE futexes
  2007-04-07  8:43                                 ` [PATCH, take4] " Eric Dumazet
  2007-04-07  9:30                                   ` Nick Piggin
@ 2007-04-07 11:18                                   ` Jakub Jelinek
  2007-04-07 11:54                                     ` Eric Dumazet
  2007-04-07 22:15                                   ` Andrew Morton
  2007-04-11  9:19                                   ` [PATCH, take5] " Eric Dumazet
  3 siblings, 1 reply; 78+ messages in thread
From: Jakub Jelinek @ 2007-04-07 11:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Nick Piggin,
	Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Sat, Apr 07, 2007 at 10:43:39AM +0200, Eric Dumazet wrote:
> get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
> So it is possible a 64bit futex spans two pages of memory...

That would be a user bug.  32-bit futexes have to be 32-bit aligned, 64-bit
futexes have to be 64-bit aligned.

	Jakub

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH, take4] FUTEX : new PRIVATE futexes
  2007-04-07 11:18                                   ` Jakub Jelinek
@ 2007-04-07 11:54                                     ` Eric Dumazet
  2007-04-07 16:40                                       ` Ulrich Drepper
  0 siblings, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2007-04-07 11:54 UTC (permalink / raw)
  To: Jakub Jelinek
  Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Nick Piggin,
	Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Jakub Jelinek a écrit :
> On Sat, Apr 07, 2007 at 10:43:39AM +0200, Eric Dumazet wrote:
>> get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
>> So it is possible a 64bit futex spans two pages of memory...
> 
> That would be a user bug.  32-bit futexes have to be 32-bit aligned, 64-bit
> futexes have to be 64-bit aligned.

I am not sure what you want to say.

User doing sys_futex64(0x......FFC, FUTEX_WAKE_OP, ...) and crashing kernel or 
corrupting data is ok because its a user bug ?


User is allowed to do anything, kernel must check and protect innocents.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH, take4] FUTEX : new PRIVATE futexes
  2007-04-07 11:54                                     ` Eric Dumazet
@ 2007-04-07 16:40                                       ` Ulrich Drepper
  0 siblings, 0 replies; 78+ messages in thread
From: Ulrich Drepper @ 2007-04-07 16:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jakub Jelinek, Andrew Morton, Dave Jones, Nick Piggin,
	Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On 4/7/07, Eric Dumazet <dada1@cosmosbay.com> wrote:
> I am not sure what you want to say.

What Jakub meant is that it is OK for the kernel to reject using
unaligned 64-bit futexes.  Just return an error in all cases (not just
in some).

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH, take4] FUTEX : new PRIVATE futexes
  2007-04-07  8:43                                 ` [PATCH, take4] " Eric Dumazet
  2007-04-07  9:30                                   ` Nick Piggin
  2007-04-07 11:18                                   ` Jakub Jelinek
@ 2007-04-07 22:15                                   ` Andrew Morton
  2007-04-10  9:21                                     ` Eric Dumazet
  2007-04-11  9:19                                   ` [PATCH, take5] " Eric Dumazet
  3 siblings, 1 reply; 78+ messages in thread
From: Andrew Morton @ 2007-04-07 22:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Dave Jones, Ulrich Drepper, Nick Piggin, Ingo Molnar, Andi Kleen,
	Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Sat, 7 Apr 2007 10:43:39 +0200 Eric Dumazet <dada1@cosmosbay.com> wrote:

> Hi all
> 
> Updates on this take4 :
> 
> - All remarks from Nick were addressed I hope
> 
> - Current mm code have a problem with 64bit futexes, as spoted by Nick :
> 
> get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
> So it is possible a 64bit futex spans two pages of memory...
> I had to change get_futex_key() prototype to be able to do a correct test.
> 

Cold we please have that in a separate patch?  It's logically a part of the
64-bit-futex work, is it not?

> +
> +/**
> + * get_futex_key - Get parameters which are the keys for a futex.
> + * @uaddr: virtual address of the futex
> + * @size: size of futex (4 or 8)
> + * @shared: NULL for a PROCESS_PRIVATE futex,
> + *	&current->mm->mmap_sem for a PROCESS_SHARED futex
> + * @key: address where result is stored.
> + *
> + * Returns an error code or 0
> + */
> +int get_futex_key(void __user *uaddr, int size, struct rw_semaphore *shared,
> +		  union futex_key *key);

Thanks for documenting the interface, but please do it in the .c file at
the function's definition site.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH, take4] FUTEX : new PRIVATE futexes
  2007-04-07 22:15                                   ` Andrew Morton
@ 2007-04-10  9:21                                     ` Eric Dumazet
  0 siblings, 0 replies; 78+ messages in thread
From: Eric Dumazet @ 2007-04-10  9:21 UTC (permalink / raw)
  To: Andrew Morton, Pierre Peiffer
  Cc: Dave Jones, Ulrich Drepper, Nick Piggin, Ingo Molnar, Andi Kleen,
	Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Sat, 7 Apr 2007 15:15:56 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Sat, 7 Apr 2007 10:43:39 +0200 Eric Dumazet <dada1@cosmosbay.com> wrote:
> > 
> > get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
> > So it is possible a 64bit futex spans two pages of memory...
> > I had to change get_futex_key() prototype to be able to do a correct test.
> > 
> 
> Cold we please have that in a separate patch?  It's logically a part of the
> 64-bit-futex work, is it not?

Yes you probably want this patch to fix 64bit futexes support.

[PATCH] get_futex_key() must check proper alignement for 64bit futexes

get_futex_key() does an alignment check against sizeof(u32) regardless of futex
being 64bits or not.

So it is possible a 64bit futex spans two pages of memory, and some
malicious user code can trigger data corruption.

We must add a 'fsize' parameter to get_futex_key(), telling it the size of the
futex (4 or 8 bytes)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/futex.h |    2 -
 kernel/futex.c        |   78 ++++++++++++++++++++++++++--------------
 2 files changed, 52 insertions(+), 28 deletions(-)

--- linux-2.6.21-rc6-mm1/kernel/futex.c
+++ linux-2.6.21-rc6-mm1-ed/kernel/futex.c
@@ -189,19 +189,22 @@ static inline int match_futex(union fute
 		&& key1->both.offset == key2->both.offset);
 }
 
-/*
- * Get parameters which are the keys for a futex.
+/**
+ * get_futex_key - Get parameters which are the keys for a futex.
+ * @uaddr: virtual address of the futex
+ * @size: size of futex (4 or 8)
+ * @key: address where result is stored.
+ *
+ * Returns a negative error code or 0
+ * The key words are stored in *key on success.
  *
  * For shared mappings, it's (page->index, vma->vm_file->f_path.dentry->d_inode,
  * offset_within_page).  For private mappings, it's (uaddr, current->mm).
  * We can usually work out the index without swapping in the page.
  *
- * Returns: 0, or negative error code.
- * The key words are stored in *key on success.
- *
  * Should be called with &current->mm->mmap_sem but NOT any spinlocks.
  */
-int get_futex_key(void __user *uaddr, union futex_key *key)
+int get_futex_key(void __user *uaddr, int size, union futex_key *key)
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
@@ -213,7 +216,7 @@ int get_futex_key(void __user *uaddr, un
 	 * The futex address must be "naturally" aligned.
 	 */
 	key->both.offset = address % PAGE_SIZE;
-	if (unlikely((key->both.offset % sizeof(u32)) != 0))
+	if (unlikely((key->both.offset & (size - 1)) != 0))
 		return -EINVAL;
 	address -= key->both.offset;
 
@@ -705,17 +708,18 @@ double_lock_hb(struct futex_hash_bucket 
  * Wake up all waiters hashed on the physical page that is mapped
  * to this virtual address:
  */
-static int futex_wake(unsigned long __user *uaddr, int nr_wake)
+static int futex_wake(unsigned long __user *uaddr, int futex64, int nr_wake)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
 	struct plist_head *head;
 	union futex_key key;
 	int ret;
+	int fsize = futex64 ? sizeof(u64) : sizeof(u32);
 
 	down_read(&current->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr, &key);
+	ret = get_futex_key(uaddr, fsize, &key);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -817,6 +821,7 @@ futex_requeue_pi(unsigned long __user *u
 	struct rt_mutex_waiter *waiter, *top_waiter = NULL;
 	struct rt_mutex *lock2 = NULL;
 	int ret, drop_count = 0;
+	int fsize = futex64 ? sizeof(u64) : sizeof(u32);
 
 	if (refill_pi_state_cache())
 		return -ENOMEM;
@@ -827,10 +832,10 @@ retry:
 	 */
 	down_read(&current->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr1, &key1);
+	ret = get_futex_key(uaddr1, fsize, &key1);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, &key2);
+	ret = get_futex_key(uaddr2, fsize, &key2);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -1005,14 +1010,15 @@ futex_wake_op(unsigned long __user *uadd
 	struct plist_head *head;
 	struct futex_q *this, *next;
 	int ret, op_ret, attempt = 0;
+	int fsize = futex64 ? sizeof(u64) : sizeof(u32);
 
 retryfull:
 	down_read(&current->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr1, &key1);
+	ret = get_futex_key(uaddr1, fsize, &key1);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, &key2);
+	ret = get_futex_key(uaddr2, fsize, &key2);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -1125,14 +1131,15 @@ futex_requeue(unsigned long __user *uadd
 	struct plist_head *head1;
 	struct futex_q *this, *next;
 	int ret, drop_count = 0;
+	int fsize = futex64 ? sizeof(u64) : sizeof(u32);
 
  retry:
 	down_read(&current->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr1, &key1);
+	ret = get_futex_key(uaddr1, fsize, &key1);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, &key2);
+	ret = get_futex_key(uaddr2, fsize, &key2);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -1390,9 +1397,15 @@ static int fixup_pi_state_owner(unsigned
 	return ret;
 }
 
+/*
+ * In case we must use restart_block to restart a futex_wait,
+ * we encode in the 'arg3' futex64 capability
+ */
+#define ARG3_FUTEX64 1
+
 static long futex_wait_restart(struct restart_block *restart);
-static int futex_wait(unsigned long __user *uaddr, unsigned long val,
-		      ktime_t *abs_time, int futex64)
+static int futex_wait(unsigned long __user *uaddr, int futex64,
+		      unsigned long val, ktime_t *abs_time)
 {
 	struct task_struct *curr = current;
 	DECLARE_WAITQUEUE(wait, curr);
@@ -1402,12 +1415,13 @@ static int futex_wait(unsigned long __us
 	int ret;
 	struct hrtimer_sleeper t, *to = NULL;
 	int rem = 0;
+	int fsize = futex64 ? sizeof(u64) : sizeof(u32);
 
 	q.pi_state = NULL;
  retry:
 	down_read(&curr->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr, &q.key);
+	ret = get_futex_key(uaddr, fsize, &q.key);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
 
@@ -1597,7 +1611,11 @@ static int futex_wait(unsigned long __us
 		restart->arg0 = (unsigned long)uaddr;
 		restart->arg1 = val;
 		restart->arg2 = (unsigned long)abs_time;
-		restart->arg3 = (unsigned long)futex64;
+		restart->arg3 = 0;
+#ifdef CONFIG_64BIT
+		if (futex64)
+			restart->arg3 |= ARG3_FUTEX64;
+#endif
 		return -ERESTART_RESTARTBLOCK;
 	}
 
@@ -1615,10 +1633,14 @@ static long futex_wait_restart(struct re
 	unsigned long __user *uaddr = (unsigned long __user *)restart->arg0;
 	unsigned long val = restart->arg1;
 	ktime_t *abs_time = (ktime_t *)restart->arg2;
-	int futex64 = (int)restart->arg3;
+	int futex64 = 0;
 
+#ifdef CONFIG_64BIT
+	if (restart->arg3 & ARG3_FUTEX64)
+		futex64 = 1;
+#endif
 	restart->fn = do_no_restart_syscall;
-	return (long)futex_wait(uaddr, val, abs_time, futex64);
+	return (long)futex_wait(uaddr, futex64, val, abs_time);
 }
 
 
@@ -1682,6 +1704,7 @@ static int futex_lock_pi(unsigned long _
 	unsigned long uval, newval, curval;
 	struct futex_q q;
 	int ret, lock_held, attempt = 0;
+	int fsize = futex64 ? sizeof(u64) : sizeof(u32);
 
 	if (refill_pi_state_cache())
 		return -ENOMEM;
@@ -1697,7 +1720,7 @@ static int futex_lock_pi(unsigned long _
  retry:
 	down_read(&curr->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr, &q.key);
+	ret = get_futex_key(uaddr, fsize, &q.key);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
 
@@ -1907,6 +1930,7 @@ static int futex_unlock_pi(unsigned long
 	struct plist_head *head;
 	union futex_key key;
 	int ret, attempt = 0;
+	int fsize = futex64 ? sizeof(u64) : sizeof(u32);
 
 retry:
 	if (futex_get_user(&uval, uaddr, futex64))
@@ -1921,7 +1945,7 @@ retry:
 	 */
 	down_read(&current->mm->mmap_sem);
 
-	ret = get_futex_key(uaddr, &key);
+	ret = get_futex_key(uaddr, fsize, &key);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -2094,7 +2118,7 @@ static int futex_fd(u32 __user *uaddr, i
 	q->pi_state = NULL;
 
 	down_read(&current->mm->mmap_sem);
-	err = get_futex_key(uaddr, &q->key);
+	err = get_futex_key(uaddr, sizeof(u32), &q->key);
 
 	if (unlikely(err != 0)) {
 		up_read(&current->mm->mmap_sem);
@@ -2238,7 +2262,7 @@ retry:
 		 */
 		if (!pi) {
 			if (uval & FUTEX_WAITERS)
-				futex_wake((unsigned long __user *)uaddr, 1);
+				futex_wake((unsigned long __user *)uaddr, 0, 1);
 		}
 	}
 	return 0;
@@ -2329,10 +2353,10 @@ long do_futex(unsigned long __user *uadd
 
 	switch (op) {
 	case FUTEX_WAIT:
-		ret = futex_wait(uaddr, val, timeout, fut64);
+		ret = futex_wait(uaddr, fut64, val, timeout);
 		break;
 	case FUTEX_WAKE:
-		ret = futex_wake(uaddr, val);
+		ret = futex_wake(uaddr, fut64, val);
 		break;
 	case FUTEX_FD:
 		if (fut64)
--- linux-2.6.21-rc6-mm1/include/linux/futex.h
+++ linux-2.6.21-rc6-mm1-ed/include/linux/futex.h
@@ -135,7 +135,7 @@ union futex_key {
 		int offset;
 	} both;
 };
-int get_futex_key(void __user *uaddr, union futex_key *key);
+int get_futex_key(void __user *uaddr, int size, union futex_key *key);
 void get_futex_key_refs(union futex_key *key);
 void drop_futex_key_refs(union futex_key *key);
 


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH, take4] FUTEX : new PRIVATE futexes
  2007-04-07 10:00                                     ` Eric Dumazet
@ 2007-04-11  7:22                                       ` Nick Piggin
  2007-04-11  8:14                                         ` Eric Dumazet
  0 siblings, 1 reply; 78+ messages in thread
From: Nick Piggin @ 2007-04-11  7:22 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Ingo Molnar,
	Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Eric Dumazet wrote:
> On Sat, 07 Apr 2007 19:30:14 +1000
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>Eric Dumazet wrote:
> 
>  
> 
>>>- Current mm code have a problem with 64bit futexes, as spoted by Nick :
>>>
>>>get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
>>>So it is possible a 64bit futex spans two pages of memory...
>>>I had to change get_futex_key() prototype to be able to do a correct test.
>>
>>I wonder if it should be encfocing alignment to keep in on 1 page?
> 
> 
> I believe I just did that :)

Yes :P What I was trying to say before jumping on a plane is that
sys_futex/sys_futex64 calls should each check their own address alignment, so
the deeper parts of the call stack always know alignment is correct.

This will remove all the fsize you pass around, and also sanitise the userspace
argument much higher in the call stack, which is very preferable and more
conventional.

Maybe this isn't possible (it's very obvious, so there may be a good reason it
hasn't been done).

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH, take4] FUTEX : new PRIVATE futexes
  2007-04-11  7:22                                       ` Nick Piggin
@ 2007-04-11  8:14                                         ` Eric Dumazet
  2007-04-11  9:23                                           ` Nick Piggin
  0 siblings, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2007-04-11  8:14 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Ingo Molnar,
	Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Wed, 11 Apr 2007 17:22:57 +1000
Nick Piggin <nickpiggin@yahoo.com.au> wrote:

> Eric Dumazet wrote:
> > On Sat, 07 Apr 2007 19:30:14 +1000
> > Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> > 
> > 
> >>Eric Dumazet wrote:
> > 
> >  
> > 
> >>>- Current mm code have a problem with 64bit futexes, as spoted by Nick :
> >>>
> >>>get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
> >>>So it is possible a 64bit futex spans two pages of memory...
> >>>I had to change get_futex_key() prototype to be able to do a correct test.
> >>
> >>I wonder if it should be encfocing alignment to keep in on 1 page?
> > 
> > 
> > I believe I just did that :)
> 
> Yes :P What I was trying to say before jumping on a plane is that
> sys_futex/sys_futex64 calls should each check their own address alignment, so
> the deeper parts of the call stack always know alignment is correct.
> 
> This will remove all the fsize you pass around, and also sanitise the userspace
> argument much higher in the call stack, which is very preferable and more
> conventional.
> 
> Maybe this isn't possible (it's very obvious, so there may be a good reason it
> hasn't been done).

I had this idea as well, but considering get_futex_key() is exported in include/linux/futex.h, I believe some out-of tree thing is using it.

As this external thing certainly is not doing the check itself, to be on the safe side we should enforce it in get_futex_key(). I agree with you : If we want to maximize performance, we could say : The check *must* be done by the caller.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH, take5] FUTEX : new PRIVATE futexes
  2007-04-07  8:43                                 ` [PATCH, take4] " Eric Dumazet
                                                     ` (2 preceding siblings ...)
  2007-04-07 22:15                                   ` Andrew Morton
@ 2007-04-11  9:19                                   ` Eric Dumazet
  2007-04-11 12:23                                     ` Rusty Russell
  2007-04-26 12:55                                     ` [PATCH, take6] " Eric Dumazet
  3 siblings, 2 replies; 78+ messages in thread
From: Eric Dumazet @ 2007-04-11  9:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Dave Jones, Ulrich Drepper, Nick Piggin, Ingo Molnar, Andi Kleen,
	Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel, Rusty Russel

Hi Andrew

Update on this take5 :

- Rebased on linux-2.6.21-rc6-mm1 + get_futex_key() must check proper alignement for 64bit futexes
- compile test on x86_64 (one minor typo)
- Added Rusty in CC since he may have to change drivers/lguest/io.c again, since get_futex_key() have yet another parameter (fshared). (I couldnt find this file in 2.6.21-rc6-mm1 tree)

Thank you

History :

take4 :

- All remarks from Nick were addressed I hope

- Current mm code have a problem with 64bit futexes, as spoted by Nick :

get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
So it is possible a 64bit futex spans two pages of memory...
I had to change get_futex_key() prototype to be able to do a correct test.

take3:

I'm pleased to present this patch which improves linux futexes performance and 
scalability, merely avoiding taking mmap_sem rwlock.

Ulrich agreed with the API and said glibc work could start as soon
as he gets a Fedora kernel with it :)

In this third version I dropped the NUMA optims and process private hash table,
to let new API come in and be tested.

Thank you

[PATCH] FUTEX : new PRIVATE futexes

Analysis of current linux futex code :
--------------------------------------

A central hash table futex_queues[] holds all contexts (futex_q) of waiting 
threads.
Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to 
perform lookups or insert/deletion of a futex_q.

When a futex_wait() is done, calling thread has to :


1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
     (calling find_vma()). This validation tells us if the futex uses
     an inode based store (mapped file), or mm based store (anonymous mem)

2) - compute a hash key

3) - Atomic increment of reference counter on an inode or a mm_struct

4) - lock part of futex_queues[] hash table

5) - perform the test on value of futex.
                (rollback is value != expected_value, returns EWOULDBLOCK)
        (various loops if test triggers mm faults)

6) queue the context into hash table, release the lock got in 4)

7) - release the read_lock on mmap_sem

   <block>

8) Eventually unqueue the context (but rarely, as this part
 may be done by the futex_wake())

Futexes were designed to improve scalability but current implementation
has various problems :

- Central hashtable :
 This means scalability problems if many processes/threads want to use
 futexes at the same time.
 This means NUMA unbalance because this hashtable is located on one node.

- Using mmap_sem on every futex() syscall :

 Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
 ops on mmap_sem, dirtying cache line :
        - lot of cache line ping pongs on SMP configurations.

 mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
 Highly threaded processes might suffer from mmap_sem contention.

 mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
 It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
 because of cache misses.

Most of these scalability problems come from the fact that futexes are in
one global namespace. As we use a central hash table, we must make sure
they are all using the same reference (given by the mm subsystem).
We chose to force all futexes be 'shared'. This has a cost.

But fact is POSIX defined PRIVATE and SHARED, allowing clear separation, and
optimal performance if carefuly implemented. Time has come for linux to have 
better threading performance.

The goal is to permit new futex commands to avoid :
 - Taking the mmap_sem semaphore, conflicting with other subsystems.
 - Modifying a ref_count on mm or an inode, still conflicting with mm or fs.

This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
futexes, we only need to distinguish futexes by their virtual address, no
matter the underlying mm storage is.



If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And
be prepared to fallback on old subcommands for old kernels. Using one
global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.

PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : the same SHARED futex (mapped on a file) can be used by old binaries 
*and* new binaries, because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)

/* calling futex_wait(addr, value) with value != *addr */
434 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
427 cycles per futex(FUTEX_WAIT) call (using one futex)
345 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
345 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
For reference :
187 cycles per getppid() call
188 cycles per umask() call
183 cycles per ni_syscall() call

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/futex.h |   29 +++
 kernel/futex.c        |  339 +++++++++++++++++++++++++---------------
 2 files changed, 245 insertions(+), 123 deletions(-)

--- linux-2.6.21-rc6-mm1/kernel/futex.c
+++ linux-2.6.21-rc6-mm1-ed/kernel/futex.c
@@ -16,6 +16,9 @@
  *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
  *  Copyright (C) 2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com>
  *
+ *  PRIVATE futexes by Eric Dumazet
+ *  Copyright (C) 2007 Eric Dumazet <dada1@cosmosbay.com>
+ *
  *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
  *  enough at me, Linus for the original (flawed) idea, Matthew
  *  Kirkwood for proof-of-concept implementation.
@@ -193,6 +196,8 @@ static inline int match_futex(union fute
  * get_futex_key - Get parameters which are the keys for a futex.
  * @uaddr: virtual address of the futex
  * @size: size of futex (4 or 8)
+ * @shared: NULL for a PROCESS_PRIVATE futex,
+ *	&current->mm->mmap_sem for a PROCESS_SHARED futex
  * @key: address where result is stored.
  *
  * Returns a negative error code or 0
@@ -202,9 +207,12 @@ static inline int match_futex(union fute
  * offset_within_page).  For private mappings, it's (uaddr, current->mm).
  * We can usually work out the index without swapping in the page.
  *
- * Should be called with &current->mm->mmap_sem but NOT any spinlocks.
+ * fshared is NULL for PROCESS_PRIVATE futexes
+ * For other futexes, it points to &current->mm->mmap_sem and
+ * caller must have taken the reader lock. but NOT any spinlocks.
  */
-int get_futex_key(void __user *uaddr, int size, union futex_key *key)
+int get_futex_key(void __user *uaddr, int size, struct rw_semaphore *fshared,
+		  union futex_key *key)
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
@@ -221,6 +229,20 @@ int get_futex_key(void __user *uaddr, in
 	address -= key->both.offset;
 
 	/*
+	 * PROCESS_PRIVATE futexes are fast.
+	 * As the mm cannot disappear under us and the 'key' only needs
+	 * virtual address, we dont even have to find the underlying vma.
+	 * Note : We do have to check 'uaddr' is a valid user address,
+	 *        but access_ok() should be faster than find_vma()
+	 */
+	if (!fshared) {
+		if (!access_ok(VERIFY_WRITE, uaddr, size))
+			return -EFAULT;
+		key->private.mm = mm;
+		key->private.address = address;
+		return 0;
+	}
+	/*
 	 * The futex is hashed differently depending on whether
 	 * it's in a shared or private mapping.  So check vma first.
 	 */
@@ -247,6 +269,7 @@ int get_futex_key(void __user *uaddr, in
 	 * mappings of _writable_ handles.
 	 */
 	if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
+		key->both.offset |= FUT_OFF_MMSHARED; /* reference taken on mm */
 		key->private.mm = mm;
 		key->private.address = address;
 		return 0;
@@ -256,7 +279,7 @@ int get_futex_key(void __user *uaddr, in
 	 * Linear file mappings are also simple.
 	 */
 	key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
-	key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
+	key->both.offset |= FUT_OFF_INODE; /* inode-based key. */
 	if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
 		key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
 				     + vma->vm_pgoff);
@@ -284,16 +307,18 @@ EXPORT_SYMBOL_GPL(get_futex_key);
  * Take a reference to the resource addressed by a key.
  * Can be called while holding spinlocks.
  *
- * NOTE: mmap_sem MUST be held between get_futex_key() and calling this
- * function, if it is called at all.  mmap_sem keeps key->shared.inode valid.
  */
 inline void get_futex_key_refs(union futex_key *key)
 {
-	if (key->both.ptr != 0) {
-		if (key->both.offset & 1)
+	if (key->both.ptr == 0)
+		return;
+	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
+		case FUT_OFF_INODE:
 			atomic_inc(&key->shared.inode->i_count);
-		else
+			break;
+		case FUT_OFF_MMSHARED:
 			atomic_inc(&key->private.mm->mm_count);
+			break;
 	}
 }
 EXPORT_SYMBOL_GPL(get_futex_key_refs);
@@ -304,11 +329,15 @@ EXPORT_SYMBOL_GPL(get_futex_key_refs);
  */
 void drop_futex_key_refs(union futex_key *key)
 {
-	if (key->both.ptr != 0) {
-		if (key->both.offset & 1)
+	if (key->both.ptr == 0)
+		return;
+	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
+		case FUT_OFF_INODE:
 			iput(key->shared.inode);
-		else
+			break;
+		case FUT_OFF_MMSHARED:
 			mmdrop(key->private.mm);
+			break;
 	}
 }
 EXPORT_SYMBOL_GPL(drop_futex_key_refs);
@@ -342,28 +371,39 @@ get_futex_value_locked(unsigned long *de
 }
 
 /*
- * Fault handling. Called with current->mm->mmap_sem held.
+ * Fault handling.
+ * if fshared is non NULL, current->mm->mmap_sem is already held
  */
-static int futex_handle_fault(unsigned long address, int attempt)
+static int futex_handle_fault(unsigned long address,
+			      struct rw_semaphore *fshared, int attempt)
 {
 	struct vm_area_struct * vma;
 	struct mm_struct *mm = current->mm;
+	int ret = -EFAULT;
 
-	if (attempt > 2 || !(vma = find_vma(mm, address)) ||
-	    vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
-		return -EFAULT;
+	if (attempt > 2)
+		return ret;
 
-	switch (handle_mm_fault(mm, vma, address, 1)) {
-	case VM_FAULT_MINOR:
-		current->min_flt++;
-		break;
-	case VM_FAULT_MAJOR:
-		current->maj_flt++;
-		break;
-	default:
-		return -EFAULT;
+	if (!fshared)
+		down_read(&mm->mmap_sem);
+	vma = find_vma(mm, address);
+	if (vma &&
+	    address >= vma->vm_start &&
+	    (vma->vm_flags & VM_WRITE)) {
+		switch (handle_mm_fault(mm, vma, address, 1)) {
+		case VM_FAULT_MINOR:
+			ret = 0;
+			current->min_flt++;
+			break;
+		case VM_FAULT_MAJOR:
+			ret = 0;
+			current->maj_flt++;
+			break;
+		}
 	}
-	return 0;
+	if (!fshared)
+		up_read(&mm->mmap_sem);
+	return ret;
 }
 
 /*
@@ -705,10 +745,10 @@ double_lock_hb(struct futex_hash_bucket 
 }
 
 /*
- * Wake up all waiters hashed on the physical page that is mapped
- * to this virtual address:
+ * Wake up all waiters on a futex (fuaddr, futex64, fshared)
  */
-static int futex_wake(unsigned long __user *uaddr, int futex64, int nr_wake)
+static int futex_wake(unsigned long __user *uaddr, int futex64,
+		      struct rw_semaphore *fshared, int nr_wake)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
@@ -717,9 +757,10 @@ static int futex_wake(unsigned long __us
 	int ret;
 	int fsize = futex64 ? sizeof(u64) : sizeof(u32);
 
-	down_read(&current->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr, fsize, &key);
+	ret = get_futex_key(uaddr, fsize, fshared, &key);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -741,7 +782,8 @@ static int futex_wake(unsigned long __us
 
 	spin_unlock(&hb->lock);
 out:
-	up_read(&current->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	return ret;
 }
 
@@ -810,8 +852,10 @@ retry:
  * one physical page to another physical page (PI-futex uaddr2)
  */
 static int
-futex_requeue_pi(unsigned long __user *uaddr1, unsigned long __user *uaddr2,
-		 int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64)
+futex_requeue_pi(unsigned long __user *uaddr1,
+		 int futex64, struct rw_semaphore *fshared,
+		 unsigned long __user *uaddr2,
+		 int nr_wake, int nr_requeue, unsigned long *cmpval)
 {
 	union futex_key key1, key2;
 	struct futex_hash_bucket *hb1, *hb2;
@@ -830,12 +874,13 @@ retry:
 	/*
 	 * First take all the futex related locks:
 	 */
-	down_read(&current->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr1, fsize, &key1);
+	ret = get_futex_key(uaddr1, fsize, fshared, &key1);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, fsize, &key2);
+	ret = get_futex_key(uaddr2, fsize, fshared, &key2);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -858,7 +903,8 @@ retry:
 			 * If we would have faulted, release mmap_sem, fault
 			 * it in and start all over again.
 			 */
-			up_read(&current->mm->mmap_sem);
+			if (fshared)
+				up_read(fshared);
 
 			ret = futex_get_user(&curval, uaddr1, futex64);
 
@@ -993,7 +1039,8 @@ out_unlock:
 		drop_futex_key_refs(&key1);
 
 out:
-	up_read(&current->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	return ret;
 }
 
@@ -1002,8 +1049,10 @@ out:
  * to this virtual address:
  */
 static int
-futex_wake_op(unsigned long __user *uaddr1, unsigned long __user *uaddr2,
-	      int nr_wake, int nr_wake2, int op, int futex64)
+futex_wake_op(unsigned long __user *uaddr1,
+	      int futex64, struct rw_semaphore *fshared,
+	      unsigned long __user *uaddr2,
+	      int nr_wake, int nr_wake2, int op)
 {
 	union futex_key key1, key2;
 	struct futex_hash_bucket *hb1, *hb2;
@@ -1013,12 +1062,13 @@ futex_wake_op(unsigned long __user *uadd
 	int fsize = futex64 ? sizeof(u64) : sizeof(u32);
 
 retryfull:
-	down_read(&current->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr1, fsize, &key1);
+	ret = get_futex_key(uaddr1, fsize, fshared, &key1);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, fsize, &key2);
+	ret = get_futex_key(uaddr2, fsize, fshared, &key2);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -1065,11 +1115,10 @@ retry:
 		 * still holding the mmap_sem.
 		 */
 		if (attempt++) {
-			if (futex_handle_fault((unsigned long)uaddr2,
-						attempt)) {
-				ret = -EFAULT;
+			ret = futex_handle_fault((unsigned long)uaddr2,
+						fshared, attempt);
+			if (ret)
 				goto out;
-			}
 			goto retry;
 		}
 
@@ -1077,7 +1126,8 @@ retry:
 		 * If we would have faulted, release mmap_sem,
 		 * fault it in and start all over again.
 		 */
-		up_read(&current->mm->mmap_sem);
+		if (fshared)
+			up_read(fshared);
 
 		ret = futex_get_user(&dummy, uaddr2, futex64);
 		if (ret)
@@ -1114,7 +1164,8 @@ retry:
 	if (hb1 != hb2)
 		spin_unlock(&hb2->lock);
 out:
-	up_read(&current->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	return ret;
 }
 
@@ -1123,8 +1174,10 @@ out:
  * physical page.
  */
 static int
-futex_requeue(unsigned long __user *uaddr1, unsigned long __user *uaddr2,
-	      int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64)
+futex_requeue(unsigned long __user *uaddr1,
+	      int futex64, struct rw_semaphore *fshared,
+	      unsigned long __user *uaddr2,
+	      int nr_wake, int nr_requeue, unsigned long *cmpval)
 {
 	union futex_key key1, key2;
 	struct futex_hash_bucket *hb1, *hb2;
@@ -1134,12 +1187,13 @@ futex_requeue(unsigned long __user *uadd
 	int fsize = futex64 ? sizeof(u64) : sizeof(u32);
 
  retry:
-	down_read(&current->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr1, fsize, &key1);
+	ret = get_futex_key(uaddr1, fsize, fshared, &key1);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, fsize, &key2);
+	ret = get_futex_key(uaddr2, fsize, fshared, &key2);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -1162,7 +1216,8 @@ futex_requeue(unsigned long __user *uadd
 			 * If we would have faulted, release mmap_sem, fault
 			 * it in and start all over again.
 			 */
-			up_read(&current->mm->mmap_sem);
+			if (fshared)
+				up_read(fshared);
 
 			ret = futex_get_user(&curval, uaddr1, futex64);
 
@@ -1215,7 +1270,8 @@ out_unlock:
 		drop_futex_key_refs(&key1);
 
 out:
-	up_read(&current->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	return ret;
 }
 
@@ -1346,12 +1402,14 @@ static void unqueue_me_pi(struct futex_q
 /*
  * Fixup the pi_state owner with current.
  *
- * The cur->mm semaphore must be  held, it is released at return of this
- * function.
+ * for PROCESS_SHARED futexes, cur->mm semaphore must be  held, it is
+ * released at return of this function.
  */
-static int fixup_pi_state_owner(unsigned long  __user *uaddr, struct futex_q *q,
+static int fixup_pi_state_owner(unsigned long  __user *uaddr, int futex64,
+				struct rw_semaphore *fshared,
+				struct futex_q *q,
 				struct futex_hash_bucket *hb,
-				struct task_struct *curr, int futex64)
+				struct task_struct *curr)
 {
 	unsigned long newtid = curr->pid | FUTEX_WAITERS;
 	struct futex_pi_state *pi_state = q->pi_state;
@@ -1376,7 +1434,8 @@ static int fixup_pi_state_owner(unsigned
 
 	/* Unqueue and drop the lock */
 	unqueue_me_pi(q);
-	up_read(&curr->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	/*
 	 * We own it, so we have to replace the pending owner
 	 * TID. This must be atomic as we have preserve the
@@ -1399,12 +1458,14 @@ static int fixup_pi_state_owner(unsigned
 
 /*
  * In case we must use restart_block to restart a futex_wait,
- * we encode in the 'arg3' futex64 capability
+ * we encode in the 'arg3' both futex64 and shared capabilities
  */
 #define ARG3_FUTEX64 1
+#define ARG3_SHARED  2
 
 static long futex_wait_restart(struct restart_block *restart);
 static int futex_wait(unsigned long __user *uaddr, int futex64,
+		      struct rw_semaphore *fshared,
 		      unsigned long val, ktime_t *abs_time)
 {
 	struct task_struct *curr = current;
@@ -1419,9 +1480,10 @@ static int futex_wait(unsigned long __us
 
 	q.pi_state = NULL;
  retry:
-	down_read(&curr->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr, fsize, &q.key);
+	ret = get_futex_key(uaddr, fsize, fshared, &q.key);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
 
@@ -1444,8 +1506,8 @@ static int futex_wait(unsigned long __us
 	 * a wakeup when *uaddr != val on entry to the syscall.  This is
 	 * rare, but normal.
 	 *
-	 * We hold the mmap semaphore, so the mapping cannot have changed
-	 * since we looked it up in get_futex_key.
+	 * for shared futexes, we hold the mmap semaphore, so the mapping
+	 * cannot have changed since we looked it up in get_futex_key.
 	 */
 	ret = get_futex_value_locked(&uval, uaddr, futex64);
 
@@ -1456,7 +1518,8 @@ static int futex_wait(unsigned long __us
 		 * If we would have faulted, release mmap_sem, fault it in and
 		 * start all over again.
 		 */
-		up_read(&curr->mm->mmap_sem);
+		if (fshared)
+			up_read(fshared);
 		ret = futex_get_user(&uval, uaddr, futex64);
 
 		if (!ret)
@@ -1482,7 +1545,8 @@ static int futex_wait(unsigned long __us
 	 * Now the futex is queued and we have checked the data, we
 	 * don't want to hold mmap_sem while we sleep.
 	 */
-	up_read(&curr->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 
 	/*
 	 * There might have been scheduling since the queue_me(), as we
@@ -1552,7 +1616,8 @@ static int futex_wait(unsigned long __us
 		else
 			ret = rt_mutex_timed_lock(lock, to, 1);
 
-		down_read(&curr->mm->mmap_sem);
+		if (fshared)
+			down_read(fshared);
 		spin_lock(q.lock_ptr);
 
 		/*
@@ -1569,7 +1634,8 @@ static int futex_wait(unsigned long __us
 
 			/* mmap_sem and hash_bucket lock are unlocked at
 			   return of this function */
-			ret = fixup_pi_state_owner(uaddr, &q, hb, curr, futex64);
+			ret = fixup_pi_state_owner(uaddr, futex64, fshared,
+						   &q, hb, curr);
 		} else {
 			/*
 			 * Catch the rare case, where the lock was released
@@ -1582,7 +1648,8 @@ static int futex_wait(unsigned long __us
 			}
 			/* Unqueue and drop the lock */
 			unqueue_me_pi(&q);
-			up_read(&curr->mm->mmap_sem);
+			if (fshared)
+				up_read(fshared);
 		}
 
 		debug_rt_mutex_free_waiter(&q.waiter);
@@ -1616,6 +1683,8 @@ static int futex_wait(unsigned long __us
 		if (futex64)
 			restart->arg3 |= ARG3_FUTEX64;
 #endif
+		if (fshared)
+			restart->arg3 |= ARG3_SHARED;
 		return -ERESTART_RESTARTBLOCK;
 	}
 
@@ -1623,7 +1692,8 @@ static int futex_wait(unsigned long __us
 	queue_unlock(&q, hb);
 
  out_release_sem:
-	up_read(&curr->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	return ret;
 }
 
@@ -1634,13 +1704,16 @@ static long futex_wait_restart(struct re
 	unsigned long val = restart->arg1;
 	ktime_t *abs_time = (ktime_t *)restart->arg2;
 	int futex64 = 0;
+	struct rw_semaphore *fshared = NULL;
 
 #ifdef CONFIG_64BIT
 	if (restart->arg3 & ARG3_FUTEX64)
 		futex64 = 1;
 #endif
+	if (restart->arg3 & ARG3_SHARED)
+		fshared = &current->mm->mmap_sem;
 	restart->fn = do_no_restart_syscall;
-	return (long)futex_wait(uaddr, futex64, val, abs_time);
+	return (long)futex_wait(uaddr, futex64, fshared, val, abs_time);
 }
 
 
@@ -1695,8 +1768,9 @@ static void set_pi_futex_owner(struct fu
  * if there are waiters then it will block, it does PI, etc. (Due to
  * races the kernel might see a 0 value of the futex too.)
  */
-static int futex_lock_pi(unsigned long __user *uaddr, int detect, ktime_t *time,
-			 int trylock, int futex64)
+static int futex_lock_pi(unsigned long __user *uaddr,
+			 int futex64, struct rw_semaphore *fshared,
+			 int detect, ktime_t *time, int trylock)
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
 	struct task_struct *curr = current;
@@ -1718,9 +1792,10 @@ static int futex_lock_pi(unsigned long _
 
 	q.pi_state = NULL;
  retry:
-	down_read(&curr->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr, fsize, &q.key);
+	ret = get_futex_key(uaddr, fsize, fshared, &q.key);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
 
@@ -1841,7 +1916,8 @@ static int futex_lock_pi(unsigned long _
 	 * Now the futex is queued and we have checked the data, we
 	 * don't want to hold mmap_sem while we sleep.
 	 */
-	up_read(&curr->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 
 	WARN_ON(!q.pi_state);
 	/*
@@ -1855,7 +1931,8 @@ static int futex_lock_pi(unsigned long _
 		ret = ret ? 0 : -EWOULDBLOCK;
 	}
 
-	down_read(&curr->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 	spin_lock(q.lock_ptr);
 
 	/*
@@ -1864,7 +1941,8 @@ static int futex_lock_pi(unsigned long _
 	 */
 	if (!ret && q.pi_state->owner != curr)
 		/* mmap_sem is unlocked at return of this function */
-		ret = fixup_pi_state_owner(uaddr, &q, hb, curr, futex64);
+		ret = fixup_pi_state_owner(uaddr, futex64, fshared,
+					   &q, hb, curr);
 	else {
 		/*
 		 * Catch the rare case, where the lock was released
@@ -1877,7 +1955,8 @@ static int futex_lock_pi(unsigned long _
 		}
 		/* Unqueue and drop the lock */
 		unqueue_me_pi(&q);
-		up_read(&curr->mm->mmap_sem);
+		if (fshared)
+			up_read(fshared);
 	}
 
 	if (!detect && ret == -EDEADLK && 0)
@@ -1889,7 +1968,8 @@ static int futex_lock_pi(unsigned long _
 	queue_unlock(&q, hb);
 
  out_release_sem:
-	up_read(&curr->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	return ret;
 
  uaddr_faulted:
@@ -1900,15 +1980,16 @@ static int futex_lock_pi(unsigned long _
 	 * still holding the mmap_sem.
 	 */
 	if (attempt++) {
-		if (futex_handle_fault((unsigned long)uaddr, attempt)) {
-			ret = -EFAULT;
+		ret = futex_handle_fault((unsigned long)uaddr, fshared,
+					 attempt);
+		if (ret)
 			goto out_unlock_release_sem;
-		}
 		goto retry_locked;
 	}
 
 	queue_unlock(&q, hb);
-	up_read(&curr->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 
 	ret = futex_get_user(&uval, uaddr, futex64);
 	if (!ret && (uval != -EFAULT))
@@ -1922,7 +2003,8 @@ static int futex_lock_pi(unsigned long _
  * This is the in-kernel slowpath: we look up the PI state (if any),
  * and do the rt-mutex unlock.
  */
-static int futex_unlock_pi(unsigned long __user *uaddr, int futex64)
+static int futex_unlock_pi(unsigned long __user *uaddr, int futex64,
+			   struct rw_semaphore *fshared)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
@@ -1943,9 +2025,10 @@ retry:
 	/*
 	 * First take all the futex related locks:
 	 */
-	down_read(&current->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr, fsize, &key);
+	ret = get_futex_key(uaddr, fsize, fshared, &key);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -2004,7 +2087,8 @@ retry_locked:
 out_unlock:
 	spin_unlock(&hb->lock);
 out:
-	up_read(&current->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 
 	return ret;
 
@@ -2016,15 +2100,16 @@ pi_faulted:
 	 * still holding the mmap_sem.
 	 */
 	if (attempt++) {
-		if (futex_handle_fault((unsigned long)uaddr, attempt)) {
-			ret = -EFAULT;
+		ret = futex_handle_fault((unsigned long)uaddr, fshared,
+					 attempt);
+		if (ret)
 			goto out_unlock;
-		}
 		goto retry_locked;
 	}
 
 	spin_unlock(&hb->lock);
-	up_read(&current->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 
 	ret = futex_get_user(&uval, uaddr, futex64);
 	if (!ret && (uval != -EFAULT))
@@ -2076,6 +2161,7 @@ static int futex_fd(u32 __user *uaddr, i
 	struct futex_q *q;
 	struct file *filp;
 	int ret, err;
+	struct rw_semaphore *fshared;
 	static unsigned long printk_interval;
 
 	if (printk_timed_ratelimit(&printk_interval, 60 * 60 * 1000)) {
@@ -2117,11 +2203,12 @@ static int futex_fd(u32 __user *uaddr, i
 	}
 	q->pi_state = NULL;
 
-	down_read(&current->mm->mmap_sem);
-	err = get_futex_key(uaddr, sizeof(u32), &q->key);
+	fshared = &current->mm->mmap_sem;
+	down_read(fshared);
+	err = get_futex_key(uaddr, sizeof(u32), fshared, &q->key);
 
 	if (unlikely(err != 0)) {
-		up_read(&current->mm->mmap_sem);
+		up_read(fshared);
 		kfree(q);
 		goto error;
 	}
@@ -2133,7 +2220,7 @@ static int futex_fd(u32 __user *uaddr, i
 	filp->private_data = q;
 
 	queue_me(q, ret, filp);
-	up_read(&current->mm->mmap_sem);
+	up_read(fshared);
 
 	/* Now we map fd to filp, so userspace can access it */
 	fd_install(ret, filp);
@@ -2262,7 +2349,8 @@ retry:
 		 */
 		if (!pi) {
 			if (uval & FUTEX_WAITERS)
-				futex_wake((unsigned long __user *)uaddr, 0, 1);
+				futex_wake((unsigned long __user *)uaddr, 0,
+					   &curr->mm->mmap_sem, 1);
 		}
 	}
 	return 0;
@@ -2350,13 +2438,18 @@ long do_futex(unsigned long __user *uadd
 	      unsigned long val2, unsigned long val3, int fut64)
 {
 	int ret;
+	int cmd = op & FUTEX_CMD_MASK;
+	struct rw_semaphore *fshared = NULL;
+
+	if (!(op & FUTEX_PRIVATE_FLAG))
+		fshared = &current->mm->mmap_sem;
 
-	switch (op) {
+	switch (cmd) {
 	case FUTEX_WAIT:
-		ret = futex_wait(uaddr, fut64, val, timeout);
+		ret = futex_wait(uaddr, fut64, fshared, val, timeout);
 		break;
 	case FUTEX_WAKE:
-		ret = futex_wake(uaddr, fut64, val);
+		ret = futex_wake(uaddr, fut64, fshared, val);
 		break;
 	case FUTEX_FD:
 		if (fut64)
@@ -2366,25 +2459,29 @@ long do_futex(unsigned long __user *uadd
 			ret = futex_fd((u32 __user *)uaddr, val);
 		break;
 	case FUTEX_REQUEUE:
-		ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, fut64);
+		ret = futex_requeue(uaddr, fut64, fshared,
+				    uaddr2, val, val2, NULL);
 		break;
 	case FUTEX_CMP_REQUEUE:
-		ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, fut64);
+		ret = futex_requeue(uaddr, fut64, fshared,
+				    uaddr2, val, val2, &val3);
 		break;
 	case FUTEX_WAKE_OP:
-		ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, fut64);
+		ret = futex_wake_op(uaddr, fut64, fshared,
+				    uaddr2, val, val2, val3);
 		break;
 	case FUTEX_LOCK_PI:
-		ret = futex_lock_pi(uaddr, val, timeout, 0, fut64);
+		ret = futex_lock_pi(uaddr, fut64, fshared, val, timeout, 0);
 		break;
 	case FUTEX_UNLOCK_PI:
-		ret = futex_unlock_pi(uaddr, fut64);
+		ret = futex_unlock_pi(uaddr, fut64, fshared);
 		break;
 	case FUTEX_TRYLOCK_PI:
-		ret = futex_lock_pi(uaddr, 0, timeout, 1, fut64);
+		ret = futex_lock_pi(uaddr, fut64, fshared, 0, timeout, 1);
 		break;
 	case FUTEX_CMP_REQUEUE_PI:
-		ret = futex_requeue_pi(uaddr, uaddr2, val, val2, &val3, fut64);
+		ret = futex_requeue_pi(uaddr, fut64, fshared,
+				       uaddr2, val, val2, &val3);
 		break;
 	default:
 		ret = -ENOSYS;
@@ -2401,23 +2498,24 @@ sys_futex64(u64 __user *uaddr, int op, u
 	struct timespec ts;
 	ktime_t t, *tp = NULL;
 	u64 val2 = 0;
+	int cmd = op & FUTEX_CMD_MASK;
 
-	if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) {
+	if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) {
 		if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
 			return -EFAULT;
 		if (!timespec_valid(&ts))
 			return -EINVAL;
 
 		t = timespec_to_ktime(ts);
-		if (op == FUTEX_WAIT)
+		if (cmd == FUTEX_WAIT)
 			t = ktime_add(ktime_get(), t);
 		tp = &t;
 	}
 	/*
-	 * requeue parameter in 'utime' if op == FUTEX_REQUEUE.
+	 * requeue parameter in 'utime' if cmd == FUTEX_REQUEUE.
 	 */
-	if (op == FUTEX_REQUEUE || op == FUTEX_CMP_REQUEUE
-	    || op == FUTEX_CMP_REQUEUE_PI)
+	if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE
+	    || cmd == FUTEX_CMP_REQUEUE_PI)
 		val2 = (unsigned long) utime;
 
 	return do_futex((unsigned long __user*)uaddr, op, val, tp,
@@ -2433,23 +2531,24 @@ asmlinkage long sys_futex(u32 __user *ua
 	struct timespec ts;
 	ktime_t t, *tp = NULL;
 	u32 val2 = 0;
+	int cmd = op & FUTEX_CMD_MASK;
 
-	if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) {
+	if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) {
 		if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
 			return -EFAULT;
 		if (!timespec_valid(&ts))
 			return -EINVAL;
 
 		t = timespec_to_ktime(ts);
-		if (op == FUTEX_WAIT)
+		if (cmd == FUTEX_WAIT)
 			t = ktime_add(ktime_get(), t);
 		tp = &t;
 	}
 	/*
-	 * requeue parameter in 'utime' if op == FUTEX_REQUEUE.
+	 * requeue parameter in 'utime' if cmd == FUTEX_REQUEUE.
 	 */
-	if (op == FUTEX_REQUEUE || op == FUTEX_CMP_REQUEUE
-	    || op == FUTEX_CMP_REQUEUE_PI)
+	if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE
+	    || cmd == FUTEX_CMP_REQUEUE_PI)
 		val2 = (u32) (unsigned long) utime;
 
 	return do_futex((unsigned long __user*)uaddr, op, val, tp,
--- linux-2.6.21-rc6-mm1/include/linux/futex.h
+++ linux-2.6.21-rc6-mm1-ed/include/linux/futex.h
@@ -19,6 +19,18 @@ union ktime;
 #define FUTEX_TRYLOCK_PI	8
 #define FUTEX_CMP_REQUEUE_PI	9
 
+#define FUTEX_PRIVATE_FLAG	128
+#define FUTEX_CMD_MASK		~FUTEX_PRIVATE_FLAG
+
+#define FUTEX_WAIT_PRIVATE	(FUTEX_WAIT | FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAKE_PRIVATE	(FUTEX_WAKE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_REQUEUE_PRIVATE	(FUTEX_REQUEUE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_CMP_REQUEUE_PRIVATE (FUTEX_CMP_REQUEUE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAKE_OP_PRIVATE	(FUTEX_WAKE_OP | FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_PI_PRIVATE	(FUTEX_LOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_PI_PRIVATE	(FUTEX_UNLOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG)
+
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
  * thread exit time.
@@ -115,8 +127,18 @@ handle_futex_death(u32 __user *uaddr, st
  * Don't rearrange members without looking at hash_futex().
  *
  * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
- * We set bit 0 to indicate if it's an inode-based key.
- */
+ * We use the two low order bits of offset to tell what is the kind of key :
+ *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
+ *       (no reference on an inode or mm)
+ *  01 : Shared futex (PTHREAD_PROCESS_SHARED)
+ *	mapped on a file (reference on the underlying inode)
+ *  10 : Shared futex (PTHREAD_PROCESS_SHARED)
+ *       (but private mapping on an mm, and reference taken on it)
+*/
+
+#define FUT_OFF_INODE    1 /* We set bit 0 if key has a reference on inode */
+#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */
+
 union futex_key {
 	unsigned long __user *uaddr;
 	struct {
@@ -135,7 +157,8 @@ union futex_key {
 		int offset;
 	} both;
 };
-int get_futex_key(void __user *uaddr, int size, union futex_key *key);
+int get_futex_key(void __user *uaddr, int size, struct rw_semaphore *shared,
+		  union futex_key *key);
 void get_futex_key_refs(union futex_key *key);
 void drop_futex_key_refs(union futex_key *key);
 

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH, take4] FUTEX : new PRIVATE futexes
  2007-04-11  8:14                                         ` Eric Dumazet
@ 2007-04-11  9:23                                           ` Nick Piggin
  2007-04-11  9:30                                             ` Pierre Peiffer
  2007-04-11  9:35                                             ` Eric Dumazet
  0 siblings, 2 replies; 78+ messages in thread
From: Nick Piggin @ 2007-04-11  9:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Ingo Molnar,
	Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Eric Dumazet wrote:
> On Wed, 11 Apr 2007 17:22:57 +1000
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:
> 
> 
>>Eric Dumazet wrote:
>>
>>>On Sat, 07 Apr 2007 19:30:14 +1000
>>>Nick Piggin <nickpiggin@yahoo.com.au> wrote:
>>>
>>>
>>>
>>>>Eric Dumazet wrote:
>>>
>>> 
>>>
>>>
>>>>>- Current mm code have a problem with 64bit futexes, as spoted by Nick :
>>>>>
>>>>>get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
>>>>>So it is possible a 64bit futex spans two pages of memory...
>>>>>I had to change get_futex_key() prototype to be able to do a correct test.
>>>>
>>>>I wonder if it should be encfocing alignment to keep in on 1 page?
>>>
>>>
>>>I believe I just did that :)
>>
>>Yes :P What I was trying to say before jumping on a plane is that
>>sys_futex/sys_futex64 calls should each check their own address alignment, so
>>the deeper parts of the call stack always know alignment is correct.
>>
>>This will remove all the fsize you pass around, and also sanitise the userspace
>>argument much higher in the call stack, which is very preferable and more
>>conventional.
>>
>>Maybe this isn't possible (it's very obvious, so there may be a good reason it
>>hasn't been done).
> 
> 
> I had this idea as well, but considering get_futex_key() is exported in include/linux/futex.h, I believe some out-of tree thing is using it.

You're changing the API anyway, so just get rid of the declaration from futex.h
(that doesn't actually make for an export anyway, unless it is inline).

But... that isn't there in mainline. Why is it in -mm? At any rate, that makes
it a no brainer to change.

> 
> As this external thing certainly is not doing the check itself, to be on the safe side we should enforce it in get_futex_key(). I agree with you : If we want to maximize performance, we could say : The check *must* be done by the caller.

Well we _control_ the API, so let's make it as clean and performant as possible
from the start.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH, take4] FUTEX : new PRIVATE futexes
  2007-04-11  9:23                                           ` Nick Piggin
@ 2007-04-11  9:30                                             ` Pierre Peiffer
  2007-04-11  9:39                                               ` Nick Piggin
  2007-04-11  9:35                                             ` Eric Dumazet
  1 sibling, 1 reply; 78+ messages in thread
From: Pierre Peiffer @ 2007-04-11  9:30 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Eric Dumazet, Andrew Morton, Dave Jones, Ulrich Drepper,
	Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Nick Piggin a écrit :
> 
> But... that isn't there in mainline. Why is it in -mm?

This was introduced by lguest code....
I did not follow exaclty why.

Pierre

> At any rate, that 
> makes
> it a no brainer to change.
> 
>>
>> As this external thing certainly is not doing the check itself, to be 
>> on the safe side we should enforce it in get_futex_key(). I agree with 
>> you : If we want to maximize performance, we could say : The check 
>> *must* be done by the caller.
> 
> Well we _control_ the API, so let's make it as clean and performant as 
> possible
> from the start.
> 

-- 
Pierre Peiffer

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH, take4] FUTEX : new PRIVATE futexes
  2007-04-11  9:23                                           ` Nick Piggin
  2007-04-11  9:30                                             ` Pierre Peiffer
@ 2007-04-11  9:35                                             ` Eric Dumazet
  2007-04-12  1:57                                               ` Nick Piggin
  1 sibling, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2007-04-11  9:35 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Ingo Molnar,
	Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Wed, 11 Apr 2007 19:23:26 +1000
Nick Piggin <nickpiggin@yahoo.com.au> wrote:


> But... that isn't there in mainline. Why is it in -mm? At any rate, that makes
> it a no brainer to change.

Seems to be related to lguest. Ask Rusty :)

> 
> > 
> > As this external thing certainly is not doing the check itself, to be on the safe side we should enforce it in get_futex_key(). I agree with you : If we want to maximize performance, we could say : The check *must* be done by the caller.
> 
> Well we _control_ the API, so let's make it as clean and performant as possible
> from the start.

Take a look at do_futex().
Adding checks in callers just increase code size. I tried this got only bad results.
This would speedup only the slow path (ie when some user code want to give us non aligned addrs)
A single factorized check is cleaner and not slower, since we reduce icache pressure.


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH, take4] FUTEX : new PRIVATE futexes
  2007-04-11  9:30                                             ` Pierre Peiffer
@ 2007-04-11  9:39                                               ` Nick Piggin
  2007-04-11  9:40                                                 ` Nick Piggin
  0 siblings, 1 reply; 78+ messages in thread
From: Nick Piggin @ 2007-04-11  9:39 UTC (permalink / raw)
  To: Pierre Peiffer
  Cc: Eric Dumazet, Andrew Morton, Dave Jones, Ulrich Drepper,
	Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Pierre Peiffer wrote:
> Nick Piggin a écrit :
> 
>>
>> But... that isn't there in mainline. Why is it in -mm?
> 
> 
> This was introduced by lguest code....
> I did not follow exaclty why.

OK, that's no problem, then it can remain exported but we just
have to document and audit that callers must pass in an aligned
address. We can also BUG_ON(address & ~PAGE_MASK); to handle
the security aspect.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH, take4] FUTEX : new PRIVATE futexes
  2007-04-11  9:39                                               ` Nick Piggin
@ 2007-04-11  9:40                                                 ` Nick Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nick Piggin @ 2007-04-11  9:40 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pierre Peiffer, Eric Dumazet, Andrew Morton, Dave Jones,
	Ulrich Drepper, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel, Rusty Russell

Nick Piggin wrote:
> Pierre Peiffer wrote:
> 
>> Nick Piggin a écrit :
>>
>>>
>>> But... that isn't there in mainline. Why is it in -mm?
>>
>>
>>
>> This was introduced by lguest code....
>> I did not follow exaclty why.
> 
> 
> OK, that's no problem, then it can remain exported but we just
> have to document and audit that callers must pass in an aligned
> address. We can also BUG_ON(address & ~PAGE_MASK); to handle
> the security aspect.

Err, duh no we can't because we don't know the size, which is the
whole point :P

Anyway.... just ensure callers have to fix alignment.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH, take5] FUTEX : new PRIVATE futexes
  2007-04-11  9:19                                   ` [PATCH, take5] " Eric Dumazet
@ 2007-04-11 12:23                                     ` Rusty Russell
  2007-04-26 12:55                                     ` [PATCH, take6] " Eric Dumazet
  1 sibling, 0 replies; 78+ messages in thread
From: Rusty Russell @ 2007-04-11 12:23 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Nick Piggin,
	Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

On Wed, 2007-04-11 at 11:19 +0200, Eric Dumazet wrote:
> Hi Andrew
> 
> Update on this take5 :
> 
> - Rebased on linux-2.6.21-rc6-mm1 + get_futex_key() must check proper alignement for 64bit futexes
> - compile test on x86_64 (one minor typo)

> - Added Rusty in CC since he may have to change drivers/lguest/io.c
> again, since get_futex_key() have yet another parameter (fshared). (I
> couldnt find this file in 2.6.21-rc6-mm1 tree)

That's fine.  I just use the futex key lookup code to get a unique key
for shared memory between lguests (this is how lg.ko decides if they can
send DMA & interrupts to each other).

Looks like Andrew has already patched the lguest callers (I resubmitted
the moved lguest code just today).

Thanks!
Rusty.
PS.  As original futex author, I could have been cc'd on these patches
anyway.  But since the futex code has become unforgivably ugly (but, oh
so fast for pthreads), it's probably better not to see them 8)



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH, take4] FUTEX : new PRIVATE futexes
  2007-04-11  9:35                                             ` Eric Dumazet
@ 2007-04-12  1:57                                               ` Nick Piggin
  0 siblings, 0 replies; 78+ messages in thread
From: Nick Piggin @ 2007-04-12  1:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Ingo Molnar,
	Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel

Eric Dumazet wrote:
> On Wed, 11 Apr 2007 19:23:26 +1000
> Nick Piggin <nickpiggin@yahoo.com.au> wrote:

>>>As this external thing certainly is not doing the check itself, to be on the safe side we should enforce it in get_futex_key(). I agree with you : If we want to maximize performance, we could say : The check *must* be done by the caller.
>>
>>Well we _control_ the API, so let's make it as clean and performant as possible
>>from the start.
> 
> 
> Take a look at do_futex().
> Adding checks in callers just increase code size. I tried this got only bad results.
> This would speedup only the slow path (ie when some user code want to give us non aligned addrs)
> A single factorized check is cleaner and not slower, since we reduce icache pressure.

1 extra check versus all that additional argument passing? I don't think it
is conclusive.

-- 
SUSE Labs, Novell Inc.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH, take6] FUTEX : new PRIVATE futexes
  2007-04-11  9:19                                   ` [PATCH, take5] " Eric Dumazet
  2007-04-11 12:23                                     ` Rusty Russell
@ 2007-04-26 12:55                                     ` Eric Dumazet
  2007-04-26 13:35                                       ` Pierre Peiffer
  1 sibling, 1 reply; 78+ messages in thread
From: Eric Dumazet @ 2007-04-26 12:55 UTC (permalink / raw)
  To: Andrew Morton, Pierre Peiffer
  Cc: Dave Jones, Ulrich Drepper, Nick Piggin, Ingo Molnar, Andi Kleen,
	Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel, Rusty Russel

Hi Andrew

Not sure if you prefer to wait Pierre work on futex64, so just in case, I prepared this patch.

Update on this take6 :

- Rebased on linux-2.6.21-rc7-mm2 , since futex64 were droped from mm


Pierre, I can resubmit another patch on top on your next patch, so please do as you prefer (ignoring or not this patch)

Thank you

History :

take5 :

- Rebased on linux-2.6.21-rc6-mm1 + get_futex_key() must check proper alignement for 64bit futexes
- compile test on x86_64 (one minor typo)
- Added Rusty in CC since he may have to change drivers/lguest/io.c again, since get_futex_key() have yet another parameter (fshared). (I couldnt find this file in 2.6.21-rc6-mm1 tree)


take4 :

- All remarks from Nick were addressed I hope

- Current mm code have a problem with 64bit futexes, as spoted by Nick :

get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not.
So it is possible a 64bit futex spans two pages of memory...
I had to change get_futex_key() prototype to be able to do a correct test.

take3:

I'm pleased to present this patch which improves linux futexes performance and 
scalability, merely avoiding taking mmap_sem rwlock.

Ulrich agreed with the API and said glibc work could start as soon
as he gets a Fedora kernel with it :)

In this third version I dropped the NUMA optims and process private hash table,
to let new API come in and be tested.

Thank you

[PATCH] FUTEX : new PRIVATE futexes

Analysis of current linux futex code :
--------------------------------------

A central hash table futex_queues[] holds all contexts (futex_q) of waiting 
threads.
Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to 
perform lookups or insert/deletion of a futex_q.

When a futex_wait() is done, calling thread has to :


1) - Obtain a read lock on mmap_sem to be able to validate the user pointer
     (calling find_vma()). This validation tells us if the futex uses
     an inode based store (mapped file), or mm based store (anonymous mem)

2) - compute a hash key

3) - Atomic increment of reference counter on an inode or a mm_struct

4) - lock part of futex_queues[] hash table

5) - perform the test on value of futex.
                (rollback is value != expected_value, returns EWOULDBLOCK)
        (various loops if test triggers mm faults)

6) queue the context into hash table, release the lock got in 4)

7) - release the read_lock on mmap_sem

   <block>

8) Eventually unqueue the context (but rarely, as this part
 may be done by the futex_wake())

Futexes were designed to improve scalability but current implementation
has various problems :

- Central hashtable :
 This means scalability problems if many processes/threads want to use
 futexes at the same time.
 This means NUMA unbalance because this hashtable is located on one node.

- Using mmap_sem on every futex() syscall :

 Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic
 ops on mmap_sem, dirtying cache line :
        - lot of cache line ping pongs on SMP configurations.

 mmap_sem is also extensively used by mm code (page faults, mmap()/munmap())
 Highly threaded processes might suffer from mmap_sem contention.

 mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded
programs because of contention on the mmap_sem cache line.

- Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter:
 It's also a cache line ping pong on SMP. It also increases mmap_sem hold time
 because of cache misses.

Most of these scalability problems come from the fact that futexes are in
one global namespace. As we use a central hash table, we must make sure
they are all using the same reference (given by the mm subsystem).
We chose to force all futexes be 'shared'. This has a cost.

But fact is POSIX defined PRIVATE and SHARED, allowing clear separation, and
optimal performance if carefuly implemented. Time has come for linux to have 
better threading performance.

The goal is to permit new futex commands to avoid :
 - Taking the mmap_sem semaphore, conflicting with other subsystems.
 - Modifying a ref_count on mm or an inode, still conflicting with mm or fs.

This is possible because, for one process using PTHREAD_PROCESS_PRIVATE
futexes, we only need to distinguish futexes by their virtual address, no
matter the underlying mm storage is.



If glibc wants to exploit this new infrastructure, it should use new
_PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And
be prepared to fallback on old subcommands for old kernels. Using one
global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK.

PTHREAD_PROCESS_SHARED futexes should still use the old subcommands.

Compatibility with old applications is preserved, they still hit the
scalability problems, but new applications can fly :)

Note : the same SHARED futex (mapped on a file) can be used by old binaries 
*and* new binaries, because both binaries will use the old subcommands.

Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic,
as this is the default semantic. Almost all applications should benefit
of this changes (new kernel and updated libc)

Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine)

/* calling futex_wait(addr, value) with value != *addr */
433 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes)
424 cycles per futex(FUTEX_WAIT) call (using one futex)
334 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes)
334 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex)
For reference :
187 cycles per getppid() call
188 cycles per umask() call
181 cycles per ni_syscall() call

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/futex.h |   29 +++
 kernel/futex.c        |  324 +++++++++++++++++++++++++---------------
 2 files changed, 236 insertions(+), 117 deletions(-)

--- linux-2.6.21-rc7-mm2/include/linux/futex.h
+++ linux-2.6.21-rc7-mm2-ed/include/linux/futex.h
@@ -19,6 +19,18 @@ union ktime;
 #define FUTEX_TRYLOCK_PI	8
 #define FUTEX_CMP_REQUEUE_PI	9
 
+#define FUTEX_PRIVATE_FLAG	128
+#define FUTEX_CMD_MASK		~FUTEX_PRIVATE_FLAG
+
+#define FUTEX_WAIT_PRIVATE	(FUTEX_WAIT | FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAKE_PRIVATE	(FUTEX_WAKE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_REQUEUE_PRIVATE	(FUTEX_REQUEUE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_CMP_REQUEUE_PRIVATE (FUTEX_CMP_REQUEUE | FUTEX_PRIVATE_FLAG)
+#define FUTEX_WAKE_OP_PRIVATE	(FUTEX_WAKE_OP | FUTEX_PRIVATE_FLAG)
+#define FUTEX_LOCK_PI_PRIVATE	(FUTEX_LOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_UNLOCK_PI_PRIVATE	(FUTEX_UNLOCK_PI | FUTEX_PRIVATE_FLAG)
+#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG)
+
 /*
  * Support for robust futexes: the kernel cleans up held futexes at
  * thread exit time.
@@ -114,8 +126,18 @@ handle_futex_death(u32 __user *uaddr, st
  * Don't rearrange members without looking at hash_futex().
  *
  * offset is aligned to a multiple of sizeof(u32) (== 4) by definition.
- * We set bit 0 to indicate if it's an inode-based key.
- */
+ * We use the two low order bits of offset to tell what is the kind of key :
+ *  00 : Private process futex (PTHREAD_PROCESS_PRIVATE)
+ *       (no reference on an inode or mm)
+ *  01 : Shared futex (PTHREAD_PROCESS_SHARED)
+ *	mapped on a file (reference on the underlying inode)
+ *  10 : Shared futex (PTHREAD_PROCESS_SHARED)
+ *       (but private mapping on an mm, and reference taken on it)
+*/
+
+#define FUT_OFF_INODE    1 /* We set bit 0 if key has a reference on inode */
+#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */
+
 union futex_key {
 	u32 __user *uaddr;
 	struct {
@@ -134,7 +156,8 @@ union futex_key {
 		int offset;
 	} both;
 };
-int get_futex_key(u32 __user *uaddr, union futex_key *key);
+int get_futex_key(u32 __user *uaddr, struct rw_semaphore *shared,
+		  union futex_key *key);
 void get_futex_key_refs(union futex_key *key);
 void drop_futex_key_refs(union futex_key *key);
 
--- linux-2.6.21-rc7-mm2/kernel/futex.c
+++ linux-2.6.21-rc7-mm2-ed/kernel/futex.c
@@ -16,6 +16,9 @@
  *  Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
  *  Copyright (C) 2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com>
  *
+ *  PRIVATE futexes by Eric Dumazet
+ *  Copyright (C) 2007 Eric Dumazet <dada1@cosmosbay.com>
+ *
  *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
  *  enough at me, Linus for the original (flawed) idea, Matthew
  *  Kirkwood for proof-of-concept implementation.
@@ -150,19 +153,26 @@ static inline int match_futex(union fute
 		&& key1->both.offset == key2->both.offset);
 }
 
-/*
- * Get parameters which are the keys for a futex.
+/**
+ * get_futex_key - Get parameters which are the keys for a futex.
+ * @uaddr: virtual address of the futex
+ * @shared: NULL for a PROCESS_PRIVATE futex,
+ *	&current->mm->mmap_sem for a PROCESS_SHARED futex
+ * @key: address where result is stored.
+ *
+ * Returns a negative error code or 0
+ * The key words are stored in *key on success.
  *
  * For shared mappings, it's (page->index, vma->vm_file->f_path.dentry->d_inode,
  * offset_within_page).  For private mappings, it's (uaddr, current->mm).
  * We can usually work out the index without swapping in the page.
  *
- * Returns: 0, or negative error code.
- * The key words are stored in *key on success.
- *
- * Should be called with &current->mm->mmap_sem but NOT any spinlocks.
+ * fshared is NULL for PROCESS_PRIVATE futexes
+ * For other futexes, it points to &current->mm->mmap_sem and
+ * caller must have taken the reader lock. but NOT any spinlocks.
  */
-int get_futex_key(u32 __user *uaddr, union futex_key *key)
+int get_futex_key(u32 __user *uaddr, struct rw_semaphore *fshared,
+		  union futex_key *key)
 {
 	unsigned long address = (unsigned long)uaddr;
 	struct mm_struct *mm = current->mm;
@@ -174,11 +184,25 @@ int get_futex_key(u32 __user *uaddr, uni
 	 * The futex address must be "naturally" aligned.
 	 */
 	key->both.offset = address % PAGE_SIZE;
-	if (unlikely((key->both.offset % sizeof(u32)) != 0))
+	if (unlikely((address % sizeof(u32)) != 0))
 		return -EINVAL;
 	address -= key->both.offset;
 
 	/*
+	 * PROCESS_PRIVATE futexes are fast.
+	 * As the mm cannot disappear under us and the 'key' only needs
+	 * virtual address, we dont even have to find the underlying vma.
+	 * Note : We do have to check 'uaddr' is a valid user address,
+	 *        but access_ok() should be faster than find_vma()
+	 */
+	if (!fshared) {
+		if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32))))
+			return -EFAULT;
+		key->private.mm = mm;
+		key->private.address = address;
+		return 0;
+	}
+	/*
 	 * The futex is hashed differently depending on whether
 	 * it's in a shared or private mapping.  So check vma first.
 	 */
@@ -205,6 +229,7 @@ int get_futex_key(u32 __user *uaddr, uni
 	 * mappings of _writable_ handles.
 	 */
 	if (likely(!(vma->vm_flags & VM_MAYSHARE))) {
+		key->both.offset |= FUT_OFF_MMSHARED; /* reference taken on mm */
 		key->private.mm = mm;
 		key->private.address = address;
 		return 0;
@@ -214,7 +239,7 @@ int get_futex_key(u32 __user *uaddr, uni
 	 * Linear file mappings are also simple.
 	 */
 	key->shared.inode = vma->vm_file->f_path.dentry->d_inode;
-	key->both.offset++; /* Bit 0 of offset indicates inode-based key. */
+	key->both.offset |= FUT_OFF_INODE; /* inode-based key. */
 	if (likely(!(vma->vm_flags & VM_NONLINEAR))) {
 		key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT)
 				     + vma->vm_pgoff);
@@ -242,16 +267,18 @@ EXPORT_SYMBOL_GPL(get_futex_key);
  * Take a reference to the resource addressed by a key.
  * Can be called while holding spinlocks.
  *
- * NOTE: mmap_sem MUST be held between get_futex_key() and calling this
- * function, if it is called at all.  mmap_sem keeps key->shared.inode valid.
  */
 inline void get_futex_key_refs(union futex_key *key)
 {
-	if (key->both.ptr != 0) {
-		if (key->both.offset & 1)
+	if (key->both.ptr == 0)
+		return;
+	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
+		case FUT_OFF_INODE:
 			atomic_inc(&key->shared.inode->i_count);
-		else
+			break;
+		case FUT_OFF_MMSHARED:
 			atomic_inc(&key->private.mm->mm_count);
+			break;
 	}
 }
 EXPORT_SYMBOL_GPL(get_futex_key_refs);
@@ -262,11 +289,15 @@ EXPORT_SYMBOL_GPL(get_futex_key_refs);
  */
 void drop_futex_key_refs(union futex_key *key)
 {
-	if (key->both.ptr != 0) {
-		if (key->both.offset & 1)
+	if (key->both.ptr == 0)
+		return;
+	switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) {
+		case FUT_OFF_INODE:
 			iput(key->shared.inode);
-		else
+			break;
+		case FUT_OFF_MMSHARED:
 			mmdrop(key->private.mm);
+			break;
 	}
 }
 EXPORT_SYMBOL_GPL(drop_futex_key_refs);
@@ -283,28 +314,38 @@ static inline int get_futex_value_locked
 }
 
 /*
- * Fault handling. Called with current->mm->mmap_sem held.
+ * Fault handling.
+ * if fshared is non NULL, current->mm->mmap_sem is already held
  */
-static int futex_handle_fault(unsigned long address, int attempt)
+static int futex_handle_fault(unsigned long address,
+			      struct rw_semaphore *fshared, int attempt)
 {
 	struct vm_area_struct * vma;
 	struct mm_struct *mm = current->mm;
+	int ret = -EFAULT;
 
-	if (attempt > 2 || !(vma = find_vma(mm, address)) ||
-	    vma->vm_start > address || !(vma->vm_flags & VM_WRITE))
-		return -EFAULT;
+	if (attempt > 2)
+		return ret;
 
-	switch (handle_mm_fault(mm, vma, address, 1)) {
-	case VM_FAULT_MINOR:
-		current->min_flt++;
-		break;
-	case VM_FAULT_MAJOR:
-		current->maj_flt++;
-		break;
-	default:
-		return -EFAULT;
+	if (!fshared)
+		down_read(&mm->mmap_sem);
+	vma = find_vma(mm, address);
+	if (vma && address >= vma->vm_start &&
+	    (vma->vm_flags & VM_WRITE)) {
+		switch (handle_mm_fault(mm, vma, address, 1)) {
+		case VM_FAULT_MINOR:
+			ret = 0;
+			current->min_flt++;
+			break;
+		case VM_FAULT_MAJOR:
+			ret = 0;
+			current->maj_flt++;
+			break;
+		}
 	}
-	return 0;
+	if (!fshared)
+		up_read(&mm->mmap_sem);
+	return ret;
 }
 
 /*
@@ -647,7 +688,8 @@ double_lock_hb(struct futex_hash_bucket 
  * Wake up all waiters hashed on the physical page that is mapped
  * to this virtual address:
  */
-static int futex_wake(u32 __user *uaddr, int nr_wake)
+static int futex_wake(u32 __user *uaddr, struct rw_semaphore *fshared,
+		      int nr_wake)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
@@ -655,9 +697,10 @@ static int futex_wake(u32 __user *uaddr,
 	union futex_key key;
 	int ret;
 
-	down_read(&current->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr, &key);
+	ret = get_futex_key(uaddr, fshared, &key);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -679,7 +722,8 @@ static int futex_wake(u32 __user *uaddr,
 
 	spin_unlock(&hb->lock);
 out:
-	up_read(&current->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	return ret;
 }
 
@@ -746,7 +790,9 @@ retry:
  * and requeue the next nr_requeue waiters following hashed on
  * one physical page to another physical page (PI-futex uaddr2)
  */
-static int futex_requeue_pi(u32 __user *uaddr1, u32 __user *uaddr2,
+static int futex_requeue_pi(u32 __user *uaddr1, 
+			    struct rw_semaphore *fshared,
+			    u32 __user *uaddr2,
 			    int nr_wake, int nr_requeue, u32 *cmpval)
 {
 	union futex_key key1, key2;
@@ -765,12 +811,13 @@ retry:
 	/*
 	 * First take all the futex related locks:
 	 */
-	down_read(&current->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr1, &key1);
+	ret = get_futex_key(uaddr1, fshared, &key1);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, &key2);
+	ret = get_futex_key(uaddr2, fshared, &key2);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -793,7 +840,8 @@ retry:
 			 * If we would have faulted, release mmap_sem, fault
 			 * it in and start all over again.
 			 */
-			up_read(&current->mm->mmap_sem);
+			if (fshared)
+				up_read(fshared);
 
 			ret = get_user(curval, uaddr1);
 
@@ -927,7 +975,8 @@ out_unlock:
 		drop_futex_key_refs(&key1);
 
 out:
-	up_read(&current->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	return ret;
 }
 
@@ -936,7 +985,8 @@ out:
  * to this virtual address:
  */
 static int
-futex_wake_op(u32 __user *uaddr1, u32 __user *uaddr2,
+futex_wake_op(u32 __user *uaddr1, struct rw_semaphore *fshared,
+	      u32 __user *uaddr2,
 	      int nr_wake, int nr_wake2, int op)
 {
 	union futex_key key1, key2;
@@ -946,12 +996,13 @@ futex_wake_op(u32 __user *uaddr1, u32 __
 	int ret, op_ret, attempt = 0;
 
 retryfull:
-	down_read(&current->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr1, &key1);
+	ret = get_futex_key(uaddr1, fshared, &key1);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, &key2);
+	ret = get_futex_key(uaddr2, fshared, &key2);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -991,11 +1042,10 @@ retry:
 		 * still holding the mmap_sem.
 		 */
 		if (attempt++) {
-			if (futex_handle_fault((unsigned long)uaddr2,
-						attempt)) {
-				ret = -EFAULT;
+			ret = futex_handle_fault((unsigned long)uaddr2,
+						fshared, attempt);
+			if (ret)
 				goto out;
-			}
 			goto retry;
 		}
 
@@ -1003,7 +1053,8 @@ retry:
 		 * If we would have faulted, release mmap_sem,
 		 * fault it in and start all over again.
 		 */
-		up_read(&current->mm->mmap_sem);
+		if (fshared)
+			up_read(fshared);
 
 		ret = get_user(dummy, uaddr2);
 		if (ret)
@@ -1040,7 +1091,8 @@ retry:
 	if (hb1 != hb2)
 		spin_unlock(&hb2->lock);
 out:
-	up_read(&current->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	return ret;
 }
 
@@ -1048,7 +1100,8 @@ out:
  * Requeue all waiters hashed on one physical page to another
  * physical page.
  */
-static int futex_requeue(u32 __user *uaddr1, u32 __user *uaddr2,
+static int futex_requeue(u32 __user *uaddr1, struct rw_semaphore *fshared,
+			 u32 __user *uaddr2,
 			 int nr_wake, int nr_requeue, u32 *cmpval)
 {
 	union futex_key key1, key2;
@@ -1058,12 +1111,13 @@ static int futex_requeue(u32 __user *uad
 	int ret, drop_count = 0;
 
  retry:
-	down_read(&current->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr1, &key1);
+	ret = get_futex_key(uaddr1, fshared, &key1);
 	if (unlikely(ret != 0))
 		goto out;
-	ret = get_futex_key(uaddr2, &key2);
+	ret = get_futex_key(uaddr2, fshared, &key2);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -1086,7 +1140,8 @@ static int futex_requeue(u32 __user *uad
 			 * If we would have faulted, release mmap_sem, fault
 			 * it in and start all over again.
 			 */
-			up_read(&current->mm->mmap_sem);
+			if (fshared)
+				up_read(fshared);
 
 			ret = get_user(curval, uaddr1);
 
@@ -1139,7 +1194,8 @@ out_unlock:
 		drop_futex_key_refs(&key1);
 
 out:
-	up_read(&current->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	return ret;
 }
 
@@ -1273,7 +1329,8 @@ static void unqueue_me_pi(struct futex_q
  * The cur->mm semaphore must be  held, it is released at return of this
  * function.
  */
-static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q,
+static int fixup_pi_state_owner(u32 __user *uaddr, struct rw_semaphore *fshared,
+				struct futex_q *q,
 				struct futex_hash_bucket *hb,
 				struct task_struct *curr)
 {
@@ -1300,7 +1357,8 @@ static int fixup_pi_state_owner(u32 __us
 
 	/* Unqueue and drop the lock */
 	unqueue_me_pi(q);
-	up_read(&curr->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	/*
 	 * We own it, so we have to replace the pending owner
 	 * TID. This must be atomic as we have preserve the
@@ -1321,8 +1379,15 @@ static int fixup_pi_state_owner(u32 __us
 	return ret;
 }
 
+/*
+ * In case we must use restart_block to restart a futex_wait,
+ * we encode in the 'arg3' shared capability
+ */
+#define ARG3_SHARED  1
+
 static long futex_wait_restart(struct restart_block *restart);
-static int futex_wait(u32 __user *uaddr, u32 val, ktime_t *abs_time)
+static int futex_wait(u32 __user *uaddr, struct rw_semaphore *fshared,
+		      u32 val, ktime_t *abs_time)
 {
 	struct task_struct *curr = current;
 	DECLARE_WAITQUEUE(wait, curr);
@@ -1335,9 +1400,10 @@ static int futex_wait(u32 __user *uaddr,
 
 	q.pi_state = NULL;
  retry:
-	down_read(&curr->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr, &q.key);
+	ret = get_futex_key(uaddr, fshared, &q.key);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
 
@@ -1360,8 +1426,8 @@ static int futex_wait(u32 __user *uaddr,
 	 * a wakeup when *uaddr != val on entry to the syscall.  This is
 	 * rare, but normal.
 	 *
-	 * We hold the mmap semaphore, so the mapping cannot have changed
-	 * since we looked it up in get_futex_key.
+	 * for shared futexes, we hold the mmap semaphore, so the mapping
+	 * cannot have changed since we looked it up in get_futex_key.
 	 */
 	ret = get_futex_value_locked(&uval, uaddr);
 
@@ -1372,7 +1438,8 @@ static int futex_wait(u32 __user *uaddr,
 		 * If we would have faulted, release mmap_sem, fault it in and
 		 * start all over again.
 		 */
-		up_read(&curr->mm->mmap_sem);
+		if (fshared)
+			up_read(fshared);
 
 		ret = get_user(uval, uaddr);
 
@@ -1399,7 +1466,8 @@ static int futex_wait(u32 __user *uaddr,
 	 * Now the futex is queued and we have checked the data, we
 	 * don't want to hold mmap_sem while we sleep.
 	 */
-	up_read(&curr->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 
 	/*
 	 * There might have been scheduling since the queue_me(), as we
@@ -1469,7 +1537,8 @@ static int futex_wait(u32 __user *uaddr,
 		else
 			ret = rt_mutex_timed_lock(lock, to, 1);
 
-		down_read(&curr->mm->mmap_sem);
+		if (fshared)
+			down_read(fshared);
 		spin_lock(q.lock_ptr);
 
 		/*
@@ -1486,7 +1555,8 @@ static int futex_wait(u32 __user *uaddr,
 
 			/* mmap_sem and hash_bucket lock are unlocked at
 			   return of this function */
-			ret = fixup_pi_state_owner(uaddr, &q, hb, curr);
+			ret = fixup_pi_state_owner(uaddr, fshared,
+						   &q, hb, curr);
 		} else {
 			/*
 			 * Catch the rare case, where the lock was released
@@ -1499,7 +1569,8 @@ static int futex_wait(u32 __user *uaddr,
 			}
 			/* Unqueue and drop the lock */
 			unqueue_me_pi(&q);
-			up_read(&curr->mm->mmap_sem);
+			if (fshared)
+				up_read(fshared);
 		}
 
 		debug_rt_mutex_free_waiter(&q.waiter);
@@ -1528,6 +1599,9 @@ static int futex_wait(u32 __user *uaddr,
 		restart->arg0 = (unsigned long)uaddr;
 		restart->arg1 = (unsigned long)val;
 		restart->arg2 = (unsigned long)abs_time;
+		restart->arg3 = 0;
+		if (fshared)
+			restart->arg3 |= ARG3_SHARED;
 		return -ERESTART_RESTARTBLOCK;
 	}
 
@@ -1535,7 +1609,8 @@ static int futex_wait(u32 __user *uaddr,
 	queue_unlock(&q, hb);
 
  out_release_sem:
-	up_read(&curr->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	return ret;
 }
 
@@ -1545,9 +1620,12 @@ static long futex_wait_restart(struct re
 	u32 __user *uaddr = (u32 __user *)restart->arg0;
 	u32 val = (u32)restart->arg1;
 	ktime_t *abs_time = (ktime_t *)restart->arg2;
+	struct rw_semaphore *fshared = NULL;
 
 	restart->fn = do_no_restart_syscall;
-	return (long)futex_wait(uaddr, val, abs_time);
+	if (restart->arg3 & ARG3_SHARED)
+		fshared = &current->mm->mmap_sem;
+	return (long)futex_wait(uaddr, fshared, val, abs_time);
 }
 
 
@@ -1602,8 +1680,8 @@ static void set_pi_futex_owner(struct fu
  * if there are waiters then it will block, it does PI, etc. (Due to
  * races the kernel might see a 0 value of the futex too.)
  */
-static int futex_lock_pi(u32 __user *uaddr, int detect, ktime_t *time,
-			 int trylock)
+static int futex_lock_pi(u32 __user *uaddr, struct rw_semaphore *fshared,
+			 int detect, ktime_t *time, int trylock)
 {
 	struct hrtimer_sleeper timeout, *to = NULL;
 	struct task_struct *curr = current;
@@ -1624,9 +1702,10 @@ static int futex_lock_pi(u32 __user *uad
 
 	q.pi_state = NULL;
  retry:
-	down_read(&curr->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr, &q.key);
+	ret = get_futex_key(uaddr, fshared, &q.key);
 	if (unlikely(ret != 0))
 		goto out_release_sem;
 
@@ -1747,7 +1826,8 @@ static int futex_lock_pi(u32 __user *uad
 	 * Now the futex is queued and we have checked the data, we
 	 * don't want to hold mmap_sem while we sleep.
 	 */
-	up_read(&curr->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 
 	WARN_ON(!q.pi_state);
 	/*
@@ -1761,7 +1841,8 @@ static int futex_lock_pi(u32 __user *uad
 		ret = ret ? 0 : -EWOULDBLOCK;
 	}
 
-	down_read(&curr->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 	spin_lock(q.lock_ptr);
 
 	/*
@@ -1770,7 +1851,7 @@ static int futex_lock_pi(u32 __user *uad
 	 */
 	if (!ret && q.pi_state->owner != curr)
 		/* mmap_sem is unlocked at return of this function */
-		ret = fixup_pi_state_owner(uaddr, &q, hb, curr);
+		ret = fixup_pi_state_owner(uaddr, fshared, &q, hb, curr);
 	else {
 		/*
 		 * Catch the rare case, where the lock was released
@@ -1783,7 +1864,8 @@ static int futex_lock_pi(u32 __user *uad
 		}
 		/* Unqueue and drop the lock */
 		unqueue_me_pi(&q);
-		up_read(&curr->mm->mmap_sem);
+		if (fshared)
+			up_read(fshared);
 	}
 
 	if (!detect && ret == -EDEADLK && 0)
@@ -1795,7 +1877,8 @@ static int futex_lock_pi(u32 __user *uad
 	queue_unlock(&q, hb);
 
  out_release_sem:
-	up_read(&curr->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 	return ret;
 
  uaddr_faulted:
@@ -1806,15 +1889,16 @@ static int futex_lock_pi(u32 __user *uad
 	 * still holding the mmap_sem.
 	 */
 	if (attempt++) {
-		if (futex_handle_fault((unsigned long)uaddr, attempt)) {
-			ret = -EFAULT;
+		ret = futex_handle_fault((unsigned long)uaddr, fshared,
+					 attempt);
+		if (ret)
 			goto out_unlock_release_sem;
-		}
 		goto retry_locked;
 	}
 
 	queue_unlock(&q, hb);
-	up_read(&curr->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 
 	ret = get_user(uval, uaddr);
 	if (!ret && (uval != -EFAULT))
@@ -1828,7 +1912,7 @@ static int futex_lock_pi(u32 __user *uad
  * This is the in-kernel slowpath: we look up the PI state (if any),
  * and do the rt-mutex unlock.
  */
-static int futex_unlock_pi(u32 __user *uaddr)
+static int futex_unlock_pi(u32 __user *uaddr, struct rw_semaphore *fshared)
 {
 	struct futex_hash_bucket *hb;
 	struct futex_q *this, *next;
@@ -1848,9 +1932,10 @@ retry:
 	/*
 	 * First take all the futex related locks:
 	 */
-	down_read(&current->mm->mmap_sem);
+	if (fshared)
+		down_read(fshared);
 
-	ret = get_futex_key(uaddr, &key);
+	ret = get_futex_key(uaddr, fshared, &key);
 	if (unlikely(ret != 0))
 		goto out;
 
@@ -1909,7 +1994,8 @@ retry_locked:
 out_unlock:
 	spin_unlock(&hb->lock);
 out:
-	up_read(&current->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 
 	return ret;
 
@@ -1921,15 +2007,16 @@ pi_faulted:
 	 * still holding the mmap_sem.
 	 */
 	if (attempt++) {
-		if (futex_handle_fault((unsigned long)uaddr, attempt)) {
-			ret = -EFAULT;
+		ret = futex_handle_fault((unsigned long)uaddr, fshared,
+					 attempt);
+		if (ret)
 			goto out_unlock;
-		}
 		goto retry_locked;
 	}
 
 	spin_unlock(&hb->lock);
-	up_read(&current->mm->mmap_sem);
+	if (fshared)
+		up_read(fshared);
 
 	ret = get_user(uval, uaddr);
 	if (!ret && (uval != -EFAULT))
@@ -1981,6 +2068,7 @@ static int futex_fd(u32 __user *uaddr, i
 	struct futex_q *q;
 	struct file *filp;
 	int ret, err;
+	struct rw_semaphore *fshared;
 	static unsigned long printk_interval;
 
 	if (printk_timed_ratelimit(&printk_interval, 60 * 60 * 1000)) {
@@ -2022,11 +2110,12 @@ static int futex_fd(u32 __user *uaddr, i
 	}
 	q->pi_state = NULL;
 
-	down_read(&current->mm->mmap_sem);
-	err = get_futex_key(uaddr, &q->key);
+	fshared = &current->mm->mmap_sem;
+	down_read(fshared);
+	err = get_futex_key(uaddr, fshared, &q->key);
 
 	if (unlikely(err != 0)) {
-		up_read(&current->mm->mmap_sem);
+		up_read(fshared);
 		kfree(q);
 		goto error;
 	}
@@ -2038,7 +2127,7 @@ static int futex_fd(u32 __user *uaddr, i
 	filp->private_data = q;
 
 	queue_me(q, ret, filp);
-	up_read(&current->mm->mmap_sem);
+	up_read(fshared);
 
 	/* Now we map fd to filp, so userspace can access it */
 	fd_install(ret, filp);
@@ -2167,7 +2256,7 @@ retry:
 		 */
 		if (!pi) {
 			if (uval & FUTEX_WAITERS)
-				futex_wake(uaddr, 1);
+				futex_wake(uaddr, &curr->mm->mmap_sem, 1);
 		}
 	}
 	return 0;
@@ -2223,7 +2312,8 @@ void exit_robust_list(struct task_struct
 		return;
 
 	if (pending)
-		handle_futex_death((void __user *)pending + futex_offset, curr, pip);
+		handle_futex_death((void __user *)pending + futex_offset,
+				   curr, pip);
 
 	while (entry != &head->list) {
 		/*
@@ -2253,38 +2343,43 @@ long do_futex(u32 __user *uaddr, int op,
 		u32 __user *uaddr2, u32 val2, u32 val3)
 {
 	int ret;
+	int cmd = op & FUTEX_CMD_MASK;
+	struct rw_semaphore *fshared = NULL;
+
+	if (!(op & FUTEX_PRIVATE_FLAG))
+		fshared = &current->mm->mmap_sem;
 
-	switch (op) {
+	switch (cmd) {
 	case FUTEX_WAIT:
-		ret = futex_wait(uaddr, val, timeout);
+		ret = futex_wait(uaddr, fshared, val, timeout);
 		break;
 	case FUTEX_WAKE:
-		ret = futex_wake(uaddr, val);
+		ret = futex_wake(uaddr, fshared, val);
 		break;
 	case FUTEX_FD:
 		/* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */
 		ret = futex_fd(uaddr, val);
 		break;
 	case FUTEX_REQUEUE:
-		ret = futex_requeue(uaddr, uaddr2, val, val2, NULL);
+		ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL);
 		break;
 	case FUTEX_CMP_REQUEUE:
-		ret = futex_requeue(uaddr, uaddr2, val, val2, &val3);
+		ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3);
 		break;
 	case FUTEX_WAKE_OP:
-		ret = futex_wake_op(uaddr, uaddr2, val, val2, val3);
+		ret = futex_wake_op(uaddr, fshared, uaddr2, val, val2, val3);
 		break;
 	case FUTEX_LOCK_PI:
-		ret = futex_lock_pi(uaddr, val, timeout, 0);
+		ret = futex_lock_pi(uaddr, fshared, val, timeout, 0);
 		break;
 	case FUTEX_UNLOCK_PI:
-		ret = futex_unlock_pi(uaddr);
+		ret = futex_unlock_pi(uaddr, fshared);
 		break;
 	case FUTEX_TRYLOCK_PI:
-		ret = futex_lock_pi(uaddr, 0, timeout, 1);
+		ret = futex_lock_pi(uaddr, fshared, 0, timeout, 1);
 		break;
 	case FUTEX_CMP_REQUEUE_PI:
-		ret = futex_requeue_pi(uaddr, uaddr2, val, val2, &val3);
+		ret = futex_requeue_pi(uaddr, fshared, uaddr2, val, val2, &val3);
 		break;
 	default:
 		ret = -ENOSYS;
@@ -2300,23 +2395,24 @@ asmlinkage long sys_futex(u32 __user *ua
 	struct timespec ts;
 	ktime_t t, *tp = NULL;
 	u32 val2 = 0;
+	int cmd = op & FUTEX_CMD_MASK;
 
-	if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) {
+	if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) {
 		if (copy_from_user(&ts, utime, sizeof(ts)) != 0)
 			return -EFAULT;
 		if (!timespec_valid(&ts))
 			return -EINVAL;
 
 		t = timespec_to_ktime(ts);
-		if (op == FUTEX_WAIT)
+		if (cmd == FUTEX_WAIT)
 			t = ktime_add(ktime_get(), t);
 		tp = &t;
 	}
 	/*
-	 * requeue parameter in 'utime' if op == FUTEX_REQUEUE.
+	 * requeue parameter in 'utime' if cmd == FUTEX_REQUEUE.
 	 */
-	if (op == FUTEX_REQUEUE || op == FUTEX_CMP_REQUEUE
-	    || op == FUTEX_CMP_REQUEUE_PI)
+	if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE
+	    || cmd == FUTEX_CMP_REQUEUE_PI)
 		val2 = (u32) (unsigned long) utime;
 
 	return do_futex(uaddr, op, val, tp, uaddr2, val2, val3);

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH, take6] FUTEX : new PRIVATE futexes
  2007-04-26 12:55                                     ` [PATCH, take6] " Eric Dumazet
@ 2007-04-26 13:35                                       ` Pierre Peiffer
  0 siblings, 0 replies; 78+ messages in thread
From: Pierre Peiffer @ 2007-04-26 13:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Nick Piggin,
	Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai,
	Shai Fultheim (Shai@scalex86.org),
	pravin b shelar, linux-kernel, Rusty Russel

Eric Dumazet a écrit :
> Hi Andrew
> 
> Not sure if you prefer to wait Pierre work on futex64, so just in case, I prepared this patch.
> 
> Update on this take6 :
> 
> - Rebased on linux-2.6.21-rc7-mm2 , since futex64 were droped from mm
> 
> 
> Pierre, I can resubmit another patch on top on your next patch, so please do as you prefer (ignoring or not this patch)

Thank you for taking care of this.
But I think your patch is more mature than the futex64 patch; So, it's ok for 
me, and it's probably simpler and better for "the community" (and for Andrew's 
work) to put this patch first, because futex64 still may need several reworks...

-- 
Pierre

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2007-04-26 13:38 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2006-08-08  7:07 [RFC] NUMA futex hashing Ravikiran G Thirumalai
2006-08-08  9:14 ` Eric Dumazet
2006-08-08 20:31   ` Ravikiran G Thirumalai
2006-08-08  9:37 ` Jes Sorensen
2006-08-08  9:58   ` Andi Kleen
2006-08-08 10:07     ` Jes Sorensen
2006-08-08  9:57 ` Andi Kleen
2006-08-08 10:10   ` Eric Dumazet
2006-08-08 10:36     ` Andi Kleen
2006-08-08 12:29       ` Eric Dumazet
2006-08-08 12:47         ` Andi Kleen
2006-08-08 12:57           ` Eric Dumazet
2006-08-08 14:39             ` Ulrich Drepper
2006-08-08 15:11               ` Nick Piggin
2006-08-08 15:36                 ` Ulrich Drepper
2006-08-08 16:22                   ` Nick Piggin
2006-08-08 16:26                     ` Nick Piggin
2006-08-08 16:49                     ` Ulrich Drepper
2006-08-08 16:08                 ` Eric Dumazet
2006-08-08 16:34                   ` Nick Piggin
2006-08-08 16:49                     ` Eric Dumazet
2006-08-08 16:59                       ` Eric Dumazet
2006-08-09  1:56                       ` Nick Piggin
2006-08-08 16:58                   ` Ulrich Drepper
2006-08-08 17:08                     ` Eric Dumazet
2006-08-09  1:58                     ` Nick Piggin
2006-08-09  6:26                       ` Eric Dumazet
2006-08-09  6:43                         ` Eric Dumazet
2007-03-15 19:10                           ` [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements Eric Dumazet
2007-03-15 20:15                             ` Nick Piggin
2007-03-16  8:05                             ` Peter Zijlstra
2007-03-16  9:30                               ` Eric Dumazet
2007-03-16 10:10                                 ` Peter Zijlstra
2007-03-16 10:30                                   ` Eric Dumazet
2007-03-16 10:36                                     ` Peter Zijlstra
2007-04-04  7:16                             ` Ulrich Drepper
2007-04-05 17:49                               ` [PATCH] FUTEX : new PRIVATE futexes Eric Dumazet
2007-04-05 20:43                                 ` Ulrich Drepper
2007-04-06  1:19                                 ` Nick Piggin
2007-04-06  5:53                                   ` Eric Dumazet
2007-04-06 11:50                                     ` Nick Piggin
2007-04-06  6:05                                   ` Hugh Dickins
2007-04-06 17:41                                     ` Jan Engelhardt
2007-04-06 12:26                                 ` Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes) Peter Zijlstra
2007-04-06 13:02                                   ` Hugh Dickins
2007-04-06 13:15                                     ` Peter Zijlstra
2007-04-06 13:15                                     ` Nick Piggin
2007-04-06 13:22                                       ` Peter Zijlstra
2007-04-06 13:40                                         ` Nick Piggin
2007-04-06 12:31                                 ` [PATCH] FUTEX : new PRIVATE futexes Peter Zijlstra
2007-04-07  8:43                                 ` [PATCH, take4] " Eric Dumazet
2007-04-07  9:30                                   ` Nick Piggin
2007-04-07 10:00                                     ` Eric Dumazet
2007-04-11  7:22                                       ` Nick Piggin
2007-04-11  8:14                                         ` Eric Dumazet
2007-04-11  9:23                                           ` Nick Piggin
2007-04-11  9:30                                             ` Pierre Peiffer
2007-04-11  9:39                                               ` Nick Piggin
2007-04-11  9:40                                                 ` Nick Piggin
2007-04-11  9:35                                             ` Eric Dumazet
2007-04-12  1:57                                               ` Nick Piggin
2007-04-07 11:18                                   ` Jakub Jelinek
2007-04-07 11:54                                     ` Eric Dumazet
2007-04-07 16:40                                       ` Ulrich Drepper
2007-04-07 22:15                                   ` Andrew Morton
2007-04-10  9:21                                     ` Eric Dumazet
2007-04-11  9:19                                   ` [PATCH, take5] " Eric Dumazet
2007-04-11 12:23                                     ` Rusty Russell
2007-04-26 12:55                                     ` [PATCH, take6] " Eric Dumazet
2007-04-26 13:35                                       ` Pierre Peiffer
2007-03-15 19:13                           ` [PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic Eric Dumazet
2007-03-15 19:16                           ` [PATCH 2/3] FUTEX : introduce private hashtables Eric Dumazet
2007-03-15 20:25                             ` Nick Piggin
2007-03-15 21:09                               ` Ulrich Drepper
2007-03-15 21:29                                 ` Nick Piggin
2007-03-15 22:59                               ` William Lee Irwin III
2007-03-15 19:20                           ` [PATCH 3/3] FUTEX : NUMA friendly global hashtable Eric Dumazet
2006-08-09  0:13     ` [RFC] NUMA futex hashing Ravikiran G Thirumalai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.