* [RFC] NUMA futex hashing @ 2006-08-08 7:07 Ravikiran G Thirumalai 2006-08-08 9:14 ` Eric Dumazet ` (2 more replies) 0 siblings, 3 replies; 78+ messages in thread From: Ravikiran G Thirumalai @ 2006-08-08 7:07 UTC (permalink / raw) To: linux-kernel; +Cc: Shai Fultheim (Shai@scalex86.org), pravin b shelar Current futex hash scheme is not the best for NUMA. The futex hash table is an array of struct futex_hash_bucket, which is just a spinlock and a list_head -- this means multiple spinlocks on the same cacheline and on NUMA machines, on the same internode cacheline. If futexes of two unrelated threads running on two different nodes happen to hash onto adjacent hash buckets, or buckets on the same internode cacheline, then we have the internode cacheline bouncing between nodes. Here is a simple scheme which maintains per-node hash tables for futexes. In this scheme, a private futex is assigned to the node id of the futex's KVA. The reasoning is, the futex KVA is allocated from the node as indicated by memory policy set by the process, and that should be a good 'home node' for that futex. Of course this helps workloads where all the threads of a process are bound to the same node, but it seems reasonable to run all threads of a process on the same node. A shared futex is assigned a home node based on jhash2 itself. Since inode and offset are used as the key, the same inode offset is used to arrive at the home node of a shared futex. This distributes private futexes across all nodes. Comments? Suggestions? Particularly regarding shared futexes. Any policy suggestions? Thanks, Kiran Note: This patch needs to have kvaddr_to_nid() reintroduced. This was taken out in git commit 9f3fd602aef96c2a490e3bfd669d06475aeba8d8 Index: linux-2.6.18-rc3/kernel/futex.c =================================================================== --- linux-2.6.18-rc3.orig/kernel/futex.c 2006-08-02 12:11:34.000000000 -0700 +++ linux-2.6.18-rc3/kernel/futex.c 2006-08-02 16:48:47.000000000 -0700 @@ -137,20 +137,35 @@ struct futex_hash_bucket { struct list_head chain; }; -static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS]; +static struct futex_hash_bucket *futex_queues[MAX_NUMNODES] __read_mostly; /* Futex-fs vfsmount entry: */ static struct vfsmount *futex_mnt; /* * We hash on the keys returned from get_futex_key (see below). + * With NUMA aware futex hashing, we have per-node hash tables. + * We determine the home node of a futex based on the KVA -- if the futex + * is a private futex. For shared futexes, we use jhash2 itself on the + * futex_key to arrive at a home node. */ static struct futex_hash_bucket *hash_futex(union futex_key *key) { + int nodeid; u32 hash = jhash2((u32*)&key->both.word, (sizeof(key->both.word)+sizeof(key->both.ptr))/4, key->both.offset); - return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)]; + if (key->both.offset & 0x1) { + /* + * Shared futex: Use any of the 'possible' nodes as home node. + */ + nodeid = hash & (MAX_NUMNODES -1); + BUG_ON(!node_possible(nodeid)); + } else + /* Private futex */ + nodeid = kvaddr_to_nid(key->both.ptr); + + return &futex_queues[nodeid][hash & ((1 << FUTEX_HASHBITS)-1)]; } /* @@ -1909,13 +1924,25 @@ static int __init init(void) { unsigned int i; + int nid; + + for_each_node(nid) + { + futex_queues[nid] = kmalloc_node( + (sizeof(struct futex_hash_bucket) * + (1 << FUTEX_HASHBITS)), + GFP_KERNEL, nid); + if (!futex_queues[nid]) + panic("futex_init: Allocation of multi-node futex_queues failed"); + for (i = 0; i < (1 << FUTEX_HASHBITS); i++) { + INIT_LIST_HEAD(&futex_queues[nid][i].chain); + spin_lock_init(&futex_queues[nid][i].lock); + } + } + register_filesystem(&futex_fs_type); futex_mnt = kern_mount(&futex_fs_type); - for (i = 0; i < ARRAY_SIZE(futex_queues); i++) { - INIT_LIST_HEAD(&futex_queues[i].chain); - spin_lock_init(&futex_queues[i].lock); - } return 0; } __initcall(init); ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 7:07 [RFC] NUMA futex hashing Ravikiran G Thirumalai @ 2006-08-08 9:14 ` Eric Dumazet 2006-08-08 20:31 ` Ravikiran G Thirumalai 2006-08-08 9:37 ` Jes Sorensen 2006-08-08 9:57 ` Andi Kleen 2 siblings, 1 reply; 78+ messages in thread From: Eric Dumazet @ 2006-08-08 9:14 UTC (permalink / raw) To: Ravikiran G Thirumalai Cc: linux-kernel, Shai Fultheim (Shai@scalex86.org), pravin b shelar On Tuesday 08 August 2006 09:07, Ravikiran G Thirumalai wrote: > Current futex hash scheme is not the best for NUMA. The futex hash table > is an array of struct futex_hash_bucket, which is just a spinlock and a > list_head -- this means multiple spinlocks on the same cacheline and on > NUMA machines, on the same internode cacheline. If futexes of two > unrelated threads running on two different nodes happen to hash onto > adjacent hash buckets, or buckets on the same internode cacheline, then we > have the internode cacheline bouncing between nodes. > > Here is a simple scheme which maintains per-node hash tables for futexes. > > In this scheme, a private futex is assigned to the node id of the futex's > KVA. The reasoning is, the futex KVA is allocated from the node as > indicated by memory policy set by the process, and that should be a good > 'home node' for that futex. Of course this helps workloads where all the > threads of a process are bound to the same node, but it seems reasonable to > run all threads of a process on the same node. > > A shared futex is assigned a home node based on jhash2 itself. Since inode > and offset are used as the key, the same inode offset is used to arrive at > the home node of a shared futex. This distributes private futexes across > all nodes. > > Comments? Suggestions? Particularly regarding shared futexes. Any policy > suggestions? > Your patch seems fine, but I have one comment. For non NUMA machine, we would have one useless indirection to get the futex_queues pointer. static struct futex_hash_bucket *futex_queues[1]; I think it is worth to redesign your patch so that this extra-indirection is needed only for NUMA machines. #if defined(CONFIG_NUMA) static struct futex_hash_bucket *futex_queues[MAX_NUMNODES]; #define FUTEX_QUEUES(nodeid, hash) \ &futex_queues[nodeid][hash & ((1 << FUTEX_HASHBITS)-1)]; #else static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS]; # define FUTEX_QUEUES(nodeid, hash) \ &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)]; #endif Thank you > Thanks, > Kiran > > Note: This patch needs to have kvaddr_to_nid() reintroduced. This was > taken out in git commit 9f3fd602aef96c2a490e3bfd669d06475aeba8d8 > > Index: linux-2.6.18-rc3/kernel/futex.c > =================================================================== > --- linux-2.6.18-rc3.orig/kernel/futex.c 2006-08-02 12:11:34.000000000 > -0700 +++ linux-2.6.18-rc3/kernel/futex.c 2006-08-02 16:48:47.000000000 > -0700 @@ -137,20 +137,35 @@ struct futex_hash_bucket { > struct list_head chain; > }; > > -static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS]; > +static struct futex_hash_bucket *futex_queues[MAX_NUMNODES] __read_mostly; > > /* Futex-fs vfsmount entry: */ > static struct vfsmount *futex_mnt; > > /* > * We hash on the keys returned from get_futex_key (see below). > + * With NUMA aware futex hashing, we have per-node hash tables. > + * We determine the home node of a futex based on the KVA -- if the futex > + * is a private futex. For shared futexes, we use jhash2 itself on the > + * futex_key to arrive at a home node. > */ > static struct futex_hash_bucket *hash_futex(union futex_key *key) > { > + int nodeid; > u32 hash = jhash2((u32*)&key->both.word, > (sizeof(key->both.word)+sizeof(key->both.ptr))/4, > key->both.offset); > - return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)]; > + if (key->both.offset & 0x1) { > + /* > + * Shared futex: Use any of the 'possible' nodes as home node. > + */ > + nodeid = hash & (MAX_NUMNODES -1); > + BUG_ON(!node_possible(nodeid)); > + } else > + /* Private futex */ > + nodeid = kvaddr_to_nid(key->both.ptr); > + > + return &futex_queues[nodeid][hash & ((1 << FUTEX_HASHBITS)-1)]; > } > > /* > @@ -1909,13 +1924,25 @@ static int __init init(void) > { > unsigned int i; > > + int nid; > + > + for_each_node(nid) > + { > + futex_queues[nid] = kmalloc_node( > + (sizeof(struct futex_hash_bucket) * > + (1 << FUTEX_HASHBITS)), > + GFP_KERNEL, nid); > + if (!futex_queues[nid]) > + panic("futex_init: Allocation of multi-node futex_queues failed"); > + for (i = 0; i < (1 << FUTEX_HASHBITS); i++) { > + INIT_LIST_HEAD(&futex_queues[nid][i].chain); > + spin_lock_init(&futex_queues[nid][i].lock); > + } > + } > + > register_filesystem(&futex_fs_type); > futex_mnt = kern_mount(&futex_fs_type); > > - for (i = 0; i < ARRAY_SIZE(futex_queues); i++) { > - INIT_LIST_HEAD(&futex_queues[i].chain); > - spin_lock_init(&futex_queues[i].lock); > - } > return 0; > } > __initcall(init); > > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 9:14 ` Eric Dumazet @ 2006-08-08 20:31 ` Ravikiran G Thirumalai 0 siblings, 0 replies; 78+ messages in thread From: Ravikiran G Thirumalai @ 2006-08-08 20:31 UTC (permalink / raw) To: Eric Dumazet Cc: linux-kernel, Shai Fultheim (Shai@scalex86.org), pravin b shelar On Tue, Aug 08, 2006 at 11:14:49AM +0200, Eric Dumazet wrote: > On Tuesday 08 August 2006 09:07, Ravikiran G Thirumalai wrote: > > > > Your patch seems fine, but I have one comment. > > For non NUMA machine, we would have one useless indirection to get the > futex_queues pointer. > > static struct futex_hash_bucket *futex_queues[1]; > > I think it is worth to redesign your patch so that this extra-indirection is > needed only for NUMA machines. Yes. Will do in the next iteration. Thanks, Kiran ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 7:07 [RFC] NUMA futex hashing Ravikiran G Thirumalai 2006-08-08 9:14 ` Eric Dumazet @ 2006-08-08 9:37 ` Jes Sorensen 2006-08-08 9:58 ` Andi Kleen 2006-08-08 9:57 ` Andi Kleen 2 siblings, 1 reply; 78+ messages in thread From: Jes Sorensen @ 2006-08-08 9:37 UTC (permalink / raw) To: Ravikiran G Thirumalai Cc: linux-kernel, Shai Fultheim (Shai@scalex86.org), pravin b shelar >>>>> "Ravikiran" == Ravikiran G Thirumalai <kiran@scalex86.org> writes: Ravikiran> Current futex hash scheme is not the best for NUMA. The Ravikiran> futex hash table is an array of struct futex_hash_bucket, Ravikiran> which is just a spinlock and a list_head -- this means Ravikiran> multiple spinlocks on the same cacheline and on NUMA Ravikiran> machines, on the same internode cacheline. If futexes of Ravikiran> two unrelated threads running on two different nodes happen Ravikiran> to hash onto adjacent hash buckets, or buckets on the same Ravikiran> internode cacheline, then we have the internode cacheline Ravikiran> bouncing between nodes. Ravikiran, Using that argument, all you need to do is to add the alignment ____cacheline_aligned_in_smp to the definition of struct futex_hash_bucket and the problem is solved, given that the internode cacheline in a NUMA system is defined to be the same as the SMP cacheline size. Ravikiran> Here is a simple scheme which maintains per-node hash Ravikiran> tables for futexes. Ravikiran> In this scheme, a private futex is assigned to the node id Ravikiran> of the futex's KVA. The reasoning is, the futex KVA is Ravikiran> allocated from the node as indicated by memory policy set Ravikiran> by the process, and that should be a good 'home node' for Ravikiran> that futex. Of course this helps workloads where all the Ravikiran> threads of a process are bound to the same node, but it Ravikiran> seems reasonable to run all threads of a process on the Ravikiran> same node. You can't make that assumption at all. In many NUMA workloads it is not common to have all threads of a process run on the same node. You often see a case where one thread spawns a number of threads that are then grouped onto the various nodes. If we want to make the futexes really NUMA aware, having them explicitly allocated on a given node would be more useful or alternatively have them allocated on a first touch basis. But to be honest, I doubt it matters too much since the futex cacheline is most likely to end up in cache on the node where it's being used and as long as the other nodes don't try and touch the same futex this becomes a non-issue with the proper alignment. I don't think your patch is harmful, but it looks awfully complex for something that could be solved just as well by a simple alignment statement. Cheers, Jes ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 9:37 ` Jes Sorensen @ 2006-08-08 9:58 ` Andi Kleen 2006-08-08 10:07 ` Jes Sorensen 0 siblings, 1 reply; 78+ messages in thread From: Andi Kleen @ 2006-08-08 9:58 UTC (permalink / raw) To: Jes Sorensen Cc: linux-kernel, Shai Fultheim (Shai@scalex86.org), pravin b shelar Jes Sorensen <jes@sgi.com> writes: > > Using that argument, all you need to do is to add the alignment > ____cacheline_aligned_in_smp to the definition of > struct futex_hash_bucket and the problem is solved, given that the > internode cacheline in a NUMA system is defined to be the same as the > SMP cacheline size. Yes but it would waste quite a lot of memory and cache. Wasted cache = slow. -Andi ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 9:58 ` Andi Kleen @ 2006-08-08 10:07 ` Jes Sorensen 0 siblings, 0 replies; 78+ messages in thread From: Jes Sorensen @ 2006-08-08 10:07 UTC (permalink / raw) To: Andi Kleen Cc: linux-kernel, Shai Fultheim (Shai@scalex86.org), pravin b shelar >>>>> "Andi" == Andi Kleen <ak@suse.de> writes: Andi> Jes Sorensen <jes@sgi.com> writes: >> Using that argument, all you need to do is to add the alignment >> ____cacheline_aligned_in_smp to the definition of struct >> futex_hash_bucket and the problem is solved, given that the >> internode cacheline in a NUMA system is defined to be the same as >> the SMP cacheline size. Andi> Yes but it would waste quite a lot of memory and cache. Wasted Andi> cache = slow. Compared to the extra level of indirection, I doubt it would be measurable. The cache space is barely wasted, we're talking approximately half a cacheline per futex hash bucket in use. Jes ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 7:07 [RFC] NUMA futex hashing Ravikiran G Thirumalai 2006-08-08 9:14 ` Eric Dumazet 2006-08-08 9:37 ` Jes Sorensen @ 2006-08-08 9:57 ` Andi Kleen 2006-08-08 10:10 ` Eric Dumazet 2 siblings, 1 reply; 78+ messages in thread From: Andi Kleen @ 2006-08-08 9:57 UTC (permalink / raw) To: Ravikiran G Thirumalai Cc: Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Ravikiran G Thirumalai <kiran@scalex86.org> writes: > Current futex hash scheme is not the best for NUMA. The futex hash table is > an array of struct futex_hash_bucket, which is just a spinlock and a > list_head -- this means multiple spinlocks on the same cacheline and on NUMA > machines, on the same internode cacheline. If futexes of two unrelated > threads running on two different nodes happen to hash onto adjacent hash > buckets, or buckets on the same internode cacheline, then we have the > internode cacheline bouncing between nodes. When I did some testing with a (arguably far too lock intensive) benchmark on a bigger box I got most bouncing cycles not in the futex locks itself, but in the down_read on the mm semaphore. I guess that is not addressed? -Andi ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 9:57 ` Andi Kleen @ 2006-08-08 10:10 ` Eric Dumazet 2006-08-08 10:36 ` Andi Kleen 2006-08-09 0:13 ` [RFC] NUMA futex hashing Ravikiran G Thirumalai 0 siblings, 2 replies; 78+ messages in thread From: Eric Dumazet @ 2006-08-08 10:10 UTC (permalink / raw) To: Andi Kleen Cc: Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Tuesday 08 August 2006 11:57, Andi Kleen wrote: > Ravikiran G Thirumalai <kiran@scalex86.org> writes: > > Current futex hash scheme is not the best for NUMA. The futex hash > > table is an array of struct futex_hash_bucket, which is just a spinlock > > and a list_head -- this means multiple spinlocks on the same cacheline > > and on NUMA machines, on the same internode cacheline. If futexes of two > > unrelated threads running on two different nodes happen to hash onto > > adjacent hash buckets, or buckets on the same internode cacheline, then > > we have the internode cacheline bouncing between nodes. > > When I did some testing with a (arguably far too lock intensive) benchmark > on a bigger box I got most bouncing cycles not in the futex locks itself, > but in the down_read on the mm semaphore. This is true, even with a normal application (not a biased benchmark) and using oprofile. mmap_sem is the killer. We may have special case for PRIVATE futexes (they dont need to be chained in a global table, but a process private table) POSIX thread api already can let the application tell glibc/kernel a mutex/futex ahe a process scope. For this private futexes, I think we would not need to down_read(mmap_sem) at all. (only a/some lock/s protecting the process private table) Eric ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 10:10 ` Eric Dumazet @ 2006-08-08 10:36 ` Andi Kleen 2006-08-08 12:29 ` Eric Dumazet 2006-08-09 0:13 ` [RFC] NUMA futex hashing Ravikiran G Thirumalai 1 sibling, 1 reply; 78+ messages in thread From: Andi Kleen @ 2006-08-08 10:36 UTC (permalink / raw) To: Eric Dumazet Cc: Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel > > We may have special case for PRIVATE futexes (they dont need to be chained in > a global table, but a process private table) What do you mean with PRIVATE futex? Even if the futex mapping is only visible by a single MM mmap_sem is still needed to protect against other threads doing mmap. -Andi ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 10:36 ` Andi Kleen @ 2006-08-08 12:29 ` Eric Dumazet 2006-08-08 12:47 ` Andi Kleen 0 siblings, 1 reply; 78+ messages in thread From: Eric Dumazet @ 2006-08-08 12:29 UTC (permalink / raw) To: Andi Kleen Cc: Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Tuesday 08 August 2006 12:36, Andi Kleen wrote: > > We may have special case for PRIVATE futexes (they dont need to be > > chained in a global table, but a process private table) > > What do you mean with PRIVATE futex? > > Even if the futex mapping is only visible by a single MM mmap_sem is still > needed to protect against other threads doing mmap. Hum... I would call that a user error. If a thread is munmap()ing the vma that contains active futexes, result is undefined. Same as today I think (a thread blocked in a FUTEX_WAIT should stay blocked) The point is that private futexes could be managed using virtual addresses, and no call to find_extend_vma(), hence no mmap_sem contention. There could be problem if the same futex (32 bits integer) could be mapped at different virtual addresses in the same process. Eric ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 12:29 ` Eric Dumazet @ 2006-08-08 12:47 ` Andi Kleen 2006-08-08 12:57 ` Eric Dumazet 0 siblings, 1 reply; 78+ messages in thread From: Andi Kleen @ 2006-08-08 12:47 UTC (permalink / raw) To: Eric Dumazet Cc: Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Tuesday 08 August 2006 14:29, Eric Dumazet wrote: > On Tuesday 08 August 2006 12:36, Andi Kleen wrote: > > > We may have special case for PRIVATE futexes (they dont need to be > > > chained in a global table, but a process private table) > > > > What do you mean with PRIVATE futex? > > > > Even if the futex mapping is only visible by a single MM mmap_sem is still > > needed to protect against other threads doing mmap. > > Hum... I would call that a user error. > > If a thread is munmap()ing the vma that contains active futexes, result is > undefined. We can't allow anything that could crash the kernel, corrupt a kernel, data structure, allow writing to freed memory etc. No matter how defined it is or not. Working with a vma that doesn't have an existence guarantee would be just that. -Andi ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 12:47 ` Andi Kleen @ 2006-08-08 12:57 ` Eric Dumazet 2006-08-08 14:39 ` Ulrich Drepper 0 siblings, 1 reply; 78+ messages in thread From: Eric Dumazet @ 2006-08-08 12:57 UTC (permalink / raw) To: Andi Kleen Cc: Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Tuesday 08 August 2006 14:47, Andi Kleen wrote: > On Tuesday 08 August 2006 14:29, Eric Dumazet wrote: > > On Tuesday 08 August 2006 12:36, Andi Kleen wrote: > > > > We may have special case for PRIVATE futexes (they dont need to be > > > > chained in a global table, but a process private table) > > > > > > What do you mean with PRIVATE futex? > > > > > > Even if the futex mapping is only visible by a single MM mmap_sem is > > > still needed to protect against other threads doing mmap. > > > > Hum... I would call that a user error. > > > > If a thread is munmap()ing the vma that contains active futexes, result > > is undefined. > > We can't allow anything that could crash the kernel, corrupt a kernel, > data structure, allow writing to freed memory etc. No matter how > defined it is or not. Working with a vma that doesn't have > an existence guarantee would be just that. As I said, we do not walk the vmas anymore, no crashes are ever possible. Just keep a process private list of 'private futexes' , indexed by their virtual address. This list can be of course stored in a efficient data structure, an AVL or RB tree, or hash table. The validity of the virtual address is still tested by normal get_user() call.. If the memory was freed by a thread, then a normal EFAULT error will be reported... eventually. Eric ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 12:57 ` Eric Dumazet @ 2006-08-08 14:39 ` Ulrich Drepper 2006-08-08 15:11 ` Nick Piggin 0 siblings, 1 reply; 78+ messages in thread From: Ulrich Drepper @ 2006-08-08 14:39 UTC (permalink / raw) To: Eric Dumazet Cc: Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On 8/8/06, Eric Dumazet <dada1@cosmosbay.com> wrote: > The validity of the virtual address is still tested by normal get_user() > call.. If the memory was freed by a thread, then a normal EFAULT error will > be reported... eventually. This is indeed what should be done. Private futexes are the by far more frequent case and I bet you'd see improvements when avoiding the mm mutex even for normal machines since futexes really are everywhere. For shared mutexes you end up doing two lookups and that's fine IMO as long as the first lookup is fast. As for the NUMA case, I would oppose any change which has the slightest impact on non-NUMA machines. It cannot be allowed that the majority of systems is slowed down significantly just because of NUMA. Especially since the effects of NUMA beside cache line transfer penalties IMO probably are neglect able. The in-kernel futex representation only exists when there are waiters and so the memory needed is only allocated when we a waiting. In this case it just be easy enough to use local memory. But this unlikely will help much since the waker thread is ideally not on the same processor, maybe not even on the same node. So there will be cacheline transfers in most cases and everything possible improvement will be minimal and maybe even not generally measurable. If you want to do anything in this area, first remove the global mutex. Then really measure with real world application. And I don't mean specially designed HPC apps which assign threads/processes to processors or nodes. Those are special cases of a special case. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 14:39 ` Ulrich Drepper @ 2006-08-08 15:11 ` Nick Piggin 2006-08-08 15:36 ` Ulrich Drepper 2006-08-08 16:08 ` Eric Dumazet 0 siblings, 2 replies; 78+ messages in thread From: Nick Piggin @ 2006-08-08 15:11 UTC (permalink / raw) To: Ulrich Drepper Cc: Eric Dumazet, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Ulrich Drepper wrote: > On 8/8/06, Eric Dumazet <dada1@cosmosbay.com> wrote: > >> The validity of the virtual address is still tested by normal get_user() >> call.. If the memory was freed by a thread, then a normal EFAULT error >> will >> be reported... eventually. > > > This is indeed what should be done. Private futexes are the by far > more frequent case and I bet you'd see improvements when avoiding the > mm mutex even for normal machines since futexes really are everywhere. > For shared mutexes you end up doing two lookups and that's fine IMO > as long as the first lookup is fast. The private futex's namespace is its virtual address, so I don't see how you can decouple that from the management of virtual addresses. Let me get this straight: to insert a contended futex into your rbtree, you need to hold the mmap sem to ensure that address remains valid, then you need to take a lock which protects your rbtree. Then to wake up a process and remove the futex, you need to take the rbtree lock. Or to unmap any memory you also need to take the rbtree lock and ensure there are no futexes there. So you just add another lock for no reason, or have I got a few screws loose myself? I don't see how you can significantly reduce lock cacheline bouncing in a futex heavy workload if you're just going to add another shared data structure. But if you can, sweet ;) -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 15:11 ` Nick Piggin @ 2006-08-08 15:36 ` Ulrich Drepper 2006-08-08 16:22 ` Nick Piggin 2006-08-08 16:08 ` Eric Dumazet 1 sibling, 1 reply; 78+ messages in thread From: Ulrich Drepper @ 2006-08-08 15:36 UTC (permalink / raw) To: Nick Piggin Cc: Eric Dumazet, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On 8/8/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Let me get this straight: to insert a contended futex into your rbtree, > you need to hold the mmap sem to ensure that address remains valid, > then you need to take a lock which protects your rbtree. Why does it have to remain valid? As long as the kernel doesn't crash on any of the operations associated with the futex syscalls let the address space region explode, implode, whatever. It's a bug in the program if the address region is changed while a futex is placed there. If the futex syscall hangs forever or returns with a bogus state (error or even success) this is perfectly acceptable. We shouldn't slow down correct uses just to make it possible for broken programs to receive a more detailed error description. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 15:36 ` Ulrich Drepper @ 2006-08-08 16:22 ` Nick Piggin 2006-08-08 16:26 ` Nick Piggin 2006-08-08 16:49 ` Ulrich Drepper 0 siblings, 2 replies; 78+ messages in thread From: Nick Piggin @ 2006-08-08 16:22 UTC (permalink / raw) To: Ulrich Drepper Cc: Eric Dumazet, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Ulrich Drepper wrote: > On 8/8/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> Let me get this straight: to insert a contended futex into your rbtree, >> you need to hold the mmap sem to ensure that address remains valid, >> then you need to take a lock which protects your rbtree. > > > Why does it have to remain valid? As long as the kernel doesn't crash > on any of the operations associated with the futex syscalls let the > address space region explode, implode, whatever. It's a bug in the > program if the address region is changed while a futex is placed > there. If the futex syscall hangs forever or returns with a bogus > state (error or even success) this is perfectly acceptable. We I thought mremap (no, that's already kind of messed up); or even just getting consistency in failures (eg. so you don't have the situation that a futex op can succeed on a previously unmapped region). If you're not worried about the latter, then it might work... I didn't initially click that the private futex API operates purely on tokens rather than virtual memory... comments in futex.c talk about futexes being hashed to a particular physical page (which is the case for shared). That's whacked. So actually you would change semantics in some weird corner cases, like mremaping a shared futex over a private futex's Arguably that's broken, though ;) > shouldn't slow down correct uses just to make it possible for broken > programs to receive a more detailed error description. > No we shouldn't slow them down. I'd be interested to see whether locking is significantly sped up with this new data structure, though. You might also slow down due to the fact that you'd have to do the locking and unconditionally traverse the private futexes even for shared futexes. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 16:22 ` Nick Piggin @ 2006-08-08 16:26 ` Nick Piggin 2006-08-08 16:49 ` Ulrich Drepper 1 sibling, 0 replies; 78+ messages in thread From: Nick Piggin @ 2006-08-08 16:26 UTC (permalink / raw) To: Ulrich Drepper Cc: Eric Dumazet, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Nick Piggin wrote: > No we shouldn't slow them down. I'd be interested to see whether > locking is significantly sped up with this new data structure, > though. OTOH, maybe you don't need a new data structure. Maybe you could use the hash and check that for a match on a private futex before trying to find a possible shared futex. Locking I guess becomes no more of a problem than now, and in some cases maybe much less. So OK, I stand corrected. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 16:22 ` Nick Piggin 2006-08-08 16:26 ` Nick Piggin @ 2006-08-08 16:49 ` Ulrich Drepper 1 sibling, 0 replies; 78+ messages in thread From: Ulrich Drepper @ 2006-08-08 16:49 UTC (permalink / raw) To: Nick Piggin Cc: Eric Dumazet, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On 8/8/06, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > I thought mremap (no, that's already kind of messed up); or > even just getting consistency in failures (eg. so you don't have > the situation that a futex op can succeed on a previously > unmapped region). > > If you're not worried about the latter, then it might work... I'm not the least bit worried about this. It's 100% an application's fault. You cannot touch an address space if it's used, e.g., for mutexes. > I didn't initially click that the private futex API operates > purely on tokens rather than virtual memory... I haven't looked at the code in some time but I thought this got clarified in the comments. For waiting on private mutexes we need nothing but the address value itself. There is the FUTEX_WAKE_OP operation which will also write to memory but this is only the waker side and if the memory mapping is gone, just flag an error. It's another program error which shouldn't in any way slow down normal operations. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 15:11 ` Nick Piggin 2006-08-08 15:36 ` Ulrich Drepper @ 2006-08-08 16:08 ` Eric Dumazet 2006-08-08 16:34 ` Nick Piggin 2006-08-08 16:58 ` Ulrich Drepper 1 sibling, 2 replies; 78+ messages in thread From: Eric Dumazet @ 2006-08-08 16:08 UTC (permalink / raw) To: Nick Piggin Cc: Ulrich Drepper, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Tuesday 08 August 2006 17:11, Nick Piggin wrote: > Ulrich Drepper wrote: > > On 8/8/06, Eric Dumazet <dada1@cosmosbay.com> wrote: > >> The validity of the virtual address is still tested by normal get_user() > >> call.. If the memory was freed by a thread, then a normal EFAULT error > >> will > >> be reported... eventually. > > > > This is indeed what should be done. Private futexes are the by far > > more frequent case and I bet you'd see improvements when avoiding the > > mm mutex even for normal machines since futexes really are everywhere. > > For shared mutexes you end up doing two lookups and that's fine IMO > > as long as the first lookup is fast. > > The private futex's namespace is its virtual address, so I don't see > how you can decouple that from the management of virtual addresses. > > Let me get this straight: to insert a contended futex into your rbtree, > you need to hold the mmap sem to ensure that address remains valid, > then you need to take a lock which protects your rbtree. Then to wake > up a process and remove the futex, you need to take the rbtree lock. Or > to unmap any memory you also need to take the rbtree lock and ensure > there are no futexes there. > > So you just add another lock for no reason, or have I got a few screws > loose myself? I don't see how you can significantly reduce lock > cacheline bouncing in a futex heavy workload if you're just going to > add another shared data structure. But if you can, sweet ;) We certainly can. But if you insist of using mmap sem at all, then we have a problem. rbtree would not reduce cacheline bouncing, so : We could use a hashtable (allocated on demand) of size N, N depending on NR_CPUS for example. each chain protected by a private spinlock. If N is well chosen, we might reduce lock cacheline bouncing. (different threads fighting on different private futexes would have a good chance to get different cachelines in this hashtable) As soon a process enters 'private futex' code, the futex code allocates this hashtable if the process has a NULL hash table (set to NULL at exec() time, or maybe re-allocated because we want to be sure futex syscall always suceed (no ENOMEM)) So we really can... but for 'private futexes' which are the vast majority of futexes needed by typical program (using POSIX pshared thread mutex attribute PTHREAD_PROCESS_PRIVATE, currently not used by NPTL glibc) Of course we would need a new syscall, and to change glibc to be able to actually use this new private_futex syscall. Probably a lot of work, still, but could help heavy threaded programs not touching mmap_sem. We might have a refcounting problem on this 'hashtable' since several threads share this structure, but only at thread creation/destruction, not in futex call (ie no cacheline bouncing on the refcount) Eric ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 16:08 ` Eric Dumazet @ 2006-08-08 16:34 ` Nick Piggin 2006-08-08 16:49 ` Eric Dumazet 2006-08-08 16:58 ` Ulrich Drepper 1 sibling, 1 reply; 78+ messages in thread From: Nick Piggin @ 2006-08-08 16:34 UTC (permalink / raw) To: Eric Dumazet Cc: Ulrich Drepper, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Eric Dumazet wrote: > We certainly can. But if you insist of using mmap sem at all, then we have a > problem. > > rbtree would not reduce cacheline bouncing, so : > > We could use a hashtable (allocated on demand) of size N, N depending on > NR_CPUS for example. each chain protected by a private spinlock. If N is well > chosen, we might reduce lock cacheline bouncing. (different threads fighting > on different private futexes would have a good chance to get different > cachelines in this hashtable) See other mail. We already have a hash table ;) -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 16:34 ` Nick Piggin @ 2006-08-08 16:49 ` Eric Dumazet 2006-08-08 16:59 ` Eric Dumazet 2006-08-09 1:56 ` Nick Piggin 0 siblings, 2 replies; 78+ messages in thread From: Eric Dumazet @ 2006-08-08 16:49 UTC (permalink / raw) To: Nick Piggin Cc: Ulrich Drepper, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Tuesday 08 August 2006 18:34, Nick Piggin wrote: > Eric Dumazet wrote: > > We certainly can. But if you insist of using mmap sem at all, then we > > have a problem. > > > > rbtree would not reduce cacheline bouncing, so : > > > > We could use a hashtable (allocated on demand) of size N, N depending on > > NR_CPUS for example. each chain protected by a private spinlock. If N is > > well chosen, we might reduce lock cacheline bouncing. (different threads > > fighting on different private futexes would have a good chance to get > > different cachelines in this hashtable) > > See other mail. We already have a hash table ;) Yes but still you want at FUTEX_WAIT time to tell the kernel the futex is private to this process. Giving the same info at FUTEX_WAKE time could avoid the kernel to make the second pass (using only a private futex lookup), avoiding again the mmap_sem touch in case no threads are waiting anymore on this futex. Eric ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 16:49 ` Eric Dumazet @ 2006-08-08 16:59 ` Eric Dumazet 2006-08-09 1:56 ` Nick Piggin 1 sibling, 0 replies; 78+ messages in thread From: Eric Dumazet @ 2006-08-08 16:59 UTC (permalink / raw) To: Nick Piggin Cc: Ulrich Drepper, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Tuesday 08 August 2006 18:49, Eric Dumazet wrote: > On Tuesday 08 August 2006 18:34, Nick Piggin wrote: > > Eric Dumazet wrote: > > > We certainly can. But if you insist of using mmap sem at all, then we > > > have a problem. > > > > > > rbtree would not reduce cacheline bouncing, so : > > > > > > We could use a hashtable (allocated on demand) of size N, N depending > > > on NR_CPUS for example. each chain protected by a private spinlock. If > > > N is well chosen, we might reduce lock cacheline bouncing. (different > > > threads fighting on different private futexes would have a good chance > > > to get different cachelines in this hashtable) > > > > See other mail. We already have a hash table ;) > > Yes but still you want at FUTEX_WAIT time to tell the kernel the futex is > private to this process. > > Giving the same info at FUTEX_WAKE time could avoid the kernel to make the > second pass (using only a private futex lookup), avoiding again the > mmap_sem touch in case no threads are waiting anymore on this futex. After looking at kernel/futex.c, I realize we also can avoid the atomic ops (and another cacheline bouncing) done in get_key_refs()/drop_key_refs(), touching the inode i_count or mm_count refcounter) Eric ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 16:49 ` Eric Dumazet 2006-08-08 16:59 ` Eric Dumazet @ 2006-08-09 1:56 ` Nick Piggin 1 sibling, 0 replies; 78+ messages in thread From: Nick Piggin @ 2006-08-09 1:56 UTC (permalink / raw) To: Eric Dumazet Cc: Ulrich Drepper, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Eric Dumazet wrote: >On Tuesday 08 August 2006 18:34, Nick Piggin wrote: > >>Eric Dumazet wrote: >> >>>We certainly can. But if you insist of using mmap sem at all, then we >>>have a problem. >>> >>>rbtree would not reduce cacheline bouncing, so : >>> >>>We could use a hashtable (allocated on demand) of size N, N depending on >>>NR_CPUS for example. each chain protected by a private spinlock. If N is >>>well chosen, we might reduce lock cacheline bouncing. (different threads >>>fighting on different private futexes would have a good chance to get >>>different cachelines in this hashtable) >>> >>See other mail. We already have a hash table ;) >> > >Yes but still you want at FUTEX_WAIT time to tell the kernel the futex is >private to this process. > Yes, but I'm saying we already have a hash table. The hash table. I'm *not* saying you *don't* also want a private directive from userspace. -- Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 16:08 ` Eric Dumazet 2006-08-08 16:34 ` Nick Piggin @ 2006-08-08 16:58 ` Ulrich Drepper 2006-08-08 17:08 ` Eric Dumazet 2006-08-09 1:58 ` Nick Piggin 1 sibling, 2 replies; 78+ messages in thread From: Ulrich Drepper @ 2006-08-08 16:58 UTC (permalink / raw) To: Eric Dumazet Cc: Nick Piggin, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On 8/8/06, Eric Dumazet <dada1@cosmosbay.com> wrote: > So we really can... but for 'private futexes' which are the vast majority of > futexes needed by typical program (using POSIX pshared thread mutex attribute > PTHREAD_PROCESS_PRIVATE, currently not used by NPTL glibc) Nonsense. Mutexes are by default always private. They explicitly have to be marked as sharable. This happens using the pthread_mutexattr_setpshared function which takes PTHREAD_PROCESS_PRIVATE or PTHREAD_PROCESS_SHARED in the second parameter. So the former _is_ clearly used. > Of course we would need a new syscall, and to change glibc to be able to > actually use this new private_futex syscall. No, why? The kernel already does recognize private mutexes. It just checks whether the pages used to store it are private or mapped. This requires some interaction with the memory subsystem but as long as no crashes happen the data can change underneath. It's the program's fault if it does. On the waker side you would search the local futex hash table/tree first and if this doesn't yield a match, search the global table. Wakeup calls without any waiters are usually rare. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 16:58 ` Ulrich Drepper @ 2006-08-08 17:08 ` Eric Dumazet 2006-08-09 1:58 ` Nick Piggin 1 sibling, 0 replies; 78+ messages in thread From: Eric Dumazet @ 2006-08-08 17:08 UTC (permalink / raw) To: Ulrich Drepper Cc: Nick Piggin, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Tuesday 08 August 2006 18:58, Ulrich Drepper wrote: > On 8/8/06, Eric Dumazet <dada1@cosmosbay.com> wrote: > > So we really can... but for 'private futexes' which are the vast majority > > of futexes needed by typical program (using POSIX pshared thread mutex > > attribute PTHREAD_PROCESS_PRIVATE, currently not used by NPTL glibc) > > Nonsense. Mutexes are by default always private. They explicitly > have to be marked as sharable. This happens using the > pthread_mutexattr_setpshared function which takes > PTHREAD_PROCESS_PRIVATE or PTHREAD_PROCESS_SHARED in the second > parameter. So the former _is_ clearly used. > I was saying that PTHREAD_PROCESS_PRIVATE or PTHREAD_PROCESS_SHARED info is not provided to the kernel (because futex api/implementation dont need to). It was not an attack on glibc. > > Of course we would need a new syscall, and to change glibc to be able to > > actually use this new private_futex syscall. > > No, why? The kernel already does recognize private mutexes. It just > checks whether the pages used to store it are private or mapped. This > requires some interaction with the memory subsystem but as long as no > crashes happen the data can change underneath. It's the program's > fault if it does. But if you let futex code doing the vma walk to check the private/shared status, you still need the mmap_sem locking. Moreover, a program can mmap() a file (shared in terms of VMA), and continue to use a PTHREAD_PROCESS_PRIVATE mutex lying in this shared zone (Example : shmem or hugetlb mapping, wich API might always give a 'shared' vma) > > On the waker side you would search the local futex hash table/tree > first and if this doesn't yield a match, search the global table. > Wakeup calls without any waiters are usually rare. If the two searches touch two different cache lines in the hash table, we might have a performance regression. Of course we might chose a hash function so that the same slot is accessed. Eric ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 16:58 ` Ulrich Drepper 2006-08-08 17:08 ` Eric Dumazet @ 2006-08-09 1:58 ` Nick Piggin 2006-08-09 6:26 ` Eric Dumazet 1 sibling, 1 reply; 78+ messages in thread From: Nick Piggin @ 2006-08-09 1:58 UTC (permalink / raw) To: Ulrich Drepper Cc: Eric Dumazet, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Ulrich Drepper wrote: >> Of course we would need a new syscall, and to change glibc to be able to >> actually use this new private_futex syscall. > > > No, why? The kernel already does recognize private mutexes. It just > checks whether the pages used to store it are private or mapped. This > requires some interaction with the memory subsystem but as long as no > crashes happen the data can change underneath. It's the program's > fault if it does. Because that requires taking mmap_sem. Avoiding that is the whole purpose, isn't it? -- Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-09 1:58 ` Nick Piggin @ 2006-08-09 6:26 ` Eric Dumazet 2006-08-09 6:43 ` Eric Dumazet 0 siblings, 1 reply; 78+ messages in thread From: Eric Dumazet @ 2006-08-09 6:26 UTC (permalink / raw) To: Nick Piggin, Ulrich Drepper, Andi Kleen, Ravikiran G Thirumalai Cc: Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1347 bytes --] Based on various discussions and feedbacks, I cooked a patch that implements the notion of private futexes (private to a process, in the spirit of POSIX pshared PTHREAD_PROCESS_PRIVATE ) [PATCH] futex : Add new PRIVATE futex primitives for performance improvements When a futex is privately used by a process, we dont really need to lookup the list of vmas of the process in order to discover if the futex is backed by a inode or by the mm struct. We dont really need to keep a refcount on the inode or mm. This patch introduces new futex calls, that could be used by user land (glibc of course) when private futexes are used. Avoiding vmas lookup means avoiding taking the mmap_sem (and forcing cacheline bouncings). Avoiding refcounting on underlying inode or mm struct also avoids cacheline bouncing. Thats two cacheline bounces avoided per FUTEX syscall glibc could use the new futex primitives introduced here (in particular for PTHREAD_PROCESS_PRIVATE semantic), and fallback to old one if running on older kernel. Fallback could set a global variable with the number of syscall so that only one failed syscall is done in the process lifetime. Note : Compatibility should be maintained by this patch, as old applications will use the 'SHARED' functionality, unchanged. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> [-- Attachment #2: futex_priv1.patch --] [-- Type: text/x-diff, Size: 9592 bytes --] --- linux-2.6.18-rc4/include/linux/futex.h 2006-08-08 22:46:13.000000000 +0200 +++ linux-2.6.18-rc4-ed/include/linux/futex.h 2006-08-08 23:23:13.000000000 +0200 @@ -15,6 +15,11 @@ #define FUTEX_LOCK_PI 6 #define FUTEX_UNLOCK_PI 7 #define FUTEX_TRYLOCK_PI 8 +#define FUTEX_WAIT_PRIVATE 9 +#define FUTEX_WAKE_PRIVATE 10 +#define FUTEX_REQUEUE_PRIVATE 11 +#define FUTEX_CMP_REQUEUE_PRIVATE 12 +#define FUTEX_WAKE_OP_PRIVATE 13 /* * Support for robust futexes: the kernel cleans up held futexes at --- linux-2.6.18-rc4/kernel/futex.c 2006-08-08 22:45:46.000000000 +0200 +++ linux-2.6.18-rc4-ed/kernel/futex.c 2006-08-09 07:26:19.000000000 +0200 @@ -60,8 +60,9 @@ * Don't rearrange members without looking at hash_futex(). * * offset is aligned to a multiple of sizeof(u32) (== 4) by definition. - * We set bit 0 to indicate if it's an inode-based key. */ +#define OFF_INODE 1 /* We set bit 0 if it has a reference on inode */ +#define OFF_MMSHARED 2 /* We set bit 1 if it has a reference on mm */ union futex_key { struct { unsigned long pgoff; @@ -79,6 +80,8 @@ int offset; } both; }; +#define FUT_SHARED 1 /* we should walk vmas */ +#define FUT_PRIVATE 0 /* private futex: no need to walk vmas*/ /* * Priority Inheritance state: @@ -140,7 +143,7 @@ static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS]; /* Futex-fs vfsmount entry: */ -static struct vfsmount *futex_mnt; +static struct vfsmount *futex_mnt __read_mostly; /* * We hash on the keys returned from get_futex_key (see below). @@ -175,7 +178,7 @@ * * Should be called with ¤t->mm->mmap_sem but NOT any spinlocks. */ -static int get_futex_key(u32 __user *uaddr, union futex_key *key) +static int get_futex_key(u32 __user *uaddr, union futex_key *key, int shared) { unsigned long address = (unsigned long)uaddr; struct mm_struct *mm = current->mm; @@ -191,6 +194,11 @@ return -EINVAL; address -= key->both.offset; + if (shared == FUT_PRIVATE) { + key->private.mm = mm; + key->private.address = address; + return 0; + } /* * The futex is hashed differently depending on whether * it's in a shared or private mapping. So check vma first. @@ -215,6 +223,7 @@ * mappings of _writable_ handles. */ if (likely(!(vma->vm_flags & VM_MAYSHARE))) { + key->both.offset += OFF_MMSHARED; /* reference taken on mm */ key->private.mm = mm; key->private.address = address; return 0; @@ -224,7 +233,7 @@ * Linear file mappings are also simple. */ key->shared.inode = vma->vm_file->f_dentry->d_inode; - key->both.offset++; /* Bit 0 of offset indicates inode-based key. */ + key->both.offset += OFF_INODE; /* reference taken on inode */ if (likely(!(vma->vm_flags & VM_NONLINEAR))) { key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff); @@ -256,8 +265,8 @@ */ static inline void get_key_refs(union futex_key *key) { - if (key->both.ptr != 0) { - if (key->both.offset & 1) + if (key->both.offset & (OFF_INODE|OFF_MMSHARED)) { + if (key->both.offset & OFF_INODE) atomic_inc(&key->shared.inode->i_count); else atomic_inc(&key->private.mm->mm_count); @@ -270,8 +279,8 @@ */ static void drop_key_refs(union futex_key *key) { - if (key->both.ptr != 0) { - if (key->both.offset & 1) + if (key->both.offset & (OFF_INODE|OFF_MMSHARED)) { + if (key->both.offset & OFF_INODE) iput(key->shared.inode); else mmdrop(key->private.mm); @@ -650,7 +659,7 @@ * Wake up all waiters hashed on the physical page that is mapped * to this virtual address: */ -static int futex_wake(u32 __user *uaddr, int nr_wake) +static int futex_wake(u32 __user *uaddr, int nr_wake, int shared) { struct futex_hash_bucket *hb; struct futex_q *this, *next; @@ -658,9 +667,10 @@ union futex_key key; int ret; - down_read(¤t->mm->mmap_sem); + if (shared) + down_read(¤t->mm->mmap_sem); - ret = get_futex_key(uaddr, &key); + ret = get_futex_key(uaddr, &key, shared); if (unlikely(ret != 0)) goto out; @@ -682,7 +692,8 @@ spin_unlock(&hb->lock); out: - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(¤t->mm->mmap_sem); return ret; } @@ -692,7 +703,7 @@ */ static int futex_wake_op(u32 __user *uaddr1, u32 __user *uaddr2, - int nr_wake, int nr_wake2, int op) + int nr_wake, int nr_wake2, int op, int shared) { union futex_key key1, key2; struct futex_hash_bucket *hb1, *hb2; @@ -703,10 +714,10 @@ retryfull: down_read(¤t->mm->mmap_sem); - ret = get_futex_key(uaddr1, &key1); + ret = get_futex_key(uaddr1, &key1, shared); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, &key2); + ret = get_futex_key(uaddr2, &key2, shared); if (unlikely(ret != 0)) goto out; @@ -802,7 +813,7 @@ * physical page. */ static int futex_requeue(u32 __user *uaddr1, u32 __user *uaddr2, - int nr_wake, int nr_requeue, u32 *cmpval) + int nr_wake, int nr_requeue, u32 *cmpval, int shared) { union futex_key key1, key2; struct futex_hash_bucket *hb1, *hb2; @@ -811,12 +822,13 @@ int ret, drop_count = 0; retry: - down_read(¤t->mm->mmap_sem); + if (shared) + down_read(¤t->mm->mmap_sem); - ret = get_futex_key(uaddr1, &key1); + ret = get_futex_key(uaddr1, &key1, shared); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, &key2); + ret = get_futex_key(uaddr2, &key2, shared); if (unlikely(ret != 0)) goto out; @@ -839,7 +851,8 @@ * If we would have faulted, release mmap_sem, fault * it in and start all over again. */ - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(¤t->mm->mmap_sem); ret = get_user(curval, uaddr1); @@ -888,7 +901,8 @@ drop_key_refs(&key1); out: - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(¤t->mm->mmap_sem); return ret; } @@ -999,7 +1013,7 @@ drop_key_refs(&q->key); } -static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time) +static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time, int shared) { struct task_struct *curr = current; DECLARE_WAITQUEUE(wait, curr); @@ -1010,9 +1024,10 @@ q.pi_state = NULL; retry: - down_read(&curr->mm->mmap_sem); + if (shared) + down_read(&curr->mm->mmap_sem); - ret = get_futex_key(uaddr, &q.key); + ret = get_futex_key(uaddr, &q.key, shared); if (unlikely(ret != 0)) goto out_release_sem; @@ -1047,7 +1062,8 @@ * If we would have faulted, release mmap_sem, fault it in and * start all over again. */ - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(&curr->mm->mmap_sem); ret = get_user(uval, uaddr); @@ -1066,7 +1082,8 @@ * Now the futex is queued and we have checked the data, we * don't want to hold mmap_sem while we sleep. */ - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(&curr->mm->mmap_sem); /* * There might have been scheduling since the queue_me(), as we @@ -1108,7 +1125,8 @@ queue_unlock(&q, hb); out_release_sem: - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(&curr->mm->mmap_sem); return ret; } @@ -1134,7 +1152,7 @@ retry: down_read(&curr->mm->mmap_sem); - ret = get_futex_key(uaddr, &q.key); + ret = get_futex_key(uaddr, &q.key, FUT_SHARED); if (unlikely(ret != 0)) goto out_release_sem; @@ -1435,7 +1453,7 @@ */ down_read(¤t->mm->mmap_sem); - ret = get_futex_key(uaddr, &key); + ret = get_futex_key(uaddr, &key, FUT_SHARED); if (unlikely(ret != 0)) goto out; @@ -1551,7 +1569,7 @@ return ret; } -static struct file_operations futex_fops = { +static const struct file_operations futex_fops = { .release = futex_close, .poll = futex_poll, }; @@ -1600,7 +1618,7 @@ q->pi_state = NULL; down_read(¤t->mm->mmap_sem); - err = get_futex_key(uaddr, &q->key); + err = get_futex_key(uaddr, &q->key, FUT_SHARED); if (unlikely(err != 0)) { up_read(¤t->mm->mmap_sem); @@ -1742,7 +1760,7 @@ */ if (!pi) { if (uval & FUTEX_WAITERS) - futex_wake(uaddr, 1); + futex_wake(uaddr, 1, FUT_SHARED); } } return 0; @@ -1830,23 +1848,38 @@ switch (op) { case FUTEX_WAIT: - ret = futex_wait(uaddr, val, timeout); + ret = futex_wait(uaddr, val, timeout, FUT_SHARED); + break; + case FUTEX_WAIT_PRIVATE: + ret = futex_wait(uaddr, val, timeout, FUT_PRIVATE); break; case FUTEX_WAKE: - ret = futex_wake(uaddr, val); + ret = futex_wake(uaddr, val, FUT_SHARED); + break; + case FUTEX_WAKE_PRIVATE: + ret = futex_wake(uaddr, val, FUT_PRIVATE); break; case FUTEX_FD: /* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */ ret = futex_fd(uaddr, val); break; case FUTEX_REQUEUE: - ret = futex_requeue(uaddr, uaddr2, val, val2, NULL); + ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, FUT_SHARED); + break; + case FUTEX_REQUEUE_PRIVATE: + ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, FUT_PRIVATE); break; case FUTEX_CMP_REQUEUE: - ret = futex_requeue(uaddr, uaddr2, val, val2, &val3); + ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, FUT_SHARED); + break; + case FUTEX_CMP_REQUEUE_PRIVATE: + ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, FUT_PRIVATE); break; case FUTEX_WAKE_OP: - ret = futex_wake_op(uaddr, uaddr2, val, val2, val3); + ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, FUT_SHARED); + break; + case FUTEX_WAKE_OP_PRIVATE: + ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, FUT_PRIVATE); break; case FUTEX_LOCK_PI: ret = futex_lock_pi(uaddr, val, timeout, val2, 0); ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-09 6:26 ` Eric Dumazet @ 2006-08-09 6:43 ` Eric Dumazet 2007-03-15 19:10 ` [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements Eric Dumazet ` (3 more replies) 0 siblings, 4 replies; 78+ messages in thread From: Eric Dumazet @ 2006-08-09 6:43 UTC (permalink / raw) To: Nick Piggin Cc: Ulrich Drepper, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel (patch inlined this time) Based on various discussions and feedbacks, I cooked a patch that implements the notion of private futexes (private to a process, in the spirit of POSIX pshared PTHREAD_PROCESS_PRIVATE ) [PATCH] futex : Add new PRIVATE futex primitives for performance improvements When a futex is privately used by a process, we dont really need to lookup the list of vmas of the process in order to discover if the futex is backed by a inode or by the mm struct. We dont really need to keep a refcount on the inode or mm. This patch introduces new futex calls, that could be used by user land (glibc of course) when private futexes are used. Avoiding vmas lookup means avoiding taking the mmap_sem (and forcing cacheline bouncings). Avoiding refcounting on underlying inode or mm struct also avoids cacheline bouncing. Thats two cacheline bounces avoided per FUTEX syscall glibc could use the new futex primitives introduced here (in particular for PTHREAD_PROCESS_PRIVATE semantic), and fallback to old one if running on older kernel. Fallback could set a global variable with the number of syscall so that only one failed syscall is done in the process lifetime. Note : Compatibility should be maintained by this patch, as old applications will use the 'SHARED' functionality, unchanged. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- linux-2.6.18-rc4/include/linux/futex.h 2006-08-08 22:46:13.000000000 +0200 +++ linux-2.6.18-rc4-ed/include/linux/futex.h 2006-08-08 23:23:13.000000000 +0200 @@ -15,6 +15,11 @@ #define FUTEX_LOCK_PI 6 #define FUTEX_UNLOCK_PI 7 #define FUTEX_TRYLOCK_PI 8 +#define FUTEX_WAIT_PRIVATE 9 +#define FUTEX_WAKE_PRIVATE 10 +#define FUTEX_REQUEUE_PRIVATE 11 +#define FUTEX_CMP_REQUEUE_PRIVATE 12 +#define FUTEX_WAKE_OP_PRIVATE 13 /* * Support for robust futexes: the kernel cleans up held futexes at --- linux-2.6.18-rc4/kernel/futex.c 2006-08-08 22:45:46.000000000 +0200 +++ linux-2.6.18-rc4-ed/kernel/futex.c 2006-08-09 07:26:19.000000000 +0200 @@ -60,8 +60,9 @@ * Don't rearrange members without looking at hash_futex(). * * offset is aligned to a multiple of sizeof(u32) (== 4) by definition. - * We set bit 0 to indicate if it's an inode-based key. */ +#define OFF_INODE 1 /* We set bit 0 if it has a reference on inode */ +#define OFF_MMSHARED 2 /* We set bit 1 if it has a reference on mm */ union futex_key { struct { unsigned long pgoff; @@ -79,6 +80,8 @@ int offset; } both; }; +#define FUT_SHARED 1 /* we should walk vmas */ +#define FUT_PRIVATE 0 /* private futex: no need to walk vmas*/ /* * Priority Inheritance state: @@ -140,7 +143,7 @@ static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS]; /* Futex-fs vfsmount entry: */ -static struct vfsmount *futex_mnt; +static struct vfsmount *futex_mnt __read_mostly; /* * We hash on the keys returned from get_futex_key (see below). @@ -175,7 +178,7 @@ * * Should be called with ¤t->mm->mmap_sem but NOT any spinlocks. */ -static int get_futex_key(u32 __user *uaddr, union futex_key *key) +static int get_futex_key(u32 __user *uaddr, union futex_key *key, int shared) { unsigned long address = (unsigned long)uaddr; struct mm_struct *mm = current->mm; @@ -191,6 +194,11 @@ return -EINVAL; address -= key->both.offset; + if (shared == FUT_PRIVATE) { + key->private.mm = mm; + key->private.address = address; + return 0; + } /* * The futex is hashed differently depending on whether * it's in a shared or private mapping. So check vma first. @@ -215,6 +223,7 @@ * mappings of _writable_ handles. */ if (likely(!(vma->vm_flags & VM_MAYSHARE))) { + key->both.offset += OFF_MMSHARED; /* reference taken on mm */ key->private.mm = mm; key->private.address = address; return 0; @@ -224,7 +233,7 @@ * Linear file mappings are also simple. */ key->shared.inode = vma->vm_file->f_dentry->d_inode; - key->both.offset++; /* Bit 0 of offset indicates inode-based key. */ + key->both.offset += OFF_INODE; /* reference taken on inode */ if (likely(!(vma->vm_flags & VM_NONLINEAR))) { key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff); @@ -256,8 +265,8 @@ */ static inline void get_key_refs(union futex_key *key) { - if (key->both.ptr != 0) { - if (key->both.offset & 1) + if (key->both.offset & (OFF_INODE|OFF_MMSHARED)) { + if (key->both.offset & OFF_INODE) atomic_inc(&key->shared.inode->i_count); else atomic_inc(&key->private.mm->mm_count); @@ -270,8 +279,8 @@ */ static void drop_key_refs(union futex_key *key) { - if (key->both.ptr != 0) { - if (key->both.offset & 1) + if (key->both.offset & (OFF_INODE|OFF_MMSHARED)) { + if (key->both.offset & OFF_INODE) iput(key->shared.inode); else mmdrop(key->private.mm); @@ -650,7 +659,7 @@ * Wake up all waiters hashed on the physical page that is mapped * to this virtual address: */ -static int futex_wake(u32 __user *uaddr, int nr_wake) +static int futex_wake(u32 __user *uaddr, int nr_wake, int shared) { struct futex_hash_bucket *hb; struct futex_q *this, *next; @@ -658,9 +667,10 @@ union futex_key key; int ret; - down_read(¤t->mm->mmap_sem); + if (shared) + down_read(¤t->mm->mmap_sem); - ret = get_futex_key(uaddr, &key); + ret = get_futex_key(uaddr, &key, shared); if (unlikely(ret != 0)) goto out; @@ -682,7 +692,8 @@ spin_unlock(&hb->lock); out: - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(¤t->mm->mmap_sem); return ret; } @@ -692,7 +703,7 @@ */ static int futex_wake_op(u32 __user *uaddr1, u32 __user *uaddr2, - int nr_wake, int nr_wake2, int op) + int nr_wake, int nr_wake2, int op, int shared) { union futex_key key1, key2; struct futex_hash_bucket *hb1, *hb2; @@ -703,10 +714,10 @@ retryfull: down_read(¤t->mm->mmap_sem); - ret = get_futex_key(uaddr1, &key1); + ret = get_futex_key(uaddr1, &key1, shared); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, &key2); + ret = get_futex_key(uaddr2, &key2, shared); if (unlikely(ret != 0)) goto out; @@ -802,7 +813,7 @@ * physical page. */ static int futex_requeue(u32 __user *uaddr1, u32 __user *uaddr2, - int nr_wake, int nr_requeue, u32 *cmpval) + int nr_wake, int nr_requeue, u32 *cmpval, int shared) { union futex_key key1, key2; struct futex_hash_bucket *hb1, *hb2; @@ -811,12 +822,13 @@ int ret, drop_count = 0; retry: - down_read(¤t->mm->mmap_sem); + if (shared) + down_read(¤t->mm->mmap_sem); - ret = get_futex_key(uaddr1, &key1); + ret = get_futex_key(uaddr1, &key1, shared); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, &key2); + ret = get_futex_key(uaddr2, &key2, shared); if (unlikely(ret != 0)) goto out; @@ -839,7 +851,8 @@ * If we would have faulted, release mmap_sem, fault * it in and start all over again. */ - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(¤t->mm->mmap_sem); ret = get_user(curval, uaddr1); @@ -888,7 +901,8 @@ drop_key_refs(&key1); out: - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(¤t->mm->mmap_sem); return ret; } @@ -999,7 +1013,7 @@ drop_key_refs(&q->key); } -static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time) +static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time, int shared) { struct task_struct *curr = current; DECLARE_WAITQUEUE(wait, curr); @@ -1010,9 +1024,10 @@ q.pi_state = NULL; retry: - down_read(&curr->mm->mmap_sem); + if (shared) + down_read(&curr->mm->mmap_sem); - ret = get_futex_key(uaddr, &q.key); + ret = get_futex_key(uaddr, &q.key, shared); if (unlikely(ret != 0)) goto out_release_sem; @@ -1047,7 +1062,8 @@ * If we would have faulted, release mmap_sem, fault it in and * start all over again. */ - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(&curr->mm->mmap_sem); ret = get_user(uval, uaddr); @@ -1066,7 +1082,8 @@ * Now the futex is queued and we have checked the data, we * don't want to hold mmap_sem while we sleep. */ - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(&curr->mm->mmap_sem); /* * There might have been scheduling since the queue_me(), as we @@ -1108,7 +1125,8 @@ queue_unlock(&q, hb); out_release_sem: - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(&curr->mm->mmap_sem); return ret; } @@ -1134,7 +1152,7 @@ retry: down_read(&curr->mm->mmap_sem); - ret = get_futex_key(uaddr, &q.key); + ret = get_futex_key(uaddr, &q.key, FUT_SHARED); if (unlikely(ret != 0)) goto out_release_sem; @@ -1435,7 +1453,7 @@ */ down_read(¤t->mm->mmap_sem); - ret = get_futex_key(uaddr, &key); + ret = get_futex_key(uaddr, &key, FUT_SHARED); if (unlikely(ret != 0)) goto out; @@ -1551,7 +1569,7 @@ return ret; } -static struct file_operations futex_fops = { +static const struct file_operations futex_fops = { .release = futex_close, .poll = futex_poll, }; @@ -1600,7 +1618,7 @@ q->pi_state = NULL; down_read(¤t->mm->mmap_sem); - err = get_futex_key(uaddr, &q->key); + err = get_futex_key(uaddr, &q->key, FUT_SHARED); if (unlikely(err != 0)) { up_read(¤t->mm->mmap_sem); @@ -1742,7 +1760,7 @@ */ if (!pi) { if (uval & FUTEX_WAITERS) - futex_wake(uaddr, 1); + futex_wake(uaddr, 1, FUT_SHARED); } } return 0; @@ -1830,23 +1848,38 @@ switch (op) { case FUTEX_WAIT: - ret = futex_wait(uaddr, val, timeout); + ret = futex_wait(uaddr, val, timeout, FUT_SHARED); + break; + case FUTEX_WAIT_PRIVATE: + ret = futex_wait(uaddr, val, timeout, FUT_PRIVATE); break; case FUTEX_WAKE: - ret = futex_wake(uaddr, val); + ret = futex_wake(uaddr, val, FUT_SHARED); + break; + case FUTEX_WAKE_PRIVATE: + ret = futex_wake(uaddr, val, FUT_PRIVATE); break; case FUTEX_FD: /* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */ ret = futex_fd(uaddr, val); break; case FUTEX_REQUEUE: - ret = futex_requeue(uaddr, uaddr2, val, val2, NULL); + ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, FUT_SHARED); + break; + case FUTEX_REQUEUE_PRIVATE: + ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, FUT_PRIVATE); break; case FUTEX_CMP_REQUEUE: - ret = futex_requeue(uaddr, uaddr2, val, val2, &val3); + ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, FUT_SHARED); + break; + case FUTEX_CMP_REQUEUE_PRIVATE: + ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, FUT_PRIVATE); break; case FUTEX_WAKE_OP: - ret = futex_wake_op(uaddr, uaddr2, val, val2, val3); + ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, FUT_SHARED); + break; + case FUTEX_WAKE_OP_PRIVATE: + ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, FUT_PRIVATE); break; case FUTEX_LOCK_PI: ret = futex_lock_pi(uaddr, val, timeout, val2, 0); ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements 2006-08-09 6:43 ` Eric Dumazet @ 2007-03-15 19:10 ` Eric Dumazet 2007-03-15 20:15 ` Nick Piggin ` (2 more replies) 2007-03-15 19:13 ` [PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic Eric Dumazet ` (2 subsequent siblings) 3 siblings, 3 replies; 78+ messages in thread From: Eric Dumazet @ 2007-03-15 19:10 UTC (permalink / raw) To: Nick Piggin, Ulrich Drepper, Andrew Morton, Ingo Molnar Cc: Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Hi I'm pleased to present these patches which improve linux futex performance and scalability, on both UP, SMP and NUMA configs. I had this idea last year but I was not understood, probably because I gave not enough explanations. Sorry if this mail is really long... Analysis of current linux futex code : -------------------------------------- A central hash table futex_queues[] holds all contexts (futex_q) of waiting threads. Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to perform lookups or insert/deletion of a futex_q. When a futex_wait() is done, thread has to : 1) - Obtain a read lock on mmap_sem to be able to validate the user pointer (calling find_vma()). This validation tells us if the futex use an inode based store (mapped file), or mm based store (anonymous mem) 2) - compute a hash key 3) - Atomic Increment of reference counter on an inode or a mm 4) - lock part of futex_queues[] hash table 5) - perform the test on value of futex. (rollback is value != expected_value, returns EWOULDBLOCK) (various loops if test triggers mm faults) 6) queue the context into hash table, release the lock got in 4) 7) - release the read_lock on mmap_sem <block> 8) Eventually unqueue the context (but rarely, as this part may be done by the futex_wake()) Futexes were designed to improve scalability but current implementation has various problems : - Central hashtable : This means scalability problems if many processes/threads want to use futexes at the same time. This means NUMA unbalance because this hashtable is located on one node. - Using mmap_sem on every futex() syscall : Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic ops on mmap_sem, dirtying cache line : - lot of cache line ping pongs on SMP configurations. mmap_sem is also extensively used by mm code (page faults, mmap()/munmap()) Highly threaded processes might suffer from mmap_sem contention. mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded programs because of contention on the mmap_sem cache line. - Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter: It's also a cache line ping pong on SMP. It also increases mmap_sem hold time because of cache misses. Most of these scalability problems come from the fact that futexes are in one global namespace. As we use a central hash table, we must make sure they are all using the same reference (given by the mm subsystem). We chose to force all futexes be 'shared'. This has a cost. But fact is POSIX defined PRIVATE and SHARED, allowing clear separation, and optimal performance if carefuly implemented. Time has come for linux to have better threading performance. PTHREAD_PROCESS_PRIVATE semantic allows implementation to use separate repositories : - One 'global' namespace for all PROCESS_SHARED futexes. - One "per process private repository" for PROCESS_PRIVATE futexes. This repository is NUMA aware, it is allocated the first time a process issues a futex(XXXX_PRIVATE) call. If allocation is not possible because of memory shortage, we just fallback using the central repository. The goal is to permit new futex commands to avoid : - Using the central hash table (still used by PTHREAD_PROCESS_SHARED futexes) - Taking the mmap_sem semaphore, conflicting with other subsystems. - Modifying a ref_count on mm or an inode, still conflicting with mm or fs. This is possible because, for one process using PTHREAD_PROCESS_PRIVATE futexes, we only need to distinguish futexes by their virtual address, no matter the underlying mm storage is. This is why this patches : 1) Define new futex subcommands (basically adding a _PRIVATE flag) Avoids using mmap_sem, and ref counter on inode or mm. 2) Allows each process to have a private repository (a small hash table) where its PROCESS_PRIVATE active futexes are stored, instead of the global repository. if CONFIG_BASE_SMALL, we still use the global repository 3) NUMA optimization : we allocate the global hash table with vmalloc() If glibc wants to exploit this new infrastructure, it should use new _PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And be prepared to fallback on old subcommands for old kernels. Using one global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK. Only PTHREAD_PROCESS_SHARED futexes should use the old subcommands. Compatibility with old applications is preserved, they still hit the scalability problems, but new applications can fly :) Note : the same SHARED futex (mapped on a file) can be used by old binaries *and* new binaries, because both binaries will use the old subcommands. Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic, as this is the default semantic. Almost all applications should benefit of this changes (new kernel and updated libc) Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine) /* calling futex_wait(addr, value) with value != *addr */ 450 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes) 427 cycles per futex(FUTEX_WAIT) call (using one futex) 337 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes) 332 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex) For reference : 214 cycles per futex(1000) call (returns ENOSYS) 186 cycles per getppid() call 187 cycles per umask() call 182 cycles per ni_syscall() call Thank you for reading this mail [PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic [PATCH 2/3] FUTEX : introduce private hashtables [PATCH 3/3] FUTEX : NUMA friendly global hashtable Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements 2007-03-15 19:10 ` [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements Eric Dumazet @ 2007-03-15 20:15 ` Nick Piggin 2007-03-16 8:05 ` Peter Zijlstra 2007-04-04 7:16 ` Ulrich Drepper 2 siblings, 0 replies; 78+ messages in thread From: Nick Piggin @ 2007-03-15 20:15 UTC (permalink / raw) To: Eric Dumazet Cc: Ulrich Drepper, Andrew Morton, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Eric Dumazet wrote: > Hi > > I'm pleased to present these patches which improve linux futex performance and > scalability, on both UP, SMP and NUMA configs. > > I had this idea last year but I was not understood, probably because I gave > not enough explanations. Sorry if this mail is really long... Yes please, this is really nice. The mmap_sem is already overworked just doing real mm stuff, let alone making it the central point of our "scalable" thread synchronisation syscall. To summarise: not only will this make futexes scale much better, but it will also take some load off mmap_sem. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements 2007-03-15 19:10 ` [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements Eric Dumazet 2007-03-15 20:15 ` Nick Piggin @ 2007-03-16 8:05 ` Peter Zijlstra 2007-03-16 9:30 ` Eric Dumazet 2007-04-04 7:16 ` Ulrich Drepper 2 siblings, 1 reply; 78+ messages in thread From: Peter Zijlstra @ 2007-03-16 8:05 UTC (permalink / raw) To: Eric Dumazet Cc: Nick Piggin, Ulrich Drepper, Andrew Morton, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Thu, 2007-03-15 at 20:10 +0100, Eric Dumazet wrote: > Hi > > I'm pleased to present these patches which improve linux futex performance and > scalability, on both UP, SMP and NUMA configs. > > I had this idea last year but I was not understood, probably because I gave > not enough explanations. Sorry if this mail is really long... I started playing with it after your last reference to it, I have some code here (against -rt): http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/ Which I will post once I have the found what keeps pthread_join() from completing :-( It basically adds a per task vma lookup cache which can also activate the private logic without explicit use of the new interface. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements 2007-03-16 8:05 ` Peter Zijlstra @ 2007-03-16 9:30 ` Eric Dumazet 2007-03-16 10:10 ` Peter Zijlstra 0 siblings, 1 reply; 78+ messages in thread From: Eric Dumazet @ 2007-03-16 9:30 UTC (permalink / raw) To: Peter Zijlstra Cc: Nick Piggin, Ulrich Drepper, Andrew Morton, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Friday 16 March 2007 09:05, Peter Zijlstra wrote: > On Thu, 2007-03-15 at 20:10 +0100, Eric Dumazet wrote: > > Hi > > > > I'm pleased to present these patches which improve linux futex > > performance and scalability, on both UP, SMP and NUMA configs. > > > > I had this idea last year but I was not understood, probably because I > > gave not enough explanations. Sorry if this mail is really long... > > I started playing with it after your last reference to it, I have some > code here (against -rt): > http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/ > > Which I will post once I have the found what keeps pthread_join() from > completing :-( > > It basically adds a per task vma lookup cache which can also activate > the private logic without explicit use of the new interface. Hi Peter I dont think yet another cache will help in the general case. A typical program uses many vmas at once... glibc has internal futexes, on a different vma than futexes declared in your program. Each shared library is going to have its own vma for its data (and futexes) (244 vmas on one kmail program for example) About your guess_futex_shared() thing, I miss the vma_anon() definition. But if it has to walk the vmas (and take mmap_sem), you already loose the PRIVATE benefit. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements 2007-03-16 9:30 ` Eric Dumazet @ 2007-03-16 10:10 ` Peter Zijlstra 2007-03-16 10:30 ` Eric Dumazet 0 siblings, 1 reply; 78+ messages in thread From: Peter Zijlstra @ 2007-03-16 10:10 UTC (permalink / raw) To: Eric Dumazet Cc: Nick Piggin, Ulrich Drepper, Andrew Morton, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Fri, 2007-03-16 at 10:30 +0100, Eric Dumazet wrote: > On Friday 16 March 2007 09:05, Peter Zijlstra wrote: > > On Thu, 2007-03-15 at 20:10 +0100, Eric Dumazet wrote: > > > Hi > > > > > > I'm pleased to present these patches which improve linux futex > > > performance and scalability, on both UP, SMP and NUMA configs. > > > > > > I had this idea last year but I was not understood, probably because I > > > gave not enough explanations. Sorry if this mail is really long... > > > > I started playing with it after your last reference to it, I have some > > code here (against -rt): > > http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/ > > > > Which I will post once I have the found what keeps pthread_join() from > > completing :-( > > > > It basically adds a per task vma lookup cache which can also activate > > the private logic without explicit use of the new interface. > > Hi Peter > > I dont think yet another cache will help in the general case. > A typical program uses many vmas at once... > > glibc has internal futexes, on a different vma than futexes declared in your > program. Each shared library is going to have its own vma for its data (and > futexes) > > (244 vmas on one kmail program for example) Yeah, I was just hoping a few cache entries would be enough to get the worst of them. A benchmark will have to tell I guess. > About your guess_futex_shared() thing, I miss the vma_anon() definition. http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/vma_cache.patch > But if it has to walk the vmas (and take mmap_sem), you already loose the > PRIVATE benefit. It doesn't take mmap_sem, I am aware of the problems. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements 2007-03-16 10:10 ` Peter Zijlstra @ 2007-03-16 10:30 ` Eric Dumazet 2007-03-16 10:36 ` Peter Zijlstra 0 siblings, 1 reply; 78+ messages in thread From: Eric Dumazet @ 2007-03-16 10:30 UTC (permalink / raw) To: Peter Zijlstra Cc: Nick Piggin, Ulrich Drepper, Andrew Morton, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Friday 16 March 2007 11:10, Peter Zijlstra wrote: > http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/vma_cache.p >atch Oh thanks > > > But if it has to walk the vmas (and take mmap_sem), you already loose the > > PRIVATE benefit. > > It doesn't take mmap_sem, I am aware of the problems. Yes but the vma_anon() -> vma_cache_find() needs to read 3 cache lines on x86_64 (sizeof(struct vma_cache) = 136) and dirty one bit, so it might be more expensive than the mmap_sem ... ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements 2007-03-16 10:30 ` Eric Dumazet @ 2007-03-16 10:36 ` Peter Zijlstra 0 siblings, 0 replies; 78+ messages in thread From: Peter Zijlstra @ 2007-03-16 10:36 UTC (permalink / raw) To: Eric Dumazet Cc: Nick Piggin, Ulrich Drepper, Andrew Morton, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Fri, 2007-03-16 at 11:30 +0100, Eric Dumazet wrote: > On Friday 16 March 2007 11:10, Peter Zijlstra wrote: > > > http://programming.kicks-ass.net/kernel-patches/futex-vma-cache/vma_cache.p > >atch > > Oh thanks > > > > > > But if it has to walk the vmas (and take mmap_sem), you already loose the > > > PRIVATE benefit. > > > > It doesn't take mmap_sem, I am aware of the problems. > > Yes but the vma_anon() -> vma_cache_find() needs to read 3 cache lines on > x86_64 (sizeof(struct vma_cache) = 136) > and dirty one bit, so it might be more expensive than the mmap_sem ... I though the cacheline was 128 bytes, but point taken. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements 2007-03-15 19:10 ` [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements Eric Dumazet 2007-03-15 20:15 ` Nick Piggin 2007-03-16 8:05 ` Peter Zijlstra @ 2007-04-04 7:16 ` Ulrich Drepper 2007-04-05 17:49 ` [PATCH] FUTEX : new PRIVATE futexes Eric Dumazet 2 siblings, 1 reply; 78+ messages in thread From: Ulrich Drepper @ 2007-04-04 7:16 UTC (permalink / raw) To: Eric Dumazet Cc: Nick Piggin, Andrew Morton, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On 3/15/07, Eric Dumazet <dada1@cosmosbay.com> wrote: > But fact is POSIX defined PRIVATE and SHARED, allowing clear separation, and > optimal performance if carefuly implemented. Time has come for linux to have > better threading performance. Now that I've been pointed to the thread I can comment on it. Yes, this approach makes a lot of sense. Programs which shared syn objects without declaring them correctly are broken and deserve to fail. It would be quite a lot of change in libpthread but it's manageable. I haven't tested the code nor will I likely do until I get a Fedora kernel with it. So, convince DaveJ to take it for testing. ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH] FUTEX : new PRIVATE futexes 2007-04-04 7:16 ` Ulrich Drepper @ 2007-04-05 17:49 ` Eric Dumazet 2007-04-05 20:43 ` Ulrich Drepper ` (4 more replies) 0 siblings, 5 replies; 78+ messages in thread From: Eric Dumazet @ 2007-04-05 17:49 UTC (permalink / raw) To: Ulrich Drepper, Andrew Morton, Dave Jones Cc: Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Hi all I'm pleased to present this patch which improves linux futexes performance and scalability, merely avoiding taking mmap_sem rwlock. Ulrich agreed with the API and said glibc work could start as soon as he gets a Fedora kernel with it :) Andrew, could we get this in mm as well ? This version is against 2.6.21-rc5-mm4 (so supports 64bit futexes) In this third version I dropped the NUMA optims and process private hash table, to let new API come in and be tested. Thank you [PATCH] FUTEX : new PRIVATE futexes Analysis of current linux futex code : -------------------------------------- A central hash table futex_queues[] holds all contexts (futex_q) of waiting threads. Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to perform lookups or insert/deletion of a futex_q. When a futex_wait() is done, calling thread has to : 1) - Obtain a read lock on mmap_sem to be able to validate the user pointer (calling find_vma()). This validation tells us if the futex uses an inode based store (mapped file), or mm based store (anonymous mem) 2) - compute a hash key 3) - Atomic increment of reference counter on an inode or a mm_struct 4) - lock part of futex_queues[] hash table 5) - perform the test on value of futex. (rollback is value != expected_value, returns EWOULDBLOCK) (various loops if test triggers mm faults) 6) queue the context into hash table, release the lock got in 4) 7) - release the read_lock on mmap_sem <block> 8) Eventually unqueue the context (but rarely, as this part may be done by the futex_wake()) Futexes were designed to improve scalability but current implementation has various problems : - Central hashtable : This means scalability problems if many processes/threads want to use futexes at the same time. This means NUMA unbalance because this hashtable is located on one node. - Using mmap_sem on every futex() syscall : Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic ops on mmap_sem, dirtying cache line : - lot of cache line ping pongs on SMP configurations. mmap_sem is also extensively used by mm code (page faults, mmap()/munmap()) Highly threaded processes might suffer from mmap_sem contention. mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded programs because of contention on the mmap_sem cache line. - Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter: It's also a cache line ping pong on SMP. It also increases mmap_sem hold time because of cache misses. Most of these scalability problems come from the fact that futexes are in one global namespace. As we use a central hash table, we must make sure they are all using the same reference (given by the mm subsystem). We chose to force all futexes be 'shared'. This has a cost. But fact is POSIX defined PRIVATE and SHARED, allowing clear separation, and optimal performance if carefuly implemented. Time has come for linux to have better threading performance. The goal is to permit new futex commands to avoid : - Taking the mmap_sem semaphore, conflicting with other subsystems. - Modifying a ref_count on mm or an inode, still conflicting with mm or fs. This is possible because, for one process using PTHREAD_PROCESS_PRIVATE futexes, we only need to distinguish futexes by their virtual address, no matter the underlying mm storage is. If glibc wants to exploit this new infrastructure, it should use new _PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And be prepared to fallback on old subcommands for old kernels. Using one global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK. PTHREAD_PROCESS_SHARED futexes should still use the old subcommands. Compatibility with old applications is preserved, they still hit the scalability problems, but new applications can fly :) Note : the same SHARED futex (mapped on a file) can be used by old binaries *and* new binaries, because both binaries will use the old subcommands. Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic, as this is the default semantic. Almost all applications should benefit of this changes (new kernel and updated libc) Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine) /* calling futex_wait(addr, value) with value != *addr */ 434 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes) 427 cycles per futex(FUTEX_WAIT) call (using one futex) 345 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes) 345 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex) For reference : 187 cycles per getppid() call 188 cycles per umask() call 183 cycles per ni_syscall() call Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/linux/futex.h | 45 +++++- kernel/futex.c | 294 +++++++++++++++++++++++++--------------- 2 files changed, 230 insertions(+), 109 deletions(-) --- linux-2.6.21-rc5-mm4/include/linux/futex.h +++ linux-2.6.21-rc5-mm4-ed/include/linux/futex.h @@ -19,6 +19,18 @@ union ktime; #define FUTEX_TRYLOCK_PI 8 #define FUTEX_CMP_REQUEUE_PI 9 +#define FUTEX_PRIVATE_FLAG 128 +#define FUTEX_CMD_MASK ~FUTEX_PRIVATE_FLAG + +#define FUTEX_WAIT_PRIVATE (FUTEX_WAIT | FUTEX_PRIVATE_FLAG) +#define FUTEX_WAKE_PRIVATE (FUTEX_WAKE | FUTEX_PRIVATE_FLAG) +#define FUTEX_REQUEUE_PRIVATE (FUTEX_REQUEUE | FUTEX_PRIVATE_FLAG) +#define FUTEX_CMP_REQUEUE_PRIVATE (FUTEX_CMP_REQUEUE | FUTEX_PRIVATE_FLAG) +#define FUTEX_WAKE_OP_PRIVATE (FUTEX_WAKE_OP | FUTEX_PRIVATE_FLAG) +#define FUTEX_LOCK_PI_PRIVATE (FUTEX_LOCK_PI | FUTEX_PRIVATE_FLAG) +#define FUTEX_UNLOCK_PI_PRIVATE (FUTEX_UNLOCK_PI | FUTEX_PRIVATE_FLAG) +#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG) + /* * Support for robust futexes: the kernel cleans up held futexes at * thread exit time. @@ -115,8 +127,18 @@ handle_futex_death(u32 __user *uaddr, st * Don't rearrange members without looking at hash_futex(). * * offset is aligned to a multiple of sizeof(u32) (== 4) by definition. - * We set bit 0 to indicate if it's an inode-based key. + * We use the two low order bits of offset to tell what is the kind of key : + * 00 : Private process futex (PTHREAD_PROCESS_PRIVATE) + * (no reference on an inode or mm) + * 01 : Shared futex (PTHREAD_PROCESS_SHARED) + * mapped on a file (reference on the underlying inode) + * 10 : Shared futex (PTHREAD_PROCESS_SHARED) + * (but private mapping on an mm, and reference taken on it) */ + +#define FUT_OFF_INODE 1 /* We set bit 0 if key has a reference on inode */ +#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */ + union futex_key { unsigned long __user *uaddr; struct { @@ -135,8 +157,27 @@ union futex_key { int offset; } both; }; -int get_futex_key(void __user *uaddr, union futex_key *key); + +/** + * get_futex_key - Get parameters which are the keys for a futex. + * @uaddr: virtual address of the futex + * @shared: NULL for a PROCESS_PRIVATE futex, + * ¤t->mm->mmap_sem for a PROCESS_SHARED futex + * @key: address where result is stored. + * + * Returns an error code or 0 + */ +int get_futex_key(void __user *uaddr, union futex_key *key, + struct rw_semaphore *shared); + +/** + * get_futex_key_refs - Take a reference to the resource addressed by a key + */ void get_futex_key_refs(union futex_key *key); + +/** + * drop_futex_key_refs - Drop a reference to the resource addressed by a key. + */ void drop_futex_key_refs(union futex_key *key); #ifdef CONFIG_FUTEX --- linux-2.6.21-rc5-mm4/kernel/futex.c +++ linux-2.6.21-rc5-mm4-ed/kernel/futex.c @@ -16,6 +16,9 @@ * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com> * Copyright (C) 2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com> * + * PRIVATE futexes by Eric Dumazet + * Copyright (C) 2007 Eric Dumazet <dada1@cosmosbay.com> + * * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly * enough at me, Linus for the original (flawed) idea, Matthew * Kirkwood for proof-of-concept implementation. @@ -199,9 +202,12 @@ static inline int match_futex(union fute * Returns: 0, or negative error code. * The key words are stored in *key on success. * - * Should be called with ¤t->mm->mmap_sem but NOT any spinlocks. + * shared is NULL for PROCESS_PRIVATE futexes + * For other futexes, it points to ¤t->mm->mmap_sem and + * caller must have taken the reader lock. but NOT any spinlocks. */ -int get_futex_key(void __user *uaddr, union futex_key *key) +int get_futex_key(void __user *uaddr, union futex_key *key, + struct rw_semaphore *shared) { unsigned long address = (unsigned long)uaddr; struct mm_struct *mm = current->mm; @@ -218,6 +224,22 @@ int get_futex_key(void __user *uaddr, un address -= key->both.offset; /* + * PROCESS_PRIVATE futexes are fast. + * As the mm cannot disappear under us and the 'key' only needs + * virtual address, we dont even have to find the underlying vma. + * Note : We do have to check 'address' is a valid user address, + * but access_ok() should be faster than find_vma() + * Note : At this point, address points to the start of page, + * not the real futex address, this is ok. + */ + if (!shared) { + if (!access_ok(VERIFY_WRITE, address, sizeof(int))) + return -EFAULT; + key->private.mm = mm; + key->private.address = address; + return 0; + } + /* * The futex is hashed differently depending on whether * it's in a shared or private mapping. So check vma first. */ @@ -244,6 +266,7 @@ int get_futex_key(void __user *uaddr, un * mappings of _writable_ handles. */ if (likely(!(vma->vm_flags & VM_MAYSHARE))) { + key->both.offset += FUT_OFF_MMSHARED; /* reference taken on mm */ key->private.mm = mm; key->private.address = address; return 0; @@ -253,7 +276,7 @@ int get_futex_key(void __user *uaddr, un * Linear file mappings are also simple. */ key->shared.inode = vma->vm_file->f_path.dentry->d_inode; - key->both.offset++; /* Bit 0 of offset indicates inode-based key. */ + key->both.offset += FUT_OFF_INODE; /* inode-based key. */ if (likely(!(vma->vm_flags & VM_NONLINEAR))) { key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff); @@ -281,17 +304,19 @@ EXPORT_SYMBOL_GPL(get_futex_key); * Take a reference to the resource addressed by a key. * Can be called while holding spinlocks. * - * NOTE: mmap_sem MUST be held between get_futex_key() and calling this - * function, if it is called at all. mmap_sem keeps key->shared.inode valid. */ inline void get_futex_key_refs(union futex_key *key) { - if (key->both.ptr != 0) { - if (key->both.offset & 1) + if (key->both.ptr == 0) + return; + switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { + case FUT_OFF_INODE: atomic_inc(&key->shared.inode->i_count); - else + break; + case FUT_OFF_MMSHARED: atomic_inc(&key->private.mm->mm_count); - } + break; + } } EXPORT_SYMBOL_GPL(get_futex_key_refs); @@ -301,11 +326,15 @@ EXPORT_SYMBOL_GPL(get_futex_key_refs); */ void drop_futex_key_refs(union futex_key *key) { - if (key->both.ptr != 0) { - if (key->both.offset & 1) + if (key->both.ptr == 0) + return; + switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { + case FUT_OFF_INODE: iput(key->shared.inode); - else + break; + case FUT_OFF_MMSHARED: mmdrop(key->private.mm); + break; } } EXPORT_SYMBOL_GPL(drop_futex_key_refs); @@ -339,28 +368,40 @@ get_futex_value_locked(unsigned long *de } /* - * Fault handling. Called with current->mm->mmap_sem held. + * Fault handling. + * if shared is non NULL, current->mm->mmap_sem is already held */ -static int futex_handle_fault(unsigned long address, int attempt) +static int futex_handle_fault(unsigned long address, int attempt, + struct rw_semaphore *shared) { struct vm_area_struct * vma; struct mm_struct *mm = current->mm; + int ret = 0; - if (attempt > 2 || !(vma = find_vma(mm, address)) || - vma->vm_start > address || !(vma->vm_flags & VM_WRITE)) + if (attempt > 2) return -EFAULT; - switch (handle_mm_fault(mm, vma, address, 1)) { - case VM_FAULT_MINOR: - current->min_flt++; - break; - case VM_FAULT_MAJOR: - current->maj_flt++; - break; - default: - return -EFAULT; - } - return 0; + if (!shared) + down_read(&mm->mmap_sem); + + if (!(vma = find_vma(mm, address)) || + vma->vm_start > address || !(vma->vm_flags & VM_WRITE)) + ret = -EFAULT; + + else + switch (handle_mm_fault(mm, vma, address, 1)) { + case VM_FAULT_MINOR: + current->min_flt++; + break; + case VM_FAULT_MAJOR: + current->maj_flt++; + break; + default: + ret = -EFAULT; + } + if (!shared) + up_read(&mm->mmap_sem); + return ret; } /* @@ -705,7 +746,8 @@ double_lock_hb(struct futex_hash_bucket * Wake up all waiters hashed on the physical page that is mapped * to this virtual address: */ -static int futex_wake(unsigned long __user *uaddr, int nr_wake) +static int futex_wake(unsigned long __user *uaddr, int nr_wake, + struct rw_semaphore *shared) { struct futex_hash_bucket *hb; struct futex_q *this, *next; @@ -713,9 +755,10 @@ static int futex_wake(unsigned long __us union futex_key key; int ret; - down_read(¤t->mm->mmap_sem); + if (shared) + down_read(shared); - ret = get_futex_key(uaddr, &key); + ret = get_futex_key(uaddr, &key, shared); if (unlikely(ret != 0)) goto out; @@ -737,7 +780,8 @@ static int futex_wake(unsigned long __us spin_unlock(&hb->lock); out: - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(shared); return ret; } @@ -807,7 +851,8 @@ retry: */ static int futex_requeue_pi(unsigned long __user *uaddr1, unsigned long __user *uaddr2, - int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64) + int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64, + struct rw_semaphore *shared) { union futex_key key1, key2; struct futex_hash_bucket *hb1, *hb2; @@ -825,12 +870,13 @@ retry: /* * First take all the futex related locks: */ - down_read(¤t->mm->mmap_sem); + if (shared) + down_read(shared); - ret = get_futex_key(uaddr1, &key1); + ret = get_futex_key(uaddr1, &key1, shared); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, &key2); + ret = get_futex_key(uaddr2, &key2, shared); if (unlikely(ret != 0)) goto out; @@ -998,7 +1044,8 @@ out: */ static int futex_wake_op(unsigned long __user *uaddr1, unsigned long __user *uaddr2, - int nr_wake, int nr_wake2, int op, int futex64) + int nr_wake, int nr_wake2, int op, int futex64, + struct rw_semaphore *shared) { union futex_key key1, key2; struct futex_hash_bucket *hb1, *hb2; @@ -1007,12 +1054,13 @@ futex_wake_op(unsigned long __user *uadd int ret, op_ret, attempt = 0; retryfull: - down_read(¤t->mm->mmap_sem); + if (shared) + down_read(shared); - ret = get_futex_key(uaddr1, &key1); + ret = get_futex_key(uaddr1, &key1, shared); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, &key2); + ret = get_futex_key(uaddr2, &key2, shared); if (unlikely(ret != 0)) goto out; @@ -1055,15 +1103,14 @@ retry: * futex_atomic_op_inuser needs to both read and write * *(int __user *)uaddr2, but we can't modify it * non-atomically. Therefore, if get_user below is not - * enough, we need to handle the fault ourselves, while - * still holding the mmap_sem. + * enough, we need to handle the fault ourselves. Make + * sure we hold mmap_sem. */ if (attempt++) { - if (futex_handle_fault((unsigned long)uaddr2, - attempt)) { - ret = -EFAULT; + ret = futex_handle_fault((unsigned long)uaddr2, + attempt, shared); + if (ret) goto out; - } goto retry; } @@ -1071,7 +1118,8 @@ retry: * If we would have faulted, release mmap_sem, * fault it in and start all over again. */ - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(shared); ret = futex_get_user(&dummy, uaddr2, futex64); if (ret) @@ -1108,7 +1156,8 @@ retry: if (hb1 != hb2) spin_unlock(&hb2->lock); out: - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(shared); return ret; } @@ -1118,7 +1167,8 @@ out: */ static int futex_requeue(unsigned long __user *uaddr1, unsigned long __user *uaddr2, - int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64) + int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64, + struct rw_semaphore *shared) { union futex_key key1, key2; struct futex_hash_bucket *hb1, *hb2; @@ -1127,12 +1177,13 @@ futex_requeue(unsigned long __user *uadd int ret, drop_count = 0; retry: - down_read(¤t->mm->mmap_sem); + if (shared) + down_read(shared); - ret = get_futex_key(uaddr1, &key1); + ret = get_futex_key(uaddr1, &key1, shared); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, &key2); + ret = get_futex_key(uaddr2, &key2, shared); if (unlikely(ret != 0)) goto out; @@ -1155,7 +1206,8 @@ futex_requeue(unsigned long __user *uadd * If we would have faulted, release mmap_sem, fault * it in and start all over again. */ - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(shared); ret = futex_get_user(&curval, uaddr1, futex64); @@ -1208,7 +1260,8 @@ out_unlock: drop_futex_key_refs(&key1); out: - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(shared); return ret; } @@ -1392,7 +1445,8 @@ static int fixup_pi_state_owner(unsigned static long futex_wait_restart(struct restart_block *restart); static int futex_wait(unsigned long __user *uaddr, unsigned long val, - ktime_t *abs_time, int futex64) + ktime_t *abs_time, int futex64, + struct rw_semaphore *shared) { struct task_struct *curr = current; DECLARE_WAITQUEUE(wait, curr); @@ -1405,9 +1459,10 @@ static int futex_wait(unsigned long __us q.pi_state = NULL; retry: - down_read(&curr->mm->mmap_sem); + if (shared) + down_read(shared); - ret = get_futex_key(uaddr, &q.key); + ret = get_futex_key(uaddr, &q.key, shared); if (unlikely(ret != 0)) goto out_release_sem; @@ -1430,8 +1485,8 @@ static int futex_wait(unsigned long __us * a wakeup when *uaddr != val on entry to the syscall. This is * rare, but normal. * - * We hold the mmap semaphore, so the mapping cannot have changed - * since we looked it up in get_futex_key. + * for shared futexes, we hold the mmap semaphore, so the mapping + * cannot have changed since we looked it up in get_futex_key. */ ret = get_futex_value_locked(&uval, uaddr, futex64); @@ -1442,7 +1497,8 @@ static int futex_wait(unsigned long __us * If we would have faulted, release mmap_sem, fault it in and * start all over again. */ - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(shared); ret = futex_get_user(&uval, uaddr, futex64); if (!ret) @@ -1468,7 +1524,8 @@ static int futex_wait(unsigned long __us * Now the futex is queued and we have checked the data, we * don't want to hold mmap_sem while we sleep. */ - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(shared); /* * There might have been scheduling since the queue_me(), as we @@ -1568,7 +1625,8 @@ static int futex_wait(unsigned long __us } /* Unqueue and drop the lock */ unqueue_me_pi(&q); - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(shared); } debug_rt_mutex_free_waiter(&q.waiter); @@ -1598,6 +1656,8 @@ static int futex_wait(unsigned long __us restart->arg1 = val; restart->arg2 = (unsigned long)abs_time; restart->arg3 = (unsigned long)futex64; + if (shared) + restart->arg3 |= 2; return -ERESTART_RESTARTBLOCK; } @@ -1605,7 +1665,8 @@ static int futex_wait(unsigned long __us queue_unlock(&q, hb); out_release_sem: - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(shared); return ret; } @@ -1615,10 +1676,11 @@ static long futex_wait_restart(struct re unsigned long __user *uaddr = (unsigned long __user *)restart->arg0; unsigned long val = restart->arg1; ktime_t *abs_time = (ktime_t *)restart->arg2; - int futex64 = (int)restart->arg3; + int futex64 = (int)restart->arg3 & 1 ; + struct rw_semaphore *shared = (restart->arg3 & 2) ? ¤t->mm->mmap_sem : NULL; restart->fn = do_no_restart_syscall; - return (long)futex_wait(uaddr, val, abs_time, futex64); + return (long)futex_wait(uaddr, val, abs_time, futex64, shared); } @@ -1674,7 +1736,7 @@ static void set_pi_futex_owner(struct fu * races the kernel might see a 0 value of the futex too.) */ static int futex_lock_pi(unsigned long __user *uaddr, int detect, ktime_t *time, - int trylock, int futex64) + int trylock, int futex64, struct rw_semaphore *shared) { struct hrtimer_sleeper timeout, *to = NULL; struct task_struct *curr = current; @@ -1695,9 +1757,10 @@ static int futex_lock_pi(unsigned long _ q.pi_state = NULL; retry: - down_read(&curr->mm->mmap_sem); + if (shared) + down_read(shared); - ret = get_futex_key(uaddr, &q.key); + ret = get_futex_key(uaddr, &q.key, shared); if (unlikely(ret != 0)) goto out_release_sem; @@ -1818,7 +1881,8 @@ static int futex_lock_pi(unsigned long _ * Now the futex is queued and we have checked the data, we * don't want to hold mmap_sem while we sleep. */ - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(shared); WARN_ON(!q.pi_state); /* @@ -1832,7 +1896,8 @@ static int futex_lock_pi(unsigned long _ ret = ret ? 0 : -EWOULDBLOCK; } - down_read(&curr->mm->mmap_sem); + if (shared) + down_read(shared); spin_lock(q.lock_ptr); /* @@ -1854,7 +1919,8 @@ static int futex_lock_pi(unsigned long _ } /* Unqueue and drop the lock */ unqueue_me_pi(&q); - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(shared); } if (!detect && ret == -EDEADLK && 0) @@ -1866,7 +1932,8 @@ static int futex_lock_pi(unsigned long _ queue_unlock(&q, hb); out_release_sem: - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(shared); return ret; uaddr_faulted: @@ -1877,15 +1944,15 @@ static int futex_lock_pi(unsigned long _ * still holding the mmap_sem. */ if (attempt++) { - if (futex_handle_fault((unsigned long)uaddr, attempt)) { - ret = -EFAULT; + ret = futex_handle_fault((unsigned long)uaddr, attempt, shared); + if (ret) goto out_unlock_release_sem; - } goto retry_locked; } queue_unlock(&q, hb); - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(shared); ret = futex_get_user(&uval, uaddr, futex64); if (!ret && (uval != -EFAULT)) @@ -1899,7 +1966,8 @@ static int futex_lock_pi(unsigned long _ * This is the in-kernel slowpath: we look up the PI state (if any), * and do the rt-mutex unlock. */ -static int futex_unlock_pi(unsigned long __user *uaddr, int futex64) +static int futex_unlock_pi(unsigned long __user *uaddr, int futex64, + struct rw_semaphore *shared) { struct futex_hash_bucket *hb; struct futex_q *this, *next; @@ -1919,9 +1987,10 @@ retry: /* * First take all the futex related locks: */ - down_read(¤t->mm->mmap_sem); + if (shared) + down_read(shared); - ret = get_futex_key(uaddr, &key); + ret = get_futex_key(uaddr, &key, shared); if (unlikely(ret != 0)) goto out; @@ -1980,7 +2049,8 @@ retry_locked: out_unlock: spin_unlock(&hb->lock); out: - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(shared); return ret; @@ -1992,15 +2062,15 @@ pi_faulted: * still holding the mmap_sem. */ if (attempt++) { - if (futex_handle_fault((unsigned long)uaddr, attempt)) { - ret = -EFAULT; + ret = futex_handle_fault((unsigned long)uaddr, attempt, shared); + if (ret) goto out_unlock; - } goto retry_locked; } spin_unlock(&hb->lock); - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(shared); ret = futex_get_user(&uval, uaddr, futex64); if (!ret && (uval != -EFAULT)) @@ -2052,6 +2122,7 @@ static int futex_fd(u32 __user *uaddr, i struct futex_q *q; struct file *filp; int ret, err; + struct rw_semaphore *shared; static unsigned long printk_interval; if (printk_timed_ratelimit(&printk_interval, 60 * 60 * 1000)) { @@ -2093,11 +2164,12 @@ static int futex_fd(u32 __user *uaddr, i } q->pi_state = NULL; - down_read(¤t->mm->mmap_sem); - err = get_futex_key(uaddr, &q->key); + shared = ¤t->mm->mmap_sem; + down_read(shared); + err = get_futex_key(uaddr, &q->key, shared); if (unlikely(err != 0)) { - up_read(¤t->mm->mmap_sem); + up_read(shared); kfree(q); goto error; } @@ -2109,7 +2181,7 @@ static int futex_fd(u32 __user *uaddr, i filp->private_data = q; queue_me(q, ret, filp); - up_read(¤t->mm->mmap_sem); + up_read(shared); /* Now we map fd to filp, so userspace can access it */ fd_install(ret, filp); @@ -2238,7 +2310,8 @@ retry: */ if (!pi) { if (uval & FUTEX_WAITERS) - futex_wake((unsigned long __user *)uaddr, 1); + futex_wake((unsigned long __user *)uaddr, 1, + &curr->mm->mmap_sem); } } return 0; @@ -2326,13 +2399,18 @@ long do_futex(unsigned long __user *uadd unsigned long val2, unsigned long val3, int fut64) { int ret; + int opm = op & FUTEX_CMD_MASK; + struct rw_semaphore *shared = NULL; + + if (!(op & FUTEX_PRIVATE_FLAG)) + shared = ¤t->mm->mmap_sem; - switch (op) { + switch (opm) { case FUTEX_WAIT: - ret = futex_wait(uaddr, val, timeout, fut64); + ret = futex_wait(uaddr, val, timeout, fut64, shared); break; case FUTEX_WAKE: - ret = futex_wake(uaddr, val); + ret = futex_wake(uaddr, val, shared); break; case FUTEX_FD: if (fut64) @@ -2342,25 +2420,25 @@ long do_futex(unsigned long __user *uadd ret = futex_fd((u32 __user *)uaddr, val); break; case FUTEX_REQUEUE: - ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, fut64); + ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, fut64, shared); break; case FUTEX_CMP_REQUEUE: - ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, fut64); + ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, fut64, shared); break; case FUTEX_WAKE_OP: - ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, fut64); + ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, fut64, shared); break; case FUTEX_LOCK_PI: - ret = futex_lock_pi(uaddr, val, timeout, 0, fut64); + ret = futex_lock_pi(uaddr, val, timeout, 0, fut64, shared); break; case FUTEX_UNLOCK_PI: - ret = futex_unlock_pi(uaddr, fut64); + ret = futex_unlock_pi(uaddr, fut64, shared); break; case FUTEX_TRYLOCK_PI: - ret = futex_lock_pi(uaddr, 0, timeout, 1, fut64); + ret = futex_lock_pi(uaddr, 0, timeout, 1, fut64, shared); break; case FUTEX_CMP_REQUEUE_PI: - ret = futex_requeue_pi(uaddr, uaddr2, val, val2, &val3, fut64); + ret = futex_requeue_pi(uaddr, uaddr2, val, val2, &val3, fut64, shared); break; default: ret = -ENOSYS; @@ -2377,23 +2455,24 @@ sys_futex64(u64 __user *uaddr, int op, u struct timespec ts; ktime_t t, *tp = NULL; u64 val2 = 0; + int opm = op & FUTEX_CMD_MASK; - if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) { + if (utime && (opm == FUTEX_WAIT || opm == FUTEX_LOCK_PI)) { if (copy_from_user(&ts, utime, sizeof(ts)) != 0) return -EFAULT; if (!timespec_valid(&ts)) return -EINVAL; t = timespec_to_ktime(ts); - if (op == FUTEX_WAIT) + if (opm == FUTEX_WAIT) t = ktime_add(ktime_get(), t); tp = &t; } /* - * requeue parameter in 'utime' if op == FUTEX_REQUEUE. + * requeue parameter in 'utime' if opm == FUTEX_REQUEUE. */ - if (op == FUTEX_REQUEUE || op == FUTEX_CMP_REQUEUE - || op == FUTEX_CMP_REQUEUE_PI) + if (opm == FUTEX_REQUEUE || opm == FUTEX_CMP_REQUEUE + || opm == FUTEX_CMP_REQUEUE_PI) val2 = (unsigned long) utime; return do_futex((unsigned long __user*)uaddr, op, val, tp, @@ -2409,23 +2488,24 @@ asmlinkage long sys_futex(u32 __user *ua struct timespec ts; ktime_t t, *tp = NULL; u32 val2 = 0; + int opm = op & FUTEX_CMD_MASK; - if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) { + if (utime && (opm == FUTEX_WAIT || opm == FUTEX_LOCK_PI)) { if (copy_from_user(&ts, utime, sizeof(ts)) != 0) return -EFAULT; if (!timespec_valid(&ts)) return -EINVAL; t = timespec_to_ktime(ts); - if (op == FUTEX_WAIT) + if (opm == FUTEX_WAIT) t = ktime_add(ktime_get(), t); tp = &t; } /* * requeue parameter in 'utime' if op == FUTEX_REQUEUE. */ - if (op == FUTEX_REQUEUE || op == FUTEX_CMP_REQUEUE - || op == FUTEX_CMP_REQUEUE_PI) + if (opm == FUTEX_REQUEUE || opm == FUTEX_CMP_REQUEUE + || opm == FUTEX_CMP_REQUEUE_PI) val2 = (u32) (unsigned long) utime; return do_futex((unsigned long __user*)uaddr, op, val, tp, ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] FUTEX : new PRIVATE futexes 2007-04-05 17:49 ` [PATCH] FUTEX : new PRIVATE futexes Eric Dumazet @ 2007-04-05 20:43 ` Ulrich Drepper 2007-04-06 1:19 ` Nick Piggin ` (3 subsequent siblings) 4 siblings, 0 replies; 78+ messages in thread From: Ulrich Drepper @ 2007-04-05 20:43 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Dave Jones, Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On 4/5/07, Eric Dumazet <dada1@cosmosbay.com> wrote: >> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- > include/linux/futex.h | 45 +++++- > kernel/futex.c | 294 +++++++++++++++++++++++++--------------- > 2 files changed, 230 insertions(+), 109 deletions(-) I cannot vouch for the whole code but the concept is very sound and definitely along the way the specification allows it: Acked-by: Ulrich Drepper <drepper@redhat.com> ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] FUTEX : new PRIVATE futexes 2007-04-05 17:49 ` [PATCH] FUTEX : new PRIVATE futexes Eric Dumazet 2007-04-05 20:43 ` Ulrich Drepper @ 2007-04-06 1:19 ` Nick Piggin 2007-04-06 5:53 ` Eric Dumazet 2007-04-06 6:05 ` Hugh Dickins 2007-04-06 12:26 ` Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes) Peter Zijlstra ` (2 subsequent siblings) 4 siblings, 2 replies; 78+ messages in thread From: Nick Piggin @ 2007-04-06 1:19 UTC (permalink / raw) To: Eric Dumazet Cc: Ulrich Drepper, Andrew Morton, Dave Jones, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Hi Eric, Thanks for doing this... It's looking good, I just have some minor comments: Eric Dumazet wrote: > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- linux-2.6.21-rc5-mm4/kernel/futex.c > +++ linux-2.6.21-rc5-mm4-ed/kernel/futex.c > @@ -16,6 +16,9 @@ > * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com> > * Copyright (C) 2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com> > * > + * PRIVATE futexes by Eric Dumazet > + * Copyright (C) 2007 Eric Dumazet <dada1@cosmosbay.com> > + * > * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly > * enough at me, Linus for the original (flawed) idea, Matthew > * Kirkwood for proof-of-concept implementation. > @@ -199,9 +202,12 @@ static inline int match_futex(union fute > * Returns: 0, or negative error code. > * The key words are stored in *key on success. > * > - * Should be called with ¤t->mm->mmap_sem but NOT any spinlocks. > + * shared is NULL for PROCESS_PRIVATE futexes > + * For other futexes, it points to ¤t->mm->mmap_sem and > + * caller must have taken the reader lock. but NOT any spinlocks. > */ > -int get_futex_key(void __user *uaddr, union futex_key *key) > +int get_futex_key(void __user *uaddr, union futex_key *key, > + struct rw_semaphore *shared) Can we pass in something other than the rw_semaphore here? Seeing as it only actually gets used as a flag, it might be nicer just to pass a 0 or 1? And all through the call stack... Did the whole thing just turn out neater when you passed the rwsem? We always know to use current->mm->mmap_sem, so it doesn't seem like a boolean flag would hurt? > { > unsigned long address = (unsigned long)uaddr; > struct mm_struct *mm = current->mm; > @@ -218,6 +224,22 @@ int get_futex_key(void __user *uaddr, un > address -= key->both.offset; > > /* > + * PROCESS_PRIVATE futexes are fast. > + * As the mm cannot disappear under us and the 'key' only needs > + * virtual address, we dont even have to find the underlying vma. > + * Note : We do have to check 'address' is a valid user address, > + * but access_ok() should be faster than find_vma() > + * Note : At this point, address points to the start of page, > + * not the real futex address, this is ok. > + */ > + if (!shared) { > + if (!access_ok(VERIFY_WRITE, address, sizeof(int))) > + return -EFAULT; Shouldn't that be sizeof(long) to handle 64 bit futexes? Or strictly, it should depend on the size of the operation. Maybe the access_ok check should go outside get_futex_key? > + key->private.mm = mm; > + key->private.address = address; > + return 0; > + } > + /* > * The futex is hashed differently depending on whether > * it's in a shared or private mapping. So check vma first. > */ > @@ -244,6 +266,7 @@ int get_futex_key(void __user *uaddr, un > * mappings of _writable_ handles. > */ > if (likely(!(vma->vm_flags & VM_MAYSHARE))) { > + key->both.offset += FUT_OFF_MMSHARED; /* reference taken on mm */ > key->private.mm = mm; > key->private.address = address; > return 0; > @@ -253,7 +276,7 @@ int get_futex_key(void __user *uaddr, un > * Linear file mappings are also simple. > */ > key->shared.inode = vma->vm_file->f_path.dentry->d_inode; > - key->both.offset++; /* Bit 0 of offset indicates inode-based key. */ > + key->both.offset += FUT_OFF_INODE; /* inode-based key. */ > if (likely(!(vma->vm_flags & VM_NONLINEAR))) { > key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT) > + vma->vm_pgoff); I like |= for adding flags, it seems less ambiguous. But I guess that's a matter of opinion. Hugh seems to like +=, and I can't argue with him about style issues ;) > @@ -281,17 +304,19 @@ EXPORT_SYMBOL_GPL(get_futex_key); > * Take a reference to the resource addressed by a key. > * Can be called while holding spinlocks. > * > - * NOTE: mmap_sem MUST be held between get_futex_key() and calling this > - * function, if it is called at all. mmap_sem keeps key->shared.inode valid. > */ > inline void get_futex_key_refs(union futex_key *key) > { > - if (key->both.ptr != 0) { > - if (key->both.offset & 1) > + if (key->both.ptr == 0) > + return; > + switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { > + case FUT_OFF_INODE: > atomic_inc(&key->shared.inode->i_count); > - else > + break; > + case FUT_OFF_MMSHARED: > atomic_inc(&key->private.mm->mm_count); > - } > + break; > + } > } > EXPORT_SYMBOL_GPL(get_futex_key_refs); > > @@ -301,11 +326,15 @@ EXPORT_SYMBOL_GPL(get_futex_key_refs); > */ > void drop_futex_key_refs(union futex_key *key) > { > - if (key->both.ptr != 0) { > - if (key->both.offset & 1) > + if (key->both.ptr == 0) > + return; > + switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { > + case FUT_OFF_INODE: > iput(key->shared.inode); > - else > + break; > + case FUT_OFF_MMSHARED: > mmdrop(key->private.mm); > + break; > } > } > EXPORT_SYMBOL_GPL(drop_futex_key_refs); I wonder if it would be worthwhile inlining and likley()ing the private fastpath? Might make it pretty compact... I guess that's something to worry about after glibc gets support. > @@ -339,28 +368,40 @@ get_futex_value_locked(unsigned long *de > } > > /* > - * Fault handling. Called with current->mm->mmap_sem held. > + * Fault handling. > + * if shared is non NULL, current->mm->mmap_sem is already held > */ > -static int futex_handle_fault(unsigned long address, int attempt) > +static int futex_handle_fault(unsigned long address, int attempt, > + struct rw_semaphore *shared) > { > struct vm_area_struct * vma; > struct mm_struct *mm = current->mm; > + int ret = 0; > > - if (attempt > 2 || !(vma = find_vma(mm, address)) || > - vma->vm_start > address || !(vma->vm_flags & VM_WRITE)) > + if (attempt > 2) > return -EFAULT; > > - switch (handle_mm_fault(mm, vma, address, 1)) { > - case VM_FAULT_MINOR: > - current->min_flt++; > - break; > - case VM_FAULT_MAJOR: > - current->maj_flt++; > - break; > - default: > - return -EFAULT; > - } > - return 0; > + if (!shared) > + down_read(&mm->mmap_sem); > + > + if (!(vma = find_vma(mm, address)) || > + vma->vm_start > address || !(vma->vm_flags & VM_WRITE)) > + ret = -EFAULT; > + > + else > + switch (handle_mm_fault(mm, vma, address, 1)) { > + case VM_FAULT_MINOR: > + current->min_flt++; > + break; > + case VM_FAULT_MAJOR: > + current->maj_flt++; > + break; > + default: > + ret = -EFAULT; > + } > + if (!shared) > + up_read(&mm->mmap_sem); > + return ret; > } > > /* You've got an extra space after the if (maybe for clarity?). In this situation I prefer putting braces around both the if and the else, and if you get rid of that blank line, it doesn't cost you anything more ;) > @@ -1598,6 +1656,8 @@ static int futex_wait(unsigned long __us > restart->arg1 = val; > restart->arg2 = (unsigned long)abs_time; > restart->arg3 = (unsigned long)futex64; > + if (shared) > + restart->arg3 |= 2; Could you make this into a proper flags argument and use #define CONSTANTs for it? > @@ -2377,23 +2455,24 @@ sys_futex64(u64 __user *uaddr, int op, u > struct timespec ts; > ktime_t t, *tp = NULL; > u64 val2 = 0; > + int opm = op & FUTEX_CMD_MASK; What's opm stand for? > > - if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) { > + if (utime && (opm == FUTEX_WAIT || opm == FUTEX_LOCK_PI)) { -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] FUTEX : new PRIVATE futexes 2007-04-06 1:19 ` Nick Piggin @ 2007-04-06 5:53 ` Eric Dumazet 2007-04-06 11:50 ` Nick Piggin 2007-04-06 6:05 ` Hugh Dickins 1 sibling, 1 reply; 78+ messages in thread From: Eric Dumazet @ 2007-04-06 5:53 UTC (permalink / raw) To: Nick Piggin Cc: Ulrich Drepper, Andrew Morton, Dave Jones, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Nick Piggin a écrit : > Hi Eric, > > Thanks for doing this... It's looking good, I just have some minor > comments: Hi Nick, thanks for reviewing. > > Eric Dumazet wrote: >> */ >> -int get_futex_key(void __user *uaddr, union futex_key *key) >> +int get_futex_key(void __user *uaddr, union futex_key *key, >> + struct rw_semaphore *shared) > > Can we pass in something other than the rw_semaphore here? Seeing as > it only actually gets used as a flag, it might be nicer just to pass > a 0 or 1? And all through the call stack... > > Did the whole thing just turn out neater when you passed the rwsem? > We always know to use current->mm->mmap_sem, so it doesn't seem like > a boolean flag would hurt? That's a good question current->mm->mmap_sem being calculated once is a win in itself, because current access is not cheap. It also does the memory access to go through part of the chain in advance, before its use. It does a prefetch() equivalent for free : If current->mm is not in CPU cache, CPU wont stall because next instructions dont depend on it. This means less CPU stall in case current->mm is not in CPU cache. Thats difficult to benchmark it, but you can trust me. A flag means : if (flag) up_read(¤t->mm->mmap_sem) This generates quite a bad code. if (ptr) up_read(ptr) generates *much* better code. So this is a cleanup and a runtime optimization. I dit a similar optimization on commit 163da958ba5282cbf85e8b3dc08e4f51f8b01c5e I invite you to check it : http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=163da958ba5282cbf85e8b3dc08e4f51f8b01c5e > >> { >> unsigned long address = (unsigned long)uaddr; >> struct mm_struct *mm = current->mm; >> @@ -218,6 +224,22 @@ int get_futex_key(void __user *uaddr, un >> address -= key->both.offset; >> >> /* >> + * PROCESS_PRIVATE futexes are fast. >> + * As the mm cannot disappear under us and the 'key' only needs >> + * virtual address, we dont even have to find the underlying vma. >> + * Note : We do have to check 'address' is a valid user address, >> + * but access_ok() should be faster than find_vma() >> + * Note : At this point, address points to the start of page, >> + * not the real futex address, this is ok. >> + */ >> + if (!shared) { >> + if (!access_ok(VERIFY_WRITE, address, sizeof(int))) >> + return -EFAULT; > > Shouldn't that be sizeof(long) to handle 64 bit futexes? Or strictly, it > should depend on the size of the operation. Maybe the access_ok check > should go outside get_futex_key? If you check again, you'll see that address points to the start of the PAGE, not the real u32/u64 futex address. This checks the PAGE. We can use char, short, int, long, or char[PAGE_SIZE] as long as we know a futex cannot span two pages. >> */ >> key->shared.inode = vma->vm_file->f_path.dentry->d_inode; >> - key->both.offset++; /* Bit 0 of offset indicates inode-based key. */ >> + key->both.offset += FUT_OFF_INODE; /* inode-based key. */ >> if (likely(!(vma->vm_flags & VM_NONLINEAR))) { >> key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT) >> + vma->vm_pgoff); > > I like |= for adding flags, it seems less ambiguous. But I guess that's > a matter of opinion. Hugh seems to like +=, and I can't argue with him > about style issues ;) Previous code was doing offset++ wich means offset += 1; I didnt want to hurt Hugh :) >> EXPORT_SYMBOL_GPL(drop_futex_key_refs); > > I wonder if it would be worthwhile inlining and likley()ing the > private fastpath? Might make it pretty compact... I guess that's > something to worry about after glibc gets support. Yes, in a future patch, in about one year :) >> + >> + if (!(vma = find_vma(mm, address)) || >> + vma->vm_start > address || !(vma->vm_flags & VM_WRITE)) >> + ret = -EFAULT; >> + >> + else >> + switch (handle_mm_fault(mm, vma, address, 1)) { >> + case VM_FAULT_MINOR: >> + current->min_flt++; >> + break; >> + case VM_FAULT_MAJOR: >> + current->maj_flt++; >> + break; >> + default: >> + ret = -EFAULT; >> + } >> + if (!shared) >> + up_read(&mm->mmap_sem); >> + return ret; >> } >> >> /* > > You've got an extra space after the if (maybe for clarity?). In this > situation I prefer putting braces around both the if and the else, and > if you get rid of that blank line, it doesn't cost you anything more ;) Oh well... > >> @@ -1598,6 +1656,8 @@ static int futex_wait(unsigned long __us >> restart->arg1 = val; >> restart->arg2 = (unsigned long)abs_time; >> restart->arg3 = (unsigned long)futex64; >> + if (shared) >> + restart->arg3 |= 2; > > Could you make this into a proper flags argument and use #define > CONSTANTs for it? Yes, but I'm not sure it will improve readability. > >> @@ -2377,23 +2455,24 @@ sys_futex64(u64 __user *uaddr, int op, u >> struct timespec ts; >> ktime_t t, *tp = NULL; >> u64 val2 = 0; >> + int opm = op & FUTEX_CMD_MASK; > > What's opm stand for? I guess 'm' stands for 'mask' or 'masked' ? Thank you ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] FUTEX : new PRIVATE futexes 2007-04-06 5:53 ` Eric Dumazet @ 2007-04-06 11:50 ` Nick Piggin 0 siblings, 0 replies; 78+ messages in thread From: Nick Piggin @ 2007-04-06 11:50 UTC (permalink / raw) To: Eric Dumazet Cc: Ulrich Drepper, Andrew Morton, Dave Jones, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Eric Dumazet wrote: > Nick Piggin a écrit : >> Did the whole thing just turn out neater when you passed the rwsem? >> We always know to use current->mm->mmap_sem, so it doesn't seem like >> a boolean flag would hurt? > > > That's a good question > > current->mm->mmap_sem being calculated once is a win in itself, because > current access is not cheap. > It also does the memory access to go through part of the chain in > advance, before its use. It does a prefetch() equivalent for free : If > current->mm is not in CPU cache, CPU wont stall because next > instructions dont depend on it. Fair enough. Current access I think should be cheap though (it is effectively a constant), but I guess it is still improvement. >> Shouldn't that be sizeof(long) to handle 64 bit futexes? Or strictly, it >> should depend on the size of the operation. Maybe the access_ok check >> should go outside get_futex_key? > > > If you check again, you'll see that address points to the start of the > PAGE, not the real u32/u64 futex address. This checks the PAGE. We can > use char, short, int, long, or char[PAGE_SIZE] as long as we know a > futex cannot span two pages. Ah, that works. >>> */ >>> key->shared.inode = vma->vm_file->f_path.dentry->d_inode; >>> - key->both.offset++; /* Bit 0 of offset indicates inode-based >>> key. */ >>> + key->both.offset += FUT_OFF_INODE; /* inode-based key. */ >>> if (likely(!(vma->vm_flags & VM_NONLINEAR))) { >>> key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT) >>> + vma->vm_pgoff); >> >> >> I like |= for adding flags, it seems less ambiguous. But I guess that's >> a matter of opinion. Hugh seems to like +=, and I can't argue with him >> about style issues ;) > > > > Previous code was doing offset++ wich means offset += 1; But it doesn't mean you have to ;) >>> @@ -1598,6 +1656,8 @@ static int futex_wait(unsigned long __us >>> restart->arg1 = val; >>> restart->arg2 = (unsigned long)abs_time; >>> restart->arg3 = (unsigned long)futex64; >>> + if (shared) >>> + restart->arg3 |= 2; >> >> >> Could you make this into a proper flags argument and use #define >> CONSTANTs for it? > > > Yes, but I'm not sure it will improve readability. Well that bit of code alone is obviously unreadable. restart->arg3 = 0; if (futex64) restart->arg3 |= FUTEX_64; if (shared) restart->arg3 |= FUTEX_SHARED; Maybe a matter of taste. > >> >>> @@ -2377,23 +2455,24 @@ sys_futex64(u64 __user *uaddr, int op, u >>> struct timespec ts; >>> ktime_t t, *tp = NULL; >>> u64 val2 = 0; >>> + int opm = op & FUTEX_CMD_MASK; >> >> >> What's opm stand for? > > > I guess 'm' stands for 'mask' or 'masked' ? Why not call it cmd? (ie. what it is, rather than what you have done to derive it). -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] FUTEX : new PRIVATE futexes 2007-04-06 1:19 ` Nick Piggin 2007-04-06 5:53 ` Eric Dumazet @ 2007-04-06 6:05 ` Hugh Dickins 2007-04-06 17:41 ` Jan Engelhardt 1 sibling, 1 reply; 78+ messages in thread From: Hugh Dickins @ 2007-04-06 6:05 UTC (permalink / raw) To: Nick Piggin Cc: Eric Dumazet, Ulrich Drepper, Andrew Morton, Dave Jones, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Fri, 6 Apr 2007, Nick Piggin wrote: > > I like |= for adding flags, it seems less ambiguous. But I guess that's > a matter of opinion. Hugh seems to like +=, Do I? You probably have a shaming example in mind (PAGE_MAPPING_ANON? that's a hybrid case where using + and - helped minimize the casting); but in general I'd agree with you that it's |= for setting flag bits. Hmm, Eric's FUT_OFF_INODE is hybrid too, that might justify the += > and I can't argue with him about style issues ;) I feel a warm glow. Nobody ever called me a style guru before. Hugh p.s. Please don't interpret this as any useful contribution to reviewing Eric's futex work: seems sensible, but I've hardly looked. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] FUTEX : new PRIVATE futexes 2007-04-06 6:05 ` Hugh Dickins @ 2007-04-06 17:41 ` Jan Engelhardt 0 siblings, 0 replies; 78+ messages in thread From: Jan Engelhardt @ 2007-04-06 17:41 UTC (permalink / raw) To: Hugh Dickins Cc: Nick Piggin, Eric Dumazet, Ulrich Drepper, Andrew Morton, Dave Jones, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Apr 6 2007 07:05, Hugh Dickins wrote: >On Fri, 6 Apr 2007, Nick Piggin wrote: >> >> I like |= for adding flags, it seems less ambiguous. But I guess that's >> a matter of opinion. Hugh seems to like +=, > >Do I? You probably have a shaming example in mind (PAGE_MAPPING_ANON? >that's a hybrid case where using + and - helped minimize the casting); >but in general I'd agree with you that it's |= for setting flag bits. > >Hmm, Eric's FUT_OFF_INODE is hybrid too, that might justify the += If a bit is already set, you can't set it again using +=. Jan -- ^ permalink raw reply [flat|nested] 78+ messages in thread
* Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes) 2007-04-05 17:49 ` [PATCH] FUTEX : new PRIVATE futexes Eric Dumazet 2007-04-05 20:43 ` Ulrich Drepper 2007-04-06 1:19 ` Nick Piggin @ 2007-04-06 12:26 ` Peter Zijlstra 2007-04-06 13:02 ` Hugh Dickins 2007-04-06 12:31 ` [PATCH] FUTEX : new PRIVATE futexes Peter Zijlstra 2007-04-07 8:43 ` [PATCH, take4] " Eric Dumazet 4 siblings, 1 reply; 78+ messages in thread From: Peter Zijlstra @ 2007-04-06 12:26 UTC (permalink / raw) To: Eric Dumazet Cc: Ulrich Drepper, Andrew Morton, Dave Jones, Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel, hugh, Pierre.Peiffer Hi, some thoughts on shared futexes; Could we get rid of the mmap_sem on the shared futexes in the following manner: - do a page table walk to find the pte; - get a page using pfn_to_page (skipping VM_PFNMAP) - get the futex key from page->mapping->host and page->index and offset from addr % PAGE_SIZE. or given a key: - lookup the page from key.shared.inode->i_mapping by key.shared.pgoff possibly loading the page using mapping->a_ops->readpage(). then: - perform the futex operation on a kmap of the page This should all work except for VM_PFNMAP. Since the address is passed from userspace we cannot trust it to not point into a VM_PFNMAP area. However, with the RCU VMA lookup patches I'm working on we could do that check without holding locks and without exclusive cachelines; the question is, is that good enough? Or is there an alternative way of determining a pfnmap given a pfn/struct page? ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes) 2007-04-06 12:26 ` Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes) Peter Zijlstra @ 2007-04-06 13:02 ` Hugh Dickins 2007-04-06 13:15 ` Peter Zijlstra 2007-04-06 13:15 ` Nick Piggin 0 siblings, 2 replies; 78+ messages in thread From: Hugh Dickins @ 2007-04-06 13:02 UTC (permalink / raw) To: Peter Zijlstra Cc: Eric Dumazet, Ulrich Drepper, Andrew Morton, Dave Jones, Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel, Pierre.Peiffer On Fri, 6 Apr 2007, Peter Zijlstra wrote: > > some thoughts on shared futexes; > > Could we get rid of the mmap_sem on the shared futexes in the following > manner: > > - do a page table walk to find the pte; ("walk" meaning descent down the levels, I presume, rather than across) I've not had time to digest your proposal, and I'm about to go out: let me sound a warning that springs to mind, maybe it's entirely inapproriate, but better said than kept silent. It looks as if you're supposing that mmap_sem is needed to find_vma, but not for going down the pagetables. It's not a simple as that: you need to be careful that a concurrent munmap from another thread isn't freeing pagetables from under you. Holding (down_read) of mmap_sem is one way to protect against that. try_to_unmap doesn't have that luxury: in its case, it's made safe by the way free_pgtables does anon_vma_unlink and unlink_file_vma before freeing any pagetables, so try_to_unmap etc. won't get there; but you can't do that. Hugh > - get a page using pfn_to_page (skipping VM_PFNMAP) > - get the futex key from page->mapping->host and page->index > and offset from addr % PAGE_SIZE. > > or given a key: > > - lookup the page from key.shared.inode->i_mapping by key.shared.pgoff > possibly loading the page using mapping->a_ops->readpage(). > > then: > > - perform the futex operation on a kmap of the page > > > This should all work except for VM_PFNMAP. > > Since the address is passed from userspace we cannot trust it to not > point into a VM_PFNMAP area. > > However, with the RCU VMA lookup patches I'm working on we could do that > check without holding locks and without exclusive cachelines; the > question is, is that good enough? > > Or is there an alternative way of determining a pfnmap given a > pfn/struct page? > ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes) 2007-04-06 13:02 ` Hugh Dickins @ 2007-04-06 13:15 ` Peter Zijlstra 2007-04-06 13:15 ` Nick Piggin 1 sibling, 0 replies; 78+ messages in thread From: Peter Zijlstra @ 2007-04-06 13:15 UTC (permalink / raw) To: Hugh Dickins Cc: Eric Dumazet, Ulrich Drepper, Andrew Morton, Dave Jones, Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel, Pierre.Peiffer On Fri, 2007-04-06 at 14:02 +0100, Hugh Dickins wrote: > On Fri, 6 Apr 2007, Peter Zijlstra wrote: > > > > some thoughts on shared futexes; > > > > Could we get rid of the mmap_sem on the shared futexes in the following > > manner: > > > > - do a page table walk to find the pte; > > ("walk" meaning descent down the levels, I presume, rather than across) indeed. > I've not had time to digest your proposal, and I'm about to go out: > let me sound a warning that springs to mind, maybe it's entirely > inapproriate, but better said than kept silent. > > It looks as if you're supposing that mmap_sem is needed to find_vma, > but not for going down the pagetables. It's not a simple as that: > you need to be careful that a concurrent munmap from another thread > isn't freeing pagetables from under you. > > Holding (down_read) of mmap_sem is one way to protect against that. > try_to_unmap doesn't have that luxury: in its case, it's made safe > by the way free_pgtables does anon_vma_unlink and unlink_file_vma > before freeing any pagetables, so try_to_unmap etc. won't get there; > but you can't do that. Ah, drad. I had hoped that the pte_lock would solve that race, but that doesn't cover the upper levels of the tree. Back to the drawing board.. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes) 2007-04-06 13:02 ` Hugh Dickins 2007-04-06 13:15 ` Peter Zijlstra @ 2007-04-06 13:15 ` Nick Piggin 2007-04-06 13:22 ` Peter Zijlstra 1 sibling, 1 reply; 78+ messages in thread From: Nick Piggin @ 2007-04-06 13:15 UTC (permalink / raw) To: Hugh Dickins Cc: Peter Zijlstra, Eric Dumazet, Ulrich Drepper, Andrew Morton, Dave Jones, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel, Pierre.Peiffer Hugh Dickins wrote: > On Fri, 6 Apr 2007, Peter Zijlstra wrote: > >>some thoughts on shared futexes; >> >>Could we get rid of the mmap_sem on the shared futexes in the following >>manner: I'd imagine shared futexes would be much less common than private for threaded programs... I'd say we should reevaluate things once we have private futexes, and malloc/free stop hammering mmap_sem so hard... >> - get a page using pfn_to_page (skipping VM_PFNMAP) >> - get the futex key from page->mapping->host and page->index >> and offset from addr % PAGE_SIZE. >> >>or given a key: >> >> - lookup the page from key.shared.inode->i_mapping by key.shared.pgoff >> possibly loading the page using mapping->a_ops->readpage(). For shared futexes, wouldn't i_mapping be worse, because you'd be ping-ponging the tree_lock between processes, rather than have each use their own mmap_sem? That also only helps for the wakeup case too, doesn't it? You have to use the vmas to find out which inode to use to do the wait, I think? (unless you introduce a new shared futex API). -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes) 2007-04-06 13:15 ` Nick Piggin @ 2007-04-06 13:22 ` Peter Zijlstra 2007-04-06 13:40 ` Nick Piggin 0 siblings, 1 reply; 78+ messages in thread From: Peter Zijlstra @ 2007-04-06 13:22 UTC (permalink / raw) To: Nick Piggin Cc: Hugh Dickins, Eric Dumazet, Ulrich Drepper, Andrew Morton, Dave Jones, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel, Pierre.Peiffer On Fri, 2007-04-06 at 23:15 +1000, Nick Piggin wrote: > Hugh Dickins wrote: > > On Fri, 6 Apr 2007, Peter Zijlstra wrote: > > > >>some thoughts on shared futexes; > >> > >>Could we get rid of the mmap_sem on the shared futexes in the following > >>manner: > > I'd imagine shared futexes would be much less common than private for > threaded programs... I'd say we should reevaluate things once we have > private futexes, and malloc/free stop hammering mmap_sem so hard... Indeed, private futexes are by far the most common. > >> - get a page using pfn_to_page (skipping VM_PFNMAP) > >> - get the futex key from page->mapping->host and page->index > >> and offset from addr % PAGE_SIZE. > >> > >>or given a key: > >> > >> - lookup the page from key.shared.inode->i_mapping by key.shared.pgoff > >> possibly loading the page using mapping->a_ops->readpage(). > > For shared futexes, wouldn't i_mapping be worse, because you'd be > ping-ponging the tree_lock between processes, rather than have each > use their own mmap_sem? Your lockless pagecache work would solve most of that, no? > That also only helps for the wakeup case too, doesn't it? You have > to use the vmas to find out which inode to use to do the wait, I think? > (unless you introduce a new shared futex API). one could do something like this: struct address_space *mapping = page_mapping(page); if (!mapping || mapping == &swapper_space) do_private_futex(); else do_shared_futex(mapping->host, page->index, addr % PAGE_SIZE); But alas, it seems I overlooked that the mmap_sem also protects the page tables as pointed out by Hugh, so this is all in fain it seems. A well. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes) 2007-04-06 13:22 ` Peter Zijlstra @ 2007-04-06 13:40 ` Nick Piggin 0 siblings, 0 replies; 78+ messages in thread From: Nick Piggin @ 2007-04-06 13:40 UTC (permalink / raw) To: Peter Zijlstra Cc: Hugh Dickins, Eric Dumazet, Ulrich Drepper, Andrew Morton, Dave Jones, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel, Pierre.Peiffer Peter Zijlstra wrote: > On Fri, 2007-04-06 at 23:15 +1000, Nick Piggin wrote: >>>>or given a key: >>>> >>>>- lookup the page from key.shared.inode->i_mapping by key.shared.pgoff >>>> possibly loading the page using mapping->a_ops->readpage(). >> >>For shared futexes, wouldn't i_mapping be worse, because you'd be >>ping-ponging the tree_lock between processes, rather than have each >>use their own mmap_sem? > > > Your lockless pagecache work would solve most of that, no? Yeah it would. >>That also only helps for the wakeup case too, doesn't it? You have >>to use the vmas to find out which inode to use to do the wait, I think? >>(unless you introduce a new shared futex API). > > > one could do something like this: > > struct address_space *mapping = page_mapping(page); > if (!mapping || mapping == &swapper_space) > do_private_futex(); > else > do_shared_futex(mapping->host, page->index, addr % PAGE_SIZE); > > > But alas, it seems I overlooked that the mmap_sem also protects the page > tables as pointed out by Hugh, so this is all in fain it seems. > > A well. Well if it turned out to be a real problem, we _could_ introduce a new futex API that takes a file descriptor+offset. The old futex API is basically just a shorthand for choosing between shared and private, depending on what the vma at the given address happens to be. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH] FUTEX : new PRIVATE futexes 2007-04-05 17:49 ` [PATCH] FUTEX : new PRIVATE futexes Eric Dumazet ` (2 preceding siblings ...) 2007-04-06 12:26 ` Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes) Peter Zijlstra @ 2007-04-06 12:31 ` Peter Zijlstra 2007-04-07 8:43 ` [PATCH, take4] " Eric Dumazet 4 siblings, 0 replies; 78+ messages in thread From: Peter Zijlstra @ 2007-04-06 12:31 UTC (permalink / raw) To: Eric Dumazet Cc: Ulrich Drepper, Andrew Morton, Dave Jones, Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Thu, 2007-04-05 at 19:49 +0200, Eric Dumazet wrote: > Hi all > > I'm pleased to present this patch which improves linux futexes performance and > scalability, merely avoiding taking mmap_sem rwlock. > > Ulrich agreed with the API and said glibc work could start as soon > as he gets a Fedora kernel with it :) > > Andrew, could we get this in mm as well ? This version is against 2.6.21-rc5-mm4 > (so supports 64bit futexes) > > In this third version I dropped the NUMA optims and process private hash table, > to let new API come in and be tested. Good work, Thanks! Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH, take4] FUTEX : new PRIVATE futexes 2007-04-05 17:49 ` [PATCH] FUTEX : new PRIVATE futexes Eric Dumazet ` (3 preceding siblings ...) 2007-04-06 12:31 ` [PATCH] FUTEX : new PRIVATE futexes Peter Zijlstra @ 2007-04-07 8:43 ` Eric Dumazet 2007-04-07 9:30 ` Nick Piggin ` (3 more replies) 4 siblings, 4 replies; 78+ messages in thread From: Eric Dumazet @ 2007-04-07 8:43 UTC (permalink / raw) To: Andrew Morton, Dave Jones Cc: Ulrich Drepper, Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Hi all Updates on this take4 : - All remarks from Nick were addressed I hope - Current mm code have a problem with 64bit futexes, as spoted by Nick : get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not. So it is possible a 64bit futex spans two pages of memory... I had to change get_futex_key() prototype to be able to do a correct test. History : I'm pleased to present this patch which improves linux futexes performance and scalability, merely avoiding taking mmap_sem rwlock. Ulrich agreed with the API and said glibc work could start as soon as he gets a Fedora kernel with it :) Andrew, could we get this in mm as well ? This version is against 2.6.21-rc5-mm4 (so supports 64bit futexes) In this third version I dropped the NUMA optims and process private hash table, to let new API come in and be tested. Thank you [PATCH] FUTEX : new PRIVATE futexes Analysis of current linux futex code : -------------------------------------- A central hash table futex_queues[] holds all contexts (futex_q) of waiting threads. Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to perform lookups or insert/deletion of a futex_q. When a futex_wait() is done, calling thread has to : 1) - Obtain a read lock on mmap_sem to be able to validate the user pointer (calling find_vma()). This validation tells us if the futex uses an inode based store (mapped file), or mm based store (anonymous mem) 2) - compute a hash key 3) - Atomic increment of reference counter on an inode or a mm_struct 4) - lock part of futex_queues[] hash table 5) - perform the test on value of futex. (rollback is value != expected_value, returns EWOULDBLOCK) (various loops if test triggers mm faults) 6) queue the context into hash table, release the lock got in 4) 7) - release the read_lock on mmap_sem <block> 8) Eventually unqueue the context (but rarely, as this part may be done by the futex_wake()) Futexes were designed to improve scalability but current implementation has various problems : - Central hashtable : This means scalability problems if many processes/threads want to use futexes at the same time. This means NUMA unbalance because this hashtable is located on one node. - Using mmap_sem on every futex() syscall : Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic ops on mmap_sem, dirtying cache line : - lot of cache line ping pongs on SMP configurations. mmap_sem is also extensively used by mm code (page faults, mmap()/munmap()) Highly threaded processes might suffer from mmap_sem contention. mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded programs because of contention on the mmap_sem cache line. - Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter: It's also a cache line ping pong on SMP. It also increases mmap_sem hold time because of cache misses. Most of these scalability problems come from the fact that futexes are in one global namespace. As we use a central hash table, we must make sure they are all using the same reference (given by the mm subsystem). We chose to force all futexes be 'shared'. This has a cost. But fact is POSIX defined PRIVATE and SHARED, allowing clear separation, and optimal performance if carefuly implemented. Time has come for linux to have better threading performance. The goal is to permit new futex commands to avoid : - Taking the mmap_sem semaphore, conflicting with other subsystems. - Modifying a ref_count on mm or an inode, still conflicting with mm or fs. This is possible because, for one process using PTHREAD_PROCESS_PRIVATE futexes, we only need to distinguish futexes by their virtual address, no matter the underlying mm storage is. If glibc wants to exploit this new infrastructure, it should use new _PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And be prepared to fallback on old subcommands for old kernels. Using one global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK. PTHREAD_PROCESS_SHARED futexes should still use the old subcommands. Compatibility with old applications is preserved, they still hit the scalability problems, but new applications can fly :) Note : the same SHARED futex (mapped on a file) can be used by old binaries *and* new binaries, because both binaries will use the old subcommands. Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic, as this is the default semantic. Almost all applications should benefit of this changes (new kernel and updated libc) Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine) /* calling futex_wait(addr, value) with value != *addr */ 434 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes) 427 cycles per futex(FUTEX_WAIT) call (using one futex) 345 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes) 345 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex) For reference : 187 cycles per getppid() call 188 cycles per umask() call 183 cycles per ni_syscall() call Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/linux/futex.h | 46 +++++ kernel/futex.c | 340 ++++++++++++++++++++++++++-------------- 2 files changed, 268 insertions(+), 118 deletions(-) --- linux-2.6.21-rc5-mm4/include/linux/futex.h +++ linux-2.6.21-rc5-mm4-ed/include/linux/futex.h @@ -19,6 +19,18 @@ union ktime; #define FUTEX_TRYLOCK_PI 8 #define FUTEX_CMP_REQUEUE_PI 9 +#define FUTEX_PRIVATE_FLAG 128 +#define FUTEX_CMD_MASK ~FUTEX_PRIVATE_FLAG + +#define FUTEX_WAIT_PRIVATE (FUTEX_WAIT | FUTEX_PRIVATE_FLAG) +#define FUTEX_WAKE_PRIVATE (FUTEX_WAKE | FUTEX_PRIVATE_FLAG) +#define FUTEX_REQUEUE_PRIVATE (FUTEX_REQUEUE | FUTEX_PRIVATE_FLAG) +#define FUTEX_CMP_REQUEUE_PRIVATE (FUTEX_CMP_REQUEUE | FUTEX_PRIVATE_FLAG) +#define FUTEX_WAKE_OP_PRIVATE (FUTEX_WAKE_OP | FUTEX_PRIVATE_FLAG) +#define FUTEX_LOCK_PI_PRIVATE (FUTEX_LOCK_PI | FUTEX_PRIVATE_FLAG) +#define FUTEX_UNLOCK_PI_PRIVATE (FUTEX_UNLOCK_PI | FUTEX_PRIVATE_FLAG) +#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG) + /* * Support for robust futexes: the kernel cleans up held futexes at * thread exit time. @@ -115,8 +127,18 @@ handle_futex_death(u32 __user *uaddr, st * Don't rearrange members without looking at hash_futex(). * * offset is aligned to a multiple of sizeof(u32) (== 4) by definition. - * We set bit 0 to indicate if it's an inode-based key. + * We use the two low order bits of offset to tell what is the kind of key : + * 00 : Private process futex (PTHREAD_PROCESS_PRIVATE) + * (no reference on an inode or mm) + * 01 : Shared futex (PTHREAD_PROCESS_SHARED) + * mapped on a file (reference on the underlying inode) + * 10 : Shared futex (PTHREAD_PROCESS_SHARED) + * (but private mapping on an mm, and reference taken on it) */ + +#define FUT_OFF_INODE 1 /* We set bit 0 if key has a reference on inode */ +#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */ + union futex_key { unsigned long __user *uaddr; struct { @@ -135,8 +157,28 @@ union futex_key { int offset; } both; }; -int get_futex_key(void __user *uaddr, union futex_key *key); + +/** + * get_futex_key - Get parameters which are the keys for a futex. + * @uaddr: virtual address of the futex + * @size: size of futex (4 or 8) + * @shared: NULL for a PROCESS_PRIVATE futex, + * ¤t->mm->mmap_sem for a PROCESS_SHARED futex + * @key: address where result is stored. + * + * Returns an error code or 0 + */ +int get_futex_key(void __user *uaddr, int size, struct rw_semaphore *shared, + union futex_key *key); + +/** + * get_futex_key_refs - Take a reference to the resource addressed by a key + */ void get_futex_key_refs(union futex_key *key); + +/** + * drop_futex_key_refs - Drop a reference to the resource addressed by a key. + */ void drop_futex_key_refs(union futex_key *key); #ifdef CONFIG_FUTEX --- linux-2.6.21-rc5-mm4/kernel/futex.c +++ linux-2.6.21-rc5-mm4-ed/kernel/futex.c @@ -16,6 +16,9 @@ * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com> * Copyright (C) 2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com> * + * PRIVATE futexes by Eric Dumazet + * Copyright (C) 2007 Eric Dumazet <dada1@cosmosbay.com> + * * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly * enough at me, Linus for the original (flawed) idea, Matthew * Kirkwood for proof-of-concept implementation. @@ -199,11 +202,14 @@ static inline int match_futex(union fute * Returns: 0, or negative error code. * The key words are stored in *key on success. * - * Should be called with ¤t->mm->mmap_sem but NOT any spinlocks. + * fshared is NULL for PROCESS_PRIVATE futexes + * For other futexes, it points to ¤t->mm->mmap_sem and + * caller must have taken the reader lock. but NOT any spinlocks. */ -int get_futex_key(void __user *uaddr, union futex_key *key) +int get_futex_key(void __user *fuaddr, int fsize, struct rw_semaphore *fshared, + union futex_key *key) { - unsigned long address = (unsigned long)uaddr; + unsigned long address = (unsigned long)fuaddr; struct mm_struct *mm = current->mm; struct vm_area_struct *vma; struct page *page; @@ -213,11 +219,29 @@ int get_futex_key(void __user *uaddr, un * The futex address must be "naturally" aligned. */ key->both.offset = address % PAGE_SIZE; - if (unlikely((key->both.offset % sizeof(u32)) != 0)) + if (unlikely((key->both.offset & (fsize - 1)) != 0)) return -EINVAL; address -= key->both.offset; /* + * PROCESS_PRIVATE futexes are fast. + * As the mm cannot disappear under us and the 'key' only needs + * virtual address, we dont even have to find the underlying vma. + * Note : We do have to check 'address' is a valid user address, + * but access_ok() should be faster than find_vma() + */ + if (!fshared) { + /* + * Note : At this point, address points to the start of page, + * not the real futex address, this is OK. + */ + if (!access_ok(VERIFY_WRITE, address, sizeof(int))) + return -EFAULT; + key->private.mm = mm; + key->private.address = address; + return 0; + } + /* * The futex is hashed differently depending on whether * it's in a shared or private mapping. So check vma first. */ @@ -232,7 +256,7 @@ int get_futex_key(void __user *uaddr, un return (vma->vm_flags & VM_IO) ? -EPERM : -EACCES; /* Save the user address in the ley */ - key->uaddr = uaddr; + key->uaddr = fuaddr; /* * Private mappings are handled in a simple way. @@ -244,6 +268,7 @@ int get_futex_key(void __user *uaddr, un * mappings of _writable_ handles. */ if (likely(!(vma->vm_flags & VM_MAYSHARE))) { + key->both.offset |= FUT_OFF_MMSHARED; /* reference taken on mm */ key->private.mm = mm; key->private.address = address; return 0; @@ -253,7 +278,7 @@ int get_futex_key(void __user *uaddr, un * Linear file mappings are also simple. */ key->shared.inode = vma->vm_file->f_path.dentry->d_inode; - key->both.offset++; /* Bit 0 of offset indicates inode-based key. */ + key->both.offset |= FUT_OFF_INODE; /* inode-based key. */ if (likely(!(vma->vm_flags & VM_NONLINEAR))) { key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff); @@ -281,17 +306,19 @@ EXPORT_SYMBOL_GPL(get_futex_key); * Take a reference to the resource addressed by a key. * Can be called while holding spinlocks. * - * NOTE: mmap_sem MUST be held between get_futex_key() and calling this - * function, if it is called at all. mmap_sem keeps key->shared.inode valid. */ inline void get_futex_key_refs(union futex_key *key) { - if (key->both.ptr != 0) { - if (key->both.offset & 1) + if (key->both.ptr == 0) + return; + switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { + case FUT_OFF_INODE: atomic_inc(&key->shared.inode->i_count); - else + break; + case FUT_OFF_MMSHARED: atomic_inc(&key->private.mm->mm_count); - } + break; + } } EXPORT_SYMBOL_GPL(get_futex_key_refs); @@ -301,11 +328,15 @@ EXPORT_SYMBOL_GPL(get_futex_key_refs); */ void drop_futex_key_refs(union futex_key *key) { - if (key->both.ptr != 0) { - if (key->both.offset & 1) + if (key->both.ptr == 0) + return; + switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { + case FUT_OFF_INODE: iput(key->shared.inode); - else + break; + case FUT_OFF_MMSHARED: mmdrop(key->private.mm); + break; } } EXPORT_SYMBOL_GPL(drop_futex_key_refs); @@ -339,28 +370,38 @@ get_futex_value_locked(unsigned long *de } /* - * Fault handling. Called with current->mm->mmap_sem held. + * Fault handling. + * if fshared is non NULL, current->mm->mmap_sem is already held */ -static int futex_handle_fault(unsigned long address, int attempt) +static int futex_handle_fault(unsigned long address, + struct rw_semaphore *fshared, int attempt) { struct vm_area_struct * vma; struct mm_struct *mm = current->mm; + int ret = -EFAULT; - if (attempt > 2 || !(vma = find_vma(mm, address)) || - vma->vm_start > address || !(vma->vm_flags & VM_WRITE)) - return -EFAULT; - - switch (handle_mm_fault(mm, vma, address, 1)) { - case VM_FAULT_MINOR: - current->min_flt++; - break; - case VM_FAULT_MAJOR: - current->maj_flt++; - break; - default: - return -EFAULT; + if (attempt > 2) + return ret; + if (!fshared) + down_read(&mm->mmap_sem); + vma = find_vma(mm, address); + if (vma && + address >= vma->vm_start && + (vma->vm_flags & VM_WRITE)) { + switch (handle_mm_fault(mm, vma, address, 1)) { + case VM_FAULT_MINOR: + ret = 0; + current->min_flt++; + break; + case VM_FAULT_MAJOR: + ret = 0; + current->maj_flt++; + break; + } } - return 0; + if (!fshared) + up_read(&mm->mmap_sem); + return ret; } /* @@ -702,20 +743,22 @@ double_lock_hb(struct futex_hash_bucket } /* - * Wake up all waiters hashed on the physical page that is mapped - * to this virtual address: + * Wake up all waiters on a futex (fuaddr, futex64, fshared) */ -static int futex_wake(unsigned long __user *uaddr, int nr_wake) +static int futex_wake(unsigned long __user *fuaddr, int futex64, + struct rw_semaphore *fshared, int nr_wake) { struct futex_hash_bucket *hb; struct futex_q *this, *next; struct plist_head *head; union futex_key key; int ret; + int fsize = futex64 ? sizeof(u64) : sizeof(u32); - down_read(¤t->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr, &key); + ret = get_futex_key(fuaddr, fsize, fshared, &key); if (unlikely(ret != 0)) goto out; @@ -737,7 +780,8 @@ static int futex_wake(unsigned long __us spin_unlock(&hb->lock); out: - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; } @@ -807,7 +851,8 @@ retry: */ static int futex_requeue_pi(unsigned long __user *uaddr1, unsigned long __user *uaddr2, - int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64) + int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64, + struct rw_semaphore *fshared) { union futex_key key1, key2; struct futex_hash_bucket *hb1, *hb2; @@ -817,6 +862,7 @@ futex_requeue_pi(unsigned long __user *u struct rt_mutex_waiter *waiter, *top_waiter = NULL; struct rt_mutex *lock2 = NULL; int ret, drop_count = 0; + int fsize = futex64 ? sizeof(u64) : sizeof(u32); if (refill_pi_state_cache()) return -ENOMEM; @@ -825,12 +871,13 @@ retry: /* * First take all the futex related locks: */ - down_read(¤t->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr1, &key1); + ret = get_futex_key(uaddr1, fsize, fshared, &key1); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, &key2); + ret = get_futex_key(uaddr2, fsize, fshared, &key2); if (unlikely(ret != 0)) goto out; @@ -998,21 +1045,24 @@ out: */ static int futex_wake_op(unsigned long __user *uaddr1, unsigned long __user *uaddr2, - int nr_wake, int nr_wake2, int op, int futex64) + int nr_wake, int nr_wake2, int op, int futex64, + struct rw_semaphore *fshared) { union futex_key key1, key2; struct futex_hash_bucket *hb1, *hb2; struct plist_head *head; struct futex_q *this, *next; int ret, op_ret, attempt = 0; + int fsize = futex64 ? sizeof(u64) : sizeof(u32); retryfull: - down_read(¤t->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr1, &key1); + ret = get_futex_key(uaddr1, fsize, fshared, &key1); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, &key2); + ret = get_futex_key(uaddr2, fsize, fshared, &key2); if (unlikely(ret != 0)) goto out; @@ -1055,15 +1105,14 @@ retry: * futex_atomic_op_inuser needs to both read and write * *(int __user *)uaddr2, but we can't modify it * non-atomically. Therefore, if get_user below is not - * enough, we need to handle the fault ourselves, while - * still holding the mmap_sem. + * enough, we need to handle the fault ourselves. Make + * sure we hold mmap_sem. */ if (attempt++) { - if (futex_handle_fault((unsigned long)uaddr2, - attempt)) { - ret = -EFAULT; + ret = futex_handle_fault((unsigned long)uaddr2, + fshared, attempt); + if (ret) goto out; - } goto retry; } @@ -1071,7 +1120,8 @@ retry: * If we would have faulted, release mmap_sem, * fault it in and start all over again. */ - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); ret = futex_get_user(&dummy, uaddr2, futex64); if (ret) @@ -1108,7 +1158,8 @@ retry: if (hb1 != hb2) spin_unlock(&hb2->lock); out: - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; } @@ -1118,21 +1169,23 @@ out: */ static int futex_requeue(unsigned long __user *uaddr1, unsigned long __user *uaddr2, - int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64) + int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64, + struct rw_semaphore *fshared) { union futex_key key1, key2; struct futex_hash_bucket *hb1, *hb2; struct plist_head *head1; struct futex_q *this, *next; int ret, drop_count = 0; - + int fsize = futex64 ? sizeof(u64) : sizeof(u32); retry: - down_read(¤t->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr1, &key1); + ret = get_futex_key(uaddr1, fsize, fshared, &key1); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, &key2); + ret = get_futex_key(uaddr2, fsize, fshared, &key2); if (unlikely(ret != 0)) goto out; @@ -1155,7 +1208,8 @@ futex_requeue(unsigned long __user *uadd * If we would have faulted, release mmap_sem, fault * it in and start all over again. */ - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); ret = futex_get_user(&curval, uaddr1, futex64); @@ -1208,7 +1262,8 @@ out_unlock: drop_futex_key_refs(&key1); out: - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; } @@ -1390,9 +1445,17 @@ static int fixup_pi_state_owner(unsigned return ret; } +/* + * In case we must use restart_block to restart a futex_wait, + * we encode in the 'arg3' both futex64 and shared capabilities + */ +#define ARG3_FUTEX64 1 +#define ARG3_SHARED 2 + static long futex_wait_restart(struct restart_block *restart); -static int futex_wait(unsigned long __user *uaddr, unsigned long val, - ktime_t *abs_time, int futex64) +static int futex_wait(unsigned long __user *uaddr, int futex64, + struct rw_semaphore *fshared, + unsigned long val, ktime_t *abs_time) { struct task_struct *curr = current; DECLARE_WAITQUEUE(wait, curr); @@ -1402,12 +1465,14 @@ static int futex_wait(unsigned long __us int ret; struct hrtimer_sleeper t, *to = NULL; int rem = 0; + int fsize = futex64 ? sizeof(u64) : sizeof(u32); q.pi_state = NULL; retry: - down_read(&curr->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr, &q.key); + ret = get_futex_key(uaddr, fsize, fshared, &q.key); if (unlikely(ret != 0)) goto out_release_sem; @@ -1430,8 +1495,8 @@ static int futex_wait(unsigned long __us * a wakeup when *uaddr != val on entry to the syscall. This is * rare, but normal. * - * We hold the mmap semaphore, so the mapping cannot have changed - * since we looked it up in get_futex_key. + * for shared futexes, we hold the mmap semaphore, so the mapping + * cannot have changed since we looked it up in get_futex_key. */ ret = get_futex_value_locked(&uval, uaddr, futex64); @@ -1442,7 +1507,8 @@ static int futex_wait(unsigned long __us * If we would have faulted, release mmap_sem, fault it in and * start all over again. */ - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); ret = futex_get_user(&uval, uaddr, futex64); if (!ret) @@ -1468,7 +1534,8 @@ static int futex_wait(unsigned long __us * Now the futex is queued and we have checked the data, we * don't want to hold mmap_sem while we sleep. */ - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); /* * There might have been scheduling since the queue_me(), as we @@ -1568,7 +1635,8 @@ static int futex_wait(unsigned long __us } /* Unqueue and drop the lock */ unqueue_me_pi(&q); - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); } debug_rt_mutex_free_waiter(&q.waiter); @@ -1597,7 +1665,13 @@ static int futex_wait(unsigned long __us restart->arg0 = (unsigned long)uaddr; restart->arg1 = val; restart->arg2 = (unsigned long)abs_time; - restart->arg3 = (unsigned long)futex64; + restart->arg3 = 0; +#ifdef CONFIG_64BIT + if (futex64) + restart->arg3 |= ARG3_FUTEX64; +#endif + if (fshared) + restart->arg3 |= ARG3_SHARED; return -ERESTART_RESTARTBLOCK; } @@ -1605,7 +1679,8 @@ static int futex_wait(unsigned long __us queue_unlock(&q, hb); out_release_sem: - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; } @@ -1615,10 +1690,17 @@ static long futex_wait_restart(struct re unsigned long __user *uaddr = (unsigned long __user *)restart->arg0; unsigned long val = restart->arg1; ktime_t *abs_time = (ktime_t *)restart->arg2; - int futex64 = (int)restart->arg3; + int futex64 = 0; + struct rw_semaphore *fshared = NULL; +#ifdef CONFIG_64BIT + if (restart->arg3 & ARG3_FUTEX64) + futex64 = 1; +#endif + if (restart->arg3 & ARG3_SHARED) + fshared = ¤t->mm->mmap_sem; restart->fn = do_no_restart_syscall; - return (long)futex_wait(uaddr, val, abs_time, futex64); + return (long)futex_wait(uaddr, futex64, fshared, val, abs_time); } @@ -1674,7 +1756,7 @@ static void set_pi_futex_owner(struct fu * races the kernel might see a 0 value of the futex too.) */ static int futex_lock_pi(unsigned long __user *uaddr, int detect, ktime_t *time, - int trylock, int futex64) + int trylock, int futex64, struct rw_semaphore *fshared) { struct hrtimer_sleeper timeout, *to = NULL; struct task_struct *curr = current; @@ -1682,6 +1764,7 @@ static int futex_lock_pi(unsigned long _ unsigned long uval, newval, curval; struct futex_q q; int ret, lock_held, attempt = 0; + int fsize = futex64 ? sizeof(u64) : sizeof(u32); if (refill_pi_state_cache()) return -ENOMEM; @@ -1695,9 +1778,10 @@ static int futex_lock_pi(unsigned long _ q.pi_state = NULL; retry: - down_read(&curr->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr, &q.key); + ret = get_futex_key(uaddr, fsize, fshared, &q.key); if (unlikely(ret != 0)) goto out_release_sem; @@ -1818,7 +1902,8 @@ static int futex_lock_pi(unsigned long _ * Now the futex is queued and we have checked the data, we * don't want to hold mmap_sem while we sleep. */ - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); WARN_ON(!q.pi_state); /* @@ -1832,7 +1917,8 @@ static int futex_lock_pi(unsigned long _ ret = ret ? 0 : -EWOULDBLOCK; } - down_read(&curr->mm->mmap_sem); + if (fshared) + down_read(fshared); spin_lock(q.lock_ptr); /* @@ -1854,7 +1940,8 @@ static int futex_lock_pi(unsigned long _ } /* Unqueue and drop the lock */ unqueue_me_pi(&q); - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); } if (!detect && ret == -EDEADLK && 0) @@ -1866,7 +1953,8 @@ static int futex_lock_pi(unsigned long _ queue_unlock(&q, hb); out_release_sem: - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; uaddr_faulted: @@ -1877,15 +1965,16 @@ static int futex_lock_pi(unsigned long _ * still holding the mmap_sem. */ if (attempt++) { - if (futex_handle_fault((unsigned long)uaddr, attempt)) { - ret = -EFAULT; + ret = futex_handle_fault((unsigned long)uaddr, fshared, + attempt); + if (ret) goto out_unlock_release_sem; - } goto retry_locked; } queue_unlock(&q, hb); - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); ret = futex_get_user(&uval, uaddr, futex64); if (!ret && (uval != -EFAULT)) @@ -1899,7 +1988,8 @@ static int futex_lock_pi(unsigned long _ * This is the in-kernel slowpath: we look up the PI state (if any), * and do the rt-mutex unlock. */ -static int futex_unlock_pi(unsigned long __user *uaddr, int futex64) +static int futex_unlock_pi(unsigned long __user *uaddr, int futex64, + struct rw_semaphore *fshared) { struct futex_hash_bucket *hb; struct futex_q *this, *next; @@ -1907,6 +1997,7 @@ static int futex_unlock_pi(unsigned long struct plist_head *head; union futex_key key; int ret, attempt = 0; + int fsize = futex64 ? sizeof(u64) : sizeof(u32); retry: if (futex_get_user(&uval, uaddr, futex64)) @@ -1919,9 +2010,10 @@ retry: /* * First take all the futex related locks: */ - down_read(¤t->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr, &key); + ret = get_futex_key(uaddr, fsize, fshared, &key); if (unlikely(ret != 0)) goto out; @@ -1980,7 +2072,8 @@ retry_locked: out_unlock: spin_unlock(&hb->lock); out: - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; @@ -1992,15 +2085,16 @@ pi_faulted: * still holding the mmap_sem. */ if (attempt++) { - if (futex_handle_fault((unsigned long)uaddr, attempt)) { - ret = -EFAULT; + ret = futex_handle_fault((unsigned long)uaddr, fshared, + attempt); + if (ret) goto out_unlock; - } goto retry_locked; } spin_unlock(&hb->lock); - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); ret = futex_get_user(&uval, uaddr, futex64); if (!ret && (uval != -EFAULT)) @@ -2052,6 +2146,7 @@ static int futex_fd(u32 __user *uaddr, i struct futex_q *q; struct file *filp; int ret, err; + struct rw_semaphore *fshared; static unsigned long printk_interval; if (printk_timed_ratelimit(&printk_interval, 60 * 60 * 1000)) { @@ -2093,11 +2188,12 @@ static int futex_fd(u32 __user *uaddr, i } q->pi_state = NULL; - down_read(¤t->mm->mmap_sem); - err = get_futex_key(uaddr, &q->key); + fshared = ¤t->mm->mmap_sem; + down_read(fshared); + err = get_futex_key(uaddr, sizeof(u32), fshared, &q->key); if (unlikely(err != 0)) { - up_read(¤t->mm->mmap_sem); + up_read(fshared); kfree(q); goto error; } @@ -2109,7 +2205,7 @@ static int futex_fd(u32 __user *uaddr, i filp->private_data = q; queue_me(q, ret, filp); - up_read(¤t->mm->mmap_sem); + up_read(fshared); /* Now we map fd to filp, so userspace can access it */ fd_install(ret, filp); @@ -2238,7 +2334,8 @@ retry: */ if (!pi) { if (uval & FUTEX_WAITERS) - futex_wake((unsigned long __user *)uaddr, 1); + futex_wake((unsigned long __user *)uaddr, + 0, &curr->mm->mmap_sem, 1); } } return 0; @@ -2326,13 +2423,18 @@ long do_futex(unsigned long __user *uadd unsigned long val2, unsigned long val3, int fut64) { int ret; + int cmd = op & FUTEX_CMD_MASK; + struct rw_semaphore *fshared = NULL; + + if (!(op & FUTEX_PRIVATE_FLAG)) + fshared = ¤t->mm->mmap_sem; - switch (op) { + switch (cmd) { case FUTEX_WAIT: - ret = futex_wait(uaddr, val, timeout, fut64); + ret = futex_wait(uaddr, fut64, fshared, val, timeout); break; case FUTEX_WAKE: - ret = futex_wake(uaddr, val); + ret = futex_wake(uaddr, fut64, fshared, val); break; case FUTEX_FD: if (fut64) @@ -2342,25 +2444,29 @@ long do_futex(unsigned long __user *uadd ret = futex_fd((u32 __user *)uaddr, val); break; case FUTEX_REQUEUE: - ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, fut64); + ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, fut64, + fshared); break; case FUTEX_CMP_REQUEUE: - ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, fut64); + ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, fut64, + fshared); break; case FUTEX_WAKE_OP: - ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, fut64); + ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, fut64, + fshared); break; case FUTEX_LOCK_PI: - ret = futex_lock_pi(uaddr, val, timeout, 0, fut64); + ret = futex_lock_pi(uaddr, val, timeout, 0, fut64, fshared); break; case FUTEX_UNLOCK_PI: - ret = futex_unlock_pi(uaddr, fut64); + ret = futex_unlock_pi(uaddr, fut64, fshared); break; case FUTEX_TRYLOCK_PI: - ret = futex_lock_pi(uaddr, 0, timeout, 1, fut64); + ret = futex_lock_pi(uaddr, 0, timeout, 1, fut64, fshared); break; case FUTEX_CMP_REQUEUE_PI: - ret = futex_requeue_pi(uaddr, uaddr2, val, val2, &val3, fut64); + ret = futex_requeue_pi(uaddr, uaddr2, val, val2, &val3, fut64, + fshared); break; default: ret = -ENOSYS; @@ -2377,23 +2483,24 @@ sys_futex64(u64 __user *uaddr, int op, u struct timespec ts; ktime_t t, *tp = NULL; u64 val2 = 0; + cmd opm = op & FUTEX_CMD_MASK; - if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) { + if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) { if (copy_from_user(&ts, utime, sizeof(ts)) != 0) return -EFAULT; if (!timespec_valid(&ts)) return -EINVAL; t = timespec_to_ktime(ts); - if (op == FUTEX_WAIT) + if (cmd == FUTEX_WAIT) t = ktime_add(ktime_get(), t); tp = &t; } /* - * requeue parameter in 'utime' if op == FUTEX_REQUEUE. + * requeue parameter in 'utime' if cmd == FUTEX_REQUEUE. */ - if (op == FUTEX_REQUEUE || op == FUTEX_CMP_REQUEUE - || op == FUTEX_CMP_REQUEUE_PI) + if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE + || cmd == FUTEX_CMP_REQUEUE_PI) val2 = (unsigned long) utime; return do_futex((unsigned long __user*)uaddr, op, val, tp, @@ -2409,23 +2516,24 @@ asmlinkage long sys_futex(u32 __user *ua struct timespec ts; ktime_t t, *tp = NULL; u32 val2 = 0; + int cmd = op & FUTEX_CMD_MASK; - if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) { + if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) { if (copy_from_user(&ts, utime, sizeof(ts)) != 0) return -EFAULT; if (!timespec_valid(&ts)) return -EINVAL; t = timespec_to_ktime(ts); - if (op == FUTEX_WAIT) + if (cmd == FUTEX_WAIT) t = ktime_add(ktime_get(), t); tp = &t; } /* * requeue parameter in 'utime' if op == FUTEX_REQUEUE. */ - if (op == FUTEX_REQUEUE || op == FUTEX_CMP_REQUEUE - || op == FUTEX_CMP_REQUEUE_PI) + if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE + || cmd == FUTEX_CMP_REQUEUE_PI) val2 = (u32) (unsigned long) utime; return do_futex((unsigned long __user*)uaddr, op, val, tp, ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH, take4] FUTEX : new PRIVATE futexes 2007-04-07 8:43 ` [PATCH, take4] " Eric Dumazet @ 2007-04-07 9:30 ` Nick Piggin 2007-04-07 10:00 ` Eric Dumazet 2007-04-07 11:18 ` Jakub Jelinek ` (2 subsequent siblings) 3 siblings, 1 reply; 78+ messages in thread From: Nick Piggin @ 2007-04-07 9:30 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Eric Dumazet wrote: > Hi all > > Updates on this take4 : > > - All remarks from Nick were addressed I hope Yeah looks very nice. Thanks for doing that. > > - Current mm code have a problem with 64bit futexes, as spoted by Nick : > > get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not. > So it is possible a 64bit futex spans two pages of memory... > I had to change get_futex_key() prototype to be able to do a correct test. I wonder if it should be encfocing alignment to keep in on 1 page? Otherwise, Acked-by: Nick Piggin <npiggin@suse.de> I'll be away for a couple of days, but I'll look at running some performance tests when I get back. Thanks, Nick -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH, take4] FUTEX : new PRIVATE futexes 2007-04-07 9:30 ` Nick Piggin @ 2007-04-07 10:00 ` Eric Dumazet 2007-04-11 7:22 ` Nick Piggin 0 siblings, 1 reply; 78+ messages in thread From: Eric Dumazet @ 2007-04-07 10:00 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Sat, 07 Apr 2007 19:30:14 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Eric Dumazet wrote: > > > > - Current mm code have a problem with 64bit futexes, as spoted by Nick : > > > > get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not. > > So it is possible a 64bit futex spans two pages of memory... > > I had to change get_futex_key() prototype to be able to do a correct test. > > I wonder if it should be encfocing alignment to keep in on 1 page? I believe I just did that :) Before the patch : Alignment was only 4 bytes for all futexes, but some user app could trigger a kernel bug (since one 64bit futex could sit on two different pages, so possible separate vmas, so the inode refcounting was wrong, and access_ok did not a correct check) After the patch : Alignment is 8 bytes for 64 bit futexes, 4 bytes for 32bit futexes. All futexes are contrained to be in one single page. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH, take4] FUTEX : new PRIVATE futexes 2007-04-07 10:00 ` Eric Dumazet @ 2007-04-11 7:22 ` Nick Piggin 2007-04-11 8:14 ` Eric Dumazet 0 siblings, 1 reply; 78+ messages in thread From: Nick Piggin @ 2007-04-11 7:22 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Eric Dumazet wrote: > On Sat, 07 Apr 2007 19:30:14 +1000 > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >>Eric Dumazet wrote: > > > >>>- Current mm code have a problem with 64bit futexes, as spoted by Nick : >>> >>>get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not. >>>So it is possible a 64bit futex spans two pages of memory... >>>I had to change get_futex_key() prototype to be able to do a correct test. >> >>I wonder if it should be encfocing alignment to keep in on 1 page? > > > I believe I just did that :) Yes :P What I was trying to say before jumping on a plane is that sys_futex/sys_futex64 calls should each check their own address alignment, so the deeper parts of the call stack always know alignment is correct. This will remove all the fsize you pass around, and also sanitise the userspace argument much higher in the call stack, which is very preferable and more conventional. Maybe this isn't possible (it's very obvious, so there may be a good reason it hasn't been done). -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH, take4] FUTEX : new PRIVATE futexes 2007-04-11 7:22 ` Nick Piggin @ 2007-04-11 8:14 ` Eric Dumazet 2007-04-11 9:23 ` Nick Piggin 0 siblings, 1 reply; 78+ messages in thread From: Eric Dumazet @ 2007-04-11 8:14 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Wed, 11 Apr 2007 17:22:57 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > Eric Dumazet wrote: > > On Sat, 07 Apr 2007 19:30:14 +1000 > > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > > > > >>Eric Dumazet wrote: > > > > > > > >>>- Current mm code have a problem with 64bit futexes, as spoted by Nick : > >>> > >>>get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not. > >>>So it is possible a 64bit futex spans two pages of memory... > >>>I had to change get_futex_key() prototype to be able to do a correct test. > >> > >>I wonder if it should be encfocing alignment to keep in on 1 page? > > > > > > I believe I just did that :) > > Yes :P What I was trying to say before jumping on a plane is that > sys_futex/sys_futex64 calls should each check their own address alignment, so > the deeper parts of the call stack always know alignment is correct. > > This will remove all the fsize you pass around, and also sanitise the userspace > argument much higher in the call stack, which is very preferable and more > conventional. > > Maybe this isn't possible (it's very obvious, so there may be a good reason it > hasn't been done). I had this idea as well, but considering get_futex_key() is exported in include/linux/futex.h, I believe some out-of tree thing is using it. As this external thing certainly is not doing the check itself, to be on the safe side we should enforce it in get_futex_key(). I agree with you : If we want to maximize performance, we could say : The check *must* be done by the caller. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH, take4] FUTEX : new PRIVATE futexes 2007-04-11 8:14 ` Eric Dumazet @ 2007-04-11 9:23 ` Nick Piggin 2007-04-11 9:30 ` Pierre Peiffer 2007-04-11 9:35 ` Eric Dumazet 0 siblings, 2 replies; 78+ messages in thread From: Nick Piggin @ 2007-04-11 9:23 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Eric Dumazet wrote: > On Wed, 11 Apr 2007 17:22:57 +1000 > Nick Piggin <nickpiggin@yahoo.com.au> wrote: > > >>Eric Dumazet wrote: >> >>>On Sat, 07 Apr 2007 19:30:14 +1000 >>>Nick Piggin <nickpiggin@yahoo.com.au> wrote: >>> >>> >>> >>>>Eric Dumazet wrote: >>> >>> >>> >>> >>>>>- Current mm code have a problem with 64bit futexes, as spoted by Nick : >>>>> >>>>>get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not. >>>>>So it is possible a 64bit futex spans two pages of memory... >>>>>I had to change get_futex_key() prototype to be able to do a correct test. >>>> >>>>I wonder if it should be encfocing alignment to keep in on 1 page? >>> >>> >>>I believe I just did that :) >> >>Yes :P What I was trying to say before jumping on a plane is that >>sys_futex/sys_futex64 calls should each check their own address alignment, so >>the deeper parts of the call stack always know alignment is correct. >> >>This will remove all the fsize you pass around, and also sanitise the userspace >>argument much higher in the call stack, which is very preferable and more >>conventional. >> >>Maybe this isn't possible (it's very obvious, so there may be a good reason it >>hasn't been done). > > > I had this idea as well, but considering get_futex_key() is exported in include/linux/futex.h, I believe some out-of tree thing is using it. You're changing the API anyway, so just get rid of the declaration from futex.h (that doesn't actually make for an export anyway, unless it is inline). But... that isn't there in mainline. Why is it in -mm? At any rate, that makes it a no brainer to change. > > As this external thing certainly is not doing the check itself, to be on the safe side we should enforce it in get_futex_key(). I agree with you : If we want to maximize performance, we could say : The check *must* be done by the caller. Well we _control_ the API, so let's make it as clean and performant as possible from the start. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH, take4] FUTEX : new PRIVATE futexes 2007-04-11 9:23 ` Nick Piggin @ 2007-04-11 9:30 ` Pierre Peiffer 2007-04-11 9:39 ` Nick Piggin 2007-04-11 9:35 ` Eric Dumazet 1 sibling, 1 reply; 78+ messages in thread From: Pierre Peiffer @ 2007-04-11 9:30 UTC (permalink / raw) To: Nick Piggin Cc: Eric Dumazet, Andrew Morton, Dave Jones, Ulrich Drepper, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Nick Piggin a écrit : > > But... that isn't there in mainline. Why is it in -mm? This was introduced by lguest code.... I did not follow exaclty why. Pierre > At any rate, that > makes > it a no brainer to change. > >> >> As this external thing certainly is not doing the check itself, to be >> on the safe side we should enforce it in get_futex_key(). I agree with >> you : If we want to maximize performance, we could say : The check >> *must* be done by the caller. > > Well we _control_ the API, so let's make it as clean and performant as > possible > from the start. > -- Pierre Peiffer ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH, take4] FUTEX : new PRIVATE futexes 2007-04-11 9:30 ` Pierre Peiffer @ 2007-04-11 9:39 ` Nick Piggin 2007-04-11 9:40 ` Nick Piggin 0 siblings, 1 reply; 78+ messages in thread From: Nick Piggin @ 2007-04-11 9:39 UTC (permalink / raw) To: Pierre Peiffer Cc: Eric Dumazet, Andrew Morton, Dave Jones, Ulrich Drepper, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Pierre Peiffer wrote: > Nick Piggin a écrit : > >> >> But... that isn't there in mainline. Why is it in -mm? > > > This was introduced by lguest code.... > I did not follow exaclty why. OK, that's no problem, then it can remain exported but we just have to document and audit that callers must pass in an aligned address. We can also BUG_ON(address & ~PAGE_MASK); to handle the security aspect. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH, take4] FUTEX : new PRIVATE futexes 2007-04-11 9:39 ` Nick Piggin @ 2007-04-11 9:40 ` Nick Piggin 0 siblings, 0 replies; 78+ messages in thread From: Nick Piggin @ 2007-04-11 9:40 UTC (permalink / raw) To: Nick Piggin Cc: Pierre Peiffer, Eric Dumazet, Andrew Morton, Dave Jones, Ulrich Drepper, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel, Rusty Russell Nick Piggin wrote: > Pierre Peiffer wrote: > >> Nick Piggin a écrit : >> >>> >>> But... that isn't there in mainline. Why is it in -mm? >> >> >> >> This was introduced by lguest code.... >> I did not follow exaclty why. > > > OK, that's no problem, then it can remain exported but we just > have to document and audit that callers must pass in an aligned > address. We can also BUG_ON(address & ~PAGE_MASK); to handle > the security aspect. Err, duh no we can't because we don't know the size, which is the whole point :P Anyway.... just ensure callers have to fix alignment. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH, take4] FUTEX : new PRIVATE futexes 2007-04-11 9:23 ` Nick Piggin 2007-04-11 9:30 ` Pierre Peiffer @ 2007-04-11 9:35 ` Eric Dumazet 2007-04-12 1:57 ` Nick Piggin 1 sibling, 1 reply; 78+ messages in thread From: Eric Dumazet @ 2007-04-11 9:35 UTC (permalink / raw) To: Nick Piggin Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Wed, 11 Apr 2007 19:23:26 +1000 Nick Piggin <nickpiggin@yahoo.com.au> wrote: > But... that isn't there in mainline. Why is it in -mm? At any rate, that makes > it a no brainer to change. Seems to be related to lguest. Ask Rusty :) > > > > > As this external thing certainly is not doing the check itself, to be on the safe side we should enforce it in get_futex_key(). I agree with you : If we want to maximize performance, we could say : The check *must* be done by the caller. > > Well we _control_ the API, so let's make it as clean and performant as possible > from the start. Take a look at do_futex(). Adding checks in callers just increase code size. I tried this got only bad results. This would speedup only the slow path (ie when some user code want to give us non aligned addrs) A single factorized check is cleaner and not slower, since we reduce icache pressure. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH, take4] FUTEX : new PRIVATE futexes 2007-04-11 9:35 ` Eric Dumazet @ 2007-04-12 1:57 ` Nick Piggin 0 siblings, 0 replies; 78+ messages in thread From: Nick Piggin @ 2007-04-12 1:57 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Eric Dumazet wrote: > On Wed, 11 Apr 2007 19:23:26 +1000 > Nick Piggin <nickpiggin@yahoo.com.au> wrote: >>>As this external thing certainly is not doing the check itself, to be on the safe side we should enforce it in get_futex_key(). I agree with you : If we want to maximize performance, we could say : The check *must* be done by the caller. >> >>Well we _control_ the API, so let's make it as clean and performant as possible >>from the start. > > > Take a look at do_futex(). > Adding checks in callers just increase code size. I tried this got only bad results. > This would speedup only the slow path (ie when some user code want to give us non aligned addrs) > A single factorized check is cleaner and not slower, since we reduce icache pressure. 1 extra check versus all that additional argument passing? I don't think it is conclusive. -- SUSE Labs, Novell Inc. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH, take4] FUTEX : new PRIVATE futexes 2007-04-07 8:43 ` [PATCH, take4] " Eric Dumazet 2007-04-07 9:30 ` Nick Piggin @ 2007-04-07 11:18 ` Jakub Jelinek 2007-04-07 11:54 ` Eric Dumazet 2007-04-07 22:15 ` Andrew Morton 2007-04-11 9:19 ` [PATCH, take5] " Eric Dumazet 3 siblings, 1 reply; 78+ messages in thread From: Jakub Jelinek @ 2007-04-07 11:18 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Sat, Apr 07, 2007 at 10:43:39AM +0200, Eric Dumazet wrote: > get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not. > So it is possible a 64bit futex spans two pages of memory... That would be a user bug. 32-bit futexes have to be 32-bit aligned, 64-bit futexes have to be 64-bit aligned. Jakub ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH, take4] FUTEX : new PRIVATE futexes 2007-04-07 11:18 ` Jakub Jelinek @ 2007-04-07 11:54 ` Eric Dumazet 2007-04-07 16:40 ` Ulrich Drepper 0 siblings, 1 reply; 78+ messages in thread From: Eric Dumazet @ 2007-04-07 11:54 UTC (permalink / raw) To: Jakub Jelinek Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Jakub Jelinek a écrit : > On Sat, Apr 07, 2007 at 10:43:39AM +0200, Eric Dumazet wrote: >> get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not. >> So it is possible a 64bit futex spans two pages of memory... > > That would be a user bug. 32-bit futexes have to be 32-bit aligned, 64-bit > futexes have to be 64-bit aligned. I am not sure what you want to say. User doing sys_futex64(0x......FFC, FUTEX_WAKE_OP, ...) and crashing kernel or corrupting data is ok because its a user bug ? User is allowed to do anything, kernel must check and protect innocents. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH, take4] FUTEX : new PRIVATE futexes 2007-04-07 11:54 ` Eric Dumazet @ 2007-04-07 16:40 ` Ulrich Drepper 0 siblings, 0 replies; 78+ messages in thread From: Ulrich Drepper @ 2007-04-07 16:40 UTC (permalink / raw) To: Eric Dumazet Cc: Jakub Jelinek, Andrew Morton, Dave Jones, Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On 4/7/07, Eric Dumazet <dada1@cosmosbay.com> wrote: > I am not sure what you want to say. What Jakub meant is that it is OK for the kernel to reject using unaligned 64-bit futexes. Just return an error in all cases (not just in some). ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH, take4] FUTEX : new PRIVATE futexes 2007-04-07 8:43 ` [PATCH, take4] " Eric Dumazet 2007-04-07 9:30 ` Nick Piggin 2007-04-07 11:18 ` Jakub Jelinek @ 2007-04-07 22:15 ` Andrew Morton 2007-04-10 9:21 ` Eric Dumazet 2007-04-11 9:19 ` [PATCH, take5] " Eric Dumazet 3 siblings, 1 reply; 78+ messages in thread From: Andrew Morton @ 2007-04-07 22:15 UTC (permalink / raw) To: Eric Dumazet Cc: Dave Jones, Ulrich Drepper, Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Sat, 7 Apr 2007 10:43:39 +0200 Eric Dumazet <dada1@cosmosbay.com> wrote: > Hi all > > Updates on this take4 : > > - All remarks from Nick were addressed I hope > > - Current mm code have a problem with 64bit futexes, as spoted by Nick : > > get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not. > So it is possible a 64bit futex spans two pages of memory... > I had to change get_futex_key() prototype to be able to do a correct test. > Cold we please have that in a separate patch? It's logically a part of the 64-bit-futex work, is it not? > + > +/** > + * get_futex_key - Get parameters which are the keys for a futex. > + * @uaddr: virtual address of the futex > + * @size: size of futex (4 or 8) > + * @shared: NULL for a PROCESS_PRIVATE futex, > + * ¤t->mm->mmap_sem for a PROCESS_SHARED futex > + * @key: address where result is stored. > + * > + * Returns an error code or 0 > + */ > +int get_futex_key(void __user *uaddr, int size, struct rw_semaphore *shared, > + union futex_key *key); Thanks for documenting the interface, but please do it in the .c file at the function's definition site. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH, take4] FUTEX : new PRIVATE futexes 2007-04-07 22:15 ` Andrew Morton @ 2007-04-10 9:21 ` Eric Dumazet 0 siblings, 0 replies; 78+ messages in thread From: Eric Dumazet @ 2007-04-10 9:21 UTC (permalink / raw) To: Andrew Morton, Pierre Peiffer Cc: Dave Jones, Ulrich Drepper, Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Sat, 7 Apr 2007 15:15:56 -0700 Andrew Morton <akpm@linux-foundation.org> wrote: > On Sat, 7 Apr 2007 10:43:39 +0200 Eric Dumazet <dada1@cosmosbay.com> wrote: > > > > get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not. > > So it is possible a 64bit futex spans two pages of memory... > > I had to change get_futex_key() prototype to be able to do a correct test. > > > > Cold we please have that in a separate patch? It's logically a part of the > 64-bit-futex work, is it not? Yes you probably want this patch to fix 64bit futexes support. [PATCH] get_futex_key() must check proper alignement for 64bit futexes get_futex_key() does an alignment check against sizeof(u32) regardless of futex being 64bits or not. So it is possible a 64bit futex spans two pages of memory, and some malicious user code can trigger data corruption. We must add a 'fsize' parameter to get_futex_key(), telling it the size of the futex (4 or 8 bytes) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/linux/futex.h | 2 - kernel/futex.c | 78 ++++++++++++++++++++++++++-------------- 2 files changed, 52 insertions(+), 28 deletions(-) --- linux-2.6.21-rc6-mm1/kernel/futex.c +++ linux-2.6.21-rc6-mm1-ed/kernel/futex.c @@ -189,19 +189,22 @@ static inline int match_futex(union fute && key1->both.offset == key2->both.offset); } -/* - * Get parameters which are the keys for a futex. +/** + * get_futex_key - Get parameters which are the keys for a futex. + * @uaddr: virtual address of the futex + * @size: size of futex (4 or 8) + * @key: address where result is stored. + * + * Returns a negative error code or 0 + * The key words are stored in *key on success. * * For shared mappings, it's (page->index, vma->vm_file->f_path.dentry->d_inode, * offset_within_page). For private mappings, it's (uaddr, current->mm). * We can usually work out the index without swapping in the page. * - * Returns: 0, or negative error code. - * The key words are stored in *key on success. - * * Should be called with ¤t->mm->mmap_sem but NOT any spinlocks. */ -int get_futex_key(void __user *uaddr, union futex_key *key) +int get_futex_key(void __user *uaddr, int size, union futex_key *key) { unsigned long address = (unsigned long)uaddr; struct mm_struct *mm = current->mm; @@ -213,7 +216,7 @@ int get_futex_key(void __user *uaddr, un * The futex address must be "naturally" aligned. */ key->both.offset = address % PAGE_SIZE; - if (unlikely((key->both.offset % sizeof(u32)) != 0)) + if (unlikely((key->both.offset & (size - 1)) != 0)) return -EINVAL; address -= key->both.offset; @@ -705,17 +708,18 @@ double_lock_hb(struct futex_hash_bucket * Wake up all waiters hashed on the physical page that is mapped * to this virtual address: */ -static int futex_wake(unsigned long __user *uaddr, int nr_wake) +static int futex_wake(unsigned long __user *uaddr, int futex64, int nr_wake) { struct futex_hash_bucket *hb; struct futex_q *this, *next; struct plist_head *head; union futex_key key; int ret; + int fsize = futex64 ? sizeof(u64) : sizeof(u32); down_read(¤t->mm->mmap_sem); - ret = get_futex_key(uaddr, &key); + ret = get_futex_key(uaddr, fsize, &key); if (unlikely(ret != 0)) goto out; @@ -817,6 +821,7 @@ futex_requeue_pi(unsigned long __user *u struct rt_mutex_waiter *waiter, *top_waiter = NULL; struct rt_mutex *lock2 = NULL; int ret, drop_count = 0; + int fsize = futex64 ? sizeof(u64) : sizeof(u32); if (refill_pi_state_cache()) return -ENOMEM; @@ -827,10 +832,10 @@ retry: */ down_read(¤t->mm->mmap_sem); - ret = get_futex_key(uaddr1, &key1); + ret = get_futex_key(uaddr1, fsize, &key1); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, &key2); + ret = get_futex_key(uaddr2, fsize, &key2); if (unlikely(ret != 0)) goto out; @@ -1005,14 +1010,15 @@ futex_wake_op(unsigned long __user *uadd struct plist_head *head; struct futex_q *this, *next; int ret, op_ret, attempt = 0; + int fsize = futex64 ? sizeof(u64) : sizeof(u32); retryfull: down_read(¤t->mm->mmap_sem); - ret = get_futex_key(uaddr1, &key1); + ret = get_futex_key(uaddr1, fsize, &key1); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, &key2); + ret = get_futex_key(uaddr2, fsize, &key2); if (unlikely(ret != 0)) goto out; @@ -1125,14 +1131,15 @@ futex_requeue(unsigned long __user *uadd struct plist_head *head1; struct futex_q *this, *next; int ret, drop_count = 0; + int fsize = futex64 ? sizeof(u64) : sizeof(u32); retry: down_read(¤t->mm->mmap_sem); - ret = get_futex_key(uaddr1, &key1); + ret = get_futex_key(uaddr1, fsize, &key1); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, &key2); + ret = get_futex_key(uaddr2, fsize, &key2); if (unlikely(ret != 0)) goto out; @@ -1390,9 +1397,15 @@ static int fixup_pi_state_owner(unsigned return ret; } +/* + * In case we must use restart_block to restart a futex_wait, + * we encode in the 'arg3' futex64 capability + */ +#define ARG3_FUTEX64 1 + static long futex_wait_restart(struct restart_block *restart); -static int futex_wait(unsigned long __user *uaddr, unsigned long val, - ktime_t *abs_time, int futex64) +static int futex_wait(unsigned long __user *uaddr, int futex64, + unsigned long val, ktime_t *abs_time) { struct task_struct *curr = current; DECLARE_WAITQUEUE(wait, curr); @@ -1402,12 +1415,13 @@ static int futex_wait(unsigned long __us int ret; struct hrtimer_sleeper t, *to = NULL; int rem = 0; + int fsize = futex64 ? sizeof(u64) : sizeof(u32); q.pi_state = NULL; retry: down_read(&curr->mm->mmap_sem); - ret = get_futex_key(uaddr, &q.key); + ret = get_futex_key(uaddr, fsize, &q.key); if (unlikely(ret != 0)) goto out_release_sem; @@ -1597,7 +1611,11 @@ static int futex_wait(unsigned long __us restart->arg0 = (unsigned long)uaddr; restart->arg1 = val; restart->arg2 = (unsigned long)abs_time; - restart->arg3 = (unsigned long)futex64; + restart->arg3 = 0; +#ifdef CONFIG_64BIT + if (futex64) + restart->arg3 |= ARG3_FUTEX64; +#endif return -ERESTART_RESTARTBLOCK; } @@ -1615,10 +1633,14 @@ static long futex_wait_restart(struct re unsigned long __user *uaddr = (unsigned long __user *)restart->arg0; unsigned long val = restart->arg1; ktime_t *abs_time = (ktime_t *)restart->arg2; - int futex64 = (int)restart->arg3; + int futex64 = 0; +#ifdef CONFIG_64BIT + if (restart->arg3 & ARG3_FUTEX64) + futex64 = 1; +#endif restart->fn = do_no_restart_syscall; - return (long)futex_wait(uaddr, val, abs_time, futex64); + return (long)futex_wait(uaddr, futex64, val, abs_time); } @@ -1682,6 +1704,7 @@ static int futex_lock_pi(unsigned long _ unsigned long uval, newval, curval; struct futex_q q; int ret, lock_held, attempt = 0; + int fsize = futex64 ? sizeof(u64) : sizeof(u32); if (refill_pi_state_cache()) return -ENOMEM; @@ -1697,7 +1720,7 @@ static int futex_lock_pi(unsigned long _ retry: down_read(&curr->mm->mmap_sem); - ret = get_futex_key(uaddr, &q.key); + ret = get_futex_key(uaddr, fsize, &q.key); if (unlikely(ret != 0)) goto out_release_sem; @@ -1907,6 +1930,7 @@ static int futex_unlock_pi(unsigned long struct plist_head *head; union futex_key key; int ret, attempt = 0; + int fsize = futex64 ? sizeof(u64) : sizeof(u32); retry: if (futex_get_user(&uval, uaddr, futex64)) @@ -1921,7 +1945,7 @@ retry: */ down_read(¤t->mm->mmap_sem); - ret = get_futex_key(uaddr, &key); + ret = get_futex_key(uaddr, fsize, &key); if (unlikely(ret != 0)) goto out; @@ -2094,7 +2118,7 @@ static int futex_fd(u32 __user *uaddr, i q->pi_state = NULL; down_read(¤t->mm->mmap_sem); - err = get_futex_key(uaddr, &q->key); + err = get_futex_key(uaddr, sizeof(u32), &q->key); if (unlikely(err != 0)) { up_read(¤t->mm->mmap_sem); @@ -2238,7 +2262,7 @@ retry: */ if (!pi) { if (uval & FUTEX_WAITERS) - futex_wake((unsigned long __user *)uaddr, 1); + futex_wake((unsigned long __user *)uaddr, 0, 1); } } return 0; @@ -2329,10 +2353,10 @@ long do_futex(unsigned long __user *uadd switch (op) { case FUTEX_WAIT: - ret = futex_wait(uaddr, val, timeout, fut64); + ret = futex_wait(uaddr, fut64, val, timeout); break; case FUTEX_WAKE: - ret = futex_wake(uaddr, val); + ret = futex_wake(uaddr, fut64, val); break; case FUTEX_FD: if (fut64) --- linux-2.6.21-rc6-mm1/include/linux/futex.h +++ linux-2.6.21-rc6-mm1-ed/include/linux/futex.h @@ -135,7 +135,7 @@ union futex_key { int offset; } both; }; -int get_futex_key(void __user *uaddr, union futex_key *key); +int get_futex_key(void __user *uaddr, int size, union futex_key *key); void get_futex_key_refs(union futex_key *key); void drop_futex_key_refs(union futex_key *key); ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH, take5] FUTEX : new PRIVATE futexes 2007-04-07 8:43 ` [PATCH, take4] " Eric Dumazet ` (2 preceding siblings ...) 2007-04-07 22:15 ` Andrew Morton @ 2007-04-11 9:19 ` Eric Dumazet 2007-04-11 12:23 ` Rusty Russell 2007-04-26 12:55 ` [PATCH, take6] " Eric Dumazet 3 siblings, 2 replies; 78+ messages in thread From: Eric Dumazet @ 2007-04-11 9:19 UTC (permalink / raw) To: Andrew Morton Cc: Dave Jones, Ulrich Drepper, Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel, Rusty Russel Hi Andrew Update on this take5 : - Rebased on linux-2.6.21-rc6-mm1 + get_futex_key() must check proper alignement for 64bit futexes - compile test on x86_64 (one minor typo) - Added Rusty in CC since he may have to change drivers/lguest/io.c again, since get_futex_key() have yet another parameter (fshared). (I couldnt find this file in 2.6.21-rc6-mm1 tree) Thank you History : take4 : - All remarks from Nick were addressed I hope - Current mm code have a problem with 64bit futexes, as spoted by Nick : get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not. So it is possible a 64bit futex spans two pages of memory... I had to change get_futex_key() prototype to be able to do a correct test. take3: I'm pleased to present this patch which improves linux futexes performance and scalability, merely avoiding taking mmap_sem rwlock. Ulrich agreed with the API and said glibc work could start as soon as he gets a Fedora kernel with it :) In this third version I dropped the NUMA optims and process private hash table, to let new API come in and be tested. Thank you [PATCH] FUTEX : new PRIVATE futexes Analysis of current linux futex code : -------------------------------------- A central hash table futex_queues[] holds all contexts (futex_q) of waiting threads. Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to perform lookups or insert/deletion of a futex_q. When a futex_wait() is done, calling thread has to : 1) - Obtain a read lock on mmap_sem to be able to validate the user pointer (calling find_vma()). This validation tells us if the futex uses an inode based store (mapped file), or mm based store (anonymous mem) 2) - compute a hash key 3) - Atomic increment of reference counter on an inode or a mm_struct 4) - lock part of futex_queues[] hash table 5) - perform the test on value of futex. (rollback is value != expected_value, returns EWOULDBLOCK) (various loops if test triggers mm faults) 6) queue the context into hash table, release the lock got in 4) 7) - release the read_lock on mmap_sem <block> 8) Eventually unqueue the context (but rarely, as this part may be done by the futex_wake()) Futexes were designed to improve scalability but current implementation has various problems : - Central hashtable : This means scalability problems if many processes/threads want to use futexes at the same time. This means NUMA unbalance because this hashtable is located on one node. - Using mmap_sem on every futex() syscall : Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic ops on mmap_sem, dirtying cache line : - lot of cache line ping pongs on SMP configurations. mmap_sem is also extensively used by mm code (page faults, mmap()/munmap()) Highly threaded processes might suffer from mmap_sem contention. mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded programs because of contention on the mmap_sem cache line. - Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter: It's also a cache line ping pong on SMP. It also increases mmap_sem hold time because of cache misses. Most of these scalability problems come from the fact that futexes are in one global namespace. As we use a central hash table, we must make sure they are all using the same reference (given by the mm subsystem). We chose to force all futexes be 'shared'. This has a cost. But fact is POSIX defined PRIVATE and SHARED, allowing clear separation, and optimal performance if carefuly implemented. Time has come for linux to have better threading performance. The goal is to permit new futex commands to avoid : - Taking the mmap_sem semaphore, conflicting with other subsystems. - Modifying a ref_count on mm or an inode, still conflicting with mm or fs. This is possible because, for one process using PTHREAD_PROCESS_PRIVATE futexes, we only need to distinguish futexes by their virtual address, no matter the underlying mm storage is. If glibc wants to exploit this new infrastructure, it should use new _PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And be prepared to fallback on old subcommands for old kernels. Using one global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK. PTHREAD_PROCESS_SHARED futexes should still use the old subcommands. Compatibility with old applications is preserved, they still hit the scalability problems, but new applications can fly :) Note : the same SHARED futex (mapped on a file) can be used by old binaries *and* new binaries, because both binaries will use the old subcommands. Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic, as this is the default semantic. Almost all applications should benefit of this changes (new kernel and updated libc) Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine) /* calling futex_wait(addr, value) with value != *addr */ 434 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes) 427 cycles per futex(FUTEX_WAIT) call (using one futex) 345 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes) 345 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex) For reference : 187 cycles per getppid() call 188 cycles per umask() call 183 cycles per ni_syscall() call Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/linux/futex.h | 29 +++ kernel/futex.c | 339 +++++++++++++++++++++++++--------------- 2 files changed, 245 insertions(+), 123 deletions(-) --- linux-2.6.21-rc6-mm1/kernel/futex.c +++ linux-2.6.21-rc6-mm1-ed/kernel/futex.c @@ -16,6 +16,9 @@ * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com> * Copyright (C) 2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com> * + * PRIVATE futexes by Eric Dumazet + * Copyright (C) 2007 Eric Dumazet <dada1@cosmosbay.com> + * * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly * enough at me, Linus for the original (flawed) idea, Matthew * Kirkwood for proof-of-concept implementation. @@ -193,6 +196,8 @@ static inline int match_futex(union fute * get_futex_key - Get parameters which are the keys for a futex. * @uaddr: virtual address of the futex * @size: size of futex (4 or 8) + * @shared: NULL for a PROCESS_PRIVATE futex, + * ¤t->mm->mmap_sem for a PROCESS_SHARED futex * @key: address where result is stored. * * Returns a negative error code or 0 @@ -202,9 +207,12 @@ static inline int match_futex(union fute * offset_within_page). For private mappings, it's (uaddr, current->mm). * We can usually work out the index without swapping in the page. * - * Should be called with ¤t->mm->mmap_sem but NOT any spinlocks. + * fshared is NULL for PROCESS_PRIVATE futexes + * For other futexes, it points to ¤t->mm->mmap_sem and + * caller must have taken the reader lock. but NOT any spinlocks. */ -int get_futex_key(void __user *uaddr, int size, union futex_key *key) +int get_futex_key(void __user *uaddr, int size, struct rw_semaphore *fshared, + union futex_key *key) { unsigned long address = (unsigned long)uaddr; struct mm_struct *mm = current->mm; @@ -221,6 +229,20 @@ int get_futex_key(void __user *uaddr, in address -= key->both.offset; /* + * PROCESS_PRIVATE futexes are fast. + * As the mm cannot disappear under us and the 'key' only needs + * virtual address, we dont even have to find the underlying vma. + * Note : We do have to check 'uaddr' is a valid user address, + * but access_ok() should be faster than find_vma() + */ + if (!fshared) { + if (!access_ok(VERIFY_WRITE, uaddr, size)) + return -EFAULT; + key->private.mm = mm; + key->private.address = address; + return 0; + } + /* * The futex is hashed differently depending on whether * it's in a shared or private mapping. So check vma first. */ @@ -247,6 +269,7 @@ int get_futex_key(void __user *uaddr, in * mappings of _writable_ handles. */ if (likely(!(vma->vm_flags & VM_MAYSHARE))) { + key->both.offset |= FUT_OFF_MMSHARED; /* reference taken on mm */ key->private.mm = mm; key->private.address = address; return 0; @@ -256,7 +279,7 @@ int get_futex_key(void __user *uaddr, in * Linear file mappings are also simple. */ key->shared.inode = vma->vm_file->f_path.dentry->d_inode; - key->both.offset++; /* Bit 0 of offset indicates inode-based key. */ + key->both.offset |= FUT_OFF_INODE; /* inode-based key. */ if (likely(!(vma->vm_flags & VM_NONLINEAR))) { key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff); @@ -284,16 +307,18 @@ EXPORT_SYMBOL_GPL(get_futex_key); * Take a reference to the resource addressed by a key. * Can be called while holding spinlocks. * - * NOTE: mmap_sem MUST be held between get_futex_key() and calling this - * function, if it is called at all. mmap_sem keeps key->shared.inode valid. */ inline void get_futex_key_refs(union futex_key *key) { - if (key->both.ptr != 0) { - if (key->both.offset & 1) + if (key->both.ptr == 0) + return; + switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { + case FUT_OFF_INODE: atomic_inc(&key->shared.inode->i_count); - else + break; + case FUT_OFF_MMSHARED: atomic_inc(&key->private.mm->mm_count); + break; } } EXPORT_SYMBOL_GPL(get_futex_key_refs); @@ -304,11 +329,15 @@ EXPORT_SYMBOL_GPL(get_futex_key_refs); */ void drop_futex_key_refs(union futex_key *key) { - if (key->both.ptr != 0) { - if (key->both.offset & 1) + if (key->both.ptr == 0) + return; + switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { + case FUT_OFF_INODE: iput(key->shared.inode); - else + break; + case FUT_OFF_MMSHARED: mmdrop(key->private.mm); + break; } } EXPORT_SYMBOL_GPL(drop_futex_key_refs); @@ -342,28 +371,39 @@ get_futex_value_locked(unsigned long *de } /* - * Fault handling. Called with current->mm->mmap_sem held. + * Fault handling. + * if fshared is non NULL, current->mm->mmap_sem is already held */ -static int futex_handle_fault(unsigned long address, int attempt) +static int futex_handle_fault(unsigned long address, + struct rw_semaphore *fshared, int attempt) { struct vm_area_struct * vma; struct mm_struct *mm = current->mm; + int ret = -EFAULT; - if (attempt > 2 || !(vma = find_vma(mm, address)) || - vma->vm_start > address || !(vma->vm_flags & VM_WRITE)) - return -EFAULT; + if (attempt > 2) + return ret; - switch (handle_mm_fault(mm, vma, address, 1)) { - case VM_FAULT_MINOR: - current->min_flt++; - break; - case VM_FAULT_MAJOR: - current->maj_flt++; - break; - default: - return -EFAULT; + if (!fshared) + down_read(&mm->mmap_sem); + vma = find_vma(mm, address); + if (vma && + address >= vma->vm_start && + (vma->vm_flags & VM_WRITE)) { + switch (handle_mm_fault(mm, vma, address, 1)) { + case VM_FAULT_MINOR: + ret = 0; + current->min_flt++; + break; + case VM_FAULT_MAJOR: + ret = 0; + current->maj_flt++; + break; + } } - return 0; + if (!fshared) + up_read(&mm->mmap_sem); + return ret; } /* @@ -705,10 +745,10 @@ double_lock_hb(struct futex_hash_bucket } /* - * Wake up all waiters hashed on the physical page that is mapped - * to this virtual address: + * Wake up all waiters on a futex (fuaddr, futex64, fshared) */ -static int futex_wake(unsigned long __user *uaddr, int futex64, int nr_wake) +static int futex_wake(unsigned long __user *uaddr, int futex64, + struct rw_semaphore *fshared, int nr_wake) { struct futex_hash_bucket *hb; struct futex_q *this, *next; @@ -717,9 +757,10 @@ static int futex_wake(unsigned long __us int ret; int fsize = futex64 ? sizeof(u64) : sizeof(u32); - down_read(¤t->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr, fsize, &key); + ret = get_futex_key(uaddr, fsize, fshared, &key); if (unlikely(ret != 0)) goto out; @@ -741,7 +782,8 @@ static int futex_wake(unsigned long __us spin_unlock(&hb->lock); out: - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; } @@ -810,8 +852,10 @@ retry: * one physical page to another physical page (PI-futex uaddr2) */ static int -futex_requeue_pi(unsigned long __user *uaddr1, unsigned long __user *uaddr2, - int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64) +futex_requeue_pi(unsigned long __user *uaddr1, + int futex64, struct rw_semaphore *fshared, + unsigned long __user *uaddr2, + int nr_wake, int nr_requeue, unsigned long *cmpval) { union futex_key key1, key2; struct futex_hash_bucket *hb1, *hb2; @@ -830,12 +874,13 @@ retry: /* * First take all the futex related locks: */ - down_read(¤t->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr1, fsize, &key1); + ret = get_futex_key(uaddr1, fsize, fshared, &key1); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, fsize, &key2); + ret = get_futex_key(uaddr2, fsize, fshared, &key2); if (unlikely(ret != 0)) goto out; @@ -858,7 +903,8 @@ retry: * If we would have faulted, release mmap_sem, fault * it in and start all over again. */ - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); ret = futex_get_user(&curval, uaddr1, futex64); @@ -993,7 +1039,8 @@ out_unlock: drop_futex_key_refs(&key1); out: - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; } @@ -1002,8 +1049,10 @@ out: * to this virtual address: */ static int -futex_wake_op(unsigned long __user *uaddr1, unsigned long __user *uaddr2, - int nr_wake, int nr_wake2, int op, int futex64) +futex_wake_op(unsigned long __user *uaddr1, + int futex64, struct rw_semaphore *fshared, + unsigned long __user *uaddr2, + int nr_wake, int nr_wake2, int op) { union futex_key key1, key2; struct futex_hash_bucket *hb1, *hb2; @@ -1013,12 +1062,13 @@ futex_wake_op(unsigned long __user *uadd int fsize = futex64 ? sizeof(u64) : sizeof(u32); retryfull: - down_read(¤t->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr1, fsize, &key1); + ret = get_futex_key(uaddr1, fsize, fshared, &key1); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, fsize, &key2); + ret = get_futex_key(uaddr2, fsize, fshared, &key2); if (unlikely(ret != 0)) goto out; @@ -1065,11 +1115,10 @@ retry: * still holding the mmap_sem. */ if (attempt++) { - if (futex_handle_fault((unsigned long)uaddr2, - attempt)) { - ret = -EFAULT; + ret = futex_handle_fault((unsigned long)uaddr2, + fshared, attempt); + if (ret) goto out; - } goto retry; } @@ -1077,7 +1126,8 @@ retry: * If we would have faulted, release mmap_sem, * fault it in and start all over again. */ - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); ret = futex_get_user(&dummy, uaddr2, futex64); if (ret) @@ -1114,7 +1164,8 @@ retry: if (hb1 != hb2) spin_unlock(&hb2->lock); out: - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; } @@ -1123,8 +1174,10 @@ out: * physical page. */ static int -futex_requeue(unsigned long __user *uaddr1, unsigned long __user *uaddr2, - int nr_wake, int nr_requeue, unsigned long *cmpval, int futex64) +futex_requeue(unsigned long __user *uaddr1, + int futex64, struct rw_semaphore *fshared, + unsigned long __user *uaddr2, + int nr_wake, int nr_requeue, unsigned long *cmpval) { union futex_key key1, key2; struct futex_hash_bucket *hb1, *hb2; @@ -1134,12 +1187,13 @@ futex_requeue(unsigned long __user *uadd int fsize = futex64 ? sizeof(u64) : sizeof(u32); retry: - down_read(¤t->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr1, fsize, &key1); + ret = get_futex_key(uaddr1, fsize, fshared, &key1); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, fsize, &key2); + ret = get_futex_key(uaddr2, fsize, fshared, &key2); if (unlikely(ret != 0)) goto out; @@ -1162,7 +1216,8 @@ futex_requeue(unsigned long __user *uadd * If we would have faulted, release mmap_sem, fault * it in and start all over again. */ - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); ret = futex_get_user(&curval, uaddr1, futex64); @@ -1215,7 +1270,8 @@ out_unlock: drop_futex_key_refs(&key1); out: - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; } @@ -1346,12 +1402,14 @@ static void unqueue_me_pi(struct futex_q /* * Fixup the pi_state owner with current. * - * The cur->mm semaphore must be held, it is released at return of this - * function. + * for PROCESS_SHARED futexes, cur->mm semaphore must be held, it is + * released at return of this function. */ -static int fixup_pi_state_owner(unsigned long __user *uaddr, struct futex_q *q, +static int fixup_pi_state_owner(unsigned long __user *uaddr, int futex64, + struct rw_semaphore *fshared, + struct futex_q *q, struct futex_hash_bucket *hb, - struct task_struct *curr, int futex64) + struct task_struct *curr) { unsigned long newtid = curr->pid | FUTEX_WAITERS; struct futex_pi_state *pi_state = q->pi_state; @@ -1376,7 +1434,8 @@ static int fixup_pi_state_owner(unsigned /* Unqueue and drop the lock */ unqueue_me_pi(q); - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); /* * We own it, so we have to replace the pending owner * TID. This must be atomic as we have preserve the @@ -1399,12 +1458,14 @@ static int fixup_pi_state_owner(unsigned /* * In case we must use restart_block to restart a futex_wait, - * we encode in the 'arg3' futex64 capability + * we encode in the 'arg3' both futex64 and shared capabilities */ #define ARG3_FUTEX64 1 +#define ARG3_SHARED 2 static long futex_wait_restart(struct restart_block *restart); static int futex_wait(unsigned long __user *uaddr, int futex64, + struct rw_semaphore *fshared, unsigned long val, ktime_t *abs_time) { struct task_struct *curr = current; @@ -1419,9 +1480,10 @@ static int futex_wait(unsigned long __us q.pi_state = NULL; retry: - down_read(&curr->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr, fsize, &q.key); + ret = get_futex_key(uaddr, fsize, fshared, &q.key); if (unlikely(ret != 0)) goto out_release_sem; @@ -1444,8 +1506,8 @@ static int futex_wait(unsigned long __us * a wakeup when *uaddr != val on entry to the syscall. This is * rare, but normal. * - * We hold the mmap semaphore, so the mapping cannot have changed - * since we looked it up in get_futex_key. + * for shared futexes, we hold the mmap semaphore, so the mapping + * cannot have changed since we looked it up in get_futex_key. */ ret = get_futex_value_locked(&uval, uaddr, futex64); @@ -1456,7 +1518,8 @@ static int futex_wait(unsigned long __us * If we would have faulted, release mmap_sem, fault it in and * start all over again. */ - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); ret = futex_get_user(&uval, uaddr, futex64); if (!ret) @@ -1482,7 +1545,8 @@ static int futex_wait(unsigned long __us * Now the futex is queued and we have checked the data, we * don't want to hold mmap_sem while we sleep. */ - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); /* * There might have been scheduling since the queue_me(), as we @@ -1552,7 +1616,8 @@ static int futex_wait(unsigned long __us else ret = rt_mutex_timed_lock(lock, to, 1); - down_read(&curr->mm->mmap_sem); + if (fshared) + down_read(fshared); spin_lock(q.lock_ptr); /* @@ -1569,7 +1634,8 @@ static int futex_wait(unsigned long __us /* mmap_sem and hash_bucket lock are unlocked at return of this function */ - ret = fixup_pi_state_owner(uaddr, &q, hb, curr, futex64); + ret = fixup_pi_state_owner(uaddr, futex64, fshared, + &q, hb, curr); } else { /* * Catch the rare case, where the lock was released @@ -1582,7 +1648,8 @@ static int futex_wait(unsigned long __us } /* Unqueue and drop the lock */ unqueue_me_pi(&q); - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); } debug_rt_mutex_free_waiter(&q.waiter); @@ -1616,6 +1683,8 @@ static int futex_wait(unsigned long __us if (futex64) restart->arg3 |= ARG3_FUTEX64; #endif + if (fshared) + restart->arg3 |= ARG3_SHARED; return -ERESTART_RESTARTBLOCK; } @@ -1623,7 +1692,8 @@ static int futex_wait(unsigned long __us queue_unlock(&q, hb); out_release_sem: - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; } @@ -1634,13 +1704,16 @@ static long futex_wait_restart(struct re unsigned long val = restart->arg1; ktime_t *abs_time = (ktime_t *)restart->arg2; int futex64 = 0; + struct rw_semaphore *fshared = NULL; #ifdef CONFIG_64BIT if (restart->arg3 & ARG3_FUTEX64) futex64 = 1; #endif + if (restart->arg3 & ARG3_SHARED) + fshared = ¤t->mm->mmap_sem; restart->fn = do_no_restart_syscall; - return (long)futex_wait(uaddr, futex64, val, abs_time); + return (long)futex_wait(uaddr, futex64, fshared, val, abs_time); } @@ -1695,8 +1768,9 @@ static void set_pi_futex_owner(struct fu * if there are waiters then it will block, it does PI, etc. (Due to * races the kernel might see a 0 value of the futex too.) */ -static int futex_lock_pi(unsigned long __user *uaddr, int detect, ktime_t *time, - int trylock, int futex64) +static int futex_lock_pi(unsigned long __user *uaddr, + int futex64, struct rw_semaphore *fshared, + int detect, ktime_t *time, int trylock) { struct hrtimer_sleeper timeout, *to = NULL; struct task_struct *curr = current; @@ -1718,9 +1792,10 @@ static int futex_lock_pi(unsigned long _ q.pi_state = NULL; retry: - down_read(&curr->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr, fsize, &q.key); + ret = get_futex_key(uaddr, fsize, fshared, &q.key); if (unlikely(ret != 0)) goto out_release_sem; @@ -1841,7 +1916,8 @@ static int futex_lock_pi(unsigned long _ * Now the futex is queued and we have checked the data, we * don't want to hold mmap_sem while we sleep. */ - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); WARN_ON(!q.pi_state); /* @@ -1855,7 +1931,8 @@ static int futex_lock_pi(unsigned long _ ret = ret ? 0 : -EWOULDBLOCK; } - down_read(&curr->mm->mmap_sem); + if (fshared) + down_read(fshared); spin_lock(q.lock_ptr); /* @@ -1864,7 +1941,8 @@ static int futex_lock_pi(unsigned long _ */ if (!ret && q.pi_state->owner != curr) /* mmap_sem is unlocked at return of this function */ - ret = fixup_pi_state_owner(uaddr, &q, hb, curr, futex64); + ret = fixup_pi_state_owner(uaddr, futex64, fshared, + &q, hb, curr); else { /* * Catch the rare case, where the lock was released @@ -1877,7 +1955,8 @@ static int futex_lock_pi(unsigned long _ } /* Unqueue and drop the lock */ unqueue_me_pi(&q); - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); } if (!detect && ret == -EDEADLK && 0) @@ -1889,7 +1968,8 @@ static int futex_lock_pi(unsigned long _ queue_unlock(&q, hb); out_release_sem: - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; uaddr_faulted: @@ -1900,15 +1980,16 @@ static int futex_lock_pi(unsigned long _ * still holding the mmap_sem. */ if (attempt++) { - if (futex_handle_fault((unsigned long)uaddr, attempt)) { - ret = -EFAULT; + ret = futex_handle_fault((unsigned long)uaddr, fshared, + attempt); + if (ret) goto out_unlock_release_sem; - } goto retry_locked; } queue_unlock(&q, hb); - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); ret = futex_get_user(&uval, uaddr, futex64); if (!ret && (uval != -EFAULT)) @@ -1922,7 +2003,8 @@ static int futex_lock_pi(unsigned long _ * This is the in-kernel slowpath: we look up the PI state (if any), * and do the rt-mutex unlock. */ -static int futex_unlock_pi(unsigned long __user *uaddr, int futex64) +static int futex_unlock_pi(unsigned long __user *uaddr, int futex64, + struct rw_semaphore *fshared) { struct futex_hash_bucket *hb; struct futex_q *this, *next; @@ -1943,9 +2025,10 @@ retry: /* * First take all the futex related locks: */ - down_read(¤t->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr, fsize, &key); + ret = get_futex_key(uaddr, fsize, fshared, &key); if (unlikely(ret != 0)) goto out; @@ -2004,7 +2087,8 @@ retry_locked: out_unlock: spin_unlock(&hb->lock); out: - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; @@ -2016,15 +2100,16 @@ pi_faulted: * still holding the mmap_sem. */ if (attempt++) { - if (futex_handle_fault((unsigned long)uaddr, attempt)) { - ret = -EFAULT; + ret = futex_handle_fault((unsigned long)uaddr, fshared, + attempt); + if (ret) goto out_unlock; - } goto retry_locked; } spin_unlock(&hb->lock); - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); ret = futex_get_user(&uval, uaddr, futex64); if (!ret && (uval != -EFAULT)) @@ -2076,6 +2161,7 @@ static int futex_fd(u32 __user *uaddr, i struct futex_q *q; struct file *filp; int ret, err; + struct rw_semaphore *fshared; static unsigned long printk_interval; if (printk_timed_ratelimit(&printk_interval, 60 * 60 * 1000)) { @@ -2117,11 +2203,12 @@ static int futex_fd(u32 __user *uaddr, i } q->pi_state = NULL; - down_read(¤t->mm->mmap_sem); - err = get_futex_key(uaddr, sizeof(u32), &q->key); + fshared = ¤t->mm->mmap_sem; + down_read(fshared); + err = get_futex_key(uaddr, sizeof(u32), fshared, &q->key); if (unlikely(err != 0)) { - up_read(¤t->mm->mmap_sem); + up_read(fshared); kfree(q); goto error; } @@ -2133,7 +2220,7 @@ static int futex_fd(u32 __user *uaddr, i filp->private_data = q; queue_me(q, ret, filp); - up_read(¤t->mm->mmap_sem); + up_read(fshared); /* Now we map fd to filp, so userspace can access it */ fd_install(ret, filp); @@ -2262,7 +2349,8 @@ retry: */ if (!pi) { if (uval & FUTEX_WAITERS) - futex_wake((unsigned long __user *)uaddr, 0, 1); + futex_wake((unsigned long __user *)uaddr, 0, + &curr->mm->mmap_sem, 1); } } return 0; @@ -2350,13 +2438,18 @@ long do_futex(unsigned long __user *uadd unsigned long val2, unsigned long val3, int fut64) { int ret; + int cmd = op & FUTEX_CMD_MASK; + struct rw_semaphore *fshared = NULL; + + if (!(op & FUTEX_PRIVATE_FLAG)) + fshared = ¤t->mm->mmap_sem; - switch (op) { + switch (cmd) { case FUTEX_WAIT: - ret = futex_wait(uaddr, fut64, val, timeout); + ret = futex_wait(uaddr, fut64, fshared, val, timeout); break; case FUTEX_WAKE: - ret = futex_wake(uaddr, fut64, val); + ret = futex_wake(uaddr, fut64, fshared, val); break; case FUTEX_FD: if (fut64) @@ -2366,25 +2459,29 @@ long do_futex(unsigned long __user *uadd ret = futex_fd((u32 __user *)uaddr, val); break; case FUTEX_REQUEUE: - ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, fut64); + ret = futex_requeue(uaddr, fut64, fshared, + uaddr2, val, val2, NULL); break; case FUTEX_CMP_REQUEUE: - ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, fut64); + ret = futex_requeue(uaddr, fut64, fshared, + uaddr2, val, val2, &val3); break; case FUTEX_WAKE_OP: - ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, fut64); + ret = futex_wake_op(uaddr, fut64, fshared, + uaddr2, val, val2, val3); break; case FUTEX_LOCK_PI: - ret = futex_lock_pi(uaddr, val, timeout, 0, fut64); + ret = futex_lock_pi(uaddr, fut64, fshared, val, timeout, 0); break; case FUTEX_UNLOCK_PI: - ret = futex_unlock_pi(uaddr, fut64); + ret = futex_unlock_pi(uaddr, fut64, fshared); break; case FUTEX_TRYLOCK_PI: - ret = futex_lock_pi(uaddr, 0, timeout, 1, fut64); + ret = futex_lock_pi(uaddr, fut64, fshared, 0, timeout, 1); break; case FUTEX_CMP_REQUEUE_PI: - ret = futex_requeue_pi(uaddr, uaddr2, val, val2, &val3, fut64); + ret = futex_requeue_pi(uaddr, fut64, fshared, + uaddr2, val, val2, &val3); break; default: ret = -ENOSYS; @@ -2401,23 +2498,24 @@ sys_futex64(u64 __user *uaddr, int op, u struct timespec ts; ktime_t t, *tp = NULL; u64 val2 = 0; + int cmd = op & FUTEX_CMD_MASK; - if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) { + if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) { if (copy_from_user(&ts, utime, sizeof(ts)) != 0) return -EFAULT; if (!timespec_valid(&ts)) return -EINVAL; t = timespec_to_ktime(ts); - if (op == FUTEX_WAIT) + if (cmd == FUTEX_WAIT) t = ktime_add(ktime_get(), t); tp = &t; } /* - * requeue parameter in 'utime' if op == FUTEX_REQUEUE. + * requeue parameter in 'utime' if cmd == FUTEX_REQUEUE. */ - if (op == FUTEX_REQUEUE || op == FUTEX_CMP_REQUEUE - || op == FUTEX_CMP_REQUEUE_PI) + if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE + || cmd == FUTEX_CMP_REQUEUE_PI) val2 = (unsigned long) utime; return do_futex((unsigned long __user*)uaddr, op, val, tp, @@ -2433,23 +2531,24 @@ asmlinkage long sys_futex(u32 __user *ua struct timespec ts; ktime_t t, *tp = NULL; u32 val2 = 0; + int cmd = op & FUTEX_CMD_MASK; - if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) { + if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) { if (copy_from_user(&ts, utime, sizeof(ts)) != 0) return -EFAULT; if (!timespec_valid(&ts)) return -EINVAL; t = timespec_to_ktime(ts); - if (op == FUTEX_WAIT) + if (cmd == FUTEX_WAIT) t = ktime_add(ktime_get(), t); tp = &t; } /* - * requeue parameter in 'utime' if op == FUTEX_REQUEUE. + * requeue parameter in 'utime' if cmd == FUTEX_REQUEUE. */ - if (op == FUTEX_REQUEUE || op == FUTEX_CMP_REQUEUE - || op == FUTEX_CMP_REQUEUE_PI) + if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE + || cmd == FUTEX_CMP_REQUEUE_PI) val2 = (u32) (unsigned long) utime; return do_futex((unsigned long __user*)uaddr, op, val, tp, --- linux-2.6.21-rc6-mm1/include/linux/futex.h +++ linux-2.6.21-rc6-mm1-ed/include/linux/futex.h @@ -19,6 +19,18 @@ union ktime; #define FUTEX_TRYLOCK_PI 8 #define FUTEX_CMP_REQUEUE_PI 9 +#define FUTEX_PRIVATE_FLAG 128 +#define FUTEX_CMD_MASK ~FUTEX_PRIVATE_FLAG + +#define FUTEX_WAIT_PRIVATE (FUTEX_WAIT | FUTEX_PRIVATE_FLAG) +#define FUTEX_WAKE_PRIVATE (FUTEX_WAKE | FUTEX_PRIVATE_FLAG) +#define FUTEX_REQUEUE_PRIVATE (FUTEX_REQUEUE | FUTEX_PRIVATE_FLAG) +#define FUTEX_CMP_REQUEUE_PRIVATE (FUTEX_CMP_REQUEUE | FUTEX_PRIVATE_FLAG) +#define FUTEX_WAKE_OP_PRIVATE (FUTEX_WAKE_OP | FUTEX_PRIVATE_FLAG) +#define FUTEX_LOCK_PI_PRIVATE (FUTEX_LOCK_PI | FUTEX_PRIVATE_FLAG) +#define FUTEX_UNLOCK_PI_PRIVATE (FUTEX_UNLOCK_PI | FUTEX_PRIVATE_FLAG) +#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG) + /* * Support for robust futexes: the kernel cleans up held futexes at * thread exit time. @@ -115,8 +127,18 @@ handle_futex_death(u32 __user *uaddr, st * Don't rearrange members without looking at hash_futex(). * * offset is aligned to a multiple of sizeof(u32) (== 4) by definition. - * We set bit 0 to indicate if it's an inode-based key. - */ + * We use the two low order bits of offset to tell what is the kind of key : + * 00 : Private process futex (PTHREAD_PROCESS_PRIVATE) + * (no reference on an inode or mm) + * 01 : Shared futex (PTHREAD_PROCESS_SHARED) + * mapped on a file (reference on the underlying inode) + * 10 : Shared futex (PTHREAD_PROCESS_SHARED) + * (but private mapping on an mm, and reference taken on it) +*/ + +#define FUT_OFF_INODE 1 /* We set bit 0 if key has a reference on inode */ +#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */ + union futex_key { unsigned long __user *uaddr; struct { @@ -135,7 +157,8 @@ union futex_key { int offset; } both; }; -int get_futex_key(void __user *uaddr, int size, union futex_key *key); +int get_futex_key(void __user *uaddr, int size, struct rw_semaphore *shared, + union futex_key *key); void get_futex_key_refs(union futex_key *key); void drop_futex_key_refs(union futex_key *key); ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH, take5] FUTEX : new PRIVATE futexes 2007-04-11 9:19 ` [PATCH, take5] " Eric Dumazet @ 2007-04-11 12:23 ` Rusty Russell 2007-04-26 12:55 ` [PATCH, take6] " Eric Dumazet 1 sibling, 0 replies; 78+ messages in thread From: Rusty Russell @ 2007-04-11 12:23 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Wed, 2007-04-11 at 11:19 +0200, Eric Dumazet wrote: > Hi Andrew > > Update on this take5 : > > - Rebased on linux-2.6.21-rc6-mm1 + get_futex_key() must check proper alignement for 64bit futexes > - compile test on x86_64 (one minor typo) > - Added Rusty in CC since he may have to change drivers/lguest/io.c > again, since get_futex_key() have yet another parameter (fshared). (I > couldnt find this file in 2.6.21-rc6-mm1 tree) That's fine. I just use the futex key lookup code to get a unique key for shared memory between lguests (this is how lg.ko decides if they can send DMA & interrupts to each other). Looks like Andrew has already patched the lguest callers (I resubmitted the moved lguest code just today). Thanks! Rusty. PS. As original futex author, I could have been cc'd on these patches anyway. But since the futex code has become unforgivably ugly (but, oh so fast for pthreads), it's probably better not to see them 8) ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH, take6] FUTEX : new PRIVATE futexes 2007-04-11 9:19 ` [PATCH, take5] " Eric Dumazet 2007-04-11 12:23 ` Rusty Russell @ 2007-04-26 12:55 ` Eric Dumazet 2007-04-26 13:35 ` Pierre Peiffer 1 sibling, 1 reply; 78+ messages in thread From: Eric Dumazet @ 2007-04-26 12:55 UTC (permalink / raw) To: Andrew Morton, Pierre Peiffer Cc: Dave Jones, Ulrich Drepper, Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel, Rusty Russel Hi Andrew Not sure if you prefer to wait Pierre work on futex64, so just in case, I prepared this patch. Update on this take6 : - Rebased on linux-2.6.21-rc7-mm2 , since futex64 were droped from mm Pierre, I can resubmit another patch on top on your next patch, so please do as you prefer (ignoring or not this patch) Thank you History : take5 : - Rebased on linux-2.6.21-rc6-mm1 + get_futex_key() must check proper alignement for 64bit futexes - compile test on x86_64 (one minor typo) - Added Rusty in CC since he may have to change drivers/lguest/io.c again, since get_futex_key() have yet another parameter (fshared). (I couldnt find this file in 2.6.21-rc6-mm1 tree) take4 : - All remarks from Nick were addressed I hope - Current mm code have a problem with 64bit futexes, as spoted by Nick : get_futex_key() does a check against sizeof(u32) regardless of futex being 64bits or not. So it is possible a 64bit futex spans two pages of memory... I had to change get_futex_key() prototype to be able to do a correct test. take3: I'm pleased to present this patch which improves linux futexes performance and scalability, merely avoiding taking mmap_sem rwlock. Ulrich agreed with the API and said glibc work could start as soon as he gets a Fedora kernel with it :) In this third version I dropped the NUMA optims and process private hash table, to let new API come in and be tested. Thank you [PATCH] FUTEX : new PRIVATE futexes Analysis of current linux futex code : -------------------------------------- A central hash table futex_queues[] holds all contexts (futex_q) of waiting threads. Each futex_wait()/futex_wait() has to obtain a spinlock on a hash slot to perform lookups or insert/deletion of a futex_q. When a futex_wait() is done, calling thread has to : 1) - Obtain a read lock on mmap_sem to be able to validate the user pointer (calling find_vma()). This validation tells us if the futex uses an inode based store (mapped file), or mm based store (anonymous mem) 2) - compute a hash key 3) - Atomic increment of reference counter on an inode or a mm_struct 4) - lock part of futex_queues[] hash table 5) - perform the test on value of futex. (rollback is value != expected_value, returns EWOULDBLOCK) (various loops if test triggers mm faults) 6) queue the context into hash table, release the lock got in 4) 7) - release the read_lock on mmap_sem <block> 8) Eventually unqueue the context (but rarely, as this part may be done by the futex_wake()) Futexes were designed to improve scalability but current implementation has various problems : - Central hashtable : This means scalability problems if many processes/threads want to use futexes at the same time. This means NUMA unbalance because this hashtable is located on one node. - Using mmap_sem on every futex() syscall : Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic ops on mmap_sem, dirtying cache line : - lot of cache line ping pongs on SMP configurations. mmap_sem is also extensively used by mm code (page faults, mmap()/munmap()) Highly threaded processes might suffer from mmap_sem contention. mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded programs because of contention on the mmap_sem cache line. - Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter: It's also a cache line ping pong on SMP. It also increases mmap_sem hold time because of cache misses. Most of these scalability problems come from the fact that futexes are in one global namespace. As we use a central hash table, we must make sure they are all using the same reference (given by the mm subsystem). We chose to force all futexes be 'shared'. This has a cost. But fact is POSIX defined PRIVATE and SHARED, allowing clear separation, and optimal performance if carefuly implemented. Time has come for linux to have better threading performance. The goal is to permit new futex commands to avoid : - Taking the mmap_sem semaphore, conflicting with other subsystems. - Modifying a ref_count on mm or an inode, still conflicting with mm or fs. This is possible because, for one process using PTHREAD_PROCESS_PRIVATE futexes, we only need to distinguish futexes by their virtual address, no matter the underlying mm storage is. If glibc wants to exploit this new infrastructure, it should use new _PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And be prepared to fallback on old subcommands for old kernels. Using one global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK. PTHREAD_PROCESS_SHARED futexes should still use the old subcommands. Compatibility with old applications is preserved, they still hit the scalability problems, but new applications can fly :) Note : the same SHARED futex (mapped on a file) can be used by old binaries *and* new binaries, because both binaries will use the old subcommands. Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic, as this is the default semantic. Almost all applications should benefit of this changes (new kernel and updated libc) Some bench results on a Pentium M 1.6 GHz (SMP kernel on a UP machine) /* calling futex_wait(addr, value) with value != *addr */ 433 cycles per futex(FUTEX_WAIT) call (mixing 2 futexes) 424 cycles per futex(FUTEX_WAIT) call (using one futex) 334 cycles per futex(FUTEX_WAIT_PRIVATE) call (mixing 2 futexes) 334 cycles per futex(FUTEX_WAIT_PRIVATE) call (using one futex) For reference : 187 cycles per getppid() call 188 cycles per umask() call 181 cycles per ni_syscall() call Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/linux/futex.h | 29 +++ kernel/futex.c | 324 +++++++++++++++++++++++++--------------- 2 files changed, 236 insertions(+), 117 deletions(-) --- linux-2.6.21-rc7-mm2/include/linux/futex.h +++ linux-2.6.21-rc7-mm2-ed/include/linux/futex.h @@ -19,6 +19,18 @@ union ktime; #define FUTEX_TRYLOCK_PI 8 #define FUTEX_CMP_REQUEUE_PI 9 +#define FUTEX_PRIVATE_FLAG 128 +#define FUTEX_CMD_MASK ~FUTEX_PRIVATE_FLAG + +#define FUTEX_WAIT_PRIVATE (FUTEX_WAIT | FUTEX_PRIVATE_FLAG) +#define FUTEX_WAKE_PRIVATE (FUTEX_WAKE | FUTEX_PRIVATE_FLAG) +#define FUTEX_REQUEUE_PRIVATE (FUTEX_REQUEUE | FUTEX_PRIVATE_FLAG) +#define FUTEX_CMP_REQUEUE_PRIVATE (FUTEX_CMP_REQUEUE | FUTEX_PRIVATE_FLAG) +#define FUTEX_WAKE_OP_PRIVATE (FUTEX_WAKE_OP | FUTEX_PRIVATE_FLAG) +#define FUTEX_LOCK_PI_PRIVATE (FUTEX_LOCK_PI | FUTEX_PRIVATE_FLAG) +#define FUTEX_UNLOCK_PI_PRIVATE (FUTEX_UNLOCK_PI | FUTEX_PRIVATE_FLAG) +#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG) + /* * Support for robust futexes: the kernel cleans up held futexes at * thread exit time. @@ -114,8 +126,18 @@ handle_futex_death(u32 __user *uaddr, st * Don't rearrange members without looking at hash_futex(). * * offset is aligned to a multiple of sizeof(u32) (== 4) by definition. - * We set bit 0 to indicate if it's an inode-based key. - */ + * We use the two low order bits of offset to tell what is the kind of key : + * 00 : Private process futex (PTHREAD_PROCESS_PRIVATE) + * (no reference on an inode or mm) + * 01 : Shared futex (PTHREAD_PROCESS_SHARED) + * mapped on a file (reference on the underlying inode) + * 10 : Shared futex (PTHREAD_PROCESS_SHARED) + * (but private mapping on an mm, and reference taken on it) +*/ + +#define FUT_OFF_INODE 1 /* We set bit 0 if key has a reference on inode */ +#define FUT_OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */ + union futex_key { u32 __user *uaddr; struct { @@ -134,7 +156,8 @@ union futex_key { int offset; } both; }; -int get_futex_key(u32 __user *uaddr, union futex_key *key); +int get_futex_key(u32 __user *uaddr, struct rw_semaphore *shared, + union futex_key *key); void get_futex_key_refs(union futex_key *key); void drop_futex_key_refs(union futex_key *key); --- linux-2.6.21-rc7-mm2/kernel/futex.c +++ linux-2.6.21-rc7-mm2-ed/kernel/futex.c @@ -16,6 +16,9 @@ * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com> * Copyright (C) 2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com> * + * PRIVATE futexes by Eric Dumazet + * Copyright (C) 2007 Eric Dumazet <dada1@cosmosbay.com> + * * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly * enough at me, Linus for the original (flawed) idea, Matthew * Kirkwood for proof-of-concept implementation. @@ -150,19 +153,26 @@ static inline int match_futex(union fute && key1->both.offset == key2->both.offset); } -/* - * Get parameters which are the keys for a futex. +/** + * get_futex_key - Get parameters which are the keys for a futex. + * @uaddr: virtual address of the futex + * @shared: NULL for a PROCESS_PRIVATE futex, + * ¤t->mm->mmap_sem for a PROCESS_SHARED futex + * @key: address where result is stored. + * + * Returns a negative error code or 0 + * The key words are stored in *key on success. * * For shared mappings, it's (page->index, vma->vm_file->f_path.dentry->d_inode, * offset_within_page). For private mappings, it's (uaddr, current->mm). * We can usually work out the index without swapping in the page. * - * Returns: 0, or negative error code. - * The key words are stored in *key on success. - * - * Should be called with ¤t->mm->mmap_sem but NOT any spinlocks. + * fshared is NULL for PROCESS_PRIVATE futexes + * For other futexes, it points to ¤t->mm->mmap_sem and + * caller must have taken the reader lock. but NOT any spinlocks. */ -int get_futex_key(u32 __user *uaddr, union futex_key *key) +int get_futex_key(u32 __user *uaddr, struct rw_semaphore *fshared, + union futex_key *key) { unsigned long address = (unsigned long)uaddr; struct mm_struct *mm = current->mm; @@ -174,11 +184,25 @@ int get_futex_key(u32 __user *uaddr, uni * The futex address must be "naturally" aligned. */ key->both.offset = address % PAGE_SIZE; - if (unlikely((key->both.offset % sizeof(u32)) != 0)) + if (unlikely((address % sizeof(u32)) != 0)) return -EINVAL; address -= key->both.offset; /* + * PROCESS_PRIVATE futexes are fast. + * As the mm cannot disappear under us and the 'key' only needs + * virtual address, we dont even have to find the underlying vma. + * Note : We do have to check 'uaddr' is a valid user address, + * but access_ok() should be faster than find_vma() + */ + if (!fshared) { + if (unlikely(!access_ok(VERIFY_WRITE, uaddr, sizeof(u32)))) + return -EFAULT; + key->private.mm = mm; + key->private.address = address; + return 0; + } + /* * The futex is hashed differently depending on whether * it's in a shared or private mapping. So check vma first. */ @@ -205,6 +229,7 @@ int get_futex_key(u32 __user *uaddr, uni * mappings of _writable_ handles. */ if (likely(!(vma->vm_flags & VM_MAYSHARE))) { + key->both.offset |= FUT_OFF_MMSHARED; /* reference taken on mm */ key->private.mm = mm; key->private.address = address; return 0; @@ -214,7 +239,7 @@ int get_futex_key(u32 __user *uaddr, uni * Linear file mappings are also simple. */ key->shared.inode = vma->vm_file->f_path.dentry->d_inode; - key->both.offset++; /* Bit 0 of offset indicates inode-based key. */ + key->both.offset |= FUT_OFF_INODE; /* inode-based key. */ if (likely(!(vma->vm_flags & VM_NONLINEAR))) { key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff); @@ -242,16 +267,18 @@ EXPORT_SYMBOL_GPL(get_futex_key); * Take a reference to the resource addressed by a key. * Can be called while holding spinlocks. * - * NOTE: mmap_sem MUST be held between get_futex_key() and calling this - * function, if it is called at all. mmap_sem keeps key->shared.inode valid. */ inline void get_futex_key_refs(union futex_key *key) { - if (key->both.ptr != 0) { - if (key->both.offset & 1) + if (key->both.ptr == 0) + return; + switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { + case FUT_OFF_INODE: atomic_inc(&key->shared.inode->i_count); - else + break; + case FUT_OFF_MMSHARED: atomic_inc(&key->private.mm->mm_count); + break; } } EXPORT_SYMBOL_GPL(get_futex_key_refs); @@ -262,11 +289,15 @@ EXPORT_SYMBOL_GPL(get_futex_key_refs); */ void drop_futex_key_refs(union futex_key *key) { - if (key->both.ptr != 0) { - if (key->both.offset & 1) + if (key->both.ptr == 0) + return; + switch (key->both.offset & (FUT_OFF_INODE|FUT_OFF_MMSHARED)) { + case FUT_OFF_INODE: iput(key->shared.inode); - else + break; + case FUT_OFF_MMSHARED: mmdrop(key->private.mm); + break; } } EXPORT_SYMBOL_GPL(drop_futex_key_refs); @@ -283,28 +314,38 @@ static inline int get_futex_value_locked } /* - * Fault handling. Called with current->mm->mmap_sem held. + * Fault handling. + * if fshared is non NULL, current->mm->mmap_sem is already held */ -static int futex_handle_fault(unsigned long address, int attempt) +static int futex_handle_fault(unsigned long address, + struct rw_semaphore *fshared, int attempt) { struct vm_area_struct * vma; struct mm_struct *mm = current->mm; + int ret = -EFAULT; - if (attempt > 2 || !(vma = find_vma(mm, address)) || - vma->vm_start > address || !(vma->vm_flags & VM_WRITE)) - return -EFAULT; + if (attempt > 2) + return ret; - switch (handle_mm_fault(mm, vma, address, 1)) { - case VM_FAULT_MINOR: - current->min_flt++; - break; - case VM_FAULT_MAJOR: - current->maj_flt++; - break; - default: - return -EFAULT; + if (!fshared) + down_read(&mm->mmap_sem); + vma = find_vma(mm, address); + if (vma && address >= vma->vm_start && + (vma->vm_flags & VM_WRITE)) { + switch (handle_mm_fault(mm, vma, address, 1)) { + case VM_FAULT_MINOR: + ret = 0; + current->min_flt++; + break; + case VM_FAULT_MAJOR: + ret = 0; + current->maj_flt++; + break; + } } - return 0; + if (!fshared) + up_read(&mm->mmap_sem); + return ret; } /* @@ -647,7 +688,8 @@ double_lock_hb(struct futex_hash_bucket * Wake up all waiters hashed on the physical page that is mapped * to this virtual address: */ -static int futex_wake(u32 __user *uaddr, int nr_wake) +static int futex_wake(u32 __user *uaddr, struct rw_semaphore *fshared, + int nr_wake) { struct futex_hash_bucket *hb; struct futex_q *this, *next; @@ -655,9 +697,10 @@ static int futex_wake(u32 __user *uaddr, union futex_key key; int ret; - down_read(¤t->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr, &key); + ret = get_futex_key(uaddr, fshared, &key); if (unlikely(ret != 0)) goto out; @@ -679,7 +722,8 @@ static int futex_wake(u32 __user *uaddr, spin_unlock(&hb->lock); out: - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; } @@ -746,7 +790,9 @@ retry: * and requeue the next nr_requeue waiters following hashed on * one physical page to another physical page (PI-futex uaddr2) */ -static int futex_requeue_pi(u32 __user *uaddr1, u32 __user *uaddr2, +static int futex_requeue_pi(u32 __user *uaddr1, + struct rw_semaphore *fshared, + u32 __user *uaddr2, int nr_wake, int nr_requeue, u32 *cmpval) { union futex_key key1, key2; @@ -765,12 +811,13 @@ retry: /* * First take all the futex related locks: */ - down_read(¤t->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr1, &key1); + ret = get_futex_key(uaddr1, fshared, &key1); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, &key2); + ret = get_futex_key(uaddr2, fshared, &key2); if (unlikely(ret != 0)) goto out; @@ -793,7 +840,8 @@ retry: * If we would have faulted, release mmap_sem, fault * it in and start all over again. */ - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); ret = get_user(curval, uaddr1); @@ -927,7 +975,8 @@ out_unlock: drop_futex_key_refs(&key1); out: - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; } @@ -936,7 +985,8 @@ out: * to this virtual address: */ static int -futex_wake_op(u32 __user *uaddr1, u32 __user *uaddr2, +futex_wake_op(u32 __user *uaddr1, struct rw_semaphore *fshared, + u32 __user *uaddr2, int nr_wake, int nr_wake2, int op) { union futex_key key1, key2; @@ -946,12 +996,13 @@ futex_wake_op(u32 __user *uaddr1, u32 __ int ret, op_ret, attempt = 0; retryfull: - down_read(¤t->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr1, &key1); + ret = get_futex_key(uaddr1, fshared, &key1); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, &key2); + ret = get_futex_key(uaddr2, fshared, &key2); if (unlikely(ret != 0)) goto out; @@ -991,11 +1042,10 @@ retry: * still holding the mmap_sem. */ if (attempt++) { - if (futex_handle_fault((unsigned long)uaddr2, - attempt)) { - ret = -EFAULT; + ret = futex_handle_fault((unsigned long)uaddr2, + fshared, attempt); + if (ret) goto out; - } goto retry; } @@ -1003,7 +1053,8 @@ retry: * If we would have faulted, release mmap_sem, * fault it in and start all over again. */ - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); ret = get_user(dummy, uaddr2); if (ret) @@ -1040,7 +1091,8 @@ retry: if (hb1 != hb2) spin_unlock(&hb2->lock); out: - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; } @@ -1048,7 +1100,8 @@ out: * Requeue all waiters hashed on one physical page to another * physical page. */ -static int futex_requeue(u32 __user *uaddr1, u32 __user *uaddr2, +static int futex_requeue(u32 __user *uaddr1, struct rw_semaphore *fshared, + u32 __user *uaddr2, int nr_wake, int nr_requeue, u32 *cmpval) { union futex_key key1, key2; @@ -1058,12 +1111,13 @@ static int futex_requeue(u32 __user *uad int ret, drop_count = 0; retry: - down_read(¤t->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr1, &key1); + ret = get_futex_key(uaddr1, fshared, &key1); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, &key2); + ret = get_futex_key(uaddr2, fshared, &key2); if (unlikely(ret != 0)) goto out; @@ -1086,7 +1140,8 @@ static int futex_requeue(u32 __user *uad * If we would have faulted, release mmap_sem, fault * it in and start all over again. */ - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); ret = get_user(curval, uaddr1); @@ -1139,7 +1194,8 @@ out_unlock: drop_futex_key_refs(&key1); out: - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; } @@ -1273,7 +1329,8 @@ static void unqueue_me_pi(struct futex_q * The cur->mm semaphore must be held, it is released at return of this * function. */ -static int fixup_pi_state_owner(u32 __user *uaddr, struct futex_q *q, +static int fixup_pi_state_owner(u32 __user *uaddr, struct rw_semaphore *fshared, + struct futex_q *q, struct futex_hash_bucket *hb, struct task_struct *curr) { @@ -1300,7 +1357,8 @@ static int fixup_pi_state_owner(u32 __us /* Unqueue and drop the lock */ unqueue_me_pi(q); - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); /* * We own it, so we have to replace the pending owner * TID. This must be atomic as we have preserve the @@ -1321,8 +1379,15 @@ static int fixup_pi_state_owner(u32 __us return ret; } +/* + * In case we must use restart_block to restart a futex_wait, + * we encode in the 'arg3' shared capability + */ +#define ARG3_SHARED 1 + static long futex_wait_restart(struct restart_block *restart); -static int futex_wait(u32 __user *uaddr, u32 val, ktime_t *abs_time) +static int futex_wait(u32 __user *uaddr, struct rw_semaphore *fshared, + u32 val, ktime_t *abs_time) { struct task_struct *curr = current; DECLARE_WAITQUEUE(wait, curr); @@ -1335,9 +1400,10 @@ static int futex_wait(u32 __user *uaddr, q.pi_state = NULL; retry: - down_read(&curr->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr, &q.key); + ret = get_futex_key(uaddr, fshared, &q.key); if (unlikely(ret != 0)) goto out_release_sem; @@ -1360,8 +1426,8 @@ static int futex_wait(u32 __user *uaddr, * a wakeup when *uaddr != val on entry to the syscall. This is * rare, but normal. * - * We hold the mmap semaphore, so the mapping cannot have changed - * since we looked it up in get_futex_key. + * for shared futexes, we hold the mmap semaphore, so the mapping + * cannot have changed since we looked it up in get_futex_key. */ ret = get_futex_value_locked(&uval, uaddr); @@ -1372,7 +1438,8 @@ static int futex_wait(u32 __user *uaddr, * If we would have faulted, release mmap_sem, fault it in and * start all over again. */ - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); ret = get_user(uval, uaddr); @@ -1399,7 +1466,8 @@ static int futex_wait(u32 __user *uaddr, * Now the futex is queued and we have checked the data, we * don't want to hold mmap_sem while we sleep. */ - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); /* * There might have been scheduling since the queue_me(), as we @@ -1469,7 +1537,8 @@ static int futex_wait(u32 __user *uaddr, else ret = rt_mutex_timed_lock(lock, to, 1); - down_read(&curr->mm->mmap_sem); + if (fshared) + down_read(fshared); spin_lock(q.lock_ptr); /* @@ -1486,7 +1555,8 @@ static int futex_wait(u32 __user *uaddr, /* mmap_sem and hash_bucket lock are unlocked at return of this function */ - ret = fixup_pi_state_owner(uaddr, &q, hb, curr); + ret = fixup_pi_state_owner(uaddr, fshared, + &q, hb, curr); } else { /* * Catch the rare case, where the lock was released @@ -1499,7 +1569,8 @@ static int futex_wait(u32 __user *uaddr, } /* Unqueue and drop the lock */ unqueue_me_pi(&q); - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); } debug_rt_mutex_free_waiter(&q.waiter); @@ -1528,6 +1599,9 @@ static int futex_wait(u32 __user *uaddr, restart->arg0 = (unsigned long)uaddr; restart->arg1 = (unsigned long)val; restart->arg2 = (unsigned long)abs_time; + restart->arg3 = 0; + if (fshared) + restart->arg3 |= ARG3_SHARED; return -ERESTART_RESTARTBLOCK; } @@ -1535,7 +1609,8 @@ static int futex_wait(u32 __user *uaddr, queue_unlock(&q, hb); out_release_sem: - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; } @@ -1545,9 +1620,12 @@ static long futex_wait_restart(struct re u32 __user *uaddr = (u32 __user *)restart->arg0; u32 val = (u32)restart->arg1; ktime_t *abs_time = (ktime_t *)restart->arg2; + struct rw_semaphore *fshared = NULL; restart->fn = do_no_restart_syscall; - return (long)futex_wait(uaddr, val, abs_time); + if (restart->arg3 & ARG3_SHARED) + fshared = ¤t->mm->mmap_sem; + return (long)futex_wait(uaddr, fshared, val, abs_time); } @@ -1602,8 +1680,8 @@ static void set_pi_futex_owner(struct fu * if there are waiters then it will block, it does PI, etc. (Due to * races the kernel might see a 0 value of the futex too.) */ -static int futex_lock_pi(u32 __user *uaddr, int detect, ktime_t *time, - int trylock) +static int futex_lock_pi(u32 __user *uaddr, struct rw_semaphore *fshared, + int detect, ktime_t *time, int trylock) { struct hrtimer_sleeper timeout, *to = NULL; struct task_struct *curr = current; @@ -1624,9 +1702,10 @@ static int futex_lock_pi(u32 __user *uad q.pi_state = NULL; retry: - down_read(&curr->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr, &q.key); + ret = get_futex_key(uaddr, fshared, &q.key); if (unlikely(ret != 0)) goto out_release_sem; @@ -1747,7 +1826,8 @@ static int futex_lock_pi(u32 __user *uad * Now the futex is queued and we have checked the data, we * don't want to hold mmap_sem while we sleep. */ - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); WARN_ON(!q.pi_state); /* @@ -1761,7 +1841,8 @@ static int futex_lock_pi(u32 __user *uad ret = ret ? 0 : -EWOULDBLOCK; } - down_read(&curr->mm->mmap_sem); + if (fshared) + down_read(fshared); spin_lock(q.lock_ptr); /* @@ -1770,7 +1851,7 @@ static int futex_lock_pi(u32 __user *uad */ if (!ret && q.pi_state->owner != curr) /* mmap_sem is unlocked at return of this function */ - ret = fixup_pi_state_owner(uaddr, &q, hb, curr); + ret = fixup_pi_state_owner(uaddr, fshared, &q, hb, curr); else { /* * Catch the rare case, where the lock was released @@ -1783,7 +1864,8 @@ static int futex_lock_pi(u32 __user *uad } /* Unqueue and drop the lock */ unqueue_me_pi(&q); - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); } if (!detect && ret == -EDEADLK && 0) @@ -1795,7 +1877,8 @@ static int futex_lock_pi(u32 __user *uad queue_unlock(&q, hb); out_release_sem: - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; uaddr_faulted: @@ -1806,15 +1889,16 @@ static int futex_lock_pi(u32 __user *uad * still holding the mmap_sem. */ if (attempt++) { - if (futex_handle_fault((unsigned long)uaddr, attempt)) { - ret = -EFAULT; + ret = futex_handle_fault((unsigned long)uaddr, fshared, + attempt); + if (ret) goto out_unlock_release_sem; - } goto retry_locked; } queue_unlock(&q, hb); - up_read(&curr->mm->mmap_sem); + if (fshared) + up_read(fshared); ret = get_user(uval, uaddr); if (!ret && (uval != -EFAULT)) @@ -1828,7 +1912,7 @@ static int futex_lock_pi(u32 __user *uad * This is the in-kernel slowpath: we look up the PI state (if any), * and do the rt-mutex unlock. */ -static int futex_unlock_pi(u32 __user *uaddr) +static int futex_unlock_pi(u32 __user *uaddr, struct rw_semaphore *fshared) { struct futex_hash_bucket *hb; struct futex_q *this, *next; @@ -1848,9 +1932,10 @@ retry: /* * First take all the futex related locks: */ - down_read(¤t->mm->mmap_sem); + if (fshared) + down_read(fshared); - ret = get_futex_key(uaddr, &key); + ret = get_futex_key(uaddr, fshared, &key); if (unlikely(ret != 0)) goto out; @@ -1909,7 +1994,8 @@ retry_locked: out_unlock: spin_unlock(&hb->lock); out: - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); return ret; @@ -1921,15 +2007,16 @@ pi_faulted: * still holding the mmap_sem. */ if (attempt++) { - if (futex_handle_fault((unsigned long)uaddr, attempt)) { - ret = -EFAULT; + ret = futex_handle_fault((unsigned long)uaddr, fshared, + attempt); + if (ret) goto out_unlock; - } goto retry_locked; } spin_unlock(&hb->lock); - up_read(¤t->mm->mmap_sem); + if (fshared) + up_read(fshared); ret = get_user(uval, uaddr); if (!ret && (uval != -EFAULT)) @@ -1981,6 +2068,7 @@ static int futex_fd(u32 __user *uaddr, i struct futex_q *q; struct file *filp; int ret, err; + struct rw_semaphore *fshared; static unsigned long printk_interval; if (printk_timed_ratelimit(&printk_interval, 60 * 60 * 1000)) { @@ -2022,11 +2110,12 @@ static int futex_fd(u32 __user *uaddr, i } q->pi_state = NULL; - down_read(¤t->mm->mmap_sem); - err = get_futex_key(uaddr, &q->key); + fshared = ¤t->mm->mmap_sem; + down_read(fshared); + err = get_futex_key(uaddr, fshared, &q->key); if (unlikely(err != 0)) { - up_read(¤t->mm->mmap_sem); + up_read(fshared); kfree(q); goto error; } @@ -2038,7 +2127,7 @@ static int futex_fd(u32 __user *uaddr, i filp->private_data = q; queue_me(q, ret, filp); - up_read(¤t->mm->mmap_sem); + up_read(fshared); /* Now we map fd to filp, so userspace can access it */ fd_install(ret, filp); @@ -2167,7 +2256,7 @@ retry: */ if (!pi) { if (uval & FUTEX_WAITERS) - futex_wake(uaddr, 1); + futex_wake(uaddr, &curr->mm->mmap_sem, 1); } } return 0; @@ -2223,7 +2312,8 @@ void exit_robust_list(struct task_struct return; if (pending) - handle_futex_death((void __user *)pending + futex_offset, curr, pip); + handle_futex_death((void __user *)pending + futex_offset, + curr, pip); while (entry != &head->list) { /* @@ -2253,38 +2343,43 @@ long do_futex(u32 __user *uaddr, int op, u32 __user *uaddr2, u32 val2, u32 val3) { int ret; + int cmd = op & FUTEX_CMD_MASK; + struct rw_semaphore *fshared = NULL; + + if (!(op & FUTEX_PRIVATE_FLAG)) + fshared = ¤t->mm->mmap_sem; - switch (op) { + switch (cmd) { case FUTEX_WAIT: - ret = futex_wait(uaddr, val, timeout); + ret = futex_wait(uaddr, fshared, val, timeout); break; case FUTEX_WAKE: - ret = futex_wake(uaddr, val); + ret = futex_wake(uaddr, fshared, val); break; case FUTEX_FD: /* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */ ret = futex_fd(uaddr, val); break; case FUTEX_REQUEUE: - ret = futex_requeue(uaddr, uaddr2, val, val2, NULL); + ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, NULL); break; case FUTEX_CMP_REQUEUE: - ret = futex_requeue(uaddr, uaddr2, val, val2, &val3); + ret = futex_requeue(uaddr, fshared, uaddr2, val, val2, &val3); break; case FUTEX_WAKE_OP: - ret = futex_wake_op(uaddr, uaddr2, val, val2, val3); + ret = futex_wake_op(uaddr, fshared, uaddr2, val, val2, val3); break; case FUTEX_LOCK_PI: - ret = futex_lock_pi(uaddr, val, timeout, 0); + ret = futex_lock_pi(uaddr, fshared, val, timeout, 0); break; case FUTEX_UNLOCK_PI: - ret = futex_unlock_pi(uaddr); + ret = futex_unlock_pi(uaddr, fshared); break; case FUTEX_TRYLOCK_PI: - ret = futex_lock_pi(uaddr, 0, timeout, 1); + ret = futex_lock_pi(uaddr, fshared, 0, timeout, 1); break; case FUTEX_CMP_REQUEUE_PI: - ret = futex_requeue_pi(uaddr, uaddr2, val, val2, &val3); + ret = futex_requeue_pi(uaddr, fshared, uaddr2, val, val2, &val3); break; default: ret = -ENOSYS; @@ -2300,23 +2395,24 @@ asmlinkage long sys_futex(u32 __user *ua struct timespec ts; ktime_t t, *tp = NULL; u32 val2 = 0; + int cmd = op & FUTEX_CMD_MASK; - if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) { + if (utime && (cmd == FUTEX_WAIT || cmd == FUTEX_LOCK_PI)) { if (copy_from_user(&ts, utime, sizeof(ts)) != 0) return -EFAULT; if (!timespec_valid(&ts)) return -EINVAL; t = timespec_to_ktime(ts); - if (op == FUTEX_WAIT) + if (cmd == FUTEX_WAIT) t = ktime_add(ktime_get(), t); tp = &t; } /* - * requeue parameter in 'utime' if op == FUTEX_REQUEUE. + * requeue parameter in 'utime' if cmd == FUTEX_REQUEUE. */ - if (op == FUTEX_REQUEUE || op == FUTEX_CMP_REQUEUE - || op == FUTEX_CMP_REQUEUE_PI) + if (cmd == FUTEX_REQUEUE || cmd == FUTEX_CMP_REQUEUE + || cmd == FUTEX_CMP_REQUEUE_PI) val2 = (u32) (unsigned long) utime; return do_futex(uaddr, op, val, tp, uaddr2, val2, val3); ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH, take6] FUTEX : new PRIVATE futexes 2007-04-26 12:55 ` [PATCH, take6] " Eric Dumazet @ 2007-04-26 13:35 ` Pierre Peiffer 0 siblings, 0 replies; 78+ messages in thread From: Pierre Peiffer @ 2007-04-26 13:35 UTC (permalink / raw) To: Eric Dumazet Cc: Andrew Morton, Dave Jones, Ulrich Drepper, Nick Piggin, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel, Rusty Russel Eric Dumazet a écrit : > Hi Andrew > > Not sure if you prefer to wait Pierre work on futex64, so just in case, I prepared this patch. > > Update on this take6 : > > - Rebased on linux-2.6.21-rc7-mm2 , since futex64 were droped from mm > > > Pierre, I can resubmit another patch on top on your next patch, so please do as you prefer (ignoring or not this patch) Thank you for taking care of this. But I think your patch is more mature than the futex64 patch; So, it's ok for me, and it's probably simpler and better for "the community" (and for Andrew's work) to put this patch first, because futex64 still may need several reworks... -- Pierre ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic 2006-08-09 6:43 ` Eric Dumazet 2007-03-15 19:10 ` [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements Eric Dumazet @ 2007-03-15 19:13 ` Eric Dumazet 2007-03-15 19:16 ` [PATCH 2/3] FUTEX : introduce private hashtables Eric Dumazet 2007-03-15 19:20 ` [PATCH 3/3] FUTEX : NUMA friendly global hashtable Eric Dumazet 3 siblings, 0 replies; 78+ messages in thread From: Eric Dumazet @ 2007-03-15 19:13 UTC (permalink / raw) To: Nick Piggin, Ulrich Drepper, Andrew Morton, Ingo Molnar Cc: Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel [-- Attachment #1: Type: text/plain, Size: 2352 bytes --] [PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic This first patch introduces XXX_PRIVATE futexes operations. When a process uses a XXX_PRIVATE futex primitive, kernel can avoid to take a read lock on mmap_sem, to find the vma that contains the futex, to learn if it is associated to an inode (shared) or the mm (private to process) We also avoid taking a reference on the found inode or the mm. Even if mmap_sem is a rw_semaphore, up_read()/down_read() are doing atomic ops on mmap_sem, dirtying cache line : - lot of cache line ping pongs on SMP configurations. mmap_sem is also extensively used by mm code (page faults, mmap()/munmap()) Highly threaded processes might suffer from mmap_sem contention. mmap_sem is also used by oprofile code. Enabling oprofile hurts threaded programs because of contention on the mmap_sem cache line. - Using an atomic_inc()/atomic_dec() on inode ref counter or mm ref counter: It's also a cache line ping pong on SMP. It also increases mmap_sem hold time because of cache misses. This first patch is possible because, for one process using PTHREAD_PROCESS_PRIVATE futexes, we only need to distinguish futexes by their virtual address, no matter the underlying mm storage is. The case of multiple virtual addresses mapped on the same physical address is just insane : "Dont do it on PROCESS_PRIVATE futexes, please ?" If glibc wants to exploit this new infrastructure, it should use new _PRIVATE futex subcommands for PTHREAD_PROCESS_PRIVATE futexes. And be prepared to fallback on old subcommands for old kernels. Using one global variable with the FUTEX_PRIVATE_FLAG or 0 value should be OK, so that only one syscall might fail. Compatibility with old applications is preserved, they still hit the scalability problems, but new applications can fly :) Note : SHARED futexes can be used by old binaries *and* new binaries, because both binaries will use the old subcommands. Note : Vast majority of futexes should be using PROCESS_PRIVATE semantic, as this is the default semantic. Almost all applications should benefit of this changes (new kernel and updated libc) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/linux/futex.h | 12 + kernel/futex.c | 273 +++++++++++++++++++++++++--------------- 2 files changed, 188 insertions(+), 97 deletions(-) [-- Attachment #2: futex_p1.patch --] [-- Type: text/plain, Size: 20078 bytes --] --- linux-2.6.21-rc3/kernel/futex.c 2007-03-13 13:22:31.000000000 +0100 +++ linux-2.6.21-rc3-ed/kernel/futex.c 2007-03-15 18:30:15.000000000 +0100 @@ -16,6 +16,9 @@ * Copyright (C) 2006 Red Hat, Inc., Ingo Molnar <mingo@redhat.com> * Copyright (C) 2006 Timesys Corp., Thomas Gleixner <tglx@timesys.com> * + * Introduction of PRIVATE futexes by Eric Dumazet + * Copyright (C) 2007 Eric Dumazet <dada1@cosmosbay.com> + * * Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly * enough at me, Linus for the original (flawed) idea, Matthew * Kirkwood for proof-of-concept implementation. @@ -60,8 +63,18 @@ * Don't rearrange members without looking at hash_futex(). * * offset is aligned to a multiple of sizeof(u32) (== 4) by definition. - * We set bit 0 to indicate if it's an inode-based key. + * We use the two low order bits of offset to tell what is the kind of key : + * 00 : Private process futex (PTHREAD_PROCESS_PRIVATE) + * (no reference on an inode or mm) + * 01 : Shared futex (PTHREAD_PROCESS_SHARED) + * mapped on a file (reference on the underlying inode) + * 10 : Shared futex (PTHREAD_PROCESS_SHARED) + * (but private mapping on an mm, and reference taken on it) */ + +#define OFF_INODE 1 /* We set bit 0 if key has a reference on inode */ +#define OFF_MMSHARED 2 /* We set bit 1 if key has a reference on mm */ + union futex_key { struct { unsigned long pgoff; @@ -129,9 +142,6 @@ struct futex_q { struct task_struct *task; }; -/* - * Split the global futex_lock into every hash list lock. - */ struct futex_hash_bucket { spinlock_t lock; struct list_head chain; @@ -175,7 +185,8 @@ static inline int match_futex(union fute * * Should be called with ¤t->mm->mmap_sem but NOT any spinlocks. */ -static int get_futex_key(u32 __user *uaddr, union futex_key *key) +static int get_futex_key(u32 __user *uaddr, union futex_key *key, + struct rw_semaphore *shared) { unsigned long address = (unsigned long)uaddr; struct mm_struct *mm = current->mm; @@ -192,6 +203,22 @@ static int get_futex_key(u32 __user *uad address -= key->both.offset; /* + * PROCESS_PRIVATE futexes are fast. + * As the mm cannot disappear under us and the 'key' only needs + * virtual address, we dont even have to find the underlying vma. + * Note : We do have to check 'address' is a valid user address, + * but access_ok() should be faster than find_vma() + * Note : At this point, address points to the start of page, + * not the real futex address, this is ok. + */ + if (!shared) { + if (!access_ok(VERIFY_WRITE, address, sizeof(int))) + return -EFAULT; + key->private.mm = mm; + key->private.address = address; + return 0; + } + /* * The futex is hashed differently depending on whether * it's in a shared or private mapping. So check vma first. */ @@ -215,6 +242,7 @@ static int get_futex_key(u32 __user *uad * mappings of _writable_ handles. */ if (likely(!(vma->vm_flags & VM_MAYSHARE))) { + key->both.offset += OFF_MMSHARED; /* reference taken on mm */ key->private.mm = mm; key->private.address = address; return 0; @@ -224,7 +252,7 @@ static int get_futex_key(u32 __user *uad * Linear file mappings are also simple. */ key->shared.inode = vma->vm_file->f_path.dentry->d_inode; - key->both.offset++; /* Bit 0 of offset indicates inode-based key. */ + key->both.offset += OFF_INODE; /* inode-based key. */ if (likely(!(vma->vm_flags & VM_NONLINEAR))) { key->shared.pgoff = (((address - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff); @@ -251,16 +279,21 @@ static int get_futex_key(u32 __user *uad * Take a reference to the resource addressed by a key. * Can be called while holding spinlocks. * - * NOTE: mmap_sem MUST be held between get_futex_key() and calling this - * function, if it is called at all. mmap_sem keeps key->shared.inode valid. + * NOTE: for SHARED futexes, mmap_sem MUST be held between get_futex_key() + * and calling this function, if it is called at all. mmap_sem keeps + * key->shared.inode valid. */ static inline void get_key_refs(union futex_key *key) { - if (key->both.ptr != 0) { - if (key->both.offset & 1) + switch (key->both.offset & (OFF_INODE|OFF_MMSHARED)) { + case OFF_INODE: atomic_inc(&key->shared.inode->i_count); - else + break; + case OFF_MMSHARED: atomic_inc(&key->private.mm->mm_count); + break; + default: + break; } } @@ -270,11 +303,15 @@ static inline void get_key_refs(union fu */ static void drop_key_refs(union futex_key *key) { - if (key->both.ptr != 0) { - if (key->both.offset & 1) + switch (key->both.offset & (OFF_INODE|OFF_MMSHARED)) { + case OFF_INODE: iput(key->shared.inode); - else + break; + case OFF_MMSHARED: mmdrop(key->private.mm); + break; + default: + break; } } @@ -286,32 +323,44 @@ static inline int get_futex_value_locked ret = __copy_from_user_inatomic(dest, from, sizeof(u32)); pagefault_enable(); - return ret ? -EFAULT : 0; + return ret; } /* - * Fault handling. Called with current->mm->mmap_sem held. + * Fault handling. + * if shared is non NULL, current->mm->mmap_sem is held */ -static int futex_handle_fault(unsigned long address, int attempt) +static int futex_handle_fault(unsigned long address, int attempt, + struct rw_semaphore *shared) { struct vm_area_struct * vma; struct mm_struct *mm = current->mm; + int ret = 0; - if (attempt > 2 || !(vma = find_vma(mm, address)) || - vma->vm_start > address || !(vma->vm_flags & VM_WRITE)) + if (attempt > 2) return -EFAULT; - switch (handle_mm_fault(mm, vma, address, 1)) { - case VM_FAULT_MINOR: - current->min_flt++; - break; - case VM_FAULT_MAJOR: - current->maj_flt++; - break; - default: - return -EFAULT; - } - return 0; + if (!shared) + down_read(&mm->mmap_sem); + + if (!(vma = find_vma(mm, address)) || + vma->vm_start > address || !(vma->vm_flags & VM_WRITE)) + ret = -EFAULT; + + else + switch (handle_mm_fault(mm, vma, address, 1)) { + case VM_FAULT_MINOR: + current->min_flt++; + break; + case VM_FAULT_MAJOR: + current->maj_flt++; + break; + default: + ret = -EFAULT; + } + if (!shared) + up_read(¤t->mm->mmap_sem); + return ret; } /* @@ -649,7 +698,8 @@ double_lock_hb(struct futex_hash_bucket * Wake up all waiters hashed on the physical page that is mapped * to this virtual address: */ -static int futex_wake(u32 __user *uaddr, int nr_wake) +static int futex_wake(u32 __user *uaddr, int nr_wake, + struct rw_semaphore *shared) { struct futex_hash_bucket *hb; struct futex_q *this, *next; @@ -657,9 +707,10 @@ static int futex_wake(u32 __user *uaddr, union futex_key key; int ret; - down_read(¤t->mm->mmap_sem); + if (shared) + down_read(shared); - ret = get_futex_key(uaddr, &key); + ret = get_futex_key(uaddr, &key, shared); if (unlikely(ret != 0)) goto out; @@ -681,7 +732,8 @@ static int futex_wake(u32 __user *uaddr, spin_unlock(&hb->lock); out: - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(shared); return ret; } @@ -691,7 +743,7 @@ out: */ static int futex_wake_op(u32 __user *uaddr1, u32 __user *uaddr2, - int nr_wake, int nr_wake2, int op) + int nr_wake, int nr_wake2, int op, struct rw_semaphore *shared) { union futex_key key1, key2; struct futex_hash_bucket *hb1, *hb2; @@ -700,12 +752,13 @@ futex_wake_op(u32 __user *uaddr1, u32 __ int ret, op_ret, attempt = 0; retryfull: - down_read(¤t->mm->mmap_sem); + if (shared) + down_read(shared); - ret = get_futex_key(uaddr1, &key1); + ret = get_futex_key(uaddr1, &key1, shared); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, &key2); + ret = get_futex_key(uaddr2, &key2, shared); if (unlikely(ret != 0)) goto out; @@ -741,15 +794,14 @@ retry: * futex_atomic_op_inuser needs to both read and write * *(int __user *)uaddr2, but we can't modify it * non-atomically. Therefore, if get_user below is not - * enough, we need to handle the fault ourselves, while - * still holding the mmap_sem. + * enough, we need to handle the fault ourselves. Make + * sure we hold mmap_sem. */ if (attempt++) { - if (futex_handle_fault((unsigned long)uaddr2, - attempt)) { - ret = -EFAULT; + ret = futex_handle_fault((unsigned long)uaddr2, + attempt, shared); + if (ret) goto out; - } goto retry; } @@ -757,7 +809,8 @@ retry: * If we would have faulted, release mmap_sem, * fault it in and start all over again. */ - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(shared); ret = get_user(dummy, uaddr2); if (ret) @@ -794,7 +847,8 @@ retry: if (hb1 != hb2) spin_unlock(&hb2->lock); out: - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(shared); return ret; } @@ -803,7 +857,8 @@ out: * physical page. */ static int futex_requeue(u32 __user *uaddr1, u32 __user *uaddr2, - int nr_wake, int nr_requeue, u32 *cmpval) + int nr_wake, int nr_requeue, u32 *cmpval, + struct rw_semaphore *shared) { union futex_key key1, key2; struct futex_hash_bucket *hb1, *hb2; @@ -812,12 +867,13 @@ static int futex_requeue(u32 __user *uad int ret, drop_count = 0; retry: - down_read(¤t->mm->mmap_sem); + if (shared) + down_read(shared); - ret = get_futex_key(uaddr1, &key1); + ret = get_futex_key(uaddr1, &key1, shared); if (unlikely(ret != 0)) goto out; - ret = get_futex_key(uaddr2, &key2); + ret = get_futex_key(uaddr2, &key2, shared); if (unlikely(ret != 0)) goto out; @@ -840,7 +896,8 @@ static int futex_requeue(u32 __user *uad * If we would have faulted, release mmap_sem, fault * it in and start all over again. */ - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(shared); ret = get_user(curval, uaddr1); @@ -889,7 +946,8 @@ out_unlock: drop_key_refs(&key1); out: - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(shared); return ret; } @@ -1000,7 +1058,8 @@ static void unqueue_me_pi(struct futex_q drop_key_refs(&q->key); } -static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time) +static int futex_wait(u32 __user *uaddr, u32 val, unsigned long time, + struct rw_semaphore *shared) { struct task_struct *curr = current; DECLARE_WAITQUEUE(wait, curr); @@ -1011,9 +1070,10 @@ static int futex_wait(u32 __user *uaddr, q.pi_state = NULL; retry: - down_read(&curr->mm->mmap_sem); + if (shared) + down_read(shared); - ret = get_futex_key(uaddr, &q.key); + ret = get_futex_key(uaddr, &q.key, shared); if (unlikely(ret != 0)) goto out_release_sem; @@ -1036,8 +1096,8 @@ static int futex_wait(u32 __user *uaddr, * a wakeup when *uaddr != val on entry to the syscall. This is * rare, but normal. * - * We hold the mmap semaphore, so the mapping cannot have changed - * since we looked it up in get_futex_key. + * for shared futexes, we hold the mmap semaphore, so the mapping + * cannot have changed since we looked it up in get_futex_key. */ ret = get_futex_value_locked(&uval, uaddr); @@ -1048,7 +1108,8 @@ static int futex_wait(u32 __user *uaddr, * If we would have faulted, release mmap_sem, fault it in and * start all over again. */ - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(shared); ret = get_user(uval, uaddr); @@ -1067,7 +1128,8 @@ static int futex_wait(u32 __user *uaddr, * Now the futex is queued and we have checked the data, we * don't want to hold mmap_sem while we sleep. */ - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(shared); /* * There might have been scheduling since the queue_me(), as we @@ -1109,7 +1171,8 @@ static int futex_wait(u32 __user *uaddr, queue_unlock(&q, hb); out_release_sem: - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(shared); return ret; } @@ -1120,7 +1183,7 @@ static int futex_wait(u32 __user *uaddr, * races the kernel might see a 0 value of the futex too.) */ static int futex_lock_pi(u32 __user *uaddr, int detect, unsigned long sec, - long nsec, int trylock) + long nsec, int trylock, struct rw_semaphore *shared) { struct hrtimer_sleeper timeout, *to = NULL; struct task_struct *curr = current; @@ -1141,9 +1204,10 @@ static int futex_lock_pi(u32 __user *uad q.pi_state = NULL; retry: - down_read(&curr->mm->mmap_sem); + if (shared) + down_read(shared); - ret = get_futex_key(uaddr, &q.key); + ret = get_futex_key(uaddr, &q.key, shared); if (unlikely(ret != 0)) goto out_release_sem; @@ -1237,7 +1301,8 @@ static int futex_lock_pi(u32 __user *uad * Now the futex is queued and we have checked the data, we * don't want to hold mmap_sem while we sleep. */ - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(shared); WARN_ON(!q.pi_state); /* @@ -1251,7 +1316,8 @@ static int futex_lock_pi(u32 __user *uad ret = ret ? 0 : -EWOULDBLOCK; } - down_read(&curr->mm->mmap_sem); + if (shared) + down_read(shared); spin_lock(q.lock_ptr); /* @@ -1279,7 +1345,8 @@ static int futex_lock_pi(u32 __user *uad /* Unqueue and drop the lock */ unqueue_me_pi(&q, hb); - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(shared); /* * We own it, so we have to replace the pending owner * TID. This must be atomic as we have preserve the @@ -1308,7 +1375,8 @@ static int futex_lock_pi(u32 __user *uad } /* Unqueue and drop the lock */ unqueue_me_pi(&q, hb); - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(shared); } if (!detect && ret == -EDEADLK && 0) @@ -1320,7 +1388,8 @@ static int futex_lock_pi(u32 __user *uad queue_unlock(&q, hb); out_release_sem: - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(shared); return ret; uaddr_faulted: @@ -1331,15 +1400,15 @@ static int futex_lock_pi(u32 __user *uad * still holding the mmap_sem. */ if (attempt++) { - if (futex_handle_fault((unsigned long)uaddr, attempt)) { - ret = -EFAULT; + ret = futex_handle_fault((unsigned long)uaddr, attempt, shared); + if (ret) goto out_unlock_release_sem; - } goto retry_locked; } queue_unlock(&q, hb); - up_read(&curr->mm->mmap_sem); + if (shared) + up_read(shared); ret = get_user(uval, uaddr); if (!ret && (uval != -EFAULT)) @@ -1353,7 +1422,7 @@ static int futex_lock_pi(u32 __user *uad * This is the in-kernel slowpath: we look up the PI state (if any), * and do the rt-mutex unlock. */ -static int futex_unlock_pi(u32 __user *uaddr) +static int futex_unlock_pi(u32 __user *uaddr, struct rw_semaphore *shared) { struct futex_hash_bucket *hb; struct futex_q *this, *next; @@ -1373,9 +1442,10 @@ retry: /* * First take all the futex related locks: */ - down_read(¤t->mm->mmap_sem); + if (shared) + down_read(shared); - ret = get_futex_key(uaddr, &key); + ret = get_futex_key(uaddr, &key, shared); if (unlikely(ret != 0)) goto out; @@ -1434,7 +1504,8 @@ retry_locked: out_unlock: spin_unlock(&hb->lock); out: - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(shared); return ret; @@ -1446,15 +1517,15 @@ pi_faulted: * still holding the mmap_sem. */ if (attempt++) { - if (futex_handle_fault((unsigned long)uaddr, attempt)) { - ret = -EFAULT; + ret = futex_handle_fault((unsigned long)uaddr, attempt, shared); + if (ret) goto out_unlock; - } goto retry_locked; } spin_unlock(&hb->lock); - up_read(¤t->mm->mmap_sem); + if (shared) + up_read(shared); ret = get_user(uval, uaddr); if (!ret && (uval != -EFAULT)) @@ -1506,6 +1577,7 @@ static int futex_fd(u32 __user *uaddr, i struct futex_q *q; struct file *filp; int ret, err; + struct rw_semaphore *shared; static unsigned long printk_interval; if (printk_timed_ratelimit(&printk_interval, 60 * 60 * 1000)) { @@ -1547,11 +1619,12 @@ static int futex_fd(u32 __user *uaddr, i } q->pi_state = NULL; - down_read(¤t->mm->mmap_sem); - err = get_futex_key(uaddr, &q->key); + shared = ¤t->mm->mmap_sem; + down_read(shared); + err = get_futex_key(uaddr, &q->key, shared); if (unlikely(err != 0)) { - up_read(¤t->mm->mmap_sem); + up_read(shared); kfree(q); goto error; } @@ -1563,7 +1636,7 @@ static int futex_fd(u32 __user *uaddr, i filp->private_data = q; queue_me(q, ret, filp); - up_read(¤t->mm->mmap_sem); + up_read(shared); /* Now we map fd to filp, so userspace can access it */ fd_install(ret, filp); @@ -1690,7 +1763,7 @@ retry: */ if (!pi) { if (uval & FUTEX_WAITERS) - futex_wake(uaddr, 1); + futex_wake(uaddr, 1, &curr->mm->mmap_sem); } } return 0; @@ -1776,35 +1849,40 @@ long do_futex(u32 __user *uaddr, int op, u32 __user *uaddr2, u32 val2, u32 val3) { int ret; + int opm = op & FUTEX_CMD_MASK; + struct rw_semaphore *shared = NULL; + + if (op & FUTEX_PRIVATE_FLAG) + shared = ¤t->mm->mmap_sem; - switch (op) { + switch (opm) { case FUTEX_WAIT: - ret = futex_wait(uaddr, val, timeout); + ret = futex_wait(uaddr, val, timeout, shared); break; case FUTEX_WAKE: - ret = futex_wake(uaddr, val); + ret = futex_wake(uaddr, val, shared); break; case FUTEX_FD: /* non-zero val means F_SETOWN(getpid()) & F_SETSIG(val) */ ret = futex_fd(uaddr, val); break; case FUTEX_REQUEUE: - ret = futex_requeue(uaddr, uaddr2, val, val2, NULL); + ret = futex_requeue(uaddr, uaddr2, val, val2, NULL, shared); break; case FUTEX_CMP_REQUEUE: - ret = futex_requeue(uaddr, uaddr2, val, val2, &val3); + ret = futex_requeue(uaddr, uaddr2, val, val2, &val3, shared); break; case FUTEX_WAKE_OP: - ret = futex_wake_op(uaddr, uaddr2, val, val2, val3); + ret = futex_wake_op(uaddr, uaddr2, val, val2, val3, shared); break; case FUTEX_LOCK_PI: - ret = futex_lock_pi(uaddr, val, timeout, val2, 0); + ret = futex_lock_pi(uaddr, val, timeout, val2, 0, shared); break; case FUTEX_UNLOCK_PI: - ret = futex_unlock_pi(uaddr); + ret = futex_unlock_pi(uaddr, shared); break; case FUTEX_TRYLOCK_PI: - ret = futex_lock_pi(uaddr, 0, timeout, val2, 1); + ret = futex_lock_pi(uaddr, 0, timeout, val2, 1, shared); break; default: ret = -ENOSYS; @@ -1820,8 +1898,9 @@ asmlinkage long sys_futex(u32 __user *ua struct timespec t; unsigned long timeout = MAX_SCHEDULE_TIMEOUT; u32 val2 = 0; + int opm = op & FUTEX_CMD_MASK; - if (utime && (op == FUTEX_WAIT || op == FUTEX_LOCK_PI)) { + if (utime && (opm == FUTEX_WAIT || opm == FUTEX_LOCK_PI)) { if (copy_from_user(&t, utime, sizeof(t)) != 0) return -EFAULT; if (!timespec_valid(&t)) @@ -1834,9 +1913,9 @@ asmlinkage long sys_futex(u32 __user *ua } } /* - * requeue parameter in 'utime' if op == FUTEX_REQUEUE. + * requeue parameter in 'utime' if opm == FUTEX_REQUEUE. */ - if (op == FUTEX_REQUEUE || op == FUTEX_CMP_REQUEUE) + if (opm == FUTEX_REQUEUE || opm == FUTEX_CMP_REQUEUE) val2 = (u32) (unsigned long) utime; return do_futex(uaddr, op, val, timeout, uaddr2, val2, val3); --- linux-2.6.21-rc3/include/linux/futex.h 2007-03-13 13:22:31.000000000 +0100 +++ linux-2.6.21-rc3-ed/include/linux/futex.h 2007-03-15 18:08:37.000000000 +0100 @@ -16,6 +16,18 @@ #define FUTEX_UNLOCK_PI 7 #define FUTEX_TRYLOCK_PI 8 +#define FUTEX_PRIVATE_FLAG 128 +#define FUTEX_CMD_MASK ~FUTEX_PRIVATE_FLAG + +#define FUTEX_WAIT_PRIVATE (FUTEX_WAIT | FUTEX_PRIVATE_FLAG) +#define FUTEX_WAKE_PRIVATE (FUTEX_WAKE | FUTEX_PRIVATE_FLAG) +#define FUTEX_REQUEUE_PRIVATE (FUTEX_REQUEUE | FUTEX_PRIVATE_FLAG) +#define FUTEX_CMP_REQUEUE_PRIVATE (FUTEX_CMP_REQUEUE | FUTEX_PRIVATE_FLAG) +#define FUTEX_WAKE_OP_PRIVATE (FUTEX_WAKE_OP | FUTEX_PRIVATE_FLAG) +#define FUTEX_LOCK_PI_PRIVATE (FUTEX_LOCK_PI | FUTEX_PRIVATE_FLAG) +#define FUTEX_UNLOCK_PI_PRIVATE (FUTEX_UNLOCK_PI | FUTEX_PRIVATE_FLAG) +#define FUTEX_TRYLOCK_PI_PRIVATE (FUTEX_TRYLOCK_PI | FUTEX_PRIVATE_FLAG) + /* * Support for robust futexes: the kernel cleans up held futexes at * thread exit time. ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH 2/3] FUTEX : introduce private hashtables 2006-08-09 6:43 ` Eric Dumazet 2007-03-15 19:10 ` [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements Eric Dumazet 2007-03-15 19:13 ` [PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic Eric Dumazet @ 2007-03-15 19:16 ` Eric Dumazet 2007-03-15 20:25 ` Nick Piggin 2007-03-15 19:20 ` [PATCH 3/3] FUTEX : NUMA friendly global hashtable Eric Dumazet 3 siblings, 1 reply; 78+ messages in thread From: Eric Dumazet @ 2007-03-15 19:16 UTC (permalink / raw) To: Nick Piggin, Ulrich Drepper, Andrew Morton, Ingo Molnar Cc: Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel [-- Attachment #1: Type: text/plain, Size: 1002 bytes --] [PATCH 2/3] FUTEX : introduce private hashtables This patch introduces a separate hashtable per process to store _PRIVATE futexes. This hashtable is dynamically allocated on the first _PRIVATE futex syscall. If memory cannot be allocated, the process will use the global hashtable. Using a separate hashtable has the advantage of lowering the contention on the global hashtable. NUMA should benefits of this separation because the allocation should respect the mm policy of the process. Code is using kmalloc()/vmalloc() depending on the size of spinlocks. For normal setup, size of the private hashtable should be 768 bytes on 32bit arches, 1536 bytes on 64bit arches. Private hashtable is freed() when process exits. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- include/linux/futex.h | 4 + include/linux/sched.h | 7 ++ kernel/fork.c | 1 kernel/futex.c | 112 ++++++++++++++++++++++++++++++++++++++-- 4 files changed, 120 insertions(+), 4 deletions(-) [-- Attachment #2: futex_p2.patch --] [-- Type: text/plain, Size: 5570 bytes --] --- linux-2.6.21-rc3/kernel/futex.c 2007-03-15 18:30:15.000000000 +0100 +++ linux-2.6.21-rc3-ed/kernel/futex.c 2007-03-15 18:54:47.000000000 +0100 @@ -51,11 +51,11 @@ #include <linux/pagemap.h> #include <linux/syscalls.h> #include <linux/signal.h> +#include <linux/vmalloc.h> #include <asm/futex.h> #include "rtmutex_common.h" -#define FUTEX_HASHBITS (CONFIG_BASE_SMALL ? 4 : 8) /* * Futexes are matched on equal values of this key. @@ -147,11 +147,96 @@ struct futex_hash_bucket { struct list_head chain; }; -static struct futex_hash_bucket futex_queues[1<<FUTEX_HASHBITS]; + +#if CONFIG_BASE_SMALL +# define FUTEX_HASH_SLOTS 16 +# define FUTEX_NOPRIVHASH /* no private hashtable, only one global */ +#else +# define FUTEX_HASH_SLOTS 256 +# define FUTEX_PRIVHASH_SLOTS 64 +#endif + +#define FUTEX_PRIVHASH_SIZE \ + (FUTEX_PRIVHASH_SLOTS * sizeof(struct futex_hash_bucket)) +/* + * futex_queues[] : global hash table + * + * PTHREAD_PROCESS_SHARED futexes are hashed into this table + * + * PTHREAD_PROCESS_PRIVATE futexes may be hashed into this table too if the + * owner process failed to allocate its private hashtable (or CONFIG_BASE_SMALL) + * + */ +static struct futex_hash_bucket futex_queues[FUTEX_HASH_SLOTS]; /* Futex-fs vfsmount entry: */ static struct vfsmount *futex_mnt; + +/* + * private futexes are hashed into a process private table. + * As this table is dynamically allocated, it might be in fact + * the global table in case of memory stress. + * A pointer to this table is kept in mm_struct. + */ +#ifndef FUTEX_NOPRIVHASH +static void mm_priv_queues_alloc(struct mm_struct *mm) +{ + unsigned int ui; + struct futex_hash_bucket *queues; + + /* + * FUTEX_PRIVHASH_SIZE is a constant, compiler should choose + * either vmalloc()/kmalloc() :) + */ + if (FUTEX_PRIVHASH_SIZE > PAGE_SIZE) + queues = vmalloc(FUTEX_PRIVHASH_SIZE); + else + queues = kmalloc(FUTEX_PRIVHASH_SIZE, GFP_KERNEL); + + if (queues) { + for (ui = 0; ui < FUTEX_PRIVHASH_SLOTS; ui++) { + spin_lock_init(&queues[ui].lock); + INIT_LIST_HEAD(&queues[ui].chain); + } + spin_lock(&mm->page_table_lock); + /* + * check if another thread installed a table before me + */ + if (mm->mm_priv_futex_queues) { + if (FUTEX_PRIVHASH_SIZE > PAGE_SIZE) + vfree(queues); + else + kfree(queues); + } + else + mm->mm_priv_futex_queues = queues; + } + else { + spin_lock(&mm->page_table_lock); + if (!mm->mm_priv_futex_queues) + mm->mm_priv_futex_queues = futex_queues; + } + spin_unlock(&mm->page_table_lock); +} +#endif + +/* + * Called from __mmdrop()/mm_free_futex() to eventually free private futexes + * hash table attached to mm + */ +void __mm_free_futex(struct mm_struct *mm) +{ +#ifndef FUTEX_NOPRIVHASH + if (mm->mm_priv_futex_queues != futex_queues) { + if (FUTEX_PRIVHASH_SIZE > PAGE_SIZE) + vfree(mm->mm_priv_futex_queues); + else + kfree(mm->mm_priv_futex_queues); + } +#endif +} + /* * We hash on the keys returned from get_futex_key (see below). */ @@ -159,8 +244,27 @@ static struct futex_hash_bucket *hash_fu { u32 hash = jhash2((u32*)&key->both.word, (sizeof(key->both.word)+sizeof(key->both.ptr))/4, - key->both.offset); - return &futex_queues[hash & ((1 << FUTEX_HASHBITS)-1)]; + key->both.offset) % FUTEX_HASH_SLOTS; + +#ifdef FUTEX_NOPRIVHASH + return &futex_queues[hash]; +#else + struct mm_struct *mm; + /* + * PROCESS_SHARED futexes are hashed into futex_queues[] + */ + if (key->both.offset & (OFF_INODE|OFF_MMSHARED)) + return &futex_queues[hash]; + + if (FUTEX_PRIVHASH_SLOTS < FUTEX_HASH_SLOTS) + hash %= FUTEX_PRIVHASH_SLOTS; + + mm = current->mm; + if (unlikely(!mm->mm_priv_futex_queues)) + mm_priv_queues_alloc(mm); + + return &mm->mm_priv_futex_queues[hash]; +#endif } /* --- linux-2.6.21-rc3/include/linux/futex.h 2007-03-15 18:08:37.000000000 +0100 +++ linux-2.6.21-rc3-ed/include/linux/futex.h 2007-03-15 18:31:40.000000000 +0100 @@ -115,6 +115,7 @@ handle_futex_death(u32 __user *uaddr, st #ifdef CONFIG_FUTEX extern void exit_robust_list(struct task_struct *curr); extern void exit_pi_state_list(struct task_struct *curr); +extern void __mm_free_futex(struct mm_struct *mm); #else static inline void exit_robust_list(struct task_struct *curr) { @@ -122,6 +123,9 @@ static inline void exit_robust_list(stru static inline void exit_pi_state_list(struct task_struct *curr) { } +static inline void __mm_free_futex(struct mm_struct *mm) +{ +} #endif #endif /* __KERNEL__ */ --- linux-2.6.21-rc3/include/linux/sched.h 2007-03-15 18:32:08.000000000 +0100 +++ linux-2.6.21-rc3-ed/include/linux/sched.h 2007-03-15 18:31:40.000000000 +0100 @@ -373,6 +373,8 @@ struct mm_struct { /* aio bits */ rwlock_t ioctx_list_lock; struct kioctx *ioctx_list; + /* private futexes */ + struct futex_hash_bucket *mm_priv_futex_queues; }; struct sighand_struct { @@ -1374,6 +1376,11 @@ static inline int sas_ss_flags(unsigned * Routines for handling mm_structs */ extern struct mm_struct * mm_alloc(void); +static inline void mm_free_futex(struct mm_struct * mm) +{ + if (mm->mm_priv_futex_queues) + __mm_free_futex(mm); +} /* mmdrop drops the mm and the page tables */ extern void FASTCALL(__mmdrop(struct mm_struct *)); --- linux-2.6.21-rc3/kernel/fork.c 2007-03-15 18:32:08.000000000 +0100 +++ linux-2.6.21-rc3-ed/kernel/fork.c 2007-03-15 18:31:40.000000000 +0100 @@ -374,6 +374,7 @@ void fastcall __mmdrop(struct mm_struct BUG_ON(mm == &init_mm); mm_free_pgd(mm); destroy_context(mm); + mm_free_futex(mm); free_mm(mm); } ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 2/3] FUTEX : introduce private hashtables 2007-03-15 19:16 ` [PATCH 2/3] FUTEX : introduce private hashtables Eric Dumazet @ 2007-03-15 20:25 ` Nick Piggin 2007-03-15 21:09 ` Ulrich Drepper 2007-03-15 22:59 ` William Lee Irwin III 0 siblings, 2 replies; 78+ messages in thread From: Nick Piggin @ 2007-03-15 20:25 UTC (permalink / raw) To: Eric Dumazet Cc: Ulrich Drepper, Andrew Morton, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Eric Dumazet wrote: > [PATCH 2/3] FUTEX : introduce private hashtables > > This patch introduces a separate hashtable per process to store _PRIVATE > futexes. > This hashtable is dynamically allocated on the first _PRIVATE futex syscall. > If memory cannot be allocated, the process will use the global hashtable. > > Using a separate hashtable has the advantage of lowering the contention on the > global hashtable. NUMA should benefits of this separation because the > allocation should respect the mm policy of the process. > > Code is using kmalloc()/vmalloc() depending on the size of spinlocks. For > normal setup, size of the private hashtable should be 768 bytes on 32bit > arches, 1536 bytes on 64bit arches. > > Private hashtable is freed() when process exits. I do disagree with this patch, though. There should be little contention on the memory in the global hash anyway, because we can roughly reduce contention as a factor of hash-size/cacheline-size. What we will have are cache misses on the global table... but we're going to get cache misses on those private tables as well. Also, you never know what the use cases are going to be... there could be an application with thousands of threads and mutexes in which case your private hash could be too small... I think w e want to stay away from per-mm _anything_ with private futexes, which is the direction that your patch 1 takes us. I would just avoid the complexity and setup/teardown costs, and just use a vmalloc'ed global hash for NUMA. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 2/3] FUTEX : introduce private hashtables 2007-03-15 20:25 ` Nick Piggin @ 2007-03-15 21:09 ` Ulrich Drepper 2007-03-15 21:29 ` Nick Piggin 2007-03-15 22:59 ` William Lee Irwin III 1 sibling, 1 reply; 78+ messages in thread From: Ulrich Drepper @ 2007-03-15 21:09 UTC (permalink / raw) To: Nick Piggin Cc: Eric Dumazet, Andrew Morton, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On 3/15/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > There should be little contention on the memory in the global hash anyway, > because we can roughly reduce contention as a factor of hash-size/cacheline-size. > > What we will have are cache misses on the global table... but we're going to > get cache misses on those private tables as well. I'm thinking about NUMA cases. If you have private tables for a process which is pinned to some cluster in a NUMA machine the table is local to the node. If you have a global table you cannot optimize your application for such a situation because at least some of the pages of the global table are remote. ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 2/3] FUTEX : introduce private hashtables 2007-03-15 21:09 ` Ulrich Drepper @ 2007-03-15 21:29 ` Nick Piggin 0 siblings, 0 replies; 78+ messages in thread From: Nick Piggin @ 2007-03-15 21:29 UTC (permalink / raw) To: Ulrich Drepper Cc: Eric Dumazet, Andrew Morton, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel Ulrich Drepper wrote: > On 3/15/07, Nick Piggin <nickpiggin@yahoo.com.au> wrote: > >> There should be little contention on the memory in the global hash >> anyway, >> because we can roughly reduce contention as a factor of >> hash-size/cacheline-size. >> >> What we will have are cache misses on the global table... but we're >> going to >> get cache misses on those private tables as well. > > > I'm thinking about NUMA cases. If you have private tables for a > process which is pinned to some cluster in a NUMA machine the table is > local to the node. If you have a global table you cannot optimize > your application for such a situation because at least some of the > pages of the global table are remote. > That's true, but it also might be able to be improved in other ways. At least once we get the basic support in the kernel, and glibc picks it up, then we have a better base to evaluate these more exotic changes against. -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [PATCH 2/3] FUTEX : introduce private hashtables 2007-03-15 20:25 ` Nick Piggin 2007-03-15 21:09 ` Ulrich Drepper @ 2007-03-15 22:59 ` William Lee Irwin III 1 sibling, 0 replies; 78+ messages in thread From: William Lee Irwin III @ 2007-03-15 22:59 UTC (permalink / raw) To: Nick Piggin Cc: Eric Dumazet, Ulrich Drepper, Andrew Morton, Ingo Molnar, Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Fri, Mar 16, 2007 at 07:25:53AM +1100, Nick Piggin wrote: > I would just avoid the complexity and setup/teardown costs, and just > use a vmalloc'ed global hash for NUMA. This patch is not the way to go, but neither are vmalloc()'d global hashtables. When you just happen to hash to the wrong node, you're in for quasi-unreproducible poor performance. The size is never right, at which point RCU resizing is required with all its overhead and memory freeing delays and failure to resize (even if only to contract) under pressure. Better would be to use a different data structure admitting locality of reference and adaptively sizing itself, furthermore localized to the appropriate sharing domain. For file-backed futexes, this would be the struct address_space. For anonymous-backed futexes, this would be the COW sharing group, which an anon_vma could almost be used to represent. Using an object to properly represent the COW sharing group (i.e. Hugh's struct anon) would do the trick, and one might as well move the rmap code over to it while we're at it since the anon_vma scanning tricks are all pointless overhead once the COW sharing group is accurately tracked (the scanning around for nearby vmas with ->anon_vma set is not great anyway, though the overhead is hidden in the noise of large teardown and setup operations; inheriting on fork() is much simpler and faster). In such a manner localization is accomplished while no interface extensions are required. -- wli ^ permalink raw reply [flat|nested] 78+ messages in thread
* [PATCH 3/3] FUTEX : NUMA friendly global hashtable 2006-08-09 6:43 ` Eric Dumazet ` (2 preceding siblings ...) 2007-03-15 19:16 ` [PATCH 2/3] FUTEX : introduce private hashtables Eric Dumazet @ 2007-03-15 19:20 ` Eric Dumazet 3 siblings, 0 replies; 78+ messages in thread From: Eric Dumazet @ 2007-03-15 19:20 UTC (permalink / raw) To: Nick Piggin, Ulrich Drepper, Andrew Morton, Ingo Molnar Cc: Andi Kleen, Ravikiran G Thirumalai, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel [-- Attachment #1: Type: text/plain, Size: 594 bytes --] [PATCH 3/3] FUTEX : NUMA friendly global hashtable On NUMA machines, we should get better performance using a big futex hashtable, allocated with vmalloc() so that it is spreaded on several nodes. I chose a static size of four pages. (Very big NUMA machines have 64k page size) This patch should have a temporary effect, as most futexes are expected to be stored in process private tables. We probably can drop it in five years :) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- kernel/futex.c | 26 +++++++++++++++++++++++--- 1 file changed, 23 insertions(+), 3 deletions(-) [-- Attachment #2: futex_p3.patch --] [-- Type: text/plain, Size: 1770 bytes --] --- linux-2.6.21-rc3/kernel/futex.c 2007-03-15 18:54:47.000000000 +0100 +++ linux-2.6.21-rc3-ed/kernel/futex.c 2007-03-15 18:54:47.000000000 +0100 @@ -152,8 +152,13 @@ struct futex_hash_bucket { # define FUTEX_HASH_SLOTS 16 # define FUTEX_NOPRIVHASH /* no private hashtable, only one global */ #else -# define FUTEX_HASH_SLOTS 256 -# define FUTEX_PRIVHASH_SLOTS 64 +# ifdef CONFIG_NUMA +# define FUTEX_HASH_SLOTS ((4*PAGE_SIZE)/sizeof(struct futex_hash_bucket)) +# define FUTEX_PRIVHASH_SLOTS (4096 / sizeof(struct futex_hash_bucket)) +# else +# define FUTEX_HASH_SLOTS 256 +# define FUTEX_PRIVHASH_SLOTS 64 +# endif #endif #define FUTEX_PRIVHASH_SIZE \ @@ -166,8 +171,14 @@ struct futex_hash_bucket { * PTHREAD_PROCESS_PRIVATE futexes may be hashed into this table too if the * owner process failed to allocate its private hashtable (or CONFIG_BASE_SMALL) * + * On NUMA configs, table is allocated with vmalloc() to spread this hash table + * up to 4 nodes (we use 4 pages) */ +#ifdef CONFIG_NUMA +static struct futex_hash_bucket *futex_queues __read_mostly; +#else static struct futex_hash_bucket futex_queues[FUTEX_HASH_SLOTS]; +#endif /* Futex-fs vfsmount entry: */ static struct vfsmount *futex_mnt; @@ -2051,7 +2062,16 @@ static int __init init(void) return PTR_ERR(futex_mnt); } - for (i = 0; i < ARRAY_SIZE(futex_queues); i++) { +#ifdef CONFIG_NUMA + /* + * vmalloc() is supposed to obey mempolicy and spread our 4 pages + * on several nodes + */ + futex_queues = vmalloc(FUTEX_HASH_SLOTS * sizeof(*futex_queues)); + if (!futex_queues) + panic("Failed to allocate futex hash table\n"); +#endif + for (i = 0; i < FUTEX_HASH_SLOTS; i++) { INIT_LIST_HEAD(&futex_queues[i].chain); spin_lock_init(&futex_queues[i].lock); } ^ permalink raw reply [flat|nested] 78+ messages in thread
* Re: [RFC] NUMA futex hashing 2006-08-08 10:10 ` Eric Dumazet 2006-08-08 10:36 ` Andi Kleen @ 2006-08-09 0:13 ` Ravikiran G Thirumalai 1 sibling, 0 replies; 78+ messages in thread From: Ravikiran G Thirumalai @ 2006-08-09 0:13 UTC (permalink / raw) To: Eric Dumazet Cc: Andi Kleen, Shai Fultheim (Shai@scalex86.org), pravin b shelar, linux-kernel On Tue, Aug 08, 2006 at 12:10:39PM +0200, Eric Dumazet wrote: > On Tuesday 08 August 2006 11:57, Andi Kleen wrote: > > Ravikiran G Thirumalai <kiran@scalex86.org> writes: > > > Current futex hash scheme is not the best for NUMA. The futex hash > > > table is an array of struct futex_hash_bucket, which is just a spinlock > > > and a list_head -- this means multiple spinlocks on the same cacheline > > > and on NUMA machines, on the same internode cacheline. If futexes of two > > > unrelated threads running on two different nodes happen to hash onto > > > adjacent hash buckets, or buckets on the same internode cacheline, then > > > we have the internode cacheline bouncing between nodes. > > > > When I did some testing with a (arguably far too lock intensive) benchmark > > on a bigger box I got most bouncing cycles not in the futex locks itself, > > but in the down_read on the mm semaphore. > > This is true, even with a normal application (not a biased benchmark) and > using oprofile. mmap_sem is the killer. Not if two threads of two different process (so no same mmap_sem) hash onto futexes on the same cacheline. But agreed, mmap_sem needs to be fixed too. If everyone agrees on a per-process hash table for private futexes, then we will work on that approach. Thanks, Kiran ^ permalink raw reply [flat|nested] 78+ messages in thread
end of thread, other threads:[~2007-04-26 13:38 UTC | newest] Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2006-08-08 7:07 [RFC] NUMA futex hashing Ravikiran G Thirumalai 2006-08-08 9:14 ` Eric Dumazet 2006-08-08 20:31 ` Ravikiran G Thirumalai 2006-08-08 9:37 ` Jes Sorensen 2006-08-08 9:58 ` Andi Kleen 2006-08-08 10:07 ` Jes Sorensen 2006-08-08 9:57 ` Andi Kleen 2006-08-08 10:10 ` Eric Dumazet 2006-08-08 10:36 ` Andi Kleen 2006-08-08 12:29 ` Eric Dumazet 2006-08-08 12:47 ` Andi Kleen 2006-08-08 12:57 ` Eric Dumazet 2006-08-08 14:39 ` Ulrich Drepper 2006-08-08 15:11 ` Nick Piggin 2006-08-08 15:36 ` Ulrich Drepper 2006-08-08 16:22 ` Nick Piggin 2006-08-08 16:26 ` Nick Piggin 2006-08-08 16:49 ` Ulrich Drepper 2006-08-08 16:08 ` Eric Dumazet 2006-08-08 16:34 ` Nick Piggin 2006-08-08 16:49 ` Eric Dumazet 2006-08-08 16:59 ` Eric Dumazet 2006-08-09 1:56 ` Nick Piggin 2006-08-08 16:58 ` Ulrich Drepper 2006-08-08 17:08 ` Eric Dumazet 2006-08-09 1:58 ` Nick Piggin 2006-08-09 6:26 ` Eric Dumazet 2006-08-09 6:43 ` Eric Dumazet 2007-03-15 19:10 ` [PATCH 0/3] FUTEX : new PRIVATE futexes, SMP and NUMA improvements Eric Dumazet 2007-03-15 20:15 ` Nick Piggin 2007-03-16 8:05 ` Peter Zijlstra 2007-03-16 9:30 ` Eric Dumazet 2007-03-16 10:10 ` Peter Zijlstra 2007-03-16 10:30 ` Eric Dumazet 2007-03-16 10:36 ` Peter Zijlstra 2007-04-04 7:16 ` Ulrich Drepper 2007-04-05 17:49 ` [PATCH] FUTEX : new PRIVATE futexes Eric Dumazet 2007-04-05 20:43 ` Ulrich Drepper 2007-04-06 1:19 ` Nick Piggin 2007-04-06 5:53 ` Eric Dumazet 2007-04-06 11:50 ` Nick Piggin 2007-04-06 6:05 ` Hugh Dickins 2007-04-06 17:41 ` Jan Engelhardt 2007-04-06 12:26 ` Shared futexes (was [PATCH] FUTEX : new PRIVATE futexes) Peter Zijlstra 2007-04-06 13:02 ` Hugh Dickins 2007-04-06 13:15 ` Peter Zijlstra 2007-04-06 13:15 ` Nick Piggin 2007-04-06 13:22 ` Peter Zijlstra 2007-04-06 13:40 ` Nick Piggin 2007-04-06 12:31 ` [PATCH] FUTEX : new PRIVATE futexes Peter Zijlstra 2007-04-07 8:43 ` [PATCH, take4] " Eric Dumazet 2007-04-07 9:30 ` Nick Piggin 2007-04-07 10:00 ` Eric Dumazet 2007-04-11 7:22 ` Nick Piggin 2007-04-11 8:14 ` Eric Dumazet 2007-04-11 9:23 ` Nick Piggin 2007-04-11 9:30 ` Pierre Peiffer 2007-04-11 9:39 ` Nick Piggin 2007-04-11 9:40 ` Nick Piggin 2007-04-11 9:35 ` Eric Dumazet 2007-04-12 1:57 ` Nick Piggin 2007-04-07 11:18 ` Jakub Jelinek 2007-04-07 11:54 ` Eric Dumazet 2007-04-07 16:40 ` Ulrich Drepper 2007-04-07 22:15 ` Andrew Morton 2007-04-10 9:21 ` Eric Dumazet 2007-04-11 9:19 ` [PATCH, take5] " Eric Dumazet 2007-04-11 12:23 ` Rusty Russell 2007-04-26 12:55 ` [PATCH, take6] " Eric Dumazet 2007-04-26 13:35 ` Pierre Peiffer 2007-03-15 19:13 ` [PATCH 1/3] FUTEX : introduce PROCESS_PRIVATE semantic Eric Dumazet 2007-03-15 19:16 ` [PATCH 2/3] FUTEX : introduce private hashtables Eric Dumazet 2007-03-15 20:25 ` Nick Piggin 2007-03-15 21:09 ` Ulrich Drepper 2007-03-15 21:29 ` Nick Piggin 2007-03-15 22:59 ` William Lee Irwin III 2007-03-15 19:20 ` [PATCH 3/3] FUTEX : NUMA friendly global hashtable Eric Dumazet 2006-08-09 0:13 ` [RFC] NUMA futex hashing Ravikiran G Thirumalai
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.