linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Fast Userspace Mutexes III.
@ 2002-03-04  3:55 Rusty Russell
  2002-03-04 19:49 ` Robert Love
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Rusty Russell @ 2002-03-04  3:55 UTC (permalink / raw)
  To: torvalds, matthew, bcrl, david, wli, linux-kernel; +Cc: Hubertus Franke

1) Use mmap/mprotect bits, not new syscall (thanks RTH, Erik Biederman)
2) Fix wakeup race in kernel (thanks Martin Wirth, Paul Mackerras)
3) Simplify locking to a single atomic (no more arch specifics!)
4) Use wake-one by handcoding queues.
5) Comments added.

Thanks to all for feedback and review: I'd appreciate a comment from
those arch's which need to do something with the PROT_SEM bit.

Once again, tested on 2.4.18 UP PPC, compiles on 2.5.6-pre1.

Bad news is that we're up to 206 lines again.
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6-pre1/kernel/futex.c working-2.5.6-pre1-futex/kernel/futex.c
--- linux-2.5.6-pre1/kernel/futex.c	Thu Jan  1 10:00:00 1970
+++ working-2.5.6-pre1-futex/kernel/futex.c	Mon Mar  4 14:48:47 2002
@@ -0,0 +1,206 @@
+/*
+ *  Fast Userspace Mutexes (which I call "Futexes!").
+ *  (C) Rusty Russell, IBM 2002
+ *
+ *  Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
+ *  enough at me, Linus for the original (flawed) idea, Matthew
+ *  Kirkwood for proof-of-concept implementation.
+ *
+ *  "The futexes are also cursed."
+ *  "But they come in a choice of three flavours!"
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; either version 2 of the License, or
+ *  (at your option) any later version.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, write to the Free Software
+ *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ */
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/hash.h>
+#include <linux/init.h>
+#include <linux/fs.h>
+#include <asm/atomic.h>
+
+/* These mutexes are a very simple counter: the winner is the one who
+   decrements from 1 to 0.  The counter starts at 1 when the lock is
+   free.  A value other than 0 or 1 means someone may be sleeping.
+   This is simple enough to work on all architectures, but has the
+   problem that if we never "up" the semaphore it could eventually
+   wrap around. */
+
+/* FIXME: This may be way too small. --RR */
+#define FUTEX_HASHBITS 6
+
+/* We use this instead of a normal wait_queue_t, so we can wake only
+   the relevent ones (hashed waitqueues may be shared) */
+struct futex_q {
+	struct list_head list;
+	struct task_struct *task;
+	atomic_t *count;
+};
+
+/* The key for the hash is the address + index + offset within page */
+static struct list_head futex_queues[1<<FUTEX_HASHBITS];
+static spinlock_t futex_lock = SPIN_LOCK_UNLOCKED;
+
+static inline struct list_head *hash_futex(struct page *page,
+					   unsigned long offset)
+{
+	unsigned long h;
+
+	/* If someone is sleeping, page is pinned.  ie. page_address
+           is a constant when we care about it. */
+	h = (unsigned long)page_address(page) + page->index + offset;
+	return &futex_queues[hash_long(h, FUTEX_HASHBITS)];
+}
+
+static inline void wake_one_waiter(struct list_head *head, atomic_t *count)
+{
+	struct list_head *i;
+
+	spin_lock(&futex_lock);
+	list_for_each(i, head) {
+		struct futex_q *this = list_entry(i, struct futex_q, list);
+
+		if (this->count == count) {
+			wake_up_process(this->task);
+			break;
+		}
+	}
+	spin_unlock(&futex_lock);
+}
+
+/* Add at end to avoid starvation */
+static inline void queue_me(struct list_head *head,
+			    struct futex_q *q,
+			    atomic_t *count)
+{
+	q->task = current;
+	q->count = count;
+
+	spin_lock(&futex_lock);
+	list_add_tail(&q->list, head);
+	spin_unlock(&futex_lock);
+}
+
+static inline void unqueue_me(struct futex_q *q)
+{
+	spin_lock(&futex_lock);
+	list_del(&q->list);
+	spin_unlock(&futex_lock);
+}
+
+/* Get kernel address of the user page and pin it. */
+static struct page *pin_page(unsigned long page_start)
+{
+	struct mm_struct *mm = current->mm;
+	struct page *page;
+	int err;
+
+	down_read(&mm->mmap_sem);
+	err = get_user_pages(current, current->mm, page_start,
+			     1 /* one page */,
+			     1 /* writable */,
+			     0 /* don't force */,
+			     &page,
+			     NULL /* don't return vmas */);
+	up_read(&mm->mmap_sem);
+
+	if (err < 0)
+		return ERR_PTR(err);
+	return page;
+}
+
+/* Simplified from arch/ppc/kernel/semaphore.c: Paul M. is a genius. */
+static int futex_down(struct list_head *head, atomic_t *count)
+{
+	int retval = 0;
+	struct futex_q q;
+
+	current->state = TASK_INTERRUPTIBLE;
+	queue_me(head, &q, count);
+
+	/* If we take the semaphore from 1 to 0, it's ours. */
+	while (!atomic_dec_and_test(count)) {
+		if (signal_pending(current)) {
+			retval = -EINTR;
+			break;
+		}
+		schedule();
+		current->state = TASK_INTERRUPTIBLE;
+	}
+	current->state = TASK_RUNNING;
+	unqueue_me(&q);
+	/* If we were signalled, we might have just been woken: we
+	   must wake another one.  Otherwise we need to wake someone
+	   else (if they are waiting) so they drop the count below 0,
+	   and when we "up" in userspace, we know there is a
+	   waiter. */
+	wake_one_waiter(head, count);
+	return retval;
+}
+
+static int futex_up(struct list_head *head, atomic_t *count)
+{
+	atomic_set(count, 1);
+	smp_wmb();
+	wake_one_waiter(head, count);
+	return 0;
+}
+
+asmlinkage int sys_futex(void *uaddr, int op)
+{
+	int ret;
+	unsigned long pos_in_page;
+	struct list_head *head;
+	struct page *page;
+
+	pos_in_page = ((unsigned long)uaddr) % PAGE_SIZE;
+
+	/* Must be "naturally" aligned, and not on page boundary. */
+	if ((pos_in_page % __alignof__(atomic_t)) != 0
+	    || pos_in_page + sizeof(atomic_t) > PAGE_SIZE)
+		return -EINVAL;
+
+	/* Simpler if it doesn't vanish underneath us. */
+	page = pin_page((unsigned long)uaddr - pos_in_page);
+	if (IS_ERR(page))
+		return PTR_ERR(page);
+
+	head = hash_futex(page, pos_in_page);
+	switch (op) {
+	case 1:
+		ret = futex_up(head, page_address(page) + pos_in_page);
+		break;
+	case -1:
+		ret = futex_down(head, page_address(page) + pos_in_page);
+		break;
+	/* Add other lock types here... */
+	default:
+		ret = -EINVAL;
+	}
+	put_page(page);
+
+	return ret;
+}
+
+static int __init init(void)
+{
+	unsigned int i;
+
+	for (i = 0; i < ARRAY_SIZE(futex_queues); i++)
+		INIT_LIST_HEAD(&futex_queues[i]);
+	return 0;
+}
+__initcall(init);
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6-pre1/arch/i386/kernel/entry.S working-2.5.6-pre1-futex/arch/i386/kernel/entry.S
--- linux-2.5.6-pre1/arch/i386/kernel/entry.S	Wed Feb 20 17:56:59 2002
+++ working-2.5.6-pre1-futex/arch/i386/kernel/entry.S	Mon Mar  4 14:35:13 2002
@@ -716,6 +716,7 @@
 	.long SYMBOL_NAME(sys_lremovexattr)
 	.long SYMBOL_NAME(sys_fremovexattr)
 	.long SYMBOL_NAME(sys_tkill)
+	.long SYMBOL_NAME(sys_futex)
 
 	.rept NR_syscalls-(.-sys_call_table)/4
 		.long SYMBOL_NAME(sys_ni_syscall)
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6-pre1/arch/ppc/kernel/misc.S working-2.5.6-pre1-futex/arch/ppc/kernel/misc.S
--- linux-2.5.6-pre1/arch/ppc/kernel/misc.S	Wed Feb 20 17:57:04 2002
+++ working-2.5.6-pre1-futex/arch/ppc/kernel/misc.S	Mon Mar  4 14:35:56 2002
@@ -1246,6 +1246,7 @@
 	.long sys_removexattr
 	.long sys_lremovexattr
 	.long sys_fremovexattr	/* 220 */
+	.long sys_futex
 	.rept NR_syscalls-(.-sys_call_table)/4
 		.long sys_ni_syscall
 	.endr
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6-pre1/arch/ppc/kernel/semaphore.c working-2.5.6-pre1-futex/arch/ppc/kernel/semaphore.c
--- linux-2.5.6-pre1/arch/ppc/kernel/semaphore.c	Wed Feb 20 17:57:04 2002
+++ working-2.5.6-pre1-futex/arch/ppc/kernel/semaphore.c	Fri Mar  1 22:52:06 2002
@@ -21,34 +21,6 @@
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
 
-/*
- * Atomically update sem->count.
- * This does the equivalent of the following:
- *
- *	old_count = sem->count;
- *	tmp = MAX(old_count, 0) + incr;
- *	sem->count = tmp;
- *	return old_count;
- */
-static inline int __sem_update_count(struct semaphore *sem, int incr)
-{
-	int old_count, tmp;
-
-	__asm__ __volatile__("\n"
-"1:	lwarx	%0,0,%3\n"
-"	srawi	%1,%0,31\n"
-"	andc	%1,%0,%1\n"
-"	add	%1,%1,%4\n"
-	PPC405_ERR77(0,%3)
-"	stwcx.	%1,0,%3\n"
-"	bne	1b"
-	: "=&r" (old_count), "=&r" (tmp), "=m" (sem->count)
-	: "r" (&sem->count), "r" (incr), "m" (sem->count)
-	: "cc");
-
-	return old_count;
-}
-
 void __up(struct semaphore *sem)
 {
 	/*
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6-pre1/include/asm-i386/atomic.h working-2.5.6-pre1-futex/include/asm-i386/atomic.h
--- linux-2.5.6-pre1/include/asm-i386/atomic.h	Fri Nov 23 06:46:18 2001
+++ working-2.5.6-pre1-futex/include/asm-i386/atomic.h	Fri Mar  1 22:51:25 2002
@@ -2,6 +2,7 @@
 #define __ARCH_I386_ATOMIC__
 
 #include <linux/config.h>
+#include <asm/system.h>
 
 /*
  * Atomic operations that C can't guarantee us.  Useful for
@@ -201,4 +202,43 @@
 #define smp_mb__before_atomic_inc()	barrier()
 #define smp_mb__after_atomic_inc()	barrier()
 
+/*
+ * Atomically update count.
+ * This does the equivalent of the following:
+ *
+ *	old_count = *count;
+ *	tmp = MAX(old_count, 0) + incr;
+ *	*count = tmp;
+ *	return old_count;
+ */
+#ifdef CONFIG_M386
+#ifndef CONFIG_SMP
+#define __HAVE_ARCH_UPDATE_COUNT
+/* We can still do this (no userspace can be running). */
+static inline int __update_count(atomic_t *count, int incr)
+{
+	int old_count, tmp;
+
+	/* preempt_disable() */
+	old_count = atomic_read(count);
+	tmp = old_count > 0 ? old_count : 0;
+	atomic_set(count, tmp + incr);
+	return old_count;
+}
+#else
+/* SMP 386 update_count is not possible. 8( */
+#else /* ! 386 */
+#define __HAVE_ARCH_UPDATE_COUNT
+static inline int __update_count(atomic_t *count, int incr)
+{
+	int old_count, tmp;
+
+	do {
+		old_count = atomic_read(count);
+		tmp = old_count > 0 ? old_count : 0;
+	} while (cmpxchg(&count->counter, old_count, tmp + incr) != old_count);
+
+	return old_count;
+}
+#endif /* CONFIG_M386 */
 #endif
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6-pre1/include/asm-i386/mman.h working-2.5.6-pre1-futex/include/asm-i386/mman.h
--- linux-2.5.6-pre1/include/asm-i386/mman.h	Wed Mar 15 12:45:20 2000
+++ working-2.5.6-pre1-futex/include/asm-i386/mman.h	Mon Mar  4 14:51:26 2002
@@ -4,6 +4,7 @@
 #define PROT_READ	0x1		/* page can be read */
 #define PROT_WRITE	0x2		/* page can be written */
 #define PROT_EXEC	0x4		/* page can be executed */
+#define PROT_SEM	0x8		/* page may be used for atomic ops */
 #define PROT_NONE	0x0		/* page can not be accessed */
 
 #define MAP_SHARED	0x01		/* Share changes */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6-pre1/include/asm-i386/unistd.h working-2.5.6-pre1-futex/include/asm-i386/unistd.h
--- linux-2.5.6-pre1/include/asm-i386/unistd.h	Wed Feb 20 17:56:40 2002
+++ working-2.5.6-pre1-futex/include/asm-i386/unistd.h	Mon Mar  4 14:36:32 2002
@@ -243,6 +243,7 @@
 #define __NR_lremovexattr	236
 #define __NR_fremovexattr	237
 #define __NR_tkill		238
+#define __NR_futex		239
 
 /* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
 
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6-pre1/include/asm-ppc/atomic.h working-2.5.6-pre1-futex/include/asm-ppc/atomic.h
--- linux-2.5.6-pre1/include/asm-ppc/atomic.h	Wed Feb 20 17:57:18 2002
+++ working-2.5.6-pre1-futex/include/asm-ppc/atomic.h	Sat Mar  2 10:34:29 2002
@@ -198,5 +198,33 @@
 #define smp_mb__before_atomic_inc()	smp_mb()
 #define smp_mb__after_atomic_inc()	smp_mb()
 
+/*
+ * Atomically update count.
+ * This does the equivalent of the following:
+ *
+ *	old_count = *count;
+ *	tmp = MAX(old_count, 0) + incr;
+ *	*count = tmp;
+ *	return old_count;
+ */
+#define __HAVE_ARCH_UPDATE_COUNT
+static inline int __update_count(atomic_t *count, int incr)
+{
+	int old_count, tmp;
+
+	__asm__ __volatile__("\n"
+"1:	lwarx	%0,0,%3\n"
+"	srawi	%1,%0,31\n"
+"	andc	%1,%0,%1\n"
+"	add	%1,%1,%4\n"
+	PPC405_ERR77(0,%3)
+"	stwcx.	%1,0,%3\n"
+"	bne	1b"
+	: "=&r" (old_count), "=&r" (tmp), "=m" (*count)
+	: "r" (count), "r" (incr), "m" (*count)
+	: "cc");
+
+	return old_count;
+}
 #endif /* __KERNEL__ */
 #endif /* _ASM_PPC_ATOMIC_H_ */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6-pre1/include/asm-ppc/mman.h working-2.5.6-pre1-futex/include/asm-ppc/mman.h
--- linux-2.5.6-pre1/include/asm-ppc/mman.h	Tue May 22 08:02:06 2001
+++ working-2.5.6-pre1-futex/include/asm-ppc/mman.h	Mon Mar  4 14:51:26 2002
@@ -7,6 +7,7 @@
 #define PROT_READ	0x1		/* page can be read */
 #define PROT_WRITE	0x2		/* page can be written */
 #define PROT_EXEC	0x4		/* page can be executed */
+#define PROT_SEM	0x8		/* page may be used for atomic ops */
 #define PROT_NONE	0x0		/* page can not be accessed */
 
 #define MAP_SHARED	0x01		/* Share changes */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6-pre1/include/asm-ppc/unistd.h working-2.5.6-pre1-futex/include/asm-ppc/unistd.h
--- linux-2.5.6-pre1/include/asm-ppc/unistd.h	Wed Feb 20 17:57:18 2002
+++ working-2.5.6-pre1-futex/include/asm-ppc/unistd.h	Mon Mar  4 14:36:23 2002
@@ -228,6 +228,7 @@
 #define __NR_removexattr	218
 #define __NR_lremovexattr	219
 #define __NR_fremovexattr	220
+#define __NR_futex		221
 
 #define __NR(n)	#n
 
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6-pre1/include/linux/hash.h working-2.5.6-pre1-futex/include/linux/hash.h
--- linux-2.5.6-pre1/include/linux/hash.h	Thu Jan  1 10:00:00 1970
+++ working-2.5.6-pre1-futex/include/linux/hash.h	Fri Mar  1 22:51:25 2002
@@ -0,0 +1,58 @@
+#ifndef _LINUX_HASH_H
+#define _LINUX_HASH_H
+/* Fast hashing routine for a long.
+   (C) 2002 William Lee Irwin III, IBM */
+
+/*
+ * Knuth recommends primes in approximately golden ratio to the maximum
+ * integer representable by a machine word for multiplicative hashing.
+ * Chuck Lever verified the effectiveness of this technique:
+ * http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf
+ *
+ * These primes are chosen to be bit-sparse, that is operations on
+ * them can use shifts and additions instead of multiplications for
+ * machines where multiplications are slow.
+ */
+#if BITS_PER_LONG == 32
+/* 2^31 + 2^29 - 2^25 + 2^22 - 2^19 - 2^16 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e370001UL
+#elif BITS_PER_LONG == 64
+/*  2^63 + 2^61 - 2^57 + 2^54 - 2^51 - 2^18 + 1 */
+#define GOLDEN_RATIO_PRIME 0x9e37fffffffc0001UL
+#else
+#error Define GOLDEN_RATIO_PRIME for your wordsize.
+#endif
+
+static inline unsigned long hash_long(unsigned long val, unsigned int bits)
+{
+	unsigned long hash = val;
+
+#if BITS_PER_LONG == 64
+	/*  Sigh, gcc can't optimise this alone like it does for 32 bits. */
+	unsigned long n = hash;
+	n <<= 18;
+	hash -= n;
+	n <<= 33;
+	hash -= n;
+	n <<= 3;
+	hash += n;
+	n <<= 3;
+	hash -= n;
+	n <<= 4;
+	hash += n;
+	n <<= 2;
+	hash += n;
+#else
+	/* On some cpus multiply is faster, on others gcc will do shifts */
+	hash *= GOLDEN_RATIO_PRIME;
+#endif
+
+	/* High bits are more random, so use them. */
+	return hash >> (BITS_PER_LONG - bits);
+}
+	
+static inline unsigned long hash_ptr(void *ptr, unsigned int bits)
+{
+	return hash_long((unsigned long)ptr, bits);
+}
+#endif /* _LINUX_HASH_H */
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6-pre1/include/linux/mmzone.h working-2.5.6-pre1-futex/include/linux/mmzone.h
--- linux-2.5.6-pre1/include/linux/mmzone.h	Fri Mar  1 22:58:34 2002
+++ working-2.5.6-pre1-futex/include/linux/mmzone.h	Mon Mar  4 14:37:39 2002
@@ -51,8 +51,7 @@
 	/*
 	 * wait_table		-- the array holding the hash table
 	 * wait_table_size	-- the size of the hash table array
-	 * wait_table_shift	-- wait_table_size
-	 * 				== BITS_PER_LONG (1 << wait_table_bits)
+	 * wait_table_bits	-- wait_table_size == (1 << wait_table_bits)
 	 *
 	 * The purpose of all these is to keep track of the people
 	 * waiting for a page to become available and make them
@@ -75,7 +74,7 @@
 	 */
 	wait_queue_head_t	* wait_table;
 	unsigned long		wait_table_size;
-	unsigned long		wait_table_shift;
+	unsigned long		wait_table_bits;
 
 	/*
 	 * Discontig memory support fields.
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6-pre1/kernel/Makefile working-2.5.6-pre1-futex/kernel/Makefile
--- linux-2.5.6-pre1/kernel/Makefile	Wed Feb 20 17:56:17 2002
+++ working-2.5.6-pre1-futex/kernel/Makefile	Fri Mar  1 22:53:36 2002
@@ -15,7 +15,7 @@
 obj-y     = sched.o dma.o fork.o exec_domain.o panic.o printk.o \
 	    module.o exit.o itimer.o info.o time.o softirq.o resource.o \
 	    sysctl.o acct.o capability.o ptrace.o timer.o user.o \
-	    signal.o sys.o kmod.o context.o 
+	    signal.o sys.o kmod.o context.o futex.o
 
 obj-$(CONFIG_UID16) += uid16.o
 obj-$(CONFIG_MODULES) += ksyms.o
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6-pre1/mm/filemap.c working-2.5.6-pre1-futex/mm/filemap.c
--- linux-2.5.6-pre1/mm/filemap.c	Wed Feb 27 14:45:43 2002
+++ working-2.5.6-pre1-futex/mm/filemap.c	Fri Mar  1 23:10:14 2002
@@ -25,6 +25,7 @@
 #include <linux/iobuf.h>
 #include <linux/compiler.h>
 #include <linux/fs.h>
+#include <linux/hash.h>
 
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
@@ -773,32 +774,8 @@
 static inline wait_queue_head_t *page_waitqueue(struct page *page)
 {
 	const zone_t *zone = page_zone(page);
-	wait_queue_head_t *wait = zone->wait_table;
-	unsigned long hash = (unsigned long)page;
 
-#if BITS_PER_LONG == 64
-	/*  Sigh, gcc can't optimise this alone like it does for 32 bits. */
-	unsigned long n = hash;
-	n <<= 18;
-	hash -= n;
-	n <<= 33;
-	hash -= n;
-	n <<= 3;
-	hash += n;
-	n <<= 3;
-	hash -= n;
-	n <<= 4;
-	hash += n;
-	n <<= 2;
-	hash += n;
-#else
-	/* On some cpus multiply is faster, on others gcc will do shifts */
-	hash *= GOLDEN_RATIO_PRIME;
-#endif
-
-	hash >>= zone->wait_table_shift;
-
-	return &wait[hash];
+	return &zone->wait_table[hash_ptr(page, zone->wait_table_bits)];
 }
 
 /* 
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6-pre1/mm/mprotect.c working-2.5.6-pre1-futex/mm/mprotect.c
--- linux-2.5.6-pre1/mm/mprotect.c	Wed Feb 20 17:57:21 2002
+++ working-2.5.6-pre1-futex/mm/mprotect.c	Mon Mar  4 14:51:26 2002
@@ -280,7 +280,7 @@
 	end = start + len;
 	if (end < start)
 		return -EINVAL;
-	if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC))
+	if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM))
 		return -EINVAL;
 	if (end == start)
 		return 0;
diff -urN -I \$.*\$ --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal linux-2.5.6-pre1/mm/page_alloc.c working-2.5.6-pre1-futex/mm/page_alloc.c
--- linux-2.5.6-pre1/mm/page_alloc.c	Wed Feb 20 17:57:21 2002
+++ working-2.5.6-pre1-futex/mm/page_alloc.c	Fri Mar  1 23:10:14 2002
@@ -776,8 +776,8 @@
 		 * per zone.
 		 */
 		zone->wait_table_size = wait_table_size(size);
-		zone->wait_table_shift =
-			BITS_PER_LONG - wait_table_bits(zone->wait_table_size);
+		zone->wait_table_bits =
+			wait_table_bits(zone->wait_table_size);
 		zone->wait_table = (wait_queue_head_t *)
 			alloc_bootmem_node(pgdat, zone->wait_table_size
 						* sizeof(wait_queue_head_t));


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-04  3:55 [PATCH] Fast Userspace Mutexes III Rusty Russell
@ 2002-03-04 19:49 ` Robert Love
  2002-03-04 20:13   ` Davide Libenzi
  2002-03-05  1:34   ` Rusty Russell
  2002-03-07  1:52 ` Richard Henderson
  2002-03-07  3:39 ` Rusty Russell
  2 siblings, 2 replies; 24+ messages in thread
From: Robert Love @ 2002-03-04 19:49 UTC (permalink / raw)
  To: Rusty Russell
  Cc: torvalds, matthew, bcrl, david, wli, linux-kernel, Hubertus Franke

On Sun, 2002-03-03 at 22:55, Rusty Russell wrote:
> 1) Use mmap/mprotect bits, not new syscall (thanks RTH, Erik Biederman)
> 2) Fix wakeup race in kernel (thanks Martin Wirth, Paul Mackerras)
> 3) Simplify locking to a single atomic (no more arch specifics!)
> 4) Use wake-one by handcoding queues.
> 5) Comments added.
> 
> Thanks to all for feedback and review: I'd appreciate a comment from
> those arch's which need to do something with the PROT_SEM bit.
> 
> Once again, tested on 2.4.18 UP PPC, compiles on 2.5.6-pre1.
> 
> Bad news is that we're up to 206 lines again.
> Rusty.

Good work.  I likee.

I have a couple comments and question:

> +static spinlock_t futex_lock = SPIN_LOCK_UNLOCKED;

Could we make this per-waitqueue?

> +asmlinkage int sys_futex(void *uaddr, int op)
< [...]
> +	switch (op) {
> +	case 1:
> +		ret = futex_up(head, page_address(page) + pos_in_page);
> +		break;
> +	case -1:

We should do:

	#define FUTEX_UP	1
	#define FUTEX_DOWN	-1

and put them in a common header (i.e. include/linux so both the kernel
and glibc will use it) and use that in our code and the kernel code. 
Just a finishing detail ...

> +static inline int __update_count(atomic_t *count, int incr)
> +{
> +	int old_count, tmp;
> +
> +	/* preempt_disable() */
> +	old_count = atomic_read(count);
> +	tmp = old_count > 0 ? old_count : 0;
> +	atomic_set(count, tmp + incr);
> +	return old_count;
> +}

You will want to do:

	int old_count, tmp;

	preempt_disable();
	old count = atomic_read(count);
	tmp = old_count > 0 ? old_count : 0;
	atomic_set(count, tmp + incr);
	preempt_enable();

	return old_count;

here.  The preempt statements compile away if CONFIG_PREEMPT is not set,
so you can just put them in, even on arches that don't do preemption
yet.

... oh, and I would love an example of using it in userspace ;)

Nice work, Rusty.

	Robert Love


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-04 19:49 ` Robert Love
@ 2002-03-04 20:13   ` Davide Libenzi
  2002-03-04 20:20     ` Matthew Kirkwood
                       ` (2 more replies)
  2002-03-05  1:34   ` Rusty Russell
  1 sibling, 3 replies; 24+ messages in thread
From: Davide Libenzi @ 2002-03-04 20:13 UTC (permalink / raw)
  To: Robert Love
  Cc: Rusty Russell, Linus Torvalds, matthew, Benjamin LaHaise, david,
	wli, Linux Kernel Mailing List, Hubertus Franke

On 4 Mar 2002, Robert Love wrote:

> On Sun, 2002-03-03 at 22:55, Rusty Russell wrote:
> > 1) Use mmap/mprotect bits, not new syscall (thanks RTH, Erik Biederman)
> > 2) Fix wakeup race in kernel (thanks Martin Wirth, Paul Mackerras)
> > 3) Simplify locking to a single atomic (no more arch specifics!)
> > 4) Use wake-one by handcoding queues.
> > 5) Comments added.
> >
> > Thanks to all for feedback and review: I'd appreciate a comment from
> > those arch's which need to do something with the PROT_SEM bit.
> >
> > Once again, tested on 2.4.18 UP PPC, compiles on 2.5.6-pre1.
> >
> > Bad news is that we're up to 206 lines again.
> > Rusty.
>
> Good work.  I likee.

Ok, i reply to this because my 'd' Pine's key is heavily used in these days :-)
I took a look at the code yesterday night and i've a stupid question
Rusty. I do not know what is the code used in the upper part of the
iceberg ( userspace ) but are you doing a dec_and_test() on userspace
before entering the kernel ? Or, is the kernel code the slow path or you
enter it by default ?



- Davide



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-04 20:13   ` Davide Libenzi
@ 2002-03-04 20:20     ` Matthew Kirkwood
  2002-03-04 20:48     ` Hubertus Franke
  2002-03-05  4:48     ` Rusty Russell
  2 siblings, 0 replies; 24+ messages in thread
From: Matthew Kirkwood @ 2002-03-04 20:20 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Linux Kernel Mailing List

On Mon, 4 Mar 2002, Davide Libenzi wrote:

[ cc: list trimmage ]

> Or, is the kernel code the slow path or you enter it by default ?

The kernel part is the slow path -- we only enter it
when we have contention and want to sleep.

Matthew.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-04 20:13   ` Davide Libenzi
  2002-03-04 20:20     ` Matthew Kirkwood
@ 2002-03-04 20:48     ` Hubertus Franke
  2002-03-04 22:15       ` Davide Libenzi
  2002-03-05  4:48     ` Rusty Russell
  2 siblings, 1 reply; 24+ messages in thread
From: Hubertus Franke @ 2002-03-04 20:48 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Robert Love, Rusty Russell, Linux Kernel Mailing List

On Mon, Mar 04, 2002 at 12:13:50PM -0800, Davide Libenzi wrote:
> On 4 Mar 2002, Robert Love wrote:
> 
> > On Sun, 2002-03-03 at 22:55, Rusty Russell wrote:
> > > 1) Use mmap/mprotect bits, not new syscall (thanks RTH, Erik Biederman)
> > > 2) Fix wakeup race in kernel (thanks Martin Wirth, Paul Mackerras)
> > > 3) Simplify locking to a single atomic (no more arch specifics!)
> > > 4) Use wake-one by handcoding queues.
> > > 5) Comments added.
> > >
> > > Thanks to all for feedback and review: I'd appreciate a comment from
> > > those arch's which need to do something with the PROT_SEM bit.
> > >
> > > Once again, tested on 2.4.18 UP PPC, compiles on 2.5.6-pre1.
> > >
> > > Bad news is that we're up to 206 lines again.
> > > Rusty.
> >
> > Good work.  I likee.
> 
> Ok, i reply to this because my 'd' Pine's key is heavily used in these days :-)
> I took a look at the code yesterday night and i've a stupid question
> Rusty. I do not know what is the code used in the upper part of the
> iceberg ( userspace ) but are you doing a dec_and_test() on userspace
> before entering the kernel ? Or, is the kernel code the slow path or you
> enter it by default ?
> 
> 
> 
> - Davide
> 

Yes, that section is missing. It should be just as you said.

int futex_down(atomic_t *count)
{
	if (!atomic_dec_and_test(count)) 
		return sys_futex(count,FUTEX_DOWN);
	return 0;
}	

One needs to ensure that the atomic nature is compiled in (i.e. SMP) 
even if SMP is not enabled in the machine.

Rusty, could you post that code that you are using.


Also, the check on PROT_SEM is missing. I tried this before and glibc filtered
these flags out when set. But effectively, one still needs to check
whether semaphores are allowed during the sys_futex call.
This is expensive, because the protections are associated with the VMA.
find_vma() is not an option here, that's why I did the hashing and more persistent
objects rather than hashed wait queues.

I'd suggest to drop the requirements for this flag PROT_SEM.
I don't see tremendous value for it anyway, and it creates quite some 
headache if its to be enforced, which it is not in Rusty's code.
My code could do it, but as it comes with the complexity discussed earlier.

-- Hubertus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-04 20:48     ` Hubertus Franke
@ 2002-03-04 22:15       ` Davide Libenzi
  2002-03-05  1:50         ` Robert Love
  2002-03-07  1:58         ` Richard Henderson
  0 siblings, 2 replies; 24+ messages in thread
From: Davide Libenzi @ 2002-03-04 22:15 UTC (permalink / raw)
  To: Hubertus Franke
  Cc: Davide Libenzi, Robert Love, Rusty Russell, Linux Kernel Mailing List

On Mon, 4 Mar 2002, Hubertus Franke wrote:

> On Mon, Mar 04, 2002 at 12:13:50PM -0800, Davide Libenzi wrote:
> > On 4 Mar 2002, Robert Love wrote:
> >
> > > On Sun, 2002-03-03 at 22:55, Rusty Russell wrote:
> > > > 1) Use mmap/mprotect bits, not new syscall (thanks RTH, Erik Biederman)
> > > > 2) Fix wakeup race in kernel (thanks Martin Wirth, Paul Mackerras)
> > > > 3) Simplify locking to a single atomic (no more arch specifics!)
> > > > 4) Use wake-one by handcoding queues.
> > > > 5) Comments added.
> > > >
> > > > Thanks to all for feedback and review: I'd appreciate a comment from
> > > > those arch's which need to do something with the PROT_SEM bit.
> > > >
> > > > Once again, tested on 2.4.18 UP PPC, compiles on 2.5.6-pre1.
> > > >
> > > > Bad news is that we're up to 206 lines again.
> > > > Rusty.
> > >
> > > Good work.  I likee.
> >
> > Ok, i reply to this because my 'd' Pine's key is heavily used in these days :-)
> > I took a look at the code yesterday night and i've a stupid question
> > Rusty. I do not know what is the code used in the upper part of the
> > iceberg ( userspace ) but are you doing a dec_and_test() on userspace
> > before entering the kernel ? Or, is the kernel code the slow path or you
> > enter it by default ?
> >
> >
> >
> > - Davide
> >
>
> Yes, that section is missing. It should be just as you said.
>
> int futex_down(atomic_t *count)
> {
> 	if (!atomic_dec_and_test(count))
> 		return sys_futex(count,FUTEX_DOWN);
> 	return 0;
> }

That's great. What if the process holding the mutex dies while there're
sleeping tasks waiting for it ?




- Davide




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-04 19:49 ` Robert Love
  2002-03-04 20:13   ` Davide Libenzi
@ 2002-03-05  1:34   ` Rusty Russell
  1 sibling, 0 replies; 24+ messages in thread
From: Rusty Russell @ 2002-03-05  1:34 UTC (permalink / raw)
  To: Robert Love
  Cc: torvalds, matthew, bcrl, david, wli, linux-kernel, Hubertus Franke

In message <1015271393.15277.112.camel@phantasy> you write:
> > +static spinlock_t futex_lock = SPIN_LOCK_UNLOCKED;
> 
> Could we make this per-waitqueue?

Yes, once someone gives benchmarks proving it's worth doing the whole
"multiple locks and cache aligned" thing.  Until then, it's premature
optimization.

> We should do:
> 
> 	#define FUTEX_UP	1
> 	#define FUTEX_DOWN	-1

Ack.  Definitely.

> here.  The preempt statements compile away if CONFIG_PREEMPT is not set,
> so you can just put them in, even on arches that don't do preemption
> yet.

Oops, that code shouldn't have been in patch, and the only reason that
preempt_disable() was commented out is that I tested the patch on 2.4.

> ... oh, and I would love an example of using it in userspace ;)

I'll throw it in for patch IV. 8)

> Nice work, Rusty.

I don't know if I can accept the kudos: it's now hovering at about 70%
my code, but only 20% my ideas.

Cheers,
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-04 22:15       ` Davide Libenzi
@ 2002-03-05  1:50         ` Robert Love
  2002-03-05  2:53           ` Davide Libenzi
                             ` (2 more replies)
  2002-03-07  1:58         ` Richard Henderson
  1 sibling, 3 replies; 24+ messages in thread
From: Robert Love @ 2002-03-05  1:50 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Hubertus Franke, Rusty Russell, Linux Kernel Mailing List

On Mon, 2002-03-04 at 17:15, Davide Libenzi wrote:

> That's great. What if the process holding the mutex dies while there're
> sleeping tasks waiting for it ?

I can't find an answer in the code (meaning the lock is lost...) and no
one has yet answered.  Davide, have you noticed anything?

I think this needs a proper solution..

	Robert Love


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-05  1:50         ` Robert Love
@ 2002-03-05  2:53           ` Davide Libenzi
  2002-03-05  3:45           ` Rusty Russell
  2002-03-05  4:30           ` Edgar Toernig
  2 siblings, 0 replies; 24+ messages in thread
From: Davide Libenzi @ 2002-03-05  2:53 UTC (permalink / raw)
  To: Robert Love; +Cc: Hubertus Franke, Rusty Russell, Linux Kernel Mailing List

On 4 Mar 2002, Robert Love wrote:

> On Mon, 2002-03-04 at 17:15, Davide Libenzi wrote:
>
> > That's great. What if the process holding the mutex dies while there're
> > sleeping tasks waiting for it ?
>
> I can't find an answer in the code (meaning the lock is lost...) and no
> one has yet answered.  Davide, have you noticed anything?

not inside the code i saw yesterday ...


> I think this needs a proper solution..

i think so, even if it'll make things a bit more complex ...



- Davide



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-05  1:50         ` Robert Love
  2002-03-05  2:53           ` Davide Libenzi
@ 2002-03-05  3:45           ` Rusty Russell
  2002-03-05  3:55             ` Davide Libenzi
  2002-03-05 15:09             ` Hubertus Franke
  2002-03-05  4:30           ` Edgar Toernig
  2 siblings, 2 replies; 24+ messages in thread
From: Rusty Russell @ 2002-03-05  3:45 UTC (permalink / raw)
  To: Robert Love; +Cc: Hubertus Franke, Davide Libenzi, Linux Kernel Mailing List

In message <1015293007.882.87.camel@phantasy> you write:
> On Mon, 2002-03-04 at 17:15, Davide Libenzi wrote:
> 
> > That's great. What if the process holding the mutex dies while there're
> > sleeping tasks waiting for it ?
> 
> I can't find an answer in the code (meaning the lock is lost...) and no
> one has yet answered.  Davide, have you noticed anything?
> 
> I think this needs a proper solution..

No.  From my previous post:

Date: Mon, 4 Mar 2002 17:13:46 +1100
From: Rusty Russell <rusty@rustcorp.com.au>
To: Paul Jackson <pj@engr.sgi.com>
Cc: frankeh@watson.ibm.com, martin.wirth@dlr.de, rusty@rustycorp.com.au, linux-kernel@vger.kernel.org, lse-tech@lists.sourceforge.net
Subject: Re: [Lse-tech] Re: [PATCH] Lightweight userspace semaphores...
Message-Id: <20020304171346.04c9cac6.rusty@rustcorp.com.au>
In-Reply-To: <Pine.SGI.4.21.0203031410310.623951-100000@sam.engr.sgi.com>
In-Reply-To: <20020302090856.A1332@elinux01.watson.ibm.com>
	<Pine.SGI.4.21.0203031410310.623951-100000@sam.engr.sgi.com>

On Sun, 3 Mar 2002 14:13:45 -0800
Paul Jackson <pj@engr.sgi.com> wrote:

> On Sat, 2 Mar 2002, Hubertus Franke wrote:
> > But more conceptually, if you process dies while holding a lock ... 
> > your app is toast at that point.
> 
> Without application specific knowledge, certainly.
> 
> Is there someway one could support a hook, to enable
> an application to register a handler that could clean
> up, for those apps that found that worthwhile?

If you want this, use fcntl locks (see TDB).  If you don't tell the kernel
what you are doing (which is the reason these locks are fast), it cannot
clean up for you.

One could conceive of a solution where a process told the kernel
"here is where I keep my lock states: if I die, clean up".  Of course,
there's a race there too.

IMHO, given that the lock is protecting something which is left in an
unknown state, this is something which would require serious testing
to be proven worthwhile.

Hope that helps,
Rusty.
-- 
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-05  3:45           ` Rusty Russell
@ 2002-03-05  3:55             ` Davide Libenzi
  2002-03-05  6:11               ` Rusty Russell
  2002-03-05 15:09             ` Hubertus Franke
  1 sibling, 1 reply; 24+ messages in thread
From: Davide Libenzi @ 2002-03-05  3:55 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Robert Love, Hubertus Franke, Linux Kernel Mailing List

On Tue, 5 Mar 2002, Rusty Russell wrote:

> In message <1015293007.882.87.camel@phantasy> you write:
> > On Mon, 2002-03-04 at 17:15, Davide Libenzi wrote:
> >
> > > That's great. What if the process holding the mutex dies while there're
> > > sleeping tasks waiting for it ?
> >
> > I can't find an answer in the code (meaning the lock is lost...) and no
> > one has yet answered.  Davide, have you noticed anything?
> >
> > I think this needs a proper solution..
>
> No.  From my previous post:

Uhmm, you better comment this one.

	"please do not die while holding it, please ..."

:-)



- Davide



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-05  1:50         ` Robert Love
  2002-03-05  2:53           ` Davide Libenzi
  2002-03-05  3:45           ` Rusty Russell
@ 2002-03-05  4:30           ` Edgar Toernig
  2 siblings, 0 replies; 24+ messages in thread
From: Edgar Toernig @ 2002-03-05  4:30 UTC (permalink / raw)
  To: Robert Love
  Cc: Davide Libenzi, Hubertus Franke, Rusty Russell,
	Linux Kernel Mailing List

Robert Love wrote:
> 
> On Mon, 2002-03-04 at 17:15, Davide Libenzi wrote:
> 
> > That's great. What if the process holding the mutex dies while there're
> > sleeping tasks waiting for it ?
> 
> I can't find an answer in the code (meaning the lock is lost...) and no
> one has yet answered.  Davide, have you noticed anything?
> 
> I think this needs a proper solution..

You can't do very much.  The futex holder has probably damaged some
data.  The only thing you could do is to kill all current and future
waiters too.  But the "future waiters" is difficult.  The process
may hold other locks the kernel does not know anything about.

The only thing one could do is to kill all processes that share a
MAP_SEM page with a dying process.

Ciao, ET.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-04 20:13   ` Davide Libenzi
  2002-03-04 20:20     ` Matthew Kirkwood
  2002-03-04 20:48     ` Hubertus Franke
@ 2002-03-05  4:48     ` Rusty Russell
  2002-03-05 15:15       ` Hubertus Franke
  2 siblings, 1 reply; 24+ messages in thread
From: Rusty Russell @ 2002-03-05  4:48 UTC (permalink / raw)
  To: Hubertus Franke; +Cc: davidel, rml, linux-kernel

On Mon, 4 Mar 2002 15:48:48 -0500
Hubertus Franke <frankeh@watson.ibm.com> wrote:

> Also, the check on PROT_SEM is missing. I tried this before and glibc filtered
> these flags out when set. But effectively, one still needs to check
> whether semaphores are allowed during the sys_futex call.

Neither arch I care about (ppc, x86) needs to do anything with PROT_SEM, so it's
OK.  glibc will have to be fixed on any architectures which require help here,
and a hook will be needed somewhere in the kernel for them.

I didn't implement it because I don't *know* which archs will need something,
and what they will need.  Hence my request for arch maintainers to step
forward (Linus said they exist, and I believe him).

Hope that clarifies this particular wart...
Rusty.
-- 
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-05  3:55             ` Davide Libenzi
@ 2002-03-05  6:11               ` Rusty Russell
  2002-03-05 17:23                 ` Davide Libenzi
  0 siblings, 1 reply; 24+ messages in thread
From: Rusty Russell @ 2002-03-05  6:11 UTC (permalink / raw)
  To: Davide Libenzi; +Cc: Robert Love, Hubertus Franke, Linux Kernel Mailing List

In message <Pine.LNX.4.44.0203041953480.1561-100000@blue1.dev.mcafeelabs.com> y
ou write:
> On Tue, 5 Mar 2002, Rusty Russell wrote:
> 
> > In message <1015293007.882.87.camel@phantasy> you write:
> > > On Mon, 2002-03-04 at 17:15, Davide Libenzi wrote:
> > >
> > > > That's great. What if the process holding the mutex dies while there're
> > > > sleeping tasks waiting for it ?

I don't see that.  Your data, your program, your funeral, yes?

Want to autoclean the lock?  There are many ways to do this in
userspace.  For most cases, the problem is "and then what"?

Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-05  3:45           ` Rusty Russell
  2002-03-05  3:55             ` Davide Libenzi
@ 2002-03-05 15:09             ` Hubertus Franke
  1 sibling, 0 replies; 24+ messages in thread
From: Hubertus Franke @ 2002-03-05 15:09 UTC (permalink / raw)
  To: Rusty Russell, Robert Love; +Cc: Davide Libenzi, Linux Kernel Mailing List

On Monday 04 March 2002 10:45 pm, Rusty Russell wrote:
> In message <1015293007.882.87.camel@phantasy> you write:
> > On Mon, 2002-03-04 at 17:15, Davide Libenzi wrote:
> > > That's great. What if the process holding the mutex dies while there're
> > > sleeping tasks waiting for it ?
> >
> > I can't find an answer in the code (meaning the lock is lost...) and no
> > one has yet answered.  Davide, have you noticed anything?
> >
> > I think this needs a proper solution..
>
>
> If you want this, use fcntl locks (see TDB).  If you don't tell the kernel
> what you are doing (which is the reason these locks are fast), it cannot
> clean up for you.
>
> One could conceive of a solution where a process told the kernel
> "here is where I keep my lock states: if I die, clean up".  Of course,
> there's a race there too.
>

Yes, the problem goes even deeper. A simple hook is not enough.
One must know who is actually holding the lock, so that the cleanup
routines do the right thing.
E.g. store the pid with the lock. As Rusty stated this has still race 
conditions.
Anyway, this should be orthogonal to the low level services
provided. 
Another issue is that of rwlocks. Here its perfectly OK to die if
you hold the lock in read mode and clean up before going away.

Again, this should not be part of the base service.

> IMHO, given that the lock is protecting something which is left in an
> unknown state, this is something which would require serious testing
> to be proven worthwhile.
>


> Hope that helps,
> Rusty.

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-05  4:48     ` Rusty Russell
@ 2002-03-05 15:15       ` Hubertus Franke
  2002-03-06  1:31         ` Rusty Russell
  0 siblings, 1 reply; 24+ messages in thread
From: Hubertus Franke @ 2002-03-05 15:15 UTC (permalink / raw)
  To: Rusty Russell; +Cc: davidel, rml, linux-kernel

On Monday 04 March 2002 11:48 pm, Rusty Russell wrote:
> On Mon, 4 Mar 2002 15:48:48 -0500
>
> Hubertus Franke <frankeh@watson.ibm.com> wrote:
> > Also, the check on PROT_SEM is missing. I tried this before and glibc
> > filtered these flags out when set. But effectively, one still needs to
> > check whether semaphores are allowed during the sys_futex call.
>
> Neither arch I care about (ppc, x86) needs to do anything with PROT_SEM, so
> it's OK.  glibc will have to be fixed on any architectures which require
> help here, and a hook will be needed somewhere in the kernel for them.
>
> I didn't implement it because I don't *know* which archs will need
> something, and what they will need.  Hence my request for arch maintainers
> to step forward (Linus said they exist, and I believe him).
>
> Hope that clarifies this particular wart...
> Rusty.

Clarifies only partially.

I agree to put it there if its not used as a means to define whether
user locks are permitted or not. If that is the intention, then the current 
futex will need to check every access through find_vma(), which we
both know nobody wants to do.

So it can only be used for architectural hints, agreed ?

-- 
-- Hubertus Franke  (frankeh@watson.ibm.com)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-05  6:11               ` Rusty Russell
@ 2002-03-05 17:23                 ` Davide Libenzi
  0 siblings, 0 replies; 24+ messages in thread
From: Davide Libenzi @ 2002-03-05 17:23 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Robert Love, Hubertus Franke, Linux Kernel Mailing List

On Tue, 5 Mar 2002, Rusty Russell wrote:

> In message <Pine.LNX.4.44.0203041953480.1561-100000@blue1.dev.mcafeelabs.com> y
> ou write:
> > On Tue, 5 Mar 2002, Rusty Russell wrote:
> >
> > > In message <1015293007.882.87.camel@phantasy> you write:
> > > > On Mon, 2002-03-04 at 17:15, Davide Libenzi wrote:
> > > >
> > > > > That's great. What if the process holding the mutex dies while there're
> > > > > sleeping tasks waiting for it ?
>
> I don't see that.  Your data, your program, your funeral, yes?
>
> Want to autoclean the lock?  There are many ways to do this in
> userspace.  For most cases, the problem is "and then what"?

I tried hard to remember cases where fetures of IPC sems have been handy
to me but i failed.

	*as designed*



- Davide



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-05 15:15       ` Hubertus Franke
@ 2002-03-06  1:31         ` Rusty Russell
  0 siblings, 0 replies; 24+ messages in thread
From: Rusty Russell @ 2002-03-06  1:31 UTC (permalink / raw)
  To: frankeh; +Cc: davidel, rml, linux-kernel

In message <20020305151439.457E03FE06@smtp.linux.ibm.com> you write:
> I agree to put it there if its not used as a means to define whether
> user locks are permitted or not. If that is the intention, then the current 
> futex will need to check every access through find_vma(), which we
> both know nobody wants to do.
> 
> So it can only be used for architectural hints, agreed ?

Yes.  It *might* work if you don't PROT_SEM the page the semaphore is
on, but it's still a bug waiting to happen.  OTOH, it'd be nice if
PROT_SEM returns EINVAL was a reliably indicator of no futex support.
This way you actually need to call the futex syscall once to see if it
works.

Cheers!
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-04  3:55 [PATCH] Fast Userspace Mutexes III Rusty Russell
  2002-03-04 19:49 ` Robert Love
@ 2002-03-07  1:52 ` Richard Henderson
  2002-03-07  3:39 ` Rusty Russell
  2 siblings, 0 replies; 24+ messages in thread
From: Richard Henderson @ 2002-03-07  1:52 UTC (permalink / raw)
  To: Rusty Russell
  Cc: torvalds, matthew, bcrl, david, wli, linux-kernel, Hubertus Franke

On Mon, Mar 04, 2002 at 02:55:36PM +1100, Rusty Russell wrote:
> +	/* If we take the semaphore from 1 to 0, it's ours. */
> +	while (!atomic_dec_and_test(count)) {
> +		if (signal_pending(current)) {
> +			retval = -EINTR;
> +			break;

This is not safe from wraparound.  Let one thread hold the
lock forever; let other threads keep trying to take the lock
while periodically getting SIGALRM.  Eventually one of the
spinning threads will incorrectly acquire the mutex.

On sparc32, atomic_t is only 24 bits wide, so it wouldn't
take very long at all to wrap.  It's slightly longer for 
the other platforms, but it can still happen.  And note
that even 64-bit platforms may be using a 32-bit atomic_t.

You really do need that cmpxchg loop.


r~

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-04 22:15       ` Davide Libenzi
  2002-03-05  1:50         ` Robert Love
@ 2002-03-07  1:58         ` Richard Henderson
  2002-03-07  2:10           ` Davide Libenzi
  1 sibling, 1 reply; 24+ messages in thread
From: Richard Henderson @ 2002-03-07  1:58 UTC (permalink / raw)
  To: Davide Libenzi
  Cc: Hubertus Franke, Robert Love, Rusty Russell, Linux Kernel Mailing List

On Mon, Mar 04, 2002 at 02:15:58PM -0800, Davide Libenzi wrote:
> That's great. What if the process holding the mutex dies while there're
> sleeping tasks waiting for it ?

The lock is lost.  The same thing would happen with locks completely
implemented in userspace.

I don't see that the kernel should do anything about this.  If a 
thread is killed with predudice (i.e. without pthread_cancel) then
there are all sorts of cleanups that won't happen.  Having the
kernel automatically unlock the locks doesn't help much, since
the data structures are quite likely in an inconsistent state.


r~

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-07  1:58         ` Richard Henderson
@ 2002-03-07  2:10           ` Davide Libenzi
  0 siblings, 0 replies; 24+ messages in thread
From: Davide Libenzi @ 2002-03-07  2:10 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Hubertus Franke, Robert Love, Rusty Russell, Linux Kernel Mailing List

On Wed, 6 Mar 2002, Richard Henderson wrote:

> On Mon, Mar 04, 2002 at 02:15:58PM -0800, Davide Libenzi wrote:
> > That's great. What if the process holding the mutex dies while there're
> > sleeping tasks waiting for it ?
>
> The lock is lost.  The same thing would happen with locks completely
> implemented in userspace.
>
> I don't see that the kernel should do anything about this.  If a
> thread is killed with predudice (i.e. without pthread_cancel) then
> there are all sorts of cleanups that won't happen.  Having the
> kernel automatically unlock the locks doesn't help much, since
> the data structures are quite likely in an inconsistent state.

agreed, whatever solution does not solve it completely and makes things is
lot more complex. it's not an issue ...



- Davide



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-04  3:55 [PATCH] Fast Userspace Mutexes III Rusty Russell
  2002-03-04 19:49 ` Robert Love
  2002-03-07  1:52 ` Richard Henderson
@ 2002-03-07  3:39 ` Rusty Russell
  2002-03-07  8:48   ` Richard Henderson
  2 siblings, 1 reply; 24+ messages in thread
From: Rusty Russell @ 2002-03-07  3:39 UTC (permalink / raw)
  To: Richard Henderson
  Cc: torvalds, matthew, bcrl, david, wli, linux-kernel, frankeh

On Wed, 6 Mar 2002 17:52:03 -0800
Richard Henderson <rth@twiddle.net> wrote:

> On Mon, Mar 04, 2002 at 02:55:36PM +1100, Rusty Russell wrote:
> > +	/* If we take the semaphore from 1 to 0, it's ours. */
> > +	while (!atomic_dec_and_test(count)) {
> > +		if (signal_pending(current)) {
> > +			retval = -EINTR;
> > +			break;
> 
> This is not safe from wraparound.  Let one thread hold the
> lock forever; let other threads keep trying to take the lock
> while periodically getting SIGALRM.  Eventually one of the
> spinning threads will incorrectly acquire the mutex.

Yes, this was noted.  And yes, it's about time we fixed sparc32
or threw it out of the tree.  But since the real problem here is
"lock held forever", so I don't care.

> You really do need that cmpxchg loop.

Well, not decrementing if count < 0 already also works (as seen in
later patches), and I'm not going to break those SMP 386s if I don't
have to.

Cheers!
Rusty.
PS. Will Alpha have to do any special magic with the mmap PROT_SEM flag?
-- 
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-07  3:39 ` Rusty Russell
@ 2002-03-07  8:48   ` Richard Henderson
  2002-03-07  9:17     ` Rusty Russell
  0 siblings, 1 reply; 24+ messages in thread
From: Richard Henderson @ 2002-03-07  8:48 UTC (permalink / raw)
  To: Rusty Russell; +Cc: torvalds, matthew, bcrl, david, wli, linux-kernel, frankeh

On Thu, Mar 07, 2002 at 02:39:47PM +1100, Rusty Russell wrote:
> But since the real problem here is "lock held forever", so I don't care.

No, "lock held forever" merely makes the example trivial.  
"Lock held for a while" is the real problem.

> > You really do need that cmpxchg loop.
> 
> Well, not decrementing if count < 0 already also works

How, exactly, are you planning on doing that atomically?
Clue: 386 SMP requires an extra spinlock.

> PS. Will Alpha have to do any special magic with the mmap PROT_SEM flag?

No.


r~

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] Fast Userspace Mutexes III.
  2002-03-07  8:48   ` Richard Henderson
@ 2002-03-07  9:17     ` Rusty Russell
  0 siblings, 0 replies; 24+ messages in thread
From: Rusty Russell @ 2002-03-07  9:17 UTC (permalink / raw)
  To: Richard Henderson
  Cc: torvalds, matthew, bcrl, david, wli, linux-kernel, frankeh

In message <20020307004821.A26624@twiddle.net> you write:
> On Thu, Mar 07, 2002 at 02:39:47PM +1100, Rusty Russell wrote:
> > But since the real problem here is "lock held forever", so I don't care.
> 
> No, "lock held forever" merely makes the example trivial.  
> "Lock held for a while" is the real problem.
> 
> > > You really do need that cmpxchg loop.
> > 
> > Well, not decrementing if count < 0 already also works
> 
> How, exactly, are you planning on doing that atomically?
> Clue: 386 SMP requires an extra spinlock.

Fortuntely, it doesn't need to be atomic.  This got me too, but Paul
Mackerras convinced me.

We don't care if someone decrements it between us checking < 0 and
doing the atomic_dec_and_test.  So the only thing we worry about is
the "up" case.  A userspace "up" will go into the kernel (since count
< 0), and the kernel "up" will wake us since we are on the queue
before we do this test.

> > PS. Will Alpha have to do any special magic with the mmap PROT_SEM flag?
> 
> No.

Damn, I was hoping to find out which arch actually is going to use
this.  I hate untested code.

Thanks!
Rusty.
--
  Anyone who quotes me in their sig is an idiot. -- Rusty Russell.

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2002-03-07  9:14 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-03-04  3:55 [PATCH] Fast Userspace Mutexes III Rusty Russell
2002-03-04 19:49 ` Robert Love
2002-03-04 20:13   ` Davide Libenzi
2002-03-04 20:20     ` Matthew Kirkwood
2002-03-04 20:48     ` Hubertus Franke
2002-03-04 22:15       ` Davide Libenzi
2002-03-05  1:50         ` Robert Love
2002-03-05  2:53           ` Davide Libenzi
2002-03-05  3:45           ` Rusty Russell
2002-03-05  3:55             ` Davide Libenzi
2002-03-05  6:11               ` Rusty Russell
2002-03-05 17:23                 ` Davide Libenzi
2002-03-05 15:09             ` Hubertus Franke
2002-03-05  4:30           ` Edgar Toernig
2002-03-07  1:58         ` Richard Henderson
2002-03-07  2:10           ` Davide Libenzi
2002-03-05  4:48     ` Rusty Russell
2002-03-05 15:15       ` Hubertus Franke
2002-03-06  1:31         ` Rusty Russell
2002-03-05  1:34   ` Rusty Russell
2002-03-07  1:52 ` Richard Henderson
2002-03-07  3:39 ` Rusty Russell
2002-03-07  8:48   ` Richard Henderson
2002-03-07  9:17     ` Rusty Russell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).