linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Deadlock on the mm->mmap_sem
@ 2001-09-17 20:57 Ulrich Weigand
  0 siblings, 0 replies; 49+ messages in thread
From: Ulrich Weigand @ 2001-09-17 20:57 UTC (permalink / raw)
  To: linux-kernel

Hello,

we're experiencing deadlocks on the mm->mmap_sem which appear to be
caused by proc_pid_read_maps (on S/390, but I believe this is arch-
independent).

What happens is that proc_pid_read_maps grabs the mmap_sem as a reader,
and *while it holds the lock*, does a copy_to_user.  This can of course
page-fault, and the handler will also grab the mmap_sem (if it is the
same task).

Now, normally this just works because both are readers.  However, on SMP
it might just so happen that another thread sharing the mm wants to grab
the lock as a writer after proc_pid_read_maps grabbed it as reader, but
before the page fault handler grabs it.

In that situation, that second thread blocks (because there's already a
writer), and then the first thread blocks in the page fault handler
(because a writer is pending).  Instant deadlock ...

B.t.w. S/390 uses the generic spinlock based rwsem code, if this is of
relevance.

Any ideas how to fix this?  Should proc_pid_read_maps just drop the lock
before copy_to_user?


Mit freundlichen Gruessen / Best Regards

Ulrich Weigand

--
  Dr. Ulrich Weigand
  Linux for S/390 Design & Development
  IBM Deutschland Entwicklung GmbH, Schoenaicher Str. 220, 71032 Boeblingen
  Phone: +49-7031/16-3727   ---   Email: Ulrich.Weigand@de.ibm.com


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-20 18:24   ` Andrea Arcangeli
  2001-09-20 21:43     ` Manfred Spraul
@ 2001-09-22 21:06     ` Manfred Spraul
  1 sibling, 0 replies; 49+ messages in thread
From: Manfred Spraul @ 2001-09-22 21:06 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: David Howells, linux-kernel, torvalds

Andrea Arcangeli wrote:
> 
> > I'll write a patch that moves the locking into the coredump handlers,
> > then we can compare that with Andrea's proposal.
> 
> Ok.
>
I've changed my mind:

Modifying the mmap_sem is a better solution for 2.4 than integrating the
locking into elf_core_dump.

My patch copies the vm areas into a list (under down_write()) and calls
up_write(), but I found 2 races:
* the kernel must not touch VM_IO memory. Another thread could call
"munmap(), mmap(,VM_IO)".
* If another thread calls munmap(), my coredump handler would abort
dumping due to the resulting pagefault.

The proper solution would be using a page table walker in elf_core_dump
(similar to access_process_vm()), everything under down_write().

But that would be a large rewrite. I'm aware of at least 4 users who
want such a page table walker: map_user_kiobuf, access_process_vm,
singlecopy pipe (not merged), elf_core_dump.

--
	Manfred

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-20 18:24   ` Andrea Arcangeli
@ 2001-09-20 21:43     ` Manfred Spraul
  2001-09-22 21:06     ` Manfred Spraul
  1 sibling, 0 replies; 49+ messages in thread
From: Manfred Spraul @ 2001-09-20 21:43 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: David Howells, linux-kernel, torvalds

[-- Attachment #1: Type: text/plain, Size: 1645 bytes --]

Andrea Arcangeli wrote:
> > elf_core_dump should call down_write to prevent concurrent expand_stack
> 
> expand_stack doesn't need the write sem, see the locking comments in the
> 00_silent-stack-overflow patch in -aa.
>
You misunderstood me, it's the other way around:
elf_core_dump walks the vma list twice and assumes that the segment
sizes don't change --> elf_core_dump needs a write lock to ensure that.

> > calls, and acquire the pagetable_lock around some lines (right now it
> > walks the page tables without locking). I'll check the other coredump
> 
> Also expand_stack needs the page_table_lock, that's ok.
>
Dito: elf_core_dump walks the page tables without locking.

> > I'll write a patch that moves the locking into the coredump handlers,
> > then we can compare that with Andrea's proposal.
> 
> Ok.
> 
Attached is a beta version:

* remove {down,up}_read from fs/exec.c::do_coredump
* add down_read into each arch/*/process.c::dump_thread. Most of them
are probably save without the locking, but it's better for the
consistency. Most of them access mm->brk and perform some calculations.
* explicit memset(,0,) around all structures that are dumped in
fs/binfmt_elf.c: depending on the structure alignment, we could
otherwise leak kernel stack.
* Do not walk the vma list twice, copy it into a temporary kernel
buffer. down_write around it to prevent concurrent expand_stack.
* spin_lock(&current->mm->page_table_lock) around the code that walks
the page tables.
* all other binfmt's have trivial coredump implementation that only call
dump_thread, or no coredump at all.

I'll do more extensive testing tomorrow.

--
	Manfred

[-- Attachment #2: patch-coredump --]
[-- Type: text/plain, Size: 15801 bytes --]

// $Header$
// Kernel Version:
//  VERSION = 2
//  PATCHLEVEL = 4
//  SUBLEVEL = 10
//  EXTRAVERSION =-pre12
diff -ur 2.4/fs/binfmt_elf.c build-2.4/fs/binfmt_elf.c
--- 2.4/fs/binfmt_elf.c	Wed Sep 19 22:39:35 2001
+++ build-2.4/fs/binfmt_elf.c	Thu Sep 20 22:48:52 2001
@@ -31,6 +31,7 @@
 #include <linux/init.h>
 #include <linux/highuid.h>
 #include <linux/smp_lock.h>
+#include <linux/vmalloc.h>
 
 #include <asm/uaccess.h>
 #include <asm/param.h>
@@ -895,22 +896,22 @@
  *
  * I think we should skip something. But I am not sure how. H.J.
  */
-static inline int maydump(struct vm_area_struct *vma)
+static inline int maydump(unsigned long flags)
 {
 	/*
 	 * If we may not read the contents, don't allow us to dump
 	 * them either. "dump_write()" can't handle it anyway.
 	 */
-	if (!(vma->vm_flags & VM_READ))
+	if (!(flags & VM_READ))
 		return 0;
 
 	/* Do not dump I/O mapped devices! -DaveM */
-	if (vma->vm_flags & VM_IO)
+	if (flags & VM_IO)
 		return 0;
 #if 1
-	if (vma->vm_flags & (VM_WRITE|VM_GROWSUP|VM_GROWSDOWN))
+	if (flags & (VM_WRITE|VM_GROWSUP|VM_GROWSDOWN))
 		return 1;
-	if (vma->vm_flags & (VM_READ|VM_EXEC|VM_EXECUTABLE|VM_SHARED))
+	if (flags & (VM_READ|VM_EXEC|VM_EXECUTABLE|VM_SHARED))
 		return 0;
 #endif
 	return 1;
@@ -967,6 +968,7 @@
 {
 	struct elf_note en;
 
+	memset(&en,0,sizeof(en));
 	en.n_namesz = strlen(men->name);
 	en.n_descsz = men->datasz;
 	en.n_type = men->type;
@@ -989,6 +991,46 @@
 #define DUMP_SEEK(off)	\
 	if (!dump_seek(file, (off))) \
 		goto end_coredump;
+
+struct elf_dumpinfo {
+	unsigned long start;
+	size_t len;
+	unsigned long flags;
+};
+
+static struct elf_dumpinfo * alloc_dumpinfo(int segs)
+{
+	int len = sizeof(struct elf_dumpinfo)*segs;
+	if (len < PAGE_SIZE)
+		return kmalloc(len, GFP_KERNEL);
+	return vmalloc(len);
+}
+
+void free_dumpinfo(struct elf_dumpinfo * ptr, int segs)
+{
+	int len = sizeof(struct elf_dumpinfo)*segs;
+	if (len < PAGE_SIZE)
+		return kfree(ptr);
+	return vfree(ptr);
+}
+
+static struct elf_dumpinfo *get_dumpinfo(void)
+{
+	int i;
+	struct vm_area_struct *vma;
+	struct elf_dumpinfo *di = alloc_dumpinfo(current->mm->map_count);
+	if (!di)
+		return NULL;
+
+	vma = current->mm->mmap;
+	for(i = 0, vma = current->mm->mmap; vma != NULL; i++,vma = vma->vm_next) {
+		di[i].start = vma->vm_start;
+		di[i].len =  vma->vm_end - vma->vm_start;
+		di[i].flags = vma->vm_flags;
+	}
+	if (i != current->mm->map_count) BUG();
+	return di;
+}
 /*
  * Actual dumper
  *
@@ -1003,7 +1045,6 @@
 	int segs;
 	size_t size = 0;
 	int i;
-	struct vm_area_struct *vma;
 	struct elfhdr elf;
 	off_t offset = 0, dataoff;
 	unsigned long limit = current->rlim[RLIMIT_CORE].rlim_cur;
@@ -1012,19 +1053,26 @@
 	struct elf_prstatus prstatus;	/* NT_PRSTATUS */
 	elf_fpregset_t fpu;		/* NT_PRFPREG */
 	struct elf_prpsinfo psinfo;	/* NT_PRPSINFO */
+	struct elf_dumpinfo *di;
 
+	/* stop all vm operations, including expand_stack */
+	down_write(&current->mm->mmap_sem);
 	segs = current->mm->map_count;
+	di = get_dumpinfo();
+	up_write(&current->mm->mmap_sem);
+	if (!di)
+		return 0;
 
 #ifdef DEBUG
 	printk("elf_core_dump: %d segs %lu limit\n", segs, limit);
 #endif
 
 	/* Set up header */
+	memset(&elf, 0, sizeof(elf));
 	memcpy(elf.e_ident, ELFMAG, SELFMAG);
 	elf.e_ident[EI_CLASS] = ELF_CLASS;
 	elf.e_ident[EI_DATA] = ELF_DATA;
 	elf.e_ident[EI_VERSION] = EV_CURRENT;
-	memset(elf.e_ident+EI_PAD, 0, EI_NIDENT-EI_PAD);
 
 	elf.e_type = ET_CORE;
 	elf.e_machine = ELF_ARCH;
@@ -1156,42 +1204,34 @@
 		for(i = 0; i < numnote; i++)
 			sz += notesize(&notes[i]);
 
+		memset(&phdr, 0, sizeof(phdr));
 		phdr.p_type = PT_NOTE;
 		phdr.p_offset = offset;
-		phdr.p_vaddr = 0;
-		phdr.p_paddr = 0;
 		phdr.p_filesz = sz;
-		phdr.p_memsz = 0;
-		phdr.p_flags = 0;
-		phdr.p_align = 0;
 
 		offset += phdr.p_filesz;
 		DUMP_WRITE(&phdr, sizeof(phdr));
-	}
 
-	/* Page-align dumped data */
-	dataoff = offset = roundup(offset, ELF_EXEC_PAGESIZE);
+		/* Page-align dumped data */
+		dataoff = offset = roundup(offset, ELF_EXEC_PAGESIZE);
 
-	/* Write program headers for segments dump */
-	for(vma = current->mm->mmap; vma != NULL; vma = vma->vm_next) {
-		struct elf_phdr phdr;
-		size_t sz;
+		/* Write program headers for segments dump */
+		for(i = 0;i < segs; i++) {
 
-		sz = vma->vm_end - vma->vm_start;
+			phdr.p_type = PT_LOAD;
+			phdr.p_offset = offset;
+			phdr.p_vaddr = di[i].start;
+			phdr.p_paddr = 0;
+			phdr.p_filesz = maydump(di[i].flags) ? di[i].len : 0;
+			phdr.p_memsz = di[i].len;
+			offset += phdr.p_filesz;
+			phdr.p_flags = di[i].flags & VM_READ ? PF_R : 0;
+			if (di[i].flags & VM_WRITE) phdr.p_flags |= PF_W;
+			if (di[i].flags & VM_EXEC) phdr.p_flags |= PF_X;
+			phdr.p_align = ELF_EXEC_PAGESIZE;
 
-		phdr.p_type = PT_LOAD;
-		phdr.p_offset = offset;
-		phdr.p_vaddr = vma->vm_start;
-		phdr.p_paddr = 0;
-		phdr.p_filesz = maydump(vma) ? sz : 0;
-		phdr.p_memsz = sz;
-		offset += phdr.p_filesz;
-		phdr.p_flags = vma->vm_flags & VM_READ ? PF_R : 0;
-		if (vma->vm_flags & VM_WRITE) phdr.p_flags |= PF_W;
-		if (vma->vm_flags & VM_EXEC) phdr.p_flags |= PF_X;
-		phdr.p_align = ELF_EXEC_PAGESIZE;
-
-		DUMP_WRITE(&phdr, sizeof(phdr));
+			DUMP_WRITE(&phdr, sizeof(phdr));
+		}
 	}
 
 	for(i = 0; i < numnote; i++)
@@ -1202,29 +1242,29 @@
 
 	DUMP_SEEK(dataoff);
 
-	for(vma = current->mm->mmap; vma != NULL; vma = vma->vm_next) {
+	for(i = 0;i < segs; i++) {
 		unsigned long addr;
 
-		if (!maydump(vma))
+		if (!maydump(di[i].flags))
 			continue;
 #ifdef DEBUG
 		printk("elf_core_dump: writing %08lx %lx\n", addr, len);
 #endif
-		for (addr = vma->vm_start;
-		     addr < vma->vm_end;
-		     addr += PAGE_SIZE) {
+		for (addr = di[i].start; addr < di[i].start+di[i].len; addr += PAGE_SIZE) {
 			pgd_t *pgd;
 			pmd_t *pmd;
-			pte_t *pte;
-
-			pgd = pgd_offset(vma->vm_mm, addr);
+			pte_t pte;
+			
+			spin_lock(&current->mm->page_table_lock);
+			pgd = pgd_offset(current->mm, addr);
 			if (pgd_none(*pgd))
 				goto nextpage_coredump;
 			pmd = pmd_offset(pgd, addr);
 			if (pmd_none(*pmd))
 				goto nextpage_coredump;
-			pte = pte_offset(pmd, addr);
-			if (pte_none(*pte)) {
+			pte = *pte_offset(pmd, addr);
+			spin_unlock(&current->mm->page_table_lock);
+			if (pte_none(pte)) {
 nextpage_coredump:
 				DUMP_SEEK (file->f_pos + PAGE_SIZE);
 			} else {
@@ -1241,6 +1281,7 @@
 
  end_coredump:
 	set_fs(fs);
+	free_dumpinfo(di, segs);
 	return has_dumped;
 }
 #endif		/* USE_ELF_CORE_DUMP */
diff -ur 2.4/fs/exec.c build-2.4/fs/exec.c
--- 2.4/fs/exec.c	Wed Sep 19 22:39:35 2001
+++ build-2.4/fs/exec.c	Thu Sep 20 19:41:24 2001
@@ -969,9 +969,7 @@
 	if (do_truncate(file->f_dentry, 0) != 0)
 		goto close_fail;
 
-	down_read(&current->mm->mmap_sem);
 	retval = binfmt->core_dump(signr, regs, file);
-	up_read(&current->mm->mmap_sem);
 
 close_fail:
 	filp_close(file, NULL);
diff -ur 2.4/arch/alpha/kernel/process.c build-2.4/arch/alpha/kernel/process.c
--- 2.4/arch/alpha/kernel/process.c	Wed Sep 19 22:39:31 2001
+++ build-2.4/arch/alpha/kernel/process.c	Thu Sep 20 21:26:54 2001
@@ -344,6 +344,7 @@
 	struct switch_stack * sw = ((struct switch_stack *) pt) - 1;
 
 	dump->magic = CMAGIC;
+	down_read(&current->mm->mmap_sem);
 	dump->start_code  = current->mm->start_code;
 	dump->start_data  = current->mm->start_data;
 	dump->start_stack = rdusp() & ~(PAGE_SIZE - 1);
@@ -353,6 +354,7 @@
 			 >> PAGE_SHIFT);
 	dump->u_ssize = (current->mm->start_stack - dump->start_stack
 			 + PAGE_SIZE-1) >> PAGE_SHIFT;
+	up_read(&current->mm->mmap_sem);
 
 	/*
 	 * We store the registers in an order/format that is
diff -ur 2.4/arch/arm/kernel/process.c build-2.4/arch/arm/kernel/process.c
--- 2.4/arch/arm/kernel/process.c	Wed Sep 19 22:39:31 2001
+++ build-2.4/arch/arm/kernel/process.c	Thu Sep 20 21:27:45 2001
@@ -339,11 +339,13 @@
 	struct task_struct *tsk = current;
 
 	dump->magic = CMAGIC;
+	down_read(&tsk->mm->mmap_sem);
 	dump->start_code = tsk->mm->start_code;
 	dump->start_stack = regs->ARM_sp & ~(PAGE_SIZE - 1);
 
 	dump->u_tsize = (tsk->mm->end_code - tsk->mm->start_code) >> PAGE_SHIFT;
 	dump->u_dsize = (tsk->mm->brk - tsk->mm->start_data + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	up_read(&tsk->mm->mmap_sem);
 	dump->u_ssize = 0;
 
 	dump->u_debugreg[0] = tsk->thread.debug.bp[0].address;
diff -ur 2.4/arch/cris/kernel/process.c build-2.4/arch/cris/kernel/process.c
--- 2.4/arch/cris/kernel/process.c	Wed Sep 19 22:39:31 2001
+++ build-2.4/arch/cris/kernel/process.c	Thu Sep 20 21:28:38 2001
@@ -227,8 +227,10 @@
 	dump->magic = CMAGIC;
 	dump->start_code = 0;
 	dump->start_stack = regs->esp & ~(PAGE_SIZE - 1);
+	down_read(&current->mm->mmap_sem);
 	dump->u_tsize = ((unsigned long) current->mm->end_code) >> PAGE_SHIFT;
 	dump->u_dsize = ((unsigned long) (current->mm->brk + (PAGE_SIZE-1))) >> PAGE_SHIFT;
+	up_read(&current->mm->mmap_sem);
 	dump->u_dsize -= dump->u_tsize;
 	dump->u_ssize = 0;
 	for (i = 0; i < 8; i++)
diff -ur 2.4/arch/i386/kernel/process.c build-2.4/arch/i386/kernel/process.c
--- 2.4/arch/i386/kernel/process.c	Wed Sep 19 22:39:31 2001
+++ build-2.4/arch/i386/kernel/process.c	Thu Sep 20 19:47:58 2001
@@ -611,8 +611,10 @@
 	dump->magic = CMAGIC;
 	dump->start_code = 0;
 	dump->start_stack = regs->esp & ~(PAGE_SIZE - 1);
+	down_read(&current->mm->mmap_sem);
 	dump->u_tsize = ((unsigned long) current->mm->end_code) >> PAGE_SHIFT;
 	dump->u_dsize = ((unsigned long) (current->mm->brk + (PAGE_SIZE-1))) >> PAGE_SHIFT;
+	up_read(&current->mm->mmap_sem);
 	dump->u_dsize -= dump->u_tsize;
 	dump->u_ssize = 0;
 	for (i = 0; i < 8; i++)
diff -ur 2.4/arch/m68k/kernel/process.c build-2.4/arch/m68k/kernel/process.c
--- 2.4/arch/m68k/kernel/process.c	Wed Sep 19 22:39:31 2001
+++ build-2.4/arch/m68k/kernel/process.c	Thu Sep 20 21:29:11 2001
@@ -291,9 +291,11 @@
 	dump->magic = CMAGIC;
 	dump->start_code = 0;
 	dump->start_stack = rdusp() & ~(PAGE_SIZE - 1);
+	down_read(&current->mm->mmap_sem);
 	dump->u_tsize = ((unsigned long) current->mm->end_code) >> PAGE_SHIFT;
 	dump->u_dsize = ((unsigned long) (current->mm->brk +
 					  (PAGE_SIZE-1))) >> PAGE_SHIFT;
+	up_read(&current->mm->mmap_sem);
 	dump->u_dsize -= dump->u_tsize;
 	dump->u_ssize = 0;
 
diff -ur 2.4/arch/mips/kernel/process.c build-2.4/arch/mips/kernel/process.c
--- 2.4/arch/mips/kernel/process.c	Wed Sep 19 22:39:31 2001
+++ build-2.4/arch/mips/kernel/process.c	Thu Sep 20 21:29:34 2001
@@ -140,6 +140,7 @@
 void dump_thread(struct pt_regs *regs, struct user *dump)
 {
 	dump->magic = CMAGIC;
+	down_read(&current->mm->mmap_sem);
 	dump->start_code  = current->mm->start_code;
 	dump->start_data  = current->mm->start_data;
 	dump->start_stack = regs->regs[29] & ~(PAGE_SIZE - 1);
@@ -147,6 +148,7 @@
 	dump->u_dsize = (current->mm->brk + (PAGE_SIZE - 1) - dump->start_data) >> PAGE_SHIFT;
 	dump->u_ssize =
 		(current->mm->start_stack - dump->start_stack + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	up_read(&current->mm->mmap_sem);
 	memcpy(&dump->regs[0], regs, sizeof(struct pt_regs));
 	memcpy(&dump->regs[EF_SIZE/4], &current->thread.fpu, sizeof(current->thread.fpu));
 }
diff -ur 2.4/arch/mips64/kernel/process.c build-2.4/arch/mips64/kernel/process.c
--- 2.4/arch/mips64/kernel/process.c	Fri Feb 23 15:24:13 2001
+++ build-2.4/arch/mips64/kernel/process.c	Thu Sep 20 21:29:51 2001
@@ -133,6 +133,7 @@
 void dump_thread(struct pt_regs *regs, struct user *dump)
 {
 	dump->magic = CMAGIC;
+	down_read(&current->mm->mmap_sem);
 	dump->start_code  = current->mm->start_code;
 	dump->start_data  = current->mm->start_data;
 	dump->start_stack = regs->regs[29] & ~(PAGE_SIZE - 1);
@@ -142,6 +143,7 @@
 	                >> PAGE_SHIFT;
 	dump->u_ssize = (current->mm->start_stack - dump->start_stack +
 	                 PAGE_SIZE - 1) >> PAGE_SHIFT;
+	up_read(&current->mm->mmap_sem);
 	memcpy(&dump->regs[0], regs, sizeof(struct pt_regs));
 	memcpy(&dump->regs[EF_SIZE/4], &current->thread.fpu,
 	       sizeof(current->thread.fpu));
diff -ur 2.4/arch/s390/kernel/process.c build-2.4/arch/s390/kernel/process.c
--- 2.4/arch/s390/kernel/process.c	Fri Aug 17 18:24:46 2001
+++ build-2.4/arch/s390/kernel/process.c	Thu Sep 20 21:30:14 2001
@@ -415,10 +415,12 @@
 	dump->magic = CMAGIC;
 	dump->start_code = 0;
 	dump->start_stack = regs->gprs[15] & ~(PAGE_SIZE - 1);
+	down_read(&current->mm->mmap_sem);
 	dump->u_tsize = ((unsigned long) current->mm->end_code) >> PAGE_SHIFT;
 	dump->u_dsize = ((unsigned long) (current->mm->brk + (PAGE_SIZE-1))) >> PAGE_SHIFT;
 	dump->u_dsize -= dump->u_tsize;
 	dump->u_ssize = 0;
+	up_read(&current->mm->mmap_sem);
 	if (dump->start_stack < TASK_SIZE)
 		dump->u_ssize = ((unsigned long) (TASK_SIZE - dump->start_stack)) >> PAGE_SHIFT;
 	memcpy(&dump->regs.gprs[0],regs,sizeof(s390_regs));
diff -ur 2.4/arch/s390x/kernel/process.c build-2.4/arch/s390x/kernel/process.c
--- 2.4/arch/s390x/kernel/process.c	Fri Aug 17 18:24:46 2001
+++ build-2.4/arch/s390x/kernel/process.c	Thu Sep 20 21:30:29 2001
@@ -409,10 +409,12 @@
 	dump->magic = CMAGIC;
 	dump->start_code = 0;
 	dump->start_stack = regs->gprs[15] & ~(PAGE_SIZE - 1);
+	down_read(&current->mm->mmap_sem);
 	dump->u_tsize = ((unsigned long) current->mm->end_code) >> PAGE_SHIFT;
 	dump->u_dsize = ((unsigned long) (current->mm->brk + (PAGE_SIZE-1))) >> PAGE_SHIFT;
 	dump->u_dsize -= dump->u_tsize;
 	dump->u_ssize = 0;
+	up_read(&current->mm->mmap_sem);
 	if (dump->start_stack < TASK_SIZE)
 		dump->u_ssize = ((unsigned long) (TASK_SIZE - dump->start_stack)) >> PAGE_SHIFT;
 	memcpy(&dump->regs.gprs[0],regs,sizeof(s390_regs));
diff -ur 2.4/arch/sh/kernel/process.c build-2.4/arch/sh/kernel/process.c
--- 2.4/arch/sh/kernel/process.c	Wed Sep 19 22:39:32 2001
+++ build-2.4/arch/sh/kernel/process.c	Thu Sep 20 21:30:47 2001
@@ -236,6 +236,7 @@
 void dump_thread(struct pt_regs * regs, struct user * dump)
 {
 	dump->magic = CMAGIC;
+	down_read(&current->mm->mmap_sem);
 	dump->start_code = current->mm->start_code;
 	dump->start_data  = current->mm->start_data;
 	dump->start_stack = regs->regs[15] & ~(PAGE_SIZE - 1);
@@ -243,6 +244,7 @@
 	dump->u_dsize = (current->mm->brk + (PAGE_SIZE-1) - dump->start_data) >> PAGE_SHIFT;
 	dump->u_ssize = (current->mm->start_stack - dump->start_stack +
 			 PAGE_SIZE - 1) >> PAGE_SHIFT;
+	up_read(&current->mm->mmap_sem);
 	/* Debug registers will come here. */
 
 	dump->regs = *regs;
diff -ur 2.4/arch/sparc/kernel/process.c build-2.4/arch/sparc/kernel/process.c
--- 2.4/arch/sparc/kernel/process.c	Fri Feb 23 15:24:18 2001
+++ build-2.4/arch/sparc/kernel/process.c	Thu Sep 20 21:31:28 2001
@@ -581,9 +581,11 @@
 	/* fuck me plenty */
 	memcpy(&dump->regs.regs[0], &regs->u_regs[1], (sizeof(unsigned long) * 15));
 	dump->uexec = current->thread.core_exec;
+	down_read(&current->mm->mmap_sem);
 	dump->u_tsize = (((unsigned long) current->mm->end_code) -
 		((unsigned long) current->mm->start_code)) & ~(PAGE_SIZE - 1);
 	dump->u_dsize = ((unsigned long) (current->mm->brk + (PAGE_SIZE-1)));
+	up_read(&current->mm->mmap_sem);
 	dump->u_dsize -= dump->u_tsize;
 	dump->u_dsize &= ~(PAGE_SIZE - 1);
 	first_stack_page = (regs->u_regs[UREG_FP] & ~(PAGE_SIZE - 1));
diff -ur 2.4/arch/sparc64/kernel/process.c build-2.4/arch/sparc64/kernel/process.c
--- 2.4/arch/sparc64/kernel/process.c	Sat Jul  7 13:05:52 2001
+++ build-2.4/arch/sparc64/kernel/process.c	Thu Sep 20 21:35:32 2001
@@ -699,9 +699,11 @@
 	dump->regs.y = regs->y;
 	/* fuck me plenty */
 	memcpy(&dump->regs.regs[0], &regs->u_regs[1], (sizeof(unsigned long) * 15));
+	down_read(&current->mm->mmap_sem);
 	dump->u_tsize = (((unsigned long) current->mm->end_code) -
 		((unsigned long) current->mm->start_code)) & ~(PAGE_SIZE - 1);
 	dump->u_dsize = ((unsigned long) (current->mm->brk + (PAGE_SIZE-1)));
+	up_read(&current->mm->mmap_sem);
 	dump->u_dsize -= dump->u_tsize;
 	dump->u_dsize &= ~(PAGE_SIZE - 1);
 	first_stack_page = (regs->u_regs[UREG_FP] & ~(PAGE_SIZE - 1));

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-20 10:57 ` Studierende der Universitaet des Saarlandes
  2001-09-20 12:40   ` David Howells
@ 2001-09-20 18:24   ` Andrea Arcangeli
  2001-09-20 21:43     ` Manfred Spraul
  2001-09-22 21:06     ` Manfred Spraul
  1 sibling, 2 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2001-09-20 18:24 UTC (permalink / raw)
  To: manfred; +Cc: David Howells, linux-kernel, torvalds

On Thu, Sep 20, 2001 at 10:57:08AM +0000, Studierende der Universitaet des Saarlandes wrote:
> * A fair, recursive mmap_sem (a task that already owns the mmap_sem can
> acquire it again without deadlocking, all other cases are fair). That's
> what Andrea proposes. (Andrea, is that correct?)

Exactly.

> elf_core_dump should call down_write to prevent concurrent expand_stack

expand_stack doesn't need the write sem, see the locking comments in the
00_silent-stack-overflow patch in -aa.

> calls, and acquire the pagetable_lock around some lines (right now it
> walks the page tables without locking). I'll check the other coredump

Also expand_stack needs the page_table_lock, that's ok.

> I'll write a patch that moves the locking into the coredump handlers,
> then we can compare that with Andrea's proposal.

Ok.

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-20 10:57 ` Studierende der Universitaet des Saarlandes
@ 2001-09-20 12:40   ` David Howells
  2001-09-20 18:24   ` Andrea Arcangeli
  1 sibling, 0 replies; 49+ messages in thread
From: David Howells @ 2001-09-20 12:40 UTC (permalink / raw)
  To: manfred, andrea; +Cc: David Howells, linux-kernel, torvalds


> David, coredump is the only difficult recursive user of mmap_sem.  ptrace &
> /proc/pid/mem double buffer into kernel buffers, fork just doesn't lock the
> new mm_struct - it's new, noone can get a pointer to it before it's linked
> into the various lists.

Yes, you're right. So what you and Andrea are proposing is to have a field in
the task struct that counts the number of active readlocks you hold on your
own mm_struct. If this is >0, then you can add another readlock to it. If this
is the case, then you can add an extra asm-rwsem operation that simply
increments the semaphore counter. BUT you can only use this operation if you
_know_ you already have a readlock. And as you know that some function higher
up the stack holds the lock, you can guarantee that the lock isn't going to go
away.

Give me a few minutes, and I can handle this:-)

David

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
@ 2001-09-20 10:57 ` Studierende der Universitaet des Saarlandes
  2001-09-20 12:40   ` David Howells
  2001-09-20 18:24   ` Andrea Arcangeli
  0 siblings, 2 replies; 49+ messages in thread
From: Studierende der Universitaet des Saarlandes @ 2001-09-20 10:57 UTC (permalink / raw)
  To: andrea, David Howells; +Cc: linux-kernel, torvalds

> On Thu, Sep 20, 2001 at 09:01:13AM +0100, David Howells wrote:
> > 
> > Andrea Arcangeli <andrea@suse.de> wrote:
> > > the process doesn't need to lock multiple mm_structs at the same time.
> > 
> > fork, ptrace, /proc/pid/mem, /proc/pid/maps
> >

David, coredump is the only difficult recursive user of mmap_sem.
ptrace & /proc/pid/mem double buffer into kernel buffers, fork just
doesn't lock the new mm_struct - it's new, noone can get a pointer to it
before it's linked into the various lists.

> for /proc/<pid>/maps this check takes care of it of course (or it could
> get unfair again: only when we're faulting on our vm we're allowed to go
> through):
> 
>         if (task == current)
>                 down_read_recursive(&mm->mmap_sem, &current->mm_recursor);
>         else
>                 down_read(&mm->mmap_sem);
> 
Andrea, my rewrite of proc_pid_read_maps fixes that without any ugly
recursive/nonrecursive tests.

Short summary of the possible fixes for the deadlock:

* A simple unfair mmap_sem (rw_lock like) is not possible.
* Copying the mm_struct is ugly.
* A fair, recursive mmap_sem (a task that already owns the mmap_sem can
acquire it again without deadlocking, all other cases are fair). That's
what Andrea proposes. (Andrea, is that correct?)
* moving the locking into each coredump handler. The main advantage is
that for some coredump handlers down_read is not enough - e.g.
elf_core_dump should call down_write to prevent concurrent expand_stack
calls, and acquire the pagetable_lock around some lines (right now it
walks the page tables without locking). I'll check the other coredump
handlers - during a quick check I couldn't find any oopsable races if
only a read lock is taken.

I'll write a patch that moves the locking into the coredump handlers,
then we can compare that with Andrea's proposal.

--
	Manfred

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-20  8:01                           ` David Howells
@ 2001-09-20  8:09                             ` Andrea Arcangeli
  0 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2001-09-20  8:09 UTC (permalink / raw)
  To: David Howells
  Cc: Manfred Spraul, Linus Torvalds, Ulrich.Weigand, linux-kernel

On Thu, Sep 20, 2001 at 09:01:13AM +0100, David Howells wrote:
> 
> Andrea Arcangeli <andrea@suse.de> wrote:
> > the process doesn't need to lock multiple mm_structs at the same time.
> 
> fork, ptrace, /proc/pid/mem, /proc/pid/maps
> 
> All have to be able to lock two process's mm_structs simultaneously, even if
> it's indirectly through copy_to_user() or copy_from_user().

ptrace doesn't use down_read_recursive, nor /proc/<>/mem, nor fork.

for /proc/<pid>/maps this check takes care of it of course (or it could
get unfair again: only when we're faulting on our vm we're allowed to go
through):

	if (task == current)
		down_read_recursive(&mm->mmap_sem, &current->mm_recursor);
	else
		down_read(&mm->mmap_sem);

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-20  7:19                         ` Andrea Arcangeli
@ 2001-09-20  8:01                           ` David Howells
  2001-09-20  8:09                             ` Andrea Arcangeli
  0 siblings, 1 reply; 49+ messages in thread
From: David Howells @ 2001-09-20  8:01 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: David Howells, Manfred Spraul, Linus Torvalds, Ulrich.Weigand,
	linux-kernel


Andrea Arcangeli <andrea@suse.de> wrote:
> the process doesn't need to lock multiple mm_structs at the same time.

fork, ptrace, /proc/pid/mem, /proc/pid/maps

All have to be able to lock two process's mm_structs simultaneously, even if
it's indirectly through copy_to_user() or copy_from_user().

David

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-20  7:05                       ` David Howells
@ 2001-09-20  7:19                         ` Andrea Arcangeli
  2001-09-20  8:01                           ` David Howells
  0 siblings, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2001-09-20  7:19 UTC (permalink / raw)
  To: David Howells
  Cc: Manfred Spraul, Linus Torvalds, Ulrich.Weigand, linux-kernel

On Thu, Sep 20, 2001 at 08:05:57AM +0100, David Howells wrote:
> 
> Andrea Arcangeli <andrea@suse.de> wrote:
> > yes, one solution to the latency problem without writing the
> > ugly code would be simply to add a per-process counter to pass to a
> > modified rwsem api, then to hide the trickery in a mm_down_read macro.
> > such way it will be recursive _and_ fair.
> 
> You'd need a counter per-process per-mm_struct. Otherwise you couldn't do a
> recursive read lock simultaneously in two or more different processes, and
> also allow any one process to lock multiple mm_structs.

the process doesn't need to lock multiple mm_structs at the same time.

I mean, we just need to allow a single task to go through, doesn't
matter if the other tasks/threads are stuck, they will wait the write to
finish. that's exactly where the fairness cames from.

The only thing that matters is that if a certain task passes the first
read lock of its mmstruct semaphore, it will also pass any other
further recursive lock again of its own _same_ mmstruct. So a
per-process recursor is all what we need.

Must not be per-mm, per-mm would work but it would simply introduce the
unfairness again.

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-20  2:07                     ` Andrea Arcangeli
  2001-09-20  4:37                       ` Andrea Arcangeli
@ 2001-09-20  7:05                       ` David Howells
  2001-09-20  7:19                         ` Andrea Arcangeli
  1 sibling, 1 reply; 49+ messages in thread
From: David Howells @ 2001-09-20  7:05 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Manfred Spraul, David Howells, Linus Torvalds, Ulrich.Weigand,
	linux-kernel


Andrea Arcangeli <andrea@suse.de> wrote:
> yes, one solution to the latency problem without writing the
> ugly code would be simply to add a per-process counter to pass to a
> modified rwsem api, then to hide the trickery in a mm_down_read macro.
> such way it will be recursive _and_ fair.

You'd need a counter per-process per-mm_struct. Otherwise you couldn't do a
recursive read lock simultaneously in two or more different processes, and
also allow any one process to lock multiple mm_structs.

David

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-20  2:07                     ` Andrea Arcangeli
@ 2001-09-20  4:37                       ` Andrea Arcangeli
  2001-09-20  7:05                       ` David Howells
  1 sibling, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2001-09-20  4:37 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: David Howells, Linus Torvalds, Ulrich.Weigand, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 9939 bytes --]

On Thu, Sep 20, 2001 at 04:07:02AM +0200, Andrea Arcangeli wrote:
> On Wed, Sep 19, 2001 at 08:19:09PM +0200, Manfred Spraul wrote:
> > > if we go generic then I strongly recommend my version of the generic
> > > semaphores is _much_ faster (and cleaner) than this one (it even
> > allows
> > > more than 2^31 concurrent readers on 64 bit archs ;).
> > >
> > Andrea,
> > 
> > implementing recursive semaphores is trivial, but do you have any idea
> > how to fix the latency problem?
> 
> yes, one solution to the latency problem without writing the
> ugly code would be simply to add a per-process counter to pass to a
> modified rwsem api, then to hide the trickery in a mm_down_read macro.
> such way it will be recursive _and_ fair.

ok, here it is attached a rwsem patch (now with the fast path inlined :)
of my version of the rwsem-spinlock semaphores (as only option all over
the ports). It's against pre12 (it is not inlined in the email since
it's not readable anyways)

and below you find (this time inlined) an incremental patch to be
applied on top of the attachment that implements the read recursive and
at the same time fair rw semaphores. This should close all the problems.

But keep in mind the rwsem by default are _fair_, so you cannot do read
recursion unless you use the recursive version and you pass a
rw_sem_recursor to it.

diff -urN rwsem/arch/alpha/mm/fault.c rwsem-recurisve/arch/alpha/mm/fault.c
--- rwsem/arch/alpha/mm/fault.c	Thu Sep 20 01:43:26 2001
+++ rwsem-recurisve/arch/alpha/mm/fault.c	Thu Sep 20 06:31:37 2001
@@ -113,7 +113,7 @@
 		goto vmalloc_fault;
 #endif
 
-	down_read(&mm->mmap_sem);
+	down_read_recursive(&mm->mmap_sem, &current->mm_recursor);
 	vma = find_vma(mm, address);
 	if (!vma)
 		goto bad_area;
@@ -147,7 +147,7 @@
 	 * the fault.
 	 */
 	fault = handle_mm_fault(mm, vma, address, cause > 0);
-	up_read(&mm->mmap_sem);
+	up_read_recursive(&mm->mmap_sem, &current->mm_recursor);
 
 	if (fault < 0)
 		goto out_of_memory;
@@ -161,7 +161,7 @@
  * Fix it, but check if it's kernel or user first..
  */
 bad_area:
-	up_read(&mm->mmap_sem);
+	up_read_recursive(&mm->mmap_sem, &current->mm_recursor);
 
 	if (user_mode(regs)) {
 		force_sig(SIGSEGV, current);
@@ -198,7 +198,7 @@
 	if (current->pid == 1) {
 		current->policy |= SCHED_YIELD;
 		schedule();
-		down_read(&mm->mmap_sem);
+		down_read_recursive(&mm->mmap_sem, &current->mm_recursor);
 		goto survive;
 	}
 	printk(KERN_ALERT "VM: killing process %s(%d)\n",
diff -urN rwsem/arch/i386/mm/fault.c rwsem-recurisve/arch/i386/mm/fault.c
--- rwsem/arch/i386/mm/fault.c	Thu Sep 20 01:43:27 2001
+++ rwsem-recurisve/arch/i386/mm/fault.c	Thu Sep 20 06:31:53 2001
@@ -191,7 +191,7 @@
 	if (in_interrupt() || !mm)
 		goto no_context;
 
-	down_read(&mm->mmap_sem);
+	down_read_recursive(&mm->mmap_sem, &current->mm_recursor);
 
 	vma = find_vma(mm, address);
 	if (!vma)
@@ -265,7 +265,7 @@
 		if (bit < 32)
 			tsk->thread.screen_bitmap |= 1 << bit;
 	}
-	up_read(&mm->mmap_sem);
+	up_read_recursive(&mm->mmap_sem, &current->mm_recursor);
 	return;
 
 /*
@@ -273,7 +273,7 @@
  * Fix it, but check if it's kernel or user first..
  */
 bad_area:
-	up_read(&mm->mmap_sem);
+	up_read_recursive(&mm->mmap_sem, &current->mm_recursor);
 
 	/* User mode accesses just cause a SIGSEGV */
 	if (error_code & 4) {
@@ -341,11 +341,11 @@
  * us unable to handle the page fault gracefully.
  */
 out_of_memory:
-	up_read(&mm->mmap_sem);
+	up_read_recursive(&mm->mmap_sem, &current->mm_recursor);
 	if (tsk->pid == 1) {
 		tsk->policy |= SCHED_YIELD;
 		schedule();
-		down_read(&mm->mmap_sem);
+		down_read_recursive(&mm->mmap_sem, &current->mm_recursor);
 		goto survive;
 	}
 	printk("VM: killing process %s\n", tsk->comm);
@@ -354,7 +354,7 @@
 	goto no_context;
 
 do_sigbus:
-	up_read(&mm->mmap_sem);
+	up_read_recursive(&mm->mmap_sem, &current->mm_recursor);
 
 	/*
 	 * Send a sigbus, regardless of whether we were in kernel
diff -urN rwsem/arch/ia64/mm/fault.c rwsem-recurisve/arch/ia64/mm/fault.c
--- rwsem/arch/ia64/mm/fault.c	Tue May  1 19:35:18 2001
+++ rwsem-recurisve/arch/ia64/mm/fault.c	Thu Sep 20 06:04:45 2001
@@ -60,7 +60,7 @@
 	if (in_interrupt() || !mm)
 		goto no_context;
 
-	down_read(&mm->mmap_sem);
+	down_read_recursive(&mm->mmap_sem, &current->mm_recursor);
 
 	vma = find_vma_prev(mm, address, &prev_vma);
 	if (!vma)
@@ -112,7 +112,7 @@
 	      default:
 		goto out_of_memory;
 	}
-	up_read(&mm->mmap_sem);
+	up_read_recursive(&mm->mmap_sem, &current->mm_recursor);
 	return;
 
   check_expansion:
@@ -135,7 +135,7 @@
 	goto good_area;
 
   bad_area:
-	up_read(&mm->mmap_sem);
+	up_read_recursive(&mm->mmap_sem, &current->mm_recursor);
 	if (isr & IA64_ISR_SP) {
 		/*
 		 * This fault was due to a speculative load set the "ed" bit in the psr to
@@ -184,7 +184,7 @@
 	return;
 
   out_of_memory:
-	up_read(&mm->mmap_sem);
+	up_read_recursive(&mm->mmap_sem, &current->mm_recursor);
 	printk("VM: killing process %s\n", current->comm);
 	if (user_mode(regs))
 		do_exit(SIGKILL);
diff -urN rwsem/arch/ppc/mm/fault.c rwsem-recurisve/arch/ppc/mm/fault.c
--- rwsem/arch/ppc/mm/fault.c	Wed Jul  4 04:03:45 2001
+++ rwsem-recurisve/arch/ppc/mm/fault.c	Thu Sep 20 06:10:09 2001
@@ -103,7 +103,7 @@
 		bad_page_fault(regs, address, SIGSEGV);
 		return;
 	}
-	down_read(&mm->mmap_sem);
+	down_read(&mm->mmap_sem, &current->mm_recursor);
 	vma = find_vma(mm, address);
 	if (!vma)
 		goto bad_area;
@@ -163,7 +163,7 @@
                 goto out_of_memory;
 	}
 
-	up_read(&mm->mmap_sem);
+	up_read_recursive(&mm->mmap_sem, &current->mm_recursor);
 	/*
 	 * keep track of tlb+htab misses that are good addrs but
 	 * just need pte's created via handle_mm_fault()
@@ -173,7 +173,7 @@
 	return;
 
 bad_area:
-	up_read(&mm->mmap_sem);
+	up_read_recursive(&mm->mmap_sem, &current->mm_recursor);
 	pte_errors++;	
 
 	/* User mode accesses cause a SIGSEGV */
@@ -194,7 +194,7 @@
  * us unable to handle the page fault gracefully.
  */
 out_of_memory:
-	up_read(&mm->mmap_sem);
+	up_read_recursive(&mm->mmap_sem, &current->mm_recursor);
 	printk("VM: killing process %s\n", current->comm);
 	if (user_mode(regs))
 		do_exit(SIGKILL);
@@ -202,7 +202,7 @@
 	return;
 
 do_sigbus:
-	up_read(&mm->mmap_sem);
+	up_read_recursive(&mm->mmap_sem, &current->mm_recursor);
 	info.si_signo = SIGBUS;
 	info.si_errno = 0;
 	info.si_code = BUS_ADRERR;
diff -urN rwsem/fs/exec.c rwsem-recurisve/fs/exec.c
--- rwsem/fs/exec.c	Thu Sep 20 01:44:06 2001
+++ rwsem-recurisve/fs/exec.c	Thu Sep 20 06:31:06 2001
@@ -969,9 +969,9 @@
 	if (do_truncate(file->f_dentry, 0) != 0)
 		goto close_fail;
 
-	down_read(&current->mm->mmap_sem);
+	down_read_recursive(&current->mm->mmap_sem, &current->mm_recursor);
 	retval = binfmt->core_dump(signr, regs, file);
-	up_read(&current->mm->mmap_sem);
+	up_read_recursive(&current->mm->mmap_sem, &current->mm_recursor);
 
 close_fail:
 	filp_close(file, NULL);
diff -urN rwsem/fs/proc/array.c rwsem-recurisve/fs/proc/array.c
--- rwsem/fs/proc/array.c	Sat Aug 11 08:04:22 2001
+++ rwsem-recurisve/fs/proc/array.c	Thu Sep 20 06:22:35 2001
@@ -577,7 +577,10 @@
 	column = *ppos & (MAPS_LINE_LENGTH-1);
 
 	/* quickly go to line lineno */
-	down_read(&mm->mmap_sem);
+	if (task == current)
+		down_read_recursive(&mm->mmap_sem, &current->mm_recursor);
+	else
+		down_read(&mm->mmap_sem);
 	for (map = mm->mmap, i = 0; map && (i < lineno); map = map->vm_next, i++)
 		continue;
 
@@ -658,7 +661,10 @@
 		if (volatile_task)
 			break;
 	}
-	up_read(&mm->mmap_sem);
+	if (task == current)
+		up_read_recursive(&mm->mmap_sem, &current->mm_recursor);
+	else
+		up_read(&mm->mmap_sem);
 
 	/* encode f_pos */
 	*ppos = (lineno << MAPS_LINE_SHIFT) + column;
diff -urN rwsem/include/linux/rwsem.h rwsem-recurisve/include/linux/rwsem.h
--- rwsem/include/linux/rwsem.h	Thu Sep 20 05:08:56 2001
+++ rwsem-recurisve/include/linux/rwsem.h	Thu Sep 20 06:25:49 2001
@@ -18,6 +18,11 @@
 #endif
 };
 
+struct rw_sem_recursor
+{
+	int counter;
+};
+
 #if RWSEM_DEBUG
 #define __SEM_DEBUG_INIT(name) \
 	, (long)&(name).__magic
@@ -42,6 +47,7 @@
 	__SEM_DEBUG_INIT(name)			\
 }
 #define RWSEM_INITIALIZER(name) __RWSEM_INITIALIZER(name, 0)
+#define RWSEM_RECURSOR_INITIALIZER ((struct rw_sem_recursor) { 0, })
 
 #define __DECLARE_RWSEM(name, count) \
 	struct rw_semaphore name = __RWSEM_INITIALIZER(name, count)
@@ -112,6 +118,34 @@
 	spin_lock(&sem->lock);
 	sem->count -= RWSEM_WRITE_BIAS;
 	if (unlikely(sem->count))
+		rwsem_wake(sem);
+	spin_unlock(&sem->lock);
+}
+
+static inline void down_read_recursive(struct rw_semaphore *sem,
+				       struct rw_sem_recursor * recursor)
+{
+	int count, counter;
+	CHECK_MAGIC(sem->__magic);
+
+	spin_lock(&sem->lock);
+	count = sem->count;
+	sem->count += RWSEM_READ_BIAS;
+	counter = recursor->counter++;
+	if (unlikely(count < 0 && !counter && !(count & RWSEM_READ_MASK)))
+		rwsem_down_failed(sem, RWSEM_READ_BLOCKING_BIAS);
+	spin_unlock(&sem->lock);
+}
+
+static inline void up_read_recursive(struct rw_semaphore *sem,
+				     struct rw_sem_recursor * recursor)
+{
+	CHECK_MAGIC(sem->__magic);
+
+	spin_lock(&sem->lock);
+	sem->count -= RWSEM_READ_BIAS;
+	recursor->counter--;
+	if (unlikely(sem->count < 0 && !(sem->count & RWSEM_READ_MASK)))
 		rwsem_wake(sem);
 	spin_unlock(&sem->lock);
 }
diff -urN rwsem/include/linux/sched.h rwsem-recurisve/include/linux/sched.h
--- rwsem/include/linux/sched.h	Thu Sep 20 05:09:07 2001
+++ rwsem-recurisve/include/linux/sched.h	Thu Sep 20 06:25:50 2001
@@ -315,6 +315,7 @@
 
 	struct task_struct *next_task, *prev_task;
 	struct mm_struct *active_mm;
+	struct rw_sem_recursor mm_recursor;
 	struct list_head local_pages;
 	unsigned int allocation_order, nr_local_pages;
 
@@ -460,6 +461,7 @@
     policy:		SCHED_OTHER,					\
     mm:			NULL,						\
     active_mm:		&init_mm,					\
+    mm_recursor:	RWSEM_RECURSOR_INITIALIZER,			\
     cpus_allowed:	-1,						\
     run_list:		LIST_HEAD_INIT(tsk.run_list),			\
     next_task:		&tsk,						\

Andrea

[-- Attachment #2: 00_rwsem-fair-20 --]
[-- Type: text/plain, Size: 38913 bytes --]

diff -urN 2.4.10pre12/arch/alpha/config.in rwsem/arch/alpha/config.in
--- 2.4.10pre12/arch/alpha/config.in	Thu Aug 16 22:03:22 2001
+++ rwsem/arch/alpha/config.in	Thu Sep 20 03:02:18 2001
@@ -5,8 +5,6 @@
 
 define_bool CONFIG_ALPHA y
 define_bool CONFIG_UID16 n
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK n
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM y
 
 mainmenu_name "Kernel configuration of Linux for Alpha machines"
 
diff -urN 2.4.10pre12/arch/arm/config.in rwsem/arch/arm/config.in
--- 2.4.10pre12/arch/arm/config.in	Thu Aug 16 22:03:22 2001
+++ rwsem/arch/arm/config.in	Thu Sep 20 03:02:22 2001
@@ -9,8 +9,6 @@
 define_bool CONFIG_SBUS n
 define_bool CONFIG_MCA n
 define_bool CONFIG_UID16 y
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 
 mainmenu_option next_comment
diff -urN 2.4.10pre12/arch/cris/config.in rwsem/arch/cris/config.in
--- 2.4.10pre12/arch/cris/config.in	Sat Aug 11 08:03:53 2001
+++ rwsem/arch/cris/config.in	Thu Sep 20 03:02:32 2001
@@ -5,8 +5,6 @@
 mainmenu_name "Linux/CRIS Kernel Configuration"
 
 define_bool CONFIG_UID16 y
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 mainmenu_option next_comment
 comment 'Code maturity level options'
diff -urN 2.4.10pre12/arch/i386/config.in rwsem/arch/i386/config.in
--- 2.4.10pre12/arch/i386/config.in	Thu Sep 20 01:43:26 2001
+++ rwsem/arch/i386/config.in	Thu Sep 20 03:02:39 2001
@@ -50,8 +50,6 @@
    define_bool CONFIG_X86_CMPXCHG n
    define_bool CONFIG_X86_XADD n
    define_int  CONFIG_X86_L1_CACHE_SHIFT 4
-   define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-   define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 else
    define_bool CONFIG_X86_WP_WORKS_OK y
    define_bool CONFIG_X86_INVLPG y
@@ -59,8 +57,6 @@
    define_bool CONFIG_X86_XADD y
    define_bool CONFIG_X86_BSWAP y
    define_bool CONFIG_X86_POPAD_OK y
-   define_bool CONFIG_RWSEM_GENERIC_SPINLOCK n
-   define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM y
 fi
 if [ "$CONFIG_M486" = "y" ]; then
    define_int  CONFIG_X86_L1_CACHE_SHIFT 4
diff -urN 2.4.10pre12/arch/ia64/config.in rwsem/arch/ia64/config.in
--- 2.4.10pre12/arch/ia64/config.in	Sat Aug 11 08:03:54 2001
+++ rwsem/arch/ia64/config.in	Thu Sep 20 03:02:44 2001
@@ -23,8 +23,6 @@
 define_bool CONFIG_EISA n
 define_bool CONFIG_MCA n
 define_bool CONFIG_SBUS n
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 if [ "$CONFIG_IA64_HP_SIM" = "n" ]; then
   define_bool CONFIG_ACPI y
diff -urN 2.4.10pre12/arch/m68k/config.in rwsem/arch/m68k/config.in
--- 2.4.10pre12/arch/m68k/config.in	Wed Jul  4 04:03:45 2001
+++ rwsem/arch/m68k/config.in	Thu Sep 20 03:02:48 2001
@@ -4,8 +4,6 @@
 #
 
 define_bool CONFIG_UID16 y
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 mainmenu_name "Linux/68k Kernel Configuration"
 
diff -urN 2.4.10pre12/arch/mips/config.in rwsem/arch/mips/config.in
--- 2.4.10pre12/arch/mips/config.in	Thu Sep 20 01:43:27 2001
+++ rwsem/arch/mips/config.in	Thu Sep 20 03:02:52 2001
@@ -68,8 +68,6 @@
    fi
 bool 'Support for Alchemy Semi PB1000 board' CONFIG_MIPS_PB1000
 
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 #
 # Select some configuration options automatically for certain systems.
diff -urN 2.4.10pre12/arch/mips64/config.in rwsem/arch/mips64/config.in
--- 2.4.10pre12/arch/mips64/config.in	Thu Sep 20 01:43:30 2001
+++ rwsem/arch/mips64/config.in	Thu Sep 20 03:03:02 2001
@@ -27,9 +27,6 @@
 fi
 endmenu
 
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
-
 #
 # Select some configuration options automatically based on user selections
 #
diff -urN 2.4.10pre12/arch/parisc/config.in rwsem/arch/parisc/config.in
--- 2.4.10pre12/arch/parisc/config.in	Tue May  1 19:35:20 2001
+++ rwsem/arch/parisc/config.in	Thu Sep 20 03:03:06 2001
@@ -7,8 +7,6 @@
 
 define_bool CONFIG_PARISC y
 define_bool CONFIG_UID16 n
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 mainmenu_option next_comment
 comment 'Code maturity level options'
diff -urN 2.4.10pre12/arch/ppc/config.in rwsem/arch/ppc/config.in
--- 2.4.10pre12/arch/ppc/config.in	Thu Sep 20 01:43:31 2001
+++ rwsem/arch/ppc/config.in	Thu Sep 20 03:03:10 2001
@@ -4,8 +4,6 @@
 # see Documentation/kbuild/config-language.txt.
 #
 define_bool CONFIG_UID16 n
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK n
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM y
 
 mainmenu_name "Linux/PowerPC Kernel Configuration"
 
diff -urN 2.4.10pre12/arch/s390/config.in rwsem/arch/s390/config.in
--- 2.4.10pre12/arch/s390/config.in	Sat Aug 11 08:03:56 2001
+++ rwsem/arch/s390/config.in	Thu Sep 20 03:03:13 2001
@@ -7,8 +7,6 @@
 define_bool CONFIG_EISA n
 define_bool CONFIG_MCA n
 define_bool CONFIG_UID16 y
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 mainmenu_name "Linux Kernel Configuration"
 define_bool CONFIG_ARCH_S390 y
diff -urN 2.4.10pre12/arch/s390x/config.in rwsem/arch/s390x/config.in
--- 2.4.10pre12/arch/s390x/config.in	Sat Aug 11 08:04:00 2001
+++ rwsem/arch/s390x/config.in	Thu Sep 20 03:03:17 2001
@@ -6,8 +6,6 @@
 define_bool CONFIG_ISA n
 define_bool CONFIG_EISA n
 define_bool CONFIG_MCA n
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 mainmenu_name "Linux Kernel Configuration"
 define_bool CONFIG_ARCH_S390 y
diff -urN 2.4.10pre12/arch/sh/config.in rwsem/arch/sh/config.in
--- 2.4.10pre12/arch/sh/config.in	Thu Sep 20 01:43:33 2001
+++ rwsem/arch/sh/config.in	Thu Sep 20 03:03:20 2001
@@ -7,8 +7,6 @@
 define_bool CONFIG_SUPERH y
 
 define_bool CONFIG_UID16 y
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 mainmenu_option next_comment
 comment 'Code maturity level options'
diff -urN 2.4.10pre12/arch/sparc/config.in rwsem/arch/sparc/config.in
--- 2.4.10pre12/arch/sparc/config.in	Wed Jul  4 04:03:45 2001
+++ rwsem/arch/sparc/config.in	Thu Sep 20 03:03:23 2001
@@ -48,8 +48,6 @@
 define_bool CONFIG_SUN_CONSOLE y
 define_bool CONFIG_SUN_AUXIO y
 define_bool CONFIG_SUN_IO y
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 bool 'Support for SUN4 machines (disables SUN4[CDM] support)' CONFIG_SUN4
 if [ "$CONFIG_SUN4" != "y" ]; then
diff -urN 2.4.10pre12/arch/sparc64/config.in rwsem/arch/sparc64/config.in
--- 2.4.10pre12/arch/sparc64/config.in	Thu Aug 16 22:03:25 2001
+++ rwsem/arch/sparc64/config.in	Thu Sep 20 03:03:27 2001
@@ -33,8 +33,6 @@
 
 # Global things across all Sun machines.
 define_bool CONFIG_HAVE_DEC_LOCK y
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK n
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM y
 define_bool CONFIG_ISA n
 define_bool CONFIG_ISAPNP n
 define_bool CONFIG_EISA n
diff -urN 2.4.10pre12/include/asm-alpha/rwsem.h rwsem/include/asm-alpha/rwsem.h
--- 2.4.10pre12/include/asm-alpha/rwsem.h	Sat Jul 21 00:04:29 2001
+++ rwsem/include/asm-alpha/rwsem.h	Thu Jan  1 01:00:00 1970
@@ -1,208 +0,0 @@
-#ifndef _ALPHA_RWSEM_H
-#define _ALPHA_RWSEM_H
-
-/*
- * Written by Ivan Kokshaysky <ink@jurassic.park.msu.ru>, 2001.
- * Based on asm-alpha/semaphore.h and asm-i386/rwsem.h
- */
-
-#ifndef _LINUX_RWSEM_H
-#error please dont include asm/rwsem.h directly, use linux/rwsem.h instead
-#endif
-
-#ifdef __KERNEL__
-
-#include <asm/compiler.h>
-#include <linux/list.h>
-#include <linux/spinlock.h>
-
-struct rwsem_waiter;
-
-extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *);
-
-/*
- * the semaphore definition
- */
-struct rw_semaphore {
-	long			count;
-#define RWSEM_UNLOCKED_VALUE		0x0000000000000000L
-#define RWSEM_ACTIVE_BIAS		0x0000000000000001L
-#define RWSEM_ACTIVE_MASK		0x00000000ffffffffL
-#define RWSEM_WAITING_BIAS		(-0x0000000100000000L)
-#define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
-	spinlock_t		wait_lock;
-	struct list_head	wait_list;
-#if RWSEM_DEBUG
-	int			debug;
-#endif
-};
-
-#if RWSEM_DEBUG
-#define __RWSEM_DEBUG_INIT      , 0
-#else
-#define __RWSEM_DEBUG_INIT	/* */
-#endif
-
-#define __RWSEM_INITIALIZER(name) \
-	{ RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, \
-	LIST_HEAD_INIT((name).wait_list) __RWSEM_DEBUG_INIT }
-
-#define DECLARE_RWSEM(name) \
-	struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-static inline void init_rwsem(struct rw_semaphore *sem)
-{
-	sem->count = RWSEM_UNLOCKED_VALUE;
-	spin_lock_init(&sem->wait_lock);
-	INIT_LIST_HEAD(&sem->wait_list);
-#if RWSEM_DEBUG
-	sem->debug = 0;
-#endif
-}
-
-static inline void __down_read(struct rw_semaphore *sem)
-{
-	long oldcount;
-#ifndef	CONFIG_SMP
-	oldcount = sem->count;
-	sem->count += RWSEM_ACTIVE_READ_BIAS;
-#else
-	long temp;
-	__asm__ __volatile__(
-	"1:	ldq_l	%0,%1\n"
-	"	addq	%0,%3,%2\n"
-	"	stq_c	%2,%1\n"
-	"	beq	%2,2f\n"
-	"	mb\n"
-	".subsection 2\n"
-	"2:	br	1b\n"
-	".previous"
-	:"=&r" (oldcount), "=m" (sem->count), "=&r" (temp)
-	:"Ir" (RWSEM_ACTIVE_READ_BIAS), "m" (sem->count) : "memory");
-#endif
-	if (__builtin_expect(oldcount < 0, 0))
-		rwsem_down_read_failed(sem);
-}
-
-static inline void __down_write(struct rw_semaphore *sem)
-{
-	long oldcount;
-#ifndef	CONFIG_SMP
-	oldcount = sem->count;
-	sem->count += RWSEM_ACTIVE_WRITE_BIAS;
-#else
-	long temp;
-	__asm__ __volatile__(
-	"1:	ldq_l	%0,%1\n"
-	"	addq	%0,%3,%2\n"
-	"	stq_c	%2,%1\n"
-	"	beq	%2,2f\n"
-	"	mb\n"
-	".subsection 2\n"
-	"2:	br	1b\n"
-	".previous"
-	:"=&r" (oldcount), "=m" (sem->count), "=&r" (temp)
-	:"Ir" (RWSEM_ACTIVE_WRITE_BIAS), "m" (sem->count) : "memory");
-#endif
-	if (__builtin_expect(oldcount, 0))
-		rwsem_down_write_failed(sem);
-}
-
-static inline void __up_read(struct rw_semaphore *sem)
-{
-	long oldcount;
-#ifndef	CONFIG_SMP
-	oldcount = sem->count;
-	sem->count -= RWSEM_ACTIVE_READ_BIAS;
-#else
-	long temp;
-	__asm__ __volatile__(
-	"	mb\n"
-	"1:	ldq_l	%0,%1\n"
-	"	subq	%0,%3,%2\n"
-	"	stq_c	%2,%1\n"
-	"	beq	%2,2f\n"
-	".subsection 2\n"
-	"2:	br	1b\n"
-	".previous"
-	:"=&r" (oldcount), "=m" (sem->count), "=&r" (temp)
-	:"Ir" (RWSEM_ACTIVE_READ_BIAS), "m" (sem->count) : "memory");
-#endif
-	if (__builtin_expect(oldcount < 0, 0)) 
-		if ((int)oldcount - RWSEM_ACTIVE_READ_BIAS == 0)
-			rwsem_wake(sem);
-}
-
-static inline void __up_write(struct rw_semaphore *sem)
-{
-	long count;
-#ifndef	CONFIG_SMP
-	sem->count -= RWSEM_ACTIVE_WRITE_BIAS;
-	count = sem->count;
-#else
-	long temp;
-	__asm__ __volatile__(
-	"	mb\n"
-	"1:	ldq_l	%0,%1\n"
-	"	subq	%0,%3,%2\n"
-	"	stq_c	%2,%1\n"
-	"	beq	%2,2f\n"
-	"	subq	%0,%3,%0\n"
-	".subsection 2\n"
-	"2:	br	1b\n"
-	".previous"
-	:"=&r" (count), "=m" (sem->count), "=&r" (temp)
-	:"Ir" (RWSEM_ACTIVE_WRITE_BIAS), "m" (sem->count) : "memory");
-#endif
-	if (__builtin_expect(count, 0))
-		if ((int)count == 0)
-			rwsem_wake(sem);
-}
-
-static inline void rwsem_atomic_add(long val, struct rw_semaphore *sem)
-{
-#ifndef	CONFIG_SMP
-	sem->count += val;
-#else
-	long temp;
-	__asm__ __volatile__(
-	"1:	ldq_l	%0,%1\n"
-	"	addq	%0,%2,%0\n"
-	"	stq_c	%0,%1\n"
-	"	beq	%0,2f\n"
-	".subsection 2\n"
-	"2:	br	1b\n"
-	".previous"
-	:"=&r" (temp), "=m" (sem->count)
-	:"Ir" (val), "m" (sem->count));
-#endif
-}
-
-static inline long rwsem_atomic_update(long val, struct rw_semaphore *sem)
-{
-#ifndef	CONFIG_SMP
-	sem->count += val;
-	return sem->count;
-#else
-	long ret, temp;
-	__asm__ __volatile__(
-	"1:	ldq_l	%0,%1\n"
-	"	addq 	%0,%3,%2\n"
-	"	addq	%0,%3,%0\n"
-	"	stq_c	%2,%1\n"
-	"	beq	%2,2f\n"
-	".subsection 2\n"
-	"2:	br	1b\n"
-	".previous"
-	:"=&r" (ret), "=m" (sem->count), "=&r" (temp)
-	:"Ir" (val), "m" (sem->count));
-
-	return ret;
-#endif
-}
-
-#endif /* __KERNEL__ */
-#endif /* _ALPHA_RWSEM_H */
diff -urN 2.4.10pre12/include/asm-i386/rwsem.h rwsem/include/asm-i386/rwsem.h
--- 2.4.10pre12/include/asm-i386/rwsem.h	Fri Aug 17 05:02:27 2001
+++ rwsem/include/asm-i386/rwsem.h	Thu Jan  1 01:00:00 1970
@@ -1,226 +0,0 @@
-/* rwsem.h: R/W semaphores implemented using XADD/CMPXCHG for i486+
- *
- * Written by David Howells (dhowells@redhat.com).
- *
- * Derived from asm-i386/semaphore.h
- *
- *
- * The MSW of the count is the negated number of active writers and waiting
- * lockers, and the LSW is the total number of active locks
- *
- * The lock count is initialized to 0 (no active and no waiting lockers).
- *
- * When a writer subtracts WRITE_BIAS, it'll get 0xffff0001 for the case of an
- * uncontended lock. This can be determined because XADD returns the old value.
- * Readers increment by 1 and see a positive value when uncontended, negative
- * if there are writers (and maybe) readers waiting (in which case it goes to
- * sleep).
- *
- * The value of WAITING_BIAS supports up to 32766 waiting processes. This can
- * be extended to 65534 by manually checking the whole MSW rather than relying
- * on the S flag.
- *
- * The value of ACTIVE_BIAS supports up to 65535 active processes.
- *
- * This should be totally fair - if anything is waiting, a process that wants a
- * lock will go to the back of the queue. When the currently active lock is
- * released, if there's a writer at the front of the queue, then that and only
- * that will be woken up; if there's a bunch of consequtive readers at the
- * front, then they'll all be woken up, but no other readers will be.
- */
-
-#ifndef _I386_RWSEM_H
-#define _I386_RWSEM_H
-
-#ifndef _LINUX_RWSEM_H
-#error please dont include asm/rwsem.h directly, use linux/rwsem.h instead
-#endif
-
-#ifdef __KERNEL__
-
-#include <linux/list.h>
-#include <linux/spinlock.h>
-
-struct rwsem_waiter;
-
-extern struct rw_semaphore *FASTCALL(rwsem_down_read_failed(struct rw_semaphore *sem));
-extern struct rw_semaphore *FASTCALL(rwsem_down_write_failed(struct rw_semaphore *sem));
-extern struct rw_semaphore *FASTCALL(rwsem_wake(struct rw_semaphore *));
-
-/*
- * the semaphore definition
- */
-struct rw_semaphore {
-	signed long		count;
-#define RWSEM_UNLOCKED_VALUE		0x00000000
-#define RWSEM_ACTIVE_BIAS		0x00000001
-#define RWSEM_ACTIVE_MASK		0x0000ffff
-#define RWSEM_WAITING_BIAS		(-0x00010000)
-#define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
-	spinlock_t		wait_lock;
-	struct list_head	wait_list;
-#if RWSEM_DEBUG
-	int			debug;
-#endif
-};
-
-/*
- * initialisation
- */
-#if RWSEM_DEBUG
-#define __RWSEM_DEBUG_INIT      , 0
-#else
-#define __RWSEM_DEBUG_INIT	/* */
-#endif
-
-#define __RWSEM_INITIALIZER(name) \
-{ RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) \
-	__RWSEM_DEBUG_INIT }
-
-#define DECLARE_RWSEM(name) \
-	struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-static inline void init_rwsem(struct rw_semaphore *sem)
-{
-	sem->count = RWSEM_UNLOCKED_VALUE;
-	spin_lock_init(&sem->wait_lock);
-	INIT_LIST_HEAD(&sem->wait_list);
-#if RWSEM_DEBUG
-	sem->debug = 0;
-#endif
-}
-
-/*
- * lock for reading
- */
-static inline void __down_read(struct rw_semaphore *sem)
-{
-	__asm__ __volatile__(
-		"# beginning down_read\n\t"
-LOCK_PREFIX	"  incl      (%%eax)\n\t" /* adds 0x00000001, returns the old value */
-		"  js        2f\n\t" /* jump if we weren't granted the lock */
-		"1:\n\t"
-		".section .text.lock,\"ax\"\n"
-		"2:\n\t"
-		"  pushl     %%ecx\n\t"
-		"  pushl     %%edx\n\t"
-		"  call      rwsem_down_read_failed\n\t"
-		"  popl      %%edx\n\t"
-		"  popl      %%ecx\n\t"
-		"  jmp       1b\n"
-		".previous"
-		"# ending down_read\n\t"
-		: "+m"(sem->count)
-		: "a"(sem)
-		: "memory", "cc");
-}
-
-/*
- * lock for writing
- */
-static inline void __down_write(struct rw_semaphore *sem)
-{
-	int tmp;
-
-	tmp = RWSEM_ACTIVE_WRITE_BIAS;
-	__asm__ __volatile__(
-		"# beginning down_write\n\t"
-LOCK_PREFIX	"  xadd      %0,(%%eax)\n\t" /* subtract 0x0000ffff, returns the old value */
-		"  testl     %0,%0\n\t" /* was the count 0 before? */
-		"  jnz       2f\n\t" /* jump if we weren't granted the lock */
-		"1:\n\t"
-		".section .text.lock,\"ax\"\n"
-		"2:\n\t"
-		"  pushl     %%ecx\n\t"
-		"  call      rwsem_down_write_failed\n\t"
-		"  popl      %%ecx\n\t"
-		"  jmp       1b\n"
-		".previous\n"
-		"# ending down_write"
-		: "+d"(tmp), "+m"(sem->count)
-		: "a"(sem)
-		: "memory", "cc");
-}
-
-/*
- * unlock after reading
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
-	__s32 tmp = -RWSEM_ACTIVE_READ_BIAS;
-	__asm__ __volatile__(
-		"# beginning __up_read\n\t"
-LOCK_PREFIX	"  xadd      %%edx,(%%eax)\n\t" /* subtracts 1, returns the old value */
-		"  js        2f\n\t" /* jump if the lock is being waited upon */
-		"1:\n\t"
-		".section .text.lock,\"ax\"\n"
-		"2:\n\t"
-		"  decw      %%dx\n\t" /* do nothing if still outstanding active readers */
-		"  jnz       1b\n\t"
-		"  pushl     %%ecx\n\t"
-		"  call      rwsem_wake\n\t"
-		"  popl      %%ecx\n\t"
-		"  jmp       1b\n"
-		".previous\n"
-		"# ending __up_read\n"
-		: "+m"(sem->count), "+d"(tmp)
-		: "a"(sem)
-		: "memory", "cc");
-}
-
-/*
- * unlock after writing
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
-	__asm__ __volatile__(
-		"# beginning __up_write\n\t"
-		"  movl      %2,%%edx\n\t"
-LOCK_PREFIX	"  xaddl     %%edx,(%%eax)\n\t" /* tries to transition 0xffff0001 -> 0x00000000 */
-		"  jnz       2f\n\t" /* jump if the lock is being waited upon */
-		"1:\n\t"
-		".section .text.lock,\"ax\"\n"
-		"2:\n\t"
-		"  decw      %%dx\n\t" /* did the active count reduce to 0? */
-		"  jnz       1b\n\t" /* jump back if not */
-		"  pushl     %%ecx\n\t"
-		"  call      rwsem_wake\n\t"
-		"  popl      %%ecx\n\t"
-		"  jmp       1b\n"
-		".previous\n"
-		"# ending __up_write\n"
-		: "+m"(sem->count)
-		: "a"(sem), "i"(-RWSEM_ACTIVE_WRITE_BIAS)
-		: "memory", "cc", "edx");
-}
-
-/*
- * implement atomic add functionality
- */
-static inline void rwsem_atomic_add(int delta, struct rw_semaphore *sem)
-{
-	__asm__ __volatile__(
-LOCK_PREFIX	"addl %1,%0"
-		:"=m"(sem->count)
-		:"ir"(delta), "m"(sem->count));
-}
-
-/*
- * implement exchange and add functionality
- */
-static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem)
-{
-	int tmp = delta;
-
-	__asm__ __volatile__(
-LOCK_PREFIX	"xadd %0,(%2)"
-		: "+r"(tmp), "=m"(sem->count)
-		: "r"(sem), "m"(sem->count)
-		: "memory");
-
-	return tmp+delta;
-}
-
-#endif /* __KERNEL__ */
-#endif /* _I386_RWSEM_H */
diff -urN 2.4.10pre12/include/linux/rwsem-spinlock.h rwsem/include/linux/rwsem-spinlock.h
--- 2.4.10pre12/include/linux/rwsem-spinlock.h	Wed Aug 29 15:05:24 2001
+++ rwsem/include/linux/rwsem-spinlock.h	Thu Jan  1 01:00:00 1970
@@ -1,62 +0,0 @@
-/* rwsem-spinlock.h: fallback C implementation
- *
- * Copyright (c) 2001   David Howells (dhowells@redhat.com).
- * - Derived partially from ideas by Andrea Arcangeli <andrea@suse.de>
- * - Derived also from comments by Linus
- */
-
-#ifndef _LINUX_RWSEM_SPINLOCK_H
-#define _LINUX_RWSEM_SPINLOCK_H
-
-#ifndef _LINUX_RWSEM_H
-#error please dont include linux/rwsem-spinlock.h directly, use linux/rwsem.h instead
-#endif
-
-#include <linux/spinlock.h>
-#include <linux/list.h>
-
-#ifdef __KERNEL__
-
-#include <linux/types.h>
-
-struct rwsem_waiter;
-
-/*
- * the rw-semaphore definition
- * - if activity is 0 then there are no active readers or writers
- * - if activity is +ve then that is the number of active readers
- * - if activity is -1 then there is one active writer
- * - if wait_list is not empty, then there are processes waiting for the semaphore
- */
-struct rw_semaphore {
-	__s32			activity;
-	spinlock_t		wait_lock;
-	struct list_head	wait_list;
-#if RWSEM_DEBUG
-	int			debug;
-#endif
-};
-
-/*
- * initialisation
- */
-#if RWSEM_DEBUG
-#define __RWSEM_DEBUG_INIT      , 0
-#else
-#define __RWSEM_DEBUG_INIT	/* */
-#endif
-
-#define __RWSEM_INITIALIZER(name) \
-{ 0, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) __RWSEM_DEBUG_INIT }
-
-#define DECLARE_RWSEM(name) \
-	struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-extern void FASTCALL(init_rwsem(struct rw_semaphore *sem));
-extern void FASTCALL(__down_read(struct rw_semaphore *sem));
-extern void FASTCALL(__down_write(struct rw_semaphore *sem));
-extern void FASTCALL(__up_read(struct rw_semaphore *sem));
-extern void FASTCALL(__up_write(struct rw_semaphore *sem));
-
-#endif /* __KERNEL__ */
-#endif /* _LINUX_RWSEM_SPINLOCK_H */
diff -urN 2.4.10pre12/include/linux/rwsem.h rwsem/include/linux/rwsem.h
--- 2.4.10pre12/include/linux/rwsem.h	Wed Aug 29 15:05:24 2001
+++ rwsem/include/linux/rwsem.h	Thu Sep 20 05:08:56 2001
@@ -1,80 +1,120 @@
-/* rwsem.h: R/W semaphores, public interface
- *
- * Written by David Howells (dhowells@redhat.com).
- * Derived from asm-i386/semaphore.h
- */
-
 #ifndef _LINUX_RWSEM_H
 #define _LINUX_RWSEM_H
 
-#include <linux/linkage.h>
-
-#define RWSEM_DEBUG 0
-
 #ifdef __KERNEL__
 
-#include <linux/config.h>
-#include <linux/types.h>
+#include <linux/compiler.h>
 #include <linux/kernel.h>
-#include <asm/system.h>
-#include <asm/atomic.h>
 
-struct rw_semaphore;
-
-#ifdef CONFIG_RWSEM_GENERIC_SPINLOCK
-#include <linux/rwsem-spinlock.h> /* use a generic implementation */
-#else
-#include <asm/rwsem.h> /* use an arch-specific implementation */
+struct rw_semaphore
+{
+	spinlock_t lock;
+	long count;
+#define RWSEM_READ_BIAS 1
+#define RWSEM_WRITE_BIAS (~(~0UL >> (BITS_PER_LONG>>1)))
+	struct list_head wait;
+#if RWSEM_DEBUG
+	long __magic;
 #endif
+};
 
-#ifndef rwsemtrace
 #if RWSEM_DEBUG
-extern void FASTCALL(rwsemtrace(struct rw_semaphore *sem, const char *str));
+#define __SEM_DEBUG_INIT(name) \
+	, (long)&(name).__magic
+#define RWSEM_MAGIC(x)							\
+	do {								\
+		if ((x) != (long)&(x)) {				\
+			printk("rwsem bad magic %lx (should be %lx), ",	\
+				(long)x, (long)&(x));			\
+			BUG();						\
+		}							\
+	} while (0)
 #else
-#define rwsemtrace(SEM,FMT)
+#define __SEM_DEBUG_INIT(name)
+#define CHECK_MAGIC(x)
 #endif
+
+#define __RWSEM_INITIALIZER(name, count)	\
+{						\
+	SPIN_LOCK_UNLOCKED,			\
+	(count),				\
+	LIST_HEAD_INIT((name).wait)		\
+	__SEM_DEBUG_INIT(name)			\
+}
+#define RWSEM_INITIALIZER(name) __RWSEM_INITIALIZER(name, 0)
+
+#define __DECLARE_RWSEM(name, count) \
+	struct rw_semaphore name = __RWSEM_INITIALIZER(name, count)
+#define DECLARE_RWSEM(name) __DECLARE_RWSEM(name, 0)
+#define DECLARE_RWSEM_READ_LOCKED(name) __DECLARE_RWSEM(name, RWSEM_READ_BIAS)
+#define DECLARE_RWSEM_WRITE_LOCKED(name) __DECLARE_RWSEM(name, RWSEM_WRITE_BIAS)
+
+#define RWSEM_READ_BLOCKING_BIAS (RWSEM_WRITE_BIAS-RWSEM_READ_BIAS)
+#define RWSEM_WRITE_BLOCKING_BIAS (0)
+
+#define RWSEM_READ_MASK (~RWSEM_WRITE_BIAS)
+#define RWSEM_WRITE_MASK (RWSEM_WRITE_BIAS)
+
+extern void FASTCALL(rwsem_down_failed(struct rw_semaphore *, long));
+extern void FASTCALL(rwsem_wake(struct rw_semaphore *));
+
+static inline void init_rwsem(struct rw_semaphore *sem)
+{
+	spin_lock_init(&sem->lock);
+	sem->count = 0;
+	INIT_LIST_HEAD(&sem->wait);
+#if RWSEM_DEBUG
+	sem->__magic = (long)&sem->__magic;
 #endif
+}
 
-/*
- * lock for reading
- */
 static inline void down_read(struct rw_semaphore *sem)
 {
-	rwsemtrace(sem,"Entering down_read");
-	__down_read(sem);
-	rwsemtrace(sem,"Leaving down_read");
+	int count;
+	CHECK_MAGIC(sem->__magic);
+
+	spin_lock(&sem->lock);
+	count = sem->count;
+	sem->count += RWSEM_READ_BIAS;
+	if (unlikely(count < 0))
+		rwsem_down_failed(sem, RWSEM_READ_BLOCKING_BIAS);
+	spin_unlock(&sem->lock);
 }
 
-/*
- * lock for writing
- */
 static inline void down_write(struct rw_semaphore *sem)
 {
-	rwsemtrace(sem,"Entering down_write");
-	__down_write(sem);
-	rwsemtrace(sem,"Leaving down_write");
+	long count;
+	CHECK_MAGIC(sem->__magic);
+
+	spin_lock(&sem->lock);
+	count = sem->count;
+	sem->count += RWSEM_WRITE_BIAS;
+	if (unlikely(count))
+		rwsem_down_failed(sem, RWSEM_WRITE_BLOCKING_BIAS);
+	spin_unlock(&sem->lock);
 }
 
-/*
- * release a read lock
- */
 static inline void up_read(struct rw_semaphore *sem)
 {
-	rwsemtrace(sem,"Entering up_read");
-	__up_read(sem);
-	rwsemtrace(sem,"Leaving up_read");
+	CHECK_MAGIC(sem->__magic);
+
+	spin_lock(&sem->lock);
+	sem->count -= RWSEM_READ_BIAS;
+	if (unlikely(sem->count < 0 && !(sem->count & RWSEM_READ_MASK)))
+		rwsem_wake(sem);
+	spin_unlock(&sem->lock);
 }
 
-/*
- * release a write lock
- */
 static inline void up_write(struct rw_semaphore *sem)
 {
-	rwsemtrace(sem,"Entering up_write");
-	__up_write(sem);
-	rwsemtrace(sem,"Leaving up_write");
-}
+	CHECK_MAGIC(sem->__magic);
 
+	spin_lock(&sem->lock);
+	sem->count -= RWSEM_WRITE_BIAS;
+	if (unlikely(sem->count))
+		rwsem_wake(sem);
+	spin_unlock(&sem->lock);
+}
 
 #endif /* __KERNEL__ */
 #endif /* _LINUX_RWSEM_H */
diff -urN 2.4.10pre12/include/linux/sched.h rwsem/include/linux/sched.h
--- 2.4.10pre12/include/linux/sched.h	Thu Sep 20 01:44:18 2001
+++ rwsem/include/linux/sched.h	Thu Sep 20 05:09:07 2001
@@ -239,7 +239,7 @@
 	pgd:		swapper_pg_dir, 		\
 	mm_users:	ATOMIC_INIT(2), 		\
 	mm_count:	ATOMIC_INIT(1), 		\
-	mmap_sem:	__RWSEM_INITIALIZER(name.mmap_sem), \
+	mmap_sem:	RWSEM_INITIALIZER(name.mmap_sem), \
 	page_table_lock: SPIN_LOCK_UNLOCKED, 		\
 	mmlist:		LIST_HEAD_INIT(name.mmlist),	\
 }
diff -urN 2.4.10pre12/lib/Makefile rwsem/lib/Makefile
--- 2.4.10pre12/lib/Makefile	Thu Sep 20 01:44:19 2001
+++ rwsem/lib/Makefile	Thu Sep 20 04:38:44 2001
@@ -8,12 +8,9 @@
 
 L_TARGET := lib.a
 
-export-objs := cmdline.o dec_and_lock.o rwsem-spinlock.o rwsem.o
+export-objs := cmdline.o dec_and_lock.o rwsem.o
 
-obj-y := errno.o ctype.o string.o vsprintf.o brlock.o cmdline.o bust_spinlocks.o rbtree.o
-
-obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
-obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
+obj-y := errno.o ctype.o string.o vsprintf.o brlock.o cmdline.o bust_spinlocks.o rbtree.o rwsem.o
 
 ifneq ($(CONFIG_HAVE_DEC_LOCK),y) 
   obj-y += dec_and_lock.o
diff -urN 2.4.10pre12/lib/rwsem-spinlock.c rwsem/lib/rwsem-spinlock.c
--- 2.4.10pre12/lib/rwsem-spinlock.c	Tue May  1 19:35:33 2001
+++ rwsem/lib/rwsem-spinlock.c	Thu Jan  1 01:00:00 1970
@@ -1,239 +0,0 @@
-/* rwsem-spinlock.c: R/W semaphores: contention handling functions for generic spinlock
- *                                   implementation
- *
- * Copyright (c) 2001   David Howells (dhowells@redhat.com).
- * - Derived partially from idea by Andrea Arcangeli <andrea@suse.de>
- * - Derived also from comments by Linus
- */
-#include <linux/rwsem.h>
-#include <linux/sched.h>
-#include <linux/module.h>
-
-struct rwsem_waiter {
-	struct list_head	list;
-	struct task_struct	*task;
-	unsigned int		flags;
-#define RWSEM_WAITING_FOR_READ	0x00000001
-#define RWSEM_WAITING_FOR_WRITE	0x00000002
-};
-
-#if RWSEM_DEBUG
-void rwsemtrace(struct rw_semaphore *sem, const char *str)
-{
-	if (sem->debug)
-		printk("[%d] %s({%d,%d})\n",
-		       current->pid,str,sem->activity,list_empty(&sem->wait_list)?0:1);
-}
-#endif
-
-/*
- * initialise the semaphore
- */
-void init_rwsem(struct rw_semaphore *sem)
-{
-	sem->activity = 0;
-	spin_lock_init(&sem->wait_lock);
-	INIT_LIST_HEAD(&sem->wait_list);
-#if RWSEM_DEBUG
-	sem->debug = 0;
-#endif
-}
-
-/*
- * handle the lock being released whilst there are processes blocked on it that can now run
- * - if we come here, then:
- *   - the 'active count' _reached_ zero
- *   - the 'waiting count' is non-zero
- * - the spinlock must be held by the caller
- * - woken process blocks are discarded from the list after having flags zeroised
- */
-static inline struct rw_semaphore *__rwsem_do_wake(struct rw_semaphore *sem)
-{
-	struct rwsem_waiter *waiter;
-	int woken;
-
-	rwsemtrace(sem,"Entering __rwsem_do_wake");
-
-	waiter = list_entry(sem->wait_list.next,struct rwsem_waiter,list);
-
-	/* try to grant a single write lock if there's a writer at the front of the queue
-	 * - we leave the 'waiting count' incremented to signify potential contention
-	 */
-	if (waiter->flags & RWSEM_WAITING_FOR_WRITE) {
-		sem->activity = -1;
-		list_del(&waiter->list);
-		waiter->flags = 0;
-		wake_up_process(waiter->task);
-		goto out;
-	}
-
-	/* grant an infinite number of read locks to the readers at the front of the queue */
-	woken = 0;
-	do {
-		list_del(&waiter->list);
-		waiter->flags = 0;
-		wake_up_process(waiter->task);
-		woken++;
-		if (list_empty(&sem->wait_list))
-			break;
-		waiter = list_entry(sem->wait_list.next,struct rwsem_waiter,list);
-	} while (waiter->flags&RWSEM_WAITING_FOR_READ);
-
-	sem->activity += woken;
-
- out:
-	rwsemtrace(sem,"Leaving __rwsem_do_wake");
-	return sem;
-}
-
-/*
- * wake a single writer
- */
-static inline struct rw_semaphore *__rwsem_wake_one_writer(struct rw_semaphore *sem)
-{
-	struct rwsem_waiter *waiter;
-
-	sem->activity = -1;
-
-	waiter = list_entry(sem->wait_list.next,struct rwsem_waiter,list);
-	list_del(&waiter->list);
-
-	waiter->flags = 0;
-	wake_up_process(waiter->task);
-	return sem;
-}
-
-/*
- * get a read lock on the semaphore
- */
-void __down_read(struct rw_semaphore *sem)
-{
-	struct rwsem_waiter waiter;
-	struct task_struct *tsk;
-
-	rwsemtrace(sem,"Entering __down_read");
-
-	spin_lock(&sem->wait_lock);
-
-	if (sem->activity>=0 && list_empty(&sem->wait_list)) {
-		/* granted */
-		sem->activity++;
-		spin_unlock(&sem->wait_lock);
-		goto out;
-	}
-
-	tsk = current;
-	set_task_state(tsk,TASK_UNINTERRUPTIBLE);
-
-	/* set up my own style of waitqueue */
-	waiter.task = tsk;
-	waiter.flags = RWSEM_WAITING_FOR_READ;
-
-	list_add_tail(&waiter.list,&sem->wait_list);
-
-	/* we don't need to touch the semaphore struct anymore */
-	spin_unlock(&sem->wait_lock);
-
-	/* wait to be given the lock */
-	for (;;) {
-		if (!waiter.flags)
-			break;
-		schedule();
-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-	}
-
-	tsk->state = TASK_RUNNING;
-
- out:
-	rwsemtrace(sem,"Leaving __down_read");
-}
-
-/*
- * get a write lock on the semaphore
- * - note that we increment the waiting count anyway to indicate an exclusive lock
- */
-void __down_write(struct rw_semaphore *sem)
-{
-	struct rwsem_waiter waiter;
-	struct task_struct *tsk;
-
-	rwsemtrace(sem,"Entering __down_write");
-
-	spin_lock(&sem->wait_lock);
-
-	if (sem->activity==0 && list_empty(&sem->wait_list)) {
-		/* granted */
-		sem->activity = -1;
-		spin_unlock(&sem->wait_lock);
-		goto out;
-	}
-
-	tsk = current;
-	set_task_state(tsk,TASK_UNINTERRUPTIBLE);
-
-	/* set up my own style of waitqueue */
-	waiter.task = tsk;
-	waiter.flags = RWSEM_WAITING_FOR_WRITE;
-
-	list_add_tail(&waiter.list,&sem->wait_list);
-
-	/* we don't need to touch the semaphore struct anymore */
-	spin_unlock(&sem->wait_lock);
-
-	/* wait to be given the lock */
-	for (;;) {
-		if (!waiter.flags)
-			break;
-		schedule();
-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-	}
-
-	tsk->state = TASK_RUNNING;
-
- out:
-	rwsemtrace(sem,"Leaving __down_write");
-}
-
-/*
- * release a read lock on the semaphore
- */
-void __up_read(struct rw_semaphore *sem)
-{
-	rwsemtrace(sem,"Entering __up_read");
-
-	spin_lock(&sem->wait_lock);
-
-	if (--sem->activity==0 && !list_empty(&sem->wait_list))
-		sem = __rwsem_wake_one_writer(sem);
-
-	spin_unlock(&sem->wait_lock);
-
-	rwsemtrace(sem,"Leaving __up_read");
-}
-
-/*
- * release a write lock on the semaphore
- */
-void __up_write(struct rw_semaphore *sem)
-{
-	rwsemtrace(sem,"Entering __up_write");
-
-	spin_lock(&sem->wait_lock);
-
-	sem->activity = 0;
-	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem);
-
-	spin_unlock(&sem->wait_lock);
-
-	rwsemtrace(sem,"Leaving __up_write");
-}
-
-EXPORT_SYMBOL(init_rwsem);
-EXPORT_SYMBOL(__down_read);
-EXPORT_SYMBOL(__down_write);
-EXPORT_SYMBOL(__up_read);
-EXPORT_SYMBOL(__up_write);
-#if RWSEM_DEBUG
-EXPORT_SYMBOL(rwsemtrace);
-#endif
diff -urN 2.4.10pre12/lib/rwsem.c rwsem/lib/rwsem.c
--- 2.4.10pre12/lib/rwsem.c	Sat Jul 21 00:04:34 2001
+++ rwsem/lib/rwsem.c	Thu Sep 20 05:27:06 2001
@@ -1,210 +1,63 @@
-/* rwsem.c: R/W semaphores: contention handling functions
- *
- * Written by David Howells (dhowells@redhat.com).
- * Derived from arch/i386/kernel/semaphore.c
+/*
+ *  rw_semaphores generic spinlock version
+ *  Copyright (C) 2001 Andrea Arcangeli <andrea@suse.de> SuSE
  */
-#include <linux/rwsem.h>
+
 #include <linux/sched.h>
 #include <linux/module.h>
+#include <asm/semaphore.h>
 
-struct rwsem_waiter {
-	struct list_head	list;
-	struct task_struct	*task;
-	unsigned int		flags;
-#define RWSEM_WAITING_FOR_READ	0x00000001
-#define RWSEM_WAITING_FOR_WRITE	0x00000002
+struct rwsem_wait_queue {
+	unsigned long retire;
+	struct task_struct * task;
+	struct list_head task_list;
 };
 
-#if RWSEM_DEBUG
-#undef rwsemtrace
-void rwsemtrace(struct rw_semaphore *sem, const char *str)
-{
-	printk("sem=%p\n",sem);
-	printk("(sem)=%08lx\n",sem->count);
-	if (sem->debug)
-		printk("[%d] %s({%08lx})\n",current->pid,str,sem->count);
-}
-#endif
-
-/*
- * handle the lock being released whilst there are processes blocked on it that can now run
- * - if we come here, then:
- *   - the 'active part' of the count (&0x0000ffff) reached zero but has been re-incremented
- *   - the 'waiting part' of the count (&0xffff0000) is negative (and will still be so)
- *   - there must be someone on the queue
- * - the spinlock must be held by the caller
- * - woken process blocks are discarded from the list after having flags zeroised
- */
-static inline struct rw_semaphore *__rwsem_do_wake(struct rw_semaphore *sem)
-{
-	struct rwsem_waiter *waiter;
-	struct list_head *next;
-	signed long oldcount;
-	int woken, loop;
-
-	rwsemtrace(sem,"Entering __rwsem_do_wake");
-
-	/* only wake someone up if we can transition the active part of the count from 0 -> 1 */
- try_again:
-	oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS,sem) - RWSEM_ACTIVE_BIAS;
-	if (oldcount & RWSEM_ACTIVE_MASK)
-		goto undo;
-
-	waiter = list_entry(sem->wait_list.next,struct rwsem_waiter,list);
-
-	/* try to grant a single write lock if there's a writer at the front of the queue
-	 * - note we leave the 'active part' of the count incremented by 1 and the waiting part
-	 *   incremented by 0x00010000
-	 */
-	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
-		goto readers_only;
-
-	list_del(&waiter->list);
-	waiter->flags = 0;
-	wake_up_process(waiter->task);
-	goto out;
-
-	/* grant an infinite number of read locks to the readers at the front of the queue
-	 * - note we increment the 'active part' of the count by the number of readers (less one
-	 *   for the activity decrement we've already done) before waking any processes up
-	 */
- readers_only:
-	woken = 0;
-	do {
-		woken++;
-
-		if (waiter->list.next==&sem->wait_list)
-			break;
-
-		waiter = list_entry(waiter->list.next,struct rwsem_waiter,list);
-
-	} while (waiter->flags & RWSEM_WAITING_FOR_READ);
-
-	loop = woken;
-	woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS;
-	woken -= RWSEM_ACTIVE_BIAS;
-	rwsem_atomic_add(woken,sem);
-
-	next = sem->wait_list.next;
-	for (; loop>0; loop--) {
-		waiter = list_entry(next,struct rwsem_waiter,list);
-		next = waiter->list.next;
-		waiter->flags = 0;
-		wake_up_process(waiter->task);
-	}
-
-	sem->wait_list.next = next;
-	next->prev = &sem->wait_list;
-
- out:
-	rwsemtrace(sem,"Leaving __rwsem_do_wake");
-	return sem;
-
-	/* undo the change to count, but check for a transition 1->0 */
- undo:
-	if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS,sem)!=0)
-		goto out;
-	goto try_again;
-}
-
-/*
- * wait for a lock to be granted
- */
-static inline struct rw_semaphore *rwsem_down_failed_common(struct rw_semaphore *sem,
-								 struct rwsem_waiter *waiter,
-								 signed long adjustment)
+void rwsem_down_failed(struct rw_semaphore *sem, long retire)
 {
 	struct task_struct *tsk = current;
-	signed long count;
-
-	set_task_state(tsk,TASK_UNINTERRUPTIBLE);
-
-	/* set up my own style of waitqueue */
-	spin_lock(&sem->wait_lock);
-	waiter->task = tsk;
-
-	list_add_tail(&waiter->list,&sem->wait_list);
-
-	/* note that we're now waiting on the lock, but no longer actively read-locking */
-	count = rwsem_atomic_update(adjustment,sem);
-
-	/* if there are no longer active locks, wake the front queued process(es) up
-	 * - it might even be this process, since the waker takes a more active part
-	 */
-	if (!(count & RWSEM_ACTIVE_MASK))
-		sem = __rwsem_do_wake(sem);
+	struct rwsem_wait_queue wait;
 
-	spin_unlock(&sem->wait_lock);
+	sem->count += retire;
+	wait.retire = retire;
+	wait.task = tsk;
+	INIT_LIST_HEAD(&wait.task_list);
+	list_add(&wait.task_list, &sem->wait);
 
-	/* wait to be given the lock */
-	for (;;) {
-		if (!waiter->flags)
-			break;
+	do {
+		__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+		spin_unlock(&sem->lock);
 		schedule();
-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-	}
-
-	tsk->state = TASK_RUNNING;
-
-	return sem;
-}
-
-/*
- * wait for the read lock to be granted
- */
-struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem)
-{
-	struct rwsem_waiter waiter;
-
-	rwsemtrace(sem,"Entering rwsem_down_read_failed");
-
-	waiter.flags = RWSEM_WAITING_FOR_READ;
-	rwsem_down_failed_common(sem,&waiter,RWSEM_WAITING_BIAS-RWSEM_ACTIVE_BIAS);
-
-	rwsemtrace(sem,"Leaving rwsem_down_read_failed");
-	return sem;
+		spin_lock(&sem->lock);
+	} while(wait.task_list.next);
 }
 
-/*
- * wait for the write lock to be granted
- */
-struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem)
+void rwsem_wake(struct rw_semaphore *sem)
 {
-	struct rwsem_waiter waiter;
+	struct list_head * entry, * head = &sem->wait;
+	int last = 0;
 
-	rwsemtrace(sem,"Entering rwsem_down_write_failed");
+	while ((entry = head->prev) != head) {
+		struct rwsem_wait_queue * wait;
 
-	waiter.flags = RWSEM_WAITING_FOR_WRITE;
-	rwsem_down_failed_common(sem,&waiter,-RWSEM_ACTIVE_BIAS);
-
-	rwsemtrace(sem,"Leaving rwsem_down_write_failed");
-	return sem;
-}
+		wait = list_entry(entry, struct rwsem_wait_queue, task_list);
 
-/*
- * handle waking up a waiter on the semaphore
- * - up_read has decremented the active part of the count if we come here
- */
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
-{
-	rwsemtrace(sem,"Entering rwsem_wake");
+		if (wait->retire == RWSEM_WRITE_BLOCKING_BIAS) {
+			if (sem->count & RWSEM_READ_MASK)
+				break;
+			last = 1;
+		}
 
-	spin_lock(&sem->wait_lock);
-
-	/* do nothing if list empty */
-	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem);
-
-	spin_unlock(&sem->wait_lock);
-
-	rwsemtrace(sem,"Leaving rwsem_wake");
-
-	return sem;
+		/* convert write lock into read lock when read become active */
+		sem->count -= wait->retire;
+		list_del(entry);
+		entry->next = NULL;
+		wake_up_process(wait->task);
+			
+		if (last)
+			break;
+	}
 }
 
-EXPORT_SYMBOL_NOVERS(rwsem_down_read_failed);
-EXPORT_SYMBOL_NOVERS(rwsem_down_write_failed);
-EXPORT_SYMBOL_NOVERS(rwsem_wake);
-#if RWSEM_DEBUG
-EXPORT_SYMBOL(rwsemtrace);
-#endif
+EXPORT_SYMBOL(rwsem_down_failed);
+EXPORT_SYMBOL(rwsem_wake);

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-19 18:19                   ` Manfred Spraul
@ 2001-09-20  2:07                     ` Andrea Arcangeli
  2001-09-20  4:37                       ` Andrea Arcangeli
  2001-09-20  7:05                       ` David Howells
  0 siblings, 2 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2001-09-20  2:07 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: David Howells, Linus Torvalds, Ulrich.Weigand, linux-kernel

On Wed, Sep 19, 2001 at 08:19:09PM +0200, Manfred Spraul wrote:
> > if we go generic then I strongly recommend my version of the generic
> > semaphores is _much_ faster (and cleaner) than this one (it even
> allows
> > more than 2^31 concurrent readers on 64 bit archs ;).
> >
> Andrea,
> 
> implementing recursive semaphores is trivial, but do you have any idea
> how to fix the latency problem?

yes, one solution to the latency problem without writing the
ugly code would be simply to add a per-process counter to pass to a
modified rwsem api, then to hide the trickery in a mm_down_read macro.
such way it will be recursive _and_ fair.

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-19 23:34                         ` Andrea Arcangeli
@ 2001-09-19 23:46                           ` Andrea Arcangeli
  0 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2001-09-19 23:46 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Manfred Spraul, Ulrich.Weigand, linux-kernel

On Thu, Sep 20, 2001 at 01:34:21AM +0200, Andrea Arcangeli wrote:
> On Thu, Sep 20, 2001 at 12:25:40AM +0100, David Howells wrote:
> > 
> > Andrea Arcangeli <andrea@suse.de> wrote:
> > > On Wed, Sep 19, 2001 at 07:26:33PM +0100, David Howells wrote:
> > > > > if we go generic then I strongly recommend my version of the generic
> > > > > semaphores is _much_ faster (and cleaner) than this one
> > > >
> > > > Not so:-) Your patch, Andrea, grabs the spinlock far more than is necessary.
> > > 
> > > then why your microbenchmarks says my version is much faster?
> > 
> > They don't:
> 
> ok, so I drop my objections, you must have changed something radical

hey no! It was a very unfair comparison! my rwsem are non inlined, yours
are.

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-19 23:25                       ` David Howells
@ 2001-09-19 23:34                         ` Andrea Arcangeli
  2001-09-19 23:46                           ` Andrea Arcangeli
  0 siblings, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2001-09-19 23:34 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Manfred Spraul, Ulrich.Weigand, linux-kernel

On Thu, Sep 20, 2001 at 12:25:40AM +0100, David Howells wrote:
> 
> Andrea Arcangeli <andrea@suse.de> wrote:
> > On Wed, Sep 19, 2001 at 07:26:33PM +0100, David Howells wrote:
> > > > if we go generic then I strongly recommend my version of the generic
> > > > semaphores is _much_ faster (and cleaner) than this one
> > >
> > > Not so:-) Your patch, Andrea, grabs the spinlock far more than is necessary.
> > 
> > then why your microbenchmarks says my version is much faster?
> 
> They don't:

ok, so I drop my objections, you must have changed something radical
since the last time I benchmarked your generic rwsemaphores, I must have
missed you did those changes, sorry, so I'll just drop my rwsem patch
and I'll apply yours.

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-19 18:47                     ` Andrea Arcangeli
@ 2001-09-19 23:25                       ` David Howells
  2001-09-19 23:34                         ` Andrea Arcangeli
  0 siblings, 1 reply; 49+ messages in thread
From: David Howells @ 2001-09-19 23:25 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: David Howells, Linus Torvalds, Manfred Spraul, Ulrich.Weigand,
	linux-kernel


Andrea Arcangeli <andrea@suse.de> wrote:
> On Wed, Sep 19, 2001 at 07:26:33PM +0100, David Howells wrote:
> > > if we go generic then I strongly recommend my version of the generic
> > > semaphores is _much_ faster (and cleaner) than this one
> >
> > Not so:-) Your patch, Andrea, grabs the spinlock far more than is necessary.
> 
> then why your microbenchmarks says my version is much faster?

They don't:

[my rwsem-unfair-2.diff patch]
#rds #wrs time :  rd granted [spread]  wr granted [spread]
==== ==== =====   ===================  ===================
   1    0   10s:    25384639 [0.008%]           0 [0.000%]
   2    0   10s:     8007388 [0.013%]           0 [0.000%]
   4    0   10s:     8004833 [0.007%]           0 [0.000%]
   0    1   10s:           0 [0.000%]    25717288 [0.009%]
   0    4   10s:           0 [0.000%]     2817293 [2.323%]
   4    2   10s:    12742523 [0.008%]         114 [5.394%]
  30    1   10s:    12739163 [0.002%]          92 [1.739%]
  30   15   50s:    63733591 [0.003%]         286 [2.711%]

[Andrea's 00_rwsem-19 patch]
#rds #wrs time :  rd granted [spread]  wr granted [spread]
==== ==== =====   ===================  ===================
   1    0   10s:    24577938 [0.003%]           0 [0.000%]
   2    0   10s:     5631224 [0.016%]           0 [0.000%]
   4    0   10s:     5633280 [0.009%]           0 [0.000%]
   0    1   10s:           0 [0.000%]    24572785 [0.007%]
   0    4   10s:           0 [0.000%]     3444786 [0.091%]
   4    2   10s:     8036291 [0.008%]         122 [4.918%]
  30    1   10s:     8036490 [0.006%]          89 [1.253%]
  30   15   50s:    40200472 [0.001%]         265 [2.717%]

[Unpatched kernel]
#rds #wrs time :  rd granted [spread]  wr granted [spread]
==== ==== =====   ===================  ===================
   1    0   10s:    30118903 [0.008%]           0 [0.000%]
   2    0   10s:    12543737 [0.007%]           0 [0.000%]
   4    0   10s:    12543731 [0.006%]           0 [0.000%]
   0    1   10s:           0 [0.000%]    28408147 [0.008%]
   0    4   10s:           0 [0.000%]     1554495 [0.172%]
   4    2   10s:     1119656 [0.076%]      560020 [0.077%]
  30    1   10s:     5361923 [0.024%]       55787 [0.012%]
  30   15   50s:     5322473 [0.049%]     2661621 [0.050%]

David

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-19 21:14                       ` Benjamin LaHaise
@ 2001-09-19 22:07                         ` Andrea Arcangeli
  0 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2001-09-19 22:07 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: David Howells, Linus Torvalds, Manfred Spraul, Ulrich.Weigand,
	linux-kernel

On Wed, Sep 19, 2001 at 05:14:04PM -0400, Benjamin LaHaise wrote:
> On Wed, Sep 19, 2001 at 08:45:46PM +0200, Andrea Arcangeli wrote:
> > To be pedantic the only idea I shared with the old code (but that's just
> > the idea, not the implementation, so AFIK only a patent on such idea
> > could protect it from its free usage usage) is to return the rwsem again
> > from rwsem_wake and friends to avoid saving it in the asm slow path, and
> > I written that:
> 
> Your patch moved a bunch of code into asm-i386/rwsem_xchgadd.h.  That 
> code was derived from the spinlock code by me into the first rwsems, 
> then David reworked bits of it, as wel as you.  But there is no 
> copyright on that file indicating this heritage.  If you look at 
> how strict commercial copyright control can be, even copying a 
> single line of code mentally by retyping it can still mandate the 
> copyright legacy.  I'm sure it's just an oversight, but it's 
> probably one we *all* need to be reminded of every now and again.

I recall I wrote such code without copying anything, I certainly copied
the way of doing things (pushl dx,cx and save ax via returning it in the
slow path) as I wrote in the comment, but not the code itself. Infact at
first i was probably also pushing eax, and even now that I use the same
logic the two versions should be slightly different. and I think mine
has a race this is why I'm not using such code and with the unfair thing
such code is scheduled for removal since it's unfixable.  really it
sounds like too me too I cut and pasted the static inline void
__down_read(struct rw_semaphore *sem) declarations, since I tend to left
a space between "*" and the variable name but not always.

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-19 18:45                     ` Andrea Arcangeli
@ 2001-09-19 21:14                       ` Benjamin LaHaise
  2001-09-19 22:07                         ` Andrea Arcangeli
  0 siblings, 1 reply; 49+ messages in thread
From: Benjamin LaHaise @ 2001-09-19 21:14 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: David Howells, Linus Torvalds, Manfred Spraul, Ulrich.Weigand,
	linux-kernel

On Wed, Sep 19, 2001 at 08:45:46PM +0200, Andrea Arcangeli wrote:
> To be pedantic the only idea I shared with the old code (but that's just
> the idea, not the implementation, so AFIK only a patent on such idea
> could protect it from its free usage usage) is to return the rwsem again
> from rwsem_wake and friends to avoid saving it in the asm slow path, and
> I written that:

Your patch moved a bunch of code into asm-i386/rwsem_xchgadd.h.  That 
code was derived from the spinlock code by me into the first rwsems, 
then David reworked bits of it, as wel as you.  But there is no 
copyright on that file indicating this heritage.  If you look at 
how strict commercial copyright control can be, even copying a 
single line of code mentally by retyping it can still mandate the 
copyright legacy.  I'm sure it's just an oversight, but it's 
probably one we *all* need to be reminded of every now and again.

		-ben

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-19 18:27                     ` David Howells
@ 2001-09-19 18:48                       ` Andrea Arcangeli
  0 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2001-09-19 18:48 UTC (permalink / raw)
  To: David Howells
  Cc: Benjamin LaHaise, Linus Torvalds, Manfred Spraul, Ulrich.Weigand,
	linux-kernel

On Wed, Sep 19, 2001 at 07:27:40PM +0100, David Howells wrote:
> 
> > I don't know about you, but I'm mildly concerned that copyright attributions 
> > vanished.
> 
> I concur with that.

can you be a little more accurate on which copyright attributions
vanished by mistake?

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-19 18:26                   ` David Howells
@ 2001-09-19 18:47                     ` Andrea Arcangeli
  2001-09-19 23:25                       ` David Howells
  0 siblings, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2001-09-19 18:47 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Manfred Spraul, Ulrich.Weigand, linux-kernel

On Wed, Sep 19, 2001 at 07:26:33PM +0100, David Howells wrote:
> > if we go generic then I strongly recommend my version of the generic
> > semaphores is _much_ faster (and cleaner) than this one
> 
> Not so:-) Your patch, Andrea, grabs the spinlock far more than is necessary.

then why your microbenchmarks says my version is much faster?

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-19 18:16                   ` Benjamin LaHaise
  2001-09-19 18:27                     ` David Howells
@ 2001-09-19 18:45                     ` Andrea Arcangeli
  2001-09-19 21:14                       ` Benjamin LaHaise
  1 sibling, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2001-09-19 18:45 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: David Howells, Linus Torvalds, Manfred Spraul, Ulrich.Weigand,
	linux-kernel

On Wed, Sep 19, 2001 at 02:16:56PM -0400, Benjamin LaHaise wrote:
> I don't know about you, but I'm mildly concerned that copyright attributions 
> vanished.

can you be more precise? which copyright attribution? my rwsem are
written totally from scratch, they doesn't share anything with the
previous code, it is just a code replacement. As far I can tell deleted
some file, and with it also its headers but that seems not to infinge
any copyright, I didn't removed only the copyright attribution. If I
removed only the copyright attribution please let me know of course,
that would been a silly mistake.

To be pedantic the only idea I shared with the old code (but that's just
the idea, not the implementation, so AFIK only a patent on such idea
could protect it from its free usage usage) is to return the rwsem again
from rwsem_wake and friends to avoid saving it in the asm slow path, and
I written that:

/*
 * We return the semaphore itself from the C functions so we can pass it
 * in %eax via regparm and we don't need to declare %eax clobbered by C.
 * This is mostly for x86 but maybe other archs can make a use of it
 * too.
 * Idea is from David Howells <dhowells@redhat.com>.
 */

And the xadd version is scheduled for removal anyways soon btw (David
just dropped it in its implementation).

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-19 18:16                   ` Benjamin LaHaise
@ 2001-09-19 18:27                     ` David Howells
  2001-09-19 18:48                       ` Andrea Arcangeli
  2001-09-19 18:45                     ` Andrea Arcangeli
  1 sibling, 1 reply; 49+ messages in thread
From: David Howells @ 2001-09-19 18:27 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Andrea Arcangeli, David Howells, Linus Torvalds, Manfred Spraul,
	Ulrich.Weigand, linux-kernel


> I don't know about you, but I'm mildly concerned that copyright attributions 
> vanished.

I concur with that.

David

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-19 18:03                 ` Andrea Arcangeli
  2001-09-19 18:16                   ` Benjamin LaHaise
  2001-09-19 18:19                   ` Manfred Spraul
@ 2001-09-19 18:26                   ` David Howells
  2001-09-19 18:47                     ` Andrea Arcangeli
  2 siblings, 1 reply; 49+ messages in thread
From: David Howells @ 2001-09-19 18:26 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: David Howells, Linus Torvalds, Manfred Spraul, Ulrich.Weigand,
	linux-kernel

> if we go generic then I strongly recommend my version of the generic
> semaphores is _much_ faster (and cleaner) than this one

Not so:-) Your patch, Andrea, grabs the spinlock far more than is necessary.

> (it even allows more than 2^31 concurrent readers on 64 bit archs ;).

Easy enough to fix. Just apply this as well:

--- linux-rwsem-old/include/linux/rwsem.h       Wed Sep 19 19:23:44 2001
+++ linux-rwsem/include/linux/rwsem.h   Wed Sep 19 19:23:47 2001
@@ -26,7 +26,7 @@
  * - if wait_list is not empty, then there are processes waiting for the semaphore
  */
 struct rw_semaphore {
-       int                     activity;
+       long                    activity;
        spinlock_t              lock;
        struct list_head        wait_list;
 };

David

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-19 18:03                 ` Andrea Arcangeli
  2001-09-19 18:16                   ` Benjamin LaHaise
@ 2001-09-19 18:19                   ` Manfred Spraul
  2001-09-20  2:07                     ` Andrea Arcangeli
  2001-09-19 18:26                   ` David Howells
  2 siblings, 1 reply; 49+ messages in thread
From: Manfred Spraul @ 2001-09-19 18:19 UTC (permalink / raw)
  To: Andrea Arcangeli, David Howells
  Cc: Linus Torvalds, Ulrich.Weigand, linux-kernel

> if we go generic then I strongly recommend my version of the generic
> semaphores is _much_ faster (and cleaner) than this one (it even
allows
> more than 2^31 concurrent readers on 64 bit archs ;).
>
Andrea,

implementing recursive semaphores is trivial, but do you have any idea
how to fix the latency problem?
Multithreaded, iobound apps that use mmap are unusable with unfair
semaphores.

My test app hangs for minutes within mprotect().

--
    Manfred




^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-19 18:03                 ` Andrea Arcangeli
@ 2001-09-19 18:16                   ` Benjamin LaHaise
  2001-09-19 18:27                     ` David Howells
  2001-09-19 18:45                     ` Andrea Arcangeli
  2001-09-19 18:19                   ` Manfred Spraul
  2001-09-19 18:26                   ` David Howells
  2 siblings, 2 replies; 49+ messages in thread
From: Benjamin LaHaise @ 2001-09-19 18:16 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: David Howells, Linus Torvalds, Manfred Spraul, Ulrich.Weigand,
	linux-kernel

On Wed, Sep 19, 2001 at 08:03:57PM +0200, Andrea Arcangeli wrote:
> I inlined the patch below (may need some conversion to likely/unlikely
> as well).

I don't know about you, but I'm mildly concerned that copyright attributions 
vanished.

		-ben

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-19 14:53               ` David Howells
@ 2001-09-19 18:03                 ` Andrea Arcangeli
  2001-09-19 18:16                   ` Benjamin LaHaise
                                     ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2001-09-19 18:03 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Manfred Spraul, Ulrich.Weigand, linux-kernel

On Wed, Sep 19, 2001 at 03:53:23PM +0100, David Howells wrote:
> 
> Here's a patch to make rwsems unfair.

if we go generic then I strongly recommend my version of the generic
semaphores is _much_ faster (and cleaner) than this one (it even allows
more than 2^31 concurrent readers on 64 bit archs ;).

just apply this patch on top of pre12:

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.10pre11aa1/00_rwsem-19

and then I'll send another incremental patch to kill the XCHGADD
garbage. (but this patch will be enough to provide efficient and unfair
rwsem to all architectures, so fixing the deadlock)

I inlined the patch below (may need some conversion to likely/unlikely
as well).

diff -urN rwsem-ref/arch/alpha/config.in rwsem/arch/alpha/config.in
--- rwsem-ref/arch/alpha/config.in	Thu Aug 16 22:03:22 2001
+++ rwsem/arch/alpha/config.in	Tue Sep 18 10:31:29 2001
@@ -5,8 +5,7 @@
 
 define_bool CONFIG_ALPHA y
 define_bool CONFIG_UID16 n
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK n
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM y
+define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
 
 mainmenu_name "Kernel configuration of Linux for Alpha machines"
 
diff -urN rwsem-ref/arch/arm/config.in rwsem/arch/arm/config.in
--- rwsem-ref/arch/arm/config.in	Thu Aug 16 22:03:22 2001
+++ rwsem/arch/arm/config.in	Tue Sep 18 10:31:29 2001
@@ -10,8 +10,6 @@
 define_bool CONFIG_MCA n
 define_bool CONFIG_UID16 y
 define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
-
 
 mainmenu_option next_comment
 comment 'Code maturity level options'
diff -urN rwsem-ref/arch/cris/config.in rwsem/arch/cris/config.in
--- rwsem-ref/arch/cris/config.in	Sat Aug 11 08:03:53 2001
+++ rwsem/arch/cris/config.in	Tue Sep 18 10:31:29 2001
@@ -6,7 +6,6 @@
 
 define_bool CONFIG_UID16 y
 define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 mainmenu_option next_comment
 comment 'Code maturity level options'
diff -urN rwsem-ref/arch/i386/config.in rwsem/arch/i386/config.in
--- rwsem-ref/arch/i386/config.in	Sat Jul 21 00:04:05 2001
+++ rwsem/arch/i386/config.in	Tue Sep 18 10:31:29 2001
@@ -51,7 +51,6 @@
    define_bool CONFIG_X86_XADD n
    define_int  CONFIG_X86_L1_CACHE_SHIFT 4
    define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-   define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 else
    define_bool CONFIG_X86_WP_WORKS_OK y
    define_bool CONFIG_X86_INVLPG y
@@ -59,8 +58,7 @@
    define_bool CONFIG_X86_XADD y
    define_bool CONFIG_X86_BSWAP y
    define_bool CONFIG_X86_POPAD_OK y
-   define_bool CONFIG_RWSEM_GENERIC_SPINLOCK n
-   define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM y
+   define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
 fi
 if [ "$CONFIG_M486" = "y" ]; then
    define_int  CONFIG_X86_L1_CACHE_SHIFT 4
diff -urN rwsem-ref/arch/ia64/config.in rwsem/arch/ia64/config.in
--- rwsem-ref/arch/ia64/config.in	Sat Aug 11 08:03:54 2001
+++ rwsem/arch/ia64/config.in	Tue Sep 18 10:31:29 2001
@@ -24,7 +24,6 @@
 define_bool CONFIG_MCA n
 define_bool CONFIG_SBUS n
 define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 if [ "$CONFIG_IA64_HP_SIM" = "n" ]; then
   define_bool CONFIG_ACPI y
diff -urN rwsem-ref/arch/m68k/config.in rwsem/arch/m68k/config.in
--- rwsem-ref/arch/m68k/config.in	Wed Jul  4 04:03:45 2001
+++ rwsem/arch/m68k/config.in	Tue Sep 18 10:31:29 2001
@@ -5,7 +5,6 @@
 
 define_bool CONFIG_UID16 y
 define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 mainmenu_name "Linux/68k Kernel Configuration"
 
diff -urN rwsem-ref/arch/mips/config.in rwsem/arch/mips/config.in
--- rwsem-ref/arch/mips/config.in	Tue Sep 18 02:42:02 2001
+++ rwsem/arch/mips/config.in	Tue Sep 18 10:31:29 2001
@@ -69,7 +69,6 @@
 bool 'Support for Alchemy Semi PB1000 board' CONFIG_MIPS_PB1000
 
 define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 #
 # Select some configuration options automatically for certain systems.
diff -urN rwsem-ref/arch/mips64/config.in rwsem/arch/mips64/config.in
--- rwsem-ref/arch/mips64/config.in	Tue Sep 18 02:42:10 2001
+++ rwsem/arch/mips64/config.in	Tue Sep 18 10:31:29 2001
@@ -28,7 +28,6 @@
 endmenu
 
 define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 #
 # Select some configuration options automatically based on user selections
diff -urN rwsem-ref/arch/parisc/config.in rwsem/arch/parisc/config.in
--- rwsem-ref/arch/parisc/config.in	Tue May  1 19:35:20 2001
+++ rwsem/arch/parisc/config.in	Tue Sep 18 10:31:29 2001
@@ -8,7 +8,6 @@
 define_bool CONFIG_PARISC y
 define_bool CONFIG_UID16 n
 define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 mainmenu_option next_comment
 comment 'Code maturity level options'
diff -urN rwsem-ref/arch/ppc/config.in rwsem/arch/ppc/config.in
--- rwsem-ref/arch/ppc/config.in	Tue Sep 18 02:42:16 2001
+++ rwsem/arch/ppc/config.in	Tue Sep 18 10:31:29 2001
@@ -4,8 +4,7 @@
 # see Documentation/kbuild/config-language.txt.
 #
 define_bool CONFIG_UID16 n
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK n
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM y
+define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
 
 mainmenu_name "Linux/PowerPC Kernel Configuration"
 
diff -urN rwsem-ref/arch/s390/config.in rwsem/arch/s390/config.in
--- rwsem-ref/arch/s390/config.in	Sat Aug 11 08:03:56 2001
+++ rwsem/arch/s390/config.in	Tue Sep 18 10:31:29 2001
@@ -8,7 +8,6 @@
 define_bool CONFIG_MCA n
 define_bool CONFIG_UID16 y
 define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 mainmenu_name "Linux Kernel Configuration"
 define_bool CONFIG_ARCH_S390 y
diff -urN rwsem-ref/arch/s390x/config.in rwsem/arch/s390x/config.in
--- rwsem-ref/arch/s390x/config.in	Sat Aug 11 08:04:00 2001
+++ rwsem/arch/s390x/config.in	Tue Sep 18 10:31:29 2001
@@ -7,7 +7,6 @@
 define_bool CONFIG_EISA n
 define_bool CONFIG_MCA n
 define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 mainmenu_name "Linux Kernel Configuration"
 define_bool CONFIG_ARCH_S390 y
diff -urN rwsem-ref/arch/sh/config.in rwsem/arch/sh/config.in
--- rwsem-ref/arch/sh/config.in	Tue Sep 18 02:42:19 2001
+++ rwsem/arch/sh/config.in	Tue Sep 18 10:31:29 2001
@@ -8,7 +8,6 @@
 
 define_bool CONFIG_UID16 y
 define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 mainmenu_option next_comment
 comment 'Code maturity level options'
diff -urN rwsem-ref/arch/sparc/config.in rwsem/arch/sparc/config.in
--- rwsem-ref/arch/sparc/config.in	Wed Jul  4 04:03:45 2001
+++ rwsem/arch/sparc/config.in	Tue Sep 18 10:31:29 2001
@@ -49,7 +49,6 @@
 define_bool CONFIG_SUN_AUXIO y
 define_bool CONFIG_SUN_IO y
 define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 bool 'Support for SUN4 machines (disables SUN4[CDM] support)' CONFIG_SUN4
 if [ "$CONFIG_SUN4" != "y" ]; then
diff -urN rwsem-ref/arch/sparc64/config.in rwsem/arch/sparc64/config.in
--- rwsem-ref/arch/sparc64/config.in	Thu Aug 16 22:03:25 2001
+++ rwsem/arch/sparc64/config.in	Tue Sep 18 10:31:29 2001
@@ -33,8 +33,8 @@
 
 # Global things across all Sun machines.
 define_bool CONFIG_HAVE_DEC_LOCK y
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK n
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM y
+# sorry I broke it again
+define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
 define_bool CONFIG_ISA n
 define_bool CONFIG_ISAPNP n
 define_bool CONFIG_EISA n
diff -urN rwsem-ref/include/asm-alpha/rwsem_xchgadd.h rwsem/include/asm-alpha/rwsem_xchgadd.h
--- rwsem-ref/include/asm-alpha/rwsem_xchgadd.h	Thu Jan  1 01:00:00 1970
+++ rwsem/include/asm-alpha/rwsem_xchgadd.h	Tue Sep 18 10:31:29 2001
@@ -0,0 +1,27 @@
+#ifndef _ALPHA_RWSEM_XCHGADD_H
+#define _ALPHA_RWSEM_XCHGADD_H
+
+/* WRITEME */
+
+static inline void __down_read(struct rw_semaphore *sem)
+{
+}
+
+static inline void __down_write(struct rw_semaphore *sem)
+{
+}
+
+static inline void __up_read(struct rw_semaphore *sem)
+{
+}
+
+static inline void __up_write(struct rw_semaphore *sem)
+{
+}
+
+static inline long rwsem_xchgadd(long value, long * count)
+{
+	return value;
+}
+
+#endif
diff -urN rwsem-ref/include/asm-i386/rwsem.h rwsem/include/asm-i386/rwsem.h
--- rwsem-ref/include/asm-i386/rwsem.h	Fri Aug 17 05:02:27 2001
+++ rwsem/include/asm-i386/rwsem.h	Thu Jan  1 01:00:00 1970
@@ -1,226 +0,0 @@
-/* rwsem.h: R/W semaphores implemented using XADD/CMPXCHG for i486+
- *
- * Written by David Howells (dhowells@redhat.com).
- *
- * Derived from asm-i386/semaphore.h
- *
- *
- * The MSW of the count is the negated number of active writers and waiting
- * lockers, and the LSW is the total number of active locks
- *
- * The lock count is initialized to 0 (no active and no waiting lockers).
- *
- * When a writer subtracts WRITE_BIAS, it'll get 0xffff0001 for the case of an
- * uncontended lock. This can be determined because XADD returns the old value.
- * Readers increment by 1 and see a positive value when uncontended, negative
- * if there are writers (and maybe) readers waiting (in which case it goes to
- * sleep).
- *
- * The value of WAITING_BIAS supports up to 32766 waiting processes. This can
- * be extended to 65534 by manually checking the whole MSW rather than relying
- * on the S flag.
- *
- * The value of ACTIVE_BIAS supports up to 65535 active processes.
- *
- * This should be totally fair - if anything is waiting, a process that wants a
- * lock will go to the back of the queue. When the currently active lock is
- * released, if there's a writer at the front of the queue, then that and only
- * that will be woken up; if there's a bunch of consequtive readers at the
- * front, then they'll all be woken up, but no other readers will be.
- */
-
-#ifndef _I386_RWSEM_H
-#define _I386_RWSEM_H
-
-#ifndef _LINUX_RWSEM_H
-#error please dont include asm/rwsem.h directly, use linux/rwsem.h instead
-#endif
-
-#ifdef __KERNEL__
-
-#include <linux/list.h>
-#include <linux/spinlock.h>
-
-struct rwsem_waiter;
-
-extern struct rw_semaphore *FASTCALL(rwsem_down_read_failed(struct rw_semaphore *sem));
-extern struct rw_semaphore *FASTCALL(rwsem_down_write_failed(struct rw_semaphore *sem));
-extern struct rw_semaphore *FASTCALL(rwsem_wake(struct rw_semaphore *));
-
-/*
- * the semaphore definition
- */
-struct rw_semaphore {
-	signed long		count;
-#define RWSEM_UNLOCKED_VALUE		0x00000000
-#define RWSEM_ACTIVE_BIAS		0x00000001
-#define RWSEM_ACTIVE_MASK		0x0000ffff
-#define RWSEM_WAITING_BIAS		(-0x00010000)
-#define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
-	spinlock_t		wait_lock;
-	struct list_head	wait_list;
-#if RWSEM_DEBUG
-	int			debug;
-#endif
-};
-
-/*
- * initialisation
- */
-#if RWSEM_DEBUG
-#define __RWSEM_DEBUG_INIT      , 0
-#else
-#define __RWSEM_DEBUG_INIT	/* */
-#endif
-
-#define __RWSEM_INITIALIZER(name) \
-{ RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) \
-	__RWSEM_DEBUG_INIT }
-
-#define DECLARE_RWSEM(name) \
-	struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-static inline void init_rwsem(struct rw_semaphore *sem)
-{
-	sem->count = RWSEM_UNLOCKED_VALUE;
-	spin_lock_init(&sem->wait_lock);
-	INIT_LIST_HEAD(&sem->wait_list);
-#if RWSEM_DEBUG
-	sem->debug = 0;
-#endif
-}
-
-/*
- * lock for reading
- */
-static inline void __down_read(struct rw_semaphore *sem)
-{
-	__asm__ __volatile__(
-		"# beginning down_read\n\t"
-LOCK_PREFIX	"  incl      (%%eax)\n\t" /* adds 0x00000001, returns the old value */
-		"  js        2f\n\t" /* jump if we weren't granted the lock */
-		"1:\n\t"
-		".section .text.lock,\"ax\"\n"
-		"2:\n\t"
-		"  pushl     %%ecx\n\t"
-		"  pushl     %%edx\n\t"
-		"  call      rwsem_down_read_failed\n\t"
-		"  popl      %%edx\n\t"
-		"  popl      %%ecx\n\t"
-		"  jmp       1b\n"
-		".previous"
-		"# ending down_read\n\t"
-		: "+m"(sem->count)
-		: "a"(sem)
-		: "memory", "cc");
-}
-
-/*
- * lock for writing
- */
-static inline void __down_write(struct rw_semaphore *sem)
-{
-	int tmp;
-
-	tmp = RWSEM_ACTIVE_WRITE_BIAS;
-	__asm__ __volatile__(
-		"# beginning down_write\n\t"
-LOCK_PREFIX	"  xadd      %0,(%%eax)\n\t" /* subtract 0x0000ffff, returns the old value */
-		"  testl     %0,%0\n\t" /* was the count 0 before? */
-		"  jnz       2f\n\t" /* jump if we weren't granted the lock */
-		"1:\n\t"
-		".section .text.lock,\"ax\"\n"
-		"2:\n\t"
-		"  pushl     %%ecx\n\t"
-		"  call      rwsem_down_write_failed\n\t"
-		"  popl      %%ecx\n\t"
-		"  jmp       1b\n"
-		".previous\n"
-		"# ending down_write"
-		: "+d"(tmp), "+m"(sem->count)
-		: "a"(sem)
-		: "memory", "cc");
-}
-
-/*
- * unlock after reading
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
-	__s32 tmp = -RWSEM_ACTIVE_READ_BIAS;
-	__asm__ __volatile__(
-		"# beginning __up_read\n\t"
-LOCK_PREFIX	"  xadd      %%edx,(%%eax)\n\t" /* subtracts 1, returns the old value */
-		"  js        2f\n\t" /* jump if the lock is being waited upon */
-		"1:\n\t"
-		".section .text.lock,\"ax\"\n"
-		"2:\n\t"
-		"  decw      %%dx\n\t" /* do nothing if still outstanding active readers */
-		"  jnz       1b\n\t"
-		"  pushl     %%ecx\n\t"
-		"  call      rwsem_wake\n\t"
-		"  popl      %%ecx\n\t"
-		"  jmp       1b\n"
-		".previous\n"
-		"# ending __up_read\n"
-		: "+m"(sem->count), "+d"(tmp)
-		: "a"(sem)
-		: "memory", "cc");
-}
-
-/*
- * unlock after writing
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
-	__asm__ __volatile__(
-		"# beginning __up_write\n\t"
-		"  movl      %2,%%edx\n\t"
-LOCK_PREFIX	"  xaddl     %%edx,(%%eax)\n\t" /* tries to transition 0xffff0001 -> 0x00000000 */
-		"  jnz       2f\n\t" /* jump if the lock is being waited upon */
-		"1:\n\t"
-		".section .text.lock,\"ax\"\n"
-		"2:\n\t"
-		"  decw      %%dx\n\t" /* did the active count reduce to 0? */
-		"  jnz       1b\n\t" /* jump back if not */
-		"  pushl     %%ecx\n\t"
-		"  call      rwsem_wake\n\t"
-		"  popl      %%ecx\n\t"
-		"  jmp       1b\n"
-		".previous\n"
-		"# ending __up_write\n"
-		: "+m"(sem->count)
-		: "a"(sem), "i"(-RWSEM_ACTIVE_WRITE_BIAS)
-		: "memory", "cc", "edx");
-}
-
-/*
- * implement atomic add functionality
- */
-static inline void rwsem_atomic_add(int delta, struct rw_semaphore *sem)
-{
-	__asm__ __volatile__(
-LOCK_PREFIX	"addl %1,%0"
-		:"=m"(sem->count)
-		:"ir"(delta), "m"(sem->count));
-}
-
-/*
- * implement exchange and add functionality
- */
-static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem)
-{
-	int tmp = delta;
-
-	__asm__ __volatile__(
-LOCK_PREFIX	"xadd %0,(%2)"
-		: "+r"(tmp), "=m"(sem->count)
-		: "r"(sem), "m"(sem->count)
-		: "memory");
-
-	return tmp+delta;
-}
-
-#endif /* __KERNEL__ */
-#endif /* _I386_RWSEM_H */
diff -urN rwsem-ref/include/asm-i386/rwsem_xchgadd.h rwsem/include/asm-i386/rwsem_xchgadd.h
--- rwsem-ref/include/asm-i386/rwsem_xchgadd.h	Thu Jan  1 01:00:00 1970
+++ rwsem/include/asm-i386/rwsem_xchgadd.h	Tue Sep 18 10:31:29 2001
@@ -0,0 +1,93 @@
+#ifndef _X86_RWSEM_XCHGADD_H
+#define _X86_RWSEM_XCHGADD_H
+
+static inline void __down_read(struct rw_semaphore *sem)
+{
+	__asm__ __volatile__(LOCK "incl %0\n\t"
+			     "js 2f\n"
+			     "1:\n"
+			     ".section .text.lock,\"ax\"\n" 
+			     "2:\t"
+			     "pushl %%edx\n\t"
+			     "pushl %%ecx\n\t"
+			     "movl %2, %%edx\n\t"
+			     "call rwsem_down_failed\n\t"
+			     "popl %%ecx\n\t"
+			     "popl %%edx\n\t"
+			     "jmp 1b\n"
+			     ".previous"
+			     : "+m" (sem->count)
+			     : "a" (sem), "i" (RWSEM_READ_BLOCKING_BIAS)
+			     : "memory", "cc");
+}
+
+static inline void __down_write(struct rw_semaphore *sem)
+{
+	long count;
+
+	count = RWSEM_WRITE_BIAS + RWSEM_READ_BIAS;
+	__asm__ __volatile(LOCK "xaddl %0, %1\n\t"
+			   "testl %0,%0\n\t"
+			   "jnz 2f\n"
+			   "1:\n"
+			   ".section .text.lock,\"ax\"\n"
+			   "2:\t"
+			   "pushl %%ecx\n\t"
+			   "movl %3, %%edx\n\t"
+			   "call rwsem_down_failed\n\t"
+			   "popl %%ecx\n\t"
+			   "jmp 1b\n"
+			   ".previous"
+			   : "+d" (count), "+m" (sem->count)
+			   : "a" (sem), "i" (RWSEM_WRITE_BLOCKING_BIAS)
+			   : "memory", "cc");
+}
+
+static inline void __up_read(struct rw_semaphore *sem)
+{
+	long count;
+
+	count = -RWSEM_READ_BIAS;
+	__asm__ __volatile__(LOCK "xaddl %0, %1\n\t"
+			     "js 2f\n"
+			     "1:\n"
+			     ".section .text.lock,\"ax\"\n"
+			     "2:\t"
+			     "cmpw $1, %w0\n\t"
+			     "jnz 1b\n\t"
+			     "pushl %%ecx\n\t"
+			     "call rwsem_wake\n\t"
+			     "popl %%ecx\n\t"
+			     "jmp 1b\n"
+			     ".previous"
+			     : "+d" (count), "+m" (sem->count)
+			     : "a" (sem)
+			     : "memory", "cc");
+}
+static inline void __up_write(struct rw_semaphore *sem)
+{
+	__asm__ __volatile__(LOCK "subl %2, %0\n\t"
+			     "js 2f\n"
+			     "1:\n"
+			     ".section .text.lock,\"ax\"\n"
+			     "2:\t"
+			     "pushl %%edx\n\t"
+			     "pushl %%ecx\n\t"
+			     "call rwsem_wake\n\t"
+			     "popl %%ecx\n\t"
+			     "popl %%edx\n\t"
+			     "jmp 1b\n"
+			     ".previous"
+			     : "+m" (sem->count)
+			     : "a" (sem), "i" (RWSEM_READ_BIAS + RWSEM_WRITE_BIAS)
+			     : "memory", "cc");
+}
+
+static inline long rwsem_xchgadd(long value, long * count)
+{
+	__asm__ __volatile__(LOCK "xaddl %0,%1"
+			     : "+r" (value), "+m" (*count));
+	return value;
+}
+
+#endif
diff -urN rwsem-ref/include/linux/rwsem-spinlock.h rwsem/include/linux/rwsem-spinlock.h
--- rwsem-ref/include/linux/rwsem-spinlock.h	Wed Aug 29 15:05:24 2001
+++ rwsem/include/linux/rwsem-spinlock.h	Thu Jan  1 01:00:00 1970
@@ -1,62 +0,0 @@
-/* rwsem-spinlock.h: fallback C implementation
- *
- * Copyright (c) 2001   David Howells (dhowells@redhat.com).
- * - Derived partially from ideas by Andrea Arcangeli <andrea@suse.de>
- * - Derived also from comments by Linus
- */
-
-#ifndef _LINUX_RWSEM_SPINLOCK_H
-#define _LINUX_RWSEM_SPINLOCK_H
-
-#ifndef _LINUX_RWSEM_H
-#error please dont include linux/rwsem-spinlock.h directly, use linux/rwsem.h instead
-#endif
-
-#include <linux/spinlock.h>
-#include <linux/list.h>
-
-#ifdef __KERNEL__
-
-#include <linux/types.h>
-
-struct rwsem_waiter;
-
-/*
- * the rw-semaphore definition
- * - if activity is 0 then there are no active readers or writers
- * - if activity is +ve then that is the number of active readers
- * - if activity is -1 then there is one active writer
- * - if wait_list is not empty, then there are processes waiting for the semaphore
- */
-struct rw_semaphore {
-	__s32			activity;
-	spinlock_t		wait_lock;
-	struct list_head	wait_list;
-#if RWSEM_DEBUG
-	int			debug;
-#endif
-};
-
-/*
- * initialisation
- */
-#if RWSEM_DEBUG
-#define __RWSEM_DEBUG_INIT      , 0
-#else
-#define __RWSEM_DEBUG_INIT	/* */
-#endif
-
-#define __RWSEM_INITIALIZER(name) \
-{ 0, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) __RWSEM_DEBUG_INIT }
-
-#define DECLARE_RWSEM(name) \
-	struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-extern void FASTCALL(init_rwsem(struct rw_semaphore *sem));
-extern void FASTCALL(__down_read(struct rw_semaphore *sem));
-extern void FASTCALL(__down_write(struct rw_semaphore *sem));
-extern void FASTCALL(__up_read(struct rw_semaphore *sem));
-extern void FASTCALL(__up_write(struct rw_semaphore *sem));
-
-#endif /* __KERNEL__ */
-#endif /* _LINUX_RWSEM_SPINLOCK_H */
diff -urN rwsem-ref/include/linux/rwsem.h rwsem/include/linux/rwsem.h
--- rwsem-ref/include/linux/rwsem.h	Wed Aug 29 15:05:24 2001
+++ rwsem/include/linux/rwsem.h	Tue Sep 18 10:31:34 2001
@@ -1,80 +1,19 @@
-/* rwsem.h: R/W semaphores, public interface
- *
- * Written by David Howells (dhowells@redhat.com).
- * Derived from asm-i386/semaphore.h
- */
-
 #ifndef _LINUX_RWSEM_H
 #define _LINUX_RWSEM_H
 
-#include <linux/linkage.h>
-
-#define RWSEM_DEBUG 0
-
 #ifdef __KERNEL__
 
 #include <linux/config.h>
-#include <linux/types.h>
-#include <linux/kernel.h>
-#include <asm/system.h>
-#include <asm/atomic.h>
 
-struct rw_semaphore;
+#undef RWSEM_DEBUG
 
 #ifdef CONFIG_RWSEM_GENERIC_SPINLOCK
-#include <linux/rwsem-spinlock.h> /* use a generic implementation */
-#else
-#include <asm/rwsem.h> /* use an arch-specific implementation */
-#endif
-
-#ifndef rwsemtrace
-#if RWSEM_DEBUG
-extern void FASTCALL(rwsemtrace(struct rw_semaphore *sem, const char *str));
+#include <linux/rwsem_spinlock.h>
+#elif defined(CONFIG_RWSEM_XCHGADD)
+#include <linux/rwsem_xchgadd.h>
 #else
-#define rwsemtrace(SEM,FMT)
+#include <asm/rwsem.h>
 #endif
-#endif
-
-/*
- * lock for reading
- */
-static inline void down_read(struct rw_semaphore *sem)
-{
-	rwsemtrace(sem,"Entering down_read");
-	__down_read(sem);
-	rwsemtrace(sem,"Leaving down_read");
-}
-
-/*
- * lock for writing
- */
-static inline void down_write(struct rw_semaphore *sem)
-{
-	rwsemtrace(sem,"Entering down_write");
-	__down_write(sem);
-	rwsemtrace(sem,"Leaving down_write");
-}
-
-/*
- * release a read lock
- */
-static inline void up_read(struct rw_semaphore *sem)
-{
-	rwsemtrace(sem,"Entering up_read");
-	__up_read(sem);
-	rwsemtrace(sem,"Leaving up_read");
-}
-
-/*
- * release a write lock
- */
-static inline void up_write(struct rw_semaphore *sem)
-{
-	rwsemtrace(sem,"Entering up_write");
-	__up_write(sem);
-	rwsemtrace(sem,"Leaving up_write");
-}
-
 
 #endif /* __KERNEL__ */
 #endif /* _LINUX_RWSEM_H */
diff -urN rwsem-ref/include/linux/rwsem_spinlock.h rwsem/include/linux/rwsem_spinlock.h
--- rwsem-ref/include/linux/rwsem_spinlock.h	Thu Jan  1 01:00:00 1970
+++ rwsem/include/linux/rwsem_spinlock.h	Tue Sep 18 10:31:34 2001
@@ -0,0 +1,62 @@
+#ifndef _LINUX_RWSEM_SPINLOCK_H
+#define _LINUX_RWSEM_SPINLOCK_H
+
+#include <linux/compiler.h>
+#include <linux/kernel.h>
+
+struct rw_semaphore
+{
+	spinlock_t lock;
+	long count;
+#define RWSEM_READ_BIAS 1
+#define RWSEM_WRITE_BIAS (~(~0UL >> (BITS_PER_LONG>>1)))
+	struct list_head wait;
+#if RWSEM_DEBUG
+	long __magic;
+#endif
+};
+
+#if RWSEM_DEBUG
+#define __SEM_DEBUG_INIT(name) \
+	, (long)&(name).__magic
+#define RWSEM_MAGIC(x)							\
+	do {								\
+		if ((x) != (long)&(x)) {				\
+			printk("rwsem bad magic %lx (should be %lx), ",	\
+				(long)x, (long)&(x));			\
+			BUG();						\
+		}							\
+	} while (0)
+#else
+#define __SEM_DEBUG_INIT(name)
+#define CHECK_MAGIC(x)
+#endif
+
+#define __RWSEM_INITIALIZER(name, count)	\
+{						\
+	SPIN_LOCK_UNLOCKED,			\
+	(count),				\
+	LIST_HEAD_INIT((name).wait)		\
+	__SEM_DEBUG_INIT(name)			\
+}
+#define RWSEM_INITIALIZER(name) __RWSEM_INITIALIZER(name, 0)
+
+#define __DECLARE_RWSEM(name, count) \
+	struct rw_semaphore name = __RWSEM_INITIALIZER(name, count)
+#define DECLARE_RWSEM(name) __DECLARE_RWSEM(name, 0)
+#define DECLARE_RWSEM_READ_LOCKED(name) __DECLARE_RWSEM(name, RWSEM_READ_BIAS)
+#define DECLARE_RWSEM_WRITE_LOCKED(name) __DECLARE_RWSEM(name, RWSEM_WRITE_BIAS)
+
+#define RWSEM_READ_BLOCKING_BIAS (RWSEM_WRITE_BIAS-RWSEM_READ_BIAS)
+#define RWSEM_WRITE_BLOCKING_BIAS (0)
+
+#define RWSEM_READ_MASK (~RWSEM_WRITE_BIAS)
+#define RWSEM_WRITE_MASK (RWSEM_WRITE_BIAS)
+
+extern void FASTCALL(init_rwsem(struct rw_semaphore *));
+extern void FASTCALL(down_read(struct rw_semaphore *));
+extern void FASTCALL(down_write(struct rw_semaphore *));
+extern void FASTCALL(up_read(struct rw_semaphore *));
+extern void FASTCALL(up_write(struct rw_semaphore *));
+
+#endif /* _LINUX_RWSEM_SPINLOCK_H */
diff -urN rwsem-ref/include/linux/rwsem_xchgadd.h rwsem/include/linux/rwsem_xchgadd.h
--- rwsem-ref/include/linux/rwsem_xchgadd.h	Thu Jan  1 01:00:00 1970
+++ rwsem/include/linux/rwsem_xchgadd.h	Tue Sep 18 10:31:34 2001
@@ -0,0 +1,104 @@
+#ifndef _LINUX_RWSEM_XCHGADD_H
+#define _LINUX_RWSEM_XCHGADD_H
+
+#include <linux/kernel.h>
+
+struct rw_semaphore
+{
+	long count;
+	spinlock_t lock;
+#define RWSEM_READ_BIAS 1
+#define RWSEM_WRITE_BIAS (~(~0UL >> (BITS_PER_LONG>>1)))
+	struct list_head wait;
+#if RWSEM_DEBUG
+	long __magic;
+#endif
+};
+
+#if RWSEM_DEBUG
+#define __SEM_DEBUG_INIT(name) \
+	, (int)&(name).__magic
+#define RWSEM_MAGIC(x)							\
+	do {								\
+		if ((x) != (long)&(x)) {				\
+			printk("rwsem bad magic %lx (should be %lx), ",	\
+				(long)x, (long)&(x));			\
+			BUG();						\
+		}							\
+	} while (0)
+#else
+#define __SEM_DEBUG_INIT(name)
+#define CHECK_MAGIC(x)
+#endif
+
+#define __RWSEM_INITIALIZER(name, count)	\
+{						\
+	(count),				\
+	SPIN_LOCK_UNLOCKED,			\
+	LIST_HEAD_INIT((name).wait)		\
+	__SEM_DEBUG_INIT(name)			\
+}
+#define RWSEM_INITIALIZER(name) __RWSEM_INITIALIZER(name, 0)
+
+#define __DECLARE_RWSEM(name, count) \
+	struct rw_semaphore name = __RWSEM_INITIALIZER(name, count)
+#define DECLARE_RWSEM(name) __DECLARE_RWSEM(name, 0)
+#define DECLARE_RWSEM_READ_LOCKED(name) __DECLARE_RWSEM(name, RWSEM_READ_BIAS)
+#define DECLARE_RWSEM_WRITE_LOCKED(name) __DECLARE_RWSEM(name, RWSEM_WRITE_BIAS+RWSEM_READ_BIAS)
+
+#define RWSEM_READ_BLOCKING_BIAS (RWSEM_WRITE_BIAS-RWSEM_READ_BIAS)
+#define RWSEM_WRITE_BLOCKING_BIAS (-RWSEM_READ_BIAS)
+
+#define RWSEM_READ_MASK (~RWSEM_WRITE_BIAS)
+#define RWSEM_WRITE_MASK (RWSEM_WRITE_BIAS)
+
+/*
+ * We return the semaphore itself from the C functions so we can pass it
+ * in %eax via regparm and we don't need to declare %eax clobbered by C.
+ * This is mostly for x86 but maybe other archs can make a use of it too.
+ * Idea is from David Howells <dhowells@redhat.com>.
+ */
+extern struct rw_semaphore * FASTCALL(rwsem_down_failed(struct rw_semaphore *, long));
+extern struct rw_semaphore * FASTCALL(rwsem_wake(struct rw_semaphore *));
+
+static inline void init_rwsem(struct rw_semaphore *sem)
+{
+	sem->count = 0;
+	spin_lock_init(&sem->lock);
+	INIT_LIST_HEAD(&sem->wait);
+#if RWSEM_DEBUG
+	sem->__magic = (long)&sem->__magic;
+#endif
+}
+
+#include <asm/rwsem_xchgadd.h>
+
+static inline void down_read(struct rw_semaphore *sem)
+{
+	CHECK_MAGIC(sem->__magic);
+
+	__down_read(sem);
+}
+
+static inline void down_write(struct rw_semaphore *sem)
+{
+	CHECK_MAGIC(sem->__magic);
+
+	__down_write(sem);
+}
+
+static inline void up_read(struct rw_semaphore *sem)
+{
+	CHECK_MAGIC(sem->__magic);
+
+	__up_read(sem);
+}
+
+static inline void up_write(struct rw_semaphore *sem)
+{
+	CHECK_MAGIC(sem->__magic);
+
+	__up_write(sem);
+}
+
+#endif /* _LINUX_RWSEM_XCHGADD_H */
diff -urN rwsem-ref/include/linux/sched.h rwsem/include/linux/sched.h
--- rwsem-ref/include/linux/sched.h	Tue Sep 18 02:43:03 2001
+++ rwsem/include/linux/sched.h	Tue Sep 18 10:31:34 2001
@@ -239,7 +239,7 @@
 	pgd:		swapper_pg_dir, 		\
 	mm_users:	ATOMIC_INIT(2), 		\
 	mm_count:	ATOMIC_INIT(1), 		\
-	mmap_sem:	__RWSEM_INITIALIZER(name.mmap_sem), \
+	mmap_sem:	RWSEM_INITIALIZER(name.mmap_sem), \
 	page_table_lock: SPIN_LOCK_UNLOCKED, 		\
 	mmlist:		LIST_HEAD_INIT(name.mmlist),	\
 }
diff -urN rwsem-ref/lib/Makefile rwsem/lib/Makefile
--- rwsem-ref/lib/Makefile	Tue Sep 18 02:43:04 2001
+++ rwsem/lib/Makefile	Tue Sep 18 10:31:34 2001
@@ -8,12 +8,12 @@
 
 L_TARGET := lib.a
 
-export-objs := cmdline.o dec_and_lock.o rwsem-spinlock.o rwsem.o
+export-objs := cmdline.o dec_and_lock.o rwsem_spinlock.o rwsem_xchgadd.o
 
 obj-y := errno.o ctype.o string.o vsprintf.o brlock.o cmdline.o bust_spinlocks.o rbtree.o
 
-obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
-obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
+obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem_spinlock.o
+obj-$(CONFIG_RWSEM_XCHGADD) += rwsem_xchgadd.o
 
 ifneq ($(CONFIG_HAVE_DEC_LOCK),y) 
   obj-y += dec_and_lock.o
diff -urN rwsem-ref/lib/rwsem.c rwsem/lib/rwsem.c
--- rwsem-ref/lib/rwsem.c	Sat Jul 21 00:04:34 2001
+++ rwsem/lib/rwsem.c	Thu Jan  1 01:00:00 1970
@@ -1,210 +0,0 @@
-/* rwsem.c: R/W semaphores: contention handling functions
- *
- * Written by David Howells (dhowells@redhat.com).
- * Derived from arch/i386/kernel/semaphore.c
- */
-#include <linux/rwsem.h>
-#include <linux/sched.h>
-#include <linux/module.h>
-
-struct rwsem_waiter {
-	struct list_head	list;
-	struct task_struct	*task;
-	unsigned int		flags;
-#define RWSEM_WAITING_FOR_READ	0x00000001
-#define RWSEM_WAITING_FOR_WRITE	0x00000002
-};
-
-#if RWSEM_DEBUG
-#undef rwsemtrace
-void rwsemtrace(struct rw_semaphore *sem, const char *str)
-{
-	printk("sem=%p\n",sem);
-	printk("(sem)=%08lx\n",sem->count);
-	if (sem->debug)
-		printk("[%d] %s({%08lx})\n",current->pid,str,sem->count);
-}
-#endif
-
-/*
- * handle the lock being released whilst there are processes blocked on it that can now run
- * - if we come here, then:
- *   - the 'active part' of the count (&0x0000ffff) reached zero but has been re-incremented
- *   - the 'waiting part' of the count (&0xffff0000) is negative (and will still be so)
- *   - there must be someone on the queue
- * - the spinlock must be held by the caller
- * - woken process blocks are discarded from the list after having flags zeroised
- */
-static inline struct rw_semaphore *__rwsem_do_wake(struct rw_semaphore *sem)
-{
-	struct rwsem_waiter *waiter;
-	struct list_head *next;
-	signed long oldcount;
-	int woken, loop;
-
-	rwsemtrace(sem,"Entering __rwsem_do_wake");
-
-	/* only wake someone up if we can transition the active part of the count from 0 -> 1 */
- try_again:
-	oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS,sem) - RWSEM_ACTIVE_BIAS;
-	if (oldcount & RWSEM_ACTIVE_MASK)
-		goto undo;
-
-	waiter = list_entry(sem->wait_list.next,struct rwsem_waiter,list);
-
-	/* try to grant a single write lock if there's a writer at the front of the queue
-	 * - note we leave the 'active part' of the count incremented by 1 and the waiting part
-	 *   incremented by 0x00010000
-	 */
-	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
-		goto readers_only;
-
-	list_del(&waiter->list);
-	waiter->flags = 0;
-	wake_up_process(waiter->task);
-	goto out;
-
-	/* grant an infinite number of read locks to the readers at the front of the queue
-	 * - note we increment the 'active part' of the count by the number of readers (less one
-	 *   for the activity decrement we've already done) before waking any processes up
-	 */
- readers_only:
-	woken = 0;
-	do {
-		woken++;
-
-		if (waiter->list.next==&sem->wait_list)
-			break;
-
-		waiter = list_entry(waiter->list.next,struct rwsem_waiter,list);
-
-	} while (waiter->flags & RWSEM_WAITING_FOR_READ);
-
-	loop = woken;
-	woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS;
-	woken -= RWSEM_ACTIVE_BIAS;
-	rwsem_atomic_add(woken,sem);
-
-	next = sem->wait_list.next;
-	for (; loop>0; loop--) {
-		waiter = list_entry(next,struct rwsem_waiter,list);
-		next = waiter->list.next;
-		waiter->flags = 0;
-		wake_up_process(waiter->task);
-	}
-
-	sem->wait_list.next = next;
-	next->prev = &sem->wait_list;
-
- out:
-	rwsemtrace(sem,"Leaving __rwsem_do_wake");
-	return sem;
-
-	/* undo the change to count, but check for a transition 1->0 */
- undo:
-	if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS,sem)!=0)
-		goto out;
-	goto try_again;
-}
-
-/*
- * wait for a lock to be granted
- */
-static inline struct rw_semaphore *rwsem_down_failed_common(struct rw_semaphore *sem,
-								 struct rwsem_waiter *waiter,
-								 signed long adjustment)
-{
-	struct task_struct *tsk = current;
-	signed long count;
-
-	set_task_state(tsk,TASK_UNINTERRUPTIBLE);
-
-	/* set up my own style of waitqueue */
-	spin_lock(&sem->wait_lock);
-	waiter->task = tsk;
-
-	list_add_tail(&waiter->list,&sem->wait_list);
-
-	/* note that we're now waiting on the lock, but no longer actively read-locking */
-	count = rwsem_atomic_update(adjustment,sem);
-
-	/* if there are no longer active locks, wake the front queued process(es) up
-	 * - it might even be this process, since the waker takes a more active part
-	 */
-	if (!(count & RWSEM_ACTIVE_MASK))
-		sem = __rwsem_do_wake(sem);
-
-	spin_unlock(&sem->wait_lock);
-
-	/* wait to be given the lock */
-	for (;;) {
-		if (!waiter->flags)
-			break;
-		schedule();
-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-	}
-
-	tsk->state = TASK_RUNNING;
-
-	return sem;
-}
-
-/*
- * wait for the read lock to be granted
- */
-struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem)
-{
-	struct rwsem_waiter waiter;
-
-	rwsemtrace(sem,"Entering rwsem_down_read_failed");
-
-	waiter.flags = RWSEM_WAITING_FOR_READ;
-	rwsem_down_failed_common(sem,&waiter,RWSEM_WAITING_BIAS-RWSEM_ACTIVE_BIAS);
-
-	rwsemtrace(sem,"Leaving rwsem_down_read_failed");
-	return sem;
-}
-
-/*
- * wait for the write lock to be granted
- */
-struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem)
-{
-	struct rwsem_waiter waiter;
-
-	rwsemtrace(sem,"Entering rwsem_down_write_failed");
-
-	waiter.flags = RWSEM_WAITING_FOR_WRITE;
-	rwsem_down_failed_common(sem,&waiter,-RWSEM_ACTIVE_BIAS);
-
-	rwsemtrace(sem,"Leaving rwsem_down_write_failed");
-	return sem;
-}
-
-/*
- * handle waking up a waiter on the semaphore
- * - up_read has decremented the active part of the count if we come here
- */
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
-{
-	rwsemtrace(sem,"Entering rwsem_wake");
-
-	spin_lock(&sem->wait_lock);
-
-	/* do nothing if list empty */
-	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem);
-
-	spin_unlock(&sem->wait_lock);
-
-	rwsemtrace(sem,"Leaving rwsem_wake");
-
-	return sem;
-}
-
-EXPORT_SYMBOL_NOVERS(rwsem_down_read_failed);
-EXPORT_SYMBOL_NOVERS(rwsem_down_write_failed);
-EXPORT_SYMBOL_NOVERS(rwsem_wake);
-#if RWSEM_DEBUG
-EXPORT_SYMBOL(rwsemtrace);
-#endif
diff -urN rwsem-ref/lib/rwsem_spinlock.c rwsem/lib/rwsem_spinlock.c
--- rwsem-ref/lib/rwsem_spinlock.c	Thu Jan  1 01:00:00 1970
+++ rwsem/lib/rwsem_spinlock.c	Tue Sep 18 10:31:34 2001
@@ -0,0 +1,126 @@
+/*
+ *  rw_semaphores generic spinlock version
+ *  Copyright (C) 2001 Andrea Arcangeli <andrea@suse.de> SuSE
+ */
+
+#include <linux/sched.h>
+#include <linux/module.h>
+#include <asm/semaphore.h>
+
+struct rwsem_wait_queue {
+	unsigned long retire;
+	struct task_struct * task;
+	struct list_head task_list;
+};
+
+static void FASTCALL(rwsem_down_failed(struct rw_semaphore *, long));
+static void rwsem_down_failed(struct rw_semaphore *sem, long retire)
+{
+	struct task_struct *tsk = current;
+	struct rwsem_wait_queue wait;
+
+	sem->count += retire;
+	wait.retire = retire;
+	wait.task = tsk;
+	INIT_LIST_HEAD(&wait.task_list);
+	list_add(&wait.task_list, &sem->wait);
+
+	do {
+		__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+		spin_unlock(&sem->lock);
+		schedule();
+		spin_lock(&sem->lock);
+	} while(wait.task_list.next);
+}
+
+static void FASTCALL(rwsem_wake(struct rw_semaphore *));
+static void rwsem_wake(struct rw_semaphore *sem)
+{
+	struct list_head * entry, * head = &sem->wait;
+	int last = 0;
+
+	while ((entry = head->prev) != head) {
+		struct rwsem_wait_queue * wait;
+
+		wait = list_entry(entry, struct rwsem_wait_queue, task_list);
+
+		if (wait->retire == RWSEM_WRITE_BLOCKING_BIAS) {
+			if (sem->count & RWSEM_READ_MASK)
+				break;
+			last = 1;
+		}
+
+		/* convert write lock into read lock when read become active */
+		sem->count -= wait->retire;
+		list_del(entry);
+		entry->next = NULL;
+		wake_up_process(wait->task);
+			
+		if (last)
+			break;
+	}
+}
+
+void init_rwsem(struct rw_semaphore *sem)
+{
+	spin_lock_init(&sem->lock);
+	sem->count = 0;
+	INIT_LIST_HEAD(&sem->wait);
+#if RWSEM_DEBUG
+	sem->__magic = (long)&sem->__magic;
+#endif
+}
+
+void down_read(struct rw_semaphore *sem)
+{
+	int count;
+	CHECK_MAGIC(sem->__magic);
+
+	spin_lock(&sem->lock);
+	count = sem->count;
+	sem->count += RWSEM_READ_BIAS;
+	if (__builtin_expect(count < 0 && !(count & RWSEM_READ_MASK), 0))
+		rwsem_down_failed(sem, RWSEM_READ_BLOCKING_BIAS);
+	spin_unlock(&sem->lock);
+}
+
+void down_write(struct rw_semaphore *sem)
+{
+	long count;
+	CHECK_MAGIC(sem->__magic);
+
+	spin_lock(&sem->lock);
+	count = sem->count;
+	sem->count += RWSEM_WRITE_BIAS;
+	if (__builtin_expect(count, 0))
+		rwsem_down_failed(sem, RWSEM_WRITE_BLOCKING_BIAS);
+	spin_unlock(&sem->lock);
+}
+
+void up_read(struct rw_semaphore *sem)
+{
+	CHECK_MAGIC(sem->__magic);
+
+	spin_lock(&sem->lock);
+	sem->count -= RWSEM_READ_BIAS;
+	if (__builtin_expect(sem->count < 0 && !(sem->count & RWSEM_READ_MASK), 0))
+		rwsem_wake(sem);
+	spin_unlock(&sem->lock);
+}
+
+void up_write(struct rw_semaphore *sem)
+{
+	CHECK_MAGIC(sem->__magic);
+
+	spin_lock(&sem->lock);
+	sem->count -= RWSEM_WRITE_BIAS;
+	if (__builtin_expect(sem->count, 0))
+		rwsem_wake(sem);
+	spin_unlock(&sem->lock);
+}
+
+EXPORT_SYMBOL(init_rwsem);
+EXPORT_SYMBOL(down_read);
+EXPORT_SYMBOL(down_write);
+EXPORT_SYMBOL(up_read);
+EXPORT_SYMBOL(up_write);
diff -urN rwsem-ref/lib/rwsem_xchgadd.c rwsem/lib/rwsem_xchgadd.c
--- rwsem-ref/lib/rwsem_xchgadd.c	Thu Jan  1 01:00:00 1970
+++ rwsem/lib/rwsem_xchgadd.c	Tue Sep 18 10:31:34 2001
@@ -0,0 +1,92 @@
+/*
+ *  rw_semaphores xchgadd version
+ *  Copyright (C) 2001 Andrea Arcangeli <andrea@suse.de> SuSE
+ */
+
+#include <linux/sched.h>
+#include <linux/module.h>
+#include <asm/semaphore.h>
+
+struct rwsem_wait_queue {
+	unsigned long retire;
+	struct task_struct * task;
+	struct list_head task_list;
+};
+
+static void FASTCALL(__rwsem_wake(struct rw_semaphore *));
+static void __rwsem_wake(struct rw_semaphore *sem)
+{
+	struct list_head * entry, * head = &sem->wait;
+	int wake_write = 0, wake_read = 0;
+
+	while ((entry = head->prev) != head) {
+		struct rwsem_wait_queue * wait;
+		long count;
+
+		wait = list_entry(entry, struct rwsem_wait_queue, task_list);
+
+		if (wait->retire == RWSEM_WRITE_BLOCKING_BIAS) {
+			if (wake_read)
+				break;
+			wake_write = 1;
+		}
+
+	again:
+		count = rwsem_xchgadd(-wait->retire, &sem->count);
+		if (!wake_read && (count & RWSEM_READ_MASK)) {
+			count = rwsem_xchgadd(wait->retire, &sem->count);
+			if ((count & RWSEM_READ_MASK) == 1)
+				goto again;
+			break;
+		}
+		
+		list_del(entry);
+		entry->next = NULL;
+		wake_up_process(wait->task);
+			
+		if (wake_write)
+			break;
+		wake_read = 1;
+	}
+}
+
+struct rw_semaphore * rwsem_down_failed(struct rw_semaphore *sem, long retire)
+{
+	struct task_struct *tsk = current;
+	struct rwsem_wait_queue wait;
+	long count;
+
+	wait.retire = retire;
+	wait.task = tsk;
+	INIT_LIST_HEAD(&wait.task_list);
+
+	spin_lock(&sem->lock);
+	list_add(&wait.task_list, &sem->wait);
+
+	count = rwsem_xchgadd(retire, &sem->count);
+	if ((count & RWSEM_READ_MASK) == 1)
+		__rwsem_wake(sem);
+
+	while (wait.task_list.next) {
+		__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+		spin_unlock(&sem->lock);
+		schedule();
+		spin_lock(&sem->lock);
+	}
+
+	spin_unlock(&sem->lock);
+
+	return sem;
+}
+
+struct rw_semaphore * rwsem_wake(struct rw_semaphore *sem)
+{
+	spin_lock(&sem->lock);
+	__rwsem_wake(sem);
+	spin_unlock(&sem->lock);
+
+	return sem;
+}
+
+EXPORT_SYMBOL_NOVERS(rwsem_down_failed);
+EXPORT_SYMBOL_NOVERS(rwsem_wake);

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-19 14:51               ` David Howells
@ 2001-09-19 15:18                 ` Manfred Spraul
  0 siblings, 0 replies; 49+ messages in thread
From: Manfred Spraul @ 2001-09-19 15:18 UTC (permalink / raw)
  To: Linus Torvalds, David Howells
  Cc: David Howells, Andrea Arcangeli, Ulrich.Weigand, linux-kernel

>
> I also don't think the hack is that bad. All it's doing is taking a
> copy of the process's VM decription so that it knows that
> nobody is going to modify it whilst a coredump is in progress.

You break the locking scheme of the mm structure.
Right now the rules are

1 get a mm_struct pointer by whatever means (walk the process list and
    read task->mm, walk the mm_list)
2 increase mm_users
3 release the spinlock you acquired for 1
4 you can do with the result what you want.

With your patch applied, we would have to restrict rule 4 - at least
modifying the vma list is not possible anymore, probably further
changes.
AFAIK right now no external mm_struct user modifies the vma list, but it
could be a problem in the future.

>
> However, if you don't like that, how about just changing the lock on
> mm_struct to a special mm_struct-only type lock that has a
> recursive lock operation for use by the pagefault handler (and
> _only_ the pagefault handler)? I've attached  a patch to do just that.
> This introduces five operations:

Does that solve the latency problem? That problem is pagefaults vs.
another operation.

--
    Manfred



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-18 16:49             ` Linus Torvalds
                                 ` (3 preceding siblings ...)
  2001-09-19 14:53               ` David Howells
@ 2001-09-19 14:58               ` David Howells
  4 siblings, 0 replies; 49+ messages in thread
From: David Howells @ 2001-09-19 14:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Manfred Spraul, Andrea Arcangeli, Ulrich.Weigand,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 46 bytes --]

Here's a patch to make rwsems unfair.

David


[-- Attachment #2: rwsem.diff.bz2 --]
[-- Type: application/octet-stream, Size: 10338 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-18 16:49             ` Linus Torvalds
                                 ` (2 preceding siblings ...)
  2001-09-19 14:51               ` David Howells
@ 2001-09-19 14:53               ` David Howells
  2001-09-19 18:03                 ` Andrea Arcangeli
  2001-09-19 14:58               ` David Howells
  4 siblings, 1 reply; 49+ messages in thread
From: David Howells @ 2001-09-19 14:53 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Manfred Spraul, Andrea Arcangeli, Ulrich.Weigand,
	linux-kernel


Here's a patch to make rwsems unfair.

David


diff -uNr linux-2.4.10-pre12/arch/alpha/config.in linux-rwsem/arch/alpha/config.in
--- linux-2.4.10-pre12/arch/alpha/config.in	Tue Sep 18 08:45:58 2001
+++ linux-rwsem/arch/alpha/config.in	Wed Sep 19 14:46:18 2001
@@ -5,8 +5,6 @@
 
 define_bool CONFIG_ALPHA y
 define_bool CONFIG_UID16 n
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK n
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM y
 
 mainmenu_name "Kernel configuration of Linux for Alpha machines"
 
diff -uNr linux-2.4.10-pre12/arch/arm/config.in linux-rwsem/arch/arm/config.in
--- linux-2.4.10-pre12/arch/arm/config.in	Tue Sep 18 08:46:39 2001
+++ linux-rwsem/arch/arm/config.in	Wed Sep 19 14:46:18 2001
@@ -9,8 +9,6 @@
 define_bool CONFIG_SBUS n
 define_bool CONFIG_MCA n
 define_bool CONFIG_UID16 y
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 
 mainmenu_option next_comment
diff -uNr linux-2.4.10-pre12/arch/arm/def-configs/anakin linux-rwsem/arch/arm/def-configs/anakin
--- linux-2.4.10-pre12/arch/arm/def-configs/anakin	Tue Sep 18 08:46:40 2001
+++ linux-rwsem/arch/arm/def-configs/anakin	Wed Sep 19 14:46:18 2001
@@ -6,8 +6,6 @@
 # CONFIG_SBUS is not set
 # CONFIG_MCA is not set
 CONFIG_UID16=y
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/arm/def-configs/assabet linux-rwsem/arch/arm/def-configs/assabet
--- linux-2.4.10-pre12/arch/arm/def-configs/assabet	Tue Sep 18 08:46:40 2001
+++ linux-rwsem/arch/arm/def-configs/assabet	Wed Sep 19 14:46:18 2001
@@ -6,8 +6,6 @@
 # CONFIG_SBUS is not set
 # CONFIG_MCA is not set
 CONFIG_UID16=y
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/arm/def-configs/bitsy linux-rwsem/arch/arm/def-configs/bitsy
--- linux-2.4.10-pre12/arch/arm/def-configs/bitsy	Tue Sep 18 08:46:40 2001
+++ linux-rwsem/arch/arm/def-configs/bitsy	Wed Sep 19 14:46:18 2001
@@ -6,8 +6,6 @@
 # CONFIG_SBUS is not set
 # CONFIG_MCA is not set
 CONFIG_UID16=y
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/arm/def-configs/ebsa110 linux-rwsem/arch/arm/def-configs/ebsa110
--- linux-2.4.10-pre12/arch/arm/def-configs/ebsa110	Tue Sep 18 08:46:40 2001
+++ linux-rwsem/arch/arm/def-configs/ebsa110	Wed Sep 19 14:46:18 2001
@@ -6,8 +6,6 @@
 # CONFIG_SBUS is not set
 # CONFIG_MCA is not set
 CONFIG_UID16=y
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/arm/def-configs/flexanet linux-rwsem/arch/arm/def-configs/flexanet
--- linux-2.4.10-pre12/arch/arm/def-configs/flexanet	Tue Sep 18 08:46:40 2001
+++ linux-rwsem/arch/arm/def-configs/flexanet	Wed Sep 19 14:46:18 2001
@@ -6,8 +6,6 @@
 # CONFIG_SBUS is not set
 # CONFIG_MCA is not set
 CONFIG_UID16=y
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/arm/def-configs/integrator linux-rwsem/arch/arm/def-configs/integrator
--- linux-2.4.10-pre12/arch/arm/def-configs/integrator	Tue Sep 18 08:46:40 2001
+++ linux-rwsem/arch/arm/def-configs/integrator	Wed Sep 19 14:46:18 2001
@@ -6,8 +6,6 @@
 # CONFIG_SBUS is not set
 # CONFIG_MCA is not set
 CONFIG_UID16=y
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/arm/def-configs/lart linux-rwsem/arch/arm/def-configs/lart
--- linux-2.4.10-pre12/arch/arm/def-configs/lart	Tue Sep 18 08:46:40 2001
+++ linux-rwsem/arch/arm/def-configs/lart	Wed Sep 19 14:46:18 2001
@@ -6,8 +6,6 @@
 # CONFIG_SBUS is not set
 # CONFIG_MCA is not set
 CONFIG_UID16=y
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/arm/def-configs/neponset linux-rwsem/arch/arm/def-configs/neponset
--- linux-2.4.10-pre12/arch/arm/def-configs/neponset	Tue Sep 18 08:46:40 2001
+++ linux-rwsem/arch/arm/def-configs/neponset	Wed Sep 19 14:46:18 2001
@@ -6,8 +6,6 @@
 # CONFIG_SBUS is not set
 # CONFIG_MCA is not set
 CONFIG_UID16=y
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/arm/def-configs/pleb linux-rwsem/arch/arm/def-configs/pleb
--- linux-2.4.10-pre12/arch/arm/def-configs/pleb	Tue Sep 18 08:46:40 2001
+++ linux-rwsem/arch/arm/def-configs/pleb	Wed Sep 19 14:46:18 2001
@@ -6,8 +6,6 @@
 # CONFIG_SBUS is not set
 # CONFIG_MCA is not set
 CONFIG_UID16=y
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/arm/def-configs/rpc linux-rwsem/arch/arm/def-configs/rpc
--- linux-2.4.10-pre12/arch/arm/def-configs/rpc	Tue Sep 18 08:46:40 2001
+++ linux-rwsem/arch/arm/def-configs/rpc	Wed Sep 19 14:46:18 2001
@@ -6,8 +6,6 @@
 # CONFIG_SBUS is not set
 # CONFIG_MCA is not set
 CONFIG_UID16=y
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/arm/def-configs/shark linux-rwsem/arch/arm/def-configs/shark
--- linux-2.4.10-pre12/arch/arm/def-configs/shark	Tue Sep 18 08:46:40 2001
+++ linux-rwsem/arch/arm/def-configs/shark	Wed Sep 19 14:46:18 2001
@@ -6,8 +6,6 @@
 # CONFIG_SBUS is not set
 # CONFIG_MCA is not set
 CONFIG_UID16=y
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/cris/config.in linux-rwsem/arch/cris/config.in
--- linux-2.4.10-pre12/arch/cris/config.in	Tue Sep 18 08:46:43 2001
+++ linux-rwsem/arch/cris/config.in	Wed Sep 19 14:46:18 2001
@@ -5,8 +5,6 @@
 mainmenu_name "Linux/CRIS Kernel Configuration"
 
 define_bool CONFIG_UID16 y
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 mainmenu_option next_comment
 comment 'Code maturity level options'
diff -uNr linux-2.4.10-pre12/arch/cris/defconfig linux-rwsem/arch/cris/defconfig
--- linux-2.4.10-pre12/arch/cris/defconfig	Tue Sep 18 08:46:43 2001
+++ linux-rwsem/arch/cris/defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 CONFIG_UID16=y
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/i386/config.in linux-rwsem/arch/i386/config.in
--- linux-2.4.10-pre12/arch/i386/config.in	Wed Sep 19 10:39:05 2001
+++ linux-rwsem/arch/i386/config.in	Wed Sep 19 14:46:18 2001
@@ -50,8 +50,6 @@
    define_bool CONFIG_X86_CMPXCHG n
    define_bool CONFIG_X86_XADD n
    define_int  CONFIG_X86_L1_CACHE_SHIFT 4
-   define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-   define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 else
    define_bool CONFIG_X86_WP_WORKS_OK y
    define_bool CONFIG_X86_INVLPG y
@@ -59,8 +57,6 @@
    define_bool CONFIG_X86_XADD y
    define_bool CONFIG_X86_BSWAP y
    define_bool CONFIG_X86_POPAD_OK y
-   define_bool CONFIG_RWSEM_GENERIC_SPINLOCK n
-   define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM y
 fi
 if [ "$CONFIG_M486" = "y" ]; then
    define_int  CONFIG_X86_L1_CACHE_SHIFT 4
diff -uNr linux-2.4.10-pre12/arch/i386/defconfig linux-rwsem/arch/i386/defconfig
--- linux-2.4.10-pre12/arch/i386/defconfig	Wed Sep 19 10:39:05 2001
+++ linux-rwsem/arch/i386/defconfig	Wed Sep 19 14:46:18 2001
@@ -42,8 +42,6 @@
 CONFIG_X86_XADD=y
 CONFIG_X86_BSWAP=y
 CONFIG_X86_POPAD_OK=y
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 CONFIG_X86_L1_CACHE_SHIFT=5
 CONFIG_X86_TSC=y
 CONFIG_X86_GOOD_APIC=y
diff -uNr linux-2.4.10-pre12/arch/ia64/config.in linux-rwsem/arch/ia64/config.in
--- linux-2.4.10-pre12/arch/ia64/config.in	Tue Sep 18 08:46:41 2001
+++ linux-rwsem/arch/ia64/config.in	Wed Sep 19 14:46:18 2001
@@ -23,8 +23,6 @@
 define_bool CONFIG_EISA n
 define_bool CONFIG_MCA n
 define_bool CONFIG_SBUS n
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 if [ "$CONFIG_IA64_HP_SIM" = "n" ]; then
   define_bool CONFIG_ACPI y
diff -uNr linux-2.4.10-pre12/arch/m68k/config.in linux-rwsem/arch/m68k/config.in
--- linux-2.4.10-pre12/arch/m68k/config.in	Tue Sep 18 08:46:04 2001
+++ linux-rwsem/arch/m68k/config.in	Wed Sep 19 14:46:18 2001
@@ -4,8 +4,6 @@
 #
 
 define_bool CONFIG_UID16 y
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 mainmenu_name "Linux/68k Kernel Configuration"
 
diff -uNr linux-2.4.10-pre12/arch/mips/config.in linux-rwsem/arch/mips/config.in
--- linux-2.4.10-pre12/arch/mips/config.in	Wed Sep 19 10:39:06 2001
+++ linux-rwsem/arch/mips/config.in	Wed Sep 19 14:46:18 2001
@@ -68,8 +68,6 @@
    fi
 bool 'Support for Alchemy Semi PB1000 board' CONFIG_MIPS_PB1000
 
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 #
 # Select some configuration options automatically for certain systems.
diff -uNr linux-2.4.10-pre12/arch/mips/defconfig linux-rwsem/arch/mips/defconfig
--- linux-2.4.10-pre12/arch/mips/defconfig	Wed Sep 19 10:39:06 2001
+++ linux-rwsem/arch/mips/defconfig	Wed Sep 19 14:46:18 2001
@@ -32,8 +32,6 @@
 # CONFIG_MIPS_ITE8172 is not set
 # CONFIG_MIPS_IVR is not set
 # CONFIG_MIPS_PB1000 is not set
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 # CONFIG_MCA is not set
 # CONFIG_SBUS is not set
 CONFIG_ARC32=y
diff -uNr linux-2.4.10-pre12/arch/mips/defconfig-atlas linux-rwsem/arch/mips/defconfig-atlas
--- linux-2.4.10-pre12/arch/mips/defconfig-atlas	Wed Sep 19 10:39:06 2001
+++ linux-rwsem/arch/mips/defconfig-atlas	Wed Sep 19 14:46:18 2001
@@ -32,8 +32,6 @@
 # CONFIG_MIPS_ITE8172 is not set
 # CONFIG_MIPS_IVR is not set
 # CONFIG_MIPS_PB1000 is not set
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 # CONFIG_MCA is not set
 # CONFIG_SBUS is not set
 CONFIG_PCI=y
diff -uNr linux-2.4.10-pre12/arch/mips/defconfig-ddb5476 linux-rwsem/arch/mips/defconfig-ddb5476
--- linux-2.4.10-pre12/arch/mips/defconfig-ddb5476	Wed Sep 19 10:39:06 2001
+++ linux-rwsem/arch/mips/defconfig-ddb5476	Wed Sep 19 14:46:18 2001
@@ -32,8 +32,6 @@
 # CONFIG_MIPS_ITE8172 is not set
 # CONFIG_MIPS_IVR is not set
 # CONFIG_MIPS_PB1000 is not set
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 # CONFIG_MCA is not set
 # CONFIG_SBUS is not set
 CONFIG_ISA=y
diff -uNr linux-2.4.10-pre12/arch/mips/defconfig-ddb5477 linux-rwsem/arch/mips/defconfig-ddb5477
--- linux-2.4.10-pre12/arch/mips/defconfig-ddb5477	Wed Sep 19 10:39:06 2001
+++ linux-rwsem/arch/mips/defconfig-ddb5477	Wed Sep 19 14:46:18 2001
@@ -32,8 +32,6 @@
 # CONFIG_MIPS_ITE8172 is not set
 # CONFIG_MIPS_IVR is not set
 # CONFIG_MIPS_PB1000 is not set
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 # CONFIG_MCA is not set
 # CONFIG_SBUS is not set
 CONFIG_CPU_LITTLE_ENDIAN=y
diff -uNr linux-2.4.10-pre12/arch/mips/defconfig-decstation linux-rwsem/arch/mips/defconfig-decstation
--- linux-2.4.10-pre12/arch/mips/defconfig-decstation	Wed Sep 19 10:39:06 2001
+++ linux-rwsem/arch/mips/defconfig-decstation	Wed Sep 19 14:46:18 2001
@@ -32,8 +32,6 @@
 # CONFIG_MIPS_ITE8172 is not set
 # CONFIG_MIPS_IVR is not set
 # CONFIG_MIPS_PB1000 is not set
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 # CONFIG_MCA is not set
 # CONFIG_SBUS is not set
 # CONFIG_ISA is not set
diff -uNr linux-2.4.10-pre12/arch/mips/defconfig-ip22 linux-rwsem/arch/mips/defconfig-ip22
--- linux-2.4.10-pre12/arch/mips/defconfig-ip22	Wed Sep 19 10:39:06 2001
+++ linux-rwsem/arch/mips/defconfig-ip22	Wed Sep 19 14:46:18 2001
@@ -32,8 +32,6 @@
 # CONFIG_MIPS_ITE8172 is not set
 # CONFIG_MIPS_IVR is not set
 # CONFIG_MIPS_PB1000 is not set
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 # CONFIG_MCA is not set
 # CONFIG_SBUS is not set
 CONFIG_ARC32=y
diff -uNr linux-2.4.10-pre12/arch/mips/defconfig-it8172 linux-rwsem/arch/mips/defconfig-it8172
--- linux-2.4.10-pre12/arch/mips/defconfig-it8172	Wed Sep 19 10:39:06 2001
+++ linux-rwsem/arch/mips/defconfig-it8172	Wed Sep 19 14:46:18 2001
@@ -37,8 +37,6 @@
 # CONFIG_IT8172_SCR1 is not set
 # CONFIG_MIPS_IVR is not set
 # CONFIG_MIPS_PB1000 is not set
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 # CONFIG_MCA is not set
 # CONFIG_SBUS is not set
 CONFIG_PCI=y
diff -uNr linux-2.4.10-pre12/arch/mips/defconfig-malta linux-rwsem/arch/mips/defconfig-malta
--- linux-2.4.10-pre12/arch/mips/defconfig-malta	Wed Sep 19 10:39:06 2001
+++ linux-rwsem/arch/mips/defconfig-malta	Wed Sep 19 14:46:18 2001
@@ -32,8 +32,6 @@
 # CONFIG_MIPS_ITE8172 is not set
 # CONFIG_MIPS_IVR is not set
 # CONFIG_MIPS_PB1000 is not set
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 # CONFIG_MCA is not set
 # CONFIG_SBUS is not set
 CONFIG_I8259=y
diff -uNr linux-2.4.10-pre12/arch/mips/defconfig-nino linux-rwsem/arch/mips/defconfig-nino
--- linux-2.4.10-pre12/arch/mips/defconfig-nino	Wed Sep 19 10:39:06 2001
+++ linux-rwsem/arch/mips/defconfig-nino	Wed Sep 19 14:46:18 2001
@@ -35,8 +35,6 @@
 # CONFIG_MIPS_ITE8172 is not set
 # CONFIG_MIPS_IVR is not set
 # CONFIG_MIPS_PB1000 is not set
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 # CONFIG_MCA is not set
 # CONFIG_SBUS is not set
 CONFIG_PC_KEYB=y
diff -uNr linux-2.4.10-pre12/arch/mips/defconfig-ocelot linux-rwsem/arch/mips/defconfig-ocelot
--- linux-2.4.10-pre12/arch/mips/defconfig-ocelot	Wed Sep 19 10:39:06 2001
+++ linux-rwsem/arch/mips/defconfig-ocelot	Wed Sep 19 14:46:18 2001
@@ -32,8 +32,6 @@
 # CONFIG_MIPS_ITE8172 is not set
 # CONFIG_MIPS_IVR is not set
 # CONFIG_MIPS_PB1000 is not set
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 # CONFIG_MCA is not set
 # CONFIG_SBUS is not set
 CONFIG_PCI=y
diff -uNr linux-2.4.10-pre12/arch/mips/defconfig-pb1000 linux-rwsem/arch/mips/defconfig-pb1000
--- linux-2.4.10-pre12/arch/mips/defconfig-pb1000	Wed Sep 19 10:39:06 2001
+++ linux-rwsem/arch/mips/defconfig-pb1000	Wed Sep 19 14:46:18 2001
@@ -32,8 +32,6 @@
 # CONFIG_MIPS_ITE8172 is not set
 # CONFIG_MIPS_IVR is not set
 CONFIG_MIPS_PB1000=y
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 # CONFIG_MCA is not set
 # CONFIG_SBUS is not set
 CONFIG_MIPS_AU1000=y
diff -uNr linux-2.4.10-pre12/arch/mips/defconfig-rm200 linux-rwsem/arch/mips/defconfig-rm200
--- linux-2.4.10-pre12/arch/mips/defconfig-rm200	Wed Sep 19 10:39:06 2001
+++ linux-rwsem/arch/mips/defconfig-rm200	Wed Sep 19 14:46:18 2001
@@ -32,8 +32,6 @@
 # CONFIG_MIPS_ITE8172 is not set
 # CONFIG_MIPS_IVR is not set
 # CONFIG_MIPS_PB1000 is not set
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 # CONFIG_MCA is not set
 # CONFIG_SBUS is not set
 CONFIG_ARC32=y
diff -uNr linux-2.4.10-pre12/arch/mips64/config.in linux-rwsem/arch/mips64/config.in
--- linux-2.4.10-pre12/arch/mips64/config.in	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/mips64/config.in	Wed Sep 19 14:46:18 2001
@@ -27,8 +27,6 @@
 fi
 endmenu
 
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 #
 # Select some configuration options automatically based on user selections
diff -uNr linux-2.4.10-pre12/arch/mips64/defconfig linux-rwsem/arch/mips64/defconfig
--- linux-2.4.10-pre12/arch/mips64/defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/mips64/defconfig	Wed Sep 19 14:46:18 2001
@@ -19,8 +19,6 @@
 # CONFIG_REPLICATE_KTEXT is not set
 # CONFIG_REPLICATE_EXHANDLERS is not set
 CONFIG_SMP=y
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 CONFIG_BOOT_ELF64=y
 CONFIG_ARC64=y
 CONFIG_COHERENT_IO=y
diff -uNr linux-2.4.10-pre12/arch/mips64/defconfig-ip22 linux-rwsem/arch/mips64/defconfig-ip22
--- linux-2.4.10-pre12/arch/mips64/defconfig-ip22	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/mips64/defconfig-ip22	Wed Sep 19 14:46:18 2001
@@ -12,8 +12,6 @@
 #
 CONFIG_SGI_IP22=y
 # CONFIG_SGI_IP27 is not set
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 CONFIG_BOOT_ELF32=y
 CONFIG_ARC32=y
 CONFIG_BOARD_SCACHE=y
diff -uNr linux-2.4.10-pre12/arch/mips64/defconfig-ip27 linux-rwsem/arch/mips64/defconfig-ip27
--- linux-2.4.10-pre12/arch/mips64/defconfig-ip27	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/mips64/defconfig-ip27	Wed Sep 19 14:46:18 2001
@@ -19,8 +19,6 @@
 # CONFIG_REPLICATE_KTEXT is not set
 # CONFIG_REPLICATE_EXHANDLERS is not set
 CONFIG_SMP=y
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 CONFIG_BOOT_ELF64=y
 CONFIG_ARC64=y
 CONFIG_COHERENT_IO=y
diff -uNr linux-2.4.10-pre12/arch/mips64/defconfig-ip32 linux-rwsem/arch/mips64/defconfig-ip32
--- linux-2.4.10-pre12/arch/mips64/defconfig-ip32	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/mips64/defconfig-ip32	Wed Sep 19 14:46:18 2001
@@ -13,8 +13,6 @@
 # CONFIG_SGI_IP22 is not set
 # CONFIG_SGI_IP27 is not set
 CONFIG_SGI_IP32=y
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 CONFIG_BOOT_ELF32=y
 CONFIG_ARC32=y
 CONFIG_PC_KEYB=y
diff -uNr linux-2.4.10-pre12/arch/parisc/config.in linux-rwsem/arch/parisc/config.in
--- linux-2.4.10-pre12/arch/parisc/config.in	Tue Sep 18 08:46:43 2001
+++ linux-rwsem/arch/parisc/config.in	Wed Sep 19 14:46:18 2001
@@ -7,8 +7,6 @@
 
 define_bool CONFIG_PARISC y
 define_bool CONFIG_UID16 n
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 mainmenu_option next_comment
 comment 'Code maturity level options'
diff -uNr linux-2.4.10-pre12/arch/ppc/config.in linux-rwsem/arch/ppc/config.in
--- linux-2.4.10-pre12/arch/ppc/config.in	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/config.in	Wed Sep 19 14:46:18 2001
@@ -4,8 +4,6 @@
 # see Documentation/kbuild/config-language.txt.
 #
 define_bool CONFIG_UID16 n
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK n
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM y
 
 mainmenu_name "Linux/PowerPC Kernel Configuration"
 
diff -uNr linux-2.4.10-pre12/arch/ppc/configs/IVMS8_defconfig linux-rwsem/arch/ppc/configs/IVMS8_defconfig
--- linux-2.4.10-pre12/arch/ppc/configs/IVMS8_defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/configs/IVMS8_defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/ppc/configs/SM850_defconfig linux-rwsem/arch/ppc/configs/SM850_defconfig
--- linux-2.4.10-pre12/arch/ppc/configs/SM850_defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/configs/SM850_defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/ppc/configs/SPD823TS_defconfig linux-rwsem/arch/ppc/configs/SPD823TS_defconfig
--- linux-2.4.10-pre12/arch/ppc/configs/SPD823TS_defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/configs/SPD823TS_defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/ppc/configs/TQM823L_defconfig linux-rwsem/arch/ppc/configs/TQM823L_defconfig
--- linux-2.4.10-pre12/arch/ppc/configs/TQM823L_defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/configs/TQM823L_defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/ppc/configs/TQM850L_defconfig linux-rwsem/arch/ppc/configs/TQM850L_defconfig
--- linux-2.4.10-pre12/arch/ppc/configs/TQM850L_defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/configs/TQM850L_defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/ppc/configs/TQM860L_defconfig linux-rwsem/arch/ppc/configs/TQM860L_defconfig
--- linux-2.4.10-pre12/arch/ppc/configs/TQM860L_defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/configs/TQM860L_defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/ppc/configs/apus_defconfig linux-rwsem/arch/ppc/configs/apus_defconfig
--- linux-2.4.10-pre12/arch/ppc/configs/apus_defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/configs/apus_defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/ppc/configs/bseip_defconfig linux-rwsem/arch/ppc/configs/bseip_defconfig
--- linux-2.4.10-pre12/arch/ppc/configs/bseip_defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/configs/bseip_defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/ppc/configs/common_defconfig linux-rwsem/arch/ppc/configs/common_defconfig
--- linux-2.4.10-pre12/arch/ppc/configs/common_defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/configs/common_defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/ppc/configs/est8260_defconfig linux-rwsem/arch/ppc/configs/est8260_defconfig
--- linux-2.4.10-pre12/arch/ppc/configs/est8260_defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/configs/est8260_defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/ppc/configs/gemini_defconfig linux-rwsem/arch/ppc/configs/gemini_defconfig
--- linux-2.4.10-pre12/arch/ppc/configs/gemini_defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/configs/gemini_defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/ppc/configs/ibmchrp_defconfig linux-rwsem/arch/ppc/configs/ibmchrp_defconfig
--- linux-2.4.10-pre12/arch/ppc/configs/ibmchrp_defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/configs/ibmchrp_defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/ppc/configs/mbx_defconfig linux-rwsem/arch/ppc/configs/mbx_defconfig
--- linux-2.4.10-pre12/arch/ppc/configs/mbx_defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/configs/mbx_defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/ppc/configs/oak_defconfig linux-rwsem/arch/ppc/configs/oak_defconfig
--- linux-2.4.10-pre12/arch/ppc/configs/oak_defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/configs/oak_defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/ppc/configs/power3_defconfig linux-rwsem/arch/ppc/configs/power3_defconfig
--- linux-2.4.10-pre12/arch/ppc/configs/power3_defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/configs/power3_defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/ppc/configs/rpxcllf_defconfig linux-rwsem/arch/ppc/configs/rpxcllf_defconfig
--- linux-2.4.10-pre12/arch/ppc/configs/rpxcllf_defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/configs/rpxcllf_defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/ppc/configs/rpxlite_defconfig linux-rwsem/arch/ppc/configs/rpxlite_defconfig
--- linux-2.4.10-pre12/arch/ppc/configs/rpxlite_defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/configs/rpxlite_defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/ppc/configs/walnut_defconfig linux-rwsem/arch/ppc/configs/walnut_defconfig
--- linux-2.4.10-pre12/arch/ppc/configs/walnut_defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/configs/walnut_defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/ppc/defconfig linux-rwsem/arch/ppc/defconfig
--- linux-2.4.10-pre12/arch/ppc/defconfig	Wed Sep 19 10:39:07 2001
+++ linux-rwsem/arch/ppc/defconfig	Wed Sep 19 14:46:18 2001
@@ -2,8 +2,6 @@
 # Automatically generated make config: don't edit
 #
 # CONFIG_UID16 is not set
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 
 #
 # Code maturity level options
diff -uNr linux-2.4.10-pre12/arch/s390/config.in linux-rwsem/arch/s390/config.in
--- linux-2.4.10-pre12/arch/s390/config.in	Tue Sep 18 08:46:42 2001
+++ linux-rwsem/arch/s390/config.in	Wed Sep 19 14:46:18 2001
@@ -7,8 +7,6 @@
 define_bool CONFIG_EISA n
 define_bool CONFIG_MCA n
 define_bool CONFIG_UID16 y
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 mainmenu_name "Linux Kernel Configuration"
 define_bool CONFIG_ARCH_S390 y
diff -uNr linux-2.4.10-pre12/arch/s390/defconfig linux-rwsem/arch/s390/defconfig
--- linux-2.4.10-pre12/arch/s390/defconfig	Tue Sep 18 08:46:42 2001
+++ linux-rwsem/arch/s390/defconfig	Wed Sep 19 14:46:18 2001
@@ -5,8 +5,6 @@
 # CONFIG_EISA is not set
 # CONFIG_MCA is not set
 CONFIG_UID16=y
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 CONFIG_ARCH_S390=y
 
 #
diff -uNr linux-2.4.10-pre12/arch/s390x/config.in linux-rwsem/arch/s390x/config.in
--- linux-2.4.10-pre12/arch/s390x/config.in	Tue Sep 18 08:46:43 2001
+++ linux-rwsem/arch/s390x/config.in	Wed Sep 19 14:46:16 2001
@@ -6,8 +6,6 @@
 define_bool CONFIG_ISA n
 define_bool CONFIG_EISA n
 define_bool CONFIG_MCA n
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 mainmenu_name "Linux Kernel Configuration"
 define_bool CONFIG_ARCH_S390 y
diff -uNr linux-2.4.10-pre12/arch/s390x/defconfig linux-rwsem/arch/s390x/defconfig
--- linux-2.4.10-pre12/arch/s390x/defconfig	Tue Sep 18 08:46:43 2001
+++ linux-rwsem/arch/s390x/defconfig	Wed Sep 19 14:45:43 2001
@@ -4,8 +4,6 @@
 # CONFIG_ISA is not set
 # CONFIG_EISA is not set
 # CONFIG_MCA is not set
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 CONFIG_ARCH_S390=y
 CONFIG_ARCH_S390X=y
 
diff -uNr linux-2.4.10-pre12/arch/sh/config.in linux-rwsem/arch/sh/config.in
--- linux-2.4.10-pre12/arch/sh/config.in	Wed Sep 19 10:39:08 2001
+++ linux-rwsem/arch/sh/config.in	Wed Sep 19 14:46:18 2001
@@ -7,8 +7,6 @@
 define_bool CONFIG_SUPERH y
 
 define_bool CONFIG_UID16 y
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 mainmenu_option next_comment
 comment 'Code maturity level options'
diff -uNr linux-2.4.10-pre12/arch/sparc/config.in linux-rwsem/arch/sparc/config.in
--- linux-2.4.10-pre12/arch/sparc/config.in	Tue Sep 18 08:45:59 2001
+++ linux-rwsem/arch/sparc/config.in	Wed Sep 19 14:46:18 2001
@@ -48,8 +48,6 @@
 define_bool CONFIG_SUN_CONSOLE y
 define_bool CONFIG_SUN_AUXIO y
 define_bool CONFIG_SUN_IO y
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 
 bool 'Support for SUN4 machines (disables SUN4[CDM] support)' CONFIG_SUN4
 if [ "$CONFIG_SUN4" != "y" ]; then
diff -uNr linux-2.4.10-pre12/arch/sparc/defconfig linux-rwsem/arch/sparc/defconfig
--- linux-2.4.10-pre12/arch/sparc/defconfig	Tue Sep 18 08:45:59 2001
+++ linux-rwsem/arch/sparc/defconfig	Wed Sep 19 14:46:18 2001
@@ -38,8 +38,6 @@
 CONFIG_SUN_CONSOLE=y
 CONFIG_SUN_AUXIO=y
 CONFIG_SUN_IO=y
-CONFIG_RWSEM_GENERIC_SPINLOCK=y
-# CONFIG_RWSEM_XCHGADD_ALGORITHM is not set
 # CONFIG_SUN4 is not set
 # CONFIG_PCI is not set
 CONFIG_SUN_OPENPROMFS=m
diff -uNr linux-2.4.10-pre12/arch/sparc64/config.in linux-rwsem/arch/sparc64/config.in
--- linux-2.4.10-pre12/arch/sparc64/config.in	Tue Sep 18 08:46:06 2001
+++ linux-rwsem/arch/sparc64/config.in	Wed Sep 19 14:46:18 2001
@@ -33,8 +33,6 @@
 
 # Global things across all Sun machines.
 define_bool CONFIG_HAVE_DEC_LOCK y
-define_bool CONFIG_RWSEM_GENERIC_SPINLOCK n
-define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM y
 define_bool CONFIG_ISA n
 define_bool CONFIG_ISAPNP n
 define_bool CONFIG_EISA n
diff -uNr linux-2.4.10-pre12/arch/sparc64/defconfig linux-rwsem/arch/sparc64/defconfig
--- linux-2.4.10-pre12/arch/sparc64/defconfig	Wed Sep 19 10:39:08 2001
+++ linux-rwsem/arch/sparc64/defconfig	Wed Sep 19 14:46:18 2001
@@ -23,8 +23,6 @@
 CONFIG_SMP=y
 CONFIG_SPARC64=y
 CONFIG_HAVE_DEC_LOCK=y
-# CONFIG_RWSEM_GENERIC_SPINLOCK is not set
-CONFIG_RWSEM_XCHGADD_ALGORITHM=y
 # CONFIG_ISA is not set
 # CONFIG_ISAPNP is not set
 # CONFIG_EISA is not set
diff -uNr linux-2.4.10-pre12/drivers/char/sysrq.c linux-rwsem/drivers/char/sysrq.c
--- linux-2.4.10-pre12/drivers/char/sysrq.c	Wed Sep 19 10:39:11 2001
+++ linux-rwsem/drivers/char/sysrq.c	Wed Sep 19 15:17:20 2001
@@ -32,7 +32,7 @@
 
 #include <asm/ptrace.h>
 
-extern void wakeup_bdflush(int);
+extern void wakeup_bdflush();
 extern void reset_vc(unsigned int);
 extern struct list_head super_blocks;
 
@@ -221,7 +221,7 @@
 static void sysrq_handle_sync(int key, struct pt_regs *pt_regs,
 		struct kbd_struct *kbd, struct tty_struct *tty) {
 	emergency_sync_scheduled = EMERG_SYNC;
-	wakeup_bdflush(0);
+	wakeup_bdflush();
 }
 static struct sysrq_key_op sysrq_sync_op = {
 	handler:	sysrq_handle_sync,
@@ -232,7 +232,7 @@
 static void sysrq_handle_mountro(int key, struct pt_regs *pt_regs,
 		struct kbd_struct *kbd, struct tty_struct *tty) {
 	emergency_sync_scheduled = EMERG_REMOUNT;
-	wakeup_bdflush(0);
+	wakeup_bdflush();
 }
 static struct sysrq_key_op sysrq_mountro_op = {
 	handler:	sysrq_handle_mountro,
diff -uNr linux-2.4.10-pre12/include/asm-alpha/rwsem.h linux-rwsem/include/asm-alpha/rwsem.h
--- linux-2.4.10-pre12/include/asm-alpha/rwsem.h	Tue Sep 18 08:45:14 2001
+++ linux-rwsem/include/asm-alpha/rwsem.h	Thu Jan  1 01:00:00 1970
@@ -1,208 +0,0 @@
-#ifndef _ALPHA_RWSEM_H
-#define _ALPHA_RWSEM_H
-
-/*
- * Written by Ivan Kokshaysky <ink@jurassic.park.msu.ru>, 2001.
- * Based on asm-alpha/semaphore.h and asm-i386/rwsem.h
- */
-
-#ifndef _LINUX_RWSEM_H
-#error please dont include asm/rwsem.h directly, use linux/rwsem.h instead
-#endif
-
-#ifdef __KERNEL__
-
-#include <asm/compiler.h>
-#include <linux/list.h>
-#include <linux/spinlock.h>
-
-struct rwsem_waiter;
-
-extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *);
-
-/*
- * the semaphore definition
- */
-struct rw_semaphore {
-	long			count;
-#define RWSEM_UNLOCKED_VALUE		0x0000000000000000L
-#define RWSEM_ACTIVE_BIAS		0x0000000000000001L
-#define RWSEM_ACTIVE_MASK		0x00000000ffffffffL
-#define RWSEM_WAITING_BIAS		(-0x0000000100000000L)
-#define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
-	spinlock_t		wait_lock;
-	struct list_head	wait_list;
-#if RWSEM_DEBUG
-	int			debug;
-#endif
-};
-
-#if RWSEM_DEBUG
-#define __RWSEM_DEBUG_INIT      , 0
-#else
-#define __RWSEM_DEBUG_INIT	/* */
-#endif
-
-#define __RWSEM_INITIALIZER(name) \
-	{ RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, \
-	LIST_HEAD_INIT((name).wait_list) __RWSEM_DEBUG_INIT }
-
-#define DECLARE_RWSEM(name) \
-	struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-static inline void init_rwsem(struct rw_semaphore *sem)
-{
-	sem->count = RWSEM_UNLOCKED_VALUE;
-	spin_lock_init(&sem->wait_lock);
-	INIT_LIST_HEAD(&sem->wait_list);
-#if RWSEM_DEBUG
-	sem->debug = 0;
-#endif
-}
-
-static inline void __down_read(struct rw_semaphore *sem)
-{
-	long oldcount;
-#ifndef	CONFIG_SMP
-	oldcount = sem->count;
-	sem->count += RWSEM_ACTIVE_READ_BIAS;
-#else
-	long temp;
-	__asm__ __volatile__(
-	"1:	ldq_l	%0,%1\n"
-	"	addq	%0,%3,%2\n"
-	"	stq_c	%2,%1\n"
-	"	beq	%2,2f\n"
-	"	mb\n"
-	".subsection 2\n"
-	"2:	br	1b\n"
-	".previous"
-	:"=&r" (oldcount), "=m" (sem->count), "=&r" (temp)
-	:"Ir" (RWSEM_ACTIVE_READ_BIAS), "m" (sem->count) : "memory");
-#endif
-	if (__builtin_expect(oldcount < 0, 0))
-		rwsem_down_read_failed(sem);
-}
-
-static inline void __down_write(struct rw_semaphore *sem)
-{
-	long oldcount;
-#ifndef	CONFIG_SMP
-	oldcount = sem->count;
-	sem->count += RWSEM_ACTIVE_WRITE_BIAS;
-#else
-	long temp;
-	__asm__ __volatile__(
-	"1:	ldq_l	%0,%1\n"
-	"	addq	%0,%3,%2\n"
-	"	stq_c	%2,%1\n"
-	"	beq	%2,2f\n"
-	"	mb\n"
-	".subsection 2\n"
-	"2:	br	1b\n"
-	".previous"
-	:"=&r" (oldcount), "=m" (sem->count), "=&r" (temp)
-	:"Ir" (RWSEM_ACTIVE_WRITE_BIAS), "m" (sem->count) : "memory");
-#endif
-	if (__builtin_expect(oldcount, 0))
-		rwsem_down_write_failed(sem);
-}
-
-static inline void __up_read(struct rw_semaphore *sem)
-{
-	long oldcount;
-#ifndef	CONFIG_SMP
-	oldcount = sem->count;
-	sem->count -= RWSEM_ACTIVE_READ_BIAS;
-#else
-	long temp;
-	__asm__ __volatile__(
-	"	mb\n"
-	"1:	ldq_l	%0,%1\n"
-	"	subq	%0,%3,%2\n"
-	"	stq_c	%2,%1\n"
-	"	beq	%2,2f\n"
-	".subsection 2\n"
-	"2:	br	1b\n"
-	".previous"
-	:"=&r" (oldcount), "=m" (sem->count), "=&r" (temp)
-	:"Ir" (RWSEM_ACTIVE_READ_BIAS), "m" (sem->count) : "memory");
-#endif
-	if (__builtin_expect(oldcount < 0, 0)) 
-		if ((int)oldcount - RWSEM_ACTIVE_READ_BIAS == 0)
-			rwsem_wake(sem);
-}
-
-static inline void __up_write(struct rw_semaphore *sem)
-{
-	long count;
-#ifndef	CONFIG_SMP
-	sem->count -= RWSEM_ACTIVE_WRITE_BIAS;
-	count = sem->count;
-#else
-	long temp;
-	__asm__ __volatile__(
-	"	mb\n"
-	"1:	ldq_l	%0,%1\n"
-	"	subq	%0,%3,%2\n"
-	"	stq_c	%2,%1\n"
-	"	beq	%2,2f\n"
-	"	subq	%0,%3,%0\n"
-	".subsection 2\n"
-	"2:	br	1b\n"
-	".previous"
-	:"=&r" (count), "=m" (sem->count), "=&r" (temp)
-	:"Ir" (RWSEM_ACTIVE_WRITE_BIAS), "m" (sem->count) : "memory");
-#endif
-	if (__builtin_expect(count, 0))
-		if ((int)count == 0)
-			rwsem_wake(sem);
-}
-
-static inline void rwsem_atomic_add(long val, struct rw_semaphore *sem)
-{
-#ifndef	CONFIG_SMP
-	sem->count += val;
-#else
-	long temp;
-	__asm__ __volatile__(
-	"1:	ldq_l	%0,%1\n"
-	"	addq	%0,%2,%0\n"
-	"	stq_c	%0,%1\n"
-	"	beq	%0,2f\n"
-	".subsection 2\n"
-	"2:	br	1b\n"
-	".previous"
-	:"=&r" (temp), "=m" (sem->count)
-	:"Ir" (val), "m" (sem->count));
-#endif
-}
-
-static inline long rwsem_atomic_update(long val, struct rw_semaphore *sem)
-{
-#ifndef	CONFIG_SMP
-	sem->count += val;
-	return sem->count;
-#else
-	long ret, temp;
-	__asm__ __volatile__(
-	"1:	ldq_l	%0,%1\n"
-	"	addq 	%0,%3,%2\n"
-	"	addq	%0,%3,%0\n"
-	"	stq_c	%2,%1\n"
-	"	beq	%2,2f\n"
-	".subsection 2\n"
-	"2:	br	1b\n"
-	".previous"
-	:"=&r" (ret), "=m" (sem->count), "=&r" (temp)
-	:"Ir" (val), "m" (sem->count));
-
-	return ret;
-#endif
-}
-
-#endif /* __KERNEL__ */
-#endif /* _ALPHA_RWSEM_H */
diff -uNr linux-2.4.10-pre12/include/asm-i386/rwsem.h linux-rwsem/include/asm-i386/rwsem.h
--- linux-2.4.10-pre12/include/asm-i386/rwsem.h	Tue Sep 18 08:45:13 2001
+++ linux-rwsem/include/asm-i386/rwsem.h	Thu Jan  1 01:00:00 1970
@@ -1,226 +0,0 @@
-/* rwsem.h: R/W semaphores implemented using XADD/CMPXCHG for i486+
- *
- * Written by David Howells (dhowells@redhat.com).
- *
- * Derived from asm-i386/semaphore.h
- *
- *
- * The MSW of the count is the negated number of active writers and waiting
- * lockers, and the LSW is the total number of active locks
- *
- * The lock count is initialized to 0 (no active and no waiting lockers).
- *
- * When a writer subtracts WRITE_BIAS, it'll get 0xffff0001 for the case of an
- * uncontended lock. This can be determined because XADD returns the old value.
- * Readers increment by 1 and see a positive value when uncontended, negative
- * if there are writers (and maybe) readers waiting (in which case it goes to
- * sleep).
- *
- * The value of WAITING_BIAS supports up to 32766 waiting processes. This can
- * be extended to 65534 by manually checking the whole MSW rather than relying
- * on the S flag.
- *
- * The value of ACTIVE_BIAS supports up to 65535 active processes.
- *
- * This should be totally fair - if anything is waiting, a process that wants a
- * lock will go to the back of the queue. When the currently active lock is
- * released, if there's a writer at the front of the queue, then that and only
- * that will be woken up; if there's a bunch of consequtive readers at the
- * front, then they'll all be woken up, but no other readers will be.
- */
-
-#ifndef _I386_RWSEM_H
-#define _I386_RWSEM_H
-
-#ifndef _LINUX_RWSEM_H
-#error please dont include asm/rwsem.h directly, use linux/rwsem.h instead
-#endif
-
-#ifdef __KERNEL__
-
-#include <linux/list.h>
-#include <linux/spinlock.h>
-
-struct rwsem_waiter;
-
-extern struct rw_semaphore *FASTCALL(rwsem_down_read_failed(struct rw_semaphore *sem));
-extern struct rw_semaphore *FASTCALL(rwsem_down_write_failed(struct rw_semaphore *sem));
-extern struct rw_semaphore *FASTCALL(rwsem_wake(struct rw_semaphore *));
-
-/*
- * the semaphore definition
- */
-struct rw_semaphore {
-	signed long		count;
-#define RWSEM_UNLOCKED_VALUE		0x00000000
-#define RWSEM_ACTIVE_BIAS		0x00000001
-#define RWSEM_ACTIVE_MASK		0x0000ffff
-#define RWSEM_WAITING_BIAS		(-0x00010000)
-#define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
-	spinlock_t		wait_lock;
-	struct list_head	wait_list;
-#if RWSEM_DEBUG
-	int			debug;
-#endif
-};
-
-/*
- * initialisation
- */
-#if RWSEM_DEBUG
-#define __RWSEM_DEBUG_INIT      , 0
-#else
-#define __RWSEM_DEBUG_INIT	/* */
-#endif
-
-#define __RWSEM_INITIALIZER(name) \
-{ RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) \
-	__RWSEM_DEBUG_INIT }
-
-#define DECLARE_RWSEM(name) \
-	struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-static inline void init_rwsem(struct rw_semaphore *sem)
-{
-	sem->count = RWSEM_UNLOCKED_VALUE;
-	spin_lock_init(&sem->wait_lock);
-	INIT_LIST_HEAD(&sem->wait_list);
-#if RWSEM_DEBUG
-	sem->debug = 0;
-#endif
-}
-
-/*
- * lock for reading
- */
-static inline void __down_read(struct rw_semaphore *sem)
-{
-	__asm__ __volatile__(
-		"# beginning down_read\n\t"
-LOCK_PREFIX	"  incl      (%%eax)\n\t" /* adds 0x00000001, returns the old value */
-		"  js        2f\n\t" /* jump if we weren't granted the lock */
-		"1:\n\t"
-		".section .text.lock,\"ax\"\n"
-		"2:\n\t"
-		"  pushl     %%ecx\n\t"
-		"  pushl     %%edx\n\t"
-		"  call      rwsem_down_read_failed\n\t"
-		"  popl      %%edx\n\t"
-		"  popl      %%ecx\n\t"
-		"  jmp       1b\n"
-		".previous"
-		"# ending down_read\n\t"
-		: "+m"(sem->count)
-		: "a"(sem)
-		: "memory", "cc");
-}
-
-/*
- * lock for writing
- */
-static inline void __down_write(struct rw_semaphore *sem)
-{
-	int tmp;
-
-	tmp = RWSEM_ACTIVE_WRITE_BIAS;
-	__asm__ __volatile__(
-		"# beginning down_write\n\t"
-LOCK_PREFIX	"  xadd      %0,(%%eax)\n\t" /* subtract 0x0000ffff, returns the old value */
-		"  testl     %0,%0\n\t" /* was the count 0 before? */
-		"  jnz       2f\n\t" /* jump if we weren't granted the lock */
-		"1:\n\t"
-		".section .text.lock,\"ax\"\n"
-		"2:\n\t"
-		"  pushl     %%ecx\n\t"
-		"  call      rwsem_down_write_failed\n\t"
-		"  popl      %%ecx\n\t"
-		"  jmp       1b\n"
-		".previous\n"
-		"# ending down_write"
-		: "+d"(tmp), "+m"(sem->count)
-		: "a"(sem)
-		: "memory", "cc");
-}
-
-/*
- * unlock after reading
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
-	__s32 tmp = -RWSEM_ACTIVE_READ_BIAS;
-	__asm__ __volatile__(
-		"# beginning __up_read\n\t"
-LOCK_PREFIX	"  xadd      %%edx,(%%eax)\n\t" /* subtracts 1, returns the old value */
-		"  js        2f\n\t" /* jump if the lock is being waited upon */
-		"1:\n\t"
-		".section .text.lock,\"ax\"\n"
-		"2:\n\t"
-		"  decw      %%dx\n\t" /* do nothing if still outstanding active readers */
-		"  jnz       1b\n\t"
-		"  pushl     %%ecx\n\t"
-		"  call      rwsem_wake\n\t"
-		"  popl      %%ecx\n\t"
-		"  jmp       1b\n"
-		".previous\n"
-		"# ending __up_read\n"
-		: "+m"(sem->count), "+d"(tmp)
-		: "a"(sem)
-		: "memory", "cc");
-}
-
-/*
- * unlock after writing
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
-	__asm__ __volatile__(
-		"# beginning __up_write\n\t"
-		"  movl      %2,%%edx\n\t"
-LOCK_PREFIX	"  xaddl     %%edx,(%%eax)\n\t" /* tries to transition 0xffff0001 -> 0x00000000 */
-		"  jnz       2f\n\t" /* jump if the lock is being waited upon */
-		"1:\n\t"
-		".section .text.lock,\"ax\"\n"
-		"2:\n\t"
-		"  decw      %%dx\n\t" /* did the active count reduce to 0? */
-		"  jnz       1b\n\t" /* jump back if not */
-		"  pushl     %%ecx\n\t"
-		"  call      rwsem_wake\n\t"
-		"  popl      %%ecx\n\t"
-		"  jmp       1b\n"
-		".previous\n"
-		"# ending __up_write\n"
-		: "+m"(sem->count)
-		: "a"(sem), "i"(-RWSEM_ACTIVE_WRITE_BIAS)
-		: "memory", "cc", "edx");
-}
-
-/*
- * implement atomic add functionality
- */
-static inline void rwsem_atomic_add(int delta, struct rw_semaphore *sem)
-{
-	__asm__ __volatile__(
-LOCK_PREFIX	"addl %1,%0"
-		:"=m"(sem->count)
-		:"ir"(delta), "m"(sem->count));
-}
-
-/*
- * implement exchange and add functionality
- */
-static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem)
-{
-	int tmp = delta;
-
-	__asm__ __volatile__(
-LOCK_PREFIX	"xadd %0,(%2)"
-		: "+r"(tmp), "=m"(sem->count)
-		: "r"(sem), "m"(sem->count)
-		: "memory");
-
-	return tmp+delta;
-}
-
-#endif /* __KERNEL__ */
-#endif /* _I386_RWSEM_H */
diff -uNr linux-2.4.10-pre12/include/asm-ppc/rwsem.h linux-rwsem/include/asm-ppc/rwsem.h
--- linux-2.4.10-pre12/include/asm-ppc/rwsem.h	Tue Sep 18 08:45:14 2001
+++ linux-rwsem/include/asm-ppc/rwsem.h	Thu Jan  1 01:00:00 1970
@@ -1,137 +0,0 @@
-/*
- * BK Id: SCCS/s.rwsem.h 1.6 05/17/01 18:14:25 cort
- */
-/*
- * include/asm-ppc/rwsem.h: R/W semaphores for PPC using the stuff
- * in lib/rwsem.c.  Adapted largely from include/asm-i386/rwsem.h
- * by Paul Mackerras <paulus@samba.org>.
- */
-
-#ifndef _PPC_RWSEM_H
-#define _PPC_RWSEM_H
-
-#ifdef __KERNEL__
-#include <linux/list.h>
-#include <linux/spinlock.h>
-#include <asm/atomic.h>
-#include <asm/system.h>
-
-/*
- * the semaphore definition
- */
-struct rw_semaphore {
-	/* XXX this should be able to be an atomic_t  -- paulus */
-	signed long		count;
-#define RWSEM_UNLOCKED_VALUE		0x00000000
-#define RWSEM_ACTIVE_BIAS		0x00000001
-#define RWSEM_ACTIVE_MASK		0x0000ffff
-#define RWSEM_WAITING_BIAS		(-0x00010000)
-#define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
-	spinlock_t		wait_lock;
-	struct list_head	wait_list;
-#if RWSEM_DEBUG
-	int			debug;
-#endif
-};
-
-/*
- * initialisation
- */
-#if RWSEM_DEBUG
-#define __RWSEM_DEBUG_INIT      , 0
-#else
-#define __RWSEM_DEBUG_INIT	/* */
-#endif
-
-#define __RWSEM_INITIALIZER(name) \
-	{ RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, \
-	  LIST_HEAD_INIT((name).wait_list) \
-	  __RWSEM_DEBUG_INIT }
-
-#define DECLARE_RWSEM(name)		\
-	struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-extern struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem);
-extern struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem);
-
-static inline void init_rwsem(struct rw_semaphore *sem)
-{
-	sem->count = RWSEM_UNLOCKED_VALUE;
-	spin_lock_init(&sem->wait_lock);
-	INIT_LIST_HEAD(&sem->wait_list);
-#if RWSEM_DEBUG
-	sem->debug = 0;
-#endif
-}
-
-/*
- * lock for reading
- */
-static inline void __down_read(struct rw_semaphore *sem)
-{
-	if (atomic_inc_return((atomic_t *)(&sem->count)) >= 0)
-		smp_wmb();
-	else
-		rwsem_down_read_failed(sem);
-}
-
-/*
- * lock for writing
- */
-static inline void __down_write(struct rw_semaphore *sem)
-{
-	int tmp;
-
-	tmp = atomic_add_return(RWSEM_ACTIVE_WRITE_BIAS,
-				(atomic_t *)(&sem->count));
-	if (tmp == RWSEM_ACTIVE_WRITE_BIAS)
-		smp_wmb();
-	else
-		rwsem_down_write_failed(sem);
-}
-
-/*
- * unlock after reading
- */
-static inline void __up_read(struct rw_semaphore *sem)
-{
-	int tmp;
-
-	smp_wmb();
-	tmp = atomic_dec_return((atomic_t *)(&sem->count));
-	if (tmp < -1 && (tmp & RWSEM_ACTIVE_MASK) == 0)
-		rwsem_wake(sem);
-}
-
-/*
- * unlock after writing
- */
-static inline void __up_write(struct rw_semaphore *sem)
-{
-	smp_wmb();
-	if (atomic_sub_return(RWSEM_ACTIVE_WRITE_BIAS,
-			      (atomic_t *)(&sem->count)) < 0)
-		rwsem_wake(sem);
-}
-
-/*
- * implement atomic add functionality
- */
-static inline void rwsem_atomic_add(int delta, struct rw_semaphore *sem)
-{
-	atomic_add(delta, (atomic_t *)(&sem->count));
-}
-
-/*
- * implement exchange and add functionality
- */
-static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem)
-{
-	smp_mb();
-	return atomic_add_return(delta, (atomic_t *)(&sem->count));
-}
-
-#endif /* __KERNEL__ */
-#endif /* _PPC_RWSEM_XADD_H */
diff -uNr linux-2.4.10-pre12/include/asm-sparc64/rwsem.h linux-rwsem/include/asm-sparc64/rwsem.h
--- linux-2.4.10-pre12/include/asm-sparc64/rwsem.h	Tue Sep 18 08:45:14 2001
+++ linux-rwsem/include/asm-sparc64/rwsem.h	Thu Jan  1 01:00:00 1970
@@ -1,233 +0,0 @@
-/* $Id: rwsem.h,v 1.4 2001/04/26 02:36:36 davem Exp $
- * rwsem.h: R/W semaphores implemented using CAS
- *
- * Written by David S. Miller (davem@redhat.com), 2001.
- * Derived from asm-i386/rwsem.h
- */
-#ifndef _SPARC64_RWSEM_H
-#define _SPARC64_RWSEM_H
-
-#ifndef _LINUX_RWSEM_H
-#error please dont include asm/rwsem.h directly, use linux/rwsem.h instead
-#endif
-
-#ifdef __KERNEL__
-
-#include <linux/list.h>
-#include <linux/spinlock.h>
-
-struct rwsem_waiter;
-
-extern struct rw_semaphore *FASTCALL(rwsem_down_read_failed(struct rw_semaphore *sem));
-extern struct rw_semaphore *FASTCALL(rwsem_down_write_failed(struct rw_semaphore *sem));
-extern struct rw_semaphore *FASTCALL(rwsem_wake(struct rw_semaphore *));
-
-struct rw_semaphore {
-	signed int count;
-#define RWSEM_UNLOCKED_VALUE		0x00000000
-#define RWSEM_ACTIVE_BIAS		0x00000001
-#define RWSEM_ACTIVE_MASK		0x0000ffff
-#define RWSEM_WAITING_BIAS		0xffff0000
-#define RWSEM_ACTIVE_READ_BIAS		RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS		(RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
-	spinlock_t		wait_lock;
-	struct list_head	wait_list;
-};
-
-#define __RWSEM_INITIALIZER(name) \
-{ RWSEM_UNLOCKED_VALUE, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) }
-
-#define DECLARE_RWSEM(name) \
-	struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-static inline void init_rwsem(struct rw_semaphore *sem)
-{
-	sem->count = RWSEM_UNLOCKED_VALUE;
-	spin_lock_init(&sem->wait_lock);
-	INIT_LIST_HEAD(&sem->wait_list);
-}
-
-static inline void __down_read(struct rw_semaphore *sem)
-{
-	__asm__ __volatile__(
-		"! beginning __down_read\n"
-		"1:\tlduw	[%0], %%g5\n\t"
-		"add		%%g5, 1, %%g7\n\t"
-		"cas		[%0], %%g5, %%g7\n\t"
-		"cmp		%%g5, %%g7\n\t"
-		"bne,pn		%%icc, 1b\n\t"
-		" add		%%g7, 1, %%g7\n\t"
-		"cmp		%%g7, 0\n\t"
-		"bl,pn		%%icc, 3f\n\t"
-		" membar	#StoreStore\n"
-		"2:\n\t"
-		".subsection	2\n"
-		"3:\tmov	%0, %%g5\n\t"
-		"save		%%sp, -160, %%sp\n\t"
-		"mov		%%g1, %%l1\n\t"
-		"mov		%%g2, %%l2\n\t"
-		"mov		%%g3, %%l3\n\t"
-		"call		%1\n\t"
-		" mov		%%g5, %%o0\n\t"
-		"mov		%%l1, %%g1\n\t"
-		"mov		%%l2, %%g2\n\t"
-		"ba,pt		%%xcc, 2b\n\t"
-		" restore	%%l3, %%g0, %%g3\n\t"
-		".previous\n\t"
-		"! ending __down_read"
-		: : "r" (sem), "i" (rwsem_down_read_failed)
-		: "g5", "g7", "memory", "cc");
-}
-
-static inline void __down_write(struct rw_semaphore *sem)
-{
-	__asm__ __volatile__(
-		"! beginning __down_write\n\t"
-		"sethi		%%hi(%2), %%g1\n\t"
-		"or		%%g1, %%lo(%2), %%g1\n"
-		"1:\tlduw	[%0], %%g5\n\t"
-		"add		%%g5, %%g1, %%g7\n\t"
-		"cas		[%0], %%g5, %%g7\n\t"
-		"cmp		%%g5, %%g7\n\t"
-		"bne,pn		%%icc, 1b\n\t"
-		" cmp		%%g7, 0\n\t"
-		"bne,pn		%%icc, 3f\n\t"
-		" membar	#StoreStore\n"
-		"2:\n\t"
-		".subsection	2\n"
-		"3:\tmov	%0, %%g5\n\t"
-		"save		%%sp, -160, %%sp\n\t"
-		"mov		%%g2, %%l2\n\t"
-		"mov		%%g3, %%l3\n\t"
-		"call		%1\n\t"
-		" mov		%%g5, %%o0\n\t"
-		"mov		%%l2, %%g2\n\t"
-		"ba,pt		%%xcc, 2b\n\t"
-		" restore	%%l3, %%g0, %%g3\n\t"
-		".previous\n\t"
-		"! ending __down_write"
-		: : "r" (sem), "i" (rwsem_down_write_failed),
-		    "i" (RWSEM_ACTIVE_WRITE_BIAS)
-		: "g1", "g5", "g7", "memory", "cc");
-}
-
-static inline void __up_read(struct rw_semaphore *sem)
-{
-	__asm__ __volatile__(
-		"! beginning __up_read\n\t"
-		"1:\tlduw	[%0], %%g5\n\t"
-		"sub		%%g5, 1, %%g7\n\t"
-		"cas		[%0], %%g5, %%g7\n\t"
-		"cmp		%%g5, %%g7\n\t"
-		"bne,pn		%%icc, 1b\n\t"
-		" cmp		%%g7, 0\n\t"
-		"bl,pn		%%icc, 3f\n\t"
-		" membar	#StoreStore\n"
-		"2:\n\t"
-		".subsection	2\n"
-		"3:\tsethi	%%hi(%2), %%g1\n\t"
-		"sub		%%g7, 1, %%g7\n\t"
-		"or		%%g1, %%lo(%2), %%g1\n\t"
-		"andcc		%%g7, %%g1, %%g0\n\t"
-		"bne,pn		%%icc, 2b\n\t"
-		" mov		%0, %%g5\n\t"
-		"save		%%sp, -160, %%sp\n\t"
-		"mov		%%g2, %%l2\n\t"
-		"mov		%%g3, %%l3\n\t"
-		"call		%1\n\t"
-		" mov		%%g5, %%o0\n\t"
-		"mov		%%l2, %%g2\n\t"
-		"ba,pt		%%xcc, 2b\n\t"
-		" restore	%%l3, %%g0, %%g3\n\t"
-		".previous\n\t"
-		"! ending __up_read"
-		: : "r" (sem), "i" (rwsem_wake),
-		    "i" (RWSEM_ACTIVE_MASK)
-		: "g1", "g5", "g7", "memory", "cc");
-}
-
-static inline void __up_write(struct rw_semaphore *sem)
-{
-	__asm__ __volatile__(
-		"! beginning __up_write\n\t"
-		"sethi		%%hi(%2), %%g1\n\t"
-		"or		%%g1, %%lo(%2), %%g1\n"
-		"1:\tlduw	[%0], %%g5\n\t"
-		"sub		%%g5, %%g1, %%g7\n\t"
-		"cas		[%0], %%g5, %%g7\n\t"
-		"cmp		%%g5, %%g7\n\t"
-		"bne,pn		%%icc, 1b\n\t"
-		" sub		%%g7, %%g1, %%g7\n\t"
-		"cmp		%%g7, 0\n\t"
-		"bl,pn		%%icc, 3f\n\t"
-		" membar	#StoreStore\n"
-		"2:\n\t"
-		".subsection 2\n"
-		"3:\tmov	%0, %%g5\n\t"
-		"save		%%sp, -160, %%sp\n\t"
-		"mov		%%g2, %%l2\n\t"
-		"mov		%%g3, %%l3\n\t"
-		"call		%1\n\t"
-		" mov		%%g5, %%o0\n\t"
-		"mov		%%l2, %%g2\n\t"
-		"ba,pt		%%xcc, 2b\n\t"
-		" restore	%%l3, %%g0, %%g3\n\t"
-		".previous\n\t"
-		"! ending __up_write"
-		: : "r" (sem), "i" (rwsem_wake),
-		    "i" (RWSEM_ACTIVE_WRITE_BIAS)
-		: "g1", "g5", "g7", "memory", "cc");
-}
-
-static inline int rwsem_atomic_update(int delta, struct rw_semaphore *sem)
-{
-	int tmp = delta;
-
-	__asm__ __volatile__(
-		"1:\tlduw	[%2], %%g5\n\t"
-		"add		%%g5, %1, %%g7\n\t"
-		"cas		[%2], %%g5, %%g7\n\t"
-		"cmp		%%g5, %%g7\n\t"
-		"bne,pn		%%icc, 1b\n\t"
-		" nop\n\t"
-		"mov		%%g7, %0\n\t"
-		: "=&r" (tmp)
-		: "0" (tmp), "r" (sem)
-		: "g5", "g7", "memory");
-
-	return tmp + delta;
-}
-
-#define rwsem_atomic_add rwsem_atomic_update
-
-static inline __u16 rwsem_cmpxchgw(struct rw_semaphore *sem, __u16 __old, __u16 __new)
-{
-	u32 old = (sem->count & 0xffff0000) | (u32) __old;
-	u32 new = (old & 0xffff0000) | (u32) __new;
-	u32 prev;
-
-again:
-	__asm__ __volatile__("cas	[%2], %3, %0\n\t"
-			     "membar	#StoreStore | #StoreLoad"
-			     : "=&r" (prev)
-			     : "0" (new), "r" (sem), "r" (old)
-			     : "memory");
-
-	/* To give the same semantics as x86 cmpxchgw, keep trying
-	 * if only the upper 16-bits changed.
-	 */
-	if (prev != old &&
-	    ((prev & 0xffff) == (old & 0xffff)))
-		goto again;
-
-	return prev & 0xffff;
-}
-
-static inline signed long rwsem_cmpxchg(struct rw_semaphore *sem, signed long old, signed long new)
-{
-	return cmpxchg(&sem->count,old,new);
-}
-
-#endif /* __KERNEL__ */
-
-#endif /* _SPARC64_RWSEM_H */
diff -uNr linux-2.4.10-pre12/include/linux/rwsem-spinlock.h linux-rwsem/include/linux/rwsem-spinlock.h
--- linux-2.4.10-pre12/include/linux/rwsem-spinlock.h	Tue Sep 18 08:45:13 2001
+++ linux-rwsem/include/linux/rwsem-spinlock.h	Thu Jan  1 01:00:00 1970
@@ -1,62 +0,0 @@
-/* rwsem-spinlock.h: fallback C implementation
- *
- * Copyright (c) 2001   David Howells (dhowells@redhat.com).
- * - Derived partially from ideas by Andrea Arcangeli <andrea@suse.de>
- * - Derived also from comments by Linus
- */
-
-#ifndef _LINUX_RWSEM_SPINLOCK_H
-#define _LINUX_RWSEM_SPINLOCK_H
-
-#ifndef _LINUX_RWSEM_H
-#error please dont include linux/rwsem-spinlock.h directly, use linux/rwsem.h instead
-#endif
-
-#include <linux/spinlock.h>
-#include <linux/list.h>
-
-#ifdef __KERNEL__
-
-#include <linux/types.h>
-
-struct rwsem_waiter;
-
-/*
- * the rw-semaphore definition
- * - if activity is 0 then there are no active readers or writers
- * - if activity is +ve then that is the number of active readers
- * - if activity is -1 then there is one active writer
- * - if wait_list is not empty, then there are processes waiting for the semaphore
- */
-struct rw_semaphore {
-	__s32			activity;
-	spinlock_t		wait_lock;
-	struct list_head	wait_list;
-#if RWSEM_DEBUG
-	int			debug;
-#endif
-};
-
-/*
- * initialisation
- */
-#if RWSEM_DEBUG
-#define __RWSEM_DEBUG_INIT      , 0
-#else
-#define __RWSEM_DEBUG_INIT	/* */
-#endif
-
-#define __RWSEM_INITIALIZER(name) \
-{ 0, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) __RWSEM_DEBUG_INIT }
-
-#define DECLARE_RWSEM(name) \
-	struct rw_semaphore name = __RWSEM_INITIALIZER(name)
-
-extern void FASTCALL(init_rwsem(struct rw_semaphore *sem));
-extern void FASTCALL(__down_read(struct rw_semaphore *sem));
-extern void FASTCALL(__down_write(struct rw_semaphore *sem));
-extern void FASTCALL(__up_read(struct rw_semaphore *sem));
-extern void FASTCALL(__up_write(struct rw_semaphore *sem));
-
-#endif /* __KERNEL__ */
-#endif /* _LINUX_RWSEM_SPINLOCK_H */
diff -uNr linux-2.4.10-pre12/include/linux/rwsem.h linux-rwsem/include/linux/rwsem.h
--- linux-2.4.10-pre12/include/linux/rwsem.h	Tue Sep 18 08:45:13 2001
+++ linux-rwsem/include/linux/rwsem.h	Wed Sep 19 14:50:22 2001
@@ -9,40 +9,59 @@
 
 #include <linux/linkage.h>
 
-#define RWSEM_DEBUG 0
-
 #ifdef __KERNEL__
 
 #include <linux/config.h>
 #include <linux/types.h>
 #include <linux/kernel.h>
+#include <linux/spinlock.h>
+#include <linux/list.h>
 #include <asm/system.h>
-#include <asm/atomic.h>
 
-struct rw_semaphore;
+/*
+ * the rw-semaphore definition
+ * - if activity is 0 then there are no active readers or writers
+ * - if activity is +ve then that is the number of active readers
+ * - if activity is -1 then there is one active writer
+ * - if wait_list is not empty, then there are processes waiting for the semaphore
+ */
+struct rw_semaphore {
+	int			activity;
+	spinlock_t		lock;
+	struct list_head	wait_list;
+};
+
+/*
+ * initialisation
+ */
+#define __RWSEM_INITIALIZER(name) \
+{ 0, SPIN_LOCK_UNLOCKED, LIST_HEAD_INIT((name).wait_list) }
+
+#define DECLARE_RWSEM(name) \
+	struct rw_semaphore name = __RWSEM_INITIALIZER(name)
+
+static inline void init_rwsem(struct rw_semaphore *sem)
+{
+	sem->activity = 0;
+	spin_lock_init(&sem->lock);
+	INIT_LIST_HEAD(&sem->wait_list);
+}
 
-#ifdef CONFIG_RWSEM_GENERIC_SPINLOCK
-#include <linux/rwsem-spinlock.h> /* use a generic implementation */
-#else
-#include <asm/rwsem.h> /* use an arch-specific implementation */
-#endif
-
-#ifndef rwsemtrace
-#if RWSEM_DEBUG
-extern void FASTCALL(rwsemtrace(struct rw_semaphore *sem, const char *str));
-#else
-#define rwsemtrace(SEM,FMT)
-#endif
-#endif
+extern void FASTCALL(__rwsem_wait(struct rw_semaphore *sem, int bias));
+extern void FASTCALL(__rwsem_wake(struct rw_semaphore *sem));
 
 /*
  * lock for reading
  */
 static inline void down_read(struct rw_semaphore *sem)
 {
-	rwsemtrace(sem,"Entering down_read");
-	__down_read(sem);
-	rwsemtrace(sem,"Leaving down_read");
+	spin_lock(&sem->lock);
+	if (sem->activity>=0) {
+		sem->activity++;
+		spin_unlock(&sem->lock);
+	}
+	else
+		__rwsem_wait(sem,1);
 }
 
 /*
@@ -50,9 +69,13 @@
  */
 static inline void down_write(struct rw_semaphore *sem)
 {
-	rwsemtrace(sem,"Entering down_write");
-	__down_write(sem);
-	rwsemtrace(sem,"Leaving down_write");
+	spin_lock(&sem->lock);
+	if (sem->activity==0) {
+		sem->activity--;
+		spin_unlock(&sem->lock);
+	}
+	else
+		__rwsem_wait(sem,-1);
 }
 
 /*
@@ -60,9 +83,10 @@
  */
 static inline void up_read(struct rw_semaphore *sem)
 {
-	rwsemtrace(sem,"Entering up_read");
-	__up_read(sem);
-	rwsemtrace(sem,"Leaving up_read");
+	spin_lock(&sem->lock);
+	if (!--sem->activity && !list_empty(&sem->wait_list))
+		__rwsem_wake(sem);
+	spin_unlock(&sem->lock);
 }
 
 /*
@@ -70,9 +94,11 @@
  */
 static inline void up_write(struct rw_semaphore *sem)
 {
-	rwsemtrace(sem,"Entering up_write");
-	__up_write(sem);
-	rwsemtrace(sem,"Leaving up_write");
+	spin_lock(&sem->lock);
+	sem->activity++;
+	if (!list_empty(&sem->wait_list))
+		__rwsem_wake(sem);
+	spin_unlock(&sem->lock);
 }
 
 
diff -uNr linux-2.4.10-pre12/lib/Makefile linux-rwsem/lib/Makefile
--- linux-2.4.10-pre12/lib/Makefile	Wed Sep 19 10:39:23 2001
+++ linux-rwsem/lib/Makefile	Wed Sep 19 14:49:09 2001
@@ -8,12 +8,9 @@
 
 L_TARGET := lib.a
 
-export-objs := cmdline.o dec_and_lock.o rwsem-spinlock.o rwsem.o
+export-objs := cmdline.o dec_and_lock.o rwsem.o
 
-obj-y := errno.o ctype.o string.o vsprintf.o brlock.o cmdline.o bust_spinlocks.o rbtree.o
-
-obj-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
-obj-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
+obj-y := errno.o ctype.o string.o vsprintf.o brlock.o cmdline.o bust_spinlocks.o rbtree.o rwsem.o
 
 ifneq ($(CONFIG_HAVE_DEC_LOCK),y) 
   obj-y += dec_and_lock.o
diff -uNr linux-2.4.10-pre12/lib/rwsem-spinlock.c linux-rwsem/lib/rwsem-spinlock.c
--- linux-2.4.10-pre12/lib/rwsem-spinlock.c	Tue Sep 18 08:45:12 2001
+++ linux-rwsem/lib/rwsem-spinlock.c	Thu Jan  1 01:00:00 1970
@@ -1,239 +0,0 @@
-/* rwsem-spinlock.c: R/W semaphores: contention handling functions for generic spinlock
- *                                   implementation
- *
- * Copyright (c) 2001   David Howells (dhowells@redhat.com).
- * - Derived partially from idea by Andrea Arcangeli <andrea@suse.de>
- * - Derived also from comments by Linus
- */
-#include <linux/rwsem.h>
-#include <linux/sched.h>
-#include <linux/module.h>
-
-struct rwsem_waiter {
-	struct list_head	list;
-	struct task_struct	*task;
-	unsigned int		flags;
-#define RWSEM_WAITING_FOR_READ	0x00000001
-#define RWSEM_WAITING_FOR_WRITE	0x00000002
-};
-
-#if RWSEM_DEBUG
-void rwsemtrace(struct rw_semaphore *sem, const char *str)
-{
-	if (sem->debug)
-		printk("[%d] %s({%d,%d})\n",
-		       current->pid,str,sem->activity,list_empty(&sem->wait_list)?0:1);
-}
-#endif
-
-/*
- * initialise the semaphore
- */
-void init_rwsem(struct rw_semaphore *sem)
-{
-	sem->activity = 0;
-	spin_lock_init(&sem->wait_lock);
-	INIT_LIST_HEAD(&sem->wait_list);
-#if RWSEM_DEBUG
-	sem->debug = 0;
-#endif
-}
-
-/*
- * handle the lock being released whilst there are processes blocked on it that can now run
- * - if we come here, then:
- *   - the 'active count' _reached_ zero
- *   - the 'waiting count' is non-zero
- * - the spinlock must be held by the caller
- * - woken process blocks are discarded from the list after having flags zeroised
- */
-static inline struct rw_semaphore *__rwsem_do_wake(struct rw_semaphore *sem)
-{
-	struct rwsem_waiter *waiter;
-	int woken;
-
-	rwsemtrace(sem,"Entering __rwsem_do_wake");
-
-	waiter = list_entry(sem->wait_list.next,struct rwsem_waiter,list);
-
-	/* try to grant a single write lock if there's a writer at the front of the queue
-	 * - we leave the 'waiting count' incremented to signify potential contention
-	 */
-	if (waiter->flags & RWSEM_WAITING_FOR_WRITE) {
-		sem->activity = -1;
-		list_del(&waiter->list);
-		waiter->flags = 0;
-		wake_up_process(waiter->task);
-		goto out;
-	}
-
-	/* grant an infinite number of read locks to the readers at the front of the queue */
-	woken = 0;
-	do {
-		list_del(&waiter->list);
-		waiter->flags = 0;
-		wake_up_process(waiter->task);
-		woken++;
-		if (list_empty(&sem->wait_list))
-			break;
-		waiter = list_entry(sem->wait_list.next,struct rwsem_waiter,list);
-	} while (waiter->flags&RWSEM_WAITING_FOR_READ);
-
-	sem->activity += woken;
-
- out:
-	rwsemtrace(sem,"Leaving __rwsem_do_wake");
-	return sem;
-}
-
-/*
- * wake a single writer
- */
-static inline struct rw_semaphore *__rwsem_wake_one_writer(struct rw_semaphore *sem)
-{
-	struct rwsem_waiter *waiter;
-
-	sem->activity = -1;
-
-	waiter = list_entry(sem->wait_list.next,struct rwsem_waiter,list);
-	list_del(&waiter->list);
-
-	waiter->flags = 0;
-	wake_up_process(waiter->task);
-	return sem;
-}
-
-/*
- * get a read lock on the semaphore
- */
-void __down_read(struct rw_semaphore *sem)
-{
-	struct rwsem_waiter waiter;
-	struct task_struct *tsk;
-
-	rwsemtrace(sem,"Entering __down_read");
-
-	spin_lock(&sem->wait_lock);
-
-	if (sem->activity>=0 && list_empty(&sem->wait_list)) {
-		/* granted */
-		sem->activity++;
-		spin_unlock(&sem->wait_lock);
-		goto out;
-	}
-
-	tsk = current;
-	set_task_state(tsk,TASK_UNINTERRUPTIBLE);
-
-	/* set up my own style of waitqueue */
-	waiter.task = tsk;
-	waiter.flags = RWSEM_WAITING_FOR_READ;
-
-	list_add_tail(&waiter.list,&sem->wait_list);
-
-	/* we don't need to touch the semaphore struct anymore */
-	spin_unlock(&sem->wait_lock);
-
-	/* wait to be given the lock */
-	for (;;) {
-		if (!waiter.flags)
-			break;
-		schedule();
-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-	}
-
-	tsk->state = TASK_RUNNING;
-
- out:
-	rwsemtrace(sem,"Leaving __down_read");
-}
-
-/*
- * get a write lock on the semaphore
- * - note that we increment the waiting count anyway to indicate an exclusive lock
- */
-void __down_write(struct rw_semaphore *sem)
-{
-	struct rwsem_waiter waiter;
-	struct task_struct *tsk;
-
-	rwsemtrace(sem,"Entering __down_write");
-
-	spin_lock(&sem->wait_lock);
-
-	if (sem->activity==0 && list_empty(&sem->wait_list)) {
-		/* granted */
-		sem->activity = -1;
-		spin_unlock(&sem->wait_lock);
-		goto out;
-	}
-
-	tsk = current;
-	set_task_state(tsk,TASK_UNINTERRUPTIBLE);
-
-	/* set up my own style of waitqueue */
-	waiter.task = tsk;
-	waiter.flags = RWSEM_WAITING_FOR_WRITE;
-
-	list_add_tail(&waiter.list,&sem->wait_list);
-
-	/* we don't need to touch the semaphore struct anymore */
-	spin_unlock(&sem->wait_lock);
-
-	/* wait to be given the lock */
-	for (;;) {
-		if (!waiter.flags)
-			break;
-		schedule();
-		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-	}
-
-	tsk->state = TASK_RUNNING;
-
- out:
-	rwsemtrace(sem,"Leaving __down_write");
-}
-
-/*
- * release a read lock on the semaphore
- */
-void __up_read(struct rw_semaphore *sem)
-{
-	rwsemtrace(sem,"Entering __up_read");
-
-	spin_lock(&sem->wait_lock);
-
-	if (--sem->activity==0 && !list_empty(&sem->wait_list))
-		sem = __rwsem_wake_one_writer(sem);
-
-	spin_unlock(&sem->wait_lock);
-
-	rwsemtrace(sem,"Leaving __up_read");
-}
-
-/*
- * release a write lock on the semaphore
- */
-void __up_write(struct rw_semaphore *sem)
-{
-	rwsemtrace(sem,"Entering __up_write");
-
-	spin_lock(&sem->wait_lock);
-
-	sem->activity = 0;
-	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem);
-
-	spin_unlock(&sem->wait_lock);
-
-	rwsemtrace(sem,"Leaving __up_write");
-}
-
-EXPORT_SYMBOL(init_rwsem);
-EXPORT_SYMBOL(__down_read);
-EXPORT_SYMBOL(__down_write);
-EXPORT_SYMBOL(__up_read);
-EXPORT_SYMBOL(__up_write);
-#if RWSEM_DEBUG
-EXPORT_SYMBOL(rwsemtrace);
-#endif
diff -uNr linux-2.4.10-pre12/lib/rwsem.c linux-rwsem/lib/rwsem.c
--- linux-2.4.10-pre12/lib/rwsem.c	Tue Sep 18 08:45:12 2001
+++ linux-rwsem/lib/rwsem.c	Wed Sep 19 15:09:25 2001
@@ -1,7 +1,8 @@
 /* rwsem.c: R/W semaphores: contention handling functions
  *
- * Written by David Howells (dhowells@redhat.com).
- * Derived from arch/i386/kernel/semaphore.c
+ * Copyright (c) 2001   David Howells (dhowells@redhat.com).
+ * - Derived partially from idea by Andrea Arcangeli <andrea@suse.de>
+ * - Derived also from comments by Linus
  */
 #include <linux/rwsem.h>
 #include <linux/sched.h>
@@ -15,196 +16,78 @@
 #define RWSEM_WAITING_FOR_WRITE	0x00000002
 };
 
-#if RWSEM_DEBUG
-#undef rwsemtrace
-void rwsemtrace(struct rw_semaphore *sem, const char *str)
-{
-	printk("sem=%p\n",sem);
-	printk("(sem)=%08lx\n",sem->count);
-	if (sem->debug)
-		printk("[%d] %s({%08lx})\n",current->pid,str,sem->count);
-}
-#endif
-
 /*
  * handle the lock being released whilst there are processes blocked on it that can now run
  * - if we come here, then:
- *   - the 'active part' of the count (&0x0000ffff) reached zero but has been re-incremented
- *   - the 'waiting part' of the count (&0xffff0000) is negative (and will still be so)
- *   - there must be someone on the queue
+ *   - the 'active count' _reached_ zero
+ *   - the 'waiting count' is non-zero
  * - the spinlock must be held by the caller
  * - woken process blocks are discarded from the list after having flags zeroised
  */
-static inline struct rw_semaphore *__rwsem_do_wake(struct rw_semaphore *sem)
+void __rwsem_wake(struct rw_semaphore *sem)
 {
 	struct rwsem_waiter *waiter;
-	struct list_head *next;
-	signed long oldcount;
-	int woken, loop;
-
-	rwsemtrace(sem,"Entering __rwsem_do_wake");
-
-	/* only wake someone up if we can transition the active part of the count from 0 -> 1 */
- try_again:
-	oldcount = rwsem_atomic_update(RWSEM_ACTIVE_BIAS,sem) - RWSEM_ACTIVE_BIAS;
-	if (oldcount & RWSEM_ACTIVE_MASK)
-		goto undo;
+	int woken;
 
 	waiter = list_entry(sem->wait_list.next,struct rwsem_waiter,list);
 
 	/* try to grant a single write lock if there's a writer at the front of the queue
-	 * - note we leave the 'active part' of the count incremented by 1 and the waiting part
-	 *   incremented by 0x00010000
+	 * - we leave the 'waiting count' incremented to signify potential contention
 	 */
-	if (!(waiter->flags & RWSEM_WAITING_FOR_WRITE))
-		goto readers_only;
+	if (waiter->flags & RWSEM_WAITING_FOR_WRITE) {
+		sem->activity = -1;
+		list_del(&waiter->list);
+		waiter->flags = 0;
+		wake_up_process(waiter->task);
+		return;
+	}
 
-	list_del(&waiter->list);
-	waiter->flags = 0;
-	wake_up_process(waiter->task);
-	goto out;
-
-	/* grant an infinite number of read locks to the readers at the front of the queue
-	 * - note we increment the 'active part' of the count by the number of readers (less one
-	 *   for the activity decrement we've already done) before waking any processes up
-	 */
- readers_only:
+	/* grant an infinite number of read locks to the readers at the front of the queue */
 	woken = 0;
 	do {
-		woken++;
-
-		if (waiter->list.next==&sem->wait_list)
-			break;
-
-		waiter = list_entry(waiter->list.next,struct rwsem_waiter,list);
-
-	} while (waiter->flags & RWSEM_WAITING_FOR_READ);
-
-	loop = woken;
-	woken *= RWSEM_ACTIVE_BIAS-RWSEM_WAITING_BIAS;
-	woken -= RWSEM_ACTIVE_BIAS;
-	rwsem_atomic_add(woken,sem);
-
-	next = sem->wait_list.next;
-	for (; loop>0; loop--) {
-		waiter = list_entry(next,struct rwsem_waiter,list);
-		next = waiter->list.next;
+		list_del(&waiter->list);
 		waiter->flags = 0;
 		wake_up_process(waiter->task);
-	}
-
-	sem->wait_list.next = next;
-	next->prev = &sem->wait_list;
+		woken++;
+		if (list_empty(&sem->wait_list))
+			break;
+		waiter = list_entry(sem->wait_list.next,struct rwsem_waiter,list);
+	} while (waiter->flags&RWSEM_WAITING_FOR_READ);
 
- out:
-	rwsemtrace(sem,"Leaving __rwsem_do_wake");
-	return sem;
-
-	/* undo the change to count, but check for a transition 1->0 */
- undo:
-	if (rwsem_atomic_update(-RWSEM_ACTIVE_BIAS,sem)!=0)
-		goto out;
-	goto try_again;
+	sem->activity += woken;
 }
 
 /*
- * wait for a lock to be granted
+ * wait for a lock on the rw_semaphore
+ * - must be entered with the rwsemsem_lock spinlock held
  */
-static inline struct rw_semaphore *rwsem_down_failed_common(struct rw_semaphore *sem,
-								 struct rwsem_waiter *waiter,
-								 signed long adjustment)
+void __rwsem_wait(struct rw_semaphore *sem, int bias)
 {
-	struct task_struct *tsk = current;
-	signed long count;
+	struct rwsem_waiter waiter;
+	struct task_struct *tsk;
 
+	tsk = current;
 	set_task_state(tsk,TASK_UNINTERRUPTIBLE);
 
-	/* set up my own style of waitqueue */
-	spin_lock(&sem->wait_lock);
-	waiter->task = tsk;
-
-	list_add_tail(&waiter->list,&sem->wait_list);
-
-	/* note that we're now waiting on the lock, but no longer actively read-locking */
-	count = rwsem_atomic_update(adjustment,sem);
+	/* add to the waitqueue */
+	waiter.task = tsk;
+	waiter.flags = RWSEM_WAITING_FOR_READ;
 
-	/* if there are no longer active locks, wake the front queued process(es) up
-	 * - it might even be this process, since the waker takes a more active part
-	 */
-	if (!(count & RWSEM_ACTIVE_MASK))
-		sem = __rwsem_do_wake(sem);
+	list_add_tail(&waiter.list,&sem->wait_list);
 
-	spin_unlock(&sem->wait_lock);
+	/* we don't need to touch the semaphore anymore */
+	spin_unlock(&sem->lock);
 
 	/* wait to be given the lock */
 	for (;;) {
-		if (!waiter->flags)
+		if (!waiter.flags)
 			break;
 		schedule();
 		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
 	}
 
 	tsk->state = TASK_RUNNING;
-
-	return sem;
-}
-
-/*
- * wait for the read lock to be granted
- */
-struct rw_semaphore *rwsem_down_read_failed(struct rw_semaphore *sem)
-{
-	struct rwsem_waiter waiter;
-
-	rwsemtrace(sem,"Entering rwsem_down_read_failed");
-
-	waiter.flags = RWSEM_WAITING_FOR_READ;
-	rwsem_down_failed_common(sem,&waiter,RWSEM_WAITING_BIAS-RWSEM_ACTIVE_BIAS);
-
-	rwsemtrace(sem,"Leaving rwsem_down_read_failed");
-	return sem;
-}
-
-/*
- * wait for the write lock to be granted
- */
-struct rw_semaphore *rwsem_down_write_failed(struct rw_semaphore *sem)
-{
-	struct rwsem_waiter waiter;
-
-	rwsemtrace(sem,"Entering rwsem_down_write_failed");
-
-	waiter.flags = RWSEM_WAITING_FOR_WRITE;
-	rwsem_down_failed_common(sem,&waiter,-RWSEM_ACTIVE_BIAS);
-
-	rwsemtrace(sem,"Leaving rwsem_down_write_failed");
-	return sem;
-}
-
-/*
- * handle waking up a waiter on the semaphore
- * - up_read has decremented the active part of the count if we come here
- */
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
-{
-	rwsemtrace(sem,"Entering rwsem_wake");
-
-	spin_lock(&sem->wait_lock);
-
-	/* do nothing if list empty */
-	if (!list_empty(&sem->wait_list))
-		sem = __rwsem_do_wake(sem);
-
-	spin_unlock(&sem->wait_lock);
-
-	rwsemtrace(sem,"Leaving rwsem_wake");
-
-	return sem;
 }
 
-EXPORT_SYMBOL_NOVERS(rwsem_down_read_failed);
-EXPORT_SYMBOL_NOVERS(rwsem_down_write_failed);
-EXPORT_SYMBOL_NOVERS(rwsem_wake);
-#if RWSEM_DEBUG
-EXPORT_SYMBOL(rwsemtrace);
-#endif
+EXPORT_SYMBOL(__rwsem_wait);
+EXPORT_SYMBOL(__rwsem_wake);

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-18 16:49             ` Linus Torvalds
  2001-09-19  9:51               ` David Howells
  2001-09-19 14:08               ` Manfred Spraul
@ 2001-09-19 14:51               ` David Howells
  2001-09-19 15:18                 ` Manfred Spraul
  2001-09-19 14:53               ` David Howells
  2001-09-19 14:58               ` David Howells
  4 siblings, 1 reply; 49+ messages in thread
From: David Howells @ 2001-09-19 14:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Manfred Spraul, Andrea Arcangeli, Ulrich.Weigand,
	linux-kernel


Linus Torvalds <torvalds@transmeta.com> wrote:
> On Tue, 18 Sep 2001, David Howells wrote:
> >
> > Okay preliminary as-yet-untested patch to cure coredumping of the need
> > to hold the mm semaphore:
> >
> > 	- kernel/fork.c: function to partially copy an mm_struct and attach it
> > 			 to the task_struct in place of the old.
> 
> Oh, please no.
> 
> If the choice is between a hack to do strange and incomprehensible things
> for a special case, and just making the semaphores do the same thing
> rw-spinlocks do and make the problem go away naturally, Ill take #2 any
> day. The patches already exist, after all.

But surely giving rw-semaphores this behaviour is even worse... It introduces
the possibility of livelock, and so DoS attacks, and it affects more than just
access to the mm_struct.

Also comparing them to rw-spinlocks isn't really fair IMHO since they have
different restrictions. Things inside spinlocks aren't allowed to sleep, and
mustn't incur pagefaults.

I also don't think the hack is that bad. All it's doing is taking a copy of
the process's VM decription so that it knows that nobody is going to modify it
whilst a coredump is in progress. Furthermore, it _only_ affects the coredump
path, and the coredump path is just about the last thing on the agenda for a
dying process.

However, if you don't like that, how about just changing the lock on mm_struct
to a special mm_struct-only type lock that has a recursive lock operation for
use by the pagefault handler (and _only_ the pagefault handler)? I've attached
a patch to do just that. This introduces five operations:

	- mm_lock_shared()		- get shared lock fairly
	- mm_lock_shared_recursive()	- get shared lock unfairly
	- mm_unlock_shared()		- release shared lock
	- mm_lock_exclusive()		- get exclusive lock
	- mm_unlock_exclusive()		- release exclusive lock

David


diff -uNr linux-2.4.10-pre12/arch/alpha/kernel/osf_sys.c linux-mmsem/arch/alpha/kernel/osf_sys.c
--- linux-2.4.10-pre12/arch/alpha/kernel/osf_sys.c	Tue Sep 18 08:45:58 2001
+++ linux-mmsem/arch/alpha/kernel/osf_sys.c	Wed Sep 19 12:57:04 2001
@@ -241,9 +241,9 @@
 			goto out;
 	}
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	ret = do_mmap(file, addr, len, prot, flags, off);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	if (file)
 		fput(file);
 out:
diff -uNr linux-2.4.10-pre12/arch/alpha/mm/fault.c linux-mmsem/arch/alpha/mm/fault.c
--- linux-2.4.10-pre12/arch/alpha/mm/fault.c	Wed Sep 19 10:39:05 2001
+++ linux-mmsem/arch/alpha/mm/fault.c	Wed Sep 19 13:55:13 2001
@@ -113,7 +113,7 @@
 		goto vmalloc_fault;
 #endif
 
-	down_read(&mm->mmap_sem);
+	mm_lock_shared_recursive(mm);
 	vma = find_vma(mm, address);
 	if (!vma)
 		goto bad_area;
@@ -147,7 +147,7 @@
 	 * the fault.
 	 */
 	fault = handle_mm_fault(mm, vma, address, cause > 0);
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 	if (fault < 0)
 		goto out_of_memory;
@@ -161,7 +161,7 @@
  * Fix it, but check if it's kernel or user first..
  */
 bad_area:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 	if (user_mode(regs)) {
 		force_sig(SIGSEGV, current);
@@ -198,7 +198,7 @@
 	if (current->pid == 1) {
 		current->policy |= SCHED_YIELD;
 		schedule();
-		down_read(&mm->mmap_sem);
+		mm_lock_shared_recursive(mm);
 		goto survive;
 	}
 	printk(KERN_ALERT "VM: killing process %s(%d)\n",
diff -uNr linux-2.4.10-pre12/arch/arm/kernel/sys_arm.c linux-mmsem/arch/arm/kernel/sys_arm.c
--- linux-2.4.10-pre12/arch/arm/kernel/sys_arm.c	Tue Sep 18 08:46:39 2001
+++ linux-mmsem/arch/arm/kernel/sys_arm.c	Wed Sep 19 12:57:09 2001
@@ -74,9 +74,9 @@
 			goto out;
 	}
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	error = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 
 	if (file)
 		fput(file);
@@ -125,9 +125,9 @@
 	    vectors_base() == 0)
 		goto out;
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	ret = do_mremap(addr, old_len, new_len, flags, new_addr);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 
 out:
 	return ret;
diff -uNr linux-2.4.10-pre12/arch/arm/mm/fault-common.c linux-mmsem/arch/arm/mm/fault-common.c
--- linux-2.4.10-pre12/arch/arm/mm/fault-common.c	Tue Sep 18 08:46:39 2001
+++ linux-mmsem/arch/arm/mm/fault-common.c	Wed Sep 19 13:56:39 2001
@@ -251,9 +251,9 @@
 	if (in_interrupt() || !mm)
 		goto no_context;
 
-	down_read(&mm->mmap_sem);
+	mm_lock_shared_recursive(mm);
 	fault = __do_page_fault(mm, addr, error_code, tsk);
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 	/*
 	 * Handle the "normal" case first
diff -uNr linux-2.4.10-pre12/arch/cris/kernel/sys_cris.c linux-mmsem/arch/cris/kernel/sys_cris.c
--- linux-2.4.10-pre12/arch/cris/kernel/sys_cris.c	Tue Sep 18 08:46:43 2001
+++ linux-mmsem/arch/cris/kernel/sys_cris.c	Wed Sep 19 12:57:13 2001
@@ -59,9 +59,9 @@
                         goto out;
         }
 
-        down_write(&current->mm->mmap_sem);
+        mm_lock_exclusive(current->mm);
         error = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
-        up_write(&current->mm->mmap_sem);
+        mm_unlock_exclusive(current->mm);
 
         if (file)
                 fput(file);
diff -uNr linux-2.4.10-pre12/arch/cris/mm/fault.c linux-mmsem/arch/cris/mm/fault.c
--- linux-2.4.10-pre12/arch/cris/mm/fault.c	Tue Sep 18 08:46:43 2001
+++ linux-mmsem/arch/cris/mm/fault.c	Wed Sep 19 13:57:18 2001
@@ -266,7 +266,7 @@
 	if (in_interrupt() || !mm)
 		goto no_context;
 
-	down_read(&mm->mmap_sem);
+	mm_lock_shared_recursive(mm);
 	vma = find_vma(mm, address);
 	if (!vma)
 		goto bad_area;
@@ -324,7 +324,7 @@
                 goto out_of_memory;
 	}
 
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	return;
 	
 	/*
@@ -334,7 +334,7 @@
 
  bad_area:
 
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
  bad_area_nosemaphore:
 	DPG(show_registers(regs));
@@ -397,14 +397,14 @@
 	 */
 
  out_of_memory:
-        up_read(&mm->mmap_sem);
+        mm_unlock_shared(mm);
 	printk("VM: killing process %s\n", tsk->comm);
 	if(user_mode(regs))
 		do_exit(SIGKILL);
 	goto no_context;
 
  do_sigbus:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 	/*
          * Send a sigbus, regardless of whether we were in kernel
diff -uNr linux-2.4.10-pre12/arch/i386/kernel/ldt.c linux-mmsem/arch/i386/kernel/ldt.c
--- linux-2.4.10-pre12/arch/i386/kernel/ldt.c	Tue Sep 18 08:45:58 2001
+++ linux-mmsem/arch/i386/kernel/ldt.c	Wed Sep 19 12:57:03 2001
@@ -73,7 +73,7 @@
 	 * the GDT index of the LDT is allocated dynamically, and is
 	 * limited by MAX_LDT_DESCRIPTORS.
 	 */
-	down_write(&mm->mmap_sem);
+	mm_lock_exclusive(mm);
 	if (!mm->context.segments) {
 		void * segments = vmalloc(LDT_ENTRIES*LDT_ENTRY_SIZE);
 		error = -ENOMEM;
@@ -124,7 +124,7 @@
 	error = 0;
 
 out_unlock:
-	up_write(&mm->mmap_sem);
+	mm_unlock_exclusive(mm);
 out:
 	return error;
 }
diff -uNr linux-2.4.10-pre12/arch/i386/kernel/sys_i386.c linux-mmsem/arch/i386/kernel/sys_i386.c
--- linux-2.4.10-pre12/arch/i386/kernel/sys_i386.c	Tue Sep 18 08:45:58 2001
+++ linux-mmsem/arch/i386/kernel/sys_i386.c	Wed Sep 19 12:57:03 2001
@@ -55,9 +55,9 @@
 			goto out;
 	}
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	error = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 
 	if (file)
 		fput(file);
diff -uNr linux-2.4.10-pre12/arch/i386/mm/fault.c linux-mmsem/arch/i386/mm/fault.c
--- linux-2.4.10-pre12/arch/i386/mm/fault.c	Wed Sep 19 10:39:06 2001
+++ linux-mmsem/arch/i386/mm/fault.c	Wed Sep 19 13:55:18 2001
@@ -191,7 +191,7 @@
 	if (in_interrupt() || !mm)
 		goto no_context;
 
-	down_read(&mm->mmap_sem);
+	mm_lock_shared_recursive(mm);
 
 	vma = find_vma(mm, address);
 	if (!vma)
@@ -265,7 +265,7 @@
 		if (bit < 32)
 			tsk->thread.screen_bitmap |= 1 << bit;
 	}
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	return;
 
 /*
@@ -273,7 +273,7 @@
  * Fix it, but check if it's kernel or user first..
  */
 bad_area:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 	/* User mode accesses just cause a SIGSEGV */
 	if (error_code & 4) {
@@ -341,11 +341,11 @@
  * us unable to handle the page fault gracefully.
  */
 out_of_memory:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	if (tsk->pid == 1) {
 		tsk->policy |= SCHED_YIELD;
 		schedule();
-		down_read(&mm->mmap_sem);
+		mm_lock_shared_recursive(mm);
 		goto survive;
 	}
 	printk("VM: killing process %s\n", tsk->comm);
@@ -354,7 +354,7 @@
 	goto no_context;
 
 do_sigbus:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 	/*
 	 * Send a sigbus, regardless of whether we were in kernel
diff -uNr linux-2.4.10-pre12/arch/ia64/ia32/sys_ia32.c linux-mmsem/arch/ia64/ia32/sys_ia32.c
--- linux-2.4.10-pre12/arch/ia64/ia32/sys_ia32.c	Tue Sep 18 08:46:41 2001
+++ linux-mmsem/arch/ia64/ia32/sys_ia32.c	Wed Sep 19 12:57:11 2001
@@ -245,9 +245,9 @@
 		}
 		__copy_user(back, (char *)addr + len, PAGE_SIZE - ((addr + len) & ~PAGE_MASK));
 	}
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	r = do_mmap(0, baddr, len + (addr - baddr), prot, flags | MAP_ANONYMOUS, 0);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	if (r < 0)
 		return(r);
 	if (addr == 0)
@@ -291,9 +291,9 @@
 		poff = offset & PAGE_MASK;
 		len += offset - poff;
 
-		down_write(&current->mm->mmap_sem);
+		mm_lock_exclusive(current->mm);
 		error = do_mmap_pgoff(file, addr, len, prot, flags, poff >> PAGE_SHIFT);
-		up_write(&current->mm->mmap_sem);
+		mm_unlock_exclusive(current->mm);
 
 		if (!IS_ERR((void *) error))
 			error += offset - poff;
@@ -338,9 +338,9 @@
 	if ((a.offset & ~PAGE_MASK) != 0)
 		return -EINVAL;
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	retval = do_mmap_pgoff(file, a.addr, a.len, a.prot, a.flags, a.offset >> PAGE_SHIFT);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 #else
 	retval = ia32_do_mmap(file, a.addr, a.len, a.prot, a.flags, a.fd, a.offset);
 #endif
@@ -2605,11 +2605,11 @@
 		return(-EFAULT);
 	}
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	addr = do_mmap_pgoff(file, IA32_IOBASE,
 			     IOLEN, PROT_READ|PROT_WRITE, MAP_SHARED,
 			     (ia64_iobase & ~PAGE_OFFSET) >> PAGE_SHIFT);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 
 	if (addr >= 0) {
 		ia64_set_kr(IA64_KR_IO_BASE, addr);
diff -uNr linux-2.4.10-pre12/arch/ia64/kernel/sys_ia64.c linux-mmsem/arch/ia64/kernel/sys_ia64.c
--- linux-2.4.10-pre12/arch/ia64/kernel/sys_ia64.c	Tue Sep 18 08:46:41 2001
+++ linux-mmsem/arch/ia64/kernel/sys_ia64.c	Wed Sep 19 12:57:11 2001
@@ -106,7 +106,7 @@
 	 * check and the clearing of r8.  However, we can't call sys_brk() because we need
 	 * to acquire the mmap_sem before we can do the test...
 	 */
-	down_write(&mm->mmap_sem);
+	mm_lock_exclusive(mm);
 
 	if (brk < mm->end_code)
 		goto out;
@@ -146,7 +146,7 @@
 	mm->brk = brk;
 out:
 	retval = mm->brk;
-	up_write(&mm->mmap_sem);
+	mm_unlock_exclusive(mm);
 	regs->r8 = 0;		/* ensure large retval isn't mistaken as error code */
 	return retval;
 }
@@ -205,9 +205,9 @@
 	if (rgn_index(addr) != rgn_index(addr + len))
 		return -EINVAL;
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	addr = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 
 	if (file)
 		fput(file);
diff -uNr linux-2.4.10-pre12/arch/ia64/mm/fault.c linux-mmsem/arch/ia64/mm/fault.c
--- linux-2.4.10-pre12/arch/ia64/mm/fault.c	Tue Sep 18 08:46:41 2001
+++ linux-mmsem/arch/ia64/mm/fault.c	Wed Sep 19 13:56:55 2001
@@ -60,7 +60,7 @@
 	if (in_interrupt() || !mm)
 		goto no_context;
 
-	down_read(&mm->mmap_sem);
+	mm_lock_shared_recursive(mm);
 
 	vma = find_vma_prev(mm, address, &prev_vma);
 	if (!vma)
@@ -112,7 +112,7 @@
 	      default:
 		goto out_of_memory;
 	}
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	return;
 
   check_expansion:
@@ -135,7 +135,7 @@
 	goto good_area;
 
   bad_area:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	if (isr & IA64_ISR_SP) {
 		/*
 		 * This fault was due to a speculative load set the "ed" bit in the psr to
@@ -184,7 +184,7 @@
 	return;
 
   out_of_memory:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	printk("VM: killing process %s\n", current->comm);
 	if (user_mode(regs))
 		do_exit(SIGKILL);
diff -uNr linux-2.4.10-pre12/arch/m68k/kernel/sys_m68k.c linux-mmsem/arch/m68k/kernel/sys_m68k.c
--- linux-2.4.10-pre12/arch/m68k/kernel/sys_m68k.c	Tue Sep 18 08:46:05 2001
+++ linux-mmsem/arch/m68k/kernel/sys_m68k.c	Wed Sep 19 12:57:08 2001
@@ -59,9 +59,9 @@
 			goto out;
 	}
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	error = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 
 	if (file)
 		fput(file);
@@ -146,9 +146,9 @@
 	}
 	a.flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	error = do_mmap_pgoff(file, a.addr, a.len, a.prot, a.flags, pgoff);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	if (file)
 		fput(file);
 out:
diff -uNr linux-2.4.10-pre12/arch/m68k/mm/fault.c linux-mmsem/arch/m68k/mm/fault.c
--- linux-2.4.10-pre12/arch/m68k/mm/fault.c	Tue Sep 18 08:46:05 2001
+++ linux-mmsem/arch/m68k/mm/fault.c	Wed Sep 19 13:56:17 2001
@@ -101,7 +101,7 @@
 	if (in_interrupt() || !mm)
 		goto no_context;
 
-	down_read(&mm->mmap_sem);
+	mm_lock_shared_recursive(mm);
 
 	vma = find_vma(mm, address);
 	if (!vma)
@@ -168,7 +168,7 @@
 	#warning should be obsolete now...
 	if (CPU_IS_040_OR_060)
 		flush_tlb_page(vma, address);
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	return 0;
 
 /*
@@ -203,6 +203,6 @@
 	current->thread.faddr = address;
 
 send_sig:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	return send_fault_sig(regs);
 }
diff -uNr linux-2.4.10-pre12/arch/mips/kernel/irixelf.c linux-mmsem/arch/mips/kernel/irixelf.c
--- linux-2.4.10-pre12/arch/mips/kernel/irixelf.c	Tue Sep 18 08:45:59 2001
+++ linux-mmsem/arch/mips/kernel/irixelf.c	Wed Sep 19 12:57:05 2001
@@ -314,12 +314,12 @@
 		   (unsigned long) elf_prot, (unsigned long) elf_type,
 		   (unsigned long) (eppnt->p_offset & 0xfffff000));
 #endif
-	    down_write(&current->mm->mmap_sem);
+	    mm_lock_exclusive(current->mm);
 	    error = do_mmap(interpreter, vaddr,
 			    eppnt->p_filesz + (eppnt->p_vaddr & 0xfff),
 			    elf_prot, elf_type,
 			    eppnt->p_offset & 0xfffff000);
-	    up_write(&current->mm->mmap_sem);
+	    mm_unlock_exclusive(current->mm);
 
 	    if(error < 0 && error > -1024) {
 		    printk("Aieee IRIX interp mmap error=%d\n", error);
@@ -498,12 +498,12 @@
 		prot  = (epp->p_flags & PF_R) ? PROT_READ : 0;
 		prot |= (epp->p_flags & PF_W) ? PROT_WRITE : 0;
 		prot |= (epp->p_flags & PF_X) ? PROT_EXEC : 0;
-	        down_write(&current->mm->mmap_sem);
+	        mm_lock_exclusive(current->mm);
 		(void) do_mmap(fp, (epp->p_vaddr & 0xfffff000),
 			       (epp->p_filesz + (epp->p_vaddr & 0xfff)),
 			       prot, EXEC_MAP_FLAGS,
 			       (epp->p_offset & 0xfffff000));
-	        up_write(&current->mm->mmap_sem);
+	        mm_unlock_exclusive(current->mm);
 
 		/* Fixup location tracking vars. */
 		if((epp->p_vaddr & 0xfffff000) < *estack)
@@ -762,10 +762,10 @@
 	 * Since we do not have the power to recompile these, we
 	 * emulate the SVr4 behavior.  Sigh.
 	 */
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	(void) do_mmap(NULL, 0, 4096, PROT_READ | PROT_EXEC,
 		       MAP_FIXED | MAP_PRIVATE, 0);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 #endif
 
 	start_thread(regs, elf_entry, bprm->p);
@@ -837,14 +837,14 @@
 	while(elf_phdata->p_type != PT_LOAD) elf_phdata++;
 	
 	/* Now use mmap to map the library into memory. */
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	error = do_mmap(file,
 			elf_phdata->p_vaddr & 0xfffff000,
 			elf_phdata->p_filesz + (elf_phdata->p_vaddr & 0xfff),
 			PROT_READ | PROT_WRITE | PROT_EXEC,
 			MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
 			elf_phdata->p_offset & 0xfffff000);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 
 	k = elf_phdata->p_vaddr + elf_phdata->p_filesz;
 	if (k > elf_bss) elf_bss = k;
@@ -916,12 +916,12 @@
 		prot  = (hp->p_flags & PF_R) ? PROT_READ : 0;
 		prot |= (hp->p_flags & PF_W) ? PROT_WRITE : 0;
 		prot |= (hp->p_flags & PF_X) ? PROT_EXEC : 0;
-		down_write(&current->mm->mmap_sem);
+		mm_lock_exclusive(current->mm);
 		retval = do_mmap(filp, (hp->p_vaddr & 0xfffff000),
 				 (hp->p_filesz + (hp->p_vaddr & 0xfff)),
 				 prot, (MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE),
 				 (hp->p_offset & 0xfffff000));
-		up_write(&current->mm->mmap_sem);
+		mm_unlock_exclusive(current->mm);
 
 		if(retval != (hp->p_vaddr & 0xfffff000)) {
 			printk("irix_mapelf: do_mmap fails with %d!\n", retval);
diff -uNr linux-2.4.10-pre12/arch/mips/kernel/syscall.c linux-mmsem/arch/mips/kernel/syscall.c
--- linux-2.4.10-pre12/arch/mips/kernel/syscall.c	Tue Sep 18 08:45:59 2001
+++ linux-mmsem/arch/mips/kernel/syscall.c	Wed Sep 19 12:57:05 2001
@@ -69,9 +69,9 @@
 			goto out;
 	}
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	error = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 
 	if (file)
 		fput(file);
diff -uNr linux-2.4.10-pre12/arch/mips/kernel/sysirix.c linux-mmsem/arch/mips/kernel/sysirix.c
--- linux-2.4.10-pre12/arch/mips/kernel/sysirix.c	Tue Sep 18 08:45:59 2001
+++ linux-mmsem/arch/mips/kernel/sysirix.c	Wed Sep 19 12:57:05 2001
@@ -471,7 +471,7 @@
 		if (retval)
 			return retval;
 
-		down_read(&mm->mmap_sem);
+		mm_lock_shared(mm);
 		pgdp = pgd_offset(mm, addr);
 		pmdp = pmd_offset(pgdp, addr);
 		ptep = pte_offset(pmdp, addr);
@@ -484,7 +484,7 @@
 				                   PAGE_SHIFT, pageno);
 			}
 		}
-		up_read(&mm->mmap_sem);
+		mm_unlock_shared(mm);
 		break;
 	}
 
@@ -534,7 +534,7 @@
 	struct mm_struct *mm = current->mm;
 	int ret;
 
-	down_write(&mm->mmap_sem);
+	mm_lock_exclusive(mm);
 	if (brk < mm->end_code) {
 		ret = -ENOMEM;
 		goto out;
@@ -592,7 +592,7 @@
 	ret = 0;
 
 out:
-	up_write(&mm->mmap_sem);
+	mm_unlock_exclusive(mm);
 	return ret;
 }
 
@@ -1082,9 +1082,9 @@
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	retval = do_mmap(file, addr, len, prot, flags, offset);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	if (file)
 		fput(file);
 
@@ -1642,9 +1642,9 @@
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	error = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 
 	if (file)
 		fput(file);
diff -uNr linux-2.4.10-pre12/arch/mips/mm/fault.c linux-mmsem/arch/mips/mm/fault.c
--- linux-2.4.10-pre12/arch/mips/mm/fault.c	Tue Sep 18 08:45:59 2001
+++ linux-mmsem/arch/mips/mm/fault.c	Wed Sep 19 13:55:33 2001
@@ -72,7 +72,7 @@
 	printk("[%s:%d:%08lx:%ld:%08lx]\n", current->comm, current->pid,
 	       address, write, regs->cp0_epc);
 #endif
-	down_read(&mm->mmap_sem);
+	mm_lock_shared_recursive(mm);
 	vma = find_vma(mm, address);
 	if (!vma)
 		goto bad_area;
@@ -115,7 +115,7 @@
 		goto out_of_memory;
 	}
 
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	return;
 
 /*
@@ -123,7 +123,7 @@
  * Fix it, but check if it's kernel or user first..
  */
 bad_area:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 bad_area_nosemaphore:
 	/* User mode accesses just cause a SIGSEGV */
@@ -177,14 +177,14 @@
  * us unable to handle the page fault gracefully.
  */
 out_of_memory:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	printk("VM: killing process %s\n", tsk->comm);
 	if (user_mode(regs))
 		do_exit(SIGKILL);
 	goto no_context;
 
 do_sigbus:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 	/*
 	 * Send a sigbus, regardless of whether we were in kernel
diff -uNr linux-2.4.10-pre12/arch/mips64/kernel/linux32.c linux-mmsem/arch/mips64/kernel/linux32.c
--- linux-2.4.10-pre12/arch/mips64/kernel/linux32.c	Wed Sep 19 10:39:07 2001
+++ linux-mmsem/arch/mips64/kernel/linux32.c	Wed Sep 19 12:57:12 2001
@@ -443,10 +443,10 @@
 	 *  `execve' frees all current memory we only have to do an
 	 *  `munmap' if the `execve' failes.
 	 */
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	av = (char **) do_mmap_pgoff(0, 0, len, PROT_READ | PROT_WRITE,
 				     MAP_PRIVATE | MAP_ANONYMOUS, 0);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 
 	if (IS_ERR(av))
 		return (long) av;
diff -uNr linux-2.4.10-pre12/arch/mips64/kernel/syscall.c linux-mmsem/arch/mips64/kernel/syscall.c
--- linux-2.4.10-pre12/arch/mips64/kernel/syscall.c	Wed Sep 19 10:39:07 2001
+++ linux-mmsem/arch/mips64/kernel/syscall.c	Wed Sep 19 12:57:12 2001
@@ -65,9 +65,9 @@
 	}
         flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
         error = do_mmap(file, addr, len, prot, flags, offset);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
         if (file)
                 fput(file);
 out:
diff -uNr linux-2.4.10-pre12/arch/mips64/mm/fault.c linux-mmsem/arch/mips64/mm/fault.c
--- linux-2.4.10-pre12/arch/mips64/mm/fault.c	Wed Sep 19 10:39:07 2001
+++ linux-mmsem/arch/mips64/mm/fault.c	Wed Sep 19 13:57:01 2001
@@ -124,7 +124,7 @@
 	printk("Cpu%d[%s:%d:%08lx:%ld:%08lx]\n", smp_processor_id(), current->comm,
 		current->pid, address, write, regs->cp0_epc);
 #endif
-	down_read(&mm->mmap_sem);
+	mm_lock_shared_recursive(mm);
 	vma = find_vma(mm, address);
 	if (!vma)
 		goto bad_area;
@@ -167,7 +167,7 @@
 		goto out_of_memory;
 	}
 
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	return;
 
 /*
@@ -175,7 +175,7 @@
  * Fix it, but check if it's kernel or user first..
  */
 bad_area:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 bad_area_nosemaphore:
 	if (user_mode(regs)) {
@@ -233,14 +233,14 @@
  * us unable to handle the page fault gracefully.
  */
 out_of_memory:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	printk("VM: killing process %s\n", tsk->comm);
 	if (user_mode(regs))
 		do_exit(SIGKILL);
 	goto no_context;
 
 do_sigbus:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 	/*
 	 * Send a sigbus, regardless of whether we were in kernel
diff -uNr linux-2.4.10-pre12/arch/parisc/kernel/sys_parisc.c linux-mmsem/arch/parisc/kernel/sys_parisc.c
--- linux-2.4.10-pre12/arch/parisc/kernel/sys_parisc.c	Tue Sep 18 08:46:43 2001
+++ linux-mmsem/arch/parisc/kernel/sys_parisc.c	Wed Sep 19 12:57:13 2001
@@ -51,7 +51,7 @@
 	struct file * file = NULL;
 	int error;
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	lock_kernel();
 	if (!(flags & MAP_ANONYMOUS)) {
 		error = -EBADF;
@@ -65,7 +65,7 @@
 		fput(file);
 out:
 	unlock_kernel();
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	return error;
 }
 
diff -uNr linux-2.4.10-pre12/arch/parisc/mm/fault.c linux-mmsem/arch/parisc/mm/fault.c
--- linux-2.4.10-pre12/arch/parisc/mm/fault.c	Tue Sep 18 08:46:43 2001
+++ linux-mmsem/arch/parisc/mm/fault.c	Wed Sep 19 13:57:13 2001
@@ -175,7 +175,7 @@
 	if (in_interrupt() || !mm)
 		goto no_context;
 
-	down_read(&mm->mmap_sem);
+	mm_lock_shared_recursive(mm);
 	vma = pa_find_vma(mm, address);
 	if (!vma)
 		goto bad_area;
@@ -218,14 +218,14 @@
 	      default:
 		goto out_of_memory;
 	}
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	return;
 
 /*
  * Something tried to access memory that isn't in our memory map..
  */
 bad_area:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 	if (user_mode(regs)) {
 		struct siginfo si;
@@ -275,7 +275,7 @@
 	parisc_terminate("Bad Address (null pointer deref?)",regs,code,address);
 
   out_of_memory:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	printk("VM: killing process %s\n", current->comm);
 	if (user_mode(regs))
 		do_exit(SIGKILL);
diff -uNr linux-2.4.10-pre12/arch/ppc/kernel/syscalls.c linux-mmsem/arch/ppc/kernel/syscalls.c
--- linux-2.4.10-pre12/arch/ppc/kernel/syscalls.c	Tue Sep 18 08:46:01 2001
+++ linux-mmsem/arch/ppc/kernel/syscalls.c	Wed Sep 19 12:57:06 2001
@@ -202,9 +202,9 @@
 			goto out;
 	}
 	
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	ret = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	if (file)
 		fput(file);
 out:
diff -uNr linux-2.4.10-pre12/arch/ppc/mm/fault.c linux-mmsem/arch/ppc/mm/fault.c
--- linux-2.4.10-pre12/arch/ppc/mm/fault.c	Tue Sep 18 08:46:01 2001
+++ linux-mmsem/arch/ppc/mm/fault.c	Wed Sep 19 13:55:48 2001
@@ -103,7 +103,7 @@
 		bad_page_fault(regs, address, SIGSEGV);
 		return;
 	}
-	down_read(&mm->mmap_sem);
+	mm_lock_shared_recursive(mm);
 	vma = find_vma(mm, address);
 	if (!vma)
 		goto bad_area;
@@ -163,7 +163,7 @@
                 goto out_of_memory;
 	}
 
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	/*
 	 * keep track of tlb+htab misses that are good addrs but
 	 * just need pte's created via handle_mm_fault()
@@ -173,7 +173,7 @@
 	return;
 
 bad_area:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	pte_errors++;	
 
 	/* User mode accesses cause a SIGSEGV */
@@ -194,7 +194,7 @@
  * us unable to handle the page fault gracefully.
  */
 out_of_memory:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	printk("VM: killing process %s\n", current->comm);
 	if (user_mode(regs))
 		do_exit(SIGKILL);
@@ -202,7 +202,7 @@
 	return;
 
 do_sigbus:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	info.si_signo = SIGBUS;
 	info.si_errno = 0;
 	info.si_code = BUS_ADRERR;
diff -uNr linux-2.4.10-pre12/arch/s390/kernel/sys_s390.c linux-mmsem/arch/s390/kernel/sys_s390.c
--- linux-2.4.10-pre12/arch/s390/kernel/sys_s390.c	Tue Sep 18 08:46:42 2001
+++ linux-mmsem/arch/s390/kernel/sys_s390.c	Wed Sep 19 12:57:12 2001
@@ -61,9 +61,9 @@
 			goto out;
 	}
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	error = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 
 	if (file)
 		fput(file);
diff -uNr linux-2.4.10-pre12/arch/s390/mm/fault.c linux-mmsem/arch/s390/mm/fault.c
--- linux-2.4.10-pre12/arch/s390/mm/fault.c	Tue Sep 18 08:46:43 2001
+++ linux-mmsem/arch/s390/mm/fault.c	Wed Sep 19 13:57:08 2001
@@ -113,7 +113,7 @@
 	 * task's user address space, so we search the VMAs
 	 */
 
-        down_read(&mm->mmap_sem);
+        mm_lock_shared_recursive(mm);
 
         vma = find_vma(mm, address);
         if (!vma)
@@ -164,7 +164,7 @@
 		goto out_of_memory;
 	}
 
-        up_read(&mm->mmap_sem);
+        mm_unlock_shared(mm);
         return;
 
 /*
@@ -172,7 +172,7 @@
  * Fix it, but check if it's kernel or user first..
  */
 bad_area:
-        up_read(&mm->mmap_sem);
+        mm_unlock_shared(mm);
 
         /* User mode accesses just cause a SIGSEGV */
         if (regs->psw.mask & PSW_PROBLEM_STATE) {
@@ -231,14 +231,14 @@
  * us unable to handle the page fault gracefully.
 */
 out_of_memory:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	printk("VM: killing process %s\n", tsk->comm);
 	if (regs->psw.mask & PSW_PROBLEM_STATE)
 		do_exit(SIGKILL);
 	goto no_context;
 
 do_sigbus:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 	/*
 	 * Send a sigbus, regardless of whether we were in kernel
diff -uNr linux-2.4.10-pre12/arch/s390x/kernel/binfmt_elf32.c linux-mmsem/arch/s390x/kernel/binfmt_elf32.c
--- linux-2.4.10-pre12/arch/s390x/kernel/binfmt_elf32.c	Tue Sep 18 08:46:43 2001
+++ linux-mmsem/arch/s390x/kernel/binfmt_elf32.c	Wed Sep 19 12:57:13 2001
@@ -194,11 +194,11 @@
 	if(!addr)
 		addr = 0x40000000;
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	map_addr = do_mmap(filep, ELF_PAGESTART(addr),
 			   eppnt->p_filesz + ELF_PAGEOFFSET(eppnt->p_vaddr), prot, type,
 			   eppnt->p_offset - ELF_PAGEOFFSET(eppnt->p_vaddr));
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	return(map_addr);
 }
 
diff -uNr linux-2.4.10-pre12/arch/s390x/kernel/exec32.c linux-mmsem/arch/s390x/kernel/exec32.c
--- linux-2.4.10-pre12/arch/s390x/kernel/exec32.c	Tue Sep 18 08:46:43 2001
+++ linux-mmsem/arch/s390x/kernel/exec32.c	Wed Sep 19 12:57:13 2001
@@ -54,7 +54,7 @@
 	if (!mpnt) 
 		return -ENOMEM; 
 	
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	{
 		mpnt->vm_mm = current->mm;
 		mpnt->vm_start = PAGE_MASK & (unsigned long) bprm->p;
@@ -77,7 +77,7 @@
 		}
 		stack_base += PAGE_SIZE;
 	}
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	
 	return 0;
 }
diff -uNr linux-2.4.10-pre12/arch/s390x/kernel/linux32.c linux-mmsem/arch/s390x/kernel/linux32.c
--- linux-2.4.10-pre12/arch/s390x/kernel/linux32.c	Tue Sep 18 08:46:43 2001
+++ linux-mmsem/arch/s390x/kernel/linux32.c	Wed Sep 19 12:57:13 2001
@@ -4186,14 +4186,14 @@
 			goto out;
 	}
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	error = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
 	if (!IS_ERR((void *) error) && error + len >= 0x80000000ULL) {
 		/* Result is out of bounds.  */
 		do_munmap(current->mm, addr, len);
 		error = -ENOMEM;
 	}
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 
 	if (file)
 		fput(file);
diff -uNr linux-2.4.10-pre12/arch/s390x/kernel/sys_s390.c linux-mmsem/arch/s390x/kernel/sys_s390.c
--- linux-2.4.10-pre12/arch/s390x/kernel/sys_s390.c	Tue Sep 18 08:46:44 2001
+++ linux-mmsem/arch/s390x/kernel/sys_s390.c	Wed Sep 19 12:57:13 2001
@@ -61,9 +61,9 @@
 			goto out;
 	}
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	error = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 
 	if (file)
 		fput(file);
diff -uNr linux-2.4.10-pre12/arch/s390x/mm/fault.c linux-mmsem/arch/s390x/mm/fault.c
--- linux-2.4.10-pre12/arch/s390x/mm/fault.c	Tue Sep 18 08:46:44 2001
+++ linux-mmsem/arch/s390x/mm/fault.c	Wed Sep 19 13:57:25 2001
@@ -141,7 +141,7 @@
 	 * task's user address space, so we search the VMAs
 	 */
 
-        down_read(&mm->mmap_sem);
+        mm_lock_shared_recursive(mm);
 
         vma = find_vma(mm, address);
         if (!vma) {
@@ -195,7 +195,7 @@
 		goto out_of_memory;
 	}
 
-        up_read(&mm->mmap_sem);
+        mm_unlock_shared(mm);
         return;
 
 /*
@@ -203,7 +203,7 @@
  * Fix it, but check if it's kernel or user first..
  */
 bad_area:
-        up_read(&mm->mmap_sem);
+        mm_unlock_shared(mm);
 
         /* User mode accesses just cause a SIGSEGV */
         if (regs->psw.mask & PSW_PROBLEM_STATE) {
@@ -262,14 +262,14 @@
  * us unable to handle the page fault gracefully.
 */
 out_of_memory:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	printk("VM: killing process %s\n", tsk->comm);
 	if (regs->psw.mask & PSW_PROBLEM_STATE)
 		do_exit(SIGKILL);
 	goto no_context;
 
 do_sigbus:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 	/*
 	 * Send a sigbus, regardless of whether we were in kernel
diff -uNr linux-2.4.10-pre12/arch/sh/kernel/sys_sh.c linux-mmsem/arch/sh/kernel/sys_sh.c
--- linux-2.4.10-pre12/arch/sh/kernel/sys_sh.c	Wed Sep 19 10:39:08 2001
+++ linux-mmsem/arch/sh/kernel/sys_sh.c	Wed Sep 19 12:57:10 2001
@@ -96,9 +96,9 @@
 			goto out;
 	}
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	error = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 
 	if (file)
 		fput(file);
diff -uNr linux-2.4.10-pre12/arch/sh/mm/fault.c linux-mmsem/arch/sh/mm/fault.c
--- linux-2.4.10-pre12/arch/sh/mm/fault.c	Wed Sep 19 10:39:08 2001
+++ linux-mmsem/arch/sh/mm/fault.c	Wed Sep 19 13:56:46 2001
@@ -105,7 +105,7 @@
 	if (in_interrupt() || !mm)
 		goto no_context;
 
-	down_read(&mm->mmap_sem);
+	mm_lock_shared_recursive(mm);
 
 	vma = find_vma(mm, address);
 	if (!vma)
@@ -147,7 +147,7 @@
 		goto out_of_memory;
 	}
 
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	return;
 
 /*
@@ -155,7 +155,7 @@
  * Fix it, but check if it's kernel or user first..
  */
 bad_area:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 	if (user_mode(regs)) {
 		tsk->thread.address = address;
@@ -204,14 +204,14 @@
  * us unable to handle the page fault gracefully.
  */
 out_of_memory:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	printk("VM: killing process %s\n", tsk->comm);
 	if (user_mode(regs))
 		do_exit(SIGKILL);
 	goto no_context;
 
 do_sigbus:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 	/*
 	 * Send a sigbus, regardless of whether we were in kernel
diff -uNr linux-2.4.10-pre12/arch/sparc/kernel/sys_sparc.c linux-mmsem/arch/sparc/kernel/sys_sparc.c
--- linux-2.4.10-pre12/arch/sparc/kernel/sys_sparc.c	Tue Sep 18 08:45:59 2001
+++ linux-mmsem/arch/sparc/kernel/sys_sparc.c	Wed Sep 19 12:57:04 2001
@@ -242,9 +242,9 @@
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	retval = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 
 out_putf:
 	if (file)
@@ -288,7 +288,7 @@
 	if (old_len > TASK_SIZE - PAGE_SIZE ||
 	    new_len > TASK_SIZE - PAGE_SIZE)
 		goto out;
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	if (flags & MREMAP_FIXED) {
 		if (ARCH_SUN4C_SUN4 &&
 		    new_addr < 0xe0000000 &&
@@ -323,7 +323,7 @@
 	}
 	ret = do_mremap(addr, old_len, new_len, flags, new_addr);
 out_sem:
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 out:
 	return ret;       
 }
diff -uNr linux-2.4.10-pre12/arch/sparc/kernel/sys_sunos.c linux-mmsem/arch/sparc/kernel/sys_sunos.c
--- linux-2.4.10-pre12/arch/sparc/kernel/sys_sunos.c	Tue Sep 18 08:45:59 2001
+++ linux-mmsem/arch/sparc/kernel/sys_sunos.c	Wed Sep 19 12:57:04 2001
@@ -116,9 +116,9 @@
 	}
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	retval = do_mmap(file, addr, len, prot, flags, off);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	if(!ret_type)
 		retval = ((retval < PAGE_OFFSET) ? 0 : retval);
 
@@ -145,7 +145,7 @@
 	unsigned long rlim;
 	unsigned long newbrk, oldbrk;
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	if(ARCH_SUN4C_SUN4) {
 		if(brk >= 0x20000000 && brk < 0xe0000000) {
 			goto out;
@@ -208,7 +208,7 @@
 	do_brk(oldbrk, newbrk-oldbrk);
 	retval = 0;
 out:
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	return retval;
 }
 
diff -uNr linux-2.4.10-pre12/arch/sparc/mm/fault.c linux-mmsem/arch/sparc/mm/fault.c
--- linux-2.4.10-pre12/arch/sparc/mm/fault.c	Tue Sep 18 08:45:59 2001
+++ linux-mmsem/arch/sparc/mm/fault.c	Wed Sep 19 12:57:05 2001
@@ -222,7 +222,7 @@
         if (in_interrupt() || !mm)
                 goto no_context;
 
-	down_read(&mm->mmap_sem);
+	mm_lock_shared(mm);
 
 	/*
 	 * The kernel referencing a bad kernel pointer can lock up
@@ -272,7 +272,7 @@
 	default:
 		goto out_of_memory;
 	}
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	return;
 
 	/*
@@ -280,7 +280,7 @@
 	 * Fix it, but check if it's kernel or user first..
 	 */
 bad_area:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 bad_area_nosemaphore:
 	/* User mode accesses just cause a SIGSEGV */
@@ -336,14 +336,14 @@
  * us unable to handle the page fault gracefully.
  */
 out_of_memory:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	printk("VM: killing process %s\n", tsk->comm);
 	if (from_user)
 		do_exit(SIGKILL);
 	goto no_context;
 
 do_sigbus:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	info.si_signo = SIGBUS;
 	info.si_errno = 0;
 	info.si_code = BUS_ADRERR;
@@ -477,7 +477,7 @@
 	printk("wf<pid=%d,wr=%d,addr=%08lx>\n",
 	       tsk->pid, write, address);
 #endif
-	down_read(&mm->mmap_sem);
+	mm_lock_shared(mm);
 	vma = find_vma(mm, address);
 	if(!vma)
 		goto bad_area;
@@ -498,10 +498,10 @@
 	}
 	if (!handle_mm_fault(mm, vma, address, write))
 		goto do_sigbus;
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	return;
 bad_area:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 #if 0
 	printk("Window whee %s [%d]: segfaults at %08lx\n",
 	       tsk->comm, tsk->pid, address);
@@ -516,7 +516,7 @@
 	return;
 
 do_sigbus:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	info.si_signo = SIGBUS;
 	info.si_errno = 0;
 	info.si_code = BUS_ADRERR;
diff -uNr linux-2.4.10-pre12/arch/sparc64/kernel/binfmt_aout32.c linux-mmsem/arch/sparc64/kernel/binfmt_aout32.c
--- linux-2.4.10-pre12/arch/sparc64/kernel/binfmt_aout32.c	Tue Sep 18 08:46:07 2001
+++ linux-mmsem/arch/sparc64/kernel/binfmt_aout32.c	Wed Sep 19 12:57:09 2001
@@ -277,24 +277,24 @@
 			goto beyond_if;
 		}
 
-	        down_write(&current->mm->mmap_sem);
+	        mm_lock_exclusive(current->mm);
 		error = do_mmap(bprm->file, N_TXTADDR(ex), ex.a_text,
 			PROT_READ | PROT_EXEC,
 			MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECUTABLE,
 			fd_offset);
-	        up_write(&current->mm->mmap_sem);
+	        mm_unlock_exclusive(current->mm);
 
 		if (error != N_TXTADDR(ex)) {
 			send_sig(SIGKILL, current, 0);
 			return error;
 		}
 
-	        down_write(&current->mm->mmap_sem);
+	        mm_lock_exclusive(current->mm);
  		error = do_mmap(bprm->file, N_DATADDR(ex), ex.a_data,
 				PROT_READ | PROT_WRITE | PROT_EXEC,
 				MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECUTABLE,
 				fd_offset + ex.a_text);
-	        up_write(&current->mm->mmap_sem);
+	        mm_unlock_exclusive(current->mm);
 		if (error != N_DATADDR(ex)) {
 			send_sig(SIGKILL, current, 0);
 			return error;
@@ -369,12 +369,12 @@
 	start_addr =  ex.a_entry & 0xfffff000;
 
 	/* Now use mmap to map the library into memory. */
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	error = do_mmap(file, start_addr, ex.a_text + ex.a_data,
 			PROT_READ | PROT_WRITE | PROT_EXEC,
 			MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
 			N_TXTOFF(ex));
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	retval = error;
 	if (error != start_addr)
 		goto out;
diff -uNr linux-2.4.10-pre12/arch/sparc64/kernel/sys_sparc.c linux-mmsem/arch/sparc64/kernel/sys_sparc.c
--- linux-2.4.10-pre12/arch/sparc64/kernel/sys_sparc.c	Tue Sep 18 08:46:06 2001
+++ linux-mmsem/arch/sparc64/kernel/sys_sparc.c	Wed Sep 19 12:57:09 2001
@@ -292,9 +292,9 @@
 			goto out_putf;
 	}
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	retval = do_mmap(file, addr, len, prot, flags, off);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 
 out_putf:
 	if (file)
@@ -310,9 +310,9 @@
 	if (len > -PAGE_OFFSET ||
 	    (addr < PAGE_OFFSET && addr + len > -PAGE_OFFSET))
 		return -EINVAL;
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	ret = do_munmap(current->mm, addr, len);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	return ret;
 }
 
@@ -332,7 +332,7 @@
 		goto out;
 	if (addr < PAGE_OFFSET && addr + old_len > -PAGE_OFFSET)
 		goto out;
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	if (flags & MREMAP_FIXED) {
 		if (new_addr < PAGE_OFFSET &&
 		    new_addr + new_len > -PAGE_OFFSET)
@@ -363,7 +363,7 @@
 	}
 	ret = do_mremap(addr, old_len, new_len, flags, new_addr);
 out_sem:
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 out:
 	return ret;       
 }
diff -uNr linux-2.4.10-pre12/arch/sparc64/kernel/sys_sparc32.c linux-mmsem/arch/sparc64/kernel/sys_sparc32.c
--- linux-2.4.10-pre12/arch/sparc64/kernel/sys_sparc32.c	Tue Sep 18 08:46:06 2001
+++ linux-mmsem/arch/sparc64/kernel/sys_sparc32.c	Wed Sep 19 12:57:09 2001
@@ -4141,7 +4141,7 @@
 		goto out;
 	if (addr > 0xf0000000UL - old_len)
 		goto out;
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	if (flags & MREMAP_FIXED) {
 		if (new_addr > 0xf0000000UL - new_len)
 			goto out_sem;
@@ -4171,7 +4171,7 @@
 	}
 	ret = do_mremap(addr, old_len, new_len, flags, new_addr);
 out_sem:
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 out:
 	return ret;       
 }
diff -uNr linux-2.4.10-pre12/arch/sparc64/kernel/sys_sunos32.c linux-mmsem/arch/sparc64/kernel/sys_sunos32.c
--- linux-2.4.10-pre12/arch/sparc64/kernel/sys_sunos32.c	Tue Sep 18 08:46:08 2001
+++ linux-mmsem/arch/sparc64/kernel/sys_sunos32.c	Wed Sep 19 12:57:09 2001
@@ -100,12 +100,12 @@
 	flags &= ~_MAP_NEW;
 
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	retval = do_mmap(file,
 			 (unsigned long) addr, (unsigned long) len,
 			 (unsigned long) prot, (unsigned long) flags,
 			 (unsigned long) off);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	if(!ret_type)
 		retval = ((retval < 0xf0000000) ? 0 : retval);
 out_putf:
@@ -126,7 +126,7 @@
 	unsigned long rlim;
 	unsigned long newbrk, oldbrk, brk = (unsigned long) baddr;
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	if (brk < current->mm->end_code)
 		goto out;
 	newbrk = PAGE_ALIGN(brk);
@@ -170,7 +170,7 @@
 	do_brk(oldbrk, newbrk-oldbrk);
 	retval = 0;
 out:
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	return retval;
 }
 
diff -uNr linux-2.4.10-pre12/arch/sparc64/mm/fault.c linux-mmsem/arch/sparc64/mm/fault.c
--- linux-2.4.10-pre12/arch/sparc64/mm/fault.c	Wed Sep 19 10:39:08 2001
+++ linux-mmsem/arch/sparc64/mm/fault.c	Wed Sep 19 12:57:09 2001
@@ -306,7 +306,7 @@
 		address &= 0xffffffff;
 	}
 
-	down_read(&mm->mmap_sem);
+	mm_lock_shared(mm);
 	vma = find_vma(mm, address);
 	if (!vma)
 		goto bad_area;
@@ -378,7 +378,7 @@
 		goto out_of_memory;
 	}
 
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	goto fault_done;
 
 	/*
@@ -387,7 +387,7 @@
 	 */
 bad_area:
 	insn = get_fault_insn(regs, insn);
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 handle_kernel_fault:
 	do_kernel_fault(regs, si_code, fault_code, insn, address);
@@ -400,7 +400,7 @@
  */
 out_of_memory:
 	insn = get_fault_insn(regs, insn);
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	printk("VM: killing process %s\n", current->comm);
 	if (!(regs->tstate & TSTATE_PRIV))
 		do_exit(SIGKILL);
@@ -412,7 +412,7 @@
 
 do_sigbus:
 	insn = get_fault_insn(regs, insn);
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 	/*
 	 * Send a sigbus, regardless of whether we were in kernel
diff -uNr linux-2.4.10-pre12/arch/sparc64/solaris/misc.c linux-mmsem/arch/sparc64/solaris/misc.c
--- linux-2.4.10-pre12/arch/sparc64/solaris/misc.c	Wed Sep 19 10:39:08 2001
+++ linux-mmsem/arch/sparc64/solaris/misc.c	Wed Sep 19 12:57:09 2001
@@ -92,12 +92,12 @@
 	ret_type = flags & _MAP_NEW;
 	flags &= ~_MAP_NEW;
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE);
 	retval = do_mmap(file,
 			 (unsigned long) addr, (unsigned long) len,
 			 (unsigned long) prot, (unsigned long) flags, off);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	if(!ret_type)
 		retval = ((retval < 0xf0000000) ? 0 : retval);
 	                        
diff -uNr linux-2.4.10-pre12/drivers/char/mem.c linux-mmsem/drivers/char/mem.c
--- linux-2.4.10-pre12/drivers/char/mem.c	Wed Sep 19 10:39:10 2001
+++ linux-mmsem/drivers/char/mem.c	Wed Sep 19 12:58:59 2001
@@ -350,7 +350,7 @@
 
 	mm = current->mm;
 	/* Oops, this was forgotten before. -ben */
-	down_read(&mm->mmap_sem);
+	mm_lock_shared(mm);
 
 	/* For private mappings, just map in zero pages. */
 	for (vma = find_vma(mm, addr); vma; vma = vma->vm_next) {
@@ -374,7 +374,7 @@
 			goto out_up;
 	}
 
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	
 	/* The shared case is hard. Let's do the conventional zeroing. */ 
 	do {
@@ -389,7 +389,7 @@
 
 	return size;
 out_up:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	return size;
 }
 
diff -uNr linux-2.4.10-pre12/drivers/sgi/char/graphics.c linux-mmsem/drivers/sgi/char/graphics.c
--- linux-2.4.10-pre12/drivers/sgi/char/graphics.c	Wed Sep 19 10:39:17 2001
+++ linux-mmsem/drivers/sgi/char/graphics.c	Wed Sep 19 12:59:13 2001
@@ -152,11 +152,11 @@
 		 * sgi_graphics_mmap
 		 */
 		disable_gconsole ();
-		down_write(&current->mm->mmap_sem);
+		mm_lock_exclusive(current->mm);
 		r = do_mmap (file, (unsigned long)vaddr,
 			     cards[board].g_regs_size, PROT_READ|PROT_WRITE,
 			     MAP_FIXED|MAP_PRIVATE, 0);
-		up_write(&current->mm->mmap_sem);
+		mm_unlock_exclusive(current->mm);
 		if (r)
 			return r;
 	}
diff -uNr linux-2.4.10-pre12/drivers/sgi/char/shmiq.c linux-mmsem/drivers/sgi/char/shmiq.c
--- linux-2.4.10-pre12/drivers/sgi/char/shmiq.c	Wed Sep 19 10:39:17 2001
+++ linux-mmsem/drivers/sgi/char/shmiq.c	Wed Sep 19 12:59:13 2001
@@ -285,11 +285,11 @@
 			s = req.arg * sizeof (struct shmqevent) +
 			    sizeof (struct sharedMemoryInputQueue);
 			v = sys_munmap (vaddr, s);
-			down_write(&current->mm->mmap_sem);
+			mm_lock_exclusive(current->mm);
 			do_munmap(current->mm, vaddr, s);
 			do_mmap(filp, vaddr, s, PROT_READ | PROT_WRITE,
 			        MAP_PRIVATE|MAP_FIXED, 0);
-			up_write(&current->mm->mmap_sem);
+			mm_unlock_exclusive(current->mm);
 			shmiqs[minor].events = req.arg;
 			shmiqs[minor].mapped = 1;
 
diff -uNr linux-2.4.10-pre12/fs/binfmt_aout.c linux-mmsem/fs/binfmt_aout.c
--- linux-2.4.10-pre12/fs/binfmt_aout.c	Wed Sep 19 10:39:20 2001
+++ linux-mmsem/fs/binfmt_aout.c	Wed Sep 19 11:42:59 2001
@@ -377,24 +377,24 @@
 			goto beyond_if;
 		}
 
-		down_write(&current->mm->mmap_sem);
+		mm_lock_exclusive(current->mm);
 		error = do_mmap(bprm->file, N_TXTADDR(ex), ex.a_text,
 			PROT_READ | PROT_EXEC,
 			MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECUTABLE,
 			fd_offset);
-		up_write(&current->mm->mmap_sem);
+		mm_unlock_exclusive(current->mm);
 
 		if (error != N_TXTADDR(ex)) {
 			send_sig(SIGKILL, current, 0);
 			return error;
 		}
 
-		down_write(&current->mm->mmap_sem);
+		mm_lock_exclusive(current->mm);
  		error = do_mmap(bprm->file, N_DATADDR(ex), ex.a_data,
 				PROT_READ | PROT_WRITE | PROT_EXEC,
 				MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECUTABLE,
 				fd_offset + ex.a_text);
-		up_write(&current->mm->mmap_sem);
+		mm_unlock_exclusive(current->mm);
 		if (error != N_DATADDR(ex)) {
 			send_sig(SIGKILL, current, 0);
 			return error;
@@ -476,12 +476,12 @@
 		goto out;
 	}
 	/* Now use mmap to map the library into memory. */
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	error = do_mmap(file, start_addr, ex.a_text + ex.a_data,
 			PROT_READ | PROT_WRITE | PROT_EXEC,
 			MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
 			N_TXTOFF(ex));
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	retval = error;
 	if (error != start_addr)
 		goto out;
diff -uNr linux-2.4.10-pre12/fs/binfmt_elf.c linux-mmsem/fs/binfmt_elf.c
--- linux-2.4.10-pre12/fs/binfmt_elf.c	Wed Sep 19 10:39:20 2001
+++ linux-mmsem/fs/binfmt_elf.c	Wed Sep 19 11:43:48 2001
@@ -224,11 +224,11 @@
 {
 	unsigned long map_addr;
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	map_addr = do_mmap(filep, ELF_PAGESTART(addr),
 			   eppnt->p_filesz + ELF_PAGEOFFSET(eppnt->p_vaddr), prot, type,
 			   eppnt->p_offset - ELF_PAGEOFFSET(eppnt->p_vaddr));
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	return(map_addr);
 }
 
@@ -743,10 +743,10 @@
 		   Since we do not have the power to recompile these, we
 		   emulate the SVr4 behavior.  Sigh.  */
 		/* N.B. Shouldn't the size here be PAGE_SIZE?? */
-		down_write(&current->mm->mmap_sem);
+		mm_lock_exclusive(current->mm);
 		error = do_mmap(NULL, 0, 4096, PROT_READ | PROT_EXEC,
 				MAP_FIXED | MAP_PRIVATE, 0);
-		up_write(&current->mm->mmap_sem);
+		mm_unlock_exclusive(current->mm);
 	}
 
 #ifdef ELF_PLAT_INIT
@@ -827,7 +827,7 @@
 	while (elf_phdata->p_type != PT_LOAD) elf_phdata++;
 
 	/* Now use mmap to map the library into memory. */
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	error = do_mmap(file,
 			ELF_PAGESTART(elf_phdata->p_vaddr),
 			(elf_phdata->p_filesz +
@@ -836,7 +836,7 @@
 			MAP_FIXED | MAP_PRIVATE | MAP_DENYWRITE,
 			(elf_phdata->p_offset -
 			 ELF_PAGEOFFSET(elf_phdata->p_vaddr)));
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	if (error != ELF_PAGESTART(elf_phdata->p_vaddr))
 		goto out_free_ph;
 
diff -uNr linux-2.4.10-pre12/fs/exec.c linux-mmsem/fs/exec.c
--- linux-2.4.10-pre12/fs/exec.c	Wed Sep 19 10:39:20 2001
+++ linux-mmsem/fs/exec.c	Wed Sep 19 11:44:38 2001
@@ -307,7 +307,7 @@
 	if (!mpnt) 
 		return -ENOMEM; 
 	
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	{
 		mpnt->vm_mm = current->mm;
 		mpnt->vm_start = PAGE_MASK & (unsigned long) bprm->p;
@@ -330,7 +330,7 @@
 		}
 		stack_base += PAGE_SIZE;
 	}
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	
 	return 0;
 }
@@ -969,9 +969,9 @@
 	if (do_truncate(file->f_dentry, 0) != 0)
 		goto close_fail;
 
-	down_read(&current->mm->mmap_sem);
+	mm_lock_shared(current->mm);
 	retval = binfmt->core_dump(signr, regs, file);
-	up_read(&current->mm->mmap_sem);
+	mm_unlock_shared(current->mm);
 
 close_fail:
 	filp_close(file, NULL);
diff -uNr linux-2.4.10-pre12/fs/proc/array.c linux-mmsem/fs/proc/array.c
--- linux-2.4.10-pre12/fs/proc/array.c	Tue Sep 18 08:45:09 2001
+++ linux-mmsem/fs/proc/array.c	Wed Sep 19 11:47:34 2001
@@ -181,7 +181,7 @@
 	unsigned long data = 0, stack = 0;
 	unsigned long exec = 0, lib = 0;
 
-	down_read(&mm->mmap_sem);
+	mm_lock_shared(mm);
 	for (vma = mm->mmap; vma; vma = vma->vm_next) {
 		unsigned long len = (vma->vm_end - vma->vm_start) >> 10;
 		if (!vma->vm_file) {
@@ -212,7 +212,7 @@
 		mm->rss << (PAGE_SHIFT-10),
 		data - stack, stack,
 		exec - lib, lib);
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	return buffer;
 }
 
@@ -317,7 +317,7 @@
 	task_unlock(task);
 	if (mm) {
 		struct vm_area_struct *vma;
-		down_read(&mm->mmap_sem);
+		mm_lock_shared(mm);
 		vma = mm->mmap;
 		while (vma) {
 			vsize += vma->vm_end - vma->vm_start;
@@ -325,7 +325,7 @@
 		}
 		eip = KSTK_EIP(task);
 		esp = KSTK_ESP(task);
-		up_read(&mm->mmap_sem);
+		mm_unlock_shared(mm);
 	}
 
 	wchan = get_wchan(task);
@@ -479,7 +479,7 @@
 	task_unlock(task);
 	if (mm) {
 		struct vm_area_struct * vma;
-		down_read(&mm->mmap_sem);
+		mm_lock_shared(mm);
 		vma = mm->mmap;
 		while (vma) {
 			pgd_t *pgd = pgd_offset(mm, vma->vm_start);
@@ -500,7 +500,7 @@
 				drs += pages;
 			vma = vma->vm_next;
 		}
-		up_read(&mm->mmap_sem);
+		mm_unlock_shared(mm);
 		mmput(mm);
 	}
 	return sprintf(buffer,"%d %d %d %d %d %d %d\n",
@@ -577,7 +577,7 @@
 	column = *ppos & (MAPS_LINE_LENGTH-1);
 
 	/* quickly go to line lineno */
-	down_read(&mm->mmap_sem);
+	mm_lock_shared(mm);
 	for (map = mm->mmap, i = 0; map && (i < lineno); map = map->vm_next, i++)
 		continue;
 
@@ -658,7 +658,7 @@
 		if (volatile_task)
 			break;
 	}
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 
 	/* encode f_pos */
 	*ppos = (lineno << MAPS_LINE_SHIFT) + column;
diff -uNr linux-2.4.10-pre12/fs/proc/base.c linux-mmsem/fs/proc/base.c
--- linux-2.4.10-pre12/fs/proc/base.c	Tue Sep 18 08:45:09 2001
+++ linux-mmsem/fs/proc/base.c	Wed Sep 19 11:47:49 2001
@@ -64,7 +64,7 @@
 	task_unlock(task);
 	if (!mm)
 		goto out;
-	down_read(&mm->mmap_sem);
+	mm_lock_shared(mm);
 	vma = mm->mmap;
 	while (vma) {
 		if ((vma->vm_flags & VM_EXECUTABLE) && 
@@ -76,7 +76,7 @@
 		}
 		vma = vma->vm_next;
 	}
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	mmput(mm);
 out:
 	return result;
diff -uNr linux-2.4.10-pre12/include/linux/sched.h linux-mmsem/include/linux/sched.h
--- linux-2.4.10-pre12/include/linux/sched.h	Wed Sep 19 10:39:23 2001
+++ linux-mmsem/include/linux/sched.h	Wed Sep 19 13:07:34 2001
@@ -209,7 +209,9 @@
 	atomic_t mm_users;			/* How many users with user space? */
 	atomic_t mm_count;			/* How many references to "struct mm_struct" (users count as 1) */
 	int map_count;				/* number of VMAs */
-	struct rw_semaphore mmap_sem;
+	spinlock_t mmsem_lock;			/* protects access to mmsem stuff */
+	int mmsem_activity;			/* 0 inactive, +n active readers, -1 active writer */
+	struct list_head mmsem_waiters;
 	spinlock_t page_table_lock;		/* Protects task page tables and mm->rss */
 
 	struct list_head mmlist;		/* List of all active mm's.  These are globally strung
@@ -233,15 +235,70 @@
 
 extern int mmlist_nr;
 
-#define INIT_MM(name) \
-{			 				\
-	mm_rb:		RB_ROOT,			\
-	pgd:		swapper_pg_dir, 		\
-	mm_users:	ATOMIC_INIT(2), 		\
-	mm_count:	ATOMIC_INIT(1), 		\
-	mmap_sem:	__RWSEM_INITIALIZER(name.mmap_sem), \
-	page_table_lock: SPIN_LOCK_UNLOCKED, 		\
-	mmlist:		LIST_HEAD_INIT(name.mmlist),	\
+#define INIT_MM(name)						\
+{								\
+	mm_rb:		RB_ROOT,				\
+	pgd:		swapper_pg_dir,				\
+	mm_users:	ATOMIC_INIT(2),				\
+	mm_count:	ATOMIC_INIT(1),				\
+	mmsem_lock:	SPIN_LOCK_UNLOCKED,			\
+	mmsem_activity:	0,					\
+	mmsem_waiters:	LIST_HEAD_INIT(name.mmsem_waiters),	\
+	page_table_lock: SPIN_LOCK_UNLOCKED,			\
+	mmlist:		LIST_HEAD_INIT(name.mmlist),		\
+}
+
+extern void __mm_lock_wait(struct mm_struct *mm, int bias);
+extern void __mm_lock_wake(struct mm_struct *mm);
+
+static inline void mm_lock_shared(struct mm_struct *mm)
+{
+	spin_lock(&mm->mmsem_lock);
+	if (mm->mmsem_activity>=0 && list_empty(&mm->mmsem_waiters)) {
+		mm->mmsem_activity++;
+		spin_unlock(&mm->mmsem_lock);
+	}
+	else
+		__mm_lock_wait(mm,1);
+}
+
+static inline void mm_lock_shared_recursive(struct mm_struct *mm)
+{
+	spin_lock(&mm->mmsem_lock);
+	if (mm->mmsem_activity>=0) {
+		mm->mmsem_activity++;
+		spin_unlock(&mm->mmsem_lock);
+	}
+	else
+		__mm_lock_wait(mm,1);
+}
+
+static inline void mm_unlock_shared(struct mm_struct *mm)
+{
+	spin_lock(&mm->mmsem_lock);
+	if (!--mm->mmsem_activity && !list_empty(&mm->mmsem_waiters))
+		__mm_lock_wake(mm);
+	spin_unlock(&mm->mmsem_lock);
+}
+
+static inline void mm_lock_exclusive(struct mm_struct *mm)
+{
+	spin_lock(&mm->mmsem_lock);
+	if (mm->mmsem_activity==0) {
+		mm->mmsem_activity--;
+		spin_unlock(&mm->mmsem_lock);
+	}
+	else
+		__mm_lock_wait(mm,-1);
+}
+
+static inline void mm_unlock_exclusive(struct mm_struct *mm)
+{
+	spin_lock(&mm->mmsem_lock);
+	mm->mmsem_activity++;
+	if (!list_empty(&mm->mmsem_waiters))
+		__mm_lock_wake(mm);
+	spin_unlock(&mm->mmsem_lock);
 }
 
 struct signal_struct {
diff -uNr linux-2.4.10-pre12/ipc/shm.c linux-mmsem/ipc/shm.c
--- linux-2.4.10-pre12/ipc/shm.c	Wed Sep 19 10:39:23 2001
+++ linux-mmsem/ipc/shm.c	Wed Sep 19 12:32:57 2001
@@ -619,9 +619,9 @@
 	shp->shm_nattch++;
 	shm_unlock(shmid);
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	user_addr = (void *) do_mmap (file, addr, file->f_dentry->d_inode->i_size, prot, flags, 0);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 
 	down (&shm_ids.sem);
 	if(!(shp = shm_lock(shmid)))
@@ -650,14 +650,14 @@
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *shmd, *shmdnext;
 
-	down_write(&mm->mmap_sem);
+	mm_lock_exclusive(mm);
 	for (shmd = mm->mmap; shmd; shmd = shmdnext) {
 		shmdnext = shmd->vm_next;
 		if (shmd->vm_ops == &shm_vm_ops
 		    && shmd->vm_start - (shmd->vm_pgoff << PAGE_SHIFT) == (ulong) shmaddr)
 			do_munmap(mm, shmd->vm_start, shmd->vm_end - shmd->vm_start);
 	}
-	up_write(&mm->mmap_sem);
+	mm_unlock_exclusive(mm);
 	return 0;
 }
 
diff -uNr linux-2.4.10-pre12/kernel/acct.c linux-mmsem/kernel/acct.c
--- linux-2.4.10-pre12/kernel/acct.c	Tue Sep 18 08:45:11 2001
+++ linux-mmsem/kernel/acct.c	Wed Sep 19 11:38:59 2001
@@ -315,13 +315,13 @@
 	vsize = 0;
 	if (current->mm) {
 		struct vm_area_struct *vma;
-		down_read(&current->mm->mmap_sem);
+		mm_lock_shared(current->mm);
 		vma = current->mm->mmap;
 		while (vma) {
 			vsize += vma->vm_end - vma->vm_start;
 			vma = vma->vm_next;
 		}
-		up_read(&current->mm->mmap_sem);
+		mm_unlock_shared(current->mm);
 	}
 	vsize = vsize / 1024;
 	ac.ac_mem = encode_comp_t(vsize);
diff -uNr linux-2.4.10-pre12/kernel/fork.c linux-mmsem/kernel/fork.c
--- linux-2.4.10-pre12/kernel/fork.c	Wed Sep 19 10:39:23 2001
+++ linux-mmsem/kernel/fork.c	Wed Sep 19 11:41:48 2001
@@ -216,7 +216,9 @@
 {
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
-	init_rwsem(&mm->mmap_sem);
+	spin_lock_init(&mm->mmsem_lock);
+	mm->mmsem_activity = 0;
+	INIT_LIST_HEAD(&mm->mmsem_waiters);
 	mm->page_table_lock = SPIN_LOCK_UNLOCKED;
 	mm->pgd = pgd_alloc(mm);
 	if (mm->pgd)
@@ -333,9 +335,9 @@
 	if (!mm_init(mm))
 		goto fail_nomem;
 
-	down_write(&oldmm->mmap_sem);
+	mm_lock_exclusive(oldmm);
 	retval = dup_mmap(mm);
-	up_write(&oldmm->mmap_sem);
+	mm_unlock_exclusive(oldmm);
 
 	if (retval)
 		goto free_pt;
diff -uNr linux-2.4.10-pre12/kernel/ksyms.c linux-mmsem/kernel/ksyms.c
--- linux-2.4.10-pre12/kernel/ksyms.c	Wed Sep 19 10:39:23 2001
+++ linux-mmsem/kernel/ksyms.c	Wed Sep 19 14:13:42 2001
@@ -87,6 +87,8 @@
 EXPORT_SYMBOL(exit_files);
 EXPORT_SYMBOL(exit_fs);
 EXPORT_SYMBOL(exit_sighand);
+EXPORT_SYMBOL(__mm_lock_wait);
+EXPORT_SYMBOL(__mm_lock_wake);
 
 /* internal kernel memory management */
 EXPORT_SYMBOL(_alloc_pages);
diff -uNr linux-2.4.10-pre12/kernel/ptrace.c linux-mmsem/kernel/ptrace.c
--- linux-2.4.10-pre12/kernel/ptrace.c	Wed Sep 19 10:39:23 2001
+++ linux-mmsem/kernel/ptrace.c	Wed Sep 19 11:40:34 2001
@@ -208,13 +208,13 @@
 	if (!mm)
 		return 0;
 
-	down_read(&mm->mmap_sem);
+	mm_lock_shared(mm);
 	vma = find_extend_vma(mm, addr);
 	copied = 0;
 	if (vma)
 		copied = access_mm(mm, vma, addr, buf, len, write);
 
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	mmput(mm);
 	return copied;
 }
diff -uNr linux-2.4.10-pre12/mm/filemap.c linux-mmsem/mm/filemap.c
--- linux-2.4.10-pre12/mm/filemap.c	Wed Sep 19 10:39:24 2001
+++ linux-mmsem/mm/filemap.c	Wed Sep 19 11:33:14 2001
@@ -1949,7 +1949,7 @@
 	struct vm_area_struct * vma;
 	int unmapped_error, error = -EINVAL;
 
-	down_read(&current->mm->mmap_sem);
+	mm_lock_shared(current->mm);
 	if (start & ~PAGE_MASK)
 		goto out;
 	len = (len + ~PAGE_MASK) & PAGE_MASK;
@@ -1995,7 +1995,7 @@
 		vma = vma->vm_next;
 	}
 out:
-	up_read(&current->mm->mmap_sem);
+	mm_unlock_shared(current->mm);
 	return error;
 }
 
@@ -2298,7 +2298,7 @@
 	int unmapped_error = 0;
 	int error = -EINVAL;
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 
 	if (start & ~PAGE_MASK)
 		goto out;
@@ -2349,7 +2349,7 @@
 	}
 
 out:
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	return error;
 }
 
@@ -2451,7 +2451,7 @@
 	int unmapped_error = 0;
 	long error = -EINVAL;
 
-	down_read(&current->mm->mmap_sem);
+	mm_lock_shared(current->mm);
 
 	if (start & ~PAGE_CACHE_MASK)
 		goto out;
@@ -2503,7 +2503,7 @@
 	}
 
 out:
-	up_read(&current->mm->mmap_sem);
+	mm_unlock_shared(current->mm);
 	return error;
 }
 
diff -uNr linux-2.4.10-pre12/mm/memory.c linux-mmsem/mm/memory.c
--- linux-2.4.10-pre12/mm/memory.c	Wed Sep 19 10:39:24 2001
+++ linux-mmsem/mm/memory.c	Wed Sep 19 11:33:28 2001
@@ -464,7 +464,7 @@
 	if (err)
 		return err;
 
-	down_read(&mm->mmap_sem);
+	mm_lock_shared(mm);
 
 	err = -EFAULT;
 	iobuf->locked = 0;
@@ -522,12 +522,12 @@
 		ptr += PAGE_SIZE;
 	}
 
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	dprintk ("map_user_kiobuf: end OK\n");
 	return 0;
 
  out_unlock:
-	up_read(&mm->mmap_sem);
+	mm_unlock_shared(mm);
 	unmap_kiobuf(iobuf);
 	dprintk ("map_user_kiobuf: end %d\n", err);
 	return err;
diff -uNr linux-2.4.10-pre12/mm/mlock.c linux-mmsem/mm/mlock.c
--- linux-2.4.10-pre12/mm/mlock.c	Wed Sep 19 10:39:24 2001
+++ linux-mmsem/mm/mlock.c	Wed Sep 19 11:35:27 2001
@@ -198,7 +198,7 @@
 	unsigned long lock_limit;
 	int error = -ENOMEM;
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
 	start &= PAGE_MASK;
 
@@ -219,7 +219,7 @@
 
 	error = do_mlock(start, len, 1);
 out:
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	return error;
 }
 
@@ -227,11 +227,11 @@
 {
 	int ret;
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
 	start &= PAGE_MASK;
 	ret = do_mlock(start, len, 0);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	return ret;
 }
 
@@ -268,7 +268,7 @@
 	unsigned long lock_limit;
 	int ret = -EINVAL;
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE)))
 		goto out;
 
@@ -286,7 +286,7 @@
 
 	ret = do_mlockall(flags);
 out:
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	return ret;
 }
 
@@ -294,8 +294,8 @@
 {
 	int ret;
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	ret = do_mlockall(0);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	return ret;
 }
diff -uNr linux-2.4.10-pre12/mm/mmap.c linux-mmsem/mm/mmap.c
--- linux-2.4.10-pre12/mm/mmap.c	Wed Sep 19 10:39:24 2001
+++ linux-mmsem/mm/mmap.c	Wed Sep 19 13:24:42 2001
@@ -149,7 +149,7 @@
 	unsigned long newbrk, oldbrk;
 	struct mm_struct *mm = current->mm;
 
-	down_write(&mm->mmap_sem);
+	mm_lock_exclusive(mm);
 
 	if (brk < mm->end_code)
 		goto out;
@@ -185,7 +185,7 @@
 	mm->brk = brk;
 out:
 	retval = mm->brk;
-	up_write(&mm->mmap_sem);
+	mm_unlock_exclusive(mm);
 	return retval;
 }
 
@@ -995,9 +995,9 @@
 	int ret;
 	struct mm_struct *mm = current->mm;
 
-	down_write(&mm->mmap_sem);
+	mm_lock_exclusive(mm);
 	ret = do_munmap(mm, addr, len);
-	up_write(&mm->mmap_sem);
+	mm_unlock_exclusive(mm);
 	return ret;
 }
 
@@ -1172,4 +1172,86 @@
 		BUG();
 	vma_link(mm, vma, prev, rb_link, rb_parent);
 	validate_mm(mm);
+}
+
+
+struct mm_waiter {
+	struct list_head	list;
+	struct task_struct	*task;
+	unsigned int		flags;
+#define MM_WAITING_FOR_READ	0x00000001
+#define MM_WAITING_FOR_WRITE	0x00000002
+};
+
+/*
+ * handle the lock being released whilst there are processes blocked on it that can now run
+ * - if we come here, then:
+ *   - the 'active count' _reached_ zero
+ *   - the 'waiting count' is non-zero
+ * - the spinlock must be held by the caller
+ * - woken process blocks are discarded from the list after having flags zeroised
+ */
+void __mm_lock_wake(struct mm_struct *mm)
+{
+	struct mm_waiter *waiter;
+	int woken;
+
+	waiter = list_entry(mm->mmsem_waiters.next,struct mm_waiter,list);
+
+	/* try to grant a single write lock if there's a writer at the front of the queue
+	 * - we leave the 'waiting count' incremented to signify potential contention
+	 */
+	if (waiter->flags & MM_WAITING_FOR_WRITE) {
+		mm->mmsem_activity = -1;
+		list_del(&waiter->list);
+		waiter->flags = 0;
+		wake_up_process(waiter->task);
+		return;
+	}
+
+	/* grant an infinite number of read locks to the readers at the front of the queue */
+	woken = 0;
+	do {
+		list_del(&waiter->list);
+		waiter->flags = 0;
+		wake_up_process(waiter->task);
+		woken++;
+		if (list_empty(&mm->mmsem_waiters))
+			break;
+		waiter = list_entry(mm->mmsem_waiters.next,struct mm_waiter,list);
+	} while (waiter->flags&MM_WAITING_FOR_READ);
+
+	mm->mmsem_activity += woken;
+}
+
+/*
+ * wait for a lock on the mm_struct
+ * - must be entered with the mmsem_lock spinlock held
+ */
+void __mm_lock_wait(struct mm_struct *mm, int bias)
+{
+	struct mm_waiter waiter;
+	struct task_struct *tsk;
+
+	tsk = current;
+	set_task_state(tsk,TASK_UNINTERRUPTIBLE);
+
+	/* add to the waitqueue */
+	waiter.task = tsk;
+	waiter.flags = MM_WAITING_FOR_READ;
+
+	list_add_tail(&waiter.list,&mm->mmsem_waiters);
+
+	/* we don't need to touch the mm_struct anymore */
+	spin_unlock(&mm->mmsem_lock);
+
+	/* wait to be given the lock */
+	for (;;) {
+		if (!waiter.flags)
+			break;
+		schedule();
+		set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+	}
+
+	tsk->state = TASK_RUNNING;
 }
diff -uNr linux-2.4.10-pre12/mm/mprotect.c linux-mmsem/mm/mprotect.c
--- linux-2.4.10-pre12/mm/mprotect.c	Wed Sep 19 10:39:24 2001
+++ linux-mmsem/mm/mprotect.c	Wed Sep 19 11:36:18 2001
@@ -281,7 +281,7 @@
 	if (end == start)
 		return 0;
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 
 	vma = find_vma_prev(current->mm, start, &prev);
 	error = -EFAULT;
@@ -332,6 +332,6 @@
 		prev->vm_mm->map_count--;
 	}
 out:
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	return error;
 }
diff -uNr linux-2.4.10-pre12/mm/mremap.c linux-mmsem/mm/mremap.c
--- linux-2.4.10-pre12/mm/mremap.c	Wed Sep 19 10:39:24 2001
+++ linux-mmsem/mm/mremap.c	Wed Sep 19 11:36:35 2001
@@ -346,8 +346,8 @@
 {
 	unsigned long ret;
 
-	down_write(&current->mm->mmap_sem);
+	mm_lock_exclusive(current->mm);
 	ret = do_mremap(addr, old_len, new_len, flags, new_addr);
-	up_write(&current->mm->mmap_sem);
+	mm_unlock_exclusive(current->mm);
 	return ret;
 }

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-18 16:49             ` Linus Torvalds
  2001-09-19  9:51               ` David Howells
@ 2001-09-19 14:08               ` Manfred Spraul
  2001-09-19 14:51               ` David Howells
                                 ` (2 subsequent siblings)
  4 siblings, 0 replies; 49+ messages in thread
From: Manfred Spraul @ 2001-09-19 14:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Andrea Arcangeli, Ulrich.Weigand, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 876 bytes --]

Linus Torvalds wrote:
> 
> If the choice is between a hack to do strange and incomprehensible things
> for a special case, and just making the semaphores do the same thing
> rw-spinlocks do and make the problem go away naturally, Ill take #2 any
> day. The patches already exist, after all.
>

I've attached a recursive semaphore patch against 2.4.10-pre11 - but it
makes mmap io unusable:

Testcase:
* file, mmaped, 300 MB, 128 MB Ram
* 2 threads: touch random pages 
* third thread: calls mprotect(0xFFFF00000,0x1000, PAGE_READ)

Result:
mprotect hangs forever (minutes) with recursive semaphores.

With fair semaphores, mprotect returns after ~80 milliseconds with 5
worker threads, after ~380 milliseconds with 20 worker threads (slow IDE
disk).

One alternative to David's patch would be moving the locking into the
coredump handlers - would you prefer that?

--
	Manfred

[-- Attachment #2: patch-recursive --]
[-- Type: text/plain, Size: 835 bytes --]

--- 2.4/lib/rwsem-spinlock.c	Sat Apr 28 10:37:27 2001
+++ build-2.4/lib/rwsem-spinlock.c	Wed Sep 19 15:03:28 2001
@@ -115,7 +115,7 @@
 
 	spin_lock(&sem->wait_lock);
 
-	if (sem->activity>=0 && list_empty(&sem->wait_list)) {
+	if (sem->activity>=0) {
 		/* granted */
 		sem->activity++;
 		spin_unlock(&sem->wait_lock);
--- 2.4/arch/i386/config.in	Wed Sep 19 14:36:35 2001
+++ build-2.4/arch/i386/config.in	Wed Sep 19 14:48:06 2001
@@ -59,8 +59,8 @@
    define_bool CONFIG_X86_XADD y
    define_bool CONFIG_X86_BSWAP y
    define_bool CONFIG_X86_POPAD_OK y
-   define_bool CONFIG_RWSEM_GENERIC_SPINLOCK n
-   define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM y
+   define_bool CONFIG_RWSEM_GENERIC_SPINLOCK y
+   define_bool CONFIG_RWSEM_XCHGADD_ALGORITHM n
 fi
 if [ "$CONFIG_M486" = "y" ]; then
    define_int  CONFIG_X86_L1_CACHE_SHIFT 4

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-19  9:51               ` David Howells
@ 2001-09-19 12:49                 ` Andrea Arcangeli
  0 siblings, 0 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2001-09-19 12:49 UTC (permalink / raw)
  To: David Howells
  Cc: Linus Torvalds, Manfred Spraul, Ulrich.Weigand, linux-kernel

On Wed, Sep 19, 2001 at 10:51:57AM +0100, David Howells wrote:
> 
> Looking through the do_page_fault(), I noticed there's a race in expand stack
> because expand_stack() expects the caller to have the mm-sem write-locked.
> 
> I've attached a patch that might fix it appropriately. Alternatively, it may
> be worth applying Andrea's 00_silent-stack-overflow-10 patch which fixes this
> and something else too.

Yep, it's here:

	ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.10pre11aa1/00_silent-stack-overflow-10

I also added the documentation on the locking on top of expand_stack.

My patch also enforced a gap of one page (sysctl configurable in
with page granularity) between a growsdown vma and its previous vma, so
that we can more easily trap stack overflows on the heap. (such part
isn't related to the race fix but it was controversial but since it's
quite useful too I didn't splitted it out :)

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-18 16:49             ` Linus Torvalds
@ 2001-09-19  9:51               ` David Howells
  2001-09-19 12:49                 ` Andrea Arcangeli
  2001-09-19 14:08               ` Manfred Spraul
                                 ` (3 subsequent siblings)
  4 siblings, 1 reply; 49+ messages in thread
From: David Howells @ 2001-09-19  9:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Manfred Spraul, Andrea Arcangeli, Ulrich.Weigand,
	linux-kernel


Looking through the do_page_fault(), I noticed there's a race in expand stack
because expand_stack() expects the caller to have the mm-sem write-locked.

I've attached a patch that might fix it appropriately. Alternatively, it may
be worth applying Andrea's 00_silent-stack-overflow-10 patch which fixes this
and something else too.

David


diff -uNr linux-2.4.10-pre12/include/linux/mm.h linux-rwsem/include/linux/mm.h
--- linux-2.4.10-pre12/include/linux/mm.h	Wed Sep 19 10:39:23 2001
+++ linux-rwsem/include/linux/mm.h	Wed Sep 19 10:40:48 2001
@@ -586,11 +586,11 @@
 	 * before relocating the vma range ourself.
 	 */
 	address &= PAGE_MASK;
+	spin_lock(&vma->vm_mm->page_table_lock);
 	grow = (vma->vm_start - address) >> PAGE_SHIFT;
 	if (vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur ||
 	    ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) > current->rlim[RLIMIT_AS].rlim_cur)
-		return -ENOMEM;
-	spin_lock(&vma->vm_mm->page_table_lock);
+		goto nomem;
 	vma->vm_start = address;
 	vma->vm_pgoff -= grow;
 	vma->vm_mm->total_vm += grow;
@@ -598,6 +598,9 @@
 		vma->vm_mm->locked_vm += grow;
 	spin_unlock(&vma->vm_mm->page_table_lock);
 	return 0;
+ nomem:
+	spin_unlock(&vma->vm_mm->page_table_lock);
+	return -ENOMEM;
 }
 
 /* Look up the first VMA which satisfies  addr < vm_end,  NULL if none. */

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-18 14:13           ` David Howells
  2001-09-18 14:49             ` Alan Cox
  2001-09-18 15:11             ` David Howells
@ 2001-09-18 16:49             ` Linus Torvalds
  2001-09-19  9:51               ` David Howells
                                 ` (4 more replies)
  2 siblings, 5 replies; 49+ messages in thread
From: Linus Torvalds @ 2001-09-18 16:49 UTC (permalink / raw)
  To: David Howells
  Cc: Manfred Spraul, Andrea Arcangeli, Ulrich.Weigand, linux-kernel


On Tue, 18 Sep 2001, David Howells wrote:
>
> Okay preliminary as-yet-untested patch to cure coredumping of the need
> to hold the mm semaphore:
>
> 	- kernel/fork.c: function to partially copy an mm_struct and attach it
> 			 to the task_struct in place of the old.

Oh, please no.

If the choice is between a hack to do strange and incomprehensible things
for a special case, and just making the semaphores do the same thing
rw-spinlocks do and make the problem go away naturally, Ill take #2 any
day. The patches already exist, after all.

		Linus


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-18 15:26               ` David Howells
@ 2001-09-18 15:46                 ` Alan Cox
  0 siblings, 0 replies; 49+ messages in thread
From: Alan Cox @ 2001-09-18 15:46 UTC (permalink / raw)
  To: David Howells
  Cc: Alan Cox, Manfred Spraul, Andrea Arcangeli, Linus Torvalds,
	dhowells, Ulrich.Weigand, linux-kernel

> > If you want codd for this its in the older -ac tree. Linus decided it wasnt
> > justified so it went out
> 
> Arjan said there was such a beast in the -ac stuff, but I guess this explains
> why I couldn't find it... Do you have any idea which -ac's?

It went in with the threaded core dump patch

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-18 14:49             ` Alan Cox
@ 2001-09-18 15:26               ` David Howells
  2001-09-18 15:46                 ` Alan Cox
  0 siblings, 1 reply; 49+ messages in thread
From: David Howells @ 2001-09-18 15:26 UTC (permalink / raw)
  To: Alan Cox
  Cc: Manfred Spraul, Andrea Arcangeli, Linus Torvalds, dhowells,
	Ulrich.Weigand, linux-kernel


> > Okay preliminary as-yet-untested patch to cure coredumping of the need> to 
> >hold the mm semaphore:
>
> If you want codd for this its in the older -ac tree. Linus decided it wasnt
> justified so it went out

Arjan said there was such a beast in the -ac stuff, but I guess this explains
why I couldn't find it... Do you have any idea which -ac's?

Oh well, I'd already done my patch anyway. Does it look okay to you.

David

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-18 14:13           ` David Howells
  2001-09-18 14:49             ` Alan Cox
@ 2001-09-18 15:11             ` David Howells
  2001-09-18 16:49             ` Linus Torvalds
  2 siblings, 0 replies; 49+ messages in thread
From: David Howells @ 2001-09-18 15:11 UTC (permalink / raw)
  To: David Howells
  Cc: Manfred Spraul, Andrea Arcangeli, Linus Torvalds, Ulrich.Weigand,
	linux-kernel


Okay tested patch to cure coredumping of the need to hold the mm semaphore:

	- kernel/fork.c: function to partially copy an mm_struct and attach it
			 to the task_struct in place of the old.

	- include/linux/mm.h: declaration for above function

	- fs/exec.c: have do_coredump() call this function and not get the
		     read lock around the binfmt coredumper

It works, and you can core dump without oopsing.

David

diff -uNr -x TAGS linux-2.4.10-pre11/fs/exec.c linux-rwsem/fs/exec.c
--- linux-2.4.10-pre11/fs/exec.c	Tue Sep 18 13:57:06 2001
+++ linux-rwsem/fs/exec.c	Tue Sep 18 15:01:56 2001
@@ -947,6 +947,14 @@
 	if (current->rlim[RLIMIT_CORE].rlim_cur < binfmt->min_coredump)
 		goto fail;
 
+	/* make sure the attached VM has a single ref (this process) to make
+	 * sure only do_exit() will change the VMA list, so we don't have to
+	 * lock the mm->sem around the binfmt coredumper
+	 */
+	retval = copy_mm_for_coredump(current);
+	if (retval<0)
+		goto fail;
+
 	memcpy(corename,"core.", 5);
 	corename[4] = '\0';
  	if (core_uses_pid || atomic_read(&current->mm->mm_users) != 1)
@@ -969,9 +977,7 @@
 	if (do_truncate(file->f_dentry, 0) != 0)
 		goto close_fail;
 
-	down_read(&current->mm->mmap_sem);
 	retval = binfmt->core_dump(signr, regs, file);
-	up_read(&current->mm->mmap_sem);
 
 close_fail:
 	filp_close(file, NULL);
diff -uNr -x TAGS linux-2.4.10-pre11/include/linux/mm.h linux-rwsem/include/linux/mm.h
--- linux-2.4.10-pre11/include/linux/mm.h	Tue Sep 18 13:57:09 2001
+++ linux-rwsem/include/linux/mm.h	Tue Sep 18 15:27:48 2001
@@ -615,6 +615,7 @@
 }
 
 extern struct vm_area_struct *find_extend_vma(struct mm_struct *mm, unsigned long addr);
+extern int copy_mm_for_coredump(struct task_struct *tsk);
 
 #endif /* __KERNEL__ */
 
diff -uNr -x TAGS linux-2.4.10-pre11/kernel/fork.c linux-rwsem/kernel/fork.c
--- linux-2.4.10-pre11/kernel/fork.c	Tue Sep 18 13:57:10 2001
+++ linux-rwsem/kernel/fork.c	Tue Sep 18 15:28:50 2001
@@ -359,6 +359,55 @@
 	return retval;
 }
 
+int copy_mm_for_coredump(struct task_struct * tsk)
+{
+	struct mm_struct *mm, *old_mm;
+	int retval;
+
+	/* don't bother copying if there's only one user anyway */
+	if (atomic_read(&tsk->mm->mm_users)==1)
+		return 0;
+
+	old_mm = tsk->mm;
+
+	retval = -ENOMEM;
+	mm = allocate_mm();
+	if (!mm)
+		goto fail_nomem;
+
+	/* Copy the current MM stuff.. */
+	memcpy(mm, tsk->mm, sizeof(*mm));
+	if (!mm_init(mm))
+		goto fail_nomem;
+
+	down_write(&tsk->mm->mmap_sem);
+	retval = dup_mmap(mm);
+	up_write(&tsk->mm->mmap_sem);
+
+	if (retval)
+		goto free_pt;
+
+	/* no LDT now */
+	mm->context.segments = NULL;
+
+	if (init_new_context(tsk,mm))
+		goto free_pt;
+
+	/* swap to new MM */
+	task_lock(tsk);
+	tsk->mm = mm;
+	tsk->active_mm = mm;
+	task_unlock(tsk);
+	mmput(old_mm);
+
+	return 0;
+
+free_pt:
+	mmput(mm);
+fail_nomem:
+	return retval;
+}
+
 static inline struct fs_struct *__copy_fs_struct(struct fs_struct *old)
 {
 	struct fs_struct *fs = kmem_cache_alloc(fs_cachep, GFP_KERNEL);

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-18 14:13           ` David Howells
@ 2001-09-18 14:49             ` Alan Cox
  2001-09-18 15:26               ` David Howells
  2001-09-18 15:11             ` David Howells
  2001-09-18 16:49             ` Linus Torvalds
  2 siblings, 1 reply; 49+ messages in thread
From: Alan Cox @ 2001-09-18 14:49 UTC (permalink / raw)
  To: David Howells
  Cc: Manfred Spraul, Andrea Arcangeli, Linus Torvalds, dhowells,
	Ulrich.Weigand, linux-kernel

> Okay preliminary as-yet-untested patch to cure coredumping of the need> to 
>hold the mm semaphore:

If you want codd for this its in the older -ac tree. Linus decided it wasnt
justified so it went out

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-18 12:53         ` Manfred Spraul
@ 2001-09-18 14:13           ` David Howells
  2001-09-18 14:49             ` Alan Cox
                               ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: David Howells @ 2001-09-18 14:13 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Andrea Arcangeli, Linus Torvalds, dhowells, Ulrich.Weigand, linux-kernel


Okay preliminary as-yet-untested patch to cure coredumping of the need
to hold the mm semaphore:

	- kernel/fork.c: function to partially copy an mm_struct and attach it
			 to the task_struct in place of the old.

	- include/linux/mm.h: declaration for above function

	- fs/exec.c: have do_coredump() call this function and not get the
		     read lock around the binfmt coredumper

David

diff -uNr linux-2.4.10-pre11/fs/exec.c linux-rwsem/fs/exec.c
--- linux-2.4.10-pre11/fs/exec.c	Tue Sep 18 13:57:06 2001
+++ linux-rwsem/fs/exec.c	Tue Sep 18 15:01:56 2001
@@ -947,6 +947,14 @@
 	if (current->rlim[RLIMIT_CORE].rlim_cur < binfmt->min_coredump)
 		goto fail;
 
+	/* make sure the attached VM has a single ref (this process) to make
+	 * sure only do_exit() will change the VMA list, so we don't have to
+	 * lock the mm->sem around the binfmt coredumper
+	 */
+	retval = copy_mm_for_coredump(current);
+	if (retval<0)
+		goto fail;
+
 	memcpy(corename,"core.", 5);
 	corename[4] = '\0';
  	if (core_uses_pid || atomic_read(&current->mm->mm_users) != 1)
@@ -969,9 +977,7 @@
 	if (do_truncate(file->f_dentry, 0) != 0)
 		goto close_fail;
 
-	down_read(&current->mm->mmap_sem);
 	retval = binfmt->core_dump(signr, regs, file);
-	up_read(&current->mm->mmap_sem);
 
 close_fail:
 	filp_close(file, NULL);
diff -uNr linux-2.4.10-pre11/include/linux/mm.h linux-rwsem/include/linux/mm.h
--- linux-2.4.10-pre11/include/linux/mm.h	Tue Sep 18 13:57:09 2001
+++ linux-rwsem/include/linux/mm.h	Tue Sep 18 14:38:44 2001
@@ -615,6 +615,7 @@
 }
 
 extern struct vm_area_struct *find_extend_vma(struct mm_struct *mm, unsigned long addr);
+extern struct mm_struct *copy_mm_for_coredump(struct task_struct *tsk);
 
 #endif /* __KERNEL__ */
 
diff -uNr linux-2.4.10-pre11/kernel/fork.c linux-rwsem/kernel/fork.c
--- linux-2.4.10-pre11/kernel/fork.c	Tue Sep 18 13:57:10 2001
+++ linux-rwsem/kernel/fork.c	Tue Sep 18 14:49:18 2001
@@ -359,6 +359,55 @@
 	return retval;
 }
 
+int copy_mm_for_coredump(struct task_struct * tsk)
+{
+	struct mm_struct *mm, *oldmm;
+	int retval;
+
+	/* don't bother copying if there's only one user anyway */
+	if (atomic_read(mm_users)==1)
+		return 0;
+
+	old_mm = current->mm;
+
+	retval = -ENOMEM;
+	mm = allocate_mm();
+	if (!mm)
+		goto fail_nomem;
+
+	/* Copy the current MM stuff.. */
+	memcpy(mm, tsk->mm, sizeof(*mm));
+	if (!mm_init(mm))
+		goto fail_nomem;
+
+	down_write(&tsk->mm->mmap_sem);
+	retval = dup_mmap(mm);
+	up_write(&tsk->mm->mmap_sem);
+
+	if (retval)
+		goto free_pt;
+
+	/* no LDT now */
+	mm->context.segments = NULL;
+
+	if (init_new_context(tsk,mm))
+		goto free_pt;
+
+	/* swap to new MM */
+	task_lock(tsk);
+	current->mm = mm;
+	current->active_mm = mm;
+	task_unlock(tsk);
+	mmput(old_mm);
+
+	return 0;
+
+free_pt:
+	mmput(mm);
+fail_nomem:
+	return retval;
+}
+
 static inline struct fs_struct *__copy_fs_struct(struct fs_struct *old)
 {
 	struct fs_struct *fs = kmem_cache_alloc(fs_cachep, GFP_KERNEL);

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
@ 2001-09-18 13:22 Ulrich Weigand
  0 siblings, 0 replies; 49+ messages in thread
From: Ulrich Weigand @ 2001-09-18 13:22 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Andrea Arcangeli, Linus Torvalds, dhowells, linux-kernel

Manfred Spraul wrote:

>+   if (retval > count) BUG();
>+   if (copy_to_user(buf, kbuf, retval)) {
>+        retval = -EFAULT;
>+   } else {
>+        *ppos = (lineno << MAPS_LINE_SHIFT) + loff;
>    }
>    up_read(&mm->mmap_sem);

The copy_to_user is still done with the lock held ...  I guess you just
forgot to move the up_read() up before the copy_to_user(), right?


Mit freundlichen Gruessen / Best Regards

Ulrich Weigand

--
  Dr. Ulrich Weigand
  Linux for S/390 Design & Development
  IBM Deutschland Entwicklung GmbH, Schoenaicher Str. 220, 71032 Boeblingen
  Phone: +49-7031/16-3727   ---   Email: Ulrich.Weigand@de.ibm.com


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-18  7:55       ` Andrea Arcangeli
                           ` (3 preceding siblings ...)
  2001-09-18  9:49         ` Arjan van de Ven
@ 2001-09-18 12:53         ` Manfred Spraul
  2001-09-18 14:13           ` David Howells
  4 siblings, 1 reply; 49+ messages in thread
From: Manfred Spraul @ 2001-09-18 12:53 UTC (permalink / raw)
  To: Andrea Arcangeli, Linus Torvalds, dhowells, Ulrich.Weigand, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 558 bytes --]

Attached is a rewritten proc_pid_read_maps. Patch against 2.4.9-ac9, but
should apply against 2.4.9 as well.

As a side-effect, it's more efficient since it tries to return multiple
lines in each call, without the volatile_task hack.

Code that's older than 2.4.3 can't cause recursions - that was before
the rw-mmap_sem change.

* ptrace doesn't cause any recursions, it always copies into temporary
kernel buffers.

* the multithreaded coredump patch added down_read() around
binfmt->core_dump(). I'll think about the simplest way to fix that.

--
	Manfred

[-- Attachment #2: patch-array --]
[-- Type: text/plain, Size: 5951 bytes --]

--- 2.4/fs/proc/array.c	Thu Sep  6 20:51:37 2001
+++ build-2.4/fs/proc/array.c	Tue Sep 18 14:25:17 2001
@@ -537,136 +537,142 @@
 #define MAPS_LINE_FORMAT8	  "%016lx-%016lx %s %016lx %s %lu"
 #define MAPS_LINE_MAX8	73 /* sum of 16  1  16  1 4 1 16 1 5 1 10 1 */
 
-#define MAPS_LINE_MAX	MAPS_LINE_MAX8
+#define MAPS_LINE_FORMAT	(sizeof(void*) == 4 ? MAPS_LINE_FORMAT4 : MAPS_LINE_FORMAT8)
+#define MAPS_LINE_MAX	(sizeof(void*) == 4 ?  MAPS_LINE_MAX4 :  MAPS_LINE_MAX8)
 
+int proc_pid_maps_get_line (char *buf, struct vm_area_struct *map)
+{
+	/* produce the next line */
+	char *line;
+	char str[5];
+	int flags;
+	kdev_t dev;
+	unsigned long ino;
+	int len;
+
+	flags = map->vm_flags;
+
+	str[0] = flags & VM_READ ? 'r' : '-';
+	str[1] = flags & VM_WRITE ? 'w' : '-';
+	str[2] = flags & VM_EXEC ? 'x' : '-';
+	str[3] = flags & VM_MAYSHARE ? 's' : 'p';
+	str[4] = 0;
+
+	dev = 0;
+	ino = 0;
+	if (map->vm_file != NULL) {
+		dev = map->vm_file->f_dentry->d_inode->i_dev;
+		ino = map->vm_file->f_dentry->d_inode->i_ino;
+		line = d_path(map->vm_file->f_dentry,
+			      map->vm_file->f_vfsmnt,
+			      buf, PAGE_SIZE);
+		buf[PAGE_SIZE-1] = '\n';
+		line -= MAPS_LINE_MAX;
+		if(line < buf)
+			line = buf;
+	} else
+		line = buf;
+
+	len = sprintf(line,
+		      MAPS_LINE_FORMAT,
+		      map->vm_start, map->vm_end, str, map->vm_pgoff << PAGE_SHIFT,
+		      kdevname(dev), ino);
+
+	if(map->vm_file) {
+		int i;
+		for(i = len; i < MAPS_LINE_MAX; i++)
+			line[i] = ' ';
+		len = buf + PAGE_SIZE - line;
+		memmove(buf, line, len);
+	} else
+		line[len++] = '\n';
+	return len;
+}
 
 ssize_t proc_pid_read_maps (struct task_struct *task, struct file * file, char * buf,
 			  size_t count, loff_t *ppos)
 {
 	struct mm_struct *mm;
 	struct vm_area_struct * map, * next;
-	char * destptr = buf, * buffer;
-	loff_t lineno;
-	ssize_t column, i;
-	int volatile_task;
+	char *tmp, *kbuf;
 	long retval;
+	int off, lineno, loff;
 
+	/* reject calls with out of range parameters immediately */
+	retval = 0;
+	if (*ppos > LONG_MAX)
+		goto out;
+	if (count == 0)
+		goto out;
+	off = (long)*ppos;
 	/*
 	 * We might sleep getting the page, so get it first.
 	 */
 	retval = -ENOMEM;
-	buffer = (char*)__get_free_page(GFP_KERNEL);
-	if (!buffer)
+	kbuf = (char*)__get_free_page(GFP_KERNEL);
+	if (!kbuf)
 		goto out;
 
-	if (count == 0)
-		goto getlen_out;
+	tmp = (char*)__get_free_page(GFP_KERNEL);
+	if (!tmp)
+		goto out_free1;
+
 	task_lock(task);
 	mm = task->mm;
 	if (mm)
 		atomic_inc(&mm->mm_users);
 	task_unlock(task);
+	retval = 0;
 	if (!mm)
-		goto getlen_out;
-
-	/* Check whether the mmaps could change if we sleep */
-	volatile_task = (task != current || atomic_read(&mm->mm_users) > 2);
-
-	/* decode f_pos */
-	lineno = *ppos >> MAPS_LINE_SHIFT;
-	column = *ppos & (MAPS_LINE_LENGTH-1);
+		goto out_free2;
 
-	/* quickly go to line lineno */
 	down_read(&mm->mmap_sem);
-	for (map = mm->mmap, i = 0; map && (i < lineno); map = map->vm_next, i++)
-		continue;
-
-	for ( ; map ; map = next ) {
-		/* produce the next line */
-		char *line;
-		char str[5], *cp = str;
-		int flags;
-		kdev_t dev;
-		unsigned long ino;
-		int maxlen = (sizeof(void*) == 4) ?
-			MAPS_LINE_MAX4 :  MAPS_LINE_MAX8;
+	map = mm->mmap;
+	lineno = 0;
+	loff = 0;
+	if (count > PAGE_SIZE)
+		count = PAGE_SIZE;
+	while (map) {
 		int len;
-
-		/*
-		 * Get the next vma now (but it won't be used if we sleep).
-		 */
-		next = map->vm_next;
-		flags = map->vm_flags;
-
-		*cp++ = flags & VM_READ ? 'r' : '-';
-		*cp++ = flags & VM_WRITE ? 'w' : '-';
-		*cp++ = flags & VM_EXEC ? 'x' : '-';
-		*cp++ = flags & VM_MAYSHARE ? 's' : 'p';
-		*cp++ = 0;
-
-		dev = 0;
-		ino = 0;
-		if (map->vm_file != NULL) {
-			dev = map->vm_file->f_dentry->d_inode->i_dev;
-			ino = map->vm_file->f_dentry->d_inode->i_ino;
-			line = d_path(map->vm_file->f_dentry,
-				      map->vm_file->f_vfsmnt,
-				      buffer, PAGE_SIZE);
-			buffer[PAGE_SIZE-1] = '\n';
-			line -= maxlen;
-			if(line < buffer)
-				line = buffer;
-		} else
-			line = buffer;
-
-		len = sprintf(line,
-			      sizeof(void*) == 4 ? MAPS_LINE_FORMAT4 : MAPS_LINE_FORMAT8,
-			      map->vm_start, map->vm_end, str, map->vm_pgoff << PAGE_SHIFT,
-			      kdevname(dev), ino);
-
-		if(map->vm_file) {
-			for(i = len; i < maxlen; i++)
-				line[i] = ' ';
-			len = buffer + PAGE_SIZE - line;
-		} else
-			line[len++] = '\n';
-		if (column >= len) {
-			column = 0; /* continue with next line at column 0 */
-			lineno++;
-			continue; /* we haven't slept */
+		if (off > MAPS_LINE_LENGTH) {
+			off -= MAPS_LINE_LENGTH;
+			goto next;
 		}
-
-		i = len-column;
-		if (i > count)
-			i = count;
-		copy_to_user(destptr, line+column, i); /* may have slept */
-		destptr += i;
-		count   -= i;
-		column  += i;
-		if (column >= len) {
-			column = 0; /* next time: next line at column 0 */
-			lineno++;
+		len = proc_pid_maps_get_line(tmp, map);
+		len -= off;
+		if (len > 0) {
+			if (retval+len > count) {
+				/* only partial line transfer possible */
+				len = count - retval;
+				/* save the offset where the next read
+				 * must start */
+				loff = len+off;
+			}
+			memcpy(kbuf+retval, tmp+off, len);
+			retval += len;
 		}
-
-		/* done? */
-		if (count == 0)
-			break;
-
-		/* By writing to user space, we might have slept.
-		 * Stop the loop, to avoid a race condition.
-		 */
-		if (volatile_task)
+		off = 0;
+next:
+		if (!loff)
+			lineno++;
+		if (retval >= count)
 			break;
+		if (loff) BUG();
+		map = map->vm_next;
+	}
+	if (retval > count) BUG();
+	if (copy_to_user(buf, kbuf, retval)) {
+		retval = -EFAULT;
+	} else {
+		*ppos = (lineno << MAPS_LINE_SHIFT) + loff;
 	}
 	up_read(&mm->mmap_sem);
-
-	/* encode f_pos */
-	*ppos = (lineno << MAPS_LINE_SHIFT) + column;
 	mmput(mm);
 
-getlen_out:
-	retval = destptr - buf;
-	free_page((unsigned long)buffer);
+out_free2:
+	free_page((unsigned long)tmp);
+out_free1:
+	free_page((unsigned long)kbuf);
 out:
 	return retval;
 }

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-18  7:55       ` Andrea Arcangeli
                           ` (2 preceding siblings ...)
  2001-09-18  9:37         ` Manfred Spraul
@ 2001-09-18  9:49         ` Arjan van de Ven
  2001-09-18 12:53         ` Manfred Spraul
  4 siblings, 0 replies; 49+ messages in thread
From: Arjan van de Ven @ 2001-09-18  9:49 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: linux-kernel

Andrea Arcangeli wrote:
> 
> On Tue, Sep 18, 2001 at 09:31:40AM +0200, Manfred Spraul wrote:
> > From: "Andrea Arcangeli" <andrea@suse.de>
> > > > The mmap semaphore is a read-write semaphore, and it _is_
> > permissible to
> > > > call "copy_to_user()" and friends while holding the read lock.
> > > >
> > > > The bug appears to be in the implementation of the write semaphore -
> > > > down_write() doesn't undestand that blocked writes must not block
> > new
> > > > readers, exactly because of this situation.
> > >
> > > Exactly, same reason for which we need the same property from the rw
> > > spinlocks (to be allowed to read_lock without clearing irqs). Thanks
> > so
> > > much for reminding me about this! Unfortunately my rwsemaphores are
> > > blocking readers at the first down_write (for the better fairness
> > > property issuse, but I obviously forgotten that doing so I would
> > > introduce such a deadlock).
> >
> > i386 has a fair rwsemaphore, too - probably other archs must be modified
> > as well.
> 
> yes, actually my patch was against the rwsem patch in -aa, and in -aa
> I'm using the generic semaphores for all archs in the tree so it fixes
> the race for all them. The mainline semaphores are slightly different.

> if that's the very only place that could be a viable option but OTOH I
> like to be allowed to use recursion on the read locks as with the
> spinlocks. I think another option would be to have reacursion allowed on
> the default read locks and then make a down_read_fair that will block at
> if there's a down_write under us. we can very cleanly implement this,
> the same can be done cleanly also for the spinlocks: read_lock_fair. One
> can even mix the read_lock/read_lock_fair or the
> down_read/down_read_fair together. For example assuming we use the
> recursive semaphore fix in proc_pid_read_maps the down_read over there
> could be converted to a down_read_fair (but that's just an exercise, if
> the page fault isn't fair it doesn't worth to have proc_pid_read_maps
> fair either).

Be careful; If another user can grab your semaphore for read for a short 
time (eg for "top" or similar usage), he can construct several threads
that
do this in a busy loop; the end result is that this evil user is capable
of blocking out writers FOREVER if semaphores are unfair; nice DoS....

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-18  7:55       ` Andrea Arcangeli
  2001-09-18  8:18         ` David Howells
  2001-09-18  9:32         ` David Howells
@ 2001-09-18  9:37         ` Manfred Spraul
  2001-09-18  9:49         ` Arjan van de Ven
  2001-09-18 12:53         ` Manfred Spraul
  4 siblings, 0 replies; 49+ messages in thread
From: Manfred Spraul @ 2001-09-18  9:37 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Linus Torvalds, dhowells, Ulrich.Weigand, linux-kernel

> > IMHO modifying proc_pid_read_maps() is far simpler - I'm not aware
of
> > another recursive mmap_sem user.
>
> if that's the very only place that could be a viable option but OTOH I
> like to be allowed to use recursion on the read locks as with the
> spinlocks.

But shouldn't that change wait until 2.5? Especially since another huge
mm change was just merged?
proc_pid_read_maps contains further bugs - afaics it it could skip lines
if PAGE_SIZE > 4096 and a file path is nearly 4096 bytes long.
I'll post a patch to proc_pid_read_maps - modifying the rw semaphore
behaviour just asks for trouble.

--
    Manfred


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-18  7:55       ` Andrea Arcangeli
  2001-09-18  8:18         ` David Howells
@ 2001-09-18  9:32         ` David Howells
  2001-09-18  9:37         ` Manfred Spraul
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 49+ messages in thread
From: David Howells @ 2001-09-18  9:32 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Manfred Spraul, Linus Torvalds, dhowells, Ulrich.Weigand, linux-kernel


Linus Torvalds <linux-kernel@vger.kernel.org> wrote:
> The mmap semaphore is a read-write semaphore, and it _is_ permissible to
> call "copy_to_user()" and friends while holding the read lock.
>
> The bug appears to be in the implementation of the write semaphore -
> down_write() doesn't undestand that blocked writes must not block new
> readers, exactly because of this situation. 
>
> The situation wrt read-write spinlocks is exactly the same, btw, except
> there we have "readers can have interrupts enabled even if interrupts
> also take read locks" instead of having user-level faults.
>
> Why do we want to explicitly allow this behaviour wrt mmap_sem? Because
> some things are inherently racy without it (ie threaded processes that
> read or write the address space - coredumping, ptrace etc).


Hmmm... I don't think this is possible with XADD based semaphores as they
stand (my version or Andrea's).

With the current XADD based stuff, you can't distinguish between one
writer running and a queue of sleeping locks and one reader running and a
queue of sleeping locks without counting the sleepers

	Sem(sleepers)	Proc 1	Proc 2	Proc 3	Proc 4	Proc 5
	========	======	======	======	======	======
	00000000(0)
			-->down_read()
			<--down_read()
	00000001(0)
				-->down_write()
				-->down_write_failed()
				[schedule]
	FFFF0001(1)
					-->down_write()
					-->down_write_failed()
					[schedule]
	FFFE0001(2)
						-->down_write()
						-->down_write_failed()
						[schedule]
	FFFD0001(3)
							-->down_read_unfair()
	FFFC0002(3)
							is the active proc
							R or W?

	Sem		Proc 1	Proc 2	Proc 3	Proc 4
	========	======	======	======	======
	00000000(0)
			-->down_write()
			<--down_write()
	FFFF0001(0)
				-->down_write()
				-->down_write_failed()
				[schedule]
	FFFE0001(1)
					-->down_write()
					-->down_write_failed()
					[schedule]
	FFFD0001(2)
						-->down_read_unfair()
	FFFC0002(2)
						is the active proc R or W?

In fact, it's worse than that: you can't tell the difference between two
active readers and a queue of sleepers and one active writer, one failed
read or write attempt as yet unprocessed, and a queue of sleepers

	Sem(sleepers)	Proc 1	Proc 2	Proc 3	Proc 4	Proc 5
	==============	======	======	======	======	======
	00000000(0)
			-->down_read()
			<--down_read()
	00000001(0)
				-->down_read()
				<--down_read()
	00000002(0)
					-->down_write()
	FFFF0003(0)
					-->down_write_failed()
					[schedule]
	FFFF0002(1)
						-->down_write()
	FFFE0003(1)
							-->down_read_unfair()
	FFFE0004(1)
							since the LSW>2 does
							this mean there are 2+
							readers active?

	Sem(sleepers)	Proc 1	Proc 2	Proc 3	Proc 4	Proc 5
	==============	======	======	======	======	======
	00000000(0)
			-->down_write()
			<--down_write()
	FFFF0001(0)
				-->down_write()
	FFFE0002(0)
				-->down_write_failed()
				[schedule]
	FFFE0001(1)
					-->down_read()
	FFFE0002(1)
						-->down_read()
	FFFE0003(1)
							-->down_read_unfair()
	FFFE0004(1)
							since the LSW>2 does
							this mean there are 2+
							readers active?

I think it might well be too hard to do unfair reads with the XADD based
stuff. The problem is that you can't compensate for the effect on the counter
of failed attempts to get read or write locks, even when you've got the
semaphore spinlock (the queue length is of no help).

I think that this problem can only be solved by going to the spinlock version,
and maintaining a flag to say what sort of lock is currently active.

David

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-18  7:55       ` Andrea Arcangeli
@ 2001-09-18  8:18         ` David Howells
  2001-09-18  9:32         ` David Howells
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 49+ messages in thread
From: David Howells @ 2001-09-18  8:18 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Manfred Spraul, Linus Torvalds, dhowells, Ulrich.Weigand, linux-kernel


> > i386 has a fair rwsemaphore, too - probably other archs must be modified
> > as well.
> 
> yes, actually my patch was against the rwsem patch in -aa, and in -aa
> I'm using the generic semaphores for all archs in the tree so it fixes
> the race for all them. The mainline semaphores are slightly different.

Wasn't there a problem with unfair rw-semaphores? I can't remember exactly
now...

> > IMHO modifying proc_pid_read_maps() is far simpler - I'm not aware of
> > another recursive mmap_sem user.
> 
> if that's the very only place that could be a viable option but OTOH I
> like to be allowed to use recursion on the read locks as with the
> spinlocks. I think another option would be to have reacursion allowed on
> the default read locks and then make a down_read_fair that will block at
> if there's a down_write under us. we can very cleanly implement this,
> the same can be done cleanly also for the spinlocks: read_lock_fair. One
> can even mix the read_lock/read_lock_fair or the
> down_read/down_read_fair together. For example assuming we use the
> recursive semaphore fix in proc_pid_read_maps the down_read over there
> could be converted to a down_read_fair (but that's just an exercise, if
> the page fault isn't fair it doesn't worth to have proc_pid_read_maps
> fair either).

If this were to be done, I'd prefer to keep down_read() as being fair and add
a down_read_unfair(). This'd have the least impact on the current behaviour,
and I suspect we actually want fairness most of the time.

Of course, I'd personally prefer to avoid recursive semaphore situations where
possible too... it sounds far too much like trouble waiting to happen, but we
can't have everything.

David

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-18  7:31     ` Manfred Spraul
@ 2001-09-18  7:55       ` Andrea Arcangeli
  2001-09-18  8:18         ` David Howells
                           ` (4 more replies)
  0 siblings, 5 replies; 49+ messages in thread
From: Andrea Arcangeli @ 2001-09-18  7:55 UTC (permalink / raw)
  To: Manfred Spraul; +Cc: Linus Torvalds, dhowells, Ulrich.Weigand, linux-kernel

On Tue, Sep 18, 2001 at 09:31:40AM +0200, Manfred Spraul wrote:
> From: "Andrea Arcangeli" <andrea@suse.de>
> > > The mmap semaphore is a read-write semaphore, and it _is_
> permissible to
> > > call "copy_to_user()" and friends while holding the read lock.
> > >
> > > The bug appears to be in the implementation of the write semaphore -
> > > down_write() doesn't undestand that blocked writes must not block
> new
> > > readers, exactly because of this situation.
> >
> > Exactly, same reason for which we need the same property from the rw
> > spinlocks (to be allowed to read_lock without clearing irqs). Thanks
> so
> > much for reminding me about this! Unfortunately my rwsemaphores are
> > blocking readers at the first down_write (for the better fairness
> > property issuse, but I obviously forgotten that doing so I would
> > introduce such a deadlock).
> 
> i386 has a fair rwsemaphore, too - probably other archs must be modified
> as well.

yes, actually my patch was against the rwsem patch in -aa, and in -aa
I'm using the generic semaphores for all archs in the tree so it fixes
the race for all them. The mainline semaphores are slightly different.

> > The fix is a few liner for my
> > implementation, here it is:
> >
> 
> Obivously your patch fixes the race, but we could starve down_write() if
> there are many page faults.

Yes.

> IMHO modifying proc_pid_read_maps() is far simpler - I'm not aware of
> another recursive mmap_sem user.

if that's the very only place that could be a viable option but OTOH I
like to be allowed to use recursion on the read locks as with the
spinlocks. I think another option would be to have reacursion allowed on
the default read locks and then make a down_read_fair that will block at
if there's a down_write under us. we can very cleanly implement this,
the same can be done cleanly also for the spinlocks: read_lock_fair. One
can even mix the read_lock/read_lock_fair or the
down_read/down_read_fair together. For example assuming we use the
recursive semaphore fix in proc_pid_read_maps the down_read over there
could be converted to a down_read_fair (but that's just an exercise, if
the page fault isn't fair it doesn't worth to have proc_pid_read_maps
fair either).

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-18  0:01   ` Andrea Arcangeli
@ 2001-09-18  7:31     ` Manfred Spraul
  2001-09-18  7:55       ` Andrea Arcangeli
  0 siblings, 1 reply; 49+ messages in thread
From: Manfred Spraul @ 2001-09-18  7:31 UTC (permalink / raw)
  To: Andrea Arcangeli, Linus Torvalds; +Cc: dhowells, Ulrich.Weigand, linux-kernel

From: "Andrea Arcangeli" <andrea@suse.de>
> > The mmap semaphore is a read-write semaphore, and it _is_
permissible to
> > call "copy_to_user()" and friends while holding the read lock.
> >
> > The bug appears to be in the implementation of the write semaphore -
> > down_write() doesn't undestand that blocked writes must not block
new
> > readers, exactly because of this situation.
>
> Exactly, same reason for which we need the same property from the rw
> spinlocks (to be allowed to read_lock without clearing irqs). Thanks
so
> much for reminding me about this! Unfortunately my rwsemaphores are
> blocking readers at the first down_write (for the better fairness
> property issuse, but I obviously forgotten that doing so I would
> introduce such a deadlock).

i386 has a fair rwsemaphore, too - probably other archs must be modified
as well.

> The fix is a few liner for my
> implementation, here it is:
>

Obivously your patch fixes the race, but we could starve down_write() if
there are many page faults.
Which multithreaded apps rely on mmap for file io? innd, perhaps samba
if mmap is enabled (I'm not sure what's the default and if samba is
multithreaded).

If you compile a kernel for 80386, then i386 uses the generic
semaphores.
Could someone with innd compile his kernel for i386, apply Andrea's
patch and check that the performance doesn't break down?

IMHO modifying proc_pid_read_maps() is far simpler - I'm not aware of
another recursive mmap_sem user.
--
    Manfred



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
       [not found] ` <200109172339.f8HNd5W13244@penguin.transmeta.com>
@ 2001-09-18  0:01   ` Andrea Arcangeli
  2001-09-18  7:31     ` Manfred Spraul
  0 siblings, 1 reply; 49+ messages in thread
From: Andrea Arcangeli @ 2001-09-18  0:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: dhowells, Ulrich.Weigand, manfred, linux-kernel

On Mon, Sep 17, 2001 at 04:39:05PM -0700, Linus Torvalds wrote:
> [ David, Andrea - can you check this out? ]
> 
> In article <001701c13fc2$cda19a90$010411ac@local>,
> Manfred Spraul <manfred@colorfullife.com> wrote:
> >> What happens is that proc_pid_read_maps grabs the mmap_sem as a
> >> reader, and *while it holds the lock*, does a copy_to_user.  This can
> >> of course page-fault, and the handler will also grab the mmap_sem
> >> (if it is the same task).
> >
> >Ok, that's a bug.
> >You must not call copy_to_user with the mmap semaphore acquired - linux
> >semaphores are not recursive.
> 
> No, that's not the bug.

agreed.

> The mmap semaphore is a read-write semaphore, and it _is_ permissible to
> call "copy_to_user()" and friends while holding the read lock.
> 
> The bug appears to be in the implementation of the write semaphore -
> down_write() doesn't undestand that blocked writes must not block new
> readers, exactly because of this situation. 

Exactly, same reason for which we need the same property from the rw
spinlocks (to be allowed to read_lock without clearing irqs). Thanks so
much for reminding me about this! Unfortunately my rwsemaphores are
blocking readers at the first down_write (for the better fairness
property issuse, but I obviously forgotten that doing so I would
introduce such a deadlock). The fix is a few liner for my
implementation, here it is:

--- 2.4.10pre10aa2/lib/rwsem_spinlock.c.~1~	Mon Sep 17 19:17:24 2001
+++ 2.4.10pre10aa2/lib/rwsem_spinlock.c	Tue Sep 18 01:59:06 2001
@@ -73,11 +73,13 @@
 
 void down_read(struct rw_semaphore *sem)
 {
+	int count;
 	CHECK_MAGIC(sem->__magic);
 
 	spin_lock(&sem->lock);
+	count = sem->count;
 	sem->count += RWSEM_READ_BIAS;
-	if (__builtin_expect(sem->count, 0) < 0)
+	if (__builtin_expect(count < 0 && !(count & RWSEM_READ_MASK), 0))
 		rwsem_down_failed(sem, RWSEM_READ_BLOCKING_BIAS);
 	spin_unlock(&sem->lock);
 }

it will be applied to next -aa. For the mainline semaphores I assume
David will take care of that.

For the record, I'm using spinlock based rwsemphores. Last time I
checked my asm semaphores I found a small race in up_write, I didn't
checked if the mainlines semaphores were affected too but I just
preferred to stay safe with the spinlock in the meantime (in the
microbenchmark the spinlock based rwsems weren't that much slower
[and my optimized version is much faster than the mainline spinlock
based rwsem] so using asm it's not a noticeable improvement in the macro
real life benchmarks and the robustness of the spinlock is quite
unvaluable, even more now that allowed me to do a bugfix without
panicing in doing those changes).  I think I will return to the asm
rwsem only after proofing my implementation with math or after writing
an automated simulation that checks their correctness in all possible
race combinations (assuming they're mutex and with a variable number of
threads).

Andrea

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
  2001-09-17 21:50 Manfred Spraul
@ 2001-09-17 23:39 ` Linus Torvalds
       [not found] ` <200109172339.f8HNd5W13244@penguin.transmeta.com>
  1 sibling, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2001-09-17 23:39 UTC (permalink / raw)
  To: linux-kernel

[ David, Andrea - can you check this out? ]

In article <001701c13fc2$cda19a90$010411ac@local>,
Manfred Spraul <manfred@colorfullife.com> wrote:
>> What happens is that proc_pid_read_maps grabs the mmap_sem as a
>> reader, and *while it holds the lock*, does a copy_to_user.  This can
>> of course page-fault, and the handler will also grab the mmap_sem
>> (if it is the same task).
>
>Ok, that's a bug.
>You must not call copy_to_user with the mmap semaphore acquired - linux
>semaphores are not recursive.

No, that's not the bug.

The mmap semaphore is a read-write semaphore, and it _is_ permissible to
call "copy_to_user()" and friends while holding the read lock.

The bug appears to be in the implementation of the write semaphore -
down_write() doesn't undestand that blocked writes must not block new
readers, exactly because of this situation. 

The situation wrt read-write spinlocks is exactly the same, btw, except
there we have "readers can have interrupts enabled even if interrupts
also take read locks" instead of having user-level faults.

Why do we want to explicitly allow this behaviour wrt mmap_sem? Because
some things are inherently racy without it (ie threaded processes that
read or write the address space - coredumping, ptrace etc).

		Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Deadlock on the mm->mmap_sem
@ 2001-09-17 21:50 Manfred Spraul
  2001-09-17 23:39 ` Linus Torvalds
       [not found] ` <200109172339.f8HNd5W13244@penguin.transmeta.com>
  0 siblings, 2 replies; 49+ messages in thread
From: Manfred Spraul @ 2001-09-17 21:50 UTC (permalink / raw)
  To: "Ulrich Weigand"; +Cc: linux-kernel

> What happens is that proc_pid_read_maps grabs the mmap_sem as a
> reader, and *while it holds the lock*, does a copy_to_user.  This can
> of course page-fault, and the handler will also grab the mmap_sem
> (if it is the same task).

Ok, that's a bug.
You must not call copy_to_user with the mmap semaphore acquired - linux
semaphores are not recursive.

> Any ideas how to fix this?  Should proc_pid_read_maps just drop the
> lock before copy_to_user?

Yes, and preferable switch to multiline copies - a full page temporary
buffer is allocated, transfering data on a line-by-line base is way too
much overhead (and the current volatile_task is an ugly hack).

--
    Manfred






^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2001-09-22 21:06 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2001-09-17 20:57 Deadlock on the mm->mmap_sem Ulrich Weigand
2001-09-17 21:50 Manfred Spraul
2001-09-17 23:39 ` Linus Torvalds
     [not found] ` <200109172339.f8HNd5W13244@penguin.transmeta.com>
2001-09-18  0:01   ` Andrea Arcangeli
2001-09-18  7:31     ` Manfred Spraul
2001-09-18  7:55       ` Andrea Arcangeli
2001-09-18  8:18         ` David Howells
2001-09-18  9:32         ` David Howells
2001-09-18  9:37         ` Manfred Spraul
2001-09-18  9:49         ` Arjan van de Ven
2001-09-18 12:53         ` Manfred Spraul
2001-09-18 14:13           ` David Howells
2001-09-18 14:49             ` Alan Cox
2001-09-18 15:26               ` David Howells
2001-09-18 15:46                 ` Alan Cox
2001-09-18 15:11             ` David Howells
2001-09-18 16:49             ` Linus Torvalds
2001-09-19  9:51               ` David Howells
2001-09-19 12:49                 ` Andrea Arcangeli
2001-09-19 14:08               ` Manfred Spraul
2001-09-19 14:51               ` David Howells
2001-09-19 15:18                 ` Manfred Spraul
2001-09-19 14:53               ` David Howells
2001-09-19 18:03                 ` Andrea Arcangeli
2001-09-19 18:16                   ` Benjamin LaHaise
2001-09-19 18:27                     ` David Howells
2001-09-19 18:48                       ` Andrea Arcangeli
2001-09-19 18:45                     ` Andrea Arcangeli
2001-09-19 21:14                       ` Benjamin LaHaise
2001-09-19 22:07                         ` Andrea Arcangeli
2001-09-19 18:19                   ` Manfred Spraul
2001-09-20  2:07                     ` Andrea Arcangeli
2001-09-20  4:37                       ` Andrea Arcangeli
2001-09-20  7:05                       ` David Howells
2001-09-20  7:19                         ` Andrea Arcangeli
2001-09-20  8:01                           ` David Howells
2001-09-20  8:09                             ` Andrea Arcangeli
2001-09-19 18:26                   ` David Howells
2001-09-19 18:47                     ` Andrea Arcangeli
2001-09-19 23:25                       ` David Howells
2001-09-19 23:34                         ` Andrea Arcangeli
2001-09-19 23:46                           ` Andrea Arcangeli
2001-09-19 14:58               ` David Howells
2001-09-18 13:22 Ulrich Weigand
     [not found] <masp0008@stud.uni-sb.de>
2001-09-20 10:57 ` Studierende der Universitaet des Saarlandes
2001-09-20 12:40   ` David Howells
2001-09-20 18:24   ` Andrea Arcangeli
2001-09-20 21:43     ` Manfred Spraul
2001-09-22 21:06     ` Manfred Spraul

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).