linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* NUMA API for Linux
@ 2004-04-06 13:33 Andi Kleen
  2004-04-06 13:34 ` [PATCH] NUMA API for Linux 1/ Core NUMA API code Andi Kleen
                   ` (11 more replies)
  0 siblings, 12 replies; 60+ messages in thread
From: Andi Kleen @ 2004-04-06 13:33 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm


The following patches add support for configurable NUMA memory policy
for user processes. It is based on the proposal from last kernel summit
with feedback from various people.

This NUMA API doesn't not attempt to implement page migration or anything
else complicated: all it does is to police the allocation when a page 
is first allocation or when a page is reallocated after swapping. Currently
only support for shared memory and anonymous memory is there; policy for 
file based mappings is not implemented yet (although they get implicitely
policied by the default process policy)

It adds three new system calls: mbind to change the policy of a VMA,
set_mempolicy to change the policy of a process, get_mempolicy to retrieve
memory policy. User tools (numactl, libnuma, test programs, manpages) can be 
found in  ftp://ftp.suse.com/pub/people/ak/numa/numactl-0.6.tar.gz

For details on the system calls see the manpages
http://www.firstfloor.org/~andi/mbind.html
http://www.firstfloor.org/~andi/set_mempolicy.html
http://www.firstfloor.org/~andi/get_mempolicy.html
Most user programs should actually not use the system calls directly,
but use the higher level functions in libnuma 
(http://www.firstfloor.org/~andi/numa.html) or the command line tools
(http://www.firstfloor.org/~andi/numactl.html

The system calls allow user programs and administors to set various NUMA memory 
policies for putting memory on specific nodes. Here is a short description
of the policies copied from the kernel patch:

 * NUMA policy allows the user to give hints in which node(s) memory should
 * be allocated.
 *
 * Support four policies per VMA and per process:
 *
 * The VMA policy has priority over the process policy for a page fault.
 *
 * interleave     Allocate memory interleaved over a set of nodes,
 *                with normal fallback if it fails.
 *                For VMA based allocations this interleaves based on the
 *                offset into the backing object or offset into the mapping
 *                for anonymous memory. For process policy an process counter
 *                is used.
 * bind           Only allocate memory on a specific set of nodes,
 *                no fallback.
 * preferred      Try a specific node first before normal fallback.
 *                As a special case node -1 here means do the allocation
 *                on the local CPU. This is normally identical to default,
 *                but useful to set in a VMA when you have a non default
 *                process policy.
 * default        Allocate on the local node first, or when on a VMA
 *                use the process policy. This is what Linux always did
 *                in a NUMA aware kernel and still does by, ahem, default.
 *
 * The process policy is applied for most non interrupt memory allocations
 * in that process' context. Interrupts ignore the policies and always
 * try to allocate on the local CPU. The VMA policy is only applied for memory
 * allocations for a VMA in the VM.
 *
 * Currently there are a few corner cases in swapping where the policy
 * is not applied, but the majority should be handled. When process policy
 * is used it is not remembered over swap outs/swap ins.
 *
 * Only the highest zone in the zone hierarchy gets policied. Allocations
 * requesting a lower zone just use default policy. This implies that
 * on systems with highmem kernel lowmem allocation don't get policied.
 * Same with GFP_DMA allocations.
 *
 * For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between
 * all users and remembered even when nobody has memory mapped.

The following patches implement all this. 

I think these patches are ready for merging in -mm*.

-Andi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH] NUMA API for Linux 1/ Core NUMA API code
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
@ 2004-04-06 13:34 ` Andi Kleen
  2004-04-06 13:35 ` NUMA API for Linux 2/ Add x86-64 support Andi Kleen
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 60+ messages in thread
From: Andi Kleen @ 2004-04-06 13:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm

This is the core NUMA API code. This includes NUMA policy aware 
wrappers for get_free_pages and alloc_page_vma(). On non NUMA kernels
these are defined away.

The system calls mbind (see http://www.firstfloor.org/~andi/mbind.html),
get_mempolicy (http://www.firstfloor.org/~andi/get_mempolicy.html) and
set_mempolicy (http://www.firstfloor.org/~andi/set_mempolicy.html) are
implemented here.

Adds a vm_policy field to the VMA and to the process. The process
also has field for interleaving. VMA interleaving uses the offset
into the VMA, but that's not possible for process allocations.

diff -u linux-2.6.5-numa/include/linux/gfp.h-o linux-2.6.5-numa/include/linux/gfp.h
--- linux-2.6.5-numa/include/linux/gfp.h-o	2004-03-21 21:11:55.000000000 +0100
+++ linux-2.6.5-numa/include/linux/gfp.h	2004-04-06 13:36:12.000000000 +0200
@@ -4,6 +4,8 @@
 #include <linux/mmzone.h>
 #include <linux/stddef.h>
 #include <linux/linkage.h>
+#include <linux/config.h>
+
 /*
  * GFP bitmasks..
  */
@@ -72,10 +74,29 @@
 	return __alloc_pages(gfp_mask, order, NODE_DATA(nid)->node_zonelists + (gfp_mask & GFP_ZONEMASK));
 }
 
+extern struct page *alloc_pages_current(unsigned gfp_mask, unsigned order);
+struct vm_area_struct;
+
+#ifdef CONFIG_NUMA
+static inline struct page * alloc_pages(unsigned int gfp_mask, unsigned int order)
+{
+	if (unlikely(order >= MAX_ORDER))
+		return NULL;
+
+	return alloc_pages_current(gfp_mask, order);
+}
+extern struct page *__alloc_page_vma(unsigned gfp_mask, struct vm_area_struct *vma, 
+				   unsigned long off);
+
+extern struct page *alloc_page_vma(unsigned gfp_mask, struct vm_area_struct *vma, 
+				   unsigned long addr);
+#else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_page(gfp_mask) \
-		alloc_pages_node(numa_node_id(), gfp_mask, 0)
+#define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#define __alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#endif
+#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 
 extern unsigned long FASTCALL(__get_free_pages(unsigned int gfp_mask, unsigned int order));
 extern unsigned long FASTCALL(get_zeroed_page(unsigned int gfp_mask));
diff -u linux-2.6.5-numa/include/linux/mm.h-o linux-2.6.5-numa/include/linux/mm.h
--- linux-2.6.5-numa/include/linux/mm.h-o	2004-04-06 13:12:23.000000000 +0200
+++ linux-2.6.5-numa/include/linux/mm.h	2004-04-06 13:36:12.000000000 +0200
@@ -12,6 +12,7 @@
 #include <linux/mmzone.h>
 #include <linux/rbtree.h>
 #include <linux/fs.h>
+#include <linux/mempolicy.h>
 
 #ifndef CONFIG_DISCONTIGMEM          /* Don't use mapnrs, do it properly */
 extern unsigned long max_mapnr;
@@ -47,6 +48,9 @@
  *
  * This structure is exactly 64 bytes on ia32.  Please think very, very hard
  * before adding anything to it.
+ * [Now 4 bytes more on 32bit NUMA machines. Sorry. -AK.
+ * But if you want to recover the 4 bytes justr remove vm_next. It is redundant 
+ * with vm_rb. Will be a lot of editing work though. vm_rb.color is redundant too.] 
  */
 struct vm_area_struct {
 	struct mm_struct * vm_mm;	/* The address space we belong to. */
@@ -77,6 +81,10 @@
 					   units, *not* PAGE_CACHE_SIZE */
 	struct file * vm_file;		/* File we map to (can be NULL). */
 	void * vm_private_data;		/* was vm_pte (shared mem) */
+
+#ifdef CONFIG_NUMA
+	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
+#endif
 };
 
 /*
@@ -148,6 +156,8 @@
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
+	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
+	struct mempolicy *(*get_policy)(struct vm_area_struct *vma, unsigned long addr);
 };
 
 /* forward declaration; pte_chain is meant to be internal to rmap.c */
@@ -435,6 +445,8 @@
 
 struct page *shmem_nopage(struct vm_area_struct * vma,
 			unsigned long address, int *type);
+int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new);
+struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, unsigned long addr);
 struct file *shmem_file_setup(char * name, loff_t size, unsigned long flags);
 void shmem_lock(struct file * file, int lock);
 int shmem_zero_setup(struct vm_area_struct *);
@@ -633,6 +645,11 @@
 	return vma;
 }
 
+static inline unsigned long vma_pages(struct vm_area_struct *vma)
+{
+	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+}
+
 extern struct vm_area_struct *find_extend_vma(struct mm_struct *mm, unsigned long addr);
 
 extern unsigned int nr_used_zone_pages(void);
diff -u linux-2.6.5-numa/include/linux/sched.h-o linux-2.6.5-numa/include/linux/sched.h
--- linux-2.6.5-numa/include/linux/sched.h-o	2004-04-06 13:12:23.000000000 +0200
+++ linux-2.6.5-numa/include/linux/sched.h	2004-04-06 13:36:12.000000000 +0200
@@ -29,6 +29,7 @@
 #include <linux/completion.h>
 #include <linux/pid.h>
 #include <linux/percpu.h>
+#include <linux/mempolicy.h>
 
 struct exec_domain;
 
@@ -493,6 +494,9 @@
 
 	unsigned long ptrace_message;
 	siginfo_t *last_siginfo; /* For ptrace use.  */
+
+  	struct mempolicy *mempolicy;
+  	short il_next;		/* could be shared with used_math */
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
diff -u linux-2.6.5-numa/kernel/sys.c-o linux-2.6.5-numa/kernel/sys.c
--- linux-2.6.5-numa/kernel/sys.c-o	1970-01-01 01:12:51.000000000 +0100
+++ linux-2.6.5-numa/kernel/sys.c	2004-04-06 13:36:12.000000000 +0200
@@ -260,6 +260,9 @@
 cond_syscall(sys_shmget)
 cond_syscall(sys_shmdt)
 cond_syscall(sys_shmctl)
+cond_syscall(sys_mbind)
+cond_syscall(sys_get_mempolicy)
+cond_syscall(sys_set_mempolicy)
 
 /* arch-specific weak syscall entries */
 cond_syscall(sys_pciconfig_read)
diff -u linux-2.6.5-numa/mm/Makefile-o linux-2.6.5-numa/mm/Makefile
--- linux-2.6.5-numa/mm/Makefile-o	2004-03-21 21:12:13.000000000 +0100
+++ linux-2.6.5-numa/mm/Makefile	2004-04-06 13:36:12.000000000 +0200
@@ -12,3 +12,4 @@
 			   slab.o swap.o truncate.o vmscan.o $(mmu-y)
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
+obj-$(CONFIG_NUMA) 	+= policy.o

^ permalink raw reply	[flat|nested] 60+ messages in thread

* NUMA API for Linux 2/ Add x86-64 support
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
  2004-04-06 13:34 ` [PATCH] NUMA API for Linux 1/ Core NUMA API code Andi Kleen
@ 2004-04-06 13:35 ` Andi Kleen
  2004-04-06 13:35 ` [PATCH] NUMA API for Linux 3/ Add i386 support Andi Kleen
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 60+ messages in thread
From: Andi Kleen @ 2004-04-06 13:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm

Add NUMA API system calls on x86-64

diff -u linux-2.6.5-numa/include/asm-x86_64/unistd.h-o linux-2.6.5-numa/include/asm-x86_64/unistd.h
--- linux-2.6.5-numa/include/asm-x86_64/unistd.h-o	2004-03-17 12:17:59.000000000 +0100
+++ linux-2.6.5-numa/include/asm-x86_64/unistd.h	2004-04-06 13:36:12.000000000 +0200
@@ -532,10 +532,14 @@
 __SYSCALL(__NR_utimes, sys_utimes)
 #define __NR_vserver		236
 __SYSCALL(__NR_vserver, sys_ni_syscall)
+#define __NR_mbind 237
+__SYSCALL(__NR_mbind, sys_mbind)
+#define __NR_set_mempolicy 238
+__SYSCALL(__NR_set_mempolicy, sys_set_mempolicy)
+#define __NR_get_mempolicy 239
+__SYSCALL(__NR_get_mempolicy, sys_get_mempolicy)
 
-/* 237,238,239 reserved for NUMA API */
-
-#define __NR_syscall_max __NR_vserver
+#define __NR_syscall_max __NR_get_mempolicy
 #ifndef __NO_STUBS
 
 /* user-visible error numbers are in the range -1 - -4095 */

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH] NUMA API for Linux 3/ Add i386 support
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
  2004-04-06 13:34 ` [PATCH] NUMA API for Linux 1/ Core NUMA API code Andi Kleen
  2004-04-06 13:35 ` NUMA API for Linux 2/ Add x86-64 support Andi Kleen
@ 2004-04-06 13:35 ` Andi Kleen
  2004-04-06 23:23   ` Andrew Morton
  2004-04-06 13:36 ` [PATCH] NUMA API for Linux 4/ Add IA64 support Andi Kleen
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 60+ messages in thread
From: Andi Kleen @ 2004-04-06 13:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm

Add NUMA API system calls for i386

diff -u linux-2.6.5-numa/arch/i386/kernel/entry.S-o linux-2.6.5-numa/arch/i386/kernel/entry.S
--- linux-2.6.5-numa/arch/i386/kernel/entry.S-o	1970-01-01 01:12:51.000000000 +0100
+++ linux-2.6.5-numa/arch/i386/kernel/entry.S	2004-04-06 15:06:46.000000000 +0200
@@ -882,5 +882,8 @@
 	.long sys_utimes
  	.long sys_fadvise64_64
 	.long sys_ni_syscall	/* sys_vserver */
-
+	.long sys_mbind
+	.long sys_get_mempolicy
+	.long sys_set_mempolicy
+	
 syscall_table_size=(.-sys_call_table)
diff -u linux-2.6.5-numa/include/asm-i386/unistd.h-o linux-2.6.5-numa/include/asm-i386/unistd.h
--- linux-2.6.5-numa/include/asm-i386/unistd.h-o	2004-04-06 13:12:19.000000000 +0200
+++ linux-2.6.5-numa/include/asm-i386/unistd.h	2004-04-06 15:07:21.000000000 +0200
@@ -279,8 +279,11 @@
 #define __NR_utimes		271
 #define __NR_fadvise64_64	272
 #define __NR_vserver		273
+#define __NR_mbind		273
+#define __NR_get_mempolicy	274
+#define __NR_set_mempolicy	275
 
-#define NR_syscalls 274
+#define NR_syscalls 275
 
 /* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH] NUMA API for Linux 4/ Add IA64 support
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
                   ` (2 preceding siblings ...)
  2004-04-06 13:35 ` [PATCH] NUMA API for Linux 3/ Add i386 support Andi Kleen
@ 2004-04-06 13:36 ` Andi Kleen
  2004-04-06 13:37 ` [PATCH] NUMA API for Linux 5/ Add VMA hooks for policy Andi Kleen
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 60+ messages in thread
From: Andi Kleen @ 2004-04-06 13:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm


Add NUMA API system calls on IA64 and one bug fix required for it.

diff -u linux-2.6.5-numa/arch/ia64/kernel/acpi.c-o linux-2.6.5-numa/arch/ia64/kernel/acpi.c
--- linux-2.6.5-numa/arch/ia64/kernel/acpi.c-o	2004-04-06 13:12:00.000000000 +0200
+++ linux-2.6.5-numa/arch/ia64/kernel/acpi.c	2004-04-06 13:36:12.000000000 +0200
@@ -455,6 +455,7 @@
 	for (i = 0; i < MAX_PXM_DOMAINS; i++) {
 		if (pxm_bit_test(i)) {
 			pxm_to_nid_map[i] = numnodes;
+			node_set_online(numnodes);
 			nid_to_pxm_map[numnodes++] = i;
 		}
 	}
diff -u linux-2.6.5-numa/arch/ia64/kernel/entry.S-o linux-2.6.5-numa/arch/ia64/kernel/entry.S
--- linux-2.6.5-numa/arch/ia64/kernel/entry.S-o	2004-03-21 21:12:05.000000000 +0100
+++ linux-2.6.5-numa/arch/ia64/kernel/entry.S	2004-04-06 13:36:12.000000000 +0200
@@ -1501,9 +1501,9 @@
 	data8 sys_clock_nanosleep
 	data8 sys_fstatfs64
 	data8 sys_statfs64
-	data8 sys_ni_syscall
-	data8 sys_ni_syscall			// 1260
-	data8 sys_ni_syscall
+	data8 sys_mbind
+	data8 sys_get_mempolicy			// 1260
+	data8 sys_set_mempolicy
 	data8 sys_ni_syscall
 	data8 sys_ni_syscall
 	data8 sys_ni_syscall
diff -u linux-2.6.5-numa/include/asm-ia64/unistd.h-o linux-2.6.5-numa/include/asm-ia64/unistd.h
--- linux-2.6.5-numa/include/asm-ia64/unistd.h-o	2004-04-06 13:12:19.000000000 +0200
+++ linux-2.6.5-numa/include/asm-ia64/unistd.h	2004-04-06 13:36:12.000000000 +0200
@@ -248,9 +248,9 @@
 #define __NR_clock_nanosleep		1256
 #define __NR_fstatfs64			1257
 #define __NR_statfs64			1258
-#define __NR_reserved1			1259	/* reserved for NUMA interface */
-#define __NR_reserved2			1260	/* reserved for NUMA interface */
-#define __NR_reserved3			1261	/* reserved for NUMA interface */
+#define __NR_mbind			1259
+#define __NR_get_mempolicy		1260
+#define __NR_set_mempolicy		1261
 
 #ifdef __KERNEL__
 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH] NUMA API for Linux 5/ Add VMA hooks for policy
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
                   ` (3 preceding siblings ...)
  2004-04-06 13:36 ` [PATCH] NUMA API for Linux 4/ Add IA64 support Andi Kleen
@ 2004-04-06 13:37 ` Andi Kleen
  2004-05-05 16:05   ` Paul Jackson
  2004-04-06 13:37 ` [PATCH] NUMA API for Linux 6/ Add shared memory support Andi Kleen
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 60+ messages in thread
From: Andi Kleen @ 2004-04-06 13:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm


NUMA API adds a policy to each VMA. During VMA creattion, merging and 
splitting these policies must be handled properly. This patch adds
the calls to this. 

It is a nop when CONFIG_NUMA is not defined.

diff -u linux-2.6.5-numa/arch/ia64/ia32/binfmt_elf32.c-o linux-2.6.5-numa/arch/ia64/ia32/binfmt_elf32.c
--- linux-2.6.5-numa/arch/ia64/ia32/binfmt_elf32.c-o	1970-01-01 01:12:51.000000000 +0100
+++ linux-2.6.5-numa/arch/ia64/ia32/binfmt_elf32.c	2004-04-06 13:36:12.000000000 +0200
@@ -104,6 +104,7 @@
 		vma->vm_pgoff = 0;
 		vma->vm_file = NULL;
 		vma->vm_private_data = NULL;
+		mpol_set_vma_default(vma);
 		down_write(&current->mm->mmap_sem);
 		{
 			insert_vm_struct(current->mm, vma);
@@ -184,6 +185,7 @@
 		mpnt->vm_pgoff = 0;
 		mpnt->vm_file = NULL;
 		mpnt->vm_private_data = 0;
+		mpol_set_vma_default(mpnt);
 		insert_vm_struct(current->mm, mpnt);
 		current->mm->total_vm = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
 	}
diff -u linux-2.6.5-numa/arch/ia64/kernel/perfmon.c-o linux-2.6.5-numa/arch/ia64/kernel/perfmon.c
--- linux-2.6.5-numa/arch/ia64/kernel/perfmon.c-o	1970-01-01 01:12:51.000000000 +0100
+++ linux-2.6.5-numa/arch/ia64/kernel/perfmon.c	2004-04-06 13:36:12.000000000 +0200
@@ -2273,6 +2273,7 @@
 	vma->vm_ops	     = &pfm_vm_ops;
 	vma->vm_pgoff	     = 0;
 	vma->vm_file	     = NULL;
+	mpol_set_vma_default(vma);
 	vma->vm_private_data = ctx;	/* information needed by the pfm_vm_close() function */
 
 	/*
diff -u linux-2.6.5-numa/arch/ia64/mm/init.c-o linux-2.6.5-numa/arch/ia64/mm/init.c
--- linux-2.6.5-numa/arch/ia64/mm/init.c-o	2004-04-06 13:12:00.000000000 +0200
+++ linux-2.6.5-numa/arch/ia64/mm/init.c	2004-04-06 13:36:12.000000000 +0200
@@ -131,6 +131,7 @@
 		vma->vm_pgoff = 0;
 		vma->vm_file = NULL;
 		vma->vm_private_data = NULL;
+		mpol_set_vma_default(vma);
 		insert_vm_struct(current->mm, vma);
 	}
 
@@ -143,6 +144,7 @@
 			vma->vm_end = PAGE_SIZE;
 			vma->vm_page_prot = __pgprot(pgprot_val(PAGE_READONLY) | _PAGE_MA_NAT);
 			vma->vm_flags = VM_READ | VM_MAYREAD | VM_IO | VM_RESERVED;
+			mpol_set_vma_default(vma);
 			insert_vm_struct(current->mm, vma);
 		}
 	}
diff -u linux-2.6.5-numa/arch/m68k/atari/stram.c-o linux-2.6.5-numa/arch/m68k/atari/stram.c
--- linux-2.6.5-numa/arch/m68k/atari/stram.c-o	2004-03-21 21:12:07.000000000 +0100
+++ linux-2.6.5-numa/arch/m68k/atari/stram.c	2004-04-06 13:36:12.000000000 +0200
@@ -752,7 +752,7 @@
 			/* Get a page for the entry, using the existing
 			   swap cache page if there is one.  Otherwise,
 			   get a clean page and read the swap into it. */
-			page = read_swap_cache_async(entry);
+			page = read_swap_cache_async(entry, NULL, 0);
 			if (!page) {
 				swap_free(entry);
 				return -ENOMEM;
diff -u linux-2.6.5-numa/arch/s390/kernel/compat_exec.c-o linux-2.6.5-numa/arch/s390/kernel/compat_exec.c
--- linux-2.6.5-numa/arch/s390/kernel/compat_exec.c-o	2004-03-21 21:12:11.000000000 +0100
+++ linux-2.6.5-numa/arch/s390/kernel/compat_exec.c	2004-04-06 13:36:12.000000000 +0200
@@ -71,6 +71,7 @@
 		mpnt->vm_ops = NULL;
 		mpnt->vm_pgoff = 0;
 		mpnt->vm_file = NULL;
+		mpol_set_vma_default(mpnt);
 		INIT_LIST_HEAD(&mpnt->shared);
 		mpnt->vm_private_data = (void *) 0;
 		insert_vm_struct(mm, mpnt);
diff -u linux-2.6.5-numa/arch/x86_64/ia32/ia32_binfmt.c-o linux-2.6.5-numa/arch/x86_64/ia32/ia32_binfmt.c
--- linux-2.6.5-numa/arch/x86_64/ia32/ia32_binfmt.c-o	2004-04-06 13:12:04.000000000 +0200
+++ linux-2.6.5-numa/arch/x86_64/ia32/ia32_binfmt.c	2004-04-06 13:36:12.000000000 +0200
@@ -360,6 +360,7 @@
 		mpnt->vm_ops = NULL;
 		mpnt->vm_pgoff = 0;
 		mpnt->vm_file = NULL;
+		mpol_set_vma_default(mpnt);
 		INIT_LIST_HEAD(&mpnt->shared);
 		mpnt->vm_private_data = (void *) 0;
 		insert_vm_struct(mm, mpnt);
diff -u linux-2.6.5-numa/fs/exec.c-o linux-2.6.5-numa/fs/exec.c
--- linux-2.6.5-numa/fs/exec.c-o	1970-01-01 01:12:51.000000000 +0100
+++ linux-2.6.5-numa/fs/exec.c	2004-04-06 13:36:12.000000000 +0200
@@ -430,6 +430,7 @@
 		mpnt->vm_ops = NULL;
 		mpnt->vm_pgoff = 0;
 		mpnt->vm_file = NULL;
+		mpol_set_vma_default(mpnt);
 		INIT_LIST_HEAD(&mpnt->shared);
 		mpnt->vm_private_data = (void *) 0;
 		insert_vm_struct(mm, mpnt);
diff -u linux-2.6.5-numa/kernel/exit.c-o linux-2.6.5-numa/kernel/exit.c
--- linux-2.6.5-numa/kernel/exit.c-o	2004-04-06 13:12:24.000000000 +0200
+++ linux-2.6.5-numa/kernel/exit.c	2004-04-06 13:36:12.000000000 +0200
@@ -779,6 +779,7 @@
 	exit_namespace(tsk);
 	exit_itimers(tsk);
 	exit_thread();
+	mpol_free(tsk->mempolicy);
 
 	if (tsk->leader)
 		disassociate_ctty(1);
diff -u linux-2.6.5-numa/kernel/fork.c-o linux-2.6.5-numa/kernel/fork.c
--- linux-2.6.5-numa/kernel/fork.c-o	1970-01-01 01:12:51.000000000 +0100
+++ linux-2.6.5-numa/kernel/fork.c	2004-04-06 13:36:12.000000000 +0200
@@ -268,6 +268,7 @@
 	struct rb_node **rb_link, *rb_parent;
 	int retval;
 	unsigned long charge = 0;
+	struct mempolicy *pol;
 
 	down_write(&oldmm->mmap_sem);
 	flush_cache_mm(current->mm);
@@ -309,6 +310,11 @@
 		if (!tmp)
 			goto fail_nomem;
 		*tmp = *mpnt;
+		pol = mpol_copy(vma_policy(mpnt));
+		retval = PTR_ERR(pol);
+		if (IS_ERR(pol))
+			goto fail_nomem_policy;
+		vma_set_policy(tmp, pol);	
 		tmp->vm_flags &= ~VM_LOCKED;
 		tmp->vm_mm = mm;
 		tmp->vm_next = NULL;
@@ -355,6 +361,8 @@
 	flush_tlb_mm(current->mm);
 	up_write(&oldmm->mmap_sem);
 	return retval;
+fail_nomem_policy: 
+	kmem_cache_free(vm_area_cachep, tmp);
 fail_nomem:
 	retval = -ENOMEM;
 fail:
@@ -946,9 +954,16 @@
 	p->security = NULL;
 	p->io_context = NULL;
 
+	p->mempolicy = mpol_copy(p->mempolicy);
+	if (IS_ERR(p->mempolicy)) { 
+		retval = PTR_ERR(p->mempolicy);
+		p->mempolicy = NULL;
+		goto bad_fork_cleanup;
+	}	
+	
 	retval = -ENOMEM;
 	if ((retval = security_task_alloc(p)))
-		goto bad_fork_cleanup;
+		goto bad_fork_cleanup_policy;
 	/* copy all the process information */
 	if ((retval = copy_semundo(clone_flags, p)))
 		goto bad_fork_cleanup_security;
@@ -1088,6 +1103,8 @@
 	exit_sem(p);
 bad_fork_cleanup_security:
 	security_task_free(p);
+bad_fork_cleanup_policy:
+	mpol_free(p->mempolicy);
 bad_fork_cleanup:
 	if (p->pid > 0)
 		free_pidmap(p->pid);
diff -u linux-2.6.5-numa/mm/mmap.c-o linux-2.6.5-numa/mm/mmap.c
--- linux-2.6.5-numa/mm/mmap.c-o	2004-04-06 13:12:24.000000000 +0200
+++ linux-2.6.5-numa/mm/mmap.c	2004-04-06 13:36:12.000000000 +0200
@@ -388,7 +388,8 @@
 static int vma_merge(struct mm_struct *mm, struct vm_area_struct *prev,
 			struct rb_node *rb_parent, unsigned long addr, 
 			unsigned long end, unsigned long vm_flags,
-			struct file *file, unsigned long pgoff)
+		     	struct file *file, unsigned long pgoff,
+		        struct mempolicy *policy) 
 {
 	spinlock_t *lock = &mm->page_table_lock;
 	struct inode *inode = file ? file->f_dentry->d_inode : NULL;
@@ -412,6 +413,7 @@
 	 * Can it merge with the predecessor?
 	 */
 	if (prev->vm_end == addr &&
+ 		        mpol_equal(vma_policy(prev), policy) && 
 			is_mergeable_vma(prev, file, vm_flags) &&
 			can_vma_merge_after(prev, vm_flags, file, pgoff)) {
 		struct vm_area_struct *next;
@@ -430,6 +432,7 @@
 		 */
 		next = prev->vm_next;
 		if (next && prev->vm_end == next->vm_start &&
+		    		vma_mpol_equal(prev, next) &&
 				can_vma_merge_before(next, vm_flags, file,
 					pgoff, (end - addr) >> PAGE_SHIFT)) {
 			prev->vm_end = next->vm_end;
@@ -442,6 +445,7 @@
 				fput(file);
 
 			mm->map_count--;
+			mpol_free(vma_policy(next));
 			kmem_cache_free(vm_area_cachep, next);
 			return 1;
 		}
@@ -457,6 +461,8 @@
 	prev = prev->vm_next;
 	if (prev) {
  merge_next:
+ 		if (!mpol_equal(policy, vma_policy(prev)))
+  			return 0;
 		if (!can_vma_merge_before(prev, vm_flags, file,
 				pgoff, (end - addr) >> PAGE_SHIFT))
 			return 0;
@@ -633,7 +639,7 @@
 	/* Can we just expand an old anonymous mapping? */
 	if (!file && !(vm_flags & VM_SHARED) && rb_parent)
 		if (vma_merge(mm, prev, rb_parent, addr, addr + len,
-					vm_flags, NULL, 0))
+					vm_flags, NULL, pgoff, NULL))
 			goto out;
 
 	/*
@@ -656,6 +662,7 @@
 	vma->vm_file = NULL;
 	vma->vm_private_data = NULL;
 	vma->vm_next = NULL;
+	mpol_set_vma_default(vma);
 	INIT_LIST_HEAD(&vma->shared);
 
 	if (file) {
@@ -695,7 +702,9 @@
 	addr = vma->vm_start;
 
 	if (!file || !rb_parent || !vma_merge(mm, prev, rb_parent, addr,
-				addr + len, vma->vm_flags, file, pgoff)) {
+					      vma->vm_end, 
+					      vma->vm_flags, file, pgoff,
+					      vma_policy(vma))) {
 		vma_link(mm, vma, prev, rb_link, rb_parent);
 		if (correct_wcount)
 			atomic_inc(&inode->i_writecount);
@@ -705,6 +714,7 @@
 				atomic_inc(&inode->i_writecount);
 			fput(file);
 		}
+		mpol_free(vma_policy(vma));
 		kmem_cache_free(vm_area_cachep, vma);
 	}
 out:	
@@ -1120,6 +1130,7 @@
 
 	remove_shared_vm_struct(area);
 
+	mpol_free(vma_policy(area));
 	if (area->vm_ops && area->vm_ops->close)
 		area->vm_ops->close(area);
 	if (area->vm_file)
@@ -1202,6 +1213,7 @@
 int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 	      unsigned long addr, int new_below)
 {
+	struct mempolicy *pol;
 	struct vm_area_struct *new;
 	struct address_space *mapping = NULL;
 
@@ -1224,6 +1236,13 @@
 		new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT);
 	}
 
+	pol = mpol_copy(vma_policy(vma)); 
+	if (IS_ERR(pol)) { 
+		kmem_cache_free(vm_area_cachep, new); 
+		return PTR_ERR(pol);
+	} 
+	vma_set_policy(new, pol);
+
 	if (new->vm_file)
 		get_file(new->vm_file);
 
@@ -1393,7 +1412,7 @@
 
 	/* Can we just expand an old anonymous mapping? */
 	if (rb_parent && vma_merge(mm, prev, rb_parent, addr, addr + len,
-					flags, NULL, 0))
+					flags, NULL, 0, NULL))
 		goto out;
 
 	/*
@@ -1414,6 +1433,7 @@
 	vma->vm_pgoff = 0;
 	vma->vm_file = NULL;
 	vma->vm_private_data = NULL;
+	mpol_set_vma_default(vma);
 	INIT_LIST_HEAD(&vma->shared);
 
 	vma_link(mm, vma, prev, rb_link, rb_parent);
@@ -1474,6 +1494,7 @@
 		}
 		if (vma->vm_file)
 			fput(vma->vm_file);
+		mpol_free(vma_policy(vma));
 		kmem_cache_free(vm_area_cachep, vma);
 		vma = next;
 	}
diff -u linux-2.6.5-numa/mm/mprotect.c-o linux-2.6.5-numa/mm/mprotect.c
--- linux-2.6.5-numa/mm/mprotect.c-o	2004-04-06 13:12:24.000000000 +0200
+++ linux-2.6.5-numa/mm/mprotect.c	2004-04-06 13:36:12.000000000 +0200
@@ -124,6 +124,8 @@
 		return 0;
 	if (vma->vm_file || (vma->vm_flags & VM_SHARED))
 		return 0;
+	if (!vma_mpol_equal(vma, prev))
+		return 0;
 
 	/*
 	 * If the whole area changes to the protection of the previous one
@@ -135,6 +137,7 @@
 		__vma_unlink(mm, vma, prev);
 		spin_unlock(&mm->page_table_lock);
 
+		mpol_free(vma_policy(vma));
 		kmem_cache_free(vm_area_cachep, vma);
 		mm->map_count--;
 		return 1;
@@ -317,12 +320,14 @@
 
 	if (next && prev->vm_end == next->vm_start &&
 			can_vma_merge(next, prev->vm_flags) &&
+	    	vma_mpol_equal(prev, next) &&
 			!prev->vm_file && !(prev->vm_flags & VM_SHARED)) {
 		spin_lock(&prev->vm_mm->page_table_lock);
 		prev->vm_end = next->vm_end;
 		__vma_unlink(prev->vm_mm, next, prev);
 		spin_unlock(&prev->vm_mm->page_table_lock);
 
+		mpol_free(vma_policy(next));
 		kmem_cache_free(vm_area_cachep, next);
 		prev->vm_mm->map_count--;
 	}
diff -u linux-2.6.5-numa/mm/mremap.c-o linux-2.6.5-numa/mm/mremap.c
--- linux-2.6.5-numa/mm/mremap.c-o	2004-03-21 21:12:13.000000000 +0100
+++ linux-2.6.5-numa/mm/mremap.c	2004-04-06 13:36:12.000000000 +0200
@@ -199,6 +199,7 @@
 	next = find_vma_prev(mm, new_addr, &prev);
 	if (next) {
 		if (prev && prev->vm_end == new_addr &&
+		    mpol_equal(vma_policy(prev), vma_policy(next)) &&
 		    can_vma_merge(prev, vma->vm_flags) && !vma->vm_file &&
 					!(vma->vm_flags & VM_SHARED)) {
 			spin_lock(&mm->page_table_lock);
@@ -208,6 +209,7 @@
 			if (next != prev->vm_next)
 				BUG();
 			if (prev->vm_end == next->vm_start &&
+			                vma_mpol_equal(next, prev) && 
 					can_vma_merge(next, prev->vm_flags)) {
 				spin_lock(&mm->page_table_lock);
 				prev->vm_end = next->vm_end;
@@ -216,10 +218,12 @@
 				if (vma == next)
 					vma = prev;
 				mm->map_count--;
+				mpol_free(vma_policy(next));
 				kmem_cache_free(vm_area_cachep, next);
 			}
 		} else if (next->vm_start == new_addr + new_len &&
 			  	can_vma_merge(next, vma->vm_flags) &&
+			        vma_mpol_equal(next, vma) &&
 				!vma->vm_file && !(vma->vm_flags & VM_SHARED)) {
 			spin_lock(&mm->page_table_lock);
 			next->vm_start = new_addr;
@@ -229,6 +233,7 @@
 	} else {
 		prev = find_vma(mm, new_addr-1);
 		if (prev && prev->vm_end == new_addr &&
+		    vma_mpol_equal(prev, vma) &&
 		    can_vma_merge(prev, vma->vm_flags) && !vma->vm_file &&
 				!(vma->vm_flags & VM_SHARED)) {
 			spin_lock(&mm->page_table_lock);
@@ -250,7 +255,12 @@
 		unsigned long vm_locked = vma->vm_flags & VM_LOCKED;
 
 		if (allocated_vma) {
+			struct mempolicy *pol;
 			*new_vma = *vma;
+			pol = mpol_copy(vma_policy(new_vma));
+			if (IS_ERR(pol))
+				goto out_vma;
+			vma_set_policy(new_vma, pol);	
 			INIT_LIST_HEAD(&new_vma->shared);
 			new_vma->vm_start = new_addr;
 			new_vma->vm_end = new_addr+new_len;
@@ -291,6 +301,7 @@
 		}
 		return new_addr;
 	}
+ out_vma:
 	if (allocated_vma)
 		kmem_cache_free(vm_area_cachep, new_vma);
  out:

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH] NUMA API for Linux 6/ Add shared memory support
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
                   ` (4 preceding siblings ...)
  2004-04-06 13:37 ` [PATCH] NUMA API for Linux 5/ Add VMA hooks for policy Andi Kleen
@ 2004-04-06 13:37 ` Andi Kleen
  2004-04-06 13:38 ` [PATCH] NUMA API for Linux 7/ Add statistics Andi Kleen
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 60+ messages in thread
From: Andi Kleen @ 2004-04-06 13:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm

Add support to tmpfs and hugetlbfs to support NUMA API. Shared memory 
is a bit of a special case for NUMA policy. Normally policy is associated
to VMAs or to processes, but for a shared memory segment you really 
want to share the policy. The core NUMA API has code for that,
this patch adds the necessary changes to tmpfs and hugetlbfs.

First it changes the custom swapping code in tmpfs to follow the policy
set via VMAs.

It is also useful to have a "backing store" of policy that saves
the policy even when nobody has the shared memory segment mapped. This
allows command line tools to pre configure policy, which is then 
later used by programs.

Note that hugetlbfs needs more changes - it is also required to switch
it to lazy allocation, otherwise the prefault prevents mbind() from 
working.

diff -u linux-2.6.5-numa/fs/hugetlbfs/inode.c-o linux-2.6.5-numa/fs/hugetlbfs/inode.c
--- linux-2.6.5-numa/fs/hugetlbfs/inode.c-o	2004-04-06 13:12:17.000000000 +0200
+++ linux-2.6.5-numa/fs/hugetlbfs/inode.c	2004-04-06 13:36:12.000000000 +0200
@@ -375,6 +375,7 @@
 
 	inode = new_inode(sb);
 	if (inode) {
+		struct hugetlbfs_inode_info *info;
 		inode->i_mode = mode;
 		inode->i_uid = uid;
 		inode->i_gid = gid;
@@ -383,6 +384,8 @@
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+		info = HUGETLBFS_I(inode);
+		mpol_shared_policy_init(&info->policy);
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
@@ -510,6 +513,32 @@
 	}
 }
 
+static kmem_cache_t *hugetlbfs_inode_cachep;
+
+static struct inode *hugetlbfs_alloc_inode(struct super_block *sb)
+{
+	struct hugetlbfs_inode_info *p = kmem_cache_alloc(hugetlbfs_inode_cachep,
+							  SLAB_KERNEL);
+	if (!p)
+		return NULL;
+	return &p->vfs_inode;
+}
+
+static void init_once(void *foo, kmem_cache_t *cachep, unsigned long flags)
+{
+	struct hugetlbfs_inode_info *ei = (struct hugetlbfs_inode_info *) foo;
+
+	if ((flags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==
+	    SLAB_CTOR_CONSTRUCTOR)
+		inode_init_once(&ei->vfs_inode);
+}
+
+static void hugetlbfs_destroy_inode(struct inode *inode)
+{
+	mpol_free_shared_policy(&HUGETLBFS_I(inode)->policy);  
+	kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I(inode));
+}
+
 static struct address_space_operations hugetlbfs_aops = {
 	.readpage	= hugetlbfs_readpage,
 	.prepare_write	= hugetlbfs_prepare_write,
@@ -541,6 +570,8 @@
 };
 
 static struct super_operations hugetlbfs_ops = {
+	.alloc_inode    = hugetlbfs_alloc_inode,
+	.destroy_inode  = hugetlbfs_destroy_inode,
 	.statfs		= hugetlbfs_statfs,
 	.drop_inode	= hugetlbfs_drop_inode,
 	.put_super	= hugetlbfs_put_super,
@@ -755,9 +786,16 @@
 	int error;
 	struct vfsmount *vfsmount;
 
+	hugetlbfs_inode_cachep = kmem_cache_create("hugetlbfs_inode_cache",
+				     sizeof(struct hugetlbfs_inode_info),
+				     0, SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT,
+				     init_once, NULL);
+	if (hugetlbfs_inode_cachep == NULL)
+		return -ENOMEM;
+
 	error = register_filesystem(&hugetlbfs_fs_type);
 	if (error)
-		return error;
+		goto out;
 
 	vfsmount = kern_mount(&hugetlbfs_fs_type);
 
@@ -767,11 +805,16 @@
 	}
 
 	error = PTR_ERR(vfsmount);
+
+ out:
+	if (error)
+		kmem_cache_destroy(hugetlbfs_inode_cachep);		
 	return error;
 }
 
 static void __exit exit_hugetlbfs_fs(void)
 {
+	kmem_cache_destroy(hugetlbfs_inode_cachep);
 	unregister_filesystem(&hugetlbfs_fs_type);
 }
 
diff -u linux-2.6.5-numa/include/linux/mm.h-o linux-2.6.5-numa/include/linux/mm.h
--- linux-2.6.5-numa/include/linux/mm.h-o	2004-04-06 13:12:23.000000000 +0200
+++ linux-2.6.5-numa/include/linux/mm.h	2004-04-06 13:36:12.000000000 +0200
@@ -435,6 +445,8 @@
 
 struct page *shmem_nopage(struct vm_area_struct * vma,
 			unsigned long address, int *type);
+int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new);
+struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, unsigned long addr);
 struct file *shmem_file_setup(char * name, loff_t size, unsigned long flags);
 void shmem_lock(struct file * file, int lock);
 int shmem_zero_setup(struct vm_area_struct *);
diff -u linux-2.6.5-numa/include/linux/shmem_fs.h-o linux-2.6.5-numa/include/linux/shmem_fs.h
--- linux-2.6.5-numa/include/linux/shmem_fs.h-o	2004-03-21 21:11:55.000000000 +0100
+++ linux-2.6.5-numa/include/linux/shmem_fs.h	2004-04-06 13:36:12.000000000 +0200
@@ -2,6 +2,7 @@
 #define __SHMEM_FS_H
 
 #include <linux/swap.h>
+#include <linux/mempolicy.h>
 
 /* inode in-kernel data */
 
@@ -15,6 +16,7 @@
 	unsigned long		alloced;    /* data pages allocated to file */
 	unsigned long		swapped;    /* subtotal assigned to swap */
 	unsigned long		flags;
+	struct shared_policy     policy;
 	struct list_head	list;
 	struct inode		vfs_inode;
 };
diff -u linux-2.6.5-numa/ipc/shm.c-o linux-2.6.5-numa/ipc/shm.c
--- linux-2.6.5-numa/ipc/shm.c-o	2004-04-06 13:12:24.000000000 +0200
+++ linux-2.6.5-numa/ipc/shm.c	2004-04-06 13:36:12.000000000 +0200
@@ -163,6 +163,8 @@
 	.open	= shm_open,	/* callback for a new vm-area open */
 	.close	= shm_close,	/* callback for when the vm-area is released */
 	.nopage	= shmem_nopage,
+	.set_policy = shmem_set_policy,
+	.get_policy = shmem_get_policy,
 };
 
 static int newseg (key_t key, int shmflg, size_t size)
diff -u linux-2.6.5-numa/mm/shmem.c-o linux-2.6.5-numa/mm/shmem.c
--- linux-2.6.5-numa/mm/shmem.c-o	2004-04-06 13:12:24.000000000 +0200
+++ linux-2.6.5-numa/mm/shmem.c	2004-04-06 13:36:12.000000000 +0200
@@ -8,6 +8,7 @@
  *		 2002 Red Hat Inc.
  * Copyright (C) 2002-2003 Hugh Dickins.
  * Copyright (C) 2002-2003 VERITAS Software Corporation.
+ * Copyright (C) 2004 Andi Kleen, SuSE Labs
  *
  * This file is released under the GPL.
  */
@@ -37,8 +38,10 @@
 #include <linux/vfs.h>
 #include <linux/blkdev.h>
 #include <linux/security.h>
+#include <linux/swapops.h>
 #include <asm/uaccess.h>
 #include <asm/div64.h>
+#include <asm/pgtable.h>
 
 /* This magic number is used in glibc for posix shared memory */
 #define TMPFS_MAGIC	0x01021994
@@ -758,6 +761,72 @@
 	return WRITEPAGE_ACTIVATE;	/* Return with the page locked */
 }
 
+#ifdef CONFIG_NUMA
+static struct page *shmem_swapin_async(struct shared_policy *p,
+				       swp_entry_t entry, unsigned long idx)
+{
+	struct page *page;
+	struct vm_area_struct pvma; 
+	/* Create a pseudo vma that just contains the policy */
+	memset(&pvma, 0, sizeof(struct vm_area_struct));
+	pvma.vm_end = PAGE_SIZE;
+	pvma.vm_pgoff = idx;
+	pvma.vm_policy = mpol_shared_policy_lookup(p, idx); 
+	page = read_swap_cache_async(entry, &pvma, 0);
+	mpol_free(pvma.vm_policy);
+	return page; 
+} 
+
+struct page *shmem_swapin(struct shmem_inode_info *info, swp_entry_t entry, 
+			  unsigned long idx)
+{
+	struct shared_policy *p = &info->policy;
+	int i, num;
+	struct page *page;
+	unsigned long offset;
+
+	num = valid_swaphandles(entry, &offset);
+	for (i = 0; i < num; offset++, i++) {
+		page = shmem_swapin_async(p, swp_entry(swp_type(entry), offset), idx);
+		if (!page)
+			break;
+		page_cache_release(page);
+	}
+	lru_add_drain();	/* Push any new pages onto the LRU now */
+	return shmem_swapin_async(p, entry, idx); 
+}
+
+static struct page *
+shmem_alloc_page(unsigned long gfp, struct shmem_inode_info *info, 
+		 unsigned long idx)
+{
+	struct vm_area_struct pvma;
+	struct page *page;
+
+	memset(&pvma, 0, sizeof(struct vm_area_struct)); 
+	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx); 
+	pvma.vm_pgoff = idx;
+	pvma.vm_end = PAGE_SIZE;
+	page = alloc_page_vma(gfp, &pvma, 0); 
+	mpol_free(pvma.vm_policy);
+	return page; 
+} 
+#else
+static inline struct page *
+shmem_swapin(struct shmem_inode_info *info,swp_entry_t entry,unsigned long idx)
+{ 
+	swapin_readahead(entry, 0, NULL);
+	return read_swap_cache_async(entry, NULL, 0);
+} 
+
+static inline struct page *
+shmem_alloc_page(unsigned long gfp,struct shmem_inode_info *info,
+				 unsigned long idx)
+{
+	return alloc_page(gfp);  
+}
+#endif
+
 /*
  * shmem_getpage - either get the page from swap or allocate a new one
  *
@@ -815,8 +884,7 @@
 			if (majmin == VM_FAULT_MINOR && type)
 				inc_page_state(pgmajfault);
 			majmin = VM_FAULT_MAJOR;
-			swapin_readahead(swap);
-			swappage = read_swap_cache_async(swap);
+			swappage = shmem_swapin(info, swap, idx); 
 			if (!swappage) {
 				spin_lock(&info->lock);
 				entry = shmem_swp_alloc(info, idx, sgp);
@@ -921,7 +989,9 @@
 
 		if (!filepage) {
 			spin_unlock(&info->lock);
-			filepage = page_cache_alloc(mapping);
+			filepage = shmem_alloc_page(mapping_gfp_mask(mapping), 
+						    info,
+						    idx); 
 			if (!filepage) {
 				shmem_free_block(inode);
 				error = -ENOMEM;
@@ -1046,6 +1116,19 @@
 	return 0;
 }
 
+int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
+{
+	struct inode *i = vma->vm_file->f_dentry->d_inode;
+	return mpol_set_shared_policy(&SHMEM_I(i)->policy, vma, new);
+}
+
+struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct inode *i = vma->vm_file->f_dentry->d_inode;
+	unsigned long idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	return mpol_shared_policy_lookup(&SHMEM_I(i)->policy, idx);	
+} 
+
 void shmem_lock(struct file *file, int lock)
 {
 	struct inode *inode = file->f_dentry->d_inode;
@@ -1094,6 +1177,7 @@
 		info = SHMEM_I(inode);
 		memset(info, 0, (char *)inode - (char *)info);
 		spin_lock_init(&info->lock);
+		mpol_shared_policy_init(&info->policy);
 		info->flags = VM_ACCOUNT;
 		switch (mode & S_IFMT) {
 		default:
@@ -1789,6 +1873,7 @@
 
 static void shmem_destroy_inode(struct inode *inode)
 {
+	mpol_free_shared_policy(&SHMEM_I(inode)->policy);  
 	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
 }
 
@@ -1873,6 +1958,8 @@
 static struct vm_operations_struct shmem_vm_ops = {
 	.nopage		= shmem_nopage,
 	.populate	= shmem_populate,
+	.set_policy     = shmem_set_policy,
+	.get_policy     = shmem_get_policy,
 };
 
 static struct super_block *shmem_get_sb(struct file_system_type *fs_type,
diff -u linux-2.6.5-numa/include/linux/hugetlb.h-o linux-2.6.5-numa/include/linux/hugetlb.h
--- linux-2.6.5-numa/include/linux/hugetlb.h-o	2004-04-06 13:12:21.000000000 +0200
+++ linux-2.6.5-numa/include/linux/hugetlb.h	2004-04-06 13:36:12.000000000 +0200
@@ -3,6 +3,8 @@
 
 #ifdef CONFIG_HUGETLB_PAGE
 
+#include <linux/mempolicy.h>
+
 struct ctl_table;
 
 static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
@@ -103,6 +105,17 @@
 	spinlock_t	stat_lock;
 };
 
+
+struct hugetlbfs_inode_info { 
+	struct shared_policy policy;
+	struct inode vfs_inode;
+}; 
+
+static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
+{
+	return container_of(inode, struct hugetlbfs_inode_info, vfs_inode);
+}
+
 static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
 {
 	return sb->s_fs_info;
diff -u linux-2.6.5-numa/arch/i386/mm/hugetlbpage.c-o linux-2.6.5-numa/arch/i386/mm/hugetlbpage.c
--- linux-2.6.5-numa/arch/i386/mm/hugetlbpage.c-o	2004-04-06 13:11:59.000000000 +0200
+++ linux-2.6.5-numa/arch/i386/mm/hugetlbpage.c	2004-04-06 13:36:12.000000000 +0200
@@ -547,6 +640,13 @@
 	return NULL;
 }
 
+static int hugetlb_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
+{
+	struct inode *inode = vma->vm_file->f_dentry->d_inode;
+	return mpol_set_shared_policy(&HUGETLBFS_I(inode)->policy, vma, new);
+}
+
 struct vm_operations_struct hugetlb_vm_ops = {
 	.nopage = hugetlb_nopage,
+	.set_policy = hugetlb_set_policy,	
 };

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH] NUMA API for Linux 7/ Add statistics
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
                   ` (5 preceding siblings ...)
  2004-04-06 13:37 ` [PATCH] NUMA API for Linux 6/ Add shared memory support Andi Kleen
@ 2004-04-06 13:38 ` Andi Kleen
  2004-04-06 13:39 ` [PATCH] NUMA API for Linux 8/ Add policy support to anonymous memory Andi Kleen
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 60+ messages in thread
From: Andi Kleen @ 2004-04-06 13:38 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm

Add NUMA hit/miss statistics to page allocation and display them
in sysfs.

This is not 100% required for NUMA API, but without this it is very
difficult to make sure NUMA API works properly.

The overhead is quite low because all counters are per CPU and only
happens when CONFIG_NUMA is defined.

diff -u linux-2.6.5-numa/include/linux/mmzone.h-o linux-2.6.5-numa/include/linux/mmzone.h
--- linux-2.6.5-numa/include/linux/mmzone.h-o	2004-04-06 13:12:23.000000000 +0200
+++ linux-2.6.5-numa/include/linux/mmzone.h	2004-04-06 13:36:12.000000000 +0200
@@ -52,6 +52,14 @@
 
 struct per_cpu_pageset {
 	struct per_cpu_pages pcp[2];	/* 0: hot.  1: cold */
+#ifdef CONFIG_NUMA
+	unsigned long numa_hit;		/* allocated in intended node */
+	unsigned long numa_miss;	/* allocated in non intended node */
+	unsigned long numa_foreign;	/* was intended here, hit elsewhere */
+	unsigned long interleave_hit; 	/* interleaver prefered this zone */
+	unsigned long local_node;	/* allocation from local node */
+	unsigned long other_node;	/* allocation from other node */
+#endif
 } ____cacheline_aligned_in_smp;
 
 /*
diff -u linux-2.6.5-numa/mm/page_alloc.c-o linux-2.6.5-numa/mm/page_alloc.c
--- linux-2.6.5-numa/mm/page_alloc.c-o	2004-04-06 13:12:24.000000000 +0200
+++ linux-2.6.5-numa/mm/page_alloc.c	2004-04-06 13:49:54.000000000 +0200
@@ -447,6 +447,31 @@
 }
 #endif /* CONFIG_PM */
 
+static void zone_statistics(struct zonelist *zonelist, struct zone *z) 
+{ 
+#ifdef CONFIG_NUMA
+	unsigned long flags;
+	int cpu; 
+	pg_data_t *pg = z->zone_pgdat,
+		*orig = zonelist->zones[0]->zone_pgdat;
+	struct per_cpu_pageset *p;
+	local_irq_save(flags); 
+	cpu = smp_processor_id();
+	p = &z->pageset[cpu];
+	if (pg == orig) {
+		z->pageset[cpu].numa_hit++;
+	} else { 
+		p->numa_miss++;
+		zonelist->zones[0]->pageset[cpu].numa_foreign++;
+	}
+	if (pg == NODE_DATA(numa_node_id()))
+		p->local_node++;
+	else
+		p->other_node++;	
+	local_irq_restore(flags);
+#endif
+} 
+
 /*
  * Free a 0-order page
  */
@@ -582,8 +607,10 @@
 		if (z->free_pages >= min ||
 				(!wait && z->free_pages >= z->pages_high)) {
 			page = buffered_rmqueue(z, order, cold);
-			if (page)
+			if (page) { 
+					zone_statistics(zonelist, z); 
 		       		goto got_pg;
+			}
 		}
 		min += z->pages_low * sysctl_lower_zone_protection;
 	}
@@ -607,8 +634,10 @@
 		if (z->free_pages >= min ||
 				(!wait && z->free_pages >= z->pages_high)) {
 			page = buffered_rmqueue(z, order, cold);
-			if (page)
+			if (page) {
+				zone_statistics(zonelist, z); 
 				goto got_pg;
+			}
 		}
 		min += local_min * sysctl_lower_zone_protection;
 	}
@@ -622,8 +651,10 @@
 			struct zone *z = zones[i];
 
 			page = buffered_rmqueue(z, order, cold);
-			if (page)
+			if (page) {
+				zone_statistics(zonelist, z); 
 				goto got_pg;
+			}
 		}
 		goto nopage;
 	}
@@ -650,8 +681,10 @@
 		if (z->free_pages >= min ||
 				(!wait && z->free_pages >= z->pages_high)) {
 			page = buffered_rmqueue(z, order, cold);
-			if (page)
+			if (page) {
+				zone_statistics(zonelist, z); 
 				goto got_pg;
+			}
 		}
 		min += z->pages_low * sysctl_lower_zone_protection;
 	}
diff -u linux-2.6.5-numa/drivers/base/node.c-o linux-2.6.5-numa/drivers/base/node.c
--- linux-2.6.5-numa/drivers/base/node.c-o	2004-03-17 12:17:46.000000000 +0100
+++ linux-2.6.5-numa/drivers/base/node.c	2004-04-06 13:36:12.000000000 +0200
@@ -30,13 +30,20 @@
 
 static SYSDEV_ATTR(cpumap,S_IRUGO,node_read_cpumap,NULL);
 
+/* Can be overwritten by architecture specific code. */
+int __attribute__((weak)) hugetlb_report_node_meminfo(int node, char *buf)
+{
+	return 0;
+}
+
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 static ssize_t node_read_meminfo(struct sys_device * dev, char * buf)
 {
+	int n;
 	int nid = dev->id;
 	struct sysinfo i;
 	si_meminfo_node(&i, nid);
-	return sprintf(buf, "\n"
+	n = sprintf(buf, "\n"
 		       "Node %d MemTotal:     %8lu kB\n"
 		       "Node %d MemFree:      %8lu kB\n"
 		       "Node %d MemUsed:      %8lu kB\n"
@@ -51,10 +58,52 @@
 		       nid, K(i.freehigh),
 		       nid, K(i.totalram-i.totalhigh),
 		       nid, K(i.freeram-i.freehigh));
+	n += hugetlb_report_node_meminfo(nid, buf + n);
+	return n;
 }
+
 #undef K 
 static SYSDEV_ATTR(meminfo,S_IRUGO,node_read_meminfo,NULL);
 
+static ssize_t node_read_numastat(struct sys_device * dev, char * buf)
+{ 
+	unsigned long numa_hit, numa_miss, interleave_hit, numa_foreign;
+	unsigned long local_node, other_node;
+	int i, cpu;
+	pg_data_t *pg = NODE_DATA(dev->id);
+	numa_hit = 0; 
+	numa_miss = 0; 
+	interleave_hit = 0; 
+	numa_foreign = 0; 
+	local_node = 0;
+	other_node = 0;
+	for (i = 0; i < MAX_NR_ZONES; i++) { 
+		struct zone *z = &pg->node_zones[i]; 
+		for (cpu = 0; cpu < NR_CPUS; cpu++) { 
+			struct per_cpu_pageset *ps = &z->pageset[cpu]; 
+			numa_hit += ps->numa_hit; 
+			numa_miss += ps->numa_miss;
+			numa_foreign += ps->numa_foreign;
+			interleave_hit += ps->interleave_hit;
+			local_node += ps->local_node;
+			other_node += ps->other_node;
+		} 
+	} 
+	return sprintf(buf, 
+		       "numa_hit %lu\n"
+		       "numa_miss %lu\n"
+		       "numa_foreign %lu\n"
+		       "interleave_hit %lu\n"
+		       "local_node %lu\n"
+		       "other_node %lu\n", 
+		       numa_hit,
+		       numa_miss,
+		       numa_foreign,
+		       interleave_hit,
+		       local_node, 
+		       other_node); 
+} 
+static SYSDEV_ATTR(numastat,S_IRUGO,node_read_numastat,NULL);
 
 /*
  * register_node - Setup a driverfs device for a node.
@@ -74,6 +123,7 @@
 	if (!error){
 		sysdev_create_file(&node->sysdev, &attr_cpumap);
 		sysdev_create_file(&node->sysdev, &attr_meminfo);
+		sysdev_create_file(&node->sysdev, &attr_numastat); 
 	}
 	return error;
 }

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH] NUMA API for Linux 8/ Add policy support to anonymous memory
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
                   ` (6 preceding siblings ...)
  2004-04-06 13:38 ` [PATCH] NUMA API for Linux 7/ Add statistics Andi Kleen
@ 2004-04-06 13:39 ` Andi Kleen
  2004-04-06 13:40 ` [PATCH] NUMA API for Linux 9/ Add simple lazy i386/x86-64 hugetlbfs policy support Andi Kleen
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 60+ messages in thread
From: Andi Kleen @ 2004-04-06 13:39 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm


Change to core VM to use alloc_page_vma() instead of alloc_page().

Change the swap readahead to follow the policy of the VMA.


diff -u linux-2.6.5-numa/include/linux/swap.h-o linux-2.6.5-numa/include/linux/swap.h
--- linux-2.6.5-numa/include/linux/swap.h-o	2004-03-21 21:11:54.000000000 +0100
+++ linux-2.6.5-numa/include/linux/swap.h	2004-04-06 13:36:12.000000000 +0200
@@ -152,7 +152,7 @@
 extern void out_of_memory(void);
 
 /* linux/mm/memory.c */
-extern void swapin_readahead(swp_entry_t);
+extern void swapin_readahead(swp_entry_t, unsigned long, struct vm_area_struct *);
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
@@ -216,7 +216,8 @@
 extern void free_page_and_swap_cache(struct page *);
 extern void free_pages_and_swap_cache(struct page **, int);
 extern struct page * lookup_swap_cache(swp_entry_t);
-extern struct page * read_swap_cache_async(swp_entry_t);
+extern struct page * read_swap_cache_async(swp_entry_t, struct vm_area_struct *vma, 
+					   unsigned long addr);
 
 /* linux/mm/swapfile.c */
 extern int total_swap_pages;
@@ -257,7 +258,7 @@
 #define free_swap_and_cache(swp)		/*NOTHING*/
 #define swap_duplicate(swp)			/*NOTHING*/
 #define swap_free(swp)				/*NOTHING*/
-#define read_swap_cache_async(swp)		NULL
+#define read_swap_cache_async(swp,vma,addr)	NULL
 #define lookup_swap_cache(swp)			NULL
 #define valid_swaphandles(swp, off)		0
 #define can_share_swap_page(p)			0
diff -u linux-2.6.5-numa/mm/memory.c-o linux-2.6.5-numa/mm/memory.c
--- linux-2.6.5-numa/mm/memory.c-o	2004-04-06 13:12:24.000000000 +0200
+++ linux-2.6.5-numa/mm/memory.c	2004-04-06 13:36:12.000000000 +0200
@@ -1056,7 +1056,7 @@
 	pte_chain = pte_chain_alloc(GFP_KERNEL);
 	if (!pte_chain)
 		goto no_pte_chain;
-	new_page = alloc_page(GFP_HIGHUSER);
+	new_page = alloc_page_vma(GFP_HIGHUSER,vma,address);
 	if (!new_page)
 		goto no_new_page;
 	copy_cow_page(old_page,new_page,address);
@@ -1210,9 +1210,17 @@
  * (1 << page_cluster) entries in the swap area. This method is chosen
  * because it doesn't cost us any seek time.  We also make sure to queue
  * the 'original' request together with the readahead ones...  
+ * 
+ * This has been extended to use the NUMA policies from the mm triggering
+ * the readahead.
+ * 
+ * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
  */
-void swapin_readahead(swp_entry_t entry)
+void swapin_readahead(swp_entry_t entry, unsigned long addr,struct vm_area_struct *vma) 
 {
+#ifdef CONFIG_NUMA
+	struct vm_area_struct *next_vma = vma ? vma->vm_next : NULL;
+#endif
 	int i, num;
 	struct page *new_page;
 	unsigned long offset;
@@ -1224,10 +1232,31 @@
 	for (i = 0; i < num; offset++, i++) {
 		/* Ok, do the async read-ahead now */
 		new_page = read_swap_cache_async(swp_entry(swp_type(entry),
-						offset));
+							   offset), vma, addr); 
 		if (!new_page)
 			break;
 		page_cache_release(new_page);
+#ifdef CONFIG_NUMA
+		/* 
+		 * Find the next applicable VMA for the NUMA policy.
+		 */
+		addr += PAGE_SIZE;
+		if (addr == 0) 
+			vma = NULL;
+		if (vma) { 
+			if (addr >= vma->vm_end) { 
+				vma = next_vma;
+				next_vma = vma ? vma->vm_next : NULL;
+			}
+			if (vma && addr < vma->vm_start) 
+				vma = NULL; 
+		} else { 
+			if (next_vma && addr >= next_vma->vm_start) { 
+				vma = next_vma;
+				next_vma = vma->vm_next;
+			}
+		} 
+#endif
 	}
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 }
@@ -1250,8 +1279,8 @@
 	spin_unlock(&mm->page_table_lock);
 	page = lookup_swap_cache(entry);
 	if (!page) {
-		swapin_readahead(entry);
-		page = read_swap_cache_async(entry);
+ 		swapin_readahead(entry, address, vma);
+ 		page = read_swap_cache_async(entry, vma, address);
 		if (!page) {
 			/*
 			 * Back out if somebody else faulted in this pte while
@@ -1356,7 +1385,7 @@
 		pte_unmap(page_table);
 		spin_unlock(&mm->page_table_lock);
 
-		page = alloc_page(GFP_HIGHUSER);
+		page = alloc_page_vma(GFP_HIGHUSER,vma,addr);
 		if (!page)
 			goto no_mem;
 		clear_user_highpage(page, addr);
@@ -1448,7 +1477,7 @@
 	 * Should we do an early C-O-W break?
 	 */
 	if (write_access && !(vma->vm_flags & VM_SHARED)) {
-		struct page * page = alloc_page(GFP_HIGHUSER);
+		struct page * page = alloc_page_vma(GFP_HIGHUSER,vma,address);
 		if (!page)
 			goto oom;
 		copy_user_highpage(page, new_page, address);
diff -u linux-2.6.5-numa/mm/swap_state.c-o linux-2.6.5-numa/mm/swap_state.c
--- linux-2.6.5-numa/mm/swap_state.c-o	2004-03-21 21:12:13.000000000 +0100
+++ linux-2.6.5-numa/mm/swap_state.c	2004-04-06 13:36:13.000000000 +0200
@@ -331,7 +331,8 @@
  * A failure return means that either the page allocation failed or that
  * the swap entry is no longer in use.
  */
-struct page * read_swap_cache_async(swp_entry_t entry)
+struct page * 
+read_swap_cache_async(swp_entry_t entry, struct vm_area_struct *vma, unsigned long addr)
 {
 	struct page *found_page, *new_page = NULL;
 	int err;
@@ -351,7 +352,7 @@
 		 * Get a new page to read into from swap.
 		 */
 		if (!new_page) {
-			new_page = alloc_page(GFP_HIGHUSER);
+			new_page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
 			if (!new_page)
 				break;		/* Out of memory */
 		}
diff -u linux-2.6.5-numa/mm/swapfile.c-o linux-2.6.5-numa/mm/swapfile.c
--- linux-2.6.5-numa/mm/swapfile.c-o	2004-04-06 13:12:24.000000000 +0200
+++ linux-2.6.5-numa/mm/swapfile.c	2004-04-06 13:36:13.000000000 +0200
@@ -607,7 +607,7 @@
 		 */
 		swap_map = &si->swap_map[i];
 		entry = swp_entry(type, i);
-		page = read_swap_cache_async(entry);
+		page = read_swap_cache_async(entry, NULL, 0);
 		if (!page) {
 			/*
 			 * Either swap_duplicate() failed because entry

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH] NUMA API for Linux 9/ Add simple lazy i386/x86-64 hugetlbfs policy support
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
                   ` (7 preceding siblings ...)
  2004-04-06 13:39 ` [PATCH] NUMA API for Linux 8/ Add policy support to anonymous memory Andi Kleen
@ 2004-04-06 13:40 ` Andi Kleen
  2004-04-06 13:40 ` [PATCH] NUMA API for Linux 10/ Bitmap bugfix Andi Kleen
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 60+ messages in thread
From: Andi Kleen @ 2004-04-06 13:40 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm

Add NUMA policy support to i386/x86-64 hugetlbfs and switch it 
over to lazy allocation instead of prefaulting.

The NUMA policy support policies the huge page allocation based on the
current policy.

It also switch the hugetlbfs to lazy allocation, because otherwise
mbind() cannot work after mmap, because the memory was already allocated.
This doesn't do any prereservation; when a process runs out of 
huge pages it will get a SIGBUS.

There are currently various proposals on linux-kernel to add preallocation
for this; once one of these patches turns out to be good it would be 
best to replace this patch with it (and port the mpol_* changes over)

diff -u linux-2.6.5-numa/arch/i386/mm/hugetlbpage.c-o linux-2.6.5-numa/arch/i386/mm/hugetlbpage.c
--- linux-2.6.5-numa/arch/i386/mm/hugetlbpage.c-o	2004-04-06 13:11:59.000000000 +0200
+++ linux-2.6.5-numa/arch/i386/mm/hugetlbpage.c	2004-04-06 13:36:12.000000000 +0200
@@ -15,14 +15,17 @@
 #include <linux/module.h>
 #include <linux/err.h>
 #include <linux/sysctl.h>
+#include <linux/mempolicy.h>
 #include <asm/mman.h>
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 
-static long    htlbpagemem;
+/* AK: this should be all moved into the pgdat */
+
+static long    htlbpagemem[MAX_NUMNODES];
 int     htlbpage_max;
-static long    htlbzone_pages;
+static long    htlbzone_pages[MAX_NUMNODES];
 
 static struct list_head hugepage_freelists[MAX_NUMNODES];
 static spinlock_t htlbpage_lock = SPIN_LOCK_UNLOCKED;
@@ -33,14 +36,15 @@
 		&hugepage_freelists[page_zone(page)->zone_pgdat->node_id]);
 }
 
-static struct page *dequeue_huge_page(void)
+static struct page *dequeue_huge_page(struct vm_area_struct *vma, unsigned long addr)
 {
-	int nid = numa_node_id();
+	int nid = mpol_first_node(vma, addr); 
 	struct page *page = NULL;
 
 	if (list_empty(&hugepage_freelists[nid])) {
 		for (nid = 0; nid < MAX_NUMNODES; ++nid)
-			if (!list_empty(&hugepage_freelists[nid]))
+			if (mpol_node_valid(nid, vma, addr) && 
+			    !list_empty(&hugepage_freelists[nid]))
 				break;
 	}
 	if (nid >= 0 && nid < MAX_NUMNODES && !list_empty(&hugepage_freelists[nid])) {
@@ -61,18 +65,18 @@
 
 static void free_huge_page(struct page *page);
 
-static struct page *alloc_hugetlb_page(void)
+static struct page *alloc_hugetlb_page(struct vm_area_struct *vma, unsigned long addr)
 {
 	int i;
 	struct page *page;
 
 	spin_lock(&htlbpage_lock);
-	page = dequeue_huge_page();
+	page = dequeue_huge_page(vma, addr);
 	if (!page) {
 		spin_unlock(&htlbpage_lock);
 		return NULL;
 	}
-	htlbpagemem--;
+	htlbpagemem[page_zone(page)->zone_pgdat->node_id]--;
 	spin_unlock(&htlbpage_lock);
 	set_page_count(page, 1);
 	page->lru.prev = (void *)free_huge_page;
@@ -284,7 +288,7 @@
 
 	spin_lock(&htlbpage_lock);
 	enqueue_huge_page(page);
-	htlbpagemem++;
+	htlbpagemem[page_zone(page)->zone_pgdat->node_id]++;
 	spin_unlock(&htlbpage_lock);
 }
 
@@ -329,41 +333,49 @@
 	spin_unlock(&mm->page_table_lock);
 }
 
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
+/* page_table_lock hold on entry. */
+static int 
+hugetlb_alloc_fault(struct mm_struct *mm, struct vm_area_struct *vma, 
+			       unsigned long addr, int write_access)
 {
-	struct mm_struct *mm = current->mm;
-	unsigned long addr;
-	int ret = 0;
-
-	BUG_ON(vma->vm_start & ~HPAGE_MASK);
-	BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
-	spin_lock(&mm->page_table_lock);
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
 		unsigned long idx;
-		pte_t *pte = huge_pte_alloc(mm, addr);
-		struct page *page;
+	int ret;
+	pte_t *pte;
+	struct page *page = NULL;
+	struct address_space *mapping = vma->vm_file->f_mapping;
 
+	pte = huge_pte_alloc(mm, addr); 
 		if (!pte) {
-			ret = -ENOMEM;
+		ret = VM_FAULT_OOM;
 			goto out;
 		}
-		if (!pte_none(*pte))
-			continue;
+
+		/* Handle race */
+		if (!pte_none(*pte)) { 
+			ret = VM_FAULT_MINOR;
+			goto flush; 
+		}
 
 		idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
 			+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
 		page = find_get_page(mapping, idx);
 		if (!page) {
-			/* charge the fs quota first */
-			if (hugetlb_get_quota(mapping)) {
-				ret = -ENOMEM;
+		/* Should do this at prefault time, but that gets us into
+		   trouble with freeing right now. */
+		ret = hugetlb_get_quota(mapping);
+		if (ret) {
+			ret = VM_FAULT_OOM;
 				goto out;
 			}
-			page = alloc_hugetlb_page();
+		
+			page = alloc_hugetlb_page(vma, addr);
 			if (!page) {
 				hugetlb_put_quota(mapping);
-				ret = -ENOMEM;
+			
+			/* Instead of OOMing here could just transparently use
+			   small pages. */
+			
+				ret = VM_FAULT_OOM;
 				goto out;
 			}
 			ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
@@ -371,23 +383,64 @@
 			if (ret) {
 				hugetlb_put_quota(mapping);
 				free_huge_page(page);
+				ret = VM_FAULT_SIGBUS;
 				goto out;
 			}
-		}
+		ret = VM_FAULT_MAJOR; 
+	} else
+		ret = VM_FAULT_MINOR;
+		
 		set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
-	}
-out:
+
+ flush:
+	/* Don't need to flush other CPUs. They will just do a page
+	   fault and flush it lazily. */
+	__flush_tlb_one(addr);
+	
+ out:
 	spin_unlock(&mm->page_table_lock);
 	return ret;
 }
 
+int arch_hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, 
+		       unsigned long address, int write_access)
+{ 
+	pmd_t *pmd;
+	pgd_t *pgd;
+
+	if (write_access && !(vma->vm_flags & VM_WRITE))
+		return VM_FAULT_SIGBUS;
+
+	spin_lock(&mm->page_table_lock);	
+	pgd = pgd_offset(mm, address); 
+	if (pgd_none(*pgd)) 
+		return hugetlb_alloc_fault(mm, vma, address, write_access); 
+
+	pmd = pmd_offset(pgd, address);
+	if (pmd_none(*pmd))
+		return hugetlb_alloc_fault(mm, vma, address, write_access); 
+
+	BUG_ON(!pmd_large(*pmd)); 
+
+	/* must have been a race. Flush the TLB. NX not supported yet. */ 
+
+	__flush_tlb_one(address); 
+	spin_lock(&mm->page_table_lock);	
+	return VM_FAULT_MINOR;
+} 
+
+int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
+{
+	return 0;
+}
+
 static void update_and_free_page(struct page *page)
 {
 	int j;
 	struct page *map;
 
 	map = page;
-	htlbzone_pages--;
+	htlbzone_pages[page_zone(page)->zone_pgdat->node_id]--;
 	for (j = 0; j < (HPAGE_SIZE / PAGE_SIZE); j++) {
 		map->flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
 				1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
@@ -404,6 +457,7 @@
 	struct list_head *p;
 	struct page *page, *map;
 
+   page = NULL;
 	map = NULL;
 	spin_lock(&htlbpage_lock);
 	/* all lowmem is on node 0 */
@@ -411,7 +465,7 @@
 		if (map) {
 			list_del(&map->list);
 			update_and_free_page(map);
-			htlbpagemem--;
+ 			htlbpagemem[page_zone(map)->zone_pgdat->node_id]--;
 			map = NULL;
 			if (++count == 0)
 				break;
@@ -423,49 +477,61 @@
 	if (map) {
 		list_del(&map->list);
 		update_and_free_page(map);
-		htlbpagemem--;
+		htlbpagemem[page_zone(page)->zone_pgdat->node_id]--;
 		count++;
 	}
 	spin_unlock(&htlbpage_lock);
 	return count;
 }
 
+static long all_huge_pages(void)
+{ 
+	long pages = 0;
+	int i;
+	for (i = 0; i < numnodes; i++) 
+		pages += htlbzone_pages[i];
+	return pages;
+} 
+
 static int set_hugetlb_mem_size(int count)
 {
 	int lcount;
 	struct page *page;
-
 	if (count < 0)
 		lcount = count;
-	else
-		lcount = count - htlbzone_pages;
+	else { 
+		lcount = count - all_huge_pages();
+	}
 
 	if (lcount == 0)
-		return (int)htlbzone_pages;
+		return (int)all_huge_pages();
 	if (lcount > 0) {	/* Increase the mem size. */
 		while (lcount--) {
+			int node;
 			page = alloc_fresh_huge_page();
 			if (page == NULL)
 				break;
 			spin_lock(&htlbpage_lock);
 			enqueue_huge_page(page);
-			htlbpagemem++;
-			htlbzone_pages++;
+			node = page_zone(page)->zone_pgdat->node_id;
+			htlbpagemem[node]++;
+			htlbzone_pages[node]++;
 			spin_unlock(&htlbpage_lock);
 		}
-		return (int) htlbzone_pages;
+		goto out;
 	}
 	/* Shrink the memory size. */
 	lcount = try_to_free_low(lcount);
 	while (lcount++) {
-		page = alloc_hugetlb_page();
+		page = alloc_hugetlb_page(NULL, 0);
 		if (page == NULL)
 			break;
 		spin_lock(&htlbpage_lock);
 		update_and_free_page(page);
 		spin_unlock(&htlbpage_lock);
 	}
-	return (int) htlbzone_pages;
+ out:
+	return (int)all_huge_pages();
 }
 
 int hugetlb_sysctl_handler(ctl_table *table, int write,
@@ -498,33 +564,60 @@
 		INIT_LIST_HEAD(&hugepage_freelists[i]);
 
 	for (i = 0; i < htlbpage_max; ++i) {
+		int nid; 
 		page = alloc_fresh_huge_page();
 		if (!page)
 			break;
 		spin_lock(&htlbpage_lock);
 		enqueue_huge_page(page);
+		nid = page_zone(page)->zone_pgdat->node_id;
+		htlbpagemem[nid]++;
+		htlbzone_pages[nid]++;
 		spin_unlock(&htlbpage_lock);
 	}
-	htlbpage_max = htlbpagemem = htlbzone_pages = i;
-	printk("Total HugeTLB memory allocated, %ld\n", htlbpagemem);
+	htlbpage_max = i;
+	printk("Initial HugeTLB pages allocated: %d\n", i);
 	return 0;
 }
 module_init(hugetlb_init);
 
 int hugetlb_report_meminfo(char *buf)
 {
+	int i;
+	long pages = 0, mem = 0;
+	for (i = 0; i < numnodes; i++) {
+		pages += htlbzone_pages[i];
+		mem += htlbpagemem[i];
+	}
+
 	return sprintf(buf,
 			"HugePages_Total: %5lu\n"
 			"HugePages_Free:  %5lu\n"
 			"Hugepagesize:    %5lu kB\n",
-			htlbzone_pages,
-			htlbpagemem,
+			pages,
+			mem,
 			HPAGE_SIZE/1024);
 }
 
+int hugetlb_report_node_meminfo(int node, char *buf)
+{
+	return sprintf(buf,
+			"HugePages_Total: %5lu\n"
+			"HugePages_Free:  %5lu\n"
+			"Hugepagesize:    %5lu kB\n",
+			htlbzone_pages[node],
+			htlbpagemem[node],
+			HPAGE_SIZE/1024);
+}
+
+/* Not accurate with policy */
 int is_hugepage_mem_enough(size_t size)
 {
-	return (size + ~HPAGE_MASK)/HPAGE_SIZE <= htlbpagemem;
+	long pm = 0;
+	int i;
+	for (i = 0; i < numnodes; i++)
+		pm += htlbpagemem[i];
+	return (size + ~HPAGE_MASK)/HPAGE_SIZE <= pm;
 }
 
 /* Return the number pages of memory we physically have, in PAGE_SIZE units. */
diff -u linux-2.6.5-numa/include/linux/mm.h-o linux-2.6.5-numa/include/linux/mm.h
--- linux-2.6.5-numa/include/linux/mm.h-o	2004-04-06 13:12:23.000000000 +0200
+++ linux-2.6.5-numa/include/linux/mm.h	2004-04-06 13:36:12.000000000 +0200
@@ -643,6 +660,9 @@
 extern int remap_page_range(struct vm_area_struct *vma, unsigned long from,
 		unsigned long to, unsigned long size, pgprot_t prot);
 
+extern int arch_hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, 
+			      unsigned long address, int write_access);
+
 #ifndef CONFIG_DEBUG_PAGEALLOC
 static inline void
 kernel_map_pages(struct page *page, int numpages, int enable)
diff -u linux-2.6.5-numa/mm/memory.c-o linux-2.6.5-numa/mm/memory.c
--- linux-2.6.5-numa/mm/memory.c-o	2004-04-06 13:12:24.000000000 +0200
+++ linux-2.6.5-numa/mm/memory.c	2004-04-06 13:36:12.000000000 +0200
@@ -1604,6 +1633,15 @@
 	return VM_FAULT_MINOR;
 }
 
+
+/* Can be overwritten by the architecture */
+int __attribute__((weak)) arch_hugetlb_fault(struct mm_struct *mm, 
+					     struct vm_area_struct *vma, 
+					     unsigned long address, int write_access)
+{
+	return VM_FAULT_SIGBUS;
+}
+
 /*
  * By the time we get here, we already hold the mm semaphore
  */
@@ -1619,7 +1657,7 @@
 	inc_page_state(pgfault);
 
 	if (is_vm_hugetlb_page(vma))
-		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
+		return arch_hugetlb_fault(mm, vma, address, write_access);
 
 	/*
 	 * We need the page table lock to synchronize with kswapd

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [PATCH] NUMA API for Linux 10/ Bitmap bugfix
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
                   ` (8 preceding siblings ...)
  2004-04-06 13:40 ` [PATCH] NUMA API for Linux 9/ Add simple lazy i386/x86-64 hugetlbfs policy support Andi Kleen
@ 2004-04-06 13:40 ` Andi Kleen
  2004-04-06 23:35 ` NUMA API for Linux Paul Jackson
  2004-04-08 20:12 ` Pavel Machek
  11 siblings, 0 replies; 60+ messages in thread
From: Andi Kleen @ 2004-04-06 13:40 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm


Bugfix to prevent miscompilation on gcc 3.2 of bitmap.h

diff -u linux-2.6.5-numa/include/linux/bitmap.h-o linux-2.6.5-numa/include/linux/bitmap.h
--- linux-2.6.5-numa/include/linux/bitmap.h-o	2004-03-17 12:17:59.000000000 +0100
+++ linux-2.6.5-numa/include/linux/bitmap.h	2004-04-06 13:36:12.000000000 +0200
@@ -29,7 +29,8 @@
 static inline void bitmap_copy(unsigned long *dst,
 			const unsigned long *src, int bits)
 {
-	memcpy(dst, src, BITS_TO_LONGS(bits)*sizeof(unsigned long));
+	int len = BITS_TO_LONGS(bits)*sizeof(unsigned long);
+	memcpy(dst, src, len);
 }
 
 void bitmap_shift_right(unsigned long *dst,

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] NUMA API for Linux 3/ Add i386 support
  2004-04-06 13:35 ` [PATCH] NUMA API for Linux 3/ Add i386 support Andi Kleen
@ 2004-04-06 23:23   ` Andrew Morton
  0 siblings, 0 replies; 60+ messages in thread
From: Andrew Morton @ 2004-04-06 23:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, Manfred Spraul

Andi Kleen <ak@suse.de> wrote:
>
> @@ -279,8 +279,11 @@
>  #define __NR_utimes		271
>  #define __NR_fadvise64_64	272
>  #define __NR_vserver		273
> +#define __NR_mbind		273
> +#define __NR_get_mempolicy	274
> +#define __NR_set_mempolicy	275

hm, those are all wrong.  I fixed it up.

Manfred, I'm going to bump the mq syscall numbers.  numa API has been
around a bit longer and I suspect more people are relying on the syscall
numbers not changing.   Whatever they were ;)


 
diff -puN arch/i386/kernel/entry.S~numa-api-i386 arch/i386/kernel/entry.S
--- 25/arch/i386/kernel/entry.S~numa-api-i386	Tue Apr  6 16:19:40 2004
+++ 25-akpm/arch/i386/kernel/entry.S	Tue Apr  6 16:20:40 2004
@@ -908,9 +908,9 @@ ENTRY(sys_call_table)
 	.long sys_utimes
  	.long sys_fadvise64_64
 	.long sys_ni_syscall	/* sys_vserver */
-	.long sys_ni_syscall	/* sys_mbind */
-	.long sys_ni_syscall	/* 275 sys_get_mempolicy */
-	.long sys_ni_syscall	/* sys_set_mempolicy */
+	.long sys_mbind
+	.long sys_get_mempolicy	/* 275 */
+	.long sys_set_mempolicy
 	.long sys_mq_open
 	.long sys_mq_unlink
 	.long sys_mq_timedsend
diff -puN include/asm-i386/unistd.h~numa-api-i386 include/asm-i386/unistd.h

_


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
                   ` (9 preceding siblings ...)
  2004-04-06 13:40 ` [PATCH] NUMA API for Linux 10/ Bitmap bugfix Andi Kleen
@ 2004-04-06 23:35 ` Paul Jackson
  2004-04-08 20:12 ` Pavel Machek
  11 siblings, 0 replies; 60+ messages in thread
From: Paul Jackson @ 2004-04-06 23:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm

Andi,

What kernel version (-mm? patches?) is this NUMA patchset
based on (what was used for the diff)?

Thanks.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
                   ` (10 preceding siblings ...)
  2004-04-06 23:35 ` NUMA API for Linux Paul Jackson
@ 2004-04-08 20:12 ` Pavel Machek
  11 siblings, 0 replies; 60+ messages in thread
From: Pavel Machek @ 2004-04-08 20:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm

Hi!

> This NUMA API doesn't not attempt to implement page migration or anything
> else complicated: all it does is to police the allocation when a page 
> is first allocation or when a page is reallocated after swapping. Currently
> only support for shared memory and anonymous memory is there; policy for 
> file based mappings is not implemented yet (although they get implicitely
> policied by the default process policy)
> 
> It adds three new system calls: mbind to change the policy of a VMA,
> set_mempolicy to change the policy of a process, get_mempolicy to retrieve
> memory policy. User tools (numactl, libnuma, test programs, manpages) can be 

set_mempolicy is pretty ugly name. Why is prctl inadequate?
-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] NUMA API for Linux 5/ Add VMA hooks for policy
  2004-04-06 13:37 ` [PATCH] NUMA API for Linux 5/ Add VMA hooks for policy Andi Kleen
@ 2004-05-05 16:05   ` Paul Jackson
  2004-05-05 16:39     ` Andi Kleen
  0 siblings, 1 reply; 60+ messages in thread
From: Paul Jackson @ 2004-05-05 16:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ak, linux-kernel, akpm

This patch doesn't build for ia64 sn2_defconfig.

The build fails with link complaints of:

arch/ia64/kernel/built-in.o(.text+0x10862): In function `pfm_smpl_buffer_alloc':
: undefined reference to `mpol_set_vma_default'
arch/ia64/mm/built-in.o(.text+0x412): In function `ia64_init_addr_space':
: undefined reference to `mpol_set_vma_default'
arch/ia64/mm/built-in.o(.text+0x522): In function `ia64_init_addr_space':
: undefined reference to `mpol_set_vma_default'
arch/ia64/ia32/built-in.o(.text+0x1f432): In function `ia64_elf32_init':
: undefined reference to `mpol_set_vma_default'
arch/ia64/ia32/built-in.o(.text+0x1f982): In function `ia32_setup_arg_pages':
: undefined reference to `mpol_set_vma_default'
kernel/built-in.o(.text+0x162b2): In function `do_exit':
: undefined reference to `mpol_free'
make: *** [.tmp_vmlinux1] Error 1

Presumably this is because the following mpol_set_vma_default() and
mpol_free() macro calls are added, but the mempolicy.h header providing
the defines for these macros is not included:

arch/ia64/ia32/binfmt_elf32.c:          mpol_set_vma_default(vma);
arch/ia64/ia32/binfmt_elf32.c:          mpol_set_vma_default(mpnt);
arch/ia64/kernel/perfmon.c:     mpol_set_vma_default(vma);
arch/ia64/mm/init.c:            mpol_set_vma_default(vma);
arch/ia64/mm/init.c:                    mpol_set_vma_default(vma);
kernel/exit.c:  mpol_free(tsk->mempolicy);

Looks like you should do something equivalent to adding:

  #include <linux/mempolicy.h>

to the files:

  arch/ia64/ia32/binfmt_elf32.c
  arch/ia64/kernel/perfmon.c
  arch/ia64/mm/init.c
  kernel/exit.c

The following, based off the numa-api-vma-policy-hooks patch in Andrew's
latest 2.6.6-rc3-mm2, includes these additional includes, and builds
successfully:

================================ snip ================================

From: Andi Kleen <ak@suse.de>

NUMA API adds a policy to each VMA.  During VMA creattion, merging and
splitting these policies must be handled properly.  This patch adds the calls
to this.  

It is a nop when CONFIG_NUMA is not defined.
DESC
numa-api-vma-policy-hooks fix
EDESC

mm/mmap.c: In function `copy_vma':
mm/mmap.c:1531: structure has no member named `vm_policy'


Index: 2.6.6-rc3-mm2-bitmapv5/arch/ia64/ia32/binfmt_elf32.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/arch/ia64/ia32/binfmt_elf32.c	2004-05-05 07:38:11.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/arch/ia64/ia32/binfmt_elf32.c	2004-05-05 08:48:27.000000000 -0700
@@ -13,6 +13,7 @@
 
 #include <linux/types.h>
 #include <linux/mm.h>
+#include <linux/mempolicy.h>
 #include <linux/security.h>
 
 #include <asm/param.h>
@@ -104,6 +105,7 @@
 		vma->vm_pgoff = 0;
 		vma->vm_file = NULL;
 		vma->vm_private_data = NULL;
+		mpol_set_vma_default(vma);
 		down_write(&current->mm->mmap_sem);
 		{
 			insert_vm_struct(current->mm, vma);
@@ -190,6 +192,7 @@
 		mpnt->vm_pgoff = 0;
 		mpnt->vm_file = NULL;
 		mpnt->vm_private_data = 0;
+		mpol_set_vma_default(mpnt);
 		insert_vm_struct(current->mm, mpnt);
 		current->mm->total_vm = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
 	}
Index: 2.6.6-rc3-mm2-bitmapv5/arch/ia64/kernel/perfmon.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/arch/ia64/kernel/perfmon.c	2004-05-05 07:38:11.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/arch/ia64/kernel/perfmon.c	2004-05-05 08:48:38.000000000 -0700
@@ -29,6 +29,7 @@
 #include <linux/init.h>
 #include <linux/vmalloc.h>
 #include <linux/mm.h>
+#include <linux/mempolicy.h>
 #include <linux/sysctl.h>
 #include <linux/list.h>
 #include <linux/file.h>
@@ -2308,6 +2309,7 @@
 	vma->vm_ops	     = NULL;
 	vma->vm_pgoff	     = 0;
 	vma->vm_file	     = NULL;
+	mpol_set_vma_default(vma);
 	vma->vm_private_data = NULL; 
 
 	/*
Index: 2.6.6-rc3-mm2-bitmapv5/arch/ia64/mm/init.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/arch/ia64/mm/init.c	2004-05-05 07:38:11.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/arch/ia64/mm/init.c	2004-05-05 08:48:48.000000000 -0700
@@ -12,6 +12,7 @@
 #include <linux/efi.h>
 #include <linux/elf.h>
 #include <linux/mm.h>
+#include <linux/mempolicy.h>
 #include <linux/mmzone.h>
 #include <linux/module.h>
 #include <linux/personality.h>
@@ -132,6 +133,7 @@
 		vma->vm_pgoff = 0;
 		vma->vm_file = NULL;
 		vma->vm_private_data = NULL;
+		mpol_set_vma_default(vma);
 		insert_vm_struct(current->mm, vma);
 	}
 
@@ -144,6 +146,7 @@
 			vma->vm_end = PAGE_SIZE;
 			vma->vm_page_prot = __pgprot(pgprot_val(PAGE_READONLY) | _PAGE_MA_NAT);
 			vma->vm_flags = VM_READ | VM_MAYREAD | VM_IO | VM_RESERVED;
+			mpol_set_vma_default(vma);
 			insert_vm_struct(current->mm, vma);
 		}
 	}
Index: 2.6.6-rc3-mm2-bitmapv5/arch/m68k/atari/stram.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/arch/m68k/atari/stram.c	2004-05-05 07:38:12.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/arch/m68k/atari/stram.c	2004-05-05 07:40:14.000000000 -0700
@@ -752,7 +752,7 @@
 			/* Get a page for the entry, using the existing
 			   swap cache page if there is one.  Otherwise,
 			   get a clean page and read the swap into it. */
-			page = read_swap_cache_async(entry);
+			page = read_swap_cache_async(entry, NULL, 0);
 			if (!page) {
 				swap_free(entry);
 				return -ENOMEM;
Index: 2.6.6-rc3-mm2-bitmapv5/arch/s390/kernel/compat_exec.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/arch/s390/kernel/compat_exec.c	2004-05-05 07:38:14.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/arch/s390/kernel/compat_exec.c	2004-05-05 08:45:17.000000000 -0700
@@ -72,6 +72,7 @@
 		mpnt->vm_ops = NULL;
 		mpnt->vm_pgoff = 0;
 		mpnt->vm_file = NULL;
+		mpol_set_vma_default(mpnt);
 		INIT_LIST_HEAD(&mpnt->shared);
 		mpnt->vm_private_data = (void *) 0;
 		insert_vm_struct(mm, mpnt);
Index: 2.6.6-rc3-mm2-bitmapv5/arch/x86_64/ia32/ia32_binfmt.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/arch/x86_64/ia32/ia32_binfmt.c	2004-05-05 07:38:17.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/arch/x86_64/ia32/ia32_binfmt.c	2004-05-05 08:45:17.000000000 -0700
@@ -365,6 +365,7 @@
 		mpnt->vm_ops = NULL;
 		mpnt->vm_pgoff = 0;
 		mpnt->vm_file = NULL;
+		mpol_set_vma_default(mpnt);
 		INIT_LIST_HEAD(&mpnt->shared);
 		mpnt->vm_private_data = (void *) 0;
 		insert_vm_struct(mm, mpnt);
Index: 2.6.6-rc3-mm2-bitmapv5/fs/exec.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/fs/exec.c	2004-05-05 07:40:13.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/fs/exec.c	2004-05-05 08:45:18.000000000 -0700
@@ -427,6 +427,7 @@
 		mpnt->vm_ops = NULL;
 		mpnt->vm_pgoff = 0;
 		mpnt->vm_file = NULL;
+		mpol_set_vma_default(mpnt);
 		INIT_LIST_HEAD(&mpnt->shared);
 		mpnt->vm_private_data = (void *) 0;
 		insert_vm_struct(mm, mpnt);
Index: 2.6.6-rc3-mm2-bitmapv5/kernel/exit.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/kernel/exit.c	2004-05-05 07:38:41.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/kernel/exit.c	2004-05-05 08:49:02.000000000 -0700
@@ -6,6 +6,7 @@
 
 #include <linux/config.h>
 #include <linux/mm.h>
+#include <linux/mempolicy.h>
 #include <linux/slab.h>
 #include <linux/interrupt.h>
 #include <linux/smp_lock.h>
@@ -790,6 +791,7 @@
 	__exit_fs(tsk);
 	exit_namespace(tsk);
 	exit_thread();
+	mpol_free(tsk->mempolicy);
 
 	if (tsk->signal->leader)
 		disassociate_ctty(1);
Index: 2.6.6-rc3-mm2-bitmapv5/kernel/fork.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/kernel/fork.c	2004-05-05 07:40:13.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/kernel/fork.c	2004-05-05 08:45:18.000000000 -0700
@@ -270,6 +270,7 @@
 	struct rb_node **rb_link, *rb_parent;
 	int retval;
 	unsigned long charge = 0;
+	struct mempolicy *pol;
 
 	down_write(&oldmm->mmap_sem);
 	flush_cache_mm(current->mm);
@@ -311,6 +312,11 @@
 		if (!tmp)
 			goto fail_nomem;
 		*tmp = *mpnt;
+		pol = mpol_copy(vma_policy(mpnt));
+		retval = PTR_ERR(pol);
+		if (IS_ERR(pol))
+			goto fail_nomem_policy;
+		vma_set_policy(tmp, pol);
 		tmp->vm_flags &= ~VM_LOCKED;
 		tmp->vm_mm = mm;
 		tmp->vm_next = NULL;
@@ -357,6 +363,8 @@
 	flush_tlb_mm(current->mm);
 	up_write(&oldmm->mmap_sem);
 	return retval;
+fail_nomem_policy:
+	kmem_cache_free(vm_area_cachep, tmp);
 fail_nomem:
 	retval = -ENOMEM;
 fail:
@@ -963,10 +971,16 @@
 	p->security = NULL;
 	p->io_context = NULL;
 	p->audit_context = NULL;
+ 	p->mempolicy = mpol_copy(p->mempolicy);
+ 	if (IS_ERR(p->mempolicy)) {
+ 		retval = PTR_ERR(p->mempolicy);
+ 		p->mempolicy = NULL;
+ 		goto bad_fork_cleanup;
+ 	}
 
 	retval = -ENOMEM;
 	if ((retval = security_task_alloc(p)))
-		goto bad_fork_cleanup;
+		goto bad_fork_cleanup_policy;
 	if ((retval = audit_alloc(p)))
 		goto bad_fork_cleanup_security;
 	/* copy all the process information */
@@ -1112,6 +1126,8 @@
 	audit_free(p);
 bad_fork_cleanup_security:
 	security_task_free(p);
+bad_fork_cleanup_policy:
+	mpol_free(p->mempolicy);
 bad_fork_cleanup:
 	if (p->pid > 0)
 		free_pidmap(p->pid);
Index: 2.6.6-rc3-mm2-bitmapv5/mm/mmap.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/mm/mmap.c	2004-05-05 07:40:14.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/mm/mmap.c	2004-05-05 08:45:18.000000000 -0700
@@ -387,7 +387,8 @@
 			struct vm_area_struct *prev,
 			struct rb_node *rb_parent, unsigned long addr, 
 			unsigned long end, unsigned long vm_flags,
-			struct file *file, unsigned long pgoff)
+		     	struct file *file, unsigned long pgoff,
+		        struct mempolicy *policy)
 {
 	spinlock_t *lock = &mm->page_table_lock;
 	struct inode *inode = file ? file->f_dentry->d_inode : NULL;
@@ -411,6 +412,7 @@
 	 * Can it merge with the predecessor?
 	 */
 	if (prev->vm_end == addr &&
+  		        mpol_equal(vma_policy(prev), policy) &&
 			can_vma_merge_after(prev, vm_flags, file, pgoff)) {
 		struct vm_area_struct *next;
 		int need_up = 0;
@@ -428,6 +430,7 @@
 		 */
 		next = prev->vm_next;
 		if (next && prev->vm_end == next->vm_start &&
+		    		vma_mpol_equal(prev, next) &&
 				can_vma_merge_before(next, vm_flags, file,
 					pgoff, (end - addr) >> PAGE_SHIFT)) {
 			prev->vm_end = next->vm_end;
@@ -440,6 +443,7 @@
 				fput(file);
 
 			mm->map_count--;
+			mpol_free(vma_policy(next));
 			kmem_cache_free(vm_area_cachep, next);
 			return prev;
 		}
@@ -455,6 +459,8 @@
 	prev = prev->vm_next;
 	if (prev) {
  merge_next:
+ 		if (!mpol_equal(policy, vma_policy(prev)))
+  			return 0;
 		if (!can_vma_merge_before(prev, vm_flags, file,
 				pgoff, (end - addr) >> PAGE_SHIFT))
 			return NULL;
@@ -631,7 +637,7 @@
 	/* Can we just expand an old anonymous mapping? */
 	if (!file && !(vm_flags & VM_SHARED) && rb_parent)
 		if (vma_merge(mm, prev, rb_parent, addr, addr + len,
-					vm_flags, NULL, 0))
+					vm_flags, NULL, pgoff, NULL))
 			goto out;
 
 	/*
@@ -654,6 +660,7 @@
 	vma->vm_file = NULL;
 	vma->vm_private_data = NULL;
 	vma->vm_next = NULL;
+	mpol_set_vma_default(vma);
 	INIT_LIST_HEAD(&vma->shared);
 
 	if (file) {
@@ -693,7 +700,9 @@
 	addr = vma->vm_start;
 
 	if (!file || !rb_parent || !vma_merge(mm, prev, rb_parent, addr,
-				addr + len, vma->vm_flags, file, pgoff)) {
+					      vma->vm_end,
+					      vma->vm_flags, file, pgoff,
+					      vma_policy(vma))) {
 		vma_link(mm, vma, prev, rb_link, rb_parent);
 		if (correct_wcount)
 			atomic_inc(&inode->i_writecount);
@@ -703,6 +712,7 @@
 				atomic_inc(&inode->i_writecount);
 			fput(file);
 		}
+		mpol_free(vma_policy(vma));
 		kmem_cache_free(vm_area_cachep, vma);
 	}
 out:	
@@ -1118,6 +1128,7 @@
 
 	remove_shared_vm_struct(area);
 
+	mpol_free(vma_policy(area));
 	if (area->vm_ops && area->vm_ops->close)
 		area->vm_ops->close(area);
 	if (area->vm_file)
@@ -1200,6 +1211,7 @@
 int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 	      unsigned long addr, int new_below)
 {
+	struct mempolicy *pol;
 	struct vm_area_struct *new;
 	struct address_space *mapping = NULL;
 
@@ -1222,6 +1234,13 @@
 		new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT);
 	}
 
+	pol = mpol_copy(vma_policy(vma));
+	if (IS_ERR(pol)) {
+		kmem_cache_free(vm_area_cachep, new);
+		return PTR_ERR(pol);
+	}
+	vma_set_policy(new, pol);
+
 	if (new->vm_file)
 		get_file(new->vm_file);
 
@@ -1391,7 +1410,7 @@
 
 	/* Can we just expand an old anonymous mapping? */
 	if (rb_parent && vma_merge(mm, prev, rb_parent, addr, addr + len,
-					flags, NULL, 0))
+					flags, NULL, 0, NULL))
 		goto out;
 
 	/*
@@ -1412,6 +1431,7 @@
 	vma->vm_pgoff = 0;
 	vma->vm_file = NULL;
 	vma->vm_private_data = NULL;
+	mpol_set_vma_default(vma);
 	INIT_LIST_HEAD(&vma->shared);
 
 	vma_link(mm, vma, prev, rb_link, rb_parent);
@@ -1472,6 +1492,7 @@
 		}
 		if (vma->vm_file)
 			fput(vma->vm_file);
+		mpol_free(vma_policy(vma));
 		kmem_cache_free(vm_area_cachep, vma);
 		vma = next;
 	}
@@ -1508,7 +1529,7 @@
 
 	find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
 	new_vma = vma_merge(mm, prev, rb_parent, addr, addr + len,
-			vma->vm_flags, vma->vm_file, pgoff);
+			vma->vm_flags, vma->vm_file, pgoff, vma_policy(vma));
 	if (new_vma) {
 		/*
 		 * Source vma may have been merged into new_vma
Index: 2.6.6-rc3-mm2-bitmapv5/mm/mprotect.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/mm/mprotect.c	2004-05-05 07:38:42.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/mm/mprotect.c	2004-05-05 08:45:18.000000000 -0700
@@ -125,6 +125,8 @@
 		return 0;
 	if (vma->vm_file || (vma->vm_flags & VM_SHARED))
 		return 0;
+	if (!vma_mpol_equal(vma, prev))
+		return 0;
 
 	/*
 	 * If the whole area changes to the protection of the previous one
@@ -136,6 +138,7 @@
 		__vma_unlink(mm, vma, prev);
 		spin_unlock(&mm->page_table_lock);
 
+		mpol_free(vma_policy(vma));
 		kmem_cache_free(vm_area_cachep, vma);
 		mm->map_count--;
 		return 1;
@@ -318,12 +321,14 @@
 
 	if (next && prev->vm_end == next->vm_start &&
 			can_vma_merge(next, prev->vm_flags) &&
+	    	vma_mpol_equal(prev, next) &&
 			!prev->vm_file && !(prev->vm_flags & VM_SHARED)) {
 		spin_lock(&prev->vm_mm->page_table_lock);
 		prev->vm_end = next->vm_end;
 		__vma_unlink(prev->vm_mm, next, prev);
 		spin_unlock(&prev->vm_mm->page_table_lock);
 
+		mpol_free(vma_policy(next));
 		kmem_cache_free(vm_area_cachep, next);
 		prev->vm_mm->map_count--;
 	}


-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] NUMA API for Linux 5/ Add VMA hooks for policy
  2004-05-05 16:05   ` Paul Jackson
@ 2004-05-05 16:39     ` Andi Kleen
  2004-05-05 16:47       ` Paul Jackson
  0 siblings, 1 reply; 60+ messages in thread
From: Andi Kleen @ 2004-05-05 16:39 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Andi Kleen, linux-kernel, akpm

> Looks like you should do something equivalent to adding:
> 
>   #include <linux/mempolicy.h>
> 
> to the files:
> 
>   arch/ia64/ia32/binfmt_elf32.c
>   arch/ia64/kernel/perfmon.c
>   arch/ia64/mm/init.c
>   kernel/exit.c
> 
> The following, based off the numa-api-vma-policy-hooks patch in Andrew's
> latest 2.6.6-rc3-mm2, includes these additional includes, and builds
> successfully:

This is not needed, because mempolicy.h is included in mm.h, which
is included at some point by all of these.
Perhaps you missed a patch? (several of the patches depended on each other) 

-Andi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] NUMA API for Linux 5/ Add VMA hooks for policy
  2004-05-05 16:39     ` Andi Kleen
@ 2004-05-05 16:47       ` Paul Jackson
  2004-05-06  6:00         ` Andi Kleen
  0 siblings, 1 reply; 60+ messages in thread
From: Paul Jackson @ 2004-05-05 16:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ak, linux-kernel, akpm, hch

> Perhaps you missed a patch? (several of the patches depended on each other) 

No - perhaps Christoph Hellwig removed the include.

See the patch small-numa-api-fixups.patch in 2.6.6-rc3-mm2:

========================= snip =========================
From: Christoph Hellwig <hch@lst.de>

- don't include mempolicy.h in sched.h and mm.h when a forward delcaration
  is enough.  Andi argued against that in the past, but I'd really hate to add
  another header to two of the includes used in basically every driver when we
  can include it in the six files actually needing it instead (that number is
  for my ppc32 system, maybe other arches need more include in their
  directories)

- make numa api fields in tast_struct conditional on CONFIG_NUMA, this gives
  us a few ugly ifdefs but avoids wasting memory on non-NUMA systems.

... details omitted ...
========================= snip =========================

Christoph did add the needed #include's of mempolicy.h in the
generic files that needed it, but not in the ia64 files, more
or less.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [PATCH] NUMA API for Linux 5/ Add VMA hooks for policy
  2004-05-05 16:47       ` Paul Jackson
@ 2004-05-06  6:00         ` Andi Kleen
  0 siblings, 0 replies; 60+ messages in thread
From: Andi Kleen @ 2004-05-06  6:00 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Andi Kleen, linux-kernel, akpm, hch

On Wed, May 05, 2004 at 09:47:48AM -0700, Paul Jackson wrote:
> > Perhaps you missed a patch? (several of the patches depended on each other) 
> 
> No - perhaps Christoph Hellwig removed the include.

Hmm, ok. I must have missed that going in. Your patch looks correct
as an additional fix. 

Thanks,

-Andi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-15 10:39     ` Andi Kleen
  2004-04-15 11:48       ` Robin Holt
@ 2004-04-15 19:44       ` Matthew Dobson
  1 sibling, 0 replies; 60+ messages in thread
From: Matthew Dobson @ 2004-04-15 19:44 UTC (permalink / raw)
  To: Andi Kleen; +Cc: LKML, Andrew Morton, Martin J. Bligh

On Thu, 2004-04-15 at 03:39, Andi Kleen wrote:
> On Wed, 14 Apr 2004 17:38:37 -0700
> Matthew Dobson <colpatch@us.ibm.com> wrote:
> 
> 
> 
> > 1) Redefine the value of some of the MPOL_* flags
> 
> I don't want to merge the flags the and the mode argument. It's ugly.

Well, if you're against adding a pid argument, this change is useless
anyway.


> > 2) Rename check_* to mpol_check_*
> 
> I really don't understand why you insist on renaming all my functions? 
> I like the current naming, thank you.

I don't want to rename them *all*! ;)  There's a million functions named
check_foo(), so why not a few more?  :)


> > 3) Remove get_nodes().  This should be done in the same manner as
> > sys_sched_setaffinity().  We shouldn't care about unused high bits.
> 
> I disagree on that. This would break programs that are first tested
> on a small machine and then later run on a big machine (common case)

I don't follow this argument...  How does this code work on a small
machine, but break on a big one:

DECLARE_BITMAP(nodes, MAX_NUMNODES);
BITMAP_CLEAR(nodes, MAX_NUMNODES);
set_bit(4, nodes);
mbind(startaddr, 8 * PAGE_SIZE, MPOL_PREFERRED, nodes, MAX_NUMNODES,
flags);

??

Userspace should be declaring an array of unsigned longs based on the
size of the machine they're running on.  If userspace programs are just
declaring arbitrary sized array and hoping that they'll work, they're
being stupid.  If they're declaring oversized arrays and not zeroing the
array before passing it in, then they're being lazy.  Either way, we're
going to throw the bits away, so do we care if they're being lazy or
stupid?  If the bits we care about are sane, then we do what userspace
tells us.  If the bits we care about aren't, we throw an error. 
Besides, we don't return an error when they pass us too little data, why
do we return an error when they pass too much?

I remember one of your concerns was checking that all the nodes with set
bits need to be online.  I think this is wrong.  We need to be checking
node_online_map at fault time, not at binding time.  Sooner or later
memory hotplug will go into the kernel, and then you're (or someone) is
going to have to rewrite how this is handled.  I'd rather do it right
the first time.


> > 4) Create mpol_check_flags() to, well, check the flags.  As the number
> > of flags and modes grows, it will be easier to do this check in its own
> > function.
> > 5) In the syscalls (sys_mbind() & sys_set_mempolicy()), change 'len' to
> > a size_t, add __user to the declaration of 'nmask', change 'maxnode' to
> 
> unsigned long is the standard for system calls.  Check some others. 

Ok.  I see this set of system calls as sort of the illegitimate
love-child of mlock()/mprotect() and sched_set/getaffinity(). 
sys_mlock() was using a size_t, so I copied that...


> > 'nmask_len', and condense 'flags' and 'mode' into 'flags'.  The
> > motivation here is to make this syscall similar to
> > sys_sched_setaffinity().  These calls are basically the memory
> > equivalent of set/getaffinity, and should look & behave that way.  Also,
> > dropping an argument leaves an opening for a pid argument, which I
> > believe would be good.  We should allow processes (with appropriate
> > permissions, of course) to mbind other processes.
> 
> Messing with other process' VM is a recipe for disaster. There 
> used to be tons of exploitable races in /proc/pid/mem, I don't want to repeat that.
> Adding pid to set_mem_policy would be a bit easier, but it would require
> to add a lock to the task struct for this. Currently it is nice and lockless
> because it relies on the fact that only the current process can change 
> its own policy. I prefer to keep it lockless, because that keeps the memory 
> allocation fast paths faster.

We're already grabbing the mmap_sem for writing when we modify the vma's
in sys_mbind.  I haven't looked at what kind of locking would be
necessary if we were modifying a *different* processes vma's, but I
figured that would at least be a good start.  If it's significantly more
complicated than that, you're probably right that it's not worth the
effort.


> > 6) Change how end is calculated as follows:
> > 	end = PAGE_ALIGN(start+len);
> > 	start &= PAGE_MASK;
> > Basically, this allows users to pass in a non-page aligned 'start', and
> > makes sure we mbind all pages from the page containing 'start' to the
> > page containing 'start'+'len'.
> 
> mprotect() does the EINVAL check on unalignment. I think it's better
> to follow mprotect here.

Ok.  I'm usually a fan of not failing if we don't *have* to, and an
unaligned start address is easily fixable.  On the other hand, if other
syscalls throw errors for it, then it's fine with me.


BTW, any response to the idea of combining sys_mbind() &
sys_set_mempolicy()?

-Matt


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-15 11:48       ` Robin Holt
@ 2004-04-15 18:32         ` Matthew Dobson
  0 siblings, 0 replies; 60+ messages in thread
From: Matthew Dobson @ 2004-04-15 18:32 UTC (permalink / raw)
  To: Robin Holt; +Cc: Andi Kleen, LKML, Andrew Morton, Martin J. Bligh

On Thu, 2004-04-15 at 04:48, Robin Holt wrote:
> On Thu, Apr 15, 2004 at 12:39:15PM +0200, Andi Kleen wrote:
> > On Wed, 14 Apr 2004 17:38:37 -0700
> > Matthew Dobson <colpatch@us.ibm.com> wrote:
> > > 2) Rename check_* to mpol_check_*
> > 
> > I really don't understand why you insist on renaming all my functions? 
> > I like the current naming, thank you.
> > 
> 
> I like the mpol_ flavors because there is no namespace collision
> on these.  When I look at cscope or any other code analysis tool,
> I know that the mpol_... functions belong together.  Makes
> investigating other people's code easier.
> 
> Just my 2 cents,
> Robin Holt

That was my motivation.  It's not particularly important, especially
since the functions are all declared static.  I just thought they were a
little more readable when prefixed by mpol_.

-Matt


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-15 10:39     ` Andi Kleen
@ 2004-04-15 11:48       ` Robin Holt
  2004-04-15 18:32         ` Matthew Dobson
  2004-04-15 19:44       ` Matthew Dobson
  1 sibling, 1 reply; 60+ messages in thread
From: Robin Holt @ 2004-04-15 11:48 UTC (permalink / raw)
  To: Andi Kleen; +Cc: colpatch, linux-kernel, akpm, mbligh

On Thu, Apr 15, 2004 at 12:39:15PM +0200, Andi Kleen wrote:
> On Wed, 14 Apr 2004 17:38:37 -0700
> Matthew Dobson <colpatch@us.ibm.com> wrote:
> > 2) Rename check_* to mpol_check_*
> 
> I really don't understand why you insist on renaming all my functions? 
> I like the current naming, thank you.
> 

I like the mpol_ flavors because there is no namespace collision
on these.  When I look at cscope or any other code analysis tool,
I know that the mpol_... functions belong together.  Makes
investigating other people's code easier.

Just my 2 cents,
Robin Holt

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-15  0:38   ` Matthew Dobson
@ 2004-04-15 10:39     ` Andi Kleen
  2004-04-15 11:48       ` Robin Holt
  2004-04-15 19:44       ` Matthew Dobson
  0 siblings, 2 replies; 60+ messages in thread
From: Andi Kleen @ 2004-04-15 10:39 UTC (permalink / raw)
  To: colpatch; +Cc: linux-kernel, akpm, mbligh

On Wed, 14 Apr 2004 17:38:37 -0700
Matthew Dobson <colpatch@us.ibm.com> wrote:



> 1) Redefine the value of some of the MPOL_* flags

I don't want to merge the flags the and the mode argument. It's ugly.

> 2) Rename check_* to mpol_check_*

I really don't understand why you insist on renaming all my functions? 
I like the current naming, thank you.

> 3) Remove get_nodes().  This should be done in the same manner as
> sys_sched_setaffinity().  We shouldn't care about unused high bits.

I disagree on that. This would break programs that are first tested
on a small machine and then later run on a big machine (common case)

> 4) Create mpol_check_flags() to, well, check the flags.  As the number
> of flags and modes grows, it will be easier to do this check in its own
> function.
> 5) In the syscalls (sys_mbind() & sys_set_mempolicy()), change 'len' to
> a size_t, add __user to the declaration of 'nmask', change 'maxnode' to

unsigned long is the standard for system calls.  Check some others. 

> 'nmask_len', and condense 'flags' and 'mode' into 'flags'.  The
> motivation here is to make this syscall similar to
> sys_sched_setaffinity().  These calls are basically the memory
> equivalent of set/getaffinity, and should look & behave that way.  Also,
> dropping an argument leaves an opening for a pid argument, which I
> believe would be good.  We should allow processes (with appropriate
> permissions, of course) to mbind other processes.

Messing with other process' VM is a recipe for disaster. There 
used to be tons of exploitable races in /proc/pid/mem, I don't want to repeat that.
Adding pid to set_mem_policy would be a bit easier, but it would require
to add a lock to the task struct for this. Currently it is nice and lockless
because it relies on the fact that only the current process can change 
its own policy. I prefer to keep it lockless, because that keeps the memory 
allocation fast paths faster.

> 6) Change how end is calculated as follows:
> 	end = PAGE_ALIGN(start+len);
> 	start &= PAGE_MASK;
> Basically, this allows users to pass in a non-page aligned 'start', and
> makes sure we mbind all pages from the page containing 'start' to the
> page containing 'start'+'len'.

mprotect() does the EINVAL check on unalignment. I think it's better
to follow mprotect here.

-Andi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 21:27 ` Andi Kleen
  2004-04-07 21:41   ` Matthew Dobson
@ 2004-04-15  0:38   ` Matthew Dobson
  2004-04-15 10:39     ` Andi Kleen
  1 sibling, 1 reply; 60+ messages in thread
From: Matthew Dobson @ 2004-04-15  0:38 UTC (permalink / raw)
  To: Andi Kleen; +Cc: LKML, Andrew Morton, Martin J. Bligh

[-- Attachment #1: Type: text/plain, Size: 2196 bytes --]

Andi,
	I'm sure you're sick of me commenting on your patches without "showing
you the money".  I've attached a patch with some of the changes I think
would be beneficial.  Feel free to let me know which changes you think
are crap and which you think are not.

Changes include:

1) Redefine the value of some of the MPOL_* flags
2) Rename check_* to mpol_check_*
3) Remove get_nodes().  This should be done in the same manner as
sys_sched_setaffinity().  We shouldn't care about unused high bits.
4) Create mpol_check_flags() to, well, check the flags.  As the number
of flags and modes grows, it will be easier to do this check in its own
function.
5) In the syscalls (sys_mbind() & sys_set_mempolicy()), change 'len' to
a size_t, add __user to the declaration of 'nmask', change 'maxnode' to
'nmask_len', and condense 'flags' and 'mode' into 'flags'.  The
motivation here is to make this syscall similar to
sys_sched_setaffinity().  These calls are basically the memory
equivalent of set/getaffinity, and should look & behave that way.  Also,
dropping an argument leaves an opening for a pid argument, which I
believe would be good.  We should allow processes (with appropriate
permissions, of course) to mbind other processes.
6) Change how end is calculated as follows:
	end = PAGE_ALIGN(start+len);
	start &= PAGE_MASK;
Basically, this allows users to pass in a non-page aligned 'start', and
makes sure we mbind all pages from the page containing 'start' to the
page containing 'start'+'len'.

This patch also shows that sys_mbind() and sys_set_mempolicy() have more
commonalities than differences.  I believe these two syscalls should be
combined into one with the call signature of sys_mbind().  If the user
passes a start address and length of 0 (or maybe even a flag?), we bind
the whole process, otherwise we bind just a region.  This would shrink
the patch even more than the measly 3 lines the current patch saves, and
save a syscall.

[mcd@arrakis source]$ diffstat ~/linux/patches/265-mm4/mcd_mods.patch
 include/linux/mempolicy.h |   12 ++--
 mm/mempolicy.c            |  119
++++++++++++++++++++++------------------------
 2 files changed, 64 insertions(+), 67 deletions(-)

-Matt

[-- Attachment #2: mcd_mods.patch --]
[-- Type: text/x-patch, Size: 7294 bytes --]

diff -Nurp --exclude-from=/home/mcd/.dontdiff linux-2.6.5-mm4/include/linux/mempolicy.h linux-2.6.5-mcd_numa_api/include/linux/mempolicy.h
--- linux-2.6.5-mm4/include/linux/mempolicy.h	2004-04-12 15:07:18.000000000 -0700
+++ linux-2.6.5-mcd_numa_api/include/linux/mempolicy.h	2004-04-14 17:13:22.000000000 -0700
@@ -8,20 +8,22 @@
  * Copyright 2003,2004 Andi Kleen SuSE Labs
  */
 
-/* Policies */
+/* Policies aka 'modes' */
 #define MPOL_DEFAULT	0
 #define MPOL_PREFERRED	1
 #define MPOL_BIND	2
 #define MPOL_INTERLEAVE	3
 
-#define MPOL_MAX MPOL_INTERLEAVE
+#define MPOL_MAX	(MPOL_INTERLEAVE)
+/* Reserve low 4 bits for policies, ie: 16 possible 'modes' */
+#define MPOL_MODE_MASK	(0xf)
 
 /* Flags for get_mem_policy */
-#define MPOL_F_NODE	(1<<0)	/* return next IL mode instead of node mask */
-#define MPOL_F_ADDR	(1<<1)	/* look up vma using address */
+#define MPOL_F_NODE	(1<<4)	/* return next IL mode instead of node mask */
+#define MPOL_F_ADDR	(1<<5)	/* look up vma using address */
 
 /* Flags for mbind */
-#define MPOL_MF_STRICT	(1<<0)	/* Verify existing pages in the mapping */
+#define MPOL_MF_STRICT	(1<<6)	/* Verify existing pages in the mapping */
 
 #ifdef __KERNEL__
 
diff -Nurp --exclude-from=/home/mcd/.dontdiff linux-2.6.5-mm4/mm/mempolicy.c linux-2.6.5-mcd_numa_api/mm/mempolicy.c
--- linux-2.6.5-mm4/mm/mempolicy.c	2004-04-12 15:42:30.000000000 -0700
+++ linux-2.6.5-mcd_numa_api/mm/mempolicy.c	2004-04-14 17:22:05.000000000 -0700
@@ -88,13 +88,11 @@ static struct mempolicy default_policy =
 };
 
 /* Check if all specified nodes are online */
-static int check_online(unsigned long *nodes)
+static int mpol_check_online(unsigned long *nodes)
 {
 	DECLARE_BITMAP(offline, MAX_NUMNODES);
 
 	bitmap_copy(offline, node_online_map, MAX_NUMNODES);
-	if (bitmap_empty(offline, MAX_NUMNODES))
-		set_bit(0, offline);
 	bitmap_complement(offline, MAX_NUMNODES);
 	bitmap_and(offline, offline, nodes, MAX_NUMNODES);
 	if (!bitmap_empty(offline, MAX_NUMNODES))
@@ -103,7 +101,7 @@ static int check_online(unsigned long *n
 }
 
 /* Do sanity checking on a policy */
-static int check_policy(int mode, unsigned long *nodes)
+static int mpol_check_policy(int mode, unsigned long *nodes)
 {
 	int empty = bitmap_empty(nodes, MAX_NUMNODES);
 
@@ -120,46 +118,25 @@ static int check_policy(int mode, unsign
 			return -EINVAL;
 		break;
 	}
-	return check_online(nodes);
+	return mpol_check_online(nodes);
 }
 
-/* Copy a node mask from user space. */
-static int get_nodes(unsigned long *nodes, unsigned long *nmask,
-		     unsigned long maxnode, int mode)
-{
-	unsigned long k;
-	unsigned long nlongs;
-	unsigned long endmask;
-
-	--maxnode;
-	nlongs = BITS_TO_LONGS(maxnode);
-	if ((maxnode % BITS_PER_LONG) == 0)
-		endmask = ~0UL;
-	else
-		endmask = (1UL << (maxnode % BITS_PER_LONG)) - 1;
-
-	/* When the user specified more nodes than supported just check
-	   if the non supported part is all zero. */
-	if (nmask && nlongs > BITS_TO_LONGS(MAX_NUMNODES)) {
-		for (k = BITS_TO_LONGS(MAX_NUMNODES); k < nlongs; k++) {
-			unsigned long t;
-			if (get_user(t,  nmask + k))
-				return -EFAULT;
-			if (k == nlongs - 1) {
-				if (t & endmask)
-					return -EINVAL;
-			} else if (t)
-				return -EINVAL;
-		}
-		nlongs = BITS_TO_LONGS(MAX_NUMNODES);
-		endmask = ~0UL;
-	}
+/*
+ * Do sanity checking on flags argument to sys_mbind.
+ * Return 'mode' bits if sane, 0 if bad flags.
+ */
+static int mpol_check_flags(int flags)
+{
+	int mode = flags & MPOL_MODE_MASK;
+	flags &= ~MPOL_MODE_MASK;
 
-	bitmap_zero(nodes, MAX_NUMNODES);
-	if (nmask && copy_from_user(nodes, nmask, nlongs*sizeof(unsigned long)))
-		return -EFAULT;
-	nodes[nlongs-1] &= endmask;
-	return check_policy(mode, nodes);
+	if (flags & ~MPOL_MF_STRICT)
+		return 0;
+
+	if (mode > MPOL_MAX)
+		return 0;
+
+	return mode;
 }
 
 /* Generate a custom zonelist for the BIND policy. */
@@ -259,7 +236,7 @@ verify_pages(unsigned long addr, unsigne
 
 /* Step 1: check the range */
 static struct vm_area_struct *
-check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
+mpol_check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
 	    unsigned long *nodes, unsigned long flags)
 {
 	int err;
@@ -334,32 +311,39 @@ static int mbind_range(struct vm_area_st
 }
 
 /* Change policy for a memory range */
-asmlinkage long sys_mbind(unsigned long start, unsigned long len,
-			  unsigned long mode,
-			  unsigned long *nmask, unsigned long maxnode,
-			  unsigned flags)
+asmlinkage long sys_mbind(unsigned long start, size_t len,
+			  unsigned long __user *nmask, unsigned int nmask_len,
+			  int flags)
 {
 	struct vm_area_struct *vma;
 	struct mm_struct *mm = current->mm;
 	struct mempolicy *new;
 	unsigned long end;
 	DECLARE_BITMAP(nodes, MAX_NUMNODES);
-	int err;
+	int err, mode = 0;
 
-	if ((flags & ~(unsigned long)(MPOL_MF_STRICT)) || mode > MPOL_MAX)
-		return -EINVAL;
-	if (start & ~PAGE_MASK)
+	/* Make sure user passed us sane 'flags', and separate the 'mode' */
+	mode = mpol_check_flags(flags);
+	if (mode == 0)
 		return -EINVAL;
+	flags &= ~MPOL_MODE_MASK;
 	if (mode == MPOL_DEFAULT)
 		flags &= ~MPOL_MF_STRICT;
-	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
-	end = start + len;
+
+	/* Ensure start and end are on page boundaries */
+	end = PAGE_ALIGN(start + len);
+	start &= PAGE_MASK;
 	if (end < start)
 		return -EINVAL;
 	if (end == start)
 		return 0;
 
-	err = get_nodes(nodes, nmask, maxnode, mode);
+	/* Copy user's bitmask of nodes */
+	if (nmask_len < sizeof(*nodes))
+		return -EINVAL;
+	if (copy_from_user(nodes, nmask, sizeof(*nodes)))
+		return -EFAULT;
+	err = mpol_check_policy(mode, nodes);
 	if (err)
 		return err;
 
@@ -367,11 +351,9 @@ asmlinkage long sys_mbind(unsigned long 
 	if (IS_ERR(new))
 		return PTR_ERR(new);
 
-	PDprintk("mbind %lx-%lx mode:%ld nodes:%lx\n",start,start+len,
-			mode,nodes[0]);
-
+	PDprintk("mbind %lx-%lx mode:%ld nodes:%lx\n", start, end, mode, nodes[0]);
 	down_write(&mm->mmap_sem);
-	vma = check_range(mm, start, end, nodes, flags);
+	vma = mpol_check_range(mm, start, end, nodes, flags);
 	err = PTR_ERR(vma);
 	if (!IS_ERR(vma))
 		err = mbind_range(vma, start, end, new);
@@ -381,21 +363,34 @@ asmlinkage long sys_mbind(unsigned long 
 }
 
 /* Set the process memory policy */
-asmlinkage long sys_set_mempolicy(int mode, unsigned long *nmask,
-				   unsigned long maxnode)
+asmlinkage long sys_set_mempolicy(unsigned long __user *nmask,
+				  unsigned int nmask_len, int flags)
 {
-	int err;
 	struct mempolicy *new;
 	DECLARE_BITMAP(nodes, MAX_NUMNODES);
+	int err, mode = 0;
 
-	if (mode > MPOL_MAX)
+	/* Make sure user passed us sane 'flags', and separate the 'mode' */
+	mode = mpol_check_flags(flags);
+	if (mode == 0)
 		return -EINVAL;
-	err = get_nodes(nodes, nmask, maxnode, mode);
+	flags &= ~MPOL_MODE_MASK;
+	if (mode == MPOL_DEFAULT)
+		flags &= ~MPOL_MF_STRICT;
+
+	/* Copy user's bitmask of nodes */
+	if (nmask_len < sizeof(*nodes))
+		return -EINVAL;
+	if (copy_from_user(nodes, nmask, sizeof(*nodes)))
+		return -EFAULT;
+	err = mpol_check_policy(mode, nodes);
 	if (err)
 		return err;
+
 	new = new_policy(mode, nodes);
 	if (IS_ERR(new))
 		return PTR_ERR(new);
+
 	mpol_free(current->mempolicy);
 	current->mempolicy = new;
 	if (new && new->policy == MPOL_INTERLEAVE)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-09  5:29         ` Martin J. Bligh
@ 2004-04-09 18:44           ` Matthew Dobson
  0 siblings, 0 replies; 60+ messages in thread
From: Matthew Dobson @ 2004-04-09 18:44 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andi Kleen, LKML, Andrew Morton

Sounds good to me.  I named it page_to_nodenum() to match
page_to_zonenum() which I named to differentiate it from page_zone().  I
have no attachment to the names whatsoever, though.

-Matt

On Thu, 2004-04-08 at 22:29, Martin J. Bligh wrote:
> > Instead of looking up a page's node number by
> > page_zone(p)->zone_pgdat->node_id, you can get the same information much
> > more efficiently by doing some bit-twidling on page->flags.  Use
> > page_nodenum(struct page *) from include/linux/mm.h.
> 
> Never noticed that before - I'd prefer we renamed this to page_to_nid 
> before anyone starts using it ... fits with the naming convention of 
> everything else (pfn_to_nid, etc). Nobody uses it right now - I grepped 
> the whole tree.
> 
> M.
> 
> diff -aurpN -X /home/fletch/.diff.exclude virgin/include/linux/mm.h name_nids/include/linux/mm.h
> --- virgin/include/linux/mm.h	Wed Mar 17 07:33:09 2004
> +++ name_nids/include/linux/mm.h	Thu Apr  8 22:27:24 2004
> @@ -340,7 +340,7 @@ static inline unsigned long page_zonenum
>  {
>  	return (page->flags >> NODEZONE_SHIFT) & (~(~0UL << ZONES_SHIFT));
>  }
> -static inline unsigned long page_nodenum(struct page *page)
> +static inline unsigned long page_to_nid(struct page *page)
>  {
>  	return (page->flags >> (NODEZONE_SHIFT + ZONES_SHIFT));
>  }
> 
> 
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
       [not found]       ` <1ISQC-7Cv-5@gated-at.bofh.it>
@ 2004-04-09  5:39         ` Andi Kleen
  0 siblings, 0 replies; 60+ messages in thread
From: Andi Kleen @ 2004-04-09  5:39 UTC (permalink / raw)
  To: colpatch; +Cc: linux-kernel

Matthew Dobson <colpatch@us.ibm.com> writes:
>
> Instead of looking up a page's node number by
> page_zone(p)->zone_pgdat->node_id, you can get the same information much
> more efficiently by doing some bit-twidling on page->flags.  Use
> page_nodenum(struct page *) from include/linux/mm.h.

That wasn't there when I wrote the code. I will do that change, thanks.

> So it looks like you *are* sharing policies, at least for VMA's in the
> range of a single mbind() call?  This is a good start! ;)  Looking
> further ahead, I'm a bit confused.  It seems despite *sharing* VMA's
> belonging to a single mbind() call, you *copy* VMA's during dup_mmap(),
> copy_process(), split_vma(), and move_vma().  So in the majority of
> cases you duplicate policies instead of sharing them, but you *do* share
> them in some instances?  Why the inconsistency?

Locking. policies get locked by their VM.

(the sharing you're advocating would require adding a new lock to 
mempolicy) 

>> +/* Change policy for a memory range */
>> +asmlinkage long sys_mbind(unsigned long start, unsigned long len,
>> +			  unsigned long mode,
>> +			  unsigned long *nmask, unsigned long maxnode,
>> +			  unsigned flags)
>
> What would you think about giving sys_mbind() a pid argument, like
> sys_sched_setaffinity()?  It would make it much easier for sysadmins to
> use the API if they didn't have to rewrite applications to make these
> calls on their own.  There's already a plethora of arguments, so one
> more might be overkill....  Just a thought.

It already has 6 arguments - 7 are not allowed. Also playing with a different
process' VM remotely is just a recipe for disaster IMHO (remember
all the security holes that used to be in /proc/pid/mem)

> Why not condense both the sys_mbind() & sys_set_mempolicy() into a
> single call?  The functionality of these calls (and even their code) is
> very similar.  The main difference is there is no need for looking up
> VMA's and doing locking in the sys_set_mempolicy() case.  You overload
> the output side (sys_get_mempolicy()) to handle both whole process and
> memory range options, but you don't do the same on the input
> (sys_mbind() and sys_set_mempolicy()).  Saving one syscall and having
> them behave more symmetrically would be a nice addition...

I think it's cleaner to have them separated. 

>> +/* Retrieve NUMA policy */
>> +asmlinkage long sys_get_mempolicy(int *policy,
>> +				  unsigned long *nmask, unsigned long maxnode,
>> +				  unsigned long addr, unsigned long flags)	
>
> I had a thought...  Shouldn't all your user pointers be marked as such
> with __user?  Ie:

I figure that the people who use the tools who need such ugly annotations
will add them themselves.

>
>> +{
>> +	int err, pval;
>> +	struct mm_struct *mm = current->mm;
>> +	struct vm_area_struct *vma = NULL; 	
>> +	struct mempolicy *pol = current->mempolicy;
>> +
>> +	if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
>> +		return -EINVAL;
>> +	if (nmask != NULL && maxnode < numnodes)
>> +		return -EINVAL;
>
> Did you mean: if (nmask == NULL || maxnode < numnodes) ?

No, the original test is correct.

> I think that BUG_ON should be BUG_ON(nid < 0), since it *shouldn't* be
> possible to get through that loop with nid >= MAX_NUMNODES.  The only
> line that could set nid >= MAX_NUMNODES is nid = find_next_bit(...); and
> the very next line ensures that if nid is in fact >= MAX_NUMNODES it
> will be set to -1.  It actually looks pretty redundant altogether,
> but...

Yes, it could be removed.

-Andi


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-09  1:09       ` Matthew Dobson
@ 2004-04-09  5:29         ` Martin J. Bligh
  2004-04-09 18:44           ` Matthew Dobson
  0 siblings, 1 reply; 60+ messages in thread
From: Martin J. Bligh @ 2004-04-09  5:29 UTC (permalink / raw)
  To: colpatch, Andi Kleen; +Cc: LKML, Andrew Morton

> Instead of looking up a page's node number by
> page_zone(p)->zone_pgdat->node_id, you can get the same information much
> more efficiently by doing some bit-twidling on page->flags.  Use
> page_nodenum(struct page *) from include/linux/mm.h.

Never noticed that before - I'd prefer we renamed this to page_to_nid 
before anyone starts using it ... fits with the naming convention of 
everything else (pfn_to_nid, etc). Nobody uses it right now - I grepped 
the whole tree.

M.

diff -aurpN -X /home/fletch/.diff.exclude virgin/include/linux/mm.h name_nids/include/linux/mm.h
--- virgin/include/linux/mm.h	Wed Mar 17 07:33:09 2004
+++ name_nids/include/linux/mm.h	Thu Apr  8 22:27:24 2004
@@ -340,7 +340,7 @@ static inline unsigned long page_zonenum
 {
 	return (page->flags >> NODEZONE_SHIFT) & (~(~0UL << ZONES_SHIFT));
 }
-static inline unsigned long page_nodenum(struct page *page)
+static inline unsigned long page_to_nid(struct page *page)
 {
 	return (page->flags >> (NODEZONE_SHIFT + ZONES_SHIFT));
 }






^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-08 19:25               ` Andrew Morton
@ 2004-04-09  2:41                 ` Wim Coekaerts
  0 siblings, 0 replies; 60+ messages in thread
From: Wim Coekaerts @ 2004-04-09  2:41 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, ak, mbligh, colpatch, linux-kernel

> ie: there are some (oracle) workloads where the kernel craps out due to
> lowmem vma exhaustion.  If they're now using remap_file_pages() for this then
> it may not be a problem any more.  Ingo would know better than I.

remap_file_pages() from now on yes.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 21:45     ` Andi Kleen
  2004-04-07 22:19       ` Matthew Dobson
  2004-04-08  0:58       ` Matthew Dobson
@ 2004-04-09  1:09       ` Matthew Dobson
  2004-04-09  5:29         ` Martin J. Bligh
  2 siblings, 1 reply; 60+ messages in thread
From: Matthew Dobson @ 2004-04-09  1:09 UTC (permalink / raw)
  To: Andi Kleen; +Cc: LKML, Andrew Morton, Martin J. Bligh

On Wed, 2004-04-07 at 14:45, Andi Kleen wrote:

<snip>

> diff -u linux-2.6.5-mc2-numa/mm/mempolicy.c-o linux-2.6.5-mc2-numa/mm/mempolicy.c
> --- linux-2.6.5-mc2-numa/mm/mempolicy.c-o	2004-04-07 12:07:41.000000000 +0200
> +++ linux-2.6.5-mc2-numa/mm/mempolicy.c	2004-04-07 13:07:02.000000000 +0200

<snip more>

> +/* Ensure all existing pages follow the policy. */
> +static int
> +verify_pages(unsigned long addr, unsigned long end, unsigned long *nodes)
> +{
> +	while (addr < end) {
> +		struct page *p;
> +		pte_t *pte;
> +		pmd_t *pmd;
> +		pgd_t *pgd = pgd_offset_k(addr);
> +		if (pgd_none(*pgd)) {
> +			addr = (addr + PGDIR_SIZE) & PGDIR_MASK;
> +			continue;
> +		}
> +		pmd = pmd_offset(pgd, addr);
> +		if (pmd_none(*pmd)) {
> +			addr = (addr + PMD_SIZE) & PMD_MASK;
> +			continue;
> +		}
> +		p = NULL;
> +		pte = pte_offset_map(pmd, addr);
> +		if (pte_present(*pte))
> +			p = pte_page(*pte);
> +		pte_unmap(pte);
> +		if (p) {
> +			unsigned nid = page_zone(p)->zone_pgdat->node_id;
> +			if (!test_bit(nid, nodes))
> +				return -EIO;
> +		}
> +		addr += PAGE_SIZE;
> +	}
> +	return 0;
> +}

Instead of looking up a page's node number by
page_zone(p)->zone_pgdat->node_id, you can get the same information much
more efficiently by doing some bit-twidling on page->flags.  Use
page_nodenum(struct page *) from include/linux/mm.h.


> +/* Apply policy to a single VMA */
> +static int policy_vma(struct vm_area_struct *vma, struct mempolicy *new)
> +{
> +	int err = 0;
> +	struct mempolicy *old = vma->vm_policy;
> +
> +	PDprintk("vma %lx-%lx/%lx vm_ops %p vm_file %p set_policy %p\n",
> +		 vma->vm_start, vma->vm_end, vma->vm_pgoff,
> +		 vma->vm_ops, vma->vm_file,
> +		 vma->vm_ops ? vma->vm_ops->set_policy : NULL);
> +
> +	if (vma->vm_file)
> +		down(&vma->vm_file->f_mapping->i_shared_sem);
> +	if (vma->vm_ops && vma->vm_ops->set_policy)
> +		err = vma->vm_ops->set_policy(vma, new);
> +	if (!err) {
> +		mpol_get(new);
> +		vma->vm_policy = new;
> +		mpol_free(old);
> +	}
> +	if (vma->vm_file)
> +		up(&vma->vm_file->f_mapping->i_shared_sem);
> +	return err;
> +}

So it looks like you *are* sharing policies, at least for VMA's in the
range of a single mbind() call?  This is a good start! ;)  Looking
further ahead, I'm a bit confused.  It seems despite *sharing* VMA's
belonging to a single mbind() call, you *copy* VMA's during dup_mmap(),
copy_process(), split_vma(), and move_vma().  So in the majority of
cases you duplicate policies instead of sharing them, but you *do* share
them in some instances?  Why the inconsistency?


> +/* Change policy for a memory range */
> +asmlinkage long sys_mbind(unsigned long start, unsigned long len,
> +			  unsigned long mode,
> +			  unsigned long *nmask, unsigned long maxnode,
> +			  unsigned flags)

What would you think about giving sys_mbind() a pid argument, like
sys_sched_setaffinity()?  It would make it much easier for sysadmins to
use the API if they didn't have to rewrite applications to make these
calls on their own.  There's already a plethora of arguments, so one
more might be overkill....  Just a thought.


> +{
> +	struct vm_area_struct *vma;
> +	struct mm_struct *mm = current->mm;
> +	struct mempolicy *new;
> +	unsigned long end;
> +	DECLARE_BITMAP(nodes, MAX_NUMNODES);
> +	int err;
> +
> +	if ((flags & ~(unsigned long)(MPOL_MF_STRICT)) || mode > MPOL_MAX)
> +		return -EINVAL;
> +	if (start & ~PAGE_MASK)
> +		return -EINVAL;
> +	if (mode == MPOL_DEFAULT)
> +		flags &= ~MPOL_MF_STRICT;
> +	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
> +	end = start + len;
> +	if (end < start)
> +		return -EINVAL;
> +	if (end == start)
> +		return 0;
> +
> +	err = get_nodes(nodes, nmask, maxnode, mode);
> +	if (err)
> +		return err;
> +
> +	new = new_policy(mode, nodes);
> +	if (IS_ERR(new))
> +		return PTR_ERR(new);
> +
> +	PDprintk("mbind %lx-%lx mode:%ld nodes:%lx\n",start,start+len,
> +			mode,nodes[0]);
> +
> +	down_write(&mm->mmap_sem);
> +	vma = check_range(mm, start, end, nodes, flags);
> +	err = PTR_ERR(vma);
> +	if (!IS_ERR(vma))
> +		err = mbind_range(vma, start, end, new);
> +	up_write(&mm->mmap_sem);
> +	mpol_free(new);
> +	return err;
> +}
> +
> +/* Set the process memory policy */
> +asmlinkage long sys_set_mempolicy(int mode, unsigned long *nmask,
> +				   unsigned long maxnode)
> +{
> +	int err;
> +	struct mempolicy *new;
> +	DECLARE_BITMAP(nodes, MAX_NUMNODES);
> +
> +	if (mode > MPOL_MAX)
> +		return -EINVAL;
> +	err = get_nodes(nodes, nmask, maxnode, mode);
> +	if (err)
> +		return err;
> +	new = new_policy(mode, nodes);
> +	if (IS_ERR(new))
> +		return PTR_ERR(new);
> +	mpol_free(current->mempolicy);
> +	current->mempolicy = new;
> +	if (new && new->policy == MPOL_INTERLEAVE)
> +		current->il_next = find_first_bit(new->v.nodes, MAX_NUMNODES);
> +	return 0;
> +}

Why not condense both the sys_mbind() & sys_set_mempolicy() into a
single call?  The functionality of these calls (and even their code) is
very similar.  The main difference is there is no need for looking up
VMA's and doing locking in the sys_set_mempolicy() case.  You overload
the output side (sys_get_mempolicy()) to handle both whole process and
memory range options, but you don't do the same on the input
(sys_mbind() and sys_set_mempolicy()).  Saving one syscall and having
them behave more symmetrically would be a nice addition...


> +/* Fill a zone bitmap for a policy */
> +static void get_zonemask(struct mempolicy *p, unsigned long *nodes)
> +{
> +	int i;
> +	bitmap_clear(nodes, MAX_NUMNODES);
> +	switch (p->policy) {
> +	case MPOL_BIND:
> +		for (i = 0; p->v.zonelist->zones[i]; i++)
> +			__set_bit(p->v.zonelist->zones[i]->zone_pgdat->node_id, nodes);
> +		break;
> +	case MPOL_DEFAULT:
> +		break;
> +	case MPOL_INTERLEAVE:
> +		bitmap_copy(nodes, p->v.nodes, MAX_NUMNODES);
> +		break;
> +	case MPOL_PREFERRED:
> +		/* or use current node instead of online map? */
> +		if (p->v.preferred_node < 0)
> +			bitmap_copy(nodes, node_online_map, MAX_NUMNODES);
> +		else	
> +			__set_bit(p->v.preferred_node, nodes);
> +		break;
> +	default:
> +		BUG();
> +	}	
> +}

Shouldn't this be called get_nodemask()?  You're setting a bit in a
bitmask called 'nodes' for each node the policy is using, so you're
getting a nodemask, not a zonemask...


> +static int lookup_node(struct mm_struct *mm, unsigned long addr)
> +{
> +	struct page *p;
> +	int err;
> +	err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
> +	if (err >= 0) {
> +		err = page_zone(p)->zone_pgdat->node_id;
> +		put_page(p);
> +	}	
> +	return err;
> +}

Again, you can save some pointer dereferences if you call
page_nodenum(p) instead of looking it up through the pgdat.


> +/* Retrieve NUMA policy */
> +asmlinkage long sys_get_mempolicy(int *policy,
> +				  unsigned long *nmask, unsigned long maxnode,
> +				  unsigned long addr, unsigned long flags)	

I had a thought...  Shouldn't all your user pointers be marked as such
with __user?  Ie:
asmlinkage long sys_get_mempolicy(int __user *policy, 
			unsigned long __user *nmask,
			unsigned long maxnode, unsigned long addr,
			unsigned long flags)	

This would apply to the other 2 syscalls as well.


> +{
> +	int err, pval;
> +	struct mm_struct *mm = current->mm;
> +	struct vm_area_struct *vma = NULL; 	
> +	struct mempolicy *pol = current->mempolicy;
> +
> +	if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
> +		return -EINVAL;
> +	if (nmask != NULL && maxnode < numnodes)
> +		return -EINVAL;

Did you mean: if (nmask == NULL || maxnode < numnodes) ?


<snip>

> +/* Do static interleaving for a VMA with known offset. */
> +static unsigned
> +offset_il_node(struct mempolicy *pol, struct vm_area_struct *vma, unsigned long off)
> +{
> +	unsigned target = (unsigned)off % (unsigned)numnodes;
> +	int c;
> +	int nid = -1;
> +	c = 0;
> +	do {
> +		nid = find_next_bit(pol->v.nodes, MAX_NUMNODES, nid+1);
> +		if (nid >= MAX_NUMNODES) {
> +			nid = -1; 		
> +			continue;
> +		}
> +		c++;
> +	} while (c <= target);
> +	BUG_ON(nid >= MAX_NUMNODES);
> +	return nid;
> +}

I think that BUG_ON should be BUG_ON(nid < 0), since it *shouldn't* be
possible to get through that loop with nid >= MAX_NUMNODES.  The only
line that could set nid >= MAX_NUMNODES is nid = find_next_bit(...); and
the very next line ensures that if nid is in fact >= MAX_NUMNODES it
will be set to -1.  It actually looks pretty redundant altogether,
but...

More comments to follow...

-Matt


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-08 19:48     ` Hugh Dickins
@ 2004-04-08 19:57       ` Rajesh Venkatasubramanian
  0 siblings, 0 replies; 60+ messages in thread
From: Rajesh Venkatasubramanian @ 2004-04-08 19:57 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: mbligh, akpm, andrea, linux-kernel



On Thu, 8 Apr 2004, Hugh Dickins wrote:

> On Thu, 8 Apr 2004, Rajesh Venkatasubramanian wrote:
> >
> > I guess using vm_private_data for nonlinear is not a problem because
> > we use list i_mmap_nonlinear for nonlinear vmas.
> >
> > As you have found out vm_private_data is only used if vm_file != NULL
> > or VM_RESERVED or VM_DONTEXPAND is set. I think we can revert back to
> > i_mmap{_shared} list for such special cases and use prio_tree for
> > others. I maybe missing something. Please teach me.
>
> Sorry, I don't understand what you're proposing here, and why?
> Oh, to save 4 bytes of vma by making the special cases use a list,
> no need for vm_set_head, and vm_private_data for the driver itself;
> but regular vmas use prio_tree, with vm_set_head in vm_private_data.

Yeah. You are right.

> Hmm, right now it's getting too much for me: can we keep it simplish
> for now, and come back to this later?

Yeah. This complicates the code further. That's why I didn't touch
it now. If things settle down and if we really worry about sizeof
vm_area_struct in the future, then we can remove the 8 bytes used
by prio_tree.

Rajesh




^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-08 19:20   ` Rajesh Venkatasubramanian
  2004-04-08 19:48     ` Hugh Dickins
@ 2004-04-08 19:52     ` Andrea Arcangeli
  1 sibling, 0 replies; 60+ messages in thread
From: Andrea Arcangeli @ 2004-04-08 19:52 UTC (permalink / raw)
  To: Rajesh Venkatasubramanian; +Cc: Hugh Dickins, mbligh, akpm, linux-kernel

On Thu, Apr 08, 2004 at 03:20:22PM -0400, Rajesh Venkatasubramanian wrote:
> 8 extra bytes for prio_tree. If anon_vma is merged, then I can easily
> point my finger at 12 more bytes added by anon_vma and be happy :)

those 12 more bytes payoff providing better swapping performance, handle
mremap transparently and make the code scale better, allowing even to
_trivially_ move the page_table_lock into the vma without any downside
(I mean performance downside, per-vma lock will cost one more bit of
info in every vma).

There's no way to remove those 12 bytes without falling in the downsides
of anonmm.

I don't see a real need to shrink the size of the prio_tree right now
though if something useless and can be removed that's fine with me, what
I'm trying to tell is that pointing the finger at 12 bytes of anon_vma
doesn't sound a good argument since there's no way to remove them
without falling back in the anonmm downsides.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-08 19:20   ` Rajesh Venkatasubramanian
@ 2004-04-08 19:48     ` Hugh Dickins
  2004-04-08 19:57       ` Rajesh Venkatasubramanian
  2004-04-08 19:52     ` Andrea Arcangeli
  1 sibling, 1 reply; 60+ messages in thread
From: Hugh Dickins @ 2004-04-08 19:48 UTC (permalink / raw)
  To: Rajesh Venkatasubramanian; +Cc: mbligh, akpm, andrea, linux-kernel

On Thu, 8 Apr 2004, Rajesh Venkatasubramanian wrote:
> 
> I guess using vm_private_data for nonlinear is not a problem because
> we use list i_mmap_nonlinear for nonlinear vmas.
> 
> As you have found out vm_private_data is only used if vm_file != NULL
> or VM_RESERVED or VM_DONTEXPAND is set. I think we can revert back to
> i_mmap{_shared} list for such special cases and use prio_tree for
> others. I maybe missing something. Please teach me.

Sorry, I don't understand what you're proposing here, and why?
Oh, to save 4 bytes of vma by making the special cases use a list,
no need for vm_set_head, and vm_private_data for the driver itself;
but regular vmas use prio_tree, with vm_set_head in vm_private_data.
Hmm, right now it's getting too much for me: can we keep it simplish
for now, and come back to this later?

Hugh


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-08 16:15             ` Hugh Dickins
  2004-04-08 17:05               ` Martin J. Bligh
@ 2004-04-08 19:25               ` Andrew Morton
  2004-04-09  2:41                 ` Wim Coekaerts
  1 sibling, 1 reply; 60+ messages in thread
From: Andrew Morton @ 2004-04-08 19:25 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: ak, mbligh, colpatch, linux-kernel

Hugh Dickins <hugh@veritas.com> wrote:
>
> On Wed, 7 Apr 2004, Andrew Morton wrote:
> > 
> > Your patch takes the CONFIG_NUMA vma from 64 bytes to 68.  It would be nice
> > to pull those 4 bytes back somehow.
> 
> How significant is this vma size issue?

For some workloads/machines it will simply cause an
approximately-proportional reduction in the size of the workload which we
can handle.

ie: there are some (oracle) workloads where the kernel craps out due to
lowmem vma exhaustion.  If they're now using remap_file_pages() for this then
it may not be a problem any more.  Ingo would know better than I.

> anon_vma objrmap will add 20 bytes to each vma (on 32-bit arches):
> 8 for prio_tree, 12 for anon_vma linkage in vma,
> sometimes another 12 for the anon_vma head itself.
> 
> anonmm objrmap adds just the 8 bytes for prio_tree,
> remaining overhead 28 bytes per mm.
> 
> Seems hard on Andi to begrudge him 4.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
       [not found] ` <1IMik-2is-37@gated-at.bofh.it>
@ 2004-04-08 19:20   ` Rajesh Venkatasubramanian
  2004-04-08 19:48     ` Hugh Dickins
  2004-04-08 19:52     ` Andrea Arcangeli
  0 siblings, 2 replies; 60+ messages in thread
From: Rajesh Venkatasubramanian @ 2004-04-08 19:20 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: mbligh, akpm, andrea, linux-kernel



On Thu, 8 Apr 2004, Hugh Dickins wrote:

> On Thu, 8 Apr 2004, Martin J. Bligh wrote:
> > > On Wed, 7 Apr 2004, Andrew Morton wrote:
> > >>
> > >> Your patch takes the CONFIG_NUMA vma from 64 bytes to 68.  It would be nice
> > >> to pull those 4 bytes back somehow.
> > >
> > > How significant is this vma size issue?
> > >
> > > anon_vma objrmap will add 20 bytes to each vma (on 32-bit arches):
> > > 8 for prio_tree, 12 for anon_vma linkage in vma,
> > > sometimes another 12 for the anon_vma head itself.
> >
> > Ewwww. Isn't some of that shared most of the time though?
>
> The anon_vma head may well be shared with other vmas of the fork group.
> But the anon_vma linkage is a list_head and a pointer within the vma.
>
> prio_tree is already using a union as much as it can (and a pointer
> where a list_head would simplify the code); Rajesh was thinking of
> reusing vm_private_data for one pointer, but I've gone and used it
> for nonlinear swapout.

I guess using vm_private_data for nonlinear is not a problem because
we use list i_mmap_nonlinear for nonlinear vmas.

As you have found out vm_private_data is only used if vm_file != NULL
or VM_RESERVED or VM_DONTEXPAND is set. I think we can revert back to
i_mmap{_shared} list for such special cases and use prio_tree for
others. I maybe missing something. Please teach me.

If anonmm is merged then I plan to seriously consider removing that
8 extra bytes for prio_tree. If anon_vma is merged, then I can easily
point my finger at 12 more bytes added by anon_vma and be happy :)

I still think removing the 8 extra bytes used by prio_tree from
vm_area_struct is possible.

> > > anonmm objrmap adds just the 8 bytes for prio_tree,
> > > remaining overhead 28 bytes per mm.
> >
> > 28 bytes per *mm* is nothing, and I still think the prio_tree is
> > completely unneccesary. Nobody has ever demonstrated a real benchmark
> > that needs it, as far as I recall.
>
> I'm sure an Ingobench will shortly follow that observation.

Yeap. If Andrew didn't write his rmap-test.c and Ingo didn't write his
test-mmap3.c, I wouldn't even have considered developing prio_tree.

Thanks,
Rajesh

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-08  1:31         ` Andi Kleen
@ 2004-04-08 18:36           ` Matthew Dobson
  0 siblings, 0 replies; 60+ messages in thread
From: Matthew Dobson @ 2004-04-08 18:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: LKML, Andrew Morton, Martin J. Bligh

On Wed, 2004-04-07 at 18:31, Andi Kleen wrote:
> On Wed, 07 Apr 2004 17:58:23 -0700
> Matthew Dobson <colpatch@us.ibm.com> wrote:
> 
> 
> > Is there a reason you don't have a case for MPOL_PREFERRED?  You have a
> > comment about it in the function, but you don't check the nodemask isn't
> > empty...
> 
> Empty prefered is a special case. It means DEFAULT.  This is useful
> when you have a process policy != DEFAULT, but want to set a specific
> VMA to default. Normally default in a VMA would mean use process policy.

Ok.. That makes sense.


> > In this function, why do we care what bits the user set past
> > MAX_NUMNODES?  Why shouldn't we just silently ignore the bits like we do
> > in sys_sched_setaffinity?  If a user tries to hand us an 8k bitmask, my
> > opinion is we should just grab as much as we care about (MAX_NUMNODES
> > bits rounded up to the nearest UL).
> 
> This is to catch uninitialized bits. Otherwise it could work on a kernel
> with small MAX_NUMNODES, and then suddenly fail on a kernel with bigger
> MAX_NUMNODES when a node isn't online.

I am of the opinion that we should allow currently offline nodes in the
user's mask.  Those nodes may come online later on, and we should
respect the user's request to allocate from those nodes if possible. 
Just like in sched_setaffinity() we take in the user's mask, and when we
actually use the mask to make a decision, we check it against
cpu_online_map.  Just because a node isn't online at the time of the
mbind() call doesn't mean it won't be soon.  Besides, we should be
checking against node_online_map anyway, because nodes could go away. 
Well, maybe not right now, but in the near future.  Hotplugable memory
is a reality, even if we don't support it just yet.


> > This seems a bit strange to me.  Instead of just allocating a whole
> > struct zonelist, you're allocating part of one?  I guess it's safe,
> > since the array is meant to be NULL terminated, but we should put a note
> > in any code using these zonelists that they *aren't* regular zonelists,
> > they will be smaller, and dereferencing arbitrary array elements in the
> > struct could be dangerous.  I think we'd be better off creating a
> > kmem_cache_t for these and using *whole* zonelist structures. 
> > Allocating part of a well-defined structure makes me a bit nervous...
> 
> And that after all the whining about sharing policies? ;-) (a BIND policy will
> always carry a zonelist). As far as I can see all existing zonelist code
> just walks it until NULL.
> 
> I would not be opposed to always using a full one, but it would use considerably
> more memory in many cases.

I'm not whining about sharing policies because of the space usage,
although that is a small side issue.  I'm whining about sharing policies
because it just makes sense.  You've got a data structure that is always
dynamically allocated and referenced by pointers, that has no instance
specific data in it, and that *already has* an atomic reference counter
in it.  And you decided not to share this data structure?!  In my
opinion, it's harder and more code to *not* share it...  Instead of
copying the structure in mpol_copy(), just atomic_inc(policy->refcnt)
and we're pretty much done.  You already do an atomic_dec_and_test() in
mpol_free()...

-Matt


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-08 17:05               ` Martin J. Bligh
@ 2004-04-08 18:16                 ` Hugh Dickins
  0 siblings, 0 replies; 60+ messages in thread
From: Hugh Dickins @ 2004-04-08 18:16 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andrew Morton, Andi Kleen, colpatch, linux-kernel

On Thu, 8 Apr 2004, Martin J. Bligh wrote:
> > On Wed, 7 Apr 2004, Andrew Morton wrote:
> >> 
> >> Your patch takes the CONFIG_NUMA vma from 64 bytes to 68.  It would be nice
> >> to pull those 4 bytes back somehow.
> > 
> > How significant is this vma size issue?
> > 
> > anon_vma objrmap will add 20 bytes to each vma (on 32-bit arches):
> > 8 for prio_tree, 12 for anon_vma linkage in vma,
> > sometimes another 12 for the anon_vma head itself.
> 
> Ewwww. Isn't some of that shared most of the time though?

The anon_vma head may well be shared with other vmas of the fork group.
But the anon_vma linkage is a list_head and a pointer within the vma.

prio_tree is already using a union as much as it can (and a pointer
where a list_head would simplify the code); Rajesh was thinking of
reusing vm_private_data for one pointer, but I've gone and used it
for nonlinear swapout.

> > anonmm objrmap adds just the 8 bytes for prio_tree,
> > remaining overhead 28 bytes per mm.
> 
> 28 bytes per *mm* is nothing, and I still think the prio_tree is 
> completely unneccesary. Nobody has ever demonstrated a real benchmark
> that needs it, as far as I recall.

I'm sure an Ingobench will shortly follow that observation.

Hugh


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-08 16:15             ` Hugh Dickins
@ 2004-04-08 17:05               ` Martin J. Bligh
  2004-04-08 18:16                 ` Hugh Dickins
  2004-04-08 19:25               ` Andrew Morton
  1 sibling, 1 reply; 60+ messages in thread
From: Martin J. Bligh @ 2004-04-08 17:05 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton; +Cc: Andi Kleen, colpatch, linux-kernel

> On Wed, 7 Apr 2004, Andrew Morton wrote:
>> 
>> Your patch takes the CONFIG_NUMA vma from 64 bytes to 68.  It would be nice
>> to pull those 4 bytes back somehow.
> 
> How significant is this vma size issue?
> 
> anon_vma objrmap will add 20 bytes to each vma (on 32-bit arches):
> 8 for prio_tree, 12 for anon_vma linkage in vma,
> sometimes another 12 for the anon_vma head itself.

Ewwww. Isn't some of that shared most of the time though?

> anonmm objrmap adds just the 8 bytes for prio_tree,
> remaining overhead 28 bytes per mm.

28 bytes per *mm* is nothing, and I still think the prio_tree is 
completely unneccesary. Nobody has ever demonstrated a real benchmark
that needs it, as far as I recall.

> Seems hard on Andi to begrudge him 4.

I don't care about the 4 bytes much (other than that the current 64 happens
to be a nice size). I just don't see the point in making copies of the 
binding structure all the time ;-) Refcounts aren't that hard ... didn't
Greg do a kref just recently? ...

M.



^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 23:56           ` Andrew Morton
  2004-04-08  0:14             ` Andi Kleen
@ 2004-04-08 16:15             ` Hugh Dickins
  2004-04-08 17:05               ` Martin J. Bligh
  2004-04-08 19:25               ` Andrew Morton
  1 sibling, 2 replies; 60+ messages in thread
From: Hugh Dickins @ 2004-04-08 16:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Andi Kleen, mbligh, colpatch, linux-kernel

On Wed, 7 Apr 2004, Andrew Morton wrote:
> 
> Your patch takes the CONFIG_NUMA vma from 64 bytes to 68.  It would be nice
> to pull those 4 bytes back somehow.

How significant is this vma size issue?

anon_vma objrmap will add 20 bytes to each vma (on 32-bit arches):
8 for prio_tree, 12 for anon_vma linkage in vma,
sometimes another 12 for the anon_vma head itself.

anonmm objrmap adds just the 8 bytes for prio_tree,
remaining overhead 28 bytes per mm.

Seems hard on Andi to begrudge him 4.

Hugh


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-08  0:58       ` Matthew Dobson
@ 2004-04-08  1:31         ` Andi Kleen
  2004-04-08 18:36           ` Matthew Dobson
  0 siblings, 1 reply; 60+ messages in thread
From: Andi Kleen @ 2004-04-08  1:31 UTC (permalink / raw)
  To: colpatch; +Cc: linux-kernel, akpm, mbligh

On Wed, 07 Apr 2004 17:58:23 -0700
Matthew Dobson <colpatch@us.ibm.com> wrote:


> Is there a reason you don't have a case for MPOL_PREFERRED?  You have a
> comment about it in the function, but you don't check the nodemask isn't
> empty...

Empty prefered is a special case. It means DEFAULT.  This is useful
when you have a process policy != DEFAULT, but want to set a specific
VMA to default. Normally default in a VMA would mean use process policy.


> In this function, why do we care what bits the user set past
> MAX_NUMNODES?  Why shouldn't we just silently ignore the bits like we do
> in sys_sched_setaffinity?  If a user tries to hand us an 8k bitmask, my
> opinion is we should just grab as much as we care about (MAX_NUMNODES
> bits rounded up to the nearest UL).

This is to catch uninitialized bits. Otherwise it could work on a kernel
with small MAX_NUMNODES, and then suddenly fail on a kernel with bigger
MAX_NUMNODES when a node isn't online.
 

> This seems a bit strange to me.  Instead of just allocating a whole
> struct zonelist, you're allocating part of one?  I guess it's safe,
> since the array is meant to be NULL terminated, but we should put a note
> in any code using these zonelists that they *aren't* regular zonelists,
> they will be smaller, and dereferencing arbitrary array elements in the
> struct could be dangerous.  I think we'd be better off creating a
> kmem_cache_t for these and using *whole* zonelist structures. 
> Allocating part of a well-defined structure makes me a bit nervous...

And that after all the whining about sharing policies? ;-) (a BIND policy will
always carry a zonelist). As far as I can see all existing zonelist code
just walks it until NULL.

I would not be opposed to always using a full one, but it would use considerably
more memory in many cases.


> I'm guessing this is why you aren't checking MPOL_PREFERRED in
> check_policy()?  So the user can call mbind() with MPOL_PREFERRED and an
> empty nodes bitmap and get the default behavior you mentioned in the
> comments?

Yep.

-Andi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 21:45     ` Andi Kleen
  2004-04-07 22:19       ` Matthew Dobson
@ 2004-04-08  0:58       ` Matthew Dobson
  2004-04-08  1:31         ` Andi Kleen
  2004-04-09  1:09       ` Matthew Dobson
  2 siblings, 1 reply; 60+ messages in thread
From: Matthew Dobson @ 2004-04-08  0:58 UTC (permalink / raw)
  To: Andi Kleen; +Cc: LKML, Andrew Morton, Martin J. Bligh

On Wed, 2004-04-07 at 14:45, Andi Kleen wrote:

> diff -u linux-2.6.5-mc2-numa/mm/mempolicy.c-o linux-2.6.5-mc2-numa/mm/mempolicy.c
> --- linux-2.6.5-mc2-numa/mm/mempolicy.c-o	2004-04-07 12:07:41.000000000 +0200
> +++ linux-2.6.5-mc2-numa/mm/mempolicy.c	2004-04-07 13:07:02.000000000 +0200

<snip>

> +/* Do sanity checking on a policy */
> +static int check_policy(int mode, unsigned long *nodes)
> +{
> +	int empty = bitmap_empty(nodes, MAX_NUMNODES);
> +	switch (mode) {
> +	case MPOL_DEFAULT:
> +		if (!empty)
> +			return -EINVAL;
> +		break;
> +	case MPOL_BIND:
> +	case MPOL_INTERLEAVE:
> +		/* Preferred will only use the first bit, but allow
> +		   more for now. */
> +		if (empty)
> +			return -EINVAL;
> +		break;
> +	}
> +	return check_online(nodes);
> +}

Is there a reason you don't have a case for MPOL_PREFERRED?  You have a
comment about it in the function, but you don't check the nodemask isn't
empty...

> +/* Copy a node mask from user space. */
> +static int get_nodes(unsigned long *nodes, unsigned long *nmask,
> +		     unsigned long maxnode, int mode)
> +{
> +	unsigned long k;
> +	unsigned long nlongs;
> +	unsigned long endmask;
> +
> +	--maxnode;
> +	nlongs = BITS_TO_LONGS(maxnode);
> +	if ((maxnode % BITS_PER_LONG) == 0)
> +		endmask = ~0UL;
> +	else
> +		endmask = (1UL << (maxnode % BITS_PER_LONG)) - 1;
> +
> +	/* When the user specified more nodes than supported just check
> +	   if the non supported part is all zero. */
> +	if (nmask && nlongs > BITS_TO_LONGS(MAX_NUMNODES)) {
> +		for (k = BITS_TO_LONGS(MAX_NUMNODES); k < nlongs; k++) {
> +			unsigned long t;
> +			if (get_user(t,  nmask + k))
> +				return -EFAULT;
> +			if (k == nlongs - 1) {
> +				if (t & endmask)
> +					return -EINVAL;
> +			} else if (t)
> +				return -EINVAL;
> +		}
> +		nlongs = BITS_TO_LONGS(MAX_NUMNODES);
> +		endmask = ~0UL;
> +	}
> +
> +	bitmap_clear(nodes, MAX_NUMNODES);
> +	if (nmask && copy_from_user(nodes, nmask, nlongs*sizeof(unsigned long)))
> +		return -EFAULT;
> +	nodes[nlongs-1] &= endmask;
> +	return check_policy(mode, nodes);
> +}

In this function, why do we care what bits the user set past
MAX_NUMNODES?  Why shouldn't we just silently ignore the bits like we do
in sys_sched_setaffinity?  If a user tries to hand us an 8k bitmask, my
opinion is we should just grab as much as we care about (MAX_NUMNODES
bits rounded up to the nearest UL).

> +/* Generate a custom zonelist for the BIND policy. */
> +static struct zonelist *bind_zonelist(unsigned long *nodes)
> +{
> +	struct zonelist *zl;
> +	int num, max, nd;
> +
> +	max = 1 + MAX_NR_ZONES * bitmap_weight(nodes, MAX_NUMNODES);
> +	zl = kmalloc(sizeof(void *) * max, GFP_KERNEL);
> +	if (!zl)
> +		return NULL;
> +	num = 0;
> +	for (nd = find_first_bit(nodes, MAX_NUMNODES);
> +	     nd < MAX_NUMNODES;
> +	     nd = find_next_bit(nodes, MAX_NUMNODES, 1+nd)) {
> +		int k;
> +		for (k = MAX_NR_ZONES-1; k >= 0; k--) {
> +			struct zone *z = &NODE_DATA(nd)->node_zones[k];
> +			if (!z->present_pages)
> +				continue;
> +			zl->zones[num++] = z;
> +			if (k > policy_zone)
> +				policy_zone = k;
> +		}
> +	}
> +	BUG_ON(num >= max);
> +	zl->zones[num] = NULL;
> +	return zl;
> +}

This seems a bit strange to me.  Instead of just allocating a whole
struct zonelist, you're allocating part of one?  I guess it's safe,
since the array is meant to be NULL terminated, but we should put a note
in any code using these zonelists that they *aren't* regular zonelists,
they will be smaller, and dereferencing arbitrary array elements in the
struct could be dangerous.  I think we'd be better off creating a
kmem_cache_t for these and using *whole* zonelist structures. 
Allocating part of a well-defined structure makes me a bit nervous...

> +/* Create a new policy */
> +static struct mempolicy *new_policy(int mode, unsigned long *nodes)
> +{
> +	struct mempolicy *policy;
> +	PDprintk("setting mode %d nodes[0] %lx\n", mode, nodes[0]);
> +	if (mode == MPOL_DEFAULT)
> +		return NULL;
> +	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
> +	if (!policy)
> +		return ERR_PTR(-ENOMEM);
> +	atomic_set(&policy->refcnt, 1);
> +	switch (mode) {
> +	case MPOL_INTERLEAVE:
> +		bitmap_copy(policy->v.nodes, nodes, MAX_NUMNODES);
> +		break;
> +	case MPOL_PREFERRED:
> +		policy->v.preferred_node = find_first_bit(nodes, MAX_NUMNODES);
> +		if (policy->v.preferred_node >= MAX_NUMNODES)
> +			policy->v.preferred_node = -1;
> +		break;
> +	case MPOL_BIND:
> +		policy->v.zonelist = bind_zonelist(nodes);
> +		if (policy->v.zonelist == NULL) {
> +			kmem_cache_free(policy_cache, policy);
> +			return ERR_PTR(-ENOMEM);
> +		}
> +		break;
> +	}
> +	policy->policy = mode;
> +	return policy;
> +}

I'm guessing this is why you aren't checking MPOL_PREFERRED in
check_policy()?  So the user can call mbind() with MPOL_PREFERRED and an
empty nodes bitmap and get the default behavior you mentioned in the
comments?

I've got to get some dinner now...  I'll keep reading and send more
comments as I come up with them.

Cheers!

-Matt


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-08  0:26               ` Andrea Arcangeli
@ 2004-04-08  0:51                 ` Andi Kleen
  0 siblings, 0 replies; 60+ messages in thread
From: Andi Kleen @ 2004-04-08  0:51 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: akpm, mbligh, colpatch, linux-kernel

On Thu, 8 Apr 2004 02:26:26 +0200
Andrea Arcangeli <andrea@suse.de> wrote:

> On Thu, Apr 08, 2004 at 02:14:48AM +0200, Andi Kleen wrote:
> > Eliminate the RB color field or use rb_next() instead of vm_next. First 
> > alternative is cheaper.
> 
> with eliminate I assume you mean to reuse a bit in the vma for that
> (like vma->vm_flags), somewhere the color bit info is needed to make the
> rebalanacing in mmap quick but to still guarantee the max height <= 2 *
> min height.

Yes. Put it either into flags or into the low order bits of a pointer.

-Andi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-08  0:14             ` Andi Kleen
@ 2004-04-08  0:26               ` Andrea Arcangeli
  2004-04-08  0:51                 ` Andi Kleen
  0 siblings, 1 reply; 60+ messages in thread
From: Andrea Arcangeli @ 2004-04-08  0:26 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, mbligh, colpatch, linux-kernel

On Thu, Apr 08, 2004 at 02:14:48AM +0200, Andi Kleen wrote:
> Eliminate the RB color field or use rb_next() instead of vm_next. First 
> alternative is cheaper.

with eliminate I assume you mean to reuse a bit in the vma for that
(like vma->vm_flags), somewhere the color bit info is needed to make the
rebalanacing in mmap quick but to still guarantee the max height <= 2 *
min height.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 23:35         ` Andi Kleen
  2004-04-07 23:56           ` Andrew Morton
@ 2004-04-08  0:22           ` Andrea Arcangeli
  1 sibling, 0 replies; 60+ messages in thread
From: Andrea Arcangeli @ 2004-04-08  0:22 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Andrew Morton, mbligh, colpatch, linux-kernel

On Thu, Apr 08, 2004 at 01:35:22AM +0200, Andi Kleen wrote:
> NUMA API adds a new pointer, but all sharing in the world couldn't fix that.

that's fine. As we already discussed from my part I only care that the
4bytes for the pointer go away if you compile with CONFIG_NUMA=n, it
doesn't really matter much but it'd like if it would be optimized for
the long term.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 23:56           ` Andrew Morton
@ 2004-04-08  0:14             ` Andi Kleen
  2004-04-08  0:26               ` Andrea Arcangeli
  2004-04-08 16:15             ` Hugh Dickins
  1 sibling, 1 reply; 60+ messages in thread
From: Andi Kleen @ 2004-04-08  0:14 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mbligh, colpatch, linux-kernel

On Wed, 7 Apr 2004 16:56:39 -0700
Andrew Morton <akpm@osdl.org> wrote:
 
> If you _do_ use the feature, what is the overhead?  12 bytes for each and
> every vma?  Or just for the vma's which have a non-default policy?

Just for VMAs with a non default policy. Default policy is always NULL

> Your patch takes the CONFIG_NUMA vma from 64 bytes to 68.  It would be nice
> to pull those 4 bytes back somehow.

Eliminate the RB color field or use rb_next() instead of vm_next. First 
alternative is cheaper.

-Andi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 23:35         ` Andi Kleen
@ 2004-04-07 23:56           ` Andrew Morton
  2004-04-08  0:14             ` Andi Kleen
  2004-04-08 16:15             ` Hugh Dickins
  2004-04-08  0:22           ` Andrea Arcangeli
  1 sibling, 2 replies; 60+ messages in thread
From: Andrew Morton @ 2004-04-07 23:56 UTC (permalink / raw)
  To: Andi Kleen; +Cc: mbligh, colpatch, linux-kernel

Andi Kleen <ak@suse.de> wrote:
>
> On Wed, 7 Apr 2004 15:52:25 -0700
> Andrew Morton <akpm@osdl.org> wrote:
> 
> > Andi Kleen <ak@suse.de> wrote:
> > >
> > > We can discuss changes when someone shows numbers that additional 
> > > optimizations are needed. I haven't seen such numbers and I'm not convinced
> > > sharing is even a good idea from a design standpoint.  For the first version 
> > > I just aimed to get something working with straight forward code.
> > > 
> > > To put it all in perspective: a policy is 12 bytes on a 32bit machine
> > > (assuming MAX_NUMNODES <= 32) and 16 bytes on a 64bit machine
> > > (with MAX_NUMNODES <= 64)
> > 
> > sizeof(vm_area_struct) is a very sensitive thing on ia32.  If you expect
> > that anyone is likely to actually use the numa API on 32-bit, sharing
> > will be important.
> 
> I don't really believe that.

You better.  VMA space exhaustion is one of the reasons for introducing
remap_file_pages().  It's an oracle-killer.  Like everything else ;)

> If it was that way someone would have already
> done all the obvious space optimizations left on the table...
> (like using rb_next or merging the rb color into flags)

Nope, we're slack.

> NUMA API adds a new pointer, but all sharing in the world couldn't fix that.

> When you set a policy != default you will also pay the 12 or 16 bytes overhead
> for the object for each "policy region"

OK, that's not so bad.  So if you don't use the feature the overhead is 4
bytes/vma.

If you _do_ use the feature, what is the overhead?  12 bytes for each and
every vma?  Or just for the vma's which have a non-default policy?

Your patch takes the CONFIG_NUMA vma from 64 bytes to 68.  It would be nice
to pull those 4 bytes back somehow.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 22:52       ` Andrew Morton
  2004-04-07 23:09         ` Martin J. Bligh
@ 2004-04-07 23:35         ` Andi Kleen
  2004-04-07 23:56           ` Andrew Morton
  2004-04-08  0:22           ` Andrea Arcangeli
  1 sibling, 2 replies; 60+ messages in thread
From: Andi Kleen @ 2004-04-07 23:35 UTC (permalink / raw)
  To: Andrew Morton; +Cc: mbligh, colpatch, linux-kernel

On Wed, 7 Apr 2004 15:52:25 -0700
Andrew Morton <akpm@osdl.org> wrote:

> Andi Kleen <ak@suse.de> wrote:
> >
> > We can discuss changes when someone shows numbers that additional 
> > optimizations are needed. I haven't seen such numbers and I'm not convinced
> > sharing is even a good idea from a design standpoint.  For the first version 
> > I just aimed to get something working with straight forward code.
> > 
> > To put it all in perspective: a policy is 12 bytes on a 32bit machine
> > (assuming MAX_NUMNODES <= 32) and 16 bytes on a 64bit machine
> > (with MAX_NUMNODES <= 64)
> 
> sizeof(vm_area_struct) is a very sensitive thing on ia32.  If you expect
> that anyone is likely to actually use the numa API on 32-bit, sharing
> will be important.

I don't really believe that. If it was that way someone would have already
done all the obvious space optimizations left on the table...
(like using rb_next or merging the rb color into flags)

NUMA API adds a new pointer, but all sharing in the world couldn't fix that.

When you set a policy != default you will also pay the 12 or 16 bytes overhead
for the object for each "policy region"

> It should be useful for SMT, yes?

Nope. Only for real NUMA.

-Andi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 22:52       ` Andrew Morton
@ 2004-04-07 23:09         ` Martin J. Bligh
  2004-04-07 23:35         ` Andi Kleen
  1 sibling, 0 replies; 60+ messages in thread
From: Martin J. Bligh @ 2004-04-07 23:09 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen; +Cc: colpatch, linux-kernel

> sizeof(vm_area_struct) is a very sensitive thing on ia32.  If you expect
> that anyone is likely to actually use the numa API on 32-bit, sharing

Me please ;-)

M.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 22:38     ` Andi Kleen
@ 2004-04-07 22:52       ` Andrew Morton
  2004-04-07 23:09         ` Martin J. Bligh
  2004-04-07 23:35         ` Andi Kleen
  0 siblings, 2 replies; 60+ messages in thread
From: Andrew Morton @ 2004-04-07 22:52 UTC (permalink / raw)
  To: Andi Kleen; +Cc: mbligh, colpatch, linux-kernel

Andi Kleen <ak@suse.de> wrote:
>
> We can discuss changes when someone shows numbers that additional 
> optimizations are needed. I haven't seen such numbers and I'm not convinced
> sharing is even a good idea from a design standpoint.  For the first version 
> I just aimed to get something working with straight forward code.
> 
> To put it all in perspective: a policy is 12 bytes on a 32bit machine
> (assuming MAX_NUMNODES <= 32) and 16 bytes on a 64bit machine
> (with MAX_NUMNODES <= 64)

sizeof(vm_area_struct) is a very sensitive thing on ia32.  If you expect
that anyone is likely to actually use the numa API on 32-bit, sharing
will be important.

It should be useful for SMT, yes?


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 22:16   ` Andi Kleen
  2004-04-07 22:34     ` Andrew Morton
@ 2004-04-07 22:39     ` Martin J. Bligh
  2004-04-07 22:33       ` Andi Kleen
  1 sibling, 1 reply; 60+ messages in thread
From: Martin J. Bligh @ 2004-04-07 22:39 UTC (permalink / raw)
  To: Andi Kleen, Andrew Morton; +Cc: colpatch, linux-kernel

>> ppc64+CONFIG_NUMA compiles OK.
> 
> ppc64 doesn't have the system calls hooked up, but I'm not sure how useful
> it would be for these boxes anyways (afaik they are pretty uniform) 

They actually are fairly keen on doing NUMA for HPC stuff - it makes
a significant performance improvement ...

M.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 21:51 ` Andrew Morton
  2004-04-07 22:16   ` Andi Kleen
@ 2004-04-07 22:38   ` Martin J. Bligh
  2004-04-07 22:38     ` Andi Kleen
  1 sibling, 1 reply; 60+ messages in thread
From: Martin J. Bligh @ 2004-04-07 22:38 UTC (permalink / raw)
  To: Andrew Morton, colpatch; +Cc: linux-kernel, ak

--On Wednesday, April 07, 2004 14:51:30 -0700 Andrew Morton <akpm@osdl.org> wrote:

> Matthew Dobson <colpatch@us.ibm.com> wrote:
>> 
>> Just from the patches you posted, I would really disagree that these are
>> ready for merging into -mm.
> 
> I have them all merged up here.  I made a number of small changes -
> additional CONFIG_NUMA ifdefs, whitespace improvements, remove unneeded
> arch_hugetlb_fault() implementation.  The core patch created two copies of
> the same file in mempolicy.h, compile fix in mmap.c and a few other things.

I think there are some design issues that still aren't resolved - we've
been over this a bit before, but I still don't think they're fixed.
It seems you're still making a copy of the binding structure for every
VMA, which seems ... extravagent. Can we share them? IIRC, the only 
justification was the striping ... and I thought we agreed that was
better fixed by using the mod of the offset as a decider?

Maybe I'm just misreading your code, in which case, feel free to spit at me ;-)

M.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 22:38   ` Martin J. Bligh
@ 2004-04-07 22:38     ` Andi Kleen
  2004-04-07 22:52       ` Andrew Morton
  0 siblings, 1 reply; 60+ messages in thread
From: Andi Kleen @ 2004-04-07 22:38 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: akpm, colpatch, linux-kernel

On Wed, 07 Apr 2004 15:38:24 -0700
"Martin J. Bligh" <mbligh@aracnet.com> wrote:

> I think there are some design issues that still aren't resolved - we've
> been over this a bit before, but I still don't think they're fixed.
> It seems you're still making a copy of the binding structure for every
> VMA, which seems ... extravagent. Can we share them? IIRC, the only 

Sharing is only an optimization that adds more code and potential 
for more bugs (hash tables, locking etc.). 

We can discuss changes when someone shows numbers that additional 
optimizations are needed. I haven't seen such numbers and I'm not convinced
sharing is even a good idea from a design standpoint.  For the first version 
I just aimed to get something working with straight forward code.

To put it all in perspective: a policy is 12 bytes on a 32bit machine
(assuming MAX_NUMNODES <= 32) and 16 bytes on a 64bit machine
(with MAX_NUMNODES <= 64)

-Andi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 22:16   ` Andi Kleen
@ 2004-04-07 22:34     ` Andrew Morton
  2004-04-07 22:39     ` Martin J. Bligh
  1 sibling, 0 replies; 60+ messages in thread
From: Andrew Morton @ 2004-04-07 22:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: colpatch, linux-kernel, mbligh

Andi Kleen <ak@suse.de> wrote:
>
> What was the problem in mmap.c ? I compiled in various combinations (with
> and without NUMA on i386 and x86-64) and it worked.



mm/mmap.c: In function `copy_vma':
mm/mmap.c:1531: structure has no member named `vm_policy'

--- 25/mm/mmap.c~numa-api-vma-policy-hooks-fix	Wed Apr  7 12:28:53 2004
+++ 25-akpm/mm/mmap.c	Wed Apr  7 12:29:09 2004
@@ -1528,7 +1528,7 @@ struct vm_area_struct *copy_vma(struct v
 
 	find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
 	new_vma = vma_merge(mm, prev, rb_parent, addr, addr + len,
-			vma->vm_flags, vma->vm_file, pgoff, vma->vm_policy);
+			vma->vm_flags, vma->vm_file, pgoff, vma_policy(vma));
 	if (!new_vma) {
 		new_vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
 		if (new_vma) {

_

> And why was arch_hugetlb_fault() unneeded?

Well we do:

	if (condition-which-evaluates-to-constant-zero-on-CONFIG_HUGETLB=n)
		arch_hugetlb_fault();

so the compiler will never emit a call to the stub function anyway.

But it turns out the stub is needed for now, because !X86 doesn't implement
arch_hugetlb_fault().  So I put it back.

> > It builds OK for NUMAQ, although NUMAQ does have a problem:
> > 
> > drivers/built-in.o: In function `acpi_pci_root_add':
> > drivers/built-in.o(.text+0x22015): undefined reference to `pci_acpi_scan_root'
> > 
> > ppc64+CONFIG_NUMA compiles OK.
> 
> ppc64 doesn't have the system calls hooked up, but I'm not sure how useful
> it would be for these boxes anyways (afaik they are pretty uniform) 

Well, it's best that the kernel actually compiles still..


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 22:39     ` Martin J. Bligh
@ 2004-04-07 22:33       ` Andi Kleen
  0 siblings, 0 replies; 60+ messages in thread
From: Andi Kleen @ 2004-04-07 22:33 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: akpm, colpatch, linux-kernel

On Wed, 07 Apr 2004 15:39:17 -0700
"Martin J. Bligh" <mbligh@aracnet.com> wrote:

> >> ppc64+CONFIG_NUMA compiles OK.
> > 
> > ppc64 doesn't have the system calls hooked up, but I'm not sure how useful
> > it would be for these boxes anyways (afaik they are pretty uniform) 
> 
> They actually are fairly keen on doing NUMA for HPC stuff - it makes
> a significant performance improvement ...

Ok. Should be straight forward to add system calls them if they're
interested. 

The hugetlbpages may need some work though, but that's non essential for 
the beginning.

The new calls are all emulation clean too, you can just hand them through.
[i didn't add that yet, but it would be pretty simple]

-Andi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 21:45     ` Andi Kleen
@ 2004-04-07 22:19       ` Matthew Dobson
  2004-04-08  0:58       ` Matthew Dobson
  2004-04-09  1:09       ` Matthew Dobson
  2 siblings, 0 replies; 60+ messages in thread
From: Matthew Dobson @ 2004-04-07 22:19 UTC (permalink / raw)
  To: Andi Kleen; +Cc: LKML, Andrew Morton, Martin J. Bligh

On Wed, 2004-04-07 at 14:45, Andi Kleen wrote:
> On Wed, 07 Apr 2004 14:41:02 -0700
> Matthew Dobson <colpatch@us.ibm.com> wrote:
> 
> > On Wed, 2004-04-07 at 14:27, Andi Kleen wrote:
> > > On Wed, 07 Apr 2004 14:24:19 -0700
> > > Matthew Dobson <colpatch@us.ibm.com> wrote:
> > > 
> > > > 	I must be missing something here, but did you not include mempolicy.h
> > > > and policy.c in these patches?  I can't seem to find them anywhere?!? 
> > > > It's really hard to evaluate your patches if the core of them is
> > > > missing!
> > > 
> > > It was in the core patch and also in the last patch I sent Andrew.
> > > See ftp://ftp.suse.com/pub/people/ak/numa/* for the full patches
> > 
> > Ok.. I'll check that link, but what you posted didn't have the files
> > (mempolicy.h & policy.c) in the patch:
> 
> Indeed. Must have gone missing. Here are the files for reference.
> 
> The full current broken out patchkit is in 
> ftp.suse.com:/pub/people/ak/numa/2.6.5mc2/

Server isn't taking connections right now.  At least for me... :(

Your patch still looks broken.  It includes some files twice:

> diff -u linux-2.6.5-mc2-numa/include/linux/mempolicy.h-o linux-2.6.5-mc2-numa/include/linux/mempolicy.h
> --- linux-2.6.5-mc2-numa/include/linux/mempolicy.h-o	2004-04-07 12:07:18.000000000 +0200
> +++ linux-2.6.5-mc2-numa/include/linux/mempolicy.h	2004-04-07 12:07:13.000000000 +0200
> @@ -0,0 +1,219 @@

<snip>

> diff -u linux-2.6.5-mc2-numa/include/linux/mempolicy.h-o linux-2.6.5-mc2-numa/include/linux/mempolicy.h
> --- linux-2.6.5-mc2-numa/include/linux/mempolicy.h-o	2004-04-07 12:07:18.000000000 +0200
> +++ linux-2.6.5-mc2-numa/include/linux/mempolicy.h	2004-04-07 12:07:13.000000000 +0200
> @@ -0,0 +1,219 @@

-Matt


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 21:51 ` Andrew Morton
@ 2004-04-07 22:16   ` Andi Kleen
  2004-04-07 22:34     ` Andrew Morton
  2004-04-07 22:39     ` Martin J. Bligh
  2004-04-07 22:38   ` Martin J. Bligh
  1 sibling, 2 replies; 60+ messages in thread
From: Andi Kleen @ 2004-04-07 22:16 UTC (permalink / raw)
  To: Andrew Morton; +Cc: colpatch, linux-kernel, mbligh

On Wed, 7 Apr 2004 14:51:30 -0700
Andrew Morton <akpm@osdl.org> wrote:

> Matthew Dobson <colpatch@us.ibm.com> wrote:
> >
> > Just from the patches you posted, I would really disagree that these are
> > ready for merging into -mm.
> 
> I have them all merged up here.  I made a number of small changes -
> additional CONFIG_NUMA ifdefs, whitespace improvements, remove unneeded
> arch_hugetlb_fault() implementation.  The core patch created two copies of
> the same file in mempolicy.h, compile fix in mmap.c and a few other things.

Sorry about the bad patches. I will try to be more careful in the future.

What was the problem in mmap.c ? I compiled in various combinations (with
and without NUMA on i386 and x86-64) and it worked.

And why was arch_hugetlb_fault() unneeded?

> It builds OK for NUMAQ, although NUMAQ does have a problem:
> 
> drivers/built-in.o: In function `acpi_pci_root_add':
> drivers/built-in.o(.text+0x22015): undefined reference to `pci_acpi_scan_root'
> 
> ppc64+CONFIG_NUMA compiles OK.

ppc64 doesn't have the system calls hooked up, but I'm not sure how useful
it would be for these boxes anyways (afaik they are pretty uniform) 

-Andi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 21:24 Matthew Dobson
  2004-04-07 21:27 ` Andi Kleen
  2004-04-07 21:35 ` Matthew Dobson
@ 2004-04-07 21:51 ` Andrew Morton
  2004-04-07 22:16   ` Andi Kleen
  2004-04-07 22:38   ` Martin J. Bligh
  2 siblings, 2 replies; 60+ messages in thread
From: Andrew Morton @ 2004-04-07 21:51 UTC (permalink / raw)
  To: colpatch; +Cc: linux-kernel, ak, mbligh

Matthew Dobson <colpatch@us.ibm.com> wrote:
>
> Just from the patches you posted, I would really disagree that these are
> ready for merging into -mm.

I have them all merged up here.  I made a number of small changes -
additional CONFIG_NUMA ifdefs, whitespace improvements, remove unneeded
arch_hugetlb_fault() implementation.  The core patch created two copies of
the same file in mempolicy.h, compile fix in mmap.c and a few other things.

It builds OK for NUMAQ, although NUMAQ does have a problem:

drivers/built-in.o: In function `acpi_pci_root_add':
drivers/built-in.o(.text+0x22015): undefined reference to `pci_acpi_scan_root'

ppc64+CONFIG_NUMA compiles OK.


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 21:41   ` Matthew Dobson
@ 2004-04-07 21:45     ` Andi Kleen
  2004-04-07 22:19       ` Matthew Dobson
                         ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Andi Kleen @ 2004-04-07 21:45 UTC (permalink / raw)
  To: colpatch; +Cc: linux-kernel, akpm, mbligh

On Wed, 07 Apr 2004 14:41:02 -0700
Matthew Dobson <colpatch@us.ibm.com> wrote:

> On Wed, 2004-04-07 at 14:27, Andi Kleen wrote:
> > On Wed, 07 Apr 2004 14:24:19 -0700
> > Matthew Dobson <colpatch@us.ibm.com> wrote:
> > 
> > > 	I must be missing something here, but did you not include mempolicy.h
> > > and policy.c in these patches?  I can't seem to find them anywhere?!? 
> > > It's really hard to evaluate your patches if the core of them is
> > > missing!
> > 
> > It was in the core patch and also in the last patch I sent Andrew.
> > See ftp://ftp.suse.com/pub/people/ak/numa/* for the full patches
> 
> Ok.. I'll check that link, but what you posted didn't have the files
> (mempolicy.h & policy.c) in the patch:

Indeed. Must have gone missing. Here are the files for reference.

The full current broken out patchkit is in 
ftp.suse.com:/pub/people/ak/numa/2.6.5mc2/

Core NUMA API code

This is the core NUMA API code. This includes NUMA policy aware 
wrappers for get_free_pages and alloc_page_vma(). On non NUMA kernels
these are defined away.

The system calls mbind (see http://www.firstfloor.org/~andi/mbind.html),
get_mempolicy (http://www.firstfloor.org/~andi/get_mempolicy.html) and
set_mempolicy (http://www.firstfloor.org/~andi/set_mempolicy.html) are
implemented here.

Adds a vm_policy field to the VMA and to the process. The process
also has field for interleaving. VMA interleaving uses the offset
into the VMA, but that's not possible for process allocations.

diff -u linux-2.6.5-mc2-numa/include/linux/gfp.h-o linux-2.6.5-mc2-numa/include/linux/gfp.h
--- linux-2.6.5-mc2-numa/include/linux/gfp.h-o	2004-04-07 11:42:19.000000000 +0200
+++ linux-2.6.5-mc2-numa/include/linux/gfp.h	2004-04-07 11:45:42.000000000 +0200
@@ -4,6 +4,8 @@
 #include <linux/mmzone.h>
 #include <linux/stddef.h>
 #include <linux/linkage.h>
+#include <linux/config.h>
+
 /*
  * GFP bitmasks..
  */
@@ -73,10 +75,29 @@
 	return __alloc_pages(gfp_mask, order, NODE_DATA(nid)->node_zonelists + (gfp_mask & GFP_ZONEMASK));
 }
 
+extern struct page *alloc_pages_current(unsigned gfp_mask, unsigned order);
+struct vm_area_struct;
+
+#ifdef CONFIG_NUMA
+static inline struct page * alloc_pages(unsigned int gfp_mask, unsigned int order)
+{
+	if (unlikely(order >= MAX_ORDER))
+		return NULL;
+
+	return alloc_pages_current(gfp_mask, order);
+}
+extern struct page *__alloc_page_vma(unsigned gfp_mask, struct vm_area_struct *vma, 
+				   unsigned long off);
+
+extern struct page *alloc_page_vma(unsigned gfp_mask, struct vm_area_struct *vma, 
+				   unsigned long addr);
+#else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_page(gfp_mask) \
-		alloc_pages_node(numa_node_id(), gfp_mask, 0)
+#define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#define __alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#endif
+#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 
 extern unsigned long FASTCALL(__get_free_pages(unsigned int gfp_mask, unsigned int order));
 extern unsigned long FASTCALL(get_zeroed_page(unsigned int gfp_mask));
diff -u linux-2.6.5-mc2-numa/include/linux/mempolicy.h-o linux-2.6.5-mc2-numa/include/linux/mempolicy.h
--- linux-2.6.5-mc2-numa/include/linux/mempolicy.h-o	2004-04-07 12:07:18.000000000 +0200
+++ linux-2.6.5-mc2-numa/include/linux/mempolicy.h	2004-04-07 12:07:13.000000000 +0200
@@ -0,0 +1,219 @@
+#ifndef _LINUX_MEMPOLICY_H
+#define _LINUX_MEMPOLICY_H 1
+
+#include <linux/errno.h>
+
+/*
+ * NUMA memory policies for Linux.
+ * Copyright 2003,2004 Andi Kleen SuSE Labs
+ */
+
+/* Policies */
+#define MPOL_DEFAULT     0
+#define MPOL_PREFERRED    1
+#define MPOL_BIND        2
+#define MPOL_INTERLEAVE  3
+
+#define MPOL_MAX MPOL_INTERLEAVE
+
+/* Flags for get_mem_policy */
+#define MPOL_F_NODE   (1<<0)  /* return next IL mode instead of node mask */
+#define MPOL_F_ADDR     (1<<1)  /* look up vma using address */
+
+/* Flags for mbind */
+#define MPOL_MF_STRICT  (1<<0)  /* Verify existing pages in the mapping */
+
+#ifdef __KERNEL__
+
+#include <linux/config.h>
+#include <linux/mmzone.h>
+#include <linux/bitmap.h>
+#include <linux/slab.h>
+#include <linux/rbtree.h>
+#include <asm/semaphore.h>
+
+struct vm_area_struct;
+
+#ifdef CONFIG_NUMA
+
+/*
+ * Describe a memory policy.
+ *
+ * A mempolicy can be either associated with a process or with a VMA.
+ * For VMA related allocations the VMA policy is preferred, otherwise
+ * the process policy is used. Interrupts ignore the memory policy
+ * of the current process.
+ *
+ * Locking policy for interlave:
+ * In process context there is no locking because only the process accesses
+ * its own state. All vma manipulation is somewhat protected by a down_read on
+ * mmap_sem. For allocating in the interleave policy the page_table_lock
+ * must be also aquired to protect il_next.
+ *
+ * Freeing policy:
+ * When policy is MPOL_BIND v.zonelist is kmalloc'ed and must be kfree'd.
+ * All other policies don't have any external state. mpol_free() handles this.
+ *
+ * Copying policy objects:
+ * For MPOL_BIND the zonelist must be always duplicated. mpol_clone() does this.
+ */
+struct mempolicy {
+	atomic_t   refcnt;
+	short policy; 	/* See MPOL_* above */
+	union {
+		struct zonelist  *zonelist;	/* bind */
+		short 		 preferred_node; /* preferred */
+		DECLARE_BITMAP(nodes, MAX_NUMNODES); /* interleave */
+		/* undefined for default */
+	} v;
+};
+
+/* An NULL mempolicy pointer is a synonym of &default_policy. */
+extern struct mempolicy default_policy;
+
+/*
+ * Support for managing mempolicy data objects (clone, copy, destroy)
+ * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
+ */
+
+extern void __mpol_free(struct mempolicy *pol);
+static inline void mpol_free(struct mempolicy *pol)
+{
+	if (pol)
+		__mpol_free(pol);
+}
+
+extern struct mempolicy *__mpol_copy(struct mempolicy *pol);
+static inline struct mempolicy *mpol_copy(struct mempolicy *pol)
+{
+	if (pol)
+		pol = __mpol_copy(pol);
+	return pol;
+}
+
+#define vma_policy(vma) ((vma)->vm_policy)
+#define vma_set_policy(vma, pol) ((vma)->vm_policy = (pol))
+
+static inline void mpol_get(struct mempolicy *pol)
+{
+	if (pol)
+		atomic_inc(&pol->refcnt);
+}
+
+extern int __mpol_equal(struct mempolicy *a, struct mempolicy *b);
+static inline int mpol_equal(struct mempolicy *a, struct mempolicy *b)
+{
+	if (a == b)
+		return 1;
+	return __mpol_equal(a, b);
+}
+#define vma_mpol_equal(a,b) mpol_equal(vma_policy(a), vma_policy(b))
+
+/* Could later add inheritance of the process policy here. */
+
+#define mpol_set_vma_default(vma) ((vma)->vm_policy = NULL)
+
+/*
+ * Hugetlb policy. i386 hugetlb so far works with node numbers
+ * instead of zone lists, so give it special interfaces for now.
+ */
+extern int mpol_first_node(struct vm_area_struct *vma, unsigned long addr);
+extern int mpol_node_valid(int nid, struct vm_area_struct *vma, unsigned long addr);
+
+/*
+ * Tree of shared policies for a shared memory region.
+ * Maintain the policies in a pseudo mm that contains vmas. The vmas
+ * carry the policy. As a special twist the pseudo mm is indexed in pages, not
+ * bytes, so that we can work with shared memory segments bigger than
+ * unsigned long.
+ */
+
+struct sp_node {
+	struct rb_node nd;
+	unsigned long start, end;
+	struct mempolicy *policy;
+};
+
+struct shared_policy {
+	struct rb_root root;
+	struct semaphore sem;
+};
+
+static inline void mpol_shared_policy_init(struct shared_policy *info)
+{
+	info->root = RB_ROOT;
+	init_MUTEX(&info->sem);
+}
+
+int mpol_set_shared_policy(struct shared_policy *info,
+				  struct vm_area_struct *vma,
+				  struct mempolicy *new);
+void mpol_free_shared_policy(struct shared_policy *p);
+struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
+					    unsigned long idx);
+
+#else
+
+struct mempolicy {};
+
+static inline int mpol_equal(struct mempolicy *a, struct mempolicy *b)
+{
+	return 1;
+}
+#define vma_mpol_equal(a,b) 1
+
+#define mpol_set_vma_default(vma) do {} while(0)
+
+static inline void mpol_free(struct mempolicy *p)
+{
+}
+
+static inline void mpol_get(struct mempolicy *pol)
+{
+}
+
+static inline struct mempolicy *mpol_copy(struct mempolicy *old)
+{
+	return NULL;
+}
+
+static inline int mpol_first_node(struct vm_area_struct *vma, unsigned long a)
+{
+	return numa_node_id();
+}
+
+static inline int mpol_node_valid(int nid, struct vm_area_struct *vma, unsigned long a)
+{
+	return 1;
+}
+
+struct shared_policy {};
+
+static inline int mpol_set_shared_policy(struct shared_policy *info,
+				      struct vm_area_struct *vma,
+				      struct mempolicy *new)
+{
+	return -EINVAL;
+}
+
+static inline void mpol_shared_policy_init(struct shared_policy *info)
+{
+}
+
+static inline void mpol_free_shared_policy(struct shared_policy *p)
+{
+}
+
+static inline struct mempolicy *
+mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
+{
+	return NULL;
+}
+
+#define vma_policy(vma) NULL
+#define vma_set_policy(vma, pol) do {} while(0)
+
+#endif /* CONFIG_NUMA */
+#endif /* __KERNEL__ */
+
+#endif
diff -u linux-2.6.5-mc2-numa/include/linux/mm.h-o linux-2.6.5-mc2-numa/include/linux/mm.h
--- linux-2.6.5-mc2-numa/include/linux/mm.h-o	2004-04-07 11:42:19.000000000 +0200
+++ linux-2.6.5-mc2-numa/include/linux/mm.h	2004-04-07 11:45:42.000000000 +0200
@@ -12,6 +12,7 @@
 #include <linux/mmzone.h>
 #include <linux/rbtree.h>
 #include <linux/fs.h>
+#include <linux/mempolicy.h>
 
 #ifndef CONFIG_DISCONTIGMEM          /* Don't use mapnrs, do it properly */
 extern unsigned long max_mapnr;
@@ -47,6 +48,9 @@
  *
  * This structure is exactly 64 bytes on ia32.  Please think very, very hard
  * before adding anything to it.
+ * [Now 4 bytes more on 32bit NUMA machines. Sorry. -AK.
+ * But if you want to recover the 4 bytes justr remove vm_next. It is redundant 
+ * with vm_rb. Will be a lot of editing work though. vm_rb.color is redundant too.] 
  */
 struct vm_area_struct {
 	struct mm_struct * vm_mm;	/* The address space we belong to. */
@@ -77,6 +81,10 @@
 					   units, *not* PAGE_CACHE_SIZE */
 	struct file * vm_file;		/* File we map to (can be NULL). */
 	void * vm_private_data;		/* was vm_pte (shared mem) */
+
+#ifdef CONFIG_NUMA
+	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
+#endif
 };
 
 /*
@@ -148,6 +156,8 @@
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
+	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
+	struct mempolicy *(*get_policy)(struct vm_area_struct *vma, unsigned long addr);
 };
 
 /* forward declaration; pte_chain is meant to be internal to rmap.c */
@@ -434,6 +444,8 @@
 
 struct page *shmem_nopage(struct vm_area_struct * vma,
 			unsigned long address, int *type);
+int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new);
+struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, unsigned long addr);
 struct file *shmem_file_setup(char * name, loff_t size, unsigned long flags);
 void shmem_lock(struct file * file, int lock);
 int shmem_zero_setup(struct vm_area_struct *);
@@ -636,6 +648,11 @@
 	return vma;
 }
 
+static inline unsigned long vma_pages(struct vm_area_struct *vma)
+{
+	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+}
+
 extern struct vm_area_struct *find_extend_vma(struct mm_struct *mm, unsigned long addr);
 
 extern unsigned int nr_used_zone_pages(void);
diff -u linux-2.6.5-mc2-numa/include/linux/sched.h-o linux-2.6.5-mc2-numa/include/linux/sched.h
--- linux-2.6.5-mc2-numa/include/linux/sched.h-o	2004-04-07 11:42:19.000000000 +0200
+++ linux-2.6.5-mc2-numa/include/linux/sched.h	2004-04-07 11:45:42.000000000 +0200
@@ -29,6 +29,7 @@
 #include <linux/completion.h>
 #include <linux/pid.h>
 #include <linux/percpu.h>
+#include <linux/mempolicy.h>
 
 struct exec_domain;
 
@@ -501,6 +502,9 @@
 
 	unsigned long ptrace_message;
 	siginfo_t *last_siginfo; /* For ptrace use.  */
+
+  	struct mempolicy *mempolicy;
+  	short il_next;		/* could be shared with used_math */
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
diff -u linux-2.6.5-mc2-numa/kernel/sys.c-o linux-2.6.5-mc2-numa/kernel/sys.c
--- linux-2.6.5-mc2-numa/kernel/sys.c-o	2004-04-07 11:42:20.000000000 +0200
+++ linux-2.6.5-mc2-numa/kernel/sys.c	2004-04-07 11:48:39.000000000 +0200
@@ -266,6 +266,9 @@
 cond_syscall(sys_mq_timedreceive)
 cond_syscall(sys_mq_notify)
 cond_syscall(sys_mq_getsetattr)
+cond_syscall(sys_mbind)
+cond_syscall(sys_get_mempolicy)
+cond_syscall(sys_set_mempolicy)
 
 /* arch-specific weak syscall entries */
 cond_syscall(sys_pciconfig_read)
diff -u linux-2.6.5-mc2-numa/mm/Makefile-o linux-2.6.5-mc2-numa/mm/Makefile
--- linux-2.6.5-mc2-numa/mm/Makefile-o	2004-03-21 21:12:13.000000000 +0100
+++ linux-2.6.5-mc2-numa/mm/Makefile	2004-04-07 12:07:49.000000000 +0200
@@ -12,3 +12,4 @@
 			   slab.o swap.o truncate.o vmscan.o $(mmu-y)
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
+obj-$(CONFIG_NUMA) 	+= mempolicy.o
diff -u linux-2.6.5-mc2-numa/include/linux/mempolicy.h-o linux-2.6.5-mc2-numa/include/linux/mempolicy.h
--- linux-2.6.5-mc2-numa/include/linux/mempolicy.h-o	2004-04-07 12:07:18.000000000 +0200
+++ linux-2.6.5-mc2-numa/include/linux/mempolicy.h	2004-04-07 12:07:13.000000000 +0200
@@ -0,0 +1,219 @@
+#ifndef _LINUX_MEMPOLICY_H
+#define _LINUX_MEMPOLICY_H 1
+
+#include <linux/errno.h>
+
+/*
+ * NUMA memory policies for Linux.
+ * Copyright 2003,2004 Andi Kleen SuSE Labs
+ */
+
+/* Policies */
+#define MPOL_DEFAULT     0
+#define MPOL_PREFERRED    1
+#define MPOL_BIND        2
+#define MPOL_INTERLEAVE  3
+
+#define MPOL_MAX MPOL_INTERLEAVE
+
+/* Flags for get_mem_policy */
+#define MPOL_F_NODE   (1<<0)  /* return next IL mode instead of node mask */
+#define MPOL_F_ADDR     (1<<1)  /* look up vma using address */
+
+/* Flags for mbind */
+#define MPOL_MF_STRICT  (1<<0)  /* Verify existing pages in the mapping */
+
+#ifdef __KERNEL__
+
+#include <linux/config.h>
+#include <linux/mmzone.h>
+#include <linux/bitmap.h>
+#include <linux/slab.h>
+#include <linux/rbtree.h>
+#include <asm/semaphore.h>
+
+struct vm_area_struct;
+
+#ifdef CONFIG_NUMA
+
+/*
+ * Describe a memory policy.
+ *
+ * A mempolicy can be either associated with a process or with a VMA.
+ * For VMA related allocations the VMA policy is preferred, otherwise
+ * the process policy is used. Interrupts ignore the memory policy
+ * of the current process.
+ *
+ * Locking policy for interlave:
+ * In process context there is no locking because only the process accesses
+ * its own state. All vma manipulation is somewhat protected by a down_read on
+ * mmap_sem. For allocating in the interleave policy the page_table_lock
+ * must be also aquired to protect il_next.
+ *
+ * Freeing policy:
+ * When policy is MPOL_BIND v.zonelist is kmalloc'ed and must be kfree'd.
+ * All other policies don't have any external state. mpol_free() handles this.
+ *
+ * Copying policy objects:
+ * For MPOL_BIND the zonelist must be always duplicated. mpol_clone() does this.
+ */
+struct mempolicy {
+	atomic_t   refcnt;
+	short policy; 	/* See MPOL_* above */
+	union {
+		struct zonelist  *zonelist;	/* bind */
+		short 		 preferred_node; /* preferred */
+		DECLARE_BITMAP(nodes, MAX_NUMNODES); /* interleave */
+		/* undefined for default */
+	} v;
+};
+
+/* An NULL mempolicy pointer is a synonym of &default_policy. */
+extern struct mempolicy default_policy;
+
+/*
+ * Support for managing mempolicy data objects (clone, copy, destroy)
+ * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
+ */
+
+extern void __mpol_free(struct mempolicy *pol);
+static inline void mpol_free(struct mempolicy *pol)
+{
+	if (pol)
+		__mpol_free(pol);
+}
+
+extern struct mempolicy *__mpol_copy(struct mempolicy *pol);
+static inline struct mempolicy *mpol_copy(struct mempolicy *pol)
+{
+	if (pol)
+		pol = __mpol_copy(pol);
+	return pol;
+}
+
+#define vma_policy(vma) ((vma)->vm_policy)
+#define vma_set_policy(vma, pol) ((vma)->vm_policy = (pol))
+
+static inline void mpol_get(struct mempolicy *pol)
+{
+	if (pol)
+		atomic_inc(&pol->refcnt);
+}
+
+extern int __mpol_equal(struct mempolicy *a, struct mempolicy *b);
+static inline int mpol_equal(struct mempolicy *a, struct mempolicy *b)
+{
+	if (a == b)
+		return 1;
+	return __mpol_equal(a, b);
+}
+#define vma_mpol_equal(a,b) mpol_equal(vma_policy(a), vma_policy(b))
+
+/* Could later add inheritance of the process policy here. */
+
+#define mpol_set_vma_default(vma) ((vma)->vm_policy = NULL)
+
+/*
+ * Hugetlb policy. i386 hugetlb so far works with node numbers
+ * instead of zone lists, so give it special interfaces for now.
+ */
+extern int mpol_first_node(struct vm_area_struct *vma, unsigned long addr);
+extern int mpol_node_valid(int nid, struct vm_area_struct *vma, unsigned long addr);
+
+/*
+ * Tree of shared policies for a shared memory region.
+ * Maintain the policies in a pseudo mm that contains vmas. The vmas
+ * carry the policy. As a special twist the pseudo mm is indexed in pages, not
+ * bytes, so that we can work with shared memory segments bigger than
+ * unsigned long.
+ */
+
+struct sp_node {
+	struct rb_node nd;
+	unsigned long start, end;
+	struct mempolicy *policy;
+};
+
+struct shared_policy {
+	struct rb_root root;
+	struct semaphore sem;
+};
+
+static inline void mpol_shared_policy_init(struct shared_policy *info)
+{
+	info->root = RB_ROOT;
+	init_MUTEX(&info->sem);
+}
+
+int mpol_set_shared_policy(struct shared_policy *info,
+				  struct vm_area_struct *vma,
+				  struct mempolicy *new);
+void mpol_free_shared_policy(struct shared_policy *p);
+struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
+					    unsigned long idx);
+
+#else
+
+struct mempolicy {};
+
+static inline int mpol_equal(struct mempolicy *a, struct mempolicy *b)
+{
+	return 1;
+}
+#define vma_mpol_equal(a,b) 1
+
+#define mpol_set_vma_default(vma) do {} while(0)
+
+static inline void mpol_free(struct mempolicy *p)
+{
+}
+
+static inline void mpol_get(struct mempolicy *pol)
+{
+}
+
+static inline struct mempolicy *mpol_copy(struct mempolicy *old)
+{
+	return NULL;
+}
+
+static inline int mpol_first_node(struct vm_area_struct *vma, unsigned long a)
+{
+	return numa_node_id();
+}
+
+static inline int mpol_node_valid(int nid, struct vm_area_struct *vma, unsigned long a)
+{
+	return 1;
+}
+
+struct shared_policy {};
+
+static inline int mpol_set_shared_policy(struct shared_policy *info,
+				      struct vm_area_struct *vma,
+				      struct mempolicy *new)
+{
+	return -EINVAL;
+}
+
+static inline void mpol_shared_policy_init(struct shared_policy *info)
+{
+}
+
+static inline void mpol_free_shared_policy(struct shared_policy *p)
+{
+}
+
+static inline struct mempolicy *
+mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
+{
+	return NULL;
+}
+
+#define vma_policy(vma) NULL
+#define vma_set_policy(vma, pol) do {} while(0)
+
+#endif /* CONFIG_NUMA */
+#endif /* __KERNEL__ */
+
+#endif
diff -u linux-2.6.5-mc2-numa/mm/mempolicy.c-o linux-2.6.5-mc2-numa/mm/mempolicy.c
--- linux-2.6.5-mc2-numa/mm/mempolicy.c-o	2004-04-07 12:07:41.000000000 +0200
+++ linux-2.6.5-mc2-numa/mm/mempolicy.c	2004-04-07 13:07:02.000000000 +0200
@@ -0,0 +1,981 @@
+/*
+ * Simple NUMA memory policy for the Linux kernel.
+ *
+ * Copyright 2003,2004 Andi Kleen, SuSE Labs.
+ * Subject to the GNU Public License, version 2.
+ *
+ * NUMA policy allows the user to give hints in which node(s) memory should
+ * be allocated.
+ *
+ * Support four policies per VMA and per process:
+ *
+ * The VMA policy has priority over the process policy for a page fault.
+ *
+ * interleave     Allocate memory interleaved over a set of nodes,
+ *                with normal fallback if it fails.
+ *                For VMA based allocations this interleaves based on the
+ *                offset into the backing object or offset into the mapping
+ *                for anonymous memory. For process policy an process counter
+ *                is used.
+ * bind           Only allocate memory on a specific set of nodes,
+ *                no fallback.
+ * preferred       Try a specific node first before normal fallback.
+ *                As a special case node -1 here means do the allocation
+ *                on the local CPU. This is normally identical to default,
+ *                but useful to set in a VMA when you have a non default
+ *                process policy.
+ * default        Allocate on the local node first, or when on a VMA
+ *                use the process policy. This is what Linux always did
+ *		         in a NUMA aware kernel and still does by, ahem, default.
+ *
+ * The process policy is applied for most non interrupt memory allocations
+ * in that process' context. Interrupts ignore the policies and always
+ * try to allocate on the local CPU. The VMA policy is only applied for memory
+ * allocations for a VMA in the VM.
+ *
+ * Currently there are a few corner cases in swapping where the policy
+ * is not applied, but the majority should be handled. When process policy
+ * is used it is not remembered over swap outs/swap ins.
+ *
+ * Only the highest zone in the zone hierarchy gets policied. Allocations
+ * requesting a lower zone just use default policy. This implies that
+ * on systems with highmem kernel lowmem allocation don't get policied.
+ * Same with GFP_DMA allocations.
+ *
+ * For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between
+ * all users and remembered even when nobody has memory mapped.
+ */
+
+/* Notebook:
+   fix mmap readahead to honour policy and enable policy for any page cache
+   object
+   statistics for bigpages
+   global policy for page cache? currently it uses process policy. Requires
+   first item above.
+   handle mremap for shared memory (currently ignored for the policy)
+   grows down?
+   make bind policy root only? It can trigger oom much faster and the
+   kernel is not always grateful with that.
+   could replace all the switch()es with a mempolicy_ops structure.
+*/
+
+#include <linux/mempolicy.h>
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/gfp.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/module.h>
+#include <linux/interrupt.h>
+#include <linux/init.h>
+#include <asm/uaccess.h>
+
+static kmem_cache_t *policy_cache;
+static kmem_cache_t *sn_cache;
+
+#define round_up(x,y) (((x) + (y) - 1) & ~((y)-1))
+#define PDprintk(fmt...)
+
+/* Highest zone. An specific allocation for a zone below that is not
+   policied. */
+static int policy_zone;
+
+static struct mempolicy default_policy = {
+	.refcnt = ATOMIC_INIT(1), /* never free it */
+	.policy = MPOL_DEFAULT,
+};
+
+/* Check if all specified nodes are online */
+static int check_online(unsigned long *nodes)
+{
+	DECLARE_BITMAP(offline, MAX_NUMNODES);
+	bitmap_copy(offline, node_online_map, MAX_NUMNODES);
+	if (bitmap_empty(offline, MAX_NUMNODES))
+		set_bit(0, offline);
+	bitmap_complement(offline, MAX_NUMNODES);
+	bitmap_and(offline, offline, nodes, MAX_NUMNODES);
+	if (!bitmap_empty(offline, MAX_NUMNODES))
+		return -EINVAL;
+	return 0;
+}
+
+/* Do sanity checking on a policy */
+static int check_policy(int mode, unsigned long *nodes)
+{
+	int empty = bitmap_empty(nodes, MAX_NUMNODES);
+	switch (mode) {
+	case MPOL_DEFAULT:
+		if (!empty)
+			return -EINVAL;
+		break;
+	case MPOL_BIND:
+	case MPOL_INTERLEAVE:
+		/* Preferred will only use the first bit, but allow
+		   more for now. */
+		if (empty)
+			return -EINVAL;
+		break;
+	}
+	return check_online(nodes);
+}
+
+/* Copy a node mask from user space. */
+static int get_nodes(unsigned long *nodes, unsigned long *nmask,
+		     unsigned long maxnode, int mode)
+{
+	unsigned long k;
+	unsigned long nlongs;
+	unsigned long endmask;
+
+	--maxnode;
+	nlongs = BITS_TO_LONGS(maxnode);
+	if ((maxnode % BITS_PER_LONG) == 0)
+		endmask = ~0UL;
+	else
+		endmask = (1UL << (maxnode % BITS_PER_LONG)) - 1;
+
+	/* When the user specified more nodes than supported just check
+	   if the non supported part is all zero. */
+	if (nmask && nlongs > BITS_TO_LONGS(MAX_NUMNODES)) {
+		for (k = BITS_TO_LONGS(MAX_NUMNODES); k < nlongs; k++) {
+			unsigned long t;
+			if (get_user(t,  nmask + k))
+				return -EFAULT;
+			if (k == nlongs - 1) {
+				if (t & endmask)
+					return -EINVAL;
+			} else if (t)
+				return -EINVAL;
+		}
+		nlongs = BITS_TO_LONGS(MAX_NUMNODES);
+		endmask = ~0UL;
+	}
+
+	bitmap_clear(nodes, MAX_NUMNODES);
+	if (nmask && copy_from_user(nodes, nmask, nlongs*sizeof(unsigned long)))
+		return -EFAULT;
+	nodes[nlongs-1] &= endmask;
+	return check_policy(mode, nodes);
+}
+
+/* Generate a custom zonelist for the BIND policy. */
+static struct zonelist *bind_zonelist(unsigned long *nodes)
+{
+	struct zonelist *zl;
+	int num, max, nd;
+
+	max = 1 + MAX_NR_ZONES * bitmap_weight(nodes, MAX_NUMNODES);
+	zl = kmalloc(sizeof(void *) * max, GFP_KERNEL);
+	if (!zl)
+		return NULL;
+	num = 0;
+	for (nd = find_first_bit(nodes, MAX_NUMNODES);
+	     nd < MAX_NUMNODES;
+	     nd = find_next_bit(nodes, MAX_NUMNODES, 1+nd)) {
+		int k;
+		for (k = MAX_NR_ZONES-1; k >= 0; k--) {
+			struct zone *z = &NODE_DATA(nd)->node_zones[k];
+			if (!z->present_pages)
+				continue;
+			zl->zones[num++] = z;
+			if (k > policy_zone)
+				policy_zone = k;
+		}
+	}
+	BUG_ON(num >= max);
+	zl->zones[num] = NULL;
+	return zl;
+}
+
+/* Create a new policy */
+static struct mempolicy *new_policy(int mode, unsigned long *nodes)
+{
+	struct mempolicy *policy;
+	PDprintk("setting mode %d nodes[0] %lx\n", mode, nodes[0]);
+	if (mode == MPOL_DEFAULT)
+		return NULL;
+	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
+	if (!policy)
+		return ERR_PTR(-ENOMEM);
+	atomic_set(&policy->refcnt, 1);
+	switch (mode) {
+	case MPOL_INTERLEAVE:
+		bitmap_copy(policy->v.nodes, nodes, MAX_NUMNODES);
+		break;
+	case MPOL_PREFERRED:
+		policy->v.preferred_node = find_first_bit(nodes, MAX_NUMNODES);
+		if (policy->v.preferred_node >= MAX_NUMNODES)
+			policy->v.preferred_node = -1;
+		break;
+	case MPOL_BIND:
+		policy->v.zonelist = bind_zonelist(nodes);
+		if (policy->v.zonelist == NULL) {
+			kmem_cache_free(policy_cache, policy);
+			return ERR_PTR(-ENOMEM);
+		}
+		break;
+	}
+	policy->policy = mode;
+	return policy;
+}
+
+/* Ensure all existing pages follow the policy. */
+static int
+verify_pages(unsigned long addr, unsigned long end, unsigned long *nodes)
+{
+	while (addr < end) {
+		struct page *p;
+		pte_t *pte;
+		pmd_t *pmd;
+		pgd_t *pgd = pgd_offset_k(addr);
+		if (pgd_none(*pgd)) {
+			addr = (addr + PGDIR_SIZE) & PGDIR_MASK;
+			continue;
+		}
+		pmd = pmd_offset(pgd, addr);
+		if (pmd_none(*pmd)) {
+			addr = (addr + PMD_SIZE) & PMD_MASK;
+			continue;
+		}
+		p = NULL;
+		pte = pte_offset_map(pmd, addr);
+		if (pte_present(*pte))
+			p = pte_page(*pte);
+		pte_unmap(pte);
+		if (p) {
+			unsigned nid = page_zone(p)->zone_pgdat->node_id;
+			if (!test_bit(nid, nodes))
+				return -EIO;
+		}
+		addr += PAGE_SIZE;
+	}
+	return 0;
+}
+
+/* Step 1: check the range */
+static struct vm_area_struct *
+check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
+	    unsigned long *nodes, unsigned long flags)
+{
+	int err;
+	struct vm_area_struct *first, *vma, *prev;
+
+	first = find_vma(mm, start);
+	if (!first)
+		return ERR_PTR(-EFAULT);
+	prev = NULL;
+	for (vma = first; vma->vm_start < end; vma = vma->vm_next) {
+		if (!vma->vm_next && vma->vm_end < end)
+			return ERR_PTR(-EFAULT);
+		if (prev && prev->vm_end < vma->vm_start)
+			return ERR_PTR(-EFAULT);
+		if ((flags & MPOL_MF_STRICT) && !is_vm_hugetlb_page(vma)) {
+			err = verify_pages(vma->vm_start, vma->vm_end, nodes);
+			if (err) {
+				first = ERR_PTR(err);
+				break;
+			}
+		}
+		prev = vma;
+	}
+	return first;
+}
+
+/* Apply policy to a single VMA */
+static int policy_vma(struct vm_area_struct *vma, struct mempolicy *new)
+{
+	int err = 0;
+	struct mempolicy *old = vma->vm_policy;
+
+	PDprintk("vma %lx-%lx/%lx vm_ops %p vm_file %p set_policy %p\n",
+		 vma->vm_start, vma->vm_end, vma->vm_pgoff,
+		 vma->vm_ops, vma->vm_file,
+		 vma->vm_ops ? vma->vm_ops->set_policy : NULL);
+
+	if (vma->vm_file)
+		down(&vma->vm_file->f_mapping->i_shared_sem);
+	if (vma->vm_ops && vma->vm_ops->set_policy)
+		err = vma->vm_ops->set_policy(vma, new);
+	if (!err) {
+		mpol_get(new);
+		vma->vm_policy = new;
+		mpol_free(old);
+	}
+	if (vma->vm_file)
+		up(&vma->vm_file->f_mapping->i_shared_sem);
+	return err;
+}
+
+/* Step 2: apply policy to a range and do splits. */
+static int mbind_range(struct vm_area_struct *vma, unsigned long start,
+		       unsigned long end, struct mempolicy *new)
+{
+	struct vm_area_struct *next;
+	int err;
+
+	err = 0;
+	for (; vma->vm_start < end; vma = next) {
+		next = vma->vm_next;
+		if (vma->vm_start < start)
+			err = split_vma(vma->vm_mm, vma, start, 1);
+		if (!err && vma->vm_end > end)
+			err = split_vma(vma->vm_mm, vma, end, 0);
+		if (!err)
+			err = policy_vma(vma, new);
+		if (err)
+			break;
+	}
+	return err;
+}
+
+/* Change policy for a memory range */
+asmlinkage long sys_mbind(unsigned long start, unsigned long len,
+			  unsigned long mode,
+			  unsigned long *nmask, unsigned long maxnode,
+			  unsigned flags)
+{
+	struct vm_area_struct *vma;
+	struct mm_struct *mm = current->mm;
+	struct mempolicy *new;
+	unsigned long end;
+	DECLARE_BITMAP(nodes, MAX_NUMNODES);
+	int err;
+
+	if ((flags & ~(unsigned long)(MPOL_MF_STRICT)) || mode > MPOL_MAX)
+		return -EINVAL;
+	if (start & ~PAGE_MASK)
+		return -EINVAL;
+	if (mode == MPOL_DEFAULT)
+		flags &= ~MPOL_MF_STRICT;
+	len = (len + PAGE_SIZE - 1) & PAGE_MASK;
+	end = start + len;
+	if (end < start)
+		return -EINVAL;
+	if (end == start)
+		return 0;
+
+	err = get_nodes(nodes, nmask, maxnode, mode);
+	if (err)
+		return err;
+
+	new = new_policy(mode, nodes);
+	if (IS_ERR(new))
+		return PTR_ERR(new);
+
+	PDprintk("mbind %lx-%lx mode:%ld nodes:%lx\n",start,start+len,
+			mode,nodes[0]);
+
+	down_write(&mm->mmap_sem);
+	vma = check_range(mm, start, end, nodes, flags);
+	err = PTR_ERR(vma);
+	if (!IS_ERR(vma))
+		err = mbind_range(vma, start, end, new);
+	up_write(&mm->mmap_sem);
+	mpol_free(new);
+	return err;
+}
+
+/* Set the process memory policy */
+asmlinkage long sys_set_mempolicy(int mode, unsigned long *nmask,
+				   unsigned long maxnode)
+{
+	int err;
+	struct mempolicy *new;
+	DECLARE_BITMAP(nodes, MAX_NUMNODES);
+
+	if (mode > MPOL_MAX)
+		return -EINVAL;
+	err = get_nodes(nodes, nmask, maxnode, mode);
+	if (err)
+		return err;
+	new = new_policy(mode, nodes);
+	if (IS_ERR(new))
+		return PTR_ERR(new);
+	mpol_free(current->mempolicy);
+	current->mempolicy = new;
+	if (new && new->policy == MPOL_INTERLEAVE)
+		current->il_next = find_first_bit(new->v.nodes, MAX_NUMNODES);
+	return 0;
+}
+
+/* Fill a zone bitmap for a policy */
+static void get_zonemask(struct mempolicy *p, unsigned long *nodes)
+{
+	int i;
+	bitmap_clear(nodes, MAX_NUMNODES);
+	switch (p->policy) {
+	case MPOL_BIND:
+		for (i = 0; p->v.zonelist->zones[i]; i++)
+			__set_bit(p->v.zonelist->zones[i]->zone_pgdat->node_id, nodes);
+		break;
+	case MPOL_DEFAULT:
+		break;
+	case MPOL_INTERLEAVE:
+		bitmap_copy(nodes, p->v.nodes, MAX_NUMNODES);
+		break;
+	case MPOL_PREFERRED:
+		/* or use current node instead of online map? */
+		if (p->v.preferred_node < 0)
+			bitmap_copy(nodes, node_online_map, MAX_NUMNODES);
+		else	
+			__set_bit(p->v.preferred_node, nodes);
+		break;
+	default:
+		BUG();
+	}	
+}
+
+static int lookup_node(struct mm_struct *mm, unsigned long addr)
+{
+	struct page *p;
+	int err;
+	err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
+	if (err >= 0) {
+		err = page_zone(p)->zone_pgdat->node_id;
+		put_page(p);
+	}	
+	return err;
+}
+
+/* Copy a kernel node mask to user space */
+static int copy_nodes_to_user(unsigned long *user_mask, unsigned long maxnode,
+			      unsigned long *nodes)
+{
+	unsigned long copy = round_up(maxnode-1, BITS_PER_LONG) / 8;
+	if (copy > sizeof(nodes)) {
+		if (copy > PAGE_SIZE)
+			return -EINVAL;
+		if (clear_user((char*)user_mask + sizeof(nodes), copy - sizeof(nodes)))
+			return -EFAULT;
+		copy = sizeof(nodes);
+	}
+	return copy_to_user(user_mask, nodes, copy) ? -EFAULT : 0;
+}
+
+/* Retrieve NUMA policy */
+asmlinkage long sys_get_mempolicy(int *policy,
+				  unsigned long *nmask, unsigned long maxnode,
+				  unsigned long addr, unsigned long flags)	
+{
+	int err, pval;
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma = NULL; 	
+	struct mempolicy *pol = current->mempolicy;
+
+	if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
+		return -EINVAL;
+	if (nmask != NULL && maxnode < numnodes)
+		return -EINVAL;
+	if (flags & MPOL_F_ADDR) {
+		down_read(&mm->mmap_sem);
+		vma = find_vma_intersection(mm, addr, addr+1);
+		if (!vma) {
+			up_read(&mm->mmap_sem);
+			return -EFAULT;
+		}
+		if (vma->vm_ops && vma->vm_ops->get_policy)
+			pol = vma->vm_ops->get_policy(vma, addr);
+		else
+			pol = vma->vm_policy;
+	} else if (addr)
+		return -EINVAL;
+		
+	if (!pol)
+		pol = &default_policy;
+		
+	if (flags & MPOL_F_NODE) {
+		if (flags & MPOL_F_ADDR) {
+			err = lookup_node(mm, addr);
+			if (err < 0)
+				goto out;
+			pval = err;	
+		} else if (pol == current->mempolicy && pol->policy == MPOL_INTERLEAVE)
+			pval = current->il_next;
+		else {
+			err = -EINVAL;
+			goto out;
+		}	
+	} else
+		pval = pol->policy;
+
+	err = -EFAULT;
+	if (policy && put_user(pval, policy))
+		goto out;
+
+	err = 0;
+	if (nmask) {
+		DECLARE_BITMAP(nodes, MAX_NUMNODES);
+		get_zonemask(pol, nodes);
+		err = copy_nodes_to_user(nmask, maxnode, nodes);
+	}	
+
+ out:
+	if (vma)
+		up_read(&current->mm->mmap_sem);
+	return err;
+}
+
+/* Return effective policy for a VMA */
+static struct mempolicy *
+get_vma_policy(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct mempolicy *pol = current->mempolicy;
+	if (vma) {
+		if (vma->vm_ops && vma->vm_ops->get_policy)
+		        pol = vma->vm_ops->get_policy(vma, addr);
+		else if (vma->vm_policy && vma->vm_policy->policy != MPOL_DEFAULT)
+			pol = vma->vm_policy;
+	}
+	if (!pol)
+		pol = &default_policy;
+	return pol;
+}
+
+/* Return a zonelist representing a mempolicy */
+static struct zonelist *zonelist_policy(unsigned gfp, struct mempolicy *policy)
+{
+	int nd;
+	switch (policy->policy) {
+	case MPOL_PREFERRED:
+		nd = policy->v.preferred_node;
+		if (nd < 0)
+			nd = numa_node_id();
+		break;
+	case MPOL_BIND:
+		/* Lower zones don't get a policy applied */
+		if (gfp >= policy_zone)
+			return policy->v.zonelist;
+		/*FALL THROUGH*/
+	case MPOL_INTERLEAVE: /* should not happen */
+	case MPOL_DEFAULT:
+		nd = numa_node_id();
+		break;
+	default:
+		nd = 0;
+		BUG();
+	}
+	return NODE_DATA(nd)->node_zonelists + (gfp & GFP_ZONEMASK);
+}
+
+/* Do dynamic interleaving for a process */
+static unsigned interleave_nodes(struct mempolicy *policy)
+{
+	unsigned nid, next; 	
+	struct task_struct *me = current;
+	nid = me->il_next;
+	BUG_ON(nid >= MAX_NUMNODES);
+	next = find_next_bit(policy->v.nodes, MAX_NUMNODES, 1+nid);
+	if (next >= MAX_NUMNODES)
+		next = find_first_bit(policy->v.nodes, MAX_NUMNODES);
+	me->il_next = next;
+	return nid;
+}
+
+/* Do static interleaving for a VMA with known offset. */
+static unsigned
+offset_il_node(struct mempolicy *pol, struct vm_area_struct *vma, unsigned long off)
+{
+	unsigned target = (unsigned)off % (unsigned)numnodes;
+	int c;
+	int nid = -1;
+	c = 0;
+	do {
+		nid = find_next_bit(pol->v.nodes, MAX_NUMNODES, nid+1);
+		if (nid >= MAX_NUMNODES) {
+			nid = -1; 		
+			continue;
+		}
+		c++;
+	} while (c <= target);
+	BUG_ON(nid >= MAX_NUMNODES);
+	return nid;
+}
+
+/* Allocate a page in interleaved policy for a VMA. Use the offset
+   into the VMA as key. Own path because it needs to do special accounting. */
+static struct page *alloc_page_interleave(unsigned gfp, unsigned nid)
+{
+	struct zonelist *zl;
+	struct page *page;
+	BUG_ON(!test_bit(nid, node_online_map));
+	zl = NODE_DATA(nid)->node_zonelists + (gfp & GFP_ZONEMASK);
+	page = __alloc_pages(gfp, 0, zl);
+	if (page && page_zone(page) == zl->zones[0]) {
+		zl->zones[0]->pageset[get_cpu()].interleave_hit++;
+		put_cpu();
+	}
+	return page;
+}
+
+/**
+ * 	alloc_page_vma	- Allocate a page for a VMA.
+ *
+ * 	@gfp:
+ *      %GFP_USER    user allocation.
+ *      %GFP_KERNEL  kernel allocations,
+ *      %GFP_HIGHMEM highmem/user allocations,
+ *      %GFP_FS      allocation should not call back into a file system.
+ *      %GFP_ATOMIC  don't sleep.
+ *
+ * 	@vma:  Pointer to VMA or NULL if not available.
+ *	@addr: Virtual Address of the allocation. Must be inside the VMA.
+ *
+ * 	This function allocates a page from the kernel page pool and applies
+ *	a NUMA policy associated with the VMA or the current process.
+ *	When VMA is not NULL caller must hold down_read on the mmap_sem of the
+ *	mm_struct of the VMA to prevent it from going away. Should be used for i
+ *	all allocations for pages that will be mapped into
+ * 	user space. Returns NULL when no page can be allocated.
+ *
+ *	Should be called with the mm_sem of the vma hold.
+ */
+struct page *
+alloc_page_vma(unsigned gfp, struct vm_area_struct *vma, unsigned long addr)
+{
+	struct mempolicy *pol = get_vma_policy(vma, addr);
+	if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
+		unsigned nid;
+		if (vma) { 	
+			unsigned long off;
+			BUG_ON(addr >= vma->vm_end);
+			BUG_ON(addr < vma->vm_start);
+			off = vma->vm_pgoff;
+			off += (addr - vma->vm_start) >> PAGE_SHIFT;
+			nid = offset_il_node(pol, vma, off);		
+		} else {
+			/* fall back to process interleaving */
+			nid = interleave_nodes(pol);
+		}
+		return alloc_page_interleave(gfp, nid);
+	}
+	return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+}
+
+/**
+ * 	alloc_pages_current - Allocate pages.
+ *
+ *	@gfp:  %GFP_USER   user allocation,
+ *      %GFP_KERNEL kernel allocation,
+ *      %GFP_HIGHMEM highmem allocation,
+ *      %GFP_FS     don't call back into a file system.
+ *      %GFP_ATOMIC don't sleep.
+ *	@order: Power of two of allocation size in pages. 0 is a single page.
+ *
+ *	Allocate a page from the kernel page pool.  When not in
+ *	interrupt context and apply the current process NUMA policy.
+ *	Returns NULL when no page can be allocated.
+ */
+struct page *alloc_pages_current(unsigned gfp, unsigned order)
+{ 	
+	struct mempolicy *pol = current->mempolicy;
+	if (!pol || in_interrupt())
+		pol = &default_policy;
+	if (pol->policy == MPOL_INTERLEAVE && order == 0)
+		return alloc_page_interleave(gfp, interleave_nodes(pol));
+	return __alloc_pages(gfp, order, zonelist_policy(gfp, pol));
+}
+
+EXPORT_SYMBOL(alloc_pages_current);
+
+/* Slow path of a mempolicy copy */
+struct mempolicy *__mpol_copy(struct mempolicy *old)
+{	
+	struct mempolicy *new = kmem_cache_alloc(policy_cache, GFP_KERNEL);
+	if (!new)
+		return ERR_PTR(-ENOMEM);
+	*new = *old;		
+	atomic_set(&new->refcnt, 1);
+	if (new->policy == MPOL_BIND) {
+		int sz = ksize(old->v.zonelist);
+		new->v.zonelist = kmalloc(sz, SLAB_KERNEL);
+		if (!new->v.zonelist) {
+			kmem_cache_free(policy_cache, new);
+			return ERR_PTR(-ENOMEM);
+		}
+		memcpy(new->v.zonelist, old->v.zonelist, sz);
+	}
+	return new;
+}
+
+/* Slow path of a mempolicy comparison */
+int __mpol_equal(struct mempolicy *a, struct mempolicy *b)
+{
+	if (!a || !b)
+		return 0;
+	if (a->policy != b->policy)
+		return 0;
+	switch (a->policy) {
+	case MPOL_DEFAULT:
+		return 1;
+	case MPOL_INTERLEAVE:
+		return bitmap_equal(a->v.nodes, b->v.nodes, MAX_NUMNODES);
+	case MPOL_PREFERRED:
+		return a->v.preferred_node == b->v.preferred_node;
+	case MPOL_BIND: {
+		int i;
+		for (i = 0; a->v.zonelist->zones[i]; i++)
+			if (a->v.zonelist->zones[i] != b->v.zonelist->zones[i])
+				return 0; 		
+		return b->v.zonelist->zones[i] == NULL;
+	}
+	default:
+		BUG();
+		return 0;
+	}
+}
+
+/* Slow path of a mpol destructor. */
+extern void __mpol_free(struct mempolicy *p)
+{
+	if (!atomic_dec_and_test(&p->refcnt))
+		return;
+	if (p->policy == MPOL_BIND)
+		kfree(p->v.zonelist);
+	p->policy = MPOL_DEFAULT;
+	kmem_cache_free(policy_cache, p);
+}
+
+/*
+ * Hugetlb policy. Same as above, just works with node numbers instead of
+ * zonelists.
+ */
+
+/* Find first node suitable for an allocation */
+int mpol_first_node(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct mempolicy *pol = get_vma_policy(vma, addr);
+	switch (pol->policy) {
+	case MPOL_DEFAULT:
+		return numa_node_id();
+	case MPOL_BIND:
+		return pol->v.zonelist->zones[0]->zone_pgdat->node_id;
+	case MPOL_INTERLEAVE:
+		return interleave_nodes(pol);
+	case MPOL_PREFERRED:
+		return pol->v.preferred_node >= 0 ? pol->v.preferred_node:numa_node_id();
+	}
+	BUG();
+	return 0;
+}
+
+/* Find secondary valid nodes for an allocation */
+int mpol_node_valid(int nid, struct vm_area_struct *vma, unsigned long addr)
+{
+	struct mempolicy *pol = get_vma_policy(vma, addr);
+	switch (pol->policy) {
+	case MPOL_PREFERRED:
+	case MPOL_DEFAULT:
+	case MPOL_INTERLEAVE:
+		return 1;
+	case MPOL_BIND: {
+		struct zone **z;
+		for (z = pol->v.zonelist->zones; *z; z++)
+			if ((*z)->zone_pgdat->node_id == nid)
+				return 1;
+		return 0;
+	}
+	default:
+		BUG();
+		return 0;
+	}
+}
+
+/*
+ * Shared memory backing store policy support.
+ *
+ * Remember policies even when nobody has shared memory mapped.
+ * The policies are kept in Red-Black tree linked from the inode.
+ * They are protected by the sp->sem semaphore, which should be held
+ * for any accesses to the tree.
+ */
+
+/* lookup first element intersecting start-end */
+/* Caller holds sp->sem */
+static struct sp_node *
+sp_lookup(struct shared_policy *sp, unsigned long start, unsigned long end)
+{
+	struct rb_node *n = sp->root.rb_node;
+	while (n) {
+		struct sp_node *p = rb_entry(n, struct sp_node, nd);
+		if (start >= p->end) {
+			n = n->rb_right;
+		} else if (end < p->start) {
+			n = n->rb_left;
+		} else {
+			break;
+		}
+	}
+	if (!n)
+		return NULL;
+	for (;;) {
+		struct sp_node *w = NULL;
+		struct rb_node *prev = rb_prev(n);
+		if (!prev)
+			break;
+		w = rb_entry(prev, struct sp_node, nd);
+		if (w->end <= start)
+			break;
+		n = prev;
+	}
+	return rb_entry(n, struct sp_node, nd);
+}
+
+/* Insert a new shared policy into the list. */
+/* Caller holds sp->sem */
+static void sp_insert(struct shared_policy *sp, struct sp_node *new)
+{
+	struct rb_node **p = &sp->root.rb_node;
+	struct rb_node *parent = NULL;
+	struct sp_node *nd;
+	while (*p) {
+		parent = *p;
+		nd = rb_entry(parent, struct sp_node, nd);
+		if (new->start < nd->start)
+			p = &(*p)->rb_left;
+		else if (new->end > nd->end)
+			p = &(*p)->rb_right;
+		else
+			BUG();
+	}
+	rb_link_node(&new->nd, parent, p);
+	rb_insert_color(&new->nd, &sp->root);
+	PDprintk("inserting %lx-%lx: %d\n", new->start, new->end,
+		 new->policy ? new->policy->policy : 0);
+}
+
+/* Find shared policy intersecting idx */
+struct mempolicy *
+mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
+{
+	struct mempolicy *pol = NULL;
+	struct sp_node *sn;
+	down(&sp->sem);
+	sn = sp_lookup(sp, idx, idx+1);
+	if (sn) {
+		mpol_get(sn->policy);
+		pol = sn->policy;
+	}
+	up(&sp->sem);
+	return pol;
+}
+
+static void sp_delete(struct shared_policy *sp, struct sp_node *n)
+{
+	PDprintk("deleting %lx-l%x\n", n->start, n->end);
+	rb_erase(&n->nd, &sp->root);
+	mpol_free(n->policy);
+	kmem_cache_free(sn_cache, n);
+}
+
+struct sp_node *
+sp_alloc(unsigned long start, unsigned long end, struct mempolicy *pol)
+{
+	struct sp_node *n = kmem_cache_alloc(sn_cache, GFP_KERNEL);
+	if (!n)
+		return NULL;
+	n->start = start;
+	n->end = end;
+	mpol_get(pol);
+	n->policy = pol;
+	return n;
+}
+
+/* Replace a policy range. */
+static int shared_policy_replace(struct shared_policy *sp, unsigned long start,
+				 unsigned long end, struct sp_node *new)
+{
+	struct sp_node *n, *new2;
+
+	down(&sp->sem);
+	n = sp_lookup(sp, start, end);
+	/* Take care of old policies in the same range. */
+	while (n && n->start < end) {
+		struct rb_node *next = rb_next(&n->nd);
+		if (n->start >= start) {
+			if (n->end <= end)
+				sp_delete(sp, n);
+			else
+				n->start = end;
+		} else {
+			/* Old policy spanning whole new range. */
+			if (n->end > end) {
+				new2 = sp_alloc(end, n->end, n->policy);
+				if (!new2) {
+					up(&sp->sem);
+					return -ENOMEM;
+				}
+				n->end = end;
+				sp_insert(sp, new2);
+			}
+			/* Old crossing beginning, but not end (easy) */
+			if (n->start < start && n->end > start)
+				n->end = start;
+		}
+		if (!next)
+			break;
+		n = rb_entry(next, struct sp_node, nd);
+	}
+	if (new)
+		sp_insert(sp, new);
+	up(&sp->sem);
+	return 0;
+}
+
+int mpol_set_shared_policy(struct shared_policy *info, struct vm_area_struct *vma,
+			   struct mempolicy *npol)
+{
+	int err;
+	struct sp_node *new = NULL;
+	unsigned long sz = vma_pages(vma);
+
+	PDprintk("set_shared_policy %lx sz %lu %d %lx\n",
+		 vma->vm_pgoff,
+		 sz, npol? npol->policy : -1,
+		npol ? npol->v.nodes[0] : -1);
+		
+	if (npol) {
+		new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, npol);
+		if (!new)
+			return -ENOMEM;
+	}
+	err = shared_policy_replace(info, vma->vm_pgoff, vma->vm_pgoff+sz, new);
+	if (err && new)
+		kmem_cache_free(sn_cache, new);
+	return err;
+}
+
+/* Free a backing policy store on inode delete. */
+void mpol_free_shared_policy(struct shared_policy *p)
+{
+	struct sp_node *n;
+	struct rb_node *next;
+	down(&p->sem);
+	next = rb_first(&p->root);
+	while (next) {
+		n = rb_entry(next, struct sp_node, nd);		
+		next = rb_next(&n->nd);
+		rb_erase(&n->nd, &p->root);
+		mpol_free(n->policy);
+		kmem_cache_free(sn_cache, n);		
+	}
+	up(&p->sem);
+}
+
+static __init int numa_policy_init(void)
+{
+	policy_cache = kmem_cache_create("numa_policy",
+					 sizeof(struct mempolicy),
+					 0, 0, NULL, NULL);
+
+	sn_cache = kmem_cache_create("shared_policy_node",
+				     sizeof(struct sp_node),
+				     0, 0, NULL, NULL);
+
+	if (!policy_cache || !sn_cache)
+		panic("Cannot create NUMA policy cache");
+	return 0;
+}
+__initcall(numa_policy_init);

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 21:27 ` Andi Kleen
@ 2004-04-07 21:41   ` Matthew Dobson
  2004-04-07 21:45     ` Andi Kleen
  2004-04-15  0:38   ` Matthew Dobson
  1 sibling, 1 reply; 60+ messages in thread
From: Matthew Dobson @ 2004-04-07 21:41 UTC (permalink / raw)
  To: Andi Kleen; +Cc: LKML, Andrew Morton, Martin J. Bligh

On Wed, 2004-04-07 at 14:27, Andi Kleen wrote:
> On Wed, 07 Apr 2004 14:24:19 -0700
> Matthew Dobson <colpatch@us.ibm.com> wrote:
> 
> > 	I must be missing something here, but did you not include mempolicy.h
> > and policy.c in these patches?  I can't seem to find them anywhere?!? 
> > It's really hard to evaluate your patches if the core of them is
> > missing!
> 
> It was in the core patch and also in the last patch I sent Andrew.
> See ftp://ftp.suse.com/pub/people/ak/numa/* for the full patches

Ok.. I'll check that link, but what you posted didn't have the files
(mempolicy.h & policy.c) in the patch:

[mcd@arrakis numa_api]$ diffstat numa_api-01-core.patch
 include/linux/gfp.h   |   25 +++++++++++++++++++++++--
 include/linux/mm.h    |   17 +++++++++++++++++
 include/linux/sched.h |    4 ++++
 kernel/sys.c          |    3 +++
 mm/Makefile           |    1 +
 5 files changed, 48 insertions(+), 2 deletions(-)

Maybe it got lost somewhere between your mailer and mine?  The patch you
posted to LKML yesterday was clearly done without the -N option to diff:

diff -u linux-2.6.5-numa/kernel/sys.c-o linux-2.6.5-numa/kernel/sys.c


> > 
> > Andrew already mentioned your mistake on the i386 syscalls which needs
> > to be fixed.
> 
> That's already fixed

Good.

 
> > Also, this snippet of code is in 2 of your patches (#1 and #6) causing
> > rejects:
> > 
> > @@ -435,6 +445,8 @@
> >  
> >  struct page *shmem_nopage(struct vm_area_struct * vma,
> >                          unsigned long address, int *type);
> > +int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy
> > *new);
> > +struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, unsigned
> > long addr);
> >  struct file *shmem_file_setup(char * name, loff_t size, unsigned long
> > flags);
> >  void shmem_lock(struct file * file, int lock);
> >  int shmem_zero_setup(struct vm_area_struct *);
> 
> 
> It didn't reject for me. 

I don't know why.  The same code addition is in both the 'core' patch
and the 'shm' patch.  Adding it twice causes patch throw a reject.


> > Just from the patches you posted, I would really disagree that these are
> > ready for merging into -mm.
> 
> Why so? 
> 
> -Andi


Well, if for no other reason than all the code isn't posted!

-Matt


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 21:24 Matthew Dobson
  2004-04-07 21:27 ` Andi Kleen
@ 2004-04-07 21:35 ` Matthew Dobson
  2004-04-07 21:51 ` Andrew Morton
  2 siblings, 0 replies; 60+ messages in thread
From: Matthew Dobson @ 2004-04-07 21:35 UTC (permalink / raw)
  To: LKML; +Cc: Andi Kleen, Andrew Morton, Martin J. Bligh

Replying to myself, but...

On Wed, 2004-04-07 at 14:24, Matthew Dobson wrote:
> Andi,
> 	I must be missing something here, but did you not include mempolicy.h
> and policy.c in these patches?  I can't seem to find them anywhere?!? 
> It's really hard to evaluate your patches if the core of them is
> missing!

Running make -j24 bzImage
In file included from arch/i386/kernel/asm-offsets.c:7:
include/linux/sched.h:32: linux/mempolicy.h: No such file or directory
make[1]: *** [arch/i386/kernel/asm-offsets.s] Error 1
make: *** [arch/i386/kernel/asm-offsets.s] Error 2

I'm guessing you just forgot the -N option to diff.  You might want to
add the -p option when you rediff and repost because it makes your
patches an order of magnitude easier to read when you can tell what
function the patch is modifying.

-Matt


^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: NUMA API for Linux
  2004-04-07 21:24 Matthew Dobson
@ 2004-04-07 21:27 ` Andi Kleen
  2004-04-07 21:41   ` Matthew Dobson
  2004-04-15  0:38   ` Matthew Dobson
  2004-04-07 21:35 ` Matthew Dobson
  2004-04-07 21:51 ` Andrew Morton
  2 siblings, 2 replies; 60+ messages in thread
From: Andi Kleen @ 2004-04-07 21:27 UTC (permalink / raw)
  To: colpatch; +Cc: linux-kernel, akpm, mbligh

On Wed, 07 Apr 2004 14:24:19 -0700
Matthew Dobson <colpatch@us.ibm.com> wrote:

> 	I must be missing something here, but did you not include mempolicy.h
> and policy.c in these patches?  I can't seem to find them anywhere?!? 
> It's really hard to evaluate your patches if the core of them is
> missing!

It was in the core patch and also in the last patch I sent Andrew.
See ftp://ftp.suse.com/pub/people/ak/numa/* for the full patches

> 
> Andrew already mentioned your mistake on the i386 syscalls which needs
> to be fixed.

That's already fixed
 
> Also, this snippet of code is in 2 of your patches (#1 and #6) causing
> rejects:
> 
> @@ -435,6 +445,8 @@
>  
>  struct page *shmem_nopage(struct vm_area_struct * vma,
>                          unsigned long address, int *type);
> +int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy
> *new);
> +struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, unsigned
> long addr);
>  struct file *shmem_file_setup(char * name, loff_t size, unsigned long
> flags);
>  void shmem_lock(struct file * file, int lock);
>  int shmem_zero_setup(struct vm_area_struct *);


It didn't reject for me. 

> Just from the patches you posted, I would really disagree that these are
> ready for merging into -mm.

Why so? 

-Andi

^ permalink raw reply	[flat|nested] 60+ messages in thread

* NUMA API for Linux
@ 2004-04-07 21:24 Matthew Dobson
  2004-04-07 21:27 ` Andi Kleen
                   ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Matthew Dobson @ 2004-04-07 21:24 UTC (permalink / raw)
  To: LKML, Andi Kleen, Andrew Morton, Martin J. Bligh

Andi,
	I must be missing something here, but did you not include mempolicy.h
and policy.c in these patches?  I can't seem to find them anywhere?!? 
It's really hard to evaluate your patches if the core of them is
missing!

Andrew already mentioned your mistake on the i386 syscalls which needs
to be fixed.

Also, this snippet of code is in 2 of your patches (#1 and #6) causing
rejects:

@@ -435,6 +445,8 @@
 
 struct page *shmem_nopage(struct vm_area_struct * vma,
                         unsigned long address, int *type);
+int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy
*new);
+struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, unsigned
long addr);
 struct file *shmem_file_setup(char * name, loff_t size, unsigned long
flags);
 void shmem_lock(struct file * file, int lock);
 int shmem_zero_setup(struct vm_area_struct *);



Just from the patches you posted, I would really disagree that these are
ready for merging into -mm.

-Matt


^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2004-05-06  6:01 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-04-06 13:33 NUMA API for Linux Andi Kleen
2004-04-06 13:34 ` [PATCH] NUMA API for Linux 1/ Core NUMA API code Andi Kleen
2004-04-06 13:35 ` NUMA API for Linux 2/ Add x86-64 support Andi Kleen
2004-04-06 13:35 ` [PATCH] NUMA API for Linux 3/ Add i386 support Andi Kleen
2004-04-06 23:23   ` Andrew Morton
2004-04-06 13:36 ` [PATCH] NUMA API for Linux 4/ Add IA64 support Andi Kleen
2004-04-06 13:37 ` [PATCH] NUMA API for Linux 5/ Add VMA hooks for policy Andi Kleen
2004-05-05 16:05   ` Paul Jackson
2004-05-05 16:39     ` Andi Kleen
2004-05-05 16:47       ` Paul Jackson
2004-05-06  6:00         ` Andi Kleen
2004-04-06 13:37 ` [PATCH] NUMA API for Linux 6/ Add shared memory support Andi Kleen
2004-04-06 13:38 ` [PATCH] NUMA API for Linux 7/ Add statistics Andi Kleen
2004-04-06 13:39 ` [PATCH] NUMA API for Linux 8/ Add policy support to anonymous memory Andi Kleen
2004-04-06 13:40 ` [PATCH] NUMA API for Linux 9/ Add simple lazy i386/x86-64 hugetlbfs policy support Andi Kleen
2004-04-06 13:40 ` [PATCH] NUMA API for Linux 10/ Bitmap bugfix Andi Kleen
2004-04-06 23:35 ` NUMA API for Linux Paul Jackson
2004-04-08 20:12 ` Pavel Machek
2004-04-07 21:24 Matthew Dobson
2004-04-07 21:27 ` Andi Kleen
2004-04-07 21:41   ` Matthew Dobson
2004-04-07 21:45     ` Andi Kleen
2004-04-07 22:19       ` Matthew Dobson
2004-04-08  0:58       ` Matthew Dobson
2004-04-08  1:31         ` Andi Kleen
2004-04-08 18:36           ` Matthew Dobson
2004-04-09  1:09       ` Matthew Dobson
2004-04-09  5:29         ` Martin J. Bligh
2004-04-09 18:44           ` Matthew Dobson
2004-04-15  0:38   ` Matthew Dobson
2004-04-15 10:39     ` Andi Kleen
2004-04-15 11:48       ` Robin Holt
2004-04-15 18:32         ` Matthew Dobson
2004-04-15 19:44       ` Matthew Dobson
2004-04-07 21:35 ` Matthew Dobson
2004-04-07 21:51 ` Andrew Morton
2004-04-07 22:16   ` Andi Kleen
2004-04-07 22:34     ` Andrew Morton
2004-04-07 22:39     ` Martin J. Bligh
2004-04-07 22:33       ` Andi Kleen
2004-04-07 22:38   ` Martin J. Bligh
2004-04-07 22:38     ` Andi Kleen
2004-04-07 22:52       ` Andrew Morton
2004-04-07 23:09         ` Martin J. Bligh
2004-04-07 23:35         ` Andi Kleen
2004-04-07 23:56           ` Andrew Morton
2004-04-08  0:14             ` Andi Kleen
2004-04-08  0:26               ` Andrea Arcangeli
2004-04-08  0:51                 ` Andi Kleen
2004-04-08 16:15             ` Hugh Dickins
2004-04-08 17:05               ` Martin J. Bligh
2004-04-08 18:16                 ` Hugh Dickins
2004-04-08 19:25               ` Andrew Morton
2004-04-09  2:41                 ` Wim Coekaerts
2004-04-08  0:22           ` Andrea Arcangeli
     [not found] <1IL3l-1dP-35@gated-at.bofh.it>
     [not found] ` <1IMik-2is-37@gated-at.bofh.it>
2004-04-08 19:20   ` Rajesh Venkatasubramanian
2004-04-08 19:48     ` Hugh Dickins
2004-04-08 19:57       ` Rajesh Venkatasubramanian
2004-04-08 19:52     ` Andrea Arcangeli
     [not found] <1IsMQ-3vi-35@gated-at.bofh.it>
     [not found] ` <1IsMS-3vi-45@gated-at.bofh.it>
     [not found]   ` <1It5U-3J1-21@gated-at.bofh.it>
     [not found]     ` <1ItfE-3PL-3@gated-at.bofh.it>
     [not found]       ` <1ISQC-7Cv-5@gated-at.bofh.it>
2004-04-09  5:39         ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).