linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* NUMA API for Linux
@ 2004-04-06 13:33 Andi Kleen
  2004-04-06 13:34 ` [PATCH] NUMA API for Linux 1/ Core NUMA API code Andi Kleen
                   ` (11 more replies)
  0 siblings, 12 replies; 18+ messages in thread
From: Andi Kleen @ 2004-04-06 13:33 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm


The following patches add support for configurable NUMA memory policy
for user processes. It is based on the proposal from last kernel summit
with feedback from various people.

This NUMA API doesn't not attempt to implement page migration or anything
else complicated: all it does is to police the allocation when a page 
is first allocation or when a page is reallocated after swapping. Currently
only support for shared memory and anonymous memory is there; policy for 
file based mappings is not implemented yet (although they get implicitely
policied by the default process policy)

It adds three new system calls: mbind to change the policy of a VMA,
set_mempolicy to change the policy of a process, get_mempolicy to retrieve
memory policy. User tools (numactl, libnuma, test programs, manpages) can be 
found in  ftp://ftp.suse.com/pub/people/ak/numa/numactl-0.6.tar.gz

For details on the system calls see the manpages
http://www.firstfloor.org/~andi/mbind.html
http://www.firstfloor.org/~andi/set_mempolicy.html
http://www.firstfloor.org/~andi/get_mempolicy.html
Most user programs should actually not use the system calls directly,
but use the higher level functions in libnuma 
(http://www.firstfloor.org/~andi/numa.html) or the command line tools
(http://www.firstfloor.org/~andi/numactl.html

The system calls allow user programs and administors to set various NUMA memory 
policies for putting memory on specific nodes. Here is a short description
of the policies copied from the kernel patch:

 * NUMA policy allows the user to give hints in which node(s) memory should
 * be allocated.
 *
 * Support four policies per VMA and per process:
 *
 * The VMA policy has priority over the process policy for a page fault.
 *
 * interleave     Allocate memory interleaved over a set of nodes,
 *                with normal fallback if it fails.
 *                For VMA based allocations this interleaves based on the
 *                offset into the backing object or offset into the mapping
 *                for anonymous memory. For process policy an process counter
 *                is used.
 * bind           Only allocate memory on a specific set of nodes,
 *                no fallback.
 * preferred      Try a specific node first before normal fallback.
 *                As a special case node -1 here means do the allocation
 *                on the local CPU. This is normally identical to default,
 *                but useful to set in a VMA when you have a non default
 *                process policy.
 * default        Allocate on the local node first, or when on a VMA
 *                use the process policy. This is what Linux always did
 *                in a NUMA aware kernel and still does by, ahem, default.
 *
 * The process policy is applied for most non interrupt memory allocations
 * in that process' context. Interrupts ignore the policies and always
 * try to allocate on the local CPU. The VMA policy is only applied for memory
 * allocations for a VMA in the VM.
 *
 * Currently there are a few corner cases in swapping where the policy
 * is not applied, but the majority should be handled. When process policy
 * is used it is not remembered over swap outs/swap ins.
 *
 * Only the highest zone in the zone hierarchy gets policied. Allocations
 * requesting a lower zone just use default policy. This implies that
 * on systems with highmem kernel lowmem allocation don't get policied.
 * Same with GFP_DMA allocations.
 *
 * For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between
 * all users and remembered even when nobody has memory mapped.

The following patches implement all this. 

I think these patches are ready for merging in -mm*.

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] NUMA API for Linux 1/ Core NUMA API code
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
@ 2004-04-06 13:34 ` Andi Kleen
  2004-04-06 13:35 ` NUMA API for Linux 2/ Add x86-64 support Andi Kleen
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2004-04-06 13:34 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm

This is the core NUMA API code. This includes NUMA policy aware 
wrappers for get_free_pages and alloc_page_vma(). On non NUMA kernels
these are defined away.

The system calls mbind (see http://www.firstfloor.org/~andi/mbind.html),
get_mempolicy (http://www.firstfloor.org/~andi/get_mempolicy.html) and
set_mempolicy (http://www.firstfloor.org/~andi/set_mempolicy.html) are
implemented here.

Adds a vm_policy field to the VMA and to the process. The process
also has field for interleaving. VMA interleaving uses the offset
into the VMA, but that's not possible for process allocations.

diff -u linux-2.6.5-numa/include/linux/gfp.h-o linux-2.6.5-numa/include/linux/gfp.h
--- linux-2.6.5-numa/include/linux/gfp.h-o	2004-03-21 21:11:55.000000000 +0100
+++ linux-2.6.5-numa/include/linux/gfp.h	2004-04-06 13:36:12.000000000 +0200
@@ -4,6 +4,8 @@
 #include <linux/mmzone.h>
 #include <linux/stddef.h>
 #include <linux/linkage.h>
+#include <linux/config.h>
+
 /*
  * GFP bitmasks..
  */
@@ -72,10 +74,29 @@
 	return __alloc_pages(gfp_mask, order, NODE_DATA(nid)->node_zonelists + (gfp_mask & GFP_ZONEMASK));
 }
 
+extern struct page *alloc_pages_current(unsigned gfp_mask, unsigned order);
+struct vm_area_struct;
+
+#ifdef CONFIG_NUMA
+static inline struct page * alloc_pages(unsigned int gfp_mask, unsigned int order)
+{
+	if (unlikely(order >= MAX_ORDER))
+		return NULL;
+
+	return alloc_pages_current(gfp_mask, order);
+}
+extern struct page *__alloc_page_vma(unsigned gfp_mask, struct vm_area_struct *vma, 
+				   unsigned long off);
+
+extern struct page *alloc_page_vma(unsigned gfp_mask, struct vm_area_struct *vma, 
+				   unsigned long addr);
+#else
 #define alloc_pages(gfp_mask, order) \
 		alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_page(gfp_mask) \
-		alloc_pages_node(numa_node_id(), gfp_mask, 0)
+#define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#define __alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#endif
+#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 
 extern unsigned long FASTCALL(__get_free_pages(unsigned int gfp_mask, unsigned int order));
 extern unsigned long FASTCALL(get_zeroed_page(unsigned int gfp_mask));
diff -u linux-2.6.5-numa/include/linux/mm.h-o linux-2.6.5-numa/include/linux/mm.h
--- linux-2.6.5-numa/include/linux/mm.h-o	2004-04-06 13:12:23.000000000 +0200
+++ linux-2.6.5-numa/include/linux/mm.h	2004-04-06 13:36:12.000000000 +0200
@@ -12,6 +12,7 @@
 #include <linux/mmzone.h>
 #include <linux/rbtree.h>
 #include <linux/fs.h>
+#include <linux/mempolicy.h>
 
 #ifndef CONFIG_DISCONTIGMEM          /* Don't use mapnrs, do it properly */
 extern unsigned long max_mapnr;
@@ -47,6 +48,9 @@
  *
  * This structure is exactly 64 bytes on ia32.  Please think very, very hard
  * before adding anything to it.
+ * [Now 4 bytes more on 32bit NUMA machines. Sorry. -AK.
+ * But if you want to recover the 4 bytes justr remove vm_next. It is redundant 
+ * with vm_rb. Will be a lot of editing work though. vm_rb.color is redundant too.] 
  */
 struct vm_area_struct {
 	struct mm_struct * vm_mm;	/* The address space we belong to. */
@@ -77,6 +81,10 @@
 					   units, *not* PAGE_CACHE_SIZE */
 	struct file * vm_file;		/* File we map to (can be NULL). */
 	void * vm_private_data;		/* was vm_pte (shared mem) */
+
+#ifdef CONFIG_NUMA
+	struct mempolicy *vm_policy;	/* NUMA policy for the VMA */
+#endif
 };
 
 /*
@@ -148,6 +156,8 @@
 	void (*close)(struct vm_area_struct * area);
 	struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
 	int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
+	int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
+	struct mempolicy *(*get_policy)(struct vm_area_struct *vma, unsigned long addr);
 };
 
 /* forward declaration; pte_chain is meant to be internal to rmap.c */
@@ -435,6 +445,8 @@
 
 struct page *shmem_nopage(struct vm_area_struct * vma,
 			unsigned long address, int *type);
+int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new);
+struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, unsigned long addr);
 struct file *shmem_file_setup(char * name, loff_t size, unsigned long flags);
 void shmem_lock(struct file * file, int lock);
 int shmem_zero_setup(struct vm_area_struct *);
@@ -633,6 +645,11 @@
 	return vma;
 }
 
+static inline unsigned long vma_pages(struct vm_area_struct *vma)
+{
+	return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+}
+
 extern struct vm_area_struct *find_extend_vma(struct mm_struct *mm, unsigned long addr);
 
 extern unsigned int nr_used_zone_pages(void);
diff -u linux-2.6.5-numa/include/linux/sched.h-o linux-2.6.5-numa/include/linux/sched.h
--- linux-2.6.5-numa/include/linux/sched.h-o	2004-04-06 13:12:23.000000000 +0200
+++ linux-2.6.5-numa/include/linux/sched.h	2004-04-06 13:36:12.000000000 +0200
@@ -29,6 +29,7 @@
 #include <linux/completion.h>
 #include <linux/pid.h>
 #include <linux/percpu.h>
+#include <linux/mempolicy.h>
 
 struct exec_domain;
 
@@ -493,6 +494,9 @@
 
 	unsigned long ptrace_message;
 	siginfo_t *last_siginfo; /* For ptrace use.  */
+
+  	struct mempolicy *mempolicy;
+  	short il_next;		/* could be shared with used_math */
 };
 
 static inline pid_t process_group(struct task_struct *tsk)
diff -u linux-2.6.5-numa/kernel/sys.c-o linux-2.6.5-numa/kernel/sys.c
--- linux-2.6.5-numa/kernel/sys.c-o	1970-01-01 01:12:51.000000000 +0100
+++ linux-2.6.5-numa/kernel/sys.c	2004-04-06 13:36:12.000000000 +0200
@@ -260,6 +260,9 @@
 cond_syscall(sys_shmget)
 cond_syscall(sys_shmdt)
 cond_syscall(sys_shmctl)
+cond_syscall(sys_mbind)
+cond_syscall(sys_get_mempolicy)
+cond_syscall(sys_set_mempolicy)
 
 /* arch-specific weak syscall entries */
 cond_syscall(sys_pciconfig_read)
diff -u linux-2.6.5-numa/mm/Makefile-o linux-2.6.5-numa/mm/Makefile
--- linux-2.6.5-numa/mm/Makefile-o	2004-03-21 21:12:13.000000000 +0100
+++ linux-2.6.5-numa/mm/Makefile	2004-04-06 13:36:12.000000000 +0200
@@ -12,3 +12,4 @@
 			   slab.o swap.o truncate.o vmscan.o $(mmu-y)
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o
+obj-$(CONFIG_NUMA) 	+= policy.o

^ permalink raw reply	[flat|nested] 18+ messages in thread

* NUMA API for Linux 2/ Add x86-64 support
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
  2004-04-06 13:34 ` [PATCH] NUMA API for Linux 1/ Core NUMA API code Andi Kleen
@ 2004-04-06 13:35 ` Andi Kleen
  2004-04-06 13:35 ` [PATCH] NUMA API for Linux 3/ Add i386 support Andi Kleen
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2004-04-06 13:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm

Add NUMA API system calls on x86-64

diff -u linux-2.6.5-numa/include/asm-x86_64/unistd.h-o linux-2.6.5-numa/include/asm-x86_64/unistd.h
--- linux-2.6.5-numa/include/asm-x86_64/unistd.h-o	2004-03-17 12:17:59.000000000 +0100
+++ linux-2.6.5-numa/include/asm-x86_64/unistd.h	2004-04-06 13:36:12.000000000 +0200
@@ -532,10 +532,14 @@
 __SYSCALL(__NR_utimes, sys_utimes)
 #define __NR_vserver		236
 __SYSCALL(__NR_vserver, sys_ni_syscall)
+#define __NR_mbind 237
+__SYSCALL(__NR_mbind, sys_mbind)
+#define __NR_set_mempolicy 238
+__SYSCALL(__NR_set_mempolicy, sys_set_mempolicy)
+#define __NR_get_mempolicy 239
+__SYSCALL(__NR_get_mempolicy, sys_get_mempolicy)
 
-/* 237,238,239 reserved for NUMA API */
-
-#define __NR_syscall_max __NR_vserver
+#define __NR_syscall_max __NR_get_mempolicy
 #ifndef __NO_STUBS
 
 /* user-visible error numbers are in the range -1 - -4095 */

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] NUMA API for Linux 3/ Add i386 support
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
  2004-04-06 13:34 ` [PATCH] NUMA API for Linux 1/ Core NUMA API code Andi Kleen
  2004-04-06 13:35 ` NUMA API for Linux 2/ Add x86-64 support Andi Kleen
@ 2004-04-06 13:35 ` Andi Kleen
  2004-04-06 23:23   ` Andrew Morton
  2004-04-06 13:36 ` [PATCH] NUMA API for Linux 4/ Add IA64 support Andi Kleen
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2004-04-06 13:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm

Add NUMA API system calls for i386

diff -u linux-2.6.5-numa/arch/i386/kernel/entry.S-o linux-2.6.5-numa/arch/i386/kernel/entry.S
--- linux-2.6.5-numa/arch/i386/kernel/entry.S-o	1970-01-01 01:12:51.000000000 +0100
+++ linux-2.6.5-numa/arch/i386/kernel/entry.S	2004-04-06 15:06:46.000000000 +0200
@@ -882,5 +882,8 @@
 	.long sys_utimes
  	.long sys_fadvise64_64
 	.long sys_ni_syscall	/* sys_vserver */
-
+	.long sys_mbind
+	.long sys_get_mempolicy
+	.long sys_set_mempolicy
+	
 syscall_table_size=(.-sys_call_table)
diff -u linux-2.6.5-numa/include/asm-i386/unistd.h-o linux-2.6.5-numa/include/asm-i386/unistd.h
--- linux-2.6.5-numa/include/asm-i386/unistd.h-o	2004-04-06 13:12:19.000000000 +0200
+++ linux-2.6.5-numa/include/asm-i386/unistd.h	2004-04-06 15:07:21.000000000 +0200
@@ -279,8 +279,11 @@
 #define __NR_utimes		271
 #define __NR_fadvise64_64	272
 #define __NR_vserver		273
+#define __NR_mbind		273
+#define __NR_get_mempolicy	274
+#define __NR_set_mempolicy	275
 
-#define NR_syscalls 274
+#define NR_syscalls 275
 
 /* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] NUMA API for Linux 4/ Add IA64 support
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
                   ` (2 preceding siblings ...)
  2004-04-06 13:35 ` [PATCH] NUMA API for Linux 3/ Add i386 support Andi Kleen
@ 2004-04-06 13:36 ` Andi Kleen
  2004-04-06 13:37 ` [PATCH] NUMA API for Linux 5/ Add VMA hooks for policy Andi Kleen
                   ` (7 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2004-04-06 13:36 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm


Add NUMA API system calls on IA64 and one bug fix required for it.

diff -u linux-2.6.5-numa/arch/ia64/kernel/acpi.c-o linux-2.6.5-numa/arch/ia64/kernel/acpi.c
--- linux-2.6.5-numa/arch/ia64/kernel/acpi.c-o	2004-04-06 13:12:00.000000000 +0200
+++ linux-2.6.5-numa/arch/ia64/kernel/acpi.c	2004-04-06 13:36:12.000000000 +0200
@@ -455,6 +455,7 @@
 	for (i = 0; i < MAX_PXM_DOMAINS; i++) {
 		if (pxm_bit_test(i)) {
 			pxm_to_nid_map[i] = numnodes;
+			node_set_online(numnodes);
 			nid_to_pxm_map[numnodes++] = i;
 		}
 	}
diff -u linux-2.6.5-numa/arch/ia64/kernel/entry.S-o linux-2.6.5-numa/arch/ia64/kernel/entry.S
--- linux-2.6.5-numa/arch/ia64/kernel/entry.S-o	2004-03-21 21:12:05.000000000 +0100
+++ linux-2.6.5-numa/arch/ia64/kernel/entry.S	2004-04-06 13:36:12.000000000 +0200
@@ -1501,9 +1501,9 @@
 	data8 sys_clock_nanosleep
 	data8 sys_fstatfs64
 	data8 sys_statfs64
-	data8 sys_ni_syscall
-	data8 sys_ni_syscall			// 1260
-	data8 sys_ni_syscall
+	data8 sys_mbind
+	data8 sys_get_mempolicy			// 1260
+	data8 sys_set_mempolicy
 	data8 sys_ni_syscall
 	data8 sys_ni_syscall
 	data8 sys_ni_syscall
diff -u linux-2.6.5-numa/include/asm-ia64/unistd.h-o linux-2.6.5-numa/include/asm-ia64/unistd.h
--- linux-2.6.5-numa/include/asm-ia64/unistd.h-o	2004-04-06 13:12:19.000000000 +0200
+++ linux-2.6.5-numa/include/asm-ia64/unistd.h	2004-04-06 13:36:12.000000000 +0200
@@ -248,9 +248,9 @@
 #define __NR_clock_nanosleep		1256
 #define __NR_fstatfs64			1257
 #define __NR_statfs64			1258
-#define __NR_reserved1			1259	/* reserved for NUMA interface */
-#define __NR_reserved2			1260	/* reserved for NUMA interface */
-#define __NR_reserved3			1261	/* reserved for NUMA interface */
+#define __NR_mbind			1259
+#define __NR_get_mempolicy		1260
+#define __NR_set_mempolicy		1261
 
 #ifdef __KERNEL__
 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] NUMA API for Linux 5/ Add VMA hooks for policy
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
                   ` (3 preceding siblings ...)
  2004-04-06 13:36 ` [PATCH] NUMA API for Linux 4/ Add IA64 support Andi Kleen
@ 2004-04-06 13:37 ` Andi Kleen
  2004-05-05 16:05   ` Paul Jackson
  2004-04-06 13:37 ` [PATCH] NUMA API for Linux 6/ Add shared memory support Andi Kleen
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2004-04-06 13:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm


NUMA API adds a policy to each VMA. During VMA creattion, merging and 
splitting these policies must be handled properly. This patch adds
the calls to this. 

It is a nop when CONFIG_NUMA is not defined.

diff -u linux-2.6.5-numa/arch/ia64/ia32/binfmt_elf32.c-o linux-2.6.5-numa/arch/ia64/ia32/binfmt_elf32.c
--- linux-2.6.5-numa/arch/ia64/ia32/binfmt_elf32.c-o	1970-01-01 01:12:51.000000000 +0100
+++ linux-2.6.5-numa/arch/ia64/ia32/binfmt_elf32.c	2004-04-06 13:36:12.000000000 +0200
@@ -104,6 +104,7 @@
 		vma->vm_pgoff = 0;
 		vma->vm_file = NULL;
 		vma->vm_private_data = NULL;
+		mpol_set_vma_default(vma);
 		down_write(&current->mm->mmap_sem);
 		{
 			insert_vm_struct(current->mm, vma);
@@ -184,6 +185,7 @@
 		mpnt->vm_pgoff = 0;
 		mpnt->vm_file = NULL;
 		mpnt->vm_private_data = 0;
+		mpol_set_vma_default(mpnt);
 		insert_vm_struct(current->mm, mpnt);
 		current->mm->total_vm = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
 	}
diff -u linux-2.6.5-numa/arch/ia64/kernel/perfmon.c-o linux-2.6.5-numa/arch/ia64/kernel/perfmon.c
--- linux-2.6.5-numa/arch/ia64/kernel/perfmon.c-o	1970-01-01 01:12:51.000000000 +0100
+++ linux-2.6.5-numa/arch/ia64/kernel/perfmon.c	2004-04-06 13:36:12.000000000 +0200
@@ -2273,6 +2273,7 @@
 	vma->vm_ops	     = &pfm_vm_ops;
 	vma->vm_pgoff	     = 0;
 	vma->vm_file	     = NULL;
+	mpol_set_vma_default(vma);
 	vma->vm_private_data = ctx;	/* information needed by the pfm_vm_close() function */
 
 	/*
diff -u linux-2.6.5-numa/arch/ia64/mm/init.c-o linux-2.6.5-numa/arch/ia64/mm/init.c
--- linux-2.6.5-numa/arch/ia64/mm/init.c-o	2004-04-06 13:12:00.000000000 +0200
+++ linux-2.6.5-numa/arch/ia64/mm/init.c	2004-04-06 13:36:12.000000000 +0200
@@ -131,6 +131,7 @@
 		vma->vm_pgoff = 0;
 		vma->vm_file = NULL;
 		vma->vm_private_data = NULL;
+		mpol_set_vma_default(vma);
 		insert_vm_struct(current->mm, vma);
 	}
 
@@ -143,6 +144,7 @@
 			vma->vm_end = PAGE_SIZE;
 			vma->vm_page_prot = __pgprot(pgprot_val(PAGE_READONLY) | _PAGE_MA_NAT);
 			vma->vm_flags = VM_READ | VM_MAYREAD | VM_IO | VM_RESERVED;
+			mpol_set_vma_default(vma);
 			insert_vm_struct(current->mm, vma);
 		}
 	}
diff -u linux-2.6.5-numa/arch/m68k/atari/stram.c-o linux-2.6.5-numa/arch/m68k/atari/stram.c
--- linux-2.6.5-numa/arch/m68k/atari/stram.c-o	2004-03-21 21:12:07.000000000 +0100
+++ linux-2.6.5-numa/arch/m68k/atari/stram.c	2004-04-06 13:36:12.000000000 +0200
@@ -752,7 +752,7 @@
 			/* Get a page for the entry, using the existing
 			   swap cache page if there is one.  Otherwise,
 			   get a clean page and read the swap into it. */
-			page = read_swap_cache_async(entry);
+			page = read_swap_cache_async(entry, NULL, 0);
 			if (!page) {
 				swap_free(entry);
 				return -ENOMEM;
diff -u linux-2.6.5-numa/arch/s390/kernel/compat_exec.c-o linux-2.6.5-numa/arch/s390/kernel/compat_exec.c
--- linux-2.6.5-numa/arch/s390/kernel/compat_exec.c-o	2004-03-21 21:12:11.000000000 +0100
+++ linux-2.6.5-numa/arch/s390/kernel/compat_exec.c	2004-04-06 13:36:12.000000000 +0200
@@ -71,6 +71,7 @@
 		mpnt->vm_ops = NULL;
 		mpnt->vm_pgoff = 0;
 		mpnt->vm_file = NULL;
+		mpol_set_vma_default(mpnt);
 		INIT_LIST_HEAD(&mpnt->shared);
 		mpnt->vm_private_data = (void *) 0;
 		insert_vm_struct(mm, mpnt);
diff -u linux-2.6.5-numa/arch/x86_64/ia32/ia32_binfmt.c-o linux-2.6.5-numa/arch/x86_64/ia32/ia32_binfmt.c
--- linux-2.6.5-numa/arch/x86_64/ia32/ia32_binfmt.c-o	2004-04-06 13:12:04.000000000 +0200
+++ linux-2.6.5-numa/arch/x86_64/ia32/ia32_binfmt.c	2004-04-06 13:36:12.000000000 +0200
@@ -360,6 +360,7 @@
 		mpnt->vm_ops = NULL;
 		mpnt->vm_pgoff = 0;
 		mpnt->vm_file = NULL;
+		mpol_set_vma_default(mpnt);
 		INIT_LIST_HEAD(&mpnt->shared);
 		mpnt->vm_private_data = (void *) 0;
 		insert_vm_struct(mm, mpnt);
diff -u linux-2.6.5-numa/fs/exec.c-o linux-2.6.5-numa/fs/exec.c
--- linux-2.6.5-numa/fs/exec.c-o	1970-01-01 01:12:51.000000000 +0100
+++ linux-2.6.5-numa/fs/exec.c	2004-04-06 13:36:12.000000000 +0200
@@ -430,6 +430,7 @@
 		mpnt->vm_ops = NULL;
 		mpnt->vm_pgoff = 0;
 		mpnt->vm_file = NULL;
+		mpol_set_vma_default(mpnt);
 		INIT_LIST_HEAD(&mpnt->shared);
 		mpnt->vm_private_data = (void *) 0;
 		insert_vm_struct(mm, mpnt);
diff -u linux-2.6.5-numa/kernel/exit.c-o linux-2.6.5-numa/kernel/exit.c
--- linux-2.6.5-numa/kernel/exit.c-o	2004-04-06 13:12:24.000000000 +0200
+++ linux-2.6.5-numa/kernel/exit.c	2004-04-06 13:36:12.000000000 +0200
@@ -779,6 +779,7 @@
 	exit_namespace(tsk);
 	exit_itimers(tsk);
 	exit_thread();
+	mpol_free(tsk->mempolicy);
 
 	if (tsk->leader)
 		disassociate_ctty(1);
diff -u linux-2.6.5-numa/kernel/fork.c-o linux-2.6.5-numa/kernel/fork.c
--- linux-2.6.5-numa/kernel/fork.c-o	1970-01-01 01:12:51.000000000 +0100
+++ linux-2.6.5-numa/kernel/fork.c	2004-04-06 13:36:12.000000000 +0200
@@ -268,6 +268,7 @@
 	struct rb_node **rb_link, *rb_parent;
 	int retval;
 	unsigned long charge = 0;
+	struct mempolicy *pol;
 
 	down_write(&oldmm->mmap_sem);
 	flush_cache_mm(current->mm);
@@ -309,6 +310,11 @@
 		if (!tmp)
 			goto fail_nomem;
 		*tmp = *mpnt;
+		pol = mpol_copy(vma_policy(mpnt));
+		retval = PTR_ERR(pol);
+		if (IS_ERR(pol))
+			goto fail_nomem_policy;
+		vma_set_policy(tmp, pol);	
 		tmp->vm_flags &= ~VM_LOCKED;
 		tmp->vm_mm = mm;
 		tmp->vm_next = NULL;
@@ -355,6 +361,8 @@
 	flush_tlb_mm(current->mm);
 	up_write(&oldmm->mmap_sem);
 	return retval;
+fail_nomem_policy: 
+	kmem_cache_free(vm_area_cachep, tmp);
 fail_nomem:
 	retval = -ENOMEM;
 fail:
@@ -946,9 +954,16 @@
 	p->security = NULL;
 	p->io_context = NULL;
 
+	p->mempolicy = mpol_copy(p->mempolicy);
+	if (IS_ERR(p->mempolicy)) { 
+		retval = PTR_ERR(p->mempolicy);
+		p->mempolicy = NULL;
+		goto bad_fork_cleanup;
+	}	
+	
 	retval = -ENOMEM;
 	if ((retval = security_task_alloc(p)))
-		goto bad_fork_cleanup;
+		goto bad_fork_cleanup_policy;
 	/* copy all the process information */
 	if ((retval = copy_semundo(clone_flags, p)))
 		goto bad_fork_cleanup_security;
@@ -1088,6 +1103,8 @@
 	exit_sem(p);
 bad_fork_cleanup_security:
 	security_task_free(p);
+bad_fork_cleanup_policy:
+	mpol_free(p->mempolicy);
 bad_fork_cleanup:
 	if (p->pid > 0)
 		free_pidmap(p->pid);
diff -u linux-2.6.5-numa/mm/mmap.c-o linux-2.6.5-numa/mm/mmap.c
--- linux-2.6.5-numa/mm/mmap.c-o	2004-04-06 13:12:24.000000000 +0200
+++ linux-2.6.5-numa/mm/mmap.c	2004-04-06 13:36:12.000000000 +0200
@@ -388,7 +388,8 @@
 static int vma_merge(struct mm_struct *mm, struct vm_area_struct *prev,
 			struct rb_node *rb_parent, unsigned long addr, 
 			unsigned long end, unsigned long vm_flags,
-			struct file *file, unsigned long pgoff)
+		     	struct file *file, unsigned long pgoff,
+		        struct mempolicy *policy) 
 {
 	spinlock_t *lock = &mm->page_table_lock;
 	struct inode *inode = file ? file->f_dentry->d_inode : NULL;
@@ -412,6 +413,7 @@
 	 * Can it merge with the predecessor?
 	 */
 	if (prev->vm_end == addr &&
+ 		        mpol_equal(vma_policy(prev), policy) && 
 			is_mergeable_vma(prev, file, vm_flags) &&
 			can_vma_merge_after(prev, vm_flags, file, pgoff)) {
 		struct vm_area_struct *next;
@@ -430,6 +432,7 @@
 		 */
 		next = prev->vm_next;
 		if (next && prev->vm_end == next->vm_start &&
+		    		vma_mpol_equal(prev, next) &&
 				can_vma_merge_before(next, vm_flags, file,
 					pgoff, (end - addr) >> PAGE_SHIFT)) {
 			prev->vm_end = next->vm_end;
@@ -442,6 +445,7 @@
 				fput(file);
 
 			mm->map_count--;
+			mpol_free(vma_policy(next));
 			kmem_cache_free(vm_area_cachep, next);
 			return 1;
 		}
@@ -457,6 +461,8 @@
 	prev = prev->vm_next;
 	if (prev) {
  merge_next:
+ 		if (!mpol_equal(policy, vma_policy(prev)))
+  			return 0;
 		if (!can_vma_merge_before(prev, vm_flags, file,
 				pgoff, (end - addr) >> PAGE_SHIFT))
 			return 0;
@@ -633,7 +639,7 @@
 	/* Can we just expand an old anonymous mapping? */
 	if (!file && !(vm_flags & VM_SHARED) && rb_parent)
 		if (vma_merge(mm, prev, rb_parent, addr, addr + len,
-					vm_flags, NULL, 0))
+					vm_flags, NULL, pgoff, NULL))
 			goto out;
 
 	/*
@@ -656,6 +662,7 @@
 	vma->vm_file = NULL;
 	vma->vm_private_data = NULL;
 	vma->vm_next = NULL;
+	mpol_set_vma_default(vma);
 	INIT_LIST_HEAD(&vma->shared);
 
 	if (file) {
@@ -695,7 +702,9 @@
 	addr = vma->vm_start;
 
 	if (!file || !rb_parent || !vma_merge(mm, prev, rb_parent, addr,
-				addr + len, vma->vm_flags, file, pgoff)) {
+					      vma->vm_end, 
+					      vma->vm_flags, file, pgoff,
+					      vma_policy(vma))) {
 		vma_link(mm, vma, prev, rb_link, rb_parent);
 		if (correct_wcount)
 			atomic_inc(&inode->i_writecount);
@@ -705,6 +714,7 @@
 				atomic_inc(&inode->i_writecount);
 			fput(file);
 		}
+		mpol_free(vma_policy(vma));
 		kmem_cache_free(vm_area_cachep, vma);
 	}
 out:	
@@ -1120,6 +1130,7 @@
 
 	remove_shared_vm_struct(area);
 
+	mpol_free(vma_policy(area));
 	if (area->vm_ops && area->vm_ops->close)
 		area->vm_ops->close(area);
 	if (area->vm_file)
@@ -1202,6 +1213,7 @@
 int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 	      unsigned long addr, int new_below)
 {
+	struct mempolicy *pol;
 	struct vm_area_struct *new;
 	struct address_space *mapping = NULL;
 
@@ -1224,6 +1236,13 @@
 		new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT);
 	}
 
+	pol = mpol_copy(vma_policy(vma)); 
+	if (IS_ERR(pol)) { 
+		kmem_cache_free(vm_area_cachep, new); 
+		return PTR_ERR(pol);
+	} 
+	vma_set_policy(new, pol);
+
 	if (new->vm_file)
 		get_file(new->vm_file);
 
@@ -1393,7 +1412,7 @@
 
 	/* Can we just expand an old anonymous mapping? */
 	if (rb_parent && vma_merge(mm, prev, rb_parent, addr, addr + len,
-					flags, NULL, 0))
+					flags, NULL, 0, NULL))
 		goto out;
 
 	/*
@@ -1414,6 +1433,7 @@
 	vma->vm_pgoff = 0;
 	vma->vm_file = NULL;
 	vma->vm_private_data = NULL;
+	mpol_set_vma_default(vma);
 	INIT_LIST_HEAD(&vma->shared);
 
 	vma_link(mm, vma, prev, rb_link, rb_parent);
@@ -1474,6 +1494,7 @@
 		}
 		if (vma->vm_file)
 			fput(vma->vm_file);
+		mpol_free(vma_policy(vma));
 		kmem_cache_free(vm_area_cachep, vma);
 		vma = next;
 	}
diff -u linux-2.6.5-numa/mm/mprotect.c-o linux-2.6.5-numa/mm/mprotect.c
--- linux-2.6.5-numa/mm/mprotect.c-o	2004-04-06 13:12:24.000000000 +0200
+++ linux-2.6.5-numa/mm/mprotect.c	2004-04-06 13:36:12.000000000 +0200
@@ -124,6 +124,8 @@
 		return 0;
 	if (vma->vm_file || (vma->vm_flags & VM_SHARED))
 		return 0;
+	if (!vma_mpol_equal(vma, prev))
+		return 0;
 
 	/*
 	 * If the whole area changes to the protection of the previous one
@@ -135,6 +137,7 @@
 		__vma_unlink(mm, vma, prev);
 		spin_unlock(&mm->page_table_lock);
 
+		mpol_free(vma_policy(vma));
 		kmem_cache_free(vm_area_cachep, vma);
 		mm->map_count--;
 		return 1;
@@ -317,12 +320,14 @@
 
 	if (next && prev->vm_end == next->vm_start &&
 			can_vma_merge(next, prev->vm_flags) &&
+	    	vma_mpol_equal(prev, next) &&
 			!prev->vm_file && !(prev->vm_flags & VM_SHARED)) {
 		spin_lock(&prev->vm_mm->page_table_lock);
 		prev->vm_end = next->vm_end;
 		__vma_unlink(prev->vm_mm, next, prev);
 		spin_unlock(&prev->vm_mm->page_table_lock);
 
+		mpol_free(vma_policy(next));
 		kmem_cache_free(vm_area_cachep, next);
 		prev->vm_mm->map_count--;
 	}
diff -u linux-2.6.5-numa/mm/mremap.c-o linux-2.6.5-numa/mm/mremap.c
--- linux-2.6.5-numa/mm/mremap.c-o	2004-03-21 21:12:13.000000000 +0100
+++ linux-2.6.5-numa/mm/mremap.c	2004-04-06 13:36:12.000000000 +0200
@@ -199,6 +199,7 @@
 	next = find_vma_prev(mm, new_addr, &prev);
 	if (next) {
 		if (prev && prev->vm_end == new_addr &&
+		    mpol_equal(vma_policy(prev), vma_policy(next)) &&
 		    can_vma_merge(prev, vma->vm_flags) && !vma->vm_file &&
 					!(vma->vm_flags & VM_SHARED)) {
 			spin_lock(&mm->page_table_lock);
@@ -208,6 +209,7 @@
 			if (next != prev->vm_next)
 				BUG();
 			if (prev->vm_end == next->vm_start &&
+			                vma_mpol_equal(next, prev) && 
 					can_vma_merge(next, prev->vm_flags)) {
 				spin_lock(&mm->page_table_lock);
 				prev->vm_end = next->vm_end;
@@ -216,10 +218,12 @@
 				if (vma == next)
 					vma = prev;
 				mm->map_count--;
+				mpol_free(vma_policy(next));
 				kmem_cache_free(vm_area_cachep, next);
 			}
 		} else if (next->vm_start == new_addr + new_len &&
 			  	can_vma_merge(next, vma->vm_flags) &&
+			        vma_mpol_equal(next, vma) &&
 				!vma->vm_file && !(vma->vm_flags & VM_SHARED)) {
 			spin_lock(&mm->page_table_lock);
 			next->vm_start = new_addr;
@@ -229,6 +233,7 @@
 	} else {
 		prev = find_vma(mm, new_addr-1);
 		if (prev && prev->vm_end == new_addr &&
+		    vma_mpol_equal(prev, vma) &&
 		    can_vma_merge(prev, vma->vm_flags) && !vma->vm_file &&
 				!(vma->vm_flags & VM_SHARED)) {
 			spin_lock(&mm->page_table_lock);
@@ -250,7 +255,12 @@
 		unsigned long vm_locked = vma->vm_flags & VM_LOCKED;
 
 		if (allocated_vma) {
+			struct mempolicy *pol;
 			*new_vma = *vma;
+			pol = mpol_copy(vma_policy(new_vma));
+			if (IS_ERR(pol))
+				goto out_vma;
+			vma_set_policy(new_vma, pol);	
 			INIT_LIST_HEAD(&new_vma->shared);
 			new_vma->vm_start = new_addr;
 			new_vma->vm_end = new_addr+new_len;
@@ -291,6 +301,7 @@
 		}
 		return new_addr;
 	}
+ out_vma:
 	if (allocated_vma)
 		kmem_cache_free(vm_area_cachep, new_vma);
  out:

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] NUMA API for Linux 6/ Add shared memory support
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
                   ` (4 preceding siblings ...)
  2004-04-06 13:37 ` [PATCH] NUMA API for Linux 5/ Add VMA hooks for policy Andi Kleen
@ 2004-04-06 13:37 ` Andi Kleen
  2004-04-06 13:38 ` [PATCH] NUMA API for Linux 7/ Add statistics Andi Kleen
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2004-04-06 13:37 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm

Add support to tmpfs and hugetlbfs to support NUMA API. Shared memory 
is a bit of a special case for NUMA policy. Normally policy is associated
to VMAs or to processes, but for a shared memory segment you really 
want to share the policy. The core NUMA API has code for that,
this patch adds the necessary changes to tmpfs and hugetlbfs.

First it changes the custom swapping code in tmpfs to follow the policy
set via VMAs.

It is also useful to have a "backing store" of policy that saves
the policy even when nobody has the shared memory segment mapped. This
allows command line tools to pre configure policy, which is then 
later used by programs.

Note that hugetlbfs needs more changes - it is also required to switch
it to lazy allocation, otherwise the prefault prevents mbind() from 
working.

diff -u linux-2.6.5-numa/fs/hugetlbfs/inode.c-o linux-2.6.5-numa/fs/hugetlbfs/inode.c
--- linux-2.6.5-numa/fs/hugetlbfs/inode.c-o	2004-04-06 13:12:17.000000000 +0200
+++ linux-2.6.5-numa/fs/hugetlbfs/inode.c	2004-04-06 13:36:12.000000000 +0200
@@ -375,6 +375,7 @@
 
 	inode = new_inode(sb);
 	if (inode) {
+		struct hugetlbfs_inode_info *info;
 		inode->i_mode = mode;
 		inode->i_uid = uid;
 		inode->i_gid = gid;
@@ -383,6 +384,8 @@
 		inode->i_mapping->a_ops = &hugetlbfs_aops;
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
+		info = HUGETLBFS_I(inode);
+		mpol_shared_policy_init(&info->policy);
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
@@ -510,6 +513,32 @@
 	}
 }
 
+static kmem_cache_t *hugetlbfs_inode_cachep;
+
+static struct inode *hugetlbfs_alloc_inode(struct super_block *sb)
+{
+	struct hugetlbfs_inode_info *p = kmem_cache_alloc(hugetlbfs_inode_cachep,
+							  SLAB_KERNEL);
+	if (!p)
+		return NULL;
+	return &p->vfs_inode;
+}
+
+static void init_once(void *foo, kmem_cache_t *cachep, unsigned long flags)
+{
+	struct hugetlbfs_inode_info *ei = (struct hugetlbfs_inode_info *) foo;
+
+	if ((flags & (SLAB_CTOR_VERIFY|SLAB_CTOR_CONSTRUCTOR)) ==
+	    SLAB_CTOR_CONSTRUCTOR)
+		inode_init_once(&ei->vfs_inode);
+}
+
+static void hugetlbfs_destroy_inode(struct inode *inode)
+{
+	mpol_free_shared_policy(&HUGETLBFS_I(inode)->policy);  
+	kmem_cache_free(hugetlbfs_inode_cachep, HUGETLBFS_I(inode));
+}
+
 static struct address_space_operations hugetlbfs_aops = {
 	.readpage	= hugetlbfs_readpage,
 	.prepare_write	= hugetlbfs_prepare_write,
@@ -541,6 +570,8 @@
 };
 
 static struct super_operations hugetlbfs_ops = {
+	.alloc_inode    = hugetlbfs_alloc_inode,
+	.destroy_inode  = hugetlbfs_destroy_inode,
 	.statfs		= hugetlbfs_statfs,
 	.drop_inode	= hugetlbfs_drop_inode,
 	.put_super	= hugetlbfs_put_super,
@@ -755,9 +786,16 @@
 	int error;
 	struct vfsmount *vfsmount;
 
+	hugetlbfs_inode_cachep = kmem_cache_create("hugetlbfs_inode_cache",
+				     sizeof(struct hugetlbfs_inode_info),
+				     0, SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT,
+				     init_once, NULL);
+	if (hugetlbfs_inode_cachep == NULL)
+		return -ENOMEM;
+
 	error = register_filesystem(&hugetlbfs_fs_type);
 	if (error)
-		return error;
+		goto out;
 
 	vfsmount = kern_mount(&hugetlbfs_fs_type);
 
@@ -767,11 +805,16 @@
 	}
 
 	error = PTR_ERR(vfsmount);
+
+ out:
+	if (error)
+		kmem_cache_destroy(hugetlbfs_inode_cachep);		
 	return error;
 }
 
 static void __exit exit_hugetlbfs_fs(void)
 {
+	kmem_cache_destroy(hugetlbfs_inode_cachep);
 	unregister_filesystem(&hugetlbfs_fs_type);
 }
 
diff -u linux-2.6.5-numa/include/linux/mm.h-o linux-2.6.5-numa/include/linux/mm.h
--- linux-2.6.5-numa/include/linux/mm.h-o	2004-04-06 13:12:23.000000000 +0200
+++ linux-2.6.5-numa/include/linux/mm.h	2004-04-06 13:36:12.000000000 +0200
@@ -435,6 +445,8 @@
 
 struct page *shmem_nopage(struct vm_area_struct * vma,
 			unsigned long address, int *type);
+int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new);
+struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, unsigned long addr);
 struct file *shmem_file_setup(char * name, loff_t size, unsigned long flags);
 void shmem_lock(struct file * file, int lock);
 int shmem_zero_setup(struct vm_area_struct *);
diff -u linux-2.6.5-numa/include/linux/shmem_fs.h-o linux-2.6.5-numa/include/linux/shmem_fs.h
--- linux-2.6.5-numa/include/linux/shmem_fs.h-o	2004-03-21 21:11:55.000000000 +0100
+++ linux-2.6.5-numa/include/linux/shmem_fs.h	2004-04-06 13:36:12.000000000 +0200
@@ -2,6 +2,7 @@
 #define __SHMEM_FS_H
 
 #include <linux/swap.h>
+#include <linux/mempolicy.h>
 
 /* inode in-kernel data */
 
@@ -15,6 +16,7 @@
 	unsigned long		alloced;    /* data pages allocated to file */
 	unsigned long		swapped;    /* subtotal assigned to swap */
 	unsigned long		flags;
+	struct shared_policy     policy;
 	struct list_head	list;
 	struct inode		vfs_inode;
 };
diff -u linux-2.6.5-numa/ipc/shm.c-o linux-2.6.5-numa/ipc/shm.c
--- linux-2.6.5-numa/ipc/shm.c-o	2004-04-06 13:12:24.000000000 +0200
+++ linux-2.6.5-numa/ipc/shm.c	2004-04-06 13:36:12.000000000 +0200
@@ -163,6 +163,8 @@
 	.open	= shm_open,	/* callback for a new vm-area open */
 	.close	= shm_close,	/* callback for when the vm-area is released */
 	.nopage	= shmem_nopage,
+	.set_policy = shmem_set_policy,
+	.get_policy = shmem_get_policy,
 };
 
 static int newseg (key_t key, int shmflg, size_t size)
diff -u linux-2.6.5-numa/mm/shmem.c-o linux-2.6.5-numa/mm/shmem.c
--- linux-2.6.5-numa/mm/shmem.c-o	2004-04-06 13:12:24.000000000 +0200
+++ linux-2.6.5-numa/mm/shmem.c	2004-04-06 13:36:12.000000000 +0200
@@ -8,6 +8,7 @@
  *		 2002 Red Hat Inc.
  * Copyright (C) 2002-2003 Hugh Dickins.
  * Copyright (C) 2002-2003 VERITAS Software Corporation.
+ * Copyright (C) 2004 Andi Kleen, SuSE Labs
  *
  * This file is released under the GPL.
  */
@@ -37,8 +38,10 @@
 #include <linux/vfs.h>
 #include <linux/blkdev.h>
 #include <linux/security.h>
+#include <linux/swapops.h>
 #include <asm/uaccess.h>
 #include <asm/div64.h>
+#include <asm/pgtable.h>
 
 /* This magic number is used in glibc for posix shared memory */
 #define TMPFS_MAGIC	0x01021994
@@ -758,6 +761,72 @@
 	return WRITEPAGE_ACTIVATE;	/* Return with the page locked */
 }
 
+#ifdef CONFIG_NUMA
+static struct page *shmem_swapin_async(struct shared_policy *p,
+				       swp_entry_t entry, unsigned long idx)
+{
+	struct page *page;
+	struct vm_area_struct pvma; 
+	/* Create a pseudo vma that just contains the policy */
+	memset(&pvma, 0, sizeof(struct vm_area_struct));
+	pvma.vm_end = PAGE_SIZE;
+	pvma.vm_pgoff = idx;
+	pvma.vm_policy = mpol_shared_policy_lookup(p, idx); 
+	page = read_swap_cache_async(entry, &pvma, 0);
+	mpol_free(pvma.vm_policy);
+	return page; 
+} 
+
+struct page *shmem_swapin(struct shmem_inode_info *info, swp_entry_t entry, 
+			  unsigned long idx)
+{
+	struct shared_policy *p = &info->policy;
+	int i, num;
+	struct page *page;
+	unsigned long offset;
+
+	num = valid_swaphandles(entry, &offset);
+	for (i = 0; i < num; offset++, i++) {
+		page = shmem_swapin_async(p, swp_entry(swp_type(entry), offset), idx);
+		if (!page)
+			break;
+		page_cache_release(page);
+	}
+	lru_add_drain();	/* Push any new pages onto the LRU now */
+	return shmem_swapin_async(p, entry, idx); 
+}
+
+static struct page *
+shmem_alloc_page(unsigned long gfp, struct shmem_inode_info *info, 
+		 unsigned long idx)
+{
+	struct vm_area_struct pvma;
+	struct page *page;
+
+	memset(&pvma, 0, sizeof(struct vm_area_struct)); 
+	pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, idx); 
+	pvma.vm_pgoff = idx;
+	pvma.vm_end = PAGE_SIZE;
+	page = alloc_page_vma(gfp, &pvma, 0); 
+	mpol_free(pvma.vm_policy);
+	return page; 
+} 
+#else
+static inline struct page *
+shmem_swapin(struct shmem_inode_info *info,swp_entry_t entry,unsigned long idx)
+{ 
+	swapin_readahead(entry, 0, NULL);
+	return read_swap_cache_async(entry, NULL, 0);
+} 
+
+static inline struct page *
+shmem_alloc_page(unsigned long gfp,struct shmem_inode_info *info,
+				 unsigned long idx)
+{
+	return alloc_page(gfp);  
+}
+#endif
+
 /*
  * shmem_getpage - either get the page from swap or allocate a new one
  *
@@ -815,8 +884,7 @@
 			if (majmin == VM_FAULT_MINOR && type)
 				inc_page_state(pgmajfault);
 			majmin = VM_FAULT_MAJOR;
-			swapin_readahead(swap);
-			swappage = read_swap_cache_async(swap);
+			swappage = shmem_swapin(info, swap, idx); 
 			if (!swappage) {
 				spin_lock(&info->lock);
 				entry = shmem_swp_alloc(info, idx, sgp);
@@ -921,7 +989,9 @@
 
 		if (!filepage) {
 			spin_unlock(&info->lock);
-			filepage = page_cache_alloc(mapping);
+			filepage = shmem_alloc_page(mapping_gfp_mask(mapping), 
+						    info,
+						    idx); 
 			if (!filepage) {
 				shmem_free_block(inode);
 				error = -ENOMEM;
@@ -1046,6 +1116,19 @@
 	return 0;
 }
 
+int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
+{
+	struct inode *i = vma->vm_file->f_dentry->d_inode;
+	return mpol_set_shared_policy(&SHMEM_I(i)->policy, vma, new);
+}
+
+struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, unsigned long addr)
+{
+	struct inode *i = vma->vm_file->f_dentry->d_inode;
+	unsigned long idx = ((addr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	return mpol_shared_policy_lookup(&SHMEM_I(i)->policy, idx);	
+} 
+
 void shmem_lock(struct file *file, int lock)
 {
 	struct inode *inode = file->f_dentry->d_inode;
@@ -1094,6 +1177,7 @@
 		info = SHMEM_I(inode);
 		memset(info, 0, (char *)inode - (char *)info);
 		spin_lock_init(&info->lock);
+		mpol_shared_policy_init(&info->policy);
 		info->flags = VM_ACCOUNT;
 		switch (mode & S_IFMT) {
 		default:
@@ -1789,6 +1873,7 @@
 
 static void shmem_destroy_inode(struct inode *inode)
 {
+	mpol_free_shared_policy(&SHMEM_I(inode)->policy);  
 	kmem_cache_free(shmem_inode_cachep, SHMEM_I(inode));
 }
 
@@ -1873,6 +1958,8 @@
 static struct vm_operations_struct shmem_vm_ops = {
 	.nopage		= shmem_nopage,
 	.populate	= shmem_populate,
+	.set_policy     = shmem_set_policy,
+	.get_policy     = shmem_get_policy,
 };
 
 static struct super_block *shmem_get_sb(struct file_system_type *fs_type,
diff -u linux-2.6.5-numa/include/linux/hugetlb.h-o linux-2.6.5-numa/include/linux/hugetlb.h
--- linux-2.6.5-numa/include/linux/hugetlb.h-o	2004-04-06 13:12:21.000000000 +0200
+++ linux-2.6.5-numa/include/linux/hugetlb.h	2004-04-06 13:36:12.000000000 +0200
@@ -3,6 +3,8 @@
 
 #ifdef CONFIG_HUGETLB_PAGE
 
+#include <linux/mempolicy.h>
+
 struct ctl_table;
 
 static inline int is_vm_hugetlb_page(struct vm_area_struct *vma)
@@ -103,6 +105,17 @@
 	spinlock_t	stat_lock;
 };
 
+
+struct hugetlbfs_inode_info { 
+	struct shared_policy policy;
+	struct inode vfs_inode;
+}; 
+
+static inline struct hugetlbfs_inode_info *HUGETLBFS_I(struct inode *inode)
+{
+	return container_of(inode, struct hugetlbfs_inode_info, vfs_inode);
+}
+
 static inline struct hugetlbfs_sb_info *HUGETLBFS_SB(struct super_block *sb)
 {
 	return sb->s_fs_info;
diff -u linux-2.6.5-numa/arch/i386/mm/hugetlbpage.c-o linux-2.6.5-numa/arch/i386/mm/hugetlbpage.c
--- linux-2.6.5-numa/arch/i386/mm/hugetlbpage.c-o	2004-04-06 13:11:59.000000000 +0200
+++ linux-2.6.5-numa/arch/i386/mm/hugetlbpage.c	2004-04-06 13:36:12.000000000 +0200
@@ -547,6 +640,13 @@
 	return NULL;
 }
 
+static int hugetlb_set_policy(struct vm_area_struct *vma, struct mempolicy *new)
+{
+	struct inode *inode = vma->vm_file->f_dentry->d_inode;
+	return mpol_set_shared_policy(&HUGETLBFS_I(inode)->policy, vma, new);
+}
+
 struct vm_operations_struct hugetlb_vm_ops = {
 	.nopage = hugetlb_nopage,
+	.set_policy = hugetlb_set_policy,	
 };

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] NUMA API for Linux 7/ Add statistics
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
                   ` (5 preceding siblings ...)
  2004-04-06 13:37 ` [PATCH] NUMA API for Linux 6/ Add shared memory support Andi Kleen
@ 2004-04-06 13:38 ` Andi Kleen
  2004-04-06 13:39 ` [PATCH] NUMA API for Linux 8/ Add policy support to anonymous memory Andi Kleen
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2004-04-06 13:38 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm

Add NUMA hit/miss statistics to page allocation and display them
in sysfs.

This is not 100% required for NUMA API, but without this it is very
difficult to make sure NUMA API works properly.

The overhead is quite low because all counters are per CPU and only
happens when CONFIG_NUMA is defined.

diff -u linux-2.6.5-numa/include/linux/mmzone.h-o linux-2.6.5-numa/include/linux/mmzone.h
--- linux-2.6.5-numa/include/linux/mmzone.h-o	2004-04-06 13:12:23.000000000 +0200
+++ linux-2.6.5-numa/include/linux/mmzone.h	2004-04-06 13:36:12.000000000 +0200
@@ -52,6 +52,14 @@
 
 struct per_cpu_pageset {
 	struct per_cpu_pages pcp[2];	/* 0: hot.  1: cold */
+#ifdef CONFIG_NUMA
+	unsigned long numa_hit;		/* allocated in intended node */
+	unsigned long numa_miss;	/* allocated in non intended node */
+	unsigned long numa_foreign;	/* was intended here, hit elsewhere */
+	unsigned long interleave_hit; 	/* interleaver prefered this zone */
+	unsigned long local_node;	/* allocation from local node */
+	unsigned long other_node;	/* allocation from other node */
+#endif
 } ____cacheline_aligned_in_smp;
 
 /*
diff -u linux-2.6.5-numa/mm/page_alloc.c-o linux-2.6.5-numa/mm/page_alloc.c
--- linux-2.6.5-numa/mm/page_alloc.c-o	2004-04-06 13:12:24.000000000 +0200
+++ linux-2.6.5-numa/mm/page_alloc.c	2004-04-06 13:49:54.000000000 +0200
@@ -447,6 +447,31 @@
 }
 #endif /* CONFIG_PM */
 
+static void zone_statistics(struct zonelist *zonelist, struct zone *z) 
+{ 
+#ifdef CONFIG_NUMA
+	unsigned long flags;
+	int cpu; 
+	pg_data_t *pg = z->zone_pgdat,
+		*orig = zonelist->zones[0]->zone_pgdat;
+	struct per_cpu_pageset *p;
+	local_irq_save(flags); 
+	cpu = smp_processor_id();
+	p = &z->pageset[cpu];
+	if (pg == orig) {
+		z->pageset[cpu].numa_hit++;
+	} else { 
+		p->numa_miss++;
+		zonelist->zones[0]->pageset[cpu].numa_foreign++;
+	}
+	if (pg == NODE_DATA(numa_node_id()))
+		p->local_node++;
+	else
+		p->other_node++;	
+	local_irq_restore(flags);
+#endif
+} 
+
 /*
  * Free a 0-order page
  */
@@ -582,8 +607,10 @@
 		if (z->free_pages >= min ||
 				(!wait && z->free_pages >= z->pages_high)) {
 			page = buffered_rmqueue(z, order, cold);
-			if (page)
+			if (page) { 
+					zone_statistics(zonelist, z); 
 		       		goto got_pg;
+			}
 		}
 		min += z->pages_low * sysctl_lower_zone_protection;
 	}
@@ -607,8 +634,10 @@
 		if (z->free_pages >= min ||
 				(!wait && z->free_pages >= z->pages_high)) {
 			page = buffered_rmqueue(z, order, cold);
-			if (page)
+			if (page) {
+				zone_statistics(zonelist, z); 
 				goto got_pg;
+			}
 		}
 		min += local_min * sysctl_lower_zone_protection;
 	}
@@ -622,8 +651,10 @@
 			struct zone *z = zones[i];
 
 			page = buffered_rmqueue(z, order, cold);
-			if (page)
+			if (page) {
+				zone_statistics(zonelist, z); 
 				goto got_pg;
+			}
 		}
 		goto nopage;
 	}
@@ -650,8 +681,10 @@
 		if (z->free_pages >= min ||
 				(!wait && z->free_pages >= z->pages_high)) {
 			page = buffered_rmqueue(z, order, cold);
-			if (page)
+			if (page) {
+				zone_statistics(zonelist, z); 
 				goto got_pg;
+			}
 		}
 		min += z->pages_low * sysctl_lower_zone_protection;
 	}
diff -u linux-2.6.5-numa/drivers/base/node.c-o linux-2.6.5-numa/drivers/base/node.c
--- linux-2.6.5-numa/drivers/base/node.c-o	2004-03-17 12:17:46.000000000 +0100
+++ linux-2.6.5-numa/drivers/base/node.c	2004-04-06 13:36:12.000000000 +0200
@@ -30,13 +30,20 @@
 
 static SYSDEV_ATTR(cpumap,S_IRUGO,node_read_cpumap,NULL);
 
+/* Can be overwritten by architecture specific code. */
+int __attribute__((weak)) hugetlb_report_node_meminfo(int node, char *buf)
+{
+	return 0;
+}
+
 #define K(x) ((x) << (PAGE_SHIFT - 10))
 static ssize_t node_read_meminfo(struct sys_device * dev, char * buf)
 {
+	int n;
 	int nid = dev->id;
 	struct sysinfo i;
 	si_meminfo_node(&i, nid);
-	return sprintf(buf, "\n"
+	n = sprintf(buf, "\n"
 		       "Node %d MemTotal:     %8lu kB\n"
 		       "Node %d MemFree:      %8lu kB\n"
 		       "Node %d MemUsed:      %8lu kB\n"
@@ -51,10 +58,52 @@
 		       nid, K(i.freehigh),
 		       nid, K(i.totalram-i.totalhigh),
 		       nid, K(i.freeram-i.freehigh));
+	n += hugetlb_report_node_meminfo(nid, buf + n);
+	return n;
 }
+
 #undef K 
 static SYSDEV_ATTR(meminfo,S_IRUGO,node_read_meminfo,NULL);
 
+static ssize_t node_read_numastat(struct sys_device * dev, char * buf)
+{ 
+	unsigned long numa_hit, numa_miss, interleave_hit, numa_foreign;
+	unsigned long local_node, other_node;
+	int i, cpu;
+	pg_data_t *pg = NODE_DATA(dev->id);
+	numa_hit = 0; 
+	numa_miss = 0; 
+	interleave_hit = 0; 
+	numa_foreign = 0; 
+	local_node = 0;
+	other_node = 0;
+	for (i = 0; i < MAX_NR_ZONES; i++) { 
+		struct zone *z = &pg->node_zones[i]; 
+		for (cpu = 0; cpu < NR_CPUS; cpu++) { 
+			struct per_cpu_pageset *ps = &z->pageset[cpu]; 
+			numa_hit += ps->numa_hit; 
+			numa_miss += ps->numa_miss;
+			numa_foreign += ps->numa_foreign;
+			interleave_hit += ps->interleave_hit;
+			local_node += ps->local_node;
+			other_node += ps->other_node;
+		} 
+	} 
+	return sprintf(buf, 
+		       "numa_hit %lu\n"
+		       "numa_miss %lu\n"
+		       "numa_foreign %lu\n"
+		       "interleave_hit %lu\n"
+		       "local_node %lu\n"
+		       "other_node %lu\n", 
+		       numa_hit,
+		       numa_miss,
+		       numa_foreign,
+		       interleave_hit,
+		       local_node, 
+		       other_node); 
+} 
+static SYSDEV_ATTR(numastat,S_IRUGO,node_read_numastat,NULL);
 
 /*
  * register_node - Setup a driverfs device for a node.
@@ -74,6 +123,7 @@
 	if (!error){
 		sysdev_create_file(&node->sysdev, &attr_cpumap);
 		sysdev_create_file(&node->sysdev, &attr_meminfo);
+		sysdev_create_file(&node->sysdev, &attr_numastat); 
 	}
 	return error;
 }

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] NUMA API for Linux 8/ Add policy support to anonymous memory
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
                   ` (6 preceding siblings ...)
  2004-04-06 13:38 ` [PATCH] NUMA API for Linux 7/ Add statistics Andi Kleen
@ 2004-04-06 13:39 ` Andi Kleen
  2004-04-06 13:40 ` [PATCH] NUMA API for Linux 9/ Add simple lazy i386/x86-64 hugetlbfs policy support Andi Kleen
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2004-04-06 13:39 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm


Change to core VM to use alloc_page_vma() instead of alloc_page().

Change the swap readahead to follow the policy of the VMA.


diff -u linux-2.6.5-numa/include/linux/swap.h-o linux-2.6.5-numa/include/linux/swap.h
--- linux-2.6.5-numa/include/linux/swap.h-o	2004-03-21 21:11:54.000000000 +0100
+++ linux-2.6.5-numa/include/linux/swap.h	2004-04-06 13:36:12.000000000 +0200
@@ -152,7 +152,7 @@
 extern void out_of_memory(void);
 
 /* linux/mm/memory.c */
-extern void swapin_readahead(swp_entry_t);
+extern void swapin_readahead(swp_entry_t, unsigned long, struct vm_area_struct *);
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalram_pages;
@@ -216,7 +216,8 @@
 extern void free_page_and_swap_cache(struct page *);
 extern void free_pages_and_swap_cache(struct page **, int);
 extern struct page * lookup_swap_cache(swp_entry_t);
-extern struct page * read_swap_cache_async(swp_entry_t);
+extern struct page * read_swap_cache_async(swp_entry_t, struct vm_area_struct *vma, 
+					   unsigned long addr);
 
 /* linux/mm/swapfile.c */
 extern int total_swap_pages;
@@ -257,7 +258,7 @@
 #define free_swap_and_cache(swp)		/*NOTHING*/
 #define swap_duplicate(swp)			/*NOTHING*/
 #define swap_free(swp)				/*NOTHING*/
-#define read_swap_cache_async(swp)		NULL
+#define read_swap_cache_async(swp,vma,addr)	NULL
 #define lookup_swap_cache(swp)			NULL
 #define valid_swaphandles(swp, off)		0
 #define can_share_swap_page(p)			0
diff -u linux-2.6.5-numa/mm/memory.c-o linux-2.6.5-numa/mm/memory.c
--- linux-2.6.5-numa/mm/memory.c-o	2004-04-06 13:12:24.000000000 +0200
+++ linux-2.6.5-numa/mm/memory.c	2004-04-06 13:36:12.000000000 +0200
@@ -1056,7 +1056,7 @@
 	pte_chain = pte_chain_alloc(GFP_KERNEL);
 	if (!pte_chain)
 		goto no_pte_chain;
-	new_page = alloc_page(GFP_HIGHUSER);
+	new_page = alloc_page_vma(GFP_HIGHUSER,vma,address);
 	if (!new_page)
 		goto no_new_page;
 	copy_cow_page(old_page,new_page,address);
@@ -1210,9 +1210,17 @@
  * (1 << page_cluster) entries in the swap area. This method is chosen
  * because it doesn't cost us any seek time.  We also make sure to queue
  * the 'original' request together with the readahead ones...  
+ * 
+ * This has been extended to use the NUMA policies from the mm triggering
+ * the readahead.
+ * 
+ * Caller must hold down_read on the vma->vm_mm if vma is not NULL.
  */
-void swapin_readahead(swp_entry_t entry)
+void swapin_readahead(swp_entry_t entry, unsigned long addr,struct vm_area_struct *vma) 
 {
+#ifdef CONFIG_NUMA
+	struct vm_area_struct *next_vma = vma ? vma->vm_next : NULL;
+#endif
 	int i, num;
 	struct page *new_page;
 	unsigned long offset;
@@ -1224,10 +1232,31 @@
 	for (i = 0; i < num; offset++, i++) {
 		/* Ok, do the async read-ahead now */
 		new_page = read_swap_cache_async(swp_entry(swp_type(entry),
-						offset));
+							   offset), vma, addr); 
 		if (!new_page)
 			break;
 		page_cache_release(new_page);
+#ifdef CONFIG_NUMA
+		/* 
+		 * Find the next applicable VMA for the NUMA policy.
+		 */
+		addr += PAGE_SIZE;
+		if (addr == 0) 
+			vma = NULL;
+		if (vma) { 
+			if (addr >= vma->vm_end) { 
+				vma = next_vma;
+				next_vma = vma ? vma->vm_next : NULL;
+			}
+			if (vma && addr < vma->vm_start) 
+				vma = NULL; 
+		} else { 
+			if (next_vma && addr >= next_vma->vm_start) { 
+				vma = next_vma;
+				next_vma = vma->vm_next;
+			}
+		} 
+#endif
 	}
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 }
@@ -1250,8 +1279,8 @@
 	spin_unlock(&mm->page_table_lock);
 	page = lookup_swap_cache(entry);
 	if (!page) {
-		swapin_readahead(entry);
-		page = read_swap_cache_async(entry);
+ 		swapin_readahead(entry, address, vma);
+ 		page = read_swap_cache_async(entry, vma, address);
 		if (!page) {
 			/*
 			 * Back out if somebody else faulted in this pte while
@@ -1356,7 +1385,7 @@
 		pte_unmap(page_table);
 		spin_unlock(&mm->page_table_lock);
 
-		page = alloc_page(GFP_HIGHUSER);
+		page = alloc_page_vma(GFP_HIGHUSER,vma,addr);
 		if (!page)
 			goto no_mem;
 		clear_user_highpage(page, addr);
@@ -1448,7 +1477,7 @@
 	 * Should we do an early C-O-W break?
 	 */
 	if (write_access && !(vma->vm_flags & VM_SHARED)) {
-		struct page * page = alloc_page(GFP_HIGHUSER);
+		struct page * page = alloc_page_vma(GFP_HIGHUSER,vma,address);
 		if (!page)
 			goto oom;
 		copy_user_highpage(page, new_page, address);
diff -u linux-2.6.5-numa/mm/swap_state.c-o linux-2.6.5-numa/mm/swap_state.c
--- linux-2.6.5-numa/mm/swap_state.c-o	2004-03-21 21:12:13.000000000 +0100
+++ linux-2.6.5-numa/mm/swap_state.c	2004-04-06 13:36:13.000000000 +0200
@@ -331,7 +331,8 @@
  * A failure return means that either the page allocation failed or that
  * the swap entry is no longer in use.
  */
-struct page * read_swap_cache_async(swp_entry_t entry)
+struct page * 
+read_swap_cache_async(swp_entry_t entry, struct vm_area_struct *vma, unsigned long addr)
 {
 	struct page *found_page, *new_page = NULL;
 	int err;
@@ -351,7 +352,7 @@
 		 * Get a new page to read into from swap.
 		 */
 		if (!new_page) {
-			new_page = alloc_page(GFP_HIGHUSER);
+			new_page = alloc_page_vma(GFP_HIGHUSER, vma, addr);
 			if (!new_page)
 				break;		/* Out of memory */
 		}
diff -u linux-2.6.5-numa/mm/swapfile.c-o linux-2.6.5-numa/mm/swapfile.c
--- linux-2.6.5-numa/mm/swapfile.c-o	2004-04-06 13:12:24.000000000 +0200
+++ linux-2.6.5-numa/mm/swapfile.c	2004-04-06 13:36:13.000000000 +0200
@@ -607,7 +607,7 @@
 		 */
 		swap_map = &si->swap_map[i];
 		entry = swp_entry(type, i);
-		page = read_swap_cache_async(entry);
+		page = read_swap_cache_async(entry, NULL, 0);
 		if (!page) {
 			/*
 			 * Either swap_duplicate() failed because entry

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] NUMA API for Linux 9/ Add simple lazy i386/x86-64 hugetlbfs policy support
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
                   ` (7 preceding siblings ...)
  2004-04-06 13:39 ` [PATCH] NUMA API for Linux 8/ Add policy support to anonymous memory Andi Kleen
@ 2004-04-06 13:40 ` Andi Kleen
  2004-04-06 13:40 ` [PATCH] NUMA API for Linux 10/ Bitmap bugfix Andi Kleen
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2004-04-06 13:40 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm

Add NUMA policy support to i386/x86-64 hugetlbfs and switch it 
over to lazy allocation instead of prefaulting.

The NUMA policy support policies the huge page allocation based on the
current policy.

It also switch the hugetlbfs to lazy allocation, because otherwise
mbind() cannot work after mmap, because the memory was already allocated.
This doesn't do any prereservation; when a process runs out of 
huge pages it will get a SIGBUS.

There are currently various proposals on linux-kernel to add preallocation
for this; once one of these patches turns out to be good it would be 
best to replace this patch with it (and port the mpol_* changes over)

diff -u linux-2.6.5-numa/arch/i386/mm/hugetlbpage.c-o linux-2.6.5-numa/arch/i386/mm/hugetlbpage.c
--- linux-2.6.5-numa/arch/i386/mm/hugetlbpage.c-o	2004-04-06 13:11:59.000000000 +0200
+++ linux-2.6.5-numa/arch/i386/mm/hugetlbpage.c	2004-04-06 13:36:12.000000000 +0200
@@ -15,14 +15,17 @@
 #include <linux/module.h>
 #include <linux/err.h>
 #include <linux/sysctl.h>
+#include <linux/mempolicy.h>
 #include <asm/mman.h>
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
 #include <asm/tlbflush.h>
 
-static long    htlbpagemem;
+/* AK: this should be all moved into the pgdat */
+
+static long    htlbpagemem[MAX_NUMNODES];
 int     htlbpage_max;
-static long    htlbzone_pages;
+static long    htlbzone_pages[MAX_NUMNODES];
 
 static struct list_head hugepage_freelists[MAX_NUMNODES];
 static spinlock_t htlbpage_lock = SPIN_LOCK_UNLOCKED;
@@ -33,14 +36,15 @@
 		&hugepage_freelists[page_zone(page)->zone_pgdat->node_id]);
 }
 
-static struct page *dequeue_huge_page(void)
+static struct page *dequeue_huge_page(struct vm_area_struct *vma, unsigned long addr)
 {
-	int nid = numa_node_id();
+	int nid = mpol_first_node(vma, addr); 
 	struct page *page = NULL;
 
 	if (list_empty(&hugepage_freelists[nid])) {
 		for (nid = 0; nid < MAX_NUMNODES; ++nid)
-			if (!list_empty(&hugepage_freelists[nid]))
+			if (mpol_node_valid(nid, vma, addr) && 
+			    !list_empty(&hugepage_freelists[nid]))
 				break;
 	}
 	if (nid >= 0 && nid < MAX_NUMNODES && !list_empty(&hugepage_freelists[nid])) {
@@ -61,18 +65,18 @@
 
 static void free_huge_page(struct page *page);
 
-static struct page *alloc_hugetlb_page(void)
+static struct page *alloc_hugetlb_page(struct vm_area_struct *vma, unsigned long addr)
 {
 	int i;
 	struct page *page;
 
 	spin_lock(&htlbpage_lock);
-	page = dequeue_huge_page();
+	page = dequeue_huge_page(vma, addr);
 	if (!page) {
 		spin_unlock(&htlbpage_lock);
 		return NULL;
 	}
-	htlbpagemem--;
+	htlbpagemem[page_zone(page)->zone_pgdat->node_id]--;
 	spin_unlock(&htlbpage_lock);
 	set_page_count(page, 1);
 	page->lru.prev = (void *)free_huge_page;
@@ -284,7 +288,7 @@
 
 	spin_lock(&htlbpage_lock);
 	enqueue_huge_page(page);
-	htlbpagemem++;
+	htlbpagemem[page_zone(page)->zone_pgdat->node_id]++;
 	spin_unlock(&htlbpage_lock);
 }
 
@@ -329,41 +333,49 @@
 	spin_unlock(&mm->page_table_lock);
 }
 
-int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
+/* page_table_lock hold on entry. */
+static int 
+hugetlb_alloc_fault(struct mm_struct *mm, struct vm_area_struct *vma, 
+			       unsigned long addr, int write_access)
 {
-	struct mm_struct *mm = current->mm;
-	unsigned long addr;
-	int ret = 0;
-
-	BUG_ON(vma->vm_start & ~HPAGE_MASK);
-	BUG_ON(vma->vm_end & ~HPAGE_MASK);
-
-	spin_lock(&mm->page_table_lock);
-	for (addr = vma->vm_start; addr < vma->vm_end; addr += HPAGE_SIZE) {
 		unsigned long idx;
-		pte_t *pte = huge_pte_alloc(mm, addr);
-		struct page *page;
+	int ret;
+	pte_t *pte;
+	struct page *page = NULL;
+	struct address_space *mapping = vma->vm_file->f_mapping;
 
+	pte = huge_pte_alloc(mm, addr); 
 		if (!pte) {
-			ret = -ENOMEM;
+		ret = VM_FAULT_OOM;
 			goto out;
 		}
-		if (!pte_none(*pte))
-			continue;
+
+		/* Handle race */
+		if (!pte_none(*pte)) { 
+			ret = VM_FAULT_MINOR;
+			goto flush; 
+		}
 
 		idx = ((addr - vma->vm_start) >> HPAGE_SHIFT)
 			+ (vma->vm_pgoff >> (HPAGE_SHIFT - PAGE_SHIFT));
 		page = find_get_page(mapping, idx);
 		if (!page) {
-			/* charge the fs quota first */
-			if (hugetlb_get_quota(mapping)) {
-				ret = -ENOMEM;
+		/* Should do this at prefault time, but that gets us into
+		   trouble with freeing right now. */
+		ret = hugetlb_get_quota(mapping);
+		if (ret) {
+			ret = VM_FAULT_OOM;
 				goto out;
 			}
-			page = alloc_hugetlb_page();
+		
+			page = alloc_hugetlb_page(vma, addr);
 			if (!page) {
 				hugetlb_put_quota(mapping);
-				ret = -ENOMEM;
+			
+			/* Instead of OOMing here could just transparently use
+			   small pages. */
+			
+				ret = VM_FAULT_OOM;
 				goto out;
 			}
 			ret = add_to_page_cache(page, mapping, idx, GFP_ATOMIC);
@@ -371,23 +383,64 @@
 			if (ret) {
 				hugetlb_put_quota(mapping);
 				free_huge_page(page);
+				ret = VM_FAULT_SIGBUS;
 				goto out;
 			}
-		}
+		ret = VM_FAULT_MAJOR; 
+	} else
+		ret = VM_FAULT_MINOR;
+		
 		set_huge_pte(mm, vma, page, pte, vma->vm_flags & VM_WRITE);
-	}
-out:
+
+ flush:
+	/* Don't need to flush other CPUs. They will just do a page
+	   fault and flush it lazily. */
+	__flush_tlb_one(addr);
+	
+ out:
 	spin_unlock(&mm->page_table_lock);
 	return ret;
 }
 
+int arch_hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, 
+		       unsigned long address, int write_access)
+{ 
+	pmd_t *pmd;
+	pgd_t *pgd;
+
+	if (write_access && !(vma->vm_flags & VM_WRITE))
+		return VM_FAULT_SIGBUS;
+
+	spin_lock(&mm->page_table_lock);	
+	pgd = pgd_offset(mm, address); 
+	if (pgd_none(*pgd)) 
+		return hugetlb_alloc_fault(mm, vma, address, write_access); 
+
+	pmd = pmd_offset(pgd, address);
+	if (pmd_none(*pmd))
+		return hugetlb_alloc_fault(mm, vma, address, write_access); 
+
+	BUG_ON(!pmd_large(*pmd)); 
+
+	/* must have been a race. Flush the TLB. NX not supported yet. */ 
+
+	__flush_tlb_one(address); 
+	spin_lock(&mm->page_table_lock);	
+	return VM_FAULT_MINOR;
+} 
+
+int hugetlb_prefault(struct address_space *mapping, struct vm_area_struct *vma)
+{
+	return 0;
+}
+
 static void update_and_free_page(struct page *page)
 {
 	int j;
 	struct page *map;
 
 	map = page;
-	htlbzone_pages--;
+	htlbzone_pages[page_zone(page)->zone_pgdat->node_id]--;
 	for (j = 0; j < (HPAGE_SIZE / PAGE_SIZE); j++) {
 		map->flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
 				1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
@@ -404,6 +457,7 @@
 	struct list_head *p;
 	struct page *page, *map;
 
+   page = NULL;
 	map = NULL;
 	spin_lock(&htlbpage_lock);
 	/* all lowmem is on node 0 */
@@ -411,7 +465,7 @@
 		if (map) {
 			list_del(&map->list);
 			update_and_free_page(map);
-			htlbpagemem--;
+ 			htlbpagemem[page_zone(map)->zone_pgdat->node_id]--;
 			map = NULL;
 			if (++count == 0)
 				break;
@@ -423,49 +477,61 @@
 	if (map) {
 		list_del(&map->list);
 		update_and_free_page(map);
-		htlbpagemem--;
+		htlbpagemem[page_zone(page)->zone_pgdat->node_id]--;
 		count++;
 	}
 	spin_unlock(&htlbpage_lock);
 	return count;
 }
 
+static long all_huge_pages(void)
+{ 
+	long pages = 0;
+	int i;
+	for (i = 0; i < numnodes; i++) 
+		pages += htlbzone_pages[i];
+	return pages;
+} 
+
 static int set_hugetlb_mem_size(int count)
 {
 	int lcount;
 	struct page *page;
-
 	if (count < 0)
 		lcount = count;
-	else
-		lcount = count - htlbzone_pages;
+	else { 
+		lcount = count - all_huge_pages();
+	}
 
 	if (lcount == 0)
-		return (int)htlbzone_pages;
+		return (int)all_huge_pages();
 	if (lcount > 0) {	/* Increase the mem size. */
 		while (lcount--) {
+			int node;
 			page = alloc_fresh_huge_page();
 			if (page == NULL)
 				break;
 			spin_lock(&htlbpage_lock);
 			enqueue_huge_page(page);
-			htlbpagemem++;
-			htlbzone_pages++;
+			node = page_zone(page)->zone_pgdat->node_id;
+			htlbpagemem[node]++;
+			htlbzone_pages[node]++;
 			spin_unlock(&htlbpage_lock);
 		}
-		return (int) htlbzone_pages;
+		goto out;
 	}
 	/* Shrink the memory size. */
 	lcount = try_to_free_low(lcount);
 	while (lcount++) {
-		page = alloc_hugetlb_page();
+		page = alloc_hugetlb_page(NULL, 0);
 		if (page == NULL)
 			break;
 		spin_lock(&htlbpage_lock);
 		update_and_free_page(page);
 		spin_unlock(&htlbpage_lock);
 	}
-	return (int) htlbzone_pages;
+ out:
+	return (int)all_huge_pages();
 }
 
 int hugetlb_sysctl_handler(ctl_table *table, int write,
@@ -498,33 +564,60 @@
 		INIT_LIST_HEAD(&hugepage_freelists[i]);
 
 	for (i = 0; i < htlbpage_max; ++i) {
+		int nid; 
 		page = alloc_fresh_huge_page();
 		if (!page)
 			break;
 		spin_lock(&htlbpage_lock);
 		enqueue_huge_page(page);
+		nid = page_zone(page)->zone_pgdat->node_id;
+		htlbpagemem[nid]++;
+		htlbzone_pages[nid]++;
 		spin_unlock(&htlbpage_lock);
 	}
-	htlbpage_max = htlbpagemem = htlbzone_pages = i;
-	printk("Total HugeTLB memory allocated, %ld\n", htlbpagemem);
+	htlbpage_max = i;
+	printk("Initial HugeTLB pages allocated: %d\n", i);
 	return 0;
 }
 module_init(hugetlb_init);
 
 int hugetlb_report_meminfo(char *buf)
 {
+	int i;
+	long pages = 0, mem = 0;
+	for (i = 0; i < numnodes; i++) {
+		pages += htlbzone_pages[i];
+		mem += htlbpagemem[i];
+	}
+
 	return sprintf(buf,
 			"HugePages_Total: %5lu\n"
 			"HugePages_Free:  %5lu\n"
 			"Hugepagesize:    %5lu kB\n",
-			htlbzone_pages,
-			htlbpagemem,
+			pages,
+			mem,
 			HPAGE_SIZE/1024);
 }
 
+int hugetlb_report_node_meminfo(int node, char *buf)
+{
+	return sprintf(buf,
+			"HugePages_Total: %5lu\n"
+			"HugePages_Free:  %5lu\n"
+			"Hugepagesize:    %5lu kB\n",
+			htlbzone_pages[node],
+			htlbpagemem[node],
+			HPAGE_SIZE/1024);
+}
+
+/* Not accurate with policy */
 int is_hugepage_mem_enough(size_t size)
 {
-	return (size + ~HPAGE_MASK)/HPAGE_SIZE <= htlbpagemem;
+	long pm = 0;
+	int i;
+	for (i = 0; i < numnodes; i++)
+		pm += htlbpagemem[i];
+	return (size + ~HPAGE_MASK)/HPAGE_SIZE <= pm;
 }
 
 /* Return the number pages of memory we physically have, in PAGE_SIZE units. */
diff -u linux-2.6.5-numa/include/linux/mm.h-o linux-2.6.5-numa/include/linux/mm.h
--- linux-2.6.5-numa/include/linux/mm.h-o	2004-04-06 13:12:23.000000000 +0200
+++ linux-2.6.5-numa/include/linux/mm.h	2004-04-06 13:36:12.000000000 +0200
@@ -643,6 +660,9 @@
 extern int remap_page_range(struct vm_area_struct *vma, unsigned long from,
 		unsigned long to, unsigned long size, pgprot_t prot);
 
+extern int arch_hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma, 
+			      unsigned long address, int write_access);
+
 #ifndef CONFIG_DEBUG_PAGEALLOC
 static inline void
 kernel_map_pages(struct page *page, int numpages, int enable)
diff -u linux-2.6.5-numa/mm/memory.c-o linux-2.6.5-numa/mm/memory.c
--- linux-2.6.5-numa/mm/memory.c-o	2004-04-06 13:12:24.000000000 +0200
+++ linux-2.6.5-numa/mm/memory.c	2004-04-06 13:36:12.000000000 +0200
@@ -1604,6 +1633,15 @@
 	return VM_FAULT_MINOR;
 }
 
+
+/* Can be overwritten by the architecture */
+int __attribute__((weak)) arch_hugetlb_fault(struct mm_struct *mm, 
+					     struct vm_area_struct *vma, 
+					     unsigned long address, int write_access)
+{
+	return VM_FAULT_SIGBUS;
+}
+
 /*
  * By the time we get here, we already hold the mm semaphore
  */
@@ -1619,7 +1657,7 @@
 	inc_page_state(pgfault);
 
 	if (is_vm_hugetlb_page(vma))
-		return VM_FAULT_SIGBUS;	/* mapping truncation does this. */
+		return arch_hugetlb_fault(mm, vma, address, write_access);
 
 	/*
 	 * We need the page table lock to synchronize with kswapd

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH] NUMA API for Linux 10/ Bitmap bugfix
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
                   ` (8 preceding siblings ...)
  2004-04-06 13:40 ` [PATCH] NUMA API for Linux 9/ Add simple lazy i386/x86-64 hugetlbfs policy support Andi Kleen
@ 2004-04-06 13:40 ` Andi Kleen
  2004-04-06 23:35 ` NUMA API for Linux Paul Jackson
  2004-04-08 20:12 ` Pavel Machek
  11 siblings, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2004-04-06 13:40 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm


Bugfix to prevent miscompilation on gcc 3.2 of bitmap.h

diff -u linux-2.6.5-numa/include/linux/bitmap.h-o linux-2.6.5-numa/include/linux/bitmap.h
--- linux-2.6.5-numa/include/linux/bitmap.h-o	2004-03-17 12:17:59.000000000 +0100
+++ linux-2.6.5-numa/include/linux/bitmap.h	2004-04-06 13:36:12.000000000 +0200
@@ -29,7 +29,8 @@
 static inline void bitmap_copy(unsigned long *dst,
 			const unsigned long *src, int bits)
 {
-	memcpy(dst, src, BITS_TO_LONGS(bits)*sizeof(unsigned long));
+	int len = BITS_TO_LONGS(bits)*sizeof(unsigned long);
+	memcpy(dst, src, len);
 }
 
 void bitmap_shift_right(unsigned long *dst,

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] NUMA API for Linux 3/ Add i386 support
  2004-04-06 13:35 ` [PATCH] NUMA API for Linux 3/ Add i386 support Andi Kleen
@ 2004-04-06 23:23   ` Andrew Morton
  0 siblings, 0 replies; 18+ messages in thread
From: Andrew Morton @ 2004-04-06 23:23 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, Manfred Spraul

Andi Kleen <ak@suse.de> wrote:
>
> @@ -279,8 +279,11 @@
>  #define __NR_utimes		271
>  #define __NR_fadvise64_64	272
>  #define __NR_vserver		273
> +#define __NR_mbind		273
> +#define __NR_get_mempolicy	274
> +#define __NR_set_mempolicy	275

hm, those are all wrong.  I fixed it up.

Manfred, I'm going to bump the mq syscall numbers.  numa API has been
around a bit longer and I suspect more people are relying on the syscall
numbers not changing.   Whatever they were ;)


 
diff -puN arch/i386/kernel/entry.S~numa-api-i386 arch/i386/kernel/entry.S
--- 25/arch/i386/kernel/entry.S~numa-api-i386	Tue Apr  6 16:19:40 2004
+++ 25-akpm/arch/i386/kernel/entry.S	Tue Apr  6 16:20:40 2004
@@ -908,9 +908,9 @@ ENTRY(sys_call_table)
 	.long sys_utimes
  	.long sys_fadvise64_64
 	.long sys_ni_syscall	/* sys_vserver */
-	.long sys_ni_syscall	/* sys_mbind */
-	.long sys_ni_syscall	/* 275 sys_get_mempolicy */
-	.long sys_ni_syscall	/* sys_set_mempolicy */
+	.long sys_mbind
+	.long sys_get_mempolicy	/* 275 */
+	.long sys_set_mempolicy
 	.long sys_mq_open
 	.long sys_mq_unlink
 	.long sys_mq_timedsend
diff -puN include/asm-i386/unistd.h~numa-api-i386 include/asm-i386/unistd.h

_


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: NUMA API for Linux
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
                   ` (9 preceding siblings ...)
  2004-04-06 13:40 ` [PATCH] NUMA API for Linux 10/ Bitmap bugfix Andi Kleen
@ 2004-04-06 23:35 ` Paul Jackson
  2004-04-08 20:12 ` Pavel Machek
  11 siblings, 0 replies; 18+ messages in thread
From: Paul Jackson @ 2004-04-06 23:35 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm

Andi,

What kernel version (-mm? patches?) is this NUMA patchset
based on (what was used for the diff)?

Thanks.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: NUMA API for Linux
  2004-04-06 13:33 NUMA API for Linux Andi Kleen
                   ` (10 preceding siblings ...)
  2004-04-06 23:35 ` NUMA API for Linux Paul Jackson
@ 2004-04-08 20:12 ` Pavel Machek
  11 siblings, 0 replies; 18+ messages in thread
From: Pavel Machek @ 2004-04-08 20:12 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, akpm

Hi!

> This NUMA API doesn't not attempt to implement page migration or anything
> else complicated: all it does is to police the allocation when a page 
> is first allocation or when a page is reallocated after swapping. Currently
> only support for shared memory and anonymous memory is there; policy for 
> file based mappings is not implemented yet (although they get implicitely
> policied by the default process policy)
> 
> It adds three new system calls: mbind to change the policy of a VMA,
> set_mempolicy to change the policy of a process, get_mempolicy to retrieve
> memory policy. User tools (numactl, libnuma, test programs, manpages) can be 

set_mempolicy is pretty ugly name. Why is prctl inadequate?
-- 
64 bytes from 195.113.31.123: icmp_seq=28 ttl=51 time=448769.1 ms         


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] NUMA API for Linux 5/ Add VMA hooks for policy
  2004-04-06 13:37 ` [PATCH] NUMA API for Linux 5/ Add VMA hooks for policy Andi Kleen
@ 2004-05-05 16:05   ` Paul Jackson
  2004-05-05 16:39     ` Andi Kleen
  0 siblings, 1 reply; 18+ messages in thread
From: Paul Jackson @ 2004-05-05 16:05 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ak, linux-kernel, akpm

This patch doesn't build for ia64 sn2_defconfig.

The build fails with link complaints of:

arch/ia64/kernel/built-in.o(.text+0x10862): In function `pfm_smpl_buffer_alloc':
: undefined reference to `mpol_set_vma_default'
arch/ia64/mm/built-in.o(.text+0x412): In function `ia64_init_addr_space':
: undefined reference to `mpol_set_vma_default'
arch/ia64/mm/built-in.o(.text+0x522): In function `ia64_init_addr_space':
: undefined reference to `mpol_set_vma_default'
arch/ia64/ia32/built-in.o(.text+0x1f432): In function `ia64_elf32_init':
: undefined reference to `mpol_set_vma_default'
arch/ia64/ia32/built-in.o(.text+0x1f982): In function `ia32_setup_arg_pages':
: undefined reference to `mpol_set_vma_default'
kernel/built-in.o(.text+0x162b2): In function `do_exit':
: undefined reference to `mpol_free'
make: *** [.tmp_vmlinux1] Error 1

Presumably this is because the following mpol_set_vma_default() and
mpol_free() macro calls are added, but the mempolicy.h header providing
the defines for these macros is not included:

arch/ia64/ia32/binfmt_elf32.c:          mpol_set_vma_default(vma);
arch/ia64/ia32/binfmt_elf32.c:          mpol_set_vma_default(mpnt);
arch/ia64/kernel/perfmon.c:     mpol_set_vma_default(vma);
arch/ia64/mm/init.c:            mpol_set_vma_default(vma);
arch/ia64/mm/init.c:                    mpol_set_vma_default(vma);
kernel/exit.c:  mpol_free(tsk->mempolicy);

Looks like you should do something equivalent to adding:

  #include <linux/mempolicy.h>

to the files:

  arch/ia64/ia32/binfmt_elf32.c
  arch/ia64/kernel/perfmon.c
  arch/ia64/mm/init.c
  kernel/exit.c

The following, based off the numa-api-vma-policy-hooks patch in Andrew's
latest 2.6.6-rc3-mm2, includes these additional includes, and builds
successfully:

================================ snip ================================

From: Andi Kleen <ak@suse.de>

NUMA API adds a policy to each VMA.  During VMA creattion, merging and
splitting these policies must be handled properly.  This patch adds the calls
to this.  

It is a nop when CONFIG_NUMA is not defined.
DESC
numa-api-vma-policy-hooks fix
EDESC

mm/mmap.c: In function `copy_vma':
mm/mmap.c:1531: structure has no member named `vm_policy'


Index: 2.6.6-rc3-mm2-bitmapv5/arch/ia64/ia32/binfmt_elf32.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/arch/ia64/ia32/binfmt_elf32.c	2004-05-05 07:38:11.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/arch/ia64/ia32/binfmt_elf32.c	2004-05-05 08:48:27.000000000 -0700
@@ -13,6 +13,7 @@
 
 #include <linux/types.h>
 #include <linux/mm.h>
+#include <linux/mempolicy.h>
 #include <linux/security.h>
 
 #include <asm/param.h>
@@ -104,6 +105,7 @@
 		vma->vm_pgoff = 0;
 		vma->vm_file = NULL;
 		vma->vm_private_data = NULL;
+		mpol_set_vma_default(vma);
 		down_write(&current->mm->mmap_sem);
 		{
 			insert_vm_struct(current->mm, vma);
@@ -190,6 +192,7 @@
 		mpnt->vm_pgoff = 0;
 		mpnt->vm_file = NULL;
 		mpnt->vm_private_data = 0;
+		mpol_set_vma_default(mpnt);
 		insert_vm_struct(current->mm, mpnt);
 		current->mm->total_vm = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
 	}
Index: 2.6.6-rc3-mm2-bitmapv5/arch/ia64/kernel/perfmon.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/arch/ia64/kernel/perfmon.c	2004-05-05 07:38:11.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/arch/ia64/kernel/perfmon.c	2004-05-05 08:48:38.000000000 -0700
@@ -29,6 +29,7 @@
 #include <linux/init.h>
 #include <linux/vmalloc.h>
 #include <linux/mm.h>
+#include <linux/mempolicy.h>
 #include <linux/sysctl.h>
 #include <linux/list.h>
 #include <linux/file.h>
@@ -2308,6 +2309,7 @@
 	vma->vm_ops	     = NULL;
 	vma->vm_pgoff	     = 0;
 	vma->vm_file	     = NULL;
+	mpol_set_vma_default(vma);
 	vma->vm_private_data = NULL; 
 
 	/*
Index: 2.6.6-rc3-mm2-bitmapv5/arch/ia64/mm/init.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/arch/ia64/mm/init.c	2004-05-05 07:38:11.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/arch/ia64/mm/init.c	2004-05-05 08:48:48.000000000 -0700
@@ -12,6 +12,7 @@
 #include <linux/efi.h>
 #include <linux/elf.h>
 #include <linux/mm.h>
+#include <linux/mempolicy.h>
 #include <linux/mmzone.h>
 #include <linux/module.h>
 #include <linux/personality.h>
@@ -132,6 +133,7 @@
 		vma->vm_pgoff = 0;
 		vma->vm_file = NULL;
 		vma->vm_private_data = NULL;
+		mpol_set_vma_default(vma);
 		insert_vm_struct(current->mm, vma);
 	}
 
@@ -144,6 +146,7 @@
 			vma->vm_end = PAGE_SIZE;
 			vma->vm_page_prot = __pgprot(pgprot_val(PAGE_READONLY) | _PAGE_MA_NAT);
 			vma->vm_flags = VM_READ | VM_MAYREAD | VM_IO | VM_RESERVED;
+			mpol_set_vma_default(vma);
 			insert_vm_struct(current->mm, vma);
 		}
 	}
Index: 2.6.6-rc3-mm2-bitmapv5/arch/m68k/atari/stram.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/arch/m68k/atari/stram.c	2004-05-05 07:38:12.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/arch/m68k/atari/stram.c	2004-05-05 07:40:14.000000000 -0700
@@ -752,7 +752,7 @@
 			/* Get a page for the entry, using the existing
 			   swap cache page if there is one.  Otherwise,
 			   get a clean page and read the swap into it. */
-			page = read_swap_cache_async(entry);
+			page = read_swap_cache_async(entry, NULL, 0);
 			if (!page) {
 				swap_free(entry);
 				return -ENOMEM;
Index: 2.6.6-rc3-mm2-bitmapv5/arch/s390/kernel/compat_exec.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/arch/s390/kernel/compat_exec.c	2004-05-05 07:38:14.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/arch/s390/kernel/compat_exec.c	2004-05-05 08:45:17.000000000 -0700
@@ -72,6 +72,7 @@
 		mpnt->vm_ops = NULL;
 		mpnt->vm_pgoff = 0;
 		mpnt->vm_file = NULL;
+		mpol_set_vma_default(mpnt);
 		INIT_LIST_HEAD(&mpnt->shared);
 		mpnt->vm_private_data = (void *) 0;
 		insert_vm_struct(mm, mpnt);
Index: 2.6.6-rc3-mm2-bitmapv5/arch/x86_64/ia32/ia32_binfmt.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/arch/x86_64/ia32/ia32_binfmt.c	2004-05-05 07:38:17.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/arch/x86_64/ia32/ia32_binfmt.c	2004-05-05 08:45:17.000000000 -0700
@@ -365,6 +365,7 @@
 		mpnt->vm_ops = NULL;
 		mpnt->vm_pgoff = 0;
 		mpnt->vm_file = NULL;
+		mpol_set_vma_default(mpnt);
 		INIT_LIST_HEAD(&mpnt->shared);
 		mpnt->vm_private_data = (void *) 0;
 		insert_vm_struct(mm, mpnt);
Index: 2.6.6-rc3-mm2-bitmapv5/fs/exec.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/fs/exec.c	2004-05-05 07:40:13.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/fs/exec.c	2004-05-05 08:45:18.000000000 -0700
@@ -427,6 +427,7 @@
 		mpnt->vm_ops = NULL;
 		mpnt->vm_pgoff = 0;
 		mpnt->vm_file = NULL;
+		mpol_set_vma_default(mpnt);
 		INIT_LIST_HEAD(&mpnt->shared);
 		mpnt->vm_private_data = (void *) 0;
 		insert_vm_struct(mm, mpnt);
Index: 2.6.6-rc3-mm2-bitmapv5/kernel/exit.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/kernel/exit.c	2004-05-05 07:38:41.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/kernel/exit.c	2004-05-05 08:49:02.000000000 -0700
@@ -6,6 +6,7 @@
 
 #include <linux/config.h>
 #include <linux/mm.h>
+#include <linux/mempolicy.h>
 #include <linux/slab.h>
 #include <linux/interrupt.h>
 #include <linux/smp_lock.h>
@@ -790,6 +791,7 @@
 	__exit_fs(tsk);
 	exit_namespace(tsk);
 	exit_thread();
+	mpol_free(tsk->mempolicy);
 
 	if (tsk->signal->leader)
 		disassociate_ctty(1);
Index: 2.6.6-rc3-mm2-bitmapv5/kernel/fork.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/kernel/fork.c	2004-05-05 07:40:13.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/kernel/fork.c	2004-05-05 08:45:18.000000000 -0700
@@ -270,6 +270,7 @@
 	struct rb_node **rb_link, *rb_parent;
 	int retval;
 	unsigned long charge = 0;
+	struct mempolicy *pol;
 
 	down_write(&oldmm->mmap_sem);
 	flush_cache_mm(current->mm);
@@ -311,6 +312,11 @@
 		if (!tmp)
 			goto fail_nomem;
 		*tmp = *mpnt;
+		pol = mpol_copy(vma_policy(mpnt));
+		retval = PTR_ERR(pol);
+		if (IS_ERR(pol))
+			goto fail_nomem_policy;
+		vma_set_policy(tmp, pol);
 		tmp->vm_flags &= ~VM_LOCKED;
 		tmp->vm_mm = mm;
 		tmp->vm_next = NULL;
@@ -357,6 +363,8 @@
 	flush_tlb_mm(current->mm);
 	up_write(&oldmm->mmap_sem);
 	return retval;
+fail_nomem_policy:
+	kmem_cache_free(vm_area_cachep, tmp);
 fail_nomem:
 	retval = -ENOMEM;
 fail:
@@ -963,10 +971,16 @@
 	p->security = NULL;
 	p->io_context = NULL;
 	p->audit_context = NULL;
+ 	p->mempolicy = mpol_copy(p->mempolicy);
+ 	if (IS_ERR(p->mempolicy)) {
+ 		retval = PTR_ERR(p->mempolicy);
+ 		p->mempolicy = NULL;
+ 		goto bad_fork_cleanup;
+ 	}
 
 	retval = -ENOMEM;
 	if ((retval = security_task_alloc(p)))
-		goto bad_fork_cleanup;
+		goto bad_fork_cleanup_policy;
 	if ((retval = audit_alloc(p)))
 		goto bad_fork_cleanup_security;
 	/* copy all the process information */
@@ -1112,6 +1126,8 @@
 	audit_free(p);
 bad_fork_cleanup_security:
 	security_task_free(p);
+bad_fork_cleanup_policy:
+	mpol_free(p->mempolicy);
 bad_fork_cleanup:
 	if (p->pid > 0)
 		free_pidmap(p->pid);
Index: 2.6.6-rc3-mm2-bitmapv5/mm/mmap.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/mm/mmap.c	2004-05-05 07:40:14.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/mm/mmap.c	2004-05-05 08:45:18.000000000 -0700
@@ -387,7 +387,8 @@
 			struct vm_area_struct *prev,
 			struct rb_node *rb_parent, unsigned long addr, 
 			unsigned long end, unsigned long vm_flags,
-			struct file *file, unsigned long pgoff)
+		     	struct file *file, unsigned long pgoff,
+		        struct mempolicy *policy)
 {
 	spinlock_t *lock = &mm->page_table_lock;
 	struct inode *inode = file ? file->f_dentry->d_inode : NULL;
@@ -411,6 +412,7 @@
 	 * Can it merge with the predecessor?
 	 */
 	if (prev->vm_end == addr &&
+  		        mpol_equal(vma_policy(prev), policy) &&
 			can_vma_merge_after(prev, vm_flags, file, pgoff)) {
 		struct vm_area_struct *next;
 		int need_up = 0;
@@ -428,6 +430,7 @@
 		 */
 		next = prev->vm_next;
 		if (next && prev->vm_end == next->vm_start &&
+		    		vma_mpol_equal(prev, next) &&
 				can_vma_merge_before(next, vm_flags, file,
 					pgoff, (end - addr) >> PAGE_SHIFT)) {
 			prev->vm_end = next->vm_end;
@@ -440,6 +443,7 @@
 				fput(file);
 
 			mm->map_count--;
+			mpol_free(vma_policy(next));
 			kmem_cache_free(vm_area_cachep, next);
 			return prev;
 		}
@@ -455,6 +459,8 @@
 	prev = prev->vm_next;
 	if (prev) {
  merge_next:
+ 		if (!mpol_equal(policy, vma_policy(prev)))
+  			return 0;
 		if (!can_vma_merge_before(prev, vm_flags, file,
 				pgoff, (end - addr) >> PAGE_SHIFT))
 			return NULL;
@@ -631,7 +637,7 @@
 	/* Can we just expand an old anonymous mapping? */
 	if (!file && !(vm_flags & VM_SHARED) && rb_parent)
 		if (vma_merge(mm, prev, rb_parent, addr, addr + len,
-					vm_flags, NULL, 0))
+					vm_flags, NULL, pgoff, NULL))
 			goto out;
 
 	/*
@@ -654,6 +660,7 @@
 	vma->vm_file = NULL;
 	vma->vm_private_data = NULL;
 	vma->vm_next = NULL;
+	mpol_set_vma_default(vma);
 	INIT_LIST_HEAD(&vma->shared);
 
 	if (file) {
@@ -693,7 +700,9 @@
 	addr = vma->vm_start;
 
 	if (!file || !rb_parent || !vma_merge(mm, prev, rb_parent, addr,
-				addr + len, vma->vm_flags, file, pgoff)) {
+					      vma->vm_end,
+					      vma->vm_flags, file, pgoff,
+					      vma_policy(vma))) {
 		vma_link(mm, vma, prev, rb_link, rb_parent);
 		if (correct_wcount)
 			atomic_inc(&inode->i_writecount);
@@ -703,6 +712,7 @@
 				atomic_inc(&inode->i_writecount);
 			fput(file);
 		}
+		mpol_free(vma_policy(vma));
 		kmem_cache_free(vm_area_cachep, vma);
 	}
 out:	
@@ -1118,6 +1128,7 @@
 
 	remove_shared_vm_struct(area);
 
+	mpol_free(vma_policy(area));
 	if (area->vm_ops && area->vm_ops->close)
 		area->vm_ops->close(area);
 	if (area->vm_file)
@@ -1200,6 +1211,7 @@
 int split_vma(struct mm_struct * mm, struct vm_area_struct * vma,
 	      unsigned long addr, int new_below)
 {
+	struct mempolicy *pol;
 	struct vm_area_struct *new;
 	struct address_space *mapping = NULL;
 
@@ -1222,6 +1234,13 @@
 		new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT);
 	}
 
+	pol = mpol_copy(vma_policy(vma));
+	if (IS_ERR(pol)) {
+		kmem_cache_free(vm_area_cachep, new);
+		return PTR_ERR(pol);
+	}
+	vma_set_policy(new, pol);
+
 	if (new->vm_file)
 		get_file(new->vm_file);
 
@@ -1391,7 +1410,7 @@
 
 	/* Can we just expand an old anonymous mapping? */
 	if (rb_parent && vma_merge(mm, prev, rb_parent, addr, addr + len,
-					flags, NULL, 0))
+					flags, NULL, 0, NULL))
 		goto out;
 
 	/*
@@ -1412,6 +1431,7 @@
 	vma->vm_pgoff = 0;
 	vma->vm_file = NULL;
 	vma->vm_private_data = NULL;
+	mpol_set_vma_default(vma);
 	INIT_LIST_HEAD(&vma->shared);
 
 	vma_link(mm, vma, prev, rb_link, rb_parent);
@@ -1472,6 +1492,7 @@
 		}
 		if (vma->vm_file)
 			fput(vma->vm_file);
+		mpol_free(vma_policy(vma));
 		kmem_cache_free(vm_area_cachep, vma);
 		vma = next;
 	}
@@ -1508,7 +1529,7 @@
 
 	find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
 	new_vma = vma_merge(mm, prev, rb_parent, addr, addr + len,
-			vma->vm_flags, vma->vm_file, pgoff);
+			vma->vm_flags, vma->vm_file, pgoff, vma_policy(vma));
 	if (new_vma) {
 		/*
 		 * Source vma may have been merged into new_vma
Index: 2.6.6-rc3-mm2-bitmapv5/mm/mprotect.c
===================================================================
--- 2.6.6-rc3-mm2-bitmapv5.orig/mm/mprotect.c	2004-05-05 07:38:42.000000000 -0700
+++ 2.6.6-rc3-mm2-bitmapv5/mm/mprotect.c	2004-05-05 08:45:18.000000000 -0700
@@ -125,6 +125,8 @@
 		return 0;
 	if (vma->vm_file || (vma->vm_flags & VM_SHARED))
 		return 0;
+	if (!vma_mpol_equal(vma, prev))
+		return 0;
 
 	/*
 	 * If the whole area changes to the protection of the previous one
@@ -136,6 +138,7 @@
 		__vma_unlink(mm, vma, prev);
 		spin_unlock(&mm->page_table_lock);
 
+		mpol_free(vma_policy(vma));
 		kmem_cache_free(vm_area_cachep, vma);
 		mm->map_count--;
 		return 1;
@@ -318,12 +321,14 @@
 
 	if (next && prev->vm_end == next->vm_start &&
 			can_vma_merge(next, prev->vm_flags) &&
+	    	vma_mpol_equal(prev, next) &&
 			!prev->vm_file && !(prev->vm_flags & VM_SHARED)) {
 		spin_lock(&prev->vm_mm->page_table_lock);
 		prev->vm_end = next->vm_end;
 		__vma_unlink(prev->vm_mm, next, prev);
 		spin_unlock(&prev->vm_mm->page_table_lock);
 
+		mpol_free(vma_policy(next));
 		kmem_cache_free(vm_area_cachep, next);
 		prev->vm_mm->map_count--;
 	}


-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] NUMA API for Linux 5/ Add VMA hooks for policy
  2004-05-05 16:05   ` Paul Jackson
@ 2004-05-05 16:39     ` Andi Kleen
  2004-05-05 16:47       ` Paul Jackson
  0 siblings, 1 reply; 18+ messages in thread
From: Andi Kleen @ 2004-05-05 16:39 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Andi Kleen, linux-kernel, akpm

> Looks like you should do something equivalent to adding:
> 
>   #include <linux/mempolicy.h>
> 
> to the files:
> 
>   arch/ia64/ia32/binfmt_elf32.c
>   arch/ia64/kernel/perfmon.c
>   arch/ia64/mm/init.c
>   kernel/exit.c
> 
> The following, based off the numa-api-vma-policy-hooks patch in Andrew's
> latest 2.6.6-rc3-mm2, includes these additional includes, and builds
> successfully:

This is not needed, because mempolicy.h is included in mm.h, which
is included at some point by all of these.
Perhaps you missed a patch? (several of the patches depended on each other) 

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] NUMA API for Linux 5/ Add VMA hooks for policy
  2004-05-05 16:39     ` Andi Kleen
@ 2004-05-05 16:47       ` Paul Jackson
  2004-05-06  6:00         ` Andi Kleen
  0 siblings, 1 reply; 18+ messages in thread
From: Paul Jackson @ 2004-05-05 16:47 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ak, linux-kernel, akpm, hch

> Perhaps you missed a patch? (several of the patches depended on each other) 

No - perhaps Christoph Hellwig removed the include.

See the patch small-numa-api-fixups.patch in 2.6.6-rc3-mm2:

========================= snip =========================
From: Christoph Hellwig <hch@lst.de>

- don't include mempolicy.h in sched.h and mm.h when a forward delcaration
  is enough.  Andi argued against that in the past, but I'd really hate to add
  another header to two of the includes used in basically every driver when we
  can include it in the six files actually needing it instead (that number is
  for my ppc32 system, maybe other arches need more include in their
  directories)

- make numa api fields in tast_struct conditional on CONFIG_NUMA, this gives
  us a few ugly ifdefs but avoids wasting memory on non-NUMA systems.

... details omitted ...
========================= snip =========================

Christoph did add the needed #include's of mempolicy.h in the
generic files that needed it, but not in the ia64 files, more
or less.

-- 
                          I won't rest till it's the best ...
                          Programmer, Linux Scalability
                          Paul Jackson <pj@sgi.com> 1.650.933.1373

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] NUMA API for Linux 5/ Add VMA hooks for policy
  2004-05-05 16:47       ` Paul Jackson
@ 2004-05-06  6:00         ` Andi Kleen
  0 siblings, 0 replies; 18+ messages in thread
From: Andi Kleen @ 2004-05-06  6:00 UTC (permalink / raw)
  To: Paul Jackson; +Cc: Andi Kleen, linux-kernel, akpm, hch

On Wed, May 05, 2004 at 09:47:48AM -0700, Paul Jackson wrote:
> > Perhaps you missed a patch? (several of the patches depended on each other) 
> 
> No - perhaps Christoph Hellwig removed the include.

Hmm, ok. I must have missed that going in. Your patch looks correct
as an additional fix. 

Thanks,

-Andi

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2004-05-06  6:01 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-04-06 13:33 NUMA API for Linux Andi Kleen
2004-04-06 13:34 ` [PATCH] NUMA API for Linux 1/ Core NUMA API code Andi Kleen
2004-04-06 13:35 ` NUMA API for Linux 2/ Add x86-64 support Andi Kleen
2004-04-06 13:35 ` [PATCH] NUMA API for Linux 3/ Add i386 support Andi Kleen
2004-04-06 23:23   ` Andrew Morton
2004-04-06 13:36 ` [PATCH] NUMA API for Linux 4/ Add IA64 support Andi Kleen
2004-04-06 13:37 ` [PATCH] NUMA API for Linux 5/ Add VMA hooks for policy Andi Kleen
2004-05-05 16:05   ` Paul Jackson
2004-05-05 16:39     ` Andi Kleen
2004-05-05 16:47       ` Paul Jackson
2004-05-06  6:00         ` Andi Kleen
2004-04-06 13:37 ` [PATCH] NUMA API for Linux 6/ Add shared memory support Andi Kleen
2004-04-06 13:38 ` [PATCH] NUMA API for Linux 7/ Add statistics Andi Kleen
2004-04-06 13:39 ` [PATCH] NUMA API for Linux 8/ Add policy support to anonymous memory Andi Kleen
2004-04-06 13:40 ` [PATCH] NUMA API for Linux 9/ Add simple lazy i386/x86-64 hugetlbfs policy support Andi Kleen
2004-04-06 13:40 ` [PATCH] NUMA API for Linux 10/ Bitmap bugfix Andi Kleen
2004-04-06 23:35 ` NUMA API for Linux Paul Jackson
2004-04-08 20:12 ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).