All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-02  1:07 ` Brent Casavant
  0 siblings, 0 replies; 49+ messages in thread
From: Brent Casavant @ 2004-11-02  1:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm; +Cc: hugh, ak

Hello,

This patch causes memory allocation for tmpfs files to be distributed
evenly across NUMA machines.  In most circumstances today, tmpfs files
will be allocated on the same node as the task writing to the file.
In many cases, particularly when large files are created, or a large
number of files are created by a single task, this leads to a severe
imbalance in free memory amongst nodes.  This patch corrects that
situation.

Memory imbalances such as this can degrade performance in various ways.
The two most typical situations are:

1. An HPC application with tighty synchronized threads has one or more
   of its tasks running on a node with little free memory.  These
   tasks are forced to allocate and/or access memory from remote
   nodes.  Since access to remote memory incurs a performance penalty,
   these threads run slower than those on other nodes.  The threads
   on other nodes end up waiting for the slower threads (i.e. the
   whole application runs only as fast as the slowest thread), thereby
   incurring a significant performance penalty for the application.
2. Many seperate tasks access a given file.  Since the memory is remote
   for almost all processors, this causes a slowdown for the remote
   tasks.  A small performance gain can be seen by distribute the
   file's memory allocation across the system.  This scenario has an
   additional effect of increasing the memory traffic for the node
   hosting the allocation, thereby unfairly penalizing the memory
   bandwidth on the host node when such traffic could be spread more
   evenly across the machine.

This patch builds upon the NUMA memory policy code.  It affects only
the default placement of tmpfs file allocations, and in particular
does not affect /dev/zero memory allocations, which are implemented
in the same code.  /dev/zero mappings are typically of interest only
to a particular process, and thus benefit more from local allocations.
In any case, the NUMA API can still be used to more precisely lay out
allocations if so desired, it is only the default behavior of the
allocations which is changed.

System V shared memory allocations are also unaffected, though implemented
by the same code.  While in theory these allocations could suffer some
of the same problems as tmpfs files (i.e. a need to distribute large
allocations machine-wide, particularly if the allocation is long-lived),
SGI has not seen this to be a significant problem in practice.  As
addressing shm would add to the complexity of the patch with little
perceived value, changes thereto have not been implemented.

The following output (gleaned from /sys/devices/system/node/*/meminfo),
illustrates the effect of this patch.  The tmpfs file was created via
the command 'dd if=/dev/zero of=3G bs=1073741824 count=3' within a
tmpfs directory.

Without patch
=============
Before creating 3GB tmpfs file
------------------------------
 Nid  MemTotal   MemFree   MemUsed      (in kB)
   0   4002480   3787040    215440
   1   4014080   3942368     71712
   2   3882992   3749376    133616
   3   3883008   3824288     58720
   4   3882992   3825824     57168
   5   3883008   3825440     57568
   6   3883008   3304576    578432
   7   3882992   3829408     53584
   8    933888    873696     60192
   9    933888    887424     46464
  10    933872    890720     43152
  11    933888    891328     42560
  12    933888    890464     43424
  13    933888    880608     53280
  14    933872    889408     44464
  15    915696    872672     43024
 ToT  38767440  37164640   1602800

After creating 3GB tmpfs file
-----------------------------
 Nid  MemTotal   MemFree   MemUsed      (in kB)
   0   4002480    633792   3368688
   1   4014080   3938944     75136
   2   3882992   3742624    140368
   3   3883008   3824288     58720
   4   3882992   3825792     57200
   5   3883008   3825440     57568
   6   3883008   3304640    578368
   7   3882992   3829408     53584
   8    933888    873696     60192
   9    933888    887552     46336
  10    933872    890720     43152
  11    933888    891264     42624
  12    933888    890464     43424
  13    933888    880672     53216
  14    933872    889408     44464
  15    915696    872672     43024
 ToT  38767440  34001376   4766064

With patch
==========
Before creating 3GB tmpfs file
------------------------------
 Nid  MemTotal   MemFree   MemUsed      (in kB)
   0   4002496   3263584    738912
   1   4014080   3934656     79424
   2   3882992   3735584    147408
   3   3883008   3815648     67360
   4   3882992   3820640     62352
   5   3883008   3816992     66016
   6   3883008   3803904     79104
   7   3882992   3820000     62992
   8    933888    872160     61728
   9    933888    881888     52000
  10    933872    881376     52496
  11    933888    883072     50816
  12    933888    884000     49888
  13    933888    878144     55744
  14    933872    875008     58864
  15    915696    864256     51440
 ToT  38767456  37030912   1736544

After creating 3GB tmpfs file
-----------------------------
 Nid  MemTotal   MemFree   MemUsed      (in kB)
   0   4002496   3066144    936352
   1   4014080   3738176    275904
   2   3882992   3536800    346192
   3   3883008   3619232    263776
   4   3882992   3624224    258768
   5   3883008   3620576    262432
   6   3883008   3607488    275520
   7   3882992   3623584    259408
   8    933888    675744    258144
   9    933888    685472    248416
  10    933872    684960    248912
  11    933888    686656    247232
  12    933888    687584    246304
  13    933888    681728    252160
  14    933872    678656    255216
  15    915696    667840    247856
 ToT  38767456  33884864   4882592

Note that in all cases shown above, there are additional sources of
memory imbalances.  These do not lie within tmpfs, and will be addressed
in due course.

And now, for your viewing pleasure...

Signed-off-by: Brent Casavant <bcasavan@sgi.com>

Index: linux/mm/mempolicy.c
===================================================================
--- linux.orig/mm/mempolicy.c	2004-11-01 13:47:00.000000000 -0600
+++ linux/mm/mempolicy.c	2004-11-01 15:04:30.000000000 -0600
@@ -1027,6 +1027,28 @@
 	return 0;
 }

+void mpol_shared_policy_init(struct shared_policy *info, unsigned interleave)
+{
+	info->root = RB_ROOT;
+	init_MUTEX(&info->sem);
+
+	if (unlikely(interleave)) {
+		struct mempolicy *newpol;
+
+		/* Falls back to MPOL_DEFAULT on any error */
+		newpol = mpol_new(MPOL_INTERLEAVE, nodes_addr(node_online_map));
+		if (likely(!IS_ERR(newpol))) {
+			/* Create pseudo-vma that contains just the policy */
+			struct vm_area_struct pvma;
+
+			memset(&pvma, 0, sizeof(struct vm_area_struct));
+			/* Policy covers entire file */
+			pvma.vm_end = ~0UL;
+			mpol_set_shared_policy(info, &pvma, newpol);
+		}
+	}
+}
+
 int mpol_set_shared_policy(struct shared_policy *info,
 			struct vm_area_struct *vma, struct mempolicy *npol)
 {
Index: linux/fs/hugetlbfs/inode.c
===================================================================
--- linux.orig/fs/hugetlbfs/inode.c	2004-10-18 16:55:07.000000000 -0500
+++ linux/fs/hugetlbfs/inode.c	2004-11-01 14:18:36.000000000 -0600
@@ -384,7 +384,7 @@
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		info = HUGETLBFS_I(inode);
-		mpol_shared_policy_init(&info->policy);
+		mpol_shared_policy_init(&info->policy, 0);
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
Index: linux/include/linux/mempolicy.h
===================================================================
--- linux.orig/include/linux/mempolicy.h	2004-10-18 16:53:10.000000000 -0500
+++ linux/include/linux/mempolicy.h	2004-11-01 14:20:54.000000000 -0600
@@ -137,11 +137,7 @@
 	struct semaphore sem;
 };

-static inline void mpol_shared_policy_init(struct shared_policy *info)
-{
-	info->root = RB_ROOT;
-	init_MUTEX(&info->sem);
-}
+void mpol_shared_policy_init(struct shared_policy *info, unsigned interleave);

 int mpol_set_shared_policy(struct shared_policy *info,
 				struct vm_area_struct *vma,
@@ -198,7 +194,8 @@
 	return -EINVAL;
 }

-static inline void mpol_shared_policy_init(struct shared_policy *info)
+static inline void mpol_shared_policy_init(struct shared_policy *info,
+					unsigned interleave)
 {
 }

Index: linux/mm/shmem.c
===================================================================
--- linux.orig/mm/shmem.c	2004-11-01 13:47:00.000000000 -0600
+++ linux/mm/shmem.c	2004-11-01 14:21:43.000000000 -0600
@@ -1236,7 +1236,7 @@
 		info = SHMEM_I(inode);
 		memset(info, 0, (char *)inode - (char *)info);
 		spin_lock_init(&info->lock);
- 		mpol_shared_policy_init(&info->policy);
+ 		mpol_shared_policy_init(&info->policy, sbinfo ? 1 : 0);
 		INIT_LIST_HEAD(&info->swaplist);

 		switch (mode & S_IFMT) {

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-02  1:07 ` Brent Casavant
  0 siblings, 0 replies; 49+ messages in thread
From: Brent Casavant @ 2004-11-02  1:07 UTC (permalink / raw)
  To: linux-kernel, linux-mm; +Cc: hugh, ak

Hello,

This patch causes memory allocation for tmpfs files to be distributed
evenly across NUMA machines.  In most circumstances today, tmpfs files
will be allocated on the same node as the task writing to the file.
In many cases, particularly when large files are created, or a large
number of files are created by a single task, this leads to a severe
imbalance in free memory amongst nodes.  This patch corrects that
situation.

Memory imbalances such as this can degrade performance in various ways.
The two most typical situations are:

1. An HPC application with tighty synchronized threads has one or more
   of its tasks running on a node with little free memory.  These
   tasks are forced to allocate and/or access memory from remote
   nodes.  Since access to remote memory incurs a performance penalty,
   these threads run slower than those on other nodes.  The threads
   on other nodes end up waiting for the slower threads (i.e. the
   whole application runs only as fast as the slowest thread), thereby
   incurring a significant performance penalty for the application.
2. Many seperate tasks access a given file.  Since the memory is remote
   for almost all processors, this causes a slowdown for the remote
   tasks.  A small performance gain can be seen by distribute the
   file's memory allocation across the system.  This scenario has an
   additional effect of increasing the memory traffic for the node
   hosting the allocation, thereby unfairly penalizing the memory
   bandwidth on the host node when such traffic could be spread more
   evenly across the machine.

This patch builds upon the NUMA memory policy code.  It affects only
the default placement of tmpfs file allocations, and in particular
does not affect /dev/zero memory allocations, which are implemented
in the same code.  /dev/zero mappings are typically of interest only
to a particular process, and thus benefit more from local allocations.
In any case, the NUMA API can still be used to more precisely lay out
allocations if so desired, it is only the default behavior of the
allocations which is changed.

System V shared memory allocations are also unaffected, though implemented
by the same code.  While in theory these allocations could suffer some
of the same problems as tmpfs files (i.e. a need to distribute large
allocations machine-wide, particularly if the allocation is long-lived),
SGI has not seen this to be a significant problem in practice.  As
addressing shm would add to the complexity of the patch with little
perceived value, changes thereto have not been implemented.

The following output (gleaned from /sys/devices/system/node/*/meminfo),
illustrates the effect of this patch.  The tmpfs file was created via
the command 'dd if=/dev/zero of=3G bs=1073741824 count=3' within a
tmpfs directory.

Without patch
=============
Before creating 3GB tmpfs file
------------------------------
 Nid  MemTotal   MemFree   MemUsed      (in kB)
   0   4002480   3787040    215440
   1   4014080   3942368     71712
   2   3882992   3749376    133616
   3   3883008   3824288     58720
   4   3882992   3825824     57168
   5   3883008   3825440     57568
   6   3883008   3304576    578432
   7   3882992   3829408     53584
   8    933888    873696     60192
   9    933888    887424     46464
  10    933872    890720     43152
  11    933888    891328     42560
  12    933888    890464     43424
  13    933888    880608     53280
  14    933872    889408     44464
  15    915696    872672     43024
 ToT  38767440  37164640   1602800

After creating 3GB tmpfs file
-----------------------------
 Nid  MemTotal   MemFree   MemUsed      (in kB)
   0   4002480    633792   3368688
   1   4014080   3938944     75136
   2   3882992   3742624    140368
   3   3883008   3824288     58720
   4   3882992   3825792     57200
   5   3883008   3825440     57568
   6   3883008   3304640    578368
   7   3882992   3829408     53584
   8    933888    873696     60192
   9    933888    887552     46336
  10    933872    890720     43152
  11    933888    891264     42624
  12    933888    890464     43424
  13    933888    880672     53216
  14    933872    889408     44464
  15    915696    872672     43024
 ToT  38767440  34001376   4766064

With patch
==========
Before creating 3GB tmpfs file
------------------------------
 Nid  MemTotal   MemFree   MemUsed      (in kB)
   0   4002496   3263584    738912
   1   4014080   3934656     79424
   2   3882992   3735584    147408
   3   3883008   3815648     67360
   4   3882992   3820640     62352
   5   3883008   3816992     66016
   6   3883008   3803904     79104
   7   3882992   3820000     62992
   8    933888    872160     61728
   9    933888    881888     52000
  10    933872    881376     52496
  11    933888    883072     50816
  12    933888    884000     49888
  13    933888    878144     55744
  14    933872    875008     58864
  15    915696    864256     51440
 ToT  38767456  37030912   1736544

After creating 3GB tmpfs file
-----------------------------
 Nid  MemTotal   MemFree   MemUsed      (in kB)
   0   4002496   3066144    936352
   1   4014080   3738176    275904
   2   3882992   3536800    346192
   3   3883008   3619232    263776
   4   3882992   3624224    258768
   5   3883008   3620576    262432
   6   3883008   3607488    275520
   7   3882992   3623584    259408
   8    933888    675744    258144
   9    933888    685472    248416
  10    933872    684960    248912
  11    933888    686656    247232
  12    933888    687584    246304
  13    933888    681728    252160
  14    933872    678656    255216
  15    915696    667840    247856
 ToT  38767456  33884864   4882592

Note that in all cases shown above, there are additional sources of
memory imbalances.  These do not lie within tmpfs, and will be addressed
in due course.

And now, for your viewing pleasure...

Signed-off-by: Brent Casavant <bcasavan@sgi.com>

Index: linux/mm/mempolicy.c
===================================================================
--- linux.orig/mm/mempolicy.c	2004-11-01 13:47:00.000000000 -0600
+++ linux/mm/mempolicy.c	2004-11-01 15:04:30.000000000 -0600
@@ -1027,6 +1027,28 @@
 	return 0;
 }

+void mpol_shared_policy_init(struct shared_policy *info, unsigned interleave)
+{
+	info->root = RB_ROOT;
+	init_MUTEX(&info->sem);
+
+	if (unlikely(interleave)) {
+		struct mempolicy *newpol;
+
+		/* Falls back to MPOL_DEFAULT on any error */
+		newpol = mpol_new(MPOL_INTERLEAVE, nodes_addr(node_online_map));
+		if (likely(!IS_ERR(newpol))) {
+			/* Create pseudo-vma that contains just the policy */
+			struct vm_area_struct pvma;
+
+			memset(&pvma, 0, sizeof(struct vm_area_struct));
+			/* Policy covers entire file */
+			pvma.vm_end = ~0UL;
+			mpol_set_shared_policy(info, &pvma, newpol);
+		}
+	}
+}
+
 int mpol_set_shared_policy(struct shared_policy *info,
 			struct vm_area_struct *vma, struct mempolicy *npol)
 {
Index: linux/fs/hugetlbfs/inode.c
===================================================================
--- linux.orig/fs/hugetlbfs/inode.c	2004-10-18 16:55:07.000000000 -0500
+++ linux/fs/hugetlbfs/inode.c	2004-11-01 14:18:36.000000000 -0600
@@ -384,7 +384,7 @@
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		info = HUGETLBFS_I(inode);
-		mpol_shared_policy_init(&info->policy);
+		mpol_shared_policy_init(&info->policy, 0);
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
Index: linux/include/linux/mempolicy.h
===================================================================
--- linux.orig/include/linux/mempolicy.h	2004-10-18 16:53:10.000000000 -0500
+++ linux/include/linux/mempolicy.h	2004-11-01 14:20:54.000000000 -0600
@@ -137,11 +137,7 @@
 	struct semaphore sem;
 };

-static inline void mpol_shared_policy_init(struct shared_policy *info)
-{
-	info->root = RB_ROOT;
-	init_MUTEX(&info->sem);
-}
+void mpol_shared_policy_init(struct shared_policy *info, unsigned interleave);

 int mpol_set_shared_policy(struct shared_policy *info,
 				struct vm_area_struct *vma,
@@ -198,7 +194,8 @@
 	return -EINVAL;
 }

-static inline void mpol_shared_policy_init(struct shared_policy *info)
+static inline void mpol_shared_policy_init(struct shared_policy *info,
+					unsigned interleave)
 {
 }

Index: linux/mm/shmem.c
===================================================================
--- linux.orig/mm/shmem.c	2004-11-01 13:47:00.000000000 -0600
+++ linux/mm/shmem.c	2004-11-01 14:21:43.000000000 -0600
@@ -1236,7 +1236,7 @@
 		info = SHMEM_I(inode);
 		memset(info, 0, (char *)inode - (char *)info);
 		spin_lock_init(&info->lock);
- 		mpol_shared_policy_init(&info->policy);
+ 		mpol_shared_policy_init(&info->policy, sbinfo ? 1 : 0);
 		INIT_LIST_HEAD(&info->swaplist);

 		switch (mode & S_IFMT) {

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-02  1:07 ` Brent Casavant
@ 2004-11-02  1:43   ` Dave Hansen
  -1 siblings, 0 replies; 49+ messages in thread
From: Dave Hansen @ 2004-11-02  1:43 UTC (permalink / raw)
  To: Brent Casavant; +Cc: linux-kernel, linux-mm, hugh, ak

Brent Casavant wrote:
> This patch causes memory allocation for tmpfs files to be distributed
> evenly across NUMA machines.  In most circumstances today, tmpfs files
> will be allocated on the same node as the task writing to the file.
> In many cases, particularly when large files are created, or a large
> number of files are created by a single task, this leads to a severe
> imbalance in free memory amongst nodes.  This patch corrects that
> situation.

Why don't you just use the NUMA API in your application for this?  Won't 
this hurt any application that uses tmpfs and never leaves a node in its 
lifetime, like a short gcc run?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-02  1:43   ` Dave Hansen
  0 siblings, 0 replies; 49+ messages in thread
From: Dave Hansen @ 2004-11-02  1:43 UTC (permalink / raw)
  To: Brent Casavant; +Cc: linux-kernel, linux-mm, hugh, ak

Brent Casavant wrote:
> This patch causes memory allocation for tmpfs files to be distributed
> evenly across NUMA machines.  In most circumstances today, tmpfs files
> will be allocated on the same node as the task writing to the file.
> In many cases, particularly when large files are created, or a large
> number of files are created by a single task, this leads to a severe
> imbalance in free memory amongst nodes.  This patch corrects that
> situation.

Why don't you just use the NUMA API in your application for this?  Won't 
this hurt any application that uses tmpfs and never leaves a node in its 
lifetime, like a short gcc run?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-02  1:07 ` Brent Casavant
@ 2004-11-02  9:13   ` Andi Kleen
  -1 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2004-11-02  9:13 UTC (permalink / raw)
  To: Brent Casavant; +Cc: linux-kernel, linux-mm, hugh, ak

> And now, for your viewing pleasure...

Patch is fine except that I would add a sysctl to enable/disable this.

I can see that some people would like to not have interleave policy
(e.g. when you use tmpfs as a large memory extender on 32bit NUMA
then you probably want local affinity) 

Best if you name it /proc/sys/vm/numa-tmpfs-rr

Default can be set to one for now.

-Andi

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-02  9:13   ` Andi Kleen
  0 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2004-11-02  9:13 UTC (permalink / raw)
  To: Brent Casavant; +Cc: linux-kernel, linux-mm, hugh, ak

> And now, for your viewing pleasure...

Patch is fine except that I would add a sysctl to enable/disable this.

I can see that some people would like to not have interleave policy
(e.g. when you use tmpfs as a large memory extender on 32bit NUMA
then you probably want local affinity) 

Best if you name it /proc/sys/vm/numa-tmpfs-rr

Default can be set to one for now.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-02  1:07 ` Brent Casavant
@ 2004-11-02 15:46   ` Martin J. Bligh
  -1 siblings, 0 replies; 49+ messages in thread
From: Martin J. Bligh @ 2004-11-02 15:46 UTC (permalink / raw)
  To: Brent Casavant, linux-kernel, linux-mm; +Cc: hugh, ak

> This patch causes memory allocation for tmpfs files to be distributed
> evenly across NUMA machines.  In most circumstances today, tmpfs files
> will be allocated on the same node as the task writing to the file.
> In many cases, particularly when large files are created, or a large
> number of files are created by a single task, this leads to a severe
> imbalance in free memory amongst nodes.  This patch corrects that
> situation.

Yeah, but it also ruins your locality of reference (in a NUMA sense). 
Not convinced that's a good idea. You're guaranteeing universally consistent
worse-case performance for everyone. And you're only looking at a situation
where there's one allocator on the system, and that's imbalanced.

You WANT your data to be local. That's the whole idea.

M.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-02 15:46   ` Martin J. Bligh
  0 siblings, 0 replies; 49+ messages in thread
From: Martin J. Bligh @ 2004-11-02 15:46 UTC (permalink / raw)
  To: Brent Casavant, linux-kernel, linux-mm; +Cc: hugh, ak

> This patch causes memory allocation for tmpfs files to be distributed
> evenly across NUMA machines.  In most circumstances today, tmpfs files
> will be allocated on the same node as the task writing to the file.
> In many cases, particularly when large files are created, or a large
> number of files are created by a single task, this leads to a severe
> imbalance in free memory amongst nodes.  This patch corrects that
> situation.

Yeah, but it also ruins your locality of reference (in a NUMA sense). 
Not convinced that's a good idea. You're guaranteeing universally consistent
worse-case performance for everyone. And you're only looking at a situation
where there's one allocator on the system, and that's imbalanced.

You WANT your data to be local. That's the whole idea.

M.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-02 15:46   ` Martin J. Bligh
@ 2004-11-02 15:55     ` Andi Kleen
  -1 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2004-11-02 15:55 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Brent Casavant, linux-kernel, linux-mm, hugh, ak

On Tue, Nov 02, 2004 at 07:46:59AM -0800, Martin J. Bligh wrote:
> > This patch causes memory allocation for tmpfs files to be distributed
> > evenly across NUMA machines.  In most circumstances today, tmpfs files
> > will be allocated on the same node as the task writing to the file.
> > In many cases, particularly when large files are created, or a large
> > number of files are created by a single task, this leads to a severe
> > imbalance in free memory amongst nodes.  This patch corrects that
> > situation.
> 
> Yeah, but it also ruins your locality of reference (in a NUMA sense). 
> Not convinced that's a good idea. You're guaranteeing universally consistent
> worse-case performance for everyone. And you're only looking at a situation
> where there's one allocator on the system, and that's imbalanced.
> 
> You WANT your data to be local. That's the whole idea.

I think it depends on how you use tmpfs. When you use it for read/write
it's a good idea because you likely don't care about a bit of additional
latency and it's better to not fill up your local nodes with temporary
files.

If you use it with mmap then you likely want local policy.

But that's a big ugly to distingush, that is why I suggested the sysctl.

-Andi

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-02 15:55     ` Andi Kleen
  0 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2004-11-02 15:55 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Brent Casavant, linux-kernel, linux-mm, hugh, ak

On Tue, Nov 02, 2004 at 07:46:59AM -0800, Martin J. Bligh wrote:
> > This patch causes memory allocation for tmpfs files to be distributed
> > evenly across NUMA machines.  In most circumstances today, tmpfs files
> > will be allocated on the same node as the task writing to the file.
> > In many cases, particularly when large files are created, or a large
> > number of files are created by a single task, this leads to a severe
> > imbalance in free memory amongst nodes.  This patch corrects that
> > situation.
> 
> Yeah, but it also ruins your locality of reference (in a NUMA sense). 
> Not convinced that's a good idea. You're guaranteeing universally consistent
> worse-case performance for everyone. And you're only looking at a situation
> where there's one allocator on the system, and that's imbalanced.
> 
> You WANT your data to be local. That's the whole idea.

I think it depends on how you use tmpfs. When you use it for read/write
it's a good idea because you likely don't care about a bit of additional
latency and it's better to not fill up your local nodes with temporary
files.

If you use it with mmap then you likely want local policy.

But that's a big ugly to distingush, that is why I suggested the sysctl.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-02 15:55     ` Andi Kleen
@ 2004-11-02 16:55       ` Martin J. Bligh
  -1 siblings, 0 replies; 49+ messages in thread
From: Martin J. Bligh @ 2004-11-02 16:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Brent Casavant, linux-kernel, linux-mm, hugh

--Andi Kleen <ak@suse.de> wrote (on Tuesday, November 02, 2004 16:55:07 +0100):

> On Tue, Nov 02, 2004 at 07:46:59AM -0800, Martin J. Bligh wrote:
>> > This patch causes memory allocation for tmpfs files to be distributed
>> > evenly across NUMA machines.  In most circumstances today, tmpfs files
>> > will be allocated on the same node as the task writing to the file.
>> > In many cases, particularly when large files are created, or a large
>> > number of files are created by a single task, this leads to a severe
>> > imbalance in free memory amongst nodes.  This patch corrects that
>> > situation.
>> 
>> Yeah, but it also ruins your locality of reference (in a NUMA sense). 
>> Not convinced that's a good idea. You're guaranteeing universally consistent
>> worse-case performance for everyone. And you're only looking at a situation
>> where there's one allocator on the system, and that's imbalanced.
>> 
>> You WANT your data to be local. That's the whole idea.
> 
> I think it depends on how you use tmpfs. When you use it for read/write
> it's a good idea because you likely don't care about a bit of additional
> latency and it's better to not fill up your local nodes with temporary
> files.
> 
> If you use it with mmap then you likely want local policy.
> 
> But that's a big ugly to distingush, that is why I suggested the sysctl.

As long as it defaults to off, I guess I don't really care. Though I'm still
wholly unconvinced it makes much sense. I think we're just papering over the
underlying problem - that we don't do good balancing between nodes under
mem pressure.

M.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-02 16:55       ` Martin J. Bligh
  0 siblings, 0 replies; 49+ messages in thread
From: Martin J. Bligh @ 2004-11-02 16:55 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Brent Casavant, linux-kernel, linux-mm, hugh

--Andi Kleen <ak@suse.de> wrote (on Tuesday, November 02, 2004 16:55:07 +0100):

> On Tue, Nov 02, 2004 at 07:46:59AM -0800, Martin J. Bligh wrote:
>> > This patch causes memory allocation for tmpfs files to be distributed
>> > evenly across NUMA machines.  In most circumstances today, tmpfs files
>> > will be allocated on the same node as the task writing to the file.
>> > In many cases, particularly when large files are created, or a large
>> > number of files are created by a single task, this leads to a severe
>> > imbalance in free memory amongst nodes.  This patch corrects that
>> > situation.
>> 
>> Yeah, but it also ruins your locality of reference (in a NUMA sense). 
>> Not convinced that's a good idea. You're guaranteeing universally consistent
>> worse-case performance for everyone. And you're only looking at a situation
>> where there's one allocator on the system, and that's imbalanced.
>> 
>> You WANT your data to be local. That's the whole idea.
> 
> I think it depends on how you use tmpfs. When you use it for read/write
> it's a good idea because you likely don't care about a bit of additional
> latency and it's better to not fill up your local nodes with temporary
> files.
> 
> If you use it with mmap then you likely want local policy.
> 
> But that's a big ugly to distingush, that is why I suggested the sysctl.

As long as it defaults to off, I guess I don't really care. Though I'm still
wholly unconvinced it makes much sense. I think we're just papering over the
underlying problem - that we don't do good balancing between nodes under
mem pressure.

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-02 16:55       ` Martin J. Bligh
@ 2004-11-02 22:17         ` Brent Casavant
  -1 siblings, 0 replies; 49+ messages in thread
From: Brent Casavant @ 2004-11-02 22:17 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andi Kleen, linux-kernel, linux-mm, hugh

On Tue, 2 Nov 2004, Martin J. Bligh wrote:

> --Andi Kleen <ak@suse.de> wrote (on Tuesday, November 02, 2004 16:55:07 +0100):
>
> > On Tue, Nov 02, 2004 at 07:46:59AM -0800, Martin J. Bligh wrote:
> >> > This patch causes memory allocation for tmpfs files to be distributed
> >> > evenly across NUMA machines.  In most circumstances today, tmpfs files
> >> > will be allocated on the same node as the task writing to the file.
> >> > In many cases, particularly when large files are created, or a large
> >> > number of files are created by a single task, this leads to a severe
> >> > imbalance in free memory amongst nodes.  This patch corrects that
> >> > situation.
> >>
> >> Yeah, but it also ruins your locality of reference (in a NUMA sense).
> >> Not convinced that's a good idea. You're guaranteeing universally consistent
> >> worse-case performance for everyone. And you're only looking at a situation
> >> where there's one allocator on the system, and that's imbalanced.
> >>
> >> You WANT your data to be local. That's the whole idea.
> >
> > I think it depends on how you use tmpfs. When you use it for read/write
> > it's a good idea because you likely don't care about a bit of additional
> > latency and it's better to not fill up your local nodes with temporary
> > files.
> >
> > If you use it with mmap then you likely want local policy.

Yeah, what Andi said. :)

I fully agree with Martin's statement "You WANT your data to be local."
However, it can become non-local in two different manners, each of
which wants tmpfs to behave differently.  (The next two paragraphs are
simply to summarize, no position is being taken.)

The manner I'm concerned about is when a long-lived file (long-lived
meaning at least for the duration of the run of a large multithreaded app)
is placed in memory as an accidental artifact of the CPU which happened
to create the file.  Imagine, for instance, an application startup
script copying a large file into a tmpfs filesystem before spawning
the actual computation application itself.  There is a large class of
HPC applications which are tightly synchronized, and which require nearly
all of the system memory (i.e. almost all the memory on each node).
In this case the "victim" node will be forced to go off-node for the
bulk of its memory accesses, destroying locality, and slowing down
the entire application.  This problem is alleviated if the tmpfs file
is fairly well distributed across the nodes.

But I do understand the opposite situation.  If a lightly-threaded
application wants to access the file data, or the application does
not require large amounts of additional memory (i.e. nearly all the
memory on each node) getting the file itself allocated close to the
processor is more beneficial.  In this case the distribution of tmpfs
pages is non-ideal (though I'm not sure it's worst-case).

> > But that's a big ugly to distingush, that is why I suggested the sysctl.
>
> As long as it defaults to off, I guess I don't really care. Though I'm still
> wholly unconvinced it makes much sense. I think we're just papering over the
> underlying problem - that we don't do good balancing between nodes under
> mem pressure.

It's a tough situation, as shown above.  The HPC workload I mentioned
would much prefer the tmpfs file to be distributed.  A non-HPC workload
would prefer the tmpfs files be local.  Short of a sysctl I'm not sure
how the system could make an intelligent decision about what to do under
memory pressure -- it simply isn't knowledge the kernel can have.

I've got a new patch including Andi's suggested sysctl ready to go.
But I've seen one vote for defaulting to on, and one for defaulting
to off.  I'd vote for on, but then again I'm biased. :)  Who arbitrates
this one, Hugh Dickins (it's a tmpfs change, after all)?  From SGI's
perspective we can live with it either way; we'll simply need to
document our recommendations for our customers.

As soon as I know whether this should default on or off, I'll post
the new patch, including the sysctl.

Thanks,
Brent

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-02 22:17         ` Brent Casavant
  0 siblings, 0 replies; 49+ messages in thread
From: Brent Casavant @ 2004-11-02 22:17 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andi Kleen, linux-kernel, linux-mm, hugh

On Tue, 2 Nov 2004, Martin J. Bligh wrote:

> --Andi Kleen <ak@suse.de> wrote (on Tuesday, November 02, 2004 16:55:07 +0100):
>
> > On Tue, Nov 02, 2004 at 07:46:59AM -0800, Martin J. Bligh wrote:
> >> > This patch causes memory allocation for tmpfs files to be distributed
> >> > evenly across NUMA machines.  In most circumstances today, tmpfs files
> >> > will be allocated on the same node as the task writing to the file.
> >> > In many cases, particularly when large files are created, or a large
> >> > number of files are created by a single task, this leads to a severe
> >> > imbalance in free memory amongst nodes.  This patch corrects that
> >> > situation.
> >>
> >> Yeah, but it also ruins your locality of reference (in a NUMA sense).
> >> Not convinced that's a good idea. You're guaranteeing universally consistent
> >> worse-case performance for everyone. And you're only looking at a situation
> >> where there's one allocator on the system, and that's imbalanced.
> >>
> >> You WANT your data to be local. That's the whole idea.
> >
> > I think it depends on how you use tmpfs. When you use it for read/write
> > it's a good idea because you likely don't care about a bit of additional
> > latency and it's better to not fill up your local nodes with temporary
> > files.
> >
> > If you use it with mmap then you likely want local policy.

Yeah, what Andi said. :)

I fully agree with Martin's statement "You WANT your data to be local."
However, it can become non-local in two different manners, each of
which wants tmpfs to behave differently.  (The next two paragraphs are
simply to summarize, no position is being taken.)

The manner I'm concerned about is when a long-lived file (long-lived
meaning at least for the duration of the run of a large multithreaded app)
is placed in memory as an accidental artifact of the CPU which happened
to create the file.  Imagine, for instance, an application startup
script copying a large file into a tmpfs filesystem before spawning
the actual computation application itself.  There is a large class of
HPC applications which are tightly synchronized, and which require nearly
all of the system memory (i.e. almost all the memory on each node).
In this case the "victim" node will be forced to go off-node for the
bulk of its memory accesses, destroying locality, and slowing down
the entire application.  This problem is alleviated if the tmpfs file
is fairly well distributed across the nodes.

But I do understand the opposite situation.  If a lightly-threaded
application wants to access the file data, or the application does
not require large amounts of additional memory (i.e. nearly all the
memory on each node) getting the file itself allocated close to the
processor is more beneficial.  In this case the distribution of tmpfs
pages is non-ideal (though I'm not sure it's worst-case).

> > But that's a big ugly to distingush, that is why I suggested the sysctl.
>
> As long as it defaults to off, I guess I don't really care. Though I'm still
> wholly unconvinced it makes much sense. I think we're just papering over the
> underlying problem - that we don't do good balancing between nodes under
> mem pressure.

It's a tough situation, as shown above.  The HPC workload I mentioned
would much prefer the tmpfs file to be distributed.  A non-HPC workload
would prefer the tmpfs files be local.  Short of a sysctl I'm not sure
how the system could make an intelligent decision about what to do under
memory pressure -- it simply isn't knowledge the kernel can have.

I've got a new patch including Andi's suggested sysctl ready to go.
But I've seen one vote for defaulting to on, and one for defaulting
to off.  I'd vote for on, but then again I'm biased. :)  Who arbitrates
this one, Hugh Dickins (it's a tmpfs change, after all)?  From SGI's
perspective we can live with it either way; we'll simply need to
document our recommendations for our customers.

As soon as I know whether this should default on or off, I'll post
the new patch, including the sysctl.

Thanks,
Brent

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-02 22:17         ` Brent Casavant
@ 2004-11-02 22:51           ` Martin J. Bligh
  -1 siblings, 0 replies; 49+ messages in thread
From: Martin J. Bligh @ 2004-11-02 22:51 UTC (permalink / raw)
  To: Brent Casavant; +Cc: Andi Kleen, linux-kernel, linux-mm, hugh

> I fully agree with Martin's statement "You WANT your data to be local."
> However, it can become non-local in two different manners, each of
> which wants tmpfs to behave differently.  (The next two paragraphs are
> simply to summarize, no position is being taken.)
> 
> The manner I'm concerned about is when a long-lived file (long-lived
> meaning at least for the duration of the run of a large multithreaded app)
> is placed in memory as an accidental artifact of the CPU which happened
> to create the file.

Agreed, I see that point - if it's a globally accessed file that's
created by one CPU, you want it spread around. However ... how the hell
is the VM meant to distinguish that? The correct way is for the application
to tell us that - ie use the NUMA API.

> Imagine, for instance, an application startup
> script copying a large file into a tmpfs filesystem before spawning
> the actual computation application itself.  There is a large class of
> HPC applications which are tightly synchronized, and which require nearly
> all of the system memory (i.e. almost all the memory on each node).
> In this case the "victim" node will be forced to go off-node for the
> bulk of its memory accesses, destroying locality, and slowing down
> the entire application.  This problem is alleviated if the tmpfs file
> is fairly well distributed across the nodes.

There are definitely situations where you'd want both, I'd agree.

> But I do understand the opposite situation.  If a lightly-threaded
> application wants to access the file data, or the application does
> not require large amounts of additional memory (i.e. nearly all the
> memory on each node) getting the file itself allocated close to the
> processor is more beneficial.  In this case the distribution of tmpfs
> pages is non-ideal (though I'm not sure it's worst-case).

Well, it's not quite worst case  ... you'll only get (n-1)/n of your pages
off node (where n is number of nodes). I guess we could deliberately
force ALL of them off-node just to make sure ;-)

>> > But that's a big ugly to distingush, that is why I suggested the sysctl.
>> 
>> As long as it defaults to off, I guess I don't really care. Though I'm still
>> wholly unconvinced it makes much sense. I think we're just papering over the
>> underlying problem - that we don't do good balancing between nodes under
>> mem pressure.
> 
> It's a tough situation, as shown above.  The HPC workload I mentioned
> would much prefer the tmpfs file to be distributed.  A non-HPC workload
> would prefer the tmpfs files be local.  Short of a sysctl I'm not sure
> how the system could make an intelligent decision about what to do under
> memory pressure -- it simply isn't knowledge the kernel can have.

It is if you tell it from the app ;-) But otherwise yes, I'd agree.
 
> I've got a new patch including Andi's suggested sysctl ready to go.
> But I've seen one vote for defaulting to on, and one for defaulting
> to off.  I'd vote for on, but then again I'm biased. :)  Who arbitrates
> this one, Hugh Dickins (it's a tmpfs change, after all)?  From SGI's
> perspective we can live with it either way; we'll simply need to
> document our recommendations for our customers.

I guess Hugh or Andrew. But I'm very relucant to see the current default
changed to benefit one particular set of workloads ... unless there's
overwhelming evidence that those workloads are massively predominant.
A per-arch thing might make sense I suppose if you're convinced that
all ia64 NUMA machines will ever run is HPC apps.

> As soon as I know whether this should default on or off, I'll post
> the new patch, including the sysctl.

Another way might be a tmpfs mount option ... I'd prefer that to a sysctl
personally, but maybe others wouldn't. Hugh, is that nuts?

M.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-02 22:51           ` Martin J. Bligh
  0 siblings, 0 replies; 49+ messages in thread
From: Martin J. Bligh @ 2004-11-02 22:51 UTC (permalink / raw)
  To: Brent Casavant; +Cc: Andi Kleen, linux-kernel, linux-mm, hugh

> I fully agree with Martin's statement "You WANT your data to be local."
> However, it can become non-local in two different manners, each of
> which wants tmpfs to behave differently.  (The next two paragraphs are
> simply to summarize, no position is being taken.)
> 
> The manner I'm concerned about is when a long-lived file (long-lived
> meaning at least for the duration of the run of a large multithreaded app)
> is placed in memory as an accidental artifact of the CPU which happened
> to create the file.

Agreed, I see that point - if it's a globally accessed file that's
created by one CPU, you want it spread around. However ... how the hell
is the VM meant to distinguish that? The correct way is for the application
to tell us that - ie use the NUMA API.

> Imagine, for instance, an application startup
> script copying a large file into a tmpfs filesystem before spawning
> the actual computation application itself.  There is a large class of
> HPC applications which are tightly synchronized, and which require nearly
> all of the system memory (i.e. almost all the memory on each node).
> In this case the "victim" node will be forced to go off-node for the
> bulk of its memory accesses, destroying locality, and slowing down
> the entire application.  This problem is alleviated if the tmpfs file
> is fairly well distributed across the nodes.

There are definitely situations where you'd want both, I'd agree.

> But I do understand the opposite situation.  If a lightly-threaded
> application wants to access the file data, or the application does
> not require large amounts of additional memory (i.e. nearly all the
> memory on each node) getting the file itself allocated close to the
> processor is more beneficial.  In this case the distribution of tmpfs
> pages is non-ideal (though I'm not sure it's worst-case).

Well, it's not quite worst case  ... you'll only get (n-1)/n of your pages
off node (where n is number of nodes). I guess we could deliberately
force ALL of them off-node just to make sure ;-)

>> > But that's a big ugly to distingush, that is why I suggested the sysctl.
>> 
>> As long as it defaults to off, I guess I don't really care. Though I'm still
>> wholly unconvinced it makes much sense. I think we're just papering over the
>> underlying problem - that we don't do good balancing between nodes under
>> mem pressure.
> 
> It's a tough situation, as shown above.  The HPC workload I mentioned
> would much prefer the tmpfs file to be distributed.  A non-HPC workload
> would prefer the tmpfs files be local.  Short of a sysctl I'm not sure
> how the system could make an intelligent decision about what to do under
> memory pressure -- it simply isn't knowledge the kernel can have.

It is if you tell it from the app ;-) But otherwise yes, I'd agree.
 
> I've got a new patch including Andi's suggested sysctl ready to go.
> But I've seen one vote for defaulting to on, and one for defaulting
> to off.  I'd vote for on, but then again I'm biased. :)  Who arbitrates
> this one, Hugh Dickins (it's a tmpfs change, after all)?  From SGI's
> perspective we can live with it either way; we'll simply need to
> document our recommendations for our customers.

I guess Hugh or Andrew. But I'm very relucant to see the current default
changed to benefit one particular set of workloads ... unless there's
overwhelming evidence that those workloads are massively predominant.
A per-arch thing might make sense I suppose if you're convinced that
all ia64 NUMA machines will ever run is HPC apps.

> As soon as I know whether this should default on or off, I'll post
> the new patch, including the sysctl.

Another way might be a tmpfs mount option ... I'd prefer that to a sysctl
personally, but maybe others wouldn't. Hugh, is that nuts?

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-02 22:51           ` Martin J. Bligh
@ 2004-11-03  1:12             ` Brent Casavant
  -1 siblings, 0 replies; 49+ messages in thread
From: Brent Casavant @ 2004-11-03  1:12 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andi Kleen, linux-kernel, linux-mm, hugh

On Tue, 2 Nov 2004, Martin J. Bligh wrote:

> > The manner I'm concerned about is when a long-lived file (long-lived
> > meaning at least for the duration of the run of a large multithreaded app)
> > is placed in memory as an accidental artifact of the CPU which happened
> > to create the file.
>
> Agreed, I see that point - if it's a globally accessed file that's
> created by one CPU, you want it spread around. However ... how the hell
> is the VM meant to distinguish that? The correct way is for the application
> to tell us that - ie use the NUMA API.

Right.  The application can already tell us that without this patch,
but only if the file is mapped.  Using writes there doesn't appear to
be any way to control this behavior (unless I overlooked something).
So it is impossible to use normal system utilities (i.e. cp, dd, tar)
and get appropriate placement.

This change gives us a method to get a different default (i.e write)
behavior.

> > It's a tough situation, as shown above.  The HPC workload I mentioned
> > would much prefer the tmpfs file to be distributed.  A non-HPC workload
> > would prefer the tmpfs files be local.  Short of a sysctl I'm not sure
> > how the system could make an intelligent decision about what to do under
> > memory pressure -- it simply isn't knowledge the kernel can have.
>
> It is if you tell it from the app ;-) But otherwise yes, I'd agree.

Yep.  Also, bear in mind that many applications/utilities are not going
to be very savvy in regard to tmpfs placement, and really shouldn't be.
I wouldn't expect a compiler to have special code to lay out the generated
objects in a friendly manner.  Currently, all it takes is one unaware
application/utility to mess things up for us.

With local-by-default placement, as today, every single application and
utility needs to cooperate in order to be successful.

Which, I guess, might be something to consider (thinking out loud here)...

An application which uses the NUMA API obviously cares about placement.
If it causes data to be placed locally when it would otherwise be placed
globally, it will not cause a problem for a seperate application/utility
which doesn't care (much?) about locality.

However, if an application/utility which does not care (much?) about
locality fails to use the NUMA API and causes data to be placed locally,
it may very well cause a problem for a seperate application which does
care about locality.

Thus, to me, a default of spreading the allocation globally makes
sense.  The burden of adding code to "do the right thing" falls to
the applications which care.  No other program requires changes and
as such everything will "just work".

But, once again, I fess up to a bias. :)

> Another way might be a tmpfs mount option ... I'd prefer that to a sysctl
> personally, but maybe others wouldn't. Hugh, is that nuts?

I kind of like that -- it shouldn't be too hard to stuff into the tmpfs
superblock.  But I agree, Hugh knows better than I do.

Brent

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-03  1:12             ` Brent Casavant
  0 siblings, 0 replies; 49+ messages in thread
From: Brent Casavant @ 2004-11-03  1:12 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Andi Kleen, linux-kernel, linux-mm, hugh

On Tue, 2 Nov 2004, Martin J. Bligh wrote:

> > The manner I'm concerned about is when a long-lived file (long-lived
> > meaning at least for the duration of the run of a large multithreaded app)
> > is placed in memory as an accidental artifact of the CPU which happened
> > to create the file.
>
> Agreed, I see that point - if it's a globally accessed file that's
> created by one CPU, you want it spread around. However ... how the hell
> is the VM meant to distinguish that? The correct way is for the application
> to tell us that - ie use the NUMA API.

Right.  The application can already tell us that without this patch,
but only if the file is mapped.  Using writes there doesn't appear to
be any way to control this behavior (unless I overlooked something).
So it is impossible to use normal system utilities (i.e. cp, dd, tar)
and get appropriate placement.

This change gives us a method to get a different default (i.e write)
behavior.

> > It's a tough situation, as shown above.  The HPC workload I mentioned
> > would much prefer the tmpfs file to be distributed.  A non-HPC workload
> > would prefer the tmpfs files be local.  Short of a sysctl I'm not sure
> > how the system could make an intelligent decision about what to do under
> > memory pressure -- it simply isn't knowledge the kernel can have.
>
> It is if you tell it from the app ;-) But otherwise yes, I'd agree.

Yep.  Also, bear in mind that many applications/utilities are not going
to be very savvy in regard to tmpfs placement, and really shouldn't be.
I wouldn't expect a compiler to have special code to lay out the generated
objects in a friendly manner.  Currently, all it takes is one unaware
application/utility to mess things up for us.

With local-by-default placement, as today, every single application and
utility needs to cooperate in order to be successful.

Which, I guess, might be something to consider (thinking out loud here)...

An application which uses the NUMA API obviously cares about placement.
If it causes data to be placed locally when it would otherwise be placed
globally, it will not cause a problem for a seperate application/utility
which doesn't care (much?) about locality.

However, if an application/utility which does not care (much?) about
locality fails to use the NUMA API and causes data to be placed locally,
it may very well cause a problem for a seperate application which does
care about locality.

Thus, to me, a default of spreading the allocation globally makes
sense.  The burden of adding code to "do the right thing" falls to
the applications which care.  No other program requires changes and
as such everything will "just work".

But, once again, I fess up to a bias. :)

> Another way might be a tmpfs mount option ... I'd prefer that to a sysctl
> personally, but maybe others wouldn't. Hugh, is that nuts?

I kind of like that -- it shouldn't be too hard to stuff into the tmpfs
superblock.  But I agree, Hugh knows better than I do.

Brent

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-03  1:12             ` Brent Casavant
  (?)
@ 2004-11-03  1:30             ` Martin J. Bligh
  -1 siblings, 0 replies; 49+ messages in thread
From: Martin J. Bligh @ 2004-11-03  1:30 UTC (permalink / raw)
  To: Brent Casavant; +Cc: Andi Kleen, linux-kernel, linux-mm, hugh

--On Tuesday, November 02, 2004 19:12:10 -0600 Brent Casavant <bcasavan@sgi.com> wrote:

> On Tue, 2 Nov 2004, Martin J. Bligh wrote:
> 
>> > The manner I'm concerned about is when a long-lived file (long-lived
>> > meaning at least for the duration of the run of a large multithreaded app)
>> > is placed in memory as an accidental artifact of the CPU which happened
>> > to create the file.
>> 
>> Agreed, I see that point - if it's a globally accessed file that's
>> created by one CPU, you want it spread around. However ... how the hell
>> is the VM meant to distinguish that? The correct way is for the application
>> to tell us that - ie use the NUMA API.
> 
> Right.  The application can already tell us that without this patch,
> but only if the file is mapped.  Using writes there doesn't appear to
> be any way to control this behavior (unless I overlooked something).
> So it is impossible to use normal system utilities (i.e. cp, dd, tar)
> and get appropriate placement.

Ah yes, I see your point.
 
> This change gives us a method to get a different default (i.e write)
> behavior.

OK. Might still be more useful to have it per filehandle, but we'd need
a way to send info on an existing filehandle.

> Yep.  Also, bear in mind that many applications/utilities are not going
> to be very savvy in regard to tmpfs placement, and really shouldn't be.
> I wouldn't expect a compiler to have special code to lay out the generated
> objects in a friendly manner.  Currently, all it takes is one unaware
> application/utility to mess things up for us.

Well, on certain extreme workloads, yes. But not for most people.
 
> With local-by-default placement, as today, every single application and
> utility needs to cooperate in order to be successful.
> 
> Which, I guess, might be something to consider (thinking out loud here)...

I don't like apps having to co-operate any more than you do, as they
obviously won't. However, remember that most workloads won't hit this,
and there are a couple of other approaches:

1) your global switch is making more sense to me now.
2) We can balance nodes under mem pressure (we don't currently)

Number 2 fixes a variety of evils, and we've been talking about fixing
that for a while (see the previous one on pagecache ... do we have to
fix each case?).

What scares me is that what all these discussions are pointing towards
is "we want global striping for all allocations, because Linux is too 
crap to deal with cross-node balancing pressure". I don't want to see 
us go that way ;-). Some things are more obviously global than others
(eg shmem is more likely to be shared) ... whether we think the usage
of tmpfs is local or global is very much up to debate though. 

There are a few other indicators I could see us using as hints from the 
OS that might help - the size of the file vs the amount of mem in the
node, for one. How many processes have the file open, and which nodes
they're on for another. Neither of which are flawless, or even terribly
simple, but we're making heuristic guesses.

> An application which uses the NUMA API obviously cares about placement.
> If it causes data to be placed locally when it would otherwise be placed
> globally, it will not cause a problem for a seperate application/utility
> which doesn't care (much?) about locality.

99.9% of apps won't be NUMA aware, nor should they have to be. Standard
code should just work out the box ... I'm not going to take the viewpoint
that these are only specialist servers running one or two apps - they'll
run all sorts of stuff, esp with the advent of AMD in the marketplace,
and hopefully Intel will get their ass into gear and do local mem soon
as well.

> However, if an application/utility which does not care (much?) about
> locality fails to use the NUMA API and causes data to be placed locally,
> it may very well cause a problem for a seperate application which does
> care about locality.
> 
> Thus, to me, a default of spreading the allocation globally makes
> sense.  The burden of adding code to "do the right thing" falls to
> the applications which care.  No other program requires changes and
> as such everything will "just work".
> 
> But, once again, I fess up to a bias. :)

Yeah ;-) I'm strongly against taking the road of "NUMA performance will
suck unless your app is NUMA aware" (ie default to non-local alloc).
OTOH, I don't like us crapping out under imbalance either. However,
I think there are other ways to solve this (see above).
 
>> Another way might be a tmpfs mount option ... I'd prefer that to a sysctl
>> personally, but maybe others wouldn't. Hugh, is that nuts?
> 
> I kind of like that -- it shouldn't be too hard to stuff into the tmpfs
> superblock.  But I agree, Hugh knows better than I do.

OK, that'd be under the sysadmin's control fairly easily at least. 

M>


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-02 22:51           ` Martin J. Bligh
@ 2004-11-03  8:44             ` Hugh Dickins
  -1 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2004-11-03  8:44 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Brent Casavant, Andi Kleen, linux-kernel, linux-mm

On Tue, 2 Nov 2004, Martin J. Bligh wrote:
> 
> Another way might be a tmpfs mount option ... I'd prefer that to a sysctl
> personally, but maybe others wouldn't. Hugh, is that nuts?

Only nuts if I am, I was going to suggest the same: the sysctl idea seems
very inadequate; a mount option at least allows the possibility of having
different tmpfs files allocated with different policies at the same time.

But I'm not usually qualified to comment on NUMA matters, and my tmpfs
maintenance shouldn't be allowed to get in the way of progress.  Plus
I've barely been attending in recent days: back to normality tomorrow.

Hugh


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-03  8:44             ` Hugh Dickins
  0 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2004-11-03  8:44 UTC (permalink / raw)
  To: Martin J. Bligh; +Cc: Brent Casavant, Andi Kleen, linux-kernel, linux-mm

On Tue, 2 Nov 2004, Martin J. Bligh wrote:
> 
> Another way might be a tmpfs mount option ... I'd prefer that to a sysctl
> personally, but maybe others wouldn't. Hugh, is that nuts?

Only nuts if I am, I was going to suggest the same: the sysctl idea seems
very inadequate; a mount option at least allows the possibility of having
different tmpfs files allocated with different policies at the same time.

But I'm not usually qualified to comment on NUMA matters, and my tmpfs
maintenance shouldn't be allowed to get in the way of progress.  Plus
I've barely been attending in recent days: back to normality tomorrow.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-03  8:44             ` Hugh Dickins
@ 2004-11-03  9:01               ` Andi Kleen
  -1 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2004-11-03  9:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin J. Bligh, Brent Casavant, Andi Kleen, linux-kernel, linux-mm

On Wed, Nov 03, 2004 at 08:44:32AM +0000, Hugh Dickins wrote:
> On Tue, 2 Nov 2004, Martin J. Bligh wrote:
> > 
> > Another way might be a tmpfs mount option ... I'd prefer that to a sysctl
> > personally, but maybe others wouldn't. Hugh, is that nuts?
> 
> Only nuts if I am, I was going to suggest the same: the sysctl idea seems
> very inadequate; a mount option at least allows the possibility of having
> different tmpfs files allocated with different policies at the same time.
> 
> But I'm not usually qualified to comment on NUMA matters, and my tmpfs
> maintenance shouldn't be allowed to get in the way of progress.  Plus
> I've barely been attending in recent days: back to normality tomorrow.

If you want to go more finegraid then you can always use numactl
or even libnuma in the application.  For a quick policy decision a sysctl 
is fine imho.

-Andi

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-03  9:01               ` Andi Kleen
  0 siblings, 0 replies; 49+ messages in thread
From: Andi Kleen @ 2004-11-03  9:01 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin J. Bligh, Brent Casavant, Andi Kleen, linux-kernel, linux-mm

On Wed, Nov 03, 2004 at 08:44:32AM +0000, Hugh Dickins wrote:
> On Tue, 2 Nov 2004, Martin J. Bligh wrote:
> > 
> > Another way might be a tmpfs mount option ... I'd prefer that to a sysctl
> > personally, but maybe others wouldn't. Hugh, is that nuts?
> 
> Only nuts if I am, I was going to suggest the same: the sysctl idea seems
> very inadequate; a mount option at least allows the possibility of having
> different tmpfs files allocated with different policies at the same time.
> 
> But I'm not usually qualified to comment on NUMA matters, and my tmpfs
> maintenance shouldn't be allowed to get in the way of progress.  Plus
> I've barely been attending in recent days: back to normality tomorrow.

If you want to go more finegraid then you can always use numactl
or even libnuma in the application.  For a quick policy decision a sysctl 
is fine imho.

-Andi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-03  9:01               ` Andi Kleen
@ 2004-11-03 16:32                 ` Brent Casavant
  -1 siblings, 0 replies; 49+ messages in thread
From: Brent Casavant @ 2004-11-03 16:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Hugh Dickins, Martin J. Bligh, linux-kernel, linux-mm

On Wed, 3 Nov 2004, Andi Kleen wrote:

> If you want to go more finegraid then you can always use numactl
> or even libnuma in the application.  For a quick policy decision a sysctl
> is fine imho.

OK, so I'm not seeing a definitive stance by the interested parties
either way.  So since the code's already done, I'm posting the sysctl
method, and defaulting to on.  I assume that if we later decide that
a mount option was correct after all, that it's no big deal to axe the
sysctl?

The sysctl code in this patch is based on work originally done by
Andi.  It has been changed a bit, mostly to make it appear only
in CONFIG_NUMA && CONFIG_TMPFS kernels.

Signed-off-by: Brent Casavant <bcasavan@sgi.com>

Index: linux/mm/mempolicy.c
===================================================================
--- linux.orig/mm/mempolicy.c	2004-11-03 10:24:16.000000000 -0600
+++ linux/mm/mempolicy.c	2004-11-03 10:26:30.000000000 -0600
@@ -1027,6 +1027,28 @@
 	return 0;
 }

+void mpol_shared_policy_init(struct shared_policy *info, unsigned interleave)
+{
+	info->root = RB_ROOT;
+	init_MUTEX(&info->sem);
+
+	if (unlikely(interleave)) {
+		struct mempolicy *newpol;
+
+		/* Falls back to MPOL_DEFAULT on any error */
+		newpol = mpol_new(MPOL_INTERLEAVE, nodes_addr(node_online_map));
+		if (likely(!IS_ERR(newpol))) {
+			/* Create pseudo-vma that contains just the policy */
+			struct vm_area_struct pvma;
+
+			memset(&pvma, 0, sizeof(struct vm_area_struct));
+			/* Policy covers entire file */
+			pvma.vm_end = ~0UL;
+			mpol_set_shared_policy(info, &pvma, newpol);
+		}
+	}
+}
+
 int mpol_set_shared_policy(struct shared_policy *info,
 			struct vm_area_struct *vma, struct mempolicy *npol)
 {
Index: linux/fs/hugetlbfs/inode.c
===================================================================
--- linux.orig/fs/hugetlbfs/inode.c	2004-11-03 10:24:16.000000000 -0600
+++ linux/fs/hugetlbfs/inode.c	2004-11-03 10:26:30.000000000 -0600
@@ -384,7 +384,7 @@
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		info = HUGETLBFS_I(inode);
-		mpol_shared_policy_init(&info->policy);
+		mpol_shared_policy_init(&info->policy, 0);
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
Index: linux/include/linux/mempolicy.h
===================================================================
--- linux.orig/include/linux/mempolicy.h	2004-11-03 10:24:16.000000000 -0600
+++ linux/include/linux/mempolicy.h	2004-11-03 10:26:30.000000000 -0600
@@ -137,11 +137,7 @@
 	struct semaphore sem;
 };

-static inline void mpol_shared_policy_init(struct shared_policy *info)
-{
-	info->root = RB_ROOT;
-	init_MUTEX(&info->sem);
-}
+void mpol_shared_policy_init(struct shared_policy *info, unsigned interleave);

 int mpol_set_shared_policy(struct shared_policy *info,
 				struct vm_area_struct *vma,
@@ -198,7 +194,8 @@
 	return -EINVAL;
 }

-static inline void mpol_shared_policy_init(struct shared_policy *info)
+static inline void mpol_shared_policy_init(struct shared_policy *info,
+					unsigned interleave)
 {
 }

Index: linux/mm/shmem.c
===================================================================
--- linux.orig/mm/shmem.c	2004-11-03 10:24:16.000000000 -0600
+++ linux/mm/shmem.c	2004-11-03 10:26:30.000000000 -0600
@@ -72,6 +72,12 @@
 /* Keep swapped page count in private field of indirect struct page */
 #define nr_swapped		private

+#if defined(CONFIG_NUMA) && defined(CONFIG_TMPFS)
+int sysctl_tmpfs_rr = 1;
+#else
+#define sysctl_tmpfs_rr (0)
+#endif
+
 /* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
 enum sgp_type {
 	SGP_QUICK,	/* don't try more than file page cache lookup */
@@ -1236,7 +1242,7 @@
 		info = SHMEM_I(inode);
 		memset(info, 0, (char *)inode - (char *)info);
 		spin_lock_init(&info->lock);
- 		mpol_shared_policy_init(&info->policy);
+ 		mpol_shared_policy_init(&info->policy, sbinfo ? sysctl_tmpfs_rr : 0);
 		INIT_LIST_HEAD(&info->swaplist);

 		switch (mode & S_IFMT) {
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c	2004-11-03 10:24:20.000000000 -0600
+++ linux/kernel/sysctl.c	2004-11-03 10:26:30.000000000 -0600
@@ -74,6 +74,10 @@
 				  void __user *, size_t *, loff_t *);
 #endif

+#if defined(CONFIG_NUMA) && defined(CONFIG_TMPFS)
+extern int sysctl_tmpfs_rr;
+#endif
+
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
 static int minolduid;
@@ -622,6 +626,16 @@
 		.maxlen         = sizeof (int),
 		.mode           = 0644,
 		.proc_handler   = &proc_unknown_nmi_panic,
+	},
+#endif
+#if defined(CONFIG_NUMA) && defined(CONFIG_TMPFS)
+	{
+		.ctl_name	= KERN_NUMA_TMPFS_RR,
+		.procname	= "numa-tmpfs-rr",
+		.data		= &sysctl_tmpfs_rr,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
 	},
 #endif
 	{ .ctl_name = 0 }
Index: linux/include/linux/sysctl.h
===================================================================
--- linux.orig/include/linux/sysctl.h	2004-11-03 10:26:20.000000000 -0600
+++ linux/include/linux/sysctl.h	2004-11-03 10:26:41.000000000 -0600
@@ -134,6 +134,7 @@
 	KERN_SPARC_SCONS_PWROFF=64, /* int: serial console power-off halt */
 	KERN_HZ_TIMER=65,	/* int: hz timer on or off */
 	KERN_UNKNOWN_NMI_PANIC=66, /* int: unknown nmi panic flag */
+	KERN_NUMA_TMPFS_RR=67,	/* int: NUMA interleave tmpfs allocations */
 };

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-03 16:32                 ` Brent Casavant
  0 siblings, 0 replies; 49+ messages in thread
From: Brent Casavant @ 2004-11-03 16:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Hugh Dickins, Martin J. Bligh, linux-kernel, linux-mm

On Wed, 3 Nov 2004, Andi Kleen wrote:

> If you want to go more finegraid then you can always use numactl
> or even libnuma in the application.  For a quick policy decision a sysctl
> is fine imho.

OK, so I'm not seeing a definitive stance by the interested parties
either way.  So since the code's already done, I'm posting the sysctl
method, and defaulting to on.  I assume that if we later decide that
a mount option was correct after all, that it's no big deal to axe the
sysctl?

The sysctl code in this patch is based on work originally done by
Andi.  It has been changed a bit, mostly to make it appear only
in CONFIG_NUMA && CONFIG_TMPFS kernels.

Signed-off-by: Brent Casavant <bcasavan@sgi.com>

Index: linux/mm/mempolicy.c
===================================================================
--- linux.orig/mm/mempolicy.c	2004-11-03 10:24:16.000000000 -0600
+++ linux/mm/mempolicy.c	2004-11-03 10:26:30.000000000 -0600
@@ -1027,6 +1027,28 @@
 	return 0;
 }

+void mpol_shared_policy_init(struct shared_policy *info, unsigned interleave)
+{
+	info->root = RB_ROOT;
+	init_MUTEX(&info->sem);
+
+	if (unlikely(interleave)) {
+		struct mempolicy *newpol;
+
+		/* Falls back to MPOL_DEFAULT on any error */
+		newpol = mpol_new(MPOL_INTERLEAVE, nodes_addr(node_online_map));
+		if (likely(!IS_ERR(newpol))) {
+			/* Create pseudo-vma that contains just the policy */
+			struct vm_area_struct pvma;
+
+			memset(&pvma, 0, sizeof(struct vm_area_struct));
+			/* Policy covers entire file */
+			pvma.vm_end = ~0UL;
+			mpol_set_shared_policy(info, &pvma, newpol);
+		}
+	}
+}
+
 int mpol_set_shared_policy(struct shared_policy *info,
 			struct vm_area_struct *vma, struct mempolicy *npol)
 {
Index: linux/fs/hugetlbfs/inode.c
===================================================================
--- linux.orig/fs/hugetlbfs/inode.c	2004-11-03 10:24:16.000000000 -0600
+++ linux/fs/hugetlbfs/inode.c	2004-11-03 10:26:30.000000000 -0600
@@ -384,7 +384,7 @@
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		info = HUGETLBFS_I(inode);
-		mpol_shared_policy_init(&info->policy);
+		mpol_shared_policy_init(&info->policy, 0);
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
Index: linux/include/linux/mempolicy.h
===================================================================
--- linux.orig/include/linux/mempolicy.h	2004-11-03 10:24:16.000000000 -0600
+++ linux/include/linux/mempolicy.h	2004-11-03 10:26:30.000000000 -0600
@@ -137,11 +137,7 @@
 	struct semaphore sem;
 };

-static inline void mpol_shared_policy_init(struct shared_policy *info)
-{
-	info->root = RB_ROOT;
-	init_MUTEX(&info->sem);
-}
+void mpol_shared_policy_init(struct shared_policy *info, unsigned interleave);

 int mpol_set_shared_policy(struct shared_policy *info,
 				struct vm_area_struct *vma,
@@ -198,7 +194,8 @@
 	return -EINVAL;
 }

-static inline void mpol_shared_policy_init(struct shared_policy *info)
+static inline void mpol_shared_policy_init(struct shared_policy *info,
+					unsigned interleave)
 {
 }

Index: linux/mm/shmem.c
===================================================================
--- linux.orig/mm/shmem.c	2004-11-03 10:24:16.000000000 -0600
+++ linux/mm/shmem.c	2004-11-03 10:26:30.000000000 -0600
@@ -72,6 +72,12 @@
 /* Keep swapped page count in private field of indirect struct page */
 #define nr_swapped		private

+#if defined(CONFIG_NUMA) && defined(CONFIG_TMPFS)
+int sysctl_tmpfs_rr = 1;
+#else
+#define sysctl_tmpfs_rr (0)
+#endif
+
 /* Flag allocation requirements to shmem_getpage and shmem_swp_alloc */
 enum sgp_type {
 	SGP_QUICK,	/* don't try more than file page cache lookup */
@@ -1236,7 +1242,7 @@
 		info = SHMEM_I(inode);
 		memset(info, 0, (char *)inode - (char *)info);
 		spin_lock_init(&info->lock);
- 		mpol_shared_policy_init(&info->policy);
+ 		mpol_shared_policy_init(&info->policy, sbinfo ? sysctl_tmpfs_rr : 0);
 		INIT_LIST_HEAD(&info->swaplist);

 		switch (mode & S_IFMT) {
Index: linux/kernel/sysctl.c
===================================================================
--- linux.orig/kernel/sysctl.c	2004-11-03 10:24:20.000000000 -0600
+++ linux/kernel/sysctl.c	2004-11-03 10:26:30.000000000 -0600
@@ -74,6 +74,10 @@
 				  void __user *, size_t *, loff_t *);
 #endif

+#if defined(CONFIG_NUMA) && defined(CONFIG_TMPFS)
+extern int sysctl_tmpfs_rr;
+#endif
+
 /* this is needed for the proc_dointvec_minmax for [fs_]overflow UID and GID */
 static int maxolduid = 65535;
 static int minolduid;
@@ -622,6 +626,16 @@
 		.maxlen         = sizeof (int),
 		.mode           = 0644,
 		.proc_handler   = &proc_unknown_nmi_panic,
+	},
+#endif
+#if defined(CONFIG_NUMA) && defined(CONFIG_TMPFS)
+	{
+		.ctl_name	= KERN_NUMA_TMPFS_RR,
+		.procname	= "numa-tmpfs-rr",
+		.data		= &sysctl_tmpfs_rr,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
 	},
 #endif
 	{ .ctl_name = 0 }
Index: linux/include/linux/sysctl.h
===================================================================
--- linux.orig/include/linux/sysctl.h	2004-11-03 10:26:20.000000000 -0600
+++ linux/include/linux/sysctl.h	2004-11-03 10:26:41.000000000 -0600
@@ -134,6 +134,7 @@
 	KERN_SPARC_SCONS_PWROFF=64, /* int: serial console power-off halt */
 	KERN_HZ_TIMER=65,	/* int: hz timer on or off */
 	KERN_UNKNOWN_NMI_PANIC=66, /* int: unknown nmi panic flag */
+	KERN_NUMA_TMPFS_RR=67,	/* int: NUMA interleave tmpfs allocations */
 };

-- 
Brent Casavant             bcasavan@sgi.com        Forget bright-eyed and
Operating System Engineer  http://www.sgi.com/     bushy-tailed; I'm red-
Silicon Graphics, Inc.     44.8562N 93.1355W 860F  eyed and bushy-haired.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-03 16:32                 ` Brent Casavant
@ 2004-11-03 21:00                   ` Martin J. Bligh
  -1 siblings, 0 replies; 49+ messages in thread
From: Martin J. Bligh @ 2004-11-03 21:00 UTC (permalink / raw)
  To: Brent Casavant, Andi Kleen, colpatch; +Cc: Hugh Dickins, linux-kernel, linux-mm



--On Wednesday, November 03, 2004 10:32:56 -0600 Brent Casavant <bcasavan@sgi.com> wrote:

> On Wed, 3 Nov 2004, Andi Kleen wrote:
> 
>> If you want to go more finegraid then you can always use numactl
>> or even libnuma in the application.  For a quick policy decision a sysctl
>> is fine imho.
> 
> OK, so I'm not seeing a definitive stance by the interested parties
> either way.  So since the code's already done, I'm posting the sysctl
> method, and defaulting to on.  I assume that if we later decide that
> a mount option was correct after all, that it's no big deal to axe the
> sysctl?

Matt has volunteered to write the mount option for this, so let's hold
off for a couple of days until that's done.

M


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-03 21:00                   ` Martin J. Bligh
  0 siblings, 0 replies; 49+ messages in thread
From: Martin J. Bligh @ 2004-11-03 21:00 UTC (permalink / raw)
  To: Brent Casavant, Andi Kleen, colpatch; +Cc: Hugh Dickins, linux-kernel, linux-mm


--On Wednesday, November 03, 2004 10:32:56 -0600 Brent Casavant <bcasavan@sgi.com> wrote:

> On Wed, 3 Nov 2004, Andi Kleen wrote:
> 
>> If you want to go more finegraid then you can always use numactl
>> or even libnuma in the application.  For a quick policy decision a sysctl
>> is fine imho.
> 
> OK, so I'm not seeing a definitive stance by the interested parties
> either way.  So since the code's already done, I'm posting the sysctl
> method, and defaulting to on.  I assume that if we later decide that
> a mount option was correct after all, that it's no big deal to axe the
> sysctl?

Matt has volunteered to write the mount option for this, so let's hold
off for a couple of days until that's done.

M

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-03 21:00                   ` Martin J. Bligh
@ 2004-11-08 19:58                     ` Brent Casavant
  -1 siblings, 0 replies; 49+ messages in thread
From: Brent Casavant @ 2004-11-08 19:58 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andi Kleen, colpatch, Hugh Dickins, linux-kernel, linux-mm

On Wed, 3 Nov 2004, Martin J. Bligh wrote:

> Matt has volunteered to write the mount option for this, so let's hold
> off for a couple of days until that's done.

I had the time to do this myself.  Updated patch attached below.

Description:

This patch creates a tmpfs mount option and adds code which causes
tmpfs allocations to be interleaved across the nodes of a NUMA machine.
This is useful in situations where a large tmpfs file would otherwise
consume most of the memory on a single node, forcing tasks running on
that node to perform off-node allocations for other (i.e. non-tmpfs)
memory needs.

Tightly synchronized HPC applications with large (on the order of
a single node's total RAM) per-task memory requirements are an example
of a type of application which benefits from this change.  Other
types of workloads may prefer local tmpfs allocations, thus a mount
option is provided.

The new mount option is "interleave=", and defaults to 0.  Any non-zero
setting turns on interleaving.  Interleaving behavior can be changed
via a remount operation.

Signed-off-by: Brent Casavant <bcasavan@sgi.com>

Index: linux/mm/mempolicy.c
===================================================================
--- linux.orig/mm/mempolicy.c	2004-11-03 10:24:16.000000000 -0600
+++ linux/mm/mempolicy.c	2004-11-03 10:26:30.000000000 -0600
@@ -1027,6 +1027,28 @@
 	return 0;
 }

+void mpol_shared_policy_init(struct shared_policy *info, unsigned interleave)
+{
+	info->root = RB_ROOT;
+	init_MUTEX(&info->sem);
+
+	if (unlikely(interleave)) {
+		struct mempolicy *newpol;
+
+		/* Falls back to MPOL_DEFAULT on any error */
+		newpol = mpol_new(MPOL_INTERLEAVE, nodes_addr(node_online_map));
+		if (likely(!IS_ERR(newpol))) {
+			/* Create pseudo-vma that contains just the policy */
+			struct vm_area_struct pvma;
+
+			memset(&pvma, 0, sizeof(struct vm_area_struct));
+			/* Policy covers entire file */
+			pvma.vm_end = ~0UL;
+			mpol_set_shared_policy(info, &pvma, newpol);
+		}
+	}
+}
+
 int mpol_set_shared_policy(struct shared_policy *info,
 			struct vm_area_struct *vma, struct mempolicy *npol)
 {
Index: linux/fs/hugetlbfs/inode.c
===================================================================
--- linux.orig/fs/hugetlbfs/inode.c	2004-11-03 10:24:16.000000000 -0600
+++ linux/fs/hugetlbfs/inode.c	2004-11-03 10:26:30.000000000 -0600
@@ -384,7 +384,7 @@
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		info = HUGETLBFS_I(inode);
-		mpol_shared_policy_init(&info->policy);
+		mpol_shared_policy_init(&info->policy, 0);
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
Index: linux/include/linux/mempolicy.h
===================================================================
--- linux.orig/include/linux/mempolicy.h	2004-11-03 10:24:16.000000000 -0600
+++ linux/include/linux/mempolicy.h	2004-11-03 10:26:30.000000000 -0600
@@ -137,11 +137,7 @@
 	struct semaphore sem;
 };

-static inline void mpol_shared_policy_init(struct shared_policy *info)
-{
-	info->root = RB_ROOT;
-	init_MUTEX(&info->sem);
-}
+void mpol_shared_policy_init(struct shared_policy *info, unsigned interleave);

 int mpol_set_shared_policy(struct shared_policy *info,
 				struct vm_area_struct *vma,
@@ -198,7 +194,8 @@
 	return -EINVAL;
 }

-static inline void mpol_shared_policy_init(struct shared_policy *info)
+static inline void mpol_shared_policy_init(struct shared_policy *info,
+					unsigned interleave)
 {
 }

Index: linux/mm/shmem.c
===================================================================
--- linux.orig/mm/shmem.c	2004-11-03 10:24:16.000000000 -0600
+++ linux/mm/shmem.c	2004-11-05 17:41:46.000000000 -0600
@@ -66,6 +66,9 @@
 #define SHMEM_PAGEIN	 VM_READ
 #define SHMEM_TRUNCATE	 VM_WRITE

+/* sbinfo flags */
+#define SHMEM_INTERLEAVE 0x1	/* Interleave tmpfs allocations */
+
 /* Pretend that each entry is of this size in directory's i_size */
 #define BOGO_DIRENT_SIZE 20

@@ -1236,7 +1239,8 @@
 		info = SHMEM_I(inode);
 		memset(info, 0, (char *)inode - (char *)info);
 		spin_lock_init(&info->lock);
- 		mpol_shared_policy_init(&info->policy);
+ 		mpol_shared_policy_init(&info->policy,
+			sbinfo && (sbinfo->flags & SHMEM_INTERLEAVE) ? 1 : 0);
 		INIT_LIST_HEAD(&info->swaplist);

 		switch (mode & S_IFMT) {
@@ -1784,7 +1788,9 @@
 #endif
 };

-static int shmem_parse_options(char *options, int *mode, uid_t *uid, gid_t *gid, unsigned long *blocks, unsigned long *inodes)
+static int shmem_parse_options(char *options, int *mode, uid_t *uid,
+				gid_t *gid, unsigned long *blocks,
+				unsigned long *inodes, int *interleave)
 {
 	char *this_char, *value, *rest;

@@ -1838,6 +1844,12 @@
 			*gid = simple_strtoul(value,&rest,0);
 			if (*rest)
 				goto bad_val;
+		} else if (!strcmp(this_char,"interleave")) {
+			if (!interleave)
+				continue;
+			*interleave = (0 != simple_strtoul(value,&rest,0));
+			if (*rest)
+				goto bad_val;
 		} else {
 			printk(KERN_ERR "tmpfs: Bad mount option %s\n",
 			       this_char);
@@ -1858,12 +1870,14 @@
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
 	unsigned long max_blocks = 0;
 	unsigned long max_inodes = 0;
+	int interleave = 0;

 	if (sbinfo) {
 		max_blocks = sbinfo->max_blocks;
 		max_inodes = sbinfo->max_inodes;
 	}
-	if (shmem_parse_options(data, NULL, NULL, NULL, &max_blocks, &max_inodes))
+	if (shmem_parse_options(data, NULL, NULL, NULL, &max_blocks,
+				&max_inodes, &interleave))
 		return -EINVAL;
 	/* Keep it simple: disallow limited <-> unlimited remount */
 	if ((max_blocks || max_inodes) == !sbinfo)
@@ -1871,6 +1885,12 @@
 	/* But allow the pointless unlimited -> unlimited remount */
 	if (!sbinfo)
 		return 0;
+	/* Allow change of interleaving */
+	if (interleave)
+		sbinfo->flags |= SHMEM_INTERLEAVE;
+	else
+		sbinfo->flags &= ~SHMEM_INTERLEAVE;
+
 	return shmem_set_size(sbinfo, max_blocks, max_inodes);
 }
 #endif
@@ -1900,6 +1920,7 @@
 #ifdef CONFIG_TMPFS
 	unsigned long blocks = 0;
 	unsigned long inodes = 0;
+	int interleave = 0;

 	/*
 	 * Per default we only allow half of the physical ram per
@@ -1912,12 +1933,12 @@
 		if (inodes > blocks)
 			inodes = blocks;

-		if (shmem_parse_options(data, &mode,
-					&uid, &gid, &blocks, &inodes))
+		if (shmem_parse_options(data, &mode, &uid, &gid, &blocks,
+					&inodes, &interleave))
 			return -EINVAL;
 	}

-	if (blocks || inodes) {
+	if (blocks || inodes || interleave) {
 		struct shmem_sb_info *sbinfo;
 		sbinfo = kmalloc(sizeof(struct shmem_sb_info), GFP_KERNEL);
 		if (!sbinfo)
@@ -1928,6 +1949,9 @@
 		sbinfo->free_blocks = blocks;
 		sbinfo->max_inodes = inodes;
 		sbinfo->free_inodes = inodes;
+		sbinfo->flags = 0;
+		if (interleave)
+			sbinfo->flags |= SHMEM_INTERLEAVE;
 	}
 	sb->s_xattr = shmem_xattr_handlers;
 #endif
Index: linux/include/linux/shmem_fs.h
===================================================================
--- linux.orig/include/linux/shmem_fs.h	2004-10-18 16:54:55.000000000 -0500
+++ linux/include/linux/shmem_fs.h	2004-11-05 17:35:25.000000000 -0600
@@ -26,6 +26,7 @@
 	unsigned long free_blocks;  /* How many are left for allocation */
 	unsigned long max_inodes;   /* How many inodes are allowed */
 	unsigned long free_inodes;  /* How many are left for allocation */
+	unsigned long flags;
 	spinlock_t    stat_lock;
 };


-- 
Brent Casavant                          If you had nothing to fear,
bcasavan@sgi.com                        how then could you be brave?
Silicon Graphics, Inc.                    -- Queen Dama, Source Wars

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-08 19:58                     ` Brent Casavant
  0 siblings, 0 replies; 49+ messages in thread
From: Brent Casavant @ 2004-11-08 19:58 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Andi Kleen, colpatch, Hugh Dickins, linux-kernel, linux-mm

On Wed, 3 Nov 2004, Martin J. Bligh wrote:

> Matt has volunteered to write the mount option for this, so let's hold
> off for a couple of days until that's done.

I had the time to do this myself.  Updated patch attached below.

Description:

This patch creates a tmpfs mount option and adds code which causes
tmpfs allocations to be interleaved across the nodes of a NUMA machine.
This is useful in situations where a large tmpfs file would otherwise
consume most of the memory on a single node, forcing tasks running on
that node to perform off-node allocations for other (i.e. non-tmpfs)
memory needs.

Tightly synchronized HPC applications with large (on the order of
a single node's total RAM) per-task memory requirements are an example
of a type of application which benefits from this change.  Other
types of workloads may prefer local tmpfs allocations, thus a mount
option is provided.

The new mount option is "interleave=", and defaults to 0.  Any non-zero
setting turns on interleaving.  Interleaving behavior can be changed
via a remount operation.

Signed-off-by: Brent Casavant <bcasavan@sgi.com>

Index: linux/mm/mempolicy.c
===================================================================
--- linux.orig/mm/mempolicy.c	2004-11-03 10:24:16.000000000 -0600
+++ linux/mm/mempolicy.c	2004-11-03 10:26:30.000000000 -0600
@@ -1027,6 +1027,28 @@
 	return 0;
 }

+void mpol_shared_policy_init(struct shared_policy *info, unsigned interleave)
+{
+	info->root = RB_ROOT;
+	init_MUTEX(&info->sem);
+
+	if (unlikely(interleave)) {
+		struct mempolicy *newpol;
+
+		/* Falls back to MPOL_DEFAULT on any error */
+		newpol = mpol_new(MPOL_INTERLEAVE, nodes_addr(node_online_map));
+		if (likely(!IS_ERR(newpol))) {
+			/* Create pseudo-vma that contains just the policy */
+			struct vm_area_struct pvma;
+
+			memset(&pvma, 0, sizeof(struct vm_area_struct));
+			/* Policy covers entire file */
+			pvma.vm_end = ~0UL;
+			mpol_set_shared_policy(info, &pvma, newpol);
+		}
+	}
+}
+
 int mpol_set_shared_policy(struct shared_policy *info,
 			struct vm_area_struct *vma, struct mempolicy *npol)
 {
Index: linux/fs/hugetlbfs/inode.c
===================================================================
--- linux.orig/fs/hugetlbfs/inode.c	2004-11-03 10:24:16.000000000 -0600
+++ linux/fs/hugetlbfs/inode.c	2004-11-03 10:26:30.000000000 -0600
@@ -384,7 +384,7 @@
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		info = HUGETLBFS_I(inode);
-		mpol_shared_policy_init(&info->policy);
+		mpol_shared_policy_init(&info->policy, 0);
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
Index: linux/include/linux/mempolicy.h
===================================================================
--- linux.orig/include/linux/mempolicy.h	2004-11-03 10:24:16.000000000 -0600
+++ linux/include/linux/mempolicy.h	2004-11-03 10:26:30.000000000 -0600
@@ -137,11 +137,7 @@
 	struct semaphore sem;
 };

-static inline void mpol_shared_policy_init(struct shared_policy *info)
-{
-	info->root = RB_ROOT;
-	init_MUTEX(&info->sem);
-}
+void mpol_shared_policy_init(struct shared_policy *info, unsigned interleave);

 int mpol_set_shared_policy(struct shared_policy *info,
 				struct vm_area_struct *vma,
@@ -198,7 +194,8 @@
 	return -EINVAL;
 }

-static inline void mpol_shared_policy_init(struct shared_policy *info)
+static inline void mpol_shared_policy_init(struct shared_policy *info,
+					unsigned interleave)
 {
 }

Index: linux/mm/shmem.c
===================================================================
--- linux.orig/mm/shmem.c	2004-11-03 10:24:16.000000000 -0600
+++ linux/mm/shmem.c	2004-11-05 17:41:46.000000000 -0600
@@ -66,6 +66,9 @@
 #define SHMEM_PAGEIN	 VM_READ
 #define SHMEM_TRUNCATE	 VM_WRITE

+/* sbinfo flags */
+#define SHMEM_INTERLEAVE 0x1	/* Interleave tmpfs allocations */
+
 /* Pretend that each entry is of this size in directory's i_size */
 #define BOGO_DIRENT_SIZE 20

@@ -1236,7 +1239,8 @@
 		info = SHMEM_I(inode);
 		memset(info, 0, (char *)inode - (char *)info);
 		spin_lock_init(&info->lock);
- 		mpol_shared_policy_init(&info->policy);
+ 		mpol_shared_policy_init(&info->policy,
+			sbinfo && (sbinfo->flags & SHMEM_INTERLEAVE) ? 1 : 0);
 		INIT_LIST_HEAD(&info->swaplist);

 		switch (mode & S_IFMT) {
@@ -1784,7 +1788,9 @@
 #endif
 };

-static int shmem_parse_options(char *options, int *mode, uid_t *uid, gid_t *gid, unsigned long *blocks, unsigned long *inodes)
+static int shmem_parse_options(char *options, int *mode, uid_t *uid,
+				gid_t *gid, unsigned long *blocks,
+				unsigned long *inodes, int *interleave)
 {
 	char *this_char, *value, *rest;

@@ -1838,6 +1844,12 @@
 			*gid = simple_strtoul(value,&rest,0);
 			if (*rest)
 				goto bad_val;
+		} else if (!strcmp(this_char,"interleave")) {
+			if (!interleave)
+				continue;
+			*interleave = (0 != simple_strtoul(value,&rest,0));
+			if (*rest)
+				goto bad_val;
 		} else {
 			printk(KERN_ERR "tmpfs: Bad mount option %s\n",
 			       this_char);
@@ -1858,12 +1870,14 @@
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
 	unsigned long max_blocks = 0;
 	unsigned long max_inodes = 0;
+	int interleave = 0;

 	if (sbinfo) {
 		max_blocks = sbinfo->max_blocks;
 		max_inodes = sbinfo->max_inodes;
 	}
-	if (shmem_parse_options(data, NULL, NULL, NULL, &max_blocks, &max_inodes))
+	if (shmem_parse_options(data, NULL, NULL, NULL, &max_blocks,
+				&max_inodes, &interleave))
 		return -EINVAL;
 	/* Keep it simple: disallow limited <-> unlimited remount */
 	if ((max_blocks || max_inodes) == !sbinfo)
@@ -1871,6 +1885,12 @@
 	/* But allow the pointless unlimited -> unlimited remount */
 	if (!sbinfo)
 		return 0;
+	/* Allow change of interleaving */
+	if (interleave)
+		sbinfo->flags |= SHMEM_INTERLEAVE;
+	else
+		sbinfo->flags &= ~SHMEM_INTERLEAVE;
+
 	return shmem_set_size(sbinfo, max_blocks, max_inodes);
 }
 #endif
@@ -1900,6 +1920,7 @@
 #ifdef CONFIG_TMPFS
 	unsigned long blocks = 0;
 	unsigned long inodes = 0;
+	int interleave = 0;

 	/*
 	 * Per default we only allow half of the physical ram per
@@ -1912,12 +1933,12 @@
 		if (inodes > blocks)
 			inodes = blocks;

-		if (shmem_parse_options(data, &mode,
-					&uid, &gid, &blocks, &inodes))
+		if (shmem_parse_options(data, &mode, &uid, &gid, &blocks,
+					&inodes, &interleave))
 			return -EINVAL;
 	}

-	if (blocks || inodes) {
+	if (blocks || inodes || interleave) {
 		struct shmem_sb_info *sbinfo;
 		sbinfo = kmalloc(sizeof(struct shmem_sb_info), GFP_KERNEL);
 		if (!sbinfo)
@@ -1928,6 +1949,9 @@
 		sbinfo->free_blocks = blocks;
 		sbinfo->max_inodes = inodes;
 		sbinfo->free_inodes = inodes;
+		sbinfo->flags = 0;
+		if (interleave)
+			sbinfo->flags |= SHMEM_INTERLEAVE;
 	}
 	sb->s_xattr = shmem_xattr_handlers;
 #endif
Index: linux/include/linux/shmem_fs.h
===================================================================
--- linux.orig/include/linux/shmem_fs.h	2004-10-18 16:54:55.000000000 -0500
+++ linux/include/linux/shmem_fs.h	2004-11-05 17:35:25.000000000 -0600
@@ -26,6 +26,7 @@
 	unsigned long free_blocks;  /* How many are left for allocation */
 	unsigned long max_inodes;   /* How many inodes are allowed */
 	unsigned long free_inodes;  /* How many are left for allocation */
+	unsigned long flags;
 	spinlock_t    stat_lock;
 };


-- 
Brent Casavant                          If you had nothing to fear,
bcasavan@sgi.com                        how then could you be brave?
Silicon Graphics, Inc.                    -- Queen Dama, Source Wars
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-08 19:58                     ` Brent Casavant
@ 2004-11-08 20:57                       ` Martin J. Bligh
  -1 siblings, 0 replies; 49+ messages in thread
From: Martin J. Bligh @ 2004-11-08 20:57 UTC (permalink / raw)
  To: Brent Casavant; +Cc: Andi Kleen, colpatch, Hugh Dickins, linux-kernel, linux-mm

>> Matt has volunteered to write the mount option for this, so let's hold
>> off for a couple of days until that's done.
> 
> I had the time to do this myself.  Updated patch attached below.
> 
> Description:
> 
> This patch creates a tmpfs mount option and adds code which causes
> tmpfs allocations to be interleaved across the nodes of a NUMA machine.
> This is useful in situations where a large tmpfs file would otherwise
> consume most of the memory on a single node, forcing tasks running on
> that node to perform off-node allocations for other (i.e. non-tmpfs)
> memory needs.
> 
> Tightly synchronized HPC applications with large (on the order of
> a single node's total RAM) per-task memory requirements are an example
> of a type of application which benefits from this change.  Other
> types of workloads may prefer local tmpfs allocations, thus a mount
> option is provided.
> 
> The new mount option is "interleave=", and defaults to 0.  Any non-zero
> setting turns on interleaving.  Interleaving behavior can be changed
> via a remount operation.

Looks good to me, though I don't really know that area of code very well.
Hopefully Hugh or someone can give it a better review.

Thanks for doing that,

M.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-08 20:57                       ` Martin J. Bligh
  0 siblings, 0 replies; 49+ messages in thread
From: Martin J. Bligh @ 2004-11-08 20:57 UTC (permalink / raw)
  To: Brent Casavant; +Cc: Andi Kleen, colpatch, Hugh Dickins, linux-kernel, linux-mm

>> Matt has volunteered to write the mount option for this, so let's hold
>> off for a couple of days until that's done.
> 
> I had the time to do this myself.  Updated patch attached below.
> 
> Description:
> 
> This patch creates a tmpfs mount option and adds code which causes
> tmpfs allocations to be interleaved across the nodes of a NUMA machine.
> This is useful in situations where a large tmpfs file would otherwise
> consume most of the memory on a single node, forcing tasks running on
> that node to perform off-node allocations for other (i.e. non-tmpfs)
> memory needs.
> 
> Tightly synchronized HPC applications with large (on the order of
> a single node's total RAM) per-task memory requirements are an example
> of a type of application which benefits from this change.  Other
> types of workloads may prefer local tmpfs allocations, thus a mount
> option is provided.
> 
> The new mount option is "interleave=", and defaults to 0.  Any non-zero
> setting turns on interleaving.  Interleaving behavior can be changed
> via a remount operation.

Looks good to me, though I don't really know that area of code very well.
Hopefully Hugh or someone can give it a better review.

Thanks for doing that,

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-08 19:58                     ` Brent Casavant
@ 2004-11-09 19:04                       ` Hugh Dickins
  -1 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2004-11-09 19:04 UTC (permalink / raw)
  To: Brent Casavant
  Cc: Martin J. Bligh, Andi Kleen, Adam J. Richter, colpatch,
	linux-kernel, linux-mm

On Mon, 8 Nov 2004, Brent Casavant wrote:
> On Wed, 3 Nov 2004, Martin J. Bligh wrote:
> 
> > Matt has volunteered to write the mount option for this, so let's hold
> > off for a couple of days until that's done.
> 
> I had the time to do this myself.  Updated patch attached below.

Looks pretty good to me.

Doesn't quite play right with what was my "NULL sbinfo" convention.
Given this mpol patch of yours, and Adam's devfs patch, it's becoming
clear that my "NULL sbinfo" was unhelpful, making life harder for both
of you to add things into the tmpfs superblock - unchanged since 2.4.0,
as soon as I mess with it, people come up with valid new uses for it.

Not to say that your patch or Adam's will go further (I've no objection
to the way Adam is using tmpfs, but no opinion on the future of devfs),
but they're two hints that I should rework that to get out of people's
way.  I'll do a patch for that, then another something like yours on
top, for you to go back and check.

I think the option should be "mpol=interleave" rather than just
"interleave", who knows what baroque mpols we might want to support
there in future?

I'm irritated to realize that we can't change the default for SysV
shared memory or /dev/zero this way, because that mount is internal.
But neither you nor Andi were wanting that, so okay, never mind;
could use his sysctl instead if the need emerges.

At one time (August) you were worried about MPOL_INTERLEAVE
overloading node 0 on small files - is that still a worry?
Perhaps you skirt that issue in recommending this option
for use with giant files.

There are quite a lot of mpol patches flying around, aren't there?
>From Ray Bryant and from Steve Longerbeam.  Would this tmpfs patch
make (adaptable) sense if we went either or both of those ways - or
have they been knocked on the head?  I don't mean in the details
(I think one of them does away with the pseudo-vma stuff - great!),
but in basic design - would your mount option mesh together well
with them, or would it be adding a further layer of confusion?

Hugh


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-09 19:04                       ` Hugh Dickins
  0 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2004-11-09 19:04 UTC (permalink / raw)
  To: Brent Casavant
  Cc: Martin J. Bligh, Andi Kleen, Adam J. Richter, colpatch,
	linux-kernel, linux-mm

On Mon, 8 Nov 2004, Brent Casavant wrote:
> On Wed, 3 Nov 2004, Martin J. Bligh wrote:
> 
> > Matt has volunteered to write the mount option for this, so let's hold
> > off for a couple of days until that's done.
> 
> I had the time to do this myself.  Updated patch attached below.

Looks pretty good to me.

Doesn't quite play right with what was my "NULL sbinfo" convention.
Given this mpol patch of yours, and Adam's devfs patch, it's becoming
clear that my "NULL sbinfo" was unhelpful, making life harder for both
of you to add things into the tmpfs superblock - unchanged since 2.4.0,
as soon as I mess with it, people come up with valid new uses for it.

Not to say that your patch or Adam's will go further (I've no objection
to the way Adam is using tmpfs, but no opinion on the future of devfs),
but they're two hints that I should rework that to get out of people's
way.  I'll do a patch for that, then another something like yours on
top, for you to go back and check.

I think the option should be "mpol=interleave" rather than just
"interleave", who knows what baroque mpols we might want to support
there in future?

I'm irritated to realize that we can't change the default for SysV
shared memory or /dev/zero this way, because that mount is internal.
But neither you nor Andi were wanting that, so okay, never mind;
could use his sysctl instead if the need emerges.

At one time (August) you were worried about MPOL_INTERLEAVE
overloading node 0 on small files - is that still a worry?
Perhaps you skirt that issue in recommending this option
for use with giant files.

There are quite a lot of mpol patches flying around, aren't there?
>From Ray Bryant and from Steve Longerbeam.  Would this tmpfs patch
make (adaptable) sense if we went either or both of those ways - or
have they been knocked on the head?  I don't mean in the details
(I think one of them does away with the pseudo-vma stuff - great!),
but in basic design - would your mount option mesh together well
with them, or would it be adding a further layer of confusion?

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-09 19:04                       ` Hugh Dickins
@ 2004-11-09 20:09                         ` Martin J. Bligh
  -1 siblings, 0 replies; 49+ messages in thread
From: Martin J. Bligh @ 2004-11-09 20:09 UTC (permalink / raw)
  To: Hugh Dickins, Brent Casavant
  Cc: Andi Kleen, Adam J. Richter, colpatch, linux-kernel, linux-mm

> I think the option should be "mpol=interleave" rather than just
> "interleave", who knows what baroque mpols we might want to support
> there in future?

Sounds sensible.
 
> I'm irritated to realize that we can't change the default for SysV
> shared memory or /dev/zero this way, because that mount is internal.

Boggle. shmem I can perfectly understand, and have been intending to
change for a while. But why /dev/zero ? Presumably you'd always want
that local?

Thanks,

M.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-09 20:09                         ` Martin J. Bligh
  0 siblings, 0 replies; 49+ messages in thread
From: Martin J. Bligh @ 2004-11-09 20:09 UTC (permalink / raw)
  To: Hugh Dickins, Brent Casavant
  Cc: Andi Kleen, Adam J. Richter, colpatch, linux-kernel, linux-mm

> I think the option should be "mpol=interleave" rather than just
> "interleave", who knows what baroque mpols we might want to support
> there in future?

Sounds sensible.
 
> I'm irritated to realize that we can't change the default for SysV
> shared memory or /dev/zero this way, because that mount is internal.

Boggle. shmem I can perfectly understand, and have been intending to
change for a while. But why /dev/zero ? Presumably you'd always want
that local?

Thanks,

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-09 20:09                         ` Martin J. Bligh
@ 2004-11-09 21:08                           ` Hugh Dickins
  -1 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2004-11-09 21:08 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Brent Casavant, Andi Kleen, Adam J. Richter, colpatch,
	linux-kernel, linux-mm

On Tue, 9 Nov 2004, Martin J. Bligh wrote:
>  
> > I'm irritated to realize that we can't change the default for SysV
> > shared memory or /dev/zero this way, because that mount is internal.
> 
> Boggle. shmem I can perfectly understand, and have been intending to
> change for a while. But why /dev/zero ? Presumably you'd always want
> that local?

I was meaning the mmap shared writable of /dev/zero, to get memory
shared between parent and child and descendants, a restricted form
of shared memory.  I was thinking of them running on different cpus,
you're suggesting they'd at least be on the same node.  I dare say,
I don't know.  I'm not desperate to be able to set some other mpol
default for all of them (and each object can be set in the established
way), just would have been happier if the possibility of doing so came
for free with the mount option work.

Hugh


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-09 21:08                           ` Hugh Dickins
  0 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2004-11-09 21:08 UTC (permalink / raw)
  To: Martin J. Bligh
  Cc: Brent Casavant, Andi Kleen, Adam J. Richter, colpatch,
	linux-kernel, linux-mm

On Tue, 9 Nov 2004, Martin J. Bligh wrote:
>  
> > I'm irritated to realize that we can't change the default for SysV
> > shared memory or /dev/zero this way, because that mount is internal.
> 
> Boggle. shmem I can perfectly understand, and have been intending to
> change for a while. But why /dev/zero ? Presumably you'd always want
> that local?

I was meaning the mmap shared writable of /dev/zero, to get memory
shared between parent and child and descendants, a restricted form
of shared memory.  I was thinking of them running on different cpus,
you're suggesting they'd at least be on the same node.  I dare say,
I don't know.  I'm not desperate to be able to set some other mpol
default for all of them (and each object can be set in the established
way), just would have been happier if the possibility of doing so came
for free with the mount option work.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-09 21:08                           ` Hugh Dickins
@ 2004-11-09 22:07                             ` Martin J. Bligh
  -1 siblings, 0 replies; 49+ messages in thread
From: Martin J. Bligh @ 2004-11-09 22:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Brent Casavant, Andi Kleen, Adam J. Richter, colpatch,
	linux-kernel, linux-mm

--On Tuesday, November 09, 2004 21:08:11 +0000 Hugh Dickins <hugh@veritas.com> wrote:

> On Tue, 9 Nov 2004, Martin J. Bligh wrote:
>>  
>> > I'm irritated to realize that we can't change the default for SysV
>> > shared memory or /dev/zero this way, because that mount is internal.
>> 
>> Boggle. shmem I can perfectly understand, and have been intending to
>> change for a while. But why /dev/zero ? Presumably you'd always want
>> that local?
> 
> I was meaning the mmap shared writable of /dev/zero, to get memory
> shared between parent and child and descendants, a restricted form
> of shared memory.  I was thinking of them running on different cpus,
> you're suggesting they'd at least be on the same node.  I dare say,
> I don't know.  I'm not desperate to be able to set some other mpol
> default for all of them (and each object can be set in the established
> way), just would have been happier if the possibility of doing so came
> for free with the mount option work.

Oh yeah ... the anon mem allocator trick. Mmmm. Not sure that should
have a different default than normal alloced memory, but either way,
what you're suggesting makes a whole lot more sense to me know than
just straight /dev/zero ;-)

Thanks for the explanation.

M.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-09 22:07                             ` Martin J. Bligh
  0 siblings, 0 replies; 49+ messages in thread
From: Martin J. Bligh @ 2004-11-09 22:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Brent Casavant, Andi Kleen, Adam J. Richter, colpatch,
	linux-kernel, linux-mm

--On Tuesday, November 09, 2004 21:08:11 +0000 Hugh Dickins <hugh@veritas.com> wrote:

> On Tue, 9 Nov 2004, Martin J. Bligh wrote:
>>  
>> > I'm irritated to realize that we can't change the default for SysV
>> > shared memory or /dev/zero this way, because that mount is internal.
>> 
>> Boggle. shmem I can perfectly understand, and have been intending to
>> change for a while. But why /dev/zero ? Presumably you'd always want
>> that local?
> 
> I was meaning the mmap shared writable of /dev/zero, to get memory
> shared between parent and child and descendants, a restricted form
> of shared memory.  I was thinking of them running on different cpus,
> you're suggesting they'd at least be on the same node.  I dare say,
> I don't know.  I'm not desperate to be able to set some other mpol
> default for all of them (and each object can be set in the established
> way), just would have been happier if the possibility of doing so came
> for free with the mount option work.

Oh yeah ... the anon mem allocator trick. Mmmm. Not sure that should
have a different default than normal alloced memory, but either way,
what you're suggesting makes a whole lot more sense to me know than
just straight /dev/zero ;-)

Thanks for the explanation.

M.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-09 19:04                       ` Hugh Dickins
@ 2004-11-10  2:41                         ` Brent Casavant
  -1 siblings, 0 replies; 49+ messages in thread
From: Brent Casavant @ 2004-11-10  2:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin J. Bligh, Andi Kleen, Adam J. Richter, colpatch,
	linux-kernel, linux-mm

Argh.  I fatfingered my mail client and deleted my response rather
than send it this morning.  Sorry for the delay.

On Tue, 9 Nov 2004, Hugh Dickins wrote:

> Doesn't quite play right with what was my "NULL sbinfo" convention.

Howso?  I thought it played quite nicely with it.  We've been using
NULL sbinfo as an indicator that an inode is from tmpfs rather than
from SysV or /dev/zero.  Or at least that's the way my brain was
wrapped around it.

> Given this mpol patch of yours, and Adam's devfs patch, it's becoming
> clear that my "NULL sbinfo" was unhelpful, making life harder for both
> of you to add things into the tmpfs superblock - unchanged since 2.4.0,
> as soon as I mess with it, people come up with valid new uses for it.

I haven't seen that other patch, but in this case I didn't see a problem.
The NULL sbinfo scheme worked perfectly for me, with very little hassle.

> Not to say that your patch or Adam's will go further (I've no objection
> to the way Adam is using tmpfs, but no opinion on the future of devfs),
> but they're two hints that I should rework that to get out of people's
> way.  I'll do a patch for that, then another something like yours on
> top, for you to go back and check.

Is this something imminent, or on the "someday" queue?  Just asking
because I'd like to avoid doing additional work that might get thrown
away soon.

> I think the option should be "mpol=interleave" rather than just
> "interleave", who knows what baroque mpols we might want to support
> there in future?

Makes sense to me.  I'll be happy to do it, pending your answer to
my preceding question.

> I'm irritated to realize that we can't change the default for SysV
> shared memory or /dev/zero this way, because that mount is internal.

Well, the only thing preventing this is that I stuck the flag into
sbinfo, since it's an filesystem-wide setting.  I don't see any reason
we couldn't add a new flag in the inode info flag field instead.  I
think there would also be some work to set pvma.vm_end more precisely
(in mpol_shared_policy_init()) in the SysV case.

> At one time (August) you were worried about MPOL_INTERLEAVE
> overloading node 0 on small files - is that still a worry?
> Perhaps you skirt that issue in recommending this option
> for use with giant files.

Yeah, there's still a bit of concern about that, but it's dwarfed
in comparison.  Taking care of that would be relatively easy,
adding something like the inode number (or other inode-constant
"randomness") in as a offset to pvma.vm_pgoff in shmem_alloc_page()
(or maybe a bit higher up the call-chain, I'd have to look closer).

> There are quite a lot of mpol patches flying around, aren't there?

Yep.  SGI solved some of these problems for our own 2.4.x kernel
distributions, but now we want to get things settled out in the
mainline kernel.  So far I think we're hitting distinct chunks
of code.

> >From Ray Bryant and from Steve Longerbeam.  Would this tmpfs patch
> make (adaptable) sense if we went either or both of those ways - or
> have they been knocked on the head?  I don't mean in the details
> (I think one of them does away with the pseudo-vma stuff - great!),
> but in basic design - would your mount option mesh together well
> with them, or would it be adding a further layer of confusion?

I see what you mean.  I believe Ray's work is addressing the buffer
cache in general.  I'll try to touch base with him again soon (I
admit to losing track of what he's been doing).

Brent

-- 
Brent Casavant                          If you had nothing to fear,
bcasavan@sgi.com                        how then could you be brave?
Silicon Graphics, Inc.                    -- Queen Dama, Source Wars

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-10  2:41                         ` Brent Casavant
  0 siblings, 0 replies; 49+ messages in thread
From: Brent Casavant @ 2004-11-10  2:41 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin J. Bligh, Andi Kleen, Adam J. Richter, colpatch,
	linux-kernel, linux-mm

Argh.  I fatfingered my mail client and deleted my response rather
than send it this morning.  Sorry for the delay.

On Tue, 9 Nov 2004, Hugh Dickins wrote:

> Doesn't quite play right with what was my "NULL sbinfo" convention.

Howso?  I thought it played quite nicely with it.  We've been using
NULL sbinfo as an indicator that an inode is from tmpfs rather than
from SysV or /dev/zero.  Or at least that's the way my brain was
wrapped around it.

> Given this mpol patch of yours, and Adam's devfs patch, it's becoming
> clear that my "NULL sbinfo" was unhelpful, making life harder for both
> of you to add things into the tmpfs superblock - unchanged since 2.4.0,
> as soon as I mess with it, people come up with valid new uses for it.

I haven't seen that other patch, but in this case I didn't see a problem.
The NULL sbinfo scheme worked perfectly for me, with very little hassle.

> Not to say that your patch or Adam's will go further (I've no objection
> to the way Adam is using tmpfs, but no opinion on the future of devfs),
> but they're two hints that I should rework that to get out of people's
> way.  I'll do a patch for that, then another something like yours on
> top, for you to go back and check.

Is this something imminent, or on the "someday" queue?  Just asking
because I'd like to avoid doing additional work that might get thrown
away soon.

> I think the option should be "mpol=interleave" rather than just
> "interleave", who knows what baroque mpols we might want to support
> there in future?

Makes sense to me.  I'll be happy to do it, pending your answer to
my preceding question.

> I'm irritated to realize that we can't change the default for SysV
> shared memory or /dev/zero this way, because that mount is internal.

Well, the only thing preventing this is that I stuck the flag into
sbinfo, since it's an filesystem-wide setting.  I don't see any reason
we couldn't add a new flag in the inode info flag field instead.  I
think there would also be some work to set pvma.vm_end more precisely
(in mpol_shared_policy_init()) in the SysV case.

> At one time (August) you were worried about MPOL_INTERLEAVE
> overloading node 0 on small files - is that still a worry?
> Perhaps you skirt that issue in recommending this option
> for use with giant files.

Yeah, there's still a bit of concern about that, but it's dwarfed
in comparison.  Taking care of that would be relatively easy,
adding something like the inode number (or other inode-constant
"randomness") in as a offset to pvma.vm_pgoff in shmem_alloc_page()
(or maybe a bit higher up the call-chain, I'd have to look closer).

> There are quite a lot of mpol patches flying around, aren't there?

Yep.  SGI solved some of these problems for our own 2.4.x kernel
distributions, but now we want to get things settled out in the
mainline kernel.  So far I think we're hitting distinct chunks
of code.

> >From Ray Bryant and from Steve Longerbeam.  Would this tmpfs patch
> make (adaptable) sense if we went either or both of those ways - or
> have they been knocked on the head?  I don't mean in the details
> (I think one of them does away with the pseudo-vma stuff - great!),
> but in basic design - would your mount option mesh together well
> with them, or would it be adding a further layer of confusion?

I see what you mean.  I believe Ray's work is addressing the buffer
cache in general.  I'll try to touch base with him again soon (I
admit to losing track of what he's been doing).

Brent

-- 
Brent Casavant                          If you had nothing to fear,
bcasavan@sgi.com                        how then could you be brave?
Silicon Graphics, Inc.                    -- Queen Dama, Source Wars
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-10  2:41                         ` Brent Casavant
@ 2004-11-10 14:20                           ` Hugh Dickins
  -1 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2004-11-10 14:20 UTC (permalink / raw)
  To: Brent Casavant
  Cc: Martin J. Bligh, Andi Kleen, Adam J. Richter, colpatch,
	linux-kernel, linux-mm

On Tue, 9 Nov 2004, Brent Casavant wrote:
> On Tue, 9 Nov 2004, Hugh Dickins wrote:
> 
> > Doesn't quite play right with what was my "NULL sbinfo" convention.
> 
> Howso?  I thought it played quite nicely with it.  We've been using
> NULL sbinfo as an indicator that an inode is from tmpfs rather than
> from SysV or /dev/zero.  Or at least that's the way my brain was
> wrapped around it.

That was the case you cared about, but remember I extended yours so
that tmpfs mounts could also suppress limiting, and get NULL sbinfo.

> The NULL sbinfo scheme worked perfectly for me, with very little hassle.

Yes, it would have worked just right for the important cases.

> > but they're two hints that I should rework that to get out of people's
> > way.  I'll do a patch for that, then another something like yours on
> > top, for you to go back and check.
> 
> Is this something imminent, or on the "someday" queue?  Just asking
> because I'd like to avoid doing additional work that might get thrown
> away soon.

I understand your concern ;)  I'm working on it, today or tomorrow.

> > I'm irritated to realize that we can't change the default for SysV
> > shared memory or /dev/zero this way, because that mount is internal.
> 
> Well, the only thing preventing this is that I stuck the flag into
> sbinfo, since it's an filesystem-wide setting.  I don't see any reason
> we couldn't add a new flag in the inode info flag field instead.  I
> think there would also be some work to set pvma.vm_end more precisely
> (in mpol_shared_policy_init()) in the SysV case.

It's not a matter of where to store the info, it's that we don't have
a user interface for remounting something that's not mounted anywhere.

Hugh


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-10 14:20                           ` Hugh Dickins
  0 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2004-11-10 14:20 UTC (permalink / raw)
  To: Brent Casavant
  Cc: Martin J. Bligh, Andi Kleen, Adam J. Richter, colpatch,
	linux-kernel, linux-mm

On Tue, 9 Nov 2004, Brent Casavant wrote:
> On Tue, 9 Nov 2004, Hugh Dickins wrote:
> 
> > Doesn't quite play right with what was my "NULL sbinfo" convention.
> 
> Howso?  I thought it played quite nicely with it.  We've been using
> NULL sbinfo as an indicator that an inode is from tmpfs rather than
> from SysV or /dev/zero.  Or at least that's the way my brain was
> wrapped around it.

That was the case you cared about, but remember I extended yours so
that tmpfs mounts could also suppress limiting, and get NULL sbinfo.

> The NULL sbinfo scheme worked perfectly for me, with very little hassle.

Yes, it would have worked just right for the important cases.

> > but they're two hints that I should rework that to get out of people's
> > way.  I'll do a patch for that, then another something like yours on
> > top, for you to go back and check.
> 
> Is this something imminent, or on the "someday" queue?  Just asking
> because I'd like to avoid doing additional work that might get thrown
> away soon.

I understand your concern ;)  I'm working on it, today or tomorrow.

> > I'm irritated to realize that we can't change the default for SysV
> > shared memory or /dev/zero this way, because that mount is internal.
> 
> Well, the only thing preventing this is that I stuck the flag into
> sbinfo, since it's an filesystem-wide setting.  I don't see any reason
> we couldn't add a new flag in the inode info flag field instead.  I
> think there would also be some work to set pvma.vm_end more precisely
> (in mpol_shared_policy_init()) in the SysV case.

It's not a matter of where to store the info, it's that we don't have
a user interface for remounting something that's not mounted anywhere.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-09 19:04                       ` Hugh Dickins
@ 2004-11-11 19:48                         ` Hugh Dickins
  -1 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2004-11-11 19:48 UTC (permalink / raw)
  To: Brent Casavant
  Cc: Martin J. Bligh, Andi Kleen, Adam J. Richter, colpatch,
	linux-kernel, linux-mm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1344 bytes --]

On Tue, 9 Nov 2004, Hugh Dickins wrote:
> 
> Not to say that your patch or Adam's will go further (I've no objection
> to the way Adam is using tmpfs, but no opinion on the future of devfs),
> but they're two hints that I should rework that to get out of people's
> way.  I'll do a patch for that, then another something like yours on
> top, for you to go back and check.
> 
> I think the option should be "mpol=interleave" rather than just
> "interleave", who knows what baroque mpols we might want to support
> there in future?

Okay, two pattachments.

The first (against 2.6.10-rc1-mm5) being my reversion of NULL sbinfo
in shmem.c, to make it easier for others to add things into sbinfo
without having to worry about NULL cases.  So that goes back to
allocating an sbinfo even for the internal mount: I've rounded up to
L1_CACHE_BYTES to avoid false sharing, but even so, please test it out
on your 512-way to make sure I haven't screwed up the scalability we
got before - thanks.  If you find it okay, I'll send to akpm soonish.

The second (against the first) being my take on your patch, with
mpol=interleave, and minor alterations which may irritate you so much
you'll revert them immediately! (mainly, using MPOL_INTERLEAVE and
MPOL_DEFAULT within shmem.c rather than defining separate flags).
Only slightly tested at this end.

Hugh

[-- Attachment #2: revert NULL sbinfo --]
[-- Type: TEXT/PLAIN, Size: 8851 bytes --]

--- 2.6.10-rc1-mm5/Documentation/filesystems/tmpfs.txt	2004-10-18 22:55:53.000000000 +0100
+++ tmpfs1/Documentation/filesystems/tmpfs.txt	2004-11-11 19:24:14.438121120 +0000
@@ -71,8 +71,8 @@ can be changed on remount.  The size par
 to limit this tmpfs instance to that percentage of your physical RAM:
 the default, when neither size nor nr_blocks is specified, is size=50%
 
-If both nr_blocks (or size) and nr_inodes are set to 0, neither blocks
-nor inodes will be limited in that instance.  It is generally unwise to
+If nr_blocks=0 (or size=0), blocks will not be limited in that instance;
+if nr_inodes=0, inodes will not be limited.  It is generally unwise to
 mount with such options, since it allows any user with write access to
 use up all the memory on the machine; but enhances the scalability of
 that instance in a system with many cpus making intensive use of it.
@@ -97,4 +97,4 @@ RAM/SWAP in 10240 inodes and it is only 
 Author:
    Christoph Rohland <cr@sap.com>, 1.12.01
 Updated:
-   Hugh Dickins <hugh@veritas.com>, 01 September 2004
+   Hugh Dickins <hugh@veritas.com>, 10 November 2004
--- 2.6.10-rc1-mm5/mm/shmem.c	2004-11-11 12:40:13.000000000 +0000
+++ tmpfs1/mm/shmem.c	2004-11-11 19:24:14.441120664 +0000
@@ -194,7 +194,7 @@ static spinlock_t shmem_swaplist_lock = 
 static void shmem_free_blocks(struct inode *inode, long pages)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
-	if (sbinfo) {
+	if (sbinfo->max_blocks) {
 		spin_lock(&sbinfo->stat_lock);
 		sbinfo->free_blocks += pages;
 		inode->i_blocks -= pages*BLOCKS_PER_PAGE;
@@ -357,7 +357,7 @@ static swp_entry_t *shmem_swp_alloc(stru
 		 * page (and perhaps indirect index pages) yet to allocate:
 		 * a waste to allocate index if we cannot allocate data.
 		 */
-		if (sbinfo) {
+		if (sbinfo->max_blocks) {
 			spin_lock(&sbinfo->stat_lock);
 			if (sbinfo->free_blocks <= 1) {
 				spin_unlock(&sbinfo->stat_lock);
@@ -678,8 +678,8 @@ static void shmem_delete_inode(struct in
 			spin_unlock(&shmem_swaplist_lock);
 		}
 	}
-	if (sbinfo) {
-		BUG_ON(inode->i_blocks);
+	BUG_ON(inode->i_blocks);
+	if (sbinfo->max_inodes) {
 		spin_lock(&sbinfo->stat_lock);
 		sbinfo->free_inodes++;
 		spin_unlock(&sbinfo->stat_lock);
@@ -1081,7 +1081,7 @@ repeat:
 	} else {
 		shmem_swp_unmap(entry);
 		sbinfo = SHMEM_SB(inode->i_sb);
-		if (sbinfo) {
+		if (sbinfo->max_blocks) {
 			spin_lock(&sbinfo->stat_lock);
 			if (sbinfo->free_blocks == 0 ||
 			    shmem_acct_block(info->flags)) {
@@ -1269,7 +1269,7 @@ shmem_get_inode(struct super_block *sb, 
 	struct shmem_inode_info *info;
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
 
-	if (sbinfo) {
+	if (sbinfo->max_inodes) {
 		spin_lock(&sbinfo->stat_lock);
 		if (!sbinfo->free_inodes) {
 			spin_unlock(&sbinfo->stat_lock);
@@ -1319,31 +1319,6 @@ shmem_get_inode(struct super_block *sb, 
 }
 
 #ifdef CONFIG_TMPFS
-
-static int shmem_set_size(struct shmem_sb_info *sbinfo,
-			  unsigned long max_blocks, unsigned long max_inodes)
-{
-	int error;
-	unsigned long blocks, inodes;
-
-	spin_lock(&sbinfo->stat_lock);
-	blocks = sbinfo->max_blocks - sbinfo->free_blocks;
-	inodes = sbinfo->max_inodes - sbinfo->free_inodes;
-	error = -EINVAL;
-	if (max_blocks < blocks)
-		goto out;
-	if (max_inodes < inodes)
-		goto out;
-	error = 0;
-	sbinfo->max_blocks  = max_blocks;
-	sbinfo->free_blocks = max_blocks - blocks;
-	sbinfo->max_inodes  = max_inodes;
-	sbinfo->free_inodes = max_inodes - inodes;
-out:
-	spin_unlock(&sbinfo->stat_lock);
-	return error;
-}
-
 static struct inode_operations shmem_symlink_inode_operations;
 static struct inode_operations shmem_symlink_inline_operations;
 
@@ -1598,15 +1573,17 @@ static int shmem_statfs(struct super_blo
 	buf->f_type = TMPFS_MAGIC;
 	buf->f_bsize = PAGE_CACHE_SIZE;
 	buf->f_namelen = NAME_MAX;
-	if (sbinfo) {
-		spin_lock(&sbinfo->stat_lock);
+	spin_lock(&sbinfo->stat_lock);
+	if (sbinfo->max_blocks) {
 		buf->f_blocks = sbinfo->max_blocks;
 		buf->f_bavail = buf->f_bfree = sbinfo->free_blocks;
+	}
+	if (sbinfo->max_inodes) {
 		buf->f_files = sbinfo->max_inodes;
 		buf->f_ffree = sbinfo->free_inodes;
-		spin_unlock(&sbinfo->stat_lock);
 	}
 	/* else leave those fields 0 like simple_statfs */
+	spin_unlock(&sbinfo->stat_lock);
 	return 0;
 }
 
@@ -1663,7 +1640,7 @@ static int shmem_link(struct dentry *old
 	 * but each new link needs a new dentry, pinning lowmem, and
 	 * tmpfs dentries cannot be pruned until they are unlinked.
 	 */
-	if (sbinfo) {
+	if (sbinfo->max_inodes) {
 		spin_lock(&sbinfo->stat_lock);
 		if (!sbinfo->free_inodes) {
 			spin_unlock(&sbinfo->stat_lock);
@@ -1688,7 +1665,7 @@ static int shmem_unlink(struct inode *di
 
 	if (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)) {
 		struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
-		if (sbinfo) {
+		if (sbinfo->max_inodes) {
 			spin_lock(&sbinfo->stat_lock);
 			sbinfo->free_inodes++;
 			spin_unlock(&sbinfo->stat_lock);
@@ -1912,22 +1889,42 @@ bad_val:
 static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
-	unsigned long max_blocks = 0;
-	unsigned long max_inodes = 0;
+	unsigned long max_blocks = sbinfo->max_blocks;
+	unsigned long max_inodes = sbinfo->max_inodes;
+	unsigned long blocks;
+	unsigned long inodes;
+	int error = -EINVAL;
 
-	if (sbinfo) {
-		max_blocks = sbinfo->max_blocks;
-		max_inodes = sbinfo->max_inodes;
-	}
-	if (shmem_parse_options(data, NULL, NULL, NULL, &max_blocks, &max_inodes))
-		return -EINVAL;
-	/* Keep it simple: disallow limited <-> unlimited remount */
-	if ((max_blocks || max_inodes) == !sbinfo)
-		return -EINVAL;
-	/* But allow the pointless unlimited -> unlimited remount */
-	if (!sbinfo)
-		return 0;
-	return shmem_set_size(sbinfo, max_blocks, max_inodes);
+	if (shmem_parse_options(data, NULL, NULL, NULL,
+				&max_blocks, &max_inodes))
+		return error;
+
+	spin_lock(&sbinfo->stat_lock);
+	blocks = sbinfo->max_blocks - sbinfo->free_blocks;
+	inodes = sbinfo->max_inodes - sbinfo->free_inodes;
+	if (max_blocks < blocks)
+		goto out;
+	if (max_inodes < inodes)
+		goto out;
+	/*
+	 * Those tests also disallow limited->unlimited while any are in
+	 * use, so i_blocks will always be zero when max_blocks is zero;
+	 * but we must separately disallow unlimited->limited, because
+	 * in that case we have no record of how much is already in use.
+	 */
+	if (max_blocks && !sbinfo->max_blocks)
+		goto out;
+	if (max_inodes && !sbinfo->max_inodes)
+		goto out;
+
+	error = 0;
+	sbinfo->max_blocks  = max_blocks;
+	sbinfo->free_blocks = max_blocks - blocks;
+	sbinfo->max_inodes  = max_inodes;
+	sbinfo->free_inodes = max_inodes - inodes;
+out:
+	spin_unlock(&sbinfo->stat_lock);
+	return error;
 }
 #endif
 
@@ -1952,11 +1949,11 @@ static int shmem_fill_super(struct super
 	uid_t uid = current->fsuid;
 	gid_t gid = current->fsgid;
 	int err = -ENOMEM;
-
-#ifdef CONFIG_TMPFS
+	struct shmem_sb_info *sbinfo;
 	unsigned long blocks = 0;
 	unsigned long inodes = 0;
 
+#ifdef CONFIG_TMPFS
 	/*
 	 * Per default we only allow half of the physical ram per
 	 * tmpfs instance, limiting inodes to one per page of lowmem;
@@ -1967,34 +1964,34 @@ static int shmem_fill_super(struct super
 		inodes = totalram_pages - totalhigh_pages;
 		if (inodes > blocks)
 			inodes = blocks;
-
-		if (shmem_parse_options(data, &mode,
-					&uid, &gid, &blocks, &inodes))
+		if (shmem_parse_options(data, &mode, &uid, &gid,
+					&blocks, &inodes))
 			return -EINVAL;
 	}
-
-	if (blocks || inodes) {
-		struct shmem_sb_info *sbinfo;
-		sbinfo = kmalloc(sizeof(struct shmem_sb_info), GFP_KERNEL);
-		if (!sbinfo)
-			return -ENOMEM;
-		sb->s_fs_info = sbinfo;
-		spin_lock_init(&sbinfo->stat_lock);
-		sbinfo->max_blocks = blocks;
-		sbinfo->free_blocks = blocks;
-		sbinfo->max_inodes = inodes;
-		sbinfo->free_inodes = inodes;
-	}
-	sb->s_xattr = shmem_xattr_handlers;
 #else
 	sb->s_flags |= MS_NOUSER;
 #endif
 
+	/* Round up to L1_CACHE_BYTES to resist false sharing */
+	sbinfo = kmalloc(max((int)sizeof(struct shmem_sb_info),
+				L1_CACHE_BYTES), GFP_KERNEL);
+	if (!sbinfo)
+		return -ENOMEM;
+
+	spin_lock_init(&sbinfo->stat_lock);
+	sbinfo->max_blocks = blocks;
+	sbinfo->free_blocks = blocks;
+	sbinfo->max_inodes = inodes;
+	sbinfo->free_inodes = inodes;
+
+	sb->s_fs_info = sbinfo;
 	sb->s_maxbytes = SHMEM_MAX_BYTES;
 	sb->s_blocksize = PAGE_CACHE_SIZE;
 	sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
 	sb->s_magic = TMPFS_MAGIC;
 	sb->s_op = &shmem_ops;
+	sb->s_xattr = shmem_xattr_handlers;
+
 	inode = shmem_get_inode(sb, S_IFDIR | mode, 0);
 	if (!inode)
 		goto failed;

[-- Attachment #3: mpol=interleave --]
[-- Type: TEXT/PLAIN, Size: 6237 bytes --]

--- tmpfs1/Documentation/filesystems/tmpfs.txt	2004-11-11 19:24:14.438121120 +0000
+++ tmpfs2/Documentation/filesystems/tmpfs.txt	2004-11-11 19:24:32.173424944 +0000
@@ -78,6 +78,12 @@ use up all the memory on the machine; bu
 that instance in a system with many cpus making intensive use of it.
 
 
+tmpfs has a mount option to set the NUMA memory allocation policy for
+all files in that instance:
+mpol=interleave		prefers to allocate memory from each node in turn
+mpol=default		prefers to allocate memory from the local node
+
+
 To specify the initial root directory you can use the following mount
 options:
 
--- tmpfs1/fs/hugetlbfs/inode.c	2004-11-11 12:39:59.000000000 +0000
+++ tmpfs2/fs/hugetlbfs/inode.c	2004-11-11 19:24:32.174424792 +0000
@@ -400,7 +400,7 @@ static struct inode *hugetlbfs_get_inode
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		info = HUGETLBFS_I(inode);
-		mpol_shared_policy_init(&info->policy);
+		mpol_shared_policy_init(&info->policy, MPOL_DEFAULT);
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
--- tmpfs1/include/linux/mempolicy.h	2004-11-11 12:40:09.000000000 +0000
+++ tmpfs2/include/linux/mempolicy.h	2004-11-11 19:24:32.175424640 +0000
@@ -137,12 +137,7 @@ struct shared_policy {
 	spinlock_t lock;
 };
 
-static inline void mpol_shared_policy_init(struct shared_policy *info)
-{
-	info->root = RB_ROOT;
-	spin_lock_init(&info->lock);
-}
-
+void mpol_shared_policy_init(struct shared_policy *info, int policy);
 int mpol_set_shared_policy(struct shared_policy *info,
 				struct vm_area_struct *vma,
 				struct mempolicy *new);
@@ -198,7 +193,8 @@ static inline int mpol_set_shared_policy
 	return -EINVAL;
 }
 
-static inline void mpol_shared_policy_init(struct shared_policy *info)
+static inline void mpol_shared_policy_init(struct shared_policy *info,
+					int policy)
 {
 }
 
--- tmpfs1/include/linux/shmem_fs.h	2004-10-18 22:56:50.000000000 +0100
+++ tmpfs2/include/linux/shmem_fs.h	2004-11-11 19:24:32.175424640 +0000
@@ -26,6 +26,7 @@ struct shmem_sb_info {
 	unsigned long free_blocks;  /* How many are left for allocation */
 	unsigned long max_inodes;   /* How many inodes are allowed */
 	unsigned long free_inodes;  /* How many are left for allocation */
+	int policy;		    /* Default NUMA memory alloc policy */
 	spinlock_t    stat_lock;
 };
 
--- tmpfs1/mm/mempolicy.c	2004-11-11 12:40:12.000000000 +0000
+++ tmpfs2/mm/mempolicy.c	2004-11-11 19:24:32.176424488 +0000
@@ -1060,6 +1060,28 @@ restart:
 	return 0;
 }
 
+void mpol_shared_policy_init(struct shared_policy *info, int policy)
+{
+	info->root = RB_ROOT;
+	spin_lock_init(&info->lock);
+
+	if (policy != MPOL_DEFAULT) {
+		struct mempolicy *newpol;
+
+		/* Falls back to MPOL_DEFAULT on any error */
+		newpol = mpol_new(policy, nodes_addr(node_online_map));
+		if (!IS_ERR(newpol)) {
+			/* Create pseudo-vma that contains just the policy */
+			struct vm_area_struct pvma;
+
+			memset(&pvma, 0, sizeof(struct vm_area_struct));
+			/* Policy covers entire file */
+			pvma.vm_end = ~0UL;
+			mpol_set_shared_policy(info, &pvma, newpol);
+		}
+	}
+}
+
 int mpol_set_shared_policy(struct shared_policy *info,
 			struct vm_area_struct *vma, struct mempolicy *npol)
 {
--- tmpfs1/mm/shmem.c	2004-11-11 19:24:14.441120664 +0000
+++ tmpfs2/mm/shmem.c	2004-11-11 19:24:32.178424184 +0000
@@ -1292,7 +1292,7 @@ shmem_get_inode(struct super_block *sb, 
 		info = SHMEM_I(inode);
 		memset(info, 0, (char *)inode - (char *)info);
 		spin_lock_init(&info->lock);
- 		mpol_shared_policy_init(&info->policy);
+ 		mpol_shared_policy_init(&info->policy, sbinfo->policy);
 		INIT_LIST_HEAD(&info->swaplist);
 
 		switch (mode & S_IFMT) {
@@ -1817,7 +1817,7 @@ static struct inode_operations shmem_sym
 #endif
 };
 
-static int shmem_parse_options(char *options, int *mode, uid_t *uid, gid_t *gid, unsigned long *blocks, unsigned long *inodes)
+static int shmem_parse_options(char *options, int *mode, uid_t *uid, gid_t *gid, unsigned long *blocks, unsigned long *inodes, int *policy)
 {
 	char *this_char, *value, *rest;
 
@@ -1871,6 +1871,13 @@ static int shmem_parse_options(char *opt
 			*gid = simple_strtoul(value,&rest,0);
 			if (*rest)
 				goto bad_val;
+		} else if (!strcmp(this_char,"mpol")) {
+			if (!strcmp(value,"interleave"))
+				*policy = MPOL_INTERLEAVE;
+			else if (!strcmp(value,"default"))
+				*policy = MPOL_DEFAULT;
+			else
+				goto bad_val;
 		} else {
 			printk(KERN_ERR "tmpfs: Bad mount option %s\n",
 			       this_char);
@@ -1891,12 +1898,13 @@ static int shmem_remount_fs(struct super
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
 	unsigned long max_blocks = sbinfo->max_blocks;
 	unsigned long max_inodes = sbinfo->max_inodes;
+	int policy = sbinfo->policy;
 	unsigned long blocks;
 	unsigned long inodes;
 	int error = -EINVAL;
 
 	if (shmem_parse_options(data, NULL, NULL, NULL,
-				&max_blocks, &max_inodes))
+				&max_blocks, &max_inodes, &policy))
 		return error;
 
 	spin_lock(&sbinfo->stat_lock);
@@ -1922,6 +1930,7 @@ static int shmem_remount_fs(struct super
 	sbinfo->free_blocks = max_blocks - blocks;
 	sbinfo->max_inodes  = max_inodes;
 	sbinfo->free_inodes = max_inodes - inodes;
+	sbinfo->policy = policy;
 out:
 	spin_unlock(&sbinfo->stat_lock);
 	return error;
@@ -1952,6 +1961,7 @@ static int shmem_fill_super(struct super
 	struct shmem_sb_info *sbinfo;
 	unsigned long blocks = 0;
 	unsigned long inodes = 0;
+	int policy = MPOL_DEFAULT;
 
 #ifdef CONFIG_TMPFS
 	/*
@@ -1965,7 +1975,7 @@ static int shmem_fill_super(struct super
 		if (inodes > blocks)
 			inodes = blocks;
 		if (shmem_parse_options(data, &mode, &uid, &gid,
-					&blocks, &inodes))
+					&blocks, &inodes, &policy))
 			return -EINVAL;
 	}
 #else
@@ -1983,6 +1993,7 @@ static int shmem_fill_super(struct super
 	sbinfo->free_blocks = blocks;
 	sbinfo->max_inodes = inodes;
 	sbinfo->free_inodes = inodes;
+	sbinfo->policy = policy;
 
 	sb->s_fs_info = sbinfo;
 	sb->s_maxbytes = SHMEM_MAX_BYTES;

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-11 19:48                         ` Hugh Dickins
  0 siblings, 0 replies; 49+ messages in thread
From: Hugh Dickins @ 2004-11-11 19:48 UTC (permalink / raw)
  To: Brent Casavant
  Cc: Martin J. Bligh, Andi Kleen, Adam J. Richter, colpatch,
	linux-kernel, linux-mm

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1344 bytes --]

On Tue, 9 Nov 2004, Hugh Dickins wrote:
> 
> Not to say that your patch or Adam's will go further (I've no objection
> to the way Adam is using tmpfs, but no opinion on the future of devfs),
> but they're two hints that I should rework that to get out of people's
> way.  I'll do a patch for that, then another something like yours on
> top, for you to go back and check.
> 
> I think the option should be "mpol=interleave" rather than just
> "interleave", who knows what baroque mpols we might want to support
> there in future?

Okay, two pattachments.

The first (against 2.6.10-rc1-mm5) being my reversion of NULL sbinfo
in shmem.c, to make it easier for others to add things into sbinfo
without having to worry about NULL cases.  So that goes back to
allocating an sbinfo even for the internal mount: I've rounded up to
L1_CACHE_BYTES to avoid false sharing, but even so, please test it out
on your 512-way to make sure I haven't screwed up the scalability we
got before - thanks.  If you find it okay, I'll send to akpm soonish.

The second (against the first) being my take on your patch, with
mpol=interleave, and minor alterations which may irritate you so much
you'll revert them immediately! (mainly, using MPOL_INTERLEAVE and
MPOL_DEFAULT within shmem.c rather than defining separate flags).
Only slightly tested at this end.

Hugh

[-- Attachment #2: revert NULL sbinfo --]
[-- Type: TEXT/PLAIN, Size: 8851 bytes --]

--- 2.6.10-rc1-mm5/Documentation/filesystems/tmpfs.txt	2004-10-18 22:55:53.000000000 +0100
+++ tmpfs1/Documentation/filesystems/tmpfs.txt	2004-11-11 19:24:14.438121120 +0000
@@ -71,8 +71,8 @@ can be changed on remount.  The size par
 to limit this tmpfs instance to that percentage of your physical RAM:
 the default, when neither size nor nr_blocks is specified, is size=50%
 
-If both nr_blocks (or size) and nr_inodes are set to 0, neither blocks
-nor inodes will be limited in that instance.  It is generally unwise to
+If nr_blocks=0 (or size=0), blocks will not be limited in that instance;
+if nr_inodes=0, inodes will not be limited.  It is generally unwise to
 mount with such options, since it allows any user with write access to
 use up all the memory on the machine; but enhances the scalability of
 that instance in a system with many cpus making intensive use of it.
@@ -97,4 +97,4 @@ RAM/SWAP in 10240 inodes and it is only 
 Author:
    Christoph Rohland <cr@sap.com>, 1.12.01
 Updated:
-   Hugh Dickins <hugh@veritas.com>, 01 September 2004
+   Hugh Dickins <hugh@veritas.com>, 10 November 2004
--- 2.6.10-rc1-mm5/mm/shmem.c	2004-11-11 12:40:13.000000000 +0000
+++ tmpfs1/mm/shmem.c	2004-11-11 19:24:14.441120664 +0000
@@ -194,7 +194,7 @@ static spinlock_t shmem_swaplist_lock = 
 static void shmem_free_blocks(struct inode *inode, long pages)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
-	if (sbinfo) {
+	if (sbinfo->max_blocks) {
 		spin_lock(&sbinfo->stat_lock);
 		sbinfo->free_blocks += pages;
 		inode->i_blocks -= pages*BLOCKS_PER_PAGE;
@@ -357,7 +357,7 @@ static swp_entry_t *shmem_swp_alloc(stru
 		 * page (and perhaps indirect index pages) yet to allocate:
 		 * a waste to allocate index if we cannot allocate data.
 		 */
-		if (sbinfo) {
+		if (sbinfo->max_blocks) {
 			spin_lock(&sbinfo->stat_lock);
 			if (sbinfo->free_blocks <= 1) {
 				spin_unlock(&sbinfo->stat_lock);
@@ -678,8 +678,8 @@ static void shmem_delete_inode(struct in
 			spin_unlock(&shmem_swaplist_lock);
 		}
 	}
-	if (sbinfo) {
-		BUG_ON(inode->i_blocks);
+	BUG_ON(inode->i_blocks);
+	if (sbinfo->max_inodes) {
 		spin_lock(&sbinfo->stat_lock);
 		sbinfo->free_inodes++;
 		spin_unlock(&sbinfo->stat_lock);
@@ -1081,7 +1081,7 @@ repeat:
 	} else {
 		shmem_swp_unmap(entry);
 		sbinfo = SHMEM_SB(inode->i_sb);
-		if (sbinfo) {
+		if (sbinfo->max_blocks) {
 			spin_lock(&sbinfo->stat_lock);
 			if (sbinfo->free_blocks == 0 ||
 			    shmem_acct_block(info->flags)) {
@@ -1269,7 +1269,7 @@ shmem_get_inode(struct super_block *sb, 
 	struct shmem_inode_info *info;
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
 
-	if (sbinfo) {
+	if (sbinfo->max_inodes) {
 		spin_lock(&sbinfo->stat_lock);
 		if (!sbinfo->free_inodes) {
 			spin_unlock(&sbinfo->stat_lock);
@@ -1319,31 +1319,6 @@ shmem_get_inode(struct super_block *sb, 
 }
 
 #ifdef CONFIG_TMPFS
-
-static int shmem_set_size(struct shmem_sb_info *sbinfo,
-			  unsigned long max_blocks, unsigned long max_inodes)
-{
-	int error;
-	unsigned long blocks, inodes;
-
-	spin_lock(&sbinfo->stat_lock);
-	blocks = sbinfo->max_blocks - sbinfo->free_blocks;
-	inodes = sbinfo->max_inodes - sbinfo->free_inodes;
-	error = -EINVAL;
-	if (max_blocks < blocks)
-		goto out;
-	if (max_inodes < inodes)
-		goto out;
-	error = 0;
-	sbinfo->max_blocks  = max_blocks;
-	sbinfo->free_blocks = max_blocks - blocks;
-	sbinfo->max_inodes  = max_inodes;
-	sbinfo->free_inodes = max_inodes - inodes;
-out:
-	spin_unlock(&sbinfo->stat_lock);
-	return error;
-}
-
 static struct inode_operations shmem_symlink_inode_operations;
 static struct inode_operations shmem_symlink_inline_operations;
 
@@ -1598,15 +1573,17 @@ static int shmem_statfs(struct super_blo
 	buf->f_type = TMPFS_MAGIC;
 	buf->f_bsize = PAGE_CACHE_SIZE;
 	buf->f_namelen = NAME_MAX;
-	if (sbinfo) {
-		spin_lock(&sbinfo->stat_lock);
+	spin_lock(&sbinfo->stat_lock);
+	if (sbinfo->max_blocks) {
 		buf->f_blocks = sbinfo->max_blocks;
 		buf->f_bavail = buf->f_bfree = sbinfo->free_blocks;
+	}
+	if (sbinfo->max_inodes) {
 		buf->f_files = sbinfo->max_inodes;
 		buf->f_ffree = sbinfo->free_inodes;
-		spin_unlock(&sbinfo->stat_lock);
 	}
 	/* else leave those fields 0 like simple_statfs */
+	spin_unlock(&sbinfo->stat_lock);
 	return 0;
 }
 
@@ -1663,7 +1640,7 @@ static int shmem_link(struct dentry *old
 	 * but each new link needs a new dentry, pinning lowmem, and
 	 * tmpfs dentries cannot be pruned until they are unlinked.
 	 */
-	if (sbinfo) {
+	if (sbinfo->max_inodes) {
 		spin_lock(&sbinfo->stat_lock);
 		if (!sbinfo->free_inodes) {
 			spin_unlock(&sbinfo->stat_lock);
@@ -1688,7 +1665,7 @@ static int shmem_unlink(struct inode *di
 
 	if (inode->i_nlink > 1 && !S_ISDIR(inode->i_mode)) {
 		struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
-		if (sbinfo) {
+		if (sbinfo->max_inodes) {
 			spin_lock(&sbinfo->stat_lock);
 			sbinfo->free_inodes++;
 			spin_unlock(&sbinfo->stat_lock);
@@ -1912,22 +1889,42 @@ bad_val:
 static int shmem_remount_fs(struct super_block *sb, int *flags, char *data)
 {
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
-	unsigned long max_blocks = 0;
-	unsigned long max_inodes = 0;
+	unsigned long max_blocks = sbinfo->max_blocks;
+	unsigned long max_inodes = sbinfo->max_inodes;
+	unsigned long blocks;
+	unsigned long inodes;
+	int error = -EINVAL;
 
-	if (sbinfo) {
-		max_blocks = sbinfo->max_blocks;
-		max_inodes = sbinfo->max_inodes;
-	}
-	if (shmem_parse_options(data, NULL, NULL, NULL, &max_blocks, &max_inodes))
-		return -EINVAL;
-	/* Keep it simple: disallow limited <-> unlimited remount */
-	if ((max_blocks || max_inodes) == !sbinfo)
-		return -EINVAL;
-	/* But allow the pointless unlimited -> unlimited remount */
-	if (!sbinfo)
-		return 0;
-	return shmem_set_size(sbinfo, max_blocks, max_inodes);
+	if (shmem_parse_options(data, NULL, NULL, NULL,
+				&max_blocks, &max_inodes))
+		return error;
+
+	spin_lock(&sbinfo->stat_lock);
+	blocks = sbinfo->max_blocks - sbinfo->free_blocks;
+	inodes = sbinfo->max_inodes - sbinfo->free_inodes;
+	if (max_blocks < blocks)
+		goto out;
+	if (max_inodes < inodes)
+		goto out;
+	/*
+	 * Those tests also disallow limited->unlimited while any are in
+	 * use, so i_blocks will always be zero when max_blocks is zero;
+	 * but we must separately disallow unlimited->limited, because
+	 * in that case we have no record of how much is already in use.
+	 */
+	if (max_blocks && !sbinfo->max_blocks)
+		goto out;
+	if (max_inodes && !sbinfo->max_inodes)
+		goto out;
+
+	error = 0;
+	sbinfo->max_blocks  = max_blocks;
+	sbinfo->free_blocks = max_blocks - blocks;
+	sbinfo->max_inodes  = max_inodes;
+	sbinfo->free_inodes = max_inodes - inodes;
+out:
+	spin_unlock(&sbinfo->stat_lock);
+	return error;
 }
 #endif
 
@@ -1952,11 +1949,11 @@ static int shmem_fill_super(struct super
 	uid_t uid = current->fsuid;
 	gid_t gid = current->fsgid;
 	int err = -ENOMEM;
-
-#ifdef CONFIG_TMPFS
+	struct shmem_sb_info *sbinfo;
 	unsigned long blocks = 0;
 	unsigned long inodes = 0;
 
+#ifdef CONFIG_TMPFS
 	/*
 	 * Per default we only allow half of the physical ram per
 	 * tmpfs instance, limiting inodes to one per page of lowmem;
@@ -1967,34 +1964,34 @@ static int shmem_fill_super(struct super
 		inodes = totalram_pages - totalhigh_pages;
 		if (inodes > blocks)
 			inodes = blocks;
-
-		if (shmem_parse_options(data, &mode,
-					&uid, &gid, &blocks, &inodes))
+		if (shmem_parse_options(data, &mode, &uid, &gid,
+					&blocks, &inodes))
 			return -EINVAL;
 	}
-
-	if (blocks || inodes) {
-		struct shmem_sb_info *sbinfo;
-		sbinfo = kmalloc(sizeof(struct shmem_sb_info), GFP_KERNEL);
-		if (!sbinfo)
-			return -ENOMEM;
-		sb->s_fs_info = sbinfo;
-		spin_lock_init(&sbinfo->stat_lock);
-		sbinfo->max_blocks = blocks;
-		sbinfo->free_blocks = blocks;
-		sbinfo->max_inodes = inodes;
-		sbinfo->free_inodes = inodes;
-	}
-	sb->s_xattr = shmem_xattr_handlers;
 #else
 	sb->s_flags |= MS_NOUSER;
 #endif
 
+	/* Round up to L1_CACHE_BYTES to resist false sharing */
+	sbinfo = kmalloc(max((int)sizeof(struct shmem_sb_info),
+				L1_CACHE_BYTES), GFP_KERNEL);
+	if (!sbinfo)
+		return -ENOMEM;
+
+	spin_lock_init(&sbinfo->stat_lock);
+	sbinfo->max_blocks = blocks;
+	sbinfo->free_blocks = blocks;
+	sbinfo->max_inodes = inodes;
+	sbinfo->free_inodes = inodes;
+
+	sb->s_fs_info = sbinfo;
 	sb->s_maxbytes = SHMEM_MAX_BYTES;
 	sb->s_blocksize = PAGE_CACHE_SIZE;
 	sb->s_blocksize_bits = PAGE_CACHE_SHIFT;
 	sb->s_magic = TMPFS_MAGIC;
 	sb->s_op = &shmem_ops;
+	sb->s_xattr = shmem_xattr_handlers;
+
 	inode = shmem_get_inode(sb, S_IFDIR | mode, 0);
 	if (!inode)
 		goto failed;

[-- Attachment #3: mpol=interleave --]
[-- Type: TEXT/PLAIN, Size: 6237 bytes --]

--- tmpfs1/Documentation/filesystems/tmpfs.txt	2004-11-11 19:24:14.438121120 +0000
+++ tmpfs2/Documentation/filesystems/tmpfs.txt	2004-11-11 19:24:32.173424944 +0000
@@ -78,6 +78,12 @@ use up all the memory on the machine; bu
 that instance in a system with many cpus making intensive use of it.
 
 
+tmpfs has a mount option to set the NUMA memory allocation policy for
+all files in that instance:
+mpol=interleave		prefers to allocate memory from each node in turn
+mpol=default		prefers to allocate memory from the local node
+
+
 To specify the initial root directory you can use the following mount
 options:
 
--- tmpfs1/fs/hugetlbfs/inode.c	2004-11-11 12:39:59.000000000 +0000
+++ tmpfs2/fs/hugetlbfs/inode.c	2004-11-11 19:24:32.174424792 +0000
@@ -400,7 +400,7 @@ static struct inode *hugetlbfs_get_inode
 		inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
 		inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
 		info = HUGETLBFS_I(inode);
-		mpol_shared_policy_init(&info->policy);
+		mpol_shared_policy_init(&info->policy, MPOL_DEFAULT);
 		switch (mode & S_IFMT) {
 		default:
 			init_special_inode(inode, mode, dev);
--- tmpfs1/include/linux/mempolicy.h	2004-11-11 12:40:09.000000000 +0000
+++ tmpfs2/include/linux/mempolicy.h	2004-11-11 19:24:32.175424640 +0000
@@ -137,12 +137,7 @@ struct shared_policy {
 	spinlock_t lock;
 };
 
-static inline void mpol_shared_policy_init(struct shared_policy *info)
-{
-	info->root = RB_ROOT;
-	spin_lock_init(&info->lock);
-}
-
+void mpol_shared_policy_init(struct shared_policy *info, int policy);
 int mpol_set_shared_policy(struct shared_policy *info,
 				struct vm_area_struct *vma,
 				struct mempolicy *new);
@@ -198,7 +193,8 @@ static inline int mpol_set_shared_policy
 	return -EINVAL;
 }
 
-static inline void mpol_shared_policy_init(struct shared_policy *info)
+static inline void mpol_shared_policy_init(struct shared_policy *info,
+					int policy)
 {
 }
 
--- tmpfs1/include/linux/shmem_fs.h	2004-10-18 22:56:50.000000000 +0100
+++ tmpfs2/include/linux/shmem_fs.h	2004-11-11 19:24:32.175424640 +0000
@@ -26,6 +26,7 @@ struct shmem_sb_info {
 	unsigned long free_blocks;  /* How many are left for allocation */
 	unsigned long max_inodes;   /* How many inodes are allowed */
 	unsigned long free_inodes;  /* How many are left for allocation */
+	int policy;		    /* Default NUMA memory alloc policy */
 	spinlock_t    stat_lock;
 };
 
--- tmpfs1/mm/mempolicy.c	2004-11-11 12:40:12.000000000 +0000
+++ tmpfs2/mm/mempolicy.c	2004-11-11 19:24:32.176424488 +0000
@@ -1060,6 +1060,28 @@ restart:
 	return 0;
 }
 
+void mpol_shared_policy_init(struct shared_policy *info, int policy)
+{
+	info->root = RB_ROOT;
+	spin_lock_init(&info->lock);
+
+	if (policy != MPOL_DEFAULT) {
+		struct mempolicy *newpol;
+
+		/* Falls back to MPOL_DEFAULT on any error */
+		newpol = mpol_new(policy, nodes_addr(node_online_map));
+		if (!IS_ERR(newpol)) {
+			/* Create pseudo-vma that contains just the policy */
+			struct vm_area_struct pvma;
+
+			memset(&pvma, 0, sizeof(struct vm_area_struct));
+			/* Policy covers entire file */
+			pvma.vm_end = ~0UL;
+			mpol_set_shared_policy(info, &pvma, newpol);
+		}
+	}
+}
+
 int mpol_set_shared_policy(struct shared_policy *info,
 			struct vm_area_struct *vma, struct mempolicy *npol)
 {
--- tmpfs1/mm/shmem.c	2004-11-11 19:24:14.441120664 +0000
+++ tmpfs2/mm/shmem.c	2004-11-11 19:24:32.178424184 +0000
@@ -1292,7 +1292,7 @@ shmem_get_inode(struct super_block *sb, 
 		info = SHMEM_I(inode);
 		memset(info, 0, (char *)inode - (char *)info);
 		spin_lock_init(&info->lock);
- 		mpol_shared_policy_init(&info->policy);
+ 		mpol_shared_policy_init(&info->policy, sbinfo->policy);
 		INIT_LIST_HEAD(&info->swaplist);
 
 		switch (mode & S_IFMT) {
@@ -1817,7 +1817,7 @@ static struct inode_operations shmem_sym
 #endif
 };
 
-static int shmem_parse_options(char *options, int *mode, uid_t *uid, gid_t *gid, unsigned long *blocks, unsigned long *inodes)
+static int shmem_parse_options(char *options, int *mode, uid_t *uid, gid_t *gid, unsigned long *blocks, unsigned long *inodes, int *policy)
 {
 	char *this_char, *value, *rest;
 
@@ -1871,6 +1871,13 @@ static int shmem_parse_options(char *opt
 			*gid = simple_strtoul(value,&rest,0);
 			if (*rest)
 				goto bad_val;
+		} else if (!strcmp(this_char,"mpol")) {
+			if (!strcmp(value,"interleave"))
+				*policy = MPOL_INTERLEAVE;
+			else if (!strcmp(value,"default"))
+				*policy = MPOL_DEFAULT;
+			else
+				goto bad_val;
 		} else {
 			printk(KERN_ERR "tmpfs: Bad mount option %s\n",
 			       this_char);
@@ -1891,12 +1898,13 @@ static int shmem_remount_fs(struct super
 	struct shmem_sb_info *sbinfo = SHMEM_SB(sb);
 	unsigned long max_blocks = sbinfo->max_blocks;
 	unsigned long max_inodes = sbinfo->max_inodes;
+	int policy = sbinfo->policy;
 	unsigned long blocks;
 	unsigned long inodes;
 	int error = -EINVAL;
 
 	if (shmem_parse_options(data, NULL, NULL, NULL,
-				&max_blocks, &max_inodes))
+				&max_blocks, &max_inodes, &policy))
 		return error;
 
 	spin_lock(&sbinfo->stat_lock);
@@ -1922,6 +1930,7 @@ static int shmem_remount_fs(struct super
 	sbinfo->free_blocks = max_blocks - blocks;
 	sbinfo->max_inodes  = max_inodes;
 	sbinfo->free_inodes = max_inodes - inodes;
+	sbinfo->policy = policy;
 out:
 	spin_unlock(&sbinfo->stat_lock);
 	return error;
@@ -1952,6 +1961,7 @@ static int shmem_fill_super(struct super
 	struct shmem_sb_info *sbinfo;
 	unsigned long blocks = 0;
 	unsigned long inodes = 0;
+	int policy = MPOL_DEFAULT;
 
 #ifdef CONFIG_TMPFS
 	/*
@@ -1965,7 +1975,7 @@ static int shmem_fill_super(struct super
 		if (inodes > blocks)
 			inodes = blocks;
 		if (shmem_parse_options(data, &mode, &uid, &gid,
-					&blocks, &inodes))
+					&blocks, &inodes, &policy))
 			return -EINVAL;
 	}
 #else
@@ -1983,6 +1993,7 @@ static int shmem_fill_super(struct super
 	sbinfo->free_blocks = blocks;
 	sbinfo->max_inodes = inodes;
 	sbinfo->free_inodes = inodes;
+	sbinfo->policy = policy;
 
 	sb->s_fs_info = sbinfo;
 	sb->s_maxbytes = SHMEM_MAX_BYTES;

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-11 19:48                         ` Hugh Dickins
@ 2004-11-11 23:10                           ` Brent Casavant
  -1 siblings, 0 replies; 49+ messages in thread
From: Brent Casavant @ 2004-11-11 23:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin J. Bligh, Andi Kleen, Adam J. Richter, colpatch,
	linux-kernel, linux-mm

On Thu, 11 Nov 2004, Hugh Dickins wrote:

> The first (against 2.6.10-rc1-mm5) being my reversion of NULL sbinfo
> in shmem.c, to make it easier for others to add things into sbinfo
> without having to worry about NULL cases.  So that goes back to
> allocating an sbinfo even for the internal mount: I've rounded up to
> L1_CACHE_BYTES to avoid false sharing, but even so, please test it out
> on your 512-way to make sure I haven't screwed up the scalability we
> got before - thanks.  If you find it okay, I'll send to akpm soonish.

I won't be able to get a 512 run in until Monday, due to test machine
availability.  However runs at 32P and 64P indicate nothing disastrous.
Results seem to be in line with the numbers we were getting when doing
the NULL sbinfo work.

So, thus far a preliminary "Looks good".

> The second (against the first) being my take on your patch, with
> mpol=interleave, and minor alterations which may irritate you so much
> you'll revert them immediately! (mainly, using MPOL_INTERLEAVE and
> MPOL_DEFAULT within shmem.c rather than defining separate flags).
> Only slightly tested at this end.

Seems to work just fine, and I rather like how this was made a bit more
general.  Thumbs up!

Brent

-- 
Brent Casavant                          If you had nothing to fear,
bcasavan@sgi.com                        how then could you be brave?
Silicon Graphics, Inc.                    -- Queen Dama, Source Wars

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-11 23:10                           ` Brent Casavant
  0 siblings, 0 replies; 49+ messages in thread
From: Brent Casavant @ 2004-11-11 23:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin J. Bligh, Andi Kleen, Adam J. Richter, colpatch,
	linux-kernel, linux-mm

On Thu, 11 Nov 2004, Hugh Dickins wrote:

> The first (against 2.6.10-rc1-mm5) being my reversion of NULL sbinfo
> in shmem.c, to make it easier for others to add things into sbinfo
> without having to worry about NULL cases.  So that goes back to
> allocating an sbinfo even for the internal mount: I've rounded up to
> L1_CACHE_BYTES to avoid false sharing, but even so, please test it out
> on your 512-way to make sure I haven't screwed up the scalability we
> got before - thanks.  If you find it okay, I'll send to akpm soonish.

I won't be able to get a 512 run in until Monday, due to test machine
availability.  However runs at 32P and 64P indicate nothing disastrous.
Results seem to be in line with the numbers we were getting when doing
the NULL sbinfo work.

So, thus far a preliminary "Looks good".

> The second (against the first) being my take on your patch, with
> mpol=interleave, and minor alterations which may irritate you so much
> you'll revert them immediately! (mainly, using MPOL_INTERLEAVE and
> MPOL_DEFAULT within shmem.c rather than defining separate flags).
> Only slightly tested at this end.

Seems to work just fine, and I rather like how this was made a bit more
general.  Thumbs up!

Brent

-- 
Brent Casavant                          If you had nothing to fear,
bcasavan@sgi.com                        how then could you be brave?
Silicon Graphics, Inc.                    -- Queen Dama, Source Wars
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
  2004-11-11 19:48                         ` Hugh Dickins
@ 2004-11-15 22:07                           ` Brent Casavant
  -1 siblings, 0 replies; 49+ messages in thread
From: Brent Casavant @ 2004-11-15 22:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin J. Bligh, Andi Kleen, Adam J. Richter, colpatch,
	linux-kernel, linux-mm

On Thu, 11 Nov 2004, Hugh Dickins wrote:

> The first (against 2.6.10-rc1-mm5) being my reversion of NULL sbinfo
> in shmem.c, to make it easier for others to add things into sbinfo
> without having to worry about NULL cases.  So that goes back to
> allocating an sbinfo even for the internal mount: I've rounded up to
> L1_CACHE_BYTES to avoid false sharing, but even so, please test it out
> on your 512-way to make sure I haven't screwed up the scalability we
> got before - thanks.  If you find it okay, I'll send to akpm soonish.

OK, tried it at 508P (the last 4P had a hardware problem).  The
performance results are all in line with what we'd accomplished
with NULL sbinfo, so the patch is good by me.

Brent

-- 
Brent Casavant                          If you had nothing to fear,
bcasavan@sgi.com                        how then could you be brave?
Silicon Graphics, Inc.                    -- Queen Dama, Source Wars

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Use MPOL_INTERLEAVE for tmpfs files
@ 2004-11-15 22:07                           ` Brent Casavant
  0 siblings, 0 replies; 49+ messages in thread
From: Brent Casavant @ 2004-11-15 22:07 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Martin J. Bligh, Andi Kleen, Adam J. Richter, colpatch,
	linux-kernel, linux-mm

On Thu, 11 Nov 2004, Hugh Dickins wrote:

> The first (against 2.6.10-rc1-mm5) being my reversion of NULL sbinfo
> in shmem.c, to make it easier for others to add things into sbinfo
> without having to worry about NULL cases.  So that goes back to
> allocating an sbinfo even for the internal mount: I've rounded up to
> L1_CACHE_BYTES to avoid false sharing, but even so, please test it out
> on your 512-way to make sure I haven't screwed up the scalability we
> got before - thanks.  If you find it okay, I'll send to akpm soonish.

OK, tried it at 508P (the last 4P had a hardware problem).  The
performance results are all in line with what we'd accomplished
with NULL sbinfo, so the patch is good by me.

Brent

-- 
Brent Casavant                          If you had nothing to fear,
bcasavan@sgi.com                        how then could you be brave?
Silicon Graphics, Inc.                    -- Queen Dama, Source Wars
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2004-11-15 22:11 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-11-02  1:07 [PATCH] Use MPOL_INTERLEAVE for tmpfs files Brent Casavant
2004-11-02  1:07 ` Brent Casavant
2004-11-02  1:43 ` Dave Hansen
2004-11-02  1:43   ` Dave Hansen
2004-11-02  9:13 ` Andi Kleen
2004-11-02  9:13   ` Andi Kleen
2004-11-02 15:46 ` Martin J. Bligh
2004-11-02 15:46   ` Martin J. Bligh
2004-11-02 15:55   ` Andi Kleen
2004-11-02 15:55     ` Andi Kleen
2004-11-02 16:55     ` Martin J. Bligh
2004-11-02 16:55       ` Martin J. Bligh
2004-11-02 22:17       ` Brent Casavant
2004-11-02 22:17         ` Brent Casavant
2004-11-02 22:51         ` Martin J. Bligh
2004-11-02 22:51           ` Martin J. Bligh
2004-11-03  1:12           ` Brent Casavant
2004-11-03  1:12             ` Brent Casavant
2004-11-03  1:30             ` Martin J. Bligh
2004-11-03  8:44           ` Hugh Dickins
2004-11-03  8:44             ` Hugh Dickins
2004-11-03  9:01             ` Andi Kleen
2004-11-03  9:01               ` Andi Kleen
2004-11-03 16:32               ` Brent Casavant
2004-11-03 16:32                 ` Brent Casavant
2004-11-03 21:00                 ` Martin J. Bligh
2004-11-03 21:00                   ` Martin J. Bligh
2004-11-08 19:58                   ` Brent Casavant
2004-11-08 19:58                     ` Brent Casavant
2004-11-08 20:57                     ` Martin J. Bligh
2004-11-08 20:57                       ` Martin J. Bligh
2004-11-09 19:04                     ` Hugh Dickins
2004-11-09 19:04                       ` Hugh Dickins
2004-11-09 20:09                       ` Martin J. Bligh
2004-11-09 20:09                         ` Martin J. Bligh
2004-11-09 21:08                         ` Hugh Dickins
2004-11-09 21:08                           ` Hugh Dickins
2004-11-09 22:07                           ` Martin J. Bligh
2004-11-09 22:07                             ` Martin J. Bligh
2004-11-10  2:41                       ` Brent Casavant
2004-11-10  2:41                         ` Brent Casavant
2004-11-10 14:20                         ` Hugh Dickins
2004-11-10 14:20                           ` Hugh Dickins
2004-11-11 19:48                       ` Hugh Dickins
2004-11-11 19:48                         ` Hugh Dickins
2004-11-11 23:10                         ` Brent Casavant
2004-11-11 23:10                           ` Brent Casavant
2004-11-15 22:07                         ` Brent Casavant
2004-11-15 22:07                           ` Brent Casavant

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.