All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4]: affinity-on-next-touch
@ 2009-05-11  8:27 Stefan Lankes
  2009-05-11  8:48 ` Dieter an Mey
  2009-05-11 13:22 ` Andi Kleen
  0 siblings, 2 replies; 44+ messages in thread
From: Stefan Lankes @ 2009-05-11  8:27 UTC (permalink / raw)
  To: linux-kernel

Hello,

I wrote a patch to support the adaptive data distribution strategy
"affinity-on-next-touch" for NUMA architectures. The patch is in an early
state and I am interested in your comments.

The basic idea of "affinity-on-next-touch" is this: Via some runtime
mechanism, a user-level process activates "affinity-on-next-touch" for a
certain region of its virtual memory space. Afterwards, each page in this
region will be migrated to that node which next tries to access it.
Noordergraaf and van der Pas [1] have proposed to extend the OpenMP standard
to support this strategy. Since version 9, the “affinity-on-next-touch”
mechanism is available in Solaris and can be triggered via the madvise
system call. Löf and Homgren [2] and Terboven et al. [3] have described
their encouraging experiences with this implementation.

Linux does not yet support "affinity-on-next-touch". Terboven et al. [3]
have presented a user-level implementation of this strategy for Linux. To
realize "affinity-on-next-touch" in user space, they protect a specific
memory area from read and write accesses and install a signal handler to
catch access violations. If a thread accesses a page in the protected memory
area, the signal handler migrates the page to the node which handled the
access violation. Afterwards, the signal handler clears the page protection
and the interrupted thread is resumed. 

Unfortunately, the overhead of this solution is very high. For instance, to
distribute 512 MByte via "affinity-on-next-touch" the user-level solution
needs 2518ms on our dual-socket, quad-core Opteron 2376 system with the
current kernel (2.6.30-rc4). I evaluated this overhead with the following
OpenMP code:

   madvise(array, sizeof(int) * SIZE, MADV_ACCESS_LWP);
   start = omp_get_wtime();
   #pragma omp parallel for
   for (j = 0; j < SIZE; j += pagesize/sizeof(int))
      array[j]++;
   end = omp_get_wtime();
   printf("time: %lf ms\n", (end - start) * 1000.0);

The benchmark uses 8 threads and each thread is bound to one core.

I divide my patch into the following 4 parts:

[Patch 1/4]: Extend the system call madvise with a new parameter
MADV_ACCESS_LWP (the same as used in Solaris). The specified memory area
then uses "affinity-on-next-touch".  In this case, madvise_access_lwp
protects the memory area from read and write access. To avoid unnecessary
list operations, the patch changes the permissions only in the page table
entries and does not update the list of VMAs. Beside this, the system call
madvise set also the new “untouched bit” of the "page" record.

[Patch 2/4]: The pte fault handler detects, via a new "untouched bit" inside
of the "page" record, that the page which the thread tried to access uses
“affinity-on-next-touch”. Afterwards, the kernel reads the original
permissions from vm_area_struct, restores them in the page tables and
migrates the page to the current node. To accelerate page migration, the
patch avoids unnecessary calls to migrate_prep().

[Patch 3/4]: If the "untouched" bit is set, mprotect isn’t permitted to
change the permission in the page table entry. By using of
"affinity-on-next-touch", the access permission will be set by the pte fault
handler.

[Patch 4/4]: This part of the patch adds some counters to detect migration
errors and publishes these counters via /proc/vmstat. Besides this, the
Kconfig file is extend with the parameter CONFIG_AFFINITY_ON_NEXT_TOUCH.

With this patch, the kernel reduces the overhead of page distribution via
"affinity-on-next-touch" from 2518ms to 366ms compared to the user-level
approach. Currently, I'm evaluating the performance of the patch with some
other benchmarks and test applications (stream benchmark, Jacobi solver, PDE
solver,...).

I am very interested in your comments!

Stefan

[1] Noordergraaf, L., van der Pas, R.: Performance Experiences on Suns
WildFire Prototype. In: Proceedings of the 1999 ACM/IEEE conference on
Supercomputing,Portland, Oregon, USA (November 1999)

[2] Löf, H., Holmgren, S.: affinity-on-next-touch: Increasing the
Performance of an Industrial PDE Solver on a cc-NUMA System. In: Proceedings
of the 19th Annual International Conference on Supercomputing, Cambridge,
Massachusetts, USA (June 2005) 387–392

[3]. Terboven, C., an Mey, D., Schmidl, D., Jin, H., Reichstein, T.: Data
and Thread Affinity in OpenMP Programs. In: Proceedings of the 2008 Workshop
on Memory Access on future Processors: A solved problem?, ACM International
Conference on Computing Frontiers, Ischia, Italy (May 2008) 377–384




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-05-11  8:27 [RFC PATCH 0/4]: affinity-on-next-touch Stefan Lankes
@ 2009-05-11  8:48 ` Dieter an Mey
  2009-05-11 13:22 ` Andi Kleen
  1 sibling, 0 replies; 44+ messages in thread
From: Dieter an Mey @ 2009-05-11  8:48 UTC (permalink / raw)
  To: Stefan Lankes; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5336 bytes --]

Hello,

I am supporting Stefan's activity from the parallel programmer's 
perspective and I would be happy to provide further input, if needed.

best regards
Dieter

Stefan Lankes schrieb:
> Hello,
> 
> I wrote a patch to support the adaptive data distribution strategy
> "affinity-on-next-touch" for NUMA architectures. The patch is in an early
> state and I am interested in your comments.
> 
> The basic idea of "affinity-on-next-touch" is this: Via some runtime
> mechanism, a user-level process activates "affinity-on-next-touch" for a
> certain region of its virtual memory space. Afterwards, each page in this
> region will be migrated to that node which next tries to access it.
> Noordergraaf and van der Pas [1] have proposed to extend the OpenMP standard
> to support this strategy. Since version 9, the “affinity-on-next-touch”
> mechanism is available in Solaris and can be triggered via the madvise
> system call. Löf and Homgren [2] and Terboven et al. [3] have described
> their encouraging experiences with this implementation.
> 
> Linux does not yet support "affinity-on-next-touch". Terboven et al. [3]
> have presented a user-level implementation of this strategy for Linux. To
> realize "affinity-on-next-touch" in user space, they protect a specific
> memory area from read and write accesses and install a signal handler to
> catch access violations. If a thread accesses a page in the protected memory
> area, the signal handler migrates the page to the node which handled the
> access violation. Afterwards, the signal handler clears the page protection
> and the interrupted thread is resumed. 
> 
> Unfortunately, the overhead of this solution is very high. For instance, to
> distribute 512 MByte via "affinity-on-next-touch" the user-level solution
> needs 2518ms on our dual-socket, quad-core Opteron 2376 system with the
> current kernel (2.6.30-rc4). I evaluated this overhead with the following
> OpenMP code:
> 
>    madvise(array, sizeof(int) * SIZE, MADV_ACCESS_LWP);
>    start = omp_get_wtime();
>    #pragma omp parallel for
>    for (j = 0; j < SIZE; j += pagesize/sizeof(int))
>       array[j]++;
>    end = omp_get_wtime();
>    printf("time: %lf ms\n", (end - start) * 1000.0);
> 
> The benchmark uses 8 threads and each thread is bound to one core.
> 
> I divide my patch into the following 4 parts:
> 
> [Patch 1/4]: Extend the system call madvise with a new parameter
> MADV_ACCESS_LWP (the same as used in Solaris). The specified memory area
> then uses "affinity-on-next-touch".  In this case, madvise_access_lwp
> protects the memory area from read and write access. To avoid unnecessary
> list operations, the patch changes the permissions only in the page table
> entries and does not update the list of VMAs. Beside this, the system call
> madvise set also the new “untouched bit” of the "page" record.
> 
> [Patch 2/4]: The pte fault handler detects, via a new "untouched bit" inside
> of the "page" record, that the page which the thread tried to access uses
> “affinity-on-next-touch”. Afterwards, the kernel reads the original
> permissions from vm_area_struct, restores them in the page tables and
> migrates the page to the current node. To accelerate page migration, the
> patch avoids unnecessary calls to migrate_prep().
> 
> [Patch 3/4]: If the "untouched" bit is set, mprotect isn’t permitted to
> change the permission in the page table entry. By using of
> "affinity-on-next-touch", the access permission will be set by the pte fault
> handler.
> 
> [Patch 4/4]: This part of the patch adds some counters to detect migration
> errors and publishes these counters via /proc/vmstat. Besides this, the
> Kconfig file is extend with the parameter CONFIG_AFFINITY_ON_NEXT_TOUCH.
> 
> With this patch, the kernel reduces the overhead of page distribution via
> "affinity-on-next-touch" from 2518ms to 366ms compared to the user-level
> approach. Currently, I'm evaluating the performance of the patch with some
> other benchmarks and test applications (stream benchmark, Jacobi solver, PDE
> solver,...).
> 
> I am very interested in your comments!
> 
> Stefan
> 
> [1] Noordergraaf, L., van der Pas, R.: Performance Experiences on Suns
> WildFire Prototype. In: Proceedings of the 1999 ACM/IEEE conference on
> Supercomputing,Portland, Oregon, USA (November 1999)
> 
> [2] Löf, H., Holmgren, S.: affinity-on-next-touch: Increasing the
> Performance of an Industrial PDE Solver on a cc-NUMA System. In: Proceedings
> of the 19th Annual International Conference on Supercomputing, Cambridge,
> Massachusetts, USA (June 2005) 387–392
> 
> [3]. Terboven, C., an Mey, D., Schmidl, D., Jin, H., Reichstein, T.: Data
> and Thread Affinity in OpenMP Programs. In: Proceedings of the 2008 Workshop
> on Memory Access on future Processors: A solved problem?, ACM International
> Conference on Computing Frontiers, Ischia, Italy (May 2008) 377–384
> 
> 
> 

-- 
Dipl.-Math. Dieter an Mey, HPC Team Lead
RWTH Aachen University, Center for Computing and Communication
Rechen- und Kommunikationszentrum der RWTH Aachen
Seffenter Weg 23, D 52074 Aachen (Germany)
Phone: + 49 241 80 24377 - Fax/UMS: + 49 241 80 624377
mailto:anmey@rz.rwth-aachen.de http://www.rz.rwth-aachen.de


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/x-pkcs7-signature, Size: 5773 bytes --]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-05-11  8:27 [RFC PATCH 0/4]: affinity-on-next-touch Stefan Lankes
  2009-05-11  8:48 ` Dieter an Mey
@ 2009-05-11 13:22 ` Andi Kleen
  2009-05-11 13:32   ` Brice Goglin
                     ` (2 more replies)
  1 sibling, 3 replies; 44+ messages in thread
From: Andi Kleen @ 2009-05-11 13:22 UTC (permalink / raw)
  To: Stefan Lankes; +Cc: linux-kernel, Lee.Schermerhorn, linux-numa

Stefan Lankes <lankes@lfbs.rwth-aachen.de> writes:
>
> [Patch 1/4]: Extend the system call madvise with a new parameter
> MADV_ACCESS_LWP (the same as used in Solaris). The specified memory area

Linux does NUMA memory policies in mbind(), not madvise()
Also if there's a new NUMA policy it should be in the standard
Linux NUMA memory policy frame work, not inventing a new one

[I find it amazing that you did apparently so much work
without being familiar with existing Linux NUMA policies]

Your patches seem to have a lot of overlap with 
Lee Schermerhorn's old migrate memory on cpu migration patches.
I don't know the status of those.

> [Patch 4/4]: This part of the patch adds some counters to detect migration
> errors and publishes these counters via /proc/vmstat. Besides this, the
> Kconfig file is extend with the parameter CONFIG_AFFINITY_ON_NEXT_TOUCH.
>
> With this patch, the kernel reduces the overhead of page distribution via
> "affinity-on-next-touch" from 2518ms to 366ms compared to the user-level

The interesting part is less how much faster it is compared to an user
space implementation, but how much this migrate on touch approach
helps in general compared to already existing policies. Some hard
numbers on that would appreciated.

Note that for the OpenMP case old kernels sometimes had trouble because
the threads tended to be not scheduled to the final target CPU
on the first time slice so the memory was often first-touched
on the wrong node. Later kernels avoided that by more aggressively
moving the threads early.

This nearly sounds like a workaround for that (I hope it's more
than that)

If you present any benchmark make sure the kernel you're benching
against does not have this issue.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-05-11 13:22 ` Andi Kleen
@ 2009-05-11 13:32   ` Brice Goglin
  2009-05-11 14:54     ` Stefan Lankes
  2009-06-11 18:45   ` Stefan Lankes
  2 siblings, 0 replies; 44+ messages in thread
From: Brice Goglin @ 2009-05-11 13:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Stefan Lankes, linux-kernel, Lee.Schermerhorn, linux-numa

Andi Kleen wrote:
> Stefan Lankes <lankes@lfbs.rwth-aachen.de> writes:
>   
>> [Patch 1/4]: Extend the system call madvise with a new parameter
>> MADV_ACCESS_LWP (the same as used in Solaris). The specified memory area
>>     
>
> Linux does NUMA memory policies in mbind(), not madvise()
> Also if there's a new NUMA policy it should be in the standard
> Linux NUMA memory policy frame work, not inventing a new one
>   

Marking a buffer as "migrate-on-next-touch" is very different from
setting a NUMA policy. Migrate-on-next-touch a temporary flag that is
cleared on the next-touch. It's cleared per page, not per area or
whatever. So marking a VMA as "migrate-on-next-touch" doesn't make much
sense since some pages could already have been migrated (and brought
back to their usual state) while some other are still marked as
migrate-on-next-touch.

Brice


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-05-11 13:22 ` Andi Kleen
@ 2009-05-11 14:54     ` Stefan Lankes
  2009-05-11 14:54     ` Stefan Lankes
  2009-06-11 18:45   ` Stefan Lankes
  2 siblings, 0 replies; 44+ messages in thread
From: Stefan Lankes @ 2009-05-11 14:54 UTC (permalink / raw)
  To: 'Andi Kleen'
  Cc: linux-kernel, Lee.Schermerhorn, linux-numa, brice.goglin,
	'Terboven, Christian', anmey, 'Boris Bierbaum'

> From: Andi Kleen [mailto:andi@firstfloor.org]
> 
> Stefan Lankes <lankes@lfbs.rwth-aachen.de> writes:
> >
> > [Patch 1/4]: Extend the system call madvise with a new parameter
> > MADV_ACCESS_LWP (the same as used in Solaris). The specified memory
> area
> 
> Linux does NUMA memory policies in mbind(), not madvise()
> Also if there's a new NUMA policy it should be in the standard
> Linux NUMA memory policy frame work, not inventing a new one

By default, mbind only has an effect on new allocations. I think that this
is different from what we need for applications with dynamic memory access
patterns. The app gives the kernel a hint that the access pattern has been
changed and the kernel has to redistribute the pages which are already
allocated.

> > [Patch 4/4]: This part of the patch adds some counters to detect
> migration
> > errors and publishes these counters via /proc/vmstat. Besides this,
> the
> > Kconfig file is extend with the parameter
> CONFIG_AFFINITY_ON_NEXT_TOUCH.
> >
> > With this patch, the kernel reduces the overhead of page distribution
> via
> > "affinity-on-next-touch" from 2518ms to 366ms compared to the user-
> level
> 
> The interesting part is less how much faster it is compared to an user
> space implementation, but how much this migrate on touch approach
> helps in general compared to already existing policies. Some hard
> numbers on that would appreciated.
> 
> Note that for the OpenMP case old kernels sometimes had trouble because
> the threads tended to be not scheduled to the final target CPU
> on the first time slice so the memory was often first-touched
> on the wrong node. Later kernels avoided that by more aggressively
> moving the threads early.
> 

"affinity-on-next-touch" is not a data distribution strategy for
applications with a static access pattern. If the access pattern changed,
you could initialize the "affinity-on-next-touch" mechanism and afterwards
the kernel redistributes the pages. 

For instance, Norden's PDE solvers using adaptive mesh refinements (AMR) [1]
is an application with a dynamic access pattern. We use this example to
evaluate the performance of our patch. We ran this solver on our
quad-socket, dual-core Opteron 875 (2.2GHz) system running CentOS 5.2. The
code was already optimized for NUMA architectures. Before the arrays are
initialized, the threads are bound to one core. In our test case, the solver
needs 5318s. If we use our kernel extension, the solver needs 4489s. 

Currently, we are testing some other apps.

Stefan

[1] Norden, M., Löf, H., Rantakokko, J., Holmgren, S.: Geographical Locality
and Dynamic Data Migration for OpenMP Implementations of Adaptive PDE
Solvers. In: Proceedings of the 2nd International Workshop on OpenMP
(IWOMP), Reims, France (June 2006) 382–393



^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RFC PATCH 0/4]: affinity-on-next-touch
@ 2009-05-11 14:54     ` Stefan Lankes
  0 siblings, 0 replies; 44+ messages in thread
From: Stefan Lankes @ 2009-05-11 14:54 UTC (permalink / raw)
  To: 'Andi Kleen'
  Cc: linux-kernel, Lee.Schermerhorn, linux-numa, brice.goglin,
	'Terboven, Christian', anmey, 'Boris Bierbaum'

> From: Andi Kleen [mailto:andi@firstfloor.org]
> 
> Stefan Lankes <lankes@lfbs.rwth-aachen.de> writes:
> >
> > [Patch 1/4]: Extend the system call madvise with a new parameter
> > MADV_ACCESS_LWP (the same as used in Solaris). The specified memory
> area
> 
> Linux does NUMA memory policies in mbind(), not madvise()
> Also if there's a new NUMA policy it should be in the standard
> Linux NUMA memory policy frame work, not inventing a new one

By default, mbind only has an effect on new allocations. I think that this
is different from what we need for applications with dynamic memory access
patterns. The app gives the kernel a hint that the access pattern has been
changed and the kernel has to redistribute the pages which are already
allocated.

> > [Patch 4/4]: This part of the patch adds some counters to detect
> migration
> > errors and publishes these counters via /proc/vmstat. Besides this,
> the
> > Kconfig file is extend with the parameter
> CONFIG_AFFINITY_ON_NEXT_TOUCH.
> >
> > With this patch, the kernel reduces the overhead of page distribution
> via
> > "affinity-on-next-touch" from 2518ms to 366ms compared to the user-
> level
> 
> The interesting part is less how much faster it is compared to an user
> space implementation, but how much this migrate on touch approach
> helps in general compared to already existing policies. Some hard
> numbers on that would appreciated.
> 
> Note that for the OpenMP case old kernels sometimes had trouble because
> the threads tended to be not scheduled to the final target CPU
> on the first time slice so the memory was often first-touched
> on the wrong node. Later kernels avoided that by more aggressively
> moving the threads early.
> 

"affinity-on-next-touch" is not a data distribution strategy for
applications with a static access pattern. If the access pattern changed,
you could initialize the "affinity-on-next-touch" mechanism and afterwards
the kernel redistributes the pages. 

For instance, Norden's PDE solvers using adaptive mesh refinements (AMR) [1]
is an application with a dynamic access pattern. We use this example to
evaluate the performance of our patch. We ran this solver on our
quad-socket, dual-core Opteron 875 (2.2GHz) system running CentOS 5.2. The
code was already optimized for NUMA architectures. Before the arrays are
initialized, the threads are bound to one core. In our test case, the solver
needs 5318s. If we use our kernel extension, the solver needs 4489s. 

Currently, we are testing some other apps.

Stefan

[1] Norden, M., Löf, H., Rantakokko, J., Holmgren, S.: Geographical Locality
and Dynamic Data Migration for OpenMP Implementations of Adaptive PDE
Solvers. In: Proceedings of the 2nd International Workshop on OpenMP
(IWOMP), Reims, France (June 2006) 382–393


--
To unsubscribe from this list: send the line "unsubscribe linux-numa" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-05-11 14:54     ` Stefan Lankes
  (?)
@ 2009-05-11 16:37     ` Andi Kleen
  2009-05-11 17:22       ` Stefan Lankes
  -1 siblings, 1 reply; 44+ messages in thread
From: Andi Kleen @ 2009-05-11 16:37 UTC (permalink / raw)
  To: Stefan Lankes
  Cc: 'Andi Kleen',
	linux-kernel, Lee.Schermerhorn, linux-numa, brice.goglin,
	'Terboven, Christian', anmey, 'Boris Bierbaum'

On Mon, May 11, 2009 at 04:54:40PM +0200, Stefan Lankes wrote:
> > From: Andi Kleen [mailto:andi@firstfloor.org]
> > 
> > Stefan Lankes <lankes@lfbs.rwth-aachen.de> writes:
> > >
> > > [Patch 1/4]: Extend the system call madvise with a new parameter
> > > MADV_ACCESS_LWP (the same as used in Solaris). The specified memory
> > area
> > 
> > Linux does NUMA memory policies in mbind(), not madvise()
> > Also if there's a new NUMA policy it should be in the standard
> > Linux NUMA memory policy frame work, not inventing a new one
> 
> By default, mbind only has an effect on new allocations. I think that this

Nope, it affects existing pages too, it can even move pages
if you ask for it.

> is different from what we need for applications with dynamic memory access
> patterns. The app gives the kernel a hint that the access pattern has been
> changed and the kernel has to redistribute the pages which are already
> allocated.

MF_MOVE


> For instance, Norden's PDE solvers using adaptive mesh refinements (AMR) [1]
> is an application with a dynamic access pattern. We use this example to
> evaluate the performance of our patch. We ran this solver on our
> quad-socket, dual-core Opteron 875 (2.2GHz) system running CentOS 5.2. The
> code was already optimized for NUMA architectures. Before the arrays are
> initialized, the threads are bound to one core. In our test case, the solver
> needs 5318s. If we use our kernel extension, the solver needs 4489s. 

Okay that sounds like good numbers. 

> Currently, we are testing some other apps.

Please keep the list updated.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-05-11 16:37     ` Andi Kleen
@ 2009-05-11 17:22       ` Stefan Lankes
  0 siblings, 0 replies; 44+ messages in thread
From: Stefan Lankes @ 2009-05-11 17:22 UTC (permalink / raw)
  To: 'Andi Kleen'
  Cc: linux-kernel, Lee.Schermerhorn, linux-numa, brice.goglin,
	'Terboven, Christian',
	anmey, Boris Bierbaum



> -----Original Message-----
> From: Andi Kleen [mailto:andi@firstfloor.org]
> Sent: Monday, May 11, 2009 6:37 PM
> To: Stefan Lankes
> Cc: 'Andi Kleen'; linux-kernel@vger.kernel.org;
> Lee.Schermerhorn@hp.com; linux-numa@vger.kernel.org;
> brice.goglin@inria.fr; 'Terboven, Christian'; anmey@rz.rwth-aachen.de;
> 'Boris Bierbaum'
> Subject: Re: [RFC PATCH 0/4]: affinity-on-next-touch
> 
> > By default, mbind only has an effect on new allocations. I think that
> this
> 
> Nope, it affects existing pages too, it can even move pages
> if you ask for it.
> 

I know this possibility. I thought that "affinity-on-next-touch" fit better
to madvise. Brice told already the technical reasons for preferring of
madvise.

> > For instance, Norden's PDE solvers using adaptive mesh refinements
> (AMR) [1]
> > is an application with a dynamic access pattern. We use this example
> to
> > evaluate the performance of our patch. We ran this solver on our
> > quad-socket, dual-core Opteron 875 (2.2GHz) system running CentOS
> 5.2. The
> > code was already optimized for NUMA architectures. Before the arrays
> are
> > initialized, the threads are bound to one core. In our test case, the
> solver
> > needs 5318s. If we use our kernel extension, the solver needs 4489s.
> 
> Okay that sounds like good numbers.
> 
> > Currently, we are testing some other apps.
> 
> Please keep the list updated.
> 

I will do it.

Stefan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-05-11 13:22 ` Andi Kleen
  2009-05-11 13:32   ` Brice Goglin
  2009-05-11 14:54     ` Stefan Lankes
@ 2009-06-11 18:45   ` Stefan Lankes
  2009-06-12 10:32     ` Andi Kleen
  2009-06-16  2:21     ` Lee Schermerhorn
  2 siblings, 2 replies; 44+ messages in thread
From: Stefan Lankes @ 2009-06-11 18:45 UTC (permalink / raw)
  To: 'Andi Kleen'
  Cc: linux-kernel, Lee.Schermerhorn, linux-numa, Boris Bierbaum,
	'Brice Goglin'

> Your patches seem to have a lot of overlap with
> Lee Schermerhorn's old migrate memory on cpu migration patches.
> I don't know the status of those.

I analyze Lee Schermerhorn's migrate memory on cpu migration patches
(http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that Lee
Schermerhorn add similar functionalities to the kernel. He called the
"affinity-on-next-touch" functionality "migrate_on_fault" and uses in his
patches the normal NUMA memory policies. Therefore, his solution fits better
to the Linux kernel. I tested his patches with our test applications and got
nearly the same performance results. 

I found only patches for the kernel 2.6.25-rc2-mm1. Does someone develop
these patches further?

Stefan




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-11 18:45   ` Stefan Lankes
@ 2009-06-12 10:32     ` Andi Kleen
  2009-06-12 11:46       ` Stefan Lankes
  2009-06-16  2:25       ` Lee Schermerhorn
  2009-06-16  2:21     ` Lee Schermerhorn
  1 sibling, 2 replies; 44+ messages in thread
From: Andi Kleen @ 2009-06-12 10:32 UTC (permalink / raw)
  To: Stefan Lankes
  Cc: 'Andi Kleen',
	linux-kernel, Lee.Schermerhorn, linux-numa, Boris Bierbaum,
	'Brice Goglin'

On Thu, Jun 11, 2009 at 08:45:29PM +0200, Stefan Lankes wrote:
> > Your patches seem to have a lot of overlap with
> > Lee Schermerhorn's old migrate memory on cpu migration patches.
> > I don't know the status of those.
> 
> I analyze Lee Schermerhorn's migrate memory on cpu migration patches
> (http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that Lee
> Schermerhorn add similar functionalities to the kernel. He called the
> "affinity-on-next-touch" functionality "migrate_on_fault" and uses in his
> patches the normal NUMA memory policies. Therefore, his solution fits better
> to the Linux kernel. I tested his patches with our test applications and got
> nearly the same performance results. 

That's great to know.

I didn't think he had a per process setting though, did he?

> I found only patches for the kernel 2.6.25-rc2-mm1. Does someone develop
> these patches further?

Not to much knowledge. Maybe Lee will pick them up again now that there
are more use cases.

If he doesn't have time maybe you could update them?

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-12 10:32     ` Andi Kleen
@ 2009-06-12 11:46       ` Stefan Lankes
  2009-06-12 12:30         ` Brice Goglin
  2009-06-16  2:39         ` Lee Schermerhorn
  2009-06-16  2:25       ` Lee Schermerhorn
  1 sibling, 2 replies; 44+ messages in thread
From: Stefan Lankes @ 2009-06-12 11:46 UTC (permalink / raw)
  To: 'Andi Kleen'
  Cc: linux-kernel, Lee.Schermerhorn, linux-numa, Boris Bierbaum,
	'Brice Goglin'


> > I analyze Lee Schermerhorn's migrate memory on cpu migration patches
> > (http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that
> Lee
> > Schermerhorn add similar functionalities to the kernel. He called the
> > "affinity-on-next-touch" functionality "migrate_on_fault" and uses in
> his
> > patches the normal NUMA memory policies. Therefore, his solution fits
> better
> > to the Linux kernel. I tested his patches with our test applications
> and got
> > nearly the same performance results.
> 
> That's great to know.
> 
> I didn't think he had a per process setting though, did he?

He enables the support of migration-on-fault via cpusets (echo 1 >
/dev/cpuset/migrate_on_fault).
Afterwards, every process could initiate migration-on-fault via mbind(...,
MPOL_MF_MOVE|MPOL_MF_LAZY).

> > I found only patches for the kernel 2.6.25-rc2-mm1. Does someone
> develop
> > these patches further?
> 
> Not to much knowledge. Maybe Lee will pick them up again now that there
> are more use cases.
> 
> If he doesn't have time maybe you could update them?
 
We are planning to work in this area. I think that I could update these
patches. 
At least, I am able to support Lee.

Stefan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-12 11:46       ` Stefan Lankes
@ 2009-06-12 12:30         ` Brice Goglin
  2009-06-12 13:21           ` Stefan Lankes
  2009-06-12 13:48           ` Stefan Lankes
  2009-06-16  2:39         ` Lee Schermerhorn
  1 sibling, 2 replies; 44+ messages in thread
From: Brice Goglin @ 2009-06-12 12:30 UTC (permalink / raw)
  To: Stefan Lankes
  Cc: 'Andi Kleen',
	linux-kernel, Lee.Schermerhorn, linux-numa, Boris Bierbaum

Stefan Lankes wrote:
> He enables the support of migration-on-fault via cpusets (echo 1 >
> /dev/cpuset/migrate_on_fault).
> Afterwards, every process could initiate migration-on-fault via mbind(...,
> MPOL_MF_MOVE|MPOL_MF_LAZY).

So mbind(MPOL_MF_LAZY) is taking care of changing page protection so as
to generate page-faults on next-touch? (instead of your madvise)
Is it migrating the whole memory area? Or only single pages?

Then, what's happening with MPOL_MF_LAZY in the kernel? Is it actually
stored in the mempolicy? If so, couldn't another fault later cause
another migration?
Or is MPOL_MF_LAZY filtered out of the policy once the protection of all
PTE has been changed?
I don't see why we need a new mempolicy here. If we are migrating single
pages, migrate-on-next-touch looks like a page-attribute to me. There
should be nothing to store in a mempolicy/VMA/whatever.

Brice


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-12 12:30         ` Brice Goglin
@ 2009-06-12 13:21           ` Stefan Lankes
  2009-06-12 13:48           ` Stefan Lankes
  1 sibling, 0 replies; 44+ messages in thread
From: Stefan Lankes @ 2009-06-12 13:21 UTC (permalink / raw)
  To: 'Brice Goglin'
  Cc: 'Andi Kleen',
	linux-kernel, Lee.Schermerhorn, linux-numa, Boris Bierbaum

> So mbind(MPOL_MF_LAZY) is taking care of changing page protection so as
> to generate page-faults on next-touch? (instead of your madvise)
> Is it migrating the whole memory area? Or only single pages?

mbind removes the pte references. Page migration will occur, when a task
access to one of these unmapped pages. Therefore, Lee's solution migrate one
single page and not the whole area.

You find further information at slides 19-23 of
http://mirror.linux.org.au/pub/linux.conf.au/2007/video/talks/197.pdf. 

> Then, what's happening with MPOL_MF_LAZY in the kernel? Is it actually
> stored in the mempolicy? If so, couldn't another fault later cause
> another migration?
> Or is MPOL_MF_LAZY filtered out of the policy once the protection of
> all
> PTE has been changed?
>
> I don't see why we need a new mempolicy here. If we are migrating
> single
> pages, migrate-on-next-touch looks like a page-attribute to me. There
> should be nothing to store in a mempolicy/VMA/whatever.
> 

MPOL_MF_LAZY is used as flag and does not specify a new policy. Therefore,
MPOL_MF_LAZY isn't stored in a VMA. The flag is only used to detect that the
system call mbind has to unmap these pages.

Stefan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-12 12:30         ` Brice Goglin
  2009-06-12 13:21           ` Stefan Lankes
@ 2009-06-12 13:48           ` Stefan Lankes
  1 sibling, 0 replies; 44+ messages in thread
From: Stefan Lankes @ 2009-06-12 13:48 UTC (permalink / raw)
  To: 'Brice Goglin'
  Cc: 'Andi Kleen',
	linux-kernel, Lee.Schermerhorn, linux-numa, Boris Bierbaum

> >
> 
> MPOL_MF_LAZY is used as flag and does not specify a new policy.
> Therefore, MPOL_MF_LAZY isn't stored in a VMA. The flag is only used to
> detect that the system call mbind has to unmap these pages.
> 

In one of my previous e-mails, I wrote that Lee use the normal NUMA memory
policies. This statement was unclear. Lee's patches define no new memory
policy. He adds only a new flag. The using of mbind looks for me smarter as
the using of madvise. 

Stefan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-11 18:45   ` Stefan Lankes
  2009-06-12 10:32     ` Andi Kleen
@ 2009-06-16  2:21     ` Lee Schermerhorn
  1 sibling, 0 replies; 44+ messages in thread
From: Lee Schermerhorn @ 2009-06-16  2:21 UTC (permalink / raw)
  To: Stefan Lankes
  Cc: 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin'

On Thu, 2009-06-11 at 20:45 +0200, Stefan Lankes wrote:
> > Your patches seem to have a lot of overlap with
> > Lee Schermerhorn's old migrate memory on cpu migration patches.
> > I don't know the status of those.
> 
> I analyze Lee Schermerhorn's migrate memory on cpu migration patches
> (http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that Lee
> Schermerhorn add similar functionalities to the kernel. He called the
> "affinity-on-next-touch" functionality "migrate_on_fault" and uses in his
> patches the normal NUMA memory policies. Therefore, his solution fits better
> to the Linux kernel. I tested his patches with our test applications and got
> nearly the same performance results. 
> 
> I found only patches for the kernel 2.6.25-rc2-mm1. Does someone develop
> these patches further?

Sorry for the delay.  I was offline for a long weekend. 

Regarding the patches:  I was rebasing them every few mmotm releases
until I ran into trouble with the memory controller handling of page
migration conflicting with migrating in the fault path and haven't had
time to investigate a solution.

Here's the problem I have:

when migrating a page with memory controller configured, the migration
code [mem_cgroup_prepare_migration()] tentatively charges the page
against the control group.  Then, when migration completes, it calls
mem_cgroup_end_migration() to commit [or cancel?] the charge.  Migration
on fault operates on an anon page in the page cache that has zero pte
references [page_mapcount(page) == 0] in do_swap_page().  do_swap_page()
does a mem_cgroup_try_charge_swapin() that also tentatively charges the
page.  I don't try to migrate the page unless this succeeds.  No sense
in doing all that work if the cgroup can't afford the page.

But, this ends up with nested "tentative charges" against the page when
I call down into the migration code via migrate_misplaced_page() and I
was having problems getting the ref counting correct.   It would bug out
under load.

What I want to do is see if the page migration code can "atomically
transfer" the page charge [including any tentative charge from
do_swap_page()] down in migrate_page_copy(), the way all other page
state is copied.  Haven't had time to see whether this is feasible.

Regards,
Lee
 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-12 10:32     ` Andi Kleen
  2009-06-12 11:46       ` Stefan Lankes
@ 2009-06-16  2:25       ` Lee Schermerhorn
  2009-06-20  7:24         ` Brice Goglin
  1 sibling, 1 reply; 44+ messages in thread
From: Lee Schermerhorn @ 2009-06-16  2:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Stefan Lankes, linux-kernel, linux-numa, Boris Bierbaum,
	'Brice Goglin'

On Fri, 2009-06-12 at 12:32 +0200, Andi Kleen wrote:
> On Thu, Jun 11, 2009 at 08:45:29PM +0200, Stefan Lankes wrote:
> > > Your patches seem to have a lot of overlap with
> > > Lee Schermerhorn's old migrate memory on cpu migration patches.
> > > I don't know the status of those.
> > 
> > I analyze Lee Schermerhorn's migrate memory on cpu migration patches
> > (http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that Lee
> > Schermerhorn add similar functionalities to the kernel. He called the
> > "affinity-on-next-touch" functionality "migrate_on_fault" and uses in his
> > patches the normal NUMA memory policies. Therefore, his solution fits better
> > to the Linux kernel. I tested his patches with our test applications and got
> > nearly the same performance results. 
> 
> That's great to know.
> 
> I didn't think he had a per process setting though, did he?

Hi, Andi.

My patches don't have per process enablement.  Rather, I chose to use
per cpuset enablement.  I view cpusets as sort of "numa control groups"
and thought this was an appropriate level at which to control this sort
of behavior--analogous to memory_spread_{page|slab}.  That probably
needs to be discussed more widely, tho'.

> 
> > I found only patches for the kernel 2.6.25-rc2-mm1. Does someone develop
> > these patches further?
> 
> Not to much knowledge. Maybe Lee will pick them up again now that there
> are more use cases.
> 
> If he doesn't have time maybe you could update them?

As I mentioned earlier, I need to sort out the interaction with the
memory controller.  It was changing too fast for me to keep up in the
time I could devote to it.

Lee


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-12 11:46       ` Stefan Lankes
  2009-06-12 12:30         ` Brice Goglin
@ 2009-06-16  2:39         ` Lee Schermerhorn
  2009-06-16 13:58           ` Stefan Lankes
  1 sibling, 1 reply; 44+ messages in thread
From: Lee Schermerhorn @ 2009-06-16  2:39 UTC (permalink / raw)
  To: Stefan Lankes
  Cc: 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin'

On Fri, 2009-06-12 at 13:46 +0200, Stefan Lankes wrote:
> > > I analyze Lee Schermerhorn's migrate memory on cpu migration patches
> > > (http://free.linux.hp.com/~lts/Patches/PageMigration/). I think that
> > Lee
> > > Schermerhorn add similar functionalities to the kernel. He called the
> > > "affinity-on-next-touch" functionality "migrate_on_fault" and uses in
> > his
> > > patches the normal NUMA memory policies. Therefore, his solution fits
> > better
> > > to the Linux kernel. I tested his patches with our test applications
> > and got
> > > nearly the same performance results.
> > 
> > That's great to know.
> > 
> > I didn't think he had a per process setting though, did he?
> 
> He enables the support of migration-on-fault via cpusets (echo 1 >
> /dev/cpuset/migrate_on_fault).
> Afterwards, every process could initiate migration-on-fault via mbind(...,
> MPOL_MF_MOVE|MPOL_MF_LAZY).

I should have read through the entire thread before responding Andi's
mail.

> 
> > > I found only patches for the kernel 2.6.25-rc2-mm1. Does someone
> > develop
> > > these patches further?
> > 
> > Not to much knowledge. Maybe Lee will pick them up again now that there
> > are more use cases.
> > 
> > If he doesn't have time maybe you could update them?
>  
> We are planning to work in this area. I think that I could update these
> patches. 
> At least, I am able to support Lee.

I would like to get these patches working with latest mmotm to test on
some newer hardware where I think they will help more.  And I would
welcome your support.  However, I think we'll need to get Balbir and
Kamezawa-san involved to sort out the interaction with memory control
group.

I can send you the more recent rebase that I've done.  This is getting
pretty old now:  2.6.28-rc4-mmotm-081110-081117.  I'll try to rebase to
the most recent mmotm [that boots on my platforms], at least so that we
can build and boot with migrate-on-fault disabled, within the next
couple of weeks.  

Regards,
Lee


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-16  2:39         ` Lee Schermerhorn
@ 2009-06-16 13:58           ` Stefan Lankes
  2009-06-16 14:59             ` Lee Schermerhorn
  0 siblings, 1 reply; 44+ messages in thread
From: Stefan Lankes @ 2009-06-16 13:58 UTC (permalink / raw)
  To: 'Lee Schermerhorn'
  Cc: 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin'

> 
> I would like to get these patches working with latest mmotm to test on
> some newer hardware where I think they will help more.  And I would
> welcome your support.  However, I think we'll need to get Balbir and
> Kamezawa-san involved to sort out the interaction with memory control
> group.
> 
> I can send you the more recent rebase that I've done.  This is getting
> pretty old now:  2.6.28-rc4-mmotm-081110-081117.  I'll try to rebase to
> the most recent mmotm [that boots on my platforms], at least so that we
> can build and boot with migrate-on-fault disabled, within the next
> couple of weeks.
> 

Sounds good! Send me your last version and I will try to reconstruct your
problems. Afterwards, we could try to solve these problems.

Regards,

Stefan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-16 13:58           ` Stefan Lankes
@ 2009-06-16 14:59             ` Lee Schermerhorn
  2009-06-17  1:22               ` KAMEZAWA Hiroyuki
  2009-06-17  7:45               ` Stefan Lankes
  0 siblings, 2 replies; 44+ messages in thread
From: Lee Schermerhorn @ 2009-06-16 14:59 UTC (permalink / raw)
  To: Stefan Lankes
  Cc: 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin'

On Tue, 2009-06-16 at 15:58 +0200, Stefan Lankes wrote:
> > 
> > I would like to get these patches working with latest mmotm to test on
> > some newer hardware where I think they will help more.  And I would
> > welcome your support.  However, I think we'll need to get Balbir and
> > Kamezawa-san involved to sort out the interaction with memory control
> > group.
> > 
> > I can send you the more recent rebase that I've done.  This is getting
> > pretty old now:  2.6.28-rc4-mmotm-081110-081117.  I'll try to rebase to
> > the most recent mmotm [that boots on my platforms], at least so that we
> > can build and boot with migrate-on-fault disabled, within the next
> > couple of weeks.
> > 
> 
> Sounds good! Send me your last version and I will try to reconstruct your
> problems. Afterwards, we could try to solve these problems.
> 

Stefan:

I've placed the last rebased version in :

http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-081110/

As I recall, this version DOES bug out because of reference count
problems due to interaction with the memory controller.

As I said, I'll try to rebase to latest mmotm real soon.  I need to test
whether it will boot on my test systems first.

Lee




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-16 14:59             ` Lee Schermerhorn
@ 2009-06-17  1:22               ` KAMEZAWA Hiroyuki
  2009-06-17 12:02                 ` Lee Schermerhorn
  2009-06-17  7:45               ` Stefan Lankes
  1 sibling, 1 reply; 44+ messages in thread
From: KAMEZAWA Hiroyuki @ 2009-06-17  1:22 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Stefan Lankes, 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin'

On Tue, 16 Jun 2009 10:59:55 -0400
Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:

> On Tue, 2009-06-16 at 15:58 +0200, Stefan Lankes wrote:
> > > 
> > > I would like to get these patches working with latest mmotm to test on
> > > some newer hardware where I think they will help more.  And I would
> > > welcome your support.  However, I think we'll need to get Balbir and
> > > Kamezawa-san involved to sort out the interaction with memory control
> > > group.
> > > 
> > > I can send you the more recent rebase that I've done.  This is getting
> > > pretty old now:  2.6.28-rc4-mmotm-081110-081117.  I'll try to rebase to
> > > the most recent mmotm [that boots on my platforms], at least so that we
> > > can build and boot with migrate-on-fault disabled, within the next
> > > couple of weeks.
> > > 
> > 
> > Sounds good! Send me your last version and I will try to reconstruct your
> > problems. Afterwards, we could try to solve these problems.
> > 
> 
> Stefan:
> 
> I've placed the last rebased version in :
> 
> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-081110/
> 
> As I recall, this version DOES bug out because of reference count
> problems due to interaction with the memory controller.
> 
please report in precise if memcg has bug.
An example of test is welcome.

Thanks,
-Kmae


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-16 14:59             ` Lee Schermerhorn
  2009-06-17  1:22               ` KAMEZAWA Hiroyuki
@ 2009-06-17  7:45               ` Stefan Lankes
  2009-06-18  4:37                 ` Lee Schermerhorn
  1 sibling, 1 reply; 44+ messages in thread
From: Stefan Lankes @ 2009-06-17  7:45 UTC (permalink / raw)
  To: 'Lee Schermerhorn'
  Cc: 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin'

 
> I've placed the last rebased version in :
> 
> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-
> 081110/
> 

OK! I will try to reconstruct the problem.

Stefan 


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-17  1:22               ` KAMEZAWA Hiroyuki
@ 2009-06-17 12:02                 ` Lee Schermerhorn
  0 siblings, 0 replies; 44+ messages in thread
From: Lee Schermerhorn @ 2009-06-17 12:02 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Stefan Lankes, 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin'

On Wed, 2009-06-17 at 10:22 +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 16 Jun 2009 10:59:55 -0400
> Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote:
> 
> > On Tue, 2009-06-16 at 15:58 +0200, Stefan Lankes wrote:
> > > > 
> > > > I would like to get these patches working with latest mmotm to test on
> > > > some newer hardware where I think they will help more.  And I would
> > > > welcome your support.  However, I think we'll need to get Balbir and
> > > > Kamezawa-san involved to sort out the interaction with memory control
> > > > group.
> > > > 
> > > > I can send you the more recent rebase that I've done.  This is getting
> > > > pretty old now:  2.6.28-rc4-mmotm-081110-081117.  I'll try to rebase to
> > > > the most recent mmotm [that boots on my platforms], at least so that we
> > > > can build and boot with migrate-on-fault disabled, within the next
> > > > couple of weeks.
> > > > 
> > > 
> > > Sounds good! Send me your last version and I will try to reconstruct your
> > > problems. Afterwards, we could try to solve these problems.
> > > 
> > 
> > Stefan:
> > 
> > I've placed the last rebased version in :
> > 
> > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-081110/
> > 
> > As I recall, this version DOES bug out because of reference count
> > problems due to interaction with the memory controller.
> > 
> please report in precise if memcg has bug.
> An example of test is welcome.

Not an memcg bug.  Just an implementation choice [2 phase migration
handling:  start/end calls] that is problematic for "lazy" page
migration--i.e., "migration-on-touch" in the fault path.  I'd be
interested in your opinion on the feasibility of transferring the
"charge" against the page--including the "try charge" from
do_swap_page()--in migrate_page_copy() along with other page state.

I'll try to rebase my lazy migration series to recent mmotm [as time
permits] in the "near future" and gather more info on the problem I was
having.

Regards,
Lee


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-17  7:45               ` Stefan Lankes
@ 2009-06-18  4:37                 ` Lee Schermerhorn
  2009-06-18 19:04                   ` Lee Schermerhorn
  2009-06-22 12:34                   ` Brice Goglin
  0 siblings, 2 replies; 44+ messages in thread
From: Lee Schermerhorn @ 2009-06-18  4:37 UTC (permalink / raw)
  To: Stefan Lankes
  Cc: 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin',
	KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro

On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
> > I've placed the last rebased version in :
> > 
> > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-
> > 081110/
> > 
> 
> OK! I will try to reconstruct the problem.

Stefan:

Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
[along with my shared policy series atop which they sit in my tree].
Patches reside in:

http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/


I did a quick test.  I'm afraid the patches have suffered some "bit rot"
vis a vis mainline/mmotm over the past several months.  Two possibly
related issues:

1) lazy migration doesn't seem to work. Looks like
mbind(<some-policy>+MPOL_MF_MOVE+MPOL_MF_LAZY) is not unmapping the
pages so, of course, migrate on fault won't work.  I suspect the
reference count handling has changed since I last tried this.  [Note one
of the patch conflicts was in the MPOL_MF_LAZY addition to the mbind
flag definitions in mempolicy.h and I may have botched the resolution
thereof.]

2) When the pages get freed on exit/unmap, they are still PageLocked()
and free_pages_check()/bad_page() bugs out with bad page state.

Note:  This is independent of memcg--i.e., happens whether or not memcg
configured.

To test this, I created a test cpuset with all nodes/mems/cpus and
enabled migrate_on_fault therein.  I then ran an interactive "memtoy"
session there [shown below].  Memtoy is a program I use for ad hoc
testing of various mm features.  You can find the latest version [almost
always] at:

http://free.linux.hp.com/~lts/Tools/memtoy-latest.tar.gz

You'll need the numactl-devel package to build--an older one with the V1
api, I think.  I need to upgrade it to latest libnuma.  

The same directory [Tools] contains a tarball of simple cpuset scripts
to make, query, modify, "enter" and run commands in cpusets.  There may
be other versions of such scripts around.  If you don't already have
any, feel free to grab them.

Since you've expressed interest in this [as has Kamezawa-san], I'll try
to pay some attention to debugging the patches in my copious spare time.
And, I'd be very interested in anything you discover in your
investigations.

Regards,
Lee

Memtoy-0.19c [for latest MPOL_MF flags defs]:

!!! lines are my annotations:

memtoy pid:  4222
memtoy>mems
mems allowed = 0-3
mems policy = 0-3
memtoy>cpus
cpu affinity mask/ids:  0-7
memtoy>anon a1 8p
memtoy>map a1
memtoy>mbind a1 pref 1
memtoy>touch a1 w
memtoy:  touched 8 pages in  0.000 secs
memtoy>where a1
a 0x00007f51ae757000 0x000000008000 0x000000000000  rw- default a1
page offset    +00 +01 +02 +03 +04 +05 +06 +07
           0:    1   1   1   1   1   1   1   1
memtoy>mbind a1 pref+move 2
memtoy:  migration of a1 [8 pages] took  0.000secs.

memtoy>where a1
a 0x00007f51ae757000 0x000000008000 0x000000000000  rw- default a1
page offset    +00 +01 +02 +03 +04 +05 +06 +07
           0:    2   2   2   2   2   2   2   2

!!! direct migration [still] works!  Try lazy:

memtoy>mbind a1 pref+move+lazy 3
memtoy:  unmap of a1 [8 pages] took  0.000secs.
memtoy>where a1

!!! "where" command uses get_mempolicy() w/ MPOL_ADDR|MPOL_NODE flags to
fetch page location.  Will call get_user_pages() and refault pages.
Should migrate to node 3, but:

a 0x00007f51ae757000 0x000000008000 0x000000000000  rw- default a1
page offset    +00 +01 +02 +03 +04 +05 +06 +07
           0:    2   2   2   2   2   2   2   2
!!! didn't move 
memtoy>exit


On console I see, for each of 8 pages of segment a1:

BUG: Bad page state in process memtoy  pfn:67515f
page:ffffea001699ccc8 flags:0a0000000010001d count:0 mapcount:0
mapping:(null) index:7f51ae75e
Pid: 4222, comm: memtoy Not tainted 2.6.30-mmotm-090612-1220+spol+lpm #6
Call Trace:
 [<ffffffff810a787a>] bad_page+0xaa/0x130
 [<ffffffff810a8719>] free_hot_cold_page+0x199/0x1d0
 [<ffffffff810a8774>] __pagevec_free+0x24/0x30
 [<ffffffff810ac96a>] release_pages+0x1ca/0x210
 [<ffffffff810c8b7d>] free_pages_and_swap_cache+0x8d/0xb0
 [<ffffffff810c0505>] exit_mmap+0x145/0x160
 [<ffffffff81044177>] mmput+0x47/0xa0
 [<ffffffff81048854>] exit_mm+0xf4/0x130
 [<ffffffff81049c58>] do_exit+0x188/0x810
 [<ffffffff81337194>] ? do_page_fault+0x184/0x310
 [<ffffffff8104a31e>] do_group_exit+0x3e/0xa0
 [<ffffffff8104a392>] sys_exit_group+0x12/0x20
 [<ffffffff8100bd2b>] system_call_fastpath+0x16/0x1b


Page flags 0x10001d:  locked, referenced, uptodate, dirty, swapbacked.
'locked' is bad state.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-18  4:37                 ` Lee Schermerhorn
@ 2009-06-18 19:04                   ` Lee Schermerhorn
  2009-06-19 15:26                     ` Lee Schermerhorn
  2009-06-22 12:34                   ` Brice Goglin
  1 sibling, 1 reply; 44+ messages in thread
From: Lee Schermerhorn @ 2009-06-18 19:04 UTC (permalink / raw)
  To: Stefan Lankes
  Cc: 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin',
	KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro

On Thu, 2009-06-18 at 00:37 -0400, Lee Schermerhorn wrote:
> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
> > > I've placed the last rebased version in :
> > > 
> > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-
> > > 081110/
> > > 
> > 
> > OK! I will try to reconstruct the problem.
> 
> Stefan:
> 
> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
> [along with my shared policy series atop which they sit in my tree].
> Patches reside in:
> 
> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
> 

I have updated the migrate-on-fault tarball in the above location to fix
part of the problems I was seeing.  See below.

> 
> I did a quick test.  I'm afraid the patches have suffered some "bit rot"
> vis a vis mainline/mmotm over the past several months.  Two possibly
> related issues:
> 
> 1) lazy migration doesn't seem to work. Looks like
> mbind(<some-policy>+MPOL_MF_MOVE+MPOL_MF_LAZY) is not unmapping the
> pages so, of course, migrate on fault won't work.  I suspect the
> reference count handling has changed since I last tried this.  [Note one
> of the patch conflicts was in the MPOL_MF_LAZY addition to the mbind
> flag definitions in mempolicy.h and I may have botched the resolution
> thereof.]
> 
> 2) When the pages get freed on exit/unmap, they are still PageLocked()
> and free_pages_check()/bad_page() bugs out with bad page state.
> 
> Note:  This is independent of memcg--i.e., happens whether or not memcg
> configured.
> 
<snip>

OK.  Found time to look at this.  Turns out I hadn't tested since
trylock_page() was introduced.  I did a one-for-one replacement of the
old API [TestSetPageLocked()], not noticing that the sense of the return
was inverted.  Thus, I was bailing out of the migrate_pages_unmap_only()
loop with the page locked, thinking someone else had locked it and would
take care of it.  Since the page wasn't unmapped from the page table[s],
of course it wouldn't migrate on fault--wouldn't even fault!

Fixed this.

Now:  lazy migration works w/ or w/o memcg configured, but NOT with the
swap resource controller configured.  I'll look at that as time permits.

Lee


^ permalink raw reply	[flat|nested] 44+ messages in thread

* RE: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-18 19:04                   ` Lee Schermerhorn
@ 2009-06-19 15:26                     ` Lee Schermerhorn
  2009-06-19 15:41                       ` Balbir Singh
  2009-06-19 21:19                       ` Stefan Lankes
  0 siblings, 2 replies; 44+ messages in thread
From: Lee Schermerhorn @ 2009-06-19 15:26 UTC (permalink / raw)
  To: Stefan Lankes
  Cc: 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin',
	KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro

On Thu, 2009-06-18 at 15:04 -0400, Lee Schermerhorn wrote:
> On Thu, 2009-06-18 at 00:37 -0400, Lee Schermerhorn wrote:
> > On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
> > > > I've placed the last rebased version in :
> > > > 
> > > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-
> > > > 081110/
> > > > 
> > > 
> > > OK! I will try to reconstruct the problem.
> > 
> > Stefan:
> > 
> > Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
> > [along with my shared policy series atop which they sit in my tree].
> > Patches reside in:
> > 
> > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
> > 
> 
> I have updated the migrate-on-fault tarball in the above location to fix
> part of the problems I was seeing.  See below.
> 
> > 
> > I did a quick test.  I'm afraid the patches have suffered some "bit rot"
> > vis a vis mainline/mmotm over the past several months.  Two possibly
> > related issues:
> > 
> > 1) lazy migration doesn't seem to work. Looks like
> > mbind(<some-policy>+MPOL_MF_MOVE+MPOL_MF_LAZY) is not unmapping the
> > pages so, of course, migrate on fault won't work.  I suspect the
> > reference count handling has changed since I last tried this.  [Note one
> > of the patch conflicts was in the MPOL_MF_LAZY addition to the mbind
> > flag definitions in mempolicy.h and I may have botched the resolution
> > thereof.]
> > 
> > 2) When the pages get freed on exit/unmap, they are still PageLocked()
> > and free_pages_check()/bad_page() bugs out with bad page state.
> > 
> > Note:  This is independent of memcg--i.e., happens whether or not memcg
> > configured.
> > 
> <snip>
> 
> OK.  Found time to look at this.  Turns out I hadn't tested since
> trylock_page() was introduced.  I did a one-for-one replacement of the
> old API [TestSetPageLocked()], not noticing that the sense of the return
> was inverted.  Thus, I was bailing out of the migrate_pages_unmap_only()
> loop with the page locked, thinking someone else had locked it and would
> take care of it.  Since the page wasn't unmapped from the page table[s],
> of course it wouldn't migrate on fault--wouldn't even fault!
> 
> Fixed this.
> 
> Now:  lazy migration works w/ or w/o memcg configured, but NOT with the
> swap resource controller configured.  I'll look at that as time permits.

Update:  I now can't reproduce the lazy migration failure with the swap
resource controller configured.  Perhaps I had booted the wrong kernel
for the test reported above.  Now the updated patch series mentioned
above seems to be working with both memory and swap resource controllers
configured for simple memtoy driven lazy migration.

Lee  


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-19 15:26                     ` Lee Schermerhorn
@ 2009-06-19 15:41                       ` Balbir Singh
  2009-06-19 15:59                         ` Lee Schermerhorn
  2009-06-19 21:19                       ` Stefan Lankes
  1 sibling, 1 reply; 44+ messages in thread
From: Balbir Singh @ 2009-06-19 15:41 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Stefan Lankes, 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin',
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

* Lee Schermerhorn <Lee.Schermerhorn@hp.com> [2009-06-19 11:26:53]:

> On Thu, 2009-06-18 at 15:04 -0400, Lee Schermerhorn wrote:
> > On Thu, 2009-06-18 at 00:37 -0400, Lee Schermerhorn wrote:
> > > On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
> > > > > I've placed the last rebased version in :
> > > > > 
> > > > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-
> > > > > 081110/
> > > > > 
> > > > 
> > > > OK! I will try to reconstruct the problem.
> > > 
> > > Stefan:
> > > 
> > > Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
> > > [along with my shared policy series atop which they sit in my tree].
> > > Patches reside in:
> > > 
> > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
> > > 
> > 
> > I have updated the migrate-on-fault tarball in the above location to fix
> > part of the problems I was seeing.  See below.
> > 
> > > 
> > > I did a quick test.  I'm afraid the patches have suffered some "bit rot"
> > > vis a vis mainline/mmotm over the past several months.  Two possibly
> > > related issues:
> > > 
> > > 1) lazy migration doesn't seem to work. Looks like
> > > mbind(<some-policy>+MPOL_MF_MOVE+MPOL_MF_LAZY) is not unmapping the
> > > pages so, of course, migrate on fault won't work.  I suspect the
> > > reference count handling has changed since I last tried this.  [Note one
> > > of the patch conflicts was in the MPOL_MF_LAZY addition to the mbind
> > > flag definitions in mempolicy.h and I may have botched the resolution
> > > thereof.]
> > > 
> > > 2) When the pages get freed on exit/unmap, they are still PageLocked()
> > > and free_pages_check()/bad_page() bugs out with bad page state.
> > > 
> > > Note:  This is independent of memcg--i.e., happens whether or not memcg
> > > configured.
> > > 
> > <snip>
> > 
> > OK.  Found time to look at this.  Turns out I hadn't tested since
> > trylock_page() was introduced.  I did a one-for-one replacement of the
> > old API [TestSetPageLocked()], not noticing that the sense of the return
> > was inverted.  Thus, I was bailing out of the migrate_pages_unmap_only()
> > loop with the page locked, thinking someone else had locked it and would
> > take care of it.  Since the page wasn't unmapped from the page table[s],
> > of course it wouldn't migrate on fault--wouldn't even fault!
> > 
> > Fixed this.
> > 
> > Now:  lazy migration works w/ or w/o memcg configured, but NOT with the
> > swap resource controller configured.  I'll look at that as time permits.
> 
> Update:  I now can't reproduce the lazy migration failure with the swap
> resource controller configured.  Perhaps I had booted the wrong kernel
> for the test reported above.  Now the updated patch series mentioned
> above seems to be working with both memory and swap resource controllers
> configured for simple memtoy driven lazy migration.

Excellent, I presume that you are using the latest mmotm or mainline.
We've had some swap cache leakage fix go in, but those are not as
serious (they can potentially cause OOM in a cgroup when the leak
occurs).


-- 
	Balbir

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-19 15:41                       ` Balbir Singh
@ 2009-06-19 15:59                         ` Lee Schermerhorn
  0 siblings, 0 replies; 44+ messages in thread
From: Lee Schermerhorn @ 2009-06-19 15:59 UTC (permalink / raw)
  To: balbir
  Cc: Stefan Lankes, 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin',
	KAMEZAWA Hiroyuki, KOSAKI Motohiro

On Fri, 2009-06-19 at 21:11 +0530, Balbir Singh wrote:
> * Lee Schermerhorn <Lee.Schermerhorn@hp.com> [2009-06-19 11:26:53]:
> 
> > On Thu, 2009-06-18 at 15:04 -0400, Lee Schermerhorn wrote:
> > > On Thu, 2009-06-18 at 00:37 -0400, Lee Schermerhorn wrote:
> > > > On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
> > > > > > I've placed the last rebased version in :
> > > > > > 
> > > > > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-
> > > > > > 081110/
> > > > > > 
> > > > > 
> > > > > OK! I will try to reconstruct the problem.
> > > > 
> > > > Stefan:
> > > > 
> > > > Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
> > > > [along with my shared policy series atop which they sit in my tree].
> > > > Patches reside in:
> > > > 
> > > > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
> > > > 
> > > 
> > > I have updated the migrate-on-fault tarball in the above location to fix
> > > part of the problems I was seeing.  See below.
> > > 
> > > > 
> > > > I did a quick test.  I'm afraid the patches have suffered some "bit rot"
> > > > vis a vis mainline/mmotm over the past several months.  Two possibly
> > > > related issues:
> > > > 
> > > > 1) lazy migration doesn't seem to work. Looks like
> > > > mbind(<some-policy>+MPOL_MF_MOVE+MPOL_MF_LAZY) is not unmapping the
> > > > pages so, of course, migrate on fault won't work.  I suspect the
> > > > reference count handling has changed since I last tried this.  [Note one
> > > > of the patch conflicts was in the MPOL_MF_LAZY addition to the mbind
> > > > flag definitions in mempolicy.h and I may have botched the resolution
> > > > thereof.]
> > > > 
> > > > 2) When the pages get freed on exit/unmap, they are still PageLocked()
> > > > and free_pages_check()/bad_page() bugs out with bad page state.
> > > > 
> > > > Note:  This is independent of memcg--i.e., happens whether or not memcg
> > > > configured.
> > > > 
> > > <snip>
> > > 
> > > OK.  Found time to look at this.  Turns out I hadn't tested since
> > > trylock_page() was introduced.  I did a one-for-one replacement of the
> > > old API [TestSetPageLocked()], not noticing that the sense of the return
> > > was inverted.  Thus, I was bailing out of the migrate_pages_unmap_only()
> > > loop with the page locked, thinking someone else had locked it and would
> > > take care of it.  Since the page wasn't unmapped from the page table[s],
> > > of course it wouldn't migrate on fault--wouldn't even fault!
> > > 
> > > Fixed this.
> > > 
> > > Now:  lazy migration works w/ or w/o memcg configured, but NOT with the
> > > swap resource controller configured.  I'll look at that as time permits.
> > 
> > Update:  I now can't reproduce the lazy migration failure with the swap
> > resource controller configured.  Perhaps I had booted the wrong kernel
> > for the test reported above.  Now the updated patch series mentioned
> > above seems to be working with both memory and swap resource controllers
> > configured for simple memtoy driven lazy migration.
> 
> Excellent, I presume that you are using the latest mmotm or mainline.
> We've had some swap cache leakage fix go in, but those are not as
> serious (they can potentially cause OOM in a cgroup when the leak
> occurs).

Yes, I'm using the 12jun mmotm atop 2.6.30.   I use the mmotm timestamp
in my kernel versions to show the base I using.  E.g., see the url
above.

Lee


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-19 15:26                     ` Lee Schermerhorn
  2009-06-19 15:41                       ` Balbir Singh
@ 2009-06-19 21:19                       ` Stefan Lankes
  1 sibling, 0 replies; 44+ messages in thread
From: Stefan Lankes @ 2009-06-19 21:19 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, 'Brice Goglin',
	KAMEZAWA Hiroyuki, Balbir Singh, KOSAKI Motohiro



Lee Schermerhorn wrote:
> On Thu, 2009-06-18 at 15:04 -0400, Lee Schermerhorn wrote:
>> On Thu, 2009-06-18 at 00:37 -0400, Lee Schermerhorn wrote:
>>> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
>>>>> I've placed the last rebased version in :
>>>>>
>>>>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-
>>>>> 081110/
>>>>>
>>>> OK! I will try to reconstruct the problem.
>>> Stefan:
>>>
>>> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
>>> [along with my shared policy series atop which they sit in my tree].
>>> Patches reside in:
>>>
>>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
>>>
>> I have updated the migrate-on-fault tarball in the above location to fix
>> part of the problems I was seeing.  See below.
>>
>>> I did a quick test.  I'm afraid the patches have suffered some "bit rot"
>>> vis a vis mainline/mmotm over the past several months.  Two possibly
>>> related issues:
>>>
>>> 1) lazy migration doesn't seem to work. Looks like
>>> mbind(<some-policy>+MPOL_MF_MOVE+MPOL_MF_LAZY) is not unmapping the
>>> pages so, of course, migrate on fault won't work.  I suspect the
>>> reference count handling has changed since I last tried this.  [Note one
>>> of the patch conflicts was in the MPOL_MF_LAZY addition to the mbind
>>> flag definitions in mempolicy.h and I may have botched the resolution
>>> thereof.]
>>>
>>> 2) When the pages get freed on exit/unmap, they are still PageLocked()
>>> and free_pages_check()/bad_page() bugs out with bad page state.
>>>
>>> Note:  This is independent of memcg--i.e., happens whether or not memcg
>>> configured.
>>>
>> <snip>
>>
>> OK.  Found time to look at this.  Turns out I hadn't tested since
>> trylock_page() was introduced.  I did a one-for-one replacement of the
>> old API [TestSetPageLocked()], not noticing that the sense of the return
>> was inverted.  Thus, I was bailing out of the migrate_pages_unmap_only()
>> loop with the page locked, thinking someone else had locked it and would
>> take care of it.  Since the page wasn't unmapped from the page table[s],
>> of course it wouldn't migrate on fault--wouldn't even fault!
>>
>> Fixed this.
>>
>> Now:  lazy migration works w/ or w/o memcg configured, but NOT with the
>> swap resource controller configured.  I'll look at that as time permits.
> 
> Update:  I now can't reproduce the lazy migration failure with the swap
> resource controller configured.  Perhaps I had booted the wrong kernel
> for the test reported above.  Now the updated patch series mentioned
> above seems to be working with both memory and swap resource controllers
> configured for simple memtoy driven lazy migration.
> 

The current version of your patch works fine on my system. I tested the 
patches with our test applications and got very good performance results!

Stefan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-16  2:25       ` Lee Schermerhorn
@ 2009-06-20  7:24         ` Brice Goglin
  2009-06-22 13:49           ` Lee Schermerhorn
  0 siblings, 1 reply; 44+ messages in thread
From: Brice Goglin @ 2009-06-20  7:24 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Andi Kleen, Stefan Lankes, linux-kernel, linux-numa,
	Boris Bierbaum, 'Brice Goglin'

Lee Schermerhorn wrote:
> My patches don't have per process enablement.  Rather, I chose to use
> per cpuset enablement.  I view cpusets as sort of "numa control groups"
> and thought this was an appropriate level at which to control this sort
> of behavior--analogous to memory_spread_{page|slab}.  That probably
> needs to be discussed more widely, tho'.
>   

Could you explain why you actually want to enable/disable
migrate-on-fault on a cpuset (or process) basis? Why would an
administrator want to disable it? Aren't the existing cpuset memory
restriction abilities enough?

Brice


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-18  4:37                 ` Lee Schermerhorn
  2009-06-18 19:04                   ` Lee Schermerhorn
@ 2009-06-22 12:34                   ` Brice Goglin
  2009-06-22 14:24                     ` Lee Schermerhorn
  2009-06-22 14:32                     ` Stefan Lankes
  1 sibling, 2 replies; 44+ messages in thread
From: Brice Goglin @ 2009-06-22 12:34 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Stefan Lankes, 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki,
	Balbir Singh, KOSAKI Motohiro

Lee Schermerhorn wrote:
> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
>   
>>> I've placed the last rebased version in :
>>>
>>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-
>>> 081110/
>>>
>>>       
>> OK! I will try to reconstruct the problem.
>>     
>
> Stefan:
>
> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
> [along with my shared policy series atop which they sit in my tree].
> Patches reside in:
>
> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
>
>   

I gave this patchset a try and indeed it seems to work fine, thanks a
lot. But the migration performance isn't very good. I am seeing about
540MB/s when doing mbind+touch_all_pages on large buffers on a
quad-barcelona machines. move_pages gets 640MB/s there. And my own
next-touch implementation were near 800MB/s in the past.

I wonder if there is a more general migration performance degradation in
latest Linus git. move_pages performance was supposed to increase by 15%
(more than 700MB/s) thanks to commit dfa33d45 but I don't seem to see
the improvement with git or mmotm. Also migrate_pages seems to have
decreased but it might be older than 2.6.30. I need to find some time to
git bisect all this, otherwise it's hard to compare the performance of
your migrate-on-fault with other older implementations :)

When do you plan to actually submit all your patches for inclusion?

thanks,
Brice


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-20  7:24         ` Brice Goglin
@ 2009-06-22 13:49           ` Lee Schermerhorn
  0 siblings, 0 replies; 44+ messages in thread
From: Lee Schermerhorn @ 2009-06-22 13:49 UTC (permalink / raw)
  To: Brice Goglin
  Cc: Andi Kleen, Stefan Lankes, linux-kernel, linux-numa, Boris Bierbaum

On Sat, 2009-06-20 at 09:24 +0200, Brice Goglin wrote:
> Lee Schermerhorn wrote:
> > My patches don't have per process enablement.  Rather, I chose to use
> > per cpuset enablement.  I view cpusets as sort of "numa control groups"
> > and thought this was an appropriate level at which to control this sort
> > of behavior--analogous to memory_spread_{page|slab}.  That probably
> > needs to be discussed more widely, tho'.
> >   
> 
> Could you explain why you actually want to enable/disable
> migrate-on-fault on a cpuset (or process) basis? Why would an
> administrator want to disable it? Aren't the existing cpuset memory
> restriction abilities enough?
> 
> Brice
> 

Hello, Brice:

There are a couple of aspects to this question, I think?

1) why enable/disable at all?  why not always enabled?  

When I try out some new behavior such as migrate of fault, I start with
the assumption [right or wrong] that not all users will want this
behavior.  For migrate-on-fault, one probably won't run into it all that
often unless the MPOL_MF_LAZY flag is used to forcibly unmap regions.
However, with swap read-ahead, one could end up with anon pages in the
swap cache with no pte references, and could experience unexpected
migrations.  I've learned that some folks really don't like
surprises :).  Now, when you consider the "automigration" feature
["auto" here means "self" more than "automatic"], I think it's more
important to be able to enable/disable it.  I've not seen any
performance degradation when using it, but I feared that for some
workloads, thrashing could cause such degradation.  Page migration isn't
free.

Also, because Linux runs on such a wide range of platforms, I don't want
to burden smaller, embedded systems with the additional code, so I also
try to make the feature source configurable.  I know we worry about the
proliferation of config options, but it's easier to remove one after the
fact, I think, than to retrofit it.

2) Why a per cpuset control?

I consider cpusets to be "numa control groups".  They constrain
resources on a numa node [and related cpus] granularity, and control
numa related behavior, such as migration when changing cpusets,
spreading page cache and slab pages over nodes in the cpuset, ...  In
fact, I think it would have been appropriate to call the cpuset control
group the "numa control group" when cgroups were introduced, but it's
too late for that now.

Finally, and not a reason to include the controls in the mainline, it's
REALLY useful during development.  One can boot a test kernel, and only
enable the feature in a test cpuset, limiting the damage of, e.g., a
reference counting bug or such.  It's also useful for measuring the
overhead of the patches absent any actual page migrations.  However, if
this feature ever makes it to mainline, the community will have its say
on whether these controls should be included and how.

Hope this helps,
Lee




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-22 12:34                   ` Brice Goglin
@ 2009-06-22 14:24                     ` Lee Schermerhorn
  2009-06-22 15:28                       ` Brice Goglin
  2009-06-22 14:32                     ` Stefan Lankes
  1 sibling, 1 reply; 44+ messages in thread
From: Lee Schermerhorn @ 2009-06-22 14:24 UTC (permalink / raw)
  To: Brice Goglin
  Cc: Stefan Lankes, 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki,
	Balbir Singh, KOSAKI Motohiro

On Mon, 2009-06-22 at 14:34 +0200, Brice Goglin wrote:
> Lee Schermerhorn wrote:
> > On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
> >   
> >>> I've placed the last rebased version in :
> >>>
> >>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.28-rc4-mmotm-
> >>> 081110/
> >>>
> >>>       
> >> OK! I will try to reconstruct the problem.
> >>     
> >
> > Stefan:
> >
> > Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
> > [along with my shared policy series atop which they sit in my tree].
> > Patches reside in:
> >
> > http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
> >
> >   
> 
> I gave this patchset a try and indeed it seems to work fine, thanks a
> lot. But the migration performance isn't very good. I am seeing about
> 540MB/s when doing mbind+touch_all_pages on large buffers on a
> quad-barcelona machines. move_pages gets 640MB/s there. And my own
> next-touch implementation were near 800MB/s in the past.

Interesting.  Do you have any idea where the differences come from?  Are
you comparing them on the same kernel versions?  I don't know the
details of your implementation, but one possible area is the check for
"misplacement".  When migrate-on-fault is enabled, I check all pages
with page_mapcount() == 0 for misplacement in the [swap page] fault
path.  That, and other filtering to eliminate unnecessary migrations
could cause extra overhead.

Aside:  currently, my implementation could migrate a page, only to find
that it will be replaced by a new page due to copy-on-write.  I have on
my list to check write access and whether we can reuse the swap page and
avoid the migration if we're going to COW later anyway.  This could
improve performance for write accesses, if the snoop traffic doesn't
overshadow any such improvement.

> 
> I wonder if there is a more general migration performance degradation in
> latest Linus git. move_pages performance was supposed to increase by 15%
> (more than 700MB/s) thanks to commit dfa33d45 but I don't seem to see
> the improvement with git or mmotm. Also migrate_pages seems to have
> decreased but it might be older than 2.6.30. I need to find some time to
> git bisect all this, otherwise it's hard to compare the performance of
> your migrate-on-fault with other older implementations :)

Confession:  I've not measured migration performance directly.  Rather,
I've only observed how applications/benchmarks perform with
migrate-on-fault+automigration enabled.  On the platforms available to
me back when I was actively working on this, I did see improvements in
real and user time due to improved locality, especially under heavy load
when interconnect bandwidth is at a premium.  Of course, system time
increased because of the migration overheads.

> 
> When do you plan to actually submit all your patches for inclusion?

I had/have no immediate plans.  I held off on these series while other
mm features--reclaim scalability, memory control groups, ...--seemed
higher priority, and the churn in mm made it difficult to keep these
patches up to date.  Now that the patches seem to be working again, I
plan to test them on newer platforms with more "interesting" numa
topologies.  If they work well there, and with your interest and
cooperation, perhaps we can try again with some variant or combination
of our approaches.

Regards,
Lee


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-22 12:34                   ` Brice Goglin
  2009-06-22 14:24                     ` Lee Schermerhorn
@ 2009-06-22 14:32                     ` Stefan Lankes
  2009-06-22 14:56                       ` Lee Schermerhorn
  1 sibling, 1 reply; 44+ messages in thread
From: Stefan Lankes @ 2009-06-22 14:32 UTC (permalink / raw)
  To: Brice Goglin
  Cc: Lee Schermerhorn, 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki,
	Balbir Singh, KOSAKI Motohiro



Brice Goglin wrote:
> Lee Schermerhorn wrote:
>> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
>>   
>>
>> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
>> [along with my shared policy series atop which they sit in my tree].
>> Patches reside in:
>>
>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
>>
>>   
> 
> I gave this patchset a try and indeed it seems to work fine, thanks a
> lot. But the migration performance isn't very good. I am seeing about
> 540MB/s when doing mbind+touch_all_pages on large buffers on a
> quad-barcelona machines. move_pages gets 640MB/s there. And my own
> next-touch implementation were near 800MB/s in the past.

I used a modified stream benchmark to evaluate the performance of Lee's 
and my version of the next-touch implementation. In this low-level 
benchmark is Lee's patch better than my patch. I think that Brice and I 
use the same technique to realize affinity-on-next-touch. Do you use 
another kernel version to evaluate the performance?

Stefan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-22 14:32                     ` Stefan Lankes
@ 2009-06-22 14:56                       ` Lee Schermerhorn
  2009-06-22 15:42                         ` Stefan Lankes
  0 siblings, 1 reply; 44+ messages in thread
From: Lee Schermerhorn @ 2009-06-22 14:56 UTC (permalink / raw)
  To: Stefan Lankes
  Cc: Brice Goglin, 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki,
	Balbir Singh, KOSAKI Motohiro

On Mon, 2009-06-22 at 16:32 +0200, Stefan Lankes wrote:
> 
> Brice Goglin wrote:
> > Lee Schermerhorn wrote:
> >> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
> >>   
> >>
> >> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
> >> [along with my shared policy series atop which they sit in my tree].
> >> Patches reside in:
> >>
> >> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
> >>
> >>   
> > 
> > I gave this patchset a try and indeed it seems to work fine, thanks a
> > lot. But the migration performance isn't very good. I am seeing about
> > 540MB/s when doing mbind+touch_all_pages on large buffers on a
> > quad-barcelona machines. move_pages gets 640MB/s there. And my own
> > next-touch implementation were near 800MB/s in the past.
> 
> I used a modified stream benchmark to evaluate the performance of Lee's 
> and my version of the next-touch implementation. In this low-level 
> benchmark is Lee's patch better than my patch. I think that Brice and I 
> use the same technique to realize affinity-on-next-touch. Do you use 
> another kernel version to evaluate the performance?

Hi, Stefan:

I also used a [modified!] stream benchmark to test my patches.  One of
the modifications was to dump the time it takes for one pass over the
data arrays to a specific file description, if that file description was
open at start time--e.g., via something like "4>stream_times".  Then, I
increased the number of iterations to something large so that I could
run other tests during the stream run.  I plotted the "time per
iteration" vs iteration number and could see that after any transient
load, the stream benchmark returned to a good [not sure if maximal]
locality state.  The time per interation was comparable to hand
affinitized of the threads.  Without automigration and hand
affinitization, any transient load would scramble the location of the
threads relative to the data region they were operating on due to load
balancing.  The more nodes you have, the less likely you'll end up in a
good state.

I was using a parallel kernel make [-j <2*nr_cpus>] as the load.  In
addition to the stream returning to good locality, I noticed that the
kernel build completed much faster in the presence of the stream load
with automigration enabled.  I reported these results in a presentation
at LCA'07.  Slides and video [yuck! :)] are available on line at the
LCA'07 site.

Regards,
Lee


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-22 14:24                     ` Lee Schermerhorn
@ 2009-06-22 15:28                       ` Brice Goglin
  2009-06-22 16:55                         ` Lee Schermerhorn
  0 siblings, 1 reply; 44+ messages in thread
From: Brice Goglin @ 2009-06-22 15:28 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Stefan Lankes, 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki,
	Balbir Singh, KOSAKI Motohiro

Lee Schermerhorn wrote:
>> I gave this patchset a try and indeed it seems to work fine, thanks a
>> lot. But the migration performance isn't very good. I am seeing about
>> 540MB/s when doing mbind+touch_all_pages on large buffers on a
>> quad-barcelona machines. move_pages gets 640MB/s there. And my own
>> next-touch implementation were near 800MB/s in the past.
>>     
>
> Interesting.  Do you have any idea where the differences come from?  Are
> you comparing them on the same kernel versions?  I don't know the
> details of your implementation, but one possible area is the check for
> "misplacement".  When migrate-on-fault is enabled, I check all pages
> with page_mapcount() == 0 for misplacement in the [swap page] fault
> path.  That, and other filtering to eliminate unnecessary migrations
> could cause extra overhead.
>   

(I'll actually talk about this at the Linux Symposium) I used 2.6.27
initially, with some 2.6.29 patches to fix the throughput of move_pages
for large buffers. So move_pages was getting about 600MB/s there. Then
my own (hacky) next-touch implementation was getting about 800MB/s. The
main difference with your code is that mine only modifies the current
process PTE without touching the other processes if the page is shared.
So my code basically only supports private pages, it duplicates/migrates
them on next-touch. I thought it was faster than move_pages because I
didn't support shared-page migration. But, I found out later that
move_pages could be further improved up to about 750MB/s (it will be in
2.6.31).

So now, I'd expect both the next-touch migration and move_pages to have
similar migration throughput, about 750-800MB/s on my quad-barcelona
machine. Right now, I'm seeing less than that for both, so there might
be a problem deeper. Actually, looking at COW performance when the new
page is allocated on a remote numa node, I also see the throughput much
lower in 2.6.29+ (about 720MB/s) than in 2.6.27 (about 850MB/s). Maybe a
regression in the low-level page copy routine?

Brice


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-22 14:56                       ` Lee Schermerhorn
@ 2009-06-22 15:42                         ` Stefan Lankes
  2009-06-22 16:38                           ` Lee Schermerhorn
  0 siblings, 1 reply; 44+ messages in thread
From: Stefan Lankes @ 2009-06-22 15:42 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Brice Goglin, 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki,
	Balbir Singh, KOSAKI Motohiro



Lee Schermerhorn wrote:
> On Mon, 2009-06-22 at 16:32 +0200, Stefan Lankes wrote:
>> Brice Goglin wrote:
>>> Lee Schermerhorn wrote:
>>>> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
>>>>   
>>>>
>>>> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
>>>> [along with my shared policy series atop which they sit in my tree].
>>>> Patches reside in:
>>>>
>>>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
>>>>
>>>>   
>>> I gave this patchset a try and indeed it seems to work fine, thanks a
>>> lot. But the migration performance isn't very good. I am seeing about
>>> 540MB/s when doing mbind+touch_all_pages on large buffers on a
>>> quad-barcelona machines. move_pages gets 640MB/s there. And my own
>>> next-touch implementation were near 800MB/s in the past.
>> I used a modified stream benchmark to evaluate the performance of Lee's 
>> and my version of the next-touch implementation. In this low-level 
>> benchmark is Lee's patch better than my patch. I think that Brice and I 
>> use the same technique to realize affinity-on-next-touch. Do you use 
>> another kernel version to evaluate the performance?
> 
> Hi, Stefan:
> 
> I also used a [modified!] stream benchmark to test my patches.  One of
> the modifications was to dump the time it takes for one pass over the
> data arrays to a specific file description, if that file description was
> open at start time--e.g., via something like "4>stream_times".  Then, I
> increased the number of iterations to something large so that I could
> run other tests during the stream run.  I plotted the "time per
> iteration" vs iteration number and could see that after any transient
> load, the stream benchmark returned to a good [not sure if maximal]
> locality state.  The time per interation was comparable to hand
> affinitized of the threads.  Without automigration and hand
> affinitization, any transient load would scramble the location of the
> threads relative to the data region they were operating on due to load
> balancing.  The more nodes you have, the less likely you'll end up in a
> good state.
> 
> I was using a parallel kernel make [-j <2*nr_cpus>] as the load.  In
> addition to the stream returning to good locality, I noticed that the
> kernel build completed much faster in the presence of the stream load
> with automigration enabled.  I reported these results in a presentation
> at LCA'07.  Slides and video [yuck! :)] are available on line at the
> LCA'07 site.

I think that you use migration-on-fault in the context of automigration. 
Brice and I use affinity-on-next-touch/migration-on-fault in another 
context. If the access pattern of an application changed, we want to 
redistribute the pages in "nearly" ideal matter. Sometimes it is 
difficult to determine the ideal page distribution. In such cases, 
affinity-on-next-touch could be an attractive solution. In our test 
applications, we add at some certain points the system call to use 
affinity-on-next-touch and redistribute the pages. Assumed that the next 
thread use these pages very often, we improve the performance of our 
test applications.

Regards,

Stefan


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-22 15:42                         ` Stefan Lankes
@ 2009-06-22 16:38                           ` Lee Schermerhorn
  0 siblings, 0 replies; 44+ messages in thread
From: Lee Schermerhorn @ 2009-06-22 16:38 UTC (permalink / raw)
  To: Stefan Lankes
  Cc: Brice Goglin, 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki,
	Balbir Singh, KOSAKI Motohiro

On Mon, 2009-06-22 at 17:42 +0200, Stefan Lankes wrote:
> 
> Lee Schermerhorn wrote:
> > On Mon, 2009-06-22 at 16:32 +0200, Stefan Lankes wrote:
> >> Brice Goglin wrote:
> >>> Lee Schermerhorn wrote:
> >>>> On Wed, 2009-06-17 at 09:45 +0200, Stefan Lankes wrote:
> >>>>   
> >>>>
> >>>> Today I rebased the migrate on fault patches to 2.6.30-mmotm-090612...
> >>>> [along with my shared policy series atop which they sit in my tree].
> >>>> Patches reside in:
> >>>>
> >>>> http://free.linux.hp.com/~lts/Patches/PageMigration/2.6.30-mmotm-090612-1220/
> >>>>
> >>>>   
> >>> I gave this patchset a try and indeed it seems to work fine, thanks a
> >>> lot. But the migration performance isn't very good. I am seeing about
> >>> 540MB/s when doing mbind+touch_all_pages on large buffers on a
> >>> quad-barcelona machines. move_pages gets 640MB/s there. And my own
> >>> next-touch implementation were near 800MB/s in the past.
> >> I used a modified stream benchmark to evaluate the performance of Lee's 
> >> and my version of the next-touch implementation. In this low-level 
> >> benchmark is Lee's patch better than my patch. I think that Brice and I 
> >> use the same technique to realize affinity-on-next-touch. Do you use 
> >> another kernel version to evaluate the performance?
> > 
> > Hi, Stefan:
> > 
> > I also used a [modified!] stream benchmark to test my patches.  One of
> > the modifications was to dump the time it takes for one pass over the
> > data arrays to a specific file description, if that file description was
> > open at start time--e.g., via something like "4>stream_times".  Then, I
> > increased the number of iterations to something large so that I could
> > run other tests during the stream run.  I plotted the "time per
> > iteration" vs iteration number and could see that after any transient
> > load, the stream benchmark returned to a good [not sure if maximal]
> > locality state.  The time per interation was comparable to hand
> > affinitized of the threads.  Without automigration and hand
> > affinitization, any transient load would scramble the location of the
> > threads relative to the data region they were operating on due to load
> > balancing.  The more nodes you have, the less likely you'll end up in a
> > good state.
> > 
> > I was using a parallel kernel make [-j <2*nr_cpus>] as the load.  In
> > addition to the stream returning to good locality, I noticed that the
> > kernel build completed much faster in the presence of the stream load
> > with automigration enabled.  I reported these results in a presentation
> > at LCA'07.  Slides and video [yuck! :)] are available on line at the
> > LCA'07 site.
> 
> I think that you use migration-on-fault in the context of automigration. 
> Brice and I use affinity-on-next-touch/migration-on-fault in another 
> context. If the access pattern of an application changed, we want to 
> redistribute the pages in "nearly" ideal matter. Sometimes it is 
> difficult to determine the ideal page distribution. In such cases, 
> affinity-on-next-touch could be an attractive solution. In our test 
> applications, we add at some certain points the system call to use 
> affinity-on-next-touch and redistribute the pages. Assumed that the next 
> thread use these pages very often, we improve the performance of our 
> test applications.

I understand.  That's one of the motivations for MPOL_MF_LAZY and the
MPOL_MF_NOOP policy mode.  It simply unmaps [removes pte refs from] the
pages, priming them for migrate on next touch, if they are "misplaced"
relative to the task touching them.  It's useful for testing and that's
my personal primary use case, but I did envision it for use in
applications that know they're entering a new computation phase with
different access patterns.

Lee



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-22 15:28                       ` Brice Goglin
@ 2009-06-22 16:55                         ` Lee Schermerhorn
  2009-06-22 17:06                           ` Brice Goglin
  0 siblings, 1 reply; 44+ messages in thread
From: Lee Schermerhorn @ 2009-06-22 16:55 UTC (permalink / raw)
  To: Brice Goglin
  Cc: Stefan Lankes, 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki,
	Balbir Singh, KOSAKI Motohiro

On Mon, 2009-06-22 at 17:28 +0200, Brice Goglin wrote:
> Lee Schermerhorn wrote:
> >> I gave this patchset a try and indeed it seems to work fine, thanks a
> >> lot. But the migration performance isn't very good. I am seeing about
> >> 540MB/s when doing mbind+touch_all_pages on large buffers on a
> >> quad-barcelona machines. move_pages gets 640MB/s there. And my own
> >> next-touch implementation were near 800MB/s in the past.
> >>     
> >
> > Interesting.  Do you have any idea where the differences come from?  Are
> > you comparing them on the same kernel versions?  I don't know the
> > details of your implementation, but one possible area is the check for
> > "misplacement".  When migrate-on-fault is enabled, I check all pages
> > with page_mapcount() == 0 for misplacement in the [swap page] fault
> > path.  That, and other filtering to eliminate unnecessary migrations
> > could cause extra overhead.
> >   
> 
> (I'll actually talk about this at the Linux Symposium) I used 2.6.27
> initially, with some 2.6.29 patches to fix the throughput of move_pages
> for large buffers. So move_pages was getting about 600MB/s there. Then
> my own (hacky) next-touch implementation was getting about 800MB/s. The
> main difference with your code is that mine only modifies the current
> process PTE without touching the other processes if the page is shared.

The primary difference should be at unmap time, right?  In the fault
path, I only update the pte of the faulting task.  That's why I require
the [anon] pages to be in the swap cache [or something similar].  I
don't want to be fixing up other tasks' page tables in the context of
the faulting task's fault handler.  If, later, another task touches the
page, it will take a minor fault and find the [possibly migrated] page
in the cache.  Hmmm, I guess all tasks WILL incur the minor fault if
they touch the page after the unmap.  That could be part of the
difference if you compare on the same kernel version.

> So my code basically only supports private pages, it duplicates/migrates
> them on next-touch. I thought it was faster than move_pages because I
> didn't support shared-page migration. But, I found out later that
> move_pages could be further improved up to about 750MB/s (it will be in
> 2.6.31).
> 
> So now, I'd expect both the next-touch migration and move_pages to have
> similar migration throughput, about 750-800MB/s on my quad-barcelona
> machine. Right now, I'm seeing less than that for both, so there might
> be a problem deeper. 

Try booting with cgroup_disable=memory on the command line, if you have
the memory resource controller configured in.  See what that does to
your measurements.

> Actually, looking at COW performance when the new
> page is allocated on a remote numa node, I also see the throughput much
> lower in 2.6.29+ (about 720MB/s) than in 2.6.27 (about 850MB/s). Maybe a
> regression in the low-level page copy routine?

??? I would expect low level page copying to be highly optimized per
arch, and also fairly stable.  Based on recent experience, I'd more
likely suspect the mm housekeeping overheads--e.g., global and per memcg
lru management, ...  We seen a lot of new code in this area in the past
few releases.

Lee


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-22 16:55                         ` Lee Schermerhorn
@ 2009-06-22 17:06                           ` Brice Goglin
  2009-06-22 17:59                             ` Stefan Lankes
  0 siblings, 1 reply; 44+ messages in thread
From: Brice Goglin @ 2009-06-22 17:06 UTC (permalink / raw)
  To: Lee Schermerhorn
  Cc: Stefan Lankes, 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki,
	Balbir Singh, KOSAKI Motohiro

Lee Schermerhorn wrote:
> The primary difference should be at unmap time, right?  In the fault
> path, I only update the pte of the faulting task.  That's why I require
> the [anon] pages to be in the swap cache [or something similar].  I
> don't want to be fixing up other tasks' page tables in the context of
> the faulting task's fault handler.  If, later, another task touches the
> page, it will take a minor fault and find the [possibly migrated] page
> in the cache.  Hmmm, I guess all tasks WILL incur the minor fault if
> they touch the page after the unmap.  That could be part of the
> difference if you compare on the same kernel version.
>   

Agreed.

> Try booting with cgroup_disable=memory on the command line, if you have
> the memory resource controller configured in.  See what that does to
> your measurements.
>   

It doesn't seem to help. I'll try to bisect and find where the
performance dropped.

> ??? I would expect low level page copying to be highly optimized per
> arch, and also fairly stable.

I just did a quick copy_page benchmark and didn't see any performance
difference between 2.6.27 and mmotm.

thanks,
Brice


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-22 17:06                           ` Brice Goglin
@ 2009-06-22 17:59                             ` Stefan Lankes
  2009-06-22 19:10                               ` Brice Goglin
  0 siblings, 1 reply; 44+ messages in thread
From: Stefan Lankes @ 2009-06-22 17:59 UTC (permalink / raw)
  To: Brice Goglin
  Cc: Lee Schermerhorn, 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki,
	Balbir Singh, KOSAKI Motohiro



Brice Goglin wrote:
> Lee Schermerhorn wrote:
>> The primary difference should be at unmap time, right?  In the fault
>> path, I only update the pte of the faulting task.  That's why I require
>> the [anon] pages to be in the swap cache [or something similar].  I
>> don't want to be fixing up other tasks' page tables in the context of
>> the faulting task's fault handler.  If, later, another task touches the
>> page, it will take a minor fault and find the [possibly migrated] page
>> in the cache.  Hmmm, I guess all tasks WILL incur the minor fault if
>> they touch the page after the unmap.  That could be part of the
>> difference if you compare on the same kernel version.
>>   
> 
> Agreed.
> 
>> Try booting with cgroup_disable=memory on the command line, if you have
>> the memory resource controller configured in.  See what that does to
>> your measurements.
>>   
> 
> It doesn't seem to help. I'll try to bisect and find where the
> performance dropped.
> 

I am not able to reconstruct any performance drawbacks on my system. 
Could you send me your low-level benchmark?

Stefan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-22 17:59                             ` Stefan Lankes
@ 2009-06-22 19:10                               ` Brice Goglin
  2009-06-22 20:16                                 ` Stefan Lankes
  0 siblings, 1 reply; 44+ messages in thread
From: Brice Goglin @ 2009-06-22 19:10 UTC (permalink / raw)
  To: Stefan Lankes
  Cc: Lee Schermerhorn, 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki,
	Balbir Singh, KOSAKI Motohiro

[-- Attachment #1: Type: text/plain, Size: 495 bytes --]

Stefan Lankes wrote:
> I am not able to reconstruct any performance drawbacks on my system.
> Could you send me your low-level benchmark?

It's attached. As you may see, it's fairly trivial. It just does several
iterations of mbind+touch_all_pages for different power-of-two buffer
sizes. Just replace mbind with madvise in the inner loop if you want to
try with your affinit-on-next-touch.

Which kernels are you using when comparing your next-touch
implementation with Lee's patchset?

Brice


[-- Attachment #2: next-touch-mof-cost.c --]
[-- Type: text/x-csrc, Size: 2044 bytes --]

#define _GNU_SOURCE 1
#include <unistd.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <stdio.h>
#include <stdlib.h>
#include <numa.h>
#include <numaif.h>
#include <errno.h>
#include <sched.h>

#ifndef MPOL_MF_LAZY
#define MPOL_MF_LAZY (1<<3)
#endif

#define TOTALPAGES 262144

int nbpages, loop;
int pagesize;

int main(int argc, char **argv) {
  void *buffer;
  int i, err;
  unsigned long nodemask;
  int maxnode;
  struct timeval tv1, tv2;
  unsigned long us;
  cpu_set_t cset;

  /* put the thread on node 0 */
  CPU_ZERO(&cset);
  CPU_SET(0, &cset);
  err = sched_setaffinity(0, sizeof(cset), &cset);
  if (err < 0) {
    perror("sched_setaffinity");
    exit(-1);
  }

  pagesize = getpagesize();
  maxnode = numa_max_node();

  fprintf(stdout, "# Nb_pages\tCost(ns)\n");
  for(nbpages=2 ; nbpages<=TOTALPAGES ; nbpages*=2) {
    int loops = TOTALPAGES/nbpages;
    if (loops > 128) loops = 128;

    buffer = mmap(NULL, TOTALPAGES*pagesize, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    if (buffer == MAP_FAILED) {
      perror("mmap");
      exit(-1);
    }

    /* bind to node 1 and prefault */
    nodemask = 1<<1;
    err = mbind(buffer, TOTALPAGES*pagesize, MPOL_BIND, &nodemask, maxnode+2, MPOL_MF_MOVE);
    if (err < 0) {
      perror("mbind");
      exit(-1);
    }
    for(i=0 ; i<TOTALPAGES ; i++)
      *(int*)(buffer+i*pagesize) = 0;

    gettimeofday(&tv1, NULL);
 
    for(loop=0 ; loop<loops ; loop++) {
      /* mark subbuf as next-touch and touch it */
      void *subbuf = buffer + loop*nbpages*pagesize;
      err = mbind(subbuf, nbpages*pagesize, MPOL_PREFERRED, NULL, 0, MPOL_MF_MOVE|MPOL_MF_LAZY);
      if (err < 0) {
        perror("mbind");
	exit(-1);
      }
      for(i=0;i<nbpages;i++)
        *(int*)(subbuf + i*pagesize) = 42;
    }
    gettimeofday(&tv2, NULL);

    us = (tv2.tv_sec - tv1.tv_sec) * 1000000 + (tv2.tv_usec - tv1.tv_usec);
    fprintf(stdout, "%d\t%ld\n", nbpages, us * 1000/loops);
    fflush(stdout);

    munmap(buffer, TOTALPAGES*pagesize);
  }

  return 0;
}

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-22 19:10                               ` Brice Goglin
@ 2009-06-22 20:16                                 ` Stefan Lankes
  2009-06-22 20:34                                   ` Brice Goglin
  0 siblings, 1 reply; 44+ messages in thread
From: Stefan Lankes @ 2009-06-22 20:16 UTC (permalink / raw)
  To: Brice Goglin
  Cc: Lee Schermerhorn, 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki,
	Balbir Singh, KOSAKI Motohiro



Brice Goglin wrote:
> Stefan Lankes wrote:
>> I am not able to reconstruct any performance drawbacks on my system.
>> Could you send me your low-level benchmark?
> 
> It's attached. As you may see, it's fairly trivial. It just does several
> iterations of mbind+touch_all_pages for different power-of-two buffer
> sizes. Just replace mbind with madvise in the inner loop if you want to
> try with your affinit-on-next-touch.

I use MPOL_NOOP instead of MPOL_PREFERRED. On my system, MPOL_NOOP is 
defined in as 4 (-> include/linux/mempolicy.h).

By the way, do you also add Lee's "shared policy" patches? These patches 
add MPOL_MF_SHARED, which is specified as 3. Afterwards, you have to 
define MPOL_MF_LAZY as 4.

I got following performance results with MPOL_NOOP:

# Nb_pages      Cost(ns)
2       44539
4       44695
8       53937
16      61625
32      87757
64      135070
128     233812
256     428539
512     870476
1024    1695859
2048    3280695
4096    6450328
8192    12719187
16384   25377750
32768   50431375
65536   101970000
131072  216200500
262144  511706000

I got following performance results with MPOL_PREFERRED:

# Nb_pages      Cost(ns)
2       50742
4       58656
8       79929
16      117171
32      195304
64      354851
128     744835
256     1354476
512     2759570
1024    5433304
2048    10173390
4096    20178453
8192    36452343
16384   71077375
32768   141738000
65536   281460250
131072  576971000
262144  1231694000

> Which kernels are you using when comparing your next-touch
> implementation with Lee's patchset?
> 

The current mmotm tree.

Regards,

Stefan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
  2009-06-22 20:16                                 ` Stefan Lankes
@ 2009-06-22 20:34                                   ` Brice Goglin
  0 siblings, 0 replies; 44+ messages in thread
From: Brice Goglin @ 2009-06-22 20:34 UTC (permalink / raw)
  To: Stefan Lankes
  Cc: Lee Schermerhorn, 'Andi Kleen',
	linux-kernel, linux-numa, Boris Bierbaum, KAMEZAWA Hiroyuki,
	Balbir Singh, KOSAKI Motohiro

Stefan Lankes wrote:
> By the way, do you also add Lee's "shared policy" patches? These
> patches add MPOL_MF_SHARED, which is specified as 3. Afterwards, you
> have to define MPOL_MF_LAZY as 4.

Yes, I applied shared-policy-* since migrate-on-fault doesn't apply
without them :)

But I have the following in include/linux/mempolicy.h after applying all
patches:
#define MPOL_MF_LAZY     (1<<3) /* Modifies '_MOVE:  lazy migrate on
fault */
#define MPOL_F_SHARED  (1 << 0) /* identify shared policies */
Where did you get your F_SHARED=3 and MF_LAZY=4?

> I got following performance results with MPOL_NOOP:
>
> # Nb_pages      Cost(ns)
> 32768   50431375
> 65536   101970000
> 131072  216200500
> 262144  511706000

Is there any migration here? Don't you just have unmap and fault without
migration? In my test program, the initialization does MPOL_BIND. So the
following MPOL_NOOP should just do nothing since the page is already
correctly placed with regard to the previous MPOL_BIND. I feel like 2us
per page looks too low for a migration and it's also very high for just
unmap and fault-in.

> I got following performance results with MPOL_PREFERRED:
>
> # Nb_pages      Cost(ns)
> 32768   141738000

That's about 60% faster than on my machine (quad-barcelona 8347HE
1.9GHz). What machine are you running on?

Brice


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH 0/4]: affinity-on-next-touch
@ 2009-05-11 14:31 Samuel Thibault
  0 siblings, 0 replies; 44+ messages in thread
From: Samuel Thibault @ 2009-05-11 14:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, Stefan Lankes

Andi writes:
> > With this patch, the kernel reduces the overhead of page distribution via
> > "affinity-on-next-touch" from 2518ms to 366ms compared to the user-level
> 
> The interesting part is less how much faster it is compared to an user
> space implementation, but how much this migrate on touch approach
> helps in general compared to already existing policies. Some hard
> numbers on that would appreciated.

That is described in the papers that Stefan mentioned.  The problem is
that quite often it is very hard or even impossible to know which data
should get which migration, because you have a sparse matrix which gets
accessed by threads according to intermediate results, for instance.

> Note that for the OpenMP case old kernels sometimes had trouble because
> the threads tended to be not scheduled to the final target CPU
> on the first time slice so the memory was often first-touched
> on the wrong node.

It's not only that kind of issue, but in a lot of applications memory
is initialized sequentially by the main thread (not only zeroing, but
also reading files etc).  Setting "next-touch" right after sequential
initialization would just work fine.

Samuel

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2009-06-22 20:34 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-11  8:27 [RFC PATCH 0/4]: affinity-on-next-touch Stefan Lankes
2009-05-11  8:48 ` Dieter an Mey
2009-05-11 13:22 ` Andi Kleen
2009-05-11 13:32   ` Brice Goglin
2009-05-11 14:54   ` Stefan Lankes
2009-05-11 14:54     ` Stefan Lankes
2009-05-11 16:37     ` Andi Kleen
2009-05-11 17:22       ` Stefan Lankes
2009-06-11 18:45   ` Stefan Lankes
2009-06-12 10:32     ` Andi Kleen
2009-06-12 11:46       ` Stefan Lankes
2009-06-12 12:30         ` Brice Goglin
2009-06-12 13:21           ` Stefan Lankes
2009-06-12 13:48           ` Stefan Lankes
2009-06-16  2:39         ` Lee Schermerhorn
2009-06-16 13:58           ` Stefan Lankes
2009-06-16 14:59             ` Lee Schermerhorn
2009-06-17  1:22               ` KAMEZAWA Hiroyuki
2009-06-17 12:02                 ` Lee Schermerhorn
2009-06-17  7:45               ` Stefan Lankes
2009-06-18  4:37                 ` Lee Schermerhorn
2009-06-18 19:04                   ` Lee Schermerhorn
2009-06-19 15:26                     ` Lee Schermerhorn
2009-06-19 15:41                       ` Balbir Singh
2009-06-19 15:59                         ` Lee Schermerhorn
2009-06-19 21:19                       ` Stefan Lankes
2009-06-22 12:34                   ` Brice Goglin
2009-06-22 14:24                     ` Lee Schermerhorn
2009-06-22 15:28                       ` Brice Goglin
2009-06-22 16:55                         ` Lee Schermerhorn
2009-06-22 17:06                           ` Brice Goglin
2009-06-22 17:59                             ` Stefan Lankes
2009-06-22 19:10                               ` Brice Goglin
2009-06-22 20:16                                 ` Stefan Lankes
2009-06-22 20:34                                   ` Brice Goglin
2009-06-22 14:32                     ` Stefan Lankes
2009-06-22 14:56                       ` Lee Schermerhorn
2009-06-22 15:42                         ` Stefan Lankes
2009-06-22 16:38                           ` Lee Schermerhorn
2009-06-16  2:25       ` Lee Schermerhorn
2009-06-20  7:24         ` Brice Goglin
2009-06-22 13:49           ` Lee Schermerhorn
2009-06-16  2:21     ` Lee Schermerhorn
2009-05-11 14:31 Samuel Thibault

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.