All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/4]: affinity-on-next-touch
@ 2009-05-11  8:27 Stefan Lankes
  2009-05-11  8:48 ` Dieter an Mey
  2009-05-11 13:22 ` Andi Kleen
  0 siblings, 2 replies; 44+ messages in thread
From: Stefan Lankes @ 2009-05-11  8:27 UTC (permalink / raw)
  To: linux-kernel

Hello,

I wrote a patch to support the adaptive data distribution strategy
"affinity-on-next-touch" for NUMA architectures. The patch is in an early
state and I am interested in your comments.

The basic idea of "affinity-on-next-touch" is this: Via some runtime
mechanism, a user-level process activates "affinity-on-next-touch" for a
certain region of its virtual memory space. Afterwards, each page in this
region will be migrated to that node which next tries to access it.
Noordergraaf and van der Pas [1] have proposed to extend the OpenMP standard
to support this strategy. Since version 9, the “affinity-on-next-touch”
mechanism is available in Solaris and can be triggered via the madvise
system call. Löf and Homgren [2] and Terboven et al. [3] have described
their encouraging experiences with this implementation.

Linux does not yet support "affinity-on-next-touch". Terboven et al. [3]
have presented a user-level implementation of this strategy for Linux. To
realize "affinity-on-next-touch" in user space, they protect a specific
memory area from read and write accesses and install a signal handler to
catch access violations. If a thread accesses a page in the protected memory
area, the signal handler migrates the page to the node which handled the
access violation. Afterwards, the signal handler clears the page protection
and the interrupted thread is resumed. 

Unfortunately, the overhead of this solution is very high. For instance, to
distribute 512 MByte via "affinity-on-next-touch" the user-level solution
needs 2518ms on our dual-socket, quad-core Opteron 2376 system with the
current kernel (2.6.30-rc4). I evaluated this overhead with the following
OpenMP code:

   madvise(array, sizeof(int) * SIZE, MADV_ACCESS_LWP);
   start = omp_get_wtime();
   #pragma omp parallel for
   for (j = 0; j < SIZE; j += pagesize/sizeof(int))
      array[j]++;
   end = omp_get_wtime();
   printf("time: %lf ms\n", (end - start) * 1000.0);

The benchmark uses 8 threads and each thread is bound to one core.

I divide my patch into the following 4 parts:

[Patch 1/4]: Extend the system call madvise with a new parameter
MADV_ACCESS_LWP (the same as used in Solaris). The specified memory area
then uses "affinity-on-next-touch".  In this case, madvise_access_lwp
protects the memory area from read and write access. To avoid unnecessary
list operations, the patch changes the permissions only in the page table
entries and does not update the list of VMAs. Beside this, the system call
madvise set also the new “untouched bit” of the "page" record.

[Patch 2/4]: The pte fault handler detects, via a new "untouched bit" inside
of the "page" record, that the page which the thread tried to access uses
“affinity-on-next-touch”. Afterwards, the kernel reads the original
permissions from vm_area_struct, restores them in the page tables and
migrates the page to the current node. To accelerate page migration, the
patch avoids unnecessary calls to migrate_prep().

[Patch 3/4]: If the "untouched" bit is set, mprotect isn’t permitted to
change the permission in the page table entry. By using of
"affinity-on-next-touch", the access permission will be set by the pte fault
handler.

[Patch 4/4]: This part of the patch adds some counters to detect migration
errors and publishes these counters via /proc/vmstat. Besides this, the
Kconfig file is extend with the parameter CONFIG_AFFINITY_ON_NEXT_TOUCH.

With this patch, the kernel reduces the overhead of page distribution via
"affinity-on-next-touch" from 2518ms to 366ms compared to the user-level
approach. Currently, I'm evaluating the performance of the patch with some
other benchmarks and test applications (stream benchmark, Jacobi solver, PDE
solver,...).

I am very interested in your comments!

Stefan

[1] Noordergraaf, L., van der Pas, R.: Performance Experiences on Suns
WildFire Prototype. In: Proceedings of the 1999 ACM/IEEE conference on
Supercomputing,Portland, Oregon, USA (November 1999)

[2] Löf, H., Holmgren, S.: affinity-on-next-touch: Increasing the
Performance of an Industrial PDE Solver on a cc-NUMA System. In: Proceedings
of the 19th Annual International Conference on Supercomputing, Cambridge,
Massachusetts, USA (June 2005) 387–392

[3]. Terboven, C., an Mey, D., Schmidl, D., Jin, H., Reichstein, T.: Data
and Thread Affinity in OpenMP Programs. In: Proceedings of the 2008 Workshop
on Memory Access on future Processors: A solved problem?, ACM International
Conference on Computing Frontiers, Ischia, Italy (May 2008) 377–384




^ permalink raw reply	[flat|nested] 44+ messages in thread
* Re: [RFC PATCH 0/4]: affinity-on-next-touch
@ 2009-05-11 14:31 Samuel Thibault
  0 siblings, 0 replies; 44+ messages in thread
From: Samuel Thibault @ 2009-05-11 14:31 UTC (permalink / raw)
  To: Andi Kleen; +Cc: linux-kernel, Stefan Lankes

Andi writes:
> > With this patch, the kernel reduces the overhead of page distribution via
> > "affinity-on-next-touch" from 2518ms to 366ms compared to the user-level
> 
> The interesting part is less how much faster it is compared to an user
> space implementation, but how much this migrate on touch approach
> helps in general compared to already existing policies. Some hard
> numbers on that would appreciated.

That is described in the papers that Stefan mentioned.  The problem is
that quite often it is very hard or even impossible to know which data
should get which migration, because you have a sparse matrix which gets
accessed by threads according to intermediate results, for instance.

> Note that for the OpenMP case old kernels sometimes had trouble because
> the threads tended to be not scheduled to the final target CPU
> on the first time slice so the memory was often first-touched
> on the wrong node.

It's not only that kind of issue, but in a lot of applications memory
is initialized sequentially by the main thread (not only zeroing, but
also reading files etc).  Setting "next-touch" right after sequential
initialization would just work fine.

Samuel

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2009-06-22 20:34 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-11  8:27 [RFC PATCH 0/4]: affinity-on-next-touch Stefan Lankes
2009-05-11  8:48 ` Dieter an Mey
2009-05-11 13:22 ` Andi Kleen
2009-05-11 13:32   ` Brice Goglin
2009-05-11 14:54   ` Stefan Lankes
2009-05-11 14:54     ` Stefan Lankes
2009-05-11 16:37     ` Andi Kleen
2009-05-11 17:22       ` Stefan Lankes
2009-06-11 18:45   ` Stefan Lankes
2009-06-12 10:32     ` Andi Kleen
2009-06-12 11:46       ` Stefan Lankes
2009-06-12 12:30         ` Brice Goglin
2009-06-12 13:21           ` Stefan Lankes
2009-06-12 13:48           ` Stefan Lankes
2009-06-16  2:39         ` Lee Schermerhorn
2009-06-16 13:58           ` Stefan Lankes
2009-06-16 14:59             ` Lee Schermerhorn
2009-06-17  1:22               ` KAMEZAWA Hiroyuki
2009-06-17 12:02                 ` Lee Schermerhorn
2009-06-17  7:45               ` Stefan Lankes
2009-06-18  4:37                 ` Lee Schermerhorn
2009-06-18 19:04                   ` Lee Schermerhorn
2009-06-19 15:26                     ` Lee Schermerhorn
2009-06-19 15:41                       ` Balbir Singh
2009-06-19 15:59                         ` Lee Schermerhorn
2009-06-19 21:19                       ` Stefan Lankes
2009-06-22 12:34                   ` Brice Goglin
2009-06-22 14:24                     ` Lee Schermerhorn
2009-06-22 15:28                       ` Brice Goglin
2009-06-22 16:55                         ` Lee Schermerhorn
2009-06-22 17:06                           ` Brice Goglin
2009-06-22 17:59                             ` Stefan Lankes
2009-06-22 19:10                               ` Brice Goglin
2009-06-22 20:16                                 ` Stefan Lankes
2009-06-22 20:34                                   ` Brice Goglin
2009-06-22 14:32                     ` Stefan Lankes
2009-06-22 14:56                       ` Lee Schermerhorn
2009-06-22 15:42                         ` Stefan Lankes
2009-06-22 16:38                           ` Lee Schermerhorn
2009-06-16  2:25       ` Lee Schermerhorn
2009-06-20  7:24         ` Brice Goglin
2009-06-22 13:49           ` Lee Schermerhorn
2009-06-16  2:21     ` Lee Schermerhorn
2009-05-11 14:31 Samuel Thibault

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.