Hello, I am supporting Stefan's activity from the parallel programmer's perspective and I would be happy to provide further input, if needed. best regards Dieter Stefan Lankes schrieb: > Hello, > > I wrote a patch to support the adaptive data distribution strategy > "affinity-on-next-touch" for NUMA architectures. The patch is in an early > state and I am interested in your comments. > > The basic idea of "affinity-on-next-touch" is this: Via some runtime > mechanism, a user-level process activates "affinity-on-next-touch" for a > certain region of its virtual memory space. Afterwards, each page in this > region will be migrated to that node which next tries to access it. > Noordergraaf and van der Pas [1] have proposed to extend the OpenMP standard > to support this strategy. Since version 9, the “affinity-on-next-touch” > mechanism is available in Solaris and can be triggered via the madvise > system call. Löf and Homgren [2] and Terboven et al. [3] have described > their encouraging experiences with this implementation. > > Linux does not yet support "affinity-on-next-touch". Terboven et al. [3] > have presented a user-level implementation of this strategy for Linux. To > realize "affinity-on-next-touch" in user space, they protect a specific > memory area from read and write accesses and install a signal handler to > catch access violations. If a thread accesses a page in the protected memory > area, the signal handler migrates the page to the node which handled the > access violation. Afterwards, the signal handler clears the page protection > and the interrupted thread is resumed. > > Unfortunately, the overhead of this solution is very high. For instance, to > distribute 512 MByte via "affinity-on-next-touch" the user-level solution > needs 2518ms on our dual-socket, quad-core Opteron 2376 system with the > current kernel (2.6.30-rc4). I evaluated this overhead with the following > OpenMP code: > > madvise(array, sizeof(int) * SIZE, MADV_ACCESS_LWP); > start = omp_get_wtime(); > #pragma omp parallel for > for (j = 0; j < SIZE; j += pagesize/sizeof(int)) > array[j]++; > end = omp_get_wtime(); > printf("time: %lf ms\n", (end - start) * 1000.0); > > The benchmark uses 8 threads and each thread is bound to one core. > > I divide my patch into the following 4 parts: > > [Patch 1/4]: Extend the system call madvise with a new parameter > MADV_ACCESS_LWP (the same as used in Solaris). The specified memory area > then uses "affinity-on-next-touch". In this case, madvise_access_lwp > protects the memory area from read and write access. To avoid unnecessary > list operations, the patch changes the permissions only in the page table > entries and does not update the list of VMAs. Beside this, the system call > madvise set also the new “untouched bit” of the "page" record. > > [Patch 2/4]: The pte fault handler detects, via a new "untouched bit" inside > of the "page" record, that the page which the thread tried to access uses > “affinity-on-next-touch”. Afterwards, the kernel reads the original > permissions from vm_area_struct, restores them in the page tables and > migrates the page to the current node. To accelerate page migration, the > patch avoids unnecessary calls to migrate_prep(). > > [Patch 3/4]: If the "untouched" bit is set, mprotect isn’t permitted to > change the permission in the page table entry. By using of > "affinity-on-next-touch", the access permission will be set by the pte fault > handler. > > [Patch 4/4]: This part of the patch adds some counters to detect migration > errors and publishes these counters via /proc/vmstat. Besides this, the > Kconfig file is extend with the parameter CONFIG_AFFINITY_ON_NEXT_TOUCH. > > With this patch, the kernel reduces the overhead of page distribution via > "affinity-on-next-touch" from 2518ms to 366ms compared to the user-level > approach. Currently, I'm evaluating the performance of the patch with some > other benchmarks and test applications (stream benchmark, Jacobi solver, PDE > solver,...). > > I am very interested in your comments! > > Stefan > > [1] Noordergraaf, L., van der Pas, R.: Performance Experiences on Suns > WildFire Prototype. In: Proceedings of the 1999 ACM/IEEE conference on > Supercomputing,Portland, Oregon, USA (November 1999) > > [2] Löf, H., Holmgren, S.: affinity-on-next-touch: Increasing the > Performance of an Industrial PDE Solver on a cc-NUMA System. In: Proceedings > of the 19th Annual International Conference on Supercomputing, Cambridge, > Massachusetts, USA (June 2005) 387–392 > > [3]. Terboven, C., an Mey, D., Schmidl, D., Jin, H., Reichstein, T.: Data > and Thread Affinity in OpenMP Programs. In: Proceedings of the 2008 Workshop > on Memory Access on future Processors: A solved problem?, ACM International > Conference on Computing Frontiers, Ischia, Italy (May 2008) 377–384 > > > -- Dipl.-Math. Dieter an Mey, HPC Team Lead RWTH Aachen University, Center for Computing and Communication Rechen- und Kommunikationszentrum der RWTH Aachen Seffenter Weg 23, D 52074 Aachen (Germany) Phone: + 49 241 80 24377 - Fax/UMS: + 49 241 80 624377 mailto:anmey@rz.rwth-aachen.de http://www.rz.rwth-aachen.de