On Tue, 2010-03-23 at 10:22 -0400, Andrew Morton wrote: > (switched to email. Please respond via emailed reply-to-all, not via the > bugzilla web interface). > > On Tue, 23 Mar 2010 16:13:25 GMT bugzilla-daemon@bugzilla.kernel.org wrote: > > > https://bugzilla.kernel.org/show_bug.cgi?id=15618 > > > > Summary: 2.6.18->2.6.32->2.6.33 huge regression in performance > > Product: Process Management > > Version: 2.5 > > Kernel Version: 2.6.32 > > Platform: All > > OS/Version: Linux > > Tree: Mainline > > Status: NEW > > Severity: high > > Priority: P1 > > Component: Other > > AssignedTo: process_other@kernel-bugs.osdl.org > > ReportedBy: ant.starikov@gmail.com > > Regression: No > > > > > > We have benchmarked some multithreaded code here on 16-core/4-way opteron 8356 > > host on number of kernels (see below) and found strange results. > > Up to 8 threads we didn't see any noticeable differences in performance, but > > starting from 9 threads performance diverges substantially. I provide here > > results for 14 threads > > lolz. Catastrophic meltdown. Thanks for doing all that work - at a > guess I'd say it's mmap_sem. Perhaps with some assist from the CPU > scheduler. > > If you change the config to set CONFIG_RWSEM_GENERIC_SPINLOCK=n, > CONFIG_RWSEM_XCHGADD_ALGORITHM=y does it help? > > Anyway, there's a testcase in bugzilla and it looks like we got us some > work to do. > I had an "opportunity" to investigate page fault behavior on 2.6.18+ [RHEL5.4] on an 8-socket Istanbul system earlier this year. When I saw this mail, I collected up the data I had from that adventure and ran additional tests on 2.6.33 and 2.6.34-rc1. I have attached plots for what "per node" and "system wide" page fault scalability. The per node plot [#1] shows the page fault rate of 1 to 6 [nr_cores_per_socket] tasks [processes] and threads faulting in a fixed GB/task at the same time on a single socket. The system wide plot [#3] show 1 to 48 [nr_sockets * nr_cores_per_socket] tasks and threads again faulting in a fixed GB/task... For the latter test, I load one core per socket at at time, then add the 2nd core per socket, ... In all cases, the individual tasks/threads are fork()ed/pthread_create()d by a parent bound to the cpu where they'll run to obtain node-local kernel data structures. The tests run with SCHED_FIFO. I plot both "faults per wall clock second"--the aggregate rate--and "faults per cpu second" or normalized rate. The per node scalability doesn't look all that different across the 3 releases, especially the faults per cpu seconds curves. However, in the system wide multi-threaded tests, 2.6.33 is an anomaly compared to both 2.6.18+ and 2.6.34-rc1. The 2.6.18+ and 2.6.34.rc1 multi-threaded tests show a lot of noise and, of course, a lot lower fault rate relative the the multi-task tests. I aborted the 2.6.33 system wide multi-threaded test at 32 threads because it was just taking too long. Unfortunately, with this many curves, the legends obscure much of the plot. So, rather than bloat this message any more, I've packaged up the raw data along with plots with and without legends and placed the tarball here: http://free.linux.hp.com/~lts/Pft/ That directory also contains the source for the version of the pft test used, along with the scripts used to run the tests and plot the results. Note that some manual editing of the "plot annotations" in the raw data was required to generate several different plots from the same data. The pft test is a highly, uh, "evolved" version of pft.c that Christoph Lameter pointed me at a few years ago. This version requires a patched libnuma with the v2 api. The required patch to the numactl-2.0.3 package is included in the test tarball. [I've contacted Cliff about getting the patch into 2.0.4.] Lee