From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S272260AbTG3Vup (ORCPT ); Wed, 30 Jul 2003 17:50:45 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S272264AbTG3Vup (ORCPT ); Wed, 30 Jul 2003 17:50:45 -0400 Received: from w197.z066088144.sjc-ca.dsl.cnc.net ([66.88.144.197]:49384 "EHLO kali.zeta-soft.com") by vger.kernel.org with ESMTP id S272260AbTG3Vuj (ORCPT ); Wed, 30 Jul 2003 17:50:39 -0400 From: "Scott L. Burson" MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Date: Wed, 30 Jul 2003 14:50:39 -0700 (PDT) To: linux-kernel@vger.kernel.org CC: Mathieu.Malaterre@creatis.insa-lyon.fr CC: hw.support@amd.com Subject: Spinlock performance on Athlon MP (2.4) X-Mailer: VM 6.43 under 20.4 "Emerald" XEmacs Lucid Message-ID: <16168.12542.850631.294911@kali.zeta-soft.com> Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Hi, In March of last year I briefly corresponded with this list about a performance problem that shows up on the dual Athlon MP running a 2.4 kernel (apparently any 2.4 kernel, though I happen to be using 2.4.18). I don't have a solution, but I do have some more information. First, and probably the reason you haven't heard more complaints about the problem, its severity is evidently dependent on the size of main memory. At 512MB it doesn't seem to be much of a problem (right, Mathieu?). At 2.5GB, which is what I have, it can be quite serious. For instance, if I start two `find' processes at the roots of different filesystems, the system can spend (according to `top') 95% - 98% of its time in the kernel. It even gets worse than that, but `top' stops updating -- in fact, the system can seem completely frozen, but it does recover eventually. Stopping or killing one of the `find' processes brings it back fairly quickly, though it can take a while to accomplish that. The dual Pentium, as you are probably aware, has no trace of the problem. What I think is going on (after doing some profiling, some lockmeter measurements, and other stuff I'll describe below) is simply that the spinlock handoff time is much longer on the Athlon than on the Pentium. So any spinlock contention hurts much worse on the Athlon. The kernel spinlock loop includes the Pentium 4 PAUSE instruction (aka "REP NOP"). The Intel Pentium instruction set reference manual has this to say about PAUSE: When executing a spin-wait loop, a Pentium 4 or Intel Xeon processor suffers a severe performance penalty when exiting the loop because it detects a possible memory order violation. The PAUSE instruction provides a hint to the processor that the code sequence is a spin-wait loop. The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. I don't know enough about cache coherency protocols to know what a memory order violation is or what it might cost in nanoseconds, but clearly the Pentium had a bad enough problem that Intel decided to fix it. I speculate that the Athlon has the same problem. I urge AMD to do the following: (1) Figure out whether this is in fact the problem. (2) Figure out whether a different instruction sequence for the spinlock inner loop would work better. (The existing sequence, for those at AMD who may not have the kernel source handy, is simply CMPB 0, [lock address]; PAUSE; JLE [address of CMPB].) (3) Make sure the Hammer doesn't have the same problem -- or if it does, fix it! I assure you it will show up in the field. Meanwhile, supposing that (a) this is the problem and (b) there's no alternative instruction sequence that cures it, the question remains what to do about it. One of the first things I found when profiling was that the worst of the problem seemed to be in the slab allocator. I made some changes to `mm/slab.c' in 2.4.18 intended to reduce spinlock contention therein. I believe they were successful (though I should do some before-and-after runs with lockmeter to verify this), and I think they ameliorate the problem somewhat. However, it appears that there are other sources of contention that show up once that one is fixed (the next one appears, according to the profiler, to be somewhere in `shrink_cache' in `mm/vmscan.c'). (Sorry I never released my patch to `slab.c'; since it wasn't a complete cure, and I didn't understand why not, and didn't have any more time to try to figure it out, I didn't send it out. But it is stable; I've been using it for over a year.) I could continue to try to remove sources of contention in 2.4, but of course another interesting question is how much it would help simply to switch to 2.6. I don't know that the 2.6 beta is quite to the point where I'm ready to try it; and anyway 2.6 is probably still a few months from general use. If it's thought that 2.6 reduces contention generally, it might be worth a shot; but on the other hand it probably would still be worthwhile to fix this in 2.4, if I can. Comments? -- Scott