* Major regression on hackbench with SLUB @ 2007-12-07 17:50 Steven Rostedt 2007-12-07 17:55 ` Ingo Molnar ` (2 more replies) 0 siblings, 3 replies; 93+ messages in thread From: Steven Rostedt @ 2007-12-07 17:50 UTC (permalink / raw) To: LKML Cc: Christoph Lameter, Ingo Molnar, Linus Torvalds, Peter Zijlstra, Christoph Hellwig Hi, I've been doing some benchmarks for RT task migrations and I stumbled over this regression. I first thought it was a regression with CFS and started doing a git bisect. But I later found that the "good" kernel also failed. Then I started looking at the config options and discovered that SLUB has become the default. When I put SLAB back, the regression went away, even on the lastest git tree. So I ran the hackbench benchmarks with the latest git commit (f194d132e4971111f85c18c96067acffb13cee6d at the time of this writing). And put up the results on my web page. http://people.redhat.com/srostedt/slab/ I ran Ingo's cfs-debug-info.sh and have the results of that on the web page. That script records lots of good information to see what kind of machine I ran this on. This is a 64way box (yes lots of CPUS!). The hackbench code and the script I used to run and record the results is also present. I ran "hackbench 90" 20 times, 10 times as SCHED_OTHER and 10 times as SCHED_FIFO (chrt -f 10 hackbench 90). The graph shows both runs (min, max and average). The versions that start with "rt:" (although the graph somehow truncated it to just "t:") are the SCHED_FIFO runs. (click on the graph to see a bigger version) The regression is that hackbench slows down tremendously. It goes from running just under 2 seconds to running around 14 seconds. Hackbench seems to show this regression the most. In my tests I didn't see much change with kernel builds and such, but the focus was on scheduling not memory management. I'll run my kernel tests next for both SLAB and SLUB and see if there's any difference there. But this regression might be a reason to not have SLUB as default for now. At least until we find out why this is affected so badly. -- Steve ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB 2007-12-07 17:50 Major regression on hackbench with SLUB Steven Rostedt @ 2007-12-07 17:55 ` Ingo Molnar 2007-12-07 17:59 ` Linus Torvalds 2007-12-11 14:33 ` Major regression on hackbench with SLUB (more numbers) Ingo Molnar 2 siblings, 0 replies; 93+ messages in thread From: Ingo Molnar @ 2007-12-07 17:55 UTC (permalink / raw) To: Steven Rostedt Cc: LKML, Christoph Lameter, Linus Torvalds, Peter Zijlstra, Christoph Hellwig * Steven Rostedt <rostedt@goodmis.org> wrote: > The regression is that hackbench slows down tremendously. It goes from > running just under 2 seconds to running around 14 seconds. ouch ... it might make sense to check the SLUB patches from -mm ontop of vanilla as well - while that might be of little comfort as far as 2.6.24 goes, it at least gives us the info whether this is something that has been fixed in the latest SLUB code. Ingo ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB 2007-12-07 17:50 Major regression on hackbench with SLUB Steven Rostedt 2007-12-07 17:55 ` Ingo Molnar @ 2007-12-07 17:59 ` Linus Torvalds 2007-12-07 18:36 ` Linus Torvalds 2007-12-11 14:33 ` Major regression on hackbench with SLUB (more numbers) Ingo Molnar 2 siblings, 1 reply; 93+ messages in thread From: Linus Torvalds @ 2007-12-07 17:59 UTC (permalink / raw) To: Steven Rostedt Cc: LKML, Christoph Lameter, Ingo Molnar, Peter Zijlstra, Christoph Hellwig On Fri, 7 Dec 2007, Steven Rostedt wrote: > > The regression is that hackbench slows down tremendously. It goes from > running just under 2 seconds to running around 14 seconds. Can you do one run with oprofile, and see exactly where the cost is? It should hopefully be pretty darn obvious, considering your timing. Linus ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB 2007-12-07 17:59 ` Linus Torvalds @ 2007-12-07 18:36 ` Linus Torvalds 2007-12-07 18:58 ` Steven Rostedt ` (3 more replies) 0 siblings, 4 replies; 93+ messages in thread From: Linus Torvalds @ 2007-12-07 18:36 UTC (permalink / raw) To: Steven Rostedt Cc: LKML, Christoph Lameter, Ingo Molnar, Peter Zijlstra, Christoph Hellwig On Fri, 7 Dec 2007, Linus Torvalds wrote: > > Can you do one run with oprofile, and see exactly where the cost is? It > should hopefully be pretty darn obvious, considering your timing. So I checked hackbench on my machine with oprofile, and while I'm not seeing anything that says "ten times slower", I do see it hitting the slow path all the time, and __slab_alloc() is 15% of the profile. With __slab_free() being another 8-9%. I assume that with many CPU's, those get horrendously worse, with cache trashing. The biggest cost of __slab_alloc() in my profile is the "slab_lock()", but that may not be the one that causes problems in a 64-cpu setup, so it would be good to have that verified. Anyway, it's unclear whether the reason it goes into the slow-path because the freelist is just always empty, or whether it hits the ... || !node_match(c, node) case which can trigger on NUMA. That's another potential explanation of why you'd see such a *huge* slowdown (ie if the whole node-match thing doesn't work out, slub just never gets the fast-case at all). That said, the number of places that actually pass a specific node to slub is very limited, so I suspect it's not the node matching. But just disabling that test in slab_alloc() might be one thing to test. [ The whole node match thing is likely totally bogus. I suspect we pay *more* in trying to match nodes than we'd ever likely pay in just returning the wrong node for an allocation, but that's neither here nor there ] But yeah, I'm not entirely sure SLUB is working out. Linus ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB 2007-12-07 18:36 ` Linus Torvalds @ 2007-12-07 18:58 ` Steven Rostedt 2007-12-08 22:16 ` Steven Rostedt ` (2 subsequent siblings) 3 siblings, 0 replies; 93+ messages in thread From: Steven Rostedt @ 2007-12-07 18:58 UTC (permalink / raw) To: Linus Torvalds Cc: LKML, Christoph Lameter, Ingo Molnar, Peter Zijlstra, Christoph Hellwig [ Note: I'm currently having problems with my email: http://www.domaindirect.com/network.html I'm in the process of setting up my own personal email server. ] > > Can you do one run with oprofile, and see exactly where the cost is? It > > should hopefully be pretty darn obvious, considering your timing. I kicked off my kernel tests which will take a few hours. I run "make -j256" (4*nr_CPUS) 10 times at SCHED_OTHER and 10 times at SCHED_FIFO (chrt -f 10) for both SLUB and SLAB. This is automated, so I need to wait for it to finish before I can continue other tests. > The biggest cost of __slab_alloc() in my profile is the "slab_lock()", but > that may not be the one that causes problems in a 64-cpu setup, so it > would be good to have that verified. I'll run oprofile as soon as the kernel build test is done. > case which can trigger on NUMA. That's another potential explanation of > why you'd see such a *huge* slowdown (ie if the whole node-match thing > doesn't work out, slub just never gets the fast-case at all). That said, > the number of places that actually pass a specific node to slub is very > limited, so I suspect it's not the node matching. But just disabling that > test in slab_alloc() might be one thing to test. I'll try that after the oprofile. Thanks, -- Steve ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB 2007-12-07 18:36 ` Linus Torvalds 2007-12-07 18:58 ` Steven Rostedt @ 2007-12-08 22:16 ` Steven Rostedt 2007-12-10 7:38 ` Björn Steinbrink 2007-12-09 0:28 ` Steven Rostedt 2007-12-13 22:11 ` Christoph Lameter 3 siblings, 1 reply; 93+ messages in thread From: Steven Rostedt @ 2007-12-08 22:16 UTC (permalink / raw) To: Linus Torvalds Cc: LKML, Christoph Lameter, Ingo Molnar, Peter Zijlstra, Christoph Hellwig Hi Linus, > On Fri, 7 Dec 2007, Linus Torvalds wrote: > > > > Can you do one run with oprofile, and see exactly where the cost is? It > > should hopefully be pretty darn obvious, considering your timing. The results are here: http://people.redhat.com/srostedt/slub/results/slab.op http://people.redhat.com/srostedt/slub/results/slub.op > > Anyway, it's unclear whether the reason it goes into the slow-path because > the freelist is just always empty, or whether it hits the > > ... || !node_match(c, node) I did two things. First I tried making node_match always return true, the second was just commenting out the above check. They both had pretty much the exact same results. It slowed down by double! Instead of taking 15 seconds to complete, both took 30 seconds to complete. Here's the oprofiles of those runs. node_match return 1: http://people.redhat.com/srostedt/slub/results/slub-nonode.op comment out node_match check: http://people.redhat.com/srostedt/slub/results/slub-nochecknode.op > > [ The whole node match thing is likely totally bogus. I suspect we pay > *more* in trying to match nodes than we'd ever likely pay in just > returning the wrong node for an allocation, but that's neither here nor > there ] I don't think it's bogus by the result that removing it slowes it down by half. I'm wondering if there's not some other cache thrashing happening somewhere else in the slub code. I'll start adding some stats in the code to see where the congestion might lie. I'll look into this on Monday. Seems that both SLAB and SLUB run kernel compiles the same. Here's the results of my compile test. (make -j256 and chrt -f 10 make -j256) http://people.redhat.com/srostedt/slub/ -- Steve ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB 2007-12-08 22:16 ` Steven Rostedt @ 2007-12-10 7:38 ` Björn Steinbrink 2007-12-10 14:00 ` Steven Rostedt 0 siblings, 1 reply; 93+ messages in thread From: Björn Steinbrink @ 2007-12-10 7:38 UTC (permalink / raw) To: Steven Rostedt Cc: Linus Torvalds, LKML, Christoph Lameter, Ingo Molnar, Peter Zijlstra, Christoph Hellwig On 2007.12.08 17:16:24 -0500, Steven Rostedt wrote: > > Hi Linus, > > > On Fri, 7 Dec 2007, Linus Torvalds wrote: > > > > > > Can you do one run with oprofile, and see exactly where the cost is? It > > > should hopefully be pretty darn obvious, considering your timing. > > The results are here: > > http://people.redhat.com/srostedt/slub/results/slab.op > http://people.redhat.com/srostedt/slub/results/slub.op Hm, you seem to be hitting the "another_slab" stuff in __slab_alloc alot. I wonder if !node_match triggers too often. We always start with the per cpu slab, if that one is on the wrong node, you'll always hit that "another_slab" path. After searching for way too long (given that I have no clue about that stuff anyway and just read the code out of curiousness), I noticed that the the cpu_to_node stuff on x86_64 seems to be initialized to 0xff (arch/x86/mm/numa_64.c), and Google brought me this dmesg output [1], which, AFAICT, shows that the per cpu slab setup is done _before_ cpu_to_node is correctly setup. That would lead to the per cpu slabs all having node == 0xff, which looks pretty bad. Disclaimer: I read the slub/numa/$WHATEVER_I_SAW_THERE for the first time, so this might be total bull ;-) Björn [1] http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-10/msg04648.html ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB 2007-12-10 7:38 ` Björn Steinbrink @ 2007-12-10 14:00 ` Steven Rostedt 0 siblings, 0 replies; 93+ messages in thread From: Steven Rostedt @ 2007-12-10 14:00 UTC (permalink / raw) To: Björn Steinbrink Cc: Linus Torvalds, LKML, Christoph Lameter, Ingo Molnar, Peter Zijlstra, Christoph Hellwig On Mon, 10 Dec 2007, Björn Steinbrink wrote: > > > > The results are here: > > > > http://people.redhat.com/srostedt/slub/results/slab.op > > http://people.redhat.com/srostedt/slub/results/slub.op > > Hm, you seem to be hitting the "another_slab" stuff in __slab_alloc > alot. I wonder if !node_match triggers too often. We always start with > the per cpu slab, if that one is on the wrong node, you'll always hit > that "another_slab" path. Well, I commented out the node_match part and it got 100% worse. It took 30 seconds to complete. > > After searching for way too long (given that I have no clue about that > stuff anyway and just read the code out of curiousness), I noticed that > the the cpu_to_node stuff on x86_64 seems to be initialized to 0xff > (arch/x86/mm/numa_64.c), and Google brought me this dmesg output [1], > which, AFAICT, shows that the per cpu slab setup is done _before_ > cpu_to_node is correctly setup. That would lead to the per cpu slabs all > having node == 0xff, which looks pretty bad. I didn't check to see if the internal set up of the node is correct though. I can put in some debug to see what I get. > > Disclaimer: I read the slub/numa/$WHATEVER_I_SAW_THERE for the first > time, so this might be total bull ;-) Well I'm a newbie on NUMA stuff too. I just got lucky enough to be able to reserve this box ;-) -- Steve ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB 2007-12-07 18:36 ` Linus Torvalds 2007-12-07 18:58 ` Steven Rostedt 2007-12-08 22:16 ` Steven Rostedt @ 2007-12-09 0:28 ` Steven Rostedt 2007-12-13 22:11 ` Christoph Lameter 3 siblings, 0 replies; 93+ messages in thread From: Steven Rostedt @ 2007-12-09 0:28 UTC (permalink / raw) To: Ingo Molnar Cc: LKML, Linus Torvalds, Christoph Lameter, Peter Zijlstra, Christoph Hellwig Ingo, Due to email problems, I never got your reply on this thread. But I saw your reply when reading this thread on lkml.org (nor did I receive Linus's first reply). Anyway, I booted 2.6.24-rc4-mm1 with the same config, and accepted the defaults to the new config options as the git version I checked. And it gave me an average of 12.5 second runs. Here's the oprofile for it: http://people.redhat.com/srostedt/slub/results/slub-mm.op -- Steve ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB 2007-12-07 18:36 ` Linus Torvalds ` (2 preceding siblings ...) 2007-12-09 0:28 ` Steven Rostedt @ 2007-12-13 22:11 ` Christoph Lameter 3 siblings, 0 replies; 93+ messages in thread From: Christoph Lameter @ 2007-12-13 22:11 UTC (permalink / raw) To: Linus Torvalds Cc: Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra, Christoph Hellwig On Fri, 7 Dec 2007, Linus Torvalds wrote: > The biggest cost of __slab_alloc() in my profile is the "slab_lock()", but > that may not be the one that causes problems in a 64-cpu setup, so it > would be good to have that verified. Hmmmm.. This would indicate lock contention on a slab. > [ The whole node match thing is likely totally bogus. I suspect we pay > *more* in trying to match nodes than we'd ever likely pay in just > returning the wrong node for an allocation, but that's neither here nor > there ] Node match is necessary in order to make the allocator able to get memory for the node that the caller requested. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-07 17:50 Major regression on hackbench with SLUB Steven Rostedt 2007-12-07 17:55 ` Ingo Molnar 2007-12-07 17:59 ` Linus Torvalds @ 2007-12-11 14:33 ` Ingo Molnar 2007-12-13 22:22 ` Christoph Lameter 2 siblings, 1 reply; 93+ messages in thread From: Ingo Molnar @ 2007-12-11 14:33 UTC (permalink / raw) To: Steven Rostedt Cc: LKML, Christoph Lameter, Andrew Morton, Linus Torvalds, Peter Zijlstra, Christoph Hellwig * Steven Rostedt <rostedt@goodmis.org> wrote: > Hackbench seems to show this regression the most. In my tests I didn't > see much change with kernel builds and such, but the focus was on > scheduling not memory management. I'll run my kernel tests next for > both SLAB and SLUB and see if there's any difference there. i just ran various benchmarks on an 8-way (8x 700 MHz Xeon, 4GB RAM): AVG v2.6.24.slab v2.6.24.slub [ smaller is better ] ----------------------------------------- mmap: 1052.66 1049.33 ( 0%) ctx-2: 4.32 4.30 ( 0%) select: 41.95 43.69 ( 4%) proc-exec: 394.45 391.92 ( 0%) hackbench-10: 1.12 2.99 (166%) hackbench-20: 2.04 6.67 (226%) hackbench-50: 5.03 17.50 (247%) and hackbench overhead stands out, by a huge margin. Other stuff is within measurement noise. Neither SLUB nor SLAB debugging was turned on, all other debugging options were off too. Ingo ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-11 14:33 ` Major regression on hackbench with SLUB (more numbers) Ingo Molnar @ 2007-12-13 22:22 ` Christoph Lameter 2007-12-14 4:24 ` Christoph Lameter 0 siblings, 1 reply; 93+ messages in thread From: Christoph Lameter @ 2007-12-13 22:22 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, LKML, Andrew Morton, Linus Torvalds, Peter Zijlstra, Christoph Hellwig On Tue, 11 Dec 2007, Ingo Molnar wrote: > hackbench-10: 1.12 2.99 (166%) > hackbench-20: 2.04 6.67 (226%) > hackbench-50: 5.03 17.50 (247%) > > and hackbench overhead stands out, by a huge margin. Other stuff is > within measurement noise. Neither SLUB nor SLAB debugging was turned on, > all other debugging options were off too. I just came back from vacation. The non linear growth of regression indicates lock contention somewhere. Must be something special that was triggered by hackbench. We have had a regression like that in 2.6.24 before due to order 1 allocs in the network layer. The .24 modifications for SLUB introduced page allocator pass through for allocations > PAGESIZE/2. Order 1 allocs could serialize on zone locks. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-13 22:22 ` Christoph Lameter @ 2007-12-14 4:24 ` Christoph Lameter 2007-12-14 6:15 ` Steven Rostedt 2007-12-21 12:09 ` Ingo Molnar 0 siblings, 2 replies; 93+ messages in thread From: Christoph Lameter @ 2007-12-14 4:24 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, LKML, Andrew Morton, Linus Torvalds, Peter Zijlstra, Christoph Hellwig Hmmmm... Some tests here on an 8p 8G machine: SLAB N=10 Time: 0.341 N=20 Time: 0.605 N=50 Time: 1.487 SLUB N=10 Time: 0.675 N=20 Time: 1.434 N=50 Time: 3.996 So its factor 2 for untuned SLUB. Looking at hackbench: This is a test that allocates objects that are then consumed by N cpus that then return them. The allocating processor can allocate from the per cpu slab (the freelist is copied to a per cpu structure when its activated) avoiding touching the page struct. However, the freeing processors must update the freelist of the slab page directly. So they all content for the cacheline of the page struct. Since we do directly free there is the chance of lots of contention between the N cpus that free the objects. This is in particular high in a synthetic benchmark like this. However, if the object to be freed is still in a cpu slab then the freeing action will reduce the taking of the listlock. So we can actually decrease the overhead by increasing the slab page size. In an extreme case (boot with slub_min_order=9 to get huge page sized slabs) SLUB can win against SLAB: N=10 Time: 0.338 Minimally faster N=20 Time: 0.560 10% faster N=50 Time: 1.353 15% faster Not sure how to get that mainstream yet but there certainly is a way to get there. Need to get an idea how to reduce listlock contention in the remote free case. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-14 4:24 ` Christoph Lameter @ 2007-12-14 6:15 ` Steven Rostedt 2007-12-21 12:09 ` Ingo Molnar 1 sibling, 0 replies; 93+ messages in thread From: Steven Rostedt @ 2007-12-14 6:15 UTC (permalink / raw) To: Christoph Lameter Cc: Ingo Molnar, LKML, Andrew Morton, Linus Torvalds, Peter Zijlstra, Christoph Hellwig Christoph, Welcome back from your vacation! :-) On Thu, 13 Dec 2007, Christoph Lameter wrote: > > In an extreme case (boot with slub_min_order=9 to get huge page sized > slabs) SLUB can win against SLAB: > > N=10 Time: 0.338 Minimally faster > N=20 Time: 0.560 10% faster > N=50 Time: 1.353 15% faster > I booted back to slub with slub_min_order=9 on the 64way and you are indeed right about this. # tests/hackbench 90 Time: 2.109 # chrt -f 10 tests/hackbench 90 Time: 3.775 # tests/hackbench 90 Time: 1.583 # chrt -f 10 tests/hackbench 90 Time: 1.833 # tests/hackbench 90 Time: 1.601 # chrt -f 10 tests/hackbench 90 Time: 3.078 # tests/hackbench 90 Time: 1.321 # chrt -f 10 tests/hackbench 90 Time: 1.933 A flux between 1.321 and 3.775 (*)(the rt runs were higher). But still, good job at narrowing this down. Now the hard part. Figuring out a way that will be nice to all users. -- Steve (*) The previous runs (without the command line option) were 13 seconds or so. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-14 4:24 ` Christoph Lameter 2007-12-14 6:15 ` Steven Rostedt @ 2007-12-21 12:09 ` Ingo Molnar 2007-12-21 12:26 ` Ingo Molnar ` (2 more replies) 1 sibling, 3 replies; 93+ messages in thread From: Ingo Molnar @ 2007-12-21 12:09 UTC (permalink / raw) To: Christoph Lameter Cc: Steven Rostedt, LKML, Andrew Morton, Linus Torvalds, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki * Christoph Lameter <clameter@sgi.com> wrote: > Hmmmm... Some tests here on an 8p 8G machine: > In an extreme case (boot with slub_min_order=9 to get huge page sized > slabs) SLUB can win against SLAB: > > N=10 Time: 0.338 Minimally faster > N=20 Time: 0.560 10% faster > N=50 Time: 1.353 15% faster what's up with this regression? There's been absolutely no activity about it in the last 8 days: upstream still regresses, -mm still regresses and there are no patches posted for testing. being able to utilize order-0 pages was supposed to be one of the big plusses of SLUB, so booting with _2MB_ sized slabs cannot be seriously the "fix", right? Ingo ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 12:09 ` Ingo Molnar @ 2007-12-21 12:26 ` Ingo Molnar 2007-12-21 16:26 ` Christoph Lameter 2007-12-21 16:18 ` Christoph Lameter 2007-12-21 17:44 ` Pekka Enberg 2 siblings, 1 reply; 93+ messages in thread From: Ingo Molnar @ 2007-12-21 12:26 UTC (permalink / raw) To: Christoph Lameter Cc: Steven Rostedt, LKML, Andrew Morton, Linus Torvalds, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki * Ingo Molnar <mingo@elte.hu> wrote: > * Christoph Lameter <clameter@sgi.com> wrote: > > > Hmmmm... Some tests here on an 8p 8G machine: > > > In an extreme case (boot with slub_min_order=9 to get huge page sized > > slabs) SLUB can win against SLAB: > > > > N=10 Time: 0.338 Minimally faster > > N=20 Time: 0.560 10% faster > > N=50 Time: 1.353 15% faster > > what's up with this regression? There's been absolutely no activity > about it in the last 8 days: upstream still regresses, -mm still > regresses and there are no patches posted for testing. > > being able to utilize order-0 pages was supposed to be one of the big > plusses of SLUB, so booting with _2MB_ sized slabs cannot be seriously > the "fix", right? and this is not the only regression: http://lkml.org/lkml/2007/10/4/290 _6%_ TPC-C regression. That's _a lot_ in TPC-C terms. and just like in this case there were very clear profiles posted. I proffer, reading back the whole thread, that if you fix hackbench you have fixed TPC-C as well. So i believe you should either send some sensible fixes _NOW_, or admit that the "no queues" NIH nonsense of SLUB doesnt work and do an edible, incremental patchset against SLAB to bring in the debuggability features of SLUB without killing SLAB's performance. (And fix the NUMA alien cache problem along the lines suggested before - perhaps initially by making 'noaliencache' the default bootup option.) And we obviously must revert the default in 2.6.24 to SLAB as well. Ingo ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 12:26 ` Ingo Molnar @ 2007-12-21 16:26 ` Christoph Lameter 2007-12-21 16:33 ` Ingo Molnar 0 siblings, 1 reply; 93+ messages in thread From: Christoph Lameter @ 2007-12-21 16:26 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, LKML, Andrew Morton, Linus Torvalds, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki On Fri, 21 Dec 2007, Ingo Molnar wrote: > and this is not the only regression: > > http://lkml.org/lkml/2007/10/4/290 > > _6%_ TPC-C regression. That's _a lot_ in TPC-C terms. > > and just like in this case there were very clear profiles posted. I > proffer, reading back the whole thread, that if you fix hackbench you > have fixed TPC-C as well. There are patches pending to address these issues. AFAICT Intel is testing if the regression is still there. There is no way for me to verify what is going on there and there is the constant difficulty of getting detailed information about what is going on at Intel. Every couple of month I get a result from that test. Its a really crappy situation where a lot of confusing information is passed around. > So i believe you should either send some sensible fixes _NOW_, or admit > that the "no queues" NIH nonsense of SLUB doesnt work and do an edible, > incremental patchset against SLAB to bring in the debuggability features > of SLUB without killing SLAB's performance. (And fix the NUMA alien > cache problem along the lines suggested before - perhaps initially by > making 'noaliencache' the default bootup option.) And we obviously must > revert the default in 2.6.24 to SLAB as well. The fixes to the alien that you proposed do not work since we would still have to check for the need to put the object into alien cache. The checks for the locality of each object are the main cause of trouble when freeing in SLAB. SLUB is generally performance wise superior to SLAB as demonstrated by my measurements that you reviewed too. In addition SLUB has many improvements over SLAB. Its not only the runtime debuggability which would be difficult to add since there would be numerous runtime checks in critical code paths that would cause regressions. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 16:26 ` Christoph Lameter @ 2007-12-21 16:33 ` Ingo Molnar 2007-12-21 21:56 ` Christoph Lameter 0 siblings, 1 reply; 93+ messages in thread From: Ingo Molnar @ 2007-12-21 16:33 UTC (permalink / raw) To: Christoph Lameter Cc: Steven Rostedt, LKML, Andrew Morton, Linus Torvalds, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki * Christoph Lameter <clameter@sgi.com> wrote: > On Fri, 21 Dec 2007, Ingo Molnar wrote: > > > and this is not the only regression: > > > > http://lkml.org/lkml/2007/10/4/290 > > > > _6%_ TPC-C regression. That's _a lot_ in TPC-C terms. > > > > and just like in this case there were very clear profiles posted. I > > proffer, reading back the whole thread, that if you fix hackbench ^^^^^^^^^^^^^^^^^^^^ > > you have fixed TPC-C as well. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > > There are patches pending to address these issues. AFAICT Intel is > testing if the regression is still there. There is no way for me to > verify what is going on there and there is the constant difficulty of > getting detailed information about what is going on at Intel. Every > couple of month I get a result from that test. Its a really crappy > situation where a lot of confusing information is passed around. of course there is a way to find out, and that's why i mailed you: fix the hackbench regression and i'm quite sure you'll improve the TPC-C numbers as well. It shows the same kind of overhead in the profile and takes just a few seconds to run. Are your pending SLUB patches in 2.6.24-rc5-mm1 already? Ingo ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 16:33 ` Ingo Molnar @ 2007-12-21 21:56 ` Christoph Lameter 0 siblings, 0 replies; 93+ messages in thread From: Christoph Lameter @ 2007-12-21 21:56 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, LKML, Andrew Morton, Linus Torvalds, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki On Fri, 21 Dec 2007, Ingo Molnar wrote: > > There are patches pending to address these issues. AFAICT Intel is > > testing if the regression is still there. There is no way for me to > > verify what is going on there and there is the constant difficulty of > > getting detailed information about what is going on at Intel. Every > > couple of month I get a result from that test. Its a really crappy > > situation where a lot of confusing information is passed around. > > of course there is a way to find out, and that's why i mailed you: fix > the hackbench regression and i'm quite sure you'll improve the TPC-C > numbers as well. It shows the same kind of overhead in the profile and > takes just a few seconds to run. Are your pending SLUB patches in > 2.6.24-rc5-mm1 already? The tests that I wrote emulate the test behavior that was described to me by me. The fixes in 2.6.24-rc5-mm1 improved those numbers. See http://lkml.org/lkml/2007/10/27/245 which I quoted earlier to you. However, I have no TPC-C setup here and from what I hear it takes weeks to run and requires a large support team for tuning. You can find the slab test suite for that at http://git.kernel.org/?p=linux/kernel/git/christoph/vm.git;a=shortlog;h=tests AFAICT the fixes in 2.6.25-rc5-mm1 result in double the alloc performance (fastpath) of SLAB. There are fixes that are not merged yet (the cpu alloc patchset) that seem to make that factor 3 because we can use the segment register to avoid per cpu array lookups in the fast path. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 12:09 ` Ingo Molnar 2007-12-21 12:26 ` Ingo Molnar @ 2007-12-21 16:18 ` Christoph Lameter 2007-12-21 16:44 ` Linus Torvalds 2007-12-21 17:44 ` Pekka Enberg 2 siblings, 1 reply; 93+ messages in thread From: Christoph Lameter @ 2007-12-21 16:18 UTC (permalink / raw) To: Ingo Molnar Cc: Steven Rostedt, LKML, Andrew Morton, Linus Torvalds, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki On Fri, 21 Dec 2007, Ingo Molnar wrote: > what's up with this regression? There's been absolutely no activity > about it in the last 8 days: upstream still regresses, -mm still > regresses and there are no patches posted for testing. I added a test that does this 1 alloc N free behavior to the slab in kernel benchmark (cpu 0 continually allocs, cpu 1-7 continously free) (all order 0). The test makes the regression worse because it run the scenario with raw slab allocs/frees in the kernel context without the use of the scheduler: SLAB 1 alloc N free(8): 0=2115 1=549 2=551 3=550 4=550 5=550 6=551 7=549 Average=746 1 alloc N free(16): 0=2159 1=573 2=574 3=573 4=574 5=573 6=574 7=573 Average=772 1 alloc N free(32): 0=2168 1=577 2=576 3=577 4=579 5=578 6=577 7=576 Average=776 1 alloc N free(64): 0=2276 1=555 2=554 3=554 4=555 5=556 6=555 7=554 Average=770 1 alloc N free(128): 0=2635 1=549 2=548 3=548 4=550 5=549 6=548 7=549 Average=809 1 alloc N free(256): 0=3261 1=543 2=542 3=542 4=544 5=542 6=542 7=542 Average=882 1 alloc N free(512): 0=4642 1=571 2=572 3=570 4=573 5=571 6=572 7=572 Average=1080 1 alloc N free(1024): 0=7116 1=582 2=581 3=583 4=581 5=582 6=582 7=583 Average=1399 1 alloc N free(2048): 0=11612 1=667 2=667 3=667 4=668 5=667 6=668 7=667 Average=2035 1 alloc N free(4096): 0=20135 1=673 2=674 3=674 4=675 5=674 6=675 7=674 Average=3107 SLUB 1 alloc N free test =================== 1 alloc N free(8): 0=1223 1=645 2=417 3=643 4=538 5=612 6=557 7=564 Average=650 1 alloc N free(16): 0=3828 1=3140 2=3116 3=3166 4=3122 5=3172 6=3129 7=3172 Average=3231 1 alloc N free(32): 0=4116 1=3207 2=3180 3=3210 4=3183 5=3202 6=3182 7=3214 Average=3312 1 alloc N free(64): 0=4116 1=3017 2=3005 3=3013 4=3006 5=3016 6=3004 7=3020 Average=3149 1 alloc N free(128): 0=5282 1=3013 2=3010 3=3014 4=3009 5=3010 6=3005 7=3013 Average=3295 1 alloc N free(256): 0=5271 1=2750 2=2751 3=2751 4=2751 5=2751 6=2752 7=2751 Average=3066 1 alloc N free(512): 0=5304 1=2293 2=2295 3=2295 4=2295 5=2294 6=2295 7=2294 Average=2671 1 alloc N free(1024): 0=6017 1=2044 2=2043 3=2043 4=2043 5=2045 6=2044 7=2044 Average=2541 1 alloc N free(2048): 0=7162 1=2086 2=2084 3=2086 4=2086 5=2085 6=2086 7=2086 Average=2720 1 alloc N free(4096): 0=7133 1=1795 2=1796 3=1796 4=1797 5=1795 6=1796 7=1795 Average=2463 So we have 3 different regimes here (order 0): 1. SLUB wins in size 8 because the cpu slab is never given up given that we have 512 objects 8 byte objects in a 4K slab (same scenario that we can produce with slub_min_order=9). 2. At the high end (>= 2048) the additional overhead of SLAB when allocating objects makes SLUB again performance wise better or similar. 3. There is an intermediate range where the cpu slab has to be changed (and thus list_lock has to be used) and where multiple cpus that free objects can simultaneouly hit the slab lock of the same slab page. That is the cause of the regression The regression is contained because: 1. The contention only exist when concurrently freeing and allocation from the same slab page. SLUB has provisions for keeping a single allocator and a single freeer in the clear by moving the freelist pointer when a slab page becomes a cpu slab. Concurrent allocs and frees are possible without cacheline contention. The problem only comes about when multiple free operations occur simultaneously to the same slab page. 2. The number of objects in a slab is limited. For the 128 byte size of interest here we have 32 objects in a slab. The maximum theoretical contention on the slab lock is through 32 cpus if they are all freeing simultaneously. 3. The tests are an artificial scenario. Typically one would have additional processing in both the alloc and the free paths which would reduce the contention. If one would not have complex processing on alloc then we typically allocate the object on the remote processor (warms up the processor cache) instead of allocating objects while freeing them on remote processors. 4. The contention is reduced the larger the object size becomes since this will reduce the number of object per slab page. Thus the number of processors taking the lock simultaneously is also reduced. 5. The contention is also reduced through normal operations that will lead to the creation of partially populated slabs. If we cannot allocate a large number of objects from the same slab (typical if the system has been run for awhile) then the contention is constrained. We could address this issue by: 1. Detecting the contention while taking the slab lock and then defer the free for these special situations (f.e. by using RCU to free these objects. 2. Not allocate from the same slab if we detect contention. But given all the boundaries for the contention I would think that it is not worth addressing. > being able to utilize order-0 pages was supposed to be one of the big > plusses of SLUB, so booting with _2MB_ sized slabs cannot be seriously > the "fix", right? Booting with 2MB sized slab is certainly not the mainstream fix although I expect that some SGI machines may run with that as the default given that they have a couple of terabytes of memory and also the 2Mb configurations reduce TLB pressure significantly. The use of 2MB slabs is not motivated by the contention that we see here but rather because of the improved scaling and nice interaction with the huge page management framework provided by the antifrag patches. Order 0 allocations were never the emphasis of SLUB. The emphasis was on being able to use higher order allocs in order to make slab operations scale higher. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 16:18 ` Christoph Lameter @ 2007-12-21 16:44 ` Linus Torvalds 2007-12-21 22:11 ` Christoph Lameter 0 siblings, 1 reply; 93+ messages in thread From: Linus Torvalds @ 2007-12-21 16:44 UTC (permalink / raw) To: Christoph Lameter Cc: Ingo Molnar, Steven Rostedt, LKML, Andrew Morton, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki On Fri, 21 Dec 2007, Christoph Lameter wrote: > > So we have 3 different regimes here (order 0): [ ... ] > The regression is contained because: [ ... ] Christoph, *NONE* of these arguments matter at all. The only thing that matters is the simple fact that real-life benchmarks show that SLUB is worse. It doesn't matter one whit that you can say that it's better under some circumstance that either isn't triggered, or when setting unrealistic and unusable parameters (ie big internal slab orders). > We could address this issue by: [...] > But given all the boundaries for the contention I would think that it is > not worth addressing. .. and this seems to be the single biggest reason to just say "revert SLUB entirely". If you aren't even motivated to fix the problems that have been reported, then SLUB isn't even a _potential_ solution, it's purely a problem, and since I am not IN THE LEAST interested in having three different allocators around, SLUB is going to get axed. In other words, here's the low-down: a) either SLUB can replace SLAB, or SLUB is going away This isn't open for discussion, Christoph. This was the rule back when it got merged in the first place. b) if SLUB performs worse than SLAB on real loads (TPC-C certainly being one, and while hackbench may be borderline, it's certainly less borderline than others), then SLUB cannot replace SLAB. See (a) c) If you aren't even interested in trying to fix it and are ignoring the reports, there is not even a _potential_ for SLUB for getting over these problems, and I should disable it and we should get over it as soon as possible, and this whole discussion is pretty much ended. See? It really is that simple. Either you say "Hell yes, I'll fix it", or SLUB goes away. There simply is no orther choice. When you say "not worth addressing", that to me is a clear an unambiguous "let's remove SLUB". The main reason for SLUB in the first place was SGI concerns. You seem to think that _only_ SGI concerns matter. Wrong. If SLUB remains a SGI-only thing and you don't fix it for other users, then SLUB gets reverted from the standard kernel. That's all. And it's not really worth discussing. Linus ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 16:44 ` Linus Torvalds @ 2007-12-21 22:11 ` Christoph Lameter 2007-12-21 22:16 ` Peter Zijlstra 2007-12-21 22:49 ` Linus Torvalds 0 siblings, 2 replies; 93+ messages in thread From: Christoph Lameter @ 2007-12-21 22:11 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Steven Rostedt, LKML, Andrew Morton, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki On Fri, 21 Dec 2007, Linus Torvalds wrote: > If you aren't even motivated to fix the problems that have been reported, > then SLUB isn't even a _potential_ solution, it's purely a problem, and > since I am not IN THE LEAST interested in having three different > allocators around, SLUB is going to get axed. Not motivated? I have analyzed the problem in detail and when it comes down to it there is not much impact that I can see in real life applications. I have always responded to the regression reported also via TPC-C. There is still no answer to my post at http://lkml.org/lkml/2007/10/27/245 SLAB has similar issues when freeing to remote nodes under NUMA. Its even worse since the lock is not per slab page but per slab cache. The alien caches that are then used have one lock per node. > In other words, here's the low-down: > > a) either SLUB can replace SLAB, or SLUB is going away > > This isn't open for discussion, Christoph. This was the rule back when > it got merged in the first place. It obviously can replace it and has replaced it for awhile now. > b) if SLUB performs worse than SLAB on real loads (TPC-C certainly being > one, and while hackbench may be borderline, it's certainly less > borderline than others), then SLUB cannot replace SLAB. Looks like I need to find someone to get that going here to be able to test under it. > c) If you aren't even interested in trying to fix it and are ignoring > the reports, there is not even a _potential_ for SLUB for getting over > these problems, and I should disable it and we should get over it as > soon as possible, and this whole discussion is pretty much ended. I have looked at this issue under various circumstances for the last week. Nothing really convinced me that this is so serious. Could not figure out if its really worth anything about but if you put it that way then of course it is. > The main reason for SLUB in the first place was SGI concerns. You seem to > think that _only_ SGI concerns matter. Wrong. If SLUB remains a SGI-only > thing and you don't fix it for other users, then SLUB gets reverted from > the standard kernel. The reason for SLUB was dysfunctional behavior of SLAB in many areas. It was not just SGI's concerns that drove it. > That's all. And it's not really worth discussing. Well still SLUB is really superior to SLAB in many ways.... - SLUB is performance wise much faster than SLAB. This can be more than a factor of 10 (case of concurrent allocations / frees on multiple processors). See http://lkml.org/lkml/2007/10/27/245 - Single threaded allocation speed is up to double that of SLAB - Remote freeing of objectcs in a NUMA systems is typically 30% faster. - Debugging on SLAB is difficult. Requires recompile of the kernel and the resulting output is difficult to interpret. SLUB can apply debugging options to a subset of the slabcaches in order to allow the system to work with maximum speed. This is necessary to detect difficult to reproduce race conditions. - SLAB can capture huge amounts of memory in its queues. The problem gets worse the more processors and NUMA nodes are in the system. - SLAB requires a pass through all slab caches every 2 seconds to expire objects. This is a problem both for realtime and MPI jobs that cannot take such a processor outage. - SLAB does not have a sophisticated slabinfo tool to report the state of slab objects on the system. Can provide details of object use. - SLAB requires the update of two words for freeing and allocation. SLUB can do that by updating a single word which allows to avoid enabling and disabling interrupts if the processor supports an atomic instruction for that purpose. This is important for realtime kernels where special measures may have to be implemented. - SLAB requires memory to be set aside for queues (processors times number of slabs times queue size). SLUB requires none of that. - SLUB merges slab caches with similar characteristics to reduce the memory footprint even further. - SLAB performs object level NUMA management which creates a complex allocator complexity. SLUB manages NUMA on the level of slab pages. - SLUB allows remote node defragmentation to avoid the buildup of large partial lists on a single node. - SLUB can actively reduce the fragmentation of slabs through slab cache specific callbacks (not merged yet) - SLUB has resiliency features that allow it to isolate a problem object and continue after diagnostics have been performed. - SLUB creates rarely used DMA caches on demand instead of creating them all on bootup (SLAB). ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 22:11 ` Christoph Lameter @ 2007-12-21 22:16 ` Peter Zijlstra 2007-12-21 22:17 ` Peter Zijlstra 2007-12-21 22:49 ` Linus Torvalds 1 sibling, 1 reply; 93+ messages in thread From: Peter Zijlstra @ 2007-12-21 22:16 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Ingo Molnar, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Fri, 2007-12-21 at 14:11 -0800, Christoph Lameter wrote: > On Fri, 21 Dec 2007, Linus Torvalds wrote: > > > If you aren't even motivated to fix the problems that have been reported, > > then SLUB isn't even a _potential_ solution, it's purely a problem, and > > since I am not IN THE LEAST interested in having three different > > allocators around, SLUB is going to get axed. > > Not motivated? I have analyzed the problem in detail and when it comes > down to it there is not much impact that I can see in real life > applications. I have always responded to the regression reported also via > TPC-C. But you are dismissing the hackbench regression, which is not a small one. It runs an astonishing 10x slower. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 22:16 ` Peter Zijlstra @ 2007-12-21 22:17 ` Peter Zijlstra 2007-12-21 22:27 ` Christoph Lameter 0 siblings, 1 reply; 93+ messages in thread From: Peter Zijlstra @ 2007-12-21 22:17 UTC (permalink / raw) To: Christoph Lameter Cc: Linus Torvalds, Ingo Molnar, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Fri, 2007-12-21 at 23:16 +0100, Peter Zijlstra wrote: > On Fri, 2007-12-21 at 14:11 -0800, Christoph Lameter wrote: > > On Fri, 21 Dec 2007, Linus Torvalds wrote: > > > > > If you aren't even motivated to fix the problems that have been reported, > > > then SLUB isn't even a _potential_ solution, it's purely a problem, and > > > since I am not IN THE LEAST interested in having three different > > > allocators around, SLUB is going to get axed. > > > > Not motivated? I have analyzed the problem in detail and when it comes > > down to it there is not much impact that I can see in real life > > applications. I have always responded to the regression reported also via > > TPC-C. > > But you are dismissing the hackbench regression, which is not a small > one. It runs an astonishing 10x slower. > BTW, does /proc/slabinfo exist again? I thought we set that as a requirement for SLUB to be the default and a full replacement for SLAB. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 22:17 ` Peter Zijlstra @ 2007-12-21 22:27 ` Christoph Lameter 2007-12-21 22:54 ` Ingo Molnar 0 siblings, 1 reply; 93+ messages in thread From: Christoph Lameter @ 2007-12-21 22:27 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Ingo Molnar, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Fri, 21 Dec 2007, Peter Zijlstra wrote: > BTW, does /proc/slabinfo exist again? I thought we set that as a > requirement for SLUB to be the default and a full replacement for SLAB. Well the information that would be exposed in /proc/slabinfo would only be faked up stuff from SLUB. The queues that are described in /proc/slabinfo do not exist f.e. (tunables do not make sense) and there are more details available via /sys/kernel/slab. There is a discussion on linux-mm with some of the people working on tools to make these to use the information that SLUB provides. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 22:27 ` Christoph Lameter @ 2007-12-21 22:54 ` Ingo Molnar 2007-12-21 23:20 ` David Miller ` (3 more replies) 0 siblings, 4 replies; 93+ messages in thread From: Ingo Molnar @ 2007-12-21 22:54 UTC (permalink / raw) To: Christoph Lameter Cc: Peter Zijlstra, Linus Torvalds, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki * Christoph Lameter <clameter@sgi.com> wrote: > On Fri, 21 Dec 2007, Peter Zijlstra wrote: > > > BTW, does /proc/slabinfo exist again? I thought we set that as a > > requirement for SLUB to be the default and a full replacement for > > SLAB. > > Well the information that would be exposed in /proc/slabinfo would > only be faked up stuff from SLUB. The queues that are described in > /proc/slabinfo do not exist f.e. (tunables do not make sense) and > there are more details available via /sys/kernel/slab. Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. slabtop relies on it, people use it every day to monitor memory consumption. I'm really getting worried that you are apparently incapable of grasping such _SIMPLE_ concepts. Who the heck cares whether you put in zeros or whatever else in some of the fields? People use it to know how many objects are allocated and sure SLUB knows that count, sheesh. How on earth can you come up with a lame excuse like that? You dont like the 'SLAB' portion of the name perhaps? Is it NIH again? Really, if your behavior is representative of how our SLAB allocator will be maintained in the future then i'm very, very worried :-( You ignore and downplay clear-cut regressions, you insult and attack testers, you are incredibly stupid about user ABIs (or pretend to be so) and you distort and mislead all the way. What will you be able to do in the much less clear-cut cases?? Ingo ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 22:54 ` Ingo Molnar @ 2007-12-21 23:20 ` David Miller 2007-12-22 9:45 ` Ingo Molnar 2007-12-21 23:27 ` Andrew Morton ` (2 subsequent siblings) 3 siblings, 1 reply; 93+ messages in thread From: David Miller @ 2007-12-21 23:20 UTC (permalink / raw) To: mingo Cc: clameter, a.p.zijlstra, torvalds, rostedt, linux-kernel, akpm, hch, rjw From: Ingo Molnar <mingo@elte.hu> Date: Fri, 21 Dec 2007 23:54:13 +0100 > Really, if your behavior is representative of how our SLAB allocator > will be maintained in the future then i'm very, very worried :-( Actually, to the contrary, I actually think Christoph responds to every problem I've ever reported to him about his per-cpu counters work and SLUB much better than most people who call themselves "maintainers" around here. And I say that without any reservations. He doesn't deserve the ad-hominem attacks he is getting today, because he does resolve every problem reported to him. The guy wrote test cases, he analyzed every problem, he wrote test patches, and he doesn't stop doing any of that until the issue really is reported as resolved by the testers. I'll take Christoph as the implementor and maintainer of anything, any day of the week. He rocks. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 23:20 ` David Miller @ 2007-12-22 9:45 ` Ingo Molnar 2007-12-24 19:24 ` Christoph Lameter 0 siblings, 1 reply; 93+ messages in thread From: Ingo Molnar @ 2007-12-22 9:45 UTC (permalink / raw) To: David Miller Cc: clameter, a.p.zijlstra, torvalds, rostedt, linux-kernel, akpm, hch, rjw * David Miller <davem@davemloft.net> wrote: > From: Ingo Molnar <mingo@elte.hu> > Date: Fri, 21 Dec 2007 23:54:13 +0100 > > > Really, if your behavior is representative of how our SLAB allocator > > will be maintained in the future then i'm very, very worried :-( > > Actually, to the contrary, I actually think Christoph responds to > every problem I've ever reported to him about his per-cpu counters > work and SLUB much better than most people who call themselves > "maintainers" around here. > > And I say that without any reservations. > > He doesn't deserve the ad-hominem attacks he is getting today, because > he does resolve every problem reported to him. > > The guy wrote test cases, he analyzed every problem, he wrote test > patches, and he doesn't stop doing any of that until the issue really > is reported as resolved by the testers. > > I'll take Christoph as the implementor and maintainer of anything, any > day of the week. He rocks. well, maybe i got unlucky, this hackbench thing being the first time i'm exposed to a major SLUB regression. The hackbench problem was dead easy to reproduce, i (and others) offered immediate testing of whatever test patches, it also matched the profiles of the TPC-C regression but still i was only offered explanations about why this workload does not matter and how others suck because they are unable to give immediate test feedback from millions-of-dollars test equipment that is barely able to run our devel kernels. The regression is fixed now and i'm a happy camper! Christoph, i'd like to apologize for all overly harsh words i said. (and i said quite a few :-/ ) Ingo ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 9:45 ` Ingo Molnar @ 2007-12-24 19:24 ` Christoph Lameter 0 siblings, 0 replies; 93+ messages in thread From: Christoph Lameter @ 2007-12-24 19:24 UTC (permalink / raw) To: Ingo Molnar Cc: David Miller, a.p.zijlstra, torvalds, rostedt, linux-kernel, akpm, hch, rjw On Sat, 22 Dec 2007, Ingo Molnar wrote: > Christoph, i'd like to apologize for all overly harsh words i said. (and > i said quite a few :-/ ) Looks like a bad interaction due to overload, vacation, flu epidemic hitting me and family and also Christmas coming up so that I could not do my usual workload. Need to find some way to cut back on the stuff that I am involved with it seems. Thanks for the SLUB patches. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 22:54 ` Ingo Molnar 2007-12-21 23:20 ` David Miller @ 2007-12-21 23:27 ` Andrew Morton 2007-12-21 23:51 ` Christoph Lameter 2007-12-22 0:28 ` Andi Kleen 3 siblings, 0 replies; 93+ messages in thread From: Andrew Morton @ 2007-12-21 23:27 UTC (permalink / raw) To: Ingo Molnar Cc: clameter, a.p.zijlstra, torvalds, rostedt, linux-kernel, hch, rjw On Fri, 21 Dec 2007 23:54:13 +0100 Ingo Molnar <mingo@elte.hu> wrote: > * Christoph Lameter <clameter@sgi.com> wrote: > > > On Fri, 21 Dec 2007, Peter Zijlstra wrote: > > > > > BTW, does /proc/slabinfo exist again? I thought we set that as a > > > requirement for SLUB to be the default and a full replacement for > > > SLAB. > > > > Well the information that would be exposed in /proc/slabinfo would > > only be faked up stuff from SLUB. The queues that are described in > > /proc/slabinfo do not exist f.e. (tunables do not make sense) and > > there are more details available via /sys/kernel/slab. > > Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. That would be really bad. /proc/slabinfo intimately exposes implementation-specific internals and strictly speaking should never have been introduced. Obviously that isn't a _practical_ position, but there are no easy solutions here. We don't want to have to drag a pile of be-compatible-with-slab gunk along with us for ever. > slabtop > relies on it, people use it every day to monitor memory consumption. slabtop needs to be changed asap. Possibly we could look at adding some interrim make-it-look-like-slab hack into slub.c for a couple of releases but no longer. If we'd updated slabtop six months ago this issue wouldn't have arisen now. But we didn't do that. We're quite bad at this sort of thing. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 22:54 ` Ingo Molnar 2007-12-21 23:20 ` David Miller 2007-12-21 23:27 ` Andrew Morton @ 2007-12-21 23:51 ` Christoph Lameter 2007-12-22 0:28 ` Andi Kleen 3 siblings, 0 replies; 93+ messages in thread From: Christoph Lameter @ 2007-12-21 23:51 UTC (permalink / raw) To: Ingo Molnar Cc: Peter Zijlstra, Linus Torvalds, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Fri, 21 Dec 2007, Ingo Molnar wrote: > I'm really getting worried that you are apparently incapable of grasping > such _SIMPLE_ concepts. Who the heck cares whether you put in zeros or > whatever else in some of the fields? People use it to know how many > objects are allocated and sure SLUB knows that count, sheesh. How on > earth can you come up with a lame excuse like that? You dont like the > 'SLAB' portion of the name perhaps? Is it NIH again? NIH? I wrote major portions of SLAB. I would be hating my own product. Could you get the facts straight at some point? This is getting weird. > Really, if your behavior is representative of how our SLAB allocator > will be maintained in the future then i'm very, very worried :-( You > ignore and downplay clear-cut regressions, you insult and attack > testers, you are incredibly stupid about user ABIs (or pretend to be so) > and you distort and mislead all the way. What will you be able to do in > the much less clear-cut cases?? I analyzed the issue and argued that the issues that one test showed in SLUB is a really special case and then you conclude that I ignore all regressions? I have addressed and responded to all reports of regressions that came to me. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 22:54 ` Ingo Molnar ` (2 preceding siblings ...) 2007-12-21 23:51 ` Christoph Lameter @ 2007-12-22 0:28 ` Andi Kleen 2007-12-22 0:33 ` Chuck Ebbert ` (2 more replies) 3 siblings, 3 replies; 93+ messages in thread From: Andi Kleen @ 2007-12-22 0:28 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Lameter, Peter Zijlstra, Linus Torvalds, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki Ingo Molnar <mingo@elte.hu> writes: > Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. slabtop > relies on it, people use it every day to monitor memory consumption. It's definitely not a stable ABI. slabtop tends to exit without any error message on any slabinfo version number increase and I've seen that happen several times in not so old kernels. Requiring just another slabtop update isn't really a big deal. Also it's not that it's a critical functionality like udev. -Andi ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 0:28 ` Andi Kleen @ 2007-12-22 0:33 ` Chuck Ebbert 2007-12-22 2:19 ` Matt Mackall 2007-12-22 10:03 ` Ingo Molnar 2 siblings, 0 replies; 93+ messages in thread From: Chuck Ebbert @ 2007-12-22 0:33 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Christoph Lameter, Peter Zijlstra, Linus Torvalds, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On 12/21/2007 07:28 PM, Andi Kleen wrote: > Ingo Molnar <mingo@elte.hu> writes: > >> Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. slabtop >> relies on it, people use it every day to monitor memory consumption. > > It's definitely not a stable ABI. slabtop tends to exit without any > error message on any slabinfo version number increase and I've seen > that happen several times in not so old kernels. > > Requiring just another slabtop update isn't really a big deal. > > Also it's not that it's a critical functionality like udev. > There's also Documentation/vm/slabinfo.c, which really belongs in some kind of 'kernel utilities' package. Since it's so intimately tied to kernel version I don't think it belongs in util-linux. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 0:28 ` Andi Kleen 2007-12-22 0:33 ` Chuck Ebbert @ 2007-12-22 2:19 ` Matt Mackall 2007-12-22 2:44 ` Andi Kleen 2007-12-22 10:03 ` Ingo Molnar 2 siblings, 1 reply; 93+ messages in thread From: Matt Mackall @ 2007-12-22 2:19 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Christoph Lameter, Peter Zijlstra, Linus Torvalds, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Sat, 2007-12-22 at 01:28 +0100, Andi Kleen wrote: > Ingo Molnar <mingo@elte.hu> writes: > > > Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. slabtop > > relies on it, people use it every day to monitor memory consumption. > > It's definitely not a stable ABI. slabtop tends to exit without any > error message on any slabinfo version number increase and I've seen > that happen several times in not so old kernels. > > Requiring just another slabtop update isn't really a big deal. Might as well mention that SLOB also doesn't provide this functionality, nor would it make any sense for it to try (because there are no slab-like objects to report about). -- Mathematics is the supreme nostalgia of our time. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 2:19 ` Matt Mackall @ 2007-12-22 2:44 ` Andi Kleen 0 siblings, 0 replies; 93+ messages in thread From: Andi Kleen @ 2007-12-22 2:44 UTC (permalink / raw) To: Matt Mackall Cc: Andi Kleen, Ingo Molnar, Christoph Lameter, Peter Zijlstra, Linus Torvalds, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Fri, Dec 21, 2007 at 08:19:42PM -0600, Matt Mackall wrote: > > On Sat, 2007-12-22 at 01:28 +0100, Andi Kleen wrote: > > Ingo Molnar <mingo@elte.hu> writes: > > > > > Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. slabtop > > > relies on it, people use it every day to monitor memory consumption. > > > > It's definitely not a stable ABI. slabtop tends to exit without any > > error message on any slabinfo version number increase and I've seen > > that happen several times in not so old kernels. > > > > Requiring just another slabtop update isn't really a big deal. > > Might as well mention that SLOB also doesn't provide this functionality, > nor would it make any sense for it to try (because there are no > slab-like objects to report about). % slabinfo Name Objects Objsize Space Slabs/Part/Cpu O/S O %Fr %Ef Flg ... anon_vma 1551 24 81.9K 20/18/2 128 0 90 45 Seems certainly like information that slabtop could report, even if it's not exactly the same. -Andi ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 0:28 ` Andi Kleen 2007-12-22 0:33 ` Chuck Ebbert 2007-12-22 2:19 ` Matt Mackall @ 2007-12-22 10:03 ` Ingo Molnar 2007-12-22 12:37 ` Andi Kleen 2007-12-22 18:01 ` Linus Torvalds 2 siblings, 2 replies; 93+ messages in thread From: Ingo Molnar @ 2007-12-22 10:03 UTC (permalink / raw) To: Andi Kleen Cc: Christoph Lameter, Peter Zijlstra, Linus Torvalds, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki * Andi Kleen <andi@firstfloor.org> wrote: > Ingo Molnar <mingo@elte.hu> writes: > > > Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. > > slabtop relies on it, people use it every day to monitor memory > > consumption. > > It's definitely not a stable ABI. slabtop tends to exit without any > error message on any slabinfo version number increase and I've seen > that happen several times in not so old kernels. so because we sucked in the past we can continue to suck? ;) Why are we still arguing about this? We kernel developers are foxes amongst the hens and if a compatibility issue comes up we have to act _doubly_ conservatively. You think it's reasonable to not offer /proc/slabinfo? You can fairly assume that a user considers it absolutely unreasonable. If other kernel developers tell you: "no, it's fine", then it's as if other foxes told you "no, it's fine to eat that hen, we do it all the time too!" ;-) > Requiring just another slabtop update isn't really a big deal. certainly. But consider this from the user's perspective who tries one of our devel kernels. He suspects a memory leak. Runs slabtop and gets: $ slabtop fopen /proc/slabinfo: No such file or directory and would fairly conclude: "ok, this new Linux kernel looks quite apparently unfinished, i'm outta here". We do this way too frequently and many silly details like this _do_ mount up. > Also it's not that it's a critical functionality like udev. Sure, we can argue about details that not all fields in /proc/slabinfo are relevant, and that slabtop should be a bit more careful, etc., but we've got what we've got because _we_ built the current code, so we might as well accept the consequences it brings. The most of the basic output of slabtop: Active / Total Objects (% used) : 648754 / 747076 (86.8%) Active / Total Slabs (% used) : 39394 / 39394 (100.0%) Active / Total Caches (% used) : 103 / 143 (72.0%) Active / Total Size (% used) : 138884.36K / 151075.96K (91.9%) Minimum / Average / Maximum Object : 0.01K / 0.20K / 128.00K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 261928 239808 91% 0.13K 9032 29 36128K dentry_cache 222048 174144 78% 0.05K 3084 72 12336K buffer_head 187232 178929 95% 0.48K 23404 8 93616K ext3_inode_cache 24416 17908 73% 0.27K 1744 14 6976K radix_tree_node could be offered on SLUB too. 'top' isnt critical functionality either like udev, and the ABI does not only cover 'critical' functionality. A utility suddenly not working gives Linux a pretty amateurish feeling. Should we tell users/admins: "Hehe, gotcha! Didnt you know /proc/slabinfo was not an ABI? Poor sob. If you want your stuff to continue working, use Windows next time around or what. Sheesh, what do these people want!' ;-) the rule is very simple: unless you have really, really, really, REALLY good reasons, just dont break stuff. Ingo ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 10:03 ` Ingo Molnar @ 2007-12-22 12:37 ` Andi Kleen 2007-12-22 12:51 ` Pekka Enberg 2007-12-22 13:01 ` Alexey Dobriyan 2007-12-22 18:01 ` Linus Torvalds 1 sibling, 2 replies; 93+ messages in thread From: Andi Kleen @ 2007-12-22 12:37 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Christoph Lameter, Peter Zijlstra, Linus Torvalds, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki > could be offered on SLUB too. Sure (I argued that in a earlier mail in fact), but it doesn't make sense to fake into the old slabinfo format. > > 'top' isnt critical functionality either like udev, and the ABI does not > only cover 'critical' functionality. A utility suddenly not working > gives Linux a pretty amateurish feeling. Should we tell users/admins: > "Hehe, gotcha! Didnt you know /proc/slabinfo was not an ABI? Poor sob. > If you want your stuff to continue working, use Windows next time around > or what. Sheesh, what do these people want!' ;-) While I sympatise with this point in practice non critial ABIs change regularly (and sometimes even critical ones like sched_yield ...) > > the rule is very simple: unless you have really, really, really, REALLY > good reasons, just dont break stuff. Removing an interface that exposes lots of internal details when you rewrite the subsystem and those internal details all change seems like a good reason to me. Besides the original interface was broken anyways. The version number was one of the interface worst ideas ever imho and slabtop's handling of it quite dumb. The better way would have been to always add new fields and never remove old ones. Hopefully the new /sys based interface will be better. -Andi ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 12:37 ` Andi Kleen @ 2007-12-22 12:51 ` Pekka Enberg 2007-12-22 12:54 ` Andi Kleen 2007-12-22 13:01 ` Alexey Dobriyan 1 sibling, 1 reply; 93+ messages in thread From: Pekka Enberg @ 2007-12-22 12:51 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Christoph Lameter, Peter Zijlstra, Linus Torvalds, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki Hi, On Dec 22, 2007 2:37 PM, Andi Kleen <andi@firstfloor.org> wrote: > Removing an interface that exposes lots of internal details when you > rewrite the subsystem and those internal details all change seems > like a good reason to me. Yeah but then again removing an interface that has been around for ever is a real PITA for users. Besides, emulating /proc/slabinfo ain't so bad: http://lkml.org/lkml/2007/12/22/58 Pekka ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 12:51 ` Pekka Enberg @ 2007-12-22 12:54 ` Andi Kleen 2007-12-22 13:27 ` Pekka Enberg 0 siblings, 1 reply; 93+ messages in thread From: Andi Kleen @ 2007-12-22 12:54 UTC (permalink / raw) To: Pekka Enberg Cc: Andi Kleen, Ingo Molnar, Christoph Lameter, Peter Zijlstra, Linus Torvalds, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki > Yeah but then again removing an interface that has been around for > ever Version 2.1 hasn't been around forever and at least slabtop does not work over version number changes in my experience. -Andi ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 12:54 ` Andi Kleen @ 2007-12-22 13:27 ` Pekka Enberg 0 siblings, 0 replies; 93+ messages in thread From: Pekka Enberg @ 2007-12-22 13:27 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Christoph Lameter, Peter Zijlstra, Linus Torvalds, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki Hi Andi, On Dec 22, 2007 2:54 PM, Andi Kleen <andi@firstfloor.org> wrote: > > Yeah but then again removing an interface that has been around for > > ever > > Version 2.1 hasn't been around forever and at least slabtop does not > work over version number changes in my experience. True but the default user-visible ABI (the one without statistics) is unchanged since version 2.0: http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commitdiff;h=7ee832030b3a5d3a87f8f85b1fc773be15e98608 http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commitdiff;h=e79fbf181ba738ca410f2c457ced220b738e8856 But anyway, I don't care too much either way so I'll just let Andrew sort this out, as usual. Pekka ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 12:37 ` Andi Kleen 2007-12-22 12:51 ` Pekka Enberg @ 2007-12-22 13:01 ` Alexey Dobriyan 1 sibling, 0 replies; 93+ messages in thread From: Alexey Dobriyan @ 2007-12-22 13:01 UTC (permalink / raw) To: Andi Kleen Cc: Ingo Molnar, Christoph Lameter, Peter Zijlstra, Linus Torvalds, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki > > the rule is very simple: unless you have really, really, really, REALLY > > good reasons, just dont break stuff. Could we please drop "kset: move /sys/slab to /sys/kernel/slab" from -mm? Old slabinfo complains that SLUB_DEBUG is not on and refuses to do anything. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 10:03 ` Ingo Molnar 2007-12-22 12:37 ` Andi Kleen @ 2007-12-22 18:01 ` Linus Torvalds 2007-12-22 19:09 ` Ingo Molnar ` (2 more replies) 1 sibling, 3 replies; 93+ messages in thread From: Linus Torvalds @ 2007-12-22 18:01 UTC (permalink / raw) To: Ingo Molnar Cc: Andi Kleen, Christoph Lameter, Peter Zijlstra, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Sat, 22 Dec 2007, Ingo Molnar wrote: > > > > It's definitely not a stable ABI. slabtop tends to exit without any > > error message on any slabinfo version number increase and I've seen > > that happen several times in not so old kernels. > > so because we sucked in the past we can continue to suck? ;) Well, I do have to admit that I'm not a huge fan of /proc/slabinfo, and the fact that there hasn't been a lot of complaints about it going away does seem to imply that it wasn't a very important ABI. I'm the first to stand up for backwards compatibility, but I also try to be realistic, and so far nobody has really complained about the fact that /proc/slabinfo went away on any grounds but on the general principle of "it was a user-visible ABI". We've changed user-visible ABI's in the past in the hope that they weren't actually used. Sometimes it worked, sometimes it didn't. In this case, I think it still has the potential of working. That said, I'm not a *huge* fan of /sys/slab/ either. I can get the info I as a developer tend to want from there, but it's really rather less convenient than I'd like. It is really quite hard, for example, to get any kind of "who actually uses the most *memory*" information out of there. You have to use something like this: for i in * do order=$(cat "$i/order") slabs=$(cat "$i/slabs") object_size=$(cat "$i/object_size") objects=$(cat "$i/objects") truesize=$(( $slabs<<($order+12) )) size=$(( $object_size*$objects )) percent=$(( $truesize/100 )) if [ $percent -gt 0 ]; then percent=$(( $size / $percent )) fi mb=$(( $size >> 10 )) printf "%10d MB %3d %s\n" $mb $percent $i done | sort -n | tail which works fine, but while it's actually _much_ more readable than doing the same thing with /proc/slabinfo was, I worry about the lack of atomicity in getting the statistics. I dunno. I do think /sys/slab/ is a better interface than /proc/slabinfo was. I just wonder if it could be better still. Linus ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 18:01 ` Linus Torvalds @ 2007-12-22 19:09 ` Ingo Molnar 2007-12-22 19:29 ` Ingo Molnar 2007-12-22 22:52 ` Jason L Tibbitts III 2007-12-22 19:25 ` Theodore Tso 2007-12-22 19:46 ` slabtop replacement was " Andi Kleen 2 siblings, 2 replies; 93+ messages in thread From: Ingo Molnar @ 2007-12-22 19:09 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Christoph Lameter, Peter Zijlstra, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki * Linus Torvalds <torvalds@linux-foundation.org> wrote: > > > It's definitely not a stable ABI. slabtop tends to exit without > > > any error message on any slabinfo version number increase and I've > > > seen that happen several times in not so old kernels. > > > > so because we sucked in the past we can continue to suck? ;) > > Well, I do have to admit that I'm not a huge fan of /proc/slabinfo, > and the fact that there hasn't been a lot of complaints about it going > away does seem to imply that it wasn't a very important ABI. > > I'm the first to stand up for backwards compatibility, but I also try > to be realistic, and so far nobody has really complained about the > fact that /proc/slabinfo went away on any grounds but on the general > principle of "it was a user-visible ABI". > > We've changed user-visible ABI's in the past in the hope that they > weren't actually used. Sometimes it worked, sometimes it didn't. In > this case, I think it still has the potential of working. > > That said, I'm not a *huge* fan of /sys/slab/ either. I can get the > info I as a developer tend to want from there, but it's really rather > less convenient than I'd like. It is really quite hard, for example, > to get any kind of "who actually uses the most *memory*" information > out of there. Yes, i agree that me calling it an 'ABI' stretches the term - we dont want to put ourselves into a forced compatibility corner in every case where /proc does something really stupid. But /proc/slabinfo is being used, slabtop is installed and deployed, so why break it unnecessarily? It's not like we couldnt remove /proc/slabinfo later on, via the normal route of feature removal. I think Pekka's patch that adds /proc/slabinfo is simple and reasonable enough for 2.6.24. We can get rid of /proc/slabinfo but then it should i think be done via the normal route of Documentation/feature-removal-schedule.txt. Was there any particular problem with /proc/slabinfo, at least as far as the read-only access that slabtop does goes? I dont think there was. So why should we go out on a limb _not_ providing it when there's a patch available, etc. I just googled a bit and people _have_ asked about slabtop availability, and the impression was that it would be offered by the time SLUB becomes the default. (and this was mentioned by others in this discussion) I'd also not rely on the fact that only a few people are complaining. Most people, even 2.6.24-rc early adopters, still use SLAB because early adopters typically use their .23 .config and do a 'make oldconfig' - which picks up SLAB. So SLUB use will become more widespread only once 2.6.24 is out and is packaged in distros. Distros will likely pick SLUB if there's no performance worries and if it's the default. Fedora rawhide already uses SLUB. Ingo ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 19:09 ` Ingo Molnar @ 2007-12-22 19:29 ` Ingo Molnar 2007-12-22 22:52 ` Jason L Tibbitts III 1 sibling, 0 replies; 93+ messages in thread From: Ingo Molnar @ 2007-12-22 19:29 UTC (permalink / raw) To: Linus Torvalds Cc: Andi Kleen, Christoph Lameter, Peter Zijlstra, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki * Ingo Molnar <mingo@elte.hu> wrote: > I'd also not rely on the fact that only a few people are complaining. > Most people, even 2.6.24-rc early adopters, still use SLAB because > early adopters typically use their .23 .config and do a 'make > oldconfig' - which picks up SLAB. So SLUB use will become more > widespread only once 2.6.24 is out and is packaged in distros. Distros > will likely pick SLUB if there's no performance worries and if it's > the default. Fedora rawhide already uses SLUB. here's some silly statistics about allocator coverage on lkml, based on configs reported to lkml in the past 4 months: $ for N in SLAB SLUB SLOB; do printf "%s: " $N; grep ^CONFIG_$N=y linux-kernel | wc -l; done SLAB: 70 SLUB: 77 SLOB: 4 so SLUB and SLAB is utilized about equally amongst people who reported configs to lkml. But people who use SLUB enabled it intentionally - and they are thus much less likely to complain about this choice of them. Reporting: "I just enabled SLUB instead of SLAB, and hey guys, it does not have SLABinfo" has a foolish ring to it, doesnt it? I'd rather ask carefully, like this person did: http://kerneltrap.org/mailarchive/linux-kernel/2007/10/12/335765 This is one of the reasons why i think the whole SLAB->SLUB renaming was bad - we should have done SLAB2 or SLAB^2 instead, so that people expect _at least as good_ behavior (performance, features, etc.) from SLUB as from SLAB. Instead of "something different". anyway ... i think we still generally suck _alot_ at providing near-transparent kernel upgrades to users. (kernel upgrades are still a pain and risk, and often just due to poor developer choices on our side.) It's nowhere near as bad as the 2.4->2.6 transition was (in fact it shouldnt even be mentioned in the same sentence), and it's getting better gradually, but i think we should just by default be 10 times more careful about these things - whenever it is borderline technically sane to do so. We induce enough unintentional breakage of user-space, we shouldnt compound it with intentional breakages. The kernel community still has _a lot_ of user and distro trust to win back. So being seen as over-cautious wont harm. Ingo ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 19:09 ` Ingo Molnar 2007-12-22 19:29 ` Ingo Molnar @ 2007-12-22 22:52 ` Jason L Tibbitts III 2007-12-24 3:59 ` Alessandro Suardi 1 sibling, 1 reply; 93+ messages in thread From: Jason L Tibbitts III @ 2007-12-22 22:52 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Andi Kleen, Christoph Lameter, Peter Zijlstra, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki >>>>> "IM" == Ingo Molnar <mingo@elte.hu> writes: IM> Distros will likely pick SLUB if there's no performance worries IM> and if it's the default. Fedora rawhide already uses SLUB. Actually, it seems to me that not only does Fedora rawhide use SLUB, but Fedora 8 and 7 use it as well. They don't have /proc/slabinfo and they all seem to have CONFIG_SLUB=y: > grep -r CONFIG_SLUB=y kernel kernel/devel/config-generic:CONFIG_SLUB=y kernel/F-7/configs/config-generic:CONFIG_SLUB=y kernel/F-8/config-generic:CONFIG_SLUB=y - J< ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 22:52 ` Jason L Tibbitts III @ 2007-12-24 3:59 ` Alessandro Suardi 0 siblings, 0 replies; 93+ messages in thread From: Alessandro Suardi @ 2007-12-24 3:59 UTC (permalink / raw) To: Jason L Tibbitts III Cc: Ingo Molnar, Linus Torvalds, Andi Kleen, Christoph Lameter, Peter Zijlstra, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On 22 Dec 2007 16:52:56 -0600, Jason L Tibbitts III <tibbs@math.uh.edu> wrote: > >>>>> "IM" == Ingo Molnar <mingo@elte.hu> writes: > > IM> Distros will likely pick SLUB if there's no performance worries > IM> and if it's the default. Fedora rawhide already uses SLUB. > > Actually, it seems to me that not only does Fedora rawhide use SLUB, > but Fedora 8 and 7 use it as well. They don't have /proc/slabinfo and > they all seem to have CONFIG_SLUB=y: > > > grep -r CONFIG_SLUB=y kernel > kernel/devel/config-generic:CONFIG_SLUB=y > kernel/F-7/configs/config-generic:CONFIG_SLUB=y > kernel/F-8/config-generic:CONFIG_SLUB=y Also true for at least recent versions of FC6 (which my bittorrent machine is still on), which currently run 2.6.22-14 based kernels. And no, I didn't notice - that box usually hits triple-digit uptime before I reboot; in fact, it's still running a 2.6.22-9 based kernel. --alessandro "Damaged people are dangerous. They know they can survive." (Anna Barton/Juliette Binoche, 'Damage') ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 18:01 ` Linus Torvalds 2007-12-22 19:09 ` Ingo Molnar @ 2007-12-22 19:25 ` Theodore Tso 2007-12-22 21:00 ` Linus Torvalds 2007-12-22 19:46 ` slabtop replacement was " Andi Kleen 2 siblings, 1 reply; 93+ messages in thread From: Theodore Tso @ 2007-12-22 19:25 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Andi Kleen, Christoph Lameter, Peter Zijlstra, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Sat, Dec 22, 2007 at 10:01:37AM -0800, Linus Torvalds wrote: > Well, I do have to admit that I'm not a huge fan of /proc/slabinfo, and > the fact that there hasn't been a lot of complaints about it going away > does seem to imply that it wasn't a very important ABI. > > I'm the first to stand up for backwards compatibility, but I also try to > be realistic, and so far nobody has really complained about the fact that > /proc/slabinfo went away on any grounds but on the general principle of > "it was a user-visible ABI". That's probably because the people who are most likely to be using /proc/slabinfo tend to people people using kernels in production environments, and they probably haven't started playing with SLUB yet. So the fact we haven't heard the yelling doesn't necessarily mean that people won't notice. I know of a number of debugging scripts that periodically poll /proc/slabinfo to find out what is using memory and to track trends and predict potential problems with memory usage and balance. Indeed, I've left several such scripts behind at various customer installations after debugging memory usage problems, and some system administrators very much like being able to collect that information easily. So I can say for sure there at there are scripts out there that depend on /proc/slabinfo, but they generally tend to be at sites where people won't be running bleeding edge kernels. > That said, I'm not a *huge* fan of /sys/slab/ either. I can get the info I > as a developer tend to want from there, but it's really rather less > convenient than I'd like. It is really quite hard, for example, to get any > kind of "who actually uses the most *memory*" information out of there. I have a general problem with things in /sys/slab, and that's just because they are *ugly*. So yes, you can write ugly shell scripts like this to get out information: > You have to use something like this: > > for i in * > do > order=$(cat "$i/order") > slabs=$(cat "$i/slabs") > object_size=$(cat "$i/object_size") > objects=$(cat "$i/objects") > truesize=$(( $slabs<<($order+12) )) > size=$(( $object_size*$objects )) > percent=$(( $truesize/100 )) > if [ $percent -gt 0 ]; then > percent=$(( $size / $percent )) > fi > mb=$(( $size >> 10 )) > printf "%10d MB %3d %s\n" $mb $percent $i > done | sort -n | tail But sometimes when trying to eyeball what is going on, it's a lot nicer just to use "cat /proc/slabinfo". Another problem with using /sys/slab is that it is downright *ugly*. Why is it for example, that /sys/slab/dentry is a symlink to ../slab/:a-0000160? Because of this, Linus's script reports dentry twice: 10942 MB 86 buffer_head 10942 MB 86 journal_head 10943 MB 86 :a-0000056 21923 MB 86 dentry 21924 MB 86 :a-0000160 78437 MB 94 ext3_inode_cache Once as "dentry" and once as ":a-0000160". A similar situation happens with journal_head and ":a-0000056". I'm sure this could be filtered out with the right shell or perl script, but now we're getting even farther from /proc/slabinfo. Further, if you look at Documentation/sysfs-rules.txt (does anyone ever read it or bother to update it?), /sys/slab isn't documented so it falls in the category of "everything else is a kernel-level implementation detail which is not guaranteed to be stable". Nice.... Also, looking at sysfs-rules.txt, it talks an awful lot about reading symlinks, and implying that actually trying to *dereference* the symlinks might not actually work. So I'm pretty sure linus's "cd /sys/slab; for i in *" violates the Documentation/sysfs-rules.txt guidelines even if /sys/slab were documented. In general, /sys has some real issues that need some careful scrutiny. Documentation/sysfs-rules.txt is *not* enough. For example, various official websites are telling people that the way to enable or disable power-saving is: echo 5 > /sys/bus/pci/drivers/iwl4965/*/power_level (Reference: http://www.lesswatts.org/tips/wireless.php) Aside from the fact that "iwconfig wlan0 power on" is easier to type and simpler to undrestand, I'm pretty sure that the above violates sysfs-rules.txt. From looking at sysfs-rules.txt, the fact that you have to read symlinks as the only valid and guaranteed way for things to work means that you have to write a perl script to safely manipulate things in /sys. People are ignoring sysfs-rules.txt, and giving advice in websites contrary to sysfs-rules.txt because it's too hard to follow. > I worry about the lack of atomicity in getting the statistics. And that's another problem with trying to use /sys when trying to get statistics in bulk... Very often the tables in /proc/* were not only more convenient, but allowed for atomic access. - Ted ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 19:25 ` Theodore Tso @ 2007-12-22 21:00 ` Linus Torvalds 2007-12-22 21:50 ` Al Viro 2007-12-22 22:10 ` Willy Tarreau 0 siblings, 2 replies; 93+ messages in thread From: Linus Torvalds @ 2007-12-22 21:00 UTC (permalink / raw) To: Theodore Tso Cc: Ingo Molnar, Andi Kleen, Christoph Lameter, Peter Zijlstra, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Sat, 22 Dec 2007, Theodore Tso wrote: > > I have a general problem with things in /sys/slab, and that's just > because they are *ugly*. So yes, you can write ugly shell scripts > like this to get out information: [ script deleted ] > But sometimes when trying to eyeball what is going on, it's a lot > nicer just to use "cat /proc/slabinfo". .. and I call BS on this claim. /proc/slabinfo was (and is) totally pointless for "trying to eyeball what's going on". The output is totally unreadable, and useless. You end up with exactly the same script as above, except it reads as cat /proc/slabinfo | (read headerline while read name active num objsize objsperslab pagesperslab rest do realsize=$(( nul * objsize )) size=$(( active * objsize )) .. exact same rest of loop .. done | sort -n | .. so no, "cat /proc/slabinfo" was almost never practical on its own. The *one* advantage it does have is that you can forward it to others. That's a big advantage. But no, it wasn't ever readable for eyeballing, because it doesn't even give you a memory usage thing (just "number of objects and object size" as separate numbers). But the "everything in one file" indubitably did make it a lot easier for just attaching it to bug-reports. > Another problem with using /sys/slab is that it is downright *ugly*. > Why is it for example, that /sys/slab/dentry is a symlink to > ../slab/:a-0000160? That's the only really ugly thing there. Otherwise, it's pretty nice, but having a million files makes for problems when trying to send somebody else the full info. Linus ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 21:00 ` Linus Torvalds @ 2007-12-22 21:50 ` Al Viro 2007-12-22 23:29 ` Al Viro 2007-12-22 22:10 ` Willy Tarreau 1 sibling, 1 reply; 93+ messages in thread From: Al Viro @ 2007-12-22 21:50 UTC (permalink / raw) To: Linus Torvalds Cc: Theodore Tso, Ingo Molnar, Andi Kleen, Christoph Lameter, Peter Zijlstra, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Sat, Dec 22, 2007 at 01:00:09PM -0800, Linus Torvalds wrote: > > Another problem with using /sys/slab is that it is downright *ugly*. > > Why is it for example, that /sys/slab/dentry is a symlink to > > ../slab/:a-0000160? > > That's the only really ugly thing there. Otherwise, it's pretty nice, but > having a million files makes for problems when trying to send somebody > else the full info. *raised brows* I would say that there's that really ugly thing with embedding kobject into a struct with the lifetime rules of its own and then having that struct kfree()d while kobject might still have references, but hey, maybe it's just me and my lack of appreciation of the glory that is sysfs. Al, too tired of ranting about sysfs being a major architecture mistake and a recurring source of turds like that one... ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 21:50 ` Al Viro @ 2007-12-22 23:29 ` Al Viro 0 siblings, 0 replies; 93+ messages in thread From: Al Viro @ 2007-12-22 23:29 UTC (permalink / raw) To: Linus Torvalds Cc: Theodore Tso, Ingo Molnar, Andi Kleen, Christoph Lameter, Peter Zijlstra, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Sat, Dec 22, 2007 at 09:50:09PM +0000, Al Viro wrote: > On Sat, Dec 22, 2007 at 01:00:09PM -0800, Linus Torvalds wrote: > > > > Another problem with using /sys/slab is that it is downright *ugly*. > > > Why is it for example, that /sys/slab/dentry is a symlink to > > > ../slab/:a-0000160? > > > > That's the only really ugly thing there. Otherwise, it's pretty nice, but > > having a million files makes for problems when trying to send somebody > > else the full info. > > *raised brows* > > I would say that there's that really ugly thing with embedding kobject > into a struct with the lifetime rules of its own and then having that > struct kfree()d while kobject might still have references, but hey, > maybe it's just me and my lack of appreciation of the glory that is > sysfs. > > Al, too tired of ranting about sysfs being a major architecture > mistake and a recurring source of turds like that one... BTW, the fact that presence of that kobject is conditional makes life even uglier - we have to either kfree() or drop a reference to kobject leaving actual kfree() to its ->release(). While we are at it, when do we remove the symlinks? That got to add even more fun for the lifetime rules... Sigh... How does one set a script that would mail a warning upon git pull that introduces any instances of keyword from given set? I hadn't noticed that slub was using sysfs when it went into the tree back in May ;-/ ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 21:00 ` Linus Torvalds 2007-12-22 21:50 ` Al Viro @ 2007-12-22 22:10 ` Willy Tarreau 2007-12-23 0:59 ` Steven Rostedt 1 sibling, 1 reply; 93+ messages in thread From: Willy Tarreau @ 2007-12-22 22:10 UTC (permalink / raw) To: Linus Torvalds Cc: Theodore Tso, Ingo Molnar, Andi Kleen, Christoph Lameter, Peter Zijlstra, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Sat, Dec 22, 2007 at 01:00:09PM -0800, Linus Torvalds wrote: > > > On Sat, 22 Dec 2007, Theodore Tso wrote: > > > > I have a general problem with things in /sys/slab, and that's just > > because they are *ugly*. So yes, you can write ugly shell scripts > > like this to get out information: > > [ script deleted ] > > > But sometimes when trying to eyeball what is going on, it's a lot > > nicer just to use "cat /proc/slabinfo". > > .. and I call BS on this claim. > > /proc/slabinfo was (and is) totally pointless for "trying to eyeball > what's going on". The output is totally unreadable, and useless. You end > up with exactly the same script as above, except it reads as > > cat /proc/slabinfo | (read headerline > while read name active num objsize objsperslab pagesperslab rest > do > realsize=$(( nul * objsize )) > size=$(( active * objsize )) > .. exact same rest of loop .. > done | sort -n | .. > > so no, "cat /proc/slabinfo" was almost never practical on its own. > > The *one* advantage it does have is that you can forward it to others. > That's a big advantage. But no, it wasn't ever readable for eyeballing, > because it doesn't even give you a memory usage thing (just "number of > objects and object size" as separate numbers). I don't agree with you Linus. I'm one of those used to quickly take a look at its contents when I don't know where all my memory has gone. I know by experience that if I find 6-digit values in dentry_cache or inode_cache, all I was looking for is there, and that's enough for a quick diag. It doesn't give anything accurate, but it's very useful to quickly diag out some problems on production systems. It was this week that I noticed for the first time that slabinfo did not exist anymore, and did not know what replaced it. Lacking time, I finally gave up and *supposed* that the memory was eaten by the usual suspects. I can understand that it has to go away for technical reasons, but Ted is right, please don't believe that nobody uses it just because you got no complaint. While people are not likely to perform all computations in scripts, at least they're used to find some quickly identifiable patterns there. Willy ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 22:10 ` Willy Tarreau @ 2007-12-23 0:59 ` Steven Rostedt 2007-12-23 5:12 ` Willy Tarreau 0 siblings, 1 reply; 93+ messages in thread From: Steven Rostedt @ 2007-12-23 0:59 UTC (permalink / raw) To: Willy Tarreau Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Andi Kleen, Christoph Lameter, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki [/me sneaks away from the family] On Sat, 22 Dec 2007, Willy Tarreau wrote: > On Sat, Dec 22, 2007 at 01:00:09PM -0800, Linus Torvalds wrote: > > On Sat, 22 Dec 2007, Theodore Tso wrote: > > > But sometimes when trying to eyeball what is going on, it's a lot > > > nicer just to use "cat /proc/slabinfo". > > > > .. and I call BS on this claim. > > [...] > > I can understand that it has to go away for technical reasons, but Ted > is right, please don't believe that nobody uses it just because you got > no complaint. While people are not likely to perform all computations > in scripts, at least they're used to find some quickly identifiable > patterns there. > I know when I'm looking for memory leaks, I've asked customers to give snapshots of slabinfo at periodic times (once a day even, matters how bad the leak is). This has been helpful in seeing if something did indeed leak. If you have a slab cache that constantly grows, and never shrinks, that's a good indication that something might be leaking. Not always, since there can be legitimate reasons for that, but sometimes it helps. But I still scratch my head when ever I need to touch sysfs. [/me runs back to the family] -- Steve ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-23 0:59 ` Steven Rostedt @ 2007-12-23 5:12 ` Willy Tarreau 2007-12-23 14:15 ` Andi Kleen 0 siblings, 1 reply; 93+ messages in thread From: Willy Tarreau @ 2007-12-23 5:12 UTC (permalink / raw) To: Steven Rostedt Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Andi Kleen, Christoph Lameter, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Sat, Dec 22, 2007 at 07:59:47PM -0500, Steven Rostedt wrote: [...] > But I still scratch my head when ever I need to touch sysfs. Same here. In fact, I've always considered that procfs was for humans while sysfs was for tools. sysfs reminds me too much the unexploitable /devices in Solaris. With the proper tools, I think we can do a lot with it, but it's not as intuitive to find the proper tools as it was to do "ls" followed by "cat" in /proc. Willy ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-23 5:12 ` Willy Tarreau @ 2007-12-23 14:15 ` Andi Kleen 2007-12-24 3:45 ` Theodore Tso 0 siblings, 1 reply; 93+ messages in thread From: Andi Kleen @ 2007-12-23 14:15 UTC (permalink / raw) To: Willy Tarreau Cc: Steven Rostedt, Linus Torvalds, Theodore Tso, Ingo Molnar, Andi Kleen, Christoph Lameter, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki > Same here. In fact, I've always considered that procfs was for > humans while sysfs was for tools. sysfs reminds me too much the > unexploitable /devices in Solaris. With the proper tools, I think > we can do a lot with it, but it's not as intuitive to find the > proper tools as it was to do "ls" followed by "cat" in /proc. find /sys/... -type f | while read i ; do echo "$i: $(<$i)" ; done tends to work reasonably well for a quick overview, but yes cat was nicer for humans. -Andi ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-23 14:15 ` Andi Kleen @ 2007-12-24 3:45 ` Theodore Tso 2007-12-24 19:21 ` Christoph Lameter 0 siblings, 1 reply; 93+ messages in thread From: Theodore Tso @ 2007-12-24 3:45 UTC (permalink / raw) To: Andi Kleen Cc: Willy Tarreau, Steven Rostedt, Linus Torvalds, Ingo Molnar, Christoph Lameter, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Sun, Dec 23, 2007 at 03:15:00PM +0100, Andi Kleen wrote: > > Same here. In fact, I've always considered that procfs was for > > humans while sysfs was for tools. sysfs reminds me too much the > > unexploitable /devices in Solaris. With the proper tools, I think > > we can do a lot with it, but it's not as intuitive to find the > > proper tools as it was to do "ls" followed by "cat" in /proc. > > find /sys/... -type f | while read i ; do echo "$i: $(<$i)" ; done > > tends to work reasonably well for a quick overview, but yes > cat was nicer for humans. Until you start to wonder what the heck :a-0000136 is: /sys/slab/:a-0000136/objs_per_slab: 30 Sigh... - Ted ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-24 3:45 ` Theodore Tso @ 2007-12-24 19:21 ` Christoph Lameter 2007-12-24 23:37 ` Theodore Tso 0 siblings, 1 reply; 93+ messages in thread From: Christoph Lameter @ 2007-12-24 19:21 UTC (permalink / raw) To: Theodore Tso Cc: Andi Kleen, Willy Tarreau, Steven Rostedt, Linus Torvalds, Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Sun, 23 Dec 2007, Theodore Tso wrote: > > tends to work reasonably well for a quick overview, but yes > > cat was nicer for humans. > > Until you start to wonder what the heck :a-0000136 is: > > /sys/slab/:a-0000136/objs_per_slab: 30 > > Sigh... That is why there is a slabinfo tool that does all the nice formatting. Do a gcc -o slabinfo Documentation/vm/slabinfo.c Then run slabinfo instead of doing cat /proc/slabinfo ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-24 19:21 ` Christoph Lameter @ 2007-12-24 23:37 ` Theodore Tso 2007-12-25 3:34 ` Andi Kleen 2007-12-26 21:31 ` Major regression on hackbench with SLUB (more numbers) Christoph Lameter 0 siblings, 2 replies; 93+ messages in thread From: Theodore Tso @ 2007-12-24 23:37 UTC (permalink / raw) To: Christoph Lameter Cc: Andi Kleen, Willy Tarreau, Steven Rostedt, Linus Torvalds, Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Mon, Dec 24, 2007 at 11:21:14AM -0800, Christoph Lameter wrote: > > That is why there is a slabinfo tool that does all the nice formatting. > > Do a > > gcc -o slabinfo Documentation/vm/slabinfo.c > > Then run slabinfo instead of doing cat /proc/slabinfo So two questions: why isn't -f the default? And is /sys/slab guaranteed to be a stable and permanent interface if the SLAB implementation ever gets ripped out? If so, maybe this should go into util-linux-ng? I am aa bit concerned about the lack of atomicity of /sys/slab, but this is a heck of a lot better of many kernel drivers or subsystems which use /sys and the completely punt on any kind of userspace utility, forcing users to type crazy things like echo 5 > /sys/bus/pci/drivers/iwl4965/*/power_level - Ted ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-24 23:37 ` Theodore Tso @ 2007-12-25 3:34 ` Andi Kleen 2008-01-01 12:47 ` Theodore Tso 2007-12-26 21:31 ` Major regression on hackbench with SLUB (more numbers) Christoph Lameter 1 sibling, 1 reply; 93+ messages in thread From: Andi Kleen @ 2007-12-25 3:34 UTC (permalink / raw) To: Theodore Tso, Christoph Lameter, Andi Kleen, Willy Tarreau, Steven Rostedt, Linus Torvalds, Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki > And is /sys/slab > guaranteed to be a stable and permanent interface if the SLAB Debugging feature and "stable and permanent" just don't fit together. It's like requesting stable and permanent sysrq output. -Andi ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-25 3:34 ` Andi Kleen @ 2008-01-01 12:47 ` Theodore Tso 2008-01-01 15:19 ` Andi Kleen 2008-01-01 16:23 ` [patch] slub: provide /proc/slabinfo Ingo Molnar 0 siblings, 2 replies; 93+ messages in thread From: Theodore Tso @ 2008-01-01 12:47 UTC (permalink / raw) To: Andi Kleen Cc: Christoph Lameter, Willy Tarreau, Steven Rostedt, Linus Torvalds, Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Tue, Dec 25, 2007 at 04:34:18AM +0100, Andi Kleen wrote: > > And is /sys/slab > > guaranteed to be a stable and permanent interface if the SLAB > > Debugging feature and "stable and permanent" just don't > fit together. It's like requesting stable and permanent > sysrq output. Yeah, unfortunately, querying to find out how much memory is being used by which parts of the system is something which I think needs to be more than just a debugging feature. One could argue that "vmstat", "iostat", and "sar" are debugging features as well, but other people would consider them "critical programs to get information necessary to monitor the health of their system". Perhaps /proc/slabinfo should be that relatively stable interface, if /sys/slab can't be guaranteed to be stable. But there *are* people who will want to monitor information like this on an ongoing fashion on production systems. One of the major downsides of /sys seems to be that it's very hard to keep it stable, so maybe it's not the right tool for this. /proc/slabinfo has the advantage that it's a simple text file, and its format is relatively well-understood, and doesn't a patch to provide /proc/slabinfo for SLUB already exist? - Ted ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2008-01-01 12:47 ` Theodore Tso @ 2008-01-01 15:19 ` Andi Kleen 2008-01-01 16:23 ` [patch] slub: provide /proc/slabinfo Ingo Molnar 1 sibling, 0 replies; 93+ messages in thread From: Andi Kleen @ 2008-01-01 15:19 UTC (permalink / raw) To: Theodore Tso, Andi Kleen, Christoph Lameter, Willy Tarreau, Steven Rostedt, Linus Torvalds, Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki > Yeah, unfortunately, querying to find out how much memory is being > used by which parts of the system is something which I think needs to > be more than just a debugging feature. One could argue that "vmstat", slabinfo is about querying kernel internals and requiring stable and permanent interfaces to that would imply freezing the kernel internals. People who want to have information about slabs and other similar kernel internal allocations will always have to deal with version dependencies. No way around that. > Perhaps /proc/slabinfo should be that relatively stable interface, if That would require at least never changing any slab caches in any subsystem. Obviously not an option. -Andi ^ permalink raw reply [flat|nested] 93+ messages in thread
* [patch] slub: provide /proc/slabinfo 2008-01-01 12:47 ` Theodore Tso 2008-01-01 15:19 ` Andi Kleen @ 2008-01-01 16:23 ` Ingo Molnar 2008-01-05 0:27 ` Arjan van de Ven 1 sibling, 1 reply; 93+ messages in thread From: Ingo Molnar @ 2008-01-01 16:23 UTC (permalink / raw) To: Theodore Tso, Andi Kleen, Christoph Lameter, Willy Tarreau, Steven Rostedt, Linus Torvalds, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki * Theodore Tso <tytso@MIT.EDU> wrote: > [...] doesn't a patch to provide /proc/slabinfo for SLUB already > exist? yes, the complete (and tested) patch is below. It has been through a few thousand random-bootup tests on x86 32-bit and 64-bit so it's v2.6.24 material i think. Ingo -----------------> Subject: slub: provide /proc/slabinfo From: Pekka J Enberg <penberg@cs.helsinki.fi> This adds a read-only /proc/slabinfo file on SLUB, that makes slabtop work. [ mingo@elte.hu: build fix. ] Cc: Andi Kleen <andi@firstfloor.org> Cc: Christoph Lameter <clameter@sgi.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/slub_def.h | 2 mm/slub.c | 105 +++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 94 insertions(+), 13 deletions(-) Index: linux/include/linux/slub_def.h =================================================================== --- linux.orig/include/linux/slub_def.h +++ linux/include/linux/slub_def.h @@ -200,4 +200,6 @@ static __always_inline void *kmalloc_nod } #endif +extern const struct seq_operations slabinfo_op; + #endif /* _LINUX_SLUB_DEF_H */ Index: linux/mm/slub.c =================================================================== --- linux.orig/mm/slub.c +++ linux/mm/slub.c @@ -3076,6 +3076,19 @@ void *__kmalloc_node_track_caller(size_t return slab_alloc(s, gfpflags, node, caller); } +static unsigned long count_partial(struct kmem_cache_node *n) +{ + unsigned long flags; + unsigned long x = 0; + struct page *page; + + spin_lock_irqsave(&n->list_lock, flags); + list_for_each_entry(page, &n->partial, lru) + x += page->inuse; + spin_unlock_irqrestore(&n->list_lock, flags); + return x; +} + #if defined(CONFIG_SYSFS) && defined(CONFIG_SLUB_DEBUG) static int validate_slab(struct kmem_cache *s, struct page *page, unsigned long *map) @@ -3458,19 +3471,6 @@ static int list_locations(struct kmem_ca return n; } -static unsigned long count_partial(struct kmem_cache_node *n) -{ - unsigned long flags; - unsigned long x = 0; - struct page *page; - - spin_lock_irqsave(&n->list_lock, flags); - list_for_each_entry(page, &n->partial, lru) - x += page->inuse; - spin_unlock_irqrestore(&n->list_lock, flags); - return x; -} - enum slab_stat_type { SL_FULL, SL_PARTIAL, @@ -4123,3 +4123,82 @@ static int __init slab_sysfs_init(void) __initcall(slab_sysfs_init); #endif + +/* + * The /proc/slabinfo ABI + */ +#ifdef CONFIG_PROC_FS + +static void print_slabinfo_header(struct seq_file *m) +{ + seq_puts(m, "slabinfo - version: 2.1\n"); + seq_puts(m, "# name <active_objs> <num_objs> <objsize> " + "<objperslab> <pagesperslab>"); + seq_puts(m, " : tunables <limit> <batchcount> <sharedfactor>"); + seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>"); + seq_putc(m, '\n'); +} + +static void *s_start(struct seq_file *m, loff_t *pos) +{ + loff_t n = *pos; + + down_read(&slub_lock); + if (!n) + print_slabinfo_header(m); + + return seq_list_start(&slab_caches, *pos); +} + +static void *s_next(struct seq_file *m, void *p, loff_t *pos) +{ + return seq_list_next(p, &slab_caches, pos); +} + +static void s_stop(struct seq_file *m, void *p) +{ + up_read(&slub_lock); +} + +static int s_show(struct seq_file *m, void *p) +{ + unsigned long nr_partials = 0; + unsigned long nr_slabs = 0; + unsigned long nr_inuse = 0; + unsigned long nr_objs; + struct kmem_cache *s; + int node; + + s = list_entry(p, struct kmem_cache, list); + + for_each_online_node(node) { + struct kmem_cache_node *n = get_node(s, node); + + if (!n) + continue; + + nr_partials += n->nr_partial; + nr_slabs += atomic_long_read(&n->nr_slabs); + nr_inuse += count_partial(n); + } + + nr_objs = nr_slabs * s->objects; + nr_inuse += (nr_slabs - nr_partials) * s->objects; + + seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, nr_inuse, + nr_objs, s->size, s->objects, (1 << s->order)); + seq_printf(m, " : tunables %4u %4u %4u", 0, 0, 0); + seq_printf(m, " : slabdata %6lu %6lu %6lu", nr_slabs, nr_slabs, + 0UL); + seq_putc(m, '\n'); + return 0; +} + +const struct seq_operations slabinfo_op = { + .start = s_start, + .next = s_next, + .stop = s_stop, + .show = s_show, +}; + +#endif /* CONFIG_PROC_FS */ ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [patch] slub: provide /proc/slabinfo 2008-01-01 16:23 ` [patch] slub: provide /proc/slabinfo Ingo Molnar @ 2008-01-05 0:27 ` Arjan van de Ven 2008-01-05 0:49 ` Linus Torvalds 0 siblings, 1 reply; 93+ messages in thread From: Arjan van de Ven @ 2008-01-05 0:27 UTC (permalink / raw) To: Ingo Molnar Cc: Theodore Tso, Andi Kleen, Christoph Lameter, Willy Tarreau, Steven Rostedt, Linus Torvalds, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Tue, 1 Jan 2008 17:23:28 +0100 Ingo Molnar <mingo@elte.hu> wrote: > > * Theodore Tso <tytso@MIT.EDU> wrote: > > > [...] doesn't a patch to provide /proc/slabinfo for SLUB already > > exist? > > yes, the complete (and tested) patch is below. It has been through a > few thousand random-bootup tests on x86 32-bit and 64-bit so it's > v2.6.24 material i think. > +static void *s_start(struct seq_file *m, loff_t *pos) > +{ > + loff_t n = *pos; > + > + down_read(&slub_lock); > + if (!n) > + print_slabinfo_header(m); > + > + return seq_list_start(&slab_caches, *pos); > +} > + this is the part I'm not very thrilled about... at least on first sight it looks like a user can now hold the read sem over system calls, and for as long as it wants. Either that's no problem (but then we wouldn't need the lock in the first place) or it is a problem that then wants to be fixed ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: [patch] slub: provide /proc/slabinfo 2008-01-05 0:27 ` Arjan van de Ven @ 2008-01-05 0:49 ` Linus Torvalds 0 siblings, 0 replies; 93+ messages in thread From: Linus Torvalds @ 2008-01-05 0:49 UTC (permalink / raw) To: Arjan van de Ven Cc: Ingo Molnar, Theodore Tso, Andi Kleen, Christoph Lameter, Willy Tarreau, Steven Rostedt, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Fri, 4 Jan 2008, Arjan van de Ven wrote: > > this is the part I'm not very thrilled about... at least on first sight > it looks like a user can now hold the read sem over system calls, and > for as long as it wants. No, you misunderstand how seq-files work. The start/stop sequence is done only overone single kernel buffer instance, not over the whole open/close (or even the copy to user space). It's expressly designed so that you can hold locks (including spinlocks etc), and will not do any blocking ops (well - your *callbacks* can you blocking ops, but the seq_file stuff itself won't) between start/stop. Linus ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-24 23:37 ` Theodore Tso 2007-12-25 3:34 ` Andi Kleen @ 2007-12-26 21:31 ` Christoph Lameter 2007-12-26 22:16 ` Al Viro 1 sibling, 1 reply; 93+ messages in thread From: Christoph Lameter @ 2007-12-26 21:31 UTC (permalink / raw) To: Theodore Tso Cc: Andi Kleen, Willy Tarreau, Steven Rostedt, Linus Torvalds, Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Mon, 24 Dec 2007, Theodore Tso wrote: > So two questions: why isn't -f the default? And is /sys/slab Because it gives misleading output. It displays the name of the first of multiple slabs that share the same storage structures. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-26 21:31 ` Major regression on hackbench with SLUB (more numbers) Christoph Lameter @ 2007-12-26 22:16 ` Al Viro 2007-12-27 5:51 ` Christoph Lameter 2007-12-28 9:00 ` Major regression on hackbench with SLUB (more numbers) Ingo Molnar 0 siblings, 2 replies; 93+ messages in thread From: Al Viro @ 2007-12-26 22:16 UTC (permalink / raw) To: Christoph Lameter Cc: Theodore Tso, Andi Kleen, Willy Tarreau, Steven Rostedt, Linus Torvalds, Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Wed, Dec 26, 2007 at 01:31:35PM -0800, Christoph Lameter wrote: > On Mon, 24 Dec 2007, Theodore Tso wrote: > > > So two questions: why isn't -f the default? And is /sys/slab > > Because it gives misleading output. It displays the name of the first > of multiple slabs that share the same storage structures. Erm... Let me spell it out: current lifetime rules are completely broken. As it is, create/destroy/create cache sequence will do kobject_put() on kfree'd object. Even without people playing with holding sysfs files open or doing IO on those. a) you have kobject embedded into struct with the lifetime rules of its own. When its refcount hits zero you kfree() the sucker, even if you still have references to embedded kobject. b) your symlinks stick around. Even when cache is long gone you still have a sysfs symlink with its embedded kobject as a target. They are eventually removed when cache with the same name gets created. _Then_ you get the target kobject dropped - when the memory it used to be in had been freed for hell knows how long and reused by something that would not appreciate slub.c code suddenly deciding to decrement some word in that memory. c) you leak references to these kobject; kobject_del() only removes it from the tree undoing the effect of kobject_add() and you still need kobject_put() to deal with the last reference. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-26 22:16 ` Al Viro @ 2007-12-27 5:51 ` Christoph Lameter 2007-12-27 20:28 ` SLUB sysfs support Christoph Lameter 2007-12-28 9:00 ` Major regression on hackbench with SLUB (more numbers) Ingo Molnar 1 sibling, 1 reply; 93+ messages in thread From: Christoph Lameter @ 2007-12-27 5:51 UTC (permalink / raw) To: Al Viro Cc: Theodore Tso, Andi Kleen, Willy Tarreau, Steven Rostedt, Linus Torvalds, Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Wed, 26 Dec 2007, Al Viro wrote: > Erm... Let me spell it out: current lifetime rules are completely broken. > As it is, create/destroy/create cache sequence will do kobject_put() on > kfree'd object. Even without people playing with holding sysfs files > open or doing IO on those. Urgh. Help! Isnt there some sanity with sysfs? > a) you have kobject embedded into struct with the lifetime rules of its > own. When its refcount hits zero you kfree() the sucker, even if you > still have references to embedded kobject. So alloate it separately? > b) your symlinks stick around. Even when cache is long gone you still > have a sysfs symlink with its embedded kobject as a target. They are > eventually removed when cache with the same name gets created. _Then_ > you get the target kobject dropped - when the memory it used to be in > had been freed for hell knows how long and reused by something that would > not appreciate slub.c code suddenly deciding to decrement some word in > that memory. Hmmmm.. That would also be addressed by a having a separately allocated kobject? > c) you leak references to these kobject; kobject_del() only removes it > from the tree undoing the effect of kobject_add() and you still need > kobject_put() to deal with the last reference. Patches would be appreciated.... Had a hard time to get the sysfs support to work right. ^ permalink raw reply [flat|nested] 93+ messages in thread
* SLUB sysfs support 2007-12-27 5:51 ` Christoph Lameter @ 2007-12-27 20:28 ` Christoph Lameter 2007-12-27 22:59 ` Al Viro 0 siblings, 1 reply; 93+ messages in thread From: Christoph Lameter @ 2007-12-27 20:28 UTC (permalink / raw) To: Al Viro Cc: Theodore Tso, Andi Kleen, Willy Tarreau, Steven Rostedt, Linus Torvalds, Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki Hmmm.. If I separately allocate the kobject then I can no longer get to the kmem_cache structure from the kobject. I need to add a second kobject_del to sysfs_slab_remove() to make sysfs completely forget about the object? Probably should track down any remaining symlinks at that point and nuke them too. Isnt there some way to convince sysfs to remove the symlinks if the target vanishes? ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: SLUB sysfs support 2007-12-27 20:28 ` SLUB sysfs support Christoph Lameter @ 2007-12-27 22:59 ` Al Viro 2007-12-27 23:22 ` Christoph Lameter 0 siblings, 1 reply; 93+ messages in thread From: Al Viro @ 2007-12-27 22:59 UTC (permalink / raw) To: Christoph Lameter Cc: Theodore Tso, Andi Kleen, Willy Tarreau, Steven Rostedt, Linus Torvalds, Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Thu, Dec 27, 2007 at 12:28:14PM -0800, Christoph Lameter wrote: > Hmmm.. If I separately allocate the kobject then I can no longer get to > the kmem_cache structure from the kobject. > > I need to add a second kobject_del to sysfs_slab_remove() to make sysfs > completely forget about the object? > > Probably should track down any remaining symlinks at that point and nuke > them too. Isnt there some way to convince sysfs to remove the symlinks > if the target vanishes? Don't bother with separate allocation. a) remove symlink when slab goes away b) instead of kfree() in slab removal do kobject_put() if you have sysfs stuff c) have ->release() of these kobjects do kfree() ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: SLUB sysfs support 2007-12-27 22:59 ` Al Viro @ 2007-12-27 23:22 ` Christoph Lameter 2007-12-27 23:53 ` Christoph Lameter 2007-12-27 23:54 ` Al Viro 0 siblings, 2 replies; 93+ messages in thread From: Christoph Lameter @ 2007-12-27 23:22 UTC (permalink / raw) To: Al Viro Cc: Theodore Tso, Andi Kleen, Willy Tarreau, Steven Rostedt, Linus Torvalds, Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Thu, 27 Dec 2007, Al Viro wrote: > On Thu, Dec 27, 2007 at 12:28:14PM -0800, Christoph Lameter wrote: > > Hmmm.. If I separately allocate the kobject then I can no longer get to > > the kmem_cache structure from the kobject. > > > > I need to add a second kobject_del to sysfs_slab_remove() to make sysfs > > completely forget about the object? > > > > Probably should track down any remaining symlinks at that point and nuke > > them too. Isnt there some way to convince sysfs to remove the symlinks > > if the target vanishes? > > Don't bother with separate allocation. > > a) remove symlink when slab goes away Ok. Need to think about how to code that. > b) instead of kfree() in slab removal do kobject_put() if you have sysfs stuff Hmmmm.... Okay. Patch follows but its strange to do a kobject_put after a kobject_del(). > c) have ->release() of these kobjects do kfree() In slab_ktype? --- mm/slub.c | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2007-12-27 15:09:38.000000000 -0800 +++ linux-2.6/mm/slub.c 2007-12-27 15:21:12.000000000 -0800 @@ -247,7 +247,10 @@ static void sysfs_slab_remove(struct kme static inline int sysfs_slab_add(struct kmem_cache *s) { return 0; } static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p) { return 0; } -static inline void sysfs_slab_remove(struct kmem_cache *s) {} +static inline void sysfs_slab_remove(struct kmem_cache *s) +{ + kfree(s); +} #endif /******************************************************************** @@ -2322,7 +2325,6 @@ void kmem_cache_destroy(struct kmem_cach if (kmem_cache_close(s)) WARN_ON(1); sysfs_slab_remove(s); - kfree(s); } else up_write(&slub_lock); } @@ -3940,6 +3942,13 @@ static ssize_t slab_attr_store(struct ko return err; } +static void slab_release(struct kobject *kobj) +{ + struct kmem_cache *s = to_slab(kobj); + + kfree(s); +} + static struct sysfs_ops slab_sysfs_ops = { .show = slab_attr_show, .store = slab_attr_store, @@ -3947,6 +3956,7 @@ static struct sysfs_ops slab_sysfs_ops = static struct kobj_type slab_ktype = { .sysfs_ops = &slab_sysfs_ops, + .release = slab_release }; static int uevent_filter(struct kset *kset, struct kobject *kobj) @@ -4048,6 +4058,7 @@ static void sysfs_slab_remove(struct kme { kobject_uevent(&s->kobj, KOBJ_REMOVE); kobject_del(&s->kobj); + kobject_put(&s->kobj); } /* ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: SLUB sysfs support 2007-12-27 23:22 ` Christoph Lameter @ 2007-12-27 23:53 ` Christoph Lameter 2007-12-27 23:58 ` Al Viro 2007-12-27 23:54 ` Al Viro 1 sibling, 1 reply; 93+ messages in thread From: Christoph Lameter @ 2007-12-27 23:53 UTC (permalink / raw) To: Al Viro; +Cc: Steven Rostedt, Peter Zijlstra, LKML, Rafael J. Wysocki On Thu, 27 Dec 2007, Christoph Lameter wrote: > > a) remove symlink when slab goes away > > Ok. Need to think about how to code that. How do I iterate over all symlinks in /sys/kernel/slab/*? I remember trying to do it before and not being able to find a sysfs method for that. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: SLUB sysfs support 2007-12-27 23:53 ` Christoph Lameter @ 2007-12-27 23:58 ` Al Viro 2007-12-28 0:02 ` Christoph Lameter 0 siblings, 1 reply; 93+ messages in thread From: Al Viro @ 2007-12-27 23:58 UTC (permalink / raw) To: Christoph Lameter; +Cc: Steven Rostedt, Peter Zijlstra, LKML, Rafael J. Wysocki On Thu, Dec 27, 2007 at 03:53:43PM -0800, Christoph Lameter wrote: > On Thu, 27 Dec 2007, Christoph Lameter wrote: > > > > a) remove symlink when slab goes away > > > > Ok. Need to think about how to code that. > > How do I iterate over all symlinks in /sys/kernel/slab/*? > > I remember trying to do it before and not being able to find a sysfs > method for that. What the hell do you need that for? You have an obvious moment when /sys/kernel/slab/cache_name -> <whatever> should be gone - it's when you remove the slab called cache_name. And at that point you don't need to iterate over anything - you have the exact name, for fsck sake... Why would you want these symlinks to stick around for longer than that? ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: SLUB sysfs support 2007-12-27 23:58 ` Al Viro @ 2007-12-28 0:02 ` Christoph Lameter 2007-12-28 0:45 ` Al Viro 0 siblings, 1 reply; 93+ messages in thread From: Christoph Lameter @ 2007-12-28 0:02 UTC (permalink / raw) To: Al Viro; +Cc: Steven Rostedt, Peter Zijlstra, LKML, Rafael J. Wysocki On Thu, 27 Dec 2007, Al Viro wrote: > On Thu, Dec 27, 2007 at 03:53:43PM -0800, Christoph Lameter wrote: > > On Thu, 27 Dec 2007, Christoph Lameter wrote: > > > > > > a) remove symlink when slab goes away > > > > > > Ok. Need to think about how to code that. > > > > How do I iterate over all symlinks in /sys/kernel/slab/*? > > > > I remember trying to do it before and not being able to find a sysfs > > method for that. > > What the hell do you need that for? You have an obvious moment when > /sys/kernel/slab/cache_name -> <whatever> should be gone - it's when > you remove the slab called cache_name. And at that point you don't > need to iterate over anything - you have the exact name, for fsck sake... > > Why would you want these symlinks to stick around for longer than that? /sys/kernel/slab/cache_name is a real directory but then there are the aliases in /sys/kernel/slab/alias_name pointing to that directory that also have to be removed. So I need to scan for symlinks in /sys/kernel/slab/* pointing to /sys/kernel/slab/cache_name. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: SLUB sysfs support 2007-12-28 0:02 ` Christoph Lameter @ 2007-12-28 0:45 ` Al Viro 2007-12-28 2:19 ` Christoph Lameter 0 siblings, 1 reply; 93+ messages in thread From: Al Viro @ 2007-12-28 0:45 UTC (permalink / raw) To: Christoph Lameter; +Cc: Steven Rostedt, Peter Zijlstra, LKML, Rafael J. Wysocki On Thu, Dec 27, 2007 at 04:02:56PM -0800, Christoph Lameter wrote: > > Why would you want these symlinks to stick around for longer than that? > > /sys/kernel/slab/cache_name is a real directory but then there are the > aliases in /sys/kernel/slab/alias_name pointing to that directory that > also have to be removed. > > So I need to scan for symlinks in /sys/kernel/slab/* pointing to > /sys/kernel/slab/cache_name. Oh, lovely. So we can have module A do kmem_cache_create(), calling cache "foo". Then module B does (without any knowlegde about A) completely unrelated kmem_cache_create(), calling the sucker "bar". mm/slub decides that they are mergable. Then we get rmmod A... and no way to find out if that's foo or bar going away - kmem_cache_destroy() doesn't have enough information to figure that out. So we have to keep both slab/foo and slab/bar around until both are gone or until somebody kind enough creates a cache called foo. Better yet, on some systems you get things like slab/nfsd4_delegations sticking around long after nfsd is gone just because slub.c decides that it's sharable, but in the next kernel version the thing it's shared with gets different object size and behaviour suddenly changes - now it's gone as soon as nfsd is gone. Brilliant interface, that... ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: SLUB sysfs support 2007-12-28 0:45 ` Al Viro @ 2007-12-28 2:19 ` Christoph Lameter 2007-12-28 3:26 ` Al Viro 0 siblings, 1 reply; 93+ messages in thread From: Christoph Lameter @ 2007-12-28 2:19 UTC (permalink / raw) To: Al Viro; +Cc: Steven Rostedt, Peter Zijlstra, LKML, Rafael J. Wysocki On Fri, 28 Dec 2007, Al Viro wrote: > Oh, lovely. So we can have module A do kmem_cache_create(), calling > cache "foo". Then module B does (without any knowlegde about A) > completely unrelated kmem_cache_create(), calling the sucker "bar". > mm/slub decides that they are mergable. Then we get rmmod A... and > no way to find out if that's foo or bar going away - kmem_cache_destroy() > doesn't have enough information to figure that out. So we have to keep > both slab/foo and slab/bar around until both are gone or until somebody > kind enough creates a cache called foo. Better yet, on some systems you > get things like slab/nfsd4_delegations sticking around long after nfsd is > gone just because slub.c decides that it's sharable, but in the next > kernel version the thing it's shared with gets different object size and > behaviour suddenly changes - now it's gone as soon as nfsd is gone. > Brilliant interface, that... Just enable slab debugging and all of that goes away and you get the usual error messages. Otherwise slub tries to compress the memory use as much as possible. The advantage here is that you can add an abitrary amount of kmem_cache_create() without increasing memory use significantly. Its cheap to do kmem_cache creations. Thus you can create open arbitrary caches not worrying about memory use and have meaningful variables on which you do kmem_cache_alloc instead of kmalloc/kfree. kmalloc/kfree wastes space since they go to powers of two. And kmalloc/kfree have to calculate the cache at runtime whereas kmem_cache_alloc/free does the lookup once. The amount of per cpu stuff we keep around that is idle is reduced by 50%. Increasing cpu cache usage and reduces memory footprint. per cpu stuff has to hold per cpu reserves and it is not good to waste too much memory there. nfsd4_delegations? What is this about? How do I scan for the symlinks in sysfs? ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: SLUB sysfs support 2007-12-28 2:19 ` Christoph Lameter @ 2007-12-28 3:26 ` Al Viro 2008-01-02 20:25 ` Christoph Lameter 0 siblings, 1 reply; 93+ messages in thread From: Al Viro @ 2007-12-28 3:26 UTC (permalink / raw) To: Christoph Lameter; +Cc: Steven Rostedt, Peter Zijlstra, LKML, Rafael J. Wysocki On Thu, Dec 27, 2007 at 06:19:46PM -0800, Christoph Lameter wrote: > nfsd4_delegations? What is this about? The random lifetimes of user-visible files you create in sysfs. > How do I scan for the symlinks in sysfs? At which point are you going to do that? AFAICS, the fundamental problem is that you * have aliases indistinguishable, so kmem_cache_destroy() can't tell which one is going away, no matter what * have per-alias objects in sysfs As the result, you have a user-visible mess in that directory in sysfs. And I don't see how you would deal with that - on the "the contents of directory changes in so-and-so way when such-and-such operation is done", not the implementation details one. BTW, I'm rather sceptical about free use of slabs; keep in mind that their names have to be unique with your sysfs layout, so... ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: SLUB sysfs support 2007-12-28 3:26 ` Al Viro @ 2008-01-02 20:25 ` Christoph Lameter 0 siblings, 0 replies; 93+ messages in thread From: Christoph Lameter @ 2008-01-02 20:25 UTC (permalink / raw) To: Al Viro; +Cc: Steven Rostedt, Peter Zijlstra, LKML, Rafael J. Wysocki On Fri, 28 Dec 2007, Al Viro wrote: > On Thu, Dec 27, 2007 at 06:19:46PM -0800, Christoph Lameter wrote: > > nfsd4_delegations? What is this about? > > The random lifetimes of user-visible files you create in sysfs. Well these are symlinks. > > How do I scan for the symlinks in sysfs? > > At which point are you going to do that? AFAICS, the fundamental problem > is that you > * have aliases indistinguishable, so kmem_cache_destroy() can't tell > which one is going away, no matter what > * have per-alias objects in sysfs > As the result, you have a user-visible mess in that directory in sysfs. > And I don't see how you would deal with that - on the "the contents of > directory changes in so-and-so way when such-and-such operation is > done", not the implementation details one. If the alias count of a kmem_cache structure reaches zero then all aliases can be taken down. At that point we know that no aliases exist anymore. > BTW, I'm rather sceptical about free use of slabs; keep in mind that their > names have to be unique with your sysfs layout, so... What do you mean by free use of slabs? Creation of new slab caches to avoid the rounding up to order 2 size? ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: SLUB sysfs support 2007-12-27 23:22 ` Christoph Lameter 2007-12-27 23:53 ` Christoph Lameter @ 2007-12-27 23:54 ` Al Viro 1 sibling, 0 replies; 93+ messages in thread From: Al Viro @ 2007-12-27 23:54 UTC (permalink / raw) To: Christoph Lameter Cc: Theodore Tso, Andi Kleen, Willy Tarreau, Steven Rostedt, Linus Torvalds, Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Thu, Dec 27, 2007 at 03:22:28PM -0800, Christoph Lameter wrote: > > a) remove symlink when slab goes away > > Ok. Need to think about how to code that. Huh? Just call it from kmem_cache_destroy(); what business does that symlink have being around after that point? > > b) instead of kfree() in slab removal do kobject_put() if you have sysfs stuff > > Hmmmm.... Okay. Patch follows but its strange to do a kobject_put after a > kobject_del(). kobject_del() undoes the effect of kobject_add(). And then you are left with the refcount you had before kobject_add(), i.e. from kobject_init(). Think of it that way: kobject refcount controls the lifetime of structure the sucker's embedded into. Depending on kobject methods, you might be unable to do cleanup of non-kobject parts of that animal until after kobject_del(). IOW, you _can't_ have kobject_del() dropping the (hopefully) final reference to kobject - only the caller knows what might have to be done in between. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-26 22:16 ` Al Viro 2007-12-27 5:51 ` Christoph Lameter @ 2007-12-28 9:00 ` Ingo Molnar 2007-12-28 14:52 ` Steven Rostedt 1 sibling, 1 reply; 93+ messages in thread From: Ingo Molnar @ 2007-12-28 9:00 UTC (permalink / raw) To: Al Viro Cc: Christoph Lameter, Theodore Tso, Andi Kleen, Willy Tarreau, Steven Rostedt, Linus Torvalds, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki * Al Viro <viro@ZenIV.linux.org.uk> wrote: > > > So two questions: why isn't -f the default? And is /sys/slab > > > > Because it gives misleading output. It displays the name of the > > first of multiple slabs that share the same storage structures. > > Erm... Let me spell it out: current lifetime rules are completely > broken. As it is, create/destroy/create cache sequence will do > kobject_put() on kfree'd object. Even without people playing with > holding sysfs files open or doing IO on those. > > a) you have kobject embedded into struct with the lifetime rules of > its own. When its refcount hits zero you kfree() the sucker, even if > you still have references to embedded kobject. > > b) your symlinks stick around. Even when cache is long gone you still > have a sysfs symlink with its embedded kobject as a target. They are > eventually removed when cache with the same name gets created. _Then_ > you get the target kobject dropped - when the memory it used to be in > had been freed for hell knows how long and reused by something that > would not appreciate slub.c code suddenly deciding to decrement some > word in that memory. > > c) you leak references to these kobject; kobject_del() only removes it > from the tree undoing the effect of kobject_add() and you still need > kobject_put() to deal with the last reference. as a sidenote: bugs like this seem to be reoccuring. People implement sysfs bindings (without being sysfs internals experts - and why should they be) - and create hard to debug problems. We've seen that with the scheduler's recent sysfs changes too. shouldnt the sysfs code be designed in a way to not allow such bugs? The primary usecase of sysfs is by people who do _not_ deal with it on a daily basis. So if they pick APIs that look obvious and create hard to debug problems (and userspace incompatibilities) that's a primary failure of sysfs, not a failure of those who utilize it. At a minimum there should be some _strong_ debugging facility that transparently detects and reports such bugs as they occur. CONFIG_DEBUG_KOBJECT is totally unusable right now, it spams the syslog (so no distro ever enables it - i disable it in random bootups as well because it takes _ages_ to even get to a boot prompt) and never finds any of these hard-to-find-but-easy-to-explain bugs. or if sysfs/kobjects should be scrapped and rewritten, do you have any insight into what kind of abstraction could/should replace it? Should we go back to procfs and get rid of kobjects altogether? (as it's slowly turning into a /proc problem of itself, with worse compatibility and sneakier bugs.) Ingo ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-28 9:00 ` Major regression on hackbench with SLUB (more numbers) Ingo Molnar @ 2007-12-28 14:52 ` Steven Rostedt 0 siblings, 0 replies; 93+ messages in thread From: Steven Rostedt @ 2007-12-28 14:52 UTC (permalink / raw) To: Ingo Molnar Cc: Al Viro, Christoph Lameter, Theodore Tso, Andi Kleen, Willy Tarreau, Linus Torvalds, Peter Zijlstra, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On Fri, 28 Dec 2007, Ingo Molnar wrote: > > or if sysfs/kobjects should be scrapped and rewritten, do you have any > insight into what kind of abstraction could/should replace it? Should we > go back to procfs and get rid of kobjects altogether? (as it's slowly > turning into a /proc problem of itself, with worse compatibility and > sneakier bugs.) > The few times I've used kobjects/sysfs, I would always need to "relearn" how to use it. It's not trivial at all, and even when I finally get it working, I'm not sure it's working correctly. Although something like debugfs took me a few minutes to use. Now when ever I need to do something that userspace uses, I simply choose debugfs. I thought the switch to sysfs was to keep procfs cleaner and only used for processes. Why did it need something so complex as the kobject structure? Was it so that udev could integrate with it? If sysfs was as simple as debugfs, I think a lot of these bugs would not have occurred. -- Steve ^ permalink raw reply [flat|nested] 93+ messages in thread
* slabtop replacement was Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 18:01 ` Linus Torvalds 2007-12-22 19:09 ` Ingo Molnar 2007-12-22 19:25 ` Theodore Tso @ 2007-12-22 19:46 ` Andi Kleen 2007-12-22 23:28 ` Karol Swietlicki 2 siblings, 1 reply; 93+ messages in thread From: Andi Kleen @ 2007-12-22 19:46 UTC (permalink / raw) To: Linus Torvalds Cc: Ingo Molnar, Andi Kleen, Christoph Lameter, Peter Zijlstra, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki As a followup to the thread regarding slabinfo and Andrew's earlier query about updating slabtop. watch -n1 slabinfo -S seems to be a reasonable replacement for slabtop. The only missing functionality are some hotkeys missing: you have to restart now to get a different sort order (and to read the slabinfo source to find out the correct option), but other than that it works. So I would propose to just retire slabtop and replace it with a simple shell script doing that. A manpage for slabinfo would be useful though. Anybody volunteering to write one? -Andi ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: slabtop replacement was Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 19:46 ` slabtop replacement was " Andi Kleen @ 2007-12-22 23:28 ` Karol Swietlicki 2007-12-29 18:08 ` Karol Swietlicki 0 siblings, 1 reply; 93+ messages in thread From: Karol Swietlicki @ 2007-12-22 23:28 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Ingo Molnar, Christoph Lameter, Peter Zijlstra, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On 22/12/2007, Andi Kleen <andi@firstfloor.org> wrote: > A manpage for slabinfo would be useful though. Anybody > volunteering to write one? > > -Andi That would be me. I'm a newbie and never wrote a man page before, so it will take a few days, but I'm bored and out of ideas for any new code for the moment being. Should be fun. Karol Swietlicki ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: slabtop replacement was Re: Major regression on hackbench with SLUB (more numbers) 2007-12-22 23:28 ` Karol Swietlicki @ 2007-12-29 18:08 ` Karol Swietlicki 0 siblings, 0 replies; 93+ messages in thread From: Karol Swietlicki @ 2007-12-29 18:08 UTC (permalink / raw) To: Andi Kleen Cc: Linus Torvalds, Ingo Molnar, Christoph Lameter, Peter Zijlstra, Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki On 23/12/2007, Karol Swietlicki <magotari@gmail.com> wrote: > On 22/12/2007, Andi Kleen <andi@firstfloor.org> wrote: > > A manpage for slabinfo would be useful though. Anybody > > volunteering to write one? > > > > -Andi > > That would be me. > I'm a newbie and never wrote a man page before, so it will take a few > days, but I'm bored and out of ideas for any new code for the moment > being. Should be fun. > > Karol Swietlicki I'm dropping out from this one. I worked for a good while, and I really see no way of improving what the usage information (slabinfo -h) already provides. It's a good description, and I'm starting to think that adding a man page is overkill. If I am misunderstanding something and what is desired is in fact usage information in the man page format, I can provide such a page, no problem. At least now I know how to write man pages. Karol Swietlicki ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 22:11 ` Christoph Lameter 2007-12-21 22:16 ` Peter Zijlstra @ 2007-12-21 22:49 ` Linus Torvalds 1 sibling, 0 replies; 93+ messages in thread From: Linus Torvalds @ 2007-12-21 22:49 UTC (permalink / raw) To: Christoph Lameter Cc: Ingo Molnar, Steven Rostedt, LKML, Andrew Morton, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki On Fri, 21 Dec 2007, Christoph Lameter wrote: > > It obviously can replace it and has replaced it for awhile now. No. If there are 6% performance regressions on TPC-C, then it CAN NOT replace it! > Well still SLUB is really superior to SLAB in many ways.... > > - SLUB is performance wise much faster than SLAB. No. Christoph, statements like this is *exactly* why I don't think SLUB can make it. You're closing your eyes to real performace *regression* reports, and then you claim thast SLUB is faster, because you find your own microbenchmarks that show so for specific loads. But those specific loads are apparetly never the issue. So stop saying that SLUB is "much faster", as long as hackbench shows that that is simply NOT THE CASE. It doesn't matter one whit if SLUB is lots faster, if it's faster for cases that never matter. So far, I don't think we've *ever* seen any actual benchmarks that involve any kind of real use where SLUB is really faster, and we have some major examples of SLUB being disastrously slower! Your special-case kmalloc performance tests don't matter, when real user programs show the exact opposite effect. And the fact that you dismiss those real user programs just because you have your own test harness is why I'm ready to throw in the towel on SLUB. I don't mind performance regressions, but I *do* mind performance regressions when the main author then says "look, a helicopter!" and points to some totally different thing and claims that the performance regression doesn't even exist! Because those kinds of performance regressions never get fixed, because they are ignored. Linus ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 12:09 ` Ingo Molnar 2007-12-21 12:26 ` Ingo Molnar 2007-12-21 16:18 ` Christoph Lameter @ 2007-12-21 17:44 ` Pekka Enberg 2007-12-21 22:22 ` Christoph Lameter 2 siblings, 1 reply; 93+ messages in thread From: Pekka Enberg @ 2007-12-21 17:44 UTC (permalink / raw) To: Ingo Molnar Cc: Christoph Lameter, Steven Rostedt, LKML, Andrew Morton, Linus Torvalds, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki Hi, Christoph Lameter <clameter@sgi.com> wrote: > > In an extreme case (boot with slub_min_order=9 to get huge page sized > > slabs) SLUB can win against SLAB: > > > > N=10 Time: 0.338 Minimally faster > > N=20 Time: 0.560 10% faster > > N=50 Time: 1.353 15% faster On Dec 21, 2007 2:09 PM, Ingo Molnar <mingo@elte.hu> wrote: > what's up with this regression? There's been absolutely no activity > about it in the last 8 days: upstream still regresses, -mm still > regresses and there are no patches posted for testing. Christoph, did you see Steven's oprofile results for the hackbench runs (http://lkml.org/lkml/2007/12/8/198)? Any ideas why we're hitting add_partial like crazy? Pekka ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 17:44 ` Pekka Enberg @ 2007-12-21 22:22 ` Christoph Lameter 2007-12-21 22:37 ` Christoph Lameter 0 siblings, 1 reply; 93+ messages in thread From: Christoph Lameter @ 2007-12-21 22:22 UTC (permalink / raw) To: Pekka Enberg Cc: Ingo Molnar, Steven Rostedt, LKML, Andrew Morton, Linus Torvalds, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki On Fri, 21 Dec 2007, Pekka Enberg wrote: > Christoph, did you see Steven's oprofile results for the hackbench > runs (http://lkml.org/lkml/2007/12/8/198)? Any ideas why we're hitting > add_partial like crazy? Hmmm... SLUB is cycling through partial slabs. If it gets fed objects with a single free object from the free list again and again then this is the behavior that one would see. The worst case scenario would be. 1. Processor 0 gets slab with one free entry from the partial list. 2. Processor 0 allocates object and deactivates the slab since it is full. 3. Processor 1 frees the object and finds that it was not on the partial list since there were no free objects. Call put_partial() 4. processor 0 gets slab with one free entry from the partial list. If we would make the partial list a mininum size (which is already done through MIN_PARTIAL maybe just increase it) then we may be able to avoid this particular case. Also we could make sure that the slab is not put at the beginning of the partial list on free in order to increase the time that the slab spends on the partial list. That way it will gather more objects before it is picked up. Hmmmm... This is a bit different from what I got to here. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 22:22 ` Christoph Lameter @ 2007-12-21 22:37 ` Christoph Lameter 2007-12-21 22:51 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 93+ messages in thread From: Christoph Lameter @ 2007-12-21 22:37 UTC (permalink / raw) To: Pekka Enberg Cc: Ingo Molnar, Steven Rostedt, LKML, Andrew Morton, Linus Torvalds, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki Actually thanks for pointing that out Pekka.... Here is a patch that takes the regression almost completely away (Too much jetlag combined with flu seems to have impaired my thinking I should have seen this earlier). I still need to run my slab performance tests on this. But hackbench improves. SLUB: Improve hackbench speed Increase the mininum number of partial slabs to keep around and put partial slabs to the end of the partial queue so that they can add more objects. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- mm/slub.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2007-12-21 14:30:11.096324462 -0800 +++ linux-2.6/mm/slub.c 2007-12-21 14:30:33.656441994 -0800 @@ -172,7 +172,7 @@ static inline void ClearSlabDebug(struct * Mininum number of partial slabs. These will be left on the partial * lists even if they are empty. kmem_cache_shrink may reclaim them. */ -#define MIN_PARTIAL 2 +#define MIN_PARTIAL 5 /* * Maximum number of desirable partial slabs. @@ -1616,7 +1616,7 @@ checks_ok: * then add it. */ if (unlikely(!prior)) - add_partial(get_node(s, page_to_nid(page)), page); + add_partial_tail(get_node(s, page_to_nid(page)), page); out_unlock: slab_unlock(page); ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 22:37 ` Christoph Lameter @ 2007-12-21 22:51 ` Linus Torvalds 2007-12-21 23:17 ` Ingo Molnar 2007-12-21 22:52 ` Steven Rostedt 2007-12-21 22:56 ` Ingo Molnar 2 siblings, 1 reply; 93+ messages in thread From: Linus Torvalds @ 2007-12-21 22:51 UTC (permalink / raw) To: Christoph Lameter Cc: Pekka Enberg, Ingo Molnar, Steven Rostedt, LKML, Andrew Morton, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki On Fri, 21 Dec 2007, Christoph Lameter wrote: > > Actually thanks for pointing that out Pekka.... Here is a patch that takes > the regression almost completely away Now *this* is what I wanted to see - rather than argue against other peoples performance regression reports, an actual patch that acknowledges the problem. Thanks. And now, can the people who made the problem reports and complained about SLUB please test the patch - the ball is now in your court! Linus ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 22:51 ` Linus Torvalds @ 2007-12-21 23:17 ` Ingo Molnar 2007-12-21 23:40 ` Pekka Enberg 2007-12-21 23:54 ` Christoph Lameter 0 siblings, 2 replies; 93+ messages in thread From: Ingo Molnar @ 2007-12-21 23:17 UTC (permalink / raw) To: Linus Torvalds Cc: Christoph Lameter, Pekka Enberg, Steven Rostedt, LKML, Andrew Morton, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki * Linus Torvalds <torvalds@linux-foundation.org> wrote: > And now, can the people who made the problem reports and complained > about SLUB please test the patch - the ball is now in your court! yep, and i ran a quick comparison test on a 2-core box with 3 kernels: [ best of 5 runs in a row which had a relative jitter of less than 10% ] MIN v2.6.24.slab v2.6.24.slub v2.6.24.slub.fix ---------------------------------------------------------- mmap: 429.00 402.00 ( -6%) 385.00 (-10%) select: 11.38 10.46 ( -8%) 11.41 ( 0%) proc-exec: 121.52 116.77 ( -3%) 120.77 ( 0%) proc-fork: 106.84 106.19 ( 0%) 107.92 ( 1%) syscall-open: 3.09 3.13 ( 1%) 3.25 ( 4%) hackbench-50: 2.85 3.47 ( 21%) 2.88 ( 1%) and the regression seems to be largely fixed! Not only is the hackbench one fixed, but mmap shows an above-noise improvement as well. Acked-by: Ingo Molnar <mingo@elte.hu> And i hereby nominate Pekka as SLUB/SLAB co-maintainer and spokesperson ;-) Ingo ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 23:17 ` Ingo Molnar @ 2007-12-21 23:40 ` Pekka Enberg 2007-12-21 23:54 ` Christoph Lameter 1 sibling, 0 replies; 93+ messages in thread From: Pekka Enberg @ 2007-12-21 23:40 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Christoph Lameter, Steven Rostedt, LKML, Andrew Morton, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki Hi, On Dec 22, 2007 1:17 AM, Ingo Molnar <mingo@elte.hu> wrote: > yep, and i ran a quick comparison test on a 2-core box with 3 kernels: > > [ best of 5 runs in a row which had a relative jitter of less than 10% ] > > MIN v2.6.24.slab v2.6.24.slub v2.6.24.slub.fix > ---------------------------------------------------------- > mmap: 429.00 402.00 ( -6%) 385.00 (-10%) > select: 11.38 10.46 ( -8%) 11.41 ( 0%) > proc-exec: 121.52 116.77 ( -3%) 120.77 ( 0%) > proc-fork: 106.84 106.19 ( 0%) 107.92 ( 1%) > syscall-open: 3.09 3.13 ( 1%) 3.25 ( 4%) > hackbench-50: 2.85 3.47 ( 21%) 2.88 ( 1%) > > and the regression seems to be largely fixed! Not only is the hackbench > one fixed, but mmap shows an above-noise improvement as well. > > Acked-by: Ingo Molnar <mingo@elte.hu> I thought it might be a bug but couldn't figure it out. The patch looks good to me too, Christoph. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> On Dec 22, 2007 1:17 AM, Ingo Molnar <mingo@elte.hu> wrote: > And i hereby nominate Pekka as SLUB/SLAB co-maintainer and spokesperson > ;-) Heh, thanks, but grep me from MAINTAINERS some day, Ingo :-). Pekka ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 23:17 ` Ingo Molnar 2007-12-21 23:40 ` Pekka Enberg @ 2007-12-21 23:54 ` Christoph Lameter 2007-12-22 0:02 ` Chuck Ebbert 1 sibling, 1 reply; 93+ messages in thread From: Christoph Lameter @ 2007-12-21 23:54 UTC (permalink / raw) To: Ingo Molnar Cc: Linus Torvalds, Pekka Enberg, Steven Rostedt, LKML, Andrew Morton, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki On Sat, 22 Dec 2007, Ingo Molnar wrote: > and the regression seems to be largely fixed! Not only is the hackbench > one fixed, but mmap shows an above-noise improvement as well. > > Acked-by: Ingo Molnar <mingo@elte.hu> Well lets use the version attached to this patch. It turns out that it is not important to increase the number of partial slabs. Just putting the slab onto the tail of the list does it. > And i hereby nominate Pekka as SLUB/SLAB co-maintainer and spokesperson > ;-) He is already the co-maintainer of the slab allocators. See the MAINTAINERS file. SLUB: Improve hackbench speed Increase the mininum number of partial slabs to keep around and put partial slabs to the end of the partial queue so that they can add more objects. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- mm/slub.c | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) Index: linux-2.6/mm/slub.c =================================================================== --- linux-2.6.orig/mm/slub.c 2007-12-21 14:30:11.096324462 -0800 +++ linux-2.6/mm/slub.c 2007-12-21 15:10:17.611022381 -0800 @@ -1611,12 +1611,12 @@ checks_ok: goto slab_empty; /* - * Objects left in the slab. If it - * was not on the partial list before - * then add it. + * Objects left in the slab. If it was not on the partial list before + * then add it. Add it to the end since there is only a single object + * which would make slab_alloc inefficient. */ if (unlikely(!prior)) - add_partial(get_node(s, page_to_nid(page)), page); + add_partial_tail(get_node(s, page_to_nid(page)), page); out_unlock: slab_unlock(page); ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 23:54 ` Christoph Lameter @ 2007-12-22 0:02 ` Chuck Ebbert 0 siblings, 0 replies; 93+ messages in thread From: Chuck Ebbert @ 2007-12-22 0:02 UTC (permalink / raw) To: Christoph Lameter Cc: Ingo Molnar, Linus Torvalds, Pekka Enberg, Steven Rostedt, LKML, Andrew Morton, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki On 12/21/2007 06:54 PM, Christoph Lameter wrote: > > SLUB: Improve hackbench speed > > Increase the mininum number of partial slabs to keep around and put > partial slabs to the end of the partial queue so that they can add > more objects. > > Signed-off-by: Christoph Lameter <clameter@sgi.com> > You forgot to update the commit message. Should be: Put partial slabs to the end of the partial queue so that they can add more objects. ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 22:37 ` Christoph Lameter 2007-12-21 22:51 ` Linus Torvalds @ 2007-12-21 22:52 ` Steven Rostedt 2007-12-21 22:56 ` Ingo Molnar 2 siblings, 0 replies; 93+ messages in thread From: Steven Rostedt @ 2007-12-21 22:52 UTC (permalink / raw) To: Christoph Lameter Cc: Pekka Enberg, Ingo Molnar, LKML, Andrew Morton, Linus Torvalds, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki On Fri, 21 Dec 2007, Christoph Lameter wrote: > Actually thanks for pointing that out Pekka.... Here is a patch that takes > the regression almost completely away (Too much jetlag combined with flu > seems to have impaired my thinking I should have seen this earlier). I > still need to run my slab performance tests on this. But hackbench > improves. > > > SLUB: Improve hackbench speed > > Increase the mininum number of partial slabs to keep around and put > partial slabs to the end of the partial queue so that they can add > more objects. > > Signed-off-by: Christoph Lameter <clameter@sgi.com> > FYI, I'm currently on holiday till Jan 2cd, and I also lost access to the machine I did the orignal benchmarks on. But I'll try to reserve it again after the New Year, and I will run the benchmarks including this patch and report my findings. Thanks, -- Steve ^ permalink raw reply [flat|nested] 93+ messages in thread
* Re: Major regression on hackbench with SLUB (more numbers) 2007-12-21 22:37 ` Christoph Lameter 2007-12-21 22:51 ` Linus Torvalds 2007-12-21 22:52 ` Steven Rostedt @ 2007-12-21 22:56 ` Ingo Molnar 2 siblings, 0 replies; 93+ messages in thread From: Ingo Molnar @ 2007-12-21 22:56 UTC (permalink / raw) To: Christoph Lameter Cc: Pekka Enberg, Steven Rostedt, LKML, Andrew Morton, Linus Torvalds, Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki * Christoph Lameter <clameter@sgi.com> wrote: > if (unlikely(!prior)) > - add_partial(get_node(s, page_to_nid(page)), page); > + add_partial_tail(get_node(s, page_to_nid(page)), page); FYI, this gives me: mm/slub.c: In function 'kfree': mm/slub.c:2604: warning: passing argument 3 of 'slab_free' discards qualifiers from pointer target type looks harmless though at first sight. Ingo ^ permalink raw reply [flat|nested] 93+ messages in thread
end of thread, other threads:[~2008-01-05 0:50 UTC | newest] Thread overview: 93+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2007-12-07 17:50 Major regression on hackbench with SLUB Steven Rostedt 2007-12-07 17:55 ` Ingo Molnar 2007-12-07 17:59 ` Linus Torvalds 2007-12-07 18:36 ` Linus Torvalds 2007-12-07 18:58 ` Steven Rostedt 2007-12-08 22:16 ` Steven Rostedt 2007-12-10 7:38 ` Björn Steinbrink 2007-12-10 14:00 ` Steven Rostedt 2007-12-09 0:28 ` Steven Rostedt 2007-12-13 22:11 ` Christoph Lameter 2007-12-11 14:33 ` Major regression on hackbench with SLUB (more numbers) Ingo Molnar 2007-12-13 22:22 ` Christoph Lameter 2007-12-14 4:24 ` Christoph Lameter 2007-12-14 6:15 ` Steven Rostedt 2007-12-21 12:09 ` Ingo Molnar 2007-12-21 12:26 ` Ingo Molnar 2007-12-21 16:26 ` Christoph Lameter 2007-12-21 16:33 ` Ingo Molnar 2007-12-21 21:56 ` Christoph Lameter 2007-12-21 16:18 ` Christoph Lameter 2007-12-21 16:44 ` Linus Torvalds 2007-12-21 22:11 ` Christoph Lameter 2007-12-21 22:16 ` Peter Zijlstra 2007-12-21 22:17 ` Peter Zijlstra 2007-12-21 22:27 ` Christoph Lameter 2007-12-21 22:54 ` Ingo Molnar 2007-12-21 23:20 ` David Miller 2007-12-22 9:45 ` Ingo Molnar 2007-12-24 19:24 ` Christoph Lameter 2007-12-21 23:27 ` Andrew Morton 2007-12-21 23:51 ` Christoph Lameter 2007-12-22 0:28 ` Andi Kleen 2007-12-22 0:33 ` Chuck Ebbert 2007-12-22 2:19 ` Matt Mackall 2007-12-22 2:44 ` Andi Kleen 2007-12-22 10:03 ` Ingo Molnar 2007-12-22 12:37 ` Andi Kleen 2007-12-22 12:51 ` Pekka Enberg 2007-12-22 12:54 ` Andi Kleen 2007-12-22 13:27 ` Pekka Enberg 2007-12-22 13:01 ` Alexey Dobriyan 2007-12-22 18:01 ` Linus Torvalds 2007-12-22 19:09 ` Ingo Molnar 2007-12-22 19:29 ` Ingo Molnar 2007-12-22 22:52 ` Jason L Tibbitts III 2007-12-24 3:59 ` Alessandro Suardi 2007-12-22 19:25 ` Theodore Tso 2007-12-22 21:00 ` Linus Torvalds 2007-12-22 21:50 ` Al Viro 2007-12-22 23:29 ` Al Viro 2007-12-22 22:10 ` Willy Tarreau 2007-12-23 0:59 ` Steven Rostedt 2007-12-23 5:12 ` Willy Tarreau 2007-12-23 14:15 ` Andi Kleen 2007-12-24 3:45 ` Theodore Tso 2007-12-24 19:21 ` Christoph Lameter 2007-12-24 23:37 ` Theodore Tso 2007-12-25 3:34 ` Andi Kleen 2008-01-01 12:47 ` Theodore Tso 2008-01-01 15:19 ` Andi Kleen 2008-01-01 16:23 ` [patch] slub: provide /proc/slabinfo Ingo Molnar 2008-01-05 0:27 ` Arjan van de Ven 2008-01-05 0:49 ` Linus Torvalds 2007-12-26 21:31 ` Major regression on hackbench with SLUB (more numbers) Christoph Lameter 2007-12-26 22:16 ` Al Viro 2007-12-27 5:51 ` Christoph Lameter 2007-12-27 20:28 ` SLUB sysfs support Christoph Lameter 2007-12-27 22:59 ` Al Viro 2007-12-27 23:22 ` Christoph Lameter 2007-12-27 23:53 ` Christoph Lameter 2007-12-27 23:58 ` Al Viro 2007-12-28 0:02 ` Christoph Lameter 2007-12-28 0:45 ` Al Viro 2007-12-28 2:19 ` Christoph Lameter 2007-12-28 3:26 ` Al Viro 2008-01-02 20:25 ` Christoph Lameter 2007-12-27 23:54 ` Al Viro 2007-12-28 9:00 ` Major regression on hackbench with SLUB (more numbers) Ingo Molnar 2007-12-28 14:52 ` Steven Rostedt 2007-12-22 19:46 ` slabtop replacement was " Andi Kleen 2007-12-22 23:28 ` Karol Swietlicki 2007-12-29 18:08 ` Karol Swietlicki 2007-12-21 22:49 ` Linus Torvalds 2007-12-21 17:44 ` Pekka Enberg 2007-12-21 22:22 ` Christoph Lameter 2007-12-21 22:37 ` Christoph Lameter 2007-12-21 22:51 ` Linus Torvalds 2007-12-21 23:17 ` Ingo Molnar 2007-12-21 23:40 ` Pekka Enberg 2007-12-21 23:54 ` Christoph Lameter 2007-12-22 0:02 ` Chuck Ebbert 2007-12-21 22:52 ` Steven Rostedt 2007-12-21 22:56 ` Ingo Molnar
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).