linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Major regression on hackbench with SLUB
@ 2007-12-07 17:50 Steven Rostedt
  2007-12-07 17:55 ` Ingo Molnar
                   ` (2 more replies)
  0 siblings, 3 replies; 93+ messages in thread
From: Steven Rostedt @ 2007-12-07 17:50 UTC (permalink / raw)
  To: LKML
  Cc: Christoph Lameter, Ingo Molnar, Linus Torvalds, Peter Zijlstra,
	Christoph Hellwig

Hi,

I've been doing some benchmarks for RT task migrations and I stumbled
over this regression. I first thought it was a regression with CFS and
started doing a git bisect. But I later found that the "good" kernel
also failed. Then I started looking at the config options and discovered
that SLUB has become the default.

When I put SLAB back, the regression went away, even on the lastest git
tree.

So I ran the hackbench benchmarks with the latest git commit
(f194d132e4971111f85c18c96067acffb13cee6d at the time of this writing).
And put up the results on my web page.

http://people.redhat.com/srostedt/slab/

I ran Ingo's cfs-debug-info.sh and have the results of that on the web
page. That script records lots of good information to see what kind of
machine I ran this on. This is a 64way box (yes lots of CPUS!).

The hackbench code and the script I used to run and record the results
is also present. I ran "hackbench 90" 20 times, 10 times as SCHED_OTHER
and 10 times as SCHED_FIFO (chrt -f 10 hackbench 90).  The graph shows
both runs (min, max and average). The versions that start with
"rt:" (although the graph somehow truncated it to just "t:") are the
SCHED_FIFO runs. (click on the graph to see a bigger version)

The regression is that hackbench slows down tremendously. It goes from
running just under 2 seconds to running around 14 seconds.

Hackbench seems to show this regression the most. In my tests I didn't
see much change with kernel builds and such, but the focus was on
scheduling not memory management. I'll run my kernel tests next for both
SLAB and SLUB and see if there's any difference there.

But this regression might be a reason to not have SLUB as default for
now. At least until we find out why this is affected so badly.

-- Steve



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB
  2007-12-07 17:50 Major regression on hackbench with SLUB Steven Rostedt
@ 2007-12-07 17:55 ` Ingo Molnar
  2007-12-07 17:59 ` Linus Torvalds
  2007-12-11 14:33 ` Major regression on hackbench with SLUB (more numbers) Ingo Molnar
  2 siblings, 0 replies; 93+ messages in thread
From: Ingo Molnar @ 2007-12-07 17:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Christoph Lameter, Linus Torvalds, Peter Zijlstra,
	Christoph Hellwig


* Steven Rostedt <rostedt@goodmis.org> wrote:

> The regression is that hackbench slows down tremendously. It goes from 
> running just under 2 seconds to running around 14 seconds.

ouch ...

it might make sense to check the SLUB patches from -mm ontop of vanilla 
as well - while that might be of little comfort as far as 2.6.24 goes, 
it at least gives us the info whether this is something that has been 
fixed in the latest SLUB code.

	Ingo

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB
  2007-12-07 17:50 Major regression on hackbench with SLUB Steven Rostedt
  2007-12-07 17:55 ` Ingo Molnar
@ 2007-12-07 17:59 ` Linus Torvalds
  2007-12-07 18:36   ` Linus Torvalds
  2007-12-11 14:33 ` Major regression on hackbench with SLUB (more numbers) Ingo Molnar
  2 siblings, 1 reply; 93+ messages in thread
From: Linus Torvalds @ 2007-12-07 17:59 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Christoph Lameter, Ingo Molnar, Peter Zijlstra, Christoph Hellwig



On Fri, 7 Dec 2007, Steven Rostedt wrote:
> 
> The regression is that hackbench slows down tremendously. It goes from
> running just under 2 seconds to running around 14 seconds.

Can you do one run with oprofile, and see exactly where the cost is? It 
should hopefully be pretty darn obvious, considering your timing.

			Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB
  2007-12-07 17:59 ` Linus Torvalds
@ 2007-12-07 18:36   ` Linus Torvalds
  2007-12-07 18:58     ` Steven Rostedt
                       ` (3 more replies)
  0 siblings, 4 replies; 93+ messages in thread
From: Linus Torvalds @ 2007-12-07 18:36 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Christoph Lameter, Ingo Molnar, Peter Zijlstra, Christoph Hellwig



On Fri, 7 Dec 2007, Linus Torvalds wrote:
> 
> Can you do one run with oprofile, and see exactly where the cost is? It 
> should hopefully be pretty darn obvious, considering your timing.

So I checked hackbench on my machine with oprofile, and while I'm not 
seeing anything that says "ten times slower", I do see it hitting the slow 
path all the time, and __slab_alloc() is 15% of the profile.

With __slab_free() being another 8-9%.

I assume that with many CPU's, those get horrendously worse, with cache 
trashing.

The biggest cost of __slab_alloc() in my profile is the "slab_lock()", but 
that may not be the one that causes problems in a 64-cpu setup, so it 
would be good to have that verified.

Anyway, it's unclear whether the reason it goes into the slow-path because 
the freelist is just always empty, or whether it hits the

	... || !node_match(c, node)

case which can trigger on NUMA. That's another potential explanation of 
why you'd see such a *huge* slowdown (ie if the whole node-match thing 
doesn't work out, slub just never gets the fast-case at all). That said, 
the number of places that actually pass a specific node to slub is very 
limited, so I suspect it's not the node matching. But just disabling that 
test in slab_alloc() might be one thing to test.

[ The whole node match thing is likely totally bogus. I suspect we pay 
  *more* in trying to match nodes than we'd ever likely pay in just 
  returning the wrong node for an allocation, but that's neither here nor 
  there ]

But yeah, I'm not entirely sure SLUB is working out.

		Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB
  2007-12-07 18:36   ` Linus Torvalds
@ 2007-12-07 18:58     ` Steven Rostedt
  2007-12-08 22:16     ` Steven Rostedt
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 93+ messages in thread
From: Steven Rostedt @ 2007-12-07 18:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Christoph Lameter, Ingo Molnar, Peter Zijlstra, Christoph Hellwig


[ Note: I'm currently having problems with my email:
  http://www.domaindirect.com/network.html
  I'm in the process of setting up my own personal email server. ]


> > Can you do one run with oprofile, and see exactly where the cost is? It
> > should hopefully be pretty darn obvious, considering your timing.

I kicked off my kernel tests which will take a few hours.
I run "make -j256" (4*nr_CPUS) 10 times at SCHED_OTHER and 10 times
at SCHED_FIFO (chrt -f 10) for both SLUB and SLAB.

This is automated, so I need to wait for it to finish before I can
continue other tests.

> The biggest cost of __slab_alloc() in my profile is the "slab_lock()", but
> that may not be the one that causes problems in a 64-cpu setup, so it
> would be good to have that verified.

I'll run oprofile as soon as the kernel build test is done.

> case which can trigger on NUMA. That's another potential explanation of
> why you'd see such a *huge* slowdown (ie if the whole node-match thing
> doesn't work out, slub just never gets the fast-case at all). That said,
> the number of places that actually pass a specific node to slub is very
> limited, so I suspect it's not the node matching. But just disabling that
> test in slab_alloc() might be one thing to test.

I'll try that after the oprofile.

Thanks,

-- Steve


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB
  2007-12-07 18:36   ` Linus Torvalds
  2007-12-07 18:58     ` Steven Rostedt
@ 2007-12-08 22:16     ` Steven Rostedt
  2007-12-10  7:38       ` Björn Steinbrink
  2007-12-09  0:28     ` Steven Rostedt
  2007-12-13 22:11     ` Christoph Lameter
  3 siblings, 1 reply; 93+ messages in thread
From: Steven Rostedt @ 2007-12-08 22:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Christoph Lameter, Ingo Molnar, Peter Zijlstra, Christoph Hellwig


Hi Linus,

> On Fri, 7 Dec 2007, Linus Torvalds wrote:
> >
> > Can you do one run with oprofile, and see exactly where the cost is? It
> > should hopefully be pretty darn obvious, considering your timing.

The results are here:

http://people.redhat.com/srostedt/slub/results/slab.op
http://people.redhat.com/srostedt/slub/results/slub.op


>
> Anyway, it's unclear whether the reason it goes into the slow-path because
> the freelist is just always empty, or whether it hits the
>
> 	... || !node_match(c, node)

I did two things. First I tried making node_match always return true,
the second was just commenting out the above check. They both had pretty
much the exact same results. It slowed down by double! Instead of taking
15 seconds to complete, both took 30 seconds to complete.

Here's the oprofiles of those runs.

  node_match return 1:
http://people.redhat.com/srostedt/slub/results/slub-nonode.op

  comment out node_match check:
http://people.redhat.com/srostedt/slub/results/slub-nochecknode.op


>
> [ The whole node match thing is likely totally bogus. I suspect we pay
>   *more* in trying to match nodes than we'd ever likely pay in just
>   returning the wrong node for an allocation, but that's neither here nor
>   there ]

I don't think it's bogus by the result that removing it slowes it down
by half.


I'm wondering if there's not some other cache thrashing happening
somewhere else in the slub code. I'll start adding some stats in the code
to see where the congestion might lie. I'll look into this on Monday.

Seems that both SLAB and SLUB run kernel compiles the same. Here's the
results of my compile test. (make -j256 and chrt -f 10 make -j256)

http://people.redhat.com/srostedt/slub/

-- Steve



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB
  2007-12-07 18:36   ` Linus Torvalds
  2007-12-07 18:58     ` Steven Rostedt
  2007-12-08 22:16     ` Steven Rostedt
@ 2007-12-09  0:28     ` Steven Rostedt
  2007-12-13 22:11     ` Christoph Lameter
  3 siblings, 0 replies; 93+ messages in thread
From: Steven Rostedt @ 2007-12-09  0:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: LKML, Linus Torvalds, Christoph Lameter, Peter Zijlstra,
	Christoph Hellwig


Ingo,

Due to email problems, I never got your reply on this thread. But I saw
your reply when reading this thread on lkml.org (nor did I receive Linus's
first reply).

Anyway, I booted 2.6.24-rc4-mm1 with the same config, and accepted the
defaults to the new config options as the git version I checked. And it
gave me an average of 12.5 second runs.

Here's the oprofile for it:

http://people.redhat.com/srostedt/slub/results/slub-mm.op

-- Steve




^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB
  2007-12-08 22:16     ` Steven Rostedt
@ 2007-12-10  7:38       ` Björn Steinbrink
  2007-12-10 14:00         ` Steven Rostedt
  0 siblings, 1 reply; 93+ messages in thread
From: Björn Steinbrink @ 2007-12-10  7:38 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, LKML, Christoph Lameter, Ingo Molnar,
	Peter Zijlstra, Christoph Hellwig

On 2007.12.08 17:16:24 -0500, Steven Rostedt wrote:
> 
> Hi Linus,
> 
> > On Fri, 7 Dec 2007, Linus Torvalds wrote:
> > >
> > > Can you do one run with oprofile, and see exactly where the cost is? It
> > > should hopefully be pretty darn obvious, considering your timing.
> 
> The results are here:
> 
> http://people.redhat.com/srostedt/slub/results/slab.op
> http://people.redhat.com/srostedt/slub/results/slub.op

Hm, you seem to be hitting the "another_slab" stuff in __slab_alloc
alot. I wonder if !node_match triggers too often. We always start with
the per cpu slab, if that one is on the wrong node, you'll always hit
that "another_slab" path.

After searching for way too long (given that I have no clue about that
stuff anyway and just read the code out of curiousness), I noticed that
the the cpu_to_node stuff on x86_64 seems to be initialized to 0xff
(arch/x86/mm/numa_64.c), and Google brought me this dmesg output [1],
which, AFAICT, shows that the per cpu slab setup is done _before_
cpu_to_node is correctly setup. That would lead to the per cpu slabs all
having node == 0xff, which looks pretty bad.

Disclaimer: I read the slub/numa/$WHATEVER_I_SAW_THERE for the first
time, so this might be total bull ;-)

Björn

[1] http://linux.derkeiler.com/Mailing-Lists/Kernel/2007-10/msg04648.html

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB
  2007-12-10  7:38       ` Björn Steinbrink
@ 2007-12-10 14:00         ` Steven Rostedt
  0 siblings, 0 replies; 93+ messages in thread
From: Steven Rostedt @ 2007-12-10 14:00 UTC (permalink / raw)
  To: Björn Steinbrink
  Cc: Linus Torvalds, LKML, Christoph Lameter, Ingo Molnar,
	Peter Zijlstra, Christoph Hellwig


On Mon, 10 Dec 2007, Björn Steinbrink wrote:
> >
> > The results are here:
> >
> > http://people.redhat.com/srostedt/slub/results/slab.op
> > http://people.redhat.com/srostedt/slub/results/slub.op
>
> Hm, you seem to be hitting the "another_slab" stuff in __slab_alloc
> alot. I wonder if !node_match triggers too often. We always start with
> the per cpu slab, if that one is on the wrong node, you'll always hit
> that "another_slab" path.

Well, I commented out the node_match part and it got 100% worse. It took
30 seconds to complete.

>
> After searching for way too long (given that I have no clue about that
> stuff anyway and just read the code out of curiousness), I noticed that
> the the cpu_to_node stuff on x86_64 seems to be initialized to 0xff
> (arch/x86/mm/numa_64.c), and Google brought me this dmesg output [1],
> which, AFAICT, shows that the per cpu slab setup is done _before_
> cpu_to_node is correctly setup. That would lead to the per cpu slabs all
> having node == 0xff, which looks pretty bad.

I didn't check to see if the internal set up of the node is correct
though.  I can put in some debug to see what I get.

>
> Disclaimer: I read the slub/numa/$WHATEVER_I_SAW_THERE for the first
> time, so this might be total bull ;-)

Well I'm a newbie on NUMA stuff too. I just got lucky enough to be able to
reserve this box ;-)

-- Steve


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-07 17:50 Major regression on hackbench with SLUB Steven Rostedt
  2007-12-07 17:55 ` Ingo Molnar
  2007-12-07 17:59 ` Linus Torvalds
@ 2007-12-11 14:33 ` Ingo Molnar
  2007-12-13 22:22   ` Christoph Lameter
  2 siblings, 1 reply; 93+ messages in thread
From: Ingo Molnar @ 2007-12-11 14:33 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: LKML, Christoph Lameter, Andrew Morton, Linus Torvalds,
	Peter Zijlstra, Christoph Hellwig


* Steven Rostedt <rostedt@goodmis.org> wrote:

> Hackbench seems to show this regression the most. In my tests I didn't 
> see much change with kernel builds and such, but the focus was on 
> scheduling not memory management. I'll run my kernel tests next for 
> both SLAB and SLUB and see if there's any difference there.

i just ran various benchmarks on an 8-way (8x 700 MHz Xeon, 4GB RAM):

      AVG      v2.6.24.slab     v2.6.24.slub         [ smaller is better ]
   -----------------------------------------
            mmap:   1052.66          1049.33   (  0%)
           ctx-2:      4.32             4.30   (  0%)
          select:     41.95            43.69   (  4%)
       proc-exec:    394.45           391.92   (  0%)
    hackbench-10:      1.12             2.99   (166%)
    hackbench-20:      2.04             6.67   (226%)
    hackbench-50:      5.03            17.50   (247%)

and hackbench overhead stands out, by a huge margin. Other stuff is 
within measurement noise. Neither SLUB nor SLAB debugging was turned on, 
all other debugging options were off too.

	Ingo

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB
  2007-12-07 18:36   ` Linus Torvalds
                       ` (2 preceding siblings ...)
  2007-12-09  0:28     ` Steven Rostedt
@ 2007-12-13 22:11     ` Christoph Lameter
  3 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2007-12-13 22:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Steven Rostedt, LKML, Ingo Molnar, Peter Zijlstra, Christoph Hellwig

On Fri, 7 Dec 2007, Linus Torvalds wrote:

> The biggest cost of __slab_alloc() in my profile is the "slab_lock()", but 
> that may not be the one that causes problems in a 64-cpu setup, so it 
> would be good to have that verified.

Hmmmm.. This would indicate lock contention on a slab.

> [ The whole node match thing is likely totally bogus. I suspect we pay 
>   *more* in trying to match nodes than we'd ever likely pay in just 
>   returning the wrong node for an allocation, but that's neither here nor 
>   there ]

Node match is necessary in order to make the allocator able to get memory 
for the node that the caller requested.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-11 14:33 ` Major regression on hackbench with SLUB (more numbers) Ingo Molnar
@ 2007-12-13 22:22   ` Christoph Lameter
  2007-12-14  4:24     ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2007-12-13 22:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, LKML, Andrew Morton, Linus Torvalds,
	Peter Zijlstra, Christoph Hellwig

On Tue, 11 Dec 2007, Ingo Molnar wrote:

>     hackbench-10:      1.12             2.99   (166%)
>     hackbench-20:      2.04             6.67   (226%)
>     hackbench-50:      5.03            17.50   (247%)
> 
> and hackbench overhead stands out, by a huge margin. Other stuff is 
> within measurement noise. Neither SLUB nor SLAB debugging was turned on, 
> all other debugging options were off too.

I just came back from vacation. The non linear growth of regression
indicates lock contention somewhere. Must be something special that was 
triggered by hackbench.

We have had a regression like that in 2.6.24 before due to order 1 allocs 
in the network layer. The .24 modifications for SLUB introduced page 
allocator pass through for allocations > PAGESIZE/2. Order 1 allocs could 
serialize on zone locks.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-13 22:22   ` Christoph Lameter
@ 2007-12-14  4:24     ` Christoph Lameter
  2007-12-14  6:15       ` Steven Rostedt
  2007-12-21 12:09       ` Ingo Molnar
  0 siblings, 2 replies; 93+ messages in thread
From: Christoph Lameter @ 2007-12-14  4:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, LKML, Andrew Morton, Linus Torvalds,
	Peter Zijlstra, Christoph Hellwig

Hmmmm... Some tests here on an 8p 8G machine:

SLAB

N=10 Time: 0.341
N=20 Time: 0.605
N=50 Time: 1.487

SLUB 

N=10 Time: 0.675
N=20 Time: 1.434
N=50 Time: 3.996

So its factor 2 for untuned SLUB. Looking at hackbench: This is a 
test that allocates objects that are then consumed by N cpus that then 
return them. The allocating processor can allocate from the per cpu slab 
(the freelist is copied to a per cpu structure when its activated)
avoiding touching the page struct. However, the freeing processors 
must update the freelist of the slab page directly. So they all content 
for the cacheline of the page struct.

Since we do directly free there is the chance of lots of contention 
between the N cpus that free the objects. This is in particular high in a 
synthetic benchmark like this.

However, if the object to be freed is still in a cpu slab then the freeing 
action will reduce the taking of the listlock. So we can actually 
decrease the overhead by increasing the slab page size.

In an extreme case (boot with slub_min_order=9 to get huge page sized 
slabs) SLUB can win against SLAB:

N=10 Time: 0.338	Minimally faster
N=20 Time: 0.560	10% faster
N=50 Time: 1.353	15% faster

Not sure how to get that mainstream yet but there certainly is a way to 
get there. Need to get an idea how to reduce listlock contention in the 
remote free case.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-14  4:24     ` Christoph Lameter
@ 2007-12-14  6:15       ` Steven Rostedt
  2007-12-21 12:09       ` Ingo Molnar
  1 sibling, 0 replies; 93+ messages in thread
From: Steven Rostedt @ 2007-12-14  6:15 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, LKML, Andrew Morton, Linus Torvalds, Peter Zijlstra,
	Christoph Hellwig


Christoph,

Welcome back from your vacation! :-)

On Thu, 13 Dec 2007, Christoph Lameter wrote:
>
> In an extreme case (boot with slub_min_order=9 to get huge page sized
> slabs) SLUB can win against SLAB:
>
> N=10 Time: 0.338	Minimally faster
> N=20 Time: 0.560	10% faster
> N=50 Time: 1.353	15% faster
>

I booted back to slub with slub_min_order=9 on the 64way and you are
indeed right about this.

# tests/hackbench 90
Time: 2.109
# chrt -f 10 tests/hackbench 90
Time: 3.775
# tests/hackbench 90
Time: 1.583
# chrt -f 10 tests/hackbench 90
Time: 1.833
# tests/hackbench 90
Time: 1.601
# chrt -f 10 tests/hackbench 90
Time: 3.078
# tests/hackbench 90
Time: 1.321
# chrt -f 10 tests/hackbench 90
Time: 1.933



A flux between 1.321 and 3.775 (*)(the rt runs were higher). But still,
good job at narrowing this down. Now the hard part. Figuring out a way
that will be nice to all users.

-- Steve

(*) The previous runs (without the command line option) were 13 seconds or
so.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-14  4:24     ` Christoph Lameter
  2007-12-14  6:15       ` Steven Rostedt
@ 2007-12-21 12:09       ` Ingo Molnar
  2007-12-21 12:26         ` Ingo Molnar
                           ` (2 more replies)
  1 sibling, 3 replies; 93+ messages in thread
From: Ingo Molnar @ 2007-12-21 12:09 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Steven Rostedt, LKML, Andrew Morton, Linus Torvalds,
	Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki


* Christoph Lameter <clameter@sgi.com> wrote:

> Hmmmm... Some tests here on an 8p 8G machine:

> In an extreme case (boot with slub_min_order=9 to get huge page sized 
> slabs) SLUB can win against SLAB:
> 
> N=10 Time: 0.338	Minimally faster
> N=20 Time: 0.560	10% faster
> N=50 Time: 1.353	15% faster

what's up with this regression? There's been absolutely no activity 
about it in the last 8 days: upstream still regresses, -mm still 
regresses and there are no patches posted for testing.

being able to utilize order-0 pages was supposed to be one of the big 
plusses of SLUB, so booting with _2MB_ sized slabs cannot be seriously 
the "fix", right?

	Ingo

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 12:09       ` Ingo Molnar
@ 2007-12-21 12:26         ` Ingo Molnar
  2007-12-21 16:26           ` Christoph Lameter
  2007-12-21 16:18         ` Christoph Lameter
  2007-12-21 17:44         ` Pekka Enberg
  2 siblings, 1 reply; 93+ messages in thread
From: Ingo Molnar @ 2007-12-21 12:26 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Steven Rostedt, LKML, Andrew Morton, Linus Torvalds,
	Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki


* Ingo Molnar <mingo@elte.hu> wrote:

> * Christoph Lameter <clameter@sgi.com> wrote:
> 
> > Hmmmm... Some tests here on an 8p 8G machine:
> 
> > In an extreme case (boot with slub_min_order=9 to get huge page sized 
> > slabs) SLUB can win against SLAB:
> > 
> > N=10 Time: 0.338	Minimally faster
> > N=20 Time: 0.560	10% faster
> > N=50 Time: 1.353	15% faster
> 
> what's up with this regression? There's been absolutely no activity 
> about it in the last 8 days: upstream still regresses, -mm still 
> regresses and there are no patches posted for testing.
> 
> being able to utilize order-0 pages was supposed to be one of the big 
> plusses of SLUB, so booting with _2MB_ sized slabs cannot be seriously 
> the "fix", right?

and this is not the only regression:

    http://lkml.org/lkml/2007/10/4/290

_6%_ TPC-C regression. That's _a lot_ in TPC-C terms.

and just like in this case there were very clear profiles posted. I 
proffer, reading back the whole thread, that if you fix hackbench you 
have fixed TPC-C as well.

So i believe you should either send some sensible fixes _NOW_, or admit 
that the "no queues" NIH nonsense of SLUB doesnt work and do an edible, 
incremental patchset against SLAB to bring in the debuggability features 
of SLUB without killing SLAB's performance. (And fix the NUMA alien 
cache problem along the lines suggested before - perhaps initially by 
making 'noaliencache' the default bootup option.) And we obviously must 
revert the default in 2.6.24 to SLAB as well.

	Ingo

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 12:09       ` Ingo Molnar
  2007-12-21 12:26         ` Ingo Molnar
@ 2007-12-21 16:18         ` Christoph Lameter
  2007-12-21 16:44           ` Linus Torvalds
  2007-12-21 17:44         ` Pekka Enberg
  2 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2007-12-21 16:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, LKML, Andrew Morton, Linus Torvalds,
	Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki

On Fri, 21 Dec 2007, Ingo Molnar wrote:

> what's up with this regression? There's been absolutely no activity 
> about it in the last 8 days: upstream still regresses, -mm still 
> regresses and there are no patches posted for testing.

I added a test that does this 1 alloc N free behavior to the slab 
in kernel benchmark (cpu 0 continually allocs, cpu 1-7 continously free) 
(all order 0). The test makes the regression worse because it run the 
scenario with raw slab allocs/frees in the kernel context without the use 
of the scheduler:

SLAB
1 alloc N free(8): 0=2115 1=549 2=551 3=550 4=550 5=550 6=551 7=549 Average=746
1 alloc N free(16): 0=2159 1=573 2=574 3=573 4=574 5=573 6=574 7=573 Average=772
1 alloc N free(32): 0=2168 1=577 2=576 3=577 4=579 5=578 6=577 7=576 Average=776
1 alloc N free(64): 0=2276 1=555 2=554 3=554 4=555 5=556 6=555 7=554 Average=770
1 alloc N free(128): 0=2635 1=549 2=548 3=548 4=550 5=549 6=548 7=549 Average=809
1 alloc N free(256): 0=3261 1=543 2=542 3=542 4=544 5=542 6=542 7=542 Average=882
1 alloc N free(512): 0=4642 1=571 2=572 3=570 4=573 5=571 6=572 7=572 Average=1080
1 alloc N free(1024): 0=7116 1=582 2=581 3=583 4=581 5=582 6=582 7=583 Average=1399
1 alloc N free(2048): 0=11612 1=667 2=667 3=667 4=668 5=667 6=668 7=667 Average=2035
1 alloc N free(4096): 0=20135 1=673 2=674 3=674 4=675 5=674 6=675 7=674 Average=3107

SLUB
1 alloc N free test
===================
1 alloc N free(8): 0=1223 1=645 2=417 3=643 4=538 5=612 6=557 7=564 Average=650
1 alloc N free(16): 0=3828 1=3140 2=3116 3=3166 4=3122 5=3172 6=3129 7=3172 Average=3231
1 alloc N free(32): 0=4116 1=3207 2=3180 3=3210 4=3183 5=3202 6=3182 7=3214 Average=3312
1 alloc N free(64): 0=4116 1=3017 2=3005 3=3013 4=3006 5=3016 6=3004 7=3020 Average=3149
1 alloc N free(128): 0=5282 1=3013 2=3010 3=3014 4=3009 5=3010 6=3005 7=3013 Average=3295
1 alloc N free(256): 0=5271 1=2750 2=2751 3=2751 4=2751 5=2751 6=2752 7=2751 Average=3066
1 alloc N free(512): 0=5304 1=2293 2=2295 3=2295 4=2295 5=2294 6=2295 7=2294 Average=2671
1 alloc N free(1024): 0=6017 1=2044 2=2043 3=2043 4=2043 5=2045 6=2044 7=2044 Average=2541
1 alloc N free(2048): 0=7162 1=2086 2=2084 3=2086 4=2086 5=2085 6=2086 7=2086 Average=2720
1 alloc N free(4096): 0=7133 1=1795 2=1796 3=1796 4=1797 5=1795 6=1796 7=1795 Average=2463

So we have 3 different regimes here (order 0):

1. SLUB wins in size 8 because the cpu slab is never given up given that 
   we have 512 objects 8 byte objects in a 4K slab (same scenario that we
   can produce with slub_min_order=9).

2. At the high end (>= 2048) the additional overhead of SLAB when 
   allocating objects makes SLUB again performance wise better or similar.

3. There is an intermediate range where the cpu slab has to be changed
   (and thus list_lock has to be used) and where multiple cpus that free
   objects can simultaneouly hit the slab lock of the same slab page. That 
   is the cause of the regression

The regression is contained because:

1. The contention only exist when concurrently freeing and allocation
   from the same slab page. SLUB has provisions for keeping a single 
   allocator and a single freeer in the clear by moving the freelist 
   pointer when a slab page becomes a cpu slab. Concurrent allocs and 
   frees are possible without cacheline contention. The problem only comes
   about when multiple free operations occur simultaneously to the same
   slab page.
 
2. The number of objects in a slab is limited. For the 128 byte size of
   interest here we have 32 objects in a slab. The maximum theoretical 
   contention on the slab lock is through 32 cpus if they are all freeing 
   simultaneously.

3. The tests are an artificial scenario. Typically one would have 
   additional processing in both the alloc and the free paths which would
   reduce the contention. If one would not have complex processing on 
   alloc then we typically allocate the object on the remote processor 
   (warms up the processor cache) instead of allocating objects while 
   freeing  them on remote processors.

4. The contention is reduced the larger the object size becomes since
   this will reduce the number of object per slab page. Thus the number
   of processors taking the lock simultaneously is also reduced.

5. The contention is also reduced through normal operations that will
   lead to the creation of partially populated slabs. If we cannot 
   allocate a large number of objects from the same slab (typical if
   the system has been run for awhile) then the contention is 
   constrained.

We could address this issue by:

1. Detecting the contention while taking the slab lock and then defer the 
   free for these special situations (f.e. by using RCU to free these 
   objects.

2. Not allocate from the same slab if we detect contention.

But given all the boundaries for the contention I would think that it is 
not worth addressing. 

> being able to utilize order-0 pages was supposed to be one of the big 
> plusses of SLUB, so booting with _2MB_ sized slabs cannot be seriously 
> the "fix", right?

Booting with 2MB sized slab is certainly not the mainstream fix although I 
expect that some SGI machines may run with that as the default 
given that they have a couple of terabytes of memory and also the 2Mb 
configurations reduce TLB pressure significantly. The use of 2MB slabs is 
not motivated by the contention that we see here but rather because of the 
improved scaling and nice interaction with the huge page management 
framework provided by the antifrag patches.

Order 0 allocations were never the emphasis of SLUB. The emphasis was on 
being able to use higher order allocs in order to make slab operations 
scale higher.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 12:26         ` Ingo Molnar
@ 2007-12-21 16:26           ` Christoph Lameter
  2007-12-21 16:33             ` Ingo Molnar
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2007-12-21 16:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, LKML, Andrew Morton, Linus Torvalds,
	Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki

On Fri, 21 Dec 2007, Ingo Molnar wrote:

> and this is not the only regression:
> 
>     http://lkml.org/lkml/2007/10/4/290
> 
> _6%_ TPC-C regression. That's _a lot_ in TPC-C terms.
> 
> and just like in this case there were very clear profiles posted. I 
> proffer, reading back the whole thread, that if you fix hackbench you 
> have fixed TPC-C as well.

There are patches pending to address these issues. AFAICT Intel is 
testing if the regression is still there. There is no way for me to verify what 
is going  on there and there is the constant difficulty of getting 
detailed information about what is going on at Intel. Every couple of 
month I get a result from that test. Its a really crappy situation where a 
lot of confusing information is passed around.

> So i believe you should either send some sensible fixes _NOW_, or admit 
> that the "no queues" NIH nonsense of SLUB doesnt work and do an edible, 
> incremental patchset against SLAB to bring in the debuggability features 
> of SLUB without killing SLAB's performance. (And fix the NUMA alien 
> cache problem along the lines suggested before - perhaps initially by 
> making 'noaliencache' the default bootup option.) And we obviously must 
> revert the default in 2.6.24 to SLAB as well.

The fixes to the alien that you proposed do not work since we would still 
have to check for the need to put the object into alien cache. The checks 
for the locality of each object are the main cause of trouble when freeing 
in SLAB.

SLUB is generally performance wise superior to SLAB as demonstrated by my 
measurements that you reviewed too. In addition SLUB has many improvements 
over SLAB. Its not only the runtime debuggability which would be difficult 
to add since there would be numerous runtime checks in critical code paths 
that would cause regressions.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 16:26           ` Christoph Lameter
@ 2007-12-21 16:33             ` Ingo Molnar
  2007-12-21 21:56               ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Ingo Molnar @ 2007-12-21 16:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Steven Rostedt, LKML, Andrew Morton, Linus Torvalds,
	Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki


* Christoph Lameter <clameter@sgi.com> wrote:

> On Fri, 21 Dec 2007, Ingo Molnar wrote:
> 
> > and this is not the only regression:
> > 
> >     http://lkml.org/lkml/2007/10/4/290
> > 
> > _6%_ TPC-C regression. That's _a lot_ in TPC-C terms.
> > 
> > and just like in this case there were very clear profiles posted. I 
> > proffer, reading back the whole thread, that if you fix hackbench 
                                                 ^^^^^^^^^^^^^^^^^^^^
> > you have fixed TPC-C as well.
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 
> There are patches pending to address these issues. AFAICT Intel is 
> testing if the regression is still there. There is no way for me to 
> verify what is going on there and there is the constant difficulty of 
> getting detailed information about what is going on at Intel. Every 
> couple of month I get a result from that test. Its a really crappy 
> situation where a lot of confusing information is passed around.

of course there is a way to find out, and that's why i mailed you: fix 
the hackbench regression and i'm quite sure you'll improve the TPC-C 
numbers as well. It shows the same kind of overhead in the profile and 
takes just a few seconds to run. Are your pending SLUB patches in 
2.6.24-rc5-mm1 already?

	Ingo

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 16:18         ` Christoph Lameter
@ 2007-12-21 16:44           ` Linus Torvalds
  2007-12-21 22:11             ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Linus Torvalds @ 2007-12-21 16:44 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, Steven Rostedt, LKML, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Rafael J. Wysocki



On Fri, 21 Dec 2007, Christoph Lameter wrote:
> 
> So we have 3 different regimes here (order 0):
[ ... ]
> The regression is contained because:
[ ... ]

Christoph, *NONE* of these arguments matter at all.

The only thing that matters is the simple fact that real-life benchmarks 
show that SLUB is worse. It doesn't matter one whit that you can say that 
it's better under some circumstance that either isn't triggered, or when 
setting unrealistic and unusable parameters (ie big internal slab orders).

> We could address this issue by:
[...]
> But given all the boundaries for the contention I would think that it is 
> not worth addressing. 

.. and this seems to be the single biggest reason to just say "revert SLUB 
entirely".

If you aren't even motivated to fix the problems that have been reported, 
then SLUB isn't even a _potential_ solution, it's purely a problem, and 
since I am not IN THE LEAST interested in having three different 
allocators around, SLUB is going to get axed.

In other words, here's the low-down:

 a) either SLUB can replace SLAB, or SLUB is going away

    This isn't open for discussion, Christoph. This was the rule back when 
    it got merged in the first place.

 b) if SLUB performs worse than SLAB on real loads (TPC-C certainly being 
    one, and while hackbench may be borderline, it's certainly less 
    borderline than others), then SLUB cannot replace SLAB.

    See (a)

 c) If you aren't even interested in trying to fix it and are ignoring 
    the reports, there is not even a _potential_ for SLUB for getting over 
    these problems, and I should disable it and we should get over it as 
    soon as possible, and this whole discussion is pretty much ended.

See? It really is that simple. Either you say "Hell yes, I'll fix it", or 
SLUB goes away. There simply is no orther choice.

When you say "not worth addressing", that to me is a clear an unambiguous 
"let's remove SLUB".

The main reason for SLUB in the first place was SGI concerns. You seem to 
think that _only_ SGI concerns matter. Wrong. If SLUB remains a SGI-only 
thing and you don't fix it for other users, then SLUB gets reverted from 
the standard kernel.

That's all. And it's not really worth discussing. 

			Linus



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 12:09       ` Ingo Molnar
  2007-12-21 12:26         ` Ingo Molnar
  2007-12-21 16:18         ` Christoph Lameter
@ 2007-12-21 17:44         ` Pekka Enberg
  2007-12-21 22:22           ` Christoph Lameter
  2 siblings, 1 reply; 93+ messages in thread
From: Pekka Enberg @ 2007-12-21 17:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Lameter, Steven Rostedt, LKML, Andrew Morton,
	Linus Torvalds, Peter Zijlstra, Christoph Hellwig,
	Rafael J. Wysocki

Hi,

Christoph Lameter <clameter@sgi.com> wrote:
> > In an extreme case (boot with slub_min_order=9 to get huge page sized
> > slabs) SLUB can win against SLAB:
> >
> > N=10 Time: 0.338      Minimally faster
> > N=20 Time: 0.560      10% faster
> > N=50 Time: 1.353      15% faster

On Dec 21, 2007 2:09 PM, Ingo Molnar <mingo@elte.hu> wrote:
> what's up with this regression? There's been absolutely no activity
> about it in the last 8 days: upstream still regresses, -mm still
> regresses and there are no patches posted for testing.

Christoph, did you see Steven's oprofile results for the hackbench
runs (http://lkml.org/lkml/2007/12/8/198)? Any ideas why we're hitting
add_partial like crazy?

                                Pekka

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 16:33             ` Ingo Molnar
@ 2007-12-21 21:56               ` Christoph Lameter
  0 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2007-12-21 21:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Steven Rostedt, LKML, Andrew Morton, Linus Torvalds,
	Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki

On Fri, 21 Dec 2007, Ingo Molnar wrote:

> > There are patches pending to address these issues. AFAICT Intel is 
> > testing if the regression is still there. There is no way for me to 
> > verify what is going on there and there is the constant difficulty of 
> > getting detailed information about what is going on at Intel. Every 
> > couple of month I get a result from that test. Its a really crappy 
> > situation where a lot of confusing information is passed around.
> 
> of course there is a way to find out, and that's why i mailed you: fix 
> the hackbench regression and i'm quite sure you'll improve the TPC-C 
> numbers as well. It shows the same kind of overhead in the profile and 
> takes just a few seconds to run. Are your pending SLUB patches in 
> 2.6.24-rc5-mm1 already?

The tests that I wrote emulate the test behavior that was described to me 
by me.

The fixes in 2.6.24-rc5-mm1 improved those numbers. See 
http://lkml.org/lkml/2007/10/27/245 which I quoted earlier to you. 
However, I have no TPC-C setup here and from what I hear it takes weeks to 
run and requires a large support team for tuning.

You can find the slab test suite for that at

http://git.kernel.org/?p=linux/kernel/git/christoph/vm.git;a=shortlog;h=tests

AFAICT the fixes in 2.6.25-rc5-mm1 result in double the alloc performance 
(fastpath) of SLAB.

There are fixes that are not merged yet (the cpu alloc patchset) that 
seem to make that factor 3 because we can use the segment register to 
avoid per cpu array lookups in the fast path.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 16:44           ` Linus Torvalds
@ 2007-12-21 22:11             ` Christoph Lameter
  2007-12-21 22:16               ` Peter Zijlstra
  2007-12-21 22:49               ` Linus Torvalds
  0 siblings, 2 replies; 93+ messages in thread
From: Christoph Lameter @ 2007-12-21 22:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Steven Rostedt, LKML, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Rafael J. Wysocki

On Fri, 21 Dec 2007, Linus Torvalds wrote:

> If you aren't even motivated to fix the problems that have been reported, 
> then SLUB isn't even a _potential_ solution, it's purely a problem, and 
> since I am not IN THE LEAST interested in having three different 
> allocators around, SLUB is going to get axed.

Not motivated? I have analyzed the problem in detail and when it comes 
down to it  there is not much impact that I can see in real life 
applications. I have always responded to the regression reported also via 
TPC-C. There is still no answer to my post at 
http://lkml.org/lkml/2007/10/27/245

SLAB has similar issues when freeing to remote nodes under NUMA. Its 
even worse since the lock is not per slab page but per slab cache. The 
alien caches that are then used have one lock per node.

> In other words, here's the low-down:
> 
>  a) either SLUB can replace SLAB, or SLUB is going away
> 
>     This isn't open for discussion, Christoph. This was the rule back when 
>     it got merged in the first place.

It obviously can replace it and has replaced it for awhile now.

>  b) if SLUB performs worse than SLAB on real loads (TPC-C certainly being 
>     one, and while hackbench may be borderline, it's certainly less 
>     borderline than others), then SLUB cannot replace SLAB.

Looks like I need to find someone to get that going here to be able to 
test under it.

>  c) If you aren't even interested in trying to fix it and are ignoring 
>     the reports, there is not even a _potential_ for SLUB for getting over 
>     these problems, and I should disable it and we should get over it as 
>     soon as possible, and this whole discussion is pretty much ended.

I have looked at this issue under various circumstances for the last week. 
Nothing really convinced me that this is so serious. Could not figure 
out if its really worth anything about but if you put it that way then of 
course it is.
 
> The main reason for SLUB in the first place was SGI concerns. You seem to 
> think that _only_ SGI concerns matter. Wrong. If SLUB remains a SGI-only 
> thing and you don't fix it for other users, then SLUB gets reverted from 
> the standard kernel.

The reason for SLUB was dysfunctional behavior of SLAB in many areas. It 
was not just SGI's concerns that drove it.

> That's all. And it's not really worth discussing. 

Well still SLUB is really superior to SLAB in many ways....

- SLUB is performance wise much faster than SLAB. This can be more than a
  factor of 10 (case of concurrent allocations / frees on multiple
  processors). See http://lkml.org/lkml/2007/10/27/245

- Single threaded allocation speed is up to double that of SLAB

- Remote freeing of objectcs in a NUMA systems is typically 30% faster.

- Debugging on SLAB is difficult. Requires recompile of the kernel
  and the resulting output is difficult to interpret. SLUB can apply
  debugging options to a subset of the slabcaches in order to allow
  the system to work with maximum speed. This is necessary to detect
  difficult to reproduce race conditions.

- SLAB can capture huge amounts of memory in its queues. The problem
  gets worse the more processors and NUMA nodes are in the system.

- SLAB requires a pass through all slab caches every 2 seconds to
  expire objects. This is a problem both for realtime and MPI jobs
  that cannot take such a processor outage.

- SLAB does not have a sophisticated slabinfo tool to report the
  state of slab objects on the system. Can provide details of
  object use.

- SLAB requires the update of two words for freeing
  and allocation. SLUB can do that by updating a single
  word which allows to avoid enabling and disabling interrupts if
  the processor supports an atomic instruction for that purpose.
  This is important for realtime kernels where special measures
  may have to be implemented.

- SLAB requires memory to be set aside for queues (processors
  times number of slabs times queue size).
  SLUB requires none of that.

- SLUB merges slab caches with similar characteristics to
  reduce the memory footprint even further.

- SLAB performs object level NUMA management which creates
  a complex allocator complexity. SLUB manages NUMA on the level of
  slab pages.

- SLUB allows remote node defragmentation to avoid the buildup
  of large partial lists on a single node.

- SLUB can actively reduce the fragmentation of slabs through
  slab cache specific callbacks (not merged yet)

- SLUB has resiliency features that allow it to isolate a problem
  object and continue after diagnostics have been performed.

- SLUB creates rarely used DMA caches on demand instead of creating
  them all on bootup (SLAB).

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 22:11             ` Christoph Lameter
@ 2007-12-21 22:16               ` Peter Zijlstra
  2007-12-21 22:17                 ` Peter Zijlstra
  2007-12-21 22:49               ` Linus Torvalds
  1 sibling, 1 reply; 93+ messages in thread
From: Peter Zijlstra @ 2007-12-21 22:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Ingo Molnar, Steven Rostedt, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki


On Fri, 2007-12-21 at 14:11 -0800, Christoph Lameter wrote:
> On Fri, 21 Dec 2007, Linus Torvalds wrote:
> 
> > If you aren't even motivated to fix the problems that have been reported, 
> > then SLUB isn't even a _potential_ solution, it's purely a problem, and 
> > since I am not IN THE LEAST interested in having three different 
> > allocators around, SLUB is going to get axed.
> 
> Not motivated? I have analyzed the problem in detail and when it comes 
> down to it  there is not much impact that I can see in real life 
> applications. I have always responded to the regression reported also via 
> TPC-C. 

But you are dismissing the hackbench regression, which is not a small
one. It runs an astonishing 10x slower.




^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 22:16               ` Peter Zijlstra
@ 2007-12-21 22:17                 ` Peter Zijlstra
  2007-12-21 22:27                   ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Peter Zijlstra @ 2007-12-21 22:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Linus Torvalds, Ingo Molnar, Steven Rostedt, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki


On Fri, 2007-12-21 at 23:16 +0100, Peter Zijlstra wrote:
> On Fri, 2007-12-21 at 14:11 -0800, Christoph Lameter wrote:
> > On Fri, 21 Dec 2007, Linus Torvalds wrote:
> > 
> > > If you aren't even motivated to fix the problems that have been reported, 
> > > then SLUB isn't even a _potential_ solution, it's purely a problem, and 
> > > since I am not IN THE LEAST interested in having three different 
> > > allocators around, SLUB is going to get axed.
> > 
> > Not motivated? I have analyzed the problem in detail and when it comes 
> > down to it  there is not much impact that I can see in real life 
> > applications. I have always responded to the regression reported also via 
> > TPC-C. 
> 
> But you are dismissing the hackbench regression, which is not a small
> one. It runs an astonishing 10x slower.
> 

BTW, does /proc/slabinfo exist again? I thought we set that as a
requirement for SLUB to be the default and a full replacement for SLAB.



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 17:44         ` Pekka Enberg
@ 2007-12-21 22:22           ` Christoph Lameter
  2007-12-21 22:37             ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2007-12-21 22:22 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Ingo Molnar, Steven Rostedt, LKML, Andrew Morton, Linus Torvalds,
	Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki

On Fri, 21 Dec 2007, Pekka Enberg wrote:

> Christoph, did you see Steven's oprofile results for the hackbench
> runs (http://lkml.org/lkml/2007/12/8/198)? Any ideas why we're hitting
> add_partial like crazy?

Hmmm... SLUB is cycling through partial slabs. If it gets fed objects with 
a single free object from the free list again and again then this is the 
behavior that one would see.

The worst case scenario would be.

1. Processor 0 gets slab with one free entry from the partial list. 

2. Processor 0 allocates object and deactivates the slab since it is full.

3. Processor 1 frees the object and finds that it was not on the partial
   list since there were no free objects. Call put_partial()

4. processor 0 gets slab with one free entry from the partial list.

If we would make the partial list a mininum size (which is already done 
through MIN_PARTIAL maybe just increase it) then we may be able to avoid 
this particular case. Also we could make sure that the slab is not put at 
the beginning  of the partial list on free in order to increase the time 
that the slab spends on the partial list. That way it will gather more 
objects before it is picked up.

Hmmmm... This is a bit different from what I got to here.







^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 22:17                 ` Peter Zijlstra
@ 2007-12-21 22:27                   ` Christoph Lameter
  2007-12-21 22:54                     ` Ingo Molnar
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2007-12-21 22:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Ingo Molnar, Steven Rostedt, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

On Fri, 21 Dec 2007, Peter Zijlstra wrote:

> BTW, does /proc/slabinfo exist again? I thought we set that as a
> requirement for SLUB to be the default and a full replacement for SLAB.

Well the information that would be exposed in /proc/slabinfo would only be 
faked up stuff from SLUB. The queues that are described in /proc/slabinfo do not 
exist f.e. (tunables do not make sense) and there are more details available via /sys/kernel/slab.

There is a discussion on linux-mm with some of the people working on tools 
to make these to use the information that SLUB provides.




^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 22:22           ` Christoph Lameter
@ 2007-12-21 22:37             ` Christoph Lameter
  2007-12-21 22:51               ` Linus Torvalds
                                 ` (2 more replies)
  0 siblings, 3 replies; 93+ messages in thread
From: Christoph Lameter @ 2007-12-21 22:37 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Ingo Molnar, Steven Rostedt, LKML, Andrew Morton, Linus Torvalds,
	Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki

Actually thanks for pointing that out Pekka.... Here is a patch that takes 
the regression almost completely away (Too much jetlag combined with flu 
seems to have impaired my thinking I should have seen this earlier). I 
still need to run my slab performance tests on this. But hackbench 
improves.


SLUB: Improve hackbench speed

Increase the mininum number of partial slabs to keep around and put
partial slabs to the end of the partial queue so that they can add
more objects.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slub.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-12-21 14:30:11.096324462 -0800
+++ linux-2.6/mm/slub.c	2007-12-21 14:30:33.656441994 -0800
@@ -172,7 +172,7 @@ static inline void ClearSlabDebug(struct
  * Mininum number of partial slabs. These will be left on the partial
  * lists even if they are empty. kmem_cache_shrink may reclaim them.
  */
-#define MIN_PARTIAL 2
+#define MIN_PARTIAL 5
 
 /*
  * Maximum number of desirable partial slabs.
@@ -1616,7 +1616,7 @@ checks_ok:
 	 * then add it.
 	 */
 	if (unlikely(!prior))
-		add_partial(get_node(s, page_to_nid(page)), page);
+		add_partial_tail(get_node(s, page_to_nid(page)), page);
 
 out_unlock:
 	slab_unlock(page);

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 22:11             ` Christoph Lameter
  2007-12-21 22:16               ` Peter Zijlstra
@ 2007-12-21 22:49               ` Linus Torvalds
  1 sibling, 0 replies; 93+ messages in thread
From: Linus Torvalds @ 2007-12-21 22:49 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, Steven Rostedt, LKML, Andrew Morton, Peter Zijlstra,
	Christoph Hellwig, Rafael J. Wysocki



On Fri, 21 Dec 2007, Christoph Lameter wrote:
> 
> It obviously can replace it and has replaced it for awhile now.

No. If there are 6% performance regressions on TPC-C, then it CAN NOT 
replace it!

> Well still SLUB is really superior to SLAB in many ways....
> 
> - SLUB is performance wise much faster than SLAB.

No.

Christoph, statements like this is *exactly* why I don't think SLUB can 
make it.

You're closing your eyes to real performace *regression* reports, and then 
you claim thast SLUB is faster, because you find your own microbenchmarks 
that show so for specific loads.

But those specific loads are apparetly never the issue.

So stop saying that SLUB is "much faster", as long as hackbench shows that 
that is simply NOT THE CASE.

It doesn't matter one whit if SLUB is lots faster, if it's faster for 
cases that never matter. So far, I don't think we've *ever* seen any 
actual benchmarks that involve any kind of real use where SLUB is 
really faster, and we have some major examples of SLUB being disastrously 
slower!

Your special-case kmalloc performance tests don't matter, when real user 
programs show the exact opposite effect.

And the fact that you dismiss those real user programs just because you 
have your own test harness is why I'm ready to throw in the towel on SLUB. 

I don't mind performance regressions, but I *do* mind performance 
regressions when the main author then says "look, a helicopter!" and 
points to some totally different thing and claims that the performance 
regression doesn't even exist!

Because those kinds of performance regressions never get fixed, because 
they are ignored.

		Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 22:37             ` Christoph Lameter
@ 2007-12-21 22:51               ` Linus Torvalds
  2007-12-21 23:17                 ` Ingo Molnar
  2007-12-21 22:52               ` Steven Rostedt
  2007-12-21 22:56               ` Ingo Molnar
  2 siblings, 1 reply; 93+ messages in thread
From: Linus Torvalds @ 2007-12-21 22:51 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Ingo Molnar, Steven Rostedt, LKML, Andrew Morton,
	Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki



On Fri, 21 Dec 2007, Christoph Lameter wrote:
>
> Actually thanks for pointing that out Pekka.... Here is a patch that takes 
> the regression almost completely away

Now *this* is what I wanted to see - rather than argue against other 
peoples performance regression reports, an actual patch that acknowledges 
the problem.

Thanks.

And now, can the people who made the problem reports and complained about 
SLUB please test the patch - the ball is now in your court!

		Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 22:37             ` Christoph Lameter
  2007-12-21 22:51               ` Linus Torvalds
@ 2007-12-21 22:52               ` Steven Rostedt
  2007-12-21 22:56               ` Ingo Molnar
  2 siblings, 0 replies; 93+ messages in thread
From: Steven Rostedt @ 2007-12-21 22:52 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Ingo Molnar, LKML, Andrew Morton, Linus Torvalds,
	Peter Zijlstra, Christoph Hellwig, Rafael J. Wysocki


On Fri, 21 Dec 2007, Christoph Lameter wrote:

> Actually thanks for pointing that out Pekka.... Here is a patch that takes
> the regression almost completely away (Too much jetlag combined with flu
> seems to have impaired my thinking I should have seen this earlier). I
> still need to run my slab performance tests on this. But hackbench
> improves.
>
>
> SLUB: Improve hackbench speed
>
> Increase the mininum number of partial slabs to keep around and put
> partial slabs to the end of the partial queue so that they can add
> more objects.
>
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
>

FYI, I'm currently on holiday till Jan 2cd, and I also lost access to the
machine I did the orignal benchmarks on.  But I'll try to reserve it again
after the New Year, and I will run the benchmarks including this patch and
report my findings.

Thanks,

-- Steve


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 22:27                   ` Christoph Lameter
@ 2007-12-21 22:54                     ` Ingo Molnar
  2007-12-21 23:20                       ` David Miller
                                         ` (3 more replies)
  0 siblings, 4 replies; 93+ messages in thread
From: Ingo Molnar @ 2007-12-21 22:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Peter Zijlstra, Linus Torvalds, Steven Rostedt, LKML,
	Andrew Morton, Christoph Hellwig, Rafael J. Wysocki


* Christoph Lameter <clameter@sgi.com> wrote:

> On Fri, 21 Dec 2007, Peter Zijlstra wrote:
> 
> > BTW, does /proc/slabinfo exist again? I thought we set that as a 
> > requirement for SLUB to be the default and a full replacement for 
> > SLAB.
> 
> Well the information that would be exposed in /proc/slabinfo would 
> only be faked up stuff from SLUB. The queues that are described in 
> /proc/slabinfo do not exist f.e. (tunables do not make sense) and 
> there are more details available via /sys/kernel/slab.

Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. slabtop 
relies on it, people use it every day to monitor memory consumption.

I'm really getting worried that you are apparently incapable of grasping 
such _SIMPLE_ concepts. Who the heck cares whether you put in zeros or 
whatever else in some of the fields? People use it to know how many 
objects are allocated and sure SLUB knows that count, sheesh. How on 
earth can you come up with a lame excuse like that? You dont like the 
'SLAB' portion of the name perhaps? Is it NIH again?

Really, if your behavior is representative of how our SLAB allocator 
will be maintained in the future then i'm very, very worried :-( You 
ignore and downplay clear-cut regressions, you insult and attack 
testers, you are incredibly stupid about user ABIs (or pretend to be so) 
and you distort and mislead all the way. What will you be able to do in 
the much less clear-cut cases??

	Ingo

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 22:37             ` Christoph Lameter
  2007-12-21 22:51               ` Linus Torvalds
  2007-12-21 22:52               ` Steven Rostedt
@ 2007-12-21 22:56               ` Ingo Molnar
  2 siblings, 0 replies; 93+ messages in thread
From: Ingo Molnar @ 2007-12-21 22:56 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Pekka Enberg, Steven Rostedt, LKML, Andrew Morton,
	Linus Torvalds, Peter Zijlstra, Christoph Hellwig,
	Rafael J. Wysocki


* Christoph Lameter <clameter@sgi.com> wrote:

>  	if (unlikely(!prior))
> -		add_partial(get_node(s, page_to_nid(page)), page);
> +		add_partial_tail(get_node(s, page_to_nid(page)), page);

FYI, this gives me:

 mm/slub.c: In function 'kfree':
 mm/slub.c:2604: warning: passing argument 3 of 'slab_free' discards qualifiers from pointer target type

looks harmless though at first sight.

	Ingo

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 22:51               ` Linus Torvalds
@ 2007-12-21 23:17                 ` Ingo Molnar
  2007-12-21 23:40                   ` Pekka Enberg
  2007-12-21 23:54                   ` Christoph Lameter
  0 siblings, 2 replies; 93+ messages in thread
From: Ingo Molnar @ 2007-12-21 23:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Lameter, Pekka Enberg, Steven Rostedt, LKML,
	Andrew Morton, Peter Zijlstra, Christoph Hellwig,
	Rafael J. Wysocki


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> And now, can the people who made the problem reports and complained 
> about SLUB please test the patch - the ball is now in your court!

yep, and i ran a quick comparison test on a 2-core box with 3 kernels:

  [ best of 5 runs in a row which had a relative jitter of less than 10% ]

     MIN      v2.6.24.slab     v2.6.24.slub v2.6.24.slub.fix
  ----------------------------------------------------------
           mmap:    429.00    402.00 ( -6%)    385.00 (-10%)
         select:     11.38     10.46 ( -8%)     11.41 (  0%)
      proc-exec:    121.52    116.77 ( -3%)    120.77 (  0%)
      proc-fork:    106.84    106.19 (  0%)    107.92 (  1%)
   syscall-open:      3.09      3.13 (  1%)      3.25 (  4%)
   hackbench-50:      2.85      3.47 ( 21%)      2.88 (  1%)

and the regression seems to be largely fixed! Not only is the hackbench 
one fixed, but mmap shows an above-noise improvement as well.

Acked-by: Ingo Molnar <mingo@elte.hu>

And i hereby nominate Pekka as SLUB/SLAB co-maintainer and spokesperson 
;-)

	Ingo

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 22:54                     ` Ingo Molnar
@ 2007-12-21 23:20                       ` David Miller
  2007-12-22  9:45                         ` Ingo Molnar
  2007-12-21 23:27                       ` Andrew Morton
                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 93+ messages in thread
From: David Miller @ 2007-12-21 23:20 UTC (permalink / raw)
  To: mingo
  Cc: clameter, a.p.zijlstra, torvalds, rostedt, linux-kernel, akpm, hch, rjw

From: Ingo Molnar <mingo@elte.hu>
Date: Fri, 21 Dec 2007 23:54:13 +0100

> Really, if your behavior is representative of how our SLAB allocator 
> will be maintained in the future then i'm very, very worried :-(

Actually, to the contrary, I actually think Christoph responds to
every problem I've ever reported to him about his per-cpu counters
work and SLUB much better than most people who call themselves
"maintainers" around here.

And I say that without any reservations.

He doesn't deserve the ad-hominem attacks he is getting today, because
he does resolve every problem reported to him.

The guy wrote test cases, he analyzed every problem, he wrote test
patches, and he doesn't stop doing any of that until the issue really
is reported as resolved by the testers.

I'll take Christoph as the implementor and maintainer of anything, any
day of the week.  He rocks.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 22:54                     ` Ingo Molnar
  2007-12-21 23:20                       ` David Miller
@ 2007-12-21 23:27                       ` Andrew Morton
  2007-12-21 23:51                       ` Christoph Lameter
  2007-12-22  0:28                       ` Andi Kleen
  3 siblings, 0 replies; 93+ messages in thread
From: Andrew Morton @ 2007-12-21 23:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: clameter, a.p.zijlstra, torvalds, rostedt, linux-kernel, hch, rjw

On Fri, 21 Dec 2007 23:54:13 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> * Christoph Lameter <clameter@sgi.com> wrote:
> 
> > On Fri, 21 Dec 2007, Peter Zijlstra wrote:
> > 
> > > BTW, does /proc/slabinfo exist again? I thought we set that as a 
> > > requirement for SLUB to be the default and a full replacement for 
> > > SLAB.
> > 
> > Well the information that would be exposed in /proc/slabinfo would 
> > only be faked up stuff from SLUB. The queues that are described in 
> > /proc/slabinfo do not exist f.e. (tunables do not make sense) and 
> > there are more details available via /sys/kernel/slab.
> 
> Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it.

That would be really bad.

/proc/slabinfo intimately exposes implementation-specific internals and
strictly speaking should never have been introduced.  Obviously that isn't
a _practical_ position, but there are no easy solutions here.

We don't want to have to drag a pile of be-compatible-with-slab gunk along
with us for ever.

> slabtop 
> relies on it, people use it every day to monitor memory consumption.

slabtop needs to be changed asap.

Possibly we could look at adding some interrim make-it-look-like-slab hack
into slub.c for a couple of releases but no longer.

If we'd updated slabtop six months ago this issue wouldn't have arisen now.

But we didn't do that.   We're quite bad at this sort of thing.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 23:17                 ` Ingo Molnar
@ 2007-12-21 23:40                   ` Pekka Enberg
  2007-12-21 23:54                   ` Christoph Lameter
  1 sibling, 0 replies; 93+ messages in thread
From: Pekka Enberg @ 2007-12-21 23:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Christoph Lameter, Steven Rostedt, LKML,
	Andrew Morton, Peter Zijlstra, Christoph Hellwig,
	Rafael J. Wysocki

Hi,

On Dec 22, 2007 1:17 AM, Ingo Molnar <mingo@elte.hu> wrote:
> yep, and i ran a quick comparison test on a 2-core box with 3 kernels:
>
>   [ best of 5 runs in a row which had a relative jitter of less than 10% ]
>
>      MIN      v2.6.24.slab     v2.6.24.slub v2.6.24.slub.fix
>   ----------------------------------------------------------
>            mmap:    429.00    402.00 ( -6%)    385.00 (-10%)
>          select:     11.38     10.46 ( -8%)     11.41 (  0%)
>       proc-exec:    121.52    116.77 ( -3%)    120.77 (  0%)
>       proc-fork:    106.84    106.19 (  0%)    107.92 (  1%)
>    syscall-open:      3.09      3.13 (  1%)      3.25 (  4%)
>    hackbench-50:      2.85      3.47 ( 21%)      2.88 (  1%)
>
> and the regression seems to be largely fixed! Not only is the hackbench
> one fixed, but mmap shows an above-noise improvement as well.
>
> Acked-by: Ingo Molnar <mingo@elte.hu>

I thought it might be a bug but couldn't figure it out. The patch
looks good to me too, Christoph.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

On Dec 22, 2007 1:17 AM, Ingo Molnar <mingo@elte.hu> wrote:
> And i hereby nominate Pekka as SLUB/SLAB co-maintainer and spokesperson
> ;-)

Heh, thanks, but grep me from MAINTAINERS some day, Ingo :-).

                                Pekka

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 22:54                     ` Ingo Molnar
  2007-12-21 23:20                       ` David Miller
  2007-12-21 23:27                       ` Andrew Morton
@ 2007-12-21 23:51                       ` Christoph Lameter
  2007-12-22  0:28                       ` Andi Kleen
  3 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2007-12-21 23:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Linus Torvalds, Steven Rostedt, LKML,
	Andrew Morton, Christoph Hellwig, Rafael J. Wysocki

On Fri, 21 Dec 2007, Ingo Molnar wrote:

> I'm really getting worried that you are apparently incapable of grasping 
> such _SIMPLE_ concepts. Who the heck cares whether you put in zeros or 
> whatever else in some of the fields? People use it to know how many 
> objects are allocated and sure SLUB knows that count, sheesh. How on 
> earth can you come up with a lame excuse like that? You dont like the 
> 'SLAB' portion of the name perhaps? Is it NIH again?

NIH? I wrote major portions of SLAB. I would be hating my own product.
Could you get the facts straight at some point? This is getting weird.

> Really, if your behavior is representative of how our SLAB allocator 
> will be maintained in the future then i'm very, very worried :-( You 
> ignore and downplay clear-cut regressions, you insult and attack 
> testers, you are incredibly stupid about user ABIs (or pretend to be so) 
> and you distort and mislead all the way. What will you be able to do in 
> the much less clear-cut cases??

I analyzed the issue and argued that the issues that one test showed in 
SLUB is a really special case and then you conclude that I ignore all 
regressions? I have addressed and responded to all reports of regressions 
that came to me.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 23:17                 ` Ingo Molnar
  2007-12-21 23:40                   ` Pekka Enberg
@ 2007-12-21 23:54                   ` Christoph Lameter
  2007-12-22  0:02                     ` Chuck Ebbert
  1 sibling, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2007-12-21 23:54 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Pekka Enberg, Steven Rostedt, LKML,
	Andrew Morton, Peter Zijlstra, Christoph Hellwig,
	Rafael J. Wysocki

On Sat, 22 Dec 2007, Ingo Molnar wrote:

> and the regression seems to be largely fixed! Not only is the hackbench 
> one fixed, but mmap shows an above-noise improvement as well.
> 
> Acked-by: Ingo Molnar <mingo@elte.hu>

Well lets use the version attached to this patch. It turns out that it is 
not important to increase the number of partial slabs. Just putting the 
slab onto the tail of the list does it.
 
> And i hereby nominate Pekka as SLUB/SLAB co-maintainer and spokesperson 
> ;-)

He is already the co-maintainer of the slab allocators. See the 
MAINTAINERS file.



SLUB: Improve hackbench speed

Increase the mininum number of partial slabs to keep around and put
partial slabs to the end of the partial queue so that they can add
more objects.

Signed-off-by: Christoph Lameter <clameter@sgi.com>

---
 mm/slub.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-12-21 14:30:11.096324462 -0800
+++ linux-2.6/mm/slub.c	2007-12-21 15:10:17.611022381 -0800
@@ -1611,12 +1611,12 @@ checks_ok:
 		goto slab_empty;
 
 	/*
-	 * Objects left in the slab. If it
-	 * was not on the partial list before
-	 * then add it.
+	 * Objects left in the slab. If it was not on the partial list before
+	 * then add it. Add it to the end since there is only a single object
+	 * which would make slab_alloc inefficient.
 	 */
 	if (unlikely(!prior))
-		add_partial(get_node(s, page_to_nid(page)), page);
+		add_partial_tail(get_node(s, page_to_nid(page)), page);
 
 out_unlock:
 	slab_unlock(page);

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 23:54                   ` Christoph Lameter
@ 2007-12-22  0:02                     ` Chuck Ebbert
  0 siblings, 0 replies; 93+ messages in thread
From: Chuck Ebbert @ 2007-12-22  0:02 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Ingo Molnar, Linus Torvalds, Pekka Enberg, Steven Rostedt, LKML,
	Andrew Morton, Peter Zijlstra, Christoph Hellwig,
	Rafael J. Wysocki

On 12/21/2007 06:54 PM, Christoph Lameter wrote:
> 
> SLUB: Improve hackbench speed
> 
> Increase the mininum number of partial slabs to keep around and put
> partial slabs to the end of the partial queue so that they can add
> more objects.
> 
> Signed-off-by: Christoph Lameter <clameter@sgi.com>
> 


You forgot to update the commit message.

Should be:


Put partial slabs to the end of the partial queue so that they can add
more objects.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 22:54                     ` Ingo Molnar
                                         ` (2 preceding siblings ...)
  2007-12-21 23:51                       ` Christoph Lameter
@ 2007-12-22  0:28                       ` Andi Kleen
  2007-12-22  0:33                         ` Chuck Ebbert
                                           ` (2 more replies)
  3 siblings, 3 replies; 93+ messages in thread
From: Andi Kleen @ 2007-12-22  0:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Lameter, Peter Zijlstra, Linus Torvalds,
	Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig,
	Rafael J. Wysocki

Ingo Molnar <mingo@elte.hu> writes:

> Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. slabtop 
> relies on it, people use it every day to monitor memory consumption.

It's definitely not a stable ABI. slabtop tends to exit without any
error message on any slabinfo version number increase and I've seen
that happen several times in not so old kernels.

Requiring just another slabtop update isn't really a big deal.

Also it's not that it's a critical functionality like udev.

-Andi


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22  0:28                       ` Andi Kleen
@ 2007-12-22  0:33                         ` Chuck Ebbert
  2007-12-22  2:19                         ` Matt Mackall
  2007-12-22 10:03                         ` Ingo Molnar
  2 siblings, 0 replies; 93+ messages in thread
From: Chuck Ebbert @ 2007-12-22  0:33 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Christoph Lameter, Peter Zijlstra, Linus Torvalds,
	Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig,
	Rafael J. Wysocki

On 12/21/2007 07:28 PM, Andi Kleen wrote:
> Ingo Molnar <mingo@elte.hu> writes:
> 
>> Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. slabtop 
>> relies on it, people use it every day to monitor memory consumption.
> 
> It's definitely not a stable ABI. slabtop tends to exit without any
> error message on any slabinfo version number increase and I've seen
> that happen several times in not so old kernels.
> 
> Requiring just another slabtop update isn't really a big deal.
> 
> Also it's not that it's a critical functionality like udev.
> 

There's also Documentation/vm/slabinfo.c, which really belongs in
some kind of 'kernel utilities' package. Since it's so intimately
tied to kernel version I don't think it belongs in util-linux.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22  0:28                       ` Andi Kleen
  2007-12-22  0:33                         ` Chuck Ebbert
@ 2007-12-22  2:19                         ` Matt Mackall
  2007-12-22  2:44                           ` Andi Kleen
  2007-12-22 10:03                         ` Ingo Molnar
  2 siblings, 1 reply; 93+ messages in thread
From: Matt Mackall @ 2007-12-22  2:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Christoph Lameter, Peter Zijlstra, Linus Torvalds,
	Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig,
	Rafael J. Wysocki


On Sat, 2007-12-22 at 01:28 +0100, Andi Kleen wrote:
> Ingo Molnar <mingo@elte.hu> writes:
> 
> > Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. slabtop 
> > relies on it, people use it every day to monitor memory consumption.
> 
> It's definitely not a stable ABI. slabtop tends to exit without any
> error message on any slabinfo version number increase and I've seen
> that happen several times in not so old kernels.
> 
> Requiring just another slabtop update isn't really a big deal.

Might as well mention that SLOB also doesn't provide this functionality,
nor would it make any sense for it to try (because there are no
slab-like objects to report about).

-- 
Mathematics is the supreme nostalgia of our time.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22  2:19                         ` Matt Mackall
@ 2007-12-22  2:44                           ` Andi Kleen
  0 siblings, 0 replies; 93+ messages in thread
From: Andi Kleen @ 2007-12-22  2:44 UTC (permalink / raw)
  To: Matt Mackall
  Cc: Andi Kleen, Ingo Molnar, Christoph Lameter, Peter Zijlstra,
	Linus Torvalds, Steven Rostedt, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

On Fri, Dec 21, 2007 at 08:19:42PM -0600, Matt Mackall wrote:
> 
> On Sat, 2007-12-22 at 01:28 +0100, Andi Kleen wrote:
> > Ingo Molnar <mingo@elte.hu> writes:
> > 
> > > Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. slabtop 
> > > relies on it, people use it every day to monitor memory consumption.
> > 
> > It's definitely not a stable ABI. slabtop tends to exit without any
> > error message on any slabinfo version number increase and I've seen
> > that happen several times in not so old kernels.
> > 
> > Requiring just another slabtop update isn't really a big deal.
> 
> Might as well mention that SLOB also doesn't provide this functionality,
> nor would it make any sense for it to try (because there are no
> slab-like objects to report about).

% slabinfo
Name                   Objects Objsize    Space Slabs/Part/Cpu  O/S O %Fr %Ef Flg
...
anon_vma                  1551      24    81.9K        20/18/2  128 0  90  45

Seems certainly like information that slabtop could report, even if it's
not exactly the same.

-Andi

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-21 23:20                       ` David Miller
@ 2007-12-22  9:45                         ` Ingo Molnar
  2007-12-24 19:24                           ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Ingo Molnar @ 2007-12-22  9:45 UTC (permalink / raw)
  To: David Miller
  Cc: clameter, a.p.zijlstra, torvalds, rostedt, linux-kernel, akpm, hch, rjw


* David Miller <davem@davemloft.net> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Fri, 21 Dec 2007 23:54:13 +0100
> 
> > Really, if your behavior is representative of how our SLAB allocator 
> > will be maintained in the future then i'm very, very worried :-(
> 
> Actually, to the contrary, I actually think Christoph responds to 
> every problem I've ever reported to him about his per-cpu counters 
> work and SLUB much better than most people who call themselves 
> "maintainers" around here.
> 
> And I say that without any reservations.
> 
> He doesn't deserve the ad-hominem attacks he is getting today, because 
> he does resolve every problem reported to him.
>
> The guy wrote test cases, he analyzed every problem, he wrote test 
> patches, and he doesn't stop doing any of that until the issue really 
> is reported as resolved by the testers.
> 
> I'll take Christoph as the implementor and maintainer of anything, any 
> day of the week.  He rocks.

well, maybe i got unlucky, this hackbench thing being the first time i'm 
exposed to a major SLUB regression. The hackbench problem was dead easy 
to reproduce, i (and others) offered immediate testing of whatever test 
patches, it also matched the profiles of the TPC-C regression but still 
i was only offered explanations about why this workload does not matter 
and how others suck because they are unable to give immediate test 
feedback from millions-of-dollars test equipment that is barely able to 
run our devel kernels. The regression is fixed now and i'm a happy 
camper!

Christoph, i'd like to apologize for all overly harsh words i said. (and 
i said quite a few :-/ )

	Ingo

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22  0:28                       ` Andi Kleen
  2007-12-22  0:33                         ` Chuck Ebbert
  2007-12-22  2:19                         ` Matt Mackall
@ 2007-12-22 10:03                         ` Ingo Molnar
  2007-12-22 12:37                           ` Andi Kleen
  2007-12-22 18:01                           ` Linus Torvalds
  2 siblings, 2 replies; 93+ messages in thread
From: Ingo Molnar @ 2007-12-22 10:03 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, Peter Zijlstra, Linus Torvalds,
	Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig,
	Rafael J. Wysocki


* Andi Kleen <andi@firstfloor.org> wrote:

> Ingo Molnar <mingo@elte.hu> writes:
> 
> > Christoph, /proc/slabinfo is an _ABI_. You HAVE to provide it. 
> > slabtop relies on it, people use it every day to monitor memory 
> > consumption.
> 
> It's definitely not a stable ABI. slabtop tends to exit without any 
> error message on any slabinfo version number increase and I've seen 
> that happen several times in not so old kernels.

so because we sucked in the past we can continue to suck? ;)

Why are we still arguing about this? We kernel developers are foxes 
amongst the hens and if a compatibility issue comes up we have to act 
_doubly_ conservatively.

You think it's reasonable to not offer /proc/slabinfo? You can fairly 
assume that a user considers it absolutely unreasonable. If other kernel 
developers tell you: "no, it's fine", then it's as if other foxes told 
you "no, it's fine to eat that hen, we do it all the time too!" ;-)

> Requiring just another slabtop update isn't really a big deal.

certainly. But consider this from the user's perspective who tries one 
of our devel kernels. He suspects a memory leak. Runs slabtop and 
gets:

 $ slabtop
 fopen /proc/slabinfo: No such file or directory

and would fairly conclude: "ok, this new Linux kernel looks quite 
apparently unfinished, i'm outta here".

We do this way too frequently and many silly details like this _do_ 
mount up.

> Also it's not that it's a critical functionality like udev.

Sure, we can argue about details that not all fields in /proc/slabinfo 
are relevant, and that slabtop should be a bit more careful, etc., but 
we've got what we've got because _we_ built the current code, so we 
might as well accept the consequences it brings. The most of the basic 
output of slabtop:

 Active / Total Objects (% used)    : 648754 / 747076 (86.8%)
 Active / Total Slabs (% used)      : 39394 / 39394 (100.0%)
 Active / Total Caches (% used)     : 103 / 143 (72.0%)
 Active / Total Size (% used)       : 138884.36K / 151075.96K (91.9%)
 Minimum / Average / Maximum Object : 0.01K / 0.20K / 128.00K

   OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
 261928 239808  91%    0.13K   9032       29     36128K dentry_cache
 222048 174144  78%    0.05K   3084       72     12336K buffer_head
 187232 178929  95%    0.48K  23404        8     93616K ext3_inode_cache
  24416  17908  73%    0.27K   1744       14      6976K radix_tree_node

could be offered on SLUB too.

'top' isnt critical functionality either like udev, and the ABI does not 
only cover 'critical' functionality. A utility suddenly not working 
gives Linux a pretty amateurish feeling. Should we tell users/admins: 
"Hehe, gotcha! Didnt you know /proc/slabinfo was not an ABI? Poor sob. 
If you want your stuff to continue working, use Windows next time around 
or what. Sheesh, what do these people want!' ;-)

the rule is very simple: unless you have really, really, really, REALLY 
good reasons, just dont break stuff.

	Ingo

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 10:03                         ` Ingo Molnar
@ 2007-12-22 12:37                           ` Andi Kleen
  2007-12-22 12:51                             ` Pekka Enberg
  2007-12-22 13:01                             ` Alexey Dobriyan
  2007-12-22 18:01                           ` Linus Torvalds
  1 sibling, 2 replies; 93+ messages in thread
From: Andi Kleen @ 2007-12-22 12:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Christoph Lameter, Peter Zijlstra, Linus Torvalds,
	Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig,
	Rafael J. Wysocki

> could be offered on SLUB too.

Sure (I argued that in a earlier mail in fact), but it doesn't make
sense to fake into the old slabinfo format.

> 
> 'top' isnt critical functionality either like udev, and the ABI does not 
> only cover 'critical' functionality. A utility suddenly not working 
> gives Linux a pretty amateurish feeling. Should we tell users/admins: 
> "Hehe, gotcha! Didnt you know /proc/slabinfo was not an ABI? Poor sob. 
> If you want your stuff to continue working, use Windows next time around 
> or what. Sheesh, what do these people want!' ;-)

While I sympatise with this point in practice non critial ABIs change
regularly (and sometimes even critical ones like sched_yield ...) 

> 
> the rule is very simple: unless you have really, really, really, REALLY 
> good reasons, just dont break stuff.

Removing an interface that exposes lots of internal details when you
rewrite the subsystem and those internal details all change seems
like a good reason to me.

Besides the original interface was broken anyways. The version
number was one of the interface worst ideas ever imho and slabtop's handling 
of it quite dumb. The better way would have been to always add new 
fields and never remove old ones. Hopefully the new /sys based interface
will be better.

-Andi


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 12:37                           ` Andi Kleen
@ 2007-12-22 12:51                             ` Pekka Enberg
  2007-12-22 12:54                               ` Andi Kleen
  2007-12-22 13:01                             ` Alexey Dobriyan
  1 sibling, 1 reply; 93+ messages in thread
From: Pekka Enberg @ 2007-12-22 12:51 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Christoph Lameter, Peter Zijlstra, Linus Torvalds,
	Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig,
	Rafael J. Wysocki

Hi,

On Dec 22, 2007 2:37 PM, Andi Kleen <andi@firstfloor.org> wrote:
> Removing an interface that exposes lots of internal details when you
> rewrite the subsystem and those internal details all change seems
> like a good reason to me.

Yeah but then again removing an interface that has been around for
ever is a real PITA for users. Besides, emulating /proc/slabinfo ain't
so bad:

http://lkml.org/lkml/2007/12/22/58

                                Pekka

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 12:51                             ` Pekka Enberg
@ 2007-12-22 12:54                               ` Andi Kleen
  2007-12-22 13:27                                 ` Pekka Enberg
  0 siblings, 1 reply; 93+ messages in thread
From: Andi Kleen @ 2007-12-22 12:54 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: Andi Kleen, Ingo Molnar, Christoph Lameter, Peter Zijlstra,
	Linus Torvalds, Steven Rostedt, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

> Yeah but then again removing an interface that has been around for
> ever 

Version 2.1 hasn't been around forever and at least slabtop does not
work over version number changes in my experience.

-Andi

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 12:37                           ` Andi Kleen
  2007-12-22 12:51                             ` Pekka Enberg
@ 2007-12-22 13:01                             ` Alexey Dobriyan
  1 sibling, 0 replies; 93+ messages in thread
From: Alexey Dobriyan @ 2007-12-22 13:01 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Christoph Lameter, Peter Zijlstra, Linus Torvalds,
	Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig,
	Rafael J. Wysocki

> > the rule is very simple: unless you have really, really, really, REALLY 
> > good reasons, just dont break stuff.

Could we please drop "kset: move /sys/slab to /sys/kernel/slab" from -mm?
Old slabinfo complains that SLUB_DEBUG is not on and refuses to do anything.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 12:54                               ` Andi Kleen
@ 2007-12-22 13:27                                 ` Pekka Enberg
  0 siblings, 0 replies; 93+ messages in thread
From: Pekka Enberg @ 2007-12-22 13:27 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Ingo Molnar, Christoph Lameter, Peter Zijlstra, Linus Torvalds,
	Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig,
	Rafael J. Wysocki

Hi Andi,

On Dec 22, 2007 2:54 PM, Andi Kleen <andi@firstfloor.org> wrote:
> > Yeah but then again removing an interface that has been around for
> > ever
>
> Version 2.1 hasn't been around forever and at least slabtop does not
> work over version number changes in my experience.

True but the default user-visible ABI (the one without statistics) is
unchanged since version 2.0:

http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commitdiff;h=7ee832030b3a5d3a87f8f85b1fc773be15e98608

http://git.kernel.org/?p=linux/kernel/git/torvalds/old-2.6-bkcvs.git;a=commitdiff;h=e79fbf181ba738ca410f2c457ced220b738e8856

But anyway, I don't care too much either way so I'll just let Andrew
sort this out, as usual.

                                 Pekka

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 10:03                         ` Ingo Molnar
  2007-12-22 12:37                           ` Andi Kleen
@ 2007-12-22 18:01                           ` Linus Torvalds
  2007-12-22 19:09                             ` Ingo Molnar
                                               ` (2 more replies)
  1 sibling, 3 replies; 93+ messages in thread
From: Linus Torvalds @ 2007-12-22 18:01 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andi Kleen, Christoph Lameter, Peter Zijlstra, Steven Rostedt,
	LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki



On Sat, 22 Dec 2007, Ingo Molnar wrote:
> > 
> > It's definitely not a stable ABI. slabtop tends to exit without any 
> > error message on any slabinfo version number increase and I've seen 
> > that happen several times in not so old kernels.
> 
> so because we sucked in the past we can continue to suck? ;)

Well, I do have to admit that I'm not a huge fan of /proc/slabinfo, and 
the fact that there hasn't been a lot of complaints about it going away 
does seem to imply that it wasn't a very important ABI.

I'm the first to stand up for backwards compatibility, but I also try to 
be realistic, and so far nobody has really complained about the fact that 
/proc/slabinfo went away on any grounds but on the general principle of 
"it was a user-visible ABI".

We've changed user-visible ABI's in the past in the hope that they weren't 
actually used. Sometimes it worked, sometimes it didn't. In this case, I 
think it still has the potential of working.

That said, I'm not a *huge* fan of /sys/slab/ either. I can get the info I 
as a developer tend to want from there, but it's really rather less 
convenient than I'd like. It is really quite hard, for example, to get any 
kind of "who actually uses the most *memory*" information out of there. 

You have to use something like this:

	for i in *
	do
		order=$(cat "$i/order")
		slabs=$(cat "$i/slabs")
		object_size=$(cat "$i/object_size")
		objects=$(cat "$i/objects")
		truesize=$(( $slabs<<($order+12) ))
		size=$(( $object_size*$objects ))
		percent=$(( $truesize/100 ))
		if [ $percent -gt 0 ]; then
			percent=$(( $size / $percent ))
		fi
		mb=$(( $size >> 10 ))
		printf "%10d MB %3d %s\n" $mb $percent $i
	done | sort -n | tail

which works fine, but while it's actually _much_ more readable than doing 
the same thing with /proc/slabinfo was, I worry about the lack of 
atomicity in getting the statistics.

I dunno. I do think /sys/slab/ is a better interface than /proc/slabinfo 
was. I just wonder if it could be better still.

			Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 18:01                           ` Linus Torvalds
@ 2007-12-22 19:09                             ` Ingo Molnar
  2007-12-22 19:29                               ` Ingo Molnar
  2007-12-22 22:52                               ` Jason L Tibbitts III
  2007-12-22 19:25                             ` Theodore Tso
  2007-12-22 19:46                             ` slabtop replacement was " Andi Kleen
  2 siblings, 2 replies; 93+ messages in thread
From: Ingo Molnar @ 2007-12-22 19:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Christoph Lameter, Peter Zijlstra, Steven Rostedt,
	LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki


* Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > > It's definitely not a stable ABI. slabtop tends to exit without 
> > > any error message on any slabinfo version number increase and I've 
> > > seen that happen several times in not so old kernels.
> > 
> > so because we sucked in the past we can continue to suck? ;)
> 
> Well, I do have to admit that I'm not a huge fan of /proc/slabinfo, 
> and the fact that there hasn't been a lot of complaints about it going 
> away does seem to imply that it wasn't a very important ABI.
> 
> I'm the first to stand up for backwards compatibility, but I also try 
> to be realistic, and so far nobody has really complained about the 
> fact that /proc/slabinfo went away on any grounds but on the general 
> principle of "it was a user-visible ABI".
> 
> We've changed user-visible ABI's in the past in the hope that they 
> weren't actually used. Sometimes it worked, sometimes it didn't. In 
> this case, I think it still has the potential of working.
> 
> That said, I'm not a *huge* fan of /sys/slab/ either. I can get the 
> info I as a developer tend to want from there, but it's really rather 
> less convenient than I'd like. It is really quite hard, for example, 
> to get any kind of "who actually uses the most *memory*" information 
> out of there.

Yes, i agree that me calling it an 'ABI' stretches the term - we dont 
want to put ourselves into a forced compatibility corner in every case 
where /proc does something really stupid. But /proc/slabinfo is being 
used, slabtop is installed and deployed, so why break it unnecessarily? 
It's not like we couldnt remove /proc/slabinfo later on, via the normal 
route of feature removal. I think Pekka's patch that adds /proc/slabinfo 
is simple and reasonable enough for 2.6.24.

We can get rid of /proc/slabinfo but then it should i think be done via 
the normal route of Documentation/feature-removal-schedule.txt. Was 
there any particular problem with /proc/slabinfo, at least as far as the 
read-only access that slabtop does goes? I dont think there was. So why 
should we go out on a limb _not_ providing it when there's a patch 
available, etc. I just googled a bit and people _have_ asked about 
slabtop availability, and the impression was that it would be offered by 
the time SLUB becomes the default. (and this was mentioned by others in 
this discussion)

I'd also not rely on the fact that only a few people are complaining. 
Most people, even 2.6.24-rc early adopters, still use SLAB because early 
adopters typically use their .23 .config and do a 'make oldconfig' - 
which picks up SLAB. So SLUB use will become more widespread only once 
2.6.24 is out and is packaged in distros. Distros will likely pick SLUB 
if there's no performance worries and if it's the default. Fedora 
rawhide already uses SLUB.

	Ingo

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 18:01                           ` Linus Torvalds
  2007-12-22 19:09                             ` Ingo Molnar
@ 2007-12-22 19:25                             ` Theodore Tso
  2007-12-22 21:00                               ` Linus Torvalds
  2007-12-22 19:46                             ` slabtop replacement was " Andi Kleen
  2 siblings, 1 reply; 93+ messages in thread
From: Theodore Tso @ 2007-12-22 19:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andi Kleen, Christoph Lameter, Peter Zijlstra,
	Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig,
	Rafael J. Wysocki

On Sat, Dec 22, 2007 at 10:01:37AM -0800, Linus Torvalds wrote:
> Well, I do have to admit that I'm not a huge fan of /proc/slabinfo, and 
> the fact that there hasn't been a lot of complaints about it going away 
> does seem to imply that it wasn't a very important ABI.
> 
> I'm the first to stand up for backwards compatibility, but I also try to 
> be realistic, and so far nobody has really complained about the fact that 
> /proc/slabinfo went away on any grounds but on the general principle of 
> "it was a user-visible ABI".

That's probably because the people who are most likely to be using
/proc/slabinfo tend to people people using kernels in production
environments, and they probably haven't started playing with SLUB yet.
So the fact we haven't heard the yelling doesn't necessarily mean that
people won't notice.

I know of a number of debugging scripts that periodically poll
/proc/slabinfo to find out what is using memory and to track trends
and predict potential problems with memory usage and balance.  Indeed,
I've left several such scripts behind at various customer
installations after debugging memory usage problems, and some system
administrators very much like being able to collect that information
easily.  So I can say for sure there at there are scripts out there
that depend on /proc/slabinfo, but they generally tend to be at sites
where people won't be running bleeding edge kernels.

> That said, I'm not a *huge* fan of /sys/slab/ either. I can get the info I 
> as a developer tend to want from there, but it's really rather less 
> convenient than I'd like. It is really quite hard, for example, to get any 
> kind of "who actually uses the most *memory*" information out of there. 

I have a general problem with things in /sys/slab, and that's just
because they are *ugly*.  So yes, you can write ugly shell scripts
like this to get out information:

> You have to use something like this:
> 
> 	for i in *
> 	do
> 		order=$(cat "$i/order")
> 		slabs=$(cat "$i/slabs")
> 		object_size=$(cat "$i/object_size")
> 		objects=$(cat "$i/objects")
> 		truesize=$(( $slabs<<($order+12) ))
> 		size=$(( $object_size*$objects ))
> 		percent=$(( $truesize/100 ))
> 		if [ $percent -gt 0 ]; then
> 			percent=$(( $size / $percent ))
> 		fi
> 		mb=$(( $size >> 10 ))
> 		printf "%10d MB %3d %s\n" $mb $percent $i
> 	done | sort -n | tail

But sometimes when trying to eyeball what is going on, it's a lot
nicer just to use "cat /proc/slabinfo".  

Another problem with using /sys/slab is that it is downright *ugly*.
Why is it for example, that /sys/slab/dentry is a symlink to
../slab/:a-0000160?  Because of this, Linus's script reports dentry twice:

     10942 MB  86 buffer_head
     10942 MB  86 journal_head
     10943 MB  86 :a-0000056
     21923 MB  86 dentry
     21924 MB  86 :a-0000160
     78437 MB  94 ext3_inode_cache

Once as "dentry" and once as ":a-0000160".  A similar situation
happens with journal_head and ":a-0000056".  I'm sure this could be
filtered out with the right shell or perl script, but now we're
getting even farther from /proc/slabinfo.  

Further, if you look at Documentation/sysfs-rules.txt (does anyone
ever read it or bother to update it?), /sys/slab isn't documented so
it falls in the category of "everything else is a kernel-level
implementation detail which is not guaranteed to be stable".  Nice....

Also, looking at sysfs-rules.txt, it talks an awful lot about reading
symlinks, and implying that actually trying to *dereference* the
symlinks might not actually work.  So I'm pretty sure linus's "cd
/sys/slab; for i in *" violates the Documentation/sysfs-rules.txt
guidelines even if /sys/slab were documented.

In general, /sys has some real issues that need some careful scrutiny.
Documentation/sysfs-rules.txt is *not* enough.  For example, various
official websites are telling people that the way to enable or disable
power-saving is:

	echo 5 > /sys/bus/pci/drivers/iwl4965/*/power_level

(Reference: http://www.lesswatts.org/tips/wireless.php)

Aside from the fact that "iwconfig wlan0 power on" is easier to type
and simpler to undrestand, I'm pretty sure that the above violates
sysfs-rules.txt.  From looking at sysfs-rules.txt, the fact that you
have to read symlinks as the only valid and guaranteed way for things
to work means that you have to write a perl script to safely
manipulate things in /sys.  People are ignoring sysfs-rules.txt, and
giving advice in websites contrary to sysfs-rules.txt because it's too
hard to follow.

> I worry about the lack of atomicity in getting the statistics.

And that's another problem with trying to use /sys when trying to get
statistics in bulk...  Very often the tables in /proc/* were not only
more convenient, but allowed for atomic access.

						- Ted

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 19:09                             ` Ingo Molnar
@ 2007-12-22 19:29                               ` Ingo Molnar
  2007-12-22 22:52                               ` Jason L Tibbitts III
  1 sibling, 0 replies; 93+ messages in thread
From: Ingo Molnar @ 2007-12-22 19:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andi Kleen, Christoph Lameter, Peter Zijlstra, Steven Rostedt,
	LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki


* Ingo Molnar <mingo@elte.hu> wrote:

> I'd also not rely on the fact that only a few people are complaining. 
> Most people, even 2.6.24-rc early adopters, still use SLAB because 
> early adopters typically use their .23 .config and do a 'make 
> oldconfig' - which picks up SLAB. So SLUB use will become more 
> widespread only once 2.6.24 is out and is packaged in distros. Distros 
> will likely pick SLUB if there's no performance worries and if it's 
> the default. Fedora rawhide already uses SLUB.

here's some silly statistics about allocator coverage on lkml, based on 
configs reported to lkml in the past 4 months:

  $ for N in SLAB SLUB SLOB; do printf "%s: " $N; grep ^CONFIG_$N=y linux-kernel | wc -l; done
  SLAB: 70
  SLUB: 77
  SLOB: 4

so SLUB and SLAB is utilized about equally amongst people who reported 
configs to lkml.

But people who use SLUB enabled it intentionally - and they are thus 
much less likely to complain about this choice of them.

Reporting:

 "I just enabled SLUB instead of SLAB, and hey guys, it does not have 
  SLABinfo"

has a foolish ring to it, doesnt it? I'd rather ask carefully, like this 
person did:

  http://kerneltrap.org/mailarchive/linux-kernel/2007/10/12/335765

This is one of the reasons why i think the whole SLAB->SLUB renaming was 
bad - we should have done SLAB2 or SLAB^2 instead, so that people expect 
_at least as good_ behavior (performance, features, etc.) from SLUB as 
from SLAB. Instead of "something different".

anyway ... i think we still generally suck _alot_ at providing 
near-transparent kernel upgrades to users. (kernel upgrades are still a 
pain and risk, and often just due to poor developer choices on our 
side.)

It's nowhere near as bad as the 2.4->2.6 transition was (in fact it 
shouldnt even be mentioned in the same sentence), and it's getting 
better gradually, but i think we should just by default be 10 times more 
careful about these things - whenever it is borderline technically sane 
to do so. We induce enough unintentional breakage of user-space, we 
shouldnt compound it with intentional breakages. The kernel community 
still has _a lot_ of user and distro trust to win back. So being seen as 
over-cautious wont harm.

	Ingo

^ permalink raw reply	[flat|nested] 93+ messages in thread

* slabtop replacement was Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 18:01                           ` Linus Torvalds
  2007-12-22 19:09                             ` Ingo Molnar
  2007-12-22 19:25                             ` Theodore Tso
@ 2007-12-22 19:46                             ` Andi Kleen
  2007-12-22 23:28                               ` Karol Swietlicki
  2 siblings, 1 reply; 93+ messages in thread
From: Andi Kleen @ 2007-12-22 19:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Ingo Molnar, Andi Kleen, Christoph Lameter, Peter Zijlstra,
	Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig,
	Rafael J. Wysocki


As a followup to the thread regarding slabinfo and Andrew's 
earlier query about updating slabtop.
 
watch -n1 slabinfo -S 

seems to be a reasonable replacement for slabtop. The only
missing functionality are some hotkeys missing: you have to restart 
now to get a different sort order (and to read the slabinfo source
to find out the correct option), but other than that it works.

So I would propose to just retire slabtop and replace 
it with a simple shell script doing that.

A manpage for slabinfo would be useful though. Anybody 
volunteering to write one? 

-Andi



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 19:25                             ` Theodore Tso
@ 2007-12-22 21:00                               ` Linus Torvalds
  2007-12-22 21:50                                 ` Al Viro
  2007-12-22 22:10                                 ` Willy Tarreau
  0 siblings, 2 replies; 93+ messages in thread
From: Linus Torvalds @ 2007-12-22 21:00 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Ingo Molnar, Andi Kleen, Christoph Lameter, Peter Zijlstra,
	Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig,
	Rafael J. Wysocki



On Sat, 22 Dec 2007, Theodore Tso wrote:
> 
> I have a general problem with things in /sys/slab, and that's just
> because they are *ugly*.  So yes, you can write ugly shell scripts
> like this to get out information:

[ script deleted ]

> But sometimes when trying to eyeball what is going on, it's a lot
> nicer just to use "cat /proc/slabinfo".  

.. and I call BS on this claim.

/proc/slabinfo was (and is) totally pointless for "trying to eyeball 
what's going on". The output is totally unreadable, and useless. You end 
up with exactly the same script as above, except it reads as

	cat /proc/slabinfo | (read headerline
		while read name active num objsize objsperslab pagesperslab rest
		do
			realsize=$(( nul * objsize ))
			size=$(( active * objsize ))
			.. exact same rest of loop ..
		done | sort -n | ..

so no, "cat /proc/slabinfo" was almost never practical on its own.

The *one* advantage it does have is that you can forward it to others. 
That's a big advantage. But no, it wasn't ever readable for eyeballing, 
because it doesn't even give you a memory usage thing (just "number of 
objects and object size" as separate numbers).

But the "everything in one file" indubitably did make it a lot easier for 
just attaching it to bug-reports.

> Another problem with using /sys/slab is that it is downright *ugly*.
> Why is it for example, that /sys/slab/dentry is a symlink to
> ../slab/:a-0000160?

That's the only really ugly thing there. Otherwise, it's pretty nice, but 
having a million files makes for problems when trying to send somebody 
else the full info. 

		Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 21:00                               ` Linus Torvalds
@ 2007-12-22 21:50                                 ` Al Viro
  2007-12-22 23:29                                   ` Al Viro
  2007-12-22 22:10                                 ` Willy Tarreau
  1 sibling, 1 reply; 93+ messages in thread
From: Al Viro @ 2007-12-22 21:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Ingo Molnar, Andi Kleen, Christoph Lameter,
	Peter Zijlstra, Steven Rostedt, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

On Sat, Dec 22, 2007 at 01:00:09PM -0800, Linus Torvalds wrote:

> > Another problem with using /sys/slab is that it is downright *ugly*.
> > Why is it for example, that /sys/slab/dentry is a symlink to
> > ../slab/:a-0000160?
> 
> That's the only really ugly thing there. Otherwise, it's pretty nice, but 
> having a million files makes for problems when trying to send somebody 
> else the full info. 

*raised brows*

I would say that there's that really ugly thing with embedding kobject
into a struct with the lifetime rules of its own and then having that
struct kfree()d while kobject might still have references, but hey,
maybe it's just me and my lack of appreciation of the glory that is
sysfs.

	Al, too tired of ranting about sysfs being a major architecture
mistake and a recurring source of turds like that one...

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 21:00                               ` Linus Torvalds
  2007-12-22 21:50                                 ` Al Viro
@ 2007-12-22 22:10                                 ` Willy Tarreau
  2007-12-23  0:59                                   ` Steven Rostedt
  1 sibling, 1 reply; 93+ messages in thread
From: Willy Tarreau @ 2007-12-22 22:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Ingo Molnar, Andi Kleen, Christoph Lameter,
	Peter Zijlstra, Steven Rostedt, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

On Sat, Dec 22, 2007 at 01:00:09PM -0800, Linus Torvalds wrote:
> 
> 
> On Sat, 22 Dec 2007, Theodore Tso wrote:
> > 
> > I have a general problem with things in /sys/slab, and that's just
> > because they are *ugly*.  So yes, you can write ugly shell scripts
> > like this to get out information:
> 
> [ script deleted ]
> 
> > But sometimes when trying to eyeball what is going on, it's a lot
> > nicer just to use "cat /proc/slabinfo".  
> 
> .. and I call BS on this claim.
> 
> /proc/slabinfo was (and is) totally pointless for "trying to eyeball 
> what's going on". The output is totally unreadable, and useless. You end 
> up with exactly the same script as above, except it reads as
> 
> 	cat /proc/slabinfo | (read headerline
> 		while read name active num objsize objsperslab pagesperslab rest
> 		do
> 			realsize=$(( nul * objsize ))
> 			size=$(( active * objsize ))
> 			.. exact same rest of loop ..
> 		done | sort -n | ..
> 
> so no, "cat /proc/slabinfo" was almost never practical on its own.
> 
> The *one* advantage it does have is that you can forward it to others. 
> That's a big advantage. But no, it wasn't ever readable for eyeballing, 
> because it doesn't even give you a memory usage thing (just "number of 
> objects and object size" as separate numbers).

I don't agree with you Linus. I'm one of those used to quickly take a
look at its contents when I don't know where all my memory has gone.
I know by experience that if I find 6-digit values in dentry_cache or
inode_cache, all I was looking for is there, and that's enough for a
quick diag. It doesn't give anything accurate, but it's very useful
to quickly diag out some problems on production systems.

It was this week that I noticed for the first time that slabinfo did
not exist anymore, and did not know what replaced it. Lacking time,
I finally gave up and *supposed* that the memory was eaten by the
usual suspects.

I can understand that it has to go away for technical reasons, but Ted
is right, please don't believe that nobody uses it just because you got
no complaint. While people are not likely to perform all computations
in scripts, at least they're used to find some quickly identifiable
patterns there.

Willy


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 19:09                             ` Ingo Molnar
  2007-12-22 19:29                               ` Ingo Molnar
@ 2007-12-22 22:52                               ` Jason L Tibbitts III
  2007-12-24  3:59                                 ` Alessandro Suardi
  1 sibling, 1 reply; 93+ messages in thread
From: Jason L Tibbitts III @ 2007-12-22 22:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Linus Torvalds, Andi Kleen, Christoph Lameter, Peter Zijlstra,
	Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig,
	Rafael J. Wysocki

>>>>> "IM" == Ingo Molnar <mingo@elte.hu> writes:

IM> Distros will likely pick SLUB if there's no performance worries
IM> and if it's the default. Fedora rawhide already uses SLUB.

Actually, it seems to me that not only does Fedora rawhide use SLUB,
but Fedora 8 and 7 use it as well.  They don't have /proc/slabinfo and
they all seem to have CONFIG_SLUB=y:

> grep -r CONFIG_SLUB=y kernel
kernel/devel/config-generic:CONFIG_SLUB=y
kernel/F-7/configs/config-generic:CONFIG_SLUB=y
kernel/F-8/config-generic:CONFIG_SLUB=y

 - J<

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: slabtop replacement was Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 19:46                             ` slabtop replacement was " Andi Kleen
@ 2007-12-22 23:28                               ` Karol Swietlicki
  2007-12-29 18:08                                 ` Karol Swietlicki
  0 siblings, 1 reply; 93+ messages in thread
From: Karol Swietlicki @ 2007-12-22 23:28 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Ingo Molnar, Christoph Lameter, Peter Zijlstra,
	Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig,
	Rafael J. Wysocki

On 22/12/2007, Andi Kleen <andi@firstfloor.org> wrote:
> A manpage for slabinfo would be useful though. Anybody
> volunteering to write one?
>
> -Andi

That would be me.
I'm a newbie and never wrote a man page before, so it will take a few
days, but I'm bored and out of ideas for any new code for the moment
being. Should be fun.

Karol Swietlicki

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 21:50                                 ` Al Viro
@ 2007-12-22 23:29                                   ` Al Viro
  0 siblings, 0 replies; 93+ messages in thread
From: Al Viro @ 2007-12-22 23:29 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Theodore Tso, Ingo Molnar, Andi Kleen, Christoph Lameter,
	Peter Zijlstra, Steven Rostedt, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

On Sat, Dec 22, 2007 at 09:50:09PM +0000, Al Viro wrote:
> On Sat, Dec 22, 2007 at 01:00:09PM -0800, Linus Torvalds wrote:
> 
> > > Another problem with using /sys/slab is that it is downright *ugly*.
> > > Why is it for example, that /sys/slab/dentry is a symlink to
> > > ../slab/:a-0000160?
> > 
> > That's the only really ugly thing there. Otherwise, it's pretty nice, but 
> > having a million files makes for problems when trying to send somebody 
> > else the full info. 
> 
> *raised brows*
> 
> I would say that there's that really ugly thing with embedding kobject
> into a struct with the lifetime rules of its own and then having that
> struct kfree()d while kobject might still have references, but hey,
> maybe it's just me and my lack of appreciation of the glory that is
> sysfs.
> 
> 	Al, too tired of ranting about sysfs being a major architecture
> mistake and a recurring source of turds like that one...

BTW, the fact that presence of that kobject is conditional makes life even
uglier - we have to either kfree() or drop a reference to kobject leaving
actual kfree() to its ->release().   While we are at it, when do we remove
the symlinks?  That got to add even more fun for the lifetime rules...

Sigh...  How does one set a script that would mail a warning upon git pull
that introduces any instances of keyword from given set?  I hadn't noticed
that slub was using sysfs when it went into the tree back in May ;-/

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 22:10                                 ` Willy Tarreau
@ 2007-12-23  0:59                                   ` Steven Rostedt
  2007-12-23  5:12                                     ` Willy Tarreau
  0 siblings, 1 reply; 93+ messages in thread
From: Steven Rostedt @ 2007-12-23  0:59 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Andi Kleen,
	Christoph Lameter, Peter Zijlstra, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki


[/me sneaks away from the family]

On Sat, 22 Dec 2007, Willy Tarreau wrote:

> On Sat, Dec 22, 2007 at 01:00:09PM -0800, Linus Torvalds wrote:
> > On Sat, 22 Dec 2007, Theodore Tso wrote:
> > > But sometimes when trying to eyeball what is going on, it's a lot
> > > nicer just to use "cat /proc/slabinfo".
> >
> > .. and I call BS on this claim.
> >

[...]

>
> I can understand that it has to go away for technical reasons, but Ted
> is right, please don't believe that nobody uses it just because you got
> no complaint. While people are not likely to perform all computations
> in scripts, at least they're used to find some quickly identifiable
> patterns there.
>

I know when I'm looking for memory leaks, I've asked customers to give
snapshots of slabinfo at periodic times (once a day even, matters how bad
the leak is). This has been helpful in seeing if something did indeed
leak.

If you have a slab cache that constantly grows, and never shrinks, that's
a good indication that something might be leaking. Not always, since there
can be legitimate reasons for that, but sometimes it helps.

But I still scratch my head when ever I need to touch sysfs.

[/me runs back to the family]

-- Steve


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-23  0:59                                   ` Steven Rostedt
@ 2007-12-23  5:12                                     ` Willy Tarreau
  2007-12-23 14:15                                       ` Andi Kleen
  0 siblings, 1 reply; 93+ messages in thread
From: Willy Tarreau @ 2007-12-23  5:12 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Linus Torvalds, Theodore Tso, Ingo Molnar, Andi Kleen,
	Christoph Lameter, Peter Zijlstra, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

On Sat, Dec 22, 2007 at 07:59:47PM -0500, Steven Rostedt wrote:
[...]
> But I still scratch my head when ever I need to touch sysfs.

Same here. In fact, I've always considered that procfs was for
humans while sysfs was for tools. sysfs reminds me too much the
unexploitable /devices in Solaris. With the proper tools, I think
we can do a lot with it, but it's not as intuitive to find the
proper tools as it was to do "ls" followed by "cat" in /proc.

Willy


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-23  5:12                                     ` Willy Tarreau
@ 2007-12-23 14:15                                       ` Andi Kleen
  2007-12-24  3:45                                         ` Theodore Tso
  0 siblings, 1 reply; 93+ messages in thread
From: Andi Kleen @ 2007-12-23 14:15 UTC (permalink / raw)
  To: Willy Tarreau
  Cc: Steven Rostedt, Linus Torvalds, Theodore Tso, Ingo Molnar,
	Andi Kleen, Christoph Lameter, Peter Zijlstra, LKML,
	Andrew Morton, Christoph Hellwig, Rafael J. Wysocki

> Same here. In fact, I've always considered that procfs was for
> humans while sysfs was for tools. sysfs reminds me too much the
> unexploitable /devices in Solaris. With the proper tools, I think
> we can do a lot with it, but it's not as intuitive to find the
> proper tools as it was to do "ls" followed by "cat" in /proc.

find /sys/... -type f | while read i ; do echo "$i: $(<$i)" ; done

tends to work reasonably well for a quick overview, but yes
cat was nicer for humans.

-Andi

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-23 14:15                                       ` Andi Kleen
@ 2007-12-24  3:45                                         ` Theodore Tso
  2007-12-24 19:21                                           ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Theodore Tso @ 2007-12-24  3:45 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Willy Tarreau, Steven Rostedt, Linus Torvalds, Ingo Molnar,
	Christoph Lameter, Peter Zijlstra, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

On Sun, Dec 23, 2007 at 03:15:00PM +0100, Andi Kleen wrote:
> > Same here. In fact, I've always considered that procfs was for
> > humans while sysfs was for tools. sysfs reminds me too much the
> > unexploitable /devices in Solaris. With the proper tools, I think
> > we can do a lot with it, but it's not as intuitive to find the
> > proper tools as it was to do "ls" followed by "cat" in /proc.
> 
> find /sys/... -type f | while read i ; do echo "$i: $(<$i)" ; done
> 
> tends to work reasonably well for a quick overview, but yes
> cat was nicer for humans.

Until you start to wonder what the heck :a-0000136 is:

/sys/slab/:a-0000136/objs_per_slab: 30

Sigh...

					- Ted


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 22:52                               ` Jason L Tibbitts III
@ 2007-12-24  3:59                                 ` Alessandro Suardi
  0 siblings, 0 replies; 93+ messages in thread
From: Alessandro Suardi @ 2007-12-24  3:59 UTC (permalink / raw)
  To: Jason L Tibbitts III
  Cc: Ingo Molnar, Linus Torvalds, Andi Kleen, Christoph Lameter,
	Peter Zijlstra, Steven Rostedt, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

On 22 Dec 2007 16:52:56 -0600, Jason L Tibbitts III <tibbs@math.uh.edu> wrote:
> >>>>> "IM" == Ingo Molnar <mingo@elte.hu> writes:
>
> IM> Distros will likely pick SLUB if there's no performance worries
> IM> and if it's the default. Fedora rawhide already uses SLUB.
>
> Actually, it seems to me that not only does Fedora rawhide use SLUB,
> but Fedora 8 and 7 use it as well.  They don't have /proc/slabinfo and
> they all seem to have CONFIG_SLUB=y:
>
> > grep -r CONFIG_SLUB=y kernel
> kernel/devel/config-generic:CONFIG_SLUB=y
> kernel/F-7/configs/config-generic:CONFIG_SLUB=y
> kernel/F-8/config-generic:CONFIG_SLUB=y

Also true for at least recent versions of FC6 (which my bittorrent
 machine is still on), which currently run 2.6.22-14 based kernels.
And no, I didn't notice - that box usually hits triple-digit uptime
 before I reboot; in fact, it's still running a 2.6.22-9 based kernel.

--alessandro

 "Damaged people are dangerous. They know they can survive."

   (Anna Barton/Juliette Binoche, 'Damage')

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-24  3:45                                         ` Theodore Tso
@ 2007-12-24 19:21                                           ` Christoph Lameter
  2007-12-24 23:37                                             ` Theodore Tso
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2007-12-24 19:21 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andi Kleen, Willy Tarreau, Steven Rostedt, Linus Torvalds,
	Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

On Sun, 23 Dec 2007, Theodore Tso wrote:

> > tends to work reasonably well for a quick overview, but yes
> > cat was nicer for humans.
> 
> Until you start to wonder what the heck :a-0000136 is:
> 
> /sys/slab/:a-0000136/objs_per_slab: 30
> 
> Sigh...

That is why there is a slabinfo tool that does all the nice formatting.

Do a

gcc -o slabinfo Documentation/vm/slabinfo.c 

Then run slabinfo instead of doing cat /proc/slabinfo



^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22  9:45                         ` Ingo Molnar
@ 2007-12-24 19:24                           ` Christoph Lameter
  0 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2007-12-24 19:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: David Miller, a.p.zijlstra, torvalds, rostedt, linux-kernel,
	akpm, hch, rjw

On Sat, 22 Dec 2007, Ingo Molnar wrote:

> Christoph, i'd like to apologize for all overly harsh words i said. (and 
> i said quite a few :-/ )

Looks like a bad interaction due to overload, vacation, flu epidemic 
hitting me and family and also Christmas coming up so that I could not do 
my usual workload. Need to find some way to cut back on the stuff that I 
am involved with it seems. Thanks for the SLUB patches.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-24 19:21                                           ` Christoph Lameter
@ 2007-12-24 23:37                                             ` Theodore Tso
  2007-12-25  3:34                                               ` Andi Kleen
  2007-12-26 21:31                                               ` Major regression on hackbench with SLUB (more numbers) Christoph Lameter
  0 siblings, 2 replies; 93+ messages in thread
From: Theodore Tso @ 2007-12-24 23:37 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Andi Kleen, Willy Tarreau, Steven Rostedt, Linus Torvalds,
	Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

On Mon, Dec 24, 2007 at 11:21:14AM -0800, Christoph Lameter wrote:
> 
> That is why there is a slabinfo tool that does all the nice formatting.
> 
> Do a
> 
> gcc -o slabinfo Documentation/vm/slabinfo.c 
> 
> Then run slabinfo instead of doing cat /proc/slabinfo

So two questions: why isn't -f the default?  And is /sys/slab
guaranteed to be a stable and permanent interface if the SLAB
implementation ever gets ripped out?  If so, maybe this should go into
util-linux-ng?  I am aa bit concerned about the lack of atomicity of
/sys/slab, but this is a heck of a lot better of many kernel drivers
or subsystems which use /sys and the completely punt on any kind of
userspace utility, forcing users to type crazy things like 

  echo 5 > /sys/bus/pci/drivers/iwl4965/*/power_level

						- Ted

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-24 23:37                                             ` Theodore Tso
@ 2007-12-25  3:34                                               ` Andi Kleen
  2008-01-01 12:47                                                 ` Theodore Tso
  2007-12-26 21:31                                               ` Major regression on hackbench with SLUB (more numbers) Christoph Lameter
  1 sibling, 1 reply; 93+ messages in thread
From: Andi Kleen @ 2007-12-25  3:34 UTC (permalink / raw)
  To: Theodore Tso, Christoph Lameter, Andi Kleen, Willy Tarreau,
	Steven Rostedt, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki

> And is /sys/slab
> guaranteed to be a stable and permanent interface if the SLAB

Debugging feature and "stable and permanent" just don't 
fit together. It's like requesting stable and permanent
sysrq output.

-Andi

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-24 23:37                                             ` Theodore Tso
  2007-12-25  3:34                                               ` Andi Kleen
@ 2007-12-26 21:31                                               ` Christoph Lameter
  2007-12-26 22:16                                                 ` Al Viro
  1 sibling, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2007-12-26 21:31 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Andi Kleen, Willy Tarreau, Steven Rostedt, Linus Torvalds,
	Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

On Mon, 24 Dec 2007, Theodore Tso wrote:

> So two questions: why isn't -f the default?  And is /sys/slab

Because it gives misleading output. It displays the name of the first 
of multiple slabs that share the same storage structures.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-26 21:31                                               ` Major regression on hackbench with SLUB (more numbers) Christoph Lameter
@ 2007-12-26 22:16                                                 ` Al Viro
  2007-12-27  5:51                                                   ` Christoph Lameter
  2007-12-28  9:00                                                   ` Major regression on hackbench with SLUB (more numbers) Ingo Molnar
  0 siblings, 2 replies; 93+ messages in thread
From: Al Viro @ 2007-12-26 22:16 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Theodore Tso, Andi Kleen, Willy Tarreau, Steven Rostedt,
	Linus Torvalds, Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

On Wed, Dec 26, 2007 at 01:31:35PM -0800, Christoph Lameter wrote:
> On Mon, 24 Dec 2007, Theodore Tso wrote:
> 
> > So two questions: why isn't -f the default?  And is /sys/slab
> 
> Because it gives misleading output. It displays the name of the first 
> of multiple slabs that share the same storage structures.

Erm...  Let me spell it out: current lifetime rules are completely broken.
As it is, create/destroy/create cache sequence will do kobject_put() on
kfree'd object.  Even without people playing with holding sysfs files
open or doing IO on those.

a) you have kobject embedded into struct with the lifetime rules of its
own.  When its refcount hits zero you kfree() the sucker, even if you
still have references to embedded kobject.

b) your symlinks stick around.  Even when cache is long gone you still
have a sysfs symlink with its embedded kobject as a target.  They are
eventually removed when cache with the same name gets created.  _Then_
you get the target kobject dropped - when the memory it used to be in
had been freed for hell knows how long and reused by something that would
not appreciate slub.c code suddenly deciding to decrement some word in
that memory.

c) you leak references to these kobject; kobject_del() only removes it
from the tree undoing the effect of kobject_add() and you still need
kobject_put() to deal with the last reference.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-26 22:16                                                 ` Al Viro
@ 2007-12-27  5:51                                                   ` Christoph Lameter
  2007-12-27 20:28                                                     ` SLUB sysfs support Christoph Lameter
  2007-12-28  9:00                                                   ` Major regression on hackbench with SLUB (more numbers) Ingo Molnar
  1 sibling, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2007-12-27  5:51 UTC (permalink / raw)
  To: Al Viro
  Cc: Theodore Tso, Andi Kleen, Willy Tarreau, Steven Rostedt,
	Linus Torvalds, Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

On Wed, 26 Dec 2007, Al Viro wrote:

> Erm...  Let me spell it out: current lifetime rules are completely broken.
> As it is, create/destroy/create cache sequence will do kobject_put() on
> kfree'd object.  Even without people playing with holding sysfs files
> open or doing IO on those.

Urgh. Help! Isnt there some sanity with sysfs? 

> a) you have kobject embedded into struct with the lifetime rules of its
> own.  When its refcount hits zero you kfree() the sucker, even if you
> still have references to embedded kobject.

So alloate it separately?

> b) your symlinks stick around.  Even when cache is long gone you still
> have a sysfs symlink with its embedded kobject as a target.  They are
> eventually removed when cache with the same name gets created.  _Then_
> you get the target kobject dropped - when the memory it used to be in
> had been freed for hell knows how long and reused by something that would
> not appreciate slub.c code suddenly deciding to decrement some word in
> that memory.

Hmmmm.. That would also be addressed by a having a separately allocated 
kobject?
 
> c) you leak references to these kobject; kobject_del() only removes it
> from the tree undoing the effect of kobject_add() and you still need
> kobject_put() to deal with the last reference.

Patches would be appreciated.... Had a hard time to get the sysfs support 
to work right.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* SLUB sysfs support
  2007-12-27  5:51                                                   ` Christoph Lameter
@ 2007-12-27 20:28                                                     ` Christoph Lameter
  2007-12-27 22:59                                                       ` Al Viro
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2007-12-27 20:28 UTC (permalink / raw)
  To: Al Viro
  Cc: Theodore Tso, Andi Kleen, Willy Tarreau, Steven Rostedt,
	Linus Torvalds, Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

Hmmm.. If I separately allocate the kobject then I can no longer get to 
the kmem_cache structure from the kobject. 

I need to add a second kobject_del to sysfs_slab_remove() to make sysfs 
completely forget about the object?

Probably should track down any remaining symlinks at that point and nuke 
them too. Isnt there some way to convince sysfs to remove the symlinks 
if the target vanishes?






^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: SLUB sysfs support
  2007-12-27 20:28                                                     ` SLUB sysfs support Christoph Lameter
@ 2007-12-27 22:59                                                       ` Al Viro
  2007-12-27 23:22                                                         ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Al Viro @ 2007-12-27 22:59 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Theodore Tso, Andi Kleen, Willy Tarreau, Steven Rostedt,
	Linus Torvalds, Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

On Thu, Dec 27, 2007 at 12:28:14PM -0800, Christoph Lameter wrote:
> Hmmm.. If I separately allocate the kobject then I can no longer get to 
> the kmem_cache structure from the kobject. 
> 
> I need to add a second kobject_del to sysfs_slab_remove() to make sysfs 
> completely forget about the object?
> 
> Probably should track down any remaining symlinks at that point and nuke 
> them too. Isnt there some way to convince sysfs to remove the symlinks 
> if the target vanishes?

Don't bother with separate allocation.

a) remove symlink when slab goes away
b) instead of kfree() in slab removal do kobject_put() if you have sysfs stuff
c) have ->release() of these kobjects do kfree()

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: SLUB sysfs support
  2007-12-27 22:59                                                       ` Al Viro
@ 2007-12-27 23:22                                                         ` Christoph Lameter
  2007-12-27 23:53                                                           ` Christoph Lameter
  2007-12-27 23:54                                                           ` Al Viro
  0 siblings, 2 replies; 93+ messages in thread
From: Christoph Lameter @ 2007-12-27 23:22 UTC (permalink / raw)
  To: Al Viro
  Cc: Theodore Tso, Andi Kleen, Willy Tarreau, Steven Rostedt,
	Linus Torvalds, Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

On Thu, 27 Dec 2007, Al Viro wrote:

> On Thu, Dec 27, 2007 at 12:28:14PM -0800, Christoph Lameter wrote:
> > Hmmm.. If I separately allocate the kobject then I can no longer get to 
> > the kmem_cache structure from the kobject. 
> > 
> > I need to add a second kobject_del to sysfs_slab_remove() to make sysfs 
> > completely forget about the object?
> > 
> > Probably should track down any remaining symlinks at that point and nuke 
> > them too. Isnt there some way to convince sysfs to remove the symlinks 
> > if the target vanishes?
> 
> Don't bother with separate allocation.
> 
> a) remove symlink when slab goes away

Ok. Need to think about how to code that.

> b) instead of kfree() in slab removal do kobject_put() if you have sysfs stuff

Hmmmm.... Okay. Patch follows but its strange to do a kobject_put after a 
kobject_del().

> c) have ->release() of these kobjects do kfree()

In slab_ktype?

---
 mm/slub.c |   15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2007-12-27 15:09:38.000000000 -0800
+++ linux-2.6/mm/slub.c	2007-12-27 15:21:12.000000000 -0800
@@ -247,7 +247,10 @@ static void sysfs_slab_remove(struct kme
 static inline int sysfs_slab_add(struct kmem_cache *s) { return 0; }
 static inline int sysfs_slab_alias(struct kmem_cache *s, const char *p)
 							{ return 0; }
-static inline void sysfs_slab_remove(struct kmem_cache *s) {}
+static inline void sysfs_slab_remove(struct kmem_cache *s)
+{
+	kfree(s);
+}
 #endif
 
 /********************************************************************
@@ -2322,7 +2325,6 @@ void kmem_cache_destroy(struct kmem_cach
 		if (kmem_cache_close(s))
 			WARN_ON(1);
 		sysfs_slab_remove(s);
-		kfree(s);
 	} else
 		up_write(&slub_lock);
 }
@@ -3940,6 +3942,13 @@ static ssize_t slab_attr_store(struct ko
 	return err;
 }
 
+static void slab_release(struct kobject *kobj)
+{
+	struct kmem_cache *s = to_slab(kobj);
+
+	kfree(s);
+}
+
 static struct sysfs_ops slab_sysfs_ops = {
 	.show = slab_attr_show,
 	.store = slab_attr_store,
@@ -3947,6 +3956,7 @@ static struct sysfs_ops slab_sysfs_ops =
 
 static struct kobj_type slab_ktype = {
 	.sysfs_ops = &slab_sysfs_ops,
+	.release = slab_release
 };
 
 static int uevent_filter(struct kset *kset, struct kobject *kobj)
@@ -4048,6 +4058,7 @@ static void sysfs_slab_remove(struct kme
 {
 	kobject_uevent(&s->kobj, KOBJ_REMOVE);
 	kobject_del(&s->kobj);
+	kobject_put(&s->kobj);
 }
 
 /*


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: SLUB sysfs support
  2007-12-27 23:22                                                         ` Christoph Lameter
@ 2007-12-27 23:53                                                           ` Christoph Lameter
  2007-12-27 23:58                                                             ` Al Viro
  2007-12-27 23:54                                                           ` Al Viro
  1 sibling, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2007-12-27 23:53 UTC (permalink / raw)
  To: Al Viro; +Cc: Steven Rostedt, Peter Zijlstra, LKML, Rafael J. Wysocki

On Thu, 27 Dec 2007, Christoph Lameter wrote:

> > a) remove symlink when slab goes away
> 
> Ok. Need to think about how to code that.

How do I iterate over all symlinks in /sys/kernel/slab/*?

I remember trying to do it before and not being able to find a sysfs 
method for that.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: SLUB sysfs support
  2007-12-27 23:22                                                         ` Christoph Lameter
  2007-12-27 23:53                                                           ` Christoph Lameter
@ 2007-12-27 23:54                                                           ` Al Viro
  1 sibling, 0 replies; 93+ messages in thread
From: Al Viro @ 2007-12-27 23:54 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Theodore Tso, Andi Kleen, Willy Tarreau, Steven Rostedt,
	Linus Torvalds, Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

On Thu, Dec 27, 2007 at 03:22:28PM -0800, Christoph Lameter wrote:
> > a) remove symlink when slab goes away
> 
> Ok. Need to think about how to code that.

Huh?  Just call it from kmem_cache_destroy(); what business does that symlink
have being around after that point?

> > b) instead of kfree() in slab removal do kobject_put() if you have sysfs stuff
> 
> Hmmmm.... Okay. Patch follows but its strange to do a kobject_put after a 
> kobject_del().

kobject_del() undoes the effect of kobject_add().  And then you are left
with the refcount you had before kobject_add(), i.e. from kobject_init().

Think of it that way: kobject refcount controls the lifetime of structure
the sucker's embedded into.  Depending on kobject methods, you might
be unable to do cleanup of non-kobject parts of that animal until after
kobject_del().  IOW, you _can't_ have kobject_del() dropping the (hopefully)
final reference to kobject - only the caller knows what might have to be
done in between.

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: SLUB sysfs support
  2007-12-27 23:53                                                           ` Christoph Lameter
@ 2007-12-27 23:58                                                             ` Al Viro
  2007-12-28  0:02                                                               ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Al Viro @ 2007-12-27 23:58 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Steven Rostedt, Peter Zijlstra, LKML, Rafael J. Wysocki

On Thu, Dec 27, 2007 at 03:53:43PM -0800, Christoph Lameter wrote:
> On Thu, 27 Dec 2007, Christoph Lameter wrote:
> 
> > > a) remove symlink when slab goes away
> > 
> > Ok. Need to think about how to code that.
> 
> How do I iterate over all symlinks in /sys/kernel/slab/*?
> 
> I remember trying to do it before and not being able to find a sysfs 
> method for that.

What the hell do you need that for?  You have an obvious moment when
/sys/kernel/slab/cache_name -> <whatever> should be gone - it's when
you remove the slab called cache_name.  And at that point you don't
need to iterate over anything - you have the exact name, for fsck sake...

Why would you want these symlinks to stick around for longer than that?

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: SLUB sysfs support
  2007-12-27 23:58                                                             ` Al Viro
@ 2007-12-28  0:02                                                               ` Christoph Lameter
  2007-12-28  0:45                                                                 ` Al Viro
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2007-12-28  0:02 UTC (permalink / raw)
  To: Al Viro; +Cc: Steven Rostedt, Peter Zijlstra, LKML, Rafael J. Wysocki

On Thu, 27 Dec 2007, Al Viro wrote:

> On Thu, Dec 27, 2007 at 03:53:43PM -0800, Christoph Lameter wrote:
> > On Thu, 27 Dec 2007, Christoph Lameter wrote:
> > 
> > > > a) remove symlink when slab goes away
> > > 
> > > Ok. Need to think about how to code that.
> > 
> > How do I iterate over all symlinks in /sys/kernel/slab/*?
> > 
> > I remember trying to do it before and not being able to find a sysfs 
> > method for that.
> 
> What the hell do you need that for?  You have an obvious moment when
> /sys/kernel/slab/cache_name -> <whatever> should be gone - it's when
> you remove the slab called cache_name.  And at that point you don't
> need to iterate over anything - you have the exact name, for fsck sake...
> 
> Why would you want these symlinks to stick around for longer than that?

/sys/kernel/slab/cache_name is a real directory but then there are the 
aliases in /sys/kernel/slab/alias_name pointing to that directory that 
also have to be removed.

So I need to scan for symlinks in /sys/kernel/slab/* pointing to 
/sys/kernel/slab/cache_name.


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: SLUB sysfs support
  2007-12-28  0:02                                                               ` Christoph Lameter
@ 2007-12-28  0:45                                                                 ` Al Viro
  2007-12-28  2:19                                                                   ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Al Viro @ 2007-12-28  0:45 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Steven Rostedt, Peter Zijlstra, LKML, Rafael J. Wysocki

On Thu, Dec 27, 2007 at 04:02:56PM -0800, Christoph Lameter wrote:
> > Why would you want these symlinks to stick around for longer than that?
> 
> /sys/kernel/slab/cache_name is a real directory but then there are the 
> aliases in /sys/kernel/slab/alias_name pointing to that directory that 
> also have to be removed.
> 
> So I need to scan for symlinks in /sys/kernel/slab/* pointing to 
> /sys/kernel/slab/cache_name.

Oh, lovely.  So we can have module A do kmem_cache_create(), calling
cache "foo".  Then module B does (without any knowlegde about A)
completely unrelated kmem_cache_create(), calling the sucker "bar".
mm/slub decides that they are mergable.  Then we get rmmod A... and
no way to find out if that's foo or bar going away - kmem_cache_destroy()
doesn't have enough information to figure that out.  So we have to keep
both slab/foo and slab/bar around until both are gone or until somebody
kind enough creates a cache called foo.  Better yet, on some systems you
get things like slab/nfsd4_delegations sticking around long after nfsd is
gone just because slub.c decides that it's sharable, but in the next
kernel version the thing it's shared with gets different object size and
behaviour suddenly changes - now it's gone as soon as nfsd is gone.
Brilliant interface, that...

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: SLUB sysfs support
  2007-12-28  0:45                                                                 ` Al Viro
@ 2007-12-28  2:19                                                                   ` Christoph Lameter
  2007-12-28  3:26                                                                     ` Al Viro
  0 siblings, 1 reply; 93+ messages in thread
From: Christoph Lameter @ 2007-12-28  2:19 UTC (permalink / raw)
  To: Al Viro; +Cc: Steven Rostedt, Peter Zijlstra, LKML, Rafael J. Wysocki

On Fri, 28 Dec 2007, Al Viro wrote:

> Oh, lovely.  So we can have module A do kmem_cache_create(), calling
> cache "foo".  Then module B does (without any knowlegde about A)
> completely unrelated kmem_cache_create(), calling the sucker "bar".
> mm/slub decides that they are mergable.  Then we get rmmod A... and
> no way to find out if that's foo or bar going away - kmem_cache_destroy()
> doesn't have enough information to figure that out.  So we have to keep
> both slab/foo and slab/bar around until both are gone or until somebody
> kind enough creates a cache called foo.  Better yet, on some systems you
> get things like slab/nfsd4_delegations sticking around long after nfsd is
> gone just because slub.c decides that it's sharable, but in the next
> kernel version the thing it's shared with gets different object size and
> behaviour suddenly changes - now it's gone as soon as nfsd is gone.
> Brilliant interface, that...

Just enable slab debugging and all of that goes away and you get the usual 
error messages. Otherwise slub tries to compress the memory use as much as 
possible. The advantage here is that you can add an abitrary amount of 
kmem_cache_create() without increasing memory use significantly. Its cheap 
to do kmem_cache creations. Thus you can create open arbitrary caches not 
worrying about memory use and have meaningful variables on which you do 
kmem_cache_alloc instead of kmalloc/kfree. kmalloc/kfree wastes space 
since they go to powers of two. And kmalloc/kfree have to calculate the
cache at runtime whereas kmem_cache_alloc/free does the lookup once.

The amount of per cpu stuff we keep around that is idle is reduced by 50%. 
Increasing cpu cache usage and reduces memory footprint. per cpu stuff 
has to hold per cpu reserves and it is not good to waste too much memory 
there.

nfsd4_delegations? What is this about?

How do I scan for the symlinks in sysfs?


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: SLUB sysfs support
  2007-12-28  2:19                                                                   ` Christoph Lameter
@ 2007-12-28  3:26                                                                     ` Al Viro
  2008-01-02 20:25                                                                       ` Christoph Lameter
  0 siblings, 1 reply; 93+ messages in thread
From: Al Viro @ 2007-12-28  3:26 UTC (permalink / raw)
  To: Christoph Lameter; +Cc: Steven Rostedt, Peter Zijlstra, LKML, Rafael J. Wysocki

On Thu, Dec 27, 2007 at 06:19:46PM -0800, Christoph Lameter wrote:
> nfsd4_delegations? What is this about?

The random lifetimes of user-visible files you create in sysfs.

> How do I scan for the symlinks in sysfs?

At which point are you going to do that?  AFAICS, the fundamental problem
is that you
	* have aliases indistinguishable, so kmem_cache_destroy() can't tell
which one is going away, no matter what
	* have per-alias objects in sysfs
As the result, you have a user-visible mess in that directory in sysfs.
And I don't see how you would deal with that - on the "the contents of
directory changes in so-and-so way when such-and-such operation is
done", not the implementation details one.

BTW, I'm rather sceptical about free use of slabs; keep in mind that their
names have to be unique with your sysfs layout, so...

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-26 22:16                                                 ` Al Viro
  2007-12-27  5:51                                                   ` Christoph Lameter
@ 2007-12-28  9:00                                                   ` Ingo Molnar
  2007-12-28 14:52                                                     ` Steven Rostedt
  1 sibling, 1 reply; 93+ messages in thread
From: Ingo Molnar @ 2007-12-28  9:00 UTC (permalink / raw)
  To: Al Viro
  Cc: Christoph Lameter, Theodore Tso, Andi Kleen, Willy Tarreau,
	Steven Rostedt, Linus Torvalds, Peter Zijlstra, LKML,
	Andrew Morton, Christoph Hellwig, Rafael J. Wysocki


* Al Viro <viro@ZenIV.linux.org.uk> wrote:

> > > So two questions: why isn't -f the default?  And is /sys/slab
> > 
> > Because it gives misleading output. It displays the name of the 
> > first of multiple slabs that share the same storage structures.
> 
> Erm...  Let me spell it out: current lifetime rules are completely 
> broken. As it is, create/destroy/create cache sequence will do 
> kobject_put() on kfree'd object.  Even without people playing with 
> holding sysfs files open or doing IO on those.
> 
> a) you have kobject embedded into struct with the lifetime rules of 
> its own.  When its refcount hits zero you kfree() the sucker, even if 
> you still have references to embedded kobject.
> 
> b) your symlinks stick around.  Even when cache is long gone you still 
> have a sysfs symlink with its embedded kobject as a target.  They are 
> eventually removed when cache with the same name gets created.  _Then_ 
> you get the target kobject dropped - when the memory it used to be in 
> had been freed for hell knows how long and reused by something that 
> would not appreciate slub.c code suddenly deciding to decrement some 
> word in that memory.
> 
> c) you leak references to these kobject; kobject_del() only removes it 
> from the tree undoing the effect of kobject_add() and you still need 
> kobject_put() to deal with the last reference.

as a sidenote: bugs like this seem to be reoccuring. People implement 
sysfs bindings (without being sysfs internals experts - and why should 
they be) - and create hard to debug problems. We've seen that with the 
scheduler's recent sysfs changes too.

shouldnt the sysfs code be designed in a way to not allow such bugs? The 
primary usecase of sysfs is by people who do _not_ deal with it on a 
daily basis. So if they pick APIs that look obvious and create hard to 
debug problems (and userspace incompatibilities) that's a primary 
failure of sysfs, not a failure of those who utilize it.

At a minimum there should be some _strong_ debugging facility that 
transparently detects and reports such bugs as they occur. 
CONFIG_DEBUG_KOBJECT is totally unusable right now, it spams the syslog 
(so no distro ever enables it - i disable it in random bootups as well 
because it takes _ages_ to even get to a boot prompt) and never finds 
any of these hard-to-find-but-easy-to-explain bugs.

or if sysfs/kobjects should be scrapped and rewritten, do you have any 
insight into what kind of abstraction could/should replace it? Should we 
go back to procfs and get rid of kobjects altogether? (as it's slowly 
turning into a /proc problem of itself, with worse compatibility and 
sneakier bugs.)

	Ingo

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-28  9:00                                                   ` Major regression on hackbench with SLUB (more numbers) Ingo Molnar
@ 2007-12-28 14:52                                                     ` Steven Rostedt
  0 siblings, 0 replies; 93+ messages in thread
From: Steven Rostedt @ 2007-12-28 14:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Al Viro, Christoph Lameter, Theodore Tso, Andi Kleen,
	Willy Tarreau, Linus Torvalds, Peter Zijlstra, LKML,
	Andrew Morton, Christoph Hellwig, Rafael J. Wysocki


On Fri, 28 Dec 2007, Ingo Molnar wrote:
>
> or if sysfs/kobjects should be scrapped and rewritten, do you have any
> insight into what kind of abstraction could/should replace it? Should we
> go back to procfs and get rid of kobjects altogether? (as it's slowly
> turning into a /proc problem of itself, with worse compatibility and
> sneakier bugs.)
>

The few times I've used kobjects/sysfs, I would always need to "relearn"
how to use it. It's not trivial at all, and even when I finally get it
working, I'm not sure it's working correctly.

Although something like debugfs took me a few minutes to use. Now when
ever I need to do something that userspace uses, I simply choose debugfs.

I thought the switch to sysfs was to keep procfs cleaner and only used for
processes. Why did it need something so complex as the kobject structure?
Was it so that udev could integrate with it? If sysfs was as simple as
debugfs, I think a lot of these bugs would not have occurred.

-- Steve


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: slabtop replacement was Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-22 23:28                               ` Karol Swietlicki
@ 2007-12-29 18:08                                 ` Karol Swietlicki
  0 siblings, 0 replies; 93+ messages in thread
From: Karol Swietlicki @ 2007-12-29 18:08 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Linus Torvalds, Ingo Molnar, Christoph Lameter, Peter Zijlstra,
	Steven Rostedt, LKML, Andrew Morton, Christoph Hellwig,
	Rafael J. Wysocki

On 23/12/2007, Karol Swietlicki <magotari@gmail.com> wrote:
> On 22/12/2007, Andi Kleen <andi@firstfloor.org> wrote:
> > A manpage for slabinfo would be useful though. Anybody
> > volunteering to write one?
> >
> > -Andi
>
> That would be me.
> I'm a newbie and never wrote a man page before, so it will take a few
> days, but I'm bored and out of ideas for any new code for the moment
> being. Should be fun.
>
> Karol Swietlicki

I'm dropping out from this one.
I worked for a good while, and I really see no way of improving what
the usage information (slabinfo -h) already provides. It's a good
description, and I'm starting to think that adding a man page is
overkill. If I am misunderstanding something and what is desired is in
fact usage information in the man page format, I can provide such a
page, no problem.

At least now I know how to write man pages.

Karol Swietlicki

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2007-12-25  3:34                                               ` Andi Kleen
@ 2008-01-01 12:47                                                 ` Theodore Tso
  2008-01-01 15:19                                                   ` Andi Kleen
  2008-01-01 16:23                                                   ` [patch] slub: provide /proc/slabinfo Ingo Molnar
  0 siblings, 2 replies; 93+ messages in thread
From: Theodore Tso @ 2008-01-01 12:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Christoph Lameter, Willy Tarreau, Steven Rostedt, Linus Torvalds,
	Ingo Molnar, Peter Zijlstra, LKML, Andrew Morton,
	Christoph Hellwig, Rafael J. Wysocki

On Tue, Dec 25, 2007 at 04:34:18AM +0100, Andi Kleen wrote:
> > And is /sys/slab
> > guaranteed to be a stable and permanent interface if the SLAB
> 
> Debugging feature and "stable and permanent" just don't 
> fit together. It's like requesting stable and permanent
> sysrq output.

Yeah, unfortunately, querying to find out how much memory is being
used by which parts of the system is something which I think needs to
be more than just a debugging feature.  One could argue that "vmstat",
"iostat", and "sar" are debugging features as well, but other people
would consider them "critical programs to get information necessary to
monitor the health of their system".

Perhaps /proc/slabinfo should be that relatively stable interface, if
/sys/slab can't be guaranteed to be stable.  But there *are* people
who will want to monitor information like this on an ongoing fashion
on production systems.  One of the major downsides of /sys seems to be
that it's very hard to keep it stable, so maybe it's not the right
tool for this.  /proc/slabinfo has the advantage that it's a simple
text file, and its format is relatively well-understood, and doesn't a
patch to provide /proc/slabinfo for SLUB already exist?

							- Ted

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: Major regression on hackbench with SLUB (more numbers)
  2008-01-01 12:47                                                 ` Theodore Tso
@ 2008-01-01 15:19                                                   ` Andi Kleen
  2008-01-01 16:23                                                   ` [patch] slub: provide /proc/slabinfo Ingo Molnar
  1 sibling, 0 replies; 93+ messages in thread
From: Andi Kleen @ 2008-01-01 15:19 UTC (permalink / raw)
  To: Theodore Tso, Andi Kleen, Christoph Lameter, Willy Tarreau,
	Steven Rostedt, Linus Torvalds, Ingo Molnar, Peter Zijlstra,
	LKML, Andrew Morton, Christoph Hellwig, Rafael J. Wysocki

> Yeah, unfortunately, querying to find out how much memory is being
> used by which parts of the system is something which I think needs to
> be more than just a debugging feature.  One could argue that "vmstat",

slabinfo is about querying kernel internals and requiring stable
and permanent interfaces to that would imply freezing the kernel
internals.

People who want to have information about slabs and other similar
kernel internal allocations will always have to deal with version
dependencies.

No way around that.

> Perhaps /proc/slabinfo should be that relatively stable interface, if

That would require at least never changing any slab caches in any 
subsystem. Obviously not an option. 

-Andi

^ permalink raw reply	[flat|nested] 93+ messages in thread

* [patch] slub: provide /proc/slabinfo
  2008-01-01 12:47                                                 ` Theodore Tso
  2008-01-01 15:19                                                   ` Andi Kleen
@ 2008-01-01 16:23                                                   ` Ingo Molnar
  2008-01-05  0:27                                                     ` Arjan van de Ven
  1 sibling, 1 reply; 93+ messages in thread
From: Ingo Molnar @ 2008-01-01 16:23 UTC (permalink / raw)
  To: Theodore Tso, Andi Kleen, Christoph Lameter, Willy Tarreau,
	Steven Rostedt, Linus Torvalds, Peter Zijlstra, LKML,
	Andrew Morton, Christoph Hellwig, Rafael J. Wysocki


* Theodore Tso <tytso@MIT.EDU> wrote:

> [...] doesn't a patch to provide /proc/slabinfo for SLUB already 
> exist?

yes, the complete (and tested) patch is below. It has been through a few 
thousand random-bootup tests on x86 32-bit and 64-bit so it's v2.6.24 
material i think.

	Ingo

----------------->
Subject: slub: provide /proc/slabinfo
From: Pekka J Enberg <penberg@cs.helsinki.fi>

This adds a read-only /proc/slabinfo file on SLUB, that makes slabtop work.

[ mingo@elte.hu: build fix. ]

Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/slub_def.h |    2 
 mm/slub.c                |  105 +++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 94 insertions(+), 13 deletions(-)

Index: linux/include/linux/slub_def.h
===================================================================
--- linux.orig/include/linux/slub_def.h
+++ linux/include/linux/slub_def.h
@@ -200,4 +200,6 @@ static __always_inline void *kmalloc_nod
 }
 #endif
 
+extern const struct seq_operations slabinfo_op;
+
 #endif /* _LINUX_SLUB_DEF_H */
Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -3076,6 +3076,19 @@ void *__kmalloc_node_track_caller(size_t
 	return slab_alloc(s, gfpflags, node, caller);
 }
 
+static unsigned long count_partial(struct kmem_cache_node *n)
+{
+	unsigned long flags;
+	unsigned long x = 0;
+	struct page *page;
+
+	spin_lock_irqsave(&n->list_lock, flags);
+	list_for_each_entry(page, &n->partial, lru)
+		x += page->inuse;
+	spin_unlock_irqrestore(&n->list_lock, flags);
+	return x;
+}
+
 #if defined(CONFIG_SYSFS) && defined(CONFIG_SLUB_DEBUG)
 static int validate_slab(struct kmem_cache *s, struct page *page,
 						unsigned long *map)
@@ -3458,19 +3471,6 @@ static int list_locations(struct kmem_ca
 	return n;
 }
 
-static unsigned long count_partial(struct kmem_cache_node *n)
-{
-	unsigned long flags;
-	unsigned long x = 0;
-	struct page *page;
-
-	spin_lock_irqsave(&n->list_lock, flags);
-	list_for_each_entry(page, &n->partial, lru)
-		x += page->inuse;
-	spin_unlock_irqrestore(&n->list_lock, flags);
-	return x;
-}
-
 enum slab_stat_type {
 	SL_FULL,
 	SL_PARTIAL,
@@ -4123,3 +4123,82 @@ static int __init slab_sysfs_init(void)
 
 __initcall(slab_sysfs_init);
 #endif
+
+/*
+ * The /proc/slabinfo ABI
+ */
+#ifdef CONFIG_PROC_FS
+
+static void print_slabinfo_header(struct seq_file *m)
+{
+	seq_puts(m, "slabinfo - version: 2.1\n");
+	seq_puts(m, "# name            <active_objs> <num_objs> <objsize> "
+		 "<objperslab> <pagesperslab>");
+	seq_puts(m, " : tunables <limit> <batchcount> <sharedfactor>");
+	seq_puts(m, " : slabdata <active_slabs> <num_slabs> <sharedavail>");
+	seq_putc(m, '\n');
+}
+
+static void *s_start(struct seq_file *m, loff_t *pos)
+{
+	loff_t n = *pos;
+
+	down_read(&slub_lock);
+	if (!n)
+		print_slabinfo_header(m);
+
+	return seq_list_start(&slab_caches, *pos);
+}
+
+static void *s_next(struct seq_file *m, void *p, loff_t *pos)
+{
+	return seq_list_next(p, &slab_caches, pos);
+}
+
+static void s_stop(struct seq_file *m, void *p)
+{
+	up_read(&slub_lock);
+}
+
+static int s_show(struct seq_file *m, void *p)
+{
+	unsigned long nr_partials = 0;
+	unsigned long nr_slabs = 0;
+	unsigned long nr_inuse = 0;
+	unsigned long nr_objs;
+	struct kmem_cache *s;
+	int node;
+
+	s = list_entry(p, struct kmem_cache, list);
+
+	for_each_online_node(node) {
+		struct kmem_cache_node *n = get_node(s, node);
+
+		if (!n)
+			continue;
+
+		nr_partials += n->nr_partial;
+		nr_slabs += atomic_long_read(&n->nr_slabs);
+		nr_inuse += count_partial(n);
+	}
+
+	nr_objs = nr_slabs * s->objects;
+	nr_inuse += (nr_slabs - nr_partials) * s->objects;
+
+	seq_printf(m, "%-17s %6lu %6lu %6u %4u %4d", s->name, nr_inuse,
+		   nr_objs, s->size, s->objects, (1 << s->order));
+	seq_printf(m, " : tunables %4u %4u %4u", 0, 0, 0);
+	seq_printf(m, " : slabdata %6lu %6lu %6lu", nr_slabs, nr_slabs,
+		   0UL);
+	seq_putc(m, '\n');
+	return 0;
+}
+
+const struct seq_operations slabinfo_op = {
+	.start = s_start,
+	.next = s_next,
+	.stop = s_stop,
+	.show = s_show,
+};
+
+#endif /* CONFIG_PROC_FS */

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: SLUB sysfs support
  2007-12-28  3:26                                                                     ` Al Viro
@ 2008-01-02 20:25                                                                       ` Christoph Lameter
  0 siblings, 0 replies; 93+ messages in thread
From: Christoph Lameter @ 2008-01-02 20:25 UTC (permalink / raw)
  To: Al Viro; +Cc: Steven Rostedt, Peter Zijlstra, LKML, Rafael J. Wysocki

On Fri, 28 Dec 2007, Al Viro wrote:

> On Thu, Dec 27, 2007 at 06:19:46PM -0800, Christoph Lameter wrote:
> > nfsd4_delegations? What is this about?
> 
> The random lifetimes of user-visible files you create in sysfs.

Well these are symlinks.

> > How do I scan for the symlinks in sysfs?
> 
> At which point are you going to do that?  AFAICS, the fundamental problem
> is that you
> 	* have aliases indistinguishable, so kmem_cache_destroy() can't tell
> which one is going away, no matter what
> 	* have per-alias objects in sysfs
> As the result, you have a user-visible mess in that directory in sysfs.
> And I don't see how you would deal with that - on the "the contents of
> directory changes in so-and-so way when such-and-such operation is
> done", not the implementation details one.

If the alias count of a kmem_cache structure reaches zero then all aliases 
can be taken down. At that point we know that no aliases exist anymore.

> BTW, I'm rather sceptical about free use of slabs; keep in mind that their
> names have to be unique with your sysfs layout, so...

What do you mean by free use of slabs? Creation of new slab caches to 
avoid the rounding up to order 2 size?


^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch] slub: provide /proc/slabinfo
  2008-01-01 16:23                                                   ` [patch] slub: provide /proc/slabinfo Ingo Molnar
@ 2008-01-05  0:27                                                     ` Arjan van de Ven
  2008-01-05  0:49                                                       ` Linus Torvalds
  0 siblings, 1 reply; 93+ messages in thread
From: Arjan van de Ven @ 2008-01-05  0:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Theodore Tso, Andi Kleen, Christoph Lameter, Willy Tarreau,
	Steven Rostedt, Linus Torvalds, Peter Zijlstra, LKML,
	Andrew Morton, Christoph Hellwig, Rafael J. Wysocki

On Tue, 1 Jan 2008 17:23:28 +0100
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Theodore Tso <tytso@MIT.EDU> wrote:
> 
> > [...] doesn't a patch to provide /proc/slabinfo for SLUB already 
> > exist?
> 
> yes, the complete (and tested) patch is below. It has been through a
> few thousand random-bootup tests on x86 32-bit and 64-bit so it's
> v2.6.24 material i think.


> +static void *s_start(struct seq_file *m, loff_t *pos)
> +{
> +	loff_t n = *pos;
> +
> +	down_read(&slub_lock);
> +	if (!n)
> +		print_slabinfo_header(m);
> +
> +	return seq_list_start(&slab_caches, *pos);
> +}
> +


this is the part I'm not very thrilled about... at least on first sight
it looks like a user can now hold the read sem over system calls, and for as long as it wants.
Either that's no problem (but then we wouldn't need the lock in the first place) or it
is a problem that then wants to be fixed

^ permalink raw reply	[flat|nested] 93+ messages in thread

* Re: [patch] slub: provide /proc/slabinfo
  2008-01-05  0:27                                                     ` Arjan van de Ven
@ 2008-01-05  0:49                                                       ` Linus Torvalds
  0 siblings, 0 replies; 93+ messages in thread
From: Linus Torvalds @ 2008-01-05  0:49 UTC (permalink / raw)
  To: Arjan van de Ven
  Cc: Ingo Molnar, Theodore Tso, Andi Kleen, Christoph Lameter,
	Willy Tarreau, Steven Rostedt, Peter Zijlstra, LKML,
	Andrew Morton, Christoph Hellwig, Rafael J. Wysocki



On Fri, 4 Jan 2008, Arjan van de Ven wrote:
> 
> this is the part I'm not very thrilled about... at least on first sight 
> it looks like a user can now hold the read sem over system calls, and 
> for as long as it wants.

No, you misunderstand how seq-files work.

The start/stop sequence is done only overone single kernel buffer 
instance, not over the whole open/close (or even the copy to user space).

It's expressly designed so that you can hold locks (including spinlocks 
etc), and will not do any blocking ops (well - your *callbacks* can you 
blocking ops, but the seq_file stuff itself won't) between start/stop.

		Linus

^ permalink raw reply	[flat|nested] 93+ messages in thread

end of thread, other threads:[~2008-01-05  0:50 UTC | newest]

Thread overview: 93+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-12-07 17:50 Major regression on hackbench with SLUB Steven Rostedt
2007-12-07 17:55 ` Ingo Molnar
2007-12-07 17:59 ` Linus Torvalds
2007-12-07 18:36   ` Linus Torvalds
2007-12-07 18:58     ` Steven Rostedt
2007-12-08 22:16     ` Steven Rostedt
2007-12-10  7:38       ` Björn Steinbrink
2007-12-10 14:00         ` Steven Rostedt
2007-12-09  0:28     ` Steven Rostedt
2007-12-13 22:11     ` Christoph Lameter
2007-12-11 14:33 ` Major regression on hackbench with SLUB (more numbers) Ingo Molnar
2007-12-13 22:22   ` Christoph Lameter
2007-12-14  4:24     ` Christoph Lameter
2007-12-14  6:15       ` Steven Rostedt
2007-12-21 12:09       ` Ingo Molnar
2007-12-21 12:26         ` Ingo Molnar
2007-12-21 16:26           ` Christoph Lameter
2007-12-21 16:33             ` Ingo Molnar
2007-12-21 21:56               ` Christoph Lameter
2007-12-21 16:18         ` Christoph Lameter
2007-12-21 16:44           ` Linus Torvalds
2007-12-21 22:11             ` Christoph Lameter
2007-12-21 22:16               ` Peter Zijlstra
2007-12-21 22:17                 ` Peter Zijlstra
2007-12-21 22:27                   ` Christoph Lameter
2007-12-21 22:54                     ` Ingo Molnar
2007-12-21 23:20                       ` David Miller
2007-12-22  9:45                         ` Ingo Molnar
2007-12-24 19:24                           ` Christoph Lameter
2007-12-21 23:27                       ` Andrew Morton
2007-12-21 23:51                       ` Christoph Lameter
2007-12-22  0:28                       ` Andi Kleen
2007-12-22  0:33                         ` Chuck Ebbert
2007-12-22  2:19                         ` Matt Mackall
2007-12-22  2:44                           ` Andi Kleen
2007-12-22 10:03                         ` Ingo Molnar
2007-12-22 12:37                           ` Andi Kleen
2007-12-22 12:51                             ` Pekka Enberg
2007-12-22 12:54                               ` Andi Kleen
2007-12-22 13:27                                 ` Pekka Enberg
2007-12-22 13:01                             ` Alexey Dobriyan
2007-12-22 18:01                           ` Linus Torvalds
2007-12-22 19:09                             ` Ingo Molnar
2007-12-22 19:29                               ` Ingo Molnar
2007-12-22 22:52                               ` Jason L Tibbitts III
2007-12-24  3:59                                 ` Alessandro Suardi
2007-12-22 19:25                             ` Theodore Tso
2007-12-22 21:00                               ` Linus Torvalds
2007-12-22 21:50                                 ` Al Viro
2007-12-22 23:29                                   ` Al Viro
2007-12-22 22:10                                 ` Willy Tarreau
2007-12-23  0:59                                   ` Steven Rostedt
2007-12-23  5:12                                     ` Willy Tarreau
2007-12-23 14:15                                       ` Andi Kleen
2007-12-24  3:45                                         ` Theodore Tso
2007-12-24 19:21                                           ` Christoph Lameter
2007-12-24 23:37                                             ` Theodore Tso
2007-12-25  3:34                                               ` Andi Kleen
2008-01-01 12:47                                                 ` Theodore Tso
2008-01-01 15:19                                                   ` Andi Kleen
2008-01-01 16:23                                                   ` [patch] slub: provide /proc/slabinfo Ingo Molnar
2008-01-05  0:27                                                     ` Arjan van de Ven
2008-01-05  0:49                                                       ` Linus Torvalds
2007-12-26 21:31                                               ` Major regression on hackbench with SLUB (more numbers) Christoph Lameter
2007-12-26 22:16                                                 ` Al Viro
2007-12-27  5:51                                                   ` Christoph Lameter
2007-12-27 20:28                                                     ` SLUB sysfs support Christoph Lameter
2007-12-27 22:59                                                       ` Al Viro
2007-12-27 23:22                                                         ` Christoph Lameter
2007-12-27 23:53                                                           ` Christoph Lameter
2007-12-27 23:58                                                             ` Al Viro
2007-12-28  0:02                                                               ` Christoph Lameter
2007-12-28  0:45                                                                 ` Al Viro
2007-12-28  2:19                                                                   ` Christoph Lameter
2007-12-28  3:26                                                                     ` Al Viro
2008-01-02 20:25                                                                       ` Christoph Lameter
2007-12-27 23:54                                                           ` Al Viro
2007-12-28  9:00                                                   ` Major regression on hackbench with SLUB (more numbers) Ingo Molnar
2007-12-28 14:52                                                     ` Steven Rostedt
2007-12-22 19:46                             ` slabtop replacement was " Andi Kleen
2007-12-22 23:28                               ` Karol Swietlicki
2007-12-29 18:08                                 ` Karol Swietlicki
2007-12-21 22:49               ` Linus Torvalds
2007-12-21 17:44         ` Pekka Enberg
2007-12-21 22:22           ` Christoph Lameter
2007-12-21 22:37             ` Christoph Lameter
2007-12-21 22:51               ` Linus Torvalds
2007-12-21 23:17                 ` Ingo Molnar
2007-12-21 23:40                   ` Pekka Enberg
2007-12-21 23:54                   ` Christoph Lameter
2007-12-22  0:02                     ` Chuck Ebbert
2007-12-21 22:52               ` Steven Rostedt
2007-12-21 22:56               ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).