Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator

From: David Rientjes <rientjes@google.com>
To: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Nick Piggin <npiggin@suse.de>,
	Christoph Lameter <cl@linux-foundation.org>,
	Christoph Lameter <cl@linux.com>,
	linux-mm@kvack.org, LKML <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Zhang Yanmin <yanmin_zhang@linux.intel.com>,
	Matthew Wilcox <willy@linux.intel.com>,
	Matt Mackall <mpm@selenic.com>
Subject: Re: [RFC V2 SLEB 00/14] The Enhanced(hopefully) Slab Allocator
Date: Tue, 25 May 2010 12:57:06 -0700 (PDT)	[thread overview]
Message-ID: <alpine.DEB.2.00.1005251247090.20631@chino.kir.corp.google.com> (raw)
In-Reply-To: <AANLkTin5PNELUXc6oCHadVyX-YcAEalRSppjz4GMyIBh@mail.gmail.com>

On Tue, 25 May 2010, Pekka Enberg wrote:

> > The code may be much cleaner and simpler than slab, but nobody (to date)
> > has addressed the significant netperf TCP_RR regression that slub has, for
> > example. I worked on a patchset to do that for a while but it wasn't
> > popular because it added some increments to the fastpath for tracking
> > data.
> 
> Yes and IIRC I asked you to resend the series because while I care a
> lot about performance regressions, I simply don't have the time or the
> hardware to reproduce and fix the weird cases you're seeing.
> 

My patchset still never attained parity with slab even though it improved 
slub's performance for that specific benchmark on my 16-core machine with 
64G of memory:

	# threads	SLAB		SLUB		SLUB+patchset
	16		69892		71592		69505
	32		126490		95373		119731
	48		138050		113072		125014
	64		169240		149043		158919
	80		192294		172035		179679
	96		197779		187849		192154
	112		217283		204962		209988
	128		229848		217547		223507
	144		238550		232369		234565
	160		250333		239871		244789
	176		256878		242712		248971
	192		261611		243182		255596

CONFIG_SLUB_STATS demonstrates that the kmalloc-256 and kmalloc-2048 are
performing quite poorly without the changes:

	cache		ALLOC_FASTPATH	ALLOC_SLOWPATH
	kmalloc-256	98125871	31585955
	kmalloc-2048	77243698	52347453

	cache		FREE_FASTPATH	FREE_SLOWPATH
	kmalloc-256	173624		129538000
	kmalloc-2048	90520		129500630

When you have these type of results, it's obvious why slub is failing to 
achieve the same performance as slab.  With the slub fastpath percpu work 
that has been done recently, it might be possible to resurrect my patchset 
and get more positive feedback because the penalty won't be as a 
significant, but the point is that slub still fails to achieve the same 
results that slab can with heavy networking loads.  Thus, I think any 
discussion about removing slab is premature until it's no longer shown to 
be a clear winner in comparison to its replacement, whether that is slub, 
slqb, sleb, or another allocator.  I agree that slub is clearly better in 
terms of maintainability, but we simply can't use it because of its 
performance for networking loads.

If you want to duplicate these results on machines with a larger number of 
cores, just download netperf, run with CONFIG_SLUB on both netserver and 
netperf machines, and use this script:

#!/bin/bash

TIME=60				# seconds
HOSTNAME=<hostname>		# netserver

NR_CPUS=$(grep ^processor /proc/cpuinfo | wc -l)
echo NR_CPUS=$NR_CPUS

run_netperf() {
	for i in $(seq 1 $1); do
		netperf -H $HOSTNAME -t TCP_RR -l $TIME &
	done
}

ITERATIONS=0
while [ $ITERATIONS -lt 12 ]; do
	RATE=0
	ITERATIONS=$[$ITERATIONS + 1]	
	THREADS=$[$NR_CPUS * $ITERATIONS]
	RESULTS=$(run_netperf $THREADS | grep -v '[a-zA-Z]' | awk '{ print $6 }')

	for j in $RESULTS; do
		RATE=$[$RATE + ${j/.*}]
	done
	echo threads=$THREADS rate=$RATE
done