From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jesper Dangaard Brouer Subject: Re: slab-nomerge (was Re: [git pull] device mapper changes for 4.3) Date: Mon, 7 Sep 2015 23:17:15 +0200 Message-ID: <20150907231715.0a375b40@redhat.com> References: <20150903005115.GA27804@redhat.com> <20150903060247.GV1933@devil.localdomain> <20150904032607.GX1933@devil.localdomain> <20150907113026.5bb28ca3@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: Dave Chinner , Mike Snitzer , Christoph Lameter , Pekka Enberg , Andrew Morton , David Rientjes , Joonsoo Kim , "dm-devel@redhat.com" , Alasdair G Kergon , Joe Thornber , Mikulas Patocka , Vivek Goyal , Sami Tolvanen , Viresh Kumar , Heinz Mauelshagen , linux-mm , "netdev@vger.kernel.org" , brouer@redhat.com To: Linus Torvalds Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: netdev.vger.kernel.org On Mon, 7 Sep 2015 13:22:13 -0700 Linus Torvalds wrote: > On Mon, Sep 7, 2015 at 2:30 AM, Jesper Dangaard Brouer > wrote: > > > > The slub allocator have a faster "fastpath", if your workload is > > fast-reusing within the same per-cpu page-slab, but once the workload > > increases you hit the slowpath, and then slab catches up. Slub looks > > great in micro-benchmarking. > > > > And with "slab_nomerge" I get even high performance: > > I think those two are related. > > Not merging means that effectively the percpu caches end up being > bigger (simply because there are more of them), and so it captures > more of the fastpath cases. Yes, that was also my theory. As manually tuning the percpu sizes gave me almost the same boost. > Obviously the percpu queue size is an easy tunable too, but there are > real downsides to that too. The easy fix is to introduce a subsystem specific percpu cache that is large enough for our use-case. That seems to be a trend. I'm hoping to come up with something smarter that every subsystem can benefit from. E.g some heuristic that can dynamic adjust SLUB according to the usage pattern. I can imagine something as simple as a counter for every slowpath call, that is only valid as long as the jiffies count matches (reset to zero, and store new jiffies cnt). (But I have not thought this through...) > I suspect your IP forwarding case isn't so > different from some of the microbenchmarks, it just has more > outstanding work.. Yes, I will admit that my testing is very close to micro benchmarking, and it is specifically designed to pressure the system to its limits[1]. Especially the minimum frame size is evil and unrealistic, but the real purpose is preparing the stack for increasing speeds like 100Gbit/s. > And yes, the slow path (ie not hitting in the percpu cache) of SLUB > could hopefully be optimizable too, although maybe the bulk patches > are the way to go (and unrelated to this thread - at least part of > your bulk patches actually got merged last Friday - they were part of > Andrew's patch-bomb). Cool. Yes, it is only part of the bulk patches. The real performance boosters are not in yet (but I need to make them work correctly with memory debugging enabled before they can get merged). At least the main API is in, which allows me to implement use-case easier in other subsystems :-) [1] http://netoptimizer.blogspot.dk/2014/09/packet-per-sec-measurements-for.html -- Best regards, Jesper Dangaard Brouer MSc.CS, Sr. Network Kernel Developer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org