From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qk0-f179.google.com (mail-qk0-f179.google.com [209.85.220.179]) by kanga.kvack.org (Postfix) with ESMTP id E75F46B0254 for ; Thu, 3 Sep 2015 23:26:22 -0400 (EDT) Received: by qkcf65 with SMTP id f65so3927278qkc.3 for ; Thu, 03 Sep 2015 20:26:22 -0700 (PDT) Received: from mx1.redhat.com (mx1.redhat.com. [209.132.183.28]) by mx.google.com with ESMTPS id b9si1348064qgb.45.2015.09.03.20.26.20 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 03 Sep 2015 20:26:21 -0700 (PDT) Date: Fri, 4 Sep 2015 13:26:07 +1000 From: Dave Chinner Subject: Re: slab-nomerge (was Re: [git pull] device mapper changes for 4.3) Message-ID: <20150904032607.GX1933@devil.localdomain> References: <20150903005115.GA27804@redhat.com> <20150903060247.GV1933@devil.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Sender: owner-linux-mm@kvack.org List-ID: To: Linus Torvalds Cc: Mike Snitzer , Christoph Lameter , Pekka Enberg , Andrew Morton , David Rientjes , Joonsoo Kim , "dm-devel@redhat.com" , Alasdair G Kergon , Joe Thornber , Mikulas Patocka , Vivek Goyal , Sami Tolvanen , Viresh Kumar , Heinz Mauelshagen , linux-mm On Thu, Sep 03, 2015 at 08:02:40AM -0700, Linus Torvalds wrote: > On Wed, Sep 2, 2015 at 11:02 PM, Dave Chinner wrote: > > On Wed, Sep 02, 2015 at 06:21:02PM -0700, Linus Torvalds wrote: > > Right, it's not xyzzy-specific where 'xyzzy' is a subsystem. The > > flag application is actually *object specific*. That is, the use of > > the individual objects that determines whether it should be merged > > or not. > > Yes. > > I do agree that something like SLAB_NO_MERGE can make sense on an > actual object-specific level, if you have very specific allocation > pattern knowledge and can show that the merging actually hurts. There are generic cases where it hurts, so no justification should be needed for those cases... > > e.g. Slab fragmentation levels are affected more than anything by > > mixing objects with different life times in the same slab. i.e. if > > we free all the short lived objects from a page but there is one > > long lived object on the page then that page is pinned and we free > > no memory. Do that to enough pages in the slab, and we end up with a > > badly fragmented slab. > > The thing is, *if* you can show that kind of behavior for a particular > slab, and have numbers for it, then mark that slab as no-merge, and > document why you did it. The double standard is the problem here. No notification, proof, discussion or review was needed to turn on slab merging for everyone, but you're setting a very high bar to jump if anyone wants to turn it off in their code. > And quite frankly, I don't actually think you have the numbers to show > that theoretical bad behavior. I don't keep numbers close handy. I've been dealing with these problems for ten years, to I just know what workloads demonstrate this "theoretical bad behaviour" within specific slabs and test them when relevant. I'll do a couple of quick "merging is better" verification tests this afternoon, but other than that I don't have time in the next couple of weeks... But speaking of workloads, internal inode cache slab fragmentation is simple to reproduce on any filesystem. XFS just happens to be the only one that really actively manages it as a result of long term developer awareness of the problem. I first tripped over it in early 2005 with SpecSFS, and then with other similar NFS benchmarks like filebench. That's where Christoph Lameter was introduced to the problem, too: https://lwn.net/Articles/371892/ " The problem is that sparse use of objects in slab caches can cause large amounts of memory to become unusable. The first ideas to address this were developed in 2005 by various people." FYI, with appropriate manual "drop slab" hacks during the benchmark, we could get 20-25% higher throughput from the NFS server because dropping the entire slab cache before the measurement phase meant we avoided the slab fragmentation issue and had ~50% more free memory to use for the page cache during the measurement period... Similar problems have been reported over the years by users with backup programs or scripts that used find, rsync and/or 'cp -R' on large filesystems. It used to be easy to cause these sorts of problems in the XFS inode cache. There's quite a few other workloads, but it easily to reproduce inode slab fragmetnation with find, bulkstat and cp. Basically all you need to do is populate the inode cache, randomise the LRU order, then trigger combined inode cache and memory demand. It's that simple. The biggest problem with using a workload like this to "prove" that slab merging degrades behaviour is that we don't know what slabs have been merged. Hence it's extremely hard to generate a workload definition that demonstrates it. Indeed, change kernel config options, structures change size and the slab is merged with different objects, so the workload that generates problems has to be changed, too. And it doesn't even need to be a kernel with a different config - just a different set of modules loaded because the hardware and software config is different will change what slabs are merged. IOWs, what produces a problem on one kernel on one machine will not reproduce the same problem on a different kernel or machine. Numbers are a crapshoot here, especially as the cause of the problem is trivially easy to understand. Linus, you always say that at some point you've just got to step back, read the code and understand the underlying issue that is being dealt with because some things are way too complex to reproduce reliably. This is one of those cases - it's obvious that slab merging does not fix or prevent internal slab cache fragmentation and that it only serves to minimise the impact of fragmentation by amortising it across multiple similar slabs. Really, this is the best we can do with passive slab caches where you can't control freeing patterns. However, we also have actively managed slab caches, and they can and do work to prevent fragmetnation and clear it quickly when it happens. Merging these actively managed slabs with other passive slab is just a bad idea because the passive slab objects can only reduce the effectiveness of the active management algorithms. We don't need numbers to understand this - it's clear and obvious from an algorithmic point of view. > In contrast, there really *are* > numbers to show the advantages of merging. I have never denied that. Please listen to what I'm saying. > So the fragmentation argument has been shown to generally be in favor > of merging, _not_ in favor of that "no-merge" behavior. Yes, all the numbers and research I've seen has been on passive slab cache behaviour. I *agree* that passive slab caches should be merged, but I don't recall anyone documenting the behavioural distinction between active/passive slabs before now, even though it's been something I've had in my head for several years. Actively managed slabs are very different in their behaviour to passive slabs, and so what holds true for passive slabs is not necessarily true for actively managed slabs. Really, we don't need some stupidly high bar to jump over here - whether merging should be allowed can easily be answered with a simple question: "Does the slab have a shrinker or does it back a mempool?" If the answer is yes then using SLAB_SHRINKER or SLAB_MEMPOOL to trigger the no-merge case doesn't need any more justification from subsystem maintainers at all. Cheers, Dave. -- Dave Chinner dchinner@redhat.com -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org