From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751613Ab3HTVln (ORCPT ); Tue, 20 Aug 2013 17:41:43 -0400 Received: from mx1.redhat.com ([209.132.183.28]:2613 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751211Ab3HTVll (ORCPT ); Tue, 20 Aug 2013 17:41:41 -0400 Date: Tue, 20 Aug 2013 17:41:39 -0400 (EDT) From: Mikulas Patocka X-X-Sender: mpatocka@file01.intranet.prod.int.rdu2.redhat.com To: device-mapper development cc: Frank Mayhar , linux-kernel@vger.kernel.org Subject: Re: [dm-devel] dm: Make MIN_IOS, et al, tunable via sysctl. In-Reply-To: <20130819140016.GB27167@redhat.com> Message-ID: References: <1376070533.26057.244.camel@bobble.lax.corp.google.com> <20130819140016.GB27167@redhat.com> User-Agent: Alpine 2.02 (LRH 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 19 Aug 2013, Mike Snitzer wrote: > On Fri, Aug 16 2013 at 6:55pm -0400, > Frank Mayhar wrote: > > > The device mapper and some of its modules allocate memory pools at > > various points when setting up a device. In some cases, these pools are > > fairly large, for example the multipath module allocates a 256-entry > > pool and the dm itself allocates three of that size. In a > > memory-constrained environment where we're creating a lot of these > > devices, the memory use can quickly become significant. Unfortunately, > > there's currently no way to change the size of the pools other than by > > changing a constant and rebuilding the kernel. > > > > This patch fixes that by changing the hardcoded MIN_IOS (and certain > > other) #defines in dm-crypt, dm-io, dm-mpath, dm-snap and dm itself to > > sysctl-modifiable values. This lets us change the size of these pools > > on the fly, we can reduce the size of the pools and reduce memory > > pressure. > > These memory reserves are a long-standing issue with DM (made worse when > request-based mpath was introduced). Two years ago, I assembled a patch > series that took one approach to trying to fix it: > http://people.redhat.com/msnitzer/patches/upstream/dm-rq-based-mempool-sharing/series.html > > But in the end I wasn't convinced sharing the memory reserve would allow > for 100s of mpath devices to make forward progress if memory is > depleted. > > All said, I think adding the ability to control the size of the memory > reserves is reasonable. It allows for informed admins to establish > lower reserves (based on the awareness that rq-based mpath doesn't need > to support really large IOs, etc) without compromising the ability to > make forward progress. > > But, as mentioned in my porevious mail, I'd like to see this implemnted > in terms of module_param_named(). > > > We tested performance of dm-mpath with smaller MIN_IOS sizes for both dm > > and dm-mpath, from a value of 32 all the way down to zero. > > Bio-based can safely be reduced, as this older (uncommitted) patch did: > http://people.redhat.com/msnitzer/patches/upstream/dm-rq-based-mempool-sharing/0000-dm-lower-bio-based-reservation.patch > > > Bearing in mind that the underlying devices were network-based, we saw > > essentially no performance degradation; if there was any, it was down > > in the noise. One might wonder why these sizes are the way they are; > > I investigated and they've been unchanged since at least 2006. > > Performance isn't the concern. The concern is: does DM allow for > forward progress if the system's memory is completely exhausted? There is one possible deadlock that was introduced in commit d89d87965dcbe6fe4f96a2a7e8421b3a75f634d1 in 2.6.22-rc1. Unfortunatelly, no one found that bug at that time and now it seems to be hard to revert that. The problem is this: * we send bio1 to the device dm-1, device mapper splits it to bio2 and bio3 and sends both of them to the device dm-2. These two bios are added to current->bio_list. * bio2 is popped off current->bio_list, a mempool entry from device dm-2's mempool is allocated, bio4 is created and sent to the device dm-3. bio4 is added to the end of current->bio_list. * bio3 is popped off current->bio_list, a mempool entry from device dm-2's mempool is allocated. Suppose that the mempool is exhausted, so we wait until some existing work (bio2) finishes and returns the entry to the mempool. So: bio3's request routine waits until bio2 finishes and refills the mempool. bio2 is waiting for bio4 to finish. bio4 is in current->bio_list and is waiting until bio3's request routine fininshes. Deadlock. In practice, it is not so serious because in mempool_alloc there is: /* * FIXME: this should be io_schedule(). The timeout is there as a * workaround for some DM problems in 2.6.18. */ io_schedule_timeout(5*HZ); - so it waits for 5 seconds and retries. If there is something in the system that is able to free memory, it resumes. > This is why request-based has such an extensive reserve, because it > needs to account for cloning the largest possible request that comes in > (with multiple bios). Mikulas