From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mike Snitzer Subject: Re: block: fix blk_queue_split() resource exhaustion Date: Fri, 8 Jul 2016 09:05:51 -0400 Message-ID: <20160708130550.GB9196@redhat.com> References: <1466583730-28595-1-git-send-email-lars.ellenberg@linbit.com> <871t36ggcr.fsf@notabene.neil.brown.name> <20160707081616.GH13335@soda.linbit> <87vb0hf6fb.fsf@notabene.neil.brown.name> <20160708125225.GV13335@soda.linbit> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <20160708125225.GV13335@soda.linbit> Sender: linux-raid-owner@vger.kernel.org To: Lars Ellenberg Cc: Ming Lei , NeilBrown , linux-block , Jens Axboe , "open list:SOFTWARE RAID (Multiple Disks) SUPPORT" , Linux Kernel Mailing List , "Martin K. Petersen" , Peter Zijlstra , Jiri Kosina , "open list:BCACHE (BLOCK LAYER CACHE)" , Zheng Liu , Keith Busch , Takashi Iwai , "open list:DEVICE-MAPPER (LVM)" , Ingo Molnar , "Kirill A. Shutemov" , Shaohua Li , Kent Overstreet , Alasdair Kergon , Roland Kammerer List-Id: linux-raid.ids On Fri, Jul 08 2016 at 8:52am -0400, Lars Ellenberg wrote: > On Fri, Jul 08, 2016 at 07:08:32PM +0800, Ming Lei wrote: > > > So after processing a particular bio, we should then process all the > > > 'child' bios - bios send to underlying devices. Then the 'sibling' > > > bios, that were split off, and then any remaining parents and ancestors. > > > > IMHO, that is just what the oneline patch is doing, isn't it? > > > > | diff --git a/block/blk-core.c b/block/blk-core.c > > | index 2475b1c7..a5623f6 100644 > > | --- a/block/blk-core.c > > | +++ b/block/blk-core.c > > | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio) > > | * should be added at the tail > > | */ > > | if (current->bio_list) { > > | - bio_list_add(current->bio_list, bio); > > | + bio_list_add_head(current->bio_list, bio); > > | goto out; > > | } > > Almost, but not quite. > > As explained earlier, this will re-order. > It will still process bios in "deepest level first" order, > but it will process "sibling" bios in reverse submission order. > > Think "very large bio" submitted to a stripe set > with small stripe width/stripe unit size. > > So I'd expect this to be a performance hit in some scenarios, > unless the stack at some deeper level does back-merging in its elevator. > (If some driver is not able to merge stuff because of "reverse submission > order" this can easily mean saturating IOPS of the physical device with > small requests, throttling bandwidth to a minimum.) > > That's why I mentioned it as "potential easy fix for the deadlock", > but did not suggest it as the proper way to fix this. > > If however the powers that be decide that this was a non-issue, > we could use it this way. No, we cannot do this. With blk-mq it doesn't have any of the more elaborate IO scheduling that request_fn request_queues have. We should not be knowingly mangling the order of IO with the thought that some other layer will fix it up. I think it best for you to rebase your work (against jens' for-4.8/core) into a single coherent patch and resubmit for 4.8 inclusion. I really don't see a huge benefit to keeping neilb's suggestion split out -- but if you or others do that's fine. The only concern I have relative to DM is: DM doesn't use blk_queue_split, so will it need to open-code setting recursion/remainder in order to ensure forward progress? neilb seemed to think the rework in generic_make_request would "just work" for the dm-snapshot deadlock case though so maybe this isn't a valid concern... unfortunately we don't have a quick reproducer for that dm-snapshot issue so it'll take a bit to prove. Mike From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755369AbcGHNGu (ORCPT ); Fri, 8 Jul 2016 09:06:50 -0400 Received: from mx1.redhat.com ([209.132.183.28]:42812 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932631AbcGHNFy (ORCPT ); Fri, 8 Jul 2016 09:05:54 -0400 Date: Fri, 8 Jul 2016 09:05:51 -0400 From: Mike Snitzer To: Lars Ellenberg Cc: Ming Lei , NeilBrown , linux-block , Jens Axboe , "open list:SOFTWARE RAID (Multiple Disks) SUPPORT" , Linux Kernel Mailing List , "Martin K. Petersen" , Peter Zijlstra , Jiri Kosina , "open list:BCACHE (BLOCK LAYER CACHE)" , Zheng Liu , Keith Busch , Takashi Iwai , "open list:DEVICE-MAPPER (LVM)" , Ingo Molnar , "Kirill A. Shutemov" , Shaohua Li , Kent Overstreet , Alasdair Kergon , Roland Kammerer Subject: Re: block: fix blk_queue_split() resource exhaustion Message-ID: <20160708130550.GB9196@redhat.com> References: <1466583730-28595-1-git-send-email-lars.ellenberg@linbit.com> <871t36ggcr.fsf@notabene.neil.brown.name> <20160707081616.GH13335@soda.linbit> <87vb0hf6fb.fsf@notabene.neil.brown.name> <20160708125225.GV13335@soda.linbit> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20160708125225.GV13335@soda.linbit> User-Agent: Mutt/1.5.21 (2010-09-15) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.25]); Fri, 08 Jul 2016 13:05:53 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jul 08 2016 at 8:52am -0400, Lars Ellenberg wrote: > On Fri, Jul 08, 2016 at 07:08:32PM +0800, Ming Lei wrote: > > > So after processing a particular bio, we should then process all the > > > 'child' bios - bios send to underlying devices. Then the 'sibling' > > > bios, that were split off, and then any remaining parents and ancestors. > > > > IMHO, that is just what the oneline patch is doing, isn't it? > > > > | diff --git a/block/blk-core.c b/block/blk-core.c > > | index 2475b1c7..a5623f6 100644 > > | --- a/block/blk-core.c > > | +++ b/block/blk-core.c > > | @@ -2048,7 +2048,7 @@ blk_qc_t generic_make_request(struct bio *bio) > > | * should be added at the tail > > | */ > > | if (current->bio_list) { > > | - bio_list_add(current->bio_list, bio); > > | + bio_list_add_head(current->bio_list, bio); > > | goto out; > > | } > > Almost, but not quite. > > As explained earlier, this will re-order. > It will still process bios in "deepest level first" order, > but it will process "sibling" bios in reverse submission order. > > Think "very large bio" submitted to a stripe set > with small stripe width/stripe unit size. > > So I'd expect this to be a performance hit in some scenarios, > unless the stack at some deeper level does back-merging in its elevator. > (If some driver is not able to merge stuff because of "reverse submission > order" this can easily mean saturating IOPS of the physical device with > small requests, throttling bandwidth to a minimum.) > > That's why I mentioned it as "potential easy fix for the deadlock", > but did not suggest it as the proper way to fix this. > > If however the powers that be decide that this was a non-issue, > we could use it this way. No, we cannot do this. With blk-mq it doesn't have any of the more elaborate IO scheduling that request_fn request_queues have. We should not be knowingly mangling the order of IO with the thought that some other layer will fix it up. I think it best for you to rebase your work (against jens' for-4.8/core) into a single coherent patch and resubmit for 4.8 inclusion. I really don't see a huge benefit to keeping neilb's suggestion split out -- but if you or others do that's fine. The only concern I have relative to DM is: DM doesn't use blk_queue_split, so will it need to open-code setting recursion/remainder in order to ensure forward progress? neilb seemed to think the rework in generic_make_request would "just work" for the dm-snapshot deadlock case though so maybe this isn't a valid concern... unfortunately we don't have a quick reproducer for that dm-snapshot issue so it'll take a bit to prove. Mike