From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03993C10F11 for ; Wed, 24 Apr 2019 05:55:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id BB19C20685 for ; Wed, 24 Apr 2019 05:55:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729421AbfDXFzG (ORCPT ); Wed, 24 Apr 2019 01:55:06 -0400 Received: from mx2.suse.de ([195.135.220.15]:38564 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729244AbfDXFzG (ORCPT ); Wed, 24 Apr 2019 01:55:06 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id B995BAD14; Wed, 24 Apr 2019 05:55:03 +0000 (UTC) Subject: Re: [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed To: Ming Lei Cc: Jens Axboe , linux-block@vger.kernel.org, Hannes Reinecke , Keith Busch , linux-nvme@lists.infradead.org, Sagi Grimberg , Dongli Zhang , James Smart , Bart Van Assche , linux-scsi@vger.kernel.org, "Martin K . Petersen" , Christoph Hellwig , "James E . J . Bottomley" , jianchao wang References: <20190417034410.31957-1-ming.lei@redhat.com> <20190417034410.31957-7-ming.lei@redhat.com> <63c09210-c20e-6b4a-56fa-b9eac2807cc1@suse.de> <20190417125943.GB5007@ming.t460p> <20190422033057.GC21375@ming.t460p> <9d93cb74-0683-bb42-2545-b4bb2bfbb831@suse.de> <20190423133015.GA31728@ming.t460p> <20190424011242.GB634@ming.t460p> <20190424014535.GD634@ming.t460p> From: Hannes Reinecke Message-ID: <00ca949c-ba39-0b51-1a69-3852009ed911@suse.de> Date: Wed, 24 Apr 2019 07:55:02 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.6.1 MIME-Version: 1.0 In-Reply-To: <20190424014535.GD634@ming.t460p> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-block-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-block@vger.kernel.org On 4/24/19 3:45 AM, Ming Lei wrote: > On Wed, Apr 24, 2019 at 09:12:42AM +0800, Ming Lei wrote: >> On Tue, Apr 23, 2019 at 04:07:49PM +0200, Hannes Reinecke wrote: >>> On 4/23/19 3:30 PM, Ming Lei wrote: >>>> Hi Hannes, >>>> >>>> Thanks for your response. >>>> >>>> On Tue, Apr 23, 2019 at 01:19:33PM +0200, Hannes Reinecke wrote: >>>>> On 4/22/19 5:30 AM, Ming Lei wrote: >>> [ .. ] >>>>>> >>>>>> Hi Hannes, >>>>>> >>>>>> Could you please let us know if you have better idea for this issue? >>>>>> Otherwise, I think we need to move on since it is real issue, and users >>>>>> want to fix that. >>>>>> >>>>> Okay. Having looked over the problem and possible alternatives, it looks >>>>> indeed like a viable solution. >>>>> I do agree that it's a sensible design to have an additional holding area >>>>> for hardware context elements, given that they might be reassigned during >>>>> blk_mq_realloc_hw_ctxs(). >>>>> >>>>> However, I would rename the 'dead' elements to 'unused' (ie unused_hctx_list >>>>> etc). >>>> >>>> OK, looks the name of 'unused' is better. >>>> >>>>> >>>>> And I would deallocate q->queue_hw_ctx in blk_mq_exit_queue() to make things >>>>> more consistent. >>>> >>>> No, that is wrong. >>>> >>>> The request queue's refcount is often held when blk_cleanup_queue() is running, >>>> and blk_mq_exit_queue() is called directly by blk_cleanup_queue(). One invariant >>>> is that we have to allow most APIs running well if the request queue is live >>>> from kobject view. If q->queue_hw_ctx is freed in blk_mq_exit_queue(), >>>> it is quite easy to cause use-after-free. >>>> >>> Ah. Thought as much. >>> But then in most cases the ->queue_hw_ctx pointer is immaterial as we're >>> accessing things via the hctx pointer, which remains valid. >>> >>>>> Problem with the current patch is that in blk_mq_release() we iterate >>>>> the 'dead_hctx_list' and free up everything in there, but then blindly call >>>>> 'kfree(q->queue_hw_ctx)' without checking if there are any context pointers >>>>> left. >>>> >>>> If request queue is dead, it is safe to assume that there isn't any >>>> reference to request queue and q->queue_hw_ctx. Otherwise, it must be >>>> a bug somewhere. >>>> >>> Precisely. >>> What I'm trying to achieve with this is to protect against such issues, >>> which are quite easy to introduce given the complexity of the code ... >> >> But releasing q->queue_hw_ctx in blk_cleanup_queue() is easy to cause >> use-after-free even though the request queue's refcount is held. We can't >> do that simply. >> >> If someone is still trying to use q->queue_hw_ctx[] after the request >> queue is dead, the bug is in the caller of block layer API, not in >> block layer. >> >> What the patchset is trying to fix is the race in block layer, not >> users of block layer, not drivers. So far, I don't see such driver >> issue. >> >> Just thought q->queue_hw_ctx as the request queue's resource, you will >> see it is pretty reasonable to free q->queue_hw_ctx in the queue's >> release handler. >> >>> >>>>> When moving the call to blk_mq_exit_queue() we could to a simple >>>>> WARN_ON(q->queue_hw_ctx) in blk_mq_release() to ensure that everything got >>>>> deallocated properly. >>>> >>>> At that time, hctx instance might be active, but that is fine given hctx >>>> is covered by its own kobject. What we need to do is to make sure that no >>>> any references to q->queue_hw_ctx and the request queue. >>>> >>> My point being here: >>> void blk_mq_release(struct request_queue *q) >>> { >>> struct blk_mq_hw_ctx *hctx, *next; >>> >>> cancel_delayed_work_sync(&q->requeue_work); >>> >>> /* all hctx are in .dead_hctx_list now */ >>> list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) >>> { >>> list_del_init(&hctx->hctx_list); >>> kobject_put(&hctx->kobj); >>> } >>> >>> kfree(q->queue_hw_ctx); >>> >>> /* >>> * release .mq_kobj and sw queue's kobject now because >>> * both share lifetime with request queue. >>> */ >>> blk_mq_sysfs_deinit(q); >>> } >>> >>> This assumes that _all_ hctx pointers are being removed from >>> q->queue_hw_ctx, and are moved to the 'dead' list. >>> If for some reason this is not the case we'll be leaking hctx pointers here. >> >> IMO, there aren't such some reasons. When blk_mq_release() is called, >> every hctx of this request queue has been "exited" via blk_mq_exit_hctx(), >> either from blk_mq_update_nr_hw_queues() or blk_cleanup_queue(). >> >> If there are hctxs not moved to the 'dead'(or 'unused') list here, it is >> simply a bug in blk-mq. However, I don't see such case now. > > Or we can add the following check in blk_mq_release() for capturing > the impossible blk-mq bug: > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 2ca4395f794d..f0d08087b8f6 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2366,6 +2366,8 @@ blk_mq_alloc_hctx(struct request_queue *q, > hctx->queue = q; > hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED; > > + INIT_LIST_HEAD(&hctx->hctx_list); > + > /* > * Allocate space for all possible cpus to avoid allocation at > * runtime > @@ -2680,9 +2682,14 @@ static int blk_mq_alloc_ctxs(struct request_queue *q) > void blk_mq_release(struct request_queue *q) > { > struct blk_mq_hw_ctx *hctx, *next; > + int i; > > cancel_delayed_work_sync(&q->requeue_work); > > + /* warn if live hctx is found, that shouldn't happen */ > + queue_for_each_hw_ctx(q, hctx, i) > + WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list)); > + > /* all hctx are in .dead_hctx_list now */ > list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) { > list_del_init(&hctx->hctx_list); > Precisely, that would've been my other choice of fixing it, but I didn't as I tried to avoid the loop. But if you agree then I'm fine with it, too. Please add this hunk to the patch. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Mary Higgins, Sri Rasiah HRB 21284 (AG Nürnberg) From mboxrd@z Thu Jan 1 00:00:00 1970 From: hare@suse.de (Hannes Reinecke) Date: Wed, 24 Apr 2019 07:55:02 +0200 Subject: [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed In-Reply-To: <20190424014535.GD634@ming.t460p> References: <20190417034410.31957-1-ming.lei@redhat.com> <20190417034410.31957-7-ming.lei@redhat.com> <63c09210-c20e-6b4a-56fa-b9eac2807cc1@suse.de> <20190417125943.GB5007@ming.t460p> <20190422033057.GC21375@ming.t460p> <9d93cb74-0683-bb42-2545-b4bb2bfbb831@suse.de> <20190423133015.GA31728@ming.t460p> <20190424011242.GB634@ming.t460p> <20190424014535.GD634@ming.t460p> Message-ID: <00ca949c-ba39-0b51-1a69-3852009ed911@suse.de> On 4/24/19 3:45 AM, Ming Lei wrote: > On Wed, Apr 24, 2019@09:12:42AM +0800, Ming Lei wrote: >> On Tue, Apr 23, 2019@04:07:49PM +0200, Hannes Reinecke wrote: >>> On 4/23/19 3:30 PM, Ming Lei wrote: >>>> Hi Hannes, >>>> >>>> Thanks for your response. >>>> >>>> On Tue, Apr 23, 2019@01:19:33PM +0200, Hannes Reinecke wrote: >>>>> On 4/22/19 5:30 AM, Ming Lei wrote: >>> [ .. ] >>>>>> >>>>>> Hi Hannes, >>>>>> >>>>>> Could you please let us know if you have better idea for this issue? >>>>>> Otherwise, I think we need to move on since it is real issue, and users >>>>>> want to fix that. >>>>>> >>>>> Okay. Having looked over the problem and possible alternatives, it looks >>>>> indeed like a viable solution. >>>>> I do agree that it's a sensible design to have an additional holding area >>>>> for hardware context elements, given that they might be reassigned during >>>>> blk_mq_realloc_hw_ctxs(). >>>>> >>>>> However, I would rename the 'dead' elements to 'unused' (ie unused_hctx_list >>>>> etc). >>>> >>>> OK, looks the name of 'unused' is better. >>>> >>>>> >>>>> And I would deallocate q->queue_hw_ctx in blk_mq_exit_queue() to make things >>>>> more consistent. >>>> >>>> No, that is wrong. >>>> >>>> The request queue's refcount is often held when blk_cleanup_queue() is running, >>>> and blk_mq_exit_queue() is called directly by blk_cleanup_queue(). One invariant >>>> is that we have to allow most APIs running well if the request queue is live >>>> from kobject view. If q->queue_hw_ctx is freed in blk_mq_exit_queue(), >>>> it is quite easy to cause use-after-free. >>>> >>> Ah. Thought as much. >>> But then in most cases the ->queue_hw_ctx pointer is immaterial as we're >>> accessing things via the hctx pointer, which remains valid. >>> >>>>> Problem with the current patch is that in blk_mq_release() we iterate >>>>> the 'dead_hctx_list' and free up everything in there, but then blindly call >>>>> 'kfree(q->queue_hw_ctx)' without checking if there are any context pointers >>>>> left. >>>> >>>> If request queue is dead, it is safe to assume that there isn't any >>>> reference to request queue and q->queue_hw_ctx. Otherwise, it must be >>>> a bug somewhere. >>>> >>> Precisely. >>> What I'm trying to achieve with this is to protect against such issues, >>> which are quite easy to introduce given the complexity of the code ... >> >> But releasing q->queue_hw_ctx in blk_cleanup_queue() is easy to cause >> use-after-free even though the request queue's refcount is held. We can't >> do that simply. >> >> If someone is still trying to use q->queue_hw_ctx[] after the request >> queue is dead, the bug is in the caller of block layer API, not in >> block layer. >> >> What the patchset is trying to fix is the race in block layer, not >> users of block layer, not drivers. So far, I don't see such driver >> issue. >> >> Just thought q->queue_hw_ctx as the request queue's resource, you will >> see it is pretty reasonable to free q->queue_hw_ctx in the queue's >> release handler. >> >>> >>>>> When moving the call to blk_mq_exit_queue() we could to a simple >>>>> WARN_ON(q->queue_hw_ctx) in blk_mq_release() to ensure that everything got >>>>> deallocated properly. >>>> >>>> At that time, hctx instance might be active, but that is fine given hctx >>>> is covered by its own kobject. What we need to do is to make sure that no >>>> any references to q->queue_hw_ctx and the request queue. >>>> >>> My point being here: >>> void blk_mq_release(struct request_queue *q) >>> { >>> struct blk_mq_hw_ctx *hctx, *next; >>> >>> cancel_delayed_work_sync(&q->requeue_work); >>> >>> /* all hctx are in .dead_hctx_list now */ >>> list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) >>> { >>> list_del_init(&hctx->hctx_list); >>> kobject_put(&hctx->kobj); >>> } >>> >>> kfree(q->queue_hw_ctx); >>> >>> /* >>> * release .mq_kobj and sw queue's kobject now because >>> * both share lifetime with request queue. >>> */ >>> blk_mq_sysfs_deinit(q); >>> } >>> >>> This assumes that _all_ hctx pointers are being removed from >>> q->queue_hw_ctx, and are moved to the 'dead' list. >>> If for some reason this is not the case we'll be leaking hctx pointers here. >> >> IMO, there aren't such some reasons. When blk_mq_release() is called, >> every hctx of this request queue has been "exited" via blk_mq_exit_hctx(), >> either from blk_mq_update_nr_hw_queues() or blk_cleanup_queue(). >> >> If there are hctxs not moved to the 'dead'(or 'unused') list here, it is >> simply a bug in blk-mq. However, I don't see such case now. > > Or we can add the following check in blk_mq_release() for capturing > the impossible blk-mq bug: > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 2ca4395f794d..f0d08087b8f6 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -2366,6 +2366,8 @@ blk_mq_alloc_hctx(struct request_queue *q, > hctx->queue = q; > hctx->flags = set->flags & ~BLK_MQ_F_TAG_SHARED; > > + INIT_LIST_HEAD(&hctx->hctx_list); > + > /* > * Allocate space for all possible cpus to avoid allocation at > * runtime > @@ -2680,9 +2682,14 @@ static int blk_mq_alloc_ctxs(struct request_queue *q) > void blk_mq_release(struct request_queue *q) > { > struct blk_mq_hw_ctx *hctx, *next; > + int i; > > cancel_delayed_work_sync(&q->requeue_work); > > + /* warn if live hctx is found, that shouldn't happen */ > + queue_for_each_hw_ctx(q, hctx, i) > + WARN_ON_ONCE(hctx && list_empty(&hctx->hctx_list)); > + > /* all hctx are in .dead_hctx_list now */ > list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list) { > list_del_init(&hctx->hctx_list); > Precisely, that would've been my other choice of fixing it, but I didn't as I tried to avoid the loop. But if you agree then I'm fine with it, too. Please add this hunk to the patch. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare at suse.de +49 911 74053 688 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg GF: Felix Imend?rffer, Mary Higgins, Sri Rasiah HRB 21284 (AG N?rnberg)