All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Hannes Reinecke <hare@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, Hannes Reinecke <hare@suse.com>,
	Keith Busch <keith.busch@intel.com>,
	linux-nvme@lists.infradead.org, Sagi Grimberg <sagi@grimberg.me>,
	Dongli Zhang <dongli.zhang@oracle.com>,
	James Smart <james.smart@broadcom.com>,
	Bart Van Assche <bart.vanassche@wdc.com>,
	linux-scsi@vger.kernel.org,
	"Martin K . Petersen" <martin.petersen@oracle.com>,
	Christoph Hellwig <hch@lst.de>,
	"James E . J . Bottomley" <jejb@linux.vnet.ibm.com>,
	jianchao wang <jianchao.w.wang@oracle.com>
Subject: Re: [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
Date: Wed, 24 Apr 2019 09:12:43 +0800	[thread overview]
Message-ID: <20190424011242.GB634@ming.t460p> (raw)
In-Reply-To: <b7e1ad43-9ea1-8ecf-6e1f-97343c45b82f@suse.de>

On Tue, Apr 23, 2019 at 04:07:49PM +0200, Hannes Reinecke wrote:
> On 4/23/19 3:30 PM, Ming Lei wrote:
> > Hi Hannes,
> > 
> > Thanks for your response.
> > 
> > On Tue, Apr 23, 2019 at 01:19:33PM +0200, Hannes Reinecke wrote:
> > > On 4/22/19 5:30 AM, Ming Lei wrote:
> [ .. ]
> > > > 
> > > > Hi Hannes,
> > > > 
> > > > Could you please let us know if you have better idea for this issue?
> > > > Otherwise, I think we need to move on since it is real issue, and users
> > > > want to fix that.
> > > > 
> > > Okay. Having looked over the problem and possible alternatives, it looks
> > > indeed like a viable solution.
> > > I do agree that it's a sensible design to have an additional holding area
> > > for hardware context elements, given that they might be reassigned during
> > > blk_mq_realloc_hw_ctxs().
> > > 
> > > However, I would rename the 'dead' elements to 'unused' (ie unused_hctx_list
> > > etc).
> > 
> > OK, looks the name of 'unused' is better.
> > 
> > > 
> > > And I would deallocate q->queue_hw_ctx in blk_mq_exit_queue() to make things
> > > more consistent.
> > 
> > No, that is wrong.
> > 
> > The request queue's refcount is often held when blk_cleanup_queue() is running,
> > and blk_mq_exit_queue() is called directly by blk_cleanup_queue(). One invariant
> > is that we have to allow most APIs running well if the request queue is live
> > from kobject view. If q->queue_hw_ctx is freed in blk_mq_exit_queue(),
> > it is quite easy to cause use-after-free.
> > 
> Ah. Thought as much.
> But then in most cases the ->queue_hw_ctx pointer is immaterial as we're
> accessing things via the hctx pointer, which remains valid.
> 
> > > Problem with the current patch is that in blk_mq_release() we iterate
> > > the 'dead_hctx_list' and free up everything in there, but then blindly call
> > > 'kfree(q->queue_hw_ctx)' without checking if there are any context pointers
> > > left.
> > 
> > If request queue is dead, it is safe to assume that there isn't any
> > reference to request queue and q->queue_hw_ctx. Otherwise, it must be
> > a bug somewhere.
> > 
> Precisely.
> What I'm trying to achieve with this is to protect against such issues,
> which are quite easy to introduce given the complexity of the code ...

But releasing q->queue_hw_ctx in blk_cleanup_queue() is easy to cause
use-after-free even though the request queue's refcount is held. We can't
do that simply.

If someone is still trying to use q->queue_hw_ctx[] after the request
queue is dead, the bug is in the caller of block layer API, not in
block layer.

What the patchset is trying to fix is the race in block layer, not
users of block layer, not drivers. So far, I don't see such driver
issue.

Just thought q->queue_hw_ctx as the request queue's resource, you will
see it is pretty reasonable to free q->queue_hw_ctx in the queue's
release handler.

> 
> > > When moving the call to blk_mq_exit_queue() we could to a simple
> > > WARN_ON(q->queue_hw_ctx) in blk_mq_release() to ensure that everything got
> > > deallocated properly.
> > 
> > At that time, hctx instance might be active, but that is fine given hctx
> > is covered by its own kobject. What we need to do is to make sure that no
> > any references to q->queue_hw_ctx and the request queue.
> > 
> My point being here:
> void blk_mq_release(struct request_queue *q)
> {
>         struct blk_mq_hw_ctx *hctx, *next;
> 
>         cancel_delayed_work_sync(&q->requeue_work);
> 
>         /* all hctx are in .dead_hctx_list now */
>         list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list)
> {
>                 list_del_init(&hctx->hctx_list);
>                 kobject_put(&hctx->kobj);
>         }
> 
>         kfree(q->queue_hw_ctx);
> 
>         /*
>          * release .mq_kobj and sw queue's kobject now because
>          * both share lifetime with request queue.
>          */
>         blk_mq_sysfs_deinit(q);
> }
> 
> This assumes that _all_ hctx pointers are being removed from
> q->queue_hw_ctx, and are moved to the 'dead' list.
> If for some reason this is not the case we'll be leaking hctx pointers here.

IMO, there aren't such some reasons. When blk_mq_release() is called,
every hctx of this request queue has been "exited" via blk_mq_exit_hctx(),
either from blk_mq_update_nr_hw_queues() or blk_cleanup_queue().

If there are hctxs not moved to the 'dead'(or 'unused') list here, it is
simply a bug in blk-mq. However, I don't see such case now.

> And as the reference counting _really_ is tricky it would be a good idea to
> have some safety checking here (ie if all hctx pointers from ->queue_hw_ctx
> are indeed NULL) before calling kfree().
> 
> This is what I was attempting by moving the kfree() to blk_mq_exit_queue(),
> as then we could validate that we've moved all hctx pointers to the dead
> list by simply checking if ->queue_hw_ctx is NULL.

As I mentioned, it will break the current code either NULL the hctx
pointer or kfree(hctx) early, given we allow APIs to run if the queue's
refcount is held.

> 
> But just blindly calling 'kfree()' here is dangerous IMO.

How can that be dangerous? As I mentioned, if the queue's refcount isn't
held, it is simply a bug of users to try to use 'q' or 'q->queue_hw_ctx'.
That is the basic use pattern of refcount, isn't it?


Thanks,
Ming

WARNING: multiple messages have this Message-ID (diff)
From: ming.lei@redhat.com (Ming Lei)
Subject: [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed
Date: Wed, 24 Apr 2019 09:12:43 +0800	[thread overview]
Message-ID: <20190424011242.GB634@ming.t460p> (raw)
In-Reply-To: <b7e1ad43-9ea1-8ecf-6e1f-97343c45b82f@suse.de>

On Tue, Apr 23, 2019@04:07:49PM +0200, Hannes Reinecke wrote:
> On 4/23/19 3:30 PM, Ming Lei wrote:
> > Hi Hannes,
> > 
> > Thanks for your response.
> > 
> > On Tue, Apr 23, 2019@01:19:33PM +0200, Hannes Reinecke wrote:
> > > On 4/22/19 5:30 AM, Ming Lei wrote:
> [ .. ]
> > > > 
> > > > Hi Hannes,
> > > > 
> > > > Could you please let us know if you have better idea for this issue?
> > > > Otherwise, I think we need to move on since it is real issue, and users
> > > > want to fix that.
> > > > 
> > > Okay. Having looked over the problem and possible alternatives, it looks
> > > indeed like a viable solution.
> > > I do agree that it's a sensible design to have an additional holding area
> > > for hardware context elements, given that they might be reassigned during
> > > blk_mq_realloc_hw_ctxs().
> > > 
> > > However, I would rename the 'dead' elements to 'unused' (ie unused_hctx_list
> > > etc).
> > 
> > OK, looks the name of 'unused' is better.
> > 
> > > 
> > > And I would deallocate q->queue_hw_ctx in blk_mq_exit_queue() to make things
> > > more consistent.
> > 
> > No, that is wrong.
> > 
> > The request queue's refcount is often held when blk_cleanup_queue() is running,
> > and blk_mq_exit_queue() is called directly by blk_cleanup_queue(). One invariant
> > is that we have to allow most APIs running well if the request queue is live
> > from kobject view. If q->queue_hw_ctx is freed in blk_mq_exit_queue(),
> > it is quite easy to cause use-after-free.
> > 
> Ah. Thought as much.
> But then in most cases the ->queue_hw_ctx pointer is immaterial as we're
> accessing things via the hctx pointer, which remains valid.
> 
> > > Problem with the current patch is that in blk_mq_release() we iterate
> > > the 'dead_hctx_list' and free up everything in there, but then blindly call
> > > 'kfree(q->queue_hw_ctx)' without checking if there are any context pointers
> > > left.
> > 
> > If request queue is dead, it is safe to assume that there isn't any
> > reference to request queue and q->queue_hw_ctx. Otherwise, it must be
> > a bug somewhere.
> > 
> Precisely.
> What I'm trying to achieve with this is to protect against such issues,
> which are quite easy to introduce given the complexity of the code ...

But releasing q->queue_hw_ctx in blk_cleanup_queue() is easy to cause
use-after-free even though the request queue's refcount is held. We can't
do that simply.

If someone is still trying to use q->queue_hw_ctx[] after the request
queue is dead, the bug is in the caller of block layer API, not in
block layer.

What the patchset is trying to fix is the race in block layer, not
users of block layer, not drivers. So far, I don't see such driver
issue.

Just thought q->queue_hw_ctx as the request queue's resource, you will
see it is pretty reasonable to free q->queue_hw_ctx in the queue's
release handler.

> 
> > > When moving the call to blk_mq_exit_queue() we could to a simple
> > > WARN_ON(q->queue_hw_ctx) in blk_mq_release() to ensure that everything got
> > > deallocated properly.
> > 
> > At that time, hctx instance might be active, but that is fine given hctx
> > is covered by its own kobject. What we need to do is to make sure that no
> > any references to q->queue_hw_ctx and the request queue.
> > 
> My point being here:
> void blk_mq_release(struct request_queue *q)
> {
>         struct blk_mq_hw_ctx *hctx, *next;
> 
>         cancel_delayed_work_sync(&q->requeue_work);
> 
>         /* all hctx are in .dead_hctx_list now */
>         list_for_each_entry_safe(hctx, next, &q->dead_hctx_list, hctx_list)
> {
>                 list_del_init(&hctx->hctx_list);
>                 kobject_put(&hctx->kobj);
>         }
> 
>         kfree(q->queue_hw_ctx);
> 
>         /*
>          * release .mq_kobj and sw queue's kobject now because
>          * both share lifetime with request queue.
>          */
>         blk_mq_sysfs_deinit(q);
> }
> 
> This assumes that _all_ hctx pointers are being removed from
> q->queue_hw_ctx, and are moved to the 'dead' list.
> If for some reason this is not the case we'll be leaking hctx pointers here.

IMO, there aren't such some reasons. When blk_mq_release() is called,
every hctx of this request queue has been "exited" via blk_mq_exit_hctx(),
either from blk_mq_update_nr_hw_queues() or blk_cleanup_queue().

If there are hctxs not moved to the 'dead'(or 'unused') list here, it is
simply a bug in blk-mq. However, I don't see such case now.

> And as the reference counting _really_ is tricky it would be a good idea to
> have some safety checking here (ie if all hctx pointers from ->queue_hw_ctx
> are indeed NULL) before calling kfree().
> 
> This is what I was attempting by moving the kfree() to blk_mq_exit_queue(),
> as then we could validate that we've moved all hctx pointers to the dead
> list by simply checking if ->queue_hw_ctx is NULL.

As I mentioned, it will break the current code either NULL the hctx
pointer or kfree(hctx) early, given we allow APIs to run if the queue's
refcount is held.

> 
> But just blindly calling 'kfree()' here is dangerous IMO.

How can that be dangerous? As I mentioned, if the queue's refcount isn't
held, it is simply a bug of users to try to use 'q' or 'q->queue_hw_ctx'.
That is the basic use pattern of refcount, isn't it?


Thanks,
Ming

  reply	other threads:[~2019-04-24  1:13 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-17  3:44 [PATCH V6 0/9] blk-mq: fix races related with freeing queue Ming Lei
2019-04-17  3:44 ` Ming Lei
2019-04-17  3:44 ` [PATCH V6 1/9] blk-mq: grab .q_usage_counter when queuing request from plug code path Ming Lei
2019-04-17  3:44   ` Ming Lei
2019-04-17  3:44 ` [PATCH V6 2/9] blk-mq: move cancel of requeue_work into blk_mq_release Ming Lei
2019-04-17  3:44   ` Ming Lei
2019-04-17 12:00   ` Hannes Reinecke
2019-04-17 12:00     ` Hannes Reinecke
2019-04-17  3:44 ` [PATCH V6 3/9] blk-mq: free hw queue's resource in hctx's release handler Ming Lei
2019-04-17  3:44   ` Ming Lei
2019-04-17 12:02   ` Hannes Reinecke
2019-04-17 12:02     ` Hannes Reinecke
2019-04-17  3:44 ` [PATCH V6 4/9] blk-mq: move all hctx alloction & initialization into __blk_mq_alloc_and_init_hctx Ming Lei
2019-04-17  3:44   ` Ming Lei
2019-04-17 12:03   ` Hannes Reinecke
2019-04-17 12:03     ` Hannes Reinecke
2019-04-17  3:44 ` [PATCH V6 5/9] blk-mq: split blk_mq_alloc_and_init_hctx into two parts Ming Lei
2019-04-17  3:44   ` Ming Lei
2019-04-17  3:44 ` [PATCH V6 6/9] blk-mq: always free hctx after request queue is freed Ming Lei
2019-04-17  3:44   ` Ming Lei
2019-04-17 12:08   ` Hannes Reinecke
2019-04-17 12:08     ` Hannes Reinecke
2019-04-17 12:59     ` Ming Lei
2019-04-17 12:59       ` Ming Lei
2019-04-22  3:30       ` Ming Lei
2019-04-22  3:30         ` Ming Lei
2019-04-23 11:19         ` Hannes Reinecke
2019-04-23 11:19           ` Hannes Reinecke
2019-04-23 13:30           ` Ming Lei
2019-04-23 13:30             ` Ming Lei
2019-04-23 14:07             ` Hannes Reinecke
2019-04-23 14:07               ` Hannes Reinecke
2019-04-24  1:12               ` Ming Lei [this message]
2019-04-24  1:12                 ` Ming Lei
2019-04-24  1:45                 ` Ming Lei
2019-04-24  1:45                   ` Ming Lei
2019-04-24  5:55                   ` Hannes Reinecke
2019-04-24  5:55                     ` Hannes Reinecke
2019-04-17  3:44 ` [PATCH V6 7/9] blk-mq: move cancel of hctx->run_work into blk_mq_hw_sysfs_release Ming Lei
2019-04-17  3:44   ` Ming Lei
2019-04-17  3:44 ` [PATCH V6 8/9] block: don't drain in-progress dispatch in blk_cleanup_queue() Ming Lei
2019-04-17  3:44   ` Ming Lei
2019-04-17  3:44 ` [PATCH V6 9/9] nvme: hold request queue's refcount in ns's whole lifetime Ming Lei
2019-04-17  3:44   ` Ming Lei
2019-04-17 12:10   ` Hannes Reinecke
2019-04-17 12:10     ` Hannes Reinecke
2019-04-17 15:55   ` Keith Busch
2019-04-17 15:55     ` Keith Busch
2019-04-17 17:22 ` [PATCH V6 0/9] blk-mq: fix races related with freeing queue James Smart
2019-04-17 17:22   ` James Smart

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190424011242.GB634@ming.t460p \
    --to=ming.lei@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=bart.vanassche@wdc.com \
    --cc=dongli.zhang@oracle.com \
    --cc=hare@suse.com \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=james.smart@broadcom.com \
    --cc=jejb@linux.vnet.ibm.com \
    --cc=jianchao.w.wang@oracle.com \
    --cc=keith.busch@intel.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.