All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ming Lei <ming.lei@redhat.com>
To: Bart Van Assche <bvanassche@acm.org>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
	Daniel Wagner <dwagner@suse.de>,
	Khazhismel Kumykov <khazhy@google.com>,
	Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>,
	"Martin K . Petersen" <martin.petersen@oracle.com>,
	Hannes Reinecke <hare@suse.de>,
	Johannes Thumshirn <johannes.thumshirn@wdc.com>,
	John Garry <john.garry@huawei.com>
Subject: Re: [PATCH v7 3/5] blk-mq: Fix races between iterating over requests and freeing requests
Date: Thu, 22 Apr 2021 11:15:55 +0800	[thread overview]
Message-ID: <YIDqa6YkNoD5OiKN@T590> (raw)
In-Reply-To: <20210421000235.2028-4-bvanassche@acm.org>

On Tue, Apr 20, 2021 at 05:02:33PM -0700, Bart Van Assche wrote:
> When multiple request queues share a tag set and when switching the I/O
> scheduler for one of the request queues associated with that tag set, the
> following race can happen:
> * blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns
>   a pointer to a scheduler request to the local variable 'rq'.
> * blk_mq_sched_free_requests() is called to free hctx->sched_tags.
> * blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free.
> 
> Fix this race as follows:
> * Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[].
>   The performance impact of the assignments added to the hot path is minimal
>   (about 1% according to one particular test).
> * Protect hctx->tags->rqs[] reads with an RCU read-side lock or with a
>   semaphore. Which mechanism is used depends on whether or not it is allowed
>   to sleep and also on whether or not the callback function may sleep.
> * Wait for all preexisting readers to finish before freeing scheduler
>   requests.
> 
> Another race is as follows:
> * blk_mq_sched_free_requests() is called to free hctx->sched_tags.
> * blk_mq_queue_tag_busy_iter() iterates over the same tag set but for another
>   request queue than the queue for which scheduler tags are being freed.
> * bt_iter() examines rq->q after *rq has been freed.
> 
> Fix this race by protecting the rq->q read in bt_iter() with an RCU read
> lock and by calling synchronize_rcu() before freeing scheduler tags.
> 
> Multiple users have reported use-after-free complaints similar to the
> following (from https://lore.kernel.org/linux-block/1545261885.185366.488.camel@acm.org/ ):
> 
> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
> Read of size 8 at addr ffff88803b335240 by task fio/21412
> 
> CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
> Call Trace:
>  dump_stack+0x86/0xca
>  print_address_description+0x71/0x239
>  kasan_report.cold.5+0x242/0x301
>  __asan_load8+0x54/0x90
>  bt_iter+0x86/0xf0
>  blk_mq_queue_tag_busy_iter+0x373/0x5e0
>  blk_mq_in_flight+0x96/0xb0
>  part_in_flight+0x40/0x140
>  part_round_stats+0x18e/0x370
>  blk_account_io_start+0x3d7/0x670
>  blk_mq_bio_to_request+0x19c/0x3a0
>  blk_mq_make_request+0x7a9/0xcb0
>  generic_make_request+0x41d/0x960
>  submit_bio+0x9b/0x250
>  do_blockdev_direct_IO+0x435c/0x4c70
>  __blockdev_direct_IO+0x79/0x88
>  ext4_direct_IO+0x46c/0xc00
>  generic_file_direct_write+0x119/0x210
>  __generic_file_write_iter+0x11c/0x280
>  ext4_file_write_iter+0x1b8/0x6f0
>  aio_write+0x204/0x310
>  io_submit_one+0x9d3/0xe80
>  __x64_sys_io_submit+0x115/0x340
>  do_syscall_64+0x71/0x210
> 
> Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Martin K. Petersen <martin.petersen@oracle.com>
> Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
> Cc: Ming Lei <ming.lei@redhat.com>
> Cc: Hannes Reinecke <hare@suse.de>
> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
> Cc: John Garry <john.garry@huawei.com>
> Cc: Khazhy Kumykov <khazhy@google.com>
> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
> ---
>  block/blk-core.c   | 34 ++++++++++++++++++++++++++++++-
>  block/blk-mq-tag.c | 51 ++++++++++++++++++++++++++++++++++++++++------
>  block/blk-mq-tag.h |  4 +++-
>  block/blk-mq.c     | 21 +++++++++++++++----
>  block/blk-mq.h     |  1 +
>  block/blk.h        |  2 ++
>  block/elevator.c   |  1 +
>  7 files changed, 102 insertions(+), 12 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 9bcdae93f6d4..ca7f833e25a8 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -279,6 +279,36 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
>  }
>  EXPORT_SYMBOL(blk_dump_rq_flags);
>  
> +/**
> + * blk_mq_wait_for_tag_iter - wait for preexisting tag iteration functions to finish
> + * @set: Tag set to wait on.
> + *
> + * Waits for preexisting calls of blk_mq_all_tag_iter(),
> + * blk_mq_tagset_busy_iter() and also for their atomic variants to finish
> + * accessing hctx->tags->rqs[]. New readers may start while this function is
> + * in progress or after this function has returned. Use this function to make
> + * sure that hctx->tags->rqs[] changes have become globally visible.
> + *
> + * Waits for preexisting blk_mq_queue_tag_busy_iter(q, fn, priv) calls to
> + * finish accessing requests associated with other request queues than 'q'.
> + */
> +void blk_mq_wait_for_tag_iter(struct blk_mq_tag_set *set)
> +{
> +	struct blk_mq_tags *tags;
> +	int i;
> +
> +	if (set->tags) {
> +		for (i = 0; i < set->nr_hw_queues; i++) {
> +			tags = set->tags[i];
> +			if (!tags)
> +				continue;
> +			down_write(&tags->iter_rwsem);
> +			up_write(&tags->iter_rwsem);
> +		}
> +	}
> +	synchronize_rcu();
> +}
> +
>  /**
>   * blk_sync_queue - cancel any pending callbacks on a queue
>   * @q: the queue
> @@ -412,8 +442,10 @@ void blk_cleanup_queue(struct request_queue *q)
>  	 * it is safe to free requests now.
>  	 */
>  	mutex_lock(&q->sysfs_lock);
> -	if (q->elevator)
> +	if (q->elevator) {
> +		blk_mq_wait_for_tag_iter(q->tag_set);
>  		blk_mq_sched_free_requests(q);
> +	}
>  	mutex_unlock(&q->sysfs_lock);
>  
>  	percpu_ref_exit(&q->q_usage_counter);
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index d8eaa38a1bd1..39d5c9190a6b 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -209,14 +209,24 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  
>  	if (!reserved)
>  		bitnr += tags->nr_reserved_tags;
> -	rq = tags->rqs[bitnr];
> -
> +	rcu_read_lock();
> +	/*
> +	 * The request 'rq' points at is protected by an RCU read lock until
> +	 * its queue pointer has been verified and by q_usage_count while the
> +	 * callback function is being invoked. See also the
> +	 * percpu_ref_tryget() and blk_queue_exit() calls in
> +	 * blk_mq_queue_tag_busy_iter().
> +	 */
> +	rq = rcu_dereference(tags->rqs[bitnr]);
>  	/*
>  	 * We can hit rq == NULL here, because the tagging functions
>  	 * test and set the bit before assigning ->rqs[].
>  	 */
> -	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
> +	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx) {
> +		rcu_read_unlock();
>  		return iter_data->fn(hctx, rq, iter_data->data, reserved);
> +	}
> +	rcu_read_unlock();
>  	return true;
>  }
>  
> @@ -254,11 +264,17 @@ struct bt_tags_iter_data {
>  	unsigned int flags;
>  };
>  
> +/* Include reserved tags. */
>  #define BT_TAG_ITER_RESERVED		(1 << 0)
> +/* Only include started requests. */
>  #define BT_TAG_ITER_STARTED		(1 << 1)
> +/* Iterate over tags->static_rqs[] instead of tags->rqs[]. */
>  #define BT_TAG_ITER_STATIC_RQS		(1 << 2)
> +/* The callback function may sleep. */
> +#define BT_TAG_ITER_MAY_SLEEP		(1 << 3)
>  
> -static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> +static bool __bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr,
> +			   void *data)
>  {
>  	struct bt_tags_iter_data *iter_data = data;
>  	struct blk_mq_tags *tags = iter_data->tags;
> @@ -275,7 +291,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  	if (iter_data->flags & BT_TAG_ITER_STATIC_RQS)
>  		rq = tags->static_rqs[bitnr];
>  	else
> -		rq = tags->rqs[bitnr];
> +		rq = rcu_dereference_check(tags->rqs[bitnr],
> +					   lockdep_is_held(&tags->iter_rwsem));
>  	if (!rq)
>  		return true;
>  	if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
> @@ -284,6 +301,25 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>  	return iter_data->fn(rq, iter_data->data, reserved);
>  }
>  
> +static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> +{
> +	struct bt_tags_iter_data *iter_data = data;
> +	struct blk_mq_tags *tags = iter_data->tags;
> +	bool res;
> +
> +	if (iter_data->flags & BT_TAG_ITER_MAY_SLEEP) {
> +		down_read(&tags->iter_rwsem);
> +		res = __bt_tags_iter(bitmap, bitnr, data);
> +		up_read(&tags->iter_rwsem);
> +	} else {
> +		rcu_read_lock();
> +		res = __bt_tags_iter(bitmap, bitnr, data);
> +		rcu_read_unlock();
> +	}
> +
> +	return res;
> +}

Holding one rwsem or rcu read lock won't avoid the issue completely
because request may be completed remotely in iter_data->fn(), such as
nbd_clear_req(), nvme_cancel_request(), complete_all_cmds_iter(),
mtip_no_dev_cleanup(), because blk_mq_complete_request() may complete
request in softirq, remote IPI, even wq, and the request is still
referenced in these contexts after bt_tags_iter() returns.


Thanks,
Ming


  parent reply	other threads:[~2021-04-22  3:16 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-21  0:02 [PATCH v7 0/5] blk-mq: Fix a race between iterating over requests and freeing requests Bart Van Assche
2021-04-21  0:02 ` [PATCH v7 1/5] blk-mq: Move the elevator_exit() definition Bart Van Assche
2021-04-21  0:02 ` [PATCH v7 2/5] blk-mq: Introduce atomic variants of blk_mq_(all_tag|tagset_busy)_iter Bart Van Assche
2021-04-21  0:02 ` [PATCH v7 3/5] blk-mq: Fix races between iterating over requests and freeing requests Bart Van Assche
2021-04-22  2:25   ` Ming Lei
2021-04-22  4:01     ` Bart Van Assche
2021-04-22  7:23       ` Ming Lei
2021-04-22  3:15   ` Ming Lei [this message]
2021-04-22  3:54     ` Bart Van Assche
2021-04-22  7:13       ` Ming Lei
2021-04-22 15:51         ` Bart Van Assche
2021-04-23  3:52           ` Ming Lei
2021-04-23 17:52             ` Bart Van Assche
2021-04-25  0:09               ` Ming Lei
2021-04-25 21:01                 ` Bart Van Assche
2021-04-26  0:55                   ` Ming Lei
2021-04-26 16:29                 ` Bart Van Assche
2021-04-27  0:11                   ` Ming Lei
2021-04-21  0:02 ` [PATCH v7 4/5] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list Bart Van Assche
2021-04-21  0:02 ` [PATCH v7 5/5] blk-mq: Fix races between blk_mq_update_nr_hw_queues() and iterating over tags Bart Van Assche
2021-04-21 14:40 ` [PATCH v7 0/5] blk-mq: Fix a race between iterating over requests and freeing requests Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YIDqa6YkNoD5OiKN@T590 \
    --to=ming.lei@redhat.com \
    --cc=axboe@kernel.dk \
    --cc=bvanassche@acm.org \
    --cc=dwagner@suse.de \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=johannes.thumshirn@wdc.com \
    --cc=john.garry@huawei.com \
    --cc=khazhy@google.com \
    --cc=linux-block@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=shinichiro.kawasaki@wdc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.