All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bart Van Assche <bvanassche@acm.org>
To: Ming Lei <ming.lei@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>,
	linux-block@vger.kernel.org, Christoph Hellwig <hch@lst.de>,
	Daniel Wagner <dwagner@suse.de>,
	Khazhismel Kumykov <khazhy@google.com>,
	Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>,
	"Martin K . Petersen" <martin.petersen@oracle.com>,
	Hannes Reinecke <hare@suse.de>,
	Johannes Thumshirn <johannes.thumshirn@wdc.com>,
	John Garry <john.garry@huawei.com>
Subject: Re: [PATCH v7 3/5] blk-mq: Fix races between iterating over requests and freeing requests
Date: Wed, 21 Apr 2021 21:01:50 -0700	[thread overview]
Message-ID: <67442b28-3b08-1704-fff3-6f0e6a23465e@acm.org> (raw)
In-Reply-To: <YIDegAPz+17BA/9x@T590>

On 4/21/21 7:25 PM, Ming Lei wrote:
> On Tue, Apr 20, 2021 at 05:02:33PM -0700, Bart Van Assche wrote:
>> When multiple request queues share a tag set and when switching the I/O
>> scheduler for one of the request queues associated with that tag set, the
>> following race can happen:
>> * blk_mq_tagset_busy_iter() calls bt_tags_iter() and bt_tags_iter() assigns
>>   a pointer to a scheduler request to the local variable 'rq'.
>> * blk_mq_sched_free_requests() is called to free hctx->sched_tags.
>> * blk_mq_tagset_busy_iter() dereferences 'rq' and triggers a use-after-free.
>>
>> Fix this race as follows:
>> * Use rcu_assign_pointer() and rcu_dereference() to access hctx->tags->rqs[].
>>   The performance impact of the assignments added to the hot path is minimal
>>   (about 1% according to one particular test).
>> * Protect hctx->tags->rqs[] reads with an RCU read-side lock or with a
>>   semaphore. Which mechanism is used depends on whether or not it is allowed
>>   to sleep and also on whether or not the callback function may sleep.
>> * Wait for all preexisting readers to finish before freeing scheduler
>>   requests.
>>
>> Another race is as follows:
>> * blk_mq_sched_free_requests() is called to free hctx->sched_tags.
>> * blk_mq_queue_tag_busy_iter() iterates over the same tag set but for another
>>   request queue than the queue for which scheduler tags are being freed.
>> * bt_iter() examines rq->q after *rq has been freed.
>>
>> Fix this race by protecting the rq->q read in bt_iter() with an RCU read
>> lock and by calling synchronize_rcu() before freeing scheduler tags.
>>
>> Multiple users have reported use-after-free complaints similar to the
>> following (from https://lore.kernel.org/linux-block/1545261885.185366.488.camel@acm.org/ ):
>>
>> BUG: KASAN: use-after-free in bt_iter+0x86/0xf0
>> Read of size 8 at addr ffff88803b335240 by task fio/21412
>>
>> CPU: 0 PID: 21412 Comm: fio Tainted: G        W         4.20.0-rc6-dbg+ #3
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
>> Call Trace:
>>  dump_stack+0x86/0xca
>>  print_address_description+0x71/0x239
>>  kasan_report.cold.5+0x242/0x301
>>  __asan_load8+0x54/0x90
>>  bt_iter+0x86/0xf0
>>  blk_mq_queue_tag_busy_iter+0x373/0x5e0
>>  blk_mq_in_flight+0x96/0xb0
>>  part_in_flight+0x40/0x140
>>  part_round_stats+0x18e/0x370
>>  blk_account_io_start+0x3d7/0x670
>>  blk_mq_bio_to_request+0x19c/0x3a0
>>  blk_mq_make_request+0x7a9/0xcb0
>>  generic_make_request+0x41d/0x960
>>  submit_bio+0x9b/0x250
>>  do_blockdev_direct_IO+0x435c/0x4c70
>>  __blockdev_direct_IO+0x79/0x88
>>  ext4_direct_IO+0x46c/0xc00
>>  generic_file_direct_write+0x119/0x210
>>  __generic_file_write_iter+0x11c/0x280
>>  ext4_file_write_iter+0x1b8/0x6f0
>>  aio_write+0x204/0x310
>>  io_submit_one+0x9d3/0xe80
>>  __x64_sys_io_submit+0x115/0x340
>>  do_syscall_64+0x71/0x210
>>
>> Reviewed-by: Khazhismel Kumykov <khazhy@google.com>
>> Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>> Cc: Christoph Hellwig <hch@lst.de>
>> Cc: Martin K. Petersen <martin.petersen@oracle.com>
>> Cc: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
>> Cc: Ming Lei <ming.lei@redhat.com>
>> Cc: Hannes Reinecke <hare@suse.de>
>> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
>> Cc: John Garry <john.garry@huawei.com>
>> Cc: Khazhy Kumykov <khazhy@google.com>
>> Signed-off-by: Bart Van Assche <bvanassche@acm.org>
>> ---
>>  block/blk-core.c   | 34 ++++++++++++++++++++++++++++++-
>>  block/blk-mq-tag.c | 51 ++++++++++++++++++++++++++++++++++++++++------
>>  block/blk-mq-tag.h |  4 +++-
>>  block/blk-mq.c     | 21 +++++++++++++++----
>>  block/blk-mq.h     |  1 +
>>  block/blk.h        |  2 ++
>>  block/elevator.c   |  1 +
>>  7 files changed, 102 insertions(+), 12 deletions(-)
>>
>> diff --git a/block/blk-core.c b/block/blk-core.c
>> index 9bcdae93f6d4..ca7f833e25a8 100644
>> --- a/block/blk-core.c
>> +++ b/block/blk-core.c
>> @@ -279,6 +279,36 @@ void blk_dump_rq_flags(struct request *rq, char *msg)
>>  }
>>  EXPORT_SYMBOL(blk_dump_rq_flags);
>>  
>> +/**
>> + * blk_mq_wait_for_tag_iter - wait for preexisting tag iteration functions to finish
>> + * @set: Tag set to wait on.
>> + *
>> + * Waits for preexisting calls of blk_mq_all_tag_iter(),
>> + * blk_mq_tagset_busy_iter() and also for their atomic variants to finish
>> + * accessing hctx->tags->rqs[]. New readers may start while this function is
>> + * in progress or after this function has returned. Use this function to make
>> + * sure that hctx->tags->rqs[] changes have become globally visible.
>> + *
>> + * Waits for preexisting blk_mq_queue_tag_busy_iter(q, fn, priv) calls to
>> + * finish accessing requests associated with other request queues than 'q'.
>> + */
>> +void blk_mq_wait_for_tag_iter(struct blk_mq_tag_set *set)
>> +{
>> +	struct blk_mq_tags *tags;
>> +	int i;
>> +
>> +	if (set->tags) {
>> +		for (i = 0; i < set->nr_hw_queues; i++) {
>> +			tags = set->tags[i];
>> +			if (!tags)
>> +				continue;
>> +			down_write(&tags->iter_rwsem);
>> +			up_write(&tags->iter_rwsem);
>> +		}
>> +	}
>> +	synchronize_rcu();
>> +}
>> +
>>  /**
>>   * blk_sync_queue - cancel any pending callbacks on a queue
>>   * @q: the queue
>> @@ -412,8 +442,10 @@ void blk_cleanup_queue(struct request_queue *q)
>>  	 * it is safe to free requests now.
>>  	 */
>>  	mutex_lock(&q->sysfs_lock);
>> -	if (q->elevator)
>> +	if (q->elevator) {
>> +		blk_mq_wait_for_tag_iter(q->tag_set);
>>  		blk_mq_sched_free_requests(q);
>> +	}
>>  	mutex_unlock(&q->sysfs_lock);
>>  
>>  	percpu_ref_exit(&q->q_usage_counter);
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index d8eaa38a1bd1..39d5c9190a6b 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -209,14 +209,24 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>  
>>  	if (!reserved)
>>  		bitnr += tags->nr_reserved_tags;
>> -	rq = tags->rqs[bitnr];
>> -
>> +	rcu_read_lock();
>> +	/*
>> +	 * The request 'rq' points at is protected by an RCU read lock until
>> +	 * its queue pointer has been verified and by q_usage_count while the
>> +	 * callback function is being invoked. See also the
>> +	 * percpu_ref_tryget() and blk_queue_exit() calls in
>> +	 * blk_mq_queue_tag_busy_iter().
>> +	 */
>> +	rq = rcu_dereference(tags->rqs[bitnr]);
>>  	/*
>>  	 * We can hit rq == NULL here, because the tagging functions
>>  	 * test and set the bit before assigning ->rqs[].
>>  	 */
>> -	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx)
>> +	if (rq && rq->q == hctx->queue && rq->mq_hctx == hctx) {
>> +		rcu_read_unlock();
>>  		return iter_data->fn(hctx, rq, iter_data->data, reserved);
>> +	}
>> +	rcu_read_unlock();
>>  	return true;
>>  }
>>  
>> @@ -254,11 +264,17 @@ struct bt_tags_iter_data {
>>  	unsigned int flags;
>>  };
>>  
>> +/* Include reserved tags. */
>>  #define BT_TAG_ITER_RESERVED		(1 << 0)
>> +/* Only include started requests. */
>>  #define BT_TAG_ITER_STARTED		(1 << 1)
>> +/* Iterate over tags->static_rqs[] instead of tags->rqs[]. */
>>  #define BT_TAG_ITER_STATIC_RQS		(1 << 2)
>> +/* The callback function may sleep. */
>> +#define BT_TAG_ITER_MAY_SLEEP		(1 << 3)
>>  
>> -static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>> +static bool __bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr,
>> +			   void *data)
>>  {
>>  	struct bt_tags_iter_data *iter_data = data;
>>  	struct blk_mq_tags *tags = iter_data->tags;
>> @@ -275,7 +291,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>  	if (iter_data->flags & BT_TAG_ITER_STATIC_RQS)
>>  		rq = tags->static_rqs[bitnr];
>>  	else
>> -		rq = tags->rqs[bitnr];
>> +		rq = rcu_dereference_check(tags->rqs[bitnr],
>> +					   lockdep_is_held(&tags->iter_rwsem));
>>  	if (!rq)
>>  		return true;
>>  	if ((iter_data->flags & BT_TAG_ITER_STARTED) &&
>> @@ -284,6 +301,25 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>>  	return iter_data->fn(rq, iter_data->data, reserved);
>>  }
>>  
>> +static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>> +{
>> +	struct bt_tags_iter_data *iter_data = data;
>> +	struct blk_mq_tags *tags = iter_data->tags;
>> +	bool res;
>> +
>> +	if (iter_data->flags & BT_TAG_ITER_MAY_SLEEP) {
>> +		down_read(&tags->iter_rwsem);
>> +		res = __bt_tags_iter(bitmap, bitnr, data);
>> +		up_read(&tags->iter_rwsem);
>> +	} else {
>> +		rcu_read_lock();
>> +		res = __bt_tags_iter(bitmap, bitnr, data);
>> +		rcu_read_unlock();
>> +	}
>> +
>> +	return res;
>> +}
>> +
>>  /**
>>   * bt_tags_for_each - iterate over the requests in a tag map
>>   * @tags:	Tag map to iterate over.
>> @@ -357,10 +393,12 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>>  {
>>  	int i;
>>  
>> +	might_sleep();
>> +
>>  	for (i = 0; i < tagset->nr_hw_queues; i++) {
>>  		if (tagset->tags && tagset->tags[i])
>>  			__blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>> -					      BT_TAG_ITER_STARTED);
>> +				BT_TAG_ITER_STARTED | BT_TAG_ITER_MAY_SLEEP);
>>  	}
>>  }
>>  EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
>> @@ -544,6 +582,7 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags,
>>  
>>  	tags->nr_tags = total_tags;
>>  	tags->nr_reserved_tags = reserved_tags;
>> +	init_rwsem(&tags->iter_rwsem);
>>  
>>  	if (blk_mq_is_sbitmap_shared(flags))
>>  		return tags;
>> diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
>> index 0290c308ece9..d1d73d7cc7df 100644
>> --- a/block/blk-mq-tag.h
>> +++ b/block/blk-mq-tag.h
>> @@ -17,9 +17,11 @@ struct blk_mq_tags {
>>  	struct sbitmap_queue __bitmap_tags;
>>  	struct sbitmap_queue __breserved_tags;
>>  
>> -	struct request **rqs;
>> +	struct request __rcu **rqs;
>>  	struct request **static_rqs;
>>  	struct list_head page_list;
>> +
>> +	struct rw_semaphore iter_rwsem;
>>  };
>>  
>>  extern struct blk_mq_tags *blk_mq_init_tags(unsigned int nr_tags,
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 79c01b1f885c..8b59f6b4ec8e 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -496,8 +496,10 @@ static void __blk_mq_free_request(struct request *rq)
>>  	blk_crypto_free_request(rq);
>>  	blk_pm_mark_last_busy(rq);
>>  	rq->mq_hctx = NULL;
>> -	if (rq->tag != BLK_MQ_NO_TAG)
>> +	if (rq->tag != BLK_MQ_NO_TAG) {
>>  		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
>> +		rcu_assign_pointer(hctx->tags->rqs[rq->tag], NULL);
> 
> After the tag is released, it can be re-allocated from another context
> immediately. And the above rcu_assign_pointer(NULL) could be run after
> the re-allocation and new ->rqs[rq->tag] assignment in submission path,
> is it one issue?

How about swapping the rcu_assign_pointer() and blk_mq_put_tag() calls?
That should be safe since the tag iteration functions check whether or
not hctx->tags->rqs[] is NULL.

There are already barriers in sbitmap_queue_clear() so I don't think
that any new barriers would need to be added.

Thanks,

Bart.


  reply	other threads:[~2021-04-22  4:01 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-21  0:02 [PATCH v7 0/5] blk-mq: Fix a race between iterating over requests and freeing requests Bart Van Assche
2021-04-21  0:02 ` [PATCH v7 1/5] blk-mq: Move the elevator_exit() definition Bart Van Assche
2021-04-21  0:02 ` [PATCH v7 2/5] blk-mq: Introduce atomic variants of blk_mq_(all_tag|tagset_busy)_iter Bart Van Assche
2021-04-21  0:02 ` [PATCH v7 3/5] blk-mq: Fix races between iterating over requests and freeing requests Bart Van Assche
2021-04-22  2:25   ` Ming Lei
2021-04-22  4:01     ` Bart Van Assche [this message]
2021-04-22  7:23       ` Ming Lei
2021-04-22  3:15   ` Ming Lei
2021-04-22  3:54     ` Bart Van Assche
2021-04-22  7:13       ` Ming Lei
2021-04-22 15:51         ` Bart Van Assche
2021-04-23  3:52           ` Ming Lei
2021-04-23 17:52             ` Bart Van Assche
2021-04-25  0:09               ` Ming Lei
2021-04-25 21:01                 ` Bart Van Assche
2021-04-26  0:55                   ` Ming Lei
2021-04-26 16:29                 ` Bart Van Assche
2021-04-27  0:11                   ` Ming Lei
2021-04-21  0:02 ` [PATCH v7 4/5] blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list Bart Van Assche
2021-04-21  0:02 ` [PATCH v7 5/5] blk-mq: Fix races between blk_mq_update_nr_hw_queues() and iterating over tags Bart Van Assche
2021-04-21 14:40 ` [PATCH v7 0/5] blk-mq: Fix a race between iterating over requests and freeing requests Jens Axboe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=67442b28-3b08-1704-fff3-6f0e6a23465e@acm.org \
    --to=bvanassche@acm.org \
    --cc=axboe@kernel.dk \
    --cc=dwagner@suse.de \
    --cc=hare@suse.de \
    --cc=hch@lst.de \
    --cc=johannes.thumshirn@wdc.com \
    --cc=john.garry@huawei.com \
    --cc=khazhy@google.com \
    --cc=linux-block@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=ming.lei@redhat.com \
    --cc=shinichiro.kawasaki@wdc.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.