All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH for-5.18 0/3] block: throttle related fixes
@ 2022-03-18 13:01 Ming Lei
  2022-03-18 13:01 ` [PATCH 1/3] block: avoid use-after-free on throttle data Ming Lei
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Ming Lei @ 2022-03-18 13:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Tejun Heo, Yu Kuai, Christoph Hellwig, Ming Lei

Hi,

The 1st and 2nd patches fix use-after-free on request queue and
throttle data.

The 3rd patch speeds up throttling after disk is deleted, so long
wait time is avoided.


Ming Lei (2):
  block: avoid use-after-free on throttle data
  block: let blkcg_gq grab request queue's refcnt

Yu Kuai (1):
  block: cancel all throttled bios in del_gendisk()

 block/blk-cgroup.c   |  5 +++++
 block/blk-throttle.c | 48 ++++++++++++++++++++++++++++++++++++++++++--
 block/blk-throttle.h |  3 +++
 block/genhd.c        |  3 +++
 4 files changed, 57 insertions(+), 2 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/3] block: avoid use-after-free on throttle data
  2022-03-18 13:01 [PATCH for-5.18 0/3] block: throttle related fixes Ming Lei
@ 2022-03-18 13:01 ` Ming Lei
  2022-03-18 13:01 ` [PATCH 2/3] block: let blkcg_gq grab request queue's refcnt Ming Lei
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2022-03-18 13:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Tejun Heo, Yu Kuai, Christoph Hellwig, Ming Lei

In throtl_pending_timer_fn(), request queue is retrieved from throttle
data. And tg's pending timer is deleted synchronously when releasing the
associated blkg, at that time, throttle data may have been freed since
commit 1059699f87eb ("block: move blkcg initialization/destroy into disk
allocation/release handler") moves freeing q->td to disk_release() from
blk_release_queue(). So use-after-free on q->td in throtl_pending_timer_fn
can be triggered.

Fixes the issue by:

- do nothing in case that disk is released, when there isn't any bio to
  dispatch

- retrieve request queue from blkg instead of throttle data for
non top-level pending timer.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-throttle.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index a3b3ebc72dd4..880701d0106f 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -1137,12 +1137,22 @@ static void throtl_pending_timer_fn(struct timer_list *t)
 	struct throtl_service_queue *sq = from_timer(sq, t, pending_timer);
 	struct throtl_grp *tg = sq_to_tg(sq);
 	struct throtl_data *td = sq_to_td(sq);
-	struct request_queue *q = td->queue;
 	struct throtl_service_queue *parent_sq;
+	struct request_queue *q;
 	bool dispatched;
 	int ret;
 
+	/* throtl_data may be gone, so figure out request queue by blkg */
+	if (tg)
+		q = tg->pd.blkg->q;
+	else
+		q = td->queue;
+
 	spin_lock_irq(&q->queue_lock);
+
+	if (!q->root_blkg)
+		goto out_unlock;
+
 	if (throtl_can_upgrade(td, NULL))
 		throtl_upgrade_state(td);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/3] block: let blkcg_gq grab request queue's refcnt
  2022-03-18 13:01 [PATCH for-5.18 0/3] block: throttle related fixes Ming Lei
  2022-03-18 13:01 ` [PATCH 1/3] block: avoid use-after-free on throttle data Ming Lei
@ 2022-03-18 13:01 ` Ming Lei
  2022-03-22  9:33   ` Christoph Hellwig
  2022-04-20  1:46   ` Williams, Dan J
  2022-03-18 13:01 ` [PATCH 3/3] block: cancel all throttled bios in del_gendisk() Ming Lei
  2022-03-18 15:58 ` [PATCH for-5.18 0/3] block: throttle related fixes Jens Axboe
  3 siblings, 2 replies; 13+ messages in thread
From: Ming Lei @ 2022-03-18 13:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Tejun Heo, Yu Kuai, Christoph Hellwig, Ming Lei

In the whole lifetime of blkcg_gq instance, ->q will be referred, such
as, ->pd_free_fn() is called in blkg_free, and throtl_pd_free() still
may touch the request queue via &tg->service_queue.pending_timer which
is handled by throtl_pending_timer_fn(), so it is reasonable to grab
request queue's refcnt by blkcg_gq instance.

Previously blkcg_exit_queue() is called from blk_release_queue, and it
is hard to avoid the use-after-free. But recently commit 1059699f87eb ("block:
move blkcg initialization/destroy into disk allocation/release handler")
is merged to for-5.18/block, it becomes simple to fix the issue by simply
grabbing request queue's refcnt.

Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-cgroup.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index fa063c6c0338..d53b0d69dd73 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -82,6 +82,8 @@ static void blkg_free(struct blkcg_gq *blkg)
 		if (blkg->pd[i])
 			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
 
+	if (blkg->q)
+		blk_put_queue(blkg->q);
 	free_percpu(blkg->iostat_cpu);
 	percpu_ref_exit(&blkg->refcnt);
 	kfree(blkg);
@@ -167,6 +169,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
 	if (!blkg->iostat_cpu)
 		goto err_free;
 
+	if (!blk_get_queue(q))
+		goto err_free;
+
 	blkg->q = q;
 	INIT_LIST_HEAD(&blkg->q_node);
 	spin_lock_init(&blkg->async_bio_lock);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/3] block: cancel all throttled bios in del_gendisk()
  2022-03-18 13:01 [PATCH for-5.18 0/3] block: throttle related fixes Ming Lei
  2022-03-18 13:01 ` [PATCH 1/3] block: avoid use-after-free on throttle data Ming Lei
  2022-03-18 13:01 ` [PATCH 2/3] block: let blkcg_gq grab request queue's refcnt Ming Lei
@ 2022-03-18 13:01 ` Ming Lei
  2022-03-18 15:58 ` [PATCH for-5.18 0/3] block: throttle related fixes Jens Axboe
  3 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2022-03-18 13:01 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block, Tejun Heo, Yu Kuai, Christoph Hellwig, Ming Lei

From: Yu Kuai <yukuai3@huawei.com>

Throttled bios can't be issued after del_gendisk() is done, thus
it's better to cancel them immediately rather than waiting for
throttle is done.

For example, if user thread is throttled with low bps while it's
issuing large io, and the device is deleted. The user thread will
wait for a long time for io to return.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-throttle.c | 36 +++++++++++++++++++++++++++++++++++-
 block/blk-throttle.h |  3 +++
 block/genhd.c        |  3 +++
 3 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 880701d0106f..469c483719be 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -874,7 +874,8 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
 	       bio != throtl_peek_queued(&tg->service_queue.queued[rw]));
 
 	/* If tg->bps = -1, then BW is unlimited */
-	if (bps_limit == U64_MAX && iops_limit == UINT_MAX) {
+	if ((bps_limit == U64_MAX && iops_limit == UINT_MAX) ||
+	    tg->flags & THROTL_TG_CANCELING) {
 		if (wait)
 			*wait = 0;
 		return true;
@@ -1776,6 +1777,39 @@ static bool throtl_hierarchy_can_upgrade(struct throtl_grp *tg)
 	return false;
 }
 
+void blk_throtl_cancel_bios(struct request_queue *q)
+{
+	struct cgroup_subsys_state *pos_css;
+	struct blkcg_gq *blkg;
+
+	spin_lock_irq(&q->queue_lock);
+	/*
+	 * queue_lock is held, rcu lock is not needed here technically.
+	 * However, rcu lock is still held to emphasize that following
+	 * path need RCU protection and to prevent warning from lockdep.
+	 */
+	rcu_read_lock();
+	blkg_for_each_descendant_post(blkg, pos_css, q->root_blkg) {
+		struct throtl_grp *tg = blkg_to_tg(blkg);
+		struct throtl_service_queue *sq = &tg->service_queue;
+
+		/*
+		 * Set the flag to make sure throtl_pending_timer_fn() won't
+		 * stop until all throttled bios are dispatched.
+		 */
+		blkg_to_tg(blkg)->flags |= THROTL_TG_CANCELING;
+		/*
+		 * Update disptime after setting the above flag to make sure
+		 * throtl_select_dispatch() won't exit without dispatching.
+		 */
+		tg_update_disptime(tg);
+
+		throtl_schedule_pending_timer(sq, jiffies + 1);
+	}
+	rcu_read_unlock();
+	spin_unlock_irq(&q->queue_lock);
+}
+
 static bool throtl_can_upgrade(struct throtl_data *td,
 	struct throtl_grp *this_tg)
 {
diff --git a/block/blk-throttle.h b/block/blk-throttle.h
index b23a9f3abb82..c1b602996127 100644
--- a/block/blk-throttle.h
+++ b/block/blk-throttle.h
@@ -56,6 +56,7 @@ enum tg_state_flags {
 	THROTL_TG_PENDING	= 1 << 0,	/* on parent's pending tree */
 	THROTL_TG_WAS_EMPTY	= 1 << 1,	/* bio_lists[] became non-empty */
 	THROTL_TG_HAS_IOPS_LIMIT = 1 << 2,	/* tg has iops limit */
+	THROTL_TG_CANCELING	= 1 << 3,	/* starts to cancel bio */
 };
 
 enum {
@@ -162,11 +163,13 @@ static inline int blk_throtl_init(struct request_queue *q) { return 0; }
 static inline void blk_throtl_exit(struct request_queue *q) { }
 static inline void blk_throtl_register_queue(struct request_queue *q) { }
 static inline bool blk_throtl_bio(struct bio *bio) { return false; }
+static inline void blk_throtl_cancel_bios(struct request_queue *q) { }
 #else /* CONFIG_BLK_DEV_THROTTLING */
 int blk_throtl_init(struct request_queue *q);
 void blk_throtl_exit(struct request_queue *q);
 void blk_throtl_register_queue(struct request_queue *q);
 bool __blk_throtl_bio(struct bio *bio);
+void blk_throtl_cancel_bios(struct request_queue *q);
 static inline bool blk_throtl_bio(struct bio *bio)
 {
 	struct throtl_grp *tg = blkg_to_tg(bio->bi_blkg);
diff --git a/block/genhd.c b/block/genhd.c
index 56f66c6fee94..8244f09e058d 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -25,6 +25,7 @@
 #include <linux/pm_runtime.h>
 #include <linux/badblocks.h>
 #include <linux/part_stat.h>
+#include "blk-throttle.h"
 
 #include "blk.h"
 #include "blk-mq-sched.h"
@@ -627,6 +628,8 @@ void del_gendisk(struct gendisk *disk)
 
 	blk_mq_freeze_queue_wait(q);
 
+	blk_throtl_cancel_bios(disk->queue);
+
 	blk_sync_queue(q);
 	blk_flush_integrity();
 	/*
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH for-5.18 0/3] block: throttle related fixes
  2022-03-18 13:01 [PATCH for-5.18 0/3] block: throttle related fixes Ming Lei
                   ` (2 preceding siblings ...)
  2022-03-18 13:01 ` [PATCH 3/3] block: cancel all throttled bios in del_gendisk() Ming Lei
@ 2022-03-18 15:58 ` Jens Axboe
  3 siblings, 0 replies; 13+ messages in thread
From: Jens Axboe @ 2022-03-18 15:58 UTC (permalink / raw)
  To: Ming Lei; +Cc: Tejun Heo, linux-block, Christoph Hellwig, Yu Kuai

On Fri, 18 Mar 2022 21:01:41 +0800, Ming Lei wrote:
> The 1st and 2nd patches fix use-after-free on request queue and
> throttle data.
> 
> The 3rd patch speeds up throttling after disk is deleted, so long
> wait time is avoided.
> 
> 
> [...]

Applied, thanks!

[1/3] block: avoid use-after-free on throttle data
      commit: ee37eddbfa9e0401f13a01691cf4bbbacd2d16c9
[2/3] block: let blkcg_gq grab request queue's refcnt
      commit: 0a9a25ca78437b39e691bcc3dc8240455b803d8d
[3/3] block: cancel all throttled bios in del_gendisk()
      commit: 8f9e7b65f833cb9a4b2e2f54a049d74df394d906

Best regards,
-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] block: let blkcg_gq grab request queue's refcnt
  2022-03-18 13:01 ` [PATCH 2/3] block: let blkcg_gq grab request queue's refcnt Ming Lei
@ 2022-03-22  9:33   ` Christoph Hellwig
  2022-03-22 10:23     ` Ming Lei
  2022-04-20  1:46   ` Williams, Dan J
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2022-03-22  9:33 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Tejun Heo, Yu Kuai, Christoph Hellwig

On Fri, Mar 18, 2022 at 09:01:43PM +0800, Ming Lei wrote:
> In the whole lifetime of blkcg_gq instance, ->q will be referred, such
> as, ->pd_free_fn() is called in blkg_free, and throtl_pd_free() still
> may touch the request queue via &tg->service_queue.pending_timer which
> is handled by throtl_pending_timer_fn(), so it is reasonable to grab
> request queue's refcnt by blkcg_gq instance.
> 
> Previously blkcg_exit_queue() is called from blk_release_queue, and it
> is hard to avoid the use-after-free. But recently commit 1059699f87eb ("block:
> move blkcg initialization/destroy into disk allocation/release handler")
> is merged to for-5.18/block, it becomes simple to fix the issue by simply
> grabbing request queue's refcnt.
> 
> Reported-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-cgroup.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index fa063c6c0338..d53b0d69dd73 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -82,6 +82,8 @@ static void blkg_free(struct blkcg_gq *blkg)
>  		if (blkg->pd[i])
>  			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
>  
> +	if (blkg->q)
> +		blk_put_queue(blkg->q);

blkg_free can be called from RCU context, while blk_put_queue must
not be called from RCU context.  This causes regular splats when running
xfstests like:

[  693.342774] 
[  693.343585] =============================
[  693.345338] [ BUG: Invalid wait context ]
[  693.347109] 5.17.0-rc7+ #1276 Not tainted
[  693.348870] -----------------------------
[  693.350668] 056/104807 is trying to lock:
[  693.352463] ffff88810716f240 (&q->debugfs_mutex){+.+.}-{3:3}, at: blk_trace_shutdown+0x170
[  693.356286] other info that might help us debug this:
[  693.359555] context-{2:2}
[  693.361215] 4 locks held by 056/104807:
[  693.363553]  #0: ffff88810b4ca378 (&sig->cred_guard_mutex){+.+.}-{3:3}, at: bprm_execve+00
[  693.367984]  #1: ffff88810b4ca410 (&sig->exec_update_lock){++++}-{3:3}, at: begin_new_exe0
[  693.372036]  #2: ffff888100936968 (&mm->mmap_lock#2){++++}-{3:3}, at: exit_mmap+0x59/0x250
[  693.375654]  #3: ffffffff83db8360 (rcu_callback){....}-{0:0}, at: rcu_core+0x30e/0x910
[  693.378330] stack backtrace:
[  693.379262] CPU: 0 PID: 104807 Comm: 056 Not tainted 5.17.0-rc7+ #1276
[  693.381271] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/204
[  693.384177] Call Trace:
[  693.384954]  <IRQ>
[  693.385576]  dump_stack_lvl+0x45/0x59
[  693.386485]  __lock_acquire.cold+0x2a2/0x2be
[  693.387533]  ? mark_held_locks+0x49/0x70
[  693.388488]  lock_acquire+0xc9/0x2f0
[  693.389419]  ? blk_trace_shutdown+0x17/0x90
[  693.390525]  ? __slab_free+0x296/0x440
[  693.391451]  ? lockdep_hardirqs_on+0x79/0x100
[  693.392496]  __mutex_lock+0x71/0x8e0
[  693.393359]  ? blk_trace_shutdown+0x17/0x90
[  693.394368]  ? blk_trace_shutdown+0x17/0x90
[  693.395201]  ? lock_is_held_type+0xd7/0x130
[  693.396093]  ? rcu_read_lock_sched_held+0x3a/0x70
[  693.397179]  blk_trace_shutdown+0x17/0x90
[  693.398337]  blk_release_queue+0x85/0x100
[  693.399460]  kobject_put+0x63/0xc0
[  693.400397]  ? rcu_core+0x30e/0x910
[  693.401311]  blkg_free.part.0+0x3c/0x60
[  693.402425]  rcu_core+0x346/0x910
[  693.403430]  ? rcu_core+0x30e/0x910
[  693.404414]  __do_softirq+0x16e/0x4fe
[  693.405457]  __irq_exit_rcu+0xd1/0x120
[  693.406501]  irq_exit_rcu+0x5/0x20
[  693.407464]  sysvec_apic_timer_interrupt+0xa2/0xd0
[  693.408796]  </IRQ>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] block: let blkcg_gq grab request queue's refcnt
  2022-03-22  9:33   ` Christoph Hellwig
@ 2022-03-22 10:23     ` Ming Lei
  2022-03-22 16:45       ` Tejun Heo
  0 siblings, 1 reply; 13+ messages in thread
From: Ming Lei @ 2022-03-22 10:23 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jens Axboe, linux-block, Tejun Heo, Yu Kuai

On Tue, Mar 22, 2022 at 10:33:22AM +0100, Christoph Hellwig wrote:
> On Fri, Mar 18, 2022 at 09:01:43PM +0800, Ming Lei wrote:
> > In the whole lifetime of blkcg_gq instance, ->q will be referred, such
> > as, ->pd_free_fn() is called in blkg_free, and throtl_pd_free() still
> > may touch the request queue via &tg->service_queue.pending_timer which
> > is handled by throtl_pending_timer_fn(), so it is reasonable to grab
> > request queue's refcnt by blkcg_gq instance.
> > 
> > Previously blkcg_exit_queue() is called from blk_release_queue, and it
> > is hard to avoid the use-after-free. But recently commit 1059699f87eb ("block:
> > move blkcg initialization/destroy into disk allocation/release handler")
> > is merged to for-5.18/block, it becomes simple to fix the issue by simply
> > grabbing request queue's refcnt.
> > 
> > Reported-by: Christoph Hellwig <hch@lst.de>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-cgroup.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index fa063c6c0338..d53b0d69dd73 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -82,6 +82,8 @@ static void blkg_free(struct blkcg_gq *blkg)
> >  		if (blkg->pd[i])
> >  			blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
> >  
> > +	if (blkg->q)
> > +		blk_put_queue(blkg->q);
> 
> blkg_free can be called from RCU context, while blk_put_queue must
> not be called from RCU context.  This causes regular splats when running
> xfstests like:

Thanks for the report.

One solution is to delay 'blk_put_queue(blkg->q)' and 'kfree(blkg)'
into one work function by reusing blkg->async_bio_work as release_work.

I will prepare one patch for addressing the issue.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] block: let blkcg_gq grab request queue's refcnt
  2022-03-22 10:23     ` Ming Lei
@ 2022-03-22 16:45       ` Tejun Heo
  2022-03-23  0:32         ` Ming Lei
  0 siblings, 1 reply; 13+ messages in thread
From: Tejun Heo @ 2022-03-22 16:45 UTC (permalink / raw)
  To: Ming Lei; +Cc: Christoph Hellwig, Jens Axboe, linux-block, Yu Kuai

On Tue, Mar 22, 2022 at 06:23:34PM +0800, Ming Lei wrote:
> One solution is to delay 'blk_put_queue(blkg->q)' and 'kfree(blkg)'
> into one work function by reusing blkg->async_bio_work as release_work.
> 
> I will prepare one patch for addressing the issue.

Ah, so, this is the report. Can you please include the backtrace and
reference to the patch you posted?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] block: let blkcg_gq grab request queue's refcnt
  2022-03-22 16:45       ` Tejun Heo
@ 2022-03-23  0:32         ` Ming Lei
  0 siblings, 0 replies; 13+ messages in thread
From: Ming Lei @ 2022-03-23  0:32 UTC (permalink / raw)
  To: Tejun Heo; +Cc: Christoph Hellwig, Jens Axboe, linux-block, Yu Kuai

On Tue, Mar 22, 2022 at 06:45:24AM -1000, Tejun Heo wrote:
> On Tue, Mar 22, 2022 at 06:23:34PM +0800, Ming Lei wrote:
> > One solution is to delay 'blk_put_queue(blkg->q)' and 'kfree(blkg)'
> > into one work function by reusing blkg->async_bio_work as release_work.
> > 
> > I will prepare one patch for addressing the issue.
> 
> Ah, so, this is the report. Can you please include the backtrace and
> reference to the patch you posted?

OK.

Also I guess it has answered your question in another thread.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] block: let blkcg_gq grab request queue's refcnt
  2022-03-18 13:01 ` [PATCH 2/3] block: let blkcg_gq grab request queue's refcnt Ming Lei
  2022-03-22  9:33   ` Christoph Hellwig
@ 2022-04-20  1:46   ` Williams, Dan J
  2022-04-20  2:01     ` yukuai (C)
  1 sibling, 1 reply; 13+ messages in thread
From: Williams, Dan J @ 2022-04-20  1:46 UTC (permalink / raw)
  To: ming.lei, axboe; +Cc: hch, yukuai3, tj, linux-block

On Fri, 2022-03-18 at 21:01 +0800, Ming Lei wrote:
> In the whole lifetime of blkcg_gq instance, ->q will be referred, such
> as, ->pd_free_fn() is called in blkg_free, and throtl_pd_free() still
> may touch the request queue via &tg->service_queue.pending_timer which
> is handled by throtl_pending_timer_fn(), so it is reasonable to grab
> request queue's refcnt by blkcg_gq instance.
> 
> Previously blkcg_exit_queue() is called from blk_release_queue, and it
> is hard to avoid the use-after-free. But recently commit 1059699f87eb ("block:
> move blkcg initialization/destroy into disk allocation/release handler")
> is merged to for-5.18/block, it becomes simple to fix the issue by simply
> grabbing request queue's refcnt.

This patch, upstream as commit 0a9a25ca7843 ("block: let blkcg_gq grab
request queue's refcnt") causes the nvdimm unit tests to spam messages
like:

[   51.439133] debugfs: Directory 'pmem2' with parent 'block' already present!
[   52.095679] debugfs: Directory 'pmem3' with parent 'block' already present!
[   52.505613] block device autoloading is deprecated and will be removed.
[   52.791693] debugfs: Directory 'pmem2' with parent 'block' already present!
[   53.240314] debugfs: Directory 'pmem3' with parent 'block' already present!
[   53.373472] debugfs: Directory 'pmem3' with parent 'block' already present!
[   53.688876] nd_pmem btt2.0: No existing arenas
[   53.773097] debugfs: Directory 'pmem2s' with parent 'block' already present!
[   53.884493] debugfs: Directory 'pmem2s' with parent 'block' already present!
[   54.042946] debugfs: Directory 'pmem2s' with parent 'block' already present!
[   54.195954] debugfs: Directory 'pmem2s' with parent 'block' already present!
[   54.383989] debugfs: Directory 'pmem2' with parent 'block' already present!
[   54.577206] nd_pmem btt3.0: No existing arenas

...on a test doing block device setup/teardown.

It is still reproducible in v5.18-rc3 and I bisected manually applying
commit d578c770c852 ("block: avoid calling blkg_free() in atomic
context") to avoid the other identified regression with this change.

I'll take a deeper look tomorrow. It could be this is triggering a
latent bug in the pmem driver, but wanted to send this in case someone
saw an immediate ordering problem caused by this change besides the one
fixed by d578c770c852.

> 
> Reported-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-cgroup.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index fa063c6c0338..d53b0d69dd73 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -82,6 +82,8 @@ static void blkg_free(struct blkcg_gq *blkg)
>                 if (blkg->pd[i])
>                         blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
>  
> +       if (blkg->q)
> +               blk_put_queue(blkg->q);
>         free_percpu(blkg->iostat_cpu);
>         percpu_ref_exit(&blkg->refcnt);
>         kfree(blkg);
> @@ -167,6 +169,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>         if (!blkg->iostat_cpu)
>                 goto err_free;
>  
> +       if (!blk_get_queue(q))
> +               goto err_free;
> +
>         blkg->q = q;
>         INIT_LIST_HEAD(&blkg->q_node);
>         spin_lock_init(&blkg->async_bio_lock);


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] block: let blkcg_gq grab request queue's refcnt
  2022-04-20  1:46   ` Williams, Dan J
@ 2022-04-20  2:01     ` yukuai (C)
  2022-04-20  2:20       ` Ming Lei
  2022-04-20  3:40       ` Dan Williams
  0 siblings, 2 replies; 13+ messages in thread
From: yukuai (C) @ 2022-04-20  2:01 UTC (permalink / raw)
  To: Williams, Dan J, ming.lei, axboe; +Cc: hch, tj, linux-block

在 2022/04/20 9:46, Williams, Dan J 写道:
> On Fri, 2022-03-18 at 21:01 +0800, Ming Lei wrote:
>> In the whole lifetime of blkcg_gq instance, ->q will be referred, such
>> as, ->pd_free_fn() is called in blkg_free, and throtl_pd_free() still
>> may touch the request queue via &tg->service_queue.pending_timer which
>> is handled by throtl_pending_timer_fn(), so it is reasonable to grab
>> request queue's refcnt by blkcg_gq instance.
>>
>> Previously blkcg_exit_queue() is called from blk_release_queue, and it
>> is hard to avoid the use-after-free. But recently commit 1059699f87eb ("block:
>> move blkcg initialization/destroy into disk allocation/release handler")
>> is merged to for-5.18/block, it becomes simple to fix the issue by simply
>> grabbing request queue's refcnt.
> 
> This patch, upstream as commit 0a9a25ca7843 ("block: let blkcg_gq grab
> request queue's refcnt") causes the nvdimm unit tests to spam messages
> like:
> 
> [   51.439133] debugfs: Directory 'pmem2' with parent 'block' already present!
> [   52.095679] debugfs: Directory 'pmem3' with parent 'block' already present!
> [   52.505613] block device autoloading is deprecated and will be removed.
> [   52.791693] debugfs: Directory 'pmem2' with parent 'block' already present!
> [   53.240314] debugfs: Directory 'pmem3' with parent 'block' already present!
> [   53.373472] debugfs: Directory 'pmem3' with parent 'block' already present!
> [   53.688876] nd_pmem btt2.0: No existing arenas
> [   53.773097] debugfs: Directory 'pmem2s' with parent 'block' already present!
> [   53.884493] debugfs: Directory 'pmem2s' with parent 'block' already present!
> [   54.042946] debugfs: Directory 'pmem2s' with parent 'block' already present!
> [   54.195954] debugfs: Directory 'pmem2s' with parent 'block' already present!
> [   54.383989] debugfs: Directory 'pmem2' with parent 'block' already present!
> [   54.577206] nd_pmem btt3.0: No existing arenas
> 
Hi,

I saw the same warning in our test, and I posted a patch to fix the
problem:

https://patchwork.kernel.org/project/linux-block/patch/20220415035607.1836495-1-yukuai3@huawei.com/

The root cause is not relate to the above commit, see details in
the commit message.

Thanks,
Kuai
> ...on a test doing block device setup/teardown.
> 
> It is still reproducible in v5.18-rc3 and I bisected manually applying
> commit d578c770c852 ("block: avoid calling blkg_free() in atomic
> context") to avoid the other identified regression with this change.
> 
> I'll take a deeper look tomorrow. It could be this is triggering a
> latent bug in the pmem driver, but wanted to send this in case someone
> saw an immediate ordering problem caused by this change besides the one
> fixed by d578c770c852.
> 
>>
>> Reported-by: Christoph Hellwig <hch@lst.de>
>> Signed-off-by: Ming Lei <ming.lei@redhat.com>
>> ---
>>   block/blk-cgroup.c | 5 +++++
>>   1 file changed, 5 insertions(+)
>>
>> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
>> index fa063c6c0338..d53b0d69dd73 100644
>> --- a/block/blk-cgroup.c
>> +++ b/block/blk-cgroup.c
>> @@ -82,6 +82,8 @@ static void blkg_free(struct blkcg_gq *blkg)
>>                  if (blkg->pd[i])
>>                          blkcg_policy[i]->pd_free_fn(blkg->pd[i]);
>>   
>> +       if (blkg->q)
>> +               blk_put_queue(blkg->q);
>>          free_percpu(blkg->iostat_cpu);
>>          percpu_ref_exit(&blkg->refcnt);
>>          kfree(blkg);
>> @@ -167,6 +169,9 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q,
>>          if (!blkg->iostat_cpu)
>>                  goto err_free;
>>   
>> +       if (!blk_get_queue(q))
>> +               goto err_free;
>> +
>>          blkg->q = q;
>>          INIT_LIST_HEAD(&blkg->q_node);
>>          spin_lock_init(&blkg->async_bio_lock);
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] block: let blkcg_gq grab request queue's refcnt
  2022-04-20  2:01     ` yukuai (C)
@ 2022-04-20  2:20       ` Ming Lei
  2022-04-20  3:40       ` Dan Williams
  1 sibling, 0 replies; 13+ messages in thread
From: Ming Lei @ 2022-04-20  2:20 UTC (permalink / raw)
  To: yukuai (C); +Cc: Williams, Dan J, axboe, hch, tj, linux-block

On Wed, Apr 20, 2022 at 10:01:36AM +0800, yukuai (C) wrote:
> 在 2022/04/20 9:46, Williams, Dan J 写道:
> > On Fri, 2022-03-18 at 21:01 +0800, Ming Lei wrote:
> > > In the whole lifetime of blkcg_gq instance, ->q will be referred, such
> > > as, ->pd_free_fn() is called in blkg_free, and throtl_pd_free() still
> > > may touch the request queue via &tg->service_queue.pending_timer which
> > > is handled by throtl_pending_timer_fn(), so it is reasonable to grab
> > > request queue's refcnt by blkcg_gq instance.
> > > 
> > > Previously blkcg_exit_queue() is called from blk_release_queue, and it
> > > is hard to avoid the use-after-free. But recently commit 1059699f87eb ("block:
> > > move blkcg initialization/destroy into disk allocation/release handler")
> > > is merged to for-5.18/block, it becomes simple to fix the issue by simply
> > > grabbing request queue's refcnt.
> > 
> > This patch, upstream as commit 0a9a25ca7843 ("block: let blkcg_gq grab
> > request queue's refcnt") causes the nvdimm unit tests to spam messages
> > like:
> > 
> > [   51.439133] debugfs: Directory 'pmem2' with parent 'block' already present!
> > [   52.095679] debugfs: Directory 'pmem3' with parent 'block' already present!
> > [   52.505613] block device autoloading is deprecated and will be removed.
> > [   52.791693] debugfs: Directory 'pmem2' with parent 'block' already present!
> > [   53.240314] debugfs: Directory 'pmem3' with parent 'block' already present!
> > [   53.373472] debugfs: Directory 'pmem3' with parent 'block' already present!
> > [   53.688876] nd_pmem btt2.0: No existing arenas
> > [   53.773097] debugfs: Directory 'pmem2s' with parent 'block' already present!
> > [   53.884493] debugfs: Directory 'pmem2s' with parent 'block' already present!
> > [   54.042946] debugfs: Directory 'pmem2s' with parent 'block' already present!
> > [   54.195954] debugfs: Directory 'pmem2s' with parent 'block' already present!
> > [   54.383989] debugfs: Directory 'pmem2' with parent 'block' already present!
> > [   54.577206] nd_pmem btt3.0: No existing arenas
> > 
> Hi,
> 
> I saw the same warning in our test, and I posted a patch to fix the
> problem:
> 
> https://patchwork.kernel.org/project/linux-block/patch/20220415035607.1836495-1-yukuai3@huawei.com/
> 
> The root cause is not relate to the above commit, see details in
> the commit message.

Yeah, I am pretty sure the issue of 'debugfs: Directory 'XXXXX' with parent 'block' already
present!' is one block layer long-term issue.

Commit 0a9a25ca7843 ("block: let blkcg_gq grab request queue's refcnt") justs causes queue
released a bit late, then the issue becomes easier to trigger.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 2/3] block: let blkcg_gq grab request queue's refcnt
  2022-04-20  2:01     ` yukuai (C)
  2022-04-20  2:20       ` Ming Lei
@ 2022-04-20  3:40       ` Dan Williams
  1 sibling, 0 replies; 13+ messages in thread
From: Dan Williams @ 2022-04-20  3:40 UTC (permalink / raw)
  To: yukuai (C); +Cc: ming.lei, axboe, hch, tj, linux-block

On Tue, Apr 19, 2022 at 7:02 PM yukuai (C) <yukuai3@huawei.com> wrote:
>
> 在 2022/04/20 9:46, Williams, Dan J 写道:
> > On Fri, 2022-03-18 at 21:01 +0800, Ming Lei wrote:
> >> In the whole lifetime of blkcg_gq instance, ->q will be referred, such
> >> as, ->pd_free_fn() is called in blkg_free, and throtl_pd_free() still
> >> may touch the request queue via &tg->service_queue.pending_timer which
> >> is handled by throtl_pending_timer_fn(), so it is reasonable to grab
> >> request queue's refcnt by blkcg_gq instance.
> >>
> >> Previously blkcg_exit_queue() is called from blk_release_queue, and it
> >> is hard to avoid the use-after-free. But recently commit 1059699f87eb ("block:
> >> move blkcg initialization/destroy into disk allocation/release handler")
> >> is merged to for-5.18/block, it becomes simple to fix the issue by simply
> >> grabbing request queue's refcnt.
> >
> > This patch, upstream as commit 0a9a25ca7843 ("block: let blkcg_gq grab
> > request queue's refcnt") causes the nvdimm unit tests to spam messages
> > like:
> >
> > [   51.439133] debugfs: Directory 'pmem2' with parent 'block' already present!
> > [   52.095679] debugfs: Directory 'pmem3' with parent 'block' already present!
> > [   52.505613] block device autoloading is deprecated and will be removed.
> > [   52.791693] debugfs: Directory 'pmem2' with parent 'block' already present!
> > [   53.240314] debugfs: Directory 'pmem3' with parent 'block' already present!
> > [   53.373472] debugfs: Directory 'pmem3' with parent 'block' already present!
> > [   53.688876] nd_pmem btt2.0: No existing arenas
> > [   53.773097] debugfs: Directory 'pmem2s' with parent 'block' already present!
> > [   53.884493] debugfs: Directory 'pmem2s' with parent 'block' already present!
> > [   54.042946] debugfs: Directory 'pmem2s' with parent 'block' already present!
> > [   54.195954] debugfs: Directory 'pmem2s' with parent 'block' already present!
> > [   54.383989] debugfs: Directory 'pmem2' with parent 'block' already present!
> > [   54.577206] nd_pmem btt3.0: No existing arenas
> >
> Hi,
>
> I saw the same warning in our test, and I posted a patch to fix the
> problem:
>
> https://patchwork.kernel.org/project/linux-block/patch/20220415035607.1836495-1-yukuai3@huawei.com/
>
> The root cause is not relate to the above commit, see details in
> the commit message.

Nice. I offered a late tested-by on that thread.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-04-20  3:41 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-18 13:01 [PATCH for-5.18 0/3] block: throttle related fixes Ming Lei
2022-03-18 13:01 ` [PATCH 1/3] block: avoid use-after-free on throttle data Ming Lei
2022-03-18 13:01 ` [PATCH 2/3] block: let blkcg_gq grab request queue's refcnt Ming Lei
2022-03-22  9:33   ` Christoph Hellwig
2022-03-22 10:23     ` Ming Lei
2022-03-22 16:45       ` Tejun Heo
2022-03-23  0:32         ` Ming Lei
2022-04-20  1:46   ` Williams, Dan J
2022-04-20  2:01     ` yukuai (C)
2022-04-20  2:20       ` Ming Lei
2022-04-20  3:40       ` Dan Williams
2022-03-18 13:01 ` [PATCH 3/3] block: cancel all throttled bios in del_gendisk() Ming Lei
2022-03-18 15:58 ` [PATCH for-5.18 0/3] block: throttle related fixes Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.