linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH-block v3 0/2] blk-cgroup: Fix potential UAF & flush rstat at blkgs destruction path
@ 2022-12-13 18:44 Waiman Long
  2022-12-13 18:44 ` [PATCH-block v3 1/2] bdi, blk-cgroup: Fix potential UAF of blkcg Waiman Long
  2022-12-13 18:44 ` [PATCH-block v3 2/2] blk-cgroup: Flush stats at blkgs destruction path Waiman Long
  0 siblings, 2 replies; 9+ messages in thread
From: Waiman Long @ 2022-12-13 18:44 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Josef Bacik, Zefan Li, Johannes Weiner,
	Andrew Morton
  Cc: cgroups, linux-block, linux-kernel, linux-mm, Michal Koutný,
	Dennis Zhou (Facebook),
	Waiman Long

 v3:
  - Drop v2 patch 2 as it may not be needed.
  - Replace css_tryget() with percpu_ref_is_zero() in patch 1 as
    suggested by Tejun.
  - Expand comment on patch 2 to elaborate the reason for this patch.

 v2:
  - Remove unnecessary rcu_read_{lock|unlock} from
    cgroup_rstat_css_cpu_flush() in patch 3.

It was found that blkcg_destroy_blkgs() may be called with all blkcg
references gone. This may potentially cause user-after-free and so should
be fixed. The second patch flushes rstat when blkcg_destroy_blkgs().

Waiman Long (2):
  bdi, blk-cgroup: Fix potential UAF of blkcg
  blk-cgroup: Flush stats at blkgs destruction path

 block/blk-cgroup.c     | 22 ++++++++++++++++++++++
 include/linux/cgroup.h |  1 +
 kernel/cgroup/rstat.c  | 18 ++++++++++++++++++
 mm/backing-dev.c       |  8 ++++++--
 4 files changed, 47 insertions(+), 2 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH-block v3 1/2] bdi, blk-cgroup: Fix potential UAF of blkcg
  2022-12-13 18:44 [PATCH-block v3 0/2] blk-cgroup: Fix potential UAF & flush rstat at blkgs destruction path Waiman Long
@ 2022-12-13 18:44 ` Waiman Long
  2022-12-13 19:29   ` Tejun Heo
  2022-12-13 18:44 ` [PATCH-block v3 2/2] blk-cgroup: Flush stats at blkgs destruction path Waiman Long
  1 sibling, 1 reply; 9+ messages in thread
From: Waiman Long @ 2022-12-13 18:44 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Josef Bacik, Zefan Li, Johannes Weiner,
	Andrew Morton
  Cc: cgroups, linux-block, linux-kernel, linux-mm, Michal Koutný,
	Dennis Zhou (Facebook),
	Waiman Long, Yi Zhang

Commit 59b57717fff8 ("blkcg: delay blkg destruction until after
writeback has finished") delayed call to blkcg_destroy_blkgs() to
cgwb_release_workfn(). However, it is done after a css_put() of blkcg
which may be the final put that causes the blkcg to be freed as RCU
read lock isn't held.

Another place where blkcg_destroy_blkgs() can be called indirectly via
blkcg_unpin_online() is from the offline_css() function called from
css_killed_work_fn(). Over there, the potentially final css_put() call
is issued after offline_css().

By adding a css_tryget() into blkcg_destroy_blkgs() and warning its
failure, the following stack trace was produced in a test system on
bootup.

[   34.254240] RIP: 0010:blkcg_destroy_blkgs+0x16a/0x1a0
      :
[   34.339943] Call Trace:
[   34.344510]  blkcg_unpin_online+0x38/0x60
[   34.348523]  cgwb_release_workfn+0x6a/0x200
[   34.352708]  process_one_work+0x1e5/0x3b0
[   34.360758]  worker_thread+0x50/0x3a0
[   34.368447]  kthread+0xd9/0x100
[   34.376386]  ret_from_fork+0x22/0x30

This confirms that a potential UAF situation can really happen in
cgwb_release_workfn().

Fix that by delaying the css_put() until after the blkcg_unpin_online()
call. Also use css_tryget() in blkcg_destroy_blkgs() and issue a warning
if css_tryget() fails.

The reproducing system can no longer produce a warning with this patch.
All the runnable block/0* tests including block/027 were run successfully
without failure.

Fixes: 59b57717fff8 ("blkcg: delay blkg destruction until after writeback has finished")
Suggested-by: Michal Koutný <mkoutny@suse.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
---
 block/blk-cgroup.c | 7 +++++++
 mm/backing-dev.c   | 8 ++++++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 1bb939d3b793..ca28306aa1b1 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1084,6 +1084,13 @@ struct list_head *blkcg_get_cgwb_list(struct cgroup_subsys_state *css)
  */
 static void blkcg_destroy_blkgs(struct blkcg *blkcg)
 {
+	/*
+	 * blkcg_destroy_blkgs() shouldn't be called with all the blkcg
+	 * references gone.
+	 */
+	if (WARN_ON_ONCE(percpu_ref_is_zero(&blkcg->css.refcnt)))
+		return;
+
 	might_sleep();
 
 	spin_lock_irq(&blkcg->lock);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c30419a5e119..36f75b072325 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -390,11 +390,15 @@ static void cgwb_release_workfn(struct work_struct *work)
 	wb_shutdown(wb);
 
 	css_put(wb->memcg_css);
-	css_put(wb->blkcg_css);
 	mutex_unlock(&wb->bdi->cgwb_release_mutex);
 
-	/* triggers blkg destruction if no online users left */
+	/*
+	 * Triggers blkg destruction if no online users left
+	 * The final blkcg css_put() has to be done after blkcg_unpin_online()
+	 * to avoid use-after-free.
+	 */
 	blkcg_unpin_online(wb->blkcg_css);
+	css_put(wb->blkcg_css);
 
 	fprop_local_destroy_percpu(&wb->memcg_completions);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH-block v3 2/2] blk-cgroup: Flush stats at blkgs destruction path
  2022-12-13 18:44 [PATCH-block v3 0/2] blk-cgroup: Fix potential UAF & flush rstat at blkgs destruction path Waiman Long
  2022-12-13 18:44 ` [PATCH-block v3 1/2] bdi, blk-cgroup: Fix potential UAF of blkcg Waiman Long
@ 2022-12-13 18:44 ` Waiman Long
  2022-12-13 19:30   ` Tejun Heo
  1 sibling, 1 reply; 9+ messages in thread
From: Waiman Long @ 2022-12-13 18:44 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Josef Bacik, Zefan Li, Johannes Weiner,
	Andrew Morton
  Cc: cgroups, linux-block, linux-kernel, linux-mm, Michal Koutný,
	Dennis Zhou (Facebook),
	Waiman Long

As noted by Michal, the blkg_iostat_set's in the lockless list
hold reference to blkg's to protect against their removal. Those
blkg's hold reference to blkcg. When a cgroup is being destroyed,
cgroup_rstat_flush() is only called at css_release_work_fn() which is
called when the blkcg reference count reaches 0. This circular dependency
will prevent blkcg from being freed until some other events cause
cgroup_rstat_flush() to be called to flush out the pending blkcg stats.

To prevent this delayed blkcg removal, add a new cgroup_rstat_css_flush()
function to flush stats for a given css and cpu and call it at the blkgs
destruction path, blkcg_destroy_blkgs(), whenever there are still some
pending stats to be flushed. This will ensure that blkcg reference
count can reach 0 ASAP.

Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c     | 15 +++++++++++++++
 include/linux/cgroup.h |  1 +
 kernel/cgroup/rstat.c  | 18 ++++++++++++++++++
 3 files changed, 34 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index ca28306aa1b1..ddd27a714d3e 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1084,6 +1084,8 @@ struct list_head *blkcg_get_cgwb_list(struct cgroup_subsys_state *css)
  */
 static void blkcg_destroy_blkgs(struct blkcg *blkcg)
 {
+	int cpu;
+
 	/*
 	 * blkcg_destroy_blkgs() shouldn't be called with all the blkcg
 	 * references gone.
@@ -1093,6 +1095,19 @@ static void blkcg_destroy_blkgs(struct blkcg *blkcg)
 
 	might_sleep();
 
+	/*
+	 * Flush all the non-empty percpu lockless lists so as to release
+	 * the blkg references held by those lists which, in turn, may
+	 * allow the blkgs to be freed and release their references to
+	 * blkcg speeding up its freeing.
+	 */
+	for_each_possible_cpu(cpu) {
+		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
+
+		if (!llist_empty(lhead))
+			cgroup_rstat_css_cpu_flush(&blkcg->css, cpu);
+	}
+
 	spin_lock_irq(&blkcg->lock);
 
 	while (!hlist_empty(&blkcg->blkg_list)) {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 528bd44b59e2..6c4e66b3fa84 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -766,6 +766,7 @@ void cgroup_rstat_flush(struct cgroup *cgrp);
 void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp);
 void cgroup_rstat_flush_hold(struct cgroup *cgrp);
 void cgroup_rstat_flush_release(void);
+void cgroup_rstat_css_cpu_flush(struct cgroup_subsys_state *css, int cpu);
 
 /*
  * Basic resource stats.
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 793ecff29038..2e44be44351f 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -281,6 +281,24 @@ void cgroup_rstat_flush_release(void)
 	spin_unlock_irq(&cgroup_rstat_lock);
 }
 
+/**
+ * cgroup_rstat_css_cpu_flush - flush stats for the given css and cpu
+ * @css: target css to be flush
+ * @cpu: the cpu that holds the stats to be flush
+ *
+ * A lightweight rstat flush operation for a given css and cpu.
+ * Only the cpu_lock is being held for mutual exclusion, the cgroup_rstat_lock
+ * isn't used.
+ */
+void cgroup_rstat_css_cpu_flush(struct cgroup_subsys_state *css, int cpu)
+{
+	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
+
+	raw_spin_lock_irq(cpu_lock);
+	css->ss->css_rstat_flush(css, cpu);
+	raw_spin_unlock_irq(cpu_lock);
+}
+
 int cgroup_rstat_init(struct cgroup *cgrp)
 {
 	int cpu;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH-block v3 1/2] bdi, blk-cgroup: Fix potential UAF of blkcg
  2022-12-13 18:44 ` [PATCH-block v3 1/2] bdi, blk-cgroup: Fix potential UAF of blkcg Waiman Long
@ 2022-12-13 19:29   ` Tejun Heo
  2022-12-13 19:53     ` Waiman Long
  0 siblings, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2022-12-13 19:29 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jens Axboe, Josef Bacik, Zefan Li, Johannes Weiner,
	Andrew Morton, cgroups, linux-block, linux-kernel, linux-mm,
	Michal Koutný, Dennis Zhou (Facebook),
	Yi Zhang

On Tue, Dec 13, 2022 at 01:44:45PM -0500, Waiman Long wrote:
> Commit 59b57717fff8 ("blkcg: delay blkg destruction until after
> writeback has finished") delayed call to blkcg_destroy_blkgs() to
> cgwb_release_workfn(). However, it is done after a css_put() of blkcg
> which may be the final put that causes the blkcg to be freed as RCU
> read lock isn't held.
> 
> Another place where blkcg_destroy_blkgs() can be called indirectly via
> blkcg_unpin_online() is from the offline_css() function called from
> css_killed_work_fn(). Over there, the potentially final css_put() call
> is issued after offline_css().
> 
> By adding a css_tryget() into blkcg_destroy_blkgs() and warning its
> failure, the following stack trace was produced in a test system on
> bootup.

This doesn't agree with the code anymore. Otherwise

Acked-by: Tejun Heo <tj@kernel.org>

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH-block v3 2/2] blk-cgroup: Flush stats at blkgs destruction path
  2022-12-13 18:44 ` [PATCH-block v3 2/2] blk-cgroup: Flush stats at blkgs destruction path Waiman Long
@ 2022-12-13 19:30   ` Tejun Heo
  2022-12-14  1:58     ` Waiman Long
  0 siblings, 1 reply; 9+ messages in thread
From: Tejun Heo @ 2022-12-13 19:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jens Axboe, Josef Bacik, Zefan Li, Johannes Weiner,
	Andrew Morton, cgroups, linux-block, linux-kernel, linux-mm,
	Michal Koutný, Dennis Zhou (Facebook)

On Tue, Dec 13, 2022 at 01:44:46PM -0500, Waiman Long wrote:
> +	/*
> +	 * Flush all the non-empty percpu lockless lists so as to release
> +	 * the blkg references held by those lists which, in turn, may
> +	 * allow the blkgs to be freed and release their references to
> +	 * blkcg speeding up its freeing.
> +	 */

Can you mention the possible deadlock explicitly? This sounds more like an
optimization.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH-block v3 1/2] bdi, blk-cgroup: Fix potential UAF of blkcg
  2022-12-13 19:29   ` Tejun Heo
@ 2022-12-13 19:53     ` Waiman Long
  2022-12-14 16:54       ` Jens Axboe
  0 siblings, 1 reply; 9+ messages in thread
From: Waiman Long @ 2022-12-13 19:53 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Josef Bacik, Zefan Li, Johannes Weiner,
	Andrew Morton, cgroups, linux-block, linux-kernel, linux-mm,
	Michal Koutný, Dennis Zhou (Facebook),
	Yi Zhang


On 12/13/22 14:29, Tejun Heo wrote:
> On Tue, Dec 13, 2022 at 01:44:45PM -0500, Waiman Long wrote:
>> Commit 59b57717fff8 ("blkcg: delay blkg destruction until after
>> writeback has finished") delayed call to blkcg_destroy_blkgs() to
>> cgwb_release_workfn(). However, it is done after a css_put() of blkcg
>> which may be the final put that causes the blkcg to be freed as RCU
>> read lock isn't held.
>>
>> Another place where blkcg_destroy_blkgs() can be called indirectly via
>> blkcg_unpin_online() is from the offline_css() function called from
>> css_killed_work_fn(). Over there, the potentially final css_put() call
>> is issued after offline_css().
>>
>> By adding a css_tryget() into blkcg_destroy_blkgs() and warning its
>> failure, the following stack trace was produced in a test system on
>> bootup.
> This doesn't agree with the code anymore. Otherwise
>
> Acked-by: Tejun Heo <tj@kernel.org>

Sorry, I overlooked the commit log in my update. I will update it if I 
need another version, or Jens can make the following edit:

css_tryget() -> percpu_ref_is_zero().

Thanks,
Longman


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH-block v3 2/2] blk-cgroup: Flush stats at blkgs destruction path
  2022-12-13 19:30   ` Tejun Heo
@ 2022-12-14  1:58     ` Waiman Long
  0 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2022-12-14  1:58 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Jens Axboe, Josef Bacik, Zefan Li, Johannes Weiner,
	Andrew Morton, cgroups, linux-block, linux-kernel, linux-mm,
	Michal Koutný, Dennis Zhou (Facebook)


On 12/13/22 14:30, Tejun Heo wrote:
> On Tue, Dec 13, 2022 at 01:44:46PM -0500, Waiman Long wrote:
>> +	/*
>> +	 * Flush all the non-empty percpu lockless lists so as to release
>> +	 * the blkg references held by those lists which, in turn, may
>> +	 * allow the blkgs to be freed and release their references to
>> +	 * blkcg speeding up its freeing.
>> +	 */
> Can you mention the possible deadlock explicitly? This sounds more like an
> optimization.

I am mostly thinking about the optimization aspect. Yes, deadlock in the 
sense that both blkgs and blkcg remained offline but not freed is 
possible because of the references hold in those lockless list. It is a 
problem if blkcg is the only controller in a cgroup. For cgroup that has 
both the blkcg and memory controllers, it shouldn't be a problem as the 
cgroup_rstat_flush() call in the release of memory cgroup will clear up 
blkcg too. Right, I will update the comment to mention that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH-block v3 1/2] bdi, blk-cgroup: Fix potential UAF of blkcg
  2022-12-13 19:53     ` Waiman Long
@ 2022-12-14 16:54       ` Jens Axboe
  2022-12-14 16:55         ` Waiman Long
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Axboe @ 2022-12-14 16:54 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo
  Cc: Josef Bacik, Zefan Li, Johannes Weiner, Andrew Morton, cgroups,
	linux-block, linux-kernel, linux-mm, Michal Koutný,
	Dennis Zhou (Facebook),
	Yi Zhang

On 12/13/22 12:53 PM, Waiman Long wrote:
> 
> On 12/13/22 14:29, Tejun Heo wrote:
>> On Tue, Dec 13, 2022 at 01:44:45PM -0500, Waiman Long wrote:
>>> Commit 59b57717fff8 ("blkcg: delay blkg destruction until after
>>> writeback has finished") delayed call to blkcg_destroy_blkgs() to
>>> cgwb_release_workfn(). However, it is done after a css_put() of blkcg
>>> which may be the final put that causes the blkcg to be freed as RCU
>>> read lock isn't held.
>>>
>>> Another place where blkcg_destroy_blkgs() can be called indirectly via
>>> blkcg_unpin_online() is from the offline_css() function called from
>>> css_killed_work_fn(). Over there, the potentially final css_put() call
>>> is issued after offline_css().
>>>
>>> By adding a css_tryget() into blkcg_destroy_blkgs() and warning its
>>> failure, the following stack trace was produced in a test system on
>>> bootup.
>> This doesn't agree with the code anymore. Otherwise
>>
>> Acked-by: Tejun Heo <tj@kernel.org>
> 
> Sorry, I overlooked the commit log in my update. I will update it if I need another version, or Jens can make the following edit:
> 
> css_tryget() -> percpu_ref_is_zero().

Since the other one also needs an edit, would be great if you could
just send out a v4.

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH-block v3 1/2] bdi, blk-cgroup: Fix potential UAF of blkcg
  2022-12-14 16:54       ` Jens Axboe
@ 2022-12-14 16:55         ` Waiman Long
  0 siblings, 0 replies; 9+ messages in thread
From: Waiman Long @ 2022-12-14 16:55 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo
  Cc: Josef Bacik, Zefan Li, Johannes Weiner, Andrew Morton, cgroups,
	linux-block, linux-kernel, linux-mm, Michal Koutný,
	Dennis Zhou (Facebook),
	Yi Zhang


On 12/14/22 11:54, Jens Axboe wrote:
> On 12/13/22 12:53 PM, Waiman Long wrote:
>> On 12/13/22 14:29, Tejun Heo wrote:
>>> On Tue, Dec 13, 2022 at 01:44:45PM -0500, Waiman Long wrote:
>>>> Commit 59b57717fff8 ("blkcg: delay blkg destruction until after
>>>> writeback has finished") delayed call to blkcg_destroy_blkgs() to
>>>> cgwb_release_workfn(). However, it is done after a css_put() of blkcg
>>>> which may be the final put that causes the blkcg to be freed as RCU
>>>> read lock isn't held.
>>>>
>>>> Another place where blkcg_destroy_blkgs() can be called indirectly via
>>>> blkcg_unpin_online() is from the offline_css() function called from
>>>> css_killed_work_fn(). Over there, the potentially final css_put() call
>>>> is issued after offline_css().
>>>>
>>>> By adding a css_tryget() into blkcg_destroy_blkgs() and warning its
>>>> failure, the following stack trace was produced in a test system on
>>>> bootup.
>>> This doesn't agree with the code anymore. Otherwise
>>>
>>> Acked-by: Tejun Heo <tj@kernel.org>
>> Sorry, I overlooked the commit log in my update. I will update it if I need another version, or Jens can make the following edit:
>>
>> css_tryget() -> percpu_ref_is_zero().
> Since the other one also needs an edit, would be great if you could
> just send out a v4.
>
Sure, will do that.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-12-14 16:56 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-13 18:44 [PATCH-block v3 0/2] blk-cgroup: Fix potential UAF & flush rstat at blkgs destruction path Waiman Long
2022-12-13 18:44 ` [PATCH-block v3 1/2] bdi, blk-cgroup: Fix potential UAF of blkcg Waiman Long
2022-12-13 19:29   ` Tejun Heo
2022-12-13 19:53     ` Waiman Long
2022-12-14 16:54       ` Jens Axboe
2022-12-14 16:55         ` Waiman Long
2022-12-13 18:44 ` [PATCH-block v3 2/2] blk-cgroup: Flush stats at blkgs destruction path Waiman Long
2022-12-13 19:30   ` Tejun Heo
2022-12-14  1:58     ` Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).