linux-block.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/2] blk-cgroup: Fix potential UAF & flush rstat at blkgs destruction path
@ 2022-12-15  3:31 Waiman Long
  2022-12-15  3:31 ` [PATCH v4 1/2] bdi, blk-cgroup: Fix potential UAF of blkcg Waiman Long
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Waiman Long @ 2022-12-15  3:31 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Josef Bacik, Zefan Li, Johannes Weiner,
	Andrew Morton
  Cc: cgroups, linux-block, linux-kernel, linux-mm, Michal Koutný,
	Dennis Zhou (Facebook),
	Waiman Long

 v4:
  - Update comment and commit logs for both patches.

 v3:
  - Drop v2 patch 2 as it may not be needed.
  - Replace css_tryget() with percpu_ref_is_zero() in patch 1 as
    suggested by Tejun.
  - Expand comment on patch 2 to elaborate the reason for this patch.

 v2:
  - Remove unnecessary rcu_read_{lock|unlock} from
    cgroup_rstat_css_cpu_flush() in patch 3.

It was found that blkcg_destroy_blkgs() may be called with all blkcg
references gone. This may potentially cause user-after-free and so should
be fixed. The second patch flushes rstat when calling blkcg_destroy_blkgs().

Waiman Long (2):
  bdi, blk-cgroup: Fix potential UAF of blkcg
  blk-cgroup: Flush stats at blkgs destruction path

 block/blk-cgroup.c     | 23 +++++++++++++++++++++++
 include/linux/cgroup.h |  1 +
 kernel/cgroup/rstat.c  | 18 ++++++++++++++++++
 mm/backing-dev.c       |  8 ++++++--
 4 files changed, 48 insertions(+), 2 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v4 1/2] bdi, blk-cgroup: Fix potential UAF of blkcg
  2022-12-15  3:31 [PATCH v4 0/2] blk-cgroup: Fix potential UAF & flush rstat at blkgs destruction path Waiman Long
@ 2022-12-15  3:31 ` Waiman Long
  2022-12-15  3:31 ` [PATCH v4 2/2] blk-cgroup: Flush stats at blkgs destruction path Waiman Long
  2023-02-02  3:26 ` [PATCH v4 0/2] blk-cgroup: Fix potential UAF & flush rstat " Waiman Long
  2 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2022-12-15  3:31 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Josef Bacik, Zefan Li, Johannes Weiner,
	Andrew Morton
  Cc: cgroups, linux-block, linux-kernel, linux-mm, Michal Koutný,
	Dennis Zhou (Facebook),
	Waiman Long, Yi Zhang

Commit 59b57717fff8 ("blkcg: delay blkg destruction until after
writeback has finished") delayed call to blkcg_destroy_blkgs() to
cgwb_release_workfn(). However, it is done after a css_put() of blkcg
which may be the final put that causes the blkcg to be freed as RCU
read lock isn't held.

Another place where blkcg_destroy_blkgs() can be called indirectly via
blkcg_unpin_online() is from the offline_css() function called from
css_killed_work_fn(). Over there, the potentially final css_put() call
is issued after offline_css().

By adding a css_tryget() into blkcg_destroy_blkgs() and warning its
failure, the following stack trace was produced in a test system on
bootup.

[   34.254240] RIP: 0010:blkcg_destroy_blkgs+0x16a/0x1a0
      :
[   34.339943] Call Trace:
[   34.344510]  blkcg_unpin_online+0x38/0x60
[   34.348523]  cgwb_release_workfn+0x6a/0x200
[   34.352708]  process_one_work+0x1e5/0x3b0
[   34.360758]  worker_thread+0x50/0x3a0
[   34.368447]  kthread+0xd9/0x100
[   34.376386]  ret_from_fork+0x22/0x30

This confirms that a potential UAF situation can really happen in
cgwb_release_workfn().

Fix that by delaying the css_put() until after the blkcg_unpin_online()
call. Also use percpu_ref_is_zero() in blkcg_destroy_blkgs() and issue
a warning if reference count is zero.

The reproducing system can no longer produce a warning with this patch.
All the runnable block/0* tests including block/027 were run successfully
without failure.

Fixes: 59b57717fff8 ("blkcg: delay blkg destruction until after writeback has finished")
Suggested-by: Michal Koutný <mkoutny@suse.com>
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by:  Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c | 7 +++++++
 mm/backing-dev.c   | 8 ++++++--
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 1bb939d3b793..ca28306aa1b1 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1084,6 +1084,13 @@ struct list_head *blkcg_get_cgwb_list(struct cgroup_subsys_state *css)
  */
 static void blkcg_destroy_blkgs(struct blkcg *blkcg)
 {
+	/*
+	 * blkcg_destroy_blkgs() shouldn't be called with all the blkcg
+	 * references gone.
+	 */
+	if (WARN_ON_ONCE(percpu_ref_is_zero(&blkcg->css.refcnt)))
+		return;
+
 	might_sleep();
 
 	spin_lock_irq(&blkcg->lock);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c30419a5e119..36f75b072325 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -390,11 +390,15 @@ static void cgwb_release_workfn(struct work_struct *work)
 	wb_shutdown(wb);
 
 	css_put(wb->memcg_css);
-	css_put(wb->blkcg_css);
 	mutex_unlock(&wb->bdi->cgwb_release_mutex);
 
-	/* triggers blkg destruction if no online users left */
+	/*
+	 * Triggers blkg destruction if no online users left
+	 * The final blkcg css_put() has to be done after blkcg_unpin_online()
+	 * to avoid use-after-free.
+	 */
 	blkcg_unpin_online(wb->blkcg_css);
+	css_put(wb->blkcg_css);
 
 	fprop_local_destroy_percpu(&wb->memcg_completions);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v4 2/2] blk-cgroup: Flush stats at blkgs destruction path
  2022-12-15  3:31 [PATCH v4 0/2] blk-cgroup: Fix potential UAF & flush rstat at blkgs destruction path Waiman Long
  2022-12-15  3:31 ` [PATCH v4 1/2] bdi, blk-cgroup: Fix potential UAF of blkcg Waiman Long
@ 2022-12-15  3:31 ` Waiman Long
  2023-02-02  4:15   ` Ming Lei
  2023-02-02  3:26 ` [PATCH v4 0/2] blk-cgroup: Fix potential UAF & flush rstat " Waiman Long
  2 siblings, 1 reply; 6+ messages in thread
From: Waiman Long @ 2022-12-15  3:31 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Josef Bacik, Zefan Li, Johannes Weiner,
	Andrew Morton
  Cc: cgroups, linux-block, linux-kernel, linux-mm, Michal Koutný,
	Dennis Zhou (Facebook),
	Waiman Long

As noted by Michal, the blkg_iostat_set's in the lockless list
hold reference to blkg's to protect against their removal. Those
blkg's hold reference to blkcg. When a cgroup is being destroyed,
cgroup_rstat_flush() is only called at css_release_work_fn() which
is called when the blkcg reference count reaches 0. This circular
dependency will prevent blkcg and some blkgs from being freed after
they are made offline.

It is less a problem if the cgroup to be destroyed also has other
controllers like memory that will call cgroup_rstat_flush() which will
clean up the reference count. If block is the only controller that uses
rstat, these offline blkcg and blkgs may never be freed leaking more
and more memory over time.

To prevent this potential memory leak, a new cgroup_rstat_css_cpu_flush()
function is added to flush stats for a given css and cpu. This new
function will be called at blkgs destruction path, blkcg_destroy_blkgs(),
whenever there are still pending stats to be flushed. This will release
the references to blkgs allowing them to be freed and indirectly allow
the freeing of blkcg.

Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()")
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Tejun Heo <tj@kernel.org>
---
 block/blk-cgroup.c     | 16 ++++++++++++++++
 include/linux/cgroup.h |  1 +
 kernel/cgroup/rstat.c  | 18 ++++++++++++++++++
 3 files changed, 35 insertions(+)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index ca28306aa1b1..a2a1081d9d1d 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1084,6 +1084,8 @@ struct list_head *blkcg_get_cgwb_list(struct cgroup_subsys_state *css)
  */
 static void blkcg_destroy_blkgs(struct blkcg *blkcg)
 {
+	int cpu;
+
 	/*
 	 * blkcg_destroy_blkgs() shouldn't be called with all the blkcg
 	 * references gone.
@@ -1093,6 +1095,20 @@ static void blkcg_destroy_blkgs(struct blkcg *blkcg)
 
 	might_sleep();
 
+	/*
+	 * Flush all the non-empty percpu lockless lists to release the
+	 * blkg references held by those lists which, in turn, will
+	 * allow those blkgs to be freed and release their references to
+	 * blkcg. Otherwise, they may not be freed at all becase of this
+	 * circular dependency resulting in memory leak.
+	 */
+	for_each_possible_cpu(cpu) {
+		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
+
+		if (!llist_empty(lhead))
+			cgroup_rstat_css_cpu_flush(&blkcg->css, cpu);
+	}
+
 	spin_lock_irq(&blkcg->lock);
 
 	while (!hlist_empty(&blkcg->blkg_list)) {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 528bd44b59e2..6c4e66b3fa84 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -766,6 +766,7 @@ void cgroup_rstat_flush(struct cgroup *cgrp);
 void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp);
 void cgroup_rstat_flush_hold(struct cgroup *cgrp);
 void cgroup_rstat_flush_release(void);
+void cgroup_rstat_css_cpu_flush(struct cgroup_subsys_state *css, int cpu);
 
 /*
  * Basic resource stats.
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 793ecff29038..2e44be44351f 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -281,6 +281,24 @@ void cgroup_rstat_flush_release(void)
 	spin_unlock_irq(&cgroup_rstat_lock);
 }
 
+/**
+ * cgroup_rstat_css_cpu_flush - flush stats for the given css and cpu
+ * @css: target css to be flush
+ * @cpu: the cpu that holds the stats to be flush
+ *
+ * A lightweight rstat flush operation for a given css and cpu.
+ * Only the cpu_lock is being held for mutual exclusion, the cgroup_rstat_lock
+ * isn't used.
+ */
+void cgroup_rstat_css_cpu_flush(struct cgroup_subsys_state *css, int cpu)
+{
+	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_rstat_cpu_lock, cpu);
+
+	raw_spin_lock_irq(cpu_lock);
+	css->ss->css_rstat_flush(css, cpu);
+	raw_spin_unlock_irq(cpu_lock);
+}
+
 int cgroup_rstat_init(struct cgroup *cgrp)
 {
 	int cpu;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 0/2] blk-cgroup: Fix potential UAF & flush rstat at blkgs destruction path
  2022-12-15  3:31 [PATCH v4 0/2] blk-cgroup: Fix potential UAF & flush rstat at blkgs destruction path Waiman Long
  2022-12-15  3:31 ` [PATCH v4 1/2] bdi, blk-cgroup: Fix potential UAF of blkcg Waiman Long
  2022-12-15  3:31 ` [PATCH v4 2/2] blk-cgroup: Flush stats at blkgs destruction path Waiman Long
@ 2023-02-02  3:26 ` Waiman Long
  2 siblings, 0 replies; 6+ messages in thread
From: Waiman Long @ 2023-02-02  3:26 UTC (permalink / raw)
  To: Jens Axboe, Tejun Heo, Josef Bacik, Zefan Li, Johannes Weiner,
	Andrew Morton
  Cc: cgroups, linux-block, linux-kernel, linux-mm, Michal Koutný,
	Dennis Zhou (Facebook)


On 12/14/22 22:31, Waiman Long wrote:
>   v4:
>    - Update comment and commit logs for both patches.
>
>   v3:
>    - Drop v2 patch 2 as it may not be needed.
>    - Replace css_tryget() with percpu_ref_is_zero() in patch 1 as
>      suggested by Tejun.
>    - Expand comment on patch 2 to elaborate the reason for this patch.
>
>   v2:
>    - Remove unnecessary rcu_read_{lock|unlock} from
>      cgroup_rstat_css_cpu_flush() in patch 3.
>
> It was found that blkcg_destroy_blkgs() may be called with all blkcg
> references gone. This may potentially cause user-after-free and so should
> be fixed. The second patch flushes rstat when calling blkcg_destroy_blkgs().
>
> Waiman Long (2):
>    bdi, blk-cgroup: Fix potential UAF of blkcg
>    blk-cgroup: Flush stats at blkgs destruction path
>
>   block/blk-cgroup.c     | 23 +++++++++++++++++++++++
>   include/linux/cgroup.h |  1 +
>   kernel/cgroup/rstat.c  | 18 ++++++++++++++++++
>   mm/backing-dev.c       |  8 ++++++--
>   4 files changed, 48 insertions(+), 2 deletions(-)
>
Ping!

Any comments on these patches.

Thanks,
Longman


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 2/2] blk-cgroup: Flush stats at blkgs destruction path
  2022-12-15  3:31 ` [PATCH v4 2/2] blk-cgroup: Flush stats at blkgs destruction path Waiman Long
@ 2023-02-02  4:15   ` Ming Lei
  2023-02-02 22:35     ` Tejun Heo
  0 siblings, 1 reply; 6+ messages in thread
From: Ming Lei @ 2023-02-02  4:15 UTC (permalink / raw)
  To: Waiman Long
  Cc: Jens Axboe, Tejun Heo, Josef Bacik, Zefan Li, Johannes Weiner,
	Andrew Morton, cgroups, linux-block, linux-kernel, linux-mm,
	Michal Koutný, Dennis Zhou (Facebook),
	ming.lei

On Wed, Dec 14, 2022 at 10:31:32PM -0500, Waiman Long wrote:
> As noted by Michal, the blkg_iostat_set's in the lockless list
> hold reference to blkg's to protect against their removal. Those
> blkg's hold reference to blkcg. When a cgroup is being destroyed,
> cgroup_rstat_flush() is only called at css_release_work_fn() which
> is called when the blkcg reference count reaches 0. This circular
> dependency will prevent blkcg and some blkgs from being freed after
> they are made offline.
> 
> It is less a problem if the cgroup to be destroyed also has other
> controllers like memory that will call cgroup_rstat_flush() which will
> clean up the reference count. If block is the only controller that uses
> rstat, these offline blkcg and blkgs may never be freed leaking more
> and more memory over time.
> 
> To prevent this potential memory leak, a new cgroup_rstat_css_cpu_flush()
> function is added to flush stats for a given css and cpu. This new
> function will be called at blkgs destruction path, blkcg_destroy_blkgs(),
> whenever there are still pending stats to be flushed. This will release
> the references to blkgs allowing them to be freed and indirectly allow
> the freeing of blkcg.
> 
> Fixes: 3b8cc6298724 ("blk-cgroup: Optimize blkcg_rstat_flush()")
> Signed-off-by: Waiman Long <longman@redhat.com>
> Acked-by: Tejun Heo <tj@kernel.org>
> ---
>  block/blk-cgroup.c     | 16 ++++++++++++++++
>  include/linux/cgroup.h |  1 +
>  kernel/cgroup/rstat.c  | 18 ++++++++++++++++++
>  3 files changed, 35 insertions(+)
> 
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index ca28306aa1b1..a2a1081d9d1d 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1084,6 +1084,8 @@ struct list_head *blkcg_get_cgwb_list(struct cgroup_subsys_state *css)
>   */
>  static void blkcg_destroy_blkgs(struct blkcg *blkcg)
>  {
> +	int cpu;
> +
>  	/*
>  	 * blkcg_destroy_blkgs() shouldn't be called with all the blkcg
>  	 * references gone.
> @@ -1093,6 +1095,20 @@ static void blkcg_destroy_blkgs(struct blkcg *blkcg)
>  
>  	might_sleep();
>  
> +	/*
> +	 * Flush all the non-empty percpu lockless lists to release the
> +	 * blkg references held by those lists which, in turn, will
> +	 * allow those blkgs to be freed and release their references to
> +	 * blkcg. Otherwise, they may not be freed at all becase of this
> +	 * circular dependency resulting in memory leak.
> +	 */
> +	for_each_possible_cpu(cpu) {
> +		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
> +
> +		if (!llist_empty(lhead))
> +			cgroup_rstat_css_cpu_flush(&blkcg->css, cpu);
> +	}

I guess it is possible for new iostat_cpu to be added just after the
llist_empty() check.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v4 2/2] blk-cgroup: Flush stats at blkgs destruction path
  2023-02-02  4:15   ` Ming Lei
@ 2023-02-02 22:35     ` Tejun Heo
  0 siblings, 0 replies; 6+ messages in thread
From: Tejun Heo @ 2023-02-02 22:35 UTC (permalink / raw)
  To: Ming Lei
  Cc: Waiman Long, Jens Axboe, Josef Bacik, Zefan Li, Johannes Weiner,
	Andrew Morton, cgroups, linux-block, linux-kernel, linux-mm,
	Michal Koutný, Dennis Zhou (Facebook)

On Thu, Feb 02, 2023 at 12:15:52PM +0800, Ming Lei wrote:
> > @@ -1093,6 +1095,20 @@ static void blkcg_destroy_blkgs(struct blkcg *blkcg)
> >  
> >  	might_sleep();
> >  
> > +	/*
> > +	 * Flush all the non-empty percpu lockless lists to release the
> > +	 * blkg references held by those lists which, in turn, will
> > +	 * allow those blkgs to be freed and release their references to
> > +	 * blkcg. Otherwise, they may not be freed at all becase of this
> > +	 * circular dependency resulting in memory leak.
> > +	 */
> > +	for_each_possible_cpu(cpu) {
> > +		struct llist_head *lhead = per_cpu_ptr(blkcg->lhead, cpu);
> > +
> > +		if (!llist_empty(lhead))
> > +			cgroup_rstat_css_cpu_flush(&blkcg->css, cpu);
> > +	}
> 
> I guess it is possible for new iostat_cpu to be added just after the
> llist_empty() check.

Ah, good point. Maybe:

* Move flush below the blkg kill loop.

* In blk_cgroup_bio_start(), use percpu_ref_tryget() to decide whether to
  add to llist or not.

-- 
tejun

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-02-02 22:35 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-12-15  3:31 [PATCH v4 0/2] blk-cgroup: Fix potential UAF & flush rstat at blkgs destruction path Waiman Long
2022-12-15  3:31 ` [PATCH v4 1/2] bdi, blk-cgroup: Fix potential UAF of blkcg Waiman Long
2022-12-15  3:31 ` [PATCH v4 2/2] blk-cgroup: Flush stats at blkgs destruction path Waiman Long
2023-02-02  4:15   ` Ming Lei
2023-02-02 22:35     ` Tejun Heo
2023-02-02  3:26 ` [PATCH v4 0/2] blk-cgroup: Fix potential UAF & flush rstat " Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).