rcu.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] rcu: Make sure new krcp free business is handled after the wanted rcu grace period.
@ 2023-03-30  9:21 Ziwei Dai
  0 siblings, 0 replies; 4+ messages in thread
From: Ziwei Dai @ 2023-03-30  9:21 UTC (permalink / raw)
  To: paulmck, frederic, quic_neeraju, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, joel, rcu
  Cc: shuang.wang, yifan.xin, ke.wang, xuewen.yan, zhiguo.niu,
	ziwei.dai, zhaoyang.huang

From: 代子为 (Ziwei Dai) <ziwei.dai@ziwei-lenovo.spreadtrum.com>

In kfree_rcu_monitor(), new free business at krcp is attached to any free
channel at krwp. kfree_rcu_monitor() is responsible to make sure new free
business is handled after the rcu grace period. But if there is any none-free
channel at krwp already, that means there is an on-going rcu work,
which will cause the kvfree_call_rcu()-triggered free business is done
before the wanted rcu grace period ends.

This commit ignores krwp which has non-free channel at kfree_rcu_monitor(),
to fix the issue that kvfree_call_rcu() loses effectiveness.

Below is the css_set obj "from_cset" use-after-free issue caused by
kvfree_call_rcu() losing effectiveness.
Core 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes.
Core 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp.
Core 2 frees "from_cset" after current gp end. "from_cset" is reallocated.
Core 0 references "from_cset"'s member, which causes crash.

Core 0					Core 1				       	Core 2
count_memcg_event_mm()
|rcu_read_lock()  <---
|mem_cgroup_from_task()
 |// <css_set ptr> is the "from_cset" mentioned on core 1
 |<css_set ptr> = rcu_dereference((task)->cgroups)
 |// Hard irq comes, current task is scheduled out.

			Core 1:
			cgroup_attach_task()
			|cgroup_migrate()
			 |cgroup_migrate_execute()
			  |css_set_move_task(task, from_cset, to_cset, true)
			  |cgroup_move_task(task, to_cset)
			   |rcu_assign_pointer(.., to_cset)
			   |...
			|cgroup_migrate_finish()
			 |put_css_set_locked(from_cset)
			  |from_cset->refcount return 0
			  |kfree_rcu(cset, rcu_head) <--- means to free from_cset after new gp
			   |add_ptr_to_bulk_krc_lock()
			   |schedule_delayed_work(&krcp->monitor_work, ..)

			kfree_rcu_monitor()
			|krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
			|queue_rcu_work(system_wq, &krwp->rcu_work)
			 |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
			 |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp

								// There is a perious call_rcu(.., rcu_work_rcufn)
								// gp end, rcu_work_rcufn() is called.
								rcu_work_rcufn()
								|__queue_work(.., rwork->wq, &rwork->work);
								Core 2:
								// or there is a pending kfree_rcu_work() work called.
								|kfree_rcu_work()
								|krwp->bulk_head_free[0] bulk is freed before new gp end!!!
								|The "from_cset" mentioned on core 1 is freed before new gp end.
Core 0:
// the task is schedule in after many ms.
 |<css_set ptr>->subsys[(subsys_id) <--- caused kernel crash, because <css_set ptr>="from_cset" is freed.

Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>

:#	modified:   tree.c
---
 kernel/rcu/tree.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8e880c0..f6451a8 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3107,15 +3107,16 @@ static void kfree_rcu_monitor(struct work_struct *work)
 	for (i = 0; i < KFREE_N_BATCHES; i++) {
 		struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
 
-		// Try to detach bulk_head or head and attach it over any
-		// available corresponding free channel. It can be that
-		// a previous RCU batch is in progress, it means that
-		// immediately to queue another one is not possible so
-		// in that case the monitor work is rearmed.
-		if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) ||
-			(!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) ||
-				(READ_ONCE(krcp->head) && !krwp->head_free)) {
-
+		// Try to detach bulk_head or head and attach it, only when
+		// all channels are free.  Any channel is not free means at krwp
+		// there is on-going rcu work to handle krwp's free business.
+		if (!list_empty(&krwp->bulk_head_free[0]) ||
+			!list_empty(&krwp->bulk_head_free[1]) ||
+				krwp->head_free)
+			continue;
+		if (!list_empty(&krcp->bulk_head[0]) ||
+			!list_empty(&krcp->bulk_head[1]) ||
+			READ_ONCE(krcp->head)) {
 			// Channel 1 corresponds to the SLAB-pointer bulk path.
 			// Channel 2 corresponds to vmalloc-pointer bulk path.
 			for (j = 0; j < FREE_N_CHANNELS; j++) {
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH] rcu: Make sure new krcp free business is handled after the wanted rcu grace period.
@ 2023-03-30  9:27 Ziwei Dai
  0 siblings, 0 replies; 4+ messages in thread
From: Ziwei Dai @ 2023-03-30  9:27 UTC (permalink / raw)
  To: paulmck, frederic, quic_neeraju, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, joel, rcu
  Cc: linux-kernel, shuang.wang, yifan.xin, ke.wang, xuewen.yan,
	zhiguo.niu, ziwei.dai, zhaoyang.huang

From: 代子为 (Ziwei Dai) <ziwei.dai@ziwei-lenovo.spreadtrum.com>

In kfree_rcu_monitor(), new free business at krcp is attached to any free
channel at krwp. kfree_rcu_monitor() is responsible to make sure new free
business is handled after the rcu grace period. But if there is any none-free
channel at krwp already, that means there is an on-going rcu work,
which will cause the kvfree_call_rcu()-triggered free business is done
before the wanted rcu grace period ends.

This commit ignores krwp which has non-free channel at kfree_rcu_monitor(),
to fix the issue that kvfree_call_rcu() loses effectiveness.

Below is the css_set obj "from_cset" use-after-free issue caused by
kvfree_call_rcu() losing effectiveness.
Core 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes.
Core 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp.
Core 2 frees "from_cset" after current gp end. "from_cset" is reallocated.
Core 0 references "from_cset"'s member, which causes crash.

Core 0					Core 1				       	Core 2
count_memcg_event_mm()
|rcu_read_lock()  <---
|mem_cgroup_from_task()
 |// <css_set ptr> is the "from_cset" mentioned on core 1
 |<css_set ptr> = rcu_dereference((task)->cgroups)
 |// Hard irq comes, current task is scheduled out.

			Core 1:
			cgroup_attach_task()
			|cgroup_migrate()
			 |cgroup_migrate_execute()
			  |css_set_move_task(task, from_cset, to_cset, true)
			  |cgroup_move_task(task, to_cset)
			   |rcu_assign_pointer(.., to_cset)
			   |...
			|cgroup_migrate_finish()
			 |put_css_set_locked(from_cset)
			  |from_cset->refcount return 0
			  |kfree_rcu(cset, rcu_head) <--- means to free from_cset after new gp
			   |add_ptr_to_bulk_krc_lock()
			   |schedule_delayed_work(&krcp->monitor_work, ..)

			kfree_rcu_monitor()
			|krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
			|queue_rcu_work(system_wq, &krwp->rcu_work)
			 |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
			 |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp

								// There is a perious call_rcu(.., rcu_work_rcufn)
								// gp end, rcu_work_rcufn() is called.
								rcu_work_rcufn()
								|__queue_work(.., rwork->wq, &rwork->work);
								Core 2:
								// or there is a pending kfree_rcu_work() work called.
								|kfree_rcu_work()
								|krwp->bulk_head_free[0] bulk is freed before new gp end!!!
								|The "from_cset" mentioned on core 1 is freed before new gp end.
Core 0:
// the task is schedule in after many ms.
 |<css_set ptr>->subsys[(subsys_id) <--- caused kernel crash, because <css_set ptr>="from_cset" is freed.

Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>

:#	modified:   tree.c
---
 kernel/rcu/tree.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8e880c0..f6451a8 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3107,15 +3107,16 @@ static void kfree_rcu_monitor(struct work_struct *work)
 	for (i = 0; i < KFREE_N_BATCHES; i++) {
 		struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
 
-		// Try to detach bulk_head or head and attach it over any
-		// available corresponding free channel. It can be that
-		// a previous RCU batch is in progress, it means that
-		// immediately to queue another one is not possible so
-		// in that case the monitor work is rearmed.
-		if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) ||
-			(!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) ||
-				(READ_ONCE(krcp->head) && !krwp->head_free)) {
-
+		// Try to detach bulk_head or head and attach it, only when
+		// all channels are free.  Any channel is not free means at krwp
+		// there is on-going rcu work to handle krwp's free business.
+		if (!list_empty(&krwp->bulk_head_free[0]) ||
+			!list_empty(&krwp->bulk_head_free[1]) ||
+				krwp->head_free)
+			continue;
+		if (!list_empty(&krcp->bulk_head[0]) ||
+			!list_empty(&krcp->bulk_head[1]) ||
+			READ_ONCE(krcp->head)) {
 			// Channel 1 corresponds to the SLAB-pointer bulk path.
 			// Channel 2 corresponds to vmalloc-pointer bulk path.
 			for (j = 0; j < FREE_N_CHANNELS; j++) {
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] rcu: Make sure new krcp free business is handled after the wanted rcu grace period.
  2023-03-30  7:31 Ziwei Dai
@ 2023-03-30  8:52 ` Uladzislau Rezki
  0 siblings, 0 replies; 4+ messages in thread
From: Uladzislau Rezki @ 2023-03-30  8:52 UTC (permalink / raw)
  To: Ziwei Dai
  Cc: paulmck, frederic, quic_neeraju, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, joel, rcu, shuang.wang,
	yifan.xin, ke.wang, xuewen.yan, zhiguo.niu

On Thu, Mar 30, 2023 at 03:31:37PM +0800, Ziwei Dai wrote:
> From: 代子为 (Ziwei Dai) <ziwei.dai@ziwei-lenovo.spreadtrum.com>
> 
> In kfree_rcu_monitor(), new free business at krcp is attached to any free
> channel at krwp. kfree_rcu_monitor() is responsible to make sure new free
> business is handled after the rcu grace period. But if there is any none-free
> channel at krwp already, that means there is an on-going rcu work,
> which will cause the kvfree_call_rcu()-triggered free business is done
> before the wanted rcu grace period ends.
> 
Right. There have to be a "Fixes:" tag.

> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8e880c0..4fe3c53 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -3107,15 +3107,16 @@ static void kfree_rcu_monitor(struct work_struct *work)
>         for (i = 0; i < KFREE_N_BATCHES; i++) {
>                 struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);
> 
> -               // Try to detach bulk_head or head and attach it over any
> -               // available corresponding free channel. It can be that
> -               // a previous RCU batch is in progress, it means that
> -               // immediately to queue another one is not possible so
> -               // in that case the monitor work is rearmed.
> -               if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) ||
> -                       (!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) ||
> -                               (READ_ONCE(krcp->head) && !krwp->head_free)) {
> -
> +               // Try to detach bulk_head or head and attach it, only when
> +               // all channels are free.  Any channel is not free means at krwp
> +               // there is on-going rcu work to handle krwp's free business.
> +               if (!list_empty(&krwp->bulk_head_free[0]) ||
> +                       !list_empty(&krwp->bulk_head_free[1]) ||
> +                               !krwp->head_free)
> +                       continue;
> +               if (!list_empty(&krcp->bulk_head[0]) ||
> +                       !list_empty(&krcp->bulk_head[1]) ||
> +                       READ_ONCE(krcp->head)) {
>                         // Channel 1 corresponds to the SLAB-pointer bulk path.
>                         // Channel 2 corresponds to vmalloc-pointer bulk path.
>                         for (j = 0; j < FREE_N_CHANNELS; j++) {
> --
> 1.9.1
> 
Give me some time to have a look at your change closer. It seems it can
trigger a rcu_work even though krcp is empty. But it is from a first
glance.

Thanks!

--
Uladzislau Rezki

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] rcu: Make sure new krcp free business is handled after the wanted rcu grace period.
@ 2023-03-30  7:31 Ziwei Dai
  2023-03-30  8:52 ` Uladzislau Rezki
  0 siblings, 1 reply; 4+ messages in thread
From: Ziwei Dai @ 2023-03-30  7:31 UTC (permalink / raw)
  To: paulmck, frederic, quic_neeraju, josh, rostedt,
	mathieu.desnoyers, jiangshanlai, joel, rcu
  Cc: shuang.wang, yifan.xin, ke.wang, xuewen.yan, zhiguo.niu

From: 代子为 (Ziwei Dai) <ziwei.dai@ziwei-lenovo.spreadtrum.com>

In kfree_rcu_monitor(), new free business at krcp is attached to any free
channel at krwp. kfree_rcu_monitor() is responsible to make sure new free
business is handled after the rcu grace period. But if there is any none-free
channel at krwp already, that means there is an on-going rcu work,
which will cause the kvfree_call_rcu()-triggered free business is done
before the wanted rcu grace period ends.

This commit ignore krwp which has non-free channel at kfree_rcu_monitor(),
to fix the issue that kvfree_call_rcu() loses effectiveness.

Below is the css_set obj "from_cset" use-after-free issue caused by
kvfree_call_rcu() losing effectiveness.
Core 0 calls rcu_read_lock(), then use "from_cset", then hard irq comes.
Core 1 calls kfree_rcu(cset, rcu_head), willing to free "from_cset" after new gp.
Core 2 frees "from_cset" after current gp end. "from_cset" is reallocated.
Core 0 references "from_cset"'s member, which causes crash.

Core 0                                  Core 1                                  Core 2
count_memcg_event_mm()
|rcu_read_lock()  <---
|mem_cgroup_from_task()
 |// <css_set ptr> is the "from_cset" mentioned on core 1
 |<css_set ptr> = rcu_dereference((task)->cgroups)
 |// Hard irq comes, current task is scheduled out.

                        Core 1:
                        cgroup_attach_task()
                        |cgroup_migrate()
                         |cgroup_migrate_execute()
                          |css_set_move_task(task, from_cset, to_cset, true)
                          |cgroup_move_task(task, to_cset)
                           |rcu_assign_pointer(.., to_cset)
                           |...
                        |cgroup_migrate_finish()
                         |put_css_set_locked(from_cset)
                          |from_cset->refcount return 0
                          |kfree_rcu(cset, rcu_head) <--- means to free from_cset after new gp
                           |add_ptr_to_bulk_krc_lock()
                           |schedule_delayed_work(&krcp->monitor_work, ..)

                        kfree_rcu_monitor()
                        |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[]
                        |queue_rcu_work(system_wq, &krwp->rcu_work)
                         |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state,
                         |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request a new gp

                                                                // There is a perious call_rcu(.., rcu_work_rcufn)
                                                                // gp end, rcu_work_rcufn() is called.
                                                                rcu_work_rcufn()
                                                                |__queue_work(.., rwork->wq, &rwork->work);
                                                                Core 3:
                                                                // or there is a pending kfree_rcu_work() work called.
                                                                |kfree_rcu_work()
                                                                |krwp->bulk_head_free[0] bulk is freed before new gp end!!!
                                                                |The "from_cset" mentioned on core 1 is freed before new gp end.
Core 0:
// the task is schedule in after many ms.
 |<css_set_ptr>->subsys[(subsys_id) <--- caused kernel crash, because <css_set_ptr>="from_cset" is freed.

Signed-off-by: Ziwei Dai <ziwei.dai@unisoc.com>
---
 kernel/rcu/tree.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8e880c0..4fe3c53 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3107,15 +3107,16 @@ static void kfree_rcu_monitor(struct work_struct *work)
        for (i = 0; i < KFREE_N_BATCHES; i++) {
                struct kfree_rcu_cpu_work *krwp = &(krcp->krw_arr[i]);

-               // Try to detach bulk_head or head and attach it over any
-               // available corresponding free channel. It can be that
-               // a previous RCU batch is in progress, it means that
-               // immediately to queue another one is not possible so
-               // in that case the monitor work is rearmed.
-               if ((!list_empty(&krcp->bulk_head[0]) && list_empty(&krwp->bulk_head_free[0])) ||
-                       (!list_empty(&krcp->bulk_head[1]) && list_empty(&krwp->bulk_head_free[1])) ||
-                               (READ_ONCE(krcp->head) && !krwp->head_free)) {
-
+               // Try to detach bulk_head or head and attach it, only when
+               // all channels are free.  Any channel is not free means at krwp
+               // there is on-going rcu work to handle krwp's free business.
+               if (!list_empty(&krwp->bulk_head_free[0]) ||
+                       !list_empty(&krwp->bulk_head_free[1]) ||
+                               !krwp->head_free)
+                       continue;
+               if (!list_empty(&krcp->bulk_head[0]) ||
+                       !list_empty(&krcp->bulk_head[1]) ||
+                       READ_ONCE(krcp->head)) {
                        // Channel 1 corresponds to the SLAB-pointer bulk path.
                        // Channel 2 corresponds to vmalloc-pointer bulk path.
                        for (j = 0; j < FREE_N_CHANNELS; j++) {
--
1.9.1

________________________________
 This email (including its attachments) is intended only for the person or entity to which it is addressed and may contain information that is privileged, confidential or otherwise protected from disclosure. Unauthorized use, dissemination, distribution or copying of this email or the information herein or taking any action in reliance on the contents of this email or the information herein, by anyone other than the intended recipient, or an employee or agent responsible for delivering the message to the intended recipient, is strictly prohibited. If you are not the intended recipient, please do not read, copy, use or disclose any part of this e-mail to others. Please notify the sender immediately and permanently delete this e-mail and any attachments if you received it in error. Internet communications cannot be guaranteed to be timely, secure, error-free or virus-free. The sender does not accept liability for any errors or omissions.
本邮件及其附件具有保密性质,受法律保护不得泄露,仅发送给本邮件所指特定收件人。严禁非经授权使用、宣传、发布或复制本邮件或其内容。若非该特定收件人,请勿阅读、复制、 使用或披露本邮件的任何内容。若误收本邮件,请从系统中永久性删除本邮件及所有附件,并以回复邮件的方式即刻告知发件人。无法保证互联网通信及时、安全、无误或防毒。发件人对任何错漏均不承担责任。

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-03-30  9:27 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-30  9:21 [PATCH] rcu: Make sure new krcp free business is handled after the wanted rcu grace period Ziwei Dai
  -- strict thread matches above, loose matches on Subject: below --
2023-03-30  9:27 Ziwei Dai
2023-03-30  7:31 Ziwei Dai
2023-03-30  8:52 ` Uladzislau Rezki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).