[PATCH net] sch_htb: Fix prematurely activating netdev during htb_destroy_class

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net] sch_htb: Fix prematurely activating netdev during htb_destroy_class_offload
@ 2023-01-04 17:47 Rahul Rameshbabu
  2023-01-04 20:12 ` Maxim Mikityanskiy
  0 siblings, 1 reply; 6+ messages in thread
From: Rahul Rameshbabu @ 2023-01-04 17:47 UTC (permalink / raw)
  To: netdev
  Cc: Tariq Toukan, Gal Pressman, Saeed Mahameed, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Rahul Rameshbabu, Eric Dumazet,
	Maxim Mikityanskiy

When the netdev qdisc is updated correctly with the new qdisc before
destroying the old qdisc, the netdev should not be activated till cleanup
is completed. When htb_destroy_class_offload called htb_graft_helper, the
netdev may be activated before cleanup is completed. The new netdev qdisc
may be used prematurely by queues before cleanup is done. Call
dev_graft_qdisc in place of htb_graft_helper when destroying the htb to
prevent premature netdev activation.

Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Acked-by: Saeed Mahameed <saeedm@nvidia.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Maxim Mikityanskiy <maxtram95@gmail.com>
---
 net/sched/sch_htb.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index 2238edece1a4..f62334ef016a 100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -1557,14 +1557,16 @@ static int htb_destroy_class_offload(struct Qdisc *sch, struct htb_class *cl,
 
 	WARN_ON(!q);
 	dev_queue = htb_offload_get_queue(cl);
-	old = htb_graft_helper(dev_queue, NULL);
-	if (destroying)
+	if (destroying) {
+		old = dev_graft_qdisc(dev_queue, NULL);
 		/* Before HTB is destroyed, the kernel grafts noop_qdisc to
 		 * all queues.
 		 */
 		WARN_ON(!(old->flags & TCQ_F_BUILTIN));
-	else
+	} else {
+		old = htb_graft_helper(dev_queue, NULL);
 		WARN_ON(old != q);
+	}
 
 	if (cl->parent) {
 		_bstats_update(&cl->parent->bstats_bias,
-- 
2.36.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH net] sch_htb: Fix prematurely activating netdev during htb_destroy_class_offload
  2023-01-04 17:47 [PATCH net] sch_htb: Fix prematurely activating netdev during htb_destroy_class_offload Rahul Rameshbabu
@ 2023-01-04 20:12 ` Maxim Mikityanskiy
  2023-01-05  6:03   ` Rahul Rameshbabu
  0 siblings, 1 reply; 6+ messages in thread
From: Maxim Mikityanskiy @ 2023-01-04 20:12 UTC (permalink / raw)
  To: Rahul Rameshbabu
  Cc: netdev, Tariq Toukan, Gal Pressman, Saeed Mahameed,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Eric Dumazet

On Wed, 4 Jan 2023 at 19:53, Rahul Rameshbabu <rrameshbabu@nvidia.com> wrote:
>
> When the netdev qdisc is updated correctly with the new qdisc before
> destroying the old qdisc, the netdev should not be activated till cleanup
> is completed. When htb_destroy_class_offload called htb_graft_helper, the
> netdev may be activated before cleanup is completed.

Oh, so that's what was happening! Now I get the full picture:

1. The user does RTM_DELQDISC.
2. qdisc_graft calls dev_deactivate, which sets dev_queue->qdisc to
NULL, but keeps dev_queue->qdisc_sleeping.
3. The loop in qdisc_graft calls dev_graft_qdisc(dev_queue, new),
where new is NULL, for each queue.
4. Then we get into htb_destroy_class_offload, and it's important
whether dev->qdisc is still HTB (before Eric's patch) or noop_qdisc
(after Eric's patch).
5. If dev->qdisc is noop_qdisc, and htb_graft_helper accidentally
activates the netdev, attach_default_qdiscs will be called, and
dev_queue->qdisc will no longer be NULL for the rest of the queues,
hence the WARN_ON triggering.

Nice catch indeed, premature activation of the netdev wasn't intended.

> The new netdev qdisc
> may be used prematurely by queues before cleanup is done. Call
> dev_graft_qdisc in place of htb_graft_helper when destroying the htb to
> prevent premature netdev activation.
>
> Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
> Acked-by: Saeed Mahameed <saeedm@nvidia.com>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Maxim Mikityanskiy <maxtram95@gmail.com>
> ---
>  net/sched/sch_htb.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> index 2238edece1a4..f62334ef016a 100644
> --- a/net/sched/sch_htb.c
> +++ b/net/sched/sch_htb.c
> @@ -1557,14 +1557,16 @@ static int htb_destroy_class_offload(struct Qdisc *sch, struct htb_class *cl,
>
>         WARN_ON(!q);
>         dev_queue = htb_offload_get_queue(cl);
> -       old = htb_graft_helper(dev_queue, NULL);
> -       if (destroying)
> +       if (destroying) {
> +               old = dev_graft_qdisc(dev_queue, NULL);
>                 /* Before HTB is destroyed, the kernel grafts noop_qdisc to
>                  * all queues.
>                  */
>                 WARN_ON(!(old->flags & TCQ_F_BUILTIN));

Now regarding this WARN_ON, I have concerns about its correctness.

Can the user replace the root qdisc from HTB to something else with a
single command? I.e. instead of `tc qdisc del dev eth2 root handle 1:`
do `tc qdisc replace ...` or whatever that causes qdisc_graft to be
called with new != NULL? If that is possible, then:

1. `old` won't be noop_qdisc, but rather the new qdisc (if it doesn't
support the attach callback) or the old one left from HTB (old == q,
if the new qdisc supports the attach callback). WARN_ON should
trigger.

2. We shouldn't even call dev_graft_qdisc in this case (if destroying
is true). Likewise, we shouldn't try to revert it on errors or call
qdisc_put on it.

Could you please try to reproduce this scenario of triggering WARN_ON?
I remember testing it, and something actually prevented me from doing
a replacement, but maybe I just missed something back then.

> -       else
> +       } else {
> +               old = htb_graft_helper(dev_queue, NULL);
>                 WARN_ON(old != q);
> +       }
>
>         if (cl->parent) {
>                 _bstats_update(&cl->parent->bstats_bias,
> --
> 2.36.2
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] sch_htb: Fix prematurely activating netdev during htb_destroy_class_offload
  2023-01-04 20:12 ` Maxim Mikityanskiy
@ 2023-01-05  6:03   ` Rahul Rameshbabu
  2023-01-06 12:03     ` Maxim Mikityanskiy
  0 siblings, 1 reply; 6+ messages in thread
From: Rahul Rameshbabu @ 2023-01-05  6:03 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: netdev, Tariq Toukan, Gal Pressman, Saeed Mahameed,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Eric Dumazet

Maxim Mikityanskiy <maxtram95@gmail.com> writes:

> On Wed, 4 Jan 2023 at 19:53, Rahul Rameshbabu <rrameshbabu@nvidia.com> wrote:
>>
>> When the netdev qdisc is updated correctly with the new qdisc before
>> destroying the old qdisc, the netdev should not be activated till cleanup
>> is completed. When htb_destroy_class_offload called htb_graft_helper, the
>> netdev may be activated before cleanup is completed.
>
> Oh, so that's what was happening! Now I get the full picture:
>
> 1. The user does RTM_DELQDISC.
> 2. qdisc_graft calls dev_deactivate, which sets dev_queue->qdisc to
> NULL, but keeps dev_queue->qdisc_sleeping.
> 3. The loop in qdisc_graft calls dev_graft_qdisc(dev_queue, new),
> where new is NULL, for each queue.
> 4. Then we get into htb_destroy_class_offload, and it's important
> whether dev->qdisc is still HTB (before Eric's patch) or noop_qdisc
> (after Eric's patch).
> 5. If dev->qdisc is noop_qdisc, and htb_graft_helper accidentally
> activates the netdev, attach_default_qdiscs will be called, and
> dev_queue->qdisc will no longer be NULL for the rest of the queues,
> hence the WARN_ON triggering.
>
> Nice catch indeed, premature activation of the netdev wasn't intended.
>
>> The new netdev qdisc
>> may be used prematurely by queues before cleanup is done. Call
>> dev_graft_qdisc in place of htb_graft_helper when destroying the htb to
>> prevent premature netdev activation.
>>
>> Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
>> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
>> Acked-by: Saeed Mahameed <saeedm@nvidia.com>
>> Cc: Eric Dumazet <edumazet@google.com>
>> Cc: Maxim Mikityanskiy <maxtram95@gmail.com>
>> ---
>>  net/sched/sch_htb.c | 8 +++++---
>>  1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
>> index 2238edece1a4..f62334ef016a 100644
>> --- a/net/sched/sch_htb.c
>> +++ b/net/sched/sch_htb.c
>> @@ -1557,14 +1557,16 @@ static int htb_destroy_class_offload(struct Qdisc *sch, struct htb_class *cl,
>>
>>         WARN_ON(!q);
>>         dev_queue = htb_offload_get_queue(cl);
>> -       old = htb_graft_helper(dev_queue, NULL);
>> -       if (destroying)
>> +       if (destroying) {
>> +               old = dev_graft_qdisc(dev_queue, NULL);
>>                 /* Before HTB is destroyed, the kernel grafts noop_qdisc to
>>                  * all queues.
>>                  */
>>                 WARN_ON(!(old->flags & TCQ_F_BUILTIN));
>
> Now regarding this WARN_ON, I have concerns about its correctness.
>
> Can the user replace the root qdisc from HTB to something else with a
> single command? I.e. instead of `tc qdisc del dev eth2 root handle 1:`
> do `tc qdisc replace ...` or whatever that causes qdisc_graft to be
> called with new != NULL? If that is possible, then:
>
> 1. `old` won't be noop_qdisc, but rather the new qdisc (if it doesn't
> support the attach callback) or the old one left from HTB (old == q,
> if the new qdisc supports the attach callback). WARN_ON should
> trigger.
>
> 2. We shouldn't even call dev_graft_qdisc in this case (if destroying
> is true). Likewise, we shouldn't try to revert it on errors or call
> qdisc_put on it.
>
> Could you please try to reproduce this scenario of triggering WARN_ON?
> I remember testing it, and something actually prevented me from doing
> a replacement, but maybe I just missed something back then.
>

Reproduction steps

  ip link set dev eth2 up
  ip link set dev eth2 up
  ip addr add 194.237.173.123/16 dev eth2
  tc qdisc add dev eth2 clsact
  tc qdisc add dev eth2 root handle 1: htb default 1 offload
  tc class add dev eth2 classid 1: parent root htb rate 18000mbit ceil 22500.0mbit burst 450000kbit cburst 450000kbit
  tc class add dev eth2 classid 1:3 parent 1: htb rate 3596mbit burst 89900kbit cburst 89900kbit 
  tc qdisc replace dev eth2 root pfifo

The warning is indeed triggered because the new root is pfifo rather
than noop_qdisc. I agree with both points you brought up in the patch.
When I saw the ternary in qdisc_graft for the rcu_assign_pointer call, I
was worried about the case when new was defined as a new qdisc rather
than defaulting to noop_qdisc but assumed there were some guaratees for
htb.

I believe the correct fix for a robust implementation of
htb_destroy_class_offload would be to not depend on functions that
retrieve the top level qdisc. However, I see a number of functions, not
just offload related ones, in the htb implementation that seem to depend
on the assumption that the old qdisc can safely be accessed with helpers
such as htb_graft_helper. One such example is htb_change_class. The
trivial solution I see is to change qdisc_graft to first do a
rcu_assign_pointer with noop_qdisc, call notify_and_destroy, and only
afterwards call rcu_assign_pointer with the new qdisc if defined. Let me
know your thoughts on this.

  [  384.474535] ------------[ cut here ]------------
  [  384.476685] WARNING: CPU: 2 PID: 1038 at net/sched/sch_htb.c:1561 htb_destroy_class_offload+0x179/0x430 [sch_htb]
  [  384.481217] Modules linked in: sch_htb sch_ingress xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_umad ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core fuse mlx5_core
  [  384.487081] CPU: 2 PID: 1038 Comm: tc Not tainted 6.1.0-rc2_for_upstream_min_debug_2022_10_24_15_44 #1
  [  384.488414] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  [  384.489987] RIP: 0010:htb_destroy_class_offload+0x179/0x430 [sch_htb]
  [  384.490937] Code: 2b 04 25 28 00 00 00 0f 85 cb 02 00 00 48 83 c4 48 44 89 f0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 f6 45 10 01 0f 85 26 ff ff ff <0f> 0b e9 1f ff ff ff 4d 3b 7e 40 0f 84 d9 fe ff ff 0f 0b e9 d2 fe
  [  384.493495] RSP: 0018:ffff88815162b840 EFLAGS: 00010246
  [  384.494358] RAX: 000000000000002a RBX: ffff88810e040000 RCX: 0000000021800002
  [  384.495461] RDX: 0000000021800000 RSI: 0000000000000246 RDI: ffff88810e0404c0
  [  384.496581] RBP: ffff888151ea0c00 R08: 0000000100006174 R09: ffffffff82897070
  [  384.497684] R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000001
  [  384.498923] R13: ffff88810b189200 R14: ffff88810b189a00 R15: ffff888110060a00
  [  384.500044] FS:  00007f7a2e7a3800(0000) GS:ffff88852cc80000(0000) knlGS:0000000000000000
  [  384.501390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  384.502339] CR2: 0000000000487598 CR3: 0000000151f41003 CR4: 0000000000370ea0
  [  384.503458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [  384.504581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [  384.505699] Call Trace:
  [  384.506231]  <TASK>
  [  384.506691]  ? tcf_block_put+0x74/0xa0
  [  384.507365]  htb_destroy+0x142/0x3c0 [sch_htb]
  [  384.508131]  ? hrtimer_cancel+0x11/0x40
  [  384.508832]  ? rtnl_is_locked+0x11/0x20
  [  384.509522]  ? htb_reset+0xe3/0x1a0 [sch_htb]
  [  384.510293]  qdisc_destroy+0x3b/0xd0
  [  384.510943]  qdisc_graft+0x40b/0x590
  [  384.511600]  tc_modify_qdisc+0x577/0x870
  [  384.512309]  rtnetlink_rcv_msg+0x2a2/0x390
  [  384.513031]  ? rtnl_calcit.isra.0+0x120/0x120
  [  384.513806]  netlink_rcv_skb+0x54/0x100
  [  384.514495]  netlink_unicast+0x1f6/0x2c0
  [  384.515190]  netlink_sendmsg+0x237/0x490
  [  384.515890]  sock_sendmsg+0x33/0x40
  [  384.516556]  ____sys_sendmsg+0x1d1/0x1f0
  [  384.517265]  ___sys_sendmsg+0x72/0xb0
  [  384.517942]  ? ___sys_recvmsg+0x7c/0xb0
  [  384.518631]  __sys_sendmsg+0x51/0x90
  [  384.519289]  do_syscall_64+0x3d/0x90
  [  384.519943]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
  [  384.520794] RIP: 0033:0x7f7a2eaccc17
  [  384.521449] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
  [  384.524306] RSP: 002b:00007ffd8c62fe78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
  [  384.525568] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7a2eaccc17
  [  384.526703] RDX: 0000000000000000 RSI: 00007ffd8c62fee0 RDI: 0000000000000003
  [  384.527828] RBP: 0000000063b66264 R08: 0000000000000001 R09: 00007f7a2eb8da40
  [  384.528962] R10: 0000000000405aeb R11: 0000000000000246 R12: 0000000000000001
  [  384.530097] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000485400
  [  384.531221]  </TASK>
  [  384.531689] ---[ end trace 0000000000000000 ]---

>> -       else
>> +       } else {
>> +               old = htb_graft_helper(dev_queue, NULL);
>>                 WARN_ON(old != q);
>> +       }
>>
>>         if (cl->parent) {
>>                 _bstats_update(&cl->parent->bstats_bias,
>> --
>> 2.36.2
>>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] sch_htb: Fix prematurely activating netdev during htb_destroy_class_offload
  2023-01-05  6:03   ` Rahul Rameshbabu
@ 2023-01-06 12:03     ` Maxim Mikityanskiy
  2023-01-06 23:36       ` Rahul Rameshbabu
  0 siblings, 1 reply; 6+ messages in thread
From: Maxim Mikityanskiy @ 2023-01-06 12:03 UTC (permalink / raw)
  To: Rahul Rameshbabu
  Cc: netdev, Tariq Toukan, Gal Pressman, Saeed Mahameed,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Eric Dumazet

On Thu, 5 Jan 2023 at 08:03, Rahul Rameshbabu <rrameshbabu@nvidia.com> wrote:
>
> Maxim Mikityanskiy <maxtram95@gmail.com> writes:
>
> > On Wed, 4 Jan 2023 at 19:53, Rahul Rameshbabu <rrameshbabu@nvidia.com> wrote:
> >>
> >> When the netdev qdisc is updated correctly with the new qdisc before
> >> destroying the old qdisc, the netdev should not be activated till cleanup
> >> is completed. When htb_destroy_class_offload called htb_graft_helper, the
> >> netdev may be activated before cleanup is completed.
> >
> > Oh, so that's what was happening! Now I get the full picture:
> >
> > 1. The user does RTM_DELQDISC.
> > 2. qdisc_graft calls dev_deactivate, which sets dev_queue->qdisc to
> > NULL, but keeps dev_queue->qdisc_sleeping.
> > 3. The loop in qdisc_graft calls dev_graft_qdisc(dev_queue, new),
> > where new is NULL, for each queue.
> > 4. Then we get into htb_destroy_class_offload, and it's important
> > whether dev->qdisc is still HTB (before Eric's patch) or noop_qdisc
> > (after Eric's patch).
> > 5. If dev->qdisc is noop_qdisc, and htb_graft_helper accidentally
> > activates the netdev, attach_default_qdiscs will be called, and
> > dev_queue->qdisc will no longer be NULL for the rest of the queues,
> > hence the WARN_ON triggering.
> >
> > Nice catch indeed, premature activation of the netdev wasn't intended.
> >
> >> The new netdev qdisc
> >> may be used prematurely by queues before cleanup is done. Call
> >> dev_graft_qdisc in place of htb_graft_helper when destroying the htb to
> >> prevent premature netdev activation.
> >>
> >> Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
> >> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
> >> Acked-by: Saeed Mahameed <saeedm@nvidia.com>
> >> Cc: Eric Dumazet <edumazet@google.com>
> >> Cc: Maxim Mikityanskiy <maxtram95@gmail.com>
> >> ---
> >>  net/sched/sch_htb.c | 8 +++++---
> >>  1 file changed, 5 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> >> index 2238edece1a4..f62334ef016a 100644
> >> --- a/net/sched/sch_htb.c
> >> +++ b/net/sched/sch_htb.c
> >> @@ -1557,14 +1557,16 @@ static int htb_destroy_class_offload(struct Qdisc *sch, struct htb_class *cl,
> >>
> >>         WARN_ON(!q);
> >>         dev_queue = htb_offload_get_queue(cl);
> >> -       old = htb_graft_helper(dev_queue, NULL);
> >> -       if (destroying)
> >> +       if (destroying) {
> >> +               old = dev_graft_qdisc(dev_queue, NULL);
> >>                 /* Before HTB is destroyed, the kernel grafts noop_qdisc to
> >>                  * all queues.
> >>                  */
> >>                 WARN_ON(!(old->flags & TCQ_F_BUILTIN));
> >
> > Now regarding this WARN_ON, I have concerns about its correctness.
> >
> > Can the user replace the root qdisc from HTB to something else with a
> > single command? I.e. instead of `tc qdisc del dev eth2 root handle 1:`
> > do `tc qdisc replace ...` or whatever that causes qdisc_graft to be
> > called with new != NULL? If that is possible, then:
> >
> > 1. `old` won't be noop_qdisc, but rather the new qdisc (if it doesn't
> > support the attach callback) or the old one left from HTB (old == q,
> > if the new qdisc supports the attach callback). WARN_ON should
> > trigger.
> >
> > 2. We shouldn't even call dev_graft_qdisc in this case (if destroying
> > is true). Likewise, we shouldn't try to revert it on errors or call
> > qdisc_put on it.
> >
> > Could you please try to reproduce this scenario of triggering WARN_ON?
> > I remember testing it, and something actually prevented me from doing
> > a replacement, but maybe I just missed something back then.
> >
>
> Reproduction steps
>
>   ip link set dev eth2 up
>   ip link set dev eth2 up
>   ip addr add 194.237.173.123/16 dev eth2
>   tc qdisc add dev eth2 clsact
>   tc qdisc add dev eth2 root handle 1: htb default 1 offload
>   tc class add dev eth2 classid 1: parent root htb rate 18000mbit ceil 22500.0mbit burst 450000kbit cburst 450000kbit
>   tc class add dev eth2 classid 1:3 parent 1: htb rate 3596mbit burst 89900kbit cburst 89900kbit
>   tc qdisc replace dev eth2 root pfifo
>
> The warning is indeed triggered because the new root is pfifo rather
> than noop_qdisc. I agree with both points you brought up in the patch.
> When I saw the ternary in qdisc_graft for the rcu_assign_pointer call, I
> was worried about the case when new was defined as a new qdisc rather
> than defaulting to noop_qdisc but assumed there were some guaratees for
> htb.

OK, so that means my WARN_ON relies on false guarantees.

> However, I see a number of functions, not
> just offload related ones, in the htb implementation that seem to depend
> on the assumption that the old qdisc can safely be accessed with helpers
> such as htb_graft_helper. One such example is htb_change_class.

htb_graft_helper is only used by the offload, you can see that all
usages are guarded by if (q->offload).

> The
> trivial solution I see is to change qdisc_graft to first do a
> rcu_assign_pointer with noop_qdisc, call notify_and_destroy, and only
> afterwards call rcu_assign_pointer with the new qdisc if defined. Let me
> know your thoughts on this.

I don't think it's a good idea to introduce such a change to generic
code just to fix HTB offload. It would basically remove the ability to
replace the qdisc atomically, and there will be periods of time when
all packets are dropped.

> I believe the correct fix for a robust implementation of
> htb_destroy_class_offload would be to not depend on functions that
> retrieve the top level qdisc.

This is actually the case. The source of truth is internal data
structures of HTB, specifically cl->leaf.q points to the queue qdisc,
and cl->leaf.offload_queue points to the queue itself (the latter is
very useful when the qdisc is noop_qdisc, and we can't retrieve
dev_queue from the qdisc itself). Whenever we take a qdisc from the
netdev_queue itself, it should be only to check consistency with the
source of truth (please tell me if you spotted some places where it's
used for something else).

What I suggest to do to fix this particular bug:

1. Remove the WARN_ON check for noop_qdisc on destroy, it's simply
invalid. Rephrase the comment to explain why you WARN_ON(old != q)
only when !destroying.

2. Don't graft anything on destroy, the kernel has already replaced
the qdisc with the right one (noop_qdisc on delete, the new qdisc on
replace).

3. qdisc_put down below shouldn't be called when destroying, because
we didn't graft anything ourselves. We should still call
qdisc_put(old) when err == 0 to drop the ref (we hold one ref for
cl->leaf.q and the other ref for dev_queue->qdisc, which is normally
the same qdisc). And we should still graft old back on errors, but not
when destroying (we ignore all errors on destroy and just go on, we
can't fail).

>
>   [  384.474535] ------------[ cut here ]------------
>   [  384.476685] WARNING: CPU: 2 PID: 1038 at net/sched/sch_htb.c:1561 htb_destroy_class_offload+0x179/0x430 [sch_htb]
>   [  384.481217] Modules linked in: sch_htb sch_ingress xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_umad ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core fuse mlx5_core
>   [  384.487081] CPU: 2 PID: 1038 Comm: tc Not tainted 6.1.0-rc2_for_upstream_min_debug_2022_10_24_15_44 #1
>   [  384.488414] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>   [  384.489987] RIP: 0010:htb_destroy_class_offload+0x179/0x430 [sch_htb]
>   [  384.490937] Code: 2b 04 25 28 00 00 00 0f 85 cb 02 00 00 48 83 c4 48 44 89 f0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 f6 45 10 01 0f 85 26 ff ff ff <0f> 0b e9 1f ff ff ff 4d 3b 7e 40 0f 84 d9 fe ff ff 0f 0b e9 d2 fe
>   [  384.493495] RSP: 0018:ffff88815162b840 EFLAGS: 00010246
>   [  384.494358] RAX: 000000000000002a RBX: ffff88810e040000 RCX: 0000000021800002
>   [  384.495461] RDX: 0000000021800000 RSI: 0000000000000246 RDI: ffff88810e0404c0
>   [  384.496581] RBP: ffff888151ea0c00 R08: 0000000100006174 R09: ffffffff82897070
>   [  384.497684] R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000001
>   [  384.498923] R13: ffff88810b189200 R14: ffff88810b189a00 R15: ffff888110060a00
>   [  384.500044] FS:  00007f7a2e7a3800(0000) GS:ffff88852cc80000(0000) knlGS:0000000000000000
>   [  384.501390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [  384.502339] CR2: 0000000000487598 CR3: 0000000151f41003 CR4: 0000000000370ea0
>   [  384.503458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>   [  384.504581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>   [  384.505699] Call Trace:
>   [  384.506231]  <TASK>
>   [  384.506691]  ? tcf_block_put+0x74/0xa0
>   [  384.507365]  htb_destroy+0x142/0x3c0 [sch_htb]
>   [  384.508131]  ? hrtimer_cancel+0x11/0x40
>   [  384.508832]  ? rtnl_is_locked+0x11/0x20
>   [  384.509522]  ? htb_reset+0xe3/0x1a0 [sch_htb]
>   [  384.510293]  qdisc_destroy+0x3b/0xd0
>   [  384.510943]  qdisc_graft+0x40b/0x590
>   [  384.511600]  tc_modify_qdisc+0x577/0x870
>   [  384.512309]  rtnetlink_rcv_msg+0x2a2/0x390
>   [  384.513031]  ? rtnl_calcit.isra.0+0x120/0x120
>   [  384.513806]  netlink_rcv_skb+0x54/0x100
>   [  384.514495]  netlink_unicast+0x1f6/0x2c0
>   [  384.515190]  netlink_sendmsg+0x237/0x490
>   [  384.515890]  sock_sendmsg+0x33/0x40
>   [  384.516556]  ____sys_sendmsg+0x1d1/0x1f0
>   [  384.517265]  ___sys_sendmsg+0x72/0xb0
>   [  384.517942]  ? ___sys_recvmsg+0x7c/0xb0
>   [  384.518631]  __sys_sendmsg+0x51/0x90
>   [  384.519289]  do_syscall_64+0x3d/0x90
>   [  384.519943]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
>   [  384.520794] RIP: 0033:0x7f7a2eaccc17
>   [  384.521449] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
>   [  384.524306] RSP: 002b:00007ffd8c62fe78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
>   [  384.525568] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7a2eaccc17
>   [  384.526703] RDX: 0000000000000000 RSI: 00007ffd8c62fee0 RDI: 0000000000000003
>   [  384.527828] RBP: 0000000063b66264 R08: 0000000000000001 R09: 00007f7a2eb8da40
>   [  384.528962] R10: 0000000000405aeb R11: 0000000000000246 R12: 0000000000000001
>   [  384.530097] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000485400
>   [  384.531221]  </TASK>
>   [  384.531689] ---[ end trace 0000000000000000 ]---
>
> >> -       else
> >> +       } else {
> >> +               old = htb_graft_helper(dev_queue, NULL);
> >>                 WARN_ON(old != q);
> >> +       }
> >>
> >>         if (cl->parent) {
> >>                 _bstats_update(&cl->parent->bstats_bias,
> >> --
> >> 2.36.2
> >>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] sch_htb: Fix prematurely activating netdev during htb_destroy_class_offload
  2023-01-06 12:03     ` Maxim Mikityanskiy
@ 2023-01-06 23:36       ` Rahul Rameshbabu
  2023-01-10 17:41         ` Maxim Mikityanskiy
  0 siblings, 1 reply; 6+ messages in thread
From: Rahul Rameshbabu @ 2023-01-06 23:36 UTC (permalink / raw)
  To: Maxim Mikityanskiy
  Cc: netdev, Tariq Toukan, Gal Pressman, Saeed Mahameed,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Eric Dumazet

Hi Maxim,

I am working on an updated patch based on your comments. I have further
comments below.

-- Rahul Rameshbabu

Maxim Mikityanskiy <maxtram95@gmail.com> writes:

> On Thu, 5 Jan 2023 at 08:03, Rahul Rameshbabu <rrameshbabu@nvidia.com> wrote:
>>
>> Maxim Mikityanskiy <maxtram95@gmail.com> writes:
>>
>> > On Wed, 4 Jan 2023 at 19:53, Rahul Rameshbabu <rrameshbabu@nvidia.com> wrote:
>> >>
>> >> When the netdev qdisc is updated correctly with the new qdisc before
>> >> destroying the old qdisc, the netdev should not be activated till cleanup
>> >> is completed. When htb_destroy_class_offload called htb_graft_helper, the
>> >> netdev may be activated before cleanup is completed.
>> >
>> > Oh, so that's what was happening! Now I get the full picture:
>> >
>> > 1. The user does RTM_DELQDISC.
>> > 2. qdisc_graft calls dev_deactivate, which sets dev_queue->qdisc to
>> > NULL, but keeps dev_queue->qdisc_sleeping.
>> > 3. The loop in qdisc_graft calls dev_graft_qdisc(dev_queue, new),
>> > where new is NULL, for each queue.
>> > 4. Then we get into htb_destroy_class_offload, and it's important
>> > whether dev->qdisc is still HTB (before Eric's patch) or noop_qdisc
>> > (after Eric's patch).
>> > 5. If dev->qdisc is noop_qdisc, and htb_graft_helper accidentally
>> > activates the netdev, attach_default_qdiscs will be called, and
>> > dev_queue->qdisc will no longer be NULL for the rest of the queues,
>> > hence the WARN_ON triggering.
>> >
>> > Nice catch indeed, premature activation of the netdev wasn't intended.
>> >
>> >> The new netdev qdisc
>> >> may be used prematurely by queues before cleanup is done. Call
>> >> dev_graft_qdisc in place of htb_graft_helper when destroying the htb to
>> >> prevent premature netdev activation.
>> >>
>> >> Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
>> >> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
>> >> Acked-by: Saeed Mahameed <saeedm@nvidia.com>
>> >> Cc: Eric Dumazet <edumazet@google.com>
>> >> Cc: Maxim Mikityanskiy <maxtram95@gmail.com>
>> >> ---
>> >>  net/sched/sch_htb.c | 8 +++++---
>> >>  1 file changed, 5 insertions(+), 3 deletions(-)
>> >>
>> >> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
>> >> index 2238edece1a4..f62334ef016a 100644
>> >> --- a/net/sched/sch_htb.c
>> >> +++ b/net/sched/sch_htb.c
>> >> @@ -1557,14 +1557,16 @@ static int htb_destroy_class_offload(struct Qdisc *sch, struct htb_class *cl,
>> >>
>> >>         WARN_ON(!q);
>> >>         dev_queue = htb_offload_get_queue(cl);
>> >> -       old = htb_graft_helper(dev_queue, NULL);
>> >> -       if (destroying)
>> >> +       if (destroying) {
>> >> +               old = dev_graft_qdisc(dev_queue, NULL);
>> >>                 /* Before HTB is destroyed, the kernel grafts noop_qdisc to
>> >>                  * all queues.
>> >>                  */
>> >>                 WARN_ON(!(old->flags & TCQ_F_BUILTIN));
>> >
>> > Now regarding this WARN_ON, I have concerns about its correctness.
>> >
>> > Can the user replace the root qdisc from HTB to something else with a
>> > single command? I.e. instead of `tc qdisc del dev eth2 root handle 1:`
>> > do `tc qdisc replace ...` or whatever that causes qdisc_graft to be
>> > called with new != NULL? If that is possible, then:
>> >
>> > 1. `old` won't be noop_qdisc, but rather the new qdisc (if it doesn't
>> > support the attach callback) or the old one left from HTB (old == q,
>> > if the new qdisc supports the attach callback). WARN_ON should
>> > trigger.
>> >
>> > 2. We shouldn't even call dev_graft_qdisc in this case (if destroying
>> > is true). Likewise, we shouldn't try to revert it on errors or call
>> > qdisc_put on it.
>> >
>> > Could you please try to reproduce this scenario of triggering WARN_ON?
>> > I remember testing it, and something actually prevented me from doing
>> > a replacement, but maybe I just missed something back then.
>> >
>>
>> Reproduction steps
>>
>>   ip link set dev eth2 up
>>   ip link set dev eth2 up
>>   ip addr add 194.237.173.123/16 dev eth2
>>   tc qdisc add dev eth2 clsact
>>   tc qdisc add dev eth2 root handle 1: htb default 1 offload
>>   tc class add dev eth2 classid 1: parent root htb rate 18000mbit ceil 22500.0mbit burst 450000kbit cburst 450000kbit
>>   tc class add dev eth2 classid 1:3 parent 1: htb rate 3596mbit burst 89900kbit cburst 89900kbit
>>   tc qdisc replace dev eth2 root pfifo
>>
>> The warning is indeed triggered because the new root is pfifo rather
>> than noop_qdisc. I agree with both points you brought up in the patch.
>> When I saw the ternary in qdisc_graft for the rcu_assign_pointer call, I
>> was worried about the case when new was defined as a new qdisc rather
>> than defaulting to noop_qdisc but assumed there were some guaratees for
>> htb.
>
> OK, so that means my WARN_ON relies on false guarantees.
>
>> However, I see a number of functions, not
>> just offload related ones, in the htb implementation that seem to depend
>> on the assumption that the old qdisc can safely be accessed with helpers
>> such as htb_graft_helper. One such example is htb_change_class.
>
> htb_graft_helper is only used by the offload, you can see that all
> usages are guarded by if (q->offload).
>
>> The
>> trivial solution I see is to change qdisc_graft to first do a
>> rcu_assign_pointer with noop_qdisc, call notify_and_destroy, and only
>> afterwards call rcu_assign_pointer with the new qdisc if defined. Let me
>> know your thoughts on this.
>
> I don't think it's a good idea to introduce such a change to generic
> code just to fix HTB offload. It would basically remove the ability to
> replace the qdisc atomically, and there will be periods of time when
> all packets are dropped.
>
>> I believe the correct fix for a robust implementation of
>> htb_destroy_class_offload would be to not depend on functions that
>> retrieve the top level qdisc.
>
> This is actually the case. The source of truth is internal data
> structures of HTB, specifically cl->leaf.q points to the queue qdisc,
> and cl->leaf.offload_queue points to the queue itself (the latter is
> very useful when the qdisc is noop_qdisc, and we can't retrieve
> dev_queue from the qdisc itself). Whenever we take a qdisc from the
> netdev_queue itself, it should be only to check consistency with the
> source of truth (please tell me if you spotted some places where it's
> used for something else).

Yeah, I did catch the qdisc taken from the netdev_queue used in such a
way but only in one place. This is in htb_change_class in what appears
to be an offload context (I misread in my earlier comment). It's just
the call to _bstats_update though. Should it be changed to
parent->leaf.q (since it should be the source of truth for the htb
structure in this context) in the event the WARN_ON is encountered
(shouldn't happen normally hence the WARN_ON to ensure this guarantee)?

  dev_queue = htb_offload_get_queue(parent);
  old_q = htb_graft_helper(dev_queue, NULL);
  WARN_ON(old_q != parent->leaf.q);
  offload_opt = (struct tc_htb_qopt_offload) {
    .command = TC_HTB_LEAF_TO_INNER,
    .classid = cl->common.classid,
    .parent_classid =
      TC_H_MIN(parent->common.classid),
    .rate = max_t(u64, hopt->rate.rate, rate64),
    .ceil = max_t(u64, hopt->ceil.rate, ceil64),
    .extack = extack,
  };
  err = htb_offload(dev, &offload_opt);
  if (err) {
    pr_err("htb: TC_HTB_LEAF_TO_INNER failed with err = %d\n",
            err);
    htb_graft_helper(dev_queue, old_q);
    goto err_kill_estimator;
  }
  _bstats_update(&parent->bstats_bias,
            u64_stats_read(&old_q->bstats.bytes),
            u64_stats_read(&old_q->bstats.packets));

>
> What I suggest to do to fix this particular bug:
>
> 1. Remove the WARN_ON check for noop_qdisc on destroy, it's simply
> invalid. Rephrase the comment to explain why you WARN_ON(old != q)
> only when !destroying.
>
> 2. Don't graft anything on destroy, the kernel has already replaced
> the qdisc with the right one (noop_qdisc on delete, the new qdisc on
> replace).
>
> 3. qdisc_put down below shouldn't be called when destroying, because
> we didn't graft anything ourselves. We should still call
> qdisc_put(old) when err == 0 to drop the ref (we hold one ref for
> cl->leaf.q and the other ref for dev_queue->qdisc, which is normally
> the same qdisc). And we should still graft old back on errors, but not
> when destroying (we ignore all errors on destroy and just go on, we
> can't fail).

Ack.

>
>>
>>   [  384.474535] ------------[ cut here ]------------
>>   [  384.476685] WARNING: CPU: 2 PID: 1038 at net/sched/sch_htb.c:1561 htb_destroy_class_offload+0x179/0x430 [sch_htb]
>>   [ 384.481217] Modules linked in: sch_htb sch_ingress xt_conntrack
>> xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat
>> br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi
>> rdma_cm iw_cm ib_umad ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core fuse mlx5_core
>>   [  384.487081] CPU: 2 PID: 1038 Comm: tc Not tainted 6.1.0-rc2_for_upstream_min_debug_2022_10_24_15_44 #1
>>   [  384.488414] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
>>   [  384.489987] RIP: 0010:htb_destroy_class_offload+0x179/0x430 [sch_htb]
>>   [  384.490937] Code: 2b 04 25 28 00 00 00 0f 85 cb 02 00 00 48 83 c4 48 44 89 f0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 f6 45 10 01 0f 85 26 ff ff ff <0f> 0b e9 1f ff ff ff 4d 3b 7e 40 0f 84 d9 fe ff ff 0f 0b e9 d2 fe
>>   [  384.493495] RSP: 0018:ffff88815162b840 EFLAGS: 00010246
>>   [  384.494358] RAX: 000000000000002a RBX: ffff88810e040000 RCX: 0000000021800002
>>   [  384.495461] RDX: 0000000021800000 RSI: 0000000000000246 RDI: ffff88810e0404c0
>>   [  384.496581] RBP: ffff888151ea0c00 R08: 0000000100006174 R09: ffffffff82897070
>>   [  384.497684] R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000001
>>   [  384.498923] R13: ffff88810b189200 R14: ffff88810b189a00 R15: ffff888110060a00
>>   [  384.500044] FS:  00007f7a2e7a3800(0000) GS:ffff88852cc80000(0000) knlGS:0000000000000000
>>   [  384.501390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   [  384.502339] CR2: 0000000000487598 CR3: 0000000151f41003 CR4: 0000000000370ea0
>>   [  384.503458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>   [  384.504581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>   [  384.505699] Call Trace:
>>   [  384.506231]  <TASK>
>>   [  384.506691]  ? tcf_block_put+0x74/0xa0
>>   [  384.507365]  htb_destroy+0x142/0x3c0 [sch_htb]
>>   [  384.508131]  ? hrtimer_cancel+0x11/0x40
>>   [  384.508832]  ? rtnl_is_locked+0x11/0x20
>>   [  384.509522]  ? htb_reset+0xe3/0x1a0 [sch_htb]
>>   [  384.510293]  qdisc_destroy+0x3b/0xd0
>>   [  384.510943]  qdisc_graft+0x40b/0x590
>>   [  384.511600]  tc_modify_qdisc+0x577/0x870
>>   [  384.512309]  rtnetlink_rcv_msg+0x2a2/0x390
>>   [  384.513031]  ? rtnl_calcit.isra.0+0x120/0x120
>>   [  384.513806]  netlink_rcv_skb+0x54/0x100
>>   [  384.514495]  netlink_unicast+0x1f6/0x2c0
>>   [  384.515190]  netlink_sendmsg+0x237/0x490
>>   [  384.515890]  sock_sendmsg+0x33/0x40
>>   [  384.516556]  ____sys_sendmsg+0x1d1/0x1f0
>>   [  384.517265]  ___sys_sendmsg+0x72/0xb0
>>   [  384.517942]  ? ___sys_recvmsg+0x7c/0xb0
>>   [  384.518631]  __sys_sendmsg+0x51/0x90
>>   [  384.519289]  do_syscall_64+0x3d/0x90
>>   [  384.519943]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
>>   [  384.520794] RIP: 0033:0x7f7a2eaccc17
>>   [  384.521449] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
>>   [  384.524306] RSP: 002b:00007ffd8c62fe78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
>>   [  384.525568] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7a2eaccc17
>>   [  384.526703] RDX: 0000000000000000 RSI: 00007ffd8c62fee0 RDI: 0000000000000003
>>   [  384.527828] RBP: 0000000063b66264 R08: 0000000000000001 R09: 00007f7a2eb8da40
>>   [  384.528962] R10: 0000000000405aeb R11: 0000000000000246 R12: 0000000000000001
>>   [  384.530097] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000485400
>>   [  384.531221]  </TASK>
>>   [  384.531689] ---[ end trace 0000000000000000 ]---
>>
>> >> -       else
>> >> +       } else {
>> >> +               old = htb_graft_helper(dev_queue, NULL);
>> >>                 WARN_ON(old != q);
>> >> +       }
>> >>
>> >>         if (cl->parent) {
>> >>                 _bstats_update(&cl->parent->bstats_bias,
>> >> --
>> >> 2.36.2
>> >>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH net] sch_htb: Fix prematurely activating netdev during htb_destroy_class_offload
  2023-01-06 23:36       ` Rahul Rameshbabu
@ 2023-01-10 17:41         ` Maxim Mikityanskiy
  0 siblings, 0 replies; 6+ messages in thread
From: Maxim Mikityanskiy @ 2023-01-10 17:41 UTC (permalink / raw)
  To: Rahul Rameshbabu
  Cc: netdev, Tariq Toukan, Gal Pressman, Saeed Mahameed,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Eric Dumazet

On Fri, Jan 06, 2023 at 03:36:37PM -0800, Rahul Rameshbabu wrote:
> Hi Maxim,
> 
> I am working on an updated patch based on your comments. I have further
> comments below.
> 
> -- Rahul Rameshbabu
> 
> Maxim Mikityanskiy <maxtram95@gmail.com> writes:
> 
> > On Thu, 5 Jan 2023 at 08:03, Rahul Rameshbabu <rrameshbabu@nvidia.com> wrote:
> >>
> >> Maxim Mikityanskiy <maxtram95@gmail.com> writes:
> >>
> >> > On Wed, 4 Jan 2023 at 19:53, Rahul Rameshbabu <rrameshbabu@nvidia.com> wrote:
> >> >>
> >> >> When the netdev qdisc is updated correctly with the new qdisc before
> >> >> destroying the old qdisc, the netdev should not be activated till cleanup
> >> >> is completed. When htb_destroy_class_offload called htb_graft_helper, the
> >> >> netdev may be activated before cleanup is completed.
> >> >
> >> > Oh, so that's what was happening! Now I get the full picture:
> >> >
> >> > 1. The user does RTM_DELQDISC.
> >> > 2. qdisc_graft calls dev_deactivate, which sets dev_queue->qdisc to
> >> > NULL, but keeps dev_queue->qdisc_sleeping.
> >> > 3. The loop in qdisc_graft calls dev_graft_qdisc(dev_queue, new),
> >> > where new is NULL, for each queue.
> >> > 4. Then we get into htb_destroy_class_offload, and it's important
> >> > whether dev->qdisc is still HTB (before Eric's patch) or noop_qdisc
> >> > (after Eric's patch).
> >> > 5. If dev->qdisc is noop_qdisc, and htb_graft_helper accidentally
> >> > activates the netdev, attach_default_qdiscs will be called, and
> >> > dev_queue->qdisc will no longer be NULL for the rest of the queues,
> >> > hence the WARN_ON triggering.
> >> >
> >> > Nice catch indeed, premature activation of the netdev wasn't intended.
> >> >
> >> >> The new netdev qdisc
> >> >> may be used prematurely by queues before cleanup is done. Call
> >> >> dev_graft_qdisc in place of htb_graft_helper when destroying the htb to
> >> >> prevent premature netdev activation.
> >> >>
> >> >> Fixes: d03b195b5aa0 ("sch_htb: Hierarchical QoS hardware offload")
> >> >> Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
> >> >> Acked-by: Saeed Mahameed <saeedm@nvidia.com>
> >> >> Cc: Eric Dumazet <edumazet@google.com>
> >> >> Cc: Maxim Mikityanskiy <maxtram95@gmail.com>
> >> >> ---
> >> >>  net/sched/sch_htb.c | 8 +++++---
> >> >>  1 file changed, 5 insertions(+), 3 deletions(-)
> >> >>
> >> >> diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
> >> >> index 2238edece1a4..f62334ef016a 100644
> >> >> --- a/net/sched/sch_htb.c
> >> >> +++ b/net/sched/sch_htb.c
> >> >> @@ -1557,14 +1557,16 @@ static int htb_destroy_class_offload(struct Qdisc *sch, struct htb_class *cl,
> >> >>
> >> >>         WARN_ON(!q);
> >> >>         dev_queue = htb_offload_get_queue(cl);
> >> >> -       old = htb_graft_helper(dev_queue, NULL);
> >> >> -       if (destroying)
> >> >> +       if (destroying) {
> >> >> +               old = dev_graft_qdisc(dev_queue, NULL);
> >> >>                 /* Before HTB is destroyed, the kernel grafts noop_qdisc to
> >> >>                  * all queues.
> >> >>                  */
> >> >>                 WARN_ON(!(old->flags & TCQ_F_BUILTIN));
> >> >
> >> > Now regarding this WARN_ON, I have concerns about its correctness.
> >> >
> >> > Can the user replace the root qdisc from HTB to something else with a
> >> > single command? I.e. instead of `tc qdisc del dev eth2 root handle 1:`
> >> > do `tc qdisc replace ...` or whatever that causes qdisc_graft to be
> >> > called with new != NULL? If that is possible, then:
> >> >
> >> > 1. `old` won't be noop_qdisc, but rather the new qdisc (if it doesn't
> >> > support the attach callback) or the old one left from HTB (old == q,
> >> > if the new qdisc supports the attach callback). WARN_ON should
> >> > trigger.
> >> >
> >> > 2. We shouldn't even call dev_graft_qdisc in this case (if destroying
> >> > is true). Likewise, we shouldn't try to revert it on errors or call
> >> > qdisc_put on it.
> >> >
> >> > Could you please try to reproduce this scenario of triggering WARN_ON?
> >> > I remember testing it, and something actually prevented me from doing
> >> > a replacement, but maybe I just missed something back then.
> >> >
> >>
> >> Reproduction steps
> >>
> >>   ip link set dev eth2 up
> >>   ip link set dev eth2 up
> >>   ip addr add 194.237.173.123/16 dev eth2
> >>   tc qdisc add dev eth2 clsact
> >>   tc qdisc add dev eth2 root handle 1: htb default 1 offload
> >>   tc class add dev eth2 classid 1: parent root htb rate 18000mbit ceil 22500.0mbit burst 450000kbit cburst 450000kbit
> >>   tc class add dev eth2 classid 1:3 parent 1: htb rate 3596mbit burst 89900kbit cburst 89900kbit
> >>   tc qdisc replace dev eth2 root pfifo
> >>
> >> The warning is indeed triggered because the new root is pfifo rather
> >> than noop_qdisc. I agree with both points you brought up in the patch.
> >> When I saw the ternary in qdisc_graft for the rcu_assign_pointer call, I
> >> was worried about the case when new was defined as a new qdisc rather
> >> than defaulting to noop_qdisc but assumed there were some guaratees for
> >> htb.
> >
> > OK, so that means my WARN_ON relies on false guarantees.
> >
> >> However, I see a number of functions, not
> >> just offload related ones, in the htb implementation that seem to depend
> >> on the assumption that the old qdisc can safely be accessed with helpers
> >> such as htb_graft_helper. One such example is htb_change_class.
> >
> > htb_graft_helper is only used by the offload, you can see that all
> > usages are guarded by if (q->offload).
> >
> >> The
> >> trivial solution I see is to change qdisc_graft to first do a
> >> rcu_assign_pointer with noop_qdisc, call notify_and_destroy, and only
> >> afterwards call rcu_assign_pointer with the new qdisc if defined. Let me
> >> know your thoughts on this.
> >
> > I don't think it's a good idea to introduce such a change to generic
> > code just to fix HTB offload. It would basically remove the ability to
> > replace the qdisc atomically, and there will be periods of time when
> > all packets are dropped.
> >
> >> I believe the correct fix for a robust implementation of
> >> htb_destroy_class_offload would be to not depend on functions that
> >> retrieve the top level qdisc.
> >
> > This is actually the case. The source of truth is internal data
> > structures of HTB, specifically cl->leaf.q points to the queue qdisc,
> > and cl->leaf.offload_queue points to the queue itself (the latter is
> > very useful when the qdisc is noop_qdisc, and we can't retrieve
> > dev_queue from the qdisc itself). Whenever we take a qdisc from the
> > netdev_queue itself, it should be only to check consistency with the
> > source of truth (please tell me if you spotted some places where it's
> > used for something else).
> 
> Yeah, I did catch the qdisc taken from the netdev_queue used in such a
> way but only in one place. This is in htb_change_class in what appears
> to be an offload context (I misread in my earlier comment). It's just
> the call to _bstats_update though. Should it be changed to
> parent->leaf.q (since it should be the source of truth for the htb
> structure in this context) in the event the WARN_ON is encountered
> (shouldn't happen normally hence the WARN_ON to ensure this guarantee)?

I'm inclined to keep it as is. My argument above is more relevant to the
integrity of the data structures while being modified in the control
flow. When we modify the tree, the internal data structure is the source
of truth, but we still need the same qdiscs to be assigned to dev_queues
for datapath to use them, so we derive/copy dev_queue->qdisc from the
internal data structure. (dev_queue->qdisc_sleeping only serves the
purpose of updating dev_queue->qdisc.)

Stats, on the other hand, are a datapath entity. They are produced in
the datapath, which only has access to dev_queue->qdisc. After detaching
the qdisc from the datapath (with htb_graft_helper), we read out the
accumulated stats and store them. To me, the current code makes sense.
Moreover, it's robust in that sense that when we read the stats, the
qdisc is no longer attached to the dev_queue, so the stats stand still
at this point.

That's how I see it, TL/DR: we don't use it here for the structure of
the tree, we just read some data produced by datapath, from where the
datapath wrote it.

>   dev_queue = htb_offload_get_queue(parent);
>   old_q = htb_graft_helper(dev_queue, NULL);
>   WARN_ON(old_q != parent->leaf.q);
>   offload_opt = (struct tc_htb_qopt_offload) {
>     .command = TC_HTB_LEAF_TO_INNER,
>     .classid = cl->common.classid,
>     .parent_classid =
>       TC_H_MIN(parent->common.classid),
>     .rate = max_t(u64, hopt->rate.rate, rate64),
>     .ceil = max_t(u64, hopt->ceil.rate, ceil64),
>     .extack = extack,
>   };
>   err = htb_offload(dev, &offload_opt);
>   if (err) {
>     pr_err("htb: TC_HTB_LEAF_TO_INNER failed with err = %d\n",
>             err);
>     htb_graft_helper(dev_queue, old_q);
>     goto err_kill_estimator;
>   }
>   _bstats_update(&parent->bstats_bias,
>             u64_stats_read(&old_q->bstats.bytes),
>             u64_stats_read(&old_q->bstats.packets));
> 
> >
> > What I suggest to do to fix this particular bug:
> >
> > 1. Remove the WARN_ON check for noop_qdisc on destroy, it's simply
> > invalid. Rephrase the comment to explain why you WARN_ON(old != q)
> > only when !destroying.
> >
> > 2. Don't graft anything on destroy, the kernel has already replaced
> > the qdisc with the right one (noop_qdisc on delete, the new qdisc on
> > replace).
> >
> > 3. qdisc_put down below shouldn't be called when destroying, because
> > we didn't graft anything ourselves. We should still call
> > qdisc_put(old) when err == 0 to drop the ref (we hold one ref for
> > cl->leaf.q and the other ref for dev_queue->qdisc, which is normally
> > the same qdisc). And we should still graft old back on errors, but not
> > when destroying (we ignore all errors on destroy and just go on, we
> > can't fail).
> 
> Ack.
> 
> >
> >>
> >>   [  384.474535] ------------[ cut here ]------------
> >>   [  384.476685] WARNING: CPU: 2 PID: 1038 at net/sched/sch_htb.c:1561 htb_destroy_class_offload+0x179/0x430 [sch_htb]
> >>   [ 384.481217] Modules linked in: sch_htb sch_ingress xt_conntrack
> >> xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat
> >> br_netfilter overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi
> >> rdma_cm iw_cm ib_umad ib_ipoib ib_cm mlx5_ib ib_uverbs ib_core fuse mlx5_core
> >>   [  384.487081] CPU: 2 PID: 1038 Comm: tc Not tainted 6.1.0-rc2_for_upstream_min_debug_2022_10_24_15_44 #1
> >>   [  384.488414] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> >>   [  384.489987] RIP: 0010:htb_destroy_class_offload+0x179/0x430 [sch_htb]
> >>   [  384.490937] Code: 2b 04 25 28 00 00 00 0f 85 cb 02 00 00 48 83 c4 48 44 89 f0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 f6 45 10 01 0f 85 26 ff ff ff <0f> 0b e9 1f ff ff ff 4d 3b 7e 40 0f 84 d9 fe ff ff 0f 0b e9 d2 fe
> >>   [  384.493495] RSP: 0018:ffff88815162b840 EFLAGS: 00010246
> >>   [  384.494358] RAX: 000000000000002a RBX: ffff88810e040000 RCX: 0000000021800002
> >>   [  384.495461] RDX: 0000000021800000 RSI: 0000000000000246 RDI: ffff88810e0404c0
> >>   [  384.496581] RBP: ffff888151ea0c00 R08: 0000000100006174 R09: ffffffff82897070
> >>   [  384.497684] R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000001
> >>   [  384.498923] R13: ffff88810b189200 R14: ffff88810b189a00 R15: ffff888110060a00
> >>   [  384.500044] FS:  00007f7a2e7a3800(0000) GS:ffff88852cc80000(0000) knlGS:0000000000000000
> >>   [  384.501390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>   [  384.502339] CR2: 0000000000487598 CR3: 0000000151f41003 CR4: 0000000000370ea0
> >>   [  384.503458] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>   [  384.504581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>   [  384.505699] Call Trace:
> >>   [  384.506231]  <TASK>
> >>   [  384.506691]  ? tcf_block_put+0x74/0xa0
> >>   [  384.507365]  htb_destroy+0x142/0x3c0 [sch_htb]
> >>   [  384.508131]  ? hrtimer_cancel+0x11/0x40
> >>   [  384.508832]  ? rtnl_is_locked+0x11/0x20
> >>   [  384.509522]  ? htb_reset+0xe3/0x1a0 [sch_htb]
> >>   [  384.510293]  qdisc_destroy+0x3b/0xd0
> >>   [  384.510943]  qdisc_graft+0x40b/0x590
> >>   [  384.511600]  tc_modify_qdisc+0x577/0x870
> >>   [  384.512309]  rtnetlink_rcv_msg+0x2a2/0x390
> >>   [  384.513031]  ? rtnl_calcit.isra.0+0x120/0x120
> >>   [  384.513806]  netlink_rcv_skb+0x54/0x100
> >>   [  384.514495]  netlink_unicast+0x1f6/0x2c0
> >>   [  384.515190]  netlink_sendmsg+0x237/0x490
> >>   [  384.515890]  sock_sendmsg+0x33/0x40
> >>   [  384.516556]  ____sys_sendmsg+0x1d1/0x1f0
> >>   [  384.517265]  ___sys_sendmsg+0x72/0xb0
> >>   [  384.517942]  ? ___sys_recvmsg+0x7c/0xb0
> >>   [  384.518631]  __sys_sendmsg+0x51/0x90
> >>   [  384.519289]  do_syscall_64+0x3d/0x90
> >>   [  384.519943]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> >>   [  384.520794] RIP: 0033:0x7f7a2eaccc17
> >>   [  384.521449] Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
> >>   [  384.524306] RSP: 002b:00007ffd8c62fe78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
> >>   [  384.525568] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7a2eaccc17
> >>   [  384.526703] RDX: 0000000000000000 RSI: 00007ffd8c62fee0 RDI: 0000000000000003
> >>   [  384.527828] RBP: 0000000063b66264 R08: 0000000000000001 R09: 00007f7a2eb8da40
> >>   [  384.528962] R10: 0000000000405aeb R11: 0000000000000246 R12: 0000000000000001
> >>   [  384.530097] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000485400
> >>   [  384.531221]  </TASK>
> >>   [  384.531689] ---[ end trace 0000000000000000 ]---
> >>
> >> >> -       else
> >> >> +       } else {
> >> >> +               old = htb_graft_helper(dev_queue, NULL);
> >> >>                 WARN_ON(old != q);
> >> >> +       }
> >> >>
> >> >>         if (cl->parent) {
> >> >>                 _bstats_update(&cl->parent->bstats_bias,
> >> >> --
> >> >> 2.36.2
> >> >>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-01-10 17:42 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-04 17:47 [PATCH net] sch_htb: Fix prematurely activating netdev during htb_destroy_class_offload Rahul Rameshbabu
2023-01-04 20:12 ` Maxim Mikityanskiy
2023-01-05  6:03   ` Rahul Rameshbabu
2023-01-06 12:03     ` Maxim Mikityanskiy
2023-01-06 23:36       ` Rahul Rameshbabu
2023-01-10 17:41         ` Maxim Mikityanskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).