All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vlad Buslov <vladbu@nvidia.com>
To: Peilin Ye <yepeilin.cs@gmail.com>, Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jakub Kicinski <kuba@kernel.org>,
	Pedro Tammela <pctammela@mojatatu.com>,
	"David S. Miller" <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>,
	Paolo Abeni <pabeni@redhat.com>,
	Cong Wang <xiyou.wangcong@gmail.com>,
	Jiri Pirko <jiri@resnulli.us>,
	Peilin Ye <peilin.ye@bytedance.com>,
	Daniel Borkmann <daniel@iogearbox.net>,
	"John Fastabend" <john.fastabend@gmail.com>,
	Hillf Danton <hdanton@sina.com>, <netdev@vger.kernel.org>,
	Cong Wang <cong.wang@bytedance.com>
Subject: Re: [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting
Date: Mon, 29 May 2023 15:58:50 +0300	[thread overview]
Message-ID: <87fs7fxov6.fsf@nvidia.com> (raw)
In-Reply-To: <87jzwrxrz8.fsf@nvidia.com>

On Mon 29 May 2023 at 14:50, Vlad Buslov <vladbu@nvidia.com> wrote:
> On Sun 28 May 2023 at 14:54, Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>> On Sat, May 27, 2023 at 4:23 AM Peilin Ye <yepeilin.cs@gmail.com> wrote:
>>>
>>> Hi Jakub and all,
>>>
>>> On Fri, May 26, 2023 at 07:33:24PM -0700, Jakub Kicinski wrote:
>>> > On Fri, 26 May 2023 16:09:51 -0700 Peilin Ye wrote:
>>> > > Thanks a lot, I'll get right on it.
>>> >
>>> > Any insights? Is it just a live-lock inherent to the retry scheme
>>> > or we actually forget to release the lock/refcnt?
>>>
>>> I think it's just a thread holding the RTNL mutex for too long (replaying
>>> too many times).  We could replay for arbitrary times in
>>> tc_{modify,get}_qdisc() if the user keeps sending RTNL-unlocked filter
>>> requests for the old Qdisc.
>
> After looking very carefully at the code I think I know what the issue
> might be:
>
>    Task 1 graft Qdisc   Task 2 new filter
>            +                    +
>            |                    |
>            v                    v
>         rtnl_lock()       take  q->refcnt
>            +                    +
>            |                    |
>            v                    v
> Spin while q->refcnt!=1   Block on rtnl_lock() indefinitely due to -EAGAIN
>
> This will cause a real deadlock with the proposed patch. I'll try to
> come up with a better approach. Sorry for not seeing it earlier.
>

Followup: I considered two approaches for preventing the dealock:

- Refactor cls_api to always obtain the lock before taking a reference
  to Qdisc. I started implementing PoC moving the rtnl_lock() call in
  tc_new_tfilter() before __tcf_qdisc_find() and decided it is not
  feasible because cls_api will still try to obtain rtnl_lock when
  offloading a filter to a device with non-unlocked driver or after
  releasing the lock when loading a classifier module.

- Account for such cls_api behavior in sch_api by dropping and
  re-tacking the lock before replaying. This actually seems to be quite
  straightforward since 'replay' functionality that we are reusing for
  this is designed for similar behavior - it releases rtnl lock before
  loading a sch module, takes the lock again and safely replays the
  function by re-obtaining all the necessary data.

If livelock with concurrent filters insertion is an issue, then it can
be remedied by setting a new Qdisc->flags bit
"DELETED-REJECT-NEW-FILTERS" and checking for it together with
QDISC_CLASS_OPS_DOIT_UNLOCKED in order to force any concurrent filter
insertion coming after the flag is set to synchronize on rtnl lock.

Thoughts?

>>>
>>> I tested the new reproducer Pedro posted, on:
>>>
>>> 1. All 6 v5 patches, FWIW, which caused a similar hang as Pedro reported
>>>
>>> 2. First 5 v5 patches, plus patch 6 in v1 (no replaying), did not trigger
>>>    any issues (in about 30 minutes).
>>>
>>> 3. All 6 v5 patches, plus this diff:
>>>
>>> diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
>>> index 286b7c58f5b9..988718ba5abe 100644
>>> --- a/net/sched/sch_api.c
>>> +++ b/net/sched/sch_api.c
>>> @@ -1090,8 +1090,11 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
>>>                          * RTNL-unlocked filter request(s).  This is the counterpart of that
>>>                          * qdisc_refcount_inc_nz() call in __tcf_qdisc_find().
>>>                          */
>>> -                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping))
>>> +                       if (!qdisc_refcount_dec_if_one(dev_queue->qdisc_sleeping)) {
>>> +                               rtnl_unlock();
>>> +                               rtnl_lock();
>>>                                 return -EAGAIN;
>>> +                       }
>>>                 }
>>>
>>>                 if (dev->flags & IFF_UP)
>>>
>>>    Did not trigger any issues (in about 30 mintues) either.
>>>
>>> What would you suggest?
>>
>>
>> I am more worried it is a wackamole situation. We fixed the first
>> reproducer with essentially patches 1-4 but we opened a new one which
>> the second reproducer catches. One thing the current reproducer does
>> is create a lot rtnl contention in the beggining by creating all those
>> devices and then after it is just creating/deleting qdisc and doing
>> update with flower where such contention is reduced. i.e it may just
>> take longer for the mole to pop up.
>>
>> Why dont we push the V1 patch in and then worry about getting clever
>> with EAGAIN after? Can you test the V1 version with the repro Pedro
>> posted? It shouldnt have these issues. Also it would be interesting to
>> see how performance of the parallel updates to flower is affected.
>
> This or at least push first 4 patches of this series. They target other
> older commits and fix straightforward issues with the API.


  reply	other threads:[~2023-05-29 13:13 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-24  1:16 [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Peilin Ye
2023-05-24  1:17 ` [PATCH v5 net 1/6] net/sched: sch_ingress: Only create under TC_H_INGRESS Peilin Ye
2023-05-24 15:37   ` Pedro Tammela
2023-05-24 15:57   ` Jamal Hadi Salim
2023-05-24  1:18 ` [PATCH v5 net 2/6] net/sched: sch_clsact: Only create under TC_H_CLSACT Peilin Ye
2023-05-24 15:38   ` Pedro Tammela
2023-05-24 15:58     ` Jamal Hadi Salim
2023-05-24  1:19 ` [PATCH v5 net 3/6] net/sched: Reserve TC_H_INGRESS (TC_H_CLSACT) for ingress (clsact) Qdiscs Peilin Ye
2023-05-24 15:38   ` Pedro Tammela
2023-05-24  1:19 ` [PATCH v5 net 4/6] net/sched: Prohibit regrafting ingress or clsact Qdiscs Peilin Ye
2023-05-24 15:38   ` Pedro Tammela
2023-05-24  1:20 ` [PATCH v5 net 5/6] net/sched: Refactor qdisc_graft() for ingress and " Peilin Ye
2023-05-24  1:20 ` [PATCH v5 net 6/6] net/sched: qdisc_destroy() old ingress and clsact Qdiscs before grafting Peilin Ye
2023-05-24 15:39   ` Pedro Tammela
2023-05-24 16:09     ` Jamal Hadi Salim
2023-05-25  9:25       ` Paolo Abeni
2023-05-26 12:19         ` Jamal Hadi Salim
2023-05-26 12:20     ` Jamal Hadi Salim
2023-05-26 19:47       ` Jamal Hadi Salim
2023-05-26 20:21         ` Pedro Tammela
2023-05-26 23:09           ` Peilin Ye
2023-05-27  2:33             ` Jakub Kicinski
2023-05-27  8:23               ` Peilin Ye
2023-05-28 18:54                 ` Jamal Hadi Salim
2023-05-29 11:50                   ` Vlad Buslov
2023-05-29 12:58                     ` Vlad Buslov [this message]
2023-05-30  1:03                       ` Jakub Kicinski
2023-05-30  9:11                       ` Peilin Ye
2023-05-30 12:18                         ` Vlad Buslov
2023-05-31  0:29                           ` Peilin Ye
2023-06-01  3:57                           ` Peilin Ye
2023-06-01  6:20                             ` Vlad Buslov
2023-06-07  0:57                               ` Peilin Ye
2023-06-07  8:18                                 ` Vlad Buslov
2023-06-08  1:08                                   ` Peilin Ye
2023-06-08  7:48                                     ` Vlad Buslov
2023-06-11  3:25                                       ` Peilin Ye
2023-06-08  0:39                               ` Peilin Ye
2023-06-08  9:17                                 ` Vlad Buslov
2023-06-10  0:20                                   ` Peilin Ye
2023-06-01 13:03                             ` Pedro Tammela
2023-06-07  4:25                               ` Peilin Ye
2023-05-29 13:55                     ` Jamal Hadi Salim
2023-05-29 19:14                       ` Peilin Ye
2023-05-25 17:16 ` [PATCH v5 net 0/6] net/sched: Fixes for sch_ingress and sch_clsact Vlad Buslov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87fs7fxov6.fsf@nvidia.com \
    --to=vladbu@nvidia.com \
    --cc=cong.wang@bytedance.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=hdanton@sina.com \
    --cc=jhs@mojatatu.com \
    --cc=jiri@resnulli.us \
    --cc=john.fastabend@gmail.com \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pctammela@mojatatu.com \
    --cc=peilin.ye@bytedance.com \
    --cc=xiyou.wangcong@gmail.com \
    --cc=yepeilin.cs@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.