All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vishwanath Pai <vpai@akamai.com>
To: Cong Wang <xiyou.wangcong@gmail.com>,
	Yunsheng Lin <linyunsheng@huawei.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>,
	Jiri Pirko <jiri@resnulli.us>, David Miller <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>,
	Linux Kernel Network Developers <netdev@vger.kernel.org>,
	LKML <linux-kernel@vger.kernel.org>,
	linuxarm@huawei.com, John Fastabend <john.fastabend@gmail.com>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	"Hunt, Joshua" <johunt@akamai.com>
Subject: Re: Re: [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc
Date: Wed, 28 Oct 2020 13:46:10 -0400	[thread overview]
Message-ID: <05ff05ff-884e-a3b9-2186-3ba0e3e88f28@akamai.com> (raw)
In-Reply-To: <CAM_iQpU_tbRNO=Lznz_d6YjXmenYhowEfBoOiJgEmo9x8bEevw@mail.gmail.com>

On 9/17/20 3:26 PM, Cong Wang wrote:
 > On Fri, Sep 11, 2020 at 1:13 AM Yunsheng Lin <linyunsheng@huawei.com> 
wrote:
 >>
 >> On 2020/9/11 4:07, Cong Wang wrote:
 >>> On Tue, Sep 8, 2020 at 4:06 AM Yunsheng Lin 
<linyunsheng@huawei.com> wrote:
 >>>>
 >>>> Currently there is concurrent reset and enqueue operation for the
 >>>> same lockless qdisc when there is no lock to synchronize the
 >>>> q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
 >>>> qdisc_deactivate() called by dev_deactivate_queue(), which may cause
 >>>> out-of-bounds access for priv->ring[] in hns3 driver if user has
 >>>> requested a smaller queue num when __dev_xmit_skb() still enqueue a
 >>>> skb with a larger queue_mapping after the corresponding qdisc is
 >>>> reset, and call hns3_nic_net_xmit() with that skb later.
 >>>>
 >>>> Reused the existing synchronize_net() in dev_deactivate_many() to
 >>>> make sure skb with larger queue_mapping enqueued to old qdisc(which
 >>>> is saved in dev_queue->qdisc_sleeping) will always be reset when
 >>>> dev_reset_queue() is called.
 >>>>
 >>>> Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking")
 >>>> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
 >>>> ---
 >>>> ChangeLog V2:
 >>>>         Reuse existing synchronize_net().
 >>>> ---
 >>>>  net/sched/sch_generic.c | 48 
+++++++++++++++++++++++++++++++++---------------
 >>>>  1 file changed, 33 insertions(+), 15 deletions(-)
 >>>>
 >>>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
 >>>> index 265a61d..54c4172 100644
 >>>> --- a/net/sched/sch_generic.c
 >>>> +++ b/net/sched/sch_generic.c
 >>>> @@ -1131,24 +1131,10 @@ EXPORT_SYMBOL(dev_activate);
 >>>>
 >>>>  static void qdisc_deactivate(struct Qdisc *qdisc)
 >>>>  {
 >>>> -       bool nolock = qdisc->flags & TCQ_F_NOLOCK;
 >>>> -
 >>>>         if (qdisc->flags & TCQ_F_BUILTIN)
 >>>>                 return;
 >>>> -       if (test_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state))
 >>>> -               return;
 >>>> -
 >>>> -       if (nolock)
 >>>> - spin_lock_bh(&qdisc->seqlock);
 >>>> -       spin_lock_bh(qdisc_lock(qdisc));
 >>>>
 >>>>         set_bit(__QDISC_STATE_DEACTIVATED, &qdisc->state);
 >>>> -
 >>>> -       qdisc_reset(qdisc);
 >>>> -
 >>>> -       spin_unlock_bh(qdisc_lock(qdisc));
 >>>> -       if (nolock)
 >>>> - spin_unlock_bh(&qdisc->seqlock);
 >>>>  }
 >>>>
 >>>>  static void dev_deactivate_queue(struct net_device *dev,
 >>>> @@ -1165,6 +1151,30 @@ static void dev_deactivate_queue(struct 
net_device *dev,
 >>>>         }
 >>>>  }
 >>>>
 >>>> +static void dev_reset_queue(struct net_device *dev,
 >>>> +                           struct netdev_queue *dev_queue,
 >>>> +                           void *_unused)
 >>>> +{
 >>>> +       struct Qdisc *qdisc;
 >>>> +       bool nolock;
 >>>> +
 >>>> +       qdisc = dev_queue->qdisc_sleeping;
 >>>> +       if (!qdisc)
 >>>> +               return;
 >>>> +
 >>>> +       nolock = qdisc->flags & TCQ_F_NOLOCK;
 >>>> +
 >>>> +       if (nolock)
 >>>> + spin_lock_bh(&qdisc->seqlock);
 >>>> +       spin_lock_bh(qdisc_lock(qdisc));
 >>>
 >>>
 >>> I think you do not need this lock for lockless one.
 >>
 >> It seems so.
 >> Maybe another patch to remove qdisc_lock(qdisc) for lockless
 >> qdisc?
 >
 > Yeah, but not sure if we still want this lockless qdisc any more,
 > it brings more troubles than gains.
 >
 >>
 >>
 >>>
 >>>> +
 >>>> +       qdisc_reset(qdisc);
 >>>> +
 >>>> +       spin_unlock_bh(qdisc_lock(qdisc));
 >>>> +       if (nolock)
 >>>> + spin_unlock_bh(&qdisc->seqlock);
 >>>> +}
 >>>> +
 >>>>  static bool some_qdisc_is_busy(struct net_device *dev)
 >>>>  {
 >>>>         unsigned int i;
 >>>> @@ -1213,12 +1223,20 @@ void dev_deactivate_many(struct list_head 
*head)
 >>>>                 dev_watchdog_down(dev);
 >>>>         }
 >>>>
 >>>> -       /* Wait for outstanding qdisc-less dev_queue_xmit calls.
 >>>> +       /* Wait for outstanding qdisc-less dev_queue_xmit calls or
 >>>> +        * outstanding qdisc enqueuing calls.
 >>>>          * This is avoided if all devices are in dismantle phase :
 >>>>          * Caller will call synchronize_net() for us
 >>>>          */
 >>>>         synchronize_net();
 >>>>
 >>>> +       list_for_each_entry(dev, head, close_list) {
 >>>> +               netdev_for_each_tx_queue(dev, dev_reset_queue, NULL);
 >>>> +
 >>>> +               if (dev_ingress_queue(dev))
 >>>> +                       dev_reset_queue(dev, 
dev_ingress_queue(dev), NULL);
 >>>> +       }
 >>>> +
 >>>>         /* Wait for outstanding qdisc_run calls. */
 >>>>         list_for_each_entry(dev, head, close_list) {
 >>>>                 while (some_qdisc_is_busy(dev)) {
 >>>
 >>> Do you want to reset before waiting for TX action?
 >>>
 >>> I think it is safer to do it after, at least prior to commit 759ae57f1b
 >>> we did after.
 >>
 >> The reference to the txq->qdisc is always protected by RCU, so the 
synchronize_net()
 >> should be enought to ensure there is no skb enqueued to the old 
qdisc that is saved
 >> in the dev_queue->qdisc_sleeping, because __dev_queue_xmit can only 
see the new qdisc
 >> after synchronize_net(), which is noop_qdisc, and noop_qdisc will 
make sure any skb
 >> enqueued to it will be dropped and freed, right?
 >
 > Hmm? In net_tx_action(), we do not hold RCU read lock, and we do not
 > reference qdisc via txq->qdisc but via sd->output_queue.
 >
 >
 >>
 >> If we do any additional reset that is not related to qdisc in 
dev_reset_queue(), we
 >> can move it after some_qdisc_is_busy() checking.
 >
 > I am not suggesting to do an additional reset, I am suggesting to move
 > your reset after the busy waiting.
 >
 > Thanks.

Re-sending this, looks like my previous email did not get delivered
somehow.

We noticed some problems when testing the latest 5.4 LTS kernel and
traced it back to this commit using git bisect. When running our tests
the machine stops responding to all traffic and the only way to recover
is a reboot. I do not see a stack trace on the console.

This can be reproduced using the packetdrill test below, it should be
run a few times or in a loop. You should hit this issue within a few
tries but sometimes might take up to 15-20 tries.

*** TEST BEGIN ***

0 `echo 4 > /proc/sys/net/ipv4/tcp_min_tso_segs`

0.400 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.400 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0

// set maxseg to 1000 to work with both ipv4 and ipv6
0.500 setsockopt(3, SOL_TCP, TCP_MAXSEG, [1000], 4) = 0
0.500 bind(3, ..., ...) = 0
0.500 listen(3, 1) = 0

// Establish connection
0.600 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 5>
0.600 > S. 0:0(0) ack 1 <...>

0.800 < . 1:1(0) ack 1 win 320
0.800 accept(3, ..., ...) = 4

// Send 4 data segments.
+0 write(4, ..., 4000) = 4000
+0 > P. 1:4001(4000) ack 1

// Receive a SACK
+.1 < . 1:1(0) ack 1 win 320 <sack 1001:2001,nop,nop>

+.3 %{ print "TCP CA state: ",tcpi_ca_state  }%

*** TEST END ***

I can reproduce the issue easily on v5.4.68, and after reverting this 
commit it
does not happen anymore.

Thanks,
Vishwanath


  parent reply	other threads:[~2020-10-28 22:25 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-08 11:02 [PATCH v2 net] net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc Yunsheng Lin
2020-09-10 19:39 ` David Miller
2020-09-10 20:07 ` Cong Wang
2020-09-11  8:13   ` Yunsheng Lin
2020-09-11  8:25     ` Yunsheng Lin
2020-09-17 19:26     ` Cong Wang
     [not found]       ` <CAP12E-+3DY-dgzVercKc-NYGPExWO1NjTOr1Gf3tPLKvp6O6+g@mail.gmail.com>
2020-10-28 15:37         ` Pai, Vishwanath
2020-10-28 17:47           ` Cong Wang
2020-10-28 20:04             ` Vishwanath Pai
2020-10-29  2:37               ` Yunsheng Lin
2020-10-29  4:50                 ` Vishwanath Pai
2020-10-29 10:24                   ` Yunsheng Lin
2020-10-29 17:20                     ` Vishwanath Pai
2020-11-02  9:08                       ` Yunsheng Lin
2020-11-02 18:23                         ` Vishwanath Pai
2020-10-28 17:46       ` Vishwanath Pai [this message]
2020-10-29  2:52       ` Yunsheng Lin
2020-10-29 19:05         ` Cong Wang
2020-10-30  7:37           ` Yunsheng Lin
2020-11-02 16:55             ` Cong Wang
2020-11-03  7:24               ` Yunsheng Lin
2020-11-05  6:04                 ` Cong Wang
2020-11-05  6:16                   ` Cong Wang
2020-11-05  6:32                     ` Yunsheng Lin
2020-11-05  6:22                   ` Yunsheng Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=05ff05ff-884e-a3b9-2186-3ba0e3e88f28@akamai.com \
    --to=vpai@akamai.com \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=jhs@mojatatu.com \
    --cc=jiri@resnulli.us \
    --cc=john.fastabend@gmail.com \
    --cc=johunt@akamai.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxarm@huawei.com \
    --cc=linyunsheng@huawei.com \
    --cc=netdev@vger.kernel.org \
    --cc=xiyou.wangcong@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.