netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Eric Dumazet <eric.dumazet@gmail.com>
To: Ben Greear <greearb@candelatech.com>,
	Eric Dumazet <eric.dumazet@gmail.com>,
	netdev <netdev@vger.kernel.org>
Subject: Re: 5.15-rc3+ crash in fq-codel?
Date: Wed, 29 Sep 2021 16:21:25 -0700	[thread overview]
Message-ID: <b537053d-498d-928b-8ca0-e9daf5909128@gmail.com> (raw)
In-Reply-To: <88bc8a03-da44-fc15-f032-fe5cb592958b@candelatech.com>



On 9/29/21 12:07 PM, Ben Greear wrote:
> On 9/28/21 4:25 PM, Eric Dumazet wrote:
>>
>>
>> On 9/28/21 3:00 PM, Ben Greear wrote:
>>> On 9/27/21 5:16 PM, Ben Greear wrote:
>>>> On 9/27/21 5:04 PM, Ben Greear wrote:
>>>>> On 9/27/21 4:49 PM, Eric Dumazet wrote:
>>>>>>
>>>>>>
>>>>>> On 9/27/21 4:30 PM, Ben Greear wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> In a hacked upon kernel, I'm getting crashes in fq-codel when doing bi-directional
>>>>>>> pktgen traffic on top of mac-vlans.  Unfortunately for me, I've made big changes to
>>>>>>> pktgen so I cannot easily run this test on stock kernels, and there is some chance
>>>>>>> some of my hackings have caused this issue.
>>>>>>>
>>>>>>> But, in case others have seen similar, please let me know.  I shall go digging
>>>>>>> in the meantime...
>>>>>>>
>>>>>>> Looks to me like 'skb' is NULL in line 120 below.
>>>>>>
>>>>>>
>>>>>> pktgen must not be used in a mode where a single skb
>>>>>> is cloned and reused, if packet needs to be stored in a qdisc.
>>>>>>
>>>>>> qdisc of all sorts assume skb->next/prev can be used as
>>>>>> anchor in their list.
>>>>>>
>>>>>> If the same skb is queued multiple times, lists are corrupted.
>>>>>>
>>>>>> Please double check your clone_skb pktgen setup.
>>>>>>
>>>>>> I thought we had IFF_TX_SKB_SHARING for this, and that macvlan was properly clearing this bit.
>>>>>
>>>>> My pktgen config was not using any duplicated queueing in this case.
>>>>>
>>>>> I changed to pfifo fast and so far it is stable for ~10 minutes, where before it would crash
>>>>> within a minute.  I'll let it bake overnight....
>>>>
>>>> Still running stable.  I also notice we have been using fq-codel for a while and haven't noticed
>>>> this problem (next most recent kernel we might have run similar test on would be 5.13-ish).
>>>>
>>>> I'll duplicate this test on our older kernels tomorrow to see if it looks like a regression or
>>>> if we just haven't actually done this exact test in a while...
>>>
>>> We can reproduce this crash as far back as 5.4 using fq-codel, with our pktgen driving mac-vlans.
>>> We did not try any kernels older than 5.4.
>>> We cannot reproduce with pfifo on 5.15-rc3 on an overnight run.
>>> We cannot produce with user-space UDP traffic on any kernel/qdisc combination.
>>> Our pktgen is configured for multi-skb of 0 (no multiple submits of the same skb)
>>>
>>> While looking briefly at fq-codel, I didn't notice any locking in the code that crashed.
>>> Any chance that it makes assumptions that would be incorrect with pktgen running multiple
>>> threads (one thread per mac-vlan) on top of a single qdisc belonging to the underlying NIC?
>>>
>>
>>
>> qdisc are protected by a qdisc spinlock.
>>
>> fq-codel does not have to lock anything in its enqueue() and dequeue() methods.
>>
>> I guess your local changes to pktgen might be to blame.
>>
>> pfifo is much simpler than fq-codel, it uses less fields from skb.
> 
> I looked through my pktgen, and the skb creation and setup code looks pretty
> similar to upstream pktgen.
> 
> I also added this debugging code:
> 
> [greearb@ben-dt4 linux-5.15.dev.y]$ git diff
> diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
> index bb0cd6d3d2c2..56e22106e19d 100644
> --- a/net/sched/sch_fq_codel.c
> +++ b/net/sched/sch_fq_codel.c
> @@ -165,6 +165,11 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets,
>         len = 0;
>         i = 0;
>         do {
> +               if (!flow->head) {
> +                       pr_err("fq-codel-drop: idx: %d maxbacklog: %d  threshold: %d max_packets: %d len: %d i: %d\n",
> +                              idx, maxbacklog, threshold, max_packets, len, i);
> +                       BUG_ON(1);
> +               }
>                 skb = dequeue_head(flow);
>                 len += qdisc_pkt_len(skb);
>                 mem += get_codel_cb(skb)->mem_usage;
> 
> The printout I see when this hits is:
> 
> 
> fq-codel-drop: idx: 955 maxbacklog: 7756222  threshold: 3878111 max_packets: 64 len: 93868 i: 62
> kernel BUG at net/sched/sch_fq_codel.c:171!
> .....
> 
> So, I guess this means that the backlog byte counter is out of sync with the packet queue somehow?
> 
> Any suggestions for what kinds of issues in pktgen could cause this?

Modifications to skbs after they were queued to the qdisc.

qdisc_pkt_len(skb) uses skb->cb[] storage. Make sure to not use it.



  reply	other threads:[~2021-09-29 23:21 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-27 23:30 5.15-rc3+ crash in fq-codel? Ben Greear
2021-09-27 23:49 ` Eric Dumazet
2021-09-28  0:04   ` Ben Greear
2021-09-28  0:16     ` Ben Greear
2021-09-28 22:00       ` Ben Greear
2021-09-28 23:25         ` Eric Dumazet
2021-09-29 19:07           ` Ben Greear
2021-09-29 23:21             ` Eric Dumazet [this message]
2021-09-29 23:28               ` Eric Dumazet
2021-09-29 23:42                 ` Eric Dumazet
2021-09-29 23:48                   ` Ben Greear
2021-09-30  0:04                     ` Ben Greear
2021-09-30  0:29                       ` Eric Dumazet
2021-09-30  0:40                         ` Eric Dumazet
2021-09-30  1:36                           ` Ben Greear
2021-09-30 16:44                             ` Ben Greear

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b537053d-498d-928b-8ca0-e9daf5909128@gmail.com \
    --to=eric.dumazet@gmail.com \
    --cc=greearb@candelatech.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).