netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ben Greear <greearb@candelatech.com>
To: Eric Dumazet <eric.dumazet@gmail.com>, netdev <netdev@vger.kernel.org>
Subject: Re: 5.15-rc3+ crash in fq-codel?
Date: Wed, 29 Sep 2021 12:07:04 -0700	[thread overview]
Message-ID: <88bc8a03-da44-fc15-f032-fe5cb592958b@candelatech.com> (raw)
In-Reply-To: <7f1d67f1-3a2c-2e74-bb86-c02a56370526@gmail.com>

On 9/28/21 4:25 PM, Eric Dumazet wrote:
> 
> 
> On 9/28/21 3:00 PM, Ben Greear wrote:
>> On 9/27/21 5:16 PM, Ben Greear wrote:
>>> On 9/27/21 5:04 PM, Ben Greear wrote:
>>>> On 9/27/21 4:49 PM, Eric Dumazet wrote:
>>>>>
>>>>>
>>>>> On 9/27/21 4:30 PM, Ben Greear wrote:
>>>>>> Hello,
>>>>>>
>>>>>> In a hacked upon kernel, I'm getting crashes in fq-codel when doing bi-directional
>>>>>> pktgen traffic on top of mac-vlans.  Unfortunately for me, I've made big changes to
>>>>>> pktgen so I cannot easily run this test on stock kernels, and there is some chance
>>>>>> some of my hackings have caused this issue.
>>>>>>
>>>>>> But, in case others have seen similar, please let me know.  I shall go digging
>>>>>> in the meantime...
>>>>>>
>>>>>> Looks to me like 'skb' is NULL in line 120 below.
>>>>>
>>>>>
>>>>> pktgen must not be used in a mode where a single skb
>>>>> is cloned and reused, if packet needs to be stored in a qdisc.
>>>>>
>>>>> qdisc of all sorts assume skb->next/prev can be used as
>>>>> anchor in their list.
>>>>>
>>>>> If the same skb is queued multiple times, lists are corrupted.
>>>>>
>>>>> Please double check your clone_skb pktgen setup.
>>>>>
>>>>> I thought we had IFF_TX_SKB_SHARING for this, and that macvlan was properly clearing this bit.
>>>>
>>>> My pktgen config was not using any duplicated queueing in this case.
>>>>
>>>> I changed to pfifo fast and so far it is stable for ~10 minutes, where before it would crash
>>>> within a minute.  I'll let it bake overnight....
>>>
>>> Still running stable.  I also notice we have been using fq-codel for a while and haven't noticed
>>> this problem (next most recent kernel we might have run similar test on would be 5.13-ish).
>>>
>>> I'll duplicate this test on our older kernels tomorrow to see if it looks like a regression or
>>> if we just haven't actually done this exact test in a while...
>>
>> We can reproduce this crash as far back as 5.4 using fq-codel, with our pktgen driving mac-vlans.
>> We did not try any kernels older than 5.4.
>> We cannot reproduce with pfifo on 5.15-rc3 on an overnight run.
>> We cannot produce with user-space UDP traffic on any kernel/qdisc combination.
>> Our pktgen is configured for multi-skb of 0 (no multiple submits of the same skb)
>>
>> While looking briefly at fq-codel, I didn't notice any locking in the code that crashed.
>> Any chance that it makes assumptions that would be incorrect with pktgen running multiple
>> threads (one thread per mac-vlan) on top of a single qdisc belonging to the underlying NIC?
>>
> 
> 
> qdisc are protected by a qdisc spinlock.
> 
> fq-codel does not have to lock anything in its enqueue() and dequeue() methods.
> 
> I guess your local changes to pktgen might be to blame.
> 
> pfifo is much simpler than fq-codel, it uses less fields from skb.

I looked through my pktgen, and the skb creation and setup code looks pretty
similar to upstream pktgen.

I also added this debugging code:

[greearb@ben-dt4 linux-5.15.dev.y]$ git diff
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index bb0cd6d3d2c2..56e22106e19d 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -165,6 +165,11 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets,
         len = 0;
         i = 0;
         do {
+               if (!flow->head) {
+                       pr_err("fq-codel-drop: idx: %d maxbacklog: %d  threshold: %d max_packets: %d len: %d i: %d\n",
+                              idx, maxbacklog, threshold, max_packets, len, i);
+                       BUG_ON(1);
+               }
                 skb = dequeue_head(flow);
                 len += qdisc_pkt_len(skb);
                 mem += get_codel_cb(skb)->mem_usage;

The printout I see when this hits is:


fq-codel-drop: idx: 955 maxbacklog: 7756222  threshold: 3878111 max_packets: 64 len: 93868 i: 62
kernel BUG at net/sched/sch_fq_codel.c:171!
.....

So, I guess this means that the backlog byte counter is out of sync with the packet queue somehow?

Any suggestions for what kinds of issues in pktgen could cause this?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


  reply	other threads:[~2021-09-29 19:07 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-09-27 23:30 5.15-rc3+ crash in fq-codel? Ben Greear
2021-09-27 23:49 ` Eric Dumazet
2021-09-28  0:04   ` Ben Greear
2021-09-28  0:16     ` Ben Greear
2021-09-28 22:00       ` Ben Greear
2021-09-28 23:25         ` Eric Dumazet
2021-09-29 19:07           ` Ben Greear [this message]
2021-09-29 23:21             ` Eric Dumazet
2021-09-29 23:28               ` Eric Dumazet
2021-09-29 23:42                 ` Eric Dumazet
2021-09-29 23:48                   ` Ben Greear
2021-09-30  0:04                     ` Ben Greear
2021-09-30  0:29                       ` Eric Dumazet
2021-09-30  0:40                         ` Eric Dumazet
2021-09-30  1:36                           ` Ben Greear
2021-09-30 16:44                             ` Ben Greear

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=88bc8a03-da44-fc15-f032-fe5cb592958b@candelatech.com \
    --to=greearb@candelatech.com \
    --cc=eric.dumazet@gmail.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).