All of lore.kernel.org
 help / color / mirror / Atom feed
* 5.15-rc3+ crash in fq-codel?
@ 2021-09-27 23:30 Ben Greear
  2021-09-27 23:49 ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Greear @ 2021-09-27 23:30 UTC (permalink / raw)
  To: netdev

Hello,

In a hacked upon kernel, I'm getting crashes in fq-codel when doing bi-directional
pktgen traffic on top of mac-vlans.  Unfortunately for me, I've made big changes to
pktgen so I cannot easily run this test on stock kernels, and there is some chance
some of my hackings have caused this issue.

But, in case others have seen similar, please let me know.  I shall go digging
in the meantime...

Looks to me like 'skb' is NULL in line 120 below.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./net/sched/sch_fq_codel.ko...done.
"/home/greearb/kernel/2.6/linux-5.15.x64/vmlinux" is not a core dump: file format not recognized
(gdb) l *(fq_codel_enqueue+0x24b)
0x76b is in fq_codel_enqueue (/home/greearb/git/linux-5.15.dev.y/net/sched/sch_fq_codel.c:120).
115	/* remove one skb from head of slot queue */
116	static inline struct sk_buff *dequeue_head(struct fq_codel_flow *flow)
117	{
118		struct sk_buff *skb = flow->head;
119	
120		flow->head = skb->next;
121		skb_mark_not_on_list(skb);
122		return skb;
123	}
124	
(gdb)


BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 3 PID: 2077 Comm: kpktgend_3 Not tainted 5.15.0-rc3+ #2
Hardware name: Default string Default string/SKYBAY, BIOS 5.12 02/19/2019
RIP: 0010:fq_codel_enqueue+0x24b/0x380 [sch_fq_codel]
Code: e0 02 48 89 44 24 08 49 c1 e0 06 4c 03 83 50 01 00 00 45 31 f6 45 31 c9 31 c9 89 74 24 10 eb 04 39 fa 73 33 49 8b 00 83 c13
RSP: 0018:ffffc9000030fd10 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff88810a78f600 RCX: 0000000000000032
RDX: 00000000000121ca RSI: ffff88812d716900 RDI: 00000000003b26f5
RBP: ffffc9000030fd78 R08: ffff8881311dd340 R09: 00000000000121ca
R10: 000000000000034d R11: 0000000001680900 R12: ffffc9000030fde0
R13: 000000000001b900 R14: 000000000001b900 R15: 0000000000000040
FS:  0000000000000000(0000) GS:ffff888265cc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000000260f003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  dev_qdisc_enqueue+0x35/0x90
  __dev_queue_xmit+0x647/0xb70
  macvlan_start_xmit+0x4a/0x110 [macvlan]
  pktgen_thread_worker+0x19fe/0x20ed [pktgen]
  ? wait_woken+0x60/0x60
  ? pktgen_rem_all_ifs+0x70/0x70 [pktgen]
  kthread+0x11e/0x150
  ? set_kthread_struct+0x40/0x40
  ret_from_fork+0x1f/0x30


Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 5.15-rc3+ crash in fq-codel?
  2021-09-27 23:30 5.15-rc3+ crash in fq-codel? Ben Greear
@ 2021-09-27 23:49 ` Eric Dumazet
  2021-09-28  0:04   ` Ben Greear
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2021-09-27 23:49 UTC (permalink / raw)
  To: Ben Greear, netdev



On 9/27/21 4:30 PM, Ben Greear wrote:
> Hello,
> 
> In a hacked upon kernel, I'm getting crashes in fq-codel when doing bi-directional
> pktgen traffic on top of mac-vlans.  Unfortunately for me, I've made big changes to
> pktgen so I cannot easily run this test on stock kernels, and there is some chance
> some of my hackings have caused this issue.
> 
> But, in case others have seen similar, please let me know.  I shall go digging
> in the meantime...
> 
> Looks to me like 'skb' is NULL in line 120 below.


pktgen must not be used in a mode where a single skb
is cloned and reused, if packet needs to be stored in a qdisc.

qdisc of all sorts assume skb->next/prev can be used as
anchor in their list.

If the same skb is queued multiple times, lists are corrupted.

Please double check your clone_skb pktgen setup.

I thought we had IFF_TX_SKB_SHARING for this, and that macvlan was properly clearing this bit.

> 
> For help, type "help".
> Type "apropos word" to search for commands related to "word"...
> Reading symbols from ./net/sched/sch_fq_codel.ko...done.
> "/home/greearb/kernel/2.6/linux-5.15.x64/vmlinux" is not a core dump: file format not recognized
> (gdb) l *(fq_codel_enqueue+0x24b)
> 0x76b is in fq_codel_enqueue (/home/greearb/git/linux-5.15.dev.y/net/sched/sch_fq_codel.c:120).
> 115    /* remove one skb from head of slot queue */
> 116    static inline struct sk_buff *dequeue_head(struct fq_codel_flow *flow)
> 117    {
> 118        struct sk_buff *skb = flow->head;
> 119   
> 120        flow->head = skb->next;
> 121        skb_mark_not_on_list(skb);
> 122        return skb;
> 123    }
> 124   
> (gdb)
> 
> 
> BUG: kernel NULL pointer dereference, address: 0000000000000000
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0
> Oops: 0000 [#1] PREEMPT SMP
> CPU: 3 PID: 2077 Comm: kpktgend_3 Not tainted 5.15.0-rc3+ #2
> Hardware name: Default string Default string/SKYBAY, BIOS 5.12 02/19/2019
> RIP: 0010:fq_codel_enqueue+0x24b/0x380 [sch_fq_codel]
> Code: e0 02 48 89 44 24 08 49 c1 e0 06 4c 03 83 50 01 00 00 45 31 f6 45 31 c9 31 c9 89 74 24 10 eb 04 39 fa 73 33 49 8b 00 83 c13
> RSP: 0018:ffffc9000030fd10 EFLAGS: 00010202
> RAX: 0000000000000000 RBX: ffff88810a78f600 RCX: 0000000000000032
> RDX: 00000000000121ca RSI: ffff88812d716900 RDI: 00000000003b26f5
> RBP: ffffc9000030fd78 R08: ffff8881311dd340 R09: 00000000000121ca
> R10: 000000000000034d R11: 0000000001680900 R12: ffffc9000030fde0
> R13: 000000000001b900 R14: 000000000001b900 R15: 0000000000000040
> FS:  0000000000000000(0000) GS:ffff888265cc0000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000000000 CR3: 000000000260f003 CR4: 00000000003706e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  dev_qdisc_enqueue+0x35/0x90
>  __dev_queue_xmit+0x647/0xb70
>  macvlan_start_xmit+0x4a/0x110 [macvlan]
>  pktgen_thread_worker+0x19fe/0x20ed [pktgen]
>  ? wait_woken+0x60/0x60
>  ? pktgen_rem_all_ifs+0x70/0x70 [pktgen]
>  kthread+0x11e/0x150
>  ? set_kthread_struct+0x40/0x40
>  ret_from_fork+0x1f/0x30
> 
> 
> Thanks,
> Ben
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 5.15-rc3+ crash in fq-codel?
  2021-09-27 23:49 ` Eric Dumazet
@ 2021-09-28  0:04   ` Ben Greear
  2021-09-28  0:16     ` Ben Greear
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Greear @ 2021-09-28  0:04 UTC (permalink / raw)
  To: Eric Dumazet, netdev

On 9/27/21 4:49 PM, Eric Dumazet wrote:
> 
> 
> On 9/27/21 4:30 PM, Ben Greear wrote:
>> Hello,
>>
>> In a hacked upon kernel, I'm getting crashes in fq-codel when doing bi-directional
>> pktgen traffic on top of mac-vlans.  Unfortunately for me, I've made big changes to
>> pktgen so I cannot easily run this test on stock kernels, and there is some chance
>> some of my hackings have caused this issue.
>>
>> But, in case others have seen similar, please let me know.  I shall go digging
>> in the meantime...
>>
>> Looks to me like 'skb' is NULL in line 120 below.
> 
> 
> pktgen must not be used in a mode where a single skb
> is cloned and reused, if packet needs to be stored in a qdisc.
> 
> qdisc of all sorts assume skb->next/prev can be used as
> anchor in their list.
> 
> If the same skb is queued multiple times, lists are corrupted.
> 
> Please double check your clone_skb pktgen setup.
> 
> I thought we had IFF_TX_SKB_SHARING for this, and that macvlan was properly clearing this bit.

My pktgen config was not using any duplicated queueing in this case.

I changed to pfifo fast and so far it is stable for ~10 minutes, where before it would crash
within a minute.  I'll let it bake overnight....

Thanks,
Ben

> 
>>
>> For help, type "help".
>> Type "apropos word" to search for commands related to "word"...
>> Reading symbols from ./net/sched/sch_fq_codel.ko...done.
>> "/home/greearb/kernel/2.6/linux-5.15.x64/vmlinux" is not a core dump: file format not recognized
>> (gdb) l *(fq_codel_enqueue+0x24b)
>> 0x76b is in fq_codel_enqueue (/home/greearb/git/linux-5.15.dev.y/net/sched/sch_fq_codel.c:120).
>> 115    /* remove one skb from head of slot queue */
>> 116    static inline struct sk_buff *dequeue_head(struct fq_codel_flow *flow)
>> 117    {
>> 118        struct sk_buff *skb = flow->head;
>> 119
>> 120        flow->head = skb->next;
>> 121        skb_mark_not_on_list(skb);
>> 122        return skb;
>> 123    }
>> 124
>> (gdb)
>>
>>
>> BUG: kernel NULL pointer dereference, address: 0000000000000000
>> #PF: supervisor read access in kernel mode
>> #PF: error_code(0x0000) - not-present page
>> PGD 0 P4D 0
>> Oops: 0000 [#1] PREEMPT SMP
>> CPU: 3 PID: 2077 Comm: kpktgend_3 Not tainted 5.15.0-rc3+ #2
>> Hardware name: Default string Default string/SKYBAY, BIOS 5.12 02/19/2019
>> RIP: 0010:fq_codel_enqueue+0x24b/0x380 [sch_fq_codel]
>> Code: e0 02 48 89 44 24 08 49 c1 e0 06 4c 03 83 50 01 00 00 45 31 f6 45 31 c9 31 c9 89 74 24 10 eb 04 39 fa 73 33 49 8b 00 83 c13
>> RSP: 0018:ffffc9000030fd10 EFLAGS: 00010202
>> RAX: 0000000000000000 RBX: ffff88810a78f600 RCX: 0000000000000032
>> RDX: 00000000000121ca RSI: ffff88812d716900 RDI: 00000000003b26f5
>> RBP: ffffc9000030fd78 R08: ffff8881311dd340 R09: 00000000000121ca
>> R10: 000000000000034d R11: 0000000001680900 R12: ffffc9000030fde0
>> R13: 000000000001b900 R14: 000000000001b900 R15: 0000000000000040
>> FS:  0000000000000000(0000) GS:ffff888265cc0000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 0000000000000000 CR3: 000000000260f003 CR4: 00000000003706e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Call Trace:
>>   dev_qdisc_enqueue+0x35/0x90
>>   __dev_queue_xmit+0x647/0xb70
>>   macvlan_start_xmit+0x4a/0x110 [macvlan]
>>   pktgen_thread_worker+0x19fe/0x20ed [pktgen]
>>   ? wait_woken+0x60/0x60
>>   ? pktgen_rem_all_ifs+0x70/0x70 [pktgen]
>>   kthread+0x11e/0x150
>>   ? set_kthread_struct+0x40/0x40
>>   ret_from_fork+0x1f/0x30
>>
>>
>> Thanks,
>> Ben
>>
> 


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 5.15-rc3+ crash in fq-codel?
  2021-09-28  0:04   ` Ben Greear
@ 2021-09-28  0:16     ` Ben Greear
  2021-09-28 22:00       ` Ben Greear
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Greear @ 2021-09-28  0:16 UTC (permalink / raw)
  To: Eric Dumazet, netdev

On 9/27/21 5:04 PM, Ben Greear wrote:
> On 9/27/21 4:49 PM, Eric Dumazet wrote:
>>
>>
>> On 9/27/21 4:30 PM, Ben Greear wrote:
>>> Hello,
>>>
>>> In a hacked upon kernel, I'm getting crashes in fq-codel when doing bi-directional
>>> pktgen traffic on top of mac-vlans.  Unfortunately for me, I've made big changes to
>>> pktgen so I cannot easily run this test on stock kernels, and there is some chance
>>> some of my hackings have caused this issue.
>>>
>>> But, in case others have seen similar, please let me know.  I shall go digging
>>> in the meantime...
>>>
>>> Looks to me like 'skb' is NULL in line 120 below.
>>
>>
>> pktgen must not be used in a mode where a single skb
>> is cloned and reused, if packet needs to be stored in a qdisc.
>>
>> qdisc of all sorts assume skb->next/prev can be used as
>> anchor in their list.
>>
>> If the same skb is queued multiple times, lists are corrupted.
>>
>> Please double check your clone_skb pktgen setup.
>>
>> I thought we had IFF_TX_SKB_SHARING for this, and that macvlan was properly clearing this bit.
> 
> My pktgen config was not using any duplicated queueing in this case.
> 
> I changed to pfifo fast and so far it is stable for ~10 minutes, where before it would crash
> within a minute.  I'll let it bake overnight....

Still running stable.  I also notice we have been using fq-codel for a while and haven't noticed
this problem (next most recent kernel we might have run similar test on would be 5.13-ish).

I'll duplicate this test on our older kernels tomorrow to see if it looks like a regression or
if we just haven't actually done this exact test in a while...

Thanks,
Ben

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 5.15-rc3+ crash in fq-codel?
  2021-09-28  0:16     ` Ben Greear
@ 2021-09-28 22:00       ` Ben Greear
  2021-09-28 23:25         ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Greear @ 2021-09-28 22:00 UTC (permalink / raw)
  To: Eric Dumazet, netdev

On 9/27/21 5:16 PM, Ben Greear wrote:
> On 9/27/21 5:04 PM, Ben Greear wrote:
>> On 9/27/21 4:49 PM, Eric Dumazet wrote:
>>>
>>>
>>> On 9/27/21 4:30 PM, Ben Greear wrote:
>>>> Hello,
>>>>
>>>> In a hacked upon kernel, I'm getting crashes in fq-codel when doing bi-directional
>>>> pktgen traffic on top of mac-vlans.  Unfortunately for me, I've made big changes to
>>>> pktgen so I cannot easily run this test on stock kernels, and there is some chance
>>>> some of my hackings have caused this issue.
>>>>
>>>> But, in case others have seen similar, please let me know.  I shall go digging
>>>> in the meantime...
>>>>
>>>> Looks to me like 'skb' is NULL in line 120 below.
>>>
>>>
>>> pktgen must not be used in a mode where a single skb
>>> is cloned and reused, if packet needs to be stored in a qdisc.
>>>
>>> qdisc of all sorts assume skb->next/prev can be used as
>>> anchor in their list.
>>>
>>> If the same skb is queued multiple times, lists are corrupted.
>>>
>>> Please double check your clone_skb pktgen setup.
>>>
>>> I thought we had IFF_TX_SKB_SHARING for this, and that macvlan was properly clearing this bit.
>>
>> My pktgen config was not using any duplicated queueing in this case.
>>
>> I changed to pfifo fast and so far it is stable for ~10 minutes, where before it would crash
>> within a minute.  I'll let it bake overnight....
> 
> Still running stable.  I also notice we have been using fq-codel for a while and haven't noticed
> this problem (next most recent kernel we might have run similar test on would be 5.13-ish).
> 
> I'll duplicate this test on our older kernels tomorrow to see if it looks like a regression or
> if we just haven't actually done this exact test in a while...

We can reproduce this crash as far back as 5.4 using fq-codel, with our pktgen driving mac-vlans.
We did not try any kernels older than 5.4.
We cannot reproduce with pfifo on 5.15-rc3 on an overnight run.
We cannot produce with user-space UDP traffic on any kernel/qdisc combination.
Our pktgen is configured for multi-skb of 0 (no multiple submits of the same skb)

While looking briefly at fq-codel, I didn't notice any locking in the code that crashed.
Any chance that it makes assumptions that would be incorrect with pktgen running multiple
threads (one thread per mac-vlan) on top of a single qdisc belonging to the underlying NIC?

Thanks,
Ben


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 5.15-rc3+ crash in fq-codel?
  2021-09-28 22:00       ` Ben Greear
@ 2021-09-28 23:25         ` Eric Dumazet
  2021-09-29 19:07           ` Ben Greear
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2021-09-28 23:25 UTC (permalink / raw)
  To: Ben Greear, Eric Dumazet, netdev



On 9/28/21 3:00 PM, Ben Greear wrote:
> On 9/27/21 5:16 PM, Ben Greear wrote:
>> On 9/27/21 5:04 PM, Ben Greear wrote:
>>> On 9/27/21 4:49 PM, Eric Dumazet wrote:
>>>>
>>>>
>>>> On 9/27/21 4:30 PM, Ben Greear wrote:
>>>>> Hello,
>>>>>
>>>>> In a hacked upon kernel, I'm getting crashes in fq-codel when doing bi-directional
>>>>> pktgen traffic on top of mac-vlans.  Unfortunately for me, I've made big changes to
>>>>> pktgen so I cannot easily run this test on stock kernels, and there is some chance
>>>>> some of my hackings have caused this issue.
>>>>>
>>>>> But, in case others have seen similar, please let me know.  I shall go digging
>>>>> in the meantime...
>>>>>
>>>>> Looks to me like 'skb' is NULL in line 120 below.
>>>>
>>>>
>>>> pktgen must not be used in a mode where a single skb
>>>> is cloned and reused, if packet needs to be stored in a qdisc.
>>>>
>>>> qdisc of all sorts assume skb->next/prev can be used as
>>>> anchor in their list.
>>>>
>>>> If the same skb is queued multiple times, lists are corrupted.
>>>>
>>>> Please double check your clone_skb pktgen setup.
>>>>
>>>> I thought we had IFF_TX_SKB_SHARING for this, and that macvlan was properly clearing this bit.
>>>
>>> My pktgen config was not using any duplicated queueing in this case.
>>>
>>> I changed to pfifo fast and so far it is stable for ~10 minutes, where before it would crash
>>> within a minute.  I'll let it bake overnight....
>>
>> Still running stable.  I also notice we have been using fq-codel for a while and haven't noticed
>> this problem (next most recent kernel we might have run similar test on would be 5.13-ish).
>>
>> I'll duplicate this test on our older kernels tomorrow to see if it looks like a regression or
>> if we just haven't actually done this exact test in a while...
> 
> We can reproduce this crash as far back as 5.4 using fq-codel, with our pktgen driving mac-vlans.
> We did not try any kernels older than 5.4.
> We cannot reproduce with pfifo on 5.15-rc3 on an overnight run.
> We cannot produce with user-space UDP traffic on any kernel/qdisc combination.
> Our pktgen is configured for multi-skb of 0 (no multiple submits of the same skb)
> 
> While looking briefly at fq-codel, I didn't notice any locking in the code that crashed.
> Any chance that it makes assumptions that would be incorrect with pktgen running multiple
> threads (one thread per mac-vlan) on top of a single qdisc belonging to the underlying NIC?
> 


qdisc are protected by a qdisc spinlock.

fq-codel does not have to lock anything in its enqueue() and dequeue() methods.

I guess your local changes to pktgen might be to blame.

pfifo is much simpler than fq-codel, it uses less fields from skb.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 5.15-rc3+ crash in fq-codel?
  2021-09-28 23:25         ` Eric Dumazet
@ 2021-09-29 19:07           ` Ben Greear
  2021-09-29 23:21             ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Greear @ 2021-09-29 19:07 UTC (permalink / raw)
  To: Eric Dumazet, netdev

On 9/28/21 4:25 PM, Eric Dumazet wrote:
> 
> 
> On 9/28/21 3:00 PM, Ben Greear wrote:
>> On 9/27/21 5:16 PM, Ben Greear wrote:
>>> On 9/27/21 5:04 PM, Ben Greear wrote:
>>>> On 9/27/21 4:49 PM, Eric Dumazet wrote:
>>>>>
>>>>>
>>>>> On 9/27/21 4:30 PM, Ben Greear wrote:
>>>>>> Hello,
>>>>>>
>>>>>> In a hacked upon kernel, I'm getting crashes in fq-codel when doing bi-directional
>>>>>> pktgen traffic on top of mac-vlans.  Unfortunately for me, I've made big changes to
>>>>>> pktgen so I cannot easily run this test on stock kernels, and there is some chance
>>>>>> some of my hackings have caused this issue.
>>>>>>
>>>>>> But, in case others have seen similar, please let me know.  I shall go digging
>>>>>> in the meantime...
>>>>>>
>>>>>> Looks to me like 'skb' is NULL in line 120 below.
>>>>>
>>>>>
>>>>> pktgen must not be used in a mode where a single skb
>>>>> is cloned and reused, if packet needs to be stored in a qdisc.
>>>>>
>>>>> qdisc of all sorts assume skb->next/prev can be used as
>>>>> anchor in their list.
>>>>>
>>>>> If the same skb is queued multiple times, lists are corrupted.
>>>>>
>>>>> Please double check your clone_skb pktgen setup.
>>>>>
>>>>> I thought we had IFF_TX_SKB_SHARING for this, and that macvlan was properly clearing this bit.
>>>>
>>>> My pktgen config was not using any duplicated queueing in this case.
>>>>
>>>> I changed to pfifo fast and so far it is stable for ~10 minutes, where before it would crash
>>>> within a minute.  I'll let it bake overnight....
>>>
>>> Still running stable.  I also notice we have been using fq-codel for a while and haven't noticed
>>> this problem (next most recent kernel we might have run similar test on would be 5.13-ish).
>>>
>>> I'll duplicate this test on our older kernels tomorrow to see if it looks like a regression or
>>> if we just haven't actually done this exact test in a while...
>>
>> We can reproduce this crash as far back as 5.4 using fq-codel, with our pktgen driving mac-vlans.
>> We did not try any kernels older than 5.4.
>> We cannot reproduce with pfifo on 5.15-rc3 on an overnight run.
>> We cannot produce with user-space UDP traffic on any kernel/qdisc combination.
>> Our pktgen is configured for multi-skb of 0 (no multiple submits of the same skb)
>>
>> While looking briefly at fq-codel, I didn't notice any locking in the code that crashed.
>> Any chance that it makes assumptions that would be incorrect with pktgen running multiple
>> threads (one thread per mac-vlan) on top of a single qdisc belonging to the underlying NIC?
>>
> 
> 
> qdisc are protected by a qdisc spinlock.
> 
> fq-codel does not have to lock anything in its enqueue() and dequeue() methods.
> 
> I guess your local changes to pktgen might be to blame.
> 
> pfifo is much simpler than fq-codel, it uses less fields from skb.

I looked through my pktgen, and the skb creation and setup code looks pretty
similar to upstream pktgen.

I also added this debugging code:

[greearb@ben-dt4 linux-5.15.dev.y]$ git diff
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index bb0cd6d3d2c2..56e22106e19d 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -165,6 +165,11 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets,
         len = 0;
         i = 0;
         do {
+               if (!flow->head) {
+                       pr_err("fq-codel-drop: idx: %d maxbacklog: %d  threshold: %d max_packets: %d len: %d i: %d\n",
+                              idx, maxbacklog, threshold, max_packets, len, i);
+                       BUG_ON(1);
+               }
                 skb = dequeue_head(flow);
                 len += qdisc_pkt_len(skb);
                 mem += get_codel_cb(skb)->mem_usage;

The printout I see when this hits is:


fq-codel-drop: idx: 955 maxbacklog: 7756222  threshold: 3878111 max_packets: 64 len: 93868 i: 62
kernel BUG at net/sched/sch_fq_codel.c:171!
.....

So, I guess this means that the backlog byte counter is out of sync with the packet queue somehow?

Any suggestions for what kinds of issues in pktgen could cause this?

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: 5.15-rc3+ crash in fq-codel?
  2021-09-29 19:07           ` Ben Greear
@ 2021-09-29 23:21             ` Eric Dumazet
  2021-09-29 23:28               ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2021-09-29 23:21 UTC (permalink / raw)
  To: Ben Greear, Eric Dumazet, netdev



On 9/29/21 12:07 PM, Ben Greear wrote:
> On 9/28/21 4:25 PM, Eric Dumazet wrote:
>>
>>
>> On 9/28/21 3:00 PM, Ben Greear wrote:
>>> On 9/27/21 5:16 PM, Ben Greear wrote:
>>>> On 9/27/21 5:04 PM, Ben Greear wrote:
>>>>> On 9/27/21 4:49 PM, Eric Dumazet wrote:
>>>>>>
>>>>>>
>>>>>> On 9/27/21 4:30 PM, Ben Greear wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> In a hacked upon kernel, I'm getting crashes in fq-codel when doing bi-directional
>>>>>>> pktgen traffic on top of mac-vlans.  Unfortunately for me, I've made big changes to
>>>>>>> pktgen so I cannot easily run this test on stock kernels, and there is some chance
>>>>>>> some of my hackings have caused this issue.
>>>>>>>
>>>>>>> But, in case others have seen similar, please let me know.  I shall go digging
>>>>>>> in the meantime...
>>>>>>>
>>>>>>> Looks to me like 'skb' is NULL in line 120 below.
>>>>>>
>>>>>>
>>>>>> pktgen must not be used in a mode where a single skb
>>>>>> is cloned and reused, if packet needs to be stored in a qdisc.
>>>>>>
>>>>>> qdisc of all sorts assume skb->next/prev can be used as
>>>>>> anchor in their list.
>>>>>>
>>>>>> If the same skb is queued multiple times, lists are corrupted.
>>>>>>
>>>>>> Please double check your clone_skb pktgen setup.
>>>>>>
>>>>>> I thought we had IFF_TX_SKB_SHARING for this, and that macvlan was properly clearing this bit.
>>>>>
>>>>> My pktgen config was not using any duplicated queueing in this case.
>>>>>
>>>>> I changed to pfifo fast and so far it is stable for ~10 minutes, where before it would crash
>>>>> within a minute.  I'll let it bake overnight....
>>>>
>>>> Still running stable.  I also notice we have been using fq-codel for a while and haven't noticed
>>>> this problem (next most recent kernel we might have run similar test on would be 5.13-ish).
>>>>
>>>> I'll duplicate this test on our older kernels tomorrow to see if it looks like a regression or
>>>> if we just haven't actually done this exact test in a while...
>>>
>>> We can reproduce this crash as far back as 5.4 using fq-codel, with our pktgen driving mac-vlans.
>>> We did not try any kernels older than 5.4.
>>> We cannot reproduce with pfifo on 5.15-rc3 on an overnight run.
>>> We cannot produce with user-space UDP traffic on any kernel/qdisc combination.
>>> Our pktgen is configured for multi-skb of 0 (no multiple submits of the same skb)
>>>
>>> While looking briefly at fq-codel, I didn't notice any locking in the code that crashed.
>>> Any chance that it makes assumptions that would be incorrect with pktgen running multiple
>>> threads (one thread per mac-vlan) on top of a single qdisc belonging to the underlying NIC?
>>>
>>
>>
>> qdisc are protected by a qdisc spinlock.
>>
>> fq-codel does not have to lock anything in its enqueue() and dequeue() methods.
>>
>> I guess your local changes to pktgen might be to blame.
>>
>> pfifo is much simpler than fq-codel, it uses less fields from skb.
> 
> I looked through my pktgen, and the skb creation and setup code looks pretty
> similar to upstream pktgen.
> 
> I also added this debugging code:
> 
> [greearb@ben-dt4 linux-5.15.dev.y]$ git diff
> diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
> index bb0cd6d3d2c2..56e22106e19d 100644
> --- a/net/sched/sch_fq_codel.c
> +++ b/net/sched/sch_fq_codel.c
> @@ -165,6 +165,11 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets,
>         len = 0;
>         i = 0;
>         do {
> +               if (!flow->head) {
> +                       pr_err("fq-codel-drop: idx: %d maxbacklog: %d  threshold: %d max_packets: %d len: %d i: %d\n",
> +                              idx, maxbacklog, threshold, max_packets, len, i);
> +                       BUG_ON(1);
> +               }
>                 skb = dequeue_head(flow);
>                 len += qdisc_pkt_len(skb);
>                 mem += get_codel_cb(skb)->mem_usage;
> 
> The printout I see when this hits is:
> 
> 
> fq-codel-drop: idx: 955 maxbacklog: 7756222  threshold: 3878111 max_packets: 64 len: 93868 i: 62
> kernel BUG at net/sched/sch_fq_codel.c:171!
> .....
> 
> So, I guess this means that the backlog byte counter is out of sync with the packet queue somehow?
> 
> Any suggestions for what kinds of issues in pktgen could cause this?

Modifications to skbs after they were queued to the qdisc.

qdisc_pkt_len(skb) uses skb->cb[] storage. Make sure to not use it.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 5.15-rc3+ crash in fq-codel?
  2021-09-29 23:21             ` Eric Dumazet
@ 2021-09-29 23:28               ` Eric Dumazet
  2021-09-29 23:42                 ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2021-09-29 23:28 UTC (permalink / raw)
  To: Ben Greear, netdev



On 9/29/21 4:21 PM, Eric Dumazet wrote:
> 
> 
> On 9/29/21 12:07 PM, Ben Greear wrote:
>> On 9/28/21 4:25 PM, Eric Dumazet wrote:
>>>
>>>
>>> On 9/28/21 3:00 PM, Ben Greear wrote:
>>>> On 9/27/21 5:16 PM, Ben Greear wrote:
>>>>> On 9/27/21 5:04 PM, Ben Greear wrote:
>>>>>> On 9/27/21 4:49 PM, Eric Dumazet wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 9/27/21 4:30 PM, Ben Greear wrote:
>>>>>>>> Hello,
>>>>>>>>
>>>>>>>> In a hacked upon kernel, I'm getting crashes in fq-codel when doing bi-directional
>>>>>>>> pktgen traffic on top of mac-vlans.  Unfortunately for me, I've made big changes to
>>>>>>>> pktgen so I cannot easily run this test on stock kernels, and there is some chance
>>>>>>>> some of my hackings have caused this issue.
>>>>>>>>
>>>>>>>> But, in case others have seen similar, please let me know.  I shall go digging
>>>>>>>> in the meantime...
>>>>>>>>
>>>>>>>> Looks to me like 'skb' is NULL in line 120 below.
>>>>>>>
>>>>>>>
>>>>>>> pktgen must not be used in a mode where a single skb
>>>>>>> is cloned and reused, if packet needs to be stored in a qdisc.
>>>>>>>
>>>>>>> qdisc of all sorts assume skb->next/prev can be used as
>>>>>>> anchor in their list.
>>>>>>>
>>>>>>> If the same skb is queued multiple times, lists are corrupted.
>>>>>>>
>>>>>>> Please double check your clone_skb pktgen setup.
>>>>>>>
>>>>>>> I thought we had IFF_TX_SKB_SHARING for this, and that macvlan was properly clearing this bit.
>>>>>>
>>>>>> My pktgen config was not using any duplicated queueing in this case.
>>>>>>
>>>>>> I changed to pfifo fast and so far it is stable for ~10 minutes, where before it would crash
>>>>>> within a minute.  I'll let it bake overnight....
>>>>>
>>>>> Still running stable.  I also notice we have been using fq-codel for a while and haven't noticed
>>>>> this problem (next most recent kernel we might have run similar test on would be 5.13-ish).
>>>>>
>>>>> I'll duplicate this test on our older kernels tomorrow to see if it looks like a regression or
>>>>> if we just haven't actually done this exact test in a while...
>>>>
>>>> We can reproduce this crash as far back as 5.4 using fq-codel, with our pktgen driving mac-vlans.
>>>> We did not try any kernels older than 5.4.
>>>> We cannot reproduce with pfifo on 5.15-rc3 on an overnight run.
>>>> We cannot produce with user-space UDP traffic on any kernel/qdisc combination.
>>>> Our pktgen is configured for multi-skb of 0 (no multiple submits of the same skb)
>>>>
>>>> While looking briefly at fq-codel, I didn't notice any locking in the code that crashed.
>>>> Any chance that it makes assumptions that would be incorrect with pktgen running multiple
>>>> threads (one thread per mac-vlan) on top of a single qdisc belonging to the underlying NIC?
>>>>
>>>
>>>
>>> qdisc are protected by a qdisc spinlock.
>>>
>>> fq-codel does not have to lock anything in its enqueue() and dequeue() methods.
>>>
>>> I guess your local changes to pktgen might be to blame.
>>>
>>> pfifo is much simpler than fq-codel, it uses less fields from skb.
>>
>> I looked through my pktgen, and the skb creation and setup code looks pretty
>> similar to upstream pktgen.
>>
>> I also added this debugging code:
>>
>> [greearb@ben-dt4 linux-5.15.dev.y]$ git diff
>> diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
>> index bb0cd6d3d2c2..56e22106e19d 100644
>> --- a/net/sched/sch_fq_codel.c
>> +++ b/net/sched/sch_fq_codel.c
>> @@ -165,6 +165,11 @@ static unsigned int fq_codel_drop(struct Qdisc *sch, unsigned int max_packets,
>>         len = 0;
>>         i = 0;
>>         do {
>> +               if (!flow->head) {
>> +                       pr_err("fq-codel-drop: idx: %d maxbacklog: %d  threshold: %d max_packets: %d len: %d i: %d\n",
>> +                              idx, maxbacklog, threshold, max_packets, len, i);
>> +                       BUG_ON(1);
>> +               }
>>                 skb = dequeue_head(flow);
>>                 len += qdisc_pkt_len(skb);
>>                 mem += get_codel_cb(skb)->mem_usage;
>>
>> The printout I see when this hits is:
>>
>>
>> fq-codel-drop: idx: 955 maxbacklog: 7756222  threshold: 3878111 max_packets: 64 len: 93868 i: 62
>> kernel BUG at net/sched/sch_fq_codel.c:171!
>> .....
>>
>> So, I guess this means that the backlog byte counter is out of sync with the packet queue somehow?
>>
>> Any suggestions for what kinds of issues in pktgen could cause this?
> 
> Modifications to skbs after they were queued to the qdisc.
> 
> qdisc_pkt_len(skb) uses skb->cb[] storage. Make sure to not use it.
> 
> 

Actually the bug seems to be in pktgen, vs NET_XMIT_CN

You probably would hit the same issues with other qdisc also using NET_XMIT_CN

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 5.15-rc3+ crash in fq-codel?
  2021-09-29 23:28               ` Eric Dumazet
@ 2021-09-29 23:42                 ` Eric Dumazet
  2021-09-29 23:48                   ` Ben Greear
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2021-09-29 23:42 UTC (permalink / raw)
  To: Ben Greear, netdev



On 9/29/21 4:28 PM, Eric Dumazet wrote:
> 

> 
> Actually the bug seems to be in pktgen, vs NET_XMIT_CN
> 
> You probably would hit the same issues with other qdisc also using NET_XMIT_CN
> 

I would try the following patch :

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index a3d74e2704c42e3bec1aa502b911c1b952a56cf1..0a2d9534f8d08d1da5dfc68c631f3a07f95c6f77 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3567,6 +3567,7 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
        case NET_XMIT_DROP:
        case NET_XMIT_CN:
                /* skb has been consumed */
+               pkt_dev->last_ok = 1;
                pkt_dev->errors++;
                break;
        default: /* Drivers are not supposed to return other values! */

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: 5.15-rc3+ crash in fq-codel?
  2021-09-29 23:42                 ` Eric Dumazet
@ 2021-09-29 23:48                   ` Ben Greear
  2021-09-30  0:04                     ` Ben Greear
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Greear @ 2021-09-29 23:48 UTC (permalink / raw)
  To: Eric Dumazet, netdev

On 9/29/21 4:42 PM, Eric Dumazet wrote:
> 
> 
> On 9/29/21 4:28 PM, Eric Dumazet wrote:
>>
> 
>>
>> Actually the bug seems to be in pktgen, vs NET_XMIT_CN
>>
>> You probably would hit the same issues with other qdisc also using NET_XMIT_CN
>>
> 
> I would try the following patch :
> 
> diff --git a/net/core/pktgen.c b/net/core/pktgen.c
> index a3d74e2704c42e3bec1aa502b911c1b952a56cf1..0a2d9534f8d08d1da5dfc68c631f3a07f95c6f77 100644
> --- a/net/core/pktgen.c
> +++ b/net/core/pktgen.c
> @@ -3567,6 +3567,7 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
>          case NET_XMIT_DROP:
>          case NET_XMIT_CN:
>                  /* skb has been consumed */
> +               pkt_dev->last_ok = 1;
>                  pkt_dev->errors++;
>                  break;
>          default: /* Drivers are not supposed to return other values! */
> 

Thanks, that makes sense to me.  I'll test that out tomorrow...

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 5.15-rc3+ crash in fq-codel?
  2021-09-29 23:48                   ` Ben Greear
@ 2021-09-30  0:04                     ` Ben Greear
  2021-09-30  0:29                       ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Greear @ 2021-09-30  0:04 UTC (permalink / raw)
  To: Eric Dumazet, netdev

On 9/29/21 4:48 PM, Ben Greear wrote:
> On 9/29/21 4:42 PM, Eric Dumazet wrote:
>>
>>
>> On 9/29/21 4:28 PM, Eric Dumazet wrote:
>>>
>>
>>>
>>> Actually the bug seems to be in pktgen, vs NET_XMIT_CN
>>>
>>> You probably would hit the same issues with other qdisc also using NET_XMIT_CN
>>>
>>
>> I would try the following patch :
>>
>> diff --git a/net/core/pktgen.c b/net/core/pktgen.c
>> index a3d74e2704c42e3bec1aa502b911c1b952a56cf1..0a2d9534f8d08d1da5dfc68c631f3a07f95c6f77 100644
>> --- a/net/core/pktgen.c
>> +++ b/net/core/pktgen.c
>> @@ -3567,6 +3567,7 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
>>          case NET_XMIT_DROP:
>>          case NET_XMIT_CN:
>>                  /* skb has been consumed */
>> +               pkt_dev->last_ok = 1;
>>                  pkt_dev->errors++;
>>                  break;
>>          default: /* Drivers are not supposed to return other values! */

While patching my variant of pktgen, I took a look at the 'default' case.  I think
it should probably go above NET_XMIT_DROP and fallthrough into the consumed pkt path?

Although, probably not a big deal since only bugs elsewhere would hit that path, and
we don't really know if skb would be consumed in that case or not.

Thanks,
Ben



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 5.15-rc3+ crash in fq-codel?
  2021-09-30  0:04                     ` Ben Greear
@ 2021-09-30  0:29                       ` Eric Dumazet
  2021-09-30  0:40                         ` Eric Dumazet
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2021-09-30  0:29 UTC (permalink / raw)
  To: Ben Greear, Eric Dumazet, netdev



On 9/29/21 5:04 PM, Ben Greear wrote:
> On 9/29/21 4:48 PM, Ben Greear wrote:
>> On 9/29/21 4:42 PM, Eric Dumazet wrote:
>>>
>>>
>>> On 9/29/21 4:28 PM, Eric Dumazet wrote:
>>>>
>>>
>>>>
>>>> Actually the bug seems to be in pktgen, vs NET_XMIT_CN
>>>>
>>>> You probably would hit the same issues with other qdisc also using NET_XMIT_CN
>>>>
>>>
>>> I would try the following patch :
>>>
>>> diff --git a/net/core/pktgen.c b/net/core/pktgen.c
>>> index a3d74e2704c42e3bec1aa502b911c1b952a56cf1..0a2d9534f8d08d1da5dfc68c631f3a07f95c6f77 100644
>>> --- a/net/core/pktgen.c
>>> +++ b/net/core/pktgen.c
>>> @@ -3567,6 +3567,7 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
>>>          case NET_XMIT_DROP:
>>>          case NET_XMIT_CN:
>>>                  /* skb has been consumed */
>>> +               pkt_dev->last_ok = 1;
>>>                  pkt_dev->errors++;
>>>                  break;
>>>          default: /* Drivers are not supposed to return other values! */
> 
> While patching my variant of pktgen, I took a look at the 'default' case.  I think
> it should probably go above NET_XMIT_DROP and fallthrough into the consumed pkt path?
> 
> Although, probably not a big deal since only bugs elsewhere would hit that path, and
> we don't really know if skb would be consumed in that case or not.
> 

This is probably dead code after commit

commit f466dba1832f05006cf6caa9be41fb98d11cb848    pktgen: ndo_start_xmit can return NET_XMIT_xxx values

So this does not really matter anymore.




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 5.15-rc3+ crash in fq-codel?
  2021-09-30  0:29                       ` Eric Dumazet
@ 2021-09-30  0:40                         ` Eric Dumazet
  2021-09-30  1:36                           ` Ben Greear
  0 siblings, 1 reply; 16+ messages in thread
From: Eric Dumazet @ 2021-09-30  0:40 UTC (permalink / raw)
  To: Ben Greear, netdev



On 9/29/21 5:29 PM, Eric Dumazet wrote:
> 
> 
> On 9/29/21 5:04 PM, Ben Greear wrote:
>> On 9/29/21 4:48 PM, Ben Greear wrote:
>>> On 9/29/21 4:42 PM, Eric Dumazet wrote:
>>>>
>>>>
>>>> On 9/29/21 4:28 PM, Eric Dumazet wrote:
>>>>>
>>>>
>>>>>
>>>>> Actually the bug seems to be in pktgen, vs NET_XMIT_CN
>>>>>
>>>>> You probably would hit the same issues with other qdisc also using NET_XMIT_CN
>>>>>
>>>>
>>>> I would try the following patch :
>>>>
>>>> diff --git a/net/core/pktgen.c b/net/core/pktgen.c
>>>> index a3d74e2704c42e3bec1aa502b911c1b952a56cf1..0a2d9534f8d08d1da5dfc68c631f3a07f95c6f77 100644
>>>> --- a/net/core/pktgen.c
>>>> +++ b/net/core/pktgen.c
>>>> @@ -3567,6 +3567,7 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
>>>>          case NET_XMIT_DROP:
>>>>          case NET_XMIT_CN:
>>>>                  /* skb has been consumed */
>>>> +               pkt_dev->last_ok = 1;
>>>>                  pkt_dev->errors++;
>>>>                  break;
>>>>          default: /* Drivers are not supposed to return other values! */
>>
>> While patching my variant of pktgen, I took a look at the 'default' case.  I think
>> it should probably go above NET_XMIT_DROP and fallthrough into the consumed pkt path?
>>
>> Although, probably not a big deal since only bugs elsewhere would hit that path, and
>> we don't really know if skb would be consumed in that case or not.
>>
> 
> This is probably dead code after commit
> 
> commit f466dba1832f05006cf6caa9be41fb98d11cb848    pktgen: ndo_start_xmit can return NET_XMIT_xxx values
> 
> So this does not really matter anymore.
> 
> 

Alternative would be the following patch.
NET_XMIT_CN means the packet has been queued for transmit,
but that we might have dropped prior packets.

Probably not a big deal to make the difference in pktgen.

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index a3d74e2704c42e3bec1aa502b911c1b952a56cf1..5c612cbc74c790f64aff5ce602843378284c7119 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3557,6 +3557,7 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
 
        switch (ret) {
        case NETDEV_TX_OK:
+       case NET_XMIT_CN:
                pkt_dev->last_ok = 1;
                pkt_dev->sofar++;
                pkt_dev->seq_num++;
@@ -3565,8 +3566,8 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
                        goto xmit_more;
                break;
        case NET_XMIT_DROP:
-       case NET_XMIT_CN:
                /* skb has been consumed */
+               pkt_dev->last_ok = 1;
                pkt_dev->errors++;
                break;
        default: /* Drivers are not supposed to return other values! */


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: 5.15-rc3+ crash in fq-codel?
  2021-09-30  0:40                         ` Eric Dumazet
@ 2021-09-30  1:36                           ` Ben Greear
  2021-09-30 16:44                             ` Ben Greear
  0 siblings, 1 reply; 16+ messages in thread
From: Ben Greear @ 2021-09-30  1:36 UTC (permalink / raw)
  To: Eric Dumazet, netdev

On 9/29/21 5:40 PM, Eric Dumazet wrote:
> 
> 
> On 9/29/21 5:29 PM, Eric Dumazet wrote:
>>
>>
>> On 9/29/21 5:04 PM, Ben Greear wrote:
>>> On 9/29/21 4:48 PM, Ben Greear wrote:
>>>> On 9/29/21 4:42 PM, Eric Dumazet wrote:
>>>>>
>>>>>
>>>>> On 9/29/21 4:28 PM, Eric Dumazet wrote:
>>>>>>
>>>>>
>>>>>>
>>>>>> Actually the bug seems to be in pktgen, vs NET_XMIT_CN
>>>>>>
>>>>>> You probably would hit the same issues with other qdisc also using NET_XMIT_CN
>>>>>>
>>>>>
>>>>> I would try the following patch :
>>>>>
>>>>> diff --git a/net/core/pktgen.c b/net/core/pktgen.c
>>>>> index a3d74e2704c42e3bec1aa502b911c1b952a56cf1..0a2d9534f8d08d1da5dfc68c631f3a07f95c6f77 100644
>>>>> --- a/net/core/pktgen.c
>>>>> +++ b/net/core/pktgen.c
>>>>> @@ -3567,6 +3567,7 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
>>>>>           case NET_XMIT_DROP:
>>>>>           case NET_XMIT_CN:
>>>>>                   /* skb has been consumed */
>>>>> +               pkt_dev->last_ok = 1;
>>>>>                   pkt_dev->errors++;
>>>>>                   break;
>>>>>           default: /* Drivers are not supposed to return other values! */
>>>
>>> While patching my variant of pktgen, I took a look at the 'default' case.  I think
>>> it should probably go above NET_XMIT_DROP and fallthrough into the consumed pkt path?
>>>
>>> Although, probably not a big deal since only bugs elsewhere would hit that path, and
>>> we don't really know if skb would be consumed in that case or not.
>>>
>>
>> This is probably dead code after commit
>>
>> commit f466dba1832f05006cf6caa9be41fb98d11cb848    pktgen: ndo_start_xmit can return NET_XMIT_xxx values
>>
>> So this does not really matter anymore.
>>
>>
> 
> Alternative would be the following patch.
> NET_XMIT_CN means the packet has been queued for transmit,
> but that we might have dropped prior packets.
> 
> Probably not a big deal to make the difference in pktgen.
> 
> diff --git a/net/core/pktgen.c b/net/core/pktgen.c
> index a3d74e2704c42e3bec1aa502b911c1b952a56cf1..5c612cbc74c790f64aff5ce602843378284c7119 100644
> --- a/net/core/pktgen.c
> +++ b/net/core/pktgen.c
> @@ -3557,6 +3557,7 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
>   
>          switch (ret) {
>          case NETDEV_TX_OK:
> +       case NET_XMIT_CN:
>                  pkt_dev->last_ok = 1;
>                  pkt_dev->sofar++;
>                  pkt_dev->seq_num++;
> @@ -3565,8 +3566,8 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
>                          goto xmit_more;
>                  break;
>          case NET_XMIT_DROP:
> -       case NET_XMIT_CN:
>                  /* skb has been consumed */
> +               pkt_dev->last_ok = 1;
>                  pkt_dev->errors++;
>                  break;
>          default: /* Drivers are not supposed to return other values! */
> 

Yes, I like that the XMIT_CN then means to increment the seq_num, though for my own purposes,
I wouldn't want to increment the sofar++ in that case (and maybe not do other logic in that case),
since we know at least something dropped.

For fq-codel, seems that XMIT_CN could mean that the attempted packet actually was queued
for xmit, but at least some other packets were purged.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: 5.15-rc3+ crash in fq-codel?
  2021-09-30  1:36                           ` Ben Greear
@ 2021-09-30 16:44                             ` Ben Greear
  0 siblings, 0 replies; 16+ messages in thread
From: Ben Greear @ 2021-09-30 16:44 UTC (permalink / raw)
  To: Eric Dumazet, netdev

On 9/29/21 6:36 PM, Ben Greear wrote:
> On 9/29/21 5:40 PM, Eric Dumazet wrote:
>>
>>
>> On 9/29/21 5:29 PM, Eric Dumazet wrote:
>>>
>>>
>>> On 9/29/21 5:04 PM, Ben Greear wrote:
>>>> On 9/29/21 4:48 PM, Ben Greear wrote:
>>>>> On 9/29/21 4:42 PM, Eric Dumazet wrote:
>>>>>>
>>>>>>
>>>>>> On 9/29/21 4:28 PM, Eric Dumazet wrote:
>>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Actually the bug seems to be in pktgen, vs NET_XMIT_CN
>>>>>>>
>>>>>>> You probably would hit the same issues with other qdisc also using NET_XMIT_CN
>>>>>>>
>>>>>>
>>>>>> I would try the following patch :
>>>>>>
>>>>>> diff --git a/net/core/pktgen.c b/net/core/pktgen.c
>>>>>> index a3d74e2704c42e3bec1aa502b911c1b952a56cf1..0a2d9534f8d08d1da5dfc68c631f3a07f95c6f77 100644
>>>>>> --- a/net/core/pktgen.c
>>>>>> +++ b/net/core/pktgen.c
>>>>>> @@ -3567,6 +3567,7 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
>>>>>>           case NET_XMIT_DROP:
>>>>>>           case NET_XMIT_CN:
>>>>>>                   /* skb has been consumed */
>>>>>> +               pkt_dev->last_ok = 1;
>>>>>>                   pkt_dev->errors++;
>>>>>>                   break;
>>>>>>           default: /* Drivers are not supposed to return other values! */
>>>>
>>>> While patching my variant of pktgen, I took a look at the 'default' case.  I think
>>>> it should probably go above NET_XMIT_DROP and fallthrough into the consumed pkt path?
>>>>
>>>> Although, probably not a big deal since only bugs elsewhere would hit that path, and
>>>> we don't really know if skb would be consumed in that case or not.
>>>>
>>>
>>> This is probably dead code after commit
>>>
>>> commit f466dba1832f05006cf6caa9be41fb98d11cb848    pktgen: ndo_start_xmit can return NET_XMIT_xxx values
>>>
>>> So this does not really matter anymore.
>>>
>>>
>>
>> Alternative would be the following patch.
>> NET_XMIT_CN means the packet has been queued for transmit,
>> but that we might have dropped prior packets.
>>
>> Probably not a big deal to make the difference in pktgen.
>>
>> diff --git a/net/core/pktgen.c b/net/core/pktgen.c
>> index a3d74e2704c42e3bec1aa502b911c1b952a56cf1..5c612cbc74c790f64aff5ce602843378284c7119 100644
>> --- a/net/core/pktgen.c
>> +++ b/net/core/pktgen.c
>> @@ -3557,6 +3557,7 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
>>          switch (ret) {
>>          case NETDEV_TX_OK:
>> +       case NET_XMIT_CN:
>>                  pkt_dev->last_ok = 1;
>>                  pkt_dev->sofar++;
>>                  pkt_dev->seq_num++;
>> @@ -3565,8 +3566,8 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
>>                          goto xmit_more;
>>                  break;
>>          case NET_XMIT_DROP:
>> -       case NET_XMIT_CN:
>>                  /* skb has been consumed */
>> +               pkt_dev->last_ok = 1;
>>                  pkt_dev->errors++;
>>                  break;
>>          default: /* Drivers are not supposed to return other values! */
>>
> 
> Yes, I like that the XMIT_CN then means to increment the seq_num, though for my own purposes,
> I wouldn't want to increment the sofar++ in that case (and maybe not do other logic in that case),
> since we know at least something dropped.
> 
> For fq-codel, seems that XMIT_CN could mean that the attempted packet actually was queued
> for xmit, but at least some other packets were purged.
> 
> Thanks,
> Ben
> 

This does fix the crash for me (my patch in my tree is slightly different, but same idea).

Thanks,
Ben

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-09-30 16:44 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-27 23:30 5.15-rc3+ crash in fq-codel? Ben Greear
2021-09-27 23:49 ` Eric Dumazet
2021-09-28  0:04   ` Ben Greear
2021-09-28  0:16     ` Ben Greear
2021-09-28 22:00       ` Ben Greear
2021-09-28 23:25         ` Eric Dumazet
2021-09-29 19:07           ` Ben Greear
2021-09-29 23:21             ` Eric Dumazet
2021-09-29 23:28               ` Eric Dumazet
2021-09-29 23:42                 ` Eric Dumazet
2021-09-29 23:48                   ` Ben Greear
2021-09-30  0:04                     ` Ben Greear
2021-09-30  0:29                       ` Eric Dumazet
2021-09-30  0:40                         ` Eric Dumazet
2021-09-30  1:36                           ` Ben Greear
2021-09-30 16:44                             ` Ben Greear

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.