All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
@ 2019-02-25 14:49 Sheng Lan
  2019-02-25 16:01 ` Stephen Hemminger
  0 siblings, 1 reply; 8+ messages in thread
From: Sheng Lan @ 2019-02-25 14:49 UTC (permalink / raw)
  To: davem, stephen
  Cc: netdev, netem, xuhanbing, zhengshaoyu, jiqin.ji, liuzhiqiang26

From: Sheng Lan <lansheng@huawei.com>
Subject: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec

It can be reproduced by following steps:
1. virtio_net NIC is configured with gso/tso on
2. configure nginx as http server with an index file bigger than 1M bytes
3. use tc netem to produce duplicate packets and delay:
   tc qdisc add dev eth0 root netem delay 100ms 10ms 30% duplicate 90%
4. continually curl the nginx http server to get index file on client
5. BUG_ON is seen quickly

[10258690.371129] kernel BUG at net/core/skbuff.c:4028!
[10258690.371748] invalid opcode: 0000 [#1] SMP PTI
[10258690.372094] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G        W         5.0.0-rc6 #2
[10258690.372094] RSP: 0018:ffffa05797b43da0 EFLAGS: 00010202
[10258690.372094] RBP: 00000000000005ea R08: 0000000000000000 R09: 00000000000005ea
[10258690.372094] R10: ffffa0579334d800 R11: 00000000000002c0 R12: 0000000000000002
[10258690.372094] R13: 0000000000000000 R14: ffffa05793122900 R15: ffffa0578f7cb028
[10258690.372094] FS:  0000000000000000(0000) GS:ffffa05797b40000(0000) knlGS:0000000000000000
[10258690.372094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[10258690.372094] CR2: 00007f1a6dc00868 CR3: 000000001000e000 CR4: 00000000000006e0
[10258690.372094] Call Trace:
[10258690.372094]  <IRQ>
[10258690.372094]  skb_to_sgvec+0x11/0x40
[10258690.372094]  start_xmit+0x38c/0x520 [virtio_net]
[10258690.372094]  dev_hard_start_xmit+0x9b/0x200
[10258690.372094]  sch_direct_xmit+0xff/0x260
[10258690.372094]  __qdisc_run+0x15e/0x4e0
[10258690.372094]  net_tx_action+0x137/0x210
[10258690.372094]  __do_softirq+0xd6/0x2a9
[10258690.372094]  irq_exit+0xde/0xf0
[10258690.372094]  smp_apic_timer_interrupt+0x74/0x140
[10258690.372094]  apic_timer_interrupt+0xf/0x20
[10258690.372094]  </IRQ>

In __skb_to_sgvec, the skb->len is not equal to the sum of the skb's
linear data size and nonlinear data size, thus BUG_ON triggered. The
bad skb's nonlinear data size is less than skb->data_len, because the
skb is cloned and a part of related cloned skb's nonlinear data is
split off.

Duplicate packet is cloned by skb_clone in netem_enqueue and may be delayed
some time in qdisc. Due to the delay time, the original skb will be pushed
again later in __tcp_push_pending_frames when tcp receives new packets.
In tcp_write_xmit, when the tcp_mss_split_point returns a smaller limit,
the original skb will be fragmented and the skb's nonlinear data will be
split off. The length of the skb cloned by netem will not be updated.
When we use virtio_net NIC, the duplicated cloned skb will be filled into
a scatter-gather list in __skb_to_sgvec and trigger the BUG_ON.

Here I replace the skb_clone with skb_copy in netem_enqueue to ensure
the duplicated skb's nonlinear data is independent.

Signed-off-by: Sheng Lan <lansheng@huawei.com>
Reported-by: Qin Ji <jiqin.ji@huawei.com>

Fixes: 0afb51e7 ("netem: reinsert for duplication")
---
 net/sched/sch_netem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 75046ec..76d9740 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -479,7 +479,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	 * qdisc tree, since parent queuer expects that only one
 	 * skb will be queued.
 	 */
-	if (count > 1 && (skb2 = skb_clone(skb, GFP_ATOMIC)) != NULL) {
+	if (count > 1 && (skb2 = skb_copy(skb, GFP_ATOMIC)) != NULL) {
 		struct Qdisc *rootq = qdisc_root(sch);
 		u32 dupsave = q->duplicate; /* prevent duplicating a dup... */

-- 


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
  2019-02-25 14:49 [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec Sheng Lan
@ 2019-02-25 16:01 ` Stephen Hemminger
  2019-02-26 13:02   ` Sheng Lan
  0 siblings, 1 reply; 8+ messages in thread
From: Stephen Hemminger @ 2019-02-25 16:01 UTC (permalink / raw)
  To: Sheng Lan
  Cc: davem, netdev, netem, xuhanbing, zhengshaoyu, jiqin.ji, liuzhiqiang26

On Mon, 25 Feb 2019 22:49:39 +0800
Sheng Lan <lansheng@huawei.com> wrote:

> From: Sheng Lan <lansheng@huawei.com>
> Subject: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
> 
> It can be reproduced by following steps:
> 1. virtio_net NIC is configured with gso/tso on
> 2. configure nginx as http server with an index file bigger than 1M bytes
> 3. use tc netem to produce duplicate packets and delay:
>    tc qdisc add dev eth0 root netem delay 100ms 10ms 30% duplicate 90%
> 4. continually curl the nginx http server to get index file on client
> 5. BUG_ON is seen quickly
> 
> [10258690.371129] kernel BUG at net/core/skbuff.c:4028!
> [10258690.371748] invalid opcode: 0000 [#1] SMP PTI
> [10258690.372094] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G        W         5.0.0-rc6 #2
> [10258690.372094] RSP: 0018:ffffa05797b43da0 EFLAGS: 00010202
> [10258690.372094] RBP: 00000000000005ea R08: 0000000000000000 R09: 00000000000005ea
> [10258690.372094] R10: ffffa0579334d800 R11: 00000000000002c0 R12: 0000000000000002
> [10258690.372094] R13: 0000000000000000 R14: ffffa05793122900 R15: ffffa0578f7cb028
> [10258690.372094] FS:  0000000000000000(0000) GS:ffffa05797b40000(0000) knlGS:0000000000000000
> [10258690.372094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [10258690.372094] CR2: 00007f1a6dc00868 CR3: 000000001000e000 CR4: 00000000000006e0
> [10258690.372094] Call Trace:
> [10258690.372094]  <IRQ>
> [10258690.372094]  skb_to_sgvec+0x11/0x40
> [10258690.372094]  start_xmit+0x38c/0x520 [virtio_net]
> [10258690.372094]  dev_hard_start_xmit+0x9b/0x200
> [10258690.372094]  sch_direct_xmit+0xff/0x260
> [10258690.372094]  __qdisc_run+0x15e/0x4e0
> [10258690.372094]  net_tx_action+0x137/0x210
> [10258690.372094]  __do_softirq+0xd6/0x2a9
> [10258690.372094]  irq_exit+0xde/0xf0
> [10258690.372094]  smp_apic_timer_interrupt+0x74/0x140
> [10258690.372094]  apic_timer_interrupt+0xf/0x20
> [10258690.372094]  </IRQ>
> 
> In __skb_to_sgvec, the skb->len is not equal to the sum of the skb's
> linear data size and nonlinear data size, thus BUG_ON triggered. The
> bad skb's nonlinear data size is less than skb->data_len, because the
> skb is cloned and a part of related cloned skb's nonlinear data is
> split off.
> 
> Duplicate packet is cloned by skb_clone in netem_enqueue and may be delayed
> some time in qdisc. Due to the delay time, the original skb will be pushed
> again later in __tcp_push_pending_frames when tcp receives new packets.
> In tcp_write_xmit, when the tcp_mss_split_point returns a smaller limit,
> the original skb will be fragmented and the skb's nonlinear data will be
> split off. The length of the skb cloned by netem will not be updated.
> When we use virtio_net NIC, the duplicated cloned skb will be filled into
> a scatter-gather list in __skb_to_sgvec and trigger the BUG_ON.
> 
> Here I replace the skb_clone with skb_copy in netem_enqueue to ensure
> the duplicated skb's nonlinear data is independent.
> 
> Signed-off-by: Sheng Lan <lansheng@huawei.com>
> Reported-by: Qin Ji <jiqin.ji@huawei.com>
> 
> Fixes: 0afb51e7 ("netem: reinsert for duplication")

This sounds like a bug in the other layers (either TCP or Virtio net)
not handling a cloned skb properly.



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
  2019-02-25 16:01 ` Stephen Hemminger
@ 2019-02-26 13:02   ` Sheng Lan
  2019-02-26 15:52     ` Eric Dumazet
  2019-02-26 15:59     ` Stephen Hemminger
  0 siblings, 2 replies; 8+ messages in thread
From: Sheng Lan @ 2019-02-26 13:02 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: davem, netdev, netem, xuhanbing, zhengshaoyu, jiqin.ji,
	liuzhiqiang26, yuehaibing




> On Mon, 25 Feb 2019 22:49:39 +0800
> Sheng Lan <lansheng@huawei.com> wrote:
> 
>> From: Sheng Lan <lansheng@huawei.com>
>> Subject: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
>>
>> It can be reproduced by following steps:
>> 1. virtio_net NIC is configured with gso/tso on
>> 2. configure nginx as http server with an index file bigger than 1M bytes
>> 3. use tc netem to produce duplicate packets and delay:
>>    tc qdisc add dev eth0 root netem delay 100ms 10ms 30% duplicate 90%
>> 4. continually curl the nginx http server to get index file on client
>> 5. BUG_ON is seen quickly
>>
>> [10258690.371129] kernel BUG at net/core/skbuff.c:4028!
>> [10258690.371748] invalid opcode: 0000 [#1] SMP PTI
>> [10258690.372094] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G        W         5.0.0-rc6 #2
>> [10258690.372094] RSP: 0018:ffffa05797b43da0 EFLAGS: 00010202
>> [10258690.372094] RBP: 00000000000005ea R08: 0000000000000000 R09: 00000000000005ea
>> [10258690.372094] R10: ffffa0579334d800 R11: 00000000000002c0 R12: 0000000000000002
>> [10258690.372094] R13: 0000000000000000 R14: ffffa05793122900 R15: ffffa0578f7cb028
>> [10258690.372094] FS:  0000000000000000(0000) GS:ffffa05797b40000(0000) knlGS:0000000000000000
>> [10258690.372094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [10258690.372094] CR2: 00007f1a6dc00868 CR3: 000000001000e000 CR4: 00000000000006e0
>> [10258690.372094] Call Trace:
>> [10258690.372094]  <IRQ>
>> [10258690.372094]  skb_to_sgvec+0x11/0x40
>> [10258690.372094]  start_xmit+0x38c/0x520 [virtio_net]
>> [10258690.372094]  dev_hard_start_xmit+0x9b/0x200
>> [10258690.372094]  sch_direct_xmit+0xff/0x260
>> [10258690.372094]  __qdisc_run+0x15e/0x4e0
>> [10258690.372094]  net_tx_action+0x137/0x210
>> [10258690.372094]  __do_softirq+0xd6/0x2a9
>> [10258690.372094]  irq_exit+0xde/0xf0
>> [10258690.372094]  smp_apic_timer_interrupt+0x74/0x140
>> [10258690.372094]  apic_timer_interrupt+0xf/0x20
>> [10258690.372094]  </IRQ>
>>
>> In __skb_to_sgvec, the skb->len is not equal to the sum of the skb's
>> linear data size and nonlinear data size, thus BUG_ON triggered. The
>> bad skb's nonlinear data size is less than skb->data_len, because the
>> skb is cloned and a part of related cloned skb's nonlinear data is
>> split off.
>>
>> Duplicate packet is cloned by skb_clone in netem_enqueue and may be delayed
>> some time in qdisc. Due to the delay time, the original skb will be pushed
>> again later in __tcp_push_pending_frames when tcp receives new packets.
>> In tcp_write_xmit, when the tcp_mss_split_point returns a smaller limit,
>> the original skb will be fragmented and the skb's nonlinear data will be
>> split off. The length of the skb cloned by netem will not be updated.
>> When we use virtio_net NIC, the duplicated cloned skb will be filled into
>> a scatter-gather list in __skb_to_sgvec and trigger the BUG_ON.
>>
>> Here I replace the skb_clone with skb_copy in netem_enqueue to ensure
>> the duplicated skb's nonlinear data is independent.
>>
>> Signed-off-by: Sheng Lan <lansheng@huawei.com>
>> Reported-by: Qin Ji <jiqin.ji@huawei.com>
>>
>> Fixes: 0afb51e7 ("netem: reinsert for duplication")
> 
> This sounds like a bug in the other layers (either TCP or Virtio net)
> not handling a cloned skb properly.
> 

I have traced the route of skb by printk, let me take an example to describe the problem to make it clearly:
Mss value equals to 1448. Limit value is the split size when tcp do tso_fragment, is depending on the size of the sending congestion window and mss value.

TCP layer transmit the index file to client, the original skb1 size is large:
...
tcp_write_xmit            (skb1->data_len == 62264, limit == 2*mss == 2896)
tso_fragment              (it needs to be fragmented by limit value)
skb_split                 (after split, skb1->data_len == 2896, skb_shinfo(skb1)->frags[0] == 2896, skb_shinfo(skb1)->nr_frags == 1)
...
netem_enqueue             (netem construct a duplicate packet of skb1 by skb_clone)
skb2 = skb_clone(skb1)    (skb1->data_len == skb2->data_len == 2896, skb1 and skb2 share the nonlinear data frags[0] == 2896)
waiting 30ms              (skb1 and skb2 will be delayed in qdisc queue due to the netem delay configuration)


TCP layer receives new packets and trys to retransmit the skb1:
tcp_rcv_established
__tcp_push_pending_frames
tcp_write_xmit            (skb1->data_len == 2896, cwnd size decreased or packets in flight increased, cause the limit decreased to 1*mss == 1448)
tso_fragment              (limit value is less than skb1->data_len, skb1 will be fragmented again)
skb_split                 (the second time split, skb1 is cloned now and share nonlinear data with skb2.
                           skb1->data_len == 1448, frags[0] == 1448, a part of shared nonlinear data has been split off)

Now we can see:
skb1->data_len == 1448
skb2->data_len == 2896    (2896 is wrong value)
frags[0] == 1448


The route of skb2:
netem_enqueue             (netem construct a duplicate packet, skb2 = skb_clone(skb1), put skb2 into queue)
waiting 30ms              (delayed packet)
...
netem_dequeue             (skb2->data_len == 2896, frags[0] == 1448)
sch_direct_xmit
dev_hard_start_xmit
xmit_one
start_xmit                [virtio_net driver]
skb_to_sgvec              (BUG_ON here )


The key is that skb be split by skb_split in tso_fragment again after netem clone it and share nonlinear data with another skb.
Processing of TCP seems OK, which push and fragment delayed packets in write queue. And virtio_net is the trigger of the BUG_ON.
So I replaced skb_clone with skb_copy in netem_enqueue, and the method worked. Currently, I have no better idea to fix it,
would you give me some inspiring advice ? If I am wrong, please correct me.

Thanks


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
  2019-02-26 13:02   ` Sheng Lan
@ 2019-02-26 15:52     ` Eric Dumazet
  2019-02-27 11:26       ` Sheng Lan
  2019-02-26 15:59     ` Stephen Hemminger
  1 sibling, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2019-02-26 15:52 UTC (permalink / raw)
  To: Sheng Lan, Stephen Hemminger
  Cc: davem, netdev, netem, xuhanbing, zhengshaoyu, jiqin.ji,
	liuzhiqiang26, yuehaibing



On 02/26/2019 05:02 AM, Sheng Lan wrote:
> 
> 
> 
>> On Mon, 25 Feb 2019 22:49:39 +0800
>> Sheng Lan <lansheng@huawei.com> wrote:
>>
>>> From: Sheng Lan <lansheng@huawei.com>
>>> Subject: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
>>>
>>> It can be reproduced by following steps:
>>> 1. virtio_net NIC is configured with gso/tso on
>>> 2. configure nginx as http server with an index file bigger than 1M bytes
>>> 3. use tc netem to produce duplicate packets and delay:
>>>    tc qdisc add dev eth0 root netem delay 100ms 10ms 30% duplicate 90%
>>> 4. continually curl the nginx http server to get index file on client
>>> 5. BUG_ON is seen quickly
>>>
>>> [10258690.371129] kernel BUG at net/core/skbuff.c:4028!
>>> [10258690.371748] invalid opcode: 0000 [#1] SMP PTI
>>> [10258690.372094] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G        W         5.0.0-rc6 #2
>>> [10258690.372094] RSP: 0018:ffffa05797b43da0 EFLAGS: 00010202
>>> [10258690.372094] RBP: 00000000000005ea R08: 0000000000000000 R09: 00000000000005ea
>>> [10258690.372094] R10: ffffa0579334d800 R11: 00000000000002c0 R12: 0000000000000002
>>> [10258690.372094] R13: 0000000000000000 R14: ffffa05793122900 R15: ffffa0578f7cb028
>>> [10258690.372094] FS:  0000000000000000(0000) GS:ffffa05797b40000(0000) knlGS:0000000000000000
>>> [10258690.372094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [10258690.372094] CR2: 00007f1a6dc00868 CR3: 000000001000e000 CR4: 00000000000006e0
>>> [10258690.372094] Call Trace:
>>> [10258690.372094]  <IRQ>
>>> [10258690.372094]  skb_to_sgvec+0x11/0x40
>>> [10258690.372094]  start_xmit+0x38c/0x520 [virtio_net]
>>> [10258690.372094]  dev_hard_start_xmit+0x9b/0x200
>>> [10258690.372094]  sch_direct_xmit+0xff/0x260
>>> [10258690.372094]  __qdisc_run+0x15e/0x4e0
>>> [10258690.372094]  net_tx_action+0x137/0x210
>>> [10258690.372094]  __do_softirq+0xd6/0x2a9
>>> [10258690.372094]  irq_exit+0xde/0xf0
>>> [10258690.372094]  smp_apic_timer_interrupt+0x74/0x140
>>> [10258690.372094]  apic_timer_interrupt+0xf/0x20
>>> [10258690.372094]  </IRQ>
>>>
>>> In __skb_to_sgvec, the skb->len is not equal to the sum of the skb's
>>> linear data size and nonlinear data size, thus BUG_ON triggered. The
>>> bad skb's nonlinear data size is less than skb->data_len, because the
>>> skb is cloned and a part of related cloned skb's nonlinear data is
>>> split off.
>>>
>>> Duplicate packet is cloned by skb_clone in netem_enqueue and may be delayed
>>> some time in qdisc. Due to the delay time, the original skb will be pushed
>>> again later in __tcp_push_pending_frames when tcp receives new packets.
>>> In tcp_write_xmit, when the tcp_mss_split_point returns a smaller limit,
>>> the original skb will be fragmented and the skb's nonlinear data will be
>>> split off. The length of the skb cloned by netem will not be updated.
>>> When we use virtio_net NIC, the duplicated cloned skb will be filled into
>>> a scatter-gather list in __skb_to_sgvec and trigger the BUG_ON.
>>>
>>> Here I replace the skb_clone with skb_copy in netem_enqueue to ensure
>>> the duplicated skb's nonlinear data is independent.
>>>
>>> Signed-off-by: Sheng Lan <lansheng@huawei.com>
>>> Reported-by: Qin Ji <jiqin.ji@huawei.com>
>>>
>>> Fixes: 0afb51e7 ("netem: reinsert for duplication")
>>
>> This sounds like a bug in the other layers (either TCP or Virtio net)
>> not handling a cloned skb properly.
>>
> 
> I have traced the route of skb by printk, let me take an example to describe the problem to make it clearly:
> Mss value equals to 1448. Limit value is the split size when tcp do tso_fragment, is depending on the size of the sending congestion window and mss value.
> 
> TCP layer transmit the index file to client, the original skb1 size is large:
> ...
> tcp_write_xmit            (skb1->data_len == 62264, limit == 2*mss == 2896)
> tso_fragment              (it needs to be fragmented by limit value)
> skb_split                 (after split, skb1->data_len == 2896, skb_shinfo(skb1)->frags[0] == 2896, skb_shinfo(skb1)->nr_frags == 1)
> ...
> netem_enqueue             (netem construct a duplicate packet of skb1 by skb_clone)
> skb2 = skb_clone(skb1)    (skb1->data_len == skb2->data_len == 2896, skb1 and skb2 share the nonlinear data frags[0] == 2896)
> waiting 30ms              (skb1 and skb2 will be delayed in qdisc queue due to the netem delay configuration)
> 
> 
> TCP layer receives new packets and trys to retransmit the skb1:
> tcp_rcv_established
> __tcp_push_pending_frames
> tcp_write_xmit            (skb1->data_len == 2896, cwnd size decreased or packets in flight increased, cause the limit decreased to 1*mss == 1448)

tcp_write_xmit() only deals with packet in the write queue,
they never were sent. They can not be any clone of them by definition, since
skbs in the TCP write queue are private to TCP stack,

Once a packet is sent, the master skb is moved to the rtx rb-tree,
while the clone is going through lower stacks.

When/if a retransmit is due, we always make sure there is no clone on it,
look at the various calls to skb_unclone()


> tso_fragment              (limit value is less than skb1->data_len, skb1 will be fragmented again)
> skb_split                 (the second time split, skb1 is cloned now and share nonlinear data with skb2.
>                            skb1->data_len == 1448, frags[0] == 1448, a part of shared nonlinear data has been split off)
> 
> Now we can see:
> skb1->data_len == 1448
> skb2->data_len == 2896    (2896 is wrong value)
> frags[0] == 1448
> 
> 
> The route of skb2:
> netem_enqueue             (netem construct a duplicate packet, skb2 = skb_clone(skb1), put skb2 into queue)
> waiting 30ms              (delayed packet)
> ...
> netem_dequeue             (skb2->data_len == 2896, frags[0] == 1448)
> sch_direct_xmit
> dev_hard_start_xmit
> xmit_one
> start_xmit                [virtio_net driver]
> skb_to_sgvec              (BUG_ON here )
> 
> 
> The key is that skb be split by skb_split in tso_fragment again after netem clone it and share nonlinear data with another skb.
> Processing of TCP seems OK, which push and fragment delayed packets in write queue. And virtio_net is the trigger of the BUG_ON.
> So I replaced skb_clone with skb_copy in netem_enqueue, and the method worked. Currently, I have no better idea to fix it,
> would you give me some inspiring advice ? If I am wrong, please correct me.
> 
> Thanks
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
  2019-02-26 13:02   ` Sheng Lan
  2019-02-26 15:52     ` Eric Dumazet
@ 2019-02-26 15:59     ` Stephen Hemminger
  2019-02-26 16:08       ` Eric Dumazet
  1 sibling, 1 reply; 8+ messages in thread
From: Stephen Hemminger @ 2019-02-26 15:59 UTC (permalink / raw)
  To: Sheng Lan
  Cc: davem, netdev, netem, xuhanbing, zhengshaoyu, jiqin.ji,
	liuzhiqiang26, yuehaibing

On Tue, 26 Feb 2019 21:02:10 +0800
Sheng Lan <lansheng@huawei.com> wrote:

> > On Mon, 25 Feb 2019 22:49:39 +0800
> > Sheng Lan <lansheng@huawei.com> wrote:
> >   
> >> From: Sheng Lan <lansheng@huawei.com>
> >> Subject: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
> >>
> >> It can be reproduced by following steps:
> >> 1. virtio_net NIC is configured with gso/tso on
> >> 2. configure nginx as http server with an index file bigger than 1M bytes
> >> 3. use tc netem to produce duplicate packets and delay:
> >>    tc qdisc add dev eth0 root netem delay 100ms 10ms 30% duplicate 90%
> >> 4. continually curl the nginx http server to get index file on client
> >> 5. BUG_ON is seen quickly
> >>
> >> [10258690.371129] kernel BUG at net/core/skbuff.c:4028!
> >> [10258690.371748] invalid opcode: 0000 [#1] SMP PTI
> >> [10258690.372094] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G        W         5.0.0-rc6 #2
> >> [10258690.372094] RSP: 0018:ffffa05797b43da0 EFLAGS: 00010202
> >> [10258690.372094] RBP: 00000000000005ea R08: 0000000000000000 R09: 00000000000005ea
> >> [10258690.372094] R10: ffffa0579334d800 R11: 00000000000002c0 R12: 0000000000000002
> >> [10258690.372094] R13: 0000000000000000 R14: ffffa05793122900 R15: ffffa0578f7cb028
> >> [10258690.372094] FS:  0000000000000000(0000) GS:ffffa05797b40000(0000) knlGS:0000000000000000
> >> [10258690.372094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [10258690.372094] CR2: 00007f1a6dc00868 CR3: 000000001000e000 CR4: 00000000000006e0
> >> [10258690.372094] Call Trace:
> >> [10258690.372094]  <IRQ>
> >> [10258690.372094]  skb_to_sgvec+0x11/0x40
> >> [10258690.372094]  start_xmit+0x38c/0x520 [virtio_net]
> >> [10258690.372094]  dev_hard_start_xmit+0x9b/0x200
> >> [10258690.372094]  sch_direct_xmit+0xff/0x260
> >> [10258690.372094]  __qdisc_run+0x15e/0x4e0
> >> [10258690.372094]  net_tx_action+0x137/0x210
> >> [10258690.372094]  __do_softirq+0xd6/0x2a9
> >> [10258690.372094]  irq_exit+0xde/0xf0
> >> [10258690.372094]  smp_apic_timer_interrupt+0x74/0x140
> >> [10258690.372094]  apic_timer_interrupt+0xf/0x20
> >> [10258690.372094]  </IRQ>
> >>
> >> In __skb_to_sgvec, the skb->len is not equal to the sum of the skb's
> >> linear data size and nonlinear data size, thus BUG_ON triggered. The
> >> bad skb's nonlinear data size is less than skb->data_len, because the
> >> skb is cloned and a part of related cloned skb's nonlinear data is
> >> split off.
> >>
> >> Duplicate packet is cloned by skb_clone in netem_enqueue and may be delayed
> >> some time in qdisc. Due to the delay time, the original skb will be pushed
> >> again later in __tcp_push_pending_frames when tcp receives new packets.
> >> In tcp_write_xmit, when the tcp_mss_split_point returns a smaller limit,
> >> the original skb will be fragmented and the skb's nonlinear data will be
> >> split off. The length of the skb cloned by netem will not be updated.
> >> When we use virtio_net NIC, the duplicated cloned skb will be filled into
> >> a scatter-gather list in __skb_to_sgvec and trigger the BUG_ON.
> >>
> >> Here I replace the skb_clone with skb_copy in netem_enqueue to ensure
> >> the duplicated skb's nonlinear data is independent.
> >>
> >> Signed-off-by: Sheng Lan <lansheng@huawei.com>
> >> Reported-by: Qin Ji <jiqin.ji@huawei.com>
> >>
> >> Fixes: 0afb51e7 ("netem: reinsert for duplication")  
> > 
> > This sounds like a bug in the other layers (either TCP or Virtio net)
> > not handling a cloned skb properly.
> >   
> 
> I have traced the route of skb by printk, let me take an example to describe the problem to make it clearly:
> Mss value equals to 1448. Limit value is the split size when tcp do tso_fragment, is depending on the size of the sending congestion window and mss value.
> 
> TCP layer transmit the index file to client, the original skb1 size is large:
> ...
> tcp_write_xmit            (skb1->data_len == 62264, limit == 2*mss == 2896)
> tso_fragment              (it needs to be fragmented by limit value)
> skb_split                 (after split, skb1->data_len == 2896, skb_shinfo(skb1)->frags[0] == 2896, skb_shinfo(skb1)->nr_frags == 1)
> ...
> netem_enqueue             (netem construct a duplicate packet of skb1 by skb_clone)
> skb2 = skb_clone(skb1)    (skb1->data_len == skb2->data_len == 2896, skb1 and skb2 share the nonlinear data frags[0] == 2896)
> waiting 30ms              (skb1 and skb2 will be delayed in qdisc queue due to the netem delay configuration)
> 
> 
> TCP layer receives new packets and trys to retransmit the skb1:
> tcp_rcv_established
> __tcp_push_pending_frames
> tcp_write_xmit            (skb1->data_len == 2896, cwnd size decreased or packets in flight increased, cause the limit decreased to 1*mss == 1448)
> tso_fragment              (limit value is less than skb1->data_len, skb1 will be fragmented again)


Maybe the fix is to stop TSO fragment from overwriting by doing something like:

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 730bc44dbad9..5fe91d0224f6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1856,7 +1856,7 @@ static int tso_fragment(struct sock *sk, enum tcp_queue tcp_queue,
 	u8 flags;
 
 	/* All of a TSO frame must be composed of paged data.  */
-	if (skb->len != skb->data_len)
+	if (skb->len != skb->data_len || skb_cloned(skb))
 		return tcp_fragment(sk, tcp_queue, skb, len, mss_now, gfp);
 
 	buff = sk_stream_alloc_skb(sk, 0, gfp, true);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
  2019-02-26 15:59     ` Stephen Hemminger
@ 2019-02-26 16:08       ` Eric Dumazet
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2019-02-26 16:08 UTC (permalink / raw)
  To: Stephen Hemminger, Sheng Lan
  Cc: davem, netdev, netem, xuhanbing, zhengshaoyu, jiqin.ji,
	liuzhiqiang26, yuehaibing




On 02/26/2019 07:59 AM, Stephen Hemminger wrote:
> 
> 
> Maybe the fix is to stop TSO fragment from overwriting by doing something like:
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 730bc44dbad9..5fe91d0224f6 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1856,7 +1856,7 @@ static int tso_fragment(struct sock *sk, enum tcp_queue tcp_queue,
>  	u8 flags;
>  
>  	/* All of a TSO frame must be composed of paged data.  */
> -	if (skb->len != skb->data_len)
> +	if (skb->len != skb->data_len || skb_cloned(skb))
>  		return tcp_fragment(sk, tcp_queue, skb, len, mss_now, gfp);
>  
>  	buff = sk_stream_alloc_skb(sk, 0, gfp, true);
> 


tso_fragment() is only called with packets that were not yet transmit.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
  2019-02-26 15:52     ` Eric Dumazet
@ 2019-02-27 11:26       ` Sheng Lan
  2019-02-27 15:53         ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Sheng Lan @ 2019-02-27 11:26 UTC (permalink / raw)
  To: Eric Dumazet, Stephen Hemminger
  Cc: davem, netdev, netem, xuhanbing, zhengshaoyu, jiqin.ji,
	liuzhiqiang26, yuehaibing



在 2019/2/26 23:52, Eric Dumazet 写道:
> 
> 
> On 02/26/2019 05:02 AM, Sheng Lan wrote:
>>
>>
>>
>>> On Mon, 25 Feb 2019 22:49:39 +0800
>>> Sheng Lan <lansheng@huawei.com> wrote:
>>>
>>>> From: Sheng Lan <lansheng@huawei.com>
>>>> Subject: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
>>>>
>>>> It can be reproduced by following steps:
>>>> 1. virtio_net NIC is configured with gso/tso on
>>>> 2. configure nginx as http server with an index file bigger than 1M bytes
>>>> 3. use tc netem to produce duplicate packets and delay:
>>>>    tc qdisc add dev eth0 root netem delay 100ms 10ms 30% duplicate 90%
>>>> 4. continually curl the nginx http server to get index file on client
>>>> 5. BUG_ON is seen quickly
>>>>
>>>> [10258690.371129] kernel BUG at net/core/skbuff.c:4028!
>>>> [10258690.371748] invalid opcode: 0000 [#1] SMP PTI
>>>> [10258690.372094] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G        W         5.0.0-rc6 #2
>>>> [10258690.372094] RSP: 0018:ffffa05797b43da0 EFLAGS: 00010202
>>>> [10258690.372094] RBP: 00000000000005ea R08: 0000000000000000 R09: 00000000000005ea
>>>> [10258690.372094] R10: ffffa0579334d800 R11: 00000000000002c0 R12: 0000000000000002
>>>> [10258690.372094] R13: 0000000000000000 R14: ffffa05793122900 R15: ffffa0578f7cb028
>>>> [10258690.372094] FS:  0000000000000000(0000) GS:ffffa05797b40000(0000) knlGS:0000000000000000
>>>> [10258690.372094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [10258690.372094] CR2: 00007f1a6dc00868 CR3: 000000001000e000 CR4: 00000000000006e0
>>>> [10258690.372094] Call Trace:
>>>> [10258690.372094]  <IRQ>
>>>> [10258690.372094]  skb_to_sgvec+0x11/0x40
>>>> [10258690.372094]  start_xmit+0x38c/0x520 [virtio_net]
>>>> [10258690.372094]  dev_hard_start_xmit+0x9b/0x200
>>>> [10258690.372094]  sch_direct_xmit+0xff/0x260
>>>> [10258690.372094]  __qdisc_run+0x15e/0x4e0
>>>> [10258690.372094]  net_tx_action+0x137/0x210
>>>> [10258690.372094]  __do_softirq+0xd6/0x2a9
>>>> [10258690.372094]  irq_exit+0xde/0xf0
>>>> [10258690.372094]  smp_apic_timer_interrupt+0x74/0x140
>>>> [10258690.372094]  apic_timer_interrupt+0xf/0x20
>>>> [10258690.372094]  </IRQ>
>>>>
>>>> In __skb_to_sgvec, the skb->len is not equal to the sum of the skb's
>>>> linear data size and nonlinear data size, thus BUG_ON triggered. The
>>>> bad skb's nonlinear data size is less than skb->data_len, because the
>>>> skb is cloned and a part of related cloned skb's nonlinear data is
>>>> split off.
>>>>
>>>> Duplicate packet is cloned by skb_clone in netem_enqueue and may be delayed
>>>> some time in qdisc. Due to the delay time, the original skb will be pushed
>>>> again later in __tcp_push_pending_frames when tcp receives new packets.
>>>> In tcp_write_xmit, when the tcp_mss_split_point returns a smaller limit,
>>>> the original skb will be fragmented and the skb's nonlinear data will be
>>>> split off. The length of the skb cloned by netem will not be updated.
>>>> When we use virtio_net NIC, the duplicated cloned skb will be filled into
>>>> a scatter-gather list in __skb_to_sgvec and trigger the BUG_ON.
>>>>
>>>> Here I replace the skb_clone with skb_copy in netem_enqueue to ensure
>>>> the duplicated skb's nonlinear data is independent.
>>>>
>>>> Signed-off-by: Sheng Lan <lansheng@huawei.com>
>>>> Reported-by: Qin Ji <jiqin.ji@huawei.com>
>>>>
>>>> Fixes: 0afb51e7 ("netem: reinsert for duplication")
>>>
>>> This sounds like a bug in the other layers (either TCP or Virtio net)
>>> not handling a cloned skb properly.
>>>
>>
>> I have traced the route of skb by printk, let me take an example to describe the problem to make it clearly:
>> Mss value equals to 1448. Limit value is the split size when tcp do tso_fragment, is depending on the size of the sending congestion window and mss value.
>>
>> TCP layer transmit the index file to client, the original skb1 size is large:
>> ...
>> tcp_write_xmit            (skb1->data_len == 62264, limit == 2*mss == 2896)
>> tso_fragment              (it needs to be fragmented by limit value)
>> skb_split                 (after split, skb1->data_len == 2896, skb_shinfo(skb1)->frags[0] == 2896, skb_shinfo(skb1)->nr_frags == 1)
>> ...
>> netem_enqueue             (netem construct a duplicate packet of skb1 by skb_clone)
>> skb2 = skb_clone(skb1)    (skb1->data_len == skb2->data_len == 2896, skb1 and skb2 share the nonlinear data frags[0] == 2896)
>> waiting 30ms              (skb1 and skb2 will be delayed in qdisc queue due to the netem delay configuration)
>>
>>
>> TCP layer receives new packets and trys to retransmit the skb1:
>> tcp_rcv_established
>> __tcp_push_pending_frames
>> tcp_write_xmit            (skb1->data_len == 2896, cwnd size decreased or packets in flight increased, cause the limit decreased to 1*mss == 1448)
> 
> tcp_write_xmit() only deals with packet in the write queue,
> they never were sent. They can not be any clone of them by definition, since
> skbs in the TCP write queue are private to TCP stack,
> 
> Once a packet is sent, the master skb is moved to the rtx rb-tree,
> while the clone is going through lower stacks.
> 
> When/if a retransmit is due, we always make sure there is no clone on it,
> look at the various calls to skb_unclone()

I traced again and found that the skb was not sent, master skb was still in write queue,
because the function tcp_transmit_skb() returns 1 (NET_XMIT_DROP), thus it can be retransmit.
I found the error value NET_XMIT_DROP returns from netem_enqueue(), when the length of qdisc queue
is greater than queue limit value.

In netem_enqueue() the skb is cloned before returning the NET_XMIT_DROP error value,
thus the master skb is still in write queue and be cloned in netem_enqueue(). This may cause the master
skb be retransmit and fragmented again while it is cloned.

I think there are potential risks that tso_fragment() will get a cloned skb if skb is cloned by lower layer.
I try to fix it by moving returning error value statment to the front of the skb_clone() in netem_enqueue(), and it works.
And netem_enqueue() constructs corrupt packets statment returns NET_XMIT_DROP too. To fix this completely should I move the
constructing corrupt statment to the front of the skb_clone() ?
Please correct me if I am wrong, and I need your advice.

Thanks

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 75046ec..615a341 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -474,6 +474,9 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	if (q->latency || q->jitter || q->rate)
 		skb_orphan_partial(skb);

+	if (unlikely(sch->q.qlen >= sch->limit))
+		return qdisc_drop_all(skb, sch, to_free);
+
 	/*
 	 * If we need to duplicate packet, then re-insert at top of the
 	 * qdisc tree, since parent queuer expects that only one
@@ -521,9 +524,6 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 			1<<(prandom_u32() % 8);
 	}

-	if (unlikely(sch->q.qlen >= sch->limit))
-		return qdisc_drop_all(skb, sch, to_free);
-
 	qdisc_qstats_backlog_inc(sch, skb);

 	cb = netem_skb_cb(skb);
-- 


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
  2019-02-27 11:26       ` Sheng Lan
@ 2019-02-27 15:53         ` Eric Dumazet
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2019-02-27 15:53 UTC (permalink / raw)
  To: Sheng Lan, Eric Dumazet, Stephen Hemminger
  Cc: davem, netdev, netem, xuhanbing, zhengshaoyu, jiqin.ji,
	liuzhiqiang26, yuehaibing



On 02/27/2019 03:26 AM, Sheng Lan wrote:
> 

> I traced again and found that the skb was not sent, master skb was still in write queue,
> because the function tcp_transmit_skb() returns 1 (NET_XMIT_DROP), thus it can be retransmit.
> I found the error value NET_XMIT_DROP returns from netem_enqueue(), when the length of qdisc queue
> is greater than queue limit value.
> 
> In netem_enqueue() the skb is cloned before returning the NET_XMIT_DROP error value,
> thus the master skb is still in write queue and be cloned in netem_enqueue(). This may cause the master
> skb be retransmit and fragmented again while it is cloned.
> 
> I think there are potential risks that tso_fragment() will get a cloned skb if skb is cloned by lower layer.
> I try to fix it by moving returning error value statment to the front of the skb_clone() in netem_enqueue(), and it works.
> And netem_enqueue() constructs corrupt packets statment returns NET_XMIT_DROP too. To fix this completely should I move the
> constructing corrupt statment to the front of the skb_clone() ?
> Please correct me if I am wrong, and I need your advice.
> 

Hi

Choices are either :

1) netem sends a proper return value if a packet has been queued.

2) TCP (and probably other protocols) no longer trust lower stack
   and always move the master skb in rtx queue,
   even if the transmit of the (first) clone failed.

I prefer 1), since netem is not used in the fast path, generally...

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 75046ec7214449c631c38eaab5e4a51644cfa0e5..f6ea2d44dffe328a2fd1a468e209aac0bfaccd49 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -447,6 +447,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
        int nb = 0;
        int count = 1;
        int rc = NET_XMIT_SUCCESS;
+       int rc_drop = NET_XMIT_DROP;
 
        /* Do not fool qdisc_drop_all() */
        skb->prev = NULL;
@@ -486,6 +487,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
                q->duplicate = 0;
                rootq->enqueue(skb2, rootq, to_free);
                q->duplicate = dupsave;
+               rc_drop = NET_XMIT_SUCCESS;
        }
 
        /*
@@ -498,7 +500,7 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
                if (skb_is_gso(skb)) {
                        segs = netem_segment(skb, sch, to_free);
                        if (!segs)
-                               return NET_XMIT_DROP;
+                               return rc_drop;
                } else {
                        segs = skb;
                }
@@ -521,9 +523,10 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch,
                        1<<(prandom_u32() % 8);
        }
 
-       if (unlikely(sch->q.qlen >= sch->limit))
-               return qdisc_drop_all(skb, sch, to_free);
-
+       if (unlikely(sch->q.qlen >= sch->limit)) {
+               qdisc_drop_all(skb, sch, to_free);
+               return rc_drop;
+       }
        qdisc_qstats_backlog_inc(sch, skb);
 
        cb = netem_skb_cb(skb);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-02-27 15:53 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-25 14:49 [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec Sheng Lan
2019-02-25 16:01 ` Stephen Hemminger
2019-02-26 13:02   ` Sheng Lan
2019-02-26 15:52     ` Eric Dumazet
2019-02-27 11:26       ` Sheng Lan
2019-02-27 15:53         ` Eric Dumazet
2019-02-26 15:59     ` Stephen Hemminger
2019-02-26 16:08       ` Eric Dumazet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.