linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* WARNING in sk_stream_kill_queues (5)
@ 2020-11-30  8:40 syzbot
  2020-12-03 15:58 ` Marco Elver
  0 siblings, 1 reply; 18+ messages in thread
From: syzbot @ 2020-11-30  8:40 UTC (permalink / raw)
  To: akpm, davem, dvyukov, elver, glider, jannh, kuba, linux-kernel,
	netdev, sfr, syzkaller-bugs

Hello,

syzbot found the following issue on:

HEAD commit:    6147c83f Add linux-next specific files for 20201126
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=117c9679500000
kernel config:  https://syzkaller.appspot.com/x/.config?x=9b91566da897c24f
dashboard link: https://syzkaller.appspot.com/bug?extid=7b99aafdcc2eedea6178
compiler:       gcc (GCC) 10.1.0-syz 20200507
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=103bf743500000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=167c60c9500000

The issue was bisected to:

commit 145cd60fb481328faafba76842aa0fd242e2b163
Author: Alexander Potapenko <glider@google.com>
Date:   Tue Nov 24 05:38:44 2020 +0000

    mm, kfence: insert KFENCE hooks for SLUB

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=13abe5b3500000
final oops:     https://syzkaller.appspot.com/x/report.txt?x=106be5b3500000
console output: https://syzkaller.appspot.com/x/log.txt?x=17abe5b3500000

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+7b99aafdcc2eedea6178@syzkaller.appspotmail.com
Fixes: 145cd60fb481 ("mm, kfence: insert KFENCE hooks for SLUB")

------------[ cut here ]------------
WARNING: CPU: 0 PID: 11307 at net/core/stream.c:207 sk_stream_kill_queues+0x3c3/0x530 net/core/stream.c:207
Modules linked in:
CPU: 0 PID: 11307 Comm: syz-executor673 Not tainted 5.10.0-rc5-next-20201126-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:sk_stream_kill_queues+0x3c3/0x530 net/core/stream.c:207
Code: 00 00 00 fc ff df 48 c1 ea 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e 63 01 00 00 8b ab 20 02 00 00 e9 60 ff ff ff e8 ad 24 7b fa <0f> 0b eb 97 e8 a4 24 7b fa 0f 0b eb a0 e8 9b 24 7b fa 0f 0b e9 a5
RSP: 0018:ffffc9000979f978 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 00000000fffffe80 RCX: ffffffff86f5877a
RDX: ffff88801ebb5040 RSI: ffffffff86f587e3 RDI: 0000000000000005
RBP: 0000000000000180 R08: 0000000000000001 R09: ffffffff8ebd9817
R10: 0000000000000000 R11: 0000000000000001 R12: ffff8880182f3ce0
R13: ffffffff8fb178c0 R14: ffff8880182f3ae8 R15: ffff8880182f3c70
FS:  0000000000000000(0000) GS:ffff8880b9e00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004c7cd8 CR3: 000000000b08e000 CR4: 00000000001506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 inet_csk_destroy_sock+0x1a5/0x490 net/ipv4/inet_connection_sock.c:885
 __tcp_close+0xd3e/0x1170 net/ipv4/tcp.c:2585
 tcp_close+0x29/0xc0 net/ipv4/tcp.c:2597
 inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
 __sock_release+0xcd/0x280 net/socket.c:596
 sock_close+0x18/0x20 net/socket.c:1255
 __fput+0x283/0x920 fs/file_table.c:280
 task_work_run+0xdd/0x190 kernel/task_work.c:140
 exit_task_work include/linux/task_work.h:30 [inline]
 do_exit+0xb89/0x29e0 kernel/exit.c:823
 do_group_exit+0x125/0x310 kernel/exit.c:920
 get_signal+0x3ec/0x2010 kernel/signal.c:2770
 arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
 handle_signal_work kernel/entry/common.c:144 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:168 [inline]
 exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:198
 syscall_exit_to_user_mode+0x36/0x260 kernel/entry/common.c:275
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x44ea59
Code: Unable to access opcode bytes at RIP 0x44ea2f.
RSP: 002b:00007fd1200f3d98 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: 0000000000012255 RBX: 00000000006e6a18 RCX: 000000000044ea59
RDX: 00000001000001bd RSI: 00000000200001c0 RDI: 0000000000000003
RBP: 00000000006e6a10 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e6a1c
R13: 3030303030303030 R14: 3030303030303d65 R15: 2b74d0dd4a6f722c


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
syzbot can test patches for this issue, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: WARNING in sk_stream_kill_queues (5)
  2020-11-30  8:40 WARNING in sk_stream_kill_queues (5) syzbot
@ 2020-12-03 15:58 ` Marco Elver
  2020-12-03 16:27   ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Marco Elver @ 2020-12-03 15:58 UTC (permalink / raw)
  To: netdev
  Cc: akpm, davem, dvyukov, glider, jannh, kuba, linux-kernel, sfr,
	syzkaller-bugs, edumazet, willemb, syzbot

On Mon, Nov 30, 2020 at 12:40AM -0800, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    6147c83f Add linux-next specific files for 20201126
> git tree:       linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=117c9679500000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=9b91566da897c24f
> dashboard link: https://syzkaller.appspot.com/bug?extid=7b99aafdcc2eedea6178
> compiler:       gcc (GCC) 10.1.0-syz 20200507
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=103bf743500000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=167c60c9500000
> 
> The issue was bisected to:
> 
> commit 145cd60fb481328faafba76842aa0fd242e2b163
> Author: Alexander Potapenko <glider@google.com>
> Date:   Tue Nov 24 05:38:44 2020 +0000
> 
>     mm, kfence: insert KFENCE hooks for SLUB
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=13abe5b3500000
> final oops:     https://syzkaller.appspot.com/x/report.txt?x=106be5b3500000
> console output: https://syzkaller.appspot.com/x/log.txt?x=17abe5b3500000
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+7b99aafdcc2eedea6178@syzkaller.appspotmail.com
> Fixes: 145cd60fb481 ("mm, kfence: insert KFENCE hooks for SLUB")
> 
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 11307 at net/core/stream.c:207 sk_stream_kill_queues+0x3c3/0x530 net/core/stream.c:207
[...]
> Call Trace:
>  inet_csk_destroy_sock+0x1a5/0x490 net/ipv4/inet_connection_sock.c:885
>  __tcp_close+0xd3e/0x1170 net/ipv4/tcp.c:2585
>  tcp_close+0x29/0xc0 net/ipv4/tcp.c:2597
>  inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
>  __sock_release+0xcd/0x280 net/socket.c:596
>  sock_close+0x18/0x20 net/socket.c:1255
>  __fput+0x283/0x920 fs/file_table.c:280
>  task_work_run+0xdd/0x190 kernel/task_work.c:140
>  exit_task_work include/linux/task_work.h:30 [inline]
>  do_exit+0xb89/0x29e0 kernel/exit.c:823
>  do_group_exit+0x125/0x310 kernel/exit.c:920
>  get_signal+0x3ec/0x2010 kernel/signal.c:2770
>  arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
>  handle_signal_work kernel/entry/common.c:144 [inline]
>  exit_to_user_mode_loop kernel/entry/common.c:168 [inline]
>  exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:198
>  syscall_exit_to_user_mode+0x36/0x260 kernel/entry/common.c:275
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9

I've been debugging this and I think enabling KFENCE uncovered that some
code is assuming that the following is always true:

	ksize(kmalloc(S)) == ksize(kmalloc(S))

but I don't think this assumption can be made (with or without KFENCE).

With KFENCE, we actually end up testing no code assumes this, because
KFENCE's ksize() always returns the exact size S.

I have narrowed it down to sk_wmem_queued becoming <0 in
sk_wmem_free_skb().

The skb passed to sk_wmem_free_skb() and whose truesize causes
sk_wmem_queued to become negative is always allocated in:

 | kmem_cache_alloc_node+0x140/0x400 mm/slub.c:2939
 | __alloc_skb+0x6d/0x710 net/core/skbuff.c:198
 | alloc_skb_fclone include/linux/skbuff.h:1144 [inline]
 | sk_stream_alloc_skb+0x109/0xc30 net/ipv4/tcp.c:888
 | tso_fragment net/ipv4/tcp_output.c:2124 [inline]
 | tcp_write_xmit+0x1dbf/0x5ce0 net/ipv4/tcp_output.c:2674
 | __tcp_push_pending_frames+0xaa/0x390 net/ipv4/tcp_output.c:2866
 | tcp_push_pending_frames include/net/tcp.h:1864 [inline]
 | tcp_data_snd_check net/ipv4/tcp_input.c:5374 [inline]
 | tcp_rcv_established+0x8c9/0x1eb0 net/ipv4/tcp_input.c:5869
 | tcp_v4_do_rcv+0x5d1/0x870 net/ipv4/tcp_ipv4.c:1668
 | sk_backlog_rcv include/net/sock.h:1011 [inline]
 | __release_sock+0x134/0x3a0 net/core/sock.c:2523
 | release_sock+0x54/0x1b0 net/core/sock.c:3053
 | sk_wait_data+0x177/0x450 net/core/sock.c:2565
 | tcp_recvmsg+0x17ea/0x2aa0 net/ipv4/tcp.c:2181
 | inet_recvmsg+0x11b/0x5d0 net/ipv4/af_inet.c:848
 | sock_recvmsg_nosec net/socket.c:885 [inline]
 | sock_recvmsg net/socket.c:903 [inline]
 | sock_recvmsg net/socket.c:899 [inline]
 | ____sys_recvmsg+0x2c4/0x600 net/socket.c:2563
 | ___sys_recvmsg+0x127/0x200 net/socket.c:2605
 | __sys_recvmsg+0xe2/0x1a0 net/socket.c:2641
 | do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 | entry_SYSCALL_64_after_hwframe+0x44/0xa9

I used the code below to add some warnings that helped narrow it down.

Does any of this help explain the problem?

Thanks,
-- Marco

------ >8 ------


diff --git a/include/net/sock.h b/include/net/sock.h
index e8d958ef3ea0..ef4837f3aba4 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -35,6 +35,7 @@
 #ifndef _SOCK_H
 #define _SOCK_H
 
+#include <linux/kfence.h>
 #include <linux/hardirq.h>
 #include <linux/kernel.h>
 #include <linux/list.h>
@@ -1534,7 +1535,15 @@ static inline void sk_mem_uncharge(struct sock *sk, int size)
 DECLARE_STATIC_KEY_FALSE(tcp_tx_skb_cache_key);
 static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
 {
+	bool bad = false;
+
 	sk_wmem_queued_add(sk, -skb->truesize);
+
+	if (WARN_ON(READ_ONCE(sk->sk_wmem_queued) == -384)) {
+		pr_info("wmem_queued=%d truesize=%u\n", sk->sk_wmem_queued, skb->truesize);
+		bad = true;
+	}
+
 	sk_mem_uncharge(sk, skb->truesize);
 	if (static_branch_unlikely(&tcp_tx_skb_cache_key) &&
 	    !sk->sk_tx_skb_cache && !skb_cloned(skb)) {
@@ -1544,6 +1553,9 @@ static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
 		return;
 	}
 	__kfree_skb(skb);
+
+	if (bad)
+		(void)READ_ONCE(skb->truesize); /* UAF to let KASAN show where it was allocated */
 }
 
 static inline void sock_release_ownership(struct sock *sk)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index ffe3dcc0ebea..f365495819ee 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -208,6 +208,19 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	size = SKB_DATA_ALIGN(size);
 	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
+
+	if (is_kfence_address(data))
+		pr_info("kfence's ksize: %zu\n", ksize(data));
+	/*
+	 *				BUG BUG
+	 * Hypothesis: The problem is that some code assumes that:
+	 *
+	 *	ksize(kmalloc(S)) == ksize(kmalloc(S))
+	 *
+	 * Note: If we force no KFENCE allocation for @data above, the warnings
+	 * disappear. KFENCE's ksize() always returns the exact size S.
+	 */
+
 	if (!data)
 		goto nodata;
 	/* kmalloc(size) might give us more room than requested.

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: WARNING in sk_stream_kill_queues (5)
  2020-12-03 15:58 ` Marco Elver
@ 2020-12-03 16:27   ` Eric Dumazet
  2020-12-03 16:34     ` Marco Elver
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2020-12-03 16:27 UTC (permalink / raw)
  To: Marco Elver
  Cc: netdev, Andrew Morton, David Miller, Dmitry Vyukov,
	Alexander Potapenko, Jann Horn, Jakub Kicinski, LKML,
	Stephen Rothwell, syzkaller-bugs, Willem de Bruijn, syzbot

On Thu, Dec 3, 2020 at 4:58 PM Marco Elver <elver@google.com> wrote:
>
> On Mon, Nov 30, 2020 at 12:40AM -0800, syzbot wrote:
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:    6147c83f Add linux-next specific files for 20201126
> > git tree:       linux-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=117c9679500000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=9b91566da897c24f
> > dashboard link: https://syzkaller.appspot.com/bug?extid=7b99aafdcc2eedea6178
> > compiler:       gcc (GCC) 10.1.0-syz 20200507
> > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=103bf743500000
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=167c60c9500000
> >
> > The issue was bisected to:
> >
> > commit 145cd60fb481328faafba76842aa0fd242e2b163
> > Author: Alexander Potapenko <glider@google.com>
> > Date:   Tue Nov 24 05:38:44 2020 +0000
> >
> >     mm, kfence: insert KFENCE hooks for SLUB
> >
> > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=13abe5b3500000
> > final oops:     https://syzkaller.appspot.com/x/report.txt?x=106be5b3500000
> > console output: https://syzkaller.appspot.com/x/log.txt?x=17abe5b3500000
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+7b99aafdcc2eedea6178@syzkaller.appspotmail.com
> > Fixes: 145cd60fb481 ("mm, kfence: insert KFENCE hooks for SLUB")
> >
> > ------------[ cut here ]------------
> > WARNING: CPU: 0 PID: 11307 at net/core/stream.c:207 sk_stream_kill_queues+0x3c3/0x530 net/core/stream.c:207
> [...]
> > Call Trace:
> >  inet_csk_destroy_sock+0x1a5/0x490 net/ipv4/inet_connection_sock.c:885
> >  __tcp_close+0xd3e/0x1170 net/ipv4/tcp.c:2585
> >  tcp_close+0x29/0xc0 net/ipv4/tcp.c:2597
> >  inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
> >  __sock_release+0xcd/0x280 net/socket.c:596
> >  sock_close+0x18/0x20 net/socket.c:1255
> >  __fput+0x283/0x920 fs/file_table.c:280
> >  task_work_run+0xdd/0x190 kernel/task_work.c:140
> >  exit_task_work include/linux/task_work.h:30 [inline]
> >  do_exit+0xb89/0x29e0 kernel/exit.c:823
> >  do_group_exit+0x125/0x310 kernel/exit.c:920
> >  get_signal+0x3ec/0x2010 kernel/signal.c:2770
> >  arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
> >  handle_signal_work kernel/entry/common.c:144 [inline]
> >  exit_to_user_mode_loop kernel/entry/common.c:168 [inline]
> >  exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:198
> >  syscall_exit_to_user_mode+0x36/0x260 kernel/entry/common.c:275
> >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> I've been debugging this and I think enabling KFENCE uncovered that some
> code is assuming that the following is always true:
>
>         ksize(kmalloc(S)) == ksize(kmalloc(S))
>


I do not think we make this assumption.

Each skb tracks the 'truesize' which is populated from __alloc_skb()
using ksize(allocated head) .

So if ksize() decides to give us random data, it should be still fine,
because we use ksize(buff) only once at alloc skb time, and record the
value in skb->truesize
 (only the socket buffer accounting would be off)


> but I don't think this assumption can be made (with or without KFENCE).
>
> With KFENCE, we actually end up testing no code assumes this, because
> KFENCE's ksize() always returns the exact size S.
>
> I have narrowed it down to sk_wmem_queued becoming <0 in
> sk_wmem_free_skb().
>
> The skb passed to sk_wmem_free_skb() and whose truesize causes
> sk_wmem_queued to become negative is always allocated in:
>
>  | kmem_cache_alloc_node+0x140/0x400 mm/slub.c:2939
>  | __alloc_skb+0x6d/0x710 net/core/skbuff.c:198
>  | alloc_skb_fclone include/linux/skbuff.h:1144 [inline]
>  | sk_stream_alloc_skb+0x109/0xc30 net/ipv4/tcp.c:888
>  | tso_fragment net/ipv4/tcp_output.c:2124 [inline]
>  | tcp_write_xmit+0x1dbf/0x5ce0 net/ipv4/tcp_output.c:2674
>  | __tcp_push_pending_frames+0xaa/0x390 net/ipv4/tcp_output.c:2866
>  | tcp_push_pending_frames include/net/tcp.h:1864 [inline]
>  | tcp_data_snd_check net/ipv4/tcp_input.c:5374 [inline]
>  | tcp_rcv_established+0x8c9/0x1eb0 net/ipv4/tcp_input.c:5869
>  | tcp_v4_do_rcv+0x5d1/0x870 net/ipv4/tcp_ipv4.c:1668
>  | sk_backlog_rcv include/net/sock.h:1011 [inline]
>  | __release_sock+0x134/0x3a0 net/core/sock.c:2523
>  | release_sock+0x54/0x1b0 net/core/sock.c:3053
>  | sk_wait_data+0x177/0x450 net/core/sock.c:2565
>  | tcp_recvmsg+0x17ea/0x2aa0 net/ipv4/tcp.c:2181
>  | inet_recvmsg+0x11b/0x5d0 net/ipv4/af_inet.c:848
>  | sock_recvmsg_nosec net/socket.c:885 [inline]
>  | sock_recvmsg net/socket.c:903 [inline]
>  | sock_recvmsg net/socket.c:899 [inline]
>  | ____sys_recvmsg+0x2c4/0x600 net/socket.c:2563
>  | ___sys_recvmsg+0x127/0x200 net/socket.c:2605
>  | __sys_recvmsg+0xe2/0x1a0 net/socket.c:2641
>  | do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  | entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> I used the code below to add some warnings that helped narrow it down.
>
> Does any of this help explain the problem?

Not yet :)


tso_fragment() transfers some payload from one skb to another, and
properly shifts this amount from src skb to dst skb (@buff) :

sk_wmem_queued_add(sk, buff->truesize);
buff->truesize += nlen;
skb->truesize -= nlen;

>
> Thanks,
> -- Marco
>
> ------ >8 ------
>
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index e8d958ef3ea0..ef4837f3aba4 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -35,6 +35,7 @@
>  #ifndef _SOCK_H
>  #define _SOCK_H
>
> +#include <linux/kfence.h>
>  #include <linux/hardirq.h>
>  #include <linux/kernel.h>
>  #include <linux/list.h>
> @@ -1534,7 +1535,15 @@ static inline void sk_mem_uncharge(struct sock *sk, int size)
>  DECLARE_STATIC_KEY_FALSE(tcp_tx_skb_cache_key);
>  static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
>  {
> +       bool bad = false;
> +
>         sk_wmem_queued_add(sk, -skb->truesize);
> +
> +       if (WARN_ON(READ_ONCE(sk->sk_wmem_queued) == -384)) {
> +               pr_info("wmem_queued=%d truesize=%u\n", sk->sk_wmem_queued, skb->truesize);
> +               bad = true;
> +       }
> +
>         sk_mem_uncharge(sk, skb->truesize);
>         if (static_branch_unlikely(&tcp_tx_skb_cache_key) &&
>             !sk->sk_tx_skb_cache && !skb_cloned(skb)) {
> @@ -1544,6 +1553,9 @@ static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
>                 return;
>         }
>         __kfree_skb(skb);
> +
> +       if (bad)
> +               (void)READ_ONCE(skb->truesize); /* UAF to let KASAN show where it was allocated */
>  }
>
>  static inline void sock_release_ownership(struct sock *sk)
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index ffe3dcc0ebea..f365495819ee 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -208,6 +208,19 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>         size = SKB_DATA_ALIGN(size);
>         size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>         data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> +
> +       if (is_kfence_address(data))
> +               pr_info("kfence's ksize: %zu\n", ksize(data));
> +       /*
> +        *                              BUG BUG
> +        * Hypothesis: The problem is that some code assumes that:
> +        *
> +        *      ksize(kmalloc(S)) == ksize(kmalloc(S))
> +        *
> +        * Note: If we force no KFENCE allocation for @data above, the warnings
> +        * disappear. KFENCE's ksize() always returns the exact size S.
> +        */
> +
>         if (!data)
>                 goto nodata;
>         /* kmalloc(size) might give us more room than requested.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: WARNING in sk_stream_kill_queues (5)
  2020-12-03 16:27   ` Eric Dumazet
@ 2020-12-03 16:34     ` Marco Elver
  2020-12-03 16:42       ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Marco Elver @ 2020-12-03 16:34 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Andrew Morton, David Miller, Dmitry Vyukov,
	Alexander Potapenko, Jann Horn, Jakub Kicinski, LKML,
	Stephen Rothwell, syzkaller-bugs, Willem de Bruijn, syzbot

On Thu, 3 Dec 2020 at 17:27, Eric Dumazet <edumazet@google.com> wrote:
> On Thu, Dec 3, 2020 at 4:58 PM Marco Elver <elver@google.com> wrote:
> >
> > On Mon, Nov 30, 2020 at 12:40AM -0800, syzbot wrote:
> > > Hello,
> > >
> > > syzbot found the following issue on:
> > >
> > > HEAD commit:    6147c83f Add linux-next specific files for 20201126
> > > git tree:       linux-next
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=117c9679500000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=9b91566da897c24f
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=7b99aafdcc2eedea6178
> > > compiler:       gcc (GCC) 10.1.0-syz 20200507
> > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=103bf743500000
> > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=167c60c9500000
> > >
> > > The issue was bisected to:
> > >
> > > commit 145cd60fb481328faafba76842aa0fd242e2b163
> > > Author: Alexander Potapenko <glider@google.com>
> > > Date:   Tue Nov 24 05:38:44 2020 +0000
> > >
> > >     mm, kfence: insert KFENCE hooks for SLUB
> > >
> > > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=13abe5b3500000
> > > final oops:     https://syzkaller.appspot.com/x/report.txt?x=106be5b3500000
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=17abe5b3500000
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: syzbot+7b99aafdcc2eedea6178@syzkaller.appspotmail.com
> > > Fixes: 145cd60fb481 ("mm, kfence: insert KFENCE hooks for SLUB")
> > >
> > > ------------[ cut here ]------------
> > > WARNING: CPU: 0 PID: 11307 at net/core/stream.c:207 sk_stream_kill_queues+0x3c3/0x530 net/core/stream.c:207
> > [...]
> > > Call Trace:
> > >  inet_csk_destroy_sock+0x1a5/0x490 net/ipv4/inet_connection_sock.c:885
> > >  __tcp_close+0xd3e/0x1170 net/ipv4/tcp.c:2585
> > >  tcp_close+0x29/0xc0 net/ipv4/tcp.c:2597
> > >  inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
> > >  __sock_release+0xcd/0x280 net/socket.c:596
> > >  sock_close+0x18/0x20 net/socket.c:1255
> > >  __fput+0x283/0x920 fs/file_table.c:280
> > >  task_work_run+0xdd/0x190 kernel/task_work.c:140
> > >  exit_task_work include/linux/task_work.h:30 [inline]
> > >  do_exit+0xb89/0x29e0 kernel/exit.c:823
> > >  do_group_exit+0x125/0x310 kernel/exit.c:920
> > >  get_signal+0x3ec/0x2010 kernel/signal.c:2770
> > >  arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
> > >  handle_signal_work kernel/entry/common.c:144 [inline]
> > >  exit_to_user_mode_loop kernel/entry/common.c:168 [inline]
> > >  exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:198
> > >  syscall_exit_to_user_mode+0x36/0x260 kernel/entry/common.c:275
> > >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > I've been debugging this and I think enabling KFENCE uncovered that some
> > code is assuming that the following is always true:
> >
> >         ksize(kmalloc(S)) == ksize(kmalloc(S))
> >
>
>
> I do not think we make this assumption.
>
> Each skb tracks the 'truesize' which is populated from __alloc_skb()
> using ksize(allocated head) .
>
> So if ksize() decides to give us random data, it should be still fine,
> because we use ksize(buff) only once at alloc skb time, and record the
> value in skb->truesize
>  (only the socket buffer accounting would be off)

Good, thanks for clarifying. So something else must be off then.

> > but I don't think this assumption can be made (with or without KFENCE).
> >
> > With KFENCE, we actually end up testing no code assumes this, because
> > KFENCE's ksize() always returns the exact size S.
> >
> > I have narrowed it down to sk_wmem_queued becoming <0 in
> > sk_wmem_free_skb().
> >
> > The skb passed to sk_wmem_free_skb() and whose truesize causes
> > sk_wmem_queued to become negative is always allocated in:
> >
> >  | kmem_cache_alloc_node+0x140/0x400 mm/slub.c:2939
> >  | __alloc_skb+0x6d/0x710 net/core/skbuff.c:198
> >  | alloc_skb_fclone include/linux/skbuff.h:1144 [inline]
> >  | sk_stream_alloc_skb+0x109/0xc30 net/ipv4/tcp.c:888
> >  | tso_fragment net/ipv4/tcp_output.c:2124 [inline]
> >  | tcp_write_xmit+0x1dbf/0x5ce0 net/ipv4/tcp_output.c:2674
> >  | __tcp_push_pending_frames+0xaa/0x390 net/ipv4/tcp_output.c:2866
> >  | tcp_push_pending_frames include/net/tcp.h:1864 [inline]
> >  | tcp_data_snd_check net/ipv4/tcp_input.c:5374 [inline]
> >  | tcp_rcv_established+0x8c9/0x1eb0 net/ipv4/tcp_input.c:5869
> >  | tcp_v4_do_rcv+0x5d1/0x870 net/ipv4/tcp_ipv4.c:1668
> >  | sk_backlog_rcv include/net/sock.h:1011 [inline]
> >  | __release_sock+0x134/0x3a0 net/core/sock.c:2523
> >  | release_sock+0x54/0x1b0 net/core/sock.c:3053
> >  | sk_wait_data+0x177/0x450 net/core/sock.c:2565
> >  | tcp_recvmsg+0x17ea/0x2aa0 net/ipv4/tcp.c:2181
> >  | inet_recvmsg+0x11b/0x5d0 net/ipv4/af_inet.c:848
> >  | sock_recvmsg_nosec net/socket.c:885 [inline]
> >  | sock_recvmsg net/socket.c:903 [inline]
> >  | sock_recvmsg net/socket.c:899 [inline]
> >  | ____sys_recvmsg+0x2c4/0x600 net/socket.c:2563
> >  | ___sys_recvmsg+0x127/0x200 net/socket.c:2605
> >  | __sys_recvmsg+0xe2/0x1a0 net/socket.c:2641
> >  | do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> >  | entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > I used the code below to add some warnings that helped narrow it down.
> >
> > Does any of this help explain the problem?
>
> Not yet :)

Damn...

> tso_fragment() transfers some payload from one skb to another, and
> properly shifts this amount from src skb to dst skb (@buff) :
>
> sk_wmem_queued_add(sk, buff->truesize);
> buff->truesize += nlen;
> skb->truesize -= nlen;

Any other stacktraces that might help?

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: WARNING in sk_stream_kill_queues (5)
  2020-12-03 16:34     ` Marco Elver
@ 2020-12-03 16:42       ` Eric Dumazet
  2020-12-03 17:41         ` Marco Elver
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2020-12-03 16:42 UTC (permalink / raw)
  To: Marco Elver
  Cc: netdev, Andrew Morton, David Miller, Dmitry Vyukov,
	Alexander Potapenko, Jann Horn, Jakub Kicinski, LKML,
	Stephen Rothwell, syzkaller-bugs, Willem de Bruijn, syzbot

On Thu, Dec 3, 2020 at 5:34 PM Marco Elver <elver@google.com> wrote:
>
> On Thu, 3 Dec 2020 at 17:27, Eric Dumazet <edumazet@google.com> wrote:
> > On Thu, Dec 3, 2020 at 4:58 PM Marco Elver <elver@google.com> wrote:
> > >
> > > On Mon, Nov 30, 2020 at 12:40AM -0800, syzbot wrote:
> > > > Hello,
> > > >
> > > > syzbot found the following issue on:
> > > >
> > > > HEAD commit:    6147c83f Add linux-next specific files for 20201126
> > > > git tree:       linux-next
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=117c9679500000
> > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=9b91566da897c24f
> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=7b99aafdcc2eedea6178
> > > > compiler:       gcc (GCC) 10.1.0-syz 20200507
> > > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=103bf743500000
> > > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=167c60c9500000
> > > >
> > > > The issue was bisected to:
> > > >
> > > > commit 145cd60fb481328faafba76842aa0fd242e2b163
> > > > Author: Alexander Potapenko <glider@google.com>
> > > > Date:   Tue Nov 24 05:38:44 2020 +0000
> > > >
> > > >     mm, kfence: insert KFENCE hooks for SLUB
> > > >
> > > > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=13abe5b3500000
> > > > final oops:     https://syzkaller.appspot.com/x/report.txt?x=106be5b3500000
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=17abe5b3500000
> > > >
> > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > Reported-by: syzbot+7b99aafdcc2eedea6178@syzkaller.appspotmail.com
> > > > Fixes: 145cd60fb481 ("mm, kfence: insert KFENCE hooks for SLUB")
> > > >
> > > > ------------[ cut here ]------------
> > > > WARNING: CPU: 0 PID: 11307 at net/core/stream.c:207 sk_stream_kill_queues+0x3c3/0x530 net/core/stream.c:207
> > > [...]
> > > > Call Trace:
> > > >  inet_csk_destroy_sock+0x1a5/0x490 net/ipv4/inet_connection_sock.c:885
> > > >  __tcp_close+0xd3e/0x1170 net/ipv4/tcp.c:2585
> > > >  tcp_close+0x29/0xc0 net/ipv4/tcp.c:2597
> > > >  inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
> > > >  __sock_release+0xcd/0x280 net/socket.c:596
> > > >  sock_close+0x18/0x20 net/socket.c:1255
> > > >  __fput+0x283/0x920 fs/file_table.c:280
> > > >  task_work_run+0xdd/0x190 kernel/task_work.c:140
> > > >  exit_task_work include/linux/task_work.h:30 [inline]
> > > >  do_exit+0xb89/0x29e0 kernel/exit.c:823
> > > >  do_group_exit+0x125/0x310 kernel/exit.c:920
> > > >  get_signal+0x3ec/0x2010 kernel/signal.c:2770
> > > >  arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
> > > >  handle_signal_work kernel/entry/common.c:144 [inline]
> > > >  exit_to_user_mode_loop kernel/entry/common.c:168 [inline]
> > > >  exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:198
> > > >  syscall_exit_to_user_mode+0x36/0x260 kernel/entry/common.c:275
> > > >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >
> > > I've been debugging this and I think enabling KFENCE uncovered that some
> > > code is assuming that the following is always true:
> > >
> > >         ksize(kmalloc(S)) == ksize(kmalloc(S))
> > >
> >
> >
> > I do not think we make this assumption.
> >
> > Each skb tracks the 'truesize' which is populated from __alloc_skb()
> > using ksize(allocated head) .
> >
> > So if ksize() decides to give us random data, it should be still fine,
> > because we use ksize(buff) only once at alloc skb time, and record the
> > value in skb->truesize
> >  (only the socket buffer accounting would be off)
>
> Good, thanks for clarifying. So something else must be off then.

Actually we might have the following assumption :

buff = kmalloc(size, GFP...)
if (buff)
   ASSERT(ksize(buff) >= size)

So obviously ksize() should not be completely random ;)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: WARNING in sk_stream_kill_queues (5)
  2020-12-03 16:42       ` Eric Dumazet
@ 2020-12-03 17:41         ` Marco Elver
  2020-12-03 18:01           ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Marco Elver @ 2020-12-03 17:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: netdev, Andrew Morton, David Miller, Dmitry Vyukov,
	Alexander Potapenko, Jann Horn, Jakub Kicinski, LKML,
	Stephen Rothwell, syzkaller-bugs, Willem de Bruijn, syzbot

On Thu, Dec 03, 2020 at 05:42PM +0100, Eric Dumazet wrote:
> On Thu, Dec 3, 2020 at 5:34 PM Marco Elver <elver@google.com> wrote:
> >
> > On Thu, 3 Dec 2020 at 17:27, Eric Dumazet <edumazet@google.com> wrote:
> > > On Thu, Dec 3, 2020 at 4:58 PM Marco Elver <elver@google.com> wrote:
> > > >
> > > > On Mon, Nov 30, 2020 at 12:40AM -0800, syzbot wrote:
> > > > > Hello,
> > > > >
> > > > > syzbot found the following issue on:
> > > > >
> > > > > HEAD commit:    6147c83f Add linux-next specific files for 20201126
> > > > > git tree:       linux-next
> > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=117c9679500000
> > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=9b91566da897c24f
> > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=7b99aafdcc2eedea6178
> > > > > compiler:       gcc (GCC) 10.1.0-syz 20200507
> > > > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=103bf743500000
> > > > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=167c60c9500000
> > > > >
> > > > > The issue was bisected to:
> > > > >
> > > > > commit 145cd60fb481328faafba76842aa0fd242e2b163
> > > > > Author: Alexander Potapenko <glider@google.com>
> > > > > Date:   Tue Nov 24 05:38:44 2020 +0000
> > > > >
> > > > >     mm, kfence: insert KFENCE hooks for SLUB
> > > > >
> > > > > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=13abe5b3500000
> > > > > final oops:     https://syzkaller.appspot.com/x/report.txt?x=106be5b3500000
> > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=17abe5b3500000
> > > > >
> > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > > Reported-by: syzbot+7b99aafdcc2eedea6178@syzkaller.appspotmail.com
> > > > > Fixes: 145cd60fb481 ("mm, kfence: insert KFENCE hooks for SLUB")
> > > > >
> > > > > ------------[ cut here ]------------
> > > > > WARNING: CPU: 0 PID: 11307 at net/core/stream.c:207 sk_stream_kill_queues+0x3c3/0x530 net/core/stream.c:207
> > > > [...]
> > > > > Call Trace:
> > > > >  inet_csk_destroy_sock+0x1a5/0x490 net/ipv4/inet_connection_sock.c:885
> > > > >  __tcp_close+0xd3e/0x1170 net/ipv4/tcp.c:2585
> > > > >  tcp_close+0x29/0xc0 net/ipv4/tcp.c:2597
> > > > >  inet_release+0x12e/0x280 net/ipv4/af_inet.c:431
> > > > >  __sock_release+0xcd/0x280 net/socket.c:596
> > > > >  sock_close+0x18/0x20 net/socket.c:1255
> > > > >  __fput+0x283/0x920 fs/file_table.c:280
> > > > >  task_work_run+0xdd/0x190 kernel/task_work.c:140
> > > > >  exit_task_work include/linux/task_work.h:30 [inline]
> > > > >  do_exit+0xb89/0x29e0 kernel/exit.c:823
> > > > >  do_group_exit+0x125/0x310 kernel/exit.c:920
> > > > >  get_signal+0x3ec/0x2010 kernel/signal.c:2770
> > > > >  arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
> > > > >  handle_signal_work kernel/entry/common.c:144 [inline]
> > > > >  exit_to_user_mode_loop kernel/entry/common.c:168 [inline]
> > > > >  exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:198
> > > > >  syscall_exit_to_user_mode+0x36/0x260 kernel/entry/common.c:275
> > > > >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > >
> > > > I've been debugging this and I think enabling KFENCE uncovered that some
> > > > code is assuming that the following is always true:
> > > >
> > > >         ksize(kmalloc(S)) == ksize(kmalloc(S))
> > > >
> > >
> > >
> > > I do not think we make this assumption.
> > >
> > > Each skb tracks the 'truesize' which is populated from __alloc_skb()
> > > using ksize(allocated head) .
> > >
> > > So if ksize() decides to give us random data, it should be still fine,
> > > because we use ksize(buff) only once at alloc skb time, and record the
> > > value in skb->truesize
> > >  (only the socket buffer accounting would be off)
> >
> > Good, thanks for clarifying. So something else must be off then.
> 
> Actually we might have the following assumption :
> 
> buff = kmalloc(size, GFP...)
> if (buff)
>    ASSERT(ksize(buff) >= size)
> 
> So obviously ksize() should not be completely random ;)

One more experiment -- simply adding

--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -207,7 +207,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	 */
 	size = SKB_DATA_ALIGN(size);
 	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	size = 1 << kmalloc_index(size); /* HACK */
 	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);


also got rid of the warnings. Something must be off with some value that
is computed in terms of ksize(). If not, I don't have any explanation
for why the above hides the problem.

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: WARNING in sk_stream_kill_queues (5)
  2020-12-03 17:41         ` Marco Elver
@ 2020-12-03 18:01           ` Eric Dumazet
  2020-12-07 15:30             ` Marco Elver
                               ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Eric Dumazet @ 2020-12-03 18:01 UTC (permalink / raw)
  To: Marco Elver, Eric Dumazet
  Cc: netdev, Andrew Morton, David Miller, Dmitry Vyukov,
	Alexander Potapenko, Jann Horn, Jakub Kicinski, LKML,
	Stephen Rothwell, syzkaller-bugs, Willem de Bruijn, syzbot



On 12/3/20 6:41 PM, Marco Elver wrote:

> One more experiment -- simply adding
> 
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -207,7 +207,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
>  	 */
>  	size = SKB_DATA_ALIGN(size);
>  	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> +	size = 1 << kmalloc_index(size); /* HACK */
>  	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> 
> 
> also got rid of the warnings. Something must be off with some value that
> is computed in terms of ksize(). If not, I don't have any explanation
> for why the above hides the problem.

Maybe the implementations of various macros (SKB_DATA_ALIGN and friends)
hae some kind of assumptions, I will double check this.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: WARNING in sk_stream_kill_queues (5)
  2020-12-03 18:01           ` Eric Dumazet
@ 2020-12-07 15:30             ` Marco Elver
  2020-12-07 16:28             ` Marco Elver
  2020-12-08 19:06             ` Marco Elver
  2 siblings, 0 replies; 18+ messages in thread
From: Marco Elver @ 2020-12-07 15:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, netdev, Andrew Morton, David Miller, Dmitry Vyukov,
	Alexander Potapenko, Jann Horn, Jakub Kicinski, LKML,
	Stephen Rothwell, syzkaller-bugs, Willem de Bruijn, syzbot

On Thu, Dec 03, 2020 at 07:01PM +0100, Eric Dumazet wrote:
> On 12/3/20 6:41 PM, Marco Elver wrote:
> 
> > One more experiment -- simply adding
> > 
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -207,7 +207,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> >  	 */
> >  	size = SKB_DATA_ALIGN(size);
> >  	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > +	size = 1 << kmalloc_index(size); /* HACK */
> >  	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> > 
> > 
> > also got rid of the warnings. Something must be off with some value that
> > is computed in terms of ksize(). If not, I don't have any explanation
> > for why the above hides the problem.
> 
> Maybe the implementations of various macros (SKB_DATA_ALIGN and friends)
> hae some kind of assumptions, I will double check this.

I looked at some of these macros and am wondering why SKB_TRUESIZE()
uses SKB_DATA_ALIGN(sizeof(struct sk_buff)). Because I don't understand
how the memcaches that allocate sk_buff are aligned or somehow always
return SKB_DATA_ALIGN(sizeof(struct sk_buff)) sized objects -- a simple
BUG_ON(ksize(skb) != SKB_DATA_ALIGN(sizeof(struct sk_buff))) triggers.

Alas, doing something like:

--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -235,7 +235,7 @@
 
 /* return minimum truesize of one skb containing X bytes of data */
 #define SKB_TRUESIZE(X) ((X) +						\
-			 SKB_DATA_ALIGN(sizeof(struct sk_buff)) +	\
+			 sizeof(struct sk_buff) +			\
 			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))

does not fix the problem.

Still trying to debug, because I don't want this to block the SLUB
enablement of KFENCE, even if it turns out it's not KFENCE. :-/

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: WARNING in sk_stream_kill_queues (5)
  2020-12-03 18:01           ` Eric Dumazet
  2020-12-07 15:30             ` Marco Elver
@ 2020-12-07 16:28             ` Marco Elver
  2020-12-08 19:06             ` Marco Elver
  2 siblings, 0 replies; 18+ messages in thread
From: Marco Elver @ 2020-12-07 16:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, netdev, Andrew Morton, David Miller, Dmitry Vyukov,
	Alexander Potapenko, Jann Horn, Jakub Kicinski, LKML,
	Stephen Rothwell, syzkaller-bugs, Willem de Bruijn, syzbot

On Thu, Dec 03, 2020 at 07:01PM +0100, Eric Dumazet wrote:
> On 12/3/20 6:41 PM, Marco Elver wrote:
> 
> > One more experiment -- simply adding
> > 
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -207,7 +207,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> >  	 */
> >  	size = SKB_DATA_ALIGN(size);
> >  	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > +	size = 1 << kmalloc_index(size); /* HACK */
> >  	data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> > 
> > 
> > also got rid of the warnings. Something must be off with some value that
> > is computed in terms of ksize(). If not, I don't have any explanation
> > for why the above hides the problem.
> 
> Maybe the implementations of various macros (SKB_DATA_ALIGN and friends)
> hae some kind of assumptions, I will double check this.


Some more data; removing all uses of ksize() fixes the warnings:

| --- a/net/core/skbuff.c
| +++ b/net/core/skbuff.c
| @@ -214,7 +214,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
|  	 * Put skb_shared_info exactly at the end of allocated zone,
|  	 * to allow max possible filling before reallocation.
|  	 */
| -	size = SKB_WITH_OVERHEAD(ksize(data));
| +	size = SKB_WITH_OVERHEAD(size);
|  	prefetchw(data + size);
|  
|  	/*
| @@ -1628,7 +1628,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
|  			       gfp_mask, NUMA_NO_NODE, NULL);
|  	if (!data)
|  		goto nodata;
| -	size = SKB_WITH_OVERHEAD(ksize(data));
| +	size = SKB_WITH_OVERHEAD(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
^^ Reverting *only* this to 'ksize(data)' triggers the warning.
|  	/* Copy only real data... and, alas, header. This should be
|  	 * optimized for the cases when header is void.
| @@ -5901,7 +5901,7 @@ static int pskb_carve_inside_header(struct sk_buff *skb, const u32 off,
|  	if (!data)
|  		return -ENOMEM;
|  
| -	size = SKB_WITH_OVERHEAD(ksize(data));
| +	size = SKB_WITH_OVERHEAD(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
|  
|  	/* Copy real data, and all frags */
|  	skb_copy_from_linear_data_offset(skb, off, data, new_hlen);
| @@ -6025,7 +6025,7 @@ static int pskb_carve_inside_nonlinear(struct sk_buff *skb, const u32 off,
|  	if (!data)
|  		return -ENOMEM;
|  
| -	size = SKB_WITH_OVERHEAD(ksize(data));
| +	size = SKB_WITH_OVERHEAD(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
|  
|  	memcpy((struct skb_shared_info *)(data + size),
|  	       skb_shinfo(skb), offsetof(struct skb_shared_info, frags[0]));


Conversely, only doing this also fixes the warnings:

| --- a/net/core/skbuff.c
| +++ b/net/core/skbuff.c
| @@ -1628,7 +1628,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
|  			       gfp_mask, NUMA_NO_NODE, NULL);
|  	if (!data)
|  		goto nodata;
| -	size = SKB_WITH_OVERHEAD(ksize(data));
| +	size = SKB_WITH_OVERHEAD(size + SKB_DATA_ALIGN(sizeof(struct skb_shared_info)));
|  
|  	/* Copy only real data... and, alas, header. This should be
|  	 * optimized for the cases when header is void.

But not sure if any of this is helpful, since in the end what we want is
to make a bunch of subtractions reach precisely 0, and any deviation
somewhere might, by chance, achieve that.

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: WARNING in sk_stream_kill_queues (5)
  2020-12-03 18:01           ` Eric Dumazet
  2020-12-07 15:30             ` Marco Elver
  2020-12-07 16:28             ` Marco Elver
@ 2020-12-08 19:06             ` Marco Elver
  2020-12-09 12:47               ` Marco Elver
  2 siblings, 1 reply; 18+ messages in thread
From: Marco Elver @ 2020-12-08 19:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, netdev, Andrew Morton, David Miller, Dmitry Vyukov,
	Alexander Potapenko, Jann Horn, Jakub Kicinski, LKML,
	Stephen Rothwell, syzkaller-bugs, Willem de Bruijn, syzbot

On Thu, 3 Dec 2020 at 19:01, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On 12/3/20 6:41 PM, Marco Elver wrote:
>
> > One more experiment -- simply adding
> >
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -207,7 +207,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> >        */
> >       size = SKB_DATA_ALIGN(size);
> >       size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > +     size = 1 << kmalloc_index(size); /* HACK */
> >       data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> >
> >
> > also got rid of the warnings. Something must be off with some value that
> > is computed in terms of ksize(). If not, I don't have any explanation
> > for why the above hides the problem.
>
> Maybe the implementations of various macros (SKB_DATA_ALIGN and friends)
> hae some kind of assumptions, I will double check this.

If I force kfence to return 4K sized allocations for everything, the
warnings remain. That might suggest that it's not due to a missed
ALIGN.

Is it possible that copies or moves are a problem? E.g. we copy
something from kfence -> non-kfence object (or vice-versa), and
ksize() no longer matches, then things go wrong?

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: WARNING in sk_stream_kill_queues (5)
  2020-12-08 19:06             ` Marco Elver
@ 2020-12-09 12:47               ` Marco Elver
  2020-12-10 16:51                 ` Marco Elver
  0 siblings, 1 reply; 18+ messages in thread
From: Marco Elver @ 2020-12-09 12:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, netdev, Andrew Morton, David Miller, Dmitry Vyukov,
	Alexander Potapenko, Jann Horn, Jakub Kicinski, LKML,
	Stephen Rothwell, syzkaller-bugs, Willem de Bruijn, syzbot

[-- Attachment #1: Type: text/plain, Size: 1635 bytes --]

On Tue, Dec 08, 2020 at 08:06PM +0100, Marco Elver wrote:
> On Thu, 3 Dec 2020 at 19:01, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On 12/3/20 6:41 PM, Marco Elver wrote:
> >
> > > One more experiment -- simply adding
> > >
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -207,7 +207,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > >        */
> > >       size = SKB_DATA_ALIGN(size);
> > >       size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > +     size = 1 << kmalloc_index(size); /* HACK */
> > >       data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> > >
> > >
> > > also got rid of the warnings. Something must be off with some value that
> > > is computed in terms of ksize(). If not, I don't have any explanation
> > > for why the above hides the problem.
> >
> > Maybe the implementations of various macros (SKB_DATA_ALIGN and friends)
> > hae some kind of assumptions, I will double check this.
> 
> If I force kfence to return 4K sized allocations for everything, the
> warnings remain. That might suggest that it's not due to a missed
> ALIGN.
> 
> Is it possible that copies or moves are a problem? E.g. we copy
> something from kfence -> non-kfence object (or vice-versa), and
> ksize() no longer matches, then things go wrong?

I was able to narrow it down to allocations of size 640. I also narrowed
it down to 5 allocations that go through kfence that start triggering
the issue. I have attached the list of those 5 allocations with
allocation + free stacks. I'll try to go through them, maybe I get
lucky eventually. :-)

Thanks,
-- Marco

[-- Attachment #2: suspect-allocations.log --]
[-- Type: text/plain, Size: 18154 bytes --]

kfence-#0 [0xffff888436814000-0xffff88843681427f, size=640, cache=kmalloc-1k] allocated by task 5298:
 __kmalloc_reserve net/core/skbuff.c:142 [inline]
 __alloc_skb+0xb8/0x3f0 net/core/skbuff.c:210
 alloc_skb_fclone include/linux/skbuff.h:1144 [inline]
 sk_stream_alloc_skb+0xd3/0x650 net/ipv4/tcp.c:888
 tcp_fragment+0x124/0xac0 net/ipv4/tcp_output.c:1569
 __tcp_retransmit_skb+0x5e3/0x1770 net/ipv4/tcp_output.c:3183
 tcp_retransmit_skb+0x2a/0x200 net/ipv4/tcp_output.c:3257
 tcp_retransmit_timer+0x958/0x1a60 net/ipv4/tcp_timer.c:527
 tcp_write_timer_handler+0x4a6/0x5d0 net/ipv4/tcp_timer.c:610
 tcp_write_timer+0x86/0x270 net/ipv4/tcp_timer.c:630
 call_timer_fn+0x145/0x510 kernel/time/timer.c:1417
 expire_timers kernel/time/timer.c:1462 [inline]
 __run_timers.part.0+0x519/0x680 kernel/time/timer.c:1731
 __run_timers kernel/time/timer.c:1712 [inline]
 run_timer_softirq+0x6f/0x110 kernel/time/timer.c:1744
 __do_softirq+0x132/0x40b kernel/softirq.c:298
 asm_call_irq_on_stack+0xf/0x20
 __run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline]
 run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline]
 do_softirq_own_stack+0x58/0x70 arch/x86/kernel/irq_64.c:77
 invoke_softirq kernel/softirq.c:393 [inline]
 __irq_exit_rcu kernel/softirq.c:423 [inline]
 irq_exit_rcu+0xcd/0x110 kernel/softirq.c:435
 sysvec_apic_timer_interrupt+0x38/0xd0 arch/x86/kernel/apic/apic.c:1096
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:628
 native_restore_fl arch/x86/include/asm/irqflags.h:41 [inline]
 arch_local_irq_restore arch/x86/include/asm/irqflags.h:84 [inline]
 __raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:160 [inline]
 _raw_spin_unlock_irqrestore+0x20/0x40 kernel/locking/spinlock.c:191
 spin_unlock_irqrestore include/linux/spinlock.h:409 [inline]
 free_debug_processing+0x1fc/0x2e0 mm/slub.c:1255
 __slab_free+0x130/0x5b0 mm/slub.c:2991
 do_slab_free mm/slub.c:3145 [inline]
 slab_free mm/slub.c:3158 [inline]
 kfree+0x532/0x580 mm/slub.c:4156
 tomoyo_realpath_from_path+0x12b/0x3d0 security/tomoyo/realpath.c:291
 tomoyo_get_realpath security/tomoyo/file.c:151 [inline]
 tomoyo_path_number_perm+0x11d/0x420 security/tomoyo/file.c:723
 tomoyo_path_chmod+0x21/0x30 security/tomoyo/tomoyo.c:342
 security_path_chmod+0x87/0xc0 security/security.c:1152
 chmod_common+0xbd/0x280 fs/open.c:578
 vfs_fchmod fs/open.c:598 [inline]
 __do_sys_fchmod fs/open.c:607 [inline]
 __se_sys_fchmod fs/open.c:601 [inline]
 __x64_sys_fchmod+0xaa/0x100 fs/open.c:601
 do_syscall_64+0x34/0x80 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

freed by task 12099:
 skb_free_head net/core/skbuff.c:595 [inline]
 skb_release_data+0x499/0x4e0 net/core/skbuff.c:615
 skb_release_all net/core/skbuff.c:669 [inline]
 __kfree_skb+0x34/0x50 net/core/skbuff.c:683
 sk_wmem_free_skb include/net/sock.h:1546 [inline]
 tcp_rtx_queue_unlink_and_free include/net/tcp.h:1856 [inline]
 tcp_clean_rtx_queue net/ipv4/tcp_input.c:3251 [inline]
 tcp_ack+0x124a/0x3450 net/ipv4/tcp_input.c:3795
 tcp_rcv_established+0x367/0x10b0 net/ipv4/tcp_input.c:5858
 tcp_v4_do_rcv+0x361/0x4c0 net/ipv4/tcp_ipv4.c:1668
 sk_backlog_rcv include/net/sock.h:1010 [inline]
 __release_sock+0xd7/0x260 net/core/sock.c:2523
 release_sock+0x40/0x120 net/core/sock.c:3053
 sk_wait_data+0x127/0x2b0 net/core/sock.c:2565
 tcp_recvmsg+0x1106/0x1b60 net/ipv4/tcp.c:2181
 inet_recvmsg+0xb1/0x270 net/ipv4/af_inet.c:848
 sock_recvmsg_nosec net/socket.c:885 [inline]
 sock_recvmsg net/socket.c:903 [inline]
 sock_recvmsg net/socket.c:899 [inline]
 ____sys_recvmsg+0x2fd/0x3a0 net/socket.c:2563
 ___sys_recvmsg+0xd9/0x1b0 net/socket.c:2605
 __sys_recvmsg+0x8b/0x130 net/socket.c:2641
 __do_sys_recvmsg net/socket.c:2651 [inline]
 __se_sys_recvmsg net/socket.c:2648 [inline]
 __x64_sys_recvmsg+0x43/0x50 net/socket.c:2648
 do_syscall_64+0x34/0x80 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
---------------------------------
kfence-#1 [0xffff888436816000-0xffff88843681627f, size=640, cache=kmalloc-1k] allocated by task 29:
 __kmalloc_reserve net/core/skbuff.c:142 [inline]
 __alloc_skb+0xb8/0x3f0 net/core/skbuff.c:210
 alloc_skb_fclone include/linux/skbuff.h:1144 [inline]
 sk_stream_alloc_skb+0xd3/0x650 net/ipv4/tcp.c:888
 tcp_fragment+0x124/0xac0 net/ipv4/tcp_output.c:1569
 __tcp_retransmit_skb+0x5e3/0x1770 net/ipv4/tcp_output.c:3183
 tcp_retransmit_skb+0x2a/0x200 net/ipv4/tcp_output.c:3257
 tcp_xmit_retransmit_queue.part.0+0x389/0x6f0 net/ipv4/tcp_output.c:3339
 tcp_xmit_retransmit_queue+0x36/0x40 net/ipv4/tcp_output.c:3293
 tcp_xmit_recovery net/ipv4/tcp_input.c:3652 [inline]
 tcp_xmit_recovery+0x64/0xe0 net/ipv4/tcp_input.c:3638
 tcp_ack+0x1a60/0x3450 net/ipv4/tcp_input.c:3825
 tcp_rcv_established+0x367/0x10b0 net/ipv4/tcp_input.c:5858
 tcp_v4_do_rcv+0x361/0x4c0 net/ipv4/tcp_ipv4.c:1668
 tcp_v4_rcv+0x1e29/0x20c0 net/ipv4/tcp_ipv4.c:2050
 ip_protocol_deliver_rcu+0x31/0x4f0 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x111/0x150 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:301 [inline]
 NF_HOOK include/linux/netfilter.h:295 [inline]
 ip_local_deliver+0x244/0x250 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:447 [inline]
 ip_rcv_finish+0x14a/0x1d0 net/ipv4/ip_input.c:428
 NF_HOOK include/linux/netfilter.h:301 [inline]
 NF_HOOK include/linux/netfilter.h:295 [inline]
 ip_rcv+0x1c4/0x1d0 net/ipv4/ip_input.c:539
 __netif_receive_skb_one_core+0xb3/0xe0 net/core/dev.c:5316
 __netif_receive_skb+0x29/0xe0 net/core/dev.c:5430
 process_backlog+0x169/0x350 net/core/dev.c:6320
 napi_poll net/core/dev.c:6764 [inline]
 net_rx_action+0x326/0xa30 net/core/dev.c:6834
 __do_softirq+0x132/0x40b kernel/softirq.c:298
 run_ksoftirqd kernel/softirq.c:653 [inline]
 run_ksoftirqd+0x21/0x40 kernel/softirq.c:645
 smpboot_thread_fn+0x3e6/0x560 kernel/smpboot.c:165
 kthread+0x24f/0x280 kernel/kthread.c:292
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296

freed by task 13615:
 skb_free_head net/core/skbuff.c:595 [inline]
 skb_release_data+0x499/0x4e0 net/core/skbuff.c:615
 skb_release_all net/core/skbuff.c:669 [inline]
 __kfree_skb+0x34/0x50 net/core/skbuff.c:683
 sk_wmem_free_skb include/net/sock.h:1546 [inline]
 tcp_rtx_queue_unlink_and_free include/net/tcp.h:1856 [inline]
 tcp_clean_rtx_queue net/ipv4/tcp_input.c:3251 [inline]
 tcp_ack+0x124a/0x3450 net/ipv4/tcp_input.c:3795
 tcp_rcv_established+0x367/0x10b0 net/ipv4/tcp_input.c:5858
 tcp_v4_do_rcv+0x361/0x4c0 net/ipv4/tcp_ipv4.c:1668
 sk_backlog_rcv include/net/sock.h:1010 [inline]
 __release_sock+0xd7/0x260 net/core/sock.c:2523
 release_sock+0x40/0x120 net/core/sock.c:3053
 sk_wait_data+0x127/0x2b0 net/core/sock.c:2565
 tcp_recvmsg+0x1106/0x1b60 net/ipv4/tcp.c:2181
 inet_recvmsg+0xb1/0x270 net/ipv4/af_inet.c:848
 sock_recvmsg_nosec net/socket.c:885 [inline]
 sock_recvmsg net/socket.c:903 [inline]
 sock_recvmsg net/socket.c:899 [inline]
 ____sys_recvmsg+0x2fd/0x3a0 net/socket.c:2563
 ___sys_recvmsg+0xd9/0x1b0 net/socket.c:2605
 __sys_recvmsg+0x8b/0x130 net/socket.c:2641
 __do_sys_recvmsg net/socket.c:2651 [inline]
 __se_sys_recvmsg net/socket.c:2648 [inline]
 __x64_sys_recvmsg+0x43/0x50 net/socket.c:2648
 do_syscall_64+0x34/0x80 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
---------------------------------
kfence-#2 [0xffff888436818c00-0xffff888436818e7f, size=640, cache=kmalloc-1k] allocated by task 0:
 __kmalloc_reserve net/core/skbuff.c:142 [inline]
 __alloc_skb+0xb8/0x3f0 net/core/skbuff.c:210
 alloc_skb include/linux/skbuff.h:1094 [inline]
 __tcp_send_ack.part.0+0x47/0x3c0 net/ipv4/tcp_output.c:3945
 __tcp_send_ack net/ipv4/tcp_output.c:3977 [inline]
 tcp_send_ack+0x47/0x50 net/ipv4/tcp_output.c:3977
 __tcp_ack_snd_check+0xb2/0x530 net/ipv4/tcp_input.c:5400
 tcp_ack_snd_check net/ipv4/tcp_input.c:5445 [inline]
 tcp_rcv_established+0x5c2/0x10b0 net/ipv4/tcp_input.c:5870
 tcp_v4_do_rcv+0x361/0x4c0 net/ipv4/tcp_ipv4.c:1668
 tcp_v4_rcv+0x1e29/0x20c0 net/ipv4/tcp_ipv4.c:2050
 ip_protocol_deliver_rcu+0x31/0x4f0 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x111/0x150 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:301 [inline]
 NF_HOOK include/linux/netfilter.h:295 [inline]
 ip_local_deliver+0x244/0x250 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:447 [inline]
 ip_rcv_finish+0x14a/0x1d0 net/ipv4/ip_input.c:428
 NF_HOOK include/linux/netfilter.h:301 [inline]
 NF_HOOK include/linux/netfilter.h:295 [inline]
 ip_rcv+0x1c4/0x1d0 net/ipv4/ip_input.c:539
 __netif_receive_skb_one_core+0xb3/0xe0 net/core/dev.c:5316
 __netif_receive_skb+0x29/0xe0 net/core/dev.c:5430
 process_backlog+0x169/0x350 net/core/dev.c:6320
 napi_poll net/core/dev.c:6764 [inline]
 net_rx_action+0x326/0xa30 net/core/dev.c:6834
 __do_softirq+0x132/0x40b kernel/softirq.c:298
 asm_call_irq_on_stack+0xf/0x20
 __run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline]
 run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline]
 do_softirq_own_stack+0x58/0x70 arch/x86/kernel/irq_64.c:77
 invoke_softirq kernel/softirq.c:393 [inline]
 __irq_exit_rcu kernel/softirq.c:423 [inline]
 irq_exit_rcu+0xcd/0x110 kernel/softirq.c:435
 sysvec_apic_timer_interrupt+0x38/0xd0 arch/x86/kernel/apic/apic.c:1096
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:628
 native_safe_halt arch/x86/include/asm/irqflags.h:60 [inline]
 arch_safe_halt arch/x86/include/asm/irqflags.h:103 [inline]
 default_idle+0xe/0x10 arch/x86/kernel/process.c:688
 default_idle_call+0x32/0x50 kernel/sched/idle.c:98
 cpuidle_idle_call kernel/sched/idle.c:168 [inline]
 do_idle+0x207/0x270 kernel/sched/idle.c:273
 cpu_startup_entry+0x14/0x20 kernel/sched/idle.c:370
 secondary_startup_64_no_verify+0xb0/0xbb

freed by task 0:
 skb_free_head net/core/skbuff.c:595 [inline]
 skb_release_data+0x499/0x4e0 net/core/skbuff.c:615
 skb_release_all net/core/skbuff.c:669 [inline]
 __kfree_skb+0x34/0x50 net/core/skbuff.c:683
 tcp_data_queue+0x1801/0x2560 net/ipv4/tcp_input.c:4927
 tcp_rcv_established+0x52c/0x10b0 net/ipv4/tcp_input.c:5867
 tcp_v4_do_rcv+0x361/0x4c0 net/ipv4/tcp_ipv4.c:1668
 tcp_v4_rcv+0x1e29/0x20c0 net/ipv4/tcp_ipv4.c:2050
 ip_protocol_deliver_rcu+0x31/0x4f0 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x111/0x150 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:301 [inline]
 NF_HOOK include/linux/netfilter.h:295 [inline]
 ip_local_deliver+0x244/0x250 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:447 [inline]
 ip_rcv_finish+0x14a/0x1d0 net/ipv4/ip_input.c:428
 NF_HOOK include/linux/netfilter.h:301 [inline]
 NF_HOOK include/linux/netfilter.h:295 [inline]
 ip_rcv+0x1c4/0x1d0 net/ipv4/ip_input.c:539
 __netif_receive_skb_one_core+0xb3/0xe0 net/core/dev.c:5316
 __netif_receive_skb+0x29/0xe0 net/core/dev.c:5430
 process_backlog+0x169/0x350 net/core/dev.c:6320
 napi_poll net/core/dev.c:6764 [inline]
 net_rx_action+0x326/0xa30 net/core/dev.c:6834
 __do_softirq+0x132/0x40b kernel/softirq.c:298
 asm_call_irq_on_stack+0xf/0x20
 __run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline]
 run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline]
 do_softirq_own_stack+0x58/0x70 arch/x86/kernel/irq_64.c:77
 invoke_softirq kernel/softirq.c:393 [inline]
 __irq_exit_rcu kernel/softirq.c:423 [inline]
 irq_exit_rcu+0xcd/0x110 kernel/softirq.c:435
 sysvec_apic_timer_interrupt+0x38/0xd0 arch/x86/kernel/apic/apic.c:1096
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:628
 native_safe_halt arch/x86/include/asm/irqflags.h:60 [inline]
 arch_safe_halt arch/x86/include/asm/irqflags.h:103 [inline]
 default_idle+0xe/0x10 arch/x86/kernel/process.c:688
 default_idle_call+0x32/0x50 kernel/sched/idle.c:98
 cpuidle_idle_call kernel/sched/idle.c:168 [inline]
 do_idle+0x207/0x270 kernel/sched/idle.c:273
 cpu_startup_entry+0x14/0x20 kernel/sched/idle.c:370
 secondary_startup_64_no_verify+0xb0/0xbb
---------------------------------
kfence-#3 [0xffff88843681ac00-0xffff88843681ae7f, size=640, cache=kmalloc-1k] allocated by task 17012:
 __kmalloc_reserve net/core/skbuff.c:142 [inline]
 __alloc_skb+0xb8/0x3f0 net/core/skbuff.c:210
 alloc_skb_fclone include/linux/skbuff.h:1144 [inline]
 sk_stream_alloc_skb+0xd3/0x650 net/ipv4/tcp.c:888
 tso_fragment net/ipv4/tcp_output.c:2124 [inline]
 tcp_write_xmit+0x1366/0x3510 net/ipv4/tcp_output.c:2674
 __tcp_push_pending_frames+0x68/0x1f0 net/ipv4/tcp_output.c:2866
 tcp_push_pending_frames include/net/tcp.h:1864 [inline]
 tcp_data_snd_check net/ipv4/tcp_input.c:5374 [inline]
 tcp_rcv_established+0x57c/0x10b0 net/ipv4/tcp_input.c:5869
 tcp_v4_do_rcv+0x361/0x4c0 net/ipv4/tcp_ipv4.c:1668
 sk_backlog_rcv include/net/sock.h:1010 [inline]
 __release_sock+0xd7/0x260 net/core/sock.c:2523
 release_sock+0x40/0x120 net/core/sock.c:3053
 sk_wait_data+0x127/0x2b0 net/core/sock.c:2565
 tcp_recvmsg+0x1106/0x1b60 net/ipv4/tcp.c:2181
 inet_recvmsg+0xb1/0x270 net/ipv4/af_inet.c:848
 sock_recvmsg_nosec net/socket.c:885 [inline]
 sock_recvmsg net/socket.c:903 [inline]
 sock_recvmsg net/socket.c:899 [inline]
 ____sys_recvmsg+0x2fd/0x3a0 net/socket.c:2563
 ___sys_recvmsg+0xd9/0x1b0 net/socket.c:2605
 __sys_recvmsg+0x8b/0x130 net/socket.c:2641
 __do_sys_recvmsg net/socket.c:2651 [inline]
 __se_sys_recvmsg net/socket.c:2648 [inline]
 __x64_sys_recvmsg+0x43/0x50 net/socket.c:2648
 do_syscall_64+0x34/0x80 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

freed by task 17012:
 skb_free_head net/core/skbuff.c:595 [inline]
 skb_release_data+0x499/0x4e0 net/core/skbuff.c:615
 skb_release_all net/core/skbuff.c:669 [inline]
 __kfree_skb+0x34/0x50 net/core/skbuff.c:683
 tcp_drop net/ipv4/tcp_input.c:4618 [inline]
 tcp_prune_ofo_queue+0x14b/0x3e0 net/ipv4/tcp_input.c:5250
 tcp_prune_queue net/ipv4/tcp_input.c:5307 [inline]
 tcp_try_rmem_schedule+0x9a2/0xbc0 net/ipv4/tcp_input.c:4680
 tcp_data_queue_ofo net/ipv4/tcp_input.c:4701 [inline]
 tcp_data_queue+0x2dd/0x2560 net/ipv4/tcp_input.c:5015
 tcp_rcv_established+0x52c/0x10b0 net/ipv4/tcp_input.c:5867
 tcp_v4_do_rcv+0x361/0x4c0 net/ipv4/tcp_ipv4.c:1668
 sk_backlog_rcv include/net/sock.h:1010 [inline]
 __release_sock+0xd7/0x260 net/core/sock.c:2523
 release_sock+0x40/0x120 net/core/sock.c:3053
 sk_wait_data+0x127/0x2b0 net/core/sock.c:2565
 tcp_recvmsg+0x1106/0x1b60 net/ipv4/tcp.c:2181
 inet_recvmsg+0xb1/0x270 net/ipv4/af_inet.c:848
 sock_recvmsg_nosec net/socket.c:885 [inline]
 sock_recvmsg net/socket.c:903 [inline]
 sock_recvmsg net/socket.c:899 [inline]
 ____sys_recvmsg+0x2fd/0x3a0 net/socket.c:2563
 ___sys_recvmsg+0xd9/0x1b0 net/socket.c:2605
 __sys_recvmsg+0x8b/0x130 net/socket.c:2641
 __do_sys_recvmsg net/socket.c:2651 [inline]
 __se_sys_recvmsg net/socket.c:2648 [inline]
 __x64_sys_recvmsg+0x43/0x50 net/socket.c:2648
 do_syscall_64+0x34/0x80 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
---------------------------------
kfence-#4 [0xffff88843681c000-0xffff88843681c27f, size=640, cache=kmalloc-1k] allocated by task 0:
 __kmalloc_reserve net/core/skbuff.c:142 [inline]
 __alloc_skb+0xb8/0x3f0 net/core/skbuff.c:210
 alloc_skb_fclone include/linux/skbuff.h:1144 [inline]
 sk_stream_alloc_skb+0xd3/0x650 net/ipv4/tcp.c:888
 tso_fragment net/ipv4/tcp_output.c:2124 [inline]
 tcp_write_xmit+0x1366/0x3510 net/ipv4/tcp_output.c:2674
 tcp_send_loss_probe+0x337/0x4c0 net/ipv4/tcp_output.c:2804
 tcp_write_timer_handler+0x4d4/0x5d0 net/ipv4/tcp_timer.c:606
 tcp_write_timer+0x86/0x270 net/ipv4/tcp_timer.c:630
 call_timer_fn+0x145/0x510 kernel/time/timer.c:1417
 expire_timers kernel/time/timer.c:1462 [inline]
 __run_timers.part.0+0x519/0x680 kernel/time/timer.c:1731
 __run_timers kernel/time/timer.c:1712 [inline]
 run_timer_softirq+0x6f/0x110 kernel/time/timer.c:1744
 __do_softirq+0x132/0x40b kernel/softirq.c:298
 asm_call_irq_on_stack+0xf/0x20
 __run_on_irqstack arch/x86/include/asm/irq_stack.h:26 [inline]
 run_on_irqstack_cond arch/x86/include/asm/irq_stack.h:77 [inline]
 do_softirq_own_stack+0x58/0x70 arch/x86/kernel/irq_64.c:77
 invoke_softirq kernel/softirq.c:393 [inline]
 __irq_exit_rcu kernel/softirq.c:423 [inline]
 irq_exit_rcu+0xcd/0x110 kernel/softirq.c:435
 sysvec_apic_timer_interrupt+0x38/0xd0 arch/x86/kernel/apic/apic.c:1096
 asm_sysvec_apic_timer_interrupt+0x12/0x20 arch/x86/include/asm/idtentry.h:628
 native_safe_halt arch/x86/include/asm/irqflags.h:60 [inline]
 arch_safe_halt arch/x86/include/asm/irqflags.h:103 [inline]
 default_idle+0xe/0x10 arch/x86/kernel/process.c:688
 default_idle_call+0x32/0x50 kernel/sched/idle.c:98
 cpuidle_idle_call kernel/sched/idle.c:168 [inline]
 do_idle+0x207/0x270 kernel/sched/idle.c:273
 cpu_startup_entry+0x14/0x20 kernel/sched/idle.c:370
 secondary_startup_64_no_verify+0xb0/0xbb

freed by task 17200:
 skb_free_head net/core/skbuff.c:595 [inline]
 skb_release_data+0x499/0x4e0 net/core/skbuff.c:615
 skb_release_all net/core/skbuff.c:669 [inline]
 __kfree_skb+0x34/0x50 net/core/skbuff.c:683
 sk_wmem_free_skb include/net/sock.h:1546 [inline]
 tcp_rtx_queue_unlink_and_free include/net/tcp.h:1856 [inline]
 tcp_shifted_skb+0x4f8/0x960 net/ipv4/tcp_input.c:1464
 tcp_shift_skb_data net/ipv4/tcp_input.c:1607 [inline]
 tcp_sacktag_walk+0x7e0/0xc40 net/ipv4/tcp_input.c:1670
 tcp_sacktag_write_queue+0xd5e/0x1b50 net/ipv4/tcp_input.c:1931
 tcp_ack+0x1fcd/0x3450 net/ipv4/tcp_input.c:3758
 tcp_rcv_established+0x367/0x10b0 net/ipv4/tcp_input.c:5858
 tcp_v4_do_rcv+0x361/0x4c0 net/ipv4/tcp_ipv4.c:1668
 sk_backlog_rcv include/net/sock.h:1010 [inline]
 __release_sock+0xd7/0x260 net/core/sock.c:2523
 release_sock+0x40/0x120 net/core/sock.c:3053
 sk_wait_data+0x127/0x2b0 net/core/sock.c:2565
 tcp_recvmsg+0x1106/0x1b60 net/ipv4/tcp.c:2181
 inet_recvmsg+0xb1/0x270 net/ipv4/af_inet.c:848
 sock_recvmsg_nosec net/socket.c:885 [inline]
 sock_recvmsg net/socket.c:903 [inline]
 sock_recvmsg net/socket.c:899 [inline]
 ____sys_recvmsg+0x2fd/0x3a0 net/socket.c:2563
 ___sys_recvmsg+0xd9/0x1b0 net/socket.c:2605
 __sys_recvmsg+0x8b/0x130 net/socket.c:2641
 __do_sys_recvmsg net/socket.c:2651 [inline]
 __se_sys_recvmsg net/socket.c:2648 [inline]
 __x64_sys_recvmsg+0x43/0x50 net/socket.c:2648
 do_syscall_64+0x34/0x80 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: WARNING in sk_stream_kill_queues (5)
  2020-12-09 12:47               ` Marco Elver
@ 2020-12-10 16:51                 ` Marco Elver
  2020-12-10 17:14                   ` Eric Dumazet
  0 siblings, 1 reply; 18+ messages in thread
From: Marco Elver @ 2020-12-10 16:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, netdev, Andrew Morton, David Miller, Dmitry Vyukov,
	Alexander Potapenko, Jann Horn, Jakub Kicinski, LKML,
	Stephen Rothwell, syzkaller-bugs, Willem de Bruijn, syzbot

On Wed, Dec 09, 2020 at 01:47PM +0100, Marco Elver wrote:
> On Tue, Dec 08, 2020 at 08:06PM +0100, Marco Elver wrote:
> > On Thu, 3 Dec 2020 at 19:01, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > On 12/3/20 6:41 PM, Marco Elver wrote:
> > >
> > > > One more experiment -- simply adding
> > > >
> > > > --- a/net/core/skbuff.c
> > > > +++ b/net/core/skbuff.c
> > > > @@ -207,7 +207,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > > >        */
> > > >       size = SKB_DATA_ALIGN(size);
> > > >       size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > > +     size = 1 << kmalloc_index(size); /* HACK */
> > > >       data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> > > >
> > > >
> > > > also got rid of the warnings. Something must be off with some value that
> > > > is computed in terms of ksize(). If not, I don't have any explanation
> > > > for why the above hides the problem.
> > >
> > > Maybe the implementations of various macros (SKB_DATA_ALIGN and friends)
> > > hae some kind of assumptions, I will double check this.
> > 
> > If I force kfence to return 4K sized allocations for everything, the
> > warnings remain. That might suggest that it's not due to a missed
> > ALIGN.
> > 
> > Is it possible that copies or moves are a problem? E.g. we copy
> > something from kfence -> non-kfence object (or vice-versa), and
> > ksize() no longer matches, then things go wrong?
> 
> I was able to narrow it down to allocations of size 640. I also narrowed
> it down to 5 allocations that go through kfence that start triggering
> the issue. I have attached the list of those 5 allocations with
> allocation + free stacks. I'll try to go through them, maybe I get
> lucky eventually. :-)

[...]

> kfence-#3 [0xffff88843681ac00-0xffff88843681ae7f, size=640, cache=kmalloc-1k] allocated by task 17012:
>  __kmalloc_reserve net/core/skbuff.c:142 [inline]
>  __alloc_skb+0xb8/0x3f0 net/core/skbuff.c:210
>  alloc_skb_fclone include/linux/skbuff.h:1144 [inline]
>  sk_stream_alloc_skb+0xd3/0x650 net/ipv4/tcp.c:888
>  tso_fragment net/ipv4/tcp_output.c:2124 [inline]
>  tcp_write_xmit+0x1366/0x3510 net/ipv4/tcp_output.c:2674
>  __tcp_push_pending_frames+0x68/0x1f0 net/ipv4/tcp_output.c:2866
>  tcp_push_pending_frames include/net/tcp.h:1864 [inline]
>  tcp_data_snd_check net/ipv4/tcp_input.c:5374 [inline]
>  tcp_rcv_established+0x57c/0x10b0 net/ipv4/tcp_input.c:5869
>  tcp_v4_do_rcv+0x361/0x4c0 net/ipv4/tcp_ipv4.c:1668
>  sk_backlog_rcv include/net/sock.h:1010 [inline]
>  __release_sock+0xd7/0x260 net/core/sock.c:2523
>  release_sock+0x40/0x120 net/core/sock.c:3053
>  sk_wait_data+0x127/0x2b0 net/core/sock.c:2565
>  tcp_recvmsg+0x1106/0x1b60 net/ipv4/tcp.c:2181
>  inet_recvmsg+0xb1/0x270 net/ipv4/af_inet.c:848
>  sock_recvmsg_nosec net/socket.c:885 [inline]
>  sock_recvmsg net/socket.c:903 [inline]
>  sock_recvmsg net/socket.c:899 [inline]
>  ____sys_recvmsg+0x2fd/0x3a0 net/socket.c:2563
>  ___sys_recvmsg+0xd9/0x1b0 net/socket.c:2605
>  __sys_recvmsg+0x8b/0x130 net/socket.c:2641
>  __do_sys_recvmsg net/socket.c:2651 [inline]
>  __se_sys_recvmsg net/socket.c:2648 [inline]
>  __x64_sys_recvmsg+0x43/0x50 net/socket.c:2648
>  do_syscall_64+0x34/0x80 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9

So I started putting gdb to work, and whenever I see an allocation
exactly like the above that goes through tso_fragment() a warning
immediately follows.

Long story short, I somehow synthesized this patch that appears to fix
things, but I can't explain why exactly:

| --- a/net/core/skbuff.c
| +++ b/net/core/skbuff.c
| @@ -1679,13 +1679,6 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
|  
|  	skb_metadata_clear(skb);
|  
| -	/* It is not generally safe to change skb->truesize.
| -	 * For the moment, we really care of rx path, or
| -	 * when skb is orphaned (not attached to a socket).
| -	 */
| -	if (!skb->sk || skb->destructor == sock_edemux)
| -		skb->truesize += size - osize;
| -
|  	return 0;
|  
|  nofrags:

Now, here are the breadcrumbs I followed:


1. 	Breakpoint on kfence_ksize() -- first allocation that matches the above:

	| #0  __kfence_ksize (s=18446612700164612096) at mm/kfence/core.c:726
	| #1  0xffffffff816fbf30 in kfence_ksize (addr=0xffff888436856000) at mm/kfence/core.c:737
	| #2  0xffffffff816217cf in ksize (objp=0xffff888436856000) at mm/slab_common.c:1178
	| #3  0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
	| #4  0xffffffff84d0ba73 in alloc_skb_fclone (priority=<optimized out>, size=<optimized out>) at ./include/linux/skbuff.h:1144
	| #5  sk_stream_alloc_skb (sk=0xffff8881176cc000, size=0, gfp=2592, force_schedule=232) at net/ipv4/tcp.c:888
	| #6  0xffffffff84d41c36 in tso_fragment (gfp=<optimized out>, mss_now=<optimized out>, len=<optimized out>,
	|     skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_output.c:2124
	| #7  tcp_write_xmit (sk=0xffff8881176cc000, mss_now=21950, nonagle=3096, push_one=-1996874776, gfp=0)
	|     at net/ipv4/tcp_output.c:2674
	| #8  0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
	|     at ./include/net/sock.h:918
	| #9  0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
	| #10 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
	| #11 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
	| #12 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f52ea0) at net/ipv4/tcp_ipv4.c:1668
	| [...]	

	Set watchpoint on skb->truesize:

	| (gdb) frame 3
	| #3  0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
	| 217             size = SKB_WITH_OVERHEAD(ksize(data));
	| (gdb) p &skb->truesize
	| $5 = (unsigned int *) 0xffff888117f55f90
	| (gdb) awatch *0xffff888117f55f90
	| Hardware access (read/write) watchpoint 6: *0xffff888117f55f90

2. 	Some time later, we see that the skb with kfence-allocated data
	is cloned:

	| Thread 7 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
	|
	| Value = 1570
	| 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
	| 1002            C(truesize);
	| (gdb) bt
	| #0  0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
	| #1  0xffffffff8488bfb9 in skb_clone (skb=0xffff888117f55ec0, gfp_mask=2592) at net/core/skbuff.c:1454
	| #2  0xffffffff84d3cd1c in __tcp_transmit_skb (sk=0xffff8881176cc000, skb=0xffff888117f55ec0, clone_it=0, gfp_mask=2592,
	|     rcv_nxt=0) at net/ipv4/tcp_output.c:1267
	| #3  0xffffffff84d4125b in tcp_transmit_skb (gfp_mask=<optimized out>, clone_it=<optimized out>, skb=<optimized out>,
	|     sk=<optimized out>) at ./include/linux/tcp.h:439
	| #4  tcp_write_xmit (sk=0xffff8881176cc000, mss_now=392485600, nonagle=1326, push_one=-1996875104, gfp=0)
	|     at net/ipv4/tcp_output.c:2688
	| #5  0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
	|     at ./include/net/sock.h:918
	| #6  0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
	| #7  tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
	| #8  tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
	| #9  0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f57820) at net/ipv4/tcp_ipv4.c:1668
	| #10 0xffffffff8487bf67 in sk_backlog_rcv (skb=<optimized out>, sk=<optimized out>) at ./include/net/sock.h:1010
	[...]


3. 	The original skb (that was cloned) has its truesize adjusted
	after a pskb_expand_head():

	| Thread 2 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
	| 
	| Old value = 1570
	| New value = 1954

	^^ the difference between the old and the new value is exactly
	384, which is also the final underflow of the sk_wmem_queued
	that triggers the warning. Presumably if the original allocation
	had been through kmalloc-1k and not KFENCE, the difference here
	would have been 0, since ksize() of the original allocation in
	step (1) would have been 1024, and not 640 (difference of 384).

	| 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
	|     at net/core/skbuff.c:1687
	| 1687                    skb->truesize += size - osize;
	| (gdb) bt
	| #0  0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
	|     at net/core/skbuff.c:1687
	| #1  0xffffffff8488de01 in skb_prepare_for_shift (skb=<optimized out>) at ./arch/x86/include/asm/atomic.h:29
	| #2  skb_prepare_for_shift (skb=0xffff888117f55ec0) at net/core/skbuff.c:3276
	| #3  0xffffffff848936b1 in skb_shift (tgt=0xffff888117f549c0, skb=0xffff888117f55ec0, shiftlen=674) at net/core/skbuff.c:3351
	| #4  0xffffffff84d264de in tcp_skb_shift (shiftlen=<optimized out>, pcount=<optimized out>, from=<optimized out>,
	|     to=<optimized out>) at net/ipv4/tcp_input.c:1497
	| #5  tcp_shift_skb_data (dup_sack=<optimized out>, end_seq=<optimized out>, start_seq=<optimized out>, state=<optimized out>,
	|     skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_input.c:1605
	| #6  tcp_sacktag_walk (skb=0xffff888117f55ec0, sk=0xffff8881176cc000, next_dup=0x894,
	|     state=0xffffffff88fa1aa0 <watchpoints+192>, start_seq=0, end_seq=401956752, dup_sack_in=false)
	|     at net/ipv4/tcp_input.c:1670
	| #7  0xffffffff84d276de in tcp_sacktag_write_queue (sk=0xffff888117f55f90, ack_skb=0x1888117f55f90, prior_snd_una=2196,
	|     state=0xffffffff88fa1aa0 <watchpoints+192>) at net/ipv4/tcp_input.c:1931
	| #8  0xffffffff84d2ca1d in tcp_ack (sk=0xffff8881176cc000, skb=0x1888117f55f90, flag=16643) at net/ipv4/tcp_input.c:3758
	| #9  0xffffffff84d32387 in tcp_rcv_established (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_input.c:5858
	| #10 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_ipv4.c:1668
	[...]


Any of this make sense?

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: WARNING in sk_stream_kill_queues (5)
  2020-12-10 16:51                 ` Marco Elver
@ 2020-12-10 17:14                   ` Eric Dumazet
  2020-12-10 19:01                     ` Marco Elver
  0 siblings, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2020-12-10 17:14 UTC (permalink / raw)
  To: Marco Elver
  Cc: Eric Dumazet, netdev, Andrew Morton, David Miller, Dmitry Vyukov,
	Alexander Potapenko, Jann Horn, Jakub Kicinski, LKML,
	Stephen Rothwell, syzkaller-bugs, Willem de Bruijn, syzbot

On Thu, Dec 10, 2020 at 5:51 PM Marco Elver <elver@google.com> wrote:
>
> On Wed, Dec 09, 2020 at 01:47PM +0100, Marco Elver wrote:
> > On Tue, Dec 08, 2020 at 08:06PM +0100, Marco Elver wrote:
> > > On Thu, 3 Dec 2020 at 19:01, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > On 12/3/20 6:41 PM, Marco Elver wrote:
> > > >
> > > > > One more experiment -- simply adding
> > > > >
> > > > > --- a/net/core/skbuff.c
> > > > > +++ b/net/core/skbuff.c
> > > > > @@ -207,7 +207,21 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
> > > > >        */
> > > > >       size = SKB_DATA_ALIGN(size);
> > > > >       size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > > > +     size = 1 << kmalloc_index(size); /* HACK */
> > > > >       data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
> > > > >
> > > > >
> > > > > also got rid of the warnings. Something must be off with some value that
> > > > > is computed in terms of ksize(). If not, I don't have any explanation
> > > > > for why the above hides the problem.
> > > >
> > > > Maybe the implementations of various macros (SKB_DATA_ALIGN and friends)
> > > > hae some kind of assumptions, I will double check this.
> > >
> > > If I force kfence to return 4K sized allocations for everything, the
> > > warnings remain. That might suggest that it's not due to a missed
> > > ALIGN.
> > >
> > > Is it possible that copies or moves are a problem? E.g. we copy
> > > something from kfence -> non-kfence object (or vice-versa), and
> > > ksize() no longer matches, then things go wrong?
> >
> > I was able to narrow it down to allocations of size 640. I also narrowed
> > it down to 5 allocations that go through kfence that start triggering
> > the issue. I have attached the list of those 5 allocations with
> > allocation + free stacks. I'll try to go through them, maybe I get
> > lucky eventually. :-)
>
> [...]
>
> > kfence-#3 [0xffff88843681ac00-0xffff88843681ae7f, size=640, cache=kmalloc-1k] allocated by task 17012:
> >  __kmalloc_reserve net/core/skbuff.c:142 [inline]
> >  __alloc_skb+0xb8/0x3f0 net/core/skbuff.c:210
> >  alloc_skb_fclone include/linux/skbuff.h:1144 [inline]
> >  sk_stream_alloc_skb+0xd3/0x650 net/ipv4/tcp.c:888
> >  tso_fragment net/ipv4/tcp_output.c:2124 [inline]
> >  tcp_write_xmit+0x1366/0x3510 net/ipv4/tcp_output.c:2674
> >  __tcp_push_pending_frames+0x68/0x1f0 net/ipv4/tcp_output.c:2866
> >  tcp_push_pending_frames include/net/tcp.h:1864 [inline]
> >  tcp_data_snd_check net/ipv4/tcp_input.c:5374 [inline]
> >  tcp_rcv_established+0x57c/0x10b0 net/ipv4/tcp_input.c:5869
> >  tcp_v4_do_rcv+0x361/0x4c0 net/ipv4/tcp_ipv4.c:1668
> >  sk_backlog_rcv include/net/sock.h:1010 [inline]
> >  __release_sock+0xd7/0x260 net/core/sock.c:2523
> >  release_sock+0x40/0x120 net/core/sock.c:3053
> >  sk_wait_data+0x127/0x2b0 net/core/sock.c:2565
> >  tcp_recvmsg+0x1106/0x1b60 net/ipv4/tcp.c:2181
> >  inet_recvmsg+0xb1/0x270 net/ipv4/af_inet.c:848
> >  sock_recvmsg_nosec net/socket.c:885 [inline]
> >  sock_recvmsg net/socket.c:903 [inline]
> >  sock_recvmsg net/socket.c:899 [inline]
> >  ____sys_recvmsg+0x2fd/0x3a0 net/socket.c:2563
> >  ___sys_recvmsg+0xd9/0x1b0 net/socket.c:2605
> >  __sys_recvmsg+0x8b/0x130 net/socket.c:2641
> >  __do_sys_recvmsg net/socket.c:2651 [inline]
> >  __se_sys_recvmsg net/socket.c:2648 [inline]
> >  __x64_sys_recvmsg+0x43/0x50 net/socket.c:2648
> >  do_syscall_64+0x34/0x80 arch/x86/entry/common.c:46
> >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> So I started putting gdb to work, and whenever I see an allocation
> exactly like the above that goes through tso_fragment() a warning
> immediately follows.
>
> Long story short, I somehow synthesized this patch that appears to fix
> things, but I can't explain why exactly:
>
> | --- a/net/core/skbuff.c
> | +++ b/net/core/skbuff.c
> | @@ -1679,13 +1679,6 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> |
> |       skb_metadata_clear(skb);
> |
> | -     /* It is not generally safe to change skb->truesize.
> | -      * For the moment, we really care of rx path, or
> | -      * when skb is orphaned (not attached to a socket).
> | -      */
> | -     if (!skb->sk || skb->destructor == sock_edemux)
> | -             skb->truesize += size - osize;
> | -
> |       return 0;
> |
> |  nofrags:
>
> Now, here are the breadcrumbs I followed:
>
>
> 1.      Breakpoint on kfence_ksize() -- first allocation that matches the above:
>
>         | #0  __kfence_ksize (s=18446612700164612096) at mm/kfence/core.c:726
>         | #1  0xffffffff816fbf30 in kfence_ksize (addr=0xffff888436856000) at mm/kfence/core.c:737
>         | #2  0xffffffff816217cf in ksize (objp=0xffff888436856000) at mm/slab_common.c:1178
>         | #3  0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
>         | #4  0xffffffff84d0ba73 in alloc_skb_fclone (priority=<optimized out>, size=<optimized out>) at ./include/linux/skbuff.h:1144
>         | #5  sk_stream_alloc_skb (sk=0xffff8881176cc000, size=0, gfp=2592, force_schedule=232) at net/ipv4/tcp.c:888
>         | #6  0xffffffff84d41c36 in tso_fragment (gfp=<optimized out>, mss_now=<optimized out>, len=<optimized out>,
>         |     skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_output.c:2124
>         | #7  tcp_write_xmit (sk=0xffff8881176cc000, mss_now=21950, nonagle=3096, push_one=-1996874776, gfp=0)
>         |     at net/ipv4/tcp_output.c:2674
>         | #8  0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
>         |     at ./include/net/sock.h:918
>         | #9  0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
>         | #10 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
>         | #11 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
>         | #12 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f52ea0) at net/ipv4/tcp_ipv4.c:1668
>         | [...]
>
>         Set watchpoint on skb->truesize:
>
>         | (gdb) frame 3
>         | #3  0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
>         | 217             size = SKB_WITH_OVERHEAD(ksize(data));
>         | (gdb) p &skb->truesize
>         | $5 = (unsigned int *) 0xffff888117f55f90
>         | (gdb) awatch *0xffff888117f55f90
>         | Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
>
> 2.      Some time later, we see that the skb with kfence-allocated data
>         is cloned:
>
>         | Thread 7 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
>         |
>         | Value = 1570
>         | 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
>         | 1002            C(truesize);
>         | (gdb) bt
>         | #0  0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
>         | #1  0xffffffff8488bfb9 in skb_clone (skb=0xffff888117f55ec0, gfp_mask=2592) at net/core/skbuff.c:1454
>         | #2  0xffffffff84d3cd1c in __tcp_transmit_skb (sk=0xffff8881176cc000, skb=0xffff888117f55ec0, clone_it=0, gfp_mask=2592,
>         |     rcv_nxt=0) at net/ipv4/tcp_output.c:1267
>         | #3  0xffffffff84d4125b in tcp_transmit_skb (gfp_mask=<optimized out>, clone_it=<optimized out>, skb=<optimized out>,
>         |     sk=<optimized out>) at ./include/linux/tcp.h:439
>         | #4  tcp_write_xmit (sk=0xffff8881176cc000, mss_now=392485600, nonagle=1326, push_one=-1996875104, gfp=0)
>         |     at net/ipv4/tcp_output.c:2688
>         | #5  0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
>         |     at ./include/net/sock.h:918
>         | #6  0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
>         | #7  tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
>         | #8  tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
>         | #9  0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f57820) at net/ipv4/tcp_ipv4.c:1668
>         | #10 0xffffffff8487bf67 in sk_backlog_rcv (skb=<optimized out>, sk=<optimized out>) at ./include/net/sock.h:1010
>         [...]
>
>
> 3.      The original skb (that was cloned) has its truesize adjusted
>         after a pskb_expand_head():
>
>         | Thread 2 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
>         |
>         | Old value = 1570
>         | New value = 1954
>
>         ^^ the difference between the old and the new value is exactly
>         384, which is also the final underflow of the sk_wmem_queued
>         that triggers the warning. Presumably if the original allocation
>         had been through kmalloc-1k and not KFENCE, the difference here
>         would have been 0, since ksize() of the original allocation in
>         step (1) would have been 1024, and not 640 (difference of 384).
>
>         | 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
>         |     at net/core/skbuff.c:1687
>         | 1687                    skb->truesize += size - osize;
>         | (gdb) bt
>         | #0  0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
>         |     at net/core/skbuff.c:1687
>         | #1  0xffffffff8488de01 in skb_prepare_for_shift (skb=<optimized out>) at ./arch/x86/include/asm/atomic.h:29
>         | #2  skb_prepare_for_shift (skb=0xffff888117f55ec0) at net/core/skbuff.c:3276
>         | #3  0xffffffff848936b1 in skb_shift (tgt=0xffff888117f549c0, skb=0xffff888117f55ec0, shiftlen=674) at net/core/skbuff.c:3351
>         | #4  0xffffffff84d264de in tcp_skb_shift (shiftlen=<optimized out>, pcount=<optimized out>, from=<optimized out>,
>         |     to=<optimized out>) at net/ipv4/tcp_input.c:1497
>         | #5  tcp_shift_skb_data (dup_sack=<optimized out>, end_seq=<optimized out>, start_seq=<optimized out>, state=<optimized out>,
>         |     skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_input.c:1605
>         | #6  tcp_sacktag_walk (skb=0xffff888117f55ec0, sk=0xffff8881176cc000, next_dup=0x894,
>         |     state=0xffffffff88fa1aa0 <watchpoints+192>, start_seq=0, end_seq=401956752, dup_sack_in=false)
>         |     at net/ipv4/tcp_input.c:1670
>         | #7  0xffffffff84d276de in tcp_sacktag_write_queue (sk=0xffff888117f55f90, ack_skb=0x1888117f55f90, prior_snd_una=2196,
>         |     state=0xffffffff88fa1aa0 <watchpoints+192>) at net/ipv4/tcp_input.c:1931
>         | #8  0xffffffff84d2ca1d in tcp_ack (sk=0xffff8881176cc000, skb=0x1888117f55f90, flag=16643) at net/ipv4/tcp_input.c:3758
>         | #9  0xffffffff84d32387 in tcp_rcv_established (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_input.c:5858
>         | #10 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_ipv4.c:1668
>         [...]
>
>
> Any of this make sense?


Very nice debugging !

I guess we could fix this in skb_prepare_for_shift(), eventually
caring for the truesize manipulation
(or reverting the change done in pskb_expand_head(), since only kfence
is having this issue.

(All TCP skbs in output path have the same allocation size for skb->head)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index e578544b2cc7110ec2f6bcf4c29d93e4b4b1ad14..798b51eeeaa4fbed65d41d9eab207dbbf438dab3
100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3270,7 +3270,14 @@ EXPORT_SYMBOL(skb_split);
  */
 static int skb_prepare_for_shift(struct sk_buff *skb)
 {
-       return skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
+       unsigned int ret = 0, save;
+
+       if (skb_cloned(skb)) {
+               save = skb->truesize;
+               ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
+               skb->truesize = save;
+       }
+       return ret;
 }

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: WARNING in sk_stream_kill_queues (5)
  2020-12-10 17:14                   ` Eric Dumazet
@ 2020-12-10 19:01                     ` Marco Elver
  2020-12-14 10:09                       ` Marco Elver
  0 siblings, 1 reply; 18+ messages in thread
From: Marco Elver @ 2020-12-10 19:01 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, netdev, Andrew Morton, David Miller, Dmitry Vyukov,
	Alexander Potapenko, Jann Horn, Jakub Kicinski, LKML,
	Stephen Rothwell, syzkaller-bugs, Willem de Bruijn, syzbot

On Thu, 10 Dec 2020 at 18:14, Eric Dumazet <edumazet@google.com> wrote:
> On Thu, Dec 10, 2020 at 5:51 PM Marco Elver <elver@google.com> wrote:
[...]
> > So I started putting gdb to work, and whenever I see an allocation
> > exactly like the above that goes through tso_fragment() a warning
> > immediately follows.
> >
> > Long story short, I somehow synthesized this patch that appears to fix
> > things, but I can't explain why exactly:
> >
> > | --- a/net/core/skbuff.c
> > | +++ b/net/core/skbuff.c
> > | @@ -1679,13 +1679,6 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> > |
> > |       skb_metadata_clear(skb);
> > |
> > | -     /* It is not generally safe to change skb->truesize.
> > | -      * For the moment, we really care of rx path, or
> > | -      * when skb is orphaned (not attached to a socket).
> > | -      */
> > | -     if (!skb->sk || skb->destructor == sock_edemux)
> > | -             skb->truesize += size - osize;
> > | -
> > |       return 0;
> > |
> > |  nofrags:
> >
> > Now, here are the breadcrumbs I followed:
> >
> >
> > 1.      Breakpoint on kfence_ksize() -- first allocation that matches the above:
> >
> >         | #0  __kfence_ksize (s=18446612700164612096) at mm/kfence/core.c:726
> >         | #1  0xffffffff816fbf30 in kfence_ksize (addr=0xffff888436856000) at mm/kfence/core.c:737
> >         | #2  0xffffffff816217cf in ksize (objp=0xffff888436856000) at mm/slab_common.c:1178
> >         | #3  0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> >         | #4  0xffffffff84d0ba73 in alloc_skb_fclone (priority=<optimized out>, size=<optimized out>) at ./include/linux/skbuff.h:1144
> >         | #5  sk_stream_alloc_skb (sk=0xffff8881176cc000, size=0, gfp=2592, force_schedule=232) at net/ipv4/tcp.c:888
> >         | #6  0xffffffff84d41c36 in tso_fragment (gfp=<optimized out>, mss_now=<optimized out>, len=<optimized out>,
> >         |     skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_output.c:2124
> >         | #7  tcp_write_xmit (sk=0xffff8881176cc000, mss_now=21950, nonagle=3096, push_one=-1996874776, gfp=0)
> >         |     at net/ipv4/tcp_output.c:2674
> >         | #8  0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
> >         |     at ./include/net/sock.h:918
> >         | #9  0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
> >         | #10 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
> >         | #11 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
> >         | #12 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f52ea0) at net/ipv4/tcp_ipv4.c:1668
> >         | [...]
> >
> >         Set watchpoint on skb->truesize:
> >
> >         | (gdb) frame 3
> >         | #3  0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> >         | 217             size = SKB_WITH_OVERHEAD(ksize(data));
> >         | (gdb) p &skb->truesize
> >         | $5 = (unsigned int *) 0xffff888117f55f90
> >         | (gdb) awatch *0xffff888117f55f90
> >         | Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> >
> > 2.      Some time later, we see that the skb with kfence-allocated data
> >         is cloned:
> >
> >         | Thread 7 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> >         |
> >         | Value = 1570
> >         | 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
> >         | 1002            C(truesize);
> >         | (gdb) bt
> >         | #0  0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
> >         | #1  0xffffffff8488bfb9 in skb_clone (skb=0xffff888117f55ec0, gfp_mask=2592) at net/core/skbuff.c:1454
> >         | #2  0xffffffff84d3cd1c in __tcp_transmit_skb (sk=0xffff8881176cc000, skb=0xffff888117f55ec0, clone_it=0, gfp_mask=2592,
> >         |     rcv_nxt=0) at net/ipv4/tcp_output.c:1267
> >         | #3  0xffffffff84d4125b in tcp_transmit_skb (gfp_mask=<optimized out>, clone_it=<optimized out>, skb=<optimized out>,
> >         |     sk=<optimized out>) at ./include/linux/tcp.h:439
> >         | #4  tcp_write_xmit (sk=0xffff8881176cc000, mss_now=392485600, nonagle=1326, push_one=-1996875104, gfp=0)
> >         |     at net/ipv4/tcp_output.c:2688
> >         | #5  0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
> >         |     at ./include/net/sock.h:918
> >         | #6  0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
> >         | #7  tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
> >         | #8  tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
> >         | #9  0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f57820) at net/ipv4/tcp_ipv4.c:1668
> >         | #10 0xffffffff8487bf67 in sk_backlog_rcv (skb=<optimized out>, sk=<optimized out>) at ./include/net/sock.h:1010
> >         [...]
> >
> >
> > 3.      The original skb (that was cloned) has its truesize adjusted
> >         after a pskb_expand_head():
> >
> >         | Thread 2 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> >         |
> >         | Old value = 1570
> >         | New value = 1954
> >
> >         ^^ the difference between the old and the new value is exactly
> >         384, which is also the final underflow of the sk_wmem_queued
> >         that triggers the warning. Presumably if the original allocation
> >         had been through kmalloc-1k and not KFENCE, the difference here
> >         would have been 0, since ksize() of the original allocation in
> >         step (1) would have been 1024, and not 640 (difference of 384).
> >
> >         | 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
> >         |     at net/core/skbuff.c:1687
> >         | 1687                    skb->truesize += size - osize;
> >         | (gdb) bt
> >         | #0  0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
> >         |     at net/core/skbuff.c:1687
> >         | #1  0xffffffff8488de01 in skb_prepare_for_shift (skb=<optimized out>) at ./arch/x86/include/asm/atomic.h:29
> >         | #2  skb_prepare_for_shift (skb=0xffff888117f55ec0) at net/core/skbuff.c:3276
> >         | #3  0xffffffff848936b1 in skb_shift (tgt=0xffff888117f549c0, skb=0xffff888117f55ec0, shiftlen=674) at net/core/skbuff.c:3351
> >         | #4  0xffffffff84d264de in tcp_skb_shift (shiftlen=<optimized out>, pcount=<optimized out>, from=<optimized out>,
> >         |     to=<optimized out>) at net/ipv4/tcp_input.c:1497
> >         | #5  tcp_shift_skb_data (dup_sack=<optimized out>, end_seq=<optimized out>, start_seq=<optimized out>, state=<optimized out>,
> >         |     skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_input.c:1605
> >         | #6  tcp_sacktag_walk (skb=0xffff888117f55ec0, sk=0xffff8881176cc000, next_dup=0x894,
> >         |     state=0xffffffff88fa1aa0 <watchpoints+192>, start_seq=0, end_seq=401956752, dup_sack_in=false)
> >         |     at net/ipv4/tcp_input.c:1670
> >         | #7  0xffffffff84d276de in tcp_sacktag_write_queue (sk=0xffff888117f55f90, ack_skb=0x1888117f55f90, prior_snd_una=2196,
> >         |     state=0xffffffff88fa1aa0 <watchpoints+192>) at net/ipv4/tcp_input.c:1931
> >         | #8  0xffffffff84d2ca1d in tcp_ack (sk=0xffff8881176cc000, skb=0x1888117f55f90, flag=16643) at net/ipv4/tcp_input.c:3758
> >         | #9  0xffffffff84d32387 in tcp_rcv_established (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_input.c:5858
> >         | #10 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_ipv4.c:1668
> >         [...]
> >
> >
> > Any of this make sense?
>
> Very nice debugging !
>
> I guess we could fix this in skb_prepare_for_shift(), eventually
> caring for the truesize manipulation
> (or reverting the change done in pskb_expand_head(), since only kfence
> is having this issue.

Phew, good to hear I finally got lucky. :-)

Either option is fine, as long as it avoids this problem in future.
Hopefully it can be fixed for 5.11.

> (All TCP skbs in output path have the same allocation size for skb->head)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index e578544b2cc7110ec2f6bcf4c29d93e4b4b1ad14..798b51eeeaa4fbed65d41d9eab207dbbf438dab3
> 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3270,7 +3270,14 @@ EXPORT_SYMBOL(skb_split);
>   */
>  static int skb_prepare_for_shift(struct sk_buff *skb)
>  {
> -       return skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> +       unsigned int ret = 0, save;
> +
> +       if (skb_cloned(skb)) {
> +               save = skb->truesize;
> +               ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> +               skb->truesize = save;
> +       }
> +       return ret;
>  }

FWIW,

    Tested-by: Marco Elver <elver@google.com>

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: WARNING in sk_stream_kill_queues (5)
  2020-12-10 19:01                     ` Marco Elver
@ 2020-12-14 10:09                       ` Marco Elver
  2020-12-14 10:47                         ` Eric Dumazet
  2021-02-03 10:25                         ` Marco Elver
  0 siblings, 2 replies; 18+ messages in thread
From: Marco Elver @ 2020-12-14 10:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, netdev, Andrew Morton, David Miller, Dmitry Vyukov,
	Alexander Potapenko, Jann Horn, Jakub Kicinski, LKML,
	Stephen Rothwell, syzkaller-bugs, Willem de Bruijn, syzbot

On Thu, 10 Dec 2020 at 20:01, Marco Elver <elver@google.com> wrote:
> On Thu, 10 Dec 2020 at 18:14, Eric Dumazet <edumazet@google.com> wrote:
> > On Thu, Dec 10, 2020 at 5:51 PM Marco Elver <elver@google.com> wrote:
> [...]
> > > So I started putting gdb to work, and whenever I see an allocation
> > > exactly like the above that goes through tso_fragment() a warning
> > > immediately follows.
> > >
> > > Long story short, I somehow synthesized this patch that appears to fix
> > > things, but I can't explain why exactly:
> > >
> > > | --- a/net/core/skbuff.c
> > > | +++ b/net/core/skbuff.c
> > > | @@ -1679,13 +1679,6 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> > > |
> > > |       skb_metadata_clear(skb);
> > > |
> > > | -     /* It is not generally safe to change skb->truesize.
> > > | -      * For the moment, we really care of rx path, or
> > > | -      * when skb is orphaned (not attached to a socket).
> > > | -      */
> > > | -     if (!skb->sk || skb->destructor == sock_edemux)
> > > | -             skb->truesize += size - osize;
> > > | -
> > > |       return 0;
> > > |
> > > |  nofrags:
> > >
> > > Now, here are the breadcrumbs I followed:
> > >
> > >
> > > 1.      Breakpoint on kfence_ksize() -- first allocation that matches the above:
> > >
> > >         | #0  __kfence_ksize (s=18446612700164612096) at mm/kfence/core.c:726
> > >         | #1  0xffffffff816fbf30 in kfence_ksize (addr=0xffff888436856000) at mm/kfence/core.c:737
> > >         | #2  0xffffffff816217cf in ksize (objp=0xffff888436856000) at mm/slab_common.c:1178
> > >         | #3  0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> > >         | #4  0xffffffff84d0ba73 in alloc_skb_fclone (priority=<optimized out>, size=<optimized out>) at ./include/linux/skbuff.h:1144
> > >         | #5  sk_stream_alloc_skb (sk=0xffff8881176cc000, size=0, gfp=2592, force_schedule=232) at net/ipv4/tcp.c:888
> > >         | #6  0xffffffff84d41c36 in tso_fragment (gfp=<optimized out>, mss_now=<optimized out>, len=<optimized out>,
> > >         |     skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_output.c:2124
> > >         | #7  tcp_write_xmit (sk=0xffff8881176cc000, mss_now=21950, nonagle=3096, push_one=-1996874776, gfp=0)
> > >         |     at net/ipv4/tcp_output.c:2674
> > >         | #8  0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
> > >         |     at ./include/net/sock.h:918
> > >         | #9  0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
> > >         | #10 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
> > >         | #11 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
> > >         | #12 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f52ea0) at net/ipv4/tcp_ipv4.c:1668
> > >         | [...]
> > >
> > >         Set watchpoint on skb->truesize:
> > >
> > >         | (gdb) frame 3
> > >         | #3  0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> > >         | 217             size = SKB_WITH_OVERHEAD(ksize(data));
> > >         | (gdb) p &skb->truesize
> > >         | $5 = (unsigned int *) 0xffff888117f55f90
> > >         | (gdb) awatch *0xffff888117f55f90
> > >         | Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > >
> > > 2.      Some time later, we see that the skb with kfence-allocated data
> > >         is cloned:
> > >
> > >         | Thread 7 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > >         |
> > >         | Value = 1570
> > >         | 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
> > >         | 1002            C(truesize);
> > >         | (gdb) bt
> > >         | #0  0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
> > >         | #1  0xffffffff8488bfb9 in skb_clone (skb=0xffff888117f55ec0, gfp_mask=2592) at net/core/skbuff.c:1454
> > >         | #2  0xffffffff84d3cd1c in __tcp_transmit_skb (sk=0xffff8881176cc000, skb=0xffff888117f55ec0, clone_it=0, gfp_mask=2592,
> > >         |     rcv_nxt=0) at net/ipv4/tcp_output.c:1267
> > >         | #3  0xffffffff84d4125b in tcp_transmit_skb (gfp_mask=<optimized out>, clone_it=<optimized out>, skb=<optimized out>,
> > >         |     sk=<optimized out>) at ./include/linux/tcp.h:439
> > >         | #4  tcp_write_xmit (sk=0xffff8881176cc000, mss_now=392485600, nonagle=1326, push_one=-1996875104, gfp=0)
> > >         |     at net/ipv4/tcp_output.c:2688
> > >         | #5  0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
> > >         |     at ./include/net/sock.h:918
> > >         | #6  0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
> > >         | #7  tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
> > >         | #8  tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
> > >         | #9  0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f57820) at net/ipv4/tcp_ipv4.c:1668
> > >         | #10 0xffffffff8487bf67 in sk_backlog_rcv (skb=<optimized out>, sk=<optimized out>) at ./include/net/sock.h:1010
> > >         [...]
> > >
> > >
> > > 3.      The original skb (that was cloned) has its truesize adjusted
> > >         after a pskb_expand_head():
> > >
> > >         | Thread 2 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > >         |
> > >         | Old value = 1570
> > >         | New value = 1954
> > >
> > >         ^^ the difference between the old and the new value is exactly
> > >         384, which is also the final underflow of the sk_wmem_queued
> > >         that triggers the warning. Presumably if the original allocation
> > >         had been through kmalloc-1k and not KFENCE, the difference here
> > >         would have been 0, since ksize() of the original allocation in
> > >         step (1) would have been 1024, and not 640 (difference of 384).
> > >
> > >         | 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
> > >         |     at net/core/skbuff.c:1687
> > >         | 1687                    skb->truesize += size - osize;
> > >         | (gdb) bt
> > >         | #0  0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
> > >         |     at net/core/skbuff.c:1687
> > >         | #1  0xffffffff8488de01 in skb_prepare_for_shift (skb=<optimized out>) at ./arch/x86/include/asm/atomic.h:29
> > >         | #2  skb_prepare_for_shift (skb=0xffff888117f55ec0) at net/core/skbuff.c:3276
> > >         | #3  0xffffffff848936b1 in skb_shift (tgt=0xffff888117f549c0, skb=0xffff888117f55ec0, shiftlen=674) at net/core/skbuff.c:3351
> > >         | #4  0xffffffff84d264de in tcp_skb_shift (shiftlen=<optimized out>, pcount=<optimized out>, from=<optimized out>,
> > >         |     to=<optimized out>) at net/ipv4/tcp_input.c:1497
> > >         | #5  tcp_shift_skb_data (dup_sack=<optimized out>, end_seq=<optimized out>, start_seq=<optimized out>, state=<optimized out>,
> > >         |     skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_input.c:1605
> > >         | #6  tcp_sacktag_walk (skb=0xffff888117f55ec0, sk=0xffff8881176cc000, next_dup=0x894,
> > >         |     state=0xffffffff88fa1aa0 <watchpoints+192>, start_seq=0, end_seq=401956752, dup_sack_in=false)
> > >         |     at net/ipv4/tcp_input.c:1670
> > >         | #7  0xffffffff84d276de in tcp_sacktag_write_queue (sk=0xffff888117f55f90, ack_skb=0x1888117f55f90, prior_snd_una=2196,
> > >         |     state=0xffffffff88fa1aa0 <watchpoints+192>) at net/ipv4/tcp_input.c:1931
> > >         | #8  0xffffffff84d2ca1d in tcp_ack (sk=0xffff8881176cc000, skb=0x1888117f55f90, flag=16643) at net/ipv4/tcp_input.c:3758
> > >         | #9  0xffffffff84d32387 in tcp_rcv_established (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_input.c:5858
> > >         | #10 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_ipv4.c:1668
> > >         [...]
> > >
> > >
> > > Any of this make sense?
> >
> > Very nice debugging !
> >
> > I guess we could fix this in skb_prepare_for_shift(), eventually
> > caring for the truesize manipulation
> > (or reverting the change done in pskb_expand_head(), since only kfence
> > is having this issue.
>
> Phew, good to hear I finally got lucky. :-)
>
> Either option is fine, as long as it avoids this problem in future.
> Hopefully it can be fixed for 5.11.
>
> > (All TCP skbs in output path have the same allocation size for skb->head)
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index e578544b2cc7110ec2f6bcf4c29d93e4b4b1ad14..798b51eeeaa4fbed65d41d9eab207dbbf438dab3
> > 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -3270,7 +3270,14 @@ EXPORT_SYMBOL(skb_split);
> >   */
> >  static int skb_prepare_for_shift(struct sk_buff *skb)
> >  {
> > -       return skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> > +       unsigned int ret = 0, save;
> > +
> > +       if (skb_cloned(skb)) {
> > +               save = skb->truesize;
> > +               ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> > +               skb->truesize = save;
> > +       }
> > +       return ret;
> >  }
>
> FWIW,
>
>     Tested-by: Marco Elver <elver@google.com>

Has this patch, or similar, already been sent?

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: WARNING in sk_stream_kill_queues (5)
  2020-12-14 10:09                       ` Marco Elver
@ 2020-12-14 10:47                         ` Eric Dumazet
  2021-01-11  8:55                           ` Marco Elver
  2021-02-03 10:25                         ` Marco Elver
  1 sibling, 1 reply; 18+ messages in thread
From: Eric Dumazet @ 2020-12-14 10:47 UTC (permalink / raw)
  To: Marco Elver
  Cc: Eric Dumazet, netdev, Andrew Morton, David Miller, Dmitry Vyukov,
	Alexander Potapenko, Jann Horn, Jakub Kicinski, LKML,
	Stephen Rothwell, syzkaller-bugs, Willem de Bruijn, syzbot

On Mon, Dec 14, 2020 at 11:09 AM Marco Elver <elver@google.com> wrote:
>
> On Thu, 10 Dec 2020 at 20:01, Marco Elver <elver@google.com> wrote:
> > On Thu, 10 Dec 2020 at 18:14, Eric Dumazet <edumazet@google.com> wrote:
> > > On Thu, Dec 10, 2020 at 5:51 PM Marco Elver <elver@google.com> wrote:
> > [...]
> > > > So I started putting gdb to work, and whenever I see an allocation
> > > > exactly like the above that goes through tso_fragment() a warning
> > > > immediately follows.
> > > >
> > > > Long story short, I somehow synthesized this patch that appears to fix
> > > > things, but I can't explain why exactly:
> > > >
> > > > | --- a/net/core/skbuff.c
> > > > | +++ b/net/core/skbuff.c
> > > > | @@ -1679,13 +1679,6 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> > > > |
> > > > |       skb_metadata_clear(skb);
> > > > |
> > > > | -     /* It is not generally safe to change skb->truesize.
> > > > | -      * For the moment, we really care of rx path, or
> > > > | -      * when skb is orphaned (not attached to a socket).
> > > > | -      */
> > > > | -     if (!skb->sk || skb->destructor == sock_edemux)
> > > > | -             skb->truesize += size - osize;
> > > > | -
> > > > |       return 0;
> > > > |
> > > > |  nofrags:
> > > >
> > > > Now, here are the breadcrumbs I followed:
> > > >
> > > >
> > > > 1.      Breakpoint on kfence_ksize() -- first allocation that matches the above:
> > > >
> > > >         | #0  __kfence_ksize (s=18446612700164612096) at mm/kfence/core.c:726
> > > >         | #1  0xffffffff816fbf30 in kfence_ksize (addr=0xffff888436856000) at mm/kfence/core.c:737
> > > >         | #2  0xffffffff816217cf in ksize (objp=0xffff888436856000) at mm/slab_common.c:1178
> > > >         | #3  0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> > > >         | #4  0xffffffff84d0ba73 in alloc_skb_fclone (priority=<optimized out>, size=<optimized out>) at ./include/linux/skbuff.h:1144
> > > >         | #5  sk_stream_alloc_skb (sk=0xffff8881176cc000, size=0, gfp=2592, force_schedule=232) at net/ipv4/tcp.c:888
> > > >         | #6  0xffffffff84d41c36 in tso_fragment (gfp=<optimized out>, mss_now=<optimized out>, len=<optimized out>,
> > > >         |     skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_output.c:2124
> > > >         | #7  tcp_write_xmit (sk=0xffff8881176cc000, mss_now=21950, nonagle=3096, push_one=-1996874776, gfp=0)
> > > >         |     at net/ipv4/tcp_output.c:2674
> > > >         | #8  0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
> > > >         |     at ./include/net/sock.h:918
> > > >         | #9  0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
> > > >         | #10 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
> > > >         | #11 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
> > > >         | #12 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f52ea0) at net/ipv4/tcp_ipv4.c:1668
> > > >         | [...]
> > > >
> > > >         Set watchpoint on skb->truesize:
> > > >
> > > >         | (gdb) frame 3
> > > >         | #3  0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> > > >         | 217             size = SKB_WITH_OVERHEAD(ksize(data));
> > > >         | (gdb) p &skb->truesize
> > > >         | $5 = (unsigned int *) 0xffff888117f55f90
> > > >         | (gdb) awatch *0xffff888117f55f90
> > > >         | Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > > >
> > > > 2.      Some time later, we see that the skb with kfence-allocated data
> > > >         is cloned:
> > > >
> > > >         | Thread 7 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > > >         |
> > > >         | Value = 1570
> > > >         | 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
> > > >         | 1002            C(truesize);
> > > >         | (gdb) bt
> > > >         | #0  0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
> > > >         | #1  0xffffffff8488bfb9 in skb_clone (skb=0xffff888117f55ec0, gfp_mask=2592) at net/core/skbuff.c:1454
> > > >         | #2  0xffffffff84d3cd1c in __tcp_transmit_skb (sk=0xffff8881176cc000, skb=0xffff888117f55ec0, clone_it=0, gfp_mask=2592,
> > > >         |     rcv_nxt=0) at net/ipv4/tcp_output.c:1267
> > > >         | #3  0xffffffff84d4125b in tcp_transmit_skb (gfp_mask=<optimized out>, clone_it=<optimized out>, skb=<optimized out>,
> > > >         |     sk=<optimized out>) at ./include/linux/tcp.h:439
> > > >         | #4  tcp_write_xmit (sk=0xffff8881176cc000, mss_now=392485600, nonagle=1326, push_one=-1996875104, gfp=0)
> > > >         |     at net/ipv4/tcp_output.c:2688
> > > >         | #5  0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
> > > >         |     at ./include/net/sock.h:918
> > > >         | #6  0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
> > > >         | #7  tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
> > > >         | #8  tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
> > > >         | #9  0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f57820) at net/ipv4/tcp_ipv4.c:1668
> > > >         | #10 0xffffffff8487bf67 in sk_backlog_rcv (skb=<optimized out>, sk=<optimized out>) at ./include/net/sock.h:1010
> > > >         [...]
> > > >
> > > >
> > > > 3.      The original skb (that was cloned) has its truesize adjusted
> > > >         after a pskb_expand_head():
> > > >
> > > >         | Thread 2 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > > >         |
> > > >         | Old value = 1570
> > > >         | New value = 1954
> > > >
> > > >         ^^ the difference between the old and the new value is exactly
> > > >         384, which is also the final underflow of the sk_wmem_queued
> > > >         that triggers the warning. Presumably if the original allocation
> > > >         had been through kmalloc-1k and not KFENCE, the difference here
> > > >         would have been 0, since ksize() of the original allocation in
> > > >         step (1) would have been 1024, and not 640 (difference of 384).
> > > >
> > > >         | 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
> > > >         |     at net/core/skbuff.c:1687
> > > >         | 1687                    skb->truesize += size - osize;
> > > >         | (gdb) bt
> > > >         | #0  0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
> > > >         |     at net/core/skbuff.c:1687
> > > >         | #1  0xffffffff8488de01 in skb_prepare_for_shift (skb=<optimized out>) at ./arch/x86/include/asm/atomic.h:29
> > > >         | #2  skb_prepare_for_shift (skb=0xffff888117f55ec0) at net/core/skbuff.c:3276
> > > >         | #3  0xffffffff848936b1 in skb_shift (tgt=0xffff888117f549c0, skb=0xffff888117f55ec0, shiftlen=674) at net/core/skbuff.c:3351
> > > >         | #4  0xffffffff84d264de in tcp_skb_shift (shiftlen=<optimized out>, pcount=<optimized out>, from=<optimized out>,
> > > >         |     to=<optimized out>) at net/ipv4/tcp_input.c:1497
> > > >         | #5  tcp_shift_skb_data (dup_sack=<optimized out>, end_seq=<optimized out>, start_seq=<optimized out>, state=<optimized out>,
> > > >         |     skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_input.c:1605
> > > >         | #6  tcp_sacktag_walk (skb=0xffff888117f55ec0, sk=0xffff8881176cc000, next_dup=0x894,
> > > >         |     state=0xffffffff88fa1aa0 <watchpoints+192>, start_seq=0, end_seq=401956752, dup_sack_in=false)
> > > >         |     at net/ipv4/tcp_input.c:1670
> > > >         | #7  0xffffffff84d276de in tcp_sacktag_write_queue (sk=0xffff888117f55f90, ack_skb=0x1888117f55f90, prior_snd_una=2196,
> > > >         |     state=0xffffffff88fa1aa0 <watchpoints+192>) at net/ipv4/tcp_input.c:1931
> > > >         | #8  0xffffffff84d2ca1d in tcp_ack (sk=0xffff8881176cc000, skb=0x1888117f55f90, flag=16643) at net/ipv4/tcp_input.c:3758
> > > >         | #9  0xffffffff84d32387 in tcp_rcv_established (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_input.c:5858
> > > >         | #10 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_ipv4.c:1668
> > > >         [...]
> > > >
> > > >
> > > > Any of this make sense?
> > >
> > > Very nice debugging !
> > >
> > > I guess we could fix this in skb_prepare_for_shift(), eventually
> > > caring for the truesize manipulation
> > > (or reverting the change done in pskb_expand_head(), since only kfence
> > > is having this issue.
> >
> > Phew, good to hear I finally got lucky. :-)
> >
> > Either option is fine, as long as it avoids this problem in future.
> > Hopefully it can be fixed for 5.11.
> >
> > > (All TCP skbs in output path have the same allocation size for skb->head)
> > >
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index e578544b2cc7110ec2f6bcf4c29d93e4b4b1ad14..798b51eeeaa4fbed65d41d9eab207dbbf438dab3
> > > 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -3270,7 +3270,14 @@ EXPORT_SYMBOL(skb_split);
> > >   */
> > >  static int skb_prepare_for_shift(struct sk_buff *skb)
> > >  {
> > > -       return skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> > > +       unsigned int ret = 0, save;
> > > +
> > > +       if (skb_cloned(skb)) {
> > > +               save = skb->truesize;
> > > +               ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> > > +               skb->truesize = save;
> > > +       }
> > > +       return ret;
> > >  }
> >
> > FWIW,
> >
> >     Tested-by: Marco Elver <elver@google.com>
>
> Has this patch, or similar, already been sent?


Not yet, we have few weeks left before 5.11 is released ;)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: WARNING in sk_stream_kill_queues (5)
  2020-12-14 10:47                         ` Eric Dumazet
@ 2021-01-11  8:55                           ` Marco Elver
  0 siblings, 0 replies; 18+ messages in thread
From: Marco Elver @ 2021-01-11  8:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, netdev, Andrew Morton, David Miller, Dmitry Vyukov,
	Alexander Potapenko, Jann Horn, Jakub Kicinski, LKML,
	Stephen Rothwell, syzkaller-bugs, Willem de Bruijn, syzbot

On Mon, 14 Dec 2020 at 11:47, Eric Dumazet <edumazet@google.com> wrote:
> On Mon, Dec 14, 2020 at 11:09 AM Marco Elver <elver@google.com> wrote:
> > On Thu, 10 Dec 2020 at 20:01, Marco Elver <elver@google.com> wrote:
> > > On Thu, 10 Dec 2020 at 18:14, Eric Dumazet <edumazet@google.com> wrote:
> > > > On Thu, Dec 10, 2020 at 5:51 PM Marco Elver <elver@google.com> wrote:
[...]
> > > Either option is fine, as long as it avoids this problem in future.
> > > Hopefully it can be fixed for 5.11.
> > >
> > > > (All TCP skbs in output path have the same allocation size for skb->head)
> > > >
> > > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > > index e578544b2cc7110ec2f6bcf4c29d93e4b4b1ad14..798b51eeeaa4fbed65d41d9eab207dbbf438dab3
> > > > 100644
> > > > --- a/net/core/skbuff.c
> > > > +++ b/net/core/skbuff.c
> > > > @@ -3270,7 +3270,14 @@ EXPORT_SYMBOL(skb_split);
> > > >   */
> > > >  static int skb_prepare_for_shift(struct sk_buff *skb)
> > > >  {
> > > > -       return skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> > > > +       unsigned int ret = 0, save;
> > > > +
> > > > +       if (skb_cloned(skb)) {
> > > > +               save = skb->truesize;
> > > > +               ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> > > > +               skb->truesize = save;
> > > > +       }
> > > > +       return ret;
> > > >  }
> > >
> > > FWIW,
> > >
> > >     Tested-by: Marco Elver <elver@google.com>
> >
> > Has this patch, or similar, already been sent?
>
>
> Not yet, we have few weeks left before 5.11 is released ;)

Ping. Though KFENCE has been dropped from 5.11, just a reminder so we
get this fixed for 5.12.

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: WARNING in sk_stream_kill_queues (5)
  2020-12-14 10:09                       ` Marco Elver
  2020-12-14 10:47                         ` Eric Dumazet
@ 2021-02-03 10:25                         ` Marco Elver
  1 sibling, 0 replies; 18+ messages in thread
From: Marco Elver @ 2021-02-03 10:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Eric Dumazet, netdev, Andrew Morton, David Miller, Dmitry Vyukov,
	Alexander Potapenko, Jann Horn, Jakub Kicinski, LKML,
	Stephen Rothwell, syzkaller-bugs, Willem de Bruijn, syzbot

On Mon, 14 Dec 2020 at 11:09, Marco Elver <elver@google.com> wrote:
> On Thu, 10 Dec 2020 at 20:01, Marco Elver <elver@google.com> wrote:
> > On Thu, 10 Dec 2020 at 18:14, Eric Dumazet <edumazet@google.com> wrote:
> > > On Thu, Dec 10, 2020 at 5:51 PM Marco Elver <elver@google.com> wrote:
> > [...]
> > > > So I started putting gdb to work, and whenever I see an allocation
> > > > exactly like the above that goes through tso_fragment() a warning
> > > > immediately follows.
> > > >
> > > > Long story short, I somehow synthesized this patch that appears to fix
> > > > things, but I can't explain why exactly:
> > > >
> > > > | --- a/net/core/skbuff.c
> > > > | +++ b/net/core/skbuff.c
> > > > | @@ -1679,13 +1679,6 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
> > > > |
> > > > |       skb_metadata_clear(skb);
> > > > |
> > > > | -     /* It is not generally safe to change skb->truesize.
> > > > | -      * For the moment, we really care of rx path, or
> > > > | -      * when skb is orphaned (not attached to a socket).
> > > > | -      */
> > > > | -     if (!skb->sk || skb->destructor == sock_edemux)
> > > > | -             skb->truesize += size - osize;
> > > > | -
> > > > |       return 0;
> > > > |
> > > > |  nofrags:
> > > >
> > > > Now, here are the breadcrumbs I followed:
> > > >
> > > >
> > > > 1.      Breakpoint on kfence_ksize() -- first allocation that matches the above:
> > > >
> > > >         | #0  __kfence_ksize (s=18446612700164612096) at mm/kfence/core.c:726
> > > >         | #1  0xffffffff816fbf30 in kfence_ksize (addr=0xffff888436856000) at mm/kfence/core.c:737
> > > >         | #2  0xffffffff816217cf in ksize (objp=0xffff888436856000) at mm/slab_common.c:1178
> > > >         | #3  0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> > > >         | #4  0xffffffff84d0ba73 in alloc_skb_fclone (priority=<optimized out>, size=<optimized out>) at ./include/linux/skbuff.h:1144
> > > >         | #5  sk_stream_alloc_skb (sk=0xffff8881176cc000, size=0, gfp=2592, force_schedule=232) at net/ipv4/tcp.c:888
> > > >         | #6  0xffffffff84d41c36 in tso_fragment (gfp=<optimized out>, mss_now=<optimized out>, len=<optimized out>,
> > > >         |     skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_output.c:2124
> > > >         | #7  tcp_write_xmit (sk=0xffff8881176cc000, mss_now=21950, nonagle=3096, push_one=-1996874776, gfp=0)
> > > >         |     at net/ipv4/tcp_output.c:2674
> > > >         | #8  0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
> > > >         |     at ./include/net/sock.h:918
> > > >         | #9  0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
> > > >         | #10 tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
> > > >         | #11 tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
> > > >         | #12 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f52ea0) at net/ipv4/tcp_ipv4.c:1668
> > > >         | [...]
> > > >
> > > >         Set watchpoint on skb->truesize:
> > > >
> > > >         | (gdb) frame 3
> > > >         | #3  0xffffffff84896911 in __alloc_skb (size=914710528, gfp_mask=2592, flags=0, node=-1) at net/core/skbuff.c:217
> > > >         | 217             size = SKB_WITH_OVERHEAD(ksize(data));
> > > >         | (gdb) p &skb->truesize
> > > >         | $5 = (unsigned int *) 0xffff888117f55f90
> > > >         | (gdb) awatch *0xffff888117f55f90
> > > >         | Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > > >
> > > > 2.      Some time later, we see that the skb with kfence-allocated data
> > > >         is cloned:
> > > >
> > > >         | Thread 7 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > > >         |
> > > >         | Value = 1570
> > > >         | 0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
> > > >         | 1002            C(truesize);
> > > >         | (gdb) bt
> > > >         | #0  0xffffffff84886947 in __skb_clone (n=0xffff888117f55fa0, skb=0xffff888117f55ec0) at net/core/skbuff.c:1002
> > > >         | #1  0xffffffff8488bfb9 in skb_clone (skb=0xffff888117f55ec0, gfp_mask=2592) at net/core/skbuff.c:1454
> > > >         | #2  0xffffffff84d3cd1c in __tcp_transmit_skb (sk=0xffff8881176cc000, skb=0xffff888117f55ec0, clone_it=0, gfp_mask=2592,
> > > >         |     rcv_nxt=0) at net/ipv4/tcp_output.c:1267
> > > >         | #3  0xffffffff84d4125b in tcp_transmit_skb (gfp_mask=<optimized out>, clone_it=<optimized out>, skb=<optimized out>,
> > > >         |     sk=<optimized out>) at ./include/linux/tcp.h:439
> > > >         | #4  tcp_write_xmit (sk=0xffff8881176cc000, mss_now=392485600, nonagle=1326, push_one=-1996875104, gfp=0)
> > > >         |     at net/ipv4/tcp_output.c:2688
> > > >         | #5  0xffffffff84d43e48 in __tcp_push_pending_frames (sk=0xffff8881176cc000, cur_mss=337, nonagle=0)
> > > >         |     at ./include/net/sock.h:918
> > > >         | #6  0xffffffff84d3259c in tcp_push_pending_frames (sk=<optimized out>) at ./include/net/tcp.h:1864
> > > >         | #7  tcp_data_snd_check (sk=<optimized out>) at net/ipv4/tcp_input.c:5374
> > > >         | #8  tcp_rcv_established (sk=0xffff8881176cc000, skb=0x0 <fixed_percpu_data>) at net/ipv4/tcp_input.c:5869
> > > >         | #9  0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f57820) at net/ipv4/tcp_ipv4.c:1668
> > > >         | #10 0xffffffff8487bf67 in sk_backlog_rcv (skb=<optimized out>, sk=<optimized out>) at ./include/net/sock.h:1010
> > > >         [...]
> > > >
> > > >
> > > > 3.      The original skb (that was cloned) has its truesize adjusted
> > > >         after a pskb_expand_head():
> > > >
> > > >         | Thread 2 hit Hardware access (read/write) watchpoint 6: *0xffff888117f55f90
> > > >         |
> > > >         | Old value = 1570
> > > >         | New value = 1954
> > > >
> > > >         ^^ the difference between the old and the new value is exactly
> > > >         384, which is also the final underflow of the sk_wmem_queued
> > > >         that triggers the warning. Presumably if the original allocation
> > > >         had been through kmalloc-1k and not KFENCE, the difference here
> > > >         would have been 0, since ksize() of the original allocation in
> > > >         step (1) would have been 1024, and not 640 (difference of 384).
> > > >
> > > >         | 0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
> > > >         |     at net/core/skbuff.c:1687
> > > >         | 1687                    skb->truesize += size - osize;
> > > >         | (gdb) bt
> > > >         | #0  0xffffffff8488d84b in pskb_expand_head (skb=0xffff888117f55ec0, nhead=401956752, ntail=1954, gfp_mask=2298092192)
> > > >         |     at net/core/skbuff.c:1687
> > > >         | #1  0xffffffff8488de01 in skb_prepare_for_shift (skb=<optimized out>) at ./arch/x86/include/asm/atomic.h:29
> > > >         | #2  skb_prepare_for_shift (skb=0xffff888117f55ec0) at net/core/skbuff.c:3276
> > > >         | #3  0xffffffff848936b1 in skb_shift (tgt=0xffff888117f549c0, skb=0xffff888117f55ec0, shiftlen=674) at net/core/skbuff.c:3351
> > > >         | #4  0xffffffff84d264de in tcp_skb_shift (shiftlen=<optimized out>, pcount=<optimized out>, from=<optimized out>,
> > > >         |     to=<optimized out>) at net/ipv4/tcp_input.c:1497
> > > >         | #5  tcp_shift_skb_data (dup_sack=<optimized out>, end_seq=<optimized out>, start_seq=<optimized out>, state=<optimized out>,
> > > >         |     skb=<optimized out>, sk=<optimized out>) at net/ipv4/tcp_input.c:1605
> > > >         | #6  tcp_sacktag_walk (skb=0xffff888117f55ec0, sk=0xffff8881176cc000, next_dup=0x894,
> > > >         |     state=0xffffffff88fa1aa0 <watchpoints+192>, start_seq=0, end_seq=401956752, dup_sack_in=false)
> > > >         |     at net/ipv4/tcp_input.c:1670
> > > >         | #7  0xffffffff84d276de in tcp_sacktag_write_queue (sk=0xffff888117f55f90, ack_skb=0x1888117f55f90, prior_snd_una=2196,
> > > >         |     state=0xffffffff88fa1aa0 <watchpoints+192>) at net/ipv4/tcp_input.c:1931
> > > >         | #8  0xffffffff84d2ca1d in tcp_ack (sk=0xffff8881176cc000, skb=0x1888117f55f90, flag=16643) at net/ipv4/tcp_input.c:3758
> > > >         | #9  0xffffffff84d32387 in tcp_rcv_established (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_input.c:5858
> > > >         | #10 0xffffffff84d56731 in tcp_v4_do_rcv (sk=0xffff8881176cc000, skb=0xffff888117f54020) at net/ipv4/tcp_ipv4.c:1668
> > > >         [...]
> > > >
> > > >
> > > > Any of this make sense?
> > >
> > > Very nice debugging !
> > >
> > > I guess we could fix this in skb_prepare_for_shift(), eventually
> > > caring for the truesize manipulation
> > > (or reverting the change done in pskb_expand_head(), since only kfence
> > > is having this issue.
> >
> > Phew, good to hear I finally got lucky. :-)
> >
> > Either option is fine, as long as it avoids this problem in future.
> > Hopefully it can be fixed for 5.11.
> >
> > > (All TCP skbs in output path have the same allocation size for skb->head)
> > >
> > > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > > index e578544b2cc7110ec2f6bcf4c29d93e4b4b1ad14..798b51eeeaa4fbed65d41d9eab207dbbf438dab3
> > > 100644
> > > --- a/net/core/skbuff.c
> > > +++ b/net/core/skbuff.c
> > > @@ -3270,7 +3270,14 @@ EXPORT_SYMBOL(skb_split);
> > >   */
> > >  static int skb_prepare_for_shift(struct sk_buff *skb)
> > >  {
> > > -       return skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> > > +       unsigned int ret = 0, save;
> > > +
> > > +       if (skb_cloned(skb)) {
> > > +               save = skb->truesize;
> > > +               ret = pskb_expand_head(skb, 0, 0, GFP_ATOMIC);
> > > +               skb->truesize = save;
> > > +       }
> > > +       return ret;
> > >  }
> >
> > FWIW,
> >
> >     Tested-by: Marco Elver <elver@google.com>
>
> Has this patch, or similar, already been sent?

This is now in net-next:
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=097b9146c0e2

Thanks,
-- Marco

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-02-03 10:27 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-30  8:40 WARNING in sk_stream_kill_queues (5) syzbot
2020-12-03 15:58 ` Marco Elver
2020-12-03 16:27   ` Eric Dumazet
2020-12-03 16:34     ` Marco Elver
2020-12-03 16:42       ` Eric Dumazet
2020-12-03 17:41         ` Marco Elver
2020-12-03 18:01           ` Eric Dumazet
2020-12-07 15:30             ` Marco Elver
2020-12-07 16:28             ` Marco Elver
2020-12-08 19:06             ` Marco Elver
2020-12-09 12:47               ` Marco Elver
2020-12-10 16:51                 ` Marco Elver
2020-12-10 17:14                   ` Eric Dumazet
2020-12-10 19:01                     ` Marco Elver
2020-12-14 10:09                       ` Marco Elver
2020-12-14 10:47                         ` Eric Dumazet
2021-01-11  8:55                           ` Marco Elver
2021-02-03 10:25                         ` Marco Elver

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).