All of lore.kernel.org
 help / color / mirror / Atom feed
* BUG: corrupted list in freeary
@ 2018-11-27 15:52 syzbot
  2018-11-27 16:07 ` Eric W. Biederman
       [not found] ` <621f7c52-de52-eb35-bf61-e839adee7ec9@colorfullife.com>
  0 siblings, 2 replies; 8+ messages in thread
From: syzbot @ 2018-11-27 15:52 UTC (permalink / raw)
  To: akpm, arnd, dave, ebiederm, linux-kernel, linux, manfred, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    e195ca6cb6f2 Merge branch 'for-linus' of git://git.kernel...
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=10d3e6a3400000
kernel config:  https://syzkaller.appspot.com/x/.config?x=73e2bc0cb6463446
dashboard link: https://syzkaller.appspot.com/bug?extid=c92d3646e35bc5d1a909
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+c92d3646e35bc5d1a909@syzkaller.appspotmail.com

input: syz1 as /devices/virtual/input/input670
input: syz1 as /devices/virtual/input/input671
list_del corruption. prev->next should be ffff8881dae2cdb8, but was  
0000000000100000
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:53!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 6194 Comm: syz-executor5 Not tainted 4.20.0-rc3+ #348
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:__list_del_entry_valid.cold.1+0x48/0x4a lib/list_debug.c:51
Code: d0 60 88 e8 b2 31 d2 fd 0f 0b 48 89 de 48 c7 c7 00 d2 60 88 e8 a1 31  
d2 fd 0f 0b 48 89 de 48 c7 c7 a0 d1 60 88 e8 90 31 d2 fd <0f> 0b 48 89 d9  
48 c7 c7 60 d2 60 88 e8 7f 31 d2 fd 0f 0b 48 89 f1
RSP: 0018:ffff8881848fee80 EFLAGS: 00010286
RAX: 0000000000000054 RBX: ffff8881dae2cdb8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff8165eaf5 RDI: 0000000000000005
RBP: ffff8881848fee98 R08: ffff8881848f26c0 R09: 0000000000000006
R10: 0000000000000000 R11: ffff8881848f26c0 R12: ffff8881c3173a00
R13: ffff8881be118118 R14: ffff8881848ff280 R15: dffffc0000000000
FS:  00000000020b2940(0000) GS:ffff8881dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000625208 CR3: 00000001c10d3000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  __list_del_entry include/linux/list.h:117 [inline]
  list_del include/linux/list.h:125 [inline]
  unlink_queue ipc/sem.c:786 [inline]
  freeary+0xbd1/0x1a40 ipc/sem.c:1160
  free_ipcs+0x9f/0x1c0 ipc/namespace.c:112
  sem_exit_ns+0x20/0x40 ipc/sem.c:237
  free_ipc_ns ipc/namespace.c:120 [inline]
  put_ipc_ns+0x66/0x180 ipc/namespace.c:152
  free_nsproxy+0xcf/0x220 kernel/nsproxy.c:180
  switch_task_namespaces+0xb3/0xd0 kernel/nsproxy.c:229
  exit_task_namespaces+0x17/0x20 kernel/nsproxy.c:234
  do_exit+0x1ad1/0x26d0 kernel/exit.c:866
  do_group_exit+0x177/0x440 kernel/exit.c:970
  get_signal+0x8b0/0x1980 kernel/signal.c:2517
  do_signal+0x9c/0x21c0 arch/x86/kernel/signal.c:816
  exit_to_usermode_loop+0x2e5/0x380 arch/x86/entry/common.c:162
  prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
  syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
  do_syscall_64+0x6be/0x820 arch/x86/entry/common.c:293
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x410fa0
Code: 24 48 8b 4c 24 28 48 89 4c 24 08 88 44 24 10 c6 44 24 11 01 0f b6 44  
24 50 88 44 24 12 e8 18 15 01 00 48 8b 44 24 18 48 85 c0 <74> 59 48 89 44  
24 30 48 8b 48 18 48 8b 54 24 48 48 01 d1 48 89 88
RSP: 002b:00007fff19ab9c48 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 0000000000410fa0
RDX: 0000000000000038 RSI: 0000000000728020 RDI: 00000000000000f9
RBP: 0000000000000003 R08: 0000000000006000 R09: 0000000000004000
R10: 00007fff19ab95e0 R11: 0000000000000246 R12: 0000000000000002
R13: 000000000008f53b R14: 00000000000002f8 R15: 0000000000000005
Modules linked in:
---[ end trace 9573c75ddafce7e3 ]---
RIP: 0010:__list_del_entry_valid.cold.1+0x48/0x4a lib/list_debug.c:51
Code: d0 60 88 e8 b2 31 d2 fd 0f 0b 48 89 de 48 c7 c7 00 d2 60 88 e8 a1 31  
d2 fd 0f 0b 48 89 de 48 c7 c7 a0 d1 60 88 e8 90 31 d2 fd <0f> 0b 48 89 d9  
48 c7 c7 60 d2 60 88 e8 7f 31 d2 fd 0f 0b 48 89 f1
RSP: 0018:ffff8881848fee80 EFLAGS: 00010286
RAX: 0000000000000054 RBX: ffff8881dae2cdb8 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff8165eaf5 RDI: 0000000000000005
RBP: ffff8881848fee98 R08: ffff8881848f26c0 R09: 0000000000000006
R10: 0000000000000000 R11: ffff8881848f26c0 R12: ffff8881c3173a00
R13: ffff8881be118118 R14: ffff8881848ff280 R15: dffffc0000000000
FS:  00000000020b2940(0000) GS:ffff8881dae00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000625208 CR3: 00000001c10d3000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: corrupted list in freeary
  2018-11-27 15:52 BUG: corrupted list in freeary syzbot
@ 2018-11-27 16:07 ` Eric W. Biederman
  2018-11-27 16:22   ` Sean Christopherson
       [not found] ` <621f7c52-de52-eb35-bf61-e839adee7ec9@colorfullife.com>
  1 sibling, 1 reply; 8+ messages in thread
From: Eric W. Biederman @ 2018-11-27 16:07 UTC (permalink / raw)
  To: syzbot; +Cc: akpm, arnd, dave, linux-kernel, linux, manfred, syzkaller-bugs



syzbot <syzbot+c92d3646e35bc5d1a909@syzkaller.appspotmail.com> writes:

> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    e195ca6cb6f2 Merge branch 'for-linus' of git://git.kernel...
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=10d3e6a3400000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=73e2bc0cb6463446
> dashboard link: https://syzkaller.appspot.com/bug?extid=c92d3646e35bc5d1a909
> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)

I don't have this commit.  I don't have a clue where it find this
commit.  I can't possibly look into this let alone debug this.

That is not a useful but report.  Please fix your bot.

Eric



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: corrupted list in freeary
  2018-11-27 16:07 ` Eric W. Biederman
@ 2018-11-27 16:22   ` Sean Christopherson
  0 siblings, 0 replies; 8+ messages in thread
From: Sean Christopherson @ 2018-11-27 16:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: syzbot, akpm, arnd, dave, linux-kernel, linux, manfred, syzkaller-bugs

On Tue, Nov 27, 2018 at 10:07:53AM -0600, Eric W. Biederman wrote:
> 
> 
> syzbot <syzbot+c92d3646e35bc5d1a909@syzkaller.appspotmail.com> writes:
> 
> > Hello,
> >
> > syzbot found the following crash on:
> >
> > HEAD commit:    e195ca6cb6f2 Merge branch 'for-linus' of git://git.kernel...
> > git tree:       upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=10d3e6a3400000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=73e2bc0cb6463446
> > dashboard link: https://syzkaller.appspot.com/bug?extid=c92d3646e35bc5d1a909
> > compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
> 
> I don't have this commit.  I don't have a clue where it find this
> commit.  I can't possibly look into this let alone debug this.

It's in v4.20-rc4.

> 
> That is not a useful but report.  Please fix your bot.
> 
> Eric
> 
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: corrupted list in freeary
       [not found] ` <621f7c52-de52-eb35-bf61-e839adee7ec9@colorfullife.com>
@ 2018-11-30 17:58   ` Dmitry Vyukov
  2018-11-30 18:06     ` Dmitry Vyukov
  2018-12-01 20:22     ` Manfred Spraul
  0 siblings, 2 replies; 8+ messages in thread
From: Dmitry Vyukov @ 2018-11-30 17:58 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: syzbot, Andrew Morton, Arnd Bergmann, Davidlohr Bueso,
	Eric W. Biederman, LKML, linux, syzkaller-bugs

On Thu, Nov 29, 2018 at 9:13 AM, Manfred Spraul
<manfred@colorfullife.com> wrote:
> Hello together,
>
> On 11/27/18 4:52 PM, syzbot wrote:
>
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    e195ca6cb6f2 Merge branch 'for-linus' of git://git.kernel...
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=10d3e6a3400000
>
> From the console output:
>
> 20:36:14 executing program 4:
> semget$private(0x12000000, 0x39d0, 0x0)
>
>
> I don't understand the 0x12000000.
>
> What does that mean? What is the actual syscall?

Hi Manfred,

The syscall is semget with the first argument 0x12000000.


>
> Is 0x39d0 the number of semaphores in the array, i.e. create ~13.000
> semaphores?

If the second argument of 0x39d0 relates to creation of 0x39d0
semaphores, then yes.



> kernel config:  https://syzkaller.appspot.com/x/.config?x=73e2bc0cb6463446
> dashboard link: https://syzkaller.appspot.com/bug?extid=c92d3646e35bc5d1a909
> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+c92d3646e35bc5d1a909@syzkaller.appspotmail.com
>
> input: syz1 as /devices/virtual/input/input670
> input: syz1 as /devices/virtual/input/input671
> list_del corruption. prev->next should be ffff8881dae2cdb8, but was
> 0000000000100000
> ------------[ cut here ]------------
> kernel BUG at lib/list_debug.c:53!
> invalid opcode: 0000 [#1] PREEMPT SMP KASAN
> CPU: 0 PID: 6194 Comm: syz-executor5 Not tainted 4.20.0-rc3+ #348
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:__list_del_entry_valid.cold.1+0x48/0x4a lib/list_debug.c:51
> Code: d0 60 88 e8 b2 31 d2 fd 0f 0b 48 89 de 48 c7 c7 00 d2 60 88 e8 a1 31
> d2 fd 0f 0b 48 89 de 48 c7 c7 a0 d1 60 88 e8 90 31 d2 fd <0f> 0b 48 89 d9 48
> c7 c7 60 d2 60 88 e8 7f 31 d2 fd 0f 0b 48 89 f1
> RSP: 0018:ffff8881848fee80 EFLAGS: 00010286
> RAX: 0000000000000054 RBX: ffff8881dae2cdb8 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffffffff8165eaf5 RDI: 0000000000000005
> RBP: ffff8881848fee98 R08: ffff8881848f26c0 R09: 0000000000000006
> R10: 0000000000000000 R11: ffff8881848f26c0 R12: ffff8881c3173a00
> R13: ffff8881be118118 R14: ffff8881848ff280 R15: dffffc0000000000
> FS:  00000000020b2940(0000) GS:ffff8881dae00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000000625208 CR3: 00000001c10d3000 CR4: 00000000001406f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  __list_del_entry include/linux/list.h:117 [inline]
>  list_del include/linux/list.h:125 [inline]
>  unlink_queue ipc/sem.c:786 [inline]
>
> Unlink_queue means transfer all waiting threads to the wake-q.
>
> There are 2*(1+<semaphores in the array>) linked lists in an array.
>
> And this fails, because one linked list contains 0x100000 instead of a real
> pointer.
>
> I could not find any semop() in the log --> all lists must be empty.
>
> Actually, the lists were initialized in newary(), and then never touched.
>
>  freeary+0xbd1/0x1a40 ipc/sem.c:1160
>
> Free a semaphore array
>
>  free_ipcs+0x9f/0x1c0 ipc/namespace.c:112
>  sem_exit_ns+0x20/0x40 ipc/sem.c:237
>  free_ipc_ns ipc/namespace.c:120 [inline]
>
> Free all ipc ids in the name space
>
>  put_ipc_ns+0x66/0x180 ipc/namespace.c:152
>  free_nsproxy+0xcf/0x220 kernel/nsproxy.c:180
>
> Free the name space
>
>  switch_task_namespaces+0xb3/0xd0 kernel/nsproxy.c:229
>  exit_task_namespaces+0x17/0x20 kernel/nsproxy.c:234
>  do_exit+0x1ad1/0x26d0 kernel/exit.c:866
>  do_group_exit+0x177/0x440 kernel/exit.c:970
>  get_signal+0x8b0/0x1980 kernel/signal.c:2517
>  do_signal+0x9c/0x21c0 arch/x86/kernel/signal.c:816
>  exit_to_usermode_loop+0x2e5/0x380 arch/x86/entry/common.c:162
>  prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
>  syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
>  do_syscall_64+0x6be/0x820 arch/x86/entry/common.c:293
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x410fa0
>
> This is time code 604.599748 in the console output:
>
> [  604.599748] RIP: 0033:0x410fa0
>
>
> Questions:
>
> 1) What is this?
>
> [  600.924691]  entry_SYSCALL_64_after_hwframe+0x49/0xbe^M
> [  600.929872] RIP: 0033:0x7f3e597d0120^M
> [  600.933576] Code: Bad RIP value.^M
> [  600.936920] RSP: 002b:00007ffc2d83e008 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000002^M
> [  600.944608] RAX: ffffffffffffffda RBX: 000055ca2995b436 RCX:
> 00007f3e597d0120^M
> [  600.951856] RDX: 00007ffc2d83e244 RSI: 0000000000080000 RDI:
> 00007ffc2d83e220^M
> [  600.959107] RBP: 000055ca2995b1e0 R08: 0000000000000000 R09:
> 000055ca2995b099^M
> [  600.966355] R10: 0000000000000000 R11: 0000000000000246 R12:
> 0000000000000001^M
> [  600.973628] R13: 000055ca2995b090 R14: 000055ca2995b190 R15:
> 00007ffc2d83e220^M
>
> Isn't this a kernel stack overrun?
>
> RSP: 0x..83e008. Assuming 8 kB kernel stack, and 8 kB alignment, we have
> used up everything.

I don't exact answer, that's just the kernel output that we captured
from console.

FWIW with KASAN stacks are 16K:
https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/page_64_types.h#L10


But it looks like the OOM may have been caused by a page fault, so the
IP/SP are legitimately in user-space, e.g. in the same log:

[  443.339635]  __do_page_fault+0x5e8/0xe60
[  443.343683]  ? trace_hardirqs_off+0xb8/0x310
[  443.348079]  do_page_fault+0xf2/0x7e0
[  443.351867]  ? vmalloc_sync_all+0x30/0x30
[  443.356002]  ? error_entry+0x70/0xd0
[  443.359702]  ? trace_hardirqs_off_caller+0xbb/0x310
[  443.364702]  ? trace_hardirqs_on_caller+0xc0/0x310
[  443.369620]  ? syscall_return_slowpath+0x5e0/0x5e0
[  443.374537]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  443.379390]  ? trace_hardirqs_on_caller+0x310/0x310
[  443.384397]  ? __bpf_trace_preemptirq_template+0x30/0x30
[  443.389837]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
[  443.395366]  ? prepare_exit_to_usermode+0x291/0x3b0
[  443.400374]  ? page_fault+0x8/0x30
[  443.403901]  ? trace_hardirqs_off_thunk+0x1a/0x1c
[  443.408729]  ? page_fault+0x8/0x30
[  443.412255]  page_fault+0x1e/0x30
[  443.415693] RIP: 0033:0x43d8e8
[  443.418884] Code: Bad RIP value.
[  443.422229] RSP: 002b:000000c43a053490 EFLAGS: 00010206




> - How/where are namespaces used by the bot?
>
> I.e. what triggered the namespace exit?

It creates whole set of namespaces per test process:
https://github.com/google/syzkaller/blob/master/executor/common_linux.h#L1523

but then each test can also execute unshare as well.

Namespace exits usually happen when process terminates.
In the reported stack it seems to be the case:

 exit_task_namespaces+0x17/0x20 kernel/nsproxy.c:234
 do_exit+0x1ad1/0x26d0 kernel/exit.c:866
 do_group_exit+0x177/0x440 kernel/exit.c:970
 get_signal+0x8b0/0x1980 kernel/signal.c:2517
 do_signal+0x9c/0x21c0 arch/x86/kernel/signal.c:816
 exit_to_usermode_loop+0x2e5/0x380 arch/x86/entry/common.c:162


> - There are ~370 calls to semget(), most with large (>10.000) semaphores in
> the arrays.
>
> Starting from [442.544635], the OOM killer starts to kill processes.
>
> Is this as intended?

Well, generally everything except for kernel crashes is expected.

We actually sandbox it with memcg quite aggressively:
https://github.com/google/syzkaller/blob/master/executor/common_linux.h#L2159
But it seems to manage to either break the limits, or cause some
massive memory leaks. The nature of that is yet unknown.


> - Which stress tests are enabled? By chance, I found:
>
> [  433.304586] FAULT_INJECTION: forcing a failure.^M
> [  433.304586] name fail_page_alloc, interval 1, probability 0, space 0,
> times 0^M
> [  433.316471] CPU: 1 PID: 19653 Comm: syz-executor4 Not tainted 4.20.0-rc3+
> #348^M
> [  433.323841] Hardware name: Google Google Compute Engine/Google Compute
> Engine, BIOS Google 01/01/2011^M
>
> I need some more background, then I can review the code.

What exactly do you mean by "Which stress tests"?
Fault injection is enabled. Also random workload from userspace.


> Right now, I would put it into my "unknown syzcaller finding" folder.
>
> --
>
>     Manfred

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: corrupted list in freeary
  2018-11-30 17:58   ` Dmitry Vyukov
@ 2018-11-30 18:06     ` Dmitry Vyukov
  2018-12-01 20:22     ` Manfred Spraul
  1 sibling, 0 replies; 8+ messages in thread
From: Dmitry Vyukov @ 2018-11-30 18:06 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: syzbot, Andrew Morton, Arnd Bergmann, Davidlohr Bueso,
	Eric W. Biederman, LKML, linux, syzkaller-bugs

On Fri, Nov 30, 2018 at 5:58 PM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Thu, Nov 29, 2018 at 9:13 AM, Manfred Spraul
> <manfred@colorfullife.com> wrote:
>> Hello together,
>>
>> On 11/27/18 4:52 PM, syzbot wrote:
>>
>> Hello,
>>
>> syzbot found the following crash on:
>>
>> HEAD commit:    e195ca6cb6f2 Merge branch 'for-linus' of git://git.kernel...
>> git tree:       upstream
>> console output: https://syzkaller.appspot.com/x/log.txt?x=10d3e6a3400000
>>
>> From the console output:
>>
>> 20:36:14 executing program 4:
>> semget$private(0x12000000, 0x39d0, 0x0)
>>
>>
>> I don't understand the 0x12000000.
>>
>> What does that mean? What is the actual syscall?
>
> Hi Manfred,
>
> The syscall is semget with the first argument 0x12000000.
>
>
>>
>> Is 0x39d0 the number of semaphores in the array, i.e. create ~13.000
>> semaphores?
>
> If the second argument of 0x39d0 relates to creation of 0x39d0
> semaphores, then yes.
>
>
>
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=73e2bc0cb6463446
>> dashboard link: https://syzkaller.appspot.com/bug?extid=c92d3646e35bc5d1a909
>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>>
>> Unfortunately, I don't have any reproducer for this crash yet.
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+c92d3646e35bc5d1a909@syzkaller.appspotmail.com
>>
>> input: syz1 as /devices/virtual/input/input670
>> input: syz1 as /devices/virtual/input/input671
>> list_del corruption. prev->next should be ffff8881dae2cdb8, but was
>> 0000000000100000
>> ------------[ cut here ]------------
>> kernel BUG at lib/list_debug.c:53!
>> invalid opcode: 0000 [#1] PREEMPT SMP KASAN
>> CPU: 0 PID: 6194 Comm: syz-executor5 Not tainted 4.20.0-rc3+ #348
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
>> Google 01/01/2011
>> RIP: 0010:__list_del_entry_valid.cold.1+0x48/0x4a lib/list_debug.c:51
>> Code: d0 60 88 e8 b2 31 d2 fd 0f 0b 48 89 de 48 c7 c7 00 d2 60 88 e8 a1 31
>> d2 fd 0f 0b 48 89 de 48 c7 c7 a0 d1 60 88 e8 90 31 d2 fd <0f> 0b 48 89 d9 48
>> c7 c7 60 d2 60 88 e8 7f 31 d2 fd 0f 0b 48 89 f1
>> RSP: 0018:ffff8881848fee80 EFLAGS: 00010286
>> RAX: 0000000000000054 RBX: ffff8881dae2cdb8 RCX: 0000000000000000
>> RDX: 0000000000000000 RSI: ffffffff8165eaf5 RDI: 0000000000000005
>> RBP: ffff8881848fee98 R08: ffff8881848f26c0 R09: 0000000000000006
>> R10: 0000000000000000 R11: ffff8881848f26c0 R12: ffff8881c3173a00
>> R13: ffff8881be118118 R14: ffff8881848ff280 R15: dffffc0000000000
>> FS:  00000000020b2940(0000) GS:ffff8881dae00000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 0000000000625208 CR3: 00000001c10d3000 CR4: 00000000001406f0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Call Trace:
>>  __list_del_entry include/linux/list.h:117 [inline]
>>  list_del include/linux/list.h:125 [inline]
>>  unlink_queue ipc/sem.c:786 [inline]
>>
>> Unlink_queue means transfer all waiting threads to the wake-q.
>>
>> There are 2*(1+<semaphores in the array>) linked lists in an array.
>>
>> And this fails, because one linked list contains 0x100000 instead of a real
>> pointer.
>>
>> I could not find any semop() in the log --> all lists must be empty.
>>
>> Actually, the lists were initialized in newary(), and then never touched.
>>
>>  freeary+0xbd1/0x1a40 ipc/sem.c:1160
>>
>> Free a semaphore array
>>
>>  free_ipcs+0x9f/0x1c0 ipc/namespace.c:112
>>  sem_exit_ns+0x20/0x40 ipc/sem.c:237
>>  free_ipc_ns ipc/namespace.c:120 [inline]
>>
>> Free all ipc ids in the name space
>>
>>  put_ipc_ns+0x66/0x180 ipc/namespace.c:152
>>  free_nsproxy+0xcf/0x220 kernel/nsproxy.c:180
>>
>> Free the name space
>>
>>  switch_task_namespaces+0xb3/0xd0 kernel/nsproxy.c:229
>>  exit_task_namespaces+0x17/0x20 kernel/nsproxy.c:234
>>  do_exit+0x1ad1/0x26d0 kernel/exit.c:866
>>  do_group_exit+0x177/0x440 kernel/exit.c:970
>>  get_signal+0x8b0/0x1980 kernel/signal.c:2517
>>  do_signal+0x9c/0x21c0 arch/x86/kernel/signal.c:816
>>  exit_to_usermode_loop+0x2e5/0x380 arch/x86/entry/common.c:162
>>  prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
>>  syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
>>  do_syscall_64+0x6be/0x820 arch/x86/entry/common.c:293
>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> RIP: 0033:0x410fa0
>>
>> This is time code 604.599748 in the console output:
>>
>> [  604.599748] RIP: 0033:0x410fa0
>>
>>
>> Questions:
>>
>> 1) What is this?
>>
>> [  600.924691]  entry_SYSCALL_64_after_hwframe+0x49/0xbe^M
>> [  600.929872] RIP: 0033:0x7f3e597d0120^M
>> [  600.933576] Code: Bad RIP value.^M
>> [  600.936920] RSP: 002b:00007ffc2d83e008 EFLAGS: 00000246 ORIG_RAX:
>> 0000000000000002^M
>> [  600.944608] RAX: ffffffffffffffda RBX: 000055ca2995b436 RCX:
>> 00007f3e597d0120^M
>> [  600.951856] RDX: 00007ffc2d83e244 RSI: 0000000000080000 RDI:
>> 00007ffc2d83e220^M
>> [  600.959107] RBP: 000055ca2995b1e0 R08: 0000000000000000 R09:
>> 000055ca2995b099^M
>> [  600.966355] R10: 0000000000000000 R11: 0000000000000246 R12:
>> 0000000000000001^M
>> [  600.973628] R13: 000055ca2995b090 R14: 000055ca2995b190 R15:
>> 00007ffc2d83e220^M
>>
>> Isn't this a kernel stack overrun?
>>
>> RSP: 0x..83e008. Assuming 8 kB kernel stack, and 8 kB alignment, we have
>> used up everything.
>
> I don't exact answer, that's just the kernel output that we captured
> from console.
>
> FWIW with KASAN stacks are 16K:
> https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/page_64_types.h#L10
>
>
> But it looks like the OOM may have been caused by a page fault, so the
> IP/SP are legitimately in user-space, e.g. in the same log:
>
> [  443.339635]  __do_page_fault+0x5e8/0xe60
> [  443.343683]  ? trace_hardirqs_off+0xb8/0x310
> [  443.348079]  do_page_fault+0xf2/0x7e0
> [  443.351867]  ? vmalloc_sync_all+0x30/0x30
> [  443.356002]  ? error_entry+0x70/0xd0
> [  443.359702]  ? trace_hardirqs_off_caller+0xbb/0x310
> [  443.364702]  ? trace_hardirqs_on_caller+0xc0/0x310
> [  443.369620]  ? syscall_return_slowpath+0x5e0/0x5e0
> [  443.374537]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [  443.379390]  ? trace_hardirqs_on_caller+0x310/0x310
> [  443.384397]  ? __bpf_trace_preemptirq_template+0x30/0x30
> [  443.389837]  ? __sanitizer_cov_trace_const_cmp4+0x16/0x20
> [  443.395366]  ? prepare_exit_to_usermode+0x291/0x3b0
> [  443.400374]  ? page_fault+0x8/0x30
> [  443.403901]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [  443.408729]  ? page_fault+0x8/0x30
> [  443.412255]  page_fault+0x1e/0x30
> [  443.415693] RIP: 0033:0x43d8e8
> [  443.418884] Code: Bad RIP value.
> [  443.422229] RSP: 002b:000000c43a053490 EFLAGS: 00010206
>
>
>
>
>> - How/where are namespaces used by the bot?
>>
>> I.e. what triggered the namespace exit?
>
> It creates whole set of namespaces per test process:
> https://github.com/google/syzkaller/blob/master/executor/common_linux.h#L1523
>
> but then each test can also execute unshare as well.
>
> Namespace exits usually happen when process terminates.
> In the reported stack it seems to be the case:
>
>  exit_task_namespaces+0x17/0x20 kernel/nsproxy.c:234
>  do_exit+0x1ad1/0x26d0 kernel/exit.c:866
>  do_group_exit+0x177/0x440 kernel/exit.c:970
>  get_signal+0x8b0/0x1980 kernel/signal.c:2517
>  do_signal+0x9c/0x21c0 arch/x86/kernel/signal.c:816
>  exit_to_usermode_loop+0x2e5/0x380 arch/x86/entry/common.c:162
>
>
>> - There are ~370 calls to semget(), most with large (>10.000) semaphores in
>> the arrays.
>>
>> Starting from [442.544635], the OOM killer starts to kill processes.
>>
>> Is this as intended?
>
> Well, generally everything except for kernel crashes is expected.
>
> We actually sandbox it with memcg quite aggressively:
> https://github.com/google/syzkaller/blob/master/executor/common_linux.h#L2159
> But it seems to manage to either break the limits, or cause some
> massive memory leaks. The nature of that is yet unknown.
>
>
>> - Which stress tests are enabled? By chance, I found:
>>
>> [  433.304586] FAULT_INJECTION: forcing a failure.^M
>> [  433.304586] name fail_page_alloc, interval 1, probability 0, space 0,
>> times 0^M
>> [  433.316471] CPU: 1 PID: 19653 Comm: syz-executor4 Not tainted 4.20.0-rc3+
>> #348^M
>> [  433.323841] Hardware name: Google Google Compute Engine/Google Compute
>> Engine, BIOS Google 01/01/2011^M
>>
>> I need some more background, then I can review the code.
>
> What exactly do you mean by "Which stress tests"?
> Fault injection is enabled. Also random workload from userspace.
>
>
>> Right now, I would put it into my "unknown syzcaller finding" folder.


Since it happened only once, it can well be a previous silent memory
corruption. But we also seen hundreds of subtle data races that ever
fired few times or just once, but turned out to be very real bugs.
Hard to say.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: corrupted list in freeary
  2018-11-30 17:58   ` Dmitry Vyukov
  2018-11-30 18:06     ` Dmitry Vyukov
@ 2018-12-01 20:22     ` Manfred Spraul
  2018-12-03 14:53       ` Dmitry Vyukov
  1 sibling, 1 reply; 8+ messages in thread
From: Manfred Spraul @ 2018-12-01 20:22 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: syzbot, Andrew Morton, Arnd Bergmann, Davidlohr Bueso,
	Eric W. Biederman, LKML, linux, syzkaller-bugs

Hi Dmitry,

On 11/30/18 6:58 PM, Dmitry Vyukov wrote:
> On Thu, Nov 29, 2018 at 9:13 AM, Manfred Spraul
> <manfred@colorfullife.com> wrote:
>> Hello together,
>>
>> On 11/27/18 4:52 PM, syzbot wrote:
>>
>> Hello,
>>
>> syzbot found the following crash on:
>>
>> HEAD commit:    e195ca6cb6f2 Merge branch 'for-linus' of git://git.kernel...
>> git tree:       upstream
>> console output: https://syzkaller.appspot.com/x/log.txt?x=10d3e6a3400000
[...]
>> Isn't this a kernel stack overrun?
>>
>> RSP: 0x..83e008. Assuming 8 kB kernel stack, and 8 kB alignment, we have
>> used up everything.
> I don't exact answer, that's just the kernel output that we captured
> from console.
>
> FWIW with KASAN stacks are 16K:
> https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/page_64_types.h#L10
Ok, thanks. And stack overrun detection is enabled as well -> a real 
stack overrun is unlikely.
> Well, generally everything except for kernel crashes is expected.
>
> We actually sandbox it with memcg quite aggressively:
> https://github.com/google/syzkaller/blob/master/executor/common_linux.h#L2159
> But it seems to manage to either break the limits, or cause some
> massive memory leaks. The nature of that is yet unknown.

Is it possible to start from that side?

Are there other syzcaller runs where the OOM killer triggers that much?

>
>> - Which stress tests are enabled? By chance, I found:
>>
>> [  433.304586] FAULT_INJECTION: forcing a failure.^M
>> [  433.304586] name fail_page_alloc, interval 1, probability 0, space 0,
>> times 0^M
>> [  433.316471] CPU: 1 PID: 19653 Comm: syz-executor4 Not tainted 4.20.0-rc3+
>> #348^M
>> [  433.323841] Hardware name: Google Google Compute Engine/Google Compute
>> Engine, BIOS Google 01/01/2011^M
>>
>> I need some more background, then I can review the code.
> What exactly do you mean by "Which stress tests"?
> Fault injection is enabled. Also random workload from userspace.
>
>
>> Right now, I would put it into my "unknown syzcaller finding" folder.

One more idea: Are there further syzcaller runs that end up with 
0x010000 in a pointer?

 From what I see, the sysv sem code that is used is trivial, I don't see 
that it could cause the observed behavior.


--

     Manfred


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: corrupted list in freeary
  2018-12-01 20:22     ` Manfred Spraul
@ 2018-12-03 14:53       ` Dmitry Vyukov
  2019-03-26  8:47         ` Dmitry Vyukov
  0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Vyukov @ 2018-12-03 14:53 UTC (permalink / raw)
  To: manfred
  Cc: syzbot, Andrew Morton, Arnd Bergmann, Davidlohr Bueso,
	Eric W. Biederman, LKML, linux, syzkaller-bugs

On Sat, Dec 1, 2018 at 9:22 PM Manfred Spraul <manfred@colorfullife.com> wrote:
>
> Hi Dmitry,
>
> On 11/30/18 6:58 PM, Dmitry Vyukov wrote:
> > On Thu, Nov 29, 2018 at 9:13 AM, Manfred Spraul
> > <manfred@colorfullife.com> wrote:
> >> Hello together,
> >>
> >> On 11/27/18 4:52 PM, syzbot wrote:
> >>
> >> Hello,
> >>
> >> syzbot found the following crash on:
> >>
> >> HEAD commit:    e195ca6cb6f2 Merge branch 'for-linus' of git://git.kernel...
> >> git tree:       upstream
> >> console output: https://syzkaller.appspot.com/x/log.txt?x=10d3e6a3400000
> [...]
> >> Isn't this a kernel stack overrun?
> >>
> >> RSP: 0x..83e008. Assuming 8 kB kernel stack, and 8 kB alignment, we have
> >> used up everything.
> > I don't exact answer, that's just the kernel output that we captured
> > from console.
> >
> > FWIW with KASAN stacks are 16K:
> > https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/page_64_types.h#L10
> Ok, thanks. And stack overrun detection is enabled as well -> a real
> stack overrun is unlikely.
> > Well, generally everything except for kernel crashes is expected.
> >
> > We actually sandbox it with memcg quite aggressively:
> > https://github.com/google/syzkaller/blob/master/executor/common_linux.h#L2159
> > But it seems to manage to either break the limits, or cause some
> > massive memory leaks. The nature of that is yet unknown.
>
> Is it possible to start from that side?
>
> Are there other syzcaller runs where the OOM killer triggers that much?

Lots of them:

https://groups.google.com/forum/#!searchin/syzkaller-upstream-moderation/lowmem_reserve
https://groups.google.com/forum/#!searchin/syzkaller-bugs/lowmem_reserve

But nobody got any hook on the reasons.


> >> - Which stress tests are enabled? By chance, I found:
> >>
> >> [  433.304586] FAULT_INJECTION: forcing a failure.^M
> >> [  433.304586] name fail_page_alloc, interval 1, probability 0, space 0,
> >> times 0^M
> >> [  433.316471] CPU: 1 PID: 19653 Comm: syz-executor4 Not tainted 4.20.0-rc3+
> >> #348^M
> >> [  433.323841] Hardware name: Google Google Compute Engine/Google Compute
> >> Engine, BIOS Google 01/01/2011^M
> >>
> >> I need some more background, then I can review the code.
> > What exactly do you mean by "Which stress tests"?
> > Fault injection is enabled. Also random workload from userspace.
> >
> >
> >> Right now, I would put it into my "unknown syzcaller finding" folder.
>
> One more idea: Are there further syzcaller runs that end up with
> 0x010000 in a pointer?

Hard to say. syzbot triggered millions of crashes. I can't say that I
remember this as distinctive pattern that come up before.

>  From what I see, the sysv sem code that is used is trivial, I don't see
> that it could cause the observed behavior.

I propose that we postpone further investigation of this until we have
a reproducer, or this happens more than once, or we gather some other
information.
Half of bugs are simple, so even for a crash happened once it makes
sense to spend 10 minutes looking at the code in case the root cause
is easy to spot. And hundreds of bugs were fixed this way. But I
assume you already did this.
The thing is that there are 100+ known bugs in kernel that lead to
memory corruptions:
https://syzkaller.appspot.com/#upstream-open
We try to catch them reliably with KASAN, but KASAN does not give 100%
guarantee. So if just one instance of a known bug gets unnoticed,
leads to a memory corruption, then later it can lead to an
unexplainable one-off crash like this. At this point higher ROI will
probably be from spending more time on hundreds of other known bugs
that have reproducers, happened lots of times, or just simpler. Once
we get rid of most of them, hopefully such unexplainable crashes will
go down too.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: corrupted list in freeary
  2018-12-03 14:53       ` Dmitry Vyukov
@ 2019-03-26  8:47         ` Dmitry Vyukov
  0 siblings, 0 replies; 8+ messages in thread
From: Dmitry Vyukov @ 2019-03-26  8:47 UTC (permalink / raw)
  To: manfred
  Cc: syzbot, Andrew Morton, Arnd Bergmann, Davidlohr Bueso,
	Eric W. Biederman, LKML, linux, syzkaller-bugs

On Mon, Dec 3, 2018 at 3:53 PM Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Sat, Dec 1, 2018 at 9:22 PM Manfred Spraul <manfred@colorfullife.com> wrote:
> >
> > Hi Dmitry,
> >
> > On 11/30/18 6:58 PM, Dmitry Vyukov wrote:
> > > On Thu, Nov 29, 2018 at 9:13 AM, Manfred Spraul
> > > <manfred@colorfullife.com> wrote:
> > >> Hello together,
> > >>
> > >> On 11/27/18 4:52 PM, syzbot wrote:
> > >>
> > >> Hello,
> > >>
> > >> syzbot found the following crash on:
> > >>
> > >> HEAD commit:    e195ca6cb6f2 Merge branch 'for-linus' of git://git.kernel...
> > >> git tree:       upstream
> > >> console output: https://syzkaller.appspot.com/x/log.txt?x=10d3e6a3400000
> > [...]
> > >> Isn't this a kernel stack overrun?
> > >>
> > >> RSP: 0x..83e008. Assuming 8 kB kernel stack, and 8 kB alignment, we have
> > >> used up everything.
> > > I don't exact answer, that's just the kernel output that we captured
> > > from console.
> > >
> > > FWIW with KASAN stacks are 16K:
> > > https://elixir.bootlin.com/linux/latest/source/arch/x86/include/asm/page_64_types.h#L10
> > Ok, thanks. And stack overrun detection is enabled as well -> a real
> > stack overrun is unlikely.
> > > Well, generally everything except for kernel crashes is expected.
> > >
> > > We actually sandbox it with memcg quite aggressively:
> > > https://github.com/google/syzkaller/blob/master/executor/common_linux.h#L2159
> > > But it seems to manage to either break the limits, or cause some
> > > massive memory leaks. The nature of that is yet unknown.
> >
> > Is it possible to start from that side?
> >
> > Are there other syzcaller runs where the OOM killer triggers that much?
>
> Lots of them:
>
> https://groups.google.com/forum/#!searchin/syzkaller-upstream-moderation/lowmem_reserve
> https://groups.google.com/forum/#!searchin/syzkaller-bugs/lowmem_reserve
>
> But nobody got any hook on the reasons.
>
>
> > >> - Which stress tests are enabled? By chance, I found:
> > >>
> > >> [  433.304586] FAULT_INJECTION: forcing a failure.^M
> > >> [  433.304586] name fail_page_alloc, interval 1, probability 0, space 0,
> > >> times 0^M
> > >> [  433.316471] CPU: 1 PID: 19653 Comm: syz-executor4 Not tainted 4.20.0-rc3+
> > >> #348^M
> > >> [  433.323841] Hardware name: Google Google Compute Engine/Google Compute
> > >> Engine, BIOS Google 01/01/2011^M
> > >>
> > >> I need some more background, then I can review the code.
> > > What exactly do you mean by "Which stress tests"?
> > > Fault injection is enabled. Also random workload from userspace.
> > >
> > >
> > >> Right now, I would put it into my "unknown syzcaller finding" folder.
> >
> > One more idea: Are there further syzcaller runs that end up with
> > 0x010000 in a pointer?
>
> Hard to say. syzbot triggered millions of crashes. I can't say that I
> remember this as distinctive pattern that come up before.
>
> >  From what I see, the sysv sem code that is used is trivial, I don't see
> > that it could cause the observed behavior.
>
> I propose that we postpone further investigation of this until we have
> a reproducer, or this happens more than once, or we gather some other
> information.
> Half of bugs are simple, so even for a crash happened once it makes
> sense to spend 10 minutes looking at the code in case the root cause
> is easy to spot. And hundreds of bugs were fixed this way. But I
> assume you already did this.
> The thing is that there are 100+ known bugs in kernel that lead to
> memory corruptions:
> https://syzkaller.appspot.com/#upstream-open
> We try to catch them reliably with KASAN, but KASAN does not give 100%
> guarantee. So if just one instance of a known bug gets unnoticed,
> leads to a memory corruption, then later it can lead to an
> unexplainable one-off crash like this. At this point higher ROI will
> probably be from spending more time on hundreds of other known bugs
> that have reproducers, happened lots of times, or just simpler. Once
> we get rid of most of them, hopefully such unexplainable crashes will
> go down too.

The working hypothesis for this bug is as follows.

semget provokes OOMs. OOMs now cause kernel stack overflow/corruption
in wb_workfn. So semget is kinda red herring.
Since we now sandbox tests processes with sem sysctl and friends, I
think we can close this report.

#syz invalid

Though the kernel memory corruption on OOMs is still there.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2019-03-26  8:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-27 15:52 BUG: corrupted list in freeary syzbot
2018-11-27 16:07 ` Eric W. Biederman
2018-11-27 16:22   ` Sean Christopherson
     [not found] ` <621f7c52-de52-eb35-bf61-e839adee7ec9@colorfullife.com>
2018-11-30 17:58   ` Dmitry Vyukov
2018-11-30 18:06     ` Dmitry Vyukov
2018-12-01 20:22     ` Manfred Spraul
2018-12-03 14:53       ` Dmitry Vyukov
2019-03-26  8:47         ` Dmitry Vyukov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.