All of lore.kernel.org
 help / color / mirror / Atom feed
* kernel BUG at mm/usercopy.c:102 -- pc : usercopy_abort
@ 2024-03-25  7:45 Xiubo Li
  2024-03-25 10:14 ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: Xiubo Li @ 2024-03-25  7:45 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, Ceph Development, linux-fsdevel

Hi guys,

We are hitting the same crash frequently recently with the latest kernel 
when testing kceph, and the call trace will be something likes:

[ 1580.034891] usercopy: Kernel memory exposure attempt detected from 
SLUB object 'kmalloc-192' (offset 82, size 499712)!^M
[ 1580.045866] ------------[ cut here ]------------^M
[ 1580.050551] kernel BUG at mm/usercopy.c:102!^M
^M
Entering kdb (current=0xffff8881211f5500, pid 172901) on processor 4 
Oops: (null)^M
due to oops @ 0xffffffff8138cabd^M
CPU: 4 PID: 172901 Comm: fsstress Tainted: G S 6.6.0-g623393c9d50c #1^M
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015^M
RIP: 0010:usercopy_abort+0x6d/0x80^M
Code: 4c 0f 44 d0 41 53 48 c7 c0 1c e9 13 82 48 c7 c6 71 62 13 82 48 0f 
45 f0 48 89 f9 48 c7 c7 f0 6b 1b 82 4c 89 d2 e8 63 2b df ff <0f> 0b 49 
c7 c1 44 c8 14 82 4d 89 cb 4d 89 c8 eb a5 66 90 f3 0f 1e^M
RSP: 0018:ffffc90006dfba88 EFLAGS: 00010246^M
RAX: 000000000000006a RBX: 000000000007a000 RCX: 0000000000000000^M
RDX: 0000000000000000 RSI: ffff88885fd1d880 RDI: ffff88885fd1d880^M
RBP: 000000000007a000 R08: 0000000000000000 R09: c0000000ffffdfff^M
R10: 0000000000000001 R11: ffffc90006dfb930 R12: 0000000000000001^M
R13: ffff8882b7bbed12 R14: ffff88827a375830 R15: ffff8882b7b44d12^M
FS:  00007fb24c859500(0000) GS:ffff88885fd00000(0000) 
knlGS:0000000000000000^M
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
CR2: 000055c2bcf9eb00 CR3: 000000028956c005 CR4: 00000000001706e0^M
Call Trace:^M
  <TASK>^M
  ? kdb_main_loop+0x32c/0xa10^M
  ? kdb_stub+0x216/0x420^M
more>

You can see more detail in ceph tracker 
https://tracker.ceph.com/issues/64471.

I have seen someone has reported a similar issue one year ago but that 
has been fixed already, please see 
https://lore.kernel.org/linux-mm/CANn89iLaWZhrfyn8NBzdN1zQC0d47WC4_jvpwKQPoHwyCVueVQ@mail.gmail.com/T/.

Is it a known issue ?

Thanks

- Xiubo




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: kernel BUG at mm/usercopy.c:102 -- pc : usercopy_abort
  2024-03-25  7:45 kernel BUG at mm/usercopy.c:102 -- pc : usercopy_abort Xiubo Li
@ 2024-03-25 10:14 ` David Hildenbrand
  2024-03-25 12:06   ` Xiubo Li
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2024-03-25 10:14 UTC (permalink / raw)
  To: Xiubo Li, linux-mm; +Cc: linux-kernel, Ceph Development, linux-fsdevel

On 25.03.24 08:45, Xiubo Li wrote:
> Hi guys,
> 
> We are hitting the same crash frequently recently with the latest kernel
> when testing kceph, and the call trace will be something likes:
> 
> [ 1580.034891] usercopy: Kernel memory exposure attempt detected from
> SLUB object 'kmalloc-192' (offset 82, size 499712)!^M
> [ 1580.045866] ------------[ cut here ]------------^M
> [ 1580.050551] kernel BUG at mm/usercopy.c:102!^M
> ^M
> Entering kdb (current=0xffff8881211f5500, pid 172901) on processor 4
> Oops: (null)^M
> due to oops @ 0xffffffff8138cabd^M
> CPU: 4 PID: 172901 Comm: fsstress Tainted: G S 6.6.0-g623393c9d50c #1^M
> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015^M
> RIP: 0010:usercopy_abort+0x6d/0x80^M
> Code: 4c 0f 44 d0 41 53 48 c7 c0 1c e9 13 82 48 c7 c6 71 62 13 82 48 0f
> 45 f0 48 89 f9 48 c7 c7 f0 6b 1b 82 4c 89 d2 e8 63 2b df ff <0f> 0b 49
> c7 c1 44 c8 14 82 4d 89 cb 4d 89 c8 eb a5 66 90 f3 0f 1e^M
> RSP: 0018:ffffc90006dfba88 EFLAGS: 00010246^M
> RAX: 000000000000006a RBX: 000000000007a000 RCX: 0000000000000000^M
> RDX: 0000000000000000 RSI: ffff88885fd1d880 RDI: ffff88885fd1d880^M
> RBP: 000000000007a000 R08: 0000000000000000 R09: c0000000ffffdfff^M
> R10: 0000000000000001 R11: ffffc90006dfb930 R12: 0000000000000001^M
> R13: ffff8882b7bbed12 R14: ffff88827a375830 R15: ffff8882b7b44d12^M
> FS:  00007fb24c859500(0000) GS:ffff88885fd00000(0000)
> knlGS:0000000000000000^M
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> CR2: 000055c2bcf9eb00 CR3: 000000028956c005 CR4: 00000000001706e0^M
> Call Trace:^M
>    <TASK>^M
>    ? kdb_main_loop+0x32c/0xa10^M
>    ? kdb_stub+0x216/0x420^M
> more>
> 
> You can see more detail in ceph tracker
> https://tracker.ceph.com/issues/64471.

Where is the full backtrace? Above contains only the backtrace of kdb.

That link also contains:

Entering kdb (current=0xffff9115d14fb980, pid 61925) on processor 5 
Oops: (null)^M
due to oops @ 0xfffffffface3a1d2^M
CPU: 5 PID: 61925 Comm: ld Kdump: loaded Not tainted 
5.14.0-421.el9.x86_64 #1^M
Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015^M
RIP: 0010:usercopy_abort+0x74/0x76^M
Code: 14 74 ad 51 48 0f 44 d6 49 c7 c3 cb 9f 73 ad 4c 89 d1 57 48 c7 c6 
60 83 75 ad 48 c7 c7 00 83 75 ad 49 0f 44 f3 e8 1b 3b ff ff <0f> 0b 0f 
b6 d3 4d 89 e0 48 89 e9 31 f6 48 c7 c7 7f 83 75 ad e8 73^M
RSP: 0018:ffffbb97c16af8d0 EFLAGS: 00010246^M
RAX: 0000000000000072 RBX: 0000000000000112 RCX: 0000000000000000^M
RDX: 0000000000000000 RSI: ffff911d1fd60840 RDI: ffff911d1fd60840^M
RBP: 0000000000004000 R08: 80000000ffff84b4 R09: 0000000000ffff0a^M
R10: 0000000000000004 R11: 0000000000000076 R12: ffff9115c0be8b00^M
R13: 0000000000000001 R14: ffff911665df9f68 R15: ffff9115d16be112^M
FS:  00007ff20442eb80(0000) GS:ffff911d1fd40000(0000) 
knlGS:0000000000000000^M
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
CR2: 00007ff20446142d CR3: 00000001215ec003 CR4: 00000000003706e0^M
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
Call Trace:^M
  <TASK>^M
  ? show_trace_log_lvl+0x1c4/0x2df^M
more>


Don't we have more information about the calltrace somewhere? (or a 
reproducer?)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: kernel BUG at mm/usercopy.c:102 -- pc : usercopy_abort
  2024-03-25 10:14 ` David Hildenbrand
@ 2024-03-25 12:06   ` Xiubo Li
  2024-03-25 16:43     ` David Hildenbrand
  0 siblings, 1 reply; 7+ messages in thread
From: Xiubo Li @ 2024-03-25 12:06 UTC (permalink / raw)
  To: David Hildenbrand, linux-mm; +Cc: linux-kernel, Ceph Development, linux-fsdevel


On 3/25/24 18:14, David Hildenbrand wrote:
> On 25.03.24 08:45, Xiubo Li wrote:
>> Hi guys,
>>
>> We are hitting the same crash frequently recently with the latest kernel
>> when testing kceph, and the call trace will be something likes:
>>
>> [ 1580.034891] usercopy: Kernel memory exposure attempt detected from
>> SLUB object 'kmalloc-192' (offset 82, size 499712)!^M
>> [ 1580.045866] ------------[ cut here ]------------^M
>> [ 1580.050551] kernel BUG at mm/usercopy.c:102!^M
>> ^M
>> Entering kdb (current=0xffff8881211f5500, pid 172901) on processor 4
>> Oops: (null)^M
>> due to oops @ 0xffffffff8138cabd^M
>> CPU: 4 PID: 172901 Comm: fsstress Tainted: G S 6.6.0-g623393c9d50c #1^M
>> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015^M
>> RIP: 0010:usercopy_abort+0x6d/0x80^M
>> Code: 4c 0f 44 d0 41 53 48 c7 c0 1c e9 13 82 48 c7 c6 71 62 13 82 48 0f
>> 45 f0 48 89 f9 48 c7 c7 f0 6b 1b 82 4c 89 d2 e8 63 2b df ff <0f> 0b 49
>> c7 c1 44 c8 14 82 4d 89 cb 4d 89 c8 eb a5 66 90 f3 0f 1e^M
>> RSP: 0018:ffffc90006dfba88 EFLAGS: 00010246^M
>> RAX: 000000000000006a RBX: 000000000007a000 RCX: 0000000000000000^M
>> RDX: 0000000000000000 RSI: ffff88885fd1d880 RDI: ffff88885fd1d880^M
>> RBP: 000000000007a000 R08: 0000000000000000 R09: c0000000ffffdfff^M
>> R10: 0000000000000001 R11: ffffc90006dfb930 R12: 0000000000000001^M
>> R13: ffff8882b7bbed12 R14: ffff88827a375830 R15: ffff8882b7b44d12^M
>> FS:  00007fb24c859500(0000) GS:ffff88885fd00000(0000)
>> knlGS:0000000000000000^M
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
>> CR2: 000055c2bcf9eb00 CR3: 000000028956c005 CR4: 00000000001706e0^M
>> Call Trace:^M
>>    <TASK>^M
>>    ? kdb_main_loop+0x32c/0xa10^M
>>    ? kdb_stub+0x216/0x420^M
>> more>
>>
>> You can see more detail in ceph tracker
>> https://tracker.ceph.com/issues/64471.
>
> Where is the full backtrace? Above contains only the backtrace of kdb.
>
Hi David,

The bad news is that there is no more backtrace. All the failures we hit 
are similar with the following logs:


> That link also contains:
>
> Entering kdb (current=0xffff9115d14fb980, pid 61925) on processor 5 
> Oops: (null)^M
> due to oops @ 0xfffffffface3a1d2^M
> CPU: 5 PID: 61925 Comm: ld Kdump: loaded Not tainted 
> 5.14.0-421.el9.x86_64 #1^M
> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015^M
> RIP: 0010:usercopy_abort+0x74/0x76^M
> Code: 14 74 ad 51 48 0f 44 d6 49 c7 c3 cb 9f 73 ad 4c 89 d1 57 48 c7 
> c6 60 83 75 ad 48 c7 c7 00 83 75 ad 49 0f 44 f3 e8 1b 3b ff ff <0f> 0b 
> 0f b6 d3 4d 89 e0 48 89 e9 31 f6 48 c7 c7 7f 83 75 ad e8 73^M
> RSP: 0018:ffffbb97c16af8d0 EFLAGS: 00010246^M
> RAX: 0000000000000072 RBX: 0000000000000112 RCX: 0000000000000000^M
> RDX: 0000000000000000 RSI: ffff911d1fd60840 RDI: ffff911d1fd60840^M
> RBP: 0000000000004000 R08: 80000000ffff84b4 R09: 0000000000ffff0a^M
> R10: 0000000000000004 R11: 0000000000000076 R12: ffff9115c0be8b00^M
> R13: 0000000000000001 R14: ffff911665df9f68 R15: ffff9115d16be112^M
> FS:  00007ff20442eb80(0000) GS:ffff911d1fd40000(0000) 
> knlGS:0000000000000000^M
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> CR2: 00007ff20446142d CR3: 00000001215ec003 CR4: 00000000003706e0^M
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
> Call Trace:^M
>  <TASK>^M
>  ? show_trace_log_lvl+0x1c4/0x2df^M
> more>
>
>
> Don't we have more information about the calltrace somewhere? (or a 
> reproducer?)

There is no reproducer and each time the failure test cases are 
different. So it seems randomly.

Thanks

- Xiubo




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: kernel BUG at mm/usercopy.c:102 -- pc : usercopy_abort
  2024-03-25 12:06   ` Xiubo Li
@ 2024-03-25 16:43     ` David Hildenbrand
  2024-03-25 19:37       ` Ilya Dryomov
  0 siblings, 1 reply; 7+ messages in thread
From: David Hildenbrand @ 2024-03-25 16:43 UTC (permalink / raw)
  To: Xiubo Li, linux-mm
  Cc: linux-kernel, Ceph Development, linux-fsdevel, Vlastimil Babka

On 25.03.24 13:06, Xiubo Li wrote:
> 
> On 3/25/24 18:14, David Hildenbrand wrote:
>> On 25.03.24 08:45, Xiubo Li wrote:
>>> Hi guys,
>>>
>>> We are hitting the same crash frequently recently with the latest kernel
>>> when testing kceph, and the call trace will be something likes:
>>>
>>> [ 1580.034891] usercopy: Kernel memory exposure attempt detected from
>>> SLUB object 'kmalloc-192' (offset 82, size 499712)!^M
>>> [ 1580.045866] ------------[ cut here ]------------^M
>>> [ 1580.050551] kernel BUG at mm/usercopy.c:102!^M
>>> ^M
>>> Entering kdb (current=0xffff8881211f5500, pid 172901) on processor 4
>>> Oops: (null)^M
>>> due to oops @ 0xffffffff8138cabd^M
>>> CPU: 4 PID: 172901 Comm: fsstress Tainted: G S 6.6.0-g623393c9d50c #1^M
>>> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015^M
>>> RIP: 0010:usercopy_abort+0x6d/0x80^M
>>> Code: 4c 0f 44 d0 41 53 48 c7 c0 1c e9 13 82 48 c7 c6 71 62 13 82 48 0f
>>> 45 f0 48 89 f9 48 c7 c7 f0 6b 1b 82 4c 89 d2 e8 63 2b df ff <0f> 0b 49
>>> c7 c1 44 c8 14 82 4d 89 cb 4d 89 c8 eb a5 66 90 f3 0f 1e^M
>>> RSP: 0018:ffffc90006dfba88 EFLAGS: 00010246^M
>>> RAX: 000000000000006a RBX: 000000000007a000 RCX: 0000000000000000^M
>>> RDX: 0000000000000000 RSI: ffff88885fd1d880 RDI: ffff88885fd1d880^M
>>> RBP: 000000000007a000 R08: 0000000000000000 R09: c0000000ffffdfff^M
>>> R10: 0000000000000001 R11: ffffc90006dfb930 R12: 0000000000000001^M
>>> R13: ffff8882b7bbed12 R14: ffff88827a375830 R15: ffff8882b7b44d12^M
>>> FS:  00007fb24c859500(0000) GS:ffff88885fd00000(0000)
>>> knlGS:0000000000000000^M
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
>>> CR2: 000055c2bcf9eb00 CR3: 000000028956c005 CR4: 00000000001706e0^M
>>> Call Trace:^M
>>>     <TASK>^M
>>>     ? kdb_main_loop+0x32c/0xa10^M
>>>     ? kdb_stub+0x216/0x420^M
>>> more>
>>>
>>> You can see more detail in ceph tracker
>>> https://tracker.ceph.com/issues/64471.
>>
>> Where is the full backtrace? Above contains only the backtrace of kdb.
>>
> Hi David,
> 
> The bad news is that there is no more backtrace. All the failures we hit
> are similar with the following logs:
> 

That's unfortunate :/

"exposure" in the message means we are in copy_to_user().

SLUB object 'kmalloc-192' means that we come from __check_heap_object() 
... we have 192 bytes, but the length we want to access is 499712 ... 
488 KiB.

So we ended  up somehow in

__copy_to_user()->check_object_size()->__check_object_size()->
check_heap_object()->__check_heap_object()->usercopy_abort()


... but the big question is which code tried to copy way too much memory 
out of a slab folio to user space.

> 
>> That link also contains:
>>
>> Entering kdb (current=0xffff9115d14fb980, pid 61925) on processor 5
>> Oops: (null)^M
>> due to oops @ 0xfffffffface3a1d2^M
>> CPU: 5 PID: 61925 Comm: ld Kdump: loaded Not tainted
>> 5.14.0-421.el9.x86_64 #1^M
>> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015^M
>> RIP: 0010:usercopy_abort+0x74/0x76^M
>> Code: 14 74 ad 51 48 0f 44 d6 49 c7 c3 cb 9f 73 ad 4c 89 d1 57 48 c7
>> c6 60 83 75 ad 48 c7 c7 00 83 75 ad 49 0f 44 f3 e8 1b 3b ff ff <0f> 0b
>> 0f b6 d3 4d 89 e0 48 89 e9 31 f6 48 c7 c7 7f 83 75 ad e8 73^M
>> RSP: 0018:ffffbb97c16af8d0 EFLAGS: 00010246^M
>> RAX: 0000000000000072 RBX: 0000000000000112 RCX: 0000000000000000^M
>> RDX: 0000000000000000 RSI: ffff911d1fd60840 RDI: ffff911d1fd60840^M
>> RBP: 0000000000004000 R08: 80000000ffff84b4 R09: 0000000000ffff0a^M
>> R10: 0000000000000004 R11: 0000000000000076 R12: ffff9115c0be8b00^M
>> R13: 0000000000000001 R14: ffff911665df9f68 R15: ffff9115d16be112^M
>> FS:  00007ff20442eb80(0000) GS:ffff911d1fd40000(0000)
>> knlGS:0000000000000000^M
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
>> CR2: 00007ff20446142d CR3: 00000001215ec003 CR4: 00000000003706e0^M
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
>> Call Trace:^M
>>   <TASK>^M
>>   ? show_trace_log_lvl+0x1c4/0x2df^M


... are we stuck in show_trace_log_lvl(), probably deadlocked not being 
able to print the actuall callstack? If so, that's nasty.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: kernel BUG at mm/usercopy.c:102 -- pc : usercopy_abort
  2024-03-25 16:43     ` David Hildenbrand
@ 2024-03-25 19:37       ` Ilya Dryomov
  2024-03-26  1:17         ` Xiubo Li
  0 siblings, 1 reply; 7+ messages in thread
From: Ilya Dryomov @ 2024-03-25 19:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Xiubo Li, linux-mm, linux-kernel, Ceph Development,
	linux-fsdevel, Vlastimil Babka

On Mon, Mar 25, 2024 at 6:39 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 25.03.24 13:06, Xiubo Li wrote:
> >
> > On 3/25/24 18:14, David Hildenbrand wrote:
> >> On 25.03.24 08:45, Xiubo Li wrote:
> >>> Hi guys,
> >>>
> >>> We are hitting the same crash frequently recently with the latest kernel
> >>> when testing kceph, and the call trace will be something likes:
> >>>
> >>> [ 1580.034891] usercopy: Kernel memory exposure attempt detected from
> >>> SLUB object 'kmalloc-192' (offset 82, size 499712)!^M
> >>> [ 1580.045866] ------------[ cut here ]------------^M
> >>> [ 1580.050551] kernel BUG at mm/usercopy.c:102!^M
> >>> ^M
> >>> Entering kdb (current=0xffff8881211f5500, pid 172901) on processor 4
> >>> Oops: (null)^M
> >>> due to oops @ 0xffffffff8138cabd^M
> >>> CPU: 4 PID: 172901 Comm: fsstress Tainted: G S 6.6.0-g623393c9d50c #1^M
> >>> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015^M
> >>> RIP: 0010:usercopy_abort+0x6d/0x80^M
> >>> Code: 4c 0f 44 d0 41 53 48 c7 c0 1c e9 13 82 48 c7 c6 71 62 13 82 48 0f
> >>> 45 f0 48 89 f9 48 c7 c7 f0 6b 1b 82 4c 89 d2 e8 63 2b df ff <0f> 0b 49
> >>> c7 c1 44 c8 14 82 4d 89 cb 4d 89 c8 eb a5 66 90 f3 0f 1e^M
> >>> RSP: 0018:ffffc90006dfba88 EFLAGS: 00010246^M
> >>> RAX: 000000000000006a RBX: 000000000007a000 RCX: 0000000000000000^M
> >>> RDX: 0000000000000000 RSI: ffff88885fd1d880 RDI: ffff88885fd1d880^M
> >>> RBP: 000000000007a000 R08: 0000000000000000 R09: c0000000ffffdfff^M
> >>> R10: 0000000000000001 R11: ffffc90006dfb930 R12: 0000000000000001^M
> >>> R13: ffff8882b7bbed12 R14: ffff88827a375830 R15: ffff8882b7b44d12^M
> >>> FS:  00007fb24c859500(0000) GS:ffff88885fd00000(0000)
> >>> knlGS:0000000000000000^M
> >>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> >>> CR2: 000055c2bcf9eb00 CR3: 000000028956c005 CR4: 00000000001706e0^M
> >>> Call Trace:^M
> >>>     <TASK>^M
> >>>     ? kdb_main_loop+0x32c/0xa10^M
> >>>     ? kdb_stub+0x216/0x420^M
> >>> more>
> >>>
> >>> You can see more detail in ceph tracker
> >>> https://tracker.ceph.com/issues/64471.
> >>
> >> Where is the full backtrace? Above contains only the backtrace of kdb.
> >>
> > Hi David,
> >
> > The bad news is that there is no more backtrace. All the failures we hit
> > are similar with the following logs:
> >
>
> That's unfortunate :/
>
> "exposure" in the message means we are in copy_to_user().
>
> SLUB object 'kmalloc-192' means that we come from __check_heap_object()
> ... we have 192 bytes, but the length we want to access is 499712 ...
> 488 KiB.
>
> So we ended  up somehow in
>
> __copy_to_user()->check_object_size()->__check_object_size()->
> check_heap_object()->__check_heap_object()->usercopy_abort()
>
>
> ... but the big question is which code tried to copy way too much memory
> out of a slab folio to user space.
>
> >
> >> That link also contains:
> >>
> >> Entering kdb (current=0xffff9115d14fb980, pid 61925) on processor 5
> >> Oops: (null)^M
> >> due to oops @ 0xfffffffface3a1d2^M
> >> CPU: 5 PID: 61925 Comm: ld Kdump: loaded Not tainted
> >> 5.14.0-421.el9.x86_64 #1^M
> >> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015^M
> >> RIP: 0010:usercopy_abort+0x74/0x76^M
> >> Code: 14 74 ad 51 48 0f 44 d6 49 c7 c3 cb 9f 73 ad 4c 89 d1 57 48 c7
> >> c6 60 83 75 ad 48 c7 c7 00 83 75 ad 49 0f 44 f3 e8 1b 3b ff ff <0f> 0b
> >> 0f b6 d3 4d 89 e0 48 89 e9 31 f6 48 c7 c7 7f 83 75 ad e8 73^M
> >> RSP: 0018:ffffbb97c16af8d0 EFLAGS: 00010246^M
> >> RAX: 0000000000000072 RBX: 0000000000000112 RCX: 0000000000000000^M
> >> RDX: 0000000000000000 RSI: ffff911d1fd60840 RDI: ffff911d1fd60840^M
> >> RBP: 0000000000004000 R08: 80000000ffff84b4 R09: 0000000000ffff0a^M
> >> R10: 0000000000000004 R11: 0000000000000076 R12: ffff9115c0be8b00^M
> >> R13: 0000000000000001 R14: ffff911665df9f68 R15: ffff9115d16be112^M
> >> FS:  00007ff20442eb80(0000) GS:ffff911d1fd40000(0000)
> >> knlGS:0000000000000000^M
> >> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
> >> CR2: 00007ff20446142d CR3: 00000001215ec003 CR4: 00000000003706e0^M
> >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
> >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
> >> Call Trace:^M
> >>   <TASK>^M
> >>   ? show_trace_log_lvl+0x1c4/0x2df^M
>
>
> ... are we stuck in show_trace_log_lvl(), probably deadlocked not being
> able to print the actuall callstack? If so, that's nasty.

Hi David,

I don't think so.  This appears to be a cut-and-paste from what is
essentially a non-interactive serial console.  Stack trace entries
prefixed with ? aren't exact and kdb prompt

    more>

is there in all cases which is what hides the rest of the stack.

There are four ways to get the entire stack trace here:

a) try to attach to the serial console and interact with kdb -- this
   is very much hit or miss due to general IPMI/BMC unreliability and
   the fact that it would be already attached to for logging
b) disable kdb by passing "kdb: false" in the job definition -- this
   should result in /sys/module/kgdboc/parameters/kgdboc cleared after
   booting into the kernel under test (or just hack teuthology to not
   pass "kdb: true" which it does by default if "-k <kernel>" is given
   when scheduling)
c) if b) fails, rebuild the kernel with kdb disabled in Kconfig
d) configure kdump and grab a vmcore -- these is no teuthology support
   for this, so it would be challenging but would provide the most data
   to chew on

Xiubo, I'd recommend going with b), but take your pick ;)

Thanks,

                Ilya

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: kernel BUG at mm/usercopy.c:102 -- pc : usercopy_abort
  2024-03-25 19:37       ` Ilya Dryomov
@ 2024-03-26  1:17         ` Xiubo Li
  0 siblings, 0 replies; 7+ messages in thread
From: Xiubo Li @ 2024-03-26  1:17 UTC (permalink / raw)
  To: Ilya Dryomov, David Hildenbrand
  Cc: linux-mm, linux-kernel, Ceph Development, linux-fsdevel, Vlastimil Babka


On 3/26/24 03:37, Ilya Dryomov wrote:
> On Mon, Mar 25, 2024 at 6:39 PM David Hildenbrand <david@redhat.com> wrote:
>> On 25.03.24 13:06, Xiubo Li wrote:
>>> On 3/25/24 18:14, David Hildenbrand wrote:
>>>> On 25.03.24 08:45, Xiubo Li wrote:
>>>>> Hi guys,
>>>>>
>>>>> We are hitting the same crash frequently recently with the latest kernel
>>>>> when testing kceph, and the call trace will be something likes:
>>>>>
>>>>> [ 1580.034891] usercopy: Kernel memory exposure attempt detected from
>>>>> SLUB object 'kmalloc-192' (offset 82, size 499712)!^M
>>>>> [ 1580.045866] ------------[ cut here ]------------^M
>>>>> [ 1580.050551] kernel BUG at mm/usercopy.c:102!^M
>>>>> ^M
>>>>> Entering kdb (current=0xffff8881211f5500, pid 172901) on processor 4
>>>>> Oops: (null)^M
>>>>> due to oops @ 0xffffffff8138cabd^M
>>>>> CPU: 4 PID: 172901 Comm: fsstress Tainted: G S 6.6.0-g623393c9d50c #1^M
>>>>> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 1.0c 09/07/2015^M
>>>>> RIP: 0010:usercopy_abort+0x6d/0x80^M
>>>>> Code: 4c 0f 44 d0 41 53 48 c7 c0 1c e9 13 82 48 c7 c6 71 62 13 82 48 0f
>>>>> 45 f0 48 89 f9 48 c7 c7 f0 6b 1b 82 4c 89 d2 e8 63 2b df ff <0f> 0b 49
>>>>> c7 c1 44 c8 14 82 4d 89 cb 4d 89 c8 eb a5 66 90 f3 0f 1e^M
>>>>> RSP: 0018:ffffc90006dfba88 EFLAGS: 00010246^M
>>>>> RAX: 000000000000006a RBX: 000000000007a000 RCX: 0000000000000000^M
>>>>> RDX: 0000000000000000 RSI: ffff88885fd1d880 RDI: ffff88885fd1d880^M
>>>>> RBP: 000000000007a000 R08: 0000000000000000 R09: c0000000ffffdfff^M
>>>>> R10: 0000000000000001 R11: ffffc90006dfb930 R12: 0000000000000001^M
>>>>> R13: ffff8882b7bbed12 R14: ffff88827a375830 R15: ffff8882b7b44d12^M
>>>>> FS:  00007fb24c859500(0000) GS:ffff88885fd00000(0000)
>>>>> knlGS:0000000000000000^M
>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
>>>>> CR2: 000055c2bcf9eb00 CR3: 000000028956c005 CR4: 00000000001706e0^M
>>>>> Call Trace:^M
>>>>>      <TASK>^M
>>>>>      ? kdb_main_loop+0x32c/0xa10^M
>>>>>      ? kdb_stub+0x216/0x420^M
>>>>> more>
>>>>>
>>>>> You can see more detail in ceph tracker
>>>>> https://tracker.ceph.com/issues/64471.
>>>> Where is the full backtrace? Above contains only the backtrace of kdb.
>>>>
>>> Hi David,
>>>
>>> The bad news is that there is no more backtrace. All the failures we hit
>>> are similar with the following logs:
>>>
>> That's unfortunate :/
>>
>> "exposure" in the message means we are in copy_to_user().
>>
>> SLUB object 'kmalloc-192' means that we come from __check_heap_object()
>> ... we have 192 bytes, but the length we want to access is 499712 ...
>> 488 KiB.
>>
>> So we ended  up somehow in
>>
>> __copy_to_user()->check_object_size()->__check_object_size()->
>> check_heap_object()->__check_heap_object()->usercopy_abort()
>>
>>
>> ... but the big question is which code tried to copy way too much memory
>> out of a slab folio to user space.
>>
>>>> That link also contains:
>>>>
>>>> Entering kdb (current=0xffff9115d14fb980, pid 61925) on processor 5
>>>> Oops: (null)^M
>>>> due to oops @ 0xfffffffface3a1d2^M
>>>> CPU: 5 PID: 61925 Comm: ld Kdump: loaded Not tainted
>>>> 5.14.0-421.el9.x86_64 #1^M
>>>> Hardware name: Supermicro SYS-5018R-WR/X10SRW-F, BIOS 2.0 12/17/2015^M
>>>> RIP: 0010:usercopy_abort+0x74/0x76^M
>>>> Code: 14 74 ad 51 48 0f 44 d6 49 c7 c3 cb 9f 73 ad 4c 89 d1 57 48 c7
>>>> c6 60 83 75 ad 48 c7 c7 00 83 75 ad 49 0f 44 f3 e8 1b 3b ff ff <0f> 0b
>>>> 0f b6 d3 4d 89 e0 48 89 e9 31 f6 48 c7 c7 7f 83 75 ad e8 73^M
>>>> RSP: 0018:ffffbb97c16af8d0 EFLAGS: 00010246^M
>>>> RAX: 0000000000000072 RBX: 0000000000000112 RCX: 0000000000000000^M
>>>> RDX: 0000000000000000 RSI: ffff911d1fd60840 RDI: ffff911d1fd60840^M
>>>> RBP: 0000000000004000 R08: 80000000ffff84b4 R09: 0000000000ffff0a^M
>>>> R10: 0000000000000004 R11: 0000000000000076 R12: ffff9115c0be8b00^M
>>>> R13: 0000000000000001 R14: ffff911665df9f68 R15: ffff9115d16be112^M
>>>> FS:  00007ff20442eb80(0000) GS:ffff911d1fd40000(0000)
>>>> knlGS:0000000000000000^M
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
>>>> CR2: 00007ff20446142d CR3: 00000001215ec003 CR4: 00000000003706e0^M
>>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000^M
>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400^M
>>>> Call Trace:^M
>>>>    <TASK>^M
>>>>    ? show_trace_log_lvl+0x1c4/0x2df^M
>>
>> ... are we stuck in show_trace_log_lvl(), probably deadlocked not being
>> able to print the actuall callstack? If so, that's nasty.
> Hi David,
>
> I don't think so.  This appears to be a cut-and-paste from what is
> essentially a non-interactive serial console.  Stack trace entries
> prefixed with ? aren't exact and kdb prompt
>
>      more>
>
> is there in all cases which is what hides the rest of the stack.
>
> There are four ways to get the entire stack trace here:
>
> a) try to attach to the serial console and interact with kdb -- this
>     is very much hit or miss due to general IPMI/BMC unreliability and
>     the fact that it would be already attached to for logging
> b) disable kdb by passing "kdb: false" in the job definition -- this
>     should result in /sys/module/kgdboc/parameters/kgdboc cleared after
>     booting into the kernel under test (or just hack teuthology to not
>     pass "kdb: true" which it does by default if "-k <kernel>" is given
>     when scheduling)
> c) if b) fails, rebuild the kernel with kdb disabled in Kconfig
> d) configure kdump and grab a vmcore -- these is no teuthology support
>     for this, so it would be challenging but would provide the most data
>     to chew on
>
> Xiubo, I'd recommend going with b), but take your pick ;)

Thanks Ilya

Let's try the b) first and to see could we find something new.

Thanks

- Xiubo


> Thanks,
>
>                  Ilya
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* re: kernel BUG at mm/usercopy.c:102 -- pc : usercopy_abort
@ 2024-03-25  8:15 Dawei Li
  0 siblings, 0 replies; 7+ messages in thread
From: Dawei Li @ 2024-03-25  8:15 UTC (permalink / raw)
  To: xiubli; +Cc: linux-kernel, dawei.li

Hi,

Sorry for the top-posting cuz I miss you original mail.

About issue on [1], maybe I am the trouble maker here.

Can you please revert commit below to see if it can fix it?
commit 328c801335d5f7edf2a3c9c331ddf8978f21e2a7
cpumask: create dedicated kmem cache for cpumask var

[1] https://lore.kernel.org/all/e119b3e2-09a0-47a7-945c-98a1f03633ef@redhat.com/

Sorry for the mess.

Thanks,

	Dawei

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-03-26  1:17 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-25  7:45 kernel BUG at mm/usercopy.c:102 -- pc : usercopy_abort Xiubo Li
2024-03-25 10:14 ` David Hildenbrand
2024-03-25 12:06   ` Xiubo Li
2024-03-25 16:43     ` David Hildenbrand
2024-03-25 19:37       ` Ilya Dryomov
2024-03-26  1:17         ` Xiubo Li
2024-03-25  8:15 Dawei Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.