linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* "cat /proc/sys/kernel/random/entropy_avail" sometimes causes page allocation failure
@ 2021-02-01 11:26 Steven Noonan
  2021-02-01 12:20 ` Vlastimil Babka
  0 siblings, 1 reply; 4+ messages in thread
From: Steven Noonan @ 2021-02-01 11:26 UTC (permalink / raw)
  To: linux-mm


[-- Attachment #1.1: Type: text/plain, Size: 6323 bytes --]

(Please CC me on replies, I'm not subscribed to the list.)

I started seeing this problem several months ago, but after some Google searches I concluded that someone had already root caused and fixed it, and I just needed to wait for a newer kernel to come along:

https://www.spinics.net/lists/linux-mm/msg226311.html

But it turns out this has not yet been fixed, even in 5.10.10. Also, while it used to be logged as an order:5 allocation it's now an order:6 allocation for some reason. It only happens rarely (anywhere from once per day to once per week), but it's kind of scary to check the kernel log and see this:

[569537.153097] cat: page allocation failure: order:6, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
[569537.153104] CPU: 1 PID: 3153417 Comm: cat Tainted: P S        IO      5.10.10-1-hsw #1
[569537.153105] Hardware name: ASUSTeK COMPUTER INC. P10S-M WS Series/P10S-M WS Series, BIOS 4503 07/23/2019
[569537.153106] Call Trace:
[569537.153111]  dump_stack+0x58/0x6b
[569537.153113]  warn_alloc.cold+0x78/0xdc
[569537.153116]  ? __alloc_pages_direct_compact+0x15f/0x170
[569537.153117]  __alloc_pages_slowpath.constprop.0+0xc4e/0xc70
[569537.153120]  __alloc_pages_nodemask+0x2de/0x310
[569537.153121]  kmalloc_order+0x25/0x80
[569537.153122]  kmalloc_order_trace+0x17/0x80
[569537.153125]  proc_sys_call_handler+0xb8/0x250
[569537.153127]  new_sync_read+0x10f/0x190
[569537.153128]  vfs_read+0x147/0x1a0
[569537.153130]  ksys_read+0x5f/0xe0
[569537.153131]  do_syscall_64+0x2d/0x70
[569537.153134]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[569537.153135] RIP: 0033:0x7fcc21ef92b2
[569537.153137] Code: c0 e9 b2 fe ff ff 50 48 8d 3d d2 2a 0a 00 e8 e5 ec 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[569537.153138] RSP: 002b:00007ffcfbcc5c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[569537.153140] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fcc21ef92b2
[569537.153140] RDX: 0000000000020000 RSI: 00007fcc21ff8000 RDI: 0000000000000003
[569537.153141] RBP: 00007fcc21ff8000 R08: 00007fcc21ff7010 R09: 0000000000000000
[569537.153142] R10: 0000000000000022 R11: 0000000000000246 R12: 0000556e99b70c00
[569537.153142] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
[569537.153144] Mem-Info:
[569537.153146] active_anon:50924 inactive_anon:1715988 isolated_anon:0
                 active_file:163427 inactive_file:104041 isolated_file:0
                 unevictable:3247 dirty:5825 writeback:0
                 slab_reclaimable:39566 slab_unreclaimable:1760175
                 mapped:49406 shmem:17850 pagetables:10611 bounce:0
                 free:1060578 free_pcp:0 free_cma:0
[569537.153149] Node 0 active_anon:203696kB inactive_anon:6863952kB active_file:653708kB inactive_file:416164kB unevictable:12988kB isolated(anon):0kB isolated(file):0kB mapped:197624kB dirty:23300kB writeback:0kB shmem:71400kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:18112kB all_unreclaimable? no
[569537.153151] Node 0 DMA free:11800kB min:24kB low:36kB high:48kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15984kB managed:15896kB mlocked:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[569537.153153] lowmem_reserve[]: 0 1810 64139 64139 64139
[569537.153156] Node 0 DMA32 free:252476kB min:3208kB low:5076kB high:6944kB reserved_highatomic:0KB active_anon:756kB inactive_anon:124844kB active_file:2120kB inactive_file:180kB unevictable:0kB writepending:0kB present:1937436kB managed:1871900kB mlocked:0kB pagetables:8kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[569537.153158] lowmem_reserve[]: 0 0 62329 62329 62329
[569537.153160] Node 0 Normal free:3978036kB min:109404kB low:173232kB high:237060kB reserved_highatomic:2048KB active_anon:202940kB inactive_anon:6739108kB active_file:651588kB inactive_file:415984kB unevictable:12988kB writepending:23076kB present:65011712kB managed:63831896kB mlocked:11176kB pagetables:42436kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[569537.153162] lowmem_reserve[]: 0 0 0 0 0
[569537.153164] Node 0 DMA: 2*4kB (U) 2*8kB (U) 2*16kB (U) 3*32kB (U) 2*64kB (U) 2*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11800kB
[569537.153171] Node 0 DMA32: 3738*4kB (UME) 2341*8kB (UME) 1719*16kB (UME) 943*32kB (UME) 443*64kB (UME) 19*128kB (UME) 108*256kB (UME) 169*512kB (UM) 8*1024kB (UM) 4*2048kB (UM) 0*4096kB = 252704kB
[569537.153179] Node 0 Normal: 81071*4kB (UMEH) 112101*8kB (UMEH) 101108*16kB (UMEH) 34297*32kB (UMEH) 660*64kB (UME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3978564kB
[569537.153185] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
[569537.153186] Node 0 hugepages_total=64 hugepages_free=51 hugepages_surp=64 hugepages_size=2048kB
[569537.153186] 298118 total pagecache pages
[569537.153197] 14242 pages in swap cache
[569537.153198] Swap cache stats: add 165298, delete 151079, find 10932118/10944726
[569537.153199] Free swap  = 100129528kB
[569537.153199] Total swap = 100663288kB
[569537.153200] 16741283 pages RAM
[569537.153201] 0 pages HighMem/MovableOnly
[569537.153201] 311360 pages reserved
[569537.153202] 0 pages hwpoisoned

I used sysdig to figure out what 'cat' was trying to read when it failed, via:

# sysdig -p "%12user.name %6proc.pid %12proc.name %fd.name %evt.args" 'evt.type=read and evt.failed=true and proc.name=cat'

The relevant one was this:

nobody       315341 cat          /proc/sys/kernel/random/entropy_avail res=-12(ENOMEM) data=

Which I believe is a munin script gathering a bunch of data for graphs (munin-node runs as "nobody", uses shell scripts for its data gathering, and one of the plugins tracks entropy_avail, so that would seem to be the guilty party).

Any idea why it's trying to do an order:6 allocation for just a simple read of entropy_avail? Based on the stack trace I'm guessing this isn't owned by the "random" driver maintainers, it seems to be a more general issue with memory allocations while reading from procfs.

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 509 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: "cat /proc/sys/kernel/random/entropy_avail" sometimes causes page allocation failure
  2021-02-01 11:26 "cat /proc/sys/kernel/random/entropy_avail" sometimes causes page allocation failure Steven Noonan
@ 2021-02-01 12:20 ` Vlastimil Babka
  2021-02-01 13:52   ` Matthew Wilcox
  0 siblings, 1 reply; 4+ messages in thread
From: Vlastimil Babka @ 2021-02-01 12:20 UTC (permalink / raw)
  To: Steven Noonan, linux-mm; +Cc: Christoph Hellwig, Josef Bacik, Matthew Wilcox

On 2/1/21 12:26 PM, Steven Noonan wrote:
> (Please CC me on replies, I'm not subscribed to the list.)
> 
> I started seeing this problem several months ago, but after some Google searches I concluded that someone had already root caused and fixed it, and I just needed to wait for a newer kernel to come along:
> 
> https://www.spinics.net/lists/linux-mm/msg226311.html

Yeah, that thread mentions Josef's series [1], but I don't see that it made it
into git log. What happened to it? Meanwhile Matthew refactored it with
"4bd6a7353ee1 ("sysctl: Convert to iter interfaces")" which left the kzalloc()
but changed count to count+1, which would explain the change from order-5 to
order-6. In my quick test, strace cat tells me it uses 128k sized read, which is
order-5.

So miminally there should be kvzalloc(), but it's still somewhat unfortunate to
do that for reading a few bytes.
Also the check for KMALLOC_MAX_SIZE happens before the +1. Meh. Matthew?

[1]
https://lore.kernel.org/linux-fsdevel/20200813210411.905010-2-josef@toxicpanda.com/

> But it turns out this has not yet been fixed, even in 5.10.10. Also, while it used to be logged as an order:5 allocation it's now an order:6 allocation for some reason. It only happens rarely (anywhere from once per day to once per week), but it's kind of scary to check the kernel log and see this:
> 
> [569537.153097] cat: page allocation failure: order:6, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
> [569537.153104] CPU: 1 PID: 3153417 Comm: cat Tainted: P S        IO      5.10.10-1-hsw #1
> [569537.153105] Hardware name: ASUSTeK COMPUTER INC. P10S-M WS Series/P10S-M WS Series, BIOS 4503 07/23/2019
> [569537.153106] Call Trace:
> [569537.153111]  dump_stack+0x58/0x6b
> [569537.153113]  warn_alloc.cold+0x78/0xdc
> [569537.153116]  ? __alloc_pages_direct_compact+0x15f/0x170
> [569537.153117]  __alloc_pages_slowpath.constprop.0+0xc4e/0xc70
> [569537.153120]  __alloc_pages_nodemask+0x2de/0x310
> [569537.153121]  kmalloc_order+0x25/0x80
> [569537.153122]  kmalloc_order_trace+0x17/0x80
> [569537.153125]  proc_sys_call_handler+0xb8/0x250
> [569537.153127]  new_sync_read+0x10f/0x190
> [569537.153128]  vfs_read+0x147/0x1a0
> [569537.153130]  ksys_read+0x5f/0xe0
> [569537.153131]  do_syscall_64+0x2d/0x70
> [569537.153134]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [569537.153135] RIP: 0033:0x7fcc21ef92b2
> [569537.153137] Code: c0 e9 b2 fe ff ff 50 48 8d 3d d2 2a 0a 00 e8 e5 ec 01 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
> [569537.153138] RSP: 002b:00007ffcfbcc5c88 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
> [569537.153140] RAX: ffffffffffffffda RBX: 0000000000020000 RCX: 00007fcc21ef92b2
> [569537.153140] RDX: 0000000000020000 RSI: 00007fcc21ff8000 RDI: 0000000000000003
> [569537.153141] RBP: 00007fcc21ff8000 R08: 00007fcc21ff7010 R09: 0000000000000000
> [569537.153142] R10: 0000000000000022 R11: 0000000000000246 R12: 0000556e99b70c00
> [569537.153142] R13: 0000000000000003 R14: 0000000000020000 R15: 0000000000020000
> [569537.153144] Mem-Info:
> [569537.153146] active_anon:50924 inactive_anon:1715988 isolated_anon:0
>                  active_file:163427 inactive_file:104041 isolated_file:0
>                  unevictable:3247 dirty:5825 writeback:0
>                  slab_reclaimable:39566 slab_unreclaimable:1760175
>                  mapped:49406 shmem:17850 pagetables:10611 bounce:0
>                  free:1060578 free_pcp:0 free_cma:0
> [569537.153149] Node 0 active_anon:203696kB inactive_anon:6863952kB active_file:653708kB inactive_file:416164kB unevictable:12988kB isolated(anon):0kB isolated(file):0kB mapped:197624kB dirty:23300kB writeback:0kB shmem:71400kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB kernel_stack:18112kB all_unreclaimable? no
> [569537.153151] Node 0 DMA free:11800kB min:24kB low:36kB high:48kB reserved_highatomic:0KB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15984kB managed:15896kB mlocked:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> [569537.153153] lowmem_reserve[]: 0 1810 64139 64139 64139
> [569537.153156] Node 0 DMA32 free:252476kB min:3208kB low:5076kB high:6944kB reserved_highatomic:0KB active_anon:756kB inactive_anon:124844kB active_file:2120kB inactive_file:180kB unevictable:0kB writepending:0kB present:1937436kB managed:1871900kB mlocked:0kB pagetables:8kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> [569537.153158] lowmem_reserve[]: 0 0 62329 62329 62329
> [569537.153160] Node 0 Normal free:3978036kB min:109404kB low:173232kB high:237060kB reserved_highatomic:2048KB active_anon:202940kB inactive_anon:6739108kB active_file:651588kB inactive_file:415984kB unevictable:12988kB writepending:23076kB present:65011712kB managed:63831896kB mlocked:11176kB pagetables:42436kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
> [569537.153162] lowmem_reserve[]: 0 0 0 0 0
> [569537.153164] Node 0 DMA: 2*4kB (U) 2*8kB (U) 2*16kB (U) 3*32kB (U) 2*64kB (U) 2*128kB (U) 0*256kB 0*512kB 1*1024kB (U) 1*2048kB (M) 2*4096kB (M) = 11800kB
> [569537.153171] Node 0 DMA32: 3738*4kB (UME) 2341*8kB (UME) 1719*16kB (UME) 943*32kB (UME) 443*64kB (UME) 19*128kB (UME) 108*256kB (UME) 169*512kB (UM) 8*1024kB (UM) 4*2048kB (UM) 0*4096kB = 252704kB
> [569537.153179] Node 0 Normal: 81071*4kB (UMEH) 112101*8kB (UMEH) 101108*16kB (UMEH) 34297*32kB (UMEH) 660*64kB (UME) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3978564kB
> [569537.153185] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
> [569537.153186] Node 0 hugepages_total=64 hugepages_free=51 hugepages_surp=64 hugepages_size=2048kB
> [569537.153186] 298118 total pagecache pages
> [569537.153197] 14242 pages in swap cache
> [569537.153198] Swap cache stats: add 165298, delete 151079, find 10932118/10944726
> [569537.153199] Free swap  = 100129528kB
> [569537.153199] Total swap = 100663288kB
> [569537.153200] 16741283 pages RAM
> [569537.153201] 0 pages HighMem/MovableOnly
> [569537.153201] 311360 pages reserved
> [569537.153202] 0 pages hwpoisoned
> 
> I used sysdig to figure out what 'cat' was trying to read when it failed, via:
> 
> # sysdig -p "%12user.name %6proc.pid %12proc.name %fd.name %evt.args" 'evt.type=read and evt.failed=true and proc.name=cat'
> 
> The relevant one was this:
> 
> nobody       315341 cat          /proc/sys/kernel/random/entropy_avail res=-12(ENOMEM) data=
> 
> Which I believe is a munin script gathering a bunch of data for graphs (munin-node runs as "nobody", uses shell scripts for its data gathering, and one of the plugins tracks entropy_avail, so that would seem to be the guilty party).
> 
> Any idea why it's trying to do an order:6 allocation for just a simple read of entropy_avail? Based on the stack trace I'm guessing this isn't owned by the "random" driver maintainers, it seems to be a more general issue with memory allocations while reading from procfs.
> 



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: "cat /proc/sys/kernel/random/entropy_avail" sometimes causes page allocation failure
  2021-02-01 12:20 ` Vlastimil Babka
@ 2021-02-01 13:52   ` Matthew Wilcox
  2021-02-09 11:17     ` Vlastimil Babka
  0 siblings, 1 reply; 4+ messages in thread
From: Matthew Wilcox @ 2021-02-01 13:52 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Steven Noonan, linux-mm, Christoph Hellwig, Josef Bacik

On Mon, Feb 01, 2021 at 01:20:04PM +0100, Vlastimil Babka wrote:
> On 2/1/21 12:26 PM, Steven Noonan wrote:
> > (Please CC me on replies, I'm not subscribed to the list.)
> > 
> > I started seeing this problem several months ago, but after some Google searches I concluded that someone had already root caused and fixed it, and I just needed to wait for a newer kernel to come along:
> > 
> > https://www.spinics.net/lists/linux-mm/msg226311.html
> 
> Yeah, that thread mentions Josef's series [1], but I don't see that it made it
> into git log. What happened to it? Meanwhile Matthew refactored it with
> "4bd6a7353ee1 ("sysctl: Convert to iter interfaces")" which left the kzalloc()
> but changed count to count+1, which would explain the change from order-5 to
> order-6. In my quick test, strace cat tells me it uses 128k sized read, which is
> order-5.
> 
> So miminally there should be kvzalloc(), but it's still somewhat unfortunate to
> do that for reading a few bytes.
> Also the check for KMALLOC_MAX_SIZE happens before the +1. Meh. Matthew?

That's why the check is >= KMALLOC_MAX_SIZE.

Switching to kvzalloc() means we'll only allocate one extra page, not twice
as many pages.  We can't know how large the proc file is going to want the
buffer to be, but perhaps we can do this ...

@@ -569,17 +569,16 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
 
        /* don't even try if the size is too large */
        error = -ENOMEM;
-       if (count >= KMALLOC_MAX_SIZE)
+       if (count > KMALLOC_MAX_SIZE)
                goto out;
-       kbuf = kzalloc(count + 1, GFP_KERNEL);
+       kbuf = kvzalloc(count, GFP_KERNEL);
        if (!kbuf)
                goto out;
 
        if (write) {
                error = -EFAULT;
-               if (!copy_from_iter_full(kbuf, count, iter))
+               if (!copy_from_iter_full(kbuf, count - 1, iter))
                        goto out_free_buf;
-               kbuf[count] = '\0';
        }
 
        error = BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write, &kbuf, &count,




^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: "cat /proc/sys/kernel/random/entropy_avail" sometimes causes page allocation failure
  2021-02-01 13:52   ` Matthew Wilcox
@ 2021-02-09 11:17     ` Vlastimil Babka
  0 siblings, 0 replies; 4+ messages in thread
From: Vlastimil Babka @ 2021-02-09 11:17 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Steven Noonan, linux-mm, Christoph Hellwig, Josef Bacik

On 2/1/21 2:52 PM, Matthew Wilcox wrote:
> On Mon, Feb 01, 2021 at 01:20:04PM +0100, Vlastimil Babka wrote:
>> On 2/1/21 12:26 PM, Steven Noonan wrote:
>> > (Please CC me on replies, I'm not subscribed to the list.)
>> > 
>> > I started seeing this problem several months ago, but after some Google searches I concluded that someone had already root caused and fixed it, and I just needed to wait for a newer kernel to come along:
>> > 
>> > https://www.spinics.net/lists/linux-mm/msg226311.html
>> 
>> Yeah, that thread mentions Josef's series [1], but I don't see that it made it
>> into git log. What happened to it? Meanwhile Matthew refactored it with
>> "4bd6a7353ee1 ("sysctl: Convert to iter interfaces")" which left the kzalloc()
>> but changed count to count+1, which would explain the change from order-5 to
>> order-6. In my quick test, strace cat tells me it uses 128k sized read, which is
>> order-5.
>> 
>> So miminally there should be kvzalloc(), but it's still somewhat unfortunate to
>> do that for reading a few bytes.
>> Also the check for KMALLOC_MAX_SIZE happens before the +1. Meh. Matthew?
> 
> That's why the check is >= KMALLOC_MAX_SIZE.
> 
> Switching to kvzalloc() means we'll only allocate one extra page, not twice
> as many pages.  We can't know how large the proc file is going to want the
> buffer to be, but perhaps we can do this ...

Looks ok to me, but I'm not that familiar with iov_iter stuff. Care to send a
proper patch?

> @@ -569,17 +569,16 @@ static ssize_t proc_sys_call_handler(struct kiocb *iocb, struct iov_iter *iter,
>  
>         /* don't even try if the size is too large */
>         error = -ENOMEM;
> -       if (count >= KMALLOC_MAX_SIZE)
> +       if (count > KMALLOC_MAX_SIZE)
>                 goto out;
> -       kbuf = kzalloc(count + 1, GFP_KERNEL);
> +       kbuf = kvzalloc(count, GFP_KERNEL);
>         if (!kbuf)
>                 goto out;
>  
>         if (write) {
>                 error = -EFAULT;
> -               if (!copy_from_iter_full(kbuf, count, iter))
> +               if (!copy_from_iter_full(kbuf, count - 1, iter))
>                         goto out_free_buf;
> -               kbuf[count] = '\0';
>         }
>  
>         error = BPF_CGROUP_RUN_PROG_SYSCTL(head, table, write, &kbuf, &count,
> 
> 



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-02-09 11:17 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-01 11:26 "cat /proc/sys/kernel/random/entropy_avail" sometimes causes page allocation failure Steven Noonan
2021-02-01 12:20 ` Vlastimil Babka
2021-02-01 13:52   ` Matthew Wilcox
2021-02-09 11:17     ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).