linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: System freezes after OOM
       [not found]   ` <9be09452-de7f-d8be-fd5d-4a80d1cd1ba3@redhat.com>
@ 2016-07-11 15:43     ` Mikulas Patocka
  2016-07-12  6:49       ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: Mikulas Patocka @ 2016-07-11 15:43 UTC (permalink / raw)
  To: Ondrej Kozina
  Cc: Jerome Marchand, Stanislav Kozina, Michal Hocko, linux-mm, linux-kernel



On Mon, 11 Jul 2016, Ondrej Kozina wrote:

> On 07/11/2016 01:55 PM, Jerome Marchand wrote:
> > On 07/11/2016 01:03 PM, Stanislav Kozina wrote:
> > > Hi Jerome,
> > > 
> > > On upstream mailing lists there have been reports of freezing systems
> > > due to OOM. Ondra (on CC) managed to reproduce this inhouse, he'd like
> > > someone with mm skills to look at the problem since he doesn't
> > > understand why OOM comes into play when >90% of 2GB swap are still free.
> > > 
> > > Could you please take a look? It's following this email on upstream:
> > > https://lkml.org/lkml/2016/5/5/356
> > > 
> > > Thanks!
> > > -Stanislav
> > 
> > Hi Ondrej,
> > 
> > I can see [1] that there are several atomic memory allocation failures
> > before the OOM kill, several of them are in memory reclaim path, which
> > prevents it to free memory.
> > Normally the linux mm try to keep enough memory free at all time to
> > satisfy atomic allocation (cf. /proc/sys/vm/min_free_kbytes). Have you
> > try to increase that value?
> > It would be useful to understand why the reserve for atomic allocations
> > runs out. There might be a burst of atomic allocations that deplete the
> > reserve. What kind of workload is that?
> > 
> > Jerome
> > 
> > [1]:
> > https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/vmlog-1462458369-00000/sample-00011/dmesg
> > 
> 
> Hi Jerome,
> 
> first let thank you for looking into it! About the workload it's nothing
> special. I've started gcc build of a project in C++ in 3-4 threads so that I'd
> waste all physical memory to trigger it. I can build some simple utility to
> allocate memory in predefined chunks in some loop if it'd of any help. It was
> really quite simple to trigger this.
> 
> On a /proc/sys/vm/min_free_kbytes value. Let me try it...
> 
> Thanks Ondra
> 
> PS: Adding Mikulas on CC'ed (dm-crypt upstream) in case he has anything to
> add.

That allocation warning in wb_start_writeback was already silenced by the 
commit 78ebc2f7146156f488083c9e5a7ded9d5c38c58b. The warning in 
drivers/virtio/virtio_ring.c:alloc_indirect could be silenced as well (the 
driver does fallback in case of allocation failure, so this failure can't 
result in loss of functionality).


The general problem is that the memory allocator does 16 retries to 
allocate a page and then triggers the OOM killer (and it doesn't take into 
account how much swap space is free or how many dirty pages were really 
swapped out while it waited).

So, it could prematurely trigger OOM killer on any slow swapping device 
(including dm-crypt). Michal Hocko reworked the OOM killer in the patch 
0a0337e0d1d134465778a16f5cbea95086e8e9e0, but it still has the flaw that 
it triggers OOM if there is plenty of free swap space free.

Michal, would you accept a change to the OOM killer, to prevent it from 
triggerring when there is free swap space?

Mikulas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-11 15:43     ` System freezes after OOM Mikulas Patocka
@ 2016-07-12  6:49       ` Michal Hocko
  2016-07-12 23:44         ` Mikulas Patocka
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-07-12  6:49 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Ondrej Kozina, Jerome Marchand, Stanislav Kozina, linux-mm, linux-kernel

On Mon 11-07-16 11:43:02, Mikulas Patocka wrote:
[...]
> The general problem is that the memory allocator does 16 retries to 
> allocate a page and then triggers the OOM killer (and it doesn't take into 
> account how much swap space is free or how many dirty pages were really 
> swapped out while it waited).

Well, that is not how it works exactly. We retry as long as there is a
reclaim progress (at least one page freed) back off only if the
reclaimable memory can exceed watermks which is scaled down in 16
retries. The overal size of free swap is not really that important if we
cannot swap out like here due to complete memory reserves depletion:
https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/vmlog-1462458369-00000/sample-00011/dmesg:
[   90.491276] Node 0 DMA free:0kB min:60kB low:72kB high:84kB active_anon:4096kB inactive_anon:4636kB active_file:212kB inactive_file:280kB unevictable:488kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:488kB dirty:276kB writeback:4636kB mapped:476kB shmem:12kB slab_reclaimable:204kB slab_unreclaimable:4700kB kernel_stack:48kB pagetables:120kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:61132 all_unreclaimable? yes
[   90.491283] lowmem_reserve[]: 0 977 977 977
[   90.491286] Node 0 DMA32 free:0kB min:3828kB low:4824kB high:5820kB active_anon:423820kB inactive_anon:424916kB active_file:17996kB inactive_file:21800kB unevictable:20724kB isolated(anon):384kB isolated(file):0kB present:1032184kB managed:1001260kB mlocked:20724kB dirty:25236kB writeback:49972kB mapped:23076kB shmem:1364kB slab_reclaimable:13796kB slab_unreclaimable:43008kB kernel_stack:2816kB pagetables:7320kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:5635400 all_unreclaimable? yes

Look at the amount of free memory. It is completely depleted. So it
smells like a process which has access to memory reserves has consumed
all of it. I suspect a __GFP_MEMALLOC resp. PF_MEMALLOC from softirq
context user which went off the leash.

> So, it could prematurely trigger OOM killer on any slow swapping device 
> (including dm-crypt). Michal Hocko reworked the OOM killer in the patch 
> 0a0337e0d1d134465778a16f5cbea95086e8e9e0, but it still has the flaw that 
> it triggers OOM if there is plenty of free swap space free.
> 
> Michal, would you accept a change to the OOM killer, to prevent it from 
> triggerring when there is free swap space?

No this doesn't sound like a proper solution. The current decision
logic, as explained above relies on the feedback from the reclaim. A
free swap space doesn't really mean we can make a forward progress.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-12  6:49       ` Michal Hocko
@ 2016-07-12 23:44         ` Mikulas Patocka
  2016-07-13  8:35           ` Jerome Marchand
                             ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Mikulas Patocka @ 2016-07-12 23:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ondrej Kozina, Jerome Marchand, Stanislav Kozina, linux-mm, linux-kernel

The problem of swapping to dm-crypt is this.

The free memory goes low, kswapd decides that some page should be swapped 
out. However, when you swap to an ecrypted device, writeback of each page 
requires another page to hold the encrypted data. dm-crypt uses mempools 
for all its structures and pages, so that it can make forward progress 
even if there is no memory free. However, the mempool code first allocates 
from general memory allocator and resorts to the mempool only if the 
memory is below limit.

So every attempt to swap out some page allocates another page.

As long as swapping is in progress, the free memory is below the limit 
(because the swapping activity itself consumes any memory over the limit). 
And that triggered the OOM killer prematurely.


On Tue, 12 Jul 2016, Michal Hocko wrote:

> On Mon 11-07-16 11:43:02, Mikulas Patocka wrote:
> [...]
> > The general problem is that the memory allocator does 16 retries to 
> > allocate a page and then triggers the OOM killer (and it doesn't take into 
> > account how much swap space is free or how many dirty pages were really 
> > swapped out while it waited).
> 
> Well, that is not how it works exactly. We retry as long as there is a
> reclaim progress (at least one page freed) back off only if the
> reclaimable memory can exceed watermks which is scaled down in 16
> retries. The overal size of free swap is not really that important if we
> cannot swap out like here due to complete memory reserves depletion:
> https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/vmlog-1462458369-00000/sample-00011/dmesg:
> [   90.491276] Node 0 DMA free:0kB min:60kB low:72kB high:84kB active_anon:4096kB inactive_anon:4636kB active_file:212kB inactive_file:280kB unevictable:488kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:488kB dirty:276kB writeback:4636kB mapped:476kB shmem:12kB slab_reclaimable:204kB slab_unreclaimable:4700kB kernel_stack:48kB pagetables:120kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:61132 all_unreclaimable? yes
> [   90.491283] lowmem_reserve[]: 0 977 977 977
> [   90.491286] Node 0 DMA32 free:0kB min:3828kB low:4824kB high:5820kB active_anon:423820kB inactive_anon:424916kB active_file:17996kB inactive_file:21800kB unevictable:20724kB isolated(anon):384kB isolated(file):0kB present:1032184kB managed:1001260kB mlocked:20724kB dirty:25236kB writeback:49972kB mapped:23076kB shmem:1364kB slab_reclaimable:13796kB slab_unreclaimable:43008kB kernel_stack:2816kB pagetables:7320kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:5635400 all_unreclaimable? yes
> 
> Look at the amount of free memory. It is completely depleted. So it
> smells like a process which has access to memory reserves has consumed
> all of it. I suspect a __GFP_MEMALLOC resp. PF_MEMALLOC from softirq
> context user which went off the leash.

It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. Prior 
to this commit, mempool allocations set __GFP_NOMEMALLOC, so they never 
exhausted reserved memory. With this commit, mempool allocations drop 
__GFP_NOMEMALLOC, so they can dig deeper (if the process has PF_MEMALLOC, 
they can bypass all limits).

But swapping should proceed even if there is no memory free. There is a 
comment "TODO: this could cause a theoretical memory reclaim deadlock in 
the swap out path." in the function add_to_swap - but apart from that, 
swap should proceed even with no available memory, as long as all the 
drivers in the block layer use mempools.

> > So, it could prematurely trigger OOM killer on any slow swapping device 
> > (including dm-crypt). Michal Hocko reworked the OOM killer in the patch 
> > 0a0337e0d1d134465778a16f5cbea95086e8e9e0, but it still has the flaw that 
> > it triggers OOM if there is plenty of free swap space free.
> > 
> > Michal, would you accept a change to the OOM killer, to prevent it from 
> > triggerring when there is free swap space?
> 
> No this doesn't sound like a proper solution. The current decision
> logic, as explained above relies on the feedback from the reclaim. A
> free swap space doesn't really mean we can make a forward progress.

I'm interested - why would you need to trigger the OOM killer if there is 
free swap space?

The only possibility is that all the memory is filled with unswappable 
kernel pages - but that condition could be detected if there is unusually 
low number of anonymous and cache pages. Besides that - in what situation 
is triggering the OOM killer with free swap desired?

> -- 
> Michal Hocko
> SUSE Labs
> 

The kernel 4.7-rc almost deadlocks in another way. The machine got stuck 
and the following stacktrace was obtained when swapping to dm-crypt.

We can see that dm-crypt does a mempool allocation. But the mempool 
allocation somehow falls into throttle_vm_writeout. There, it waits for 
0.1 seconds. So, as a result, the dm-crypt worker thread ends up 
processing requests at an unusually slow rate of 10 requests per second 
and it results in the machine being stuck (it would proabably recover if 
we waited for extreme amount of time).

[  345.352536] kworker/u4:0    D ffff88003df7f438 10488     6      2 0x00000000
[  345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
[  345.352536]  ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80
[  345.352536]  ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470
[  345.352536]  ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450
[  345.352536] Call Trace:
[  345.352536]  [<ffffffff818d466c>] schedule+0x3c/0x90
[  345.352536]  [<ffffffff818d96a8>] schedule_timeout+0x1d8/0x360
[  345.352536]  [<ffffffff81135e40>] ? detach_if_pending+0x1c0/0x1c0
[  345.352536]  [<ffffffff811407c3>] ? ktime_get+0xb3/0x150
[  345.352536]  [<ffffffff811958cf>] ? __delayacct_blkio_start+0x1f/0x30
[  345.352536]  [<ffffffff818d39e4>] io_schedule_timeout+0xa4/0x110
[  345.352536]  [<ffffffff8121d886>] congestion_wait+0x86/0x1f0
[  345.352536]  [<ffffffff810fdf40>] ? prepare_to_wait_event+0xf0/0xf0
[  345.352536]  [<ffffffff812061d4>] throttle_vm_writeout+0x44/0xd0
[  345.352536]  [<ffffffff81211533>] shrink_zone_memcg+0x613/0x720
[  345.352536]  [<ffffffff81211720>] shrink_zone+0xe0/0x300
[  345.352536]  [<ffffffff81211aed>] do_try_to_free_pages+0x1ad/0x450
[  345.352536]  [<ffffffff81211e7f>] try_to_free_pages+0xef/0x300
[  345.352536]  [<ffffffff811fef19>] __alloc_pages_nodemask+0x879/0x1210
[  345.352536]  [<ffffffff810e8080>] ? sched_clock_cpu+0x90/0xc0
[  345.352536]  [<ffffffff8125a8d1>] alloc_pages_current+0xa1/0x1f0
[  345.352536]  [<ffffffff81265ef5>] ? new_slab+0x3f5/0x6a0
[  345.352536]  [<ffffffff81265dd7>] new_slab+0x2d7/0x6a0
[  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
[  345.352536]  [<ffffffff812678cb>] ___slab_alloc+0x3fb/0x5c0
[  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
[  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff81267ae1>] __slab_alloc+0x51/0x90
[  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff81267d9b>] kmem_cache_alloc+0x27b/0x310
[  345.352536]  [<ffffffff811f71bd>] mempool_alloc_slab+0x1d/0x30
[  345.352536]  [<ffffffff811f6f11>] mempool_alloc+0x91/0x230
[  345.352536]  [<ffffffff8141a02d>] bio_alloc_bioset+0xbd/0x260
[  345.352536]  [<ffffffffc02f1a54>] kcryptd_crypt+0x114/0x3b0 [dm_crypt]
[  345.352536]  [<ffffffff810cc312>] process_one_work+0x242/0x700
[  345.352536]  [<ffffffff810cc28a>] ? process_one_work+0x1ba/0x700
[  345.352536]  [<ffffffff810cc81e>] worker_thread+0x4e/0x490
[  345.352536]  [<ffffffff810cc7d0>] ? process_one_work+0x700/0x700
[  345.352536]  [<ffffffff810d3c01>] kthread+0x101/0x120
[  345.352536]  [<ffffffff8110b9f5>] ? trace_hardirqs_on_caller+0xf5/0x1b0
[  345.352536]  [<ffffffff818db1af>] ret_from_fork+0x1f/0x40
[  345.352536]  [<ffffffff810d3b00>] ? kthread_create_on_node+0x250/0x250

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-12 23:44         ` Mikulas Patocka
@ 2016-07-13  8:35           ` Jerome Marchand
  2016-07-13 11:14             ` Michal Hocko
  2016-07-13 11:10           ` Michal Hocko
  2016-07-13 13:19           ` Tetsuo Handa
  2 siblings, 1 reply; 60+ messages in thread
From: Jerome Marchand @ 2016-07-13  8:35 UTC (permalink / raw)
  To: Mikulas Patocka, Michal Hocko, Ondrej Kozina
  Cc: Stanislav Kozina, linux-mm, linux-kernel


[-- Attachment #1.1: Type: text/plain, Size: 8827 bytes --]

On 07/13/2016 01:44 AM, Mikulas Patocka wrote:
> The problem of swapping to dm-crypt is this.
> 
> The free memory goes low, kswapd decides that some page should be swapped 
> out. However, when you swap to an ecrypted device, writeback of each page 
> requires another page to hold the encrypted data. dm-crypt uses mempools 
> for all its structures and pages, so that it can make forward progress 
> even if there is no memory free. However, the mempool code first allocates 
> from general memory allocator and resorts to the mempool only if the 
> memory is below limit.
> 
> So every attempt to swap out some page allocates another page.
> 
> As long as swapping is in progress, the free memory is below the limit 
> (because the swapping activity itself consumes any memory over the limit). 
> And that triggered the OOM killer prematurely.

There is a quite recent sysctl vm knob that I believe can help in this
case: watermark_scale_factor. If you increase this value, kswapd will
start paging out earlier, when there might still be enough free memory.

Ondrej, have you tried to increase /proc/sys/vm/watermark_scale_factor?

Jerome

> 
> 
> On Tue, 12 Jul 2016, Michal Hocko wrote:
> 
>> On Mon 11-07-16 11:43:02, Mikulas Patocka wrote:
>> [...]
>>> The general problem is that the memory allocator does 16 retries to 
>>> allocate a page and then triggers the OOM killer (and it doesn't take into 
>>> account how much swap space is free or how many dirty pages were really 
>>> swapped out while it waited).
>>
>> Well, that is not how it works exactly. We retry as long as there is a
>> reclaim progress (at least one page freed) back off only if the
>> reclaimable memory can exceed watermks which is scaled down in 16
>> retries. The overal size of free swap is not really that important if we
>> cannot swap out like here due to complete memory reserves depletion:
>> https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/vmlog-1462458369-00000/sample-00011/dmesg:
>> [   90.491276] Node 0 DMA free:0kB min:60kB low:72kB high:84kB active_anon:4096kB inactive_anon:4636kB active_file:212kB inactive_file:280kB unevictable:488kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:488kB dirty:276kB writeback:4636kB mapped:476kB shmem:12kB slab_reclaimable:204kB slab_unreclaimable:4700kB kernel_stack:48kB pagetables:120kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:61132 all_unreclaimable? yes
>> [   90.491283] lowmem_reserve[]: 0 977 977 977
>> [   90.491286] Node 0 DMA32 free:0kB min:3828kB low:4824kB high:5820kB active_anon:423820kB inactive_anon:424916kB active_file:17996kB inactive_file:21800kB unevictable:20724kB isolated(anon):384kB isolated(file):0kB present:1032184kB managed:1001260kB mlocked:20724kB dirty:25236kB writeback:49972kB mapped:23076kB shmem:1364kB slab_reclaimable:13796kB slab_unreclaimable:43008kB kernel_stack:2816kB pagetables:7320kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:5635400 all_unreclaimable? yes
>>
>> Look at the amount of free memory. It is completely depleted. So it
>> smells like a process which has access to memory reserves has consumed
>> all of it. I suspect a __GFP_MEMALLOC resp. PF_MEMALLOC from softirq
>> context user which went off the leash.
> 
> It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. Prior 
> to this commit, mempool allocations set __GFP_NOMEMALLOC, so they never 
> exhausted reserved memory. With this commit, mempool allocations drop 
> __GFP_NOMEMALLOC, so they can dig deeper (if the process has PF_MEMALLOC, 
> they can bypass all limits).
> 
> But swapping should proceed even if there is no memory free. There is a 
> comment "TODO: this could cause a theoretical memory reclaim deadlock in 
> the swap out path." in the function add_to_swap - but apart from that, 
> swap should proceed even with no available memory, as long as all the 
> drivers in the block layer use mempools.
> 
>>> So, it could prematurely trigger OOM killer on any slow swapping device 
>>> (including dm-crypt). Michal Hocko reworked the OOM killer in the patch 
>>> 0a0337e0d1d134465778a16f5cbea95086e8e9e0, but it still has the flaw that 
>>> it triggers OOM if there is plenty of free swap space free.
>>>
>>> Michal, would you accept a change to the OOM killer, to prevent it from 
>>> triggerring when there is free swap space?
>>
>> No this doesn't sound like a proper solution. The current decision
>> logic, as explained above relies on the feedback from the reclaim. A
>> free swap space doesn't really mean we can make a forward progress.
> 
> I'm interested - why would you need to trigger the OOM killer if there is 
> free swap space?
> 
> The only possibility is that all the memory is filled with unswappable 
> kernel pages - but that condition could be detected if there is unusually 
> low number of anonymous and cache pages. Besides that - in what situation 
> is triggering the OOM killer with free swap desired?
> 
>> -- 
>> Michal Hocko
>> SUSE Labs
>>
> 
> The kernel 4.7-rc almost deadlocks in another way. The machine got stuck 
> and the following stacktrace was obtained when swapping to dm-crypt.
> 
> We can see that dm-crypt does a mempool allocation. But the mempool 
> allocation somehow falls into throttle_vm_writeout. There, it waits for 
> 0.1 seconds. So, as a result, the dm-crypt worker thread ends up 
> processing requests at an unusually slow rate of 10 requests per second 
> and it results in the machine being stuck (it would proabably recover if 
> we waited for extreme amount of time).
> 
> [  345.352536] kworker/u4:0    D ffff88003df7f438 10488     6      2 0x00000000
> [  345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
> [  345.352536]  ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80
> [  345.352536]  ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470
> [  345.352536]  ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450
> [  345.352536] Call Trace:
> [  345.352536]  [<ffffffff818d466c>] schedule+0x3c/0x90
> [  345.352536]  [<ffffffff818d96a8>] schedule_timeout+0x1d8/0x360
> [  345.352536]  [<ffffffff81135e40>] ? detach_if_pending+0x1c0/0x1c0
> [  345.352536]  [<ffffffff811407c3>] ? ktime_get+0xb3/0x150
> [  345.352536]  [<ffffffff811958cf>] ? __delayacct_blkio_start+0x1f/0x30
> [  345.352536]  [<ffffffff818d39e4>] io_schedule_timeout+0xa4/0x110
> [  345.352536]  [<ffffffff8121d886>] congestion_wait+0x86/0x1f0
> [  345.352536]  [<ffffffff810fdf40>] ? prepare_to_wait_event+0xf0/0xf0
> [  345.352536]  [<ffffffff812061d4>] throttle_vm_writeout+0x44/0xd0
> [  345.352536]  [<ffffffff81211533>] shrink_zone_memcg+0x613/0x720
> [  345.352536]  [<ffffffff81211720>] shrink_zone+0xe0/0x300
> [  345.352536]  [<ffffffff81211aed>] do_try_to_free_pages+0x1ad/0x450
> [  345.352536]  [<ffffffff81211e7f>] try_to_free_pages+0xef/0x300
> [  345.352536]  [<ffffffff811fef19>] __alloc_pages_nodemask+0x879/0x1210
> [  345.352536]  [<ffffffff810e8080>] ? sched_clock_cpu+0x90/0xc0
> [  345.352536]  [<ffffffff8125a8d1>] alloc_pages_current+0xa1/0x1f0
> [  345.352536]  [<ffffffff81265ef5>] ? new_slab+0x3f5/0x6a0
> [  345.352536]  [<ffffffff81265dd7>] new_slab+0x2d7/0x6a0
> [  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> [  345.352536]  [<ffffffff812678cb>] ___slab_alloc+0x3fb/0x5c0
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff81267ae1>] __slab_alloc+0x51/0x90
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff81267d9b>] kmem_cache_alloc+0x27b/0x310
> [  345.352536]  [<ffffffff811f71bd>] mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff811f6f11>] mempool_alloc+0x91/0x230
> [  345.352536]  [<ffffffff8141a02d>] bio_alloc_bioset+0xbd/0x260
> [  345.352536]  [<ffffffffc02f1a54>] kcryptd_crypt+0x114/0x3b0 [dm_crypt]
> [  345.352536]  [<ffffffff810cc312>] process_one_work+0x242/0x700
> [  345.352536]  [<ffffffff810cc28a>] ? process_one_work+0x1ba/0x700
> [  345.352536]  [<ffffffff810cc81e>] worker_thread+0x4e/0x490
> [  345.352536]  [<ffffffff810cc7d0>] ? process_one_work+0x700/0x700
> [  345.352536]  [<ffffffff810d3c01>] kthread+0x101/0x120
> [  345.352536]  [<ffffffff8110b9f5>] ? trace_hardirqs_on_caller+0xf5/0x1b0
> [  345.352536]  [<ffffffff818db1af>] ret_from_fork+0x1f/0x40
> [  345.352536]  [<ffffffff810d3b00>] ? kthread_create_on_node+0x250/0x250
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-12 23:44         ` Mikulas Patocka
  2016-07-13  8:35           ` Jerome Marchand
@ 2016-07-13 11:10           ` Michal Hocko
  2016-07-13 12:50             ` Michal Hocko
  2016-07-13 15:02             ` Mikulas Patocka
  2016-07-13 13:19           ` Tetsuo Handa
  2 siblings, 2 replies; 60+ messages in thread
From: Michal Hocko @ 2016-07-13 11:10 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Ondrej Kozina, Jerome Marchand, Stanislav Kozina, linux-mm, linux-kernel

On Tue 12-07-16 19:44:11, Mikulas Patocka wrote:
> The problem of swapping to dm-crypt is this.
> 
> The free memory goes low, kswapd decides that some page should be swapped 
> out. However, when you swap to an ecrypted device, writeback of each page 
> requires another page to hold the encrypted data. dm-crypt uses mempools 
> for all its structures and pages, so that it can make forward progress 
> even if there is no memory free. However, the mempool code first allocates 
> from general memory allocator and resorts to the mempool only if the 
> memory is below limit.

OK, thanks for the clarification. I guess the core part happens in
crypt_alloc_buffer, right?

> So every attempt to swap out some page allocates another page.
> 
> As long as swapping is in progress, the free memory is below the limit 
> (because the swapping activity itself consumes any memory over the limit). 
> And that triggered the OOM killer prematurely.

I am not sure I understand the last part. Are you saing that we trigger
OOM because the initiated swapout will not be able to finish the IO thus
release the page in time?

The oom detection checks waits for an ongoing writeout if there is no
reclaim progress and at least half of the reclaimable memory is either
dirty or under writeback. Pages under swaout are marked as under
writeback AFAIR. The writeout path (dm-crypt worker in this case) should
be able to allocate a memory from the mempool, hand over to the crypt
layer and finish the IO. Is it possible this might take a lot of time?

> On Tue, 12 Jul 2016, Michal Hocko wrote:
> 
> > On Mon 11-07-16 11:43:02, Mikulas Patocka wrote:
> > [...]
> > > The general problem is that the memory allocator does 16 retries to 
> > > allocate a page and then triggers the OOM killer (and it doesn't take into 
> > > account how much swap space is free or how many dirty pages were really 
> > > swapped out while it waited).
> > 
> > Well, that is not how it works exactly. We retry as long as there is a
> > reclaim progress (at least one page freed) back off only if the
> > reclaimable memory can exceed watermks which is scaled down in 16
> > retries. The overal size of free swap is not really that important if we
> > cannot swap out like here due to complete memory reserves depletion:
> > https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/vmlog-1462458369-00000/sample-00011/dmesg:
> > [   90.491276] Node 0 DMA free:0kB min:60kB low:72kB high:84kB active_anon:4096kB inactive_anon:4636kB active_file:212kB inactive_file:280kB unevictable:488kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:488kB dirty:276kB writeback:4636kB mapped:476kB shmem:12kB slab_reclaimable:204kB slab_unreclaimable:4700kB kernel_stack:48kB pagetables:120kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:61132 all_unreclaimable? yes
> > [   90.491283] lowmem_reserve[]: 0 977 977 977
> > [   90.491286] Node 0 DMA32 free:0kB min:3828kB low:4824kB high:5820kB active_anon:423820kB inactive_anon:424916kB active_file:17996kB inactive_file:21800kB unevictable:20724kB isolated(anon):384kB isolated(file):0kB present:1032184kB managed:1001260kB mlocked:20724kB dirty:25236kB writeback:49972kB mapped:23076kB shmem:1364kB slab_reclaimable:13796kB slab_unreclaimable:43008kB kernel_stack:2816kB pagetables:7320kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:5635400 all_unreclaimable? yes
> > 
> > Look at the amount of free memory. It is completely depleted. So it
> > smells like a process which has access to memory reserves has consumed
> > all of it. I suspect a __GFP_MEMALLOC resp. PF_MEMALLOC from softirq
> > context user which went off the leash.
> 
> It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. Prior 
> to this commit, mempool allocations set __GFP_NOMEMALLOC, so they never 
> exhausted reserved memory. With this commit, mempool allocations drop 
> __GFP_NOMEMALLOC, so they can dig deeper (if the process has PF_MEMALLOC, 
> they can bypass all limits).

Hmm, but the patch allows access to the memory reserves only when the
pool is empty. And even then the caller would have to request access to
reserves explicitly either by __GFP_NOMEMALLOC or PF_MEMALLOC. That
doesn't seem to be the case for the dm-crypt, though. Or do you suspect
that some other mempool user might be doing so? 

> But swapping should proceed even if there is no memory free. There is a 
> comment "TODO: this could cause a theoretical memory reclaim deadlock in 
> the swap out path." in the function add_to_swap - but apart from that, 
> swap should proceed even with no available memory, as long as all the 
> drivers in the block layer use mempools.
> 
> > > So, it could prematurely trigger OOM killer on any slow swapping device 
> > > (including dm-crypt). Michal Hocko reworked the OOM killer in the patch 
> > > 0a0337e0d1d134465778a16f5cbea95086e8e9e0, but it still has the flaw that 
> > > it triggers OOM if there is plenty of free swap space free.
> > > 
> > > Michal, would you accept a change to the OOM killer, to prevent it from 
> > > triggerring when there is free swap space?
> > 
> > No this doesn't sound like a proper solution. The current decision
> > logic, as explained above relies on the feedback from the reclaim. A
> > free swap space doesn't really mean we can make a forward progress.
> 
> I'm interested - why would you need to trigger the OOM killer if there is 
> free swap space?

Let me clarify. If there is a swapable memory then we shouldn't trigger
the OOM killer normally of course. And that should be the case with the
current implementation. We just rely on the swapout making some progress
and back off only if that is not the case after several attempts with a
throttling based on the writeback counters. Checking the available swap
space doesn't guarantee a forward progress, though. If the swap out is
stuck for some reason then it should be safer to trigger to OOM rather
than wait or trash for ever (or an excessive amount of time).

Now, I can see that the retry logic might need some tuning for complex
setups like dm-crypt swap partitions because the progress might be much
slower there. But I would like the understand what is the worst estimate
for the swapout path with all the roadblocks on the way for this setup
before we can think of a proper retry logic tuning.

> The only possibility is that all the memory is filled with unswappable 
> kernel pages - but that condition could be detected if there is unusually 
> low number of anonymous and cache pages. Besides that - in what situation 
> is triggering the OOM killer with free swap desired?

I hope the above has explained that.

> The kernel 4.7-rc almost deadlocks in another way. The machine got stuck 
> and the following stacktrace was obtained when swapping to dm-crypt.
> 
> We can see that dm-crypt does a mempool allocation. But the mempool 
> allocation somehow falls into throttle_vm_writeout. There, it waits for 
> 0.1 seconds. So, as a result, the dm-crypt worker thread ends up 
> processing requests at an unusually slow rate of 10 requests per second 
> and it results in the machine being stuck (it would proabably recover if 
> we waited for extreme amount of time).

Hmm, that throttling is there since ever basically. I do not see what
would have changed that recently, but I haven't looked too close to be
honest.

I agree that throttling a flusher (which this worker definitely is)
doesn't look like a correct thing to do. We have PF_LESS_THROTTLE for
this kind of things. So maybe the right thing to do is to use this flag
for the dm_crypt worker:

diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 4f3cb3554944..0b806810efab 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1392,11 +1392,14 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
 static void kcryptd_crypt(struct work_struct *work)
 {
 	struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
+	unsigned int pflags = current->flags;
 
+	current->flags |= PF_LESS_THROTTLE;
 	if (bio_data_dir(io->base_bio) == READ)
 		kcryptd_crypt_read_convert(io);
 	else
 		kcryptd_crypt_write_convert(io);
+	tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
 }
 
 static void kcryptd_queue_crypt(struct dm_crypt_io *io)

> 
> [  345.352536] kworker/u4:0    D ffff88003df7f438 10488     6      2 0x00000000
> [  345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
> [  345.352536]  ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80
> [  345.352536]  ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470
> [  345.352536]  ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450
> [  345.352536] Call Trace:
> [  345.352536]  [<ffffffff818d466c>] schedule+0x3c/0x90
> [  345.352536]  [<ffffffff818d96a8>] schedule_timeout+0x1d8/0x360
> [  345.352536]  [<ffffffff81135e40>] ? detach_if_pending+0x1c0/0x1c0
> [  345.352536]  [<ffffffff811407c3>] ? ktime_get+0xb3/0x150
> [  345.352536]  [<ffffffff811958cf>] ? __delayacct_blkio_start+0x1f/0x30
> [  345.352536]  [<ffffffff818d39e4>] io_schedule_timeout+0xa4/0x110
> [  345.352536]  [<ffffffff8121d886>] congestion_wait+0x86/0x1f0
> [  345.352536]  [<ffffffff810fdf40>] ? prepare_to_wait_event+0xf0/0xf0
> [  345.352536]  [<ffffffff812061d4>] throttle_vm_writeout+0x44/0xd0
> [  345.352536]  [<ffffffff81211533>] shrink_zone_memcg+0x613/0x720
> [  345.352536]  [<ffffffff81211720>] shrink_zone+0xe0/0x300
> [  345.352536]  [<ffffffff81211aed>] do_try_to_free_pages+0x1ad/0x450
> [  345.352536]  [<ffffffff81211e7f>] try_to_free_pages+0xef/0x300
> [  345.352536]  [<ffffffff811fef19>] __alloc_pages_nodemask+0x879/0x1210
> [  345.352536]  [<ffffffff810e8080>] ? sched_clock_cpu+0x90/0xc0
> [  345.352536]  [<ffffffff8125a8d1>] alloc_pages_current+0xa1/0x1f0
> [  345.352536]  [<ffffffff81265ef5>] ? new_slab+0x3f5/0x6a0
> [  345.352536]  [<ffffffff81265dd7>] new_slab+0x2d7/0x6a0
> [  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> [  345.352536]  [<ffffffff812678cb>] ___slab_alloc+0x3fb/0x5c0
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff81267ae1>] __slab_alloc+0x51/0x90
> [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff81267d9b>] kmem_cache_alloc+0x27b/0x310
> [  345.352536]  [<ffffffff811f71bd>] mempool_alloc_slab+0x1d/0x30
> [  345.352536]  [<ffffffff811f6f11>] mempool_alloc+0x91/0x230
> [  345.352536]  [<ffffffff8141a02d>] bio_alloc_bioset+0xbd/0x260
> [  345.352536]  [<ffffffffc02f1a54>] kcryptd_crypt+0x114/0x3b0 [dm_crypt]
> [  345.352536]  [<ffffffff810cc312>] process_one_work+0x242/0x700
> [  345.352536]  [<ffffffff810cc28a>] ? process_one_work+0x1ba/0x700
> [  345.352536]  [<ffffffff810cc81e>] worker_thread+0x4e/0x490
> [  345.352536]  [<ffffffff810cc7d0>] ? process_one_work+0x700/0x700
> [  345.352536]  [<ffffffff810d3c01>] kthread+0x101/0x120
> [  345.352536]  [<ffffffff8110b9f5>] ? trace_hardirqs_on_caller+0xf5/0x1b0
> [  345.352536]  [<ffffffff818db1af>] ret_from_fork+0x1f/0x40
> [  345.352536]  [<ffffffff810d3b00>] ? kthread_create_on_node+0x250/0x250

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-13  8:35           ` Jerome Marchand
@ 2016-07-13 11:14             ` Michal Hocko
  2016-07-13 14:21               ` Mikulas Patocka
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-07-13 11:14 UTC (permalink / raw)
  To: Jerome Marchand
  Cc: Mikulas Patocka, Ondrej Kozina, Stanislav Kozina, linux-mm, linux-kernel

On Wed 13-07-16 10:35:01, Jerome Marchand wrote:
> On 07/13/2016 01:44 AM, Mikulas Patocka wrote:
> > The problem of swapping to dm-crypt is this.
> > 
> > The free memory goes low, kswapd decides that some page should be swapped 
> > out. However, when you swap to an ecrypted device, writeback of each page 
> > requires another page to hold the encrypted data. dm-crypt uses mempools 
> > for all its structures and pages, so that it can make forward progress 
> > even if there is no memory free. However, the mempool code first allocates 
> > from general memory allocator and resorts to the mempool only if the 
> > memory is below limit.
> > 
> > So every attempt to swap out some page allocates another page.
> > 
> > As long as swapping is in progress, the free memory is below the limit 
> > (because the swapping activity itself consumes any memory over the limit). 
> > And that triggered the OOM killer prematurely.
> 
> There is a quite recent sysctl vm knob that I believe can help in this
> case: watermark_scale_factor. If you increase this value, kswapd will
> start paging out earlier, when there might still be enough free memory.
> 
> Ondrej, have you tried to increase /proc/sys/vm/watermark_scale_factor?

I suspect this would just change the timing or the real problem gets
hidden.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-13 11:10           ` Michal Hocko
@ 2016-07-13 12:50             ` Michal Hocko
  2016-07-13 13:44               ` Milan Broz
  2016-07-13 15:02             ` Mikulas Patocka
  1 sibling, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-07-13 12:50 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Ondrej Kozina, Jerome Marchand, Stanislav Kozina, linux-mm, linux-kernel

On Wed 13-07-16 13:10:06, Michal Hocko wrote:
> On Tue 12-07-16 19:44:11, Mikulas Patocka wrote:
[...]
> > As long as swapping is in progress, the free memory is below the limit 
> > (because the swapping activity itself consumes any memory over the limit). 
> > And that triggered the OOM killer prematurely.
> 
> I am not sure I understand the last part. Are you saing that we trigger
> OOM because the initiated swapout will not be able to finish the IO thus
> release the page in time?
> 
> The oom detection checks waits for an ongoing writeout if there is no
> reclaim progress and at least half of the reclaimable memory is either
> dirty or under writeback. Pages under swaout are marked as under
> writeback AFAIR. The writeout path (dm-crypt worker in this case) should
> be able to allocate a memory from the mempool, hand over to the crypt
> layer and finish the IO. Is it possible this might take a lot of time?

I am not familiar with the crypto API but from what I understood from
crypt_convert the encryption is done asynchronously. Then I got lost in
the indirection. Who is completing the request and from what kind of
context? Is it possible it wouldn't be runable for a long time?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-12 23:44         ` Mikulas Patocka
  2016-07-13  8:35           ` Jerome Marchand
  2016-07-13 11:10           ` Michal Hocko
@ 2016-07-13 13:19           ` Tetsuo Handa
  2016-07-13 13:39             ` Michal Hocko
  2016-07-14  0:01             ` David Rientjes
  2 siblings, 2 replies; 60+ messages in thread
From: Tetsuo Handa @ 2016-07-13 13:19 UTC (permalink / raw)
  To: Mikulas Patocka, Michal Hocko
  Cc: Ondrej Kozina, Jerome Marchand, Stanislav Kozina, linux-mm, linux-kernel

> On Tue, 12 Jul 2016, Michal Hocko wrote:
> 
>> On Mon 11-07-16 11:43:02, Mikulas Patocka wrote:
>> [...]
>>> The general problem is that the memory allocator does 16 retries to 
>>> allocate a page and then triggers the OOM killer (and it doesn't take into 
>>> account how much swap space is free or how many dirty pages were really 
>>> swapped out while it waited).
>>
>> Well, that is not how it works exactly. We retry as long as there is a
>> reclaim progress (at least one page freed) back off only if the
>> reclaimable memory can exceed watermks which is scaled down in 16
>> retries. The overal size of free swap is not really that important if we
>> cannot swap out like here due to complete memory reserves depletion:
>> https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/vmlog-1462458369-00000/sample-00011/dmesg:
>> [   90.491276] Node 0 DMA free:0kB min:60kB low:72kB high:84kB active_anon:4096kB inactive_anon:4636kB active_file:212kB inactive_file:280kB unevictable:488kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:488kB dirty:276kB writeback:4636kB mapped:476kB shmem:12kB slab_reclaimable:204kB slab_unreclaimable:4700kB kernel_stack:48kB pagetables:120kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:61132 all_unreclaimable? yes
>> [   90.491283] lowmem_reserve[]: 0 977 977 977
>> [   90.491286] Node 0 DMA32 free:0kB min:3828kB low:4824kB high:5820kB active_anon:423820kB inactive_anon:424916kB active_file:17996kB inactive_file:21800kB unevictable:20724kB isolated(anon):384kB isolated(file):0kB present:1032184kB managed:1001260kB mlocked:20724kB dirty:25236kB writeback:49972kB mapped:23076kB shmem:1364kB slab_reclaimable:13796kB slab_unreclaimable:43008kB kernel_stack:2816kB pagetables:7320kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:5635400 all_unreclaimable? yes
>>
>> Look at the amount of free memory. It is completely depleted. So it
>> smells like a process which has access to memory reserves has consumed
>> all of it. I suspect a __GFP_MEMALLOC resp. PF_MEMALLOC from softirq
>> context user which went off the leash.
> 
> It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. Prior 
> to this commit, mempool allocations set __GFP_NOMEMALLOC, so they never 
> exhausted reserved memory. With this commit, mempool allocations drop 
> __GFP_NOMEMALLOC, so they can dig deeper (if the process has PF_MEMALLOC, 
> they can bypass all limits).

I wonder whether commit f9054c70d28bc214 ("mm, mempool: only set
__GFP_NOMEMALLOC if there are free elements") is doing correct thing.
It says

    If an oom killed thread calls mempool_alloc(), it is possible that it'll
    loop forever if there are no elements on the freelist since
    __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in
    oom conditions.

but we can allow mempool_alloc(__GFP_NOMEMALLOC) requests to access
memory reserves via below change, can't we? The purpose of allowing
ALLOC_NO_WATERMARKS via TIF_MEMDIE is to make sure current allocation
request does not to loop forever inside the page allocator, isn't it?
Why we need to allow mempool_alloc(__GFP_NOMEMALLOC) requests to use
ALLOC_NO_WATERMARKS when TIF_MEMDIE is not set?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6903b69..e4e3700 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3439,14 +3439,14 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	} else if (unlikely(rt_task(current)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
-	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
+	if (!in_interrupt() && unlikely(test_thread_flag(TIF_MEMDIE)))
+		alloc_flags |= ALLOC_NO_WATERMARKS;
+	else if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
 		if (gfp_mask & __GFP_MEMALLOC)
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
-		else if (!in_interrupt() &&
-				((current->flags & PF_MEMALLOC) ||
-				 unlikely(test_thread_flag(TIF_MEMDIE))))
+		else if (!in_interrupt() && (current->flags & PF_MEMALLOC))
 			alloc_flags |= ALLOC_NO_WATERMARKS;
 	}
 #ifdef CONFIG_CMA

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-13 13:19           ` Tetsuo Handa
@ 2016-07-13 13:39             ` Michal Hocko
  2016-07-13 14:18               ` Mikulas Patocka
  2016-07-14  0:01             ` David Rientjes
  1 sibling, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-07-13 13:39 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Mikulas Patocka, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel, David Rientjes

[CC David]

On Wed 13-07-16 22:19:23, Tetsuo Handa wrote:
> >> On Mon 11-07-16 11:43:02, Mikulas Patocka wrote:
> >> [...]
> >>> The general problem is that the memory allocator does 16 retries to 
> >>> allocate a page and then triggers the OOM killer (and it doesn't take into 
> >>> account how much swap space is free or how many dirty pages were really 
> >>> swapped out while it waited).
> >>
> >> Well, that is not how it works exactly. We retry as long as there is a
> >> reclaim progress (at least one page freed) back off only if the
> >> reclaimable memory can exceed watermks which is scaled down in 16
> >> retries. The overal size of free swap is not really that important if we
> >> cannot swap out like here due to complete memory reserves depletion:
> >> https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/vmlog-1462458369-00000/sample-00011/dmesg:
> >> [   90.491276] Node 0 DMA free:0kB min:60kB low:72kB high:84kB active_anon:4096kB inactive_anon:4636kB active_file:212kB inactive_file:280kB unevictable:488kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15908kB mlocked:488kB dirty:276kB writeback:4636kB mapped:476kB shmem:12kB slab_reclaimable:204kB slab_unreclaimable:4700kB kernel_stack:48kB pagetables:120kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:61132 all_unreclaimable? yes
> >> [   90.491283] lowmem_reserve[]: 0 977 977 977
> >> [   90.491286] Node 0 DMA32 free:0kB min:3828kB low:4824kB high:5820kB active_anon:423820kB inactive_anon:424916kB active_file:17996kB inactive_file:21800kB unevictable:20724kB isolated(anon):384kB isolated(file):0kB present:1032184kB managed:1001260kB mlocked:20724kB dirty:25236kB writeback:49972kB mapped:23076kB shmem:1364kB slab_reclaimable:13796kB slab_unreclaimable:43008kB kernel_stack:2816kB pagetables:7320kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:5635400 all_unreclaimable? yes
> >>
> >> Look at the amount of free memory. It is completely depleted. So it
> >> smells like a process which has access to memory reserves has consumed
> >> all of it. I suspect a __GFP_MEMALLOC resp. PF_MEMALLOC from softirq
> >> context user which went off the leash.
> > 
> > It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. Prior 
> > to this commit, mempool allocations set __GFP_NOMEMALLOC, so they never 
> > exhausted reserved memory. With this commit, mempool allocations drop 
> > __GFP_NOMEMALLOC, so they can dig deeper (if the process has PF_MEMALLOC, 
> > they can bypass all limits).
> 
> I wonder whether commit f9054c70d28bc214 ("mm, mempool: only set
> __GFP_NOMEMALLOC if there are free elements") is doing correct thing.
> It says
> 
>     If an oom killed thread calls mempool_alloc(), it is possible that it'll
>     loop forever if there are no elements on the freelist since
>     __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in
>     oom conditions.

I haven't studied the patch very deeply so I might be missing something
but from a quick look the patch does exactly what the above says.

mempool_alloc used to inhibit ALLOC_NO_WATERMARKS by default. David has
only changed that to allow ALLOC_NO_WATERMARKS if there are no objects
in the pool and so we have no fallback for the default __GFP_NORETRY
request.

> but we can allow mempool_alloc(__GFP_NOMEMALLOC) requests to access
> memory reserves via below change, can't we?

Well, I do not see all the potential side effects of such a change but
I believe it shouldn't be really necessary because we should eventually
allow ALLOC_NO_WATERMARKS even from mempool_alloc.

> The purpose of allowing
> ALLOC_NO_WATERMARKS via TIF_MEMDIE is to make sure current allocation
> request does not to loop forever inside the page allocator, isn't it?
> Why we need to allow mempool_alloc(__GFP_NOMEMALLOC) requests to use
> ALLOC_NO_WATERMARKS when TIF_MEMDIE is not set?
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 6903b69..e4e3700 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3439,14 +3439,14 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	} else if (unlikely(rt_task(current)) && !in_interrupt())
>  		alloc_flags |= ALLOC_HARDER;
>  
> -	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> +	if (!in_interrupt() && unlikely(test_thread_flag(TIF_MEMDIE)))
> +		alloc_flags |= ALLOC_NO_WATERMARKS;
> +	else if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
>  		if (gfp_mask & __GFP_MEMALLOC)
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
>  		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
> -		else if (!in_interrupt() &&
> -				((current->flags & PF_MEMALLOC) ||
> -				 unlikely(test_thread_flag(TIF_MEMDIE))))
> +		else if (!in_interrupt() && (current->flags & PF_MEMALLOC))
>  			alloc_flags |= ALLOC_NO_WATERMARKS;
>  	}
>  #ifdef CONFIG_CMA
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-13 12:50             ` Michal Hocko
@ 2016-07-13 13:44               ` Milan Broz
  2016-07-13 15:21                 ` Mikulas Patocka
  0 siblings, 1 reply; 60+ messages in thread
From: Milan Broz @ 2016-07-13 13:44 UTC (permalink / raw)
  To: Michal Hocko, Mikulas Patocka
  Cc: Ondrej Kozina, Jerome Marchand, Stanislav Kozina, linux-mm,
	linux-kernel, device-mapper development

On 07/13/2016 02:50 PM, Michal Hocko wrote:
> On Wed 13-07-16 13:10:06, Michal Hocko wrote:
>> On Tue 12-07-16 19:44:11, Mikulas Patocka wrote:
> [...]
>>> As long as swapping is in progress, the free memory is below the limit 
>>> (because the swapping activity itself consumes any memory over the limit). 
>>> And that triggered the OOM killer prematurely.
>>
>> I am not sure I understand the last part. Are you saing that we trigger
>> OOM because the initiated swapout will not be able to finish the IO thus
>> release the page in time?
>>
>> The oom detection checks waits for an ongoing writeout if there is no
>> reclaim progress and at least half of the reclaimable memory is either
>> dirty or under writeback. Pages under swaout are marked as under
>> writeback AFAIR. The writeout path (dm-crypt worker in this case) should
>> be able to allocate a memory from the mempool, hand over to the crypt
>> layer and finish the IO. Is it possible this might take a lot of time?
> 
> I am not familiar with the crypto API but from what I understood from
> crypt_convert the encryption is done asynchronously. Then I got lost in
> the indirection. Who is completing the request and from what kind of
> context? Is it possible it wouldn't be runable for a long time?

If you mean crypt_convert in dm-crypt, then it can do asynchronous completion
but usually (with AES-NI ans sw implementations) it run the operation completely
synchronously.
Asynchronous processing is quite rare, usually only on some specific hardware
crypto accelerators.

Once the encryption is finished, the cloned bio is sent to the block
layer for processing.
(There is also some magic with sorting writes but Mikulas knows this better.)

Milan
p.s. I added cc to dm-devel, some dmcrypt people reads only this list.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-13 13:39             ` Michal Hocko
@ 2016-07-13 14:18               ` Mikulas Patocka
  2016-07-13 14:56                 ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: Mikulas Patocka @ 2016-07-13 14:18 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, Ondrej Kozina, Jerome Marchand, Stanislav Kozina,
	linux-mm, linux-kernel, David Rientjes



On Wed, 13 Jul 2016, Michal Hocko wrote:

> [CC David]
> 
> > > It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. 
> > > Prior to this commit, mempool allocations set __GFP_NOMEMALLOC, so 
> > > they never exhausted reserved memory. With this commit, mempool 
> > > allocations drop __GFP_NOMEMALLOC, so they can dig deeper (if the 
> > > process has PF_MEMALLOC, they can bypass all limits).
> > 
> > I wonder whether commit f9054c70d28bc214 ("mm, mempool: only set 
> > __GFP_NOMEMALLOC if there are free elements") is doing correct thing. 
> > It says
> > 
> >     If an oom killed thread calls mempool_alloc(), it is possible that 
> > it'll
> >     loop forever if there are no elements on the freelist since
> >     __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in
> >     oom conditions.
> 
> I haven't studied the patch very deeply so I might be missing something
> but from a quick look the patch does exactly what the above says.
> 
> mempool_alloc used to inhibit ALLOC_NO_WATERMARKS by default. David has
> only changed that to allow ALLOC_NO_WATERMARKS if there are no objects
> in the pool and so we have no fallback for the default __GFP_NORETRY
> request.

The swapper core sets the flag PF_MEMALLOC and calls generic_make_request 
to submit the swapping bio to the block driver. The device mapper driver 
uses mempools for all its I/O processing.

Prior to the patch f9054c70d28bc214b2857cf8db8269f4f45a5e23, mempool_alloc 
never exhausted the reserved memory - it tried to allocace first with 
__GFP_NOMEMALLOC (thus preventing the allocator from allocating below the 
limits), then it tried to allocate from the mempool reserve and if the 
mempool is exhausted, it waits until some structures are returned to the 
mempool.

After the patch f9054c70d28bc214b2857cf8db8269f4f45a5e23, __GFP_NOMEMALLOC 
is not used if the mempool is exhausted - and so repeated use of 
mempool_alloc (tohether with PF_MEMALLOC that is implicitly set) can 
exhaust all available memory.

The patch f9054c70d28bc214b2857cf8db8269f4f45a5e23 allows more paralellism 
(mempool_alloc waits less and proceeds more often), but the downside is 
that it exhausts all the memory. Bisection showed that those dm-crypt 
swapping failures were caused by that patch.

I think f9054c70d28bc214b2857cf8db8269f4f45a5e23 should be reverted - but 
first, we need to find out why does swapping fail if all the memory is 
exhausted - that is a separate bug that should be addressed first.

> > but we can allow mempool_alloc(__GFP_NOMEMALLOC) requests to access
> > memory reserves via below change, can't we?

There are no mempool_alloc(__GFP_NOMEMALLOC) requsts - mempool users don't 
use this flag.

Mikulas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-13 11:14             ` Michal Hocko
@ 2016-07-13 14:21               ` Mikulas Patocka
  0 siblings, 0 replies; 60+ messages in thread
From: Mikulas Patocka @ 2016-07-13 14:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jerome Marchand, Ondrej Kozina, Stanislav Kozina, linux-mm, linux-kernel



On Wed, 13 Jul 2016, Michal Hocko wrote:

> On Wed 13-07-16 10:35:01, Jerome Marchand wrote:
> > On 07/13/2016 01:44 AM, Mikulas Patocka wrote:
> > > The problem of swapping to dm-crypt is this.
> > > 
> > > The free memory goes low, kswapd decides that some page should be swapped 
> > > out. However, when you swap to an ecrypted device, writeback of each page 
> > > requires another page to hold the encrypted data. dm-crypt uses mempools 
> > > for all its structures and pages, so that it can make forward progress 
> > > even if there is no memory free. However, the mempool code first allocates 
> > > from general memory allocator and resorts to the mempool only if the 
> > > memory is below limit.
> > > 
> > > So every attempt to swap out some page allocates another page.
> > > 
> > > As long as swapping is in progress, the free memory is below the limit 
> > > (because the swapping activity itself consumes any memory over the limit). 
> > > And that triggered the OOM killer prematurely.
> > 
> > There is a quite recent sysctl vm knob that I believe can help in this
> > case: watermark_scale_factor. If you increase this value, kswapd will
> > start paging out earlier, when there might still be enough free memory.
> > 
> > Ondrej, have you tried to increase /proc/sys/vm/watermark_scale_factor?
> 
> I suspect this would just change the timing or the real problem gets
> hidden.

I agree - tweaking some limits would just change the probability of the 
bug without addressing the root cause.

We shouldn't tweak anything and just stick to Ondrej's scenario where he 
reproduced the bug.

Mikulas

> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-13 14:18               ` Mikulas Patocka
@ 2016-07-13 14:56                 ` Michal Hocko
  2016-07-13 15:11                   ` Mikulas Patocka
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-07-13 14:56 UTC (permalink / raw)
  To: Mikulas Patocka, David Rientjes
  Cc: Tetsuo Handa, Ondrej Kozina, Jerome Marchand, Stanislav Kozina,
	linux-mm, linux-kernel

On Wed 13-07-16 10:18:35, Mikulas Patocka wrote:
> 
> 
> On Wed, 13 Jul 2016, Michal Hocko wrote:
> 
> > [CC David]
> > 
> > > > It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. 
> > > > Prior to this commit, mempool allocations set __GFP_NOMEMALLOC, so 
> > > > they never exhausted reserved memory. With this commit, mempool 
> > > > allocations drop __GFP_NOMEMALLOC, so they can dig deeper (if the 
> > > > process has PF_MEMALLOC, they can bypass all limits).
> > > 
> > > I wonder whether commit f9054c70d28bc214 ("mm, mempool: only set 
> > > __GFP_NOMEMALLOC if there are free elements") is doing correct thing. 
> > > It says
> > > 
> > >     If an oom killed thread calls mempool_alloc(), it is possible that 
> > > it'll
> > >     loop forever if there are no elements on the freelist since
> > >     __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in
> > >     oom conditions.
> > 
> > I haven't studied the patch very deeply so I might be missing something
> > but from a quick look the patch does exactly what the above says.
> > 
> > mempool_alloc used to inhibit ALLOC_NO_WATERMARKS by default. David has
> > only changed that to allow ALLOC_NO_WATERMARKS if there are no objects
> > in the pool and so we have no fallback for the default __GFP_NORETRY
> > request.
> 
> The swapper core sets the flag PF_MEMALLOC and calls generic_make_request 
> to submit the swapping bio to the block driver. The device mapper driver 
> uses mempools for all its I/O processing.

OK, this is the part I have missed. I didn't realize that the swapout
path, which is indeed PF_MEMALLOC, can get down to blk code which uses
mempools. A quick code travers shows that at least
	make_request_fn = blk_queue_bio
	blk_queue_bio
	  get_request
	    __get_request

might do that. And in that case I agree that the above mentioned patch
has unintentional side effects and should be re-evaluated. David, what
do you think? An obvious fixup would be considering TIF_MEMDIE in
mempool_alloc explicitly.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-13 11:10           ` Michal Hocko
  2016-07-13 12:50             ` Michal Hocko
@ 2016-07-13 15:02             ` Mikulas Patocka
  2016-07-14 10:51               ` [dm-devel] " Ondrej Kozina
  2016-07-14 12:51               ` Michal Hocko
  1 sibling, 2 replies; 60+ messages in thread
From: Mikulas Patocka @ 2016-07-13 15:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ondrej Kozina, Jerome Marchand, Stanislav Kozina, linux-mm, linux-kernel



On Wed, 13 Jul 2016, Michal Hocko wrote:

> On Tue 12-07-16 19:44:11, Mikulas Patocka wrote:
> > The problem of swapping to dm-crypt is this.
> > 
> > The free memory goes low, kswapd decides that some page should be swapped 
> > out. However, when you swap to an ecrypted device, writeback of each page 
> > requires another page to hold the encrypted data. dm-crypt uses mempools 
> > for all its structures and pages, so that it can make forward progress 
> > even if there is no memory free. However, the mempool code first allocates 
> > from general memory allocator and resorts to the mempool only if the 
> > memory is below limit.
> 
> OK, thanks for the clarification. I guess the core part happens in
> crypt_alloc_buffer, right?
> 
> > So every attempt to swap out some page allocates another page.
> > 
> > As long as swapping is in progress, the free memory is below the limit 
> > (because the swapping activity itself consumes any memory over the limit). 
> > And that triggered the OOM killer prematurely.
> 
> I am not sure I understand the last part. Are you saing that we trigger
> OOM because the initiated swapout will not be able to finish the IO thus
> release the page in time?

On kernel 4.6 - premature OOM is triggered just because the free memory 
stays below the limit for some time.

On kernel 4.7-rc6 (that contains your OOM patch 
0a0337e0d1d134465778a16f5cbea95086e8e9e0), OOM is not triggered, but the 
machine slows down to a crawl, because the allocator applies throttling to 
the process that is doing the encryption.

> The oom detection checks waits for an ongoing writeout if there is no
> reclaim progress and at least half of the reclaimable memory is either
> dirty or under writeback. Pages under swaout are marked as under
> writeback AFAIR. The writeout path (dm-crypt worker in this case) should
> be able to allocate a memory from the mempool, hand over to the crypt
> layer and finish the IO. Is it possible this might take a lot of time?

See the backtrace below - the dm-crypt worker is making progress, but the 
memory allocator deliberatelly stalls it in throttle_vm_writeout.

> > On Tue, 12 Jul 2016, Michal Hocko wrote:
> > > 
> > > Look at the amount of free memory. It is completely depleted. So it
> > > smells like a process which has access to memory reserves has consumed
> > > all of it. I suspect a __GFP_MEMALLOC resp. PF_MEMALLOC from softirq
> > > context user which went off the leash.
> > 
> > It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. Prior 
> > to this commit, mempool allocations set __GFP_NOMEMALLOC, so they never 
> > exhausted reserved memory. With this commit, mempool allocations drop 
> > __GFP_NOMEMALLOC, so they can dig deeper (if the process has PF_MEMALLOC, 
> > they can bypass all limits).
> 
> Hmm, but the patch allows access to the memory reserves only when the
> pool is empty. And even then the caller would have to request access to
> reserves explicitly either by __GFP_NOMEMALLOC or PF_MEMALLOC. That

PF_MEMALLOC is set when you enter the block driver when swapping. So, some 
of the mempool allocations are done with PF_MEMALLOC.

> doesn't seem to be the case for the dm-crypt, though. Or do you suspect
> that some other mempool user might be doing so? 

Bisection showed that that patch triggered the dm-crypt swapping problems.

Without the patch f9054c70d28bc214b2857cf8db8269f4f45a5e23, mempool_alloc
1. allocates memory up to __GFP_NOMEMALLOC limit
2. allocates memory from the mempool reserve
3. waits, until some objects are returned to the mempool

With the patch f9054c70d28bc214b2857cf8db8269f4f45a5e23, mempool_alloc 
1. allocates memory up to __GFP_NOMEMALLOC limit
2. allocates memory from the mempool reserve
3. allocates all remaining memory until total exhaustion
4. waits, until some objects are returned to the mempool

> > > No this doesn't sound like a proper solution. The current decision
> > > logic, as explained above relies on the feedback from the reclaim. A
> > > free swap space doesn't really mean we can make a forward progress.
> > 
> > I'm interested - why would you need to trigger the OOM killer if there is 
> > free swap space?
> 
> Let me clarify. If there is a swapable memory then we shouldn't trigger
> the OOM killer normally of course. And that should be the case with the
> current implementation. We just rely on the swapout making some progress

And what does exactly "making some progress" mean? How do you measure it?

> and back off only if that is not the case after several attempts with a
> throttling based on the writeback counters. Checking the available swap
> space doesn't guarantee a forward progress, though. If the swap out is
> stuck for some reason then it should be safer to trigger to OOM rather
> than wait or trash for ever (or an excessive amount of time).
>
> Now, I can see that the retry logic might need some tuning for complex
> setups like dm-crypt swap partitions because the progress might be much
> slower there. But I would like the understand what is the worst estimate
> for the swapout path with all the roadblocks on the way for this setup
> before we can think of a proper retry logic tuning.

For example, you could increment a percpu counter each time writeback of a 
page is finished. If the counters stays the same for some pre-determined 
period of time, writeback is stuck and you could trigger OOM prematurely.

But the memory management code doesn't do that. So what it really does and 
what is the intention behind it?

Another question is - do we really want to try to recover in case of stuck 
writeback? If the swap device dies so that it stops processing I/Os, the 
system is dead anyway - there is no point in trying to recover it by 
killing processes.

The question is if these safeguards against stuck writeback are really 
doing more harm than good. Do you have some real use case where you get 
stuck writeback and where you need to recover by OOM killing?

This is not the first time I've seen premature OOM. Long time ago, I saw a 
case when the admin set /proc/sys/vm/swappiness to a low value (because he 
was running some scientific calculations on the machine and he preferred 
memory being allocated to those calculations rather than to the cache) - 
and the result was premature OOM killing while the machine had plenty of 
free swap swace.

> > The kernel 4.7-rc almost deadlocks in another way. The machine got stuck 
> > and the following stacktrace was obtained when swapping to dm-crypt.
> > 
> > We can see that dm-crypt does a mempool allocation. But the mempool 
> > allocation somehow falls into throttle_vm_writeout. There, it waits for 
> > 0.1 seconds. So, as a result, the dm-crypt worker thread ends up 
> > processing requests at an unusually slow rate of 10 requests per second 
> > and it results in the machine being stuck (it would proabably recover if 
> > we waited for extreme amount of time).
> 
> Hmm, that throttling is there since ever basically. I do not see what
> would have changed that recently, but I haven't looked too close to be
> honest.
> 
> I agree that throttling a flusher (which this worker definitely is)
> doesn't look like a correct thing to do. We have PF_LESS_THROTTLE for
> this kind of things. So maybe the right thing to do is to use this flag
> for the dm_crypt worker:
> 
> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> index 4f3cb3554944..0b806810efab 100644
> --- a/drivers/md/dm-crypt.c
> +++ b/drivers/md/dm-crypt.c
> @@ -1392,11 +1392,14 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
>  static void kcryptd_crypt(struct work_struct *work)
>  {
>  	struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
> +	unsigned int pflags = current->flags;
>  
> +	current->flags |= PF_LESS_THROTTLE;
>  	if (bio_data_dir(io->base_bio) == READ)
>  		kcryptd_crypt_read_convert(io);
>  	else
>  		kcryptd_crypt_write_convert(io);
> +	tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
>  }
>  
>  static void kcryptd_queue_crypt(struct dm_crypt_io *io)

^^^ That fixes just one specific case - but there may be other threads 
doing mempool allocations in the device mapper subsystem - and you would 
need to mark all of them.

I would try the patch below - generally, allocations from the mempool 
subsystem should not wait in the memory allocator at all. I don't know if 
there are other cases when these allocations can sleep. I'm interested if 
it fixes Ondrej's case - or if it uncovers some other sleeping.

An alternate possibility would be to drop the flag __GFP_DIRECT_RECLAIM in 
mempool_alloc - so that mempool allocations never sleep in the allocator.

---
 mm/page-writeback.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: linux-4.7-rc7/mm/page-writeback.c
===================================================================
--- linux-4.7-rc7.orig/mm/page-writeback.c	2016-07-12 20:57:53.000000000 +0200
+++ linux-4.7-rc7/mm/page-writeback.c	2016-07-12 20:59:41.000000000 +0200
@@ -1945,6 +1945,14 @@ void throttle_vm_writeout(gfp_t gfp_mask
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 
+	/*
+	 * If we came here from mempool_alloc, we don't want to wait 0.1s.
+	 * We want to fail as soon as possible, so that the allocation is tried
+	 * from mempool reserve.
+	 */
+	if (unlikely(gfp_mask & __GFP_NORETRY))
+		return;
+
         for ( ; ; ) {
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 		dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh);

> > 
> > [  345.352536] kworker/u4:0    D ffff88003df7f438 10488     6      2 0x00000000
> > [  345.352536] Workqueue: kcryptd kcryptd_crypt [dm_crypt]
> > [  345.352536]  ffff88003df7f438 ffff88003e5d0380 ffff88003e5d0380 ffff88003e5d8e80
> > [  345.352536]  ffff88003dfb3240 ffff88003df73240 ffff88003df80000 ffff88003df7f470
> > [  345.352536]  ffff88003e5d0380 ffff88003e5d0380 ffff88003df7f828 ffff88003df7f450
> > [  345.352536] Call Trace:
> > [  345.352536]  [<ffffffff818d466c>] schedule+0x3c/0x90
> > [  345.352536]  [<ffffffff818d96a8>] schedule_timeout+0x1d8/0x360
> > [  345.352536]  [<ffffffff81135e40>] ? detach_if_pending+0x1c0/0x1c0
> > [  345.352536]  [<ffffffff811407c3>] ? ktime_get+0xb3/0x150
> > [  345.352536]  [<ffffffff811958cf>] ? __delayacct_blkio_start+0x1f/0x30
> > [  345.352536]  [<ffffffff818d39e4>] io_schedule_timeout+0xa4/0x110
> > [  345.352536]  [<ffffffff8121d886>] congestion_wait+0x86/0x1f0
> > [  345.352536]  [<ffffffff810fdf40>] ? prepare_to_wait_event+0xf0/0xf0
> > [  345.352536]  [<ffffffff812061d4>] throttle_vm_writeout+0x44/0xd0
> > [  345.352536]  [<ffffffff81211533>] shrink_zone_memcg+0x613/0x720
> > [  345.352536]  [<ffffffff81211720>] shrink_zone+0xe0/0x300
> > [  345.352536]  [<ffffffff81211aed>] do_try_to_free_pages+0x1ad/0x450
> > [  345.352536]  [<ffffffff81211e7f>] try_to_free_pages+0xef/0x300
> > [  345.352536]  [<ffffffff811fef19>] __alloc_pages_nodemask+0x879/0x1210
> > [  345.352536]  [<ffffffff810e8080>] ? sched_clock_cpu+0x90/0xc0
> > [  345.352536]  [<ffffffff8125a8d1>] alloc_pages_current+0xa1/0x1f0
> > [  345.352536]  [<ffffffff81265ef5>] ? new_slab+0x3f5/0x6a0
> > [  345.352536]  [<ffffffff81265dd7>] new_slab+0x2d7/0x6a0
> > [  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> > [  345.352536]  [<ffffffff812678cb>] ___slab_alloc+0x3fb/0x5c0
> > [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> > [  345.352536]  [<ffffffff810e7f87>] ? sched_clock_local+0x17/0x80
> > [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> > [  345.352536]  [<ffffffff81267ae1>] __slab_alloc+0x51/0x90
> > [  345.352536]  [<ffffffff811f71bd>] ? mempool_alloc_slab+0x1d/0x30
> > [  345.352536]  [<ffffffff81267d9b>] kmem_cache_alloc+0x27b/0x310
> > [  345.352536]  [<ffffffff811f71bd>] mempool_alloc_slab+0x1d/0x30
> > [  345.352536]  [<ffffffff811f6f11>] mempool_alloc+0x91/0x230
> > [  345.352536]  [<ffffffff8141a02d>] bio_alloc_bioset+0xbd/0x260
> > [  345.352536]  [<ffffffffc02f1a54>] kcryptd_crypt+0x114/0x3b0 [dm_crypt]
> > [  345.352536]  [<ffffffff810cc312>] process_one_work+0x242/0x700
> > [  345.352536]  [<ffffffff810cc28a>] ? process_one_work+0x1ba/0x700
> > [  345.352536]  [<ffffffff810cc81e>] worker_thread+0x4e/0x490
> > [  345.352536]  [<ffffffff810cc7d0>] ? process_one_work+0x700/0x700
> > [  345.352536]  [<ffffffff810d3c01>] kthread+0x101/0x120
> > [  345.352536]  [<ffffffff8110b9f5>] ? trace_hardirqs_on_caller+0xf5/0x1b0
> > [  345.352536]  [<ffffffff818db1af>] ret_from_fork+0x1f/0x40
> > [  345.352536]  [<ffffffff810d3b00>] ? kthread_create_on_node+0x250/0x250
> 
> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-13 14:56                 ` Michal Hocko
@ 2016-07-13 15:11                   ` Mikulas Patocka
  2016-07-13 23:53                     ` David Rientjes
  0 siblings, 1 reply; 60+ messages in thread
From: Mikulas Patocka @ 2016-07-13 15:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Tetsuo Handa, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel



On Wed, 13 Jul 2016, Michal Hocko wrote:

> On Wed 13-07-16 10:18:35, Mikulas Patocka wrote:
> > 
> > 
> > On Wed, 13 Jul 2016, Michal Hocko wrote:
> > 
> > > [CC David]
> > > 
> > > > > It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. 
> > > > > Prior to this commit, mempool allocations set __GFP_NOMEMALLOC, so 
> > > > > they never exhausted reserved memory. With this commit, mempool 
> > > > > allocations drop __GFP_NOMEMALLOC, so they can dig deeper (if the 
> > > > > process has PF_MEMALLOC, they can bypass all limits).
> > > > 
> > > > I wonder whether commit f9054c70d28bc214 ("mm, mempool: only set 
> > > > __GFP_NOMEMALLOC if there are free elements") is doing correct thing. 
> > > > It says
> > > > 
> > > >     If an oom killed thread calls mempool_alloc(), it is possible that 
> > > > it'll
> > > >     loop forever if there are no elements on the freelist since
> > > >     __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in
> > > >     oom conditions.
> > > 
> > > I haven't studied the patch very deeply so I might be missing something
> > > but from a quick look the patch does exactly what the above says.
> > > 
> > > mempool_alloc used to inhibit ALLOC_NO_WATERMARKS by default. David has
> > > only changed that to allow ALLOC_NO_WATERMARKS if there are no objects
> > > in the pool and so we have no fallback for the default __GFP_NORETRY
> > > request.
> > 
> > The swapper core sets the flag PF_MEMALLOC and calls generic_make_request 
> > to submit the swapping bio to the block driver. The device mapper driver 
> > uses mempools for all its I/O processing.
> 
> OK, this is the part I have missed. I didn't realize that the swapout
> path, which is indeed PF_MEMALLOC, can get down to blk code which uses
> mempools. A quick code travers shows that at least
> 	make_request_fn = blk_queue_bio
> 	blk_queue_bio
> 	  get_request
> 	    __get_request
> 
> might do that. And in that case I agree that the above mentioned patch
> has unintentional side effects and should be re-evaluated. David, what
> do you think? An obvious fixup would be considering TIF_MEMDIE in
> mempool_alloc explicitly.

What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 
tries to fix?

Do you have a stacktrace where it deadlocked, or was just a theoretical 
consideration?

Mempool users generally (except for some flawed cases like fs_bio_set) do 
not require memory to proceed. So if you just loop in mempool_alloc, the 
processes that exhasted the mempool reserve will eventually return objects 
to the mempool and you should proceed.

If you can't proceed, it is a bug in the code that uses the mempool.

Mikulas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-13 13:44               ` Milan Broz
@ 2016-07-13 15:21                 ` Mikulas Patocka
  2016-07-14  9:09                   ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: Mikulas Patocka @ 2016-07-13 15:21 UTC (permalink / raw)
  To: Milan Broz
  Cc: Michal Hocko, Ondrej Kozina, Jerome Marchand, Stanislav Kozina,
	linux-mm, linux-kernel, device-mapper development



On Wed, 13 Jul 2016, Milan Broz wrote:

> On 07/13/2016 02:50 PM, Michal Hocko wrote:
> > On Wed 13-07-16 13:10:06, Michal Hocko wrote:
> >> On Tue 12-07-16 19:44:11, Mikulas Patocka wrote:
> > [...]
> >>> As long as swapping is in progress, the free memory is below the limit 
> >>> (because the swapping activity itself consumes any memory over the limit). 
> >>> And that triggered the OOM killer prematurely.
> >>
> >> I am not sure I understand the last part. Are you saing that we trigger
> >> OOM because the initiated swapout will not be able to finish the IO thus
> >> release the page in time?
> >>
> >> The oom detection checks waits for an ongoing writeout if there is no
> >> reclaim progress and at least half of the reclaimable memory is either
> >> dirty or under writeback. Pages under swaout are marked as under
> >> writeback AFAIR. The writeout path (dm-crypt worker in this case) should
> >> be able to allocate a memory from the mempool, hand over to the crypt
> >> layer and finish the IO. Is it possible this might take a lot of time?
> > 
> > I am not familiar with the crypto API but from what I understood from
> > crypt_convert the encryption is done asynchronously. Then I got lost in
> > the indirection. Who is completing the request and from what kind of
> > context? Is it possible it wouldn't be runable for a long time?
> 
> If you mean crypt_convert in dm-crypt, then it can do asynchronous completion
> but usually (with AES-NI ans sw implementations) it run the operation completely
> synchronously.
> Asynchronous processing is quite rare, usually only on some specific hardware
> crypto accelerators.
> 
> Once the encryption is finished, the cloned bio is sent to the block
> layer for processing.
> (There is also some magic with sorting writes but Mikulas knows this better.)

dm-crypt receives requests in crypt_map, then it distributes write 
requests to multiple encryption threads. Encryption is done usually 
synchronously; asynchronous completion is used only when using some PCI 
cards that accelerate encryption. When encryption finishes, the encrypted 
pages are submitted to a thread dmcrypt_write that sorts the requests 
using rbtree and submits them.

The block layer has a deficiency that it cannot merge adjacent requests 
submitted by the different threads.

If we submitted requests directly from encryption threads, lack of merging 
degraded performance seriously.

Mikulas

> Milan
> p.s. I added cc to dm-devel, some dmcrypt people reads only this list.
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-13 15:11                   ` Mikulas Patocka
@ 2016-07-13 23:53                     ` David Rientjes
  2016-07-14 11:01                       ` Tetsuo Handa
                                         ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: David Rientjes @ 2016-07-13 23:53 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Michal Hocko, Tetsuo Handa, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel

On Wed, 13 Jul 2016, Mikulas Patocka wrote:

> What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 
> tries to fix?
> 

It prevents the whole system from livelocking due to an oom killed process 
stalling forever waiting for mempool_alloc() to return.  No other threads 
may be oom killed while waiting for it to exit.

> Do you have a stacktrace where it deadlocked, or was just a theoretical 
> consideration?
> 

schedule
schedule_timeout
io_schedule_timeout
mempool_alloc
__split_and_process_bio
dm_request
generic_make_request
submit_bio
mpage_readpages
ext4_readpages
__do_page_cache_readahead
ra_submit
filemap_fault
handle_mm_fault
__do_page_fault
do_page_fault
page_fault

> Mempool users generally (except for some flawed cases like fs_bio_set) do 
> not require memory to proceed. So if you just loop in mempool_alloc, the 
> processes that exhasted the mempool reserve will eventually return objects 
> to the mempool and you should proceed.
> 

That's obviously not the case if we have hundreds of machines timing out 
after two hours waiting for that fault to succeed.  The mempool interface 
cannot require that users return elements to the pool synchronous with all 
allocators so that we can happily loop forever, the only requirement on 
the interface is that mempool_alloc() must succeed.  If the context of the 
thread doing mempool_alloc() allows access to memory reserves, this will 
always be allowed by the page allocator.  This is not a mempool problem.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-13 13:19           ` Tetsuo Handa
  2016-07-13 13:39             ` Michal Hocko
@ 2016-07-14  0:01             ` David Rientjes
  1 sibling, 0 replies; 60+ messages in thread
From: David Rientjes @ 2016-07-14  0:01 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Mikulas Patocka, Michal Hocko, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel

On Wed, 13 Jul 2016, Tetsuo Handa wrote:

> I wonder whether commit f9054c70d28bc214 ("mm, mempool: only set
> __GFP_NOMEMALLOC if there are free elements") is doing correct thing.
> It says
> 
>     If an oom killed thread calls mempool_alloc(), it is possible that it'll
>     loop forever if there are no elements on the freelist since
>     __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in
>     oom conditions.
> 
> but we can allow mempool_alloc(__GFP_NOMEMALLOC) requests to access
> memory reserves via below change, can't we? The purpose of allowing
> ALLOC_NO_WATERMARKS via TIF_MEMDIE is to make sure current allocation
> request does not to loop forever inside the page allocator, isn't it?

This would defeat the purpose of __GFP_NOMEMALLOC for oom killed threads, 
so you'd need to demonstrate that isn't a problem for the current users 
and then change the semantics of the gfp flag.

> Why we need to allow mempool_alloc(__GFP_NOMEMALLOC) requests to use
> ALLOC_NO_WATERMARKS when TIF_MEMDIE is not set?
> 

mempool_alloc(__GFP_NOMEMALLOC) is forbidden.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-13 15:21                 ` Mikulas Patocka
@ 2016-07-14  9:09                   ` Michal Hocko
  2016-07-14  9:46                     ` Milan Broz
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-07-14  9:09 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Milan Broz, Ondrej Kozina, Jerome Marchand, Stanislav Kozina,
	linux-mm, linux-kernel, device-mapper development

On Wed 13-07-16 11:21:41, Mikulas Patocka wrote:
> 
> 
> On Wed, 13 Jul 2016, Milan Broz wrote:
> 
> > On 07/13/2016 02:50 PM, Michal Hocko wrote:
> > > On Wed 13-07-16 13:10:06, Michal Hocko wrote:
> > >> On Tue 12-07-16 19:44:11, Mikulas Patocka wrote:
> > > [...]
> > >>> As long as swapping is in progress, the free memory is below the limit 
> > >>> (because the swapping activity itself consumes any memory over the limit). 
> > >>> And that triggered the OOM killer prematurely.
> > >>
> > >> I am not sure I understand the last part. Are you saing that we trigger
> > >> OOM because the initiated swapout will not be able to finish the IO thus
> > >> release the page in time?
> > >>
> > >> The oom detection checks waits for an ongoing writeout if there is no
> > >> reclaim progress and at least half of the reclaimable memory is either
> > >> dirty or under writeback. Pages under swaout are marked as under
> > >> writeback AFAIR. The writeout path (dm-crypt worker in this case) should
> > >> be able to allocate a memory from the mempool, hand over to the crypt
> > >> layer and finish the IO. Is it possible this might take a lot of time?
> > > 
> > > I am not familiar with the crypto API but from what I understood from
> > > crypt_convert the encryption is done asynchronously. Then I got lost in
> > > the indirection. Who is completing the request and from what kind of
> > > context? Is it possible it wouldn't be runable for a long time?
> > 
> > If you mean crypt_convert in dm-crypt, then it can do asynchronous completion
> > but usually (with AES-NI ans sw implementations) it run the operation completely
> > synchronously.
> > Asynchronous processing is quite rare, usually only on some specific hardware
> > crypto accelerators.
> > 
> > Once the encryption is finished, the cloned bio is sent to the block
> > layer for processing.
> > (There is also some magic with sorting writes but Mikulas knows this better.)
> 
> dm-crypt receives requests in crypt_map, then it distributes write 
> requests to multiple encryption threads. Encryption is done usually 
> synchronously; asynchronous completion is used only when using some PCI 
> cards that accelerate encryption. When encryption finishes, the encrypted 
> pages are submitted to a thread dmcrypt_write that sorts the requests 
> using rbtree and submits them.

OK. I was worried that the async context would depend on WQ and a lack
of workers could lead to long stalls. Dedicated kernel threads seem
sufficient.

Thanks for the clarification.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14  9:09                   ` Michal Hocko
@ 2016-07-14  9:46                     ` Milan Broz
  0 siblings, 0 replies; 60+ messages in thread
From: Milan Broz @ 2016-07-14  9:46 UTC (permalink / raw)
  To: Michal Hocko, Mikulas Patocka
  Cc: Milan Broz, Ondrej Kozina, Jerome Marchand, Stanislav Kozina,
	linux-mm, linux-kernel, device-mapper development

On 07/14/2016 11:09 AM, Michal Hocko wrote:
> On Wed 13-07-16 11:21:41, Mikulas Patocka wrote:
>>
>>
>> On Wed, 13 Jul 2016, Milan Broz wrote:
>>
>>> On 07/13/2016 02:50 PM, Michal Hocko wrote:
>>>> On Wed 13-07-16 13:10:06, Michal Hocko wrote:
>>>>> On Tue 12-07-16 19:44:11, Mikulas Patocka wrote:
>>>> [...]
>>>>>> As long as swapping is in progress, the free memory is below the limit 
>>>>>> (because the swapping activity itself consumes any memory over the limit). 
>>>>>> And that triggered the OOM killer prematurely.
>>>>>
>>>>> I am not sure I understand the last part. Are you saing that we trigger
>>>>> OOM because the initiated swapout will not be able to finish the IO thus
>>>>> release the page in time?
>>>>>
>>>>> The oom detection checks waits for an ongoing writeout if there is no
>>>>> reclaim progress and at least half of the reclaimable memory is either
>>>>> dirty or under writeback. Pages under swaout are marked as under
>>>>> writeback AFAIR. The writeout path (dm-crypt worker in this case) should
>>>>> be able to allocate a memory from the mempool, hand over to the crypt
>>>>> layer and finish the IO. Is it possible this might take a lot of time?
>>>>
>>>> I am not familiar with the crypto API but from what I understood from
>>>> crypt_convert the encryption is done asynchronously. Then I got lost in
>>>> the indirection. Who is completing the request and from what kind of
>>>> context? Is it possible it wouldn't be runable for a long time?
>>>
>>> If you mean crypt_convert in dm-crypt, then it can do asynchronous completion
>>> but usually (with AES-NI ans sw implementations) it run the operation completely
>>> synchronously.
>>> Asynchronous processing is quite rare, usually only on some specific hardware
>>> crypto accelerators.
>>>
>>> Once the encryption is finished, the cloned bio is sent to the block
>>> layer for processing.
>>> (There is also some magic with sorting writes but Mikulas knows this better.)
>>
>> dm-crypt receives requests in crypt_map, then it distributes write 
>> requests to multiple encryption threads. Encryption is done usually 
>> synchronously; asynchronous completion is used only when using some PCI 
>> cards that accelerate encryption. When encryption finishes, the encrypted 
>> pages are submitted to a thread dmcrypt_write that sorts the requests 
>> using rbtree and submits them.
> 
> OK. I was worried that the async context would depend on WQ and a lack
> of workers could lead to long stalls. Dedicated kernel threads seem
> sufficient.

Just for the record - if there is a suspicion that some crypto operation
causes problem, dmcrypt can use null cipher. This degrades encryption/decryption
to just plain memcpy inside crypto API but leaves all workqueues and
tooling around the same.
(I added it to cryptsetup to easily configure it and it was intended to test dmcrypt
non-crypto overherad in fact.)

Anyway, thanks for looking into this!
Milan

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [dm-devel] System freezes after OOM
  2016-07-13 15:02             ` Mikulas Patocka
@ 2016-07-14 10:51               ` Ondrej Kozina
  2016-07-14 12:51               ` Michal Hocko
  1 sibling, 0 replies; 60+ messages in thread
From: Ondrej Kozina @ 2016-07-14 10:51 UTC (permalink / raw)
  To: Mikulas Patocka, Michal Hocko
  Cc: Stanislav Kozina, Jerome Marchand, linux-kernel, linux-mm, dm-devel

On 07/13/2016 05:02 PM, Mikulas Patocka wrote:
>
>
> On Wed, 13 Jul 2016, Michal Hocko wrote:
>
>> On Tue 12-07-16 19:44:11, Mikulas Patocka wrote:
>>> The problem of swapping to dm-crypt is this.
>>>
>>> The free memory goes low, kswapd decides that some page should be swapped
>>> out. However, when you swap to an ecrypted device, writeback of each page
>>> requires another page to hold the encrypted data. dm-crypt uses mempools
>>> for all its structures and pages, so that it can make forward progress
>>> even if there is no memory free. However, the mempool code first allocates
>>> from general memory allocator and resorts to the mempool only if the
>>> memory is below limit.
>>
>> OK, thanks for the clarification. I guess the core part happens in
>> crypt_alloc_buffer, right?
>>
>>> So every attempt to swap out some page allocates another page.
>>>
>>> As long as swapping is in progress, the free memory is below the limit
>>> (because the swapping activity itself consumes any memory over the limit).
>>> And that triggered the OOM killer prematurely.
>>
>> I am not sure I understand the last part. Are you saing that we trigger
>> OOM because the initiated swapout will not be able to finish the IO thus
>> release the page in time?
>
> On kernel 4.6 - premature OOM is triggered just because the free memory
> stays below the limit for some time.
>
> On kernel 4.7-rc6 (that contains your OOM patch
> 0a0337e0d1d134465778a16f5cbea95086e8e9e0), OOM is not triggered, but the
> machine slows down to a crawl, because the allocator applies throttling to
> the process that is doing the encryption.
>
>> The oom detection checks waits for an ongoing writeout if there is no
>> reclaim progress and at least half of the reclaimable memory is either
>> dirty or under writeback. Pages under swaout are marked as under
>> writeback AFAIR. The writeout path (dm-crypt worker in this case) should
>> be able to allocate a memory from the mempool, hand over to the crypt
>> layer and finish the IO. Is it possible this might take a lot of time?
>
> See the backtrace below - the dm-crypt worker is making progress, but the
> memory allocator deliberatelly stalls it in throttle_vm_writeout.
>
>>> On Tue, 12 Jul 2016, Michal Hocko wrote:
>>>>
>>>> Look at the amount of free memory. It is completely depleted. So it
>>>> smells like a process which has access to memory reserves has consumed
>>>> all of it. I suspect a __GFP_MEMALLOC resp. PF_MEMALLOC from softirq
>>>> context user which went off the leash.
>>>
>>> It is caused by the commit f9054c70d28bc214b2857cf8db8269f4f45a5e23. Prior
>>> to this commit, mempool allocations set __GFP_NOMEMALLOC, so they never
>>> exhausted reserved memory. With this commit, mempool allocations drop
>>> __GFP_NOMEMALLOC, so they can dig deeper (if the process has PF_MEMALLOC,
>>> they can bypass all limits).
>>
>> Hmm, but the patch allows access to the memory reserves only when the
>> pool is empty. And even then the caller would have to request access to
>> reserves explicitly either by __GFP_NOMEMALLOC or PF_MEMALLOC. That
>
> PF_MEMALLOC is set when you enter the block driver when swapping. So, some
> of the mempool allocations are done with PF_MEMALLOC.
>
>> doesn't seem to be the case for the dm-crypt, though. Or do you suspect
>> that some other mempool user might be doing so?
>
> Bisection showed that that patch triggered the dm-crypt swapping problems.
>
> Without the patch f9054c70d28bc214b2857cf8db8269f4f45a5e23, mempool_alloc
> 1. allocates memory up to __GFP_NOMEMALLOC limit
> 2. allocates memory from the mempool reserve
> 3. waits, until some objects are returned to the mempool
>
> With the patch f9054c70d28bc214b2857cf8db8269f4f45a5e23, mempool_alloc
> 1. allocates memory up to __GFP_NOMEMALLOC limit
> 2. allocates memory from the mempool reserve
> 3. allocates all remaining memory until total exhaustion
> 4. waits, until some objects are returned to the mempool
>
>>>> No this doesn't sound like a proper solution. The current decision
>>>> logic, as explained above relies on the feedback from the reclaim. A
>>>> free swap space doesn't really mean we can make a forward progress.
>>>
>>> I'm interested - why would you need to trigger the OOM killer if there is
>>> free swap space?
>>
>> Let me clarify. If there is a swapable memory then we shouldn't trigger
>> the OOM killer normally of course. And that should be the case with the
>> current implementation. We just rely on the swapout making some progress
>
> And what does exactly "making some progress" mean? How do you measure it?
>
>> and back off only if that is not the case after several attempts with a
>> throttling based on the writeback counters. Checking the available swap
>> space doesn't guarantee a forward progress, though. If the swap out is
>> stuck for some reason then it should be safer to trigger to OOM rather
>> than wait or trash for ever (or an excessive amount of time).
>>
>> Now, I can see that the retry logic might need some tuning for complex
>> setups like dm-crypt swap partitions because the progress might be much
>> slower there. But I would like the understand what is the worst estimate
>> for the swapout path with all the roadblocks on the way for this setup
>> before we can think of a proper retry logic tuning.
>
> For example, you could increment a percpu counter each time writeback of a
> page is finished. If the counters stays the same for some pre-determined
> period of time, writeback is stuck and you could trigger OOM prematurely.
>
> But the memory management code doesn't do that. So what it really does and
> what is the intention behind it?
>
> Another question is - do we really want to try to recover in case of stuck
> writeback? If the swap device dies so that it stops processing I/Os, the
> system is dead anyway - there is no point in trying to recover it by
> killing processes.
>
> The question is if these safeguards against stuck writeback are really
> doing more harm than good. Do you have some real use case where you get
> stuck writeback and where you need to recover by OOM killing?
>
> This is not the first time I've seen premature OOM. Long time ago, I saw a
> case when the admin set /proc/sys/vm/swappiness to a low value (because he
> was running some scientific calculations on the machine and he preferred
> memory being allocated to those calculations rather than to the cache) -
> and the result was premature OOM killing while the machine had plenty of
> free swap swace.
>
>>> The kernel 4.7-rc almost deadlocks in another way. The machine got stuck
>>> and the following stacktrace was obtained when swapping to dm-crypt.
>>>
>>> We can see that dm-crypt does a mempool allocation. But the mempool
>>> allocation somehow falls into throttle_vm_writeout. There, it waits for
>>> 0.1 seconds. So, as a result, the dm-crypt worker thread ends up
>>> processing requests at an unusually slow rate of 10 requests per second
>>> and it results in the machine being stuck (it would proabably recover if
>>> we waited for extreme amount of time).
>>
>> Hmm, that throttling is there since ever basically. I do not see what
>> would have changed that recently, but I haven't looked too close to be
>> honest.
>>
>> I agree that throttling a flusher (which this worker definitely is)
>> doesn't look like a correct thing to do. We have PF_LESS_THROTTLE for
>> this kind of things. So maybe the right thing to do is to use this flag
>> for the dm_crypt worker:
>>
>> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
>> index 4f3cb3554944..0b806810efab 100644
>> --- a/drivers/md/dm-crypt.c
>> +++ b/drivers/md/dm-crypt.c
>> @@ -1392,11 +1392,14 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
>>  static void kcryptd_crypt(struct work_struct *work)
>>  {
>>  	struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
>> +	unsigned int pflags = current->flags;
>>
>> +	current->flags |= PF_LESS_THROTTLE;
>>  	if (bio_data_dir(io->base_bio) == READ)
>>  		kcryptd_crypt_read_convert(io);
>>  	else
>>  		kcryptd_crypt_write_convert(io);
>> +	tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
>>  }
>>
>>  static void kcryptd_queue_crypt(struct dm_crypt_io *io)
>
> ^^^ That fixes just one specific case - but there may be other threads
> doing mempool allocations in the device mapper subsystem - and you would
> need to mark all of them.
>
> I would try the patch below - generally, allocations from the mempool
> subsystem should not wait in the memory allocator at all. I don't know if
> there are other cases when these allocations can sleep. I'm interested if
> it fixes Ondrej's case - or if it uncovers some other sleeping.
>
> An alternate possibility would be to drop the flag __GFP_DIRECT_RECLAIM in
> mempool_alloc - so that mempool allocations never sleep in the allocator.

Good news (I hope). With Mikulas's patch below I'm able to run the test 
and not get the utility oom_killed. Neither the system livelocks for 
dozens minutes as before with pure 4.7.0-rc6. Here's the syslog:

https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/4.7.0-rc7+/0/4.7.0-rc7+.log

Just for the record the test utility allocates memory. More than 
physical ram installed, but much less than total ram including swap.

As you can see the swap fills slowly as expected. I'd not say it's ideal 
fix since during the test system not much responsive, but still I'd call 
it a progress.

Regards O.

>
> ---
>  mm/page-writeback.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
>
> Index: linux-4.7-rc7/mm/page-writeback.c
> ===================================================================
> --- linux-4.7-rc7.orig/mm/page-writeback.c	2016-07-12 20:57:53.000000000 +0200
> +++ linux-4.7-rc7/mm/page-writeback.c	2016-07-12 20:59:41.000000000 +0200
> @@ -1945,6 +1945,14 @@ void throttle_vm_writeout(gfp_t gfp_mask
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>
> +	/*
> +	 * If we came here from mempool_alloc, we don't want to wait 0.1s.
> +	 * We want to fail as soon as possible, so that the allocation is tried
> +	 * from mempool reserve.
> +	 */
> +	if (unlikely(gfp_mask & __GFP_NORETRY))
> +		return;
> +
>          for ( ; ; ) {
>  		global_dirty_limits(&background_thresh, &dirty_thresh);
>  		dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh);
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-13 23:53                     ` David Rientjes
@ 2016-07-14 11:01                       ` Tetsuo Handa
  2016-07-14 12:29                         ` Mikulas Patocka
  2016-07-14 20:26                         ` David Rientjes
  2016-07-14 12:27                       ` Mikulas Patocka
  2016-07-14 15:29                       ` Michal Hocko
  2 siblings, 2 replies; 60+ messages in thread
From: Tetsuo Handa @ 2016-07-14 11:01 UTC (permalink / raw)
  To: rientjes, mpatocka
  Cc: mhocko, okozina, jmarchan, skozina, linux-mm, linux-kernel

Michal Hocko wrote:
> OK, this is the part I have missed. I didn't realize that the swapout
> path, which is indeed PF_MEMALLOC, can get down to blk code which uses
> mempools. A quick code travers shows that at least
> 	make_request_fn = blk_queue_bio
> 	blk_queue_bio
> 	  get_request
> 	    __get_request
> 
> might do that. And in that case I agree that the above mentioned patch
> has unintentional side effects and should be re-evaluated. David, what
> do you think? An obvious fixup would be considering TIF_MEMDIE in
> mempool_alloc explicitly.

TIF_MEMDIE is racy. Since the OOM killer sets TIF_MEMDIE on only one thread,
there is no guarantee that TIF_MEMDIE is set to the thread which is looping
inside mempool_alloc(). And since __GFP_NORETRY is used (regardless of
f9054c70d28bc214), out_of_memory() is not called via __alloc_pages_may_oom().
This means that the thread which is looping inside mempool_alloc() can't
get TIF_MEMDIE unless TIF_MEMDIE is set by the OOM killer.

Maybe set __GFP_NOMEMALLOC by default at mempool_alloc() and remove it
at mempool_alloc() when fatal_signal_pending() is true? But that behavior
can OOM-kill somebody else when current was not OOM-killed. Sigh...

David Rientjes wrote:
> On Wed, 13 Jul 2016, Mikulas Patocka wrote:
> 
> > What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 
> > tries to fix?
> > 
> 
> It prevents the whole system from livelocking due to an oom killed process 
> stalling forever waiting for mempool_alloc() to return.  No other threads 
> may be oom killed while waiting for it to exit.

Is that concern still valid? We have the OOM reaper for CONFIG_MMU=y case.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-13 23:53                     ` David Rientjes
  2016-07-14 11:01                       ` Tetsuo Handa
@ 2016-07-14 12:27                       ` Mikulas Patocka
  2016-07-14 20:22                         ` David Rientjes
  2016-07-14 15:29                       ` Michal Hocko
  2 siblings, 1 reply; 60+ messages in thread
From: Mikulas Patocka @ 2016-07-14 12:27 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Tetsuo Handa, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel, dm-devel



On Wed, 13 Jul 2016, David Rientjes wrote:

> On Wed, 13 Jul 2016, Mikulas Patocka wrote:
> 
> > What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 
> > tries to fix?
> > 
> 
> It prevents the whole system from livelocking due to an oom killed process 
> stalling forever waiting for mempool_alloc() to return.  No other threads 
> may be oom killed while waiting for it to exit.
> 
> > Do you have a stacktrace where it deadlocked, or was just a theoretical 
> > consideration?
> > 
> 
> schedule
> schedule_timeout
> io_schedule_timeout
> mempool_alloc
> __split_and_process_bio
> dm_request
> generic_make_request
> submit_bio
> mpage_readpages
> ext4_readpages
> __do_page_cache_readahead
> ra_submit
> filemap_fault
> handle_mm_fault
> __do_page_fault
> do_page_fault
> page_fault

Device mapper should be able to proceed if there is no available memory. 
If it doesn't proceed, there is a bug in it.

I'd like to ask - what device mapper targets did you use in this case? Are 
there some other deadlocked processes? (show sysrq-t, sysrq-w when this 
happened)

Did the machine lock up completely with that stacktrace, or was it just 
slowed down?

> > Mempool users generally (except for some flawed cases like fs_bio_set) do 
> > not require memory to proceed. So if you just loop in mempool_alloc, the 
> > processes that exhasted the mempool reserve will eventually return objects 
> > to the mempool and you should proceed.
> > 
> 
> That's obviously not the case if we have hundreds of machines timing out 
> after two hours waiting for that fault to succeed.  The mempool interface 
> cannot require that users return elements to the pool synchronous with all 
> allocators so that we can happily loop forever, the only requirement on 

Mempool users must return objects to the mempool.

> the interface is that mempool_alloc() must succeed.  If the context of the 
> thread doing mempool_alloc() allows access to memory reserves, this will 
> always be allowed by the page allocator.  This is not a mempool problem.

Mikulas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 11:01                       ` Tetsuo Handa
@ 2016-07-14 12:29                         ` Mikulas Patocka
  2016-07-14 20:26                         ` David Rientjes
  1 sibling, 0 replies; 60+ messages in thread
From: Mikulas Patocka @ 2016-07-14 12:29 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: rientjes, mhocko, okozina, jmarchan, skozina, linux-mm,
	linux-kernel, dm-devel



On Thu, 14 Jul 2016, Tetsuo Handa wrote:

> Michal Hocko wrote:
> > OK, this is the part I have missed. I didn't realize that the swapout
> > path, which is indeed PF_MEMALLOC, can get down to blk code which uses
> > mempools. A quick code travers shows that at least
> > 	make_request_fn = blk_queue_bio
> > 	blk_queue_bio
> > 	  get_request
> > 	    __get_request
> > 
> > might do that. And in that case I agree that the above mentioned patch
> > has unintentional side effects and should be re-evaluated. David, what
> > do you think? An obvious fixup would be considering TIF_MEMDIE in
> > mempool_alloc explicitly.
> 
> TIF_MEMDIE is racy. Since the OOM killer sets TIF_MEMDIE on only one thread,
> there is no guarantee that TIF_MEMDIE is set to the thread which is looping
> inside mempool_alloc().

If the device mapper subsystem is not returning objects to the mempool, it 
should be investigated as a bug in the device mapper.

There is no need to add workarounds to mempool_alloc to work around that 
bug.

Mikulas

> And since __GFP_NORETRY is used (regardless of
> f9054c70d28bc214), out_of_memory() is not called via __alloc_pages_may_oom().
> This means that the thread which is looping inside mempool_alloc() can't
> get TIF_MEMDIE unless TIF_MEMDIE is set by the OOM killer.
> 
> Maybe set __GFP_NOMEMALLOC by default at mempool_alloc() and remove it
> at mempool_alloc() when fatal_signal_pending() is true? But that behavior
> can OOM-kill somebody else when current was not OOM-killed. Sigh...
> 
> David Rientjes wrote:
> > On Wed, 13 Jul 2016, Mikulas Patocka wrote:
> > 
> > > What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 
> > > tries to fix?
> > > 
> > 
> > It prevents the whole system from livelocking due to an oom killed process 
> > stalling forever waiting for mempool_alloc() to return.  No other threads 
> > may be oom killed while waiting for it to exit.
> 
> Is that concern still valid? We have the OOM reaper for CONFIG_MMU=y case.
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-13 15:02             ` Mikulas Patocka
  2016-07-14 10:51               ` [dm-devel] " Ondrej Kozina
@ 2016-07-14 12:51               ` Michal Hocko
  2016-07-14 14:00                 ` Mikulas Patocka
  2016-07-14 14:08                 ` Ondrej Kozina
  1 sibling, 2 replies; 60+ messages in thread
From: Michal Hocko @ 2016-07-14 12:51 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Ondrej Kozina, Jerome Marchand, Stanislav Kozina, linux-mm, linux-kernel

On Wed 13-07-16 11:02:15, Mikulas Patocka wrote:
> On Wed, 13 Jul 2016, Michal Hocko wrote:
[...]

We are discussing several topics together so let's focus on this
particlar thing for now

> > > The kernel 4.7-rc almost deadlocks in another way. The machine got stuck 
> > > and the following stacktrace was obtained when swapping to dm-crypt.
> > > 
> > > We can see that dm-crypt does a mempool allocation. But the mempool 
> > > allocation somehow falls into throttle_vm_writeout. There, it waits for 
> > > 0.1 seconds. So, as a result, the dm-crypt worker thread ends up 
> > > processing requests at an unusually slow rate of 10 requests per second 
> > > and it results in the machine being stuck (it would proabably recover if 
> > > we waited for extreme amount of time).
> > 
> > Hmm, that throttling is there since ever basically. I do not see what
> > would have changed that recently, but I haven't looked too close to be
> > honest.
> > 
> > I agree that throttling a flusher (which this worker definitely is)
> > doesn't look like a correct thing to do. We have PF_LESS_THROTTLE for
> > this kind of things. So maybe the right thing to do is to use this flag
> > for the dm_crypt worker:
> > 
> > diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> > index 4f3cb3554944..0b806810efab 100644
> > --- a/drivers/md/dm-crypt.c
> > +++ b/drivers/md/dm-crypt.c
> > @@ -1392,11 +1392,14 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
> >  static void kcryptd_crypt(struct work_struct *work)
> >  {
> >  	struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
> > +	unsigned int pflags = current->flags;
> >  
> > +	current->flags |= PF_LESS_THROTTLE;
> >  	if (bio_data_dir(io->base_bio) == READ)
> >  		kcryptd_crypt_read_convert(io);
> >  	else
> >  		kcryptd_crypt_write_convert(io);
> > +	tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
> >  }
> >  
> >  static void kcryptd_queue_crypt(struct dm_crypt_io *io)
> 
> ^^^ That fixes just one specific case - but there may be other threads 
> doing mempool allocations in the device mapper subsystem - and you would 
> need to mark all of them.

Now that I am thinking about it some more. Are there any mempool users
which would actually want to be throttled? I would expect mempool users
are necessary to push IO through and throttle them sounds like a bad
decision in the first place but there might be other mempool users which
could cause issues. Anyway how about setting PF_LESS_THROTTLE
unconditionally inside mempool_alloc? Something like the following:

diff --git a/mm/mempool.c b/mm/mempool.c
index 8f65464da5de..e21fb632983f 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -310,7 +310,8 @@ EXPORT_SYMBOL(mempool_resize);
  */
 void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 {
-	void *element;
+	unsigned int pflags = current->flags;
+	void *element = NULL;
 	unsigned long flags;
 	wait_queue_t wait;
 	gfp_t gfp_temp;
@@ -327,6 +328,12 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
 	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
 
+	/*
+	 * Make sure that the allocation doesn't get throttled during the
+	 * reclaim
+	 */
+	if (gfpflags_allow_blocking(gfp_mask))
+		current->flags |= PF_LESS_THROTTLE;
 repeat_alloc:
 	if (likely(pool->curr_nr)) {
 		/*
@@ -339,7 +346,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
 	element = pool->alloc(gfp_temp, pool->pool_data);
 	if (likely(element != NULL))
-		return element;
+		goto out;
 
 	spin_lock_irqsave(&pool->lock, flags);
 	if (likely(pool->curr_nr)) {
@@ -352,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 		 * for debugging.
 		 */
 		kmemleak_update_trace(element);
-		return element;
+		goto out;
 	}
 
 	/*
@@ -369,7 +376,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
 	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
 		spin_unlock_irqrestore(&pool->lock, flags);
-		return NULL;
+		goto out;
 	}
 
 	/* Let's wait for someone else to return an element to @pool */
@@ -386,6 +393,10 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
 	finish_wait(&pool->wait, &wait);
 	goto repeat_alloc;
+out:
+	if (gfpflags_allow_blocking(gfp_mask))
+		tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
+	return element;
 }
 EXPORT_SYMBOL(mempool_alloc);
 

> I would try the patch below - generally, allocations from the mempool 
> subsystem should not wait in the memory allocator at all. I don't know if 
> there are other cases when these allocations can sleep. I'm interested if 
> it fixes Ondrej's case - or if it uncovers some other sleeping.

__GFP_NORETRY is used outside of mempool allocator as well and
throttling them sounds like a proper think to do. The primary point of
throttle_vm_writeout is to slow down reclaim so that it doesn't generate
excessive amount of dirty pages. It used to be a bigger deal in the past
when we initiated regular IO from the direct reclaim but we can still
generate swap IO these days. So I would rather come up with a more
robust solution.

> An alternate possibility would be to drop the flag __GFP_DIRECT_RECLAIM in 
> mempool_alloc - so that mempool allocations never sleep in the allocator.

Hmm, that would mean that the retry loop would completely rely on kswapd
doing forward progress. But note that kswapd might trigger IO and get
stuck waiting for the FS. GFP_NOIO request might be hopelessly
inefficient on its own but at least we try to reclaim something which
sounds better to me than looping and relying only on kswapd. I do not
see other potential side effects of such a change but my gut feeling
tells me this is not quite right. It works around a problem that is at a
different layer.
 
> ---
>  mm/page-writeback.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> Index: linux-4.7-rc7/mm/page-writeback.c
> ===================================================================
> --- linux-4.7-rc7.orig/mm/page-writeback.c	2016-07-12 20:57:53.000000000 +0200
> +++ linux-4.7-rc7/mm/page-writeback.c	2016-07-12 20:59:41.000000000 +0200
> @@ -1945,6 +1945,14 @@ void throttle_vm_writeout(gfp_t gfp_mask
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  
> +	/*
> +	 * If we came here from mempool_alloc, we don't want to wait 0.1s.
> +	 * We want to fail as soon as possible, so that the allocation is tried
> +	 * from mempool reserve.
> +	 */
> +	if (unlikely(gfp_mask & __GFP_NORETRY))
> +		return;
> +
>          for ( ; ; ) {
>  		global_dirty_limits(&background_thresh, &dirty_thresh);
>  		dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh);
> 
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 12:51               ` Michal Hocko
@ 2016-07-14 14:00                 ` Mikulas Patocka
  2016-07-14 14:59                   ` Michal Hocko
  2016-07-14 14:08                 ` Ondrej Kozina
  1 sibling, 1 reply; 60+ messages in thread
From: Mikulas Patocka @ 2016-07-14 14:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ondrej Kozina, Jerome Marchand, Stanislav Kozina, linux-mm,
	linux-kernel, dm-devel



On Thu, 14 Jul 2016, Michal Hocko wrote:

> On Wed 13-07-16 11:02:15, Mikulas Patocka wrote:

> > > diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> > > index 4f3cb3554944..0b806810efab 100644
> > > --- a/drivers/md/dm-crypt.c
> > > +++ b/drivers/md/dm-crypt.c
> > > @@ -1392,11 +1392,14 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
> > >  static void kcryptd_crypt(struct work_struct *work)
> > >  {
> > >  	struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
> > > +	unsigned int pflags = current->flags;
> > >  
> > > +	current->flags |= PF_LESS_THROTTLE;
> > >  	if (bio_data_dir(io->base_bio) == READ)
> > >  		kcryptd_crypt_read_convert(io);
> > >  	else
> > >  		kcryptd_crypt_write_convert(io);
> > > +	tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
> > >  }
> > >  
> > >  static void kcryptd_queue_crypt(struct dm_crypt_io *io)
> > 
> > ^^^ That fixes just one specific case - but there may be other threads 
> > doing mempool allocations in the device mapper subsystem - and you would 
> > need to mark all of them.
> 
> Now that I am thinking about it some more. Are there any mempool users
> which would actually want to be throttled? I would expect mempool users
> are necessary to push IO through and throttle them sounds like a bad
> decision in the first place but there might be other mempool users which
> could cause issues. Anyway how about setting PF_LESS_THROTTLE
> unconditionally inside mempool_alloc? Something like the following:
>
> diff --git a/mm/mempool.c b/mm/mempool.c
> index 8f65464da5de..e21fb632983f 100644
> --- a/mm/mempool.c
> +++ b/mm/mempool.c
> @@ -310,7 +310,8 @@ EXPORT_SYMBOL(mempool_resize);
>   */
>  void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  {
> -	void *element;
> +	unsigned int pflags = current->flags;
> +	void *element = NULL;
>  	unsigned long flags;
>  	wait_queue_t wait;
>  	gfp_t gfp_temp;
> @@ -327,6 +328,12 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  
>  	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
>  
> +	/*
> +	 * Make sure that the allocation doesn't get throttled during the
> +	 * reclaim
> +	 */
> +	if (gfpflags_allow_blocking(gfp_mask))
> +		current->flags |= PF_LESS_THROTTLE;
>  repeat_alloc:
>  	if (likely(pool->curr_nr)) {
>  		/*
> @@ -339,7 +346,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  
>  	element = pool->alloc(gfp_temp, pool->pool_data);
>  	if (likely(element != NULL))
> -		return element;
> +		goto out;
>  
>  	spin_lock_irqsave(&pool->lock, flags);
>  	if (likely(pool->curr_nr)) {
> @@ -352,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  		 * for debugging.
>  		 */
>  		kmemleak_update_trace(element);
> -		return element;
> +		goto out;
>  	}
>  
>  	/*
> @@ -369,7 +376,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
>  	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
>  		spin_unlock_irqrestore(&pool->lock, flags);
> -		return NULL;
> +		goto out;
>  	}
>  
>  	/* Let's wait for someone else to return an element to @pool */
> @@ -386,6 +393,10 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  
>  	finish_wait(&pool->wait, &wait);
>  	goto repeat_alloc;
> +out:
> +	if (gfpflags_allow_blocking(gfp_mask))
> +		tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
> +	return element;
>  }
>  EXPORT_SYMBOL(mempool_alloc);
>  

But it needs other changes to honor the PF_LESS_THROTTLE flag:

static int current_may_throttle(void)
{
        return !(current->flags & PF_LESS_THROTTLE) ||
                current->backing_dev_info == NULL ||
                bdi_write_congested(current->backing_dev_info);
}
--- if you set PF_LESS_THROTTLE, current_may_throttle may still return 
true if one of the other conditions is met.

shrink_zone_memcg calls throttle_vm_writeout without checking 
PF_LESS_THROTTLE at all.

Mikulas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 12:51               ` Michal Hocko
  2016-07-14 14:00                 ` Mikulas Patocka
@ 2016-07-14 14:08                 ` Ondrej Kozina
  2016-07-14 15:31                   ` Michal Hocko
  1 sibling, 1 reply; 60+ messages in thread
From: Ondrej Kozina @ 2016-07-14 14:08 UTC (permalink / raw)
  To: Michal Hocko, Mikulas Patocka
  Cc: Jerome Marchand, Stanislav Kozina, linux-mm, linux-kernel, dm-devel

On 07/14/2016 02:51 PM, Michal Hocko wrote:
> On Wed 13-07-16 11:02:15, Mikulas Patocka wrote:
>> On Wed, 13 Jul 2016, Michal Hocko wrote:
> [...]
>
> We are discussing several topics together so let's focus on this
> particlar thing for now
>
>>>> The kernel 4.7-rc almost deadlocks in another way. The machine got stuck
>>>> and the following stacktrace was obtained when swapping to dm-crypt.
>>>>
>>>> We can see that dm-crypt does a mempool allocation. But the mempool
>>>> allocation somehow falls into throttle_vm_writeout. There, it waits for
>>>> 0.1 seconds. So, as a result, the dm-crypt worker thread ends up
>>>> processing requests at an unusually slow rate of 10 requests per second
>>>> and it results in the machine being stuck (it would proabably recover if
>>>> we waited for extreme amount of time).
>>>
>>> Hmm, that throttling is there since ever basically. I do not see what
>>> would have changed that recently, but I haven't looked too close to be
>>> honest.
>>>
>>> I agree that throttling a flusher (which this worker definitely is)
>>> doesn't look like a correct thing to do. We have PF_LESS_THROTTLE for
>>> this kind of things. So maybe the right thing to do is to use this flag
>>> for the dm_crypt worker:
>>>
>>> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
>>> index 4f3cb3554944..0b806810efab 100644
>>> --- a/drivers/md/dm-crypt.c
>>> +++ b/drivers/md/dm-crypt.c
>>> @@ -1392,11 +1392,14 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
>>>  static void kcryptd_crypt(struct work_struct *work)
>>>  {
>>>  	struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
>>> +	unsigned int pflags = current->flags;
>>>
>>> +	current->flags |= PF_LESS_THROTTLE;
>>>  	if (bio_data_dir(io->base_bio) == READ)
>>>  		kcryptd_crypt_read_convert(io);
>>>  	else
>>>  		kcryptd_crypt_write_convert(io);
>>> +	tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
>>>  }
>>>
>>>  static void kcryptd_queue_crypt(struct dm_crypt_io *io)
>>
>> ^^^ That fixes just one specific case - but there may be other threads
>> doing mempool allocations in the device mapper subsystem - and you would
>> need to mark all of them.
>
> Now that I am thinking about it some more. Are there any mempool users
> which would actually want to be throttled? I would expect mempool users
> are necessary to push IO through and throttle them sounds like a bad
> decision in the first place but there might be other mempool users which
> could cause issues. Anyway how about setting PF_LESS_THROTTLE
> unconditionally inside mempool_alloc? Something like the following:
>
> diff --git a/mm/mempool.c b/mm/mempool.c
> index 8f65464da5de..e21fb632983f 100644
> --- a/mm/mempool.c
> +++ b/mm/mempool.c
> @@ -310,7 +310,8 @@ EXPORT_SYMBOL(mempool_resize);
>   */
>  void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  {
> -	void *element;
> +	unsigned int pflags = current->flags;
> +	void *element = NULL;
>  	unsigned long flags;
>  	wait_queue_t wait;
>  	gfp_t gfp_temp;
> @@ -327,6 +328,12 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>
>  	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
>
> +	/*
> +	 * Make sure that the allocation doesn't get throttled during the
> +	 * reclaim
> +	 */
> +	if (gfpflags_allow_blocking(gfp_mask))
> +		current->flags |= PF_LESS_THROTTLE;
>  repeat_alloc:
>  	if (likely(pool->curr_nr)) {
>  		/*
> @@ -339,7 +346,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>
>  	element = pool->alloc(gfp_temp, pool->pool_data);
>  	if (likely(element != NULL))
> -		return element;
> +		goto out;
>
>  	spin_lock_irqsave(&pool->lock, flags);
>  	if (likely(pool->curr_nr)) {
> @@ -352,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  		 * for debugging.
>  		 */
>  		kmemleak_update_trace(element);
> -		return element;
> +		goto out;
>  	}
>
>  	/*
> @@ -369,7 +376,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
>  	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
>  		spin_unlock_irqrestore(&pool->lock, flags);
> -		return NULL;
> +		goto out;
>  	}
>
>  	/* Let's wait for someone else to return an element to @pool */
> @@ -386,6 +393,10 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>
>  	finish_wait(&pool->wait, &wait);
>  	goto repeat_alloc;
> +out:
> +	if (gfpflags_allow_blocking(gfp_mask))
> +		tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
> +	return element;
>  }
>  EXPORT_SYMBOL(mempool_alloc);
>
>

As Mikulas pointed out, this doesn't work. The system froze as well with 
the patch above. Will try to tweak the patch with Mikulas's suggestion...

O.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 14:00                 ` Mikulas Patocka
@ 2016-07-14 14:59                   ` Michal Hocko
  2016-07-14 15:25                     ` Ondrej Kozina
  2016-07-14 17:35                     ` Mikulas Patocka
  0 siblings, 2 replies; 60+ messages in thread
From: Michal Hocko @ 2016-07-14 14:59 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Ondrej Kozina, Jerome Marchand, Stanislav Kozina, linux-mm,
	linux-kernel, dm-devel

On Thu 14-07-16 10:00:16, Mikulas Patocka wrote:
> 
> 
> On Thu, 14 Jul 2016, Michal Hocko wrote:
> 
> > On Wed 13-07-16 11:02:15, Mikulas Patocka wrote:
> 
> > > > diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> > > > index 4f3cb3554944..0b806810efab 100644
> > > > --- a/drivers/md/dm-crypt.c
> > > > +++ b/drivers/md/dm-crypt.c
> > > > @@ -1392,11 +1392,14 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
> > > >  static void kcryptd_crypt(struct work_struct *work)
> > > >  {
> > > >  	struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
> > > > +	unsigned int pflags = current->flags;
> > > >  
> > > > +	current->flags |= PF_LESS_THROTTLE;
> > > >  	if (bio_data_dir(io->base_bio) == READ)
> > > >  		kcryptd_crypt_read_convert(io);
> > > >  	else
> > > >  		kcryptd_crypt_write_convert(io);
> > > > +	tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
> > > >  }
> > > >  
> > > >  static void kcryptd_queue_crypt(struct dm_crypt_io *io)
> > > 
> > > ^^^ That fixes just one specific case - but there may be other threads 
> > > doing mempool allocations in the device mapper subsystem - and you would 
> > > need to mark all of them.
> > 
> > Now that I am thinking about it some more. Are there any mempool users
> > which would actually want to be throttled? I would expect mempool users
> > are necessary to push IO through and throttle them sounds like a bad
> > decision in the first place but there might be other mempool users which
> > could cause issues. Anyway how about setting PF_LESS_THROTTLE
> > unconditionally inside mempool_alloc? Something like the following:
> >
> > diff --git a/mm/mempool.c b/mm/mempool.c
> > index 8f65464da5de..e21fb632983f 100644
> > --- a/mm/mempool.c
> > +++ b/mm/mempool.c
> > @@ -310,7 +310,8 @@ EXPORT_SYMBOL(mempool_resize);
> >   */
> >  void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> >  {
> > -	void *element;
> > +	unsigned int pflags = current->flags;
> > +	void *element = NULL;
> >  	unsigned long flags;
> >  	wait_queue_t wait;
> >  	gfp_t gfp_temp;
> > @@ -327,6 +328,12 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> >  
> >  	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
> >  
> > +	/*
> > +	 * Make sure that the allocation doesn't get throttled during the
> > +	 * reclaim
> > +	 */
> > +	if (gfpflags_allow_blocking(gfp_mask))
> > +		current->flags |= PF_LESS_THROTTLE;
> >  repeat_alloc:
> >  	if (likely(pool->curr_nr)) {
> >  		/*
> > @@ -339,7 +346,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> >  
> >  	element = pool->alloc(gfp_temp, pool->pool_data);
> >  	if (likely(element != NULL))
> > -		return element;
> > +		goto out;
> >  
> >  	spin_lock_irqsave(&pool->lock, flags);
> >  	if (likely(pool->curr_nr)) {
> > @@ -352,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> >  		 * for debugging.
> >  		 */
> >  		kmemleak_update_trace(element);
> > -		return element;
> > +		goto out;
> >  	}
> >  
> >  	/*
> > @@ -369,7 +376,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> >  	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
> >  	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
> >  		spin_unlock_irqrestore(&pool->lock, flags);
> > -		return NULL;
> > +		goto out;
> >  	}
> >  
> >  	/* Let's wait for someone else to return an element to @pool */
> > @@ -386,6 +393,10 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> >  
> >  	finish_wait(&pool->wait, &wait);
> >  	goto repeat_alloc;
> > +out:
> > +	if (gfpflags_allow_blocking(gfp_mask))
> > +		tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
> > +	return element;
> >  }
> >  EXPORT_SYMBOL(mempool_alloc);
> >  
> 
> But it needs other changes to honor the PF_LESS_THROTTLE flag:
> 
> static int current_may_throttle(void)
> {
>         return !(current->flags & PF_LESS_THROTTLE) ||
>                 current->backing_dev_info == NULL ||
>                 bdi_write_congested(current->backing_dev_info);
> }
> --- if you set PF_LESS_THROTTLE, current_may_throttle may still return 
> true if one of the other conditions is met.

That is true but doesn't that mean that the device is congested and
waiting a bit is the right thing to do?

> shrink_zone_memcg calls throttle_vm_writeout without checking 
> PF_LESS_THROTTLE at all.

Yes it doesn't call it because it relies on
global_dirty_limits()->domain_dirty_limits() to DTRT. It will give the
caller with PF_LESS_THROTTLE some boost wrt. all other writers.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 14:59                   ` Michal Hocko
@ 2016-07-14 15:25                     ` Ondrej Kozina
  2016-07-14 17:35                     ` Mikulas Patocka
  1 sibling, 0 replies; 60+ messages in thread
From: Ondrej Kozina @ 2016-07-14 15:25 UTC (permalink / raw)
  To: Michal Hocko, Mikulas Patocka
  Cc: Jerome Marchand, Stanislav Kozina, linux-mm, linux-kernel, dm-devel

On 07/14/2016 04:59 PM, Michal Hocko wrote:
> On Thu 14-07-16 10:00:16, Mikulas Patocka wrote:
>>
>>
>> On Thu, 14 Jul 2016, Michal Hocko wrote:
>>
>>> On Wed 13-07-16 11:02:15, Mikulas Patocka wrote:
>>
>>>>> diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
>>>>> index 4f3cb3554944..0b806810efab 100644
>>>>> --- a/drivers/md/dm-crypt.c
>>>>> +++ b/drivers/md/dm-crypt.c
>>>>> @@ -1392,11 +1392,14 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
>>>>>  static void kcryptd_crypt(struct work_struct *work)
>>>>>  {
>>>>>  	struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
>>>>> +	unsigned int pflags = current->flags;
>>>>>
>>>>> +	current->flags |= PF_LESS_THROTTLE;
>>>>>  	if (bio_data_dir(io->base_bio) == READ)
>>>>>  		kcryptd_crypt_read_convert(io);
>>>>>  	else
>>>>>  		kcryptd_crypt_write_convert(io);
>>>>> +	tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
>>>>>  }
>>>>>
>>>>>  static void kcryptd_queue_crypt(struct dm_crypt_io *io)
>>>>
>>>> ^^^ That fixes just one specific case - but there may be other threads
>>>> doing mempool allocations in the device mapper subsystem - and you would
>>>> need to mark all of them.
>>>
>>> Now that I am thinking about it some more. Are there any mempool users
>>> which would actually want to be throttled? I would expect mempool users
>>> are necessary to push IO through and throttle them sounds like a bad
>>> decision in the first place but there might be other mempool users which
>>> could cause issues. Anyway how about setting PF_LESS_THROTTLE
>>> unconditionally inside mempool_alloc? Something like the following:
>>>
>>> diff --git a/mm/mempool.c b/mm/mempool.c
>>> index 8f65464da5de..e21fb632983f 100644
>>> --- a/mm/mempool.c
>>> +++ b/mm/mempool.c
>>> @@ -310,7 +310,8 @@ EXPORT_SYMBOL(mempool_resize);
>>>   */
>>>  void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>>>  {
>>> -	void *element;
>>> +	unsigned int pflags = current->flags;
>>> +	void *element = NULL;
>>>  	unsigned long flags;
>>>  	wait_queue_t wait;
>>>  	gfp_t gfp_temp;
>>> @@ -327,6 +328,12 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>>>
>>>  	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
>>>
>>> +	/*
>>> +	 * Make sure that the allocation doesn't get throttled during the
>>> +	 * reclaim
>>> +	 */
>>> +	if (gfpflags_allow_blocking(gfp_mask))
>>> +		current->flags |= PF_LESS_THROTTLE;
>>>  repeat_alloc:
>>>  	if (likely(pool->curr_nr)) {
>>>  		/*
>>> @@ -339,7 +346,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>>>
>>>  	element = pool->alloc(gfp_temp, pool->pool_data);
>>>  	if (likely(element != NULL))
>>> -		return element;
>>> +		goto out;
>>>
>>>  	spin_lock_irqsave(&pool->lock, flags);
>>>  	if (likely(pool->curr_nr)) {
>>> @@ -352,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>>>  		 * for debugging.
>>>  		 */
>>>  		kmemleak_update_trace(element);
>>> -		return element;
>>> +		goto out;
>>>  	}
>>>
>>>  	/*
>>> @@ -369,7 +376,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>>>  	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
>>>  	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
>>>  		spin_unlock_irqrestore(&pool->lock, flags);
>>> -		return NULL;
>>> +		goto out;
>>>  	}
>>>
>>>  	/* Let's wait for someone else to return an element to @pool */
>>> @@ -386,6 +393,10 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>>>
>>>  	finish_wait(&pool->wait, &wait);
>>>  	goto repeat_alloc;
>>> +out:
>>> +	if (gfpflags_allow_blocking(gfp_mask))
>>> +		tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
>>> +	return element;
>>>  }
>>>  EXPORT_SYMBOL(mempool_alloc);
>>>
>>
>> But it needs other changes to honor the PF_LESS_THROTTLE flag:
>>
>> static int current_may_throttle(void)
>> {
>>         return !(current->flags & PF_LESS_THROTTLE) ||
>>                 current->backing_dev_info == NULL ||
>>                 bdi_write_congested(current->backing_dev_info);
>> }
>> --- if you set PF_LESS_THROTTLE, current_may_throttle may still return
>> true if one of the other conditions is met.
>
> That is true but doesn't that mean that the device is congested and
> waiting a bit is the right thing to do?
>
>> shrink_zone_memcg calls throttle_vm_writeout without checking
>> PF_LESS_THROTTLE at all.
>
> Yes it doesn't call it because it relies on
> global_dirty_limits()->domain_dirty_limits() to DTRT. It will give the
> caller with PF_LESS_THROTTLE some boost wrt. all other writers.
>

Not sure it'll help but I had to apply following patch to your original 
one. Without it it didn't work.

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e248194..1616192 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1940,11 +1940,23 @@ bool wb_over_bg_thresh(struct bdi_writeback *wb)
         return false;
  }

+static int current_may_throttle(void)
+{
+       if (current->flags & PF_LESS_THROTTLE)
+               return 0;
+
+       return  current->backing_dev_info == NULL ||
+               bdi_write_congested(current->backing_dev_info);
+}
+
  void throttle_vm_writeout(gfp_t gfp_mask)
  {
         unsigned long background_thresh;
         unsigned long dirty_thresh;

+       if (!current_may_throttle())
+               return;
+
          for ( ; ; ) {
                 global_dirty_limits(&background_thresh, &dirty_thresh);
                 dirty_thresh = hard_dirty_limit(&global_wb_domain, 
dirty_thresh);

Regards Ondra

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-13 23:53                     ` David Rientjes
  2016-07-14 11:01                       ` Tetsuo Handa
  2016-07-14 12:27                       ` Mikulas Patocka
@ 2016-07-14 15:29                       ` Michal Hocko
  2016-07-14 20:38                         ` David Rientjes
  2 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-07-14 15:29 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mikulas Patocka, Tetsuo Handa, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel

On Wed 13-07-16 16:53:28, David Rientjes wrote:
> On Wed, 13 Jul 2016, Mikulas Patocka wrote:
> 
> > What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 
> > tries to fix?
> > 
> 
> It prevents the whole system from livelocking due to an oom killed process 
> stalling forever waiting for mempool_alloc() to return.  No other threads 
> may be oom killed while waiting for it to exit.

But it is true that the patch has unintended side effect for any mempool
allocation from the reclaim path (aka PF_MEMALLOC context). So do you
think we should rework your additional patch to be explicit about
TIF_MEMDIE? Something like the following (not even compile tested for
illustration). Tetsuo has properly pointed out that this doesn't work
for multithreaded processes reliable but put that aside for now as that
needs a fix on a different layer. I believe we can fix that quite
easily after recent/planned changes.
---
diff --git a/mm/mempool.c b/mm/mempool.c
index 8f65464da5de..ea26d75c8adf 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -322,20 +322,20 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
 	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
 
+	gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
 	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
 	gfp_mask |= __GFP_NOWARN;	/* failures are OK */
 
 	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
 
 repeat_alloc:
-	if (likely(pool->curr_nr)) {
-		/*
-		 * Don't allocate from emergency reserves if there are
-		 * elements available.  This check is racy, but it will
-		 * be rechecked each loop.
-		 */
-		gfp_temp |= __GFP_NOMEMALLOC;
-	}
+	/*
+	 * Make sure that the OOM victim will get access to memory reserves
+	 * properly if there are no objects in the pool to prevent from
+	 * livelocks.
+	 */
+	if (!likely(pool->curr_nr) && test_thread_flag(TIF_MEMDIE))
+		gfp_temp &= ~__GFP_NOMEMALLOC;
 
 	element = pool->alloc(gfp_temp, pool->pool_data);
 	if (likely(element != NULL))
@@ -359,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
 	 * alloc failed with that and @pool was empty, retry immediately.
 	 */
-	if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) {
+	if ((gfp_temp & __GFP_DIRECT_RECLAIM) != (gfp_mask & __GFP_DIRECT_RECLAIM)) {
 		spin_unlock_irqrestore(&pool->lock, flags);
 		gfp_temp = gfp_mask;
 		goto repeat_alloc;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 14:08                 ` Ondrej Kozina
@ 2016-07-14 15:31                   ` Michal Hocko
  2016-07-14 17:07                     ` Ondrej Kozina
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-07-14 15:31 UTC (permalink / raw)
  To: Ondrej Kozina
  Cc: Mikulas Patocka, Jerome Marchand, Stanislav Kozina, linux-mm,
	linux-kernel, dm-devel

On Thu 14-07-16 16:08:28, Ondrej Kozina wrote:
[...]
> As Mikulas pointed out, this doesn't work. The system froze as well with the
> patch above. Will try to tweak the patch with Mikulas's suggestion...

Thank you for testing! Do you happen to have traces of the frozen
processes? Does the flusher still gets throttled because the bias it
gets is not sufficient. Or does it get throttled at a different place?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 15:31                   ` Michal Hocko
@ 2016-07-14 17:07                     ` Ondrej Kozina
  2016-07-14 17:36                       ` Michal Hocko
  2016-07-15 11:42                       ` Tetsuo Handa
  0 siblings, 2 replies; 60+ messages in thread
From: Ondrej Kozina @ 2016-07-14 17:07 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mikulas Patocka, Jerome Marchand, Stanislav Kozina, linux-mm,
	linux-kernel, dm-devel

On 07/14/2016 05:31 PM, Michal Hocko wrote:
> On Thu 14-07-16 16:08:28, Ondrej Kozina wrote:
> [...]
>> As Mikulas pointed out, this doesn't work. The system froze as well with the
>> patch above. Will try to tweak the patch with Mikulas's suggestion...
>
> Thank you for testing! Do you happen to have traces of the frozen
> processes? Does the flusher still gets throttled because the bias it
> gets is not sufficient. Or does it get throttled at a different place?
>

Sure. Here it is (including sysrq+t and sysrq+w output): 
https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/4.7.0-rc7+/1/4.7.0-rc7+.log

In a directory with the log there's also a patch the kernel was compiled 
with.

Ondra

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 14:59                   ` Michal Hocko
  2016-07-14 15:25                     ` Ondrej Kozina
@ 2016-07-14 17:35                     ` Mikulas Patocka
  2016-07-15  8:35                       ` Michal Hocko
  1 sibling, 1 reply; 60+ messages in thread
From: Mikulas Patocka @ 2016-07-14 17:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ondrej Kozina, Jerome Marchand, Stanislav Kozina, linux-mm,
	linux-kernel, dm-devel



On Thu, 14 Jul 2016, Michal Hocko wrote:

> On Thu 14-07-16 10:00:16, Mikulas Patocka wrote:
> > 
> > 
> > On Thu, 14 Jul 2016, Michal Hocko wrote:
> > 
> > > On Wed 13-07-16 11:02:15, Mikulas Patocka wrote:
> > 
> > > > > diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
> > > > > index 4f3cb3554944..0b806810efab 100644
> > > > > --- a/drivers/md/dm-crypt.c
> > > > > +++ b/drivers/md/dm-crypt.c
> > > > > @@ -1392,11 +1392,14 @@ static void kcryptd_async_done(struct crypto_async_request *async_req,
> > > > >  static void kcryptd_crypt(struct work_struct *work)
> > > > >  {
> > > > >  	struct dm_crypt_io *io = container_of(work, struct dm_crypt_io, work);
> > > > > +	unsigned int pflags = current->flags;
> > > > >  
> > > > > +	current->flags |= PF_LESS_THROTTLE;
> > > > >  	if (bio_data_dir(io->base_bio) == READ)
> > > > >  		kcryptd_crypt_read_convert(io);
> > > > >  	else
> > > > >  		kcryptd_crypt_write_convert(io);
> > > > > +	tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
> > > > >  }
> > > > >  
> > > > >  static void kcryptd_queue_crypt(struct dm_crypt_io *io)
> > > > 
> > > > ^^^ That fixes just one specific case - but there may be other threads 
> > > > doing mempool allocations in the device mapper subsystem - and you would 
> > > > need to mark all of them.
> > > 
> > > Now that I am thinking about it some more. Are there any mempool users
> > > which would actually want to be throttled? I would expect mempool users
> > > are necessary to push IO through and throttle them sounds like a bad
> > > decision in the first place but there might be other mempool users which
> > > could cause issues. Anyway how about setting PF_LESS_THROTTLE
> > > unconditionally inside mempool_alloc? Something like the following:
> > >
> > > diff --git a/mm/mempool.c b/mm/mempool.c
> > > index 8f65464da5de..e21fb632983f 100644
> > > --- a/mm/mempool.c
> > > +++ b/mm/mempool.c
> > > @@ -310,7 +310,8 @@ EXPORT_SYMBOL(mempool_resize);
> > >   */
> > >  void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> > >  {
> > > -	void *element;
> > > +	unsigned int pflags = current->flags;
> > > +	void *element = NULL;
> > >  	unsigned long flags;
> > >  	wait_queue_t wait;
> > >  	gfp_t gfp_temp;
> > > @@ -327,6 +328,12 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> > >  
> > >  	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
> > >  
> > > +	/*
> > > +	 * Make sure that the allocation doesn't get throttled during the
> > > +	 * reclaim
> > > +	 */
> > > +	if (gfpflags_allow_blocking(gfp_mask))
> > > +		current->flags |= PF_LESS_THROTTLE;
> > >  repeat_alloc:
> > >  	if (likely(pool->curr_nr)) {
> > >  		/*
> > > @@ -339,7 +346,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> > >  
> > >  	element = pool->alloc(gfp_temp, pool->pool_data);
> > >  	if (likely(element != NULL))
> > > -		return element;
> > > +		goto out;
> > >  
> > >  	spin_lock_irqsave(&pool->lock, flags);
> > >  	if (likely(pool->curr_nr)) {
> > > @@ -352,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> > >  		 * for debugging.
> > >  		 */
> > >  		kmemleak_update_trace(element);
> > > -		return element;
> > > +		goto out;
> > >  	}
> > >  
> > >  	/*
> > > @@ -369,7 +376,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> > >  	/* We must not sleep if !__GFP_DIRECT_RECLAIM */
> > >  	if (!(gfp_mask & __GFP_DIRECT_RECLAIM)) {
> > >  		spin_unlock_irqrestore(&pool->lock, flags);
> > > -		return NULL;
> > > +		goto out;
> > >  	}
> > >  
> > >  	/* Let's wait for someone else to return an element to @pool */
> > > @@ -386,6 +393,10 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> > >  
> > >  	finish_wait(&pool->wait, &wait);
> > >  	goto repeat_alloc;
> > > +out:
> > > +	if (gfpflags_allow_blocking(gfp_mask))
> > > +		tsk_restore_flags(current, pflags, PF_LESS_THROTTLE);
> > > +	return element;
> > >  }
> > >  EXPORT_SYMBOL(mempool_alloc);
> > >  
> > 
> > But it needs other changes to honor the PF_LESS_THROTTLE flag:
> > 
> > static int current_may_throttle(void)
> > {
> >         return !(current->flags & PF_LESS_THROTTLE) ||
> >                 current->backing_dev_info == NULL ||
> >                 bdi_write_congested(current->backing_dev_info);
> > }
> > --- if you set PF_LESS_THROTTLE, current_may_throttle may still return 
> > true if one of the other conditions is met.
> 
> That is true but doesn't that mean that the device is congested and
> waiting a bit is the right thing to do?

You shouldn't really throttle mempool allocations at all. It's better to 
fail the allocation quickly and allocate from a mempool reserve than to 
wait 0.1 seconds in the reclaim path.

dm-crypt can do approximatelly 100MB/s. That means that it processes 25k 
swap pages per second. If you wait in mempool_alloc, the allocation would 
be satisfied in 0.00004s. If you wait in the allocator's throttle 
function, you waste 0.1s.


It is also questionable if those 0.1 second sleeps are reasonable at all. 
SSDs with 100k IOPS are common - they can drain the request queue in much 
less time than 0.1 second. I think those hardcoded 0.1 second sleeps 
should be replaced with sleeps until the device stops being congested.

Mikulas

> > shrink_zone_memcg calls throttle_vm_writeout without checking 
> > PF_LESS_THROTTLE at all.
> 
> Yes it doesn't call it because it relies on
> global_dirty_limits()->domain_dirty_limits() to DTRT. It will give the
> caller with PF_LESS_THROTTLE some boost wrt. all other writers.
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 17:07                     ` Ondrej Kozina
@ 2016-07-14 17:36                       ` Michal Hocko
  2016-07-14 17:39                         ` Michal Hocko
  2016-07-15 11:42                       ` Tetsuo Handa
  1 sibling, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-07-14 17:36 UTC (permalink / raw)
  To: Ondrej Kozina
  Cc: Mikulas Patocka, Jerome Marchand, Stanislav Kozina, linux-mm,
	linux-kernel, dm-devel

On Thu 14-07-16 19:07:52, Ondrej Kozina wrote:
> On 07/14/2016 05:31 PM, Michal Hocko wrote:
> > On Thu 14-07-16 16:08:28, Ondrej Kozina wrote:
> > [...]
> > > As Mikulas pointed out, this doesn't work. The system froze as well with the
> > > patch above. Will try to tweak the patch with Mikulas's suggestion...
> > 
> > Thank you for testing! Do you happen to have traces of the frozen
> > processes? Does the flusher still gets throttled because the bias it
> > gets is not sufficient. Or does it get throttled at a different place?
> > 
> 
> Sure. Here it is (including sysrq+t and sysrq+w output): https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/4.7.0-rc7+/1/4.7.0-rc7+.log

Thanks a lot! This is helpful.
[  162.716376] active_anon:107874 inactive_anon:108176 isolated_anon:64
[  162.716376]  active_file:1086 inactive_file:1103 isolated_file:0
[  162.716376]  unevictable:0 dirty:0 writeback:69824 unstable:0
[  162.716376]  slab_reclaimable:3119 slab_unreclaimable:24124
[  162.716376]  mapped:2165 shmem:57 pagetables:1509 bounce:0
[  162.716376]  free:701 free_pcp:0 free_cma:0

No surprise that PF_LESS_THROTTLE didn't help. It gives some bias but
considering how many pages are under writeback it cannot possibly help
to prevent from sleeping in throttle_vm_writeout. I suppose adding
the following on top of the memalloc patch helps, right?
It is an alternative to what you were suggesting in other email but
it doesn't affect current_may_throttle paths which I would rather not
touch.
---
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7fbb2d008078..a37661f1a11b 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1971,6 +1971,9 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 
+	if (current->flags & PF_LESS_THROTTLE)
+		return;
+
         for ( ; ; ) {
 		global_dirty_limits(&background_thresh, &dirty_thresh);
 		dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh);
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 17:36                       ` Michal Hocko
@ 2016-07-14 17:39                         ` Michal Hocko
  0 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2016-07-14 17:39 UTC (permalink / raw)
  To: Ondrej Kozina
  Cc: Mikulas Patocka, Jerome Marchand, Stanislav Kozina, linux-mm,
	linux-kernel, dm-devel

On Thu 14-07-16 19:36:59, Michal Hocko wrote:
> On Thu 14-07-16 19:07:52, Ondrej Kozina wrote:
> > On 07/14/2016 05:31 PM, Michal Hocko wrote:
> > > On Thu 14-07-16 16:08:28, Ondrej Kozina wrote:
> > > [...]
> > > > As Mikulas pointed out, this doesn't work. The system froze as well with the
> > > > patch above. Will try to tweak the patch with Mikulas's suggestion...
> > > 
> > > Thank you for testing! Do you happen to have traces of the frozen
> > > processes? Does the flusher still gets throttled because the bias it
> > > gets is not sufficient. Or does it get throttled at a different place?
> > > 
> > 
> > Sure. Here it is (including sysrq+t and sysrq+w output): https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/4.7.0-rc7+/1/4.7.0-rc7+.log
> 
> Thanks a lot! This is helpful.
> [  162.716376] active_anon:107874 inactive_anon:108176 isolated_anon:64
> [  162.716376]  active_file:1086 inactive_file:1103 isolated_file:0
> [  162.716376]  unevictable:0 dirty:0 writeback:69824 unstable:0
> [  162.716376]  slab_reclaimable:3119 slab_unreclaimable:24124
> [  162.716376]  mapped:2165 shmem:57 pagetables:1509 bounce:0
> [  162.716376]  free:701 free_pcp:0 free_cma:0
> 
> No surprise that PF_LESS_THROTTLE didn't help. It gives some bias but
> considering how many pages are under writeback it cannot possibly help
> to prevent from sleeping in throttle_vm_writeout. I suppose adding
> the following on top of the memalloc patch helps, right?
> It is an alternative to what you were suggesting in other email but
> it doesn't affect current_may_throttle paths which I would rather not
> touch.

Just read the other email and your patch properly and this is basically
equivalent thing. I will think more about potential issues this might
cause and send a proper patch for wider review.

> ---
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 7fbb2d008078..a37661f1a11b 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1971,6 +1971,9 @@ void throttle_vm_writeout(gfp_t gfp_mask)
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  
> +	if (current->flags & PF_LESS_THROTTLE)
> +		return;
> +
>          for ( ; ; ) {
>  		global_dirty_limits(&background_thresh, &dirty_thresh);
>  		dirty_thresh = hard_dirty_limit(&global_wb_domain, dirty_thresh);
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 12:27                       ` Mikulas Patocka
@ 2016-07-14 20:22                         ` David Rientjes
  2016-07-15 11:21                           ` Mikulas Patocka
  0 siblings, 1 reply; 60+ messages in thread
From: David Rientjes @ 2016-07-14 20:22 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Michal Hocko, Tetsuo Handa, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel, dm-devel

On Thu, 14 Jul 2016, Mikulas Patocka wrote:

> > schedule
> > schedule_timeout
> > io_schedule_timeout
> > mempool_alloc
> > __split_and_process_bio
> > dm_request
> > generic_make_request
> > submit_bio
> > mpage_readpages
> > ext4_readpages
> > __do_page_cache_readahead
> > ra_submit
> > filemap_fault
> > handle_mm_fault
> > __do_page_fault
> > do_page_fault
> > page_fault
> 
> Device mapper should be able to proceed if there is no available memory. 
> If it doesn't proceed, there is a bug in it.
> 

The above stack trace has nothing to do with the device mapper with 
pre-f9054c70d28b behavior.  It simply is calling into mempool_alloc() and 
no elements are being returned to the mempool that allow it to return.

Recall that in the above situation, the whole system is oom; nothing can 
allocate memory.  The oom killer has selected the above process to be oom 
killed, so all other processes on the system trying to allocate memory 
will stall in the page allocator waiting for this process to exit.

The natural response to this situation is to allow access to memory 
reserves, if possible, so that mempool_alloc() may return.  There is no 
guarantee that _anything_ can return memory to the mempool, especially in 
a system oom condition where nothing can make forward progress.  Insisting 
that should be guaranteed is not helpful.

> I'd like to ask - what device mapper targets did you use in this case? Are 
> there some other deadlocked processes? (show sysrq-t, sysrq-w when this 
> happened)
> 

Every process on the system is deadlocked because they cannot get memory 
through the page allocator until the above process exits.  That is how the 
oom killer works: select a process, kill it, give it access to memory 
reserves so it may exit and free its memory, and wait.

> Did the machine lock up completely with that stacktrace, or was it just 
> slowed down?
> 

Hundreds of machines locked up and rebooted after a two hour watchdog 
timeout.

> > That's obviously not the case if we have hundreds of machines timing out 
> > after two hours waiting for that fault to succeed.  The mempool interface 
> > cannot require that users return elements to the pool synchronous with all 
> > allocators so that we can happily loop forever, the only requirement on 
> 
> Mempool users must return objects to the mempool.
> 

Not possible when the system is livelocked.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 11:01                       ` Tetsuo Handa
  2016-07-14 12:29                         ` Mikulas Patocka
@ 2016-07-14 20:26                         ` David Rientjes
  2016-07-14 21:40                           ` Tetsuo Handa
  2016-07-15 11:25                           ` Mikulas Patocka
  1 sibling, 2 replies; 60+ messages in thread
From: David Rientjes @ 2016-07-14 20:26 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mpatocka, mhocko, okozina, jmarchan, skozina, linux-mm, linux-kernel

On Thu, 14 Jul 2016, Tetsuo Handa wrote:

> David Rientjes wrote:
> > On Wed, 13 Jul 2016, Mikulas Patocka wrote:
> > 
> > > What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 
> > > tries to fix?
> > > 
> > 
> > It prevents the whole system from livelocking due to an oom killed process 
> > stalling forever waiting for mempool_alloc() to return.  No other threads 
> > may be oom killed while waiting for it to exit.
> 
> Is that concern still valid? We have the OOM reaper for CONFIG_MMU=y case.
> 

Umm, show me an explicit guarantee where the oom reaper will free memory 
such that other threads may return memory to this process's mempool so it 
can make forward progress in mempool_alloc() without the need of utilizing 
memory reserves.  First, it might be helpful to show that the oom reaper 
is ever guaranteed to free any memory for a selected oom victim.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 15:29                       ` Michal Hocko
@ 2016-07-14 20:38                         ` David Rientjes
  2016-07-15  7:22                           ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: David Rientjes @ 2016-07-14 20:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mikulas Patocka, Tetsuo Handa, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel

On Thu, 14 Jul 2016, Michal Hocko wrote:

> > It prevents the whole system from livelocking due to an oom killed process 
> > stalling forever waiting for mempool_alloc() to return.  No other threads 
> > may be oom killed while waiting for it to exit.
> 
> But it is true that the patch has unintended side effect for any mempool
> allocation from the reclaim path (aka PF_MEMALLOC context).

If PF_MEMALLOC context is allocating too much memory reserves, then I'd 
argue that is a problem independent of using mempool_alloc() since 
mempool_alloc() can evolve directly into a call to the page allocator.  
How does such a process guarantee that it cannot deplete memory reserves 
with a simple call to the page allocator?  Since nothing in the page 
allocator is preventing complete depletion of reserves (it simply uses 
ALLOC_NO_WATERMARKS), the caller in a PF_MEMALLOC context must be 
responsible.

> So do you
> think we should rework your additional patch to be explicit about
> TIF_MEMDIE?

Not sure which additional patch you're referring to, the only patch that I 
proposed was commit f9054c70d28b which solved hundreds of machines from 
timing out.

> Something like the following (not even compile tested for
> illustration). Tetsuo has properly pointed out that this doesn't work
> for multithreaded processes reliable but put that aside for now as that
> needs a fix on a different layer. I believe we can fix that quite
> easily after recent/planned changes.
> ---
> diff --git a/mm/mempool.c b/mm/mempool.c
> index 8f65464da5de..ea26d75c8adf 100644
> --- a/mm/mempool.c
> +++ b/mm/mempool.c
> @@ -322,20 +322,20 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  
>  	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
>  
> +	gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
>  	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
>  	gfp_mask |= __GFP_NOWARN;	/* failures are OK */
>  
>  	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
>  
>  repeat_alloc:
> -	if (likely(pool->curr_nr)) {
> -		/*
> -		 * Don't allocate from emergency reserves if there are
> -		 * elements available.  This check is racy, but it will
> -		 * be rechecked each loop.
> -		 */
> -		gfp_temp |= __GFP_NOMEMALLOC;
> -	}
> +	/*
> +	 * Make sure that the OOM victim will get access to memory reserves
> +	 * properly if there are no objects in the pool to prevent from
> +	 * livelocks.
> +	 */
> +	if (!likely(pool->curr_nr) && test_thread_flag(TIF_MEMDIE))
> +		gfp_temp &= ~__GFP_NOMEMALLOC;
>  
>  	element = pool->alloc(gfp_temp, pool->pool_data);
>  	if (likely(element != NULL))
> @@ -359,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
>  	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
>  	 * alloc failed with that and @pool was empty, retry immediately.
>  	 */
> -	if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) {
> +	if ((gfp_temp & __GFP_DIRECT_RECLAIM) != (gfp_mask & __GFP_DIRECT_RECLAIM)) {
>  		spin_unlock_irqrestore(&pool->lock, flags);
>  		gfp_temp = gfp_mask;
>  		goto repeat_alloc;

This is bogus and quite obviously leads to oom livelock: if a process is 
holding a mutex and does mempool_alloc(), since __GFP_WAIT is allowed in 
process context for mempool allocation, it can stall here in an oom 
condition if there are no elements available on the mempool freelist.  If 
the oom victim contends the same mutex, the system livelocks and the same 
bug arises because the holder of the mutex loops forever.  This is the 
exact behavior that f9054c70d28b also fixes.

These aren't hypothetical situations, the patch fixed hundreds of machines 
from regularly timing out.  The fundamental reason is that mempool_alloc() 
must not loop forever in process context: that is needed when the 
allocator is either an oom victim itself or the oom victim is blocked by 
an allocator.  mempool_alloc() must guarantee forward progress in such a 
context.

The end result is that when in PF_MEMALLOC context, allocators must be 
responsible and not deplete all memory reserves.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 20:26                         ` David Rientjes
@ 2016-07-14 21:40                           ` Tetsuo Handa
  2016-07-14 22:04                             ` David Rientjes
  2016-07-15 11:25                           ` Mikulas Patocka
  1 sibling, 1 reply; 60+ messages in thread
From: Tetsuo Handa @ 2016-07-14 21:40 UTC (permalink / raw)
  To: rientjes
  Cc: mpatocka, mhocko, okozina, jmarchan, skozina, linux-mm, linux-kernel

David Rientjes wrote:
> On Thu, 14 Jul 2016, Tetsuo Handa wrote:
> 
> > David Rientjes wrote:
> > > On Wed, 13 Jul 2016, Mikulas Patocka wrote:
> > > 
> > > > What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 
> > > > tries to fix?
> > > > 
> > > 
> > > It prevents the whole system from livelocking due to an oom killed process 
> > > stalling forever waiting for mempool_alloc() to return.  No other threads 
> > > may be oom killed while waiting for it to exit.
> > 
> > Is that concern still valid? We have the OOM reaper for CONFIG_MMU=y case.
> > 
> 
> Umm, show me an explicit guarantee where the oom reaper will free memory 
> such that other threads may return memory to this process's mempool so it 
> can make forward progress in mempool_alloc() without the need of utilizing 
> memory reserves.  First, it might be helpful to show that the oom reaper 
> is ever guaranteed to free any memory for a selected oom victim.
> 

Whether the OOM reaper will free some memory no longer matters. Instead,
whether the OOM reaper will let the OOM killer select next OOM victim matters.

Are you aware that the OOM reaper will let the OOM killer select next OOM
victim (currently by clearing TIF_MEMDIE)? Clearing TIF_MEMDIE in 4.6 occurred
only when OOM reaping succeeded. But we are going to change the OOM reaper
always clear TIF_MEMDIE in 4.8 (or presumably change the OOM killer not to
depend on TIF_MEMDIE) so that the OOM reaper guarantees that the OOM killer
always selects next OOM victim.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 21:40                           ` Tetsuo Handa
@ 2016-07-14 22:04                             ` David Rientjes
  0 siblings, 0 replies; 60+ messages in thread
From: David Rientjes @ 2016-07-14 22:04 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: mpatocka, mhocko, okozina, jmarchan, skozina, linux-mm, linux-kernel

On Fri, 15 Jul 2016, Tetsuo Handa wrote:

> Whether the OOM reaper will free some memory no longer matters. Instead,
> whether the OOM reaper will let the OOM killer select next OOM victim matters.
> 
> Are you aware that the OOM reaper will let the OOM killer select next OOM
> victim (currently by clearing TIF_MEMDIE)? Clearing TIF_MEMDIE in 4.6 occurred
> only when OOM reaping succeeded. But we are going to change the OOM reaper
> always clear TIF_MEMDIE in 4.8 (or presumably change the OOM killer not to
> depend on TIF_MEMDIE) so that the OOM reaper guarantees that the OOM killer
> always selects next OOM victim.
> 

That's cute, I'll have to look into those patches.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 20:38                         ` David Rientjes
@ 2016-07-15  7:22                           ` Michal Hocko
  2016-07-15  8:23                             ` Michal Hocko
                                               ` (2 more replies)
  0 siblings, 3 replies; 60+ messages in thread
From: Michal Hocko @ 2016-07-15  7:22 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mikulas Patocka, Tetsuo Handa, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel

On Thu 14-07-16 13:38:42, David Rientjes wrote:
> On Thu, 14 Jul 2016, Michal Hocko wrote:
> 
> > > It prevents the whole system from livelocking due to an oom killed process 
> > > stalling forever waiting for mempool_alloc() to return.  No other threads 
> > > may be oom killed while waiting for it to exit.
> > 
> > But it is true that the patch has unintended side effect for any mempool
> > allocation from the reclaim path (aka PF_MEMALLOC context).
> 
> If PF_MEMALLOC context is allocating too much memory reserves, then I'd 
> argue that is a problem independent of using mempool_alloc() since 
> mempool_alloc() can evolve directly into a call to the page allocator.  
> How does such a process guarantee that it cannot deplete memory reserves 
> with a simple call to the page allocator?  Since nothing in the page 
> allocator is preventing complete depletion of reserves (it simply uses 
> ALLOC_NO_WATERMARKS), the caller in a PF_MEMALLOC context must be 
> responsible.

Well, the reclaim throttles the allocation request if there are too many
pages under writeback and that should slow down the allocation rate and
give the writeback some time to complete. But yes you are right there is
nothing to prevent from memory depletion and it is really hard to come
up with something with no fail semantic.

Or do you have an idea how to throttle withou knowing how much memory
will be actually consumed on the writeout path?

> > So do you
> > think we should rework your additional patch to be explicit about
> > TIF_MEMDIE?
> 
> Not sure which additional patch you're referring to, the only patch that I 
> proposed was commit f9054c70d28b which solved hundreds of machines from 
> timing out.

I would like separate TIF_MEMDIE as an access to memory reserves from
oom selection selection semantic. And let me repeat your proposed patch
has a undesirable side effects so we should think about a way to deal
with those cases. It might work for your setups but it shouldn't break
others at the same time. OOM situation is quite unlikely compared to
simple memory depletion by writing to a swap...
 
> > Something like the following (not even compile tested for
> > illustration). Tetsuo has properly pointed out that this doesn't work
> > for multithreaded processes reliable but put that aside for now as that
> > needs a fix on a different layer. I believe we can fix that quite
> > easily after recent/planned changes.
> > ---
> > diff --git a/mm/mempool.c b/mm/mempool.c
> > index 8f65464da5de..ea26d75c8adf 100644
> > --- a/mm/mempool.c
> > +++ b/mm/mempool.c
> > @@ -322,20 +322,20 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> >  
> >  	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> >  
> > +	gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
> >  	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
> >  	gfp_mask |= __GFP_NOWARN;	/* failures are OK */
> >  
> >  	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
> >  
> >  repeat_alloc:
> > -	if (likely(pool->curr_nr)) {
> > -		/*
> > -		 * Don't allocate from emergency reserves if there are
> > -		 * elements available.  This check is racy, but it will
> > -		 * be rechecked each loop.
> > -		 */
> > -		gfp_temp |= __GFP_NOMEMALLOC;
> > -	}
> > +	/*
> > +	 * Make sure that the OOM victim will get access to memory reserves
> > +	 * properly if there are no objects in the pool to prevent from
> > +	 * livelocks.
> > +	 */
> > +	if (!likely(pool->curr_nr) && test_thread_flag(TIF_MEMDIE))
> > +		gfp_temp &= ~__GFP_NOMEMALLOC;
> >  
> >  	element = pool->alloc(gfp_temp, pool->pool_data);
> >  	if (likely(element != NULL))
> > @@ -359,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> >  	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
> >  	 * alloc failed with that and @pool was empty, retry immediately.
> >  	 */
> > -	if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) {
> > +	if ((gfp_temp & __GFP_DIRECT_RECLAIM) != (gfp_mask & __GFP_DIRECT_RECLAIM)) {
> >  		spin_unlock_irqrestore(&pool->lock, flags);
> >  		gfp_temp = gfp_mask;
> >  		goto repeat_alloc;
> 
> This is bogus and quite obviously leads to oom livelock: if a process is 
> holding a mutex and does mempool_alloc(), since __GFP_WAIT is allowed in 
> process context for mempool allocation, it can stall here in an oom 
> condition if there are no elements available on the mempool freelist.  If 
> the oom victim contends the same mutex, the system livelocks and the same 
> bug arises because the holder of the mutex loops forever.  This is the 
> exact behavior that f9054c70d28b also fixes.

Just to make sure I understand properly:
Task A				Task B			Task C
current->flags = PF_MEMALLOC
mutex_lock(&foo)		mutex_lock(&foo)	out_of_memory
mempool_alloc()						  select_bad__process = Task B
  alloc_pages(__GFP_NOMEMALLOC)


That would be really unfortunate but it doesn't really differ much from
other oom deadlocks when the victim is stuck behind an allocating task.
This is a generic problem and our answer for that is the oom reaper
which will tear down the address space of the victim asynchronously.
Sure there is no guarantee it will free enough to get us unstuck because
we are freeing only private unlocked memory but we rather fallback to
another oom victim if the situation prevails even after the unmapping
pass. So we shouldn't be stuck for ever.

That being said should we rely for the mempool allocations the same as
any other oom deadlock due to locks?

> These aren't hypothetical situations, the patch fixed hundreds of machines 
> from regularly timing out.  The fundamental reason is that mempool_alloc() 
> must not loop forever in process context: that is needed when the 
> allocator is either an oom victim itself or the oom victim is blocked by 
> an allocator.  mempool_alloc() must guarantee forward progress in such a 
> context.
> 
> The end result is that when in PF_MEMALLOC context, allocators must be 
> responsible and not deplete all memory reserves.

How do you propose to guarantee that? You might have really complex IO
setup and mempools have been the answer for guaranteeing forward progress
for ages.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-15  7:22                           ` Michal Hocko
@ 2016-07-15  8:23                             ` Michal Hocko
  2016-07-15 12:00                             ` Mikulas Patocka
  2016-07-15 21:47                             ` David Rientjes
  2 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2016-07-15  8:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mikulas Patocka, Tetsuo Handa, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel

Let me paste the patch with the full changelog and the explanation so
that we can reason about it more easily. If I am making some false
assumptions then please point them out.
--- 
>From ed46e3f7f5a6e896331eeadc9d09e2796acb3d01 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 14 Jul 2016 19:31:54 +0200
Subject: [PATCH] mempool: do not consume memory reserves from the reclaim path

There has been a report about OOM killer invoked when swapping out to
a dm-crypt device. The primary reason seems to be that the swapout
out IO managed to completely deplete memory reserves. Mikulas was
able to bisect and explained the issue by pointing to f9054c70d28b
("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements").

The reason is that the swapout path is not throttled properly because
the md layer needs to allocate from the generic_make_request path
which means it allocates from the PF_MEMALLOC context. dm layer uses
mempool_alloc in order to guarantee a forward progress which used to
inhibit access to memory reserves when using page allocator. This has
changed by f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if
there are free elements") which has dropped the __GFP_NOMEMALLOC
protection when the memory pool is depleted.

If we are running out of memory and the only way forward to free memory
is to perform swapout we just keep consuming memory reserves rather than
throttling the mempool allocations and allowing the pending IO to
complete up to a moment when the memory is depleted completely and there
is no way forward but invoking the OOM killer. This is less than
optimal.

The original intention of f9054c70d28b was to help with the OOM
situations where the oom victim depends on mempool allocation to make a
forward progress. We can handle that case in a different way, though. We
can check whether the current task has access to memory reserves ad an
OOM victim (TIF_MEMDIE) and drop __GFP_NOMEMALLOC protection if the pool
is empty.

David Rientjes was objecting that such an approach wouldn't help if the
oom victim was blocked on a lock held by process doing mempool_alloc. This
is very similar to other oom deadlock situations and we have oom_reaper
to deal with them so it is reasonable to rely on the same mechanism
rather inventing a different one which has negative side effects.

Fixes: f9054c70d28b ("mm, mempool: only set __GFP_NOMEMALLOC if there are free elements")
Bisected-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/mempool.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/mempool.c b/mm/mempool.c
index 8f65464da5de..ea26d75c8adf 100644
--- a/mm/mempool.c
+++ b/mm/mempool.c
@@ -322,20 +322,20 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 
 	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
 
+	gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
 	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
 	gfp_mask |= __GFP_NOWARN;	/* failures are OK */
 
 	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
 
 repeat_alloc:
-	if (likely(pool->curr_nr)) {
-		/*
-		 * Don't allocate from emergency reserves if there are
-		 * elements available.  This check is racy, but it will
-		 * be rechecked each loop.
-		 */
-		gfp_temp |= __GFP_NOMEMALLOC;
-	}
+	/*
+	 * Make sure that the OOM victim will get access to memory reserves
+	 * properly if there are no objects in the pool to prevent from
+	 * livelocks.
+	 */
+	if (!likely(pool->curr_nr) && test_thread_flag(TIF_MEMDIE))
+		gfp_temp &= ~__GFP_NOMEMALLOC;
 
 	element = pool->alloc(gfp_temp, pool->pool_data);
 	if (likely(element != NULL))
@@ -359,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
 	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
 	 * alloc failed with that and @pool was empty, retry immediately.
 	 */
-	if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) {
+	if ((gfp_temp & __GFP_DIRECT_RECLAIM) != (gfp_mask & __GFP_DIRECT_RECLAIM)) {
 		spin_unlock_irqrestore(&pool->lock, flags);
 		gfp_temp = gfp_mask;
 		goto repeat_alloc;
-- 
2.8.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 17:35                     ` Mikulas Patocka
@ 2016-07-15  8:35                       ` Michal Hocko
  2016-07-15 12:11                         ` Mikulas Patocka
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-07-15  8:35 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Ondrej Kozina, Jerome Marchand, Stanislav Kozina, linux-mm,
	linux-kernel, dm-devel

On Thu 14-07-16 13:35:35, Mikulas Patocka wrote:
> On Thu, 14 Jul 2016, Michal Hocko wrote:
> > On Thu 14-07-16 10:00:16, Mikulas Patocka wrote:
> > > But it needs other changes to honor the PF_LESS_THROTTLE flag:
> > > 
> > > static int current_may_throttle(void)
> > > {
> > >         return !(current->flags & PF_LESS_THROTTLE) ||
> > >                 current->backing_dev_info == NULL ||
> > >                 bdi_write_congested(current->backing_dev_info);
> > > }
> > > --- if you set PF_LESS_THROTTLE, current_may_throttle may still return 
> > > true if one of the other conditions is met.
> > 
> > That is true but doesn't that mean that the device is congested and
> > waiting a bit is the right thing to do?
> 
> You shouldn't really throttle mempool allocations at all. It's better to 
> fail the allocation quickly and allocate from a mempool reserve than to 
> wait 0.1 seconds in the reclaim path.

Well, but we do that already, no? The first allocation request is NOWAIT
and then we try to consume an object from the pool. We are re-adding
__GFP_DIRECT_RECLAIM in case both fail. The point of throttling is to
prevent from scanning through LRUs too quickly while we know that the
bdi is congested.

> dm-crypt can do approximatelly 100MB/s. That means that it processes 25k 
> swap pages per second. If you wait in mempool_alloc, the allocation would 
> be satisfied in 0.00004s. If you wait in the allocator's throttle 
> function, you waste 0.1s.
> 
> 
> It is also questionable if those 0.1 second sleeps are reasonable at all. 
> SSDs with 100k IOPS are common - they can drain the request queue in much 
> less time than 0.1 second. I think those hardcoded 0.1 second sleeps 
> should be replaced with sleeps until the device stops being congested.

Well if we do not do throttle_vm_writeout then the only remaining
writeout throttling for PF_LESS_THROTTLE is wait_iff_congested for
the direct reclaim and that should wake up if the device stops being
congested AFAIU.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 20:22                         ` David Rientjes
@ 2016-07-15 11:21                           ` Mikulas Patocka
  2016-07-15 21:25                             ` David Rientjes
  2016-07-18 15:14                             ` Johannes Weiner
  0 siblings, 2 replies; 60+ messages in thread
From: Mikulas Patocka @ 2016-07-15 11:21 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Tetsuo Handa, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel, dm-devel



On Thu, 14 Jul 2016, David Rientjes wrote:

> There is no guarantee that _anything_ can return memory to the mempool,

You misunderstand mempools if you make such claims.

There is in fact guarantee that objects will be returned to mempool. In 
the past I reviewed device mapper thoroughly to make sure that it can make 
forward progress even if there is no available memory.

I don't know what should I tell you if you keep on repeating the same 
false claim over and over again. Should I explain mempool oprerations to 
you in detail? Or will you find it on your own?

Mikulas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 20:26                         ` David Rientjes
  2016-07-14 21:40                           ` Tetsuo Handa
@ 2016-07-15 11:25                           ` Mikulas Patocka
  2016-07-15 21:21                             ` David Rientjes
  1 sibling, 1 reply; 60+ messages in thread
From: Mikulas Patocka @ 2016-07-15 11:25 UTC (permalink / raw)
  To: David Rientjes
  Cc: Tetsuo Handa, mhocko, okozina, jmarchan, skozina, linux-mm, linux-kernel



On Thu, 14 Jul 2016, David Rientjes wrote:

> On Thu, 14 Jul 2016, Tetsuo Handa wrote:
> 
> > David Rientjes wrote:
> > > On Wed, 13 Jul 2016, Mikulas Patocka wrote:
> > > 
> > > > What are the real problems that f9054c70d28bc214b2857cf8db8269f4f45a5e23 
> > > > tries to fix?
> > > > 
> > > 
> > > It prevents the whole system from livelocking due to an oom killed process 
> > > stalling forever waiting for mempool_alloc() to return.  No other threads 
> > > may be oom killed while waiting for it to exit.
> > 
> > Is that concern still valid? We have the OOM reaper for CONFIG_MMU=y case.
> > 
> 
> Umm, show me an explicit guarantee where the oom reaper will free memory 
> such that other threads may return memory to this process's mempool so it 
> can make forward progress in mempool_alloc() without the need of utilizing 
> memory reserves.  First, it might be helpful to show that the oom reaper 
> is ever guaranteed to free any memory for a selected oom victim.

The function mempool_alloc sleeps with "io_schedule_timeout(5*HZ);"

So, if the oom reaper frees some memory into the page allocator, the 
process that is stuck in mempoo_alloc will sleep for up to 5 seconds, then 
it will retry the allocation with "element = pool->alloc(gfp_temp, 
pool->pool_data)" (that will allocate from the page allocator) and succed.

Mikulas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-14 17:07                     ` Ondrej Kozina
  2016-07-14 17:36                       ` Michal Hocko
@ 2016-07-15 11:42                       ` Tetsuo Handa
  1 sibling, 0 replies; 60+ messages in thread
From: Tetsuo Handa @ 2016-07-15 11:42 UTC (permalink / raw)
  To: Ondrej Kozina, Michal Hocko
  Cc: Mikulas Patocka, Jerome Marchand, Stanislav Kozina, linux-mm,
	linux-kernel, dm-devel

On 2016/07/15 2:07, Ondrej Kozina wrote:
> On 07/14/2016 05:31 PM, Michal Hocko wrote:
>> On Thu 14-07-16 16:08:28, Ondrej Kozina wrote:
>> [...]
>>> As Mikulas pointed out, this doesn't work. The system froze as well with the
>>> patch above. Will try to tweak the patch with Mikulas's suggestion...
>>
>> Thank you for testing! Do you happen to have traces of the frozen
>> processes? Does the flusher still gets throttled because the bias it
>> gets is not sufficient. Or does it get throttled at a different place?
>>
> 
> Sure. Here it is (including sysrq+t and sysrq+w output): https://okozina.fedorapeople.org/bugs/swap_on_dmcrypt/4.7.0-rc7+/1/4.7.0-rc7+.log
> 

Oh, this resembles another dm-crypt lockup problem reported last month.
( http://lkml.kernel.org/r/20160616212641.GA3308@sig21.net )

In Johannes's case, there are so many pending kcryptd_crypt work requests and
mempool_alloc() is waiting at throttle_vm_writeout() or shrink_inactive_list().

[ 2378.279029] kswapd0         D ffff88003744f538     0   766      2 0x00000000
[ 2378.286167]  ffff88003744f538 00ff88011b5ccd80 ffff88011b5d62d8 ffff88011ae58000
[ 2378.293628]  ffff880037450000 ffff880037450000 00000001000984f2 ffff88003744f570
[ 2378.301168]  ffff88011b5ccd80 ffff880037450000 ffff88003744f550 ffffffff81845cec
[ 2378.308674] Call Trace:
[ 2378.311154]  [<ffffffff81845cec>] schedule+0x8b/0xa3
[ 2378.316153]  [<ffffffff81849b5b>] schedule_timeout+0x20b/0x285
[ 2378.322028]  [<ffffffff810e6da6>] ? init_timer_key+0x112/0x112
[ 2378.327931]  [<ffffffff81845070>] io_schedule_timeout+0xa0/0x102
[ 2378.333960]  [<ffffffff81845070>] ? io_schedule_timeout+0xa0/0x102
[ 2378.340166]  [<ffffffff81162c2b>] mempool_alloc+0x123/0x154
[ 2378.345781]  [<ffffffff810bdd00>] ? wait_woken+0x72/0x72
[ 2378.351148]  [<ffffffff8133fdc1>] bio_alloc_bioset+0xe8/0x1d7
[ 2378.356910]  [<ffffffff816342ea>] alloc_tio+0x2d/0x47
[ 2378.361996]  [<ffffffff8163587e>] __split_and_process_bio+0x310/0x3a3
[ 2378.368470]  [<ffffffff81635e15>] dm_make_request+0xb5/0xe2
[ 2378.374078]  [<ffffffff81347ae7>] generic_make_request+0xcc/0x180
[ 2378.380206]  [<ffffffff81347c98>] submit_bio+0xfd/0x145
[ 2378.385482]  [<ffffffff81198948>] __swap_writepage+0x202/0x225
[ 2378.391349]  [<ffffffff810a5eeb>] ? preempt_count_sub+0xf0/0x100
[ 2378.397398]  [<ffffffff8184a5f7>] ? _raw_spin_unlock+0x31/0x44
[ 2378.403273]  [<ffffffff8119a903>] ? page_swapcount+0x45/0x4c
[ 2378.408984]  [<ffffffff811989a5>] swap_writepage+0x3a/0x3e
[ 2378.414530]  [<ffffffff811727ef>] pageout.isra.16+0x160/0x2a7
[ 2378.420320]  [<ffffffff81173a8f>] shrink_page_list+0x5a0/0x8c4
[ 2378.426197]  [<ffffffff81174489>] shrink_inactive_list+0x29e/0x4a1
[ 2378.432434]  [<ffffffff81174e8b>] shrink_zone_memcg+0x4c1/0x661
[ 2378.438406]  [<ffffffff81175107>] shrink_zone+0xdc/0x1e5
[ 2378.443742]  [<ffffffff81175107>] ? shrink_zone+0xdc/0x1e5
[ 2378.449238]  [<ffffffff8117628f>] kswapd+0x6df/0x814
[ 2378.454222]  [<ffffffff81175bb0>] ? mem_cgroup_shrink_node_zone+0x209/0x209
[ 2378.461196]  [<ffffffff8109f208>] kthread+0xff/0x107
[ 2378.466182]  [<ffffffff8184b1f2>] ret_from_fork+0x22/0x50
[ 2378.471631]  [<ffffffff8109f109>] ? kthread_create_on_node+0x1ea/0x1ea

[ 2378.769494] kworker/u8:4    D ffff8800c5dc3508     0  1592      2 0x00000000
[ 2378.776582] Workqueue: kcryptd kcryptd_crypt
[ 2378.780887]  ffff8800c5dc3508 00ff88011b7ccd80 ffff88011b7d62d8 ffff88011ae5a900
[ 2378.788399]  ffff88011a605200 ffff8800c5dc4000 00000001000983f7 ffff8800c5dc3540
[ 2378.795930]  ffff88011b7ccd80 0000000000000000 ffff8800c5dc3520 ffffffff81845cec
[ 2378.803408] Call Trace:
[ 2378.805879]  [<ffffffff81845cec>] schedule+0x8b/0xa3
[ 2378.810908]  [<ffffffff81849b5b>] schedule_timeout+0x20b/0x285
[ 2378.816783]  [<ffffffff810e6da6>] ? init_timer_key+0x112/0x112
[ 2378.822677]  [<ffffffff81845070>] io_schedule_timeout+0xa0/0x102
[ 2378.828716]  [<ffffffff81845070>] ? io_schedule_timeout+0xa0/0x102
[ 2378.834956]  [<ffffffff8117d5c0>] congestion_wait+0x84/0x160
[ 2378.840658]  [<ffffffff810bdd00>] ? wait_woken+0x72/0x72
[ 2378.845997]  [<ffffffff8116c32f>] throttle_vm_writeout+0x88/0xab
[ 2378.852036]  [<ffffffff81174fff>] shrink_zone_memcg+0x635/0x661
[ 2378.857982]  [<ffffffff81175107>] shrink_zone+0xdc/0x1e5
[ 2378.863309]  [<ffffffff81175107>] ? shrink_zone+0xdc/0x1e5
[ 2378.868832]  [<ffffffff811753b5>] do_try_to_free_pages+0x1a5/0x2c3
[ 2378.875028]  [<ffffffff811755f6>] try_to_free_pages+0x123/0x21f
[ 2378.880972]  [<ffffffff81168216>] __alloc_pages_nodemask+0x4c9/0x978
[ 2378.887385]  [<ffffffff8138027a>] ? debug_smp_processor_id+0x17/0x19
[ 2378.893782]  [<ffffffff8119fb2a>] new_slab+0xbc/0x3bb
[ 2378.898868]  [<ffffffff811a1acd>] ___slab_alloc.constprop.22+0x2fb/0x37b
[ 2378.905634]  [<ffffffff81162a88>] ? mempool_alloc_slab+0x15/0x17
[ 2378.911659]  [<ffffffff8101f5ba>] ? sched_clock+0x9/0xd
[ 2378.916909]  [<ffffffff810ae420>] ? local_clock+0x20/0x22
[ 2378.922325]  [<ffffffff810c6438>] ? __lock_acquire.isra.16+0x55e/0xb4c
[ 2378.928877]  [<ffffffff8101f5ba>] ? sched_clock+0x9/0xd
[ 2378.934138]  [<ffffffff810ae420>] ? local_clock+0x20/0x22
[ 2378.939555]  [<ffffffff810c6438>] ? __lock_acquire.isra.16+0x55e/0xb4c
[ 2378.946125]  [<ffffffff811a1ba4>] __slab_alloc.isra.17.constprop.21+0x57/0x8b
[ 2378.953289]  [<ffffffff811a1ba4>] ? __slab_alloc.isra.17.constprop.21+0x57/0x8b
[ 2378.960630]  [<ffffffff81162a88>] ? mempool_alloc_slab+0x15/0x17
[ 2378.966706]  [<ffffffff811a1c78>] kmem_cache_alloc+0xa0/0x1d6
[ 2378.972503]  [<ffffffff81162a88>] ? mempool_alloc_slab+0x15/0x17
[ 2378.978567]  [<ffffffff81162a88>] mempool_alloc_slab+0x15/0x17
[ 2378.984426]  [<ffffffff81162b7a>] mempool_alloc+0x72/0x154
[ 2378.989930]  [<ffffffff810c4b45>] ? lockdep_init_map+0xc9/0x5a3
[ 2378.995866]  [<ffffffff810ae420>] ? local_clock+0x20/0x22
[ 2379.001300]  [<ffffffff8133fdc1>] bio_alloc_bioset+0xe8/0x1d7
[ 2379.007107]  [<ffffffff81643127>] kcryptd_crypt+0x1ab/0x325
[ 2379.012704]  [<ffffffff810998fd>] ? process_one_work+0x1ad/0x4e2
[ 2379.018753]  [<ffffffff810999d3>] process_one_work+0x283/0x4e2
[ 2379.024629]  [<ffffffff810c502d>] ? put_lock_stats.isra.9+0xe/0x20
[ 2379.030851]  [<ffffffff8109a860>] worker_thread+0x285/0x370
[ 2379.036423]  [<ffffffff8109a5db>] ? rescuer_thread+0x2d1/0x2d1
[ 2379.042309]  [<ffffffff8109f208>] kthread+0xff/0x107
[ 2379.047310]  [<ffffffff8184b1f2>] ret_from_fork+0x22/0x50
[ 2379.052726]  [<ffffffff8109f109>] ? kthread_create_on_node+0x1ea/0x1ea

[ 2379.059328] kworker/u8:6    D ffff8800c5ec3508     0  1594      2 0x00000000
[ 2379.066468] Workqueue: kcryptd kcryptd_crypt
[ 2379.070808]  ffff8800c5ec3508 00ff88011b7ccd80 ffff88011b7d62d8 ffff88011ae5a900
[ 2379.078296]  ffff88003749a900 ffff8800c5ec4000 0000000100098467 ffff8800c5ec3540
[ 2379.085836]  ffff88011b7ccd80 0000000000000000 ffff8800c5ec3520 ffffffff81845cec
[ 2379.093315] Call Trace:
[ 2379.095776]  [<ffffffff81845cec>] schedule+0x8b/0xa3
[ 2379.100785]  [<ffffffff81849b5b>] schedule_timeout+0x20b/0x285
[ 2379.106627]  [<ffffffff810e6da6>] ? init_timer_key+0x112/0x112
[ 2379.112494]  [<ffffffff81845070>] io_schedule_timeout+0xa0/0x102
[ 2379.118524]  [<ffffffff81845070>] ? io_schedule_timeout+0xa0/0x102
[ 2379.124740]  [<ffffffff8117d5c0>] congestion_wait+0x84/0x160
[ 2379.130432]  [<ffffffff810bdd00>] ? wait_woken+0x72/0x72
[ 2379.135771]  [<ffffffff8116c32f>] throttle_vm_writeout+0x88/0xab
[ 2379.141839]  [<ffffffff81174fff>] shrink_zone_memcg+0x635/0x661
[ 2379.147810]  [<ffffffff81175107>] shrink_zone+0xdc/0x1e5
[ 2379.153155]  [<ffffffff81175107>] ? shrink_zone+0xdc/0x1e5
[ 2379.158651]  [<ffffffff811753b5>] do_try_to_free_pages+0x1a5/0x2c3
[ 2379.164881]  [<ffffffff811755f6>] try_to_free_pages+0x123/0x21f
[ 2379.170861]  [<ffffffff81168216>] __alloc_pages_nodemask+0x4c9/0x978
[ 2379.177292]  [<ffffffff811a1776>] ? get_partial_node.isra.19+0x353/0x3af
[ 2379.184026]  [<ffffffff8119fb2a>] new_slab+0xbc/0x3bb
[ 2379.189103]  [<ffffffff811a1acd>] ___slab_alloc.constprop.22+0x2fb/0x37b
[ 2379.195843]  [<ffffffff81162a88>] ? mempool_alloc_slab+0x15/0x17
[ 2379.201895]  [<ffffffff81049da2>] ? glue_xts_crypt_128bit+0x1a6/0x1d8
[ 2379.208357]  [<ffffffff8101f5ba>] ? sched_clock+0x9/0xd
[ 2379.213610]  [<ffffffff810ae420>] ? local_clock+0x20/0x22
[ 2379.219050]  [<ffffffff810c6438>] ? __lock_acquire.isra.16+0x55e/0xb4c
[ 2379.225596]  [<ffffffff811a1ba4>] __slab_alloc.isra.17.constprop.21+0x57/0x8b
[ 2379.232769]  [<ffffffff811a1ba4>] ? __slab_alloc.isra.17.constprop.21+0x57/0x8b
[ 2379.240143]  [<ffffffff81162a88>] ? mempool_alloc_slab+0x15/0x17
[ 2379.246177]  [<ffffffff811a1c78>] kmem_cache_alloc+0xa0/0x1d6
[ 2379.251957]  [<ffffffff81162a88>] ? mempool_alloc_slab+0x15/0x17
[ 2379.258024]  [<ffffffff81162a88>] mempool_alloc_slab+0x15/0x17
[ 2379.263907]  [<ffffffff81162b7a>] mempool_alloc+0x72/0x154
[ 2379.269403]  [<ffffffff810c4b45>] ? lockdep_init_map+0xc9/0x5a3
[ 2379.275354]  [<ffffffff810ae420>] ? local_clock+0x20/0x22
[ 2379.280754]  [<ffffffff8133fdc1>] bio_alloc_bioset+0xe8/0x1d7
[ 2379.286535]  [<ffffffff81643127>] kcryptd_crypt+0x1ab/0x325
[ 2379.292143]  [<ffffffff810998fd>] ? process_one_work+0x1ad/0x4e2
[ 2379.298208]  [<ffffffff810999d3>] process_one_work+0x283/0x4e2
[ 2379.304117]  [<ffffffff810c502d>] ? put_lock_stats.isra.9+0xe/0x20
[ 2379.310341]  [<ffffffff8109a860>] worker_thread+0x285/0x370
[ 2379.315946]  [<ffffffff8109a5db>] ? rescuer_thread+0x2d1/0x2d1
[ 2379.321840]  [<ffffffff8109f208>] kthread+0xff/0x107
[ 2379.326825]  [<ffffffff8184b1f2>] ret_from_fork+0x22/0x50
[ 2379.332299]  [<ffffffff8109f109>] ? kthread_create_on_node+0x1ea/0x1ea

[ 2385.193584] kworker/u8:1    D ffff880022e634b8     0  2342      2 0x00000000
[ 2385.200692] Workqueue: kcryptd kcryptd_crypt
[ 2385.205023]  ffff880022e634b8 00ff88011b3ccd80 ffff88011b3d62d8 ffff88011ae45200
[ 2385.212554]  ffff88011a472900 ffff880022e64000 0000000100098b0a ffff880022e634f0
[ 2385.220052]  ffff88011b3ccd80 ffff8800c5b19350 ffff880022e634d0 ffffffff81845cec
[ 2385.227547] Call Trace:
[ 2385.230002]  [<ffffffff81845cec>] schedule+0x8b/0xa3
[ 2385.235001]  [<ffffffff81849b5b>] schedule_timeout+0x20b/0x285
[ 2385.240893]  [<ffffffff810e6da6>] ? init_timer_key+0x112/0x112
[ 2385.246787]  [<ffffffff81849c33>] schedule_timeout_uninterruptible+0x1e/0x20
[ 2385.253858]  [<ffffffff81849c33>] ? schedule_timeout_uninterruptible+0x1e/0x20
[ 2385.261140]  [<ffffffff8117d72e>] wait_iff_congested+0x92/0x1b4
[ 2385.267083]  [<ffffffff810bdd00>] ? wait_woken+0x72/0x72
[ 2385.272448]  [<ffffffff811745c7>] shrink_inactive_list+0x3dc/0x4a1
[ 2385.278662]  [<ffffffff81174e8b>] shrink_zone_memcg+0x4c1/0x661
[ 2385.284643]  [<ffffffff81175107>] shrink_zone+0xdc/0x1e5
[ 2385.290006]  [<ffffffff81175107>] ? shrink_zone+0xdc/0x1e5
[ 2385.295518]  [<ffffffff811753b5>] do_try_to_free_pages+0x1a5/0x2c3
[ 2385.301739]  [<ffffffff811755f6>] try_to_free_pages+0x123/0x21f
[ 2385.307710]  [<ffffffff81168216>] __alloc_pages_nodemask+0x4c9/0x978
[ 2385.314097]  [<ffffffff8138027a>] ? debug_smp_processor_id+0x17/0x19
[ 2385.320509]  [<ffffffff8119fb2a>] new_slab+0xbc/0x3bb
[ 2385.325598]  [<ffffffff811a1acd>] ___slab_alloc.constprop.22+0x2fb/0x37b
[ 2385.332320]  [<ffffffff8138027a>] ? debug_smp_processor_id+0x17/0x19
[ 2385.338726]  [<ffffffff81162a88>] ? mempool_alloc_slab+0x15/0x17
[ 2385.344784]  [<ffffffff8101f5ba>] ? sched_clock+0x9/0xd
[ 2385.350052]  [<ffffffff810ae420>] ? local_clock+0x20/0x22
[ 2385.355494]  [<ffffffff810c6438>] ? __lock_acquire.isra.16+0x55e/0xb4c
[ 2385.362063]  [<ffffffff811a1ba4>] __slab_alloc.isra.17.constprop.21+0x57/0x8b
[ 2385.369247]  [<ffffffff811a1ba4>] ? __slab_alloc.isra.17.constprop.21+0x57/0x8b
[ 2385.376630]  [<ffffffff81162a88>] ? mempool_alloc_slab+0x15/0x17
[ 2385.382689]  [<ffffffff811a1c78>] kmem_cache_alloc+0xa0/0x1d6
[ 2385.388493]  [<ffffffff81162a88>] ? mempool_alloc_slab+0x15/0x17
[ 2385.394544]  [<ffffffff81162a88>] mempool_alloc_slab+0x15/0x17
[ 2385.400410]  [<ffffffff81162b7a>] mempool_alloc+0x72/0x154
[ 2385.405915]  [<ffffffff810c4b45>] ? lockdep_init_map+0xc9/0x5a3
[ 2385.411875]  [<ffffffff810ae420>] ? local_clock+0x20/0x22
[ 2385.417311]  [<ffffffff8133fdc1>] bio_alloc_bioset+0xe8/0x1d7
[ 2385.423082]  [<ffffffff81643127>] kcryptd_crypt+0x1ab/0x325
[ 2385.428704]  [<ffffffff810998fd>] ? process_one_work+0x1ad/0x4e2
[ 2385.434771]  [<ffffffff810999d3>] process_one_work+0x283/0x4e2
[ 2385.440664]  [<ffffffff810c502d>] ? put_lock_stats.isra.9+0xe/0x20
[ 2385.446904]  [<ffffffff8109a860>] worker_thread+0x285/0x370
[ 2385.452510]  [<ffffffff8109a5db>] ? rescuer_thread+0x2d1/0x2d1
[ 2385.458385]  [<ffffffff8109f208>] kthread+0xff/0x107
[ 2385.463379]  [<ffffffff8184b1f2>] ret_from_fork+0x22/0x50
[ 2385.468776]  [<ffffffff8109f109>] ? kthread_create_on_node+0x1ea/0x1ea

[ 2386.089621] kworker/u8:0    D ffff88010cd434b8     0 15543      2 0x00000000
[ 2386.096770] Workqueue: kcryptd kcryptd_crypt
[ 2386.101060]  ffff88010cd434b8 00ff88011b1ccd80 ffff88011b1d62d8 ffffffff81e1d540
[ 2386.108598]  ffff8800c00ca900 ffff88010cd44000 00000001000982fe ffff88010cd434f0
[ 2386.116102]  ffff88011b1ccd80 ffff8800c5b19350 ffff88010cd434d0 ffffffff81845cec
[ 2386.123651] Call Trace:
[ 2386.126132]  [<ffffffff81845cec>] schedule+0x8b/0xa3
[ 2386.131167]  [<ffffffff81849b5b>] schedule_timeout+0x20b/0x285
[ 2386.137017]  [<ffffffff810e6da6>] ? init_timer_key+0x112/0x112
[ 2386.142902]  [<ffffffff81849c33>] schedule_timeout_uninterruptible+0x1e/0x20
[ 2386.149999]  [<ffffffff81849c33>] ? schedule_timeout_uninterruptible+0x1e/0x20
[ 2386.157271]  [<ffffffff8117d72e>] wait_iff_congested+0x92/0x1b4
[ 2386.163258]  [<ffffffff810bdd00>] ? wait_woken+0x72/0x72
[ 2386.168622]  [<ffffffff811745c7>] shrink_inactive_list+0x3dc/0x4a1
[ 2386.174862]  [<ffffffff81174e8b>] shrink_zone_memcg+0x4c1/0x661
[ 2386.180834]  [<ffffffff81175107>] shrink_zone+0xdc/0x1e5
[ 2386.186154]  [<ffffffff81175107>] ? shrink_zone+0xdc/0x1e5
[ 2386.191691]  [<ffffffff811753b5>] do_try_to_free_pages+0x1a5/0x2c3
[ 2386.197931]  [<ffffffff811755f6>] try_to_free_pages+0x123/0x21f
[ 2386.203893]  [<ffffffff81168216>] __alloc_pages_nodemask+0x4c9/0x978
[ 2386.210314]  [<ffffffff811a1776>] ? get_partial_node.isra.19+0x353/0x3af
[ 2386.217057]  [<ffffffff8119fb97>] new_slab+0x129/0x3bb
[ 2386.222246]  [<ffffffff811a1acd>] ___slab_alloc.constprop.22+0x2fb/0x37b
[ 2386.228979]  [<ffffffff81162a88>] ? mempool_alloc_slab+0x15/0x17
[ 2386.235039]  [<ffffffff81049da2>] ? glue_xts_crypt_128bit+0x1a6/0x1d8
[ 2386.241529]  [<ffffffff8101f5ba>] ? sched_clock+0x9/0xd
[ 2386.246790]  [<ffffffff810ae420>] ? local_clock+0x20/0x22
[ 2386.252248]  [<ffffffff810c6438>] ? __lock_acquire.isra.16+0x55e/0xb4c
[ 2386.258863]  [<ffffffff811a1ba4>] __slab_alloc.isra.17.constprop.21+0x57/0x8b
[ 2386.266027]  [<ffffffff811a1ba4>] ? __slab_alloc.isra.17.constprop.21+0x57/0x8b
[ 2386.273358]  [<ffffffff81162a88>] ? mempool_alloc_slab+0x15/0x17
[ 2386.279383]  [<ffffffff811a1c78>] kmem_cache_alloc+0xa0/0x1d6
[ 2386.285196]  [<ffffffff81162a88>] ? mempool_alloc_slab+0x15/0x17
[ 2386.291246]  [<ffffffff81162a88>] mempool_alloc_slab+0x15/0x17
[ 2386.297146]  [<ffffffff81162b7a>] mempool_alloc+0x72/0x154
[ 2386.302669]  [<ffffffff810c4b45>] ? lockdep_init_map+0xc9/0x5a3
[ 2386.308622]  [<ffffffff810ae420>] ? local_clock+0x20/0x22
[ 2386.314073]  [<ffffffff8133fdc1>] bio_alloc_bioset+0xe8/0x1d7
[ 2386.319843]  [<ffffffff81643127>] kcryptd_crypt+0x1ab/0x325
[ 2386.325443]  [<ffffffff810998fd>] ? process_one_work+0x1ad/0x4e2
[ 2386.331482]  [<ffffffff810999d3>] process_one_work+0x283/0x4e2
[ 2386.337331]  [<ffffffff810c502d>] ? put_lock_stats.isra.9+0xe/0x20
[ 2386.343528]  [<ffffffff8109a860>] worker_thread+0x285/0x370
[ 2386.349143]  [<ffffffff8109a5db>] ? rescuer_thread+0x2d1/0x2d1
[ 2386.355055]  [<ffffffff8109f208>] kthread+0xff/0x107
[ 2386.360048]  [<ffffffff8184b1f2>] ret_from_fork+0x22/0x50
[ 2386.365471]  [<ffffffff8109f109>] ? kthread_create_on_node+0x1ea/0x1ea

[ 2419.047134] workqueue kcryptd: flags=0x2a
[ 2419.051178]   pwq 8: cpus=0-3 flags=0x4 nice=0 active=4/4
[ 2419.056687]     in-flight: 1592:kcryptd_crypt, 1594:kcryptd_crypt, 2342:kcryptd_crypt, 15543:kcryptd_crypt
[ 2419.066479]     delayed: kcryptd_crypt, kcryptd_crypt, (...snipped...) kcryptd_crypt, (...too long to finish...)

Why can't we stop queuing so many kcryptd_crypt work requests?

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-15  7:22                           ` Michal Hocko
  2016-07-15  8:23                             ` Michal Hocko
@ 2016-07-15 12:00                             ` Mikulas Patocka
  2016-07-15 21:47                             ` David Rientjes
  2 siblings, 0 replies; 60+ messages in thread
From: Mikulas Patocka @ 2016-07-15 12:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Rientjes, Tetsuo Handa, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel



On Fri, 15 Jul 2016, Michal Hocko wrote:

> On Thu 14-07-16 13:38:42, David Rientjes wrote:
> > On Thu, 14 Jul 2016, Michal Hocko wrote:
> > 
> > > > It prevents the whole system from livelocking due to an oom killed process 
> > > > stalling forever waiting for mempool_alloc() to return.  No other threads 
> > > > may be oom killed while waiting for it to exit.
> > > 
> > > But it is true that the patch has unintended side effect for any mempool
> > > allocation from the reclaim path (aka PF_MEMALLOC context).
> > 
> > If PF_MEMALLOC context is allocating too much memory reserves, then I'd 
> > argue that is a problem independent of using mempool_alloc() since 
> > mempool_alloc() can evolve directly into a call to the page allocator.  
> > How does such a process guarantee that it cannot deplete memory reserves 
> > with a simple call to the page allocator?  Since nothing in the page 
> > allocator is preventing complete depletion of reserves (it simply uses 
> > ALLOC_NO_WATERMARKS), the caller in a PF_MEMALLOC context must be 
> > responsible.

Well-written drivers should use mempools and not allocate pages directly 
from the allocator. These drivers can proceed even if there is no memory 
available.

Badly written drivers (loop block device; swapping to NFS) are prone to 
deadlock anyway - there is no easy way to fix it.

> Well, the reclaim throttles the allocation request if there are too many
> pages under writeback and that should slow down the allocation rate and
> give the writeback some time to complete. But yes you are right there is
> nothing to prevent from memory depletion and it is really hard to come
> up with something with no fail semantic.
> 
> Or do you have an idea how to throttle withou knowing how much memory
> will be actually consumed on the writeout path?
> 
> > > So do you
> > > think we should rework your additional patch to be explicit about
> > > TIF_MEMDIE?
> > 
> > Not sure which additional patch you're referring to, the only patch that I 
> > proposed was commit f9054c70d28b which solved hundreds of machines from 
> > timing out.
> 
> I would like separate TIF_MEMDIE as an access to memory reserves from
> oom selection selection semantic. And let me repeat your proposed patch
> has a undesirable side effects so we should think about a way to deal
> with those cases. It might work for your setups but it shouldn't break
> others at the same time. OOM situation is quite unlikely compared to
> simple memory depletion by writing to a swap...
>  
> > > Something like the following (not even compile tested for
> > > illustration). Tetsuo has properly pointed out that this doesn't work
> > > for multithreaded processes reliable but put that aside for now as that
> > > needs a fix on a different layer. I believe we can fix that quite
> > > easily after recent/planned changes.
> > > ---
> > > diff --git a/mm/mempool.c b/mm/mempool.c
> > > index 8f65464da5de..ea26d75c8adf 100644
> > > --- a/mm/mempool.c
> > > +++ b/mm/mempool.c
> > > @@ -322,20 +322,20 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> > >  
> > >  	might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);
> > >  
> > > +	gfp_mask |= __GFP_NOMEMALLOC;   /* don't allocate emergency reserves */
> > >  	gfp_mask |= __GFP_NORETRY;	/* don't loop in __alloc_pages */
> > >  	gfp_mask |= __GFP_NOWARN;	/* failures are OK */
> > >  
> > >  	gfp_temp = gfp_mask & ~(__GFP_DIRECT_RECLAIM|__GFP_IO);
> > >  
> > >  repeat_alloc:
> > > -	if (likely(pool->curr_nr)) {
> > > -		/*
> > > -		 * Don't allocate from emergency reserves if there are
> > > -		 * elements available.  This check is racy, but it will
> > > -		 * be rechecked each loop.
> > > -		 */
> > > -		gfp_temp |= __GFP_NOMEMALLOC;
> > > -	}
> > > +	/*
> > > +	 * Make sure that the OOM victim will get access to memory reserves
> > > +	 * properly if there are no objects in the pool to prevent from
> > > +	 * livelocks.
> > > +	 */
> > > +	if (!likely(pool->curr_nr) && test_thread_flag(TIF_MEMDIE))
> > > +		gfp_temp &= ~__GFP_NOMEMALLOC;
> > >  
> > >  	element = pool->alloc(gfp_temp, pool->pool_data);
> > >  	if (likely(element != NULL))
> > > @@ -359,7 +359,7 @@ void *mempool_alloc(mempool_t *pool, gfp_t gfp_mask)
> > >  	 * We use gfp mask w/o direct reclaim or IO for the first round.  If
> > >  	 * alloc failed with that and @pool was empty, retry immediately.
> > >  	 */
> > > -	if ((gfp_temp & ~__GFP_NOMEMALLOC) != gfp_mask) {
> > > +	if ((gfp_temp & __GFP_DIRECT_RECLAIM) != (gfp_mask & __GFP_DIRECT_RECLAIM)) {
> > >  		spin_unlock_irqrestore(&pool->lock, flags);
> > >  		gfp_temp = gfp_mask;
> > >  		goto repeat_alloc;
> > 
> > This is bogus and quite obviously leads to oom livelock: if a process is 
> > holding a mutex and does mempool_alloc(), since __GFP_WAIT is allowed in 
> > process context for mempool allocation, it can stall here in an oom 
> > condition if there are no elements available on the mempool freelist.  If 
> > the oom victim contends the same mutex, the system livelocks and the same 
> > bug arises because the holder of the mutex loops forever.  This is the 
> > exact behavior that f9054c70d28b also fixes.
> 
> Just to make sure I understand properly:
> Task A				Task B			Task C
> current->flags = PF_MEMALLOC
> mutex_lock(&foo)		mutex_lock(&foo)	out_of_memory
> mempool_alloc()						  select_bad__process = Task B
>   alloc_pages(__GFP_NOMEMALLOC)
> 
> 
> That would be really unfortunate but it doesn't really differ much from
> other oom deadlocks when the victim is stuck behind an allocating task.
> This is a generic problem and our answer for that is the oom reaper
> which will tear down the address space of the victim asynchronously.
> Sure there is no guarantee it will free enough to get us unstuck because
> we are freeing only private unlocked memory but we rather fallback to
> another oom victim if the situation prevails even after the unmapping
> pass. So we shouldn't be stuck for ever.
> 
> That being said should we rely for the mempool allocations the same as
> any other oom deadlock due to locks?

I think that mempool deadlock is really non-existent issue.

mempool for device mapper and most block device drivers guarantees forward 
progress even if there is no memory available. It doesn't deadlock as long 
as the underlying block device is processing requests. There is no reason 
why should it deadlock when OOM killer is activated.

> > These aren't hypothetical situations, the patch fixed hundreds of machines 
> > from regularly timing out.  The fundamental reason is that mempool_alloc() 
> > must not loop forever in process context: that is needed when the 
> > allocator is either an oom victim itself or the oom victim is blocked by 
> > an allocator.  mempool_alloc() must guarantee forward progress in such a 
> > context.
> > 
> > The end result is that when in PF_MEMALLOC context, allocators must be 
> > responsible and not deplete all memory reserves.
> 
> How do you propose to guarantee that? You might have really complex IO
> setup and mempools have been the answer for guaranteeing forward progress
> for ages.
> 
> -- 
> Michal Hocko
> SUSE Labs

Mikulas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-15  8:35                       ` Michal Hocko
@ 2016-07-15 12:11                         ` Mikulas Patocka
  2016-07-15 12:22                           ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: Mikulas Patocka @ 2016-07-15 12:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ondrej Kozina, Jerome Marchand, Stanislav Kozina, linux-mm,
	linux-kernel, dm-devel



On Fri, 15 Jul 2016, Michal Hocko wrote:

> On Thu 14-07-16 13:35:35, Mikulas Patocka wrote:
> > On Thu, 14 Jul 2016, Michal Hocko wrote:
> > > On Thu 14-07-16 10:00:16, Mikulas Patocka wrote:
> > > > But it needs other changes to honor the PF_LESS_THROTTLE flag:
> > > > 
> > > > static int current_may_throttle(void)
> > > > {
> > > >         return !(current->flags & PF_LESS_THROTTLE) ||
> > > >                 current->backing_dev_info == NULL ||
> > > >                 bdi_write_congested(current->backing_dev_info);
> > > > }
> > > > --- if you set PF_LESS_THROTTLE, current_may_throttle may still return 
> > > > true if one of the other conditions is met.
> > > 
> > > That is true but doesn't that mean that the device is congested and
> > > waiting a bit is the right thing to do?
> > 
> > You shouldn't really throttle mempool allocations at all. It's better to 
> > fail the allocation quickly and allocate from a mempool reserve than to 
> > wait 0.1 seconds in the reclaim path.
> 
> Well, but we do that already, no? The first allocation request is NOWAIT

The stacktraces showed that the kcryptd process was throttled when it 
tried to do mempool allocation. Mempool adds the __GFP_NORETRY flag to the 
allocation, but unfortunatelly, this flag doesn't prevent the allocator 
from throttling.

I say that the process doing mempool allocation shouldn't ever be 
throttled. Maybe add __GFP_NOTHROTTLE?

> and then we try to consume an object from the pool. We are re-adding
> __GFP_DIRECT_RECLAIM in case both fail. The point of throttling is to
> prevent from scanning through LRUs too quickly while we know that the
> bdi is congested.

> > dm-crypt can do approximatelly 100MB/s. That means that it processes 25k 
> > swap pages per second. If you wait in mempool_alloc, the allocation would 
> > be satisfied in 0.00004s. If you wait in the allocator's throttle 
> > function, you waste 0.1s.
> > 
> > 
> > It is also questionable if those 0.1 second sleeps are reasonable at all. 
> > SSDs with 100k IOPS are common - they can drain the request queue in much 
> > less time than 0.1 second. I think those hardcoded 0.1 second sleeps 
> > should be replaced with sleeps until the device stops being congested.
> 
> Well if we do not do throttle_vm_writeout then the only remaining
> writeout throttling for PF_LESS_THROTTLE is wait_iff_congested for
> the direct reclaim and that should wake up if the device stops being
> congested AFAIU.

I mean - a proper thing is to use active wakeup for the throttling, rather 
than retrying every 0.1 second. Polling for some condition is generally 
bad idea.

If there are too many pages under writeback, you should sleep on a wait 
queue. When the number of pages under writeback drops, wake up the wait 
queue.

Mikulas

> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-15 12:11                         ` Mikulas Patocka
@ 2016-07-15 12:22                           ` Michal Hocko
  2016-07-15 17:02                             ` Mikulas Patocka
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-07-15 12:22 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Ondrej Kozina, Jerome Marchand, Stanislav Kozina, linux-mm,
	linux-kernel, dm-devel

On Fri 15-07-16 08:11:22, Mikulas Patocka wrote:
> 
> 
> On Fri, 15 Jul 2016, Michal Hocko wrote:
> 
> > On Thu 14-07-16 13:35:35, Mikulas Patocka wrote:
> > > On Thu, 14 Jul 2016, Michal Hocko wrote:
> > > > On Thu 14-07-16 10:00:16, Mikulas Patocka wrote:
> > > > > But it needs other changes to honor the PF_LESS_THROTTLE flag:
> > > > > 
> > > > > static int current_may_throttle(void)
> > > > > {
> > > > >         return !(current->flags & PF_LESS_THROTTLE) ||
> > > > >                 current->backing_dev_info == NULL ||
> > > > >                 bdi_write_congested(current->backing_dev_info);
> > > > > }
> > > > > --- if you set PF_LESS_THROTTLE, current_may_throttle may still return 
> > > > > true if one of the other conditions is met.
> > > > 
> > > > That is true but doesn't that mean that the device is congested and
> > > > waiting a bit is the right thing to do?
> > > 
> > > You shouldn't really throttle mempool allocations at all. It's better to 
> > > fail the allocation quickly and allocate from a mempool reserve than to 
> > > wait 0.1 seconds in the reclaim path.
> > 
> > Well, but we do that already, no? The first allocation request is NOWAIT
> 
> The stacktraces showed that the kcryptd process was throttled when it 
> tried to do mempool allocation. Mempool adds the __GFP_NORETRY flag to the 
> allocation, but unfortunatelly, this flag doesn't prevent the allocator 
> from throttling.

Yes and in fact it shouldn't prevent any throttling. The flag merely
says that the allocation should give up rather than retry
reclaim/compaction again and again.

> I say that the process doing mempool allocation shouldn't ever be 
> throttled. Maybe add __GFP_NOTHROTTLE?

A specific gfp flag would be an option but we are slowly running out of
bit space there and I am not yet convinced PF_LESS_THROTTLE is
unsuitable.

> > and then we try to consume an object from the pool. We are re-adding
> > __GFP_DIRECT_RECLAIM in case both fail. The point of throttling is to
> > prevent from scanning through LRUs too quickly while we know that the
> > bdi is congested.
> 
> > > dm-crypt can do approximatelly 100MB/s. That means that it processes 25k 
> > > swap pages per second. If you wait in mempool_alloc, the allocation would 
> > > be satisfied in 0.00004s. If you wait in the allocator's throttle 
> > > function, you waste 0.1s.
> > > 
> > > 
> > > It is also questionable if those 0.1 second sleeps are reasonable at all. 
> > > SSDs with 100k IOPS are common - they can drain the request queue in much 
> > > less time than 0.1 second. I think those hardcoded 0.1 second sleeps 
> > > should be replaced with sleeps until the device stops being congested.
> > 
> > Well if we do not do throttle_vm_writeout then the only remaining
> > writeout throttling for PF_LESS_THROTTLE is wait_iff_congested for
> > the direct reclaim and that should wake up if the device stops being
> > congested AFAIU.
> 
> I mean - a proper thing is to use active wakeup for the throttling, rather 
> than retrying every 0.1 second. Polling for some condition is generally 
> bad idea.
> 
> If there are too many pages under writeback, you should sleep on a wait 
> queue. When the number of pages under writeback drops, wake up the wait 
> queue.

I might be missing something but exactly this is what happens in
wait_iff_congested no? If the bdi doesn't see the congestion it wakes up
the reclaim context even before the timeout. Or are we talking past each
other?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-15 12:22                           ` Michal Hocko
@ 2016-07-15 17:02                             ` Mikulas Patocka
  2016-07-18  7:22                               ` Michal Hocko
  0 siblings, 1 reply; 60+ messages in thread
From: Mikulas Patocka @ 2016-07-15 17:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Ondrej Kozina, Jerome Marchand, Stanislav Kozina, linux-mm,
	linux-kernel, dm-devel



On Fri, 15 Jul 2016, Michal Hocko wrote:

> On Fri 15-07-16 08:11:22, Mikulas Patocka wrote:
> > 
> > The stacktraces showed that the kcryptd process was throttled when it 
> > tried to do mempool allocation. Mempool adds the __GFP_NORETRY flag to the 
> > allocation, but unfortunatelly, this flag doesn't prevent the allocator 
> > from throttling.
> 
> Yes and in fact it shouldn't prevent any throttling. The flag merely
> says that the allocation should give up rather than retry
> reclaim/compaction again and again.
> 
> > I say that the process doing mempool allocation shouldn't ever be 
> > throttled. Maybe add __GFP_NOTHROTTLE?
> 
> A specific gfp flag would be an option but we are slowly running out of
> bit space there and I am not yet convinced PF_LESS_THROTTLE is
> unsuitable.

PF_LESS_THROTTLE will make it throttle less, but it doesn't eliminate 
throttling entirely. So, maybe add PF_NO_THROTTLE? But PF_* flags are also 
almost exhausted.

> I might be missing something but exactly this is what happens in
> wait_iff_congested no? If the bdi doesn't see the congestion it wakes up
> the reclaim context even before the timeout. Or are we talking past each
> other?

OK, I see that there is wait queue in congestion_wait. I didn't notice it 
before.

Mikulas

> -- 
> Michal Hocko
> SUSE Labs
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-15 11:25                           ` Mikulas Patocka
@ 2016-07-15 21:21                             ` David Rientjes
  0 siblings, 0 replies; 60+ messages in thread
From: David Rientjes @ 2016-07-15 21:21 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Tetsuo Handa, mhocko, okozina, jmarchan, skozina, linux-mm, linux-kernel

On Fri, 15 Jul 2016, Mikulas Patocka wrote:

> > Umm, show me an explicit guarantee where the oom reaper will free memory 
> > such that other threads may return memory to this process's mempool so it 
> > can make forward progress in mempool_alloc() without the need of utilizing 
> > memory reserves.  First, it might be helpful to show that the oom reaper 
> > is ever guaranteed to free any memory for a selected oom victim.
> 
> The function mempool_alloc sleeps with "io_schedule_timeout(5*HZ);"
> 
> So, if the oom reaper frees some memory into the page allocator, the 
> process that is stuck in mempoo_alloc will sleep for up to 5 seconds, then 
> it will retry the allocation with "element = pool->alloc(gfp_temp, 
> pool->pool_data)" (that will allocate from the page allocator) and succed.
> 

No, the state of the 4.7 oom killer does not explicitly guarantee any 
memory freeing of the victim and there is no guarantee that elements will 
be returned to the mempool.  If you're talking about patches that you're 
proposing for the 4.8 merge window, please post them with complete 
changelogs.  Thanks.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-15 11:21                           ` Mikulas Patocka
@ 2016-07-15 21:25                             ` David Rientjes
  2016-07-15 21:39                               ` Mikulas Patocka
  2016-07-18 15:14                             ` Johannes Weiner
  1 sibling, 1 reply; 60+ messages in thread
From: David Rientjes @ 2016-07-15 21:25 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Michal Hocko, Tetsuo Handa, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel, dm-devel

On Fri, 15 Jul 2016, Mikulas Patocka wrote:

> > There is no guarantee that _anything_ can return memory to the mempool,
> 
> You misunderstand mempools if you make such claims.
> 
> There is in fact guarantee that objects will be returned to mempool. In 
> the past I reviewed device mapper thoroughly to make sure that it can make 
> forward progress even if there is no available memory.
> 
> I don't know what should I tell you if you keep on repeating the same 
> false claim over and over again. Should I explain mempool oprerations to 
> you in detail? Or will you find it on your own?
> 

If you are talking about patches you're proposing for 4.8 or any guarantee 
of memory freeing that the oom killer/reaper will provide in 4.8, that's 
fine.  However, the state of the 4.7 kernel is the same as it was when I 
fixed this issue that timed out hundreds of our machines and is 
contradicted by that evidence.  Our machines time out after two hours with 
the oom victim looping forever in mempool_alloc(), so if there was a 
guarantee that elements would be returned in a completely livelocked 
kernel in 4.7 or earlier kernels, that would not have been the case.  I 
frankly don't care about your patch reviewing of dm mempool usage when 
dm_request() livelocked our kernel.

Feel free to formally propose patches either for 4.7 or 4.8.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-15 21:25                             ` David Rientjes
@ 2016-07-15 21:39                               ` Mikulas Patocka
  2016-07-15 21:58                                 ` David Rientjes
  0 siblings, 1 reply; 60+ messages in thread
From: Mikulas Patocka @ 2016-07-15 21:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Tetsuo Handa, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel, dm-devel



On Fri, 15 Jul 2016, David Rientjes wrote:

> On Fri, 15 Jul 2016, Mikulas Patocka wrote:
> 
> > > There is no guarantee that _anything_ can return memory to the mempool,
> > 
> > You misunderstand mempools if you make such claims.
> > 
> > There is in fact guarantee that objects will be returned to mempool. In 
> > the past I reviewed device mapper thoroughly to make sure that it can make 
> > forward progress even if there is no available memory.
> > 
> > I don't know what should I tell you if you keep on repeating the same 
> > false claim over and over again. Should I explain mempool oprerations to 
> > you in detail? Or will you find it on your own?
> > 
> 
> If you are talking about patches you're proposing for 4.8 or any guarantee 
> of memory freeing that the oom killer/reaper will provide in 4.8, that's 
> fine.  However, the state of the 4.7 kernel is the same as it was when I 
> fixed this issue that timed out hundreds of our machines and is 
> contradicted by that evidence.  Our machines time out after two hours with 
> the oom victim looping forever in mempool_alloc(), so if there was a 

And what about the oom reaper? It should have freed all victim's pages 
even if the victim is looping in mempool_alloc. Why the oom reaper didn't 
free up memory?

> guarantee that elements would be returned in a completely livelocked 
> kernel in 4.7 or earlier kernels, that would not have been the case.  I 

And what kind of targets do you use in device mapper in the configuration 
that livelocked? Do you use some custom google-developed drivers?

Please describe the whole stack of block I/O devices when this livelock 
happened.

Most device mapper drivers can really make forward progress when they are 
out of memory, so I'm interested what kind of configuration do you have.

> frankly don't care about your patch reviewing of dm mempool usage when 
> dm_request() livelocked our kernel.

If it livelocked, it is a bug in some underlying block driver, not a bug 
in mempool_alloc.

> Feel free to formally propose patches either for 4.7 or 4.8.

Mikulas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-15  7:22                           ` Michal Hocko
  2016-07-15  8:23                             ` Michal Hocko
  2016-07-15 12:00                             ` Mikulas Patocka
@ 2016-07-15 21:47                             ` David Rientjes
  2016-07-18  7:39                               ` Michal Hocko
  2 siblings, 1 reply; 60+ messages in thread
From: David Rientjes @ 2016-07-15 21:47 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mikulas Patocka, Tetsuo Handa, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel

On Fri, 15 Jul 2016, Michal Hocko wrote:

> > If PF_MEMALLOC context is allocating too much memory reserves, then I'd 
> > argue that is a problem independent of using mempool_alloc() since 
> > mempool_alloc() can evolve directly into a call to the page allocator.  
> > How does such a process guarantee that it cannot deplete memory reserves 
> > with a simple call to the page allocator?  Since nothing in the page 
> > allocator is preventing complete depletion of reserves (it simply uses 
> > ALLOC_NO_WATERMARKS), the caller in a PF_MEMALLOC context must be 
> > responsible.
> 
> Well, the reclaim throttles the allocation request if there are too many
> pages under writeback and that should slow down the allocation rate and
> give the writeback some time to complete. But yes you are right there is
> nothing to prevent from memory depletion and it is really hard to come
> up with something with no fail semantic.
> 

If the reclaimer is allocating memory, it can fully deplete memory 
reserves with ALLOC_NO_WATERMARKS without any direct reclaim itself and 
we're relying on kswapd entirely if nothing else is reclaiming in parallel 
(and depleting memory reserves itself in parallel).  It's a difficult 
problem because memory reserves can be very small and concurrent 
PF_MEMALLOC allocation contexts can lead to quick depletion.  I don't 
think it's a throttling problem itself, it's more scalability.

> I would like separate TIF_MEMDIE as an access to memory reserves from
> oom selection selection semantic. And let me repeat your proposed patch
> has a undesirable side effects so we should think about a way to deal
> with those cases. It might work for your setups but it shouldn't break
> others at the same time. OOM situation is quite unlikely compared to
> simple memory depletion by writing to a swap...
>  

I haven't proposed any patch, not sure what the reference is to.  There's 
two fundamental ways to go about it: (1) ensure mempool_alloc() can make 
forward progress (whether that's by way of gfp flags or access to memory 
reserves, which may depend on the process context such as PF_MEMALLOC) or 
(2) rely on an implementation detail of mempools to never access memory 
reserves, although it is shown to not livelock systems on 4.7 and earlier 
kernels, and instead rely on users of the same mempool to return elements 
to the freelist in all contexts, including oom contexts.  The mempool 
implementation itself shouldn't need any oom awareness, that should be a 
page allocator issue.

If the mempool user can guarantee that elements will be returned to the 
freelist in all contexts, we could relax the restriction that mempool 
users cannot use __GFP_NOMEMALLOC and leave it up to them to prevent 
access to memory reserves but only in situations where forward progress 
can be guaranteed.  That's a simple change and doesn't change mempool or 
page allocator behavior for everyone, but rather only for those that 
opt-in.  I think this is the way the dm folks should proceed, but let's 
not encode any special restriction on access to memory reserves as an 
implementation detail to mempools, specifically for processes that have 
PF_MEMALLOC set.

> Just to make sure I understand properly:
> Task A				Task B			Task C
> current->flags = PF_MEMALLOC
> mutex_lock(&foo)		mutex_lock(&foo)	out_of_memory
> mempool_alloc()						  select_bad__process = Task B
>   alloc_pages(__GFP_NOMEMALLOC)
> 

Not sure who is grabbing foo first with this, I assume Task A and Task B 
is contending.  If that's the case, then yes, this is the dm_request() oom 
livelock that went unresolved for two hours on our machines and timed 
them all out.  This is a swapless environment that heavily oversubscribes 
the machine, so not everybody's use case, but it needs to be resolved.

> That would be really unfortunate but it doesn't really differ much from
> other oom deadlocks when the victim is stuck behind an allocating task.

I'm well aware of many of the system oom and memcg oom livelocks from 
experience, unfortunately :)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-15 21:39                               ` Mikulas Patocka
@ 2016-07-15 21:58                                 ` David Rientjes
  2016-07-15 23:53                                   ` Mikulas Patocka
  0 siblings, 1 reply; 60+ messages in thread
From: David Rientjes @ 2016-07-15 21:58 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Michal Hocko, Tetsuo Handa, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel, dm-devel

On Fri, 15 Jul 2016, Mikulas Patocka wrote:

> And what about the oom reaper? It should have freed all victim's pages 
> even if the victim is looping in mempool_alloc. Why the oom reaper didn't 
> free up memory?
> 

Is that possible with mlock or shared memory?  Nope.  The oom killer does 
not have the benefit of selecting a process to kill that will likely free 
the most memory or reap the most memory, the choice is configurable by the 
user.

> > guarantee that elements would be returned in a completely livelocked 
> > kernel in 4.7 or earlier kernels, that would not have been the case.  I 
> 
> And what kind of targets do you use in device mapper in the configuration 
> that livelocked? Do you use some custom google-developed drivers?
> 
> Please describe the whole stack of block I/O devices when this livelock 
> happened.
> 
> Most device mapper drivers can really make forward progress when they are 
> out of memory, so I'm interested what kind of configuration do you have.
> 

Kworkers are processing writeback, ext4_writepages() relies on kmem that 
is reclaiming memory itself through kmem_getpages() and they are waiting 
on the oom victim to exit so they endlessly loop in the page allocator 
themselves.  Same situation with __alloc_skb() so we can intermittently 
lose access to hundreds of the machines over the network.  There are no 
custom drivers required for this to happen, the stack trace has already 
been posted of the livelock victim and this can happen for anything in 
filemap_fault() that has TIF_MEMDIE set.

> > frankly don't care about your patch reviewing of dm mempool usage when 
> > dm_request() livelocked our kernel.
> 
> If it livelocked, it is a bug in some underlying block driver, not a bug 
> in mempool_alloc.
> 

Lol, the interface is quite clear and can be modified to allow mempool 
users to set __GFP_NOMEMALLOC on their mempool_alloc() request if they can 
guarantee elements will be returned to the freelist in all situations, 
including system oom situations.  We may revert that ourselves if our 
machines time out once we use a post-4.7 kernel and report that as 
necessary.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-15 21:58                                 ` David Rientjes
@ 2016-07-15 23:53                                   ` Mikulas Patocka
  0 siblings, 0 replies; 60+ messages in thread
From: Mikulas Patocka @ 2016-07-15 23:53 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Tetsuo Handa, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel, dm-devel



On Fri, 15 Jul 2016, David Rientjes wrote:

> Kworkers are processing writeback, ext4_writepages() relies on kmem that 

ext4_writepages is above device mapper, not below, so how it could block 
device mapper progress?

Do you use device mapper on the top of block loop device? Writing to loop 
is prone to deadlock anyway, you should avoid that in production code.

> is reclaiming memory itself through kmem_getpages() and they are waiting 
> on the oom victim to exit so they endlessly loop in the page allocator 
> themselves.  Same situation with __alloc_skb() so we can intermittently 
> lose access to hundreds of the machines over the network.  There are no 
> custom drivers required for this to happen, the stack trace has already 
> been posted of the livelock victim and this can happen for anything in 
> filemap_fault() that has TIF_MEMDIE set.

Again - filemap_failt() is above device mapper, not below (unless you use 
loop).

> > > frankly don't care about your patch reviewing of dm mempool usage when 
> > > dm_request() livelocked our kernel.
> > 
> > If it livelocked, it is a bug in some underlying block driver, not a bug 
> > in mempool_alloc.
> > 
> 
> Lol, the interface is quite clear and can be modified to allow mempool 
> users to set __GFP_NOMEMALLOC on their mempool_alloc() request if they can 
> guarantee elements will be returned to the freelist in all situations, 

You still didn't post configuration of your block stack, so I have no clue 
why entries are not returned to the mempool.

> including system oom situations.  We may revert that ourselves if our 
> machines time out once we use a post-4.7 kernel and report that as 
> necessary.

Mikulas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-15 17:02                             ` Mikulas Patocka
@ 2016-07-18  7:22                               ` Michal Hocko
  0 siblings, 0 replies; 60+ messages in thread
From: Michal Hocko @ 2016-07-18  7:22 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: Ondrej Kozina, Jerome Marchand, Stanislav Kozina, linux-mm,
	linux-kernel, dm-devel

On Fri 15-07-16 13:02:17, Mikulas Patocka wrote:
> 
> 
> On Fri, 15 Jul 2016, Michal Hocko wrote:
> 
> > On Fri 15-07-16 08:11:22, Mikulas Patocka wrote:
> > > 
> > > The stacktraces showed that the kcryptd process was throttled when it 
> > > tried to do mempool allocation. Mempool adds the __GFP_NORETRY flag to the 
> > > allocation, but unfortunatelly, this flag doesn't prevent the allocator 
> > > from throttling.
> > 
> > Yes and in fact it shouldn't prevent any throttling. The flag merely
> > says that the allocation should give up rather than retry
> > reclaim/compaction again and again.
> > 
> > > I say that the process doing mempool allocation shouldn't ever be 
> > > throttled. Maybe add __GFP_NOTHROTTLE?
> > 
> > A specific gfp flag would be an option but we are slowly running out of
> > bit space there and I am not yet convinced PF_LESS_THROTTLE is
> > unsuitable.
> 
> PF_LESS_THROTTLE will make it throttle less, but it doesn't eliminate 
> throttling entirely. So, maybe add PF_NO_THROTTLE? But PF_* flags are also 
> almost exhausted.

I am not really sure we can make anybody so special to not throttle at all.
Seeing a congested backig device sounds like a reasonable compromise.
Besides that it seems that we do not really need to eliminate
wait_iff_congested for dm to work properly again AFAIU. I plan to repost
both patch today after some more internal review. If we need to do more
changes I would suggest making them in separet patches.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-15 21:47                             ` David Rientjes
@ 2016-07-18  7:39                               ` Michal Hocko
  2016-07-18 21:03                                 ` David Rientjes
  0 siblings, 1 reply; 60+ messages in thread
From: Michal Hocko @ 2016-07-18  7:39 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mikulas Patocka, Tetsuo Handa, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel

On Fri 15-07-16 14:47:30, David Rientjes wrote:
> On Fri, 15 Jul 2016, Michal Hocko wrote:
[...]
> > And let me repeat your proposed patch
> > has a undesirable side effects so we should think about a way to deal
> > with those cases. It might work for your setups but it shouldn't break
> > others at the same time. OOM situation is quite unlikely compared to
> > simple memory depletion by writing to a swap...
> >  
> 
> I haven't proposed any patch, not sure what the reference is to.

I was talking about f9054c70d28b ("mm, mempool: only set
__GFP_NOMEMALLOC if there are free elements"). Do you at least recognize
it has caused a regression which is more likely than the OOM lockup you
are referring to and that might be very specific to your particular
workload? I would really like to move on here and come up with a fix
which can handle dm-crypt swapout gracefully and also deal with the
typical case when the OOM victim is inside the mempool_alloc which
should help your usecase as well (at least the writeout path).

> There's 
> two fundamental ways to go about it: (1) ensure mempool_alloc() can make 
> forward progress (whether that's by way of gfp flags or access to memory 
> reserves, which may depend on the process context such as PF_MEMALLOC) or 
> (2) rely on an implementation detail of mempools to never access memory 
> reserves, although it is shown to not livelock systems on 4.7 and earlier 
> kernels, and instead rely on users of the same mempool to return elements 
> to the freelist in all contexts, including oom contexts.  The mempool 
> implementation itself shouldn't need any oom awareness, that should be a 
> page allocator issue.

OK, I agree that we have a certain layer violation here. __GFP_NOMEMALLOC at
the mempool level is kind of hack (like the whole existence of the
flag TBH). So if you believe that the OOM part should be handled at the
page allocator level then that has already been proposed
http://lkml.kernel.org/r/2d5e1f84-e886-7b98-cb11-170d7104fd13@I-love.SAKURA.ne.jp
and not welcome because it might have other side effects as _all_
__GFP_NOMEMALLOC users would be affected.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-15 11:21                           ` Mikulas Patocka
  2016-07-15 21:25                             ` David Rientjes
@ 2016-07-18 15:14                             ` Johannes Weiner
  1 sibling, 0 replies; 60+ messages in thread
From: Johannes Weiner @ 2016-07-18 15:14 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: David Rientjes, Michal Hocko, Tetsuo Handa, Ondrej Kozina,
	Jerome Marchand, Stanislav Kozina, linux-mm, linux-kernel,
	dm-devel, Dave Chinner

CC Dave Chinner, who I recall had strong opinions on the mempool model

The context is commit f9054c7 ("mm, mempool: only set __GFP_NOMEMALLOC
if there are free elements"), which gives MEMALLOC/TIF_MEMDIE mempool
allocations access to the system emergency reserves when there is no
reserved object currently residing in the mempool.

On Fri, Jul 15, 2016 at 07:21:59AM -0400, Mikulas Patocka wrote:
> On Thu, 14 Jul 2016, David Rientjes wrote:
> 
> > There is no guarantee that _anything_ can return memory to the mempool,
> 
> You misunderstand mempools if you make such claims.

Uhm, fully agreed.

The point of mempools is that they have their own reserves, separate
from the system reserves, to make forward progress in OOM situations.

All mempool object holders promise to make forward progress, and when
memory is depleted, the mempool allocations serialize against each
other. In this case, every allocation has to wait for in-flight IO to
finish to pass the reserved object on to the next IO. That's how the
mempool model is designed. The commit in question breaks this by not
waiting for outstanding object holders and instead quickly depletes
the system reserves. That's a mempool causing a memory deadlock...

David observed systems hanging 2+h inside mempool allocations. But
where would an object holders get stuck? It can't be taking a lock
that the waiting mempool_alloc() is holding, obviously. It also can't
be waiting for another allocation, it makes no sense to use mempools
to guarantee forward progress, but then have the whole sequence rely
on an unguaranteed allocation to succeed after the mempool ones. So
how could a system-wide OOM situation cause a mempool holder to hang?

These hangs are fishy, but it seems reasonable to assume that somebody
is breaking the mempool contract somewhere. The solution can not to be
to abandon the mempool model. f9054c7 should be reverted.

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: System freezes after OOM
  2016-07-18  7:39                               ` Michal Hocko
@ 2016-07-18 21:03                                 ` David Rientjes
  0 siblings, 0 replies; 60+ messages in thread
From: David Rientjes @ 2016-07-18 21:03 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mikulas Patocka, Tetsuo Handa, Ondrej Kozina, Jerome Marchand,
	Stanislav Kozina, linux-mm, linux-kernel

On Mon, 18 Jul 2016, Michal Hocko wrote:

> > There's 
> > two fundamental ways to go about it: (1) ensure mempool_alloc() can make 
> > forward progress (whether that's by way of gfp flags or access to memory 
> > reserves, which may depend on the process context such as PF_MEMALLOC) or 
> > (2) rely on an implementation detail of mempools to never access memory 
> > reserves, although it is shown to not livelock systems on 4.7 and earlier 
> > kernels, and instead rely on users of the same mempool to return elements 
> > to the freelist in all contexts, including oom contexts.  The mempool 
> > implementation itself shouldn't need any oom awareness, that should be a 
> > page allocator issue.
> 
> OK, I agree that we have a certain layer violation here. __GFP_NOMEMALLOC at
> the mempool level is kind of hack (like the whole existence of the
> flag TBH). So if you believe that the OOM part should be handled at the
> page allocator level then that has already been proposed
> http://lkml.kernel.org/r/2d5e1f84-e886-7b98-cb11-170d7104fd13@I-love.SAKURA.ne.jp
> and not welcome because it might have other side effects as _all_
> __GFP_NOMEMALLOC users would be affected.
> 

__GFP_NOMEMALLOC is opt-in and is a workaround for PF_MEMALLOC in this 
context to prevent a depletion of reserves, so it seems trivial to allow 
mempool_alloc(__GFP_NOMEMALLOC) in contexts where it's needed and leave it 
to the user.

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2016-07-18 21:04 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <57837CEE.1010609@redhat.com>
     [not found] ` <f80dc690-7e71-26b2-59a2-5a1557d26713@redhat.com>
     [not found]   ` <9be09452-de7f-d8be-fd5d-4a80d1cd1ba3@redhat.com>
2016-07-11 15:43     ` System freezes after OOM Mikulas Patocka
2016-07-12  6:49       ` Michal Hocko
2016-07-12 23:44         ` Mikulas Patocka
2016-07-13  8:35           ` Jerome Marchand
2016-07-13 11:14             ` Michal Hocko
2016-07-13 14:21               ` Mikulas Patocka
2016-07-13 11:10           ` Michal Hocko
2016-07-13 12:50             ` Michal Hocko
2016-07-13 13:44               ` Milan Broz
2016-07-13 15:21                 ` Mikulas Patocka
2016-07-14  9:09                   ` Michal Hocko
2016-07-14  9:46                     ` Milan Broz
2016-07-13 15:02             ` Mikulas Patocka
2016-07-14 10:51               ` [dm-devel] " Ondrej Kozina
2016-07-14 12:51               ` Michal Hocko
2016-07-14 14:00                 ` Mikulas Patocka
2016-07-14 14:59                   ` Michal Hocko
2016-07-14 15:25                     ` Ondrej Kozina
2016-07-14 17:35                     ` Mikulas Patocka
2016-07-15  8:35                       ` Michal Hocko
2016-07-15 12:11                         ` Mikulas Patocka
2016-07-15 12:22                           ` Michal Hocko
2016-07-15 17:02                             ` Mikulas Patocka
2016-07-18  7:22                               ` Michal Hocko
2016-07-14 14:08                 ` Ondrej Kozina
2016-07-14 15:31                   ` Michal Hocko
2016-07-14 17:07                     ` Ondrej Kozina
2016-07-14 17:36                       ` Michal Hocko
2016-07-14 17:39                         ` Michal Hocko
2016-07-15 11:42                       ` Tetsuo Handa
2016-07-13 13:19           ` Tetsuo Handa
2016-07-13 13:39             ` Michal Hocko
2016-07-13 14:18               ` Mikulas Patocka
2016-07-13 14:56                 ` Michal Hocko
2016-07-13 15:11                   ` Mikulas Patocka
2016-07-13 23:53                     ` David Rientjes
2016-07-14 11:01                       ` Tetsuo Handa
2016-07-14 12:29                         ` Mikulas Patocka
2016-07-14 20:26                         ` David Rientjes
2016-07-14 21:40                           ` Tetsuo Handa
2016-07-14 22:04                             ` David Rientjes
2016-07-15 11:25                           ` Mikulas Patocka
2016-07-15 21:21                             ` David Rientjes
2016-07-14 12:27                       ` Mikulas Patocka
2016-07-14 20:22                         ` David Rientjes
2016-07-15 11:21                           ` Mikulas Patocka
2016-07-15 21:25                             ` David Rientjes
2016-07-15 21:39                               ` Mikulas Patocka
2016-07-15 21:58                                 ` David Rientjes
2016-07-15 23:53                                   ` Mikulas Patocka
2016-07-18 15:14                             ` Johannes Weiner
2016-07-14 15:29                       ` Michal Hocko
2016-07-14 20:38                         ` David Rientjes
2016-07-15  7:22                           ` Michal Hocko
2016-07-15  8:23                             ` Michal Hocko
2016-07-15 12:00                             ` Mikulas Patocka
2016-07-15 21:47                             ` David Rientjes
2016-07-18  7:39                               ` Michal Hocko
2016-07-18 21:03                                 ` David Rientjes
2016-07-14  0:01             ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).