All of lore.kernel.org
 help / color / mirror / Atom feed
* zram OOM behavior
@ 2012-09-28 17:32 Luigi Semenzato
  2012-10-03 13:30 ` Konrad Rzeszutek Wilk
  2012-10-15 14:44 ` Minchan Kim
  0 siblings, 2 replies; 67+ messages in thread
From: Luigi Semenzato @ 2012-09-28 17:32 UTC (permalink / raw)
  To: linux-mm

Greetings,

We are experimenting with zram in Chrome OS.  It works quite well
until the system runs out of memory, at which point it seems to hang,
but we suspect it is thrashing.

Before the (apparent) hang, the OOM killer gets rid of a few
processes, but then the other processes gradually stop responding,
until the entire system becomes unresponsive.

I am wondering if anybody has run into this.  Thanks!

Luigi

P.S.  For those who wish to know more:

1. We use the min_filelist_kbytes patch
(http://lwn.net/Articles/412313/)  (I am not sure if it made it into
the standard kernel) and set min_filelist_kbytes to 50Mb.  (This may
not matter, as it's unlikely to make things worse.)

2. We swap only to compressed ram.  The setup is very simple:

 echo ${ZRAM_SIZE_KB}000 >/sys/block/zram0/disksize ||
      logger -t "$UPSTART_JOB" "failed to set zram size"
  mkswap /dev/zram0 || logger -t "$UPSTART_JOB" "mkswap /dev/zram0 failed"
  swapon /dev/zram0 || logger -t "$UPSTART_JOB" "swapon /dev/zram0 failed"

For ZRAM_SIZE_KB, we typically use 1.5 the size of RAM (which is 2 or
4 Gb).  The compression factor is about 3:1.  The hangs happen for
quite a wide range of zram sizes.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-09-28 17:32 zram OOM behavior Luigi Semenzato
@ 2012-10-03 13:30 ` Konrad Rzeszutek Wilk
       [not found]   ` <CAA25o9SwO209DD6CUx-LzhMt9XU6niGJ-fBPmgwfcrUvf0BPWA@mail.gmail.com>
  2012-10-15 14:44 ` Minchan Kim
  1 sibling, 1 reply; 67+ messages in thread
From: Konrad Rzeszutek Wilk @ 2012-10-03 13:30 UTC (permalink / raw)
  To: Luigi Semenzato; +Cc: linux-mm

On Fri, Sep 28, 2012 at 1:32 PM, Luigi Semenzato <semenzato@google.com> wrote:
> Greetings,
>
> We are experimenting with zram in Chrome OS.  It works quite well
> until the system runs out of memory, at which point it seems to hang,
> but we suspect it is thrashing.

Or spinning in some sad loop. Does the kernel have the CONFIG_DETECT_*
options to figure out what is happening? Can you invoke the Alt-SysRQ
when it is hung?
>
> Before the (apparent) hang, the OOM killer gets rid of a few
> processes, but then the other processes gradually stop responding,
> until the entire system becomes unresponsive.

Does the OOM give you an idea what the memory state is? Can you
actually provide the dmesg?

>
> I am wondering if anybody has run into this.  Thanks!
>
> Luigi
>
> P.S.  For those who wish to know more:
>
> 1. We use the min_filelist_kbytes patch
> (http://lwn.net/Articles/412313/)  (I am not sure if it made it into
> the standard kernel) and set min_filelist_kbytes to 50Mb.  (This may
> not matter, as it's unlikely to make things worse.)
>
> 2. We swap only to compressed ram.  The setup is very simple:
>
>  echo ${ZRAM_SIZE_KB}000 >/sys/block/zram0/disksize ||
>       logger -t "$UPSTART_JOB" "failed to set zram size"
>   mkswap /dev/zram0 || logger -t "$UPSTART_JOB" "mkswap /dev/zram0 failed"
>   swapon /dev/zram0 || logger -t "$UPSTART_JOB" "swapon /dev/zram0 failed"
>
> For ZRAM_SIZE_KB, we typically use 1.5 the size of RAM (which is 2 or
> 4 Gb).  The compression factor is about 3:1.  The hangs happen for
> quite a wide range of zram sizes.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
       [not found]   ` <CAA25o9SwO209DD6CUx-LzhMt9XU6niGJ-fBPmgwfcrUvf0BPWA@mail.gmail.com>
@ 2012-10-12 23:30     ` Luigi Semenzato
  0 siblings, 0 replies; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-12 23:30 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, linux-mm

I fixed the "hang with compressed swap" problem but I cannot claim I
understand the code very well, before or after the fix.  However, the
fix seems to make sense, unless I am misinterpreting something.

In vm_swap.c there are a few places where the amount of reclaimable
memory is computed, in the presence or absence of swap.  For instance
here:

unsigned long zone_reclaimable_pages(struct zone *zone)
{
        int nr;

        nr = zone_page_state(zone, NR_ACTIVE_FILE) +
                zone_page_state(zone, NR_INACTIVE_FILE);

        if (nr_swap_pages > 0)
                nr += zone_page_state(zone, NR_ACTIVE_ANON) +
                        zone_page_state(zone, NR_INACTIVE_ANON);

        return nr;
}

But this code seems to assume that if there is any swap space left,
then there is infinite swap space left.  If there is only a little
swap space left, only that many ANON pages may be swapped out.  So I
replaced part of the above with

anon = zone_page_state(zone, NR_ACTIVE_ANON) +
           zone_page_state(zone, NR_INACTIVE_ANON);

if (total_swap_pages > 0)
        nr += min(anon, nr_swap_spaces)

and, as I mentioned, did something equivalent in a couple of other
places.  This fixes the hangs.  I think the hangs happened because the
page allocator thought that there was reclaimable memory and kept
trying to reclaim it unsuccessfully.

But it's still hard to believe that the original code could be *that*
wrong, so what am I missing?

Or is it possible that there isn't enough interest in improving
low-memory and out-of-memory behavior?  This is rather important on
consumer devices, such as Chromebooks.

Of course the zram module is not your standard swap device (it
allocates memory to free more memory).

My colleague Mandeep Baines submitted a patch a year or two ago that
prevents thrashing in the absence of swap.  The system can still
thrash because it evicts executable pages, which are file-backed.  His
patch is just a few lines.  It stops the mm from evicting the last X
megabytes of FILE memory, where X = 50 works well for us.  Thrashing
is nasty, and his patch fixes it, yet it is not included in ToT.

Thank you for any elucidation!




On Wed, Oct 3, 2012 at 8:33 AM, Luigi Semenzato <semenzato@google.com> wrote:
> On Wed, Oct 3, 2012 at 6:30 AM, Konrad Rzeszutek Wilk <konrad@kernel.org> wrote:
>> On Fri, Sep 28, 2012 at 1:32 PM, Luigi Semenzato <semenzato@google.com> wrote:
>>> Greetings,
>>>
>>> We are experimenting with zram in Chrome OS.  It works quite well
>>> until the system runs out of memory, at which point it seems to hang,
>>> but we suspect it is thrashing.
>>
>> Or spinning in some sad loop. Does the kernel have the CONFIG_DETECT_*
>> options to figure out what is happening?
>
> Don't think so, but will check and enable it.
>
> Can you invoke the Alt-SysRQ
>> when it is hung?
>
> I don't think we have that enabled, but I will check.
>
>>>
>>> Before the (apparent) hang, the OOM killer gets rid of a few
>>> processes, but then the other processes gradually stop responding,
>>> until the entire system becomes unresponsive.
>>
>> Does the OOM give you an idea what the memory state is?
>> Can you
>> actually provide the dmesg?
>
> I may be able to do that, through the serial line.
>
> Thanks, I will reply-all when I have more info.  Didn't want to spam
> the list for now.
>
>>
>>>
>>> I am wondering if anybody has run into this.  Thanks!
>>>
>>> Luigi
>>>
>>> P.S.  For those who wish to know more:
>>>
>>> 1. We use the min_filelist_kbytes patch
>>> (http://lwn.net/Articles/412313/)  (I am not sure if it made it into
>>> the standard kernel) and set min_filelist_kbytes to 50Mb.  (This may
>>> not matter, as it's unlikely to make things worse.)
>>>
>>> 2. We swap only to compressed ram.  The setup is very simple:
>>>
>>>  echo ${ZRAM_SIZE_KB}000 >/sys/block/zram0/disksize ||
>>>       logger -t "$UPSTART_JOB" "failed to set zram size"
>>>   mkswap /dev/zram0 || logger -t "$UPSTART_JOB" "mkswap /dev/zram0 failed"
>>>   swapon /dev/zram0 || logger -t "$UPSTART_JOB" "swapon /dev/zram0 failed"
>>>
>>> For ZRAM_SIZE_KB, we typically use 1.5 the size of RAM (which is 2 or
>>> 4 Gb).  The compression factor is about 3:1.  The hangs happen for
>>> quite a wide range of zram sizes.
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-09-28 17:32 zram OOM behavior Luigi Semenzato
  2012-10-03 13:30 ` Konrad Rzeszutek Wilk
@ 2012-10-15 14:44 ` Minchan Kim
  2012-10-15 18:54   ` Luigi Semenzato
  1 sibling, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2012-10-15 14:44 UTC (permalink / raw)
  To: Luigi Semenzato; +Cc: linux-mm

Hello,

On Fri, Sep 28, 2012 at 10:32:20AM -0700, Luigi Semenzato wrote:
> Greetings,
> 
> We are experimenting with zram in Chrome OS.  It works quite well
> until the system runs out of memory, at which point it seems to hang,
> but we suspect it is thrashing.
> 
> Before the (apparent) hang, the OOM killer gets rid of a few
> processes, but then the other processes gradually stop responding,
> until the entire system becomes unresponsive.

Why do you think it's zram problem? If you use swap device as storage
instead of zram, does the problem disappear?

Could you do sysrq+t,m several time and post it while hang happens?
/proc/vmstat could be helpful, too.

> 
> I am wondering if anybody has run into this.  Thanks!
> 
> Luigi
> 
> P.S.  For those who wish to know more:
> 
> 1. We use the min_filelist_kbytes patch
> (http://lwn.net/Articles/412313/)  (I am not sure if it made it into
> the standard kernel) and set min_filelist_kbytes to 50Mb.  (This may
> not matter, as it's unlikely to make things worse.)

One of the problem I look at this patch is it might prevent
increasing of zone->pages_scanned when the swap if full or anon pages
are very small although there are lots of file-backed pages.
It means OOM can't occur and page allocator could loop forever.
Please look at zone_reclaimable.

Have you ever test it without above patch?

> 
> 2. We swap only to compressed ram.  The setup is very simple:
> 
>  echo ${ZRAM_SIZE_KB}000 >/sys/block/zram0/disksize ||
>       logger -t "$UPSTART_JOB" "failed to set zram size"
>   mkswap /dev/zram0 || logger -t "$UPSTART_JOB" "mkswap /dev/zram0 failed"
>   swapon /dev/zram0 || logger -t "$UPSTART_JOB" "swapon /dev/zram0 failed"
> 
> For ZRAM_SIZE_KB, we typically use 1.5 the size of RAM (which is 2 or
> 4 Gb).  The compression factor is about 3:1.  The hangs happen for
> quite a wide range of zram sizes.
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind Regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-15 14:44 ` Minchan Kim
@ 2012-10-15 18:54   ` Luigi Semenzato
  2012-10-16  6:18     ` Minchan Kim
  0 siblings, 1 reply; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-15 18:54 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm

On Mon, Oct 15, 2012 at 7:44 AM, Minchan Kim <minchan@kernel.org> wrote:
> Hello,
>
> On Fri, Sep 28, 2012 at 10:32:20AM -0700, Luigi Semenzato wrote:
>> Greetings,
>>
>> We are experimenting with zram in Chrome OS.  It works quite well
>> until the system runs out of memory, at which point it seems to hang,
>> but we suspect it is thrashing.
>>
>> Before the (apparent) hang, the OOM killer gets rid of a few
>> processes, but then the other processes gradually stop responding,
>> until the entire system becomes unresponsive.
>
> Why do you think it's zram problem? If you use swap device as storage
> instead of zram, does the problem disappear?

I haven't tried with a swap device, but that is a good suggestion.

I didn't want to swap to disk (too slow compared to zram, so it's not
the same experiment any more), but I could preallocate a RAM disk and
swap to that.

> Could you do sysrq+t,m several time and post it while hang happens?
> /proc/vmstat could be helpful, too.

The stack traces look mostly like this:

[ 2058.069020]  [<810681c4>] handle_edge_irq+0x8f/0xb1
[ 2058.069028]  <IRQ>  [<810037ed>] ? do_IRQ+0x3f/0x98
[ 2058.069044]  [<813b7eb0>] ? common_interrupt+0x30/0x38
[ 2058.069058]  [<8108007b>] ? ftrace_raw_event_rpm_internal+0xf/0x108
[ 2058.069072]  [<81196c1a>] ? do_raw_spin_lock+0x93/0xf3
[ 2058.069085]  [<813b70d5>] ? _raw_spin_lock+0xd/0xf
[ 2058.069097]  [<810b418c>] ? put_super+0x15/0x29
[ 2058.069108]  [<810b41ba>] ? drop_super+0x1a/0x1d
[ 2058.069119]  [<810b4d04>] ? prune_super+0x106/0x110
[ 2058.069132]  [<81093647>] ? shrink_slab+0x7f/0x22f
[ 2058.069144]  [<81095943>] ? try_to_free_pages+0x1b7/0x2e6
[ 2058.069158]  [<8108de27>] ? __alloc_pages_nodemask+0x412/0x5d5
[ 2058.069173]  [<810a9c6a>] ? read_swap_cache_async+0x4a/0xcf
[ 2058.069185]  [<810a9d50>] ? swapin_readahead+0x61/0x8d
[ 2058.069198]  [<8109fea0>] ? handle_pte_fault+0x310/0x5fb
[ 2058.069208]  [<8100223a>] ? do_signal+0x470/0x4fe
[ 2058.069220]  [<810a02cc>] ? handle_mm_fault+0xae/0xbd
[ 2058.069233]  [<8101d0f9>] ? do_page_fault+0x265/0x284
[ 2058.069247]  [<81192b32>] ? copy_to_user+0x3e/0x49
[ 2058.069257]  [<8100306d>] ? do_spurious_interrupt_bug+0x26/0x26
[ 2058.069270]  [<81009279>] ? init_fpu+0x73/0x81
[ 2058.069280]  [<8100275e>] ? math_state_restore+0x1f/0xa0
[ 2058.069290]  [<8100306d>] ? do_spurious_interrupt_bug+0x26/0x26
[ 2058.069303]  [<8101ce94>] ? vmalloc_sync_all+0xa/0xa
[ 2058.069315]  [<813b7737>] ? error_code+0x67/0x6c

The bottom part of the stack varies, but most processes are spending a
lot of time in prune_super().  There is a pretty high number of
mounted file systems, and do_try_to_free_pages() keeps calling
shrink_slab() even when there is nothing to reclaim there.

In addition, do_try_to_free_pages() keeps returning 1 because
all_unreclaimable() at the end is always false.  The allocator thinks
that zone 1 has freeable pages (zones 0 and 2 do not).  That prevents
the allocator from ooming.

I went in some more depth, but didn't quite untangle all that goes on.
 In any case, this explains why I came up with the theory that somehow
mm is too optimistic about how many pages are freeable.  Then I found
what looks like a smoking gun in vmscan.c:

if (nr_swap_pages > 0)
    nr += zone_page_state(zone, NR_ACTIVE_ANON) +
            zone_page_state(zone, NR_INACTIVE_ANON);

which seems to ignore that not all ANON pages are freeable if swap
space is limited.

Pretty much all processes hang while trying to allocate memory.  Those
that don't allocate memory keep running fine.

vmstat 1 shows a large amount of swapping activity, which drops to 0
when the processes hang.

/proc/meminfo and /proc/vmstat are at the bottom.

>
>>
>> I am wondering if anybody has run into this.  Thanks!
>>
>> Luigi
>>
>> P.S.  For those who wish to know more:
>>
>> 1. We use the min_filelist_kbytes patch
>> (http://lwn.net/Articles/412313/)  (I am not sure if it made it into
>> the standard kernel) and set min_filelist_kbytes to 50Mb.  (This may
>> not matter, as it's unlikely to make things worse.)
>
> One of the problem I look at this patch is it might prevent
> increasing of zone->pages_scanned when the swap if full or anon pages
> are very small although there are lots of file-backed pages.
> It means OOM can't occur and page allocator could loop forever.
> Please look at zone_reclaimable.

Yes---I think you are right.  It didn't matter to us because we don't
use swap.  The problem looks fixable.

> Have you ever test it without above patch?

Good suggestion.  I just did.  Almost all text pages are evicted, and
then the system thrashes so badly that the hang detector kicks in
after a couple of minutes and panics.

Thank you for the very helpful suggestions!


>
>>
>> 2. We swap only to compressed ram.  The setup is very simple:
>>
>>  echo ${ZRAM_SIZE_KB}000 >/sys/block/zram0/disksize ||
>>       logger -t "$UPSTART_JOB" "failed to set zram size"
>>   mkswap /dev/zram0 || logger -t "$UPSTART_JOB" "mkswap /dev/zram0 failed"
>>   swapon /dev/zram0 || logger -t "$UPSTART_JOB" "swapon /dev/zram0 failed"
>>
>> For ZRAM_SIZE_KB, we typically use 1.5 the size of RAM (which is 2 or
>> 4 Gb).  The compression factor is about 3:1.  The hangs happen for
>> quite a wide range of zram sizes.
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind Regards,
> Minchan Kim


MemTotal:        2002292 kB
MemFree:           15148 kB
Buffers:             260 kB
Cached:           169952 kB
SwapCached:       149448 kB
Active:           722608 kB
Inactive:         290824 kB
Active(anon):     682680 kB
Inactive(anon):   230888 kB
Active(file):      39928 kB
Inactive(file):    59936 kB
Unevictable:           0 kB
Mlocked:               0 kB
HighTotal:         74504 kB
HighFree:              0 kB
LowTotal:        1927788 kB
LowFree:           15148 kB
SwapTotal:       2933044 kB
SwapFree:          47968 kB
Dirty:                 0 kB
Writeback:            56 kB
AnonPages:        695180 kB
Mapped:            73276 kB
Shmem:             70276 kB
Slab:              19596 kB
SReclaimable:       9152 kB
SUnreclaim:        10444 kB
KernelStack:        1448 kB
PageTables:         9964 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     3934188 kB
Committed_AS:    4371740 kB
VmallocTotal:     122880 kB
VmallocUsed:       22268 kB
VmallocChunk:     100340 kB
DirectMap4k:       34808 kB
DirectMap2M:     1927168 kB


nr_free_pages 3776
nr_inactive_anon 58243
nr_active_anon 172106
nr_inactive_file 14984
nr_active_file 9982
nr_unevictable 0
nr_mlock 0
nr_anon_pages 174840
nr_mapped 18387
nr_file_pages 80762
nr_dirty 0
nr_writeback 13
nr_slab_reclaimable 2290
nr_slab_unreclaimable 2611
nr_page_table_pages 2471
nr_kernel_stack 180
nr_unstable 0
nr_bounce 0
nr_vmscan_write 679247
nr_vmscan_immediate_reclaim 0
nr_writeback_temp 0
nr_isolated_anon 416
nr_isolated_file 0
nr_shmem 17637
nr_dirtied 7630
nr_written 686863
nr_anon_transparent_hugepages 0
nr_dirty_threshold 151452
nr_dirty_background_threshold 2524
pgpgin 284189
pgpgout 2748940
pswpin 5602
pswpout 679271
pgalloc_dma 9976
pgalloc_normal 1426651
pgalloc_high 34659
pgalloc_movable 0
pgfree 1475099
pgactivate 58092
pgdeactivate 745734
pgfault 1489876
pgmajfault 1098
pgrefill_dma 8557
pgrefill_normal 742123
pgrefill_high 4088
pgrefill_movable 0
pgsteal_kswapd_dma 199
pgsteal_kswapd_normal 48387
pgsteal_kswapd_high 2443
pgsteal_kswapd_movable 0
pgsteal_direct_dma 7688
pgsteal_direct_normal 652670
pgsteal_direct_high 6242
pgsteal_direct_movable 0
pgscan_kswapd_dma 268
pgscan_kswapd_normal 105036
pgscan_kswapd_high 8395
pgscan_kswapd_movable 0
pgscan_direct_dma 185240
pgscan_direct_normal 23961886
pgscan_direct_high 584047
pgscan_direct_movable 0
pginodesteal 123
slabs_scanned 10368
kswapd_inodesteal 1
kswapd_low_wmark_hit_quickly 15
kswapd_high_wmark_hit_quickly 8
kswapd_skip_congestion_wait 639
pageoutrun 582
allocstall 14514
pgrotated 1
unevictable_pgs_culled 0
unevictable_pgs_scanned 0
unevictable_pgs_rescued 1
unevictable_pgs_mlocked 1
unevictable_pgs_munlocked 1
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
unevictable_pgs_mlockfreed 0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-15 18:54   ` Luigi Semenzato
@ 2012-10-16  6:18     ` Minchan Kim
  2012-10-16 17:36       ` Luigi Semenzato
  0 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2012-10-16  6:18 UTC (permalink / raw)
  To: Luigi Semenzato; +Cc: linux-mm

On Mon, Oct 15, 2012 at 11:54:36AM -0700, Luigi Semenzato wrote:
> On Mon, Oct 15, 2012 at 7:44 AM, Minchan Kim <minchan@kernel.org> wrote:
> > Hello,
> >
> > On Fri, Sep 28, 2012 at 10:32:20AM -0700, Luigi Semenzato wrote:
> >> Greetings,
> >>
> >> We are experimenting with zram in Chrome OS.  It works quite well
> >> until the system runs out of memory, at which point it seems to hang,
> >> but we suspect it is thrashing.
> >>
> >> Before the (apparent) hang, the OOM killer gets rid of a few
> >> processes, but then the other processes gradually stop responding,
> >> until the entire system becomes unresponsive.
> >
> > Why do you think it's zram problem? If you use swap device as storage
> > instead of zram, does the problem disappear?
> 
> I haven't tried with a swap device, but that is a good suggestion.
> 
> I didn't want to swap to disk (too slow compared to zram, so it's not
> the same experiment any more), but I could preallocate a RAM disk and
> swap to that.

Good idea.

> 
> > Could you do sysrq+t,m several time and post it while hang happens?
> > /proc/vmstat could be helpful, too.
> 
> The stack traces look mostly like this:
> 
> [ 2058.069020]  [<810681c4>] handle_edge_irq+0x8f/0xb1
> [ 2058.069028]  <IRQ>  [<810037ed>] ? do_IRQ+0x3f/0x98
> [ 2058.069044]  [<813b7eb0>] ? common_interrupt+0x30/0x38
> [ 2058.069058]  [<8108007b>] ? ftrace_raw_event_rpm_internal+0xf/0x108
> [ 2058.069072]  [<81196c1a>] ? do_raw_spin_lock+0x93/0xf3
> [ 2058.069085]  [<813b70d5>] ? _raw_spin_lock+0xd/0xf
> [ 2058.069097]  [<810b418c>] ? put_super+0x15/0x29
> [ 2058.069108]  [<810b41ba>] ? drop_super+0x1a/0x1d
> [ 2058.069119]  [<810b4d04>] ? prune_super+0x106/0x110
> [ 2058.069132]  [<81093647>] ? shrink_slab+0x7f/0x22f
> [ 2058.069144]  [<81095943>] ? try_to_free_pages+0x1b7/0x2e6
> [ 2058.069158]  [<8108de27>] ? __alloc_pages_nodemask+0x412/0x5d5
> [ 2058.069173]  [<810a9c6a>] ? read_swap_cache_async+0x4a/0xcf
> [ 2058.069185]  [<810a9d50>] ? swapin_readahead+0x61/0x8d
> [ 2058.069198]  [<8109fea0>] ? handle_pte_fault+0x310/0x5fb
> [ 2058.069208]  [<8100223a>] ? do_signal+0x470/0x4fe
> [ 2058.069220]  [<810a02cc>] ? handle_mm_fault+0xae/0xbd
> [ 2058.069233]  [<8101d0f9>] ? do_page_fault+0x265/0x284
> [ 2058.069247]  [<81192b32>] ? copy_to_user+0x3e/0x49
> [ 2058.069257]  [<8100306d>] ? do_spurious_interrupt_bug+0x26/0x26
> [ 2058.069270]  [<81009279>] ? init_fpu+0x73/0x81
> [ 2058.069280]  [<8100275e>] ? math_state_restore+0x1f/0xa0
> [ 2058.069290]  [<8100306d>] ? do_spurious_interrupt_bug+0x26/0x26
> [ 2058.069303]  [<8101ce94>] ? vmalloc_sync_all+0xa/0xa
> [ 2058.069315]  [<813b7737>] ? error_code+0x67/0x6c
> 
> The bottom part of the stack varies, but most processes are spending a
> lot of time in prune_super().  There is a pretty high number of
> mounted file systems, and do_try_to_free_pages() keeps calling
> shrink_slab() even when there is nothing to reclaim there.

Good catch. We can check the number of reclaimable slab in a zone before
diving into shrink_slab and abort it.

> 
> In addition, do_try_to_free_pages() keeps returning 1 because
> all_unreclaimable() at the end is always false.  The allocator thinks
> that zone 1 has freeable pages (zones 0 and 2 do not).  That prevents
> the allocator from ooming.

It's a problem of your custom patch "min_filelist_kbytes".

> 
> I went in some more depth, but didn't quite untangle all that goes on.
>  In any case, this explains why I came up with the theory that somehow
> mm is too optimistic about how many pages are freeable.  Then I found
> what looks like a smoking gun in vmscan.c:
> 
> if (nr_swap_pages > 0)
>     nr += zone_page_state(zone, NR_ACTIVE_ANON) +
>             zone_page_state(zone, NR_INACTIVE_ANON);
> 
> which seems to ignore that not all ANON pages are freeable if swap
> space is limited.

It's a just check for whether swap is enable or not, NOT how many we have
empty slot in swap. I understand your concern but it's not related to your
problem directly. If you could change it, you might solve the problem by
early OOM but it's not right fix, IMHO and break LRU and SLAB reclaim balancing
logic.

> 
> Pretty much all processes hang while trying to allocate memory.  Those
> that don't allocate memory keep running fine.
> 
> vmstat 1 shows a large amount of swapping activity, which drops to 0
> when the processes hang.
> 
> /proc/meminfo and /proc/vmstat are at the bottom.
> 
> >
> >>
> >> I am wondering if anybody has run into this.  Thanks!
> >>
> >> Luigi
> >>
> >> P.S.  For those who wish to know more:
> >>
> >> 1. We use the min_filelist_kbytes patch
> >> (http://lwn.net/Articles/412313/)  (I am not sure if it made it into
> >> the standard kernel) and set min_filelist_kbytes to 50Mb.  (This may
> >> not matter, as it's unlikely to make things worse.)
> >
> > One of the problem I look at this patch is it might prevent
> > increasing of zone->pages_scanned when the swap if full or anon pages
> > are very small although there are lots of file-backed pages.
> > It means OOM can't occur and page allocator could loop forever.
> > Please look at zone_reclaimable.
> 
> Yes---I think you are right.  It didn't matter to us because we don't
> use swap.  The problem looks fixable.

No use swap? You mentioned you used zram as swap?
Which is right? I started to confuse your word.
If you don't use swap, it's more error prone because get_scan_count makes
your reclaim logic never get reclaim anonymous memory and your min_filelist_kbytes
patch makes reclaim logic never get reclaim file memory if file memory is smaller
than 50M. It means VM never reclaim both anon and file LRU pages so all of processes
try to allocate will be loop forever.

You mean you didn't use it but start to use it these days?
If so, please resend min_filelist_kbytes patch with the fix to linux-mm.

> 
> > Have you ever test it without above patch?
> 
> Good suggestion.  I just did.  Almost all text pages are evicted, and
> then the system thrashes so badly that the hang detector kicks in
> after a couple of minutes and panics.

I guess culprit is your min_filelist_kbytes patch.
If you think it's really good feature, please resend it and let's makes it better
than now. I think motivation is good for embedded. :)

> 
> Thank you for the very helpful suggestions!

Thanks for the interesting problem!

> 
> 
> >
> >>
> >> 2. We swap only to compressed ram.  The setup is very simple:
> >>
> >>  echo ${ZRAM_SIZE_KB}000 >/sys/block/zram0/disksize ||
> >>       logger -t "$UPSTART_JOB" "failed to set zram size"
> >>   mkswap /dev/zram0 || logger -t "$UPSTART_JOB" "mkswap /dev/zram0 failed"
> >>   swapon /dev/zram0 || logger -t "$UPSTART_JOB" "swapon /dev/zram0 failed"
> >>
> >> For ZRAM_SIZE_KB, we typically use 1.5 the size of RAM (which is 2 or
> >> 4 Gb).  The compression factor is about 3:1.  The hangs happen for
> >> quite a wide range of zram sizes.
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> > --
> > Kind Regards,
> > Minchan Kim
> 
> 
> MemTotal:        2002292 kB
> MemFree:           15148 kB
> Buffers:             260 kB
> Cached:           169952 kB
> SwapCached:       149448 kB
> Active:           722608 kB
> Inactive:         290824 kB
> Active(anon):     682680 kB
> Inactive(anon):   230888 kB
> Active(file):      39928 kB
> Inactive(file):    59936 kB
> Unevictable:           0 kB
> Mlocked:               0 kB
> HighTotal:         74504 kB
> HighFree:              0 kB
> LowTotal:        1927788 kB
> LowFree:           15148 kB
> SwapTotal:       2933044 kB
> SwapFree:          47968 kB
> Dirty:                 0 kB
> Writeback:            56 kB
> AnonPages:        695180 kB
> Mapped:            73276 kB
> Shmem:             70276 kB
> Slab:              19596 kB
> SReclaimable:       9152 kB
> SUnreclaim:        10444 kB
> KernelStack:        1448 kB
> PageTables:         9964 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:     3934188 kB
> Committed_AS:    4371740 kB
> VmallocTotal:     122880 kB
> VmallocUsed:       22268 kB
> VmallocChunk:     100340 kB
> DirectMap4k:       34808 kB
> DirectMap2M:     1927168 kB
> 
> 
> nr_free_pages 3776
> nr_inactive_anon 58243
> nr_active_anon 172106
> nr_inactive_file 14984
> nr_active_file 9982
> nr_unevictable 0
> nr_mlock 0
> nr_anon_pages 174840
> nr_mapped 18387
> nr_file_pages 80762
> nr_dirty 0
> nr_writeback 13
> nr_slab_reclaimable 2290
> nr_slab_unreclaimable 2611
> nr_page_table_pages 2471
> nr_kernel_stack 180
> nr_unstable 0
> nr_bounce 0
> nr_vmscan_write 679247
> nr_vmscan_immediate_reclaim 0
> nr_writeback_temp 0
> nr_isolated_anon 416
> nr_isolated_file 0
> nr_shmem 17637
> nr_dirtied 7630
> nr_written 686863
> nr_anon_transparent_hugepages 0
> nr_dirty_threshold 151452
> nr_dirty_background_threshold 2524
> pgpgin 284189
> pgpgout 2748940
> pswpin 5602
> pswpout 679271
> pgalloc_dma 9976
> pgalloc_normal 1426651
> pgalloc_high 34659
> pgalloc_movable 0
> pgfree 1475099
> pgactivate 58092
> pgdeactivate 745734
> pgfault 1489876
> pgmajfault 1098
> pgrefill_dma 8557
> pgrefill_normal 742123
> pgrefill_high 4088
> pgrefill_movable 0
> pgsteal_kswapd_dma 199
> pgsteal_kswapd_normal 48387
> pgsteal_kswapd_high 2443
> pgsteal_kswapd_movable 0
> pgsteal_direct_dma 7688
> pgsteal_direct_normal 652670
> pgsteal_direct_high 6242
> pgsteal_direct_movable 0
> pgscan_kswapd_dma 268
> pgscan_kswapd_normal 105036
> pgscan_kswapd_high 8395
> pgscan_kswapd_movable 0
> pgscan_direct_dma 185240
> pgscan_direct_normal 23961886
> pgscan_direct_high 584047
> pgscan_direct_movable 0
> pginodesteal 123
> slabs_scanned 10368
> kswapd_inodesteal 1
> kswapd_low_wmark_hit_quickly 15
> kswapd_high_wmark_hit_quickly 8
> kswapd_skip_congestion_wait 639
> pageoutrun 582
> allocstall 14514
> pgrotated 1
> unevictable_pgs_culled 0
> unevictable_pgs_scanned 0
> unevictable_pgs_rescued 1
> unevictable_pgs_mlocked 1
> unevictable_pgs_munlocked 1
> unevictable_pgs_cleared 0
> unevictable_pgs_stranded 0
> unevictable_pgs_mlockfreed 0
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind Regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-16  6:18     ` Minchan Kim
@ 2012-10-16 17:36       ` Luigi Semenzato
  2012-10-19 17:49         ` Luigi Semenzato
  0 siblings, 1 reply; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-16 17:36 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, Dan Magenheimer

On Mon, Oct 15, 2012 at 11:18 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Mon, Oct 15, 2012 at 11:54:36AM -0700, Luigi Semenzato wrote:
>> On Mon, Oct 15, 2012 at 7:44 AM, Minchan Kim <minchan@kernel.org> wrote:
>> > Hello,
>> >
>> > On Fri, Sep 28, 2012 at 10:32:20AM -0700, Luigi Semenzato wrote:
>> >> Greetings,
>> >>
>> >> We are experimenting with zram in Chrome OS.  It works quite well
>> >> until the system runs out of memory, at which point it seems to hang,
>> >> but we suspect it is thrashing.
>> >>
>> >> Before the (apparent) hang, the OOM killer gets rid of a few
>> >> processes, but then the other processes gradually stop responding,
>> >> until the entire system becomes unresponsive.
>> >
>> > Why do you think it's zram problem? If you use swap device as storage
>> > instead of zram, does the problem disappear?
>>
>> I haven't tried with a swap device, but that is a good suggestion.
>>
>> I didn't want to swap to disk (too slow compared to zram, so it's not
>> the same experiment any more), but I could preallocate a RAM disk and
>> swap to that.
>
> Good idea.
>
>>
>> > Could you do sysrq+t,m several time and post it while hang happens?
>> > /proc/vmstat could be helpful, too.
>>
>> The stack traces look mostly like this:
>>
>> [ 2058.069020]  [<810681c4>] handle_edge_irq+0x8f/0xb1
>> [ 2058.069028]  <IRQ>  [<810037ed>] ? do_IRQ+0x3f/0x98
>> [ 2058.069044]  [<813b7eb0>] ? common_interrupt+0x30/0x38
>> [ 2058.069058]  [<8108007b>] ? ftrace_raw_event_rpm_internal+0xf/0x108
>> [ 2058.069072]  [<81196c1a>] ? do_raw_spin_lock+0x93/0xf3
>> [ 2058.069085]  [<813b70d5>] ? _raw_spin_lock+0xd/0xf
>> [ 2058.069097]  [<810b418c>] ? put_super+0x15/0x29
>> [ 2058.069108]  [<810b41ba>] ? drop_super+0x1a/0x1d
>> [ 2058.069119]  [<810b4d04>] ? prune_super+0x106/0x110
>> [ 2058.069132]  [<81093647>] ? shrink_slab+0x7f/0x22f
>> [ 2058.069144]  [<81095943>] ? try_to_free_pages+0x1b7/0x2e6
>> [ 2058.069158]  [<8108de27>] ? __alloc_pages_nodemask+0x412/0x5d5
>> [ 2058.069173]  [<810a9c6a>] ? read_swap_cache_async+0x4a/0xcf
>> [ 2058.069185]  [<810a9d50>] ? swapin_readahead+0x61/0x8d
>> [ 2058.069198]  [<8109fea0>] ? handle_pte_fault+0x310/0x5fb
>> [ 2058.069208]  [<8100223a>] ? do_signal+0x470/0x4fe
>> [ 2058.069220]  [<810a02cc>] ? handle_mm_fault+0xae/0xbd
>> [ 2058.069233]  [<8101d0f9>] ? do_page_fault+0x265/0x284
>> [ 2058.069247]  [<81192b32>] ? copy_to_user+0x3e/0x49
>> [ 2058.069257]  [<8100306d>] ? do_spurious_interrupt_bug+0x26/0x26
>> [ 2058.069270]  [<81009279>] ? init_fpu+0x73/0x81
>> [ 2058.069280]  [<8100275e>] ? math_state_restore+0x1f/0xa0
>> [ 2058.069290]  [<8100306d>] ? do_spurious_interrupt_bug+0x26/0x26
>> [ 2058.069303]  [<8101ce94>] ? vmalloc_sync_all+0xa/0xa
>> [ 2058.069315]  [<813b7737>] ? error_code+0x67/0x6c
>>
>> The bottom part of the stack varies, but most processes are spending a
>> lot of time in prune_super().  There is a pretty high number of
>> mounted file systems, and do_try_to_free_pages() keeps calling
>> shrink_slab() even when there is nothing to reclaim there.
>
> Good catch. We can check the number of reclaimable slab in a zone before
> diving into shrink_slab and abort it.
>
>>
>> In addition, do_try_to_free_pages() keeps returning 1 because
>> all_unreclaimable() at the end is always false.  The allocator thinks
>> that zone 1 has freeable pages (zones 0 and 2 do not).  That prevents
>> the allocator from ooming.
>
> It's a problem of your custom patch "min_filelist_kbytes".
>
>>
>> I went in some more depth, but didn't quite untangle all that goes on.
>>  In any case, this explains why I came up with the theory that somehow
>> mm is too optimistic about how many pages are freeable.  Then I found
>> what looks like a smoking gun in vmscan.c:
>>
>> if (nr_swap_pages > 0)
>>     nr += zone_page_state(zone, NR_ACTIVE_ANON) +
>>             zone_page_state(zone, NR_INACTIVE_ANON);
>>
>> which seems to ignore that not all ANON pages are freeable if swap
>> space is limited.
>
> It's a just check for whether swap is enable or not, NOT how many we have
> empty slot in swap. I understand your concern but it's not related to your
> problem directly. If you could change it, you might solve the problem by
> early OOM but it's not right fix, IMHO and break LRU and SLAB reclaim balancing
> logic.

Yes, I was afraid of some consequence of that kind.

However, I still don't understand that computation.
"zone_reclaimable_pages" suggests we're computing how many anonymous
pages can be reclaimed.  If there is zero swap, no anonymous pages can
be reclaimed.  If there is very little swap left, very few anonymous
pages can be reclaimed.  So that confuses me.  But don't worry,
because many other things confuse me too!

>
>>
>> Pretty much all processes hang while trying to allocate memory.  Those
>> that don't allocate memory keep running fine.
>>
>> vmstat 1 shows a large amount of swapping activity, which drops to 0
>> when the processes hang.
>>
>> /proc/meminfo and /proc/vmstat are at the bottom.
>>
>> >
>> >>
>> >> I am wondering if anybody has run into this.  Thanks!
>> >>
>> >> Luigi
>> >>
>> >> P.S.  For those who wish to know more:
>> >>
>> >> 1. We use the min_filelist_kbytes patch
>> >> (http://lwn.net/Articles/412313/)  (I am not sure if it made it into
>> >> the standard kernel) and set min_filelist_kbytes to 50Mb.  (This may
>> >> not matter, as it's unlikely to make things worse.)
>> >
>> > One of the problem I look at this patch is it might prevent
>> > increasing of zone->pages_scanned when the swap if full or anon pages
>> > are very small although there are lots of file-backed pages.
>> > It means OOM can't occur and page allocator could loop forever.
>> > Please look at zone_reclaimable.
>>
>> Yes---I think you are right.  It didn't matter to us because we don't
>> use swap.  The problem looks fixable.
>
> No use swap? You mentioned you used zram as swap?
> Which is right? I started to confuse your word.

I apologize for the confusion.  We don't use swap now in Chrome OS.  I
am investigating the possibility of using zram, if I can get it to
work.

We are not likely to consider swap to disk because the resulting jank
for interactive loads is too high and difficult to control, and we may
do a better job by managing memory at a higher level (basically in the
Chrome app).

> If you don't use swap, it's more error prone because get_scan_count makes
> your reclaim logic never get reclaim anonymous memory and your min_filelist_kbytes
> patch makes reclaim logic never get reclaim file memory if file memory is smaller
> than 50M. It means VM never reclaim both anon and file LRU pages so all of processes
> try to allocate will be loop forever.

Actually, our patch seems to work fine in our systems, which are
commercially available.  (I'll be happy to send you any data that you
may find interesting).  Without the patch, the system can thrash badly
when we allocate memory aggressively (for instance, by loading many
browser tabs in parallel).

So, if we ignore zram for the moment, the min_filelist_kbytes patch
prevents the last 50 Mb of file memory from being evicted.  It has no
impact on anon memory.  For that memory,  we take same code path as
before.  It may be suboptimal because it doesn't try to reclaim
inactive file memory in the last 50 Mb, but that doesn't seem to
matter.

>
> You mean you didn't use it but start to use it these days?
> If so, please resend min_filelist_kbytes patch with the fix to linux-mm.
>
>>
>> > Have you ever test it without above patch?
>>
>> Good suggestion.  I just did.  Almost all text pages are evicted, and
>> then the system thrashes so badly that the hang detector kicks in
>> after a couple of minutes and panics.
>
> I guess culprit is your min_filelist_kbytes patch.

That could be, but I still need some way of preventing file pages
thrash.  Without that patch, the system thrashes when low on memory,
with or without zram, and with or without other changes related to
nr_swap_pages.

> If you think it's really good feature, please resend it and let's makes it better
> than now. I think motivation is good for embedded. :)

Yes!  Thanks, I'll try to do that.

>
>>
>> Thank you for the very helpful suggestions!
>
> Thanks for the interesting problem!
>
>>
>>
>> >
>> >>
>> >> 2. We swap only to compressed ram.  The setup is very simple:
>> >>
>> >>  echo ${ZRAM_SIZE_KB}000 >/sys/block/zram0/disksize ||
>> >>       logger -t "$UPSTART_JOB" "failed to set zram size"
>> >>   mkswap /dev/zram0 || logger -t "$UPSTART_JOB" "mkswap /dev/zram0 failed"
>> >>   swapon /dev/zram0 || logger -t "$UPSTART_JOB" "swapon /dev/zram0 failed"
>> >>
>> >> For ZRAM_SIZE_KB, we typically use 1.5 the size of RAM (which is 2 or
>> >> 4 Gb).  The compression factor is about 3:1.  The hangs happen for
>> >> quite a wide range of zram sizes.
>> >>
>> >> --
>> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> the body to majordomo@kvack.org.  For more info on Linux MM,
>> >> see: http://www.linux-mm.org/ .
>> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >
>> > --
>> > Kind Regards,
>> > Minchan Kim
>>
>>
>> MemTotal:        2002292 kB
>> MemFree:           15148 kB
>> Buffers:             260 kB
>> Cached:           169952 kB
>> SwapCached:       149448 kB
>> Active:           722608 kB
>> Inactive:         290824 kB
>> Active(anon):     682680 kB
>> Inactive(anon):   230888 kB
>> Active(file):      39928 kB
>> Inactive(file):    59936 kB
>> Unevictable:           0 kB
>> Mlocked:               0 kB
>> HighTotal:         74504 kB
>> HighFree:              0 kB
>> LowTotal:        1927788 kB
>> LowFree:           15148 kB
>> SwapTotal:       2933044 kB
>> SwapFree:          47968 kB
>> Dirty:                 0 kB
>> Writeback:            56 kB
>> AnonPages:        695180 kB
>> Mapped:            73276 kB
>> Shmem:             70276 kB
>> Slab:              19596 kB
>> SReclaimable:       9152 kB
>> SUnreclaim:        10444 kB
>> KernelStack:        1448 kB
>> PageTables:         9964 kB
>> NFS_Unstable:          0 kB
>> Bounce:                0 kB
>> WritebackTmp:          0 kB
>> CommitLimit:     3934188 kB
>> Committed_AS:    4371740 kB
>> VmallocTotal:     122880 kB
>> VmallocUsed:       22268 kB
>> VmallocChunk:     100340 kB
>> DirectMap4k:       34808 kB
>> DirectMap2M:     1927168 kB
>>
>>
>> nr_free_pages 3776
>> nr_inactive_anon 58243
>> nr_active_anon 172106
>> nr_inactive_file 14984
>> nr_active_file 9982
>> nr_unevictable 0
>> nr_mlock 0
>> nr_anon_pages 174840
>> nr_mapped 18387
>> nr_file_pages 80762
>> nr_dirty 0
>> nr_writeback 13
>> nr_slab_reclaimable 2290
>> nr_slab_unreclaimable 2611
>> nr_page_table_pages 2471
>> nr_kernel_stack 180
>> nr_unstable 0
>> nr_bounce 0
>> nr_vmscan_write 679247
>> nr_vmscan_immediate_reclaim 0
>> nr_writeback_temp 0
>> nr_isolated_anon 416
>> nr_isolated_file 0
>> nr_shmem 17637
>> nr_dirtied 7630
>> nr_written 686863
>> nr_anon_transparent_hugepages 0
>> nr_dirty_threshold 151452
>> nr_dirty_background_threshold 2524
>> pgpgin 284189
>> pgpgout 2748940
>> pswpin 5602
>> pswpout 679271
>> pgalloc_dma 9976
>> pgalloc_normal 1426651
>> pgalloc_high 34659
>> pgalloc_movable 0
>> pgfree 1475099
>> pgactivate 58092
>> pgdeactivate 745734
>> pgfault 1489876
>> pgmajfault 1098
>> pgrefill_dma 8557
>> pgrefill_normal 742123
>> pgrefill_high 4088
>> pgrefill_movable 0
>> pgsteal_kswapd_dma 199
>> pgsteal_kswapd_normal 48387
>> pgsteal_kswapd_high 2443
>> pgsteal_kswapd_movable 0
>> pgsteal_direct_dma 7688
>> pgsteal_direct_normal 652670
>> pgsteal_direct_high 6242
>> pgsteal_direct_movable 0
>> pgscan_kswapd_dma 268
>> pgscan_kswapd_normal 105036
>> pgscan_kswapd_high 8395
>> pgscan_kswapd_movable 0
>> pgscan_direct_dma 185240
>> pgscan_direct_normal 23961886
>> pgscan_direct_high 584047
>> pgscan_direct_movable 0
>> pginodesteal 123
>> slabs_scanned 10368
>> kswapd_inodesteal 1
>> kswapd_low_wmark_hit_quickly 15
>> kswapd_high_wmark_hit_quickly 8
>> kswapd_skip_congestion_wait 639
>> pageoutrun 582
>> allocstall 14514
>> pgrotated 1
>> unevictable_pgs_culled 0
>> unevictable_pgs_scanned 0
>> unevictable_pgs_rescued 1
>> unevictable_pgs_mlocked 1
>> unevictable_pgs_munlocked 1
>> unevictable_pgs_cleared 0
>> unevictable_pgs_stranded 0
>> unevictable_pgs_mlockfreed 0
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind Regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-16 17:36       ` Luigi Semenzato
@ 2012-10-19 17:49         ` Luigi Semenzato
  2012-10-22 23:53           ` Minchan Kim
  0 siblings, 1 reply; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-19 17:49 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, Dan Magenheimer

I found the source, and maybe the cause, of the problem I am
experiencing when running out of memory with zram enabled.  It may be
a known problem.  The OOM killer doesn't find any killable process
because select_bad_process() keeps returning -1 here:

    /*
     * This task already has access to memory reserves and is
     * being killed. Don't allow any other task access to the
     * memory reserve.
     *
     * Note: this may have a chance of deadlock if it gets
     * blocked waiting for another task which itself is waiting
     * for memory. Is there a better alternative?
     */
    if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
        if (unlikely(frozen(p)))
            __thaw_task(p);
        if (!force_kill)
            return ERR_PTR(-1UL);
    }

select_bad_process() is called by out_of_memory() in __alloc_page_may_oom().

If this is the problem, I'd love to hear about solutions!

<BEGIN SHAMELESS PLUG>
if we can get this to work, it will help keep the cost of laptops down!
http://www.google.com/intl/en/chrome/devices/
<END SHAMELESS PLUG>

P.S. Chromebooks are sweet things for kernel debugging because they
boot so quickly (5-10s depending on the model).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-19 17:49         ` Luigi Semenzato
@ 2012-10-22 23:53           ` Minchan Kim
  2012-10-23  0:40             ` Luigi Semenzato
  2012-10-23  6:03             ` David Rientjes
  0 siblings, 2 replies; 67+ messages in thread
From: Minchan Kim @ 2012-10-22 23:53 UTC (permalink / raw)
  To: Luigi Semenzato
  Cc: linux-mm, Dan Magenheimer, David Rientjes, KOSAKI Motohiro

Hi, 

Sorry for late response. I was traveling at that time and still suffer from
training course I never want. :(

On Fri, Oct 19, 2012 at 10:49:22AM -0700, Luigi Semenzato wrote:
> I found the source, and maybe the cause, of the problem I am
> experiencing when running out of memory with zram enabled.  It may be
> a known problem.  The OOM killer doesn't find any killable process
> because select_bad_process() keeps returning -1 here:
> 
>     /*
>      * This task already has access to memory reserves and is
>      * being killed. Don't allow any other task access to the
>      * memory reserve.
>      *
>      * Note: this may have a chance of deadlock if it gets
>      * blocked waiting for another task which itself is waiting
>      * for memory. Is there a better alternative?
>      */
>     if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
>         if (unlikely(frozen(p)))
>             __thaw_task(p);
>         if (!force_kill)
>             return ERR_PTR(-1UL);
>     }
> 
> select_bad_process() is called by out_of_memory() in __alloc_page_may_oom().

I think it's not a zram problem but general problem of OOM killer.
Above code's intention is to prevent shortage of ememgency memory pool for avoding
deadlock. If we already killed any task and the task are in the middle of exiting,
OOM killer will wait for him to be exited. But the problem in here is that
killed task might wait any mutex which are held to another task which are
stuck for the memory allocation and can't use emergency memory pool. :(
It's a another deadlock, too. AFAIK, it's known problem and I'm not sure
OOM guys have a good idea. Cc'ed them.
I think one of solution is that if it takes some seconed(ex, 3 sec) after we already
kill some task but still looping with above code, we can allow accessing of
ememgency memory pool for another task. It may happen deadlock due to burn out memory
pool but otherwise, we still suffer from deadlock.

> 
> If this is the problem, I'd love to hear about solutions!
> 
> <BEGIN SHAMELESS PLUG>
> if we can get this to work, it will help keep the cost of laptops down!
> http://www.google.com/intl/en/chrome/devices/
> <END SHAMELESS PLUG>
> 
> P.S. Chromebooks are sweet things for kernel debugging because they
> boot so quickly (5-10s depending on the model).

But I think mainline kernel doesn't boot on that. :(

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-22 23:53           ` Minchan Kim
@ 2012-10-23  0:40             ` Luigi Semenzato
  2012-10-23  6:03             ` David Rientjes
  1 sibling, 0 replies; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-23  0:40 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, Dan Magenheimer, David Rientjes, KOSAKI Motohiro

On Mon, Oct 22, 2012 at 4:53 PM, Minchan Kim <minchan@kernel.org> wrote:
> Hi,
>
> Sorry for late response.

No problem at all.

> I was traveling at that time and still suffer from
> training course I never want. :(

I am sorry you have to take training courses you do not want, and I sympathize.

> On Fri, Oct 19, 2012 at 10:49:22AM -0700, Luigi Semenzato wrote:
>> I found the source, and maybe the cause, of the problem I am
>> experiencing when running out of memory with zram enabled.  It may be
>> a known problem.  The OOM killer doesn't find any killable process
>> because select_bad_process() keeps returning -1 here:
>>
>>     /*
>>      * This task already has access to memory reserves and is
>>      * being killed. Don't allow any other task access to the
>>      * memory reserve.
>>      *
>>      * Note: this may have a chance of deadlock if it gets
>>      * blocked waiting for another task which itself is waiting
>>      * for memory. Is there a better alternative?
>>      */
>>     if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
>>         if (unlikely(frozen(p)))
>>             __thaw_task(p);
>>         if (!force_kill)
>>             return ERR_PTR(-1UL);
>>     }
>>
>> select_bad_process() is called by out_of_memory() in __alloc_page_may_oom().
>
> I think it's not a zram problem but general problem of OOM killer.
> Above code's intention is to prevent shortage of ememgency memory pool for avoding
> deadlock. If we already killed any task and the task are in the middle of exiting,
> OOM killer will wait for him to be exited. But the problem in here is that
> killed task might wait any mutex which are held to another task which are
> stuck for the memory allocation and can't use emergency memory pool. :(
> It's a another deadlock, too. AFAIK, it's known problem and I'm not sure
> OOM guys have a good idea. Cc'ed them.
> I think one of solution is that if it takes some seconed(ex, 3 sec) after we already
> kill some task but still looping with above code, we can allow accessing of
> ememgency memory pool for another task. It may happen deadlock due to burn out memory
> pool but otherwise, we still suffer from deadlock.

Next thing, I will check what the killed task is waiting for.  It may
be that there are a few frequent cases that are solvable.

Ideally we should not reach this situation.  We use a low-memory
notification mechanism (based on some code from you, in fact, many
thanks) to discard Chrome tabs (which we reload transparently).  But
if memory is allocated very aggressively, the notification may arrive
too late.

>> If this is the problem, I'd love to hear about solutions!
>>
>> P.S. Chromebooks are sweet things for kernel debugging because they
>> boot so quickly (5-10s depending on the model).
>
> But I think mainline kernel doesn't boot on that. :(

Probably not.  Very sorry for mentioning this, then.

Thank you and I will keep you updated with any progress.

Luigi

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-22 23:53           ` Minchan Kim
  2012-10-23  0:40             ` Luigi Semenzato
@ 2012-10-23  6:03             ` David Rientjes
  2012-10-29 18:26               ` Luigi Semenzato
  1 sibling, 1 reply; 67+ messages in thread
From: David Rientjes @ 2012-10-23  6:03 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Luigi Semenzato, linux-mm, Dan Magenheimer, KOSAKI Motohiro

On Tue, 23 Oct 2012, Minchan Kim wrote:

> > I found the source, and maybe the cause, of the problem I am
> > experiencing when running out of memory with zram enabled.  It may be
> > a known problem.  The OOM killer doesn't find any killable process
> > because select_bad_process() keeps returning -1 here:
> > 
> >     /*
> >      * This task already has access to memory reserves and is
> >      * being killed. Don't allow any other task access to the
> >      * memory reserve.
> >      *
> >      * Note: this may have a chance of deadlock if it gets
> >      * blocked waiting for another task which itself is waiting
> >      * for memory. Is there a better alternative?
> >      */
> >     if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
> >         if (unlikely(frozen(p)))
> >             __thaw_task(p);
> >         if (!force_kill)
> >             return ERR_PTR(-1UL);
> >     }
> > 
> > select_bad_process() is called by out_of_memory() in __alloc_page_may_oom().
> 
> I think it's not a zram problem but general problem of OOM killer.
> Above code's intention is to prevent shortage of ememgency memory pool for avoding
> deadlock. If we already killed any task and the task are in the middle of exiting,
> OOM killer will wait for him to be exited. But the problem in here is that
> killed task might wait any mutex which are held to another task which are
> stuck for the memory allocation and can't use emergency memory pool. :(

Yeah, there's always a problem if an oom killed process cannot exit 
because it's waiting for some other eligible process.  This doesn't 
normally happen for anything sharing the same mm, though, because we try 
to kill anything sharing the same mm when we select a process for oom kill 
and if those killed threads happen to call into the oom killer they 
silently get TIF_MEMDIE so they may exit as well.  This addressed earlier 
problems we had with things waiting on mm->mmap_sem in the exit path.

If the oom killed process cannot exit because it's waiting on another 
eligible process that does not share the mm, then we'll potentially 
livelock unless you do echo f > /proc/sysrq-trigger manually or turn on 
/proc/sys/vm/oom_kill_allocating_task.

> I think one of solution is that if it takes some seconed(ex, 3 sec) after we already
> kill some task but still looping with above code, we can allow accessing of
> ememgency memory pool for another task. It may happen deadlock due to burn out memory
> pool but otherwise, we still suffer from deadlock.
> 

The problem there is that if the time limit expires (we used 10 seconds 
before internally, we don't do it at all anymore) and there are no more 
eligible threads that you unnecessarily panic, or open yourself up to a 
complete depletion of memory reserves whereas not even the oom killer can 
help.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-23  6:03             ` David Rientjes
@ 2012-10-29 18:26               ` Luigi Semenzato
  2012-10-29 19:00                 ` David Rientjes
  0 siblings, 1 reply; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-29 18:26 UTC (permalink / raw)
  To: David Rientjes; +Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro

I managed to get the stack trace for the process that refuses to die.
I am not sure it's due to the deadlock described in earlier messages.
I will investigate further.

[96283.704390] chrome          x 815ecd20     0 16573   1112 0x00100104
[96283.704405]  c107fe34 00200046 f57ae000 815ecd20 815ecd20 ec0b645a
0000578f f67cfd20
[96283.704427]  d0a9a9a0 c107fdf8 81037be5 f5bdf1e8 f6021800 00000000
c107fe04 00200202
[96283.704449]  c107fe0c 00200202 f5bdf1b0 c107fe24 8117ddb1 00200202
f5bdf1b0 f5bdf1b8
[96283.704471] Call Trace:
[96283.704484]  [<81037be5>] ? queue_work_on+0x2d/0x39
[96283.704497]  [<8117ddb1>] ? put_io_context+0x52/0x6a
[96283.704510]  [<813b68f6>] schedule+0x56/0x58
[96283.704520]  [<81028525>] do_exit+0x63e/0x640
[96283.704530]  [<81028752>] do_group_exit+0x63/0x86
[96283.704541]  [<81032b19>] get_signal_to_deliver+0x434/0x44b
[96283.704554]  [<81001e01>] do_signal+0x37/0x4fe
[96283.704564]  [<8103e31d>] ? update_rmtp+0x67/0x67
[96283.704585]  [<8105622a>] ? clockevents_program_event+0xea/0x108
[96283.704599]  [<81050d92>] ? timekeeping_get_ns+0x11/0x55
[96283.704610]  [<8105a758>] ? sys_futex+0xcb/0xdb
[96283.704620]  [<810024a7>] do_notify_resume+0x26/0x65
[96283.704632]  [<813b7305>] work_notifysig+0xa/0x11
[96283.704644]  [<813b0000>] ? coretemp_cpu_callback+0x88/0x179

On Mon, Oct 22, 2012 at 11:03 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 23 Oct 2012, Minchan Kim wrote:
>
>> > I found the source, and maybe the cause, of the problem I am
>> > experiencing when running out of memory with zram enabled.  It may be
>> > a known problem.  The OOM killer doesn't find any killable process
>> > because select_bad_process() keeps returning -1 here:
>> >
>> >     /*
>> >      * This task already has access to memory reserves and is
>> >      * being killed. Don't allow any other task access to the
>> >      * memory reserve.
>> >      *
>> >      * Note: this may have a chance of deadlock if it gets
>> >      * blocked waiting for another task which itself is waiting
>> >      * for memory. Is there a better alternative?
>> >      */
>> >     if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
>> >         if (unlikely(frozen(p)))
>> >             __thaw_task(p);
>> >         if (!force_kill)
>> >             return ERR_PTR(-1UL);
>> >     }
>> >
>> > select_bad_process() is called by out_of_memory() in __alloc_page_may_oom().
>>
>> I think it's not a zram problem but general problem of OOM killer.
>> Above code's intention is to prevent shortage of ememgency memory pool for avoding
>> deadlock. If we already killed any task and the task are in the middle of exiting,
>> OOM killer will wait for him to be exited. But the problem in here is that
>> killed task might wait any mutex which are held to another task which are
>> stuck for the memory allocation and can't use emergency memory pool. :(
>
> Yeah, there's always a problem if an oom killed process cannot exit
> because it's waiting for some other eligible process.  This doesn't
> normally happen for anything sharing the same mm, though, because we try
> to kill anything sharing the same mm when we select a process for oom kill
> and if those killed threads happen to call into the oom killer they
> silently get TIF_MEMDIE so they may exit as well.  This addressed earlier
> problems we had with things waiting on mm->mmap_sem in the exit path.
>
> If the oom killed process cannot exit because it's waiting on another
> eligible process that does not share the mm, then we'll potentially
> livelock unless you do echo f > /proc/sysrq-trigger manually or turn on
> /proc/sys/vm/oom_kill_allocating_task.
>
>> I think one of solution is that if it takes some seconed(ex, 3 sec) after we already
>> kill some task but still looping with above code, we can allow accessing of
>> ememgency memory pool for another task. It may happen deadlock due to burn out memory
>> pool but otherwise, we still suffer from deadlock.
>>
>
> The problem there is that if the time limit expires (we used 10 seconds
> before internally, we don't do it at all anymore) and there are no more
> eligible threads that you unnecessarily panic, or open yourself up to a
> complete depletion of memory reserves whereas not even the oom killer can
> help.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-29 18:26               ` Luigi Semenzato
@ 2012-10-29 19:00                 ` David Rientjes
  2012-10-29 22:36                   ` Luigi Semenzato
  0 siblings, 1 reply; 67+ messages in thread
From: David Rientjes @ 2012-10-29 19:00 UTC (permalink / raw)
  To: Luigi Semenzato; +Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro

On Mon, 29 Oct 2012, Luigi Semenzato wrote:

> I managed to get the stack trace for the process that refuses to die.
> I am not sure it's due to the deadlock described in earlier messages.
> I will investigate further.
> 
> [96283.704390] chrome          x 815ecd20     0 16573   1112 0x00100104
> [96283.704405]  c107fe34 00200046 f57ae000 815ecd20 815ecd20 ec0b645a
> 0000578f f67cfd20
> [96283.704427]  d0a9a9a0 c107fdf8 81037be5 f5bdf1e8 f6021800 00000000
> c107fe04 00200202
> [96283.704449]  c107fe0c 00200202 f5bdf1b0 c107fe24 8117ddb1 00200202
> f5bdf1b0 f5bdf1b8
> [96283.704471] Call Trace:
> [96283.704484]  [<81037be5>] ? queue_work_on+0x2d/0x39
> [96283.704497]  [<8117ddb1>] ? put_io_context+0x52/0x6a
> [96283.704510]  [<813b68f6>] schedule+0x56/0x58
> [96283.704520]  [<81028525>] do_exit+0x63e/0x640

Could you find out where this happens to be in the function?  If you 
enable CONFIG_DEBUG_INFO, you should be able to use gdb on vmlinux and 
find out with l *do_exit+0x63e.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-29 19:00                 ` David Rientjes
@ 2012-10-29 22:36                   ` Luigi Semenzato
  2012-10-29 22:52                     ` David Rientjes
  2012-10-30  0:18                     ` Minchan Kim
  0 siblings, 2 replies; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-29 22:36 UTC (permalink / raw)
  To: David Rientjes; +Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro

On Mon, Oct 29, 2012 at 12:00 PM, David Rientjes <rientjes@google.com> wrote:
> On Mon, 29 Oct 2012, Luigi Semenzato wrote:
>
>> I managed to get the stack trace for the process that refuses to die.
>> I am not sure it's due to the deadlock described in earlier messages.
>> I will investigate further.
>>
>> [96283.704390] chrome          x 815ecd20     0 16573   1112 0x00100104
>> [96283.704405]  c107fe34 00200046 f57ae000 815ecd20 815ecd20 ec0b645a
>> 0000578f f67cfd20
>> [96283.704427]  d0a9a9a0 c107fdf8 81037be5 f5bdf1e8 f6021800 00000000
>> c107fe04 00200202
>> [96283.704449]  c107fe0c 00200202 f5bdf1b0 c107fe24 8117ddb1 00200202
>> f5bdf1b0 f5bdf1b8
>> [96283.704471] Call Trace:
>> [96283.704484]  [<81037be5>] ? queue_work_on+0x2d/0x39
>> [96283.704497]  [<8117ddb1>] ? put_io_context+0x52/0x6a
>> [96283.704510]  [<813b68f6>] schedule+0x56/0x58
>> [96283.704520]  [<81028525>] do_exit+0x63e/0x640
>
> Could you find out where this happens to be in the function?  If you
> enable CONFIG_DEBUG_INFO, you should be able to use gdb on vmlinux and
> find out with l *do_exit+0x63e.

It looks like it's the final call to schedule() in do_exit():

   0x81028520 <+1593>: call   0x813b68a0 <schedule>
   0x81028525 <+1598>: ud2a

(gdb) l *do_exit+0x63e
0x81028525 is in do_exit
(/home/semenzato/trunk/src/third_party/kernel/files/kernel/exit.c:1069).
1064
1065 /* causes final put_task_struct in finish_task_switch(). */
1066 tsk->state = TASK_DEAD;
1067 tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */
1068 schedule();
1069 BUG();
1070 /* Avoid "noreturn function does return".  */
1071 for (;;)
1072 cpu_relax(); /* For when BUG is null */
1073 }

Here's a theory: the thread exits fine, but the next scheduled thread
tries to allocate memory before or during finish_task_switch(), so the
dead thread is never cleaned up completely and is still considered
alive by the OOM killer.

Unfortunately I haven't found a code path that supports this theory...

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-29 22:36                   ` Luigi Semenzato
@ 2012-10-29 22:52                     ` David Rientjes
  2012-10-29 23:23                       ` Luigi Semenzato
  2012-10-30  0:18                     ` Minchan Kim
  1 sibling, 1 reply; 67+ messages in thread
From: David Rientjes @ 2012-10-29 22:52 UTC (permalink / raw)
  To: Luigi Semenzato; +Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro

On Mon, 29 Oct 2012, Luigi Semenzato wrote:

> It looks like it's the final call to schedule() in do_exit():
> 
>    0x81028520 <+1593>: call   0x813b68a0 <schedule>
>    0x81028525 <+1598>: ud2a
> 
> (gdb) l *do_exit+0x63e
> 0x81028525 is in do_exit
> (/home/semenzato/trunk/src/third_party/kernel/files/kernel/exit.c:1069).
> 1064
> 1065 /* causes final put_task_struct in finish_task_switch(). */
> 1066 tsk->state = TASK_DEAD;
> 1067 tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */
> 1068 schedule();
> 1069 BUG();
> 1070 /* Avoid "noreturn function does return".  */
> 1071 for (;;)
> 1072 cpu_relax(); /* For when BUG is null */
> 1073 }
> 

You're using an older kernel since the code you quoted from the oom killer 
hasn't had the per-memcg oom kill rewrite.  There's logic that is called 
from select_bad_process() that should exclude this thread from being 
considered and deferred since it has a non-zero task->exit_thread, i.e. in 
oom_scan_process_thread():

	if (task->exit_state)
		return OOM_SCAN_CONTINUE;

And that's called from both the global oom killer and memcg oom killer.  
So I'm thinking you're either running on an older kernel or there is no 
oom condition at the time this is captured.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-29 22:52                     ` David Rientjes
@ 2012-10-29 23:23                       ` Luigi Semenzato
  2012-10-29 23:34                         ` Luigi Semenzato
  0 siblings, 1 reply; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-29 23:23 UTC (permalink / raw)
  To: David Rientjes; +Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro

On Mon, Oct 29, 2012 at 3:52 PM, David Rientjes <rientjes@google.com> wrote:
> On Mon, 29 Oct 2012, Luigi Semenzato wrote:
>
>> It looks like it's the final call to schedule() in do_exit():
>>
>>    0x81028520 <+1593>: call   0x813b68a0 <schedule>
>>    0x81028525 <+1598>: ud2a
>>
>> (gdb) l *do_exit+0x63e
>> 0x81028525 is in do_exit
>> (/home/semenzato/trunk/src/third_party/kernel/files/kernel/exit.c:1069).
>> 1064
>> 1065 /* causes final put_task_struct in finish_task_switch(). */
>> 1066 tsk->state = TASK_DEAD;
>> 1067 tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */
>> 1068 schedule();
>> 1069 BUG();
>> 1070 /* Avoid "noreturn function does return".  */
>> 1071 for (;;)
>> 1072 cpu_relax(); /* For when BUG is null */
>> 1073 }
>>
>
> You're using an older kernel since the code you quoted from the oom killer
> hasn't had the per-memcg oom kill rewrite.  There's logic that is called
> from select_bad_process() that should exclude this thread from being
> considered and deferred since it has a non-zero task->exit_thread, i.e. in
> oom_scan_process_thread():
>
>         if (task->exit_state)
>                 return OOM_SCAN_CONTINUE;
>
> And that's called from both the global oom killer and memcg oom killer.
> So I'm thinking you're either running on an older kernel or there is no
> oom condition at the time this is captured.

Very sorry, I never said that we're on kernel 3.4.0.

We are in a OOM-kill situation:

./arch/x86/include/asm/thread_info.h:91:#define TIF_MEMDIE 20

Bit 20 in the threadinfo flags is set:

> [96283.704390] chrome          x 815ecd20     0 16573   1112 0x00100104

So your suggestion would be to apply OOM-related patches from a later kernel?

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-29 23:23                       ` Luigi Semenzato
@ 2012-10-29 23:34                         ` Luigi Semenzato
  0 siblings, 0 replies; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-29 23:34 UTC (permalink / raw)
  To: David Rientjes; +Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro

On Mon, Oct 29, 2012 at 4:23 PM, Luigi Semenzato <semenzato@google.com> wrote:
> On Mon, Oct 29, 2012 at 3:52 PM, David Rientjes <rientjes@google.com> wrote:
>> On Mon, 29 Oct 2012, Luigi Semenzato wrote:
>>
>>> It looks like it's the final call to schedule() in do_exit():
>>>
>>>    0x81028520 <+1593>: call   0x813b68a0 <schedule>
>>>    0x81028525 <+1598>: ud2a
>>>
>>> (gdb) l *do_exit+0x63e
>>> 0x81028525 is in do_exit
>>> (/home/semenzato/trunk/src/third_party/kernel/files/kernel/exit.c:1069).
>>> 1064
>>> 1065 /* causes final put_task_struct in finish_task_switch(). */
>>> 1066 tsk->state = TASK_DEAD;
>>> 1067 tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */
>>> 1068 schedule();
>>> 1069 BUG();
>>> 1070 /* Avoid "noreturn function does return".  */
>>> 1071 for (;;)
>>> 1072 cpu_relax(); /* For when BUG is null */
>>> 1073 }
>>>
>>
>> You're using an older kernel since the code you quoted from the oom killer
>> hasn't had the per-memcg oom kill rewrite.  There's logic that is called
>> from select_bad_process() that should exclude this thread from being
>> considered and deferred since it has a non-zero task->exit_thread, i.e. in
>> oom_scan_process_thread():
>>
>>         if (task->exit_state)
>>                 return OOM_SCAN_CONTINUE;
>>
>> And that's called from both the global oom killer and memcg oom killer.
>> So I'm thinking you're either running on an older kernel or there is no
>> oom condition at the time this is captured.


> Very sorry, I never said that we're on kernel 3.4.0.
>
> We are in a OOM-kill situation:
>
> ./arch/x86/include/asm/thread_info.h:91:#define TIF_MEMDIE 20
>
> Bit 20 in the threadinfo flags is set:
>
>> [96283.704390] chrome          x 815ecd20     0 16573   1112 0x00100104
>
> So your suggestion would be to apply OOM-related patches from a later kernel?
>
> Thanks!

Actually, I am not sure that the 3.6 OOM code is sufficiently
different to avoid this situation.  3.4 already has a test for
task->exit_state, which in my case must be failing even though
TIF_MEMDIE is set and the process has finished do_exit:

do_each_thread(g, p) {
  unsigned int points;

  if (p->exit_state)
    continue;
...

In fact, those changes look mostly cosmetic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-29 22:36                   ` Luigi Semenzato
  2012-10-29 22:52                     ` David Rientjes
@ 2012-10-30  0:18                     ` Minchan Kim
  2012-10-30  0:45                       ` Luigi Semenzato
  1 sibling, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2012-10-30  0:18 UTC (permalink / raw)
  To: Luigi Semenzato
  Cc: David Rientjes, linux-mm, Dan Magenheimer, KOSAKI Motohiro

On Mon, Oct 29, 2012 at 03:36:38PM -0700, Luigi Semenzato wrote:
> On Mon, Oct 29, 2012 at 12:00 PM, David Rientjes <rientjes@google.com> wrote:
> > On Mon, 29 Oct 2012, Luigi Semenzato wrote:
> >
> >> I managed to get the stack trace for the process that refuses to die.
> >> I am not sure it's due to the deadlock described in earlier messages.
> >> I will investigate further.
> >>
> >> [96283.704390] chrome          x 815ecd20     0 16573   1112 0x00100104
> >> [96283.704405]  c107fe34 00200046 f57ae000 815ecd20 815ecd20 ec0b645a
> >> 0000578f f67cfd20
> >> [96283.704427]  d0a9a9a0 c107fdf8 81037be5 f5bdf1e8 f6021800 00000000
> >> c107fe04 00200202
> >> [96283.704449]  c107fe0c 00200202 f5bdf1b0 c107fe24 8117ddb1 00200202
> >> f5bdf1b0 f5bdf1b8
> >> [96283.704471] Call Trace:
> >> [96283.704484]  [<81037be5>] ? queue_work_on+0x2d/0x39
> >> [96283.704497]  [<8117ddb1>] ? put_io_context+0x52/0x6a
> >> [96283.704510]  [<813b68f6>] schedule+0x56/0x58
> >> [96283.704520]  [<81028525>] do_exit+0x63e/0x640
> >
> > Could you find out where this happens to be in the function?  If you
> > enable CONFIG_DEBUG_INFO, you should be able to use gdb on vmlinux and
> > find out with l *do_exit+0x63e.
> 
> It looks like it's the final call to schedule() in do_exit():
> 
>    0x81028520 <+1593>: call   0x813b68a0 <schedule>
>    0x81028525 <+1598>: ud2a
> 
> (gdb) l *do_exit+0x63e
> 0x81028525 is in do_exit
> (/home/semenzato/trunk/src/third_party/kernel/files/kernel/exit.c:1069).
> 1064
> 1065 /* causes final put_task_struct in finish_task_switch(). */
> 1066 tsk->state = TASK_DEAD;
> 1067 tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */
> 1068 schedule();
> 1069 BUG();
> 1070 /* Avoid "noreturn function does return".  */
> 1071 for (;;)
> 1072 cpu_relax(); /* For when BUG is null */
> 1073 }
> 
> Here's a theory: the thread exits fine, but the next scheduled thread
> tries to allocate memory before or during finish_task_switch(), so the
> dead thread is never cleaned up completely and is still considered
> alive by the OOM killer.

If next thread tries to allocate memory, he will enter direct reclaim path
and there are some scheduling points in there so exit thread should be
destroyed. :( In your previous mail, you said many processes are stuck at
shrink_slab which already includes cond_resched. I can't see any problem.
Hmm, Could you post entire debug log after you capture sysrq+t several time
when hang happens?

> 
> Unfortunately I haven't found a code path that supports this theory...
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-30  0:18                     ` Minchan Kim
@ 2012-10-30  0:45                       ` Luigi Semenzato
  2012-10-30  5:41                         ` David Rientjes
  0 siblings, 1 reply; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-30  0:45 UTC (permalink / raw)
  To: Minchan Kim; +Cc: David Rientjes, linux-mm, Dan Magenheimer, KOSAKI Motohiro

On Mon, Oct 29, 2012 at 5:18 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Mon, Oct 29, 2012 at 03:36:38PM -0700, Luigi Semenzato wrote:
>> On Mon, Oct 29, 2012 at 12:00 PM, David Rientjes <rientjes@google.com> wrote:
>> > On Mon, 29 Oct 2012, Luigi Semenzato wrote:
>> >
>> >> I managed to get the stack trace for the process that refuses to die.
>> >> I am not sure it's due to the deadlock described in earlier messages.
>> >> I will investigate further.
>> >>
>> >> [96283.704390] chrome          x 815ecd20     0 16573   1112 0x00100104
>> >> [96283.704405]  c107fe34 00200046 f57ae000 815ecd20 815ecd20 ec0b645a
>> >> 0000578f f67cfd20
>> >> [96283.704427]  d0a9a9a0 c107fdf8 81037be5 f5bdf1e8 f6021800 00000000
>> >> c107fe04 00200202
>> >> [96283.704449]  c107fe0c 00200202 f5bdf1b0 c107fe24 8117ddb1 00200202
>> >> f5bdf1b0 f5bdf1b8
>> >> [96283.704471] Call Trace:
>> >> [96283.704484]  [<81037be5>] ? queue_work_on+0x2d/0x39
>> >> [96283.704497]  [<8117ddb1>] ? put_io_context+0x52/0x6a
>> >> [96283.704510]  [<813b68f6>] schedule+0x56/0x58
>> >> [96283.704520]  [<81028525>] do_exit+0x63e/0x640
>> >
>> > Could you find out where this happens to be in the function?  If you
>> > enable CONFIG_DEBUG_INFO, you should be able to use gdb on vmlinux and
>> > find out with l *do_exit+0x63e.
>>
>> It looks like it's the final call to schedule() in do_exit():
>>
>>    0x81028520 <+1593>: call   0x813b68a0 <schedule>
>>    0x81028525 <+1598>: ud2a
>>
>> (gdb) l *do_exit+0x63e
>> 0x81028525 is in do_exit
>> (/home/semenzato/trunk/src/third_party/kernel/files/kernel/exit.c:1069).
>> 1064
>> 1065 /* causes final put_task_struct in finish_task_switch(). */
>> 1066 tsk->state = TASK_DEAD;
>> 1067 tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */
>> 1068 schedule();
>> 1069 BUG();
>> 1070 /* Avoid "noreturn function does return".  */
>> 1071 for (;;)
>> 1072 cpu_relax(); /* For when BUG is null */
>> 1073 }
>>
>> Here's a theory: the thread exits fine, but the next scheduled thread
>> tries to allocate memory before or during finish_task_switch(), so the
>> dead thread is never cleaned up completely and is still considered
>> alive by the OOM killer.
>
> If next thread tries to allocate memory, he will enter direct reclaim path
> and there are some scheduling points in there so exit thread should be
> destroyed. :( In your previous mail, you said many processes are stuck at
> shrink_slab which already includes cond_resched. I can't see any problem.
> Hmm, Could you post entire debug log after you capture sysrq+t several time
> when hang happens?

Thank you so much for your continued assistance.

I have been using preserved memory to get the log, and sysrq+T
overflows the buffer (there are a few dozen processes).  To get the
trace for the process with TIF_MEMDIE set, I had to modify the sysrq+T
code so that it prints only that process.

To get a full trace of all processes I will have to open the device
and attach a debug header, so it will take some time.  What are we
looking for, though?  I see many processes running in shrink_slab(),
but they are not "stuck" there, they are just spending a lot of time
in there.

However, now there is something that worries me more.  The trace of
the thread with TIF_MEMDIE set shows that it has executed most of
do_exit() and appears to be waiting to be reaped.  From my reading of
the code, this implies that task->exit_state should be non-zero, which
means that select_bad_process should have skipped that thread, which
means that we cannot be in the deadlock situation, and my experiments
are not consistent.

I will add better instrumentation and report later.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-30  0:45                       ` Luigi Semenzato
@ 2012-10-30  5:41                         ` David Rientjes
  2012-10-30 19:12                           ` Luigi Semenzato
  0 siblings, 1 reply; 67+ messages in thread
From: David Rientjes @ 2012-10-30  5:41 UTC (permalink / raw)
  To: Luigi Semenzato; +Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro

On Mon, 29 Oct 2012, Luigi Semenzato wrote:

> However, now there is something that worries me more.  The trace of
> the thread with TIF_MEMDIE set shows that it has executed most of
> do_exit() and appears to be waiting to be reaped.  From my reading of
> the code, this implies that task->exit_state should be non-zero, which
> means that select_bad_process should have skipped that thread, which
> means that we cannot be in the deadlock situation, and my experiments
> are not consistent.
> 

Yeah, this is what I was referring to earlier, select_bad_process() will 
not consider the thread for which you posted a stack trace for oom kill, 
so it's not deferring because of it.  There are either other thread(s) 
that have been oom killed and have not yet release their memory or the oom 
killer is never being called.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-30  5:41                         ` David Rientjes
@ 2012-10-30 19:12                           ` Luigi Semenzato
  2012-10-30 20:30                             ` Luigi Semenzato
  2012-10-31  0:57                             ` Minchan Kim
  0 siblings, 2 replies; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-30 19:12 UTC (permalink / raw)
  To: David Rientjes
  Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

On Mon, Oct 29, 2012 at 10:41 PM, David Rientjes <rientjes@google.com> wrote:
> On Mon, 29 Oct 2012, Luigi Semenzato wrote:
>
>> However, now there is something that worries me more.  The trace of
>> the thread with TIF_MEMDIE set shows that it has executed most of
>> do_exit() and appears to be waiting to be reaped.  From my reading of
>> the code, this implies that task->exit_state should be non-zero, which
>> means that select_bad_process should have skipped that thread, which
>> means that we cannot be in the deadlock situation, and my experiments
>> are not consistent.
>>
>
> Yeah, this is what I was referring to earlier, select_bad_process() will
> not consider the thread for which you posted a stack trace for oom kill,
> so it's not deferring because of it.  There are either other thread(s)
> that have been oom killed and have not yet release their memory or the oom
> killer is never being called.

Thanks.  I now have better information on what's happening.

The "culprit" is not the OOM-killed process (the one with TIF_MEMDIE
set).  It's another process that's exiting for some other reason.

select_bad_process() checks for thread->exit_state at the beginning,
and skips processes that are exiting.  But later it checks for
p->flags & PF_EXITING, and can return -1 in that case (and it does for
me).

It turns out that do_exit() does a lot of things between setting the
thread->flags PF_EXITING bit (in exit_signals()) and setting
thread->exit_state to non-zero (in exit_notify()).  Some of those
things apparently need memory.  I caught one process responsible for
the PTR_ERR(-1) while it was doing this:

[  191.859358] VC manager      R running      0  2388   1108 0x00000104
[  191.859377] err_ptr_count = 45623
[  191.859384]  e0611b1c 00200086 f5608000 815ecd20 815ecd20 a0a9ebc3
0000002c f67cfd20
[  191.859407]  f430a060 81191c34 e0611aec 81196d79 4168ef20 00000001
e1302400 e130264c
[  191.859428]  e1302400 e0611af4 813b71d5 e0611b00 810b42f1 e1302400
e0611b0c 810b430e
[  191.859450] Call Trace:
[  191.859465]  [<81191c34>] ? __delay+0xe/0x10
[  191.859478]  [<81196d79>] ? do_raw_spin_lock+0xa2/0xf3
[  191.859491]  [<813b71d5>] ? _raw_spin_unlock+0xd/0xf
[  191.859504]  [<810b42f1>] ? put_super+0x26/0x29
[  191.859515]  [<810b430e>] ? drop_super+0x1a/0x1d
[  191.859527]  [<8104512d>] __cond_resched+0x1b/0x2b
[  191.859537]  [<813b67a7>] _cond_resched+0x18/0x21
[  191.859549]  [<81093940>] shrink_slab+0x224/0x22f
[  191.859562]  [<81095a96>] try_to_free_pages+0x1b7/0x2e6
[  191.859574]  [<8108df2a>] __alloc_pages_nodemask+0x40a/0x61f
[  191.859588]  [<810a9dbe>] read_swap_cache_async+0x4a/0xcf
[  191.859600]  [<810a9ea4>] swapin_readahead+0x61/0x8d
[  191.859612]  [<8109fff4>] handle_pte_fault+0x310/0x5fb
[  191.859624]  [<810a0420>] handle_mm_fault+0xae/0xbd
[  191.859637]  [<8101d0f9>] do_page_fault+0x265/0x284
[  191.859648]  [<8104aa17>] ? dequeue_entity+0x236/0x252
[  191.859660]  [<8101ce94>] ? vmalloc_sync_all+0xa/0xa
[  191.859672]  [<813b7887>] error_code+0x67/0x6c
[  191.859683]  [<81191d21>] ? __get_user_4+0x11/0x17
[  191.859695]  [<81059f28>] ? exit_robust_list+0x30/0x105
[  191.859707]  [<813b71b0>] ? _raw_spin_unlock_irq+0xd/0x10
[  191.859718]  [<810446d5>] ? finish_task_switch+0x53/0x89
[  191.859730]  [<8102351d>] mm_release+0x1d/0xc3
[  191.859740]  [<81026ce9>] exit_mm+0x1d/0xe9
[  191.859750]  [<81032b87>] ? exit_signals+0x57/0x10a
[  191.859760]  [<81028082>] do_exit+0x19b/0x640
[  191.859770]  [<81058598>] ? futex_wait_queue_me+0xaa/0xbe
[  191.859781]  [<81030bbf>] ? recalc_sigpending_tsk+0x51/0x5c
[  191.859793]  [<81030beb>] ? recalc_sigpending+0x17/0x3e
[  191.859803]  [<81028752>] do_group_exit+0x63/0x86
[  191.859813]  [<81032b19>] get_signal_to_deliver+0x434/0x44b
[  191.859825]  [<81001e01>] do_signal+0x37/0x4fe
[  191.859837]  [<81048eed>] ? set_next_entity+0x36/0x9d
[  191.859850]  [<81050d8e>] ? timekeeping_get_ns+0x11/0x55
[  191.859861]  [<8105a754>] ? sys_futex+0xcb/0xdb
[  191.859871]  [<810024a7>] do_notify_resume+0x26/0x65
[  191.859883]  [<813b73a5>] work_notifysig+0xa/0x11
[  191.859893] Kernel panic - not syncing: too many ERR_PTR

I don't know why mm_release() would page fault, but it looks like it does.

So the OOM killer will not kill other processes because it thinks a
process is exiting, which will free up memory.  But the exiting
process needs memory to continue exiting --> deadlock.  Sounds
plausible?

OK, now someone is going to fix this, right? :-)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-30 19:12                           ` Luigi Semenzato
@ 2012-10-30 20:30                             ` Luigi Semenzato
  2012-10-30 22:32                               ` Luigi Semenzato
                                                 ` (2 more replies)
  2012-10-31  0:57                             ` Minchan Kim
  1 sibling, 3 replies; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-30 20:30 UTC (permalink / raw)
  To: David Rientjes
  Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

On Tue, Oct 30, 2012 at 12:12 PM, Luigi Semenzato <semenzato@google.com> wrote:

> OK, now someone is going to fix this, right? :-)

Actually, there is a very simple fix:

@@ -355,14 +364,6 @@ static struct task_struct
*select_bad_process(unsigned int *ppoints,
                        if (p == current) {
                                chosen = p;
                                *ppoints = 1000;
-                       } else if (!force_kill) {
-                               /*
-                                * If this task is not being ptraced on exit,
-                                * then wait for it to finish before killing
-                                * some other task unnecessarily.
-                                */
-                               if (!(p->group_leader->ptrace & PT_TRACE_EXIT))
-                                       return ERR_PTR(-1UL);
                        }
                }

I'd rather kill some other task unnecessarily than hang!  My load
works fine with this change.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-30 20:30                             ` Luigi Semenzato
@ 2012-10-30 22:32                               ` Luigi Semenzato
  2012-10-31 18:42                                 ` David Rientjes
  2012-10-30 22:37                               ` Sonny Rao
  2012-10-31  4:46                               ` David Rientjes
  2 siblings, 1 reply; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-30 22:32 UTC (permalink / raw)
  To: David Rientjes
  Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

On Tue, Oct 30, 2012 at 1:30 PM, Luigi Semenzato <semenzato@google.com> wrote:
> On Tue, Oct 30, 2012 at 12:12 PM, Luigi Semenzato <semenzato@google.com> wrote:
>
>> OK, now someone is going to fix this, right? :-)
>
> Actually, there is a very simple fix:
>
> @@ -355,14 +364,6 @@ static struct task_struct
> *select_bad_process(unsigned int *ppoints,
>                         if (p == current) {
>                                 chosen = p;
>                                 *ppoints = 1000;
> -                       } else if (!force_kill) {
> -                               /*
> -                                * If this task is not being ptraced on exit,
> -                                * then wait for it to finish before killing
> -                                * some other task unnecessarily.
> -                                */
> -                               if (!(p->group_leader->ptrace & PT_TRACE_EXIT))
> -                                       return ERR_PTR(-1UL);
>                         }
>                 }
>
> I'd rather kill some other task unnecessarily than hang!  My load
> works fine with this change.

For completeness, I would like to report that the page fault in
mm_release looks legitimate.  The fault happens near here:

if (unlikely(tsk->robust_list)) {
    exit_robust_list(tsk);
    tsk->robust_list = NULL;
}

and robust_list is a userspace structure.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-30 20:30                             ` Luigi Semenzato
  2012-10-30 22:32                               ` Luigi Semenzato
@ 2012-10-30 22:37                               ` Sonny Rao
  2012-10-31  4:46                               ` David Rientjes
  2 siblings, 0 replies; 67+ messages in thread
From: Sonny Rao @ 2012-10-30 22:37 UTC (permalink / raw)
  To: Luigi Semenzato
  Cc: David Rientjes, Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro

On Tue, Oct 30, 2012 at 1:30 PM, Luigi Semenzato <semenzato@google.com> wrote:
>
> On Tue, Oct 30, 2012 at 12:12 PM, Luigi Semenzato <semenzato@google.com> wrote:
>
> > OK, now someone is going to fix this, right? :-)
>
> Actually, there is a very simple fix:
>
> @@ -355,14 +364,6 @@ static struct task_struct
> *select_bad_process(unsigned int *ppoints,
>                         if (p == current) {
>                                 chosen = p;
>                                 *ppoints = 1000;
> -                       } else if (!force_kill) {
> -                               /*
> -                                * If this task is not being ptraced on exit,
> -                                * then wait for it to finish before killing
> -                                * some other task unnecessarily.
> -                                */
> -                               if (!(p->group_leader->ptrace & PT_TRACE_EXIT))
> -                                       return ERR_PTR(-1UL);
>                         }
>                 }
>
> I'd rather kill some other task unnecessarily than hang!  My load
> works fine with this change.

It also appears that we didn't kill any unnecessary tasks either.

It's just a deadlock
exiting process A encounters a page fault and has to allocate some
memory and goes to sleep
process B which is running the OOM Killer blocks on exiting process
and process A blocks forever on memory while process B blocks on A,
and therefore no memory is released

IMO, the fact that we don't do this when the process is being ptraced
also seems to justify that it's a valid thing to do in all cases.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-30 19:12                           ` Luigi Semenzato
  2012-10-30 20:30                             ` Luigi Semenzato
@ 2012-10-31  0:57                             ` Minchan Kim
  2012-10-31  1:06                               ` Luigi Semenzato
  2012-10-31 18:54                               ` David Rientjes
  1 sibling, 2 replies; 67+ messages in thread
From: Minchan Kim @ 2012-10-31  0:57 UTC (permalink / raw)
  To: Luigi Semenzato
  Cc: David Rientjes, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

Hi Luigi,

On Tue, Oct 30, 2012 at 12:12:02PM -0700, Luigi Semenzato wrote:
> On Mon, Oct 29, 2012 at 10:41 PM, David Rientjes <rientjes@google.com> wrote:
> > On Mon, 29 Oct 2012, Luigi Semenzato wrote:
> >
> >> However, now there is something that worries me more.  The trace of
> >> the thread with TIF_MEMDIE set shows that it has executed most of
> >> do_exit() and appears to be waiting to be reaped.  From my reading of
> >> the code, this implies that task->exit_state should be non-zero, which
> >> means that select_bad_process should have skipped that thread, which
> >> means that we cannot be in the deadlock situation, and my experiments
> >> are not consistent.
> >>
> >
> > Yeah, this is what I was referring to earlier, select_bad_process() will
> > not consider the thread for which you posted a stack trace for oom kill,
> > so it's not deferring because of it.  There are either other thread(s)
> > that have been oom killed and have not yet release their memory or the oom
> > killer is never being called.
> 
> Thanks.  I now have better information on what's happening.
> 
> The "culprit" is not the OOM-killed process (the one with TIF_MEMDIE
> set).  It's another process that's exiting for some other reason.
> 
> select_bad_process() checks for thread->exit_state at the beginning,
> and skips processes that are exiting.  But later it checks for
> p->flags & PF_EXITING, and can return -1 in that case (and it does for
> me).
> 
> It turns out that do_exit() does a lot of things between setting the
> thread->flags PF_EXITING bit (in exit_signals()) and setting
> thread->exit_state to non-zero (in exit_notify()).  Some of those
> things apparently need memory.  I caught one process responsible for
> the PTR_ERR(-1) while it was doing this:
> 
> [  191.859358] VC manager      R running      0  2388   1108 0x00000104
> [  191.859377] err_ptr_count = 45623
> [  191.859384]  e0611b1c 00200086 f5608000 815ecd20 815ecd20 a0a9ebc3
> 0000002c f67cfd20
> [  191.859407]  f430a060 81191c34 e0611aec 81196d79 4168ef20 00000001
> e1302400 e130264c
> [  191.859428]  e1302400 e0611af4 813b71d5 e0611b00 810b42f1 e1302400
> e0611b0c 810b430e
> [  191.859450] Call Trace:
> [  191.859465]  [<81191c34>] ? __delay+0xe/0x10
> [  191.859478]  [<81196d79>] ? do_raw_spin_lock+0xa2/0xf3
> [  191.859491]  [<813b71d5>] ? _raw_spin_unlock+0xd/0xf
> [  191.859504]  [<810b42f1>] ? put_super+0x26/0x29
> [  191.859515]  [<810b430e>] ? drop_super+0x1a/0x1d
> [  191.859527]  [<8104512d>] __cond_resched+0x1b/0x2b
> [  191.859537]  [<813b67a7>] _cond_resched+0x18/0x21
> [  191.859549]  [<81093940>] shrink_slab+0x224/0x22f
> [  191.859562]  [<81095a96>] try_to_free_pages+0x1b7/0x2e6
> [  191.859574]  [<8108df2a>] __alloc_pages_nodemask+0x40a/0x61f
> [  191.859588]  [<810a9dbe>] read_swap_cache_async+0x4a/0xcf
> [  191.859600]  [<810a9ea4>] swapin_readahead+0x61/0x8d
> [  191.859612]  [<8109fff4>] handle_pte_fault+0x310/0x5fb
> [  191.859624]  [<810a0420>] handle_mm_fault+0xae/0xbd
> [  191.859637]  [<8101d0f9>] do_page_fault+0x265/0x284
> [  191.859648]  [<8104aa17>] ? dequeue_entity+0x236/0x252
> [  191.859660]  [<8101ce94>] ? vmalloc_sync_all+0xa/0xa
> [  191.859672]  [<813b7887>] error_code+0x67/0x6c
> [  191.859683]  [<81191d21>] ? __get_user_4+0x11/0x17
> [  191.859695]  [<81059f28>] ? exit_robust_list+0x30/0x105
> [  191.859707]  [<813b71b0>] ? _raw_spin_unlock_irq+0xd/0x10
> [  191.859718]  [<810446d5>] ? finish_task_switch+0x53/0x89
> [  191.859730]  [<8102351d>] mm_release+0x1d/0xc3
> [  191.859740]  [<81026ce9>] exit_mm+0x1d/0xe9
> [  191.859750]  [<81032b87>] ? exit_signals+0x57/0x10a
> [  191.859760]  [<81028082>] do_exit+0x19b/0x640
> [  191.859770]  [<81058598>] ? futex_wait_queue_me+0xaa/0xbe
> [  191.859781]  [<81030bbf>] ? recalc_sigpending_tsk+0x51/0x5c
> [  191.859793]  [<81030beb>] ? recalc_sigpending+0x17/0x3e
> [  191.859803]  [<81028752>] do_group_exit+0x63/0x86
> [  191.859813]  [<81032b19>] get_signal_to_deliver+0x434/0x44b
> [  191.859825]  [<81001e01>] do_signal+0x37/0x4fe
> [  191.859837]  [<81048eed>] ? set_next_entity+0x36/0x9d
> [  191.859850]  [<81050d8e>] ? timekeeping_get_ns+0x11/0x55
> [  191.859861]  [<8105a754>] ? sys_futex+0xcb/0xdb
> [  191.859871]  [<810024a7>] do_notify_resume+0x26/0x65
> [  191.859883]  [<813b73a5>] work_notifysig+0xa/0x11
> [  191.859893] Kernel panic - not syncing: too many ERR_PTR
> 
> I don't know why mm_release() would page fault, but it looks like it does.
> 
> So the OOM killer will not kill other processes because it thinks a
> process is exiting, which will free up memory.  But the exiting
> process needs memory to continue exiting --> deadlock.  Sounds
> plausible?

It sounds right in your kernel but principal problem is min_filelist_kbytes patch.
If normal exited process in exit path requires a page and there is no free page
any more, it ends up going to OOM path after try to reclaim memory several time.
Then,
In select_bad_process,

        if (task->flags & PF_EXITING) {
               if (task == current)             <== true
                        return OOM_SCAN_SELECT;
In oom_kill_process,

        if (p->flags & PF_EXITING)
                set_tsk_thread_flag(p, TIF_MEMDIE);

At last, normal exited process would get a free page.

But in your kernel, it seems not because I guess did_some_progress in
__alloc_pages_direct_reclaim is never 0. The why it is never 0 is 
do_try_to_free_pages's all_unreclaimable can't do his role by your 
min_filelist_kbytes. It makes __alloc_pages_slowpath's looping forever.

Sounds plausible?

> 
> OK, now someone is going to fix this, right? :-)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-31  0:57                             ` Minchan Kim
@ 2012-10-31  1:06                               ` Luigi Semenzato
  2012-10-31  1:27                                 ` Minchan Kim
  2012-10-31 18:54                               ` David Rientjes
  1 sibling, 1 reply; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-31  1:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, linux-mm, Dan Magenheimer, KOSAKI Motohiro,
	Sonny Rao, Mandeep Baines

On Tue, Oct 30, 2012 at 5:57 PM, Minchan Kim <minchan@kernel.org> wrote:
> Hi Luigi,
>
> On Tue, Oct 30, 2012 at 12:12:02PM -0700, Luigi Semenzato wrote:
>> On Mon, Oct 29, 2012 at 10:41 PM, David Rientjes <rientjes@google.com> wrote:
>> > On Mon, 29 Oct 2012, Luigi Semenzato wrote:
>> >
>> >> However, now there is something that worries me more.  The trace of
>> >> the thread with TIF_MEMDIE set shows that it has executed most of
>> >> do_exit() and appears to be waiting to be reaped.  From my reading of
>> >> the code, this implies that task->exit_state should be non-zero, which
>> >> means that select_bad_process should have skipped that thread, which
>> >> means that we cannot be in the deadlock situation, and my experiments
>> >> are not consistent.
>> >>
>> >
>> > Yeah, this is what I was referring to earlier, select_bad_process() will
>> > not consider the thread for which you posted a stack trace for oom kill,
>> > so it's not deferring because of it.  There are either other thread(s)
>> > that have been oom killed and have not yet release their memory or the oom
>> > killer is never being called.
>>
>> Thanks.  I now have better information on what's happening.
>>
>> The "culprit" is not the OOM-killed process (the one with TIF_MEMDIE
>> set).  It's another process that's exiting for some other reason.
>>
>> select_bad_process() checks for thread->exit_state at the beginning,
>> and skips processes that are exiting.  But later it checks for
>> p->flags & PF_EXITING, and can return -1 in that case (and it does for
>> me).
>>
>> It turns out that do_exit() does a lot of things between setting the
>> thread->flags PF_EXITING bit (in exit_signals()) and setting
>> thread->exit_state to non-zero (in exit_notify()).  Some of those
>> things apparently need memory.  I caught one process responsible for
>> the PTR_ERR(-1) while it was doing this:
>>
>> [  191.859358] VC manager      R running      0  2388   1108 0x00000104
>> [  191.859377] err_ptr_count = 45623
>> [  191.859384]  e0611b1c 00200086 f5608000 815ecd20 815ecd20 a0a9ebc3
>> 0000002c f67cfd20
>> [  191.859407]  f430a060 81191c34 e0611aec 81196d79 4168ef20 00000001
>> e1302400 e130264c
>> [  191.859428]  e1302400 e0611af4 813b71d5 e0611b00 810b42f1 e1302400
>> e0611b0c 810b430e
>> [  191.859450] Call Trace:
>> [  191.859465]  [<81191c34>] ? __delay+0xe/0x10
>> [  191.859478]  [<81196d79>] ? do_raw_spin_lock+0xa2/0xf3
>> [  191.859491]  [<813b71d5>] ? _raw_spin_unlock+0xd/0xf
>> [  191.859504]  [<810b42f1>] ? put_super+0x26/0x29
>> [  191.859515]  [<810b430e>] ? drop_super+0x1a/0x1d
>> [  191.859527]  [<8104512d>] __cond_resched+0x1b/0x2b
>> [  191.859537]  [<813b67a7>] _cond_resched+0x18/0x21
>> [  191.859549]  [<81093940>] shrink_slab+0x224/0x22f
>> [  191.859562]  [<81095a96>] try_to_free_pages+0x1b7/0x2e6
>> [  191.859574]  [<8108df2a>] __alloc_pages_nodemask+0x40a/0x61f
>> [  191.859588]  [<810a9dbe>] read_swap_cache_async+0x4a/0xcf
>> [  191.859600]  [<810a9ea4>] swapin_readahead+0x61/0x8d
>> [  191.859612]  [<8109fff4>] handle_pte_fault+0x310/0x5fb
>> [  191.859624]  [<810a0420>] handle_mm_fault+0xae/0xbd
>> [  191.859637]  [<8101d0f9>] do_page_fault+0x265/0x284
>> [  191.859648]  [<8104aa17>] ? dequeue_entity+0x236/0x252
>> [  191.859660]  [<8101ce94>] ? vmalloc_sync_all+0xa/0xa
>> [  191.859672]  [<813b7887>] error_code+0x67/0x6c
>> [  191.859683]  [<81191d21>] ? __get_user_4+0x11/0x17
>> [  191.859695]  [<81059f28>] ? exit_robust_list+0x30/0x105
>> [  191.859707]  [<813b71b0>] ? _raw_spin_unlock_irq+0xd/0x10
>> [  191.859718]  [<810446d5>] ? finish_task_switch+0x53/0x89
>> [  191.859730]  [<8102351d>] mm_release+0x1d/0xc3
>> [  191.859740]  [<81026ce9>] exit_mm+0x1d/0xe9
>> [  191.859750]  [<81032b87>] ? exit_signals+0x57/0x10a
>> [  191.859760]  [<81028082>] do_exit+0x19b/0x640
>> [  191.859770]  [<81058598>] ? futex_wait_queue_me+0xaa/0xbe
>> [  191.859781]  [<81030bbf>] ? recalc_sigpending_tsk+0x51/0x5c
>> [  191.859793]  [<81030beb>] ? recalc_sigpending+0x17/0x3e
>> [  191.859803]  [<81028752>] do_group_exit+0x63/0x86
>> [  191.859813]  [<81032b19>] get_signal_to_deliver+0x434/0x44b
>> [  191.859825]  [<81001e01>] do_signal+0x37/0x4fe
>> [  191.859837]  [<81048eed>] ? set_next_entity+0x36/0x9d
>> [  191.859850]  [<81050d8e>] ? timekeeping_get_ns+0x11/0x55
>> [  191.859861]  [<8105a754>] ? sys_futex+0xcb/0xdb
>> [  191.859871]  [<810024a7>] do_notify_resume+0x26/0x65
>> [  191.859883]  [<813b73a5>] work_notifysig+0xa/0x11
>> [  191.859893] Kernel panic - not syncing: too many ERR_PTR
>>
>> I don't know why mm_release() would page fault, but it looks like it does.
>>
>> So the OOM killer will not kill other processes because it thinks a
>> process is exiting, which will free up memory.  But the exiting
>> process needs memory to continue exiting --> deadlock.  Sounds
>> plausible?
>
> It sounds right in your kernel but principal problem is min_filelist_kbytes patch.
> If normal exited process in exit path requires a page and there is no free page
> any more, it ends up going to OOM path after try to reclaim memory several time.
> Then,
> In select_bad_process,
>
>         if (task->flags & PF_EXITING) {
>                if (task == current)             <== true
>                         return OOM_SCAN_SELECT;
> In oom_kill_process,
>
>         if (p->flags & PF_EXITING)
>                 set_tsk_thread_flag(p, TIF_MEMDIE);
>
> At last, normal exited process would get a free page.
>
> But in your kernel, it seems not because I guess did_some_progress in
> __alloc_pages_direct_reclaim is never 0. The why it is never 0 is
> do_try_to_free_pages's all_unreclaimable can't do his role by your
> min_filelist_kbytes. It makes __alloc_pages_slowpath's looping forever.
>
> Sounds plausible?

Thank you Minchan, it does sound plausible, but I have little
experience with this and it will take some work to confirm.

I looked at the patch pretty carefully once, and I had the impression
its effect could be fully analyzed by logical reasoning. I will check
this again tomorrow, perhaps I can run some experiments.  I am adding
Mandeep who wrote the patch.

However, we have worse problems if we don't use that patch.  Without
the patch, and either with or without compressed swap, the same load
causes horrible thrashing, with the system appearing to hang for
minutes.  If we don't use that patch, do you have any suggestion on
how to improve the code thrash situation?

Thanks again!

>>
>> OK, now someone is going to fix this, right? :-)
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-31  1:06                               ` Luigi Semenzato
@ 2012-10-31  1:27                                 ` Minchan Kim
  2012-10-31  3:49                                   ` Luigi Semenzato
  0 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2012-10-31  1:27 UTC (permalink / raw)
  To: Luigi Semenzato
  Cc: David Rientjes, linux-mm, Dan Magenheimer, KOSAKI Motohiro,
	Sonny Rao, Mandeep Baines

On Tue, Oct 30, 2012 at 06:06:56PM -0700, Luigi Semenzato wrote:
> On Tue, Oct 30, 2012 at 5:57 PM, Minchan Kim <minchan@kernel.org> wrote:
> > Hi Luigi,
> >
> > On Tue, Oct 30, 2012 at 12:12:02PM -0700, Luigi Semenzato wrote:
> >> On Mon, Oct 29, 2012 at 10:41 PM, David Rientjes <rientjes@google.com> wrote:
> >> > On Mon, 29 Oct 2012, Luigi Semenzato wrote:
> >> >
> >> >> However, now there is something that worries me more.  The trace of
> >> >> the thread with TIF_MEMDIE set shows that it has executed most of
> >> >> do_exit() and appears to be waiting to be reaped.  From my reading of
> >> >> the code, this implies that task->exit_state should be non-zero, which
> >> >> means that select_bad_process should have skipped that thread, which
> >> >> means that we cannot be in the deadlock situation, and my experiments
> >> >> are not consistent.
> >> >>
> >> >
> >> > Yeah, this is what I was referring to earlier, select_bad_process() will
> >> > not consider the thread for which you posted a stack trace for oom kill,
> >> > so it's not deferring because of it.  There are either other thread(s)
> >> > that have been oom killed and have not yet release their memory or the oom
> >> > killer is never being called.
> >>
> >> Thanks.  I now have better information on what's happening.
> >>
> >> The "culprit" is not the OOM-killed process (the one with TIF_MEMDIE
> >> set).  It's another process that's exiting for some other reason.
> >>
> >> select_bad_process() checks for thread->exit_state at the beginning,
> >> and skips processes that are exiting.  But later it checks for
> >> p->flags & PF_EXITING, and can return -1 in that case (and it does for
> >> me).
> >>
> >> It turns out that do_exit() does a lot of things between setting the
> >> thread->flags PF_EXITING bit (in exit_signals()) and setting
> >> thread->exit_state to non-zero (in exit_notify()).  Some of those
> >> things apparently need memory.  I caught one process responsible for
> >> the PTR_ERR(-1) while it was doing this:
> >>
> >> [  191.859358] VC manager      R running      0  2388   1108 0x00000104
> >> [  191.859377] err_ptr_count = 45623
> >> [  191.859384]  e0611b1c 00200086 f5608000 815ecd20 815ecd20 a0a9ebc3
> >> 0000002c f67cfd20
> >> [  191.859407]  f430a060 81191c34 e0611aec 81196d79 4168ef20 00000001
> >> e1302400 e130264c
> >> [  191.859428]  e1302400 e0611af4 813b71d5 e0611b00 810b42f1 e1302400
> >> e0611b0c 810b430e
> >> [  191.859450] Call Trace:
> >> [  191.859465]  [<81191c34>] ? __delay+0xe/0x10
> >> [  191.859478]  [<81196d79>] ? do_raw_spin_lock+0xa2/0xf3
> >> [  191.859491]  [<813b71d5>] ? _raw_spin_unlock+0xd/0xf
> >> [  191.859504]  [<810b42f1>] ? put_super+0x26/0x29
> >> [  191.859515]  [<810b430e>] ? drop_super+0x1a/0x1d
> >> [  191.859527]  [<8104512d>] __cond_resched+0x1b/0x2b
> >> [  191.859537]  [<813b67a7>] _cond_resched+0x18/0x21
> >> [  191.859549]  [<81093940>] shrink_slab+0x224/0x22f
> >> [  191.859562]  [<81095a96>] try_to_free_pages+0x1b7/0x2e6
> >> [  191.859574]  [<8108df2a>] __alloc_pages_nodemask+0x40a/0x61f
> >> [  191.859588]  [<810a9dbe>] read_swap_cache_async+0x4a/0xcf
> >> [  191.859600]  [<810a9ea4>] swapin_readahead+0x61/0x8d
> >> [  191.859612]  [<8109fff4>] handle_pte_fault+0x310/0x5fb
> >> [  191.859624]  [<810a0420>] handle_mm_fault+0xae/0xbd
> >> [  191.859637]  [<8101d0f9>] do_page_fault+0x265/0x284
> >> [  191.859648]  [<8104aa17>] ? dequeue_entity+0x236/0x252
> >> [  191.859660]  [<8101ce94>] ? vmalloc_sync_all+0xa/0xa
> >> [  191.859672]  [<813b7887>] error_code+0x67/0x6c
> >> [  191.859683]  [<81191d21>] ? __get_user_4+0x11/0x17
> >> [  191.859695]  [<81059f28>] ? exit_robust_list+0x30/0x105
> >> [  191.859707]  [<813b71b0>] ? _raw_spin_unlock_irq+0xd/0x10
> >> [  191.859718]  [<810446d5>] ? finish_task_switch+0x53/0x89
> >> [  191.859730]  [<8102351d>] mm_release+0x1d/0xc3
> >> [  191.859740]  [<81026ce9>] exit_mm+0x1d/0xe9
> >> [  191.859750]  [<81032b87>] ? exit_signals+0x57/0x10a
> >> [  191.859760]  [<81028082>] do_exit+0x19b/0x640
> >> [  191.859770]  [<81058598>] ? futex_wait_queue_me+0xaa/0xbe
> >> [  191.859781]  [<81030bbf>] ? recalc_sigpending_tsk+0x51/0x5c
> >> [  191.859793]  [<81030beb>] ? recalc_sigpending+0x17/0x3e
> >> [  191.859803]  [<81028752>] do_group_exit+0x63/0x86
> >> [  191.859813]  [<81032b19>] get_signal_to_deliver+0x434/0x44b
> >> [  191.859825]  [<81001e01>] do_signal+0x37/0x4fe
> >> [  191.859837]  [<81048eed>] ? set_next_entity+0x36/0x9d
> >> [  191.859850]  [<81050d8e>] ? timekeeping_get_ns+0x11/0x55
> >> [  191.859861]  [<8105a754>] ? sys_futex+0xcb/0xdb
> >> [  191.859871]  [<810024a7>] do_notify_resume+0x26/0x65
> >> [  191.859883]  [<813b73a5>] work_notifysig+0xa/0x11
> >> [  191.859893] Kernel panic - not syncing: too many ERR_PTR
> >>
> >> I don't know why mm_release() would page fault, but it looks like it does.
> >>
> >> So the OOM killer will not kill other processes because it thinks a
> >> process is exiting, which will free up memory.  But the exiting
> >> process needs memory to continue exiting --> deadlock.  Sounds
> >> plausible?
> >
> > It sounds right in your kernel but principal problem is min_filelist_kbytes patch.
> > If normal exited process in exit path requires a page and there is no free page
> > any more, it ends up going to OOM path after try to reclaim memory several time.
> > Then,
> > In select_bad_process,
> >
> >         if (task->flags & PF_EXITING) {
> >                if (task == current)             <== true
> >                         return OOM_SCAN_SELECT;
> > In oom_kill_process,
> >
> >         if (p->flags & PF_EXITING)
> >                 set_tsk_thread_flag(p, TIF_MEMDIE);
> >
> > At last, normal exited process would get a free page.
> >
> > But in your kernel, it seems not because I guess did_some_progress in
> > __alloc_pages_direct_reclaim is never 0. The why it is never 0 is
> > do_try_to_free_pages's all_unreclaimable can't do his role by your
> > min_filelist_kbytes. It makes __alloc_pages_slowpath's looping forever.
> >
> > Sounds plausible?
> 
> Thank you Minchan, it does sound plausible, but I have little
> experience with this and it will take some work to confirm.

No problem :)

> 
> I looked at the patch pretty carefully once, and I had the impression
> its effect could be fully analyzed by logical reasoning. I will check
> this again tomorrow, perhaps I can run some experiments.  I am adding
> Mandeep who wrote the patch.
> 
> However, we have worse problems if we don't use that patch.  Without
> the patch, and either with or without compressed swap, the same load
> causes horrible thrashing, with the system appearing to hang for
> minutes.  If we don't use that patch, do you have any suggestion on
> how to improve the code thrash situation?

As I said, the motivation of the patch is good for embedded system but
patch's implementation is kinda buggy. I will have a look and post if 
I'm luck to get a time.

BTW, a question.

How do you find proper value for min_filelist_kbytes?
Just experiment with several trial?

Thanks.

> 
> Thanks again!
> 
> >>
> >> OK, now someone is going to fix this, right? :-)
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> > --
> > Kind regards,
> > Minchan Kim
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-31  1:27                                 ` Minchan Kim
@ 2012-10-31  3:49                                   ` Luigi Semenzato
  2012-10-31  7:24                                     ` Minchan Kim
  0 siblings, 1 reply; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-31  3:49 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, linux-mm, Dan Magenheimer, KOSAKI Motohiro,
	Sonny Rao, Mandeep Baines

On Tue, Oct 30, 2012 at 6:27 PM, Minchan Kim <minchan@kernel.org> wrote:
> On Tue, Oct 30, 2012 at 06:06:56PM -0700, Luigi Semenzato wrote:
>> On Tue, Oct 30, 2012 at 5:57 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > Hi Luigi,
>> >
>> > On Tue, Oct 30, 2012 at 12:12:02PM -0700, Luigi Semenzato wrote:
>> >> On Mon, Oct 29, 2012 at 10:41 PM, David Rientjes <rientjes@google.com> wrote:
>> >> > On Mon, 29 Oct 2012, Luigi Semenzato wrote:
>> >> >
>> >> >> However, now there is something that worries me more.  The trace of
>> >> >> the thread with TIF_MEMDIE set shows that it has executed most of
>> >> >> do_exit() and appears to be waiting to be reaped.  From my reading of
>> >> >> the code, this implies that task->exit_state should be non-zero, which
>> >> >> means that select_bad_process should have skipped that thread, which
>> >> >> means that we cannot be in the deadlock situation, and my experiments
>> >> >> are not consistent.
>> >> >>
>> >> >
>> >> > Yeah, this is what I was referring to earlier, select_bad_process() will
>> >> > not consider the thread for which you posted a stack trace for oom kill,
>> >> > so it's not deferring because of it.  There are either other thread(s)
>> >> > that have been oom killed and have not yet release their memory or the oom
>> >> > killer is never being called.
>> >>
>> >> Thanks.  I now have better information on what's happening.
>> >>
>> >> The "culprit" is not the OOM-killed process (the one with TIF_MEMDIE
>> >> set).  It's another process that's exiting for some other reason.
>> >>
>> >> select_bad_process() checks for thread->exit_state at the beginning,
>> >> and skips processes that are exiting.  But later it checks for
>> >> p->flags & PF_EXITING, and can return -1 in that case (and it does for
>> >> me).
>> >>
>> >> It turns out that do_exit() does a lot of things between setting the
>> >> thread->flags PF_EXITING bit (in exit_signals()) and setting
>> >> thread->exit_state to non-zero (in exit_notify()).  Some of those
>> >> things apparently need memory.  I caught one process responsible for
>> >> the PTR_ERR(-1) while it was doing this:
>> >>
>> >> [  191.859358] VC manager      R running      0  2388   1108 0x00000104
>> >> [  191.859377] err_ptr_count = 45623
>> >> [  191.859384]  e0611b1c 00200086 f5608000 815ecd20 815ecd20 a0a9ebc3
>> >> 0000002c f67cfd20
>> >> [  191.859407]  f430a060 81191c34 e0611aec 81196d79 4168ef20 00000001
>> >> e1302400 e130264c
>> >> [  191.859428]  e1302400 e0611af4 813b71d5 e0611b00 810b42f1 e1302400
>> >> e0611b0c 810b430e
>> >> [  191.859450] Call Trace:
>> >> [  191.859465]  [<81191c34>] ? __delay+0xe/0x10
>> >> [  191.859478]  [<81196d79>] ? do_raw_spin_lock+0xa2/0xf3
>> >> [  191.859491]  [<813b71d5>] ? _raw_spin_unlock+0xd/0xf
>> >> [  191.859504]  [<810b42f1>] ? put_super+0x26/0x29
>> >> [  191.859515]  [<810b430e>] ? drop_super+0x1a/0x1d
>> >> [  191.859527]  [<8104512d>] __cond_resched+0x1b/0x2b
>> >> [  191.859537]  [<813b67a7>] _cond_resched+0x18/0x21
>> >> [  191.859549]  [<81093940>] shrink_slab+0x224/0x22f
>> >> [  191.859562]  [<81095a96>] try_to_free_pages+0x1b7/0x2e6
>> >> [  191.859574]  [<8108df2a>] __alloc_pages_nodemask+0x40a/0x61f
>> >> [  191.859588]  [<810a9dbe>] read_swap_cache_async+0x4a/0xcf
>> >> [  191.859600]  [<810a9ea4>] swapin_readahead+0x61/0x8d
>> >> [  191.859612]  [<8109fff4>] handle_pte_fault+0x310/0x5fb
>> >> [  191.859624]  [<810a0420>] handle_mm_fault+0xae/0xbd
>> >> [  191.859637]  [<8101d0f9>] do_page_fault+0x265/0x284
>> >> [  191.859648]  [<8104aa17>] ? dequeue_entity+0x236/0x252
>> >> [  191.859660]  [<8101ce94>] ? vmalloc_sync_all+0xa/0xa
>> >> [  191.859672]  [<813b7887>] error_code+0x67/0x6c
>> >> [  191.859683]  [<81191d21>] ? __get_user_4+0x11/0x17
>> >> [  191.859695]  [<81059f28>] ? exit_robust_list+0x30/0x105
>> >> [  191.859707]  [<813b71b0>] ? _raw_spin_unlock_irq+0xd/0x10
>> >> [  191.859718]  [<810446d5>] ? finish_task_switch+0x53/0x89
>> >> [  191.859730]  [<8102351d>] mm_release+0x1d/0xc3
>> >> [  191.859740]  [<81026ce9>] exit_mm+0x1d/0xe9
>> >> [  191.859750]  [<81032b87>] ? exit_signals+0x57/0x10a
>> >> [  191.859760]  [<81028082>] do_exit+0x19b/0x640
>> >> [  191.859770]  [<81058598>] ? futex_wait_queue_me+0xaa/0xbe
>> >> [  191.859781]  [<81030bbf>] ? recalc_sigpending_tsk+0x51/0x5c
>> >> [  191.859793]  [<81030beb>] ? recalc_sigpending+0x17/0x3e
>> >> [  191.859803]  [<81028752>] do_group_exit+0x63/0x86
>> >> [  191.859813]  [<81032b19>] get_signal_to_deliver+0x434/0x44b
>> >> [  191.859825]  [<81001e01>] do_signal+0x37/0x4fe
>> >> [  191.859837]  [<81048eed>] ? set_next_entity+0x36/0x9d
>> >> [  191.859850]  [<81050d8e>] ? timekeeping_get_ns+0x11/0x55
>> >> [  191.859861]  [<8105a754>] ? sys_futex+0xcb/0xdb
>> >> [  191.859871]  [<810024a7>] do_notify_resume+0x26/0x65
>> >> [  191.859883]  [<813b73a5>] work_notifysig+0xa/0x11
>> >> [  191.859893] Kernel panic - not syncing: too many ERR_PTR
>> >>
>> >> I don't know why mm_release() would page fault, but it looks like it does.
>> >>
>> >> So the OOM killer will not kill other processes because it thinks a
>> >> process is exiting, which will free up memory.  But the exiting
>> >> process needs memory to continue exiting --> deadlock.  Sounds
>> >> plausible?
>> >
>> > It sounds right in your kernel but principal problem is min_filelist_kbytes patch.
>> > If normal exited process in exit path requires a page and there is no free page
>> > any more, it ends up going to OOM path after try to reclaim memory several time.
>> > Then,
>> > In select_bad_process,
>> >
>> >         if (task->flags & PF_EXITING) {
>> >                if (task == current)             <== true
>> >                         return OOM_SCAN_SELECT;
>> > In oom_kill_process,
>> >
>> >         if (p->flags & PF_EXITING)
>> >                 set_tsk_thread_flag(p, TIF_MEMDIE);
>> >
>> > At last, normal exited process would get a free page.
>> >
>> > But in your kernel, it seems not because I guess did_some_progress in
>> > __alloc_pages_direct_reclaim is never 0. The why it is never 0 is
>> > do_try_to_free_pages's all_unreclaimable can't do his role by your
>> > min_filelist_kbytes. It makes __alloc_pages_slowpath's looping forever.
>> >
>> > Sounds plausible?
>>
>> Thank you Minchan, it does sound plausible, but I have little
>> experience with this and it will take some work to confirm.
>
> No problem :)
>
>>
>> I looked at the patch pretty carefully once, and I had the impression
>> its effect could be fully analyzed by logical reasoning. I will check
>> this again tomorrow, perhaps I can run some experiments.  I am adding
>> Mandeep who wrote the patch.
>>
>> However, we have worse problems if we don't use that patch.  Without
>> the patch, and either with or without compressed swap, the same load
>> causes horrible thrashing, with the system appearing to hang for
>> minutes.  If we don't use that patch, do you have any suggestion on
>> how to improve the code thrash situation?
>
> As I said, the motivation of the patch is good for embedded system but
> patch's implementation is kinda buggy. I will have a look and post if
> I'm luck to get a time.
>
> BTW, a question.
>
> How do you find proper value for min_filelist_kbytes?
> Just experiment with several trial?
>
> Thanks.

Yes.  Mandeep can give more detail, but, as I understand this, the
value we use (50 Mb) was based on experimentation.  It helps that at
the moment we run Chrome OS on a relatively uniform set of devices,
with either 2 or 4 GB of RAM, no swap, binaries stored on SSD (for
backing store of text pages), and the same load (the Chrome browser).

>>
>> Thanks again!
>>
>> >>
>> >> OK, now someone is going to fix this, right? :-)
>> >>
>> >> --
>> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> >> the body to majordomo@kvack.org.  For more info on Linux MM,
>> >> see: http://www.linux-mm.org/ .
>> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>> >
>> > --
>> > Kind regards,
>> > Minchan Kim
>> >
>> > --
>> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> > the body to majordomo@kvack.org.  For more info on Linux MM,
>> > see: http://www.linux-mm.org/ .
>> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-30 20:30                             ` Luigi Semenzato
  2012-10-30 22:32                               ` Luigi Semenzato
  2012-10-30 22:37                               ` Sonny Rao
@ 2012-10-31  4:46                               ` David Rientjes
  2012-10-31  6:14                                 ` Luigi Semenzato
  2 siblings, 1 reply; 67+ messages in thread
From: David Rientjes @ 2012-10-31  4:46 UTC (permalink / raw)
  To: Luigi Semenzato
  Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

On Tue, 30 Oct 2012, Luigi Semenzato wrote:

> Actually, there is a very simple fix:
> 
> @@ -355,14 +364,6 @@ static struct task_struct
> *select_bad_process(unsigned int *ppoints,
>                         if (p == current) {
>                                 chosen = p;
>                                 *ppoints = 1000;
> -                       } else if (!force_kill) {
> -                               /*
> -                                * If this task is not being ptraced on exit,
> -                                * then wait for it to finish before killing
> -                                * some other task unnecessarily.
> -                                */
> -                               if (!(p->group_leader->ptrace & PT_TRACE_EXIT))
> -                                       return ERR_PTR(-1UL);
>                         }
>                 }
> 
> I'd rather kill some other task unnecessarily than hang!  My load
> works fine with this change.
> 

That's not an acceptable "fix" at all, it will lead to unnecessarily 
killing processes when others are in the exit path, i.e. every oom kill 
would kill two or three or more processes instead of just one.

Could you please try this on 3.6 since all the code you're quoting is from 
old kernels?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-31  4:46                               ` David Rientjes
@ 2012-10-31  6:14                                 ` Luigi Semenzato
  2012-10-31  6:28                                   ` Luigi Semenzato
  0 siblings, 1 reply; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-31  6:14 UTC (permalink / raw)
  To: David Rientjes
  Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

On Tue, Oct 30, 2012 at 9:46 PM, David Rientjes <rientjes@google.com> wrote:
> On Tue, 30 Oct 2012, Luigi Semenzato wrote:
>
>> Actually, there is a very simple fix:
>>
>> @@ -355,14 +364,6 @@ static struct task_struct
>> *select_bad_process(unsigned int *ppoints,
>>                         if (p == current) {
>>                                 chosen = p;
>>                                 *ppoints = 1000;
>> -                       } else if (!force_kill) {
>> -                               /*
>> -                                * If this task is not being ptraced on exit,
>> -                                * then wait for it to finish before killing
>> -                                * some other task unnecessarily.
>> -                                */
>> -                               if (!(p->group_leader->ptrace & PT_TRACE_EXIT))
>> -                                       return ERR_PTR(-1UL);
>>                         }
>>                 }
>>
>> I'd rather kill some other task unnecessarily than hang!  My load
>> works fine with this change.
>>
>
> That's not an acceptable "fix" at all, it will lead to unnecessarily
> killing processes when others are in the exit path, i.e. every oom kill
> would kill two or three or more processes instead of just one.

I am sorry, I didn't mean to suggest that this is the right fix for
everybody.  It seems to work for us.  A real fix would be much harder,
I think.  Certainly it would be for me.

We don't rely on OOM-killing for memory management (we tried to, but
it has drawbacks).  But OOM kills can still happen, so we have to deal
with them.  We can deal with multiple processes being killed, but not
with a hang.  I might be tempted to say that this should be true for
everybody, but I can imagine systems that work by allowing only one
process to die, and perhaps the load on those systems is such that
they don't experience this deadlock often, or ever (even though I
would be nervous about it).

> Could you please try this on 3.6 since all the code you're quoting is from
> old kernels?

I will see if I can do it, but we're shipping 3.4 and I am not sure
about the status of our 3.6 tree.  I will also visually inspect the
relevant 3.6 code and see if the possibility of deadlock is still
there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-31  6:14                                 ` Luigi Semenzato
@ 2012-10-31  6:28                                   ` Luigi Semenzato
  2012-10-31 18:45                                     ` David Rientjes
  0 siblings, 1 reply; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-31  6:28 UTC (permalink / raw)
  To: David Rientjes
  Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

On Tue, Oct 30, 2012 at 11:14 PM, Luigi Semenzato <semenzato@google.com> wrote:
> On Tue, Oct 30, 2012 at 9:46 PM, David Rientjes <rientjes@google.com> wrote:
>> On Tue, 30 Oct 2012, Luigi Semenzato wrote:
>>
>>> Actually, there is a very simple fix:
>>>
>>> @@ -355,14 +364,6 @@ static struct task_struct
>>> *select_bad_process(unsigned int *ppoints,
>>>                         if (p == current) {
>>>                                 chosen = p;
>>>                                 *ppoints = 1000;
>>> -                       } else if (!force_kill) {
>>> -                               /*
>>> -                                * If this task is not being ptraced on exit,
>>> -                                * then wait for it to finish before killing
>>> -                                * some other task unnecessarily.
>>> -                                */
>>> -                               if (!(p->group_leader->ptrace & PT_TRACE_EXIT))
>>> -                                       return ERR_PTR(-1UL);
>>>                         }
>>>                 }
>>>
>>> I'd rather kill some other task unnecessarily than hang!  My load
>>> works fine with this change.
>>>
>>
>> That's not an acceptable "fix" at all, it will lead to unnecessarily
>> killing processes when others are in the exit path, i.e. every oom kill
>> would kill two or three or more processes instead of just one.
>
> I am sorry, I didn't mean to suggest that this is the right fix for
> everybody.  It seems to work for us.  A real fix would be much harder,
> I think.  Certainly it would be for me.
>
> We don't rely on OOM-killing for memory management (we tried to, but
> it has drawbacks).  But OOM kills can still happen, so we have to deal
> with them.  We can deal with multiple processes being killed, but not
> with a hang.  I might be tempted to say that this should be true for
> everybody, but I can imagine systems that work by allowing only one
> process to die, and perhaps the load on those systems is such that
> they don't experience this deadlock often, or ever (even though I
> would be nervous about it).

To make it clear, I am suggesting that this "fix" might work as a
temporary workaround until a better fix is available.

>> Could you please try this on 3.6 since all the code you're quoting is from
>> old kernels?
>
> I will see if I can do it, but we're shipping 3.4 and I am not sure
> about the status of our 3.6 tree.  I will also visually inspect the
> relevant 3.6 code and see if the possibility of deadlock is still
> there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-31  3:49                                   ` Luigi Semenzato
@ 2012-10-31  7:24                                     ` Minchan Kim
  2012-10-31 16:07                                       ` Luigi Semenzato
  0 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2012-10-31  7:24 UTC (permalink / raw)
  To: Luigi Semenzato
  Cc: David Rientjes, linux-mm, Dan Magenheimer, KOSAKI Motohiro,
	Sonny Rao, Mandeep Baines

On Tue, Oct 30, 2012 at 08:49:26PM -0700, Luigi Semenzato wrote:
> On Tue, Oct 30, 2012 at 6:27 PM, Minchan Kim <minchan@kernel.org> wrote:
> > On Tue, Oct 30, 2012 at 06:06:56PM -0700, Luigi Semenzato wrote:
> >> On Tue, Oct 30, 2012 at 5:57 PM, Minchan Kim <minchan@kernel.org> wrote:
> >> > Hi Luigi,
> >> >
> >> > On Tue, Oct 30, 2012 at 12:12:02PM -0700, Luigi Semenzato wrote:
> >> >> On Mon, Oct 29, 2012 at 10:41 PM, David Rientjes <rientjes@google.com> wrote:
> >> >> > On Mon, 29 Oct 2012, Luigi Semenzato wrote:
> >> >> >
> >> >> >> However, now there is something that worries me more.  The trace of
> >> >> >> the thread with TIF_MEMDIE set shows that it has executed most of
> >> >> >> do_exit() and appears to be waiting to be reaped.  From my reading of
> >> >> >> the code, this implies that task->exit_state should be non-zero, which
> >> >> >> means that select_bad_process should have skipped that thread, which
> >> >> >> means that we cannot be in the deadlock situation, and my experiments
> >> >> >> are not consistent.
> >> >> >>
> >> >> >
> >> >> > Yeah, this is what I was referring to earlier, select_bad_process() will
> >> >> > not consider the thread for which you posted a stack trace for oom kill,
> >> >> > so it's not deferring because of it.  There are either other thread(s)
> >> >> > that have been oom killed and have not yet release their memory or the oom
> >> >> > killer is never being called.
> >> >>
> >> >> Thanks.  I now have better information on what's happening.
> >> >>
> >> >> The "culprit" is not the OOM-killed process (the one with TIF_MEMDIE
> >> >> set).  It's another process that's exiting for some other reason.
> >> >>
> >> >> select_bad_process() checks for thread->exit_state at the beginning,
> >> >> and skips processes that are exiting.  But later it checks for
> >> >> p->flags & PF_EXITING, and can return -1 in that case (and it does for
> >> >> me).
> >> >>
> >> >> It turns out that do_exit() does a lot of things between setting the
> >> >> thread->flags PF_EXITING bit (in exit_signals()) and setting
> >> >> thread->exit_state to non-zero (in exit_notify()).  Some of those
> >> >> things apparently need memory.  I caught one process responsible for
> >> >> the PTR_ERR(-1) while it was doing this:
> >> >>
> >> >> [  191.859358] VC manager      R running      0  2388   1108 0x00000104
> >> >> [  191.859377] err_ptr_count = 45623
> >> >> [  191.859384]  e0611b1c 00200086 f5608000 815ecd20 815ecd20 a0a9ebc3
> >> >> 0000002c f67cfd20
> >> >> [  191.859407]  f430a060 81191c34 e0611aec 81196d79 4168ef20 00000001
> >> >> e1302400 e130264c
> >> >> [  191.859428]  e1302400 e0611af4 813b71d5 e0611b00 810b42f1 e1302400
> >> >> e0611b0c 810b430e
> >> >> [  191.859450] Call Trace:
> >> >> [  191.859465]  [<81191c34>] ? __delay+0xe/0x10
> >> >> [  191.859478]  [<81196d79>] ? do_raw_spin_lock+0xa2/0xf3
> >> >> [  191.859491]  [<813b71d5>] ? _raw_spin_unlock+0xd/0xf
> >> >> [  191.859504]  [<810b42f1>] ? put_super+0x26/0x29
> >> >> [  191.859515]  [<810b430e>] ? drop_super+0x1a/0x1d
> >> >> [  191.859527]  [<8104512d>] __cond_resched+0x1b/0x2b
> >> >> [  191.859537]  [<813b67a7>] _cond_resched+0x18/0x21
> >> >> [  191.859549]  [<81093940>] shrink_slab+0x224/0x22f
> >> >> [  191.859562]  [<81095a96>] try_to_free_pages+0x1b7/0x2e6
> >> >> [  191.859574]  [<8108df2a>] __alloc_pages_nodemask+0x40a/0x61f
> >> >> [  191.859588]  [<810a9dbe>] read_swap_cache_async+0x4a/0xcf
> >> >> [  191.859600]  [<810a9ea4>] swapin_readahead+0x61/0x8d
> >> >> [  191.859612]  [<8109fff4>] handle_pte_fault+0x310/0x5fb
> >> >> [  191.859624]  [<810a0420>] handle_mm_fault+0xae/0xbd
> >> >> [  191.859637]  [<8101d0f9>] do_page_fault+0x265/0x284
> >> >> [  191.859648]  [<8104aa17>] ? dequeue_entity+0x236/0x252
> >> >> [  191.859660]  [<8101ce94>] ? vmalloc_sync_all+0xa/0xa
> >> >> [  191.859672]  [<813b7887>] error_code+0x67/0x6c
> >> >> [  191.859683]  [<81191d21>] ? __get_user_4+0x11/0x17
> >> >> [  191.859695]  [<81059f28>] ? exit_robust_list+0x30/0x105
> >> >> [  191.859707]  [<813b71b0>] ? _raw_spin_unlock_irq+0xd/0x10
> >> >> [  191.859718]  [<810446d5>] ? finish_task_switch+0x53/0x89
> >> >> [  191.859730]  [<8102351d>] mm_release+0x1d/0xc3
> >> >> [  191.859740]  [<81026ce9>] exit_mm+0x1d/0xe9
> >> >> [  191.859750]  [<81032b87>] ? exit_signals+0x57/0x10a
> >> >> [  191.859760]  [<81028082>] do_exit+0x19b/0x640
> >> >> [  191.859770]  [<81058598>] ? futex_wait_queue_me+0xaa/0xbe
> >> >> [  191.859781]  [<81030bbf>] ? recalc_sigpending_tsk+0x51/0x5c
> >> >> [  191.859793]  [<81030beb>] ? recalc_sigpending+0x17/0x3e
> >> >> [  191.859803]  [<81028752>] do_group_exit+0x63/0x86
> >> >> [  191.859813]  [<81032b19>] get_signal_to_deliver+0x434/0x44b
> >> >> [  191.859825]  [<81001e01>] do_signal+0x37/0x4fe
> >> >> [  191.859837]  [<81048eed>] ? set_next_entity+0x36/0x9d
> >> >> [  191.859850]  [<81050d8e>] ? timekeeping_get_ns+0x11/0x55
> >> >> [  191.859861]  [<8105a754>] ? sys_futex+0xcb/0xdb
> >> >> [  191.859871]  [<810024a7>] do_notify_resume+0x26/0x65
> >> >> [  191.859883]  [<813b73a5>] work_notifysig+0xa/0x11
> >> >> [  191.859893] Kernel panic - not syncing: too many ERR_PTR
> >> >>
> >> >> I don't know why mm_release() would page fault, but it looks like it does.
> >> >>
> >> >> So the OOM killer will not kill other processes because it thinks a
> >> >> process is exiting, which will free up memory.  But the exiting
> >> >> process needs memory to continue exiting --> deadlock.  Sounds
> >> >> plausible?
> >> >
> >> > It sounds right in your kernel but principal problem is min_filelist_kbytes patch.
> >> > If normal exited process in exit path requires a page and there is no free page
> >> > any more, it ends up going to OOM path after try to reclaim memory several time.
> >> > Then,
> >> > In select_bad_process,
> >> >
> >> >         if (task->flags & PF_EXITING) {
> >> >                if (task == current)             <== true
> >> >                         return OOM_SCAN_SELECT;
> >> > In oom_kill_process,
> >> >
> >> >         if (p->flags & PF_EXITING)
> >> >                 set_tsk_thread_flag(p, TIF_MEMDIE);
> >> >
> >> > At last, normal exited process would get a free page.
> >> >
> >> > But in your kernel, it seems not because I guess did_some_progress in
> >> > __alloc_pages_direct_reclaim is never 0. The why it is never 0 is
> >> > do_try_to_free_pages's all_unreclaimable can't do his role by your
> >> > min_filelist_kbytes. It makes __alloc_pages_slowpath's looping forever.
> >> >
> >> > Sounds plausible?
> >>
> >> Thank you Minchan, it does sound plausible, but I have little
> >> experience with this and it will take some work to confirm.
> >
> > No problem :)
> >
> >>
> >> I looked at the patch pretty carefully once, and I had the impression
> >> its effect could be fully analyzed by logical reasoning. I will check
> >> this again tomorrow, perhaps I can run some experiments.  I am adding
> >> Mandeep who wrote the patch.
> >>
> >> However, we have worse problems if we don't use that patch.  Without
> >> the patch, and either with or without compressed swap, the same load
> >> causes horrible thrashing, with the system appearing to hang for
> >> minutes.  If we don't use that patch, do you have any suggestion on
> >> how to improve the code thrash situation?
> >
> > As I said, the motivation of the patch is good for embedded system but
> > patch's implementation is kinda buggy. I will have a look and post if
> > I'm luck to get a time.
> >
> > BTW, a question.
> >
> > How do you find proper value for min_filelist_kbytes?
> > Just experiment with several trial?
> >
> > Thanks.
> 
> Yes.  Mandeep can give more detail, but, as I understand this, the
> value we use (50 Mb) was based on experimentation.  It helps that at
> the moment we run Chrome OS on a relatively uniform set of devices,
> with either 2 or 4 GB of RAM, no swap, binaries stored on SSD (for
> backing store of text pages), and the same load (the Chrome browser).
> 

AFAIRC, I recommended mem_notify instead of hacky patch when Mandeep submitted
at the beginning. Does it have any problem?
AFAIK, mem_notify had a problem to notify too late so OOM kill still happens.
Recently, Anton have been tried new low memory notifier and It should solve 
same problem and then it's thing you need.
https://patchwork.kernel.org/patch/1625251/

Of course, there are further steps to merge it but I think you can help us
with some experiments and input your voice to meet Chrome OS's goal.

Thanks.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-31  7:24                                     ` Minchan Kim
@ 2012-10-31 16:07                                       ` Luigi Semenzato
  2012-10-31 17:49                                         ` Mandeep Singh Baines
  0 siblings, 1 reply; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-31 16:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, linux-mm, Dan Magenheimer, KOSAKI Motohiro,
	Sonny Rao, Mandeep Baines

On Wed, Oct 31, 2012 at 12:24 AM, Minchan Kim <minchan@kernel.org> wrote:

> AFAIRC, I recommended mem_notify instead of hacky patch when Mandeep submitted
> at the beginning. Does it have any problem?

When we introduced min_filelist_kbytes, the Chrome browser was not
prepared to take actions on low-memory notifications, so we could not
use that approach.  We still needed somehow to prevent the system from
thrashing.

A couple of years later we added a "tab discard" feature to Chrome,
which could be used to release memory in Chrome after saving the DOM
state of a tab.  At that time I noticed a similar patch from you,
which I took and slightly modified for our purposes.  I was not aware
of Anton's earlier patch then.  The basic idea of my patch is the same
as yours, but I estimate "easily reclaimable memory" differently.

I wasn't sure my patch would be of interest here, so I never posted it.

Going back to the min_filelist_kbytes patch, it doesn't seem that it's
such a bad idea to have a mechanism that prevents text page thrash.
It would be useful if the system kept working even if nobody is paying
attention to low-memory notifications.  The hacky patch sets a
threshold under which text pages are not evicted, to maintain a
reasonably-sized working set in memory.  Perhaps this threshold should
be set dynamically based on the rate of page faults due to instruction
fetches?

> AFAIK, mem_notify had a problem to notify too late so OOM kill still happens.
> Recently, Anton have been tried new low memory notifier and It should solve
> same problem and then it's thing you need.
> https://patchwork.kernel.org/patch/1625251/

Yes, part of the problem is that all these mechanisms are based on
heuristics.  Chrome tab discard is conceptually very similar to OOM
kill.  When Chrome gets a low-memory notification, it discards a tab
and then waits for about 1s before checking if it should discard more
tabs.  If other processes are allocating aggressively (for instance
after issuing commands that load multiple tabs in parallel), they will
use up memory faster than the tab discarder is releasing it.  So it's
essential to have a functioning fall-back mechanism in the kernel.

> Of course, there are further steps to merge it but I think you can help us
> with some experiments and input your voice to meet Chrome OS's goal.

I will look at Anton's notifier and see if it would meet our needs.  Thanks!

>
> Thanks.
>
> --
> Kind regards,
> Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-31 16:07                                       ` Luigi Semenzato
@ 2012-10-31 17:49                                         ` Mandeep Singh Baines
  0 siblings, 0 replies; 67+ messages in thread
From: Mandeep Singh Baines @ 2012-10-31 17:49 UTC (permalink / raw)
  To: Luigi Semenzato
  Cc: Minchan Kim, David Rientjes, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

Luigi Semenzato (semenzato@google.com) wrote:
> On Wed, Oct 31, 2012 at 12:24 AM, Minchan Kim <minchan@kernel.org> wrote:
> 
> > AFAIRC, I recommended mem_notify instead of hacky patch when Mandeep submitted
> > at the beginning. Does it have any problem?
> 
> When we introduced min_filelist_kbytes, the Chrome browser was not
> prepared to take actions on low-memory notifications, so we could not
> use that approach.  We still needed somehow to prevent the system from
> thrashing.
> 
> A couple of years later we added a "tab discard" feature to Chrome,
> which could be used to release memory in Chrome after saving the DOM
> state of a tab.  At that time I noticed a similar patch from you,
> which I took and slightly modified for our purposes.  I was not aware
> of Anton's earlier patch then.  The basic idea of my patch is the same
> as yours, but I estimate "easily reclaimable memory" differently.
> 
> I wasn't sure my patch would be of interest here, so I never posted it.
> 
> Going back to the min_filelist_kbytes patch, it doesn't seem that it's
> such a bad idea to have a mechanism that prevents text page thrash.
> It would be useful if the system kept working even if nobody is paying
> attention to low-memory notifications.  The hacky patch sets a
> threshold under which text pages are not evicted, to maintain a
> reasonably-sized working set in memory.  Perhaps this threshold should
> be set dynamically based on the rate of page faults due to instruction
> fetches?
> 

An alternative approach I was considering was to just limit the rate at
which you scan each of the LRU lists. Limit the rate to one complete
scan of the list every scan_period. This would prevent thrashing of
file and anon pages and would require no tuning. You could set scan_period
to one of the scheduler periods.

Regards,
Mandeep

> > AFAIK, mem_notify had a problem to notify too late so OOM kill still happens.
> > Recently, Anton have been tried new low memory notifier and It should solve
> > same problem and then it's thing you need.
> > https://patchwork.kernel.org/patch/1625251/
> 
> Yes, part of the problem is that all these mechanisms are based on
> heuristics.  Chrome tab discard is conceptually very similar to OOM
> kill.  When Chrome gets a low-memory notification, it discards a tab
> and then waits for about 1s before checking if it should discard more
> tabs.  If other processes are allocating aggressively (for instance
> after issuing commands that load multiple tabs in parallel), they will
> use up memory faster than the tab discarder is releasing it.  So it's
> essential to have a functioning fall-back mechanism in the kernel.
> 
> > Of course, there are further steps to merge it but I think you can help us
> > with some experiments and input your voice to meet Chrome OS's goal.
> 
> I will look at Anton's notifier and see if it would meet our needs.  Thanks!
> 
> >
> > Thanks.
> >
> > --
> > Kind regards,
> > Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-30 22:32                               ` Luigi Semenzato
@ 2012-10-31 18:42                                 ` David Rientjes
  0 siblings, 0 replies; 67+ messages in thread
From: David Rientjes @ 2012-10-31 18:42 UTC (permalink / raw)
  To: Luigi Semenzato
  Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

On Tue, 30 Oct 2012, Luigi Semenzato wrote:

> For completeness, I would like to report that the page fault in
> mm_release looks legitimate.  The fault happens near here:
> 
> if (unlikely(tsk->robust_list)) {
>     exit_robust_list(tsk);
>     tsk->robust_list = NULL;
> }
> 
> and robust_list is a userspace structure.
> 

This is the only place where the hang occurs when there are several 
threads in the exit path with PF_EXITING and it causes the oom killer to 
defer killing a process?  If that's the case, then a simple 
tsk->robust_list check would be sufficient to avoid deferring incorrectly.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-31  6:28                                   ` Luigi Semenzato
@ 2012-10-31 18:45                                     ` David Rientjes
  0 siblings, 0 replies; 67+ messages in thread
From: David Rientjes @ 2012-10-31 18:45 UTC (permalink / raw)
  To: Luigi Semenzato
  Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

On Tue, 30 Oct 2012, Luigi Semenzato wrote:

> To make it clear, I am suggesting that this "fix" might work as a
> temporary workaround until a better fix is available.
> 

A temporary workaround is to do a kill -9 of the hung process since even 
the 3.4 oom killer will automatically give it access to memory reserves.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-31  0:57                             ` Minchan Kim
  2012-10-31  1:06                               ` Luigi Semenzato
@ 2012-10-31 18:54                               ` David Rientjes
  2012-10-31 21:40                                 ` Luigi Semenzato
                                                   ` (2 more replies)
  1 sibling, 3 replies; 67+ messages in thread
From: David Rientjes @ 2012-10-31 18:54 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Luigi Semenzato, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

On Wed, 31 Oct 2012, Minchan Kim wrote:

> It sounds right in your kernel but principal problem is min_filelist_kbytes patch.
> If normal exited process in exit path requires a page and there is no free page
> any more, it ends up going to OOM path after try to reclaim memory several time.
> Then,
> In select_bad_process,
> 
>         if (task->flags & PF_EXITING) {
>                if (task == current)             <== true
>                         return OOM_SCAN_SELECT;
> In oom_kill_process,
> 
>         if (p->flags & PF_EXITING)
>                 set_tsk_thread_flag(p, TIF_MEMDIE);
> 
> At last, normal exited process would get a free page.
> 

select_bad_process() won't actually select the process for oom kill, 
though, if there are other PF_EXITING threads other than current.  So if 
multiple threads are page faulting on tsk->robust_list, then no thread 
ends up getting killed.  The temporary workaround would be to do a kill -9 
so that the logic in out_of_memory() could immediately give such threads 
access to memory reserves so the page fault will succeed.  The real fix 
would be to audit all possible cases in between setting 
tsk->flags |= PF_EXITING and tsk->mm = NULL that could cause a memory 
allocation and make exemptions for them in oom_scan_process_thread().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-31 18:54                               ` David Rientjes
@ 2012-10-31 21:40                                 ` Luigi Semenzato
  2012-11-01  2:11                                 ` Minchan Kim
  2012-11-01  2:43                                 ` Minchan Kim
  2 siblings, 0 replies; 67+ messages in thread
From: Luigi Semenzato @ 2012-10-31 21:40 UTC (permalink / raw)
  To: David Rientjes
  Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

Thanks so much for your help.  There are two issues: one is what we
(Chrome OS) should do, the other is what should be done for ToT linux.

The fix(es) you propose are harder to understand than mine, and put
additional special conditions in code that is already rife with them.
My fix, instead, removes one such special condition.  It can, in
principle, cause processes to be OOM-killed unnecessarily, but what's
the likelihood that it will happen?  We don't actually see it happen,
and it matters  little to us if it happens.

I would be more than happy to try one of your fixes, but not likely to
implement it.

On Wed, Oct 31, 2012 at 11:54 AM, David Rientjes <rientjes@google.com> wrote:
> On Wed, 31 Oct 2012, Minchan Kim wrote:
>
>> It sounds right in your kernel but principal problem is min_filelist_kbytes patch.
>> If normal exited process in exit path requires a page and there is no free page
>> any more, it ends up going to OOM path after try to reclaim memory several time.
>> Then,
>> In select_bad_process,
>>
>>         if (task->flags & PF_EXITING) {
>>                if (task == current)             <== true
>>                         return OOM_SCAN_SELECT;
>> In oom_kill_process,
>>
>>         if (p->flags & PF_EXITING)
>>                 set_tsk_thread_flag(p, TIF_MEMDIE);
>>
>> At last, normal exited process would get a free page.
>>
>
> select_bad_process() won't actually select the process for oom kill,
> though, if there are other PF_EXITING threads other than current.  So if
> multiple threads are page faulting on tsk->robust_list, then no thread
> ends up getting killed.  The temporary workaround would be to do a kill -9
> so that the logic in out_of_memory() could immediately give such threads
> access to memory reserves so the page fault will succeed.

When we discover the thread in such state, it's already in do_exit()
and it's waiting for the page fault to complete.  Will it wait
forever, or timeout and retry?  Is it acceptable, and sufficient, to
change task->exit_code on the fly?  If not, what else?  It is quite
difficult to analyze that code.

>  The real fix
> would be to audit all possible cases in between setting
> tsk->flags |= PF_EXITING and tsk->mm = NULL that could cause a memory
> allocation and make exemptions for them in oom_scan_process_thread().

I think I probably slightly disagree with this.  It's an extra step in
the direction of unmaintainability.  Wouldn't it be better to disallow
a thread from making allocations in that section, fix all the places
where it does, and panic to catch missed occurrences or new ones?

Otherwise the OOM module will have to know additional details about
what threads are doing, or threads will have to maintain that state
(task->exiting_but_may_still_allocate = 1).  Isn't there already too
much of this stuff going on?

Thanks again!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-31 18:54                               ` David Rientjes
  2012-10-31 21:40                                 ` Luigi Semenzato
@ 2012-11-01  2:11                                 ` Minchan Kim
  2012-11-01  4:38                                   ` David Rientjes
  2012-11-01  2:43                                 ` Minchan Kim
  2 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2012-11-01  2:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: Luigi Semenzato, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

On Wed, Oct 31, 2012 at 11:54:07AM -0700, David Rientjes wrote:
> On Wed, 31 Oct 2012, Minchan Kim wrote:
> 
> > It sounds right in your kernel but principal problem is min_filelist_kbytes patch.
> > If normal exited process in exit path requires a page and there is no free page
> > any more, it ends up going to OOM path after try to reclaim memory several time.
> > Then,
> > In select_bad_process,
> > 
> >         if (task->flags & PF_EXITING) {
> >                if (task == current)             <== true
> >                         return OOM_SCAN_SELECT;
> > In oom_kill_process,
> > 
> >         if (p->flags & PF_EXITING)
> >                 set_tsk_thread_flag(p, TIF_MEMDIE);
> > 
> > At last, normal exited process would get a free page.
> > 
> 
> select_bad_process() won't actually select the process for oom kill, 
> though, if there are other PF_EXITING threads other than current.  So if 
> multiple threads are page faulting on tsk->robust_list, then no thread 
> ends up getting killed.  The temporary workaround would be to do a kill -9 

If mutiple threads are page faulting and try to allocate memory, then they
should go to oom path and they will reach following code.

        if (task->flags & PF_EXITING) {
               if (task == current)
                        return OOM_SCAN_SELECT;

So, the thread can access reseved memory pool and page fault will succeed.

> so that the logic in out_of_memory() could immediately give such threads 
> access to memory reserves so the page fault will succeed.  The real fix 
> would be to audit all possible cases in between setting 
> tsk->flags |= PF_EXITING and tsk->mm = NULL that could cause a memory 
> allocation and make exemptions for them in oom_scan_process_thread().
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-10-31 18:54                               ` David Rientjes
  2012-10-31 21:40                                 ` Luigi Semenzato
  2012-11-01  2:11                                 ` Minchan Kim
@ 2012-11-01  2:43                                 ` Minchan Kim
  2012-11-01  4:48                                   ` David Rientjes
  2 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2012-11-01  2:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Luigi Semenzato, linux-mm, Dan Magenheimer, KOSAKI Motohiro,
	Sonny Rao, Mel Gorman

On Wed, Oct 31, 2012 at 11:54:07AM -0700, David Rientjes wrote:
> On Wed, 31 Oct 2012, Minchan Kim wrote:
> 
> > It sounds right in your kernel but principal problem is min_filelist_kbytes patch.
> > If normal exited process in exit path requires a page and there is no free page
> > any more, it ends up going to OOM path after try to reclaim memory several time.
> > Then,
> > In select_bad_process,
> > 
> >         if (task->flags & PF_EXITING) {
> >                if (task == current)             <== true
> >                         return OOM_SCAN_SELECT;
> > In oom_kill_process,
> > 
> >         if (p->flags & PF_EXITING)
> >                 set_tsk_thread_flag(p, TIF_MEMDIE);
> > 
> > At last, normal exited process would get a free page.
> > 
> 
> select_bad_process() won't actually select the process for oom kill, 
> though, if there are other PF_EXITING threads other than current.  So if 
> multiple threads are page faulting on tsk->robust_list, then no thread 
> ends up getting killed.  The temporary workaround would be to do a kill -9 
> so that the logic in out_of_memory() could immediately give such threads 
> access to memory reserves so the page fault will succeed.  The real fix 

It's not true any more.
3.6 includes following code in try_to_free_pages

        /*   
         * Do not enter reclaim if fatal signal is pending. 1 is returned so
         * that the page allocator does not consider triggering OOM
         */
        if (fatal_signal_pending(current))
                return 1;

So the hunged task never go to the OOM path and could be looping forever.

> would be to audit all possible cases in between setting 
> tsk->flags |= PF_EXITING and tsk->mm = NULL that could cause a memory 
> allocation and make exemptions for them in oom_scan_process_thread().
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-01  2:11                                 ` Minchan Kim
@ 2012-11-01  4:38                                   ` David Rientjes
  2012-11-01  5:18                                     ` Minchan Kim
  0 siblings, 1 reply; 67+ messages in thread
From: David Rientjes @ 2012-11-01  4:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Luigi Semenzato, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

On Thu, 1 Nov 2012, Minchan Kim wrote:

> If mutiple threads are page faulting and try to allocate memory, then they
> should go to oom path and they will reach following code.
> 
>         if (task->flags & PF_EXITING) {
>                if (task == current)
>                         return OOM_SCAN_SELECT;
> 

No, OOM_SCAN_SELECT does not return immediately and kill that process; it 
only prefers to kill that process first iff the oom killer isn't deferred 
because it finds TIF_MEMDIE threads or other PF_EXITING threads other than 
current.  So if multiple processes are in the exit path with PF_EXITING 
and require additional memory then the oom killed may defer without 
killing anything.  That's what I suspect is happening in this case.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-01  2:43                                 ` Minchan Kim
@ 2012-11-01  4:48                                   ` David Rientjes
  2012-11-01  5:26                                     ` Minchan Kim
                                                       ` (2 more replies)
  0 siblings, 3 replies; 67+ messages in thread
From: David Rientjes @ 2012-11-01  4:48 UTC (permalink / raw)
  To: Minchan Kim, Mel Gorman
  Cc: Luigi Semenzato, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

On Thu, 1 Nov 2012, Minchan Kim wrote:

> It's not true any more.
> 3.6 includes following code in try_to_free_pages
> 
>         /*   
>          * Do not enter reclaim if fatal signal is pending. 1 is returned so
>          * that the page allocator does not consider triggering OOM
>          */
>         if (fatal_signal_pending(current))
>                 return 1;
> 
> So the hunged task never go to the OOM path and could be looping forever.
> 

Ah, interesting.  This is from commit 5515061d22f0 ("mm: throttle direct 
reclaimers if PF_MEMALLOC reserves are low and swap is backed by network 
storage").  Thanks for adding Mel to the cc.

The oom killer specifically has logic for this condition: when calling 
out_of_memory() the first thing it does is

	if (fatal_signal_pending(current))
		set_thread_flag(TIF_MEMDIE);

to allow it access to memory reserves so that it may exit if it's having 
trouble.  But that ends up never happening because of the above code that 
Minchan has identified.

So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages() 
as well or revert that early return entirely; there's no justification 
given for it in the comment nor in the commit log.  I'd rather remove it 
and allow the oom killer to trigger and grant access to memory reserves 
itself if necessary.

Mel, how does commit 5515061d22f0 deal with threads looping forever if 
they need memory in the exit path since the oom killer never gets called?

That aside, it doesn't seem like this is the issue that Luigi is reporting 
since his patch that avoids deferring the oom killer presumably fixes the 
issue for him.  So it turns out the oom killer must be getting called.

Luigi, can you try this instead?  It applies to the latest git but should 
be easily modified to apply to any 3.x kernel you're running.
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -310,26 +310,13 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 	if (!task->mm)
 		return OOM_SCAN_CONTINUE;
 
-	if (task->flags & PF_EXITING) {
+	if (task->flags & PF_EXITING && !force_kill) {
 		/*
-		 * If task is current and is in the process of releasing memory,
-		 * allow the "kill" to set TIF_MEMDIE, which will allow it to
-		 * access memory reserves.  Otherwise, it may stall forever.
-		 *
-		 * The iteration isn't broken here, however, in case other
-		 * threads are found to have already been oom killed.
+		 * If this task is not being ptraced on exit, then wait for it
+		 * to finish before killing some other task unnecessarily.
 		 */
-		if (task == current)
-			return OOM_SCAN_SELECT;
-		else if (!force_kill) {
-			/*
-			 * If this task is not being ptraced on exit, then wait
-			 * for it to finish before killing some other task
-			 * unnecessarily.
-			 */
-			if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
-				return OOM_SCAN_ABORT;
-		}
+		if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
+			return OOM_SCAN_ABORT;
 	}
 	return OOM_SCAN_OK;
 }
@@ -706,11 +693,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		return;
 
 	/*
-	 * If current has a pending SIGKILL, then automatically select it.  The
-	 * goal is to allow it to allocate so that it may quickly exit and free
-	 * its memory.
+	 * If current has a pending SIGKILL or is exiting, then automatically
+	 * select it.  The goal is to allow it to allocate so that it may
+	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current)) {
+	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
 		set_thread_flag(TIF_MEMDIE);
 		return;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-01  4:38                                   ` David Rientjes
@ 2012-11-01  5:18                                     ` Minchan Kim
  0 siblings, 0 replies; 67+ messages in thread
From: Minchan Kim @ 2012-11-01  5:18 UTC (permalink / raw)
  To: David Rientjes
  Cc: Luigi Semenzato, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

On Wed, Oct 31, 2012 at 09:38:47PM -0700, David Rientjes wrote:
> On Thu, 1 Nov 2012, Minchan Kim wrote:
> 
> > If mutiple threads are page faulting and try to allocate memory, then they
> > should go to oom path and they will reach following code.
> > 
> >         if (task->flags & PF_EXITING) {
> >                if (task == current)
> >                         return OOM_SCAN_SELECT;
> > 
> 
> No, OOM_SCAN_SELECT does not return immediately and kill that process; it 
> only prefers to kill that process first iff the oom killer isn't deferred 
> because it finds TIF_MEMDIE threads or other PF_EXITING threads other than 
> current.  So if multiple processes are in the exit path with PF_EXITING 
> and require additional memory then the oom killed may defer without 
> killing anything.  That's what I suspect is happening in this case.

Indeed.
Thanks for correcting me, David.

> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-01  4:48                                   ` David Rientjes
@ 2012-11-01  5:26                                     ` Minchan Kim
  2012-11-01  8:28                                     ` Mel Gorman
  2012-11-01 17:50                                     ` Luigi Semenzato
  2 siblings, 0 replies; 67+ messages in thread
From: Minchan Kim @ 2012-11-01  5:26 UTC (permalink / raw)
  To: David Rientjes
  Cc: Mel Gorman, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Wed, Oct 31, 2012 at 09:48:57PM -0700, David Rientjes wrote:
> On Thu, 1 Nov 2012, Minchan Kim wrote:
> 
> > It's not true any more.
> > 3.6 includes following code in try_to_free_pages
> > 
> >         /*   
> >          * Do not enter reclaim if fatal signal is pending. 1 is returned so
> >          * that the page allocator does not consider triggering OOM
> >          */
> >         if (fatal_signal_pending(current))
> >                 return 1;
> > 
> > So the hunged task never go to the OOM path and could be looping forever.
> > 
> 
> Ah, interesting.  This is from commit 5515061d22f0 ("mm: throttle direct 
> reclaimers if PF_MEMALLOC reserves are low and swap is backed by network 
> storage").  Thanks for adding Mel to the cc.
> 
> The oom killer specifically has logic for this condition: when calling 
> out_of_memory() the first thing it does is
> 
> 	if (fatal_signal_pending(current))
> 		set_thread_flag(TIF_MEMDIE);
> 
> to allow it access to memory reserves so that it may exit if it's having 
> trouble.  But that ends up never happening because of the above code that 
> Minchan has identified.
> 
> So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages() 
> as well or revert that early return entirely; there's no justification 
> given for it in the comment nor in the commit log.  I'd rather remove it 
> and allow the oom killer to trigger and grant access to memory reserves 
> itself if necessary.
> 
> Mel, how does commit 5515061d22f0 deal with threads looping forever if 
> they need memory in the exit path since the oom killer never gets called?
> 
> That aside, it doesn't seem like this is the issue that Luigi is reporting 
> since his patch that avoids deferring the oom killer presumably fixes the 
> issue for him.  So it turns out the oom killer must be getting called.

Exactly.

> 
> Luigi, can you try this instead?  It applies to the latest git but should 
> be easily modified to apply to any 3.x kernel you're running.
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -310,26 +310,13 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
>  	if (!task->mm)
>  		return OOM_SCAN_CONTINUE;
>  
> -	if (task->flags & PF_EXITING) {
> +	if (task->flags & PF_EXITING && !force_kill) {
>  		/*
> -		 * If task is current and is in the process of releasing memory,
> -		 * allow the "kill" to set TIF_MEMDIE, which will allow it to
> -		 * access memory reserves.  Otherwise, it may stall forever.
> -		 *
> -		 * The iteration isn't broken here, however, in case other
> -		 * threads are found to have already been oom killed.
> +		 * If this task is not being ptraced on exit, then wait for it
> +		 * to finish before killing some other task unnecessarily.
>  		 */
> -		if (task == current)
> -			return OOM_SCAN_SELECT;
> -		else if (!force_kill) {
> -			/*
> -			 * If this task is not being ptraced on exit, then wait
> -			 * for it to finish before killing some other task
> -			 * unnecessarily.
> -			 */
> -			if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
> -				return OOM_SCAN_ABORT;
> -		}
> +		if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
> +			return OOM_SCAN_ABORT;
>  	}
>  	return OOM_SCAN_OK;
>  }
> @@ -706,11 +693,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>  		return;
>  
>  	/*
> -	 * If current has a pending SIGKILL, then automatically select it.  The
> -	 * goal is to allow it to allocate so that it may quickly exit and free
> -	 * its memory.
> +	 * If current has a pending SIGKILL or is exiting, then automatically
> +	 * select it.  The goal is to allow it to allocate so that it may
> +	 * quickly exit and free its memory.
>  	 */
> -	if (fatal_signal_pending(current)) {
> +	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
>  		set_thread_flag(TIF_MEMDIE);
>  		return;
>  	}

Looks good to me.

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-01  4:48                                   ` David Rientjes
  2012-11-01  5:26                                     ` Minchan Kim
@ 2012-11-01  8:28                                     ` Mel Gorman
  2012-11-01 15:57                                       ` Luigi Semenzato
  2012-11-01 17:50                                     ` Luigi Semenzato
  2 siblings, 1 reply; 67+ messages in thread
From: Mel Gorman @ 2012-11-01  8:28 UTC (permalink / raw)
  To: David Rientjes
  Cc: Minchan Kim, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Wed, Oct 31, 2012 at 09:48:57PM -0700, David Rientjes wrote:
> On Thu, 1 Nov 2012, Minchan Kim wrote:
> 
> > It's not true any more.
> > 3.6 includes following code in try_to_free_pages
> > 
> >         /*   
> >          * Do not enter reclaim if fatal signal is pending. 1 is returned so
> >          * that the page allocator does not consider triggering OOM
> >          */
> >         if (fatal_signal_pending(current))
> >                 return 1;
> > 
> > So the hunged task never go to the OOM path and could be looping forever.
> > 
> 
> Ah, interesting.  This is from commit 5515061d22f0 ("mm: throttle direct 
> reclaimers if PF_MEMALLOC reserves are low and swap is backed by network 
> storage").  Thanks for adding Mel to the cc.
> 

Indeed, thanks.

> The oom killer specifically has logic for this condition: when calling 
> out_of_memory() the first thing it does is
> 
> 	if (fatal_signal_pending(current))
> 		set_thread_flag(TIF_MEMDIE);
> 
> to allow it access to memory reserves so that it may exit if it's having 
> trouble.  But that ends up never happening because of the above code that 
> Minchan has identified.
> 
> So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages() 
> as well or revert that early return entirely; there's no justification 
> given for it in the comment nor in the commit log. 

The check for fatal signal is in the wrong place. The reason it was added
is because a throttled process sleeps in an interruptible sleep.  If a user
user forcibly kills a throttled process, it should not result in an OOM kill.

> I'd rather remove it 
> and allow the oom killer to trigger and grant access to memory reserves 
> itself if necessary.
> 
> Mel, how does commit 5515061d22f0 deal with threads looping forever if 
> they need memory in the exit path since the oom killer never gets called?
> 

It doesn't. How about this?

---8<---
mm: vmscan: Check for fatal signals iff the process was throttled

commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
are low and swap is backed by network storage") introduced a check for
fatal signals after a process gets throttled for network storage. The
intention was that if a process was throttled and got killed that it
should not trigger the OOM killer. As pointed out by Minchan Kim and
David Rientjes, this check is in the wrong place and too broad. If a
system is in am OOM situation and a process is exiting, it can loop in
__alloc_pages_slowpath() and calling direct reclaim in a loop. As the
fatal signal is pending it returns 1 as if it is making forward progress
and can effectively deadlock.

This patch moves the fatal_signal_pending() check after throttling to
throttle_direct_reclaim() where it belongs.

If this patch passes review it should be considered a -stable candidate
for 3.6.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c |   37 +++++++++++++++++++++++++++----------
 1 file changed, 27 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2b7edfa..ca9e37f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2238,9 +2238,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
  * Throttle direct reclaimers if backing storage is backed by the network
  * and the PFMEMALLOC reserve for the preferred node is getting dangerously
  * depleted. kswapd will continue to make progress and wake the processes
- * when the low watermark is reached
+ * when the low watermark is reached.
+ *
+ * Returns true if a fatal signal was delivered during throttling. If this
+ * happens, the page allocator should not consider triggering the OOM killer.
  */
-static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
+static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 					nodemask_t *nodemask)
 {
 	struct zone *zone;
@@ -2255,13 +2258,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 	 * processes to block on log_wait_commit().
 	 */
 	if (current->flags & PF_KTHREAD)
-		return;
+		goto out;
+
+	/*
+	 * If a fatal signal is pending, this process should not throttle.
+	 * It should return quickly so it can exit and free its memory
+	 */
+	if (fatal_signal_pending(current))
+		goto out;
 
 	/* Check if the pfmemalloc reserves are ok */
 	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
 	pgdat = zone->zone_pgdat;
 	if (pfmemalloc_watermark_ok(pgdat))
-		return;
+		goto out;
 
 	/* Account for the throttling */
 	count_vm_event(PGSCAN_DIRECT_THROTTLE);
@@ -2277,12 +2287,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
 	if (!(gfp_mask & __GFP_FS)) {
 		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
 			pfmemalloc_watermark_ok(pgdat), HZ);
-		return;
+
+		goto check_pending;
 	}
 
 	/* Throttle until kswapd wakes the process */
 	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
 		pfmemalloc_watermark_ok(pgdat));
+
+check_pending:
+	if (fatal_signal_pending(current))
+		return true;
+
+out:
+	return false;
 }
 
 unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
@@ -2304,13 +2322,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 		.gfp_mask = sc.gfp_mask,
 	};
 
-	throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
-
 	/*
-	 * Do not enter reclaim if fatal signal is pending. 1 is returned so
-	 * that the page allocator does not consider triggering OOM
+	 * Do not enter reclaim if fatal signal was delivered while throttled.
+	 * 1 is returned so that the page allocator does not OOM kill at this
+	 * point.
 	 */
-	if (fatal_signal_pending(current))
+	if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask))
 		return 1;
 
 	trace_mm_vmscan_direct_reclaim_begin(order,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-01  8:28                                     ` Mel Gorman
@ 2012-11-01 15:57                                       ` Luigi Semenzato
  2012-11-01 15:58                                         ` Luigi Semenzato
  0 siblings, 1 reply; 67+ messages in thread
From: Luigi Semenzato @ 2012-11-01 15:57 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Minchan Kim, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Thu, Nov 1, 2012 at 1:28 AM, Mel Gorman <mgorman@suse.de> wrote:
> On Wed, Oct 31, 2012 at 09:48:57PM -0700, David Rientjes wrote:
>> On Thu, 1 Nov 2012, Minchan Kim wrote:
>>
>> > It's not true any more.
>> > 3.6 includes following code in try_to_free_pages
>> >
>> >         /*
>> >          * Do not enter reclaim if fatal signal is pending. 1 is returned so
>> >          * that the page allocator does not consider triggering OOM
>> >          */
>> >         if (fatal_signal_pending(current))
>> >                 return 1;
>> >
>> > So the hunged task never go to the OOM path and could be looping forever.
>> >
>>
>> Ah, interesting.  This is from commit 5515061d22f0 ("mm: throttle direct
>> reclaimers if PF_MEMALLOC reserves are low and swap is backed by network
>> storage").  Thanks for adding Mel to the cc.
>>
>
> Indeed, thanks.
>
>> The oom killer specifically has logic for this condition: when calling
>> out_of_memory() the first thing it does is
>>
>>       if (fatal_signal_pending(current))
>>               set_thread_flag(TIF_MEMDIE);
>>
>> to allow it access to memory reserves so that it may exit if it's having
>> trouble.  But that ends up never happening because of the above code that
>> Minchan has identified.
>>
>> So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages()
>> as well or revert that early return entirely; there's no justification
>> given for it in the comment nor in the commit log.
>
> The check for fatal signal is in the wrong place. The reason it was added
> is because a throttled process sleeps in an interruptible sleep.  If a user
> user forcibly kills a throttled process, it should not result in an OOM kill.
>
>> I'd rather remove it
>> and allow the oom killer to trigger and grant access to memory reserves
>> itself if necessary.
>>
>> Mel, how does commit 5515061d22f0 deal with threads looping forever if
>> they need memory in the exit path since the oom killer never gets called?
>>
>
> It doesn't. How about this?
>
> ---8<---
> mm: vmscan: Check for fatal signals iff the process was throttled
>
> commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
> are low and swap is backed by network storage") introduced a check for
> fatal signals after a process gets throttled for network storage. The
> intention was that if a process was throttled and got killed that it
> should not trigger the OOM killer. As pointed out by Minchan Kim and
> David Rientjes, this check is in the wrong place and too broad. If a
> system is in am OOM situation and a process is exiting, it can loop in
> __alloc_pages_slowpath() and calling direct reclaim in a loop. As the
> fatal signal is pending it returns 1 as if it is making forward progress
> and can effectively deadlock.
>
> This patch moves the fatal_signal_pending() check after throttling to
> throttle_direct_reclaim() where it belongs.
>
> If this patch passes review it should be considered a -stable candidate
> for 3.6.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |   37 +++++++++++++++++++++++++++----------
>  1 file changed, 27 insertions(+), 10 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2b7edfa..ca9e37f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2238,9 +2238,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>   * Throttle direct reclaimers if backing storage is backed by the network
>   * and the PFMEMALLOC reserve for the preferred node is getting dangerously
>   * depleted. kswapd will continue to make progress and wake the processes
> - * when the low watermark is reached
> + * when the low watermark is reached.
> + *
> + * Returns true if a fatal signal was delivered during throttling. If this
> + * happens, the page allocator should not consider triggering the OOM killer.
>   */
> -static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> +static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>                                         nodemask_t *nodemask)
>  {
>         struct zone *zone;
> @@ -2255,13 +2258,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>          * processes to block on log_wait_commit().
>          */
>         if (current->flags & PF_KTHREAD)
> -               return;
> +               goto out;
> +
> +       /*
> +        * If a fatal signal is pending, this process should not throttle.
> +        * It should return quickly so it can exit and free its memory
> +        */
> +       if (fatal_signal_pending(current))
> +               goto out;
>
>         /* Check if the pfmemalloc reserves are ok */
>         first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
>         pgdat = zone->zone_pgdat;
>         if (pfmemalloc_watermark_ok(pgdat))
> -               return;
> +               goto out;
>
>         /* Account for the throttling */
>         count_vm_event(PGSCAN_DIRECT_THROTTLE);
> @@ -2277,12 +2287,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>         if (!(gfp_mask & __GFP_FS)) {
>                 wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
>                         pfmemalloc_watermark_ok(pgdat), HZ);
> -               return;
> +
> +               goto check_pending;
>         }
>
>         /* Throttle until kswapd wakes the process */
>         wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
>                 pfmemalloc_watermark_ok(pgdat));
> +
> +check_pending:
> +       if (fatal_signal_pending(current))
> +               return true;
> +
> +out:
> +       return false;
>  }
>
>  unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> @@ -2304,13 +2322,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>                 .gfp_mask = sc.gfp_mask,
>         };
>
> -       throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> -
>         /*
> -        * Do not enter reclaim if fatal signal is pending. 1 is returned so
> -        * that the page allocator does not consider triggering OOM
> +        * Do not enter reclaim if fatal signal was delivered while throttled.
> +        * 1 is returned so that the page allocator does not OOM kill at this
> +        * point.
>          */
> -       if (fatal_signal_pending(current))
> +       if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask))
>                 return 1;
>
>         trace_mm_vmscan_direct_reclaim_begin(order,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-01 15:57                                       ` Luigi Semenzato
@ 2012-11-01 15:58                                         ` Luigi Semenzato
  2012-11-01 21:48                                           ` David Rientjes
  0 siblings, 1 reply; 67+ messages in thread
From: Luigi Semenzato @ 2012-11-01 15:58 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Minchan Kim, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

(Sorry, slip of finger.)

On Thu, Nov 1, 2012 at 8:57 AM, Luigi Semenzato <semenzato@google.com> wrote:
> On Thu, Nov 1, 2012 at 1:28 AM, Mel Gorman <mgorman@suse.de> wrote:
>> On Wed, Oct 31, 2012 at 09:48:57PM -0700, David Rientjes wrote:
>>> On Thu, 1 Nov 2012, Minchan Kim wrote:
>>>
>>> > It's not true any more.
>>> > 3.6 includes following code in try_to_free_pages
>>> >
>>> >         /*
>>> >          * Do not enter reclaim if fatal signal is pending. 1 is returned so
>>> >          * that the page allocator does not consider triggering OOM
>>> >          */
>>> >         if (fatal_signal_pending(current))
>>> >                 return 1;
>>> >
>>> > So the hunged task never go to the OOM path and could be looping forever.
>>> >
>>>
>>> Ah, interesting.  This is from commit 5515061d22f0 ("mm: throttle direct
>>> reclaimers if PF_MEMALLOC reserves are low and swap is backed by network
>>> storage").  Thanks for adding Mel to the cc.
>>>
>>
>> Indeed, thanks.
>>
>>> The oom killer specifically has logic for this condition: when calling
>>> out_of_memory() the first thing it does is
>>>
>>>       if (fatal_signal_pending(current))
>>>               set_thread_flag(TIF_MEMDIE);
>>>
>>> to allow it access to memory reserves so that it may exit if it's having
>>> trouble.  But that ends up never happening because of the above code that
>>> Minchan has identified.
>>>
>>> So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages()
>>> as well or revert that early return entirely; there's no justification
>>> given for it in the comment nor in the commit log.
>>
>> The check for fatal signal is in the wrong place. The reason it was added
>> is because a throttled process sleeps in an interruptible sleep.  If a user
>> user forcibly kills a throttled process, it should not result in an OOM kill.
>>
>>> I'd rather remove it
>>> and allow the oom killer to trigger and grant access to memory reserves
>>> itself if necessary.
>>>
>>> Mel, how does commit 5515061d22f0 deal with threads looping forever if
>>> they need memory in the exit path since the oom killer never gets called?
>>>
>>
>> It doesn't. How about this?
>>
>> ---8<---
>> mm: vmscan: Check for fatal signals iff the process was throttled
>>
>> commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
>> are low and swap is backed by network storage") introduced a check for
>> fatal signals after a process gets throttled for network storage. The
>> intention was that if a process was throttled and got killed that it
>> should not trigger the OOM killer. As pointed out by Minchan Kim and
>> David Rientjes, this check is in the wrong place and too broad. If a
>> system is in am OOM situation and a process is exiting, it can loop in
>> __alloc_pages_slowpath() and calling direct reclaim in a loop. As the
>> fatal signal is pending it returns 1 as if it is making forward progress
>> and can effectively deadlock.
>>
>> This patch moves the fatal_signal_pending() check after throttling to
>> throttle_direct_reclaim() where it belongs.
>>
>> If this patch passes review it should be considered a -stable candidate
>> for 3.6.
>>
>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>> ---
>>  mm/vmscan.c |   37 +++++++++++++++++++++++++++----------
>>  1 file changed, 27 insertions(+), 10 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 2b7edfa..ca9e37f 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2238,9 +2238,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>>   * Throttle direct reclaimers if backing storage is backed by the network
>>   * and the PFMEMALLOC reserve for the preferred node is getting dangerously
>>   * depleted. kswapd will continue to make progress and wake the processes
>> - * when the low watermark is reached
>> + * when the low watermark is reached.
>> + *
>> + * Returns true if a fatal signal was delivered during throttling. If this
>> + * happens, the page allocator should not consider triggering the OOM killer.
>>   */
>> -static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>> +static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>>                                         nodemask_t *nodemask)
>>  {
>>         struct zone *zone;
>> @@ -2255,13 +2258,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>>          * processes to block on log_wait_commit().
>>          */
>>         if (current->flags & PF_KTHREAD)
>> -               return;
>> +               goto out;
>> +
>> +       /*
>> +        * If a fatal signal is pending, this process should not throttle.
>> +        * It should return quickly so it can exit and free its memory
>> +        */
>> +       if (fatal_signal_pending(current))
>> +               goto out;
>>
>>         /* Check if the pfmemalloc reserves are ok */
>>         first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
>>         pgdat = zone->zone_pgdat;
>>         if (pfmemalloc_watermark_ok(pgdat))
>> -               return;
>> +               goto out;
>>
>>         /* Account for the throttling */
>>         count_vm_event(PGSCAN_DIRECT_THROTTLE);
>> @@ -2277,12 +2287,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>>         if (!(gfp_mask & __GFP_FS)) {
>>                 wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
>>                         pfmemalloc_watermark_ok(pgdat), HZ);
>> -               return;
>> +
>> +               goto check_pending;
>>         }
>>
>>         /* Throttle until kswapd wakes the process */
>>         wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
>>                 pfmemalloc_watermark_ok(pgdat));
>> +
>> +check_pending:
>> +       if (fatal_signal_pending(current))
>> +               return true;
>> +
>> +out:
>> +       return false;
>>  }
>>
>>  unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>> @@ -2304,13 +2322,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>>                 .gfp_mask = sc.gfp_mask,
>>         };
>>
>> -       throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
>> -
>>         /*
>> -        * Do not enter reclaim if fatal signal is pending. 1 is returned so
>> -        * that the page allocator does not consider triggering OOM
>> +        * Do not enter reclaim if fatal signal was delivered while throttled.
>> +        * 1 is returned so that the page allocator does not OOM kill at this
>> +        * point.
>>          */
>> -       if (fatal_signal_pending(current))
>> +       if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask))
>>                 return 1;
>>
>>         trace_mm_vmscan_direct_reclaim_begin(order,


So which one should I try first, David's change or Mel's?

Does Mel's change take into account the fact that the exiting process
is already deep into do_exit() (exit_mm() to be precise) when it tries
to allocate?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-01  4:48                                   ` David Rientjes
  2012-11-01  5:26                                     ` Minchan Kim
  2012-11-01  8:28                                     ` Mel Gorman
@ 2012-11-01 17:50                                     ` Luigi Semenzato
  2012-11-01 21:50                                       ` David Rientjes
  2 siblings, 1 reply; 67+ messages in thread
From: Luigi Semenzato @ 2012-11-01 17:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: Minchan Kim, Mel Gorman, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Wed, Oct 31, 2012 at 9:48 PM, David Rientjes <rientjes@google.com> wrote:
> On Thu, 1 Nov 2012, Minchan Kim wrote:
>
>> It's not true any more.
>> 3.6 includes following code in try_to_free_pages
>>
>>         /*
>>          * Do not enter reclaim if fatal signal is pending. 1 is returned so
>>          * that the page allocator does not consider triggering OOM
>>          */
>>         if (fatal_signal_pending(current))
>>                 return 1;
>>
>> So the hunged task never go to the OOM path and could be looping forever.
>>
>
> Ah, interesting.  This is from commit 5515061d22f0 ("mm: throttle direct
> reclaimers if PF_MEMALLOC reserves are low and swap is backed by network
> storage").  Thanks for adding Mel to the cc.
>
> The oom killer specifically has logic for this condition: when calling
> out_of_memory() the first thing it does is
>
>         if (fatal_signal_pending(current))
>                 set_thread_flag(TIF_MEMDIE);
>
> to allow it access to memory reserves so that it may exit if it's having
> trouble.  But that ends up never happening because of the above code that
> Minchan has identified.
>
> So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages()
> as well or revert that early return entirely; there's no justification
> given for it in the comment nor in the commit log.  I'd rather remove it
> and allow the oom killer to trigger and grant access to memory reserves
> itself if necessary.
>
> Mel, how does commit 5515061d22f0 deal with threads looping forever if
> they need memory in the exit path since the oom killer never gets called?
>
> That aside, it doesn't seem like this is the issue that Luigi is reporting
> since his patch that avoids deferring the oom killer presumably fixes the
> issue for him.  So it turns out the oom killer must be getting called.
>
> Luigi, can you try this instead?  It applies to the latest git but should
> be easily modified to apply to any 3.x kernel you're running.
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -310,26 +310,13 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
>         if (!task->mm)
>                 return OOM_SCAN_CONTINUE;
>
> -       if (task->flags & PF_EXITING) {
> +       if (task->flags & PF_EXITING && !force_kill) {
>                 /*
> -                * If task is current and is in the process of releasing memory,
> -                * allow the "kill" to set TIF_MEMDIE, which will allow it to
> -                * access memory reserves.  Otherwise, it may stall forever.
> -                *
> -                * The iteration isn't broken here, however, in case other
> -                * threads are found to have already been oom killed.
> +                * If this task is not being ptraced on exit, then wait for it
> +                * to finish before killing some other task unnecessarily.
>                  */
> -               if (task == current)
> -                       return OOM_SCAN_SELECT;
> -               else if (!force_kill) {
> -                       /*
> -                        * If this task is not being ptraced on exit, then wait
> -                        * for it to finish before killing some other task
> -                        * unnecessarily.
> -                        */
> -                       if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
> -                               return OOM_SCAN_ABORT;
> -               }
> +               if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
> +                       return OOM_SCAN_ABORT;
>         }
>         return OOM_SCAN_OK;
>  }
> @@ -706,11 +693,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>                 return;
>
>         /*
> -        * If current has a pending SIGKILL, then automatically select it.  The
> -        * goal is to allow it to allocate so that it may quickly exit and free
> -        * its memory.
> +        * If current has a pending SIGKILL or is exiting, then automatically
> +        * select it.  The goal is to allow it to allocate so that it may
> +        * quickly exit and free its memory.
>          */
> -       if (fatal_signal_pending(current)) {
> +       if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
>                 set_thread_flag(TIF_MEMDIE);
>                 return;
>         }

I tested this change with my load and it appears to also prevent the deadlocks.

I have a question though.  I thought only one process was allowed to
be in TIF_MEMDIE state, but I don't see anything that prevents this
code (before or after the change) from setting the flag in multiple
processes.  Is this a problem?

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-01 15:58                                         ` Luigi Semenzato
@ 2012-11-01 21:48                                           ` David Rientjes
  0 siblings, 0 replies; 67+ messages in thread
From: David Rientjes @ 2012-11-01 21:48 UTC (permalink / raw)
  To: Luigi Semenzato
  Cc: Mel Gorman, Minchan Kim, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Thu, 1 Nov 2012, Luigi Semenzato wrote:

> So which one should I try first, David's change or Mel's?
> 
> Does Mel's change take into account the fact that the exiting process
> is already deep into do_exit() (exit_mm() to be precise) when it tries
> to allocate?
> 

Mel's patch is addressing a separate issue since you've already proven 
that your problem is calling the oom killer which wouldn't occur if your 
thread had SIGKILL prior to Mel's patch.  It would allow my suggested 
workaround of killing the hung task to end the livelock, though, but that 
shouldn't be needed after my patch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-01 17:50                                     ` Luigi Semenzato
@ 2012-11-01 21:50                                       ` David Rientjes
  2012-11-01 21:58                                         ` [patch] mm, oom: allow exiting threads to have access to memory reserves David Rientjes
  2012-11-01 22:04                                         ` zram OOM behavior Luigi Semenzato
  0 siblings, 2 replies; 67+ messages in thread
From: David Rientjes @ 2012-11-01 21:50 UTC (permalink / raw)
  To: Luigi Semenzato
  Cc: Minchan Kim, Mel Gorman, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Thu, 1 Nov 2012, Luigi Semenzato wrote:

> > @@ -706,11 +693,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> >                 return;
> >
> >         /*
> > -        * If current has a pending SIGKILL, then automatically select it.  The
> > -        * goal is to allow it to allocate so that it may quickly exit and free
> > -        * its memory.
> > +        * If current has a pending SIGKILL or is exiting, then automatically
> > +        * select it.  The goal is to allow it to allocate so that it may
> > +        * quickly exit and free its memory.
> >          */
> > -       if (fatal_signal_pending(current)) {
> > +       if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
> >                 set_thread_flag(TIF_MEMDIE);
> >                 return;
> >         }
> 
> I tested this change with my load and it appears to also prevent the deadlocks.
> 
> I have a question though.  I thought only one process was allowed to
> be in TIF_MEMDIE state, but I don't see anything that prevents this
> code (before or after the change) from setting the flag in multiple
> processes.  Is this a problem?
> 

The code you've quoted above, prior to being changed by the patch, allows 
any thread with a fatal signal to have access to memory reserves, so it's 
certainly not only one thread with TIF_MEMDIE set at a time (the oom 
killer is not the only thing that can kill a thread).  The goal of that 
code is to ensure anything that has been killed can allocate successfully 
wherever it happens to be running so that it can handle the signal, exit, 
and free its memory.  My patch is extending that for all threads that are 
in the exit path that happen to require memory to exit to prevent a 
livelock.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* [patch] mm, oom: allow exiting threads to have access to memory reserves
  2012-11-01 21:50                                       ` David Rientjes
@ 2012-11-01 21:58                                         ` David Rientjes
  2012-11-01 22:43                                           ` Andrew Morton
  2012-11-01 22:04                                         ` zram OOM behavior Luigi Semenzato
  1 sibling, 1 reply; 67+ messages in thread
From: David Rientjes @ 2012-11-01 21:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Luigi Semenzato, Minchan Kim, Mel Gorman, linux-mm,
	Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

Exiting threads, those with PF_EXITING set, can pagefault and require 
memory before they can make forward progress.  This happens, for instance, 
when a process must fault task->robust_list, a userspace structure, before 
detaching its memory.

These threads also aren't guaranteed to get access to memory reserves 
unless oom killed or killed from userspace.  The oom killer won't grant 
memory reserves if other threads are also exiting other than current and 
stalling at the same point.  This prevents needlessly killing processes 
when others are already exiting.

Instead of special casing all the possible sitations between PF_EXITING 
getting set and a thread detaching its mm where it may allocate memory, 
which probably wouldn't get updated when a change is made to the exit 
path, the solution is to give all exiting threads access to memory 
reserves if they call the oom killer.  This allows them to quickly 
allocate, detach its mm, and free the memory it represents.

Acked-by: Minchan Kim <minchan@kernel.org>
Tested-by: Luigi Semenzato <semenzato@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
 This is old code and has only recently been reported as causing an issue, 
 so deferring to 3.8 seems appropriate.

 mm/oom_kill.c |   31 +++++++++----------------------
 1 file changed, 9 insertions(+), 22 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 79e0f3e..7e9e911 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -310,26 +310,13 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
 	if (!task->mm)
 		return OOM_SCAN_CONTINUE;
 
-	if (task->flags & PF_EXITING) {
+	if (task->flags & PF_EXITING && !force_kill) {
 		/*
-		 * If task is current and is in the process of releasing memory,
-		 * allow the "kill" to set TIF_MEMDIE, which will allow it to
-		 * access memory reserves.  Otherwise, it may stall forever.
-		 *
-		 * The iteration isn't broken here, however, in case other
-		 * threads are found to have already been oom killed.
+		 * If this task is not being ptraced on exit, then wait for it
+		 * to finish before killing some other task unnecessarily.
 		 */
-		if (task == current)
-			return OOM_SCAN_SELECT;
-		else if (!force_kill) {
-			/*
-			 * If this task is not being ptraced on exit, then wait
-			 * for it to finish before killing some other task
-			 * unnecessarily.
-			 */
-			if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
-				return OOM_SCAN_ABORT;
-		}
+		if (!(task->group_leader->ptrace & PT_TRACE_EXIT))
+			return OOM_SCAN_ABORT;
 	}
 	return OOM_SCAN_OK;
 }
@@ -706,11 +693,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
 		return;
 
 	/*
-	 * If current has a pending SIGKILL, then automatically select it.  The
-	 * goal is to allow it to allocate so that it may quickly exit and free
-	 * its memory.
+	 * If current has a pending SIGKILL or is exiting, then automatically
+	 * select it.  The goal is to allow it to allocate so that it may
+	 * quickly exit and free its memory.
 	 */
-	if (fatal_signal_pending(current)) {
+	if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
 		set_thread_flag(TIF_MEMDIE);
 		return;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-01 21:50                                       ` David Rientjes
  2012-11-01 21:58                                         ` [patch] mm, oom: allow exiting threads to have access to memory reserves David Rientjes
@ 2012-11-01 22:04                                         ` Luigi Semenzato
  2012-11-01 22:25                                           ` David Rientjes
  1 sibling, 1 reply; 67+ messages in thread
From: Luigi Semenzato @ 2012-11-01 22:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Minchan Kim, Mel Gorman, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Thu, Nov 1, 2012 at 2:50 PM, David Rientjes <rientjes@google.com> wrote:
> On Thu, 1 Nov 2012, Luigi Semenzato wrote:
>
>> > @@ -706,11 +693,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
>> >                 return;
>> >
>> >         /*
>> > -        * If current has a pending SIGKILL, then automatically select it.  The
>> > -        * goal is to allow it to allocate so that it may quickly exit and free
>> > -        * its memory.
>> > +        * If current has a pending SIGKILL or is exiting, then automatically
>> > +        * select it.  The goal is to allow it to allocate so that it may
>> > +        * quickly exit and free its memory.
>> >          */
>> > -       if (fatal_signal_pending(current)) {
>> > +       if (fatal_signal_pending(current) || current->flags & PF_EXITING) {
>> >                 set_thread_flag(TIF_MEMDIE);
>> >                 return;
>> >         }
>>
>> I tested this change with my load and it appears to also prevent the deadlocks.
>>
>> I have a question though.  I thought only one process was allowed to
>> be in TIF_MEMDIE state, but I don't see anything that prevents this
>> code (before or after the change) from setting the flag in multiple
>> processes.  Is this a problem?
>>
>
> The code you've quoted above, prior to being changed by the patch, allows
> any thread with a fatal signal to have access to memory reserves, so it's
> certainly not only one thread with TIF_MEMDIE set at a time (the oom
> killer is not the only thing that can kill a thread).  The goal of that
> code is to ensure anything that has been killed can allocate successfully
> wherever it happens to be running so that it can handle the signal, exit,
> and free its memory.  My patch is extending that for all threads that are
> in the exit path that happen to require memory to exit to prevent a
> livelock.

I see.  But then I am wondering: if there is no limit to the number of
threads that can access the reserved memory, then is it possible that
that memory will be exhausted?  Is the size of the reserved memory
based on heuristics then?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-01 22:04                                         ` zram OOM behavior Luigi Semenzato
@ 2012-11-01 22:25                                           ` David Rientjes
  0 siblings, 0 replies; 67+ messages in thread
From: David Rientjes @ 2012-11-01 22:25 UTC (permalink / raw)
  To: Luigi Semenzato
  Cc: Minchan Kim, Mel Gorman, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Thu, 1 Nov 2012, Luigi Semenzato wrote:

> I see.  But then I am wondering: if there is no limit to the number of
> threads that can access the reserved memory, then is it possible that
> that memory will be exhausted?  Is the size of the reserved memory
> based on heuristics then?
> 

We assume that processes with access to memory reserves will eventually 
exit and free their memory, that has always been the case.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch] mm, oom: allow exiting threads to have access to memory reserves
  2012-11-01 21:58                                         ` [patch] mm, oom: allow exiting threads to have access to memory reserves David Rientjes
@ 2012-11-01 22:43                                           ` Andrew Morton
  2012-11-01 23:05                                             ` David Rientjes
  2012-11-01 23:06                                             ` Luigi Semenzato
  0 siblings, 2 replies; 67+ messages in thread
From: Andrew Morton @ 2012-11-01 22:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Luigi Semenzato, Minchan Kim, Mel Gorman, linux-mm,
	Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

On Thu, 1 Nov 2012 14:58:18 -0700 (PDT)
David Rientjes <rientjes@google.com> wrote:

> Exiting threads, those with PF_EXITING set, can pagefault and require 
> memory before they can make forward progress.  This happens, for instance, 
> when a process must fault task->robust_list, a userspace structure, before 
> detaching its memory.
> 
> These threads also aren't guaranteed to get access to memory reserves 
> unless oom killed or killed from userspace.  The oom killer won't grant 
> memory reserves if other threads are also exiting other than current and 
> stalling at the same point.  This prevents needlessly killing processes 
> when others are already exiting.
> 
> Instead of special casing all the possible sitations between PF_EXITING 
> getting set and a thread detaching its mm where it may allocate memory, 
> which probably wouldn't get updated when a change is made to the exit 
> path, the solution is to give all exiting threads access to memory 
> reserves if they call the oom killer.  This allows them to quickly 
> allocate, detach its mm, and free the memory it represents.

Seems very sensible.

> Acked-by: Minchan Kim <minchan@kernel.org>
> Tested-by: Luigi Semenzato <semenzato@google.com>

What did Luigi actually test?  Was there some reproducible bad behavior
which this patch fixes?



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch] mm, oom: allow exiting threads to have access to memory reserves
  2012-11-01 22:43                                           ` Andrew Morton
@ 2012-11-01 23:05                                             ` David Rientjes
  2012-11-01 23:06                                             ` Luigi Semenzato
  1 sibling, 0 replies; 67+ messages in thread
From: David Rientjes @ 2012-11-01 23:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Luigi Semenzato, Minchan Kim, Mel Gorman, linux-mm,
	Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

On Thu, 1 Nov 2012, Andrew Morton wrote:

> > Exiting threads, those with PF_EXITING set, can pagefault and require 
> > memory before they can make forward progress.  This happens, for instance, 
> > when a process must fault task->robust_list, a userspace structure, before 
> > detaching its memory.
> > 
> > These threads also aren't guaranteed to get access to memory reserves 
> > unless oom killed or killed from userspace.  The oom killer won't grant 
> > memory reserves if other threads are also exiting other than current and 
> > stalling at the same point.  This prevents needlessly killing processes 
> > when others are already exiting.
> > 
> > Instead of special casing all the possible sitations between PF_EXITING 
> > getting set and a thread detaching its mm where it may allocate memory, 
> > which probably wouldn't get updated when a change is made to the exit 
> > path, the solution is to give all exiting threads access to memory 
> > reserves if they call the oom killer.  This allows them to quickly 
> > allocate, detach its mm, and free the memory it represents.
> 
> Seems very sensible.
> 
> > Acked-by: Minchan Kim <minchan@kernel.org>
> > Tested-by: Luigi Semenzato <semenzato@google.com>
> 
> What did Luigi actually test?  Was there some reproducible bad behavior
> which this patch fixes?
> 

Yeah, it's briefly described in the first paragraph.  He had an oom 
condition where threads were faulting on task->robust_list and repeatedly 
called the oom killer but it would defer killing a thread because it saw 
other PF_EXITING threads.  This can happen anytime we need to allocate 
memory after setting PF_EXITING and before detaching our mm; if there are 
other threads in the same state then the oom killer won't do anything 
unless one of them happens to be killed from userspace.

So instead of only deferring for PF_EXITING and !task->robust_list, it's 
better to just give them access to memory reserves to prevent a potential 
livelock so that any other faults that may be introduced in the future in 
the exit path don't cause the same problem (and hopefully we don't allow 
too many of those!).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: [patch] mm, oom: allow exiting threads to have access to memory reserves
  2012-11-01 22:43                                           ` Andrew Morton
  2012-11-01 23:05                                             ` David Rientjes
@ 2012-11-01 23:06                                             ` Luigi Semenzato
  1 sibling, 0 replies; 67+ messages in thread
From: Luigi Semenzato @ 2012-11-01 23:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Minchan Kim, Mel Gorman, linux-mm,
	Dan Magenheimer, KOSAKI Motohiro, Sonny Rao

On Thu, Nov 1, 2012 at 3:43 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> On Thu, 1 Nov 2012 14:58:18 -0700 (PDT)
> David Rientjes <rientjes@google.com> wrote:
>
>> Exiting threads, those with PF_EXITING set, can pagefault and require
>> memory before they can make forward progress.  This happens, for instance,
>> when a process must fault task->robust_list, a userspace structure, before
>> detaching its memory.
>>
>> These threads also aren't guaranteed to get access to memory reserves
>> unless oom killed or killed from userspace.  The oom killer won't grant
>> memory reserves if other threads are also exiting other than current and
>> stalling at the same point.  This prevents needlessly killing processes
>> when others are already exiting.
>>
>> Instead of special casing all the possible sitations between PF_EXITING
>> getting set and a thread detaching its mm where it may allocate memory,
>> which probably wouldn't get updated when a change is made to the exit
>> path, the solution is to give all exiting threads access to memory
>> reserves if they call the oom killer.  This allows them to quickly
>> allocate, detach its mm, and free the memory it represents.
>
> Seems very sensible.
>
>> Acked-by: Minchan Kim <minchan@kernel.org>
>> Tested-by: Luigi Semenzato <semenzato@google.com>
>
> What did Luigi actually test?  Was there some reproducible bad behavior
> which this patch fixes?

Yes.  I have a load that reliably reproduces the problem (in 3.4), and
it goes away with this change.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-12 14:06                 ` Mel Gorman
@ 2012-11-13 13:31                   ` Minchan Kim
  0 siblings, 0 replies; 67+ messages in thread
From: Minchan Kim @ 2012-11-13 13:31 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Mon, Nov 12, 2012 at 02:06:31PM +0000, Mel Gorman wrote:
> On Mon, Nov 12, 2012 at 10:32:18PM +0900, Minchan Kim wrote:
> > Sorry for the late reply.
> > I'm still going on training course until this week so my response would be delayed, too.
> > 
> > > > > > > <SNIP>
> > > > > > > It may be completely unnecessary to reclaim memory if the process that was
> > > > > > > throttled and killed just exits quickly. As the fatal signal is pending
> > > > > > > it will be able to use the pfmemalloc reserves.
> > > > > > > 
> > > > > > > > If he can't make forward progress with direct reclaim, he can ends up OOM path but
> > > > > > > > out_of_memory checks signal check of current and allow to access reserved memory pool
> > > > > > > > for quick exit and return without killing other victim selection.
> > > > > > > 
> > > > > > > While this is true, what advantage is there to having a killed process
> > > > > > > potentially reclaiming memory it does not need to?
> > > > > > 
> > > > > > Killed process needs a memory for him to be terminated. I think it's not a good idea for him
> > > > > > to use reserved memory pool unconditionally although he is throtlled and killed.
> > > > > > Because reserved memory pool is very stricted resource for emergency so using reserved memory
> > > > > > pool should be last resort after he fail to reclaim.
> > > > > > 
> > > > > 
> > > > > Part of that reclaim can be the process reclaiming its own pages and
> > > > > putting them in swap just so it can exit shortly afterwards. If it was
> > > > > throttled in this path, it implies that swap-over-NFS is enabled where
> > > > 
> > > > Could we make sure it's only the case for swap-over-NFS?
> > > 
> > > The PFMEMALLOC reserves being consumed to the point of throttline is only
> > > expected in the case of swap-over-network -- check the pgscan_direct_throttle
> > > counter to be sure. So it's already the case that this throttling logic and
> > > its signal handling is mostly a swap-over-NFS thing. It is possible that
> > > a badly behaving driver using GFP_ATOMIC to allocate long-lived buffers
> > > could force a situation where a process gets throttled but I'm not aware
> > > of a case where this happens todays.
> > 
> > I saw some custom drviers in embedded side have used GFP_ATOMIC easily to protect
> > avoiding deadlock.
> 
> They must be getting a lot of allocation failures in that case.

It depends on workload and I didn't received any report from them.

> 
> > Of course, it's not a good behavior but it lives with us.
> > Even, we can't fix it because we don't have any source. :(
> > 
> > > 
> > > > I think it can happen if the system has very slow thumb card.
> > > > 
> > > 
> > > How? They shouldn't be stuck in throttling in this case. They should be
> > > blocked on IO, congestion wait, dirty throttling etc.
> > 
> > Some block driver(ex, mmc) uses a thread model with PF_MEMALLOC so I think
> > they can be stucked by the throttling logic.
> > 
> 
> If they are using PF_MEMALLOC + GFP_ATOMIC, there is a strong chance
> that they'll actually deadlock their system if there are a storm of
> allocations. The drivers is fundamentally broken in a dangerous way.
> None of that is fixed by forcing an exiting process to enter direct reclaim.

Agreed.

> 
> > > 
> > > > > such reclaim in fact might require the pfmemalloc reserves to be used to
> > > > > allocate network buffers. It's potentially unnecessary work because the
> > > > 
> > > > You mean we need pfmemalloc reserve to swap out anon pages by swap-over-NFS?
> > > 
> > > In very low-memory situations - yes. We can be at the min watermark but
> > > still need to allocate a page for a network buffer to swap out the anon page.
> > > 
> > > > Yes. In this case, you're right. I would be better to use reserve pool for
> > > > just exiting instead of swap out over network. But how can you make sure that
> > > > we have only anonymous page when we try to reclaim? 
> > > > If there are some file-backed pages, we can avoid swapout at that time.
> > > > Maybe we need some check.
> > > > 
> > > 
> > > That would be a fairly invasive set of checks for a corner case. if
> > > swap-over-nfs + critically low + about to OOM + file pages available then
> > > only reclaim files.
> > > 
> > > It's getting off track as to why we're having this discussion in the first
> > > place -- looping due to improper handling of fatal signal pending.
> > 
> > If some user tune /proc/sys/vm/swappiness, we could have many page cache pages
> > when swap-over-NFS happens.
> 
> That's a BIG if. swappiness could be anything and it'll depend on the
> workload anyway.

Yes but we don't have to ignore such case.

> 
> > My point is that why do we should use emergency memory pool although we have
> > reclaimalble memory?
> > 
> 
> Because as I have already pointed out, the use of swap-over-nfs itself
> creates more allocation pressure if it is used in the reclaim path. The
> emergency memory pool is used *anyway* unless there are clean file pages
> that can be discarded. But that's a big "if". The safer path is to try
> and exit and if *that* fails *then* enter direct reclaim.

Okay. Let's see your code again POV side effect other than OOM deadlock problem.

1. pfmemalloc_watermark_ok == false but the process is received SIGKILL
   before calling throttle_direct_reclaim.

In this case, it enters direct reclaim path and would swap out anon pages.
It's a thing you are concerning now(ie, creates more allocation pressure)
Is it okay?

2. pfmemalloc_watermark_ok == false but the process is received SIGKILL
   while throttling.

In this case, it skips direct reclaim in first path and retry to allocate page.
If another procces free some memory or is killed, it can get a free page and
return. Yes. it would be good rather than unnecessary swap out and OOM kill.
Otherwise, it calls direct compaction again and then enter direct reclaim path.
It ends up consuming emergency memory pool to swap out anonymous pages or
OOM killed. Again, it's a thing you are concerning now.

So, your patch's effect depends on timing that other process release memory.
Is it right?
If it is your intention, I don't oppose it any more because apprantely it
has a benefit than I suggested. But please write description more clearly.
Below previous description focused only OOM deadlock problem and didn't explain
patch's side effect which I mentioned above.

[
mm: vmscan: Check for fatal signals iff the process was throttled

commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
are low and swap is backed by network storage") introduced a check for
fatal signals after a process gets throttled for network storage. The
intention was that if a process was throttled and got killed that it
should not trigger the OOM killer. As pointed out by Minchan Kim and
David Rientjes, this check is in the wrong place and too broad. If a
system is in am OOM situation and a process is exiting, it can loop in
__alloc_pages_slowpath() and calling direct reclaim in a loop. As the
fatal signal is pending it returns 1 as if it is making forward progress
and can effectively deadlock.

This patch moves the fatal_signal_pending() check after throttling to
throttle_direct_reclaim() where it belongs.

If this patch passes review it should be considered a -stable candidate
for 3.6.
]

> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Kind Regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-12 13:32               ` Minchan Kim
@ 2012-11-12 14:06                 ` Mel Gorman
  2012-11-13 13:31                   ` Minchan Kim
  0 siblings, 1 reply; 67+ messages in thread
From: Mel Gorman @ 2012-11-12 14:06 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Mon, Nov 12, 2012 at 10:32:18PM +0900, Minchan Kim wrote:
> Sorry for the late reply.
> I'm still going on training course until this week so my response would be delayed, too.
> 
> > > > > > <SNIP>
> > > > > > It may be completely unnecessary to reclaim memory if the process that was
> > > > > > throttled and killed just exits quickly. As the fatal signal is pending
> > > > > > it will be able to use the pfmemalloc reserves.
> > > > > > 
> > > > > > > If he can't make forward progress with direct reclaim, he can ends up OOM path but
> > > > > > > out_of_memory checks signal check of current and allow to access reserved memory pool
> > > > > > > for quick exit and return without killing other victim selection.
> > > > > > 
> > > > > > While this is true, what advantage is there to having a killed process
> > > > > > potentially reclaiming memory it does not need to?
> > > > > 
> > > > > Killed process needs a memory for him to be terminated. I think it's not a good idea for him
> > > > > to use reserved memory pool unconditionally although he is throtlled and killed.
> > > > > Because reserved memory pool is very stricted resource for emergency so using reserved memory
> > > > > pool should be last resort after he fail to reclaim.
> > > > > 
> > > > 
> > > > Part of that reclaim can be the process reclaiming its own pages and
> > > > putting them in swap just so it can exit shortly afterwards. If it was
> > > > throttled in this path, it implies that swap-over-NFS is enabled where
> > > 
> > > Could we make sure it's only the case for swap-over-NFS?
> > 
> > The PFMEMALLOC reserves being consumed to the point of throttline is only
> > expected in the case of swap-over-network -- check the pgscan_direct_throttle
> > counter to be sure. So it's already the case that this throttling logic and
> > its signal handling is mostly a swap-over-NFS thing. It is possible that
> > a badly behaving driver using GFP_ATOMIC to allocate long-lived buffers
> > could force a situation where a process gets throttled but I'm not aware
> > of a case where this happens todays.
> 
> I saw some custom drviers in embedded side have used GFP_ATOMIC easily to protect
> avoiding deadlock.

They must be getting a lot of allocation failures in that case.

> Of course, it's not a good behavior but it lives with us.
> Even, we can't fix it because we don't have any source. :(
> 
> > 
> > > I think it can happen if the system has very slow thumb card.
> > > 
> > 
> > How? They shouldn't be stuck in throttling in this case. They should be
> > blocked on IO, congestion wait, dirty throttling etc.
> 
> Some block driver(ex, mmc) uses a thread model with PF_MEMALLOC so I think
> they can be stucked by the throttling logic.
> 

If they are using PF_MEMALLOC + GFP_ATOMIC, there is a strong chance
that they'll actually deadlock their system if there are a storm of
allocations. The drivers is fundamentally broken in a dangerous way.
None of that is fixed by forcing an exiting process to enter direct reclaim.

> > 
> > > > such reclaim in fact might require the pfmemalloc reserves to be used to
> > > > allocate network buffers. It's potentially unnecessary work because the
> > > 
> > > You mean we need pfmemalloc reserve to swap out anon pages by swap-over-NFS?
> > 
> > In very low-memory situations - yes. We can be at the min watermark but
> > still need to allocate a page for a network buffer to swap out the anon page.
> > 
> > > Yes. In this case, you're right. I would be better to use reserve pool for
> > > just exiting instead of swap out over network. But how can you make sure that
> > > we have only anonymous page when we try to reclaim? 
> > > If there are some file-backed pages, we can avoid swapout at that time.
> > > Maybe we need some check.
> > > 
> > 
> > That would be a fairly invasive set of checks for a corner case. if
> > swap-over-nfs + critically low + about to OOM + file pages available then
> > only reclaim files.
> > 
> > It's getting off track as to why we're having this discussion in the first
> > place -- looping due to improper handling of fatal signal pending.
> 
> If some user tune /proc/sys/vm/swappiness, we could have many page cache pages
> when swap-over-NFS happens.

That's a BIG if. swappiness could be anything and it'll depend on the
workload anyway.

> My point is that why do we should use emergency memory pool although we have
> reclaimalble memory?
> 

Because as I have already pointed out, the use of swap-over-nfs itself
creates more allocation pressure if it is used in the reclaim path. The
emergency memory pool is used *anyway* unless there are clean file pages
that can be discarded. But that's a big "if". The safer path is to try
and exit and if *that* fails *then* enter direct reclaim.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-09  9:50             ` Mel Gorman
@ 2012-11-12 13:32               ` Minchan Kim
  2012-11-12 14:06                 ` Mel Gorman
  0 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2012-11-12 13:32 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

Sorry for the late reply.
I'm still going on training course until this week so my response would be delayed, too.

On Fri, Nov 09, 2012 at 09:50:24AM +0000, Mel Gorman wrote:
> On Tue, Nov 06, 2012 at 07:17:20PM +0900, Minchan Kim wrote:
> > On Tue, Nov 06, 2012 at 08:58:22AM +0000, Mel Gorman wrote:
> > > On Tue, Nov 06, 2012 at 09:25:50AM +0900, Minchan Kim wrote:
> > > > On Mon, Nov 05, 2012 at 02:46:14PM +0000, Mel Gorman wrote:
> > > > > On Sat, Nov 03, 2012 at 07:36:31AM +0900, Minchan Kim wrote:
> > > > > > > <SNIP>
> > > > > > > In the first version it would never try to enter direct reclaim if a
> > > > > > > fatal signal was pending but always claim that forward progress was
> > > > > > > being made.
> > > > > > 
> > > > > > Surely we need fix for preventing deadlock with OOM kill and that's why
> > > > > > I have Cced you and this patch fixes it but my question is why we need 
> > > > > > such fatal signal checking trick.
> > > > > > 
> > > > > > How about this?
> > > > > > 
> > > > > 
> > > > > Both will work as expected but....
> > > > > 
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index 10090c8..881619e 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > > > @@ -2306,13 +2306,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> > > > > >  
> > > > > >         throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> > > > > >  
> > > > > > -       /*
> > > > > > -        * Do not enter reclaim if fatal signal is pending. 1 is returned so
> > > > > > -        * that the page allocator does not consider triggering OOM
> > > > > > -        */
> > > > > > -       if (fatal_signal_pending(current))
> > > > > > -               return 1;
> > > > > > -
> > > > > >         trace_mm_vmscan_direct_reclaim_begin(order,
> > > > > >                                 sc.may_writepage,
> > > > > >                                 gfp_mask);
> > > > > >  
> > > > > > In this case, after throttling, current will try to do direct reclaim and
> > > > > > if he makes forward progress, he will get a memory and exit if he receive KILL signal.
> > > > > 
> > > > > It may be completely unnecessary to reclaim memory if the process that was
> > > > > throttled and killed just exits quickly. As the fatal signal is pending
> > > > > it will be able to use the pfmemalloc reserves.
> > > > > 
> > > > > > If he can't make forward progress with direct reclaim, he can ends up OOM path but
> > > > > > out_of_memory checks signal check of current and allow to access reserved memory pool
> > > > > > for quick exit and return without killing other victim selection.
> > > > > 
> > > > > While this is true, what advantage is there to having a killed process
> > > > > potentially reclaiming memory it does not need to?
> > > > 
> > > > Killed process needs a memory for him to be terminated. I think it's not a good idea for him
> > > > to use reserved memory pool unconditionally although he is throtlled and killed.
> > > > Because reserved memory pool is very stricted resource for emergency so using reserved memory
> > > > pool should be last resort after he fail to reclaim.
> > > > 
> > > 
> > > Part of that reclaim can be the process reclaiming its own pages and
> > > putting them in swap just so it can exit shortly afterwards. If it was
> > > throttled in this path, it implies that swap-over-NFS is enabled where
> > 
> > Could we make sure it's only the case for swap-over-NFS?
> 
> The PFMEMALLOC reserves being consumed to the point of throttline is only
> expected in the case of swap-over-network -- check the pgscan_direct_throttle
> counter to be sure. So it's already the case that this throttling logic and
> its signal handling is mostly a swap-over-NFS thing. It is possible that
> a badly behaving driver using GFP_ATOMIC to allocate long-lived buffers
> could force a situation where a process gets throttled but I'm not aware
> of a case where this happens todays.

I saw some custom drviers in embedded side have used GFP_ATOMIC easily to protect
avoiding deadlock. Of course, it's not a good behavior but it lives with us.
Even, we can't fix it because we don't have any source. :(

> 
> > I think it can happen if the system has very slow thumb card.
> > 
> 
> How? They shouldn't be stuck in throttling in this case. They should be
> blocked on IO, congestion wait, dirty throttling etc.

Some block driver(ex, mmc) uses a thread model with PF_MEMALLOC so I think
they can be stucked by the throttling logic.

> 
> > > such reclaim in fact might require the pfmemalloc reserves to be used to
> > > allocate network buffers. It's potentially unnecessary work because the
> > 
> > You mean we need pfmemalloc reserve to swap out anon pages by swap-over-NFS?
> 
> In very low-memory situations - yes. We can be at the min watermark but
> still need to allocate a page for a network buffer to swap out the anon page.
> 
> > Yes. In this case, you're right. I would be better to use reserve pool for
> > just exiting instead of swap out over network. But how can you make sure that
> > we have only anonymous page when we try to reclaim? 
> > If there are some file-backed pages, we can avoid swapout at that time.
> > Maybe we need some check.
> > 
> 
> That would be a fairly invasive set of checks for a corner case. if
> swap-over-nfs + critically low + about to OOM + file pages available then
> only reclaim files.
> 
> It's getting off track as to why we're having this discussion in the first
> place -- looping due to improper handling of fatal signal pending.

If some user tune /proc/sys/vm/swappiness, we could have many page cache pages
when swap-over-NFS happens.
My point is that why do we should use emergency memory pool although we have
reclaimalble memory?

> 
> > > same reserves could have been used to just exit the process.
> > > 
> > > I'll go your way if you insist because it's not like getting throttled
> > > and killed before exit is a common situation and it should work either
> > > way.
> > 
> > I don't want to insist on. Just want to know what's the problem and find
> > better solution. :) 
> > 
> 
> In that case, I'm going to send the patch to Andrew on Monday and avoid
> direct reclaim when a fatal signal is pending in the swap-over-network
> case. Are you ok with that?

Sorry but I don't think your patch is best approach.

> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Kind Regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-06 10:17           ` Minchan Kim
@ 2012-11-09  9:50             ` Mel Gorman
  2012-11-12 13:32               ` Minchan Kim
  0 siblings, 1 reply; 67+ messages in thread
From: Mel Gorman @ 2012-11-09  9:50 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Tue, Nov 06, 2012 at 07:17:20PM +0900, Minchan Kim wrote:
> On Tue, Nov 06, 2012 at 08:58:22AM +0000, Mel Gorman wrote:
> > On Tue, Nov 06, 2012 at 09:25:50AM +0900, Minchan Kim wrote:
> > > On Mon, Nov 05, 2012 at 02:46:14PM +0000, Mel Gorman wrote:
> > > > On Sat, Nov 03, 2012 at 07:36:31AM +0900, Minchan Kim wrote:
> > > > > > <SNIP>
> > > > > > In the first version it would never try to enter direct reclaim if a
> > > > > > fatal signal was pending but always claim that forward progress was
> > > > > > being made.
> > > > > 
> > > > > Surely we need fix for preventing deadlock with OOM kill and that's why
> > > > > I have Cced you and this patch fixes it but my question is why we need 
> > > > > such fatal signal checking trick.
> > > > > 
> > > > > How about this?
> > > > > 
> > > > 
> > > > Both will work as expected but....
> > > > 
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index 10090c8..881619e 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -2306,13 +2306,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> > > > >  
> > > > >         throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> > > > >  
> > > > > -       /*
> > > > > -        * Do not enter reclaim if fatal signal is pending. 1 is returned so
> > > > > -        * that the page allocator does not consider triggering OOM
> > > > > -        */
> > > > > -       if (fatal_signal_pending(current))
> > > > > -               return 1;
> > > > > -
> > > > >         trace_mm_vmscan_direct_reclaim_begin(order,
> > > > >                                 sc.may_writepage,
> > > > >                                 gfp_mask);
> > > > >  
> > > > > In this case, after throttling, current will try to do direct reclaim and
> > > > > if he makes forward progress, he will get a memory and exit if he receive KILL signal.
> > > > 
> > > > It may be completely unnecessary to reclaim memory if the process that was
> > > > throttled and killed just exits quickly. As the fatal signal is pending
> > > > it will be able to use the pfmemalloc reserves.
> > > > 
> > > > > If he can't make forward progress with direct reclaim, he can ends up OOM path but
> > > > > out_of_memory checks signal check of current and allow to access reserved memory pool
> > > > > for quick exit and return without killing other victim selection.
> > > > 
> > > > While this is true, what advantage is there to having a killed process
> > > > potentially reclaiming memory it does not need to?
> > > 
> > > Killed process needs a memory for him to be terminated. I think it's not a good idea for him
> > > to use reserved memory pool unconditionally although he is throtlled and killed.
> > > Because reserved memory pool is very stricted resource for emergency so using reserved memory
> > > pool should be last resort after he fail to reclaim.
> > > 
> > 
> > Part of that reclaim can be the process reclaiming its own pages and
> > putting them in swap just so it can exit shortly afterwards. If it was
> > throttled in this path, it implies that swap-over-NFS is enabled where
> 
> Could we make sure it's only the case for swap-over-NFS?

The PFMEMALLOC reserves being consumed to the point of throttline is only
expected in the case of swap-over-network -- check the pgscan_direct_throttle
counter to be sure. So it's already the case that this throttling logic and
its signal handling is mostly a swap-over-NFS thing. It is possible that
a badly behaving driver using GFP_ATOMIC to allocate long-lived buffers
could force a situation where a process gets throttled but I'm not aware
of a case where this happens todays.

> I think it can happen if the system has very slow thumb card.
> 

How? They shouldn't be stuck in throttling in this case. They should be
blocked on IO, congestion wait, dirty throttling etc.

> > such reclaim in fact might require the pfmemalloc reserves to be used to
> > allocate network buffers. It's potentially unnecessary work because the
> 
> You mean we need pfmemalloc reserve to swap out anon pages by swap-over-NFS?

In very low-memory situations - yes. We can be at the min watermark but
still need to allocate a page for a network buffer to swap out the anon page.

> Yes. In this case, you're right. I would be better to use reserve pool for
> just exiting instead of swap out over network. But how can you make sure that
> we have only anonymous page when we try to reclaim? 
> If there are some file-backed pages, we can avoid swapout at that time.
> Maybe we need some check.
> 

That would be a fairly invasive set of checks for a corner case. if
swap-over-nfs + critically low + about to OOM + file pages available then
only reclaim files.

It's getting off track as to why we're having this discussion in the first
place -- looping due to improper handling of fatal signal pending.

> > same reserves could have been used to just exit the process.
> > 
> > I'll go your way if you insist because it's not like getting throttled
> > and killed before exit is a common situation and it should work either
> > way.
> 
> I don't want to insist on. Just want to know what's the problem and find
> better solution. :) 
> 

In that case, I'm going to send the patch to Andrew on Monday and avoid
direct reclaim when a fatal signal is pending in the swap-over-network
case. Are you ok with that?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-06  8:58         ` Mel Gorman
@ 2012-11-06 10:17           ` Minchan Kim
  2012-11-09  9:50             ` Mel Gorman
  0 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2012-11-06 10:17 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Tue, Nov 06, 2012 at 08:58:22AM +0000, Mel Gorman wrote:
> On Tue, Nov 06, 2012 at 09:25:50AM +0900, Minchan Kim wrote:
> > On Mon, Nov 05, 2012 at 02:46:14PM +0000, Mel Gorman wrote:
> > > On Sat, Nov 03, 2012 at 07:36:31AM +0900, Minchan Kim wrote:
> > > > > <SNIP>
> > > > > In the first version it would never try to enter direct reclaim if a
> > > > > fatal signal was pending but always claim that forward progress was
> > > > > being made.
> > > > 
> > > > Surely we need fix for preventing deadlock with OOM kill and that's why
> > > > I have Cced you and this patch fixes it but my question is why we need 
> > > > such fatal signal checking trick.
> > > > 
> > > > How about this?
> > > > 
> > > 
> > > Both will work as expected but....
> > > 
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index 10090c8..881619e 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -2306,13 +2306,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> > > >  
> > > >         throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> > > >  
> > > > -       /*
> > > > -        * Do not enter reclaim if fatal signal is pending. 1 is returned so
> > > > -        * that the page allocator does not consider triggering OOM
> > > > -        */
> > > > -       if (fatal_signal_pending(current))
> > > > -               return 1;
> > > > -
> > > >         trace_mm_vmscan_direct_reclaim_begin(order,
> > > >                                 sc.may_writepage,
> > > >                                 gfp_mask);
> > > >  
> > > > In this case, after throttling, current will try to do direct reclaim and
> > > > if he makes forward progress, he will get a memory and exit if he receive KILL signal.
> > > 
> > > It may be completely unnecessary to reclaim memory if the process that was
> > > throttled and killed just exits quickly. As the fatal signal is pending
> > > it will be able to use the pfmemalloc reserves.
> > > 
> > > > If he can't make forward progress with direct reclaim, he can ends up OOM path but
> > > > out_of_memory checks signal check of current and allow to access reserved memory pool
> > > > for quick exit and return without killing other victim selection.
> > > 
> > > While this is true, what advantage is there to having a killed process
> > > potentially reclaiming memory it does not need to?
> > 
> > Killed process needs a memory for him to be terminated. I think it's not a good idea for him
> > to use reserved memory pool unconditionally although he is throtlled and killed.
> > Because reserved memory pool is very stricted resource for emergency so using reserved memory
> > pool should be last resort after he fail to reclaim.
> > 
> 
> Part of that reclaim can be the process reclaiming its own pages and
> putting them in swap just so it can exit shortly afterwards. If it was
> throttled in this path, it implies that swap-over-NFS is enabled where

Could we make sure it's only the case for swap-over-NFS?
I think it can happen if the system has very slow thumb card.

> such reclaim in fact might require the pfmemalloc reserves to be used to
> allocate network buffers. It's potentially unnecessary work because the

You mean we need pfmemalloc reserve to swap out anon pages by swap-over-NFS?
Yes. In this case, you're right. I would be better to use reserve pool for
just exiting instead of swap out over network. But how can you make sure that
we have only anonymous page when we try to reclaim? 
If there are some file-backed pages, we can avoid swapout at that time.
Maybe we need some check.

> same reserves could have been used to just exit the process.
> 
> I'll go your way if you insist because it's not like getting throttled
> and killed before exit is a common situation and it should work either
> way.

I don't want to insist on. Just want to know what's the problem and find
better solution. :) 

P.S) I'm at situation which is very hard to sit down in front of computer
for a long time due to really really thanksful training course. :(
Shortly, I should go to dance.
Please feel free to send patch without expectation I will send patch soon.

> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Kind Regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-06  0:25       ` Minchan Kim
@ 2012-11-06  8:58         ` Mel Gorman
  2012-11-06 10:17           ` Minchan Kim
  0 siblings, 1 reply; 67+ messages in thread
From: Mel Gorman @ 2012-11-06  8:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Tue, Nov 06, 2012 at 09:25:50AM +0900, Minchan Kim wrote:
> On Mon, Nov 05, 2012 at 02:46:14PM +0000, Mel Gorman wrote:
> > On Sat, Nov 03, 2012 at 07:36:31AM +0900, Minchan Kim wrote:
> > > > <SNIP>
> > > > In the first version it would never try to enter direct reclaim if a
> > > > fatal signal was pending but always claim that forward progress was
> > > > being made.
> > > 
> > > Surely we need fix for preventing deadlock with OOM kill and that's why
> > > I have Cced you and this patch fixes it but my question is why we need 
> > > such fatal signal checking trick.
> > > 
> > > How about this?
> > > 
> > 
> > Both will work as expected but....
> > 
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 10090c8..881619e 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2306,13 +2306,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> > >  
> > >         throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> > >  
> > > -       /*
> > > -        * Do not enter reclaim if fatal signal is pending. 1 is returned so
> > > -        * that the page allocator does not consider triggering OOM
> > > -        */
> > > -       if (fatal_signal_pending(current))
> > > -               return 1;
> > > -
> > >         trace_mm_vmscan_direct_reclaim_begin(order,
> > >                                 sc.may_writepage,
> > >                                 gfp_mask);
> > >  
> > > In this case, after throttling, current will try to do direct reclaim and
> > > if he makes forward progress, he will get a memory and exit if he receive KILL signal.
> > 
> > It may be completely unnecessary to reclaim memory if the process that was
> > throttled and killed just exits quickly. As the fatal signal is pending
> > it will be able to use the pfmemalloc reserves.
> > 
> > > If he can't make forward progress with direct reclaim, he can ends up OOM path but
> > > out_of_memory checks signal check of current and allow to access reserved memory pool
> > > for quick exit and return without killing other victim selection.
> > 
> > While this is true, what advantage is there to having a killed process
> > potentially reclaiming memory it does not need to?
> 
> Killed process needs a memory for him to be terminated. I think it's not a good idea for him
> to use reserved memory pool unconditionally although he is throtlled and killed.
> Because reserved memory pool is very stricted resource for emergency so using reserved memory
> pool should be last resort after he fail to reclaim.
> 

Part of that reclaim can be the process reclaiming its own pages and
putting them in swap just so it can exit shortly afterwards. If it was
throttled in this path, it implies that swap-over-NFS is enabled where
such reclaim in fact might require the pfmemalloc reserves to be used to
allocate network buffers. It's potentially unnecessary work because the
same reserves could have been used to just exit the process.

I'll go your way if you insist because it's not like getting throttled
and killed before exit is a common situation and it should work either
way.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-05 14:46     ` Mel Gorman
@ 2012-11-06  0:25       ` Minchan Kim
  2012-11-06  8:58         ` Mel Gorman
  0 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2012-11-06  0:25 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Mon, Nov 05, 2012 at 02:46:14PM +0000, Mel Gorman wrote:
> On Sat, Nov 03, 2012 at 07:36:31AM +0900, Minchan Kim wrote:
> > > <SNIP>
> > > In the first version it would never try to enter direct reclaim if a
> > > fatal signal was pending but always claim that forward progress was
> > > being made.
> > 
> > Surely we need fix for preventing deadlock with OOM kill and that's why
> > I have Cced you and this patch fixes it but my question is why we need 
> > such fatal signal checking trick.
> > 
> > How about this?
> > 
> 
> Both will work as expected but....
> 
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 10090c8..881619e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2306,13 +2306,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> >  
> >         throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> >  
> > -       /*
> > -        * Do not enter reclaim if fatal signal is pending. 1 is returned so
> > -        * that the page allocator does not consider triggering OOM
> > -        */
> > -       if (fatal_signal_pending(current))
> > -               return 1;
> > -
> >         trace_mm_vmscan_direct_reclaim_begin(order,
> >                                 sc.may_writepage,
> >                                 gfp_mask);
> >  
> > In this case, after throttling, current will try to do direct reclaim and
> > if he makes forward progress, he will get a memory and exit if he receive KILL signal.
> 
> It may be completely unnecessary to reclaim memory if the process that was
> throttled and killed just exits quickly. As the fatal signal is pending
> it will be able to use the pfmemalloc reserves.
> 
> > If he can't make forward progress with direct reclaim, he can ends up OOM path but
> > out_of_memory checks signal check of current and allow to access reserved memory pool
> > for quick exit and return without killing other victim selection.
> 
> While this is true, what advantage is there to having a killed process
> potentially reclaiming memory it does not need to?

Killed process needs a memory for him to be terminated. I think it's not a good idea for him
to use reserved memory pool unconditionally although he is throtlled and killed.
Because reserved memory pool is very stricted resource for emergency so using reserved memory
pool should be last resort after he fail to reclaim.

> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Kind Regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-02 22:36   ` Minchan Kim
@ 2012-11-05 14:46     ` Mel Gorman
  2012-11-06  0:25       ` Minchan Kim
  0 siblings, 1 reply; 67+ messages in thread
From: Mel Gorman @ 2012-11-05 14:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Sat, Nov 03, 2012 at 07:36:31AM +0900, Minchan Kim wrote:
> > <SNIP>
> > In the first version it would never try to enter direct reclaim if a
> > fatal signal was pending but always claim that forward progress was
> > being made.
> 
> Surely we need fix for preventing deadlock with OOM kill and that's why
> I have Cced you and this patch fixes it but my question is why we need 
> such fatal signal checking trick.
> 
> How about this?
> 

Both will work as expected but....

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 10090c8..881619e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2306,13 +2306,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  
>         throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
>  
> -       /*
> -        * Do not enter reclaim if fatal signal is pending. 1 is returned so
> -        * that the page allocator does not consider triggering OOM
> -        */
> -       if (fatal_signal_pending(current))
> -               return 1;
> -
>         trace_mm_vmscan_direct_reclaim_begin(order,
>                                 sc.may_writepage,
>                                 gfp_mask);
>  
> In this case, after throttling, current will try to do direct reclaim and
> if he makes forward progress, he will get a memory and exit if he receive KILL signal.

It may be completely unnecessary to reclaim memory if the process that was
throttled and killed just exits quickly. As the fatal signal is pending
it will be able to use the pfmemalloc reserves.

> If he can't make forward progress with direct reclaim, he can ends up OOM path but
> out_of_memory checks signal check of current and allow to access reserved memory pool
> for quick exit and return without killing other victim selection.

While this is true, what advantage is there to having a killed process
potentially reclaiming memory it does not need to?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-02  8:30 ` Mel Gorman
@ 2012-11-02 22:36   ` Minchan Kim
  2012-11-05 14:46     ` Mel Gorman
  0 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2012-11-02 22:36 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Fri, Nov 02, 2012 at 08:30:57AM +0000, Mel Gorman wrote:
> On Fri, Nov 02, 2012 at 03:39:58PM +0900, Minchan Kim wrote:
> > Hi Mel,
> > 
> > On Thu, Nov 01, 2012 at 08:28:14AM +0000, Mel Gorman wrote:
> > > On Wed, Oct 31, 2012 at 09:48:57PM -0700, David Rientjes wrote:
> > > > On Thu, 1 Nov 2012, Minchan Kim wrote:
> > > > 
> > > > > It's not true any more.
> > > > > 3.6 includes following code in try_to_free_pages
> > > > > 
> > > > >         /*   
> > > > >          * Do not enter reclaim if fatal signal is pending. 1 is returned so
> > > > >          * that the page allocator does not consider triggering OOM
> > > > >          */
> > > > >         if (fatal_signal_pending(current))
> > > > >                 return 1;
> > > > > 
> > > > > So the hunged task never go to the OOM path and could be looping forever.
> > > > > 
> > > > 
> > > > Ah, interesting.  This is from commit 5515061d22f0 ("mm: throttle direct 
> > > > reclaimers if PF_MEMALLOC reserves are low and swap is backed by network 
> > > > storage").  Thanks for adding Mel to the cc.
> > > > 
> > > 
> > > Indeed, thanks.
> > > 
> > > > The oom killer specifically has logic for this condition: when calling 
> > > > out_of_memory() the first thing it does is
> > > > 
> > > > 	if (fatal_signal_pending(current))
> > > > 		set_thread_flag(TIF_MEMDIE);
> > > > 
> > > > to allow it access to memory reserves so that it may exit if it's having 
> > > > trouble.  But that ends up never happening because of the above code that 
> > > > Minchan has identified.
> > > > 
> > > > So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages() 
> > > > as well or revert that early return entirely; there's no justification 
> > > > given for it in the comment nor in the commit log. 
> > > 
> > > The check for fatal signal is in the wrong place. The reason it was added
> > > is because a throttled process sleeps in an interruptible sleep.  If a user
> > > user forcibly kills a throttled process, it should not result in an OOM kill.
> > > 
> > > > I'd rather remove it 
> > > > and allow the oom killer to trigger and grant access to memory reserves 
> > > > itself if necessary.
> > > > 
> > > > Mel, how does commit 5515061d22f0 deal with threads looping forever if 
> > > > they need memory in the exit path since the oom killer never gets called?
> > > > 
> > > 
> > > It doesn't. How about this?
> > > 
> > > ---8<---
> > > mm: vmscan: Check for fatal signals iff the process was throttled
> > > 
> > > commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
> > > are low and swap is backed by network storage") introduced a check for
> > > fatal signals after a process gets throttled for network storage. The
> > > intention was that if a process was throttled and got killed that it
> > > should not trigger the OOM killer. As pointed out by Minchan Kim and
> > > David Rientjes, this check is in the wrong place and too broad. If a
> > > system is in am OOM situation and a process is exiting, it can loop in
> > > __alloc_pages_slowpath() and calling direct reclaim in a loop. As the
> > > fatal signal is pending it returns 1 as if it is making forward progress
> > > and can effectively deadlock.
> > > 
> > > This patch moves the fatal_signal_pending() check after throttling to
> > > throttle_direct_reclaim() where it belongs.
> > 
> > I'm not sure how below patch achieve your goal which is to prevent
> > unnecessary OOM kill if throttled process is killed by user during
> > throttling. If I misunderstood your goal, please correct me and
> > write down it in description for making it more clear.
> > 
> > If user kills throttled process, throttle_direct_reclaim returns true by
> > this patch so try_to_free_pages returns 1. It means it doesn't call OOM
> > in first path of reclaim but shortly it will try to reclaim again
> > by should_alloc_retry.
> 
> Yes and it returned without calling direct reclaim.
> 
> > And since this second path, throttle_direct_reclaim
> > will continue to return false so that it could end up calling OOM kill.
> > 
> 
> Yes except the second time it has not been throttled and it entered direct
> reclaim. If it fails to make any progress it will return 0 but if this
> happens, it potentially really is an OOM situation. If it manages to
> reclaim, it'll be returning a positive number, is making forward
> progress and should successfully exit without triggering OOM.
> 
> Note that throttle_direct_reclaim also now checks fatal_signal_pending
> before deciding to throttle at all.
> 
> > Is it a your intention? If so, what's different with old version?
> > This patch just delay OOM kill so what's benefit does it has?
> > 
> 
> In the first version it would never try to enter direct reclaim if a
> fatal signal was pending but always claim that forward progress was
> being made.

Surely we need fix for preventing deadlock with OOM kill and that's why
I have Cced you and this patch fixes it but my question is why we need 
such fatal signal checking trick.

How about this?

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 10090c8..881619e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2306,13 +2306,6 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 
        throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
 
-       /*
-        * Do not enter reclaim if fatal signal is pending. 1 is returned so
-        * that the page allocator does not consider triggering OOM
-        */
-       if (fatal_signal_pending(current))
-               return 1;
-
        trace_mm_vmscan_direct_reclaim_begin(order,
                                sc.may_writepage,
                                gfp_mask);
 
In this case, after throttling, current will try to do direct reclaim and
if he makes forward progress, he will get a memory and exit if he receive KILL signal.
If he can't make forward progress with direct reclaim, he can ends up OOM path but
out_of_memory checks signal check of current and allow to access reserved memory pool
for quick exit and return without killing other victim selection.
Is it a problem for your case?

> 
> -- 
> Mel Gorman
> SUSE Labs

-- 
Kind Regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
  2012-11-02  6:39 Minchan Kim
@ 2012-11-02  8:30 ` Mel Gorman
  2012-11-02 22:36   ` Minchan Kim
  0 siblings, 1 reply; 67+ messages in thread
From: Mel Gorman @ 2012-11-02  8:30 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

On Fri, Nov 02, 2012 at 03:39:58PM +0900, Minchan Kim wrote:
> Hi Mel,
> 
> On Thu, Nov 01, 2012 at 08:28:14AM +0000, Mel Gorman wrote:
> > On Wed, Oct 31, 2012 at 09:48:57PM -0700, David Rientjes wrote:
> > > On Thu, 1 Nov 2012, Minchan Kim wrote:
> > > 
> > > > It's not true any more.
> > > > 3.6 includes following code in try_to_free_pages
> > > > 
> > > >         /*   
> > > >          * Do not enter reclaim if fatal signal is pending. 1 is returned so
> > > >          * that the page allocator does not consider triggering OOM
> > > >          */
> > > >         if (fatal_signal_pending(current))
> > > >                 return 1;
> > > > 
> > > > So the hunged task never go to the OOM path and could be looping forever.
> > > > 
> > > 
> > > Ah, interesting.  This is from commit 5515061d22f0 ("mm: throttle direct 
> > > reclaimers if PF_MEMALLOC reserves are low and swap is backed by network 
> > > storage").  Thanks for adding Mel to the cc.
> > > 
> > 
> > Indeed, thanks.
> > 
> > > The oom killer specifically has logic for this condition: when calling 
> > > out_of_memory() the first thing it does is
> > > 
> > > 	if (fatal_signal_pending(current))
> > > 		set_thread_flag(TIF_MEMDIE);
> > > 
> > > to allow it access to memory reserves so that it may exit if it's having 
> > > trouble.  But that ends up never happening because of the above code that 
> > > Minchan has identified.
> > > 
> > > So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages() 
> > > as well or revert that early return entirely; there's no justification 
> > > given for it in the comment nor in the commit log. 
> > 
> > The check for fatal signal is in the wrong place. The reason it was added
> > is because a throttled process sleeps in an interruptible sleep.  If a user
> > user forcibly kills a throttled process, it should not result in an OOM kill.
> > 
> > > I'd rather remove it 
> > > and allow the oom killer to trigger and grant access to memory reserves 
> > > itself if necessary.
> > > 
> > > Mel, how does commit 5515061d22f0 deal with threads looping forever if 
> > > they need memory in the exit path since the oom killer never gets called?
> > > 
> > 
> > It doesn't. How about this?
> > 
> > ---8<---
> > mm: vmscan: Check for fatal signals iff the process was throttled
> > 
> > commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
> > are low and swap is backed by network storage") introduced a check for
> > fatal signals after a process gets throttled for network storage. The
> > intention was that if a process was throttled and got killed that it
> > should not trigger the OOM killer. As pointed out by Minchan Kim and
> > David Rientjes, this check is in the wrong place and too broad. If a
> > system is in am OOM situation and a process is exiting, it can loop in
> > __alloc_pages_slowpath() and calling direct reclaim in a loop. As the
> > fatal signal is pending it returns 1 as if it is making forward progress
> > and can effectively deadlock.
> > 
> > This patch moves the fatal_signal_pending() check after throttling to
> > throttle_direct_reclaim() where it belongs.
> 
> I'm not sure how below patch achieve your goal which is to prevent
> unnecessary OOM kill if throttled process is killed by user during
> throttling. If I misunderstood your goal, please correct me and
> write down it in description for making it more clear.
> 
> If user kills throttled process, throttle_direct_reclaim returns true by
> this patch so try_to_free_pages returns 1. It means it doesn't call OOM
> in first path of reclaim but shortly it will try to reclaim again
> by should_alloc_retry.

Yes and it returned without calling direct reclaim.

> And since this second path, throttle_direct_reclaim
> will continue to return false so that it could end up calling OOM kill.
> 

Yes except the second time it has not been throttled and it entered direct
reclaim. If it fails to make any progress it will return 0 but if this
happens, it potentially really is an OOM situation. If it manages to
reclaim, it'll be returning a positive number, is making forward
progress and should successfully exit without triggering OOM.

Note that throttle_direct_reclaim also now checks fatal_signal_pending
before deciding to throttle at all.

> Is it a your intention? If so, what's different with old version?
> This patch just delay OOM kill so what's benefit does it has?
> 

In the first version it would never try to enter direct reclaim if a
fatal signal was pending but always claim that forward progress was
being made.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

* Re: zram OOM behavior
@ 2012-11-02  6:39 Minchan Kim
  2012-11-02  8:30 ` Mel Gorman
  0 siblings, 1 reply; 67+ messages in thread
From: Minchan Kim @ 2012-11-02  6:39 UTC (permalink / raw)
  To: Mel Gorman
  Cc: David Rientjes, Luigi Semenzato, linux-mm, Dan Magenheimer,
	KOSAKI Motohiro, Sonny Rao

Hi Mel,

On Thu, Nov 01, 2012 at 08:28:14AM +0000, Mel Gorman wrote:
> On Wed, Oct 31, 2012 at 09:48:57PM -0700, David Rientjes wrote:
> > On Thu, 1 Nov 2012, Minchan Kim wrote:
> > 
> > > It's not true any more.
> > > 3.6 includes following code in try_to_free_pages
> > > 
> > >         /*   
> > >          * Do not enter reclaim if fatal signal is pending. 1 is returned so
> > >          * that the page allocator does not consider triggering OOM
> > >          */
> > >         if (fatal_signal_pending(current))
> > >                 return 1;
> > > 
> > > So the hunged task never go to the OOM path and could be looping forever.
> > > 
> > 
> > Ah, interesting.  This is from commit 5515061d22f0 ("mm: throttle direct 
> > reclaimers if PF_MEMALLOC reserves are low and swap is backed by network 
> > storage").  Thanks for adding Mel to the cc.
> > 
> 
> Indeed, thanks.
> 
> > The oom killer specifically has logic for this condition: when calling 
> > out_of_memory() the first thing it does is
> > 
> > 	if (fatal_signal_pending(current))
> > 		set_thread_flag(TIF_MEMDIE);
> > 
> > to allow it access to memory reserves so that it may exit if it's having 
> > trouble.  But that ends up never happening because of the above code that 
> > Minchan has identified.
> > 
> > So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages() 
> > as well or revert that early return entirely; there's no justification 
> > given for it in the comment nor in the commit log. 
> 
> The check for fatal signal is in the wrong place. The reason it was added
> is because a throttled process sleeps in an interruptible sleep.  If a user
> user forcibly kills a throttled process, it should not result in an OOM kill.
> 
> > I'd rather remove it 
> > and allow the oom killer to trigger and grant access to memory reserves 
> > itself if necessary.
> > 
> > Mel, how does commit 5515061d22f0 deal with threads looping forever if 
> > they need memory in the exit path since the oom killer never gets called?
> > 
> 
> It doesn't. How about this?
> 
> ---8<---
> mm: vmscan: Check for fatal signals iff the process was throttled
> 
> commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves
> are low and swap is backed by network storage") introduced a check for
> fatal signals after a process gets throttled for network storage. The
> intention was that if a process was throttled and got killed that it
> should not trigger the OOM killer. As pointed out by Minchan Kim and
> David Rientjes, this check is in the wrong place and too broad. If a
> system is in am OOM situation and a process is exiting, it can loop in
> __alloc_pages_slowpath() and calling direct reclaim in a loop. As the
> fatal signal is pending it returns 1 as if it is making forward progress
> and can effectively deadlock.
> 
> This patch moves the fatal_signal_pending() check after throttling to
> throttle_direct_reclaim() where it belongs.

I'm not sure how below patch achieve your goal which is to prevent
unnecessary OOM kill if throttled process is killed by user during
throttling. If I misunderstood your goal, please correct me and
write down it in description for making it more clear.

If user kills throttled process, throttle_direct_reclaim returns true by
this patch so try_to_free_pages returns 1. It means it doesn't call OOM
in first path of reclaim but shortly it will try to reclaim again
by should_alloc_retry. And since this second path, throttle_direct_reclaim
will continue to return false so that it could end up calling OOM kill.

Is it a your intention? If so, what's different with old version?
This patch just delay OOM kill so what's benefit does it has?


> 
> If this patch passes review it should be considered a -stable candidate
> for 3.6.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/vmscan.c |   37 +++++++++++++++++++++++++++----------
>  1 file changed, 27 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2b7edfa..ca9e37f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2238,9 +2238,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat)
>   * Throttle direct reclaimers if backing storage is backed by the network
>   * and the PFMEMALLOC reserve for the preferred node is getting dangerously
>   * depleted. kswapd will continue to make progress and wake the processes
> - * when the low watermark is reached
> + * when the low watermark is reached.
> + *
> + * Returns true if a fatal signal was delivered during throttling. If this
> + * happens, the page allocator should not consider triggering the OOM killer.
>   */
> -static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
> +static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  					nodemask_t *nodemask)
>  {
>  	struct zone *zone;
> @@ -2255,13 +2258,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  	 * processes to block on log_wait_commit().
>  	 */
>  	if (current->flags & PF_KTHREAD)
> -		return;
> +		goto out;
> +
> +	/*
> +	 * If a fatal signal is pending, this process should not throttle.
> +	 * It should return quickly so it can exit and free its memory
> +	 */
> +	if (fatal_signal_pending(current))
> +		goto out;
>  
>  	/* Check if the pfmemalloc reserves are ok */
>  	first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone);
>  	pgdat = zone->zone_pgdat;
>  	if (pfmemalloc_watermark_ok(pgdat))
> -		return;
> +		goto out;
>  
>  	/* Account for the throttling */
>  	count_vm_event(PGSCAN_DIRECT_THROTTLE);
> @@ -2277,12 +2287,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist,
>  	if (!(gfp_mask & __GFP_FS)) {
>  		wait_event_interruptible_timeout(pgdat->pfmemalloc_wait,
>  			pfmemalloc_watermark_ok(pgdat), HZ);
> -		return;
> +
> +		goto check_pending;
>  	}
>  
>  	/* Throttle until kswapd wakes the process */
>  	wait_event_killable(zone->zone_pgdat->pfmemalloc_wait,
>  		pfmemalloc_watermark_ok(pgdat));
> +
> +check_pending:
> +	if (fatal_signal_pending(current))
> +		return true;
> +
> +out:
> +	return false;
>  }
>  
>  unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
> @@ -2304,13 +2322,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  		.gfp_mask = sc.gfp_mask,
>  	};
>  
> -	throttle_direct_reclaim(gfp_mask, zonelist, nodemask);
> -
>  	/*
> -	 * Do not enter reclaim if fatal signal is pending. 1 is returned so
> -	 * that the page allocator does not consider triggering OOM
> +	 * Do not enter reclaim if fatal signal was delivered while throttled.
> +	 * 1 is returned so that the page allocator does not OOM kill at this
> +	 * point.
>  	 */
> -	if (fatal_signal_pending(current))
> +	if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask))
>  		return 1;
>  
>  	trace_mm_vmscan_direct_reclaim_begin(order,
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 67+ messages in thread

end of thread, other threads:[~2012-11-13 13:31 UTC | newest]

Thread overview: 67+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-09-28 17:32 zram OOM behavior Luigi Semenzato
2012-10-03 13:30 ` Konrad Rzeszutek Wilk
     [not found]   ` <CAA25o9SwO209DD6CUx-LzhMt9XU6niGJ-fBPmgwfcrUvf0BPWA@mail.gmail.com>
2012-10-12 23:30     ` Luigi Semenzato
2012-10-15 14:44 ` Minchan Kim
2012-10-15 18:54   ` Luigi Semenzato
2012-10-16  6:18     ` Minchan Kim
2012-10-16 17:36       ` Luigi Semenzato
2012-10-19 17:49         ` Luigi Semenzato
2012-10-22 23:53           ` Minchan Kim
2012-10-23  0:40             ` Luigi Semenzato
2012-10-23  6:03             ` David Rientjes
2012-10-29 18:26               ` Luigi Semenzato
2012-10-29 19:00                 ` David Rientjes
2012-10-29 22:36                   ` Luigi Semenzato
2012-10-29 22:52                     ` David Rientjes
2012-10-29 23:23                       ` Luigi Semenzato
2012-10-29 23:34                         ` Luigi Semenzato
2012-10-30  0:18                     ` Minchan Kim
2012-10-30  0:45                       ` Luigi Semenzato
2012-10-30  5:41                         ` David Rientjes
2012-10-30 19:12                           ` Luigi Semenzato
2012-10-30 20:30                             ` Luigi Semenzato
2012-10-30 22:32                               ` Luigi Semenzato
2012-10-31 18:42                                 ` David Rientjes
2012-10-30 22:37                               ` Sonny Rao
2012-10-31  4:46                               ` David Rientjes
2012-10-31  6:14                                 ` Luigi Semenzato
2012-10-31  6:28                                   ` Luigi Semenzato
2012-10-31 18:45                                     ` David Rientjes
2012-10-31  0:57                             ` Minchan Kim
2012-10-31  1:06                               ` Luigi Semenzato
2012-10-31  1:27                                 ` Minchan Kim
2012-10-31  3:49                                   ` Luigi Semenzato
2012-10-31  7:24                                     ` Minchan Kim
2012-10-31 16:07                                       ` Luigi Semenzato
2012-10-31 17:49                                         ` Mandeep Singh Baines
2012-10-31 18:54                               ` David Rientjes
2012-10-31 21:40                                 ` Luigi Semenzato
2012-11-01  2:11                                 ` Minchan Kim
2012-11-01  4:38                                   ` David Rientjes
2012-11-01  5:18                                     ` Minchan Kim
2012-11-01  2:43                                 ` Minchan Kim
2012-11-01  4:48                                   ` David Rientjes
2012-11-01  5:26                                     ` Minchan Kim
2012-11-01  8:28                                     ` Mel Gorman
2012-11-01 15:57                                       ` Luigi Semenzato
2012-11-01 15:58                                         ` Luigi Semenzato
2012-11-01 21:48                                           ` David Rientjes
2012-11-01 17:50                                     ` Luigi Semenzato
2012-11-01 21:50                                       ` David Rientjes
2012-11-01 21:58                                         ` [patch] mm, oom: allow exiting threads to have access to memory reserves David Rientjes
2012-11-01 22:43                                           ` Andrew Morton
2012-11-01 23:05                                             ` David Rientjes
2012-11-01 23:06                                             ` Luigi Semenzato
2012-11-01 22:04                                         ` zram OOM behavior Luigi Semenzato
2012-11-01 22:25                                           ` David Rientjes
2012-11-02  6:39 Minchan Kim
2012-11-02  8:30 ` Mel Gorman
2012-11-02 22:36   ` Minchan Kim
2012-11-05 14:46     ` Mel Gorman
2012-11-06  0:25       ` Minchan Kim
2012-11-06  8:58         ` Mel Gorman
2012-11-06 10:17           ` Minchan Kim
2012-11-09  9:50             ` Mel Gorman
2012-11-12 13:32               ` Minchan Kim
2012-11-12 14:06                 ` Mel Gorman
2012-11-13 13:31                   ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.