* zram OOM behavior @ 2012-09-28 17:32 Luigi Semenzato 2012-10-03 13:30 ` Konrad Rzeszutek Wilk 2012-10-15 14:44 ` Minchan Kim 0 siblings, 2 replies; 56+ messages in thread From: Luigi Semenzato @ 2012-09-28 17:32 UTC (permalink / raw) To: linux-mm Greetings, We are experimenting with zram in Chrome OS. It works quite well until the system runs out of memory, at which point it seems to hang, but we suspect it is thrashing. Before the (apparent) hang, the OOM killer gets rid of a few processes, but then the other processes gradually stop responding, until the entire system becomes unresponsive. I am wondering if anybody has run into this. Thanks! Luigi P.S. For those who wish to know more: 1. We use the min_filelist_kbytes patch (http://lwn.net/Articles/412313/) (I am not sure if it made it into the standard kernel) and set min_filelist_kbytes to 50Mb. (This may not matter, as it's unlikely to make things worse.) 2. We swap only to compressed ram. The setup is very simple: echo ${ZRAM_SIZE_KB}000 >/sys/block/zram0/disksize || logger -t "$UPSTART_JOB" "failed to set zram size" mkswap /dev/zram0 || logger -t "$UPSTART_JOB" "mkswap /dev/zram0 failed" swapon /dev/zram0 || logger -t "$UPSTART_JOB" "swapon /dev/zram0 failed" For ZRAM_SIZE_KB, we typically use 1.5 the size of RAM (which is 2 or 4 Gb). The compression factor is about 3:1. The hangs happen for quite a wide range of zram sizes. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-09-28 17:32 zram OOM behavior Luigi Semenzato @ 2012-10-03 13:30 ` Konrad Rzeszutek Wilk [not found] ` <CAA25o9SwO209DD6CUx-LzhMt9XU6niGJ-fBPmgwfcrUvf0BPWA@mail.gmail.com> 2012-10-15 14:44 ` Minchan Kim 1 sibling, 1 reply; 56+ messages in thread From: Konrad Rzeszutek Wilk @ 2012-10-03 13:30 UTC (permalink / raw) To: Luigi Semenzato; +Cc: linux-mm On Fri, Sep 28, 2012 at 1:32 PM, Luigi Semenzato <semenzato@google.com> wrote: > Greetings, > > We are experimenting with zram in Chrome OS. It works quite well > until the system runs out of memory, at which point it seems to hang, > but we suspect it is thrashing. Or spinning in some sad loop. Does the kernel have the CONFIG_DETECT_* options to figure out what is happening? Can you invoke the Alt-SysRQ when it is hung? > > Before the (apparent) hang, the OOM killer gets rid of a few > processes, but then the other processes gradually stop responding, > until the entire system becomes unresponsive. Does the OOM give you an idea what the memory state is? Can you actually provide the dmesg? > > I am wondering if anybody has run into this. Thanks! > > Luigi > > P.S. For those who wish to know more: > > 1. We use the min_filelist_kbytes patch > (http://lwn.net/Articles/412313/) (I am not sure if it made it into > the standard kernel) and set min_filelist_kbytes to 50Mb. (This may > not matter, as it's unlikely to make things worse.) > > 2. We swap only to compressed ram. The setup is very simple: > > echo ${ZRAM_SIZE_KB}000 >/sys/block/zram0/disksize || > logger -t "$UPSTART_JOB" "failed to set zram size" > mkswap /dev/zram0 || logger -t "$UPSTART_JOB" "mkswap /dev/zram0 failed" > swapon /dev/zram0 || logger -t "$UPSTART_JOB" "swapon /dev/zram0 failed" > > For ZRAM_SIZE_KB, we typically use 1.5 the size of RAM (which is 2 or > 4 Gb). The compression factor is about 3:1. The hangs happen for > quite a wide range of zram sizes. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
[parent not found: <CAA25o9SwO209DD6CUx-LzhMt9XU6niGJ-fBPmgwfcrUvf0BPWA@mail.gmail.com>]
* Re: zram OOM behavior [not found] ` <CAA25o9SwO209DD6CUx-LzhMt9XU6niGJ-fBPmgwfcrUvf0BPWA@mail.gmail.com> @ 2012-10-12 23:30 ` Luigi Semenzato 0 siblings, 0 replies; 56+ messages in thread From: Luigi Semenzato @ 2012-10-12 23:30 UTC (permalink / raw) To: Konrad Rzeszutek Wilk, linux-mm I fixed the "hang with compressed swap" problem but I cannot claim I understand the code very well, before or after the fix. However, the fix seems to make sense, unless I am misinterpreting something. In vm_swap.c there are a few places where the amount of reclaimable memory is computed, in the presence or absence of swap. For instance here: unsigned long zone_reclaimable_pages(struct zone *zone) { int nr; nr = zone_page_state(zone, NR_ACTIVE_FILE) + zone_page_state(zone, NR_INACTIVE_FILE); if (nr_swap_pages > 0) nr += zone_page_state(zone, NR_ACTIVE_ANON) + zone_page_state(zone, NR_INACTIVE_ANON); return nr; } But this code seems to assume that if there is any swap space left, then there is infinite swap space left. If there is only a little swap space left, only that many ANON pages may be swapped out. So I replaced part of the above with anon = zone_page_state(zone, NR_ACTIVE_ANON) + zone_page_state(zone, NR_INACTIVE_ANON); if (total_swap_pages > 0) nr += min(anon, nr_swap_spaces) and, as I mentioned, did something equivalent in a couple of other places. This fixes the hangs. I think the hangs happened because the page allocator thought that there was reclaimable memory and kept trying to reclaim it unsuccessfully. But it's still hard to believe that the original code could be *that* wrong, so what am I missing? Or is it possible that there isn't enough interest in improving low-memory and out-of-memory behavior? This is rather important on consumer devices, such as Chromebooks. Of course the zram module is not your standard swap device (it allocates memory to free more memory). My colleague Mandeep Baines submitted a patch a year or two ago that prevents thrashing in the absence of swap. The system can still thrash because it evicts executable pages, which are file-backed. His patch is just a few lines. It stops the mm from evicting the last X megabytes of FILE memory, where X = 50 works well for us. Thrashing is nasty, and his patch fixes it, yet it is not included in ToT. Thank you for any elucidation! On Wed, Oct 3, 2012 at 8:33 AM, Luigi Semenzato <semenzato@google.com> wrote: > On Wed, Oct 3, 2012 at 6:30 AM, Konrad Rzeszutek Wilk <konrad@kernel.org> wrote: >> On Fri, Sep 28, 2012 at 1:32 PM, Luigi Semenzato <semenzato@google.com> wrote: >>> Greetings, >>> >>> We are experimenting with zram in Chrome OS. It works quite well >>> until the system runs out of memory, at which point it seems to hang, >>> but we suspect it is thrashing. >> >> Or spinning in some sad loop. Does the kernel have the CONFIG_DETECT_* >> options to figure out what is happening? > > Don't think so, but will check and enable it. > > Can you invoke the Alt-SysRQ >> when it is hung? > > I don't think we have that enabled, but I will check. > >>> >>> Before the (apparent) hang, the OOM killer gets rid of a few >>> processes, but then the other processes gradually stop responding, >>> until the entire system becomes unresponsive. >> >> Does the OOM give you an idea what the memory state is? >> Can you >> actually provide the dmesg? > > I may be able to do that, through the serial line. > > Thanks, I will reply-all when I have more info. Didn't want to spam > the list for now. > >> >>> >>> I am wondering if anybody has run into this. Thanks! >>> >>> Luigi >>> >>> P.S. For those who wish to know more: >>> >>> 1. We use the min_filelist_kbytes patch >>> (http://lwn.net/Articles/412313/) (I am not sure if it made it into >>> the standard kernel) and set min_filelist_kbytes to 50Mb. (This may >>> not matter, as it's unlikely to make things worse.) >>> >>> 2. We swap only to compressed ram. The setup is very simple: >>> >>> echo ${ZRAM_SIZE_KB}000 >/sys/block/zram0/disksize || >>> logger -t "$UPSTART_JOB" "failed to set zram size" >>> mkswap /dev/zram0 || logger -t "$UPSTART_JOB" "mkswap /dev/zram0 failed" >>> swapon /dev/zram0 || logger -t "$UPSTART_JOB" "swapon /dev/zram0 failed" >>> >>> For ZRAM_SIZE_KB, we typically use 1.5 the size of RAM (which is 2 or >>> 4 Gb). The compression factor is about 3:1. The hangs happen for >>> quite a wide range of zram sizes. >>> >>> -- >>> To unsubscribe, send a message with 'unsubscribe linux-mm' in >>> the body to majordomo@kvack.org. For more info on Linux MM, >>> see: http://www.linux-mm.org/ . >>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> >>> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-09-28 17:32 zram OOM behavior Luigi Semenzato 2012-10-03 13:30 ` Konrad Rzeszutek Wilk @ 2012-10-15 14:44 ` Minchan Kim 2012-10-15 18:54 ` Luigi Semenzato 1 sibling, 1 reply; 56+ messages in thread From: Minchan Kim @ 2012-10-15 14:44 UTC (permalink / raw) To: Luigi Semenzato; +Cc: linux-mm Hello, On Fri, Sep 28, 2012 at 10:32:20AM -0700, Luigi Semenzato wrote: > Greetings, > > We are experimenting with zram in Chrome OS. It works quite well > until the system runs out of memory, at which point it seems to hang, > but we suspect it is thrashing. > > Before the (apparent) hang, the OOM killer gets rid of a few > processes, but then the other processes gradually stop responding, > until the entire system becomes unresponsive. Why do you think it's zram problem? If you use swap device as storage instead of zram, does the problem disappear? Could you do sysrq+t,m several time and post it while hang happens? /proc/vmstat could be helpful, too. > > I am wondering if anybody has run into this. Thanks! > > Luigi > > P.S. For those who wish to know more: > > 1. We use the min_filelist_kbytes patch > (http://lwn.net/Articles/412313/) (I am not sure if it made it into > the standard kernel) and set min_filelist_kbytes to 50Mb. (This may > not matter, as it's unlikely to make things worse.) One of the problem I look at this patch is it might prevent increasing of zone->pages_scanned when the swap if full or anon pages are very small although there are lots of file-backed pages. It means OOM can't occur and page allocator could loop forever. Please look at zone_reclaimable. Have you ever test it without above patch? > > 2. We swap only to compressed ram. The setup is very simple: > > echo ${ZRAM_SIZE_KB}000 >/sys/block/zram0/disksize || > logger -t "$UPSTART_JOB" "failed to set zram size" > mkswap /dev/zram0 || logger -t "$UPSTART_JOB" "mkswap /dev/zram0 failed" > swapon /dev/zram0 || logger -t "$UPSTART_JOB" "swapon /dev/zram0 failed" > > For ZRAM_SIZE_KB, we typically use 1.5 the size of RAM (which is 2 or > 4 Gb). The compression factor is about 3:1. The hangs happen for > quite a wide range of zram sizes. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind Regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-15 14:44 ` Minchan Kim @ 2012-10-15 18:54 ` Luigi Semenzato 2012-10-16 6:18 ` Minchan Kim 0 siblings, 1 reply; 56+ messages in thread From: Luigi Semenzato @ 2012-10-15 18:54 UTC (permalink / raw) To: Minchan Kim; +Cc: linux-mm On Mon, Oct 15, 2012 at 7:44 AM, Minchan Kim <minchan@kernel.org> wrote: > Hello, > > On Fri, Sep 28, 2012 at 10:32:20AM -0700, Luigi Semenzato wrote: >> Greetings, >> >> We are experimenting with zram in Chrome OS. It works quite well >> until the system runs out of memory, at which point it seems to hang, >> but we suspect it is thrashing. >> >> Before the (apparent) hang, the OOM killer gets rid of a few >> processes, but then the other processes gradually stop responding, >> until the entire system becomes unresponsive. > > Why do you think it's zram problem? If you use swap device as storage > instead of zram, does the problem disappear? I haven't tried with a swap device, but that is a good suggestion. I didn't want to swap to disk (too slow compared to zram, so it's not the same experiment any more), but I could preallocate a RAM disk and swap to that. > Could you do sysrq+t,m several time and post it while hang happens? > /proc/vmstat could be helpful, too. The stack traces look mostly like this: [ 2058.069020] [<810681c4>] handle_edge_irq+0x8f/0xb1 [ 2058.069028] <IRQ> [<810037ed>] ? do_IRQ+0x3f/0x98 [ 2058.069044] [<813b7eb0>] ? common_interrupt+0x30/0x38 [ 2058.069058] [<8108007b>] ? ftrace_raw_event_rpm_internal+0xf/0x108 [ 2058.069072] [<81196c1a>] ? do_raw_spin_lock+0x93/0xf3 [ 2058.069085] [<813b70d5>] ? _raw_spin_lock+0xd/0xf [ 2058.069097] [<810b418c>] ? put_super+0x15/0x29 [ 2058.069108] [<810b41ba>] ? drop_super+0x1a/0x1d [ 2058.069119] [<810b4d04>] ? prune_super+0x106/0x110 [ 2058.069132] [<81093647>] ? shrink_slab+0x7f/0x22f [ 2058.069144] [<81095943>] ? try_to_free_pages+0x1b7/0x2e6 [ 2058.069158] [<8108de27>] ? __alloc_pages_nodemask+0x412/0x5d5 [ 2058.069173] [<810a9c6a>] ? read_swap_cache_async+0x4a/0xcf [ 2058.069185] [<810a9d50>] ? swapin_readahead+0x61/0x8d [ 2058.069198] [<8109fea0>] ? handle_pte_fault+0x310/0x5fb [ 2058.069208] [<8100223a>] ? do_signal+0x470/0x4fe [ 2058.069220] [<810a02cc>] ? handle_mm_fault+0xae/0xbd [ 2058.069233] [<8101d0f9>] ? do_page_fault+0x265/0x284 [ 2058.069247] [<81192b32>] ? copy_to_user+0x3e/0x49 [ 2058.069257] [<8100306d>] ? do_spurious_interrupt_bug+0x26/0x26 [ 2058.069270] [<81009279>] ? init_fpu+0x73/0x81 [ 2058.069280] [<8100275e>] ? math_state_restore+0x1f/0xa0 [ 2058.069290] [<8100306d>] ? do_spurious_interrupt_bug+0x26/0x26 [ 2058.069303] [<8101ce94>] ? vmalloc_sync_all+0xa/0xa [ 2058.069315] [<813b7737>] ? error_code+0x67/0x6c The bottom part of the stack varies, but most processes are spending a lot of time in prune_super(). There is a pretty high number of mounted file systems, and do_try_to_free_pages() keeps calling shrink_slab() even when there is nothing to reclaim there. In addition, do_try_to_free_pages() keeps returning 1 because all_unreclaimable() at the end is always false. The allocator thinks that zone 1 has freeable pages (zones 0 and 2 do not). That prevents the allocator from ooming. I went in some more depth, but didn't quite untangle all that goes on. In any case, this explains why I came up with the theory that somehow mm is too optimistic about how many pages are freeable. Then I found what looks like a smoking gun in vmscan.c: if (nr_swap_pages > 0) nr += zone_page_state(zone, NR_ACTIVE_ANON) + zone_page_state(zone, NR_INACTIVE_ANON); which seems to ignore that not all ANON pages are freeable if swap space is limited. Pretty much all processes hang while trying to allocate memory. Those that don't allocate memory keep running fine. vmstat 1 shows a large amount of swapping activity, which drops to 0 when the processes hang. /proc/meminfo and /proc/vmstat are at the bottom. > >> >> I am wondering if anybody has run into this. Thanks! >> >> Luigi >> >> P.S. For those who wish to know more: >> >> 1. We use the min_filelist_kbytes patch >> (http://lwn.net/Articles/412313/) (I am not sure if it made it into >> the standard kernel) and set min_filelist_kbytes to 50Mb. (This may >> not matter, as it's unlikely to make things worse.) > > One of the problem I look at this patch is it might prevent > increasing of zone->pages_scanned when the swap if full or anon pages > are very small although there are lots of file-backed pages. > It means OOM can't occur and page allocator could loop forever. > Please look at zone_reclaimable. Yes---I think you are right. It didn't matter to us because we don't use swap. The problem looks fixable. > Have you ever test it without above patch? Good suggestion. I just did. Almost all text pages are evicted, and then the system thrashes so badly that the hang detector kicks in after a couple of minutes and panics. Thank you for the very helpful suggestions! > >> >> 2. We swap only to compressed ram. The setup is very simple: >> >> echo ${ZRAM_SIZE_KB}000 >/sys/block/zram0/disksize || >> logger -t "$UPSTART_JOB" "failed to set zram size" >> mkswap /dev/zram0 || logger -t "$UPSTART_JOB" "mkswap /dev/zram0 failed" >> swapon /dev/zram0 || logger -t "$UPSTART_JOB" "swapon /dev/zram0 failed" >> >> For ZRAM_SIZE_KB, we typically use 1.5 the size of RAM (which is 2 or >> 4 Gb). The compression factor is about 3:1. The hangs happen for >> quite a wide range of zram sizes. >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > > -- > Kind Regards, > Minchan Kim MemTotal: 2002292 kB MemFree: 15148 kB Buffers: 260 kB Cached: 169952 kB SwapCached: 149448 kB Active: 722608 kB Inactive: 290824 kB Active(anon): 682680 kB Inactive(anon): 230888 kB Active(file): 39928 kB Inactive(file): 59936 kB Unevictable: 0 kB Mlocked: 0 kB HighTotal: 74504 kB HighFree: 0 kB LowTotal: 1927788 kB LowFree: 15148 kB SwapTotal: 2933044 kB SwapFree: 47968 kB Dirty: 0 kB Writeback: 56 kB AnonPages: 695180 kB Mapped: 73276 kB Shmem: 70276 kB Slab: 19596 kB SReclaimable: 9152 kB SUnreclaim: 10444 kB KernelStack: 1448 kB PageTables: 9964 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 3934188 kB Committed_AS: 4371740 kB VmallocTotal: 122880 kB VmallocUsed: 22268 kB VmallocChunk: 100340 kB DirectMap4k: 34808 kB DirectMap2M: 1927168 kB nr_free_pages 3776 nr_inactive_anon 58243 nr_active_anon 172106 nr_inactive_file 14984 nr_active_file 9982 nr_unevictable 0 nr_mlock 0 nr_anon_pages 174840 nr_mapped 18387 nr_file_pages 80762 nr_dirty 0 nr_writeback 13 nr_slab_reclaimable 2290 nr_slab_unreclaimable 2611 nr_page_table_pages 2471 nr_kernel_stack 180 nr_unstable 0 nr_bounce 0 nr_vmscan_write 679247 nr_vmscan_immediate_reclaim 0 nr_writeback_temp 0 nr_isolated_anon 416 nr_isolated_file 0 nr_shmem 17637 nr_dirtied 7630 nr_written 686863 nr_anon_transparent_hugepages 0 nr_dirty_threshold 151452 nr_dirty_background_threshold 2524 pgpgin 284189 pgpgout 2748940 pswpin 5602 pswpout 679271 pgalloc_dma 9976 pgalloc_normal 1426651 pgalloc_high 34659 pgalloc_movable 0 pgfree 1475099 pgactivate 58092 pgdeactivate 745734 pgfault 1489876 pgmajfault 1098 pgrefill_dma 8557 pgrefill_normal 742123 pgrefill_high 4088 pgrefill_movable 0 pgsteal_kswapd_dma 199 pgsteal_kswapd_normal 48387 pgsteal_kswapd_high 2443 pgsteal_kswapd_movable 0 pgsteal_direct_dma 7688 pgsteal_direct_normal 652670 pgsteal_direct_high 6242 pgsteal_direct_movable 0 pgscan_kswapd_dma 268 pgscan_kswapd_normal 105036 pgscan_kswapd_high 8395 pgscan_kswapd_movable 0 pgscan_direct_dma 185240 pgscan_direct_normal 23961886 pgscan_direct_high 584047 pgscan_direct_movable 0 pginodesteal 123 slabs_scanned 10368 kswapd_inodesteal 1 kswapd_low_wmark_hit_quickly 15 kswapd_high_wmark_hit_quickly 8 kswapd_skip_congestion_wait 639 pageoutrun 582 allocstall 14514 pgrotated 1 unevictable_pgs_culled 0 unevictable_pgs_scanned 0 unevictable_pgs_rescued 1 unevictable_pgs_mlocked 1 unevictable_pgs_munlocked 1 unevictable_pgs_cleared 0 unevictable_pgs_stranded 0 unevictable_pgs_mlockfreed 0 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-15 18:54 ` Luigi Semenzato @ 2012-10-16 6:18 ` Minchan Kim 2012-10-16 17:36 ` Luigi Semenzato 0 siblings, 1 reply; 56+ messages in thread From: Minchan Kim @ 2012-10-16 6:18 UTC (permalink / raw) To: Luigi Semenzato; +Cc: linux-mm On Mon, Oct 15, 2012 at 11:54:36AM -0700, Luigi Semenzato wrote: > On Mon, Oct 15, 2012 at 7:44 AM, Minchan Kim <minchan@kernel.org> wrote: > > Hello, > > > > On Fri, Sep 28, 2012 at 10:32:20AM -0700, Luigi Semenzato wrote: > >> Greetings, > >> > >> We are experimenting with zram in Chrome OS. It works quite well > >> until the system runs out of memory, at which point it seems to hang, > >> but we suspect it is thrashing. > >> > >> Before the (apparent) hang, the OOM killer gets rid of a few > >> processes, but then the other processes gradually stop responding, > >> until the entire system becomes unresponsive. > > > > Why do you think it's zram problem? If you use swap device as storage > > instead of zram, does the problem disappear? > > I haven't tried with a swap device, but that is a good suggestion. > > I didn't want to swap to disk (too slow compared to zram, so it's not > the same experiment any more), but I could preallocate a RAM disk and > swap to that. Good idea. > > > Could you do sysrq+t,m several time and post it while hang happens? > > /proc/vmstat could be helpful, too. > > The stack traces look mostly like this: > > [ 2058.069020] [<810681c4>] handle_edge_irq+0x8f/0xb1 > [ 2058.069028] <IRQ> [<810037ed>] ? do_IRQ+0x3f/0x98 > [ 2058.069044] [<813b7eb0>] ? common_interrupt+0x30/0x38 > [ 2058.069058] [<8108007b>] ? ftrace_raw_event_rpm_internal+0xf/0x108 > [ 2058.069072] [<81196c1a>] ? do_raw_spin_lock+0x93/0xf3 > [ 2058.069085] [<813b70d5>] ? _raw_spin_lock+0xd/0xf > [ 2058.069097] [<810b418c>] ? put_super+0x15/0x29 > [ 2058.069108] [<810b41ba>] ? drop_super+0x1a/0x1d > [ 2058.069119] [<810b4d04>] ? prune_super+0x106/0x110 > [ 2058.069132] [<81093647>] ? shrink_slab+0x7f/0x22f > [ 2058.069144] [<81095943>] ? try_to_free_pages+0x1b7/0x2e6 > [ 2058.069158] [<8108de27>] ? __alloc_pages_nodemask+0x412/0x5d5 > [ 2058.069173] [<810a9c6a>] ? read_swap_cache_async+0x4a/0xcf > [ 2058.069185] [<810a9d50>] ? swapin_readahead+0x61/0x8d > [ 2058.069198] [<8109fea0>] ? handle_pte_fault+0x310/0x5fb > [ 2058.069208] [<8100223a>] ? do_signal+0x470/0x4fe > [ 2058.069220] [<810a02cc>] ? handle_mm_fault+0xae/0xbd > [ 2058.069233] [<8101d0f9>] ? do_page_fault+0x265/0x284 > [ 2058.069247] [<81192b32>] ? copy_to_user+0x3e/0x49 > [ 2058.069257] [<8100306d>] ? do_spurious_interrupt_bug+0x26/0x26 > [ 2058.069270] [<81009279>] ? init_fpu+0x73/0x81 > [ 2058.069280] [<8100275e>] ? math_state_restore+0x1f/0xa0 > [ 2058.069290] [<8100306d>] ? do_spurious_interrupt_bug+0x26/0x26 > [ 2058.069303] [<8101ce94>] ? vmalloc_sync_all+0xa/0xa > [ 2058.069315] [<813b7737>] ? error_code+0x67/0x6c > > The bottom part of the stack varies, but most processes are spending a > lot of time in prune_super(). There is a pretty high number of > mounted file systems, and do_try_to_free_pages() keeps calling > shrink_slab() even when there is nothing to reclaim there. Good catch. We can check the number of reclaimable slab in a zone before diving into shrink_slab and abort it. > > In addition, do_try_to_free_pages() keeps returning 1 because > all_unreclaimable() at the end is always false. The allocator thinks > that zone 1 has freeable pages (zones 0 and 2 do not). That prevents > the allocator from ooming. It's a problem of your custom patch "min_filelist_kbytes". > > I went in some more depth, but didn't quite untangle all that goes on. > In any case, this explains why I came up with the theory that somehow > mm is too optimistic about how many pages are freeable. Then I found > what looks like a smoking gun in vmscan.c: > > if (nr_swap_pages > 0) > nr += zone_page_state(zone, NR_ACTIVE_ANON) + > zone_page_state(zone, NR_INACTIVE_ANON); > > which seems to ignore that not all ANON pages are freeable if swap > space is limited. It's a just check for whether swap is enable or not, NOT how many we have empty slot in swap. I understand your concern but it's not related to your problem directly. If you could change it, you might solve the problem by early OOM but it's not right fix, IMHO and break LRU and SLAB reclaim balancing logic. > > Pretty much all processes hang while trying to allocate memory. Those > that don't allocate memory keep running fine. > > vmstat 1 shows a large amount of swapping activity, which drops to 0 > when the processes hang. > > /proc/meminfo and /proc/vmstat are at the bottom. > > > > >> > >> I am wondering if anybody has run into this. Thanks! > >> > >> Luigi > >> > >> P.S. For those who wish to know more: > >> > >> 1. We use the min_filelist_kbytes patch > >> (http://lwn.net/Articles/412313/) (I am not sure if it made it into > >> the standard kernel) and set min_filelist_kbytes to 50Mb. (This may > >> not matter, as it's unlikely to make things worse.) > > > > One of the problem I look at this patch is it might prevent > > increasing of zone->pages_scanned when the swap if full or anon pages > > are very small although there are lots of file-backed pages. > > It means OOM can't occur and page allocator could loop forever. > > Please look at zone_reclaimable. > > Yes---I think you are right. It didn't matter to us because we don't > use swap. The problem looks fixable. No use swap? You mentioned you used zram as swap? Which is right? I started to confuse your word. If you don't use swap, it's more error prone because get_scan_count makes your reclaim logic never get reclaim anonymous memory and your min_filelist_kbytes patch makes reclaim logic never get reclaim file memory if file memory is smaller than 50M. It means VM never reclaim both anon and file LRU pages so all of processes try to allocate will be loop forever. You mean you didn't use it but start to use it these days? If so, please resend min_filelist_kbytes patch with the fix to linux-mm. > > > Have you ever test it without above patch? > > Good suggestion. I just did. Almost all text pages are evicted, and > then the system thrashes so badly that the hang detector kicks in > after a couple of minutes and panics. I guess culprit is your min_filelist_kbytes patch. If you think it's really good feature, please resend it and let's makes it better than now. I think motivation is good for embedded. :) > > Thank you for the very helpful suggestions! Thanks for the interesting problem! > > > > > >> > >> 2. We swap only to compressed ram. The setup is very simple: > >> > >> echo ${ZRAM_SIZE_KB}000 >/sys/block/zram0/disksize || > >> logger -t "$UPSTART_JOB" "failed to set zram size" > >> mkswap /dev/zram0 || logger -t "$UPSTART_JOB" "mkswap /dev/zram0 failed" > >> swapon /dev/zram0 || logger -t "$UPSTART_JOB" "swapon /dev/zram0 failed" > >> > >> For ZRAM_SIZE_KB, we typically use 1.5 the size of RAM (which is 2 or > >> 4 Gb). The compression factor is about 3:1. The hangs happen for > >> quite a wide range of zram sizes. > >> > >> -- > >> To unsubscribe, send a message with 'unsubscribe linux-mm' in > >> the body to majordomo@kvack.org. For more info on Linux MM, > >> see: http://www.linux-mm.org/ . > >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > > > > -- > > Kind Regards, > > Minchan Kim > > > MemTotal: 2002292 kB > MemFree: 15148 kB > Buffers: 260 kB > Cached: 169952 kB > SwapCached: 149448 kB > Active: 722608 kB > Inactive: 290824 kB > Active(anon): 682680 kB > Inactive(anon): 230888 kB > Active(file): 39928 kB > Inactive(file): 59936 kB > Unevictable: 0 kB > Mlocked: 0 kB > HighTotal: 74504 kB > HighFree: 0 kB > LowTotal: 1927788 kB > LowFree: 15148 kB > SwapTotal: 2933044 kB > SwapFree: 47968 kB > Dirty: 0 kB > Writeback: 56 kB > AnonPages: 695180 kB > Mapped: 73276 kB > Shmem: 70276 kB > Slab: 19596 kB > SReclaimable: 9152 kB > SUnreclaim: 10444 kB > KernelStack: 1448 kB > PageTables: 9964 kB > NFS_Unstable: 0 kB > Bounce: 0 kB > WritebackTmp: 0 kB > CommitLimit: 3934188 kB > Committed_AS: 4371740 kB > VmallocTotal: 122880 kB > VmallocUsed: 22268 kB > VmallocChunk: 100340 kB > DirectMap4k: 34808 kB > DirectMap2M: 1927168 kB > > > nr_free_pages 3776 > nr_inactive_anon 58243 > nr_active_anon 172106 > nr_inactive_file 14984 > nr_active_file 9982 > nr_unevictable 0 > nr_mlock 0 > nr_anon_pages 174840 > nr_mapped 18387 > nr_file_pages 80762 > nr_dirty 0 > nr_writeback 13 > nr_slab_reclaimable 2290 > nr_slab_unreclaimable 2611 > nr_page_table_pages 2471 > nr_kernel_stack 180 > nr_unstable 0 > nr_bounce 0 > nr_vmscan_write 679247 > nr_vmscan_immediate_reclaim 0 > nr_writeback_temp 0 > nr_isolated_anon 416 > nr_isolated_file 0 > nr_shmem 17637 > nr_dirtied 7630 > nr_written 686863 > nr_anon_transparent_hugepages 0 > nr_dirty_threshold 151452 > nr_dirty_background_threshold 2524 > pgpgin 284189 > pgpgout 2748940 > pswpin 5602 > pswpout 679271 > pgalloc_dma 9976 > pgalloc_normal 1426651 > pgalloc_high 34659 > pgalloc_movable 0 > pgfree 1475099 > pgactivate 58092 > pgdeactivate 745734 > pgfault 1489876 > pgmajfault 1098 > pgrefill_dma 8557 > pgrefill_normal 742123 > pgrefill_high 4088 > pgrefill_movable 0 > pgsteal_kswapd_dma 199 > pgsteal_kswapd_normal 48387 > pgsteal_kswapd_high 2443 > pgsteal_kswapd_movable 0 > pgsteal_direct_dma 7688 > pgsteal_direct_normal 652670 > pgsteal_direct_high 6242 > pgsteal_direct_movable 0 > pgscan_kswapd_dma 268 > pgscan_kswapd_normal 105036 > pgscan_kswapd_high 8395 > pgscan_kswapd_movable 0 > pgscan_direct_dma 185240 > pgscan_direct_normal 23961886 > pgscan_direct_high 584047 > pgscan_direct_movable 0 > pginodesteal 123 > slabs_scanned 10368 > kswapd_inodesteal 1 > kswapd_low_wmark_hit_quickly 15 > kswapd_high_wmark_hit_quickly 8 > kswapd_skip_congestion_wait 639 > pageoutrun 582 > allocstall 14514 > pgrotated 1 > unevictable_pgs_culled 0 > unevictable_pgs_scanned 0 > unevictable_pgs_rescued 1 > unevictable_pgs_mlocked 1 > unevictable_pgs_munlocked 1 > unevictable_pgs_cleared 0 > unevictable_pgs_stranded 0 > unevictable_pgs_mlockfreed 0 > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind Regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-16 6:18 ` Minchan Kim @ 2012-10-16 17:36 ` Luigi Semenzato 2012-10-19 17:49 ` Luigi Semenzato 0 siblings, 1 reply; 56+ messages in thread From: Luigi Semenzato @ 2012-10-16 17:36 UTC (permalink / raw) To: Minchan Kim; +Cc: linux-mm, Dan Magenheimer On Mon, Oct 15, 2012 at 11:18 PM, Minchan Kim <minchan@kernel.org> wrote: > On Mon, Oct 15, 2012 at 11:54:36AM -0700, Luigi Semenzato wrote: >> On Mon, Oct 15, 2012 at 7:44 AM, Minchan Kim <minchan@kernel.org> wrote: >> > Hello, >> > >> > On Fri, Sep 28, 2012 at 10:32:20AM -0700, Luigi Semenzato wrote: >> >> Greetings, >> >> >> >> We are experimenting with zram in Chrome OS. It works quite well >> >> until the system runs out of memory, at which point it seems to hang, >> >> but we suspect it is thrashing. >> >> >> >> Before the (apparent) hang, the OOM killer gets rid of a few >> >> processes, but then the other processes gradually stop responding, >> >> until the entire system becomes unresponsive. >> > >> > Why do you think it's zram problem? If you use swap device as storage >> > instead of zram, does the problem disappear? >> >> I haven't tried with a swap device, but that is a good suggestion. >> >> I didn't want to swap to disk (too slow compared to zram, so it's not >> the same experiment any more), but I could preallocate a RAM disk and >> swap to that. > > Good idea. > >> >> > Could you do sysrq+t,m several time and post it while hang happens? >> > /proc/vmstat could be helpful, too. >> >> The stack traces look mostly like this: >> >> [ 2058.069020] [<810681c4>] handle_edge_irq+0x8f/0xb1 >> [ 2058.069028] <IRQ> [<810037ed>] ? do_IRQ+0x3f/0x98 >> [ 2058.069044] [<813b7eb0>] ? common_interrupt+0x30/0x38 >> [ 2058.069058] [<8108007b>] ? ftrace_raw_event_rpm_internal+0xf/0x108 >> [ 2058.069072] [<81196c1a>] ? do_raw_spin_lock+0x93/0xf3 >> [ 2058.069085] [<813b70d5>] ? _raw_spin_lock+0xd/0xf >> [ 2058.069097] [<810b418c>] ? put_super+0x15/0x29 >> [ 2058.069108] [<810b41ba>] ? drop_super+0x1a/0x1d >> [ 2058.069119] [<810b4d04>] ? prune_super+0x106/0x110 >> [ 2058.069132] [<81093647>] ? shrink_slab+0x7f/0x22f >> [ 2058.069144] [<81095943>] ? try_to_free_pages+0x1b7/0x2e6 >> [ 2058.069158] [<8108de27>] ? __alloc_pages_nodemask+0x412/0x5d5 >> [ 2058.069173] [<810a9c6a>] ? read_swap_cache_async+0x4a/0xcf >> [ 2058.069185] [<810a9d50>] ? swapin_readahead+0x61/0x8d >> [ 2058.069198] [<8109fea0>] ? handle_pte_fault+0x310/0x5fb >> [ 2058.069208] [<8100223a>] ? do_signal+0x470/0x4fe >> [ 2058.069220] [<810a02cc>] ? handle_mm_fault+0xae/0xbd >> [ 2058.069233] [<8101d0f9>] ? do_page_fault+0x265/0x284 >> [ 2058.069247] [<81192b32>] ? copy_to_user+0x3e/0x49 >> [ 2058.069257] [<8100306d>] ? do_spurious_interrupt_bug+0x26/0x26 >> [ 2058.069270] [<81009279>] ? init_fpu+0x73/0x81 >> [ 2058.069280] [<8100275e>] ? math_state_restore+0x1f/0xa0 >> [ 2058.069290] [<8100306d>] ? do_spurious_interrupt_bug+0x26/0x26 >> [ 2058.069303] [<8101ce94>] ? vmalloc_sync_all+0xa/0xa >> [ 2058.069315] [<813b7737>] ? error_code+0x67/0x6c >> >> The bottom part of the stack varies, but most processes are spending a >> lot of time in prune_super(). There is a pretty high number of >> mounted file systems, and do_try_to_free_pages() keeps calling >> shrink_slab() even when there is nothing to reclaim there. > > Good catch. We can check the number of reclaimable slab in a zone before > diving into shrink_slab and abort it. > >> >> In addition, do_try_to_free_pages() keeps returning 1 because >> all_unreclaimable() at the end is always false. The allocator thinks >> that zone 1 has freeable pages (zones 0 and 2 do not). That prevents >> the allocator from ooming. > > It's a problem of your custom patch "min_filelist_kbytes". > >> >> I went in some more depth, but didn't quite untangle all that goes on. >> In any case, this explains why I came up with the theory that somehow >> mm is too optimistic about how many pages are freeable. Then I found >> what looks like a smoking gun in vmscan.c: >> >> if (nr_swap_pages > 0) >> nr += zone_page_state(zone, NR_ACTIVE_ANON) + >> zone_page_state(zone, NR_INACTIVE_ANON); >> >> which seems to ignore that not all ANON pages are freeable if swap >> space is limited. > > It's a just check for whether swap is enable or not, NOT how many we have > empty slot in swap. I understand your concern but it's not related to your > problem directly. If you could change it, you might solve the problem by > early OOM but it's not right fix, IMHO and break LRU and SLAB reclaim balancing > logic. Yes, I was afraid of some consequence of that kind. However, I still don't understand that computation. "zone_reclaimable_pages" suggests we're computing how many anonymous pages can be reclaimed. If there is zero swap, no anonymous pages can be reclaimed. If there is very little swap left, very few anonymous pages can be reclaimed. So that confuses me. But don't worry, because many other things confuse me too! > >> >> Pretty much all processes hang while trying to allocate memory. Those >> that don't allocate memory keep running fine. >> >> vmstat 1 shows a large amount of swapping activity, which drops to 0 >> when the processes hang. >> >> /proc/meminfo and /proc/vmstat are at the bottom. >> >> > >> >> >> >> I am wondering if anybody has run into this. Thanks! >> >> >> >> Luigi >> >> >> >> P.S. For those who wish to know more: >> >> >> >> 1. We use the min_filelist_kbytes patch >> >> (http://lwn.net/Articles/412313/) (I am not sure if it made it into >> >> the standard kernel) and set min_filelist_kbytes to 50Mb. (This may >> >> not matter, as it's unlikely to make things worse.) >> > >> > One of the problem I look at this patch is it might prevent >> > increasing of zone->pages_scanned when the swap if full or anon pages >> > are very small although there are lots of file-backed pages. >> > It means OOM can't occur and page allocator could loop forever. >> > Please look at zone_reclaimable. >> >> Yes---I think you are right. It didn't matter to us because we don't >> use swap. The problem looks fixable. > > No use swap? You mentioned you used zram as swap? > Which is right? I started to confuse your word. I apologize for the confusion. We don't use swap now in Chrome OS. I am investigating the possibility of using zram, if I can get it to work. We are not likely to consider swap to disk because the resulting jank for interactive loads is too high and difficult to control, and we may do a better job by managing memory at a higher level (basically in the Chrome app). > If you don't use swap, it's more error prone because get_scan_count makes > your reclaim logic never get reclaim anonymous memory and your min_filelist_kbytes > patch makes reclaim logic never get reclaim file memory if file memory is smaller > than 50M. It means VM never reclaim both anon and file LRU pages so all of processes > try to allocate will be loop forever. Actually, our patch seems to work fine in our systems, which are commercially available. (I'll be happy to send you any data that you may find interesting). Without the patch, the system can thrash badly when we allocate memory aggressively (for instance, by loading many browser tabs in parallel). So, if we ignore zram for the moment, the min_filelist_kbytes patch prevents the last 50 Mb of file memory from being evicted. It has no impact on anon memory. For that memory, we take same code path as before. It may be suboptimal because it doesn't try to reclaim inactive file memory in the last 50 Mb, but that doesn't seem to matter. > > You mean you didn't use it but start to use it these days? > If so, please resend min_filelist_kbytes patch with the fix to linux-mm. > >> >> > Have you ever test it without above patch? >> >> Good suggestion. I just did. Almost all text pages are evicted, and >> then the system thrashes so badly that the hang detector kicks in >> after a couple of minutes and panics. > > I guess culprit is your min_filelist_kbytes patch. That could be, but I still need some way of preventing file pages thrash. Without that patch, the system thrashes when low on memory, with or without zram, and with or without other changes related to nr_swap_pages. > If you think it's really good feature, please resend it and let's makes it better > than now. I think motivation is good for embedded. :) Yes! Thanks, I'll try to do that. > >> >> Thank you for the very helpful suggestions! > > Thanks for the interesting problem! > >> >> >> > >> >> >> >> 2. We swap only to compressed ram. The setup is very simple: >> >> >> >> echo ${ZRAM_SIZE_KB}000 >/sys/block/zram0/disksize || >> >> logger -t "$UPSTART_JOB" "failed to set zram size" >> >> mkswap /dev/zram0 || logger -t "$UPSTART_JOB" "mkswap /dev/zram0 failed" >> >> swapon /dev/zram0 || logger -t "$UPSTART_JOB" "swapon /dev/zram0 failed" >> >> >> >> For ZRAM_SIZE_KB, we typically use 1.5 the size of RAM (which is 2 or >> >> 4 Gb). The compression factor is about 3:1. The hangs happen for >> >> quite a wide range of zram sizes. >> >> >> >> -- >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> >> the body to majordomo@kvack.org. For more info on Linux MM, >> >> see: http://www.linux-mm.org/ . >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> >> > >> > -- >> > Kind Regards, >> > Minchan Kim >> >> >> MemTotal: 2002292 kB >> MemFree: 15148 kB >> Buffers: 260 kB >> Cached: 169952 kB >> SwapCached: 149448 kB >> Active: 722608 kB >> Inactive: 290824 kB >> Active(anon): 682680 kB >> Inactive(anon): 230888 kB >> Active(file): 39928 kB >> Inactive(file): 59936 kB >> Unevictable: 0 kB >> Mlocked: 0 kB >> HighTotal: 74504 kB >> HighFree: 0 kB >> LowTotal: 1927788 kB >> LowFree: 15148 kB >> SwapTotal: 2933044 kB >> SwapFree: 47968 kB >> Dirty: 0 kB >> Writeback: 56 kB >> AnonPages: 695180 kB >> Mapped: 73276 kB >> Shmem: 70276 kB >> Slab: 19596 kB >> SReclaimable: 9152 kB >> SUnreclaim: 10444 kB >> KernelStack: 1448 kB >> PageTables: 9964 kB >> NFS_Unstable: 0 kB >> Bounce: 0 kB >> WritebackTmp: 0 kB >> CommitLimit: 3934188 kB >> Committed_AS: 4371740 kB >> VmallocTotal: 122880 kB >> VmallocUsed: 22268 kB >> VmallocChunk: 100340 kB >> DirectMap4k: 34808 kB >> DirectMap2M: 1927168 kB >> >> >> nr_free_pages 3776 >> nr_inactive_anon 58243 >> nr_active_anon 172106 >> nr_inactive_file 14984 >> nr_active_file 9982 >> nr_unevictable 0 >> nr_mlock 0 >> nr_anon_pages 174840 >> nr_mapped 18387 >> nr_file_pages 80762 >> nr_dirty 0 >> nr_writeback 13 >> nr_slab_reclaimable 2290 >> nr_slab_unreclaimable 2611 >> nr_page_table_pages 2471 >> nr_kernel_stack 180 >> nr_unstable 0 >> nr_bounce 0 >> nr_vmscan_write 679247 >> nr_vmscan_immediate_reclaim 0 >> nr_writeback_temp 0 >> nr_isolated_anon 416 >> nr_isolated_file 0 >> nr_shmem 17637 >> nr_dirtied 7630 >> nr_written 686863 >> nr_anon_transparent_hugepages 0 >> nr_dirty_threshold 151452 >> nr_dirty_background_threshold 2524 >> pgpgin 284189 >> pgpgout 2748940 >> pswpin 5602 >> pswpout 679271 >> pgalloc_dma 9976 >> pgalloc_normal 1426651 >> pgalloc_high 34659 >> pgalloc_movable 0 >> pgfree 1475099 >> pgactivate 58092 >> pgdeactivate 745734 >> pgfault 1489876 >> pgmajfault 1098 >> pgrefill_dma 8557 >> pgrefill_normal 742123 >> pgrefill_high 4088 >> pgrefill_movable 0 >> pgsteal_kswapd_dma 199 >> pgsteal_kswapd_normal 48387 >> pgsteal_kswapd_high 2443 >> pgsteal_kswapd_movable 0 >> pgsteal_direct_dma 7688 >> pgsteal_direct_normal 652670 >> pgsteal_direct_high 6242 >> pgsteal_direct_movable 0 >> pgscan_kswapd_dma 268 >> pgscan_kswapd_normal 105036 >> pgscan_kswapd_high 8395 >> pgscan_kswapd_movable 0 >> pgscan_direct_dma 185240 >> pgscan_direct_normal 23961886 >> pgscan_direct_high 584047 >> pgscan_direct_movable 0 >> pginodesteal 123 >> slabs_scanned 10368 >> kswapd_inodesteal 1 >> kswapd_low_wmark_hit_quickly 15 >> kswapd_high_wmark_hit_quickly 8 >> kswapd_skip_congestion_wait 639 >> pageoutrun 582 >> allocstall 14514 >> pgrotated 1 >> unevictable_pgs_culled 0 >> unevictable_pgs_scanned 0 >> unevictable_pgs_rescued 1 >> unevictable_pgs_mlocked 1 >> unevictable_pgs_munlocked 1 >> unevictable_pgs_cleared 0 >> unevictable_pgs_stranded 0 >> unevictable_pgs_mlockfreed 0 >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > > -- > Kind Regards, > Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-16 17:36 ` Luigi Semenzato @ 2012-10-19 17:49 ` Luigi Semenzato 2012-10-22 23:53 ` Minchan Kim 0 siblings, 1 reply; 56+ messages in thread From: Luigi Semenzato @ 2012-10-19 17:49 UTC (permalink / raw) To: Minchan Kim; +Cc: linux-mm, Dan Magenheimer I found the source, and maybe the cause, of the problem I am experiencing when running out of memory with zram enabled. It may be a known problem. The OOM killer doesn't find any killable process because select_bad_process() keeps returning -1 here: /* * This task already has access to memory reserves and is * being killed. Don't allow any other task access to the * memory reserve. * * Note: this may have a chance of deadlock if it gets * blocked waiting for another task which itself is waiting * for memory. Is there a better alternative? */ if (test_tsk_thread_flag(p, TIF_MEMDIE)) { if (unlikely(frozen(p))) __thaw_task(p); if (!force_kill) return ERR_PTR(-1UL); } select_bad_process() is called by out_of_memory() in __alloc_page_may_oom(). If this is the problem, I'd love to hear about solutions! <BEGIN SHAMELESS PLUG> if we can get this to work, it will help keep the cost of laptops down! http://www.google.com/intl/en/chrome/devices/ <END SHAMELESS PLUG> P.S. Chromebooks are sweet things for kernel debugging because they boot so quickly (5-10s depending on the model). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-19 17:49 ` Luigi Semenzato @ 2012-10-22 23:53 ` Minchan Kim 2012-10-23 0:40 ` Luigi Semenzato 2012-10-23 6:03 ` David Rientjes 0 siblings, 2 replies; 56+ messages in thread From: Minchan Kim @ 2012-10-22 23:53 UTC (permalink / raw) To: Luigi Semenzato Cc: linux-mm, Dan Magenheimer, David Rientjes, KOSAKI Motohiro Hi, Sorry for late response. I was traveling at that time and still suffer from training course I never want. :( On Fri, Oct 19, 2012 at 10:49:22AM -0700, Luigi Semenzato wrote: > I found the source, and maybe the cause, of the problem I am > experiencing when running out of memory with zram enabled. It may be > a known problem. The OOM killer doesn't find any killable process > because select_bad_process() keeps returning -1 here: > > /* > * This task already has access to memory reserves and is > * being killed. Don't allow any other task access to the > * memory reserve. > * > * Note: this may have a chance of deadlock if it gets > * blocked waiting for another task which itself is waiting > * for memory. Is there a better alternative? > */ > if (test_tsk_thread_flag(p, TIF_MEMDIE)) { > if (unlikely(frozen(p))) > __thaw_task(p); > if (!force_kill) > return ERR_PTR(-1UL); > } > > select_bad_process() is called by out_of_memory() in __alloc_page_may_oom(). I think it's not a zram problem but general problem of OOM killer. Above code's intention is to prevent shortage of ememgency memory pool for avoding deadlock. If we already killed any task and the task are in the middle of exiting, OOM killer will wait for him to be exited. But the problem in here is that killed task might wait any mutex which are held to another task which are stuck for the memory allocation and can't use emergency memory pool. :( It's a another deadlock, too. AFAIK, it's known problem and I'm not sure OOM guys have a good idea. Cc'ed them. I think one of solution is that if it takes some seconed(ex, 3 sec) after we already kill some task but still looping with above code, we can allow accessing of ememgency memory pool for another task. It may happen deadlock due to burn out memory pool but otherwise, we still suffer from deadlock. > > If this is the problem, I'd love to hear about solutions! > > <BEGIN SHAMELESS PLUG> > if we can get this to work, it will help keep the cost of laptops down! > http://www.google.com/intl/en/chrome/devices/ > <END SHAMELESS PLUG> > > P.S. Chromebooks are sweet things for kernel debugging because they > boot so quickly (5-10s depending on the model). But I think mainline kernel doesn't boot on that. :( > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-22 23:53 ` Minchan Kim @ 2012-10-23 0:40 ` Luigi Semenzato 2012-10-23 6:03 ` David Rientjes 1 sibling, 0 replies; 56+ messages in thread From: Luigi Semenzato @ 2012-10-23 0:40 UTC (permalink / raw) To: Minchan Kim; +Cc: linux-mm, Dan Magenheimer, David Rientjes, KOSAKI Motohiro On Mon, Oct 22, 2012 at 4:53 PM, Minchan Kim <minchan@kernel.org> wrote: > Hi, > > Sorry for late response. No problem at all. > I was traveling at that time and still suffer from > training course I never want. :( I am sorry you have to take training courses you do not want, and I sympathize. > On Fri, Oct 19, 2012 at 10:49:22AM -0700, Luigi Semenzato wrote: >> I found the source, and maybe the cause, of the problem I am >> experiencing when running out of memory with zram enabled. It may be >> a known problem. The OOM killer doesn't find any killable process >> because select_bad_process() keeps returning -1 here: >> >> /* >> * This task already has access to memory reserves and is >> * being killed. Don't allow any other task access to the >> * memory reserve. >> * >> * Note: this may have a chance of deadlock if it gets >> * blocked waiting for another task which itself is waiting >> * for memory. Is there a better alternative? >> */ >> if (test_tsk_thread_flag(p, TIF_MEMDIE)) { >> if (unlikely(frozen(p))) >> __thaw_task(p); >> if (!force_kill) >> return ERR_PTR(-1UL); >> } >> >> select_bad_process() is called by out_of_memory() in __alloc_page_may_oom(). > > I think it's not a zram problem but general problem of OOM killer. > Above code's intention is to prevent shortage of ememgency memory pool for avoding > deadlock. If we already killed any task and the task are in the middle of exiting, > OOM killer will wait for him to be exited. But the problem in here is that > killed task might wait any mutex which are held to another task which are > stuck for the memory allocation and can't use emergency memory pool. :( > It's a another deadlock, too. AFAIK, it's known problem and I'm not sure > OOM guys have a good idea. Cc'ed them. > I think one of solution is that if it takes some seconed(ex, 3 sec) after we already > kill some task but still looping with above code, we can allow accessing of > ememgency memory pool for another task. It may happen deadlock due to burn out memory > pool but otherwise, we still suffer from deadlock. Next thing, I will check what the killed task is waiting for. It may be that there are a few frequent cases that are solvable. Ideally we should not reach this situation. We use a low-memory notification mechanism (based on some code from you, in fact, many thanks) to discard Chrome tabs (which we reload transparently). But if memory is allocated very aggressively, the notification may arrive too late. >> If this is the problem, I'd love to hear about solutions! >> >> P.S. Chromebooks are sweet things for kernel debugging because they >> boot so quickly (5-10s depending on the model). > > But I think mainline kernel doesn't boot on that. :( Probably not. Very sorry for mentioning this, then. Thank you and I will keep you updated with any progress. Luigi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-22 23:53 ` Minchan Kim 2012-10-23 0:40 ` Luigi Semenzato @ 2012-10-23 6:03 ` David Rientjes 2012-10-29 18:26 ` Luigi Semenzato 1 sibling, 1 reply; 56+ messages in thread From: David Rientjes @ 2012-10-23 6:03 UTC (permalink / raw) To: Minchan Kim; +Cc: Luigi Semenzato, linux-mm, Dan Magenheimer, KOSAKI Motohiro On Tue, 23 Oct 2012, Minchan Kim wrote: > > I found the source, and maybe the cause, of the problem I am > > experiencing when running out of memory with zram enabled. It may be > > a known problem. The OOM killer doesn't find any killable process > > because select_bad_process() keeps returning -1 here: > > > > /* > > * This task already has access to memory reserves and is > > * being killed. Don't allow any other task access to the > > * memory reserve. > > * > > * Note: this may have a chance of deadlock if it gets > > * blocked waiting for another task which itself is waiting > > * for memory. Is there a better alternative? > > */ > > if (test_tsk_thread_flag(p, TIF_MEMDIE)) { > > if (unlikely(frozen(p))) > > __thaw_task(p); > > if (!force_kill) > > return ERR_PTR(-1UL); > > } > > > > select_bad_process() is called by out_of_memory() in __alloc_page_may_oom(). > > I think it's not a zram problem but general problem of OOM killer. > Above code's intention is to prevent shortage of ememgency memory pool for avoding > deadlock. If we already killed any task and the task are in the middle of exiting, > OOM killer will wait for him to be exited. But the problem in here is that > killed task might wait any mutex which are held to another task which are > stuck for the memory allocation and can't use emergency memory pool. :( Yeah, there's always a problem if an oom killed process cannot exit because it's waiting for some other eligible process. This doesn't normally happen for anything sharing the same mm, though, because we try to kill anything sharing the same mm when we select a process for oom kill and if those killed threads happen to call into the oom killer they silently get TIF_MEMDIE so they may exit as well. This addressed earlier problems we had with things waiting on mm->mmap_sem in the exit path. If the oom killed process cannot exit because it's waiting on another eligible process that does not share the mm, then we'll potentially livelock unless you do echo f > /proc/sysrq-trigger manually or turn on /proc/sys/vm/oom_kill_allocating_task. > I think one of solution is that if it takes some seconed(ex, 3 sec) after we already > kill some task but still looping with above code, we can allow accessing of > ememgency memory pool for another task. It may happen deadlock due to burn out memory > pool but otherwise, we still suffer from deadlock. > The problem there is that if the time limit expires (we used 10 seconds before internally, we don't do it at all anymore) and there are no more eligible threads that you unnecessarily panic, or open yourself up to a complete depletion of memory reserves whereas not even the oom killer can help. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-23 6:03 ` David Rientjes @ 2012-10-29 18:26 ` Luigi Semenzato 2012-10-29 19:00 ` David Rientjes 0 siblings, 1 reply; 56+ messages in thread From: Luigi Semenzato @ 2012-10-29 18:26 UTC (permalink / raw) To: David Rientjes; +Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro I managed to get the stack trace for the process that refuses to die. I am not sure it's due to the deadlock described in earlier messages. I will investigate further. [96283.704390] chrome x 815ecd20 0 16573 1112 0x00100104 [96283.704405] c107fe34 00200046 f57ae000 815ecd20 815ecd20 ec0b645a 0000578f f67cfd20 [96283.704427] d0a9a9a0 c107fdf8 81037be5 f5bdf1e8 f6021800 00000000 c107fe04 00200202 [96283.704449] c107fe0c 00200202 f5bdf1b0 c107fe24 8117ddb1 00200202 f5bdf1b0 f5bdf1b8 [96283.704471] Call Trace: [96283.704484] [<81037be5>] ? queue_work_on+0x2d/0x39 [96283.704497] [<8117ddb1>] ? put_io_context+0x52/0x6a [96283.704510] [<813b68f6>] schedule+0x56/0x58 [96283.704520] [<81028525>] do_exit+0x63e/0x640 [96283.704530] [<81028752>] do_group_exit+0x63/0x86 [96283.704541] [<81032b19>] get_signal_to_deliver+0x434/0x44b [96283.704554] [<81001e01>] do_signal+0x37/0x4fe [96283.704564] [<8103e31d>] ? update_rmtp+0x67/0x67 [96283.704585] [<8105622a>] ? clockevents_program_event+0xea/0x108 [96283.704599] [<81050d92>] ? timekeeping_get_ns+0x11/0x55 [96283.704610] [<8105a758>] ? sys_futex+0xcb/0xdb [96283.704620] [<810024a7>] do_notify_resume+0x26/0x65 [96283.704632] [<813b7305>] work_notifysig+0xa/0x11 [96283.704644] [<813b0000>] ? coretemp_cpu_callback+0x88/0x179 On Mon, Oct 22, 2012 at 11:03 PM, David Rientjes <rientjes@google.com> wrote: > On Tue, 23 Oct 2012, Minchan Kim wrote: > >> > I found the source, and maybe the cause, of the problem I am >> > experiencing when running out of memory with zram enabled. It may be >> > a known problem. The OOM killer doesn't find any killable process >> > because select_bad_process() keeps returning -1 here: >> > >> > /* >> > * This task already has access to memory reserves and is >> > * being killed. Don't allow any other task access to the >> > * memory reserve. >> > * >> > * Note: this may have a chance of deadlock if it gets >> > * blocked waiting for another task which itself is waiting >> > * for memory. Is there a better alternative? >> > */ >> > if (test_tsk_thread_flag(p, TIF_MEMDIE)) { >> > if (unlikely(frozen(p))) >> > __thaw_task(p); >> > if (!force_kill) >> > return ERR_PTR(-1UL); >> > } >> > >> > select_bad_process() is called by out_of_memory() in __alloc_page_may_oom(). >> >> I think it's not a zram problem but general problem of OOM killer. >> Above code's intention is to prevent shortage of ememgency memory pool for avoding >> deadlock. If we already killed any task and the task are in the middle of exiting, >> OOM killer will wait for him to be exited. But the problem in here is that >> killed task might wait any mutex which are held to another task which are >> stuck for the memory allocation and can't use emergency memory pool. :( > > Yeah, there's always a problem if an oom killed process cannot exit > because it's waiting for some other eligible process. This doesn't > normally happen for anything sharing the same mm, though, because we try > to kill anything sharing the same mm when we select a process for oom kill > and if those killed threads happen to call into the oom killer they > silently get TIF_MEMDIE so they may exit as well. This addressed earlier > problems we had with things waiting on mm->mmap_sem in the exit path. > > If the oom killed process cannot exit because it's waiting on another > eligible process that does not share the mm, then we'll potentially > livelock unless you do echo f > /proc/sysrq-trigger manually or turn on > /proc/sys/vm/oom_kill_allocating_task. > >> I think one of solution is that if it takes some seconed(ex, 3 sec) after we already >> kill some task but still looping with above code, we can allow accessing of >> ememgency memory pool for another task. It may happen deadlock due to burn out memory >> pool but otherwise, we still suffer from deadlock. >> > > The problem there is that if the time limit expires (we used 10 seconds > before internally, we don't do it at all anymore) and there are no more > eligible threads that you unnecessarily panic, or open yourself up to a > complete depletion of memory reserves whereas not even the oom killer can > help. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-29 18:26 ` Luigi Semenzato @ 2012-10-29 19:00 ` David Rientjes 2012-10-29 22:36 ` Luigi Semenzato 0 siblings, 1 reply; 56+ messages in thread From: David Rientjes @ 2012-10-29 19:00 UTC (permalink / raw) To: Luigi Semenzato; +Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro On Mon, 29 Oct 2012, Luigi Semenzato wrote: > I managed to get the stack trace for the process that refuses to die. > I am not sure it's due to the deadlock described in earlier messages. > I will investigate further. > > [96283.704390] chrome x 815ecd20 0 16573 1112 0x00100104 > [96283.704405] c107fe34 00200046 f57ae000 815ecd20 815ecd20 ec0b645a > 0000578f f67cfd20 > [96283.704427] d0a9a9a0 c107fdf8 81037be5 f5bdf1e8 f6021800 00000000 > c107fe04 00200202 > [96283.704449] c107fe0c 00200202 f5bdf1b0 c107fe24 8117ddb1 00200202 > f5bdf1b0 f5bdf1b8 > [96283.704471] Call Trace: > [96283.704484] [<81037be5>] ? queue_work_on+0x2d/0x39 > [96283.704497] [<8117ddb1>] ? put_io_context+0x52/0x6a > [96283.704510] [<813b68f6>] schedule+0x56/0x58 > [96283.704520] [<81028525>] do_exit+0x63e/0x640 Could you find out where this happens to be in the function? If you enable CONFIG_DEBUG_INFO, you should be able to use gdb on vmlinux and find out with l *do_exit+0x63e. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-29 19:00 ` David Rientjes @ 2012-10-29 22:36 ` Luigi Semenzato 2012-10-29 22:52 ` David Rientjes 2012-10-30 0:18 ` Minchan Kim 0 siblings, 2 replies; 56+ messages in thread From: Luigi Semenzato @ 2012-10-29 22:36 UTC (permalink / raw) To: David Rientjes; +Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro On Mon, Oct 29, 2012 at 12:00 PM, David Rientjes <rientjes@google.com> wrote: > On Mon, 29 Oct 2012, Luigi Semenzato wrote: > >> I managed to get the stack trace for the process that refuses to die. >> I am not sure it's due to the deadlock described in earlier messages. >> I will investigate further. >> >> [96283.704390] chrome x 815ecd20 0 16573 1112 0x00100104 >> [96283.704405] c107fe34 00200046 f57ae000 815ecd20 815ecd20 ec0b645a >> 0000578f f67cfd20 >> [96283.704427] d0a9a9a0 c107fdf8 81037be5 f5bdf1e8 f6021800 00000000 >> c107fe04 00200202 >> [96283.704449] c107fe0c 00200202 f5bdf1b0 c107fe24 8117ddb1 00200202 >> f5bdf1b0 f5bdf1b8 >> [96283.704471] Call Trace: >> [96283.704484] [<81037be5>] ? queue_work_on+0x2d/0x39 >> [96283.704497] [<8117ddb1>] ? put_io_context+0x52/0x6a >> [96283.704510] [<813b68f6>] schedule+0x56/0x58 >> [96283.704520] [<81028525>] do_exit+0x63e/0x640 > > Could you find out where this happens to be in the function? If you > enable CONFIG_DEBUG_INFO, you should be able to use gdb on vmlinux and > find out with l *do_exit+0x63e. It looks like it's the final call to schedule() in do_exit(): 0x81028520 <+1593>: call 0x813b68a0 <schedule> 0x81028525 <+1598>: ud2a (gdb) l *do_exit+0x63e 0x81028525 is in do_exit (/home/semenzato/trunk/src/third_party/kernel/files/kernel/exit.c:1069). 1064 1065 /* causes final put_task_struct in finish_task_switch(). */ 1066 tsk->state = TASK_DEAD; 1067 tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */ 1068 schedule(); 1069 BUG(); 1070 /* Avoid "noreturn function does return". */ 1071 for (;;) 1072 cpu_relax(); /* For when BUG is null */ 1073 } Here's a theory: the thread exits fine, but the next scheduled thread tries to allocate memory before or during finish_task_switch(), so the dead thread is never cleaned up completely and is still considered alive by the OOM killer. Unfortunately I haven't found a code path that supports this theory... -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-29 22:36 ` Luigi Semenzato @ 2012-10-29 22:52 ` David Rientjes 2012-10-29 23:23 ` Luigi Semenzato 2012-10-30 0:18 ` Minchan Kim 1 sibling, 1 reply; 56+ messages in thread From: David Rientjes @ 2012-10-29 22:52 UTC (permalink / raw) To: Luigi Semenzato; +Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro On Mon, 29 Oct 2012, Luigi Semenzato wrote: > It looks like it's the final call to schedule() in do_exit(): > > 0x81028520 <+1593>: call 0x813b68a0 <schedule> > 0x81028525 <+1598>: ud2a > > (gdb) l *do_exit+0x63e > 0x81028525 is in do_exit > (/home/semenzato/trunk/src/third_party/kernel/files/kernel/exit.c:1069). > 1064 > 1065 /* causes final put_task_struct in finish_task_switch(). */ > 1066 tsk->state = TASK_DEAD; > 1067 tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */ > 1068 schedule(); > 1069 BUG(); > 1070 /* Avoid "noreturn function does return". */ > 1071 for (;;) > 1072 cpu_relax(); /* For when BUG is null */ > 1073 } > You're using an older kernel since the code you quoted from the oom killer hasn't had the per-memcg oom kill rewrite. There's logic that is called from select_bad_process() that should exclude this thread from being considered and deferred since it has a non-zero task->exit_thread, i.e. in oom_scan_process_thread(): if (task->exit_state) return OOM_SCAN_CONTINUE; And that's called from both the global oom killer and memcg oom killer. So I'm thinking you're either running on an older kernel or there is no oom condition at the time this is captured. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-29 22:52 ` David Rientjes @ 2012-10-29 23:23 ` Luigi Semenzato 2012-10-29 23:34 ` Luigi Semenzato 0 siblings, 1 reply; 56+ messages in thread From: Luigi Semenzato @ 2012-10-29 23:23 UTC (permalink / raw) To: David Rientjes; +Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro On Mon, Oct 29, 2012 at 3:52 PM, David Rientjes <rientjes@google.com> wrote: > On Mon, 29 Oct 2012, Luigi Semenzato wrote: > >> It looks like it's the final call to schedule() in do_exit(): >> >> 0x81028520 <+1593>: call 0x813b68a0 <schedule> >> 0x81028525 <+1598>: ud2a >> >> (gdb) l *do_exit+0x63e >> 0x81028525 is in do_exit >> (/home/semenzato/trunk/src/third_party/kernel/files/kernel/exit.c:1069). >> 1064 >> 1065 /* causes final put_task_struct in finish_task_switch(). */ >> 1066 tsk->state = TASK_DEAD; >> 1067 tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */ >> 1068 schedule(); >> 1069 BUG(); >> 1070 /* Avoid "noreturn function does return". */ >> 1071 for (;;) >> 1072 cpu_relax(); /* For when BUG is null */ >> 1073 } >> > > You're using an older kernel since the code you quoted from the oom killer > hasn't had the per-memcg oom kill rewrite. There's logic that is called > from select_bad_process() that should exclude this thread from being > considered and deferred since it has a non-zero task->exit_thread, i.e. in > oom_scan_process_thread(): > > if (task->exit_state) > return OOM_SCAN_CONTINUE; > > And that's called from both the global oom killer and memcg oom killer. > So I'm thinking you're either running on an older kernel or there is no > oom condition at the time this is captured. Very sorry, I never said that we're on kernel 3.4.0. We are in a OOM-kill situation: ./arch/x86/include/asm/thread_info.h:91:#define TIF_MEMDIE 20 Bit 20 in the threadinfo flags is set: > [96283.704390] chrome x 815ecd20 0 16573 1112 0x00100104 So your suggestion would be to apply OOM-related patches from a later kernel? Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-29 23:23 ` Luigi Semenzato @ 2012-10-29 23:34 ` Luigi Semenzato 0 siblings, 0 replies; 56+ messages in thread From: Luigi Semenzato @ 2012-10-29 23:34 UTC (permalink / raw) To: David Rientjes; +Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro On Mon, Oct 29, 2012 at 4:23 PM, Luigi Semenzato <semenzato@google.com> wrote: > On Mon, Oct 29, 2012 at 3:52 PM, David Rientjes <rientjes@google.com> wrote: >> On Mon, 29 Oct 2012, Luigi Semenzato wrote: >> >>> It looks like it's the final call to schedule() in do_exit(): >>> >>> 0x81028520 <+1593>: call 0x813b68a0 <schedule> >>> 0x81028525 <+1598>: ud2a >>> >>> (gdb) l *do_exit+0x63e >>> 0x81028525 is in do_exit >>> (/home/semenzato/trunk/src/third_party/kernel/files/kernel/exit.c:1069). >>> 1064 >>> 1065 /* causes final put_task_struct in finish_task_switch(). */ >>> 1066 tsk->state = TASK_DEAD; >>> 1067 tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */ >>> 1068 schedule(); >>> 1069 BUG(); >>> 1070 /* Avoid "noreturn function does return". */ >>> 1071 for (;;) >>> 1072 cpu_relax(); /* For when BUG is null */ >>> 1073 } >>> >> >> You're using an older kernel since the code you quoted from the oom killer >> hasn't had the per-memcg oom kill rewrite. There's logic that is called >> from select_bad_process() that should exclude this thread from being >> considered and deferred since it has a non-zero task->exit_thread, i.e. in >> oom_scan_process_thread(): >> >> if (task->exit_state) >> return OOM_SCAN_CONTINUE; >> >> And that's called from both the global oom killer and memcg oom killer. >> So I'm thinking you're either running on an older kernel or there is no >> oom condition at the time this is captured. > Very sorry, I never said that we're on kernel 3.4.0. > > We are in a OOM-kill situation: > > ./arch/x86/include/asm/thread_info.h:91:#define TIF_MEMDIE 20 > > Bit 20 in the threadinfo flags is set: > >> [96283.704390] chrome x 815ecd20 0 16573 1112 0x00100104 > > So your suggestion would be to apply OOM-related patches from a later kernel? > > Thanks! Actually, I am not sure that the 3.6 OOM code is sufficiently different to avoid this situation. 3.4 already has a test for task->exit_state, which in my case must be failing even though TIF_MEMDIE is set and the process has finished do_exit: do_each_thread(g, p) { unsigned int points; if (p->exit_state) continue; ... In fact, those changes look mostly cosmetic. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-29 22:36 ` Luigi Semenzato 2012-10-29 22:52 ` David Rientjes @ 2012-10-30 0:18 ` Minchan Kim 2012-10-30 0:45 ` Luigi Semenzato 1 sibling, 1 reply; 56+ messages in thread From: Minchan Kim @ 2012-10-30 0:18 UTC (permalink / raw) To: Luigi Semenzato Cc: David Rientjes, linux-mm, Dan Magenheimer, KOSAKI Motohiro On Mon, Oct 29, 2012 at 03:36:38PM -0700, Luigi Semenzato wrote: > On Mon, Oct 29, 2012 at 12:00 PM, David Rientjes <rientjes@google.com> wrote: > > On Mon, 29 Oct 2012, Luigi Semenzato wrote: > > > >> I managed to get the stack trace for the process that refuses to die. > >> I am not sure it's due to the deadlock described in earlier messages. > >> I will investigate further. > >> > >> [96283.704390] chrome x 815ecd20 0 16573 1112 0x00100104 > >> [96283.704405] c107fe34 00200046 f57ae000 815ecd20 815ecd20 ec0b645a > >> 0000578f f67cfd20 > >> [96283.704427] d0a9a9a0 c107fdf8 81037be5 f5bdf1e8 f6021800 00000000 > >> c107fe04 00200202 > >> [96283.704449] c107fe0c 00200202 f5bdf1b0 c107fe24 8117ddb1 00200202 > >> f5bdf1b0 f5bdf1b8 > >> [96283.704471] Call Trace: > >> [96283.704484] [<81037be5>] ? queue_work_on+0x2d/0x39 > >> [96283.704497] [<8117ddb1>] ? put_io_context+0x52/0x6a > >> [96283.704510] [<813b68f6>] schedule+0x56/0x58 > >> [96283.704520] [<81028525>] do_exit+0x63e/0x640 > > > > Could you find out where this happens to be in the function? If you > > enable CONFIG_DEBUG_INFO, you should be able to use gdb on vmlinux and > > find out with l *do_exit+0x63e. > > It looks like it's the final call to schedule() in do_exit(): > > 0x81028520 <+1593>: call 0x813b68a0 <schedule> > 0x81028525 <+1598>: ud2a > > (gdb) l *do_exit+0x63e > 0x81028525 is in do_exit > (/home/semenzato/trunk/src/third_party/kernel/files/kernel/exit.c:1069). > 1064 > 1065 /* causes final put_task_struct in finish_task_switch(). */ > 1066 tsk->state = TASK_DEAD; > 1067 tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */ > 1068 schedule(); > 1069 BUG(); > 1070 /* Avoid "noreturn function does return". */ > 1071 for (;;) > 1072 cpu_relax(); /* For when BUG is null */ > 1073 } > > Here's a theory: the thread exits fine, but the next scheduled thread > tries to allocate memory before or during finish_task_switch(), so the > dead thread is never cleaned up completely and is still considered > alive by the OOM killer. If next thread tries to allocate memory, he will enter direct reclaim path and there are some scheduling points in there so exit thread should be destroyed. :( In your previous mail, you said many processes are stuck at shrink_slab which already includes cond_resched. I can't see any problem. Hmm, Could you post entire debug log after you capture sysrq+t several time when hang happens? > > Unfortunately I haven't found a code path that supports this theory... > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-30 0:18 ` Minchan Kim @ 2012-10-30 0:45 ` Luigi Semenzato 2012-10-30 5:41 ` David Rientjes 0 siblings, 1 reply; 56+ messages in thread From: Luigi Semenzato @ 2012-10-30 0:45 UTC (permalink / raw) To: Minchan Kim; +Cc: David Rientjes, linux-mm, Dan Magenheimer, KOSAKI Motohiro On Mon, Oct 29, 2012 at 5:18 PM, Minchan Kim <minchan@kernel.org> wrote: > On Mon, Oct 29, 2012 at 03:36:38PM -0700, Luigi Semenzato wrote: >> On Mon, Oct 29, 2012 at 12:00 PM, David Rientjes <rientjes@google.com> wrote: >> > On Mon, 29 Oct 2012, Luigi Semenzato wrote: >> > >> >> I managed to get the stack trace for the process that refuses to die. >> >> I am not sure it's due to the deadlock described in earlier messages. >> >> I will investigate further. >> >> >> >> [96283.704390] chrome x 815ecd20 0 16573 1112 0x00100104 >> >> [96283.704405] c107fe34 00200046 f57ae000 815ecd20 815ecd20 ec0b645a >> >> 0000578f f67cfd20 >> >> [96283.704427] d0a9a9a0 c107fdf8 81037be5 f5bdf1e8 f6021800 00000000 >> >> c107fe04 00200202 >> >> [96283.704449] c107fe0c 00200202 f5bdf1b0 c107fe24 8117ddb1 00200202 >> >> f5bdf1b0 f5bdf1b8 >> >> [96283.704471] Call Trace: >> >> [96283.704484] [<81037be5>] ? queue_work_on+0x2d/0x39 >> >> [96283.704497] [<8117ddb1>] ? put_io_context+0x52/0x6a >> >> [96283.704510] [<813b68f6>] schedule+0x56/0x58 >> >> [96283.704520] [<81028525>] do_exit+0x63e/0x640 >> > >> > Could you find out where this happens to be in the function? If you >> > enable CONFIG_DEBUG_INFO, you should be able to use gdb on vmlinux and >> > find out with l *do_exit+0x63e. >> >> It looks like it's the final call to schedule() in do_exit(): >> >> 0x81028520 <+1593>: call 0x813b68a0 <schedule> >> 0x81028525 <+1598>: ud2a >> >> (gdb) l *do_exit+0x63e >> 0x81028525 is in do_exit >> (/home/semenzato/trunk/src/third_party/kernel/files/kernel/exit.c:1069). >> 1064 >> 1065 /* causes final put_task_struct in finish_task_switch(). */ >> 1066 tsk->state = TASK_DEAD; >> 1067 tsk->flags |= PF_NOFREEZE; /* tell freezer to ignore us */ >> 1068 schedule(); >> 1069 BUG(); >> 1070 /* Avoid "noreturn function does return". */ >> 1071 for (;;) >> 1072 cpu_relax(); /* For when BUG is null */ >> 1073 } >> >> Here's a theory: the thread exits fine, but the next scheduled thread >> tries to allocate memory before or during finish_task_switch(), so the >> dead thread is never cleaned up completely and is still considered >> alive by the OOM killer. > > If next thread tries to allocate memory, he will enter direct reclaim path > and there are some scheduling points in there so exit thread should be > destroyed. :( In your previous mail, you said many processes are stuck at > shrink_slab which already includes cond_resched. I can't see any problem. > Hmm, Could you post entire debug log after you capture sysrq+t several time > when hang happens? Thank you so much for your continued assistance. I have been using preserved memory to get the log, and sysrq+T overflows the buffer (there are a few dozen processes). To get the trace for the process with TIF_MEMDIE set, I had to modify the sysrq+T code so that it prints only that process. To get a full trace of all processes I will have to open the device and attach a debug header, so it will take some time. What are we looking for, though? I see many processes running in shrink_slab(), but they are not "stuck" there, they are just spending a lot of time in there. However, now there is something that worries me more. The trace of the thread with TIF_MEMDIE set shows that it has executed most of do_exit() and appears to be waiting to be reaped. From my reading of the code, this implies that task->exit_state should be non-zero, which means that select_bad_process should have skipped that thread, which means that we cannot be in the deadlock situation, and my experiments are not consistent. I will add better instrumentation and report later. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-30 0:45 ` Luigi Semenzato @ 2012-10-30 5:41 ` David Rientjes 2012-10-30 19:12 ` Luigi Semenzato 0 siblings, 1 reply; 56+ messages in thread From: David Rientjes @ 2012-10-30 5:41 UTC (permalink / raw) To: Luigi Semenzato; +Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro On Mon, 29 Oct 2012, Luigi Semenzato wrote: > However, now there is something that worries me more. The trace of > the thread with TIF_MEMDIE set shows that it has executed most of > do_exit() and appears to be waiting to be reaped. From my reading of > the code, this implies that task->exit_state should be non-zero, which > means that select_bad_process should have skipped that thread, which > means that we cannot be in the deadlock situation, and my experiments > are not consistent. > Yeah, this is what I was referring to earlier, select_bad_process() will not consider the thread for which you posted a stack trace for oom kill, so it's not deferring because of it. There are either other thread(s) that have been oom killed and have not yet release their memory or the oom killer is never being called. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-30 5:41 ` David Rientjes @ 2012-10-30 19:12 ` Luigi Semenzato 2012-10-30 20:30 ` Luigi Semenzato 2012-10-31 0:57 ` Minchan Kim 0 siblings, 2 replies; 56+ messages in thread From: Luigi Semenzato @ 2012-10-30 19:12 UTC (permalink / raw) To: David Rientjes Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Mon, Oct 29, 2012 at 10:41 PM, David Rientjes <rientjes@google.com> wrote: > On Mon, 29 Oct 2012, Luigi Semenzato wrote: > >> However, now there is something that worries me more. The trace of >> the thread with TIF_MEMDIE set shows that it has executed most of >> do_exit() and appears to be waiting to be reaped. From my reading of >> the code, this implies that task->exit_state should be non-zero, which >> means that select_bad_process should have skipped that thread, which >> means that we cannot be in the deadlock situation, and my experiments >> are not consistent. >> > > Yeah, this is what I was referring to earlier, select_bad_process() will > not consider the thread for which you posted a stack trace for oom kill, > so it's not deferring because of it. There are either other thread(s) > that have been oom killed and have not yet release their memory or the oom > killer is never being called. Thanks. I now have better information on what's happening. The "culprit" is not the OOM-killed process (the one with TIF_MEMDIE set). It's another process that's exiting for some other reason. select_bad_process() checks for thread->exit_state at the beginning, and skips processes that are exiting. But later it checks for p->flags & PF_EXITING, and can return -1 in that case (and it does for me). It turns out that do_exit() does a lot of things between setting the thread->flags PF_EXITING bit (in exit_signals()) and setting thread->exit_state to non-zero (in exit_notify()). Some of those things apparently need memory. I caught one process responsible for the PTR_ERR(-1) while it was doing this: [ 191.859358] VC manager R running 0 2388 1108 0x00000104 [ 191.859377] err_ptr_count = 45623 [ 191.859384] e0611b1c 00200086 f5608000 815ecd20 815ecd20 a0a9ebc3 0000002c f67cfd20 [ 191.859407] f430a060 81191c34 e0611aec 81196d79 4168ef20 00000001 e1302400 e130264c [ 191.859428] e1302400 e0611af4 813b71d5 e0611b00 810b42f1 e1302400 e0611b0c 810b430e [ 191.859450] Call Trace: [ 191.859465] [<81191c34>] ? __delay+0xe/0x10 [ 191.859478] [<81196d79>] ? do_raw_spin_lock+0xa2/0xf3 [ 191.859491] [<813b71d5>] ? _raw_spin_unlock+0xd/0xf [ 191.859504] [<810b42f1>] ? put_super+0x26/0x29 [ 191.859515] [<810b430e>] ? drop_super+0x1a/0x1d [ 191.859527] [<8104512d>] __cond_resched+0x1b/0x2b [ 191.859537] [<813b67a7>] _cond_resched+0x18/0x21 [ 191.859549] [<81093940>] shrink_slab+0x224/0x22f [ 191.859562] [<81095a96>] try_to_free_pages+0x1b7/0x2e6 [ 191.859574] [<8108df2a>] __alloc_pages_nodemask+0x40a/0x61f [ 191.859588] [<810a9dbe>] read_swap_cache_async+0x4a/0xcf [ 191.859600] [<810a9ea4>] swapin_readahead+0x61/0x8d [ 191.859612] [<8109fff4>] handle_pte_fault+0x310/0x5fb [ 191.859624] [<810a0420>] handle_mm_fault+0xae/0xbd [ 191.859637] [<8101d0f9>] do_page_fault+0x265/0x284 [ 191.859648] [<8104aa17>] ? dequeue_entity+0x236/0x252 [ 191.859660] [<8101ce94>] ? vmalloc_sync_all+0xa/0xa [ 191.859672] [<813b7887>] error_code+0x67/0x6c [ 191.859683] [<81191d21>] ? __get_user_4+0x11/0x17 [ 191.859695] [<81059f28>] ? exit_robust_list+0x30/0x105 [ 191.859707] [<813b71b0>] ? _raw_spin_unlock_irq+0xd/0x10 [ 191.859718] [<810446d5>] ? finish_task_switch+0x53/0x89 [ 191.859730] [<8102351d>] mm_release+0x1d/0xc3 [ 191.859740] [<81026ce9>] exit_mm+0x1d/0xe9 [ 191.859750] [<81032b87>] ? exit_signals+0x57/0x10a [ 191.859760] [<81028082>] do_exit+0x19b/0x640 [ 191.859770] [<81058598>] ? futex_wait_queue_me+0xaa/0xbe [ 191.859781] [<81030bbf>] ? recalc_sigpending_tsk+0x51/0x5c [ 191.859793] [<81030beb>] ? recalc_sigpending+0x17/0x3e [ 191.859803] [<81028752>] do_group_exit+0x63/0x86 [ 191.859813] [<81032b19>] get_signal_to_deliver+0x434/0x44b [ 191.859825] [<81001e01>] do_signal+0x37/0x4fe [ 191.859837] [<81048eed>] ? set_next_entity+0x36/0x9d [ 191.859850] [<81050d8e>] ? timekeeping_get_ns+0x11/0x55 [ 191.859861] [<8105a754>] ? sys_futex+0xcb/0xdb [ 191.859871] [<810024a7>] do_notify_resume+0x26/0x65 [ 191.859883] [<813b73a5>] work_notifysig+0xa/0x11 [ 191.859893] Kernel panic - not syncing: too many ERR_PTR I don't know why mm_release() would page fault, but it looks like it does. So the OOM killer will not kill other processes because it thinks a process is exiting, which will free up memory. But the exiting process needs memory to continue exiting --> deadlock. Sounds plausible? OK, now someone is going to fix this, right? :-) -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-30 19:12 ` Luigi Semenzato @ 2012-10-30 20:30 ` Luigi Semenzato 2012-10-30 22:32 ` Luigi Semenzato ` (2 more replies) 2012-10-31 0:57 ` Minchan Kim 1 sibling, 3 replies; 56+ messages in thread From: Luigi Semenzato @ 2012-10-30 20:30 UTC (permalink / raw) To: David Rientjes Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Tue, Oct 30, 2012 at 12:12 PM, Luigi Semenzato <semenzato@google.com> wrote: > OK, now someone is going to fix this, right? :-) Actually, there is a very simple fix: @@ -355,14 +364,6 @@ static struct task_struct *select_bad_process(unsigned int *ppoints, if (p == current) { chosen = p; *ppoints = 1000; - } else if (!force_kill) { - /* - * If this task is not being ptraced on exit, - * then wait for it to finish before killing - * some other task unnecessarily. - */ - if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) - return ERR_PTR(-1UL); } } I'd rather kill some other task unnecessarily than hang! My load works fine with this change. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-30 20:30 ` Luigi Semenzato @ 2012-10-30 22:32 ` Luigi Semenzato 2012-10-31 18:42 ` David Rientjes 2012-10-30 22:37 ` Sonny Rao 2012-10-31 4:46 ` David Rientjes 2 siblings, 1 reply; 56+ messages in thread From: Luigi Semenzato @ 2012-10-30 22:32 UTC (permalink / raw) To: David Rientjes Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Tue, Oct 30, 2012 at 1:30 PM, Luigi Semenzato <semenzato@google.com> wrote: > On Tue, Oct 30, 2012 at 12:12 PM, Luigi Semenzato <semenzato@google.com> wrote: > >> OK, now someone is going to fix this, right? :-) > > Actually, there is a very simple fix: > > @@ -355,14 +364,6 @@ static struct task_struct > *select_bad_process(unsigned int *ppoints, > if (p == current) { > chosen = p; > *ppoints = 1000; > - } else if (!force_kill) { > - /* > - * If this task is not being ptraced on exit, > - * then wait for it to finish before killing > - * some other task unnecessarily. > - */ > - if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) > - return ERR_PTR(-1UL); > } > } > > I'd rather kill some other task unnecessarily than hang! My load > works fine with this change. For completeness, I would like to report that the page fault in mm_release looks legitimate. The fault happens near here: if (unlikely(tsk->robust_list)) { exit_robust_list(tsk); tsk->robust_list = NULL; } and robust_list is a userspace structure. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-30 22:32 ` Luigi Semenzato @ 2012-10-31 18:42 ` David Rientjes 0 siblings, 0 replies; 56+ messages in thread From: David Rientjes @ 2012-10-31 18:42 UTC (permalink / raw) To: Luigi Semenzato Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Tue, 30 Oct 2012, Luigi Semenzato wrote: > For completeness, I would like to report that the page fault in > mm_release looks legitimate. The fault happens near here: > > if (unlikely(tsk->robust_list)) { > exit_robust_list(tsk); > tsk->robust_list = NULL; > } > > and robust_list is a userspace structure. > This is the only place where the hang occurs when there are several threads in the exit path with PF_EXITING and it causes the oom killer to defer killing a process? If that's the case, then a simple tsk->robust_list check would be sufficient to avoid deferring incorrectly. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-30 20:30 ` Luigi Semenzato 2012-10-30 22:32 ` Luigi Semenzato @ 2012-10-30 22:37 ` Sonny Rao 2012-10-31 4:46 ` David Rientjes 2 siblings, 0 replies; 56+ messages in thread From: Sonny Rao @ 2012-10-30 22:37 UTC (permalink / raw) To: Luigi Semenzato Cc: David Rientjes, Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro On Tue, Oct 30, 2012 at 1:30 PM, Luigi Semenzato <semenzato@google.com> wrote: > > On Tue, Oct 30, 2012 at 12:12 PM, Luigi Semenzato <semenzato@google.com> wrote: > > > OK, now someone is going to fix this, right? :-) > > Actually, there is a very simple fix: > > @@ -355,14 +364,6 @@ static struct task_struct > *select_bad_process(unsigned int *ppoints, > if (p == current) { > chosen = p; > *ppoints = 1000; > - } else if (!force_kill) { > - /* > - * If this task is not being ptraced on exit, > - * then wait for it to finish before killing > - * some other task unnecessarily. > - */ > - if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) > - return ERR_PTR(-1UL); > } > } > > I'd rather kill some other task unnecessarily than hang! My load > works fine with this change. It also appears that we didn't kill any unnecessary tasks either. It's just a deadlock exiting process A encounters a page fault and has to allocate some memory and goes to sleep process B which is running the OOM Killer blocks on exiting process and process A blocks forever on memory while process B blocks on A, and therefore no memory is released IMO, the fact that we don't do this when the process is being ptraced also seems to justify that it's a valid thing to do in all cases. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-30 20:30 ` Luigi Semenzato 2012-10-30 22:32 ` Luigi Semenzato 2012-10-30 22:37 ` Sonny Rao @ 2012-10-31 4:46 ` David Rientjes 2012-10-31 6:14 ` Luigi Semenzato 2 siblings, 1 reply; 56+ messages in thread From: David Rientjes @ 2012-10-31 4:46 UTC (permalink / raw) To: Luigi Semenzato Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Tue, 30 Oct 2012, Luigi Semenzato wrote: > Actually, there is a very simple fix: > > @@ -355,14 +364,6 @@ static struct task_struct > *select_bad_process(unsigned int *ppoints, > if (p == current) { > chosen = p; > *ppoints = 1000; > - } else if (!force_kill) { > - /* > - * If this task is not being ptraced on exit, > - * then wait for it to finish before killing > - * some other task unnecessarily. > - */ > - if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) > - return ERR_PTR(-1UL); > } > } > > I'd rather kill some other task unnecessarily than hang! My load > works fine with this change. > That's not an acceptable "fix" at all, it will lead to unnecessarily killing processes when others are in the exit path, i.e. every oom kill would kill two or three or more processes instead of just one. Could you please try this on 3.6 since all the code you're quoting is from old kernels? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-31 4:46 ` David Rientjes @ 2012-10-31 6:14 ` Luigi Semenzato 2012-10-31 6:28 ` Luigi Semenzato 0 siblings, 1 reply; 56+ messages in thread From: Luigi Semenzato @ 2012-10-31 6:14 UTC (permalink / raw) To: David Rientjes Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Tue, Oct 30, 2012 at 9:46 PM, David Rientjes <rientjes@google.com> wrote: > On Tue, 30 Oct 2012, Luigi Semenzato wrote: > >> Actually, there is a very simple fix: >> >> @@ -355,14 +364,6 @@ static struct task_struct >> *select_bad_process(unsigned int *ppoints, >> if (p == current) { >> chosen = p; >> *ppoints = 1000; >> - } else if (!force_kill) { >> - /* >> - * If this task is not being ptraced on exit, >> - * then wait for it to finish before killing >> - * some other task unnecessarily. >> - */ >> - if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) >> - return ERR_PTR(-1UL); >> } >> } >> >> I'd rather kill some other task unnecessarily than hang! My load >> works fine with this change. >> > > That's not an acceptable "fix" at all, it will lead to unnecessarily > killing processes when others are in the exit path, i.e. every oom kill > would kill two or three or more processes instead of just one. I am sorry, I didn't mean to suggest that this is the right fix for everybody. It seems to work for us. A real fix would be much harder, I think. Certainly it would be for me. We don't rely on OOM-killing for memory management (we tried to, but it has drawbacks). But OOM kills can still happen, so we have to deal with them. We can deal with multiple processes being killed, but not with a hang. I might be tempted to say that this should be true for everybody, but I can imagine systems that work by allowing only one process to die, and perhaps the load on those systems is such that they don't experience this deadlock often, or ever (even though I would be nervous about it). > Could you please try this on 3.6 since all the code you're quoting is from > old kernels? I will see if I can do it, but we're shipping 3.4 and I am not sure about the status of our 3.6 tree. I will also visually inspect the relevant 3.6 code and see if the possibility of deadlock is still there. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-31 6:14 ` Luigi Semenzato @ 2012-10-31 6:28 ` Luigi Semenzato 2012-10-31 18:45 ` David Rientjes 0 siblings, 1 reply; 56+ messages in thread From: Luigi Semenzato @ 2012-10-31 6:28 UTC (permalink / raw) To: David Rientjes Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Tue, Oct 30, 2012 at 11:14 PM, Luigi Semenzato <semenzato@google.com> wrote: > On Tue, Oct 30, 2012 at 9:46 PM, David Rientjes <rientjes@google.com> wrote: >> On Tue, 30 Oct 2012, Luigi Semenzato wrote: >> >>> Actually, there is a very simple fix: >>> >>> @@ -355,14 +364,6 @@ static struct task_struct >>> *select_bad_process(unsigned int *ppoints, >>> if (p == current) { >>> chosen = p; >>> *ppoints = 1000; >>> - } else if (!force_kill) { >>> - /* >>> - * If this task is not being ptraced on exit, >>> - * then wait for it to finish before killing >>> - * some other task unnecessarily. >>> - */ >>> - if (!(p->group_leader->ptrace & PT_TRACE_EXIT)) >>> - return ERR_PTR(-1UL); >>> } >>> } >>> >>> I'd rather kill some other task unnecessarily than hang! My load >>> works fine with this change. >>> >> >> That's not an acceptable "fix" at all, it will lead to unnecessarily >> killing processes when others are in the exit path, i.e. every oom kill >> would kill two or three or more processes instead of just one. > > I am sorry, I didn't mean to suggest that this is the right fix for > everybody. It seems to work for us. A real fix would be much harder, > I think. Certainly it would be for me. > > We don't rely on OOM-killing for memory management (we tried to, but > it has drawbacks). But OOM kills can still happen, so we have to deal > with them. We can deal with multiple processes being killed, but not > with a hang. I might be tempted to say that this should be true for > everybody, but I can imagine systems that work by allowing only one > process to die, and perhaps the load on those systems is such that > they don't experience this deadlock often, or ever (even though I > would be nervous about it). To make it clear, I am suggesting that this "fix" might work as a temporary workaround until a better fix is available. >> Could you please try this on 3.6 since all the code you're quoting is from >> old kernels? > > I will see if I can do it, but we're shipping 3.4 and I am not sure > about the status of our 3.6 tree. I will also visually inspect the > relevant 3.6 code and see if the possibility of deadlock is still > there. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-31 6:28 ` Luigi Semenzato @ 2012-10-31 18:45 ` David Rientjes 0 siblings, 0 replies; 56+ messages in thread From: David Rientjes @ 2012-10-31 18:45 UTC (permalink / raw) To: Luigi Semenzato Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Tue, 30 Oct 2012, Luigi Semenzato wrote: > To make it clear, I am suggesting that this "fix" might work as a > temporary workaround until a better fix is available. > A temporary workaround is to do a kill -9 of the hung process since even the 3.4 oom killer will automatically give it access to memory reserves. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-30 19:12 ` Luigi Semenzato 2012-10-30 20:30 ` Luigi Semenzato @ 2012-10-31 0:57 ` Minchan Kim 2012-10-31 1:06 ` Luigi Semenzato 2012-10-31 18:54 ` David Rientjes 1 sibling, 2 replies; 56+ messages in thread From: Minchan Kim @ 2012-10-31 0:57 UTC (permalink / raw) To: Luigi Semenzato Cc: David Rientjes, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao Hi Luigi, On Tue, Oct 30, 2012 at 12:12:02PM -0700, Luigi Semenzato wrote: > On Mon, Oct 29, 2012 at 10:41 PM, David Rientjes <rientjes@google.com> wrote: > > On Mon, 29 Oct 2012, Luigi Semenzato wrote: > > > >> However, now there is something that worries me more. The trace of > >> the thread with TIF_MEMDIE set shows that it has executed most of > >> do_exit() and appears to be waiting to be reaped. From my reading of > >> the code, this implies that task->exit_state should be non-zero, which > >> means that select_bad_process should have skipped that thread, which > >> means that we cannot be in the deadlock situation, and my experiments > >> are not consistent. > >> > > > > Yeah, this is what I was referring to earlier, select_bad_process() will > > not consider the thread for which you posted a stack trace for oom kill, > > so it's not deferring because of it. There are either other thread(s) > > that have been oom killed and have not yet release their memory or the oom > > killer is never being called. > > Thanks. I now have better information on what's happening. > > The "culprit" is not the OOM-killed process (the one with TIF_MEMDIE > set). It's another process that's exiting for some other reason. > > select_bad_process() checks for thread->exit_state at the beginning, > and skips processes that are exiting. But later it checks for > p->flags & PF_EXITING, and can return -1 in that case (and it does for > me). > > It turns out that do_exit() does a lot of things between setting the > thread->flags PF_EXITING bit (in exit_signals()) and setting > thread->exit_state to non-zero (in exit_notify()). Some of those > things apparently need memory. I caught one process responsible for > the PTR_ERR(-1) while it was doing this: > > [ 191.859358] VC manager R running 0 2388 1108 0x00000104 > [ 191.859377] err_ptr_count = 45623 > [ 191.859384] e0611b1c 00200086 f5608000 815ecd20 815ecd20 a0a9ebc3 > 0000002c f67cfd20 > [ 191.859407] f430a060 81191c34 e0611aec 81196d79 4168ef20 00000001 > e1302400 e130264c > [ 191.859428] e1302400 e0611af4 813b71d5 e0611b00 810b42f1 e1302400 > e0611b0c 810b430e > [ 191.859450] Call Trace: > [ 191.859465] [<81191c34>] ? __delay+0xe/0x10 > [ 191.859478] [<81196d79>] ? do_raw_spin_lock+0xa2/0xf3 > [ 191.859491] [<813b71d5>] ? _raw_spin_unlock+0xd/0xf > [ 191.859504] [<810b42f1>] ? put_super+0x26/0x29 > [ 191.859515] [<810b430e>] ? drop_super+0x1a/0x1d > [ 191.859527] [<8104512d>] __cond_resched+0x1b/0x2b > [ 191.859537] [<813b67a7>] _cond_resched+0x18/0x21 > [ 191.859549] [<81093940>] shrink_slab+0x224/0x22f > [ 191.859562] [<81095a96>] try_to_free_pages+0x1b7/0x2e6 > [ 191.859574] [<8108df2a>] __alloc_pages_nodemask+0x40a/0x61f > [ 191.859588] [<810a9dbe>] read_swap_cache_async+0x4a/0xcf > [ 191.859600] [<810a9ea4>] swapin_readahead+0x61/0x8d > [ 191.859612] [<8109fff4>] handle_pte_fault+0x310/0x5fb > [ 191.859624] [<810a0420>] handle_mm_fault+0xae/0xbd > [ 191.859637] [<8101d0f9>] do_page_fault+0x265/0x284 > [ 191.859648] [<8104aa17>] ? dequeue_entity+0x236/0x252 > [ 191.859660] [<8101ce94>] ? vmalloc_sync_all+0xa/0xa > [ 191.859672] [<813b7887>] error_code+0x67/0x6c > [ 191.859683] [<81191d21>] ? __get_user_4+0x11/0x17 > [ 191.859695] [<81059f28>] ? exit_robust_list+0x30/0x105 > [ 191.859707] [<813b71b0>] ? _raw_spin_unlock_irq+0xd/0x10 > [ 191.859718] [<810446d5>] ? finish_task_switch+0x53/0x89 > [ 191.859730] [<8102351d>] mm_release+0x1d/0xc3 > [ 191.859740] [<81026ce9>] exit_mm+0x1d/0xe9 > [ 191.859750] [<81032b87>] ? exit_signals+0x57/0x10a > [ 191.859760] [<81028082>] do_exit+0x19b/0x640 > [ 191.859770] [<81058598>] ? futex_wait_queue_me+0xaa/0xbe > [ 191.859781] [<81030bbf>] ? recalc_sigpending_tsk+0x51/0x5c > [ 191.859793] [<81030beb>] ? recalc_sigpending+0x17/0x3e > [ 191.859803] [<81028752>] do_group_exit+0x63/0x86 > [ 191.859813] [<81032b19>] get_signal_to_deliver+0x434/0x44b > [ 191.859825] [<81001e01>] do_signal+0x37/0x4fe > [ 191.859837] [<81048eed>] ? set_next_entity+0x36/0x9d > [ 191.859850] [<81050d8e>] ? timekeeping_get_ns+0x11/0x55 > [ 191.859861] [<8105a754>] ? sys_futex+0xcb/0xdb > [ 191.859871] [<810024a7>] do_notify_resume+0x26/0x65 > [ 191.859883] [<813b73a5>] work_notifysig+0xa/0x11 > [ 191.859893] Kernel panic - not syncing: too many ERR_PTR > > I don't know why mm_release() would page fault, but it looks like it does. > > So the OOM killer will not kill other processes because it thinks a > process is exiting, which will free up memory. But the exiting > process needs memory to continue exiting --> deadlock. Sounds > plausible? It sounds right in your kernel but principal problem is min_filelist_kbytes patch. If normal exited process in exit path requires a page and there is no free page any more, it ends up going to OOM path after try to reclaim memory several time. Then, In select_bad_process, if (task->flags & PF_EXITING) { if (task == current) <== true return OOM_SCAN_SELECT; In oom_kill_process, if (p->flags & PF_EXITING) set_tsk_thread_flag(p, TIF_MEMDIE); At last, normal exited process would get a free page. But in your kernel, it seems not because I guess did_some_progress in __alloc_pages_direct_reclaim is never 0. The why it is never 0 is do_try_to_free_pages's all_unreclaimable can't do his role by your min_filelist_kbytes. It makes __alloc_pages_slowpath's looping forever. Sounds plausible? > > OK, now someone is going to fix this, right? :-) > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-31 0:57 ` Minchan Kim @ 2012-10-31 1:06 ` Luigi Semenzato 2012-10-31 1:27 ` Minchan Kim 2012-10-31 18:54 ` David Rientjes 1 sibling, 1 reply; 56+ messages in thread From: Luigi Semenzato @ 2012-10-31 1:06 UTC (permalink / raw) To: Minchan Kim Cc: David Rientjes, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao, Mandeep Baines On Tue, Oct 30, 2012 at 5:57 PM, Minchan Kim <minchan@kernel.org> wrote: > Hi Luigi, > > On Tue, Oct 30, 2012 at 12:12:02PM -0700, Luigi Semenzato wrote: >> On Mon, Oct 29, 2012 at 10:41 PM, David Rientjes <rientjes@google.com> wrote: >> > On Mon, 29 Oct 2012, Luigi Semenzato wrote: >> > >> >> However, now there is something that worries me more. The trace of >> >> the thread with TIF_MEMDIE set shows that it has executed most of >> >> do_exit() and appears to be waiting to be reaped. From my reading of >> >> the code, this implies that task->exit_state should be non-zero, which >> >> means that select_bad_process should have skipped that thread, which >> >> means that we cannot be in the deadlock situation, and my experiments >> >> are not consistent. >> >> >> > >> > Yeah, this is what I was referring to earlier, select_bad_process() will >> > not consider the thread for which you posted a stack trace for oom kill, >> > so it's not deferring because of it. There are either other thread(s) >> > that have been oom killed and have not yet release their memory or the oom >> > killer is never being called. >> >> Thanks. I now have better information on what's happening. >> >> The "culprit" is not the OOM-killed process (the one with TIF_MEMDIE >> set). It's another process that's exiting for some other reason. >> >> select_bad_process() checks for thread->exit_state at the beginning, >> and skips processes that are exiting. But later it checks for >> p->flags & PF_EXITING, and can return -1 in that case (and it does for >> me). >> >> It turns out that do_exit() does a lot of things between setting the >> thread->flags PF_EXITING bit (in exit_signals()) and setting >> thread->exit_state to non-zero (in exit_notify()). Some of those >> things apparently need memory. I caught one process responsible for >> the PTR_ERR(-1) while it was doing this: >> >> [ 191.859358] VC manager R running 0 2388 1108 0x00000104 >> [ 191.859377] err_ptr_count = 45623 >> [ 191.859384] e0611b1c 00200086 f5608000 815ecd20 815ecd20 a0a9ebc3 >> 0000002c f67cfd20 >> [ 191.859407] f430a060 81191c34 e0611aec 81196d79 4168ef20 00000001 >> e1302400 e130264c >> [ 191.859428] e1302400 e0611af4 813b71d5 e0611b00 810b42f1 e1302400 >> e0611b0c 810b430e >> [ 191.859450] Call Trace: >> [ 191.859465] [<81191c34>] ? __delay+0xe/0x10 >> [ 191.859478] [<81196d79>] ? do_raw_spin_lock+0xa2/0xf3 >> [ 191.859491] [<813b71d5>] ? _raw_spin_unlock+0xd/0xf >> [ 191.859504] [<810b42f1>] ? put_super+0x26/0x29 >> [ 191.859515] [<810b430e>] ? drop_super+0x1a/0x1d >> [ 191.859527] [<8104512d>] __cond_resched+0x1b/0x2b >> [ 191.859537] [<813b67a7>] _cond_resched+0x18/0x21 >> [ 191.859549] [<81093940>] shrink_slab+0x224/0x22f >> [ 191.859562] [<81095a96>] try_to_free_pages+0x1b7/0x2e6 >> [ 191.859574] [<8108df2a>] __alloc_pages_nodemask+0x40a/0x61f >> [ 191.859588] [<810a9dbe>] read_swap_cache_async+0x4a/0xcf >> [ 191.859600] [<810a9ea4>] swapin_readahead+0x61/0x8d >> [ 191.859612] [<8109fff4>] handle_pte_fault+0x310/0x5fb >> [ 191.859624] [<810a0420>] handle_mm_fault+0xae/0xbd >> [ 191.859637] [<8101d0f9>] do_page_fault+0x265/0x284 >> [ 191.859648] [<8104aa17>] ? dequeue_entity+0x236/0x252 >> [ 191.859660] [<8101ce94>] ? vmalloc_sync_all+0xa/0xa >> [ 191.859672] [<813b7887>] error_code+0x67/0x6c >> [ 191.859683] [<81191d21>] ? __get_user_4+0x11/0x17 >> [ 191.859695] [<81059f28>] ? exit_robust_list+0x30/0x105 >> [ 191.859707] [<813b71b0>] ? _raw_spin_unlock_irq+0xd/0x10 >> [ 191.859718] [<810446d5>] ? finish_task_switch+0x53/0x89 >> [ 191.859730] [<8102351d>] mm_release+0x1d/0xc3 >> [ 191.859740] [<81026ce9>] exit_mm+0x1d/0xe9 >> [ 191.859750] [<81032b87>] ? exit_signals+0x57/0x10a >> [ 191.859760] [<81028082>] do_exit+0x19b/0x640 >> [ 191.859770] [<81058598>] ? futex_wait_queue_me+0xaa/0xbe >> [ 191.859781] [<81030bbf>] ? recalc_sigpending_tsk+0x51/0x5c >> [ 191.859793] [<81030beb>] ? recalc_sigpending+0x17/0x3e >> [ 191.859803] [<81028752>] do_group_exit+0x63/0x86 >> [ 191.859813] [<81032b19>] get_signal_to_deliver+0x434/0x44b >> [ 191.859825] [<81001e01>] do_signal+0x37/0x4fe >> [ 191.859837] [<81048eed>] ? set_next_entity+0x36/0x9d >> [ 191.859850] [<81050d8e>] ? timekeeping_get_ns+0x11/0x55 >> [ 191.859861] [<8105a754>] ? sys_futex+0xcb/0xdb >> [ 191.859871] [<810024a7>] do_notify_resume+0x26/0x65 >> [ 191.859883] [<813b73a5>] work_notifysig+0xa/0x11 >> [ 191.859893] Kernel panic - not syncing: too many ERR_PTR >> >> I don't know why mm_release() would page fault, but it looks like it does. >> >> So the OOM killer will not kill other processes because it thinks a >> process is exiting, which will free up memory. But the exiting >> process needs memory to continue exiting --> deadlock. Sounds >> plausible? > > It sounds right in your kernel but principal problem is min_filelist_kbytes patch. > If normal exited process in exit path requires a page and there is no free page > any more, it ends up going to OOM path after try to reclaim memory several time. > Then, > In select_bad_process, > > if (task->flags & PF_EXITING) { > if (task == current) <== true > return OOM_SCAN_SELECT; > In oom_kill_process, > > if (p->flags & PF_EXITING) > set_tsk_thread_flag(p, TIF_MEMDIE); > > At last, normal exited process would get a free page. > > But in your kernel, it seems not because I guess did_some_progress in > __alloc_pages_direct_reclaim is never 0. The why it is never 0 is > do_try_to_free_pages's all_unreclaimable can't do his role by your > min_filelist_kbytes. It makes __alloc_pages_slowpath's looping forever. > > Sounds plausible? Thank you Minchan, it does sound plausible, but I have little experience with this and it will take some work to confirm. I looked at the patch pretty carefully once, and I had the impression its effect could be fully analyzed by logical reasoning. I will check this again tomorrow, perhaps I can run some experiments. I am adding Mandeep who wrote the patch. However, we have worse problems if we don't use that patch. Without the patch, and either with or without compressed swap, the same load causes horrible thrashing, with the system appearing to hang for minutes. If we don't use that patch, do you have any suggestion on how to improve the code thrash situation? Thanks again! >> >> OK, now someone is going to fix this, right? :-) >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > > -- > Kind regards, > Minchan Kim > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-31 1:06 ` Luigi Semenzato @ 2012-10-31 1:27 ` Minchan Kim 2012-10-31 3:49 ` Luigi Semenzato 0 siblings, 1 reply; 56+ messages in thread From: Minchan Kim @ 2012-10-31 1:27 UTC (permalink / raw) To: Luigi Semenzato Cc: David Rientjes, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao, Mandeep Baines On Tue, Oct 30, 2012 at 06:06:56PM -0700, Luigi Semenzato wrote: > On Tue, Oct 30, 2012 at 5:57 PM, Minchan Kim <minchan@kernel.org> wrote: > > Hi Luigi, > > > > On Tue, Oct 30, 2012 at 12:12:02PM -0700, Luigi Semenzato wrote: > >> On Mon, Oct 29, 2012 at 10:41 PM, David Rientjes <rientjes@google.com> wrote: > >> > On Mon, 29 Oct 2012, Luigi Semenzato wrote: > >> > > >> >> However, now there is something that worries me more. The trace of > >> >> the thread with TIF_MEMDIE set shows that it has executed most of > >> >> do_exit() and appears to be waiting to be reaped. From my reading of > >> >> the code, this implies that task->exit_state should be non-zero, which > >> >> means that select_bad_process should have skipped that thread, which > >> >> means that we cannot be in the deadlock situation, and my experiments > >> >> are not consistent. > >> >> > >> > > >> > Yeah, this is what I was referring to earlier, select_bad_process() will > >> > not consider the thread for which you posted a stack trace for oom kill, > >> > so it's not deferring because of it. There are either other thread(s) > >> > that have been oom killed and have not yet release their memory or the oom > >> > killer is never being called. > >> > >> Thanks. I now have better information on what's happening. > >> > >> The "culprit" is not the OOM-killed process (the one with TIF_MEMDIE > >> set). It's another process that's exiting for some other reason. > >> > >> select_bad_process() checks for thread->exit_state at the beginning, > >> and skips processes that are exiting. But later it checks for > >> p->flags & PF_EXITING, and can return -1 in that case (and it does for > >> me). > >> > >> It turns out that do_exit() does a lot of things between setting the > >> thread->flags PF_EXITING bit (in exit_signals()) and setting > >> thread->exit_state to non-zero (in exit_notify()). Some of those > >> things apparently need memory. I caught one process responsible for > >> the PTR_ERR(-1) while it was doing this: > >> > >> [ 191.859358] VC manager R running 0 2388 1108 0x00000104 > >> [ 191.859377] err_ptr_count = 45623 > >> [ 191.859384] e0611b1c 00200086 f5608000 815ecd20 815ecd20 a0a9ebc3 > >> 0000002c f67cfd20 > >> [ 191.859407] f430a060 81191c34 e0611aec 81196d79 4168ef20 00000001 > >> e1302400 e130264c > >> [ 191.859428] e1302400 e0611af4 813b71d5 e0611b00 810b42f1 e1302400 > >> e0611b0c 810b430e > >> [ 191.859450] Call Trace: > >> [ 191.859465] [<81191c34>] ? __delay+0xe/0x10 > >> [ 191.859478] [<81196d79>] ? do_raw_spin_lock+0xa2/0xf3 > >> [ 191.859491] [<813b71d5>] ? _raw_spin_unlock+0xd/0xf > >> [ 191.859504] [<810b42f1>] ? put_super+0x26/0x29 > >> [ 191.859515] [<810b430e>] ? drop_super+0x1a/0x1d > >> [ 191.859527] [<8104512d>] __cond_resched+0x1b/0x2b > >> [ 191.859537] [<813b67a7>] _cond_resched+0x18/0x21 > >> [ 191.859549] [<81093940>] shrink_slab+0x224/0x22f > >> [ 191.859562] [<81095a96>] try_to_free_pages+0x1b7/0x2e6 > >> [ 191.859574] [<8108df2a>] __alloc_pages_nodemask+0x40a/0x61f > >> [ 191.859588] [<810a9dbe>] read_swap_cache_async+0x4a/0xcf > >> [ 191.859600] [<810a9ea4>] swapin_readahead+0x61/0x8d > >> [ 191.859612] [<8109fff4>] handle_pte_fault+0x310/0x5fb > >> [ 191.859624] [<810a0420>] handle_mm_fault+0xae/0xbd > >> [ 191.859637] [<8101d0f9>] do_page_fault+0x265/0x284 > >> [ 191.859648] [<8104aa17>] ? dequeue_entity+0x236/0x252 > >> [ 191.859660] [<8101ce94>] ? vmalloc_sync_all+0xa/0xa > >> [ 191.859672] [<813b7887>] error_code+0x67/0x6c > >> [ 191.859683] [<81191d21>] ? __get_user_4+0x11/0x17 > >> [ 191.859695] [<81059f28>] ? exit_robust_list+0x30/0x105 > >> [ 191.859707] [<813b71b0>] ? _raw_spin_unlock_irq+0xd/0x10 > >> [ 191.859718] [<810446d5>] ? finish_task_switch+0x53/0x89 > >> [ 191.859730] [<8102351d>] mm_release+0x1d/0xc3 > >> [ 191.859740] [<81026ce9>] exit_mm+0x1d/0xe9 > >> [ 191.859750] [<81032b87>] ? exit_signals+0x57/0x10a > >> [ 191.859760] [<81028082>] do_exit+0x19b/0x640 > >> [ 191.859770] [<81058598>] ? futex_wait_queue_me+0xaa/0xbe > >> [ 191.859781] [<81030bbf>] ? recalc_sigpending_tsk+0x51/0x5c > >> [ 191.859793] [<81030beb>] ? recalc_sigpending+0x17/0x3e > >> [ 191.859803] [<81028752>] do_group_exit+0x63/0x86 > >> [ 191.859813] [<81032b19>] get_signal_to_deliver+0x434/0x44b > >> [ 191.859825] [<81001e01>] do_signal+0x37/0x4fe > >> [ 191.859837] [<81048eed>] ? set_next_entity+0x36/0x9d > >> [ 191.859850] [<81050d8e>] ? timekeeping_get_ns+0x11/0x55 > >> [ 191.859861] [<8105a754>] ? sys_futex+0xcb/0xdb > >> [ 191.859871] [<810024a7>] do_notify_resume+0x26/0x65 > >> [ 191.859883] [<813b73a5>] work_notifysig+0xa/0x11 > >> [ 191.859893] Kernel panic - not syncing: too many ERR_PTR > >> > >> I don't know why mm_release() would page fault, but it looks like it does. > >> > >> So the OOM killer will not kill other processes because it thinks a > >> process is exiting, which will free up memory. But the exiting > >> process needs memory to continue exiting --> deadlock. Sounds > >> plausible? > > > > It sounds right in your kernel but principal problem is min_filelist_kbytes patch. > > If normal exited process in exit path requires a page and there is no free page > > any more, it ends up going to OOM path after try to reclaim memory several time. > > Then, > > In select_bad_process, > > > > if (task->flags & PF_EXITING) { > > if (task == current) <== true > > return OOM_SCAN_SELECT; > > In oom_kill_process, > > > > if (p->flags & PF_EXITING) > > set_tsk_thread_flag(p, TIF_MEMDIE); > > > > At last, normal exited process would get a free page. > > > > But in your kernel, it seems not because I guess did_some_progress in > > __alloc_pages_direct_reclaim is never 0. The why it is never 0 is > > do_try_to_free_pages's all_unreclaimable can't do his role by your > > min_filelist_kbytes. It makes __alloc_pages_slowpath's looping forever. > > > > Sounds plausible? > > Thank you Minchan, it does sound plausible, but I have little > experience with this and it will take some work to confirm. No problem :) > > I looked at the patch pretty carefully once, and I had the impression > its effect could be fully analyzed by logical reasoning. I will check > this again tomorrow, perhaps I can run some experiments. I am adding > Mandeep who wrote the patch. > > However, we have worse problems if we don't use that patch. Without > the patch, and either with or without compressed swap, the same load > causes horrible thrashing, with the system appearing to hang for > minutes. If we don't use that patch, do you have any suggestion on > how to improve the code thrash situation? As I said, the motivation of the patch is good for embedded system but patch's implementation is kinda buggy. I will have a look and post if I'm luck to get a time. BTW, a question. How do you find proper value for min_filelist_kbytes? Just experiment with several trial? Thanks. > > Thanks again! > > >> > >> OK, now someone is going to fix this, right? :-) > >> > >> -- > >> To unsubscribe, send a message with 'unsubscribe linux-mm' in > >> the body to majordomo@kvack.org. For more info on Linux MM, > >> see: http://www.linux-mm.org/ . > >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > > > > -- > > Kind regards, > > Minchan Kim > > > > -- > > To unsubscribe, send a message with 'unsubscribe linux-mm' in > > the body to majordomo@kvack.org. For more info on Linux MM, > > see: http://www.linux-mm.org/ . > > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-31 1:27 ` Minchan Kim @ 2012-10-31 3:49 ` Luigi Semenzato 2012-10-31 7:24 ` Minchan Kim 0 siblings, 1 reply; 56+ messages in thread From: Luigi Semenzato @ 2012-10-31 3:49 UTC (permalink / raw) To: Minchan Kim Cc: David Rientjes, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao, Mandeep Baines On Tue, Oct 30, 2012 at 6:27 PM, Minchan Kim <minchan@kernel.org> wrote: > On Tue, Oct 30, 2012 at 06:06:56PM -0700, Luigi Semenzato wrote: >> On Tue, Oct 30, 2012 at 5:57 PM, Minchan Kim <minchan@kernel.org> wrote: >> > Hi Luigi, >> > >> > On Tue, Oct 30, 2012 at 12:12:02PM -0700, Luigi Semenzato wrote: >> >> On Mon, Oct 29, 2012 at 10:41 PM, David Rientjes <rientjes@google.com> wrote: >> >> > On Mon, 29 Oct 2012, Luigi Semenzato wrote: >> >> > >> >> >> However, now there is something that worries me more. The trace of >> >> >> the thread with TIF_MEMDIE set shows that it has executed most of >> >> >> do_exit() and appears to be waiting to be reaped. From my reading of >> >> >> the code, this implies that task->exit_state should be non-zero, which >> >> >> means that select_bad_process should have skipped that thread, which >> >> >> means that we cannot be in the deadlock situation, and my experiments >> >> >> are not consistent. >> >> >> >> >> > >> >> > Yeah, this is what I was referring to earlier, select_bad_process() will >> >> > not consider the thread for which you posted a stack trace for oom kill, >> >> > so it's not deferring because of it. There are either other thread(s) >> >> > that have been oom killed and have not yet release their memory or the oom >> >> > killer is never being called. >> >> >> >> Thanks. I now have better information on what's happening. >> >> >> >> The "culprit" is not the OOM-killed process (the one with TIF_MEMDIE >> >> set). It's another process that's exiting for some other reason. >> >> >> >> select_bad_process() checks for thread->exit_state at the beginning, >> >> and skips processes that are exiting. But later it checks for >> >> p->flags & PF_EXITING, and can return -1 in that case (and it does for >> >> me). >> >> >> >> It turns out that do_exit() does a lot of things between setting the >> >> thread->flags PF_EXITING bit (in exit_signals()) and setting >> >> thread->exit_state to non-zero (in exit_notify()). Some of those >> >> things apparently need memory. I caught one process responsible for >> >> the PTR_ERR(-1) while it was doing this: >> >> >> >> [ 191.859358] VC manager R running 0 2388 1108 0x00000104 >> >> [ 191.859377] err_ptr_count = 45623 >> >> [ 191.859384] e0611b1c 00200086 f5608000 815ecd20 815ecd20 a0a9ebc3 >> >> 0000002c f67cfd20 >> >> [ 191.859407] f430a060 81191c34 e0611aec 81196d79 4168ef20 00000001 >> >> e1302400 e130264c >> >> [ 191.859428] e1302400 e0611af4 813b71d5 e0611b00 810b42f1 e1302400 >> >> e0611b0c 810b430e >> >> [ 191.859450] Call Trace: >> >> [ 191.859465] [<81191c34>] ? __delay+0xe/0x10 >> >> [ 191.859478] [<81196d79>] ? do_raw_spin_lock+0xa2/0xf3 >> >> [ 191.859491] [<813b71d5>] ? _raw_spin_unlock+0xd/0xf >> >> [ 191.859504] [<810b42f1>] ? put_super+0x26/0x29 >> >> [ 191.859515] [<810b430e>] ? drop_super+0x1a/0x1d >> >> [ 191.859527] [<8104512d>] __cond_resched+0x1b/0x2b >> >> [ 191.859537] [<813b67a7>] _cond_resched+0x18/0x21 >> >> [ 191.859549] [<81093940>] shrink_slab+0x224/0x22f >> >> [ 191.859562] [<81095a96>] try_to_free_pages+0x1b7/0x2e6 >> >> [ 191.859574] [<8108df2a>] __alloc_pages_nodemask+0x40a/0x61f >> >> [ 191.859588] [<810a9dbe>] read_swap_cache_async+0x4a/0xcf >> >> [ 191.859600] [<810a9ea4>] swapin_readahead+0x61/0x8d >> >> [ 191.859612] [<8109fff4>] handle_pte_fault+0x310/0x5fb >> >> [ 191.859624] [<810a0420>] handle_mm_fault+0xae/0xbd >> >> [ 191.859637] [<8101d0f9>] do_page_fault+0x265/0x284 >> >> [ 191.859648] [<8104aa17>] ? dequeue_entity+0x236/0x252 >> >> [ 191.859660] [<8101ce94>] ? vmalloc_sync_all+0xa/0xa >> >> [ 191.859672] [<813b7887>] error_code+0x67/0x6c >> >> [ 191.859683] [<81191d21>] ? __get_user_4+0x11/0x17 >> >> [ 191.859695] [<81059f28>] ? exit_robust_list+0x30/0x105 >> >> [ 191.859707] [<813b71b0>] ? _raw_spin_unlock_irq+0xd/0x10 >> >> [ 191.859718] [<810446d5>] ? finish_task_switch+0x53/0x89 >> >> [ 191.859730] [<8102351d>] mm_release+0x1d/0xc3 >> >> [ 191.859740] [<81026ce9>] exit_mm+0x1d/0xe9 >> >> [ 191.859750] [<81032b87>] ? exit_signals+0x57/0x10a >> >> [ 191.859760] [<81028082>] do_exit+0x19b/0x640 >> >> [ 191.859770] [<81058598>] ? futex_wait_queue_me+0xaa/0xbe >> >> [ 191.859781] [<81030bbf>] ? recalc_sigpending_tsk+0x51/0x5c >> >> [ 191.859793] [<81030beb>] ? recalc_sigpending+0x17/0x3e >> >> [ 191.859803] [<81028752>] do_group_exit+0x63/0x86 >> >> [ 191.859813] [<81032b19>] get_signal_to_deliver+0x434/0x44b >> >> [ 191.859825] [<81001e01>] do_signal+0x37/0x4fe >> >> [ 191.859837] [<81048eed>] ? set_next_entity+0x36/0x9d >> >> [ 191.859850] [<81050d8e>] ? timekeeping_get_ns+0x11/0x55 >> >> [ 191.859861] [<8105a754>] ? sys_futex+0xcb/0xdb >> >> [ 191.859871] [<810024a7>] do_notify_resume+0x26/0x65 >> >> [ 191.859883] [<813b73a5>] work_notifysig+0xa/0x11 >> >> [ 191.859893] Kernel panic - not syncing: too many ERR_PTR >> >> >> >> I don't know why mm_release() would page fault, but it looks like it does. >> >> >> >> So the OOM killer will not kill other processes because it thinks a >> >> process is exiting, which will free up memory. But the exiting >> >> process needs memory to continue exiting --> deadlock. Sounds >> >> plausible? >> > >> > It sounds right in your kernel but principal problem is min_filelist_kbytes patch. >> > If normal exited process in exit path requires a page and there is no free page >> > any more, it ends up going to OOM path after try to reclaim memory several time. >> > Then, >> > In select_bad_process, >> > >> > if (task->flags & PF_EXITING) { >> > if (task == current) <== true >> > return OOM_SCAN_SELECT; >> > In oom_kill_process, >> > >> > if (p->flags & PF_EXITING) >> > set_tsk_thread_flag(p, TIF_MEMDIE); >> > >> > At last, normal exited process would get a free page. >> > >> > But in your kernel, it seems not because I guess did_some_progress in >> > __alloc_pages_direct_reclaim is never 0. The why it is never 0 is >> > do_try_to_free_pages's all_unreclaimable can't do his role by your >> > min_filelist_kbytes. It makes __alloc_pages_slowpath's looping forever. >> > >> > Sounds plausible? >> >> Thank you Minchan, it does sound plausible, but I have little >> experience with this and it will take some work to confirm. > > No problem :) > >> >> I looked at the patch pretty carefully once, and I had the impression >> its effect could be fully analyzed by logical reasoning. I will check >> this again tomorrow, perhaps I can run some experiments. I am adding >> Mandeep who wrote the patch. >> >> However, we have worse problems if we don't use that patch. Without >> the patch, and either with or without compressed swap, the same load >> causes horrible thrashing, with the system appearing to hang for >> minutes. If we don't use that patch, do you have any suggestion on >> how to improve the code thrash situation? > > As I said, the motivation of the patch is good for embedded system but > patch's implementation is kinda buggy. I will have a look and post if > I'm luck to get a time. > > BTW, a question. > > How do you find proper value for min_filelist_kbytes? > Just experiment with several trial? > > Thanks. Yes. Mandeep can give more detail, but, as I understand this, the value we use (50 Mb) was based on experimentation. It helps that at the moment we run Chrome OS on a relatively uniform set of devices, with either 2 or 4 GB of RAM, no swap, binaries stored on SSD (for backing store of text pages), and the same load (the Chrome browser). >> >> Thanks again! >> >> >> >> >> OK, now someone is going to fix this, right? :-) >> >> >> >> -- >> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> >> the body to majordomo@kvack.org. For more info on Linux MM, >> >> see: http://www.linux-mm.org/ . >> >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> >> > >> > -- >> > Kind regards, >> > Minchan Kim >> > >> > -- >> > To unsubscribe, send a message with 'unsubscribe linux-mm' in >> > the body to majordomo@kvack.org. For more info on Linux MM, >> > see: http://www.linux-mm.org/ . >> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> >> >> -- >> To unsubscribe, send a message with 'unsubscribe linux-mm' in >> the body to majordomo@kvack.org. For more info on Linux MM, >> see: http://www.linux-mm.org/ . >> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > > -- > Kind regards, > Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-31 3:49 ` Luigi Semenzato @ 2012-10-31 7:24 ` Minchan Kim 2012-10-31 16:07 ` Luigi Semenzato 0 siblings, 1 reply; 56+ messages in thread From: Minchan Kim @ 2012-10-31 7:24 UTC (permalink / raw) To: Luigi Semenzato Cc: David Rientjes, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao, Mandeep Baines On Tue, Oct 30, 2012 at 08:49:26PM -0700, Luigi Semenzato wrote: > On Tue, Oct 30, 2012 at 6:27 PM, Minchan Kim <minchan@kernel.org> wrote: > > On Tue, Oct 30, 2012 at 06:06:56PM -0700, Luigi Semenzato wrote: > >> On Tue, Oct 30, 2012 at 5:57 PM, Minchan Kim <minchan@kernel.org> wrote: > >> > Hi Luigi, > >> > > >> > On Tue, Oct 30, 2012 at 12:12:02PM -0700, Luigi Semenzato wrote: > >> >> On Mon, Oct 29, 2012 at 10:41 PM, David Rientjes <rientjes@google.com> wrote: > >> >> > On Mon, 29 Oct 2012, Luigi Semenzato wrote: > >> >> > > >> >> >> However, now there is something that worries me more. The trace of > >> >> >> the thread with TIF_MEMDIE set shows that it has executed most of > >> >> >> do_exit() and appears to be waiting to be reaped. From my reading of > >> >> >> the code, this implies that task->exit_state should be non-zero, which > >> >> >> means that select_bad_process should have skipped that thread, which > >> >> >> means that we cannot be in the deadlock situation, and my experiments > >> >> >> are not consistent. > >> >> >> > >> >> > > >> >> > Yeah, this is what I was referring to earlier, select_bad_process() will > >> >> > not consider the thread for which you posted a stack trace for oom kill, > >> >> > so it's not deferring because of it. There are either other thread(s) > >> >> > that have been oom killed and have not yet release their memory or the oom > >> >> > killer is never being called. > >> >> > >> >> Thanks. I now have better information on what's happening. > >> >> > >> >> The "culprit" is not the OOM-killed process (the one with TIF_MEMDIE > >> >> set). It's another process that's exiting for some other reason. > >> >> > >> >> select_bad_process() checks for thread->exit_state at the beginning, > >> >> and skips processes that are exiting. But later it checks for > >> >> p->flags & PF_EXITING, and can return -1 in that case (and it does for > >> >> me). > >> >> > >> >> It turns out that do_exit() does a lot of things between setting the > >> >> thread->flags PF_EXITING bit (in exit_signals()) and setting > >> >> thread->exit_state to non-zero (in exit_notify()). Some of those > >> >> things apparently need memory. I caught one process responsible for > >> >> the PTR_ERR(-1) while it was doing this: > >> >> > >> >> [ 191.859358] VC manager R running 0 2388 1108 0x00000104 > >> >> [ 191.859377] err_ptr_count = 45623 > >> >> [ 191.859384] e0611b1c 00200086 f5608000 815ecd20 815ecd20 a0a9ebc3 > >> >> 0000002c f67cfd20 > >> >> [ 191.859407] f430a060 81191c34 e0611aec 81196d79 4168ef20 00000001 > >> >> e1302400 e130264c > >> >> [ 191.859428] e1302400 e0611af4 813b71d5 e0611b00 810b42f1 e1302400 > >> >> e0611b0c 810b430e > >> >> [ 191.859450] Call Trace: > >> >> [ 191.859465] [<81191c34>] ? __delay+0xe/0x10 > >> >> [ 191.859478] [<81196d79>] ? do_raw_spin_lock+0xa2/0xf3 > >> >> [ 191.859491] [<813b71d5>] ? _raw_spin_unlock+0xd/0xf > >> >> [ 191.859504] [<810b42f1>] ? put_super+0x26/0x29 > >> >> [ 191.859515] [<810b430e>] ? drop_super+0x1a/0x1d > >> >> [ 191.859527] [<8104512d>] __cond_resched+0x1b/0x2b > >> >> [ 191.859537] [<813b67a7>] _cond_resched+0x18/0x21 > >> >> [ 191.859549] [<81093940>] shrink_slab+0x224/0x22f > >> >> [ 191.859562] [<81095a96>] try_to_free_pages+0x1b7/0x2e6 > >> >> [ 191.859574] [<8108df2a>] __alloc_pages_nodemask+0x40a/0x61f > >> >> [ 191.859588] [<810a9dbe>] read_swap_cache_async+0x4a/0xcf > >> >> [ 191.859600] [<810a9ea4>] swapin_readahead+0x61/0x8d > >> >> [ 191.859612] [<8109fff4>] handle_pte_fault+0x310/0x5fb > >> >> [ 191.859624] [<810a0420>] handle_mm_fault+0xae/0xbd > >> >> [ 191.859637] [<8101d0f9>] do_page_fault+0x265/0x284 > >> >> [ 191.859648] [<8104aa17>] ? dequeue_entity+0x236/0x252 > >> >> [ 191.859660] [<8101ce94>] ? vmalloc_sync_all+0xa/0xa > >> >> [ 191.859672] [<813b7887>] error_code+0x67/0x6c > >> >> [ 191.859683] [<81191d21>] ? __get_user_4+0x11/0x17 > >> >> [ 191.859695] [<81059f28>] ? exit_robust_list+0x30/0x105 > >> >> [ 191.859707] [<813b71b0>] ? _raw_spin_unlock_irq+0xd/0x10 > >> >> [ 191.859718] [<810446d5>] ? finish_task_switch+0x53/0x89 > >> >> [ 191.859730] [<8102351d>] mm_release+0x1d/0xc3 > >> >> [ 191.859740] [<81026ce9>] exit_mm+0x1d/0xe9 > >> >> [ 191.859750] [<81032b87>] ? exit_signals+0x57/0x10a > >> >> [ 191.859760] [<81028082>] do_exit+0x19b/0x640 > >> >> [ 191.859770] [<81058598>] ? futex_wait_queue_me+0xaa/0xbe > >> >> [ 191.859781] [<81030bbf>] ? recalc_sigpending_tsk+0x51/0x5c > >> >> [ 191.859793] [<81030beb>] ? recalc_sigpending+0x17/0x3e > >> >> [ 191.859803] [<81028752>] do_group_exit+0x63/0x86 > >> >> [ 191.859813] [<81032b19>] get_signal_to_deliver+0x434/0x44b > >> >> [ 191.859825] [<81001e01>] do_signal+0x37/0x4fe > >> >> [ 191.859837] [<81048eed>] ? set_next_entity+0x36/0x9d > >> >> [ 191.859850] [<81050d8e>] ? timekeeping_get_ns+0x11/0x55 > >> >> [ 191.859861] [<8105a754>] ? sys_futex+0xcb/0xdb > >> >> [ 191.859871] [<810024a7>] do_notify_resume+0x26/0x65 > >> >> [ 191.859883] [<813b73a5>] work_notifysig+0xa/0x11 > >> >> [ 191.859893] Kernel panic - not syncing: too many ERR_PTR > >> >> > >> >> I don't know why mm_release() would page fault, but it looks like it does. > >> >> > >> >> So the OOM killer will not kill other processes because it thinks a > >> >> process is exiting, which will free up memory. But the exiting > >> >> process needs memory to continue exiting --> deadlock. Sounds > >> >> plausible? > >> > > >> > It sounds right in your kernel but principal problem is min_filelist_kbytes patch. > >> > If normal exited process in exit path requires a page and there is no free page > >> > any more, it ends up going to OOM path after try to reclaim memory several time. > >> > Then, > >> > In select_bad_process, > >> > > >> > if (task->flags & PF_EXITING) { > >> > if (task == current) <== true > >> > return OOM_SCAN_SELECT; > >> > In oom_kill_process, > >> > > >> > if (p->flags & PF_EXITING) > >> > set_tsk_thread_flag(p, TIF_MEMDIE); > >> > > >> > At last, normal exited process would get a free page. > >> > > >> > But in your kernel, it seems not because I guess did_some_progress in > >> > __alloc_pages_direct_reclaim is never 0. The why it is never 0 is > >> > do_try_to_free_pages's all_unreclaimable can't do his role by your > >> > min_filelist_kbytes. It makes __alloc_pages_slowpath's looping forever. > >> > > >> > Sounds plausible? > >> > >> Thank you Minchan, it does sound plausible, but I have little > >> experience with this and it will take some work to confirm. > > > > No problem :) > > > >> > >> I looked at the patch pretty carefully once, and I had the impression > >> its effect could be fully analyzed by logical reasoning. I will check > >> this again tomorrow, perhaps I can run some experiments. I am adding > >> Mandeep who wrote the patch. > >> > >> However, we have worse problems if we don't use that patch. Without > >> the patch, and either with or without compressed swap, the same load > >> causes horrible thrashing, with the system appearing to hang for > >> minutes. If we don't use that patch, do you have any suggestion on > >> how to improve the code thrash situation? > > > > As I said, the motivation of the patch is good for embedded system but > > patch's implementation is kinda buggy. I will have a look and post if > > I'm luck to get a time. > > > > BTW, a question. > > > > How do you find proper value for min_filelist_kbytes? > > Just experiment with several trial? > > > > Thanks. > > Yes. Mandeep can give more detail, but, as I understand this, the > value we use (50 Mb) was based on experimentation. It helps that at > the moment we run Chrome OS on a relatively uniform set of devices, > with either 2 or 4 GB of RAM, no swap, binaries stored on SSD (for > backing store of text pages), and the same load (the Chrome browser). > AFAIRC, I recommended mem_notify instead of hacky patch when Mandeep submitted at the beginning. Does it have any problem? AFAIK, mem_notify had a problem to notify too late so OOM kill still happens. Recently, Anton have been tried new low memory notifier and It should solve same problem and then it's thing you need. https://patchwork.kernel.org/patch/1625251/ Of course, there are further steps to merge it but I think you can help us with some experiments and input your voice to meet Chrome OS's goal. Thanks. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-31 7:24 ` Minchan Kim @ 2012-10-31 16:07 ` Luigi Semenzato 2012-10-31 17:49 ` Mandeep Singh Baines 0 siblings, 1 reply; 56+ messages in thread From: Luigi Semenzato @ 2012-10-31 16:07 UTC (permalink / raw) To: Minchan Kim Cc: David Rientjes, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao, Mandeep Baines On Wed, Oct 31, 2012 at 12:24 AM, Minchan Kim <minchan@kernel.org> wrote: > AFAIRC, I recommended mem_notify instead of hacky patch when Mandeep submitted > at the beginning. Does it have any problem? When we introduced min_filelist_kbytes, the Chrome browser was not prepared to take actions on low-memory notifications, so we could not use that approach. We still needed somehow to prevent the system from thrashing. A couple of years later we added a "tab discard" feature to Chrome, which could be used to release memory in Chrome after saving the DOM state of a tab. At that time I noticed a similar patch from you, which I took and slightly modified for our purposes. I was not aware of Anton's earlier patch then. The basic idea of my patch is the same as yours, but I estimate "easily reclaimable memory" differently. I wasn't sure my patch would be of interest here, so I never posted it. Going back to the min_filelist_kbytes patch, it doesn't seem that it's such a bad idea to have a mechanism that prevents text page thrash. It would be useful if the system kept working even if nobody is paying attention to low-memory notifications. The hacky patch sets a threshold under which text pages are not evicted, to maintain a reasonably-sized working set in memory. Perhaps this threshold should be set dynamically based on the rate of page faults due to instruction fetches? > AFAIK, mem_notify had a problem to notify too late so OOM kill still happens. > Recently, Anton have been tried new low memory notifier and It should solve > same problem and then it's thing you need. > https://patchwork.kernel.org/patch/1625251/ Yes, part of the problem is that all these mechanisms are based on heuristics. Chrome tab discard is conceptually very similar to OOM kill. When Chrome gets a low-memory notification, it discards a tab and then waits for about 1s before checking if it should discard more tabs. If other processes are allocating aggressively (for instance after issuing commands that load multiple tabs in parallel), they will use up memory faster than the tab discarder is releasing it. So it's essential to have a functioning fall-back mechanism in the kernel. > Of course, there are further steps to merge it but I think you can help us > with some experiments and input your voice to meet Chrome OS's goal. I will look at Anton's notifier and see if it would meet our needs. Thanks! > > Thanks. > > -- > Kind regards, > Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-31 16:07 ` Luigi Semenzato @ 2012-10-31 17:49 ` Mandeep Singh Baines 0 siblings, 0 replies; 56+ messages in thread From: Mandeep Singh Baines @ 2012-10-31 17:49 UTC (permalink / raw) To: Luigi Semenzato Cc: Minchan Kim, David Rientjes, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao Luigi Semenzato (semenzato@google.com) wrote: > On Wed, Oct 31, 2012 at 12:24 AM, Minchan Kim <minchan@kernel.org> wrote: > > > AFAIRC, I recommended mem_notify instead of hacky patch when Mandeep submitted > > at the beginning. Does it have any problem? > > When we introduced min_filelist_kbytes, the Chrome browser was not > prepared to take actions on low-memory notifications, so we could not > use that approach. We still needed somehow to prevent the system from > thrashing. > > A couple of years later we added a "tab discard" feature to Chrome, > which could be used to release memory in Chrome after saving the DOM > state of a tab. At that time I noticed a similar patch from you, > which I took and slightly modified for our purposes. I was not aware > of Anton's earlier patch then. The basic idea of my patch is the same > as yours, but I estimate "easily reclaimable memory" differently. > > I wasn't sure my patch would be of interest here, so I never posted it. > > Going back to the min_filelist_kbytes patch, it doesn't seem that it's > such a bad idea to have a mechanism that prevents text page thrash. > It would be useful if the system kept working even if nobody is paying > attention to low-memory notifications. The hacky patch sets a > threshold under which text pages are not evicted, to maintain a > reasonably-sized working set in memory. Perhaps this threshold should > be set dynamically based on the rate of page faults due to instruction > fetches? > An alternative approach I was considering was to just limit the rate at which you scan each of the LRU lists. Limit the rate to one complete scan of the list every scan_period. This would prevent thrashing of file and anon pages and would require no tuning. You could set scan_period to one of the scheduler periods. Regards, Mandeep > > AFAIK, mem_notify had a problem to notify too late so OOM kill still happens. > > Recently, Anton have been tried new low memory notifier and It should solve > > same problem and then it's thing you need. > > https://patchwork.kernel.org/patch/1625251/ > > Yes, part of the problem is that all these mechanisms are based on > heuristics. Chrome tab discard is conceptually very similar to OOM > kill. When Chrome gets a low-memory notification, it discards a tab > and then waits for about 1s before checking if it should discard more > tabs. If other processes are allocating aggressively (for instance > after issuing commands that load multiple tabs in parallel), they will > use up memory faster than the tab discarder is releasing it. So it's > essential to have a functioning fall-back mechanism in the kernel. > > > Of course, there are further steps to merge it but I think you can help us > > with some experiments and input your voice to meet Chrome OS's goal. > > I will look at Anton's notifier and see if it would meet our needs. Thanks! > > > > > Thanks. > > > > -- > > Kind regards, > > Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-31 0:57 ` Minchan Kim 2012-10-31 1:06 ` Luigi Semenzato @ 2012-10-31 18:54 ` David Rientjes 2012-10-31 21:40 ` Luigi Semenzato ` (2 more replies) 1 sibling, 3 replies; 56+ messages in thread From: David Rientjes @ 2012-10-31 18:54 UTC (permalink / raw) To: Minchan Kim Cc: Luigi Semenzato, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Wed, 31 Oct 2012, Minchan Kim wrote: > It sounds right in your kernel but principal problem is min_filelist_kbytes patch. > If normal exited process in exit path requires a page and there is no free page > any more, it ends up going to OOM path after try to reclaim memory several time. > Then, > In select_bad_process, > > if (task->flags & PF_EXITING) { > if (task == current) <== true > return OOM_SCAN_SELECT; > In oom_kill_process, > > if (p->flags & PF_EXITING) > set_tsk_thread_flag(p, TIF_MEMDIE); > > At last, normal exited process would get a free page. > select_bad_process() won't actually select the process for oom kill, though, if there are other PF_EXITING threads other than current. So if multiple threads are page faulting on tsk->robust_list, then no thread ends up getting killed. The temporary workaround would be to do a kill -9 so that the logic in out_of_memory() could immediately give such threads access to memory reserves so the page fault will succeed. The real fix would be to audit all possible cases in between setting tsk->flags |= PF_EXITING and tsk->mm = NULL that could cause a memory allocation and make exemptions for them in oom_scan_process_thread(). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-31 18:54 ` David Rientjes @ 2012-10-31 21:40 ` Luigi Semenzato 2012-11-01 2:11 ` Minchan Kim 2012-11-01 2:43 ` Minchan Kim 2 siblings, 0 replies; 56+ messages in thread From: Luigi Semenzato @ 2012-10-31 21:40 UTC (permalink / raw) To: David Rientjes Cc: Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao Thanks so much for your help. There are two issues: one is what we (Chrome OS) should do, the other is what should be done for ToT linux. The fix(es) you propose are harder to understand than mine, and put additional special conditions in code that is already rife with them. My fix, instead, removes one such special condition. It can, in principle, cause processes to be OOM-killed unnecessarily, but what's the likelihood that it will happen? We don't actually see it happen, and it matters little to us if it happens. I would be more than happy to try one of your fixes, but not likely to implement it. On Wed, Oct 31, 2012 at 11:54 AM, David Rientjes <rientjes@google.com> wrote: > On Wed, 31 Oct 2012, Minchan Kim wrote: > >> It sounds right in your kernel but principal problem is min_filelist_kbytes patch. >> If normal exited process in exit path requires a page and there is no free page >> any more, it ends up going to OOM path after try to reclaim memory several time. >> Then, >> In select_bad_process, >> >> if (task->flags & PF_EXITING) { >> if (task == current) <== true >> return OOM_SCAN_SELECT; >> In oom_kill_process, >> >> if (p->flags & PF_EXITING) >> set_tsk_thread_flag(p, TIF_MEMDIE); >> >> At last, normal exited process would get a free page. >> > > select_bad_process() won't actually select the process for oom kill, > though, if there are other PF_EXITING threads other than current. So if > multiple threads are page faulting on tsk->robust_list, then no thread > ends up getting killed. The temporary workaround would be to do a kill -9 > so that the logic in out_of_memory() could immediately give such threads > access to memory reserves so the page fault will succeed. When we discover the thread in such state, it's already in do_exit() and it's waiting for the page fault to complete. Will it wait forever, or timeout and retry? Is it acceptable, and sufficient, to change task->exit_code on the fly? If not, what else? It is quite difficult to analyze that code. > The real fix > would be to audit all possible cases in between setting > tsk->flags |= PF_EXITING and tsk->mm = NULL that could cause a memory > allocation and make exemptions for them in oom_scan_process_thread(). I think I probably slightly disagree with this. It's an extra step in the direction of unmaintainability. Wouldn't it be better to disallow a thread from making allocations in that section, fix all the places where it does, and panic to catch missed occurrences or new ones? Otherwise the OOM module will have to know additional details about what threads are doing, or threads will have to maintain that state (task->exiting_but_may_still_allocate = 1). Isn't there already too much of this stuff going on? Thanks again! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-31 18:54 ` David Rientjes 2012-10-31 21:40 ` Luigi Semenzato @ 2012-11-01 2:11 ` Minchan Kim 2012-11-01 4:38 ` David Rientjes 2012-11-01 2:43 ` Minchan Kim 2 siblings, 1 reply; 56+ messages in thread From: Minchan Kim @ 2012-11-01 2:11 UTC (permalink / raw) To: David Rientjes Cc: Luigi Semenzato, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Wed, Oct 31, 2012 at 11:54:07AM -0700, David Rientjes wrote: > On Wed, 31 Oct 2012, Minchan Kim wrote: > > > It sounds right in your kernel but principal problem is min_filelist_kbytes patch. > > If normal exited process in exit path requires a page and there is no free page > > any more, it ends up going to OOM path after try to reclaim memory several time. > > Then, > > In select_bad_process, > > > > if (task->flags & PF_EXITING) { > > if (task == current) <== true > > return OOM_SCAN_SELECT; > > In oom_kill_process, > > > > if (p->flags & PF_EXITING) > > set_tsk_thread_flag(p, TIF_MEMDIE); > > > > At last, normal exited process would get a free page. > > > > select_bad_process() won't actually select the process for oom kill, > though, if there are other PF_EXITING threads other than current. So if > multiple threads are page faulting on tsk->robust_list, then no thread > ends up getting killed. The temporary workaround would be to do a kill -9 If mutiple threads are page faulting and try to allocate memory, then they should go to oom path and they will reach following code. if (task->flags & PF_EXITING) { if (task == current) return OOM_SCAN_SELECT; So, the thread can access reseved memory pool and page fault will succeed. > so that the logic in out_of_memory() could immediately give such threads > access to memory reserves so the page fault will succeed. The real fix > would be to audit all possible cases in between setting > tsk->flags |= PF_EXITING and tsk->mm = NULL that could cause a memory > allocation and make exemptions for them in oom_scan_process_thread(). > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-11-01 2:11 ` Minchan Kim @ 2012-11-01 4:38 ` David Rientjes 2012-11-01 5:18 ` Minchan Kim 0 siblings, 1 reply; 56+ messages in thread From: David Rientjes @ 2012-11-01 4:38 UTC (permalink / raw) To: Minchan Kim Cc: Luigi Semenzato, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Thu, 1 Nov 2012, Minchan Kim wrote: > If mutiple threads are page faulting and try to allocate memory, then they > should go to oom path and they will reach following code. > > if (task->flags & PF_EXITING) { > if (task == current) > return OOM_SCAN_SELECT; > No, OOM_SCAN_SELECT does not return immediately and kill that process; it only prefers to kill that process first iff the oom killer isn't deferred because it finds TIF_MEMDIE threads or other PF_EXITING threads other than current. So if multiple processes are in the exit path with PF_EXITING and require additional memory then the oom killed may defer without killing anything. That's what I suspect is happening in this case. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-11-01 4:38 ` David Rientjes @ 2012-11-01 5:18 ` Minchan Kim 0 siblings, 0 replies; 56+ messages in thread From: Minchan Kim @ 2012-11-01 5:18 UTC (permalink / raw) To: David Rientjes Cc: Luigi Semenzato, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Wed, Oct 31, 2012 at 09:38:47PM -0700, David Rientjes wrote: > On Thu, 1 Nov 2012, Minchan Kim wrote: > > > If mutiple threads are page faulting and try to allocate memory, then they > > should go to oom path and they will reach following code. > > > > if (task->flags & PF_EXITING) { > > if (task == current) > > return OOM_SCAN_SELECT; > > > > No, OOM_SCAN_SELECT does not return immediately and kill that process; it > only prefers to kill that process first iff the oom killer isn't deferred > because it finds TIF_MEMDIE threads or other PF_EXITING threads other than > current. So if multiple processes are in the exit path with PF_EXITING > and require additional memory then the oom killed may defer without > killing anything. That's what I suspect is happening in this case. Indeed. Thanks for correcting me, David. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-10-31 18:54 ` David Rientjes 2012-10-31 21:40 ` Luigi Semenzato 2012-11-01 2:11 ` Minchan Kim @ 2012-11-01 2:43 ` Minchan Kim 2012-11-01 4:48 ` David Rientjes 2 siblings, 1 reply; 56+ messages in thread From: Minchan Kim @ 2012-11-01 2:43 UTC (permalink / raw) To: David Rientjes Cc: Luigi Semenzato, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao, Mel Gorman On Wed, Oct 31, 2012 at 11:54:07AM -0700, David Rientjes wrote: > On Wed, 31 Oct 2012, Minchan Kim wrote: > > > It sounds right in your kernel but principal problem is min_filelist_kbytes patch. > > If normal exited process in exit path requires a page and there is no free page > > any more, it ends up going to OOM path after try to reclaim memory several time. > > Then, > > In select_bad_process, > > > > if (task->flags & PF_EXITING) { > > if (task == current) <== true > > return OOM_SCAN_SELECT; > > In oom_kill_process, > > > > if (p->flags & PF_EXITING) > > set_tsk_thread_flag(p, TIF_MEMDIE); > > > > At last, normal exited process would get a free page. > > > > select_bad_process() won't actually select the process for oom kill, > though, if there are other PF_EXITING threads other than current. So if > multiple threads are page faulting on tsk->robust_list, then no thread > ends up getting killed. The temporary workaround would be to do a kill -9 > so that the logic in out_of_memory() could immediately give such threads > access to memory reserves so the page fault will succeed. The real fix It's not true any more. 3.6 includes following code in try_to_free_pages /* * Do not enter reclaim if fatal signal is pending. 1 is returned so * that the page allocator does not consider triggering OOM */ if (fatal_signal_pending(current)) return 1; So the hunged task never go to the OOM path and could be looping forever. > would be to audit all possible cases in between setting > tsk->flags |= PF_EXITING and tsk->mm = NULL that could cause a memory > allocation and make exemptions for them in oom_scan_process_thread(). > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-11-01 2:43 ` Minchan Kim @ 2012-11-01 4:48 ` David Rientjes 2012-11-01 5:26 ` Minchan Kim ` (2 more replies) 0 siblings, 3 replies; 56+ messages in thread From: David Rientjes @ 2012-11-01 4:48 UTC (permalink / raw) To: Minchan Kim, Mel Gorman Cc: Luigi Semenzato, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Thu, 1 Nov 2012, Minchan Kim wrote: > It's not true any more. > 3.6 includes following code in try_to_free_pages > > /* > * Do not enter reclaim if fatal signal is pending. 1 is returned so > * that the page allocator does not consider triggering OOM > */ > if (fatal_signal_pending(current)) > return 1; > > So the hunged task never go to the OOM path and could be looping forever. > Ah, interesting. This is from commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage"). Thanks for adding Mel to the cc. The oom killer specifically has logic for this condition: when calling out_of_memory() the first thing it does is if (fatal_signal_pending(current)) set_thread_flag(TIF_MEMDIE); to allow it access to memory reserves so that it may exit if it's having trouble. But that ends up never happening because of the above code that Minchan has identified. So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages() as well or revert that early return entirely; there's no justification given for it in the comment nor in the commit log. I'd rather remove it and allow the oom killer to trigger and grant access to memory reserves itself if necessary. Mel, how does commit 5515061d22f0 deal with threads looping forever if they need memory in the exit path since the oom killer never gets called? That aside, it doesn't seem like this is the issue that Luigi is reporting since his patch that avoids deferring the oom killer presumably fixes the issue for him. So it turns out the oom killer must be getting called. Luigi, can you try this instead? It applies to the latest git but should be easily modified to apply to any 3.x kernel you're running. --- diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -310,26 +310,13 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task, if (!task->mm) return OOM_SCAN_CONTINUE; - if (task->flags & PF_EXITING) { + if (task->flags & PF_EXITING && !force_kill) { /* - * If task is current and is in the process of releasing memory, - * allow the "kill" to set TIF_MEMDIE, which will allow it to - * access memory reserves. Otherwise, it may stall forever. - * - * The iteration isn't broken here, however, in case other - * threads are found to have already been oom killed. + * If this task is not being ptraced on exit, then wait for it + * to finish before killing some other task unnecessarily. */ - if (task == current) - return OOM_SCAN_SELECT; - else if (!force_kill) { - /* - * If this task is not being ptraced on exit, then wait - * for it to finish before killing some other task - * unnecessarily. - */ - if (!(task->group_leader->ptrace & PT_TRACE_EXIT)) - return OOM_SCAN_ABORT; - } + if (!(task->group_leader->ptrace & PT_TRACE_EXIT)) + return OOM_SCAN_ABORT; } return OOM_SCAN_OK; } @@ -706,11 +693,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, return; /* - * If current has a pending SIGKILL, then automatically select it. The - * goal is to allow it to allocate so that it may quickly exit and free - * its memory. + * If current has a pending SIGKILL or is exiting, then automatically + * select it. The goal is to allow it to allocate so that it may + * quickly exit and free its memory. */ - if (fatal_signal_pending(current)) { + if (fatal_signal_pending(current) || current->flags & PF_EXITING) { set_thread_flag(TIF_MEMDIE); return; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-11-01 4:48 ` David Rientjes @ 2012-11-01 5:26 ` Minchan Kim 2012-11-01 8:28 ` Mel Gorman 2012-11-01 17:50 ` Luigi Semenzato 2 siblings, 0 replies; 56+ messages in thread From: Minchan Kim @ 2012-11-01 5:26 UTC (permalink / raw) To: David Rientjes Cc: Mel Gorman, Luigi Semenzato, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Wed, Oct 31, 2012 at 09:48:57PM -0700, David Rientjes wrote: > On Thu, 1 Nov 2012, Minchan Kim wrote: > > > It's not true any more. > > 3.6 includes following code in try_to_free_pages > > > > /* > > * Do not enter reclaim if fatal signal is pending. 1 is returned so > > * that the page allocator does not consider triggering OOM > > */ > > if (fatal_signal_pending(current)) > > return 1; > > > > So the hunged task never go to the OOM path and could be looping forever. > > > > Ah, interesting. This is from commit 5515061d22f0 ("mm: throttle direct > reclaimers if PF_MEMALLOC reserves are low and swap is backed by network > storage"). Thanks for adding Mel to the cc. > > The oom killer specifically has logic for this condition: when calling > out_of_memory() the first thing it does is > > if (fatal_signal_pending(current)) > set_thread_flag(TIF_MEMDIE); > > to allow it access to memory reserves so that it may exit if it's having > trouble. But that ends up never happening because of the above code that > Minchan has identified. > > So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages() > as well or revert that early return entirely; there's no justification > given for it in the comment nor in the commit log. I'd rather remove it > and allow the oom killer to trigger and grant access to memory reserves > itself if necessary. > > Mel, how does commit 5515061d22f0 deal with threads looping forever if > they need memory in the exit path since the oom killer never gets called? > > That aside, it doesn't seem like this is the issue that Luigi is reporting > since his patch that avoids deferring the oom killer presumably fixes the > issue for him. So it turns out the oom killer must be getting called. Exactly. > > Luigi, can you try this instead? It applies to the latest git but should > be easily modified to apply to any 3.x kernel you're running. > --- > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -310,26 +310,13 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task, > if (!task->mm) > return OOM_SCAN_CONTINUE; > > - if (task->flags & PF_EXITING) { > + if (task->flags & PF_EXITING && !force_kill) { > /* > - * If task is current and is in the process of releasing memory, > - * allow the "kill" to set TIF_MEMDIE, which will allow it to > - * access memory reserves. Otherwise, it may stall forever. > - * > - * The iteration isn't broken here, however, in case other > - * threads are found to have already been oom killed. > + * If this task is not being ptraced on exit, then wait for it > + * to finish before killing some other task unnecessarily. > */ > - if (task == current) > - return OOM_SCAN_SELECT; > - else if (!force_kill) { > - /* > - * If this task is not being ptraced on exit, then wait > - * for it to finish before killing some other task > - * unnecessarily. > - */ > - if (!(task->group_leader->ptrace & PT_TRACE_EXIT)) > - return OOM_SCAN_ABORT; > - } > + if (!(task->group_leader->ptrace & PT_TRACE_EXIT)) > + return OOM_SCAN_ABORT; > } > return OOM_SCAN_OK; > } > @@ -706,11 +693,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > return; > > /* > - * If current has a pending SIGKILL, then automatically select it. The > - * goal is to allow it to allocate so that it may quickly exit and free > - * its memory. > + * If current has a pending SIGKILL or is exiting, then automatically > + * select it. The goal is to allow it to allocate so that it may > + * quickly exit and free its memory. > */ > - if (fatal_signal_pending(current)) { > + if (fatal_signal_pending(current) || current->flags & PF_EXITING) { > set_thread_flag(TIF_MEMDIE); > return; > } Looks good to me. -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-11-01 4:48 ` David Rientjes 2012-11-01 5:26 ` Minchan Kim @ 2012-11-01 8:28 ` Mel Gorman 2012-11-01 15:57 ` Luigi Semenzato 2012-11-01 17:50 ` Luigi Semenzato 2 siblings, 1 reply; 56+ messages in thread From: Mel Gorman @ 2012-11-01 8:28 UTC (permalink / raw) To: David Rientjes Cc: Minchan Kim, Luigi Semenzato, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Wed, Oct 31, 2012 at 09:48:57PM -0700, David Rientjes wrote: > On Thu, 1 Nov 2012, Minchan Kim wrote: > > > It's not true any more. > > 3.6 includes following code in try_to_free_pages > > > > /* > > * Do not enter reclaim if fatal signal is pending. 1 is returned so > > * that the page allocator does not consider triggering OOM > > */ > > if (fatal_signal_pending(current)) > > return 1; > > > > So the hunged task never go to the OOM path and could be looping forever. > > > > Ah, interesting. This is from commit 5515061d22f0 ("mm: throttle direct > reclaimers if PF_MEMALLOC reserves are low and swap is backed by network > storage"). Thanks for adding Mel to the cc. > Indeed, thanks. > The oom killer specifically has logic for this condition: when calling > out_of_memory() the first thing it does is > > if (fatal_signal_pending(current)) > set_thread_flag(TIF_MEMDIE); > > to allow it access to memory reserves so that it may exit if it's having > trouble. But that ends up never happening because of the above code that > Minchan has identified. > > So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages() > as well or revert that early return entirely; there's no justification > given for it in the comment nor in the commit log. The check for fatal signal is in the wrong place. The reason it was added is because a throttled process sleeps in an interruptible sleep. If a user user forcibly kills a throttled process, it should not result in an OOM kill. > I'd rather remove it > and allow the oom killer to trigger and grant access to memory reserves > itself if necessary. > > Mel, how does commit 5515061d22f0 deal with threads looping forever if > they need memory in the exit path since the oom killer never gets called? > It doesn't. How about this? ---8<--- mm: vmscan: Check for fatal signals iff the process was throttled commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves are low and swap is backed by network storage") introduced a check for fatal signals after a process gets throttled for network storage. The intention was that if a process was throttled and got killed that it should not trigger the OOM killer. As pointed out by Minchan Kim and David Rientjes, this check is in the wrong place and too broad. If a system is in am OOM situation and a process is exiting, it can loop in __alloc_pages_slowpath() and calling direct reclaim in a loop. As the fatal signal is pending it returns 1 as if it is making forward progress and can effectively deadlock. This patch moves the fatal_signal_pending() check after throttling to throttle_direct_reclaim() where it belongs. If this patch passes review it should be considered a -stable candidate for 3.6. Signed-off-by: Mel Gorman <mgorman@suse.de> --- mm/vmscan.c | 37 +++++++++++++++++++++++++++---------- 1 file changed, 27 insertions(+), 10 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 2b7edfa..ca9e37f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2238,9 +2238,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) * Throttle direct reclaimers if backing storage is backed by the network * and the PFMEMALLOC reserve for the preferred node is getting dangerously * depleted. kswapd will continue to make progress and wake the processes - * when the low watermark is reached + * when the low watermark is reached. + * + * Returns true if a fatal signal was delivered during throttling. If this + * happens, the page allocator should not consider triggering the OOM killer. */ -static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, +static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, nodemask_t *nodemask) { struct zone *zone; @@ -2255,13 +2258,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, * processes to block on log_wait_commit(). */ if (current->flags & PF_KTHREAD) - return; + goto out; + + /* + * If a fatal signal is pending, this process should not throttle. + * It should return quickly so it can exit and free its memory + */ + if (fatal_signal_pending(current)) + goto out; /* Check if the pfmemalloc reserves are ok */ first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone); pgdat = zone->zone_pgdat; if (pfmemalloc_watermark_ok(pgdat)) - return; + goto out; /* Account for the throttling */ count_vm_event(PGSCAN_DIRECT_THROTTLE); @@ -2277,12 +2287,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, if (!(gfp_mask & __GFP_FS)) { wait_event_interruptible_timeout(pgdat->pfmemalloc_wait, pfmemalloc_watermark_ok(pgdat), HZ); - return; + + goto check_pending; } /* Throttle until kswapd wakes the process */ wait_event_killable(zone->zone_pgdat->pfmemalloc_wait, pfmemalloc_watermark_ok(pgdat)); + +check_pending: + if (fatal_signal_pending(current)) + return true; + +out: + return false; } unsigned long try_to_free_pages(struct zonelist *zonelist, int order, @@ -2304,13 +2322,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, .gfp_mask = sc.gfp_mask, }; - throttle_direct_reclaim(gfp_mask, zonelist, nodemask); - /* - * Do not enter reclaim if fatal signal is pending. 1 is returned so - * that the page allocator does not consider triggering OOM + * Do not enter reclaim if fatal signal was delivered while throttled. + * 1 is returned so that the page allocator does not OOM kill at this + * point. */ - if (fatal_signal_pending(current)) + if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask)) return 1; trace_mm_vmscan_direct_reclaim_begin(order, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-11-01 8:28 ` Mel Gorman @ 2012-11-01 15:57 ` Luigi Semenzato 2012-11-01 15:58 ` Luigi Semenzato 0 siblings, 1 reply; 56+ messages in thread From: Luigi Semenzato @ 2012-11-01 15:57 UTC (permalink / raw) To: Mel Gorman Cc: David Rientjes, Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Thu, Nov 1, 2012 at 1:28 AM, Mel Gorman <mgorman@suse.de> wrote: > On Wed, Oct 31, 2012 at 09:48:57PM -0700, David Rientjes wrote: >> On Thu, 1 Nov 2012, Minchan Kim wrote: >> >> > It's not true any more. >> > 3.6 includes following code in try_to_free_pages >> > >> > /* >> > * Do not enter reclaim if fatal signal is pending. 1 is returned so >> > * that the page allocator does not consider triggering OOM >> > */ >> > if (fatal_signal_pending(current)) >> > return 1; >> > >> > So the hunged task never go to the OOM path and could be looping forever. >> > >> >> Ah, interesting. This is from commit 5515061d22f0 ("mm: throttle direct >> reclaimers if PF_MEMALLOC reserves are low and swap is backed by network >> storage"). Thanks for adding Mel to the cc. >> > > Indeed, thanks. > >> The oom killer specifically has logic for this condition: when calling >> out_of_memory() the first thing it does is >> >> if (fatal_signal_pending(current)) >> set_thread_flag(TIF_MEMDIE); >> >> to allow it access to memory reserves so that it may exit if it's having >> trouble. But that ends up never happening because of the above code that >> Minchan has identified. >> >> So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages() >> as well or revert that early return entirely; there's no justification >> given for it in the comment nor in the commit log. > > The check for fatal signal is in the wrong place. The reason it was added > is because a throttled process sleeps in an interruptible sleep. If a user > user forcibly kills a throttled process, it should not result in an OOM kill. > >> I'd rather remove it >> and allow the oom killer to trigger and grant access to memory reserves >> itself if necessary. >> >> Mel, how does commit 5515061d22f0 deal with threads looping forever if >> they need memory in the exit path since the oom killer never gets called? >> > > It doesn't. How about this? > > ---8<--- > mm: vmscan: Check for fatal signals iff the process was throttled > > commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves > are low and swap is backed by network storage") introduced a check for > fatal signals after a process gets throttled for network storage. The > intention was that if a process was throttled and got killed that it > should not trigger the OOM killer. As pointed out by Minchan Kim and > David Rientjes, this check is in the wrong place and too broad. If a > system is in am OOM situation and a process is exiting, it can loop in > __alloc_pages_slowpath() and calling direct reclaim in a loop. As the > fatal signal is pending it returns 1 as if it is making forward progress > and can effectively deadlock. > > This patch moves the fatal_signal_pending() check after throttling to > throttle_direct_reclaim() where it belongs. > > If this patch passes review it should be considered a -stable candidate > for 3.6. > > Signed-off-by: Mel Gorman <mgorman@suse.de> > --- > mm/vmscan.c | 37 +++++++++++++++++++++++++++---------- > 1 file changed, 27 insertions(+), 10 deletions(-) > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 2b7edfa..ca9e37f 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2238,9 +2238,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) > * Throttle direct reclaimers if backing storage is backed by the network > * and the PFMEMALLOC reserve for the preferred node is getting dangerously > * depleted. kswapd will continue to make progress and wake the processes > - * when the low watermark is reached > + * when the low watermark is reached. > + * > + * Returns true if a fatal signal was delivered during throttling. If this > + * happens, the page allocator should not consider triggering the OOM killer. > */ > -static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, > +static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, > nodemask_t *nodemask) > { > struct zone *zone; > @@ -2255,13 +2258,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, > * processes to block on log_wait_commit(). > */ > if (current->flags & PF_KTHREAD) > - return; > + goto out; > + > + /* > + * If a fatal signal is pending, this process should not throttle. > + * It should return quickly so it can exit and free its memory > + */ > + if (fatal_signal_pending(current)) > + goto out; > > /* Check if the pfmemalloc reserves are ok */ > first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone); > pgdat = zone->zone_pgdat; > if (pfmemalloc_watermark_ok(pgdat)) > - return; > + goto out; > > /* Account for the throttling */ > count_vm_event(PGSCAN_DIRECT_THROTTLE); > @@ -2277,12 +2287,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, > if (!(gfp_mask & __GFP_FS)) { > wait_event_interruptible_timeout(pgdat->pfmemalloc_wait, > pfmemalloc_watermark_ok(pgdat), HZ); > - return; > + > + goto check_pending; > } > > /* Throttle until kswapd wakes the process */ > wait_event_killable(zone->zone_pgdat->pfmemalloc_wait, > pfmemalloc_watermark_ok(pgdat)); > + > +check_pending: > + if (fatal_signal_pending(current)) > + return true; > + > +out: > + return false; > } > > unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > @@ -2304,13 +2322,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, > .gfp_mask = sc.gfp_mask, > }; > > - throttle_direct_reclaim(gfp_mask, zonelist, nodemask); > - > /* > - * Do not enter reclaim if fatal signal is pending. 1 is returned so > - * that the page allocator does not consider triggering OOM > + * Do not enter reclaim if fatal signal was delivered while throttled. > + * 1 is returned so that the page allocator does not OOM kill at this > + * point. > */ > - if (fatal_signal_pending(current)) > + if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask)) > return 1; > > trace_mm_vmscan_direct_reclaim_begin(order, -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-11-01 15:57 ` Luigi Semenzato @ 2012-11-01 15:58 ` Luigi Semenzato 2012-11-01 21:48 ` David Rientjes 0 siblings, 1 reply; 56+ messages in thread From: Luigi Semenzato @ 2012-11-01 15:58 UTC (permalink / raw) To: Mel Gorman Cc: David Rientjes, Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao (Sorry, slip of finger.) On Thu, Nov 1, 2012 at 8:57 AM, Luigi Semenzato <semenzato@google.com> wrote: > On Thu, Nov 1, 2012 at 1:28 AM, Mel Gorman <mgorman@suse.de> wrote: >> On Wed, Oct 31, 2012 at 09:48:57PM -0700, David Rientjes wrote: >>> On Thu, 1 Nov 2012, Minchan Kim wrote: >>> >>> > It's not true any more. >>> > 3.6 includes following code in try_to_free_pages >>> > >>> > /* >>> > * Do not enter reclaim if fatal signal is pending. 1 is returned so >>> > * that the page allocator does not consider triggering OOM >>> > */ >>> > if (fatal_signal_pending(current)) >>> > return 1; >>> > >>> > So the hunged task never go to the OOM path and could be looping forever. >>> > >>> >>> Ah, interesting. This is from commit 5515061d22f0 ("mm: throttle direct >>> reclaimers if PF_MEMALLOC reserves are low and swap is backed by network >>> storage"). Thanks for adding Mel to the cc. >>> >> >> Indeed, thanks. >> >>> The oom killer specifically has logic for this condition: when calling >>> out_of_memory() the first thing it does is >>> >>> if (fatal_signal_pending(current)) >>> set_thread_flag(TIF_MEMDIE); >>> >>> to allow it access to memory reserves so that it may exit if it's having >>> trouble. But that ends up never happening because of the above code that >>> Minchan has identified. >>> >>> So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages() >>> as well or revert that early return entirely; there's no justification >>> given for it in the comment nor in the commit log. >> >> The check for fatal signal is in the wrong place. The reason it was added >> is because a throttled process sleeps in an interruptible sleep. If a user >> user forcibly kills a throttled process, it should not result in an OOM kill. >> >>> I'd rather remove it >>> and allow the oom killer to trigger and grant access to memory reserves >>> itself if necessary. >>> >>> Mel, how does commit 5515061d22f0 deal with threads looping forever if >>> they need memory in the exit path since the oom killer never gets called? >>> >> >> It doesn't. How about this? >> >> ---8<--- >> mm: vmscan: Check for fatal signals iff the process was throttled >> >> commit 5515061d22f0 ("mm: throttle direct reclaimers if PF_MEMALLOC reserves >> are low and swap is backed by network storage") introduced a check for >> fatal signals after a process gets throttled for network storage. The >> intention was that if a process was throttled and got killed that it >> should not trigger the OOM killer. As pointed out by Minchan Kim and >> David Rientjes, this check is in the wrong place and too broad. If a >> system is in am OOM situation and a process is exiting, it can loop in >> __alloc_pages_slowpath() and calling direct reclaim in a loop. As the >> fatal signal is pending it returns 1 as if it is making forward progress >> and can effectively deadlock. >> >> This patch moves the fatal_signal_pending() check after throttling to >> throttle_direct_reclaim() where it belongs. >> >> If this patch passes review it should be considered a -stable candidate >> for 3.6. >> >> Signed-off-by: Mel Gorman <mgorman@suse.de> >> --- >> mm/vmscan.c | 37 +++++++++++++++++++++++++++---------- >> 1 file changed, 27 insertions(+), 10 deletions(-) >> >> diff --git a/mm/vmscan.c b/mm/vmscan.c >> index 2b7edfa..ca9e37f 100644 >> --- a/mm/vmscan.c >> +++ b/mm/vmscan.c >> @@ -2238,9 +2238,12 @@ static bool pfmemalloc_watermark_ok(pg_data_t *pgdat) >> * Throttle direct reclaimers if backing storage is backed by the network >> * and the PFMEMALLOC reserve for the preferred node is getting dangerously >> * depleted. kswapd will continue to make progress and wake the processes >> - * when the low watermark is reached >> + * when the low watermark is reached. >> + * >> + * Returns true if a fatal signal was delivered during throttling. If this >> + * happens, the page allocator should not consider triggering the OOM killer. >> */ >> -static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, >> +static bool throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, >> nodemask_t *nodemask) >> { >> struct zone *zone; >> @@ -2255,13 +2258,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, >> * processes to block on log_wait_commit(). >> */ >> if (current->flags & PF_KTHREAD) >> - return; >> + goto out; >> + >> + /* >> + * If a fatal signal is pending, this process should not throttle. >> + * It should return quickly so it can exit and free its memory >> + */ >> + if (fatal_signal_pending(current)) >> + goto out; >> >> /* Check if the pfmemalloc reserves are ok */ >> first_zones_zonelist(zonelist, high_zoneidx, NULL, &zone); >> pgdat = zone->zone_pgdat; >> if (pfmemalloc_watermark_ok(pgdat)) >> - return; >> + goto out; >> >> /* Account for the throttling */ >> count_vm_event(PGSCAN_DIRECT_THROTTLE); >> @@ -2277,12 +2287,20 @@ static void throttle_direct_reclaim(gfp_t gfp_mask, struct zonelist *zonelist, >> if (!(gfp_mask & __GFP_FS)) { >> wait_event_interruptible_timeout(pgdat->pfmemalloc_wait, >> pfmemalloc_watermark_ok(pgdat), HZ); >> - return; >> + >> + goto check_pending; >> } >> >> /* Throttle until kswapd wakes the process */ >> wait_event_killable(zone->zone_pgdat->pfmemalloc_wait, >> pfmemalloc_watermark_ok(pgdat)); >> + >> +check_pending: >> + if (fatal_signal_pending(current)) >> + return true; >> + >> +out: >> + return false; >> } >> >> unsigned long try_to_free_pages(struct zonelist *zonelist, int order, >> @@ -2304,13 +2322,12 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order, >> .gfp_mask = sc.gfp_mask, >> }; >> >> - throttle_direct_reclaim(gfp_mask, zonelist, nodemask); >> - >> /* >> - * Do not enter reclaim if fatal signal is pending. 1 is returned so >> - * that the page allocator does not consider triggering OOM >> + * Do not enter reclaim if fatal signal was delivered while throttled. >> + * 1 is returned so that the page allocator does not OOM kill at this >> + * point. >> */ >> - if (fatal_signal_pending(current)) >> + if (throttle_direct_reclaim(gfp_mask, zonelist, nodemask)) >> return 1; >> >> trace_mm_vmscan_direct_reclaim_begin(order, So which one should I try first, David's change or Mel's? Does Mel's change take into account the fact that the exiting process is already deep into do_exit() (exit_mm() to be precise) when it tries to allocate? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-11-01 15:58 ` Luigi Semenzato @ 2012-11-01 21:48 ` David Rientjes 0 siblings, 0 replies; 56+ messages in thread From: David Rientjes @ 2012-11-01 21:48 UTC (permalink / raw) To: Luigi Semenzato Cc: Mel Gorman, Minchan Kim, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Thu, 1 Nov 2012, Luigi Semenzato wrote: > So which one should I try first, David's change or Mel's? > > Does Mel's change take into account the fact that the exiting process > is already deep into do_exit() (exit_mm() to be precise) when it tries > to allocate? > Mel's patch is addressing a separate issue since you've already proven that your problem is calling the oom killer which wouldn't occur if your thread had SIGKILL prior to Mel's patch. It would allow my suggested workaround of killing the hung task to end the livelock, though, but that shouldn't be needed after my patch. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-11-01 4:48 ` David Rientjes 2012-11-01 5:26 ` Minchan Kim 2012-11-01 8:28 ` Mel Gorman @ 2012-11-01 17:50 ` Luigi Semenzato 2012-11-01 21:50 ` David Rientjes 2 siblings, 1 reply; 56+ messages in thread From: Luigi Semenzato @ 2012-11-01 17:50 UTC (permalink / raw) To: David Rientjes Cc: Minchan Kim, Mel Gorman, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Wed, Oct 31, 2012 at 9:48 PM, David Rientjes <rientjes@google.com> wrote: > On Thu, 1 Nov 2012, Minchan Kim wrote: > >> It's not true any more. >> 3.6 includes following code in try_to_free_pages >> >> /* >> * Do not enter reclaim if fatal signal is pending. 1 is returned so >> * that the page allocator does not consider triggering OOM >> */ >> if (fatal_signal_pending(current)) >> return 1; >> >> So the hunged task never go to the OOM path and could be looping forever. >> > > Ah, interesting. This is from commit 5515061d22f0 ("mm: throttle direct > reclaimers if PF_MEMALLOC reserves are low and swap is backed by network > storage"). Thanks for adding Mel to the cc. > > The oom killer specifically has logic for this condition: when calling > out_of_memory() the first thing it does is > > if (fatal_signal_pending(current)) > set_thread_flag(TIF_MEMDIE); > > to allow it access to memory reserves so that it may exit if it's having > trouble. But that ends up never happening because of the above code that > Minchan has identified. > > So we either need to do set_thread_flag(TIF_MEMDIE) in try_to_free_pages() > as well or revert that early return entirely; there's no justification > given for it in the comment nor in the commit log. I'd rather remove it > and allow the oom killer to trigger and grant access to memory reserves > itself if necessary. > > Mel, how does commit 5515061d22f0 deal with threads looping forever if > they need memory in the exit path since the oom killer never gets called? > > That aside, it doesn't seem like this is the issue that Luigi is reporting > since his patch that avoids deferring the oom killer presumably fixes the > issue for him. So it turns out the oom killer must be getting called. > > Luigi, can you try this instead? It applies to the latest git but should > be easily modified to apply to any 3.x kernel you're running. > --- > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -310,26 +310,13 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task, > if (!task->mm) > return OOM_SCAN_CONTINUE; > > - if (task->flags & PF_EXITING) { > + if (task->flags & PF_EXITING && !force_kill) { > /* > - * If task is current and is in the process of releasing memory, > - * allow the "kill" to set TIF_MEMDIE, which will allow it to > - * access memory reserves. Otherwise, it may stall forever. > - * > - * The iteration isn't broken here, however, in case other > - * threads are found to have already been oom killed. > + * If this task is not being ptraced on exit, then wait for it > + * to finish before killing some other task unnecessarily. > */ > - if (task == current) > - return OOM_SCAN_SELECT; > - else if (!force_kill) { > - /* > - * If this task is not being ptraced on exit, then wait > - * for it to finish before killing some other task > - * unnecessarily. > - */ > - if (!(task->group_leader->ptrace & PT_TRACE_EXIT)) > - return OOM_SCAN_ABORT; > - } > + if (!(task->group_leader->ptrace & PT_TRACE_EXIT)) > + return OOM_SCAN_ABORT; > } > return OOM_SCAN_OK; > } > @@ -706,11 +693,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > return; > > /* > - * If current has a pending SIGKILL, then automatically select it. The > - * goal is to allow it to allocate so that it may quickly exit and free > - * its memory. > + * If current has a pending SIGKILL or is exiting, then automatically > + * select it. The goal is to allow it to allocate so that it may > + * quickly exit and free its memory. > */ > - if (fatal_signal_pending(current)) { > + if (fatal_signal_pending(current) || current->flags & PF_EXITING) { > set_thread_flag(TIF_MEMDIE); > return; > } I tested this change with my load and it appears to also prevent the deadlocks. I have a question though. I thought only one process was allowed to be in TIF_MEMDIE state, but I don't see anything that prevents this code (before or after the change) from setting the flag in multiple processes. Is this a problem? Thanks! -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-11-01 17:50 ` Luigi Semenzato @ 2012-11-01 21:50 ` David Rientjes 2012-11-01 21:58 ` [patch] mm, oom: allow exiting threads to have access to memory reserves David Rientjes 2012-11-01 22:04 ` zram OOM behavior Luigi Semenzato 0 siblings, 2 replies; 56+ messages in thread From: David Rientjes @ 2012-11-01 21:50 UTC (permalink / raw) To: Luigi Semenzato Cc: Minchan Kim, Mel Gorman, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Thu, 1 Nov 2012, Luigi Semenzato wrote: > > @@ -706,11 +693,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, > > return; > > > > /* > > - * If current has a pending SIGKILL, then automatically select it. The > > - * goal is to allow it to allocate so that it may quickly exit and free > > - * its memory. > > + * If current has a pending SIGKILL or is exiting, then automatically > > + * select it. The goal is to allow it to allocate so that it may > > + * quickly exit and free its memory. > > */ > > - if (fatal_signal_pending(current)) { > > + if (fatal_signal_pending(current) || current->flags & PF_EXITING) { > > set_thread_flag(TIF_MEMDIE); > > return; > > } > > I tested this change with my load and it appears to also prevent the deadlocks. > > I have a question though. I thought only one process was allowed to > be in TIF_MEMDIE state, but I don't see anything that prevents this > code (before or after the change) from setting the flag in multiple > processes. Is this a problem? > The code you've quoted above, prior to being changed by the patch, allows any thread with a fatal signal to have access to memory reserves, so it's certainly not only one thread with TIF_MEMDIE set at a time (the oom killer is not the only thing that can kill a thread). The goal of that code is to ensure anything that has been killed can allocate successfully wherever it happens to be running so that it can handle the signal, exit, and free its memory. My patch is extending that for all threads that are in the exit path that happen to require memory to exit to prevent a livelock. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* [patch] mm, oom: allow exiting threads to have access to memory reserves 2012-11-01 21:50 ` David Rientjes @ 2012-11-01 21:58 ` David Rientjes 2012-11-01 22:43 ` Andrew Morton 2012-11-01 22:04 ` zram OOM behavior Luigi Semenzato 1 sibling, 1 reply; 56+ messages in thread From: David Rientjes @ 2012-11-01 21:58 UTC (permalink / raw) To: Andrew Morton Cc: Luigi Semenzato, Minchan Kim, Mel Gorman, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao Exiting threads, those with PF_EXITING set, can pagefault and require memory before they can make forward progress. This happens, for instance, when a process must fault task->robust_list, a userspace structure, before detaching its memory. These threads also aren't guaranteed to get access to memory reserves unless oom killed or killed from userspace. The oom killer won't grant memory reserves if other threads are also exiting other than current and stalling at the same point. This prevents needlessly killing processes when others are already exiting. Instead of special casing all the possible sitations between PF_EXITING getting set and a thread detaching its mm where it may allocate memory, which probably wouldn't get updated when a change is made to the exit path, the solution is to give all exiting threads access to memory reserves if they call the oom killer. This allows them to quickly allocate, detach its mm, and free the memory it represents. Acked-by: Minchan Kim <minchan@kernel.org> Tested-by: Luigi Semenzato <semenzato@google.com> Signed-off-by: David Rientjes <rientjes@google.com> --- This is old code and has only recently been reported as causing an issue, so deferring to 3.8 seems appropriate. mm/oom_kill.c | 31 +++++++++---------------------- 1 file changed, 9 insertions(+), 22 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 79e0f3e..7e9e911 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -310,26 +310,13 @@ enum oom_scan_t oom_scan_process_thread(struct task_struct *task, if (!task->mm) return OOM_SCAN_CONTINUE; - if (task->flags & PF_EXITING) { + if (task->flags & PF_EXITING && !force_kill) { /* - * If task is current and is in the process of releasing memory, - * allow the "kill" to set TIF_MEMDIE, which will allow it to - * access memory reserves. Otherwise, it may stall forever. - * - * The iteration isn't broken here, however, in case other - * threads are found to have already been oom killed. + * If this task is not being ptraced on exit, then wait for it + * to finish before killing some other task unnecessarily. */ - if (task == current) - return OOM_SCAN_SELECT; - else if (!force_kill) { - /* - * If this task is not being ptraced on exit, then wait - * for it to finish before killing some other task - * unnecessarily. - */ - if (!(task->group_leader->ptrace & PT_TRACE_EXIT)) - return OOM_SCAN_ABORT; - } + if (!(task->group_leader->ptrace & PT_TRACE_EXIT)) + return OOM_SCAN_ABORT; } return OOM_SCAN_OK; } @@ -706,11 +693,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, return; /* - * If current has a pending SIGKILL, then automatically select it. The - * goal is to allow it to allocate so that it may quickly exit and free - * its memory. + * If current has a pending SIGKILL or is exiting, then automatically + * select it. The goal is to allow it to allocate so that it may + * quickly exit and free its memory. */ - if (fatal_signal_pending(current)) { + if (fatal_signal_pending(current) || current->flags & PF_EXITING) { set_thread_flag(TIF_MEMDIE); return; } -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 56+ messages in thread
* Re: [patch] mm, oom: allow exiting threads to have access to memory reserves 2012-11-01 21:58 ` [patch] mm, oom: allow exiting threads to have access to memory reserves David Rientjes @ 2012-11-01 22:43 ` Andrew Morton 2012-11-01 23:05 ` David Rientjes 2012-11-01 23:06 ` Luigi Semenzato 0 siblings, 2 replies; 56+ messages in thread From: Andrew Morton @ 2012-11-01 22:43 UTC (permalink / raw) To: David Rientjes Cc: Luigi Semenzato, Minchan Kim, Mel Gorman, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Thu, 1 Nov 2012 14:58:18 -0700 (PDT) David Rientjes <rientjes@google.com> wrote: > Exiting threads, those with PF_EXITING set, can pagefault and require > memory before they can make forward progress. This happens, for instance, > when a process must fault task->robust_list, a userspace structure, before > detaching its memory. > > These threads also aren't guaranteed to get access to memory reserves > unless oom killed or killed from userspace. The oom killer won't grant > memory reserves if other threads are also exiting other than current and > stalling at the same point. This prevents needlessly killing processes > when others are already exiting. > > Instead of special casing all the possible sitations between PF_EXITING > getting set and a thread detaching its mm where it may allocate memory, > which probably wouldn't get updated when a change is made to the exit > path, the solution is to give all exiting threads access to memory > reserves if they call the oom killer. This allows them to quickly > allocate, detach its mm, and free the memory it represents. Seems very sensible. > Acked-by: Minchan Kim <minchan@kernel.org> > Tested-by: Luigi Semenzato <semenzato@google.com> What did Luigi actually test? Was there some reproducible bad behavior which this patch fixes? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch] mm, oom: allow exiting threads to have access to memory reserves 2012-11-01 22:43 ` Andrew Morton @ 2012-11-01 23:05 ` David Rientjes 2012-11-01 23:06 ` Luigi Semenzato 1 sibling, 0 replies; 56+ messages in thread From: David Rientjes @ 2012-11-01 23:05 UTC (permalink / raw) To: Andrew Morton Cc: Luigi Semenzato, Minchan Kim, Mel Gorman, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Thu, 1 Nov 2012, Andrew Morton wrote: > > Exiting threads, those with PF_EXITING set, can pagefault and require > > memory before they can make forward progress. This happens, for instance, > > when a process must fault task->robust_list, a userspace structure, before > > detaching its memory. > > > > These threads also aren't guaranteed to get access to memory reserves > > unless oom killed or killed from userspace. The oom killer won't grant > > memory reserves if other threads are also exiting other than current and > > stalling at the same point. This prevents needlessly killing processes > > when others are already exiting. > > > > Instead of special casing all the possible sitations between PF_EXITING > > getting set and a thread detaching its mm where it may allocate memory, > > which probably wouldn't get updated when a change is made to the exit > > path, the solution is to give all exiting threads access to memory > > reserves if they call the oom killer. This allows them to quickly > > allocate, detach its mm, and free the memory it represents. > > Seems very sensible. > > > Acked-by: Minchan Kim <minchan@kernel.org> > > Tested-by: Luigi Semenzato <semenzato@google.com> > > What did Luigi actually test? Was there some reproducible bad behavior > which this patch fixes? > Yeah, it's briefly described in the first paragraph. He had an oom condition where threads were faulting on task->robust_list and repeatedly called the oom killer but it would defer killing a thread because it saw other PF_EXITING threads. This can happen anytime we need to allocate memory after setting PF_EXITING and before detaching our mm; if there are other threads in the same state then the oom killer won't do anything unless one of them happens to be killed from userspace. So instead of only deferring for PF_EXITING and !task->robust_list, it's better to just give them access to memory reserves to prevent a potential livelock so that any other faults that may be introduced in the future in the exit path don't cause the same problem (and hopefully we don't allow too many of those!). -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [patch] mm, oom: allow exiting threads to have access to memory reserves 2012-11-01 22:43 ` Andrew Morton 2012-11-01 23:05 ` David Rientjes @ 2012-11-01 23:06 ` Luigi Semenzato 1 sibling, 0 replies; 56+ messages in thread From: Luigi Semenzato @ 2012-11-01 23:06 UTC (permalink / raw) To: Andrew Morton Cc: David Rientjes, Minchan Kim, Mel Gorman, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Thu, Nov 1, 2012 at 3:43 PM, Andrew Morton <akpm@linux-foundation.org> wrote: > On Thu, 1 Nov 2012 14:58:18 -0700 (PDT) > David Rientjes <rientjes@google.com> wrote: > >> Exiting threads, those with PF_EXITING set, can pagefault and require >> memory before they can make forward progress. This happens, for instance, >> when a process must fault task->robust_list, a userspace structure, before >> detaching its memory. >> >> These threads also aren't guaranteed to get access to memory reserves >> unless oom killed or killed from userspace. The oom killer won't grant >> memory reserves if other threads are also exiting other than current and >> stalling at the same point. This prevents needlessly killing processes >> when others are already exiting. >> >> Instead of special casing all the possible sitations between PF_EXITING >> getting set and a thread detaching its mm where it may allocate memory, >> which probably wouldn't get updated when a change is made to the exit >> path, the solution is to give all exiting threads access to memory >> reserves if they call the oom killer. This allows them to quickly >> allocate, detach its mm, and free the memory it represents. > > Seems very sensible. > >> Acked-by: Minchan Kim <minchan@kernel.org> >> Tested-by: Luigi Semenzato <semenzato@google.com> > > What did Luigi actually test? Was there some reproducible bad behavior > which this patch fixes? Yes. I have a load that reliably reproduces the problem (in 3.4), and it goes away with this change. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-11-01 21:50 ` David Rientjes 2012-11-01 21:58 ` [patch] mm, oom: allow exiting threads to have access to memory reserves David Rientjes @ 2012-11-01 22:04 ` Luigi Semenzato 2012-11-01 22:25 ` David Rientjes 1 sibling, 1 reply; 56+ messages in thread From: Luigi Semenzato @ 2012-11-01 22:04 UTC (permalink / raw) To: David Rientjes Cc: Minchan Kim, Mel Gorman, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Thu, Nov 1, 2012 at 2:50 PM, David Rientjes <rientjes@google.com> wrote: > On Thu, 1 Nov 2012, Luigi Semenzato wrote: > >> > @@ -706,11 +693,11 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, >> > return; >> > >> > /* >> > - * If current has a pending SIGKILL, then automatically select it. The >> > - * goal is to allow it to allocate so that it may quickly exit and free >> > - * its memory. >> > + * If current has a pending SIGKILL or is exiting, then automatically >> > + * select it. The goal is to allow it to allocate so that it may >> > + * quickly exit and free its memory. >> > */ >> > - if (fatal_signal_pending(current)) { >> > + if (fatal_signal_pending(current) || current->flags & PF_EXITING) { >> > set_thread_flag(TIF_MEMDIE); >> > return; >> > } >> >> I tested this change with my load and it appears to also prevent the deadlocks. >> >> I have a question though. I thought only one process was allowed to >> be in TIF_MEMDIE state, but I don't see anything that prevents this >> code (before or after the change) from setting the flag in multiple >> processes. Is this a problem? >> > > The code you've quoted above, prior to being changed by the patch, allows > any thread with a fatal signal to have access to memory reserves, so it's > certainly not only one thread with TIF_MEMDIE set at a time (the oom > killer is not the only thing that can kill a thread). The goal of that > code is to ensure anything that has been killed can allocate successfully > wherever it happens to be running so that it can handle the signal, exit, > and free its memory. My patch is extending that for all threads that are > in the exit path that happen to require memory to exit to prevent a > livelock. I see. But then I am wondering: if there is no limit to the number of threads that can access the reserved memory, then is it possible that that memory will be exhausted? Is the size of the reserved memory based on heuristics then? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: zram OOM behavior 2012-11-01 22:04 ` zram OOM behavior Luigi Semenzato @ 2012-11-01 22:25 ` David Rientjes 0 siblings, 0 replies; 56+ messages in thread From: David Rientjes @ 2012-11-01 22:25 UTC (permalink / raw) To: Luigi Semenzato Cc: Minchan Kim, Mel Gorman, linux-mm, Dan Magenheimer, KOSAKI Motohiro, Sonny Rao On Thu, 1 Nov 2012, Luigi Semenzato wrote: > I see. But then I am wondering: if there is no limit to the number of > threads that can access the reserved memory, then is it possible that > that memory will be exhausted? Is the size of the reserved memory > based on heuristics then? > We assume that processes with access to memory reserves will eventually exit and free their memory, that has always been the case. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 56+ messages in thread
end of thread, other threads:[~2012-11-01 23:06 UTC | newest] Thread overview: 56+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2012-09-28 17:32 zram OOM behavior Luigi Semenzato 2012-10-03 13:30 ` Konrad Rzeszutek Wilk [not found] ` <CAA25o9SwO209DD6CUx-LzhMt9XU6niGJ-fBPmgwfcrUvf0BPWA@mail.gmail.com> 2012-10-12 23:30 ` Luigi Semenzato 2012-10-15 14:44 ` Minchan Kim 2012-10-15 18:54 ` Luigi Semenzato 2012-10-16 6:18 ` Minchan Kim 2012-10-16 17:36 ` Luigi Semenzato 2012-10-19 17:49 ` Luigi Semenzato 2012-10-22 23:53 ` Minchan Kim 2012-10-23 0:40 ` Luigi Semenzato 2012-10-23 6:03 ` David Rientjes 2012-10-29 18:26 ` Luigi Semenzato 2012-10-29 19:00 ` David Rientjes 2012-10-29 22:36 ` Luigi Semenzato 2012-10-29 22:52 ` David Rientjes 2012-10-29 23:23 ` Luigi Semenzato 2012-10-29 23:34 ` Luigi Semenzato 2012-10-30 0:18 ` Minchan Kim 2012-10-30 0:45 ` Luigi Semenzato 2012-10-30 5:41 ` David Rientjes 2012-10-30 19:12 ` Luigi Semenzato 2012-10-30 20:30 ` Luigi Semenzato 2012-10-30 22:32 ` Luigi Semenzato 2012-10-31 18:42 ` David Rientjes 2012-10-30 22:37 ` Sonny Rao 2012-10-31 4:46 ` David Rientjes 2012-10-31 6:14 ` Luigi Semenzato 2012-10-31 6:28 ` Luigi Semenzato 2012-10-31 18:45 ` David Rientjes 2012-10-31 0:57 ` Minchan Kim 2012-10-31 1:06 ` Luigi Semenzato 2012-10-31 1:27 ` Minchan Kim 2012-10-31 3:49 ` Luigi Semenzato 2012-10-31 7:24 ` Minchan Kim 2012-10-31 16:07 ` Luigi Semenzato 2012-10-31 17:49 ` Mandeep Singh Baines 2012-10-31 18:54 ` David Rientjes 2012-10-31 21:40 ` Luigi Semenzato 2012-11-01 2:11 ` Minchan Kim 2012-11-01 4:38 ` David Rientjes 2012-11-01 5:18 ` Minchan Kim 2012-11-01 2:43 ` Minchan Kim 2012-11-01 4:48 ` David Rientjes 2012-11-01 5:26 ` Minchan Kim 2012-11-01 8:28 ` Mel Gorman 2012-11-01 15:57 ` Luigi Semenzato 2012-11-01 15:58 ` Luigi Semenzato 2012-11-01 21:48 ` David Rientjes 2012-11-01 17:50 ` Luigi Semenzato 2012-11-01 21:50 ` David Rientjes 2012-11-01 21:58 ` [patch] mm, oom: allow exiting threads to have access to memory reserves David Rientjes 2012-11-01 22:43 ` Andrew Morton 2012-11-01 23:05 ` David Rientjes 2012-11-01 23:06 ` Luigi Semenzato 2012-11-01 22:04 ` zram OOM behavior Luigi Semenzato 2012-11-01 22:25 ` David Rientjes
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.