> On Aug 2, 2019, at 7:41 AM, Michal Hocko wrote: > > On Fri 02-08-19 07:18:17, Masoud Sharbiani wrote: >> >> >>> On Aug 2, 2019, at 12:40 AM, Michal Hocko wrote: >>> >>> On Thu 01-08-19 11:04:14, Masoud Sharbiani wrote: >>>> Hey folks, >>>> I’ve come across an issue that affects most of 4.19, 4.20 and 5.2 linux-stable kernels that has only been fixed in 5.3-rc1. >>>> It was introduced by >>>> >>>> 29ef680 memcg, oom: move out_of_memory back to the charge path >>> >>> This commit shouldn't really change the OOM behavior for your particular >>> test case. It would have changed MAP_POPULATE behavior but your usage is >>> triggering the standard page fault path. The only difference with >>> 29ef680 is that the OOM killer is invoked during the charge path rather >>> than on the way out of the page fault. >>> >>> Anyway, I tried to run your test case in a loop and leaker always ends >>> up being killed as expected with 5.2. See the below oom report. There >>> must be something else going on. How much swap do you have on your >>> system? >> >> I do not have swap defined. > > OK, I have retested with swap disabled and again everything seems to be > working as expected. The oom happens earlier because I do not have to > wait for the swap to get full. > In my tests (with the script provided), it only loops 11 iterations before hanging, and uttering the soft lockup message. > Which fs do you use to write the file that you mmap? /dev/sda3 on / type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota) Part of the soft lockup path actually specifies that it is going through __xfs_filemap_fault(): [ 561.452933] watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [leaker:3261] [ 561.459904] Modules linked in: dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt gpio_ich iTCO_vendor_support dcdbas ipmi_ssif intel_powerc lamp coretemp kvm_intel ses ipmi_si kvm enclosure scsi_transport_sas ipmi_devintf irqbypass pcspkr lpc_ich sg joydev ipmi_msghandler wmi acp i_power_meter acpi_cpufreq xfs libcrc32c ata_generic sd_mod pata_acpi ata_piix libata megaraid_sas crc32c_intel serio_raw bnx2 bonding [ 561.495979] CPU: 4 PID: 3261 Comm: leaker Tainted: G I L 5.3.0-rc2+ #10 [ 561.503704] Hardware name: Dell Inc. PowerEdge R710/0YDJK3, BIOS 6.4.0 07/23/2013 [ 561.511168] RIP: 0010:lruvec_lru_size+0x49/0xf0 [ 561.515687] Code: 41 89 ed b8 ff ff ff ff 45 31 f6 49 c1 e5 03 eb 19 48 63 d0 4c 89 e9 48 03 8b 88 00 00 00 48 8b 14 d5 60 a9 92 94 4c 03 34 11 <48> c7 c6 80 7c bf 94 89 c7 e8 89 d3 59 00 3b 05 27 eb ff 00 72 d1 [ 561.534418] RSP: 0018:ffffb5f886a3f640 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13 [ 561.541968] RAX: 0000000000000002 RBX: ffff96fca3bba400 RCX: 00003ef5d82059f0 [ 561.549085] RDX: ffff9702a7a40000 RSI: 0000000000000010 RDI: ffffffff94bf7c80 [ 561.556202] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffff94ae1c00 [ 561.563318] R10: ffff96fcc7802520 R11: 0000000000000000 R12: 0000000000000004 [ 561.570435] R13: 0000000000000008 R14: 0000000000000000 R15: 0000000000000000 [ 561.577553] FS: 00007f5522602740(0000) GS:ffff9702a7a80000(0000) knlGS:0000000000000000 [ 561.585623] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 561.591352] CR2: 00007fba755f95b0 CR3: 0000000c646dc000 CR4: 00000000000006e0 [ 561.598468] Call Trace: [ 561.600907] shrink_node_memcg+0xc8/0x790 [ 561.604905] ? shrink_slab+0x245/0x280 [ 561.608644] ? mem_cgroup_iter+0x10a/0x2c0 [ 561.612728] shrink_node+0xcd/0x490 [ 561.616208] do_try_to_free_pages+0xda/0x3a0 [ 561.620466] ? mem_cgroup_select_victim_node+0x43/0x2f0 [ 561.625678] try_to_free_mem_cgroup_pages+0xe7/0x1c0 [ 561.630629] try_charge+0x246/0x7a0 [ 561.634107] mem_cgroup_try_charge+0x6b/0x1e0 [ 561.638453] ? mem_cgroup_commit_charge+0x5a/0x110 [ 561.643231] __add_to_page_cache_locked+0x195/0x330 [ 561.648100] ? scan_shadow_nodes+0x30/0x30 [ 561.652184] add_to_page_cache_lru+0x39/0xa0 [ 561.656442] iomap_readpages_actor+0xf2/0x230 [ 561.660787] iomap_apply+0xa3/0x130 [ 561.664266] iomap_readpages+0x97/0x180 [ 561.668091] ? iomap_migrate_page+0xe0/0xe0 [ 561.672266] read_pages+0x57/0x180 [ 561.675657] __do_page_cache_readahead+0x1ac/0x1c0 [ 561.680436] ondemand_readahead+0x168/0x2a0 [ 561.684606] filemap_fault+0x30d/0x830 [ 561.688343] ? flush_tlb_func_common.isra.8+0x147/0x230 [ 561.693554] ? __mod_lruvec_state+0x40/0xe0 [ 561.697726] ? alloc_set_pte+0x4e6/0x5b0 [ 561.701669] __xfs_filemap_fault+0x61/0x190 [xfs] [ 561.706361] __do_fault+0x38/0xb0 [ 561.709666] __handle_mm_fault+0xbee/0xe90 [ 561.713750] handle_mm_fault+0xe2/0x200 [ 561.717574] __do_page_fault+0x224/0x490 [ 561.721485] do_page_fault+0x31/0x120 [ 561.725137] page_fault+0x3e/0x50 [ 561.728439] RIP: 0033:0x400c5a [ 561.731483] Code: 45 c0 48 89 c6 bf 77 0e 40 00 b8 00 00 00 00 e8 3c fb ff ff c7 45 dc 00 00 00 00 eb 36 8b 45 dc 48 63 d0 48 8b 45 c0 48 01 d0 <0f> b6 00 0f be c0 01 45 e8 8b 45 dc 25 ff 0f 00 00 85 c0 75 10 8b [ 561.750214] RSP: 002b:00007fffba1d9450 EFLAGS: 00010206 [ 561.755426] RAX: 00007f550346b000 RBX: 0000000000000000 RCX: 000000000000001a [ 561.762542] RDX: 0000000001c4c000 RSI: 000000007fffffe5 RDI: 0000000000000000 [ 561.769659] RBP: 00007fffba1da4a0 R08: 0000000000000000 R09: 00007f552206c20d [ 561.776775] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000400850 [ 561.783892] R13: 00007fffba1da580 R14: 0000000000000000 R15: 0000000000000000 If I switch the backing file to a ext4 filesystem (separate hard drive), it OOMs. If I switch the file used to /dev/zero, it OOMs: … Todal sum was 0. Loop count is 11 Buffer is @ 0x7f2b66c00000 ./test-script-devzero.sh: line 16: 3561 Killed ./leaker -p 10240 -c 100000 > Or could you try to > simplify your test even further? E.g. does everything work as expected > when doing anonymous mmap rather than file backed one? It also OOMs with MAP_ANON. Hope that helps. Masoud > -- > Michal Hocko > SUSE Labs