Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.

From: Masoud Sharbiani <msharbiani@apple.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: gregkh@linuxfoundation.org, hannes@cmpxchg.org,
	vdavydov.dev@gmail.com, linux-mm@kvack.org,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
Date: Fri, 02 Aug 2019 11:00:55 -0700	[thread overview]
Message-ID: <5DE6F4AE-F3F9-4C52-9DFC-E066D9DD5EDC@apple.com> (raw)
In-Reply-To: <20190802144110.GL6461@dhcp22.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 6201 bytes --]

> On Aug 2, 2019, at 7:41 AM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> On Fri 02-08-19 07:18:17, Masoud Sharbiani wrote:
>> 
>> 
>>> On Aug 2, 2019, at 12:40 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>> 
>>> On Thu 01-08-19 11:04:14, Masoud Sharbiani wrote:
>>>> Hey folks,
>>>> I’ve come across an issue that affects most of 4.19, 4.20 and 5.2 linux-stable kernels that has only been fixed in 5.3-rc1.
>>>> It was introduced by
>>>> 
>>>> 29ef680 memcg, oom: move out_of_memory back to the charge path 
>>> 
>>> This commit shouldn't really change the OOM behavior for your particular
>>> test case. It would have changed MAP_POPULATE behavior but your usage is
>>> triggering the standard page fault path. The only difference with
>>> 29ef680 is that the OOM killer is invoked during the charge path rather
>>> than on the way out of the page fault.
>>> 
>>> Anyway, I tried to run your test case in a loop and leaker always ends
>>> up being killed as expected with 5.2. See the below oom report. There
>>> must be something else going on. How much swap do you have on your
>>> system?
>> 
>> I do not have swap defined. 
> 
> OK, I have retested with swap disabled and again everything seems to be
> working as expected. The oom happens earlier because I do not have to
> wait for the swap to get full.
> 

In my tests (with the script provided), it only loops 11 iterations before hanging, and uttering the soft lockup message.

> Which fs do you use to write the file that you mmap?

/dev/sda3 on / type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)

Part of the soft lockup path actually specifies that it is going through __xfs_filemap_fault():

[  561.452933] watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [leaker:3261]
[  561.459904] Modules linked in: dm_mirror dm_region_hash dm_log dm_mod iTCO_wdt gpio_ich iTCO_vendor_support dcdbas ipmi_ssif intel_powerc
lamp coretemp kvm_intel ses ipmi_si kvm enclosure scsi_transport_sas ipmi_devintf irqbypass pcspkr lpc_ich sg joydev ipmi_msghandler wmi acp
i_power_meter acpi_cpufreq xfs libcrc32c ata_generic sd_mod pata_acpi ata_piix libata megaraid_sas crc32c_intel serio_raw bnx2 bonding
[  561.495979] CPU: 4 PID: 3261 Comm: leaker Tainted: G          I  L    5.3.0-rc2+ #10
[  561.503704] Hardware name: Dell Inc. PowerEdge R710/0YDJK3, BIOS 6.4.0 07/23/2013
[  561.511168] RIP: 0010:lruvec_lru_size+0x49/0xf0
[  561.515687] Code: 41 89 ed b8 ff ff ff ff 45 31 f6 49 c1 e5 03 eb 19 48 63 d0 4c 89 e9 48 03 8b 88 00 00 00 48 8b 14 d5 60 a9 92 94 4c 03
 34 11 <48> c7 c6 80 7c bf 94 89 c7 e8 89 d3 59 00 3b 05 27 eb ff 00 72 d1
[  561.534418] RSP: 0018:ffffb5f886a3f640 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  561.541968] RAX: 0000000000000002 RBX: ffff96fca3bba400 RCX: 00003ef5d82059f0
[  561.549085] RDX: ffff9702a7a40000 RSI: 0000000000000010 RDI: ffffffff94bf7c80
[  561.556202] RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffff94ae1c00
[  561.563318] R10: ffff96fcc7802520 R11: 0000000000000000 R12: 0000000000000004
[  561.570435] R13: 0000000000000008 R14: 0000000000000000 R15: 0000000000000000
[  561.577553] FS:  00007f5522602740(0000) GS:ffff9702a7a80000(0000) knlGS:0000000000000000
[  561.585623] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  561.591352] CR2: 00007fba755f95b0 CR3: 0000000c646dc000 CR4: 00000000000006e0
[  561.598468] Call Trace:
[  561.600907]  shrink_node_memcg+0xc8/0x790
[  561.604905]  ? shrink_slab+0x245/0x280
[  561.608644]  ? mem_cgroup_iter+0x10a/0x2c0
[  561.612728]  shrink_node+0xcd/0x490
[  561.616208]  do_try_to_free_pages+0xda/0x3a0
[  561.620466]  ? mem_cgroup_select_victim_node+0x43/0x2f0
[  561.625678]  try_to_free_mem_cgroup_pages+0xe7/0x1c0
[  561.630629]  try_charge+0x246/0x7a0
[  561.634107]  mem_cgroup_try_charge+0x6b/0x1e0
[  561.638453]  ? mem_cgroup_commit_charge+0x5a/0x110
[  561.643231]  __add_to_page_cache_locked+0x195/0x330
[  561.648100]  ? scan_shadow_nodes+0x30/0x30
[  561.652184]  add_to_page_cache_lru+0x39/0xa0
[  561.656442]  iomap_readpages_actor+0xf2/0x230
[  561.660787]  iomap_apply+0xa3/0x130
[  561.664266]  iomap_readpages+0x97/0x180
[  561.668091]  ? iomap_migrate_page+0xe0/0xe0
[  561.672266]  read_pages+0x57/0x180
[  561.675657]  __do_page_cache_readahead+0x1ac/0x1c0
[  561.680436]  ondemand_readahead+0x168/0x2a0
[  561.684606]  filemap_fault+0x30d/0x830
[  561.688343]  ? flush_tlb_func_common.isra.8+0x147/0x230
[  561.693554]  ? __mod_lruvec_state+0x40/0xe0
[  561.697726]  ? alloc_set_pte+0x4e6/0x5b0
[  561.701669]  __xfs_filemap_fault+0x61/0x190 [xfs]
[  561.706361]  __do_fault+0x38/0xb0
[  561.709666]  __handle_mm_fault+0xbee/0xe90
[  561.713750]  handle_mm_fault+0xe2/0x200
[  561.717574]  __do_page_fault+0x224/0x490
[  561.721485]  do_page_fault+0x31/0x120
[  561.725137]  page_fault+0x3e/0x50
[  561.728439] RIP: 0033:0x400c5a
[  561.731483] Code: 45 c0 48 89 c6 bf 77 0e 40 00 b8 00 00 00 00 e8 3c fb ff ff c7 45 dc 00 00 00 00 eb 36 8b 45 dc 48 63 d0 48 8b 45 c0 48
 01 d0 <0f> b6 00 0f be c0 01 45 e8 8b 45 dc 25 ff 0f 00 00 85 c0 75 10 8b
[  561.750214] RSP: 002b:00007fffba1d9450 EFLAGS: 00010206
[  561.755426] RAX: 00007f550346b000 RBX: 0000000000000000 RCX: 000000000000001a
[  561.762542] RDX: 0000000001c4c000 RSI: 000000007fffffe5 RDI: 0000000000000000
[  561.769659] RBP: 00007fffba1da4a0 R08: 0000000000000000 R09: 00007f552206c20d
[  561.776775] R10: 0000000000000002 R11: 0000000000000246 R12: 0000000000400850
[  561.783892] R13: 00007fffba1da580 R14: 0000000000000000 R15: 0000000000000000

If I switch the backing file to a ext4 filesystem (separate hard drive), it OOMs.

If I switch the file used to /dev/zero, it OOMs: 
…
Todal sum was 0. Loop count is 11
Buffer is @ 0x7f2b66c00000
./test-script-devzero.sh: line 16:  3561 Killed                  ./leaker -p 10240 -c 100000

> Or could you try to
> simplify your test even further? E.g. does everything work as expected
> when doing anonymous mmap rather than file backed one?

It also OOMs with MAP_ANON. 

Hope that helps.
Masoud

> -- 
> Michal Hocko
> SUSE Labs

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3437 bytes --]