Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.

From: Masoud Sharbiani <msharbiani@apple.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>,
	hannes@cmpxchg.org, vdavydov.dev@gmail.com, linux-mm@kvack.org,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: Possible mem cgroup bug in kernels between 4.18.0 and 5.3-rc1.
Date: Fri, 02 Aug 2019 16:28:25 -0700	[thread overview]
Message-ID: <A06C5313-B021-4ADA-9897-CE260A9011CC@apple.com> (raw)
In-Reply-To: <20190802191430.GO6461@dhcp22.suse.cz>

[-- Attachment #1.1: Type: text/plain, Size: 2981 bytes --]

> On Aug 2, 2019, at 12:14 PM, Michal Hocko <mhocko@kernel.org> wrote:
> 
> On Fri 02-08-19 11:00:55, Masoud Sharbiani wrote:
>> 
>> 
>>> On Aug 2, 2019, at 7:41 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>> 
>>> On Fri 02-08-19 07:18:17, Masoud Sharbiani wrote:
>>>> 
>>>> 
>>>>> On Aug 2, 2019, at 12:40 AM, Michal Hocko <mhocko@kernel.org> wrote:
>>>>> 
>>>>> On Thu 01-08-19 11:04:14, Masoud Sharbiani wrote:
>>>>>> Hey folks,
>>>>>> I’ve come across an issue that affects most of 4.19, 4.20 and 5.2 linux-stable kernels that has only been fixed in 5.3-rc1.
>>>>>> It was introduced by
>>>>>> 
>>>>>> 29ef680 memcg, oom: move out_of_memory back to the charge path 
>>>>> 
>>>>> This commit shouldn't really change the OOM behavior for your particular
>>>>> test case. It would have changed MAP_POPULATE behavior but your usage is
>>>>> triggering the standard page fault path. The only difference with
>>>>> 29ef680 is that the OOM killer is invoked during the charge path rather
>>>>> than on the way out of the page fault.
>>>>> 
>>>>> Anyway, I tried to run your test case in a loop and leaker always ends
>>>>> up being killed as expected with 5.2. See the below oom report. There
>>>>> must be something else going on. How much swap do you have on your
>>>>> system?
>>>> 
>>>> I do not have swap defined. 
>>> 
>>> OK, I have retested with swap disabled and again everything seems to be
>>> working as expected. The oom happens earlier because I do not have to
>>> wait for the swap to get full.
>>> 
>> 
>> In my tests (with the script provided), it only loops 11 iterations before hanging, and uttering the soft lockup message.
>> 
>> 
>>> Which fs do you use to write the file that you mmap?
>> 
>> /dev/sda3 on / type xfs (rw,relatime,seclabel,attr2,inode64,logbufs=8,logbsize=32k,noquota)
>> 
>> Part of the soft lockup path actually specifies that it is going through __xfs_filemap_fault():
> 
> Right, I have just missed that.
> 
> [...]
> 
>> If I switch the backing file to a ext4 filesystem (separate hard drive), it OOMs.
>> 
>> 
>> If I switch the file used to /dev/zero, it OOMs: 
>> …
>> Todal sum was 0. Loop count is 11
>> Buffer is @ 0x7f2b66c00000
>> ./test-script-devzero.sh: line 16:  3561 Killed                  ./leaker -p 10240 -c 100000
>> 
>> 
>>> Or could you try to
>>> simplify your test even further? E.g. does everything work as expected
>>> when doing anonymous mmap rather than file backed one?
>> 
>> It also OOMs with MAP_ANON. 
>> 
>> Hope that helps.
> 
> It helps to focus more on the xfs reclaim path. Just to be sure, is
> there any difference if you use cgroup v2? I do not expect to be but
> just to be sure there are no v1 artifacts.

I was unable to use cgroups2. I’ve created the new control group, but the attempt to move a running process into it fails with ‘Device or resource busy’.

Masoud

> -- 
> Michal Hocko
> SUSE Labs

[-- Attachment #1.2: Type: text/html, Size: 12878 bytes --]

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 3437 bytes --]