Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

From: David Hildenbrand <david@redhat.com>
To: Mike Kravetz <mike.kravetz@oracle.com>,
	Mina Almasry <almasrymina@google.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Michal Privoznik <mprivozn@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Michal Hocko <mhocko@kernel.org>,
	Muchun Song <songmuchun@bytedance.com>,
	"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>,
	Tejun Heo <tj@kernel.org>, KVM <kvm@vger.kernel.org>
Subject: Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
Date: Thu, 15 Oct 2020 10:57:25 +0200	[thread overview]
Message-ID: <075968b6-9e2e-b625-8dc1-a7e5ed0bfd71@redhat.com> (raw)
In-Reply-To: <32ea3107-b1bc-f39e-3cf8-f6ef427235ef@redhat.com>

On 15.10.20 09:56, David Hildenbrand wrote:
> On 14.10.20 20:31, Mike Kravetz wrote:
>> On 10/14/20 11:18 AM, David Hildenbrand wrote:
>>> On 14.10.20 19:56, Mina Almasry wrote:
>>>> On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 14.10.20 17:22, David Hildenbrand wrote:
>>>>>> Hi everybody,
>>>>>>
>>>>>> Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon
>>>>>> with hugetlbfs and reported that this results in [1]
>>>>>>
>>>>>> 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
>>>>>>
>>>>>> 2. Any hugetlbfs allocations failing. (I assume because some accounting is wrong)
>>>>>>
>>>>>>
>>>>>> QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE)
>>>>>> to discard pages that are reported as free by a VM. The reporting
>>>>>> granularity is in pageblock granularity. So when the guest reports
>>>>>> 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU.
>>>>>>
>>>>>> I was also able to reproduce (also with virtio-mem, which similarly
>>>>>> uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9
>>>>>> (and on v5.7.X from F32).
>>>>>>
>>>>>> Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting
>>>>>> is broken with cgroups. I did *not* try without cgroups yet.
>>>>>>
>>>>>> Any ideas?
>>>>
>>>> Hi David,
>>>>
>>>> I may be able to dig in and take a look. How do I reproduce this
>>>> though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a
>>>> hugetlb region?
>>>>
>>>
>>> Hi Mina,
>>>
>>> thanks for having a look. I started poking around myself but,
>>> being new to cgroup code, I even failed to understand why that code gets
>>> triggered though the hugetlb controller isn't even enabled.
>>>
>>> I assume you at least have to make sure that there is
>>> a page populated (MMAP_POPULATE, or read/write it). But I am not
>>> sure yet if a single fallocate(FALLOC_FL_PUNCH_HOLE) is
>>> sufficient, or if it will require a sequence of
>>> populate+discard(punch) (or multi-threading).
>>
>> FWIW - I ran libhugetlbfs tests which do a bunch of hole punching
>> with (and without) hugetlb controller enabled and did not see this issue.
>>
>> May need to reproduce via QEMU as below.
> 
> Not sure if relevant, but QEMU should be using
> memfd_create(MFD_HUGETLB|MFD_HUGE_2MB) to obtain a hugetlbfs file.
> 
> Also, QEMU fallocate(FALLOC_FL_PUNCH_HOLE)'s a significant of memory of
> the md (e.g., > 90%).
> 

I just tried to reproduce by doing random accesses + random fallocate(FALLOC_FL_PUNCH_HOLE) within a file - without success.

So could be
1. KVM is involved messing this up
2. Multi-threading is involved

However, I am also able to reproduce with only a single VCPU (there is still the QEMU main thread, but it limits the chance for races).

Even KVM spits fire after a while, which could be a side effect of allocations failing:

error: kvm run failed Bad address
RAX=0000000000000000 RBX=ffff8c12c9c217c0 RCX=ffff8c12fb1b8fc0 RDX=0000000000000007
RSI=ffff8c12c9c217c0 RDI=ffff8c12c9c217c8 RBP=000000000000000d RSP=ffffb3964040fa68
R8 =0000000000000008 R9 =ffff8c12c9c20000 R10=ffff8c12fffd5000 R11=00000000000303c0
R12=ffff8c12c9c217c0 R13=0000000000000008 R14=0000000000000001 R15=fffff31d44270800
RIP=ffffffffaf33ba0f RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 00000000 00000000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 00000000 00000000
FS =0000 00007f8fabc87040 00000000 00000000
GS =0000 ffff8c12fbc00000 00000000 00000000
LDT=0000 fffffe0000000000 00000000 00000000
TR =0040 fffffe0000003000 00004087 00008b00 DPL=0 TSS64-busy
GDT=     fffffe0000001000 0000007f
IDT=     fffffe0000000000 00000fff
CR0=80050033 CR2=0000560e10895398 CR3=00000001073b2000 CR4=00350ef0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=0f 0b eb e2 90 0f 1f 44 00 00 53 48 89 fb 31 c0 48 8d 7f 08 <48> c7 47 f8 00 00 00 00 48 89 d9 48 c7 c2 44 d3 52

-- 
Thanks,

David / dhildenb