linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
@ 2020-10-14 15:22 David Hildenbrand
  2020-10-14 16:15 ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2020-10-14 15:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Michal Privoznik, Michael S. Tsirkin, Michal Hocko, Muchun Song,
	Aneesh Kumar K.V, Tejun Heo, Mina Almasry

Hi everybody,

Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon
with hugetlbfs and reported that this results in [1]

1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5

2. Any hugetlbfs allocations failing. (I assume because some accounting is wrong)


QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE)
to discard pages that are reported as free by a VM. The reporting
granularity is in pageblock granularity. So when the guest reports
2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU.

I was also able to reproduce (also with virtio-mem, which similarly
uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9
(and on v5.7.X from F32).

Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting
is broken with cgroups. I did *not* try without cgroups yet.

Any ideas?


Here is report #1:

[  315.251417] ------------[ cut here ]------------
[  315.251424] WARNING: CPU: 7 PID: 6636 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x50
[  315.251425] Modules linked in: xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp rfcomm tun bridge stp llc nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter cmac bnep hwmon_vid sunrpc squashfs vfat fat loop snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec edac_mce_amd snd_hda_core btusb btrtl btbcm snd_hwdep snd_seq btintel kvm_amd snd_seq_device bluetooth kvm snd_pcm ecdh_generic sp5100_tco irqbypass rfkill snd_timer rapl ecc pcspkr wmi_bmof joydev i2c_piix4 k10temp snd
[  315.251454]  soundcore acpi_cpufreq ip_tables xfs libcrc32c dm_crypt igb hid_logitech_hidpp video dca amdgpu iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel mxm_wmi drm ghash_clmulni_intel ccp nvme nvme_core wmi pinctrl_amd hid_logitech_dj fuse
[  315.251466] CPU: 7 PID: 6636 Comm: qemu-system-x86 Not tainted 5.9.0 #137
[  315.251467] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F21 07/31/2020
[  315.251469] RIP: 0010:page_counter_uncharge+0x4b/0x50
[  315.251471] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 0f 1f 44 00 00 48 8b 17 48 39 d6 72 41 41 54 49 89
[  315.251472] RSP: 0018:ffffb60f01ed3b20 EFLAGS: 00010286
[  315.251473] RAX: fffffffffffd0600 RBX: fffffffffffd0600 RCX: ffff8de8272e3200
[  315.251473] RDX: 000000000000028e RSI: fffffffffffd0600 RDI: ffff8de838452e40
[  315.251474] RBP: ffff8de838452e40 R08: ffff8de838452e40 R09: ffff8de837f86c80
[  315.251475] R10: ffffb60f01ed3b58 R11: 0000000000000001 R12: 0000000000051c00
[  315.251475] R13: fffffffffffae400 R14: ffff8de8272e3240 R15: 0000000000000571
[  315.251476] FS:  00007f9c2edfd700(0000) GS:ffff8de83ebc0000(0000) knlGS:0000000000000000
[  315.251477] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  315.251478] CR2: 00007f2a76787e78 CR3: 0000000fcbb1c000 CR4: 0000000000350ee0
[  315.251479] Call Trace:
[  315.251485]  hugetlb_cgroup_uncharge_file_region+0x4b/0x80
[  315.251487]  region_del+0x1d3/0x300
[  315.251489]  hugetlb_unreserve_pages+0x39/0xb0
[  315.251492]  remove_inode_hugepages+0x1a8/0x3d0
[  315.251495]  ? tlb_finish_mmu+0x7a/0x1d0
[  315.251497]  hugetlbfs_fallocate+0x3c4/0x5c0
[  315.251519]  ? kvm_arch_vcpu_ioctl_run+0x614/0x1700 [kvm]
[  315.251522]  ? file_has_perm+0xa2/0xb0
[  315.251524]  ? inode_security+0xc/0x60
[  315.251525]  ? selinux_file_permission+0x4e/0x120
[  315.251527]  vfs_fallocate+0x146/0x290
[  315.251529]  __x64_sys_fallocate+0x3e/0x70
[  315.251531]  do_syscall_64+0x33/0x40
[  315.251533]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  315.251535] RIP: 0033:0x7f9d3fb5641f
[  315.251536] Code: 89 7c 24 08 48 89 4c 24 18 e8 5d fc f8 ff 4c 8b 54 24 18 48 8b 54 24 10 41 89 c0 8b 74 24 0c 8b 7c 24 08 b8 1d 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 89 44 24 08 e8 8d fc f8 ff 8b 44
[  315.251537] RSP: 002b:00007f9c2edfc470 EFLAGS: 00000293 ORIG_RAX: 000000000000011d
[  315.251538] RAX: ffffffffffffffda RBX: 0000000000001000 RCX: 00007f9d3fb5641f
[  315.251539] RDX: 00000000ae200000 RSI: 0000000000000003 RDI: 000000000000000c
[  315.251539] RBP: 0000557389d6736c R08: 0000000000000000 R09: 000000000000000c
[  315.251540] R10: 0000000000200000 R11: 0000000000000293 R12: 0000000000200000
[  315.251540] R13: 00000000ffffffff R14: 00000000ae200000 R15: 00007f9cde000000
[  315.251542] ---[ end trace 4c88c62ccb1349c9 ]---



Here is report #2:

[  400.920702] ------------[ cut here ]------------
[  400.920711] WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x50
[  400.920712] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_nat_tftp nf_conntrack_tftp rfcomm tun bridge stp llc nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter cmac bnep hwmon_vid sunrpc squashfs vfat fat loop btusb btrtl btbcm btintel edac_mce_amd bluetooth snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_hda_codec kvm_amd snd_hda_core snd_hwdep kvm snd_seq ecdh_generic snd_seq_device rfkill irqbypass snd_pcm ecc joydev sp5100_tco rapl pcspkr
[  400.920743]  wmi_bmof i2c_piix4 k10temp snd_timer snd soundcore acpi_cpufreq ip_tables xfs libcrc32c dm_crypt igb hid_logitech_hidpp video dca amdgpu iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel drm mxm_wmi ghash_clmulni_intel ccp nvme nvme_core wmi pinctrl_amd hid_logitech_dj fuse
[  400.920759] CPU: 13 PID: 2438 Comm: qemu-system-x86 Not tainted 5.9.0 #137
[  400.920760] Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F21 07/31/2020
[  400.920763] RIP: 0010:page_counter_uncharge+0x4b/0x50
[  400.920765] Code: 0f c1 45 00 4c 29 e0 48 89 ef 48 89 c3 48 89 c6 e8 2a fe ff ff 48 85 db 78 10 48 8b 6d 28 48 85 ed 75 d8 5b 5d 41 5c 41 5d c3 <0f> 0b eb ec 90 0f 1f 44 00 00 48 8b 17 48 39 d6 72 41 41 54 49 89
[  400.920766] RSP: 0018:ffffb89e01f5fa20 EFLAGS: 00010286
[  400.920767] RAX: fffffffffff01200 RBX: fffffffffff01200 RCX: 0000000080400000
[  400.920768] RDX: 0000000000000800 RSI: fffffffffff01200 RDI: ffff910b78452e40
[  400.920769] RBP: ffff910b78452e40 R08: ffff910b78452e40 R09: ffff910b70b2a700
[  400.920769] R10: 0000000000000001 R11: ffff910b5e079300 R12: 0000000000100000
[  400.920770] R13: fffffffffff00000 R14: ffff910b76185908 R15: 0000000000000000
[  400.920771] FS:  0000000000000000(0000) GS:ffff910b7ed40000(0000) knlGS:0000000000000000
[  400.920772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  400.920773] CR2: 00007f90b2d898bc CR3: 000000056ca0e000 CR4: 0000000000350ee0
[  400.920774] Call Trace:
[  400.920780]  hugetlb_cgroup_uncharge_file_region+0x4b/0x80
[  400.920783]  region_del+0x11b/0x300
[  400.920786]  hugetlb_unreserve_pages+0x39/0xb0
[  400.920788]  remove_inode_hugepages+0x3c2/0x3d0
[  400.920792]  hugetlbfs_evict_inode+0x1a/0x40
[  400.920795]  evict+0xd1/0x1a0
[  400.920797]  __dentry_kill+0xe4/0x180
[  400.920799]  __fput+0xec/0x240
[  400.920802]  task_work_run+0x65/0xa0
[  400.920804]  do_exit+0x34c/0xad0
[  400.920806]  do_group_exit+0x33/0xa0
[  400.920808]  get_signal+0x179/0x8d0
[  400.920811]  arch_do_signal+0x30/0x700
[  400.920832]  ? kvm_vcpu_ioctl+0x29f/0x590 [kvm]
[  400.920835]  exit_to_user_mode_prepare+0xf7/0x160
[  400.920838]  syscall_exit_to_user_mode+0x31/0x1b0
[  400.920841]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  400.920843] RIP: 0033:0x7f90b3008e92
[  400.920843] Code: Bad RIP value.
[  400.920844] RSP: 002b:00007f8d45ffa770 EFLAGS: 00000282 ORIG_RAX: 00000000000000ca
[  400.920845] RAX: fffffffffffffe00 RBX: 0000000000000014 RCX: 00007f90b3008e92
[  400.920846] RDX: 0000000000000000 RSI: 0000000000000080 RDI: 000055efd2951db8
[  400.920846] RBP: 000055efd2951d90 R08: 0000000000000000 R09: 000055efd18f29a0
[  400.920847] R10: 0000000000000000 R11: 0000000000000282 R12: 0000000000000000
[  400.920848] R13: 000055efd190ff60 R14: 000055efd2951db8 R15: 00007f8d45ffa7a0
[  400.920850] ---[ end trace bd4d1b0930afe999 ]---



[1] https://www.redhat.com/archives/libvir-list/2020-October/msg00872.html

-- 
Thanks,

David / dhildenb

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
  2020-10-14 15:22 cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5 David Hildenbrand
@ 2020-10-14 16:15 ` David Hildenbrand
  2020-10-14 17:56   ` Mina Almasry
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2020-10-14 16:15 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: Michal Privoznik, Michael S. Tsirkin, Michal Hocko, Muchun Song,
	Aneesh Kumar K.V, Tejun Heo, Mina Almasry

On 14.10.20 17:22, David Hildenbrand wrote:
> Hi everybody,
> 
> Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon
> with hugetlbfs and reported that this results in [1]
> 
> 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
> 
> 2. Any hugetlbfs allocations failing. (I assume because some accounting is wrong)
> 
> 
> QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE)
> to discard pages that are reported as free by a VM. The reporting
> granularity is in pageblock granularity. So when the guest reports
> 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU.
> 
> I was also able to reproduce (also with virtio-mem, which similarly
> uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9
> (and on v5.7.X from F32).
> 
> Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting
> is broken with cgroups. I did *not* try without cgroups yet.
> 
> Any ideas?

Just tried without the hugetlb controller, seems to work just fine.

I'd like to note that
- The controller was not activated
- I had to compile the hugetlb controller out to make it work.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
  2020-10-14 16:15 ` David Hildenbrand
@ 2020-10-14 17:56   ` Mina Almasry
  2020-10-14 18:18     ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Mina Almasry @ 2020-10-14 17:56 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, Michal Privoznik, Michael S. Tsirkin,
	Michal Hocko, Muchun Song, Aneesh Kumar K.V, Tejun Heo

On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 14.10.20 17:22, David Hildenbrand wrote:
> > Hi everybody,
> >
> > Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon
> > with hugetlbfs and reported that this results in [1]
> >
> > 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
> >
> > 2. Any hugetlbfs allocations failing. (I assume because some accounting is wrong)
> >
> >
> > QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE)
> > to discard pages that are reported as free by a VM. The reporting
> > granularity is in pageblock granularity. So when the guest reports
> > 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU.
> >
> > I was also able to reproduce (also with virtio-mem, which similarly
> > uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9
> > (and on v5.7.X from F32).
> >
> > Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting
> > is broken with cgroups. I did *not* try without cgroups yet.
> >
> > Any ideas?

Hi David,

I may be able to dig in and take a look. How do I reproduce this
though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a
hugetlb region?

>
> Just tried without the hugetlb controller, seems to work just fine.
>
> I'd like to note that
> - The controller was not activated
> - I had to compile the hugetlb controller out to make it work.
>
> --
> Thanks,
>
> David / dhildenb
>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
  2020-10-14 17:56   ` Mina Almasry
@ 2020-10-14 18:18     ` David Hildenbrand
  2020-10-14 18:31       ` Mike Kravetz
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2020-10-14 18:18 UTC (permalink / raw)
  To: Mina Almasry
  Cc: linux-kernel, linux-mm, Michal Privoznik, Michael S. Tsirkin,
	Michal Hocko, Muchun Song, Aneesh Kumar K.V, Tejun Heo

On 14.10.20 19:56, Mina Almasry wrote:
> On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 14.10.20 17:22, David Hildenbrand wrote:
>>> Hi everybody,
>>>
>>> Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon
>>> with hugetlbfs and reported that this results in [1]
>>>
>>> 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
>>>
>>> 2. Any hugetlbfs allocations failing. (I assume because some accounting is wrong)
>>>
>>>
>>> QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE)
>>> to discard pages that are reported as free by a VM. The reporting
>>> granularity is in pageblock granularity. So when the guest reports
>>> 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU.
>>>
>>> I was also able to reproduce (also with virtio-mem, which similarly
>>> uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9
>>> (and on v5.7.X from F32).
>>>
>>> Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting
>>> is broken with cgroups. I did *not* try without cgroups yet.
>>>
>>> Any ideas?
> 
> Hi David,
> 
> I may be able to dig in and take a look. How do I reproduce this
> though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a
> hugetlb region?
> 

Hi Mina,

thanks for having a look. I started poking around myself but,
being new to cgroup code, I even failed to understand why that code gets
triggered though the hugetlb controller isn't even enabled.

I assume you at least have to make sure that there is
a page populated (MMAP_POPULATE, or read/write it). But I am not
sure yet if a single fallocate(FALLOC_FL_PUNCH_HOLE) is
sufficient, or if it will require a sequence of
populate+discard(punch) (or multi-threading).

What definitely makes it trigger is via QEMU

qemu-system-x86_64 \
-machine pc-i440fx-4.0,accel=kvm,usb=off,dump-guest-core=off,memory-backend=pc.ram \
-cpu host,migratable=on \
-m 4096 \
-object memory-backend-memfd,id=pc.ram,hugetlb=yes,hugetlbsize=2097152,size=4294967296 \
-overcommit mem-lock=off \
-smp 4,sockets=1,dies=1,cores=2,threads=2 \
-nodefaults \
-nographic \
-device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 \
-device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 \
-blockdev '{"driver":"file","filename":"../Fedora-Cloud-Base-32-1.6.x86_64.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":false,"discard":"unmap","driver":"qcow2","file":"libvirt-1-storage","backing":null}' \
-device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,device_id=drive-scsi0-0-0-0,drive=libvirt-1-format,id=scsi0-0-0-0,bootindex=1 \
-chardev stdio,nosignal,id=serial \
-device isa-serial,chardev=serial \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7,free-page-reporting=on


However, you need a recent QEMU (>= v5.1 IIRC) and a recent kernel
(>= v5.7) inside your guest image.

Fedora rawhide qcow2 should do: https://dl.fedoraproject.org/pub/fedora/linux/development/rawhide/Cloud/x86_64/images/Fedora-Cloud-Base-Rawhide-20201004.n.1.x86_64.qcow2


-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
  2020-10-14 18:18     ` David Hildenbrand
@ 2020-10-14 18:31       ` Mike Kravetz
  2020-10-15  7:56         ` David Hildenbrand
  2020-10-15 23:14         ` Mike Kravetz
  0 siblings, 2 replies; 18+ messages in thread
From: Mike Kravetz @ 2020-10-14 18:31 UTC (permalink / raw)
  To: David Hildenbrand, Mina Almasry
  Cc: linux-kernel, linux-mm, Michal Privoznik, Michael S. Tsirkin,
	Michal Hocko, Muchun Song, Aneesh Kumar K.V, Tejun Heo

On 10/14/20 11:18 AM, David Hildenbrand wrote:
> On 14.10.20 19:56, Mina Almasry wrote:
>> On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 14.10.20 17:22, David Hildenbrand wrote:
>>>> Hi everybody,
>>>>
>>>> Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon
>>>> with hugetlbfs and reported that this results in [1]
>>>>
>>>> 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
>>>>
>>>> 2. Any hugetlbfs allocations failing. (I assume because some accounting is wrong)
>>>>
>>>>
>>>> QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE)
>>>> to discard pages that are reported as free by a VM. The reporting
>>>> granularity is in pageblock granularity. So when the guest reports
>>>> 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU.
>>>>
>>>> I was also able to reproduce (also with virtio-mem, which similarly
>>>> uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9
>>>> (and on v5.7.X from F32).
>>>>
>>>> Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting
>>>> is broken with cgroups. I did *not* try without cgroups yet.
>>>>
>>>> Any ideas?
>>
>> Hi David,
>>
>> I may be able to dig in and take a look. How do I reproduce this
>> though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a
>> hugetlb region?
>>
> 
> Hi Mina,
> 
> thanks for having a look. I started poking around myself but,
> being new to cgroup code, I even failed to understand why that code gets
> triggered though the hugetlb controller isn't even enabled.
> 
> I assume you at least have to make sure that there is
> a page populated (MMAP_POPULATE, or read/write it). But I am not
> sure yet if a single fallocate(FALLOC_FL_PUNCH_HOLE) is
> sufficient, or if it will require a sequence of
> populate+discard(punch) (or multi-threading).

FWIW - I ran libhugetlbfs tests which do a bunch of hole punching
with (and without) hugetlb controller enabled and did not see this issue.

May need to reproduce via QEMU as below.
-- 
Mike Kravetz

> What definitely makes it trigger is via QEMU
> 
> qemu-system-x86_64 \
> -machine pc-i440fx-4.0,accel=kvm,usb=off,dump-guest-core=off,memory-backend=pc.ram \
> -cpu host,migratable=on \
> -m 4096 \
> -object memory-backend-memfd,id=pc.ram,hugetlb=yes,hugetlbsize=2097152,size=4294967296 \
> -overcommit mem-lock=off \
> -smp 4,sockets=1,dies=1,cores=2,threads=2 \
> -nodefaults \
> -nographic \
> -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 \
> -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 \
> -blockdev '{"driver":"file","filename":"../Fedora-Cloud-Base-32-1.6.x86_64.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' \
> -blockdev '{"node-name":"libvirt-1-format","read-only":false,"discard":"unmap","driver":"qcow2","file":"libvirt-1-storage","backing":null}' \
> -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,device_id=drive-scsi0-0-0-0,drive=libvirt-1-format,id=scsi0-0-0-0,bootindex=1 \
> -chardev stdio,nosignal,id=serial \
> -device isa-serial,chardev=serial \
> -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7,free-page-reporting=on
> 
> 
> However, you need a recent QEMU (>= v5.1 IIRC) and a recent kernel
> (>= v5.7) inside your guest image.
> 
> Fedora rawhide qcow2 should do: https://dl.fedoraproject.org/pub/fedora/linux/development/rawhide/Cloud/x86_64/images/Fedora-Cloud-Base-Rawhide-20201004.n.1.x86_64.qcow2
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
  2020-10-14 18:31       ` Mike Kravetz
@ 2020-10-15  7:56         ` David Hildenbrand
  2020-10-15  8:57           ` David Hildenbrand
  2020-10-15 23:14         ` Mike Kravetz
  1 sibling, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2020-10-15  7:56 UTC (permalink / raw)
  To: Mike Kravetz, Mina Almasry
  Cc: linux-kernel, linux-mm, Michal Privoznik, Michael S. Tsirkin,
	Michal Hocko, Muchun Song, Aneesh Kumar K.V, Tejun Heo

On 14.10.20 20:31, Mike Kravetz wrote:
> On 10/14/20 11:18 AM, David Hildenbrand wrote:
>> On 14.10.20 19:56, Mina Almasry wrote:
>>> On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 14.10.20 17:22, David Hildenbrand wrote:
>>>>> Hi everybody,
>>>>>
>>>>> Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon
>>>>> with hugetlbfs and reported that this results in [1]
>>>>>
>>>>> 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
>>>>>
>>>>> 2. Any hugetlbfs allocations failing. (I assume because some accounting is wrong)
>>>>>
>>>>>
>>>>> QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE)
>>>>> to discard pages that are reported as free by a VM. The reporting
>>>>> granularity is in pageblock granularity. So when the guest reports
>>>>> 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU.
>>>>>
>>>>> I was also able to reproduce (also with virtio-mem, which similarly
>>>>> uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9
>>>>> (and on v5.7.X from F32).
>>>>>
>>>>> Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting
>>>>> is broken with cgroups. I did *not* try without cgroups yet.
>>>>>
>>>>> Any ideas?
>>>
>>> Hi David,
>>>
>>> I may be able to dig in and take a look. How do I reproduce this
>>> though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a
>>> hugetlb region?
>>>
>>
>> Hi Mina,
>>
>> thanks for having a look. I started poking around myself but,
>> being new to cgroup code, I even failed to understand why that code gets
>> triggered though the hugetlb controller isn't even enabled.
>>
>> I assume you at least have to make sure that there is
>> a page populated (MMAP_POPULATE, or read/write it). But I am not
>> sure yet if a single fallocate(FALLOC_FL_PUNCH_HOLE) is
>> sufficient, or if it will require a sequence of
>> populate+discard(punch) (or multi-threading).
> 
> FWIW - I ran libhugetlbfs tests which do a bunch of hole punching
> with (and without) hugetlb controller enabled and did not see this issue.
> 
> May need to reproduce via QEMU as below.

Not sure if relevant, but QEMU should be using
memfd_create(MFD_HUGETLB|MFD_HUGE_2MB) to obtain a hugetlbfs file.

Also, QEMU fallocate(FALLOC_FL_PUNCH_HOLE)'s a significant of memory of
the md (e.g., > 90%).

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
  2020-10-15  7:56         ` David Hildenbrand
@ 2020-10-15  8:57           ` David Hildenbrand
  2020-10-15  9:01             ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2020-10-15  8:57 UTC (permalink / raw)
  To: Mike Kravetz, Mina Almasry
  Cc: linux-kernel, linux-mm, Michal Privoznik, Michael S. Tsirkin,
	Michal Hocko, Muchun Song, Aneesh Kumar K.V, Tejun Heo, KVM

On 15.10.20 09:56, David Hildenbrand wrote:
> On 14.10.20 20:31, Mike Kravetz wrote:
>> On 10/14/20 11:18 AM, David Hildenbrand wrote:
>>> On 14.10.20 19:56, Mina Almasry wrote:
>>>> On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>
>>>>> On 14.10.20 17:22, David Hildenbrand wrote:
>>>>>> Hi everybody,
>>>>>>
>>>>>> Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon
>>>>>> with hugetlbfs and reported that this results in [1]
>>>>>>
>>>>>> 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
>>>>>>
>>>>>> 2. Any hugetlbfs allocations failing. (I assume because some accounting is wrong)
>>>>>>
>>>>>>
>>>>>> QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE)
>>>>>> to discard pages that are reported as free by a VM. The reporting
>>>>>> granularity is in pageblock granularity. So when the guest reports
>>>>>> 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU.
>>>>>>
>>>>>> I was also able to reproduce (also with virtio-mem, which similarly
>>>>>> uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9
>>>>>> (and on v5.7.X from F32).
>>>>>>
>>>>>> Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting
>>>>>> is broken with cgroups. I did *not* try without cgroups yet.
>>>>>>
>>>>>> Any ideas?
>>>>
>>>> Hi David,
>>>>
>>>> I may be able to dig in and take a look. How do I reproduce this
>>>> though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a
>>>> hugetlb region?
>>>>
>>>
>>> Hi Mina,
>>>
>>> thanks for having a look. I started poking around myself but,
>>> being new to cgroup code, I even failed to understand why that code gets
>>> triggered though the hugetlb controller isn't even enabled.
>>>
>>> I assume you at least have to make sure that there is
>>> a page populated (MMAP_POPULATE, or read/write it). But I am not
>>> sure yet if a single fallocate(FALLOC_FL_PUNCH_HOLE) is
>>> sufficient, or if it will require a sequence of
>>> populate+discard(punch) (or multi-threading).
>>
>> FWIW - I ran libhugetlbfs tests which do a bunch of hole punching
>> with (and without) hugetlb controller enabled and did not see this issue.
>>
>> May need to reproduce via QEMU as below.
> 
> Not sure if relevant, but QEMU should be using
> memfd_create(MFD_HUGETLB|MFD_HUGE_2MB) to obtain a hugetlbfs file.
> 
> Also, QEMU fallocate(FALLOC_FL_PUNCH_HOLE)'s a significant of memory of
> the md (e.g., > 90%).
> 

I just tried to reproduce by doing random accesses + random fallocate(FALLOC_FL_PUNCH_HOLE) within a file - without success.

So could be
1. KVM is involved messing this up
2. Multi-threading is involved

However, I am also able to reproduce with only a single VCPU (there is still the QEMU main thread, but it limits the chance for races).

Even KVM spits fire after a while, which could be a side effect of allocations failing:

error: kvm run failed Bad address
RAX=0000000000000000 RBX=ffff8c12c9c217c0 RCX=ffff8c12fb1b8fc0 RDX=0000000000000007
RSI=ffff8c12c9c217c0 RDI=ffff8c12c9c217c8 RBP=000000000000000d RSP=ffffb3964040fa68
R8 =0000000000000008 R9 =ffff8c12c9c20000 R10=ffff8c12fffd5000 R11=00000000000303c0
R12=ffff8c12c9c217c0 R13=0000000000000008 R14=0000000000000001 R15=fffff31d44270800
RIP=ffffffffaf33ba0f RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 00000000 00000000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0018 0000000000000000 ffffffff 00c09300 DPL=0 DS   [-WA]
DS =0000 0000000000000000 00000000 00000000
FS =0000 00007f8fabc87040 00000000 00000000
GS =0000 ffff8c12fbc00000 00000000 00000000
LDT=0000 fffffe0000000000 00000000 00000000
TR =0040 fffffe0000003000 00004087 00008b00 DPL=0 TSS64-busy
GDT=     fffffe0000001000 0000007f
IDT=     fffffe0000000000 00000fff
CR0=80050033 CR2=0000560e10895398 CR3=00000001073b2000 CR4=00350ef0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000 
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000d01
Code=0f 0b eb e2 90 0f 1f 44 00 00 53 48 89 fb 31 c0 48 8d 7f 08 <48> c7 47 f8 00 00 00 00 48 89 d9 48 c7 c2 44 d3 52

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
  2020-10-15  8:57           ` David Hildenbrand
@ 2020-10-15  9:01             ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2020-10-15  9:01 UTC (permalink / raw)
  To: Mike Kravetz, Mina Almasry
  Cc: linux-kernel, linux-mm, Michal Privoznik, Michael S. Tsirkin,
	Michal Hocko, Muchun Song, Aneesh Kumar K.V, Tejun Heo, KVM

On 15.10.20 10:57, David Hildenbrand wrote:
> On 15.10.20 09:56, David Hildenbrand wrote:
>> On 14.10.20 20:31, Mike Kravetz wrote:
>>> On 10/14/20 11:18 AM, David Hildenbrand wrote:
>>>> On 14.10.20 19:56, Mina Almasry wrote:
>>>>> On Wed, Oct 14, 2020 at 9:15 AM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>> On 14.10.20 17:22, David Hildenbrand wrote:
>>>>>>> Hi everybody,
>>>>>>>
>>>>>>> Michal Privoznik played with "free page reporting" in QEMU/virtio-balloon
>>>>>>> with hugetlbfs and reported that this results in [1]
>>>>>>>
>>>>>>> 1. WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
>>>>>>>
>>>>>>> 2. Any hugetlbfs allocations failing. (I assume because some accounting is wrong)
>>>>>>>
>>>>>>>
>>>>>>> QEMU with free page hinting uses fallocate(FALLOC_FL_PUNCH_HOLE)
>>>>>>> to discard pages that are reported as free by a VM. The reporting
>>>>>>> granularity is in pageblock granularity. So when the guest reports
>>>>>>> 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE) one huge page in QEMU.
>>>>>>>
>>>>>>> I was also able to reproduce (also with virtio-mem, which similarly
>>>>>>> uses fallocate(FALLOC_FL_PUNCH_HOLE)) on latest v5.9
>>>>>>> (and on v5.7.X from F32).
>>>>>>>
>>>>>>> Looks like something with fallocate(FALLOC_FL_PUNCH_HOLE) accounting
>>>>>>> is broken with cgroups. I did *not* try without cgroups yet.
>>>>>>>
>>>>>>> Any ideas?
>>>>>
>>>>> Hi David,
>>>>>
>>>>> I may be able to dig in and take a look. How do I reproduce this
>>>>> though? I just fallocate(FALLOC_FL_PUNCH_HOLE) one 2MB page in a
>>>>> hugetlb region?
>>>>>
>>>>
>>>> Hi Mina,
>>>>
>>>> thanks for having a look. I started poking around myself but,
>>>> being new to cgroup code, I even failed to understand why that code gets
>>>> triggered though the hugetlb controller isn't even enabled.
>>>>
>>>> I assume you at least have to make sure that there is
>>>> a page populated (MMAP_POPULATE, or read/write it). But I am not
>>>> sure yet if a single fallocate(FALLOC_FL_PUNCH_HOLE) is
>>>> sufficient, or if it will require a sequence of
>>>> populate+discard(punch) (or multi-threading).
>>>
>>> FWIW - I ran libhugetlbfs tests which do a bunch of hole punching
>>> with (and without) hugetlb controller enabled and did not see this issue.
>>>
>>> May need to reproduce via QEMU as below.
>>
>> Not sure if relevant, but QEMU should be using
>> memfd_create(MFD_HUGETLB|MFD_HUGE_2MB) to obtain a hugetlbfs file.
>>
>> Also, QEMU fallocate(FALLOC_FL_PUNCH_HOLE)'s a significant of memory of
>> the md (e.g., > 90%).
>>
> 
> I just tried to reproduce by doing random accesses + random fallocate(FALLOC_FL_PUNCH_HOLE) within a file - without success.
> 
> So could be
> 1. KVM is involved messing this up
> 2. Multi-threading is involved
> 

Able to reproduce with TCG under QEMU, so not a KVM issue.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
  2020-10-14 18:31       ` Mike Kravetz
  2020-10-15  7:56         ` David Hildenbrand
@ 2020-10-15 23:14         ` Mike Kravetz
  2020-10-20 13:38           ` David Hildenbrand
  1 sibling, 1 reply; 18+ messages in thread
From: Mike Kravetz @ 2020-10-15 23:14 UTC (permalink / raw)
  To: David Hildenbrand, Mina Almasry
  Cc: linux-kernel, linux-mm, Michal Privoznik, Michael S. Tsirkin,
	Michal Hocko, Muchun Song, Aneesh Kumar K.V, Tejun Heo

On 10/14/20 11:31 AM, Mike Kravetz wrote:
> On 10/14/20 11:18 AM, David Hildenbrand wrote:
> 
> FWIW - I ran libhugetlbfs tests which do a bunch of hole punching
> with (and without) hugetlb controller enabled and did not see this issue.
> 

I took a closer look after running just the fallocate_stress test
in libhugetlbfs.  Here are the cgroup counter values:

hugetlb.2MB.failcnt 0
hugetlb.2MB.limit_in_bytes 9223372036854771712
hugetlb.2MB.max_usage_in_bytes 209715200
hugetlb.2MB.rsvd.failcnt 0
hugetlb.2MB.rsvd.limit_in_bytes 9223372036854771712
hugetlb.2MB.rsvd.max_usage_in_bytes 601882624
hugetlb.2MB.rsvd.usage_in_bytes 392167424
hugetlb.2MB.usage_in_bytes 0

We did not hit the WARN_ON_ONCE(), but the 'rsvd.usage_in_bytes' value
is not correct in that it should be zero.   No huge page reservations
remain after the test.

HugePages_Total:    1024
HugePages_Free:     1024
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:         2097152 kB

To try and better understand the reservation cgroup controller, I addded
a few printks to the code.  While running fallocate_stress with the
printks, I can consistently hit the WARN_ON_ONCE() due to the counter
going negative.  Here are the cgroup counter values after such a run:

hugetlb.2MB.failcnt 0
hugetlb.2MB.limit_in_bytes 9223372036854771712
hugetlb.2MB.max_usage_in_bytes 209715200
hugetlb.2MB.rsvd.failcnt 3
hugetlb.2MB.rsvd.limit_in_bytes 9223372036854771712
hugetlb.2MB.rsvd.max_usage_in_bytes 251658240
hugetlb.2MB.rsvd.usage_in_bytes 18446744073487253504
hugetlb.2MB.usage_in_bytes 0

Again, no reserved pages after the test.

HugePages_Total:    1024
HugePages_Free:     1024
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:         2097152 kB

I have some basic hugetlb hole punch functionality tests.  Running
these on the kernel with added printk's does not cause any issues.
In order to reproduce, I need to run fallocate_stress test which
will cause hole punch to race with page fault.  Best guess at this
time is that some of the error/race detection reservation back out
code is not properly dealing with cgroup accounting.

I'll take a look at this as well.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
  2020-10-15 23:14         ` Mike Kravetz
@ 2020-10-20 13:38           ` David Hildenbrand
  2020-10-21  3:35             ` Mike Kravetz
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2020-10-20 13:38 UTC (permalink / raw)
  To: Mike Kravetz, Mina Almasry
  Cc: linux-kernel, linux-mm, Michal Privoznik, Michael S. Tsirkin,
	Michal Hocko, Muchun Song, Aneesh Kumar K.V, Tejun Heo

On 16.10.20 01:14, Mike Kravetz wrote:
> On 10/14/20 11:31 AM, Mike Kravetz wrote:
>> On 10/14/20 11:18 AM, David Hildenbrand wrote:
>>
>> FWIW - I ran libhugetlbfs tests which do a bunch of hole punching
>> with (and without) hugetlb controller enabled and did not see this issue.
>>
> 
> I took a closer look after running just the fallocate_stress test
> in libhugetlbfs.  Here are the cgroup counter values:
> 
> hugetlb.2MB.failcnt 0
> hugetlb.2MB.limit_in_bytes 9223372036854771712
> hugetlb.2MB.max_usage_in_bytes 209715200
> hugetlb.2MB.rsvd.failcnt 0
> hugetlb.2MB.rsvd.limit_in_bytes 9223372036854771712
> hugetlb.2MB.rsvd.max_usage_in_bytes 601882624
> hugetlb.2MB.rsvd.usage_in_bytes 392167424
> hugetlb.2MB.usage_in_bytes 0
> 
> We did not hit the WARN_ON_ONCE(), but the 'rsvd.usage_in_bytes' value
> is not correct in that it should be zero.   No huge page reservations
> remain after the test.
> 
> HugePages_Total:    1024
> HugePages_Free:     1024
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> Hugetlb:         2097152 kB
> 
> To try and better understand the reservation cgroup controller, I addded
> a few printks to the code.  While running fallocate_stress with the
> printks, I can consistently hit the WARN_ON_ONCE() due to the counter
> going negative.  Here are the cgroup counter values after such a run:
> 
> hugetlb.2MB.failcnt 0
> hugetlb.2MB.limit_in_bytes 9223372036854771712
> hugetlb.2MB.max_usage_in_bytes 209715200
> hugetlb.2MB.rsvd.failcnt 3
> hugetlb.2MB.rsvd.limit_in_bytes 9223372036854771712
> hugetlb.2MB.rsvd.max_usage_in_bytes 251658240
> hugetlb.2MB.rsvd.usage_in_bytes 18446744073487253504
> hugetlb.2MB.usage_in_bytes 0
> 
> Again, no reserved pages after the test.
> 
> HugePages_Total:    1024
> HugePages_Free:     1024
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> Hugetlb:         2097152 kB
> 
> I have some basic hugetlb hole punch functionality tests.  Running
> these on the kernel with added printk's does not cause any issues.
> In order to reproduce, I need to run fallocate_stress test which
> will cause hole punch to race with page fault.  Best guess at this
> time is that some of the error/race detection reservation back out
> code is not properly dealing with cgroup accounting.
> 
> I'll take a look at this as well.
> 

I'm bisecting the warning right now. Looks like it was introduced in v5.7.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
  2020-10-20 13:38           ` David Hildenbrand
@ 2020-10-21  3:35             ` Mike Kravetz
  2020-10-21 12:42               ` David Hildenbrand
  2020-10-21 12:57               ` Michal Privoznik
  0 siblings, 2 replies; 18+ messages in thread
From: Mike Kravetz @ 2020-10-21  3:35 UTC (permalink / raw)
  To: David Hildenbrand, Mina Almasry
  Cc: linux-kernel, linux-mm, Michal Privoznik, Michael S. Tsirkin,
	Michal Hocko, Muchun Song, Aneesh Kumar K.V, Tejun Heo

On 10/20/20 6:38 AM, David Hildenbrand wrote:
> 
> I'm bisecting the warning right now. Looks like it was introduced in v5.7.

I found the following bugs in the cgroup reservation accounting.  The ones
in region_del are pretty obvious as the number of pages to uncharge would
always be zero.  The one on alloc_huge_page needs racing code to expose.

With these fixes, my testing is showing consistent/correct results for
hugetlb reservation cgroup accounting.

It would be good if Mina (at least) would look these over.  Would also
be interesting to know if these fixes address the bug seen with the qemu
use case.

I'm still doing more testing and code inspection to look for other issues.

From 861bcd7d0443f18a5fed3c3ddc5f1c71e78c4ef4 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Tue, 20 Oct 2020 20:21:42 -0700
Subject: [PATCH] hugetlb_cgroup: fix reservation accounting

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 67fc6383995b..c92366313780 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -685,17 +685,17 @@ static long region_del(struct resv_map *resv, long f, long t)
 		}
 
 		if (f <= rg->from) {	/* Trim beginning of region */
-			del += t - rg->from;
-			rg->from = t;
-
 			hugetlb_cgroup_uncharge_file_region(resv, rg,
 							    t - rg->from);
-		} else {		/* Trim end of region */
-			del += rg->to - f;
-			rg->to = f;
 
+			del += t - rg->from;
+			rg->from = t;
+		} else {		/* Trim end of region */
 			hugetlb_cgroup_uncharge_file_region(resv, rg,
 							    rg->to - f);
+
+			del += rg->to - f;
+			rg->to = f;
 		}
 	}
 
@@ -2454,6 +2454,9 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 
 		rsv_adjust = hugepage_subpool_put_pages(spool, 1);
 		hugetlb_acct_memory(h, -rsv_adjust);
+		if (deferred_reserve)
+			hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
+					pages_per_huge_page(h), page);
 	}
 	return page;
 
-- 
2.25.4



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
  2020-10-21  3:35             ` Mike Kravetz
@ 2020-10-21 12:42               ` David Hildenbrand
  2020-10-21 12:57               ` Michal Privoznik
  1 sibling, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2020-10-21 12:42 UTC (permalink / raw)
  To: Mike Kravetz, Mina Almasry
  Cc: linux-kernel, linux-mm, Michal Privoznik, Michael S. Tsirkin,
	Michal Hocko, Muchun Song, Aneesh Kumar K.V, Tejun Heo

On 21.10.20 05:35, Mike Kravetz wrote:
> On 10/20/20 6:38 AM, David Hildenbrand wrote:
>>
>> I'm bisecting the warning right now. Looks like it was introduced in v5.7.
> 

So bisecting nailed it down to one of

353b2de42e84 mm/hugetlb.c: clean code by removing unnecessary initialization
a9b3f867404b hugetlb: support file_region coalescing again
08cf9faf7558 hugetlb_cgroup: support noreserve mappings
075a61d07a8e hugetlb_cgroup: add accounting for shared mappings
0db9d74ed884 hugetlb: disable region_add file_region coalescing
e9fe92ae0cd2 hugetlb_cgroup: add reservation accounting for private mappings
9808895e1a44 mm/hugetlb_cgroup: fix hugetlb_cgroup migration
1adc4d419aa2 hugetlb_cgroup: add interface for charge/uncharge hugetlb
reservations
cdc2fcfea79b hugetlb_cgroup: add hugetlb_cgroup reservation counter

So seems to be broken right from the beginning of
charge/uncharge/reservations. Not a surpise when looking at your fixes :)


> I found the following bugs in the cgroup reservation accounting.  The ones
> in region_del are pretty obvious as the number of pages to uncharge would
> always be zero.  The one on alloc_huge_page needs racing code to expose.
> 
> With these fixes, my testing is showing consistent/correct results for
> hugetlb reservation cgroup accounting.
> 
> It would be good if Mina (at least) would look these over.  Would also
> be interesting to know if these fixes address the bug seen with the qemu
> use case.

I strongly suspect it will. Compiling, will reply in half an our or so
with the result.

> 
> I'm still doing more testing and code inspection to look for other issues.

When sending, can you make sure to credit Michal P.? Thanks!

Reported-by: Michal Privoznik <mprivozn@redhat.com>

> 
> From 861bcd7d0443f18a5fed3c3ddc5f1c71e78c4ef4 Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Tue, 20 Oct 2020 20:21:42 -0700
> Subject: [PATCH] hugetlb_cgroup: fix reservation accounting
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c | 15 +++++++++------
>  1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 67fc6383995b..c92366313780 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -685,17 +685,17 @@ static long region_del(struct resv_map *resv, long f, long t)
>  		}
>  
>  		if (f <= rg->from) {	/* Trim beginning of region */
> -			del += t - rg->from;
> -			rg->from = t;
> -
>  			hugetlb_cgroup_uncharge_file_region(resv, rg,
>  							    t - rg->from);
> -		} else {		/* Trim end of region */
> -			del += rg->to - f;
> -			rg->to = f;
>  
> +			del += t - rg->from;
> +			rg->from = t;
> +		} else {		/* Trim end of region */
>  			hugetlb_cgroup_uncharge_file_region(resv, rg,
>  							    rg->to - f);
> +
> +			del += rg->to - f;
> +			rg->to = f;

Those two look very correct to me.

You could keep computing "del" before the uncharge, similar to the /*
Remove entire region */ case.

>  		}
>  	}
>  
> @@ -2454,6 +2454,9 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>  
>  		rsv_adjust = hugepage_subpool_put_pages(spool, 1);
>  		hugetlb_acct_memory(h, -rsv_adjust);
> +		if (deferred_reserve)
> +			hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
> +					

That looks correct to me as well.

>  	}
>  	return page;
>  
> 

Thanks for debugging!

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
  2020-10-21  3:35             ` Mike Kravetz
  2020-10-21 12:42               ` David Hildenbrand
@ 2020-10-21 12:57               ` Michal Privoznik
  2020-10-21 13:11                 ` David Hildenbrand
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Privoznik @ 2020-10-21 12:57 UTC (permalink / raw)
  To: Mike Kravetz, David Hildenbrand, Mina Almasry
  Cc: linux-kernel, linux-mm, Michael S. Tsirkin, Michal Hocko,
	Muchun Song, Aneesh Kumar K.V, Tejun Heo

On 10/21/20 5:35 AM, Mike Kravetz wrote:
> On 10/20/20 6:38 AM, David Hildenbrand wrote:
>>
>> I'm bisecting the warning right now. Looks like it was introduced in v5.7.
> 
> I found the following bugs in the cgroup reservation accounting.  The ones
> in region_del are pretty obvious as the number of pages to uncharge would
> always be zero.  The one on alloc_huge_page needs racing code to expose.
> 
> With these fixes, my testing is showing consistent/correct results for
> hugetlb reservation cgroup accounting.
> 
> It would be good if Mina (at least) would look these over.  Would also
> be interesting to know if these fixes address the bug seen with the qemu
> use case.
> 
> I'm still doing more testing and code inspection to look for other issues.
> 
>  From 861bcd7d0443f18a5fed3c3ddc5f1c71e78c4ef4 Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Tue, 20 Oct 2020 20:21:42 -0700
> Subject: [PATCH] hugetlb_cgroup: fix reservation accounting
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>   mm/hugetlb.c | 15 +++++++++------
>   1 file changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 67fc6383995b..c92366313780 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -685,17 +685,17 @@ static long region_del(struct resv_map *resv, long f, long t)
>   		}
>   
>   		if (f <= rg->from) {	/* Trim beginning of region */
> -			del += t - rg->from;
> -			rg->from = t;
> -
>   			hugetlb_cgroup_uncharge_file_region(resv, rg,
>   							    t - rg->from);
> -		} else {		/* Trim end of region */
> -			del += rg->to - f;
> -			rg->to = f;
>   
> +			del += t - rg->from;
> +			rg->from = t;
> +		} else {		/* Trim end of region */
>   			hugetlb_cgroup_uncharge_file_region(resv, rg,
>   							    rg->to - f);
> +
> +			del += rg->to - f;
> +			rg->to = f;
>   		}
>   	}
>   
> @@ -2454,6 +2454,9 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>   
>   		rsv_adjust = hugepage_subpool_put_pages(spool, 1);
>   		hugetlb_acct_memory(h, -rsv_adjust);
> +		if (deferred_reserve)
> +			hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
> +					pages_per_huge_page(h), page);
>   	}
>   	return page;
>   
> 

I've applied, rebuilt and tested, but unfortunately I still hit the problem:
[ 6472.719047] ------------[ cut here ]------------
[ 6472.719052] WARNING: CPU: 6 PID: 11773 at mm/page_counter.c:57 
page_counter_uncharge+0x33/0x40
[ 6472.719052] Modules linked in: kvm_amd amdgpu kvm btusb sp5100_tco 
btrtl watchdog k10temp btbcm btintel mfd_core gpu_sched ttm
[ 6472.719057] CPU: 6 PID: 11773 Comm: CPU 3/KVM Not tainted 
5.9.1-gentoo-x86_64 #1
[ 6472.719057] Hardware name: System manufacturer System Product 
Name/PRIME X570-PRO, BIOS 1005 08/01/2019
[ 6472.719059] RIP: 0010:page_counter_uncharge+0x33/0x40
[ 6472.719060] Code: 48 85 ff 74 24 4c 89 c8 f0 48 0f c1 07 4c 29 c0 48 
89 c1 48 89 c6 e8 7c fe ff ff 48 85 c9 78 0a 48 8b 7f 28 48 85 ff 75 dc 
c3 <0f> 0b eb f2 66 0f 1f 84 00 00 00 00 00 48 8b 17 48 39 d6 72 41 41
[ 6472.719061] RSP: 0018:ffffc90000b77b40 EFLAGS: 00010286
[ 6472.719061] RAX: fffffffffffe9200 RBX: ffff888fb3b97b40 RCX: 
fffffffffffe9200
[ 6472.719062] RDX: 0000000000000221 RSI: fffffffffffe9200 RDI: 
ffff888fd8451dd0
[ 6472.719062] RBP: ffff888fb6990420 R08: 0000000000044200 R09: 
fffffffffffbbe00
[ 6472.719062] R10: ffff888fb3b97b40 R11: 000000000000000a R12: 
0000000000000001
[ 6472.719063] R13: 00000000000005df R14: 00000000000005de R15: 
ffff888fb3b97b40
[ 6472.719063] FS:  00007fbd175fe700(0000) GS:ffff888fde980000(0000) 
knlGS:0000000000000000
[ 6472.719064] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6472.719064] CR2: 00007fbd825101f0 CR3: 0000000fb5e41000 CR4: 
0000000000350ee0
[ 6472.719065] Call Trace:
[ 6472.719067]  hugetlb_cgroup_uncharge_file_region+0x46/0x70
[ 6472.719069]  region_del+0x1ae/0x270
[ 6472.719070]  hugetlb_unreserve_pages+0x32/0xa0
[ 6472.719072]  remove_inode_hugepages+0x19d/0x3a0
[ 6472.719079]  ? writeback_registers+0x45/0x60 [kvm]
[ 6472.719080]  hugetlbfs_fallocate+0x3f2/0x4a0
[ 6472.719081]  ? __mod_lruvec_state+0x1d/0x40
[ 6472.719081]  ? __mod_memcg_lruvec_state+0x1b/0xe0
[ 6472.719083]  ? __seccomp_filter+0x75/0x6a0
[ 6472.719084]  vfs_fallocate+0x122/0x260
[ 6472.719085]  __x64_sys_fallocate+0x39/0x60
[ 6472.719086]  do_syscall_64+0x2d/0x40
[ 6472.719088]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 6472.719089] RIP: 0033:0x7fbe3cefcde7
[ 6472.719089] Code: 89 7c 24 08 48 89 4c 24 18 e8 45 fc f8 ff 41 89 c0 
4c 8b 54 24 18 48 8b 54 24 10 b8 1d 01 00 00 8b 74 24 0c 8b 7c 24 08 0f 
05 <48> 3d 00 f0 ff ff 77 41 44 89 c7 89 44 24 08 e8 75 fc f8 ff 8b 44
[ 6472.719090] RSP: 002b:00007fbd175fc7a0 EFLAGS: 00000293 ORIG_RAX: 
000000000000011d
[ 6472.719090] RAX: ffffffffffffffda RBX: 00000000bbe00000 RCX: 
00007fbe3cefcde7
[ 6472.719091] RDX: 00000000bbc00000 RSI: 0000000000000003 RDI: 
000000000000001d
[ 6472.719091] RBP: 00007fbd175fc800 R08: 0000000000000000 R09: 
0000000000000000
[ 6472.719091] R10: 0000000000200000 R11: 0000000000000293 R12: 
00007ffeea066d2e
[ 6472.719092] R13: 00007ffeea066d2f R14: 00007fbd175fe700 R15: 
00007fbd175fcdc0
[ 6472.719092] ---[ end trace c97dc6281a861980 ]---


Sorry,
Michal



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
  2020-10-21 12:57               ` Michal Privoznik
@ 2020-10-21 13:11                 ` David Hildenbrand
  2020-10-21 13:34                   ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2020-10-21 13:11 UTC (permalink / raw)
  To: Michal Privoznik, Mike Kravetz, Mina Almasry
  Cc: linux-kernel, linux-mm, Michael S. Tsirkin, Michal Hocko,
	Muchun Song, Aneesh Kumar K.V, Tejun Heo

On 21.10.20 14:57, Michal Privoznik wrote:
> On 10/21/20 5:35 AM, Mike Kravetz wrote:
>> On 10/20/20 6:38 AM, David Hildenbrand wrote:
>>>
>>> I'm bisecting the warning right now. Looks like it was introduced in v5.7.
>>
>> I found the following bugs in the cgroup reservation accounting.  The ones
>> in region_del are pretty obvious as the number of pages to uncharge would
>> always be zero.  The one on alloc_huge_page needs racing code to expose.
>>
>> With these fixes, my testing is showing consistent/correct results for
>> hugetlb reservation cgroup accounting.
>>
>> It would be good if Mina (at least) would look these over.  Would also
>> be interesting to know if these fixes address the bug seen with the qemu
>> use case.
>>
>> I'm still doing more testing and code inspection to look for other issues.
>>
>>  From 861bcd7d0443f18a5fed3c3ddc5f1c71e78c4ef4 Mon Sep 17 00:00:00 2001
>> From: Mike Kravetz <mike.kravetz@oracle.com>
>> Date: Tue, 20 Oct 2020 20:21:42 -0700
>> Subject: [PATCH] hugetlb_cgroup: fix reservation accounting
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> ---
>>   mm/hugetlb.c | 15 +++++++++------
>>   1 file changed, 9 insertions(+), 6 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 67fc6383995b..c92366313780 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -685,17 +685,17 @@ static long region_del(struct resv_map *resv, long f, long t)
>>   		}
>>   
>>   		if (f <= rg->from) {	/* Trim beginning of region */
>> -			del += t - rg->from;
>> -			rg->from = t;
>> -
>>   			hugetlb_cgroup_uncharge_file_region(resv, rg,
>>   							    t - rg->from);
>> -		} else {		/* Trim end of region */
>> -			del += rg->to - f;
>> -			rg->to = f;
>>   
>> +			del += t - rg->from;
>> +			rg->from = t;
>> +		} else {		/* Trim end of region */
>>   			hugetlb_cgroup_uncharge_file_region(resv, rg,
>>   							    rg->to - f);
>> +
>> +			del += rg->to - f;
>> +			rg->to = f;
>>   		}
>>   	}
>>   
>> @@ -2454,6 +2454,9 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>>   
>>   		rsv_adjust = hugepage_subpool_put_pages(spool, 1);
>>   		hugetlb_acct_memory(h, -rsv_adjust);
>> +		if (deferred_reserve)
>> +			hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
>> +					pages_per_huge_page(h), page);
>>   	}
>>   	return page;
>>   
>>
> 
> I've applied, rebuilt and tested, but unfortunately I still hit the problem:
> [ 6472.719047] ------------[ cut here ]------------
> [ 6472.719052] WARNING: CPU: 6 PID: 11773 at mm/page_counter.c:57 
> page_counter_uncharge+0x33/0x40
> [ 6472.719052] Modules linked in: kvm_amd amdgpu kvm btusb sp5100_tco 
> btrtl watchdog k10temp btbcm btintel mfd_core gpu_sched ttm
> [ 6472.719057] CPU: 6 PID: 11773 Comm: CPU 3/KVM Not tainted 
> 5.9.1-gentoo-x86_64 #1
> [ 6472.719057] Hardware name: System manufacturer System Product 
> Name/PRIME X570-PRO, BIOS 1005 08/01/2019
> [ 6472.719059] RIP: 0010:page_counter_uncharge+0x33/0x40
> [ 6472.719060] Code: 48 85 ff 74 24 4c 89 c8 f0 48 0f c1 07 4c 29 c0 48 
> 89 c1 48 89 c6 e8 7c fe ff ff 48 85 c9 78 0a 48 8b 7f 28 48 85 ff 75 dc 
> c3 <0f> 0b eb f2 66 0f 1f 84 00 00 00 00 00 48 8b 17 48 39 d6 72 41 41
> [ 6472.719061] RSP: 0018:ffffc90000b77b40 EFLAGS: 00010286
> [ 6472.719061] RAX: fffffffffffe9200 RBX: ffff888fb3b97b40 RCX: 
> fffffffffffe9200
> [ 6472.719062] RDX: 0000000000000221 RSI: fffffffffffe9200 RDI: 
> ffff888fd8451dd0
> [ 6472.719062] RBP: ffff888fb6990420 R08: 0000000000044200 R09: 
> fffffffffffbbe00
> [ 6472.719062] R10: ffff888fb3b97b40 R11: 000000000000000a R12: 
> 0000000000000001
> [ 6472.719063] R13: 00000000000005df R14: 00000000000005de R15: 
> ffff888fb3b97b40
> [ 6472.719063] FS:  00007fbd175fe700(0000) GS:ffff888fde980000(0000) 
> knlGS:0000000000000000
> [ 6472.719064] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 6472.719064] CR2: 00007fbd825101f0 CR3: 0000000fb5e41000 CR4: 
> 0000000000350ee0
> [ 6472.719065] Call Trace:
> [ 6472.719067]  hugetlb_cgroup_uncharge_file_region+0x46/0x70
> [ 6472.719069]  region_del+0x1ae/0x270
> [ 6472.719070]  hugetlb_unreserve_pages+0x32/0xa0
> [ 6472.719072]  remove_inode_hugepages+0x19d/0x3a0
> [ 6472.719079]  ? writeback_registers+0x45/0x60 [kvm]
> [ 6472.719080]  hugetlbfs_fallocate+0x3f2/0x4a0
> [ 6472.719081]  ? __mod_lruvec_state+0x1d/0x40
> [ 6472.719081]  ? __mod_memcg_lruvec_state+0x1b/0xe0
> [ 6472.719083]  ? __seccomp_filter+0x75/0x6a0
> [ 6472.719084]  vfs_fallocate+0x122/0x260
> [ 6472.719085]  __x64_sys_fallocate+0x39/0x60
> [ 6472.719086]  do_syscall_64+0x2d/0x40
> [ 6472.719088]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [ 6472.719089] RIP: 0033:0x7fbe3cefcde7
> [ 6472.719089] Code: 89 7c 24 08 48 89 4c 24 18 e8 45 fc f8 ff 41 89 c0 
> 4c 8b 54 24 18 48 8b 54 24 10 b8 1d 01 00 00 8b 74 24 0c 8b 7c 24 08 0f 
> 05 <48> 3d 00 f0 ff ff 77 41 44 89 c7 89 44 24 08 e8 75 fc f8 ff 8b 44
> [ 6472.719090] RSP: 002b:00007fbd175fc7a0 EFLAGS: 00000293 ORIG_RAX: 
> 000000000000011d
> [ 6472.719090] RAX: ffffffffffffffda RBX: 00000000bbe00000 RCX: 
> 00007fbe3cefcde7
> [ 6472.719091] RDX: 00000000bbc00000 RSI: 0000000000000003 RDI: 
> 000000000000001d
> [ 6472.719091] RBP: 00007fbd175fc800 R08: 0000000000000000 R09: 
> 0000000000000000
> [ 6472.719091] R10: 0000000000200000 R11: 0000000000000293 R12: 
> 00007ffeea066d2e
> [ 6472.719092] R13: 00007ffeea066d2f R14: 00007fbd175fe700 R15: 
> 00007fbd175fcdc0
> [ 6472.719092] ---[ end trace c97dc6281a861980 ]---

Agreed, same over here. :(

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
  2020-10-21 13:11                 ` David Hildenbrand
@ 2020-10-21 13:34                   ` David Hildenbrand
  2020-10-21 13:38                     ` David Hildenbrand
  2020-10-21 16:58                     ` Mike Kravetz
  0 siblings, 2 replies; 18+ messages in thread
From: David Hildenbrand @ 2020-10-21 13:34 UTC (permalink / raw)
  To: Michal Privoznik, Mike Kravetz, Mina Almasry
  Cc: linux-kernel, linux-mm, Michael S. Tsirkin, Michal Hocko,
	Muchun Song, Aneesh Kumar K.V, Tejun Heo

On 21.10.20 15:11, David Hildenbrand wrote:
> On 21.10.20 14:57, Michal Privoznik wrote:
>> On 10/21/20 5:35 AM, Mike Kravetz wrote:
>>> On 10/20/20 6:38 AM, David Hildenbrand wrote:
>>>>
>>>> I'm bisecting the warning right now. Looks like it was introduced in v5.7.
>>>
>>> I found the following bugs in the cgroup reservation accounting.  The ones
>>> in region_del are pretty obvious as the number of pages to uncharge would
>>> always be zero.  The one on alloc_huge_page needs racing code to expose.
>>>
>>> With these fixes, my testing is showing consistent/correct results for
>>> hugetlb reservation cgroup accounting.
>>>
>>> It would be good if Mina (at least) would look these over.  Would also
>>> be interesting to know if these fixes address the bug seen with the qemu
>>> use case.
>>>
>>> I'm still doing more testing and code inspection to look for other issues.
>>>
>>>  From 861bcd7d0443f18a5fed3c3ddc5f1c71e78c4ef4 Mon Sep 17 00:00:00 2001
>>> From: Mike Kravetz <mike.kravetz@oracle.com>
>>> Date: Tue, 20 Oct 2020 20:21:42 -0700
>>> Subject: [PATCH] hugetlb_cgroup: fix reservation accounting
>>>
>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>> ---
>>>   mm/hugetlb.c | 15 +++++++++------
>>>   1 file changed, 9 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index 67fc6383995b..c92366313780 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -685,17 +685,17 @@ static long region_del(struct resv_map *resv, long f, long t)
>>>   		}
>>>   
>>>   		if (f <= rg->from) {	/* Trim beginning of region */
>>> -			del += t - rg->from;
>>> -			rg->from = t;
>>> -
>>>   			hugetlb_cgroup_uncharge_file_region(resv, rg,
>>>   							    t - rg->from);
>>> -		} else {		/* Trim end of region */
>>> -			del += rg->to - f;
>>> -			rg->to = f;
>>>   
>>> +			del += t - rg->from;
>>> +			rg->from = t;
>>> +		} else {		/* Trim end of region */
>>>   			hugetlb_cgroup_uncharge_file_region(resv, rg,
>>>   							    rg->to - f);
>>> +
>>> +			del += rg->to - f;
>>> +			rg->to = f;
>>>   		}
>>>   	}
>>>   
>>> @@ -2454,6 +2454,9 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>>>   
>>>   		rsv_adjust = hugepage_subpool_put_pages(spool, 1);
>>>   		hugetlb_acct_memory(h, -rsv_adjust);
>>> +		if (deferred_reserve)
>>> +			hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
>>> +					pages_per_huge_page(h), page);
>>>   	}
>>>   	return page;
>>>   
>>>
>>
>> I've applied, rebuilt and tested, but unfortunately I still hit the problem:
>> [ 6472.719047] ------------[ cut here ]------------
>> [ 6472.719052] WARNING: CPU: 6 PID: 11773 at mm/page_counter.c:57 
>> page_counter_uncharge+0x33/0x40
>> [ 6472.719052] Modules linked in: kvm_amd amdgpu kvm btusb sp5100_tco 
>> btrtl watchdog k10temp btbcm btintel mfd_core gpu_sched ttm
>> [ 6472.719057] CPU: 6 PID: 11773 Comm: CPU 3/KVM Not tainted 
>> 5.9.1-gentoo-x86_64 #1
>> [ 6472.719057] Hardware name: System manufacturer System Product 
>> Name/PRIME X570-PRO, BIOS 1005 08/01/2019
>> [ 6472.719059] RIP: 0010:page_counter_uncharge+0x33/0x40
>> [ 6472.719060] Code: 48 85 ff 74 24 4c 89 c8 f0 48 0f c1 07 4c 29 c0 48 
>> 89 c1 48 89 c6 e8 7c fe ff ff 48 85 c9 78 0a 48 8b 7f 28 48 85 ff 75 dc 
>> c3 <0f> 0b eb f2 66 0f 1f 84 00 00 00 00 00 48 8b 17 48 39 d6 72 41 41
>> [ 6472.719061] RSP: 0018:ffffc90000b77b40 EFLAGS: 00010286
>> [ 6472.719061] RAX: fffffffffffe9200 RBX: ffff888fb3b97b40 RCX: 
>> fffffffffffe9200
>> [ 6472.719062] RDX: 0000000000000221 RSI: fffffffffffe9200 RDI: 
>> ffff888fd8451dd0
>> [ 6472.719062] RBP: ffff888fb6990420 R08: 0000000000044200 R09: 
>> fffffffffffbbe00
>> [ 6472.719062] R10: ffff888fb3b97b40 R11: 000000000000000a R12: 
>> 0000000000000001
>> [ 6472.719063] R13: 00000000000005df R14: 00000000000005de R15: 
>> ffff888fb3b97b40
>> [ 6472.719063] FS:  00007fbd175fe700(0000) GS:ffff888fde980000(0000) 
>> knlGS:0000000000000000
>> [ 6472.719064] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 6472.719064] CR2: 00007fbd825101f0 CR3: 0000000fb5e41000 CR4: 
>> 0000000000350ee0
>> [ 6472.719065] Call Trace:
>> [ 6472.719067]  hugetlb_cgroup_uncharge_file_region+0x46/0x70
>> [ 6472.719069]  region_del+0x1ae/0x270
>> [ 6472.719070]  hugetlb_unreserve_pages+0x32/0xa0
>> [ 6472.719072]  remove_inode_hugepages+0x19d/0x3a0
>> [ 6472.719079]  ? writeback_registers+0x45/0x60 [kvm]
>> [ 6472.719080]  hugetlbfs_fallocate+0x3f2/0x4a0
>> [ 6472.719081]  ? __mod_lruvec_state+0x1d/0x40
>> [ 6472.719081]  ? __mod_memcg_lruvec_state+0x1b/0xe0
>> [ 6472.719083]  ? __seccomp_filter+0x75/0x6a0
>> [ 6472.719084]  vfs_fallocate+0x122/0x260
>> [ 6472.719085]  __x64_sys_fallocate+0x39/0x60
>> [ 6472.719086]  do_syscall_64+0x2d/0x40
>> [ 6472.719088]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [ 6472.719089] RIP: 0033:0x7fbe3cefcde7
>> [ 6472.719089] Code: 89 7c 24 08 48 89 4c 24 18 e8 45 fc f8 ff 41 89 c0 
>> 4c 8b 54 24 18 48 8b 54 24 10 b8 1d 01 00 00 8b 74 24 0c 8b 7c 24 08 0f 
>> 05 <48> 3d 00 f0 ff ff 77 41 44 89 c7 89 44 24 08 e8 75 fc f8 ff 8b 44
>> [ 6472.719090] RSP: 002b:00007fbd175fc7a0 EFLAGS: 00000293 ORIG_RAX: 
>> 000000000000011d
>> [ 6472.719090] RAX: ffffffffffffffda RBX: 00000000bbe00000 RCX: 
>> 00007fbe3cefcde7
>> [ 6472.719091] RDX: 00000000bbc00000 RSI: 0000000000000003 RDI: 
>> 000000000000001d
>> [ 6472.719091] RBP: 00007fbd175fc800 R08: 0000000000000000 R09: 
>> 0000000000000000
>> [ 6472.719091] R10: 0000000000200000 R11: 0000000000000293 R12: 
>> 00007ffeea066d2e
>> [ 6472.719092] R13: 00007ffeea066d2f R14: 00007fbd175fe700 R15: 
>> 00007fbd175fcdc0
>> [ 6472.719092] ---[ end trace c97dc6281a861980 ]---
> 
> Agreed, same over here. :(
> 

I *think* the following on top makes it fly

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 67fc6383995b..5cf7f6a6c1a6 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -656,6 +656,9 @@ static long region_del(struct resv_map *resv, long
f, long t)

                        del += t - f;

+                       hugetlb_cgroup_uncharge_file_region(
+                               resv, rg, t - f);
+
                        /* New entry for end of split region */
                        nrg->from = t;
                        nrg->to = rg->to;
@@ -667,9 +670,6 @@ static long region_del(struct resv_map *resv, long
f, long t)
                        /* Original entry is trimmed */
                        rg->to = f;

-                       hugetlb_cgroup_uncharge_file_region(
-                               resv, rg, nrg->to - nrg->from);
-
                        list_add(&nrg->link, &rg->link);
                        nrg = NULL;
                        break;


-- 
Thanks,

David / dhildenb



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
  2020-10-21 13:34                   ` David Hildenbrand
@ 2020-10-21 13:38                     ` David Hildenbrand
  2020-10-21 16:58                     ` Mike Kravetz
  1 sibling, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2020-10-21 13:38 UTC (permalink / raw)
  To: Michal Privoznik, Mike Kravetz, Mina Almasry
  Cc: linux-kernel, linux-mm, Michael S. Tsirkin, Michal Hocko,
	Muchun Song, Aneesh Kumar K.V, Tejun Heo

On 21.10.20 15:34, David Hildenbrand wrote:
> On 21.10.20 15:11, David Hildenbrand wrote:
>> On 21.10.20 14:57, Michal Privoznik wrote:
>>> On 10/21/20 5:35 AM, Mike Kravetz wrote:
>>>> On 10/20/20 6:38 AM, David Hildenbrand wrote:
>>>>>
>>>>> I'm bisecting the warning right now. Looks like it was introduced in v5.7.
>>>>
>>>> I found the following bugs in the cgroup reservation accounting.  The ones
>>>> in region_del are pretty obvious as the number of pages to uncharge would
>>>> always be zero.  The one on alloc_huge_page needs racing code to expose.
>>>>
>>>> With these fixes, my testing is showing consistent/correct results for
>>>> hugetlb reservation cgroup accounting.
>>>>
>>>> It would be good if Mina (at least) would look these over.  Would also
>>>> be interesting to know if these fixes address the bug seen with the qemu
>>>> use case.
>>>>
>>>> I'm still doing more testing and code inspection to look for other issues.
>>>>
>>>>  From 861bcd7d0443f18a5fed3c3ddc5f1c71e78c4ef4 Mon Sep 17 00:00:00 2001
>>>> From: Mike Kravetz <mike.kravetz@oracle.com>
>>>> Date: Tue, 20 Oct 2020 20:21:42 -0700
>>>> Subject: [PATCH] hugetlb_cgroup: fix reservation accounting
>>>>
>>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>>> ---
>>>>   mm/hugetlb.c | 15 +++++++++------
>>>>   1 file changed, 9 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index 67fc6383995b..c92366313780 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -685,17 +685,17 @@ static long region_del(struct resv_map *resv, long f, long t)
>>>>   		}
>>>>   
>>>>   		if (f <= rg->from) {	/* Trim beginning of region */
>>>> -			del += t - rg->from;
>>>> -			rg->from = t;
>>>> -
>>>>   			hugetlb_cgroup_uncharge_file_region(resv, rg,
>>>>   							    t - rg->from);
>>>> -		} else {		/* Trim end of region */
>>>> -			del += rg->to - f;
>>>> -			rg->to = f;
>>>>   
>>>> +			del += t - rg->from;
>>>> +			rg->from = t;
>>>> +		} else {		/* Trim end of region */
>>>>   			hugetlb_cgroup_uncharge_file_region(resv, rg,
>>>>   							    rg->to - f);
>>>> +
>>>> +			del += rg->to - f;
>>>> +			rg->to = f;
>>>>   		}
>>>>   	}
>>>>   
>>>> @@ -2454,6 +2454,9 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>>>>   
>>>>   		rsv_adjust = hugepage_subpool_put_pages(spool, 1);
>>>>   		hugetlb_acct_memory(h, -rsv_adjust);
>>>> +		if (deferred_reserve)
>>>> +			hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
>>>> +					pages_per_huge_page(h), page);
>>>>   	}
>>>>   	return page;
>>>>   
>>>>
>>>
>>> I've applied, rebuilt and tested, but unfortunately I still hit the problem:
>>> [ 6472.719047] ------------[ cut here ]------------
>>> [ 6472.719052] WARNING: CPU: 6 PID: 11773 at mm/page_counter.c:57 
>>> page_counter_uncharge+0x33/0x40
>>> [ 6472.719052] Modules linked in: kvm_amd amdgpu kvm btusb sp5100_tco 
>>> btrtl watchdog k10temp btbcm btintel mfd_core gpu_sched ttm
>>> [ 6472.719057] CPU: 6 PID: 11773 Comm: CPU 3/KVM Not tainted 
>>> 5.9.1-gentoo-x86_64 #1
>>> [ 6472.719057] Hardware name: System manufacturer System Product 
>>> Name/PRIME X570-PRO, BIOS 1005 08/01/2019
>>> [ 6472.719059] RIP: 0010:page_counter_uncharge+0x33/0x40
>>> [ 6472.719060] Code: 48 85 ff 74 24 4c 89 c8 f0 48 0f c1 07 4c 29 c0 48 
>>> 89 c1 48 89 c6 e8 7c fe ff ff 48 85 c9 78 0a 48 8b 7f 28 48 85 ff 75 dc 
>>> c3 <0f> 0b eb f2 66 0f 1f 84 00 00 00 00 00 48 8b 17 48 39 d6 72 41 41
>>> [ 6472.719061] RSP: 0018:ffffc90000b77b40 EFLAGS: 00010286
>>> [ 6472.719061] RAX: fffffffffffe9200 RBX: ffff888fb3b97b40 RCX: 
>>> fffffffffffe9200
>>> [ 6472.719062] RDX: 0000000000000221 RSI: fffffffffffe9200 RDI: 
>>> ffff888fd8451dd0
>>> [ 6472.719062] RBP: ffff888fb6990420 R08: 0000000000044200 R09: 
>>> fffffffffffbbe00
>>> [ 6472.719062] R10: ffff888fb3b97b40 R11: 000000000000000a R12: 
>>> 0000000000000001
>>> [ 6472.719063] R13: 00000000000005df R14: 00000000000005de R15: 
>>> ffff888fb3b97b40
>>> [ 6472.719063] FS:  00007fbd175fe700(0000) GS:ffff888fde980000(0000) 
>>> knlGS:0000000000000000
>>> [ 6472.719064] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [ 6472.719064] CR2: 00007fbd825101f0 CR3: 0000000fb5e41000 CR4: 
>>> 0000000000350ee0
>>> [ 6472.719065] Call Trace:
>>> [ 6472.719067]  hugetlb_cgroup_uncharge_file_region+0x46/0x70
>>> [ 6472.719069]  region_del+0x1ae/0x270
>>> [ 6472.719070]  hugetlb_unreserve_pages+0x32/0xa0
>>> [ 6472.719072]  remove_inode_hugepages+0x19d/0x3a0
>>> [ 6472.719079]  ? writeback_registers+0x45/0x60 [kvm]
>>> [ 6472.719080]  hugetlbfs_fallocate+0x3f2/0x4a0
>>> [ 6472.719081]  ? __mod_lruvec_state+0x1d/0x40
>>> [ 6472.719081]  ? __mod_memcg_lruvec_state+0x1b/0xe0
>>> [ 6472.719083]  ? __seccomp_filter+0x75/0x6a0
>>> [ 6472.719084]  vfs_fallocate+0x122/0x260
>>> [ 6472.719085]  __x64_sys_fallocate+0x39/0x60
>>> [ 6472.719086]  do_syscall_64+0x2d/0x40
>>> [ 6472.719088]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>> [ 6472.719089] RIP: 0033:0x7fbe3cefcde7
>>> [ 6472.719089] Code: 89 7c 24 08 48 89 4c 24 18 e8 45 fc f8 ff 41 89 c0 
>>> 4c 8b 54 24 18 48 8b 54 24 10 b8 1d 01 00 00 8b 74 24 0c 8b 7c 24 08 0f 
>>> 05 <48> 3d 00 f0 ff ff 77 41 44 89 c7 89 44 24 08 e8 75 fc f8 ff 8b 44
>>> [ 6472.719090] RSP: 002b:00007fbd175fc7a0 EFLAGS: 00000293 ORIG_RAX: 
>>> 000000000000011d
>>> [ 6472.719090] RAX: ffffffffffffffda RBX: 00000000bbe00000 RCX: 
>>> 00007fbe3cefcde7
>>> [ 6472.719091] RDX: 00000000bbc00000 RSI: 0000000000000003 RDI: 
>>> 000000000000001d
>>> [ 6472.719091] RBP: 00007fbd175fc800 R08: 0000000000000000 R09: 
>>> 0000000000000000
>>> [ 6472.719091] R10: 0000000000200000 R11: 0000000000000293 R12: 
>>> 00007ffeea066d2e
>>> [ 6472.719092] R13: 00007ffeea066d2f R14: 00007fbd175fe700 R15: 
>>> 00007fbd175fcdc0
>>> [ 6472.719092] ---[ end trace c97dc6281a861980 ]---
>>
>> Agreed, same over here. :(
>>
> 
> I *think* the following on top makes it fly
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 67fc6383995b..5cf7f6a6c1a6 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -656,6 +656,9 @@ static long region_del(struct resv_map *resv, long
> f, long t)
> 
>                         del += t - f;
> 
> +                       hugetlb_cgroup_uncharge_file_region(
> +                               resv, rg, t - f);
> +
>                         /* New entry for end of split region */
>                         nrg->from = t;
>                         nrg->to = rg->to;
> @@ -667,9 +670,6 @@ static long region_del(struct resv_map *resv, long
> f, long t)
>                         /* Original entry is trimmed */
>                         rg->to = f;
> 
> -                       hugetlb_cgroup_uncharge_file_region(
> -                               resv, rg, nrg->to - nrg->from);
> -
>                         list_add(&nrg->link, &rg->link);
>                         nrg = NULL;
>                         break;
> 
> 

(sorry for the nasty line wrapping - Thunderbird update keeps breaking
helpful plugins - like Toggle Word Wrap in this case ... argh)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
  2020-10-21 13:34                   ` David Hildenbrand
  2020-10-21 13:38                     ` David Hildenbrand
@ 2020-10-21 16:58                     ` Mike Kravetz
  2020-10-21 17:30                       ` David Hildenbrand
  1 sibling, 1 reply; 18+ messages in thread
From: Mike Kravetz @ 2020-10-21 16:58 UTC (permalink / raw)
  To: David Hildenbrand, Michal Privoznik, Mina Almasry
  Cc: linux-kernel, linux-mm, Michael S. Tsirkin, Michal Hocko,
	Muchun Song, Aneesh Kumar K.V, Tejun Heo

> On 21.10.20 15:11, David Hildenbrand wrote:
>> On 21.10.20 14:57, Michal Privoznik wrote:
>>> On 10/21/20 5:35 AM, Mike Kravetz wrote:
>>>> On 10/20/20 6:38 AM, David Hildenbrand wrote:
>>>>
>>>> It would be good if Mina (at least) would look these over.  Would also
>>>> be interesting to know if these fixes address the bug seen with the qemu
>>>> use case.
>>>>
>>>> I'm still doing more testing and code inspection to look for other issues.
>>>>
...
...
>>>
>>> I've applied, rebuilt and tested, but unfortunately I still hit the problem:
>>> [ 6472.719047] ------------[ cut here ]------------
>>> [ 6472.719052] WARNING: CPU: 6 PID: 11773 at mm/page_counter.c:57 
...
...
>>
>> Agreed, same over here. :(
>>
> 
> I *think* the following on top makes it fly
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 67fc6383995b..5cf7f6a6c1a6 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -656,6 +656,9 @@ static long region_del(struct resv_map *resv, long
> f, long t)
> 
>                         del += t - f;
> 
> +                       hugetlb_cgroup_uncharge_file_region(
> +                               resv, rg, t - f);
> +
>                         /* New entry for end of split region */
>                         nrg->from = t;
>                         nrg->to = rg->to;
> @@ -667,9 +670,6 @@ static long region_del(struct resv_map *resv, long
> f, long t)
>                         /* Original entry is trimmed */
>                         rg->to = f;
> 
> -                       hugetlb_cgroup_uncharge_file_region(
> -                               resv, rg, nrg->to - nrg->from);
> -
>                         list_add(&nrg->link, &rg->link);
>                         nrg = NULL;
>                         break;
> 
> 

Thanks, yes that certainly does look like a bug in that code.

Does that resolve the issue with quemu?

I want to do a little more testing/research before sending a patch later
today.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5
  2020-10-21 16:58                     ` Mike Kravetz
@ 2020-10-21 17:30                       ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2020-10-21 17:30 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: David Hildenbrand, Michal Privoznik, Mina Almasry, linux-kernel,
	linux-mm, Michael S. Tsirkin, Michal Hocko, Muchun Song,
	Aneesh Kumar K.V, Tejun Heo


> Am 21.10.2020 um 18:58 schrieb Mike Kravetz <mike.kravetz@oracle.com>:
> 
>> 
>>> On 21.10.20 15:11, David Hildenbrand wrote:
>>> On 21.10.20 14:57, Michal Privoznik wrote:
>>>> On 10/21/20 5:35 AM, Mike Kravetz wrote:
>>>>> On 10/20/20 6:38 AM, David Hildenbrand wrote:
>>>>> It would be good if Mina (at least) would look these over.  Would also
>>>>> be interesting to know if these fixes address the bug seen with the qemu
>>>>> use case.
>>>>> I'm still doing more testing and code inspection to look for other issues.
> ...
> ...
>>>> I've applied, rebuilt and tested, but unfortunately I still hit the problem:
>>>> [ 6472.719047] ------------[ cut here ]------------
>>>> [ 6472.719052] WARNING: CPU: 6 PID: 11773 at mm/page_counter.c:57
> ...
> ...
>>> Agreed, same over here. :(
>> 
>> I *think* the following on top makes it fly
>> 
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 67fc6383995b..5cf7f6a6c1a6 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -656,6 +656,9 @@ static long region_del(struct resv_map *resv, long
>> f, long t)
>> 
>>                        del += t - f;
>> 
>> +                       hugetlb_cgroup_uncharge_file_region(
>> +                               resv, rg, t - f);
>> +
>>                        /* New entry for end of split region */
>>                        nrg->from = t;
>>                        nrg->to = rg->to;
>> @@ -667,9 +670,6 @@ static long region_del(struct resv_map *resv, long
>> f, long t)
>>                        /* Original entry is trimmed */
>>                        rg->to = f;
>> 
>> -                       hugetlb_cgroup_uncharge_file_region(
>> -                               resv, rg, nrg->to - nrg->from);
>> -
>>                        list_add(&nrg->link, &rg->link);
>>                        nrg = NULL;
>>                        break;
> 
> Thanks, yes that certainly does look like a bug in that code.
> 
> Does that resolve the issue with quemu?

I was not able to reproduce, so I guess we found all issues!

Thanks!

> 
> I want to do a little more testing/research before sending a patch later
> today.
> -- 
> Mike Kravetz



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-10-21 17:30 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-14 15:22 cgroup and FALLOC_FL_PUNCH_HOLE: WARNING: CPU: 13 PID: 2438 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x5 David Hildenbrand
2020-10-14 16:15 ` David Hildenbrand
2020-10-14 17:56   ` Mina Almasry
2020-10-14 18:18     ` David Hildenbrand
2020-10-14 18:31       ` Mike Kravetz
2020-10-15  7:56         ` David Hildenbrand
2020-10-15  8:57           ` David Hildenbrand
2020-10-15  9:01             ` David Hildenbrand
2020-10-15 23:14         ` Mike Kravetz
2020-10-20 13:38           ` David Hildenbrand
2020-10-21  3:35             ` Mike Kravetz
2020-10-21 12:42               ` David Hildenbrand
2020-10-21 12:57               ` Michal Privoznik
2020-10-21 13:11                 ` David Hildenbrand
2020-10-21 13:34                   ` David Hildenbrand
2020-10-21 13:38                     ` David Hildenbrand
2020-10-21 16:58                     ` Mike Kravetz
2020-10-21 17:30                       ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).