The patch titled Subject: ocfs2: correctly use ocfs2_find_next_zero_bit() has been added to the -mm mm-nonmm-unstable branch. Its filename is ocfs2-correctly-use-ocfs2_find_next_zero_bit.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/ocfs2-correctly-use-ocfs2_find_next_zero_bit.patch This patch will later appear in the mm-nonmm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Joseph Qi <joseph.qi@linux.alibaba.com> Subject: ocfs2: correctly use ocfs2_find_next_zero_bit() Date: Thu, 14 Mar 2024 10:17:13 +0800 If no bits are zero, ocfs2_find_next_zero_bit() will return max size, so check the return value with -1 is meaningless. Correct this usage and cleanup the code. Link: https://lkml.kernel.org/r/20240314021713.240796-1-joseph.qi@linux.alibaba.com Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- fs/ocfs2/localalloc.c | 19 ++++++------------- fs/ocfs2/reservations.c | 2 +- fs/ocfs2/suballoc.c | 6 ++---- 3 files changed, 9 insertions(+), 18 deletions(-) --- a/fs/ocfs2/localalloc.c~ocfs2-correctly-use-ocfs2_find_next_zero_bit +++ a/fs/ocfs2/localalloc.c @@ -863,14 +863,8 @@ static int ocfs2_local_alloc_find_clear_ numfound = bitoff = startoff = 0; left = le32_to_cpu(alloc->id1.bitmap1.i_total); - while ((bitoff = ocfs2_find_next_zero_bit(bitmap, left, startoff)) != -1) { - if (bitoff == left) { - /* mlog(0, "bitoff (%d) == left", bitoff); */ - break; - } - /* mlog(0, "Found a zero: bitoff = %d, startoff = %d, " - "numfound = %d\n", bitoff, startoff, numfound);*/ - + while ((bitoff = ocfs2_find_next_zero_bit(bitmap, left, startoff)) < + left) { /* Ok, we found a zero bit... is it contig. or do we * start over?*/ if (bitoff == startoff) { @@ -976,9 +970,9 @@ static int ocfs2_sync_local_to_main(stru start = count = 0; left = le32_to_cpu(alloc->id1.bitmap1.i_total); - while ((bit_off = ocfs2_find_next_zero_bit(bitmap, left, start)) - != -1) { - if ((bit_off < left) && (bit_off == start)) { + while ((bit_off = ocfs2_find_next_zero_bit(bitmap, left, start)) < + left) { + if (bit_off == start) { count++; start++; continue; @@ -1002,8 +996,7 @@ static int ocfs2_sync_local_to_main(stru goto bail; } } - if (bit_off >= left) - break; + count = 1; start = bit_off + 1; } --- a/fs/ocfs2/reservations.c~ocfs2-correctly-use-ocfs2_find_next_zero_bit +++ a/fs/ocfs2/reservations.c @@ -414,7 +414,7 @@ static int ocfs2_resmap_find_free_bits(s start = search_start; while ((offset = ocfs2_find_next_zero_bit(bitmap, resmap->m_bitmap_len, - start)) != -1) { + start)) < resmap->m_bitmap_len) { /* Search reached end of the region */ if (offset >= (search_start + search_len)) break; --- a/fs/ocfs2/suballoc.c~ocfs2-correctly-use-ocfs2_find_next_zero_bit +++ a/fs/ocfs2/suballoc.c @@ -1290,10 +1290,8 @@ static int ocfs2_block_group_find_clear_ found = start = best_offset = best_size = 0; bitmap = bg->bg_bitmap; - while((offset = ocfs2_find_next_zero_bit(bitmap, total_bits, start)) != -1) { - if (offset == total_bits) - break; - + while ((offset = ocfs2_find_next_zero_bit(bitmap, total_bits, start)) < + total_bits) { if (!ocfs2_test_bg_bit_allocatable(bg_bh, offset)) { /* We found a zero, but we can't use it as it * hasn't been put to disk yet! */ _ Patches currently in -mm which might be from joseph.qi@linux.alibaba.com are ocfs2-correctly-use-ocfs2_find_next_zero_bit.patch
The patch titled Subject: selftests/mm: virtual_address_range: Switch to ksft_exit_fail_msg has been added to the -mm mm-unstable branch. Its filename is selftests-mm-virtual_address_range-switch-to-ksft_exit_fail_msg.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/selftests-mm-virtual_address_range-switch-to-ksft_exit_fail_msg.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Dev Jain <dev.jain@arm.com> Subject: selftests/mm: virtual_address_range: Switch to ksft_exit_fail_msg Date: Thu, 14 Mar 2024 17:52:50 +0530 mmap() must not succeed in validate_lower_address_hint(), for if it does, it is a bug in mmap() itself. Reflect this behaviour with ksft_exit_fail_msg(). While at it, do some formatting changes. Link: https://lkml.kernel.org/r/20240314122250.68534-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- tools/testing/selftests/mm/virtual_address_range.c | 12 ++++------- 1 file changed, 5 insertions(+), 7 deletions(-) --- a/tools/testing/selftests/mm/virtual_address_range.c~selftests-mm-virtual_address_range-switch-to-ksft_exit_fail_msg +++ a/tools/testing/selftests/mm/virtual_address_range.c @@ -85,7 +85,7 @@ static int validate_lower_address_hint(v char *ptr; ptr = mmap((void *) (1UL << 45), MAP_CHUNK_SIZE, PROT_READ | - PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (ptr == MAP_FAILED) return 0; @@ -105,13 +105,11 @@ int main(int argc, char *argv[]) for (i = 0; i < NR_CHUNKS_LOW; i++) { ptr[i] = mmap(NULL, MAP_CHUNK_SIZE, PROT_READ | PROT_WRITE, - MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (ptr[i] == MAP_FAILED) { - if (validate_lower_address_hint()) { - ksft_test_result_skip("Memory constraint not fulfilled\n"); - ksft_finished(); - } + if (validate_lower_address_hint()) + ksft_exit_fail_msg("mmap unexpectedly succeeded with hint\n"); break; } @@ -127,7 +125,7 @@ int main(int argc, char *argv[]) for (i = 0; i < NR_CHUNKS_HIGH; i++) { hint = hind_addr(); hptr[i] = mmap(hint, MAP_CHUNK_SIZE, PROT_READ | PROT_WRITE, - MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); + MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); if (hptr[i] == MAP_FAILED) break; _ Patches currently in -mm which might be from dev.jain@arm.com are selftests-mm-virtual_address_range-switch-to-ksft_exit_fail_msg.patch
The patch titled Subject: selftests: mm: restore settings from only parent process has been added to the -mm mm-hotfixes-unstable branch. Its filename is selftests-mm-restore-settings-from-only-parent-process.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/selftests-mm-restore-settings-from-only-parent-process.patch This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Muhammad Usama Anjum <usama.anjum@collabora.com> Subject: selftests: mm: restore settings from only parent process Date: Thu, 14 Mar 2024 14:40:45 +0500 The atexit() is called from parent process as well as forked processes. Hence the child restores the settings at exit while the parent is still executing. Fix this by checking pid of atexit() calling process and only restore THP number from parent process. Link: https://lkml.kernel.org/r/20240314094045.157149-1-usama.anjum@collabora.com Fixes: c23ea61726d5 ("selftests/mm: protection_keys: save/restore nr_hugepages settings") Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Tested-by: Joey Gouly <joey.gouly@arm.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- tools/testing/selftests/mm/protection_keys.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) --- a/tools/testing/selftests/mm/protection_keys.c~selftests-mm-restore-settings-from-only-parent-process +++ a/tools/testing/selftests/mm/protection_keys.c @@ -1745,9 +1745,12 @@ void pkey_setup_shadow(void) shadow_pkey_reg = __read_pkey_reg(); } +pid_t parent_pid; + void restore_settings_atexit(void) { - cat_into_file(buf, "/proc/sys/vm/nr_hugepages"); + if (parent_pid == getpid()) + cat_into_file(buf, "/proc/sys/vm/nr_hugepages"); } void save_settings(void) @@ -1773,6 +1776,7 @@ void save_settings(void) exit(__LINE__); } + parent_pid = getpid(); atexit(restore_settings_atexit); close(fd); } _ Patches currently in -mm which might be from usama.anjum@collabora.com are selftests-mm-restore-settings-from-only-parent-process.patch
The patch titled Subject: tools/Makefile: remove cgroup target has been added to the -mm mm-hotfixes-unstable branch. Its filename is tools-makefile-remove-cgroup-target.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/tools-makefile-remove-cgroup-target.patch This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Cong Liu <liucong2@kylinos.cn> Subject: tools/Makefile: remove cgroup target Date: Fri, 15 Mar 2024 09:22:48 +0800 The tools/cgroup directory no longer contains a Makefile. This patch updates the top-level tools/Makefile to remove references to building and installing cgroup components. This change reflects the current structure of the tools directory and fixes the build failure when building tools in the top-level directory. linux/tools$ make cgroup DESCEND cgroup make[1]: *** No targets specified and no makefile found. Stop. make: *** [Makefile:73: cgroup] Error 2 Link: https://lkml.kernel.org/r/20240315012249.439639-1-liucong2@kylinos.cn Signed-off-by: Cong Liu <liucong2@kylinos.cn> Acked-by: Stanislav Fomichev <sdf@google.com> Reviewed-by: Dmitry Rokosov <ddrokosov@salutedevices.com> Cc: Cong Liu <liucong2@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- tools/Makefile | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) --- a/tools/Makefile~tools-makefile-remove-cgroup-target +++ a/tools/Makefile @@ -11,7 +11,6 @@ help: @echo '' @echo ' acpi - ACPI tools' @echo ' bpf - misc BPF tools' - @echo ' cgroup - cgroup tools' @echo ' counter - counter tools' @echo ' cpupower - a tool for all things x86 CPU power' @echo ' debugging - tools for debugging' @@ -69,7 +68,7 @@ acpi: FORCE cpupower: FORCE $(call descend,power/$@) -cgroup counter firewire hv guest bootconfig spi usb virtio mm bpf iio gpio objtool leds wmi pci firmware debugging tracing: FORCE +counter firewire hv guest bootconfig spi usb virtio mm bpf iio gpio objtool leds wmi pci firmware debugging tracing: FORCE $(call descend,$@) bpf/%: FORCE @@ -116,7 +115,7 @@ freefall: FORCE kvm_stat: FORCE $(call descend,kvm/$@) -all: acpi cgroup counter cpupower gpio hv firewire \ +all: acpi counter cpupower gpio hv firewire \ perf selftests bootconfig spi turbostat usb \ virtio mm bpf x86_energy_perf_policy \ tmon freefall iio objtool kvm_stat wmi \ @@ -128,7 +127,7 @@ acpi_install: cpupower_install: $(call descend,power/$(@:_install=),install) -cgroup_install counter_install firewire_install gpio_install hv_install iio_install perf_install bootconfig_install spi_install usb_install virtio_install mm_install bpf_install objtool_install wmi_install pci_install debugging_install tracing_install: +counter_install firewire_install gpio_install hv_install iio_install perf_install bootconfig_install spi_install usb_install virtio_install mm_install bpf_install objtool_install wmi_install pci_install debugging_install tracing_install: $(call descend,$(@:_install=),install) selftests_install: @@ -155,7 +154,7 @@ freefall_install: kvm_stat_install: $(call descend,kvm/$(@:_install=),install) -install: acpi_install cgroup_install counter_install cpupower_install gpio_install \ +install: acpi_install counter_install cpupower_install gpio_install \ hv_install firewire_install iio_install \ perf_install selftests_install turbostat_install usb_install \ virtio_install mm_install bpf_install x86_energy_perf_policy_install \ @@ -169,7 +168,7 @@ acpi_clean: cpupower_clean: $(call descend,power/cpupower,clean) -cgroup_clean counter_clean hv_clean firewire_clean bootconfig_clean spi_clean usb_clean virtio_clean mm_clean wmi_clean bpf_clean iio_clean gpio_clean objtool_clean leds_clean pci_clean firmware_clean debugging_clean tracing_clean: +counter_clean hv_clean firewire_clean bootconfig_clean spi_clean usb_clean virtio_clean mm_clean wmi_clean bpf_clean iio_clean gpio_clean objtool_clean leds_clean pci_clean firmware_clean debugging_clean tracing_clean: $(call descend,$(@:_clean=),clean) libapi_clean: @@ -209,7 +208,7 @@ freefall_clean: build_clean: $(call descend,build,clean) -clean: acpi_clean cgroup_clean counter_clean cpupower_clean hv_clean firewire_clean \ +clean: acpi_clean counter_clean cpupower_clean hv_clean firewire_clean \ perf_clean selftests_clean turbostat_clean bootconfig_clean spi_clean usb_clean virtio_clean \ mm_clean bpf_clean iio_clean x86_energy_perf_policy_clean tmon_clean \ freefall_clean build_clean libbpf_clean libsubcmd_clean \ _ Patches currently in -mm which might be from liucong2@kylinos.cn are tools-makefile-remove-cgroup-target.patch
The patch titled Subject: mm/madvise: don't perform madvise VMA walk for MADV_POPULATE_(READ|WRITE) has been added to the -mm mm-unstable branch. Its filename is mm-madvise-dont-perform-madvise-vma-walk-for-madv_populate_readwrite.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-madvise-dont-perform-madvise-vma-walk-for-madv_populate_readwrite.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: David Hildenbrand <david@redhat.com> Subject: mm/madvise: don't perform madvise VMA walk for MADV_POPULATE_(READ|WRITE) Date: Thu, 14 Mar 2024 17:13:00 +0100 We changed faultin_page_range() to no longer consume a VMA, because faultin_page_range() might internally release the mm lock to lookup the VMA again -- required to cleanly handle VM_FAULT_RETRY. But independent of that, __get_user_pages() will always lookup the VMA itself. Now that we let __get_user_pages() just handle VMA checks in a way that is suitable for MADV_POPULATE_(READ|WRITE), the VMA walk in madvise() is just overhead. So let's just call madvise_populate() on the full range instead. There is one change in behavior: madvise_walk_vmas() would skip any VMA holes, and if everything succeeded, it would return -ENOMEM after processing all VMAs. However, for MADV_POPULATE_(READ|WRITE) it's unlikely for the caller to notice any difference: -ENOMEM might either indicate that there were VMA holes or that populating page tables failed because there was not enough memory. So it's unlikely that user space will notice the difference, and that special handling likely only makes sense for some other madvise() actions. Further, we'd already fail with -ENOMEM early in the past if looking up the VMA after dropping the MM lock failed because of concurrent VMA modifications. So let's just keep it simple and avoid the madvise VMA walk, and consistently fail early if we find a VMA hole. Link: https://lkml.kernel.org/r/20240314161300.382526-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/madvise.c | 26 ++++++++++++-------------- 1 file changed, 12 insertions(+), 14 deletions(-) --- a/mm/madvise.c~mm-madvise-dont-perform-madvise-vma-walk-for-madv_populate_readwrite +++ a/mm/madvise.c @@ -901,26 +901,19 @@ static long madvise_dontneed_free(struct return -EINVAL; } -static long madvise_populate(struct vm_area_struct *vma, - struct vm_area_struct **prev, - unsigned long start, unsigned long end, - int behavior) +static long madvise_populate(struct mm_struct *mm, unsigned long start, + unsigned long end, int behavior) { const bool write = behavior == MADV_POPULATE_WRITE; - struct mm_struct *mm = vma->vm_mm; int locked = 1; long pages; - *prev = vma; - while (start < end) { /* Populate (prefault) page tables readable/writable. */ pages = faultin_page_range(mm, start, end, write, &locked); if (!locked) { mmap_read_lock(mm); locked = 1; - *prev = NULL; - vma = NULL; } if (pages < 0) { switch (pages) { @@ -1021,9 +1014,6 @@ static int madvise_vma_behavior(struct v case MADV_DONTNEED: case MADV_DONTNEED_LOCKED: return madvise_dontneed_free(vma, prev, start, end, behavior); - case MADV_POPULATE_READ: - case MADV_POPULATE_WRITE: - return madvise_populate(vma, prev, start, end, behavior); case MADV_NORMAL: new_flags = new_flags & ~VM_RAND_READ & ~VM_SEQ_READ; break; @@ -1425,8 +1415,16 @@ int do_madvise(struct mm_struct *mm, uns end = start + len; blk_start_plug(&plug); - error = madvise_walk_vmas(mm, start, end, behavior, - madvise_vma_behavior); + switch (behavior) { + case MADV_POPULATE_READ: + case MADV_POPULATE_WRITE: + error = madvise_populate(mm, start, end, behavior); + break; + default: + error = madvise_walk_vmas(mm, start, end, behavior, + madvise_vma_behavior); + break; + } blk_finish_plug(&plug); if (write) mmap_write_unlock(mm); _ Patches currently in -mm which might be from david@redhat.com are mm-madvise-make-madv_populate_readwrite-handle-vm_fault_retry-properly.patch mm-madvise-dont-perform-madvise-vma-walk-for-madv_populate_readwrite.patch
The patch titled Subject: mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly has been added to the -mm mm-unstable branch. Its filename is mm-madvise-make-madv_populate_readwrite-handle-vm_fault_retry-properly.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-madvise-make-madv_populate_readwrite-handle-vm_fault_retry-properly.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: David Hildenbrand <david@redhat.com> Subject: mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly Date: Thu, 14 Mar 2024 17:12:59 +0100 Patch series "mm/madvise: make MADV_POPULATE_(READ|WRITE) handle VM_FAULT_RETRY properly". Derrick reports that in some cases where pread() would fail with -EIO and mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ / MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT. It all boils down to missing VM_FAULT_RETRY handling. Let's try to handle that in a better way, similar to how ordinary GUP handles it. Details in patch #1. In short, move special MADV_POPULATE_(READ|WRITE) VMA handling into __get_user_pages(), and make faultin_page_range() call __get_user_pages_locked(), which handles VM_FAULT_RETRY. Further, avoid the now-useless madvise VMA walk, because __get_user_pages() will perform the VMA lookup either way. I briefly played with handling the FOLL_MADV_POPULATE checks in __get_user_pages() a bit differently, integrating them with existing handling, but it ended up looking worse. So I decided to keep it simple. Likely, we need better selftests, but the reproducer from Darrick might be a bit hard to convert into a simple selftest. Note that using mlock() in Darricks reproducer results in a similar endless retry. Likely, that is not what we want, and we should handle VM_FAULT_RETRY in populate_vma_page_range() / __mm_populate() as well. However, similarly using __get_user_pages_locked() might be more complicated, because of the advanced VMA handling in populate_vma_page_range(). Further, most populate_vma_page_range() callers simply ignore the return values, so it's unclear in which cases we expect to just silently fail, or where we'd want to retry+fail or endlessly retry instead. This patch (of 2): Darrick reports that in some cases where pread() would fail with -EIO and mmap()+access would generate a SIGBUS signal, MADV_POPULATE_READ / MADV_POPULATE_WRITE will keep retrying forever and not fail with -EFAULT. While the madvise() call can be interrupted by a signal, this is not the desired behavior. MADV_POPULATE_READ / MADV_POPULATE_WRITE should behave like page faults in that case: fail and not retry forever. A reproducer can be found at [1]. The reason is that __get_user_pages(), as called by faultin_vma_page_range(), will not handle VM_FAULT_RETRY in a proper way: it will simply return 0 when VM_FAULT_RETRY happened, making madvise_populate()->faultin_vma_page_range() retry again and again, never setting FOLL_TRIED->FAULT_FLAG_TRIED for __get_user_pages(). __get_user_pages_locked() does what we want, but duplicating that logic in faultin_vma_page_range() feels wrong. So let's use __get_user_pages_locked() instead, that will detect VM_FAULT_RETRY and set FOLL_TRIED when retrying, making the fault handler return VM_FAULT_SIGBUS (VM_FAULT_ERROR) at some point, propagating -EFAULT from faultin_page() to __get_user_pages(), all the way to madvise_populate(). But, there is an issue: __get_user_pages_locked() will end up re-taking the MM lock and then __get_user_pages() will do another VMA lookup. In the meantime, the VMA layout could have changed and we'd fail with different error codes than we'd want to. As __get_user_pages() will currently do a new VMA lookup either way, let it do the VMA handling in a different way, controlled by a new FOLL_MADV_POPULATE flag, effectively moving these checks from madvise_populate() + faultin_page_range() in there. With this change, Darricks reproducer properly fails with -EFAULT, as documented for MADV_POPULATE_READ / MADV_POPULATE_WRITE. [1] https://lore.kernel.org/all/20240313171936.GN1927156@frogsfrogsfrogs/ Link: https://lkml.kernel.org/r/20240314161300.382526-1-david@redhat.com Link: https://lkml.kernel.org/r/20240314161300.382526-2-david@redhat.com Fixes: 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables") Signed-off-by: David Hildenbrand <david@redhat.com> Reported-by: Darrick J. Wong <djwong@kernel.org> Closes: https://lore.kernel.org/all/20240311223815.GW1927156@frogsfrogsfrogs/ Cc: Darrick J. Wong <djwong@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/gup.c | 54 ++++++++++++++++++++++++++++-------------------- mm/internal.h | 10 +++++--- mm/madvise.c | 17 +-------------- 3 files changed, 40 insertions(+), 41 deletions(-) --- a/mm/gup.c~mm-madvise-make-madv_populate_readwrite-handle-vm_fault_retry-properly +++ a/mm/gup.c @@ -1206,6 +1206,22 @@ static long __get_user_pages(struct mm_s /* first iteration or cross vma bound */ if (!vma || start >= vma->vm_end) { + /* + * MADV_POPULATE_(READ|WRITE) wants to handle VMA + * lookups+error reporting differently. + */ + if (gup_flags & FOLL_MADV_POPULATE) { + vma = vma_lookup(mm, start); + if (!vma) { + ret = -ENOMEM; + goto out; + } + if (check_vma_flags(vma, gup_flags)) { + ret = -EINVAL; + goto out; + } + goto retry; + } vma = gup_vma_lookup(mm, start); if (!vma && in_gate_area(mm, start)) { ret = get_gate_page(mm, start & PAGE_MASK, @@ -1683,35 +1699,35 @@ long populate_vma_page_range(struct vm_a } /* - * faultin_vma_page_range() - populate (prefault) page tables inside the - * given VMA range readable/writable + * faultin_page_range() - populate (prefault) page tables inside the + * given range readable/writable * * This takes care of mlocking the pages, too, if VM_LOCKED is set. * - * @vma: target vma + * @mm: the mm to populate page tables in * @start: start address * @end: end address * @write: whether to prefault readable or writable * @locked: whether the mmap_lock is still held * - * Returns either number of processed pages in the vma, or a negative error - * code on error (see __get_user_pages()). + * Returns either number of processed pages in the MM, or a negative error + * code on error (see __get_user_pages()). Note that this function reports + * errors related to VMAs, such as incompatible mappings, as expected by + * MADV_POPULATE_(READ|WRITE). * - * vma->vm_mm->mmap_lock must be held. The range must be page-aligned and - * covered by the VMA. If it's released, *@locked will be set to 0. + * The range must be page-aligned. + * + * mm->mmap_lock must be held. If it's released, *@locked will be set to 0. */ -long faultin_vma_page_range(struct vm_area_struct *vma, unsigned long start, - unsigned long end, bool write, int *locked) +long faultin_page_range(struct mm_struct *mm, unsigned long start, + unsigned long end, bool write, int *locked) { - struct mm_struct *mm = vma->vm_mm; unsigned long nr_pages = (end - start) / PAGE_SIZE; int gup_flags; long ret; VM_BUG_ON(!PAGE_ALIGNED(start)); VM_BUG_ON(!PAGE_ALIGNED(end)); - VM_BUG_ON_VMA(start < vma->vm_start, vma); - VM_BUG_ON_VMA(end > vma->vm_end, vma); mmap_assert_locked(mm); /* @@ -1723,19 +1739,13 @@ long faultin_vma_page_range(struct vm_ar * a poisoned page. * !FOLL_FORCE: Require proper access permissions. */ - gup_flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_UNLOCKABLE; + gup_flags = FOLL_TOUCH | FOLL_HWPOISON | FOLL_UNLOCKABLE | + FOLL_MADV_POPULATE; if (write) gup_flags |= FOLL_WRITE; - /* - * We want to report -EINVAL instead of -EFAULT for any permission - * problems or incompatible mappings. - */ - if (check_vma_flags(vma, gup_flags)) - return -EINVAL; - - ret = __get_user_pages(mm, start, nr_pages, gup_flags, - NULL, locked); + ret = __get_user_pages_locked(mm, start, nr_pages, NULL, locked, + gup_flags); lru_add_drain(); return ret; } --- a/mm/internal.h~mm-madvise-make-madv_populate_readwrite-handle-vm_fault_retry-properly +++ a/mm/internal.h @@ -686,9 +686,8 @@ struct anon_vma *folio_anon_vma(struct f void unmap_mapping_folio(struct folio *folio); extern long populate_vma_page_range(struct vm_area_struct *vma, unsigned long start, unsigned long end, int *locked); -extern long faultin_vma_page_range(struct vm_area_struct *vma, - unsigned long start, unsigned long end, - bool write, int *locked); +extern long faultin_page_range(struct mm_struct *mm, unsigned long start, + unsigned long end, bool write, int *locked); extern bool mlock_future_ok(struct mm_struct *mm, unsigned long flags, unsigned long bytes); @@ -1127,10 +1126,13 @@ enum { FOLL_FAST_ONLY = 1 << 20, /* allow unlocking the mmap lock */ FOLL_UNLOCKABLE = 1 << 21, + /* VMA lookup+checks compatible with MADV_POPULATE_(READ|WRITE) */ + FOLL_MADV_POPULATE = 1 << 22, }; #define INTERNAL_GUP_FLAGS (FOLL_TOUCH | FOLL_TRIED | FOLL_REMOTE | FOLL_PIN | \ - FOLL_FAST_ONLY | FOLL_UNLOCKABLE) + FOLL_FAST_ONLY | FOLL_UNLOCKABLE | \ + FOLL_MADV_POPULATE) /* * Indicates for which pages that are write-protected in the page table, --- a/mm/madvise.c~mm-madvise-make-madv_populate_readwrite-handle-vm_fault_retry-properly +++ a/mm/madvise.c @@ -908,27 +908,14 @@ static long madvise_populate(struct vm_a { const bool write = behavior == MADV_POPULATE_WRITE; struct mm_struct *mm = vma->vm_mm; - unsigned long tmp_end; int locked = 1; long pages; *prev = vma; while (start < end) { - /* - * We might have temporarily dropped the lock. For example, - * our VMA might have been split. - */ - if (!vma || start >= vma->vm_end) { - vma = vma_lookup(mm, start); - if (!vma) - return -ENOMEM; - } - - tmp_end = min_t(unsigned long, end, vma->vm_end); /* Populate (prefault) page tables readable/writable. */ - pages = faultin_vma_page_range(vma, start, tmp_end, write, - &locked); + pages = faultin_page_range(mm, start, end, write, &locked); if (!locked) { mmap_read_lock(mm); locked = 1; @@ -949,7 +936,7 @@ static long madvise_populate(struct vm_a pr_warn_once("%s: unhandled return value: %ld\n", __func__, pages); fallthrough; - case -ENOMEM: + case -ENOMEM: /* No VMA or out of memory. */ return -ENOMEM; } } _ Patches currently in -mm which might be from david@redhat.com are mm-madvise-make-madv_populate_readwrite-handle-vm_fault_retry-properly.patch mm-madvise-dont-perform-madvise-vma-walk-for-madv_populate_readwrite.patch
The patch titled Subject: mm: cachestat: fix two shmem bugs has been added to the -mm mm-hotfixes-unstable branch. Its filename is mm-cachestat-fix-two-shmem-bugs.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-cachestat-fix-two-shmem-bugs.patch This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Johannes Weiner <hannes@cmpxchg.org> Subject: mm: cachestat: fix two shmem bugs Date: Fri, 15 Mar 2024 05:55:56 -0400 When cachestat on shmem races with swapping and invalidation, there are two possible bugs: 1) A swapin error can have resulted in a poisoned swap entry in the shmem inode's xarray. Calling get_shadow_from_swap_cache() on it will result in an out-of-bounds access to swapper_spaces[]. Validate the entry with non_swap_entry() before going further. 2) When we find a valid swap entry in the shmem's inode, the shadow entry in the swapcache might not exist yet: swap IO is still in progress and we're before __remove_mapping; swapin, invalidation, or swapoff have removed the shadow from swapcache after we saw the shmem swap entry. This will send a NULL to workingset_test_recent(). The latter purely operates on pointer bits, so it won't crash - node 0, memcg ID 0, eviction timestamp 0, etc. are all valid inputs - but it's a bogus test. In theory that could result in a false "recently evicted" count. Such a false positive wouldn't be the end of the world. But for code clarity and (future) robustness, be explicit about this case. Bail on get_shadow_from_swap_cache() returning NULL. Link: https://lkml.kernel.org/r/20240315095556.GC581298@cmpxchg.org Fixes: cf264e1329fb ("cachestat: implement cachestat syscall") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Chengming Zhou <chengming.zhou@linux.dev> [Bug #1] Reported-by: Jann Horn <jannh@google.com> [Bug #2] Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Cc: <stable@vger.kernel.org> [v6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/filemap.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) --- a/mm/filemap.c~mm-cachestat-fix-two-shmem-bugs +++ a/mm/filemap.c @@ -4197,7 +4197,23 @@ static void filemap_cachestat(struct add /* shmem file - in swap cache */ swp_entry_t swp = radix_to_swp_entry(folio); + /* swapin error results in poisoned entry */ + if (non_swap_entry(swp)) + goto resched; + + /* + * Getting a swap entry from the shmem + * inode means we beat + * shmem_unuse(). rcu_read_lock() + * ensures swapoff waits for us before + * freeing the swapper space. However, + * we can race with swapping and + * invalidation, so there might not be + * a shadow in the swapcache (yet). + */ shadow = get_shadow_from_swap_cache(swp); + if (!shadow) + goto resched; } #endif if (workingset_test_recent(shadow, true, &workingset)) _ Patches currently in -mm which might be from hannes@cmpxchg.org are mm-cachestat-fix-two-shmem-bugs.patch
The patch titled Subject: mm: increase folio batch size has been added to the -mm mm-hotfixes-unstable branch. Its filename is mm-increase-folio-batch-size.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-increase-folio-batch-size.patch This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: "Matthew Wilcox (Oracle)" <willy@infradead.org> Subject: mm: increase folio batch size Date: Fri, 15 Mar 2024 14:08:21 +0000 On a 104 thread, 2 socket Skylake system, Intel report a 4.7% performance reduction with will-it-scale page_fault2. This was due to reducing the size of the batch from 32 to 15. Increasing the folio batch size from 15 to 31 gives a performance increase of 12.5% relative to the original, or 17.2% relative to the reduced performance commit. The penalty of this commit is an additional 128 bytes of stack usage. Six folio_batches are also allocated from percpu memory in cpu_fbatches so that will be an additional 768 bytes of percpu memory (per CPU). Tim Chen originally submitted a patch like this in 2020: https://lore.kernel.org/linux-mm/d1cc9f12a8ad6c2a52cb600d93b06b064f2bbc57.1593205965.git.tim.c.chen@linux.intel.com/ Link: https://lkml.kernel.org/r/20240315140823.2478146-1-willy@infradead.org Fixes: 99fbb6bfc16f ("mm: make folios_put() the basis of release_pages()") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Tested-by: Yujie Liu <yujie.liu@intel.com> Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202403151058.7048f6a8-oliver.sang@intel.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- include/linux/pagevec.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) --- a/include/linux/pagevec.h~mm-increase-folio-batch-size +++ a/include/linux/pagevec.h @@ -11,8 +11,8 @@ #include <linux/types.h> -/* 15 pointers + header align the folio_batch structure to a power of two */ -#define PAGEVEC_SIZE 15 +/* 31 pointers + header align the folio_batch structure to a power of two */ +#define PAGEVEC_SIZE 31 struct folio; _ Patches currently in -mm which might be from willy@infradead.org are mm-increase-folio-batch-size.patch
The patch titled Subject: mm,page_owner: fix recursion has been added to the -mm mm-hotfixes-unstable branch. Its filename is mmpage_owner-fix-recursion.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mmpage_owner-fix-recursion.patch This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Oscar Salvador <osalvador@suse.de> Subject: mm,page_owner: fix recursion Date: Fri, 15 Mar 2024 23:26:10 +0100 Prior to 217b2119b9e2 ("mm,page_owner: implement the tracking of the stacks count") the only place where page_owner could potentially go into recursion due to its need of allocating more memory was in save_stack(), which ends up calling into stackdepot code with the possibility of allocating memory. We made sure to guard against that by signaling that the current task was already in page_owner code, so in case a recursion attempt was made, we could catch that and return dummy_handle. After above commit, a new place in page_owner code was introduced where we could allocate memory, meaning we could go into recursion would we take that path. Make sure to signal that we are in page_owner in that codepath as well. Move the guard code into two helpers {un}set_current_in_page_owner() and use them prior to calling in the two functions that might allocate memory. Link: https://lkml.kernel.org/r/20240315222610.6870-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Fixes: 217b2119b9e2 ("mm,page_owner: implement the tracking of the stacks count") Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/page_owner.c | 33 +++++++++++++++++++++++---------- 1 file changed, 23 insertions(+), 10 deletions(-) --- a/mm/page_owner.c~mmpage_owner-fix-recursion +++ a/mm/page_owner.c @@ -54,6 +54,22 @@ static depot_stack_handle_t early_handle static void init_early_allocated_pages(void); +static inline void set_current_in_page_owner(void) +{ + /* + * Avoid recursion. + * + * We might need to allocate more memory from page_owner code, so make + * sure to signal it in order to avoid recursion. + */ + current->in_page_owner = 1; +} + +static inline void unset_current_in_page_owner(void) +{ + current->in_page_owner = 0; +} + static int __init early_page_owner_param(char *buf) { int ret = kstrtobool(buf, &page_owner_enabled); @@ -137,23 +153,16 @@ static noinline depot_stack_handle_t sav depot_stack_handle_t handle; unsigned int nr_entries; - /* - * Avoid recursion. - * - * Sometimes page metadata allocation tracking requires more - * memory to be allocated: - * - when new stack trace is saved to stack depot - */ if (current->in_page_owner) return dummy_handle; - current->in_page_owner = 1; + set_current_in_page_owner(); nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 2); handle = stack_depot_save(entries, nr_entries, flags); if (!handle) handle = failure_handle; + unset_current_in_page_owner(); - current->in_page_owner = 0; return handle; } @@ -168,9 +177,13 @@ static void add_stack_record_to_list(str gfp_mask &= (GFP_ATOMIC | GFP_KERNEL); gfp_mask |= __GFP_NOWARN; + set_current_in_page_owner(); stack = kmalloc(sizeof(*stack), gfp_mask); - if (!stack) + if (!stack) { + unset_current_in_page_owner(); return; + } + unset_current_in_page_owner(); stack->stack_record = stack_record; stack->next = NULL; _ Patches currently in -mm which might be from osalvador@suse.de are mmpage_owner-fix-refcount-imbalance.patch mmpage_owner-fix-recursion.patch
The patch titled Subject: mm: memcg: add NULL check to obj_cgroup_put() has been added to the -mm mm-unstable branch. Its filename is mm-memcg-add-null-check-to-obj_cgroup_put.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-memcg-add-null-check-to-obj_cgroup_put.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Yosry Ahmed <yosryahmed@google.com> Subject: mm: memcg: add NULL check to obj_cgroup_put() Date: Sat, 16 Mar 2024 01:58:03 +0000 9 out of 16 callers perform a NULL check before calling obj_cgroup_put(). Move the NULL check in the function, similar to mem_cgroup_put(). The unlikely() NULL check in current_objcg_update() was left alone to avoid dropping the unlikey() annotation as this a fast path. Link: https://lkml.kernel.org/r/20240316015803.2777252-1-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- include/linux/memcontrol.h | 3 ++- kernel/bpf/memalloc.c | 6 ++---- mm/memcontrol.c | 18 ++++++------------ mm/zswap.c | 3 +-- 4 files changed, 11 insertions(+), 19 deletions(-) --- a/include/linux/memcontrol.h~mm-memcg-add-null-check-to-obj_cgroup_put +++ a/include/linux/memcontrol.h @@ -818,7 +818,8 @@ static inline void obj_cgroup_get_many(s static inline void obj_cgroup_put(struct obj_cgroup *objcg) { - percpu_ref_put(&objcg->refcnt); + if (objcg) + percpu_ref_put(&objcg->refcnt); } static inline bool mem_cgroup_tryget(struct mem_cgroup *memcg) --- a/kernel/bpf/memalloc.c~mm-memcg-add-null-check-to-obj_cgroup_put +++ a/kernel/bpf/memalloc.c @@ -759,8 +759,7 @@ void bpf_mem_alloc_destroy(struct bpf_me rcu_in_progress += atomic_read(&c->call_rcu_ttrace_in_progress); rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } - if (ma->objcg) - obj_cgroup_put(ma->objcg); + obj_cgroup_put(ma->objcg); destroy_mem_alloc(ma, rcu_in_progress); } if (ma->caches) { @@ -776,8 +775,7 @@ void bpf_mem_alloc_destroy(struct bpf_me rcu_in_progress += atomic_read(&c->call_rcu_in_progress); } } - if (ma->objcg) - obj_cgroup_put(ma->objcg); + obj_cgroup_put(ma->objcg); destroy_mem_alloc(ma, rcu_in_progress); } } --- a/mm/memcontrol.c~mm-memcg-add-null-check-to-obj_cgroup_put +++ a/mm/memcontrol.c @@ -2369,8 +2369,7 @@ static void drain_local_stock(struct wor clear_bit(FLUSHING_CACHED_CHARGE, &stock->flags); local_unlock_irqrestore(&memcg_stock.stock_lock, flags); - if (old) - obj_cgroup_put(old); + obj_cgroup_put(old); } /* @@ -3145,8 +3144,7 @@ static struct obj_cgroup *current_objcg_ if (old) { old = (struct obj_cgroup *) ((unsigned long)old & ~CURRENT_OBJCG_UPDATE_FLAG); - if (old) - obj_cgroup_put(old); + obj_cgroup_put(old); old = NULL; } @@ -3418,8 +3416,7 @@ void mod_objcg_state(struct obj_cgroup * mod_objcg_mlstate(objcg, pgdat, idx, nr); local_unlock_irqrestore(&memcg_stock.stock_lock, flags); - if (old) - obj_cgroup_put(old); + obj_cgroup_put(old); } static bool consume_obj_stock(struct obj_cgroup *objcg, unsigned int nr_bytes) @@ -3546,8 +3543,7 @@ static void refill_obj_stock(struct obj_ } local_unlock_irqrestore(&memcg_stock.stock_lock, flags); - if (old) - obj_cgroup_put(old); + obj_cgroup_put(old); if (nr_pages) obj_cgroup_uncharge_pages(objcg, nr_pages); @@ -5468,8 +5464,7 @@ static void __mem_cgroup_free(struct mem { int node; - if (memcg->orig_objcg) - obj_cgroup_put(memcg->orig_objcg); + obj_cgroup_put(memcg->orig_objcg); for_each_node(node) free_mem_cgroup_per_node_info(memcg, node); @@ -6620,8 +6615,7 @@ static void mem_cgroup_exit(struct task_ objcg = (struct obj_cgroup *) ((unsigned long)objcg & ~CURRENT_OBJCG_UPDATE_FLAG); - if (objcg) - obj_cgroup_put(objcg); + obj_cgroup_put(objcg); /* * Some kernel allocations can happen after this point, --- a/mm/zswap.c~mm-memcg-add-null-check-to-obj_cgroup_put +++ a/mm/zswap.c @@ -1593,8 +1593,7 @@ put_pool: freepage: zswap_entry_cache_free(entry); reject: - if (objcg) - obj_cgroup_put(objcg); + obj_cgroup_put(objcg); check_old: /* * If the zswap store fails or zswap is disabled, we must invalidate the _ Patches currently in -mm which might be from yosryahmed@google.com are mm-memcg-add-null-check-to-obj_cgroup_put.patch
The patch titled Subject: mm: remove guard around pgd_offset_k() macro has been added to the -mm mm-unstable branch. Its filename is mm-remove-guard-around-pgd_offset_k-macro.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-remove-guard-around-pgd_offset_k-macro.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Christophe Leroy <christophe.leroy@csgroup.eu> Subject: mm: remove guard around pgd_offset_k() macro Date: Sat, 16 Mar 2024 12:52:41 +0100 The last architecture redefining pgd_offset_k() was IA64 and it was removed by commit cf8e8658100d ("arch: Remove Itanium (IA-64) architecture") There is no need anymore to guard generic version of pgd_offset_k() with #ifndef pgd_offset_k Link: https://lkml.kernel.org/r/59d3f47d5615d18cca1986f269be2fcb3df34556.1710589838.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- include/linux/pgtable.h | 2 -- 1 file changed, 2 deletions(-) --- a/include/linux/pgtable.h~mm-remove-guard-around-pgd_offset_k-macro +++ a/include/linux/pgtable.h @@ -149,9 +149,7 @@ static inline pgd_t *pgd_offset_pgd(pgd_ * a shortcut which implies the use of the kernel's pgd, instead * of a process's */ -#ifndef pgd_offset_k #define pgd_offset_k(address) pgd_offset(&init_mm, (address)) -#endif /* * In many cases it is known that a virtual address is mapped at PMD or PTE _ Patches currently in -mm which might be from christophe.leroy@csgroup.eu are mm-remove-guard-around-pgd_offset_k-macro.patch
[-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain, Size: 2390 bytes --] The patch titled Subject: mailmap: update entry for Leonard Crestez has been added to the -mm mm-hotfixes-unstable branch. Its filename is mailmap-update-entry-for-leonard-crestez.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mailmap-update-entry-for-leonard-crestez.patch This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Leonard Crestez <cdleonard@gmail.com> Subject: mailmap: update entry for Leonard Crestez Date: Sat, 16 Mar 2024 20:28:37 +0200 Put my personal email first because NXP employment ended some time ago. Also add my old intel email address. Link: https://lkml.kernel.org/r/f568faa0-2380-4e93-a312-b80c1e367645@gmail.com Signed-off-by: Leonard Crestez <cdleonard@gmail.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- .mailmap | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) --- a/.mailmap~mailmap-update-entry-for-leonard-crestez +++ a/.mailmap @@ -340,7 +340,8 @@ Lee Jones <lee@kernel.org> <joneslee@goo Lee Jones <lee@kernel.org> <lee.jones@canonical.com> Lee Jones <lee@kernel.org> <lee.jones@linaro.org> Lee Jones <lee@kernel.org> <lee@ubuntu.com> -Leonard Crestez <leonard.crestez@nxp.com> Leonard Crestez <cdleonard@gmail.com> +Leonard Crestez <cdleonard@gmail.com> <leonard.crestez@nxp.com> +Leonard Crestez <cdleonard@gmail.com> <leonard.crestez@intel.com> Leonardo Bras <leobras.c@gmail.com> <leonardo@linux.ibm.com> Leonard Göhrs <l.goehrs@pengutronix.de> Leonid I Ananiev <leonid.i.ananiev@intel.com> _ Patches currently in -mm which might be from cdleonard@gmail.com are mailmap-update-entry-for-leonard-crestez.patch
The patch titled Subject: init: open /initrd.image with O_LARGEFILE has been added to the -mm mm-hotfixes-unstable branch. Its filename is init-open-initrdimage-with-o_largefile.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/init-open-initrdimage-with-o_largefile.patch This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: John Sperbeck <jsperbeck@google.com> Subject: init: open /initrd.image with O_LARGEFILE Date: Sun, 17 Mar 2024 15:15:22 -0700 If initrd data is larger than 2Gb, we'll eventually fail to write to the /initrd.image file when we hit that limit, unless O_LARGEFILE is set. Link: https://lkml.kernel.org/r/20240317221522.896040-1-jsperbeck@google.com Signed-off-by: John Sperbeck <jsperbeck@google.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- init/initramfs.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/init/initramfs.c~init-open-initrdimage-with-o_largefile +++ a/init/initramfs.c @@ -682,7 +682,7 @@ static void __init populate_initrd_image printk(KERN_INFO "rootfs image is not initramfs (%s); looks like an initrd\n", err); - file = filp_open("/initrd.image", O_WRONLY | O_CREAT, 0700); + file = filp_open("/initrd.image", O_WRONLY|O_CREAT|O_LARGEFILE, 0700); if (IS_ERR(file)) return; _ Patches currently in -mm which might be from jsperbeck@google.com are init-open-initrdimage-with-o_largefile.patch
The patch titled Subject: selftests/mm: Fix build with _FORTIFY_SOURCE has been added to the -mm mm-hotfixes-unstable branch. Its filename is selftests-mm-fix-build-with-_fortify_source.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/selftests-mm-fix-build-with-_fortify_source.patch This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Vitaly Chikunov <vt@altlinux.org> Subject: selftests/mm: Fix build with _FORTIFY_SOURCE Date: Mon, 18 Mar 2024 05:34:44 +0300 Add missing flags argument to open(2) call with O_CREAT. Some tests fail to compile if _FORTIFY_SOURCE is defined (to any valid value) (together with -O), resulting in similar error messages such as: In file included from /usr/include/fcntl.h:342, from gup_test.c:1: In function 'open', inlined from 'main' at gup_test.c:206:10: /usr/include/bits/fcntl2.h:50:11: error: call to '__open_missing_mode' declared with attribute error: open with O_CREAT or O_TMPFILE in second argument needs 3 arguments 50 | __open_missing_mode (); | ^~~~~~~~~~~~~~~~~~~~~~ _FORTIFY_SOURCE is enabled by default in some distributions, so the tests are not built by default and are skipped. open(2) man-page warns about missing flags argument: "if it is not supplied, some arbitrary bytes from the stack will be applied as the file mode." Link: https://lkml.kernel.org/r/20240318023445.3192922-1-vt@altlinux.org Fixes: aeb85ed4f41a ("tools/testing/selftests/vm/gup_benchmark.c: allow user specified file") Fixes: fbe37501b252 ("mm: huge_memory: debugfs for file-backed THP split") Fixes: c942f5bd17b3 ("selftests: soft-dirty: add test for mprotect") Signed-off-by: Vitaly Chikunov <vt@altlinux.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Peter Xu <peterx@redhat.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- tools/testing/selftests/mm/gup_test.c | 2 +- tools/testing/selftests/mm/soft-dirty.c | 2 +- tools/testing/selftests/mm/split_huge_page_test.c | 2 +- 3 files changed, 3 insertions(+), 3 deletions(-) --- a/tools/testing/selftests/mm/gup_test.c~selftests-mm-fix-build-with-_fortify_source +++ a/tools/testing/selftests/mm/gup_test.c @@ -203,7 +203,7 @@ int main(int argc, char **argv) ksft_print_header(); ksft_set_plan(nthreads); - filed = open(file, O_RDWR|O_CREAT); + filed = open(file, O_RDWR|O_CREAT, 0664); if (filed < 0) ksft_exit_fail_msg("Unable to open %s: %s\n", file, strerror(errno)); --- a/tools/testing/selftests/mm/soft-dirty.c~selftests-mm-fix-build-with-_fortify_source +++ a/tools/testing/selftests/mm/soft-dirty.c @@ -137,7 +137,7 @@ static void test_mprotect(int pagemap_fd if (!map) ksft_exit_fail_msg("anon mmap failed\n"); } else { - test_fd = open(fname, O_RDWR | O_CREAT); + test_fd = open(fname, O_RDWR | O_CREAT, 0664); if (test_fd < 0) { ksft_test_result_skip("Test %s open() file failed\n", __func__); return; --- a/tools/testing/selftests/mm/split_huge_page_test.c~selftests-mm-fix-build-with-_fortify_source +++ a/tools/testing/selftests/mm/split_huge_page_test.c @@ -223,7 +223,7 @@ void split_file_backed_thp(void) ksft_exit_fail_msg("Fail to create file-backed THP split testing file\n"); } - fd = open(testfile, O_CREAT|O_WRONLY); + fd = open(testfile, O_CREAT|O_WRONLY, 0664); if (fd == -1) { ksft_perror("Cannot open testing file"); goto cleanup; _ Patches currently in -mm which might be from vt@altlinux.org are selftests-mm-fix-build-with-_fortify_source.patch
The patch titled Subject: mm/mm_init.c: remove arch_reserved_kernel_pages() has been added to the -mm mm-unstable branch. Its filename is mm-mm_initc-remove-arch_reserved_kernel_pages.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-mm_initc-remove-arch_reserved_kernel_pages.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Baoquan He <bhe@redhat.com> Subject: mm/mm_init.c: remove arch_reserved_kernel_pages() Date: Mon, 18 Mar 2024 22:21:38 +0800 Since the current calculation of calc_nr_kernel_pages() has taken into consideration of kernel reserved memory, no need to have arch_reserved_kernel_pages() any more. Link: https://lkml.kernel.org/r/20240318142138.783350-7-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- arch/powerpc/include/asm/mmu.h | 4 ---- arch/powerpc/kernel/fadump.c | 5 ----- include/linux/mm.h | 3 --- mm/mm_init.c | 12 ------------ 4 files changed, 24 deletions(-) --- a/arch/powerpc/include/asm/mmu.h~mm-mm_initc-remove-arch_reserved_kernel_pages +++ a/arch/powerpc/include/asm/mmu.h @@ -406,9 +406,5 @@ extern void *abatron_pteptrs[2]; #include <asm/nohash/mmu.h> #endif -#if defined(CONFIG_FA_DUMP) || defined(CONFIG_PRESERVE_FA_DUMP) -#define __HAVE_ARCH_RESERVED_KERNEL_PAGES -#endif - #endif /* __KERNEL__ */ #endif /* _ASM_POWERPC_MMU_H_ */ --- a/arch/powerpc/kernel/fadump.c~mm-mm_initc-remove-arch_reserved_kernel_pages +++ a/arch/powerpc/kernel/fadump.c @@ -1735,8 +1735,3 @@ static void __init fadump_reserve_crash_ memblock_reserve(mstart, msize); } } - -unsigned long __init arch_reserved_kernel_pages(void) -{ - return memblock_reserved_size() / PAGE_SIZE; -} --- a/include/linux/mm.h~mm-mm_initc-remove-arch_reserved_kernel_pages +++ a/include/linux/mm.h @@ -3221,9 +3221,6 @@ static inline void show_mem(void) extern long si_mem_available(void); extern void si_meminfo(struct sysinfo * val); extern void si_meminfo_node(struct sysinfo *val, int nid); -#ifdef __HAVE_ARCH_RESERVED_KERNEL_PAGES -extern unsigned long arch_reserved_kernel_pages(void); -#endif extern __printf(3, 4) void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...); --- a/mm/mm_init.c~mm-mm_initc-remove-arch_reserved_kernel_pages +++ a/mm/mm_init.c @@ -2383,17 +2383,6 @@ void __init page_alloc_init_late(void) page_alloc_sysctl_init(); } -#ifndef __HAVE_ARCH_RESERVED_KERNEL_PAGES -/* - * Returns the number of pages that arch has reserved but - * is not known to alloc_large_system_hash(). - */ -static unsigned long __init arch_reserved_kernel_pages(void) -{ - return 0; -} -#endif - /* * Adaptive scale is meant to reduce sizes of hash tables on large memory * machines. As memory size is increased the scale is also increased but at @@ -2436,7 +2425,6 @@ void *__init alloc_large_system_hash(con if (!numentries) { /* round applicable memory size up to nearest megabyte */ numentries = nr_kernel_pages; - numentries -= arch_reserved_kernel_pages(); /* It isn't necessary when PAGE_SIZE >= 1MB */ if (PAGE_SIZE < SZ_1M) _ Patches currently in -mm which might be from bhe@redhat.com are mm-mm_initc-remove-the-useless-dma_reserve.patch x86-remove-memblock_find_dma_reserve.patch mm-mm_initc-add-new-function-calc_nr_kernel_pages.patch mm-mm_initc-remove-meaningless-calculation-of-zone-managed_pages-in-free_area_init_core.patch mm-mm_initc-remove-unneeded-calc_memmap_size.patch mm-mm_initc-remove-arch_reserved_kernel_pages.patch
The patch titled Subject: mm/mm_init.c: remove unneeded calc_memmap_size() has been added to the -mm mm-unstable branch. Its filename is mm-mm_initc-remove-unneeded-calc_memmap_size.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-mm_initc-remove-unneeded-calc_memmap_size.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Baoquan He <bhe@redhat.com> Subject: mm/mm_init.c: remove unneeded calc_memmap_size() Date: Mon, 18 Mar 2024 22:21:37 +0800 Nobody calls calc_memmap_size() now. Link: https://lkml.kernel.org/r/20240318142138.783350-6-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/mm_init.c | 20 -------------------- 1 file changed, 20 deletions(-) --- a/mm/mm_init.c~mm-mm_initc-remove-unneeded-calc_memmap_size +++ a/mm/mm_init.c @@ -1331,26 +1331,6 @@ static void __init calculate_node_totalp pr_debug("On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages); } -static unsigned long __init calc_memmap_size(unsigned long spanned_pages, - unsigned long present_pages) -{ - unsigned long pages = spanned_pages; - - /* - * Provide a more accurate estimation if there are holes within - * the zone and SPARSEMEM is in use. If there are holes within the - * zone, each populated memory region may cost us one or two extra - * memmap pages due to alignment because memmap pages for each - * populated regions may not be naturally aligned on page boundary. - * So the (present_pages >> 4) heuristic is a tradeoff for that. - */ - if (spanned_pages > present_pages + (present_pages >> 4) && - IS_ENABLED(CONFIG_SPARSEMEM)) - pages = present_pages; - - return PAGE_ALIGN(pages * sizeof(struct page)) >> PAGE_SHIFT; -} - #ifdef CONFIG_TRANSPARENT_HUGEPAGE static void pgdat_init_split_queue(struct pglist_data *pgdat) { _ Patches currently in -mm which might be from bhe@redhat.com are mm-mm_initc-remove-the-useless-dma_reserve.patch x86-remove-memblock_find_dma_reserve.patch mm-mm_initc-add-new-function-calc_nr_kernel_pages.patch mm-mm_initc-remove-meaningless-calculation-of-zone-managed_pages-in-free_area_init_core.patch mm-mm_initc-remove-unneeded-calc_memmap_size.patch mm-mm_initc-remove-arch_reserved_kernel_pages.patch
The patch titled Subject: mm/mm_init.c: remove meaningless calculation of zone->managed_pages in free_area_init_core() has been added to the -mm mm-unstable branch. Its filename is mm-mm_initc-remove-meaningless-calculation-of-zone-managed_pages-in-free_area_init_core.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-mm_initc-remove-meaningless-calculation-of-zone-managed_pages-in-free_area_init_core.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Baoquan He <bhe@redhat.com> Subject: mm/mm_init.c: remove meaningless calculation of zone->managed_pages in free_area_init_core() Date: Mon, 18 Mar 2024 22:21:36 +0800 Currently, in free_area_init_core(), when initialize zone's field, a rough value is set to zone->managed_pages. That value is calculated by (zone->present_pages - memmap_pages). In the meantime, add the value to nr_all_pages and nr_kernel_pages which represent all free pages of system (only low memory or including HIGHMEM memory separately). Both of them are gonna be used in alloc_large_system_hash(). However, the rough calculation and setting of zone->managed_pages is meaningless because a) memmap pages are allocated on units of node in sparse_init() or alloc_node_mem_map(pgdat); The simple (zone->present_pages - memmap_pages) is too rough to make sense for zone; b) the set zone->managed_pages will be zeroed out and reset with acutal value in mem_init() via memblock_free_all(). Before the resetting, no buddy allocation request is issued. Here, remove the meaningless and complicated calculation of (zone->present_pages - memmap_pages), directly set zone->present_pages to zone->managed_pages. It will be adjusted in mem_init(). And also remove the assignment of nr_all_pages and nr_kernel_pages in free_area_init_core(). Instead, call the newly added calc_nr_kernel_pages() to count up all free but not reserved memory in memblock and assign to nr_all_pages and nr_kernel_pages. The counting excludes memmap_pages, and other kernel used data, which is more accurate than old way and simpler, and can also cover the ppc required arch_reserved_kernel_pages() case. Link: https://lkml.kernel.org/r/20240318142138.783350-5-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/mm_init.c | 38 ++++++-------------------------------- 1 file changed, 6 insertions(+), 32 deletions(-) --- a/mm/mm_init.c~mm-mm_initc-remove-meaningless-calculation-of-zone-managed_pages-in-free_area_init_core +++ a/mm/mm_init.c @@ -1584,41 +1584,14 @@ static void __init free_area_init_core(s for (j = 0; j < MAX_NR_ZONES; j++) { struct zone *zone = pgdat->node_zones + j; - unsigned long size, freesize, memmap_pages; - - size = zone->spanned_pages; - freesize = zone->present_pages; - - /* - * Adjust freesize so that it accounts for how much memory - * is used by this zone for memmap. This affects the watermark - * and per-cpu initialisations - */ - memmap_pages = calc_memmap_size(size, freesize); - if (!is_highmem_idx(j)) { - if (freesize >= memmap_pages) { - freesize -= memmap_pages; - if (memmap_pages) - pr_debug(" %s zone: %lu pages used for memmap\n", - zone_names[j], memmap_pages); - } else - pr_warn(" %s zone: %lu memmap pages exceeds freesize %lu\n", - zone_names[j], memmap_pages, freesize); - } - - if (!is_highmem_idx(j)) - nr_kernel_pages += freesize; - /* Charge for highmem memmap if there are enough kernel pages */ - else if (nr_kernel_pages > memmap_pages * 2) - nr_kernel_pages -= memmap_pages; - nr_all_pages += freesize; + unsigned long size = zone->spanned_pages; /* - * Set an approximate value for lowmem here, it will be adjusted - * when the bootmem allocator frees pages into the buddy system. - * And all highmem pages will be managed by the buddy system. + * Set the zone->managed_pages as zone->present_pages roughly, it + * be zeroed out and reset when memblock allocator frees pages into + * buddy system. */ - zone_init_internals(zone, j, nid, freesize); + zone_init_internals(zone, j, nid, zone->present_pages); if (!size) continue; @@ -1915,6 +1888,7 @@ void __init free_area_init(unsigned long check_for_memory(pgdat); } + calc_nr_kernel_pages(); memmap_init(); /* disable hash distribution for systems with a single node */ _ Patches currently in -mm which might be from bhe@redhat.com are mm-mm_initc-remove-the-useless-dma_reserve.patch x86-remove-memblock_find_dma_reserve.patch mm-mm_initc-add-new-function-calc_nr_kernel_pages.patch mm-mm_initc-remove-meaningless-calculation-of-zone-managed_pages-in-free_area_init_core.patch mm-mm_initc-remove-unneeded-calc_memmap_size.patch mm-mm_initc-remove-arch_reserved_kernel_pages.patch
The patch titled Subject: mm/mm_init.c: add new function calc_nr_kernel_pages() has been added to the -mm mm-unstable branch. Its filename is mm-mm_initc-add-new-function-calc_nr_kernel_pages.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-mm_initc-add-new-function-calc_nr_kernel_pages.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Baoquan He <bhe@redhat.com> Subject: mm/mm_init.c: add new function calc_nr_kernel_pages() Date: Mon, 18 Mar 2024 22:21:35 +0800 This is a preparation to calculate nr_kernel_pages and nr_all_pages, both of which will be used later in alloc_large_system_hash(). nr_all_pages counts up all free but not reserved memory in memblock allocator, including HIGHMEM memory. While nr_kernel_pages counts up all free but not reserved low memory in memblock allocator, excluding HIGHMEM memory. Link: https://lkml.kernel.org/r/20240318142138.783350-4-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/mm_init.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) --- a/mm/mm_init.c~mm-mm_initc-add-new-function-calc_nr_kernel_pages +++ a/mm/mm_init.c @@ -1264,6 +1264,30 @@ static void __init reset_memoryless_node pr_debug("On node %d totalpages: 0\n", pgdat->node_id); } +static void __init calc_nr_kernel_pages(void) +{ + unsigned long start_pfn, end_pfn; + phys_addr_t start_addr, end_addr; + u64 u; +#ifdef CONFIG_HIGHMEM + unsigned long high_zone_low = arch_zone_lowest_possible_pfn[ZONE_HIGHMEM]; +#endif + + for_each_free_mem_range(u, NUMA_NO_NODE, MEMBLOCK_NONE, &start_addr, &end_addr, NULL) { + start_pfn = PFN_UP(start_addr); + end_pfn = PFN_DOWN(end_addr); + + if (start_pfn < end_pfn) { + nr_all_pages += end_pfn - start_pfn; +#ifdef CONFIG_HIGHMEM + start_pfn = clamp(start_pfn, 0, high_zone_low); + end_pfn = clamp(end_pfn, 0, high_zone_low); +#endif + nr_kernel_pages += end_pfn - start_pfn; + } + } +} + static void __init calculate_node_totalpages(struct pglist_data *pgdat, unsigned long node_start_pfn, unsigned long node_end_pfn) _ Patches currently in -mm which might be from bhe@redhat.com are mm-mm_initc-remove-the-useless-dma_reserve.patch x86-remove-memblock_find_dma_reserve.patch mm-mm_initc-add-new-function-calc_nr_kernel_pages.patch mm-mm_initc-remove-meaningless-calculation-of-zone-managed_pages-in-free_area_init_core.patch mm-mm_initc-remove-unneeded-calc_memmap_size.patch mm-mm_initc-remove-arch_reserved_kernel_pages.patch
The patch titled Subject: x86: remove memblock_find_dma_reserve() has been added to the -mm mm-unstable branch. Its filename is x86-remove-memblock_find_dma_reserve.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/x86-remove-memblock_find_dma_reserve.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Baoquan He <bhe@redhat.com> Subject: x86: remove memblock_find_dma_reserve() Date: Mon, 18 Mar 2024 22:21:34 +0800 This is not needed any more. Link: https://lkml.kernel.org/r/20240318142138.783350-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- arch/x86/include/asm/pgtable.h | 1 arch/x86/kernel/setup.c | 2 - arch/x86/mm/init.c | 45 ------------------------------- 3 files changed, 48 deletions(-) --- a/arch/x86/include/asm/pgtable.h~x86-remove-memblock_find_dma_reserve +++ a/arch/x86/include/asm/pgtable.h @@ -1200,7 +1200,6 @@ static inline int pgd_none(pgd_t pgd) extern int direct_gbpages; void init_mem_mapping(void); void early_alloc_pgt_buf(void); -extern void memblock_find_dma_reserve(void); void __init poking_init(void); unsigned long init_memory_mapping(unsigned long start, unsigned long end, pgprot_t prot); --- a/arch/x86/kernel/setup.c~x86-remove-memblock_find_dma_reserve +++ a/arch/x86/kernel/setup.c @@ -1106,8 +1106,6 @@ void __init setup_arch(char **cmdline_p) */ arch_reserve_crashkernel(); - memblock_find_dma_reserve(); - if (!early_xdbc_setup_hardware()) early_xdbc_register_console(); --- a/arch/x86/mm/init.c~x86-remove-memblock_find_dma_reserve +++ a/arch/x86/mm/init.c @@ -990,51 +990,6 @@ void __init free_initrd_mem(unsigned lon } #endif -/* - * Calculate the precise size of the DMA zone (first 16 MB of RAM), - * and pass it to the MM layer - to help it set zone watermarks more - * accurately. - * - * Done on 64-bit systems only for the time being, although 32-bit systems - * might benefit from this as well. - */ -void __init memblock_find_dma_reserve(void) -{ -#ifdef CONFIG_X86_64 - u64 nr_pages = 0, nr_free_pages = 0; - unsigned long start_pfn, end_pfn; - phys_addr_t start_addr, end_addr; - int i; - u64 u; - - /* - * Iterate over all memory ranges (free and reserved ones alike), - * to calculate the total number of pages in the first 16 MB of RAM: - */ - nr_pages = 0; - for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, NULL) { - start_pfn = min(start_pfn, MAX_DMA_PFN); - end_pfn = min(end_pfn, MAX_DMA_PFN); - - nr_pages += end_pfn - start_pfn; - } - - /* - * Iterate over free memory ranges to calculate the number of free - * pages in the DMA zone, while not counting potential partial - * pages at the beginning or the end of the range: - */ - nr_free_pages = 0; - for_each_free_mem_range(u, NUMA_NO_NODE, MEMBLOCK_NONE, &start_addr, &end_addr, NULL) { - start_pfn = min_t(unsigned long, PFN_UP(start_addr), MAX_DMA_PFN); - end_pfn = min_t(unsigned long, PFN_DOWN(end_addr), MAX_DMA_PFN); - - if (start_pfn < end_pfn) - nr_free_pages += end_pfn - start_pfn; - } -#endif -} - void __init zone_sizes_init(void) { unsigned long max_zone_pfns[MAX_NR_ZONES]; _ Patches currently in -mm which might be from bhe@redhat.com are mm-mm_initc-remove-the-useless-dma_reserve.patch x86-remove-memblock_find_dma_reserve.patch mm-mm_initc-add-new-function-calc_nr_kernel_pages.patch mm-mm_initc-remove-meaningless-calculation-of-zone-managed_pages-in-free_area_init_core.patch mm-mm_initc-remove-unneeded-calc_memmap_size.patch mm-mm_initc-remove-arch_reserved_kernel_pages.patch
The patch titled Subject: mm/mm_init.c: remove the useless dma_reserve has been added to the -mm mm-unstable branch. Its filename is mm-mm_initc-remove-the-useless-dma_reserve.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-mm_initc-remove-the-useless-dma_reserve.patch This patch will later appear in the mm-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Baoquan He <bhe@redhat.com> Subject: mm/mm_init.c: remove the useless dma_reserve Date: Mon, 18 Mar 2024 22:21:33 +0800 Patch series "mm/mm_init.c: refactor free_area_init_core()". In free_area_init_core(), the code calculating zone->managed_pages and the subtracting dma_reserve from DMA zone looks very confusing. From git history, the code calculating zone->managed_pages was for zone->present_pages originally. The early rough assignment is for optimize zone's pcp and water mark setting. Later, managed_pages was introduced into zone to represent the number of managed pages by buddy. Now, zone->managed_pages is zeroed out and reset in mem_init() when calling memblock_free_all(). zone's pcp and wmark setting relying on actual zone->managed_pages are done later than mem_init() invocation. So we don't need rush to early calculate and set zone->managed_pages, just set it as zone->present_pages, will adjust it in mem_init(). And also add a new function calc_nr_kernel_pages() to count up free but not reserved pages in memblock, then assign it to nr_all_pages and nr_kernel_pages after memmap pages are allocated. This patch (of 6): Variable dma_reserve and its usage was introduced in commit 0e0b864e069c ("[PATCH] Account for memmap and optionally the kernel image as holes"). Its original purpose was to accounting for the reserved pages in DMA zone to make DMA zone's watermarks calculation more accurate on x86. However, currently there's zone->managed_pages to account for all available pages for buddy, zone->present_pages to account for all present physical pages in zone. What is more important, on x86, calculating and setting the zone->managed_pages is a temporary move, all zone's managed_pages will be zeroed out and reset to the actual value according to how many pages are added to buddy allocator in mem_init(). Before mem_init(), no buddy alloction is requested. And zone's pcp and watermark setting are all done after mem_init(). So, no need to worry about the DMA zone's setting accuracy during free_area_init(). Hence, remove dma_reserve and its handling in free_area_init_core() because it's useless and causes confusion. Link: https://lkml.kernel.org/r/20240318142138.783350-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20240318142138.783350-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- arch/x86/mm/init.c | 2 -- include/linux/mm.h | 1 - mm/mm_init.c | 23 ----------------------- 3 files changed, 26 deletions(-) --- a/arch/x86/mm/init.c~mm-mm_initc-remove-the-useless-dma_reserve +++ a/arch/x86/mm/init.c @@ -1032,8 +1032,6 @@ void __init memblock_find_dma_reserve(vo if (start_pfn < end_pfn) nr_free_pages += end_pfn - start_pfn; } - - set_dma_reserve(nr_pages - nr_free_pages); #endif } --- a/include/linux/mm.h~mm-mm_initc-remove-the-useless-dma_reserve +++ a/include/linux/mm.h @@ -3210,7 +3210,6 @@ static inline int early_pfn_to_nid(unsig extern int __meminit early_pfn_to_nid(unsigned long pfn); #endif -extern void set_dma_reserve(unsigned long new_dma_reserve); extern void mem_init(void); extern void __init mmap_init(void); --- a/mm/mm_init.c~mm-mm_initc-remove-the-useless-dma_reserve +++ a/mm/mm_init.c @@ -226,7 +226,6 @@ static unsigned long required_movablecor static unsigned long nr_kernel_pages __initdata; static unsigned long nr_all_pages __initdata; -static unsigned long dma_reserve __initdata; static bool deferred_struct_pages __meminitdata; @@ -1583,12 +1582,6 @@ static void __init free_area_init_core(s zone_names[j], memmap_pages, freesize); } - /* Account for reserved pages */ - if (j == 0 && freesize > dma_reserve) { - freesize -= dma_reserve; - pr_debug(" %s zone: %lu pages reserved\n", zone_names[0], dma_reserve); - } - if (!is_highmem_idx(j)) nr_kernel_pages += freesize; /* Charge for highmem memmap if there are enough kernel pages */ @@ -2547,22 +2540,6 @@ void *__init alloc_large_system_hash(con return table; } -/** - * set_dma_reserve - set the specified number of pages reserved in the first zone - * @new_dma_reserve: The number of pages to mark reserved - * - * The per-cpu batchsize and zone watermarks are determined by managed_pages. - * In the DMA zone, a significant percentage may be consumed by kernel image - * and other unfreeable allocations which can skew the watermarks badly. This - * function may optionally be used to account for unfreeable pages in the - * first zone (e.g., ZONE_DMA). The effect will be lower watermarks and - * smaller per-cpu batchsize. - */ -void __init set_dma_reserve(unsigned long new_dma_reserve) -{ - dma_reserve = new_dma_reserve; -} - void __init memblock_free_pages(struct page *page, unsigned long pfn, unsigned int order) { _ Patches currently in -mm which might be from bhe@redhat.com are mm-mm_initc-remove-the-useless-dma_reserve.patch x86-remove-memblock_find_dma_reserve.patch mm-mm_initc-add-new-function-calc_nr_kernel_pages.patch mm-mm_initc-remove-meaningless-calculation-of-zone-managed_pages-in-free_area_init_core.patch mm-mm_initc-remove-unneeded-calc_memmap_size.patch mm-mm_initc-remove-arch_reserved_kernel_pages.patch
The patch titled Subject: mm/memory: fix missing pte marker for !page on pte zaps has been added to the -mm mm-hotfixes-unstable branch. Its filename is mm-memory-fix-missing-pte-marker-for-page-on-pte-zaps.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mm-memory-fix-missing-pte-marker-for-page-on-pte-zaps.patch This patch will later appear in the mm-hotfixes-unstable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Peter Xu <peterx@redhat.com> Subject: mm/memory: fix missing pte marker for !page on pte zaps Date: Wed, 13 Mar 2024 17:31:07 -0400 Commit 0cf18e839f64 of large folio zap work broke uffd-wp. Now mm's uffd unit test "wp-unpopulated" will trigger this WARN_ON_ONCE(). The WARN_ON_ONCE() asserts that an VMA cannot be registered with userfaultfd-wp if it contains a !normal page, but it's actually possible. One example is an anonymous vma, register with uffd-wp, read anything will install a zero page. Then when zap on it, this should trigger. What's more, removing that WARN_ON_ONCE may not be enough either, because we should also not rely on "whether it's a normal page" to decide whether pte marker is needed. For example, one can register wr-protect over some DAX regions to track writes when UFFD_FEATURE_WP_ASYNC enabled, in which case it can have page==NULL for a devmap but we may want to keep the marker around. Link: https://lkml.kernel.org/r/20240313213107.235067-1-peterx@redhat.com Fixes: 0cf18e839f64 ("mm/memory: handle !page case in zap_present_pte() separately") Signed-off-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/memory.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) --- a/mm/memory.c~mm-memory-fix-missing-pte-marker-for-page-on-pte-zaps +++ a/mm/memory.c @@ -1536,7 +1536,9 @@ static inline int zap_present_ptes(struc ptep_get_and_clear_full(mm, addr, pte, tlb->fullmm); arch_check_zapped_pte(vma, ptent); tlb_remove_tlb_entry(tlb, pte, addr); - VM_WARN_ON_ONCE(userfaultfd_wp(vma)); + if (userfaultfd_pte_wp(vma, ptent)) + zap_install_uffd_wp_if_needed(vma, addr, pte, 1, + details, ptent); ksm_might_unmap_zero_page(mm, ptent); return 1; } _ Patches currently in -mm which might be from peterx@redhat.com are mm-memory-fix-missing-pte-marker-for-page-on-pte-zaps.patch
The pull request you sent on Thu, 14 Mar 2024 09:43:19 -0700: > git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm tags/mm-nonmm-stable-2024-03-14-09-36 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/e5eb28f6d1afebed4bb7d740a797d0390bd3a357 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html
The pull request you sent on Wed, 13 Mar 2024 20:05:32 -0700: > https://lkml.kernel.org/r/20240229101721.58569685@canb.auug.org.au Thanks. has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/902861e34c401696ed9ad17a54c8790e7e8e3069 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html
The patch titled Subject: mm,page_owner: fix refcount imbalance has been added to the -mm mm-stable branch. Its filename is mmpage_owner-fix-refcount-imbalance.patch This patch will shortly appear at https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patches/mmpage_owner-fix-refcount-imbalance.patch This patch will later appear in the mm-stable branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Before you just go and hit "reply", please: a) Consider who else should be cc'ed b) Prefer to cc a suitable mailing list as well c) Ideally: find the original patch on the mailing list and do a reply-to-all to that, adding suitable additional cc's *** Remember to use Documentation/process/submit-checklist.rst when testing your code *** The -mm tree is included into linux-next via the mm-everything branch at git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm and is updated there every 2-3 working days ------------------------------------------------------ From: Oscar Salvador <osalvador@suse.de> Subject: mm,page_owner: fix refcount imbalance Date: Thu, 14 Mar 2024 15:47:53 +0100 Current code does not contemplate scenarios were an allocation and free operation on the same pages do not handle it in the same amount at once. To give an example, page_alloc_exact(), where we will allocate a page of enough order to stafisfy the size request, but we will free the remainings right away. In the above example, we will increment the stack_record refcount only once, but we will decrease it the same number of times as number of unused pages we have to free. This will lead to a warning because of refcount imbalance. Fix this by recording the number of base pages every stack_record holds, and only let the last decrementing of refcount succeed if the number of base pages equals 0, which means we freed all the pages. As a bonus, show the aggregate of stack_count + base_pages as this gives a much better picture of the memory usage. Link: https://lkml.kernel.org/r/20240314144753.16276-1-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Fixes: 217b2119b9e2 ("mm,page_owner: implement the tracking of the stacks count") Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- include/linux/stackdepot.h | 3 + mm/page_owner.c | 57 ++++++++++++++++++++++++++++------- 2 files changed, 50 insertions(+), 10 deletions(-) --- a/include/linux/stackdepot.h~mmpage_owner-fix-refcount-imbalance +++ a/include/linux/stackdepot.h @@ -57,6 +57,9 @@ struct stack_record { u32 size; /* Number of stored frames */ union handle_parts handle; /* Constant after initialization */ refcount_t count; +#ifdef CONFIG_PAGE_OWNER + unsigned long nr_base_pages; +#endif union { unsigned long entries[CONFIG_STACKDEPOT_MAX_FRAMES]; /* Frames */ struct { --- a/mm/page_owner.c~mmpage_owner-fix-refcount-imbalance +++ a/mm/page_owner.c @@ -107,10 +107,14 @@ static __init void init_page_owner(void) /* Initialize dummy and failure stacks and link them to stack_list */ dummy_stack.stack_record = __stack_depot_get_stack_record(dummy_handle); failure_stack.stack_record = __stack_depot_get_stack_record(failure_handle); - if (dummy_stack.stack_record) + if (dummy_stack.stack_record) { + dummy_stack.stack_record->nr_base_pages = 0; refcount_set(&dummy_stack.stack_record->count, 1); - if (failure_stack.stack_record) + } + if (failure_stack.stack_record) { + failure_stack.stack_record->nr_base_pages = 0; refcount_set(&failure_stack.stack_record->count, 1); + } dummy_stack.next = &failure_stack; stack_list = &dummy_stack; } @@ -183,9 +187,11 @@ static void add_stack_record_to_list(str spin_unlock_irqrestore(&stack_list_lock, flags); } -static void inc_stack_record_count(depot_stack_handle_t handle, gfp_t gfp_mask) +static void inc_stack_record_count(depot_stack_handle_t handle, gfp_t gfp_mask, + unsigned long nr_base_pages) { struct stack_record *stack_record = __stack_depot_get_stack_record(handle); + unsigned long curr_nr_pages; if (!stack_record) return; @@ -200,19 +206,47 @@ static void inc_stack_record_count(depot if (refcount_read(&stack_record->count) == REFCOUNT_SATURATED) { int old = REFCOUNT_SATURATED; - if (atomic_try_cmpxchg_relaxed(&stack_record->count.refs, &old, 1)) + if (atomic_try_cmpxchg_relaxed(&stack_record->count.refs, &old, 1)) { /* Add the new stack_record to our list */ add_stack_record_to_list(stack_record, gfp_mask); + smp_store_release(&stack_record->nr_base_pages, + nr_base_pages); + goto inc; + } } + + curr_nr_pages = smp_load_acquire(&stack_record->nr_base_pages); + smp_store_release(&stack_record->nr_base_pages, + curr_nr_pages + nr_base_pages); +inc: refcount_inc(&stack_record->count); } -static void dec_stack_record_count(depot_stack_handle_t handle) +static void dec_stack_record_count(depot_stack_handle_t handle, + unsigned long nr_base_pages) { struct stack_record *stack_record = __stack_depot_get_stack_record(handle); + unsigned long curr_nr_pages; + + if (!stack_record) + return; + + curr_nr_pages = smp_load_acquire(&stack_record->nr_base_pages); + smp_store_release(&stack_record->nr_base_pages, + curr_nr_pages - nr_base_pages); + curr_nr_pages = smp_load_acquire(&stack_record->nr_base_pages); + + /* + * If this stack_record is going to reach a refcount == 1, which means + * free, only do it if all the base pages it allocated were freed. + * E.g: scenarios like THP splitting, or alloc_pages_exact() can have + * an alloc/free operation with different amount of pages + */ + if (refcount_read(&stack_record->count) == 2 && + curr_nr_pages) + return; - if (stack_record) - refcount_dec(&stack_record->count); + refcount_dec(&stack_record->count); } void __reset_page_owner(struct page *page, unsigned short order) @@ -250,7 +284,7 @@ void __reset_page_owner(struct page *pag * the machinery is not ready yet, we cannot decrement * their refcount either. */ - dec_stack_record_count(alloc_handle); + dec_stack_record_count(alloc_handle, 1UL << order); } static inline void __set_page_owner_handle(struct page_ext *page_ext, @@ -292,7 +326,7 @@ noinline void __set_page_owner(struct pa return; __set_page_owner_handle(page_ext, handle, order, gfp_mask); page_ext_put(page_ext); - inc_stack_record_count(handle, gfp_mask); + inc_stack_record_count(handle, gfp_mask, 1UL << order); } void __set_page_owner_migrate_reason(struct page *page, int reason) @@ -856,6 +890,7 @@ static int stack_print(struct seq_file * struct stack *stack = v; unsigned long *entries; unsigned long nr_entries; + unsigned long nr_base_pages; struct stack_record *stack_record = stack->stack_record; if (!stack->stack_record) @@ -863,6 +898,7 @@ static int stack_print(struct seq_file * nr_entries = stack_record->size; entries = stack_record->entries; + nr_base_pages = stack_record->nr_base_pages; stack_count = refcount_read(&stack_record->count) - 1; if (stack_count < 1 || stack_count < page_owner_stack_threshold) @@ -870,7 +906,8 @@ static int stack_print(struct seq_file * for (i = 0; i < nr_entries; i++) seq_printf(m, " %pS\n", (void *)entries[i]); - seq_printf(m, "stack_count: %d\n\n", stack_count); + seq_printf(m, "stack_count: %d curr_nr_base_pages: %lu\n\n", + stack_count, nr_base_pages); return 0; } _ Patches currently in -mm which might be from osalvador@suse.de are mmpage_owner-fix-refcount-imbalance.patch
Linus, please merge the non-MM updates for this cycle. I'm seeing conflicts in Documentation/process/changes.rst, arch/riscv/include/asm/ftrace.h and fs/ocfs2/super.c. I don't have a record of Stephen hitting these, but all are simple. Thanks. The following changes since commit b401b621758e46812da61fa58a67c3fd8d91de0d: Linux 6.8-rc5 (2024-02-18 12:56:25 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm tags/mm-nonmm-stable-2024-03-14-09-36 for you to fetch changes up to 269cdf353b5bdd15f1a079671b0f889113865f20: nilfs2: prevent kernel bug at submit_bh_wbc() (2024-03-14 09:17:30 -0700) ---------------------------------------------------------------- - Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min heap optimizations". - Kuan-Wei Chiu has also sped up the library sorting code in the series "lib/sort: Optimize the number of swaps and comparisons". - Alexey Gladkov has added the ability for code running within an IPC namespace to alter its IPC and MQ limits. The series is "Allow to change ipc/mq sysctls inside ipc namespace". - Geert Uytterhoeven has contributed some dhrystone maintenance work in the series "lib: dhry: miscellaneous cleanups". - Ryusuke Konishi continues nilfs2 maintenance work in the series "nilfs2: eliminate kmap and kmap_atomic calls" "nilfs2: fix kernel bug at submit_bh_wbc()" - Nathan Chancellor has updated our build tools requirements in the series "Bump the minimum supported version of LLVM to 13.0.1". - Muhammad Usama Anjum continues with the selftests maintenance work in the series "selftests/mm: Improve run_vmtests.sh". - Oleg Nesterov has done some maintenance work against the signal code in the series "get_signal: minor cleanups and fix". Plus the usual shower of singleton patches in various parts of the tree. Please see the individual changelogs for details. ---------------------------------------------------------------- Ahelenia Ziemiańska (1): Normalise "name (ad@dr)" MODULE_AUTHORs to "name <ad@dr>" Alexey Dobriyan (1): smp: make __smp_processor_id() 0-argument macro Alexey Gladkov (3): sysctl: allow change system v ipc sysctls inside ipc namespace docs: add information about ipc sysctls limitations sysctl: allow to change limits for posix messages queues Andy Shevchenko (1): dyndbg: replace kstrdup() + strchr() with kstrdup_and_replace() Baoquan He (1): panic: suppress gnu_printf warning Chengming Zhou (1): ocfs2: remove SLAB_MEM_SPREAD flag usage Feng Tang (1): panic: add option to dump blocked tasks in panic_print Geert Uytterhoeven (4): lib: dhry: remove unneeded <linux/mutex.h> lib: dhry: use ktime_ms_delta() helper lib: dhry: add missing closing parenthesis init: remove obsolete arch_call_rest_init() wrapper Jan Kara (1): fat: fix uninitialized field in nostale filehandles Kemeng Shi (1): flex_proportions: remove unused fprop_local_single Kuan-Wei Chiu (4): lib min_heap: optimize number of calls to min_heapify() lib min_heap: optimize number of comparisons in min_heapify() lib/sort: optimize heapsort for equal elements in sift-down path lib/sort: optimize heapsort with double-pop variation Li zeming (1): user_namespace: remove unnecessary NULL values from kbuf Matthew Wilcox (Oracle) (1): bounds: support non-power-of-two CONFIG_NR_CPUS Muhammad Usama Anjum (5): selftests/mm: hugetlb_reparenting_test: do not unmount selftests/mm: run_vmtests: remove sudo and conform to tap selftests/mm: save and restore nr_hugepages value selftests/mm: protection_keys: save/restore nr_hugepages settings selftests/mm: run_vmtests.sh: add missing tests Nathan Chancellor (13): arch and include: update LLVM Phabricator links treewide: update LLVM Bugzilla links kbuild: raise the minimum supported version of LLVM to 13.0.1 Makefile: drop warn-stack-size plugin opt x86: drop stack-alignment plugin opt ARM: remove Thumb2 __builtin_thread_pointer workaround for Clang arm64: Kconfig: clean up tautological LLVM version checks powerpc: Kconfig: remove tautology in CONFIG_COMPAT riscv: remove MCOUNT_NAME workaround riscv: Kconfig: remove version dependency from CONFIG_CLANG_SUPPORTS_DYNAMIC_FTRACE fortify: drop Clang version check for 12.0.1 or newer lib/Kconfig.debug: update Clang version check in CONFIG_KCOV compiler-clang.h: update __diag_clang() macros for minimum version bump Oleg Nesterov (4): ptrace_attach: shift send(SIGSTOP) into ptrace_set_stopped() get_signal: don't abuse ksig->info.si_signo and ksig->sig get_signal: hide_si_addr_tag_bits: fix the usage of uninitialized ksig get_signal: don't initialize ksig->info if SIGNAL_GROUP_EXIT/group_exec_task Peng Hao (1): buildid: use kmap_local_page() Pierre Gondois (3): list: add hlist_count_nodes() binder: use of hlist_count_nodes() bcache: use of hlist_count_nodes() Randy Dunlap (1): lib/win_minmax: fix header comments Ricardo B. Marliere (2): const_structs.checkpatch: add bus_type const_structs.checkpatch: add device_type Roman Smirnov (1): assoc_array: fix the return value in assoc_array_insert_mid_shortcut() Ryusuke Konishi (18): nilfs2: convert recovery logic to use kmap_local nilfs2: convert segment buffer to use kmap_local nilfs2: convert nilfs_copy_buffer() to use kmap_local nilfs2: convert metadata file common code to use kmap_local nilfs2: convert sufile to use kmap_local nilfs2: convert persistent object allocator to use kmap_local nilfs2: convert DAT to use kmap_local nilfs2: move nilfs_bmap_write call out of nilfs_write_inode_common nilfs2: do not acquire rwsem in nilfs_bmap_write() nilfs2: convert ifile to use kmap_local nilfs2: localize highmem mapping for checkpoint creation within cpfile nilfs2: localize highmem mapping for checkpoint finalization within cpfile nilfs2: localize highmem mapping for checkpoint reading within cpfile nilfs2: remove nilfs_cpfile_{get,put}_checkpoint() nilfs2: convert cpfile to use kmap_local nilfs2: MAINTAINERS: drop unreachable project mirror site nilfs2: fix failure to detect DAT corruption in btree and direct mappings nilfs2: prevent kernel bug at submit_bh_wbc() Su Yue (1): ocfs2: enable ocfs2_listxattr for special files Thomas Weißschuh (1): watchdog/core: remove sysctl handlers from public header Thorsten Blum (1): nilfs2: use div64_ul() instead of do_div() Uwe Kleine-König (1): mul_u64_u64_div_u64: increase precision by conditionally swapping a and b Wei Yang (1): list: leverage list_is_head() for list_entry_is_head() Wen Yang (1): selftests: add eventfd selftests Yongzhen Zhang (1): ocfs2: spelling fix yang.zhang (1): kexec: copy only happens before uchunk goes to zero Documentation/admin-guide/kernel-parameters.txt | 1 + Documentation/admin-guide/sysctl/kernel.rst | 15 +- Documentation/process/changes.rst | 2 +- MAINTAINERS | 1 - Makefile | 8 - arch/arm/include/asm/current.h | 8 +- arch/arm64/Kconfig | 9 +- arch/powerpc/Kconfig | 1 - arch/powerpc/Makefile | 4 +- arch/powerpc/kvm/book3s_hv_nested.c | 2 +- arch/riscv/Kconfig | 4 +- arch/riscv/include/asm/ftrace.h | 14 +- arch/riscv/kernel/mcount.S | 10 +- arch/s390/include/asm/ftrace.h | 2 +- arch/sparc/kernel/chmc.c | 2 +- arch/sparc/kernel/ds.c | 2 +- arch/x86/Makefile | 6 - arch/x86/power/Makefile | 2 +- crypto/blake2b_generic.c | 2 +- drivers/android/binder.c | 4 +- drivers/block/sunvdc.c | 2 +- drivers/char/hw_random/n2-drv.c | 2 +- drivers/char/tpm/st33zp24/i2c.c | 2 +- drivers/char/tpm/st33zp24/spi.c | 2 +- drivers/char/tpm/st33zp24/st33zp24.c | 2 +- drivers/char/tpm/tpm-interface.c | 2 +- drivers/char/tpm/tpm_atmel.c | 2 +- drivers/char/tpm/tpm_i2c_nuvoton.c | 2 +- drivers/char/tpm/tpm_nsc.c | 2 +- drivers/char/tpm/tpm_tis.c | 2 +- drivers/char/tpm/tpm_tis_core.c | 2 +- drivers/char/tpm/tpm_vtpm_proxy.c | 2 +- drivers/crypto/n2_core.c | 2 +- drivers/firmware/efi/libstub/Makefile | 2 +- drivers/gpu/drm/amd/amdgpu/sdma_v4_4_2.c | 2 +- drivers/hwmon/dell-smm-hwmon.c | 2 +- drivers/hwmon/ultra45_env.c | 2 +- drivers/i2c/muxes/i2c-mux-mlxcpld.c | 2 +- drivers/leds/leds-sunfire.c | 2 +- drivers/md/bcache/sysfs.c | 8 +- drivers/media/common/siano/smscoreapi.c | 2 +- drivers/media/common/siano/smsdvb-main.c | 2 +- drivers/media/dvb-frontends/cx24117.c | 2 +- drivers/media/test-drivers/vicodec/codec-fwht.c | 2 +- drivers/media/usb/siano/smsusb.c | 2 +- drivers/net/ethernet/broadcom/tg3.c | 2 +- drivers/net/ethernet/sun/cassini.c | 2 +- drivers/net/ethernet/sun/niu.c | 2 +- drivers/net/ethernet/sun/sunhme.c | 2 +- drivers/net/ethernet/sun/sunvnet.c | 2 +- drivers/net/ethernet/sun/sunvnet_common.c | 2 +- drivers/net/ppp/pptp.c | 2 +- drivers/platform/x86/compal-laptop.c | 2 +- drivers/platform/x86/intel/oaktrail.c | 2 +- drivers/platform/x86/mlx-platform.c | 2 +- drivers/regulator/Kconfig | 2 +- drivers/s390/net/fsm.c | 2 +- drivers/sbus/char/openprom.c | 2 +- drivers/scsi/esp_scsi.c | 2 +- drivers/scsi/jazz_esp.c | 2 +- drivers/scsi/mesh.c | 2 +- drivers/scsi/qlogicpti.c | 2 +- drivers/scsi/sun3x_esp.c | 2 +- drivers/scsi/sun_esp.c | 2 +- drivers/video/fbdev/hgafb.c | 2 +- fs/fat/nfs.c | 6 + fs/nilfs2/alloc.c | 91 +++--- fs/nilfs2/bmap.c | 3 - fs/nilfs2/btree.c | 9 +- fs/nilfs2/cpfile.c | 321 ++++++++++++++------- fs/nilfs2/cpfile.h | 10 +- fs/nilfs2/dat.c | 40 +-- fs/nilfs2/direct.c | 9 +- fs/nilfs2/ifile.c | 21 +- fs/nilfs2/ifile.h | 10 +- fs/nilfs2/inode.c | 46 ++- fs/nilfs2/ioctl.c | 4 +- fs/nilfs2/mdt.c | 4 +- fs/nilfs2/nilfs.h | 3 +- fs/nilfs2/page.c | 8 +- fs/nilfs2/recovery.c | 4 +- fs/nilfs2/segbuf.c | 4 +- fs/nilfs2/segment.c | 121 +++----- fs/nilfs2/sufile.c | 88 +++--- fs/nilfs2/super.c | 33 +-- fs/nilfs2/the_nilfs.c | 2 +- fs/ocfs2/dlmfs/dlmfs.c | 2 +- fs/ocfs2/dlmglue.c | 2 +- fs/ocfs2/file.c | 1 + fs/ocfs2/super.c | 7 +- include/asm-generic/vmlinux.lds.h | 2 +- include/linux/compiler-clang.h | 10 +- include/linux/flex_proportions.h | 32 -- include/linux/list.h | 17 +- include/linux/min_heap.h | 44 +-- include/linux/nmi.h | 7 - include/linux/smp.h | 2 +- include/linux/start_kernel.h | 2 - include/linux/win_minmax.h | 4 +- init/main.c | 9 +- ipc/ipc_sysctl.c | 37 ++- ipc/mq_sysctl.c | 36 +++ kernel/bounds.c | 2 +- kernel/kexec_core.c | 44 +-- kernel/panic.c | 9 + kernel/ptrace.c | 13 +- kernel/signal.c | 28 +- kernel/user_namespace.c | 2 +- kernel/watchdog.c | 22 +- lib/Kconfig.debug | 4 +- lib/Kconfig.kasan | 2 +- lib/assoc_array.c | 2 +- lib/buildid.c | 4 +- lib/dhry_1.c | 2 +- lib/dhry_run.c | 1 - lib/dynamic_debug.c | 7 +- lib/flex_proportions.c | 77 ----- lib/math/div64.c | 15 + lib/raid6/Makefile | 2 +- lib/sort.c | 20 +- lib/stackinit_kunit.c | 2 +- mm/slab_common.c | 2 +- net/bridge/br_multicast.c | 2 +- net/ipv4/gre_demux.c | 2 +- net/ipv6/ip6_gre.c | 2 +- net/iucv/iucv.c | 2 +- net/mpls/mpls_gso.c | 2 +- scripts/const_structs.checkpatch | 2 + scripts/min-tool-version.sh | 2 +- scripts/recordmcount.pl | 2 +- security/Kconfig | 2 - tools/objtool/noreturns.h | 1 - .../selftests/filesystems/eventfd/.gitignore | 2 + .../testing/selftests/filesystems/eventfd/Makefile | 7 + .../selftests/filesystems/eventfd/eventfd_test.c | 186 ++++++++++++ tools/testing/selftests/mm/Makefile | 5 + .../selftests/mm/charge_reserved_hugetlb.sh | 4 + .../selftests/mm/hugetlb_reparenting_test.sh | 9 +- tools/testing/selftests/mm/on-fault-limit.c | 36 ++- tools/testing/selftests/mm/protection_keys.c | 34 +++ tools/testing/selftests/mm/run_vmtests.sh | 17 +- 141 files changed, 1055 insertions(+), 770 deletions(-) create mode 100644 tools/testing/selftests/filesystems/eventfd/.gitignore create mode 100644 tools/testing/selftests/filesystems/eventfd/Makefile create mode 100644 tools/testing/selftests/filesystems/eventfd/eventfd_test.c