All of lore.kernel.org
 help / color / mirror / Atom feed
* incoming
@ 2021-11-05 20:34 Andrew Morton
  2021-11-05 20:34 ` [patch 001/262] scripts/spelling.txt: add more spellings to spelling.txt Andrew Morton
                   ` (261 more replies)
  0 siblings, 262 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mm-commits, linux-mm

262 patches, based on 8bb7eca972ad531c9b149c0a51ab43a417385813

Subsystems affected by this patch series:

  scripts
  ocfs2
  vfs
  mm/slab-generic
  mm/slab
  mm/slub
  mm/kconfig
  mm/dax
  mm/kasan
  mm/debug
  mm/pagecache
  mm/gup
  mm/swap
  mm/memcg
  mm/pagemap
  mm/mprotect
  mm/mremap
  mm/iomap
  mm/tracing
  mm/vmalloc
  mm/pagealloc
  mm/memory-failure
  mm/hugetlb
  mm/userfaultfd
  mm/vmscan
  mm/tools
  mm/memblock
  mm/oom-kill
  mm/hugetlbfs
  mm/migration
  mm/thp
  mm/readahead
  mm/nommu
  mm/ksm
  mm/vmstat
  mm/madvise
  mm/memory-hotplug
  mm/rmap
  mm/zsmalloc
  mm/highmem
  mm/zram
  mm/cleanups
  mm/kfence
  mm/damon

Subsystem: scripts

    Colin Ian King <colin.king@canonical.com>:
      scripts/spelling.txt: add more spellings to spelling.txt

    Sven Eckelmann <sven@narfation.org>:
      scripts/spelling.txt: fix "mistake" version of "synchronization"

    weidonghui <weidonghui@allwinnertech.com>:
      scripts/decodecode: fix faulting instruction no print when opps.file is DOS format

Subsystem: ocfs2

    Chenyuan Mi <cymi20@fudan.edu.cn>:
      ocfs2: fix handle refcount leak in two exception handling paths

    Valentin Vidic <vvidic@valentin-vidic.from.hr>:
      ocfs2: cleanup journal init and shutdown

    Colin Ian King <colin.king@canonical.com>:
      ocfs2/dlm: remove redundant assignment of variable ret

    Jan Kara <jack@suse.cz>:
    Patch series "ocfs2: Truncate data corruption fix":
      ocfs2: fix data corruption on truncate
      ocfs2: do not zero pages beyond i_size

Subsystem: vfs

    Arnd Bergmann <arnd@arndb.de>:
      fs/posix_acl.c: avoid -Wempty-body warning

    Jia He <justin.he@arm.com>:
      d_path: fix Kernel doc validator complaining

Subsystem: mm/slab-generic

    "Matthew Wilcox (Oracle)" <willy@infradead.org>:
      mm: move kvmalloc-related functions to slab.h

Subsystem: mm/slab

    Shi Lei <shi_lei@massclouds.com>:
      mm/slab.c: remove useless lines in enable_cpucache()

Subsystem: mm/slub

    Kefeng Wang <wangkefeng.wang@huawei.com>:
      slub: add back check for free nonslab objects

    Vlastimil Babka <vbabka@suse.cz>:
      mm, slub: change percpu partial accounting from objects to pages
      mm/slub: increase default cpu partial list sizes

    Hyeonggon Yoo <42.hyeyoo@gmail.com>:
      mm, slub: use prefetchw instead of prefetch

Subsystem: mm/kconfig

    Sebastian Andrzej Siewior <bigeasy@linutronix.de>:
      mm: disable NUMA_BALANCING_DEFAULT_ENABLED and TRANSPARENT_HUGEPAGE on PREEMPT_RT

Subsystem: mm/dax

    Christoph Hellwig <hch@lst.de>:
      mm: don't include <linux/dax.h> in <linux/mempolicy.h>

Subsystem: mm/kasan

    Marco Elver <elver@google.com>:
    Patch series "stackdepot, kasan, workqueue: Avoid expanding stackdepot slabs when holding raw_spin_lock", v2:
      lib/stackdepot: include gfp.h
      lib/stackdepot: remove unused function argument
      lib/stackdepot: introduce __stack_depot_save()
      kasan: common: provide can_alloc in kasan_save_stack()
      kasan: generic: introduce kasan_record_aux_stack_noalloc()
      workqueue, kasan: avoid alloc_pages() when recording stack

    "Matthew Wilcox (Oracle)" <willy@infradead.org>:
      kasan: fix tag for large allocations when using CONFIG_SLAB

    Peter Collingbourne <pcc@google.com>:
      kasan: test: add memcpy test that avoids out-of-bounds write

Subsystem: mm/debug

    Peter Xu <peterx@redhat.com>:
    Patch series "mm/smaps: Fixes and optimizations on shmem swap handling":
      mm/smaps: fix shmem pte hole swap calculation
      mm/smaps: use vma->vm_pgoff directly when counting partial swap
      mm/smaps: simplify shmem handling of pte holes

    Guo Ren <guoren@linux.alibaba.com>:
      mm: debug_vm_pgtable: don't use __P000 directly

    Kees Cook <keescook@chromium.org>:
      kasan: test: bypass __alloc_size checks
    Patch series "Add __alloc_size()", v3:
      rapidio: avoid bogus __alloc_size warning
      Compiler Attributes: add __alloc_size() for better bounds checking
      slab: clean up function prototypes
      slab: add __alloc_size attributes for better bounds checking
      mm/kvmalloc: add __alloc_size attributes for better bounds checking
      mm/vmalloc: add __alloc_size attributes for better bounds checking
      mm/page_alloc: add __alloc_size attributes for better bounds checking
      percpu: add __alloc_size attributes for better bounds checking

    Yinan Zhang <zhangyinan2019@email.szu.edu.cn>:
      mm/page_ext.c: fix a comment

Subsystem: mm/pagecache

    David Howells <dhowells@redhat.com>:
      mm: stop filemap_read() from grabbing a superfluous page

    Christoph Hellwig <hch@lst.de>:
    Patch series "simplify bdi unregistation":
      mm: export bdi_unregister
      mtd: call bdi_unregister explicitly
      fs: explicitly unregister per-superblock BDIs
      mm: don't automatically unregister bdis
      mm: simplify bdi refcounting

    Jens Axboe <axboe@kernel.dk>:
      mm: don't read i_size of inode unless we need it

    "Matthew Wilcox (Oracle)" <willy@infradead.org>:
      mm/filemap.c: remove bogus VM_BUG_ON

    Jens Axboe <axboe@kernel.dk>:
      mm: move more expensive part of XA setup out of mapping check

Subsystem: mm/gup

    John Hubbard <jhubbard@nvidia.com>:
      mm/gup: further simplify __gup_device_huge()

Subsystem: mm/swap

    Xu Wang <vulab@iscas.ac.cn>:
      mm/swapfile: remove needless request_queue NULL pointer check

    Rafael Aquini <aquini@redhat.com>:
      mm/swapfile: fix an integer overflow in swap_show()

    "Matthew Wilcox (Oracle)" <willy@infradead.org>:
      mm: optimise put_pages_list()

Subsystem: mm/memcg

    Peter Xu <peterx@redhat.com>:
      mm/memcg: drop swp_entry_t* in mc_handle_file_pte()

    Shakeel Butt <shakeelb@google.com>:
      memcg: flush stats only if updated
      memcg: unify memcg stat flushing

    Waiman Long <longman@redhat.com>:
      mm/memcg: remove obsolete memcg_free_kmem()

    Len Baker <len.baker@gmx.com>:
      mm/list_lru.c: prefer struct_size over open coded arithmetic

    Shakeel Butt <shakeelb@google.com>:
      memcg, kmem: further deprecate kmem.limit_in_bytes

    Muchun Song <songmuchun@bytedance.com>:
      mm: list_lru: remove holding lru lock
      mm: list_lru: fix the return value of list_lru_count_one()
      mm: memcontrol: remove kmemcg_id reparenting
      mm: memcontrol: remove the kmem states
      mm: list_lru: only add memcg-aware lrus to the global lru list

    Vasily Averin <vvs@virtuozzo.com>:
    Patch series "memcg: prohibit unconditional exceeding the limit of dying tasks", v3:
      mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks

    Michal Hocko <mhocko@suse.com>:
      mm, oom: do not trigger out_of_memory from the #PF

    Vasily Averin <vvs@virtuozzo.com>:
      memcg: prohibit unconditional exceeding the limit of dying tasks

Subsystem: mm/pagemap

    Peng Liu <liupeng256@huawei.com>:
      mm/mmap.c: fix a data race of mm->total_vm

    Rolf Eike Beer <eb@emlix.com>:
      mm: use __pfn_to_section() instead of open coding it

    Amit Daniel Kachhap <amit.kachhap@arm.com>:
      mm/memory.c: avoid unnecessary kernel/user pointer conversion

    Nadav Amit <namit@vmware.com>:
      mm/memory.c: use correct VMA flags when freeing page-tables

    Peter Xu <peterx@redhat.com>:
    Patch series "mm: A few cleanup patches around zap, shmem and uffd", v4:
      mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte
      mm: clear vmf->pte after pte_unmap_same() returns
      mm: drop first_index/last_index in zap_details
      mm: add zap_skip_check_mapping() helper

    Qi Zheng <zhengqi.arch@bytedance.com>:
    Patch series "Do some code cleanups related to mm", v3:
      mm: introduce pmd_install() helper
      mm: remove redundant smp_wmb()

    Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>:
      Documentation: update pagemap with shmem exceptions

    Nicholas Piggin <npiggin@gmail.com>:
    Patch series "shoot lazy tlbs", v4:
      lazy tlb: introduce lazy mm refcount helper functions
      lazy tlb: allow lazy tlb mm refcounting to be configurable
      lazy tlb: shoot lazies, a non-refcounting lazy tlb option
      powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

    Lukas Bulwahn <lukas.bulwahn@gmail.com>:
      memory: remove unused CONFIG_MEM_BLOCK_SIZE

Subsystem: mm/mprotect

    Liu Song <liu.song11@zte.com.cn>:
      mm/mprotect.c: avoid repeated assignment in do_mprotect_pkey()

Subsystem: mm/mremap

    Dmitry Safonov <dima@arista.com>:
      mm/mremap: don't account pages in vma_to_resize()

Subsystem: mm/iomap

    Lucas De Marchi <lucas.demarchi@intel.com>:
      include/linux/io-mapping.h: remove fallback for writecombine

Subsystem: mm/tracing

    Gang Li <ligang.bdlg@bytedance.com>:
      mm: mmap_lock: remove redundant newline  in TP_printk
      mm: mmap_lock: use DECLARE_EVENT_CLASS and DEFINE_EVENT_FN

Subsystem: mm/vmalloc

    Vasily Averin <vvs@virtuozzo.com>:
      mm/vmalloc: repair warn_alloc()s in __vmalloc_area_node()

    Peter Zijlstra <peterz@infradead.org>:
      mm/vmalloc: don't allow VM_NO_GUARD on vmap()

    Eric Dumazet <edumazet@google.com>:
      mm/vmalloc: make show_numa_info() aware of hugepage mappings
      mm/vmalloc: make sure to dump unpurged areas in /proc/vmallocinfo

    "Uladzislau Rezki (Sony)" <urezki@gmail.com>:
      mm/vmalloc: do not adjust the search size for alignment overhead
      mm/vmalloc: check various alignments when debugging

    Vasily Averin <vvs@virtuozzo.com>:
      vmalloc: back off when the current task is OOM-killed

    Kefeng Wang <wangkefeng.wang@huawei.com>:
      vmalloc: choose a better start address in vm_area_register_early()
      arm64: support page mapping percpu first chunk allocator
      kasan: arm64: fix pcpu_page_first_chunk crash with KASAN_VMALLOC

    Michal Hocko <mhocko@suse.com>:
      mm/vmalloc: be more explicit about supported gfp flags

    Chen Wandun <chenwandun@huawei.com>:
      mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation

    Changcheng Deng <deng.changcheng@zte.com.cn>:
      lib/test_vmalloc.c: use swap() to make code cleaner

Subsystem: mm/pagealloc

    Eric Dumazet <edumazet@google.com>:
      mm/large system hash: avoid possible NULL deref in alloc_large_system_hash

    Miaohe Lin <linmiaohe@huawei.com>:
    Patch series "Cleanups and fixup for page_alloc", v2:
      mm/page_alloc.c: remove meaningless VM_BUG_ON() in pindex_to_order()
      mm/page_alloc.c: simplify the code by using macro K()
      mm/page_alloc.c: fix obsolete comment in free_pcppages_bulk()
      mm/page_alloc.c: use helper function zone_spans_pfn()
      mm/page_alloc.c: avoid allocating highmem pages via alloc_pages_exact[_nid]

    Bharata B Rao <bharata@amd.com>:
    Patch series "Fix NUMA nodes fallback list ordering":
      mm/page_alloc: print node fallback order

    Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>:
      mm/page_alloc: use accumulated load when building node fallback list

    Geert Uytterhoeven <geert+renesas@glider.be>:
    Patch series "Fix NUMA without SMP":
      mm: move node_reclaim_distance to fix NUMA without SMP
      mm: move fold_vm_numa_events() to fix NUMA without SMP

    Eric Dumazet <edumazet@google.com>:
      mm/page_alloc.c: do not acquire zone lock in is_free_buddy_page()

    Feng Tang <feng.tang@intel.com>:
      mm/page_alloc: detect allocation forbidden by cpuset and bail out early

    Liangcai Fan <liangcaifan19@gmail.com>:
      mm/page_alloc.c: show watermark_boost of zone in zoneinfo

    Christophe Leroy <christophe.leroy@csgroup.eu>:
      mm: create a new system state and fix core_kernel_text()
      mm: make generic arch_is_kernel_initmem_freed() do what it says
      powerpc: use generic version of arch_is_kernel_initmem_freed()
      s390: use generic version of arch_is_kernel_initmem_freed()

    Sebastian Andrzej Siewior <bigeasy@linutronix.de>:
      mm: page_alloc: use migrate_disable() in drain_local_pages_wq()

    Wang ShaoBo <bobo.shaobowang@huawei.com>:
      mm/page_alloc: use clamp() to simplify code

Subsystem: mm/memory-failure

    Marco Elver <elver@google.com>:
      mm: fix data race in PagePoisoned()

    Rikard Falkeborn <rikard.falkeborn@gmail.com>:
      mm/memory_failure: constify static mm_walk_ops

    Yang Shi <shy828301@gmail.com>:
    Patch series "Solve silent data loss caused by poisoned page cache (shmem/tmpfs)", v5:
      mm: filemap: coding style cleanup for filemap_map_pmd()
      mm: hwpoison: refactor refcount check handling
      mm: shmem: don't truncate page if memory failure happens
      mm: hwpoison: handle non-anonymous THP correctly

Subsystem: mm/hugetlb

    Peter Xu <peterx@redhat.com>:
      mm/hugetlb: drop __unmap_hugepage_range definition from hugetlb.h

    Mike Kravetz <mike.kravetz@oracle.com>:
    Patch series "hugetlb: add demote/split page functionality", v4:
      hugetlb: add demote hugetlb page sysfs interfaces
      mm/cma: add cma_pages_valid to determine if pages are in CMA
      hugetlb: be sure to free demoted CMA pages to CMA
      hugetlb: add demote bool to gigantic page routines
      hugetlb: add hugetlb demote page support

    Liangcai Fan <liangcaifan19@gmail.com>:
      mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged

    Mina Almasry <almasrymina@google.com>:
      mm, hugepages: add mremap() support for hugepage backed vma
      mm, hugepages: add hugetlb vma mremap() test

    Baolin Wang <baolin.wang@linux.alibaba.com>:
      hugetlb: support node specified when using cma for gigantic hugepages

    Ran Jianping <ran.jianping@zte.com.cn>:
      mm: remove duplicate include in hugepage-mremap.c

    Baolin Wang <baolin.wang@linux.alibaba.com>:
    Patch series "Some cleanups and improvements for hugetlb":
      hugetlb_cgroup: remove unused hugetlb_cgroup_from_counter macro
      hugetlb: replace the obsolete hugetlb_instantiation_mutex in the comments
      hugetlb: remove redundant validation in has_same_uncharge_info()
      hugetlb: remove redundant VM_BUG_ON() in add_reservation_in_range()

    Mike Kravetz <mike.kravetz@oracle.com>:
      hugetlb: remove unnecessary set_page_count in prep_compound_gigantic_page

Subsystem: mm/userfaultfd

    Axel Rasmussen <axelrasmussen@google.com>:
    Patch series "Small userfaultfd selftest fixups", v2:
      userfaultfd/selftests: don't rely on GNU extensions for random numbers
      userfaultfd/selftests: fix feature support detection
      userfaultfd/selftests: fix calculation of expected ioctls

Subsystem: mm/vmscan

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/page_isolation: fix potential missing call to unset_migratetype_isolate()
      mm/page_isolation: guard against possible putback unisolated page

    Kai Song <songkai01@inspur.com>:
      mm/vmscan.c: fix -Wunused-but-set-variable warning

    Mel Gorman <mgorman@techsingularity.net>:
    Patch series "Remove dependency on congestion_wait in mm/", v5. Patch series:
      mm/vmscan: throttle reclaim until some writeback completes if congested
      mm/vmscan: throttle reclaim and compaction when too may pages are isolated
      mm/vmscan: throttle reclaim when no progress is being made
      mm/writeback: throttle based on page writeback instead of congestion
      mm/page_alloc: remove the throttling logic from the page allocator
      mm/vmscan: centralise timeout values for reclaim_throttle
      mm/vmscan: increase the timeout if page reclaim is not making progress
      mm/vmscan: delay waking of tasks throttled on NOPROGRESS

    Yuanzheng Song <songyuanzheng@huawei.com>:
      mm/vmpressure: fix data-race with memcg->socket_pressure

Subsystem: mm/tools

    Zhenliang Wei <weizhenliang@huawei.com>:
      tools/vm/page_owner_sort.c: count and sort by mem

    Naoya Horiguchi <naoya.horiguchi@nec.com>:
    Patch series "tools/vm/page-types.c: a few improvements":
      tools/vm/page-types.c: make walk_file() aware of address range option
      tools/vm/page-types.c: move show_file() to summary output
      tools/vm/page-types.c: print file offset in hexadecimal

Subsystem: mm/memblock

    Mike Rapoport <rppt@linux.ibm.com>:
    Patch series "memblock: cleanup memblock_free interface", v2:
      arch_numa: simplify numa_distance allocation
      xen/x86: free_p2m_page: use memblock_free_ptr() to free a virtual pointer
      memblock: drop memblock_free_early_nid() and memblock_free_early()
      memblock: stop aliasing __memblock_free_late with memblock_free_late
      memblock: rename memblock_free to memblock_phys_free
      memblock: use memblock_free for freeing virtual pointers

Subsystem: mm/oom-kill

    Sultan Alsawaf <sultan@kerneltoast.com>:
      mm: mark the OOM reaper thread as freezable

Subsystem: mm/hugetlbfs

    Zhenguo Yao <yaozhenguo1@gmail.com>:
      hugetlbfs: extend the definition of hugepages parameter to support node allocation

Subsystem: mm/migration

    John Hubbard <jhubbard@nvidia.com>:
      mm/migrate: de-duplicate migrate_reason strings

    Yang Shi <shy828301@gmail.com>:
      mm: migrate: make demotion knob depend on migration

Subsystem: mm/thp

    "George G. Davis" <davis.george@siemens.com>:
      selftests/vm/transhuge-stress: fix ram size thinko

    Rongwei Wang <rongwei.wang@linux.alibaba.com>:
    Patch series "fix two bugs for file THP":
      mm, thp: lock filemap when truncating page cache
      mm, thp: fix incorrect unmap behavior for private pages

Subsystem: mm/readahead

    Lin Feng <linf@wangsu.com>:
      mm/readahead.c: fix incorrect comments for get_init_ra_size

Subsystem: mm/nommu

    Kefeng Wang <wangkefeng.wang@huawei.com>:
      mm: nommu: kill arch_get_unmapped_area()

Subsystem: mm/ksm

    "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>:
      selftest/vm: fix ksm selftest to run with different NUMA topologies

    Pedro Demarchi Gomes <pedrodemargomes@gmail.com>:
      selftests: vm: add KSM huge pages merging time test

Subsystem: mm/vmstat

    Liu Shixin <liushixin2@huawei.com>:
      mm/vmstat: annotate data race for zone->free_area[order].nr_free

    Lin Feng <linf@wangsu.com>:
      mm: vmstat.c: make extfrag_index show more pretty

Subsystem: mm/madvise

    David Hildenbrand <david@redhat.com>:
      selftests/vm: make MADV_POPULATE_(READ|WRITE) use in-tree headers

Subsystem: mm/memory-hotplug

    Tang Yizhou <tangyizhou@huawei.com>:
      mm/memory_hotplug: add static qualifier for online_policy_to_str()

    David Hildenbrand <david@redhat.com>:
    Patch series "memory-hotplug.rst: document the "auto-movable" online policy":
      memory-hotplug.rst: fix two instances of "movablecore" that should be "movable_node"
      memory-hotplug.rst: fix wrong /sys/module/memory_hotplug/parameters/ path
      memory-hotplug.rst: document the "auto-movable" online policy
    Patch series "mm/memory_hotplug: Kconfig and 32 bit cleanups":
      mm/memory_hotplug: remove CONFIG_X86_64_ACPI_NUMA dependency from CONFIG_MEMORY_HOTPLUG
      mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE
      mm/memory_hotplug: restrict CONFIG_MEMORY_HOTPLUG to 64 bit
      mm/memory_hotplug: remove HIGHMEM leftovers
      mm/memory_hotplug: remove stale function declarations
      x86: remove memory hotplug support on X86_32
    Patch series "mm/memory_hotplug: full support for add_memory_driver_managed() with CONFIG_ARCH_KEEP_MEMBLOCK", v2:
      mm/memory_hotplug: handle memblock_add_node() failures in add_memory_resource()
      memblock: improve MEMBLOCK_HOTPLUG documentation
      memblock: allow to specify flags with memblock_add_node()
      memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED
      mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED

Subsystem: mm/rmap

    Alistair Popple <apopple@nvidia.com>:
      mm/rmap.c: avoid double faults migrating device private pages

Subsystem: mm/zsmalloc

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and zs_unregister_migration()

Subsystem: mm/highmem

    Ira Weiny <ira.weiny@intel.com>:
      mm/highmem: remove deprecated kmap_atomic

Subsystem: mm/zram

    Jaewon Kim <jaewon31.kim@samsung.com>:
      zram_drv: allow reclaim on bio_alloc

    Dan Carpenter <dan.carpenter@oracle.com>:
      zram: off by one in read_block_state()

    Brian Geffon <bgeffon@google.com>:
      zram: introduce an aged idle interface

Subsystem: mm/cleanups

    Stephen Kitt <steve@sk2.org>:
      mm: remove HARDENED_USERCOPY_FALLBACK

    Mianhan Liu <liumh1@shanghaitech.edu.cn>:
      include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h

Subsystem: mm/kfence

    Marco Elver <elver@google.com>:
      stacktrace: move filter_irq_stacks() to kernel/stacktrace.c
      kfence: count unexpectedly skipped allocations
      kfence: move saving stack trace of allocations into __kfence_alloc()
      kfence: limit currently covered allocations when pool nearly full
      kfence: add note to documentation about skipping covered allocations
      kfence: test: use kunit_skip() to skip tests
      kfence: shorten critical sections of alloc/free
      kfence: always use static branches to guard kfence_alloc()
      kfence: default to dynamic branch instead of static keys mode

Subsystem: mm/damon

    Geert Uytterhoeven <geert@linux-m68k.org>:
      mm/damon: grammar s/works/work/

    SeongJae Park <sjpark@amazon.de>:
      Documentation/vm: move user guides to admin-guide/mm/

    SeongJae Park <sj@kernel.org>:
      MAINTAINERS: update SeongJae's email address

    SeongJae Park <sjpark@amazon.de>:
      docs/vm/damon: remove broken reference
      include/linux/damon.h: fix kernel-doc comments for 'damon_callback'

    SeongJae Park <sj@kernel.org>:
      mm/damon/core: print kdamond start log in debug mode only

    Changbin Du <changbin.du@gmail.com>:
      mm/damon: remove unnecessary do_exit() from kdamond
      mm/damon: needn't hold kdamond_lock to print pid of kdamond

    Colin Ian King <colin.king@canonical.com>:
      mm/damon/core: nullify pointer ctx->kdamond with a NULL

    SeongJae Park <sj@kernel.org>:
    Patch series "Implement Data Access Monitoring-based Memory Operation Schemes":
      mm/damon/core: account age of target regions
      mm/damon/core: implement DAMON-based Operation Schemes (DAMOS)
      mm/damon/vaddr: support DAMON-based Operation Schemes
      mm/damon/dbgfs: support DAMON-based Operation Schemes
      mm/damon/schemes: implement statistics feature
      selftests/damon: add 'schemes' debugfs tests
      Docs/admin-guide/mm/damon: document DAMON-based Operation Schemes
    Patch series "DAMON: Support Physical Memory Address Space Monitoring::
      mm/damon/dbgfs: allow users to set initial monitoring target regions
      mm/damon/dbgfs-test: add a unit test case for 'init_regions'
      Docs/admin-guide/mm/damon: document 'init_regions' feature
      mm/damon/vaddr: separate commonly usable functions
      mm/damon: implement primitives for physical address space monitoring
      mm/damon/dbgfs: support physical memory monitoring
      Docs/DAMON: document physical memory monitoring support

    Rikard Falkeborn <rikard.falkeborn@gmail.com>:
      mm/damon/vaddr: constify static mm_walk_ops

    Rongwei Wang <rongwei.wang@linux.alibaba.com>:
      mm/damon/dbgfs: remove unnecessary variables

    SeongJae Park <sj@kernel.org>:
      mm/damon/paddr: support the pageout scheme
      mm/damon/schemes: implement size quota for schemes application speed control
      mm/damon/schemes: skip already charged targets and regions
      mm/damon/schemes: implement time quota
      mm/damon/dbgfs: support quotas of schemes
      mm/damon/selftests: support schemes quotas
      mm/damon/schemes: prioritize regions within the quotas
      mm/damon/vaddr,paddr: support pageout prioritization
      mm/damon/dbgfs: support prioritization weights
      tools/selftests/damon: update for regions prioritization of schemes
      mm/damon/schemes: activate schemes based on a watermarks mechanism
      mm/damon/dbgfs: support watermarks
      selftests/damon: support watermarks
      mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
      Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM

    Xin Hao <xhao@linux.alibaba.com>:
    Patch series "mm/damon: Fix some small bugs", v4:
      mm/damon: remove unnecessary variable initialization
      mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on

    SeongJae Park <sj@kernel.org>:
    Patch series "Fix trivial nits in Documentation/admin-guide/mm":
      Docs/admin-guide/mm/damon/start: fix wrong example commands
      Docs/admin-guide/mm/damon/start: fix a wrong link
      Docs/admin-guide/mm/damon/start: simplify the content
      Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions

    Changbin Du <changbin.du@gmail.com>:
      mm/damon: simplify stop mechanism

    Colin Ian King <colin.i.king@googlemail.com>:
      mm/damon: fix a few spelling mistakes in comments and a pr_debug message

    Changbin Du <changbin.du@gmail.com>:
      mm/damon: remove return value from before_terminate callback

 a/Documentation/admin-guide/blockdev/zram.rst                  |    8 
 a/Documentation/admin-guide/cgroup-v1/memory.rst               |   11 
 a/Documentation/admin-guide/kernel-parameters.txt              |   14 
 a/Documentation/admin-guide/mm/damon/index.rst                 |    1 
 a/Documentation/admin-guide/mm/damon/reclaim.rst               |  235 +++
 a/Documentation/admin-guide/mm/damon/start.rst                 |  140 +
 a/Documentation/admin-guide/mm/damon/usage.rst                 |  117 +
 a/Documentation/admin-guide/mm/hugetlbpage.rst                 |   42 
 a/Documentation/admin-guide/mm/memory-hotplug.rst              |  147 +-
 a/Documentation/admin-guide/mm/pagemap.rst                     |   75 -
 a/Documentation/core-api/memory-hotplug.rst                    |    3 
 a/Documentation/dev-tools/kfence.rst                           |   23 
 a/Documentation/translations/zh_CN/core-api/memory-hotplug.rst |    4 
 a/Documentation/vm/damon/design.rst                            |   29 
 a/Documentation/vm/damon/faq.rst                               |    5 
 a/Documentation/vm/damon/index.rst                             |    1 
 a/Documentation/vm/page_owner.rst                              |   23 
 a/MAINTAINERS                                                  |    2 
 a/Makefile                                                     |   15 
 a/arch/Kconfig                                                 |   28 
 a/arch/alpha/kernel/core_irongate.c                            |    6 
 a/arch/arc/mm/init.c                                           |    6 
 a/arch/arm/mach-hisi/platmcpm.c                                |    2 
 a/arch/arm/mach-rpc/ecard.c                                    |    2 
 a/arch/arm/mm/init.c                                           |    2 
 a/arch/arm64/Kconfig                                           |    4 
 a/arch/arm64/mm/kasan_init.c                                   |   16 
 a/arch/arm64/mm/mmu.c                                          |    4 
 a/arch/ia64/mm/contig.c                                        |    2 
 a/arch/ia64/mm/init.c                                          |    2 
 a/arch/m68k/mm/mcfmmu.c                                        |    3 
 a/arch/m68k/mm/motorola.c                                      |    6 
 a/arch/mips/loongson64/init.c                                  |    4 
 a/arch/mips/mm/init.c                                          |    6 
 a/arch/mips/sgi-ip27/ip27-memory.c                             |    3 
 a/arch/mips/sgi-ip30/ip30-setup.c                              |    6 
 a/arch/powerpc/Kconfig                                         |    1 
 a/arch/powerpc/configs/skiroot_defconfig                       |    1 
 a/arch/powerpc/include/asm/machdep.h                           |    2 
 a/arch/powerpc/include/asm/sections.h                          |   13 
 a/arch/powerpc/kernel/dt_cpu_ftrs.c                            |    8 
 a/arch/powerpc/kernel/paca.c                                   |    8 
 a/arch/powerpc/kernel/setup-common.c                           |    4 
 a/arch/powerpc/kernel/setup_64.c                               |    6 
 a/arch/powerpc/kernel/smp.c                                    |    2 
 a/arch/powerpc/mm/book3s64/radix_tlb.c                         |    4 
 a/arch/powerpc/mm/hugetlbpage.c                                |    9 
 a/arch/powerpc/platforms/powernv/pci-ioda.c                    |    4 
 a/arch/powerpc/platforms/powernv/setup.c                       |    4 
 a/arch/powerpc/platforms/pseries/setup.c                       |    2 
 a/arch/powerpc/platforms/pseries/svm.c                         |    9 
 a/arch/riscv/kernel/setup.c                                    |   10 
 a/arch/s390/include/asm/sections.h                             |   12 
 a/arch/s390/kernel/setup.c                                     |   11 
 a/arch/s390/kernel/smp.c                                       |    6 
 a/arch/s390/kernel/uv.c                                        |    2 
 a/arch/s390/mm/init.c                                          |    3 
 a/arch/s390/mm/kasan_init.c                                    |    2 
 a/arch/sh/boards/mach-ap325rxa/setup.c                         |    2 
 a/arch/sh/boards/mach-ecovec24/setup.c                         |    4 
 a/arch/sh/boards/mach-kfr2r09/setup.c                          |    2 
 a/arch/sh/boards/mach-migor/setup.c                            |    2 
 a/arch/sh/boards/mach-se/7724/setup.c                          |    4 
 a/arch/sparc/kernel/smp_64.c                                   |    4 
 a/arch/um/kernel/mem.c                                         |    4 
 a/arch/x86/Kconfig                                             |    6 
 a/arch/x86/kernel/setup.c                                      |    4 
 a/arch/x86/kernel/setup_percpu.c                               |    2 
 a/arch/x86/mm/init.c                                           |    2 
 a/arch/x86/mm/init_32.c                                        |   31 
 a/arch/x86/mm/kasan_init_64.c                                  |    4 
 a/arch/x86/mm/numa.c                                           |    2 
 a/arch/x86/mm/numa_emulation.c                                 |    2 
 a/arch/x86/xen/mmu_pv.c                                        |    8 
 a/arch/x86/xen/p2m.c                                           |    4 
 a/arch/x86/xen/setup.c                                         |    6 
 a/drivers/base/Makefile                                        |    2 
 a/drivers/base/arch_numa.c                                     |   96 +
 a/drivers/base/node.c                                          |    9 
 a/drivers/block/zram/zram_drv.c                                |   66 
 a/drivers/firmware/efi/memmap.c                                |    2 
 a/drivers/hwmon/occ/p9_sbe.c                                   |    1 
 a/drivers/macintosh/smu.c                                      |    2 
 a/drivers/mmc/core/mmc_test.c                                  |    1 
 a/drivers/mtd/mtdcore.c                                        |    1 
 a/drivers/of/kexec.c                                           |    4 
 a/drivers/of/of_reserved_mem.c                                 |    5 
 a/drivers/rapidio/devices/rio_mport_cdev.c                     |    9 
 a/drivers/s390/char/sclp_early.c                               |    4 
 a/drivers/usb/early/xhci-dbc.c                                 |   10 
 a/drivers/virtio/Kconfig                                       |    2 
 a/drivers/xen/swiotlb-xen.c                                    |    4 
 a/fs/d_path.c                                                  |    8 
 a/fs/exec.c                                                    |    4 
 a/fs/ocfs2/alloc.c                                             |   21 
 a/fs/ocfs2/dlm/dlmrecovery.c                                   |    1 
 a/fs/ocfs2/file.c                                              |    8 
 a/fs/ocfs2/inode.c                                             |    4 
 a/fs/ocfs2/journal.c                                           |   28 
 a/fs/ocfs2/journal.h                                           |    3 
 a/fs/ocfs2/super.c                                             |   40 
 a/fs/open.c                                                    |   16 
 a/fs/posix_acl.c                                               |    3 
 a/fs/proc/task_mmu.c                                           |   28 
 a/fs/super.c                                                   |    3 
 a/include/asm-generic/sections.h                               |   14 
 a/include/linux/backing-dev-defs.h                             |    3 
 a/include/linux/backing-dev.h                                  |    1 
 a/include/linux/cma.h                                          |    1 
 a/include/linux/compiler-gcc.h                                 |    8 
 a/include/linux/compiler_attributes.h                          |   10 
 a/include/linux/compiler_types.h                               |   12 
 a/include/linux/cpuset.h                                       |   17 
 a/include/linux/damon.h                                        |  258 +++
 a/include/linux/fs.h                                           |    1 
 a/include/linux/gfp.h                                          |    8 
 a/include/linux/highmem.h                                      |   28 
 a/include/linux/hugetlb.h                                      |   36 
 a/include/linux/io-mapping.h                                   |    6 
 a/include/linux/kasan.h                                        |    8 
 a/include/linux/kernel.h                                       |    1 
 a/include/linux/kfence.h                                       |   21 
 a/include/linux/memblock.h                                     |   48 
 a/include/linux/memcontrol.h                                   |    9 
 a/include/linux/memory.h                                       |   26 
 a/include/linux/memory_hotplug.h                               |    3 
 a/include/linux/mempolicy.h                                    |    5 
 a/include/linux/migrate.h                                      |   23 
 a/include/linux/migrate_mode.h                                 |   13 
 a/include/linux/mm.h                                           |   57 
 a/include/linux/mm_types.h                                     |    2 
 a/include/linux/mmzone.h                                       |   41 
 a/include/linux/node.h                                         |    4 
 a/include/linux/page-flags.h                                   |    2 
 a/include/linux/percpu.h                                       |    6 
 a/include/linux/sched/mm.h                                     |   25 
 a/include/linux/slab.h                                         |  181 +-
 a/include/linux/slub_def.h                                     |   13 
 a/include/linux/stackdepot.h                                   |    8 
 a/include/linux/stacktrace.h                                   |    1 
 a/include/linux/swap.h                                         |    1 
 a/include/linux/vmalloc.h                                      |   24 
 a/include/trace/events/mmap_lock.h                             |   50 
 a/include/trace/events/vmscan.h                                |   42 
 a/include/trace/events/writeback.h                             |    7 
 a/init/Kconfig                                                 |    2 
 a/init/initramfs.c                                             |    4 
 a/init/main.c                                                  |    6 
 a/kernel/cgroup/cpuset.c                                       |   23 
 a/kernel/cpu.c                                                 |    2 
 a/kernel/dma/swiotlb.c                                         |    6 
 a/kernel/exit.c                                                |    2 
 a/kernel/extable.c                                             |    2 
 a/kernel/fork.c                                                |   51 
 a/kernel/kexec_file.c                                          |    5 
 a/kernel/kthread.c                                             |   21 
 a/kernel/locking/lockdep.c                                     |   15 
 a/kernel/printk/printk.c                                       |    4 
 a/kernel/sched/core.c                                          |   37 
 a/kernel/sched/sched.h                                         |    4 
 a/kernel/sched/topology.c                                      |    1 
 a/kernel/stacktrace.c                                          |   30 
 a/kernel/tsacct.c                                              |    2 
 a/kernel/workqueue.c                                           |    2 
 a/lib/Kconfig.debug                                            |    2 
 a/lib/Kconfig.kfence                                           |   26 
 a/lib/bootconfig.c                                             |    2 
 a/lib/cpumask.c                                                |    6 
 a/lib/stackdepot.c                                             |   76 -
 a/lib/test_kasan.c                                             |   26 
 a/lib/test_kasan_module.c                                      |    2 
 a/lib/test_vmalloc.c                                           |    6 
 a/mm/Kconfig                                                   |   10 
 a/mm/backing-dev.c                                             |   65 
 a/mm/cma.c                                                     |   26 
 a/mm/compaction.c                                              |   12 
 a/mm/damon/Kconfig                                             |   24 
 a/mm/damon/Makefile                                            |    4 
 a/mm/damon/core.c                                              |  500 ++++++-
 a/mm/damon/dbgfs-test.h                                        |   56 
 a/mm/damon/dbgfs.c                                             |  486 +++++-
 a/mm/damon/paddr.c                                             |  275 +++
 a/mm/damon/prmtv-common.c                                      |  133 +
 a/mm/damon/prmtv-common.h                                      |   20 
 a/mm/damon/reclaim.c                                           |  356 ++++
 a/mm/damon/vaddr-test.h                                        |    2 
 a/mm/damon/vaddr.c                                             |  167 +-
 a/mm/debug.c                                                   |   20 
 a/mm/debug_vm_pgtable.c                                        |    7 
 a/mm/filemap.c                                                 |   78 -
 a/mm/gup.c                                                     |    5 
 a/mm/highmem.c                                                 |    6 
 a/mm/hugetlb.c                                                 |  713 +++++++++-
 a/mm/hugetlb_cgroup.c                                          |    3 
 a/mm/internal.h                                                |   26 
 a/mm/kasan/common.c                                            |    8 
 a/mm/kasan/generic.c                                           |   16 
 a/mm/kasan/kasan.h                                             |    2 
 a/mm/kasan/shadow.c                                            |    5 
 a/mm/kfence/core.c                                             |  214 ++-
 a/mm/kfence/kfence.h                                           |    2 
 a/mm/kfence/kfence_test.c                                      |   14 
 a/mm/khugepaged.c                                              |   10 
 a/mm/list_lru.c                                                |   58 
 a/mm/memblock.c                                                |   35 
 a/mm/memcontrol.c                                              |  217 +--
 a/mm/memory-failure.c                                          |  117 +
 a/mm/memory.c                                                  |  166 +-
 a/mm/memory_hotplug.c                                          |   57 
 a/mm/mempolicy.c                                               |  143 +-
 a/mm/migrate.c                                                 |   61 
 a/mm/mmap.c                                                    |    2 
 a/mm/mprotect.c                                                |    5 
 a/mm/mremap.c                                                  |   86 -
 a/mm/nommu.c                                                   |    6 
 a/mm/oom_kill.c                                                |   27 
 a/mm/page-writeback.c                                          |   13 
 a/mm/page_alloc.c                                              |  119 -
 a/mm/page_ext.c                                                |    2 
 a/mm/page_isolation.c                                          |   29 
 a/mm/percpu.c                                                  |   24 
 a/mm/readahead.c                                               |    2 
 a/mm/rmap.c                                                    |    8 
 a/mm/shmem.c                                                   |   44 
 a/mm/slab.c                                                    |   16 
 a/mm/slab_common.c                                             |    8 
 a/mm/slub.c                                                    |  117 -
 a/mm/sparse-vmemmap.c                                          |    2 
 a/mm/sparse.c                                                  |    6 
 a/mm/swap.c                                                    |   23 
 a/mm/swapfile.c                                                |    6 
 a/mm/userfaultfd.c                                             |    8 
 a/mm/vmalloc.c                                                 |  107 +
 a/mm/vmpressure.c                                              |    2 
 a/mm/vmscan.c                                                  |  194 ++
 a/mm/vmstat.c                                                  |   76 -
 a/mm/zsmalloc.c                                                |    7 
 a/net/ipv4/tcp.c                                               |    1 
 a/net/ipv4/udp.c                                               |    1 
 a/net/netfilter/ipvs/ip_vs_ctl.c                               |    1 
 a/net/openvswitch/meter.c                                      |    1 
 a/net/sctp/protocol.c                                          |    1 
 a/scripts/checkpatch.pl                                        |    3 
 a/scripts/decodecode                                           |    2 
 a/scripts/spelling.txt                                         |   18 
 a/security/Kconfig                                             |   14 
 a/tools/testing/selftests/damon/debugfs_attrs.sh               |   25 
 a/tools/testing/selftests/memory-hotplug/config                |    1 
 a/tools/testing/selftests/vm/.gitignore                        |    1 
 a/tools/testing/selftests/vm/Makefile                          |    1 
 a/tools/testing/selftests/vm/hugepage-mremap.c                 |  161 ++
 a/tools/testing/selftests/vm/ksm_tests.c                       |  154 ++
 a/tools/testing/selftests/vm/madv_populate.c                   |   15 
 a/tools/testing/selftests/vm/run_vmtests.sh                    |   11 
 a/tools/testing/selftests/vm/transhuge-stress.c                |    2 
 a/tools/testing/selftests/vm/userfaultfd.c                     |  157 +-
 a/tools/vm/page-types.c                                        |   38 
 a/tools/vm/page_owner_sort.c                                   |   94 +
 b/Documentation/admin-guide/mm/index.rst                       |    2 
 b/Documentation/vm/index.rst                                   |   26 
 260 files changed, 6448 insertions(+), 2327 deletions(-)


^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 001/262] scripts/spelling.txt: add more spellings to spelling.txt
  2021-11-05 20:34 incoming Andrew Morton
@ 2021-11-05 20:34 ` Andrew Morton
  2021-11-05 20:34 ` [patch 002/262] scripts/spelling.txt: fix "mistake" version of "synchronization" Andrew Morton
                   ` (260 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:34 UTC (permalink / raw)
  To: akpm, colin.king, linux-mm, mm-commits, torvalds

From: Colin Ian King <colin.king@canonical.com>
Subject: scripts/spelling.txt: add more spellings to spelling.txt

Some of the more common spelling mistakes and typos that I've found
while fixing up spelling mistakes in the kernel in the past few months.

Link: https://lkml.kernel.org/r/20210907072941.7033-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/spelling.txt |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

--- a/scripts/spelling.txt~scripts-spellingtxt-add-more-spellings-to-spellingtxt
+++ a/scripts/spelling.txt
@@ -178,6 +178,7 @@ assum||assume
 assumtpion||assumption
 asuming||assuming
 asycronous||asynchronous
+asychronous||asynchronous
 asynchnous||asynchronous
 asynchromous||asynchronous
 asymetric||asymmetric
@@ -241,6 +242,7 @@ beter||better
 betweeen||between
 bianries||binaries
 bitmast||bitmask
+bitwiedh||bitwidth
 boardcast||broadcast
 borad||board
 boundry||boundary
@@ -265,7 +267,10 @@ calucate||calculate
 calulate||calculate
 cancelation||cancellation
 cancle||cancel
+cant||can't
+cant'||can't
 canot||cannot
+cann't||can't
 capabilites||capabilities
 capabilties||capabilities
 capabilty||capability
@@ -501,6 +506,7 @@ disble||disable
 disgest||digest
 disired||desired
 dispalying||displaying
+dissable||disable
 diplay||display
 directon||direction
 direcly||directly
@@ -595,6 +601,7 @@ exceded||exceeded
 exceds||exceeds
 exceeed||exceed
 excellant||excellent
+exchnage||exchange
 execeeded||exceeded
 execeeds||exceeds
 exeed||exceed
@@ -938,6 +945,7 @@ migrateable||migratable
 milliseonds||milliseconds
 minium||minimum
 minimam||minimum
+minimun||minimum
 miniumum||minimum
 minumum||minimum
 misalinged||misaligned
@@ -956,6 +964,7 @@ mmnemonic||mnemonic
 mnay||many
 modfiy||modify
 modifer||modifier
+modul||module
 modulues||modules
 momery||memory
 memomry||memory
@@ -1154,6 +1163,7 @@ programable||programmable
 programers||programmers
 programm||program
 programms||programs
+progres||progress
 progresss||progress
 prohibitted||prohibited
 prohibitting||prohibiting
@@ -1328,6 +1338,7 @@ servive||service
 setts||sets
 settting||setting
 shapshot||snapshot
+shoft||shift
 shotdown||shutdown
 shoud||should
 shouldnt||shouldn't
@@ -1439,6 +1450,7 @@ syfs||sysfs
 symetric||symmetric
 synax||syntax
 synchonized||synchronized
+synchronization||synchronization
 synchronuously||synchronously
 syncronize||synchronize
 syncronized||synchronized
@@ -1521,6 +1533,7 @@ unexpexted||unexpected
 unfortunatelly||unfortunately
 unifiy||unify
 uniterrupted||uninterrupted
+uninterruptable||uninterruptible
 unintialized||uninitialized
 unitialized||uninitialized
 unkmown||unknown
@@ -1553,6 +1566,7 @@ unuseful||useless
 unvalid||invalid
 upate||update
 upsupported||unsupported
+useable||usable
 usefule||useful
 usefull||useful
 usege||usage
@@ -1574,6 +1588,7 @@ varient||variant
 vaule||value
 verbse||verbose
 veify||verify
+verfication||verification
 veriosn||version
 verisons||versions
 verison||version
@@ -1586,6 +1601,7 @@ visiters||visitors
 vitual||virtual
 vunerable||vulnerable
 wakeus||wakeups
+was't||wasn't
 wathdog||watchdog
 wating||waiting
 wiat||wait
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 002/262] scripts/spelling.txt: fix "mistake" version of "synchronization"
  2021-11-05 20:34 incoming Andrew Morton
  2021-11-05 20:34 ` [patch 001/262] scripts/spelling.txt: add more spellings to spelling.txt Andrew Morton
@ 2021-11-05 20:34 ` Andrew Morton
  2021-11-05 20:34 ` [patch 003/262] scripts/decodecode: fix faulting instruction no print when opps.file is DOS format Andrew Morton
                   ` (259 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:34 UTC (permalink / raw)
  To: akpm, colin.king, linux-mm, mm-commits, sven, torvalds

From: Sven Eckelmann <sven@narfation.org>
Subject: scripts/spelling.txt: fix "mistake" version of "synchronization"

If both "mistake" version and "correction" version are the same, a warning
message is created by checkpatch which is impossible to fix.  But it was
noticed that Colan Ian King created a commit e6c0a0889b80 ("ALSA: aloop:
Fix spelling mistake "synchronization" -> "synchronization"") which
suggests that this spelling mistake was fixed by replacing the word
"synchronization" with itself.  But the actual diff shows that the mistake
in the code was "sychronization".  It is rather likely that the "mistake"
in spelling.txt should have been the latter.

Link: https://lkml.kernel.org/r/20210926065529.6880-1-sven@narfation.org
Fixes: 2e74c9433ba8 ("scripts/spelling.txt: add more spellings to spelling.txt")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Reviewed-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/spelling.txt |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/scripts/spelling.txt~scripts-spellingtxt-fix-mistake-version-of-synchronization
+++ a/scripts/spelling.txt
@@ -1450,7 +1450,7 @@ syfs||sysfs
 symetric||symmetric
 synax||syntax
 synchonized||synchronized
-synchronization||synchronization
+sychronization||synchronization
 synchronuously||synchronously
 syncronize||synchronize
 syncronized||synchronized
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 003/262] scripts/decodecode: fix faulting instruction no print when opps.file is DOS format
  2021-11-05 20:34 incoming Andrew Morton
  2021-11-05 20:34 ` [patch 001/262] scripts/spelling.txt: add more spellings to spelling.txt Andrew Morton
  2021-11-05 20:34 ` [patch 002/262] scripts/spelling.txt: fix "mistake" version of "synchronization" Andrew Morton
@ 2021-11-05 20:34 ` Andrew Morton
  2021-11-05 20:34 ` [patch 004/262] ocfs2: fix handle refcount leak in two exception handling paths Andrew Morton
                   ` (258 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:34 UTC (permalink / raw)
  To: akpm, bp, linux-mm, maz, mm-commits, rabin, torvalds, weidonghui, will

From: weidonghui <weidonghui@allwinnertech.com>
Subject: scripts/decodecode: fix faulting instruction no print when opps.file is DOS format

If opps.file is in DOS format, faulting instruction cannot be printed:
/ # ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-
/ # ./scripts/decodecode < oops.file
[ 0.734345] Code: d0002881 912f9c21 94067e68 d2800001 (b900003f)
aarch64-linux-gnu-strip: '/tmp/tmp.5Y9eybnnSi.o': No such file
aarch64-linux-gnu-objdump: '/tmp/tmp.5Y9eybnnSi.o': No such file
All code
========
   0:   d0002881        adrp    x1, 0x512000
   4:   912f9c21        add     x1, x1, #0xbe7
   8:   94067e68        bl      0x19f9a8
   c:   d2800001        mov     x1, #0x0                        // #0
  10:   b900003f        str     wzr, [x1]

Code starting with the faulting instruction
===========================================

Background: The compilation environment is Ubuntu, and the test
environment is Windows.  Most logs are generated in the Windows
environment.  In this way, CR (carriage return) will inevitably appear,
which will affect the use of decodecode in the Ubuntu environment.

The repaired effect is as follows:

/ # ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu-
/ # ./scripts/decodecode < oops.file
[ 0.734345] Code: d0002881 912f9c21 94067e68 d2800001 (b900003f)
All code
========
   0:   d0002881        adrp    x1, 0x512000
   4:   912f9c21        add     x1, x1, #0xbe7
   8:   94067e68        bl      0x19f9a8
   c:   d2800001        mov     x1, #0x0                        // #0
  10:*  b900003f        str     wzr, [x1]               <-- trapping instruction

Code starting with the faulting instruction
===========================================
   0:   b900003f        str     wzr, [x1]

Link: https://lkml.kernel.org/r/20211008064712.926-1-weidonghui@allwinnertech.com
Signed-off-by: weidonghui <weidonghui@allwinnertech.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Marc Zyngier <maz@misterjones.org>
Cc: Will Deacon <will@kernel.org>
Cc: Rabin Vincent <rabin@rab.in>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/decodecode |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/scripts/decodecode~scripts-decodecode-fix-faulting-instruction-no-print-when-oppsfile-is-dos-format
+++ a/scripts/decodecode
@@ -126,7 +126,7 @@ if [ $marker -ne 0 ]; then
 fi
 echo Code starting with the faulting instruction  > $T.aa
 echo =========================================== >> $T.aa
-code=`echo $code | sed -e 's/ [<(]/ /;s/[>)] / /;s/ /,0x/g; s/[>)]$//'`
+code=`echo $code | sed -e 's/\r//;s/ [<(]/ /;s/[>)] / /;s/ /,0x/g; s/[>)]$//'`
 echo -n "	.$type 0x" > $T.s
 echo $code >> $T.s
 disas $T 0
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 004/262] ocfs2: fix handle refcount leak in two exception handling paths
  2021-11-05 20:34 incoming Andrew Morton
                   ` (2 preceding siblings ...)
  2021-11-05 20:34 ` [patch 003/262] scripts/decodecode: fix faulting instruction no print when opps.file is DOS format Andrew Morton
@ 2021-11-05 20:34 ` Andrew Morton
  2021-11-05 20:34 ` [patch 005/262] ocfs2: cleanup journal init and shutdown Andrew Morton
                   ` (257 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:34 UTC (permalink / raw)
  To: akpm, cymi20, gechangwei, ghe, jlbec, joseph.qi, junxiao.bi,
	linux-mm, mark, mm-commits, piaojun, tanxin.ctf, torvalds,
	wen.gang.wang, xiyuyang19

From: Chenyuan Mi <cymi20@fudan.edu.cn>
Subject: ocfs2: fix handle refcount leak in two exception handling paths

The reference counting issue happens in two exception handling paths
of ocfs2_replay_truncate_records(). When executing these two exception
handling paths, the function forgets to decrease the refcount of handle
increased by ocfs2_start_trans(), causing a refcount leak.

Fix this issue by using ocfs2_commit_trans() to decrease the refcount
of handle in two handling paths.

Link: https://lkml.kernel.org/r/20210908102055.10168-1-cymi20@fudan.edu.cn
Signed-off-by: Chenyuan Mi <cymi20@fudan.edu.cn>
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Wengang Wang <wen.gang.wang@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/alloc.c |    2 ++
 1 file changed, 2 insertions(+)

--- a/fs/ocfs2/alloc.c~ocfs2-fix-handle-refcount-leak-in-two-exception-handling-paths
+++ a/fs/ocfs2/alloc.c
@@ -5940,6 +5940,7 @@ static int ocfs2_replay_truncate_records
 		status = ocfs2_journal_access_di(handle, INODE_CACHE(tl_inode), tl_bh,
 						 OCFS2_JOURNAL_ACCESS_WRITE);
 		if (status < 0) {
+			ocfs2_commit_trans(osb, handle);
 			mlog_errno(status);
 			goto bail;
 		}
@@ -5964,6 +5965,7 @@ static int ocfs2_replay_truncate_records
 						     data_alloc_bh, start_blk,
 						     num_clusters);
 			if (status < 0) {
+				ocfs2_commit_trans(osb, handle);
 				mlog_errno(status);
 				goto bail;
 			}
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 005/262] ocfs2: cleanup journal init and shutdown
  2021-11-05 20:34 incoming Andrew Morton
                   ` (3 preceding siblings ...)
  2021-11-05 20:34 ` [patch 004/262] ocfs2: fix handle refcount leak in two exception handling paths Andrew Morton
@ 2021-11-05 20:34 ` Andrew Morton
  2021-11-05 20:34 ` [patch 006/262] ocfs2/dlm: remove redundant assignment of variable ret Andrew Morton
                   ` (256 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:34 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, jlbec, joseph.qi, junxiao.bi, linux-mm,
	mark, mm-commits, piaojun, torvalds, vvidic

From: Valentin Vidic <vvidic@valentin-vidic.from.hr>
Subject: ocfs2: cleanup journal init and shutdown

Allocate and free struct ocfs2_journal in ocfs2_journal_init and
ocfs2_journal_shutdown.  Init and release of system inodes references the
journal so reorder calls to make sure they work correctly.

Link: https://lkml.kernel.org/r/20211009145006.3478-1-vvidic@valentin-vidic.from.hr
Signed-off-by: Valentin Vidic <vvidic@valentin-vidic.from.hr>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/inode.c   |    4 ++--
 fs/ocfs2/journal.c |   28 ++++++++++++++++++++++------
 fs/ocfs2/journal.h |    3 +--
 fs/ocfs2/super.c   |   40 +++-------------------------------------
 4 files changed, 28 insertions(+), 47 deletions(-)

--- a/fs/ocfs2/inode.c~ocfs2-cleanup-journal-init-and-shutdown
+++ a/fs/ocfs2/inode.c
@@ -125,7 +125,6 @@ struct inode *ocfs2_iget(struct ocfs2_su
 	struct inode *inode = NULL;
 	struct super_block *sb = osb->sb;
 	struct ocfs2_find_inode_args args;
-	journal_t *journal = OCFS2_SB(sb)->journal->j_journal;
 
 	trace_ocfs2_iget_begin((unsigned long long)blkno, flags,
 			       sysfile_type);
@@ -172,10 +171,11 @@ struct inode *ocfs2_iget(struct ocfs2_su
 	 * part of the transaction - the inode could have been reclaimed and
 	 * now it is reread from disk.
 	 */
-	if (journal) {
+	if (osb->journal) {
 		transaction_t *transaction;
 		tid_t tid;
 		struct ocfs2_inode_info *oi = OCFS2_I(inode);
+		journal_t *journal = osb->journal->j_journal;
 
 		read_lock(&journal->j_state_lock);
 		if (journal->j_running_transaction)
--- a/fs/ocfs2/journal.c~ocfs2-cleanup-journal-init-and-shutdown
+++ a/fs/ocfs2/journal.c
@@ -810,19 +810,34 @@ void ocfs2_set_journal_params(struct ocf
 	write_unlock(&journal->j_state_lock);
 }
 
-int ocfs2_journal_init(struct ocfs2_journal *journal, int *dirty)
+int ocfs2_journal_init(struct ocfs2_super *osb, int *dirty)
 {
 	int status = -1;
 	struct inode *inode = NULL; /* the journal inode */
 	journal_t *j_journal = NULL;
+	struct ocfs2_journal *journal = NULL;
 	struct ocfs2_dinode *di = NULL;
 	struct buffer_head *bh = NULL;
-	struct ocfs2_super *osb;
 	int inode_lock = 0;
 
-	BUG_ON(!journal);
-
-	osb = journal->j_osb;
+	/* initialize our journal structure */
+	journal = kzalloc(sizeof(struct ocfs2_journal), GFP_KERNEL);
+	if (!journal) {
+		mlog(ML_ERROR, "unable to alloc journal\n");
+		status = -ENOMEM;
+		goto done;
+	}
+	osb->journal = journal;
+	journal->j_osb = osb;
+
+	atomic_set(&journal->j_num_trans, 0);
+	init_rwsem(&journal->j_trans_barrier);
+	init_waitqueue_head(&journal->j_checkpointed);
+	spin_lock_init(&journal->j_lock);
+	journal->j_trans_id = 1UL;
+	INIT_LIST_HEAD(&journal->j_la_cleanups);
+	INIT_WORK(&journal->j_recovery_work, ocfs2_complete_recovery);
+	journal->j_state = OCFS2_JOURNAL_FREE;
 
 	/* already have the inode for our journal */
 	inode = ocfs2_get_system_file_inode(osb, JOURNAL_SYSTEM_INODE,
@@ -1028,9 +1043,10 @@ void ocfs2_journal_shutdown(struct ocfs2
 
 	journal->j_state = OCFS2_JOURNAL_FREE;
 
-//	up_write(&journal->j_trans_barrier);
 done:
 	iput(inode);
+	kfree(journal);
+	osb->journal = NULL;
 }
 
 static void ocfs2_clear_journal_error(struct super_block *sb,
--- a/fs/ocfs2/journal.h~ocfs2-cleanup-journal-init-and-shutdown
+++ a/fs/ocfs2/journal.h
@@ -167,8 +167,7 @@ int ocfs2_compute_replay_slots(struct oc
  *  ocfs2_start_checkpoint - Kick the commit thread to do a checkpoint.
  */
 void   ocfs2_set_journal_params(struct ocfs2_super *osb);
-int    ocfs2_journal_init(struct ocfs2_journal *journal,
-			  int *dirty);
+int    ocfs2_journal_init(struct ocfs2_super *osb, int *dirty);
 void   ocfs2_journal_shutdown(struct ocfs2_super *osb);
 int    ocfs2_journal_wipe(struct ocfs2_journal *journal,
 			  int full);
--- a/fs/ocfs2/super.c~ocfs2-cleanup-journal-init-and-shutdown
+++ a/fs/ocfs2/super.c
@@ -1894,8 +1894,6 @@ static void ocfs2_dismount_volume(struct
 	/* This will disable recovery and flush any recovery work. */
 	ocfs2_recovery_exit(osb);
 
-	ocfs2_journal_shutdown(osb);
-
 	ocfs2_sync_blockdev(sb);
 
 	ocfs2_purge_refcount_trees(osb);
@@ -1918,6 +1916,8 @@ static void ocfs2_dismount_volume(struct
 
 	ocfs2_release_system_inodes(osb);
 
+	ocfs2_journal_shutdown(osb);
+
 	/*
 	 * If we're dismounting due to mount error, mount.ocfs2 will clean
 	 * up heartbeat.  If we're a local mount, there is no heartbeat.
@@ -2016,7 +2016,6 @@ static int ocfs2_initialize_super(struct
 	int i, cbits, bbits;
 	struct ocfs2_dinode *di = (struct ocfs2_dinode *)bh->b_data;
 	struct inode *inode = NULL;
-	struct ocfs2_journal *journal;
 	struct ocfs2_super *osb;
 	u64 total_blocks;
 
@@ -2197,33 +2196,6 @@ static int ocfs2_initialize_super(struct
 
 	get_random_bytes(&osb->s_next_generation, sizeof(u32));
 
-	/* FIXME
-	 * This should be done in ocfs2_journal_init(), but unknown
-	 * ordering issues will cause the filesystem to crash.
-	 * If anyone wants to figure out what part of the code
-	 * refers to osb->journal before ocfs2_journal_init() is run,
-	 * be my guest.
-	 */
-	/* initialize our journal structure */
-
-	journal = kzalloc(sizeof(struct ocfs2_journal), GFP_KERNEL);
-	if (!journal) {
-		mlog(ML_ERROR, "unable to alloc journal\n");
-		status = -ENOMEM;
-		goto bail;
-	}
-	osb->journal = journal;
-	journal->j_osb = osb;
-
-	atomic_set(&journal->j_num_trans, 0);
-	init_rwsem(&journal->j_trans_barrier);
-	init_waitqueue_head(&journal->j_checkpointed);
-	spin_lock_init(&journal->j_lock);
-	journal->j_trans_id = (unsigned long) 1;
-	INIT_LIST_HEAD(&journal->j_la_cleanups);
-	INIT_WORK(&journal->j_recovery_work, ocfs2_complete_recovery);
-	journal->j_state = OCFS2_JOURNAL_FREE;
-
 	INIT_WORK(&osb->dquot_drop_work, ocfs2_drop_dquot_refs);
 	init_llist_head(&osb->dquot_drop_list);
 
@@ -2404,7 +2376,7 @@ static int ocfs2_check_volume(struct ocf
 						  * ourselves. */
 
 	/* Init our journal object. */
-	status = ocfs2_journal_init(osb->journal, &dirty);
+	status = ocfs2_journal_init(osb, &dirty);
 	if (status < 0) {
 		mlog(ML_ERROR, "Could not initialize journal!\n");
 		goto finally;
@@ -2513,12 +2485,6 @@ static void ocfs2_delete_osb(struct ocfs
 
 	kfree(osb->osb_orphan_wipes);
 	kfree(osb->slot_recovery_generations);
-	/* FIXME
-	 * This belongs in journal shutdown, but because we have to
-	 * allocate osb->journal at the start of ocfs2_initialize_osb(),
-	 * we free it here.
-	 */
-	kfree(osb->journal);
 	kfree(osb->local_alloc_copy);
 	kfree(osb->uuid_str);
 	kfree(osb->vol_label);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 006/262] ocfs2/dlm: remove redundant assignment of variable ret
  2021-11-05 20:34 incoming Andrew Morton
                   ` (4 preceding siblings ...)
  2021-11-05 20:34 ` [patch 005/262] ocfs2: cleanup journal init and shutdown Andrew Morton
@ 2021-11-05 20:34 ` Andrew Morton
  2021-11-05 20:34 ` [patch 007/262] ocfs2: fix data corruption on truncate Andrew Morton
                   ` (255 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:34 UTC (permalink / raw)
  To: akpm, colin.king, gechangwei, ghe, jlbec, joseph.qi, junxiao.bi,
	linux-mm, mark, mm-commits, piaojun, torvalds

From: Colin Ian King <colin.king@canonical.com>
Subject: ocfs2/dlm: remove redundant assignment of variable ret

The variable ret is being assigned a value that is never read, it is
updated later on with a different value.  The assignment is redundant and
can be removed.

Addresses-Coverity: ("Unused value")
Link: https://lkml.kernel.org/r/20211007233452.30815-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/dlm/dlmrecovery.c |    1 -
 1 file changed, 1 deletion(-)

--- a/fs/ocfs2/dlm/dlmrecovery.c~ocfs2-dlm-remove-redundant-assignment-of-variable-ret
+++ a/fs/ocfs2/dlm/dlmrecovery.c
@@ -2698,7 +2698,6 @@ static int dlm_send_begin_reco_message(s
 			continue;
 		}
 retry:
-		ret = -EINVAL;
 		mlog(0, "attempting to send begin reco msg to %d\n",
 			  nodenum);
 		ret = o2net_send_message(DLM_BEGIN_RECO_MSG, dlm->key,
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 007/262] ocfs2: fix data corruption on truncate
  2021-11-05 20:34 incoming Andrew Morton
                   ` (5 preceding siblings ...)
  2021-11-05 20:34 ` [patch 006/262] ocfs2/dlm: remove redundant assignment of variable ret Andrew Morton
@ 2021-11-05 20:34 ` Andrew Morton
  2021-11-05 20:34 ` [patch 008/262] ocfs2: do not zero pages beyond i_size Andrew Morton
                   ` (254 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:34 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, jack, jlbec, joseph.qi, junxiao.bi,
	linux-mm, mark, mm-commits, piaojun, stable, torvalds

From: Jan Kara <jack@suse.cz>
Subject: ocfs2: fix data corruption on truncate

Patch series "ocfs2: Truncate data corruption fix".

As further testing has shown, commit 5314454ea3f ("ocfs2: fix data
corruption after conversion from inline format") didn't fix all the data
corruption issues the customer started observing after 6dbf7bb55598 ("fs:
Don't invalidate page buffers in block_write_full_page()") This time I
have tracked them down to two bugs in ocfs2 truncation code.

One bug (truncating page cache before clearing tail cluster and setting
i_size) could cause data corruption even before 6dbf7bb55598, but before
that commit it needed a race with page fault, after 6dbf7bb55598 it
started to be pretty deterministic.

Another bug (zeroing pages beyond old i_size) used to be harmless
inefficiency before commit 6dbf7bb55598.  But after commit 6dbf7bb55598 in
combination with the first bug it resulted in deterministic data
corruption.

Although fixing only the first problem is needed to stop data corruption,
I've fixed both issues to make the code more robust.


This patch (of 2):

ocfs2_truncate_file() did unmap invalidate page cache pages before zeroing
partial tail cluster and setting i_size.  Thus some pages could be left
(and likely have left if the cluster zeroing happened) in the page cache
beyond i_size after truncate finished letting user possibly see stale data
once the file was extended again.  Also the tail cluster zeroing was not
guaranteed to finish before truncate finished causing possible stale data
exposure.  The problem started to be particularly easy to hit after commit
6dbf7bb55598 "fs: Don't invalidate page buffers in
block_write_full_page()" stopped invalidation of pages beyond i_size from
page writeback path.

Fix these problems by unmapping and invalidating pages in the page cache
after the i_size is reduced and tail cluster is zeroed out.

Link: https://lkml.kernel.org/r/20211025150008.29002-1-jack@suse.cz
Link: https://lkml.kernel.org/r/20211025151332.11301-1-jack@suse.cz
Fixes: ccd979bdbce9 ("[PATCH] OCFS2: The Second Oracle Cluster Filesystem")
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/file.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

--- a/fs/ocfs2/file.c~ocfs2-fix-data-corruption-on-truncate
+++ a/fs/ocfs2/file.c
@@ -476,10 +476,11 @@ int ocfs2_truncate_file(struct inode *in
 	 * greater than page size, so we have to truncate them
 	 * anyway.
 	 */
-	unmap_mapping_range(inode->i_mapping, new_i_size + PAGE_SIZE - 1, 0, 1);
-	truncate_inode_pages(inode->i_mapping, new_i_size);
 
 	if (OCFS2_I(inode)->ip_dyn_features & OCFS2_INLINE_DATA_FL) {
+		unmap_mapping_range(inode->i_mapping,
+				    new_i_size + PAGE_SIZE - 1, 0, 1);
+		truncate_inode_pages(inode->i_mapping, new_i_size);
 		status = ocfs2_truncate_inline(inode, di_bh, new_i_size,
 					       i_size_read(inode), 1);
 		if (status)
@@ -498,6 +499,9 @@ int ocfs2_truncate_file(struct inode *in
 		goto bail_unlock_sem;
 	}
 
+	unmap_mapping_range(inode->i_mapping, new_i_size + PAGE_SIZE - 1, 0, 1);
+	truncate_inode_pages(inode->i_mapping, new_i_size);
+
 	status = ocfs2_commit_truncate(osb, inode, di_bh);
 	if (status < 0) {
 		mlog_errno(status);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 008/262] ocfs2: do not zero pages beyond i_size
  2021-11-05 20:34 incoming Andrew Morton
                   ` (6 preceding siblings ...)
  2021-11-05 20:34 ` [patch 007/262] ocfs2: fix data corruption on truncate Andrew Morton
@ 2021-11-05 20:34 ` Andrew Morton
  2021-11-05 20:35 ` [patch 009/262] fs/posix_acl.c: avoid -Wempty-body warning Andrew Morton
                   ` (253 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:34 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, jack, jlbec, joseph.qi, junxiao.bi,
	linux-mm, mark, mm-commits, piaojun, torvalds

From: Jan Kara <jack@suse.cz>
Subject: ocfs2: do not zero pages beyond i_size

ocfs2_zero_range_for_truncate() can try to zero pages beyond current inode
size despite the fact that underlying blocks should be already zeroed out
and writeback will skip writing such pages anyway.  Avoid the pointless
work.

Link: https://lkml.kernel.org/r/20211025151332.11301-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/alloc.c |   19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

--- a/fs/ocfs2/alloc.c~ocfs2-do-not-zero-pages-beyond-i_size
+++ a/fs/ocfs2/alloc.c
@@ -6923,13 +6923,12 @@ static int ocfs2_grab_eof_pages(struct i
 }
 
 /*
- * Zero the area past i_size but still within an allocated
- * cluster. This avoids exposing nonzero data on subsequent file
- * extends.
+ * Zero partial cluster for a hole punch or truncate. This avoids exposing
+ * nonzero data on subsequent file extends.
  *
  * We need to call this before i_size is updated on the inode because
  * otherwise block_write_full_page() will skip writeout of pages past
- * i_size. The new_i_size parameter is passed for this reason.
+ * i_size.
  */
 int ocfs2_zero_range_for_truncate(struct inode *inode, handle_t *handle,
 				  u64 range_start, u64 range_end)
@@ -6947,6 +6946,15 @@ int ocfs2_zero_range_for_truncate(struct
 	if (!ocfs2_sparse_alloc(OCFS2_SB(sb)))
 		return 0;
 
+	/*
+	 * Avoid zeroing pages fully beyond current i_size. It is pointless as
+	 * underlying blocks of those pages should be already zeroed out and
+	 * page writeback will skip them anyway.
+	 */
+	range_end = min_t(u64, range_end, i_size_read(inode));
+	if (range_start >= range_end)
+		return 0;
+
 	pages = kcalloc(ocfs2_pages_per_cluster(sb),
 			sizeof(struct page *), GFP_NOFS);
 	if (pages == NULL) {
@@ -6955,9 +6963,6 @@ int ocfs2_zero_range_for_truncate(struct
 		goto out;
 	}
 
-	if (range_start == range_end)
-		goto out;
-
 	ret = ocfs2_extent_map_get_blocks(inode,
 					  range_start >> sb->s_blocksize_bits,
 					  &phys, NULL, &ext_flags);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 009/262] fs/posix_acl.c: avoid -Wempty-body warning
  2021-11-05 20:34 incoming Andrew Morton
                   ` (7 preceding siblings ...)
  2021-11-05 20:34 ` [patch 008/262] ocfs2: do not zero pages beyond i_size Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:35 ` [patch 010/262] d_path: fix Kernel doc validator complaining Andrew Morton
                   ` (252 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: akpm, arnd, christian.brauner, jamorris, linux-mm, mm-commits,
	mszeredi, serge, torvalds, viro

From: Arnd Bergmann <arnd@arndb.de>
Subject: fs/posix_acl.c: avoid -Wempty-body warning

The fallthrough comment for an ignored cmpxchg() return value produces a
harmless warning with 'make W=1':

fs/posix_acl.c: In function 'get_acl':
fs/posix_acl.c:127:36: error: suggest braces around empty body in an 'if' statement [-Werror=empty-body]
  127 |                 /* fall through */ ;
      |                                    ^

Simplify it as a step towards a clean W=1 build.  As all architectures
define cmpxchg() as a statement expression these days, it is no longer
necessary to evaluate its return code, and the if() can just be droped.

Link: https://lkml.kernel.org/r/20210927102410.1863853-1-arnd@kernel.org
Link: https://lore.kernel.org/all/20210322132103.qiun2rjilnlgztxe@wittgenstein/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/posix_acl.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/fs/posix_acl.c~posix-acl-avoid-wempty-body-warning
+++ a/fs/posix_acl.c
@@ -134,8 +134,7 @@ struct posix_acl *get_acl(struct inode *
 	 * to just call ->get_acl to fetch the ACL ourself.  (This is going to
 	 * be an unlikely race.)
 	 */
-	if (cmpxchg(p, ACL_NOT_CACHED, sentinel) != ACL_NOT_CACHED)
-		/* fall through */ ;
+	cmpxchg(p, ACL_NOT_CACHED, sentinel);
 
 	/*
 	 * Normally, the ACL returned by ->get_acl will be cached.
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 010/262] d_path: fix Kernel doc validator complaining
  2021-11-05 20:34 incoming Andrew Morton
                   ` (8 preceding siblings ...)
  2021-11-05 20:35 ` [patch 009/262] fs/posix_acl.c: avoid -Wempty-body warning Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:35 ` [patch 011/262] mm: move kvmalloc-related functions to slab.h Andrew Morton
                   ` (251 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, justin.he, linux-mm, mm-commits,
	rdunlap, torvalds, viro

From: Jia He <justin.he@arm.com>
Subject: d_path: fix Kernel doc validator complaining

Kernel doc validator complains:
  Function parameter or member 'p' not described in 'prepend_name'
  Excess function parameter 'buffer' description in 'prepend_name'

Link: https://lkml.kernel.org/r/20211011005614.26189-1-justin.he@arm.com
Fixes: ad08ae586586 ("d_path: introduce struct prepend_buffer")
Signed-off-by: Jia He <justin.he@arm.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/d_path.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

--- a/fs/d_path.c~d_path-fix-kernel-doc-validator-complaining
+++ a/fs/d_path.c
@@ -77,9 +77,8 @@ static bool prepend(struct prepend_buffe
 
 /**
  * prepend_name - prepend a pathname in front of current buffer pointer
- * @buffer: buffer pointer
- * @buflen: allocated length of the buffer
- * @name:   name string and length qstr structure
+ * @p: prepend buffer which contains buffer pointer and allocated length
+ * @name: name string and length qstr structure
  *
  * With RCU path tracing, it may race with d_move(). Use READ_ONCE() to
  * make sure that either the old or the new name pointer and length are
@@ -141,8 +140,7 @@ static int __prepend_path(const struct d
  * prepend_path - Prepend path string to a buffer
  * @path: the dentry/vfsmount to report
  * @root: root vfsmnt/dentry
- * @buffer: pointer to the end of the buffer
- * @buflen: pointer to buffer length
+ * @p: prepend buffer which contains buffer pointer and allocated length
  *
  * The function will first try to write out the pathname without taking any
  * lock other than the RCU read lock to make sure that dentries won't go away.
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 011/262] mm: move kvmalloc-related functions to slab.h
  2021-11-05 20:34 incoming Andrew Morton
                   ` (9 preceding siblings ...)
  2021-11-05 20:35 ` [patch 010/262] d_path: fix Kernel doc validator complaining Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:35 ` [patch 012/262] mm/slab.c: remove useless lines in enable_cpucache() Andrew Morton
                   ` (250 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: akpm, cl, iamjoonsoo.kim, linux-mm, mm-commits, penberg,
	rientjes, torvalds, vbabka, willy

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm: move kvmalloc-related functions to slab.h

Not all files in the kernel should include mm.h.  Migrating callers from
kmalloc to kvmalloc is easier if the kvmalloc functions are in slab.h.

[akpm@linux-foundation.org: move the new kvrealloc() also]
[akpm@linux-foundation.org: drivers/hwmon/occ/p9_sbe.c needs slab.h]
Link: https://lkml.kernel.org/r/20210622215757.3525604-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/hwmon/occ/p9_sbe.c |    1 +
 drivers/of/kexec.c         |    1 +
 include/linux/mm.h         |   34 ----------------------------------
 include/linux/slab.h       |   34 ++++++++++++++++++++++++++++++++++
 4 files changed, 36 insertions(+), 34 deletions(-)

--- a/drivers/hwmon/occ/p9_sbe.c~mm-move-kvmalloc-related-functions-to-slabh
+++ a/drivers/hwmon/occ/p9_sbe.c
@@ -3,6 +3,7 @@
 
 #include <linux/device.h>
 #include <linux/errno.h>
+#include <linux/slab.h>
 #include <linux/fsi-occ.h>
 #include <linux/module.h>
 #include <linux/platform_device.h>
--- a/drivers/of/kexec.c~mm-move-kvmalloc-related-functions-to-slabh
+++ a/drivers/of/kexec.c
@@ -16,6 +16,7 @@
 #include <linux/of.h>
 #include <linux/of_fdt.h>
 #include <linux/random.h>
+#include <linux/slab.h>
 #include <linux/types.h>
 
 #define RNG_SEED_SIZE		128
--- a/include/linux/mm.h~mm-move-kvmalloc-related-functions-to-slabh
+++ a/include/linux/mm.h
@@ -799,40 +799,6 @@ static inline int is_vmalloc_or_module_a
 }
 #endif
 
-extern void *kvmalloc_node(size_t size, gfp_t flags, int node);
-static inline void *kvmalloc(size_t size, gfp_t flags)
-{
-	return kvmalloc_node(size, flags, NUMA_NO_NODE);
-}
-static inline void *kvzalloc_node(size_t size, gfp_t flags, int node)
-{
-	return kvmalloc_node(size, flags | __GFP_ZERO, node);
-}
-static inline void *kvzalloc(size_t size, gfp_t flags)
-{
-	return kvmalloc(size, flags | __GFP_ZERO);
-}
-
-static inline void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
-{
-	size_t bytes;
-
-	if (unlikely(check_mul_overflow(n, size, &bytes)))
-		return NULL;
-
-	return kvmalloc(bytes, flags);
-}
-
-static inline void *kvcalloc(size_t n, size_t size, gfp_t flags)
-{
-	return kvmalloc_array(n, size, flags | __GFP_ZERO);
-}
-
-extern void *kvrealloc(const void *p, size_t oldsize, size_t newsize,
-		gfp_t flags);
-extern void kvfree(const void *addr);
-extern void kvfree_sensitive(const void *addr, size_t len);
-
 static inline int head_compound_mapcount(struct page *head)
 {
 	return atomic_read(compound_mapcount_ptr(head)) + 1;
--- a/include/linux/slab.h~mm-move-kvmalloc-related-functions-to-slabh
+++ a/include/linux/slab.h
@@ -732,6 +732,40 @@ static inline void *kzalloc_node(size_t
 	return kmalloc_node(size, flags | __GFP_ZERO, node);
 }
 
+extern void *kvmalloc_node(size_t size, gfp_t flags, int node);
+static inline void *kvmalloc(size_t size, gfp_t flags)
+{
+	return kvmalloc_node(size, flags, NUMA_NO_NODE);
+}
+static inline void *kvzalloc_node(size_t size, gfp_t flags, int node)
+{
+	return kvmalloc_node(size, flags | __GFP_ZERO, node);
+}
+static inline void *kvzalloc(size_t size, gfp_t flags)
+{
+	return kvmalloc(size, flags | __GFP_ZERO);
+}
+
+static inline void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
+{
+	size_t bytes;
+
+	if (unlikely(check_mul_overflow(n, size, &bytes)))
+		return NULL;
+
+	return kvmalloc(bytes, flags);
+}
+
+static inline void *kvcalloc(size_t n, size_t size, gfp_t flags)
+{
+	return kvmalloc_array(n, size, flags | __GFP_ZERO);
+}
+
+extern void *kvrealloc(const void *p, size_t oldsize, size_t newsize,
+		gfp_t flags);
+extern void kvfree(const void *addr);
+extern void kvfree_sensitive(const void *addr, size_t len);
+
 unsigned int kmem_cache_size(struct kmem_cache *s);
 void __init kmem_cache_init_late(void);
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 012/262] mm/slab.c: remove useless lines in enable_cpucache()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (10 preceding siblings ...)
  2021-11-05 20:35 ` [patch 011/262] mm: move kvmalloc-related functions to slab.h Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:35 ` [patch 013/262] slub: add back check for free nonslab objects Andrew Morton
                   ` (249 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: akpm, cl, iamjoonsoo.kim, linux-mm, mm-commits, penberg,
	rientjes, shi_lei, torvalds, vbabka

From: Shi Lei <shi_lei@massclouds.com>
Subject: mm/slab.c: remove useless lines in enable_cpucache()

These lines are useless, so remove them.

Link: https://lkml.kernel.org/r/20210930034845.2539-1-shi_lei@massclouds.com
Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
Signed-off-by: Shi Lei <shi_lei@massclouds.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slab.c |    3 ---
 1 file changed, 3 deletions(-)

--- a/mm/slab.c~mm-remove-useless-lines-in-enable_cpucache
+++ a/mm/slab.c
@@ -3900,8 +3900,6 @@ static int enable_cpucache(struct kmem_c
 	if (err)
 		goto end;
 
-	if (limit && shared && batchcount)
-		goto skip_setup;
 	/*
 	 * The head array serves three purposes:
 	 * - create a LIFO ordering, i.e. return objects that are cache-warm
@@ -3944,7 +3942,6 @@ static int enable_cpucache(struct kmem_c
 		limit = 32;
 #endif
 	batchcount = (limit + 1) / 2;
-skip_setup:
 	err = do_tune_cpucache(cachep, limit, batchcount, shared, gfp);
 end:
 	if (err)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 013/262] slub: add back check for free nonslab objects
  2021-11-05 20:34 incoming Andrew Morton
                   ` (11 preceding siblings ...)
  2021-11-05 20:35 ` [patch 012/262] mm/slab.c: remove useless lines in enable_cpucache() Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:35 ` [patch 014/262] mm, slub: change percpu partial accounting from objects to pages Andrew Morton
                   ` (248 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: akpm, cl, iamjoonsoo.kim, linux-mm, mm-commits, penberg,
	rientjes, shakeelb, torvalds, vbabka, wangkefeng.wang, willy

From: Kefeng Wang <wangkefeng.wang@huawei.com>
Subject: slub: add back check for free nonslab objects

After commit ("f227f0faf63b slub: fix unreclaimable slab stat for bulk
free"), the check for free nonslab page is replaced by VM_BUG_ON_PAGE,
which only check with CONFIG_DEBUG_VM enabled, but this config may impact
performance, so it only for debug.

Commit ("0937502af7c9 slub: Add check for kfree() of non slab objects.")
add the ability, which should be needed in any configs to catch the
invalid free, they even could be potential issue, eg, memory corruption,
use after free and double free, so replace VM_BUG_ON_PAGE to WARN_ON_ONCE,
add object address printing to help use to debug the issue.

Link: https://lkml.kernel.org/r/20210930070214.61499-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rienjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/mm/slub.c~slub-add-back-check-for-free-nonslab-objects
+++ a/mm/slub.c
@@ -3522,7 +3522,9 @@ static inline void free_nonslab_page(str
 {
 	unsigned int order = compound_order(page);
 
-	VM_BUG_ON_PAGE(!PageCompound(page), page);
+	if (WARN_ON_ONCE(!PageCompound(page)))
+		pr_warn_once("object pointer: 0x%p\n", object);
+
 	kfree_hook(object);
 	mod_lruvec_page_state(page, NR_SLAB_UNRECLAIMABLE_B, -(PAGE_SIZE << order));
 	__free_pages(page, order);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 014/262] mm, slub: change percpu partial accounting from objects to pages
  2021-11-05 20:34 incoming Andrew Morton
                   ` (12 preceding siblings ...)
  2021-11-05 20:35 ` [patch 013/262] slub: add back check for free nonslab objects Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:35 ` [patch 015/262] mm/slub: increase default cpu partial list sizes Andrew Morton
                   ` (247 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: akpm, cl, guro, iamjoonsoo.kim, jannh, linux-mm, mm-commits,
	penberg, rientjes, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: change percpu partial accounting from objects to pages

With CONFIG_SLUB_CPU_PARTIAL enabled, SLUB keeps a percpu list of partial
slabs that can be promoted to cpu slab when the previous one is depleted,
without accessing the shared partial list.  A slab can be added to this
list by 1) refill of an empty list from get_partial_node() - once we
really have to access the shared partial list, we acquire multiple slabs
to amortize the cost of locking, and 2) first free to a previously full
slab - instead of putting the slab on a shared partial list, we can more
cheaply freeze it and put it on the per-cpu list.

To control how large a percpu partial list can grow for a kmem cache,
set_cpu_partial() calculates a target number of free objects on each cpu's
percpu partial list, and this can be also set by the sysfs file
cpu_partial.

However, the tracking of actual number of objects is imprecise, in order
to limit overhead from cpu X freeing an objects to a slab on percpu
partial list of cpu Y.  Basically, the percpu partial slabs form a single
linked list, and when we add a new slab to the list with current head
"oldpage", we set in the struct page of the slab we're adding:

page->pages = oldpage->pages + 1; // this is precise
page->pobjects = oldpage->pobjects + (page->objects - page->inuse);
page->next = oldpage;

Thus the real number of free objects in the slab (objects - inuse) is only
determined at the moment of adding the slab to the percpu partial list,
and further freeing doesn't update the pobjects counter nor propagate it
to the current list head.  As Jann reports [1], this can easily lead to
large inaccuracies, where the target number of objects (up to 30 by
default) can translate to the same number of (empty) slab pages on the
list.  In case 2) above, we put a slab with 1 free object on the list,
thus only increase page->pobjects by 1, even if there are subsequent frees
on the same slab.  Jann has noticed this in practice and so did we [2]
when investigating significant increase of kmemcg usage after switching
from SLAB to SLUB.

While this is no longer a problem in kmemcg context thanks to the
accounting rewrite in 5.9, the memory waste is still not ideal and it's
questionable whether it makes sense to perform free object count based
control when object counts can easily become so much inaccurate.  So this
patch converts the accounting to be based on number of pages only (which
is precise) and removes the page->pobjects field completely.  This is also
ultimately simpler.

To retain the existing set_cpu_partial() heuristic, first calculate the
target number of objects as previously, but then convert it to target
number of pages by assuming the pages will be half-filled on average. 
This assumption might obviously also be inaccurate in practice, but cannot
degrade to actual number of pages being equal to the target number of
objects.

We could also skip the intermediate step with target number of objects and
rewrite the heuristic in terms of pages.  However we still have the sysfs
file cpu_partial which uses number of objects and could break existing
users if it suddenly becomes number of pages, so this patch doesn't do
that.

In practice, after this patch the heuristics limit the size of percpu
partial list up to 2 pages.  In case of a reported regression (which would
mean some workload has benefited from the previous imprecise object based
counting), we can tune the heuristics to get a better compromise within
the new scheme, while still avoid the unexpectedly long percpu partial
lists.

[1] https://lore.kernel.org/linux-mm/CAG48ez2Qx5K1Cab-m8BdSibp6wLTip6ro4=-umR7BLsEgjEYzA@mail.gmail.com/
[2] https://lore.kernel.org/all/2f0f46e8-2535-410a-1859-e9cfa4e57c18@suse.cz/

==========
Evaluation
==========

Mel was kind enough to run v1 through mmtests machinery for netperf
(localhost) and hackbench and, for most significant results see below.  So
there are some apparent regressions, especially with hackbench, which I
think ultimately boils down to having shorter percpu partial lists on
average and some benchmarks benefiting from longer ones.  Monitoring slab
usage also indicated less memory usage by slab.  Based on that, the
following patch will bump the defaults to allow longer percpu partial
lists than after this patch.

However the goal is certainly not such that we would limit the percpu
partial lists to 30 pages just because previously a specific alloc/free
pattern could lead to the limit of 30 objects translate to a limit to 30
pages - that would make little sense.  This is a correctness patch, and if
a workload benefits from larger lists, the sysfs tuning knobs are still
there to allow that.

Netperf

2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads
per socket), 384GB RAM
TCP-RR:
hmean before 127045.79 after 121092.94 (-4.69%, worse)
stddev before  2634.37 after   1254.08
UDP-RR:
hmean before 166985.45 after 160668.94 ( -3.78%, worse)
stddev before 4059.69 after 1943.63

2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads
per socket), 512GB RAM
TCP-RR:
hmean before 84173.25 after 76914.72 ( -8.62%, worse)
UDP-RR:
hmean before 93571.12 after 96428.69 ( 3.05%, better)
stddev before 23118.54 after 16828.14

2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads
per socket), 64GB RAM
TCP-RR:
hmean before 49984.92 after 48922.27 ( -2.13%, worse)
stddev before 6248.15 after 4740.51
UDP-RR:
hmean before 61854.31 after 68761.81 ( 11.17%, better)
stddev before 4093.54 after 5898.91

other machines - within 2%

Hackbench

(results before and after the patch, negative % means worse)

2-socket AMD EPYC 7713 (64 cores, 128 threads per core), 256GB RAM
hackbench-process-sockets
Amean 	1 	0.5380	0.5583	( -3.78%)
Amean 	4 	0.7510	0.8150	( -8.52%)
Amean 	7 	0.7930	0.9533	( -20.22%)
Amean 	12 	0.7853	1.1313	( -44.06%)
Amean 	21 	1.1520	1.4993	( -30.15%)
Amean 	30 	1.6223	1.9237	( -18.57%)
Amean 	48 	2.6767	2.9903	( -11.72%)
Amean 	79 	4.0257	5.1150	( -27.06%)
Amean 	110	5.5193	7.4720	( -35.38%)
Amean 	141	7.2207	9.9840	( -38.27%)
Amean 	172	8.4770	12.1963	( -43.88%)
Amean 	203	9.6473	14.3137	( -48.37%)
Amean 	234	11.3960	18.7917	( -64.90%)
Amean 	265	13.9627	22.4607	( -60.86%)
Amean 	296	14.9163	26.0483	( -74.63%)

hackbench-thread-sockets
Amean 	1 	0.5597	0.5877	( -5.00%)
Amean 	4 	0.7913	0.8960	( -13.23%)
Amean 	7 	0.8190	1.0017	( -22.30%)
Amean 	12 	0.9560	1.1727	( -22.66%)
Amean 	21 	1.7587	1.5660	( 10.96%)
Amean 	30 	2.4477	1.9807	( 19.08%)
Amean 	48 	3.4573	3.0630	( 11.41%)
Amean 	79 	4.7903	5.1733	( -8.00%)
Amean 	110	6.1370	7.4220	( -20.94%)
Amean 	141	7.5777	9.2617	( -22.22%)
Amean 	172	9.2280	11.0907	( -20.18%)
Amean 	203	10.2793	13.3470	( -29.84%)
Amean 	234	11.2410	17.1070	( -52.18%)
Amean 	265	12.5970	23.3323	( -85.22%)
Amean 	296	17.1540	24.2857	( -41.57%)

2-socket Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz (20 cores, 40 threads
per socket), 384GB RAM
hackbench-process-sockets
Amean 	1 	0.5760	0.4793	( 16.78%)
Amean 	4 	0.9430	0.9707	( -2.93%)
Amean 	7 	1.5517	1.8843	( -21.44%)
Amean 	12 	2.4903	2.7267	( -9.49%)
Amean 	21 	3.9560	4.2877	( -8.38%)
Amean 	30 	5.4613	5.8343	( -6.83%)
Amean 	48 	8.5337	9.2937	( -8.91%)
Amean 	79 	14.0670	15.2630	( -8.50%)
Amean 	110	19.2253	21.2467	( -10.51%)
Amean 	141	23.7557	25.8550	( -8.84%)
Amean 	172	28.4407	29.7603	( -4.64%)
Amean 	203	33.3407	33.9927	( -1.96%)
Amean 	234	38.3633	39.1150	( -1.96%)
Amean 	265	43.4420	43.8470	( -0.93%)
Amean 	296	48.3680	48.9300	( -1.16%)

hackbench-thread-sockets
Amean 	1 	0.6080	0.6493	( -6.80%)
Amean 	4 	1.0000	1.0513	( -5.13%)
Amean 	7 	1.6607	2.0260	( -22.00%)
Amean 	12 	2.7637	2.9273	( -5.92%)
Amean 	21 	5.0613	4.5153	( 10.79%)
Amean 	30 	6.3340	6.1140	( 3.47%)
Amean 	48 	9.0567	9.5577	( -5.53%)
Amean 	79 	14.5657	15.7983	( -8.46%)
Amean 	110	19.6213	21.6333	( -10.25%)
Amean 	141	24.1563	26.2697	( -8.75%)
Amean 	172	28.9687	30.2187	( -4.32%)
Amean 	203	33.9763	34.6970	( -2.12%)
Amean 	234	38.8647	39.3207	( -1.17%)
Amean 	265	44.0813	44.1507	( -0.16%)
Amean 	296	49.2040	49.4330	( -0.47%)

2-socket Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz (20 cores, 40 threads
per socket), 512GB RAM
hackbench-process-sockets
Amean 	1 	0.5027	0.5017	( 0.20%)
Amean 	4 	1.1053	1.2033	( -8.87%)
Amean 	7 	1.8760	2.1820	( -16.31%)
Amean 	12 	2.9053	3.1810	( -9.49%)
Amean 	21 	4.6777	4.9920	( -6.72%)
Amean 	30 	6.5180	6.7827	( -4.06%)
Amean 	48 	10.0710	10.5227	( -4.48%)
Amean 	79 	16.4250	17.5053	( -6.58%)
Amean 	110	22.6203	24.4617	( -8.14%)
Amean 	141	28.0967	31.0363	( -10.46%)
Amean 	172	34.4030	36.9233	( -7.33%)
Amean 	203	40.5933	43.0850	( -6.14%)
Amean 	234	46.6477	48.7220	( -4.45%)
Amean 	265	53.0530	53.9597	( -1.71%)
Amean 	296	59.2760	59.9213	( -1.09%)

hackbench-thread-sockets
Amean 	1 	0.5363	0.5330	( 0.62%)
Amean 	4 	1.1647	1.2157	( -4.38%)
Amean 	7 	1.9237	2.2833	( -18.70%)
Amean 	12 	2.9943	3.3110	( -10.58%)
Amean 	21 	4.9987	5.1880	( -3.79%)
Amean 	30 	6.7583	7.0043	( -3.64%)
Amean 	48 	10.4547	10.8353	( -3.64%)
Amean 	79 	16.6707	17.6790	( -6.05%)
Amean 	110	22.8207	24.4403	( -7.10%)
Amean 	141	28.7090	31.0533	( -8.17%)
Amean 	172	34.9387	36.8260	( -5.40%)
Amean 	203	41.1567	43.0450	( -4.59%)
Amean 	234	47.3790	48.5307	( -2.43%)
Amean 	265	53.9543	54.6987	( -1.38%)
Amean 	296	60.0820	60.2163	( -0.22%)

1-socket Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz (4 cores, 8 threads),
32 GB RAM
hackbench-process-sockets
Amean 	1 	1.4760	1.5773	( -6.87%)
Amean 	3 	3.9370	4.0910	( -3.91%)
Amean 	5 	6.6797	6.9357	( -3.83%)
Amean 	7 	9.3367	9.7150	( -4.05%)
Amean 	12	15.7627	16.1400	( -2.39%)
Amean 	18	23.5360	23.6890	( -0.65%)
Amean 	24	31.0663	31.3137	( -0.80%)
Amean 	30	38.7283	39.0037	( -0.71%)
Amean 	32	41.3417	41.6097	( -0.65%)

hackbench-thread-sockets
Amean 	1 	1.5250	1.6043	( -5.20%)
Amean 	3 	4.0897	4.2603	( -4.17%)
Amean 	5 	6.7760	7.0933	( -4.68%)
Amean 	7 	9.4817	9.9157	( -4.58%)
Amean 	12	15.9610	16.3937	( -2.71%)
Amean 	18	23.9543	24.3417	( -1.62%)
Amean 	24	31.4400	31.7217	( -0.90%)
Amean 	30	39.2457	39.5467	( -0.77%)
Amean 	32	41.8267	42.1230	( -0.71%)

2-socket Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz (12 cores, 24 threads
per socket), 64GB RAM
hackbench-process-sockets
Amean 	1 	1.0347	1.0880	( -5.15%)
Amean 	4 	1.7267	1.8527	( -7.30%)
Amean 	7 	2.6707	2.8110	( -5.25%)
Amean 	12 	4.1617	4.3383	( -4.25%)
Amean 	21 	7.0070	7.2600	( -3.61%)
Amean 	30 	9.9187	10.2397	( -3.24%)
Amean 	48 	15.6710	16.3923	( -4.60%)
Amean 	79 	24.7743	26.1247	( -5.45%)
Amean 	110	34.3000	35.9307	( -4.75%)
Amean 	141	44.2043	44.8010	( -1.35%)
Amean 	172	54.2430	54.7260	( -0.89%)
Amean 	192	60.6557	60.9777	( -0.53%)

hackbench-thread-sockets
Amean 	1 	1.0610	1.1353	( -7.01%)
Amean 	4 	1.7543	1.9140	( -9.10%)
Amean 	7 	2.7840	2.9573	( -6.23%)
Amean 	12 	4.3813	4.4937	( -2.56%)
Amean 	21 	7.3460	7.5350	( -2.57%)
Amean 	30 	10.2313	10.5190	( -2.81%)
Amean 	48 	15.9700	16.5940	( -3.91%)
Amean 	79 	25.3973	26.6637	( -4.99%)
Amean 	110	35.1087	36.4797	( -3.91%)
Amean 	141	45.8220	46.3053	( -1.05%)
Amean 	172	55.4917	55.7320	( -0.43%)
Amean 	192	62.7490	62.5410	( 0.33%)

Link: https://lkml.kernel.org/r/20211012134651.11258-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reported-by: Jann Horn <jannh@google.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm_types.h |    2 
 include/linux/slub_def.h |   13 -----
 mm/slub.c                |   89 ++++++++++++++++++++++++-------------
 3 files changed, 61 insertions(+), 43 deletions(-)

--- a/include/linux/mm_types.h~mm-slub-change-percpu-partial-accounting-from-objects-to-pages
+++ a/include/linux/mm_types.h
@@ -124,10 +124,8 @@ struct page {
 					struct page *next;
 #ifdef CONFIG_64BIT
 					int pages;	/* Nr of pages left */
-					int pobjects;	/* Approximate count */
 #else
 					short int pages;
-					short int pobjects;
 #endif
 				};
 			};
--- a/include/linux/slub_def.h~mm-slub-change-percpu-partial-accounting-from-objects-to-pages
+++ a/include/linux/slub_def.h
@@ -99,6 +99,8 @@ struct kmem_cache {
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	/* Number of per cpu partial objects to keep around */
 	unsigned int cpu_partial;
+	/* Number of per cpu partial pages to keep around */
+	unsigned int cpu_partial_pages;
 #endif
 	struct kmem_cache_order_objects oo;
 
@@ -141,17 +143,6 @@ struct kmem_cache {
 	struct kmem_cache_node *node[MAX_NUMNODES];
 };
 
-#ifdef CONFIG_SLUB_CPU_PARTIAL
-#define slub_cpu_partial(s)		((s)->cpu_partial)
-#define slub_set_cpu_partial(s, n)		\
-({						\
-	slub_cpu_partial(s) = (n);		\
-})
-#else
-#define slub_cpu_partial(s)		(0)
-#define slub_set_cpu_partial(s, n)
-#endif /* CONFIG_SLUB_CPU_PARTIAL */
-
 #ifdef CONFIG_SYSFS
 #define SLAB_SUPPORTS_SYSFS
 void sysfs_slab_unlink(struct kmem_cache *);
--- a/mm/slub.c~mm-slub-change-percpu-partial-accounting-from-objects-to-pages
+++ a/mm/slub.c
@@ -414,6 +414,29 @@ static inline unsigned int oo_objects(st
 	return x.x & OO_MASK;
 }
 
+#ifdef CONFIG_SLUB_CPU_PARTIAL
+static void slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
+{
+	unsigned int nr_pages;
+
+	s->cpu_partial = nr_objects;
+
+	/*
+	 * We take the number of objects but actually limit the number of
+	 * pages on the per cpu partial list, in order to limit excessive
+	 * growth of the list. For simplicity we assume that the pages will
+	 * be half-full.
+	 */
+	nr_pages = DIV_ROUND_UP(nr_objects * 2, oo_objects(s->oo));
+	s->cpu_partial_pages = nr_pages;
+}
+#else
+static inline void
+slub_set_cpu_partial(struct kmem_cache *s, unsigned int nr_objects)
+{
+}
+#endif /* CONFIG_SLUB_CPU_PARTIAL */
+
 /*
  * Per slab locking using the pagelock
  */
@@ -2052,7 +2075,7 @@ static inline void remove_partial(struct
  */
 static inline void *acquire_slab(struct kmem_cache *s,
 		struct kmem_cache_node *n, struct page *page,
-		int mode, int *objects)
+		int mode)
 {
 	void *freelist;
 	unsigned long counters;
@@ -2068,7 +2091,6 @@ static inline void *acquire_slab(struct
 	freelist = page->freelist;
 	counters = page->counters;
 	new.counters = counters;
-	*objects = new.objects - new.inuse;
 	if (mode) {
 		new.inuse = page->objects;
 		new.freelist = NULL;
@@ -2106,9 +2128,8 @@ static void *get_partial_node(struct kme
 {
 	struct page *page, *page2;
 	void *object = NULL;
-	unsigned int available = 0;
 	unsigned long flags;
-	int objects;
+	unsigned int partial_pages = 0;
 
 	/*
 	 * Racy check. If we mistakenly see no partial slabs then we
@@ -2126,11 +2147,10 @@ static void *get_partial_node(struct kme
 		if (!pfmemalloc_match(page, gfpflags))
 			continue;
 
-		t = acquire_slab(s, n, page, object == NULL, &objects);
+		t = acquire_slab(s, n, page, object == NULL);
 		if (!t)
 			break;
 
-		available += objects;
 		if (!object) {
 			*ret_page = page;
 			stat(s, ALLOC_FROM_PARTIAL);
@@ -2138,10 +2158,15 @@ static void *get_partial_node(struct kme
 		} else {
 			put_cpu_partial(s, page, 0);
 			stat(s, CPU_PARTIAL_NODE);
+			partial_pages++;
 		}
+#ifdef CONFIG_SLUB_CPU_PARTIAL
 		if (!kmem_cache_has_cpu_partial(s)
-			|| available > slub_cpu_partial(s) / 2)
+			|| partial_pages > s->cpu_partial_pages / 2)
 			break;
+#else
+		break;
+#endif
 
 	}
 	spin_unlock_irqrestore(&n->list_lock, flags);
@@ -2546,14 +2571,13 @@ static void put_cpu_partial(struct kmem_
 	struct page *page_to_unfreeze = NULL;
 	unsigned long flags;
 	int pages = 0;
-	int pobjects = 0;
 
 	local_lock_irqsave(&s->cpu_slab->lock, flags);
 
 	oldpage = this_cpu_read(s->cpu_slab->partial);
 
 	if (oldpage) {
-		if (drain && oldpage->pobjects > slub_cpu_partial(s)) {
+		if (drain && oldpage->pages >= s->cpu_partial_pages) {
 			/*
 			 * Partial array is full. Move the existing set to the
 			 * per node partial list. Postpone the actual unfreezing
@@ -2562,16 +2586,13 @@ static void put_cpu_partial(struct kmem_
 			page_to_unfreeze = oldpage;
 			oldpage = NULL;
 		} else {
-			pobjects = oldpage->pobjects;
 			pages = oldpage->pages;
 		}
 	}
 
 	pages++;
-	pobjects += page->objects - page->inuse;
 
 	page->pages = pages;
-	page->pobjects = pobjects;
 	page->next = oldpage;
 
 	this_cpu_write(s->cpu_slab->partial, page);
@@ -3991,6 +4012,8 @@ static void set_min_partial(struct kmem_
 static void set_cpu_partial(struct kmem_cache *s)
 {
 #ifdef CONFIG_SLUB_CPU_PARTIAL
+	unsigned int nr_objects;
+
 	/*
 	 * cpu_partial determined the maximum number of objects kept in the
 	 * per cpu partial lists of a processor.
@@ -4000,24 +4023,22 @@ static void set_cpu_partial(struct kmem_
 	 * filled up again with minimal effort. The slab will never hit the
 	 * per node partial lists and therefore no locking will be required.
 	 *
-	 * This setting also determines
-	 *
-	 * A) The number of objects from per cpu partial slabs dumped to the
-	 *    per node list when we reach the limit.
-	 * B) The number of objects in cpu partial slabs to extract from the
-	 *    per node list when we run out of per cpu objects. We only fetch
-	 *    50% to keep some capacity around for frees.
+	 * For backwards compatibility reasons, this is determined as number
+	 * of objects, even though we now limit maximum number of pages, see
+	 * slub_set_cpu_partial()
 	 */
 	if (!kmem_cache_has_cpu_partial(s))
-		slub_set_cpu_partial(s, 0);
+		nr_objects = 0;
 	else if (s->size >= PAGE_SIZE)
-		slub_set_cpu_partial(s, 2);
+		nr_objects = 2;
 	else if (s->size >= 1024)
-		slub_set_cpu_partial(s, 6);
+		nr_objects = 6;
 	else if (s->size >= 256)
-		slub_set_cpu_partial(s, 13);
+		nr_objects = 13;
 	else
-		slub_set_cpu_partial(s, 30);
+		nr_objects = 30;
+
+	slub_set_cpu_partial(s, nr_objects);
 #endif
 }
 
@@ -5392,7 +5413,12 @@ SLAB_ATTR(min_partial);
 
 static ssize_t cpu_partial_show(struct kmem_cache *s, char *buf)
 {
-	return sysfs_emit(buf, "%u\n", slub_cpu_partial(s));
+	unsigned int nr_partial = 0;
+#ifdef CONFIG_SLUB_CPU_PARTIAL
+	nr_partial = s->cpu_partial;
+#endif
+
+	return sysfs_emit(buf, "%u\n", nr_partial);
 }
 
 static ssize_t cpu_partial_store(struct kmem_cache *s, const char *buf,
@@ -5463,12 +5489,12 @@ static ssize_t slabs_cpu_partial_show(st
 
 		page = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
 
-		if (page) {
+		if (page)
 			pages += page->pages;
-			objects += page->pobjects;
-		}
 	}
 
+	/* Approximate half-full pages , see slub_set_cpu_partial() */
+	objects = (pages * oo_objects(s->oo)) / 2;
 	len += sysfs_emit_at(buf, len, "%d(%d)", objects, pages);
 
 #ifdef CONFIG_SMP
@@ -5476,9 +5502,12 @@ static ssize_t slabs_cpu_partial_show(st
 		struct page *page;
 
 		page = slub_percpu_partial(per_cpu_ptr(s->cpu_slab, cpu));
-		if (page)
+		if (page) {
+			pages = READ_ONCE(page->pages);
+			objects = (pages * oo_objects(s->oo)) / 2;
 			len += sysfs_emit_at(buf, len, " C%d=%d(%d)",
-					     cpu, page->pobjects, page->pages);
+					     cpu, objects, pages);
+		}
 	}
 #endif
 	len += sysfs_emit_at(buf, len, "\n");
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 015/262] mm/slub: increase default cpu partial list sizes
  2021-11-05 20:34 incoming Andrew Morton
                   ` (13 preceding siblings ...)
  2021-11-05 20:35 ` [patch 014/262] mm, slub: change percpu partial accounting from objects to pages Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:35 ` [patch 016/262] mm, slub: use prefetchw instead of prefetch Andrew Morton
                   ` (246 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: akpm, cl, guro, iamjoonsoo.kim, jannh, linux-mm, mm-commits,
	penberg, rientjes, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm/slub: increase default cpu partial list sizes

The defaults are determined based on object size and can go up to 30 for
objects smaller than 256 bytes.  Before the previous patch changed the
accounting, this could have made cpu partial list contain up to 30 pages. 
After that patch, only up to 2 pages with default allocation order.

Very short lists limit the usefulness of the whole concept of cpu partial
lists, so this patch aims at a more reasonable default under the new
accounting.  The defaults are quadrupled, except for object size >=
PAGE_SIZE where it's doubled.  This makes the lists grow up to 10 pages in
practice.

A quick test of booting a kernel under virtme with 4GB RAM and 8 vcpus
shows the following slab memory usage after boot:

Before previous patch (using page->pobjects):
Slab:              36732 kB
SReclaimable:      14836 kB
SUnreclaim:        21896 kB

After previous patch (using page->pages):
Slab:              34720 kB
SReclaimable:      13716 kB
SUnreclaim:        21004 kB

After this patch (using page->pages, higher defaults):
Slab:              35252 kB
SReclaimable:      13944 kB
SUnreclaim:        21308 kB

In the same setup, I also ran 5 times:
hackbench -l 16000 -g 16

Differences in time were in the noise, we can compare slub stats as given
by slabinfo -r skbuff_head_cache (the other cache heavily used by
hackbench, kmalloc-cg-512 looks similar).  Negligible stats left out for
brevity.

Before previous patch (using page->pobjects):

Objects: 1408, Memory Total:  401408 Used :  304128

Slab Perf Counter       Alloc     Free %Al %Fr
--------------------------------------------------
Fastpath             469952498  5946606  91   1
Slowpath             42053573 506059465   8  98
Page Alloc              41093    41044   0   0
Add partial                18 21229327   0   4
Remove partial       20039522    36051   3   0
Cpu partial list      4686640 24767229   0   4
RemoteObj/SlabFrozen       16 124027841   0  24
Total                512006071 512006071
Flushes       18

Slab Deactivation             Occurrences %
-------------------------------------------------
Slab empty                       4993    0%
Deactivation bypass           24767229   99%
Refilled from foreign frees   21972674   88%

After previous patch (using page->pages):

Objects: 480, Memory Total:  131072 Used :  103680

Slab Perf Counter       Alloc     Free %Al %Fr
--------------------------------------------------
Fastpath             473016294  5405653  92   1
Slowpath             38989777 506600418   7  98
Page Alloc              32717    32701   0   0
Add partial                 3 22749164   0   4
Remove partial       11371127    32474   2   0
Cpu partial list     11686226 23090059   2   4
RemoteObj/SlabFrozen        2 67541803   0  13
Total                512006071 512006071
Flushes        3

Slab Deactivation             Occurrences %
-------------------------------------------------
Slab empty                        227    0%
Deactivation bypass           23090059   99%
Refilled from foreign frees   27585695  119%

After this patch (using page->pages, higher defaults):

Objects: 896, Memory Total:  229376 Used :  193536

Slab Perf Counter       Alloc     Free %Al %Fr
--------------------------------------------------
Fastpath             473799295  4980278  92   0
Slowpath             38206776 507025793   7  99
Page Alloc              32295    32267   0   0
Add partial                11 23291143   0   4
Remove partial        5815764    31278   1   0
Cpu partial list     18119280 23967320   3   4
RemoteObj/SlabFrozen       10 76974794   0  15
Total                512006071 512006071
Flushes       11

Slab Deactivation             Occurrences %
-------------------------------------------------
Slab empty                        989    0%
Deactivation bypass           23967320   99%
Refilled from foreign frees   32358473  135%

As expected, memory usage dropped significantly with change of accounting,
increasing the defaults increased it, but not as much.  The number of page
allocation/frees dropped significantly with the new accounting, but didn't
increase with the higher defaults.  Interestingly, the number of fasthpath
allocations increased, as well as allocations from the cpu partial list,
even though it's shorter.

Link: https://lkml.kernel.org/r/20211012134651.11258-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/slub.c~mm-slub-increase-default-cpu-partial-list-sizes
+++ a/mm/slub.c
@@ -4030,13 +4030,13 @@ static void set_cpu_partial(struct kmem_
 	if (!kmem_cache_has_cpu_partial(s))
 		nr_objects = 0;
 	else if (s->size >= PAGE_SIZE)
-		nr_objects = 2;
-	else if (s->size >= 1024)
 		nr_objects = 6;
+	else if (s->size >= 1024)
+		nr_objects = 24;
 	else if (s->size >= 256)
-		nr_objects = 13;
+		nr_objects = 52;
 	else
-		nr_objects = 30;
+		nr_objects = 120;
 
 	slub_set_cpu_partial(s, nr_objects);
 #endif
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 016/262] mm, slub: use prefetchw instead of prefetch
  2021-11-05 20:34 incoming Andrew Morton
                   ` (14 preceding siblings ...)
  2021-11-05 20:35 ` [patch 015/262] mm/slub: increase default cpu partial list sizes Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:35 ` [patch 017/262] mm: disable NUMA_BALANCING_DEFAULT_ENABLED and TRANSPARENT_HUGEPAGE on PREEMPT_RT Andrew Morton
                   ` (245 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: 42.hyeyoo, akpm, cl, iamjoonsoo.kim, linux-mm, mm-commits,
	penberg, rientjes, torvalds, vbabka

From: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Subject: mm, slub: use prefetchw instead of prefetch

commit 0ad9500e16fe ("slub: prefetch next freelist pointer in
slab_alloc()") introduced prefetch_freepointer() because when other cpu(s)
freed objects into a page that current cpu owns, the freelist link is hot
on cpu(s) which freed objects and possibly very cold on current cpu.

But if freelist link chain is hot on cpu(s) which freed objects,
it's better to invalidate that chain because they're not going to access
again within a short time.

So use prefetchw instead of prefetch. On supported architectures like x86
and arm, it invalidates other copied instances of a cache line when
prefetching it.

Before:

Time: 91.677

 Performance counter stats for 'hackbench -g 100 -l 10000':
        1462938.07 msec cpu-clock                 #   15.908 CPUs utilized
          18072550      context-switches          #   12.354 K/sec
           1018814      cpu-migrations            #  696.416 /sec
            104558      page-faults               #   71.471 /sec
     1580035699271      cycles                    #    1.080 GHz                      (54.51%)
     2003670016013      instructions              #    1.27  insn per cycle           (54.31%)
        5702204863      branch-misses                                                 (54.28%)
      643368500985      cache-references          #  439.778 M/sec                    (54.26%)
       18475582235      cache-misses              #    2.872 % of all cache refs      (54.28%)
      642206796636      L1-dcache-loads           #  438.984 M/sec                    (46.87%)
       18215813147      L1-dcache-load-misses     #    2.84% of all L1-dcache accesses  (46.83%)
      653842996501      dTLB-loads                #  446.938 M/sec                    (46.63%)
        3227179675      dTLB-load-misses          #    0.49% of all dTLB cache accesses  (46.85%)
      537531951350      iTLB-loads                #  367.433 M/sec                    (54.33%)
         114750630      iTLB-load-misses          #    0.02% of all iTLB cache accesses  (54.37%)
      630135543177      L1-icache-loads           #  430.733 M/sec                    (46.80%)
       22923237620      L1-icache-load-misses     #    3.64% of all L1-icache accesses  (46.76%)

      91.964452802 seconds time elapsed

      43.416742000 seconds user
    1422.441123000 seconds sys

After:

Time: 90.220

 Performance counter stats for 'hackbench -g 100 -l 10000':
        1437418.48 msec cpu-clock                 #   15.880 CPUs utilized
          17694068      context-switches          #   12.310 K/sec
            958257      cpu-migrations            #  666.651 /sec
            100604      page-faults               #   69.989 /sec
     1583259429428      cycles                    #    1.101 GHz                      (54.57%)
     2004002484935      instructions              #    1.27  insn per cycle           (54.37%)
        5594202389      branch-misses                                                 (54.36%)
      643113574524      cache-references          #  447.409 M/sec                    (54.39%)
       18233791870      cache-misses              #    2.835 % of all cache refs      (54.37%)
      640205852062      L1-dcache-loads           #  445.386 M/sec                    (46.75%)
       17968160377      L1-dcache-load-misses     #    2.81% of all L1-dcache accesses  (46.79%)
      651747432274      dTLB-loads                #  453.415 M/sec                    (46.59%)
        3127124271      dTLB-load-misses          #    0.48% of all dTLB cache accesses  (46.75%)
      535395273064      iTLB-loads                #  372.470 M/sec                    (54.38%)
         113500056      iTLB-load-misses          #    0.02% of all iTLB cache accesses  (54.35%)
      628871845924      L1-icache-loads           #  437.501 M/sec                    (46.80%)
       22585641203      L1-icache-load-misses     #    3.59% of all L1-icache accesses  (46.79%)

      90.514819303 seconds time elapsed

      43.877656000 seconds user
    1397.176001000 seconds sys

Link: https://lkml.org/lkml/2021/10/8/598=20
Link: https://lkml.kernel.org/r/20211011144331.70084-1-42.hyeyoo@gmail.com
Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/slub.c~mm-slub-use-prefetchw-instead-of-prefetch
+++ a/mm/slub.c
@@ -354,7 +354,7 @@ static inline void *get_freepointer(stru
 
 static void prefetch_freepointer(const struct kmem_cache *s, void *object)
 {
-	prefetch(object + s->offset);
+	prefetchw(object + s->offset);
 }
 
 static inline void *get_freepointer_safe(struct kmem_cache *s, void *object)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 017/262] mm: disable NUMA_BALANCING_DEFAULT_ENABLED and TRANSPARENT_HUGEPAGE on PREEMPT_RT
  2021-11-05 20:34 incoming Andrew Morton
                   ` (15 preceding siblings ...)
  2021-11-05 20:35 ` [patch 016/262] mm, slub: use prefetchw instead of prefetch Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:35 ` [patch 018/262] mm: don't include <linux/dax.h> in <linux/mempolicy.h> Andrew Morton
                   ` (244 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: akpm, bigeasy, david, linux-mm, mgorman, mm-commits, peterz,
	tglx, torvalds, vbabka

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: mm: disable NUMA_BALANCING_DEFAULT_ENABLED and TRANSPARENT_HUGEPAGE on PREEMPT_RT

TRANSPARENT_HUGEPAGE:
There are potential non-deterministic delays to an RT thread if a critical
memory region is not THP-aligned and a non-RT buffer is located in the same
hugepage-aligned region. It's also possible for an unrelated thread to migrate
pages belonging to an RT task incurring unexpected page faults due to memory
defragmentation even if khugepaged is disabled.

Regular HUGEPAGEs are not affected by this can be used.

NUMA_BALANCING:
There is a non-deterministic delay to mark PTEs PROT_NONE to gather NUMA fault
samples, increased page faults of regions even if mlocked and non-deterministic
delays when migrating pages.

[Mel Gorman worded 99% of the commit description].

Link: https://lore.kernel.org/all/20200304091159.GN3818@techsingularity.net/
Link: https://lore.kernel.org/all/20211026165100.ahz5bkx44lrrw5pt@linutronix.de/
Link: https://lkml.kernel.org/r/20211028143327.hfbxjze7palrpfgp@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 init/Kconfig |    2 +-
 mm/Kconfig   |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/init/Kconfig~mm-disable-numa_balancing_default_enabled-and-transparent_hugepage-on-preempt_rt
+++ a/init/Kconfig
@@ -901,7 +901,7 @@ config NUMA_BALANCING
 	bool "Memory placement aware NUMA scheduler"
 	depends on ARCH_SUPPORTS_NUMA_BALANCING
 	depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY
-	depends on SMP && NUMA && MIGRATION
+	depends on SMP && NUMA && MIGRATION && !PREEMPT_RT
 	help
 	  This option adds support for automatic NUMA aware memory/task placement.
 	  The mechanism is quite primitive and is based on migrating memory when
--- a/mm/Kconfig~mm-disable-numa_balancing_default_enabled-and-transparent_hugepage-on-preempt_rt
+++ a/mm/Kconfig
@@ -371,7 +371,7 @@ config NOMMU_INITIAL_TRIM_EXCESS
 
 config TRANSPARENT_HUGEPAGE
 	bool "Transparent Hugepage Support"
-	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE
+	depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT
 	select COMPACTION
 	select XARRAY_MULTI
 	help
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 018/262] mm: don't include <linux/dax.h> in <linux/mempolicy.h>
  2021-11-05 20:34 incoming Andrew Morton
                   ` (16 preceding siblings ...)
  2021-11-05 20:35 ` [patch 017/262] mm: disable NUMA_BALANCING_DEFAULT_ENABLED and TRANSPARENT_HUGEPAGE on PREEMPT_RT Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:35 ` [patch 019/262] lib/stackdepot: include gfp.h Andrew Morton
                   ` (243 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: akpm, dan.j.williams, hch, linux-mm, mm-commits, naoya.horiguchi,
	torvalds

From: Christoph Hellwig <hch@lst.de>
Subject: mm: don't include <linux/dax.h> in <linux/mempolicy.h>

Not required at all, and having this causes a huge kernel rebuild as soon
as something in dax.h changes.

Link: https://lkml.kernel.org/r/20210921082253.1859794-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mempolicy.h |    1 -
 mm/memory-failure.c       |    1 +
 2 files changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/mempolicy.h~mm-dont-include-linux-daxh-in-linux-mempolicyh
+++ a/include/linux/mempolicy.h
@@ -8,7 +8,6 @@
 
 #include <linux/sched.h>
 #include <linux/mmzone.h>
-#include <linux/dax.h>
 #include <linux/slab.h>
 #include <linux/rbtree.h>
 #include <linux/spinlock.h>
--- a/mm/memory-failure.c~mm-dont-include-linux-daxh-in-linux-mempolicyh
+++ a/mm/memory-failure.c
@@ -39,6 +39,7 @@
 #include <linux/kernel-page-flags.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/task.h>
+#include <linux/dax.h>
 #include <linux/ksm.h>
 #include <linux/rmap.h>
 #include <linux/export.h>
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 019/262] lib/stackdepot: include gfp.h
  2021-11-05 20:34 incoming Andrew Morton
                   ` (17 preceding siblings ...)
  2021-11-05 20:35 ` [patch 018/262] mm: don't include <linux/dax.h> in <linux/mempolicy.h> Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:35 ` [patch 020/262] lib/stackdepot: remove unused function argument Andrew Morton
                   ` (242 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: akpm, andreyknvl, bigeasy, dvyukov, elver, glider, gustavoars,
	jiangshanlai, linux-mm, mm-commits, ryabinin.a.a, skhan,
	tarasmadan, tglx, tj, torvalds, vinmenon, vjitta, walter-zh.wu

From: Marco Elver <elver@google.com>
Subject: lib/stackdepot: include gfp.h

Patch series "stackdepot, kasan, workqueue: Avoid expanding stackdepot slabs when holding raw_spin_lock", v2.

Shuah Khan reported [1]:

 | When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
 | kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
 | it tries to allocate memory attempting to acquire spinlock in page
 | allocation code while holding workqueue pool raw_spinlock.
 |
 | There are several instances of this problem when block layer tries
 | to __queue_work(). Call trace from one of these instances is below:
 |
 |     kblockd_mod_delayed_work_on()
 |       mod_delayed_work_on()
 |         __queue_delayed_work()
 |           __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
 |             insert_work()
 |               kasan_record_aux_stack()
 |                 kasan_save_stack()
 |                   stack_depot_save()
 |                     alloc_pages()
 |                       __alloc_pages()
 |                         get_page_from_freelist()
 |                           rm_queue()
 |                             rm_queue_pcplist()
 |                               local_lock_irqsave(&pagesets.lock, flags);
 |                               [ BUG: Invalid wait context triggered ]

PROVE_RAW_LOCK_NESTING is pointing out that (on RT kernels) the locking
rules are being violated.  More generally, memory is being allocated from
a non-preemptive context (raw_spin_lock'd c-s) where it is not allowed.

To properly fix this, we must prevent stackdepot from replenishing its
"stack slab" pool if memory allocations cannot be done in the current
context: it's a bug to use either GFP_ATOMIC nor GFP_NOWAIT in certain
non-preemptive contexts, including raw_spin_locks (see gfp.h and
ab00db216c9c7).

The only downside is that saving a stack trace may fail if: stackdepot
runs out of space AND the same stack trace has not been recorded before. 
I expect this to be unlikely, and a simple experiment (boot the kernel)
didn't result in any failure to record stack trace from insert_work().

The series includes a few minor fixes to stackdepot that I noticed in
preparing the series.  It then introduces __stack_depot_save(), which
exposes the option to force stackdepot to not allocate any memory. 
Finally, KASAN is changed to use the new stackdepot interface and provide
kasan_record_aux_stack_noalloc(), which is then used by workqueue code.

[1] https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org


This patch (of 6):

<linux/stackdepot.h> refers to gfp_t, but doesn't include gfp.h.

Fix it by including <linux/gfp.h>.

Link: https://lkml.kernel.org/r/20210913112609.2651084-1-elver@google.com
Link: https://lkml.kernel.org/r/20210913112609.2651084-2-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Tested-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Walter Wu <walter-zh.wu@mediatek.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: Taras Madan <tarasmadan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/stackdepot.h |    2 ++
 1 file changed, 2 insertions(+)

--- a/include/linux/stackdepot.h~lib-stackdepot-include-gfph
+++ a/include/linux/stackdepot.h
@@ -11,6 +11,8 @@
 #ifndef _LINUX_STACKDEPOT_H
 #define _LINUX_STACKDEPOT_H
 
+#include <linux/gfp.h>
+
 typedef u32 depot_stack_handle_t;
 
 depot_stack_handle_t stack_depot_save(unsigned long *entries,
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 020/262] lib/stackdepot: remove unused function argument
  2021-11-05 20:34 incoming Andrew Morton
                   ` (18 preceding siblings ...)
  2021-11-05 20:35 ` [patch 019/262] lib/stackdepot: include gfp.h Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:35 ` [patch 021/262] lib/stackdepot: introduce __stack_depot_save() Andrew Morton
                   ` (241 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: akpm, andreyknvl, bigeasy, dvyukov, elver, glider, gustavoars,
	jiangshanlai, linux-mm, mm-commits, ryabinin.a.a, skhan,
	tarasmadan, tglx, tj, torvalds, vinmenon, vjitta, walter-zh.wu

From: Marco Elver <elver@google.com>
Subject: lib/stackdepot: remove unused function argument

alloc_flags in depot_alloc_stack() is no longer used; remove it.

Link: https://lkml.kernel.org/r/20210913112609.2651084-3-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Tested-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Taras Madan <tarasmadan@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Walter Wu <walter-zh.wu@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/stackdepot.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

--- a/lib/stackdepot.c~lib-stackdepot-remove-unused-function-argument
+++ a/lib/stackdepot.c
@@ -102,8 +102,8 @@ static bool init_stack_slab(void **preal
 }
 
 /* Allocation of a new stack in raw storage */
-static struct stack_record *depot_alloc_stack(unsigned long *entries, int size,
-		u32 hash, void **prealloc, gfp_t alloc_flags)
+static struct stack_record *
+depot_alloc_stack(unsigned long *entries, int size, u32 hash, void **prealloc)
 {
 	struct stack_record *stack;
 	size_t required_size = struct_size(stack, entries, size);
@@ -309,9 +309,8 @@ depot_stack_handle_t stack_depot_save(un
 
 	found = find_stack(*bucket, entries, nr_entries, hash);
 	if (!found) {
-		struct stack_record *new =
-			depot_alloc_stack(entries, nr_entries,
-					  hash, &prealloc, alloc_flags);
+		struct stack_record *new = depot_alloc_stack(entries, nr_entries, hash, &prealloc);
+
 		if (new) {
 			new->next = *bucket;
 			/*
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 021/262] lib/stackdepot: introduce __stack_depot_save()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (19 preceding siblings ...)
  2021-11-05 20:35 ` [patch 020/262] lib/stackdepot: remove unused function argument Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:35 ` [patch 022/262] kasan: common: provide can_alloc in kasan_save_stack() Andrew Morton
                   ` (240 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: akpm, andreyknvl, bigeasy, dvyukov, elver, glider, gustavoars,
	jiangshanlai, linux-mm, mm-commits, ryabinin.a.a, skhan,
	tarasmadan, tglx, tj, torvalds, vinmenon, vjitta, walter-zh.wu

From: Marco Elver <elver@google.com>
Subject: lib/stackdepot: introduce __stack_depot_save()

Add __stack_depot_save(), which provides more fine-grained control over
stackdepot's memory allocation behaviour, in case stackdepot runs out of
"stack slabs".

Normally stackdepot uses alloc_pages() in case it runs out of space;
passing can_alloc==false to __stack_depot_save() prohibits this, at the
cost of more likely failure to record a stack trace.

Link: https://lkml.kernel.org/r/20210913112609.2651084-4-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Tested-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Taras Madan <tarasmadan@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Walter Wu <walter-zh.wu@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/stackdepot.h |    4 +++
 lib/stackdepot.c           |   43 ++++++++++++++++++++++++++++++-----
 2 files changed, 41 insertions(+), 6 deletions(-)

--- a/include/linux/stackdepot.h~lib-stackdepot-introduce-__stack_depot_save
+++ a/include/linux/stackdepot.h
@@ -15,6 +15,10 @@
 
 typedef u32 depot_stack_handle_t;
 
+depot_stack_handle_t __stack_depot_save(unsigned long *entries,
+					unsigned int nr_entries,
+					gfp_t gfp_flags, bool can_alloc);
+
 depot_stack_handle_t stack_depot_save(unsigned long *entries,
 				      unsigned int nr_entries, gfp_t gfp_flags);
 
--- a/lib/stackdepot.c~lib-stackdepot-introduce-__stack_depot_save
+++ a/lib/stackdepot.c
@@ -248,17 +248,28 @@ unsigned int stack_depot_fetch(depot_sta
 EXPORT_SYMBOL_GPL(stack_depot_fetch);
 
 /**
- * stack_depot_save - Save a stack trace from an array
+ * __stack_depot_save - Save a stack trace from an array
  *
  * @entries:		Pointer to storage array
  * @nr_entries:		Size of the storage array
  * @alloc_flags:	Allocation gfp flags
+ * @can_alloc:		Allocate stack slabs (increased chance of failure if false)
+ *
+ * Saves a stack trace from @entries array of size @nr_entries. If @can_alloc is
+ * %true, is allowed to replenish the stack slab pool in case no space is left
+ * (allocates using GFP flags of @alloc_flags). If @can_alloc is %false, avoids
+ * any allocations and will fail if no space is left to store the stack trace.
+ *
+ * Context: Any context, but setting @can_alloc to %false is required if
+ *          alloc_pages() cannot be used from the current context. Currently
+ *          this is the case from contexts where neither %GFP_ATOMIC nor
+ *          %GFP_NOWAIT can be used (NMI, raw_spin_lock).
  *
- * Return: The handle of the stack struct stored in depot
+ * Return: The handle of the stack struct stored in depot, 0 on failure.
  */
-depot_stack_handle_t stack_depot_save(unsigned long *entries,
-				      unsigned int nr_entries,
-				      gfp_t alloc_flags)
+depot_stack_handle_t __stack_depot_save(unsigned long *entries,
+					unsigned int nr_entries,
+					gfp_t alloc_flags, bool can_alloc)
 {
 	struct stack_record *found = NULL, **bucket;
 	depot_stack_handle_t retval = 0;
@@ -291,7 +302,7 @@ depot_stack_handle_t stack_depot_save(un
 	 * The smp_load_acquire() here pairs with smp_store_release() to
 	 * |next_slab_inited| in depot_alloc_stack() and init_stack_slab().
 	 */
-	if (unlikely(!smp_load_acquire(&next_slab_inited))) {
+	if (unlikely(can_alloc && !smp_load_acquire(&next_slab_inited))) {
 		/*
 		 * Zero out zone modifiers, as we don't have specific zone
 		 * requirements. Keep the flags related to allocation in atomic
@@ -339,6 +350,26 @@ exit:
 fast_exit:
 	return retval;
 }
+EXPORT_SYMBOL_GPL(__stack_depot_save);
+
+/**
+ * stack_depot_save - Save a stack trace from an array
+ *
+ * @entries:		Pointer to storage array
+ * @nr_entries:		Size of the storage array
+ * @alloc_flags:	Allocation gfp flags
+ *
+ * Context: Contexts where allocations via alloc_pages() are allowed.
+ *          See __stack_depot_save() for more details.
+ *
+ * Return: The handle of the stack struct stored in depot, 0 on failure.
+ */
+depot_stack_handle_t stack_depot_save(unsigned long *entries,
+				      unsigned int nr_entries,
+				      gfp_t alloc_flags)
+{
+	return __stack_depot_save(entries, nr_entries, alloc_flags, true);
+}
 EXPORT_SYMBOL_GPL(stack_depot_save);
 
 static inline int in_irqentry_text(unsigned long ptr)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 022/262] kasan: common: provide can_alloc in kasan_save_stack()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (20 preceding siblings ...)
  2021-11-05 20:35 ` [patch 021/262] lib/stackdepot: introduce __stack_depot_save() Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:35 ` [patch 023/262] kasan: generic: introduce kasan_record_aux_stack_noalloc() Andrew Morton
                   ` (239 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: akpm, andreyknvl, bigeasy, dvyukov, elver, glider, gustavoars,
	jiangshanlai, linux-mm, mm-commits, ryabinin.a.a, skhan,
	tarasmadan, tglx, tj, torvalds, vinmenon, vjitta, walter-zh.wu

From: Marco Elver <elver@google.com>
Subject: kasan: common: provide can_alloc in kasan_save_stack()

Add another argument, can_alloc, to kasan_save_stack() which is passed
as-is to __stack_depot_save().

No functional change intended.

Link: https://lkml.kernel.org/r/20210913112609.2651084-5-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Tested-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Taras Madan <tarasmadan@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Walter Wu <walter-zh.wu@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kasan/common.c  |    6 +++---
 mm/kasan/generic.c |    2 +-
 mm/kasan/kasan.h   |    2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

--- a/mm/kasan/common.c~kasan-common-provide-can_alloc-in-kasan_save_stack
+++ a/mm/kasan/common.c
@@ -30,20 +30,20 @@
 #include "kasan.h"
 #include "../slab.h"
 
-depot_stack_handle_t kasan_save_stack(gfp_t flags)
+depot_stack_handle_t kasan_save_stack(gfp_t flags, bool can_alloc)
 {
 	unsigned long entries[KASAN_STACK_DEPTH];
 	unsigned int nr_entries;
 
 	nr_entries = stack_trace_save(entries, ARRAY_SIZE(entries), 0);
 	nr_entries = filter_irq_stacks(entries, nr_entries);
-	return stack_depot_save(entries, nr_entries, flags);
+	return __stack_depot_save(entries, nr_entries, flags, can_alloc);
 }
 
 void kasan_set_track(struct kasan_track *track, gfp_t flags)
 {
 	track->pid = current->pid;
-	track->stack = kasan_save_stack(flags);
+	track->stack = kasan_save_stack(flags, true);
 }
 
 #if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)
--- a/mm/kasan/generic.c~kasan-common-provide-can_alloc-in-kasan_save_stack
+++ a/mm/kasan/generic.c
@@ -345,7 +345,7 @@ void kasan_record_aux_stack(void *addr)
 		return;
 
 	alloc_meta->aux_stack[1] = alloc_meta->aux_stack[0];
-	alloc_meta->aux_stack[0] = kasan_save_stack(GFP_NOWAIT);
+	alloc_meta->aux_stack[0] = kasan_save_stack(GFP_NOWAIT, true);
 }
 
 void kasan_set_free_info(struct kmem_cache *cache,
--- a/mm/kasan/kasan.h~kasan-common-provide-can_alloc-in-kasan_save_stack
+++ a/mm/kasan/kasan.h
@@ -251,7 +251,7 @@ void kasan_report_invalid_free(void *obj
 
 struct page *kasan_addr_to_page(const void *addr);
 
-depot_stack_handle_t kasan_save_stack(gfp_t flags);
+depot_stack_handle_t kasan_save_stack(gfp_t flags, bool can_alloc);
 void kasan_set_track(struct kasan_track *track, gfp_t flags);
 void kasan_set_free_info(struct kmem_cache *cache, void *object, u8 tag);
 struct kasan_track *kasan_get_free_track(struct kmem_cache *cache,
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 023/262] kasan: generic: introduce kasan_record_aux_stack_noalloc()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (21 preceding siblings ...)
  2021-11-05 20:35 ` [patch 022/262] kasan: common: provide can_alloc in kasan_save_stack() Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:35 ` [patch 024/262] workqueue, kasan: avoid alloc_pages() when recording stack Andrew Morton
                   ` (238 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: akpm, andreyknvl, bigeasy, dvyukov, elver, glider, gustavoars,
	jiangshanlai, linux-mm, mm-commits, ryabinin.a.a, skhan,
	tarasmadan, tglx, tj, torvalds, vinmenon, vjitta, walter-zh.wu

From: Marco Elver <elver@google.com>
Subject: kasan: generic: introduce kasan_record_aux_stack_noalloc()

Introduce a variant of kasan_record_aux_stack() that does not do any
memory allocation through stackdepot.  This will permit using it in
contexts that cannot allocate any memory.

Link: https://lkml.kernel.org/r/20210913112609.2651084-6-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Tested-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Taras Madan <tarasmadan@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Walter Wu <walter-zh.wu@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/kasan.h |    2 ++
 mm/kasan/generic.c    |   14 ++++++++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)

--- a/include/linux/kasan.h~kasan-generic-introduce-kasan_record_aux_stack_noalloc
+++ a/include/linux/kasan.h
@@ -370,12 +370,14 @@ static inline void kasan_unpoison_task_s
 void kasan_cache_shrink(struct kmem_cache *cache);
 void kasan_cache_shutdown(struct kmem_cache *cache);
 void kasan_record_aux_stack(void *ptr);
+void kasan_record_aux_stack_noalloc(void *ptr);
 
 #else /* CONFIG_KASAN_GENERIC */
 
 static inline void kasan_cache_shrink(struct kmem_cache *cache) {}
 static inline void kasan_cache_shutdown(struct kmem_cache *cache) {}
 static inline void kasan_record_aux_stack(void *ptr) {}
+static inline void kasan_record_aux_stack_noalloc(void *ptr) {}
 
 #endif /* CONFIG_KASAN_GENERIC */
 
--- a/mm/kasan/generic.c~kasan-generic-introduce-kasan_record_aux_stack_noalloc
+++ a/mm/kasan/generic.c
@@ -328,7 +328,7 @@ DEFINE_ASAN_SET_SHADOW(f3);
 DEFINE_ASAN_SET_SHADOW(f5);
 DEFINE_ASAN_SET_SHADOW(f8);
 
-void kasan_record_aux_stack(void *addr)
+static void __kasan_record_aux_stack(void *addr, bool can_alloc)
 {
 	struct page *page = kasan_addr_to_page(addr);
 	struct kmem_cache *cache;
@@ -345,7 +345,17 @@ void kasan_record_aux_stack(void *addr)
 		return;
 
 	alloc_meta->aux_stack[1] = alloc_meta->aux_stack[0];
-	alloc_meta->aux_stack[0] = kasan_save_stack(GFP_NOWAIT, true);
+	alloc_meta->aux_stack[0] = kasan_save_stack(GFP_NOWAIT, can_alloc);
+}
+
+void kasan_record_aux_stack(void *addr)
+{
+	return __kasan_record_aux_stack(addr, true);
+}
+
+void kasan_record_aux_stack_noalloc(void *addr)
+{
+	return __kasan_record_aux_stack(addr, false);
 }
 
 void kasan_set_free_info(struct kmem_cache *cache,
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 024/262] workqueue, kasan: avoid alloc_pages() when recording stack
  2021-11-05 20:34 incoming Andrew Morton
                   ` (22 preceding siblings ...)
  2021-11-05 20:35 ` [patch 023/262] kasan: generic: introduce kasan_record_aux_stack_noalloc() Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:35 ` [patch 025/262] kasan: fix tag for large allocations when using CONFIG_SLAB Andrew Morton
                   ` (237 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: akpm, andreyknvl, bigeasy, dvyukov, elver, glider, gustavoars,
	jiangshanlai, linux-mm, mm-commits, ryabinin.a.a, skhan,
	tarasmadan, tglx, tj, torvalds, vinmenon, vjitta, walter-zh.wu

From: Marco Elver <elver@google.com>
Subject: workqueue, kasan: avoid alloc_pages() when recording stack

Shuah Khan reported:

 | When CONFIG_PROVE_RAW_LOCK_NESTING=y and CONFIG_KASAN are enabled,
 | kasan_record_aux_stack() runs into "BUG: Invalid wait context" when
 | it tries to allocate memory attempting to acquire spinlock in page
 | allocation code while holding workqueue pool raw_spinlock.
 |
 | There are several instances of this problem when block layer tries
 | to __queue_work(). Call trace from one of these instances is below:
 |
 |     kblockd_mod_delayed_work_on()
 |       mod_delayed_work_on()
 |         __queue_delayed_work()
 |           __queue_work() (rcu_read_lock, raw_spin_lock pool->lock held)
 |             insert_work()
 |               kasan_record_aux_stack()
 |                 kasan_save_stack()
 |                   stack_depot_save()
 |                     alloc_pages()
 |                       __alloc_pages()
 |                         get_page_from_freelist()
 |                           rm_queue()
 |                             rm_queue_pcplist()
 |                               local_lock_irqsave(&pagesets.lock, flags);
 |                               [ BUG: Invalid wait context triggered ]

The default kasan_record_aux_stack() calls stack_depot_save() with
GFP_NOWAIT, which in turn can then call alloc_pages(GFP_NOWAIT, ...).  In
general, however, it is not even possible to use either GFP_ATOMIC nor
GFP_NOWAIT in certain non-preemptive contexts, including raw_spin_locks
(see gfp.h and ab00db216c9c7).

Fix it by instructing stackdepot to not expand stack storage via
alloc_pages() in case it runs out by using
kasan_record_aux_stack_noalloc().

While there is an increased risk of failing to insert the stack trace,
this is typically unlikely, especially if the same insertion had already
succeeded previously (stack depot hit).  For frequent calls from the same
location, it therefore becomes extremely unlikely that
kasan_record_aux_stack_noalloc() fails.

Link: https://lkml.kernel.org/r/20210902200134.25603-1-skhan@linuxfoundation.org
Link: https://lkml.kernel.org/r/20210913112609.2651084-7-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Shuah Khan <skhan@linuxfoundation.org>
Tested-by: Shuah Khan <skhan@linuxfoundation.org>
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Taras Madan <tarasmadan@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vijayanand Jitta <vjitta@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Walter Wu <walter-zh.wu@mediatek.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/workqueue.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/workqueue.c~workqueue-kasan-avoid-alloc_pages-when-recording-stack
+++ a/kernel/workqueue.c
@@ -1350,7 +1350,7 @@ static void insert_work(struct pool_work
 	struct worker_pool *pool = pwq->pool;
 
 	/* record the work call stack in order to print it in KASAN reports */
-	kasan_record_aux_stack(work);
+	kasan_record_aux_stack_noalloc(work);
 
 	/* we own @work, set data and link */
 	set_work_pwq(work, pwq, extra_flags);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 025/262] kasan: fix tag for large allocations when using CONFIG_SLAB
  2021-11-05 20:34 incoming Andrew Morton
                   ` (23 preceding siblings ...)
  2021-11-05 20:35 ` [patch 024/262] workqueue, kasan: avoid alloc_pages() when recording stack Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:35 ` [patch 026/262] kasan: test: add memcpy test that avoids out-of-bounds write Andrew Morton
                   ` (236 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: akpm, andreyknvl, dvyukov, elver, glider, linux-mm, mm-commits,
	ryabinin.a.a, torvalds, willy

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: kasan: fix tag for large allocations when using CONFIG_SLAB

If an object is allocated on a tail page of a multi-page slab, kasan will
get the wrong tag because page->s_mem is NULL for tail pages.  I'm not
quite sure what the user-visible effect of this might be.

Link: https://lkml.kernel.org/r/20211001024105.3217339-1-willy@infradead.org
Fixes: 7f94ffbc4c6a ("kasan: add hooks implementation for tag-based mode")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Marco Elver <elver@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kasan/common.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/kasan/common.c~kasan-fix-tag-for-large-allocations-when-using-config_slab
+++ a/mm/kasan/common.c
@@ -298,7 +298,7 @@ static inline u8 assign_tag(struct kmem_
 	/* For caches that either have a constructor or SLAB_TYPESAFE_BY_RCU: */
 #ifdef CONFIG_SLAB
 	/* For SLAB assign tags based on the object index in the freelist. */
-	return (u8)obj_to_index(cache, virt_to_page(object), (void *)object);
+	return (u8)obj_to_index(cache, virt_to_head_page(object), (void *)object);
 #else
 	/*
 	 * For SLUB assign a random tag during slab creation, otherwise reuse
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 026/262] kasan: test: add memcpy test that avoids out-of-bounds write
  2021-11-05 20:34 incoming Andrew Morton
                   ` (24 preceding siblings ...)
  2021-11-05 20:35 ` [patch 025/262] kasan: fix tag for large allocations when using CONFIG_SLAB Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:35 ` [patch 027/262] mm/smaps: fix shmem pte hole swap calculation Andrew Morton
                   ` (235 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: akpm, andreyknvl, catalin.marinas, elver, eugenis, glider,
	linux-mm, mark.rutland, mm-commits, pcc, robin.murphy, torvalds,
	will

From: Peter Collingbourne <pcc@google.com>
Subject: kasan: test: add memcpy test that avoids out-of-bounds write

With HW tag-based KASAN, error checks are performed implicitly by the load
and store instructions in the memcpy implementation.  A failed check
results in tag checks being disabled and execution will keep going.  As a
result, under HW tag-based KASAN, prior to commit 1b0668be62cf ("kasan:
test: disable kmalloc_memmove_invalid_size for HW_TAGS"), this memcpy
would end up corrupting memory until it hits an inaccessible page and
causes a kernel panic.

This is a pre-existing issue that was revealed by commit 285133040e6c
("arm64: Import latest memcpy()/memmove() implementation") which changed
the memcpy implementation from using signed comparisons (incorrectly,
resulting in the memcpy being terminated early for negative sizes) to
using unsigned comparisons.

It is unclear how this could be handled by memcpy itself in a reasonable
way.  One possibility would be to add an exception handler that would
force memcpy to return if a tag check fault is detected -- this would make
the behavior roughly similar to generic and SW tag-based KASAN.  However,
this wouldn't solve the problem for asynchronous mode and also makes
memcpy behavior inconsistent with manually copying data.

This test was added as a part of a series that taught KASAN to detect
negative sizes in memory operations, see commit 8cceeff48f23 ("kasan:
detect negative size in memory operation function").  Therefore we should
keep testing for negative sizes with generic and SW tag-based KASAN.  But
there is some value in testing small memcpy overflows, so let's add
another test with memcpy that does not destabilize the kernel by
performing out-of-bounds writes, and run it in all modes.

Link: https://linux-review.googlesource.com/id/I048d1e6a9aff766c4a53f989fb0c83de68923882
Link: https://lkml.kernel.org/r/20210910211356.3603758-1-pcc@google.com
Signed-off-by: Peter Collingbourne <pcc@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com>
Acked-by: Marco Elver <elver@google.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Evgenii Stepanov <eugenis@google.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/test_kasan.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

--- a/lib/test_kasan.c~kasan-test-add-memcpy-test-that-avoids-out-of-bounds-write
+++ a/lib/test_kasan.c
@@ -493,7 +493,7 @@ static void kmalloc_oob_in_memset(struct
 	kfree(ptr);
 }
 
-static void kmalloc_memmove_invalid_size(struct kunit *test)
+static void kmalloc_memmove_negative_size(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 64;
@@ -515,6 +515,21 @@ static void kmalloc_memmove_invalid_size
 	kfree(ptr);
 }
 
+static void kmalloc_memmove_invalid_size(struct kunit *test)
+{
+	char *ptr;
+	size_t size = 64;
+	volatile size_t invalid_size = size;
+
+	ptr = kmalloc(size, GFP_KERNEL);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
+
+	memset((char *)ptr, 0, 64);
+	KUNIT_EXPECT_KASAN_FAIL(test,
+		memmove((char *)ptr, (char *)ptr + 4, invalid_size));
+	kfree(ptr);
+}
+
 static void kmalloc_uaf(struct kunit *test)
 {
 	char *ptr;
@@ -1129,6 +1144,7 @@ static struct kunit_case kasan_kunit_tes
 	KUNIT_CASE(kmalloc_oob_memset_4),
 	KUNIT_CASE(kmalloc_oob_memset_8),
 	KUNIT_CASE(kmalloc_oob_memset_16),
+	KUNIT_CASE(kmalloc_memmove_negative_size),
 	KUNIT_CASE(kmalloc_memmove_invalid_size),
 	KUNIT_CASE(kmalloc_uaf),
 	KUNIT_CASE(kmalloc_uaf_memset),
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 027/262] mm/smaps: fix shmem pte hole swap calculation
  2021-11-05 20:34 incoming Andrew Morton
                   ` (25 preceding siblings ...)
  2021-11-05 20:35 ` [patch 026/262] kasan: test: add memcpy test that avoids out-of-bounds write Andrew Morton
@ 2021-11-05 20:35 ` Andrew Morton
  2021-11-05 20:36 ` [patch 028/262] mm/smaps: use vma->vm_pgoff directly when counting partial swap Andrew Morton
                   ` (234 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:35 UTC (permalink / raw)
  To: aarcange, akpm, hughd, linux-mm, mm-commits, peterx, torvalds,
	vbabka, willy

From: Peter Xu <peterx@redhat.com>
Subject: mm/smaps: fix shmem pte hole swap calculation

Patch series "mm/smaps: Fixes and optimizations on shmem swap handling".


This patch (of 3):

The shmem swap calculation on the privately writable mappings are using
wrong parameters as spotted by Vlastimil.  Fix them.  That's introduced in
commit 48131e03ca4e, when rework shmem_swap_usage to
shmem_partial_swap_usage.

Test program:

==================

void main(void)
{
    char *buffer, *p;
    int i, fd;

    fd = memfd_create("test", 0);
    assert(fd > 0);

    /* isize==2M*3, fill in pages, swap them out */
    ftruncate(fd, SIZE_2M * 3);
    buffer = mmap(NULL, SIZE_2M * 3, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    assert(buffer);
    for (i = 0, p = buffer; i < SIZE_2M * 3 / 4096; i++) {
        *p = 1;
        p += 4096;
    }
    madvise(buffer, SIZE_2M * 3, MADV_PAGEOUT);
    munmap(buffer, SIZE_2M * 3);

    /*
     * Remap with private+writtable mappings on partial of the inode (<= 2M*3),
     * while the size must also be >= 2M*2 to make sure there's a none pmd so
     * smaps_pte_hole will be triggered.
     */
    buffer = mmap(NULL, SIZE_2M * 2, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
    printf("pid=%d, buffer=%p
", getpid(), buffer);

    /* Check /proc/$PID/smap_rollup, should see 4MB swap */
    sleep(1000000);
}
==================

Before the patch, smaps_rollup shows <4MB swap and the number will be
random depending on the alignment of the buffer of mmap() allocated. 
After this patch, it'll show 4MB.

Link: https://lkml.kernel.org/r/20210917164756.8586-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210917164756.8586-2-peterx@redhat.com
Fixes: 48131e03ca4e ("mm, proc: reduce cost of /proc/pid/smaps for unpopulated shmem mappings")
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/task_mmu.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/fs/proc/task_mmu.c~mm-smaps-fix-shmem-pte-hole-swap-calculation
+++ a/fs/proc/task_mmu.c
@@ -478,9 +478,11 @@ static int smaps_pte_hole(unsigned long
 			  __always_unused int depth, struct mm_walk *walk)
 {
 	struct mem_size_stats *mss = walk->private;
+	struct vm_area_struct *vma = walk->vma;
 
-	mss->swap += shmem_partial_swap_usage(
-			walk->vma->vm_file->f_mapping, addr, end);
+	mss->swap += shmem_partial_swap_usage(walk->vma->vm_file->f_mapping,
+					      linear_page_index(vma, addr),
+					      linear_page_index(vma, end));
 
 	return 0;
 }
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 028/262] mm/smaps: use vma->vm_pgoff directly when counting partial swap
  2021-11-05 20:34 incoming Andrew Morton
                   ` (26 preceding siblings ...)
  2021-11-05 20:35 ` [patch 027/262] mm/smaps: fix shmem pte hole swap calculation Andrew Morton
@ 2021-11-05 20:36 ` Andrew Morton
  2021-11-05 20:36 ` [patch 029/262] mm/smaps: simplify shmem handling of pte holes Andrew Morton
                   ` (233 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:36 UTC (permalink / raw)
  To: aarcange, akpm, hughd, linux-mm, mm-commits, peterx, torvalds,
	vbabka, willy

From: Peter Xu <peterx@redhat.com>
Subject: mm/smaps: use vma->vm_pgoff directly when counting partial swap

As it's trying to cover the whole vma anyways, use direct vm_pgoff value
and vma_pages() rather than linear_page_index.

Link: https://lkml.kernel.org/r/20210917164756.8586-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/shmem.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

--- a/mm/shmem.c~mm-smaps-use-vma-vm_pgoff-directly-when-counting-partial-swap
+++ a/mm/shmem.c
@@ -856,9 +856,8 @@ unsigned long shmem_swap_usage(struct vm
 		return swapped << PAGE_SHIFT;
 
 	/* Here comes the more involved part */
-	return shmem_partial_swap_usage(mapping,
-			linear_page_index(vma, vma->vm_start),
-			linear_page_index(vma, vma->vm_end));
+	return shmem_partial_swap_usage(mapping, vma->vm_pgoff,
+					vma->vm_pgoff + vma_pages(vma));
 }
 
 /*
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 029/262] mm/smaps: simplify shmem handling of pte holes
  2021-11-05 20:34 incoming Andrew Morton
                   ` (27 preceding siblings ...)
  2021-11-05 20:36 ` [patch 028/262] mm/smaps: use vma->vm_pgoff directly when counting partial swap Andrew Morton
@ 2021-11-05 20:36 ` Andrew Morton
  2021-11-05 20:36 ` [patch 030/262] mm: debug_vm_pgtable: don't use __P000 directly Andrew Morton
                   ` (232 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:36 UTC (permalink / raw)
  To: aarcange, akpm, hughd, linux-mm, mm-commits, peterx, torvalds,
	vbabka, willy

From: Peter Xu <peterx@redhat.com>
Subject: mm/smaps: simplify shmem handling of pte holes

Firstly, check_shmem_swap variable is actually not necessary, because it's
always set with pte_hole hook; checking each would work.

Meanwhile, the check within smaps_pte_entry is not easy to follow.  E.g.,
pte_none() check is not needed as "!pte_present && !is_swap_pte" is the
same.  Since at it, use the pte_hole() helper rather than dup the page
cache lookup.

Still keep the CONFIG_SHMEM part so the code can be optimized to nop for !SHMEM.

There will be a very slight functional change in smaps_pte_entry(), that
for !SHMEM we'll return early for pte_none (before checking page==NULL),
but that's even nicer.

Link: https://lkml.kernel.org/r/20210917164756.8586-4-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/task_mmu.c |   22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

--- a/fs/proc/task_mmu.c~mm-smaps-simplify-shmem-handling-of-pte-holes
+++ a/fs/proc/task_mmu.c
@@ -397,7 +397,6 @@ struct mem_size_stats {
 	u64 pss_shmem;
 	u64 pss_locked;
 	u64 swap_pss;
-	bool check_shmem_swap;
 };
 
 static void smaps_page_accumulate(struct mem_size_stats *mss,
@@ -490,6 +489,16 @@ static int smaps_pte_hole(unsigned long
 #define smaps_pte_hole		NULL
 #endif /* CONFIG_SHMEM */
 
+static void smaps_pte_hole_lookup(unsigned long addr, struct mm_walk *walk)
+{
+#ifdef CONFIG_SHMEM
+	if (walk->ops->pte_hole) {
+		/* depth is not used */
+		smaps_pte_hole(addr, addr + PAGE_SIZE, 0, walk);
+	}
+#endif
+}
+
 static void smaps_pte_entry(pte_t *pte, unsigned long addr,
 		struct mm_walk *walk)
 {
@@ -518,12 +527,8 @@ static void smaps_pte_entry(pte_t *pte,
 			}
 		} else if (is_pfn_swap_entry(swpent))
 			page = pfn_swap_entry_to_page(swpent);
-	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
-							&& pte_none(*pte))) {
-		page = xa_load(&vma->vm_file->f_mapping->i_pages,
-						linear_page_index(vma, addr));
-		if (xa_is_value(page))
-			mss->swap += PAGE_SIZE;
+	} else {
+		smaps_pte_hole_lookup(addr, walk);
 		return;
 	}
 
@@ -737,8 +742,6 @@ static void smap_gather_stats(struct vm_
 		return;
 
 #ifdef CONFIG_SHMEM
-	/* In case of smaps_rollup, reset the value from previous vma */
-	mss->check_shmem_swap = false;
 	if (vma->vm_file && shmem_mapping(vma->vm_file->f_mapping)) {
 		/*
 		 * For shared or readonly shmem mappings we know that all
@@ -756,7 +759,6 @@ static void smap_gather_stats(struct vm_
 					!(vma->vm_flags & VM_WRITE))) {
 			mss->swap += shmem_swapped;
 		} else {
-			mss->check_shmem_swap = true;
 			ops = &smaps_shmem_walk_ops;
 		}
 	}
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 030/262] mm: debug_vm_pgtable: don't use __P000 directly
  2021-11-05 20:34 incoming Andrew Morton
                   ` (28 preceding siblings ...)
  2021-11-05 20:36 ` [patch 029/262] mm/smaps: simplify shmem handling of pte holes Andrew Morton
@ 2021-11-05 20:36 ` Andrew Morton
  2021-11-05 20:36 ` [patch 031/262] kasan: test: bypass __alloc_size checks Andrew Morton
                   ` (231 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:36 UTC (permalink / raw)
  To: akpm, anshuman.khandual, christophe.leroy, gerald.schaefer,
	gshan, guoren, linux-mm, mm-commits, torvalds

From: Guo Ren <guoren@linux.alibaba.com>
Subject: mm: debug_vm_pgtable: don't use __P000 directly

The __Pxxx/__Sxxx macros are only for protection_map[] init.  All usage of
them in linux should come from protection_map array.

Because a lot of architectures would re-initilize protection_map[] array,
eg: x86-mem_encrypt, m68k-motorola, mips, arm, sparc.

Using __P000 is not rigorous.

Link: https://lkml.kernel.org/r/20210924060821.1138281-1-guoren@kernel.org
Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/debug_vm_pgtable.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

--- a/mm/debug_vm_pgtable.c~mm-debug_vm_pgtable-dont-use-__p000-directly
+++ a/mm/debug_vm_pgtable.c
@@ -1104,13 +1104,14 @@ static int __init init_args(struct pgtab
 	/*
 	 * Initialize the debugging data.
 	 *
-	 * __P000 (or even __S000) will help create page table entries with
-	 * PROT_NONE permission as required for pxx_protnone_tests().
+	 * protection_map[0] (or even protection_map[8]) will help create
+	 * page table entries with PROT_NONE permission as required for
+	 * pxx_protnone_tests().
 	 */
 	memset(args, 0, sizeof(*args));
 	args->vaddr              = get_random_vaddr();
 	args->page_prot          = vm_get_page_prot(VMFLAGS);
-	args->page_prot_none     = __P000;
+	args->page_prot_none     = protection_map[0];
 	args->is_contiguous_page = false;
 	args->pud_pfn            = ULONG_MAX;
 	args->pmd_pfn            = ULONG_MAX;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 031/262] kasan: test: bypass __alloc_size checks
  2021-11-05 20:34 incoming Andrew Morton
                   ` (29 preceding siblings ...)
  2021-11-05 20:36 ` [patch 030/262] mm: debug_vm_pgtable: don't use __P000 directly Andrew Morton
@ 2021-11-05 20:36 ` Andrew Morton
  2021-11-05 20:36 ` [patch 032/262] rapidio: avoid bogus __alloc_size warning Andrew Morton
                   ` (230 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:36 UTC (permalink / raw)
  To: akpm, andreyknvl, dvyukov, glider, keescook, linux-mm,
	mm-commits, ryabinin.a.a, torvalds

From: Kees Cook <keescook@chromium.org>
Subject: kasan: test: bypass __alloc_size checks

Intentional overflows, as performed by the KASAN tests, are detected at
compile time[1] (instead of only at run-time) with the addition of
__alloc_size.  Fix this by forcing the compiler into not being able to
trust the size used following the kmalloc()s.

[1] https://lore.kernel.org/lkml/20211005184717.65c6d8eb39350395e387b71f@linux-foundation.org

Link: https://lkml.kernel.org/r/20211006181544.1670992-1-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/test_kasan.c        |    8 +++++++-
 lib/test_kasan_module.c |    2 ++
 2 files changed, 9 insertions(+), 1 deletion(-)

--- a/lib/test_kasan.c~kasan-test-bypass-__alloc_size-checks
+++ a/lib/test_kasan.c
@@ -440,6 +440,7 @@ static void kmalloc_oob_memset_2(struct
 	ptr = kmalloc(size, GFP_KERNEL);
 	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
+	OPTIMIZER_HIDE_VAR(size);
 	KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + size - 1, 0, 2));
 	kfree(ptr);
 }
@@ -452,6 +453,7 @@ static void kmalloc_oob_memset_4(struct
 	ptr = kmalloc(size, GFP_KERNEL);
 	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
+	OPTIMIZER_HIDE_VAR(size);
 	KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + size - 3, 0, 4));
 	kfree(ptr);
 }
@@ -464,6 +466,7 @@ static void kmalloc_oob_memset_8(struct
 	ptr = kmalloc(size, GFP_KERNEL);
 	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
+	OPTIMIZER_HIDE_VAR(size);
 	KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + size - 7, 0, 8));
 	kfree(ptr);
 }
@@ -476,6 +479,7 @@ static void kmalloc_oob_memset_16(struct
 	ptr = kmalloc(size, GFP_KERNEL);
 	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
+	OPTIMIZER_HIDE_VAR(size);
 	KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + size - 15, 0, 16));
 	kfree(ptr);
 }
@@ -488,6 +492,7 @@ static void kmalloc_oob_in_memset(struct
 	ptr = kmalloc(size, GFP_KERNEL);
 	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
+	OPTIMIZER_HIDE_VAR(size);
 	KUNIT_EXPECT_KASAN_FAIL(test,
 				memset(ptr, 0, size + KASAN_GRANULE_SIZE));
 	kfree(ptr);
@@ -497,7 +502,7 @@ static void kmalloc_memmove_negative_siz
 {
 	char *ptr;
 	size_t size = 64;
-	volatile size_t invalid_size = -2;
+	size_t invalid_size = -2;
 
 	/*
 	 * Hardware tag-based mode doesn't check memmove for negative size.
@@ -510,6 +515,7 @@ static void kmalloc_memmove_negative_siz
 	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
 	memset((char *)ptr, 0, 64);
+	OPTIMIZER_HIDE_VAR(invalid_size);
 	KUNIT_EXPECT_KASAN_FAIL(test,
 		memmove((char *)ptr, (char *)ptr + 4, invalid_size));
 	kfree(ptr);
--- a/lib/test_kasan_module.c~kasan-test-bypass-__alloc_size-checks
+++ a/lib/test_kasan_module.c
@@ -35,6 +35,8 @@ static noinline void __init copy_user_te
 		return;
 	}
 
+	OPTIMIZER_HIDE_VAR(size);
+
 	pr_info("out-of-bounds in copy_from_user()\n");
 	unused = copy_from_user(kmem, usermem, size + 1);
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 032/262] rapidio: avoid bogus __alloc_size warning
  2021-11-05 20:34 incoming Andrew Morton
                   ` (30 preceding siblings ...)
  2021-11-05 20:36 ` [patch 031/262] kasan: test: bypass __alloc_size checks Andrew Morton
@ 2021-11-05 20:36 ` Andrew Morton
  2021-11-05 20:36 ` [patch 033/262] Compiler Attributes: add __alloc_size() for better bounds checking Andrew Morton
                   ` (229 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:36 UTC (permalink / raw)
  To: akpm, alex.bou9, apw, cl, danielmicay, dennis, dwaipayanray1,
	gustavoars, iamjoonsoo.kim, ira.weiny, jhubbard, jingxiangfeng,
	joe, jrdr.linux, keescook, linux-mm, lkp, lukas.bulwahn,
	mm-commits, mporter, nathan, ndesaulniers, ojeda, penberg,
	rdunlap, rientjes, tj, torvalds, vbabka

From: Kees Cook <keescook@chromium.org>
Subject: rapidio: avoid bogus __alloc_size warning

Patch series "Add __alloc_size()", v3.

GCC and Clang both use the "alloc_size" attribute to assist with bounds
checking around the use of allocation functions.  Add the attribute,
adjust the Makefile to silence needless warnings, and add the hints to the
allocators where possible.  These changes have been in use for a while now
in GrapheneOS.


This patch (of 8):

After adding __alloc_size attributes to the allocators, GCC 9.3 (but not
later) may incorrectly evaluate the arguments to check_copy_size(),
getting seemingly confused by the size being returned from array_size(). 
Instead, perform the calculation once, which both makes the code more
readable and avoids the bug in GCC.

   In file included from arch/x86/include/asm/preempt.h:7,
                    from include/linux/preempt.h:78,
                    from include/linux/spinlock.h:55,
                    from include/linux/mm_types.h:9,
                    from include/linux/buildid.h:5,
                    from include/linux/module.h:14,
                    from drivers/rapidio/devices/rio_mport_cdev.c:13:
   In function 'check_copy_size',
       inlined from 'copy_from_user' at include/linux/uaccess.h:191:6,
       inlined from 'rio_mport_transfer_ioctl' at drivers/rapidio/devices/rio_mport_cdev.c:983:6:
   include/linux/thread_info.h:213:4: error: call to '__bad_copy_to' declared with attribute error: copy destination size is too small
     213 |    __bad_copy_to();
         |    ^~~~~~~~~~~~~~~

But the allocation size and the copy size are identical:

	transfer = vmalloc(array_size(sizeof(*transfer), transaction.count));
	if (!transfer)
		return -ENOMEM;

	if (unlikely(copy_from_user(transfer,
				    (void __user *)(uintptr_t)transaction.block,
				    array_size(sizeof(*transfer), transaction.count)))) {

Link: https://lkml.kernel.org/r/20210930222704.2631604-1-keescook@chromium.org
Link: https://lkml.kernel.org/r/20210930222704.2631604-2-keescook@chromium.org
Link: https://lore.kernel.org/linux-mm/202109091134.FHnRmRxu-lkp@intel.com/
Signed-off-by: Kees Cook <keescook@chromium.org>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Alexandre Bounine <alex.bou9@gmail.com>
Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/rapidio/devices/rio_mport_cdev.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

--- a/drivers/rapidio/devices/rio_mport_cdev.c~rapidio-avoid-bogus-__alloc_size-warning
+++ a/drivers/rapidio/devices/rio_mport_cdev.c
@@ -965,6 +965,7 @@ static int rio_mport_transfer_ioctl(stru
 	struct rio_transfer_io *transfer;
 	enum dma_data_direction dir;
 	int i, ret = 0;
+	size_t size;
 
 	if (unlikely(copy_from_user(&transaction, arg, sizeof(transaction))))
 		return -EFAULT;
@@ -976,13 +977,14 @@ static int rio_mport_transfer_ioctl(stru
 	     priv->md->properties.transfer_mode) == 0)
 		return -ENODEV;
 
-	transfer = vmalloc(array_size(sizeof(*transfer), transaction.count));
+	size = array_size(sizeof(*transfer), transaction.count);
+	transfer = vmalloc(size);
 	if (!transfer)
 		return -ENOMEM;
 
 	if (unlikely(copy_from_user(transfer,
 				    (void __user *)(uintptr_t)transaction.block,
-				    array_size(sizeof(*transfer), transaction.count)))) {
+				    size))) {
 		ret = -EFAULT;
 		goto out_free;
 	}
@@ -994,8 +996,7 @@ static int rio_mport_transfer_ioctl(stru
 			transaction.sync, dir, &transfer[i]);
 
 	if (unlikely(copy_to_user((void __user *)(uintptr_t)transaction.block,
-				  transfer,
-				  array_size(sizeof(*transfer), transaction.count))))
+				  transfer, size)))
 		ret = -EFAULT;
 
 out_free:
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 033/262] Compiler Attributes: add __alloc_size() for better bounds checking
  2021-11-05 20:34 incoming Andrew Morton
                   ` (31 preceding siblings ...)
  2021-11-05 20:36 ` [patch 032/262] rapidio: avoid bogus __alloc_size warning Andrew Morton
@ 2021-11-05 20:36 ` Andrew Morton
  2021-11-05 20:36 ` [patch 034/262] slab: clean up function prototypes Andrew Morton
                   ` (228 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:36 UTC (permalink / raw)
  To: akpm, alex.bou9, apw, cl, danielmicay, dennis, dwaipayanray1,
	gustavoars, iamjoonsoo.kim, ira.weiny, jhubbard, jingxiangfeng,
	joe, jrdr.linux, keescook, linux-mm, lkp, lukas.bulwahn,
	mm-commits, mporter, nathan, ndesaulniers, ojeda, penberg,
	rdunlap, rientjes, tj, torvalds, vbabka

From: Kees Cook <keescook@chromium.org>
Subject: Compiler Attributes: add __alloc_size() for better bounds checking

GCC and Clang can use the "alloc_size" attribute to better inform the
results of __builtin_object_size() (for compile-time constant values). 
Clang can additionally use alloc_size to inform the results of
__builtin_dynamic_object_size() (for run-time values).

Because GCC sees the frequent use of struct_size() as an allocator size
argument, and notices it can return SIZE_MAX (the overflow indication), it
complains about these call sites overflowing (since SIZE_MAX is greater
than the default -Walloc-size-larger-than=PTRDIFF_MAX).  This isn't
helpful since we already know a SIZE_MAX will be caught at run-time (this
was an intentional design).  To deal with this, we must disable this check
as it is both a false positive and redundant.  (Clang does not have this
warning option.)

Unfortunately, just checking the -Wno-alloc-size-larger-than is not
sufficient to make the __alloc_size attribute behave correctly under older
GCC versions.  The attribute itself must be disabled in those situations
too, as there appears to be no way to reliably silence the SIZE_MAX
constant expression cases for GCC versions less than 9.1:

In file included from ./include/linux/resource_ext.h:11,
                 from ./include/linux/pci.h:40,
                 from drivers/net/ethernet/intel/ixgbe/ixgbe.h:9,
                 from drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c:4:
In function 'kmalloc_node',
    inlined from 'ixgbe_alloc_q_vector' at ./include/linux/slab.h:743:9:
./include/linux/slab.h:618:9: error: argument 1 value '18446744073709551615' exceeds maximum object size 9223372036854775807 [-Werror=alloc-size-larger-than=]
  return __kmalloc_node(size, flags, node);
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
./include/linux/slab.h: In function 'ixgbe_alloc_q_vector':
./include/linux/slab.h:455:7: note: in a call to allocation function '__kmalloc_node' declared here
 void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_slab_alignment __malloc;
       ^~~~~~~~~~~~~~

Specifically:
-Wno-alloc-size-larger-than is not correctly handled by GCC < 9.1
  https://godbolt.org/z/hqsfG7q84 (doesn't disable)
  https://godbolt.org/z/P9jdrPTYh (doesn't admit to not knowing about option)
  https://godbolt.org/z/465TPMWKb (only warns when other warnings appear)

-Walloc-size-larger-than=18446744073709551615 is not handled by GCC < 8.2
  https://godbolt.org/z/73hh1EPxz (ignores numeric value)

Since anything marked with __alloc_size would also qualify for marking
with __malloc, just include __malloc along with it to avoid redundant
markings. (Suggested by Linus Torvalds.)

Finally, make sure checkpatch.pl doesn't get confused about finding the
__alloc_size attribute on functions. (Thanks to Joe Perches.)

Link: https://lkml.kernel.org/r/20210930222704.2631604-3-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexandre Bounine <alex.bou9@gmail.com>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Makefile                            |   15 +++++++++++++++
 include/linux/compiler-gcc.h        |    8 ++++++++
 include/linux/compiler_attributes.h |   10 ++++++++++
 include/linux/compiler_types.h      |   12 ++++++++++++
 scripts/checkpatch.pl               |    3 ++-
 5 files changed, 47 insertions(+), 1 deletion(-)

--- a/include/linux/compiler_attributes.h~compiler-attributes-add-__alloc_size-for-better-bounds-checking
+++ a/include/linux/compiler_attributes.h
@@ -34,6 +34,15 @@
 #define __aligned_largest               __attribute__((__aligned__))
 
 /*
+ * Note: do not use this directly. Instead, use __alloc_size() since it is conditionally
+ * available and includes other attributes.
+ *
+ *   gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-alloc_005fsize-function-attribute
+ * clang: https://clang.llvm.org/docs/AttributeReference.html#alloc-size
+ */
+#define __alloc_size__(x, ...)		__attribute__((__alloc_size__(x, ## __VA_ARGS__)))
+
+/*
  * Note: users of __always_inline currently do not write "inline" themselves,
  * which seems to be required by gcc to apply the attribute according
  * to its docs (and also "warning: always_inline function might not be
@@ -153,6 +162,7 @@
 
 /*
  *   gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-malloc-function-attribute
+ * clang: https://clang.llvm.org/docs/AttributeReference.html#malloc
  */
 #define __malloc                        __attribute__((__malloc__))
 
--- a/include/linux/compiler-gcc.h~compiler-attributes-add-__alloc_size-for-better-bounds-checking
+++ a/include/linux/compiler-gcc.h
@@ -144,3 +144,11 @@
 #else
 #define __diag_GCC_8(s)
 #endif
+
+/*
+ * Prior to 9.1, -Wno-alloc-size-larger-than (and therefore the "alloc_size"
+ * attribute) do not work, and must be disabled.
+ */
+#if GCC_VERSION < 90100
+#undef __alloc_size__
+#endif
--- a/include/linux/compiler_types.h~compiler-attributes-add-__alloc_size-for-better-bounds-checking
+++ a/include/linux/compiler_types.h
@@ -250,6 +250,18 @@ struct ftrace_likely_data {
 # define __cficanonical
 #endif
 
+/*
+ * Any place that could be marked with the "alloc_size" attribute is also
+ * a place to be marked with the "malloc" attribute. Do this as part of the
+ * __alloc_size macro to avoid redundant attributes and to avoid missing a
+ * __malloc marking.
+ */
+#ifdef __alloc_size__
+# define __alloc_size(x, ...)	__alloc_size__(x, ## __VA_ARGS__) __malloc
+#else
+# define __alloc_size(x, ...)	__malloc
+#endif
+
 #ifndef asm_volatile_goto
 #define asm_volatile_goto(x...) asm goto(x)
 #endif
--- a/Makefile~compiler-attributes-add-__alloc_size-for-better-bounds-checking
+++ a/Makefile
@@ -1008,6 +1008,21 @@ ifdef CONFIG_CC_IS_GCC
 KBUILD_CFLAGS += -Wno-maybe-uninitialized
 endif
 
+ifdef CONFIG_CC_IS_GCC
+# The allocators already balk at large sizes, so silence the compiler
+# warnings for bounds checks involving those possible values. While
+# -Wno-alloc-size-larger-than would normally be used here, earlier versions
+# of gcc (<9.1) weirdly don't handle the option correctly when _other_
+# warnings are produced (?!). Using -Walloc-size-larger-than=SIZE_MAX
+# doesn't work (as it is documented to), silently resolving to "0" prior to
+# version 9.1 (and producing an error more recently). Numeric values larger
+# than PTRDIFF_MAX also don't work prior to version 9.1, which are silently
+# ignored, continuing to default to PTRDIFF_MAX. So, left with no other
+# choice, we must perform a versioned check to disable this warning.
+# https://lore.kernel.org/lkml/20210824115859.187f272f@canb.auug.org.au
+KBUILD_CFLAGS += $(call cc-ifversion, -ge, 0901, -Wno-alloc-size-larger-than)
+endif
+
 # disable invalid "can't wrap" optimizations for signed / pointers
 KBUILD_CFLAGS	+= -fno-strict-overflow
 
--- a/scripts/checkpatch.pl~compiler-attributes-add-__alloc_size-for-better-bounds-checking
+++ a/scripts/checkpatch.pl
@@ -489,7 +489,8 @@ our $Attribute	= qr{
 			____cacheline_aligned|
 			____cacheline_aligned_in_smp|
 			____cacheline_internodealigned_in_smp|
-			__weak
+			__weak|
+			__alloc_size\s*\(\s*\d+\s*(?:,\s*\d+\s*)?\)
 		  }x;
 our $Modifier;
 our $Inline	= qr{inline|__always_inline|noinline|__inline|__inline__};
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 034/262] slab: clean up function prototypes
  2021-11-05 20:34 incoming Andrew Morton
                   ` (32 preceding siblings ...)
  2021-11-05 20:36 ` [patch 033/262] Compiler Attributes: add __alloc_size() for better bounds checking Andrew Morton
@ 2021-11-05 20:36 ` Andrew Morton
  2021-11-05 20:36 ` [patch 035/262] slab: add __alloc_size attributes for better bounds checking Andrew Morton
                   ` (227 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:36 UTC (permalink / raw)
  To: akpm, alex.bou9, apw, cl, danielmicay, dennis, dwaipayanray1,
	gustavoars, iamjoonsoo.kim, ira.weiny, jhubbard, jingxiangfeng,
	joe, jrdr.linux, keescook, linux-mm, lkp, lukas.bulwahn,
	mm-commits, mporter, nathan, ndesaulniers, ojeda, penberg,
	rdunlap, rientjes, tj, torvalds, vbabka

From: Kees Cook <keescook@chromium.org>
Subject: slab: clean up function prototypes

Based on feedback from Joe Perches and Linus Torvalds, regularize the slab
function prototypes before making attribute changes.

Link: https://lkml.kernel.org/r/20210930222704.2631604-4-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexandre Bounine <alex.bou9@gmail.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Daniel Micay <danielmicay@gmail.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
Cc: Joe Perches <joe@perches.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/slab.h |   68 ++++++++++++++++++++---------------------
 1 file changed, 34 insertions(+), 34 deletions(-)

--- a/include/linux/slab.h~slab-clean-up-function-prototypes
+++ a/include/linux/slab.h
@@ -152,8 +152,8 @@ struct kmem_cache *kmem_cache_create_use
 			slab_flags_t flags,
 			unsigned int useroffset, unsigned int usersize,
 			void (*ctor)(void *));
-void kmem_cache_destroy(struct kmem_cache *);
-int kmem_cache_shrink(struct kmem_cache *);
+void kmem_cache_destroy(struct kmem_cache *s);
+int kmem_cache_shrink(struct kmem_cache *s);
 
 /*
  * Please use this macro to create slab caches. Simply specify the
@@ -181,11 +181,11 @@ int kmem_cache_shrink(struct kmem_cache
 /*
  * Common kmalloc functions provided by all allocators
  */
-void * __must_check krealloc(const void *, size_t, gfp_t);
-void kfree(const void *);
-void kfree_sensitive(const void *);
-size_t __ksize(const void *);
-size_t ksize(const void *);
+void * __must_check krealloc(const void *objp, size_t new_size, gfp_t flags);
+void kfree(const void *objp);
+void kfree_sensitive(const void *objp);
+size_t __ksize(const void *objp);
+size_t ksize(const void *objp);
 #ifdef CONFIG_PRINTK
 bool kmem_valid_obj(void *object);
 void kmem_dump_obj(void *object);
@@ -426,8 +426,8 @@ static __always_inline unsigned int __km
 #endif /* !CONFIG_SLOB */
 
 void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __malloc;
-void *kmem_cache_alloc(struct kmem_cache *, gfp_t flags) __assume_slab_alignment __malloc;
-void kmem_cache_free(struct kmem_cache *, void *);
+void *kmem_cache_alloc(struct kmem_cache *s, gfp_t flags) __assume_slab_alignment __malloc;
+void kmem_cache_free(struct kmem_cache *s, void *objp);
 
 /*
  * Bulk allocation and freeing operations. These are accelerated in an
@@ -436,8 +436,8 @@ void kmem_cache_free(struct kmem_cache *
  *
  * Note that interrupts must be enabled when calling these functions.
  */
-void kmem_cache_free_bulk(struct kmem_cache *, size_t, void **);
-int kmem_cache_alloc_bulk(struct kmem_cache *, gfp_t, size_t, void **);
+void kmem_cache_free_bulk(struct kmem_cache *s, size_t size, void **p);
+int kmem_cache_alloc_bulk(struct kmem_cache *s, gfp_t flags, size_t size, void **p);
 
 /*
  * Caller must not use kfree_bulk() on memory not originally allocated
@@ -450,7 +450,8 @@ static __always_inline void kfree_bulk(s
 
 #ifdef CONFIG_NUMA
 void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_kmalloc_alignment __malloc;
-void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node) __assume_slab_alignment __malloc;
+void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node) __assume_slab_alignment
+									 __malloc;
 #else
 static __always_inline void *__kmalloc_node(size_t size, gfp_t flags, int node)
 {
@@ -464,25 +465,24 @@ static __always_inline void *kmem_cache_
 #endif
 
 #ifdef CONFIG_TRACING
-extern void *kmem_cache_alloc_trace(struct kmem_cache *, gfp_t, size_t) __assume_slab_alignment __malloc;
+extern void *kmem_cache_alloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
+				   __assume_slab_alignment __malloc;
 
 #ifdef CONFIG_NUMA
-extern void *kmem_cache_alloc_node_trace(struct kmem_cache *s,
-					   gfp_t gfpflags,
-					   int node, size_t size) __assume_slab_alignment __malloc;
+extern void *kmem_cache_alloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
+					 int node, size_t size) __assume_slab_alignment __malloc;
 #else
-static __always_inline void *
-kmem_cache_alloc_node_trace(struct kmem_cache *s,
-			      gfp_t gfpflags,
-			      int node, size_t size)
+static __always_inline void *kmem_cache_alloc_node_trace(struct kmem_cache *s,
+							 gfp_t gfpflags, int node,
+							 size_t size)
 {
 	return kmem_cache_alloc_trace(s, gfpflags, size);
 }
 #endif /* CONFIG_NUMA */
 
 #else /* CONFIG_TRACING */
-static __always_inline void *kmem_cache_alloc_trace(struct kmem_cache *s,
-		gfp_t flags, size_t size)
+static __always_inline void *kmem_cache_alloc_trace(struct kmem_cache *s, gfp_t flags,
+						    size_t size)
 {
 	void *ret = kmem_cache_alloc(s, flags);
 
@@ -490,10 +490,8 @@ static __always_inline void *kmem_cache_
 	return ret;
 }
 
-static __always_inline void *
-kmem_cache_alloc_node_trace(struct kmem_cache *s,
-			      gfp_t gfpflags,
-			      int node, size_t size)
+static __always_inline void *kmem_cache_alloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
+							 int node, size_t size)
 {
 	void *ret = kmem_cache_alloc_node(s, gfpflags, node);
 
@@ -502,13 +500,14 @@ kmem_cache_alloc_node_trace(struct kmem_
 }
 #endif /* CONFIG_TRACING */
 
-extern void *kmalloc_order(size_t size, gfp_t flags, unsigned int order) __assume_page_alignment __malloc;
+extern void *kmalloc_order(size_t size, gfp_t flags, unsigned int order) __assume_page_alignment
+									 __malloc;
 
 #ifdef CONFIG_TRACING
-extern void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order) __assume_page_alignment __malloc;
+extern void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
+				__assume_page_alignment __malloc;
 #else
-static __always_inline void *
-kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
+static __always_inline void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
 {
 	return kmalloc_order(size, flags, order);
 }
@@ -638,8 +637,8 @@ static inline void *kmalloc_array(size_t
  * @new_size: new size of a single member of the array
  * @flags: the type of memory to allocate (see kmalloc)
  */
-static __must_check inline void *
-krealloc_array(void *p, size_t new_n, size_t new_size, gfp_t flags)
+static inline void * __must_check krealloc_array(void *p, size_t new_n, size_t new_size,
+						 gfp_t flags)
 {
 	size_t bytes;
 
@@ -668,7 +667,7 @@ static inline void *kcalloc(size_t n, si
  * allocator where we care about the real place the memory allocation
  * request comes from.
  */
-extern void *__kmalloc_track_caller(size_t, gfp_t, unsigned long);
+extern void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller);
 #define kmalloc_track_caller(size, flags) \
 	__kmalloc_track_caller(size, flags, _RET_IP_)
 
@@ -691,7 +690,8 @@ static inline void *kcalloc_node(size_t
 
 
 #ifdef CONFIG_NUMA
-extern void *__kmalloc_node_track_caller(size_t, gfp_t, int, unsigned long);
+extern void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
+					 unsigned long caller);
 #define kmalloc_node_track_caller(size, flags, node) \
 	__kmalloc_node_track_caller(size, flags, node, \
 			_RET_IP_)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 035/262] slab: add __alloc_size attributes for better bounds checking
  2021-11-05 20:34 incoming Andrew Morton
                   ` (33 preceding siblings ...)
  2021-11-05 20:36 ` [patch 034/262] slab: clean up function prototypes Andrew Morton
@ 2021-11-05 20:36 ` Andrew Morton
  2021-11-05 20:36 ` [patch 036/262] mm/kvmalloc: " Andrew Morton
                   ` (226 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:36 UTC (permalink / raw)
  To: akpm, alex.bou9, apw, cl, danielmicay, dennis, dwaipayanray1,
	gustavoars, iamjoonsoo.kim, ira.weiny, jhubbard, jingxiangfeng,
	joe, jrdr.linux, keescook, linux-mm, lkp, lukas.bulwahn,
	mm-commits, mporter, nathan, ndesaulniers, ojeda, penberg,
	rdunlap, rientjes, tj, torvalds, vbabka

From: Kees Cook <keescook@chromium.org>
Subject: slab: add __alloc_size attributes for better bounds checking

As already done in GrapheneOS, add the __alloc_size attribute for regular
kmalloc interfaces, to provide additional hinting for better bounds
checking, assisting CONFIG_FORTIFY_SOURCE and other compiler
optimizations.

Link: https://lkml.kernel.org/r/20210930222704.2631604-5-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Alexandre Bounine <alex.bou9@gmail.com>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/slab.h |   61 ++++++++++++++++++++++-------------------
 1 file changed, 33 insertions(+), 28 deletions(-)

--- a/include/linux/slab.h~slab-add-__alloc_size-attributes-for-better-bounds-checking
+++ a/include/linux/slab.h
@@ -181,7 +181,7 @@ int kmem_cache_shrink(struct kmem_cache
 /*
  * Common kmalloc functions provided by all allocators
  */
-void * __must_check krealloc(const void *objp, size_t new_size, gfp_t flags);
+void * __must_check krealloc(const void *objp, size_t new_size, gfp_t flags) __alloc_size(2);
 void kfree(const void *objp);
 void kfree_sensitive(const void *objp);
 size_t __ksize(const void *objp);
@@ -425,7 +425,7 @@ static __always_inline unsigned int __km
 #define kmalloc_index(s) __kmalloc_index(s, true)
 #endif /* !CONFIG_SLOB */
 
-void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __malloc;
+void *__kmalloc(size_t size, gfp_t flags) __assume_kmalloc_alignment __alloc_size(1);
 void *kmem_cache_alloc(struct kmem_cache *s, gfp_t flags) __assume_slab_alignment __malloc;
 void kmem_cache_free(struct kmem_cache *s, void *objp);
 
@@ -449,11 +449,12 @@ static __always_inline void kfree_bulk(s
 }
 
 #ifdef CONFIG_NUMA
-void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_kmalloc_alignment __malloc;
+void *__kmalloc_node(size_t size, gfp_t flags, int node) __assume_kmalloc_alignment
+							 __alloc_size(1);
 void *kmem_cache_alloc_node(struct kmem_cache *s, gfp_t flags, int node) __assume_slab_alignment
 									 __malloc;
 #else
-static __always_inline void *__kmalloc_node(size_t size, gfp_t flags, int node)
+static __always_inline __alloc_size(1) void *__kmalloc_node(size_t size, gfp_t flags, int node)
 {
 	return __kmalloc(size, flags);
 }
@@ -466,23 +467,23 @@ static __always_inline void *kmem_cache_
 
 #ifdef CONFIG_TRACING
 extern void *kmem_cache_alloc_trace(struct kmem_cache *s, gfp_t flags, size_t size)
-				   __assume_slab_alignment __malloc;
+				   __assume_slab_alignment __alloc_size(3);
 
 #ifdef CONFIG_NUMA
 extern void *kmem_cache_alloc_node_trace(struct kmem_cache *s, gfp_t gfpflags,
-					 int node, size_t size) __assume_slab_alignment __malloc;
+					 int node, size_t size) __assume_slab_alignment
+								__alloc_size(4);
 #else
-static __always_inline void *kmem_cache_alloc_node_trace(struct kmem_cache *s,
-							 gfp_t gfpflags, int node,
-							 size_t size)
+static __always_inline __alloc_size(4) void *kmem_cache_alloc_node_trace(struct kmem_cache *s,
+						 gfp_t gfpflags, int node, size_t size)
 {
 	return kmem_cache_alloc_trace(s, gfpflags, size);
 }
 #endif /* CONFIG_NUMA */
 
 #else /* CONFIG_TRACING */
-static __always_inline void *kmem_cache_alloc_trace(struct kmem_cache *s, gfp_t flags,
-						    size_t size)
+static __always_inline __alloc_size(3) void *kmem_cache_alloc_trace(struct kmem_cache *s,
+								    gfp_t flags, size_t size)
 {
 	void *ret = kmem_cache_alloc(s, flags);
 
@@ -501,19 +502,20 @@ static __always_inline void *kmem_cache_
 #endif /* CONFIG_TRACING */
 
 extern void *kmalloc_order(size_t size, gfp_t flags, unsigned int order) __assume_page_alignment
-									 __malloc;
+									 __alloc_size(1);
 
 #ifdef CONFIG_TRACING
 extern void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
-				__assume_page_alignment __malloc;
+				__assume_page_alignment __alloc_size(1);
 #else
-static __always_inline void *kmalloc_order_trace(size_t size, gfp_t flags, unsigned int order)
+static __always_inline __alloc_size(1) void *kmalloc_order_trace(size_t size, gfp_t flags,
+								 unsigned int order)
 {
 	return kmalloc_order(size, flags, order);
 }
 #endif
 
-static __always_inline void *kmalloc_large(size_t size, gfp_t flags)
+static __always_inline __alloc_size(1) void *kmalloc_large(size_t size, gfp_t flags)
 {
 	unsigned int order = get_order(size);
 	return kmalloc_order_trace(size, flags, order);
@@ -573,7 +575,7 @@ static __always_inline void *kmalloc_lar
  *	Try really hard to succeed the allocation but fail
  *	eventually.
  */
-static __always_inline void *kmalloc(size_t size, gfp_t flags)
+static __always_inline __alloc_size(1) void *kmalloc(size_t size, gfp_t flags)
 {
 	if (__builtin_constant_p(size)) {
 #ifndef CONFIG_SLOB
@@ -595,7 +597,7 @@ static __always_inline void *kmalloc(siz
 	return __kmalloc(size, flags);
 }
 
-static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node)
+static __always_inline __alloc_size(1) void *kmalloc_node(size_t size, gfp_t flags, int node)
 {
 #ifndef CONFIG_SLOB
 	if (__builtin_constant_p(size) &&
@@ -619,7 +621,7 @@ static __always_inline void *kmalloc_nod
  * @size: element size.
  * @flags: the type of memory to allocate (see kmalloc).
  */
-static inline void *kmalloc_array(size_t n, size_t size, gfp_t flags)
+static inline __alloc_size(1, 2) void *kmalloc_array(size_t n, size_t size, gfp_t flags)
 {
 	size_t bytes;
 
@@ -637,8 +639,10 @@ static inline void *kmalloc_array(size_t
  * @new_size: new size of a single member of the array
  * @flags: the type of memory to allocate (see kmalloc)
  */
-static inline void * __must_check krealloc_array(void *p, size_t new_n, size_t new_size,
-						 gfp_t flags)
+static inline __alloc_size(2, 3) void * __must_check krealloc_array(void *p,
+								    size_t new_n,
+								    size_t new_size,
+								    gfp_t flags)
 {
 	size_t bytes;
 
@@ -654,7 +658,7 @@ static inline void * __must_check kreall
  * @size: element size.
  * @flags: the type of memory to allocate (see kmalloc).
  */
-static inline void *kcalloc(size_t n, size_t size, gfp_t flags)
+static inline __alloc_size(1, 2) void *kcalloc(size_t n, size_t size, gfp_t flags)
 {
 	return kmalloc_array(n, size, flags | __GFP_ZERO);
 }
@@ -667,12 +671,13 @@ static inline void *kcalloc(size_t n, si
  * allocator where we care about the real place the memory allocation
  * request comes from.
  */
-extern void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller);
+extern void *__kmalloc_track_caller(size_t size, gfp_t flags, unsigned long caller)
+				   __alloc_size(1);
 #define kmalloc_track_caller(size, flags) \
 	__kmalloc_track_caller(size, flags, _RET_IP_)
 
-static inline void *kmalloc_array_node(size_t n, size_t size, gfp_t flags,
-				       int node)
+static inline __alloc_size(1, 2) void *kmalloc_array_node(size_t n, size_t size, gfp_t flags,
+							  int node)
 {
 	size_t bytes;
 
@@ -683,7 +688,7 @@ static inline void *kmalloc_array_node(s
 	return __kmalloc_node(bytes, flags, node);
 }
 
-static inline void *kcalloc_node(size_t n, size_t size, gfp_t flags, int node)
+static inline __alloc_size(1, 2) void *kcalloc_node(size_t n, size_t size, gfp_t flags, int node)
 {
 	return kmalloc_array_node(n, size, flags | __GFP_ZERO, node);
 }
@@ -691,7 +696,7 @@ static inline void *kcalloc_node(size_t
 
 #ifdef CONFIG_NUMA
 extern void *__kmalloc_node_track_caller(size_t size, gfp_t flags, int node,
-					 unsigned long caller);
+					 unsigned long caller) __alloc_size(1);
 #define kmalloc_node_track_caller(size, flags, node) \
 	__kmalloc_node_track_caller(size, flags, node, \
 			_RET_IP_)
@@ -716,7 +721,7 @@ static inline void *kmem_cache_zalloc(st
  * @size: how many bytes of memory are required.
  * @flags: the type of memory to allocate (see kmalloc).
  */
-static inline void *kzalloc(size_t size, gfp_t flags)
+static inline __alloc_size(1) void *kzalloc(size_t size, gfp_t flags)
 {
 	return kmalloc(size, flags | __GFP_ZERO);
 }
@@ -727,7 +732,7 @@ static inline void *kzalloc(size_t size,
  * @flags: the type of memory to allocate (see kmalloc).
  * @node: memory node from which to allocate
  */
-static inline void *kzalloc_node(size_t size, gfp_t flags, int node)
+static inline __alloc_size(1) void *kzalloc_node(size_t size, gfp_t flags, int node)
 {
 	return kmalloc_node(size, flags | __GFP_ZERO, node);
 }
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 036/262] mm/kvmalloc: add __alloc_size attributes for better bounds checking
  2021-11-05 20:34 incoming Andrew Morton
                   ` (34 preceding siblings ...)
  2021-11-05 20:36 ` [patch 035/262] slab: add __alloc_size attributes for better bounds checking Andrew Morton
@ 2021-11-05 20:36 ` Andrew Morton
  2021-11-05 20:36 ` [patch 037/262] mm/vmalloc: " Andrew Morton
                   ` (225 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:36 UTC (permalink / raw)
  To: akpm, alex.bou9, apw, cl, danielmicay, dennis, dwaipayanray1,
	gustavoars, iamjoonsoo.kim, ira.weiny, jhubbard, jingxiangfeng,
	joe, jrdr.linux, keescook, linux-mm, lkp, lukas.bulwahn,
	mm-commits, mporter, nathan, ndesaulniers, ojeda, penberg,
	rdunlap, rientjes, tj, torvalds, vbabka

From: Kees Cook <keescook@chromium.org>
Subject: mm/kvmalloc: add __alloc_size attributes for better bounds checking

As already done in GrapheneOS, add the __alloc_size attribute for regular
kvmalloc interfaces, to provide additional hinting for better bounds
checking, assisting CONFIG_FORTIFY_SOURCE and other compiler
optimizations.

Link: https://lkml.kernel.org/r/20210930222704.2631604-6-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Alexandre Bounine <alex.bou9@gmail.com>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/slab.h |   16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

--- a/include/linux/slab.h~mm-kvmalloc-add-__alloc_size-attributes-for-better-bounds-checking
+++ a/include/linux/slab.h
@@ -737,21 +737,21 @@ static inline __alloc_size(1) void *kzal
 	return kmalloc_node(size, flags | __GFP_ZERO, node);
 }
 
-extern void *kvmalloc_node(size_t size, gfp_t flags, int node);
-static inline void *kvmalloc(size_t size, gfp_t flags)
+extern void *kvmalloc_node(size_t size, gfp_t flags, int node) __alloc_size(1);
+static inline __alloc_size(1) void *kvmalloc(size_t size, gfp_t flags)
 {
 	return kvmalloc_node(size, flags, NUMA_NO_NODE);
 }
-static inline void *kvzalloc_node(size_t size, gfp_t flags, int node)
+static inline __alloc_size(1) void *kvzalloc_node(size_t size, gfp_t flags, int node)
 {
 	return kvmalloc_node(size, flags | __GFP_ZERO, node);
 }
-static inline void *kvzalloc(size_t size, gfp_t flags)
+static inline __alloc_size(1) void *kvzalloc(size_t size, gfp_t flags)
 {
 	return kvmalloc(size, flags | __GFP_ZERO);
 }
 
-static inline void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
+static inline __alloc_size(1, 2) void *kvmalloc_array(size_t n, size_t size, gfp_t flags)
 {
 	size_t bytes;
 
@@ -761,13 +761,13 @@ static inline void *kvmalloc_array(size_
 	return kvmalloc(bytes, flags);
 }
 
-static inline void *kvcalloc(size_t n, size_t size, gfp_t flags)
+static inline __alloc_size(1, 2) void *kvcalloc(size_t n, size_t size, gfp_t flags)
 {
 	return kvmalloc_array(n, size, flags | __GFP_ZERO);
 }
 
-extern void *kvrealloc(const void *p, size_t oldsize, size_t newsize,
-		gfp_t flags);
+extern void *kvrealloc(const void *p, size_t oldsize, size_t newsize, gfp_t flags)
+		      __alloc_size(3);
 extern void kvfree(const void *addr);
 extern void kvfree_sensitive(const void *addr, size_t len);
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 037/262] mm/vmalloc: add __alloc_size attributes for better bounds checking
  2021-11-05 20:34 incoming Andrew Morton
                   ` (35 preceding siblings ...)
  2021-11-05 20:36 ` [patch 036/262] mm/kvmalloc: " Andrew Morton
@ 2021-11-05 20:36 ` Andrew Morton
  2021-11-05 20:36 ` [patch 038/262] mm/page_alloc: " Andrew Morton
                   ` (224 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:36 UTC (permalink / raw)
  To: akpm, alex.bou9, apw, cl, danielmicay, dennis, dwaipayanray1,
	gustavoars, iamjoonsoo.kim, ira.weiny, jhubbard, jingxiangfeng,
	joe, jrdr.linux, keescook, linux-mm, lkp, lukas.bulwahn,
	mm-commits, mporter, nathan, ndesaulniers, ojeda, penberg,
	rdunlap, rientjes, tj, torvalds, vbabka

From: Kees Cook <keescook@chromium.org>
Subject: mm/vmalloc: add __alloc_size attributes for better bounds checking

As already done in GrapheneOS, add the __alloc_size attribute for
appropriate vmalloc allocator interfaces, to provide additional hinting
for better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other
compiler optimizations.

Link: https://lkml.kernel.org/r/20210930222704.2631604-7-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexandre Bounine <alex.bou9@gmail.com>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/vmalloc.h |   22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

--- a/include/linux/vmalloc.h~mm-vmalloc-add-__alloc_size-attributes-for-better-bounds-checking
+++ a/include/linux/vmalloc.h
@@ -136,21 +136,21 @@ static inline void vmalloc_init(void)
 static inline unsigned long vmalloc_nr_pages(void) { return 0; }
 #endif
 
-extern void *vmalloc(unsigned long size);
-extern void *vzalloc(unsigned long size);
-extern void *vmalloc_user(unsigned long size);
-extern void *vmalloc_node(unsigned long size, int node);
-extern void *vzalloc_node(unsigned long size, int node);
-extern void *vmalloc_32(unsigned long size);
-extern void *vmalloc_32_user(unsigned long size);
-extern void *__vmalloc(unsigned long size, gfp_t gfp_mask);
+extern void *vmalloc(unsigned long size) __alloc_size(1);
+extern void *vzalloc(unsigned long size) __alloc_size(1);
+extern void *vmalloc_user(unsigned long size) __alloc_size(1);
+extern void *vmalloc_node(unsigned long size, int node) __alloc_size(1);
+extern void *vzalloc_node(unsigned long size, int node) __alloc_size(1);
+extern void *vmalloc_32(unsigned long size) __alloc_size(1);
+extern void *vmalloc_32_user(unsigned long size) __alloc_size(1);
+extern void *__vmalloc(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
 extern void *__vmalloc_node_range(unsigned long size, unsigned long align,
 			unsigned long start, unsigned long end, gfp_t gfp_mask,
 			pgprot_t prot, unsigned long vm_flags, int node,
-			const void *caller);
+			const void *caller) __alloc_size(1);
 void *__vmalloc_node(unsigned long size, unsigned long align, gfp_t gfp_mask,
-		int node, const void *caller);
-void *vmalloc_no_huge(unsigned long size);
+		int node, const void *caller) __alloc_size(1);
+void *vmalloc_no_huge(unsigned long size) __alloc_size(1);
 
 extern void vfree(const void *addr);
 extern void vfree_atomic(const void *addr);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 038/262] mm/page_alloc: add __alloc_size attributes for better bounds checking
  2021-11-05 20:34 incoming Andrew Morton
                   ` (36 preceding siblings ...)
  2021-11-05 20:36 ` [patch 037/262] mm/vmalloc: " Andrew Morton
@ 2021-11-05 20:36 ` Andrew Morton
  2021-11-05 20:36 ` [patch 039/262] percpu: " Andrew Morton
                   ` (223 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:36 UTC (permalink / raw)
  To: akpm, alex.bou9, apw, cl, danielmicay, dennis, dwaipayanray1,
	gustavoars, iamjoonsoo.kim, ira.weiny, jhubbard, jingxiangfeng,
	joe, jrdr.linux, keescook, linux-mm, lkp, lukas.bulwahn,
	mm-commits, mporter, nathan, ndesaulniers, ojeda, penberg,
	rdunlap, rientjes, tj, torvalds, vbabka

From: Kees Cook <keescook@chromium.org>
Subject: mm/page_alloc: add __alloc_size attributes for better bounds checking

As already done in GrapheneOS, add the __alloc_size attribute for
appropriate page allocator interfaces, to provide additional hinting for
better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other compiler
optimizations.

Link: https://lkml.kernel.org/r/20210930222704.2631604-8-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexandre Bounine <alex.bou9@gmail.com>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/gfp.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/include/linux/gfp.h~mm-page_alloc-add-__alloc_size-attributes-for-better-bounds-checking
+++ a/include/linux/gfp.h
@@ -608,9 +608,9 @@ static inline struct page *alloc_pages(g
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
 
-void *alloc_pages_exact(size_t size, gfp_t gfp_mask);
+void *alloc_pages_exact(size_t size, gfp_t gfp_mask) __alloc_size(1);
 void free_pages_exact(void *virt, size_t size);
-void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
+__meminit void *alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask) __alloc_size(1);
 
 #define __get_free_page(gfp_mask) \
 		__get_free_pages((gfp_mask), 0)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 039/262] percpu: add __alloc_size attributes for better bounds checking
  2021-11-05 20:34 incoming Andrew Morton
                   ` (37 preceding siblings ...)
  2021-11-05 20:36 ` [patch 038/262] mm/page_alloc: " Andrew Morton
@ 2021-11-05 20:36 ` Andrew Morton
  2021-11-05 20:36 ` [patch 040/262] mm/page_ext.c: fix a comment Andrew Morton
                   ` (222 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:36 UTC (permalink / raw)
  To: akpm, alex.bou9, apw, cl, danielmicay, dennis, dwaipayanray1,
	gustavoars, iamjoonsoo.kim, ira.weiny, jhubbard, jingxiangfeng,
	joe, jrdr.linux, keescook, linux-mm, lkp, lukas.bulwahn,
	mm-commits, mporter, nathan, ndesaulniers, ojeda, penberg,
	rdunlap, rientjes, tj, torvalds, vbabka

From: Kees Cook <keescook@chromium.org>
Subject: percpu: add __alloc_size attributes for better bounds checking

As already done in GrapheneOS, add the __alloc_size attribute for
appropriate percpu allocator interfaces, to provide additional hinting for
better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other compiler
optimizations.

Note that due to the implementation of the percpu API, this is unlikely to
ever actually provide compile-time checking beyond very simple non-SMP
builds.  But, since they are technically allocators, mark them as such.

Link: https://lkml.kernel.org/r/20210930222704.2631604-9-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Co-developed-by: Daniel Micay <danielmicay@gmail.com>
Signed-off-by: Daniel Micay <danielmicay@gmail.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
Cc: Joe Perches <joe@perches.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexandre Bounine <alex.bou9@gmail.com>
Cc: Gustavo A. R. Silva <gustavoars@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jing Xiangfeng <jingxiangfeng@huawei.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/percpu.h |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/include/linux/percpu.h~percpu-add-__alloc_size-attributes-for-better-bounds-checking
+++ a/include/linux/percpu.h
@@ -123,7 +123,7 @@ extern int __init pcpu_page_first_chunk(
 				pcpu_fc_populate_pte_fn_t populate_pte_fn);
 #endif
 
-extern void __percpu *__alloc_reserved_percpu(size_t size, size_t align);
+extern void __percpu *__alloc_reserved_percpu(size_t size, size_t align) __alloc_size(1);
 extern bool __is_kernel_percpu_address(unsigned long addr, unsigned long *can_addr);
 extern bool is_kernel_percpu_address(unsigned long addr);
 
@@ -131,8 +131,8 @@ extern bool is_kernel_percpu_address(uns
 extern void __init setup_per_cpu_areas(void);
 #endif
 
-extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp);
-extern void __percpu *__alloc_percpu(size_t size, size_t align);
+extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp) __alloc_size(1);
+extern void __percpu *__alloc_percpu(size_t size, size_t align) __alloc_size(1);
 extern void free_percpu(void __percpu *__pdata);
 extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 040/262] mm/page_ext.c: fix a comment
  2021-11-05 20:34 incoming Andrew Morton
                   ` (38 preceding siblings ...)
  2021-11-05 20:36 ` [patch 039/262] percpu: " Andrew Morton
@ 2021-11-05 20:36 ` Andrew Morton
  2021-11-05 20:36 ` [patch 041/262] mm: stop filemap_read() from grabbing a superfluous page Andrew Morton
                   ` (221 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:36 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, torvalds, vbabka, zhangyinan2019

From: Yinan Zhang <zhangyinan2019@email.szu.edu.cn>
Subject: mm/page_ext.c: fix a comment

I have noticed that the previous macro is #ifndef CONFIG_SPARSEMEM.  I
think the comment of #else should be CONFIG_SPARSEMEM.

Link: https://lkml.kernel.org/r/20211008140312.6492-1-zhangyinan2019@email.szu.edu.cn
Signed-off-by: Yinan Zhang <zhangyinan2019@email.szu.edu.cn>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_ext.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/page_ext.c~mm-fix-a-comment
+++ a/mm/page_ext.c
@@ -201,7 +201,7 @@ fail:
 	panic("Out of memory");
 }
 
-#else /* CONFIG_FLATMEM */
+#else /* CONFIG_SPARSEMEM */
 
 struct page_ext *lookup_page_ext(const struct page *page)
 {
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 041/262] mm: stop filemap_read() from grabbing a superfluous page
  2021-11-05 20:34 incoming Andrew Morton
                   ` (39 preceding siblings ...)
  2021-11-05 20:36 ` [patch 040/262] mm/page_ext.c: fix a comment Andrew Morton
@ 2021-11-05 20:36 ` Andrew Morton
  2021-11-05 20:36 ` [patch 042/262] mm: export bdi_unregister Andrew Morton
                   ` (220 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:36 UTC (permalink / raw)
  To: akpm, dhowells, jlayton, kent.overstreet, linux-mm, mm-commits,
	torvalds, willy

From: David Howells <dhowells@redhat.com>
Subject: mm: stop filemap_read() from grabbing a superfluous page

Under some circumstances, filemap_read() will allocate sufficient pages to
read to the end of the file, call readahead/readpages on them and copy the
data over - and then it will allocate another page at the EOF and call
readpage on that and then ignore it.  This is unnecessary and a waste of
time and resources.

filemap_read() *does* check for this, but only after it has already done
the allocation and I/O.  Fix this by checking before calling
filemap_get_pages() also.

Link: https://lkml.kernel.org/r/163472463105.3126792.7056099385135786492.stgit@warthog.procyon.org.uk
Link: https://lore.kernel.org/r/160588481358.3465195.16552616179674485179.stgit@warthog.procyon.org.uk/
Link: https://lore.kernel.org/r/163456863216.2614702.6384850026368833133.stgit@warthog.procyon.org.uk/
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/filemap.c |    3 +++
 1 file changed, 3 insertions(+)

--- a/mm/filemap.c~mm-stop-filemap_read-from-grabbing-a-superfluous-page
+++ a/mm/filemap.c
@@ -2625,6 +2625,9 @@ ssize_t filemap_read(struct kiocb *iocb,
 		if ((iocb->ki_flags & IOCB_WAITQ) && already_read)
 			iocb->ki_flags |= IOCB_NOWAIT;
 
+		if (unlikely(iocb->ki_pos >= i_size_read(inode)))
+			break;
+
 		error = filemap_get_pages(iocb, iter, &pvec);
 		if (error < 0)
 			break;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 042/262] mm: export bdi_unregister
  2021-11-05 20:34 incoming Andrew Morton
                   ` (40 preceding siblings ...)
  2021-11-05 20:36 ` [patch 041/262] mm: stop filemap_read() from grabbing a superfluous page Andrew Morton
@ 2021-11-05 20:36 ` Andrew Morton
  2021-11-05 20:36 ` [patch 043/262] mtd: call bdi_unregister explicitly Andrew Morton
                   ` (219 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:36 UTC (permalink / raw)
  To: akpm, hch, jack, linux-mm, miquel.raynal, mm-commits, richard,
	torvalds, vigneshr

From: Christoph Hellwig <hch@lst.de>
Subject: mm: export bdi_unregister

Patch series "simplify bdi unregistation".

This series simplifies the BDI code to get rid of the magic
auto-unregister feature that hid a recent block layer refcounting bug.


This patch (of 5):

To wind down the magic auto-unregister semantics we'll need to push this
into modular code.

Link: https://lkml.kernel.org/r/20211021124441.668816-1-hch@lst.de
Link: https://lkml.kernel.org/r/20211021124441.668816-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/backing-dev.c |    1 +
 1 file changed, 1 insertion(+)

--- a/mm/backing-dev.c~mm-export-bdi_unregister
+++ a/mm/backing-dev.c
@@ -958,6 +958,7 @@ void bdi_unregister(struct backing_dev_i
 		bdi->owner = NULL;
 	}
 }
+EXPORT_SYMBOL(bdi_unregister);
 
 static void release_bdi(struct kref *ref)
 {
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 043/262] mtd: call bdi_unregister explicitly
  2021-11-05 20:34 incoming Andrew Morton
                   ` (41 preceding siblings ...)
  2021-11-05 20:36 ` [patch 042/262] mm: export bdi_unregister Andrew Morton
@ 2021-11-05 20:36 ` Andrew Morton
  2021-11-05 20:36 ` [patch 044/262] fs: explicitly unregister per-superblock BDIs Andrew Morton
                   ` (218 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:36 UTC (permalink / raw)
  To: akpm, hch, jack, linux-mm, miquel.raynal, mm-commits, richard,
	torvalds, vigneshr

From: Christoph Hellwig <hch@lst.de>
Subject: mtd: call bdi_unregister explicitly

Call bdi_unregister explicitly instead of relying on the automatic
unregistration.

Link: https://lkml.kernel.org/r/20211021124441.668816-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/mtd/mtdcore.c |    1 +
 1 file changed, 1 insertion(+)

--- a/drivers/mtd/mtdcore.c~mtd-call-bdi_unregister-explicitly
+++ a/drivers/mtd/mtdcore.c
@@ -2409,6 +2409,7 @@ static void __exit cleanup_mtd(void)
 	if (proc_mtd)
 		remove_proc_entry("mtd", NULL);
 	class_unregister(&mtd_class);
+	bdi_unregister(mtd_bdi);
 	bdi_put(mtd_bdi);
 	idr_destroy(&mtd_idr);
 }
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 044/262] fs: explicitly unregister per-superblock BDIs
  2021-11-05 20:34 incoming Andrew Morton
                   ` (42 preceding siblings ...)
  2021-11-05 20:36 ` [patch 043/262] mtd: call bdi_unregister explicitly Andrew Morton
@ 2021-11-05 20:36 ` Andrew Morton
  2021-11-05 20:37 ` [patch 045/262] mm: don't automatically unregister bdis Andrew Morton
                   ` (217 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:36 UTC (permalink / raw)
  To: akpm, hch, jack, linux-mm, miquel.raynal, mm-commits, richard,
	torvalds, vigneshr

From: Christoph Hellwig <hch@lst.de>
Subject: fs: explicitly unregister per-superblock BDIs

Add a new SB_I_ flag to mark superblocks that have an ephemeral bdi
associated with them, and unregister it when the superblock is shut down.

Link: https://lkml.kernel.org/r/20211021124441.668816-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/super.c         |    3 +++
 include/linux/fs.h |    1 +
 2 files changed, 4 insertions(+)

--- a/fs/super.c~fs-explicitly-unregister-per-superblock-bdis
+++ a/fs/super.c
@@ -476,6 +476,8 @@ void generic_shutdown_super(struct super
 	spin_unlock(&sb_lock);
 	up_write(&sb->s_umount);
 	if (sb->s_bdi != &noop_backing_dev_info) {
+		if (sb->s_iflags & SB_I_PERSB_BDI)
+			bdi_unregister(sb->s_bdi);
 		bdi_put(sb->s_bdi);
 		sb->s_bdi = &noop_backing_dev_info;
 	}
@@ -1562,6 +1564,7 @@ int super_setup_bdi_name(struct super_bl
 	}
 	WARN_ON(sb->s_bdi != &noop_backing_dev_info);
 	sb->s_bdi = bdi;
+	sb->s_iflags |= SB_I_PERSB_BDI;
 
 	return 0;
 }
--- a/include/linux/fs.h~fs-explicitly-unregister-per-superblock-bdis
+++ a/include/linux/fs.h
@@ -1443,6 +1443,7 @@ extern int send_sigurg(struct fown_struc
 #define SB_I_UNTRUSTED_MOUNTER		0x00000040
 
 #define SB_I_SKIP_SYNC	0x00000100	/* Skip superblock at global sync */
+#define SB_I_PERSB_BDI	0x00000200	/* has a per-sb bdi */
 
 /* Possible states of 'frozen' field */
 enum {
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 045/262] mm: don't automatically unregister bdis
  2021-11-05 20:34 incoming Andrew Morton
                   ` (43 preceding siblings ...)
  2021-11-05 20:36 ` [patch 044/262] fs: explicitly unregister per-superblock BDIs Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 046/262] mm: simplify bdi refcounting Andrew Morton
                   ` (216 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, hch, jack, linux-mm, miquel.raynal, mm-commits, richard,
	torvalds, vigneshr

From: Christoph Hellwig <hch@lst.de>
Subject: mm: don't automatically unregister bdis

All BDI users now unregister explicitly.

Link: https://lkml.kernel.org/r/20211021124441.668816-5-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/backing-dev.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/mm/backing-dev.c~mm-dont-automatically-unregister-bdis
+++ a/mm/backing-dev.c
@@ -965,8 +965,7 @@ static void release_bdi(struct kref *ref
 	struct backing_dev_info *bdi =
 			container_of(ref, struct backing_dev_info, refcnt);
 
-	if (test_bit(WB_registered, &bdi->wb.state))
-		bdi_unregister(bdi);
+	WARN_ON_ONCE(test_bit(WB_registered, &bdi->wb.state));
 	WARN_ON_ONCE(bdi->dev);
 	wb_exit(&bdi->wb);
 	kfree(bdi);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 046/262] mm: simplify bdi refcounting
  2021-11-05 20:34 incoming Andrew Morton
                   ` (44 preceding siblings ...)
  2021-11-05 20:37 ` [patch 045/262] mm: don't automatically unregister bdis Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 047/262] mm: don't read i_size of inode unless we need it Andrew Morton
                   ` (215 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, hch, jack, linux-mm, miquel.raynal, mm-commits, richard,
	torvalds, vigneshr

From: Christoph Hellwig <hch@lst.de>
Subject: mm: simplify bdi refcounting

Move grabbing and releasing the bdi refcount out of the common
wb_init/wb_exit helpers into code that is only used for the non-default
memcg driven bdi_writeback structures.

[hch@lst.de: add comment]
  Link: https://lkml.kernel.org/r/20211027074207.GA12793@lst.de
[akpm@linux-foundation.org: fix typo]
Link: https://lkml.kernel.org/r/20211021124441.668816-6-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Vignesh Raghavendra <vigneshr@ti.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/backing-dev-defs.h |    3 +++
 mm/backing-dev.c                 |   13 +++++--------
 2 files changed, 8 insertions(+), 8 deletions(-)

--- a/include/linux/backing-dev-defs.h~mm-simplify-bdi-refcounting
+++ a/include/linux/backing-dev-defs.h
@@ -103,6 +103,9 @@ struct wb_completion {
  * change as blkcg is disabled and enabled higher up in the hierarchy, a wb
  * is tested for blkcg after lookup and removed from index on mismatch so
  * that a new wb for the combination can be created.
+ *
+ * Each bdi_writeback that is not embedded into the backing_dev_info must hold
+ * a reference to the parent backing_dev_info.  See cgwb_create() for details.
  */
 struct bdi_writeback {
 	struct backing_dev_info *bdi;	/* our parent bdi */
--- a/mm/backing-dev.c~mm-simplify-bdi-refcounting
+++ a/mm/backing-dev.c
@@ -291,8 +291,6 @@ static int wb_init(struct bdi_writeback
 
 	memset(wb, 0, sizeof(*wb));
 
-	if (wb != &bdi->wb)
-		bdi_get(bdi);
 	wb->bdi = bdi;
 	wb->last_old_flush = jiffies;
 	INIT_LIST_HEAD(&wb->b_dirty);
@@ -316,7 +314,7 @@ static int wb_init(struct bdi_writeback
 
 	err = fprop_local_init_percpu(&wb->completions, gfp);
 	if (err)
-		goto out_put_bdi;
+		return err;
 
 	for (i = 0; i < NR_WB_STAT_ITEMS; i++) {
 		err = percpu_counter_init(&wb->stat[i], 0, gfp);
@@ -330,9 +328,6 @@ out_destroy_stat:
 	while (i--)
 		percpu_counter_destroy(&wb->stat[i]);
 	fprop_local_destroy_percpu(&wb->completions);
-out_put_bdi:
-	if (wb != &bdi->wb)
-		bdi_put(bdi);
 	return err;
 }
 
@@ -373,8 +368,6 @@ static void wb_exit(struct bdi_writeback
 		percpu_counter_destroy(&wb->stat[i]);
 
 	fprop_local_destroy_percpu(&wb->completions);
-	if (wb != &wb->bdi->wb)
-		bdi_put(wb->bdi);
 }
 
 #ifdef CONFIG_CGROUP_WRITEBACK
@@ -397,6 +390,7 @@ static void cgwb_release_workfn(struct w
 	struct bdi_writeback *wb = container_of(work, struct bdi_writeback,
 						release_work);
 	struct blkcg *blkcg = css_to_blkcg(wb->blkcg_css);
+	struct backing_dev_info *bdi = wb->bdi;
 
 	mutex_lock(&wb->bdi->cgwb_release_mutex);
 	wb_shutdown(wb);
@@ -416,6 +410,7 @@ static void cgwb_release_workfn(struct w
 
 	percpu_ref_exit(&wb->refcnt);
 	wb_exit(wb);
+	bdi_put(bdi);
 	WARN_ON_ONCE(!list_empty(&wb->b_attached));
 	kfree_rcu(wb, rcu);
 }
@@ -497,6 +492,7 @@ static int cgwb_create(struct backing_de
 	INIT_LIST_HEAD(&wb->b_attached);
 	INIT_WORK(&wb->release_work, cgwb_release_workfn);
 	set_bit(WB_registered, &wb->state);
+	bdi_get(bdi);
 
 	/*
 	 * The root wb determines the registered state of the whole bdi and
@@ -528,6 +524,7 @@ static int cgwb_create(struct backing_de
 	goto out_put;
 
 err_fprop_exit:
+	bdi_put(bdi);
 	fprop_local_destroy_percpu(&wb->memcg_completions);
 err_ref_exit:
 	percpu_ref_exit(&wb->refcnt);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 047/262] mm: don't read i_size of inode unless we need it
  2021-11-05 20:34 incoming Andrew Morton
                   ` (45 preceding siblings ...)
  2021-11-05 20:37 ` [patch 046/262] mm: simplify bdi refcounting Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 048/262] mm/filemap.c: remove bogus VM_BUG_ON Andrew Morton
                   ` (214 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, asml.silence, axboe, clm, david, jack, josef, linux-mm,
	mm-commits, torvalds

From: Jens Axboe <axboe@kernel.dk>
Subject: mm: don't read i_size of inode unless we need it

We always go through i_size_read(), and we rarely end up needing it. Push
the read to down where we need to check it, which avoids it for most
cases.

It looks like we can even remove this check entirely, which might be
worth pursuing. But at least this takes it out of the hot path.

Link: https://lkml.kernel.org/r/6b67981f-57d4-c80e-bc07-6020aa601381@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Acked-by: Chris Mason <clm@fb.com>
Cc: Josef Bacik <josef@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Pavel Begunkov <asml.silence@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/filemap.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

--- a/mm/filemap.c~mm-dont-read-i_size-of-inode-unless-we-need-it
+++ a/mm/filemap.c
@@ -2740,9 +2740,7 @@ generic_file_read_iter(struct kiocb *ioc
 		struct file *file = iocb->ki_filp;
 		struct address_space *mapping = file->f_mapping;
 		struct inode *inode = mapping->host;
-		loff_t size;
 
-		size = i_size_read(inode);
 		if (iocb->ki_flags & IOCB_NOWAIT) {
 			if (filemap_range_needs_writeback(mapping, iocb->ki_pos,
 						iocb->ki_pos + count - 1))
@@ -2774,8 +2772,9 @@ generic_file_read_iter(struct kiocb *ioc
 		 * the rest of the read.  Buffered reads will not work for
 		 * DAX files, so don't bother trying.
 		 */
-		if (retval < 0 || !count || iocb->ki_pos >= size ||
-		    IS_DAX(inode))
+		if (retval < 0 || !count || IS_DAX(inode))
+			return retval;
+		if (iocb->ki_pos >= i_size_read(inode))
 			return retval;
 	}
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 048/262] mm/filemap.c: remove bogus VM_BUG_ON
  2021-11-05 20:34 incoming Andrew Morton
                   ` (46 preceding siblings ...)
  2021-11-05 20:37 ` [patch 047/262] mm: don't read i_size of inode unless we need it Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 049/262] mm: move more expensive part of XA setup out of mapping check Andrew Morton
                   ` (213 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, hughd, linux-mm, mm-commits, stable,
	syzbot+c87be4f669d920c76330, torvalds, willy

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm/filemap.c: remove bogus VM_BUG_ON

It is not safe to check page->index without holding the page lock.  It can
be changed if the page is moved between the swap cache and the page cache
for a shmem file, for example.  There is a VM_BUG_ON below which checks
page->index is correct after taking the page lock.

Link: https://lkml.kernel.org/r/20210818144932.940640-1-willy@infradead.org
Fixes: 5c211ba29deb ("mm: add and use find_lock_entries")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: <syzbot+c87be4f669d920c76330@syzkaller.appspotmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/filemap.c |    1 -
 1 file changed, 1 deletion(-)

--- a/mm/filemap.c~mm-remove-bogus-vm_bug_on
+++ a/mm/filemap.c
@@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct addres
 		if (!xa_is_value(page)) {
 			if (page->index < start)
 				goto put;
-			VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
 			if (page->index + thp_nr_pages(page) - 1 > end)
 				goto put;
 			if (!trylock_page(page))
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 049/262] mm: move more expensive part of XA setup out of mapping check
  2021-11-05 20:34 incoming Andrew Morton
                   ` (47 preceding siblings ...)
  2021-11-05 20:37 ` [patch 048/262] mm/filemap.c: remove bogus VM_BUG_ON Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 050/262] mm/gup: further simplify __gup_device_huge() Andrew Morton
                   ` (212 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, axboe, linux-mm, mm-commits, torvalds, willy

From: Jens Axboe <axboe@kernel.dk>
Subject: mm: move more expensive part of XA setup out of mapping check

The fast path here is not needing any writeback, yet we spend time setting
up the xarray lookup data upfront.  Move the part that actually needs to
iterate the address space mapping into a separate helper, saving ~30% of
the time here.

Link: https://lkml.kernel.org/r/49f67983-b802-8929-edab-d807f745c9ca@kernel.dk
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/filemap.c |   43 +++++++++++++++++++++++++------------------
 1 file changed, 25 insertions(+), 18 deletions(-)

--- a/mm/filemap.c~mm-move-more-expensive-part-of-xa-setup-out-of-mapping-check
+++ a/mm/filemap.c
@@ -639,6 +639,30 @@ static bool mapping_needs_writeback(stru
 	return mapping->nrpages;
 }
 
+static bool filemap_range_has_writeback(struct address_space *mapping,
+					loff_t start_byte, loff_t end_byte)
+{
+	XA_STATE(xas, &mapping->i_pages, start_byte >> PAGE_SHIFT);
+	pgoff_t max = end_byte >> PAGE_SHIFT;
+	struct page *page;
+
+	if (end_byte < start_byte)
+		return false;
+
+	rcu_read_lock();
+	xas_for_each(&xas, page, max) {
+		if (xas_retry(&xas, page))
+			continue;
+		if (xa_is_value(page))
+			continue;
+		if (PageDirty(page) || PageLocked(page) || PageWriteback(page))
+			break;
+	}
+	rcu_read_unlock();
+	return page != NULL;
+
+}
+
 /**
  * filemap_range_needs_writeback - check if range potentially needs writeback
  * @mapping:           address space within which to check
@@ -656,29 +680,12 @@ static bool mapping_needs_writeback(stru
 bool filemap_range_needs_writeback(struct address_space *mapping,
 				   loff_t start_byte, loff_t end_byte)
 {
-	XA_STATE(xas, &mapping->i_pages, start_byte >> PAGE_SHIFT);
-	pgoff_t max = end_byte >> PAGE_SHIFT;
-	struct page *page;
-
 	if (!mapping_needs_writeback(mapping))
 		return false;
 	if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY) &&
 	    !mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
 		return false;
-	if (end_byte < start_byte)
-		return false;
-
-	rcu_read_lock();
-	xas_for_each(&xas, page, max) {
-		if (xas_retry(&xas, page))
-			continue;
-		if (xa_is_value(page))
-			continue;
-		if (PageDirty(page) || PageLocked(page) || PageWriteback(page))
-			break;
-	}
-	rcu_read_unlock();
-	return page != NULL;
+	return filemap_range_has_writeback(mapping, start_byte, end_byte);
 }
 EXPORT_SYMBOL_GPL(filemap_range_needs_writeback);
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 050/262] mm/gup: further simplify __gup_device_huge()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (48 preceding siblings ...)
  2021-11-05 20:37 ` [patch 049/262] mm: move more expensive part of XA setup out of mapping check Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 051/262] mm/swapfile: remove needless request_queue NULL pointer check Andrew Morton
                   ` (211 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, imbrenda, jack, jhubbard, kirill.shutemov, linmiaohe,
	linux-mm, mm-commits, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: further simplify __gup_device_huge()

commit 6401c4eb57f9 ("mm: gup: fix potential pgmap refcnt leak in
__gup_device_huge()") simplified the return paths, but didn't go quite far
enough, as discussed in [1].

Remove the "ret" variable entirely, because there is enough information
already available to provide the return value.

[1] https://lore.kernel.org/r/CAHk-=wgQTRX=5SkCmS+zfmpqubGHGJvXX_HgnPG8JSpHKHBMeg@mail.gmail.com

Link: https://lkml.kernel.org/r/20210904004224.86391-1-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup.c |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

--- a/mm/gup.c~mm-gup-further-simplify-__gup_device_huge
+++ a/mm/gup.c
@@ -2228,7 +2228,6 @@ static int __gup_device_huge(unsigned lo
 {
 	int nr_start = *nr;
 	struct dev_pagemap *pgmap = NULL;
-	int ret = 1;
 
 	do {
 		struct page *page = pfn_to_page(pfn);
@@ -2236,14 +2235,12 @@ static int __gup_device_huge(unsigned lo
 		pgmap = get_dev_pagemap(pfn, pgmap);
 		if (unlikely(!pgmap)) {
 			undo_dev_pagemap(nr, nr_start, flags, pages);
-			ret = 0;
 			break;
 		}
 		SetPageReferenced(page);
 		pages[*nr] = page;
 		if (unlikely(!try_grab_page(page, flags))) {
 			undo_dev_pagemap(nr, nr_start, flags, pages);
-			ret = 0;
 			break;
 		}
 		(*nr)++;
@@ -2251,7 +2248,7 @@ static int __gup_device_huge(unsigned lo
 	} while (addr += PAGE_SIZE, addr != end);
 
 	put_dev_pagemap(pgmap);
-	return ret;
+	return addr == end;
 }
 
 static int __gup_device_huge_pmd(pmd_t orig, pmd_t *pmdp, unsigned long addr,
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 051/262] mm/swapfile: remove needless request_queue NULL pointer check
  2021-11-05 20:34 incoming Andrew Morton
                   ` (49 preceding siblings ...)
  2021-11-05 20:37 ` [patch 050/262] mm/gup: further simplify __gup_device_huge() Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 052/262] mm/swapfile: fix an integer overflow in swap_show() Andrew Morton
                   ` (210 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, david, linux-mm, mm-commits, torvalds, vulab

From: Xu Wang <vulab@iscas.ac.cn>
Subject: mm/swapfile: remove needless request_queue NULL pointer check

The request_queue pointer returned from bdev_get_queue() shall never be
NULL, so the null check is unnecessary, just remove it.

Link: https://lkml.kernel.org/r/20210917082111.33923-1-vulab@iscas.ac.cn
Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swapfile.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/swapfile.c~mm-swapfile-remove-needless-request_queue-null-pointer-check
+++ a/mm/swapfile.c
@@ -3118,7 +3118,7 @@ static bool swap_discardable(struct swap
 {
 	struct request_queue *q = bdev_get_queue(si->bdev);
 
-	if (!q || !blk_queue_discard(q))
+	if (!blk_queue_discard(q))
 		return false;
 
 	return true;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 052/262] mm/swapfile: fix an integer overflow in swap_show()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (50 preceding siblings ...)
  2021-11-05 20:37 ` [patch 051/262] mm/swapfile: remove needless request_queue NULL pointer check Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 053/262] mm: optimise put_pages_list() Andrew Morton
                   ` (209 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, aquini, hughd, linux-mm, mm-commits, torvalds

From: Rafael Aquini <aquini@redhat.com>
Subject: mm/swapfile: fix an integer overflow in swap_show()

This one is just a minor nuisance for people going through /proc/swaps if
any of their swapareas is bigger than, or equal to 1073741824 pages (4TB).

seq_printf() format string casts as uint the conversion from pages to KB,
and that will overflow in the aforementioned case.

Albeit being almost unthinkable that someone would actually set up such
big of a single swaparea, there is a ticket recently filed against RHEL:
https://bugzilla.redhat.com/show_bug.cgi?id=2008812

Given that all other codesites that use format strings for the same swap
pages-to-KB conversion do cast it as ulong, this patch just follows suit.

Link: https://lkml.kernel.org/r/20211006184011.2579054-1-aquini@redhat.com
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swapfile.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/swapfile.c~mm-swapfile-fix-an-integer-overflow-in-swap_show
+++ a/mm/swapfile.c
@@ -2763,7 +2763,7 @@ static int swap_show(struct seq_file *sw
 	struct swap_info_struct *si = v;
 	struct file *file;
 	int len;
-	unsigned int bytes, inuse;
+	unsigned long bytes, inuse;
 
 	if (si == SEQ_START_TOKEN) {
 		seq_puts(swap, "Filename\t\t\t\tType\t\tSize\t\tUsed\t\tPriority\n");
@@ -2775,7 +2775,7 @@ static int swap_show(struct seq_file *sw
 
 	file = si->swap_file;
 	len = seq_file_path(swap, file, " \t\n\\");
-	seq_printf(swap, "%*s%s\t%u\t%s%u\t%s%d\n",
+	seq_printf(swap, "%*s%s\t%lu\t%s%lu\t%s%d\n",
 			len < 40 ? 40 - len : 1, " ",
 			S_ISBLK(file_inode(file)->i_mode) ?
 				"partition" : "file\t",
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 053/262] mm: optimise put_pages_list()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (51 preceding siblings ...)
  2021-11-05 20:37 ` [patch 052/262] mm/swapfile: fix an integer overflow in swap_show() Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 054/262] mm/memcg: drop swp_entry_t* in mc_handle_file_pte() Andrew Morton
                   ` (208 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, anthony.yznaga, linux-mm, mgorman, mm-commits, torvalds, willy

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm: optimise put_pages_list()

Instead of calling put_page() one page at a time, pop pages off the list
if their refcount was too high and pass the remainder to
put_unref_page_list().  This should be a speed improvement, but I have no
measurements to support that.  Current callers do not care about
performance, but I hope to add some which do.

Link: https://lkml.kernel.org/r/20211007192138.561673-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swap.c |   23 ++++++++++++++++-------
 1 file changed, 16 insertions(+), 7 deletions(-)

--- a/mm/swap.c~mm-optimise-put_pages_list
+++ a/mm/swap.c
@@ -134,18 +134,27 @@ EXPORT_SYMBOL(__put_page);
  * put_pages_list() - release a list of pages
  * @pages: list of pages threaded on page->lru
  *
- * Release a list of pages which are strung together on page.lru.  Currently
- * used by read_cache_pages() and related error recovery code.
+ * Release a list of pages which are strung together on page.lru.
  */
 void put_pages_list(struct list_head *pages)
 {
-	while (!list_empty(pages)) {
-		struct page *victim;
+	struct page *page, *next;
 
-		victim = lru_to_page(pages);
-		list_del(&victim->lru);
-		put_page(victim);
+	list_for_each_entry_safe(page, next, pages, lru) {
+		if (!put_page_testzero(page)) {
+			list_del(&page->lru);
+			continue;
+		}
+		if (PageHead(page)) {
+			list_del(&page->lru);
+			__put_compound_page(page);
+			continue;
+		}
+		/* Cannot be PageLRU because it's passed to us using the lru */
+		__ClearPageWaiters(page);
 	}
+
+	free_unref_page_list(pages);
 }
 EXPORT_SYMBOL(put_pages_list);
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 054/262] mm/memcg: drop swp_entry_t* in mc_handle_file_pte()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (52 preceding siblings ...)
  2021-11-05 20:37 ` [patch 053/262] mm: optimise put_pages_list() Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 055/262] memcg: flush stats only if updated Andrew Morton
                   ` (207 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, david, hannes, linux-mm, mhocko, mm-commits, peterx,
	songmuchun, torvalds, willy

From: Peter Xu <peterx@redhat.com>
Subject: mm/memcg: drop swp_entry_t* in mc_handle_file_pte()

After the rework of f5df8635c5a3 ("mm: use find_get_incore_page in memcontrol",
2020-10-13) it's unused.

Link: https://lkml.kernel.org/r/20210916193014.80129-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/memcontrol.c~mm-memcg-drop-swp_entry_t-in-mc_handle_file_pte
+++ a/mm/memcontrol.c
@@ -5545,7 +5545,7 @@ static struct page *mc_handle_swap_pte(s
 #endif
 
 static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
-			unsigned long addr, pte_t ptent, swp_entry_t *entry)
+			unsigned long addr, pte_t ptent)
 {
 	if (!vma->vm_file) /* anonymous vma */
 		return NULL;
@@ -5718,7 +5718,7 @@ static enum mc_target_type get_mctgt_typ
 	else if (is_swap_pte(ptent))
 		page = mc_handle_swap_pte(vma, ptent, &ent);
 	else if (pte_none(ptent))
-		page = mc_handle_file_pte(vma, addr, ptent, &ent);
+		page = mc_handle_file_pte(vma, addr, ptent);
 
 	if (!page && !ent.val)
 		return ret;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 055/262] memcg: flush stats only if updated
  2021-11-05 20:34 incoming Andrew Morton
                   ` (53 preceding siblings ...)
  2021-11-05 20:37 ` [patch 054/262] mm/memcg: drop swp_entry_t* in mc_handle_file_pte() Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 056/262] memcg: unify memcg stat flushing Andrew Morton
                   ` (206 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, hannes, linux-mm, mhocko, mkoutny, mm-commits, shakeelb, torvalds

From: Shakeel Butt <shakeelb@google.com>
Subject: memcg: flush stats only if updated

At the moment, the kernel flushes the memcg stats on every refault and
also on every reclaim iteration.  Although rstat maintains per-cpu update
tree but on the flush the kernel still has to go through all the cpu rstat
update tree to check if there is anything to flush.  This patch adds the
tracking on the stats update side to make flush side more clever by
skipping the flush if there is no update.

The stats update codepath is very sensitive performance wise for many
workloads and benchmarks.  So, we can not follow what the commit
aa48e47e3906 ("memcg: infrastructure to flush memcg stats") did which was
triggering async flush through queue_work() and caused a lot performance
regression reports.  That got reverted by the commit 1f828223b799 ("memcg:
flush lruvec stats in the refault").

In this patch we kept the stats update codepath very minimal and let the
stats reader side to flush the stats only when the updates are over a
specific threshold.  For now the threshold is (nr_cpus * CHARGE_BATCH).

To evaluate the impact of this patch, an 8 GiB tmpfs file is created on a
system with swap-on-zram and the file was pushed to swap through
memory.force_empty interface.  On reading the whole file, the memcg stat
flush in the refault code path is triggered.  With this patch, we observed
63% reduction in the read time of 8 GiB file.

Link: https://lkml.kernel.org/r/20211001190040.48086-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Reviewed-by: "Michal Koutný" <mkoutny@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |   78 ++++++++++++++++++++++++++++++++--------------
 1 file changed, 55 insertions(+), 23 deletions(-)

--- a/mm/memcontrol.c~memcg-flush-stats-only-if-updated
+++ a/mm/memcontrol.c
@@ -103,11 +103,6 @@ static bool do_memsw_account(void)
 	return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_noswap;
 }
 
-/* memcg and lruvec stats flushing */
-static void flush_memcg_stats_dwork(struct work_struct *w);
-static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
-static DEFINE_SPINLOCK(stats_flush_lock);
-
 #define THRESHOLDS_EVENTS_TARGET 128
 #define SOFTLIMIT_EVENTS_TARGET 1024
 
@@ -635,6 +630,56 @@ mem_cgroup_largest_soft_limit_node(struc
 	return mz;
 }
 
+/*
+ * memcg and lruvec stats flushing
+ *
+ * Many codepaths leading to stats update or read are performance sensitive and
+ * adding stats flushing in such codepaths is not desirable. So, to optimize the
+ * flushing the kernel does:
+ *
+ * 1) Periodically and asynchronously flush the stats every 2 seconds to not let
+ *    rstat update tree grow unbounded.
+ *
+ * 2) Flush the stats synchronously on reader side only when there are more than
+ *    (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization
+ *    will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) but
+ *    only for 2 seconds due to (1).
+ */
+static void flush_memcg_stats_dwork(struct work_struct *w);
+static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
+static DEFINE_SPINLOCK(stats_flush_lock);
+static DEFINE_PER_CPU(unsigned int, stats_updates);
+static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
+
+static inline void memcg_rstat_updated(struct mem_cgroup *memcg)
+{
+	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
+	if (!(__this_cpu_inc_return(stats_updates) % MEMCG_CHARGE_BATCH))
+		atomic_inc(&stats_flush_threshold);
+}
+
+static void __mem_cgroup_flush_stats(void)
+{
+	if (!spin_trylock(&stats_flush_lock))
+		return;
+
+	cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup);
+	atomic_set(&stats_flush_threshold, 0);
+	spin_unlock(&stats_flush_lock);
+}
+
+void mem_cgroup_flush_stats(void)
+{
+	if (atomic_read(&stats_flush_threshold) > num_online_cpus())
+		__mem_cgroup_flush_stats();
+}
+
+static void flush_memcg_stats_dwork(struct work_struct *w)
+{
+	mem_cgroup_flush_stats();
+	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ);
+}
+
 /**
  * __mod_memcg_state - update cgroup memory statistics
  * @memcg: the memory cgroup
@@ -647,7 +692,7 @@ void __mod_memcg_state(struct mem_cgroup
 		return;
 
 	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
-	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
+	memcg_rstat_updated(memcg);
 }
 
 /* idx can be of type enum memcg_stat_item or node_stat_item. */
@@ -675,10 +720,12 @@ void __mod_memcg_lruvec_state(struct lru
 	memcg = pn->memcg;
 
 	/* Update memcg */
-	__mod_memcg_state(memcg, idx, val);
+	__this_cpu_add(memcg->vmstats_percpu->state[idx], val);
 
 	/* Update lruvec */
 	__this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
+
+	memcg_rstat_updated(memcg);
 }
 
 /**
@@ -780,7 +827,7 @@ void __count_memcg_events(struct mem_cgr
 		return;
 
 	__this_cpu_add(memcg->vmstats_percpu->events[idx], count);
-	cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
+	memcg_rstat_updated(memcg);
 }
 
 static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
@@ -5341,21 +5388,6 @@ static void mem_cgroup_css_reset(struct
 	memcg_wb_domain_size_changed(memcg);
 }
 
-void mem_cgroup_flush_stats(void)
-{
-	if (!spin_trylock(&stats_flush_lock))
-		return;
-
-	cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup);
-	spin_unlock(&stats_flush_lock);
-}
-
-static void flush_memcg_stats_dwork(struct work_struct *w)
-{
-	mem_cgroup_flush_stats();
-	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, 2UL*HZ);
-}
-
 static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 056/262] memcg: unify memcg stat flushing
  2021-11-05 20:34 incoming Andrew Morton
                   ` (54 preceding siblings ...)
  2021-11-05 20:37 ` [patch 055/262] memcg: flush stats only if updated Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 057/262] mm/memcg: remove obsolete memcg_free_kmem() Andrew Morton
                   ` (205 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, hannes, linux-mm, mhocko, mkoutny, mm-commits, shakeelb, torvalds

From: Shakeel Butt <shakeelb@google.com>
Subject: memcg: unify memcg stat flushing

The memcg stats can be flushed in multiple context and potentially in
parallel too.  For example multiple parallel user space readers for memcg
stats will contend on the rstat locks with each other.  There is no need
for that.  We just need one flusher and everyone else can benefit.  In
addition after aa48e47e3906 ("memcg: infrastructure to flush memcg stats")
the kernel periodically flush the memcg stats from the root, so, the other
flushers will potentially have much less work to do.

Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Michal Koutný" <mkoutny@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |   19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

--- a/mm/memcontrol.c~memcg-unify-memcg-stat-flushing
+++ a/mm/memcontrol.c
@@ -660,12 +660,14 @@ static inline void memcg_rstat_updated(s
 
 static void __mem_cgroup_flush_stats(void)
 {
-	if (!spin_trylock(&stats_flush_lock))
+	unsigned long flag;
+
+	if (!spin_trylock_irqsave(&stats_flush_lock, flag))
 		return;
 
 	cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup);
 	atomic_set(&stats_flush_threshold, 0);
-	spin_unlock(&stats_flush_lock);
+	spin_unlock_irqrestore(&stats_flush_lock, flag);
 }
 
 void mem_cgroup_flush_stats(void)
@@ -1461,7 +1463,7 @@ static char *memory_stat_format(struct m
 	 *
 	 * Current memory state:
 	 */
-	cgroup_rstat_flush(memcg->css.cgroup);
+	mem_cgroup_flush_stats();
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
@@ -3565,8 +3567,7 @@ static unsigned long mem_cgroup_usage(st
 	unsigned long val;
 
 	if (mem_cgroup_is_root(memcg)) {
-		/* mem_cgroup_threshold() calls here from irqsafe context */
-		cgroup_rstat_flush_irqsafe(memcg->css.cgroup);
+		mem_cgroup_flush_stats();
 		val = memcg_page_state(memcg, NR_FILE_PAGES) +
 			memcg_page_state(memcg, NR_ANON_MAPPED);
 		if (swap)
@@ -3947,7 +3948,7 @@ static int memcg_numa_stat_show(struct s
 	int nid;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	cgroup_rstat_flush(memcg->css.cgroup);
+	mem_cgroup_flush_stats();
 
 	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
 		seq_printf(m, "%s=%lu", stat->name,
@@ -4019,7 +4020,7 @@ static int memcg_stat_show(struct seq_fi
 
 	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
 
-	cgroup_rstat_flush(memcg->css.cgroup);
+	mem_cgroup_flush_stats();
 
 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
 		unsigned long nr;
@@ -4522,7 +4523,7 @@ void mem_cgroup_wb_stats(struct bdi_writ
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
 
-	cgroup_rstat_flush_irqsafe(memcg->css.cgroup);
+	mem_cgroup_flush_stats();
 
 	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
 	*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
@@ -6405,7 +6406,7 @@ static int memory_numa_stat_show(struct
 	int i;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	cgroup_rstat_flush(memcg->css.cgroup);
+	mem_cgroup_flush_stats();
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		int nid;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 057/262] mm/memcg: remove obsolete memcg_free_kmem()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (55 preceding siblings ...)
  2021-11-05 20:37 ` [patch 056/262] memcg: unify memcg stat flushing Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 058/262] mm/list_lru.c: prefer struct_size over open coded arithmetic Andrew Morton
                   ` (204 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, atomlin, guro, hannes, linux-mm, longman, mhocko,
	mm-commits, shakeelb, songmuchun, torvalds, vbabka, vdavydov.dev

From: Waiman Long <longman@redhat.com>
Subject: mm/memcg: remove obsolete memcg_free_kmem()

Since commit d648bcc7fe65 ("mm: kmem: make memcg_kmem_enabled()
irreversible"), the only thing memcg_free_kmem() does is to call
memcg_offline_kmem() when the memcg is still online which can happen when
online_css() fails due to -ENOMEM.  However, the name memcg_free_kmem() is
confusing and it is more clear and straight forward to call
memcg_offline_kmem() directly from mem_cgroup_css_free().

Link: https://lkml.kernel.org/r/20211005202450.11775-1-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Suggested-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Aaron Tomlin <atomlin@redhat.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |   14 +++-----------
 1 file changed, 3 insertions(+), 11 deletions(-)

--- a/mm/memcontrol.c~mm-memcg-remove-obsolete-memcg_free_kmem
+++ a/mm/memcontrol.c
@@ -3704,13 +3704,6 @@ static void memcg_offline_kmem(struct me
 
 	memcg_free_cache_id(kmemcg_id);
 }
-
-static void memcg_free_kmem(struct mem_cgroup *memcg)
-{
-	/* css_alloc() failed, offlining didn't happen */
-	if (unlikely(memcg->kmem_state == KMEM_ONLINE))
-		memcg_offline_kmem(memcg);
-}
 #else
 static int memcg_online_kmem(struct mem_cgroup *memcg)
 {
@@ -3719,9 +3712,6 @@ static int memcg_online_kmem(struct mem_
 static void memcg_offline_kmem(struct mem_cgroup *memcg)
 {
 }
-static void memcg_free_kmem(struct mem_cgroup *memcg)
-{
-}
 #endif /* CONFIG_MEMCG_KMEM */
 
 static int memcg_update_kmem_max(struct mem_cgroup *memcg,
@@ -5356,7 +5346,9 @@ static void mem_cgroup_css_free(struct c
 	cancel_work_sync(&memcg->high_work);
 	mem_cgroup_remove_from_trees(memcg);
 	free_shrinker_info(memcg);
-	memcg_free_kmem(memcg);
+
+	/* Need to offline kmem if online_css() fails */
+	memcg_offline_kmem(memcg);
 	mem_cgroup_free(memcg);
 }
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 058/262] mm/list_lru.c: prefer struct_size over open coded arithmetic
  2021-11-05 20:34 incoming Andrew Morton
                   ` (56 preceding siblings ...)
  2021-11-05 20:37 ` [patch 057/262] mm/memcg: remove obsolete memcg_free_kmem() Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 059/262] memcg, kmem: further deprecate kmem.limit_in_bytes Andrew Morton
                   ` (203 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, gustavoars, keescook, len.baker, linux-mm, mm-commits, torvalds

From: Len Baker <len.baker@gmx.com>
Subject: mm/list_lru.c: prefer struct_size over open coded arithmetic

As noted in the "Deprecated Interfaces, Language Features, Attributes, and
Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing.  This could lead
to values wrapping around and a smaller allocation being made than the
caller was expecting.  Using those allocations could lead to linear
overflows of heap memory and other misbehaviors.

So, use the struct_size() helper to do the arithmetic instead of the
argument "size + count * size" in the kvmalloc() functions.

Also, take the opportunity to refactor the memcpy() call to use the
flex_array_size() helper.

This code was detected with the help of Coccinelle and audited and fixed
manually.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#open-coded-arithmetic-in-allocator-arguments

Link: https://lkml.kernel.org/r/20211017105929.9284-1-len.baker@gmx.com
Signed-off-by: Len Baker <len.baker@gmx.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/list_lru.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

--- a/mm/list_lru.c~mm-list_lruc-prefer-struct_size-over-open-coded-arithmetic
+++ a/mm/list_lru.c
@@ -354,8 +354,7 @@ static int memcg_init_list_lru_node(stru
 	struct list_lru_memcg *memcg_lrus;
 	int size = memcg_nr_cache_ids;
 
-	memcg_lrus = kvmalloc(sizeof(*memcg_lrus) +
-			      size * sizeof(void *), GFP_KERNEL);
+	memcg_lrus = kvmalloc(struct_size(memcg_lrus, lru, size), GFP_KERNEL);
 	if (!memcg_lrus)
 		return -ENOMEM;
 
@@ -389,7 +388,7 @@ static int memcg_update_list_lru_node(st
 
 	old = rcu_dereference_protected(nlru->memcg_lrus,
 					lockdep_is_held(&list_lrus_mutex));
-	new = kvmalloc(sizeof(*new) + new_size * sizeof(void *), GFP_KERNEL);
+	new = kvmalloc(struct_size(new, lru, new_size), GFP_KERNEL);
 	if (!new)
 		return -ENOMEM;
 
@@ -398,7 +397,7 @@ static int memcg_update_list_lru_node(st
 		return -ENOMEM;
 	}
 
-	memcpy(&new->lru, &old->lru, old_size * sizeof(void *));
+	memcpy(&new->lru, &old->lru, flex_array_size(new, lru, old_size));
 
 	/*
 	 * The locking below allows readers that hold nlru->lock avoid taking
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 059/262] memcg, kmem: further deprecate kmem.limit_in_bytes
  2021-11-05 20:34 incoming Andrew Morton
                   ` (57 preceding siblings ...)
  2021-11-05 20:37 ` [patch 058/262] mm/list_lru.c: prefer struct_size over open coded arithmetic Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 060/262] mm: list_lru: remove holding lru lock Andrew Morton
                   ` (202 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, arnd, guro, hannes, linux-mm, mhocko, mm-commits, shakeelb,
	songmuchun, torvalds, vvs

From: Shakeel Butt <shakeelb@google.com>
Subject: memcg, kmem: further deprecate kmem.limit_in_bytes

The deprecation process of kmem.limit_in_bytes started with the commit
0158115f702 ("memcg, kmem: deprecate kmem.limit_in_bytes") which also
explains in detail the motivation behind the deprecation.  To summarize,
it is the unexpected behavior on hitting the kmem limit.  This patch moves
the deprecation process to the next stage by disallowing to set the kmem
limit.  In future we might just remove the kmem.limit_in_bytes file
completely.

[akpm@linux-foundation.org: s/ENOTSUPP/EOPNOTSUPP/]
[arnd@arndb.de: mark cancel_charge() inline]
  Link: https://lkml.kernel.org/r/20211022070542.679839-1-arnd@kernel.org
Link: https://lkml.kernel.org/r/20211019153408.2916808-1-shakeelb@google.com
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Roman Gushchin <guro@fb.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/cgroup-v1/memory.rst |   11 ----
 mm/memcontrol.c                                |   39 +--------------
 2 files changed, 7 insertions(+), 43 deletions(-)

--- a/Documentation/admin-guide/cgroup-v1/memory.rst~memcg-kmem-further-deprecate-kmemlimit_in_bytes
+++ a/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -87,10 +87,8 @@ Brief summary of control files.
  memory.oom_control		     set/show oom controls.
  memory.numa_stat		     show the number of memory usage per numa
 				     node
- memory.kmem.limit_in_bytes          set/show hard limit for kernel memory
-                                     This knob is deprecated and shouldn't be
-                                     used. It is planned that this be removed in
-                                     the foreseeable future.
+ memory.kmem.limit_in_bytes          This knob is deprecated and writing to
+                                     it will return -ENOTSUPP.
  memory.kmem.usage_in_bytes          show current kernel memory allocation
  memory.kmem.failcnt                 show the number of kernel memory usage
 				     hits limits
@@ -518,11 +516,6 @@ will be charged as a new owner of it.
   charged file caches. Some out-of-use page caches may keep charged until
   memory pressure happens. If you want to avoid that, force_empty will be useful.
 
-  Also, note that when memory.kmem.limit_in_bytes is set the charges due to
-  kernel pages will still be seen. This is not considered a failure and the
-  write will still return success. In this case, it is expected that
-  memory.kmem.usage_in_bytes == memory.usage_in_bytes.
-
 5.2 stat file
 -------------
 
--- a/mm/memcontrol.c~memcg-kmem-further-deprecate-kmemlimit_in_bytes
+++ a/mm/memcontrol.c
@@ -2771,8 +2771,7 @@ static inline int try_charge(struct mem_
 	return try_charge_memcg(memcg, gfp_mask, nr_pages);
 }
 
-#if defined(CONFIG_MEMCG_KMEM) || defined(CONFIG_MMU)
-static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
+static inline void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
 {
 	if (mem_cgroup_is_root(memcg))
 		return;
@@ -2781,7 +2780,6 @@ static void cancel_charge(struct mem_cgr
 	if (do_memsw_account())
 		page_counter_uncharge(&memcg->memsw, nr_pages);
 }
-#endif
 
 static void commit_charge(struct page *page, struct mem_cgroup *memcg)
 {
@@ -3000,7 +2998,6 @@ static void obj_cgroup_uncharge_pages(st
 static int obj_cgroup_charge_pages(struct obj_cgroup *objcg, gfp_t gfp,
 				   unsigned int nr_pages)
 {
-	struct page_counter *counter;
 	struct mem_cgroup *memcg;
 	int ret;
 
@@ -3010,21 +3007,8 @@ static int obj_cgroup_charge_pages(struc
 	if (ret)
 		goto out;
 
-	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) &&
-	    !page_counter_try_charge(&memcg->kmem, nr_pages, &counter)) {
-
-		/*
-		 * Enforce __GFP_NOFAIL allocation because callers are not
-		 * prepared to see failures and likely do not have any failure
-		 * handling code.
-		 */
-		if (gfp & __GFP_NOFAIL) {
-			page_counter_charge(&memcg->kmem, nr_pages);
-			goto out;
-		}
-		cancel_charge(memcg, nr_pages);
-		ret = -ENOMEM;
-	}
+	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		page_counter_charge(&memcg->kmem, nr_pages);
 out:
 	css_put(&memcg->css);
 
@@ -3714,17 +3698,6 @@ static void memcg_offline_kmem(struct me
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
-static int memcg_update_kmem_max(struct mem_cgroup *memcg,
-				 unsigned long max)
-{
-	int ret;
-
-	mutex_lock(&memcg_max_mutex);
-	ret = page_counter_set_max(&memcg->kmem, max);
-	mutex_unlock(&memcg_max_mutex);
-	return ret;
-}
-
 static int memcg_update_tcp_max(struct mem_cgroup *memcg, unsigned long max)
 {
 	int ret;
@@ -3790,10 +3763,8 @@ static ssize_t mem_cgroup_write(struct k
 			ret = mem_cgroup_resize_max(memcg, nr_pages, true);
 			break;
 		case _KMEM:
-			pr_warn_once("kmem.limit_in_bytes is deprecated and will be removed. "
-				     "Please report your usecase to linux-mm@kvack.org if you "
-				     "depend on this functionality.\n");
-			ret = memcg_update_kmem_max(memcg, nr_pages);
+			/* kmem.limit_in_bytes is deprecated. */
+			ret = -EOPNOTSUPP;
 			break;
 		case _TCP:
 			ret = memcg_update_tcp_max(memcg, nr_pages);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 060/262] mm: list_lru: remove holding lru lock
  2021-11-05 20:34 incoming Andrew Morton
                   ` (58 preceding siblings ...)
  2021-11-05 20:37 ` [patch 059/262] memcg, kmem: further deprecate kmem.limit_in_bytes Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 061/262] mm: list_lru: fix the return value of list_lru_count_one() Andrew Morton
                   ` (201 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, guro, hannes, linux-mm, mhocko, mm-commits, shakeelb,
	songmuchun, torvalds, willy

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: list_lru: remove holding lru lock

Since commit e5bc3af7734f ("rcu: Consolidate PREEMPT and !PREEMPT
synchronize_rcu()"), the critical section of spin lock can serve as an
RCU read-side critical section which already allows readers that hold
nlru->lock to avoid taking rcu lock.  So just remove holding lock.

Link: https://lkml.kernel.org/r/20211025124534.56345-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/list_lru.c |   11 -----------
 1 file changed, 11 deletions(-)

--- a/mm/list_lru.c~mm-list_lru-remove-holding-lru-lock
+++ a/mm/list_lru.c
@@ -398,18 +398,7 @@ static int memcg_update_list_lru_node(st
 	}
 
 	memcpy(&new->lru, &old->lru, flex_array_size(new, lru, old_size));
-
-	/*
-	 * The locking below allows readers that hold nlru->lock avoid taking
-	 * rcu_read_lock (see list_lru_from_memcg_idx).
-	 *
-	 * Since list_lru_{add,del} may be called under an IRQ-safe lock,
-	 * we have to use IRQ-safe primitives here to avoid deadlock.
-	 */
-	spin_lock_irq(&nlru->lock);
 	rcu_assign_pointer(nlru->memcg_lrus, new);
-	spin_unlock_irq(&nlru->lock);
-
 	kvfree_rcu(old, rcu);
 	return 0;
 }
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 061/262] mm: list_lru: fix the return value of list_lru_count_one()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (59 preceding siblings ...)
  2021-11-05 20:37 ` [patch 060/262] mm: list_lru: remove holding lru lock Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 062/262] mm: memcontrol: remove kmemcg_id reparenting Andrew Morton
                   ` (200 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, guro, hannes, linux-mm, mhocko, mm-commits, shakeelb,
	songmuchun, torvalds, willy

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: list_lru: fix the return value of list_lru_count_one()

Since commit 2788cf0c401c ("memcg: reparent list_lrus and free kmemcg_id
on css offline"), ->nr_items can be negative during memory cgroup
reparenting.  In this case, list_lru_count_one() will return an unusual
and huge value, which can surprise users.  At least for now it hasn't
affected any users.  But it is better to let list_lru_count_ont() returns
zero when ->nr_items is negative.

Link: https://lkml.kernel.org/r/20211025124910.56433-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/list_lru.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/mm/list_lru.c~mm-list_lru-fix-the-return-value-of-list_lru_count_one
+++ a/mm/list_lru.c
@@ -176,13 +176,16 @@ unsigned long list_lru_count_one(struct
 {
 	struct list_lru_node *nlru = &lru->node[nid];
 	struct list_lru_one *l;
-	unsigned long count;
+	long count;
 
 	rcu_read_lock();
 	l = list_lru_from_memcg_idx(nlru, memcg_cache_id(memcg));
 	count = READ_ONCE(l->nr_items);
 	rcu_read_unlock();
 
+	if (unlikely(count < 0))
+		count = 0;
+
 	return count;
 }
 EXPORT_SYMBOL_GPL(list_lru_count_one);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 062/262] mm: memcontrol: remove kmemcg_id reparenting
  2021-11-05 20:34 incoming Andrew Morton
                   ` (60 preceding siblings ...)
  2021-11-05 20:37 ` [patch 061/262] mm: list_lru: fix the return value of list_lru_count_one() Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 063/262] mm: memcontrol: remove the kmem states Andrew Morton
                   ` (199 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, guro, hannes, linux-mm, mhocko, mm-commits, shakeelb,
	songmuchun, torvalds, willy

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: memcontrol: remove kmemcg_id reparenting

Since slab objects and kmem pages are charged to object cgroup instead of
memory cgroup, memcg_reparent_objcgs() will reparent this cgroup and all
its descendants to its parent cgroup.  This already makes further
list_lru_add()'s add elements to the parent's list.  So it is unnecessary
to change kmemcg_id of an offline cgroup to its parent's id.  It just
wastes CPU cycles.  Just to remove those redundant code.

Link: https://lkml.kernel.org/r/20211025125102.56533-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |   19 ++++---------------
 1 file changed, 4 insertions(+), 15 deletions(-)

--- a/mm/memcontrol.c~mm-memcontrol-remove-kmemcg_id-reparenting
+++ a/mm/memcontrol.c
@@ -3650,8 +3650,7 @@ static int memcg_online_kmem(struct mem_
 
 static void memcg_offline_kmem(struct mem_cgroup *memcg)
 {
-	struct cgroup_subsys_state *css;
-	struct mem_cgroup *parent, *child;
+	struct mem_cgroup *parent;
 	int kmemcg_id;
 
 	if (memcg->kmem_state != KMEM_ONLINE)
@@ -3669,21 +3668,11 @@ static void memcg_offline_kmem(struct me
 	BUG_ON(kmemcg_id < 0);
 
 	/*
-	 * Change kmemcg_id of this cgroup and all its descendants to the
-	 * parent's id, and then move all entries from this cgroup's list_lrus
-	 * to ones of the parent. After we have finished, all list_lrus
-	 * corresponding to this cgroup are guaranteed to remain empty. The
-	 * ordering is imposed by list_lru_node->lock taken by
+	 * After we have finished memcg_reparent_objcgs(), all list_lrus
+	 * corresponding to this cgroup are guaranteed to remain empty.
+	 * The ordering is imposed by list_lru_node->lock taken by
 	 * memcg_drain_all_list_lrus().
 	 */
-	rcu_read_lock(); /* can be called from css_free w/o cgroup_mutex */
-	css_for_each_descendant_pre(css, &memcg->css) {
-		child = mem_cgroup_from_css(css);
-		BUG_ON(child->kmemcg_id != kmemcg_id);
-		child->kmemcg_id = parent->kmemcg_id;
-	}
-	rcu_read_unlock();
-
 	memcg_drain_all_list_lrus(kmemcg_id, parent);
 
 	memcg_free_cache_id(kmemcg_id);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 063/262] mm: memcontrol: remove the kmem states
  2021-11-05 20:34 incoming Andrew Morton
                   ` (61 preceding siblings ...)
  2021-11-05 20:37 ` [patch 062/262] mm: memcontrol: remove kmemcg_id reparenting Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:37 ` [patch 064/262] mm: list_lru: only add memcg-aware lrus to the global lru list Andrew Morton
                   ` (198 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, guro, hannes, linux-mm, mhocko, mm-commits, shakeelb,
	songmuchun, torvalds, willy

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: memcontrol: remove the kmem states

Now the kmem states is only used to indicate whether the kmem is offline. 
However, we can set ->kmemcg_id to -1 to indicate whether the kmem is
offline.  Finally, we can remove the kmem states to simplify the code.

Link: https://lkml.kernel.org/r/20211025125259.56624-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memcontrol.h |    7 -------
 mm/memcontrol.c            |    7 ++-----
 2 files changed, 2 insertions(+), 12 deletions(-)

--- a/include/linux/memcontrol.h~mm-memcontrol-remove-the-kmem-states
+++ a/include/linux/memcontrol.h
@@ -180,12 +180,6 @@ struct mem_cgroup_thresholds {
 	struct mem_cgroup_threshold_ary *spare;
 };
 
-enum memcg_kmem_state {
-	KMEM_NONE,
-	KMEM_ALLOCATED,
-	KMEM_ONLINE,
-};
-
 #if defined(CONFIG_SMP)
 struct memcg_padding {
 	char x[0];
@@ -318,7 +312,6 @@ struct mem_cgroup {
 
 #ifdef CONFIG_MEMCG_KMEM
 	int kmemcg_id;
-	enum memcg_kmem_state kmem_state;
 	struct obj_cgroup __rcu *objcg;
 	struct list_head objcg_list; /* list of inherited objcgs */
 #endif
--- a/mm/memcontrol.c~mm-memcontrol-remove-the-kmem-states
+++ a/mm/memcontrol.c
@@ -3626,7 +3626,6 @@ static int memcg_online_kmem(struct mem_
 		return 0;
 
 	BUG_ON(memcg->kmemcg_id >= 0);
-	BUG_ON(memcg->kmem_state);
 
 	memcg_id = memcg_alloc_cache_id();
 	if (memcg_id < 0)
@@ -3643,7 +3642,6 @@ static int memcg_online_kmem(struct mem_
 	static_branch_enable(&memcg_kmem_enabled_key);
 
 	memcg->kmemcg_id = memcg_id;
-	memcg->kmem_state = KMEM_ONLINE;
 
 	return 0;
 }
@@ -3653,11 +3651,9 @@ static void memcg_offline_kmem(struct me
 	struct mem_cgroup *parent;
 	int kmemcg_id;
 
-	if (memcg->kmem_state != KMEM_ONLINE)
+	if (memcg->kmemcg_id == -1)
 		return;
 
-	memcg->kmem_state = KMEM_ALLOCATED;
-
 	parent = parent_mem_cgroup(memcg);
 	if (!parent)
 		parent = root_mem_cgroup;
@@ -3676,6 +3672,7 @@ static void memcg_offline_kmem(struct me
 	memcg_drain_all_list_lrus(kmemcg_id, parent);
 
 	memcg_free_cache_id(kmemcg_id);
+	memcg->kmemcg_id = -1;
 }
 #else
 static int memcg_online_kmem(struct mem_cgroup *memcg)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 064/262] mm: list_lru: only add memcg-aware lrus to the global lru list
  2021-11-05 20:34 incoming Andrew Morton
                   ` (62 preceding siblings ...)
  2021-11-05 20:37 ` [patch 063/262] mm: memcontrol: remove the kmem states Andrew Morton
@ 2021-11-05 20:37 ` Andrew Morton
  2021-11-05 20:38 ` [patch 065/262] mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks Andrew Morton
                   ` (197 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:37 UTC (permalink / raw)
  To: akpm, guro, hannes, linux-mm, mhocko, mm-commits, shakeelb,
	songmuchun, torvalds, willy

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: list_lru: only add memcg-aware lrus to the global lru list

The non-memcg-aware lru is always skiped when traversing the global lru
list, which is not efficient.  We can only add the memcg-aware lru to the
global lru list instead to make traversing more efficient.

Link: https://lkml.kernel.org/r/20211025124353.55781-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/list_lru.c |   35 ++++++++++++++++-------------------
 1 file changed, 16 insertions(+), 19 deletions(-)

--- a/mm/list_lru.c~mm-list_lru-only-add-memcg-aware-lrus-to-the-global-lru-list
+++ a/mm/list_lru.c
@@ -15,18 +15,29 @@
 #include "slab.h"
 
 #ifdef CONFIG_MEMCG_KMEM
-static LIST_HEAD(list_lrus);
+static LIST_HEAD(memcg_list_lrus);
 static DEFINE_MUTEX(list_lrus_mutex);
 
+static inline bool list_lru_memcg_aware(struct list_lru *lru)
+{
+	return lru->memcg_aware;
+}
+
 static void list_lru_register(struct list_lru *lru)
 {
+	if (!list_lru_memcg_aware(lru))
+		return;
+
 	mutex_lock(&list_lrus_mutex);
-	list_add(&lru->list, &list_lrus);
+	list_add(&lru->list, &memcg_list_lrus);
 	mutex_unlock(&list_lrus_mutex);
 }
 
 static void list_lru_unregister(struct list_lru *lru)
 {
+	if (!list_lru_memcg_aware(lru))
+		return;
+
 	mutex_lock(&list_lrus_mutex);
 	list_del(&lru->list);
 	mutex_unlock(&list_lrus_mutex);
@@ -37,11 +48,6 @@ static int lru_shrinker_id(struct list_l
 	return lru->shrinker_id;
 }
 
-static inline bool list_lru_memcg_aware(struct list_lru *lru)
-{
-	return lru->memcg_aware;
-}
-
 static inline struct list_lru_one *
 list_lru_from_memcg_idx(struct list_lru_node *nlru, int idx)
 {
@@ -457,9 +463,6 @@ static int memcg_update_list_lru(struct
 {
 	int i;
 
-	if (!list_lru_memcg_aware(lru))
-		return 0;
-
 	for_each_node(i) {
 		if (memcg_update_list_lru_node(&lru->node[i],
 					       old_size, new_size))
@@ -482,9 +485,6 @@ static void memcg_cancel_update_list_lru
 {
 	int i;
 
-	if (!list_lru_memcg_aware(lru))
-		return;
-
 	for_each_node(i)
 		memcg_cancel_update_list_lru_node(&lru->node[i],
 						  old_size, new_size);
@@ -497,7 +497,7 @@ int memcg_update_all_list_lrus(int new_s
 	int old_size = memcg_nr_cache_ids;
 
 	mutex_lock(&list_lrus_mutex);
-	list_for_each_entry(lru, &list_lrus, list) {
+	list_for_each_entry(lru, &memcg_list_lrus, list) {
 		ret = memcg_update_list_lru(lru, old_size, new_size);
 		if (ret)
 			goto fail;
@@ -506,7 +506,7 @@ out:
 	mutex_unlock(&list_lrus_mutex);
 	return ret;
 fail:
-	list_for_each_entry_continue_reverse(lru, &list_lrus, list)
+	list_for_each_entry_continue_reverse(lru, &memcg_list_lrus, list)
 		memcg_cancel_update_list_lru(lru, old_size, new_size);
 	goto out;
 }
@@ -543,9 +543,6 @@ static void memcg_drain_list_lru(struct
 {
 	int i;
 
-	if (!list_lru_memcg_aware(lru))
-		return;
-
 	for_each_node(i)
 		memcg_drain_list_lru_node(lru, i, src_idx, dst_memcg);
 }
@@ -555,7 +552,7 @@ void memcg_drain_all_list_lrus(int src_i
 	struct list_lru *lru;
 
 	mutex_lock(&list_lrus_mutex);
-	list_for_each_entry(lru, &list_lrus, list)
+	list_for_each_entry(lru, &memcg_list_lrus, list)
 		memcg_drain_list_lru(lru, src_idx, dst_memcg);
 	mutex_unlock(&list_lrus_mutex);
 }
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 065/262] mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks
  2021-11-05 20:34 incoming Andrew Morton
                   ` (63 preceding siblings ...)
  2021-11-05 20:37 ` [patch 064/262] mm: list_lru: only add memcg-aware lrus to the global lru list Andrew Morton
@ 2021-11-05 20:38 ` Andrew Morton
  2021-11-05 20:38 ` [patch 066/262] mm, oom: do not trigger out_of_memory from the #PF Andrew Morton
                   ` (196 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:38 UTC (permalink / raw)
  To: akpm, guro, hannes, linux-mm, mgorman, mhocko, mm-commits,
	penguin-kernel, shakeelb, stable, torvalds, urezki, vbabka,
	vdavydov.dev, vvs

From: Vasily Averin <vvs@virtuozzo.com>
Subject: mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks

Patch series "memcg: prohibit unconditional exceeding the limit of dying tasks", v3.

Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit.  It can be misused and allowed to trigger global OOM from inside a
memcg-limited container.  On the other hand if memcg fails allocation,
called from inside #PF handler it triggers global OOM from inside
pagefault_out_of_memory().

To prevent these problems this patchset:
a) removes execution of out_of_memory() from pagefault_out_of_memory(),
   becasue nobody can explain why it is necessary.
b) allow memcg to fail allocation of dying/killed tasks.


This patch (of 3):

Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory which in turn executes
out_out_memory() and can kill a random task.

An allocation might fail when the current task is the oom victim and there
are no memory reserves left.  The OOM killer is already handled at the
page allocator level for the global OOM and at the charging level for the
memcg one.  Both have much more information about the scope of
allocation/charge request.  This means that either the OOM killer has been
invoked properly and didn't lead to the allocation success or it has been
skipped because it couldn't have been invoked.  In both cases triggering
it from here is pointless and even harmful.

It makes much more sense to let the killed task die rather than to wake up
an eternally hungry oom-killer and send him to choose a fatter victim for
breakfast.

Link: https://lkml.kernel.org/r/0828a149-786e-7c06-b70a-52d086818ea3@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/oom_kill.c |    3 +++
 1 file changed, 3 insertions(+)

--- a/mm/oom_kill.c~mm-oom-pagefault_out_of_memory-dont-force-global-oom-for-dying-tasks
+++ a/mm/oom_kill.c
@@ -1137,6 +1137,9 @@ void pagefault_out_of_memory(void)
 	if (mem_cgroup_oom_synchronize(true))
 		return;
 
+	if (fatal_signal_pending(current))
+		return;
+
 	if (!mutex_trylock(&oom_lock))
 		return;
 	out_of_memory(&oc);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 066/262] mm, oom: do not trigger out_of_memory from the #PF
  2021-11-05 20:34 incoming Andrew Morton
                   ` (64 preceding siblings ...)
  2021-11-05 20:38 ` [patch 065/262] mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks Andrew Morton
@ 2021-11-05 20:38 ` Andrew Morton
  2021-11-05 20:38 ` [patch 067/262] memcg: prohibit unconditional exceeding the limit of dying tasks Andrew Morton
                   ` (195 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:38 UTC (permalink / raw)
  To: akpm, guro, hannes, linux-mm, mgorman, mhocko, mm-commits,
	penguin-kernel, shakeelb, stable, torvalds, urezki, vbabka,
	vdavydov.dev, vvs

From: Michal Hocko <mhocko@suse.com>
Subject: mm, oom: do not trigger out_of_memory from the #PF

Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory.  This can happen for 2
different reasons.  a) Memcg is out of memory and we rely on
mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal
allocation fails.

The latter is quite problematic because allocation paths already trigger
out_of_memory and the page allocator tries really hard to not fail
allocations.  Anyway, if the OOM killer has been already invoked there is
no reason to invoke it again from the #PF path.  Especially when the OOM
condition might be gone by that time and we have no way to find out other
than allocate.

Moreover if the allocation failed and the OOM killer hasn't been invoked
then we are unlikely to do the right thing from the #PF context because we
have already lost the allocation context and restictions and therefore
might oom kill a task from a different NUMA domain.

This all suggests that there is no legitimate reason to trigger
out_of_memory from pagefault_out_of_memory so drop it.  Just to be sure
that no #PF path returns with VM_FAULT_OOM without allocation print a
warning that this is happening before we restart the #PF.

[VvS: #PF allocation can hit into limit of cgroup v1 kmem controller. 
This is a local problem related to memcg, however, it causes unnecessary
global OOM kills that are repeated over and over again and escalate into a
real disaster.  This has been broken since kmem accounting has been
introduced for cgroup v1 (3.8).  There was no kmem specific reclaim for
the separate limit so the only way to handle kmem hard limit was to return
with ENOMEM.  In upstream the problem will be fixed by removing the
outdated kmem limit, however stable and LTS kernels cannot do it and are
still affected.  This patch fixes the problem and should be backported
into stable/LTS.]

Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/oom_kill.c |   22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

--- a/mm/oom_kill.c~mm-oom-do-not-trigger-out_of_memory-from-the-pf
+++ a/mm/oom_kill.c
@@ -1120,19 +1120,15 @@ bool out_of_memory(struct oom_control *o
 }
 
 /*
- * The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * The pagefault handler calls here because some allocation has failed. We have
+ * to take care of the memcg OOM here because this is the only safe context without
+ * any locks held but let the oom killer triggered from the allocation context care
+ * about the global OOM.
  */
 void pagefault_out_of_memory(void)
 {
-	struct oom_control oc = {
-		.zonelist = NULL,
-		.nodemask = NULL,
-		.memcg = NULL,
-		.gfp_mask = 0,
-		.order = 0,
-	};
+	static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	if (mem_cgroup_oom_synchronize(true))
 		return;
@@ -1140,10 +1136,8 @@ void pagefault_out_of_memory(void)
 	if (fatal_signal_pending(current))
 		return;
 
-	if (!mutex_trylock(&oom_lock))
-		return;
-	out_of_memory(&oc);
-	mutex_unlock(&oom_lock);
+	if (__ratelimit(&pfoom_rs))
+		pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n");
 }
 
 SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 067/262] memcg: prohibit unconditional exceeding the limit of dying tasks
  2021-11-05 20:34 incoming Andrew Morton
                   ` (65 preceding siblings ...)
  2021-11-05 20:38 ` [patch 066/262] mm, oom: do not trigger out_of_memory from the #PF Andrew Morton
@ 2021-11-05 20:38 ` Andrew Morton
  2021-11-05 20:38 ` [patch 068/262] mm/mmap.c: fix a data race of mm->total_vm Andrew Morton
                   ` (194 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:38 UTC (permalink / raw)
  To: akpm, guro, hannes, linux-mm, mgorman, mhocko, mm-commits,
	penguin-kernel, shakeelb, stable, torvalds, urezki, vbabka,
	vdavydov.dev, vvs

From: Vasily Averin <vvs@virtuozzo.com>
Subject: memcg: prohibit unconditional exceeding the limit of dying tasks

Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit.  It is assumed that the amount of the memory charged by those tasks
is bound and most of the memory will get released while the task is
exiting.  This is resembling a heuristic for the global OOM situation when
tasks get access to memory reserves.  There is no global memory shortage
at the memcg level so the memcg heuristic is more relieved.

The above assumption is overly optimistic though.  E.g.  vmalloc can scale
to really large requests and the heuristic would allow that.  We used to
have an early break in the vmalloc allocator for killed tasks but this has
been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when the
current task is killed"").  There are likely other similar code paths
which do not check for fatal signals in an allocation&charge loop.  Also
there are some kernel objects charged to a memcg which are not bound to a
process life time.

It has been observed that it is not really hard to trigger these bypasses
and cause global OOM situation.

One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves).  This is
certainly possible but it is not really clear how much of an excess is
desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.

This patch is addressing the problem by removing the heuristic altogether.
Bypass is only allowed for requests which either cannot fail or where the
failure is not desirable while excess should be still limited (e.g. 
atomic requests).  Implementation wise a killed or dying task fails to
charge if it has passed the OOM killer stage.  That should give all forms
of reclaim chance to restore the limit before the failure (ENOMEM) and
tell the caller to back off.

In addition, this patch renames should_force_charge() helper to
task_is_dying() because now its use is not associated witch forced
charging.

This patch depends on pagefault_out_of_memory() to not trigger
out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
and cause a global OOM killer.

Link: https://lkml.kernel.org/r/8f5cebbb-06da-4902-91f0-6566fc4b4203@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |   27 ++++++++-------------------
 1 file changed, 8 insertions(+), 19 deletions(-)

--- a/mm/memcontrol.c~memcg-prohibit-unconditional-exceeding-the-limit-of-dying-tasks
+++ a/mm/memcontrol.c
@@ -234,7 +234,7 @@ enum res_type {
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
-static inline bool should_force_charge(void)
+static inline bool task_is_dying(void)
 {
 	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
 		(current->flags & PF_EXITING);
@@ -1624,7 +1624,7 @@ static bool mem_cgroup_out_of_memory(str
 	 * A few threads which were not waiting at mutex_lock_killable() can
 	 * fail to bail out. Therefore, check again after holding oom_lock.
 	 */
-	ret = should_force_charge() || out_of_memory(&oc);
+	ret = task_is_dying() || out_of_memory(&oc);
 
 unlock:
 	mutex_unlock(&oom_lock);
@@ -2579,6 +2579,7 @@ static int try_charge_memcg(struct mem_c
 	struct page_counter *counter;
 	enum oom_status oom_status;
 	unsigned long nr_reclaimed;
+	bool passed_oom = false;
 	bool may_swap = true;
 	bool drained = false;
 	unsigned long pflags;
@@ -2614,15 +2615,6 @@ retry:
 		goto force;
 
 	/*
-	 * Unlike in global OOM situations, memcg is not in a physical
-	 * memory shortage.  Allow dying and OOM-killed tasks to
-	 * bypass the last charges so that they can exit quickly and
-	 * free their memory.
-	 */
-	if (unlikely(should_force_charge()))
-		goto force;
-
-	/*
 	 * Prevent unbounded recursion when reclaim operations need to
 	 * allocate memory. This might exceed the limits temporarily,
 	 * but we prefer facilitating memory reclaim and getting back
@@ -2679,8 +2671,9 @@ retry:
 	if (gfp_mask & __GFP_RETRY_MAYFAIL)
 		goto nomem;
 
-	if (fatal_signal_pending(current))
-		goto force;
+	/* Avoid endless loop for tasks bypassed by the oom killer */
+	if (passed_oom && task_is_dying())
+		goto nomem;
 
 	/*
 	 * keep retrying as long as the memcg oom killer is able to make
@@ -2689,14 +2682,10 @@ retry:
 	 */
 	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
-	switch (oom_status) {
-	case OOM_SUCCESS:
+	if (oom_status == OOM_SUCCESS) {
+		passed_oom = true;
 		nr_retries = MAX_RECLAIM_RETRIES;
 		goto retry;
-	case OOM_FAILED:
-		goto force;
-	default:
-		goto nomem;
 	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 068/262] mm/mmap.c: fix a data race of mm->total_vm
  2021-11-05 20:34 incoming Andrew Morton
                   ` (66 preceding siblings ...)
  2021-11-05 20:38 ` [patch 067/262] memcg: prohibit unconditional exceeding the limit of dying tasks Andrew Morton
@ 2021-11-05 20:38 ` Andrew Morton
  2021-11-05 20:38 ` [patch 069/262] mm: use __pfn_to_section() instead of open coding it Andrew Morton
                   ` (193 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:38 UTC (permalink / raw)
  To: akpm, linux-mm, liupeng256, mm-commits, torvalds

From: Peng Liu <liupeng256@huawei.com>
Subject: mm/mmap.c: fix a data race of mm->total_vm

Variable mm->total_vm could be accessed concurrently during mmaping and
system accounting as noticed by KCSAN,

BUG: KCSAN: data-race in __acct_update_integrals / mmap_region

read-write to 0xffffa40267bd14c8 of 8 bytes by task 15609 on cpu 3:
 mmap_region+0x6dc/0x1400
 do_mmap+0x794/0xca0
 vm_mmap_pgoff+0xdf/0x150
 ksys_mmap_pgoff+0xe1/0x380
 do_syscall_64+0x37/0x50
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

read to 0xffffa40267bd14c8 of 8 bytes by interrupt on cpu 2:
 __acct_update_integrals+0x187/0x1d0
 acct_account_cputime+0x3c/0x40
 update_process_times+0x5c/0x150
 tick_sched_timer+0x184/0x210
 __run_hrtimer+0x119/0x3b0
 hrtimer_interrupt+0x350/0xaa0
 __sysvec_apic_timer_interrupt+0x7b/0x220
 asm_call_irq_on_stack+0x12/0x20
 sysvec_apic_timer_interrupt+0x4d/0x80
 asm_sysvec_apic_timer_interrupt+0x12/0x20
 smp_call_function_single+0x192/0x2b0
 perf_install_in_context+0x29b/0x4a0
 __se_sys_perf_event_open+0x1a98/0x2550
 __x64_sys_perf_event_open+0x63/0x70
 do_syscall_64+0x37/0x50
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported by Kernel Concurrency Sanitizer on:
CPU: 2 PID: 15610 Comm: syz-executor.3 Not tainted 5.10.0+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014

In vm_stat_account which called by mmap_region, increase total_vm, and
__acct_update_integrals may read total_vm at the same time.  This will
cause a data race which lead to undefined behaviour.  To avoid potential
bad read/write, volatile property and barrier are both used to avoid
undefined behaviour.

Link: https://lkml.kernel.org/r/20210913105550.1569419-1-liupeng256@huawei.com
Signed-off-by: Peng Liu <liupeng256@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/tsacct.c |    2 +-
 mm/mmap.c       |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/tsacct.c~mm-mmapc-fix-a-data-race-of-mm-total_vm
+++ a/kernel/tsacct.c
@@ -137,7 +137,7 @@ static void __acct_update_integrals(stru
 	 * the rest of the math is done in xacct_add_tsk.
 	 */
 	tsk->acct_rss_mem1 += delta * get_mm_rss(tsk->mm) >> 10;
-	tsk->acct_vm_mem1 += delta * tsk->mm->total_vm >> 10;
+	tsk->acct_vm_mem1 += delta * READ_ONCE(tsk->mm->total_vm) >> 10;
 }
 
 /**
--- a/mm/mmap.c~mm-mmapc-fix-a-data-race-of-mm-total_vm
+++ a/mm/mmap.c
@@ -3332,7 +3332,7 @@ bool may_expand_vm(struct mm_struct *mm,
 
 void vm_stat_account(struct mm_struct *mm, vm_flags_t flags, long npages)
 {
-	mm->total_vm += npages;
+	WRITE_ONCE(mm->total_vm, READ_ONCE(mm->total_vm)+npages);
 
 	if (is_exec_mapping(flags))
 		mm->exec_vm += npages;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 069/262] mm: use __pfn_to_section() instead of open coding it
  2021-11-05 20:34 incoming Andrew Morton
                   ` (67 preceding siblings ...)
  2021-11-05 20:38 ` [patch 068/262] mm/mmap.c: fix a data race of mm->total_vm Andrew Morton
@ 2021-11-05 20:38 ` Andrew Morton
  2021-11-05 20:38 ` [patch 070/262] mm/memory.c: avoid unnecessary kernel/user pointer conversion Andrew Morton
                   ` (192 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:38 UTC (permalink / raw)
  To: akpm, eb, linux-mm, mm-commits, torvalds

From: Rolf Eike Beer <eb@emlix.com>
Subject: mm: use __pfn_to_section() instead of open coding it

It is defined in the same file just a few lines above.

Link: https://lkml.kernel.org/r/4598487.Rc0NezkW7i@mobilepool36.emlix.com
Signed-off-by: Rolf Eike Beer <eb@emlix.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/include/linux/mmzone.h~mm-use-__pfn_to_section-instead-of-open-coding-it
+++ a/include/linux/mmzone.h
@@ -1481,7 +1481,7 @@ static inline int pfn_valid(unsigned lon
 
 	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
 		return 0;
-	ms = __nr_to_section(pfn_to_section_nr(pfn));
+	ms = __pfn_to_section(pfn);
 	if (!valid_section(ms))
 		return 0;
 	/*
@@ -1496,7 +1496,7 @@ static inline int pfn_in_present_section
 {
 	if (pfn_to_section_nr(pfn) >= NR_MEM_SECTIONS)
 		return 0;
-	return present_section(__nr_to_section(pfn_to_section_nr(pfn)));
+	return present_section(__pfn_to_section(pfn));
 }
 
 static inline unsigned long next_present_section_nr(unsigned long section_nr)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 070/262] mm/memory.c: avoid unnecessary kernel/user pointer conversion
  2021-11-05 20:34 incoming Andrew Morton
                   ` (68 preceding siblings ...)
  2021-11-05 20:38 ` [patch 069/262] mm: use __pfn_to_section() instead of open coding it Andrew Morton
@ 2021-11-05 20:38 ` Andrew Morton
  2021-11-05 20:38 ` [patch 071/262] mm/memory.c: use correct VMA flags when freeing page-tables Andrew Morton
                   ` (191 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:38 UTC (permalink / raw)
  To: akpm, amit.kachhap, kirill.shutemov, linux-mm, mm-commits,
	torvalds, Vincenzo.Frascino

From: Amit Daniel Kachhap <amit.kachhap@arm.com>
Subject: mm/memory.c: avoid unnecessary kernel/user pointer conversion

Annotating a pointer from __user to kernel and then back again might
confuse sparse.  In copy_huge_page_from_user() it can be avoided by
removing the intermediate variable since it is never used.

Link: https://lkml.kernel.org/r/20210914150820.19326-1-amit.kachhap@arm.com
Signed-off-by: Amit Daniel Kachhap <amit.kachhap@arm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vincenzo Frascino <Vincenzo.Frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--- a/mm/memory.c~mm-memory-avoid-unnecessary-kernel-user-pointer-conversion
+++ a/mm/memory.c
@@ -5421,7 +5421,6 @@ long copy_huge_page_from_user(struct pag
 				unsigned int pages_per_huge_page,
 				bool allow_pagefault)
 {
-	void *src = (void *)usr_src;
 	void *page_kaddr;
 	unsigned long i, rc = 0;
 	unsigned long ret_val = pages_per_huge_page * PAGE_SIZE;
@@ -5434,8 +5433,7 @@ long copy_huge_page_from_user(struct pag
 		else
 			page_kaddr = kmap_atomic(subpage);
 		rc = copy_from_user(page_kaddr,
-				(const void __user *)(src + i * PAGE_SIZE),
-				PAGE_SIZE);
+				usr_src + i * PAGE_SIZE, PAGE_SIZE);
 		if (allow_pagefault)
 			kunmap(subpage);
 		else
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 071/262] mm/memory.c: use correct VMA flags when freeing page-tables
  2021-11-05 20:34 incoming Andrew Morton
                   ` (69 preceding siblings ...)
  2021-11-05 20:38 ` [patch 070/262] mm/memory.c: avoid unnecessary kernel/user pointer conversion Andrew Morton
@ 2021-11-05 20:38 ` Andrew Morton
  2021-11-05 20:57   ` Nadav Amit
  2021-11-05 20:38 ` [patch 072/262] mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte Andrew Morton
                   ` (190 subsequent siblings)
  261 siblings, 1 reply; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:38 UTC (permalink / raw)
  To: aarcange, akpm, andrew.cooper3, dave.hansen, linux-mm, luto,
	mm-commits, namit, npiggin, peterz, tglx, torvalds, will, yuzhao

From: Nadav Amit <namit@vmware.com>
Subject: mm/memory.c: use correct VMA flags when freeing page-tables

Consistent use of the mmu_gather interface requires a call to
tlb_start_vma() and tlb_end_vma() for each VMA.  free_pgtables() does not
follow this pattern.

Certain architectures need tlb_start_vma() to be called in order for
tlb_update_vma_flags() to update the VMA flags (tlb->vma_exec and
tlb->vma_huge), which are later used for the proper TLB flush to be
issued.  Since tlb_start_vma() is not called, this can lead to the wrong
VMA flags being used when the flush is performed.

Specifically, the munmap syscall would call unmap_region(), which unmaps
the VMAs and then frees the page-tables.  A flush is needed after the
page-tables are removed to prevent page-walk caches from holding stale
entries, but this flush would use the flags of the VMA flags of the last
VMA that was flushed.  This does not appear to be right.

Use tlb_start_vma() and tlb_end_vma() to prevent this from happening. 
This might lead to unnecessary calls to flush_cache_range() on certain
arch's.  If needed, a new flag can be added to mmu_gather to indicate that
the flush is not needed.

Link: https://lkml.kernel.org/r/20211021122322.592822-1-namit@vmware.com
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nick Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |    4 ++++
 1 file changed, 4 insertions(+)

--- a/mm/memory.c~mm-use-correct-vma-flags-when-freeing-page-tables
+++ a/mm/memory.c
@@ -412,6 +412,8 @@ void free_pgtables(struct mmu_gather *tl
 		unlink_anon_vmas(vma);
 		unlink_file_vma(vma);
 
+		tlb_start_vma(tlb, vma);
+
 		if (is_vm_hugetlb_page(vma)) {
 			hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
 				floor, next ? next->vm_start : ceiling);
@@ -429,6 +431,8 @@ void free_pgtables(struct mmu_gather *tl
 			free_pgd_range(tlb, addr, vma->vm_end,
 				floor, next ? next->vm_start : ceiling);
 		}
+
+		tlb_end_vma(tlb, vma);
 		vma = next;
 	}
 }
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 072/262] mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte
  2021-11-05 20:34 incoming Andrew Morton
                   ` (70 preceding siblings ...)
  2021-11-05 20:38 ` [patch 071/262] mm/memory.c: use correct VMA flags when freeing page-tables Andrew Morton
@ 2021-11-05 20:38 ` Andrew Morton
  2021-11-05 20:38 ` [patch 073/262] mm: clear vmf->pte after pte_unmap_same() returns Andrew Morton
                   ` (189 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:38 UTC (permalink / raw)
  To: aarcange, akpm, apopple, axelrasmussen, david, hughd, jglisse,
	kirill, liam.howlett, linmiaohe, linux-mm, mm-commits, peterx,
	rppt, shy828301, torvalds, willy

From: Peter Xu <peterx@redhat.com>
Subject: mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte

Patch series "mm: A few cleanup patches around zap, shmem and uffd", v4.

IMHO all of them are very nice cleanups to existing code already, they're
all small and self-contained.  They'll be needed by uffd-wp coming series.


This patch (of 4):

It was conditionally done previously, as there's one shmem special case
that we use SetPageDirty() instead.  However that's not necessary and it
should be easier and cleaner to do it unconditionally in
mfill_atomic_install_pte().

The most recent discussion about this is here, where Hugh explained the
history of SetPageDirty() and why it's possible that it's not required at
all:

https://lore.kernel.org/lkml/alpine.LSU.2.11.2104121657050.1097@eggly.anvils/

Currently mfill_atomic_install_pte() has three callers:

        1. shmem_mfill_atomic_pte
        2. mcopy_atomic_pte
        3. mcontinue_atomic_pte

After the change: case (1) should have its SetPageDirty replaced by the
dirty bit on pte (so we unify them together, finally), case (2) should
have no functional change at all as it has page_in_cache==false, case (3)
may add a dirty bit to the pte.  However since case (3) is UFFDIO_CONTINUE
for shmem, it's merely 100% sure the page is dirty after all because
UFFDIO_CONTINUE normally requires another process to modify the page cache
and kick the faulted thread, so should not make a real difference either.

This should make it much easier to follow on which case will set dirty for
uffd, as we'll simply set it all now for all uffd related ioctls. 
Meanwhile, no special handling of SetPageDirty() if there's no need.

Link: https://lkml.kernel.org/r/20210915181456.10739-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210915181456.10739-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/shmem.c       |    1 -
 mm/userfaultfd.c |    3 +--
 2 files changed, 1 insertion(+), 3 deletions(-)

--- a/mm/shmem.c~mm-shmem-unconditionally-set-pte-dirty-in-mfill_atomic_install_pte
+++ a/mm/shmem.c
@@ -2423,7 +2423,6 @@ int shmem_mfill_atomic_pte(struct mm_str
 	shmem_recalc_inode(inode);
 	spin_unlock_irq(&info->lock);
 
-	SetPageDirty(page);
 	unlock_page(page);
 	return 0;
 out_delete_from_cache:
--- a/mm/userfaultfd.c~mm-shmem-unconditionally-set-pte-dirty-in-mfill_atomic_install_pte
+++ a/mm/userfaultfd.c
@@ -69,10 +69,9 @@ int mfill_atomic_install_pte(struct mm_s
 	pgoff_t offset, max_off;
 
 	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
+	_dst_pte = pte_mkdirty(_dst_pte);
 	if (page_in_cache && !vm_shared)
 		writable = false;
-	if (writable || !page_in_cache)
-		_dst_pte = pte_mkdirty(_dst_pte);
 	if (writable) {
 		if (wp_copy)
 			_dst_pte = pte_mkuffd_wp(_dst_pte);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 073/262] mm: clear vmf->pte after pte_unmap_same() returns
  2021-11-05 20:34 incoming Andrew Morton
                   ` (71 preceding siblings ...)
  2021-11-05 20:38 ` [patch 072/262] mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte Andrew Morton
@ 2021-11-05 20:38 ` Andrew Morton
  2021-11-05 20:38 ` [patch 074/262] mm: drop first_index/last_index in zap_details Andrew Morton
                   ` (188 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:38 UTC (permalink / raw)
  To: aarcange, akpm, apopple, axelrasmussen, david, hughd, jglisse,
	kirill, liam.howlett, linmiaohe, linux-mm, mm-commits, peterx,
	rppt, shy828301, torvalds, willy

From: Peter Xu <peterx@redhat.com>
Subject: mm: clear vmf->pte after pte_unmap_same() returns

pte_unmap_same() will always unmap the pte pointer.  After the unmap,
vmf->pte will not be valid any more, we should clear it.

It was safe only because no one is accessing vmf->pte after
pte_unmap_same() returns, since the only caller of pte_unmap_same() (so
far) is do_swap_page(), where vmf->pte will in most cases be overwritten
very soon.

Directly pass in vmf into pte_unmap_same() and then we can also avoid the
long parameter list too, which should be a nice cleanup.

Link: https://lkml.kernel.org/r/20210915181533.11188-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam Howlett <liam.howlett@oracle.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/mm/memory.c~mm-clear-vmf-pte-after-pte_unmap_same-returns
+++ a/mm/memory.c
@@ -2728,19 +2728,19 @@ EXPORT_SYMBOL_GPL(apply_to_existing_page
  * proceeding (but do_wp_page is only called after already making such a check;
  * and do_anonymous_page can safely check later on).
  */
-static inline int pte_unmap_same(struct mm_struct *mm, pmd_t *pmd,
-				pte_t *page_table, pte_t orig_pte)
+static inline int pte_unmap_same(struct vm_fault *vmf)
 {
 	int same = 1;
 #if defined(CONFIG_SMP) || defined(CONFIG_PREEMPTION)
 	if (sizeof(pte_t) > sizeof(unsigned long)) {
-		spinlock_t *ptl = pte_lockptr(mm, pmd);
+		spinlock_t *ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
 		spin_lock(ptl);
-		same = pte_same(*page_table, orig_pte);
+		same = pte_same(*vmf->pte, vmf->orig_pte);
 		spin_unlock(ptl);
 	}
 #endif
-	pte_unmap(page_table);
+	pte_unmap(vmf->pte);
+	vmf->pte = NULL;
 	return same;
 }
 
@@ -3492,7 +3492,7 @@ vm_fault_t do_swap_page(struct vm_fault
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
 
-	if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte))
+	if (!pte_unmap_same(vmf))
 		goto out;
 
 	entry = pte_to_swp_entry(vmf->orig_pte);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 074/262] mm: drop first_index/last_index in zap_details
  2021-11-05 20:34 incoming Andrew Morton
                   ` (72 preceding siblings ...)
  2021-11-05 20:38 ` [patch 073/262] mm: clear vmf->pte after pte_unmap_same() returns Andrew Morton
@ 2021-11-05 20:38 ` Andrew Morton
  2021-11-05 20:38 ` [patch 075/262] mm: add zap_skip_check_mapping() helper Andrew Morton
                   ` (187 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:38 UTC (permalink / raw)
  To: aarcange, akpm, apopple, axelrasmussen, david, hughd, jglisse,
	kirill, liam.howlett, linmiaohe, linux-mm, mm-commits, peterx,
	rppt, shy828301, torvalds, willy

From: Peter Xu <peterx@redhat.com>
Subject: mm: drop first_index/last_index in zap_details

The first_index/last_index parameters in zap_details are actually only
used in unmap_mapping_range_tree().  At the meantime, this function is
only called by unmap_mapping_pages() once.  Instead of passing these two
variables through the whole stack of page zapping code, remove them from
zap_details and let them simply be parameters of
unmap_mapping_range_tree(), which is inlined.

Link: https://lkml.kernel.org/r/20210915181535.11238-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Liam Howlett <liam.howlett@oracle.com>
Acked-by: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm.h |    2 --
 mm/memory.c        |   31 ++++++++++++++++++-------------
 2 files changed, 18 insertions(+), 15 deletions(-)

--- a/include/linux/mm.h~mm-drop-first_index-last_index-in-zap_details
+++ a/include/linux/mm.h
@@ -1688,8 +1688,6 @@ extern void user_shm_unlock(size_t, stru
  */
 struct zap_details {
 	struct address_space *check_mapping;	/* Check page->mapping if set */
-	pgoff_t	first_index;			/* Lowest page->index to unmap */
-	pgoff_t last_index;			/* Highest page->index to unmap */
 	struct page *single_page;		/* Locked page to be unmapped */
 };
 
--- a/mm/memory.c~mm-drop-first_index-last_index-in-zap_details
+++ a/mm/memory.c
@@ -3325,20 +3325,20 @@ static void unmap_mapping_range_vma(stru
 }
 
 static inline void unmap_mapping_range_tree(struct rb_root_cached *root,
+					    pgoff_t first_index,
+					    pgoff_t last_index,
 					    struct zap_details *details)
 {
 	struct vm_area_struct *vma;
 	pgoff_t vba, vea, zba, zea;
 
-	vma_interval_tree_foreach(vma, root,
-			details->first_index, details->last_index) {
-
+	vma_interval_tree_foreach(vma, root, first_index, last_index) {
 		vba = vma->vm_pgoff;
 		vea = vba + vma_pages(vma) - 1;
-		zba = details->first_index;
+		zba = first_index;
 		if (zba < vba)
 			zba = vba;
-		zea = details->last_index;
+		zea = last_index;
 		if (zea > vea)
 			zea = vea;
 
@@ -3364,18 +3364,22 @@ void unmap_mapping_page(struct page *pag
 {
 	struct address_space *mapping = page->mapping;
 	struct zap_details details = { };
+	pgoff_t	first_index;
+	pgoff_t	last_index;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageTail(page));
 
+	first_index = page->index;
+	last_index = page->index + thp_nr_pages(page) - 1;
+
 	details.check_mapping = mapping;
-	details.first_index = page->index;
-	details.last_index = page->index + thp_nr_pages(page) - 1;
 	details.single_page = page;
 
 	i_mmap_lock_write(mapping);
 	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
-		unmap_mapping_range_tree(&mapping->i_mmap, &details);
+		unmap_mapping_range_tree(&mapping->i_mmap, first_index,
+					 last_index, &details);
 	i_mmap_unlock_write(mapping);
 }
 
@@ -3395,16 +3399,17 @@ void unmap_mapping_pages(struct address_
 		pgoff_t nr, bool even_cows)
 {
 	struct zap_details details = { };
+	pgoff_t	first_index = start;
+	pgoff_t	last_index = start + nr - 1;
 
 	details.check_mapping = even_cows ? NULL : mapping;
-	details.first_index = start;
-	details.last_index = start + nr - 1;
-	if (details.last_index < details.first_index)
-		details.last_index = ULONG_MAX;
+	if (last_index < first_index)
+		last_index = ULONG_MAX;
 
 	i_mmap_lock_write(mapping);
 	if (unlikely(!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root)))
-		unmap_mapping_range_tree(&mapping->i_mmap, &details);
+		unmap_mapping_range_tree(&mapping->i_mmap, first_index,
+					 last_index, &details);
 	i_mmap_unlock_write(mapping);
 }
 EXPORT_SYMBOL_GPL(unmap_mapping_pages);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 075/262] mm: add zap_skip_check_mapping() helper
  2021-11-05 20:34 incoming Andrew Morton
                   ` (73 preceding siblings ...)
  2021-11-05 20:38 ` [patch 074/262] mm: drop first_index/last_index in zap_details Andrew Morton
@ 2021-11-05 20:38 ` Andrew Morton
  2021-11-05 20:38 ` [patch 076/262] mm: introduce pmd_install() helper Andrew Morton
                   ` (186 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:38 UTC (permalink / raw)
  To: aarcange, akpm, apopple, axelrasmussen, david, hughd, jglisse,
	kirill, liam.howlett, linmiaohe, linux-mm, mm-commits, peterx,
	rppt, shy828301, torvalds, willy

From: Peter Xu <peterx@redhat.com>
Subject: mm: add zap_skip_check_mapping() helper

Use the helper for the checks.  Rename "check_mapping" into "zap_mapping"
because "check_mapping" looks like a bool but in fact it stores the
mapping itself.  When it's set, we check the mapping (it must be
non-NULL).  When it's cleared we skip the check, which works like the old
way.

Move the duplicated comments to the helper too.

Link: https://lkml.kernel.org/r/20210915181538.11288-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Alistair Popple <apopple@nvidia.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm.h |   16 +++++++++++++++-
 mm/memory.c        |   29 ++++++-----------------------
 2 files changed, 21 insertions(+), 24 deletions(-)

--- a/include/linux/mm.h~mm-add-zap_skip_check_mapping-helper
+++ a/include/linux/mm.h
@@ -1687,10 +1687,24 @@ extern void user_shm_unlock(size_t, stru
  * Parameter block passed down to zap_pte_range in exceptional cases.
  */
 struct zap_details {
-	struct address_space *check_mapping;	/* Check page->mapping if set */
+	struct address_space *zap_mapping;	/* Check page->mapping if set */
 	struct page *single_page;		/* Locked page to be unmapped */
 };
 
+/*
+ * We set details->zap_mappings when we want to unmap shared but keep private
+ * pages. Return true if skip zapping this page, false otherwise.
+ */
+static inline bool
+zap_skip_check_mapping(struct zap_details *details, struct page *page)
+{
+	if (!details || !page)
+		return false;
+
+	return details->zap_mapping &&
+	    (details->zap_mapping != page_rmapping(page));
+}
+
 struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte);
 struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
--- a/mm/memory.c~mm-add-zap_skip_check_mapping-helper
+++ a/mm/memory.c
@@ -1337,16 +1337,8 @@ again:
 			struct page *page;
 
 			page = vm_normal_page(vma, addr, ptent);
-			if (unlikely(details) && page) {
-				/*
-				 * unmap_shared_mapping_pages() wants to
-				 * invalidate cache without truncating:
-				 * unmap shared but keep private pages.
-				 */
-				if (details->check_mapping &&
-				    details->check_mapping != page_rmapping(page))
-					continue;
-			}
+			if (unlikely(zap_skip_check_mapping(details, page)))
+				continue;
 			ptent = ptep_get_and_clear_full(mm, addr, pte,
 							tlb->fullmm);
 			tlb_remove_tlb_entry(tlb, pte, addr);
@@ -1379,17 +1371,8 @@ again:
 		    is_device_exclusive_entry(entry)) {
 			struct page *page = pfn_swap_entry_to_page(entry);
 
-			if (unlikely(details && details->check_mapping)) {
-				/*
-				 * unmap_shared_mapping_pages() wants to
-				 * invalidate cache without truncating:
-				 * unmap shared but keep private pages.
-				 */
-				if (details->check_mapping !=
-				    page_rmapping(page))
-					continue;
-			}
-
+			if (unlikely(zap_skip_check_mapping(details, page)))
+				continue;
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			rss[mm_counter(page)]--;
 
@@ -3373,7 +3356,7 @@ void unmap_mapping_page(struct page *pag
 	first_index = page->index;
 	last_index = page->index + thp_nr_pages(page) - 1;
 
-	details.check_mapping = mapping;
+	details.zap_mapping = mapping;
 	details.single_page = page;
 
 	i_mmap_lock_write(mapping);
@@ -3402,7 +3385,7 @@ void unmap_mapping_pages(struct address_
 	pgoff_t	first_index = start;
 	pgoff_t	last_index = start + nr - 1;
 
-	details.check_mapping = even_cows ? NULL : mapping;
+	details.zap_mapping = even_cows ? NULL : mapping;
 	if (last_index < first_index)
 		last_index = ULONG_MAX;
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 076/262] mm: introduce pmd_install() helper
  2021-11-05 20:34 incoming Andrew Morton
                   ` (74 preceding siblings ...)
  2021-11-05 20:38 ` [patch 075/262] mm: add zap_skip_check_mapping() helper Andrew Morton
@ 2021-11-05 20:38 ` Andrew Morton
  2021-11-05 20:38 ` [patch 077/262] mm: remove redundant smp_wmb() Andrew Morton
                   ` (185 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:38 UTC (permalink / raw)
  To: akpm, david, hannes, kirill.shutemov, linux-mm, mhocko,
	mika.penttila, mm-commits, songmuchun, tglx, torvalds, vbabka,
	vdavydov.dev, zhengqi.arch

From: Qi Zheng <zhengqi.arch@bytedance.com>
Subject: mm: introduce pmd_install() helper

Patch series "Do some code cleanups related to mm", v3.


This patch (of 2):

Currently we have three times the same few lines repeated in the code. 
Deduplicate them by newly introduced pmd_install() helper.

Link: https://lkml.kernel.org/r/20210901102722.47686-1-zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/20210901102722.47686-2-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Mika Penttila <mika.penttila@nextfour.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/filemap.c  |   11 ++---------
 mm/internal.h |    1 +
 mm/memory.c   |   34 ++++++++++++++++------------------
 3 files changed, 19 insertions(+), 27 deletions(-)

--- a/mm/filemap.c~mm-introduce-pmd_install-helper
+++ a/mm/filemap.c
@@ -3211,15 +3211,8 @@ static bool filemap_map_pmd(struct vm_fa
 	    }
 	}
 
-	if (pmd_none(*vmf->pmd)) {
-		vmf->ptl = pmd_lock(mm, vmf->pmd);
-		if (likely(pmd_none(*vmf->pmd))) {
-			mm_inc_nr_ptes(mm);
-			pmd_populate(mm, vmf->pmd, vmf->prealloc_pte);
-			vmf->prealloc_pte = NULL;
-		}
-		spin_unlock(vmf->ptl);
-	}
+	if (pmd_none(*vmf->pmd))
+		pmd_install(mm, vmf->pmd, &vmf->prealloc_pte);
 
 	/* See comment in handle_pte_fault() */
 	if (pmd_devmap_trans_unstable(vmf->pmd)) {
--- a/mm/internal.h~mm-introduce-pmd_install-helper
+++ a/mm/internal.h
@@ -38,6 +38,7 @@ vm_fault_t do_swap_page(struct vm_fault
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
 		unsigned long floor, unsigned long ceiling);
+void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
 
 static inline bool can_madv_lru_vma(struct vm_area_struct *vma)
 {
--- a/mm/memory.c~mm-introduce-pmd_install-helper
+++ a/mm/memory.c
@@ -437,9 +437,20 @@ void free_pgtables(struct mmu_gather *tl
 	}
 }
 
+void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte)
+{
+	spinlock_t *ptl = pmd_lock(mm, pmd);
+
+	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
+		mm_inc_nr_ptes(mm);
+		pmd_populate(mm, pmd, *pte);
+		*pte = NULL;
+	}
+	spin_unlock(ptl);
+}
+
 int __pte_alloc(struct mm_struct *mm, pmd_t *pmd)
 {
-	spinlock_t *ptl;
 	pgtable_t new = pte_alloc_one(mm);
 	if (!new)
 		return -ENOMEM;
@@ -459,13 +470,7 @@ int __pte_alloc(struct mm_struct *mm, pm
 	 */
 	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 
-	ptl = pmd_lock(mm, pmd);
-	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
-		mm_inc_nr_ptes(mm);
-		pmd_populate(mm, pmd, new);
-		new = NULL;
-	}
-	spin_unlock(ptl);
+	pmd_install(mm, pmd, &new);
 	if (new)
 		pte_free(mm, new);
 	return 0;
@@ -4028,17 +4033,10 @@ vm_fault_t finish_fault(struct vm_fault
 				return ret;
 		}
 
-		if (vmf->prealloc_pte) {
-			vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
-			if (likely(pmd_none(*vmf->pmd))) {
-				mm_inc_nr_ptes(vma->vm_mm);
-				pmd_populate(vma->vm_mm, vmf->pmd, vmf->prealloc_pte);
-				vmf->prealloc_pte = NULL;
-			}
-			spin_unlock(vmf->ptl);
-		} else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd))) {
+		if (vmf->prealloc_pte)
+			pmd_install(vma->vm_mm, vmf->pmd, &vmf->prealloc_pte);
+		else if (unlikely(pte_alloc(vma->vm_mm, vmf->pmd)))
 			return VM_FAULT_OOM;
-		}
 	}
 
 	/* See comment in handle_pte_fault() */
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 077/262] mm: remove redundant smp_wmb()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (75 preceding siblings ...)
  2021-11-05 20:38 ` [patch 076/262] mm: introduce pmd_install() helper Andrew Morton
@ 2021-11-05 20:38 ` Andrew Morton
  2021-11-05 20:38 ` [patch 078/262] Documentation: update pagemap with shmem exceptions Andrew Morton
                   ` (184 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:38 UTC (permalink / raw)
  To: akpm, david, hannes, kirill.shutemov, linux-mm, mhocko,
	mika.penttila, mm-commits, songmuchun, tglx, torvalds, vbabka,
	vdavydov.dev, zhengqi.arch

From: Qi Zheng <zhengqi.arch@bytedance.com>
Subject: mm: remove redundant smp_wmb()

The smp_wmb() which is in the __pte_alloc() is used to ensure all ptes
setup is visible before the pte is made visible to other CPUs by being put
into page tables.  We only need this when the pte is actually populated,
so move it to pmd_install().  __pte_alloc_kernel(), __p4d_alloc(),
__pud_alloc() and __pmd_alloc() are similar to this case.

We can also defer smp_wmb() to the place where the pmd entry is really
populated by preallocated pte.  There are two kinds of user of
preallocated pte, one is filemap & finish_fault(), another is THP.  The
former does not need another smp_wmb() because the smp_wmb() has been done
by pmd_install().  Fortunately, the latter also does not need another
smp_wmb() because there is already a smp_wmb() before populating the new
pte when the THP uses a preallocated pte to split a huge pmd.

Link: https://lkml.kernel.org/r/20210901102722.47686-3-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mika Penttila <mika.penttila@nextfour.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c         |   52 ++++++++++++++++++------------------------
 mm/sparse-vmemmap.c |    2 -
 2 files changed, 24 insertions(+), 30 deletions(-)

--- a/mm/memory.c~mm-remove-redundant-smp_wmb
+++ a/mm/memory.c
@@ -443,6 +443,20 @@ void pmd_install(struct mm_struct *mm, p
 
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
 		mm_inc_nr_ptes(mm);
+		/*
+		 * Ensure all pte setup (eg. pte page lock and page clearing) are
+		 * visible before the pte is made visible to other CPUs by being
+		 * put into page tables.
+		 *
+		 * The other side of the story is the pointer chasing in the page
+		 * table walking code (when walking the page table without locking;
+		 * ie. most of the time). Fortunately, these data accesses consist
+		 * of a chain of data-dependent loads, meaning most CPUs (alpha
+		 * being the notable exception) will already guarantee loads are
+		 * seen in-order. See the alpha page table accessors for the
+		 * smp_rmb() barriers in page table walking code.
+		 */
+		smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
 		pmd_populate(mm, pmd, *pte);
 		*pte = NULL;
 	}
@@ -455,21 +469,6 @@ int __pte_alloc(struct mm_struct *mm, pm
 	if (!new)
 		return -ENOMEM;
 
-	/*
-	 * Ensure all pte setup (eg. pte page lock and page clearing) are
-	 * visible before the pte is made visible to other CPUs by being
-	 * put into page tables.
-	 *
-	 * The other side of the story is the pointer chasing in the page
-	 * table walking code (when walking the page table without locking;
-	 * ie. most of the time). Fortunately, these data accesses consist
-	 * of a chain of data-dependent loads, meaning most CPUs (alpha
-	 * being the notable exception) will already guarantee loads are
-	 * seen in-order. See the alpha page table accessors for the
-	 * smp_rmb() barriers in page table walking code.
-	 */
-	smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
-
 	pmd_install(mm, pmd, &new);
 	if (new)
 		pte_free(mm, new);
@@ -482,10 +481,9 @@ int __pte_alloc_kernel(pmd_t *pmd)
 	if (!new)
 		return -ENOMEM;
 
-	smp_wmb(); /* See comment in __pte_alloc */
-
 	spin_lock(&init_mm.page_table_lock);
 	if (likely(pmd_none(*pmd))) {	/* Has another populated it ? */
+		smp_wmb(); /* See comment in pmd_install() */
 		pmd_populate_kernel(&init_mm, pmd, new);
 		new = NULL;
 	}
@@ -3849,7 +3847,6 @@ static vm_fault_t __do_fault(struct vm_f
 		vmf->prealloc_pte = pte_alloc_one(vma->vm_mm);
 		if (!vmf->prealloc_pte)
 			return VM_FAULT_OOM;
-		smp_wmb(); /* See comment in __pte_alloc() */
 	}
 
 	ret = vma->vm_ops->fault(vmf);
@@ -3920,7 +3917,6 @@ vm_fault_t do_set_pmd(struct vm_fault *v
 		vmf->prealloc_pte = pte_alloc_one(vma->vm_mm);
 		if (!vmf->prealloc_pte)
 			return VM_FAULT_OOM;
-		smp_wmb(); /* See comment in __pte_alloc() */
 	}
 
 	vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
@@ -4145,7 +4141,6 @@ static vm_fault_t do_fault_around(struct
 		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
 		if (!vmf->prealloc_pte)
 			return VM_FAULT_OOM;
-		smp_wmb(); /* See comment in __pte_alloc() */
 	}
 
 	return vmf->vma->vm_ops->map_pages(vmf, start_pgoff, end_pgoff);
@@ -4819,13 +4814,13 @@ int __p4d_alloc(struct mm_struct *mm, pg
 	if (!new)
 		return -ENOMEM;
 
-	smp_wmb(); /* See comment in __pte_alloc */
-
 	spin_lock(&mm->page_table_lock);
-	if (pgd_present(*pgd))		/* Another has populated it */
+	if (pgd_present(*pgd)) {	/* Another has populated it */
 		p4d_free(mm, new);
-	else
+	} else {
+		smp_wmb(); /* See comment in pmd_install() */
 		pgd_populate(mm, pgd, new);
+	}
 	spin_unlock(&mm->page_table_lock);
 	return 0;
 }
@@ -4842,11 +4837,10 @@ int __pud_alloc(struct mm_struct *mm, p4
 	if (!new)
 		return -ENOMEM;
 
-	smp_wmb(); /* See comment in __pte_alloc */
-
 	spin_lock(&mm->page_table_lock);
 	if (!p4d_present(*p4d)) {
 		mm_inc_nr_puds(mm);
+		smp_wmb(); /* See comment in pmd_install() */
 		p4d_populate(mm, p4d, new);
 	} else	/* Another has populated it */
 		pud_free(mm, new);
@@ -4867,14 +4861,14 @@ int __pmd_alloc(struct mm_struct *mm, pu
 	if (!new)
 		return -ENOMEM;
 
-	smp_wmb(); /* See comment in __pte_alloc */
-
 	ptl = pud_lock(mm, pud);
 	if (!pud_present(*pud)) {
 		mm_inc_nr_pmds(mm);
+		smp_wmb(); /* See comment in pmd_install() */
 		pud_populate(mm, pud, new);
-	} else	/* Another has populated it */
+	} else {	/* Another has populated it */
 		pmd_free(mm, new);
+	}
 	spin_unlock(ptl);
 	return 0;
 }
--- a/mm/sparse-vmemmap.c~mm-remove-redundant-smp_wmb
+++ a/mm/sparse-vmemmap.c
@@ -76,7 +76,7 @@ static int split_vmemmap_huge_pmd(pmd_t
 		set_pte_at(&init_mm, addr, pte, entry);
 	}
 
-	/* Make pte visible before pmd. See comment in __pte_alloc(). */
+	/* Make pte visible before pmd. See comment in pmd_install(). */
 	smp_wmb();
 	pmd_populate_kernel(&init_mm, pmd, pgtable);
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 078/262] Documentation: update pagemap with shmem exceptions
  2021-11-05 20:34 incoming Andrew Morton
                   ` (76 preceding siblings ...)
  2021-11-05 20:38 ` [patch 077/262] mm: remove redundant smp_wmb() Andrew Morton
@ 2021-11-05 20:38 ` Andrew Morton
  2021-11-05 20:38 ` [patch 079/262] lazy tlb: introduce lazy mm refcount helper functions Andrew Morton
                   ` (183 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:38 UTC (permalink / raw)
  To: akpm, carl.waldspurger, corbet, david, florian.schmidt,
	ivan.teterevkov, jonathan.davies, linux-mm, mm-commits, peterx,
	tiberiu.georgescu, torvalds

From: Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>
Subject: Documentation: update pagemap with shmem exceptions

This patch follows the discussions on previous documentation patch threads
[1][2].  It presents the exception case of shared memory management from
the pagemap's point of view.  It briefly describes what is missing, why it
is missing and alternatives to the pagemap for page info retrieval in user
space.

In short, the kernel does not keep track of PTEs for swapped out shared
pages within the processes that references them.  Thus, the
proc/pid/pagemap tool cannot print the swap destination of the shared
memory pages, instead setting the pagemap entry to zero for both
non-allocated and swapped out pages.  This can create confusion for users
who need information on swapped out pages.

The reasons why maintaining the PTEs of all swapped out shared pages among
all processes while maintaining similar performance is not a trivial task,
or a desirable change, have been discussed extensively [1][3][4][5]. 
There are also arguments for why this arguably missing information should
eventually be exposed to the user in either a future pagemap patch, or by
an alternative tool.

[1]: https://marc.info/?m=162878395426774
[2]: https://lore.kernel.org/lkml/20210920164931.175411-1-tiberiu.georgescu@nutanix.com/
[3]: https://lore.kernel.org/lkml/20210730160826.63785-1-tiberiu.georgescu@nutanix.com/
[4]: https://lore.kernel.org/lkml/20210807032521.7591-1-peterx@redhat.com/
[5]: https://lore.kernel.org/lkml/20210715201651.212134-1-peterx@redhat.com/


Mention the current missing information in the pagemap and alternatives on
how to retrieve it, in case someone stumbles upon unexpected behaviour.

Link: https://lkml.kernel.org/r/20210923064618.157046-1-tiberiu.georgescu@nutanix.com
Link: https://lkml.kernel.org/r/20210923064618.157046-2-tiberiu.georgescu@nutanix.com
Signed-off-by: Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>
Reviewed-by: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Reviewed-by: Florian Schmidt <florian.schmidt@nutanix.com>
Reviewed-by: Carl Waldspurger <carl.waldspurger@nutanix.com>
Reviewed-by: Jonathan Davies <jonathan.davies@nutanix.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/pagemap.rst |   22 +++++++++++++++++++++
 1 file changed, 22 insertions(+)

--- a/Documentation/admin-guide/mm/pagemap.rst~documentation-update-pagemap-with-shmem-exceptions
+++ a/Documentation/admin-guide/mm/pagemap.rst
@@ -196,6 +196,28 @@ you can go through every map in the proc
 in kpagecount, and tally up the number of pages that are only referenced
 once.
 
+Exceptions for Shared Memory
+============================
+
+Page table entries for shared pages are cleared when the pages are zapped or
+swapped out. This makes swapped out pages indistinguishable from never-allocated
+ones.
+
+In kernel space, the swap location can still be retrieved from the page cache.
+However, values stored only on the normal PTE get lost irretrievably when the
+page is swapped out (i.e. SOFT_DIRTY).
+
+In user space, whether the page is present, swapped or none can be deduced with
+the help of lseek and/or mincore system calls.
+
+lseek() can differentiate between accessed pages (present or swapped out) and
+holes (none/non-allocated) by specifying the SEEK_DATA flag on the file where
+the pages are backed. For anonymous shared pages, the file can be found in
+``/proc/pid/map_files/``.
+
+mincore() can differentiate between pages in memory (present, including swap
+cache) and out of memory (swapped out or none/non-allocated).
+
 Other notes
 ===========
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 079/262] lazy tlb: introduce lazy mm refcount helper functions
  2021-11-05 20:34 incoming Andrew Morton
                   ` (77 preceding siblings ...)
  2021-11-05 20:38 ` [patch 078/262] Documentation: update pagemap with shmem exceptions Andrew Morton
@ 2021-11-05 20:38 ` Andrew Morton
  2021-11-05 20:38 ` [patch 080/262] lazy tlb: allow lazy tlb mm refcounting to be configurable Andrew Morton
                   ` (182 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:38 UTC (permalink / raw)
  To: akpm, anton, benh, linux-mm, luto, mm-commits, npiggin, paulus,
	rdunlap, torvalds

From: Nicholas Piggin <npiggin@gmail.com>
Subject: lazy tlb: introduce lazy mm refcount helper functions

Patch series "shoot lazy tlbs", v4.

On a 16-socket 192-core POWER8 system, a context switching benchmark with
as many software threads as CPUs (so each switch will go in and out of
idle), upstream can achieve a rate of about 1 million context switches per
second.  After this series it goes up to 118 million.


This patch (of 4):

Add explicit _lazy_tlb annotated functions for lazy mm refcounting.  This
makes lazy mm references more obvious, and allows explicit refcounting to
be removed if it is not used.

If a kernel thread's current lazy tlb mm happens to be the one it wants to
use, then kthread_use_mm() cleverly transfers the mm refcount from the
lazy tlb mm reference to the returned reference.  If the lazy tlb mm
reference is no longer identical to a normal reference, this trick does
not work, so that is changed to be explicit about the two references.

[npiggin@gmail.com: fix a refcounting bug in kthread_use_mm]
  Link: https://lkml.kernel.org/r/1623125298.bx63h3mopj.astroid@bobo.none
Link: https://lkml.kernel.org/r/20210605014216.446867-1-npiggin@gmail.com
Link: https://lkml.kernel.org/r/20210605014216.446867-2-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm/mach-rpc/ecard.c            |    2 +-
 arch/powerpc/kernel/smp.c            |    2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c |    4 ++--
 fs/exec.c                            |    4 ++--
 include/linux/sched/mm.h             |   11 +++++++++++
 kernel/cpu.c                         |    2 +-
 kernel/exit.c                        |    2 +-
 kernel/kthread.c                     |   21 +++++++++++++--------
 kernel/sched/core.c                  |   15 ++++++++-------
 9 files changed, 40 insertions(+), 23 deletions(-)

--- a/arch/arm/mach-rpc/ecard.c~lazy-tlb-introduce-lazy-mm-refcount-helper-functions
+++ a/arch/arm/mach-rpc/ecard.c
@@ -253,7 +253,7 @@ static int ecard_init_mm(void)
 	current->mm = mm;
 	current->active_mm = mm;
 	activate_mm(active_mm, mm);
-	mmdrop(active_mm);
+	mmdrop_lazy_tlb(active_mm);
 	ecard_init_pgtables(mm);
 	return 0;
 }
--- a/arch/powerpc/kernel/smp.c~lazy-tlb-introduce-lazy-mm-refcount-helper-functions
+++ a/arch/powerpc/kernel/smp.c
@@ -1582,7 +1582,7 @@ void start_secondary(void *unused)
 	if (IS_ENABLED(CONFIG_PPC32))
 		setup_kup();
 
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	current->active_mm = &init_mm;
 
 	smp_store_cpu_info(cpu);
--- a/arch/powerpc/mm/book3s64/radix_tlb.c~lazy-tlb-introduce-lazy-mm-refcount-helper-functions
+++ a/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -786,10 +786,10 @@ void exit_lazy_flush_tlb(struct mm_struc
 	if (current->active_mm == mm) {
 		WARN_ON_ONCE(current->mm != NULL);
 		/* Is a kernel thread and is using mm as the lazy tlb */
-		mmgrab(&init_mm);
+		mmgrab_lazy_tlb(&init_mm);
 		current->active_mm = &init_mm;
 		switch_mm_irqs_off(mm, &init_mm, current);
-		mmdrop(mm);
+		mmdrop_lazy_tlb(mm);
 	}
 
 	/*
--- a/fs/exec.c~lazy-tlb-introduce-lazy-mm-refcount-helper-functions
+++ a/fs/exec.c
@@ -1028,9 +1028,9 @@ static int exec_mmap(struct mm_struct *m
 		setmax_mm_hiwater_rss(&tsk->signal->maxrss, old_mm);
 		mm_update_next_owner(old_mm);
 		mmput(old_mm);
-		return 0;
+	} else {
+		mmdrop_lazy_tlb(active_mm);
 	}
-	mmdrop(active_mm);
 	return 0;
 }
 
--- a/include/linux/sched/mm.h~lazy-tlb-introduce-lazy-mm-refcount-helper-functions
+++ a/include/linux/sched/mm.h
@@ -49,6 +49,17 @@ static inline void mmdrop(struct mm_stru
 		__mmdrop(mm);
 }
 
+/* Helpers for lazy TLB mm refcounting */
+static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
+{
+	mmgrab(mm);
+}
+
+static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
+{
+	mmdrop(mm);
+}
+
 /**
  * mmget() - Pin the address space associated with a &struct mm_struct.
  * @mm: The address space to pin.
--- a/kernel/cpu.c~lazy-tlb-introduce-lazy-mm-refcount-helper-functions
+++ a/kernel/cpu.c
@@ -613,7 +613,7 @@ static int finish_cpu(unsigned int cpu)
 	 */
 	if (mm != &init_mm)
 		idle->active_mm = &init_mm;
-	mmdrop(mm);
+	mmdrop_lazy_tlb(mm);
 	return 0;
 }
 
--- a/kernel/exit.c~lazy-tlb-introduce-lazy-mm-refcount-helper-functions
+++ a/kernel/exit.c
@@ -475,7 +475,7 @@ static void exit_mm(void)
 		__set_current_state(TASK_RUNNING);
 		mmap_read_lock(mm);
 	}
-	mmgrab(mm);
+	mmgrab_lazy_tlb(mm);
 	BUG_ON(mm != current->active_mm);
 	/* more a memory barrier than a real lock */
 	task_lock(current);
--- a/kernel/kthread.c~lazy-tlb-introduce-lazy-mm-refcount-helper-functions
+++ a/kernel/kthread.c
@@ -1350,14 +1350,19 @@ void kthread_use_mm(struct mm_struct *mm
 	WARN_ON_ONCE(!(tsk->flags & PF_KTHREAD));
 	WARN_ON_ONCE(tsk->mm);
 
+	/*
+	 * It's possible that tsk->active_mm == mm here, but we must
+	 * still mmgrab(mm) and mmdrop_lazy_tlb(active_mm), because lazy
+	 * mm may not have its own refcount (see mmgrab/drop_lazy_tlb()).
+	 */
+	mmgrab(mm);
+
 	task_lock(tsk);
 	/* Hold off tlb flush IPIs while switching mm's */
 	local_irq_disable();
 	active_mm = tsk->active_mm;
-	if (active_mm != mm) {
-		mmgrab(mm);
+	if (active_mm != mm)
 		tsk->active_mm = mm;
-	}
 	tsk->mm = mm;
 	membarrier_update_current_mm(mm);
 	switch_mm_irqs_off(active_mm, mm, tsk);
@@ -1374,12 +1379,9 @@ void kthread_use_mm(struct mm_struct *mm
 	 * memory barrier after storing to tsk->mm, before accessing
 	 * user-space memory. A full memory barrier for membarrier
 	 * {PRIVATE,GLOBAL}_EXPEDITED is implicitly provided by
-	 * mmdrop(), or explicitly with smp_mb().
+	 * mmdrop_lazy_tlb().
 	 */
-	if (active_mm != mm)
-		mmdrop(active_mm);
-	else
-		smp_mb();
+	mmdrop_lazy_tlb(active_mm);
 
 	to_kthread(tsk)->oldfs = force_uaccess_begin();
 }
@@ -1411,10 +1413,13 @@ void kthread_unuse_mm(struct mm_struct *
 	local_irq_disable();
 	tsk->mm = NULL;
 	membarrier_update_current_mm(NULL);
+	mmgrab_lazy_tlb(mm);
 	/* active_mm is still 'mm' */
 	enter_lazy_tlb(mm, tsk);
 	local_irq_enable();
 	task_unlock(tsk);
+
+	mmdrop(mm);
 }
 EXPORT_SYMBOL_GPL(kthread_unuse_mm);
 
--- a/kernel/sched/core.c~lazy-tlb-introduce-lazy-mm-refcount-helper-functions
+++ a/kernel/sched/core.c
@@ -4831,13 +4831,14 @@ static struct rq *finish_task_switch(str
 	 * rq->curr, before returning to userspace, so provide them here:
 	 *
 	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
-	 *   provided by mmdrop(),
+	 *   provided by mmdrop_lazy_tlb(),
 	 * - a sync_core for SYNC_CORE.
 	 */
 	if (mm) {
 		membarrier_mm_sync_core_before_usermode(mm);
-		mmdrop(mm);
+		mmdrop_lazy_tlb(mm);
 	}
+
 	if (unlikely(prev_state == TASK_DEAD)) {
 		if (prev->sched_class->task_dead)
 			prev->sched_class->task_dead(prev);
@@ -4900,9 +4901,9 @@ context_switch(struct rq *rq, struct tas
 
 	/*
 	 * kernel -> kernel   lazy + transfer active
-	 *   user -> kernel   lazy + mmgrab() active
+	 *   user -> kernel   lazy + mmgrab_lazy_tlb() active
 	 *
-	 * kernel ->   user   switch + mmdrop() active
+	 * kernel ->   user   switch + mmdrop_lazy_tlb() active
 	 *   user ->   user   switch
 	 */
 	if (!next->mm) {                                // to kernel
@@ -4910,7 +4911,7 @@ context_switch(struct rq *rq, struct tas
 
 		next->active_mm = prev->active_mm;
 		if (prev->mm)                           // from user
-			mmgrab(prev->active_mm);
+			mmgrab_lazy_tlb(prev->active_mm);
 		else
 			prev->active_mm = NULL;
 	} else {                                        // to user
@@ -4926,7 +4927,7 @@ context_switch(struct rq *rq, struct tas
 		switch_mm_irqs_off(prev->active_mm, next->mm, next);
 
 		if (!prev->mm) {                        // from kernel
-			/* will mmdrop() in finish_task_switch(). */
+			/* will mmdrop_lazy_tlb() in finish_task_switch(). */
 			rq->prev_mm = prev->active_mm;
 			prev->active_mm = NULL;
 		}
@@ -9442,7 +9443,7 @@ void __init sched_init(void)
 	/*
 	 * The boot idle thread does lazy MMU switching as well:
 	 */
-	mmgrab(&init_mm);
+	mmgrab_lazy_tlb(&init_mm);
 	enter_lazy_tlb(&init_mm, current);
 
 	/*
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 080/262] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2021-11-05 20:34 incoming Andrew Morton
                   ` (78 preceding siblings ...)
  2021-11-05 20:38 ` [patch 079/262] lazy tlb: introduce lazy mm refcount helper functions Andrew Morton
@ 2021-11-05 20:38 ` Andrew Morton
  2021-11-06  4:29   ` Andy Lutomirski
  2021-11-05 20:38 ` [patch 081/262] lazy tlb: shoot lazies, a non-refcounting lazy tlb option Andrew Morton
                   ` (181 subsequent siblings)
  261 siblings, 1 reply; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:38 UTC (permalink / raw)
  To: akpm, anton, benh, linux-mm, luto, mm-commits, npiggin, paulus,
	rdunlap, torvalds

From: Nicholas Piggin <npiggin@gmail.com>
Subject: lazy tlb: allow lazy tlb mm refcounting to be configurable

Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
when it is context switched.  This can be disabled by architectures that
don't require this refcounting if they clean up lazy tlb mms when the last
refcount is dropped.  Currently this is always enabled, which is what
existing code does, so the patch is effectively a no-op.

Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.

[akpm@linux-foundation.org: fix comment]
[npiggin@gmail.com: update comments]
  Link: https://lkml.kernel.org/r/1623121605.j47gdpccep.astroid@bobo.none
Link: https://lkml.kernel.org/r/20210605014216.446867-3-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/Kconfig             |   14 ++++++++++++++
 include/linux/sched/mm.h |   14 ++++++++++++--
 kernel/sched/core.c      |   22 ++++++++++++++++++----
 kernel/sched/sched.h     |    4 +++-
 4 files changed, 47 insertions(+), 7 deletions(-)

--- a/arch/Kconfig~lazy-tlb-allow-lazy-tlb-mm-refcounting-to-be-configurable
+++ a/arch/Kconfig
@@ -428,6 +428,20 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	  irqs disabled over activate_mm. Architectures that do IPI based TLB
 	  shootdowns should enable this.
 
+# Use normal mm refcounting for MMU_LAZY_TLB kernel thread references.
+# MMU_LAZY_TLB_REFCOUNT=n can improve the scalability of context switching
+# to/from kernel threads when the same mm is running on a lot of CPUs (a large
+# multi-threaded application), by reducing contention on the mm refcount.
+#
+# This can be disabled if the architecture ensures no CPUs are using an mm as a
+# "lazy tlb" beyond its final refcount (i.e., by the time __mmdrop frees the mm
+# or its kernel page tables). This could be arranged by arch_exit_mmap(), or
+# final exit(2) TLB flush, for example. arch code must also ensure the
+# _lazy_tlb variants of mmgrab/mmdrop are used when dropping the lazy reference
+# to a kthread ->active_mm (non-arch code has been converted already).
+config MMU_LAZY_TLB_REFCOUNT
+	def_bool y
+
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
 
--- a/include/linux/sched/mm.h~lazy-tlb-allow-lazy-tlb-mm-refcounting-to-be-configurable
+++ a/include/linux/sched/mm.h
@@ -52,12 +52,22 @@ static inline void mmdrop(struct mm_stru
 /* Helpers for lazy TLB mm refcounting */
 static inline void mmgrab_lazy_tlb(struct mm_struct *mm)
 {
-	mmgrab(mm);
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT))
+		mmgrab(mm);
 }
 
 static inline void mmdrop_lazy_tlb(struct mm_struct *mm)
 {
-	mmdrop(mm);
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_REFCOUNT)) {
+		mmdrop(mm);
+	} else {
+		/*
+		 * mmdrop_lazy_tlb must provide a full memory barrier, see the
+		 * membarrier comment in finish_task_switch which relies on
+		 * this.
+		 */
+		smp_mb();
+	}
 }
 
 /**
--- a/kernel/sched/core.c~lazy-tlb-allow-lazy-tlb-mm-refcounting-to-be-configurable
+++ a/kernel/sched/core.c
@@ -4772,7 +4772,7 @@ static struct rq *finish_task_switch(str
 	__releases(rq->lock)
 {
 	struct rq *rq = this_rq();
-	struct mm_struct *mm = rq->prev_mm;
+	struct mm_struct *mm = NULL;
 	long prev_state;
 
 	/*
@@ -4791,7 +4791,10 @@ static struct rq *finish_task_switch(str
 		      current->comm, current->pid, preempt_count()))
 		preempt_count_set(FORK_PREEMPT_COUNT);
 
-	rq->prev_mm = NULL;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+	mm = rq->prev_lazy_mm;
+	rq->prev_lazy_mm = NULL;
+#endif
 
 	/*
 	 * A task struct has one reference for the use as "current".
@@ -4927,9 +4930,20 @@ context_switch(struct rq *rq, struct tas
 		switch_mm_irqs_off(prev->active_mm, next->mm, next);
 
 		if (!prev->mm) {                        // from kernel
-			/* will mmdrop_lazy_tlb() in finish_task_switch(). */
-			rq->prev_mm = prev->active_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+			/* Will mmdrop_lazy_tlb() in finish_task_switch(). */
+			rq->prev_lazy_mm = prev->active_mm;
 			prev->active_mm = NULL;
+#else
+			/*
+			 * Without MMU_LAZY_TLB_REFCOUNT there is no lazy
+			 * tracking (because no rq->prev_lazy_mm) in
+			 * finish_task_switch, so no mmdrop_lazy_tlb(), so no
+			 * memory barrier for membarrier (see the membarrier
+			 * comment in finish_task_switch()).  Do it here.
+			 */
+			smp_mb();
+#endif
 		}
 	}
 
--- a/kernel/sched/sched.h~lazy-tlb-allow-lazy-tlb-mm-refcounting-to-be-configurable
+++ a/kernel/sched/sched.h
@@ -977,7 +977,9 @@ struct rq {
 	struct task_struct	*idle;
 	struct task_struct	*stop;
 	unsigned long		next_balance;
-	struct mm_struct	*prev_mm;
+#ifdef CONFIG_MMU_LAZY_TLB_REFCOUNT
+	struct mm_struct	*prev_lazy_mm;
+#endif
 
 	unsigned int		clock_update_flags;
 	u64			clock;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 081/262] lazy tlb: shoot lazies, a non-refcounting lazy tlb option
  2021-11-05 20:34 incoming Andrew Morton
                   ` (79 preceding siblings ...)
  2021-11-05 20:38 ` [patch 080/262] lazy tlb: allow lazy tlb mm refcounting to be configurable Andrew Morton
@ 2021-11-05 20:38 ` Andrew Morton
  2021-11-05 20:38 ` [patch 082/262] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN Andrew Morton
                   ` (180 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:38 UTC (permalink / raw)
  To: akpm, anton, benh, linux-mm, luto, mm-commits, npiggin, paulus,
	rdunlap, torvalds

From: Nicholas Piggin <npiggin@gmail.com>
Subject: lazy tlb: shoot lazies, a non-refcounting lazy tlb option

On big systems, the mm refcount can become highly contented when doing a
lot of context switching with threaded applications (particularly
switching between the idle thread and an application thread).

Abandoning lazy tlb slows switching down quite a bit in the important
user->idle->user cases, so instead implement a non-refcounted scheme that
causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down any
remaining lazy ones.

Shootdown IPIs are some concern, but they have not been observed to be a
big problem with this scheme (the powerpc implementation generated 314
additional interrupts on a 144 CPU system during a kernel compile).  There
are a number of strategies that could be employed to reduce IPIs if they
turn out to be a problem for some workload.

[npiggin@gmail.com: update comments]
  Link: https://lkml.kernel.org/r/1623121901.mszkmmum0n.astroid@bobo.none
Link: https://lkml.kernel.org/r/20210605014216.446867-4-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/Kconfig  |   14 +++++++++++++
 kernel/fork.c |   51 ++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 65 insertions(+)

--- a/arch/Kconfig~lazy-tlb-shoot-lazies-a-non-refcounting-lazy-tlb-option
+++ a/arch/Kconfig
@@ -441,6 +441,20 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 # to a kthread ->active_mm (non-arch code has been converted already).
 config MMU_LAZY_TLB_REFCOUNT
 	def_bool y
+	depends on !MMU_LAZY_TLB_SHOOTDOWN
+
+# This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
+# mm as a lazy tlb beyond its last reference count, by shooting down these
+# users before the mm is deallocated. __mmdrop() first IPIs all CPUs that may
+# be using the mm as a lazy tlb, so that they may switch themselves to using
+# init_mm for their active mm. mm_cpumask(mm) is used to determine which CPUs
+# may be using mm as a lazy tlb mm.
+#
+# To implement this, an arch must ensure mm_cpumask(mm) contains at least all
+# possible CPUs in which the mm is lazy, and it must meet the requirements for
+# MMU_LAZY_TLB_REFCOUNT=n (see above).
+config MMU_LAZY_TLB_SHOOTDOWN
+	bool
 
 config ARCH_HAVE_NMI_SAFE_CMPXCHG
 	bool
--- a/kernel/fork.c~lazy-tlb-shoot-lazies-a-non-refcounting-lazy-tlb-option
+++ a/kernel/fork.c
@@ -686,6 +686,53 @@ static void check_mm(struct mm_struct *m
 #define allocate_mm()	(kmem_cache_alloc(mm_cachep, GFP_KERNEL))
 #define free_mm(mm)	(kmem_cache_free(mm_cachep, (mm)))
 
+static void do_shoot_lazy_tlb(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	if (current->active_mm == mm) {
+		WARN_ON_ONCE(current->mm);
+		current->active_mm = &init_mm;
+		switch_mm(mm, &init_mm, current);
+	}
+}
+
+static void do_check_lazy_tlb(void *arg)
+{
+	struct mm_struct *mm = arg;
+
+	WARN_ON_ONCE(current->active_mm == mm);
+}
+
+static void shoot_lazy_tlbs(struct mm_struct *mm)
+{
+	if (IS_ENABLED(CONFIG_MMU_LAZY_TLB_SHOOTDOWN)) {
+		/*
+		 * IPI overheads have not found to be expensive, but they could
+		 * be reduced in a number of possible ways, for example (in
+		 * roughly increasing order of complexity):
+		 * - A batch of mms requiring IPIs could be gathered and freed
+		 *   at once.
+		 * - CPUs could store their active mm somewhere that can be
+		 *   remotely checked without a lock, to filter out
+		 *   false-positives in the cpumask.
+		 * - After mm_users or mm_count reaches zero, switching away
+		 *   from the mm could clear mm_cpumask to reduce some IPIs
+		 *   (some batching or delaying would help).
+		 * - A delayed freeing and RCU-like quiescing sequence based on
+		 *   mm switching to avoid IPIs completely.
+		 */
+		on_each_cpu_mask(mm_cpumask(mm), do_shoot_lazy_tlb, (void *)mm, 1);
+		if (IS_ENABLED(CONFIG_DEBUG_VM))
+			on_each_cpu(do_check_lazy_tlb, (void *)mm, 1);
+	} else {
+		/*
+		 * In this case, lazy tlb mms are refounted and would not reach
+		 * __mmdrop until all CPUs have switched away and mmdrop()ed.
+		 */
+	}
+}
+
 /*
  * Called when the last reference to the mm
  * is dropped: either by a lazy thread or by
@@ -695,6 +742,10 @@ void __mmdrop(struct mm_struct *mm)
 {
 	BUG_ON(mm == &init_mm);
 	WARN_ON_ONCE(mm == current->mm);
+
+	/* Ensure no CPUs are using this as their lazy tlb mm */
+	shoot_lazy_tlbs(mm);
+
 	WARN_ON_ONCE(mm == current->active_mm);
 	mm_free_pgd(mm);
 	destroy_context(mm);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 082/262] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
  2021-11-05 20:34 incoming Andrew Morton
                   ` (80 preceding siblings ...)
  2021-11-05 20:38 ` [patch 081/262] lazy tlb: shoot lazies, a non-refcounting lazy tlb option Andrew Morton
@ 2021-11-05 20:38 ` Andrew Morton
  2021-11-05 20:39 ` [patch 083/262] memory: remove unused CONFIG_MEM_BLOCK_SIZE Andrew Morton
                   ` (179 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:38 UTC (permalink / raw)
  To: akpm, anton, benh, linux-mm, luto, mm-commits, npiggin, paulus,
	rdunlap, torvalds

From: Nicholas Piggin <npiggin@gmail.com>
Subject: powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN

On a 16-socket 192-core POWER8 system, a context switching benchmark
with as many software threads as CPUs (so each switch will go in and
out of idle), upstream can achieve a rate of about 1 million context
switches per second. After this patch it goes up to 118 million.

No real datya for real world workloads unfortunately.  I think it's
always been a "known" cacheline, it just showed up badly on
will-it-scale tests recently when Anton was doing a sweep of low
hanging scalability issues on big systems.

We have some very big systems running certain in-memory databases that
get into very high contention conditions on mutexes that push context
switch rates right up and with idle times pretty high, which would get
a lot of parallel context switching between user and idle thread, we
might be getting a bit of this contention there.

It's not something at the top of profiles though.  And on
multi-threaded workloads like this, the normal refcounting of the user
mm still has fundmaental contention.  It's tricky to get the change
tested on these workloads (machine time is very limited and I can't
drive the software).

I suspect it could also show in things that do high net or disk IO
rates (enough to need a lot of cores), and do some user processing
steps along the way.  You'd potentially get a lot of idle switching.


This infrastructure could be beneficial to other architectures.  The
cacheline is going to bounce in the same situations on other archs, so
I would say yes.  Rik at one stage had some patches to try avoid it for
x86 some years ago, I don't know what happened to those.

The way powerpc has to maintain mm_cpumask for its TLB flushing makes
it relatively easy to do this shootdown, and we decided the additional
IPIs were less of a concern than the bouncing.  Others have different
concerns, but I tried to make it generic and add comments explaining
what other archs can do, or possibly different ways it might be
achieved.

Link: https://lkml.kernel.org/r/20210605014216.446867-5-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/powerpc/Kconfig |    1 +
 1 file changed, 1 insertion(+)

--- a/arch/powerpc/Kconfig~powerpc-64s-enable-mmu_lazy_tlb_shootdown
+++ a/arch/powerpc/Kconfig
@@ -249,6 +249,7 @@ config PPC
 	select IRQ_FORCED_THREADING
 	select MMU_GATHER_PAGE_SIZE
 	select MMU_GATHER_RCU_TABLE_FREE
+	select MMU_LAZY_TLB_SHOOTDOWN		if PPC_BOOK3S_64
 	select MODULES_USE_ELF_RELA
 	select NEED_DMA_MAP_STATE		if PPC64 || NOT_COHERENT_CACHE
 	select NEED_SG_DMA_LENGTH
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 083/262] memory: remove unused CONFIG_MEM_BLOCK_SIZE
  2021-11-05 20:34 incoming Andrew Morton
                   ` (81 preceding siblings ...)
  2021-11-05 20:38 ` [patch 082/262] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:39 ` [patch 084/262] mm/mprotect.c: avoid repeated assignment in do_mprotect_pkey() Andrew Morton
                   ` (178 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, dave.hansen, david, linux-mm, lukas.bulwahn, mhocko,
	mm-commits, torvalds

From: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Subject: memory: remove unused CONFIG_MEM_BLOCK_SIZE

Commit 3947be1969a9 ("[PATCH] memory hotplug: sysfs and add/remove
functions") defines CONFIG_MEM_BLOCK_SIZE, but this has never been
utilized anywhere.

It is a good practice to keep the CONFIG_* defines exclusively for the
Kbuild system.  So, drop this unused definition.

This issue was noticed due to running ./scripts/checkkconfigsymbols.py.

Link: https://lkml.kernel.org/r/20211006120354.7468-1-lukas.bulwahn@gmail.com
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memory.h |    1 -
 1 file changed, 1 deletion(-)

--- a/include/linux/memory.h~memory-remove-unused-config_mem_block_size
+++ a/include/linux/memory.h
@@ -140,7 +140,6 @@ typedef int (*walk_memory_blocks_func_t)
 extern int walk_memory_blocks(unsigned long start, unsigned long size,
 			      void *arg, walk_memory_blocks_func_t func);
 extern int for_each_memory_block(void *arg, walk_memory_blocks_func_t func);
-#define CONFIG_MEM_BLOCK_SIZE	(PAGES_PER_SECTION<<PAGE_SHIFT)
 
 extern int memory_group_register_static(int nid, unsigned long max_pages);
 extern int memory_group_register_dynamic(int nid, unsigned long unit_pages);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 084/262] mm/mprotect.c: avoid repeated assignment in do_mprotect_pkey()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (82 preceding siblings ...)
  2021-11-05 20:39 ` [patch 083/262] memory: remove unused CONFIG_MEM_BLOCK_SIZE Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:39 ` [patch 085/262] mm/mremap: don't account pages in vma_to_resize() Andrew Morton
                   ` (177 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, linux-mm, liu.song11, mm-commits, torvalds

From: Liu Song <liu.song11@zte.com.cn>
Subject: mm/mprotect.c: avoid repeated assignment in do_mprotect_pkey()

After adjustment, the repeated assignment of "prev" is avoided, and the
readability of the code is improved.

Link: https://lkml.kernel.org/r/20211012152444.4127-1-fishland@aliyun.com
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mprotect.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/mm/mprotect.c~mm-mprotectc-avoid-repeated-assignment-in-do_mprotect_pkey
+++ a/mm/mprotect.c
@@ -563,7 +563,7 @@ static int do_mprotect_pkey(unsigned lon
 	error = -ENOMEM;
 	if (!vma)
 		goto out;
-	prev = vma->vm_prev;
+
 	if (unlikely(grows & PROT_GROWSDOWN)) {
 		if (vma->vm_start >= end)
 			goto out;
@@ -581,8 +581,11 @@ static int do_mprotect_pkey(unsigned lon
 				goto out;
 		}
 	}
+
 	if (start > vma->vm_start)
 		prev = vma;
+	else
+		prev = vma->vm_prev;
 
 	for (nstart = start ; ; ) {
 		unsigned long mask_off_old_flags;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 085/262] mm/mremap: don't account pages in vma_to_resize()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (83 preceding siblings ...)
  2021-11-05 20:39 ` [patch 084/262] mm/mprotect.c: avoid repeated assignment in do_mprotect_pkey() Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:39 ` [patch 086/262] include/linux/io-mapping.h: remove fallback for writecombine Andrew Morton
                   ` (176 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, bgeffon, catalin.marinas, chenwandun, dan.carpenter,
	dan.j.williams, dave.jiang, dima, hughd, jgg, jhubbard,
	kirill.shutemov, linux-mm, linux, luto, mike.kravetz, minchan,
	mingo, mm-commits, rcampbell, tglx, torvalds, tsbogend, vbabka,
	viro, vishal.l.verma, wangkefeng.wang, weiyongjun1, will

From: Dmitry Safonov <dima@arista.com>
Subject: mm/mremap: don't account pages in vma_to_resize()

All this vm_unacct_memory(charged) dance seems to complicate the life
without a good reason.  Furthermore, it seems not always done right on
error-pathes in mremap_to().  And worse than that: this `charged'
difference is sometimes double-accounted for growing MREMAP_DONTUNMAP
mremap()s in move_vma():

	if (security_vm_enough_memory_mm(mm, new_len >> PAGE_SHIFT))

Let's not do this.  Account memory in mremap() fast-path for growing VMAs
or in move_vma() for actually moving things.  The same simpler way as it's
done by vm_stat_account(), but with a difference to call
security_vm_enough_memory_mm() before copying/adjusting VMA.

Originally noticed by Chen Wandun:
https://lkml.kernel.org/r/20210717101942.120607-1-chenwandun@huawei.com

Link: https://lkml.kernel.org/r/20210721131320.522061-1-dima@arista.com
Fixes: e346b3813067 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()")
Signed-off-by: Dmitry Safonov <dima@arista.com>
Acked-by: Brian Geffon <bgeffon@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chen Wandun <chenwandun@huawei.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mremap.c |   50 ++++++++++++++++++++++----------------------------
 1 file changed, 22 insertions(+), 28 deletions(-)

--- a/mm/mremap.c~mm-mremap-dont-account-pages-in-vma_to_resize
+++ a/mm/mremap.c
@@ -565,6 +565,7 @@ static unsigned long move_vma(struct vm_
 		bool *locked, unsigned long flags,
 		struct vm_userfaultfd_ctx *uf, struct list_head *uf_unmap)
 {
+	long to_account = new_len - old_len;
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *new_vma;
 	unsigned long vm_flags = vma->vm_flags;
@@ -583,6 +584,9 @@ static unsigned long move_vma(struct vm_
 	if (mm->map_count >= sysctl_max_map_count - 3)
 		return -ENOMEM;
 
+	if (unlikely(flags & MREMAP_DONTUNMAP))
+		to_account = new_len;
+
 	if (vma->vm_ops && vma->vm_ops->may_split) {
 		if (vma->vm_start != old_addr)
 			err = vma->vm_ops->may_split(vma, old_addr);
@@ -604,8 +608,8 @@ static unsigned long move_vma(struct vm_
 	if (err)
 		return err;
 
-	if (unlikely(flags & MREMAP_DONTUNMAP && vm_flags & VM_ACCOUNT)) {
-		if (security_vm_enough_memory_mm(mm, new_len >> PAGE_SHIFT))
+	if (vm_flags & VM_ACCOUNT) {
+		if (security_vm_enough_memory_mm(mm, to_account >> PAGE_SHIFT))
 			return -ENOMEM;
 	}
 
@@ -613,8 +617,8 @@ static unsigned long move_vma(struct vm_
 	new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff,
 			   &need_rmap_locks);
 	if (!new_vma) {
-		if (unlikely(flags & MREMAP_DONTUNMAP && vm_flags & VM_ACCOUNT))
-			vm_unacct_memory(new_len >> PAGE_SHIFT);
+		if (vm_flags & VM_ACCOUNT)
+			vm_unacct_memory(to_account >> PAGE_SHIFT);
 		return -ENOMEM;
 	}
 
@@ -708,8 +712,7 @@ static unsigned long move_vma(struct vm_
 }
 
 static struct vm_area_struct *vma_to_resize(unsigned long addr,
-	unsigned long old_len, unsigned long new_len, unsigned long flags,
-	unsigned long *p)
+	unsigned long old_len, unsigned long new_len, unsigned long flags)
 {
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
@@ -768,13 +771,6 @@ static struct vm_area_struct *vma_to_res
 				(new_len - old_len) >> PAGE_SHIFT))
 		return ERR_PTR(-ENOMEM);
 
-	if (vma->vm_flags & VM_ACCOUNT) {
-		unsigned long charged = (new_len - old_len) >> PAGE_SHIFT;
-		if (security_vm_enough_memory_mm(mm, charged))
-			return ERR_PTR(-ENOMEM);
-		*p = charged;
-	}
-
 	return vma;
 }
 
@@ -787,7 +783,6 @@ static unsigned long mremap_to(unsigned
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long ret = -EINVAL;
-	unsigned long charged = 0;
 	unsigned long map_flags = 0;
 
 	if (offset_in_page(new_addr))
@@ -830,7 +825,7 @@ static unsigned long mremap_to(unsigned
 		old_len = new_len;
 	}
 
-	vma = vma_to_resize(addr, old_len, new_len, flags, &charged);
+	vma = vma_to_resize(addr, old_len, new_len, flags);
 	if (IS_ERR(vma)) {
 		ret = PTR_ERR(vma);
 		goto out;
@@ -853,7 +848,7 @@ static unsigned long mremap_to(unsigned
 				((addr - vma->vm_start) >> PAGE_SHIFT),
 				map_flags);
 	if (IS_ERR_VALUE(ret))
-		goto out1;
+		goto out;
 
 	/* We got a new mapping */
 	if (!(flags & MREMAP_FIXED))
@@ -862,12 +857,6 @@ static unsigned long mremap_to(unsigned
 	ret = move_vma(vma, addr, old_len, new_len, new_addr, locked, flags, uf,
 		       uf_unmap);
 
-	if (!(offset_in_page(ret)))
-		goto out;
-
-out1:
-	vm_unacct_memory(charged);
-
 out:
 	return ret;
 }
@@ -899,7 +888,6 @@ SYSCALL_DEFINE5(mremap, unsigned long, a
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long ret = -EINVAL;
-	unsigned long charged = 0;
 	bool locked = false;
 	bool downgraded = false;
 	struct vm_userfaultfd_ctx uf = NULL_VM_UFFD_CTX;
@@ -981,7 +969,7 @@ SYSCALL_DEFINE5(mremap, unsigned long, a
 	/*
 	 * Ok, we need to grow..
 	 */
-	vma = vma_to_resize(addr, old_len, new_len, flags, &charged);
+	vma = vma_to_resize(addr, old_len, new_len, flags);
 	if (IS_ERR(vma)) {
 		ret = PTR_ERR(vma);
 		goto out;
@@ -992,10 +980,18 @@ SYSCALL_DEFINE5(mremap, unsigned long, a
 	if (old_len == vma->vm_end - addr) {
 		/* can we just expand the current mapping? */
 		if (vma_expandable(vma, new_len - old_len)) {
-			int pages = (new_len - old_len) >> PAGE_SHIFT;
+			long pages = (new_len - old_len) >> PAGE_SHIFT;
+
+			if (vma->vm_flags & VM_ACCOUNT) {
+				if (security_vm_enough_memory_mm(mm, pages)) {
+					ret = -ENOMEM;
+					goto out;
+				}
+			}
 
 			if (vma_adjust(vma, vma->vm_start, addr + new_len,
 				       vma->vm_pgoff, NULL)) {
+				vm_unacct_memory(pages);
 				ret = -ENOMEM;
 				goto out;
 			}
@@ -1034,10 +1030,8 @@ SYSCALL_DEFINE5(mremap, unsigned long, a
 			       &locked, flags, &uf, &uf_unmap);
 	}
 out:
-	if (offset_in_page(ret)) {
-		vm_unacct_memory(charged);
+	if (offset_in_page(ret))
 		locked = false;
-	}
 	if (downgraded)
 		mmap_read_unlock(current->mm);
 	else
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 086/262] include/linux/io-mapping.h: remove fallback for writecombine
  2021-11-05 20:34 incoming Andrew Morton
                   ` (84 preceding siblings ...)
  2021-11-05 20:39 ` [patch 085/262] mm/mremap: don't account pages in vma_to_resize() Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:39 ` [patch 087/262] mm: mmap_lock: remove redundant newline in TP_printk Andrew Morton
                   ` (175 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, chris, daniel.vetter, joonas.lahtinen, linux-mm,
	lucas.demarchi, mm-commits, peterz, torvalds

From: Lucas De Marchi <lucas.demarchi@intel.com>
Subject: include/linux/io-mapping.h: remove fallback for writecombine

The fallback was introduced in commit 80c33624e472 ("io-mapping: Fixup for
different names of writecombine") to fix the build on microblaze.

5 years later, it seems all archs now provide a pgprot_writecombine(), so
just remove the other possible fallbacks.  For microblaze,
pgprot_writecombine() is available since commit 97ccedd793ac ("microblaze:
Provide pgprot_device/writecombine macros for nommu").

This is build-tested on microblaze with a hack to always build
mm/io-mapping.o and without DIYing on an x86-only macro (_PAGE_CACHE_MASK)

Link: https://lkml.kernel.org/r/20211020204838.1142908-1-lucas.demarchi@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/io-mapping.h |    6 ------
 1 file changed, 6 deletions(-)

--- a/include/linux/io-mapping.h~io-mapping-remove-fallback-for-writecombine
+++ a/include/linux/io-mapping.h
@@ -132,13 +132,7 @@ io_mapping_init_wc(struct io_mapping *io
 
 	iomap->base = base;
 	iomap->size = size;
-#if defined(pgprot_noncached_wc) /* archs can't agree on a name ... */
-	iomap->prot = pgprot_noncached_wc(PAGE_KERNEL);
-#elif defined(pgprot_writecombine)
 	iomap->prot = pgprot_writecombine(PAGE_KERNEL);
-#else
-	iomap->prot = pgprot_noncached(PAGE_KERNEL);
-#endif
 
 	return iomap;
 }
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 087/262] mm: mmap_lock: remove redundant newline  in TP_printk
  2021-11-05 20:34 incoming Andrew Morton
                   ` (85 preceding siblings ...)
  2021-11-05 20:39 ` [patch 086/262] include/linux/io-mapping.h: remove fallback for writecombine Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:39 ` [patch 088/262] mm: mmap_lock: use DECLARE_EVENT_CLASS and DEFINE_EVENT_FN Andrew Morton
                   ` (174 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, axelrasmussen, ligang.bdlg, linux-mm, mingo, mm-commits,
	rostedt, torvalds, vbabka

From: Gang Li <ligang.bdlg@bytedance.com>
Subject: mm: mmap_lock: remove redundant newline  in TP_printk

Ftrace core will add newline automatically on printing, so using it in
TP_printkcreates a blank line.

Link: https://lkml.kernel.org/r/20211009071105.69544-1-ligang.bdlg@bytedance.com
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/trace/events/mmap_lock.h |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/include/trace/events/mmap_lock.h~mm-mmap_lock-remove-redundant-newline-in-tp_printk
+++ a/include/trace/events/mmap_lock.h
@@ -32,7 +32,7 @@ TRACE_EVENT_FN(mmap_lock_start_locking,
 	),
 
 	TP_printk(
-		"mm=%p memcg_path=%s write=%s\n",
+		"mm=%p memcg_path=%s write=%s",
 		__entry->mm,
 		__get_str(memcg_path),
 		__entry->write ? "true" : "false"
@@ -63,7 +63,7 @@ TRACE_EVENT_FN(mmap_lock_acquire_returne
 	),
 
 	TP_printk(
-		"mm=%p memcg_path=%s write=%s success=%s\n",
+		"mm=%p memcg_path=%s write=%s success=%s",
 		__entry->mm,
 		__get_str(memcg_path),
 		__entry->write ? "true" : "false",
@@ -92,7 +92,7 @@ TRACE_EVENT_FN(mmap_lock_released,
 	),
 
 	TP_printk(
-		"mm=%p memcg_path=%s write=%s\n",
+		"mm=%p memcg_path=%s write=%s",
 		__entry->mm,
 		__get_str(memcg_path),
 		__entry->write ? "true" : "false"
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 088/262] mm: mmap_lock: use DECLARE_EVENT_CLASS and DEFINE_EVENT_FN
  2021-11-05 20:34 incoming Andrew Morton
                   ` (86 preceding siblings ...)
  2021-11-05 20:39 ` [patch 087/262] mm: mmap_lock: remove redundant newline in TP_printk Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:39 ` [patch 089/262] mm/vmalloc: repair warn_alloc()s in __vmalloc_area_node() Andrew Morton
                   ` (173 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, axelrasmussen, ligang.bdlg, linux-mm, mingo, mm-commits,
	rostedt, torvalds, vbabka

From: Gang Li <ligang.bdlg@bytedance.com>
Subject: mm: mmap_lock: use DECLARE_EVENT_CLASS and DEFINE_EVENT_FN

By using DECLARE_EVENT_CLASS and TRACE_EVENT_FN, we can save a lot of
space from duplicate code.

Link: https://lkml.kernel.org/r/20211009071243.70286-1-ligang.bdlg@bytedance.com
Signed-off-by: Gang Li <ligang.bdlg@bytedance.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/trace/events/mmap_lock.h |   44 +++++++----------------------
 1 file changed, 12 insertions(+), 32 deletions(-)

--- a/include/trace/events/mmap_lock.h~mm-mmap_lock-use-declare_event_class-and-define_event_fn
+++ a/include/trace/events/mmap_lock.h
@@ -13,7 +13,7 @@ struct mm_struct;
 extern int trace_mmap_lock_reg(void);
 extern void trace_mmap_lock_unreg(void);
 
-TRACE_EVENT_FN(mmap_lock_start_locking,
+DECLARE_EVENT_CLASS(mmap_lock,
 
 	TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write),
 
@@ -36,11 +36,19 @@ TRACE_EVENT_FN(mmap_lock_start_locking,
 		__entry->mm,
 		__get_str(memcg_path),
 		__entry->write ? "true" : "false"
-	),
-
-	trace_mmap_lock_reg, trace_mmap_lock_unreg
+	)
 );
 
+#define DEFINE_MMAP_LOCK_EVENT(name)                                    \
+	DEFINE_EVENT_FN(mmap_lock, name,                                \
+		TP_PROTO(struct mm_struct *mm, const char *memcg_path,  \
+			bool write),                                    \
+		TP_ARGS(mm, memcg_path, write),                         \
+		trace_mmap_lock_reg, trace_mmap_lock_unreg)
+
+DEFINE_MMAP_LOCK_EVENT(mmap_lock_start_locking);
+DEFINE_MMAP_LOCK_EVENT(mmap_lock_released);
+
 TRACE_EVENT_FN(mmap_lock_acquire_returned,
 
 	TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write,
@@ -71,34 +79,6 @@ TRACE_EVENT_FN(mmap_lock_acquire_returne
 	),
 
 	trace_mmap_lock_reg, trace_mmap_lock_unreg
-);
-
-TRACE_EVENT_FN(mmap_lock_released,
-
-	TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write),
-
-	TP_ARGS(mm, memcg_path, write),
-
-	TP_STRUCT__entry(
-		__field(struct mm_struct *, mm)
-		__string(memcg_path, memcg_path)
-		__field(bool, write)
-	),
-
-	TP_fast_assign(
-		__entry->mm = mm;
-		__assign_str(memcg_path, memcg_path);
-		__entry->write = write;
-	),
-
-	TP_printk(
-		"mm=%p memcg_path=%s write=%s",
-		__entry->mm,
-		__get_str(memcg_path),
-		__entry->write ? "true" : "false"
-	),
-
-	trace_mmap_lock_reg, trace_mmap_lock_unreg
 );
 
 #endif /* _TRACE_MMAP_LOCK_H */
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 089/262] mm/vmalloc: repair warn_alloc()s in __vmalloc_area_node()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (87 preceding siblings ...)
  2021-11-05 20:39 ` [patch 088/262] mm: mmap_lock: use DECLARE_EVENT_CLASS and DEFINE_EVENT_FN Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:39 ` [patch 090/262] mm/vmalloc: don't allow VM_NO_GUARD on vmap() Andrew Morton
                   ` (172 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, hch, linux-mm, mm-commits, songmuchun, torvalds, urezki, vvs

From: Vasily Averin <vvs@virtuozzo.com>
Subject: mm/vmalloc: repair warn_alloc()s in __vmalloc_area_node()

Commit f255935b9767 ("mm: cleanup the gfp_mask handling in
__vmalloc_area_node") added __GFP_NOWARN to gfp_mask unconditionally
however it disabled all output inside warn_alloc() call.  This patch saves
original gfp_mask and provides it to all warn_alloc() calls.

Link: https://lkml.kernel.org/r/f4f3187b-9684-e426-565d-827c2a9bbb0e@virtuozzo.com
Fixes: f255935b9767 ("mm: cleanup the gfp_mask handling in __vmalloc_area_node")
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmalloc.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

--- a/mm/vmalloc.c~mm-vmalloc-repair-warn_allocs-in-__vmalloc_area_node
+++ a/mm/vmalloc.c
@@ -2887,6 +2887,7 @@ static void *__vmalloc_area_node(struct
 				 int node)
 {
 	const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
+	const gfp_t orig_gfp_mask = gfp_mask;
 	unsigned long addr = (unsigned long)area->addr;
 	unsigned long size = get_vm_area_size(area);
 	unsigned long array_size;
@@ -2907,7 +2908,7 @@ static void *__vmalloc_area_node(struct
 	}
 
 	if (!area->pages) {
-		warn_alloc(gfp_mask, NULL,
+		warn_alloc(orig_gfp_mask, NULL,
 			"vmalloc error: size %lu, failed to allocated page array size %lu",
 			nr_small_pages * PAGE_SIZE, array_size);
 		free_vm_area(area);
@@ -2927,7 +2928,7 @@ static void *__vmalloc_area_node(struct
 	 * allocation request, free them via __vfree() if any.
 	 */
 	if (area->nr_pages != nr_small_pages) {
-		warn_alloc(gfp_mask, NULL,
+		warn_alloc(orig_gfp_mask, NULL,
 			"vmalloc error: size %lu, page order %u, failed to allocate pages",
 			area->nr_pages * PAGE_SIZE, page_order);
 		goto fail;
@@ -2935,7 +2936,7 @@ static void *__vmalloc_area_node(struct
 
 	if (vmap_pages_range(addr, addr + size, prot, area->pages,
 			page_shift) < 0) {
-		warn_alloc(gfp_mask, NULL,
+		warn_alloc(orig_gfp_mask, NULL,
 			"vmalloc error: size %lu, failed to map pages",
 			area->nr_pages * PAGE_SIZE);
 		goto fail;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 090/262] mm/vmalloc: don't allow VM_NO_GUARD on vmap()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (88 preceding siblings ...)
  2021-11-05 20:39 ` [patch 089/262] mm/vmalloc: repair warn_alloc()s in __vmalloc_area_node() Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:39 ` [patch 091/262] mm/vmalloc: make show_numa_info() aware of hugepage mappings Andrew Morton
                   ` (171 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, andreyknvl, david, hch, keescook, linux-mm, mgorman,
	mm-commits, peterz, torvalds, urezki, will

From: Peter Zijlstra <peterz@infradead.org>
Subject: mm/vmalloc: don't allow VM_NO_GUARD on vmap()

The vmalloc guard pages are added on top of each allocation, thereby
isolating any two allocations from one another.  The top guard of the
lower allocation is the bottom guard guard of the higher allocation etc.

Therefore VM_NO_GUARD is dangerous; it breaks the basic premise of
isolating separate allocations.

There are only two in-tree users of this flag, neither of which use it
through the exported interface.  Ensure it stays this way.

Link: https://lkml.kernel.org/r/YUMfdA36fuyZ+/xt@hirez.programming.kicks-ass.net
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Uladzislau Rezki <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/vmalloc.h |    2 +-
 mm/vmalloc.c            |    7 +++++++
 2 files changed, 8 insertions(+), 1 deletion(-)

--- a/include/linux/vmalloc.h~mm-vmalloc-dont-allow-vm_no_guard-on-vmap
+++ a/include/linux/vmalloc.h
@@ -22,7 +22,7 @@ struct notifier_block;		/* in notifier.h
 #define VM_USERMAP		0x00000008	/* suitable for remap_vmalloc_range */
 #define VM_DMA_COHERENT		0x00000010	/* dma_alloc_coherent */
 #define VM_UNINITIALIZED	0x00000020	/* vm_struct is not fully initialized */
-#define VM_NO_GUARD		0x00000040      /* don't add guard page */
+#define VM_NO_GUARD		0x00000040      /* ***DANGEROUS*** don't add guard page */
 #define VM_KASAN		0x00000080      /* has allocated kasan shadow memory */
 #define VM_FLUSH_RESET_PERMS	0x00000100	/* reset direct map and flush TLB on unmap, can't be freed in atomic context */
 #define VM_MAP_PUT_PAGES	0x00000200	/* put pages and free array in vfree */
--- a/mm/vmalloc.c~mm-vmalloc-dont-allow-vm_no_guard-on-vmap
+++ a/mm/vmalloc.c
@@ -2743,6 +2743,13 @@ void *vmap(struct page **pages, unsigned
 
 	might_sleep();
 
+	/*
+	 * Your top guard is someone else's bottom guard. Not having a top
+	 * guard compromises someone else's mappings too.
+	 */
+	if (WARN_ON_ONCE(flags & VM_NO_GUARD))
+		flags &= ~VM_NO_GUARD;
+
 	if (count > totalram_pages())
 		return NULL;
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 091/262] mm/vmalloc: make show_numa_info() aware of hugepage mappings
  2021-11-05 20:34 incoming Andrew Morton
                   ` (89 preceding siblings ...)
  2021-11-05 20:39 ` [patch 090/262] mm/vmalloc: don't allow VM_NO_GUARD on vmap() Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:39 ` [patch 092/262] mm/vmalloc: make sure to dump unpurged areas in /proc/vmallocinfo Andrew Morton
                   ` (170 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, edumazet, linux-mm, mm-commits, torvalds, urezki

From: Eric Dumazet <edumazet@google.com>
Subject: mm/vmalloc: make show_numa_info() aware of hugepage mappings

show_numa_info() can be slightly faster, by skipping over hugepages
directly.

Link: https://lkml.kernel.org/r/20211001172725.105824-1-eric.dumazet@gmail.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmalloc.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/mm/vmalloc.c~mm-vmalloc-make-show_numa_info-aware-of-hugepage-mappings
+++ a/mm/vmalloc.c
@@ -3864,6 +3864,7 @@ static void show_numa_info(struct seq_fi
 {
 	if (IS_ENABLED(CONFIG_NUMA)) {
 		unsigned int nr, *counters = m->private;
+		unsigned int step = 1U << vm_area_page_order(v);
 
 		if (!counters)
 			return;
@@ -3875,9 +3876,8 @@ static void show_numa_info(struct seq_fi
 
 		memset(counters, 0, nr_node_ids * sizeof(unsigned int));
 
-		for (nr = 0; nr < v->nr_pages; nr++)
-			counters[page_to_nid(v->pages[nr])]++;
-
+		for (nr = 0; nr < v->nr_pages; nr += step)
+			counters[page_to_nid(v->pages[nr])] += step;
 		for_each_node_state(nr, N_HIGH_MEMORY)
 			if (counters[nr])
 				seq_printf(m, " N%u=%u", nr, counters[nr]);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 092/262] mm/vmalloc: make sure to dump unpurged areas in /proc/vmallocinfo
  2021-11-05 20:34 incoming Andrew Morton
                   ` (90 preceding siblings ...)
  2021-11-05 20:39 ` [patch 091/262] mm/vmalloc: make show_numa_info() aware of hugepage mappings Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:39 ` [patch 093/262] mm/vmalloc: do not adjust the search size for alignment overhead Andrew Morton
                   ` (169 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, edumazet, linux-mm, lpf.vector, mm-commits, torvalds, urezki

From: Eric Dumazet <edumazet@google.com>
Subject: mm/vmalloc: make sure to dump unpurged areas in /proc/vmallocinfo

If last va found in vmap_area_list does not have a vm pointer,
vmallocinfo.s_show() returns 0, and show_purge_info() is not called as it
should.

Link: https://lkml.kernel.org/r/20211001170815.73321-1-eric.dumazet@gmail.com
Fixes: dd3b8353bae7 ("mm/vmalloc: do not keep unpurged areas in the busy tree")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Pengfei Li <lpf.vector@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmalloc.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/vmalloc.c~mm-vmalloc-make-sure-to-dump-unpurged-areas-in-proc-vmallocinfo
+++ a/mm/vmalloc.c
@@ -3913,7 +3913,7 @@ static int s_show(struct seq_file *m, vo
 			(void *)va->va_start, (void *)va->va_end,
 			va->va_end - va->va_start);
 
-		return 0;
+		goto final;
 	}
 
 	v = va->vm;
@@ -3954,6 +3954,7 @@ static int s_show(struct seq_file *m, vo
 	/*
 	 * As a final step, dump "unpurged" areas.
 	 */
+final:
 	if (list_is_last(&va->list, &vmap_area_list))
 		show_purge_info(m);
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 093/262] mm/vmalloc: do not adjust the search size for alignment overhead
  2021-11-05 20:34 incoming Andrew Morton
                   ` (91 preceding siblings ...)
  2021-11-05 20:39 ` [patch 092/262] mm/vmalloc: make sure to dump unpurged areas in /proc/vmallocinfo Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:39 ` [patch 094/262] mm/vmalloc: check various alignments when debugging Andrew Morton
                   ` (168 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, david, hch, hdanton, linux-mm, mgorman, mhocko, mm-commits,
	npiggin, oleksiy.avramchenko, pifang, rostedt, torvalds, urezki,
	willy

From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Subject: mm/vmalloc: do not adjust the search size for alignment overhead

We used to include an alignment overhead into a search length, in that
case we guarantee that a found area will definitely fit after applying a
specific alignment that user specifies.  From the other hand we do not
guarantee that an area has the lowest address if an alignment is >=
PAGE_SIZE.

It means that, when a user specifies a special alignment together with a
range that corresponds to an exact requested size then an allocation will
fail.  This is what happens to KASAN, it wants the free block that exactly
matches a specified range during onlining memory banks:

[root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory82/state
[root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory83/state
[root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory85/state
[root@vm-0 fedora]# echo online > /sys/devices/system/memory/memory84/state
[  223.858115] vmap allocation for size 16777216 failed: use vmalloc=<size> to increase size
[  223.859415] bash: vmalloc: allocation failure: 16777216 bytes, mode:0x6000c0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
[  223.860992] CPU: 4 PID: 1644 Comm: bash Kdump: loaded Not tainted 4.18.0-339.el8.x86_64+debug #1
[  223.862149] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[  223.863580] Call Trace:
[  223.863946]  dump_stack+0x8e/0xd0
[  223.864420]  warn_alloc.cold.90+0x8a/0x1b2
[  223.864990]  ? zone_watermark_ok_safe+0x300/0x300
[  223.865626]  ? slab_free_freelist_hook+0x85/0x1a0
[  223.866264]  ? __get_vm_area_node+0x240/0x2c0
[  223.866858]  ? kfree+0xdd/0x570
[  223.867309]  ? kmem_cache_alloc_node_trace+0x157/0x230
[  223.868028]  ? notifier_call_chain+0x90/0x160
[  223.868625]  __vmalloc_node_range+0x465/0x840
[  223.869230]  ? mark_held_locks+0xb7/0x120

Fix it by making sure that find_vmap_lowest_match() returns lowest start
address with any given alignment value, i.e.  for alignments bigger then
PAGE_SIZE the algorithm rolls back toward parent nodes checking right
sub-trees if the most left free block did not fit due to alignment
overhead.

Link: https://lkml.kernel.org/r/20211004142829.22222-1-urezki@gmail.com
Fixes: 68ad4a330433 ("mm/vmalloc.c: keep track of free blocks for vmap allocation")
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reported-by: Ping Fang <pifang@redhat.com>
Tested-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmalloc.c |   22 +++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

--- a/mm/vmalloc.c~mm-vmalloc-do-not-adjust-the-search-size-for-alignment-overhead
+++ a/mm/vmalloc.c
@@ -1195,18 +1195,14 @@ find_vmap_lowest_match(unsigned long siz
 {
 	struct vmap_area *va;
 	struct rb_node *node;
-	unsigned long length;
 
 	/* Start from the root. */
 	node = free_vmap_area_root.rb_node;
 
-	/* Adjust the search size for alignment overhead. */
-	length = size + align - 1;
-
 	while (node) {
 		va = rb_entry(node, struct vmap_area, rb_node);
 
-		if (get_subtree_max_size(node->rb_left) >= length &&
+		if (get_subtree_max_size(node->rb_left) >= size &&
 				vstart < va->va_start) {
 			node = node->rb_left;
 		} else {
@@ -1216,9 +1212,9 @@ find_vmap_lowest_match(unsigned long siz
 			/*
 			 * Does not make sense to go deeper towards the right
 			 * sub-tree if it does not have a free block that is
-			 * equal or bigger to the requested search length.
+			 * equal or bigger to the requested search size.
 			 */
-			if (get_subtree_max_size(node->rb_right) >= length) {
+			if (get_subtree_max_size(node->rb_right) >= size) {
 				node = node->rb_right;
 				continue;
 			}
@@ -1226,15 +1222,23 @@ find_vmap_lowest_match(unsigned long siz
 			/*
 			 * OK. We roll back and find the first right sub-tree,
 			 * that will satisfy the search criteria. It can happen
-			 * only once due to "vstart" restriction.
+			 * due to "vstart" restriction or an alignment overhead
+			 * that is bigger then PAGE_SIZE.
 			 */
 			while ((node = rb_parent(node))) {
 				va = rb_entry(node, struct vmap_area, rb_node);
 				if (is_within_this_va(va, size, align, vstart))
 					return va;
 
-				if (get_subtree_max_size(node->rb_right) >= length &&
+				if (get_subtree_max_size(node->rb_right) >= size &&
 						vstart <= va->va_start) {
+					/*
+					 * Shift the vstart forward. Please note, we update it with
+					 * parent's start address adding "1" because we do not want
+					 * to enter same sub-tree after it has already been checked
+					 * and no suitable free block found there.
+					 */
+					vstart = va->va_start + 1;
 					node = node->rb_right;
 					break;
 				}
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 094/262] mm/vmalloc: check various alignments when debugging
  2021-11-05 20:34 incoming Andrew Morton
                   ` (92 preceding siblings ...)
  2021-11-05 20:39 ` [patch 093/262] mm/vmalloc: do not adjust the search size for alignment overhead Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:39 ` [patch 095/262] vmalloc: back off when the current task is OOM-killed Andrew Morton
                   ` (167 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, david, hch, hdanton, linux-mm, mgorman, mhocko, mm-commits,
	npiggin, oleksiy.avramchenko, pifang, rostedt, torvalds, urezki,
	willy

From: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Subject: mm/vmalloc: check various alignments when debugging

Before we did not guarantee a free block with lowest start address for
allocations with alignment >= PAGE_SIZE.  Because an alignment overhead
was included into a search length like below:

     length = size + align - 1;

doing so we make sure that a bigger block would fit after applying an
alignment adjustment.  Now there is no such limitation, i.e.  any
alignment that user wants to apply will result to a lowest address of
returned free area.

Link: https://lkml.kernel.org/r/20211004142829.22222-2-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Ping Fang <pifang@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmalloc.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/vmalloc.c~mm-vmalloc-check-various-alignments-when-debugging
+++ a/mm/vmalloc.c
@@ -1269,7 +1269,7 @@ find_vmap_lowest_linear_match(unsigned l
 }
 
 static void
-find_vmap_lowest_match_check(unsigned long size)
+find_vmap_lowest_match_check(unsigned long size, unsigned long align)
 {
 	struct vmap_area *va_1, *va_2;
 	unsigned long vstart;
@@ -1278,8 +1278,8 @@ find_vmap_lowest_match_check(unsigned lo
 	get_random_bytes(&rnd, sizeof(rnd));
 	vstart = VMALLOC_START + rnd;
 
-	va_1 = find_vmap_lowest_match(size, 1, vstart);
-	va_2 = find_vmap_lowest_linear_match(size, 1, vstart);
+	va_1 = find_vmap_lowest_match(size, align, vstart);
+	va_2 = find_vmap_lowest_linear_match(size, align, vstart);
 
 	if (va_1 != va_2)
 		pr_emerg("not lowest: t: 0x%p, l: 0x%p, v: 0x%lx\n",
@@ -1458,7 +1458,7 @@ __alloc_vmap_area(unsigned long size, un
 		return vend;
 
 #if DEBUG_AUGMENT_LOWEST_MATCH_CHECK
-	find_vmap_lowest_match_check(size);
+	find_vmap_lowest_match_check(size, align);
 #endif
 
 	return nva_start_addr;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 095/262] vmalloc: back off when the current task is OOM-killed
  2021-11-05 20:34 incoming Andrew Morton
                   ` (93 preceding siblings ...)
  2021-11-05 20:39 ` [patch 094/262] mm/vmalloc: check various alignments when debugging Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:39 ` [patch 096/262] vmalloc: choose a better start address in vm_area_register_early() Andrew Morton
                   ` (166 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, hannes, linux-mm, mhocko, mm-commits, penguin-kernel,
	torvalds, urezki, vdavydov.dev, vvs

From: Vasily Averin <vvs@virtuozzo.com>
Subject: vmalloc: back off when the current task is OOM-killed

Huge vmalloc allocation on heavy loaded node can lead to a global memory
shortage.  Task called vmalloc can have worst badness and be selected by
OOM-killer, however taken fatal signal does not interrupt allocation
cycle.  Vmalloc repeat page allocaions again and again, exacerbating the
crisis and consuming the memory freed up by another killed tasks.

After a successful completion of the allocation procedure, a fatal signal
will be processed and task will be destroyed finally.  However it may not
release the consumed memory, since the allocated object may have a
lifetime unrelated to the completed task.  In the worst case, this can
lead to the host will panic due to "Out of memory and no killable
processes..."

This patch allows OOM-killer to break vmalloc cycle, makes OOM more
effective and avoid host panic.  It does not check oom condition directly,
however, and breaks page allocation cycle when fatal signal was received.

This may trigger some hidden problems, when caller does not handle vmalloc
failures, or when rollaback after failed vmalloc calls own vmallocs
inside.  However all of these scenarios are incorrect: vmalloc does not
guarantee successful allocation, it has never been called with
__GFP_NOFAIL and threfore either should not be used for any rollbacks or
should handle such errors correctly and not lead to critical failures.

Link: https://lkml.kernel.org/r/83efc664-3a65-2adb-d7c4-2885784cf109@virtuozzo.com
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmalloc.c |    3 +++
 1 file changed, 3 insertions(+)

--- a/mm/vmalloc.c~vmalloc-back-off-when-the-current-task-is-oom-killed
+++ a/mm/vmalloc.c
@@ -2871,6 +2871,9 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 	/* High-order pages or fallback path if "bulk" fails. */
 
 	while (nr_allocated < nr_pages) {
+		if (fatal_signal_pending(current))
+			break;
+
 		if (nid == NUMA_NO_NODE)
 			page = alloc_pages(gfp, order);
 		else
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 096/262] vmalloc: choose a better start address in vm_area_register_early()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (94 preceding siblings ...)
  2021-11-05 20:39 ` [patch 095/262] vmalloc: back off when the current task is OOM-killed Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:39 ` [patch 097/262] arm64: support page mapping percpu first chunk allocator Andrew Morton
                   ` (165 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, andreyknvl, catalin.marinas, dvyukov, elver, gregkh,
	linux-mm, mm-commits, ryabinin.a.a, torvalds, wangkefeng.wang,
	will

From: Kefeng Wang <wangkefeng.wang@huawei.com>
Subject: vmalloc: choose a better start address in vm_area_register_early()

Percpu embedded first chunk allocator is the firstly option, but it could
fail on ARM64, eg,

  "percpu: max_distance=0x5fcfdc640000 too large for vmalloc space 0x781fefff0000"
  "percpu: max_distance=0x600000540000 too large for vmalloc space 0x7dffb7ff0000"
  "percpu: max_distance=0x5fff9adb0000 too large for vmalloc space 0x5dffb7ff0000"

then we could meet "WARNING: CPU: 15 PID: 461 at vmalloc.c:3087
pcpu_get_vm_areas+0x488/0x838" and the system cannot boot successfully.

Let's implement page mapping percpu first chunk allocator as a fallback to
the embedding allocator to increase the robustness of the system.

Also fix a crash when both NEED_PER_CPU_PAGE_FIRST_CHUNK and KASAN_VMALLOC
enabled.

Tested on ARM64 qemu with cmdline "percpu_alloc=page".


This patch (of 3):

There are some fixed locations in the vmalloc area be reserved in ARM(see
iotable_init()) and ARM64(see map_kernel()), but for
pcpu_page_first_chunk(), it calls vm_area_register_early() and choose
VMALLOC_START as the start address of vmap area which could be conflicted
with above address, then could trigger a BUG_ON in vm_area_add_early().

Let's choose a suit start address by traversing the vmlist.

Link: https://lkml.kernel.org/r/20210910053354.26721-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20210910053354.26721-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmalloc.c |   18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

--- a/mm/vmalloc.c~vmalloc-choose-a-better-start-address-in-vm_area_register_early
+++ a/mm/vmalloc.c
@@ -2276,15 +2276,21 @@ void __init vm_area_add_early(struct vm_
  */
 void __init vm_area_register_early(struct vm_struct *vm, size_t align)
 {
-	static size_t vm_init_off __initdata;
-	unsigned long addr;
+	unsigned long addr = ALIGN(VMALLOC_START, align);
+	struct vm_struct *cur, **p;
 
-	addr = ALIGN(VMALLOC_START + vm_init_off, align);
-	vm_init_off = PFN_ALIGN(addr + vm->size) - VMALLOC_START;
+	BUG_ON(vmap_initialized);
 
-	vm->addr = (void *)addr;
+	for (p = &vmlist; (cur = *p) != NULL; p = &cur->next) {
+		if ((unsigned long)cur->addr - addr >= vm->size)
+			break;
+		addr = ALIGN((unsigned long)cur->addr + cur->size, align);
+	}
 
-	vm_area_add_early(vm);
+	BUG_ON(addr > VMALLOC_END - vm->size);
+	vm->addr = (void *)addr;
+	vm->next = *p;
+	*p = vm;
 }
 
 static void vmap_init_free_space(void)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 097/262] arm64: support page mapping percpu first chunk allocator
  2021-11-05 20:34 incoming Andrew Morton
                   ` (95 preceding siblings ...)
  2021-11-05 20:39 ` [patch 096/262] vmalloc: choose a better start address in vm_area_register_early() Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:39 ` [patch 098/262] kasan: arm64: fix pcpu_page_first_chunk crash with KASAN_VMALLOC Andrew Morton
                   ` (164 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, andreyknvl, catalin.marinas, dvyukov, elver, gregkh,
	linux-mm, mm-commits, ryabinin.a.a, torvalds, wangkefeng.wang,
	will

From: Kefeng Wang <wangkefeng.wang@huawei.com>
Subject: arm64: support page mapping percpu first chunk allocator

Percpu embedded first chunk allocator is the firstly option, but it
could fails on ARM64, eg,
  "percpu: max_distance=0x5fcfdc640000 too large for vmalloc space 0x781fefff0000"
  "percpu: max_distance=0x600000540000 too large for vmalloc space 0x7dffb7ff0000"
  "percpu: max_distance=0x5fff9adb0000 too large for vmalloc space 0x5dffb7ff0000"

then we could meet "WARNING: CPU: 15 PID: 461 at vmalloc.c:3087
pcpu_get_vm_areas+0x488/0x838" and the system could not boot successfully.

Let's implement page mapping percpu first chunk allocator as a fallback to
the embedding allocator to increase the robustness of the system.

Link: https://lkml.kernel.org/r/20210910053354.26721-3-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Marco Elver <elver@google.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm64/Kconfig       |    4 +
 drivers/base/arch_numa.c |   82 ++++++++++++++++++++++++++++++++-----
 2 files changed, 76 insertions(+), 10 deletions(-)

--- a/arch/arm64/Kconfig~arm64-support-page-mapping-percpu-first-chunk-allocator
+++ a/arch/arm64/Kconfig
@@ -1042,6 +1042,10 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK
 	def_bool y
 	depends on NUMA
 
+config NEED_PER_CPU_PAGE_FIRST_CHUNK
+	def_bool y
+	depends on NUMA
+
 source "kernel/Kconfig.hz"
 
 config ARCH_SPARSEMEM_ENABLE
--- a/drivers/base/arch_numa.c~arm64-support-page-mapping-percpu-first-chunk-allocator
+++ a/drivers/base/arch_numa.c
@@ -14,6 +14,7 @@
 #include <linux/of.h>
 
 #include <asm/sections.h>
+#include <asm/pgalloc.h>
 
 struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
 EXPORT_SYMBOL(node_data);
@@ -168,22 +169,83 @@ static void __init pcpu_fc_free(void *pt
 	memblock_free_early(__pa(ptr), size);
 }
 
+#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
+static void __init pcpu_populate_pte(unsigned long addr)
+{
+	pgd_t *pgd = pgd_offset_k(addr);
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	p4d = p4d_offset(pgd, addr);
+	if (p4d_none(*p4d)) {
+		pud_t *new;
+
+		new = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+		if (!new)
+			goto err_alloc;
+		p4d_populate(&init_mm, p4d, new);
+	}
+
+	pud = pud_offset(p4d, addr);
+	if (pud_none(*pud)) {
+		pmd_t *new;
+
+		new = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+		if (!new)
+			goto err_alloc;
+		pud_populate(&init_mm, pud, new);
+	}
+
+	pmd = pmd_offset(pud, addr);
+	if (!pmd_present(*pmd)) {
+		pte_t *new;
+
+		new = memblock_alloc(PAGE_SIZE, PAGE_SIZE);
+		if (!new)
+			goto err_alloc;
+		pmd_populate_kernel(&init_mm, pmd, new);
+	}
+
+	return;
+
+err_alloc:
+	panic("%s: Failed to allocate %lu bytes align=%lx from=%lx\n",
+	      __func__, PAGE_SIZE, PAGE_SIZE, PAGE_SIZE);
+}
+#endif
+
 void __init setup_per_cpu_areas(void)
 {
 	unsigned long delta;
 	unsigned int cpu;
-	int rc;
+	int rc = -EINVAL;
 
-	/*
-	 * Always reserve area for module percpu variables.  That's
-	 * what the legacy allocator did.
-	 */
-	rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
-				    PERCPU_DYNAMIC_RESERVE, PAGE_SIZE,
-				    pcpu_cpu_distance,
-				    pcpu_fc_alloc, pcpu_fc_free);
+	if (pcpu_chosen_fc != PCPU_FC_PAGE) {
+		/*
+		 * Always reserve area for module percpu variables.  That's
+		 * what the legacy allocator did.
+		 */
+		rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
+					    PERCPU_DYNAMIC_RESERVE, PAGE_SIZE,
+					    pcpu_cpu_distance,
+					    pcpu_fc_alloc, pcpu_fc_free);
+#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
+		if (rc < 0)
+			pr_warn("PERCPU: %s allocator failed (%d), falling back to page size\n",
+				   pcpu_fc_names[pcpu_chosen_fc], rc);
+#endif
+	}
+
+#ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
+	if (rc < 0)
+		rc = pcpu_page_first_chunk(PERCPU_MODULE_RESERVE,
+					   pcpu_fc_alloc,
+					   pcpu_fc_free,
+					   pcpu_populate_pte);
+#endif
 	if (rc < 0)
-		panic("Failed to initialize percpu areas.");
+		panic("Failed to initialize percpu areas (err=%d).", rc);
 
 	delta = (unsigned long)pcpu_base_addr - (unsigned long)__per_cpu_start;
 	for_each_possible_cpu(cpu)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 098/262] kasan: arm64: fix pcpu_page_first_chunk crash with KASAN_VMALLOC
  2021-11-05 20:34 incoming Andrew Morton
                   ` (96 preceding siblings ...)
  2021-11-05 20:39 ` [patch 097/262] arm64: support page mapping percpu first chunk allocator Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:39 ` [patch 099/262] mm/vmalloc: be more explicit about supported gfp flags Andrew Morton
                   ` (163 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, andreyknvl, catalin.marinas, dvyukov, elver, gregkh,
	linux-mm, mm-commits, ryabinin.a.a, torvalds, wangkefeng.wang,
	will

From: Kefeng Wang <wangkefeng.wang@huawei.com>
Subject: kasan: arm64: fix pcpu_page_first_chunk crash with KASAN_VMALLOC

With KASAN_VMALLOC and NEED_PER_CPU_PAGE_FIRST_CHUNK, it crashes,

Unable to handle kernel paging request at virtual address ffff7000028f2000
...
swapper pgtable: 64k pages, 48-bit VAs, pgdp=0000000042440000
[ffff7000028f2000] pgd=000000063e7c0003, p4d=000000063e7c0003, pud=000000063e7c0003, pmd=000000063e7b0003, pte=0000000000000000
Internal error: Oops: 96000007 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 5.13.0-rc4-00003-gc6e6e28f3f30-dirty #62
Hardware name: linux,dummy-virt (DT)
pstate: 200000c5 (nzCv daIF -PAN -UAO -TCO BTYPE=--)
pc : kasan_check_range+0x90/0x1a0
lr : memcpy+0x88/0xf4
sp : ffff80001378fe20
...
Call trace:
 kasan_check_range+0x90/0x1a0
 pcpu_page_first_chunk+0x3f0/0x568
 setup_per_cpu_areas+0xb8/0x184
 start_kernel+0x8c/0x328

The vm area used in vm_area_register_early() has no kasan shadow memory,
Let's add a new kasan_populate_early_vm_area_shadow() function to populate
the vm area shadow memory to fix the issue.

[wangkefeng.wang@huawei.com: fix redefinition of 'kasan_populate_early_vm_area_shadow']
  Link: https://lkml.kernel.org/r/20211011123211.3936196-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20210910053354.26721-4-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Marco Elver <elver@google.com>		[KASAN]
Acked-by: Andrey Konovalov <andreyknvl@gmail.com>	[KASAN]
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm64/mm/kasan_init.c |   16 ++++++++++++++++
 include/linux/kasan.h      |    6 ++++++
 mm/kasan/shadow.c          |    5 +++++
 mm/vmalloc.c               |    1 +
 4 files changed, 28 insertions(+)

--- a/arch/arm64/mm/kasan_init.c~kasan-arm64-fix-pcpu_page_first_chunk-crash-with-kasan_vmalloc
+++ a/arch/arm64/mm/kasan_init.c
@@ -287,6 +287,22 @@ static void __init kasan_init_depth(void
 	init_task.kasan_depth = 0;
 }
 
+#ifdef CONFIG_KASAN_VMALLOC
+void __init kasan_populate_early_vm_area_shadow(void *start, unsigned long size)
+{
+	unsigned long shadow_start, shadow_end;
+
+	if (!is_vmalloc_or_module_addr(start))
+		return;
+
+	shadow_start = (unsigned long)kasan_mem_to_shadow(start);
+	shadow_start = ALIGN_DOWN(shadow_start, PAGE_SIZE);
+	shadow_end = (unsigned long)kasan_mem_to_shadow(start + size);
+	shadow_end = ALIGN(shadow_end, PAGE_SIZE);
+	kasan_map_populate(shadow_start, shadow_end, NUMA_NO_NODE);
+}
+#endif
+
 void __init kasan_init(void)
 {
 	kasan_init_shadow();
--- a/include/linux/kasan.h~kasan-arm64-fix-pcpu_page_first_chunk-crash-with-kasan_vmalloc
+++ a/include/linux/kasan.h
@@ -436,6 +436,8 @@ void kasan_release_vmalloc(unsigned long
 			   unsigned long free_region_start,
 			   unsigned long free_region_end);
 
+void kasan_populate_early_vm_area_shadow(void *start, unsigned long size);
+
 #else /* CONFIG_KASAN_VMALLOC */
 
 static inline int kasan_populate_vmalloc(unsigned long start,
@@ -453,6 +455,10 @@ static inline void kasan_release_vmalloc
 					 unsigned long free_region_start,
 					 unsigned long free_region_end) {}
 
+static inline void kasan_populate_early_vm_area_shadow(void *start,
+						       unsigned long size)
+{ }
+
 #endif /* CONFIG_KASAN_VMALLOC */
 
 #if (defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)) && \
--- a/mm/kasan/shadow.c~kasan-arm64-fix-pcpu_page_first_chunk-crash-with-kasan_vmalloc
+++ a/mm/kasan/shadow.c
@@ -254,6 +254,11 @@ core_initcall(kasan_memhotplug_init);
 
 #ifdef CONFIG_KASAN_VMALLOC
 
+void __init __weak kasan_populate_early_vm_area_shadow(void *start,
+						       unsigned long size)
+{
+}
+
 static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
 				      void *unused)
 {
--- a/mm/vmalloc.c~kasan-arm64-fix-pcpu_page_first_chunk-crash-with-kasan_vmalloc
+++ a/mm/vmalloc.c
@@ -2291,6 +2291,7 @@ void __init vm_area_register_early(struc
 	vm->addr = (void *)addr;
 	vm->next = *p;
 	*p = vm;
+	kasan_populate_early_vm_area_shadow(vm->addr, vm->size);
 }
 
 static void vmap_init_free_space(void)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 099/262] mm/vmalloc: be more explicit about supported gfp flags
  2021-11-05 20:34 incoming Andrew Morton
                   ` (97 preceding siblings ...)
  2021-11-05 20:39 ` [patch 098/262] kasan: arm64: fix pcpu_page_first_chunk crash with KASAN_VMALLOC Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-08  9:25   ` Michal Hocko
  2021-11-05 20:39 ` [patch 100/262] mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation Andrew Morton
                   ` (162 subsequent siblings)
  261 siblings, 1 reply; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, david, hch, idryomov, jlayton, linux-mm, mhocko,
	mm-commits, neilb, torvalds, urezki

From: Michal Hocko <mhocko@suse.com>
Subject: mm/vmalloc: be more explicit about supported gfp flags

The core of the vmalloc allocator __vmalloc_area_node doesn't say anything
about gfp mask argument.  Not all gfp flags are supported though.  Be more
explicit about constraints.

Link: https://lkml.kernel.org/r/20211020082545.4830-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Neil Brown <neilb@suse.de>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Ilya Dryomov <idryomov@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmalloc.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

--- a/mm/vmalloc.c~mm-vmalloc-be-more-explicit-about-supported-gfp-flags
+++ a/mm/vmalloc.c
@@ -2983,8 +2983,16 @@ fail:
  * @caller:		  caller's return address
  *
  * Allocate enough pages to cover @size from the page level
- * allocator with @gfp_mask flags.  Map them into contiguous
- * kernel virtual space, using a pagetable protection of @prot.
+ * allocator with @gfp_mask flags. Please note that the full set of gfp
+ * flags are not supported. GFP_KERNEL would be a preferred allocation mode
+ * but GFP_NOFS and GFP_NOIO are supported as well. Zone modifiers are not
+ * supported. From the reclaim modifiers__GFP_DIRECT_RECLAIM is required (aka
+ * GFP_NOWAIT is not supported) and only __GFP_NOFAIL is supported (aka
+ * __GFP_NORETRY and __GFP_RETRY_MAYFAIL are not supported).
+ * __GFP_NOWARN can be used to suppress error messages about failures.
+ *
+ * Map them into contiguous kernel virtual space, using a pagetable
+ * protection of @prot.
  *
  * Return: the address of the area or %NULL on failure
  */
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 100/262] mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation
  2021-11-05 20:34 incoming Andrew Morton
                   ` (98 preceding siblings ...)
  2021-11-05 20:39 ` [patch 099/262] mm/vmalloc: be more explicit about supported gfp flags Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:39 ` [patch 101/262] lib/test_vmalloc.c: use swap() to make code cleaner Andrew Morton
                   ` (161 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, chenwandun, edumazet, guohanjun, linux-mm, mm-commits,
	npiggin, shakeelb, torvalds, urezki, wangkefeng.wang

From: Chen Wandun <chenwandun@huawei.com>
Subject: mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation

"mm/vmalloc: fix numa spreading for large hash tables" will cause
significant performance regressions in some situations as Andrew
mentioned in [1].  The main situation is vmalloc, vmalloc will allocate
pages with NUMA_NO_NODE by default, that will result in alloc page one
by one;

In order to solve this, __alloc_pages_bulk and mempolicy should be
considered at the same time.

1) If node is specified in memory allocation request, it will alloc all
   pages by __alloc_pages_bulk.

2) If interleaving allocate memory, it will cauculate how many pages
   should be allocated in each node, and use __alloc_pages_bulk to alloc
   pages in each node.

[1]: https://lore.kernel.org/lkml/CALvZod4G3SzP3kWxQYn0fj+VgG-G3yWXz=gz17+3N57ru1iajw@mail.gmail.com/t/#m750c8e3231206134293b089feaa090590afa0f60

[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: make two functions static]
[akpm@linux-foundation.org: fix CONFIG_NUMA=n build]
Link: https://lkml.kernel.org/r/20211021080744.874701-3-chenwandun@huawei.com
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/gfp.h |    4 ++
 mm/mempolicy.c      |   82 ++++++++++++++++++++++++++++++++++++++++++
 mm/vmalloc.c        |   20 ++++++++--
 3 files changed, 102 insertions(+), 4 deletions(-)

--- a/include/linux/gfp.h~mm-vmalloc-introduce-alloc_pages_bulk_array_mempolicy-to-accelerate-memory-allocation
+++ a/include/linux/gfp.h
@@ -535,6 +535,10 @@ unsigned long __alloc_pages_bulk(gfp_t g
 				struct list_head *page_list,
 				struct page **page_array);
 
+unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
+				unsigned long nr_pages,
+				struct page **page_array);
+
 /* Bulk allocate order-0 pages */
 static inline unsigned long
 alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
--- a/mm/mempolicy.c~mm-vmalloc-introduce-alloc_pages_bulk_array_mempolicy-to-accelerate-memory-allocation
+++ a/mm/mempolicy.c
@@ -2196,6 +2196,88 @@ struct page *alloc_pages(gfp_t gfp, unsi
 }
 EXPORT_SYMBOL(alloc_pages);
 
+static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
+		struct mempolicy *pol, unsigned long nr_pages,
+		struct page **page_array)
+{
+	int nodes;
+	unsigned long nr_pages_per_node;
+	int delta;
+	int i;
+	unsigned long nr_allocated;
+	unsigned long total_allocated = 0;
+
+	nodes = nodes_weight(pol->nodes);
+	nr_pages_per_node = nr_pages / nodes;
+	delta = nr_pages - nodes * nr_pages_per_node;
+
+	for (i = 0; i < nodes; i++) {
+		if (delta) {
+			nr_allocated = __alloc_pages_bulk(gfp,
+					interleave_nodes(pol), NULL,
+					nr_pages_per_node + 1, NULL,
+					page_array);
+			delta--;
+		} else {
+			nr_allocated = __alloc_pages_bulk(gfp,
+					interleave_nodes(pol), NULL,
+					nr_pages_per_node, NULL, page_array);
+		}
+
+		page_array += nr_allocated;
+		total_allocated += nr_allocated;
+	}
+
+	return total_allocated;
+}
+
+static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
+		struct mempolicy *pol, unsigned long nr_pages,
+		struct page **page_array)
+{
+	gfp_t preferred_gfp;
+	unsigned long nr_allocated = 0;
+
+	preferred_gfp = gfp | __GFP_NOWARN;
+	preferred_gfp &= ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
+
+	nr_allocated  = __alloc_pages_bulk(preferred_gfp, nid, &pol->nodes,
+					   nr_pages, NULL, page_array);
+
+	if (nr_allocated < nr_pages)
+		nr_allocated += __alloc_pages_bulk(gfp, numa_node_id(), NULL,
+				nr_pages - nr_allocated, NULL,
+				page_array + nr_allocated);
+	return nr_allocated;
+}
+
+/* alloc pages bulk and mempolicy should be considered at the
+ * same time in some situation such as vmalloc.
+ *
+ * It can accelerate memory allocation especially interleaving
+ * allocate memory.
+ */
+unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
+		unsigned long nr_pages, struct page **page_array)
+{
+	struct mempolicy *pol = &default_policy;
+
+	if (!in_interrupt() && !(gfp & __GFP_THISNODE))
+		pol = get_task_policy(current);
+
+	if (pol->mode == MPOL_INTERLEAVE)
+		return alloc_pages_bulk_array_interleave(gfp, pol,
+							 nr_pages, page_array);
+
+	if (pol->mode == MPOL_PREFERRED_MANY)
+		return alloc_pages_bulk_array_preferred_many(gfp,
+				numa_node_id(), pol, nr_pages, page_array);
+
+	return __alloc_pages_bulk(gfp, policy_node(gfp, pol, numa_node_id()),
+				  policy_nodemask(gfp, pol), nr_pages, NULL,
+				  page_array);
+}
+
 int vma_dup_policy(struct vm_area_struct *src, struct vm_area_struct *dst)
 {
 	struct mempolicy *pol = mpol_dup(vma_policy(src));
--- a/mm/vmalloc.c~mm-vmalloc-introduce-alloc_pages_bulk_array_mempolicy-to-accelerate-memory-allocation
+++ a/mm/vmalloc.c
@@ -2843,7 +2843,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 	 * to fails, fallback to a single page allocator that is
 	 * more permissive.
 	 */
-	if (!order && nid != NUMA_NO_NODE) {
+	if (!order) {
 		while (nr_allocated < nr_pages) {
 			unsigned int nr, nr_pages_request;
 
@@ -2855,8 +2855,20 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 			 */
 			nr_pages_request = min(100U, nr_pages - nr_allocated);
 
-			nr = alloc_pages_bulk_array_node(gfp, nid,
-				nr_pages_request, pages + nr_allocated);
+			/* memory allocation should consider mempolicy, we can't
+			 * wrongly use nearest node when nid == NUMA_NO_NODE,
+			 * otherwise memory may be allocated in only one node,
+			 * but mempolcy want to alloc memory by interleaving.
+			 */
+			if (IS_ENABLED(CONFIG_NUMA) && nid == NUMA_NO_NODE)
+				nr = alloc_pages_bulk_array_mempolicy(gfp,
+							nr_pages_request,
+							pages + nr_allocated);
+
+			else
+				nr = alloc_pages_bulk_array_node(gfp, nid,
+							nr_pages_request,
+							pages + nr_allocated);
 
 			nr_allocated += nr;
 			cond_resched();
@@ -2868,7 +2880,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
 			if (nr != nr_pages_request)
 				break;
 		}
-	} else if (order)
+	} else
 		/*
 		 * Compound pages required for remap_vmalloc_page if
 		 * high-order pages.
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 101/262] lib/test_vmalloc.c: use swap() to make code cleaner
  2021-11-05 20:34 incoming Andrew Morton
                   ` (99 preceding siblings ...)
  2021-11-05 20:39 ` [patch 100/262] mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:39 ` [patch 102/262] mm/large system hash: avoid possible NULL deref in alloc_large_system_hash Andrew Morton
                   ` (160 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, deng.changcheng, linux-mm, mm-commits, torvalds, urezki, zealci

From: Changcheng Deng <deng.changcheng@zte.com.cn>
Subject: lib/test_vmalloc.c: use swap() to make code cleaner

Use swap() in order to make code cleaner. Issue found by coccinelle.

Link: https://lkml.kernel.org/r/20211028111443.15744-1-deng.changcheng@zte.com.cn
Signed-off-by: Changcheng Deng <deng.changcheng@zte.com.cn>
Reported-by: Zeal Robot <zealci@zte.com.cn>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/test_vmalloc.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

--- a/lib/test_vmalloc.c~lib-test_vmallocc-use-swap-to-make-code-cleaner
+++ a/lib/test_vmalloc.c
@@ -393,7 +393,7 @@ static struct test_driver {
 static void shuffle_array(int *arr, int n)
 {
 	unsigned int rnd;
-	int i, j, x;
+	int i, j;
 
 	for (i = n - 1; i > 0; i--)  {
 		get_random_bytes(&rnd, sizeof(rnd));
@@ -402,9 +402,7 @@ static void shuffle_array(int *arr, int
 		j = rnd % i;
 
 		/* Swap indexes. */
-		x = arr[i];
-		arr[i] = arr[j];
-		arr[j] = x;
+		swap(arr[i], arr[j]);
 	}
 }
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 102/262] mm/large system hash: avoid possible NULL deref in alloc_large_system_hash
  2021-11-05 20:34 incoming Andrew Morton
                   ` (100 preceding siblings ...)
  2021-11-05 20:39 ` [patch 101/262] lib/test_vmalloc.c: use swap() to make code cleaner Andrew Morton
@ 2021-11-05 20:39 ` Andrew Morton
  2021-11-05 20:40 ` [patch 103/262] mm/page_alloc.c: remove meaningless VM_BUG_ON() in pindex_to_order() Andrew Morton
                   ` (159 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:39 UTC (permalink / raw)
  To: akpm, edumazet, linux-mm, mm-commits, npiggin, torvalds

From: Eric Dumazet <edumazet@google.com>
Subject: mm/large system hash: avoid possible NULL deref in alloc_large_system_hash

If __vmalloc() returned NULL, is_vm_area_hugepages(NULL) will fault if
CONFIG_HAVE_ARCH_HUGE_VMALLOC=y

Link: https://lkml.kernel.org/r/20210915212530.2321545-1-eric.dumazet@gmail.com
Fixes: 121e6f3258fe ("mm/vmalloc: hugepage vmalloc mappings")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/page_alloc.c~mm-large-system-hash-avoid-possible-null-deref-in-alloc_large_system_hash
+++ a/mm/page_alloc.c
@@ -8762,7 +8762,8 @@ void *__init alloc_large_system_hash(con
 		} else if (get_order(size) >= MAX_ORDER || hashdist) {
 			table = __vmalloc(size, gfp_flags);
 			virt = true;
-			huge = is_vm_area_hugepages(table);
+			if (table)
+				huge = is_vm_area_hugepages(table);
 		} else {
 			/*
 			 * If bucketsize is not a power-of-two, we may free
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 103/262] mm/page_alloc.c: remove meaningless VM_BUG_ON() in pindex_to_order()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (101 preceding siblings ...)
  2021-11-05 20:39 ` [patch 102/262] mm/large system hash: avoid possible NULL deref in alloc_large_system_hash Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:40 ` [patch 104/262] mm/page_alloc.c: simplify the code by using macro K() Andrew Morton
                   ` (158 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, david, linmiaohe, linux-mm, mgorman, mm-commits, peterz,
	sfr, torvalds, vbabka

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/page_alloc.c: remove meaningless VM_BUG_ON() in pindex_to_order()

Patch series "Cleanups and fixup for page_alloc", v2.

This series contains cleanups to remove meaningless VM_BUG_ON(), use
helpers to simplify the code and remove obsolete comment.  Also we avoid
allocating highmem pages via alloc_pages_exact[_nid].  More details can be
found in the respective changelogs.  


This patch (of 5):

It's meaningless to VM_BUG_ON() order != pageblock_order just after
setting order to pageblock_order.  Remove it.

Link: https://lkml.kernel.org/r/20210902121242.41607-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210902121242.41607-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--- a/mm/page_alloc.c~mm-page_allocc-remove-meaningless-vm_bug_on-in-pindex_to_order
+++ a/mm/page_alloc.c
@@ -677,10 +677,8 @@ static inline int pindex_to_order(unsign
 	int order = pindex / MIGRATE_PCPTYPES;
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (order > PAGE_ALLOC_COSTLY_ORDER) {
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
 		order = pageblock_order;
-		VM_BUG_ON(order != pageblock_order);
-	}
 #else
 	VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
 #endif
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 104/262] mm/page_alloc.c: simplify the code by using macro K()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (102 preceding siblings ...)
  2021-11-05 20:40 ` [patch 103/262] mm/page_alloc.c: remove meaningless VM_BUG_ON() in pindex_to_order() Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:40 ` [patch 105/262] mm/page_alloc.c: fix obsolete comment in free_pcppages_bulk() Andrew Morton
                   ` (157 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, david, linmiaohe, linux-mm, mgorman, mm-commits, peterz,
	sfr, torvalds, vbabka

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/page_alloc.c: simplify the code by using macro K()

Use helper macro K() to convert the pages to the corresponding size. 
Minor readability improvement.

Link: https://lkml.kernel.org/r/20210902121242.41607-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

--- a/mm/page_alloc.c~mm-page_allocc-simplify-the-code-by-using-macro-k
+++ a/mm/page_alloc.c
@@ -8130,8 +8130,7 @@ unsigned long free_reserved_area(void *s
 	}
 
 	if (pages && s)
-		pr_info("Freeing %s memory: %ldK\n",
-			s, pages << (PAGE_SHIFT - 10));
+		pr_info("Freeing %s memory: %ldK\n", s, K(pages));
 
 	return pages;
 }
@@ -8176,14 +8175,13 @@ void __init mem_init_print_info(void)
 		", %luK highmem"
 #endif
 		")\n",
-		nr_free_pages() << (PAGE_SHIFT - 10),
-		physpages << (PAGE_SHIFT - 10),
+		K(nr_free_pages()), K(physpages),
 		codesize >> 10, datasize >> 10, rosize >> 10,
 		(init_data_size + init_code_size) >> 10, bss_size >> 10,
-		(physpages - totalram_pages() - totalcma_pages) << (PAGE_SHIFT - 10),
-		totalcma_pages << (PAGE_SHIFT - 10)
+		K(physpages - totalram_pages() - totalcma_pages),
+		K(totalcma_pages)
 #ifdef	CONFIG_HIGHMEM
-		, totalhigh_pages() << (PAGE_SHIFT - 10)
+		, K(totalhigh_pages())
 #endif
 		);
 }
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 105/262] mm/page_alloc.c: fix obsolete comment in free_pcppages_bulk()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (103 preceding siblings ...)
  2021-11-05 20:40 ` [patch 104/262] mm/page_alloc.c: simplify the code by using macro K() Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:40 ` [patch 106/262] mm/page_alloc.c: use helper function zone_spans_pfn() Andrew Morton
                   ` (156 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, david, linmiaohe, linux-mm, mgorman, mm-commits, peterz,
	sfr, torvalds, vbabka

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/page_alloc.c: fix obsolete comment in free_pcppages_bulk()

The second two paragraphs about "all pages pinned" and pages_scanned is
obsolete.  And There are PAGE_ALLOC_COSTLY_ORDER + 1 + NR_PCP_THP orders
in pcp.  So the same order assumption is not held now.

Link: https://lkml.kernel.org/r/20210902121242.41607-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

--- a/mm/page_alloc.c~mm-page_allocc-fix-obsolete-comment-in-free_pcppages_bulk
+++ a/mm/page_alloc.c
@@ -1428,14 +1428,8 @@ static inline void prefetch_buddy(struct
 
 /*
  * Frees a number of pages from the PCP lists
- * Assumes all pages on list are in same zone, and of same order.
+ * Assumes all pages on list are in same zone.
  * count is the number of pages to free.
- *
- * If the zone was previously in an "all pages pinned" state then look to
- * see if this freeing clears that state.
- *
- * And clear the zone's pages_scanned counter, to hold off the "all pages are
- * pinned" detection logic.
  */
 static void free_pcppages_bulk(struct zone *zone, int count,
 					struct per_cpu_pages *pcp)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 106/262] mm/page_alloc.c: use helper function zone_spans_pfn()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (104 preceding siblings ...)
  2021-11-05 20:40 ` [patch 105/262] mm/page_alloc.c: fix obsolete comment in free_pcppages_bulk() Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:40 ` [patch 107/262] mm/page_alloc.c: avoid allocating highmem pages via alloc_pages_exact[_nid] Andrew Morton
                   ` (155 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, david, linmiaohe, linux-mm, mgorman, mm-commits, peterz,
	sfr, torvalds, vbabka

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/page_alloc.c: use helper function zone_spans_pfn()

Use helper function zone_spans_pfn() to check whether pfn is within a zone
to simplify the code slightly.

Link: https://lkml.kernel.org/r/20210902121242.41607-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/page_alloc.c~mm-page_allocc-use-helper-function-zone_spans_pfn
+++ a/mm/page_alloc.c
@@ -1583,7 +1583,7 @@ static void __meminit init_reserved_page
 	for (zid = 0; zid < MAX_NR_ZONES; zid++) {
 		struct zone *zone = &pgdat->node_zones[zid];
 
-		if (pfn >= zone->zone_start_pfn && pfn < zone_end_pfn(zone))
+		if (zone_spans_pfn(zone, pfn))
 			break;
 	}
 	__init_single_page(pfn_to_page(pfn), pfn, zid, nid);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 107/262] mm/page_alloc.c: avoid allocating highmem pages via alloc_pages_exact[_nid]
  2021-11-05 20:34 incoming Andrew Morton
                   ` (105 preceding siblings ...)
  2021-11-05 20:40 ` [patch 106/262] mm/page_alloc.c: use helper function zone_spans_pfn() Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:40 ` [patch 108/262] mm/page_alloc: print node fallback order Andrew Morton
                   ` (154 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, david, linmiaohe, linux-mm, mgorman, mm-commits, peterz,
	sfr, torvalds, vbabka

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/page_alloc.c: avoid allocating highmem pages via alloc_pages_exact[_nid]

Don't use with __GFP_HIGHMEM because page_address() cannot represent
highmem pages without kmap().  Newly allocated pages would leak as
page_address() will return NULL for highmem pages here.  But It works now
because the callers do not specify __GFP_HIGHMEM now.

Link: https://lkml.kernel.org/r/20210902121242.41607-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/page_alloc.c~mm-page_allocc-avoid-allocating-highmem-pages-via-alloc_pages_exact
+++ a/mm/page_alloc.c
@@ -5610,8 +5610,8 @@ void *alloc_pages_exact(size_t size, gfp
 	unsigned int order = get_order(size);
 	unsigned long addr;
 
-	if (WARN_ON_ONCE(gfp_mask & __GFP_COMP))
-		gfp_mask &= ~__GFP_COMP;
+	if (WARN_ON_ONCE(gfp_mask & (__GFP_COMP | __GFP_HIGHMEM)))
+		gfp_mask &= ~(__GFP_COMP | __GFP_HIGHMEM);
 
 	addr = __get_free_pages(gfp_mask, order);
 	return make_alloc_exact(addr, order, size);
@@ -5635,8 +5635,8 @@ void * __meminit alloc_pages_exact_nid(i
 	unsigned int order = get_order(size);
 	struct page *p;
 
-	if (WARN_ON_ONCE(gfp_mask & __GFP_COMP))
-		gfp_mask &= ~__GFP_COMP;
+	if (WARN_ON_ONCE(gfp_mask & (__GFP_COMP | __GFP_HIGHMEM)))
+		gfp_mask &= ~(__GFP_COMP | __GFP_HIGHMEM);
 
 	p = alloc_pages_node(nid, gfp_mask, order);
 	if (!p)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 108/262] mm/page_alloc: print node fallback order
  2021-11-05 20:34 incoming Andrew Morton
                   ` (106 preceding siblings ...)
  2021-11-05 20:40 ` [patch 107/262] mm/page_alloc.c: avoid allocating highmem pages via alloc_pages_exact[_nid] Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:40 ` [patch 109/262] mm/page_alloc: use accumulated load when building node fallback list Andrew Morton
                   ` (153 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, anshuman.khandual, bharata, kamezawa.hiroyu,
	krupa.ramakrishnan, lee.schermerhorn, linux-mm, mgorman,
	mm-commits, Sadagopan.Srinivasan, torvalds

From: Bharata B Rao <bharata@amd.com>
Subject: mm/page_alloc: print node fallback order

Patch series "Fix NUMA nodes fallback list ordering".

For a NUMA system that has multiple nodes at same distance from other
nodes, the fallback list generation prefers same node order for them
instead of round-robin thereby penalizing one node over others.  This
series fixes it.

More description of the problem and the fix is present in the patch
description.


This patch (of 2):

Print information message about the allocation fallback order for each
NUMA node during boot.

No functional changes here.  This makes it easier to illustrate the
problem in the node fallback list generation, which the next patch fixes.

Link: https://lkml.kernel.org/r/20210830121603.1081-1-bharata@amd.com
Link: https://lkml.kernel.org/r/20210830121603.1081-2-bharata@amd.com
Signed-off-by: Bharata B Rao <bharata@amd.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
Cc: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    4 ++++
 1 file changed, 4 insertions(+)

--- a/mm/page_alloc.c~mm-page_alloc-print-node-fallback-order
+++ a/mm/page_alloc.c
@@ -6262,6 +6262,10 @@ static void build_zonelists(pg_data_t *p
 
 	build_zonelists_in_node_order(pgdat, node_order, nr_nodes);
 	build_thisnode_zonelists(pgdat);
+	pr_info("Fallback order for Node %d: ", local_node);
+	for (node = 0; node < nr_nodes; node++)
+		pr_cont("%d ", node_order[node]);
+	pr_cont("\n");
 }
 
 #ifdef CONFIG_HAVE_MEMORYLESS_NODES
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 109/262] mm/page_alloc: use accumulated load when building node fallback list
  2021-11-05 20:34 incoming Andrew Morton
                   ` (107 preceding siblings ...)
  2021-11-05 20:40 ` [patch 108/262] mm/page_alloc: print node fallback order Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:40 ` [patch 110/262] mm: move node_reclaim_distance to fix NUMA without SMP Andrew Morton
                   ` (152 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, anshuman.khandual, bharata, kamezawa.hiroyu,
	krupa.ramakrishnan, lee.schermerhorn, linux-mm, mgorman,
	mm-commits, Sadagopan.Srinivasan, torvalds

From: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
Subject: mm/page_alloc: use accumulated load when building node fallback list

In build_zonelists(), when the fallback list is built for the nodes, the
node load gets reinitialized during each iteration.  This results in nodes
with same distances occupying the same slot in different node fallback
lists rather than appearing in the intended round- robin manner.  This
results in one node getting picked for allocation more compared to other
nodes with the same distance.

As an example, consider a 4 node system with the following distance
matrix.

Node 0  1  2  3
----------------
0    10 12 32 32
1    12 10 32 32
2    32 32 10 12
3    32 32 12 10

For this case, the node fallback list gets built like this:

Node	Fallback list
---------------------
0	0 1 2 3
1	1 0 3 2
2	2 3 0 1
3	3 2 0 1 <-- Unexpected fallback order

In the fallback list for nodes 2 and 3, the nodes 0 and 1 appear in the
same order which results in more allocations getting satisfied from node 0
compared to node 1.

The effect of this on remote memory bandwidth as seen by stream benchmark
is shown below:

Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1
	(numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>)
Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3
	(numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>)

----------------------------------------
		BANDWIDTH (MB/s)
    TEST	Case 1		Case 2
----------------------------------------
    COPY	57479.6		110791.8
   SCALE	55372.9		105685.9
     ADD	50460.6		96734.2
  TRIADD	50397.6		97119.1
----------------------------------------

The bandwidth drop in Case 1 occurs because most of the allocations get
satisfied by node 0 as it appears first in the fallback order for both
nodes 2 and 3.

This can be fixed by accumulating the node load in build_zonelists()
rather than reinitializing it during each iteration.  With this the nodes
with the same distance rightly get assigned in the round robin manner.  In
fact this was how it was originally until the commit f0c0b2b808f2 ("change
zonelist order: zonelist order selection logic") dropped the load
accumulation and resorted to initializing the load during each iteration. 
While zonelist ordering was removed by commit c9bff3eebc09 ("mm,
page_alloc: rip out ZONELIST_ORDER_ZONE"), the change to the node load
accumulation in build_zonelists() remained.  So essentially this patch
reverts back to the accumulated node load logic.

After this fix, the fallback order gets built like this:

Node Fallback list
------------------
0    0 1 2 3
1    1 0 3 2
2    2 3 0 1
3    3 2 1 0 <-- Note the change here

The bandwidth in Case 1 improves and matches Case 2 as shown below.

----------------------------------------
		BANDWIDTH (MB/s)
    TEST	Case 1		Case 2
----------------------------------------
    COPY	110438.9	110107.2
   SCALE	105930.5	105817.5
     ADD	97005.1		96159.8
  TRIADD	97441.5		96757.1
----------------------------------------

The correctness of the fallback list generation has been verified for the
above node configuration where the node 3 starts as memory-less node and
comes up online only during memory hotplug.

[bharata@amd.com: Added changelog, review, test validation]
Link: https://lkml.kernel.org/r/20210830121603.1081-3-bharata@amd.com
Fixes: f0c0b2b808f2 ("change zonelist order: zonelist order selection logic")
Signed-off-by: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
Co-developed-by: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
Signed-off-by: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
Signed-off-by: Bharata B Rao <bharata@amd.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/page_alloc.c~mm-page_alloc-use-accumulated-load-when-building-node-fallback-list
+++ a/mm/page_alloc.c
@@ -6253,7 +6253,7 @@ static void build_zonelists(pg_data_t *p
 		 */
 		if (node_distance(local_node, node) !=
 		    node_distance(local_node, prev_node))
-			node_load[node] = load;
+			node_load[node] += load;
 
 		node_order[nr_nodes++] = node;
 		prev_node = node;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 110/262] mm: move node_reclaim_distance to fix NUMA without SMP
  2021-11-05 20:34 incoming Andrew Morton
                   ` (108 preceding siblings ...)
  2021-11-05 20:40 ` [patch 109/262] mm/page_alloc: use accumulated load when building node fallback list Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:40 ` [patch 111/262] mm: move fold_vm_numa_events() " Andrew Morton
                   ` (151 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, dalias, geert+renesas, gonsolo, juri.lelli, linux-mm, matt,
	mgorman, mingo, mm-commits, peterz, torvalds, vbabka,
	vincent.guittot, ysato

From: Geert Uytterhoeven <geert+renesas@glider.be>
Subject: mm: move node_reclaim_distance to fix NUMA without SMP

Patch series "Fix NUMA without SMP".

SuperH is the only architecture which still supports NUMA without SMP, for
good reasons (various memories scattered around the address space, each
with varying latencies).  This series fixes two build errors due to
variables and functions used by the NUMA code being provided by SMP-only
source files or sections.


This patch (of 2):

If CONFIG_NUMA=y, but CONFIG_SMP=n (e.g. sh/migor_defconfig):

    sh4-linux-gnu-ld: mm/page_alloc.o: in function `get_page_from_freelist':
    page_alloc.c:(.text+0x2c24): undefined reference to `node_reclaim_distance'

Fix this by moving the declaration of node_reclaim_distance from an
SMP-only to a generic file.

Link: https://lkml.kernel.org/r/cover.1631781495.git.geert+renesas@glider.be
Link: https://lkml.kernel.org/r/6432666a648dde85635341e6c918cee97c97d264.1631781495.git.geert+renesas@glider.be
Fixes: a55c7454a8c887b2 ("sched/topology: Improve load balancing on AMD EPYC systems")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Suggested-by: Matt Fleming <matt@codeblueprint.co.uk>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Cc: Rich Felker <dalias@libc.org>
Cc: Gon Solo <gonsolo@gmail.com>
Cc: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/sched/topology.c |    1 -
 mm/page_alloc.c         |    2 ++
 2 files changed, 2 insertions(+), 1 deletion(-)

--- a/kernel/sched/topology.c~mm-move-node_reclaim_distance-to-fix-numa-without-smp
+++ a/kernel/sched/topology.c
@@ -1481,7 +1481,6 @@ static int			sched_domains_curr_level;
 int				sched_max_numa_distance;
 static int			*sched_domains_numa_distance;
 static struct cpumask		***sched_domains_numa_masks;
-int __read_mostly		node_reclaim_distance = RECLAIM_DISTANCE;
 
 static unsigned long __read_mostly *sched_numa_onlined_nodes;
 #endif
--- a/mm/page_alloc.c~mm-move-node_reclaim_distance-to-fix-numa-without-smp
+++ a/mm/page_alloc.c
@@ -3960,6 +3960,8 @@ bool zone_watermark_ok_safe(struct zone
 }
 
 #ifdef CONFIG_NUMA
+int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE;
+
 static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
 {
 	return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <=
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 111/262] mm: move fold_vm_numa_events() to fix NUMA without SMP
  2021-11-05 20:34 incoming Andrew Morton
                   ` (109 preceding siblings ...)
  2021-11-05 20:40 ` [patch 110/262] mm: move node_reclaim_distance to fix NUMA without SMP Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:40 ` [patch 112/262] mm/page_alloc.c: do not acquire zone lock in is_free_buddy_page() Andrew Morton
                   ` (150 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, dalias, geert+renesas, gonsolo, juri.lelli, linux-mm, matt,
	mgorman, mingo, mm-commits, peterz, torvalds, vbabka,
	vincent.guittot, ysato

From: Geert Uytterhoeven <geert+renesas@glider.be>
Subject: mm: move fold_vm_numa_events() to fix NUMA without SMP

If CONFIG_NUMA=y, but CONFIG_SMP=n (e.g. sh/migor_defconfig):

    sh4-linux-gnu-ld: mm/vmstat.o: in function `vmstat_start':
    vmstat.c:(.text+0x97c): undefined reference to `fold_vm_numa_events'
    sh4-linux-gnu-ld: drivers/base/node.o: in function `node_read_vmstat':
    node.c:(.text+0x140): undefined reference to `fold_vm_numa_events'
    sh4-linux-gnu-ld: drivers/base/node.o: in function `node_read_numastat':
    node.c:(.text+0x1d0): undefined reference to `fold_vm_numa_events'

Fix this by moving fold_vm_numa_events() outside the SMP-only section.

Link: https://lkml.kernel.org/r/9d16ccdd9ef32803d7100c84f737de6a749314fb.1631781495.git.geert+renesas@glider.be
Fixes: f19298b9516c1a03 ("mm/vmstat: convert NUMA statistics to basic NUMA counters")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Gon Solo <gonsolo@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmstat.c |   56 +++++++++++++++++++++++++-------------------------
 1 file changed, 28 insertions(+), 28 deletions(-)

--- a/mm/vmstat.c~mm-move-fold_vm_numa_events-to-fix-numa-without-smp
+++ a/mm/vmstat.c
@@ -165,6 +165,34 @@ atomic_long_t vm_numa_event[NR_VM_NUMA_E
 EXPORT_SYMBOL(vm_zone_stat);
 EXPORT_SYMBOL(vm_node_stat);
 
+#ifdef CONFIG_NUMA
+static void fold_vm_zone_numa_events(struct zone *zone)
+{
+	unsigned long zone_numa_events[NR_VM_NUMA_EVENT_ITEMS] = { 0, };
+	int cpu;
+	enum numa_stat_item item;
+
+	for_each_online_cpu(cpu) {
+		struct per_cpu_zonestat *pzstats;
+
+		pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
+		for (item = 0; item < NR_VM_NUMA_EVENT_ITEMS; item++)
+			zone_numa_events[item] += xchg(&pzstats->vm_numa_event[item], 0);
+	}
+
+	for (item = 0; item < NR_VM_NUMA_EVENT_ITEMS; item++)
+		zone_numa_event_add(zone_numa_events[item], zone, item);
+}
+
+void fold_vm_numa_events(void)
+{
+	struct zone *zone;
+
+	for_each_populated_zone(zone)
+		fold_vm_zone_numa_events(zone);
+}
+#endif
+
 #ifdef CONFIG_SMP
 
 int calculate_pressure_threshold(struct zone *zone)
@@ -771,34 +799,6 @@ static int fold_diff(int *zone_diff, int
 	return changes;
 }
 
-#ifdef CONFIG_NUMA
-static void fold_vm_zone_numa_events(struct zone *zone)
-{
-	unsigned long zone_numa_events[NR_VM_NUMA_EVENT_ITEMS] = { 0, };
-	int cpu;
-	enum numa_stat_item item;
-
-	for_each_online_cpu(cpu) {
-		struct per_cpu_zonestat *pzstats;
-
-		pzstats = per_cpu_ptr(zone->per_cpu_zonestats, cpu);
-		for (item = 0; item < NR_VM_NUMA_EVENT_ITEMS; item++)
-			zone_numa_events[item] += xchg(&pzstats->vm_numa_event[item], 0);
-	}
-
-	for (item = 0; item < NR_VM_NUMA_EVENT_ITEMS; item++)
-		zone_numa_event_add(zone_numa_events[item], zone, item);
-}
-
-void fold_vm_numa_events(void)
-{
-	struct zone *zone;
-
-	for_each_populated_zone(zone)
-		fold_vm_zone_numa_events(zone);
-}
-#endif
-
 /*
  * Update the zone counters for the current cpu.
  *
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 112/262] mm/page_alloc.c: do not acquire zone lock in is_free_buddy_page()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (110 preceding siblings ...)
  2021-11-05 20:40 ` [patch 111/262] mm: move fold_vm_numa_events() " Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:40 ` [patch 113/262] mm/page_alloc: detect allocation forbidden by cpuset and bail out early Andrew Morton
                   ` (149 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, edumazet, hughd, linux-mm, mm-commits, torvalds

From: Eric Dumazet <edumazet@google.com>
Subject: mm/page_alloc.c: do not acquire zone lock in is_free_buddy_page()

Grabbing zone lock in is_free_buddy_page() gives a wrong sense of safety,
and has potential performance implications when zone is experiencing lock
contention.

In any case, if a caller needs a stable result, it should grab zone lock
before calling this function.

Link: https://lkml.kernel.org/r/20210922152833.4023972-1-eric.dumazet@gmail.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

--- a/mm/page_alloc.c~mm-do-not-acquire-zone-lock-in-is_free_buddy_page
+++ a/mm/page_alloc.c
@@ -9356,21 +9356,21 @@ void __offline_isolated_pages(unsigned l
 }
 #endif
 
+/*
+ * This function returns a stable result only if called under zone lock.
+ */
 bool is_free_buddy_page(struct page *page)
 {
-	struct zone *zone = page_zone(page);
 	unsigned long pfn = page_to_pfn(page);
-	unsigned long flags;
 	unsigned int order;
 
-	spin_lock_irqsave(&zone->lock, flags);
 	for (order = 0; order < MAX_ORDER; order++) {
 		struct page *page_head = page - (pfn & ((1 << order) - 1));
 
-		if (PageBuddy(page_head) && buddy_order(page_head) >= order)
+		if (PageBuddy(page_head) &&
+		    buddy_order_unsafe(page_head) >= order)
 			break;
 	}
-	spin_unlock_irqrestore(&zone->lock, flags);
 
 	return order < MAX_ORDER;
 }
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 113/262] mm/page_alloc: detect allocation forbidden by cpuset and bail out early
  2021-11-05 20:34 incoming Andrew Morton
                   ` (111 preceding siblings ...)
  2021-11-05 20:40 ` [patch 112/262] mm/page_alloc.c: do not acquire zone lock in is_free_buddy_page() Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:40 ` [patch 114/262] mm/page_alloc.c: show watermark_boost of zone in zoneinfo Andrew Morton
                   ` (148 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, feng.tang, hannes, linux-mm, lizefan.x, mgorman, mhocko,
	mm-commits, rientjes, tj, torvalds, vbabka

From: Feng Tang <feng.tang@intel.com>
Subject: mm/page_alloc: detect allocation forbidden by cpuset and bail out early

There was a report that starting an Ubuntu in docker while using cpuset to
bind it to movable nodes (a node only has movable zone, like a node for
hotplug or a Persistent Memory node in normal usage) will fail due to
memory allocation failure, and then OOM is involved and many other
innocent processes got killed.  It can be reproduced with command: $docker
run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed
/proc/self/status" (node 4 is a movable node)

  runc:[2:INIT] invoked oom-killer: gfp_mask=0x500cc2(GFP_HIGHUSER|__GFP_ACCOUNT), order=0, oom_score_adj=0
  CPU: 8 PID: 8291 Comm: runc:[2:INIT] Tainted: G        W I E     5.8.2-0.g71b519a-default #1 openSUSE Tumbleweed (unreleased)
  Hardware name: Dell Inc. PowerEdge R640/0PHYDR, BIOS 2.6.4 04/09/2020
  Call Trace:
   dump_stack+0x6b/0x88
   dump_header+0x4a/0x1e2
   oom_kill_process.cold+0xb/0x10
   out_of_memory.part.0+0xaf/0x230
   out_of_memory+0x3d/0x80
   __alloc_pages_slowpath.constprop.0+0x954/0xa20
   __alloc_pages_nodemask+0x2d3/0x300
   pipe_write+0x322/0x590
   new_sync_write+0x196/0x1b0
   vfs_write+0x1c3/0x1f0
   ksys_write+0xa7/0xe0
   do_syscall_64+0x52/0xd0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

  Mem-Info:
  active_anon:392832 inactive_anon:182 isolated_anon:0
   active_file:68130 inactive_file:151527 isolated_file:0
   unevictable:2701 dirty:0 writeback:7
   slab_reclaimable:51418 slab_unreclaimable:116300
   mapped:45825 shmem:735 pagetables:2540 bounce:0
   free:159849484 free_pcp:73 free_cma:0
  Node 4 active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:0kB dirty:0kB writeback:0kB shmem:0kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB all_unreclaimable? no
  Node 4 Movable free:130021408kB min:9140kB low:139160kB high:269180kB reserved_highatomic:0KB active_anon:1448kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:130023424kB managed:130023424kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:292kB local_pcp:84kB free_cma:0kB
  lowmem_reserve[]: 0 0 0 0 0
  Node 4 Movable: 1*4kB (M) 0*8kB 0*16kB 1*32kB (M) 0*64kB 0*128kB 1*256kB (M) 1*512kB (M) 1*1024kB (M) 0*2048kB 31743*4096kB (M) = 130021156kB

  oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
  Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
  oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
  oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
  oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The reason is, in the case, the target cpuset nodes only have movable
zone, while the creation of an OS in docker sometimes needs to allocate
memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and the
cpuset limit forbids the allocation, then out-of-memory killing is
involved even when normal nodes and movable nodes both have many free
memory.

The OOM killer cannot help to resolve the situation as there is no usable
memory for the request in the cpuset scope.  The only reasonable measure
to take is to fail the allocation right away and have the caller to deal
with it.

So add a check for cases like this in the slowpath of allocation, and bail
out early returning NULL for the allocation.

As page allocation is one of the hottest path in kernel, this check will
hurt all users with sane cpuset configuration, add a static branch check
and detect the abnormal config in cpuset memory binding setup so that the
extra check cost in page allocation is not paid by everyone.

[thanks to Micho Hocko and David Rientjes for suggesting not handling
 it inside OOM code, adding cpuset check, refining comments]

Link: https://lkml.kernel.org/r/1632481657-68112-1-git-send-email-feng.tang@intel.com
Signed-off-by: Feng Tang <feng.tang@intel.com>
Suggested-by: Michal Hocko <mhocko@suse.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/cpuset.h |   17 +++++++++++++++++
 include/linux/mmzone.h |   22 ++++++++++++++++++++++
 kernel/cgroup/cpuset.c |   23 +++++++++++++++++++++++
 mm/page_alloc.c        |   13 +++++++++++++
 4 files changed, 75 insertions(+)

--- a/include/linux/cpuset.h~mm-page_alloc-detect-allocation-forbidden-by-cpuset-and-bail-out-early
+++ a/include/linux/cpuset.h
@@ -34,6 +34,8 @@
  */
 extern struct static_key_false cpusets_pre_enable_key;
 extern struct static_key_false cpusets_enabled_key;
+extern struct static_key_false cpusets_insane_config_key;
+
 static inline bool cpusets_enabled(void)
 {
 	return static_branch_unlikely(&cpusets_enabled_key);
@@ -51,6 +53,19 @@ static inline void cpuset_dec(void)
 	static_branch_dec_cpuslocked(&cpusets_pre_enable_key);
 }
 
+/*
+ * This will get enabled whenever a cpuset configuration is considered
+ * unsupportable in general. E.g. movable only node which cannot satisfy
+ * any non movable allocations (see update_nodemask). Page allocator
+ * needs to make additional checks for those configurations and this
+ * check is meant to guard those checks without any overhead for sane
+ * configurations.
+ */
+static inline bool cpusets_insane_config(void)
+{
+	return static_branch_unlikely(&cpusets_insane_config_key);
+}
+
 extern int cpuset_init(void);
 extern void cpuset_init_smp(void);
 extern void cpuset_force_rebuild(void);
@@ -167,6 +182,8 @@ static inline void set_mems_allowed(node
 
 static inline bool cpusets_enabled(void) { return false; }
 
+static inline bool cpusets_insane_config(void) { return false; }
+
 static inline int cpuset_init(void) { return 0; }
 static inline void cpuset_init_smp(void) {}
 
--- a/include/linux/mmzone.h~mm-page_alloc-detect-allocation-forbidden-by-cpuset-and-bail-out-early
+++ a/include/linux/mmzone.h
@@ -1220,6 +1220,28 @@ static inline struct zoneref *first_zone
 #define for_each_zone_zonelist(zone, z, zlist, highidx) \
 	for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
 
+/* Whether the 'nodes' are all movable nodes */
+static inline bool movable_only_nodes(nodemask_t *nodes)
+{
+	struct zonelist *zonelist;
+	struct zoneref *z;
+	int nid;
+
+	if (nodes_empty(*nodes))
+		return false;
+
+	/*
+	 * We can chose arbitrary node from the nodemask to get a
+	 * zonelist as they are interlinked. We just need to find
+	 * at least one zone that can satisfy kernel allocations.
+	 */
+	nid = first_node(*nodes);
+	zonelist = &NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK];
+	z = first_zones_zonelist(zonelist, ZONE_NORMAL,	nodes);
+	return (!z->zone) ? true : false;
+}
+
+
 #ifdef CONFIG_SPARSEMEM
 #include <asm/sparsemem.h>
 #endif
--- a/kernel/cgroup/cpuset.c~mm-page_alloc-detect-allocation-forbidden-by-cpuset-and-bail-out-early
+++ a/kernel/cgroup/cpuset.c
@@ -69,6 +69,13 @@
 DEFINE_STATIC_KEY_FALSE(cpusets_pre_enable_key);
 DEFINE_STATIC_KEY_FALSE(cpusets_enabled_key);
 
+/*
+ * There could be abnormal cpuset configurations for cpu or memory
+ * node binding, add this key to provide a quick low-cost judgement
+ * of the situation.
+ */
+DEFINE_STATIC_KEY_FALSE(cpusets_insane_config_key);
+
 /* See "Frequency meter" comments, below. */
 
 struct fmeter {
@@ -372,6 +379,17 @@ static DECLARE_WORK(cpuset_hotplug_work,
 
 static DECLARE_WAIT_QUEUE_HEAD(cpuset_attach_wq);
 
+static inline void check_insane_mems_config(nodemask_t *nodes)
+{
+	if (!cpusets_insane_config() &&
+		movable_only_nodes(nodes)) {
+		static_branch_enable(&cpusets_insane_config_key);
+		pr_info("Unsupported (movable nodes only) cpuset configuration detected (nmask=%*pbl)!\n"
+			"Cpuset allocations might fail even with a lot of memory available.\n",
+			nodemask_pr_args(nodes));
+	}
+}
+
 /*
  * Cgroup v2 behavior is used on the "cpus" and "mems" control files when
  * on default hierarchy or when the cpuset_v2_mode flag is set by mounting
@@ -1870,6 +1888,8 @@ static int update_nodemask(struct cpuset
 	if (retval < 0)
 		goto done;
 
+	check_insane_mems_config(&trialcs->mems_allowed);
+
 	spin_lock_irq(&callback_lock);
 	cs->mems_allowed = trialcs->mems_allowed;
 	spin_unlock_irq(&callback_lock);
@@ -3173,6 +3193,9 @@ update_tasks:
 	cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus);
 	mems_updated = !nodes_equal(new_mems, cs->effective_mems);
 
+	if (mems_updated)
+		check_insane_mems_config(&new_mems);
+
 	if (is_in_v2_mode())
 		hotplug_update_tasks(cs, &new_cpus, &new_mems,
 				     cpus_updated, mems_updated);
--- a/mm/page_alloc.c~mm-page_alloc-detect-allocation-forbidden-by-cpuset-and-bail-out-early
+++ a/mm/page_alloc.c
@@ -4910,6 +4910,19 @@ retry_cpuset:
 	if (!ac->preferred_zoneref->zone)
 		goto nopage;
 
+	/*
+	 * Check for insane configurations where the cpuset doesn't contain
+	 * any suitable zone to satisfy the request - e.g. non-movable
+	 * GFP_HIGHUSER allocations from MOVABLE nodes only.
+	 */
+	if (cpusets_insane_config() && (gfp_mask & __GFP_HARDWALL)) {
+		struct zoneref *z = first_zones_zonelist(ac->zonelist,
+					ac->highest_zoneidx,
+					&cpuset_current_mems_allowed);
+		if (!z->zone)
+			goto nopage;
+	}
+
 	if (alloc_flags & ALLOC_KSWAPD)
 		wake_all_kswapds(order, gfp_mask, ac);
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 114/262] mm/page_alloc.c: show watermark_boost of zone in zoneinfo
  2021-11-05 20:34 incoming Andrew Morton
                   ` (112 preceding siblings ...)
  2021-11-05 20:40 ` [patch 113/262] mm/page_alloc: detect allocation forbidden by cpuset and bail out early Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:40 ` [patch 115/262] mm: create a new system state and fix core_kernel_text() Andrew Morton
                   ` (147 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, liangcaifan19, linux-mm, mm-commits, torvalds, zhang.lyra

From: Liangcai Fan <liangcaifan19@gmail.com>
Subject: mm/page_alloc.c: show watermark_boost of zone in zoneinfo

min/low/high_wmark_pages(z) is defined as
(z->_watermark[WMARK_MIN/LOW/HIGH] + z->watermark_boost).
If kswapd is frequently waked up due to the increase of
min/low/high_wmark_pages, printing watermark_boost can quickly locate
whether watermark_boost or _watermark[WMARK_MIN/LOW/HIGH] caused
min/low/high_wmark_pages to increase.

Link: https://lkml.kernel.org/r/1632472566-12246-1-git-send-email-liangcaifan19@gmail.com
Signed-off-by: Liangcai Fan <liangcaifan19@gmail.com>
Cc: Chunyan Zhang <zhang.lyra@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    2 ++
 mm/vmstat.c     |    2 ++
 2 files changed, 4 insertions(+)

--- a/mm/page_alloc.c~mm-show-watermark_boost-of-zone-in-zoneinfo
+++ a/mm/page_alloc.c
@@ -5993,6 +5993,7 @@ void show_free_areas(unsigned int filter
 		printk(KERN_CONT
 			"%s"
 			" free:%lukB"
+			" boost:%lukB"
 			" min:%lukB"
 			" low:%lukB"
 			" high:%lukB"
@@ -6013,6 +6014,7 @@ void show_free_areas(unsigned int filter
 			"\n",
 			zone->name,
 			K(zone_page_state(zone, NR_FREE_PAGES)),
+			K(zone->watermark_boost),
 			K(min_wmark_pages(zone)),
 			K(low_wmark_pages(zone)),
 			K(high_wmark_pages(zone)),
--- a/mm/vmstat.c~mm-show-watermark_boost-of-zone-in-zoneinfo
+++ a/mm/vmstat.c
@@ -1656,6 +1656,7 @@ static void zoneinfo_show_print(struct s
 	}
 	seq_printf(m,
 		   "\n  pages free     %lu"
+		   "\n        boost    %lu"
 		   "\n        min      %lu"
 		   "\n        low      %lu"
 		   "\n        high     %lu"
@@ -1664,6 +1665,7 @@ static void zoneinfo_show_print(struct s
 		   "\n        managed  %lu"
 		   "\n        cma      %lu",
 		   zone_page_state(zone, NR_FREE_PAGES),
+		   zone->watermark_boost,
 		   min_wmark_pages(zone),
 		   low_wmark_pages(zone),
 		   high_wmark_pages(zone),
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 115/262] mm: create a new system state and fix core_kernel_text()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (113 preceding siblings ...)
  2021-11-05 20:40 ` [patch 114/262] mm/page_alloc.c: show watermark_boost of zone in zoneinfo Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:40 ` [patch 116/262] mm: make generic arch_is_kernel_initmem_freed() do what it says Andrew Morton
                   ` (146 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, benh, christophe.leroy, gerald.schaefer, hca, linux-mm,
	mm-commits, paulus, torvalds, wangkefeng.wang

From: Christophe Leroy <christophe.leroy@csgroup.eu>
Subject: mm: create a new system state and fix core_kernel_text()

core_kernel_text() considers that until system_state in at least
SYSTEM_RUNNING, init memory is valid.

But init memory is freed a few lines before setting SYSTEM_RUNNING, so we
have a small period of time when core_kernel_text() is wrong.

Create an intermediate system state called SYSTEM_FREEING_INIT that is set
before starting freeing init memory, and use it in core_kernel_text() to
report init memory invalid earlier.

Link: https://lkml.kernel.org/r/9ecfdee7dd4d741d172cb93ff1d87f1c58127c9a.1633001016.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/kernel.h |    1 +
 init/main.c            |    2 ++
 kernel/extable.c       |    2 +-
 3 files changed, 4 insertions(+), 1 deletion(-)

--- a/include/linux/kernel.h~mm-create-a-new-system-state-and-fix-core_kernel_text
+++ a/include/linux/kernel.h
@@ -248,6 +248,7 @@ extern bool early_boot_irqs_disabled;
 extern enum system_states {
 	SYSTEM_BOOTING,
 	SYSTEM_SCHEDULING,
+	SYSTEM_FREEING_INITMEM,
 	SYSTEM_RUNNING,
 	SYSTEM_HALT,
 	SYSTEM_POWER_OFF,
--- a/init/main.c~mm-create-a-new-system-state-and-fix-core_kernel_text
+++ a/init/main.c
@@ -1506,6 +1506,8 @@ static int __ref kernel_init(void *unuse
 	kernel_init_freeable();
 	/* need to finish all async __init code before freeing the memory */
 	async_synchronize_full();
+
+	system_state = SYSTEM_FREEING_INITMEM;
 	kprobe_free_init_mem();
 	ftrace_free_init_mem();
 	kgdb_free_init_mem();
--- a/kernel/extable.c~mm-create-a-new-system-state-and-fix-core_kernel_text
+++ a/kernel/extable.c
@@ -76,7 +76,7 @@ int notrace core_kernel_text(unsigned lo
 	    addr < (unsigned long)_etext)
 		return 1;
 
-	if (system_state < SYSTEM_RUNNING &&
+	if (system_state < SYSTEM_FREEING_INITMEM &&
 	    init_kernel_text(addr))
 		return 1;
 	return 0;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 116/262] mm: make generic arch_is_kernel_initmem_freed() do what it says
  2021-11-05 20:34 incoming Andrew Morton
                   ` (114 preceding siblings ...)
  2021-11-05 20:40 ` [patch 115/262] mm: create a new system state and fix core_kernel_text() Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:40 ` [patch 117/262] powerpc: use generic version of arch_is_kernel_initmem_freed() Andrew Morton
                   ` (145 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, benh, christophe.leroy, gerald.schaefer, hca, linux-mm,
	mm-commits, paulus, torvalds, wangkefeng.wang

From: Christophe Leroy <christophe.leroy@csgroup.eu>
Subject: mm: make generic arch_is_kernel_initmem_freed() do what it says

Commit 7a5da02de8d6 ("locking/lockdep: check for freed initmem in
static_obj()") added arch_is_kernel_initmem_freed() which is supposed to
report whether an object is part of already freed init memory.

For the time being, the generic version of arch_is_kernel_initmem_freed()
always reports 'false', allthough free_initmem() is generically called on
all architectures.

Therefore, change the generic version of arch_is_kernel_initmem_freed() to
check whether free_initmem() has been called.  If so, then check if a
given address falls into init memory.

To ease the use of system_state, move it out of line into its only caller
which is lockdep.c

Link: https://lkml.kernel.org/r/1d40783e676e07858be97d881f449ee7ea8adfb1.1633001016.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/asm-generic/sections.h |   14 --------------
 kernel/locking/lockdep.c       |   15 +++++++++++++++
 2 files changed, 15 insertions(+), 14 deletions(-)

--- a/include/asm-generic/sections.h~mm-make-generic-arch_is_kernel_initmem_freed-do-what-it-says
+++ a/include/asm-generic/sections.h
@@ -80,20 +80,6 @@ static inline int arch_is_kernel_data(un
 }
 #endif
 
-/*
- * Check if an address is part of freed initmem. This is needed on architectures
- * with virt == phys kernel mapping, for code that wants to check if an address
- * is part of a static object within [_stext, _end]. After initmem is freed,
- * memory can be allocated from it, and such allocations would then have
- * addresses within the range [_stext, _end].
- */
-#ifndef arch_is_kernel_initmem_freed
-static inline int arch_is_kernel_initmem_freed(unsigned long addr)
-{
-	return 0;
-}
-#endif
-
 /**
  * memory_contains - checks if an object is contained within a memory region
  * @begin: virtual address of the beginning of the memory region
--- a/kernel/locking/lockdep.c~mm-make-generic-arch_is_kernel_initmem_freed-do-what-it-says
+++ a/kernel/locking/lockdep.c
@@ -788,6 +788,21 @@ static int very_verbose(struct lock_clas
  * Is this the address of a static object:
  */
 #ifdef __KERNEL__
+/*
+ * Check if an address is part of freed initmem. After initmem is freed,
+ * memory can be allocated from it, and such allocations would then have
+ * addresses within the range [_stext, _end].
+ */
+#ifndef arch_is_kernel_initmem_freed
+static int arch_is_kernel_initmem_freed(unsigned long addr)
+{
+	if (system_state < SYSTEM_FREEING_INITMEM)
+		return 0;
+
+	return init_section_contains((void *)addr, 1);
+}
+#endif
+
 static int static_obj(const void *obj)
 {
 	unsigned long start = (unsigned long) &_stext,
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 117/262] powerpc: use generic version of arch_is_kernel_initmem_freed()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (115 preceding siblings ...)
  2021-11-05 20:40 ` [patch 116/262] mm: make generic arch_is_kernel_initmem_freed() do what it says Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:40 ` [patch 118/262] s390: " Andrew Morton
                   ` (144 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, benh, christophe.leroy, gerald.schaefer, hca, linux-mm,
	mm-commits, paulus, torvalds, wangkefeng.wang

From: Christophe Leroy <christophe.leroy@csgroup.eu>
Subject: powerpc: use generic version of arch_is_kernel_initmem_freed()

Generic version of arch_is_kernel_initmem_freed() now does the same as
powerpc version.

Remove the powerpc version.

Link: https://lkml.kernel.org/r/c53764eb45d41491e2b21da2e7812239897dbebb.1633001016.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/powerpc/include/asm/sections.h |   13 -------------
 1 file changed, 13 deletions(-)

--- a/arch/powerpc/include/asm/sections.h~powerpc-use-generic-version-of-arch_is_kernel_initmem_freed
+++ a/arch/powerpc/include/asm/sections.h
@@ -6,21 +6,8 @@
 #include <linux/elf.h>
 #include <linux/uaccess.h>
 
-#define arch_is_kernel_initmem_freed arch_is_kernel_initmem_freed
-
 #include <asm-generic/sections.h>
 
-extern bool init_mem_is_free;
-
-static inline int arch_is_kernel_initmem_freed(unsigned long addr)
-{
-	if (!init_mem_is_free)
-		return 0;
-
-	return addr >= (unsigned long)__init_begin &&
-		addr < (unsigned long)__init_end;
-}
-
 extern char __head_end[];
 
 #ifdef __powerpc64__
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 118/262] s390: use generic version of arch_is_kernel_initmem_freed()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (116 preceding siblings ...)
  2021-11-05 20:40 ` [patch 117/262] powerpc: use generic version of arch_is_kernel_initmem_freed() Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:40 ` [patch 119/262] mm: page_alloc: use migrate_disable() in drain_local_pages_wq() Andrew Morton
                   ` (143 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, benh, christophe.leroy, gerald.schaefer, hca, linux-mm,
	mm-commits, paulus, torvalds, wangkefeng.wang

From: Christophe Leroy <christophe.leroy@csgroup.eu>
Subject: s390: use generic version of arch_is_kernel_initmem_freed()

Generic version of arch_is_kernel_initmem_freed() now does the same
as s390 version.

Remove the s390 version.

Link: https://lkml.kernel.org/r/b6feb5dfe611a322de482762fc2df3a9eece70c7.1633001016.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/s390/include/asm/sections.h |   12 ------------
 arch/s390/mm/init.c              |    3 ---
 2 files changed, 15 deletions(-)

--- a/arch/s390/include/asm/sections.h~s390-use-generic-version-of-arch_is_kernel_initmem_freed
+++ a/arch/s390/include/asm/sections.h
@@ -2,20 +2,8 @@
 #ifndef _S390_SECTIONS_H
 #define _S390_SECTIONS_H
 
-#define arch_is_kernel_initmem_freed arch_is_kernel_initmem_freed
-
 #include <asm-generic/sections.h>
 
-extern bool initmem_freed;
-
-static inline int arch_is_kernel_initmem_freed(unsigned long addr)
-{
-	if (!initmem_freed)
-		return 0;
-	return addr >= (unsigned long)__init_begin &&
-	       addr < (unsigned long)__init_end;
-}
-
 /*
  * .boot.data section contains variables "shared" between the decompressor and
  * the decompressed kernel. The decompressor will store values in them, and
--- a/arch/s390/mm/init.c~s390-use-generic-version-of-arch_is_kernel_initmem_freed
+++ a/arch/s390/mm/init.c
@@ -58,8 +58,6 @@ unsigned long empty_zero_page, zero_page
 EXPORT_SYMBOL(empty_zero_page);
 EXPORT_SYMBOL(zero_page_mask);
 
-bool initmem_freed;
-
 static void __init setup_zero_pages(void)
 {
 	unsigned int order;
@@ -214,7 +212,6 @@ void __init mem_init(void)
 
 void free_initmem(void)
 {
-	initmem_freed = true;
 	__set_memory((unsigned long)_sinittext,
 		     (unsigned long)(_einittext - _sinittext) >> PAGE_SHIFT,
 		     SET_MEMORY_RW | SET_MEMORY_NX);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 119/262] mm: page_alloc: use migrate_disable() in drain_local_pages_wq()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (117 preceding siblings ...)
  2021-11-05 20:40 ` [patch 118/262] s390: " Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:40 ` [patch 120/262] mm/page_alloc: use clamp() to simplify code Andrew Morton
                   ` (142 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, bigeasy, linux-mm, mm-commits, peterz, tglx, torvalds

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: mm: page_alloc: use migrate_disable() in drain_local_pages_wq()

drain_local_pages_wq() disables preemption to avoid CPU migration during
CPU hotplug and can't use cpus_read_lock().

Using migrate_disable() works here, too.  The scheduler won't take the CPU
offline until the task left the migrate-disable section.  The problem with
disabled preemption here is that drain_local_pages() acquires locks which
are turned into sleeping locks on PREEMPT_RT and can't be acquired with
disabled preemption.

Use migrate_disable() in drain_local_pages_wq().

Link: https://lkml.kernel.org/r/20211015210933.viw6rjvo64qtqxn4@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/page_alloc.c~mm-page_alloc-use-migrate_disable-in-drain_local_pages_wq
+++ a/mm/page_alloc.c
@@ -3141,9 +3141,9 @@ static void drain_local_pages_wq(struct
 	 * cpu which is alright but we also have to make sure to not move to
 	 * a different one.
 	 */
-	preempt_disable();
+	migrate_disable();
 	drain_local_pages(drain->zone);
-	preempt_enable();
+	migrate_enable();
 }
 
 /*
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 120/262] mm/page_alloc: use clamp() to simplify code
  2021-11-05 20:34 incoming Andrew Morton
                   ` (118 preceding siblings ...)
  2021-11-05 20:40 ` [patch 119/262] mm: page_alloc: use migrate_disable() in drain_local_pages_wq() Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:40 ` [patch 121/262] mm: fix data race in PagePoisoned() Andrew Morton
                   ` (141 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, bobo.shaobowang, david, huawei.libin, linux-mm, mm-commits,
	torvalds, weiyongjun1

From: Wang ShaoBo <bobo.shaobowang@huawei.com>
Subject: mm/page_alloc: use clamp() to simplify code

This patch uses clamp() to simplify code in init_per_zone_wmark_min().

Link: https://lkml.kernel.org/r/20211021034830.1049150-1-bobo.shaobowang@huawei.com
Signed-off-by: Wang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Wei Yongjun <weiyongjun1@huawei.com>
Cc: Li Bin <huawei.libin@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

--- a/mm/page_alloc.c~mm-page_alloc-use-clamp-to-simplify-code
+++ a/mm/page_alloc.c
@@ -8477,16 +8477,12 @@ int __meminit init_per_zone_wmark_min(vo
 	lowmem_kbytes = nr_free_buffer_pages() * (PAGE_SIZE >> 10);
 	new_min_free_kbytes = int_sqrt(lowmem_kbytes * 16);
 
-	if (new_min_free_kbytes > user_min_free_kbytes) {
-		min_free_kbytes = new_min_free_kbytes;
-		if (min_free_kbytes < 128)
-			min_free_kbytes = 128;
-		if (min_free_kbytes > 262144)
-			min_free_kbytes = 262144;
-	} else {
+	if (new_min_free_kbytes > user_min_free_kbytes)
+		min_free_kbytes = clamp(new_min_free_kbytes, 128, 262144);
+	else
 		pr_warn("min_free_kbytes is not updated to %d because user defined value %d is preferred\n",
 				new_min_free_kbytes, user_min_free_kbytes);
-	}
+
 	setup_per_zone_wmarks();
 	refresh_zone_stat_thresholds();
 	setup_per_zone_lowmem_reserve();
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 121/262] mm: fix data race in PagePoisoned()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (119 preceding siblings ...)
  2021-11-05 20:40 ` [patch 120/262] mm/page_alloc: use clamp() to simplify code Andrew Morton
@ 2021-11-05 20:40 ` Andrew Morton
  2021-11-05 20:41 ` [patch 122/262] mm/memory_failure: constify static mm_walk_ops Andrew Morton
                   ` (140 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:40 UTC (permalink / raw)
  To: akpm, elver, kirill.shutemov, linux-mm, mm-commits, n-horiguchi,
	oliver.sang, torvalds, will

From: Marco Elver <elver@google.com>
Subject: mm: fix data race in PagePoisoned()

PagePoisoned() accesses page->flags which can be updated concurrently:

  | BUG: KCSAN: data-race in next_uptodate_page / unlock_page
  |
  | write (marked) to 0xffffea00050f37c0 of 8 bytes by task 1872 on cpu 1:
  |  instrument_atomic_write           include/linux/instrumented.h:87 [inline]
  |  clear_bit_unlock_is_negative_byte include/asm-generic/bitops/instrumented-lock.h:74 [inline]
  |  unlock_page+0x102/0x1b0           mm/filemap.c:1465
  |  filemap_map_pages+0x6c6/0x890     mm/filemap.c:3057
  |  ...
  | read to 0xffffea00050f37c0 of 8 bytes by task 1873 on cpu 0:
  |  PagePoisoned                   include/linux/page-flags.h:204 [inline]
  |  PageReadahead                  include/linux/page-flags.h:382 [inline]
  |  next_uptodate_page+0x456/0x830 mm/filemap.c:2975
  |  ...
  | CPU: 0 PID: 1873 Comm: systemd-udevd Not tainted 5.11.0-rc4-00001-gf9ce0be71d1f #1

To avoid the compiler tearing or otherwise optimizing the access, use
READ_ONCE() to access flags.

Link: https://lore.kernel.org/all/20210826144157.GA26950@xsang-OptiPlex-9020/
Link: https://lkml.kernel.org/r/20210913113542.2658064-1-elver@google.com
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Will Deacon <will@kernel.org>
Cc: Marco Elver <elver@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/page-flags.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/page-flags.h~mm-fix-data-race-in-pagepoisoned
+++ a/include/linux/page-flags.h
@@ -215,7 +215,7 @@ static __always_inline int PageCompound(
 #define	PAGE_POISON_PATTERN	-1l
 static inline int PagePoisoned(const struct page *page)
 {
-	return page->flags == PAGE_POISON_PATTERN;
+	return READ_ONCE(page->flags) == PAGE_POISON_PATTERN;
 }
 
 #ifdef CONFIG_DEBUG_VM
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 122/262] mm/memory_failure: constify static mm_walk_ops
  2021-11-05 20:34 incoming Andrew Morton
                   ` (120 preceding siblings ...)
  2021-11-05 20:40 ` [patch 121/262] mm: fix data race in PagePoisoned() Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:41 ` [patch 123/262] mm: filemap: coding style cleanup for filemap_map_pmd() Andrew Morton
                   ` (139 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, anshuman.khandual, linux-mm, mm-commits, naoya.horiguchi,
	rikard.falkeborn, torvalds

From: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Subject: mm/memory_failure: constify static mm_walk_ops

The only usage of hwp_walk_ops is to pass its address to walk_page_range()
which takes a pointer to const mm_walk_ops as argument.  Make it const to
allow the compiler to put it in read-only memory.

Link: https://lkml.kernel.org/r/20211014075042.17174-3-rikard.falkeborn@gmail.com
Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory-failure.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/memory-failure.c~mm-memory_failure-constify-static-mm_walk_ops
+++ a/mm/memory-failure.c
@@ -674,7 +674,7 @@ static int hwpoison_hugetlb_range(pte_t
 #define hwpoison_hugetlb_range	NULL
 #endif
 
-static struct mm_walk_ops hwp_walk_ops = {
+static const struct mm_walk_ops hwp_walk_ops = {
 	.pmd_entry = hwpoison_pte_range,
 	.hugetlb_entry = hwpoison_hugetlb_range,
 };
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 123/262] mm: filemap: coding style cleanup for filemap_map_pmd()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (121 preceding siblings ...)
  2021-11-05 20:41 ` [patch 122/262] mm/memory_failure: constify static mm_walk_ops Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:41 ` [patch 124/262] mm: hwpoison: refactor refcount check handling Andrew Morton
                   ` (138 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, hughd, kirill.shutemov, linux-mm, mm-commits,
	naoya.horiguchi, osalvador, peterx, shy828301, torvalds, willy

From: Yang Shi <shy828301@gmail.com>
Subject: mm: filemap: coding style cleanup for filemap_map_pmd()

Patch series "Solve silent data loss caused by poisoned page cache (shmem/tmpfs)", v5.

When discussing the patch that splits page cache THP in order to offline
the poisoned page, Noaya mentioned there is a bigger problem [1] that
prevents this from working since the page cache page will be truncated if
uncorrectable errors happen.  By looking this deeper it turns out this
approach (truncating poisoned page) may incur silent data loss for all
non-readonly filesystems if the page is dirty.  It may be worse for
in-memory filesystem, e.g.  shmem/tmpfs since the data blocks are actually
gone.

To solve this problem we could keep the poisoned dirty page in page cache
then notify the users on any later access, e.g.  page fault, read/write,
etc.  The clean page could be truncated as is since they can be reread
from disk later on.

The consequence is the filesystems may find poisoned page and manipulate
it as healthy page since all the filesystems actually don't check if the
page is poisoned or not in all the relevant paths except page fault.  In
general, we need make the filesystems be aware of poisoned page before we
could keep the poisoned page in page cache in order to solve the data loss
problem.

To make filesystems be aware of poisoned page we should consider:
- The page should be not written back: clearing dirty flag could prevent from
  writeback.
- The page should not be dropped (it shows as a clean page) by drop caches or
  other callers: the refcount pin from hwpoison could prevent from invalidating
  (called by cache drop, inode cache shrinking, etc), but it doesn't avoid
  invalidation in DIO path.
- The page should be able to get truncated/hole punched/unlinked: it works as it
  is.
- Notify users when the page is accessed, e.g. read/write, page fault and other
  paths (compression, encryption, etc).

The scope of the last one is huge since almost all filesystems need do it
once a page is returned from page cache lookup.  There are a couple of
options to do it:

1. Check hwpoison flag for every path, the most straightforward way.
2. Return NULL for poisoned page from page cache lookup, the most callsites
   check if NULL is returned, this should have least work I think.  But the
   error handling in filesystems just return -ENOMEM, the error code will incur
   confusion to the users obviously.
3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO), but this
   will involve significant amount of code change as well since all the paths
   need check if the pointer is ERR or not just like option #1.

I did prototype for both #1 and #3, but it seems #3 may require more
changes than #1.  For #3 ERR_PTR will be returned so all the callers need
to check the return value otherwise invalid pointer may be dereferenced,
but not all callers really care about the content of the page, for
example, partial truncate which just sets the truncated range in one page
to 0.  So for such paths it needs additional modification if ERR_PTR is
returned.  And if the callers have their own way to handle the problematic
pages we need to add a new FGP flag to tell FGP functions to return the
pointer to the page.

It may happen very rarely, but once it happens the consequence (data
corruption) could be very bad and it is very hard to debug.  It seems this
problem had been slightly discussed before, but seems no action was taken
at that time.  [2]

As the aforementioned investigation, it needs huge amount of work to solve
the potential data loss for all filesystems.  But it is much easier for
in-memory filesystems and such filesystems actually suffer more than
others since even the data blocks are gone due to truncating.  So this
patchset starts from shmem/tmpfs by taking option #1.

TODO:
* The unpoison has been broken since commit 0ed950d1f281 ("mm,hwpoison: make
  get_hwpoison_page() call get_any_page()"), and this patch series make
  refcount check for unpoisoning shmem page fail.
* Expand to other filesystems.  But I haven't heard feedback from filesystem
  developers yet.

Patch breakdown:
Patch #1: cleanup, depended by patch #2
Patch #2: fix THP with hwpoisoned subpage(s) PMD map bug
Patch #3: coding style cleanup
Patch #4: refactor and preparation.
Patch #5: keep the poisoned page in page cache and handle such case for all
          the paths.
Patch #6: the previous patches unblock page cache THP split, so this patch
          add page cache THP split support.


This patch (of 4):

A minor cleanup to the indent.

Link: https://lkml.kernel.org/r/20211020210755.23964-1-shy828301@gmail.com
Link: https://lkml.kernel.org/r/20211020210755.23964-4-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/filemap.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/mm/filemap.c~mm-filemap-coding-style-cleanup-for-filemap_map_pmd
+++ a/mm/filemap.c
@@ -3203,12 +3203,12 @@ static bool filemap_map_pmd(struct vm_fa
 	}
 
 	if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
-	    vm_fault_t ret = do_set_pmd(vmf, page);
-	    if (!ret) {
-		    /* The page is mapped successfully, reference consumed. */
-		    unlock_page(page);
-		    return true;
-	    }
+		vm_fault_t ret = do_set_pmd(vmf, page);
+		if (!ret) {
+			/* The page is mapped successfully, reference consumed. */
+			unlock_page(page);
+			return true;
+		}
 	}
 
 	if (pmd_none(*vmf->pmd))
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 124/262] mm: hwpoison: refactor refcount check handling
  2021-11-05 20:34 incoming Andrew Morton
                   ` (122 preceding siblings ...)
  2021-11-05 20:41 ` [patch 123/262] mm: filemap: coding style cleanup for filemap_map_pmd() Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:41 ` [patch 125/262] mm: shmem: don't truncate page if memory failure happens Andrew Morton
                   ` (137 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, hughd, kirill.shutemov, linux-mm, mm-commits,
	naoya.horiguchi, osalvador, peterx, shy828301, torvalds, willy

From: Yang Shi <shy828301@gmail.com>
Subject: mm: hwpoison: refactor refcount check handling

Memory failure will report failure if the page still has extra pinned
refcount other than from hwpoison after the handler is done.  Actually the
check is not necessary for all handlers, so move the check into specific
handlers.  This would make the following keeping shmem page in page cache
patch easier.

There may be expected extra pin for some cases, for example, when the page
is dirty and in swapcache.

Link: https://lkml.kernel.org/r/20211020210755.23964-5-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory-failure.c |   93 ++++++++++++++++++++++++++++--------------
 1 file changed, 64 insertions(+), 29 deletions(-)

--- a/mm/memory-failure.c~mm-hwpoison-refactor-refcount-check-handling
+++ a/mm/memory-failure.c
@@ -807,12 +807,44 @@ static int truncate_error_page(struct pa
 	return ret;
 }
 
+struct page_state {
+	unsigned long mask;
+	unsigned long res;
+	enum mf_action_page_type type;
+
+	/* Callback ->action() has to unlock the relevant page inside it. */
+	int (*action)(struct page_state *ps, struct page *p);
+};
+
+/*
+ * Return true if page is still referenced by others, otherwise return
+ * false.
+ *
+ * The extra_pins is true when one extra refcount is expected.
+ */
+static bool has_extra_refcount(struct page_state *ps, struct page *p,
+			       bool extra_pins)
+{
+	int count = page_count(p) - 1;
+
+	if (extra_pins)
+		count -= 1;
+
+	if (count > 0) {
+		pr_err("Memory failure: %#lx: %s still referenced by %d users\n",
+		       page_to_pfn(p), action_page_types[ps->type], count);
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Error hit kernel page.
  * Do nothing, try to be lucky and not touch this instead. For a few cases we
  * could be more sophisticated.
  */
-static int me_kernel(struct page *p, unsigned long pfn)
+static int me_kernel(struct page_state *ps, struct page *p)
 {
 	unlock_page(p);
 	return MF_IGNORED;
@@ -821,9 +853,9 @@ static int me_kernel(struct page *p, uns
 /*
  * Page in unknown state. Do nothing.
  */
-static int me_unknown(struct page *p, unsigned long pfn)
+static int me_unknown(struct page_state *ps, struct page *p)
 {
-	pr_err("Memory failure: %#lx: Unknown page state\n", pfn);
+	pr_err("Memory failure: %#lx: Unknown page state\n", page_to_pfn(p));
 	unlock_page(p);
 	return MF_FAILED;
 }
@@ -831,7 +863,7 @@ static int me_unknown(struct page *p, un
 /*
  * Clean (or cleaned) page cache page.
  */
-static int me_pagecache_clean(struct page *p, unsigned long pfn)
+static int me_pagecache_clean(struct page_state *ps, struct page *p)
 {
 	int ret;
 	struct address_space *mapping;
@@ -868,9 +900,13 @@ static int me_pagecache_clean(struct pag
 	 *
 	 * Open: to take i_rwsem or not for this? Right now we don't.
 	 */
-	ret = truncate_error_page(p, pfn, mapping);
+	ret = truncate_error_page(p, page_to_pfn(p), mapping);
 out:
 	unlock_page(p);
+
+	if (has_extra_refcount(ps, p, false))
+		ret = MF_FAILED;
+
 	return ret;
 }
 
@@ -879,7 +915,7 @@ out:
  * Issues: when the error hit a hole page the error is not properly
  * propagated.
  */
-static int me_pagecache_dirty(struct page *p, unsigned long pfn)
+static int me_pagecache_dirty(struct page_state *ps, struct page *p)
 {
 	struct address_space *mapping = page_mapping(p);
 
@@ -923,7 +959,7 @@ static int me_pagecache_dirty(struct pag
 		mapping_set_error(mapping, -EIO);
 	}
 
-	return me_pagecache_clean(p, pfn);
+	return me_pagecache_clean(ps, p);
 }
 
 /*
@@ -945,9 +981,10 @@ static int me_pagecache_dirty(struct pag
  * Clean swap cache pages can be directly isolated. A later page fault will
  * bring in the known good data from disk.
  */
-static int me_swapcache_dirty(struct page *p, unsigned long pfn)
+static int me_swapcache_dirty(struct page_state *ps, struct page *p)
 {
 	int ret;
+	bool extra_pins = false;
 
 	ClearPageDirty(p);
 	/* Trigger EIO in shmem: */
@@ -955,10 +992,17 @@ static int me_swapcache_dirty(struct pag
 
 	ret = delete_from_lru_cache(p) ? MF_FAILED : MF_DELAYED;
 	unlock_page(p);
+
+	if (ret == MF_DELAYED)
+		extra_pins = true;
+
+	if (has_extra_refcount(ps, p, extra_pins))
+		ret = MF_FAILED;
+
 	return ret;
 }
 
-static int me_swapcache_clean(struct page *p, unsigned long pfn)
+static int me_swapcache_clean(struct page_state *ps, struct page *p)
 {
 	int ret;
 
@@ -966,6 +1010,10 @@ static int me_swapcache_clean(struct pag
 
 	ret = delete_from_lru_cache(p) ? MF_FAILED : MF_RECOVERED;
 	unlock_page(p);
+
+	if (has_extra_refcount(ps, p, false))
+		ret = MF_FAILED;
+
 	return ret;
 }
 
@@ -975,7 +1023,7 @@ static int me_swapcache_clean(struct pag
  * - Error on hugepage is contained in hugepage unit (not in raw page unit.)
  *   To narrow down kill region to one page, we need to break up pmd.
  */
-static int me_huge_page(struct page *p, unsigned long pfn)
+static int me_huge_page(struct page_state *ps, struct page *p)
 {
 	int res;
 	struct page *hpage = compound_head(p);
@@ -986,7 +1034,7 @@ static int me_huge_page(struct page *p,
 
 	mapping = page_mapping(hpage);
 	if (mapping) {
-		res = truncate_error_page(hpage, pfn, mapping);
+		res = truncate_error_page(hpage, page_to_pfn(p), mapping);
 		unlock_page(hpage);
 	} else {
 		res = MF_FAILED;
@@ -1004,6 +1052,9 @@ static int me_huge_page(struct page *p,
 		}
 	}
 
+	if (has_extra_refcount(ps, p, false))
+		res = MF_FAILED;
+
 	return res;
 }
 
@@ -1029,14 +1080,7 @@ static int me_huge_page(struct page *p,
 #define slab		(1UL << PG_slab)
 #define reserved	(1UL << PG_reserved)
 
-static struct page_state {
-	unsigned long mask;
-	unsigned long res;
-	enum mf_action_page_type type;
-
-	/* Callback ->action() has to unlock the relevant page inside it. */
-	int (*action)(struct page *p, unsigned long pfn);
-} error_states[] = {
+static struct page_state error_states[] = {
 	{ reserved,	reserved,	MF_MSG_KERNEL,	me_kernel },
 	/*
 	 * free pages are specially detected outside this table:
@@ -1096,19 +1140,10 @@ static int page_action(struct page_state
 			unsigned long pfn)
 {
 	int result;
-	int count;
 
 	/* page p should be unlocked after returning from ps->action().  */
-	result = ps->action(p, pfn);
+	result = ps->action(ps, p);
 
-	count = page_count(p) - 1;
-	if (ps->action == me_swapcache_dirty && result == MF_DELAYED)
-		count--;
-	if (count > 0) {
-		pr_err("Memory failure: %#lx: %s still referenced by %d users\n",
-		       pfn, action_page_types[ps->type], count);
-		result = MF_FAILED;
-	}
 	action_result(pfn, ps->type, result);
 
 	/* Could do more checks here if page looks ok */
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 125/262] mm: shmem: don't truncate page if memory failure happens
  2021-11-05 20:34 incoming Andrew Morton
                   ` (123 preceding siblings ...)
  2021-11-05 20:41 ` [patch 124/262] mm: hwpoison: refactor refcount check handling Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:41 ` [patch 126/262] mm: hwpoison: handle non-anonymous THP correctly Andrew Morton
                   ` (136 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, arnd, hughd, kirill.shutemov, linux-mm, mm-commits,
	naoya.horiguchi, osalvador, peterx, shy828301, torvalds, willy

From: Yang Shi <shy828301@gmail.com>
Subject: mm: shmem: don't truncate page if memory failure happens

The current behavior of memory failure is to truncate the page cache
regardless of dirty or clean.  If the page is dirty the later access will
get the obsolete data from disk without any notification to the users. 
This may cause silent data loss.  It is even worse for shmem since shmem
is in-memory filesystem, truncating page cache means discarding data
blocks.  The later read would return all zero.

The right approach is to keep the corrupted page in page cache, any later
access would return error for syscalls or SIGBUS for page fault, until the
file is truncated, hole punched or removed.  The regular storage backed
filesystems would be more complicated so this patch is focused on shmem. 
This also unblock the support for soft offlining shmem THP.

[arnd@arndb.de: fix uninitialized variable use in me_pagecache_clean()]
  Link: https://lkml.kernel.org/r/20211022064748.4173718-1-arnd@kernel.org
Link: https://lkml.kernel.org/r/20211020210755.23964-6-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory-failure.c |   14 +++++++++++---
 mm/shmem.c          |   38 +++++++++++++++++++++++++++++++++++---
 mm/userfaultfd.c    |    5 +++++
 3 files changed, 51 insertions(+), 6 deletions(-)

--- a/mm/memory-failure.c~mm-shmem-dont-truncate-page-if-memory-failure-happens
+++ a/mm/memory-failure.c
@@ -58,6 +58,7 @@
 #include <linux/ratelimit.h>
 #include <linux/page-isolation.h>
 #include <linux/pagewalk.h>
+#include <linux/shmem_fs.h>
 #include "internal.h"
 #include "ras/ras_event.h"
 
@@ -867,6 +868,7 @@ static int me_pagecache_clean(struct pag
 {
 	int ret;
 	struct address_space *mapping;
+	bool extra_pins;
 
 	delete_from_lru_cache(p);
 
@@ -896,17 +898,23 @@ static int me_pagecache_clean(struct pag
 	}
 
 	/*
+	 * The shmem page is kept in page cache instead of truncating
+	 * so is expected to have an extra refcount after error-handling.
+	 */
+	extra_pins = shmem_mapping(mapping);
+
+	/*
 	 * Truncation is a bit tricky. Enable it per file system for now.
 	 *
 	 * Open: to take i_rwsem or not for this? Right now we don't.
 	 */
 	ret = truncate_error_page(p, page_to_pfn(p), mapping);
+	if (has_extra_refcount(ps, p, extra_pins))
+		ret = MF_FAILED;
+
 out:
 	unlock_page(p);
 
-	if (has_extra_refcount(ps, p, false))
-		ret = MF_FAILED;
-
 	return ret;
 }
 
--- a/mm/shmem.c~mm-shmem-dont-truncate-page-if-memory-failure-happens
+++ a/mm/shmem.c
@@ -2454,6 +2454,7 @@ shmem_write_begin(struct file *file, str
 	struct inode *inode = mapping->host;
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	pgoff_t index = pos >> PAGE_SHIFT;
+	int ret = 0;
 
 	/* i_rwsem is held by caller */
 	if (unlikely(info->seals & (F_SEAL_GROW |
@@ -2464,7 +2465,15 @@ shmem_write_begin(struct file *file, str
 			return -EPERM;
 	}
 
-	return shmem_getpage(inode, index, pagep, SGP_WRITE);
+	ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
+
+	if (*pagep && PageHWPoison(*pagep)) {
+		unlock_page(*pagep);
+		put_page(*pagep);
+		ret = -EIO;
+	}
+
+	return ret;
 }
 
 static int
@@ -2551,6 +2560,12 @@ static ssize_t shmem_file_read_iter(stru
 			if (sgp == SGP_CACHE)
 				set_page_dirty(page);
 			unlock_page(page);
+
+			if (PageHWPoison(page)) {
+				put_page(page);
+				error = -EIO;
+				break;
+			}
 		}
 
 		/*
@@ -3112,7 +3127,8 @@ static const char *shmem_get_link(struct
 		page = find_get_page(inode->i_mapping, 0);
 		if (!page)
 			return ERR_PTR(-ECHILD);
-		if (!PageUptodate(page)) {
+		if (PageHWPoison(page) ||
+		    !PageUptodate(page)) {
 			put_page(page);
 			return ERR_PTR(-ECHILD);
 		}
@@ -3120,6 +3136,11 @@ static const char *shmem_get_link(struct
 		error = shmem_getpage(inode, 0, &page, SGP_READ);
 		if (error)
 			return ERR_PTR(error);
+		if (page && PageHWPoison(page)) {
+			unlock_page(page);
+			put_page(page);
+			return ERR_PTR(-ECHILD);
+		}
 		unlock_page(page);
 	}
 	set_delayed_call(done, shmem_put_link, page);
@@ -3770,6 +3791,13 @@ static void shmem_destroy_inodecache(voi
 	kmem_cache_destroy(shmem_inode_cachep);
 }
 
+/* Keep the page in page cache instead of truncating it */
+static int shmem_error_remove_page(struct address_space *mapping,
+				   struct page *page)
+{
+	return 0;
+}
+
 const struct address_space_operations shmem_aops = {
 	.writepage	= shmem_writepage,
 	.set_page_dirty	= __set_page_dirty_no_writeback,
@@ -3780,7 +3808,7 @@ const struct address_space_operations sh
 #ifdef CONFIG_MIGRATION
 	.migratepage	= migrate_page,
 #endif
-	.error_remove_page = generic_error_remove_page,
+	.error_remove_page = shmem_error_remove_page,
 };
 EXPORT_SYMBOL(shmem_aops);
 
@@ -4191,6 +4219,10 @@ struct page *shmem_read_mapping_page_gfp
 		page = ERR_PTR(error);
 	else
 		unlock_page(page);
+
+	if (PageHWPoison(page))
+		page = ERR_PTR(-EIO);
+
 	return page;
 #else
 	/*
--- a/mm/userfaultfd.c~mm-shmem-dont-truncate-page-if-memory-failure-happens
+++ a/mm/userfaultfd.c
@@ -232,6 +232,11 @@ static int mcontinue_atomic_pte(struct m
 		goto out;
 	}
 
+	if (PageHWPoison(page)) {
+		ret = -EIO;
+		goto out_release;
+	}
+
 	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
 				       page, false, wp_copy);
 	if (ret)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 126/262] mm: hwpoison: handle non-anonymous THP correctly
  2021-11-05 20:34 incoming Andrew Morton
                   ` (124 preceding siblings ...)
  2021-11-05 20:41 ` [patch 125/262] mm: shmem: don't truncate page if memory failure happens Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:41 ` [patch 127/262] mm/hugetlb: drop __unmap_hugepage_range definition from hugetlb.h Andrew Morton
                   ` (135 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, hughd, kirill.shutemov, linux-mm, mm-commits,
	naoya.horiguchi, osalvador, peterx, shy828301, torvalds, willy

From: Yang Shi <shy828301@gmail.com>
Subject: mm: hwpoison: handle non-anonymous THP correctly

Currently hwpoison doesn't handle non-anonymous THP, but since v4.8 THP
support for tmpfs and read-only file cache has been added.  They could be
offlined by split THP, just like anonymous THP.

Link: https://lkml.kernel.org/r/20211020210755.23964-7-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory-failure.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

--- a/mm/memory-failure.c~mm-hwpoison-handle-non-anonymous-thp-correctly
+++ a/mm/memory-failure.c
@@ -1444,14 +1444,11 @@ static int identify_page_state(unsigned
 static int try_to_split_thp_page(struct page *page, const char *msg)
 {
 	lock_page(page);
-	if (!PageAnon(page) || unlikely(split_huge_page(page))) {
+	if (unlikely(split_huge_page(page))) {
 		unsigned long pfn = page_to_pfn(page);
 
 		unlock_page(page);
-		if (!PageAnon(page))
-			pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
-		else
-			pr_info("%s: %#lx: thp split failed\n", msg, pfn);
+		pr_info("%s: %#lx: thp split failed\n", msg, pfn);
 		put_page(page);
 		return -EBUSY;
 	}
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 127/262] mm/hugetlb: drop __unmap_hugepage_range definition from hugetlb.h
  2021-11-05 20:34 incoming Andrew Morton
                   ` (125 preceding siblings ...)
  2021-11-05 20:41 ` [patch 126/262] mm: hwpoison: handle non-anonymous THP correctly Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:41 ` [patch 128/262] hugetlb: add demote hugetlb page sysfs interfaces Andrew Morton
                   ` (134 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, david, jhubbard, linux-mm, mike.kravetz, mm-commits,
	peterx, songmuchun, torvalds

From: Peter Xu <peterx@redhat.com>
Subject: mm/hugetlb: drop __unmap_hugepage_range definition from hugetlb.h

Remove __unmap_hugepage_range() from the header file, because it is only
used in hugetlb.c.

Link: https://lkml.kernel.org/r/20210917165108.9341-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Suggested-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/hugetlb.h |   10 ----------
 mm/hugetlb.c            |    6 +++---
 2 files changed, 3 insertions(+), 13 deletions(-)

--- a/include/linux/hugetlb.h~mm-hugetlb-drop-__unmap_hugepage_range-definition-from-hugetlbh
+++ a/include/linux/hugetlb.h
@@ -143,9 +143,6 @@ void __unmap_hugepage_range_final(struct
 			  struct vm_area_struct *vma,
 			  unsigned long start, unsigned long end,
 			  struct page *ref_page);
-void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
-				unsigned long start, unsigned long end,
-				struct page *ref_page);
 void hugetlb_report_meminfo(struct seq_file *);
 int hugetlb_report_node_meminfo(char *buf, int len, int nid);
 void hugetlb_show_meminfo(void);
@@ -382,13 +379,6 @@ static inline void __unmap_hugepage_rang
 			struct vm_area_struct *vma, unsigned long start,
 			unsigned long end, struct page *ref_page)
 {
-	BUG();
-}
-
-static inline void __unmap_hugepage_range(struct mmu_gather *tlb,
-			struct vm_area_struct *vma, unsigned long start,
-			unsigned long end, struct page *ref_page)
-{
 	BUG();
 }
 
--- a/mm/hugetlb.c~mm-hugetlb-drop-__unmap_hugepage_range-definition-from-hugetlbh
+++ a/mm/hugetlb.c
@@ -4426,9 +4426,9 @@ again:
 	return ret;
 }
 
-void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
-			    unsigned long start, unsigned long end,
-			    struct page *ref_page)
+static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
+				   unsigned long start, unsigned long end,
+				   struct page *ref_page)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	unsigned long address;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 128/262] hugetlb: add demote hugetlb page sysfs interfaces
  2021-11-05 20:34 incoming Andrew Morton
                   ` (126 preceding siblings ...)
  2021-11-05 20:41 ` [patch 127/262] mm/hugetlb: drop __unmap_hugepage_range definition from hugetlb.h Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:41 ` [patch 129/262] mm/cma: add cma_pages_valid to determine if pages are in CMA Andrew Morton
                   ` (133 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, aneesh.kumar, david, linux-mm, mhocko, mike.kravetz,
	mm-commits, naoya.horiguchi, nghialm78, osalvador, rientjes,
	songmuchun, torvalds, ziy

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: hugetlb: add demote hugetlb page sysfs interfaces

Patch series "hugetlb: add demote/split page functionality", v4.

The concurrent use of multiple hugetlb page sizes on a single system is
becoming more common.  One of the reasons is better TLB support for
gigantic page sizes on x86 hardware.  In addition, hugetlb pages are being
used to back VMs in hosting environments.

When using hugetlb pages to back VMs, it is often desirable to preallocate
hugetlb pools.  This avoids the delay and uncertainty of allocating
hugetlb pages at VM startup.  In addition, preallocating huge pages
minimizes the issue of memory fragmentation that increases the longer the
system is up and running.

In such environments, a combination of larger and smaller hugetlb pages
are preallocated in anticipation of backing VMs of various sizes.  Over
time, the preallocated pool of smaller hugetlb pages may become depleted
while larger hugetlb pages still remain.  In such situations, it is
desirable to convert larger hugetlb pages to smaller hugetlb pages.

Converting larger to smaller hugetlb pages can be accomplished today by
first freeing the larger page to the buddy allocator and then allocating
the smaller pages.  For example, to convert 50 GB pages on x86:
gb_pages=`cat .../hugepages-1048576kB/nr_hugepages`
m2_pages=`cat .../hugepages-2048kB/nr_hugepages`
echo $(($gb_pages - 50)) > .../hugepages-1048576kB/nr_hugepages
echo $(($m2_pages + 25600)) > .../hugepages-2048kB/nr_hugepages

On an idle system this operation is fairly reliable and results are as
expected.  The number of 2MB pages is increased as expected and the time
of the operation is a second or two.

However, when there is activity on the system the following issues
arise:
1) This process can take quite some time, especially if allocation of
   the smaller pages is not immediate and requires migration/compaction.
2) There is no guarantee that the total size of smaller pages allocated
   will match the size of the larger page which was freed.  This is
   because the area freed by the larger page could quickly be
   fragmented.
In a test environment with a load that continually fills the page cache
with clean pages, results such as the following can be observed:

Unexpected number of 2MB pages allocated: Expected 25600, have 19944
real    0m42.092s
user    0m0.008s
sys     0m41.467s

To address these issues, introduce the concept of hugetlb page demotion. 
Demotion provides a means of 'in place' splitting of a hugetlb page to
pages of a smaller size.  This avoids freeing pages to buddy and then
trying to allocate from buddy.

Page demotion is controlled via sysfs files that reside in the per-hugetlb
page size and per node directories.
- demote_size   Target page size for demotion, a smaller huge page size.
		File can be written to chose a smaller huge page size if
		multiple are available.
- demote        Writable number of hugetlb pages to be demoted

To demote 50 GB huge pages, one would:
cat .../hugepages-1048576kB/free_hugepages   /* optional, verify free pages */
cat .../hugepages-1048576kB/demote_size      /* optional, verify target size */
echo 50 > .../hugepages-1048576kB/demote

Only hugetlb pages which are free at the time of the request can be
demoted.  Demotion does not add to the complexity of surplus pages and
honors reserved huge pages.  Therefore, when a value is written to the
sysfs demote file, that value is only the maximum number of pages which
will be demoted.  It is possible fewer will actually be demoted.  The
recently introduced per-hstate mutex is used to synchronize demote
operations with other operations that modify hugetlb pools.

Real world use cases
--------------------
The above scenario describes a real world use case where hugetlb pages are
used to back VMs on x86.  Both issues of long allocation times and not
necessarily getting the expected number of smaller huge pages after a free
and allocate cycle have been experienced.  The occurrence of these issues
is dependent on other activity within the host and can not be predicted.


This patch (of 5):

Two new sysfs files are added to demote hugtlb pages.  These files are
both per-hugetlb page size and per node.  Files are:

  demote_size - The size in Kb that pages are demoted to. (read-write)
  demote - The number of huge pages to demote. (write-only)

By default, demote_size is the next smallest huge page size.  Valid huge
page sizes less than huge page size may be written to this file.  When
huge pages are demoted, they are demoted to this size.

Writing a value to demote will result in an attempt to demote that number
of hugetlb pages to an appropriate number of demote_size pages.

NOTE: Demote interfaces are only provided for huge page sizes if there is
a smaller target demote huge page size.  For example, on x86 1GB huge
pages will have demote interfaces.  2MB huge pages will not have demote
interfaces.

This patch does not provide full demote functionality.  It only provides
the sysfs interfaces.

It also provides documentation for the new interfaces.

[mike.kravetz@oracle.com: n_mask initialization does not need to be protected by the mutex]
  Link: https://lkml.kernel.org/r/0530e4ef-2492-5186-f919-5db68edea654@oracle.com
Link: https://lkml.kernel.org/r/20211007181918.136982-2-mike.kravetz@oracle.com
Link: https://lkml.kernel.org/r/20211007181918.136982-2-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: David Rientjes <rientjes@google.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Nghia Le <nghialm78@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/hugetlbpage.rst |   30 +++
 include/linux/hugetlb.h                      |    1 
 mm/hugetlb.c                                 |  155 ++++++++++++++++-
 3 files changed, 183 insertions(+), 3 deletions(-)

--- a/Documentation/admin-guide/mm/hugetlbpage.rst~hugetlb-add-demote-hugetlb-page-sysfs-interfaces
+++ a/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -234,8 +234,12 @@ will exist, of the form::
 
 	hugepages-${size}kB
 
-Inside each of these directories, the same set of files will exist::
+Inside each of these directories, the set of files contained in ``/proc``
+will exist.  In addition, two additional interfaces for demoting huge
+pages may exist::
 
+        demote
+        demote_size
 	nr_hugepages
 	nr_hugepages_mempolicy
 	nr_overcommit_hugepages
@@ -243,7 +247,29 @@ Inside each of these directories, the sa
 	resv_hugepages
 	surplus_hugepages
 
-which function as described above for the default huge page-sized case.
+The demote interfaces provide the ability to split a huge page into
+smaller huge pages.  For example, the x86 architecture supports both
+1GB and 2MB huge pages sizes.  A 1GB huge page can be split into 512
+2MB huge pages.  Demote interfaces are not available for the smallest
+huge page size.  The demote interfaces are:
+
+demote_size
+        is the size of demoted pages.  When a page is demoted a corresponding
+        number of huge pages of demote_size will be created.  By default,
+        demote_size is set to the next smaller huge page size.  If there are
+        multiple smaller huge page sizes, demote_size can be set to any of
+        these smaller sizes.  Only huge page sizes less than the current huge
+        pages size are allowed.
+
+demote
+        is used to demote a number of huge pages.  A user with root privileges
+        can write to this file.  It may not be possible to demote the
+        requested number of huge pages.  To determine how many pages were
+        actually demoted, compare the value of nr_hugepages before and after
+        writing to the demote interface.  demote is a write only interface.
+
+The interfaces which are the same as in ``/proc`` (all except demote and
+demote_size) function as described above for the default huge page-sized case.
 
 .. _mem_policy_and_hp_alloc:
 
--- a/include/linux/hugetlb.h~hugetlb-add-demote-hugetlb-page-sysfs-interfaces
+++ a/include/linux/hugetlb.h
@@ -586,6 +586,7 @@ struct hstate {
 	int next_nid_to_alloc;
 	int next_nid_to_free;
 	unsigned int order;
+	unsigned int demote_order;
 	unsigned long mask;
 	unsigned long max_huge_pages;
 	unsigned long nr_huge_pages;
--- a/mm/hugetlb.c~hugetlb-add-demote-hugetlb-page-sysfs-interfaces
+++ a/mm/hugetlb.c
@@ -2986,7 +2986,7 @@ free:
 
 static void __init hugetlb_init_hstates(void)
 {
-	struct hstate *h;
+	struct hstate *h, *h2;
 
 	for_each_hstate(h) {
 		if (minimum_order > huge_page_order(h))
@@ -2995,6 +2995,22 @@ static void __init hugetlb_init_hstates(
 		/* oversize hugepages were init'ed in early boot */
 		if (!hstate_is_gigantic(h))
 			hugetlb_hstate_alloc_pages(h);
+
+		/*
+		 * Set demote order for each hstate.  Note that
+		 * h->demote_order is initially 0.
+		 * - We can not demote gigantic pages if runtime freeing
+		 *   is not supported, so skip this.
+		 */
+		if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
+			continue;
+		for_each_hstate(h2) {
+			if (h2 == h)
+				continue;
+			if (h2->order < h->order &&
+			    h2->order > h->demote_order)
+				h->demote_order = h2->order;
+		}
 	}
 	VM_BUG_ON(minimum_order == UINT_MAX);
 }
@@ -3235,9 +3251,31 @@ out:
 	return 0;
 }
 
+static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
+	__must_hold(&hugetlb_lock)
+{
+	int rc = 0;
+
+	lockdep_assert_held(&hugetlb_lock);
+
+	/* We should never get here if no demote order */
+	if (!h->demote_order) {
+		pr_warn("HugeTLB: NULL demote order passed to demote_pool_huge_page.\n");
+		return -EINVAL;		/* internal error */
+	}
+
+	/*
+	 * TODO - demote fucntionality will be added in subsequent patch
+	 */
+	return rc;
+}
+
 #define HSTATE_ATTR_RO(_name) \
 	static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 
+#define HSTATE_ATTR_WO(_name) \
+	static struct kobj_attribute _name##_attr = __ATTR_WO(_name)
+
 #define HSTATE_ATTR(_name) \
 	static struct kobj_attribute _name##_attr = \
 		__ATTR(_name, 0644, _name##_show, _name##_store)
@@ -3433,6 +3471,105 @@ static ssize_t surplus_hugepages_show(st
 }
 HSTATE_ATTR_RO(surplus_hugepages);
 
+static ssize_t demote_store(struct kobject *kobj,
+	       struct kobj_attribute *attr, const char *buf, size_t len)
+{
+	unsigned long nr_demote;
+	unsigned long nr_available;
+	nodemask_t nodes_allowed, *n_mask;
+	struct hstate *h;
+	int err = 0;
+	int nid;
+
+	err = kstrtoul(buf, 10, &nr_demote);
+	if (err)
+		return err;
+	h = kobj_to_hstate(kobj, &nid);
+
+	if (nid != NUMA_NO_NODE) {
+		init_nodemask_of_node(&nodes_allowed, nid);
+		n_mask = &nodes_allowed;
+	} else {
+		n_mask = &node_states[N_MEMORY];
+	}
+
+	/* Synchronize with other sysfs operations modifying huge pages */
+	mutex_lock(&h->resize_lock);
+	spin_lock_irq(&hugetlb_lock);
+
+	while (nr_demote) {
+		/*
+		 * Check for available pages to demote each time thorough the
+		 * loop as demote_pool_huge_page will drop hugetlb_lock.
+		 *
+		 * NOTE: demote_pool_huge_page does not yet drop hugetlb_lock
+		 * but will when full demote functionality is added in a later
+		 * patch.
+		 */
+		if (nid != NUMA_NO_NODE)
+			nr_available = h->free_huge_pages_node[nid];
+		else
+			nr_available = h->free_huge_pages;
+		nr_available -= h->resv_huge_pages;
+		if (!nr_available)
+			break;
+
+		err = demote_pool_huge_page(h, n_mask);
+		if (err)
+			break;
+
+		nr_demote--;
+	}
+
+	spin_unlock_irq(&hugetlb_lock);
+	mutex_unlock(&h->resize_lock);
+
+	if (err)
+		return err;
+	return len;
+}
+HSTATE_ATTR_WO(demote);
+
+static ssize_t demote_size_show(struct kobject *kobj,
+					struct kobj_attribute *attr, char *buf)
+{
+	int nid;
+	struct hstate *h = kobj_to_hstate(kobj, &nid);
+	unsigned long demote_size = (PAGE_SIZE << h->demote_order) / SZ_1K;
+
+	return sysfs_emit(buf, "%lukB\n", demote_size);
+}
+
+static ssize_t demote_size_store(struct kobject *kobj,
+					struct kobj_attribute *attr,
+					const char *buf, size_t count)
+{
+	struct hstate *h, *demote_hstate;
+	unsigned long demote_size;
+	unsigned int demote_order;
+	int nid;
+
+	demote_size = (unsigned long)memparse(buf, NULL);
+
+	demote_hstate = size_to_hstate(demote_size);
+	if (!demote_hstate)
+		return -EINVAL;
+	demote_order = demote_hstate->order;
+
+	/* demote order must be smaller than hstate order */
+	h = kobj_to_hstate(kobj, &nid);
+	if (demote_order >= h->order)
+		return -EINVAL;
+
+	/* resize_lock synchronizes access to demote size and writes */
+	mutex_lock(&h->resize_lock);
+	h->demote_order = demote_order;
+	mutex_unlock(&h->resize_lock);
+
+	return count;
+}
+HSTATE_ATTR(demote_size);
+
 static struct attribute *hstate_attrs[] = {
 	&nr_hugepages_attr.attr,
 	&nr_overcommit_hugepages_attr.attr,
@@ -3449,6 +3586,16 @@ static const struct attribute_group hsta
 	.attrs = hstate_attrs,
 };
 
+static struct attribute *hstate_demote_attrs[] = {
+	&demote_size_attr.attr,
+	&demote_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group hstate_demote_attr_group = {
+	.attrs = hstate_demote_attrs,
+};
+
 static int hugetlb_sysfs_add_hstate(struct hstate *h, struct kobject *parent,
 				    struct kobject **hstate_kobjs,
 				    const struct attribute_group *hstate_attr_group)
@@ -3466,6 +3613,12 @@ static int hugetlb_sysfs_add_hstate(stru
 		hstate_kobjs[hi] = NULL;
 	}
 
+	if (h->demote_order) {
+		if (sysfs_create_group(hstate_kobjs[hi],
+					&hstate_demote_attr_group))
+			pr_warn("HugeTLB unable to create demote interfaces for %s\n", h->name);
+	}
+
 	return retval;
 }
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 129/262] mm/cma: add cma_pages_valid to determine if pages are in CMA
  2021-11-05 20:34 incoming Andrew Morton
                   ` (127 preceding siblings ...)
  2021-11-05 20:41 ` [patch 128/262] hugetlb: add demote hugetlb page sysfs interfaces Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:41 ` [patch 130/262] hugetlb: be sure to free demoted CMA pages to CMA Andrew Morton
                   ` (132 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, aneesh.kumar, david, linux-mm, mhocko, mike.kravetz,
	mm-commits, naoya.horiguchi, nghialm78, osalvador, rientjes,
	songmuchun, torvalds, ziy

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: mm/cma: add cma_pages_valid to determine if pages are in CMA

Add new interface cma_pages_valid() which indicates if the specified pages
are part of a CMA region.  This interface will be used in a subsequent
patch by hugetlb code.

In order to keep the same amount of DEBUG information, a pr_debug() call
was added to cma_pages_valid().  In the case where the page passed to
cma_release is not in cma region, the debug message will be printed from
cma_pages_valid as opposed to cma_release.

Link: https://lkml.kernel.org/r/20211007181918.136982-3-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Nghia Le <nghialm78@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/cma.h |    1 +
 mm/cma.c            |   24 ++++++++++++++++++++----
 2 files changed, 21 insertions(+), 4 deletions(-)

--- a/include/linux/cma.h~mm-cma-add-cma_pages_valid-to-determine-if-pages-are-in-cma
+++ a/include/linux/cma.h
@@ -46,6 +46,7 @@ extern int cma_init_reserved_mem(phys_ad
 					struct cma **res_cma);
 extern struct page *cma_alloc(struct cma *cma, unsigned long count, unsigned int align,
 			      bool no_warn);
+extern bool cma_pages_valid(struct cma *cma, const struct page *pages, unsigned long count);
 extern bool cma_release(struct cma *cma, const struct page *pages, unsigned long count);
 
 extern int cma_for_each_area(int (*it)(struct cma *cma, void *data), void *data);
--- a/mm/cma.c~mm-cma-add-cma_pages_valid-to-determine-if-pages-are-in-cma
+++ a/mm/cma.c
@@ -524,6 +524,25 @@ out:
 	return page;
 }
 
+bool cma_pages_valid(struct cma *cma, const struct page *pages,
+		     unsigned long count)
+{
+	unsigned long pfn;
+
+	if (!cma || !pages)
+		return false;
+
+	pfn = page_to_pfn(pages);
+
+	if (pfn < cma->base_pfn || pfn >= cma->base_pfn + cma->count) {
+		pr_debug("%s(page %p, count %lu)\n", __func__,
+						(void *)pages, count);
+		return false;
+	}
+
+	return true;
+}
+
 /**
  * cma_release() - release allocated pages
  * @cma:   Contiguous memory region for which the allocation is performed.
@@ -539,16 +558,13 @@ bool cma_release(struct cma *cma, const
 {
 	unsigned long pfn;
 
-	if (!cma || !pages)
+	if (!cma_pages_valid(cma, pages, count))
 		return false;
 
 	pr_debug("%s(page %p, count %lu)\n", __func__, (void *)pages, count);
 
 	pfn = page_to_pfn(pages);
 
-	if (pfn < cma->base_pfn || pfn >= cma->base_pfn + cma->count)
-		return false;
-
 	VM_BUG_ON(pfn + count > cma->base_pfn + cma->count);
 
 	free_contig_range(pfn, count);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 130/262] hugetlb: be sure to free demoted CMA pages to CMA
  2021-11-05 20:34 incoming Andrew Morton
                   ` (128 preceding siblings ...)
  2021-11-05 20:41 ` [patch 129/262] mm/cma: add cma_pages_valid to determine if pages are in CMA Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:41 ` [patch 131/262] hugetlb: add demote bool to gigantic page routines Andrew Morton
                   ` (131 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, aneesh.kumar, david, linux-mm, mhocko, mike.kravetz,
	mm-commits, naoya.horiguchi, nghialm78, osalvador, rientjes,
	songmuchun, torvalds, ziy

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: hugetlb: be sure to free demoted CMA pages to CMA

When huge page demotion is fully implemented, gigantic pages can be
demoted to a smaller huge page size.  For example, on x86 a 1G page can be
demoted to 512 2M pages.  However, gigantic pages can potentially be
allocated from CMA.  If a gigantic page which was allocated from CMA is
demoted, the corresponding demoted pages needs to be returned to CMA.

Use the new interface cma_pages_valid() to determine if a non-gigantic
hugetlb page should be freed to CMA.  Also, clear mapping field of these
pages as expected by cma_release.

This also requires a change to CMA region creation for gigantic pages. 
CMA uses a per-region bit map to track allocations.  When setting up the
region, you specify how many pages each bit represents.  Currently, only
gigantic pages are allocated/freed from CMA so the region is set up such
that one bit represents a gigantic page size allocation.

With demote, a gigantic page (allocation) could be split into smaller size
pages.  And, these smaller size pages will be freed to CMA.  So, since the
per-region bit map needs to be set up to represent the smallest
allocation/free size, it now needs to be set to the smallest huge page
size which can be freed to CMA.

Unfortunately, we set up the CMA region for huge pages before we set up
huge pages sizes (hstates).  So, technically we do not know the smallest
huge page size as this can change via command line options and
architecture specific code.  Therefore, at region setup time we use
HUGETLB_PAGE_ORDER as the smallest possible huge page size that can be
given back to CMA.  It is possible that this value is sub-optimal for some
architectures/config options.  If needed, this can be addressed in follow
on work.

Link: https://lkml.kernel.org/r/20211007181918.136982-4-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Nghia Le <nghialm78@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   41 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 39 insertions(+), 2 deletions(-)

--- a/mm/hugetlb.c~hugetlb-be-sure-to-free-demoted-cma-pages-to-cma
+++ a/mm/hugetlb.c
@@ -50,6 +50,16 @@ struct hstate hstates[HUGE_MAX_HSTATE];
 
 #ifdef CONFIG_CMA
 static struct cma *hugetlb_cma[MAX_NUMNODES];
+static bool hugetlb_cma_page(struct page *page, unsigned int order)
+{
+	return cma_pages_valid(hugetlb_cma[page_to_nid(page)], page,
+				1 << order);
+}
+#else
+static bool hugetlb_cma_page(struct page *page, unsigned int order)
+{
+	return false;
+}
 #endif
 static unsigned long hugetlb_cma_size __initdata;
 
@@ -1272,6 +1282,7 @@ static void destroy_compound_gigantic_pa
 	atomic_set(compound_pincount_ptr(page), 0);
 
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+		p->mapping = NULL;
 		clear_compound_head(p);
 		set_page_refcounted(p);
 	}
@@ -1476,7 +1487,13 @@ static void __update_and_free_page(struc
 				1 << PG_active | 1 << PG_private |
 				1 << PG_writeback);
 	}
-	if (hstate_is_gigantic(h)) {
+
+	/*
+	 * Non-gigantic pages demoted from CMA allocated gigantic pages
+	 * need to be given back to CMA in free_gigantic_page.
+	 */
+	if (hstate_is_gigantic(h) ||
+	    hugetlb_cma_page(page, huge_page_order(h))) {
 		destroy_compound_gigantic_page(page, huge_page_order(h));
 		free_gigantic_page(page, huge_page_order(h));
 	} else {
@@ -3001,9 +3018,13 @@ static void __init hugetlb_init_hstates(
 		 * h->demote_order is initially 0.
 		 * - We can not demote gigantic pages if runtime freeing
 		 *   is not supported, so skip this.
+		 * - If CMA allocation is possible, we can not demote
+		 *   HUGETLB_PAGE_ORDER or smaller size pages.
 		 */
 		if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 			continue;
+		if (hugetlb_cma_size && h->order <= HUGETLB_PAGE_ORDER)
+			continue;
 		for_each_hstate(h2) {
 			if (h2 == h)
 				continue;
@@ -3555,6 +3576,8 @@ static ssize_t demote_size_store(struct
 	if (!demote_hstate)
 		return -EINVAL;
 	demote_order = demote_hstate->order;
+	if (demote_order < HUGETLB_PAGE_ORDER)
+		return -EINVAL;
 
 	/* demote order must be smaller than hstate order */
 	h = kobj_to_hstate(kobj, &nid);
@@ -6543,6 +6566,7 @@ void __init hugetlb_cma_reserve(int orde
 	if (hugetlb_cma_size < (PAGE_SIZE << order)) {
 		pr_warn("hugetlb_cma: cma area should be at least %lu MiB\n",
 			(PAGE_SIZE << order) / SZ_1M);
+		hugetlb_cma_size = 0;
 		return;
 	}
 
@@ -6563,7 +6587,13 @@ void __init hugetlb_cma_reserve(int orde
 		size = round_up(size, PAGE_SIZE << order);
 
 		snprintf(name, sizeof(name), "hugetlb%d", nid);
-		res = cma_declare_contiguous_nid(0, size, 0, PAGE_SIZE << order,
+		/*
+		 * Note that 'order per bit' is based on smallest size that
+		 * may be returned to CMA allocator in the case of
+		 * huge page demotion.
+		 */
+		res = cma_declare_contiguous_nid(0, size, 0,
+						PAGE_SIZE << HUGETLB_PAGE_ORDER,
 						 0, false, name,
 						 &hugetlb_cma[nid], nid);
 		if (res) {
@@ -6579,6 +6609,13 @@ void __init hugetlb_cma_reserve(int orde
 		if (reserved >= hugetlb_cma_size)
 			break;
 	}
+
+	if (!reserved)
+		/*
+		 * hugetlb_cma_size is used to determine if allocations from
+		 * cma are possible.  Set to zero if no cma regions are set up.
+		 */
+		hugetlb_cma_size = 0;
 }
 
 void __init hugetlb_cma_check(void)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 131/262] hugetlb: add demote bool to gigantic page routines
  2021-11-05 20:34 incoming Andrew Morton
                   ` (129 preceding siblings ...)
  2021-11-05 20:41 ` [patch 130/262] hugetlb: be sure to free demoted CMA pages to CMA Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:41 ` [patch 132/262] hugetlb: add hugetlb demote page support Andrew Morton
                   ` (130 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, aneesh.kumar, david, linux-mm, mhocko, mike.kravetz,
	mm-commits, naoya.horiguchi, nghialm78, osalvador, rientjes,
	songmuchun, torvalds, ziy

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: hugetlb: add demote bool to gigantic page routines

The routines remove_hugetlb_page and destroy_compound_gigantic_page
will remove a gigantic page and make the set of base pages ready to be
returned to a lower level allocator.  In the process of doing this,
they make all base pages reference counted.

The routine prep_compound_gigantic_page creates a gigantic page from a
set of base pages.  It assumes that all these base pages are reference
counted.

During demotion, a gigantic page will be split into huge pages of a
smaller size.  This logically involves use of the routines,
remove_hugetlb_page, and destroy_compound_gigantic_page followed by
prep_compound*_page for each smaller huge page.

When pages are reference counted (ref count >= 0), additional
speculative ref counts could be taken as described in previous commits
[1] and [2].  This could result in errors while demoting a huge page. 
Quite a bit of code would need to be created to handle all possible
issues.

Instead of dealing with the possibility of speculative ref counts,
avoid the possibility by keeping ref counts at zero during the demote
process.  Add a boolean 'demote' to the routines remove_hugetlb_page,
destroy_compound_gigantic_page and prep_compound_gigantic_page.  If the
boolean is set, the remove and destroy routines will not reference
count pages and the prep routine will not expect reference counted
pages.

'*_for_demote' wrappers of the routines will be added in a subsequent
patch where this functionality is used.

[1] https://lore.kernel.org/linux-mm/20210622021423.154662-3-mike.kravetz@oracle.com/
[2] https://lore.kernel.org/linux-mm/20210809184832.18342-3-mike.kravetz@oracle.com/

Link: https://lkml.kernel.org/r/20211007181918.136982-5-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Nghia Le <nghialm78@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   54 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 43 insertions(+), 11 deletions(-)

--- a/mm/hugetlb.c~hugetlb-add-demote-bool-to-gigantic-page-routines
+++ a/mm/hugetlb.c
@@ -1271,8 +1271,8 @@ static int hstate_next_node_to_free(stru
 		nr_nodes--)
 
 #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
-static void destroy_compound_gigantic_page(struct page *page,
-					unsigned int order)
+static void __destroy_compound_gigantic_page(struct page *page,
+					unsigned int order, bool demote)
 {
 	int i;
 	int nr_pages = 1 << order;
@@ -1284,7 +1284,8 @@ static void destroy_compound_gigantic_pa
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
 		p->mapping = NULL;
 		clear_compound_head(p);
-		set_page_refcounted(p);
+		if (!demote)
+			set_page_refcounted(p);
 	}
 
 	set_compound_order(page, 0);
@@ -1292,6 +1293,12 @@ static void destroy_compound_gigantic_pa
 	__ClearPageHead(page);
 }
 
+static void destroy_compound_gigantic_page(struct page *page,
+					unsigned int order)
+{
+	__destroy_compound_gigantic_page(page, order, false);
+}
+
 static void free_gigantic_page(struct page *page, unsigned int order)
 {
 	/*
@@ -1364,12 +1371,15 @@ static inline void destroy_compound_giga
 
 /*
  * Remove hugetlb page from lists, and update dtor so that page appears
- * as just a compound page.  A reference is held on the page.
+ * as just a compound page.
+ *
+ * A reference is held on the page, except in the case of demote.
  *
  * Must be called with hugetlb lock held.
  */
-static void remove_hugetlb_page(struct hstate *h, struct page *page,
-							bool adjust_surplus)
+static void __remove_hugetlb_page(struct hstate *h, struct page *page,
+							bool adjust_surplus,
+							bool demote)
 {
 	int nid = page_to_nid(page);
 
@@ -1407,8 +1417,12 @@ static void remove_hugetlb_page(struct h
 	 *
 	 * This handles the case where more than one ref is held when and
 	 * after update_and_free_page is called.
+	 *
+	 * In the case of demote we do not ref count the page as it will soon
+	 * be turned into a page of smaller size.
 	 */
-	set_page_refcounted(page);
+	if (!demote)
+		set_page_refcounted(page);
 	if (hstate_is_gigantic(h))
 		set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
 	else
@@ -1418,6 +1432,12 @@ static void remove_hugetlb_page(struct h
 	h->nr_huge_pages_node[nid]--;
 }
 
+static void remove_hugetlb_page(struct hstate *h, struct page *page,
+							bool adjust_surplus)
+{
+	__remove_hugetlb_page(h, page, adjust_surplus, false);
+}
+
 static void add_hugetlb_page(struct hstate *h, struct page *page,
 			     bool adjust_surplus)
 {
@@ -1681,7 +1701,8 @@ static void prep_new_huge_page(struct hs
 	spin_unlock_irq(&hugetlb_lock);
 }
 
-static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
+static bool __prep_compound_gigantic_page(struct page *page, unsigned int order,
+								bool demote)
 {
 	int i, j;
 	int nr_pages = 1 << order;
@@ -1719,10 +1740,16 @@ static bool prep_compound_gigantic_page(
 		 * the set of pages can not be converted to a gigantic page.
 		 * The caller who allocated the pages should then discard the
 		 * pages using the appropriate free interface.
+		 *
+		 * In the case of demote, the ref count will be zero.
 		 */
-		if (!page_ref_freeze(p, 1)) {
-			pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
-			goto out_error;
+		if (!demote) {
+			if (!page_ref_freeze(p, 1)) {
+				pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
+				goto out_error;
+			}
+		} else {
+			VM_BUG_ON_PAGE(page_count(p), p);
 		}
 		set_page_count(p, 0);
 		set_compound_head(p, page);
@@ -1747,6 +1774,11 @@ out_error:
 	return false;
 }
 
+static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
+{
+	return __prep_compound_gigantic_page(page, order, false);
+}
+
 /*
  * PageHuge() only returns true for hugetlbfs pages, but not for normal or
  * transparent huge pages.  See the PageTransHuge() documentation for more
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 132/262] hugetlb: add hugetlb demote page support
  2021-11-05 20:34 incoming Andrew Morton
                   ` (130 preceding siblings ...)
  2021-11-05 20:41 ` [patch 131/262] hugetlb: add demote bool to gigantic page routines Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:41 ` [patch 133/262] mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged Andrew Morton
                   ` (129 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, aneesh.kumar, david, linux-mm, mhocko, mike.kravetz,
	mm-commits, naoya.horiguchi, nghialm78, osalvador, rientjes,
	songmuchun, torvalds, ziy

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: hugetlb: add hugetlb demote page support

Demote page functionality will split a huge page into a number of huge
pages of a smaller size.  For example, on x86 a 1GB huge page can be
demoted into 512 2M huge pages.  Demotion is done 'in place' by simply
splitting the huge page.

Added '*_for_demote' wrappers for remove_hugetlb_page,
destroy_compound_hugetlb_page and prep_compound_gigantic_page for use by
demote code.

[mike.kravetz@oracle.com: v4]
  Link: https://lkml.kernel.org/r/6ca29b8e-527c-d6ec-900e-e6a43e4f8b73@oracle.com
Link: https://lkml.kernel.org/r/20211007181918.136982-6-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: Nghia Le <nghialm78@gmail.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |  100 +++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 92 insertions(+), 8 deletions(-)

--- a/mm/hugetlb.c~hugetlb-add-hugetlb-demote-page-support
+++ a/mm/hugetlb.c
@@ -1270,7 +1270,7 @@ static int hstate_next_node_to_free(stru
 		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
 		nr_nodes--)
 
-#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
+/* used to demote non-gigantic_huge pages as well */
 static void __destroy_compound_gigantic_page(struct page *page,
 					unsigned int order, bool demote)
 {
@@ -1293,6 +1293,13 @@ static void __destroy_compound_gigantic_
 	__ClearPageHead(page);
 }
 
+static void destroy_compound_hugetlb_page_for_demote(struct page *page,
+					unsigned int order)
+{
+	__destroy_compound_gigantic_page(page, order, true);
+}
+
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
 static void destroy_compound_gigantic_page(struct page *page,
 					unsigned int order)
 {
@@ -1438,6 +1445,12 @@ static void remove_hugetlb_page(struct h
 	__remove_hugetlb_page(h, page, adjust_surplus, false);
 }
 
+static void remove_hugetlb_page_for_demote(struct hstate *h, struct page *page,
+							bool adjust_surplus)
+{
+	__remove_hugetlb_page(h, page, adjust_surplus, true);
+}
+
 static void add_hugetlb_page(struct hstate *h, struct page *page,
 			     bool adjust_surplus)
 {
@@ -1779,6 +1792,12 @@ static bool prep_compound_gigantic_page(
 	return __prep_compound_gigantic_page(page, order, false);
 }
 
+static bool prep_compound_gigantic_page_for_demote(struct page *page,
+							unsigned int order)
+{
+	return __prep_compound_gigantic_page(page, order, true);
+}
+
 /*
  * PageHuge() only returns true for hugetlbfs pages, but not for normal or
  * transparent huge pages.  See the PageTransHuge() documentation for more
@@ -3304,9 +3323,72 @@ out:
 	return 0;
 }
 
+static int demote_free_huge_page(struct hstate *h, struct page *page)
+{
+	int i, nid = page_to_nid(page);
+	struct hstate *target_hstate;
+	int rc = 0;
+
+	target_hstate = size_to_hstate(PAGE_SIZE << h->demote_order);
+
+	remove_hugetlb_page_for_demote(h, page, false);
+	spin_unlock_irq(&hugetlb_lock);
+
+	rc = alloc_huge_page_vmemmap(h, page);
+	if (rc) {
+		/* Allocation of vmemmmap failed, we can not demote page */
+		spin_lock_irq(&hugetlb_lock);
+		set_page_refcounted(page);
+		add_hugetlb_page(h, page, false);
+		return rc;
+	}
+
+	/*
+	 * Use destroy_compound_hugetlb_page_for_demote for all huge page
+	 * sizes as it will not ref count pages.
+	 */
+	destroy_compound_hugetlb_page_for_demote(page, huge_page_order(h));
+
+	/*
+	 * Taking target hstate mutex synchronizes with set_max_huge_pages.
+	 * Without the mutex, pages added to target hstate could be marked
+	 * as surplus.
+	 *
+	 * Note that we already hold h->resize_lock.  To prevent deadlock,
+	 * use the convention of always taking larger size hstate mutex first.
+	 */
+	mutex_lock(&target_hstate->resize_lock);
+	for (i = 0; i < pages_per_huge_page(h);
+				i += pages_per_huge_page(target_hstate)) {
+		if (hstate_is_gigantic(target_hstate))
+			prep_compound_gigantic_page_for_demote(page + i,
+							target_hstate->order);
+		else
+			prep_compound_page(page + i, target_hstate->order);
+		set_page_private(page + i, 0);
+		set_page_refcounted(page + i);
+		prep_new_huge_page(target_hstate, page + i, nid);
+		put_page(page + i);
+	}
+	mutex_unlock(&target_hstate->resize_lock);
+
+	spin_lock_irq(&hugetlb_lock);
+
+	/*
+	 * Not absolutely necessary, but for consistency update max_huge_pages
+	 * based on pool changes for the demoted page.
+	 */
+	h->max_huge_pages--;
+	target_hstate->max_huge_pages += pages_per_huge_page(h);
+
+	return rc;
+}
+
 static int demote_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
 	__must_hold(&hugetlb_lock)
 {
+	int nr_nodes, node;
+	struct page *page;
 	int rc = 0;
 
 	lockdep_assert_held(&hugetlb_lock);
@@ -3317,9 +3399,15 @@ static int demote_pool_huge_page(struct
 		return -EINVAL;		/* internal error */
 	}
 
-	/*
-	 * TODO - demote fucntionality will be added in subsequent patch
-	 */
+	for_each_node_mask_to_free(h, nr_nodes, node, nodes_allowed) {
+		if (!list_empty(&h->hugepage_freelists[node])) {
+			page = list_entry(h->hugepage_freelists[node].next,
+					struct page, lru);
+			rc = demote_free_huge_page(h, page);
+			break;
+		}
+	}
+
 	return rc;
 }
 
@@ -3554,10 +3642,6 @@ static ssize_t demote_store(struct kobje
 		/*
 		 * Check for available pages to demote each time thorough the
 		 * loop as demote_pool_huge_page will drop hugetlb_lock.
-		 *
-		 * NOTE: demote_pool_huge_page does not yet drop hugetlb_lock
-		 * but will when full demote functionality is added in a later
-		 * patch.
 		 */
 		if (nid != NUMA_NO_NODE)
 			nr_available = h->free_huge_pages_node[nid];
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 133/262] mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged
  2021-11-05 20:34 incoming Andrew Morton
                   ` (131 preceding siblings ...)
  2021-11-05 20:41 ` [patch 132/262] hugetlb: add hugetlb demote page support Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:41 ` [patch 134/262] mm, hugepages: add mremap() support for hugepage backed vma Andrew Morton
                   ` (128 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, liangcaifan19, linux-mm, mike.kravetz, mm-commits,
	torvalds, zhang.lyra

From: Liangcai Fan <liangcaifan19@gmail.com>
Subject: mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged

When initializing transparent huge pages, min_free_kbytes would be
calculated according to what khugepaged expected.
So when disable transparent huge pages, min_free_kbytes should be
recalculated instead of the higher value set by khugepaged.

Link: https://lkml.kernel.org/r/1633937809-16558-1-git-send-email-liangcaifan19@gmail.com
Signed-off-by: Liangcai Fan <liangcaifan19@gmail.com>
Signed-off-by: Chunyan Zhang <zhang.lyra@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm.h |    1 +
 mm/khugepaged.c    |   10 ++++++++--
 mm/page_alloc.c    |    7 ++++++-
 3 files changed, 15 insertions(+), 3 deletions(-)

--- a/include/linux/mm.h~mm-khugepaged-recalculate-min_free_kbytes-after-stopping-khugepaged
+++ a/include/linux/mm.h
@@ -2453,6 +2453,7 @@ extern void memmap_init_range(unsigned l
 		unsigned long, unsigned long, enum meminit_context,
 		struct vmem_altmap *, int migratetype);
 extern void setup_per_zone_wmarks(void);
+extern void calculate_min_free_kbytes(void);
 extern int __meminit init_per_zone_wmark_min(void);
 extern void mem_init(void);
 extern void __init mmap_init(void);
--- a/mm/khugepaged.c~mm-khugepaged-recalculate-min_free_kbytes-after-stopping-khugepaged
+++ a/mm/khugepaged.c
@@ -2299,6 +2299,11 @@ static void set_recommended_min_free_kby
 	int nr_zones = 0;
 	unsigned long recommended_min;
 
+	if (!khugepaged_enabled()) {
+		calculate_min_free_kbytes();
+		goto update_wmarks;
+	}
+
 	for_each_populated_zone(zone) {
 		/*
 		 * We don't need to worry about fragmentation of
@@ -2334,6 +2339,8 @@ static void set_recommended_min_free_kby
 
 		min_free_kbytes = recommended_min;
 	}
+
+update_wmarks:
 	setup_per_zone_wmarks();
 }
 
@@ -2355,12 +2362,11 @@ int start_stop_khugepaged(void)
 
 		if (!list_empty(&khugepaged_scan.mm_head))
 			wake_up_interruptible(&khugepaged_wait);
-
-		set_recommended_min_free_kbytes();
 	} else if (khugepaged_thread) {
 		kthread_stop(khugepaged_thread);
 		khugepaged_thread = NULL;
 	}
+	set_recommended_min_free_kbytes();
 fail:
 	mutex_unlock(&khugepaged_mutex);
 	return err;
--- a/mm/page_alloc.c~mm-khugepaged-recalculate-min_free_kbytes-after-stopping-khugepaged
+++ a/mm/page_alloc.c
@@ -8469,7 +8469,7 @@ void setup_per_zone_wmarks(void)
  * 8192MB:	11584k
  * 16384MB:	16384k
  */
-int __meminit init_per_zone_wmark_min(void)
+void calculate_min_free_kbytes(void)
 {
 	unsigned long lowmem_kbytes;
 	int new_min_free_kbytes;
@@ -8483,6 +8483,11 @@ int __meminit init_per_zone_wmark_min(vo
 		pr_warn("min_free_kbytes is not updated to %d because user defined value %d is preferred\n",
 				new_min_free_kbytes, user_min_free_kbytes);
 
+}
+
+int __meminit init_per_zone_wmark_min(void)
+{
+	calculate_min_free_kbytes();
 	setup_per_zone_wmarks();
 	refresh_zone_stat_thresholds();
 	setup_per_zone_lowmem_reserve();
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 134/262] mm, hugepages: add mremap() support for hugepage backed vma
  2021-11-05 20:34 incoming Andrew Morton
                   ` (132 preceding siblings ...)
  2021-11-05 20:41 ` [patch 133/262] mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:41 ` [patch 135/262] mm, hugepages: add hugetlb vma mremap() test Andrew Morton
                   ` (127 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, almasrymina, ckennelly, kenchen, kirill, linux-mm, mhocko,
	mike.kravetz, mm-commits, torvalds, vbabka

From: Mina Almasry <almasrymina@google.com>
Subject: mm, hugepages: add mremap() support for hugepage backed vma

Support mremap() for hugepage backed vma segment by simply repositioning
page table entries.  The page table entries are repositioned to the new
virtual address on mremap().

Hugetlb mremap() support is of course generic; my motivating use case is a
library (hugepage_text), which reloads the ELF text of executables in
hugepages.  This significantly increases the execution performance of said
executables.

Restricts the mremap operation on hugepages to up to the size of the
original mapping as the underlying hugetlb reservation is not yet capable
of handling remapping to a larger size.

During the mremap() operation we detect pmd_share'd mappings and we
unshare those during the mremap().  On access and fault the sharing is
established again.

Link: https://lkml.kernel.org/r/20211013195825.3058275-1-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Kirill Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/hugetlb.h |   19 ++++++
 mm/hugetlb.c            |  111 +++++++++++++++++++++++++++++++++++---
 mm/mremap.c             |   36 +++++++++++-
 3 files changed, 157 insertions(+), 9 deletions(-)

--- a/include/linux/hugetlb.h~mm-hugepages-add-mremap-support-for-hugepage-backed-vma
+++ a/include/linux/hugetlb.h
@@ -124,6 +124,7 @@ struct hugepage_subpool *hugepage_new_su
 void hugepage_put_subpool(struct hugepage_subpool *spool);
 
 void reset_vma_resv_huge_pages(struct vm_area_struct *vma);
+void clear_vma_resv_huge_pages(struct vm_area_struct *vma);
 int hugetlb_sysctl_handler(struct ctl_table *, int, void *, size_t *, loff_t *);
 int hugetlb_overcommit_handler(struct ctl_table *, int, void *, size_t *,
 		loff_t *);
@@ -132,6 +133,10 @@ int hugetlb_treat_movable_handler(struct
 int hugetlb_mempolicy_sysctl_handler(struct ctl_table *, int, void *, size_t *,
 		loff_t *);
 
+int move_hugetlb_page_tables(struct vm_area_struct *vma,
+			     struct vm_area_struct *new_vma,
+			     unsigned long old_addr, unsigned long new_addr,
+			     unsigned long len);
 int copy_hugetlb_page_range(struct mm_struct *, struct mm_struct *, struct vm_area_struct *);
 long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
 			 struct page **, struct vm_area_struct **,
@@ -215,6 +220,10 @@ static inline void reset_vma_resv_huge_p
 {
 }
 
+static inline void clear_vma_resv_huge_pages(struct vm_area_struct *vma)
+{
+}
+
 static inline unsigned long hugetlb_total_pages(void)
 {
 	return 0;
@@ -260,6 +269,16 @@ static inline int copy_hugetlb_page_rang
 {
 	BUG();
 	return 0;
+}
+
+static inline int move_hugetlb_page_tables(struct vm_area_struct *vma,
+					   struct vm_area_struct *new_vma,
+					   unsigned long old_addr,
+					   unsigned long new_addr,
+					   unsigned long len)
+{
+	BUG();
+	return 0;
 }
 
 static inline void hugetlb_report_meminfo(struct seq_file *m)
--- a/mm/hugetlb.c~mm-hugepages-add-mremap-support-for-hugepage-backed-vma
+++ a/mm/hugetlb.c
@@ -1014,6 +1014,35 @@ void reset_vma_resv_huge_pages(struct vm
 		vma->vm_private_data = (void *)0;
 }
 
+/*
+ * Reset and decrement one ref on hugepage private reservation.
+ * Called with mm->mmap_sem writer semaphore held.
+ * This function should be only used by move_vma() and operate on
+ * same sized vma. It should never come here with last ref on the
+ * reservation.
+ */
+void clear_vma_resv_huge_pages(struct vm_area_struct *vma)
+{
+	/*
+	 * Clear the old hugetlb private page reservation.
+	 * It has already been transferred to new_vma.
+	 *
+	 * During a mremap() operation of a hugetlb vma we call move_vma()
+	 * which copies vma into new_vma and unmaps vma. After the copy
+	 * operation both new_vma and vma share a reference to the resv_map
+	 * struct, and at that point vma is about to be unmapped. We don't
+	 * want to return the reservation to the pool at unmap of vma because
+	 * the reservation still lives on in new_vma, so simply decrement the
+	 * ref here and remove the resv_map reference from this vma.
+	 */
+	struct resv_map *reservations = vma_resv_map(vma);
+
+	if (reservations && is_vma_resv_set(vma, HPAGE_RESV_OWNER))
+		kref_put(&reservations->refs, resv_map_release);
+
+	reset_vma_resv_huge_pages(vma);
+}
+
 /* Returns true if the VMA has associated reserve pages */
 static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
 {
@@ -4718,6 +4747,82 @@ again:
 	return ret;
 }
 
+static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
+			  unsigned long new_addr, pte_t *src_pte)
+{
+	struct hstate *h = hstate_vma(vma);
+	struct mm_struct *mm = vma->vm_mm;
+	pte_t *dst_pte, pte;
+	spinlock_t *src_ptl, *dst_ptl;
+
+	dst_pte = huge_pte_offset(mm, new_addr, huge_page_size(h));
+	dst_ptl = huge_pte_lock(h, mm, dst_pte);
+	src_ptl = huge_pte_lockptr(h, mm, src_pte);
+
+	/*
+	 * We don't have to worry about the ordering of src and dst ptlocks
+	 * because exclusive mmap_sem (or the i_mmap_lock) prevents deadlock.
+	 */
+	if (src_ptl != dst_ptl)
+		spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
+
+	pte = huge_ptep_get_and_clear(mm, old_addr, src_pte);
+	set_huge_pte_at(mm, new_addr, dst_pte, pte);
+
+	if (src_ptl != dst_ptl)
+		spin_unlock(src_ptl);
+	spin_unlock(dst_ptl);
+}
+
+int move_hugetlb_page_tables(struct vm_area_struct *vma,
+			     struct vm_area_struct *new_vma,
+			     unsigned long old_addr, unsigned long new_addr,
+			     unsigned long len)
+{
+	struct hstate *h = hstate_vma(vma);
+	struct address_space *mapping = vma->vm_file->f_mapping;
+	unsigned long sz = huge_page_size(h);
+	struct mm_struct *mm = vma->vm_mm;
+	unsigned long old_end = old_addr + len;
+	unsigned long old_addr_copy;
+	pte_t *src_pte, *dst_pte;
+	struct mmu_notifier_range range;
+
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, old_addr,
+				old_end);
+	adjust_range_if_pmd_sharing_possible(vma, &range.start, &range.end);
+	mmu_notifier_invalidate_range_start(&range);
+	/* Prevent race with file truncation */
+	i_mmap_lock_write(mapping);
+	for (; old_addr < old_end; old_addr += sz, new_addr += sz) {
+		src_pte = huge_pte_offset(mm, old_addr, sz);
+		if (!src_pte)
+			continue;
+		if (huge_pte_none(huge_ptep_get(src_pte)))
+			continue;
+
+		/* old_addr arg to huge_pmd_unshare() is a pointer and so the
+		 * arg may be modified. Pass a copy instead to preserve the
+		 * value in old_addr.
+		 */
+		old_addr_copy = old_addr;
+
+		if (huge_pmd_unshare(mm, vma, &old_addr_copy, src_pte))
+			continue;
+
+		dst_pte = huge_pte_alloc(mm, new_vma, new_addr, sz);
+		if (!dst_pte)
+			break;
+
+		move_huge_pte(vma, old_addr, new_addr, src_pte);
+	}
+	i_mmap_unlock_write(mapping);
+	flush_tlb_range(vma, old_end - len, old_end);
+	mmu_notifier_invalidate_range_end(&range);
+
+	return len + old_addr - old_end;
+}
+
 static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
 				   unsigned long start, unsigned long end,
 				   struct page *ref_page)
@@ -6257,12 +6362,6 @@ void adjust_range_if_pmd_sharing_possibl
  * sharing is possible.  For hugetlbfs, this prevents removal of any page
  * table entries associated with the address space.  This is important as we
  * are setting up sharing based on existing page table entries (mappings).
- *
- * NOTE: This routine is only called from huge_pte_alloc.  Some callers of
- * huge_pte_alloc know that sharing is not possible and do not take
- * i_mmap_rwsem as a performance optimization.  This is handled by the
- * if !vma_shareable check at the beginning of the routine. i_mmap_rwsem is
- * only required for subsequent processing.
  */
 pte_t *huge_pmd_share(struct mm_struct *mm, struct vm_area_struct *vma,
 		      unsigned long addr, pud_t *pud)
--- a/mm/mremap.c~mm-hugepages-add-mremap-support-for-hugepage-backed-vma
+++ a/mm/mremap.c
@@ -489,6 +489,10 @@ unsigned long move_page_tables(struct vm
 	old_end = old_addr + len;
 	flush_cache_range(vma, old_addr, old_end);
 
+	if (is_vm_hugetlb_page(vma))
+		return move_hugetlb_page_tables(vma, new_vma, old_addr,
+						new_addr, len);
+
 	mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma, vma->vm_mm,
 				old_addr, old_end);
 	mmu_notifier_invalidate_range_start(&range);
@@ -646,6 +650,10 @@ static unsigned long move_vma(struct vm_
 		mremap_userfaultfd_prep(new_vma, uf);
 	}
 
+	if (is_vm_hugetlb_page(vma)) {
+		clear_vma_resv_huge_pages(vma);
+	}
+
 	/* Conceal VM_ACCOUNT so old reservation is not undone */
 	if (vm_flags & VM_ACCOUNT && !(flags & MREMAP_DONTUNMAP)) {
 		vma->vm_flags &= ~VM_ACCOUNT;
@@ -739,9 +747,6 @@ static struct vm_area_struct *vma_to_res
 			(vma->vm_flags & (VM_DONTEXPAND | VM_PFNMAP)))
 		return ERR_PTR(-EINVAL);
 
-	if (is_vm_hugetlb_page(vma))
-		return ERR_PTR(-EINVAL);
-
 	/* We can't remap across vm area boundaries */
 	if (old_len > vma->vm_end - addr)
 		return ERR_PTR(-EFAULT);
@@ -937,6 +942,31 @@ SYSCALL_DEFINE5(mremap, unsigned long, a
 
 	if (mmap_write_lock_killable(current->mm))
 		return -EINTR;
+	vma = find_vma(mm, addr);
+	if (!vma || vma->vm_start > addr) {
+		ret = EFAULT;
+		goto out;
+	}
+
+	if (is_vm_hugetlb_page(vma)) {
+		struct hstate *h __maybe_unused = hstate_vma(vma);
+
+		old_len = ALIGN(old_len, huge_page_size(h));
+		new_len = ALIGN(new_len, huge_page_size(h));
+
+		/* addrs must be huge page aligned */
+		if (addr & ~huge_page_mask(h))
+			goto out;
+		if (new_addr & ~huge_page_mask(h))
+			goto out;
+
+		/*
+		 * Don't allow remap expansion, because the underlying hugetlb
+		 * reservation is not yet capable to handle split reservation.
+		 */
+		if (new_len > old_len)
+			goto out;
+	}
 
 	if (flags & (MREMAP_FIXED | MREMAP_DONTUNMAP)) {
 		ret = mremap_to(addr, old_len, new_addr, new_len,
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 135/262] mm, hugepages: add hugetlb vma mremap() test
  2021-11-05 20:34 incoming Andrew Morton
                   ` (133 preceding siblings ...)
  2021-11-05 20:41 ` [patch 134/262] mm, hugepages: add mremap() support for hugepage backed vma Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:41 ` [patch 136/262] hugetlb: support node specified when using cma for gigantic hugepages Andrew Morton
                   ` (126 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, almasrymina, ckennelly, kenchen, kirill, linux-mm, mhocko,
	mike.kravetz, mm-commits, torvalds, vbabka, wanjiabing

From: Mina Almasry <almasrymina@google.com>
Subject: mm, hugepages: add hugetlb vma mremap() test

[almasrymina@google.com: v8]
  Link: https://lkml.kernel.org/r/20211014200542.4126947-2-almasrymina@google.com
[wanjiabing@vivo.com: remove duplicated include in hugepage-mremap]
  Link: https://lkml.kernel.org/r/20211021122944.8857-1-wanjiabing@vivo.com
Link: https://lkml.kernel.org/r/20211013195825.3058275-2-almasrymina@google.com
Signed-off-by: Mina Almasry <almasrymina@google.com>
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Ken Chen <kenchen@google.com>
Cc: Chris Kennelly <ckennelly@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Kirill Shutemov <kirill@shutemov.name>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/.gitignore        |    1 
 tools/testing/selftests/vm/Makefile          |    1 
 tools/testing/selftests/vm/hugepage-mremap.c |  160 +++++++++++++++++
 tools/testing/selftests/vm/run_vmtests.sh    |   11 +
 4 files changed, 173 insertions(+)

--- a/tools/testing/selftests/vm/.gitignore~mm-hugepages-add-hugetlb-vma-mremap-test
+++ a/tools/testing/selftests/vm/.gitignore
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0-only
 hugepage-mmap
+hugepage-mremap
 hugepage-shm
 khugepaged
 map_hugetlb
--- /dev/null
+++ a/tools/testing/selftests/vm/hugepage-mremap.c
@@ -0,0 +1,160 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * hugepage-mremap:
+ *
+ * Example of remapping huge page memory in a user application using the
+ * mremap system call.  Code assumes a hugetlbfs filesystem is mounted
+ * at './huge'.  The code will use 10MB worth of huge pages.
+ */
+
+#define _GNU_SOURCE
+#include <stdlib.h>
+#include <stdio.h>
+#include <unistd.h>
+#include <sys/mman.h>
+#include <errno.h>
+#include <fcntl.h> /* Definition of O_* constants */
+#include <sys/syscall.h> /* Definition of SYS_* constants */
+#include <unistd.h>
+#include <linux/userfaultfd.h>
+#include <sys/ioctl.h>
+
+#define LENGTH (1UL * 1024 * 1024 * 1024)
+
+#define PROTECTION (PROT_READ | PROT_WRITE | PROT_EXEC)
+#define FLAGS (MAP_SHARED | MAP_ANONYMOUS)
+
+static void check_bytes(char *addr)
+{
+	printf("First hex is %x\n", *((unsigned int *)addr));
+}
+
+static void write_bytes(char *addr)
+{
+	unsigned long i;
+
+	for (i = 0; i < LENGTH; i++)
+		*(addr + i) = (char)i;
+}
+
+static int read_bytes(char *addr)
+{
+	unsigned long i;
+
+	check_bytes(addr);
+	for (i = 0; i < LENGTH; i++)
+		if (*(addr + i) != (char)i) {
+			printf("Mismatch at %lu\n", i);
+			return 1;
+		}
+	return 0;
+}
+
+static void register_region_with_uffd(char *addr, size_t len)
+{
+	long uffd; /* userfaultfd file descriptor */
+	struct uffdio_api uffdio_api;
+	struct uffdio_register uffdio_register;
+
+	/* Create and enable userfaultfd object. */
+
+	uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
+	if (uffd == -1) {
+		perror("userfaultfd");
+		exit(1);
+	}
+
+	uffdio_api.api = UFFD_API;
+	uffdio_api.features = 0;
+	if (ioctl(uffd, UFFDIO_API, &uffdio_api) == -1) {
+		perror("ioctl-UFFDIO_API");
+		exit(1);
+	}
+
+	/* Create a private anonymous mapping. The memory will be
+	 * demand-zero paged--that is, not yet allocated. When we
+	 * actually touch the memory, it will be allocated via
+	 * the userfaultfd.
+	 */
+
+	addr = mmap(NULL, len, PROT_READ | PROT_WRITE,
+		    MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	if (addr == MAP_FAILED) {
+		perror("mmap");
+		exit(1);
+	}
+
+	printf("Address returned by mmap() = %p\n", addr);
+
+	/* Register the memory range of the mapping we just created for
+	 * handling by the userfaultfd object. In mode, we request to track
+	 * missing pages (i.e., pages that have not yet been faulted in).
+	 */
+
+	uffdio_register.range.start = (unsigned long)addr;
+	uffdio_register.range.len = len;
+	uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;
+	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register) == -1) {
+		perror("ioctl-UFFDIO_REGISTER");
+		exit(1);
+	}
+}
+
+int main(void)
+{
+	int ret = 0;
+
+	int fd = open("/huge/test", O_CREAT | O_RDWR, 0755);
+
+	if (fd < 0) {
+		perror("Open failed");
+		exit(1);
+	}
+
+	/* mmap to a PUD aligned address to hopefully trigger pmd sharing. */
+	unsigned long suggested_addr = 0x7eaa40000000;
+	void *haddr = mmap((void *)suggested_addr, LENGTH, PROTECTION,
+			   MAP_HUGETLB | MAP_SHARED | MAP_POPULATE, fd, 0);
+	printf("Map haddr: Returned address is %p\n", haddr);
+	if (haddr == MAP_FAILED) {
+		perror("mmap1");
+		exit(1);
+	}
+
+	/* mmap again to a dummy address to hopefully trigger pmd sharing. */
+	suggested_addr = 0x7daa40000000;
+	void *daddr = mmap((void *)suggested_addr, LENGTH, PROTECTION,
+			   MAP_HUGETLB | MAP_SHARED | MAP_POPULATE, fd, 0);
+	printf("Map daddr: Returned address is %p\n", daddr);
+	if (daddr == MAP_FAILED) {
+		perror("mmap3");
+		exit(1);
+	}
+
+	suggested_addr = 0x7faa40000000;
+	void *vaddr =
+		mmap((void *)suggested_addr, LENGTH, PROTECTION, FLAGS, -1, 0);
+	printf("Map vaddr: Returned address is %p\n", vaddr);
+	if (vaddr == MAP_FAILED) {
+		perror("mmap2");
+		exit(1);
+	}
+
+	register_region_with_uffd(haddr, LENGTH);
+
+	void *addr = mremap(haddr, LENGTH, LENGTH,
+			    MREMAP_MAYMOVE | MREMAP_FIXED, vaddr);
+	if (addr == MAP_FAILED) {
+		perror("mremap");
+		exit(1);
+	}
+
+	printf("Mremap: Returned address is %p\n", addr);
+	check_bytes(addr);
+	write_bytes(addr);
+	ret = read_bytes(addr);
+
+	munmap(addr, LENGTH);
+
+	return ret;
+}
--- a/tools/testing/selftests/vm/Makefile~mm-hugepages-add-hugetlb-vma-mremap-test
+++ a/tools/testing/selftests/vm/Makefile
@@ -29,6 +29,7 @@ TEST_GEN_FILES = compaction_test
 TEST_GEN_FILES += gup_test
 TEST_GEN_FILES += hmm-tests
 TEST_GEN_FILES += hugepage-mmap
+TEST_GEN_FILES += hugepage-mremap
 TEST_GEN_FILES += hugepage-shm
 TEST_GEN_FILES += khugepaged
 TEST_GEN_FILES += madv_populate
--- a/tools/testing/selftests/vm/run_vmtests.sh~mm-hugepages-add-hugetlb-vma-mremap-test
+++ a/tools/testing/selftests/vm/run_vmtests.sh
@@ -108,6 +108,17 @@ else
 	echo "[PASS]"
 fi
 
+echo "-----------------------"
+echo "running hugepage-mremap"
+echo "-----------------------"
+./hugepage-mremap
+if [ $? -ne 0 ]; then
+	echo "[FAIL]"
+	exitcode=1
+else
+	echo "[PASS]"
+fi
+
 echo "NOTE: The above hugetlb tests provide minimal coverage.  Use"
 echo "      https://github.com/libhugetlbfs/libhugetlbfs.git for"
 echo "      hugetlb regression testing."
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 136/262] hugetlb: support node specified when using cma for gigantic hugepages
  2021-11-05 20:34 incoming Andrew Morton
                   ` (134 preceding siblings ...)
  2021-11-05 20:41 ` [patch 135/262] mm, hugepages: add hugetlb vma mremap() test Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:41 ` [patch 137/262] mm: remove duplicate include in hugepage-mremap.c Andrew Morton
                   ` (125 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, baolin.wang, corbet, guro, linux-mm, mhocko, mike.kravetz,
	mm-commits, torvalds

From: Baolin Wang <baolin.wang@linux.alibaba.com>
Subject: hugetlb: support node specified when using cma for gigantic hugepages

Now the size of CMA area for gigantic hugepages runtime allocation is
balanced for all online nodes, but we also want to specify the size of CMA
per-node, or only one node in some cases, which are similar with patch
[1].

For example, on some multi-nodes systems, each node's memory can be
different, allocating the same size of CMA for each node is not suitable
for the low-memory nodes.  Meanwhile some workloads like DPDK mentioned by
Zhenguo in patch [1] only need hugepages in one node.

On the other hand, we have some machines with multiple types of memory,
like DRAM and PMEM (persistent memory).  On this system, we may want to
specify all the hugepages only on DRAM node, or specify the proportion of
DRAM node and PMEM node, to tuning the performance of the workloads.

Thus this patch adds node format for 'hugetlb_cma' parameter to support
specifying the size of CMA per-node.  An example is as follows:

hugetlb_cma=0:5G,2:5G

which means allocating 5G size of CMA area on node 0 and node 2
respectively.  And the users should use the node specific sysfs file to
allocate the gigantic hugepages if specified the CMA size on that node.

[1]
https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@gmail.com

Link: https://lkml.kernel.org/r/bb790775ca60bb8f4b26956bb3f6988f74e075c7.1634261144.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/kernel-parameters.txt |    6 
 mm/hugetlb.c                                    |   86 ++++++++++++--
 2 files changed, 81 insertions(+), 11 deletions(-)

--- a/Documentation/admin-guide/kernel-parameters.txt~hugetlb-support-node-specified-when-using-cma-for-gigantic-hugepages
+++ a/Documentation/admin-guide/kernel-parameters.txt
@@ -1587,8 +1587,10 @@
 			registers.  Default set by CONFIG_HPET_MMAP_DEFAULT.
 
 	hugetlb_cma=	[HW,CMA] The size of a CMA area used for allocation
-			of gigantic hugepages.
-			Format: nn[KMGTPE]
+			of gigantic hugepages. Or using node format, the size
+			of a CMA area per node can be specified.
+			Format: nn[KMGTPE] or (node format)
+				<node>:nn[KMGTPE][,<node>:nn[KMGTPE]]
 
 			Reserve a CMA area of given size and allocate gigantic
 			hugepages using the CMA allocator. If enabled, the
--- a/mm/hugetlb.c~hugetlb-support-node-specified-when-using-cma-for-gigantic-hugepages
+++ a/mm/hugetlb.c
@@ -50,6 +50,7 @@ struct hstate hstates[HUGE_MAX_HSTATE];
 
 #ifdef CONFIG_CMA
 static struct cma *hugetlb_cma[MAX_NUMNODES];
+static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initdata;
 static bool hugetlb_cma_page(struct page *page, unsigned int order)
 {
 	return cma_pages_valid(hugetlb_cma[page_to_nid(page)], page,
@@ -6762,7 +6763,38 @@ static bool cma_reserve_called __initdat
 
 static int __init cmdline_parse_hugetlb_cma(char *p)
 {
-	hugetlb_cma_size = memparse(p, &p);
+	int nid, count = 0;
+	unsigned long tmp;
+	char *s = p;
+
+	while (*s) {
+		if (sscanf(s, "%lu%n", &tmp, &count) != 1)
+			break;
+
+		if (s[count] == ':') {
+			nid = tmp;
+			if (nid < 0 || nid >= MAX_NUMNODES)
+				break;
+
+			s += count + 1;
+			tmp = memparse(s, &s);
+			hugetlb_cma_size_in_node[nid] = tmp;
+			hugetlb_cma_size += tmp;
+
+			/*
+			 * Skip the separator if have one, otherwise
+			 * break the parsing.
+			 */
+			if (*s == ',')
+				s++;
+			else
+				break;
+		} else {
+			hugetlb_cma_size = memparse(p, &p);
+			break;
+		}
+	}
+
 	return 0;
 }
 
@@ -6771,6 +6803,7 @@ early_param("hugetlb_cma", cmdline_parse
 void __init hugetlb_cma_reserve(int order)
 {
 	unsigned long size, reserved, per_node;
+	bool node_specific_cma_alloc = false;
 	int nid;
 
 	cma_reserve_called = true;
@@ -6778,6 +6811,31 @@ void __init hugetlb_cma_reserve(int orde
 	if (!hugetlb_cma_size)
 		return;
 
+	for (nid = 0; nid < MAX_NUMNODES; nid++) {
+		if (hugetlb_cma_size_in_node[nid] == 0)
+			continue;
+
+		if (!node_state(nid, N_ONLINE)) {
+			pr_warn("hugetlb_cma: invalid node %d specified\n", nid);
+			hugetlb_cma_size -= hugetlb_cma_size_in_node[nid];
+			hugetlb_cma_size_in_node[nid] = 0;
+			continue;
+		}
+
+		if (hugetlb_cma_size_in_node[nid] < (PAGE_SIZE << order)) {
+			pr_warn("hugetlb_cma: cma area of node %d should be at least %lu MiB\n",
+				nid, (PAGE_SIZE << order) / SZ_1M);
+			hugetlb_cma_size -= hugetlb_cma_size_in_node[nid];
+			hugetlb_cma_size_in_node[nid] = 0;
+		} else {
+			node_specific_cma_alloc = true;
+		}
+	}
+
+	/* Validate the CMA size again in case some invalid nodes specified. */
+	if (!hugetlb_cma_size)
+		return;
+
 	if (hugetlb_cma_size < (PAGE_SIZE << order)) {
 		pr_warn("hugetlb_cma: cma area should be at least %lu MiB\n",
 			(PAGE_SIZE << order) / SZ_1M);
@@ -6785,20 +6843,30 @@ void __init hugetlb_cma_reserve(int orde
 		return;
 	}
 
-	/*
-	 * If 3 GB area is requested on a machine with 4 numa nodes,
-	 * let's allocate 1 GB on first three nodes and ignore the last one.
-	 */
-	per_node = DIV_ROUND_UP(hugetlb_cma_size, nr_online_nodes);
-	pr_info("hugetlb_cma: reserve %lu MiB, up to %lu MiB per node\n",
-		hugetlb_cma_size / SZ_1M, per_node / SZ_1M);
+	if (!node_specific_cma_alloc) {
+		/*
+		 * If 3 GB area is requested on a machine with 4 numa nodes,
+		 * let's allocate 1 GB on first three nodes and ignore the last one.
+		 */
+		per_node = DIV_ROUND_UP(hugetlb_cma_size, nr_online_nodes);
+		pr_info("hugetlb_cma: reserve %lu MiB, up to %lu MiB per node\n",
+			hugetlb_cma_size / SZ_1M, per_node / SZ_1M);
+	}
 
 	reserved = 0;
 	for_each_node_state(nid, N_ONLINE) {
 		int res;
 		char name[CMA_MAX_NAME];
 
-		size = min(per_node, hugetlb_cma_size - reserved);
+		if (node_specific_cma_alloc) {
+			if (hugetlb_cma_size_in_node[nid] == 0)
+				continue;
+
+			size = hugetlb_cma_size_in_node[nid];
+		} else {
+			size = min(per_node, hugetlb_cma_size - reserved);
+		}
+
 		size = round_up(size, PAGE_SIZE << order);
 
 		snprintf(name, sizeof(name), "hugetlb%d", nid);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 137/262] mm: remove duplicate include in hugepage-mremap.c
  2021-11-05 20:34 incoming Andrew Morton
                   ` (135 preceding siblings ...)
  2021-11-05 20:41 ` [patch 136/262] hugetlb: support node specified when using cma for gigantic hugepages Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:41 ` [patch 138/262] hugetlb_cgroup: remove unused hugetlb_cgroup_from_counter macro Andrew Morton
                   ` (124 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, ran.jianping, shuah, torvalds, zealci

From: Ran Jianping <ran.jianping@zte.com.cn>
Subject: mm: remove duplicate include in hugepage-mremap.c

Remove duplicate includes 'unistd.h' included in
 '/tools/testing/selftests/vm/hugepage-mremap.c'  is duplicated.It is also
 included on 23 line.

Link: https://lkml.kernel.org/r/20211018102336.869726-1-ran.jianping@zte.com.cn
Signed-off-by: Ran Jianping <ran.jianping@zte.com.cn>
Reported-by: Zeal Robot <zealci@zte.com.cn>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/hugepage-mremap.c |    1 -
 1 file changed, 1 deletion(-)

--- a/tools/testing/selftests/vm/hugepage-mremap.c~mm-remove-duplicate-include-in-hugepage-mremapc
+++ a/tools/testing/selftests/vm/hugepage-mremap.c
@@ -15,7 +15,6 @@
 #include <errno.h>
 #include <fcntl.h> /* Definition of O_* constants */
 #include <sys/syscall.h> /* Definition of SYS_* constants */
-#include <unistd.h>
 #include <linux/userfaultfd.h>
 #include <sys/ioctl.h>
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 138/262] hugetlb_cgroup: remove unused hugetlb_cgroup_from_counter macro
  2021-11-05 20:34 incoming Andrew Morton
                   ` (136 preceding siblings ...)
  2021-11-05 20:41 ` [patch 137/262] mm: remove duplicate include in hugepage-mremap.c Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:41 ` [patch 139/262] hugetlb: replace the obsolete hugetlb_instantiation_mutex in the comments Andrew Morton
                   ` (123 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, baolin.wang, linux-mm, mhocko, mike.kravetz, mm-commits, torvalds

From: Baolin Wang <baolin.wang@linux.alibaba.com>
Subject: hugetlb_cgroup: remove unused hugetlb_cgroup_from_counter macro

Patch series "Some cleanups and improvements for hugetlb".

This patchset does some cleanups and improvements for hugetlb and
hugetlb_cgroup.


This patch (of 4):

Since commit 726b7bbe ("hugetlb_cgroup: fix illegal access to memory"),
the hugetlb_cgroup_from_counter() macro is not used any more, remove it.

Link: https://lkml.kernel.org/r/cover.1634797639.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/f03b29b801fa9942466ab15334ec09988e124ae6.1634797639.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb_cgroup.c |    3 ---
 1 file changed, 3 deletions(-)

--- a/mm/hugetlb_cgroup.c~hugetlb_cgroup-remove-unused-hugetlb_cgroup_from_counter-macro
+++ a/mm/hugetlb_cgroup.c
@@ -27,9 +27,6 @@
 #define MEMFILE_IDX(val)	(((val) >> 16) & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
 
-#define hugetlb_cgroup_from_counter(counter, idx)                   \
-	container_of(counter, struct hugetlb_cgroup, hugepage[idx])
-
 static struct hugetlb_cgroup *root_h_cgroup __read_mostly;
 
 static inline struct page_counter *
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 139/262] hugetlb: replace the obsolete hugetlb_instantiation_mutex in the comments
  2021-11-05 20:34 incoming Andrew Morton
                   ` (137 preceding siblings ...)
  2021-11-05 20:41 ` [patch 138/262] hugetlb_cgroup: remove unused hugetlb_cgroup_from_counter macro Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:41 ` [patch 140/262] hugetlb: remove redundant validation in has_same_uncharge_info() Andrew Morton
                   ` (122 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, baolin.wang, linux-mm, mhocko, mike.kravetz, mm-commits, torvalds

From: Baolin Wang <baolin.wang@linux.alibaba.com>
Subject: hugetlb: replace the obsolete hugetlb_instantiation_mutex in the comments

After commit 8382d914ebf7 ("mm, hugetlb: improve page-fault scalability"),
the hugetlb_instantiation_mutex lock had been replaced by
hugetlb_fault_mutex_table to serializes faults on the same logical page. 
Thus update the obsolete hugetlb_instantiation_mutex related comments.

Link: https://lkml.kernel.org/r/4b3febeae37455ff7b74aa0aad16cc6909cf0926.1634797639.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/hugetlb.c~hugetlb-replace-the-obsolete-hugetlb_instantiation_mutex-in-the-comments
+++ a/mm/hugetlb.c
@@ -5014,7 +5014,7 @@ static void unmap_ref_private(struct mm_
 
 /*
  * Hugetlb_cow() should be called with page lock of the original hugepage held.
- * Called with hugetlb_instantiation_mutex held and pte_page locked so we
+ * Called with hugetlb_fault_mutex_table held and pte_page locked so we
  * cannot race with other handlers or page migration.
  * Keep the pte_same checks anyway to make transition from the mutex easier.
  */
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 140/262] hugetlb: remove redundant validation in has_same_uncharge_info()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (138 preceding siblings ...)
  2021-11-05 20:41 ` [patch 139/262] hugetlb: replace the obsolete hugetlb_instantiation_mutex in the comments Andrew Morton
@ 2021-11-05 20:41 ` Andrew Morton
  2021-11-05 20:42 ` [patch 141/262] hugetlb: remove redundant VM_BUG_ON() in add_reservation_in_range() Andrew Morton
                   ` (121 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:41 UTC (permalink / raw)
  To: akpm, baolin.wang, linux-mm, mhocko, mike.kravetz, mm-commits, torvalds

From: Baolin Wang <baolin.wang@linux.alibaba.com>
Subject: hugetlb: remove redundant validation in has_same_uncharge_info()

The callers of has_same_uncharge_info() has accessed the original
file_region and new file_region, and they are impossible to be NULL now. 
So we can remove the file_region validation in has_same_uncharge_info() to
simplify the code.

Link: https://lkml.kernel.org/r/97fc68d3f8d34f63c204645e10d7a718997e50b7.1634797639.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/mm/hugetlb.c~hugetlb-remove-redundant-validation-in-has_same_uncharge_info
+++ a/mm/hugetlb.c
@@ -332,8 +332,7 @@ static bool has_same_uncharge_info(struc
 				   struct file_region *org)
 {
 #ifdef CONFIG_CGROUP_HUGETLB
-	return rg && org &&
-	       rg->reservation_counter == org->reservation_counter &&
+	return rg->reservation_counter == org->reservation_counter &&
 	       rg->css == org->css;
 
 #else
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 141/262] hugetlb: remove redundant VM_BUG_ON() in add_reservation_in_range()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (139 preceding siblings ...)
  2021-11-05 20:41 ` [patch 140/262] hugetlb: remove redundant validation in has_same_uncharge_info() Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 20:42 ` [patch 142/262] hugetlb: remove unnecessary set_page_count in prep_compound_gigantic_page Andrew Morton
                   ` (120 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: akpm, baolin.wang, linux-mm, mhocko, mike.kravetz, mm-commits, torvalds

From: Baolin Wang <baolin.wang@linux.alibaba.com>
Subject: hugetlb: remove redundant VM_BUG_ON() in add_reservation_in_range()

When calling hugetlb_resv_map_add(), we've guaranteed that the parameter
'to' is always larger than 'from', so it never returns a negative value
from hugetlb_resv_map_add().  Thus remove the redundant VM_BUG_ON().

Link: https://lkml.kernel.org/r/2b565552f3d06753da1e8dda439c0d96d6d9a5a3.1634797639.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |    1 -
 1 file changed, 1 deletion(-)

--- a/mm/hugetlb.c~hugetlb-remove-redundant-vm_bug_on-in-add_reservation_in_range
+++ a/mm/hugetlb.c
@@ -445,7 +445,6 @@ static long add_reservation_in_range(str
 		add += hugetlb_resv_map_add(resv, rg, last_accounted_offset,
 					    t, h, h_cg, regions_needed);
 
-	VM_BUG_ON(add < 0);
 	return add;
 }
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 142/262] hugetlb: remove unnecessary set_page_count in prep_compound_gigantic_page
  2021-11-05 20:34 incoming Andrew Morton
                   ` (140 preceding siblings ...)
  2021-11-05 20:42 ` [patch 141/262] hugetlb: remove redundant VM_BUG_ON() in add_reservation_in_range() Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 20:42 ` [patch 143/262] userfaultfd/selftests: don't rely on GNU extensions for random numbers Andrew Morton
                   ` (119 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: akpm, linux-mm, mike.kravetz, mm-commits, osalvador,
	pasha.tatashin, songmuchun, torvalds, willy

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: hugetlb: remove unnecessary set_page_count in prep_compound_gigantic_page

In commit 7118fc2906e29 ("hugetlb: address ref count racing in
prep_compound_gigantic_page"), page_ref_freeze is used to atomically zero
the ref count of tail pages iff they are 1.  The unconditional call to
set_page_count(0) was left in the code.  This call is after
page_ref_freeze so it is really a noop.

Remove redundant and unnecessary set_page_count call.

Link: https://lkml.kernel.org/r/20211026220635.35187-1-mike.kravetz@oracle.com
Fixes: 7118fc2906e29 ("hugetlb: address ref count racing in prep_compound_gigantic_page")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Suggested-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |    1 -
 1 file changed, 1 deletion(-)

--- a/mm/hugetlb.c~hugetlb-remove-unnecessary-set_page_count-in-prep_compound_gigantic_page
+++ a/mm/hugetlb.c
@@ -1792,7 +1792,6 @@ static bool __prep_compound_gigantic_pag
 		} else {
 			VM_BUG_ON_PAGE(page_count(p), p);
 		}
-		set_page_count(p, 0);
 		set_compound_head(p, page);
 	}
 	atomic_set(compound_mapcount_ptr(page), -1);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 143/262] userfaultfd/selftests: don't rely on GNU extensions for random numbers
  2021-11-05 20:34 incoming Andrew Morton
                   ` (141 preceding siblings ...)
  2021-11-05 20:42 ` [patch 142/262] hugetlb: remove unnecessary set_page_count in prep_compound_gigantic_page Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 20:42 ` [patch 144/262] userfaultfd/selftests: fix feature support detection Andrew Morton
                   ` (118 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: akpm, axelrasmussen, linux-mm, mm-commits, peterx, shuah, torvalds

From: Axel Rasmussen <axelrasmussen@google.com>
Subject: userfaultfd/selftests: don't rely on GNU extensions for random numbers

Patch series "Small userfaultfd selftest fixups", v2.


This patch (of 3):

Two arguments for doing this:

First, and maybe most importantly, the resulting code is significantly
shorter / simpler.

Then, we avoid using GNU libc extensions.  Why does this matter?  It makes
testing userfaultfd with the selftest easier e.g.  on distros which use
something other than glibc (e.g., Alpine, which uses musl); basically, it
makes the test more portable.

Link: https://lkml.kernel.org/r/20210930212309.4001967-2-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/userfaultfd.c |   26 +++------------------
 1 file changed, 4 insertions(+), 22 deletions(-)

--- a/tools/testing/selftests/vm/userfaultfd.c~userfaultfd-selftests-dont-rely-on-gnu-extensions-for-random-numbers
+++ a/tools/testing/selftests/vm/userfaultfd.c
@@ -57,6 +57,7 @@
 #include <assert.h>
 #include <inttypes.h>
 #include <stdint.h>
+#include <sys/random.h>
 
 #include "../kselftest.h"
 
@@ -518,22 +519,10 @@ static void continue_range(int ufd, __u6
 static void *locking_thread(void *arg)
 {
 	unsigned long cpu = (unsigned long) arg;
-	struct random_data rand;
 	unsigned long page_nr = *(&(page_nr)); /* uninitialized warning */
-	int32_t rand_nr;
 	unsigned long long count;
-	char randstate[64];
-	unsigned int seed;
 
-	if (bounces & BOUNCE_RANDOM) {
-		seed = (unsigned int) time(NULL) - bounces;
-		if (!(bounces & BOUNCE_RACINGFAULTS))
-			seed += cpu;
-		bzero(&rand, sizeof(rand));
-		bzero(&randstate, sizeof(randstate));
-		if (initstate_r(seed, randstate, sizeof(randstate), &rand))
-			err("initstate_r failed");
-	} else {
+	if (!(bounces & BOUNCE_RANDOM)) {
 		page_nr = -bounces;
 		if (!(bounces & BOUNCE_RACINGFAULTS))
 			page_nr += cpu * nr_pages_per_cpu;
@@ -541,15 +530,8 @@ static void *locking_thread(void *arg)
 
 	while (!finished) {
 		if (bounces & BOUNCE_RANDOM) {
-			if (random_r(&rand, &rand_nr))
-				err("random_r failed");
-			page_nr = rand_nr;
-			if (sizeof(page_nr) > sizeof(rand_nr)) {
-				if (random_r(&rand, &rand_nr))
-					err("random_r failed");
-				page_nr |= (((unsigned long) rand_nr) << 16) <<
-					   16;
-			}
+			if (getrandom(&page_nr, sizeof(page_nr), 0) != sizeof(page_nr))
+				err("getrandom failed");
 		} else
 			page_nr += 1;
 		page_nr %= nr_pages;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 144/262] userfaultfd/selftests: fix feature support detection
  2021-11-05 20:34 incoming Andrew Morton
                   ` (142 preceding siblings ...)
  2021-11-05 20:42 ` [patch 143/262] userfaultfd/selftests: don't rely on GNU extensions for random numbers Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 20:42 ` [patch 145/262] userfaultfd/selftests: fix calculation of expected ioctls Andrew Morton
                   ` (117 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: akpm, axelrasmussen, linux-mm, mm-commits, peterx, shuah, torvalds

From: Axel Rasmussen <axelrasmussen@google.com>
Subject: userfaultfd/selftests: fix feature support detection

Before any tests are run, in set_test_type, we decide what feature(s) we
are going to be testing, based upon our command line arguments.  However,
the supported features are not just a function of the memory type being
used, so this is broken.

For instance, consider writeprotect support.  It is "normally" supported
for anonymous memory, but furthermore it requires that the kernel has
CONFIG_HAVE_ARCH_USERFAULTFD_WP.  So, it is *not* supported at all on
aarch64, for example.

So, this commit fixes this by querying the kernel for the set of features
it supports in set_test_type, by opening a userfaultfd and issuing a
UFFDIO_API ioctl.  Based upon the reported features, we toggle what tests
are enabled.

Link: https://lkml.kernel.org/r/20210930212309.4001967-3-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/userfaultfd.c |   54 ++++++++++++---------
 1 file changed, 31 insertions(+), 23 deletions(-)

--- a/tools/testing/selftests/vm/userfaultfd.c~userfaultfd-selftests-fix-feature-support-detection
+++ a/tools/testing/selftests/vm/userfaultfd.c
@@ -346,6 +346,16 @@ static struct uffd_test_ops hugetlb_uffd
 
 static struct uffd_test_ops *uffd_test_ops;
 
+static inline uint64_t uffd_minor_feature(void)
+{
+	if (test_type == TEST_HUGETLB && map_shared)
+		return UFFD_FEATURE_MINOR_HUGETLBFS;
+	else if (test_type == TEST_SHMEM)
+		return UFFD_FEATURE_MINOR_SHMEM;
+	else
+		return 0;
+}
+
 static void userfaultfd_open(uint64_t *features)
 {
 	struct uffdio_api uffdio_api;
@@ -406,7 +416,7 @@ static void uffd_test_ctx_clear(void)
 	munmap_area((void **)&area_dst_alias);
 }
 
-static void uffd_test_ctx_init_ext(uint64_t *features)
+static void uffd_test_ctx_init(uint64_t features)
 {
 	unsigned long nr, cpu;
 
@@ -415,7 +425,7 @@ static void uffd_test_ctx_init_ext(uint6
 	uffd_test_ops->allocate_area((void **)&area_src);
 	uffd_test_ops->allocate_area((void **)&area_dst);
 
-	userfaultfd_open(features);
+	userfaultfd_open(&features);
 
 	count_verify = malloc(nr_pages * sizeof(unsigned long long));
 	if (!count_verify)
@@ -463,11 +473,6 @@ static void uffd_test_ctx_init_ext(uint6
 			err("pipe");
 }
 
-static inline void uffd_test_ctx_init(uint64_t features)
-{
-	uffd_test_ctx_init_ext(&features);
-}
-
 static int my_bcmp(char *str1, char *str2, size_t n)
 {
 	unsigned long i;
@@ -1208,7 +1213,6 @@ static int userfaultfd_minor_test(void)
 	void *expected_page;
 	char c;
 	struct uffd_stats stats = { 0 };
-	uint64_t req_features, features_out;
 
 	if (!test_uffdio_minor)
 		return 0;
@@ -1216,21 +1220,7 @@ static int userfaultfd_minor_test(void)
 	printf("testing minor faults: ");
 	fflush(stdout);
 
-	if (test_type == TEST_HUGETLB)
-		req_features = UFFD_FEATURE_MINOR_HUGETLBFS;
-	else if (test_type == TEST_SHMEM)
-		req_features = UFFD_FEATURE_MINOR_SHMEM;
-	else
-		return 1;
-
-	features_out = req_features;
-	uffd_test_ctx_init_ext(&features_out);
-	/* If kernel reports required features aren't supported, skip test. */
-	if ((features_out & req_features) != req_features) {
-		printf("skipping test due to lack of feature support\n");
-		fflush(stdout);
-		return 0;
-	}
+	uffd_test_ctx_init(uffd_minor_feature());
 
 	uffdio_register.range.start = (unsigned long)area_dst_alias;
 	uffdio_register.range.len = nr_pages * page_size;
@@ -1591,6 +1581,8 @@ unsigned long default_huge_page_size(voi
 
 static void set_test_type(const char *type)
 {
+	uint64_t features = UFFD_API_FEATURES;
+
 	if (!strcmp(type, "anon")) {
 		test_type = TEST_ANON;
 		uffd_test_ops = &anon_uffd_test_ops;
@@ -1624,6 +1616,22 @@ static void set_test_type(const char *ty
 	if ((unsigned long) area_count(NULL, 0) + sizeof(unsigned long long) * 2
 	    > page_size)
 		err("Impossible to run this test");
+
+	/*
+	 * Whether we can test certain features depends not just on test type,
+	 * but also on whether or not this particular kernel supports the
+	 * feature.
+	 */
+
+	userfaultfd_open(&features);
+
+	test_uffdio_wp = test_uffdio_wp &&
+		(features & UFFD_FEATURE_PAGEFAULT_FLAG_WP);
+	test_uffdio_minor = test_uffdio_minor &&
+		(features & uffd_minor_feature());
+
+	close(uffd);
+	uffd = -1;
 }
 
 static void sigalrm(int sig)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 145/262] userfaultfd/selftests: fix calculation of expected ioctls
  2021-11-05 20:34 incoming Andrew Morton
                   ` (143 preceding siblings ...)
  2021-11-05 20:42 ` [patch 144/262] userfaultfd/selftests: fix feature support detection Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 20:42 ` [patch 146/262] mm/page_isolation: fix potential missing call to unset_migratetype_isolate() Andrew Morton
                   ` (116 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: akpm, axelrasmussen, linux-mm, mm-commits, peterx, shuah, torvalds

From: Axel Rasmussen <axelrasmussen@google.com>
Subject: userfaultfd/selftests: fix calculation of expected ioctls

Today, we assert that the ioctls the kernel reports as supported for a
registration match a precomputed list.  We decide which ioctls are
supported by examining the memory type.  Then, in several locations we
"fix up" this list by adding or removing things this initial decision got
wrong.

What ioctls the kernel reports is actually a function of several things:
- The memory type
- Kernel feature support (e.g., no writeprotect on aarch64)
- The registration type (e.g., CONTINUE only supported for MINOR mode)

So, we can't fully compute this at the start, in set_test_type.  It varies
per test, depending on what registration mode(s) those tests use.

Instead, introduce a new function which computes the correct list.  This
centralizes the add/remove of ioctls depending on these function inputs in
one place, so we don't have to repeat ourselves in various tests.

Not only is the resulting code a bit shorter, but it fixes a real bug in
the existing code: previously, we would incorrectly require the
writeprotect ioctl to be present on aarch64, where it isn't actually
supported.

Link: https://lkml.kernel.org/r/20210930212309.4001967-4-axelrasmussen@google.com
Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/userfaultfd.c |   77 ++++++++++-----------
 1 file changed, 38 insertions(+), 39 deletions(-)

--- a/tools/testing/selftests/vm/userfaultfd.c~userfaultfd-selftests-fix-calculation-of-expected-ioctls
+++ a/tools/testing/selftests/vm/userfaultfd.c
@@ -308,37 +308,24 @@ static void shmem_alias_mapping(__u64 *s
 }
 
 struct uffd_test_ops {
-	unsigned long expected_ioctls;
 	void (*allocate_area)(void **alloc_area);
 	void (*release_pages)(char *rel_area);
 	void (*alias_mapping)(__u64 *start, size_t len, unsigned long offset);
 };
 
-#define SHMEM_EXPECTED_IOCTLS		((1 << _UFFDIO_WAKE) | \
-					 (1 << _UFFDIO_COPY) | \
-					 (1 << _UFFDIO_ZEROPAGE))
-
-#define ANON_EXPECTED_IOCTLS		((1 << _UFFDIO_WAKE) | \
-					 (1 << _UFFDIO_COPY) | \
-					 (1 << _UFFDIO_ZEROPAGE) | \
-					 (1 << _UFFDIO_WRITEPROTECT))
-
 static struct uffd_test_ops anon_uffd_test_ops = {
-	.expected_ioctls = ANON_EXPECTED_IOCTLS,
 	.allocate_area	= anon_allocate_area,
 	.release_pages	= anon_release_pages,
 	.alias_mapping = noop_alias_mapping,
 };
 
 static struct uffd_test_ops shmem_uffd_test_ops = {
-	.expected_ioctls = SHMEM_EXPECTED_IOCTLS,
 	.allocate_area	= shmem_allocate_area,
 	.release_pages	= shmem_release_pages,
 	.alias_mapping = shmem_alias_mapping,
 };
 
 static struct uffd_test_ops hugetlb_uffd_test_ops = {
-	.expected_ioctls = UFFD_API_RANGE_IOCTLS_BASIC & ~(1 << _UFFDIO_CONTINUE),
 	.allocate_area	= hugetlb_allocate_area,
 	.release_pages	= hugetlb_release_pages,
 	.alias_mapping = hugetlb_alias_mapping,
@@ -356,6 +343,33 @@ static inline uint64_t uffd_minor_featur
 		return 0;
 }
 
+static uint64_t get_expected_ioctls(uint64_t mode)
+{
+	uint64_t ioctls = UFFD_API_RANGE_IOCTLS;
+
+	if (test_type == TEST_HUGETLB)
+		ioctls &= ~(1 << _UFFDIO_ZEROPAGE);
+
+	if (!((mode & UFFDIO_REGISTER_MODE_WP) && test_uffdio_wp))
+		ioctls &= ~(1 << _UFFDIO_WRITEPROTECT);
+
+	if (!((mode & UFFDIO_REGISTER_MODE_MINOR) && test_uffdio_minor))
+		ioctls &= ~(1 << _UFFDIO_CONTINUE);
+
+	return ioctls;
+}
+
+static void assert_expected_ioctls_present(uint64_t mode, uint64_t ioctls)
+{
+	uint64_t expected = get_expected_ioctls(mode);
+	uint64_t actual = ioctls & expected;
+
+	if (actual != expected) {
+		err("missing ioctl(s): expected %"PRIx64" actual: %"PRIx64,
+		    expected, actual);
+	}
+}
+
 static void userfaultfd_open(uint64_t *features)
 {
 	struct uffdio_api uffdio_api;
@@ -1017,11 +1031,9 @@ static int __uffdio_zeropage(int ufd, un
 {
 	struct uffdio_zeropage uffdio_zeropage;
 	int ret;
-	unsigned long has_zeropage;
+	bool has_zeropage = get_expected_ioctls(0) & (1 << _UFFDIO_ZEROPAGE);
 	__s64 res;
 
-	has_zeropage = uffd_test_ops->expected_ioctls & (1 << _UFFDIO_ZEROPAGE);
-
 	if (offset >= nr_pages * page_size)
 		err("unexpected offset %lu", offset);
 	uffdio_zeropage.range.start = (unsigned long) area_dst + offset;
@@ -1061,7 +1073,6 @@ static int uffdio_zeropage(int ufd, unsi
 static int userfaultfd_zeropage_test(void)
 {
 	struct uffdio_register uffdio_register;
-	unsigned long expected_ioctls;
 
 	printf("testing UFFDIO_ZEROPAGE: ");
 	fflush(stdout);
@@ -1076,9 +1087,8 @@ static int userfaultfd_zeropage_test(voi
 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
 		err("register failure");
 
-	expected_ioctls = uffd_test_ops->expected_ioctls;
-	if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls)
-		err("unexpected missing ioctl for anon memory");
+	assert_expected_ioctls_present(
+		uffdio_register.mode, uffdio_register.ioctls);
 
 	if (uffdio_zeropage(uffd, 0))
 		if (my_bcmp(area_dst, zeropage, page_size))
@@ -1091,7 +1101,6 @@ static int userfaultfd_zeropage_test(voi
 static int userfaultfd_events_test(void)
 {
 	struct uffdio_register uffdio_register;
-	unsigned long expected_ioctls;
 	pthread_t uffd_mon;
 	int err, features;
 	pid_t pid;
@@ -1115,9 +1124,8 @@ static int userfaultfd_events_test(void)
 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
 		err("register failure");
 
-	expected_ioctls = uffd_test_ops->expected_ioctls;
-	if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls)
-		err("unexpected missing ioctl for anon memory");
+	assert_expected_ioctls_present(
+		uffdio_register.mode, uffdio_register.ioctls);
 
 	if (pthread_create(&uffd_mon, &attr, uffd_poll_thread, &stats))
 		err("uffd_poll_thread create");
@@ -1145,7 +1153,6 @@ static int userfaultfd_events_test(void)
 static int userfaultfd_sig_test(void)
 {
 	struct uffdio_register uffdio_register;
-	unsigned long expected_ioctls;
 	unsigned long userfaults;
 	pthread_t uffd_mon;
 	int err, features;
@@ -1169,9 +1176,8 @@ static int userfaultfd_sig_test(void)
 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
 		err("register failure");
 
-	expected_ioctls = uffd_test_ops->expected_ioctls;
-	if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls)
-		err("unexpected missing ioctl for anon memory");
+	assert_expected_ioctls_present(
+		uffdio_register.mode, uffdio_register.ioctls);
 
 	if (faulting_process(1))
 		err("faulting process failed");
@@ -1206,7 +1212,6 @@ static int userfaultfd_sig_test(void)
 static int userfaultfd_minor_test(void)
 {
 	struct uffdio_register uffdio_register;
-	unsigned long expected_ioctls;
 	unsigned long p;
 	pthread_t uffd_mon;
 	uint8_t expected_byte;
@@ -1228,10 +1233,8 @@ static int userfaultfd_minor_test(void)
 	if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
 		err("register failure");
 
-	expected_ioctls = uffd_test_ops->expected_ioctls;
-	expected_ioctls |= 1 << _UFFDIO_CONTINUE;
-	if ((uffdio_register.ioctls & expected_ioctls) != expected_ioctls)
-		err("unexpected missing ioctl(s)");
+	assert_expected_ioctls_present(
+		uffdio_register.mode, uffdio_register.ioctls);
 
 	/*
 	 * After registering with UFFD, populate the non-UFFD-registered side of
@@ -1428,8 +1431,6 @@ static int userfaultfd_stress(void)
 	pthread_attr_setstacksize(&attr, 16*1024*1024);
 
 	while (bounces--) {
-		unsigned long expected_ioctls;
-
 		printf("bounces: %d, mode:", bounces);
 		if (bounces & BOUNCE_RANDOM)
 			printf(" rnd");
@@ -1457,10 +1458,8 @@ static int userfaultfd_stress(void)
 			uffdio_register.mode |= UFFDIO_REGISTER_MODE_WP;
 		if (ioctl(uffd, UFFDIO_REGISTER, &uffdio_register))
 			err("register failure");
-		expected_ioctls = uffd_test_ops->expected_ioctls;
-		if ((uffdio_register.ioctls & expected_ioctls) !=
-		    expected_ioctls)
-			err("unexpected missing ioctl for anon memory");
+		assert_expected_ioctls_present(
+			uffdio_register.mode, uffdio_register.ioctls);
 
 		if (area_dst_alias) {
 			uffdio_register.range.start = (unsigned long)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 146/262] mm/page_isolation: fix potential missing call to unset_migratetype_isolate()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (144 preceding siblings ...)
  2021-11-05 20:42 ` [patch 145/262] userfaultfd/selftests: fix calculation of expected ioctls Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 20:42 ` [patch 147/262] mm/page_isolation: guard against possible putback unisolated page Andrew Morton
                   ` (115 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: akpm, david, linmiaohe, linux-mm, mhocko, mm-commits, torvalds, vbabka

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/page_isolation: fix potential missing call to unset_migratetype_isolate()

In start_isolate_page_range() undo path, pfn_to_online_page() just checks
the first pfn in a pageblock while __first_valid_page() will traverse the
pageblock until the first online pfn is found.  So we may miss the call to
unset_migratetype_isolate() in undo path and pages will remain isolated
unexpectedly.  Fix this by calling undo_isolate_page_range() and this will
also help to simplify the code further.  Note we shouldn't ever trigger it
because MAX_ORDER-1 aligned pfn ranges shouldn't contain memory holes now.

Link: https://lkml.kernel.org/r/20210914114348.15569-1-linmiaohe@huawei.com
Fixes: 2ce13640b3f4 ("mm: __first_valid_page skip over offline pages")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_isolation.c |   20 +++-----------------
 1 file changed, 3 insertions(+), 17 deletions(-)

--- a/mm/page_isolation.c~mm-page_isolation-fix-potential-missing-call-to-unset_migratetype_isolate
+++ a/mm/page_isolation.c
@@ -183,7 +183,6 @@ int start_isolate_page_range(unsigned lo
 			     unsigned migratetype, int flags)
 {
 	unsigned long pfn;
-	unsigned long undo_pfn;
 	struct page *page;
 
 	BUG_ON(!IS_ALIGNED(start_pfn, pageblock_nr_pages));
@@ -193,25 +192,12 @@ int start_isolate_page_range(unsigned lo
 	     pfn < end_pfn;
 	     pfn += pageblock_nr_pages) {
 		page = __first_valid_page(pfn, pageblock_nr_pages);
-		if (page) {
-			if (set_migratetype_isolate(page, migratetype, flags)) {
-				undo_pfn = pfn;
-				goto undo;
-			}
+		if (page && set_migratetype_isolate(page, migratetype, flags)) {
+			undo_isolate_page_range(start_pfn, pfn, migratetype);
+			return -EBUSY;
 		}
 	}
 	return 0;
-undo:
-	for (pfn = start_pfn;
-	     pfn < undo_pfn;
-	     pfn += pageblock_nr_pages) {
-		struct page *page = pfn_to_online_page(pfn);
-		if (!page)
-			continue;
-		unset_migratetype_isolate(page, migratetype);
-	}
-
-	return -EBUSY;
 }
 
 /*
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 147/262] mm/page_isolation: guard against possible putback unisolated page
  2021-11-05 20:34 incoming Andrew Morton
                   ` (145 preceding siblings ...)
  2021-11-05 20:42 ` [patch 146/262] mm/page_isolation: fix potential missing call to unset_migratetype_isolate() Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 20:42 ` [patch 148/262] mm/vmscan.c: fix -Wunused-but-set-variable warning Andrew Morton
                   ` (114 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: akpm, david, iamjoonsoo.kim, jhubbard, linmiaohe, linux-mm,
	mm-commits, torvalds, vbabka

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/page_isolation: guard against possible putback unisolated page

Isolating a free page in an isolated pageblock is expected to always work
as watermarks don't apply here.  But if __isolate_free_page() failed, due
to condition changes, the page will be left on the free list.  And the
page will be put back to free list again via __putback_isolated_page(). 
This may trigger VM_BUG_ON_PAGE() on page->flags checking in
__free_one_page() if PageReported is set.  Or we will corrupt the free
list because list_add() will be called for pages already on another list. 
Add a VM_WARN_ON() to complain about this change.

Link: https://lkml.kernel.org/r/20210914114508.23725-1-linmiaohe@huawei.com
Fixes: 3c605096d315 ("mm/page_alloc: restrict max order of merging on isolated pageblock")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_isolation.c |    9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

--- a/mm/page_isolation.c~mm-page_isolation-guard-against-possible-putback-unisolated-page
+++ a/mm/page_isolation.c
@@ -94,8 +94,13 @@ static void unset_migratetype_isolate(st
 			buddy = page + (buddy_pfn - pfn);
 
 			if (!is_migrate_isolate_page(buddy)) {
-				__isolate_free_page(page, order);
-				isolated_page = true;
+				isolated_page = !!__isolate_free_page(page, order);
+				/*
+				 * Isolating a free page in an isolated pageblock
+				 * is expected to always work as watermarks don't
+				 * apply here.
+				 */
+				VM_WARN_ON(!isolated_page);
 			}
 		}
 	}
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 148/262] mm/vmscan.c: fix -Wunused-but-set-variable warning
  2021-11-05 20:34 incoming Andrew Morton
                   ` (146 preceding siblings ...)
  2021-11-05 20:42 ` [patch 147/262] mm/page_isolation: guard against possible putback unisolated page Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 20:42 ` [patch 149/262] mm/vmscan: throttle reclaim until some writeback completes if congested Andrew Morton
                   ` (113 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, shy828301, songkai01, torvalds

From: Kai Song <songkai01@inspur.com>
Subject: mm/vmscan.c: fix -Wunused-but-set-variable warning

We fix the following warning when building kernel with W=1:
mm/vmscan.c:1362:6: warning: variable 'err' set but not used [-Wunused-but-set-variable]

Link: https://lkml.kernel.org/r/20210924181218.21165-1-songkai01@inspur.com
Signed-off-by: Kai Song <songkai01@inspur.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/mm/vmscan.c~mm-vmscanc-fix-wunused-but-set-variable-warning
+++ a/mm/vmscan.c
@@ -1337,7 +1337,6 @@ static unsigned int demote_page_list(str
 {
 	int target_nid = next_demotion_node(pgdat->node_id);
 	unsigned int nr_succeeded;
-	int err;
 
 	if (list_empty(demote_pages))
 		return 0;
@@ -1346,7 +1345,7 @@ static unsigned int demote_page_list(str
 		return 0;
 
 	/* Demotion ignores all cpuset and mempolicy settings */
-	err = migrate_pages(demote_pages, alloc_demote_page, NULL,
+	migrate_pages(demote_pages, alloc_demote_page, NULL,
 			    target_nid, MIGRATE_ASYNC, MR_DEMOTION,
 			    &nr_succeeded);
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 149/262] mm/vmscan: throttle reclaim until some writeback completes if congested
  2021-11-05 20:34 incoming Andrew Morton
                   ` (147 preceding siblings ...)
  2021-11-05 20:42 ` [patch 148/262] mm/vmscan.c: fix -Wunused-but-set-variable warning Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 21:02   ` Matthew Wilcox
  2021-11-05 20:42 ` [patch 150/262] mm/vmscan: throttle reclaim and compaction when too may pages are isolated Andrew Morton
                   ` (112 subsequent siblings)
  261 siblings, 1 reply; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: adilger.kernel, akpm, corbet, david, djwong, hannes, linux-mm,
	mgorman, mhocko, mm-commits, neilb, riel, torvalds, tytso,
	vbabka, willy

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/vmscan: throttle reclaim until some writeback completes if congested

Patch series "Remove dependency on congestion_wait in mm/", v5.

This series that removes all calls to congestion_wait in mm/ and deletes
wait_iff_congested.  It's not a clever implementation but congestion_wait
has been broken for a long time
(https://lore.kernel.org/linux-mm/45d8b7a6-8548-65f5-cccf-9f451d4ae3d4@kernel.dk/).
Even if congestion throttling worked, it was never a great idea.  While
excessive dirty/writeback pages at the tail of the LRU is one possibility
that reclaim may be slow, there is also the problem of too many pages
being isolated and reclaim failing for other reasons (elevated references,
too many pages isolated, excessive LRU contention etc).

This series replaces the "congestion" throttling with 3 different types.

o If there are too many dirty/writeback pages, sleep until a timeout or
  enough pages get cleaned
o If too many pages are isolated, sleep until enough isolated pages are
  either reclaimed or put back on the LRU
o If no progress is being made, direct reclaim tasks sleep until another
  task makes progress with acceptable efficiency.

This was initially tested with a mix of workloads that used to trigger
corner cases that no longer work.  A new test case was created called
"stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
created XFS filesystem.  Note that it may be necessary to increase the
timeout of ssh if executing remotely as ssh itself can get throttled and
the connection may timeout.

stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4 to
check the impact as the number of direct reclaimers increase.  It has four
types of worker.

o One "anon latency" worker creates small mappings with mmap() and times
  how long it takes to fault the mapping reading it 4K at a time
o X file writers which is fio randomly writing X files where the total
  size of the files add up to the allowed dirty_ratio.  fio is allowed to
  run for a warmup period to allow some file-backed pages to accumulate. 
  The duration of the warmup is based on the best-case linear write speed
  of the storage.
o Y file readers which is fio randomly reading small files
o Z anon memory hogs which continually map (100-dirty_ratio)% of memory
o Total estimated WSS = (100+dirty_ration) percentage of memory

X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4

The intent is to maximise the total WSS with a mix of file and anon memory
where some anonymous memory must be swapped and there is a high likelihood
of dirty/writeback pages reaching the end of the LRU.

The test can be configured to have no background readers to stress
dirty/writeback pages.  The results below are based on having zero
readers.

The short summary of the results is that the series works and stalls until
some event occurs but the timeouts may need adjustment.

The test results are not broken down by patch as the series should be
treated as one block that replaces a broken throttling mechanism with a
working one.

Finally, three machines were tested but I'm reporting the worst set of
results.  The other two machines had much better latencies for example.

First the results of the "anon latency" latency

stutterp
                              5.15.0-rc1             5.15.0-rc1
                                 vanilla mm-reclaimcongest-v5r4
Amean     mmap-4      31.4003 (   0.00%)   2661.0198 (-8374.52%)
Amean     mmap-7      38.1641 (   0.00%)    149.2891 (-291.18%)
Amean     mmap-12     60.0981 (   0.00%)    187.8105 (-212.51%)
Amean     mmap-21    161.2699 (   0.00%)    213.9107 ( -32.64%)
Amean     mmap-30    174.5589 (   0.00%)    377.7548 (-116.41%)
Amean     mmap-48   8106.8160 (   0.00%)   1070.5616 (  86.79%)
Stddev    mmap-4      41.3455 (   0.00%)  27573.9676 (-66591.66%)
Stddev    mmap-7      53.5556 (   0.00%)   4608.5860 (-8505.23%)
Stddev    mmap-12    171.3897 (   0.00%)   5559.4542 (-3143.75%)
Stddev    mmap-21   1506.6752 (   0.00%)   5746.2507 (-281.39%)
Stddev    mmap-30    557.5806 (   0.00%)   7678.1624 (-1277.05%)
Stddev    mmap-48  61681.5718 (   0.00%)  14507.2830 (  76.48%)
Max-90    mmap-4      31.4243 (   0.00%)     83.1457 (-164.59%)
Max-90    mmap-7      41.0410 (   0.00%)     41.0720 (  -0.08%)
Max-90    mmap-12     66.5255 (   0.00%)     53.9073 (  18.97%)
Max-90    mmap-21    146.7479 (   0.00%)    105.9540 (  27.80%)
Max-90    mmap-30    193.9513 (   0.00%)     64.3067 (  66.84%)
Max-90    mmap-48    277.9137 (   0.00%)    591.0594 (-112.68%)
Max       mmap-4    1913.8009 (   0.00%) 299623.9695 (-15555.96%)
Max       mmap-7    2423.9665 (   0.00%) 204453.1708 (-8334.65%)
Max       mmap-12   6845.6573 (   0.00%) 221090.3366 (-3129.64%)
Max       mmap-21  56278.6508 (   0.00%) 213877.3496 (-280.03%)
Max       mmap-30  19716.2990 (   0.00%) 216287.6229 (-997.00%)
Max       mmap-48 477923.9400 (   0.00%) 245414.8238 (  48.65%)

For most thread counts, the time to mmap() is unfortunately increased.  In
earlier versions of the series, this was lower but a large number of
throttling events were reaching their timeout increasing the amount of
inefficient scanning of the LRU.  There is no prioritisation of reclaim
tasks making progress based on each tasks rate of page allocation versus
progress of reclaim.  The variance is also impacted for high worker counts
but in all cases, the differences in latency are not statistically
significant due to very large maximum outliers.  Max-90 shows that 90% of
the stalls are comparable but the Max results show the massive outliers
which are increased to to stalling.

It is expected that this will be very machine dependant.  Due to the test
design, reclaim is difficult so allocations stall and there are variances
depending on whether THPs can be allocated or not.  The amount of memory
will affect exactly how bad the corner cases are and how often they
trigger.  The warmup period calculation is not ideal as it's based on
linear writes where as fio is randomly writing multiple files from
multiple tasks so the start state of the test is variable.  For example,
these are the latencies on a single-socket machine that had more memory

Amean     mmap-4      42.2287 (   0.00%)     49.6838 * -17.65%*
Amean     mmap-7     216.4326 (   0.00%)     47.4451 *  78.08%*
Amean     mmap-12   2412.0588 (   0.00%)     51.7497 (  97.85%)
Amean     mmap-21   5546.2548 (   0.00%)     51.8862 (  99.06%)
Amean     mmap-30   1085.3121 (   0.00%)     72.1004 (  93.36%)

The overall system CPU usage and elapsed time is as follows

                  5.15.0-rc3  5.15.0-rc3
                     vanilla mm-reclaimcongest-v5r4
Duration User        6989.03      983.42
Duration System      7308.12      799.68
Duration Elapsed     2277.67     2092.98

The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
stalling.

The high-level /proc/vmstats show

                                     5.15.0-rc1     5.15.0-rc1
                                        vanilla mm-reclaimcongest-v5r2
Ops Direct pages scanned          1056608451.00   503594991.00
Ops Kswapd pages scanned           109795048.00   147289810.00
Ops Kswapd pages reclaimed          63269243.00    31036005.00
Ops Direct pages reclaimed          10803973.00     6328887.00
Ops Kswapd efficiency %                   57.62          21.07
Ops Kswapd velocity                    48204.98       57572.86
Ops Direct efficiency %                    1.02           1.26
Ops Direct velocity                   463898.83      196845.97

Kswapd scanned less pages but the detailed pattern is different.  The
vanilla kernel scans slowly over time where as the patches exhibits burst
patterns of scan activity.  Direct reclaim scanning is reduced by 52% due
to stalling.

The pattern for stealing pages is also slightly different.  Both kernels
exhibit spikes but the vanilla kernel when reclaiming shows pages being
reclaimed over a period of time where as the patches tend to reclaim in
spikes.  The difference is that vanilla is not throttling and instead
scanning constantly finding some pages over time where as the patched
kernel throttles and reclaims in spikes.

Ops Percentage direct scans               90.59          77.37

For direct reclaim, vanilla scanned 90.59% of pages where as with the
patches, 77.37% were direct reclaim due to throttling

Ops Page writes by reclaim           2613590.00     1687131.00

Page writes from reclaim context are reduced.

Ops Page writes anon                 2932752.00     1917048.00

And there is less swapping.

Ops Page reclaim immediate         996248528.00   107664764.00

The number of pages encountered at the tail of the LRU tagged for immediate
reclaim but still dirty/writeback is reduced by 89%.

Ops Slabs scanned                     164284.00      153608.00

Slab scan activity is similar.

ftrace was used to gather stall activity

Vanilla
-------
      1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
      2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
      8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
     29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
  82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0

The fast majority of wait_iff_congested calls do not stall at all.
What is likely happening is that cond_resched() reschedules the task for
a short period when the BDI is not registering congestion (which it never
will in this test setup).

      1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
      2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
      4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
    380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
    778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000

congestion_wait if called always exceeds the timeout as there is no
trigger to wake it up.

Bottom line: Vanilla will throttle but it's not effective.

Patch series
------------

Kswapd throttle activity was always due to scanning pages tagged for
immediate reclaim at the tail of the LRU

      1 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
      4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
      5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
      6 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
     11 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
     11 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
     94 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
    112 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK

The majority of events did not stall or stalled for a short period. 
Roughly 16% of stalls reached the timeout before expiry.  For direct
reclaim, the number of times stalled for each reason were

   6624 reason=VMSCAN_THROTTLE_ISOLATED
  93246 reason=VMSCAN_THROTTLE_NOPROGRESS
  96934 reason=VMSCAN_THROTTLE_WRITEBACK

The most common reason to stall was due to excessive pages tagged for
immediate reclaim at the tail of the LRU followed by a failure to make
forward.  A relatively small number were due to too many pages isolated
from the LRU by parallel threads

For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was
 
      9 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
     12 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
     83 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
   6520 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED

Most did not stall at all. A small number reached the timeout.

For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over the
map

      1 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
      1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
      1 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
      1 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
      3 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
      3 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
      3 usec_timeout=500000 usect_delayed=272000 reason=VMSCAN_THROTTLE_NOPROGRESS
      4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
      4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
      4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
      4 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
      4 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
      4 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
      5 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
      5 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
      5 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
      5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
      6 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
      7 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
      7 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
      7 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
      7 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
      8 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
      8 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
      8 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
      9 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
      9 usec_timeout=500000 usect_delayed=376000 reason=VMSCAN_THROTTLE_NOPROGRESS
      9 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
     10 usec_timeout=500000 usect_delayed=172000 reason=VMSCAN_THROTTLE_NOPROGRESS
     10 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
     10 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
     11 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
     12 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
     13 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
     13 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
     14 usec_timeout=500000 usect_delayed=308000 reason=VMSCAN_THROTTLE_NOPROGRESS
     14 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
     14 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
     16 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
     17 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
     17 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
     17 usec_timeout=500000 usect_delayed=280000 reason=VMSCAN_THROTTLE_NOPROGRESS
     18 usec_timeout=500000 usect_delayed=404000 reason=VMSCAN_THROTTLE_NOPROGRESS
     20 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
     20 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
     20 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
     21 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
     23 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
     23 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
     25 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
     25 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
     26 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
     27 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
     28 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
     29 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
     30 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
     30 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
     31 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
     32 usec_timeout=500000 usect_delayed=156000 reason=VMSCAN_THROTTLE_NOPROGRESS
     33 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
     35 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
     35 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
     36 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
     36 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
     37 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
     38 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
     40 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
     43 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
     55 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
     56 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
     58 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
     59 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
     61 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
     71 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
     71 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
     79 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
     82 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
     82 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
     85 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
     85 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
     88 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
     90 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
     90 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
     94 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
    118 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
    119 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
    126 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
    146 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
    148 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
    148 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
    159 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
    178 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
    183 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
    237 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
    266 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
    313 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
    347 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
    470 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
    559 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
    964 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
   2001 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
   2447 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
   7888 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
  22727 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
  51305 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS

The full timeout is often hit but a large number also do not stall at all.
The remainder slept a little allowing other reclaim tasks to make
progress.

While this timeout could be further increased, it could also negatively
impact worst-case behaviour when there is no prioritisation of what task
should make progress.

For VMSCAN_THROTTLE_WRITEBACK, the breakdown was

      1 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
      2 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
      3 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
      5 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
      5 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
      6 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
      7 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
     11 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
     12 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
     16 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
     24 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
     28 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
     30 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
     30 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
     32 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
     42 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
     77 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
     99 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
    137 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
    190 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
    339 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
    518 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
    852 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
   3359 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
   7147 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
  83962 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK

The majority hit the timeout in direct reclaim context although a sizable
number did not stall at all.  This is very different to kswapd where only
a tiny percentage of stalls due to writeback reached the timeout.

Bottom line, the throttling appears to work and the wakeup events may
limit worst case stalls.  There might be some grounds for adjusting
timeouts but it's likely futile as the worst-case scenarios depend on the
workload, memory size and the speed of the storage.  A better approach to
improve the series further would be to prioritise tasks based on their
rate of allocation with the caveat that it may be very expensive to track.


This patch (of 5):

Page reclaim throttles on wait_iff_congested under the following
conditions:

o kswapd is encountering pages under writeback and marked for immediate
  reclaim implying that pages are cycling through the LRU faster than
  pages can be cleaned.

o Direct reclaim will stall if all dirty pages are backed by congested
  inodes.

wait_iff_congested is almost completely broken with few exceptions.  This
patch adds a new node-based workqueue and tracks the number of throttled
tasks and pages written back since throttling started.  If enough pages
belonging to the node are written back then the throttled tasks will wake
early.  If not, the throttled tasks sleeps until the timeout expires.

[neilb@suse.de: Uninterruptible sleep and simpler wakeups]
[hdanton@sina.com: Avoid race when reclaim starts]
[vbabka@suse.cz: vmstat irq-safe api, clarifications]
Link: https://lkml.kernel.org/r/20211022144651.19914-1-mgorman@techsingularity.net
Link: https://lkml.kernel.org/r/20211022144651.19914-2-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: NeilBrown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/backing-dev.h      |    1 
 include/linux/mmzone.h           |   13 ++++
 include/trace/events/vmscan.h    |   34 ++++++++++++
 include/trace/events/writeback.h |    7 --
 mm/backing-dev.c                 |   48 ----------------
 mm/filemap.c                     |    1 
 mm/internal.h                    |   11 +++
 mm/page_alloc.c                  |    5 +
 mm/vmscan.c                      |   82 ++++++++++++++++++++++++-----
 mm/vmstat.c                      |    1 
 10 files changed, 135 insertions(+), 68 deletions(-)

--- a/include/linux/backing-dev.h~mm-vmscan-throttle-reclaim-until-some-writeback-completes-if-congested
+++ a/include/linux/backing-dev.h
@@ -154,7 +154,6 @@ static inline int wb_congested(struct bd
 }
 
 long congestion_wait(int sync, long timeout);
-long wait_iff_congested(int sync, long timeout);
 
 static inline bool mapping_can_writeback(struct address_space *mapping)
 {
--- a/include/linux/mmzone.h~mm-vmscan-throttle-reclaim-until-some-writeback-completes-if-congested
+++ a/include/linux/mmzone.h
@@ -199,6 +199,7 @@ enum node_stat_item {
 	NR_VMSCAN_IMMEDIATE,	/* Prioritise for reclaim when writeback ends */
 	NR_DIRTIED,		/* page dirtyings since bootup */
 	NR_WRITTEN,		/* page writings since bootup */
+	NR_THROTTLED_WRITTEN,	/* NR_WRITTEN while reclaim throttled */
 	NR_KERNEL_MISC_RECLAIMABLE,	/* reclaimable non-slab kernel pages */
 	NR_FOLL_PIN_ACQUIRED,	/* via: pin_user_page(), gup flag: FOLL_PIN */
 	NR_FOLL_PIN_RELEASED,	/* pages returned via unpin_user_page() */
@@ -272,6 +273,11 @@ enum lru_list {
 	NR_LRU_LISTS
 };
 
+enum vmscan_throttle_state {
+	VMSCAN_THROTTLE_WRITEBACK,
+	NR_VMSCAN_THROTTLE,
+};
+
 #define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
 
 #define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
@@ -841,6 +847,13 @@ typedef struct pglist_data {
 	int node_id;
 	wait_queue_head_t kswapd_wait;
 	wait_queue_head_t pfmemalloc_wait;
+
+	/* workqueues for throttling reclaim for different reasons. */
+	wait_queue_head_t reclaim_wait[NR_VMSCAN_THROTTLE];
+
+	atomic_t nr_writeback_throttled;/* nr of writeback-throttled tasks */
+	unsigned long nr_reclaim_start;	/* nr pages written while throttled
+					 * when throttling started. */
 	struct task_struct *kswapd;	/* Protected by
 					   mem_hotplug_begin/end() */
 	int kswapd_order;
--- a/include/trace/events/vmscan.h~mm-vmscan-throttle-reclaim-until-some-writeback-completes-if-congested
+++ a/include/trace/events/vmscan.h
@@ -27,6 +27,14 @@
 		{RECLAIM_WB_ASYNC,	"RECLAIM_WB_ASYNC"}	\
 		) : "RECLAIM_WB_NONE"
 
+#define _VMSCAN_THROTTLE_WRITEBACK	(1 << VMSCAN_THROTTLE_WRITEBACK)
+
+#define show_throttle_flags(flags)						\
+	(flags) ? __print_flags(flags, "|",					\
+		{_VMSCAN_THROTTLE_WRITEBACK,	"VMSCAN_THROTTLE_WRITEBACK"}	\
+		) : "VMSCAN_THROTTLE_NONE"
+
+
 #define trace_reclaim_flags(file) ( \
 	(file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
 	(RECLAIM_WB_ASYNC) \
@@ -454,6 +462,32 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_en
 	TP_ARGS(nr_reclaimed)
 );
 
+TRACE_EVENT(mm_vmscan_throttled,
+
+	TP_PROTO(int nid, int usec_timeout, int usec_delayed, int reason),
+
+	TP_ARGS(nid, usec_timeout, usec_delayed, reason),
+
+	TP_STRUCT__entry(
+		__field(int, nid)
+		__field(int, usec_timeout)
+		__field(int, usec_delayed)
+		__field(int, reason)
+	),
+
+	TP_fast_assign(
+		__entry->nid = nid;
+		__entry->usec_timeout = usec_timeout;
+		__entry->usec_delayed = usec_delayed;
+		__entry->reason = 1U << reason;
+	),
+
+	TP_printk("nid=%d usec_timeout=%d usect_delayed=%d reason=%s",
+		__entry->nid,
+		__entry->usec_timeout,
+		__entry->usec_delayed,
+		show_throttle_flags(__entry->reason))
+);
 #endif /* _TRACE_VMSCAN_H */
 
 /* This part must be outside protection */
--- a/include/trace/events/writeback.h~mm-vmscan-throttle-reclaim-until-some-writeback-completes-if-congested
+++ a/include/trace/events/writeback.h
@@ -763,13 +763,6 @@ DEFINE_EVENT(writeback_congest_waited_te
 	TP_ARGS(usec_timeout, usec_delayed)
 );
 
-DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
-
-	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
-
-	TP_ARGS(usec_timeout, usec_delayed)
-);
-
 DECLARE_EVENT_CLASS(writeback_single_inode_template,
 
 	TP_PROTO(struct inode *inode,
--- a/mm/backing-dev.c~mm-vmscan-throttle-reclaim-until-some-writeback-completes-if-congested
+++ a/mm/backing-dev.c
@@ -1038,51 +1038,3 @@ long congestion_wait(int sync, long time
 	return ret;
 }
 EXPORT_SYMBOL(congestion_wait);
-
-/**
- * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
- * @sync: SYNC or ASYNC IO
- * @timeout: timeout in jiffies
- *
- * In the event of a congested backing_dev (any backing_dev) this waits
- * for up to @timeout jiffies for either a BDI to exit congestion of the
- * given @sync queue or a write to complete.
- *
- * The return value is 0 if the sleep is for the full timeout. Otherwise,
- * it is the number of jiffies that were still remaining when the function
- * returned. return_value == timeout implies the function did not sleep.
- */
-long wait_iff_congested(int sync, long timeout)
-{
-	long ret;
-	unsigned long start = jiffies;
-	DEFINE_WAIT(wait);
-	wait_queue_head_t *wqh = &congestion_wqh[sync];
-
-	/*
-	 * If there is no congestion, yield if necessary instead
-	 * of sleeping on the congestion queue
-	 */
-	if (atomic_read(&nr_wb_congested[sync]) == 0) {
-		cond_resched();
-
-		/* In case we scheduled, work out time remaining */
-		ret = timeout - (jiffies - start);
-		if (ret < 0)
-			ret = 0;
-
-		goto out;
-	}
-
-	/* Sleep until uncongested or a write happens */
-	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
-	ret = io_schedule_timeout(timeout);
-	finish_wait(wqh, &wait);
-
-out:
-	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
-					jiffies_to_usecs(jiffies - start));
-
-	return ret;
-}
-EXPORT_SYMBOL(wait_iff_congested);
--- a/mm/filemap.c~mm-vmscan-throttle-reclaim-until-some-writeback-completes-if-congested
+++ a/mm/filemap.c
@@ -1612,6 +1612,7 @@ void end_page_writeback(struct page *pag
 
 	smp_mb__after_atomic();
 	wake_up_page(page, PG_writeback);
+	acct_reclaim_writeback(page);
 	put_page(page);
 }
 EXPORT_SYMBOL(end_page_writeback);
--- a/mm/internal.h~mm-vmscan-throttle-reclaim-until-some-writeback-completes-if-congested
+++ a/mm/internal.h
@@ -34,6 +34,17 @@
 
 void page_writeback_init(void);
 
+void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
+						int nr_throttled);
+static inline void acct_reclaim_writeback(struct page *page)
+{
+	pg_data_t *pgdat = page_pgdat(page);
+	int nr_throttled = atomic_read(&pgdat->nr_writeback_throttled);
+
+	if (nr_throttled)
+		__acct_reclaim_writeback(pgdat, page, nr_throttled);
+}
+
 vm_fault_t do_swap_page(struct vm_fault *vmf);
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
--- a/mm/page_alloc.c~mm-vmscan-throttle-reclaim-until-some-writeback-completes-if-congested
+++ a/mm/page_alloc.c
@@ -7408,6 +7408,8 @@ static void pgdat_init_kcompactd(struct
 
 static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
 {
+	int i;
+
 	pgdat_resize_init(pgdat);
 
 	pgdat_init_split_queue(pgdat);
@@ -7416,6 +7418,9 @@ static void __meminit pgdat_init_interna
 	init_waitqueue_head(&pgdat->kswapd_wait);
 	init_waitqueue_head(&pgdat->pfmemalloc_wait);
 
+	for (i = 0; i < NR_VMSCAN_THROTTLE; i++)
+		init_waitqueue_head(&pgdat->reclaim_wait[i]);
+
 	pgdat_page_ext_init(pgdat);
 	lruvec_init(&pgdat->__lruvec);
 }
--- a/mm/vmscan.c~mm-vmscan-throttle-reclaim-until-some-writeback-completes-if-congested
+++ a/mm/vmscan.c
@@ -1006,6 +1006,64 @@ static void handle_write_error(struct ad
 	unlock_page(page);
 }
 
+static void
+reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
+							long timeout)
+{
+	wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
+	long ret;
+	DEFINE_WAIT(wait);
+
+	/*
+	 * Do not throttle IO workers, kthreads other than kswapd or
+	 * workqueues. They may be required for reclaim to make
+	 * forward progress (e.g. journalling workqueues or kthreads).
+	 */
+	if (!current_is_kswapd() &&
+	    current->flags & (PF_IO_WORKER|PF_KTHREAD))
+		return;
+
+	if (atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) {
+		WRITE_ONCE(pgdat->nr_reclaim_start,
+			node_page_state(pgdat, NR_THROTTLED_WRITTEN));
+	}
+
+	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	ret = schedule_timeout(timeout);
+	finish_wait(wqh, &wait);
+	atomic_dec(&pgdat->nr_writeback_throttled);
+
+	trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout),
+				jiffies_to_usecs(timeout - ret),
+				reason);
+}
+
+/*
+ * Account for pages written if tasks are throttled waiting on dirty
+ * pages to clean. If enough pages have been cleaned since throttling
+ * started then wakeup the throttled tasks.
+ */
+void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
+							int nr_throttled)
+{
+	unsigned long nr_written;
+
+	inc_node_page_state(page, NR_THROTTLED_WRITTEN);
+
+	/*
+	 * This is an inaccurate read as the per-cpu deltas may not
+	 * be synchronised. However, given that the system is
+	 * writeback throttled, it is not worth taking the penalty
+	 * of getting an accurate count. At worst, the throttle
+	 * timeout guarantees forward progress.
+	 */
+	nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
+		READ_ONCE(pgdat->nr_reclaim_start);
+
+	if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
+		wake_up(&pgdat->reclaim_wait[VMSCAN_THROTTLE_WRITEBACK]);
+}
+
 /* possible outcome of pageout() */
 typedef enum {
 	/* failed to write page out, page is locked */
@@ -1411,9 +1469,8 @@ retry:
 
 		/*
 		 * The number of dirty pages determines if a node is marked
-		 * reclaim_congested which affects wait_iff_congested. kswapd
-		 * will stall and start writing pages if the tail of the LRU
-		 * is all dirty unqueued pages.
+		 * reclaim_congested. kswapd will stall and start writing
+		 * pages if the tail of the LRU is all dirty unqueued pages.
 		 */
 		page_check_dirty_writeback(page, &dirty, &writeback);
 		if (dirty || writeback)
@@ -3179,19 +3236,19 @@ again:
 		 * If kswapd scans pages marked for immediate
 		 * reclaim and under writeback (nr_immediate), it
 		 * implies that pages are cycling through the LRU
-		 * faster than they are written so also forcibly stall.
+		 * faster than they are written so forcibly stall
+		 * until some pages complete writeback.
 		 */
 		if (sc->nr.immediate)
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
 	}
 
 	/*
-	 * Tag a node/memcg as congested if all the dirty pages
-	 * scanned were backed by a congested BDI and
-	 * wait_iff_congested will stall.
+	 * Tag a node/memcg as congested if all the dirty pages were marked
+	 * for writeback and immediate reclaim (counted in nr.congested).
 	 *
 	 * Legacy memcg will stall in page writeback so avoid forcibly
-	 * stalling in wait_iff_congested().
+	 * stalling in reclaim_throttle().
 	 */
 	if ((current_is_kswapd() ||
 	     (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
@@ -3199,15 +3256,15 @@ again:
 		set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);
 
 	/*
-	 * Stall direct reclaim for IO completions if underlying BDIs
-	 * and node is congested. Allow kswapd to continue until it
+	 * Stall direct reclaim for IO completions if the lruvec is
+	 * node is congested. Allow kswapd to continue until it
 	 * starts encountering unqueued dirty pages or cycling through
 	 * the LRU too quickly.
 	 */
 	if (!current_is_kswapd() && current_may_throttle() &&
 	    !sc->hibernation_mode &&
 	    test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
-		wait_iff_congested(BLK_RW_ASYNC, HZ/10);
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
 
 	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 				    sc))
@@ -4285,6 +4342,7 @@ static int kswapd(void *p)
 
 	WRITE_ONCE(pgdat->kswapd_order, 0);
 	WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES);
+	atomic_set(&pgdat->nr_writeback_throttled, 0);
 	for ( ; ; ) {
 		bool ret;
 
--- a/mm/vmstat.c~mm-vmscan-throttle-reclaim-until-some-writeback-completes-if-congested
+++ a/mm/vmstat.c
@@ -1225,6 +1225,7 @@ const char * const vmstat_text[] = {
 	"nr_vmscan_immediate_reclaim",
 	"nr_dirtied",
 	"nr_written",
+	"nr_throttled_written",
 	"nr_kernel_misc_reclaimable",
 	"nr_foll_pin_acquired",
 	"nr_foll_pin_released",
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 150/262] mm/vmscan: throttle reclaim and compaction when too may pages are isolated
  2021-11-05 20:34 incoming Andrew Morton
                   ` (148 preceding siblings ...)
  2021-11-05 20:42 ` [patch 149/262] mm/vmscan: throttle reclaim until some writeback completes if congested Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 20:42 ` [patch 151/262] mm/vmscan: throttle reclaim when no progress is being made Andrew Morton
                   ` (111 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: adilger.kernel, akpm, corbet, david, djwong, hannes, linux-mm,
	mgorman, mhocko, mm-commits, neilb, riel, torvalds, tytso,
	vbabka, willy

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/vmscan: throttle reclaim and compaction when too may pages are isolated

Page reclaim throttles on congestion if too many parallel reclaim
instances have isolated too many pages.  This makes no sense, excessive
parallelisation has nothing to do with writeback or congestion.

This patch creates an additional workqueue to sleep on when too many pages
are isolated.  The throttled tasks are woken when the number of isolated
pages is reduced or a timeout occurs.  There may be some false positive
wakeups for GFP_NOIO/GFP_NOFS callers but the tasks will throttle again if
necessary.

[shy828301@gmail.com: Wake up from compaction context]
[vbabka@suse.cz: Account number of throttled tasks only for writeback]
Link: https://lkml.kernel.org/r/20211022144651.19914-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h        |    1 +
 include/trace/events/vmscan.h |    4 +++-
 mm/compaction.c               |   10 ++++++++--
 mm/internal.h                 |   11 +++++++++++
 mm/vmscan.c                   |   22 ++++++++++++++++------
 5 files changed, 39 insertions(+), 9 deletions(-)

--- a/include/linux/mmzone.h~mm-vmscan-throttle-reclaim-and-compaction-when-too-may-pages-are-isolated
+++ a/include/linux/mmzone.h
@@ -275,6 +275,7 @@ enum lru_list {
 
 enum vmscan_throttle_state {
 	VMSCAN_THROTTLE_WRITEBACK,
+	VMSCAN_THROTTLE_ISOLATED,
 	NR_VMSCAN_THROTTLE,
 };
 
--- a/include/trace/events/vmscan.h~mm-vmscan-throttle-reclaim-and-compaction-when-too-may-pages-are-isolated
+++ a/include/trace/events/vmscan.h
@@ -28,10 +28,12 @@
 		) : "RECLAIM_WB_NONE"
 
 #define _VMSCAN_THROTTLE_WRITEBACK	(1 << VMSCAN_THROTTLE_WRITEBACK)
+#define _VMSCAN_THROTTLE_ISOLATED	(1 << VMSCAN_THROTTLE_ISOLATED)
 
 #define show_throttle_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",					\
-		{_VMSCAN_THROTTLE_WRITEBACK,	"VMSCAN_THROTTLE_WRITEBACK"}	\
+		{_VMSCAN_THROTTLE_WRITEBACK,	"VMSCAN_THROTTLE_WRITEBACK"},	\
+		{_VMSCAN_THROTTLE_ISOLATED,	"VMSCAN_THROTTLE_ISOLATED"}	\
 		) : "VMSCAN_THROTTLE_NONE"
 
 
--- a/mm/compaction.c~mm-vmscan-throttle-reclaim-and-compaction-when-too-may-pages-are-isolated
+++ a/mm/compaction.c
@@ -761,6 +761,8 @@ isolate_freepages_range(struct compact_c
 /* Similar to reclaim, but different enough that they don't share logic */
 static bool too_many_isolated(pg_data_t *pgdat)
 {
+	bool too_many;
+
 	unsigned long active, inactive, isolated;
 
 	inactive = node_page_state(pgdat, NR_INACTIVE_FILE) +
@@ -770,7 +772,11 @@ static bool too_many_isolated(pg_data_t
 	isolated = node_page_state(pgdat, NR_ISOLATED_FILE) +
 			node_page_state(pgdat, NR_ISOLATED_ANON);
 
-	return isolated > (inactive + active) / 2;
+	too_many = isolated > (inactive + active) / 2;
+	if (!too_many)
+		wake_throttle_isolated(pgdat);
+
+	return too_many;
 }
 
 /**
@@ -822,7 +828,7 @@ isolate_migratepages_block(struct compac
 		if (cc->mode == MIGRATE_ASYNC)
 			return -EAGAIN;
 
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
 
 		if (fatal_signal_pending(current))
 			return -EINTR;
--- a/mm/internal.h~mm-vmscan-throttle-reclaim-and-compaction-when-too-may-pages-are-isolated
+++ a/mm/internal.h
@@ -45,6 +45,15 @@ static inline void acct_reclaim_writebac
 		__acct_reclaim_writeback(pgdat, page, nr_throttled);
 }
 
+static inline void wake_throttle_isolated(pg_data_t *pgdat)
+{
+	wait_queue_head_t *wqh;
+
+	wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_ISOLATED];
+	if (waitqueue_active(wqh))
+		wake_up(wqh);
+}
+
 vm_fault_t do_swap_page(struct vm_fault *vmf);
 
 void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
@@ -121,6 +130,8 @@ extern unsigned long highest_memmap_pfn;
  */
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
+extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
+								long timeout);
 
 /*
  * in mm/rmap.c:
--- a/mm/vmscan.c~mm-vmscan-throttle-reclaim-and-compaction-when-too-may-pages-are-isolated
+++ a/mm/vmscan.c
@@ -1006,12 +1006,12 @@ static void handle_write_error(struct ad
 	unlock_page(page);
 }
 
-static void
-reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
+void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
 							long timeout)
 {
 	wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
 	long ret;
+	bool acct_writeback = (reason == VMSCAN_THROTTLE_WRITEBACK);
 	DEFINE_WAIT(wait);
 
 	/*
@@ -1023,7 +1023,8 @@ reclaim_throttle(pg_data_t *pgdat, enum
 	    current->flags & (PF_IO_WORKER|PF_KTHREAD))
 		return;
 
-	if (atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) {
+	if (acct_writeback &&
+	    atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) {
 		WRITE_ONCE(pgdat->nr_reclaim_start,
 			node_page_state(pgdat, NR_THROTTLED_WRITTEN));
 	}
@@ -1031,7 +1032,9 @@ reclaim_throttle(pg_data_t *pgdat, enum
 	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
 	ret = schedule_timeout(timeout);
 	finish_wait(wqh, &wait);
-	atomic_dec(&pgdat->nr_writeback_throttled);
+
+	if (acct_writeback)
+		atomic_dec(&pgdat->nr_writeback_throttled);
 
 	trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout),
 				jiffies_to_usecs(timeout - ret),
@@ -2175,6 +2178,7 @@ static int too_many_isolated(struct pgli
 		struct scan_control *sc)
 {
 	unsigned long inactive, isolated;
+	bool too_many;
 
 	if (current_is_kswapd())
 		return 0;
@@ -2198,7 +2202,13 @@ static int too_many_isolated(struct pgli
 	if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
 		inactive >>= 3;
 
-	return isolated > inactive;
+	too_many = isolated > inactive;
+
+	/* Wake up tasks throttled due to too_many_isolated. */
+	if (!too_many)
+		wake_throttle_isolated(pgdat);
+
+	return too_many;
 }
 
 /*
@@ -2307,8 +2317,8 @@ shrink_inactive_list(unsigned long nr_to
 			return 0;
 
 		/* wait a bit for the reclaimer. */
-		msleep(100);
 		stalled = true;
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
 
 		/* We are about to die and free our memory. Return now. */
 		if (fatal_signal_pending(current))
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 151/262] mm/vmscan: throttle reclaim when no progress is being made
  2021-11-05 20:34 incoming Andrew Morton
                   ` (149 preceding siblings ...)
  2021-11-05 20:42 ` [patch 150/262] mm/vmscan: throttle reclaim and compaction when too may pages are isolated Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 20:42 ` [patch 152/262] mm/writeback: throttle based on page writeback instead of congestion Andrew Morton
                   ` (110 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: adilger.kernel, akpm, corbet, david, djwong, hannes, linux-mm,
	mgorman, mhocko, mm-commits, neilb, riel, torvalds, tytso,
	vbabka, willy

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/vmscan: throttle reclaim when no progress is being made

Memcg reclaim throttles on congestion if no reclaim progress is made. 
This makes little sense, it might be due to writeback or a host of other
factors.

For !memcg reclaim, it's messy.  Direct reclaim primarily is throttled in
the page allocator if it is failing to make progress.  Kswapd throttles if
too many pages are under writeback and marked for immediate reclaim.

This patch explicitly throttles if reclaim is failing to make progress.

[vbabka@suse.cz: Remove redundant code]
Link: https://lkml.kernel.org/r/20211022144651.19914-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h        |    1 +
 include/trace/events/vmscan.h |    4 +++-
 mm/memcontrol.c               |   10 +---------
 mm/vmscan.c                   |   28 ++++++++++++++++++++++++++++
 4 files changed, 33 insertions(+), 10 deletions(-)

--- a/include/linux/mmzone.h~mm-vmscan-throttle-reclaim-when-no-progress-is-being-made
+++ a/include/linux/mmzone.h
@@ -276,6 +276,7 @@ enum lru_list {
 enum vmscan_throttle_state {
 	VMSCAN_THROTTLE_WRITEBACK,
 	VMSCAN_THROTTLE_ISOLATED,
+	VMSCAN_THROTTLE_NOPROGRESS,
 	NR_VMSCAN_THROTTLE,
 };
 
--- a/include/trace/events/vmscan.h~mm-vmscan-throttle-reclaim-when-no-progress-is-being-made
+++ a/include/trace/events/vmscan.h
@@ -29,11 +29,13 @@
 
 #define _VMSCAN_THROTTLE_WRITEBACK	(1 << VMSCAN_THROTTLE_WRITEBACK)
 #define _VMSCAN_THROTTLE_ISOLATED	(1 << VMSCAN_THROTTLE_ISOLATED)
+#define _VMSCAN_THROTTLE_NOPROGRESS	(1 << VMSCAN_THROTTLE_NOPROGRESS)
 
 #define show_throttle_flags(flags)						\
 	(flags) ? __print_flags(flags, "|",					\
 		{_VMSCAN_THROTTLE_WRITEBACK,	"VMSCAN_THROTTLE_WRITEBACK"},	\
-		{_VMSCAN_THROTTLE_ISOLATED,	"VMSCAN_THROTTLE_ISOLATED"}	\
+		{_VMSCAN_THROTTLE_ISOLATED,	"VMSCAN_THROTTLE_ISOLATED"},	\
+		{_VMSCAN_THROTTLE_NOPROGRESS,	"VMSCAN_THROTTLE_NOPROGRESS"}	\
 		) : "VMSCAN_THROTTLE_NONE"
 
 
--- a/mm/memcontrol.c~mm-vmscan-throttle-reclaim-when-no-progress-is-being-made
+++ a/mm/memcontrol.c
@@ -3487,19 +3487,11 @@ static int mem_cgroup_force_empty(struct
 
 	/* try to free all pages in this cgroup */
 	while (nr_retries && page_counter_read(&memcg->memory)) {
-		int progress;
-
 		if (signal_pending(current))
 			return -EINTR;
 
-		progress = try_to_free_mem_cgroup_pages(memcg, 1,
-							GFP_KERNEL, true);
-		if (!progress) {
+		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true))
 			nr_retries--;
-			/* maybe some writeback is necessary */
-			congestion_wait(BLK_RW_ASYNC, HZ/10);
-		}
-
 	}
 
 	return 0;
--- a/mm/vmscan.c~mm-vmscan-throttle-reclaim-when-no-progress-is-being-made
+++ a/mm/vmscan.c
@@ -3322,6 +3322,33 @@ static inline bool compaction_ready(stru
 	return zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx);
 }
 
+static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc)
+{
+	/* If reclaim is making progress, wake any throttled tasks. */
+	if (sc->nr_reclaimed) {
+		wait_queue_head_t *wqh;
+
+		wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_NOPROGRESS];
+		if (waitqueue_active(wqh))
+			wake_up(wqh);
+
+		return;
+	}
+
+	/*
+	 * Do not throttle kswapd on NOPROGRESS as it will throttle on
+	 * VMSCAN_THROTTLE_WRITEBACK if there are too many pages under
+	 * writeback and marked for immediate reclaim at the tail of
+	 * the LRU.
+	 */
+	if (current_is_kswapd())
+		return;
+
+	/* Throttle if making no progress at high prioities. */
+	if (sc->priority < DEF_PRIORITY - 2)
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);
+}
+
 /*
  * This is the direct reclaim path, for page-allocating processes.  We only
  * try to reclaim pages from zones which will satisfy the caller's allocation
@@ -3406,6 +3433,7 @@ static void shrink_zones(struct zonelist
 			continue;
 		last_pgdat = zone->zone_pgdat;
 		shrink_node(zone->zone_pgdat, sc);
+		consider_reclaim_throttle(zone->zone_pgdat, sc);
 	}
 
 	/*
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 152/262] mm/writeback: throttle based on page writeback instead of congestion
  2021-11-05 20:34 incoming Andrew Morton
                   ` (150 preceding siblings ...)
  2021-11-05 20:42 ` [patch 151/262] mm/vmscan: throttle reclaim when no progress is being made Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 20:42 ` [patch 153/262] mm/page_alloc: remove the throttling logic from the page allocator Andrew Morton
                   ` (109 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: adilger.kernel, akpm, corbet, david, djwong, hannes, linux-mm,
	mgorman, mhocko, mm-commits, neilb, riel, torvalds, tytso,
	vbabka, willy

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/writeback: throttle based on page writeback instead of congestion

do_writepages throttles on congestion if the writepages() fails due to a
lack of memory but congestion_wait() is partially broken as the congestion
state is not updated for all BDIs.

This patch stalls waiting for a number of pages to complete writeback that
located on the local node.  The main weakness is that there is no
correlation between the location of the inode's pages and locality but
that is still better than congestion_wait.

Link: https://lkml.kernel.org/r/20211022144651.19914-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page-writeback.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- a/mm/page-writeback.c~mm-writeback-throttle-based-on-page-writeback-instead-of-congestion
+++ a/mm/page-writeback.c
@@ -2366,8 +2366,15 @@ int do_writepages(struct address_space *
 			ret = generic_writepages(mapping, wbc);
 		if ((ret != -ENOMEM) || (wbc->sync_mode != WB_SYNC_ALL))
 			break;
-		cond_resched();
-		congestion_wait(BLK_RW_ASYNC, HZ/50);
+
+		/*
+		 * Lacking an allocation context or the locality or writeback
+		 * state of any of the inode's pages, throttle based on
+		 * writeback activity on the local node. It's as good a
+		 * guess as any.
+		 */
+		reclaim_throttle(NODE_DATA(numa_node_id()),
+			VMSCAN_THROTTLE_WRITEBACK, HZ/50);
 	}
 	/*
 	 * Usually few pages are written by now from those we've just submitted
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 153/262] mm/page_alloc: remove the throttling logic from the page allocator
  2021-11-05 20:34 incoming Andrew Morton
                   ` (151 preceding siblings ...)
  2021-11-05 20:42 ` [patch 152/262] mm/writeback: throttle based on page writeback instead of congestion Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 20:42 ` [patch 154/262] mm/vmscan: centralise timeout values for reclaim_throttle Andrew Morton
                   ` (108 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: adilger.kernel, akpm, corbet, david, djwong, hannes, linux-mm,
	mgorman, mhocko, mm-commits, neilb, riel, torvalds, tytso,
	vbabka, willy

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/page_alloc: remove the throttling logic from the page allocator

The page allocator stalls based on the number of pages that are waiting
for writeback to start but this should now be redundant. 
shrink_inactive_list() will wake flusher threads if the LRU tail are
unqueued dirty pages so the flusher should be active.  If it fails to make
progress due to pages under writeback not being completed quickly then it
should stall on VMSCAN_THROTTLE_WRITEBACK.

Link: https://lkml.kernel.org/r/20211022144651.19914-6-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   21 +--------------------
 1 file changed, 1 insertion(+), 20 deletions(-)

--- a/mm/page_alloc.c~mm-page_alloc-remove-the-throttling-logic-from-the-page-allocator
+++ a/mm/page_alloc.c
@@ -4791,30 +4791,11 @@ should_reclaim_retry(gfp_t gfp_mask, uns
 		trace_reclaim_retry_zone(z, order, reclaimable,
 				available, min_wmark, *no_progress_loops, wmark);
 		if (wmark) {
-			/*
-			 * If we didn't make any progress and have a lot of
-			 * dirty + writeback pages then we should wait for
-			 * an IO to complete to slow down the reclaim and
-			 * prevent from pre mature OOM
-			 */
-			if (!did_some_progress) {
-				unsigned long write_pending;
-
-				write_pending = zone_page_state_snapshot(zone,
-							NR_ZONE_WRITE_PENDING);
-
-				if (2 * write_pending > reclaimable) {
-					congestion_wait(BLK_RW_ASYNC, HZ/10);
-					return true;
-				}
-			}
-
 			ret = true;
-			goto out;
+			break;
 		}
 	}
 
-out:
 	/*
 	 * Memory allocation/reclaim might be called from a WQ context and the
 	 * current implementation of the WQ concurrency control doesn't
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 154/262] mm/vmscan: centralise timeout values for reclaim_throttle
  2021-11-05 20:34 incoming Andrew Morton
                   ` (152 preceding siblings ...)
  2021-11-05 20:42 ` [patch 153/262] mm/page_alloc: remove the throttling logic from the page allocator Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 20:42 ` [patch 155/262] mm/vmscan: increase the timeout if page reclaim is not making progress Andrew Morton
                   ` (107 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: adilger.kernel, akpm, corbet, david, djwong, hannes, linux-mm,
	mgorman, mhocko, mm-commits, neilb, riel, torvalds, tytso,
	vbabka, willy

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/vmscan: centralise timeout values for reclaim_throttle

Neil Brown raised concerns about callers of reclaim_throttle specifying a
timeout value.  The original timeout values to congestion_wait() were
probably pulled out of thin air or copy&pasted from somewhere else.  This
patch centralises the timeout values and selects a timeout based on the
reason for reclaim throttling.  These figures are also pulled out of the
same thin air but better values may be derived

Running a workload that is throttling for inappropriate periods and
tracing mm_vmscan_throttled can be used to pick a more appropriate value. 
Excessive throttling would pick a lower timeout where as excessive CPU
usage in reclaim context would select a larger timeout.  Ideally a large
value would always be used and the wakeups would occur before a timeout
but that requires careful testing.

Link: https://lkml.kernel.org/r/20211022144651.19914-7-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/compaction.c     |    2 -
 mm/internal.h       |    3 --
 mm/page-writeback.c |    2 -
 mm/vmscan.c         |   50 +++++++++++++++++++++++++++++++-----------
 4 files changed, 40 insertions(+), 17 deletions(-)

--- a/mm/compaction.c~mm-vmscan-centralise-timeout-values-for-reclaim_throttle
+++ a/mm/compaction.c
@@ -828,7 +828,7 @@ isolate_migratepages_block(struct compac
 		if (cc->mode == MIGRATE_ASYNC)
 			return -EAGAIN;
 
-		reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED);
 
 		if (fatal_signal_pending(current))
 			return -EINTR;
--- a/mm/internal.h~mm-vmscan-centralise-timeout-values-for-reclaim_throttle
+++ a/mm/internal.h
@@ -130,8 +130,7 @@ extern unsigned long highest_memmap_pfn;
  */
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
-extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
-								long timeout);
+extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason);
 
 /*
  * in mm/rmap.c:
--- a/mm/page-writeback.c~mm-vmscan-centralise-timeout-values-for-reclaim_throttle
+++ a/mm/page-writeback.c
@@ -2374,7 +2374,7 @@ int do_writepages(struct address_space *
 		 * guess as any.
 		 */
 		reclaim_throttle(NODE_DATA(numa_node_id()),
-			VMSCAN_THROTTLE_WRITEBACK, HZ/50);
+			VMSCAN_THROTTLE_WRITEBACK);
 	}
 	/*
 	 * Usually few pages are written by now from those we've just submitted
--- a/mm/vmscan.c~mm-vmscan-centralise-timeout-values-for-reclaim_throttle
+++ a/mm/vmscan.c
@@ -1006,12 +1006,10 @@ static void handle_write_error(struct ad
 	unlock_page(page);
 }
 
-void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
-							long timeout)
+void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
 {
 	wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
-	long ret;
-	bool acct_writeback = (reason == VMSCAN_THROTTLE_WRITEBACK);
+	long timeout, ret;
 	DEFINE_WAIT(wait);
 
 	/*
@@ -1023,17 +1021,43 @@ void reclaim_throttle(pg_data_t *pgdat,
 	    current->flags & (PF_IO_WORKER|PF_KTHREAD))
 		return;
 
-	if (acct_writeback &&
-	    atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) {
-		WRITE_ONCE(pgdat->nr_reclaim_start,
-			node_page_state(pgdat, NR_THROTTLED_WRITTEN));
+	/*
+	 * These figures are pulled out of thin air.
+	 * VMSCAN_THROTTLE_ISOLATED is a transient condition based on too many
+	 * parallel reclaimers which is a short-lived event so the timeout is
+	 * short. Failing to make progress or waiting on writeback are
+	 * potentially long-lived events so use a longer timeout. This is shaky
+	 * logic as a failure to make progress could be due to anything from
+	 * writeback to a slow device to excessive references pages at the tail
+	 * of the inactive LRU.
+	 */
+	switch(reason) {
+	case VMSCAN_THROTTLE_WRITEBACK:
+		timeout = HZ/10;
+
+		if (atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) {
+			WRITE_ONCE(pgdat->nr_reclaim_start,
+				node_page_state(pgdat, NR_THROTTLED_WRITTEN));
+		}
+
+		break;
+	case VMSCAN_THROTTLE_NOPROGRESS:
+		timeout = HZ/10;
+		break;
+	case VMSCAN_THROTTLE_ISOLATED:
+		timeout = HZ/50;
+		break;
+	default:
+		WARN_ON_ONCE(1);
+		timeout = HZ;
+		break;
 	}
 
 	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
 	ret = schedule_timeout(timeout);
 	finish_wait(wqh, &wait);
 
-	if (acct_writeback)
+	if (reason == VMSCAN_THROTTLE_WRITEBACK)
 		atomic_dec(&pgdat->nr_writeback_throttled);
 
 	trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout),
@@ -2318,7 +2342,7 @@ shrink_inactive_list(unsigned long nr_to
 
 		/* wait a bit for the reclaimer. */
 		stalled = true;
-		reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED);
 
 		/* We are about to die and free our memory. Return now. */
 		if (fatal_signal_pending(current))
@@ -3250,7 +3274,7 @@ again:
 		 * until some pages complete writeback.
 		 */
 		if (sc->nr.immediate)
-			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
+			reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
 	}
 
 	/*
@@ -3274,7 +3298,7 @@ again:
 	if (!current_is_kswapd() && current_may_throttle() &&
 	    !sc->hibernation_mode &&
 	    test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
-		reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
 
 	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
 				    sc))
@@ -3346,7 +3370,7 @@ static void consider_reclaim_throttle(pg
 
 	/* Throttle if making no progress at high prioities. */
 	if (sc->priority < DEF_PRIORITY - 2)
-		reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);
+		reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS);
 }
 
 /*
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 155/262] mm/vmscan: increase the timeout if page reclaim is not making progress
  2021-11-05 20:34 incoming Andrew Morton
                   ` (153 preceding siblings ...)
  2021-11-05 20:42 ` [patch 154/262] mm/vmscan: centralise timeout values for reclaim_throttle Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 20:42 ` [patch 156/262] mm/vmscan: delay waking of tasks throttled on NOPROGRESS Andrew Morton
                   ` (106 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: adilger.kernel, akpm, corbet, david, djwong, hannes, linux-mm,
	mgorman, mhocko, mm-commits, neilb, riel, torvalds, tytso,
	vbabka, willy

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/vmscan: increase the timeout if page reclaim is not making progress

Tracing of the stutterp workload showed the following delays

      1 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
      1 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
      1 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
      1 usect_delayed=536000 reason=VMSCAN_THROTTLE_NOPROGRESS
      1 usect_delayed=544000 reason=VMSCAN_THROTTLE_NOPROGRESS
      1 usect_delayed=556000 reason=VMSCAN_THROTTLE_NOPROGRESS
      1 usect_delayed=624000 reason=VMSCAN_THROTTLE_NOPROGRESS
      1 usect_delayed=716000 reason=VMSCAN_THROTTLE_NOPROGRESS
      1 usect_delayed=772000 reason=VMSCAN_THROTTLE_NOPROGRESS
      2 usect_delayed=512000 reason=VMSCAN_THROTTLE_NOPROGRESS
     16 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
     53 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
    116 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
   5907 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
  71741 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS

All the throttling hit the full timeout and then there was wakeup delays
meaning that the wakeups are premature as no other reclaimer such as
kswapd has made progress.  This patch increases the maximum timeout.

Link: https://lkml.kernel.org/r/20211022144651.19914-8-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/vmscan.c~mm-vmscan-increase-the-timeout-if-page-reclaim-is-not-making-progress
+++ a/mm/vmscan.c
@@ -1042,7 +1042,7 @@ void reclaim_throttle(pg_data_t *pgdat,
 
 		break;
 	case VMSCAN_THROTTLE_NOPROGRESS:
-		timeout = HZ/10;
+		timeout = HZ/2;
 		break;
 	case VMSCAN_THROTTLE_ISOLATED:
 		timeout = HZ/50;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 156/262] mm/vmscan: delay waking of tasks throttled on NOPROGRESS
  2021-11-05 20:34 incoming Andrew Morton
                   ` (154 preceding siblings ...)
  2021-11-05 20:42 ` [patch 155/262] mm/vmscan: increase the timeout if page reclaim is not making progress Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 20:42 ` [patch 157/262] mm/vmpressure: fix data-race with memcg->socket_pressure Andrew Morton
                   ` (105 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: adilger.kernel, akpm, corbet, david, djwong, hannes, linux-mm,
	mgorman, mhocko, mm-commits, neilb, riel, torvalds, tytso,
	vbabka, willy

From: Mel Gorman <mgorman@techsingularity.net>
Subject: mm/vmscan: delay waking of tasks throttled on NOPROGRESS

Tracing indicates that tasks throttled on NOPROGRESS are woken prematurely
resulting in occasional massive spikes in direct reclaim activity.  This
patch wakes tasks throttled on NOPROGRESS if reclaim efficiency is at
least 12%.

Link: https://lkml.kernel.org/r/20211022144651.19914-9-mgorman@techsingularity.net
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: "Darrick J . Wong" <djwong@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Rik van Riel <riel@surriel.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

--- a/mm/vmscan.c~mm-vmscan-delay-waking-of-tasks-throttled-on-noprogress
+++ a/mm/vmscan.c
@@ -3348,8 +3348,11 @@ static inline bool compaction_ready(stru
 
 static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc)
 {
-	/* If reclaim is making progress, wake any throttled tasks. */
-	if (sc->nr_reclaimed) {
+	/*
+	 * If reclaim is making progress greater than 12% efficiency then
+	 * wake all the NOPROGRESS throttled tasks.
+	 */
+	if (sc->nr_reclaimed > (sc->nr_scanned >> 3)) {
 		wait_queue_head_t *wqh;
 
 		wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_NOPROGRESS];
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 157/262] mm/vmpressure: fix data-race with memcg->socket_pressure
  2021-11-05 20:34 incoming Andrew Morton
                   ` (155 preceding siblings ...)
  2021-11-05 20:42 ` [patch 156/262] mm/vmscan: delay waking of tasks throttled on NOPROGRESS Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 20:42 ` [patch 158/262] tools/vm/page_owner_sort.c: count and sort by mem Andrew Morton
                   ` (104 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: akpm, alexs, guro, hannes, linux-mm, mhocko, mm-commits,
	richard.weiyang, shakeelb, songmuchun, songyuanzheng, torvalds,
	willy

From: Yuanzheng Song <songyuanzheng@huawei.com>
Subject: mm/vmpressure: fix data-race with memcg->socket_pressure

BUG: KCSAN: data-race in __sk_mem_reduce_allocated / vmpressure

write to 0xffff8881286f4938 of 8 bytes by task 24550 on cpu 3:
 vmpressure+0x218/0x230 mm/vmpressure.c:307
 shrink_node_memcgs+0x2b9/0x410 mm/vmscan.c:2658
 shrink_node+0x9d2/0x11d0 mm/vmscan.c:2769
 shrink_zones+0x29f/0x470 mm/vmscan.c:2972
 do_try_to_free_pages+0x193/0x6e0 mm/vmscan.c:3027
 try_to_free_mem_cgroup_pages+0x1c0/0x3f0 mm/vmscan.c:3345
 reclaim_high mm/memcontrol.c:2440 [inline]
 mem_cgroup_handle_over_high+0x18b/0x4d0 mm/memcontrol.c:2624
 tracehook_notify_resume include/linux/tracehook.h:197 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:164 [inline]
 exit_to_user_mode_prepare+0x110/0x170 kernel/entry/common.c:191
 syscall_exit_to_user_mode+0x16/0x30 kernel/entry/common.c:266
 ret_from_fork+0x15/0x30 arch/x86/entry/entry_64.S:289

read to 0xffff8881286f4938 of 8 bytes by interrupt on cpu 1:
 mem_cgroup_under_socket_pressure include/linux/memcontrol.h:1483 [inline]
 sk_under_memory_pressure include/net/sock.h:1314 [inline]
 __sk_mem_reduce_allocated+0x1d2/0x270 net/core/sock.c:2696
 __sk_mem_reclaim+0x44/0x50 net/core/sock.c:2711
 sk_mem_reclaim include/net/sock.h:1490 [inline]
 ......
 net_rx_action+0x17a/0x480 net/core/dev.c:6864
 __do_softirq+0x12c/0x2af kernel/softirq.c:298
 run_ksoftirqd+0x13/0x20 kernel/softirq.c:653
 smpboot_thread_fn+0x33f/0x510 kernel/smpboot.c:165
 kthread+0x1fc/0x220 kernel/kthread.c:292
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296

When reading memcg->socket_pressure in mem_cgroup_under_socket_pressure()
and writing memcg->socket_pressure in vmpressure() at the same time,
the data-race occurs.

So fix it by using READ_ONCE() and WRITE_ONCE() to read and write
memcg->socket_pressure.

Link: https://lkml.kernel.org/r/20211025082843.671690-1-songyuanzheng@huawei.com
Signed-off-by: Yuanzheng Song <songyuanzheng@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alex Shi <alexs@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memcontrol.h |    2 +-
 mm/vmpressure.c            |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/include/linux/memcontrol.h~mm-vmpressure-fix-data-race-with-memcg-socket_pressure
+++ a/include/linux/memcontrol.h
@@ -1606,7 +1606,7 @@ static inline bool mem_cgroup_under_sock
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg->tcpmem_pressure)
 		return true;
 	do {
-		if (time_before(jiffies, memcg->socket_pressure))
+		if (time_before(jiffies, READ_ONCE(memcg->socket_pressure)))
 			return true;
 	} while ((memcg = parent_mem_cgroup(memcg)));
 	return false;
--- a/mm/vmpressure.c~mm-vmpressure-fix-data-race-with-memcg-socket_pressure
+++ a/mm/vmpressure.c
@@ -308,7 +308,7 @@ void vmpressure(gfp_t gfp, struct mem_cg
 			 * asserted for a second in which subsequent
 			 * pressure events can occur.
 			 */
-			memcg->socket_pressure = jiffies + HZ;
+			WRITE_ONCE(memcg->socket_pressure, jiffies + HZ);
 		}
 	}
 }
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 158/262] tools/vm/page_owner_sort.c: count and sort by mem
  2021-11-05 20:34 incoming Andrew Morton
                   ` (156 preceding siblings ...)
  2021-11-05 20:42 ` [patch 157/262] mm/vmpressure: fix data-race with memcg->socket_pressure Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 20:42 ` [patch 159/262] tools/vm/page-types.c: make walk_file() aware of address range option Andrew Morton
                   ` (103 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, nixiaoming, tangbin, torvalds,
	weizhenliang, zhangshengju

From: Zhenliang Wei <weizhenliang@huawei.com>
Subject: tools/vm/page_owner_sort.c: count and sort by mem

When viewing page owner information, we may be more concerned about the
total memory rather than the times of stack appears.  Therefore, the
following adjustments are made:

1. Added the statistics on the total number of pages.

2. Added the optional parameter "-m" to configure the program to sort by
   memory (total pages).

The general output of page_owner is as follows:

	Page allocated via order XXX, ...
	PFN XXX ...
	 // Detailed stack

	Page allocated via order XXX, ...
	PFN XXX ...
	 // Detailed stack

The original page_owner_sort ignores PFN rows, puts the remaining rows
in buf, counts the times of buf, and finally sorts them according to
the times. General output:

	XXX times:
	Page allocated via order XXX, ...
	 // Detailed stack

Now, we use regexp to extract the page order value from the buf,
and count the total pages for the buf. General output:

	XXX times, XXX pages:
	Page allocated via order XXX, ...
	 // Detailed stack

By default, it is still sorted by the times of buf;
If you want to sort by the pages nums of buf, use the new -m parameter.

Link: https://lkml.kernel.org/r/1631678242-41033-1-git-send-email-weizhenliang@huawei.com
Signed-off-by: Zhenliang Wei <weizhenliang@huawei.com>
Cc: Tang Bin <tangbin@cmss.chinamobile.com>
Cc: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
Cc: Zhenliang Wei <weizhenliang@huawei.com>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/vm/page_owner.rst |   23 +++++++
 tools/vm/page_owner_sort.c      |   94 +++++++++++++++++++++++++++---
 2 files changed, 107 insertions(+), 10 deletions(-)

--- a/Documentation/vm/page_owner.rst~tools-vm-page_owner_sortc-count-and-sort-by-mem
+++ a/Documentation/vm/page_owner.rst
@@ -85,5 +85,26 @@ Usage
 	cat /sys/kernel/debug/page_owner > page_owner_full.txt
 	./page_owner_sort page_owner_full.txt sorted_page_owner.txt
 
+   The general output of ``page_owner_full.txt`` is as follows:
+
+	Page allocated via order XXX, ...
+	PFN XXX ...
+	 // Detailed stack
+
+	Page allocated via order XXX, ...
+	PFN XXX ...
+	 // Detailed stack
+
+   The ``page_owner_sort`` tool ignores ``PFN`` rows, puts the remaining rows
+   in buf, uses regexp to extract the page order value, counts the times
+   and pages of buf, and finally sorts them according to the times.
+
    See the result about who allocated each page
-   in the ``sorted_page_owner.txt``.
+   in the ``sorted_page_owner.txt``. General output:
+
+	XXX times, XXX pages:
+	Page allocated via order XXX, ...
+	 // Detailed stack
+
+   By default, ``page_owner_sort`` is sorted according to the times of buf.
+   If you want to sort by the pages nums of buf, use the ``-m`` parameter.
--- a/tools/vm/page_owner_sort.c~tools-vm-page_owner_sortc-count-and-sort-by-mem
+++ a/tools/vm/page_owner_sort.c
@@ -5,6 +5,8 @@
  * Example use:
  * cat /sys/kernel/debug/page_owner > page_owner_full.txt
  * ./page_owner_sort page_owner_full.txt sorted_page_owner.txt
+ * Or sort by total memory:
+ * ./page_owner_sort -m page_owner_full.txt sorted_page_owner.txt
  *
  * See Documentation/vm/page_owner.rst
 */
@@ -16,14 +18,18 @@
 #include <fcntl.h>
 #include <unistd.h>
 #include <string.h>
+#include <regex.h>
+#include <errno.h>
 
 struct block_list {
 	char *txt;
 	int len;
 	int num;
+	int page_num;
 };
 
-
+static int sort_by_memory;
+static regex_t order_pattern;
 static struct block_list *list;
 static int list_size;
 static int max_size;
@@ -59,12 +65,50 @@ static int compare_num(const void *p1, c
 	return l2->num - l1->num;
 }
 
+static int compare_page_num(const void *p1, const void *p2)
+{
+	const struct block_list *l1 = p1, *l2 = p2;
+
+	return l2->page_num - l1->page_num;
+}
+
+static int get_page_num(char *buf)
+{
+	int err, val_len, order_val;
+	char order_str[4] = {0};
+	char *endptr;
+	regmatch_t pmatch[2];
+
+	err = regexec(&order_pattern, buf, 2, pmatch, REG_NOTBOL);
+	if (err != 0 || pmatch[1].rm_so == -1) {
+		printf("no order pattern in %s\n", buf);
+		return 0;
+	}
+	val_len = pmatch[1].rm_eo - pmatch[1].rm_so;
+	if (val_len > 2) /* max_order should not exceed 2 digits */
+		goto wrong_order;
+
+	memcpy(order_str, buf + pmatch[1].rm_so, val_len);
+
+	errno = 0;
+	order_val = strtol(order_str, &endptr, 10);
+	if (errno != 0 || endptr == order_str || *endptr != '\0')
+		goto wrong_order;
+
+	return 1 << order_val;
+
+wrong_order:
+	printf("wrong order in follow buf:\n%s\n", buf);
+	return 0;
+}
+
 static void add_list(char *buf, int len)
 {
 	if (list_size != 0 &&
 	    len == list[list_size-1].len &&
 	    memcmp(buf, list[list_size-1].txt, len) == 0) {
 		list[list_size-1].num++;
+		list[list_size-1].page_num += get_page_num(buf);
 		return;
 	}
 	if (list_size == max_size) {
@@ -74,6 +118,7 @@ static void add_list(char *buf, int len)
 	list[list_size].txt = malloc(len+1);
 	list[list_size].len = len;
 	list[list_size].num = 1;
+	list[list_size].page_num = get_page_num(buf);
 	memcpy(list[list_size].txt, buf, len);
 	list[list_size].txt[len] = 0;
 	list_size++;
@@ -85,6 +130,13 @@ static void add_list(char *buf, int len)
 
 #define BUF_SIZE	(128 * 1024)
 
+static void usage(void)
+{
+	printf("Usage: ./page_owner_sort [-m] <input> <output>\n"
+		"-m	Sort by total memory. If this option is unset, sort by times\n"
+	);
+}
+
 int main(int argc, char **argv)
 {
 	FILE *fin, *fout;
@@ -92,21 +144,39 @@ int main(int argc, char **argv)
 	int ret, i, count;
 	struct block_list *list2;
 	struct stat st;
+	int err;
+	int opt;
 
-	if (argc < 3) {
-		printf("Usage: ./program <input> <output>\n");
-		perror("open: ");
+	while ((opt = getopt(argc, argv, "m")) != -1)
+		switch (opt) {
+		case 'm':
+			sort_by_memory = 1;
+			break;
+		default:
+			usage();
+			exit(1);
+		}
+
+	if (optind >= (argc - 1)) {
+		usage();
 		exit(1);
 	}
 
-	fin = fopen(argv[1], "r");
-	fout = fopen(argv[2], "w");
+	fin = fopen(argv[optind], "r");
+	fout = fopen(argv[optind + 1], "w");
 	if (!fin || !fout) {
-		printf("Usage: ./program <input> <output>\n");
+		usage();
 		perror("open: ");
 		exit(1);
 	}
 
+	err = regcomp(&order_pattern, "order\\s*([0-9]*),", REG_EXTENDED|REG_NEWLINE);
+	if (err != 0 || order_pattern.re_nsub != 1) {
+		printf("%s: Invalid pattern 'order\\s*([0-9]*),' code %d\n",
+			argv[0], err);
+		exit(1);
+	}
+
 	fstat(fileno(fin), &st);
 	max_size = st.st_size / 100; /* hack ... */
 
@@ -145,13 +215,19 @@ int main(int argc, char **argv)
 			list2[count++] = list[i];
 		} else {
 			list2[count-1].num += list[i].num;
+			list2[count-1].page_num += list[i].page_num;
 		}
 	}
 
-	qsort(list2, count, sizeof(list[0]), compare_num);
+	if (sort_by_memory)
+		qsort(list2, count, sizeof(list[0]), compare_page_num);
+	else
+		qsort(list2, count, sizeof(list[0]), compare_num);
 
 	for (i = 0; i < count; i++)
-		fprintf(fout, "%d times:\n%s\n", list2[i].num, list2[i].txt);
+		fprintf(fout, "%d times, %d pages:\n%s\n",
+				list2[i].num, list2[i].page_num, list2[i].txt);
 
+	regfree(&order_pattern);
 	return 0;
 }
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 159/262] tools/vm/page-types.c: make walk_file() aware of address range option
  2021-11-05 20:34 incoming Andrew Morton
                   ` (157 preceding siblings ...)
  2021-11-05 20:42 ` [patch 158/262] tools/vm/page_owner_sort.c: count and sort by mem Andrew Morton
@ 2021-11-05 20:42 ` Andrew Morton
  2021-11-05 20:43 ` [patch 160/262] tools/vm/page-types.c: move show_file() to summary output Andrew Morton
                   ` (102 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:42 UTC (permalink / raw)
  To: akpm, changbin.du, chansen3, koct9i, linux-mm, mm-commits,
	naoya.horiguchi, torvalds, wangbin224

From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Subject: tools/vm/page-types.c: make walk_file() aware of address range option

Patch series "tools/vm/page-types.c: a few improvements".

This patchset adds some improvements on tools/vm/page-types.c.  Patch 1/3
makes -a option (specify address range) work with -f (file cache mode). 
Patch 2/3 and 3/3 are to fix minor formatting issues of this tool.  These
would make life a little easier for the users of this tool.

Please see individual patches for more details about specific issues.


This patch (of 3):

-a|--addr option is used to limit the range of address to be scanned for
page status.  It works now for physical address space (dafult mode) or for
virtual address space (with -p option), but not for file address space
(with -f option).  So make walk_file() aware of -a option.

Link: https://lkml.kernel.org/r/20211004061325.1525902-1-naoya.horiguchi@linux.dev
Link: https://lkml.kernel.org/r/20211004061325.1525902-2-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Christian Hansen <chansen3@cisco.com>
Cc: Changbin Du <changbin.du@intel.com>
Cc: Bin Wang <wangbin224@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/vm/page-types.c |   24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

--- a/tools/vm/page-types.c~tools-vm-page-typesc-make-walk_file-aware-of-address-range-option
+++ a/tools/vm/page-types.c
@@ -967,22 +967,19 @@ static struct sigaction sigbus_action =
 	.sa_flags = SA_SIGINFO,
 };
 
-static void walk_file(const char *name, const struct stat *st)
+static void walk_file_range(const char *name, int fd,
+			    unsigned long off, unsigned long end)
 {
 	uint8_t vec[PAGEMAP_BATCH];
 	uint64_t buf[PAGEMAP_BATCH], flags;
 	uint64_t cgroup = 0;
 	uint64_t mapcnt = 0;
 	unsigned long nr_pages, pfn, i;
-	off_t off, end = st->st_size;
-	int fd;
 	ssize_t len;
 	void *ptr;
 	int first = 1;
 
-	fd = checked_open(name, O_RDONLY|O_NOATIME|O_NOFOLLOW);
-
-	for (off = 0; off < end; off += len) {
+	for (; off < end; off += len) {
 		nr_pages = (end - off + page_size - 1) / page_size;
 		if (nr_pages > PAGEMAP_BATCH)
 			nr_pages = PAGEMAP_BATCH;
@@ -1043,6 +1040,21 @@ got_sigbus:
 				 flags, cgroup, mapcnt, buf[i]);
 		}
 	}
+}
+
+static void walk_file(const char *name, const struct stat *st)
+{
+	int i;
+	int fd;
+
+	fd = checked_open(name, O_RDONLY|O_NOATIME|O_NOFOLLOW);
+
+	if (!nr_addr_ranges)
+		add_addr_range(0, st->st_size / page_size);
+
+	for (i = 0; i < nr_addr_ranges; i++)
+		walk_file_range(name, fd, opt_offset[i] * page_size,
+				(opt_offset[i] + opt_size[i]) * page_size);
 
 	close(fd);
 }
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 160/262] tools/vm/page-types.c: move show_file() to summary output
  2021-11-05 20:34 incoming Andrew Morton
                   ` (158 preceding siblings ...)
  2021-11-05 20:42 ` [patch 159/262] tools/vm/page-types.c: make walk_file() aware of address range option Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 161/262] tools/vm/page-types.c: print file offset in hexadecimal Andrew Morton
                   ` (101 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, changbin.du, chansen3, koct9i, linux-mm, mm-commits,
	naoya.horiguchi, torvalds, wangbin224

From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Subject: tools/vm/page-types.c: move show_file() to summary output

Currently file info from show_file() is printed out within page list like
below, but this is inconvenient a little to utilize the page list from
other scripts (maybe needs additional filtering).

    $ ./page-types -f page-types.c -l
    foffset offset  len     flags
    page-types.c Inode: 15108680 Size: 30953 (8 pages)
    Modify: Sat Oct  2 23:11:20 2021 (2399 seconds ago)
    Access: Sat Oct  2 23:11:28 2021 (2391 seconds ago)
    0       d9f59e  1       ___U_lA____________________________________
    1       1031eb5 1       __RU_l_____________________________________
    2       13bf717 1       __RU_l_____________________________________
    3       13ac333 1       ___U_lA____________________________________
    4       d9f59f  1       __RU_l_____________________________________
    5       183fd49 1       ___U_lA____________________________________
    6       13cbf69 1       ___U_lA____________________________________
    7       d9ef05  1       ___U_lA____________________________________

                 flags      page-count       MB  symbolic-flags                     long-symbolic-flags
    0x000000000000002c               3        0  __RU_l_____________________________________        referenced,uptodate,lru
    0x0000000000000068               5        0  ___U_lA____________________________________        uptodate,lru,active
                 total               8        0

With this patch file info is printed out in summary part like below:

    $ ./page-types -f page-types.c -l
    foffset offset  len     flags
    0       d9f59e  1       ___U_lA_____________________________________
    1       1031eb5 1       __RU_l______________________________________
    2       13bf717 1       __RU_l______________________________________
    3       13ac333 1       ___U_lA_____________________________________
    4       d9f59f  1       __RU_l______________________________________
    5       183fd49 1       ___U_lA_____________________________________
    6       13cbf69 1       ___U_lA_____________________________________

    page-types.c Inode: 15108680 Size: 30953 (8 pages)
    Modify: Sat Oct  2 23:11:20 2021 (2435 seconds ago)
    Access: Sat Oct  2 23:11:28 2021 (2427 seconds ago)

                 flags      page-count       MB  symbolic-flags                     long-symbolic-flags
    0x000000000000002c               3        0  __RU_l______________________________________       referenced,uptodate,lru
    0x0000000000000068               4        0  ___U_lA_____________________________________       uptodate,lru,active
                 total               7        0

Link: https://lkml.kernel.org/r/20211004061325.1525902-3-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Bin Wang <wangbin224@huawei.com>
Cc: Changbin Du <changbin.du@intel.com>
Cc: Christian Hansen <chansen3@cisco.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/vm/page-types.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

--- a/tools/vm/page-types.c~tools-vm-page-typesc-move-show_file-to-summary-output
+++ a/tools/vm/page-types.c
@@ -1034,7 +1034,6 @@ got_sigbus:
 			if (first && opt_list) {
 				first = 0;
 				flush_page_range();
-				show_file(name, st);
 			}
 			add_page(off / page_size + i, pfn,
 				 flags, cgroup, mapcnt, buf[i]);
@@ -1074,10 +1073,10 @@ int walk_tree(const char *name, const st
 	return 0;
 }
 
+struct stat st;
+
 static void walk_page_cache(void)
 {
-	struct stat st;
-
 	kpageflags_fd = checked_open(opt_kpageflags, O_RDONLY);
 	pagemap_fd = checked_open("/proc/self/pagemap", O_RDONLY);
 	sigaction(SIGBUS, &sigbus_action, NULL);
@@ -1374,6 +1373,11 @@ int main(int argc, char *argv[])
 	if (opt_list)
 		printf("\n\n");
 
+	if (opt_file) {
+		show_file(opt_file, &st);
+		printf("\n");
+	}
+
 	show_summary();
 
 	if (opt_list_mapcnt)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 161/262] tools/vm/page-types.c: print file offset in hexadecimal
  2021-11-05 20:34 incoming Andrew Morton
                   ` (159 preceding siblings ...)
  2021-11-05 20:43 ` [patch 160/262] tools/vm/page-types.c: move show_file() to summary output Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 162/262] arch_numa: simplify numa_distance allocation Andrew Morton
                   ` (100 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, changbin.du, chansen3, koct9i, linux-mm, mm-commits,
	naoya.horiguchi, torvalds, wangbin224

From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Subject: tools/vm/page-types.c: print file offset in hexadecimal

In page list mode (with -l and -L option), virtual address and physical
address are printed in hexadecimal, but file offset is not, which is
confusing, so let's align it.

Link: https://lkml.kernel.org/r/20211004061325.1525902-4-naoya.horiguchi@linux.dev
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Bin Wang <wangbin224@huawei.com>
Cc: Changbin Du <changbin.du@intel.com>
Cc: Christian Hansen <chansen3@cisco.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/vm/page-types.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/tools/vm/page-types.c~tools-vm-page-typesc-print-file-offset-in-hexadecimal
+++ a/tools/vm/page-types.c
@@ -390,7 +390,7 @@ static void show_page_range(unsigned lon
 		if (opt_pid)
 			printf("%lx\t", voff);
 		if (opt_file)
-			printf("%lu\t", voff);
+			printf("%lx\t", voff);
 		if (opt_list_cgroup)
 			printf("@%llu\t", (unsigned long long)cgroup0);
 		if (opt_list_mapcnt)
@@ -418,7 +418,7 @@ static void show_page(unsigned long voff
 	if (opt_pid)
 		printf("%lx\t", voffset);
 	if (opt_file)
-		printf("%lu\t", voffset);
+		printf("%lx\t", voffset);
 	if (opt_list_cgroup)
 		printf("@%llu\t", (unsigned long long)cgroup);
 	if (opt_list_mapcnt)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 162/262] arch_numa: simplify numa_distance allocation
  2021-11-05 20:34 incoming Andrew Morton
                   ` (160 preceding siblings ...)
  2021-11-05 20:43 ` [patch 161/262] tools/vm/page-types.c: print file offset in hexadecimal Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 163/262] xen/x86: free_p2m_page: use memblock_free_ptr() to free a virtual pointer Andrew Morton
                   ` (99 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, christophe.leroy, jgross, linux-mm, mm-commits, rppt,
	Shahab.Vahedi, torvalds

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: arch_numa: simplify numa_distance allocation

Patch series "memblock: cleanup memblock_free interface", v2.

This is the fix for memblock freeing APIs mismatch [1].

The first patch is a cleanup of numa_distance allocation in arch_numa I've
spotted during the conversion.  The second patch is a fix for Xen memory
freeing on some of the error paths.

[1] https://lore.kernel.org/all/CAHk-=wj9k4LZTz+svCxLYs5Y1=+yKrbAUArH1+ghyG3OLd8VVg@mail.gmail.com


This patch (of 6):

Memory allocation of numa_distance uses memblock_phys_alloc_range()
without actual range limits, converts the returned physical address to
virtual and then only uses the virtual address for further initialization.

Simplify this by replacing memblock_phys_alloc_range() with
memblock_alloc().

Link: https://lkml.kernel.org/r/20210930185031.18648-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20210930185031.18648-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/base/arch_numa.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

--- a/drivers/base/arch_numa.c~arch_numa-simplify-numa_distance-allocation
+++ a/drivers/base/arch_numa.c
@@ -337,15 +337,13 @@ void __init numa_free_distance(void)
 static int __init numa_alloc_distance(void)
 {
 	size_t size;
-	u64 phys;
 	int i, j;
 
 	size = nr_node_ids * nr_node_ids * sizeof(numa_distance[0]);
-	phys = memblock_phys_alloc_range(size, PAGE_SIZE, 0, PFN_PHYS(max_pfn));
-	if (WARN_ON(!phys))
+	numa_distance = memblock_alloc(size, PAGE_SIZE);
+	if (WARN_ON(!numa_distance))
 		return -ENOMEM;
 
-	numa_distance = __va(phys);
 	numa_distance_cnt = nr_node_ids;
 
 	/* fill with the default distances */
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 163/262] xen/x86: free_p2m_page: use memblock_free_ptr() to free a virtual pointer
  2021-11-05 20:34 incoming Andrew Morton
                   ` (161 preceding siblings ...)
  2021-11-05 20:43 ` [patch 162/262] arch_numa: simplify numa_distance allocation Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 164/262] memblock: drop memblock_free_early_nid() and memblock_free_early() Andrew Morton
                   ` (98 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, christophe.leroy, jgross, linux-mm, mm-commits, rppt,
	Shahab.Vahedi, torvalds

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: xen/x86: free_p2m_page: use memblock_free_ptr() to free a virtual pointer

free_p2m_page() wrongly passes a virtual pointer to memblock_free() that
treats it as a physical address.

Call memblock_free_ptr() instead that gets a virtual address to free the
memory.

Link: https://lkml.kernel.org/r/20210930185031.18648-3-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/xen/p2m.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/xen/p2m.c~xen-x86-free_p2m_page-use-memblock_free_ptr-to-free-a-virtual-pointer
+++ a/arch/x86/xen/p2m.c
@@ -197,7 +197,7 @@ static void * __ref alloc_p2m_page(void)
 static void __ref free_p2m_page(void *p)
 {
 	if (unlikely(!slab_is_available())) {
-		memblock_free((unsigned long)p, PAGE_SIZE);
+		memblock_free_ptr(p, PAGE_SIZE);
 		return;
 	}
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 164/262] memblock: drop memblock_free_early_nid() and memblock_free_early()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (162 preceding siblings ...)
  2021-11-05 20:43 ` [patch 163/262] xen/x86: free_p2m_page: use memblock_free_ptr() to free a virtual pointer Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 165/262] memblock: stop aliasing __memblock_free_late with memblock_free_late Andrew Morton
                   ` (97 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, christophe.leroy, jgross, linux-mm, mm-commits, rppt,
	Shahab.Vahedi, torvalds

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: memblock: drop memblock_free_early_nid() and memblock_free_early()

memblock_free_early_nid() is unused and memblock_free_early() is an alias
for memblock_free().

Replace calls to memblock_free_early() with calls to memblock_free() and
remove memblock_free_early() and memblock_free_early_nid().

Link: https://lkml.kernel.org/r/20210930185031.18648-4-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/mips/mm/init.c                  |    2 +-
 arch/powerpc/platforms/pseries/svm.c |    3 +--
 arch/s390/kernel/smp.c               |    2 +-
 drivers/base/arch_numa.c             |    2 +-
 drivers/s390/char/sclp_early.c       |    2 +-
 include/linux/memblock.h             |   12 ------------
 kernel/dma/swiotlb.c                 |    2 +-
 lib/cpumask.c                        |    2 +-
 mm/percpu.c                          |    8 ++++----
 mm/sparse.c                          |    2 +-
 10 files changed, 12 insertions(+), 25 deletions(-)

--- a/arch/mips/mm/init.c~memblock-drop-memblock_free_early_nid-and-memblock_free_early
+++ a/arch/mips/mm/init.c
@@ -529,7 +529,7 @@ static void * __init pcpu_fc_alloc(unsig
 
 static void __init pcpu_fc_free(void *ptr, size_t size)
 {
-	memblock_free_early(__pa(ptr), size);
+	memblock_free(__pa(ptr), size);
 }
 
 void __init setup_per_cpu_areas(void)
--- a/arch/powerpc/platforms/pseries/svm.c~memblock-drop-memblock_free_early_nid-and-memblock_free_early
+++ a/arch/powerpc/platforms/pseries/svm.c
@@ -56,8 +56,7 @@ void __init svm_swiotlb_init(void)
 		return;
 
 
-	memblock_free_early(__pa(vstart),
-			    PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
+	memblock_free(__pa(vstart), PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
 	panic("SVM: Cannot allocate SWIOTLB buffer");
 }
 
--- a/arch/s390/kernel/smp.c~memblock-drop-memblock_free_early_nid-and-memblock_free_early
+++ a/arch/s390/kernel/smp.c
@@ -880,7 +880,7 @@ void __init smp_detect_cpus(void)
 
 	/* Add CPUs present at boot */
 	__smp_rescan_cpus(info, true);
-	memblock_free_early((unsigned long)info, sizeof(*info));
+	memblock_free((unsigned long)info, sizeof(*info));
 }
 
 /*
--- a/drivers/base/arch_numa.c~memblock-drop-memblock_free_early_nid-and-memblock_free_early
+++ a/drivers/base/arch_numa.c
@@ -166,7 +166,7 @@ static void * __init pcpu_fc_alloc(unsig
 
 static void __init pcpu_fc_free(void *ptr, size_t size)
 {
-	memblock_free_early(__pa(ptr), size);
+	memblock_free(__pa(ptr), size);
 }
 
 #ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
--- a/drivers/s390/char/sclp_early.c~memblock-drop-memblock_free_early_nid-and-memblock_free_early
+++ a/drivers/s390/char/sclp_early.c
@@ -139,7 +139,7 @@ int __init sclp_early_get_core_info(stru
 	}
 	sclp_fill_core_info(info, sccb);
 out:
-	memblock_free_early((unsigned long)sccb, length);
+	memblock_free((unsigned long)sccb, length);
 	return rc;
 }
 
--- a/include/linux/memblock.h~memblock-drop-memblock_free_early_nid-and-memblock_free_early
+++ a/include/linux/memblock.h
@@ -441,18 +441,6 @@ static inline void *memblock_alloc_node(
 				      MEMBLOCK_ALLOC_ACCESSIBLE, nid);
 }
 
-static inline void memblock_free_early(phys_addr_t base,
-					      phys_addr_t size)
-{
-	memblock_free(base, size);
-}
-
-static inline void memblock_free_early_nid(phys_addr_t base,
-						  phys_addr_t size, int nid)
-{
-	memblock_free(base, size);
-}
-
 static inline void memblock_free_late(phys_addr_t base, phys_addr_t size)
 {
 	__memblock_free_late(base, size);
--- a/kernel/dma/swiotlb.c~memblock-drop-memblock_free_early_nid-and-memblock_free_early
+++ a/kernel/dma/swiotlb.c
@@ -247,7 +247,7 @@ swiotlb_init(int verbose)
 	return;
 
 fail_free_mem:
-	memblock_free_early(__pa(tlb), bytes);
+	memblock_free(__pa(tlb), bytes);
 fail:
 	pr_warn("Cannot allocate buffer");
 }
--- a/lib/cpumask.c~memblock-drop-memblock_free_early_nid-and-memblock_free_early
+++ a/lib/cpumask.c
@@ -188,7 +188,7 @@ EXPORT_SYMBOL(free_cpumask_var);
  */
 void __init free_bootmem_cpumask_var(cpumask_var_t mask)
 {
-	memblock_free_early(__pa(mask), cpumask_size());
+	memblock_free(__pa(mask), cpumask_size());
 }
 #endif
 
--- a/mm/percpu.c~memblock-drop-memblock_free_early_nid-and-memblock_free_early
+++ a/mm/percpu.c
@@ -2472,7 +2472,7 @@ struct pcpu_alloc_info * __init pcpu_all
  */
 void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai)
 {
-	memblock_free_early(__pa(ai), ai->__ai_size);
+	memblock_free(__pa(ai), ai->__ai_size);
 }
 
 /**
@@ -3134,7 +3134,7 @@ out_free_areas:
 out_free:
 	pcpu_free_alloc_info(ai);
 	if (areas)
-		memblock_free_early(__pa(areas), areas_size);
+		memblock_free(__pa(areas), areas_size);
 	return rc;
 }
 #endif /* BUILD_EMBED_FIRST_CHUNK */
@@ -3256,7 +3256,7 @@ enomem:
 		free_fn(page_address(pages[j]), PAGE_SIZE);
 	rc = -ENOMEM;
 out_free_ar:
-	memblock_free_early(__pa(pages), pages_size);
+	memblock_free(__pa(pages), pages_size);
 	pcpu_free_alloc_info(ai);
 	return rc;
 }
@@ -3286,7 +3286,7 @@ static void * __init pcpu_dfl_fc_alloc(u
 
 static void __init pcpu_dfl_fc_free(void *ptr, size_t size)
 {
-	memblock_free_early(__pa(ptr), size);
+	memblock_free(__pa(ptr), size);
 }
 
 void __init setup_per_cpu_areas(void)
--- a/mm/sparse.c~memblock-drop-memblock_free_early_nid-and-memblock_free_early
+++ a/mm/sparse.c
@@ -451,7 +451,7 @@ static void *sparsemap_buf_end __meminit
 static inline void __meminit sparse_buffer_free(unsigned long size)
 {
 	WARN_ON(!sparsemap_buf || size == 0);
-	memblock_free_early(__pa(sparsemap_buf), size);
+	memblock_free(__pa(sparsemap_buf), size);
 }
 
 static void __init sparse_buffer_init(unsigned long size, int nid)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 165/262] memblock: stop aliasing __memblock_free_late with memblock_free_late
  2021-11-05 20:34 incoming Andrew Morton
                   ` (163 preceding siblings ...)
  2021-11-05 20:43 ` [patch 164/262] memblock: drop memblock_free_early_nid() and memblock_free_early() Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 166/262] memblock: rename memblock_free to memblock_phys_free Andrew Morton
                   ` (96 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, christophe.leroy, jgross, linux-mm, mm-commits, rppt,
	Shahab.Vahedi, torvalds

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: memblock: stop aliasing __memblock_free_late with memblock_free_late

memblock_free_late() is a NOP wrapper for __memblock_free_late(), there is
no point to keep this indirection.

Drop the wrapper and rename __memblock_free_late() to
memblock_free_late().

Link: https://lkml.kernel.org/r/20210930185031.18648-5-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memblock.h |    7 +------
 mm/memblock.c            |    8 ++++----
 2 files changed, 5 insertions(+), 10 deletions(-)

--- a/include/linux/memblock.h~memblock-stop-aliasing-__memblock_free_late-with-memblock_free_late
+++ a/include/linux/memblock.h
@@ -133,7 +133,7 @@ void __next_mem_range_rev(u64 *idx, int
 			  struct memblock_type *type_b, phys_addr_t *out_start,
 			  phys_addr_t *out_end, int *out_nid);
 
-void __memblock_free_late(phys_addr_t base, phys_addr_t size);
+void memblock_free_late(phys_addr_t base, phys_addr_t size);
 
 #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
 static inline void __next_physmem_range(u64 *idx, struct memblock_type *type,
@@ -441,11 +441,6 @@ static inline void *memblock_alloc_node(
 				      MEMBLOCK_ALLOC_ACCESSIBLE, nid);
 }
 
-static inline void memblock_free_late(phys_addr_t base, phys_addr_t size)
-{
-	__memblock_free_late(base, size);
-}
-
 /*
  * Set the allocation direction to bottom-up or top-down.
  */
--- a/mm/memblock.c~memblock-stop-aliasing-__memblock_free_late-with-memblock_free_late
+++ a/mm/memblock.c
@@ -366,14 +366,14 @@ void __init memblock_discard(void)
 		addr = __pa(memblock.reserved.regions);
 		size = PAGE_ALIGN(sizeof(struct memblock_region) *
 				  memblock.reserved.max);
-		__memblock_free_late(addr, size);
+		memblock_free_late(addr, size);
 	}
 
 	if (memblock.memory.regions != memblock_memory_init_regions) {
 		addr = __pa(memblock.memory.regions);
 		size = PAGE_ALIGN(sizeof(struct memblock_region) *
 				  memblock.memory.max);
-		__memblock_free_late(addr, size);
+		memblock_free_late(addr, size);
 	}
 
 	memblock_memory = NULL;
@@ -1589,7 +1589,7 @@ void * __init memblock_alloc_try_nid(
 }
 
 /**
- * __memblock_free_late - free pages directly to buddy allocator
+ * memblock_free_late - free pages directly to buddy allocator
  * @base: phys starting address of the  boot memory block
  * @size: size of the boot memory block in bytes
  *
@@ -1597,7 +1597,7 @@ void * __init memblock_alloc_try_nid(
  * down, but we are still initializing the system.  Pages are released directly
  * to the buddy allocator.
  */
-void __init __memblock_free_late(phys_addr_t base, phys_addr_t size)
+void __init memblock_free_late(phys_addr_t base, phys_addr_t size)
 {
 	phys_addr_t cursor, end;
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 166/262] memblock: rename memblock_free to memblock_phys_free
  2021-11-05 20:34 incoming Andrew Morton
                   ` (164 preceding siblings ...)
  2021-11-05 20:43 ` [patch 165/262] memblock: stop aliasing __memblock_free_late with memblock_free_late Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 167/262] memblock: use memblock_free for freeing virtual pointers Andrew Morton
                   ` (95 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, christophe.leroy, jgross, linux-mm, mm-commits, rppt,
	Shahab.Vahedi, torvalds

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: memblock: rename memblock_free to memblock_phys_free

Since memblock_free() operates on a physical range, make its name reflect
it and rename it to memblock_phys_free(), so it will be a logical
counterpart to memblock_phys_alloc().

The callers are updated with the below semantic patch:

@@
expression addr;
expression size;
@@
- memblock_free(addr, size);
+ memblock_phys_free(addr, size);

Link: https://lkml.kernel.org/r/20210930185031.18648-6-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/alpha/kernel/core_irongate.c         |    3 ++-
 arch/arc/mm/init.c                        |    2 +-
 arch/arm/mach-hisi/platmcpm.c             |    2 +-
 arch/arm/mm/init.c                        |    2 +-
 arch/arm64/mm/mmu.c                       |    4 ++--
 arch/mips/mm/init.c                       |    2 +-
 arch/mips/sgi-ip30/ip30-setup.c           |    6 +++---
 arch/powerpc/kernel/dt_cpu_ftrs.c         |    4 ++--
 arch/powerpc/kernel/paca.c                |    8 ++++----
 arch/powerpc/kernel/setup-common.c        |    2 +-
 arch/powerpc/kernel/setup_64.c            |    2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |    2 +-
 arch/powerpc/platforms/pseries/svm.c      |    3 ++-
 arch/riscv/kernel/setup.c                 |    5 +++--
 arch/s390/kernel/setup.c                  |    8 ++++----
 arch/s390/kernel/smp.c                    |    4 ++--
 arch/s390/kernel/uv.c                     |    2 +-
 arch/s390/mm/kasan_init.c                 |    2 +-
 arch/sh/boards/mach-ap325rxa/setup.c      |    2 +-
 arch/sh/boards/mach-ecovec24/setup.c      |    4 ++--
 arch/sh/boards/mach-kfr2r09/setup.c       |    2 +-
 arch/sh/boards/mach-migor/setup.c         |    2 +-
 arch/sh/boards/mach-se/7724/setup.c       |    4 ++--
 arch/sparc/kernel/smp_64.c                |    2 +-
 arch/um/kernel/mem.c                      |    2 +-
 arch/x86/kernel/setup.c                   |    4 ++--
 arch/x86/mm/init.c                        |    2 +-
 arch/x86/xen/mmu_pv.c                     |    6 +++---
 arch/x86/xen/setup.c                      |    6 +++---
 drivers/base/arch_numa.c                  |    2 +-
 drivers/firmware/efi/memmap.c             |    2 +-
 drivers/of/kexec.c                        |    3 +--
 drivers/of/of_reserved_mem.c              |    5 +++--
 drivers/s390/char/sclp_early.c            |    2 +-
 drivers/usb/early/xhci-dbc.c              |   10 +++++-----
 drivers/xen/swiotlb-xen.c                 |    2 +-
 include/linux/memblock.h                  |    2 +-
 init/initramfs.c                          |    2 +-
 kernel/dma/swiotlb.c                      |    2 +-
 lib/cpumask.c                             |    2 +-
 mm/cma.c                                  |    2 +-
 mm/memblock.c                             |    8 ++++----
 mm/memory_hotplug.c                       |    2 +-
 mm/percpu.c                               |    8 ++++----
 mm/sparse.c                               |    2 +-
 45 files changed, 79 insertions(+), 76 deletions(-)

--- a/arch/alpha/kernel/core_irongate.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/alpha/kernel/core_irongate.c
@@ -233,7 +233,8 @@ albacore_init_arch(void)
 			unsigned long size;
 
 			size = initrd_end - initrd_start;
-			memblock_free(__pa(initrd_start), PAGE_ALIGN(size));
+			memblock_phys_free(__pa(initrd_start),
+					   PAGE_ALIGN(size));
 			if (!move_initrd(pci_mem))
 				printk("irongate_init_arch: initrd too big "
 				       "(%ldK)\ndisabling initrd\n",
--- a/arch/arc/mm/init.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/arc/mm/init.c
@@ -173,7 +173,7 @@ static void __init highmem_init(void)
 #ifdef CONFIG_HIGHMEM
 	unsigned long tmp;
 
-	memblock_free(high_mem_start, high_mem_sz);
+	memblock_phys_free(high_mem_start, high_mem_sz);
 	for (tmp = min_high_pfn; tmp < max_high_pfn; tmp++)
 		free_highmem_page(pfn_to_page(tmp));
 #endif
--- a/arch/arm64/mm/mmu.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/arm64/mm/mmu.c
@@ -738,8 +738,8 @@ void __init paging_init(void)
 	cpu_replace_ttbr1(lm_alias(swapper_pg_dir));
 	init_mm.pgd = swapper_pg_dir;
 
-	memblock_free(__pa_symbol(init_pg_dir),
-		      __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir));
+	memblock_phys_free(__pa_symbol(init_pg_dir),
+			   __pa_symbol(init_pg_end) - __pa_symbol(init_pg_dir));
 
 	memblock_allow_resize();
 }
--- a/arch/arm/mach-hisi/platmcpm.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/arm/mach-hisi/platmcpm.c
@@ -339,7 +339,7 @@ err_fabric:
 err_sysctrl:
 	iounmap(relocation);
 err_reloc:
-	memblock_free(hip04_boot_method[0], hip04_boot_method[1]);
+	memblock_phys_free(hip04_boot_method[0], hip04_boot_method[1]);
 err:
 	return ret;
 }
--- a/arch/arm/mm/init.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/arm/mm/init.c
@@ -158,7 +158,7 @@ phys_addr_t __init arm_memblock_steal(ph
 		panic("Failed to steal %pa bytes at %pS\n",
 		      &size, (void *)_RET_IP_);
 
-	memblock_free(phys, size);
+	memblock_phys_free(phys, size);
 	memblock_remove(phys, size);
 
 	return phys;
--- a/arch/mips/mm/init.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/mips/mm/init.c
@@ -529,7 +529,7 @@ static void * __init pcpu_fc_alloc(unsig
 
 static void __init pcpu_fc_free(void *ptr, size_t size)
 {
-	memblock_free(__pa(ptr), size);
+	memblock_phys_free(__pa(ptr), size);
 }
 
 void __init setup_per_cpu_areas(void)
--- a/arch/mips/sgi-ip30/ip30-setup.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/mips/sgi-ip30/ip30-setup.c
@@ -69,10 +69,10 @@ static void __init ip30_mem_init(void)
 		total_mem += size;
 
 		if (addr >= IP30_REAL_MEMORY_START)
-			memblock_free(addr, size);
+			memblock_phys_free(addr, size);
 		else if ((addr + size) > IP30_REAL_MEMORY_START)
-			memblock_free(IP30_REAL_MEMORY_START,
-				     size - IP30_MAX_PROM_MEMORY);
+			memblock_phys_free(IP30_REAL_MEMORY_START,
+					   size - IP30_MAX_PROM_MEMORY);
 	}
 	pr_info("Detected %luMB of physical memory.\n", MEM_SHIFT(total_mem));
 }
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -1095,8 +1095,8 @@ static int __init dt_cpu_ftrs_scan_callb
 
 	cpufeatures_setup_finished();
 
-	memblock_free(__pa(dt_cpu_features),
-			sizeof(struct dt_cpu_feature)*nr_dt_cpu_features);
+	memblock_phys_free(__pa(dt_cpu_features),
+			   sizeof(struct dt_cpu_feature) * nr_dt_cpu_features);
 
 	return 0;
 }
--- a/arch/powerpc/kernel/paca.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/powerpc/kernel/paca.c
@@ -322,8 +322,8 @@ void __init free_unused_pacas(void)
 
 	new_ptrs_size = sizeof(struct paca_struct *) * nr_cpu_ids;
 	if (new_ptrs_size < paca_ptrs_size)
-		memblock_free(__pa(paca_ptrs) + new_ptrs_size,
-					paca_ptrs_size - new_ptrs_size);
+		memblock_phys_free(__pa(paca_ptrs) + new_ptrs_size,
+				   paca_ptrs_size - new_ptrs_size);
 
 	paca_nr_cpu_ids = nr_cpu_ids;
 	paca_ptrs_size = new_ptrs_size;
@@ -331,8 +331,8 @@ void __init free_unused_pacas(void)
 #ifdef CONFIG_PPC_BOOK3S_64
 	if (early_radix_enabled()) {
 		/* Ugly fixup, see new_slb_shadow() */
-		memblock_free(__pa(paca_ptrs[boot_cpuid]->slb_shadow_ptr),
-				sizeof(struct slb_shadow));
+		memblock_phys_free(__pa(paca_ptrs[boot_cpuid]->slb_shadow_ptr),
+				   sizeof(struct slb_shadow));
 		paca_ptrs[boot_cpuid]->slb_shadow_ptr = NULL;
 	}
 #endif
--- a/arch/powerpc/kernel/setup_64.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/powerpc/kernel/setup_64.c
@@ -812,7 +812,7 @@ static void * __init pcpu_alloc_bootmem(
 
 static void __init pcpu_free_bootmem(void *ptr, size_t size)
 {
-	memblock_free(__pa(ptr), size);
+	memblock_phys_free(__pa(ptr), size);
 }
 
 static int pcpu_cpu_distance(unsigned int from, unsigned int to)
--- a/arch/powerpc/kernel/setup-common.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/powerpc/kernel/setup-common.c
@@ -825,7 +825,7 @@ static void __init smp_setup_pacas(void)
 		set_hard_smp_processor_id(cpu, cpu_to_phys_id[cpu]);
 	}
 
-	memblock_free(__pa(cpu_to_phys_id), nr_cpu_ids * sizeof(u32));
+	memblock_phys_free(__pa(cpu_to_phys_id), nr_cpu_ids * sizeof(u32));
 	cpu_to_phys_id = NULL;
 }
 #endif
--- a/arch/powerpc/platforms/powernv/pci-ioda.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2981,7 +2981,7 @@ static void __init pnv_pci_init_ioda_phb
 	if (!phb->hose) {
 		pr_err("  Can't allocate PCI controller for %pOF\n",
 		       np);
-		memblock_free(__pa(phb), sizeof(struct pnv_phb));
+		memblock_phys_free(__pa(phb), sizeof(struct pnv_phb));
 		return;
 	}
 
--- a/arch/powerpc/platforms/pseries/svm.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/powerpc/platforms/pseries/svm.c
@@ -56,7 +56,8 @@ void __init svm_swiotlb_init(void)
 		return;
 
 
-	memblock_free(__pa(vstart), PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
+	memblock_phys_free(__pa(vstart),
+			   PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
 	panic("SVM: Cannot allocate SWIOTLB buffer");
 }
 
--- a/arch/riscv/kernel/setup.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/riscv/kernel/setup.c
@@ -230,13 +230,14 @@ static void __init init_resources(void)
 
 	/* Clean-up any unused pre-allocated resources */
 	if (res_idx >= 0)
-		memblock_free(__pa(mem_res), (res_idx + 1) * sizeof(*mem_res));
+		memblock_phys_free(__pa(mem_res),
+				   (res_idx + 1) * sizeof(*mem_res));
 	return;
 
  error:
 	/* Better an empty resource tree than an inconsistent one */
 	release_child_resources(&iomem_resource);
-	memblock_free(__pa(mem_res), mem_res_sz);
+	memblock_phys_free(__pa(mem_res), mem_res_sz);
 }
 
 
--- a/arch/s390/kernel/setup.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/s390/kernel/setup.c
@@ -693,7 +693,7 @@ static void __init reserve_crashkernel(v
 	}
 
 	if (register_memory_notifier(&kdump_mem_nb)) {
-		memblock_free(crash_base, crash_size);
+		memblock_phys_free(crash_base, crash_size);
 		return;
 	}
 
@@ -748,7 +748,7 @@ static void __init free_mem_detect_info(
 
 	get_mem_detect_reserved(&start, &size);
 	if (size)
-		memblock_free(start, size);
+		memblock_phys_free(start, size);
 }
 
 static const char * __init get_mem_info_source(void)
@@ -793,7 +793,7 @@ static void __init check_initrd(void)
 	if (initrd_data.start && initrd_data.size &&
 	    !memblock_is_region_memory(initrd_data.start, initrd_data.size)) {
 		pr_err("The initial RAM disk does not fit into the memory\n");
-		memblock_free(initrd_data.start, initrd_data.size);
+		memblock_phys_free(initrd_data.start, initrd_data.size);
 		initrd_start = initrd_end = 0;
 	}
 #endif
@@ -890,7 +890,7 @@ static void __init setup_randomness(void
 
 	if (stsi(vmms, 3, 2, 2) == 0 && vmms->count)
 		add_device_randomness(&vmms->vm, sizeof(vmms->vm[0]) * vmms->count);
-	memblock_free((unsigned long) vmms, PAGE_SIZE);
+	memblock_phys_free((unsigned long)vmms, PAGE_SIZE);
 }
 
 /*
--- a/arch/s390/kernel/smp.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/s390/kernel/smp.c
@@ -723,7 +723,7 @@ void __init smp_save_dump_cpus(void)
 			/* Get the CPU registers */
 			smp_save_cpu_regs(sa, addr, is_boot_cpu, page);
 	}
-	memblock_free(page, PAGE_SIZE);
+	memblock_phys_free(page, PAGE_SIZE);
 	diag_amode31_ops.diag308_reset();
 	pcpu_set_smt(0);
 }
@@ -880,7 +880,7 @@ void __init smp_detect_cpus(void)
 
 	/* Add CPUs present at boot */
 	__smp_rescan_cpus(info, true);
-	memblock_free((unsigned long)info, sizeof(*info));
+	memblock_phys_free((unsigned long)info, sizeof(*info));
 }
 
 /*
--- a/arch/s390/kernel/uv.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/s390/kernel/uv.c
@@ -64,7 +64,7 @@ void __init setup_uv(void)
 	}
 
 	if (uv_init(uv_stor_base, uv_info.uv_base_stor_len)) {
-		memblock_free(uv_stor_base, uv_info.uv_base_stor_len);
+		memblock_phys_free(uv_stor_base, uv_info.uv_base_stor_len);
 		goto fail;
 	}
 
--- a/arch/s390/mm/kasan_init.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/s390/mm/kasan_init.c
@@ -399,5 +399,5 @@ void __init kasan_copy_shadow_mapping(vo
 
 void __init kasan_free_early_identity(void)
 {
-	memblock_free(pgalloc_pos, pgalloc_freeable - pgalloc_pos);
+	memblock_phys_free(pgalloc_pos, pgalloc_freeable - pgalloc_pos);
 }
--- a/arch/sh/boards/mach-ap325rxa/setup.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/sh/boards/mach-ap325rxa/setup.c
@@ -560,7 +560,7 @@ static void __init ap325rxa_mv_mem_reser
 	if (!phys)
 		panic("Failed to allocate CEU memory\n");
 
-	memblock_free(phys, size);
+	memblock_phys_free(phys, size);
 	memblock_remove(phys, size);
 
 	ceu_dma_membase = phys;
--- a/arch/sh/boards/mach-ecovec24/setup.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/sh/boards/mach-ecovec24/setup.c
@@ -1502,7 +1502,7 @@ static void __init ecovec_mv_mem_reserve
 	if (!phys)
 		panic("Failed to allocate CEU0 memory\n");
 
-	memblock_free(phys, size);
+	memblock_phys_free(phys, size);
 	memblock_remove(phys, size);
 	ceu0_dma_membase = phys;
 
@@ -1510,7 +1510,7 @@ static void __init ecovec_mv_mem_reserve
 	if (!phys)
 		panic("Failed to allocate CEU1 memory\n");
 
-	memblock_free(phys, size);
+	memblock_phys_free(phys, size);
 	memblock_remove(phys, size);
 	ceu1_dma_membase = phys;
 }
--- a/arch/sh/boards/mach-kfr2r09/setup.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/sh/boards/mach-kfr2r09/setup.c
@@ -633,7 +633,7 @@ static void __init kfr2r09_mv_mem_reserv
 	if (!phys)
 		panic("Failed to allocate CEU memory\n");
 
-	memblock_free(phys, size);
+	memblock_phys_free(phys, size);
 	memblock_remove(phys, size);
 
 	ceu_dma_membase = phys;
--- a/arch/sh/boards/mach-migor/setup.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/sh/boards/mach-migor/setup.c
@@ -633,7 +633,7 @@ static void __init migor_mv_mem_reserve(
 	if (!phys)
 		panic("Failed to allocate CEU memory\n");
 
-	memblock_free(phys, size);
+	memblock_phys_free(phys, size);
 	memblock_remove(phys, size);
 
 	ceu_dma_membase = phys;
--- a/arch/sh/boards/mach-se/7724/setup.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/sh/boards/mach-se/7724/setup.c
@@ -966,7 +966,7 @@ static void __init ms7724se_mv_mem_reser
 	if (!phys)
 		panic("Failed to allocate CEU0 memory\n");
 
-	memblock_free(phys, size);
+	memblock_phys_free(phys, size);
 	memblock_remove(phys, size);
 	ceu0_dma_membase = phys;
 
@@ -974,7 +974,7 @@ static void __init ms7724se_mv_mem_reser
 	if (!phys)
 		panic("Failed to allocate CEU1 memory\n");
 
-	memblock_free(phys, size);
+	memblock_phys_free(phys, size);
 	memblock_remove(phys, size);
 	ceu1_dma_membase = phys;
 }
--- a/arch/sparc/kernel/smp_64.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/sparc/kernel/smp_64.c
@@ -1567,7 +1567,7 @@ static void * __init pcpu_alloc_bootmem(
 
 static void __init pcpu_free_bootmem(void *ptr, size_t size)
 {
-	memblock_free(__pa(ptr), size);
+	memblock_phys_free(__pa(ptr), size);
 }
 
 static int __init pcpu_cpu_distance(unsigned int from, unsigned int to)
--- a/arch/um/kernel/mem.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/um/kernel/mem.c
@@ -47,7 +47,7 @@ void __init mem_init(void)
 	 */
 	brk_end = (unsigned long) UML_ROUND_UP(sbrk(0));
 	map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1, 0);
-	memblock_free(__pa(brk_end), uml_reserved - brk_end);
+	memblock_phys_free(__pa(brk_end), uml_reserved - brk_end);
 	uml_reserved = brk_end;
 
 	/* this will put all low memory onto the freelists */
--- a/arch/x86/kernel/setup.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/x86/kernel/setup.c
@@ -322,7 +322,7 @@ static void __init reserve_initrd(void)
 
 	relocate_initrd();
 
-	memblock_free(ramdisk_image, ramdisk_end - ramdisk_image);
+	memblock_phys_free(ramdisk_image, ramdisk_end - ramdisk_image);
 }
 
 #else
@@ -521,7 +521,7 @@ static void __init reserve_crashkernel(v
 	}
 
 	if (crash_base >= (1ULL << 32) && reserve_crashkernel_low()) {
-		memblock_free(crash_base, crash_size);
+		memblock_phys_free(crash_base, crash_size);
 		return;
 	}
 
--- a/arch/x86/mm/init.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/x86/mm/init.c
@@ -618,7 +618,7 @@ static void __init memory_map_top_down(u
 	 */
 	addr = memblock_phys_alloc_range(PMD_SIZE, PMD_SIZE, map_start,
 					 map_end);
-	memblock_free(addr, PMD_SIZE);
+	memblock_phys_free(addr, PMD_SIZE);
 	real_end = addr + PMD_SIZE;
 
 	/* step_size need to be small so pgt_buf from BRK could cover it */
--- a/arch/x86/xen/mmu_pv.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/x86/xen/mmu_pv.c
@@ -1025,7 +1025,7 @@ static void __init xen_free_ro_pages(uns
 	for (; vaddr < vaddr_end; vaddr += PAGE_SIZE)
 		make_lowmem_page_readwrite(vaddr);
 
-	memblock_free(paddr, size);
+	memblock_phys_free(paddr, size);
 }
 
 static void __init xen_cleanmfnmap_free_pgtbl(void *pgtbl, bool unpin)
@@ -1151,7 +1151,7 @@ static void __init xen_pagetable_p2m_fre
 		xen_cleanhighmap(addr, addr + size);
 		size = PAGE_ALIGN(xen_start_info->nr_pages *
 				  sizeof(unsigned long));
-		memblock_free(__pa(addr), size);
+		memblock_phys_free(__pa(addr), size);
 	} else {
 		xen_cleanmfnmap(addr);
 	}
@@ -1955,7 +1955,7 @@ void __init xen_relocate_p2m(void)
 		pfn_end = p2m_pfn_end;
 	}
 
-	memblock_free(PFN_PHYS(pfn), PAGE_SIZE * (pfn_end - pfn));
+	memblock_phys_free(PFN_PHYS(pfn), PAGE_SIZE * (pfn_end - pfn));
 	while (pfn < pfn_end) {
 		if (pfn == p2m_pfn) {
 			pfn = p2m_pfn_end;
--- a/arch/x86/xen/setup.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/arch/x86/xen/setup.c
@@ -153,7 +153,7 @@ static void __init xen_del_extra_mem(uns
 			break;
 		}
 	}
-	memblock_free(PFN_PHYS(start_pfn), PFN_PHYS(n_pfns));
+	memblock_phys_free(PFN_PHYS(start_pfn), PFN_PHYS(n_pfns));
 }
 
 /*
@@ -719,7 +719,7 @@ static void __init xen_reserve_xen_mfnli
 		return;
 
 	xen_relocate_p2m();
-	memblock_free(start, size);
+	memblock_phys_free(start, size);
 }
 
 /**
@@ -885,7 +885,7 @@ char * __init xen_memory_setup(void)
 		xen_phys_memcpy(new_area, start, size);
 		pr_info("initrd moved from [mem %#010llx-%#010llx] to [mem %#010llx-%#010llx]\n",
 			start, start + size, new_area, new_area + size);
-		memblock_free(start, size);
+		memblock_phys_free(start, size);
 		boot_params.hdr.ramdisk_image = new_area;
 		boot_params.ext_ramdisk_image = new_area >> 32;
 	}
--- a/drivers/base/arch_numa.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/drivers/base/arch_numa.c
@@ -166,7 +166,7 @@ static void * __init pcpu_fc_alloc(unsig
 
 static void __init pcpu_fc_free(void *ptr, size_t size)
 {
-	memblock_free(__pa(ptr), size);
+	memblock_phys_free(__pa(ptr), size);
 }
 
 #ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
--- a/drivers/firmware/efi/memmap.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/drivers/firmware/efi/memmap.c
@@ -35,7 +35,7 @@ void __init __efi_memmap_free(u64 phys,
 		if (slab_is_available())
 			memblock_free_late(phys, size);
 		else
-			memblock_free(phys, size);
+			memblock_phys_free(phys, size);
 	} else if (flags & EFI_MEMMAP_SLAB) {
 		struct page *p = pfn_to_page(PHYS_PFN(phys));
 		unsigned int order = get_order(size);
--- a/drivers/of/kexec.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/drivers/of/kexec.c
@@ -171,8 +171,7 @@ int ima_free_kexec_buffer(void)
 	if (ret)
 		return ret;
 
-	return memblock_free(addr, size);
-
+	return memblock_phys_free(addr, size);
 }
 
 /**
--- a/drivers/of/of_reserved_mem.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/drivers/of/of_reserved_mem.c
@@ -46,7 +46,7 @@ static int __init early_init_dt_alloc_re
 	if (nomap) {
 		err = memblock_mark_nomap(base, size);
 		if (err)
-			memblock_free(base, size);
+			memblock_phys_free(base, size);
 		kmemleak_ignore_phys(base);
 	}
 
@@ -284,7 +284,8 @@ void __init fdt_init_reserved_mem(void)
 				if (nomap)
 					memblock_clear_nomap(rmem->base, rmem->size);
 				else
-					memblock_free(rmem->base, rmem->size);
+					memblock_phys_free(rmem->base,
+							   rmem->size);
 			}
 		}
 	}
--- a/drivers/s390/char/sclp_early.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/drivers/s390/char/sclp_early.c
@@ -139,7 +139,7 @@ int __init sclp_early_get_core_info(stru
 	}
 	sclp_fill_core_info(info, sccb);
 out:
-	memblock_free((unsigned long)sccb, length);
+	memblock_phys_free((unsigned long)sccb, length);
 	return rc;
 }
 
--- a/drivers/usb/early/xhci-dbc.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/drivers/usb/early/xhci-dbc.c
@@ -185,7 +185,7 @@ static void __init xdbc_free_ring(struct
 	if (!seg)
 		return;
 
-	memblock_free(seg->dma, PAGE_SIZE);
+	memblock_phys_free(seg->dma, PAGE_SIZE);
 	ring->segment = NULL;
 }
 
@@ -665,10 +665,10 @@ int __init early_xdbc_setup_hardware(voi
 		xdbc_free_ring(&xdbc.in_ring);
 
 		if (xdbc.table_dma)
-			memblock_free(xdbc.table_dma, PAGE_SIZE);
+			memblock_phys_free(xdbc.table_dma, PAGE_SIZE);
 
 		if (xdbc.out_dma)
-			memblock_free(xdbc.out_dma, PAGE_SIZE);
+			memblock_phys_free(xdbc.out_dma, PAGE_SIZE);
 
 		xdbc.table_base = NULL;
 		xdbc.out_buf = NULL;
@@ -987,8 +987,8 @@ free_and_quit:
 	xdbc_free_ring(&xdbc.evt_ring);
 	xdbc_free_ring(&xdbc.out_ring);
 	xdbc_free_ring(&xdbc.in_ring);
-	memblock_free(xdbc.table_dma, PAGE_SIZE);
-	memblock_free(xdbc.out_dma, PAGE_SIZE);
+	memblock_phys_free(xdbc.table_dma, PAGE_SIZE);
+	memblock_phys_free(xdbc.out_dma, PAGE_SIZE);
 	writel(0, &xdbc.xdbc_reg->control);
 	early_iounmap(xdbc.xhci_base, xdbc.xhci_length);
 
--- a/drivers/xen/swiotlb-xen.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/drivers/xen/swiotlb-xen.c
@@ -241,7 +241,7 @@ retry:
 	 */
 	rc = xen_swiotlb_fixup(start, nslabs);
 	if (rc) {
-		memblock_free(__pa(start), PAGE_ALIGN(bytes));
+		memblock_phys_free(__pa(start), PAGE_ALIGN(bytes));
 		if (nslabs > 1024 && repeat--) {
 			/* Min is 2MB */
 			nslabs = max(1024UL, ALIGN(nslabs >> 1, IO_TLB_SEGSIZE));
--- a/include/linux/memblock.h~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/include/linux/memblock.h
@@ -103,7 +103,7 @@ void memblock_allow_resize(void);
 int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid);
 int memblock_add(phys_addr_t base, phys_addr_t size);
 int memblock_remove(phys_addr_t base, phys_addr_t size);
-int memblock_free(phys_addr_t base, phys_addr_t size);
+int memblock_phys_free(phys_addr_t base, phys_addr_t size);
 int memblock_reserve(phys_addr_t base, phys_addr_t size);
 #ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
 int memblock_physmem_add(phys_addr_t base, phys_addr_t size);
--- a/init/initramfs.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/init/initramfs.c
@@ -607,7 +607,7 @@ void __weak __init free_initrd_mem(unsig
 	unsigned long aligned_start = ALIGN_DOWN(start, PAGE_SIZE);
 	unsigned long aligned_end = ALIGN(end, PAGE_SIZE);
 
-	memblock_free(__pa(aligned_start), aligned_end - aligned_start);
+	memblock_phys_free(__pa(aligned_start), aligned_end - aligned_start);
 #endif
 
 	free_reserved_area((void *)start, (void *)end, POISON_FREE_INITMEM,
--- a/kernel/dma/swiotlb.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/kernel/dma/swiotlb.c
@@ -247,7 +247,7 @@ swiotlb_init(int verbose)
 	return;
 
 fail_free_mem:
-	memblock_free(__pa(tlb), bytes);
+	memblock_phys_free(__pa(tlb), bytes);
 fail:
 	pr_warn("Cannot allocate buffer");
 }
--- a/lib/cpumask.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/lib/cpumask.c
@@ -188,7 +188,7 @@ EXPORT_SYMBOL(free_cpumask_var);
  */
 void __init free_bootmem_cpumask_var(cpumask_var_t mask)
 {
-	memblock_free(__pa(mask), cpumask_size());
+	memblock_phys_free(__pa(mask), cpumask_size());
 }
 #endif
 
--- a/mm/cma.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/mm/cma.c
@@ -378,7 +378,7 @@ int __init cma_declare_contiguous_nid(ph
 	return 0;
 
 free_mem:
-	memblock_free(base, size);
+	memblock_phys_free(base, size);
 err:
 	pr_err("Failed to reserve %ld MiB\n", (unsigned long)size / SZ_1M);
 	return ret;
--- a/mm/memblock.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/mm/memblock.c
@@ -806,18 +806,18 @@ int __init_memblock memblock_remove(phys
 void __init_memblock memblock_free_ptr(void *ptr, size_t size)
 {
 	if (ptr)
-		memblock_free(__pa(ptr), size);
+		memblock_phys_free(__pa(ptr), size);
 }
 
 /**
- * memblock_free - free boot memory block
+ * memblock_phys_free - free boot memory block
  * @base: phys starting address of the  boot memory block
  * @size: size of the boot memory block in bytes
  *
  * Free boot memory block previously allocated by memblock_alloc_xx() API.
  * The freeing memory will not be released to the buddy allocator.
  */
-int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size)
+int __init_memblock memblock_phys_free(phys_addr_t base, phys_addr_t size)
 {
 	phys_addr_t end = base + size - 1;
 
@@ -1937,7 +1937,7 @@ static void __init free_memmap(unsigned
 	 * memmap array.
 	 */
 	if (pg < pgend)
-		memblock_free(pg, pgend - pg);
+		memblock_phys_free(pg, pgend - pg);
 }
 
 /*
--- a/mm/memory_hotplug.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/mm/memory_hotplug.c
@@ -2204,7 +2204,7 @@ static int __ref try_remove_memory(u64 s
 	arch_remove_memory(start, size, altmap);
 
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
-		memblock_free(start, size);
+		memblock_phys_free(start, size);
 		memblock_remove(start, size);
 	}
 
--- a/mm/percpu.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/mm/percpu.c
@@ -2472,7 +2472,7 @@ struct pcpu_alloc_info * __init pcpu_all
  */
 void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai)
 {
-	memblock_free(__pa(ai), ai->__ai_size);
+	memblock_phys_free(__pa(ai), ai->__ai_size);
 }
 
 /**
@@ -3134,7 +3134,7 @@ out_free_areas:
 out_free:
 	pcpu_free_alloc_info(ai);
 	if (areas)
-		memblock_free(__pa(areas), areas_size);
+		memblock_phys_free(__pa(areas), areas_size);
 	return rc;
 }
 #endif /* BUILD_EMBED_FIRST_CHUNK */
@@ -3256,7 +3256,7 @@ enomem:
 		free_fn(page_address(pages[j]), PAGE_SIZE);
 	rc = -ENOMEM;
 out_free_ar:
-	memblock_free(__pa(pages), pages_size);
+	memblock_phys_free(__pa(pages), pages_size);
 	pcpu_free_alloc_info(ai);
 	return rc;
 }
@@ -3286,7 +3286,7 @@ static void * __init pcpu_dfl_fc_alloc(u
 
 static void __init pcpu_dfl_fc_free(void *ptr, size_t size)
 {
-	memblock_free(__pa(ptr), size);
+	memblock_phys_free(__pa(ptr), size);
 }
 
 void __init setup_per_cpu_areas(void)
--- a/mm/sparse.c~memblock-rename-memblock_free-to-memblock_phys_free
+++ a/mm/sparse.c
@@ -451,7 +451,7 @@ static void *sparsemap_buf_end __meminit
 static inline void __meminit sparse_buffer_free(unsigned long size)
 {
 	WARN_ON(!sparsemap_buf || size == 0);
-	memblock_free(__pa(sparsemap_buf), size);
+	memblock_phys_free(__pa(sparsemap_buf), size);
 }
 
 static void __init sparse_buffer_init(unsigned long size, int nid)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 167/262] memblock: use memblock_free for freeing virtual pointers
  2021-11-05 20:34 incoming Andrew Morton
                   ` (165 preceding siblings ...)
  2021-11-05 20:43 ` [patch 166/262] memblock: rename memblock_free to memblock_phys_free Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 168/262] mm: mark the OOM reaper thread as freezable Andrew Morton
                   ` (94 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, christophe.leroy, jgross, linux-mm, mm-commits, rppt, sfr,
	Shahab.Vahedi, torvalds

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: memblock: use memblock_free for freeing virtual pointers

Rename memblock_free_ptr() to memblock_free() and use memblock_free() when
freeing a virtual pointer so that memblock_free() will be a counterpart of
memblock_alloc()

The callers are updated with the below semantic patch and manual addition
of (void *) casting to pointers that are represented by unsigned long
variables.

@@
identifier vaddr;
expression size;
@@
(
- memblock_phys_free(__pa(vaddr), size);
+ memblock_free(vaddr, size);
|
- memblock_free_ptr(vaddr, size);
+ memblock_free(vaddr, size);
)

[sfr@canb.auug.org.au: fixup]
  Link: https://lkml.kernel.org/r/20211018192940.3d1d532f@canb.auug.org.au
Link: https://lkml.kernel.org/r/20210930185031.18648-7-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Juergen Gross <jgross@suse.com>
Cc: Shahab Vahedi <Shahab.Vahedi@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/alpha/kernel/core_irongate.c         |    3 +--
 arch/mips/mm/init.c                       |    2 +-
 arch/powerpc/kernel/dt_cpu_ftrs.c         |    4 ++--
 arch/powerpc/kernel/setup-common.c        |    2 +-
 arch/powerpc/kernel/setup_64.c            |    2 +-
 arch/powerpc/platforms/powernv/pci-ioda.c |    2 +-
 arch/powerpc/platforms/pseries/svm.c      |    3 +--
 arch/riscv/kernel/setup.c                 |    5 ++---
 arch/sparc/kernel/smp_64.c                |    2 +-
 arch/um/kernel/mem.c                      |    2 +-
 arch/x86/kernel/setup_percpu.c            |    2 +-
 arch/x86/mm/kasan_init_64.c               |    4 ++--
 arch/x86/mm/numa.c                        |    2 +-
 arch/x86/mm/numa_emulation.c              |    2 +-
 arch/x86/xen/mmu_pv.c                     |    2 +-
 arch/x86/xen/p2m.c                        |    2 +-
 drivers/base/arch_numa.c                  |    4 ++--
 drivers/macintosh/smu.c                   |    2 +-
 drivers/xen/swiotlb-xen.c                 |    2 +-
 include/linux/memblock.h                  |    2 +-
 init/initramfs.c                          |    2 +-
 init/main.c                               |    4 ++--
 kernel/dma/swiotlb.c                      |    2 +-
 kernel/printk/printk.c                    |    4 ++--
 lib/bootconfig.c                          |    2 +-
 lib/cpumask.c                             |    2 +-
 mm/memblock.c                             |    6 +++---
 mm/percpu.c                               |    8 ++++----
 mm/sparse.c                               |    2 +-
 29 files changed, 40 insertions(+), 43 deletions(-)

--- a/arch/alpha/kernel/core_irongate.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/arch/alpha/kernel/core_irongate.c
@@ -233,8 +233,7 @@ albacore_init_arch(void)
 			unsigned long size;
 
 			size = initrd_end - initrd_start;
-			memblock_phys_free(__pa(initrd_start),
-					   PAGE_ALIGN(size));
+			memblock_free((void *)initrd_start, PAGE_ALIGN(size));
 			if (!move_initrd(pci_mem))
 				printk("irongate_init_arch: initrd too big "
 				       "(%ldK)\ndisabling initrd\n",
--- a/arch/mips/mm/init.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/arch/mips/mm/init.c
@@ -529,7 +529,7 @@ static void * __init pcpu_fc_alloc(unsig
 
 static void __init pcpu_fc_free(void *ptr, size_t size)
 {
-	memblock_phys_free(__pa(ptr), size);
+	memblock_free(ptr, size);
 }
 
 void __init setup_per_cpu_areas(void)
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -1095,8 +1095,8 @@ static int __init dt_cpu_ftrs_scan_callb
 
 	cpufeatures_setup_finished();
 
-	memblock_phys_free(__pa(dt_cpu_features),
-			   sizeof(struct dt_cpu_feature) * nr_dt_cpu_features);
+	memblock_free(dt_cpu_features,
+		      sizeof(struct dt_cpu_feature) * nr_dt_cpu_features);
 
 	return 0;
 }
--- a/arch/powerpc/kernel/setup_64.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/arch/powerpc/kernel/setup_64.c
@@ -812,7 +812,7 @@ static void * __init pcpu_alloc_bootmem(
 
 static void __init pcpu_free_bootmem(void *ptr, size_t size)
 {
-	memblock_phys_free(__pa(ptr), size);
+	memblock_free(ptr, size);
 }
 
 static int pcpu_cpu_distance(unsigned int from, unsigned int to)
--- a/arch/powerpc/kernel/setup-common.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/arch/powerpc/kernel/setup-common.c
@@ -825,7 +825,7 @@ static void __init smp_setup_pacas(void)
 		set_hard_smp_processor_id(cpu, cpu_to_phys_id[cpu]);
 	}
 
-	memblock_phys_free(__pa(cpu_to_phys_id), nr_cpu_ids * sizeof(u32));
+	memblock_free(cpu_to_phys_id, nr_cpu_ids * sizeof(u32));
 	cpu_to_phys_id = NULL;
 }
 #endif
--- a/arch/powerpc/platforms/powernv/pci-ioda.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -2981,7 +2981,7 @@ static void __init pnv_pci_init_ioda_phb
 	if (!phb->hose) {
 		pr_err("  Can't allocate PCI controller for %pOF\n",
 		       np);
-		memblock_phys_free(__pa(phb), sizeof(struct pnv_phb));
+		memblock_free(phb, sizeof(struct pnv_phb));
 		return;
 	}
 
--- a/arch/powerpc/platforms/pseries/svm.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/arch/powerpc/platforms/pseries/svm.c
@@ -56,8 +56,7 @@ void __init svm_swiotlb_init(void)
 		return;
 
 
-	memblock_phys_free(__pa(vstart),
-			   PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
+	memblock_free(vstart, PAGE_ALIGN(io_tlb_nslabs << IO_TLB_SHIFT));
 	panic("SVM: Cannot allocate SWIOTLB buffer");
 }
 
--- a/arch/riscv/kernel/setup.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/arch/riscv/kernel/setup.c
@@ -230,14 +230,13 @@ static void __init init_resources(void)
 
 	/* Clean-up any unused pre-allocated resources */
 	if (res_idx >= 0)
-		memblock_phys_free(__pa(mem_res),
-				   (res_idx + 1) * sizeof(*mem_res));
+		memblock_free(mem_res, (res_idx + 1) * sizeof(*mem_res));
 	return;
 
  error:
 	/* Better an empty resource tree than an inconsistent one */
 	release_child_resources(&iomem_resource);
-	memblock_phys_free(__pa(mem_res), mem_res_sz);
+	memblock_free(mem_res, mem_res_sz);
 }
 
 
--- a/arch/sparc/kernel/smp_64.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/arch/sparc/kernel/smp_64.c
@@ -1567,7 +1567,7 @@ static void * __init pcpu_alloc_bootmem(
 
 static void __init pcpu_free_bootmem(void *ptr, size_t size)
 {
-	memblock_phys_free(__pa(ptr), size);
+	memblock_free(ptr, size);
 }
 
 static int __init pcpu_cpu_distance(unsigned int from, unsigned int to)
--- a/arch/um/kernel/mem.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/arch/um/kernel/mem.c
@@ -47,7 +47,7 @@ void __init mem_init(void)
 	 */
 	brk_end = (unsigned long) UML_ROUND_UP(sbrk(0));
 	map_memory(brk_end, __pa(brk_end), uml_reserved - brk_end, 1, 1, 0);
-	memblock_phys_free(__pa(brk_end), uml_reserved - brk_end);
+	memblock_free((void *)brk_end, uml_reserved - brk_end);
 	uml_reserved = brk_end;
 
 	/* this will put all low memory onto the freelists */
--- a/arch/x86/kernel/setup_percpu.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/arch/x86/kernel/setup_percpu.c
@@ -135,7 +135,7 @@ static void * __init pcpu_fc_alloc(unsig
 
 static void __init pcpu_fc_free(void *ptr, size_t size)
 {
-	memblock_free_ptr(ptr, size);
+	memblock_free(ptr, size);
 }
 
 static int __init pcpu_cpu_distance(unsigned int from, unsigned int to)
--- a/arch/x86/mm/kasan_init_64.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/arch/x86/mm/kasan_init_64.c
@@ -49,7 +49,7 @@ static void __init kasan_populate_pmd(pm
 			p = early_alloc(PMD_SIZE, nid, false);
 			if (p && pmd_set_huge(pmd, __pa(p), PAGE_KERNEL))
 				return;
-			memblock_free_ptr(p, PMD_SIZE);
+			memblock_free(p, PMD_SIZE);
 		}
 
 		p = early_alloc(PAGE_SIZE, nid, true);
@@ -85,7 +85,7 @@ static void __init kasan_populate_pud(pu
 			p = early_alloc(PUD_SIZE, nid, false);
 			if (p && pud_set_huge(pud, __pa(p), PAGE_KERNEL))
 				return;
-			memblock_free_ptr(p, PUD_SIZE);
+			memblock_free(p, PUD_SIZE);
 		}
 
 		p = early_alloc(PAGE_SIZE, nid, true);
--- a/arch/x86/mm/numa.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/arch/x86/mm/numa.c
@@ -355,7 +355,7 @@ void __init numa_reset_distance(void)
 
 	/* numa_distance could be 1LU marking allocation failure, test cnt */
 	if (numa_distance_cnt)
-		memblock_free_ptr(numa_distance, size);
+		memblock_free(numa_distance, size);
 	numa_distance_cnt = 0;
 	numa_distance = NULL;	/* enable table creation */
 }
--- a/arch/x86/mm/numa_emulation.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/arch/x86/mm/numa_emulation.c
@@ -517,7 +517,7 @@ void __init numa_emulation(struct numa_m
 	}
 
 	/* free the copied physical distance table */
-	memblock_free_ptr(phys_dist, phys_size);
+	memblock_free(phys_dist, phys_size);
 	return;
 
 no_emu:
--- a/arch/x86/xen/mmu_pv.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/arch/x86/xen/mmu_pv.c
@@ -1151,7 +1151,7 @@ static void __init xen_pagetable_p2m_fre
 		xen_cleanhighmap(addr, addr + size);
 		size = PAGE_ALIGN(xen_start_info->nr_pages *
 				  sizeof(unsigned long));
-		memblock_phys_free(__pa(addr), size);
+		memblock_free((void *)addr, size);
 	} else {
 		xen_cleanmfnmap(addr);
 	}
--- a/arch/x86/xen/p2m.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/arch/x86/xen/p2m.c
@@ -197,7 +197,7 @@ static void * __ref alloc_p2m_page(void)
 static void __ref free_p2m_page(void *p)
 {
 	if (unlikely(!slab_is_available())) {
-		memblock_free_ptr(p, PAGE_SIZE);
+		memblock_free(p, PAGE_SIZE);
 		return;
 	}
 
--- a/drivers/base/arch_numa.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/drivers/base/arch_numa.c
@@ -166,7 +166,7 @@ static void * __init pcpu_fc_alloc(unsig
 
 static void __init pcpu_fc_free(void *ptr, size_t size)
 {
-	memblock_phys_free(__pa(ptr), size);
+	memblock_free(ptr, size);
 }
 
 #ifdef CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK
@@ -326,7 +326,7 @@ void __init numa_free_distance(void)
 	size = numa_distance_cnt * numa_distance_cnt *
 		sizeof(numa_distance[0]);
 
-	memblock_free_ptr(numa_distance, size);
+	memblock_free(numa_distance, size);
 	numa_distance_cnt = 0;
 	numa_distance = NULL;
 }
--- a/drivers/macintosh/smu.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/drivers/macintosh/smu.c
@@ -570,7 +570,7 @@ fail_msg_node:
 fail_db_node:
 	of_node_put(smu->db_node);
 fail_bootmem:
-	memblock_free_ptr(smu, sizeof(struct smu_device));
+	memblock_free(smu, sizeof(struct smu_device));
 	smu = NULL;
 fail_np:
 	of_node_put(np);
--- a/drivers/xen/swiotlb-xen.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/drivers/xen/swiotlb-xen.c
@@ -241,7 +241,7 @@ retry:
 	 */
 	rc = xen_swiotlb_fixup(start, nslabs);
 	if (rc) {
-		memblock_phys_free(__pa(start), PAGE_ALIGN(bytes));
+		memblock_free(start, PAGE_ALIGN(bytes));
 		if (nslabs > 1024 && repeat--) {
 			/* Min is 2MB */
 			nslabs = max(1024UL, ALIGN(nslabs >> 1, IO_TLB_SEGSIZE));
--- a/include/linux/memblock.h~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/include/linux/memblock.h
@@ -118,7 +118,7 @@ int memblock_mark_nomap(phys_addr_t base
 int memblock_clear_nomap(phys_addr_t base, phys_addr_t size);
 
 void memblock_free_all(void);
-void memblock_free_ptr(void *ptr, size_t size);
+void memblock_free(void *ptr, size_t size);
 void reset_node_managed_pages(pg_data_t *pgdat);
 void reset_all_zones_managed_pages(void);
 
--- a/init/initramfs.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/init/initramfs.c
@@ -607,7 +607,7 @@ void __weak __init free_initrd_mem(unsig
 	unsigned long aligned_start = ALIGN_DOWN(start, PAGE_SIZE);
 	unsigned long aligned_end = ALIGN(end, PAGE_SIZE);
 
-	memblock_phys_free(__pa(aligned_start), aligned_end - aligned_start);
+	memblock_free((void *)aligned_start, aligned_end - aligned_start);
 #endif
 
 	free_reserved_area((void *)start, (void *)end, POISON_FREE_INITMEM,
--- a/init/main.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/init/main.c
@@ -382,7 +382,7 @@ static char * __init xbc_make_cmdline(co
 	ret = xbc_snprint_cmdline(new_cmdline, len + 1, root);
 	if (ret < 0 || ret > len) {
 		pr_err("Failed to print extra kernel cmdline.\n");
-		memblock_free_ptr(new_cmdline, len + 1);
+		memblock_free(new_cmdline, len + 1);
 		return NULL;
 	}
 
@@ -925,7 +925,7 @@ static void __init print_unknown_bootopt
 		end += sprintf(end, " %s", *p);
 
 	pr_notice("Unknown command line parameters:%s\n", unknown_options);
-	memblock_free_ptr(unknown_options, len);
+	memblock_free(unknown_options, len);
 }
 
 asmlinkage __visible void __init __no_sanitize_address start_kernel(void)
--- a/kernel/dma/swiotlb.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/kernel/dma/swiotlb.c
@@ -247,7 +247,7 @@ swiotlb_init(int verbose)
 	return;
 
 fail_free_mem:
-	memblock_phys_free(__pa(tlb), bytes);
+	memblock_free(tlb, bytes);
 fail:
 	pr_warn("Cannot allocate buffer");
 }
--- a/kernel/printk/printk.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/kernel/printk/printk.c
@@ -1166,9 +1166,9 @@ void __init setup_log_buf(int early)
 	return;
 
 err_free_descs:
-	memblock_free_ptr(new_descs, new_descs_size);
+	memblock_free(new_descs, new_descs_size);
 err_free_log_buf:
-	memblock_free_ptr(new_log_buf, new_log_buf_len);
+	memblock_free(new_log_buf, new_log_buf_len);
 }
 
 static bool __read_mostly ignore_loglevel;
--- a/lib/bootconfig.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/lib/bootconfig.c
@@ -792,7 +792,7 @@ void __init xbc_destroy_all(void)
 	xbc_data = NULL;
 	xbc_data_size = 0;
 	xbc_node_num = 0;
-	memblock_free_ptr(xbc_nodes, sizeof(struct xbc_node) * XBC_NODE_MAX);
+	memblock_free(xbc_nodes, sizeof(struct xbc_node) * XBC_NODE_MAX);
 	xbc_nodes = NULL;
 	brace_index = 0;
 }
--- a/lib/cpumask.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/lib/cpumask.c
@@ -188,7 +188,7 @@ EXPORT_SYMBOL(free_cpumask_var);
  */
 void __init free_bootmem_cpumask_var(cpumask_var_t mask)
 {
-	memblock_phys_free(__pa(mask), cpumask_size());
+	memblock_free(mask, cpumask_size());
 }
 #endif
 
--- a/mm/memblock.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/mm/memblock.c
@@ -472,7 +472,7 @@ static int __init_memblock memblock_doub
 		kfree(old_array);
 	else if (old_array != memblock_memory_init_regions &&
 		 old_array != memblock_reserved_init_regions)
-		memblock_free_ptr(old_array, old_alloc_size);
+		memblock_free(old_array, old_alloc_size);
 
 	/*
 	 * Reserve the new array if that comes from the memblock.  Otherwise, we
@@ -796,14 +796,14 @@ int __init_memblock memblock_remove(phys
 }
 
 /**
- * memblock_free_ptr - free boot memory allocation
+ * memblock_free - free boot memory allocation
  * @ptr: starting address of the  boot memory allocation
  * @size: size of the boot memory block in bytes
  *
  * Free boot memory block previously allocated by memblock_alloc_xx() API.
  * The freeing memory will not be released to the buddy allocator.
  */
-void __init_memblock memblock_free_ptr(void *ptr, size_t size)
+void __init_memblock memblock_free(void *ptr, size_t size)
 {
 	if (ptr)
 		memblock_phys_free(__pa(ptr), size);
--- a/mm/percpu.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/mm/percpu.c
@@ -2472,7 +2472,7 @@ struct pcpu_alloc_info * __init pcpu_all
  */
 void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai)
 {
-	memblock_phys_free(__pa(ai), ai->__ai_size);
+	memblock_free(ai, ai->__ai_size);
 }
 
 /**
@@ -3134,7 +3134,7 @@ out_free_areas:
 out_free:
 	pcpu_free_alloc_info(ai);
 	if (areas)
-		memblock_phys_free(__pa(areas), areas_size);
+		memblock_free(areas, areas_size);
 	return rc;
 }
 #endif /* BUILD_EMBED_FIRST_CHUNK */
@@ -3256,7 +3256,7 @@ enomem:
 		free_fn(page_address(pages[j]), PAGE_SIZE);
 	rc = -ENOMEM;
 out_free_ar:
-	memblock_phys_free(__pa(pages), pages_size);
+	memblock_free(pages, pages_size);
 	pcpu_free_alloc_info(ai);
 	return rc;
 }
@@ -3286,7 +3286,7 @@ static void * __init pcpu_dfl_fc_alloc(u
 
 static void __init pcpu_dfl_fc_free(void *ptr, size_t size)
 {
-	memblock_phys_free(__pa(ptr), size);
+	memblock_free(ptr, size);
 }
 
 void __init setup_per_cpu_areas(void)
--- a/mm/sparse.c~memblock-use-memblock_free-for-freeing-virtual-pointers
+++ a/mm/sparse.c
@@ -451,7 +451,7 @@ static void *sparsemap_buf_end __meminit
 static inline void __meminit sparse_buffer_free(unsigned long size)
 {
 	WARN_ON(!sparsemap_buf || size == 0);
-	memblock_phys_free(__pa(sparsemap_buf), size);
+	memblock_free(sparsemap_buf, size);
 }
 
 static void __init sparse_buffer_init(unsigned long size, int nid)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 168/262] mm: mark the OOM reaper thread as freezable
  2021-11-05 20:34 incoming Andrew Morton
                   ` (166 preceding siblings ...)
  2021-11-05 20:43 ` [patch 167/262] memblock: use memblock_free for freeing virtual pointers Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 169/262] hugetlbfs: extend the definition of hugepages parameter to support node allocation Andrew Morton
                   ` (93 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, linux-mm, mgorman, mhocko, mm-commits, rientjes, sultan, torvalds

From: Sultan Alsawaf <sultan@kerneltoast.com>
Subject: mm: mark the OOM reaper thread as freezable

The OOM reaper alters user address space which might theoretically
alter the snapshot if reaping is allowed to happen after the freezer
quiescent state.  To this end, the reaper kthread uses
wait_event_freezable() while waiting for any work so that it cannot run
while the system freezes.

However, the current implementation doesn't respect the freezer because
all kernel threads are created with the PF_NOFREEZE flag, so they are
automatically excluded from freezing operations.  This means that the
OOM reaper can race with system snapshotting if it has work to do while
the system is being frozen.

Fix this by adding a set_freezable() call which will clear the
PF_NOFREEZE flag and thus make the OOM reaper visible to the freezer.

Please note that the OOM reaper altering the snapshot this way is
mostly a theoretical concern and has not been observed in practice.

Link: https://lkml.kernel.org/r/20210921165758.6154-1-sultan@kerneltoast.com
Link: https://lkml.kernel.org/r/20210918233920.9174-1-sultan@kerneltoast.com
Fixes: aac453635549 ("mm, oom: introduce oom reaper")
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/oom_kill.c |    2 ++
 1 file changed, 2 insertions(+)

--- a/mm/oom_kill.c~mm-mark-the-oom-reaper-thread-as-freezable
+++ a/mm/oom_kill.c
@@ -641,6 +641,8 @@ done:
 
 static int oom_reaper(void *unused)
 {
+	set_freezable();
+
 	while (true) {
 		struct task_struct *tsk = NULL;
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 169/262] hugetlbfs: extend the definition of hugepages parameter to support node allocation
  2021-11-05 20:34 incoming Andrew Morton
                   ` (167 preceding siblings ...)
  2021-11-05 20:43 ` [patch 168/262] mm: mark the OOM reaper thread as freezable Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 170/262] mm/migrate: de-duplicate migrate_reason strings Andrew Morton
                   ` (92 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, benh, corbet, dan.carpenter, linux-mm, mike.kravetz,
	mm-commits, mpe, nathan, paulus, rppt, torvalds, willy,
	yaozhenguo1

From: Zhenguo Yao <yaozhenguo1@gmail.com>
Subject: hugetlbfs: extend the definition of hugepages parameter to support node allocation

We can specify the number of hugepages to allocate at boot.  But the
hugepages is balanced in all nodes at present.  In some scenarios, we only
need hugepages in one node.  For example: DPDK needs hugepages which are
in the same node as NIC.

If DPDK needs four hugepages of 1G size in node1 and system has 16 numa
nodes we must reserve 64 hugepages on the kernel cmdline.  But only four
hugepages are used.  The others should be free after boot.  If the system
memory is low(for example: 64G), it will be an impossible task.

So extend the hugepages parameter to support specifying hugepages on a
specific node.  For example add following parameter:

hugepagesz=1G hugepages=0:1,1:3

It will allocate 1 hugepage in node0 and 3 hugepages in node1.

Link: https://lkml.kernel.org/r/20211005054729.86457-1-yaozhenguo1@gmail.com
Signed-off-by: Zhenguo Yao <yaozhenguo1@gmail.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Zhenguo Yao <yaozhenguo1@gmail.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/kernel-parameters.txt |    8 
 Documentation/admin-guide/mm/hugetlbpage.rst    |   12 +
 arch/powerpc/mm/hugetlbpage.c                   |    9 
 include/linux/hugetlb.h                         |    6 
 mm/hugetlb.c                                    |  153 +++++++++++---
 5 files changed, 155 insertions(+), 33 deletions(-)

--- a/arch/powerpc/mm/hugetlbpage.c~hugetlbfs-extend-the-definition-of-hugepages-parameter-to-support-node-allocation
+++ a/arch/powerpc/mm/hugetlbpage.c
@@ -229,17 +229,22 @@ static int __init pseries_alloc_bootmem_
 	m->hstate = hstate;
 	return 1;
 }
+
+bool __init hugetlb_node_alloc_supported(void)
+{
+	return false;
+}
 #endif
 
 
-int __init alloc_bootmem_huge_page(struct hstate *h)
+int __init alloc_bootmem_huge_page(struct hstate *h, int nid)
 {
 
 #ifdef CONFIG_PPC_BOOK3S_64
 	if (firmware_has_feature(FW_FEATURE_LPAR) && !radix_enabled())
 		return pseries_alloc_bootmem_huge_page(h);
 #endif
-	return __alloc_bootmem_huge_page(h);
+	return __alloc_bootmem_huge_page(h, nid);
 }
 
 #ifndef CONFIG_PPC_BOOK3S_64
--- a/Documentation/admin-guide/kernel-parameters.txt~hugetlbfs-extend-the-definition-of-hugepages-parameter-to-support-node-allocation
+++ a/Documentation/admin-guide/kernel-parameters.txt
@@ -1601,9 +1601,11 @@
 			the number of pages of hugepagesz to be allocated.
 			If this is the first HugeTLB parameter on the command
 			line, it specifies the number of pages to allocate for
-			the default huge page size.  See also
-			Documentation/admin-guide/mm/hugetlbpage.rst.
-			Format: <integer>
+			the default huge page size. If using node format, the
+			number of pages to allocate per-node can be specified.
+			See also Documentation/admin-guide/mm/hugetlbpage.rst.
+			Format: <integer> or (node format)
+				<node>:<integer>[,<node>:<integer>]
 
 	hugepagesz=
 			[HW] The size of the HugeTLB pages.  This is used in
--- a/Documentation/admin-guide/mm/hugetlbpage.rst~hugetlbfs-extend-the-definition-of-hugepages-parameter-to-support-node-allocation
+++ a/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -128,7 +128,9 @@ hugepages
 	implicitly specifies the number of huge pages of default size to
 	allocate.  If the number of huge pages of default size is implicitly
 	specified, it can not be overwritten by a hugepagesz,hugepages
-	parameter pair for the default size.
+	parameter pair for the default size.  This parameter also has a
+	node format.  The node format specifies the number of huge pages
+	to allocate on specific nodes.
 
 	For example, on an architecture with 2M default huge page size::
 
@@ -138,6 +140,14 @@ hugepages
 	indicating that the hugepages=512 parameter is ignored.  If a hugepages
 	parameter is preceded by an invalid hugepagesz parameter, it will
 	be ignored.
+
+	Node format example::
+
+		hugepagesz=2M hugepages=0:1,1:2
+
+	It will allocate 1 2M hugepage on node0 and 2 2M hugepages on node1.
+	If the node number is invalid,  the parameter will be ignored.
+
 default_hugepagesz
 	Specify the default huge page size.  This parameter can
 	only be specified once on the command line.  default_hugepagesz can
--- a/include/linux/hugetlb.h~hugetlbfs-extend-the-definition-of-hugepages-parameter-to-support-node-allocation
+++ a/include/linux/hugetlb.h
@@ -615,6 +615,7 @@ struct hstate {
 	unsigned long nr_overcommit_huge_pages;
 	struct list_head hugepage_activelist;
 	struct list_head hugepage_freelists[MAX_NUMNODES];
+	unsigned int max_huge_pages_node[MAX_NUMNODES];
 	unsigned int nr_huge_pages_node[MAX_NUMNODES];
 	unsigned int free_huge_pages_node[MAX_NUMNODES];
 	unsigned int surplus_huge_pages_node[MAX_NUMNODES];
@@ -647,8 +648,9 @@ void restore_reserve_on_error(struct hst
 				unsigned long address, struct page *page);
 
 /* arch callback */
-int __init __alloc_bootmem_huge_page(struct hstate *h);
-int __init alloc_bootmem_huge_page(struct hstate *h);
+int __init __alloc_bootmem_huge_page(struct hstate *h, int nid);
+int __init alloc_bootmem_huge_page(struct hstate *h, int nid);
+bool __init hugetlb_node_alloc_supported(void);
 
 void __init hugetlb_add_hstate(unsigned order);
 bool __init arch_hugetlb_valid_size(unsigned long size);
--- a/mm/hugetlb.c~hugetlbfs-extend-the-definition-of-hugepages-parameter-to-support-node-allocation
+++ a/mm/hugetlb.c
@@ -77,6 +77,7 @@ static struct hstate * __initdata parsed
 static unsigned long __initdata default_hstate_max_huge_pages;
 static bool __initdata parsed_valid_hugepagesz = true;
 static bool __initdata parsed_default_hugepagesz;
+static unsigned int default_hugepages_in_node[MAX_NUMNODES] __initdata;
 
 /*
  * Protects updates to hugepage_freelists, hugepage_activelist, nr_huge_pages,
@@ -2963,33 +2964,39 @@ out_subpool_put:
 	return ERR_PTR(-ENOSPC);
 }
 
-int alloc_bootmem_huge_page(struct hstate *h)
+int alloc_bootmem_huge_page(struct hstate *h, int nid)
 	__attribute__ ((weak, alias("__alloc_bootmem_huge_page")));
-int __alloc_bootmem_huge_page(struct hstate *h)
+int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 {
-	struct huge_bootmem_page *m;
+	struct huge_bootmem_page *m = NULL; /* initialize for clang */
 	int nr_nodes, node;
 
+	if (nid >= nr_online_nodes)
+		return 0;
+	/* do node specific alloc */
+	if (nid != NUMA_NO_NODE) {
+		m = memblock_alloc_try_nid_raw(huge_page_size(h), huge_page_size(h),
+				0, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
+		if (!m)
+			return 0;
+		goto found;
+	}
+	/* allocate from next node when distributing huge pages */
 	for_each_node_mask_to_alloc(h, nr_nodes, node, &node_states[N_MEMORY]) {
-		void *addr;
-
-		addr = memblock_alloc_try_nid_raw(
+		m = memblock_alloc_try_nid_raw(
 				huge_page_size(h), huge_page_size(h),
 				0, MEMBLOCK_ALLOC_ACCESSIBLE, node);
-		if (addr) {
-			/*
-			 * Use the beginning of the huge page to store the
-			 * huge_bootmem_page struct (until gather_bootmem
-			 * puts them into the mem_map).
-			 */
-			m = addr;
-			goto found;
-		}
+		/*
+		 * Use the beginning of the huge page to store the
+		 * huge_bootmem_page struct (until gather_bootmem
+		 * puts them into the mem_map).
+		 */
+		if (!m)
+			return 0;
+		goto found;
 	}
-	return 0;
 
 found:
-	BUG_ON(!IS_ALIGNED(virt_to_phys(m), huge_page_size(h)));
 	/* Put them into a private list first because mem_map is not up yet */
 	INIT_LIST_HEAD(&m->list);
 	list_add(&m->list, &huge_boot_pages);
@@ -3029,12 +3036,61 @@ static void __init gather_bootmem_preall
 		cond_resched();
 	}
 }
+static void __init hugetlb_hstate_alloc_pages_onenode(struct hstate *h, int nid)
+{
+	unsigned long i;
+	char buf[32];
+
+	for (i = 0; i < h->max_huge_pages_node[nid]; ++i) {
+		if (hstate_is_gigantic(h)) {
+			if (!alloc_bootmem_huge_page(h, nid))
+				break;
+		} else {
+			struct page *page;
+			gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
+
+			page = alloc_fresh_huge_page(h, gfp_mask, nid,
+					&node_states[N_MEMORY], NULL);
+			if (!page)
+				break;
+			put_page(page); /* free it into the hugepage allocator */
+		}
+		cond_resched();
+	}
+	if (i == h->max_huge_pages_node[nid])
+		return;
+
+	string_get_size(huge_page_size(h), 1, STRING_UNITS_2, buf, 32);
+	pr_warn("HugeTLB: allocating %u of page size %s failed node%d.  Only allocated %lu hugepages.\n",
+		h->max_huge_pages_node[nid], buf, nid, i);
+	h->max_huge_pages -= (h->max_huge_pages_node[nid] - i);
+	h->max_huge_pages_node[nid] = i;
+}
 
 static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
 	unsigned long i;
 	nodemask_t *node_alloc_noretry;
+	bool node_specific_alloc = false;
+
+	/* skip gigantic hugepages allocation if hugetlb_cma enabled */
+	if (hstate_is_gigantic(h) && hugetlb_cma_size) {
+		pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
+		return;
+	}
+
+	/* do node specific alloc */
+	for (i = 0; i < nr_online_nodes; i++) {
+		if (h->max_huge_pages_node[i] > 0) {
+			hugetlb_hstate_alloc_pages_onenode(h, i);
+			node_specific_alloc = true;
+		}
+	}
 
+	if (node_specific_alloc)
+		return;
+
+	/* below will do all node balanced alloc */
 	if (!hstate_is_gigantic(h)) {
 		/*
 		 * Bit mask controlling how hard we retry per-node allocations.
@@ -3055,11 +3111,7 @@ static void __init hugetlb_hstate_alloc_
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (hstate_is_gigantic(h)) {
-			if (hugetlb_cma_size) {
-				pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
-				goto free;
-			}
-			if (!alloc_bootmem_huge_page(h))
+			if (!alloc_bootmem_huge_page(h, NUMA_NO_NODE))
 				break;
 		} else if (!alloc_pool_huge_page(h,
 					 &node_states[N_MEMORY],
@@ -3075,7 +3127,6 @@ static void __init hugetlb_hstate_alloc_
 			h->max_huge_pages, buf, i);
 		h->max_huge_pages = i;
 	}
-free:
 	kfree(node_alloc_noretry);
 }
 
@@ -3990,6 +4041,10 @@ static int __init hugetlb_init(void)
 			}
 			default_hstate.max_huge_pages =
 				default_hstate_max_huge_pages;
+
+			for (i = 0; i < nr_online_nodes; i++)
+				default_hstate.max_huge_pages_node[i] =
+					default_hugepages_in_node[i];
 		}
 	}
 
@@ -4050,6 +4105,10 @@ void __init hugetlb_add_hstate(unsigned
 	parsed_hstate = h;
 }
 
+bool __init __weak hugetlb_node_alloc_supported(void)
+{
+	return true;
+}
 /*
  * hugepages command line processing
  * hugepages normally follows a valid hugepagsz or default_hugepagsz
@@ -4061,6 +4120,10 @@ static int __init hugepages_setup(char *
 {
 	unsigned long *mhp;
 	static unsigned long *last_mhp;
+	int node = NUMA_NO_NODE;
+	int count;
+	unsigned long tmp;
+	char *p = s;
 
 	if (!parsed_valid_hugepagesz) {
 		pr_warn("HugeTLB: hugepages=%s does not follow a valid hugepagesz, ignoring\n", s);
@@ -4084,8 +4147,40 @@ static int __init hugepages_setup(char *
 		return 0;
 	}
 
-	if (sscanf(s, "%lu", mhp) <= 0)
-		*mhp = 0;
+	while (*p) {
+		count = 0;
+		if (sscanf(p, "%lu%n", &tmp, &count) != 1)
+			goto invalid;
+		/* Parameter is node format */
+		if (p[count] == ':') {
+			if (!hugetlb_node_alloc_supported()) {
+				pr_warn("HugeTLB: architecture can't support node specific alloc, ignoring!\n");
+				return 0;
+			}
+			node = tmp;
+			p += count + 1;
+			if (node < 0 || node >= nr_online_nodes)
+				goto invalid;
+			/* Parse hugepages */
+			if (sscanf(p, "%lu%n", &tmp, &count) != 1)
+				goto invalid;
+			if (!hugetlb_max_hstate)
+				default_hugepages_in_node[node] = tmp;
+			else
+				parsed_hstate->max_huge_pages_node[node] = tmp;
+			*mhp += tmp;
+			/* Go to parse next node*/
+			if (p[count] == ',')
+				p += count + 1;
+			else
+				break;
+		} else {
+			if (p != s)
+				goto invalid;
+			*mhp = tmp;
+			break;
+		}
+	}
 
 	/*
 	 * Global state is always initialized later in hugetlb_init.
@@ -4098,6 +4193,10 @@ static int __init hugepages_setup(char *
 	last_mhp = mhp;
 
 	return 1;
+
+invalid:
+	pr_warn("HugeTLB: Invalid hugepages parameter %s\n", p);
+	return 0;
 }
 __setup("hugepages=", hugepages_setup);
 
@@ -4159,6 +4258,7 @@ __setup("hugepagesz=", hugepagesz_setup)
 static int __init default_hugepagesz_setup(char *s)
 {
 	unsigned long size;
+	int i;
 
 	parsed_valid_hugepagesz = false;
 	if (parsed_default_hugepagesz) {
@@ -4187,6 +4287,9 @@ static int __init default_hugepagesz_set
 	 */
 	if (default_hstate_max_huge_pages) {
 		default_hstate.max_huge_pages = default_hstate_max_huge_pages;
+		for (i = 0; i < nr_online_nodes; i++)
+			default_hstate.max_huge_pages_node[i] =
+				default_hugepages_in_node[i];
 		if (hstate_is_gigantic(&default_hstate))
 			hugetlb_hstate_alloc_pages(&default_hstate);
 		default_hstate_max_huge_pages = 0;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 170/262] mm/migrate: de-duplicate migrate_reason strings
  2021-11-05 20:34 incoming Andrew Morton
                   ` (168 preceding siblings ...)
  2021-11-05 20:43 ` [patch 169/262] hugetlbfs: extend the definition of hugepages parameter to support node allocation Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 171/262] mm: migrate: make demotion knob depend on migration Andrew Morton
                   ` (91 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, jhubbard, linux-mm, mm-commits, o451686892, torvalds, ying.huang

From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/migrate: de-duplicate migrate_reason strings

In order to remove the need to manually keep three different files in
synch, provide a common definition of the mapping between enum
migrate_reason, and the associated strings for each enum item.

1. Use the tracing system's mapping of enums to strings, by redefining
   and reusing the MIGRATE_REASON and supporting macros, and using that to
   populate the string array in mm/debug.c.

2. Move enum migrate_reason to migrate_mode.h.  This is not strictly
   necessary for this patch, but migrate mode and migrate reason go
   together, so this will slightly clarify things.

Link: https://lkml.kernel.org/r/20210922041755.141817-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Weizhao Ouyang <o451686892@gmail.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/migrate.h      |   19 +------------------
 include/linux/migrate_mode.h |   13 +++++++++++++
 mm/debug.c                   |   20 +++++++++++---------
 3 files changed, 25 insertions(+), 27 deletions(-)

--- a/include/linux/migrate.h~mm-migrate-de-duplicate-migrate_reason-strings
+++ a/include/linux/migrate.h
@@ -19,24 +19,7 @@ struct migration_target_control;
  */
 #define MIGRATEPAGE_SUCCESS		0
 
-/*
- * Keep sync with:
- * - macro MIGRATE_REASON in include/trace/events/migrate.h
- * - migrate_reason_names[MR_TYPES] in mm/debug.c
- */
-enum migrate_reason {
-	MR_COMPACTION,
-	MR_MEMORY_FAILURE,
-	MR_MEMORY_HOTPLUG,
-	MR_SYSCALL,		/* also applies to cpusets */
-	MR_MEMPOLICY_MBIND,
-	MR_NUMA_MISPLACED,
-	MR_CONTIG_RANGE,
-	MR_LONGTERM_PIN,
-	MR_DEMOTION,
-	MR_TYPES
-};
-
+/* Defined in mm/debug.c: */
 extern const char *migrate_reason_names[MR_TYPES];
 
 #ifdef CONFIG_MIGRATION
--- a/include/linux/migrate_mode.h~mm-migrate-de-duplicate-migrate_reason-strings
+++ a/include/linux/migrate_mode.h
@@ -19,4 +19,17 @@ enum migrate_mode {
 	MIGRATE_SYNC_NO_COPY,
 };
 
+enum migrate_reason {
+	MR_COMPACTION,
+	MR_MEMORY_FAILURE,
+	MR_MEMORY_HOTPLUG,
+	MR_SYSCALL,		/* also applies to cpusets */
+	MR_MEMPOLICY_MBIND,
+	MR_NUMA_MISPLACED,
+	MR_CONTIG_RANGE,
+	MR_LONGTERM_PIN,
+	MR_DEMOTION,
+	MR_TYPES
+};
+
 #endif		/* MIGRATE_MODE_H_INCLUDED */
--- a/mm/debug.c~mm-migrate-de-duplicate-migrate_reason-strings
+++ a/mm/debug.c
@@ -16,17 +16,19 @@
 #include <linux/ctype.h>
 
 #include "internal.h"
+#include <trace/events/migrate.h>
+
+/*
+ * Define EM() and EMe() so that MIGRATE_REASON from trace/events/migrate.h can
+ * be used to populate migrate_reason_names[].
+ */
+#undef EM
+#undef EMe
+#define EM(a, b)	b,
+#define EMe(a, b)	b
 
 const char *migrate_reason_names[MR_TYPES] = {
-	"compaction",
-	"memory_failure",
-	"memory_hotplug",
-	"syscall_or_cpuset",
-	"mempolicy_mbind",
-	"numa_misplaced",
-	"contig_range",
-	"longterm_pin",
-	"demotion",
+	MIGRATE_REASON
 };
 
 const struct trace_print_flags pageflag_names[] = {
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 171/262] mm: migrate: make demotion knob depend on migration
  2021-11-05 20:34 incoming Andrew Morton
                   ` (169 preceding siblings ...)
  2021-11-05 20:43 ` [patch 170/262] mm/migrate: de-duplicate migrate_reason strings Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 172/262] selftests/vm/transhuge-stress: fix ram size thinko Andrew Morton
                   ` (90 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, dave.hansen, linux-mm, mm-commits, shy828301, torvalds, ying.huang

From: Yang Shi <shy828301@gmail.com>
Subject: mm: migrate: make demotion knob depend on migration

The memory demotion needs to call migrate_pages() to do the jobs.  And it
is controlled by a knob, however, the knob doesn't depend on
CONFIG_MIGRATION.  The knob could be truned on even though MIGRATION is
disabled, this will not cause any crash since migrate_pages() would just
return -ENOSYS.  But it is definitely not optimal to go through demotion
path then retry regular swap every time.

And it doesn't make too much sense to have the knob visible to the users
when !MIGRATION.  Move the related code from mempolicy.[h|c] to
migrate.[h|c].

Link: https://lkml.kernel.org/r/20211015005559.246709-1-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: "Huang, Ying" <ying.huang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mempolicy.h |    4 --
 include/linux/migrate.h   |    4 ++
 mm/mempolicy.c            |   61 ------------------------------------
 mm/migrate.c              |   61 ++++++++++++++++++++++++++++++++++++
 4 files changed, 65 insertions(+), 65 deletions(-)

--- a/include/linux/mempolicy.h~mm-migrate-make-demotion-knob-depend-on-migration
+++ a/include/linux/mempolicy.h
@@ -183,8 +183,6 @@ extern bool vma_migratable(struct vm_are
 extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
 extern void mpol_put_task_policy(struct task_struct *);
 
-extern bool numa_demotion_enabled;
-
 static inline bool mpol_is_preferred_many(struct mempolicy *pol)
 {
 	return  (pol->mode == MPOL_PREFERRED_MANY);
@@ -300,8 +298,6 @@ static inline nodemask_t *policy_nodemas
 	return NULL;
 }
 
-#define numa_demotion_enabled	false
-
 static inline bool mpol_is_preferred_many(struct mempolicy *pol)
 {
 	return  false;
--- a/include/linux/migrate.h~mm-migrate-make-demotion-knob-depend-on-migration
+++ a/include/linux/migrate.h
@@ -40,6 +40,8 @@ extern int migrate_huge_page_move_mappin
 				  struct page *newpage, struct page *page);
 extern int migrate_page_move_mapping(struct address_space *mapping,
 		struct page *newpage, struct page *page, int extra_count);
+
+extern bool numa_demotion_enabled;
 #else
 
 static inline void putback_movable_pages(struct list_head *l) {}
@@ -65,6 +67,8 @@ static inline int migrate_huge_page_move
 {
 	return -ENOSYS;
 }
+
+#define numa_demotion_enabled	false
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_COMPACTION
--- a/mm/mempolicy.c~mm-migrate-make-demotion-knob-depend-on-migration
+++ a/mm/mempolicy.c
@@ -3057,64 +3057,3 @@ void mpol_to_str(char *buffer, int maxle
 		p += scnprintf(p, buffer + maxlen - p, ":%*pbl",
 			       nodemask_pr_args(&nodes));
 }
-
-bool numa_demotion_enabled = false;
-
-#ifdef CONFIG_SYSFS
-static ssize_t numa_demotion_enabled_show(struct kobject *kobj,
-					  struct kobj_attribute *attr, char *buf)
-{
-	return sysfs_emit(buf, "%s\n",
-			  numa_demotion_enabled? "true" : "false");
-}
-
-static ssize_t numa_demotion_enabled_store(struct kobject *kobj,
-					   struct kobj_attribute *attr,
-					   const char *buf, size_t count)
-{
-	if (!strncmp(buf, "true", 4) || !strncmp(buf, "1", 1))
-		numa_demotion_enabled = true;
-	else if (!strncmp(buf, "false", 5) || !strncmp(buf, "0", 1))
-		numa_demotion_enabled = false;
-	else
-		return -EINVAL;
-
-	return count;
-}
-
-static struct kobj_attribute numa_demotion_enabled_attr =
-	__ATTR(demotion_enabled, 0644, numa_demotion_enabled_show,
-	       numa_demotion_enabled_store);
-
-static struct attribute *numa_attrs[] = {
-	&numa_demotion_enabled_attr.attr,
-	NULL,
-};
-
-static const struct attribute_group numa_attr_group = {
-	.attrs = numa_attrs,
-};
-
-static int __init numa_init_sysfs(void)
-{
-	int err;
-	struct kobject *numa_kobj;
-
-	numa_kobj = kobject_create_and_add("numa", mm_kobj);
-	if (!numa_kobj) {
-		pr_err("failed to create numa kobject\n");
-		return -ENOMEM;
-	}
-	err = sysfs_create_group(numa_kobj, &numa_attr_group);
-	if (err) {
-		pr_err("failed to register numa group\n");
-		goto delete_obj;
-	}
-	return 0;
-
-delete_obj:
-	kobject_put(numa_kobj);
-	return err;
-}
-subsys_initcall(numa_init_sysfs);
-#endif
--- a/mm/migrate.c~mm-migrate-make-demotion-knob-depend-on-migration
+++ a/mm/migrate.c
@@ -3306,3 +3306,64 @@ static int __init migrate_on_reclaim_ini
 }
 late_initcall(migrate_on_reclaim_init);
 #endif /* CONFIG_HOTPLUG_CPU */
+
+bool numa_demotion_enabled = false;
+
+#ifdef CONFIG_SYSFS
+static ssize_t numa_demotion_enabled_show(struct kobject *kobj,
+					  struct kobj_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%s\n",
+			  numa_demotion_enabled ? "true" : "false");
+}
+
+static ssize_t numa_demotion_enabled_store(struct kobject *kobj,
+					   struct kobj_attribute *attr,
+					   const char *buf, size_t count)
+{
+	if (!strncmp(buf, "true", 4) || !strncmp(buf, "1", 1))
+		numa_demotion_enabled = true;
+	else if (!strncmp(buf, "false", 5) || !strncmp(buf, "0", 1))
+		numa_demotion_enabled = false;
+	else
+		return -EINVAL;
+
+	return count;
+}
+
+static struct kobj_attribute numa_demotion_enabled_attr =
+	__ATTR(demotion_enabled, 0644, numa_demotion_enabled_show,
+	       numa_demotion_enabled_store);
+
+static struct attribute *numa_attrs[] = {
+	&numa_demotion_enabled_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group numa_attr_group = {
+	.attrs = numa_attrs,
+};
+
+static int __init numa_init_sysfs(void)
+{
+	int err;
+	struct kobject *numa_kobj;
+
+	numa_kobj = kobject_create_and_add("numa", mm_kobj);
+	if (!numa_kobj) {
+		pr_err("failed to create numa kobject\n");
+		return -ENOMEM;
+	}
+	err = sysfs_create_group(numa_kobj, &numa_attr_group);
+	if (err) {
+		pr_err("failed to register numa group\n");
+		goto delete_obj;
+	}
+	return 0;
+
+delete_obj:
+	kobject_put(numa_kobj);
+	return err;
+}
+subsys_initcall(numa_init_sysfs);
+#endif
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 172/262] selftests/vm/transhuge-stress: fix ram size thinko
  2021-11-05 20:34 incoming Andrew Morton
                   ` (170 preceding siblings ...)
  2021-11-05 20:43 ` [patch 171/262] mm: migrate: make demotion knob depend on migration Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 173/262] mm, thp: lock filemap when truncating page cache Andrew Morton
                   ` (89 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, davis.george, erosca, koct9i, linux-mm, mm-commits, skhan,
	torvalds

From: "George G. Davis" <davis.george@siemens.com>
Subject: selftests/vm/transhuge-stress: fix ram size thinko

When executing transhuge-stress with an argument to specify the virtual
memory size for testing, the ram size is reported as 0, e.g.

transhuge-stress 384
thp-mmap: allocate 192 transhuge pages, using 384 MiB virtual memory and 0 MiB of ram
thp-mmap: 0.184 s/loop, 0.957 ms/page,   2090.265 MiB/s  192 succeed,    0 failed

This appears to be due to a thinko in commit 0085d61fe05e
("selftests/vm/transhuge-stress: stress test for memory compaction"),
where, at a guess, the intent was to base "xyz MiB of ram" on `ram` size. 
Here are results after using `ram` size:

thp-mmap: allocate 192 transhuge pages, using 384 MiB virtual memory and 14 MiB of ram

Link: https://lkml.kernel.org/r/20210825135843.29052-1-george_davis@mentor.com
Fixes: 0085d61fe05e ("selftests/vm/transhuge-stress: stress test for memory compaction")
Signed-off-by: George G. Davis <davis.george@siemens.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Eugeniu Rosca <erosca@de.adit-jv.com>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/transhuge-stress.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/tools/testing/selftests/vm/transhuge-stress.c~selftests-vm-transhuge-stress-fix-ram-size-thinko
+++ a/tools/testing/selftests/vm/transhuge-stress.c
@@ -79,7 +79,7 @@ int main(int argc, char **argv)
 
 	warnx("allocate %zd transhuge pages, using %zd MiB virtual memory"
 	      " and %zd MiB of ram", len >> HPAGE_SHIFT, len >> 20,
-	      len >> (20 + HPAGE_SHIFT - PAGE_SHIFT - 1));
+	      ram >> (20 + HPAGE_SHIFT - PAGE_SHIFT - 1));
 
 	pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
 	if (pagemap_fd < 0)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 173/262] mm, thp: lock filemap when truncating page cache
  2021-11-05 20:34 incoming Andrew Morton
                   ` (171 preceding siblings ...)
  2021-11-05 20:43 ` [patch 172/262] selftests/vm/transhuge-stress: fix ram size thinko Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 174/262] mm, thp: fix incorrect unmap behavior for private pages Andrew Morton
                   ` (88 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, cfijalkovich, hughd, linux-mm, mike.kravetz, mm-commits,
	rongwei.wang, shy828301, song, stable, torvalds,
	william.kucharski, willy, xuyu

From: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Subject: mm, thp: lock filemap when truncating page cache

Patch series "fix two bugs for file THP".


This patch (of 2):

Transparent huge page has supported read-only non-shmem files.  The file-
backed THP is collapsed by khugepaged and truncated when written (for
shared libraries).

However, there is a race when multiple writers truncate the same page
cache concurrently.

In that case, subpage(s) of file THP can be revealed by find_get_entry in
truncate_inode_pages_range, which will trigger PageTail BUG_ON in
truncate_inode_page, as follows.

page:000000009e420ff2 refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ff pfn:0x50c3ff
head:0000000075ff816d order:9 compound_mapcount:0 compound_pincount:0
flags: 0x37fffe0000010815(locked|uptodate|lru|arch_1|head)
raw: 37fffe0000000000 fffffe0013108001 dead000000000122 dead000000000400
raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
head: 37fffe0000010815 fffffe001066bd48 ffff000404183c20 0000000000000000
head: 0000000000000600 0000000000000000 00000001ffffffff ffff000c0345a000
page dumped because: VM_BUG_ON_PAGE(PageTail(page))
------------[ cut here ]------------
kernel BUG at mm/truncate.c:213!
Internal error: Oops - BUG: 0 [#1] SMP
Modules linked in: xfs(E) libcrc32c(E) rfkill(E) ...
CPU: 14 PID: 11394 Comm: check_madvise_d Kdump: ...
Hardware name: ECS, BIOS 0.0.0 02/06/2015
pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
pc : truncate_inode_page+0x64/0x70
lr : truncate_inode_page+0x64/0x70
sp : ffff80001b60b900
x29: ffff80001b60b900 x28: 00000000000007ff
x27: ffff80001b60b9a0 x26: 0000000000000000
x25: 000000000000000f x24: ffff80001b60b9a0
x23: ffff80001b60ba18 x22: ffff0001e0999ea8
x21: ffff0000c21db300 x20: ffffffffffffffff
x19: fffffe001310ffc0 x18: 0000000000000020
x17: 0000000000000000 x16: 0000000000000000
x15: ffff0000c21db960 x14: 3030306666666620
x13: 6666666666666666 x12: 3130303030303030
x11: ffff8000117b69b8 x10: 00000000ffff8000
x9 : ffff80001012690c x8 : 0000000000000000
x7 : ffff8000114f69b8 x6 : 0000000000017ffd
x5 : ffff0007fffbcbc8 x4 : ffff80001b60b5c0
x3 : 0000000000000001 x2 : 0000000000000000
x1 : 0000000000000000 x0 : 0000000000000000
Call trace:
 truncate_inode_page+0x64/0x70
 truncate_inode_pages_range+0x550/0x7e4
 truncate_pagecache+0x58/0x80
 do_dentry_open+0x1e4/0x3c0
 vfs_open+0x38/0x44
 do_open+0x1f0/0x310
 path_openat+0x114/0x1dc
 do_filp_open+0x84/0x134
 do_sys_openat2+0xbc/0x164
 __arm64_sys_openat+0x74/0xc0
 el0_svc_common.constprop.0+0x88/0x220
 do_el0_svc+0x30/0xa0
 el0_svc+0x20/0x30
 el0_sync_handler+0x1a4/0x1b0
 el0_sync+0x180/0x1c0
Code: aa0103e0 900061e1 910ec021 9400d300 (d4210000)
---[ end trace f70cdb42cb7c2d42 ]---
Kernel panic - not syncing: Oops - BUG: Fatal exception

This patch mainly to lock filemap when one enter truncate_pagecache(),
avoiding truncating the same page cache concurrently.

Link: https://lkml.kernel.org/r/20211025092134.18562-1-rongwei.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/20211025092134.18562-2-rongwei.wang@linux.alibaba.com
Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
Signed-off-by: Xu Yu <xuyu@linux.alibaba.com>
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Song Liu <song@kernel.org>
Cc: Collin Fijalkovich <cfijalkovich@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/open.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/fs/open.c~mm-thp-lock-filemap-when-truncating-page-cache
+++ a/fs/open.c
@@ -856,8 +856,11 @@ static int do_dentry_open(struct file *f
 		 * of THPs into the page cache will fail.
 		 */
 		smp_mb();
-		if (filemap_nr_thps(inode->i_mapping))
+		if (filemap_nr_thps(inode->i_mapping)) {
+			filemap_invalidate_lock(inode->i_mapping);
 			truncate_pagecache(inode, 0);
+			filemap_invalidate_unlock(inode->i_mapping);
+		}
 	}
 
 	return 0;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 174/262] mm, thp: fix incorrect unmap behavior for private pages
  2021-11-05 20:34 incoming Andrew Morton
                   ` (172 preceding siblings ...)
  2021-11-05 20:43 ` [patch 173/262] mm, thp: lock filemap when truncating page cache Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 175/262] mm/readahead.c: fix incorrect comments for get_init_ra_size Andrew Morton
                   ` (87 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, cfijalkovich, hughd, linux-mm, mike.kravetz, mm-commits,
	rongwei.wang, shy828301, song, stable, torvalds,
	william.kucharski, willy, xuyu

From: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Subject: mm, thp: fix incorrect unmap behavior for private pages

When truncating pagecache on file THP, the private pages of a process
should not be unmapped mapping.  This incorrect behavior on a dynamic
shared libraries which will cause related processes to happen core dump.

A simple test for a DSO (Prerequisite is the DSO mapped in file THP):

int main(int argc, char *argv[])
{
	int fd;

	fd = open(argv[1], O_WRONLY);
	if (fd < 0) {
		perror("open");
	}

	close(fd);
	return 0;
}

The test only to open a target DSO, and do nothing.  But this operation
will lead one or more process to happen core dump.  This patch mainly to
fix this bug.

Link: https://lkml.kernel.org/r/20211025092134.18562-3-rongwei.wang@linux.alibaba.com
Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Tested-by: Xu Yu <xuyu@linux.alibaba.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Song Liu <song@kernel.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Collin Fijalkovich <cfijalkovich@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/open.c |   11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

--- a/fs/open.c~mm-thp-fix-incorrect-unmap-behavior-for-private-pages
+++ a/fs/open.c
@@ -857,8 +857,17 @@ static int do_dentry_open(struct file *f
 		 */
 		smp_mb();
 		if (filemap_nr_thps(inode->i_mapping)) {
+			struct address_space *mapping = inode->i_mapping;
+
 			filemap_invalidate_lock(inode->i_mapping);
-			truncate_pagecache(inode, 0);
+			/*
+			 * unmap_mapping_range just need to be called once
+			 * here, because the private pages is not need to be
+			 * unmapped mapping (e.g. data segment of dynamic
+			 * shared libraries here).
+			 */
+			unmap_mapping_range(mapping, 0, 0, 0);
+			truncate_inode_pages(mapping, 0);
 			filemap_invalidate_unlock(inode->i_mapping);
 		}
 	}
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 175/262] mm/readahead.c: fix incorrect comments for get_init_ra_size
  2021-11-05 20:34 incoming Andrew Morton
                   ` (173 preceding siblings ...)
  2021-11-05 20:43 ` [patch 174/262] mm, thp: fix incorrect unmap behavior for private pages Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 176/262] mm: nommu: kill arch_get_unmapped_area() Andrew Morton
                   ` (86 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, linf, linux-mm, mm-commits, torvalds

From: Lin Feng <linf@wangsu.com>
Subject: mm/readahead.c: fix incorrect comments for get_init_ra_size

In fact, formated values returned by get_init_ra_size are not that
intuitive.  This patch make the comments reflect its truth.

Link: https://lkml.kernel.org/r/20211019104812.135602-1-linf@wangsu.com
Signed-off-by: Lin Feng <linf@wangsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/readahead.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/readahead.c~mm-readaheadc-fix-incorrect-comments-for-get_init_ra_size
+++ a/mm/readahead.c
@@ -309,7 +309,7 @@ void force_page_cache_ra(struct readahea
  * Set the initial window size, round to next power of 2 and square
  * for small size, x 4 for medium, and x 2 for large
  * for 128k (32 page) max ra
- * 1-8 page = 32k initial, > 8 page = 128k initial
+ * 1-2 page = 16k, 3-4 page 32k, 5-8 page = 64k, > 8 page = 128k initial
  */
 static unsigned long get_init_ra_size(unsigned long size, unsigned long max)
 {
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 176/262] mm: nommu: kill arch_get_unmapped_area()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (174 preceding siblings ...)
  2021-11-05 20:43 ` [patch 175/262] mm/readahead.c: fix incorrect comments for get_init_ra_size Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 177/262] selftest/vm: fix ksm selftest to run with different NUMA topologies Andrew Morton
                   ` (85 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, torvalds, wangkefeng.wang

From: Kefeng Wang <wangkefeng.wang@huawei.com>
Subject: mm: nommu: kill arch_get_unmapped_area()

When nommu, the arch_get_unmapped_area() will not be called, just kill it.

Link: https://lkml.kernel.org/r/20210910061906.36299-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/nommu.c |    6 ------
 1 file changed, 6 deletions(-)

--- a/mm/nommu.c~mm-nommu-kill-arch_get_unmapped_area
+++ a/mm/nommu.c
@@ -1639,12 +1639,6 @@ int remap_vmalloc_range(struct vm_area_s
 }
 EXPORT_SYMBOL(remap_vmalloc_range);
 
-unsigned long arch_get_unmapped_area(struct file *file, unsigned long addr,
-	unsigned long len, unsigned long pgoff, unsigned long flags)
-{
-	return -ENOMEM;
-}
-
 vm_fault_t filemap_fault(struct vm_fault *vmf)
 {
 	BUG();
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 177/262] selftest/vm: fix ksm selftest to run with different NUMA topologies
  2021-11-05 20:34 incoming Andrew Morton
                   ` (175 preceding siblings ...)
  2021-11-05 20:43 ` [patch 176/262] mm: nommu: kill arch_get_unmapped_area() Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 178/262] selftests: vm: add KSM huge pages merging time test Andrew Morton
                   ` (84 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, aneesh.kumar, hughd, linux-mm, mm-commits, pasha.tatashin,
	shuah, torvalds, tyhicks, zhansayabagdaulet

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Subject: selftest/vm: fix ksm selftest to run with different NUMA topologies

Platforms can have non-contiguous NUMA nodes like below

 #numactl  -H
available: 2 nodes (0,8)
.....
node distances:
node   0   8
  0:  10  40
  8:  40  10

 #numactl  -H
available: 1 nodes (1)
....
node distances:
node   1
  1:  10

Hence update the test to not assume the presence of Node 0 and 1 and also
use numa_num_configured_nodes() instead of numa_max_node for finding
whether to skip the test.

Link: https://lkml.kernel.org/r/20210914141414.350759-1-aneesh.kumar@linux.ibm.com
Fixes: 82e717ad3501 ("selftests: vm: add KSM merging across nodes test")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Cc: Zhansaya Bagdauletkyzy <zhansayabagdaulet@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Tyler Hicks <tyhicks@linux.microsoft.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/ksm_tests.c |   29 ++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)

--- a/tools/testing/selftests/vm/ksm_tests.c~selftest-vm-fix-ksm-selftest-to-run-with-different-numa-topologies
+++ a/tools/testing/selftests/vm/ksm_tests.c
@@ -354,12 +354,34 @@ err_out:
 	return KSFT_FAIL;
 }
 
+static int get_next_mem_node(int node)
+{
+
+	long node_size;
+	int mem_node = 0;
+	int i, max_node = numa_max_node();
+
+	for (i = node + 1; i <= max_node + node; i++) {
+		mem_node = i % (max_node + 1);
+		node_size = numa_node_size(mem_node, NULL);
+		if (node_size > 0)
+			break;
+	}
+	return mem_node;
+}
+
+static int get_first_mem_node(void)
+{
+	return get_next_mem_node(numa_max_node());
+}
+
 static int check_ksm_numa_merge(int mapping, int prot, int timeout, bool merge_across_nodes,
 				size_t page_size)
 {
 	void *numa1_map_ptr, *numa2_map_ptr;
 	struct timespec start_time;
 	int page_count = 2;
+	int first_node;
 
 	if (clock_gettime(CLOCK_MONOTONIC_RAW, &start_time)) {
 		perror("clock_gettime");
@@ -370,7 +392,7 @@ static int check_ksm_numa_merge(int mapp
 		perror("NUMA support not enabled");
 		return KSFT_SKIP;
 	}
-	if (numa_max_node() < 1) {
+	if (numa_num_configured_nodes() <= 1) {
 		printf("At least 2 NUMA nodes must be available\n");
 		return KSFT_SKIP;
 	}
@@ -378,8 +400,9 @@ static int check_ksm_numa_merge(int mapp
 		return KSFT_FAIL;
 
 	/* allocate 2 pages in 2 different NUMA nodes and fill them with the same data */
-	numa1_map_ptr = numa_alloc_onnode(page_size, 0);
-	numa2_map_ptr = numa_alloc_onnode(page_size, 1);
+	first_node = get_first_mem_node();
+	numa1_map_ptr = numa_alloc_onnode(page_size, first_node);
+	numa2_map_ptr = numa_alloc_onnode(page_size, get_next_mem_node(first_node));
 	if (!numa1_map_ptr || !numa2_map_ptr) {
 		perror("numa_alloc_onnode");
 		return KSFT_FAIL;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 178/262] selftests: vm: add KSM huge pages merging time test
  2021-11-05 20:34 incoming Andrew Morton
                   ` (176 preceding siblings ...)
  2021-11-05 20:43 ` [patch 177/262] selftest/vm: fix ksm selftest to run with different NUMA topologies Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:43 ` [patch 179/262] mm/vmstat: annotate data race for zone->free_area[order].nr_free Andrew Morton
                   ` (83 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, pedrodemargomes, torvalds, zhansayabagdaulet

From: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Subject: selftests: vm: add KSM huge pages merging time test

Add test case of KSM merging time using mostly huge pages

Link: https://lkml.kernel.org/r/20211013044045.360251-1-pedrodemargomes@gmail.com
Signed-off-by: Pedro Demarchi Gomes <pedrodemargomes@gmail.com>
Cc: Zhansaya Bagdauletkyzy <zhansayabagdaulet@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/ksm_tests.c |  125 ++++++++++++++++++++++-
 1 file changed, 124 insertions(+), 1 deletion(-)

--- a/tools/testing/selftests/vm/ksm_tests.c~selftests-vm-add-ksm-huge-pages-merging-time-test
+++ a/tools/testing/selftests/vm/ksm_tests.c
@@ -5,6 +5,10 @@
 #include <time.h>
 #include <string.h>
 #include <numa.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <stdint.h>
+#include <err.h>
 
 #include "../kselftest.h"
 #include "../../../../include/vdso/time64.h"
@@ -18,6 +22,15 @@
 #define KSM_MERGE_ACROSS_NODES_DEFAULT true
 #define MB (1ul << 20)
 
+#define PAGE_SHIFT 12
+#define HPAGE_SHIFT 21
+
+#define PAGE_SIZE (1 << PAGE_SHIFT)
+#define HPAGE_SIZE (1 << HPAGE_SHIFT)
+
+#define PAGEMAP_PRESENT(ent)	(((ent) & (1ull << 63)) != 0)
+#define PAGEMAP_PFN(ent)	((ent) & ((1ull << 55) - 1))
+
 struct ksm_sysfs {
 	unsigned long max_page_sharing;
 	unsigned long merge_across_nodes;
@@ -34,6 +47,7 @@ enum ksm_test_name {
 	CHECK_KSM_ZERO_PAGE_MERGE,
 	CHECK_KSM_NUMA_MERGE,
 	KSM_MERGE_TIME,
+	KSM_MERGE_TIME_HUGE_PAGES,
 	KSM_COW_TIME
 };
 
@@ -100,6 +114,9 @@ static void print_help(void)
 	       " -P evaluate merging time and speed.\n"
 	       "    For this test, the size of duplicated memory area (in MiB)\n"
 	       "    must be provided using -s option\n"
+				 " -H evaluate merging time and speed of area allocated mostly with huge pages\n"
+	       "    For this test, the size of duplicated memory area (in MiB)\n"
+	       "    must be provided using -s option\n"
 	       " -C evaluate the time required to break COW of merged pages.\n\n");
 
 	printf(" -a: specify the access protections of pages.\n"
@@ -439,6 +456,101 @@ err_out:
 	return KSFT_FAIL;
 }
 
+int64_t allocate_transhuge(void *ptr, int pagemap_fd)
+{
+	uint64_t ent[2];
+
+	/* drop pmd */
+	if (mmap(ptr, HPAGE_SIZE, PROT_READ | PROT_WRITE,
+				MAP_FIXED | MAP_ANONYMOUS |
+				MAP_NORESERVE | MAP_PRIVATE, -1, 0) != ptr)
+		errx(2, "mmap transhuge");
+
+	if (madvise(ptr, HPAGE_SIZE, MADV_HUGEPAGE))
+		err(2, "MADV_HUGEPAGE");
+
+	/* allocate transparent huge page */
+	*(volatile void **)ptr = ptr;
+
+	if (pread(pagemap_fd, ent, sizeof(ent),
+			(uintptr_t)ptr >> (PAGE_SHIFT - 3)) != sizeof(ent))
+		err(2, "read pagemap");
+
+	if (PAGEMAP_PRESENT(ent[0]) && PAGEMAP_PRESENT(ent[1]) &&
+	    PAGEMAP_PFN(ent[0]) + 1 == PAGEMAP_PFN(ent[1]) &&
+	    !(PAGEMAP_PFN(ent[0]) & ((1 << (HPAGE_SHIFT - PAGE_SHIFT)) - 1)))
+		return PAGEMAP_PFN(ent[0]);
+
+	return -1;
+}
+
+static int ksm_merge_hugepages_time(int mapping, int prot, int timeout, size_t map_size)
+{
+	void *map_ptr, *map_ptr_orig;
+	struct timespec start_time, end_time;
+	unsigned long scan_time_ns;
+	int pagemap_fd, n_normal_pages, n_huge_pages;
+
+	map_size *= MB;
+	size_t len = map_size;
+
+	len -= len % HPAGE_SIZE;
+	map_ptr_orig = mmap(NULL, len + HPAGE_SIZE, PROT_READ | PROT_WRITE,
+			MAP_ANONYMOUS | MAP_NORESERVE | MAP_PRIVATE, -1, 0);
+	map_ptr = map_ptr_orig + HPAGE_SIZE - (uintptr_t)map_ptr_orig % HPAGE_SIZE;
+
+	if (map_ptr_orig == MAP_FAILED)
+		err(2, "initial mmap");
+
+	if (madvise(map_ptr, len + HPAGE_SIZE, MADV_HUGEPAGE))
+		err(2, "MADV_HUGEPAGE");
+
+	pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
+	if (pagemap_fd < 0)
+		err(2, "open pagemap");
+
+	n_normal_pages = 0;
+	n_huge_pages = 0;
+	for (void *p = map_ptr; p < map_ptr + len; p += HPAGE_SIZE) {
+		if (allocate_transhuge(p, pagemap_fd) < 0)
+			n_normal_pages++;
+		else
+			n_huge_pages++;
+	}
+	printf("Number of normal pages:    %d\n", n_normal_pages);
+	printf("Number of huge pages:    %d\n", n_huge_pages);
+
+	memset(map_ptr, '*', len);
+
+	if (clock_gettime(CLOCK_MONOTONIC_RAW, &start_time)) {
+		perror("clock_gettime");
+		goto err_out;
+	}
+	if (ksm_merge_pages(map_ptr, map_size, start_time, timeout))
+		goto err_out;
+	if (clock_gettime(CLOCK_MONOTONIC_RAW, &end_time)) {
+		perror("clock_gettime");
+		goto err_out;
+	}
+
+	scan_time_ns = (end_time.tv_sec - start_time.tv_sec) * NSEC_PER_SEC +
+		       (end_time.tv_nsec - start_time.tv_nsec);
+
+	printf("Total size:    %lu MiB\n", map_size / MB);
+	printf("Total time:    %ld.%09ld s\n", scan_time_ns / NSEC_PER_SEC,
+	       scan_time_ns % NSEC_PER_SEC);
+	printf("Average speed:  %.3f MiB/s\n", (map_size / MB) /
+					       ((double)scan_time_ns / NSEC_PER_SEC));
+
+	munmap(map_ptr_orig, len + HPAGE_SIZE);
+	return KSFT_PASS;
+
+err_out:
+	printf("Not OK\n");
+	munmap(map_ptr_orig, len + HPAGE_SIZE);
+	return KSFT_FAIL;
+}
+
 static int ksm_merge_time(int mapping, int prot, int timeout, size_t map_size)
 {
 	void *map_ptr;
@@ -564,7 +676,7 @@ int main(int argc, char *argv[])
 	bool merge_across_nodes = KSM_MERGE_ACROSS_NODES_DEFAULT;
 	long size_MB = 0;
 
-	while ((opt = getopt(argc, argv, "ha:p:l:z:m:s:MUZNPC")) != -1) {
+	while ((opt = getopt(argc, argv, "ha:p:l:z:m:s:MUZNPCH")) != -1) {
 		switch (opt) {
 		case 'a':
 			prot = str_to_prot(optarg);
@@ -618,6 +730,9 @@ int main(int argc, char *argv[])
 		case 'P':
 			test_name = KSM_MERGE_TIME;
 			break;
+		case 'H':
+			test_name = KSM_MERGE_TIME_HUGE_PAGES;
+			break;
 		case 'C':
 			test_name = KSM_COW_TIME;
 			break;
@@ -670,6 +785,14 @@ int main(int argc, char *argv[])
 		ret = ksm_merge_time(MAP_PRIVATE | MAP_ANONYMOUS, prot, ksm_scan_limit_sec,
 				     size_MB);
 		break;
+	case KSM_MERGE_TIME_HUGE_PAGES:
+		if (size_MB == 0) {
+			printf("Option '-s' is required.\n");
+			return KSFT_FAIL;
+		}
+		ret = ksm_merge_hugepages_time(MAP_PRIVATE | MAP_ANONYMOUS, prot,
+				ksm_scan_limit_sec, size_MB);
+		break;
 	case KSM_COW_TIME:
 		ret = ksm_cow_time(MAP_PRIVATE | MAP_ANONYMOUS, prot, ksm_scan_limit_sec,
 				   page_size);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 179/262] mm/vmstat: annotate data race for zone->free_area[order].nr_free
  2021-11-05 20:34 incoming Andrew Morton
                   ` (177 preceding siblings ...)
  2021-11-05 20:43 ` [patch 178/262] selftests: vm: add KSM huge pages merging time test Andrew Morton
@ 2021-11-05 20:43 ` Andrew Morton
  2021-11-05 20:44 ` [patch 180/262] mm: vmstat.c: make extfrag_index show more pretty Andrew Morton
                   ` (82 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:43 UTC (permalink / raw)
  To: akpm, linux-mm, liushixin2, mm-commits, paulmck, torvalds

From: Liu Shixin <liushixin2@huawei.com>
Subject: mm/vmstat: annotate data race for zone->free_area[order].nr_free

KCSAN reports a data-race on v5.10 which also exists on mainline:

==================================================================
BUG: KCSAN: data-race in extfrag_for_order+0x33/0x2d0

race at unknown origin, with read to 0xffff9ee9bfffab48 of 8 bytes by task 34 on cpu 1:
 extfrag_for_order+0x33/0x2d0
 kcompactd+0x5f0/0xce0
 kthread+0x1f9/0x220
 ret_from_fork+0x22/0x30

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 34 Comm: kcompactd0 Not tainted 5.10.0+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
==================================================================

Access to zone->free_area[order].nr_free in extfrag_for_order()/frag_show_print()
is lockless. That's intentional and the stats are a rough estimate anyway.
Annotate them with data_race().

[liushixin2@huawei.com: add comments]
  Link: https://lkml.kernel.org/r/20210918084655.2696522-1-liushixin2@huawei.com
Link: https://lkml.kernel.org/r/20210908015606.3999871-1-liushixin2@huawei.com
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmstat.c |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

--- a/mm/vmstat.c~mm-vmstat-annotate-data-race-for-zone-free_areanr_free
+++ a/mm/vmstat.c
@@ -1070,8 +1070,13 @@ static void fill_contig_page_info(struct
 	for (order = 0; order < MAX_ORDER; order++) {
 		unsigned long blocks;
 
-		/* Count number of free blocks */
-		blocks = zone->free_area[order].nr_free;
+		/*
+		 * Count number of free blocks.
+		 *
+		 * Access to nr_free is lockless as nr_free is used only for
+		 * diagnostic purposes. Use data_race to avoid KCSAN warning.
+		 */
+		blocks = data_race(zone->free_area[order].nr_free);
 		info->free_blocks_total += blocks;
 
 		/* Count free base pages */
@@ -1446,7 +1451,11 @@ static void frag_show_print(struct seq_f
 
 	seq_printf(m, "Node %d, zone %8s ", pgdat->node_id, zone->name);
 	for (order = 0; order < MAX_ORDER; ++order)
-		seq_printf(m, "%6lu ", zone->free_area[order].nr_free);
+		/*
+		 * Access to nr_free is lockless as nr_free is used only for
+		 * printing purposes. Use data_race to avoid KCSAN warning.
+		 */
+		seq_printf(m, "%6lu ", data_race(zone->free_area[order].nr_free));
 	seq_putc(m, '\n');
 }
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 180/262] mm: vmstat.c: make extfrag_index show more pretty
  2021-11-05 20:34 incoming Andrew Morton
                   ` (178 preceding siblings ...)
  2021-11-05 20:43 ` [patch 179/262] mm/vmstat: annotate data race for zone->free_area[order].nr_free Andrew Morton
@ 2021-11-05 20:44 ` Andrew Morton
  2021-11-05 20:44 ` [patch 181/262] selftests/vm: make MADV_POPULATE_(READ|WRITE) use in-tree headers Andrew Morton
                   ` (81 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:44 UTC (permalink / raw)
  To: akpm, linf, linux-mm, mm-commits, torvalds

From: Lin Feng <linf@wangsu.com>
Subject: mm: vmstat.c: make extfrag_index show more pretty

fragmentation_index may return -1000 and the corresponding formated value
showed by seq_printf will take a negative signatrue, but other positive
formated values don't take a positive signatrue, so the output becomes
unaligned.

before:
Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
Node 0, zone    DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000 0.931 0.966 0.983 0.992 0.996 0.998 0.999

after this patch:
Node 0, zone      DMA -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
Node 0, zone    DMA32 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
Node 0, zone   Normal -1.000 -1.000 -1.000 -1.000  0.931  0.966  0.983  0.992  0.996  0.998  0.999

Link: https://lkml.kernel.org/r/20211019103241.134797-1-linf@wangsu.com
Signed-off-by: Lin Feng <linf@wangsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmstat.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/vmstat.c~mm-vmstatc-make-extfrag_index-show-more-pretty
+++ a/mm/vmstat.c
@@ -2191,7 +2191,7 @@ static void extfrag_show_print(struct se
 	for (order = 0; order < MAX_ORDER; ++order) {
 		fill_contig_page_info(zone, order, &info);
 		index = __fragmentation_index(order, &info);
-		seq_printf(m, "%d.%03d ", index / 1000, index % 1000);
+		seq_printf(m, "%2d.%03d ", index / 1000, index % 1000);
 	}
 
 	seq_putc(m, '\n');
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 181/262] selftests/vm: make MADV_POPULATE_(READ|WRITE) use in-tree headers
  2021-11-05 20:34 incoming Andrew Morton
                   ` (179 preceding siblings ...)
  2021-11-05 20:44 ` [patch 180/262] mm: vmstat.c: make extfrag_index show more pretty Andrew Morton
@ 2021-11-05 20:44 ` Andrew Morton
  2021-11-05 20:44 ` [patch 182/262] mm/memory_hotplug: add static qualifier for online_policy_to_str() Andrew Morton
                   ` (80 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:44 UTC (permalink / raw)
  To: akpm, david, linux-mm, mm-commits, skhan, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: selftests/vm: make MADV_POPULATE_(READ|WRITE) use in-tree headers

The madv_populate selftest currently builds with a warning when the local
installed headers (via the distribution) don't include MADV_POPULATE_READ
and MADV_POPULATE_WRITE.  The warning is correct, because the test cannot
locate the necessary header.

Reason is that the in-tree installed headers (usr/include) have a "linux"
instead of a "sys" subdirectory.

Including "linux/mman.h" instead of "sys/mman.h" doesn't work (e.g.,
mmap() and madvise() are not defined that way).  The only thing that seems
to work is including "linux/mman.h" in addition to "sys/mman.h".

We can get rid of our availability check and simplify.

Link: https://lkml.kernel.org/r/20211015165758.41374-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reported-by: Shuah Khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/madv_populate.c |   15 +--------------
 1 file changed, 1 insertion(+), 14 deletions(-)

--- a/tools/testing/selftests/vm/madv_populate.c~selftests-vm-make-madv_populate_readwrite-use-in-tree-headers
+++ a/tools/testing/selftests/vm/madv_populate.c
@@ -14,12 +14,11 @@
 #include <unistd.h>
 #include <errno.h>
 #include <fcntl.h>
+#include <linux/mman.h>
 #include <sys/mman.h>
 
 #include "../kselftest.h"
 
-#if defined(MADV_POPULATE_READ) && defined(MADV_POPULATE_WRITE)
-
 /*
  * For now, we're using 2 MiB of private anonymous memory for all tests.
  */
@@ -328,15 +327,3 @@ int main(int argc, char **argv)
 				   err, ksft_test_num());
 	return ksft_exit_pass();
 }
-
-#else /* defined(MADV_POPULATE_READ) && defined(MADV_POPULATE_WRITE) */
-
-#warning "missing MADV_POPULATE_READ or MADV_POPULATE_WRITE definition"
-
-int main(int argc, char **argv)
-{
-	ksft_print_header();
-	ksft_exit_skip("MADV_POPULATE_READ or MADV_POPULATE_WRITE not defined\n");
-}
-
-#endif /* defined(MADV_POPULATE_READ) && defined(MADV_POPULATE_WRITE) */
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 182/262] mm/memory_hotplug: add static qualifier for online_policy_to_str()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (180 preceding siblings ...)
  2021-11-05 20:44 ` [patch 181/262] selftests/vm: make MADV_POPULATE_(READ|WRITE) use in-tree headers Andrew Morton
@ 2021-11-05 20:44 ` Andrew Morton
  2021-11-05 20:44 ` [patch 183/262] memory-hotplug.rst: fix two instances of "movablecore" that should be "movable_node" Andrew Morton
                   ` (79 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:44 UTC (permalink / raw)
  To: akpm, david, linux-mm, mm-commits, songmuchun, tangyizhou, torvalds

From: Tang Yizhou <tangyizhou@huawei.com>
Subject: mm/memory_hotplug: add static qualifier for online_policy_to_str()

online_policy_to_str is only used in memory_hotplug.c and should be
defined as static.

Link: https://lkml.kernel.org/r/20210913024534.26161-1-tangyizhou@huawei.com
Signed-off-by: Tang Yizhou <tangyizhou@huawei.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-add-static-qualifier-for-online_policy_to_str
+++ a/mm/memory_hotplug.c
@@ -57,7 +57,7 @@ enum {
 	ONLINE_POLICY_AUTO_MOVABLE,
 };
 
-const char *online_policy_to_str[] = {
+static const char * const online_policy_to_str[] = {
 	[ONLINE_POLICY_CONTIG_ZONES] = "contig-zones",
 	[ONLINE_POLICY_AUTO_MOVABLE] = "auto-movable",
 };
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 183/262] memory-hotplug.rst: fix two instances of "movablecore" that should be "movable_node"
  2021-11-05 20:34 incoming Andrew Morton
                   ` (181 preceding siblings ...)
  2021-11-05 20:44 ` [patch 182/262] mm/memory_hotplug: add static qualifier for online_policy_to_str() Andrew Morton
@ 2021-11-05 20:44 ` Andrew Morton
  2021-11-05 20:44 ` [patch 184/262] memory-hotplug.rst: fix wrong /sys/module/memory_hotplug/parameters/ path Andrew Morton
                   ` (78 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:44 UTC (permalink / raw)
  To: akpm, corbet, david, linux-mm, mhocko, mm-commits, osalvador,
	rppt, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: memory-hotplug.rst: fix two instances of "movablecore" that should be "movable_node"

Patch series "memory-hotplug.rst: document the "auto-movable" online policy".

Now that the memory-hotplug.rst overhaul is upstream, proper documentation
for the "auto-movable" online policy, documenting all new toggles and
options.  Along, two fixes for the original overhaul.


This patch (of 3):

We really want to refer to the "movable_node" kernel command line
parameter here.

Link: https://lkml.kernel.org/r/20210930144117.23641-2-david@redhat.com
Fixes: ac3332c44767 ("memory-hotplug.rst: complete admin-guide overhaul")
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/memory-hotplug.rst |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/Documentation/admin-guide/mm/memory-hotplug.rst~memory-hotplugrst-fix-two-instances-of-movablecore-that-should-be-movable_node
+++ a/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -166,7 +166,7 @@ Or alternatively::
 	% echo 1 > /sys/devices/system/memory/memoryXXX/online
 
 The kernel will select the target zone automatically, usually defaulting to
-``ZONE_NORMAL`` unless ``movablecore=1`` has been specified on the kernel
+``ZONE_NORMAL`` unless ``movable_node`` has been specified on the kernel
 command line or if the memory block would intersect the ZONE_MOVABLE already.
 
 One can explicitly request to associate an offline memory block with
@@ -393,7 +393,7 @@ command line parameters are relevant:
 ======================== =======================================================
 ``memhp_default_state``	 configure auto-onlining by essentially setting
                          ``/sys/devices/system/memory/auto_online_blocks``.
-``movablecore``		 configure automatic zone selection of the kernel. When
+``movable_node``	 configure automatic zone selection in the kernel. When
 			 set, the kernel will default to ZONE_MOVABLE, unless
 			 other zones can be kept contiguous.
 ======================== =======================================================
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 184/262] memory-hotplug.rst: fix wrong /sys/module/memory_hotplug/parameters/ path
  2021-11-05 20:34 incoming Andrew Morton
                   ` (182 preceding siblings ...)
  2021-11-05 20:44 ` [patch 183/262] memory-hotplug.rst: fix two instances of "movablecore" that should be "movable_node" Andrew Morton
@ 2021-11-05 20:44 ` Andrew Morton
  2021-11-05 20:44 ` [patch 185/262] memory-hotplug.rst: document the "auto-movable" online policy Andrew Morton
                   ` (77 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:44 UTC (permalink / raw)
  To: akpm, corbet, david, linux-mm, mhocko, mm-commits, osalvador,
	rppt, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: memory-hotplug.rst: fix wrong /sys/module/memory_hotplug/parameters/ path

We accidentially added a superfluous "s".

Link: https://lkml.kernel.org/r/20210930144117.23641-3-david@redhat.com
Fixes: ac3332c44767 ("memory-hotplug.rst: complete admin-guide overhaul")
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/memory-hotplug.rst |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/Documentation/admin-guide/mm/memory-hotplug.rst~memory-hotplugrst-fix-wrong-sys-module-memory_hotplug-parameters-path
+++ a/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -410,7 +410,7 @@ them with ``memory_hotplug.`` such as::
 
 and they can be observed (and some even modified at runtime) via::
 
-	/sys/modules/memory_hotplug/parameters/
+	/sys/module/memory_hotplug/parameters/
 
 The following module parameters are currently defined:
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 185/262] memory-hotplug.rst: document the "auto-movable" online policy
  2021-11-05 20:34 incoming Andrew Morton
                   ` (183 preceding siblings ...)
  2021-11-05 20:44 ` [patch 184/262] memory-hotplug.rst: fix wrong /sys/module/memory_hotplug/parameters/ path Andrew Morton
@ 2021-11-05 20:44 ` Andrew Morton
  2021-11-05 20:44 ` [patch 186/262] mm/memory_hotplug: remove CONFIG_X86_64_ACPI_NUMA dependency from CONFIG_MEMORY_HOTPLUG Andrew Morton
                   ` (76 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:44 UTC (permalink / raw)
  To: akpm, corbet, david, linux-mm, mhocko, mm-commits, osalvador,
	rppt, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: memory-hotplug.rst: document the "auto-movable" online policy

In commit e83a437faa62 ("mm/memory_hotplug: introduce "auto-movable"
online policy") we introduced a new memory online policy to automatically
select a zone for memory blocks to be onlined.  We added a way to set the
active online policy and tunables for the auto-movable online policy.  In
follow-up commits we tweaked the "auto-movable" policy to also consider
memory device details when selecting zones for memory blocks to be
onlined.

Let's document the new toggles and how the two online policies we have
work.

[david@redhat.com: updates]
  Link: https://lkml.kernel.org/r/20211011082058.6076-4-david@redhat.com
Link: https://lkml.kernel.org/r/20210930144117.23641-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/memory-hotplug.rst |  141 ++++++++++++--
 1 file changed, 121 insertions(+), 20 deletions(-)

--- a/Documentation/admin-guide/mm/memory-hotplug.rst~memory-hotplugrst-document-the-auto-movable-online-policy
+++ a/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -165,9 +165,8 @@ Or alternatively::
 
 	% echo 1 > /sys/devices/system/memory/memoryXXX/online
 
-The kernel will select the target zone automatically, usually defaulting to
-``ZONE_NORMAL`` unless ``movable_node`` has been specified on the kernel
-command line or if the memory block would intersect the ZONE_MOVABLE already.
+The kernel will select the target zone automatically, depending on the
+configured ``online_policy``.
 
 One can explicitly request to associate an offline memory block with
 ZONE_MOVABLE by::
@@ -198,6 +197,9 @@ Auto-onlining can be enabled by writing
 
 	% echo online > /sys/devices/system/memory/auto_online_blocks
 
+Similarly to manual onlining, with ``online`` the kernel will select the
+target zone automatically, depending on the configured ``online_policy``.
+
 Modifying the auto-online behavior will only affect all subsequently added
 memory blocks only.
 
@@ -393,11 +395,16 @@ command line parameters are relevant:
 ======================== =======================================================
 ``memhp_default_state``	 configure auto-onlining by essentially setting
                          ``/sys/devices/system/memory/auto_online_blocks``.
-``movable_node``	 configure automatic zone selection in the kernel. When
-			 set, the kernel will default to ZONE_MOVABLE, unless
-			 other zones can be kept contiguous.
+``movable_node``	 configure automatic zone selection in the kernel when
+			 using the ``contig-zones`` online policy. When
+			 set, the kernel will default to ZONE_MOVABLE when
+			 onlining a memory block, unless other zones can be kept
+			 contiguous.
 ======================== =======================================================
 
+See Documentation/admin-guide/kernel-parameters.txt for a more generic
+description of these command line parameters.
+
 Module Parameters
 ------------------
 
@@ -414,20 +421,114 @@ and they can be observed (and some even
 
 The following module parameters are currently defined:
 
-======================== =======================================================
-``memmap_on_memory``	 read-write: Allocate memory for the memmap from the
-			 added memory block itself. Even if enabled, actual
-			 support depends on various other system properties and
-			 should only be regarded as a hint whether the behavior
-			 would be desired.
-
-			 While allocating the memmap from the memory block
-			 itself makes memory hotplug less likely to fail and
-			 keeps the memmap on the same NUMA node in any case, it
-			 can fragment physical memory in a way that huge pages
-			 in bigger granularity cannot be formed on hotplugged
-			 memory.
-======================== =======================================================
+================================ ===============================================
+``memmap_on_memory``		 read-write: Allocate memory for the memmap from
+				 the added memory block itself. Even if enabled,
+				 actual support depends on various other system
+				 properties and should only be regarded as a
+				 hint whether the behavior would be desired.
+
+				 While allocating the memmap from the memory
+				 block itself makes memory hotplug less likely
+				 to fail and keeps the memmap on the same NUMA
+				 node in any case, it can fragment physical
+				 memory in a way that huge pages in bigger
+				 granularity cannot be formed on hotplugged
+				 memory.
+``online_policy``		 read-write: Set the basic policy used for
+				 automatic zone selection when onlining memory
+				 blocks without specifying a target zone.
+				 ``contig-zones`` has been the kernel default
+				 before this parameter was added. After an
+				 online policy was configured and memory was
+				 online, the policy should not be changed
+				 anymore.
+
+				 When set to ``contig-zones``, the kernel will
+				 try keeping zones contiguous. If a memory block
+				 intersects multiple zones or no zone, the
+				 behavior depends on the ``movable_node`` kernel
+				 command line parameter: default to ZONE_MOVABLE
+				 if set, default to the applicable kernel zone
+				 (usually ZONE_NORMAL) if not set.
+
+				 When set to ``auto-movable``, the kernel will
+				 try onlining memory blocks to ZONE_MOVABLE if
+				 possible according to the configuration and
+				 memory device details. With this policy, one
+				 can avoid zone imbalances when eventually
+				 hotplugging a lot of memory later and still
+				 wanting to be able to hotunplug as much as
+				 possible reliably, very desirable in
+				 virtualized environments. This policy ignores
+				 the ``movable_node`` kernel command line
+				 parameter and isn't really applicable in
+				 environments that require it (e.g., bare metal
+				 with hotunpluggable nodes) where hotplugged
+				 memory might be exposed via the
+				 firmware-provided memory map early during boot
+				 to the system instead of getting detected,
+				 added and onlined  later during boot (such as
+				 done by virtio-mem or by some hypervisors
+				 implementing emulated DIMMs). As one example, a
+				 hotplugged DIMM will be onlined either
+				 completely to ZONE_MOVABLE or completely to
+				 ZONE_NORMAL, not a mixture.
+				 As another example, as many memory blocks
+				 belonging to a virtio-mem device will be
+				 onlined to ZONE_MOVABLE as possible,
+				 special-casing units of memory blocks that can
+				 only get hotunplugged together. *This policy
+				 does not protect from setups that are
+				 problematic with ZONE_MOVABLE and does not
+				 change the zone of memory blocks dynamically
+				 after they were onlined.*
+``auto_movable_ratio``		 read-write: Set the maximum MOVABLE:KERNEL
+				 memory ratio in % for the ``auto-movable``
+				 online policy. Whether the ratio applies only
+				 for the system across all NUMA nodes or also
+				 per NUMA nodes depends on the
+				 ``auto_movable_numa_aware`` configuration.
+
+				 All accounting is based on present memory pages
+				 in the zones combined with accounting per
+				 memory device. Memory dedicated to the CMA
+				 allocator is accounted as MOVABLE, although
+				 residing on one of the kernel zones. The
+				 possible ratio depends on the actual workload.
+				 The kernel default is "301" %, for example,
+				 allowing for hotplugging 24 GiB to a 8 GiB VM
+				 and automatically onlining all hotplugged
+				 memory to ZONE_MOVABLE in many setups. The
+				 additional 1% deals with some pages being not
+				 present, for example, because of some firmware
+				 allocations.
+
+				 Note that ZONE_NORMAL memory provided by one
+				 memory device does not allow for more
+				 ZONE_MOVABLE memory for a different memory
+				 device. As one example, onlining memory of a
+				 hotplugged DIMM to ZONE_NORMAL will not allow
+				 for another hotplugged DIMM to get onlined to
+				 ZONE_MOVABLE automatically. In contrast, memory
+				 hotplugged by a virtio-mem device that got
+				 onlined to ZONE_NORMAL will allow for more
+				 ZONE_MOVABLE memory within *the same*
+				 virtio-mem device.
+``auto_movable_numa_aware``	 read-write: Configure whether the
+				 ``auto_movable_ratio`` in the ``auto-movable``
+				 online policy also applies per NUMA
+				 node in addition to the whole system across all
+				 NUMA nodes. The kernel default is "Y".
+
+				 Disabling NUMA awareness can be helpful when
+				 dealing with NUMA nodes that should be
+				 completely hotunpluggable, onlining the memory
+				 completely to ZONE_MOVABLE automatically if
+				 possible.
+
+				 Parameter availability depends on CONFIG_NUMA.
+================================ ===============================================
 
 ZONE_MOVABLE
 ============
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 186/262] mm/memory_hotplug: remove CONFIG_X86_64_ACPI_NUMA dependency from CONFIG_MEMORY_HOTPLUG
  2021-11-05 20:34 incoming Andrew Morton
                   ` (184 preceding siblings ...)
  2021-11-05 20:44 ` [patch 185/262] memory-hotplug.rst: document the "auto-movable" online policy Andrew Morton
@ 2021-11-05 20:44 ` Andrew Morton
  2021-11-05 20:44 ` [patch 187/262] mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE Andrew Morton
                   ` (75 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:44 UTC (permalink / raw)
  To: akpm, alexs, benh, bp, corbet, dave.hansen, david, gregkh, hpa,
	jasowang, linux-mm, luto, mhocko, mingo, mm-commits, mpe, mst,
	osalvador, paulus, peterz, rafael, rppt, shuah, tglx, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: remove CONFIG_X86_64_ACPI_NUMA dependency from CONFIG_MEMORY_HOTPLUG

Patch series "mm/memory_hotplug: Kconfig and 32 bit cleanups".

Some cleanups around CONFIG_MEMORY_HOTPLUG, including removing 32 bit
leftovers of memory hotplug support.


This patch (of 6):

SPARSEMEM is the only possible memory model for x86-64, FLATMEM is not
possible:
	config ARCH_FLATMEM_ENABLE
		def_bool y
		depends on X86_32 && !NUMA

And X86_64_ACPI_NUMA (obviously) only supports x86-64:
	config X86_64_ACPI_NUMA
		def_bool y
		depends on X86_64 && NUMA && ACPI && PCI

Let's just remove the CONFIG_X86_64_ACPI_NUMA dependency, as it does no
longer make sense.

Link: https://lkml.kernel.org/r/20210929143600.49379-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alex Shi <alexs@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/Kconfig~mm-memory_hotplug-remove-config_x86_64_acpi_numa-dependency-from-config_memory_hotplug
+++ a/mm/Kconfig
@@ -123,7 +123,7 @@ config ARCH_ENABLE_MEMORY_HOTPLUG
 config MEMORY_HOTPLUG
 	bool "Allow for memory hot-add"
 	select MEMORY_ISOLATION
-	depends on SPARSEMEM || X86_64_ACPI_NUMA
+	depends on SPARSEMEM
 	depends on ARCH_ENABLE_MEMORY_HOTPLUG
 	depends on 64BIT || BROKEN
 	select NUMA_KEEP_MEMINFO if NUMA
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 187/262] mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE
  2021-11-05 20:34 incoming Andrew Morton
                   ` (185 preceding siblings ...)
  2021-11-05 20:44 ` [patch 186/262] mm/memory_hotplug: remove CONFIG_X86_64_ACPI_NUMA dependency from CONFIG_MEMORY_HOTPLUG Andrew Morton
@ 2021-11-05 20:44 ` Andrew Morton
  2021-11-05 20:44 ` [patch 188/262] mm/memory_hotplug: restrict CONFIG_MEMORY_HOTPLUG to 64 bit Andrew Morton
                   ` (74 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:44 UTC (permalink / raw)
  To: akpm, alexs, benh, bp, corbet, dave.hansen, david, gregkh, hpa,
	jasowang, linux-mm, luto, mhocko, mingo, mm-commits, mpe, mst,
	osalvador, paulus, peterz, rafael, rppt, skhan, tglx, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE

CONFIG_MEMORY_HOTPLUG depends on CONFIG_SPARSEMEM, so there is no need for
CONFIG_MEMORY_HOTPLUG_SPARSE anymore; adjust all instances to use
CONFIG_MEMORY_HOTPLUG and remove CONFIG_MEMORY_HOTPLUG_SPARSE.

Link: https://lkml.kernel.org/r/20210929143600.49379-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Shuah Khan <skhan@linuxfoundation.org>	[kselftest]
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Alex Shi <alexs@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/powerpc/include/asm/machdep.h            |    2 -
 arch/powerpc/kernel/setup_64.c                |    2 -
 arch/powerpc/platforms/powernv/setup.c        |    4 +-
 arch/powerpc/platforms/pseries/setup.c        |    2 -
 drivers/base/Makefile                         |    2 -
 drivers/base/node.c                           |    9 ++----
 drivers/virtio/Kconfig                        |    2 -
 include/linux/memory.h                        |   24 ++++++----------
 include/linux/node.h                          |    4 +-
 lib/Kconfig.debug                             |    2 -
 mm/Kconfig                                    |    4 --
 mm/memory_hotplug.c                           |    2 -
 tools/testing/selftests/memory-hotplug/config |    1 
 13 files changed, 24 insertions(+), 36 deletions(-)

--- a/arch/powerpc/include/asm/machdep.h~mm-memory_hotplug-remove-config_memory_hotplug_sparse
+++ a/arch/powerpc/include/asm/machdep.h
@@ -32,7 +32,7 @@ struct machdep_calls {
 	void		(*iommu_save)(void);
 	void		(*iommu_restore)(void);
 #endif
-#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
+#ifdef CONFIG_MEMORY_HOTPLUG
 	unsigned long	(*memory_block_size)(void);
 #endif
 #endif /* CONFIG_PPC64 */
--- a/arch/powerpc/kernel/setup_64.c~mm-memory_hotplug-remove-config_memory_hotplug_sparse
+++ a/arch/powerpc/kernel/setup_64.c
@@ -912,7 +912,7 @@ void __init setup_per_cpu_areas(void)
 }
 #endif
 
-#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
+#ifdef CONFIG_MEMORY_HOTPLUG
 unsigned long memory_block_size_bytes(void)
 {
 	if (ppc_md.memory_block_size)
--- a/arch/powerpc/platforms/powernv/setup.c~mm-memory_hotplug-remove-config_memory_hotplug_sparse
+++ a/arch/powerpc/platforms/powernv/setup.c
@@ -440,7 +440,7 @@ static void pnv_kexec_cpu_down(int crash
 }
 #endif /* CONFIG_KEXEC_CORE */
 
-#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
+#ifdef CONFIG_MEMORY_HOTPLUG
 static unsigned long pnv_memory_block_size(void)
 {
 	/*
@@ -553,7 +553,7 @@ define_machine(powernv) {
 #ifdef CONFIG_KEXEC_CORE
 	.kexec_cpu_down		= pnv_kexec_cpu_down,
 #endif
-#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
+#ifdef CONFIG_MEMORY_HOTPLUG
 	.memory_block_size	= pnv_memory_block_size,
 #endif
 };
--- a/arch/powerpc/platforms/pseries/setup.c~mm-memory_hotplug-remove-config_memory_hotplug_sparse
+++ a/arch/powerpc/platforms/pseries/setup.c
@@ -1089,7 +1089,7 @@ define_machine(pseries) {
 	.machine_kexec          = pSeries_machine_kexec,
 	.kexec_cpu_down         = pseries_kexec_cpu_down,
 #endif
-#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
+#ifdef CONFIG_MEMORY_HOTPLUG
 	.memory_block_size	= pseries_memory_block_size,
 #endif
 };
--- a/drivers/base/Makefile~mm-memory_hotplug-remove-config_memory_hotplug_sparse
+++ a/drivers/base/Makefile
@@ -13,7 +13,7 @@ obj-y			+= power/
 obj-$(CONFIG_ISA_BUS_API)	+= isa.o
 obj-y				+= firmware_loader/
 obj-$(CONFIG_NUMA)	+= node.o
-obj-$(CONFIG_MEMORY_HOTPLUG_SPARSE) += memory.o
+obj-$(CONFIG_MEMORY_HOTPLUG) += memory.o
 ifeq ($(CONFIG_SYSFS),y)
 obj-$(CONFIG_MODULES)	+= module.o
 endif
--- a/drivers/base/node.c~mm-memory_hotplug-remove-config_memory_hotplug_sparse
+++ a/drivers/base/node.c
@@ -629,7 +629,7 @@ static void node_device_release(struct d
 {
 	struct node *node = to_node(dev);
 
-#if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
+#if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_HUGETLBFS)
 	/*
 	 * We schedule the work only when a memory section is
 	 * onlined/offlined on this node. When we come here,
@@ -782,7 +782,7 @@ int unregister_cpu_under_node(unsigned i
 	return 0;
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
+#ifdef CONFIG_MEMORY_HOTPLUG
 static int __ref get_nid_for_pfn(unsigned long pfn)
 {
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
@@ -958,10 +958,9 @@ static int node_memory_callback(struct n
 	return NOTIFY_OK;
 }
 #endif	/* CONFIG_HUGETLBFS */
-#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
+#endif /* CONFIG_MEMORY_HOTPLUG */
 
-#if !defined(CONFIG_MEMORY_HOTPLUG_SPARSE) || \
-    !defined(CONFIG_HUGETLBFS)
+#if !defined(CONFIG_MEMORY_HOTPLUG) || !defined(CONFIG_HUGETLBFS)
 static inline int node_memory_callback(struct notifier_block *self,
 				unsigned long action, void *arg)
 {
--- a/drivers/virtio/Kconfig~mm-memory_hotplug-remove-config_memory_hotplug_sparse
+++ a/drivers/virtio/Kconfig
@@ -98,7 +98,7 @@ config VIRTIO_MEM
 	default m
 	depends on X86_64
 	depends on VIRTIO
-	depends on MEMORY_HOTPLUG_SPARSE
+	depends on MEMORY_HOTPLUG
 	depends on MEMORY_HOTREMOVE
 	depends on CONTIG_ALLOC
 	help
--- a/include/linux/memory.h~mm-memory_hotplug-remove-config_memory_hotplug_sparse
+++ a/include/linux/memory.h
@@ -110,7 +110,7 @@ struct mem_section;
 #define SLAB_CALLBACK_PRI       1
 #define IPC_CALLBACK_PRI        10
 
-#ifndef CONFIG_MEMORY_HOTPLUG_SPARSE
+#ifndef CONFIG_MEMORY_HOTPLUG
 static inline void memory_dev_init(void)
 {
 	return;
@@ -126,7 +126,14 @@ static inline int memory_notify(unsigned
 {
 	return 0;
 }
-#else
+static inline int hotplug_memory_notifier(notifier_fn_t fn, int pri)
+{
+	return 0;
+}
+/* These aren't inline functions due to a GCC bug. */
+#define register_hotmemory_notifier(nb)    ({ (void)(nb); 0; })
+#define unregister_hotmemory_notifier(nb)  ({ (void)(nb); })
+#else /* CONFIG_MEMORY_HOTPLUG */
 extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
 int create_memory_block_devices(unsigned long start, unsigned long size,
@@ -148,9 +155,6 @@ struct memory_group *memory_group_find_b
 typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *);
 int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
 			       struct memory_group *excluded, void *arg);
-#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
-
-#ifdef CONFIG_MEMORY_HOTPLUG
 #define hotplug_memory_notifier(fn, pri) ({		\
 	static __meminitdata struct notifier_block fn##_mem_nb =\
 		{ .notifier_call = fn, .priority = pri };\
@@ -158,15 +162,7 @@ int walk_dynamic_memory_groups(int nid,
 })
 #define register_hotmemory_notifier(nb)		register_memory_notifier(nb)
 #define unregister_hotmemory_notifier(nb) 	unregister_memory_notifier(nb)
-#else
-static inline int hotplug_memory_notifier(notifier_fn_t fn, int pri)
-{
-	return 0;
-}
-/* These aren't inline functions due to a GCC bug. */
-#define register_hotmemory_notifier(nb)    ({ (void)(nb); 0; })
-#define unregister_hotmemory_notifier(nb)  ({ (void)(nb); })
-#endif
+#endif	/* CONFIG_MEMORY_HOTPLUG */
 
 /*
  * Kernel text modification mutex, used for code patching. Users of this lock
--- a/include/linux/node.h~mm-memory_hotplug-remove-config_memory_hotplug_sparse
+++ a/include/linux/node.h
@@ -85,7 +85,7 @@ struct node {
 	struct device	dev;
 	struct list_head access_list;
 
-#if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS)
+#if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_HUGETLBFS)
 	struct work_struct	node_work;
 #endif
 #ifdef CONFIG_HMEM_REPORTING
@@ -98,7 +98,7 @@ struct memory_block;
 extern struct node *node_devices[];
 typedef  void (*node_registration_func_t)(struct node *);
 
-#if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_NUMA)
+#if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_NUMA)
 void link_mem_sections(int nid, unsigned long start_pfn,
 		       unsigned long end_pfn,
 		       enum meminit_context context);
--- a/lib/Kconfig.debug~mm-memory_hotplug-remove-config_memory_hotplug_sparse
+++ a/lib/Kconfig.debug
@@ -877,7 +877,7 @@ config DEBUG_MEMORY_INIT
 
 config MEMORY_NOTIFIER_ERROR_INJECT
 	tristate "Memory hotplug notifier error injection module"
-	depends on MEMORY_HOTPLUG_SPARSE && NOTIFIER_ERROR_INJECTION
+	depends on MEMORY_HOTPLUG && NOTIFIER_ERROR_INJECTION
 	help
 	  This option provides the ability to inject artificial errors to
 	  memory hotplug notifier chain callbacks.  It is controlled through
--- a/mm/Kconfig~mm-memory_hotplug-remove-config_memory_hotplug_sparse
+++ a/mm/Kconfig
@@ -128,10 +128,6 @@ config MEMORY_HOTPLUG
 	depends on 64BIT || BROKEN
 	select NUMA_KEEP_MEMINFO if NUMA
 
-config MEMORY_HOTPLUG_SPARSE
-	def_bool y
-	depends on SPARSEMEM && MEMORY_HOTPLUG
-
 config MEMORY_HOTPLUG_DEFAULT_ONLINE
 	bool "Online the newly added memory blocks by default"
 	depends on MEMORY_HOTPLUG
--- a/mm/memory_hotplug.c~mm-memory_hotplug-remove-config_memory_hotplug_sparse
+++ a/mm/memory_hotplug.c
@@ -220,7 +220,6 @@ static void release_memory_resource(stru
 	kfree(res);
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
 static int check_pfn_span(unsigned long pfn, unsigned long nr_pages,
 		const char *reason)
 {
@@ -1163,7 +1162,6 @@ failed_addition:
 	mem_hotplug_done();
 	return ret;
 }
-#endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
 
 static void reset_node_present_pages(pg_data_t *pgdat)
 {
--- a/tools/testing/selftests/memory-hotplug/config~mm-memory_hotplug-remove-config_memory_hotplug_sparse
+++ a/tools/testing/selftests/memory-hotplug/config
@@ -1,5 +1,4 @@
 CONFIG_MEMORY_HOTPLUG=y
-CONFIG_MEMORY_HOTPLUG_SPARSE=y
 CONFIG_NOTIFIER_ERROR_INJECTION=y
 CONFIG_MEMORY_NOTIFIER_ERROR_INJECT=m
 CONFIG_MEMORY_HOTREMOVE=y
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 188/262] mm/memory_hotplug: restrict CONFIG_MEMORY_HOTPLUG to 64 bit
  2021-11-05 20:34 incoming Andrew Morton
                   ` (186 preceding siblings ...)
  2021-11-05 20:44 ` [patch 187/262] mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE Andrew Morton
@ 2021-11-05 20:44 ` Andrew Morton
  2021-11-05 20:44 ` [patch 189/262] mm/memory_hotplug: remove HIGHMEM leftovers Andrew Morton
                   ` (73 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:44 UTC (permalink / raw)
  To: akpm, alexs, benh, bp, corbet, dave.hansen, david, gregkh, hpa,
	jasowang, linux-mm, luto, mhocko, mingo, mm-commits, mpe, mst,
	osalvador, paulus, peterz, rafael, rppt, shuah, tglx, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: restrict CONFIG_MEMORY_HOTPLUG to 64 bit

32 bit support is broken in various ways: for example, we can online
memory that should actually go to ZONE_HIGHMEM to ZONE_MOVABLE or in some
cases even to one of the other kernel zones.

We marked it BROKEN in commit b59d02ed0869 ("mm/memory_hotplug: disable
the functionality for 32b") almost one year ago.  According to that commit
it might be broken at least since 2017.  Further, there is hardly a sane
use case nowadays.

Let's just depend completely on 64bit, dropping the "BROKEN" dependency to
make clear that we are not going to support it again.  Next, we'll remove
some HIGHMEM leftovers from memory hotplug code to clean up.

Link: https://lkml.kernel.org/r/20210929143600.49379-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alex Shi <alexs@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/Kconfig~mm-memory_hotplug-restrict-config_memory_hotplug-to-64-bit
+++ a/mm/Kconfig
@@ -125,7 +125,7 @@ config MEMORY_HOTPLUG
 	select MEMORY_ISOLATION
 	depends on SPARSEMEM
 	depends on ARCH_ENABLE_MEMORY_HOTPLUG
-	depends on 64BIT || BROKEN
+	depends on 64BIT
 	select NUMA_KEEP_MEMINFO if NUMA
 
 config MEMORY_HOTPLUG_DEFAULT_ONLINE
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 189/262] mm/memory_hotplug: remove HIGHMEM leftovers
  2021-11-05 20:34 incoming Andrew Morton
                   ` (187 preceding siblings ...)
  2021-11-05 20:44 ` [patch 188/262] mm/memory_hotplug: restrict CONFIG_MEMORY_HOTPLUG to 64 bit Andrew Morton
@ 2021-11-05 20:44 ` Andrew Morton
  2021-11-05 20:44 ` [patch 190/262] mm/memory_hotplug: remove stale function declarations Andrew Morton
                   ` (72 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:44 UTC (permalink / raw)
  To: akpm, alexs, benh, bp, corbet, dave.hansen, david, gregkh, hpa,
	jasowang, linux-mm, luto, mhocko, mingo, mm-commits, mpe, mst,
	osalvador, paulus, peterz, rafael, rppt, shuah, tglx, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: remove HIGHMEM leftovers

We don't support CONFIG_MEMORY_HOTPLUG on 32 bit and consequently not
HIGHMEM.  Let's remove any leftover code -- including the unused
"status_change_nid_high" field part of the memory notifier.

Link: https://lkml.kernel.org/r/20210929143600.49379-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alex Shi <alexs@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/core-api/memory-hotplug.rst                    |    3 
 Documentation/translations/zh_CN/core-api/memory-hotplug.rst |    4 -
 include/linux/memory.h                                       |    1 
 mm/memory_hotplug.c                                          |   36 ----------
 4 files changed, 2 insertions(+), 42 deletions(-)

--- a/Documentation/core-api/memory-hotplug.rst~mm-memory_hotplug-remove-highmem-leftovers
+++ a/Documentation/core-api/memory-hotplug.rst
@@ -57,7 +57,6 @@ The third argument (arg) passes a pointe
 		unsigned long start_pfn;
 		unsigned long nr_pages;
 		int status_change_nid_normal;
-		int status_change_nid_high;
 		int status_change_nid;
 	}
 
@@ -65,8 +64,6 @@ The third argument (arg) passes a pointe
 - nr_pages is # of pages of online/offline memory.
 - status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask
   is (will be) set/clear, if this is -1, then nodemask status is not changed.
-- status_change_nid_high is set node id when N_HIGH_MEMORY of nodemask
-  is (will be) set/clear, if this is -1, then nodemask status is not changed.
 - status_change_nid is set node id when N_MEMORY of nodemask is (will be)
   set/clear. It means a new(memoryless) node gets new memory by online and a
   node loses all memory. If this is -1, then nodemask status is not changed.
--- a/Documentation/translations/zh_CN/core-api/memory-hotplug.rst~mm-memory_hotplug-remove-highmem-leftovers
+++ a/Documentation/translations/zh_CN/core-api/memory-hotplug.rst
@@ -63,7 +63,6 @@ memory_notify结构体的指针::
 		unsigned long start_pfn;
 		unsigned long nr_pages;
 		int status_change_nid_normal;
-		int status_change_nid_high;
 		int status_change_nid;
 	}
 
@@ -74,9 +73,6 @@ memory_notify结构体的指针::
 - status_change_nid_normal是当nodemask的N_NORMAL_MEMORY被设置/清除时设置节
   点id,如果是-1,则nodemask状态不改变。
 
-- status_change_nid_high是当nodemask的N_HIGH_MEMORY被设置/清除时设置的节点
-  id,如果这个值为-1,那么nodemask状态不会改变。
-
 - status_change_nid是当nodemask的N_MEMORY被(将)设置/清除时设置的节点id。这
   意味着一个新的(没上线的)节点通过联机获得新的内存,而一个节点失去了所有的内
   存。如果这个值为-1,那么nodemask的状态就不会改变。
--- a/include/linux/memory.h~mm-memory_hotplug-remove-highmem-leftovers
+++ a/include/linux/memory.h
@@ -96,7 +96,6 @@ struct memory_notify {
 	unsigned long start_pfn;
 	unsigned long nr_pages;
 	int status_change_nid_normal;
-	int status_change_nid_high;
 	int status_change_nid;
 };
 
--- a/mm/memory_hotplug.c~mm-memory_hotplug-remove-highmem-leftovers
+++ a/mm/memory_hotplug.c
@@ -21,7 +21,6 @@
 #include <linux/memory.h>
 #include <linux/memremap.h>
 #include <linux/memory_hotplug.h>
-#include <linux/highmem.h>
 #include <linux/vmalloc.h>
 #include <linux/ioport.h>
 #include <linux/delay.h>
@@ -585,10 +584,6 @@ void generic_online_page(struct page *pa
 	debug_pagealloc_map_pages(page, 1 << order);
 	__free_pages_core(page, order);
 	totalram_pages_add(1UL << order);
-#ifdef CONFIG_HIGHMEM
-	if (PageHighMem(page))
-		totalhigh_pages_add(1UL << order);
-#endif
 }
 EXPORT_SYMBOL_GPL(generic_online_page);
 
@@ -625,16 +620,11 @@ static void node_states_check_changes_on
 
 	arg->status_change_nid = NUMA_NO_NODE;
 	arg->status_change_nid_normal = NUMA_NO_NODE;
-	arg->status_change_nid_high = NUMA_NO_NODE;
 
 	if (!node_state(nid, N_MEMORY))
 		arg->status_change_nid = nid;
 	if (zone_idx(zone) <= ZONE_NORMAL && !node_state(nid, N_NORMAL_MEMORY))
 		arg->status_change_nid_normal = nid;
-#ifdef CONFIG_HIGHMEM
-	if (zone_idx(zone) <= ZONE_HIGHMEM && !node_state(nid, N_HIGH_MEMORY))
-		arg->status_change_nid_high = nid;
-#endif
 }
 
 static void node_states_set_node(int node, struct memory_notify *arg)
@@ -642,9 +632,6 @@ static void node_states_set_node(int nod
 	if (arg->status_change_nid_normal >= 0)
 		node_set_state(node, N_NORMAL_MEMORY);
 
-	if (arg->status_change_nid_high >= 0)
-		node_set_state(node, N_HIGH_MEMORY);
-
 	if (arg->status_change_nid >= 0)
 		node_set_state(node, N_MEMORY);
 }
@@ -1801,7 +1788,6 @@ static void node_states_check_changes_of
 
 	arg->status_change_nid = NUMA_NO_NODE;
 	arg->status_change_nid_normal = NUMA_NO_NODE;
-	arg->status_change_nid_high = NUMA_NO_NODE;
 
 	/*
 	 * Check whether node_states[N_NORMAL_MEMORY] will be changed.
@@ -1816,24 +1802,9 @@ static void node_states_check_changes_of
 	if (zone_idx(zone) <= ZONE_NORMAL && nr_pages >= present_pages)
 		arg->status_change_nid_normal = zone_to_nid(zone);
 
-#ifdef CONFIG_HIGHMEM
 	/*
-	 * node_states[N_HIGH_MEMORY] contains nodes which
-	 * have normal memory or high memory.
-	 * Here we add the present_pages belonging to ZONE_HIGHMEM.
-	 * If the zone is within the range of [0..ZONE_HIGHMEM), and
-	 * we determine that the zones in that range become empty,
-	 * we need to clear the node for N_HIGH_MEMORY.
-	 */
-	present_pages += pgdat->node_zones[ZONE_HIGHMEM].present_pages;
-	if (zone_idx(zone) <= ZONE_HIGHMEM && nr_pages >= present_pages)
-		arg->status_change_nid_high = zone_to_nid(zone);
-#endif
-
-	/*
-	 * We have accounted the pages from [0..ZONE_NORMAL), and
-	 * in case of CONFIG_HIGHMEM the pages from ZONE_HIGHMEM
-	 * as well.
+	 * We have accounted the pages from [0..ZONE_NORMAL); ZONE_HIGHMEM
+	 * does not apply as we don't support 32bit.
 	 * Here we count the possible pages from ZONE_MOVABLE.
 	 * If after having accounted all the pages, we see that the nr_pages
 	 * to be offlined is over or equal to the accounted pages,
@@ -1851,9 +1822,6 @@ static void node_states_clear_node(int n
 	if (arg->status_change_nid_normal >= 0)
 		node_clear_state(node, N_NORMAL_MEMORY);
 
-	if (arg->status_change_nid_high >= 0)
-		node_clear_state(node, N_HIGH_MEMORY);
-
 	if (arg->status_change_nid >= 0)
 		node_clear_state(node, N_MEMORY);
 }
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 190/262] mm/memory_hotplug: remove stale function declarations
  2021-11-05 20:34 incoming Andrew Morton
                   ` (188 preceding siblings ...)
  2021-11-05 20:44 ` [patch 189/262] mm/memory_hotplug: remove HIGHMEM leftovers Andrew Morton
@ 2021-11-05 20:44 ` Andrew Morton
  2021-11-05 20:44 ` [patch 191/262] x86: remove memory hotplug support on X86_32 Andrew Morton
                   ` (71 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:44 UTC (permalink / raw)
  To: akpm, alexs, benh, bp, corbet, dave.hansen, david, gregkh, hpa,
	jasowang, linux-mm, luto, mhocko, mingo, mm-commits, mpe, mst,
	osalvador, paulus, peterz, rafael, rppt, shuah, tglx, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: remove stale function declarations

These functions no longer exist.

Link: https://lkml.kernel.org/r/20210929143600.49379-6-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alex Shi <alexs@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memory_hotplug.h |    3 ---
 1 file changed, 3 deletions(-)

--- a/include/linux/memory_hotplug.h~mm-memory_hotplug-remove-stale-function-declarations
+++ a/include/linux/memory_hotplug.h
@@ -98,9 +98,6 @@ static inline void zone_seqlock_init(str
 {
 	seqlock_init(&zone->span_seqlock);
 }
-extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
-extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
-extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
 extern void adjust_present_page_count(struct page *page,
 				      struct memory_group *group,
 				      long nr_pages);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 191/262] x86: remove memory hotplug support on X86_32
  2021-11-05 20:34 incoming Andrew Morton
                   ` (189 preceding siblings ...)
  2021-11-05 20:44 ` [patch 190/262] mm/memory_hotplug: remove stale function declarations Andrew Morton
@ 2021-11-05 20:44 ` Andrew Morton
  2021-11-05 20:44 ` [patch 192/262] mm/memory_hotplug: handle memblock_add_node() failures in add_memory_resource() Andrew Morton
                   ` (70 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:44 UTC (permalink / raw)
  To: akpm, alexs, benh, bp, corbet, dave.hansen, david, gregkh, hpa,
	jasowang, linux-mm, luto, mhocko, mingo, mm-commits, mpe, mst,
	osalvador, paulus, peterz, rafael, rppt, shuah, tglx, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: x86: remove memory hotplug support on X86_32

CONFIG_MEMORY_HOTPLUG was marked BROKEN over one year and we just
restricted it to 64 bit.  Let's remove the unused x86 32bit implementation
and simplify the Kconfig.

Link: https://lkml.kernel.org/r/20210929143600.49379-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alex Shi <alexs@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/Kconfig      |    6 +++---
 arch/x86/mm/init_32.c |   31 -------------------------------
 2 files changed, 3 insertions(+), 34 deletions(-)

--- a/arch/x86/Kconfig~x86-remove-memory-hotplug-support-on-x86_32
+++ a/arch/x86/Kconfig
@@ -62,7 +62,7 @@ config X86
 	select ARCH_32BIT_OFF_T			if X86_32
 	select ARCH_CLOCKSOURCE_INIT
 	select ARCH_ENABLE_HUGEPAGE_MIGRATION if X86_64 && HUGETLB_PAGE && MIGRATION
-	select ARCH_ENABLE_MEMORY_HOTPLUG if X86_64 || (X86_32 && HIGHMEM)
+	select ARCH_ENABLE_MEMORY_HOTPLUG if X86_64
 	select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG
 	select ARCH_ENABLE_SPLIT_PMD_PTLOCK if (PGTABLE_LEVELS > 2) && (X86_64 || X86_PAE)
 	select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE
@@ -1614,7 +1614,7 @@ config ARCH_SELECT_MEMORY_MODEL
 
 config ARCH_MEMORY_PROBE
 	bool "Enable sysfs memory/probe interface"
-	depends on X86_64 && MEMORY_HOTPLUG
+	depends on MEMORY_HOTPLUG
 	help
 	  This option enables a sysfs memory/probe interface for testing.
 	  See Documentation/admin-guide/mm/memory-hotplug.rst for more information.
@@ -2394,7 +2394,7 @@ endmenu
 
 config ARCH_HAS_ADD_PAGES
 	def_bool y
-	depends on X86_64 && ARCH_ENABLE_MEMORY_HOTPLUG
+	depends on ARCH_ENABLE_MEMORY_HOTPLUG
 
 config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
 	def_bool y
--- a/arch/x86/mm/init_32.c~x86-remove-memory-hotplug-support-on-x86_32
+++ a/arch/x86/mm/init_32.c
@@ -779,37 +779,6 @@ void __init mem_init(void)
 	test_wp_bit();
 }
 
-#ifdef CONFIG_MEMORY_HOTPLUG
-int arch_add_memory(int nid, u64 start, u64 size,
-		    struct mhp_params *params)
-{
-	unsigned long start_pfn = start >> PAGE_SHIFT;
-	unsigned long nr_pages = size >> PAGE_SHIFT;
-	int ret;
-
-	/*
-	 * The page tables were already mapped at boot so if the caller
-	 * requests a different mapping type then we must change all the
-	 * pages with __set_memory_prot().
-	 */
-	if (params->pgprot.pgprot != PAGE_KERNEL.pgprot) {
-		ret = __set_memory_prot(start, nr_pages, params->pgprot);
-		if (ret)
-			return ret;
-	}
-
-	return __add_pages(nid, start_pfn, nr_pages, params);
-}
-
-void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
-{
-	unsigned long start_pfn = start >> PAGE_SHIFT;
-	unsigned long nr_pages = size >> PAGE_SHIFT;
-
-	__remove_pages(start_pfn, nr_pages, altmap);
-}
-#endif
-
 int kernel_set_to_readonly __read_mostly;
 
 static void mark_nxdata_nx(void)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 192/262] mm/memory_hotplug: handle memblock_add_node() failures in add_memory_resource()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (190 preceding siblings ...)
  2021-11-05 20:44 ` [patch 191/262] x86: remove memory hotplug support on X86_32 Andrew Morton
@ 2021-11-05 20:44 ` Andrew Morton
  2021-11-05 20:44 ` [patch 193/262] memblock: improve MEMBLOCK_HOTPLUG documentation Andrew Morton
                   ` (69 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:44 UTC (permalink / raw)
  To: akpm, aneesh.kumar, arnd, borntraeger, chenhuacai, david,
	ebiederm, geert, gor, hca, Jianyong.Wu, jiaxun.yang, linux-mm,
	mhocko, mm-commits, osalvador, rppt, shahab, torvalds, tsbogend,
	vgupta

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: handle memblock_add_node() failures in add_memory_resource()

Patch series "mm/memory_hotplug: full support for add_memory_driver_managed() with CONFIG_ARCH_KEEP_MEMBLOCK", v2.

Architectures that require CONFIG_ARCH_KEEP_MEMBLOCK=y, such as arm64,
don't cleanly support add_memory_driver_managed() yet.  Most prominently,
kexec_file can still end up placing kexec images on such driver-managed
memory, resulting in undesired behavior, for example, having kexec images
located on memory not part of the firmware-provided memory map.

Teaching kexec to not place images on driver-managed memory is especially
relevant for virtio-mem.  Details can be found in commit 7b7b27214bba
("mm/memory_hotplug: introduce add_memory_driver_managed()").

Extend memblock with a new flag and set it from memory hotplug code when
applicable.  This is required to fully support virtio-mem on arm64, making
also kexec_file behave like on x86-64.


This patch (of 2):

If memblock_add_node() fails, we're most probably running out of memory. 
While this is unlikely to happen, it can happen and having memory added
without a memblock can be problematic for architectures that use memblock
to detect valid memory.  Let's fail in a nice way instead of silently
ignoring the error.

Link: https://lkml.kernel.org/r/20211004093605.5830-1-david@redhat.com
Link: https://lkml.kernel.org/r/20211004093605.5830-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Jianyong Wu <Jianyong.Wu@arm.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Shahab Vahedi <shahab@synopsys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-handle-memblock_add_node-failures-in-add_memory_resource
+++ a/mm/memory_hotplug.c
@@ -1369,8 +1369,11 @@ int __ref add_memory_resource(int nid, s
 
 	mem_hotplug_begin();
 
-	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
-		memblock_add_node(start, size, nid);
+	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
+		ret = memblock_add_node(start, size, nid);
+		if (ret)
+			goto error_mem_hotplug_end;
+	}
 
 	ret = __try_online_node(nid, false);
 	if (ret < 0)
@@ -1443,6 +1446,7 @@ error:
 		rollback_node_hotadd(nid);
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK))
 		memblock_remove(start, size);
+error_mem_hotplug_end:
 	mem_hotplug_done();
 	return ret;
 }
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 193/262] memblock: improve MEMBLOCK_HOTPLUG documentation
  2021-11-05 20:34 incoming Andrew Morton
                   ` (191 preceding siblings ...)
  2021-11-05 20:44 ` [patch 192/262] mm/memory_hotplug: handle memblock_add_node() failures in add_memory_resource() Andrew Morton
@ 2021-11-05 20:44 ` Andrew Morton
  2021-11-05 20:44 ` [patch 194/262] memblock: allow to specify flags with memblock_add_node() Andrew Morton
                   ` (68 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:44 UTC (permalink / raw)
  To: akpm, aneesh.kumar, arnd, borntraeger, chenhuacai, david,
	ebiederm, geert, gor, hca, Jianyong.Wu, jiaxun.yang, linux-mm,
	mhocko, mm-commits, osalvador, rppt, shahab, torvalds, tsbogend,
	vgupta

From: David Hildenbrand <david@redhat.com>
Subject: memblock: improve MEMBLOCK_HOTPLUG documentation

The description of MEMBLOCK_HOTPLUG is currently short and consequently
misleading: we're actually dealing with a memory region that might get
hotunplugged later (i.e., the platform+firmware supports it), yet it is
indicated in the firmware-provided memory map as system ram that will just
get used by the system for any purpose when not taking special care.  The
firmware marked this memory region as a hot(un)plugged (e.g., hotplugged
before reboot), implying that it might get hotunplugged again later.

Whether we consider this information depends on the "movable_node" kernel
commandline parameter: only with "movable_node" set, we'll try keeping
this memory hotunpluggable, for example, by not serving early allocations
from this memory region and by letting the buddy manage it using the
ZONE_MOVABLE.

Let's make this clearer by extending the documentation.

Note: kexec *has to* indicate this memory to the second kernel.  With
"movable_node" set, we don't want to place kexec-images on this memory. 
Without "movable_node" set, we don't care and can place kexec-images on
this memory.  In both cases, after successful memory hotunplug, kexec has
to be re-armed to update the memory map for the second kernel and to place
the kexec-images somewhere else.

Link: https://lkml.kernel.org/r/20211004093605.5830-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Jianyong Wu <Jianyong.Wu@arm.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Shahab Vahedi <shahab@synopsys.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memblock.h |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

--- a/include/linux/memblock.h~memblock-improve-memblock_hotplug-documentation
+++ a/include/linux/memblock.h
@@ -28,7 +28,11 @@ extern unsigned long long max_possible_p
 /**
  * enum memblock_flags - definition of memory region attributes
  * @MEMBLOCK_NONE: no special request
- * @MEMBLOCK_HOTPLUG: hotpluggable region
+ * @MEMBLOCK_HOTPLUG: memory region indicated in the firmware-provided memory
+ * map during early boot as hot(un)pluggable system RAM (e.g., memory range
+ * that might get hotunplugged later). With "movable_node" set on the kernel
+ * commandline, try keeping this memory region hotunpluggable. Does not apply
+ * to memblocks added ("hotplugged") after early boot.
  * @MEMBLOCK_MIRROR: mirrored region
  * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as
  * reserved in the memory map; refer to memblock_mark_nomap() description
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 194/262] memblock: allow to specify flags with memblock_add_node()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (192 preceding siblings ...)
  2021-11-05 20:44 ` [patch 193/262] memblock: improve MEMBLOCK_HOTPLUG documentation Andrew Morton
@ 2021-11-05 20:44 ` Andrew Morton
  2021-11-05 20:44 ` [patch 195/262] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED Andrew Morton
                   ` (67 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:44 UTC (permalink / raw)
  To: akpm, aneesh.kumar, arnd, borntraeger, chenhuacai, david,
	ebiederm, geert, gor, hca, Jianyong.Wu, jiaxun.yang, linux-mm,
	mhocko, mm-commits, osalvador, rppt, shahab, torvalds, tsbogend,
	vgupta

From: David Hildenbrand <david@redhat.com>
Subject: memblock: allow to specify flags with memblock_add_node()

We want to specify flags when hotplugging memory.  Let's prepare to pass
flags to memblock_add_node() by adjusting all existing users.

Note that when hotplugging memory the system is already up and running and
we might have concurrent memblock users: for example, while we're
hotplugging memory, kexec_file code might search for suitable memory
regions to place kexec images.  It's important to add the memory directly
to memblock via a single call with the right flags, instead of adding the
memory first and apply flags later: otherwise, concurrent memblock users
might temporarily stumble over memblocks with wrong flags, which will be
important in a follow-up patch that introduces a new flag to properly
handle add_memory_driver_managed().

Link: https://lkml.kernel.org/r/20211004093605.5830-4-david@redhat.com
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Shahab Vahedi <shahab@synopsys.com>	[arch/arc]
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Jianyong Wu <Jianyong.Wu@arm.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arc/mm/init.c               |    4 ++--
 arch/ia64/mm/contig.c            |    2 +-
 arch/ia64/mm/init.c              |    2 +-
 arch/m68k/mm/mcfmmu.c            |    3 ++-
 arch/m68k/mm/motorola.c          |    6 ++++--
 arch/mips/loongson64/init.c      |    4 +++-
 arch/mips/sgi-ip27/ip27-memory.c |    3 ++-
 arch/s390/kernel/setup.c         |    3 ++-
 include/linux/memblock.h         |    3 ++-
 include/linux/mm.h               |    2 +-
 mm/memblock.c                    |    9 +++++----
 mm/memory_hotplug.c              |    2 +-
 12 files changed, 26 insertions(+), 17 deletions(-)

--- a/arch/arc/mm/init.c~memblock-allow-to-specify-flags-with-memblock_add_node
+++ a/arch/arc/mm/init.c
@@ -59,13 +59,13 @@ void __init early_init_dt_add_memory_arc
 
 		low_mem_sz = size;
 		in_use = 1;
-		memblock_add_node(base, size, 0);
+		memblock_add_node(base, size, 0, MEMBLOCK_NONE);
 	} else {
 #ifdef CONFIG_HIGHMEM
 		high_mem_start = base;
 		high_mem_sz = size;
 		in_use = 1;
-		memblock_add_node(base, size, 1);
+		memblock_add_node(base, size, 1, MEMBLOCK_NONE);
 		memblock_reserve(base, size);
 #endif
 	}
--- a/arch/ia64/mm/contig.c~memblock-allow-to-specify-flags-with-memblock_add_node
+++ a/arch/ia64/mm/contig.c
@@ -153,7 +153,7 @@ find_memory (void)
 	efi_memmap_walk(find_max_min_low_pfn, NULL);
 	max_pfn = max_low_pfn;
 
-	memblock_add_node(0, PFN_PHYS(max_low_pfn), 0);
+	memblock_add_node(0, PFN_PHYS(max_low_pfn), 0, MEMBLOCK_NONE);
 
 	find_initrd();
 
--- a/arch/ia64/mm/init.c~memblock-allow-to-specify-flags-with-memblock_add_node
+++ a/arch/ia64/mm/init.c
@@ -378,7 +378,7 @@ int __init register_active_ranges(u64 st
 #endif
 
 	if (start < end)
-		memblock_add_node(__pa(start), end - start, nid);
+		memblock_add_node(__pa(start), end - start, nid, MEMBLOCK_NONE);
 	return 0;
 }
 
--- a/arch/m68k/mm/mcfmmu.c~memblock-allow-to-specify-flags-with-memblock_add_node
+++ a/arch/m68k/mm/mcfmmu.c
@@ -174,7 +174,8 @@ void __init cf_bootmem_alloc(void)
 	m68k_memory[0].addr = _rambase;
 	m68k_memory[0].size = _ramend - _rambase;
 
-	memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0);
+	memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0,
+			  MEMBLOCK_NONE);
 
 	/* compute total pages in system */
 	num_pages = PFN_DOWN(_ramend - _rambase);
--- a/arch/m68k/mm/motorola.c~memblock-allow-to-specify-flags-with-memblock_add_node
+++ a/arch/m68k/mm/motorola.c
@@ -410,7 +410,8 @@ void __init paging_init(void)
 
 	min_addr = m68k_memory[0].addr;
 	max_addr = min_addr + m68k_memory[0].size;
-	memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0);
+	memblock_add_node(m68k_memory[0].addr, m68k_memory[0].size, 0,
+			  MEMBLOCK_NONE);
 	for (i = 1; i < m68k_num_memory;) {
 		if (m68k_memory[i].addr < min_addr) {
 			printk("Ignoring memory chunk at 0x%lx:0x%lx before the first chunk\n",
@@ -421,7 +422,8 @@ void __init paging_init(void)
 				(m68k_num_memory - i) * sizeof(struct m68k_mem_info));
 			continue;
 		}
-		memblock_add_node(m68k_memory[i].addr, m68k_memory[i].size, i);
+		memblock_add_node(m68k_memory[i].addr, m68k_memory[i].size, i,
+				  MEMBLOCK_NONE);
 		addr = m68k_memory[i].addr + m68k_memory[i].size;
 		if (addr > max_addr)
 			max_addr = addr;
--- a/arch/mips/loongson64/init.c~memblock-allow-to-specify-flags-with-memblock_add_node
+++ a/arch/mips/loongson64/init.c
@@ -77,7 +77,9 @@ void __init szmem(unsigned int node)
 				(u32)node_id, mem_type, mem_start, mem_size);
 			pr_info("       start_pfn:0x%llx, end_pfn:0x%llx, num_physpages:0x%lx\n",
 				start_pfn, end_pfn, num_physpages);
-			memblock_add_node(PFN_PHYS(start_pfn), PFN_PHYS(node_psize), node);
+			memblock_add_node(PFN_PHYS(start_pfn),
+					  PFN_PHYS(node_psize), node,
+					  MEMBLOCK_NONE);
 			break;
 		case SYSTEM_RAM_RESERVED:
 			pr_info("Node%d: mem_type:%d, mem_start:0x%llx, mem_size:0x%llx MB\n",
--- a/arch/mips/sgi-ip27/ip27-memory.c~memblock-allow-to-specify-flags-with-memblock_add_node
+++ a/arch/mips/sgi-ip27/ip27-memory.c
@@ -341,7 +341,8 @@ static void __init szmem(void)
 				continue;
 			}
 			memblock_add_node(PFN_PHYS(slot_getbasepfn(node, slot)),
-					  PFN_PHYS(slot_psize), node);
+					  PFN_PHYS(slot_psize), node,
+					  MEMBLOCK_NONE);
 		}
 	}
 }
--- a/arch/s390/kernel/setup.c~memblock-allow-to-specify-flags-with-memblock_add_node
+++ a/arch/s390/kernel/setup.c
@@ -593,7 +593,8 @@ static void __init setup_resources(void)
 	 * part of the System RAM resource.
 	 */
 	if (crashk_res.end) {
-		memblock_add_node(crashk_res.start, resource_size(&crashk_res), 0);
+		memblock_add_node(crashk_res.start, resource_size(&crashk_res),
+				  0, MEMBLOCK_NONE);
 		memblock_reserve(crashk_res.start, resource_size(&crashk_res));
 		insert_resource(&iomem_resource, &crashk_res);
 	}
--- a/include/linux/memblock.h~memblock-allow-to-specify-flags-with-memblock_add_node
+++ a/include/linux/memblock.h
@@ -104,7 +104,8 @@ static inline void memblock_discard(void
 #endif
 
 void memblock_allow_resize(void);
-int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid);
+int memblock_add_node(phys_addr_t base, phys_addr_t size, int nid,
+		      enum memblock_flags flags);
 int memblock_add(phys_addr_t base, phys_addr_t size);
 int memblock_remove(phys_addr_t base, phys_addr_t size);
 int memblock_phys_free(phys_addr_t base, phys_addr_t size);
--- a/include/linux/mm.h~memblock-allow-to-specify-flags-with-memblock_add_node
+++ a/include/linux/mm.h
@@ -2425,7 +2425,7 @@ static inline unsigned long get_num_phys
  * unsigned long max_zone_pfns[MAX_NR_ZONES] = {max_dma, max_normal_pfn,
  * 							 max_highmem_pfn};
  * for_each_valid_physical_page_range()
- * 	memblock_add_node(base, size, nid)
+ *	memblock_add_node(base, size, nid, MEMBLOCK_NONE)
  * free_area_init(max_zone_pfns);
  */
 void free_area_init(unsigned long *max_zone_pfn);
--- a/mm/memblock.c~memblock-allow-to-specify-flags-with-memblock_add_node
+++ a/mm/memblock.c
@@ -655,6 +655,7 @@ repeat:
  * @base: base address of the new region
  * @size: size of the new region
  * @nid: nid of the new region
+ * @flags: flags of the new region
  *
  * Add new memblock region [@base, @base + @size) to the "memory"
  * type. See memblock_add_range() description for mode details
@@ -663,14 +664,14 @@ repeat:
  * 0 on success, -errno on failure.
  */
 int __init_memblock memblock_add_node(phys_addr_t base, phys_addr_t size,
-				       int nid)
+				      int nid, enum memblock_flags flags)
 {
 	phys_addr_t end = base + size - 1;
 
-	memblock_dbg("%s: [%pa-%pa] nid=%d %pS\n", __func__,
-		     &base, &end, nid, (void *)_RET_IP_);
+	memblock_dbg("%s: [%pa-%pa] nid=%d flags=%x %pS\n", __func__,
+		     &base, &end, nid, flags, (void *)_RET_IP_);
 
-	return memblock_add_range(&memblock.memory, base, size, nid, 0);
+	return memblock_add_range(&memblock.memory, base, size, nid, flags);
 }
 
 /**
--- a/mm/memory_hotplug.c~memblock-allow-to-specify-flags-with-memblock_add_node
+++ a/mm/memory_hotplug.c
@@ -1370,7 +1370,7 @@ int __ref add_memory_resource(int nid, s
 	mem_hotplug_begin();
 
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
-		ret = memblock_add_node(start, size, nid);
+		ret = memblock_add_node(start, size, nid, MEMBLOCK_NONE);
 		if (ret)
 			goto error_mem_hotplug_end;
 	}
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 195/262] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED
  2021-11-05 20:34 incoming Andrew Morton
                   ` (193 preceding siblings ...)
  2021-11-05 20:44 ` [patch 194/262] memblock: allow to specify flags with memblock_add_node() Andrew Morton
@ 2021-11-05 20:44 ` Andrew Morton
  2021-11-05 20:44 ` [patch 196/262] mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED Andrew Morton
                   ` (66 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:44 UTC (permalink / raw)
  To: akpm, aneesh.kumar, arnd, borntraeger, chenhuacai, david,
	ebiederm, geert, gor, hca, Jianyong.Wu, jiaxun.yang, linux-mm,
	mhocko, mm-commits, osalvador, rppt, shahab, torvalds, tsbogend,
	vgupta

From: David Hildenbrand <david@redhat.com>
Subject: memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED

Let's add a flag that corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED,
indicating that we're dealing with a memory region that is never indicated
in the firmware-provided memory map, but always detected and added by a
driver.

Similar to MEMBLOCK_HOTPLUG, most infrastructure has to treat such memory
regions like ordinary MEMBLOCK_NONE memory regions -- for example, when
selecting memory regions to add to the vmcore for dumping in the
crashkernel via for_each_mem_range().

However, especially kexec_file is not supposed to select such memblocks
via for_each_free_mem_range() / for_each_free_mem_range_reverse() to place
kexec images, similar to how we handle IORESOURCE_SYSRAM_DRIVER_MANAGED
without CONFIG_ARCH_KEEP_MEMBLOCK.

We'll make sure that memory hotplug code sets the flag where applicable
(IORESOURCE_SYSRAM_DRIVER_MANAGED) next.  This prepares architectures that
need CONFIG_ARCH_KEEP_MEMBLOCK, such as arm64, for virtio-mem support.

Note that kexec *must not* indicate this memory to the second kernel and
*must not* place kexec-images on this memory.  Let's add a comment to
kexec_walk_memblock(), documenting how we handle MEMBLOCK_DRIVER_MANAGED
now just like using IORESOURCE_SYSRAM_DRIVER_MANAGED in
locate_mem_hole_callback() for kexec_walk_resources().

Also note that MEMBLOCK_HOTPLUG cannot be reused due to different
semantics:
	MEMBLOCK_HOTPLUG: memory is indicated as "System RAM" in the
	firmware-provided memory map and added to the system early during
	boot; kexec *has to* indicate this memory to the second kernel and
	can place kexec-images on this memory. After memory hotunplug,
	kexec has to be re-armed. We mostly ignore this flag when
	"movable_node" is not set on the kernel command line, because
	then we're told to not care about hotunpluggability of such
	memory regions.

	MEMBLOCK_DRIVER_MANAGED: memory is not indicated as "System RAM" in
	the firmware-provided memory map; this memory is always detected
	and added to the system by a driver; memory might not actually be
	physically hotunpluggable. kexec *must not* indicate this memory to
	the second kernel and *must not* place kexec-images on this memory.

Link: https://lkml.kernel.org/r/20211004093605.5830-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Jianyong Wu <Jianyong.Wu@arm.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Shahab Vahedi <shahab@synopsys.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memblock.h |   16 ++++++++++++++--
 kernel/kexec_file.c      |    5 +++++
 mm/memblock.c            |    4 ++++
 3 files changed, 23 insertions(+), 2 deletions(-)

--- a/include/linux/memblock.h~memblock-add-memblock_driver_managed-to-mimic-ioresource_sysram_driver_managed
+++ a/include/linux/memblock.h
@@ -37,12 +37,17 @@ extern unsigned long long max_possible_p
  * @MEMBLOCK_NOMAP: don't add to kernel direct mapping and treat as
  * reserved in the memory map; refer to memblock_mark_nomap() description
  * for further details
+ * @MEMBLOCK_DRIVER_MANAGED: memory region that is always detected and added
+ * via a driver, and never indicated in the firmware-provided memory map as
+ * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the
+ * kernel resource tree.
  */
 enum memblock_flags {
 	MEMBLOCK_NONE		= 0x0,	/* No special request */
 	MEMBLOCK_HOTPLUG	= 0x1,	/* hotpluggable region */
 	MEMBLOCK_MIRROR		= 0x2,	/* mirrored region */
 	MEMBLOCK_NOMAP		= 0x4,	/* don't add to kernel direct mapping */
+	MEMBLOCK_DRIVER_MANAGED = 0x8,	/* always detected via a driver */
 };
 
 /**
@@ -213,7 +218,8 @@ static inline void __next_physmem_range(
  */
 #define for_each_mem_range(i, p_start, p_end) \
 	__for_each_mem_range(i, &memblock.memory, NULL, NUMA_NO_NODE,	\
-			     MEMBLOCK_HOTPLUG, p_start, p_end, NULL)
+			     MEMBLOCK_HOTPLUG | MEMBLOCK_DRIVER_MANAGED, \
+			     p_start, p_end, NULL)
 
 /**
  * for_each_mem_range_rev - reverse iterate through memblock areas from
@@ -224,7 +230,8 @@ static inline void __next_physmem_range(
  */
 #define for_each_mem_range_rev(i, p_start, p_end)			\
 	__for_each_mem_range_rev(i, &memblock.memory, NULL, NUMA_NO_NODE, \
-				 MEMBLOCK_HOTPLUG, p_start, p_end, NULL)
+				 MEMBLOCK_HOTPLUG | MEMBLOCK_DRIVER_MANAGED,\
+				 p_start, p_end, NULL)
 
 /**
  * for_each_reserved_mem_range - iterate over all reserved memblock areas
@@ -254,6 +261,11 @@ static inline bool memblock_is_nomap(str
 	return m->flags & MEMBLOCK_NOMAP;
 }
 
+static inline bool memblock_is_driver_managed(struct memblock_region *m)
+{
+	return m->flags & MEMBLOCK_DRIVER_MANAGED;
+}
+
 int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
 			    unsigned long  *end_pfn);
 void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
--- a/kernel/kexec_file.c~memblock-add-memblock_driver_managed-to-mimic-ioresource_sysram_driver_managed
+++ a/kernel/kexec_file.c
@@ -556,6 +556,11 @@ static int kexec_walk_memblock(struct ke
 	if (kbuf->image->type == KEXEC_TYPE_CRASH)
 		return func(&crashk_res, kbuf);
 
+	/*
+	 * Using MEMBLOCK_NONE will properly skip MEMBLOCK_DRIVER_MANAGED. See
+	 * IORESOURCE_SYSRAM_DRIVER_MANAGED handling in
+	 * locate_mem_hole_callback().
+	 */
 	if (kbuf->top_down) {
 		for_each_free_mem_range_reverse(i, NUMA_NO_NODE, MEMBLOCK_NONE,
 						&mstart, &mend, NULL) {
--- a/mm/memblock.c~memblock-add-memblock_driver_managed-to-mimic-ioresource_sysram_driver_managed
+++ a/mm/memblock.c
@@ -982,6 +982,10 @@ static bool should_skip_region(struct me
 	if (!(flags & MEMBLOCK_NOMAP) && memblock_is_nomap(m))
 		return true;
 
+	/* skip driver-managed memory unless we were asked for it explicitly */
+	if (!(flags & MEMBLOCK_DRIVER_MANAGED) && memblock_is_driver_managed(m))
+		return true;
+
 	return false;
 }
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 196/262] mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED
  2021-11-05 20:34 incoming Andrew Morton
                   ` (194 preceding siblings ...)
  2021-11-05 20:44 ` [patch 195/262] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED Andrew Morton
@ 2021-11-05 20:44 ` Andrew Morton
  2021-11-05 20:45 ` [patch 197/262] mm/rmap.c: avoid double faults migrating device private pages Andrew Morton
                   ` (65 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:44 UTC (permalink / raw)
  To: akpm, aneesh.kumar, arnd, borntraeger, chenhuacai, david,
	ebiederm, geert, gor, hca, Jianyong.Wu, jiaxun.yang, linux-mm,
	mhocko, mm-commits, osalvador, rppt, shahab, torvalds, tsbogend,
	vgupta

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED

Let's communicate driver-managed regions to memblock, to properly teach
kexec_file with CONFIG_ARCH_KEEP_MEMBLOCK to not place images on these
memory regions.

Link: https://lkml.kernel.org/r/20211004093605.5830-6-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Jianyong Wu <Jianyong.Wu@arm.com>
Cc: Jiaxun Yang <jiaxun.yang@flygoat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Shahab Vahedi <shahab@synopsys.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-indicate-memblock_driver_managed-with-ioresource_sysram_driver_managed
+++ a/mm/memory_hotplug.c
@@ -1342,6 +1342,7 @@ bool mhp_supports_memmap_on_memory(unsig
 int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 {
 	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
+	enum memblock_flags memblock_flags = MEMBLOCK_NONE;
 	struct vmem_altmap mhp_altmap = {};
 	struct memory_group *group = NULL;
 	u64 start, size;
@@ -1370,7 +1371,9 @@ int __ref add_memory_resource(int nid, s
 	mem_hotplug_begin();
 
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
-		ret = memblock_add_node(start, size, nid, MEMBLOCK_NONE);
+		if (res->flags & IORESOURCE_SYSRAM_DRIVER_MANAGED)
+			memblock_flags = MEMBLOCK_DRIVER_MANAGED;
+		ret = memblock_add_node(start, size, nid, memblock_flags);
 		if (ret)
 			goto error_mem_hotplug_end;
 	}
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 197/262] mm/rmap.c: avoid double faults migrating device private pages
  2021-11-05 20:34 incoming Andrew Morton
                   ` (195 preceding siblings ...)
  2021-11-05 20:44 ` [patch 196/262] mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 198/262] mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and zs_unregister_migration() Andrew Morton
                   ` (64 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, apopple, jglisse, jhubbard, linux-mm, mm-commits,
	rcampbell, torvalds

From: Alistair Popple <apopple@nvidia.com>
Subject: mm/rmap.c: avoid double faults migrating device private pages

During migration special page table entries are installed for each page
being migrated.  These entries store the pfn and associated permissions of
ptes mapping the page being migarted.

Device-private pages use special swap pte entries to distinguish read-only
vs.  writeable pages which the migration code checks when creating
migration entries.  Normally this follows a fast path in
migrate_vma_collect_pmd() which correctly copies the permissions of
device-private pages over to migration entries when migrating pages back
to the CPU.

However the slow-path falls back to using try_to_migrate() which
unconditionally creates read-only migration entries for device-private
pages.  This leads to unnecessary double faults on the CPU as the new
pages are always mapped read-only even when they could be mapped
writeable.  Fix this by correctly copying device-private permissions in
try_to_migrate_one().

Link: https://lkml.kernel.org/r/20211018045247.3128058-1-apopple@nvidia.com
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reported-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/rmap.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

--- a/mm/rmap.c~mm-rmapc-avoid-double-faults-migrating-device-private-pages
+++ a/mm/rmap.c
@@ -1807,6 +1807,7 @@ static bool try_to_migrate_one(struct pa
 		update_hiwater_rss(mm);
 
 		if (is_zone_device_page(page)) {
+			unsigned long pfn = page_to_pfn(page);
 			swp_entry_t entry;
 			pte_t swp_pte;
 
@@ -1815,8 +1816,11 @@ static bool try_to_migrate_one(struct pa
 			 * pte. do_swap_page() will wait until the migration
 			 * pte is removed and then restart fault handling.
 			 */
-			entry = make_readable_migration_entry(
-							page_to_pfn(page));
+			entry = pte_to_swp_entry(pteval);
+			if (is_writable_device_private_entry(entry))
+				entry = make_writable_migration_entry(pfn);
+			else
+				entry = make_readable_migration_entry(pfn);
 			swp_pte = swp_entry_to_pte(entry);
 
 			/*
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 198/262] mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and zs_unregister_migration()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (196 preceding siblings ...)
  2021-11-05 20:45 ` [patch 197/262] mm/rmap.c: avoid double faults migrating device private pages Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 199/262] mm/highmem: remove deprecated kmap_atomic Andrew Morton
                   ` (63 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, henryburns, linmiaohe, linux-mm, minchan, mm-commits,
	senozhatsky, torvalds

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and zs_unregister_migration()

There is one possible race window between zs_pool_dec_isolated() and
zs_unregister_migration() because wait_for_isolated_drain() checks the
isolated count without holding class->lock and there is no order inside
zs_pool_dec_isolated().  Thus the below race window could be possible:

zs_pool_dec_isolated		zs_unregister_migration
  check pool->destroying != 0
				  pool->destroying = true;
				  smp_mb();
				  wait_for_isolated_drain()
				    wait for pool->isolated_pages == 0
  atomic_long_dec(&pool->isolated_pages);
  atomic_long_read(&pool->isolated_pages) == 0

Since we observe the pool->destroying (false) before atomic_long_dec() for
pool->isolated_pages, waking pool->migration_wait up is missed.

Fix this by ensure checking pool->destroying happens after the
atomic_long_dec(&pool->isolated_pages).

Link: https://lkml.kernel.org/r/20210708115027.7557-1-linmiaohe@huawei.com
Fixes: 701d678599d0 ("mm/zsmalloc.c: fix race condition in zs_destroy_pool")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Henry Burns <henryburns@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/zsmalloc.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

--- a/mm/zsmalloc.c~mm-zsmallocc-close-race-window-between-zs_pool_dec_isolated-and-zs_unregister_migration
+++ a/mm/zsmalloc.c
@@ -1830,10 +1830,11 @@ static inline void zs_pool_dec_isolated(
 	VM_BUG_ON(atomic_long_read(&pool->isolated_pages) <= 0);
 	atomic_long_dec(&pool->isolated_pages);
 	/*
-	 * There's no possibility of racing, since wait_for_isolated_drain()
-	 * checks the isolated count under &class->lock after enqueuing
-	 * on migration_wait.
+	 * Checking pool->destroying must happen after atomic_long_dec()
+	 * for pool->isolated_pages above. Paired with the smp_mb() in
+	 * zs_unregister_migration().
 	 */
+	smp_mb__after_atomic();
 	if (atomic_long_read(&pool->isolated_pages) == 0 && pool->destroying)
 		wake_up_all(&pool->migration_wait);
 }
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 199/262] mm/highmem: remove deprecated kmap_atomic
  2021-11-05 20:34 incoming Andrew Morton
                   ` (197 preceding siblings ...)
  2021-11-05 20:45 ` [patch 198/262] mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and zs_unregister_migration() Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 200/262] zram_drv: allow reclaim on bio_alloc Andrew Morton
                   ` (62 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, ira.weiny, linux-mm, mm-commits, peterz, prathu.baronia,
	rdunlap, tglx, torvalds, willy

From: Ira Weiny <ira.weiny@intel.com>
Subject: mm/highmem: remove deprecated kmap_atomic

kmap_atomic() is being deprecated in favor of kmap_local_page().

Replace the uses of kmap_atomic() within the highmem code.

On profiling clear_huge_page() using ftrace an improvement of 62% was
observed on the below setup.

Setup:-
Below data has been collected on Qualcomm's SM7250 SoC THP enabled
(kernel v4.19.113) with only CPU-0(Cortex-A55) and CPU-7(Cortex-A76)
switched on and set to max frequency, also DDR set to perf governor.

FTRACE Data:-

Base data:-
Number of iterations: 48
Mean of allocation time: 349.5 us
std deviation: 74.5 us

v4 data:-
Number of iterations: 48
Mean of allocation time: 131 us
std deviation: 32.7 us

The following simple userspace experiment to allocate
100MB(BUF_SZ) of pages and writing to it gave us a good insight,
we observed an improvement of 42% in allocation and writing timings.
-------------------------------------------------------------
Test code snippet
-------------------------------------------------------------
      clock_start();
      buf = malloc(BUF_SZ); /* Allocate 100 MB of memory */

        for(i=0; i < BUF_SZ_PAGES; i++)
        {
                *((int *)(buf + (i*PAGE_SIZE))) = 1;
        }
      clock_end();
-------------------------------------------------------------

Malloc test timings for 100MB anon allocation:-

Base data:-
Number of iterations: 100
Mean of allocation time: 31831 us
std deviation: 4286 us

v4 data:-
Number of iterations: 100
Mean of allocation time: 18193 us
std deviation: 4915 us

[willy@infradead.org: fix zero_user_segments()]
  Link: https://lkml.kernel.org/r/YYVhHCJcm2DM2G9u@casper.infradead.org
Link: https://lkml.kernel.org/r/20210204073255.20769-2-prathu.baronia@oneplus.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Prathu Baronia <prathu.baronia@oneplus.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/highmem.h |   28 ++++++++++++++--------------
 mm/highmem.c            |    6 +++---
 2 files changed, 17 insertions(+), 17 deletions(-)

--- a/include/linux/highmem.h~mm-highmem-remove-deprecated-kmap_atomic
+++ a/include/linux/highmem.h
@@ -143,9 +143,9 @@ static inline void invalidate_kernel_vma
 #ifndef clear_user_highpage
 static inline void clear_user_highpage(struct page *page, unsigned long vaddr)
 {
-	void *addr = kmap_atomic(page);
+	void *addr = kmap_local_page(page);
 	clear_user_page(addr, vaddr, page);
-	kunmap_atomic(addr);
+	kunmap_local(addr);
 }
 #endif
 
@@ -177,9 +177,9 @@ alloc_zeroed_user_highpage_movable(struc
 
 static inline void clear_highpage(struct page *page)
 {
-	void *kaddr = kmap_atomic(page);
+	void *kaddr = kmap_local_page(page);
 	clear_page(kaddr);
-	kunmap_atomic(kaddr);
+	kunmap_local(kaddr);
 }
 
 #ifndef __HAVE_ARCH_TAG_CLEAR_HIGHPAGE
@@ -202,7 +202,7 @@ static inline void zero_user_segments(st
 		unsigned start1, unsigned end1,
 		unsigned start2, unsigned end2)
 {
-	void *kaddr = kmap_atomic(page);
+	void *kaddr = kmap_local_page(page);
 	unsigned int i;
 
 	BUG_ON(end1 > page_size(page) || end2 > page_size(page));
@@ -213,7 +213,7 @@ static inline void zero_user_segments(st
 	if (end2 > start2)
 		memset(kaddr + start2, 0, end2 - start2);
 
-	kunmap_atomic(kaddr);
+	kunmap_local(kaddr);
 	for (i = 0; i < compound_nr(page); i++)
 		flush_dcache_page(page + i);
 }
@@ -238,11 +238,11 @@ static inline void copy_user_highpage(st
 {
 	char *vfrom, *vto;
 
-	vfrom = kmap_atomic(from);
-	vto = kmap_atomic(to);
+	vfrom = kmap_local_page(from);
+	vto = kmap_local_page(to);
 	copy_user_page(vto, vfrom, vaddr, to);
-	kunmap_atomic(vto);
-	kunmap_atomic(vfrom);
+	kunmap_local(vto);
+	kunmap_local(vfrom);
 }
 
 #endif
@@ -253,11 +253,11 @@ static inline void copy_highpage(struct
 {
 	char *vfrom, *vto;
 
-	vfrom = kmap_atomic(from);
-	vto = kmap_atomic(to);
+	vfrom = kmap_local_page(from);
+	vto = kmap_local_page(to);
 	copy_page(vto, vfrom);
-	kunmap_atomic(vto);
-	kunmap_atomic(vfrom);
+	kunmap_local(vto);
+	kunmap_local(vfrom);
 }
 
 #endif
--- a/mm/highmem.c~mm-highmem-remove-deprecated-kmap_atomic
+++ a/mm/highmem.c
@@ -383,7 +383,7 @@ void zero_user_segments(struct page *pag
 			unsigned this_end = min_t(unsigned, end1, PAGE_SIZE);
 
 			if (end1 > start1) {
-				kaddr = kmap_atomic(page + i);
+				kaddr = kmap_local_page(page + i);
 				memset(kaddr + start1, 0, this_end - start1);
 			}
 			end1 -= this_end;
@@ -398,7 +398,7 @@ void zero_user_segments(struct page *pag
 
 			if (end2 > start2) {
 				if (!kaddr)
-					kaddr = kmap_atomic(page + i);
+					kaddr = kmap_local_page(page + i);
 				memset(kaddr + start2, 0, this_end - start2);
 			}
 			end2 -= this_end;
@@ -406,7 +406,7 @@ void zero_user_segments(struct page *pag
 		}
 
 		if (kaddr) {
-			kunmap_atomic(kaddr);
+			kunmap_local(kaddr);
 			flush_dcache_page(page + i);
 		}
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 200/262] zram_drv: allow reclaim on bio_alloc
  2021-11-05 20:34 incoming Andrew Morton
                   ` (198 preceding siblings ...)
  2021-11-05 20:45 ` [patch 199/262] mm/highmem: remove deprecated kmap_atomic Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 201/262] zram: off by one in read_block_state() Andrew Morton
                   ` (61 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, jaewon31.kim, linux-mm, minchan, mm-commits, torvalds, ytk.lee

From: Jaewon Kim <jaewon31.kim@samsung.com>
Subject: zram_drv: allow reclaim on bio_alloc

The read_from_bdev_async is not called on atomic context.  So GFP_NOIO is
available rather than GFP_ATOMIC.  If there were reclaimable pages with
GFP_NOIO, we can avoid allocation failure and page fault failure.

Link: https://lkml.kernel.org/r/20210908005241.28062-1-jaewon31.kim@samsung.com
Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com>
Reported-by: Yong-Taek Lee <ytk.lee@samsung.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/block/zram/zram_drv.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/block/zram/zram_drv.c~zram_drv-allow-reclaim-on-bio_alloc
+++ a/drivers/block/zram/zram_drv.c
@@ -587,7 +587,7 @@ static int read_from_bdev_async(struct z
 {
 	struct bio *bio;
 
-	bio = bio_alloc(GFP_ATOMIC, 1);
+	bio = bio_alloc(GFP_NOIO, 1);
 	if (!bio)
 		return -ENOMEM;
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 201/262] zram: off by one in read_block_state()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (199 preceding siblings ...)
  2021-11-05 20:45 ` [patch 200/262] zram_drv: allow reclaim on bio_alloc Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 202/262] zram: introduce an aged idle interface Andrew Morton
                   ` (60 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, dan.carpenter, linux-mm, minchan, mm-commits, senozhatsky,
	torvalds

From: Dan Carpenter <dan.carpenter@oracle.com>
Subject: zram: off by one in read_block_state()

snprintf() returns the number of bytes it would have printed if there were
space.  But it does not count the NUL terminator.  So that means that if
"count == copied" then this has already overflowed by one character.

This bug likely isn't super harmful in real life.

Link: https://lkml.kernel.org/r/20210916130404.GA25094@kili
Fixes: c0265342bff4 ("zram: introduce zram memory tracking")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/block/zram/zram_drv.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/block/zram/zram_drv.c~zram-off-by-one-in-read_block_state
+++ a/drivers/block/zram/zram_drv.c
@@ -910,7 +910,7 @@ static ssize_t read_block_state(struct f
 			zram_test_flag(zram, index, ZRAM_HUGE) ? 'h' : '.',
 			zram_test_flag(zram, index, ZRAM_IDLE) ? 'i' : '.');
 
-		if (count < copied) {
+		if (count <= copied) {
 			zram_slot_unlock(zram, index);
 			break;
 		}
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 202/262] zram: introduce an aged idle interface
  2021-11-05 20:34 incoming Andrew Morton
                   ` (200 preceding siblings ...)
  2021-11-05 20:45 ` [patch 201/262] zram: off by one in read_block_state() Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 203/262] mm: remove HARDENED_USERCOPY_FALLBACK Andrew Morton
                   ` (59 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, bgeffon, corbet, jsbarnes, linux-mm, minchan, mm-commits,
	ngupta, senozhatsky, suleiman, torvalds

From: Brian Geffon <bgeffon@google.com>
Subject: zram: introduce an aged idle interface

This change introduces an aged idle interface to the existing idle sysfs
file for zram.

When CONFIG_ZRAM_MEMORY_TRACKING is enabled the idle file now also accepts
an integer argument.  This integer is the age (in seconds) of pages to
mark as idle.  The idle file still supports 'all' as it always has.  This
new approach allows for much more control over which pages get marked as
idle.

[bgeffon@google.com: use IS_ENABLED and cleanup comment]
  Link: https://lkml.kernel.org/r/20210924161128.1508015-1-bgeffon@google.com
[bgeffon@google.com: Sergey's cleanup suggestions]
  Link: https://lkml.kernel.org/r/20210929143056.13067-1-bgeffon@google.com
Link: https://lkml.kernel.org/r/20210923130115.1344361-1-bgeffon@google.com
Signed-off-by: Brian Geffon <bgeffon@google.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Suleiman Souhlal <suleiman@google.com>
Cc: Jesse Barnes <jsbarnes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/blockdev/zram.rst |    8 ++
 drivers/block/zram/zram_drv.c               |   62 +++++++++++++-----
 2 files changed, 54 insertions(+), 16 deletions(-)

--- a/Documentation/admin-guide/blockdev/zram.rst~zram-introduce-an-aged-idle-interface
+++ a/Documentation/admin-guide/blockdev/zram.rst
@@ -328,6 +328,14 @@ as idle::
 From now on, any pages on zram are idle pages. The idle mark
 will be removed until someone requests access of the block.
 IOW, unless there is access request, those pages are still idle pages.
+Additionally, when CONFIG_ZRAM_MEMORY_TRACKING is enabled pages can be
+marked as idle based on how long (in seconds) it's been since they were
+last accessed::
+
+        echo 86400 > /sys/block/zramX/idle
+
+In this example all pages which haven't been accessed in more than 86400
+seconds (one day) will be marked idle.
 
 Admin can request writeback of those idle pages at right timing via::
 
--- a/drivers/block/zram/zram_drv.c~zram-introduce-an-aged-idle-interface
+++ a/drivers/block/zram/zram_drv.c
@@ -291,22 +291,16 @@ static ssize_t mem_used_max_store(struct
 	return len;
 }
 
-static ssize_t idle_store(struct device *dev,
-		struct device_attribute *attr, const char *buf, size_t len)
+/*
+ * Mark all pages which are older than or equal to cutoff as IDLE.
+ * Callers should hold the zram init lock in read mode
+ */
+static void mark_idle(struct zram *zram, ktime_t cutoff)
 {
-	struct zram *zram = dev_to_zram(dev);
+	int is_idle = 1;
 	unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
 	int index;
 
-	if (!sysfs_streq(buf, "all"))
-		return -EINVAL;
-
-	down_read(&zram->init_lock);
-	if (!init_done(zram)) {
-		up_read(&zram->init_lock);
-		return -EINVAL;
-	}
-
 	for (index = 0; index < nr_pages; index++) {
 		/*
 		 * Do not mark ZRAM_UNDER_WB slot as ZRAM_IDLE to close race.
@@ -314,14 +308,50 @@ static ssize_t idle_store(struct device
 		 */
 		zram_slot_lock(zram, index);
 		if (zram_allocated(zram, index) &&
-				!zram_test_flag(zram, index, ZRAM_UNDER_WB))
-			zram_set_flag(zram, index, ZRAM_IDLE);
+				!zram_test_flag(zram, index, ZRAM_UNDER_WB)) {
+#ifdef CONFIG_ZRAM_MEMORY_TRACKING
+			is_idle = !cutoff || ktime_after(cutoff, zram->table[index].ac_time);
+#endif
+			if (is_idle)
+				zram_set_flag(zram, index, ZRAM_IDLE);
+		}
 		zram_slot_unlock(zram, index);
 	}
+}
 
-	up_read(&zram->init_lock);
+static ssize_t idle_store(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t len)
+{
+	struct zram *zram = dev_to_zram(dev);
+	ktime_t cutoff_time = 0;
+	ssize_t rv = -EINVAL;
 
-	return len;
+	if (!sysfs_streq(buf, "all")) {
+		/*
+		 * If it did not parse as 'all' try to treat it as an integer when
+		 * we have memory tracking enabled.
+		 */
+		u64 age_sec;
+
+		if (IS_ENABLED(CONFIG_ZRAM_MEMORY_TRACKING) && !kstrtoull(buf, 0, &age_sec))
+			cutoff_time = ktime_sub(ktime_get_boottime(),
+					ns_to_ktime(age_sec * NSEC_PER_SEC));
+		else
+			goto out;
+	}
+
+	down_read(&zram->init_lock);
+	if (!init_done(zram))
+		goto out_unlock;
+
+	/* A cutoff_time of 0 marks everything as idle, this is the "all" behavior */
+	mark_idle(zram, cutoff_time);
+	rv = len;
+
+out_unlock:
+	up_read(&zram->init_lock);
+out:
+	return rv;
 }
 
 #ifdef CONFIG_ZRAM_WRITEBACK
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 203/262] mm: remove HARDENED_USERCOPY_FALLBACK
  2021-11-05 20:34 incoming Andrew Morton
                   ` (201 preceding siblings ...)
  2021-11-05 20:45 ` [patch 202/262] zram: introduce an aged idle interface Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 204/262] include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h Andrew Morton
                   ` (58 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, cl, iamjoonsoo.kim, jmorris, joel, keescook, linux-mm,
	mm-commits, penberg, rientjes, serge, steve, torvalds, vbabka

From: Stephen Kitt <steve@sk2.org>
Subject: mm: remove HARDENED_USERCOPY_FALLBACK

This has served its purpose and is no longer used.  All usercopy
violations appear to have been handled by now, any remaining instances (or
new bugs) will cause copies to be rejected.

This isn't a direct revert of commit 2d891fbc3bb6 ("usercopy: Allow strict
enforcement of whitelists"); since usercopy_fallback is effectively 0, the
fallback handling is removed too.

This also removes the usercopy_fallback module parameter on slab_common.

Link: https://github.com/KSPP/linux/issues/153
Link: https://lkml.kernel.org/r/20210921061149.1091163-1-steve@sk2.org
Signed-off-by: Stephen Kitt <steve@sk2.org>
Suggested-by: Kees Cook <keescook@chromium.org>
Acked-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Joel Stanley <joel@jms.id.au>	[defconfig change]
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E . Hallyn" <serge@hallyn.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/powerpc/configs/skiroot_defconfig |    1 -
 include/linux/slab.h                   |    2 --
 mm/slab.c                              |   13 -------------
 mm/slab_common.c                       |    8 --------
 mm/slub.c                              |   14 --------------
 security/Kconfig                       |   14 --------------
 6 files changed, 52 deletions(-)

--- a/arch/powerpc/configs/skiroot_defconfig~mm-remove-hardened_usercopy_fallback
+++ a/arch/powerpc/configs/skiroot_defconfig
@@ -275,7 +275,6 @@ CONFIG_NLS_UTF8=y
 CONFIG_ENCRYPTED_KEYS=y
 CONFIG_SECURITY=y
 CONFIG_HARDENED_USERCOPY=y
-# CONFIG_HARDENED_USERCOPY_FALLBACK is not set
 CONFIG_HARDENED_USERCOPY_PAGESPAN=y
 CONFIG_FORTIFY_SOURCE=y
 CONFIG_SECURITY_LOCKDOWN_LSM=y
--- a/include/linux/slab.h~mm-remove-hardened_usercopy_fallback
+++ a/include/linux/slab.h
@@ -142,8 +142,6 @@ struct mem_cgroup;
 void __init kmem_cache_init(void);
 bool slab_is_available(void);
 
-extern bool usercopy_fallback;
-
 struct kmem_cache *kmem_cache_create(const char *name, unsigned int size,
 			unsigned int align, slab_flags_t flags,
 			void (*ctor)(void *));
--- a/mm/slab.c~mm-remove-hardened_usercopy_fallback
+++ a/mm/slab.c
@@ -4204,19 +4204,6 @@ void __check_heap_object(const void *ptr
 	    n <= cachep->useroffset - offset + cachep->usersize)
 		return;
 
-	/*
-	 * If the copy is still within the allocated object, produce
-	 * a warning instead of rejecting the copy. This is intended
-	 * to be a temporary method to find any missing usercopy
-	 * whitelists.
-	 */
-	if (usercopy_fallback &&
-	    offset <= cachep->object_size &&
-	    n <= cachep->object_size - offset) {
-		usercopy_warn("SLAB object", cachep->name, to_user, offset, n);
-		return;
-	}
-
 	usercopy_abort("SLAB object", cachep->name, to_user, offset, n);
 }
 #endif /* CONFIG_HARDENED_USERCOPY */
--- a/mm/slab_common.c~mm-remove-hardened_usercopy_fallback
+++ a/mm/slab_common.c
@@ -37,14 +37,6 @@ LIST_HEAD(slab_caches);
 DEFINE_MUTEX(slab_mutex);
 struct kmem_cache *kmem_cache;
 
-#ifdef CONFIG_HARDENED_USERCOPY
-bool usercopy_fallback __ro_after_init =
-		IS_ENABLED(CONFIG_HARDENED_USERCOPY_FALLBACK);
-module_param(usercopy_fallback, bool, 0400);
-MODULE_PARM_DESC(usercopy_fallback,
-		"WARN instead of reject usercopy whitelist violations");
-#endif
-
 static LIST_HEAD(slab_caches_to_rcu_destroy);
 static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work);
 static DECLARE_WORK(slab_caches_to_rcu_destroy_work,
--- a/mm/slub.c~mm-remove-hardened_usercopy_fallback
+++ a/mm/slub.c
@@ -4489,7 +4489,6 @@ void __check_heap_object(const void *ptr
 {
 	struct kmem_cache *s;
 	unsigned int offset;
-	size_t object_size;
 	bool is_kfence = is_kfence_address(ptr);
 
 	ptr = kasan_reset_tag(ptr);
@@ -4522,19 +4521,6 @@ void __check_heap_object(const void *ptr
 	    n <= s->useroffset - offset + s->usersize)
 		return;
 
-	/*
-	 * If the copy is still within the allocated object, produce
-	 * a warning instead of rejecting the copy. This is intended
-	 * to be a temporary method to find any missing usercopy
-	 * whitelists.
-	 */
-	object_size = slab_ksize(s);
-	if (usercopy_fallback &&
-	    offset <= object_size && n <= object_size - offset) {
-		usercopy_warn("SLUB object", s->name, to_user, offset, n);
-		return;
-	}
-
 	usercopy_abort("SLUB object", s->name, to_user, offset, n);
 }
 #endif /* CONFIG_HARDENED_USERCOPY */
--- a/security/Kconfig~mm-remove-hardened_usercopy_fallback
+++ a/security/Kconfig
@@ -163,20 +163,6 @@ config HARDENED_USERCOPY
 	  or are part of the kernel text. This kills entire classes
 	  of heap overflow exploits and similar kernel memory exposures.
 
-config HARDENED_USERCOPY_FALLBACK
-	bool "Allow usercopy whitelist violations to fallback to object size"
-	depends on HARDENED_USERCOPY
-	default y
-	help
-	  This is a temporary option that allows missing usercopy whitelists
-	  to be discovered via a WARN() to the kernel log, instead of
-	  rejecting the copy, falling back to non-whitelisted hardened
-	  usercopy that checks the slab allocation size instead of the
-	  whitelist size. This option will be removed once it seems like
-	  all missing usercopy whitelists have been identified and fixed.
-	  Booting with "slab_common.usercopy_fallback=Y/N" can change
-	  this setting.
-
 config HARDENED_USERCOPY_PAGESPAN
 	bool "Refuse to copy allocations that span multiple pages"
 	depends on HARDENED_USERCOPY
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 204/262] include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h
  2021-11-05 20:34 incoming Andrew Morton
                   ` (202 preceding siblings ...)
  2021-11-05 20:45 ` [patch 203/262] mm: remove HARDENED_USERCOPY_FALLBACK Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 205/262] stacktrace: move filter_irq_stacks() to kernel/stacktrace.c Andrew Morton
                   ` (57 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, davem, horms, kuba, linux-mm, liumh1, marcelo.leitner,
	mm-commits, pshelar, torvalds, ulf.hansson, vyasevich, willy

From: Mianhan Liu <liumh1@shanghaitech.edu.cn>
Subject: include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h

nr_free_buffer_pages could be exposed through mm.h instead of swap.h.  The
advantage of this change is that it can reduce the obsolete includes.  For
example, net/ipv4/tcp.c wouldn't need swap.h any more since it has already
included mm.h.  Similarly, after checking all the other files, it comes
that tcp.c, udp.c meter.c ,...  follow the same rule, so these files can
have swap.h removed too.

Moreover, after preprocessing all the files that use nr_free_buffer_pages,
it turns out that those files have already included mm.h.Thus, we can move
nr_free_buffer_pages from swap.h to mm.h safely.  This change will not
affect the compilation of other files.

Link: https://lkml.kernel.org/r/20210912133640.1624-1-liumh1@shanghaitech.edu.cn
Signed-off-by: Mianhan Liu <liumh1@shanghaitech.edu.cn>
Cc: Jakub Kicinski <kuba@kernel.org>
CC: Ulf Hansson <ulf.hansson@linaro.org>
Cc: "David S . Miller" <davem@davemloft.net>
Cc: Simon Horman <horms@verge.net.au>
Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/mmc/core/mmc_test.c    |    1 -
 include/linux/mm.h             |    2 ++
 include/linux/swap.h           |    1 -
 net/ipv4/tcp.c                 |    1 -
 net/ipv4/udp.c                 |    1 -
 net/netfilter/ipvs/ip_vs_ctl.c |    1 -
 net/openvswitch/meter.c        |    1 -
 net/sctp/protocol.c            |    1 -
 8 files changed, 2 insertions(+), 7 deletions(-)

--- a/drivers/mmc/core/mmc_test.c~include-linux-mmh-move-nr_free_buffer_pages-from-swaph-to-mmh
+++ a/drivers/mmc/core/mmc_test.c
@@ -10,7 +10,6 @@
 #include <linux/slab.h>
 
 #include <linux/scatterlist.h>
-#include <linux/swap.h>		/* For nr_free_buffer_pages() */
 #include <linux/list.h>
 
 #include <linux/debugfs.h>
--- a/include/linux/mm.h~include-linux-mmh-move-nr_free_buffer_pages-from-swaph-to-mmh
+++ a/include/linux/mm.h
@@ -875,6 +875,8 @@ void put_pages_list(struct list_head *pa
 void split_page(struct page *page, unsigned int order);
 void copy_huge_page(struct page *dst, struct page *src);
 
+unsigned long nr_free_buffer_pages(void);
+
 /*
  * Compound pages have a destructor function.  Provide a
  * prototype for that function and accessor functions.
--- a/include/linux/swap.h~include-linux-mmh-move-nr_free_buffer_pages-from-swaph-to-mmh
+++ a/include/linux/swap.h
@@ -335,7 +335,6 @@ void workingset_update_node(struct xa_no
 
 /* linux/mm/page_alloc.c */
 extern unsigned long totalreserve_pages;
-extern unsigned long nr_free_buffer_pages(void);
 
 /* Definition of global_zone_page_state not available yet */
 #define nr_free_pages() global_zone_page_state(NR_FREE_PAGES)
--- a/net/ipv4/tcp.c~include-linux-mmh-move-nr_free_buffer_pages-from-swaph-to-mmh
+++ a/net/ipv4/tcp.c
@@ -260,7 +260,6 @@
 #include <linux/random.h>
 #include <linux/memblock.h>
 #include <linux/highmem.h>
-#include <linux/swap.h>
 #include <linux/cache.h>
 #include <linux/err.h>
 #include <linux/time.h>
--- a/net/ipv4/udp.c~include-linux-mmh-move-nr_free_buffer_pages-from-swaph-to-mmh
+++ a/net/ipv4/udp.c
@@ -78,7 +78,6 @@
 #include <asm/ioctls.h>
 #include <linux/memblock.h>
 #include <linux/highmem.h>
-#include <linux/swap.h>
 #include <linux/types.h>
 #include <linux/fcntl.h>
 #include <linux/module.h>
--- a/net/netfilter/ipvs/ip_vs_ctl.c~include-linux-mmh-move-nr_free_buffer_pages-from-swaph-to-mmh
+++ a/net/netfilter/ipvs/ip_vs_ctl.c
@@ -24,7 +24,6 @@
 #include <linux/sysctl.h>
 #include <linux/proc_fs.h>
 #include <linux/workqueue.h>
-#include <linux/swap.h>
 #include <linux/seq_file.h>
 #include <linux/slab.h>
 
--- a/net/openvswitch/meter.c~include-linux-mmh-move-nr_free_buffer_pages-from-swaph-to-mmh
+++ a/net/openvswitch/meter.c
@@ -12,7 +12,6 @@
 #include <linux/openvswitch.h>
 #include <linux/netlink.h>
 #include <linux/rculist.h>
-#include <linux/swap.h>
 
 #include <net/netlink.h>
 #include <net/genetlink.h>
--- a/net/sctp/protocol.c~include-linux-mmh-move-nr_free_buffer_pages-from-swaph-to-mmh
+++ a/net/sctp/protocol.c
@@ -33,7 +33,6 @@
 #include <linux/seq_file.h>
 #include <linux/memblock.h>
 #include <linux/highmem.h>
-#include <linux/swap.h>
 #include <linux/slab.h>
 #include <net/net_namespace.h>
 #include <net/protocol.h>
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 205/262] stacktrace: move filter_irq_stacks() to kernel/stacktrace.c
  2021-11-05 20:34 incoming Andrew Morton
                   ` (203 preceding siblings ...)
  2021-11-05 20:45 ` [patch 204/262] include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 206/262] kfence: count unexpectedly skipped allocations Andrew Morton
                   ` (56 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, dvyukov, elver, glider, jannh, linux-mm, mm-commits,
	nogikh, tarasmadan, torvalds

From: Marco Elver <elver@google.com>
Subject: stacktrace: move filter_irq_stacks() to kernel/stacktrace.c

filter_irq_stacks() has little to do with the stackdepot implementation,
except that it is usually used by users (such as KASAN) of stackdepot to
reduce the stack trace.

However, filter_irq_stacks() itself is not useful without a stack trace as
obtained by stack_trace_save() and friends.

Therefore, move filter_irq_stacks() to kernel/stacktrace.c, so that new
users of filter_irq_stacks() do not have to start depending on STACKDEPOT
only for filter_irq_stacks().

Link: https://lkml.kernel.org/r/20210923104803.2620285-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Aleksandr Nogikh <nogikh@google.com>
Cc: Taras Madan <tarasmadan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/stackdepot.h |    2 --
 include/linux/stacktrace.h |    1 +
 kernel/stacktrace.c        |   30 ++++++++++++++++++++++++++++++
 lib/stackdepot.c           |   24 ------------------------
 4 files changed, 31 insertions(+), 26 deletions(-)

--- a/include/linux/stackdepot.h~stacktrace-move-filter_irq_stacks-to-kernel-stacktracec
+++ a/include/linux/stackdepot.h
@@ -25,8 +25,6 @@ depot_stack_handle_t stack_depot_save(un
 unsigned int stack_depot_fetch(depot_stack_handle_t handle,
 			       unsigned long **entries);
 
-unsigned int filter_irq_stacks(unsigned long *entries, unsigned int nr_entries);
-
 #ifdef CONFIG_STACKDEPOT
 int stack_depot_init(void);
 #else
--- a/include/linux/stacktrace.h~stacktrace-move-filter_irq_stacks-to-kernel-stacktracec
+++ a/include/linux/stacktrace.h
@@ -21,6 +21,7 @@ unsigned int stack_trace_save_tsk(struct
 unsigned int stack_trace_save_regs(struct pt_regs *regs, unsigned long *store,
 				   unsigned int size, unsigned int skipnr);
 unsigned int stack_trace_save_user(unsigned long *store, unsigned int size);
+unsigned int filter_irq_stacks(unsigned long *entries, unsigned int nr_entries);
 
 /* Internal interfaces. Do not use in generic code */
 #ifdef CONFIG_ARCH_STACKWALK
--- a/kernel/stacktrace.c~stacktrace-move-filter_irq_stacks-to-kernel-stacktracec
+++ a/kernel/stacktrace.c
@@ -13,6 +13,7 @@
 #include <linux/export.h>
 #include <linux/kallsyms.h>
 #include <linux/stacktrace.h>
+#include <linux/interrupt.h>
 
 /**
  * stack_trace_print - Print the entries in the stack trace
@@ -373,3 +374,32 @@ unsigned int stack_trace_save_user(unsig
 #endif /* CONFIG_USER_STACKTRACE_SUPPORT */
 
 #endif /* !CONFIG_ARCH_STACKWALK */
+
+static inline bool in_irqentry_text(unsigned long ptr)
+{
+	return (ptr >= (unsigned long)&__irqentry_text_start &&
+		ptr < (unsigned long)&__irqentry_text_end) ||
+		(ptr >= (unsigned long)&__softirqentry_text_start &&
+		 ptr < (unsigned long)&__softirqentry_text_end);
+}
+
+/**
+ * filter_irq_stacks - Find first IRQ stack entry in trace
+ * @entries:	Pointer to stack trace array
+ * @nr_entries:	Number of entries in the storage array
+ *
+ * Return: Number of trace entries until IRQ stack starts.
+ */
+unsigned int filter_irq_stacks(unsigned long *entries, unsigned int nr_entries)
+{
+	unsigned int i;
+
+	for (i = 0; i < nr_entries; i++) {
+		if (in_irqentry_text(entries[i])) {
+			/* Include the irqentry function into the stack. */
+			return i + 1;
+		}
+	}
+	return nr_entries;
+}
+EXPORT_SYMBOL_GPL(filter_irq_stacks);
--- a/lib/stackdepot.c~stacktrace-move-filter_irq_stacks-to-kernel-stacktracec
+++ a/lib/stackdepot.c
@@ -20,7 +20,6 @@
  */
 
 #include <linux/gfp.h>
-#include <linux/interrupt.h>
 #include <linux/jhash.h>
 #include <linux/kernel.h>
 #include <linux/mm.h>
@@ -371,26 +370,3 @@ depot_stack_handle_t stack_depot_save(un
 	return __stack_depot_save(entries, nr_entries, alloc_flags, true);
 }
 EXPORT_SYMBOL_GPL(stack_depot_save);
-
-static inline int in_irqentry_text(unsigned long ptr)
-{
-	return (ptr >= (unsigned long)&__irqentry_text_start &&
-		ptr < (unsigned long)&__irqentry_text_end) ||
-		(ptr >= (unsigned long)&__softirqentry_text_start &&
-		 ptr < (unsigned long)&__softirqentry_text_end);
-}
-
-unsigned int filter_irq_stacks(unsigned long *entries,
-					     unsigned int nr_entries)
-{
-	unsigned int i;
-
-	for (i = 0; i < nr_entries; i++) {
-		if (in_irqentry_text(entries[i])) {
-			/* Include the irqentry function into the stack. */
-			return i + 1;
-		}
-	}
-	return nr_entries;
-}
-EXPORT_SYMBOL_GPL(filter_irq_stacks);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 206/262] kfence: count unexpectedly skipped allocations
  2021-11-05 20:34 incoming Andrew Morton
                   ` (204 preceding siblings ...)
  2021-11-05 20:45 ` [patch 205/262] stacktrace: move filter_irq_stacks() to kernel/stacktrace.c Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 207/262] kfence: move saving stack trace of allocations into __kfence_alloc() Andrew Morton
                   ` (55 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, dvyukov, elver, glider, jannh, linux-mm, mm-commits,
	nogikh, tarasmadan, torvalds

From: Marco Elver <elver@google.com>
Subject: kfence: count unexpectedly skipped allocations

Maintain a counter to count allocations that are skipped due to being
incompatible (oversized, incompatible gfp flags) or no capacity.

This is to compute the fraction of allocations that could not be serviced
by KFENCE, which we expect to be rare.

Link: https://lkml.kernel.org/r/20210923104803.2620285-2-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Aleksandr Nogikh <nogikh@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Taras Madan <tarasmadan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kfence/core.c |   16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

--- a/mm/kfence/core.c~kfence-count-unexpectedly-skipped-allocations
+++ a/mm/kfence/core.c
@@ -112,6 +112,8 @@ enum kfence_counter_id {
 	KFENCE_COUNTER_FREES,
 	KFENCE_COUNTER_ZOMBIES,
 	KFENCE_COUNTER_BUGS,
+	KFENCE_COUNTER_SKIP_INCOMPAT,
+	KFENCE_COUNTER_SKIP_CAPACITY,
 	KFENCE_COUNTER_COUNT,
 };
 static atomic_long_t counters[KFENCE_COUNTER_COUNT];
@@ -121,6 +123,8 @@ static const char *const counter_names[]
 	[KFENCE_COUNTER_FREES]		= "total frees",
 	[KFENCE_COUNTER_ZOMBIES]	= "zombie allocations",
 	[KFENCE_COUNTER_BUGS]		= "total bugs",
+	[KFENCE_COUNTER_SKIP_INCOMPAT]	= "skipped allocations (incompatible)",
+	[KFENCE_COUNTER_SKIP_CAPACITY]	= "skipped allocations (capacity)",
 };
 static_assert(ARRAY_SIZE(counter_names) == KFENCE_COUNTER_COUNT);
 
@@ -271,8 +275,10 @@ static void *kfence_guarded_alloc(struct
 		list_del_init(&meta->list);
 	}
 	raw_spin_unlock_irqrestore(&kfence_freelist_lock, flags);
-	if (!meta)
+	if (!meta) {
+		atomic_long_inc(&counters[KFENCE_COUNTER_SKIP_CAPACITY]);
 		return NULL;
+	}
 
 	if (unlikely(!raw_spin_trylock_irqsave(&meta->lock, flags))) {
 		/*
@@ -740,8 +746,10 @@ void *__kfence_alloc(struct kmem_cache *
 	 * Perform size check before switching kfence_allocation_gate, so that
 	 * we don't disable KFENCE without making an allocation.
 	 */
-	if (size > PAGE_SIZE)
+	if (size > PAGE_SIZE) {
+		atomic_long_inc(&counters[KFENCE_COUNTER_SKIP_INCOMPAT]);
 		return NULL;
+	}
 
 	/*
 	 * Skip allocations from non-default zones, including DMA. We cannot
@@ -749,8 +757,10 @@ void *__kfence_alloc(struct kmem_cache *
 	 * properties (e.g. reside in DMAable memory).
 	 */
 	if ((flags & GFP_ZONEMASK) ||
-	    (s->flags & (SLAB_CACHE_DMA | SLAB_CACHE_DMA32)))
+	    (s->flags & (SLAB_CACHE_DMA | SLAB_CACHE_DMA32))) {
+		atomic_long_inc(&counters[KFENCE_COUNTER_SKIP_INCOMPAT]);
 		return NULL;
+	}
 
 	/*
 	 * allocation_gate only needs to become non-zero, so it doesn't make
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 207/262] kfence: move saving stack trace of allocations into __kfence_alloc()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (205 preceding siblings ...)
  2021-11-05 20:45 ` [patch 206/262] kfence: count unexpectedly skipped allocations Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 208/262] kfence: limit currently covered allocations when pool nearly full Andrew Morton
                   ` (54 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, dvyukov, elver, glider, jannh, linux-mm, mm-commits,
	nogikh, tarasmadan, torvalds

From: Marco Elver <elver@google.com>
Subject: kfence: move saving stack trace of allocations into __kfence_alloc()

Move the saving of the stack trace of allocations into __kfence_alloc(),
so that the stack entries array can be used outside of
kfence_guarded_alloc() and we avoid potentially unwinding the stack
multiple times.

Link: https://lkml.kernel.org/r/20210923104803.2620285-3-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Aleksandr Nogikh <nogikh@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Taras Madan <tarasmadan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kfence/core.c |   35 ++++++++++++++++++++++++-----------
 1 file changed, 24 insertions(+), 11 deletions(-)

--- a/mm/kfence/core.c~kfence-move-saving-stack-trace-of-allocations-into-__kfence_alloc
+++ a/mm/kfence/core.c
@@ -187,19 +187,26 @@ static inline unsigned long metadata_to_
  * Update the object's metadata state, including updating the alloc/free stacks
  * depending on the state transition.
  */
-static noinline void metadata_update_state(struct kfence_metadata *meta,
-					   enum kfence_object_state next)
+static noinline void
+metadata_update_state(struct kfence_metadata *meta, enum kfence_object_state next,
+		      unsigned long *stack_entries, size_t num_stack_entries)
 {
 	struct kfence_track *track =
 		next == KFENCE_OBJECT_FREED ? &meta->free_track : &meta->alloc_track;
 
 	lockdep_assert_held(&meta->lock);
 
-	/*
-	 * Skip over 1 (this) functions; noinline ensures we do not accidentally
-	 * skip over the caller by never inlining.
-	 */
-	track->num_stack_entries = stack_trace_save(track->stack_entries, KFENCE_STACK_DEPTH, 1);
+	if (stack_entries) {
+		memcpy(track->stack_entries, stack_entries,
+		       num_stack_entries * sizeof(stack_entries[0]));
+	} else {
+		/*
+		 * Skip over 1 (this) functions; noinline ensures we do not
+		 * accidentally skip over the caller by never inlining.
+		 */
+		num_stack_entries = stack_trace_save(track->stack_entries, KFENCE_STACK_DEPTH, 1);
+	}
+	track->num_stack_entries = num_stack_entries;
 	track->pid = task_pid_nr(current);
 	track->cpu = raw_smp_processor_id();
 	track->ts_nsec = local_clock(); /* Same source as printk timestamps. */
@@ -261,7 +268,8 @@ static __always_inline void for_each_can
 	}
 }
 
-static void *kfence_guarded_alloc(struct kmem_cache *cache, size_t size, gfp_t gfp)
+static void *kfence_guarded_alloc(struct kmem_cache *cache, size_t size, gfp_t gfp,
+				  unsigned long *stack_entries, size_t num_stack_entries)
 {
 	struct kfence_metadata *meta = NULL;
 	unsigned long flags;
@@ -320,7 +328,7 @@ static void *kfence_guarded_alloc(struct
 	addr = (void *)meta->addr;
 
 	/* Update remaining metadata. */
-	metadata_update_state(meta, KFENCE_OBJECT_ALLOCATED);
+	metadata_update_state(meta, KFENCE_OBJECT_ALLOCATED, stack_entries, num_stack_entries);
 	/* Pairs with READ_ONCE() in kfence_shutdown_cache(). */
 	WRITE_ONCE(meta->cache, cache);
 	meta->size = size;
@@ -400,7 +408,7 @@ static void kfence_guarded_free(void *ad
 		memzero_explicit(addr, meta->size);
 
 	/* Mark the object as freed. */
-	metadata_update_state(meta, KFENCE_OBJECT_FREED);
+	metadata_update_state(meta, KFENCE_OBJECT_FREED, NULL, 0);
 
 	raw_spin_unlock_irqrestore(&meta->lock, flags);
 
@@ -742,6 +750,9 @@ void kfence_shutdown_cache(struct kmem_c
 
 void *__kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags)
 {
+	unsigned long stack_entries[KFENCE_STACK_DEPTH];
+	size_t num_stack_entries;
+
 	/*
 	 * Perform size check before switching kfence_allocation_gate, so that
 	 * we don't disable KFENCE without making an allocation.
@@ -786,7 +797,9 @@ void *__kfence_alloc(struct kmem_cache *
 	if (!READ_ONCE(kfence_enabled))
 		return NULL;
 
-	return kfence_guarded_alloc(s, size, flags);
+	num_stack_entries = stack_trace_save(stack_entries, KFENCE_STACK_DEPTH, 0);
+
+	return kfence_guarded_alloc(s, size, flags, stack_entries, num_stack_entries);
 }
 
 size_t kfence_ksize(const void *addr)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 208/262] kfence: limit currently covered allocations when pool nearly full
  2021-11-05 20:34 incoming Andrew Morton
                   ` (206 preceding siblings ...)
  2021-11-05 20:45 ` [patch 207/262] kfence: move saving stack trace of allocations into __kfence_alloc() Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 209/262] kfence: add note to documentation about skipping covered allocations Andrew Morton
                   ` (53 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, dvyukov, elver, glider, jannh, linux-mm, mm-commits,
	nogikh, tarasmadan, torvalds

From: Marco Elver <elver@google.com>
Subject: kfence: limit currently covered allocations when pool nearly full

One of KFENCE's main design principles is that with increasing uptime,
allocation coverage increases sufficiently to detect previously undetected
bugs.

We have observed that frequent long-lived allocations of the same source
(e.g.  pagecache) tend to permanently fill up the KFENCE pool with
increasing system uptime, thus breaking the above requirement.  The
workaround thus far had been increasing the sample interval and/or
increasing the KFENCE pool size, but is no reliable solution.

To ensure diverse coverage of allocations, limit currently covered
allocations of the same source once pool utilization reaches 75%
(configurable via `kfence.skip_covered_thresh`) or above.  The effect is
retaining reasonable allocation coverage when the pool is close to full.

A side-effect is that this also limits frequent long-lived allocations of
the same source filling up the pool permanently.

Uniqueness of an allocation for coverage purposes is based on its
(partial) allocation stack trace (the source).  A Counting Bloom filter is
used to check if an allocation is covered; if the allocation is currently
covered, the allocation is skipped by KFENCE.

Testing was done using:

	(a) a synthetic workload that performs frequent long-lived
	    allocations (default config values; sample_interval=1;
	    num_objects=63), and

	(b) normal desktop workloads on an otherwise idle machine where
	    the problem was first reported after a few days of uptime
	    (default config values).

In both test cases the sampled allocation rate no longer drops to zero at
any point.  In the case of (b) we observe (after 2 days uptime) 15% unique
allocations in the pool, 77% pool utilization, with 20% "skipped
allocations (covered)".

[elver@google.com: simplify and just use hash_32(), use more random stack_hash_seed]
  Link: https://lkml.kernel.org/r/YU3MRGaCaJiYht5g@elver.google.com
[elver@google.com: fix 32 bit]
Link: https://lkml.kernel.org/r/20210923104803.2620285-4-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Aleksandr Nogikh <nogikh@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Taras Madan <tarasmadan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kfence/core.c   |  109 ++++++++++++++++++++++++++++++++++++++++++-
 mm/kfence/kfence.h |    2 
 2 files changed, 109 insertions(+), 2 deletions(-)

--- a/mm/kfence/core.c~kfence-limit-currently-covered-allocations-when-pool-nearly-full
+++ a/mm/kfence/core.c
@@ -10,12 +10,15 @@
 #include <linux/atomic.h>
 #include <linux/bug.h>
 #include <linux/debugfs.h>
+#include <linux/hash.h>
 #include <linux/irq_work.h>
+#include <linux/jhash.h>
 #include <linux/kcsan-checks.h>
 #include <linux/kfence.h>
 #include <linux/kmemleak.h>
 #include <linux/list.h>
 #include <linux/lockdep.h>
+#include <linux/log2.h>
 #include <linux/memblock.h>
 #include <linux/moduleparam.h>
 #include <linux/random.h>
@@ -82,6 +85,10 @@ static const struct kernel_param_ops sam
 };
 module_param_cb(sample_interval, &sample_interval_param_ops, &kfence_sample_interval, 0600);
 
+/* Pool usage% threshold when currently covered allocations are skipped. */
+static unsigned long kfence_skip_covered_thresh __read_mostly = 75;
+module_param_named(skip_covered_thresh, kfence_skip_covered_thresh, ulong, 0644);
+
 /* The pool of pages used for guard pages and objects. */
 char *__kfence_pool __ro_after_init;
 EXPORT_SYMBOL(__kfence_pool); /* Export for test modules. */
@@ -105,6 +112,32 @@ DEFINE_STATIC_KEY_FALSE(kfence_allocatio
 /* Gates the allocation, ensuring only one succeeds in a given period. */
 atomic_t kfence_allocation_gate = ATOMIC_INIT(1);
 
+/*
+ * A Counting Bloom filter of allocation coverage: limits currently covered
+ * allocations of the same source filling up the pool.
+ *
+ * Assuming a range of 15%-85% unique allocations in the pool at any point in
+ * time, the below parameters provide a probablity of 0.02-0.33 for false
+ * positive hits respectively:
+ *
+ *	P(alloc_traces) = (1 - e^(-HNUM * (alloc_traces / SIZE)) ^ HNUM
+ */
+#define ALLOC_COVERED_HNUM	2
+#define ALLOC_COVERED_ORDER	(const_ilog2(CONFIG_KFENCE_NUM_OBJECTS) + 2)
+#define ALLOC_COVERED_SIZE	(1 << ALLOC_COVERED_ORDER)
+#define ALLOC_COVERED_HNEXT(h)	hash_32(h, ALLOC_COVERED_ORDER)
+#define ALLOC_COVERED_MASK	(ALLOC_COVERED_SIZE - 1)
+static atomic_t alloc_covered[ALLOC_COVERED_SIZE];
+
+/* Stack depth used to determine uniqueness of an allocation. */
+#define UNIQUE_ALLOC_STACK_DEPTH ((size_t)8)
+
+/*
+ * Randomness for stack hashes, making the same collisions across reboots and
+ * different machines less likely.
+ */
+static u32 stack_hash_seed __ro_after_init;
+
 /* Statistics counters for debugfs. */
 enum kfence_counter_id {
 	KFENCE_COUNTER_ALLOCATED,
@@ -114,6 +147,7 @@ enum kfence_counter_id {
 	KFENCE_COUNTER_BUGS,
 	KFENCE_COUNTER_SKIP_INCOMPAT,
 	KFENCE_COUNTER_SKIP_CAPACITY,
+	KFENCE_COUNTER_SKIP_COVERED,
 	KFENCE_COUNTER_COUNT,
 };
 static atomic_long_t counters[KFENCE_COUNTER_COUNT];
@@ -125,11 +159,57 @@ static const char *const counter_names[]
 	[KFENCE_COUNTER_BUGS]		= "total bugs",
 	[KFENCE_COUNTER_SKIP_INCOMPAT]	= "skipped allocations (incompatible)",
 	[KFENCE_COUNTER_SKIP_CAPACITY]	= "skipped allocations (capacity)",
+	[KFENCE_COUNTER_SKIP_COVERED]	= "skipped allocations (covered)",
 };
 static_assert(ARRAY_SIZE(counter_names) == KFENCE_COUNTER_COUNT);
 
 /* === Internals ============================================================ */
 
+static inline bool should_skip_covered(void)
+{
+	unsigned long thresh = (CONFIG_KFENCE_NUM_OBJECTS * kfence_skip_covered_thresh) / 100;
+
+	return atomic_long_read(&counters[KFENCE_COUNTER_ALLOCATED]) > thresh;
+}
+
+static u32 get_alloc_stack_hash(unsigned long *stack_entries, size_t num_entries)
+{
+	num_entries = min(num_entries, UNIQUE_ALLOC_STACK_DEPTH);
+	num_entries = filter_irq_stacks(stack_entries, num_entries);
+	return jhash(stack_entries, num_entries * sizeof(stack_entries[0]), stack_hash_seed);
+}
+
+/*
+ * Adds (or subtracts) count @val for allocation stack trace hash
+ * @alloc_stack_hash from Counting Bloom filter.
+ */
+static void alloc_covered_add(u32 alloc_stack_hash, int val)
+{
+	int i;
+
+	for (i = 0; i < ALLOC_COVERED_HNUM; i++) {
+		atomic_add(val, &alloc_covered[alloc_stack_hash & ALLOC_COVERED_MASK]);
+		alloc_stack_hash = ALLOC_COVERED_HNEXT(alloc_stack_hash);
+	}
+}
+
+/*
+ * Returns true if the allocation stack trace hash @alloc_stack_hash is
+ * currently contained (non-zero count) in Counting Bloom filter.
+ */
+static bool alloc_covered_contains(u32 alloc_stack_hash)
+{
+	int i;
+
+	for (i = 0; i < ALLOC_COVERED_HNUM; i++) {
+		if (!atomic_read(&alloc_covered[alloc_stack_hash & ALLOC_COVERED_MASK]))
+			return false;
+		alloc_stack_hash = ALLOC_COVERED_HNEXT(alloc_stack_hash);
+	}
+
+	return true;
+}
+
 static bool kfence_protect(unsigned long addr)
 {
 	return !KFENCE_WARN_ON(!kfence_protect_page(ALIGN_DOWN(addr, PAGE_SIZE), true));
@@ -269,7 +349,8 @@ static __always_inline void for_each_can
 }
 
 static void *kfence_guarded_alloc(struct kmem_cache *cache, size_t size, gfp_t gfp,
-				  unsigned long *stack_entries, size_t num_stack_entries)
+				  unsigned long *stack_entries, size_t num_stack_entries,
+				  u32 alloc_stack_hash)
 {
 	struct kfence_metadata *meta = NULL;
 	unsigned long flags;
@@ -332,6 +413,8 @@ static void *kfence_guarded_alloc(struct
 	/* Pairs with READ_ONCE() in kfence_shutdown_cache(). */
 	WRITE_ONCE(meta->cache, cache);
 	meta->size = size;
+	meta->alloc_stack_hash = alloc_stack_hash;
+
 	for_each_canary(meta, set_canary_byte);
 
 	/* Set required struct page fields. */
@@ -344,6 +427,8 @@ static void *kfence_guarded_alloc(struct
 
 	raw_spin_unlock_irqrestore(&meta->lock, flags);
 
+	alloc_covered_add(alloc_stack_hash, 1);
+
 	/* Memory initialization. */
 
 	/*
@@ -412,6 +497,8 @@ static void kfence_guarded_free(void *ad
 
 	raw_spin_unlock_irqrestore(&meta->lock, flags);
 
+	alloc_covered_add(meta->alloc_stack_hash, -1);
+
 	/* Protect to detect use-after-frees. */
 	kfence_protect((unsigned long)addr);
 
@@ -677,6 +764,7 @@ void __init kfence_init(void)
 	if (!kfence_sample_interval)
 		return;
 
+	stack_hash_seed = (u32)random_get_entropy();
 	if (!kfence_init_pool()) {
 		pr_err("%s failed\n", __func__);
 		return;
@@ -752,6 +840,7 @@ void *__kfence_alloc(struct kmem_cache *
 {
 	unsigned long stack_entries[KFENCE_STACK_DEPTH];
 	size_t num_stack_entries;
+	u32 alloc_stack_hash;
 
 	/*
 	 * Perform size check before switching kfence_allocation_gate, so that
@@ -799,7 +888,23 @@ void *__kfence_alloc(struct kmem_cache *
 
 	num_stack_entries = stack_trace_save(stack_entries, KFENCE_STACK_DEPTH, 0);
 
-	return kfence_guarded_alloc(s, size, flags, stack_entries, num_stack_entries);
+	/*
+	 * Do expensive check for coverage of allocation in slow-path after
+	 * allocation_gate has already become non-zero, even though it might
+	 * mean not making any allocation within a given sample interval.
+	 *
+	 * This ensures reasonable allocation coverage when the pool is almost
+	 * full, including avoiding long-lived allocations of the same source
+	 * filling up the pool (e.g. pagecache allocations).
+	 */
+	alloc_stack_hash = get_alloc_stack_hash(stack_entries, num_stack_entries);
+	if (should_skip_covered() && alloc_covered_contains(alloc_stack_hash)) {
+		atomic_long_inc(&counters[KFENCE_COUNTER_SKIP_COVERED]);
+		return NULL;
+	}
+
+	return kfence_guarded_alloc(s, size, flags, stack_entries, num_stack_entries,
+				    alloc_stack_hash);
 }
 
 size_t kfence_ksize(const void *addr)
--- a/mm/kfence/kfence.h~kfence-limit-currently-covered-allocations-when-pool-nearly-full
+++ a/mm/kfence/kfence.h
@@ -87,6 +87,8 @@ struct kfence_metadata {
 	/* Allocation and free stack information. */
 	struct kfence_track alloc_track;
 	struct kfence_track free_track;
+	/* For updating alloc_covered on frees. */
+	u32 alloc_stack_hash;
 };
 
 extern struct kfence_metadata kfence_metadata[CONFIG_KFENCE_NUM_OBJECTS];
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 209/262] kfence: add note to documentation about skipping covered allocations
  2021-11-05 20:34 incoming Andrew Morton
                   ` (207 preceding siblings ...)
  2021-11-05 20:45 ` [patch 208/262] kfence: limit currently covered allocations when pool nearly full Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 210/262] kfence: test: use kunit_skip() to skip tests Andrew Morton
                   ` (52 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, dvyukov, elver, glider, jannh, linux-mm, mm-commits,
	nogikh, tarasmadan, torvalds

From: Marco Elver <elver@google.com>
Subject: kfence: add note to documentation about skipping covered allocations

Add a note briefly mentioning the new policy about "skipping currently
covered allocations if pool close to full." Since this has a notable
impact on KFENCE's bug-detection ability on systems with large uptimes, it
is worth pointing out the feature.

Link: https://lkml.kernel.org/r/20210923104803.2620285-5-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Aleksandr Nogikh <nogikh@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Taras Madan <tarasmadan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/dev-tools/kfence.rst |   11 +++++++++++
 1 file changed, 11 insertions(+)

--- a/Documentation/dev-tools/kfence.rst~kfence-add-note-to-documentation-about-skipping-covered-allocations
+++ a/Documentation/dev-tools/kfence.rst
@@ -269,6 +269,17 @@ tail of KFENCE's freelist, so that the l
 first, and the chances of detecting use-after-frees of recently freed objects
 is increased.
 
+If pool utilization reaches 75% (default) or above, to reduce the risk of the
+pool eventually being fully occupied by allocated objects yet ensure diverse
+coverage of allocations, KFENCE limits currently covered allocations of the
+same source from further filling up the pool. The "source" of an allocation is
+based on its partial allocation stack trace. A side-effect is that this also
+limits frequent long-lived allocations (e.g. pagecache) of the same source
+filling up the pool permanently, which is the most common risk for the pool
+becoming full and the sampled allocation rate dropping to zero. The threshold
+at which to start limiting currently covered allocations can be configured via
+the boot parameter ``kfence.skip_covered_thresh`` (pool usage%).
+
 Interface
 ---------
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 210/262] kfence: test: use kunit_skip() to skip tests
  2021-11-05 20:34 incoming Andrew Morton
                   ` (208 preceding siblings ...)
  2021-11-05 20:45 ` [patch 209/262] kfence: add note to documentation about skipping covered allocations Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 211/262] kfence: shorten critical sections of alloc/free Andrew Morton
                   ` (51 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, davidgow, dvyukov, elver, glider, linux-mm, mm-commits,
	nogikh, tarasmadan, torvalds

From: Marco Elver <elver@google.com>
Subject: kfence: test: use kunit_skip() to skip tests

Use the new kunit_skip() to skip tests if requirements were not met.  It
makes it easier to see in KUnit's summary if there were skipped tests.

Link: https://lkml.kernel.org/r/20210922182541.1372400-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: David Gow <davidgow@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Aleksandr Nogikh <nogikh@google.com>
Cc: Taras Madan <tarasmadan@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kfence/kfence_test.c |   14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

--- a/mm/kfence/kfence_test.c~kfence-test-use-kunit_skip-to-skip-tests
+++ a/mm/kfence/kfence_test.c
@@ -32,6 +32,11 @@
 #define arch_kfence_test_address(addr) (addr)
 #endif
 
+#define KFENCE_TEST_REQUIRES(test, cond) do {			\
+	if (!(cond))						\
+		kunit_skip((test), "Test requires: " #cond);	\
+} while (0)
+
 /* Report as observed from console. */
 static struct {
 	spinlock_t lock;
@@ -555,8 +560,7 @@ static void test_init_on_free(struct kun
 	};
 	int i;
 
-	if (!IS_ENABLED(CONFIG_INIT_ON_FREE_DEFAULT_ON))
-		return;
+	KFENCE_TEST_REQUIRES(test, IS_ENABLED(CONFIG_INIT_ON_FREE_DEFAULT_ON));
 	/* Assume it hasn't been disabled on command line. */
 
 	setup_test_cache(test, size, 0, NULL);
@@ -603,10 +607,8 @@ static void test_gfpzero(struct kunit *t
 	char *buf1, *buf2;
 	int i;
 
-	if (CONFIG_KFENCE_SAMPLE_INTERVAL > 100) {
-		kunit_warn(test, "skipping ... would take too long\n");
-		return;
-	}
+	/* Skip if we think it'd take too long. */
+	KFENCE_TEST_REQUIRES(test, CONFIG_KFENCE_SAMPLE_INTERVAL <= 100);
 
 	setup_test_cache(test, size, 0, NULL);
 	buf1 = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 211/262] kfence: shorten critical sections of alloc/free
  2021-11-05 20:34 incoming Andrew Morton
                   ` (209 preceding siblings ...)
  2021-11-05 20:45 ` [patch 210/262] kfence: test: use kunit_skip() to skip tests Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 212/262] kfence: always use static branches to guard kfence_alloc() Andrew Morton
                   ` (50 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, dvyukov, elver, glider, jannh, linux-mm, mm-commits, torvalds

From: Marco Elver <elver@google.com>
Subject: kfence: shorten critical sections of alloc/free

Initializing memory and setting/checking the canary bytes is relatively
expensive, and doing so in the meta->lock critical sections extends the
duration with preemption and interrupts disabled unnecessarily.

Any reads to meta->addr and meta->size in kfence_guarded_alloc() and
kfence_guarded_free() don't require locking meta->lock as long as the
object is removed from the freelist: only kfence_guarded_alloc() sets
meta->addr and meta->size after removing it from the freelist, which
requires a preceding kfence_guarded_free() returning it to the list or the
initial state.

Therefore move reads to meta->addr and meta->size, including expensive
memory initialization using them, out of meta->lock critical sections.

Link: https://lkml.kernel.org/r/20210930153706.2105471-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kfence/core.c |   38 +++++++++++++++++++++-----------------
 1 file changed, 21 insertions(+), 17 deletions(-)

--- a/mm/kfence/core.c~kfence-shorten-critical-sections-of-alloc-free
+++ a/mm/kfence/core.c
@@ -309,12 +309,19 @@ static inline bool set_canary_byte(u8 *a
 /* Check canary byte at @addr. */
 static inline bool check_canary_byte(u8 *addr)
 {
+	struct kfence_metadata *meta;
+	unsigned long flags;
+
 	if (likely(*addr == KFENCE_CANARY_PATTERN(addr)))
 		return true;
 
 	atomic_long_inc(&counters[KFENCE_COUNTER_BUGS]);
-	kfence_report_error((unsigned long)addr, false, NULL, addr_to_metadata((unsigned long)addr),
-			    KFENCE_ERROR_CORRUPTION);
+
+	meta = addr_to_metadata((unsigned long)addr);
+	raw_spin_lock_irqsave(&meta->lock, flags);
+	kfence_report_error((unsigned long)addr, false, NULL, meta, KFENCE_ERROR_CORRUPTION);
+	raw_spin_unlock_irqrestore(&meta->lock, flags);
+
 	return false;
 }
 
@@ -324,8 +331,6 @@ static __always_inline void for_each_can
 	const unsigned long pageaddr = ALIGN_DOWN(meta->addr, PAGE_SIZE);
 	unsigned long addr;
 
-	lockdep_assert_held(&meta->lock);
-
 	/*
 	 * We'll iterate over each canary byte per-side until fn() returns
 	 * false. However, we'll still iterate over the canary bytes to the
@@ -414,8 +419,9 @@ static void *kfence_guarded_alloc(struct
 	WRITE_ONCE(meta->cache, cache);
 	meta->size = size;
 	meta->alloc_stack_hash = alloc_stack_hash;
+	raw_spin_unlock_irqrestore(&meta->lock, flags);
 
-	for_each_canary(meta, set_canary_byte);
+	alloc_covered_add(alloc_stack_hash, 1);
 
 	/* Set required struct page fields. */
 	page = virt_to_page(meta->addr);
@@ -425,11 +431,8 @@ static void *kfence_guarded_alloc(struct
 	if (IS_ENABLED(CONFIG_SLAB))
 		page->s_mem = addr;
 
-	raw_spin_unlock_irqrestore(&meta->lock, flags);
-
-	alloc_covered_add(alloc_stack_hash, 1);
-
 	/* Memory initialization. */
+	for_each_canary(meta, set_canary_byte);
 
 	/*
 	 * We check slab_want_init_on_alloc() ourselves, rather than letting
@@ -454,6 +457,7 @@ static void kfence_guarded_free(void *ad
 {
 	struct kcsan_scoped_access assert_page_exclusive;
 	unsigned long flags;
+	bool init;
 
 	raw_spin_lock_irqsave(&meta->lock, flags);
 
@@ -481,6 +485,13 @@ static void kfence_guarded_free(void *ad
 		meta->unprotected_page = 0;
 	}
 
+	/* Mark the object as freed. */
+	metadata_update_state(meta, KFENCE_OBJECT_FREED, NULL, 0);
+	init = slab_want_init_on_free(meta->cache);
+	raw_spin_unlock_irqrestore(&meta->lock, flags);
+
+	alloc_covered_add(meta->alloc_stack_hash, -1);
+
 	/* Check canary bytes for memory corruption. */
 	for_each_canary(meta, check_canary_byte);
 
@@ -489,16 +500,9 @@ static void kfence_guarded_free(void *ad
 	 * data is still there, and after a use-after-free is detected, we
 	 * unprotect the page, so the data is still accessible.
 	 */
-	if (!zombie && unlikely(slab_want_init_on_free(meta->cache)))
+	if (!zombie && unlikely(init))
 		memzero_explicit(addr, meta->size);
 
-	/* Mark the object as freed. */
-	metadata_update_state(meta, KFENCE_OBJECT_FREED, NULL, 0);
-
-	raw_spin_unlock_irqrestore(&meta->lock, flags);
-
-	alloc_covered_add(meta->alloc_stack_hash, -1);
-
 	/* Protect to detect use-after-frees. */
 	kfence_protect((unsigned long)addr);
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 212/262] kfence: always use static branches to guard kfence_alloc()
  2021-11-05 20:34 incoming Andrew Morton
                   ` (210 preceding siblings ...)
  2021-11-05 20:45 ` [patch 211/262] kfence: shorten critical sections of alloc/free Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 213/262] kfence: default to dynamic branch instead of static keys mode Andrew Morton
                   ` (49 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, dvyukov, elver, glider, jannh, linux-mm, mm-commits, torvalds

From: Marco Elver <elver@google.com>
Subject: kfence: always use static branches to guard kfence_alloc()

Regardless of KFENCE mode (CONFIG_KFENCE_STATIC_KEYS: either using static
keys to gate allocations, or using a simple dynamic branch), always use a
static branch to avoid the dynamic branch in kfence_alloc() if KFENCE was
disabled at boot.

For CONFIG_KFENCE_STATIC_KEYS=n, this now avoids the dynamic branch if
KFENCE was disabled at boot.

To simplify, also unifies the location where kfence_allocation_gate is
read-checked to just be inline in kfence_alloc().

Link: https://lkml.kernel.org/r/20211019102524.2807208-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/kfence.h |   21 +++++++++++----------
 mm/kfence/core.c       |   16 +++++++---------
 2 files changed, 18 insertions(+), 19 deletions(-)

--- a/include/linux/kfence.h~kfence-always-use-static-branches-to-guard-kfence_alloc
+++ a/include/linux/kfence.h
@@ -14,6 +14,9 @@
 
 #ifdef CONFIG_KFENCE
 
+#include <linux/atomic.h>
+#include <linux/static_key.h>
+
 /*
  * We allocate an even number of pages, as it simplifies calculations to map
  * address to metadata indices; effectively, the very first page serves as an
@@ -22,13 +25,8 @@
 #define KFENCE_POOL_SIZE ((CONFIG_KFENCE_NUM_OBJECTS + 1) * 2 * PAGE_SIZE)
 extern char *__kfence_pool;
 
-#ifdef CONFIG_KFENCE_STATIC_KEYS
-#include <linux/static_key.h>
 DECLARE_STATIC_KEY_FALSE(kfence_allocation_key);
-#else
-#include <linux/atomic.h>
 extern atomic_t kfence_allocation_gate;
-#endif
 
 /**
  * is_kfence_address() - check if an address belongs to KFENCE pool
@@ -116,13 +114,16 @@ void *__kfence_alloc(struct kmem_cache *
  */
 static __always_inline void *kfence_alloc(struct kmem_cache *s, size_t size, gfp_t flags)
 {
-#ifdef CONFIG_KFENCE_STATIC_KEYS
-	if (static_branch_unlikely(&kfence_allocation_key))
+#if defined(CONFIG_KFENCE_STATIC_KEYS) || CONFIG_KFENCE_SAMPLE_INTERVAL == 0
+	if (!static_branch_unlikely(&kfence_allocation_key))
+		return NULL;
 #else
-	if (unlikely(!atomic_read(&kfence_allocation_gate)))
+	if (!static_branch_likely(&kfence_allocation_key))
+		return NULL;
 #endif
-		return __kfence_alloc(s, size, flags);
-	return NULL;
+	if (likely(atomic_read(&kfence_allocation_gate)))
+		return NULL;
+	return __kfence_alloc(s, size, flags);
 }
 
 /**
--- a/mm/kfence/core.c~kfence-always-use-static-branches-to-guard-kfence_alloc
+++ a/mm/kfence/core.c
@@ -104,10 +104,11 @@ struct kfence_metadata kfence_metadata[C
 static struct list_head kfence_freelist = LIST_HEAD_INIT(kfence_freelist);
 static DEFINE_RAW_SPINLOCK(kfence_freelist_lock); /* Lock protecting freelist. */
 
-#ifdef CONFIG_KFENCE_STATIC_KEYS
-/* The static key to set up a KFENCE allocation. */
+/*
+ * The static key to set up a KFENCE allocation; or if static keys are not used
+ * to gate allocations, to avoid a load and compare if KFENCE is disabled.
+ */
 DEFINE_STATIC_KEY_FALSE(kfence_allocation_key);
-#endif
 
 /* Gates the allocation, ensuring only one succeeds in a given period. */
 atomic_t kfence_allocation_gate = ATOMIC_INIT(1);
@@ -774,6 +775,8 @@ void __init kfence_init(void)
 		return;
 	}
 
+	if (!IS_ENABLED(CONFIG_KFENCE_STATIC_KEYS))
+		static_branch_enable(&kfence_allocation_key);
 	WRITE_ONCE(kfence_enabled, true);
 	queue_delayed_work(system_unbound_wq, &kfence_timer, 0);
 	pr_info("initialized - using %lu bytes for %d objects at 0x%p-0x%p\n", KFENCE_POOL_SIZE,
@@ -866,12 +869,7 @@ void *__kfence_alloc(struct kmem_cache *
 		return NULL;
 	}
 
-	/*
-	 * allocation_gate only needs to become non-zero, so it doesn't make
-	 * sense to continue writing to it and pay the associated contention
-	 * cost, in case we have a large number of concurrent allocations.
-	 */
-	if (atomic_read(&kfence_allocation_gate) || atomic_inc_return(&kfence_allocation_gate) > 1)
+	if (atomic_inc_return(&kfence_allocation_gate) > 1)
 		return NULL;
 #ifdef CONFIG_KFENCE_STATIC_KEYS
 	/*
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 213/262] kfence: default to dynamic branch instead of static keys mode
  2021-11-05 20:34 incoming Andrew Morton
                   ` (211 preceding siblings ...)
  2021-11-05 20:45 ` [patch 212/262] kfence: always use static branches to guard kfence_alloc() Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 214/262] mm/damon: grammar s/works/work/ Andrew Morton
                   ` (48 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, dvyukov, elver, glider, jannh, linux-mm, mm-commits, torvalds

From: Marco Elver <elver@google.com>
Subject: kfence: default to dynamic branch instead of static keys mode

We have observed that on very large machines with newer CPUs, the static
key/branch switching delay is on the order of milliseconds.  This is due
to the required broadcast IPIs, which simply does not scale well to
hundreds of CPUs (cores).  If done too frequently, this can adversely
affect tail latencies of various workloads.

One workaround is to increase the sample interval to several seconds,
while decreasing sampled allocation coverage, but the problem still exists
and could still increase tail latencies.

As already noted in the Kconfig help text, there are trade-offs: at lower
sample intervals the dynamic branch results in better performance;
however, at very large sample intervals, the static keys mode can result
in better performance -- careful benchmarking is recommended.

Our initial benchmarking showed that with large enough sample intervals
and workloads stressing the allocator, the static keys mode was slightly
better.  Evaluating and observing the possible system-wide side-effects of
the static-key-switching induced broadcast IPIs, however, was a blind spot
(in particular on large machines with 100s of cores).

Therefore, a major downside of the static keys mode is, unfortunately,
that it is hard to predict performance on new system architectures and
topologies, but also making conclusions about performance of new workloads
based on a limited set of benchmarks.

Most distributions will simply select the defaults, while targeting a
large variety of different workloads and system architectures.  As such,
the better default is CONFIG_KFENCE_STATIC_KEYS=n, and re-enabling it is
only recommended after careful evaluation.

For reference, on x86-64 the condition in kfence_alloc() generates exactly
2 instructions in the kmem_cache_alloc() fast-path:

 | ...
 | cmpl   $0x0,0x1a8021c(%rip)  # ffffffff82d560d0 <kfence_allocation_gate>
 | je     ffffffff812d6003      <kmem_cache_alloc+0x243>
 | ...

which, given kfence_allocation_gate is infrequently modified, should be
well predicted by most CPUs.

Link: https://lkml.kernel.org/r/20211019102524.2807208-2-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Jann Horn <jannh@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/dev-tools/kfence.rst |   12 ++++++++----
 lib/Kconfig.kfence                 |   26 +++++++++++++++-----------
 2 files changed, 23 insertions(+), 15 deletions(-)

--- a/Documentation/dev-tools/kfence.rst~kfence-default-to-dynamic-branch-instead-of-static-keys-mode
+++ a/Documentation/dev-tools/kfence.rst
@@ -231,10 +231,14 @@ Guarded allocations are set up based on
 of the sample interval, the next allocation through the main allocator (SLAB or
 SLUB) returns a guarded allocation from the KFENCE object pool (allocation
 sizes up to PAGE_SIZE are supported). At this point, the timer is reset, and
-the next allocation is set up after the expiration of the interval. To "gate" a
-KFENCE allocation through the main allocator's fast-path without overhead,
-KFENCE relies on static branches via the static keys infrastructure. The static
-branch is toggled to redirect the allocation to KFENCE.
+the next allocation is set up after the expiration of the interval.
+
+When using ``CONFIG_KFENCE_STATIC_KEYS=y``, KFENCE allocations are "gated"
+through the main allocator's fast-path by relying on static branches via the
+static keys infrastructure. The static branch is toggled to redirect the
+allocation to KFENCE. Depending on sample interval, target workloads, and
+system architecture, this may perform better than the simple dynamic branch.
+Careful benchmarking is recommended.
 
 KFENCE objects each reside on a dedicated page, at either the left or right
 page boundaries selected at random. The pages to the left and right of the
--- a/lib/Kconfig.kfence~kfence-default-to-dynamic-branch-instead-of-static-keys-mode
+++ a/lib/Kconfig.kfence
@@ -25,17 +25,6 @@ menuconfig KFENCE
 
 if KFENCE
 
-config KFENCE_STATIC_KEYS
-	bool "Use static keys to set up allocations"
-	default y
-	depends on JUMP_LABEL # To ensure performance, require jump labels
-	help
-	  Use static keys (static branches) to set up KFENCE allocations. Using
-	  static keys is normally recommended, because it avoids a dynamic
-	  branch in the allocator's fast path. However, with very low sample
-	  intervals, or on systems that do not support jump labels, a dynamic
-	  branch may still be an acceptable performance trade-off.
-
 config KFENCE_SAMPLE_INTERVAL
 	int "Default sample interval in milliseconds"
 	default 100
@@ -56,6 +45,21 @@ config KFENCE_NUM_OBJECTS
 	  pages are required; with one containing the object and two adjacent
 	  ones used as guard pages.
 
+config KFENCE_STATIC_KEYS
+	bool "Use static keys to set up allocations" if EXPERT
+	depends on JUMP_LABEL
+	help
+	  Use static keys (static branches) to set up KFENCE allocations. This
+	  option is only recommended when using very large sample intervals, or
+	  performance has carefully been evaluated with this option.
+
+	  Using static keys comes with trade-offs that need to be carefully
+	  evaluated given target workloads and system architectures. Notably,
+	  enabling and disabling static keys invoke IPI broadcasts, the latency
+	  and impact of which is much harder to predict than a dynamic branch.
+
+	  Say N if you are unsure.
+
 config KFENCE_STRESS_TEST_FAULTS
 	int "Stress testing of fault handling and error reporting" if EXPERT
 	default 0
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 214/262] mm/damon: grammar s/works/work/
  2021-11-05 20:34 incoming Andrew Morton
                   ` (212 preceding siblings ...)
  2021-11-05 20:45 ` [patch 213/262] kfence: default to dynamic branch instead of static keys mode Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 215/262] Documentation/vm: move user guides to admin-guide/mm/ Andrew Morton
                   ` (47 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, geert, linux-mm, mm-commits, sjpark, torvalds

From: Geert Uytterhoeven <geert@linux-m68k.org>
Subject: mm/damon: grammar s/works/work/

Correct a singular versus plural grammar mistake in the help text for the
DAMON_VADDR config symbol.

Link: https://lkml.kernel.org/r/20210914073451.3883834-1-geert@linux-m68k.org
Fixes: 3f49584b262cf8f4 ("mm/damon: implement primitives for the virtual memory address spaces")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: SeongJae Park <sjpark@amazon.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/damon/Kconfig~mm-damon-grammar-s-works-work
+++ a/mm/damon/Kconfig
@@ -30,7 +30,7 @@ config DAMON_VADDR
 	select PAGE_IDLE_FLAG
 	help
 	  This builds the default data access monitoring primitives for DAMON
-	  that works for virtual address spaces.
+	  that work for virtual address spaces.
 
 config DAMON_VADDR_KUNIT_TEST
 	bool "Test for DAMON primitives" if !KUNIT_ALL_TESTS
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 215/262] Documentation/vm: move user guides to admin-guide/mm/
  2021-11-05 20:34 incoming Andrew Morton
                   ` (213 preceding siblings ...)
  2021-11-05 20:45 ` [patch 214/262] mm/damon: grammar s/works/work/ Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:45 ` [patch 216/262] MAINTAINERS: update SeongJae's email address Andrew Morton
                   ` (46 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, sjpark, torvalds

From: SeongJae Park <sjpark@amazon.de>
Subject: Documentation/vm: move user guides to admin-guide/mm/

Most memory management user guide documents are in 'admin-guide/mm/', but
two of those are in 'vm/'.  This commit moves the two docs into
'admin-guide/mm' for easier documents finding.

Link: https://lkml.kernel.org/r/20210917123958.3819-2-sj@kernel.org
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/index.rst        |  2 ++
 .../{vm => admin-guide/mm}/swap_numa.rst      |  0
 .../{vm => admin-guide/mm}/zswap.rst          |  0
 Documentation/vm/index.rst                    | 26 ++++---------------
 4 files changed, 7 insertions(+), 21 deletions(-)
 rename Documentation/{vm => admin-guide/mm}/swap_numa.rst (100%)
 rename Documentation/{vm => admin-guide/mm}/zswap.rst (100%)

diff --git a/Documentation/admin-guide/mm/index.rst b/Documentation/admin-guide/mm/index.rst
index cbd19d5e625f..c21b5823f126 100644
--- a/Documentation/admin-guide/mm/index.rst
+++ b/Documentation/admin-guide/mm/index.rst
@@ -37,5 +37,7 @@ the Linux memory management.
    numaperf
    pagemap
    soft-dirty
+   swap_numa
    transhuge
    userfaultfd
+   zswap
diff --git a/Documentation/vm/swap_numa.rst b/Documentation/admin-guide/mm/swap_numa.rst
similarity index 100%
rename from Documentation/vm/swap_numa.rst
rename to Documentation/admin-guide/mm/swap_numa.rst
diff --git a/Documentation/vm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
similarity index 100%
rename from Documentation/vm/zswap.rst
rename to Documentation/admin-guide/mm/zswap.rst
diff --git a/Documentation/vm/index.rst b/Documentation/vm/index.rst
index b51f0d8992f8..6f5ffef4b716 100644
--- a/Documentation/vm/index.rst
+++ b/Documentation/vm/index.rst
@@ -3,27 +3,11 @@ Linux Memory Management Documentation
 =====================================
 
 This is a collection of documents about the Linux memory management (mm)
-subsystem.  If you are looking for advice on simply allocating memory,
-see the :ref:`memory_allocation`.
-
-User guides for MM features
-===========================
-
-The following documents provide guides for controlling and tuning
-various features of the Linux memory management
-
-.. toctree::
-   :maxdepth: 1
-
-   swap_numa
-   zswap
-
-Kernel developers MM documentation
-==================================
-
-The below documents describe MM internals with different level of
-details ranging from notes and mailing list responses to elaborate
-descriptions of data structures and algorithms.
+subsystem internals with different level of details ranging from notes and
+mailing list responses for elaborating descriptions of data structures and
+algorithms.  If you are looking for advice on simply allocating memory, see the
+:ref:`memory_allocation`.  For controlling and tuning guides, see the
+:doc:`admin guide <../admin-guide/mm/index>`.
 
 .. toctree::
    :maxdepth: 1
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 278+ messages in thread

* [patch 216/262] MAINTAINERS: update SeongJae's email address
  2021-11-05 20:34 incoming Andrew Morton
                   ` (214 preceding siblings ...)
  2021-11-05 20:45 ` [patch 215/262] Documentation/vm: move user guides to admin-guide/mm/ Andrew Morton
@ 2021-11-05 20:45 ` Andrew Morton
  2021-11-05 20:46 ` [patch 217/262] docs/vm/damon: remove broken reference Andrew Morton
                   ` (45 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:45 UTC (permalink / raw)
  To: akpm, corbet, linux-mm, mm-commits, sj, sjpark, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: MAINTAINERS: update SeongJae's email address

This commit updates SeongJae's email address in MAINTAINERS file to his
preferred one.

Link: https://lkml.kernel.org/r/20210917123958.3819-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: SeongJae Park <sjpark@amazon.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 MAINTAINERS |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/MAINTAINERS~maintainers-update-seongjaes-email-address
+++ a/MAINTAINERS
@@ -5161,7 +5161,7 @@ F:	net/ax25/ax25_timer.c
 F:	net/ax25/sysctl_net_ax25.c
 
 DATA ACCESS MONITOR
-M:	SeongJae Park <sjpark@amazon.de>
+M:	SeongJae Park <sj@kernel.org>
 L:	linux-mm@kvack.org
 S:	Maintained
 F:	Documentation/admin-guide/mm/damon/
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 217/262] docs/vm/damon: remove broken reference
  2021-11-05 20:34 incoming Andrew Morton
                   ` (215 preceding siblings ...)
  2021-11-05 20:45 ` [patch 216/262] MAINTAINERS: update SeongJae's email address Andrew Morton
@ 2021-11-05 20:46 ` Andrew Morton
  2021-11-05 20:46 ` [patch 218/262] include/linux/damon.h: fix kernel-doc comments for 'damon_callback' Andrew Morton
                   ` (44 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: akpm, corbet, linux-mm, mm-commits, sjpark, torvalds

From: SeongJae Park <sjpark@amazon.de>
Subject: docs/vm/damon: remove broken reference

Building DAMON documents warns for a reference to nonexisting doc, as
below:

    $ time make htmldocs
    [...]
    Documentation/vm/damon/index.rst:24: WARNING: toctree contains reference to nonexisting document 'vm/damon/plans'

This commit fixes the warning by removing the wrong reference.

Link: https://lkml.kernel.org/r/20210917123958.3819-4-sj@kernel.org
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/vm/damon/index.rst |    1 -
 1 file changed, 1 deletion(-)

--- a/Documentation/vm/damon/index.rst~docs-vm-damon-remove-broken-reference
+++ a/Documentation/vm/damon/index.rst
@@ -27,4 +27,3 @@ workloads and systems.
    faq
    design
    api
-   plans
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 218/262] include/linux/damon.h: fix kernel-doc comments for 'damon_callback'
  2021-11-05 20:34 incoming Andrew Morton
                   ` (216 preceding siblings ...)
  2021-11-05 20:46 ` [patch 217/262] docs/vm/damon: remove broken reference Andrew Morton
@ 2021-11-05 20:46 ` Andrew Morton
  2021-11-05 20:46 ` [patch 219/262] mm/damon/core: print kdamond start log in debug mode only Andrew Morton
                   ` (43 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: akpm, corbet, linux-mm, mm-commits, sjpark, torvalds

From: SeongJae Park <sjpark@amazon.de>
Subject: include/linux/damon.h: fix kernel-doc comments for 'damon_callback'

A few Kernel-doc comments in 'damon.h' are broken.  This commit fixes
those.

Link: https://lkml.kernel.org/r/20210917123958.3819-5-sj@kernel.org
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/include/linux/damon.h~include-linux-damonh-fix-kernel-doc-comments-for-damon_callback
+++ a/include/linux/damon.h
@@ -62,7 +62,7 @@ struct damon_target {
 struct damon_ctx;
 
 /**
- * struct damon_primitive	Monitoring primitives for given use cases.
+ * struct damon_primitive - Monitoring primitives for given use cases.
  *
  * @init:			Initialize primitive-internal data structures.
  * @update:			Update primitive-internal data structures.
@@ -108,8 +108,8 @@ struct damon_primitive {
 	void (*cleanup)(struct damon_ctx *context);
 };
 
-/*
- * struct damon_callback	Monitoring events notification callbacks.
+/**
+ * struct damon_callback - Monitoring events notification callbacks.
  *
  * @before_start:	Called before starting the monitoring.
  * @after_sampling:	Called after each sampling.
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 219/262] mm/damon/core: print kdamond start log in debug mode only
  2021-11-05 20:34 incoming Andrew Morton
                   ` (217 preceding siblings ...)
  2021-11-05 20:46 ` [patch 218/262] include/linux/damon.h: fix kernel-doc comments for 'damon_callback' Andrew Morton
@ 2021-11-05 20:46 ` Andrew Morton
  2021-11-05 20:46 ` [patch 220/262] mm/damon: remove unnecessary do_exit() from kdamond Andrew Morton
                   ` (42 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: akpm, corbet, linux-mm, mm-commits, sj, sjpark, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/core: print kdamond start log in debug mode only

Logging of kdamond startup is using 'pr_info()' unnecessarily.  This
commit makes it to use 'pr_debug()' instead.

Link: https://lkml.kernel.org/r/20210917123958.3819-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: SeongJae Park <sjpark@amazon.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/damon/core.c~mm-damon-core-print-kdamond-start-log-in-debug-mode-only
+++ a/mm/damon/core.c
@@ -653,7 +653,7 @@ static int kdamond_fn(void *data)
 	unsigned long sz_limit = 0;
 
 	mutex_lock(&ctx->kdamond_lock);
-	pr_info("kdamond (%d) starts\n", ctx->kdamond->pid);
+	pr_debug("kdamond (%d) starts\n", ctx->kdamond->pid);
 	mutex_unlock(&ctx->kdamond_lock);
 
 	if (ctx->primitive.init)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 220/262] mm/damon: remove unnecessary do_exit() from kdamond
  2021-11-05 20:34 incoming Andrew Morton
                   ` (218 preceding siblings ...)
  2021-11-05 20:46 ` [patch 219/262] mm/damon/core: print kdamond start log in debug mode only Andrew Morton
@ 2021-11-05 20:46 ` Andrew Morton
  2021-11-05 20:46 ` [patch 221/262] mm/damon: needn't hold kdamond_lock to print pid of kdamond Andrew Morton
                   ` (41 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: akpm, changbin.du, linux-mm, mm-commits, sjpark, torvalds

From: Changbin Du <changbin.du@gmail.com>
Subject: mm/damon: remove unnecessary do_exit() from kdamond

Just return from the kthread function.

Link: https://lkml.kernel.org/r/20210927232421.17694-1-changbin.du@gmail.com
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Cc: SeongJae Park <sjpark@amazon.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/damon/core.c~mm-damon-remove-unnecessary-do_exit-from-kdamond
+++ a/mm/damon/core.c
@@ -714,7 +714,7 @@ static int kdamond_fn(void *data)
 	nr_running_ctxs--;
 	mutex_unlock(&damon_lock);
 
-	do_exit(0);
+	return 0;
 }
 
 #include "core-test.h"
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 221/262] mm/damon: needn't hold kdamond_lock to print pid of kdamond
  2021-11-05 20:34 incoming Andrew Morton
                   ` (219 preceding siblings ...)
  2021-11-05 20:46 ` [patch 220/262] mm/damon: remove unnecessary do_exit() from kdamond Andrew Morton
@ 2021-11-05 20:46 ` Andrew Morton
  2021-11-05 20:46 ` [patch 222/262] mm/damon/core: nullify pointer ctx->kdamond with a NULL Andrew Morton
                   ` (40 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: akpm, changbin.du, linux-mm, mm-commits, sj, torvalds

From: Changbin Du <changbin.du@gmail.com>
Subject: mm/damon: needn't hold kdamond_lock to print pid of kdamond

Just get the pid by 'current->pid'. Meanwhile, to be symmetrical make
the 'starts' and 'finishes' logs both use debug level.

Link: https://lkml.kernel.org/r/20210927232432.17750-1-changbin.du@gmail.com
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/core.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

--- a/mm/damon/core.c~mm-damon-neednt-hold-kdamond_lock-to-print-pid-of-kdamond
+++ a/mm/damon/core.c
@@ -652,9 +652,7 @@ static int kdamond_fn(void *data)
 	unsigned int max_nr_accesses = 0;
 	unsigned long sz_limit = 0;
 
-	mutex_lock(&ctx->kdamond_lock);
-	pr_debug("kdamond (%d) starts\n", ctx->kdamond->pid);
-	mutex_unlock(&ctx->kdamond_lock);
+	pr_debug("kdamond (%d) starts\n", current->pid);
 
 	if (ctx->primitive.init)
 		ctx->primitive.init(ctx);
@@ -705,7 +703,7 @@ static int kdamond_fn(void *data)
 	if (ctx->primitive.cleanup)
 		ctx->primitive.cleanup(ctx);
 
-	pr_debug("kdamond (%d) finishes\n", ctx->kdamond->pid);
+	pr_debug("kdamond (%d) finishes\n", current->pid);
 	mutex_lock(&ctx->kdamond_lock);
 	ctx->kdamond = NULL;
 	mutex_unlock(&ctx->kdamond_lock);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 222/262] mm/damon/core: nullify pointer ctx->kdamond with a NULL
  2021-11-05 20:34 incoming Andrew Morton
                   ` (220 preceding siblings ...)
  2021-11-05 20:46 ` [patch 221/262] mm/damon: needn't hold kdamond_lock to print pid of kdamond Andrew Morton
@ 2021-11-05 20:46 ` Andrew Morton
  2021-11-05 20:46 ` [patch 223/262] mm/damon/core: account age of target regions Andrew Morton
                   ` (39 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: akpm, colin.king, linux-mm, mm-commits, sj, torvalds

From: Colin Ian King <colin.king@canonical.com>
Subject: mm/damon/core: nullify pointer ctx->kdamond with a NULL

Currently a plain integer is being used to nullify the pointer
ctx->kdamond.  Use NULL instead.  Cleans up sparse warning:

mm/damon/core.c:317:40: warning: Using plain integer as NULL pointer

Link: https://lkml.kernel.org/r/20210925215908.181226-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/damon/core.c~mm-damon-core-nullify-pointer-ctx-kdamond-with-a-null
+++ a/mm/damon/core.c
@@ -314,7 +314,7 @@ static int __damon_start(struct damon_ct
 				nr_running_ctxs);
 		if (IS_ERR(ctx->kdamond)) {
 			err = PTR_ERR(ctx->kdamond);
-			ctx->kdamond = 0;
+			ctx->kdamond = NULL;
 		}
 	}
 	mutex_unlock(&ctx->kdamond_lock);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 223/262] mm/damon/core: account age of target regions
  2021-11-05 20:34 incoming Andrew Morton
                   ` (221 preceding siblings ...)
  2021-11-05 20:46 ` [patch 222/262] mm/damon/core: nullify pointer ctx->kdamond with a NULL Andrew Morton
@ 2021-11-05 20:46 ` Andrew Morton
  2021-11-05 20:46 ` [patch 224/262] mm/damon/core: implement DAMON-based Operation Schemes (DAMOS) Andrew Morton
                   ` (38 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/core: account age of target regions

Patch series "Implement Data Access Monitoring-based Memory Operation Schemes".

Introduction
============

DAMON[1] can be used as a primitive for data access aware memory
management optimizations.  For that, users who want such optimizations
should run DAMON, read the monitoring results, analyze it, plan a new
memory management scheme, and apply the new scheme by themselves.  Such
efforts will be inevitable for some complicated optimizations.

However, in many other cases, the users would simply want the system to
apply a memory management action to a memory region of a specific size
having a specific access frequency for a specific time.  For example,
"page out a memory region larger than 100 MiB keeping only rare accesses
more than 2 minutes", or "Do not use THP for a memory region larger than 2
MiB rarely accessed for more than 1 seconds".

To make the works easier and non-redundant, this patchset implements a new
feature of DAMON, which is called Data Access Monitoring-based Operation
Schemes (DAMOS).  Using the feature, users can describe the normal schemes
in a simple way and ask DAMON to execute those on its own.

[1] https://damonitor.github.io

Evaluations
===========

DAMOS is accurate and useful for memory management optimizations.  An
experimental DAMON-based operation scheme for THP, 'ethp', removes 76.15%
of THP memory overheads while preserving 51.25% of THP speedup.  Another
experimental DAMON-based 'proactive reclamation' implementation, 'prcl',
reduces 93.38% of residential sets and 23.63% of system memory footprint
while incurring only 1.22% runtime overhead in the best case
(parsec3/freqmine).

NOTE that the experimental THP optimization and proactive reclamation are
not for production but only for proof of concepts.

Please refer to the showcase web site's evaluation document[1] for
detailed evaluation setup and results.

[1] https://damonitor.github.io/doc/html/v34/vm/damon/eval.html

Long-term Support Trees
-----------------------

For people who want to test DAMON but using LTS kernels, there are another
couple of trees based on two latest LTS kernels respectively and containing the
'damon/master' backports.

- For v5.4.y: https://git.kernel.org/sj/h/damon/for-v5.4.y
- For v5.10.y: https://git.kernel.org/sj/h/damon/for-v5.10.y

Sequence Of Patches
===================

The 1st patch accounts age of each region.  The 2nd patch implements the
core of the DAMON-based operation schemes feature.  The 3rd patch makes
the default monitoring primitives for virtual address spaces to support
the schemes.  From this point, the kernel space users can use DAMOS.  The
4th patch exports the feature to the user space via the debugfs interface.
The 5th patch implements schemes statistics feature for easier tuning of
the schemes and runtime access pattern analysis, and the 6th patch adds
selftests for these changes.  Finally, the 7th patch documents this new
feature.


This patch (of 7):

DAMON can be used for data access pattern aware memory management
optimizations.  For that, users should run DAMON, read the monitoring
results, analyze it, plan a new memory management scheme, and apply the
new scheme by themselves.  It would not be too hard, but still require
some level of effort.  For complicated cases, this effort is inevitable.

That said, in many cases, users would simply want to apply an actions to a
memory region of a specific size having a specific access frequency for a
specific time.  For example, "page out a memory region larger than 100 MiB
but having a low access frequency more than 10 minutes", or "Use THP for a
memory region larger than 2 MiB having a high access frequency for more
than 2 seconds".

For such optimizations, users will need to first account the age of each
region themselves.  To reduce such efforts, this commit implements a
simple age account of each region in DAMON.  For each aggregation step,
DAMON compares the access frequency with that from last aggregation and
reset the age of the region if the change is significant.  Else, the age
is incremented.  Also, in case of the merge of regions, the region
size-weighted average of the ages is set as the age of merged new region.

Link: https://lkml.kernel.org/r/20211001125604.29660-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211001125604.29660-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Marco Elver <elver@google.com>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Greg Thelen <gthelen@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: David Rienjes <rientjes@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |   10 ++++++++++
 mm/damon/core.c       |   13 +++++++++++++
 2 files changed, 23 insertions(+)

--- a/include/linux/damon.h~mm-damon-core-account-age-of-target-regions
+++ a/include/linux/damon.h
@@ -31,12 +31,22 @@ struct damon_addr_range {
  * @sampling_addr:	Address of the sample for the next access check.
  * @nr_accesses:	Access frequency of this region.
  * @list:		List head for siblings.
+ * @age:		Age of this region.
+ *
+ * @age is initially zero, increased for each aggregation interval, and reset
+ * to zero again if the access frequency is significantly changed.  If two
+ * regions are merged into a new region, both @nr_accesses and @age of the new
+ * region are set as region size-weighted average of those of the two regions.
  */
 struct damon_region {
 	struct damon_addr_range ar;
 	unsigned long sampling_addr;
 	unsigned int nr_accesses;
 	struct list_head list;
+
+	unsigned int age;
+/* private: Internal value for age calculation. */
+	unsigned int last_nr_accesses;
 };
 
 /**
--- a/mm/damon/core.c~mm-damon-core-account-age-of-target-regions
+++ a/mm/damon/core.c
@@ -45,6 +45,9 @@ struct damon_region *damon_new_region(un
 	region->nr_accesses = 0;
 	INIT_LIST_HEAD(&region->list);
 
+	region->age = 0;
+	region->last_nr_accesses = 0;
+
 	return region;
 }
 
@@ -444,6 +447,7 @@ static void kdamond_reset_aggregated(str
 
 		damon_for_each_region(r, t) {
 			trace_damon_aggregated(t, r, damon_nr_regions(t));
+			r->last_nr_accesses = r->nr_accesses;
 			r->nr_accesses = 0;
 		}
 	}
@@ -461,6 +465,7 @@ static void damon_merge_two_regions(stru
 
 	l->nr_accesses = (l->nr_accesses * sz_l + r->nr_accesses * sz_r) /
 			(sz_l + sz_r);
+	l->age = (l->age * sz_l + r->age * sz_r) / (sz_l + sz_r);
 	l->ar.end = r->ar.end;
 	damon_destroy_region(r, t);
 }
@@ -480,6 +485,11 @@ static void damon_merge_regions_of(struc
 	struct damon_region *r, *prev = NULL, *next;
 
 	damon_for_each_region_safe(r, next, t) {
+		if (diff_of(r->nr_accesses, r->last_nr_accesses) > thres)
+			r->age = 0;
+		else
+			r->age++;
+
 		if (prev && prev->ar.end == r->ar.start &&
 		    diff_of(prev->nr_accesses, r->nr_accesses) <= thres &&
 		    sz_damon_region(prev) + sz_damon_region(r) <= sz_limit)
@@ -527,6 +537,9 @@ static void damon_split_region_at(struct
 
 	r->ar.end = new->ar.start;
 
+	new->age = r->age;
+	new->last_nr_accesses = r->last_nr_accesses;
+
 	damon_insert_region(new, r, damon_next_region(r), t);
 }
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 224/262] mm/damon/core: implement DAMON-based Operation Schemes (DAMOS)
  2021-11-05 20:34 incoming Andrew Morton
                   ` (222 preceding siblings ...)
  2021-11-05 20:46 ` [patch 223/262] mm/damon/core: account age of target regions Andrew Morton
@ 2021-11-05 20:46 ` Andrew Morton
  2021-11-05 20:46 ` [patch 225/262] mm/damon/vaddr: support DAMON-based Operation Schemes Andrew Morton
                   ` (37 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/core: implement DAMON-based Operation Schemes (DAMOS)

In many cases, users might use DAMON for simple data access aware memory
management optimizations such as applying an operation scheme to a memory
region of a specific size having a specific access frequency for a
specific time.  For example, "page out a memory region larger than 100 MiB
but having a low access frequency more than 10 minutes", or "Use THP for a
memory region larger than 2 MiB having a high access frequency for more
than 2 seconds".

Most simple form of the solution would be doing offline data access
pattern profiling using DAMON and modifying the application source code or
system configuration based on the profiling results.  Or, developing a
daemon constructed with two modules (one for access monitoring and the
other for applying memory management actions via mlock(), madvise(),
sysctl, etc) is imaginable.

To avoid users spending their time for implementation of such simple data
access monitoring-based operation schemes, this commit makes DAMON to
handle such schemes directly.  With this commit, users can simply specify
their desired schemes to DAMON.  Then, DAMON will automatically apply the
schemes to the user-specified target processes.

Each of the schemes is composed with conditions for filtering of the
target memory regions and desired memory management action for the target.
Specifically, the format is::

    <min/max size> <min/max access frequency> <min/max age> <action>

The filtering conditions are size of memory region, number of accesses to
the region monitored by DAMON, and the age of the region.  The age of
region is incremented periodically but reset when its addresses or access
frequency has significantly changed or the action of a scheme was applied.
For the action, current implementation supports a few of madvise()-like
hints, ``WILLNEED``, ``COLD``, ``PAGEOUT``, ``HUGEPAGE``, and
``NOHUGEPAGE``.

Because DAMON supports various address spaces and application of the
actions to a monitoring target region is dependent to the type of the
target address space, the application code should be implemented by each
primitives and registered to the framework.  Note that this commit only
implements the framework part.  Following commit will implement the action
applications for virtual address spaces primitives.

Link: https://lkml.kernel.org/r/20211001125604.29660-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |   66 ++++++++++++++++++++++++
 mm/damon/core.c       |  109 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 175 insertions(+)

--- a/include/linux/damon.h~mm-damon-core-implement-damon-based-operation-schemes-damos
+++ a/include/linux/damon.h
@@ -69,6 +69,48 @@ struct damon_target {
 	struct list_head list;
 };
 
+/**
+ * enum damos_action - Represents an action of a Data Access Monitoring-based
+ * Operation Scheme.
+ *
+ * @DAMOS_WILLNEED:	Call ``madvise()`` for the region with MADV_WILLNEED.
+ * @DAMOS_COLD:		Call ``madvise()`` for the region with MADV_COLD.
+ * @DAMOS_PAGEOUT:	Call ``madvise()`` for the region with MADV_PAGEOUT.
+ * @DAMOS_HUGEPAGE:	Call ``madvise()`` for the region with MADV_HUGEPAGE.
+ * @DAMOS_NOHUGEPAGE:	Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
+ */
+enum damos_action {
+	DAMOS_WILLNEED,
+	DAMOS_COLD,
+	DAMOS_PAGEOUT,
+	DAMOS_HUGEPAGE,
+	DAMOS_NOHUGEPAGE,
+};
+
+/**
+ * struct damos - Represents a Data Access Monitoring-based Operation Scheme.
+ * @min_sz_region:	Minimum size of target regions.
+ * @max_sz_region:	Maximum size of target regions.
+ * @min_nr_accesses:	Minimum ``->nr_accesses`` of target regions.
+ * @max_nr_accesses:	Maximum ``->nr_accesses`` of target regions.
+ * @min_age_region:	Minimum age of target regions.
+ * @max_age_region:	Maximum age of target regions.
+ * @action:		&damo_action to be applied to the target regions.
+ * @list:		List head for siblings.
+ *
+ * Note that both the minimums and the maximums are inclusive.
+ */
+struct damos {
+	unsigned long min_sz_region;
+	unsigned long max_sz_region;
+	unsigned int min_nr_accesses;
+	unsigned int max_nr_accesses;
+	unsigned int min_age_region;
+	unsigned int max_age_region;
+	enum damos_action action;
+	struct list_head list;
+};
+
 struct damon_ctx;
 
 /**
@@ -79,6 +121,7 @@ struct damon_ctx;
  * @prepare_access_checks:	Prepare next access check of target regions.
  * @check_accesses:		Check the accesses to target regions.
  * @reset_aggregated:		Reset aggregated accesses monitoring results.
+ * @apply_scheme:		Apply a DAMON-based operation scheme.
  * @target_valid:		Determine if the target is valid.
  * @cleanup:			Clean up the context.
  *
@@ -104,6 +147,9 @@ struct damon_ctx;
  * of its update.  The value will be used for regions adjustment threshold.
  * @reset_aggregated should reset the access monitoring results that aggregated
  * by @check_accesses.
+ * @apply_scheme is called from @kdamond when a region for user provided
+ * DAMON-based operation scheme is found.  It should apply the scheme's action
+ * to the region.  This is not used for &DAMON_ARBITRARY_TARGET case.
  * @target_valid should check whether the target is still valid for the
  * monitoring.
  * @cleanup is called from @kdamond just before its termination.
@@ -114,6 +160,8 @@ struct damon_primitive {
 	void (*prepare_access_checks)(struct damon_ctx *context);
 	unsigned int (*check_accesses)(struct damon_ctx *context);
 	void (*reset_aggregated)(struct damon_ctx *context);
+	int (*apply_scheme)(struct damon_ctx *context, struct damon_target *t,
+			struct damon_region *r, struct damos *scheme);
 	bool (*target_valid)(void *target);
 	void (*cleanup)(struct damon_ctx *context);
 };
@@ -192,6 +240,7 @@ struct damon_callback {
  * @min_nr_regions:	The minimum number of adaptive monitoring regions.
  * @max_nr_regions:	The maximum number of adaptive monitoring regions.
  * @adaptive_targets:	Head of monitoring targets (&damon_target) list.
+ * @schemes:		Head of schemes (&damos) list.
  */
 struct damon_ctx {
 	unsigned long sample_interval;
@@ -213,6 +262,7 @@ struct damon_ctx {
 	unsigned long min_nr_regions;
 	unsigned long max_nr_regions;
 	struct list_head adaptive_targets;
+	struct list_head schemes;
 };
 
 #define damon_next_region(r) \
@@ -233,6 +283,12 @@ struct damon_ctx {
 #define damon_for_each_target_safe(t, next, ctx)	\
 	list_for_each_entry_safe(t, next, &(ctx)->adaptive_targets, list)
 
+#define damon_for_each_scheme(s, ctx) \
+	list_for_each_entry(s, &(ctx)->schemes, list)
+
+#define damon_for_each_scheme_safe(s, next, ctx) \
+	list_for_each_entry_safe(s, next, &(ctx)->schemes, list)
+
 #ifdef CONFIG_DAMON
 
 struct damon_region *damon_new_region(unsigned long start, unsigned long end);
@@ -242,6 +298,14 @@ inline void damon_insert_region(struct d
 void damon_add_region(struct damon_region *r, struct damon_target *t);
 void damon_destroy_region(struct damon_region *r, struct damon_target *t);
 
+struct damos *damon_new_scheme(
+		unsigned long min_sz_region, unsigned long max_sz_region,
+		unsigned int min_nr_accesses, unsigned int max_nr_accesses,
+		unsigned int min_age_region, unsigned int max_age_region,
+		enum damos_action action);
+void damon_add_scheme(struct damon_ctx *ctx, struct damos *s);
+void damon_destroy_scheme(struct damos *s);
+
 struct damon_target *damon_new_target(unsigned long id);
 void damon_add_target(struct damon_ctx *ctx, struct damon_target *t);
 void damon_free_target(struct damon_target *t);
@@ -255,6 +319,8 @@ int damon_set_targets(struct damon_ctx *
 int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
 		unsigned long aggr_int, unsigned long primitive_upd_int,
 		unsigned long min_nr_reg, unsigned long max_nr_reg);
+int damon_set_schemes(struct damon_ctx *ctx,
+			struct damos **schemes, ssize_t nr_schemes);
 int damon_nr_running_ctxs(void);
 
 int damon_start(struct damon_ctx **ctxs, int nr_ctxs);
--- a/mm/damon/core.c~mm-damon-core-implement-damon-based-operation-schemes-damos
+++ a/mm/damon/core.c
@@ -85,6 +85,50 @@ void damon_destroy_region(struct damon_r
 	damon_free_region(r);
 }
 
+struct damos *damon_new_scheme(
+		unsigned long min_sz_region, unsigned long max_sz_region,
+		unsigned int min_nr_accesses, unsigned int max_nr_accesses,
+		unsigned int min_age_region, unsigned int max_age_region,
+		enum damos_action action)
+{
+	struct damos *scheme;
+
+	scheme = kmalloc(sizeof(*scheme), GFP_KERNEL);
+	if (!scheme)
+		return NULL;
+	scheme->min_sz_region = min_sz_region;
+	scheme->max_sz_region = max_sz_region;
+	scheme->min_nr_accesses = min_nr_accesses;
+	scheme->max_nr_accesses = max_nr_accesses;
+	scheme->min_age_region = min_age_region;
+	scheme->max_age_region = max_age_region;
+	scheme->action = action;
+	INIT_LIST_HEAD(&scheme->list);
+
+	return scheme;
+}
+
+void damon_add_scheme(struct damon_ctx *ctx, struct damos *s)
+{
+	list_add_tail(&s->list, &ctx->schemes);
+}
+
+static void damon_del_scheme(struct damos *s)
+{
+	list_del(&s->list);
+}
+
+static void damon_free_scheme(struct damos *s)
+{
+	kfree(s);
+}
+
+void damon_destroy_scheme(struct damos *s)
+{
+	damon_del_scheme(s);
+	damon_free_scheme(s);
+}
+
 /*
  * Construct a damon_target struct
  *
@@ -156,6 +200,7 @@ struct damon_ctx *damon_new_ctx(void)
 	ctx->max_nr_regions = 1000;
 
 	INIT_LIST_HEAD(&ctx->adaptive_targets);
+	INIT_LIST_HEAD(&ctx->schemes);
 
 	return ctx;
 }
@@ -175,7 +220,13 @@ static void damon_destroy_targets(struct
 
 void damon_destroy_ctx(struct damon_ctx *ctx)
 {
+	struct damos *s, *next_s;
+
 	damon_destroy_targets(ctx);
+
+	damon_for_each_scheme_safe(s, next_s, ctx)
+		damon_destroy_scheme(s);
+
 	kfree(ctx);
 }
 
@@ -251,6 +302,30 @@ int damon_set_attrs(struct damon_ctx *ct
 }
 
 /**
+ * damon_set_schemes() - Set data access monitoring based operation schemes.
+ * @ctx:	monitoring context
+ * @schemes:	array of the schemes
+ * @nr_schemes:	number of entries in @schemes
+ *
+ * This function should not be called while the kdamond of the context is
+ * running.
+ *
+ * Return: 0 if success, or negative error code otherwise.
+ */
+int damon_set_schemes(struct damon_ctx *ctx, struct damos **schemes,
+			ssize_t nr_schemes)
+{
+	struct damos *s, *next;
+	ssize_t i;
+
+	damon_for_each_scheme_safe(s, next, ctx)
+		damon_destroy_scheme(s);
+	for (i = 0; i < nr_schemes; i++)
+		damon_add_scheme(ctx, schemes[i]);
+	return 0;
+}
+
+/**
  * damon_nr_running_ctxs() - Return number of currently running contexts.
  */
 int damon_nr_running_ctxs(void)
@@ -453,6 +528,39 @@ static void kdamond_reset_aggregated(str
 	}
 }
 
+static void damon_do_apply_schemes(struct damon_ctx *c,
+				   struct damon_target *t,
+				   struct damon_region *r)
+{
+	struct damos *s;
+	unsigned long sz;
+
+	damon_for_each_scheme(s, c) {
+		sz = r->ar.end - r->ar.start;
+		if (sz < s->min_sz_region || s->max_sz_region < sz)
+			continue;
+		if (r->nr_accesses < s->min_nr_accesses ||
+				s->max_nr_accesses < r->nr_accesses)
+			continue;
+		if (r->age < s->min_age_region || s->max_age_region < r->age)
+			continue;
+		if (c->primitive.apply_scheme)
+			c->primitive.apply_scheme(c, t, r, s);
+		r->age = 0;
+	}
+}
+
+static void kdamond_apply_schemes(struct damon_ctx *c)
+{
+	struct damon_target *t;
+	struct damon_region *r;
+
+	damon_for_each_target(t, c) {
+		damon_for_each_region(r, t)
+			damon_do_apply_schemes(c, t, r);
+	}
+}
+
 #define sz_damon_region(r) (r->ar.end - r->ar.start)
 
 /*
@@ -693,6 +801,7 @@ static int kdamond_fn(void *data)
 			if (ctx->callback.after_aggregation &&
 					ctx->callback.after_aggregation(ctx))
 				set_kdamond_stop(ctx);
+			kdamond_apply_schemes(ctx);
 			kdamond_reset_aggregated(ctx);
 			kdamond_split_regions(ctx);
 			if (ctx->primitive.reset_aggregated)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 225/262] mm/damon/vaddr: support DAMON-based Operation Schemes
  2021-11-05 20:34 incoming Andrew Morton
                   ` (223 preceding siblings ...)
  2021-11-05 20:46 ` [patch 224/262] mm/damon/core: implement DAMON-based Operation Schemes (DAMOS) Andrew Morton
@ 2021-11-05 20:46 ` Andrew Morton
  2021-11-05 20:46 ` [patch 226/262] mm/damon/dbgfs: " Andrew Morton
                   ` (36 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/vaddr: support DAMON-based Operation Schemes

This commit makes DAMON's default primitives for virtual address spaces to
support DAMON-based Operation Schemes (DAMOS) by implementing actions
application functions and registering it to the monitoring context.  The
implementation simply links 'madvise()' for related DAMOS actions.  That
is, 'madvise(MADV_WILLNEED)' is called for 'WILLNEED' DAMOS action and
similar for other actions ('COLD', 'PAGEOUT', 'HUGEPAGE', 'NOHUGEPAGE').

So, the kernel space DAMON users can now use the DAMON-based optimizations
with only small amount of code.

Link: https://lkml.kernel.org/r/20211001125604.29660-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |    2 +
 mm/damon/vaddr.c      |   56 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 58 insertions(+)

--- a/include/linux/damon.h~mm-damon-vaddr-support-damon-based-operation-schemes
+++ a/include/linux/damon.h
@@ -337,6 +337,8 @@ void damon_va_prepare_access_checks(stru
 unsigned int damon_va_check_accesses(struct damon_ctx *ctx);
 bool damon_va_target_valid(void *t);
 void damon_va_cleanup(struct damon_ctx *ctx);
+int damon_va_apply_scheme(struct damon_ctx *context, struct damon_target *t,
+		struct damon_region *r, struct damos *scheme);
 void damon_va_set_primitives(struct damon_ctx *ctx);
 
 #endif	/* CONFIG_DAMON_VADDR */
--- a/mm/damon/vaddr.c~mm-damon-vaddr-support-damon-based-operation-schemes
+++ a/mm/damon/vaddr.c
@@ -7,6 +7,7 @@
 
 #define pr_fmt(fmt) "damon-va: " fmt
 
+#include <asm-generic/mman-common.h>
 #include <linux/damon.h>
 #include <linux/hugetlb.h>
 #include <linux/mm.h>
@@ -658,6 +659,60 @@ bool damon_va_target_valid(void *target)
 	return false;
 }
 
+#ifndef CONFIG_ADVISE_SYSCALLS
+static int damos_madvise(struct damon_target *target, struct damon_region *r,
+			int behavior)
+{
+	return -EINVAL;
+}
+#else
+static int damos_madvise(struct damon_target *target, struct damon_region *r,
+			int behavior)
+{
+	struct mm_struct *mm;
+	int ret = -ENOMEM;
+
+	mm = damon_get_mm(target);
+	if (!mm)
+		goto out;
+
+	ret = do_madvise(mm, PAGE_ALIGN(r->ar.start),
+			PAGE_ALIGN(r->ar.end - r->ar.start), behavior);
+	mmput(mm);
+out:
+	return ret;
+}
+#endif	/* CONFIG_ADVISE_SYSCALLS */
+
+int damon_va_apply_scheme(struct damon_ctx *ctx, struct damon_target *t,
+		struct damon_region *r, struct damos *scheme)
+{
+	int madv_action;
+
+	switch (scheme->action) {
+	case DAMOS_WILLNEED:
+		madv_action = MADV_WILLNEED;
+		break;
+	case DAMOS_COLD:
+		madv_action = MADV_COLD;
+		break;
+	case DAMOS_PAGEOUT:
+		madv_action = MADV_PAGEOUT;
+		break;
+	case DAMOS_HUGEPAGE:
+		madv_action = MADV_HUGEPAGE;
+		break;
+	case DAMOS_NOHUGEPAGE:
+		madv_action = MADV_NOHUGEPAGE;
+		break;
+	default:
+		pr_warn("Wrong action %d\n", scheme->action);
+		return -EINVAL;
+	}
+
+	return damos_madvise(t, r, madv_action);
+}
+
 void damon_va_set_primitives(struct damon_ctx *ctx)
 {
 	ctx->primitive.init = damon_va_init;
@@ -667,6 +722,7 @@ void damon_va_set_primitives(struct damo
 	ctx->primitive.reset_aggregated = NULL;
 	ctx->primitive.target_valid = damon_va_target_valid;
 	ctx->primitive.cleanup = NULL;
+	ctx->primitive.apply_scheme = damon_va_apply_scheme;
 }
 
 #include "vaddr-test.h"
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 226/262] mm/damon/dbgfs: support DAMON-based Operation Schemes
  2021-11-05 20:34 incoming Andrew Morton
                   ` (224 preceding siblings ...)
  2021-11-05 20:46 ` [patch 225/262] mm/damon/vaddr: support DAMON-based Operation Schemes Andrew Morton
@ 2021-11-05 20:46 ` Andrew Morton
  2021-11-05 20:46 ` [patch 227/262] mm/damon/schemes: implement statistics feature Andrew Morton
                   ` (35 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/dbgfs: support DAMON-based Operation Schemes

This commit makes 'damon-dbgfs' to support the data access monitoring
oriented memory management schemes.  Users can read and update the schemes
using ``<debugfs>/damon/schemes`` file.  The format is::

    <min/max size> <min/max access frequency> <min/max age> <action>

Link: https://lkml.kernel.org/r/20211001125604.29660-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/dbgfs.c |  165 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 162 insertions(+), 3 deletions(-)

--- a/mm/damon/dbgfs.c~mm-damon-dbgfs-support-damon-based-operation-schemes
+++ a/mm/damon/dbgfs.c
@@ -98,6 +98,159 @@ out:
 	return ret;
 }
 
+static ssize_t sprint_schemes(struct damon_ctx *c, char *buf, ssize_t len)
+{
+	struct damos *s;
+	int written = 0;
+	int rc;
+
+	damon_for_each_scheme(s, c) {
+		rc = scnprintf(&buf[written], len - written,
+				"%lu %lu %u %u %u %u %d\n",
+				s->min_sz_region, s->max_sz_region,
+				s->min_nr_accesses, s->max_nr_accesses,
+				s->min_age_region, s->max_age_region,
+				s->action);
+		if (!rc)
+			return -ENOMEM;
+
+		written += rc;
+	}
+	return written;
+}
+
+static ssize_t dbgfs_schemes_read(struct file *file, char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	struct damon_ctx *ctx = file->private_data;
+	char *kbuf;
+	ssize_t len;
+
+	kbuf = kmalloc(count, GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	mutex_lock(&ctx->kdamond_lock);
+	len = sprint_schemes(ctx, kbuf, count);
+	mutex_unlock(&ctx->kdamond_lock);
+	if (len < 0)
+		goto out;
+	len = simple_read_from_buffer(buf, count, ppos, kbuf, len);
+
+out:
+	kfree(kbuf);
+	return len;
+}
+
+static void free_schemes_arr(struct damos **schemes, ssize_t nr_schemes)
+{
+	ssize_t i;
+
+	for (i = 0; i < nr_schemes; i++)
+		kfree(schemes[i]);
+	kfree(schemes);
+}
+
+static bool damos_action_valid(int action)
+{
+	switch (action) {
+	case DAMOS_WILLNEED:
+	case DAMOS_COLD:
+	case DAMOS_PAGEOUT:
+	case DAMOS_HUGEPAGE:
+	case DAMOS_NOHUGEPAGE:
+		return true;
+	default:
+		return false;
+	}
+}
+
+/*
+ * Converts a string into an array of struct damos pointers
+ *
+ * Returns an array of struct damos pointers that converted if the conversion
+ * success, or NULL otherwise.
+ */
+static struct damos **str_to_schemes(const char *str, ssize_t len,
+				ssize_t *nr_schemes)
+{
+	struct damos *scheme, **schemes;
+	const int max_nr_schemes = 256;
+	int pos = 0, parsed, ret;
+	unsigned long min_sz, max_sz;
+	unsigned int min_nr_a, max_nr_a, min_age, max_age;
+	unsigned int action;
+
+	schemes = kmalloc_array(max_nr_schemes, sizeof(scheme),
+			GFP_KERNEL);
+	if (!schemes)
+		return NULL;
+
+	*nr_schemes = 0;
+	while (pos < len && *nr_schemes < max_nr_schemes) {
+		ret = sscanf(&str[pos], "%lu %lu %u %u %u %u %u%n",
+				&min_sz, &max_sz, &min_nr_a, &max_nr_a,
+				&min_age, &max_age, &action, &parsed);
+		if (ret != 7)
+			break;
+		if (!damos_action_valid(action)) {
+			pr_err("wrong action %d\n", action);
+			goto fail;
+		}
+
+		pos += parsed;
+		scheme = damon_new_scheme(min_sz, max_sz, min_nr_a, max_nr_a,
+				min_age, max_age, action);
+		if (!scheme)
+			goto fail;
+
+		schemes[*nr_schemes] = scheme;
+		*nr_schemes += 1;
+	}
+	return schemes;
+fail:
+	free_schemes_arr(schemes, *nr_schemes);
+	return NULL;
+}
+
+static ssize_t dbgfs_schemes_write(struct file *file, const char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	struct damon_ctx *ctx = file->private_data;
+	char *kbuf;
+	struct damos **schemes;
+	ssize_t nr_schemes = 0, ret = count;
+	int err;
+
+	kbuf = user_input_str(buf, count, ppos);
+	if (IS_ERR(kbuf))
+		return PTR_ERR(kbuf);
+
+	schemes = str_to_schemes(kbuf, ret, &nr_schemes);
+	if (!schemes) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	mutex_lock(&ctx->kdamond_lock);
+	if (ctx->kdamond) {
+		ret = -EBUSY;
+		goto unlock_out;
+	}
+
+	err = damon_set_schemes(ctx, schemes, nr_schemes);
+	if (err)
+		ret = err;
+	else
+		nr_schemes = 0;
+unlock_out:
+	mutex_unlock(&ctx->kdamond_lock);
+	free_schemes_arr(schemes, nr_schemes);
+out:
+	kfree(kbuf);
+	return ret;
+}
+
 static inline bool targetid_is_pid(const struct damon_ctx *ctx)
 {
 	return ctx->primitive.target_valid == damon_va_target_valid;
@@ -279,6 +432,12 @@ static const struct file_operations attr
 	.write = dbgfs_attrs_write,
 };
 
+static const struct file_operations schemes_fops = {
+	.open = damon_dbgfs_open,
+	.read = dbgfs_schemes_read,
+	.write = dbgfs_schemes_write,
+};
+
 static const struct file_operations target_ids_fops = {
 	.open = damon_dbgfs_open,
 	.read = dbgfs_target_ids_read,
@@ -292,10 +451,10 @@ static const struct file_operations kdam
 
 static void dbgfs_fill_ctx_dir(struct dentry *dir, struct damon_ctx *ctx)
 {
-	const char * const file_names[] = {"attrs", "target_ids",
+	const char * const file_names[] = {"attrs", "schemes", "target_ids",
 		"kdamond_pid"};
-	const struct file_operations *fops[] = {&attrs_fops, &target_ids_fops,
-		&kdamond_pid_fops};
+	const struct file_operations *fops[] = {&attrs_fops, &schemes_fops,
+		&target_ids_fops, &kdamond_pid_fops};
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(file_names); i++)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 227/262] mm/damon/schemes: implement statistics feature
  2021-11-05 20:34 incoming Andrew Morton
                   ` (225 preceding siblings ...)
  2021-11-05 20:46 ` [patch 226/262] mm/damon/dbgfs: " Andrew Morton
@ 2021-11-05 20:46 ` Andrew Morton
  2021-11-05 20:46 ` [patch 228/262] selftests/damon: add 'schemes' debugfs tests Andrew Morton
                   ` (34 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/schemes: implement statistics feature

To tune the DAMON-based operation schemes, knowing how many and how large
regions are affected by each of the schemes will be helful.  Those stats
could be used for not only the tuning, but also monitoring of the working
set size and the number of regions, if the scheme does not change the
program behavior too much.

For the reason, this commit implements the statistics for the schemes. 
The total number and size of the regions that each scheme is applied are
exported to users via '->stat_count' and '->stat_sz' of 'struct damos'. 
Admins can also check the number by reading 'schemes' debugfs file.  The
last two integers now represents the stats.  To allow collecting the stats
without changing the program behavior, this commit also adds new scheme
action, 'DAMOS_STAT'.  Note that 'DAMOS_STAT' is not only making no memory
operation actions, but also does not reset the age of regions.

Link: https://lkml.kernel.org/r/20211001125604.29660-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |   10 +++++++++-
 mm/damon/core.c       |    7 ++++++-
 mm/damon/dbgfs.c      |    5 +++--
 mm/damon/vaddr.c      |    2 ++
 4 files changed, 20 insertions(+), 4 deletions(-)

--- a/include/linux/damon.h~mm-damon-schemes-implement-statistics-feature
+++ a/include/linux/damon.h
@@ -78,6 +78,7 @@ struct damon_target {
  * @DAMOS_PAGEOUT:	Call ``madvise()`` for the region with MADV_PAGEOUT.
  * @DAMOS_HUGEPAGE:	Call ``madvise()`` for the region with MADV_HUGEPAGE.
  * @DAMOS_NOHUGEPAGE:	Call ``madvise()`` for the region with MADV_NOHUGEPAGE.
+ * @DAMOS_STAT:		Do nothing but count the stat.
  */
 enum damos_action {
 	DAMOS_WILLNEED,
@@ -85,6 +86,7 @@ enum damos_action {
 	DAMOS_PAGEOUT,
 	DAMOS_HUGEPAGE,
 	DAMOS_NOHUGEPAGE,
+	DAMOS_STAT,		/* Do nothing but only record the stat */
 };
 
 /**
@@ -96,9 +98,13 @@ enum damos_action {
  * @min_age_region:	Minimum age of target regions.
  * @max_age_region:	Maximum age of target regions.
  * @action:		&damo_action to be applied to the target regions.
+ * @stat_count:		Total number of regions that this scheme is applied.
+ * @stat_sz:		Total size of regions that this scheme is applied.
  * @list:		List head for siblings.
  *
- * Note that both the minimums and the maximums are inclusive.
+ * For each aggregation interval, DAMON applies @action to monitoring target
+ * regions fit in the condition and updates the statistics.  Note that both
+ * the minimums and the maximums are inclusive.
  */
 struct damos {
 	unsigned long min_sz_region;
@@ -108,6 +114,8 @@ struct damos {
 	unsigned int min_age_region;
 	unsigned int max_age_region;
 	enum damos_action action;
+	unsigned long stat_count;
+	unsigned long stat_sz;
 	struct list_head list;
 };
 
--- a/mm/damon/core.c~mm-damon-schemes-implement-statistics-feature
+++ a/mm/damon/core.c
@@ -103,6 +103,8 @@ struct damos *damon_new_scheme(
 	scheme->min_age_region = min_age_region;
 	scheme->max_age_region = max_age_region;
 	scheme->action = action;
+	scheme->stat_count = 0;
+	scheme->stat_sz = 0;
 	INIT_LIST_HEAD(&scheme->list);
 
 	return scheme;
@@ -544,9 +546,12 @@ static void damon_do_apply_schemes(struc
 			continue;
 		if (r->age < s->min_age_region || s->max_age_region < r->age)
 			continue;
+		s->stat_count++;
+		s->stat_sz += sz;
 		if (c->primitive.apply_scheme)
 			c->primitive.apply_scheme(c, t, r, s);
-		r->age = 0;
+		if (s->action != DAMOS_STAT)
+			r->age = 0;
 	}
 }
 
--- a/mm/damon/dbgfs.c~mm-damon-schemes-implement-statistics-feature
+++ a/mm/damon/dbgfs.c
@@ -106,11 +106,11 @@ static ssize_t sprint_schemes(struct dam
 
 	damon_for_each_scheme(s, c) {
 		rc = scnprintf(&buf[written], len - written,
-				"%lu %lu %u %u %u %u %d\n",
+				"%lu %lu %u %u %u %u %d %lu %lu\n",
 				s->min_sz_region, s->max_sz_region,
 				s->min_nr_accesses, s->max_nr_accesses,
 				s->min_age_region, s->max_age_region,
-				s->action);
+				s->action, s->stat_count, s->stat_sz);
 		if (!rc)
 			return -ENOMEM;
 
@@ -159,6 +159,7 @@ static bool damos_action_valid(int actio
 	case DAMOS_PAGEOUT:
 	case DAMOS_HUGEPAGE:
 	case DAMOS_NOHUGEPAGE:
+	case DAMOS_STAT:
 		return true;
 	default:
 		return false;
--- a/mm/damon/vaddr.c~mm-damon-schemes-implement-statistics-feature
+++ a/mm/damon/vaddr.c
@@ -705,6 +705,8 @@ int damon_va_apply_scheme(struct damon_c
 	case DAMOS_NOHUGEPAGE:
 		madv_action = MADV_NOHUGEPAGE;
 		break;
+	case DAMOS_STAT:
+		return 0;
 	default:
 		pr_warn("Wrong action %d\n", scheme->action);
 		return -EINVAL;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 228/262] selftests/damon: add 'schemes' debugfs tests
  2021-11-05 20:34 incoming Andrew Morton
                   ` (226 preceding siblings ...)
  2021-11-05 20:46 ` [patch 227/262] mm/damon/schemes: implement statistics feature Andrew Morton
@ 2021-11-05 20:46 ` Andrew Morton
  2021-11-05 20:46 ` [patch 229/262] Docs/admin-guide/mm/damon: document DAMON-based Operation Schemes Andrew Morton
                   ` (33 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: selftests/damon: add 'schemes' debugfs tests

This commit adds simple selftets for 'schemes' debugfs file of DAMON.

Link: https://lkml.kernel.org/r/20211001125604.29660-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/damon/debugfs_attrs.sh |   13 +++++++++++++
 1 file changed, 13 insertions(+)

--- a/tools/testing/selftests/damon/debugfs_attrs.sh~selftests-damon-add-schemes-debugfs-tests
+++ a/tools/testing/selftests/damon/debugfs_attrs.sh
@@ -57,6 +57,19 @@ test_write_fail "$file" "1 2 3 5 4" "$or
 test_content "$file" "$orig_content" "1 2 3 4 5" "successfully written"
 echo "$orig_content" > "$file"
 
+# Test schemes file
+# =================
+
+file="$DBGFS/schemes"
+orig_content=$(cat "$file")
+
+test_write_succ "$file" "1 2 3 4 5 6 4" \
+	"$orig_content" "valid input"
+test_write_fail "$file" "1 2
+3 4 5 6 3" "$orig_content" "multi lines"
+test_write_succ "$file" "" "$orig_content" "disabling"
+echo "$orig_content" > "$file"
+
 # Test target_ids file
 # ====================
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 229/262] Docs/admin-guide/mm/damon: document DAMON-based Operation Schemes
  2021-11-05 20:34 incoming Andrew Morton
                   ` (227 preceding siblings ...)
  2021-11-05 20:46 ` [patch 228/262] selftests/damon: add 'schemes' debugfs tests Andrew Morton
@ 2021-11-05 20:46 ` Andrew Morton
  2021-11-05 20:46 ` [patch 230/262] mm/damon/dbgfs: allow users to set initial monitoring target regions Andrew Morton
                   ` (32 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: Docs/admin-guide/mm/damon: document DAMON-based Operation Schemes

This commit add description of DAMON-based operation schemes in the DAMON
documents.

Link: https://lkml.kernel.org/r/20211001125604.29660-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/damon/start.rst |   11 +++
 Documentation/admin-guide/mm/damon/usage.rst |   51 ++++++++++++++++-
 2 files changed, 60 insertions(+), 2 deletions(-)

--- a/Documentation/admin-guide/mm/damon/start.rst~docs-admin-guide-mm-damon-document-damon-based-operation-schemes
+++ a/Documentation/admin-guide/mm/damon/start.rst
@@ -108,6 +108,17 @@ the results as separate image files. ::
 You can view the visualizations of this example workload at [1]_.
 Visualizations of other realistic workloads are available at [2]_ [3]_ [4]_.
 
+
+Data Access Pattern Aware Memory Management
+===========================================
+
+Below three commands make every memory region of size >=4K that doesn't
+accessed for >=60 seconds in your workload to be swapped out. ::
+
+    $ echo "#min-size max-size min-acc max-acc min-age max-age action" > scheme
+    $ echo "4K        max      0       0       60s     max     pageout" >> scheme
+    $ damo schemes -c my_thp_scheme <pid of your workload>
+
 .. [1] https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/start.html#visualizing-recorded-patterns
 .. [2] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
 .. [3] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
--- a/Documentation/admin-guide/mm/damon/usage.rst~docs-admin-guide-mm-damon-document-damon-based-operation-schemes
+++ a/Documentation/admin-guide/mm/damon/usage.rst
@@ -34,8 +34,8 @@ the reason, this document describes only
 debugfs Interface
 =================
 
-DAMON exports three files, ``attrs``, ``target_ids``, and ``monitor_on`` under
-its debugfs directory, ``<debugfs>/damon/``.
+DAMON exports four files, ``attrs``, ``target_ids``, ``schemes`` and
+``monitor_on`` under its debugfs directory, ``<debugfs>/damon/``.
 
 
 Attributes
@@ -74,6 +74,53 @@ check it again::
 Note that setting the target ids doesn't start the monitoring.
 
 
+Schemes
+-------
+
+For usual DAMON-based data access aware memory management optimizations, users
+would simply want the system to apply a memory management action to a memory
+region of a specific size having a specific access frequency for a specific
+time.  DAMON receives such formalized operation schemes from the user and
+applies those to the target processes.  It also counts the total number and
+size of regions that each scheme is applied.  This statistics can be used for
+online analysis or tuning of the schemes.
+
+Users can get and set the schemes by reading from and writing to ``schemes``
+debugfs file.  Reading the file also shows the statistics of each scheme.  To
+the file, each of the schemes should be represented in each line in below form:
+
+    min-size max-size min-acc max-acc min-age max-age action
+
+Note that the ranges are closed interval.  Bytes for the size of regions
+(``min-size`` and ``max-size``), number of monitored accesses per aggregate
+interval for access frequency (``min-acc`` and ``max-acc``), number of
+aggregate intervals for the age of regions (``min-age`` and ``max-age``), and a
+predefined integer for memory management actions should be used.  The supported
+numbers and their meanings are as below.
+
+ - 0: Call ``madvise()`` for the region with ``MADV_WILLNEED``
+ - 1: Call ``madvise()`` for the region with ``MADV_COLD``
+ - 2: Call ``madvise()`` for the region with ``MADV_PAGEOUT``
+ - 3: Call ``madvise()`` for the region with ``MADV_HUGEPAGE``
+ - 4: Call ``madvise()`` for the region with ``MADV_NOHUGEPAGE``
+ - 5: Do nothing but count the statistics
+
+You can disable schemes by simply writing an empty string to the file.  For
+example, below commands applies a scheme saying "If a memory region of size in
+[4KiB, 8KiB] is showing accesses per aggregate interval in [0, 5] for aggregate
+interval in [10, 20], page out the region", check the entered scheme again, and
+finally remove the scheme. ::
+
+    # cd <debugfs>/damon
+    # echo "4096 8192    0 5    10 20    2" > schemes
+    # cat schemes
+    4096 8192 0 5 10 20 2 0 0
+    # echo > schemes
+
+The last two integers in the 4th line of above example is the total number and
+the total size of the regions that the scheme is applied.
+
+
 Turning On/Off
 --------------
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 230/262] mm/damon/dbgfs: allow users to set initial monitoring target regions
  2021-11-05 20:34 incoming Andrew Morton
                   ` (228 preceding siblings ...)
  2021-11-05 20:46 ` [patch 229/262] Docs/admin-guide/mm/damon: document DAMON-based Operation Schemes Andrew Morton
@ 2021-11-05 20:46 ` Andrew Morton
  2021-11-05 20:46 ` [patch 231/262] mm/damon/dbgfs-test: add a unit test case for 'init_regions' Andrew Morton
                   ` (31 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: akpm, amit, benh, brendanhiggins, corbet, david, dwmw, elver,
	foersleo, gthelen, Jonathan.Cameron, linux-mm, markubo,
	mm-commits, rientjes, shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/dbgfs: allow users to set initial monitoring target regions

Patch series "DAMON: Support Physical Memory Address Space Monitoring:.

DAMON currently supports only virtual address spaces monitoring.  It can
be easily extended for various use cases and address spaces by configuring
its monitoring primitives layer to use appropriate primitives
implementations, though.  This patchset implements monitoring primitives
for the physical address space monitoring using the structure.

The first 3 patches allow the user space users manually set the monitoring
regions.  The 1st patch implements the feature in the 'damon-dbgfs'. 
Then, patches for adding a unit tests (the 2nd patch) and updating the
documentation (the 3rd patch) follow.

Following 4 patches implement the physical address space monitoring
primitives.  The 4th patch makes some primitive functions for the virtual
address spaces primitives reusable.  The 5th patch implements the physical
address space monitoring primitives.  The 6th patch links the primitives
to the 'damon-dbgfs'.  Finally, 7th patch documents this new features.


This patch (of 7):

Some 'damon-dbgfs' users would want to monitor only a part of the entire
virtual memory address space.  The program interface users in the kernel
space could use '->before_start()' callback or set the regions inside the
context struct as they want, but 'damon-dbgfs' users cannot.

For the reason, this commit introduces a new debugfs file called
'init_region'.  'damon-dbgfs' users can specify which initial monitoring
target address regions they want by writing special input to the file. 
The input should describe each region in each line in the below form:

    <pid> <start address> <end address>

Note that the regions will be updated to cover entire memory mapped
regions after a 'regions update interval' is passed.  If you want the
regions to not be updated after the initial setting, you could set the
interval as a very long time, say, a few decades.

Link: https://lkml.kernel.org/r/20211012205711.29216-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211012205711.29216-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Marco Elver <elver@google.com>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Greg Thelen <gthelen@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: David Rienjes <rientjes@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/dbgfs.c |  156 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 154 insertions(+), 2 deletions(-)

--- a/mm/damon/dbgfs.c~mm-damon-dbgfs-allow-users-to-set-initial-monitoring-target-regions
+++ a/mm/damon/dbgfs.c
@@ -394,6 +394,152 @@ out:
 	return ret;
 }
 
+static ssize_t sprint_init_regions(struct damon_ctx *c, char *buf, ssize_t len)
+{
+	struct damon_target *t;
+	struct damon_region *r;
+	int written = 0;
+	int rc;
+
+	damon_for_each_target(t, c) {
+		damon_for_each_region(r, t) {
+			rc = scnprintf(&buf[written], len - written,
+					"%lu %lu %lu\n",
+					t->id, r->ar.start, r->ar.end);
+			if (!rc)
+				return -ENOMEM;
+			written += rc;
+		}
+	}
+	return written;
+}
+
+static ssize_t dbgfs_init_regions_read(struct file *file, char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	struct damon_ctx *ctx = file->private_data;
+	char *kbuf;
+	ssize_t len;
+
+	kbuf = kmalloc(count, GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	mutex_lock(&ctx->kdamond_lock);
+	if (ctx->kdamond) {
+		mutex_unlock(&ctx->kdamond_lock);
+		len = -EBUSY;
+		goto out;
+	}
+
+	len = sprint_init_regions(ctx, kbuf, count);
+	mutex_unlock(&ctx->kdamond_lock);
+	if (len < 0)
+		goto out;
+	len = simple_read_from_buffer(buf, count, ppos, kbuf, len);
+
+out:
+	kfree(kbuf);
+	return len;
+}
+
+static int add_init_region(struct damon_ctx *c,
+			 unsigned long target_id, struct damon_addr_range *ar)
+{
+	struct damon_target *t;
+	struct damon_region *r, *prev;
+	unsigned long id;
+	int rc = -EINVAL;
+
+	if (ar->start >= ar->end)
+		return -EINVAL;
+
+	damon_for_each_target(t, c) {
+		id = t->id;
+		if (targetid_is_pid(c))
+			id = (unsigned long)pid_vnr((struct pid *)id);
+		if (id == target_id) {
+			r = damon_new_region(ar->start, ar->end);
+			if (!r)
+				return -ENOMEM;
+			damon_add_region(r, t);
+			if (damon_nr_regions(t) > 1) {
+				prev = damon_prev_region(r);
+				if (prev->ar.end > r->ar.start) {
+					damon_destroy_region(r, t);
+					return -EINVAL;
+				}
+			}
+			rc = 0;
+		}
+	}
+	return rc;
+}
+
+static int set_init_regions(struct damon_ctx *c, const char *str, ssize_t len)
+{
+	struct damon_target *t;
+	struct damon_region *r, *next;
+	int pos = 0, parsed, ret;
+	unsigned long target_id;
+	struct damon_addr_range ar;
+	int err;
+
+	damon_for_each_target(t, c) {
+		damon_for_each_region_safe(r, next, t)
+			damon_destroy_region(r, t);
+	}
+
+	while (pos < len) {
+		ret = sscanf(&str[pos], "%lu %lu %lu%n",
+				&target_id, &ar.start, &ar.end, &parsed);
+		if (ret != 3)
+			break;
+		err = add_init_region(c, target_id, &ar);
+		if (err)
+			goto fail;
+		pos += parsed;
+	}
+
+	return 0;
+
+fail:
+	damon_for_each_target(t, c) {
+		damon_for_each_region_safe(r, next, t)
+			damon_destroy_region(r, t);
+	}
+	return err;
+}
+
+static ssize_t dbgfs_init_regions_write(struct file *file,
+					  const char __user *buf, size_t count,
+					  loff_t *ppos)
+{
+	struct damon_ctx *ctx = file->private_data;
+	char *kbuf;
+	ssize_t ret = count;
+	int err;
+
+	kbuf = user_input_str(buf, count, ppos);
+	if (IS_ERR(kbuf))
+		return PTR_ERR(kbuf);
+
+	mutex_lock(&ctx->kdamond_lock);
+	if (ctx->kdamond) {
+		ret = -EBUSY;
+		goto unlock_out;
+	}
+
+	err = set_init_regions(ctx, kbuf, ret);
+	if (err)
+		ret = err;
+
+unlock_out:
+	mutex_unlock(&ctx->kdamond_lock);
+	kfree(kbuf);
+	return ret;
+}
+
 static ssize_t dbgfs_kdamond_pid_read(struct file *file,
 		char __user *buf, size_t count, loff_t *ppos)
 {
@@ -445,6 +591,12 @@ static const struct file_operations targ
 	.write = dbgfs_target_ids_write,
 };
 
+static const struct file_operations init_regions_fops = {
+	.open = damon_dbgfs_open,
+	.read = dbgfs_init_regions_read,
+	.write = dbgfs_init_regions_write,
+};
+
 static const struct file_operations kdamond_pid_fops = {
 	.open = damon_dbgfs_open,
 	.read = dbgfs_kdamond_pid_read,
@@ -453,9 +605,9 @@ static const struct file_operations kdam
 static void dbgfs_fill_ctx_dir(struct dentry *dir, struct damon_ctx *ctx)
 {
 	const char * const file_names[] = {"attrs", "schemes", "target_ids",
-		"kdamond_pid"};
+		"init_regions", "kdamond_pid"};
 	const struct file_operations *fops[] = {&attrs_fops, &schemes_fops,
-		&target_ids_fops, &kdamond_pid_fops};
+		&target_ids_fops, &init_regions_fops, &kdamond_pid_fops};
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(file_names); i++)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 231/262] mm/damon/dbgfs-test: add a unit test case for 'init_regions'
  2021-11-05 20:34 incoming Andrew Morton
                   ` (229 preceding siblings ...)
  2021-11-05 20:46 ` [patch 230/262] mm/damon/dbgfs: allow users to set initial monitoring target regions Andrew Morton
@ 2021-11-05 20:46 ` Andrew Morton
  2021-11-05 20:46 ` [patch 232/262] Docs/admin-guide/mm/damon: document 'init_regions' feature Andrew Morton
                   ` (30 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: akpm, amit, benh, brendanhiggins, corbet, david, dwmw, elver,
	foersleo, gthelen, Jonathan.Cameron, linux-mm, markubo,
	mm-commits, rientjes, shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/dbgfs-test: add a unit test case for 'init_regions'

This commit adds another test case for the new feature, 'init_regions'.

Link: https://lkml.kernel.org/r/20211012205711.29216-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Brendan Higgins <brendanhiggins@google.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/dbgfs-test.h |   54 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

--- a/mm/damon/dbgfs-test.h~mm-damon-dbgfs-test-add-a-unit-test-case-for-init_regions
+++ a/mm/damon/dbgfs-test.h
@@ -109,9 +109,63 @@ static void damon_dbgfs_test_set_targets
 	dbgfs_destroy_ctx(ctx);
 }
 
+static void damon_dbgfs_test_set_init_regions(struct kunit *test)
+{
+	struct damon_ctx *ctx = damon_new_ctx();
+	unsigned long ids[] = {1, 2, 3};
+	/* Each line represents one region in ``<target id> <start> <end>`` */
+	char * const valid_inputs[] = {"2 10 20\n 2   20 30\n2 35 45",
+		"2 10 20\n",
+		"2 10 20\n1 39 59\n1 70 134\n  2  20 25\n",
+		""};
+	/* Reading the file again will show sorted, clean output */
+	char * const valid_expects[] = {"2 10 20\n2 20 30\n2 35 45\n",
+		"2 10 20\n",
+		"1 39 59\n1 70 134\n2 10 20\n2 20 25\n",
+		""};
+	char * const invalid_inputs[] = {"4 10 20\n",	/* target not exists */
+		"2 10 20\n 2 14 26\n",		/* regions overlap */
+		"1 10 20\n2 30 40\n 1 5 8"};	/* not sorted by address */
+	char *input, *expect;
+	int i, rc;
+	char buf[256];
+
+	damon_set_targets(ctx, ids, 3);
+
+	/* Put valid inputs and check the results */
+	for (i = 0; i < ARRAY_SIZE(valid_inputs); i++) {
+		input = valid_inputs[i];
+		expect = valid_expects[i];
+
+		rc = set_init_regions(ctx, input, strnlen(input, 256));
+		KUNIT_EXPECT_EQ(test, rc, 0);
+
+		memset(buf, 0, 256);
+		sprint_init_regions(ctx, buf, 256);
+
+		KUNIT_EXPECT_STREQ(test, (char *)buf, expect);
+	}
+	/* Put invlid inputs and check the return error code */
+	for (i = 0; i < ARRAY_SIZE(invalid_inputs); i++) {
+		input = invalid_inputs[i];
+		pr_info("input: %s\n", input);
+		rc = set_init_regions(ctx, input, strnlen(input, 256));
+		KUNIT_EXPECT_EQ(test, rc, -EINVAL);
+
+		memset(buf, 0, 256);
+		sprint_init_regions(ctx, buf, 256);
+
+		KUNIT_EXPECT_STREQ(test, (char *)buf, "");
+	}
+
+	damon_set_targets(ctx, NULL, 0);
+	damon_destroy_ctx(ctx);
+}
+
 static struct kunit_case damon_test_cases[] = {
 	KUNIT_CASE(damon_dbgfs_test_str_to_target_ids),
 	KUNIT_CASE(damon_dbgfs_test_set_targets),
+	KUNIT_CASE(damon_dbgfs_test_set_init_regions),
 	{},
 };
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 232/262] Docs/admin-guide/mm/damon: document 'init_regions' feature
  2021-11-05 20:34 incoming Andrew Morton
                   ` (230 preceding siblings ...)
  2021-11-05 20:46 ` [patch 231/262] mm/damon/dbgfs-test: add a unit test case for 'init_regions' Andrew Morton
@ 2021-11-05 20:46 ` Andrew Morton
  2021-11-05 20:46 ` [patch 233/262] mm/damon/vaddr: separate commonly usable functions Andrew Morton
                   ` (29 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: akpm, amit, benh, brendanhiggins, corbet, david, dwmw, elver,
	foersleo, gthelen, Jonathan.Cameron, linux-mm, markubo,
	mm-commits, rientjes, shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: Docs/admin-guide/mm/damon: document 'init_regions' feature

This commit adds description of the 'init_regions' feature in the DAMON
usage document.

Link: https://lkml.kernel.org/r/20211012205711.29216-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/damon/usage.rst |   41 ++++++++++++++++-
 1 file changed, 39 insertions(+), 2 deletions(-)

--- a/Documentation/admin-guide/mm/damon/usage.rst~docs-admin-guide-mm-damon-document-init_regions-feature
+++ a/Documentation/admin-guide/mm/damon/usage.rst
@@ -34,8 +34,9 @@ the reason, this document describes only
 debugfs Interface
 =================
 
-DAMON exports four files, ``attrs``, ``target_ids``, ``schemes`` and
-``monitor_on`` under its debugfs directory, ``<debugfs>/damon/``.
+DAMON exports five files, ``attrs``, ``target_ids``, ``init_regions``,
+``schemes`` and ``monitor_on`` under its debugfs directory,
+``<debugfs>/damon/``.
 
 
 Attributes
@@ -74,6 +75,42 @@ check it again::
 Note that setting the target ids doesn't start the monitoring.
 
 
+Initial Monitoring Target Regions
+---------------------------------
+
+In case of the debugfs based monitoring, DAMON automatically sets and updates
+the monitoring target regions so that entire memory mappings of target
+processes can be covered.  However, users can want to limit the monitoring
+region to specific address ranges, such as the heap, the stack, or specific
+file-mapped area.  Or, some users can know the initial access pattern of their
+workloads and therefore want to set optimal initial regions for the 'adaptive
+regions adjustment'.
+
+In such cases, users can explicitly set the initial monitoring target regions
+as they want, by writing proper values to the ``init_regions`` file.  Each line
+of the input should represent one region in below form.::
+
+    <target id> <start address> <end address>
+
+The ``target id`` should already in ``target_ids`` file, and the regions should
+be passed in address order.  For example, below commands will set a couple of
+address ranges, ``1-100`` and ``100-200`` as the initial monitoring target
+region of process 42, and another couple of address ranges, ``20-40`` and
+``50-100`` as that of process 4242.::
+
+    # cd <debugfs>/damon
+    # echo "42   1       100
+            42   100     200
+            4242 20      40
+            4242 50      100" > init_regions
+
+Note that this sets the initial monitoring target regions only.  In case of
+virtual memory monitoring, DAMON will automatically updates the boundary of the
+regions after one ``regions update interval``.  Therefore, users should set the
+``regions update interval`` large enough in this case, if they don't want the
+update.
+
+
 Schemes
 -------
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 233/262] mm/damon/vaddr: separate commonly usable functions
  2021-11-05 20:34 incoming Andrew Morton
                   ` (231 preceding siblings ...)
  2021-11-05 20:46 ` [patch 232/262] Docs/admin-guide/mm/damon: document 'init_regions' feature Andrew Morton
@ 2021-11-05 20:46 ` Andrew Morton
  2021-11-05 20:46 ` [patch 234/262] mm/damon: implement primitives for physical address space monitoring Andrew Morton
                   ` (28 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: akpm, amit, benh, brendanhiggins, corbet, david, dwmw, elver,
	foersleo, gthelen, Jonathan.Cameron, linux-mm, markubo,
	mm-commits, rientjes, shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/vaddr: separate commonly usable functions

This commit moves functions in the default virtual address spaces
monitoring primitives that commonly usable from other address spaces like
physical address space into a header file.  Those will be reused by the
physical address space monitoring primitives which will be implemented by
the following commit.

[sj@kernel.org: include 'highmem.h' to fix a build failure]
  Link: https://lkml.kernel.org/r/20211014110848.5204-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211012205711.29216-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/Makefile       |    2 
 mm/damon/prmtv-common.c |   87 +++++++++++++++++++++++++++++++++++++
 mm/damon/prmtv-common.h |   17 +++++++
 mm/damon/vaddr.c        |   88 +-------------------------------------
 4 files changed, 108 insertions(+), 86 deletions(-)

--- a/mm/damon/Makefile~mm-damon-vaddr-separate-commonly-usable-functions
+++ a/mm/damon/Makefile
@@ -1,5 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
 
 obj-$(CONFIG_DAMON)		:= core.o
-obj-$(CONFIG_DAMON_VADDR)	+= vaddr.o
+obj-$(CONFIG_DAMON_VADDR)	+= prmtv-common.o vaddr.o
 obj-$(CONFIG_DAMON_DBGFS)	+= dbgfs.o
--- /dev/null
+++ a/mm/damon/prmtv-common.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Common Primitives for Data Access Monitoring
+ *
+ * Author: SeongJae Park <sj@kernel.org>
+ */
+
+#include <linux/mmu_notifier.h>
+#include <linux/page_idle.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+
+#include "prmtv-common.h"
+
+/*
+ * Get an online page for a pfn if it's in the LRU list.  Otherwise, returns
+ * NULL.
+ *
+ * The body of this function is stolen from the 'page_idle_get_page()'.  We
+ * steal rather than reuse it because the code is quite simple.
+ */
+struct page *damon_get_page(unsigned long pfn)
+{
+	struct page *page = pfn_to_online_page(pfn);
+
+	if (!page || !PageLRU(page) || !get_page_unless_zero(page))
+		return NULL;
+
+	if (unlikely(!PageLRU(page))) {
+		put_page(page);
+		page = NULL;
+	}
+	return page;
+}
+
+void damon_ptep_mkold(pte_t *pte, struct mm_struct *mm, unsigned long addr)
+{
+	bool referenced = false;
+	struct page *page = damon_get_page(pte_pfn(*pte));
+
+	if (!page)
+		return;
+
+	if (pte_young(*pte)) {
+		referenced = true;
+		*pte = pte_mkold(*pte);
+	}
+
+#ifdef CONFIG_MMU_NOTIFIER
+	if (mmu_notifier_clear_young(mm, addr, addr + PAGE_SIZE))
+		referenced = true;
+#endif /* CONFIG_MMU_NOTIFIER */
+
+	if (referenced)
+		set_page_young(page);
+
+	set_page_idle(page);
+	put_page(page);
+}
+
+void damon_pmdp_mkold(pmd_t *pmd, struct mm_struct *mm, unsigned long addr)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	bool referenced = false;
+	struct page *page = damon_get_page(pmd_pfn(*pmd));
+
+	if (!page)
+		return;
+
+	if (pmd_young(*pmd)) {
+		referenced = true;
+		*pmd = pmd_mkold(*pmd);
+	}
+
+#ifdef CONFIG_MMU_NOTIFIER
+	if (mmu_notifier_clear_young(mm, addr,
+				addr + ((1UL) << HPAGE_PMD_SHIFT)))
+		referenced = true;
+#endif /* CONFIG_MMU_NOTIFIER */
+
+	if (referenced)
+		set_page_young(page);
+
+	set_page_idle(page);
+	put_page(page);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+}
--- /dev/null
+++ a/mm/damon/prmtv-common.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Common Primitives for Data Access Monitoring
+ *
+ * Author: SeongJae Park <sj@kernel.org>
+ */
+
+#include <linux/damon.h>
+#include <linux/random.h>
+
+/* Get a random number in [l, r) */
+#define damon_rand(l, r) (l + prandom_u32_max(r - l))
+
+struct page *damon_get_page(unsigned long pfn);
+
+void damon_ptep_mkold(pte_t *pte, struct mm_struct *mm, unsigned long addr);
+void damon_pmdp_mkold(pmd_t *pmd, struct mm_struct *mm, unsigned long addr);
--- a/mm/damon/vaddr.c~mm-damon-vaddr-separate-commonly-usable-functions
+++ a/mm/damon/vaddr.c
@@ -8,25 +8,19 @@
 #define pr_fmt(fmt) "damon-va: " fmt
 
 #include <asm-generic/mman-common.h>
-#include <linux/damon.h>
+#include <linux/highmem.h>
 #include <linux/hugetlb.h>
-#include <linux/mm.h>
 #include <linux/mmu_notifier.h>
-#include <linux/highmem.h>
 #include <linux/page_idle.h>
 #include <linux/pagewalk.h>
-#include <linux/random.h>
-#include <linux/sched/mm.h>
-#include <linux/slab.h>
+
+#include "prmtv-common.h"
 
 #ifdef CONFIG_DAMON_VADDR_KUNIT_TEST
 #undef DAMON_MIN_REGION
 #define DAMON_MIN_REGION 1
 #endif
 
-/* Get a random number in [l, r) */
-#define damon_rand(l, r) (l + prandom_u32_max(r - l))
-
 /*
  * 't->id' should be the pointer to the relevant 'struct pid' having reference
  * count.  Caller must put the returned task, unless it is NULL.
@@ -373,82 +367,6 @@ void damon_va_update(struct damon_ctx *c
 	}
 }
 
-/*
- * Get an online page for a pfn if it's in the LRU list.  Otherwise, returns
- * NULL.
- *
- * The body of this function is stolen from the 'page_idle_get_page()'.  We
- * steal rather than reuse it because the code is quite simple.
- */
-static struct page *damon_get_page(unsigned long pfn)
-{
-	struct page *page = pfn_to_online_page(pfn);
-
-	if (!page || !PageLRU(page) || !get_page_unless_zero(page))
-		return NULL;
-
-	if (unlikely(!PageLRU(page))) {
-		put_page(page);
-		page = NULL;
-	}
-	return page;
-}
-
-static void damon_ptep_mkold(pte_t *pte, struct mm_struct *mm,
-			     unsigned long addr)
-{
-	bool referenced = false;
-	struct page *page = damon_get_page(pte_pfn(*pte));
-
-	if (!page)
-		return;
-
-	if (pte_young(*pte)) {
-		referenced = true;
-		*pte = pte_mkold(*pte);
-	}
-
-#ifdef CONFIG_MMU_NOTIFIER
-	if (mmu_notifier_clear_young(mm, addr, addr + PAGE_SIZE))
-		referenced = true;
-#endif /* CONFIG_MMU_NOTIFIER */
-
-	if (referenced)
-		set_page_young(page);
-
-	set_page_idle(page);
-	put_page(page);
-}
-
-static void damon_pmdp_mkold(pmd_t *pmd, struct mm_struct *mm,
-			     unsigned long addr)
-{
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	bool referenced = false;
-	struct page *page = damon_get_page(pmd_pfn(*pmd));
-
-	if (!page)
-		return;
-
-	if (pmd_young(*pmd)) {
-		referenced = true;
-		*pmd = pmd_mkold(*pmd);
-	}
-
-#ifdef CONFIG_MMU_NOTIFIER
-	if (mmu_notifier_clear_young(mm, addr,
-				addr + ((1UL) << HPAGE_PMD_SHIFT)))
-		referenced = true;
-#endif /* CONFIG_MMU_NOTIFIER */
-
-	if (referenced)
-		set_page_young(page);
-
-	set_page_idle(page);
-	put_page(page);
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-}
-
 static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
 		unsigned long next, struct mm_walk *walk)
 {
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 234/262] mm/damon: implement primitives for physical address space monitoring
  2021-11-05 20:34 incoming Andrew Morton
                   ` (232 preceding siblings ...)
  2021-11-05 20:46 ` [patch 233/262] mm/damon/vaddr: separate commonly usable functions Andrew Morton
@ 2021-11-05 20:46 ` Andrew Morton
  2021-11-05 20:47 ` [patch 235/262] mm/damon/dbgfs: support physical memory monitoring Andrew Morton
                   ` (27 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:46 UTC (permalink / raw)
  To: akpm, amit, benh, brendanhiggins, corbet, david, dwmw, elver,
	foersleo, gthelen, Jonathan.Cameron, linux-mm, markubo,
	mm-commits, rientjes, shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon: implement primitives for physical address space monitoring

This commit implements the monitoring primitives for the physical memory
address space.  Internally, it uses the PTE Accessed bit, similar to that
of the virtual address spaces monitoring primitives.  It supports only
user memory pages, as idle pages tracking does.  If the monitoring target
physical memory address range contains non-user memory pages, access check
of the pages will do nothing but simply treat the pages as not accessed.

Link: https://lkml.kernel.org/r/20211012205711.29216-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |   10 +
 mm/damon/Kconfig      |    8 +
 mm/damon/Makefile     |    1 
 mm/damon/paddr.c      |  224 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 243 insertions(+)

--- a/include/linux/damon.h~mm-damon-implement-primitives-for-physical-address-space-monitoring
+++ a/include/linux/damon.h
@@ -351,4 +351,14 @@ void damon_va_set_primitives(struct damo
 
 #endif	/* CONFIG_DAMON_VADDR */
 
+#ifdef CONFIG_DAMON_PADDR
+
+/* Monitoring primitives for the physical memory address space */
+void damon_pa_prepare_access_checks(struct damon_ctx *ctx);
+unsigned int damon_pa_check_accesses(struct damon_ctx *ctx);
+bool damon_pa_target_valid(void *t);
+void damon_pa_set_primitives(struct damon_ctx *ctx);
+
+#endif	/* CONFIG_DAMON_PADDR */
+
 #endif	/* _DAMON_H */
--- a/mm/damon/Kconfig~mm-damon-implement-primitives-for-physical-address-space-monitoring
+++ a/mm/damon/Kconfig
@@ -32,6 +32,14 @@ config DAMON_VADDR
 	  This builds the default data access monitoring primitives for DAMON
 	  that work for virtual address spaces.
 
+config DAMON_PADDR
+	bool "Data access monitoring primitives for the physical address space"
+	depends on DAMON && MMU
+	select PAGE_IDLE_FLAG
+	help
+	  This builds the default data access monitoring primitives for DAMON
+	  that works for the physical address space.
+
 config DAMON_VADDR_KUNIT_TEST
 	bool "Test for DAMON primitives" if !KUNIT_ALL_TESTS
 	depends on DAMON_VADDR && KUNIT=y
--- a/mm/damon/Makefile~mm-damon-implement-primitives-for-physical-address-space-monitoring
+++ a/mm/damon/Makefile
@@ -2,4 +2,5 @@
 
 obj-$(CONFIG_DAMON)		:= core.o
 obj-$(CONFIG_DAMON_VADDR)	+= prmtv-common.o vaddr.o
+obj-$(CONFIG_DAMON_PADDR)	+= prmtv-common.o paddr.o
 obj-$(CONFIG_DAMON_DBGFS)	+= dbgfs.o
--- /dev/null
+++ a/mm/damon/paddr.c
@@ -0,0 +1,224 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * DAMON Primitives for The Physical Address Space
+ *
+ * Author: SeongJae Park <sj@kernel.org>
+ */
+
+#define pr_fmt(fmt) "damon-pa: " fmt
+
+#include <linux/mmu_notifier.h>
+#include <linux/page_idle.h>
+#include <linux/pagemap.h>
+#include <linux/rmap.h>
+
+#include "prmtv-common.h"
+
+static bool __damon_pa_mkold(struct page *page, struct vm_area_struct *vma,
+		unsigned long addr, void *arg)
+{
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = addr,
+	};
+
+	while (page_vma_mapped_walk(&pvmw)) {
+		addr = pvmw.address;
+		if (pvmw.pte)
+			damon_ptep_mkold(pvmw.pte, vma->vm_mm, addr);
+		else
+			damon_pmdp_mkold(pvmw.pmd, vma->vm_mm, addr);
+	}
+	return true;
+}
+
+static void damon_pa_mkold(unsigned long paddr)
+{
+	struct page *page = damon_get_page(PHYS_PFN(paddr));
+	struct rmap_walk_control rwc = {
+		.rmap_one = __damon_pa_mkold,
+		.anon_lock = page_lock_anon_vma_read,
+	};
+	bool need_lock;
+
+	if (!page)
+		return;
+
+	if (!page_mapped(page) || !page_rmapping(page)) {
+		set_page_idle(page);
+		goto out;
+	}
+
+	need_lock = !PageAnon(page) || PageKsm(page);
+	if (need_lock && !trylock_page(page))
+		goto out;
+
+	rmap_walk(page, &rwc);
+
+	if (need_lock)
+		unlock_page(page);
+
+out:
+	put_page(page);
+}
+
+static void __damon_pa_prepare_access_check(struct damon_ctx *ctx,
+					    struct damon_region *r)
+{
+	r->sampling_addr = damon_rand(r->ar.start, r->ar.end);
+
+	damon_pa_mkold(r->sampling_addr);
+}
+
+void damon_pa_prepare_access_checks(struct damon_ctx *ctx)
+{
+	struct damon_target *t;
+	struct damon_region *r;
+
+	damon_for_each_target(t, ctx) {
+		damon_for_each_region(r, t)
+			__damon_pa_prepare_access_check(ctx, r);
+	}
+}
+
+struct damon_pa_access_chk_result {
+	unsigned long page_sz;
+	bool accessed;
+};
+
+static bool __damon_pa_young(struct page *page, struct vm_area_struct *vma,
+		unsigned long addr, void *arg)
+{
+	struct damon_pa_access_chk_result *result = arg;
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = addr,
+	};
+
+	result->accessed = false;
+	result->page_sz = PAGE_SIZE;
+	while (page_vma_mapped_walk(&pvmw)) {
+		addr = pvmw.address;
+		if (pvmw.pte) {
+			result->accessed = pte_young(*pvmw.pte) ||
+				!page_is_idle(page) ||
+				mmu_notifier_test_young(vma->vm_mm, addr);
+		} else {
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			result->accessed = pmd_young(*pvmw.pmd) ||
+				!page_is_idle(page) ||
+				mmu_notifier_test_young(vma->vm_mm, addr);
+			result->page_sz = ((1UL) << HPAGE_PMD_SHIFT);
+#else
+			WARN_ON_ONCE(1);
+#endif	/* CONFIG_TRANSPARENT_HUGEPAGE */
+		}
+		if (result->accessed) {
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
+	}
+
+	/* If accessed, stop walking */
+	return !result->accessed;
+}
+
+static bool damon_pa_young(unsigned long paddr, unsigned long *page_sz)
+{
+	struct page *page = damon_get_page(PHYS_PFN(paddr));
+	struct damon_pa_access_chk_result result = {
+		.page_sz = PAGE_SIZE,
+		.accessed = false,
+	};
+	struct rmap_walk_control rwc = {
+		.arg = &result,
+		.rmap_one = __damon_pa_young,
+		.anon_lock = page_lock_anon_vma_read,
+	};
+	bool need_lock;
+
+	if (!page)
+		return false;
+
+	if (!page_mapped(page) || !page_rmapping(page)) {
+		if (page_is_idle(page))
+			result.accessed = false;
+		else
+			result.accessed = true;
+		put_page(page);
+		goto out;
+	}
+
+	need_lock = !PageAnon(page) || PageKsm(page);
+	if (need_lock && !trylock_page(page)) {
+		put_page(page);
+		return NULL;
+	}
+
+	rmap_walk(page, &rwc);
+
+	if (need_lock)
+		unlock_page(page);
+	put_page(page);
+
+out:
+	*page_sz = result.page_sz;
+	return result.accessed;
+}
+
+static void __damon_pa_check_access(struct damon_ctx *ctx,
+				    struct damon_region *r)
+{
+	static unsigned long last_addr;
+	static unsigned long last_page_sz = PAGE_SIZE;
+	static bool last_accessed;
+
+	/* If the region is in the last checked page, reuse the result */
+	if (ALIGN_DOWN(last_addr, last_page_sz) ==
+				ALIGN_DOWN(r->sampling_addr, last_page_sz)) {
+		if (last_accessed)
+			r->nr_accesses++;
+		return;
+	}
+
+	last_accessed = damon_pa_young(r->sampling_addr, &last_page_sz);
+	if (last_accessed)
+		r->nr_accesses++;
+
+	last_addr = r->sampling_addr;
+}
+
+unsigned int damon_pa_check_accesses(struct damon_ctx *ctx)
+{
+	struct damon_target *t;
+	struct damon_region *r;
+	unsigned int max_nr_accesses = 0;
+
+	damon_for_each_target(t, ctx) {
+		damon_for_each_region(r, t) {
+			__damon_pa_check_access(ctx, r);
+			max_nr_accesses = max(r->nr_accesses, max_nr_accesses);
+		}
+	}
+
+	return max_nr_accesses;
+}
+
+bool damon_pa_target_valid(void *t)
+{
+	return true;
+}
+
+void damon_pa_set_primitives(struct damon_ctx *ctx)
+{
+	ctx->primitive.init = NULL;
+	ctx->primitive.update = NULL;
+	ctx->primitive.prepare_access_checks = damon_pa_prepare_access_checks;
+	ctx->primitive.check_accesses = damon_pa_check_accesses;
+	ctx->primitive.reset_aggregated = NULL;
+	ctx->primitive.target_valid = damon_pa_target_valid;
+	ctx->primitive.cleanup = NULL;
+	ctx->primitive.apply_scheme = NULL;
+}
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 235/262] mm/damon/dbgfs: support physical memory monitoring
  2021-11-05 20:34 incoming Andrew Morton
                   ` (233 preceding siblings ...)
  2021-11-05 20:46 ` [patch 234/262] mm/damon: implement primitives for physical address space monitoring Andrew Morton
@ 2021-11-05 20:47 ` Andrew Morton
  2021-11-05 20:47 ` [patch 236/262] Docs/DAMON: document physical memory monitoring support Andrew Morton
                   ` (26 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:47 UTC (permalink / raw)
  To: akpm, amit, benh, brendanhiggins, corbet, david, dwmw, elver,
	foersleo, gthelen, Jonathan.Cameron, linux-mm, markubo,
	mm-commits, rientjes, shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/dbgfs: support physical memory monitoring

This commit makes the 'damon-dbgfs' to support the physical memory
monitoring, in addition to the virtual memory monitoring.

Users can do the physical memory monitoring by writing a special keyword,
'paddr' to the 'target_ids' debugfs file.  Then, DAMON will check the
special keyword and configure the monitoring context to run with the
primitives for the physical address space.

Unlike the virtual memory monitoring, the monitoring target region will
not be automatically set.  Therefore, users should also set the monitoring
target address region using the 'init_regions' debugfs file.

Also, note that the physical memory monitoring will not automatically
terminated.  The user should explicitly turn off the monitoring by writing
'off' to the 'monitor_on' debugfs file.

Link: https://lkml.kernel.org/r/20211012205711.29216-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/Kconfig |    2 +-
 mm/damon/dbgfs.c |   21 ++++++++++++++++++---
 2 files changed, 19 insertions(+), 4 deletions(-)

--- a/mm/damon/dbgfs.c~mm-damon-dbgfs-support-physical-memory-monitoring
+++ a/mm/damon/dbgfs.c
@@ -339,6 +339,7 @@ static ssize_t dbgfs_target_ids_write(st
 		const char __user *buf, size_t count, loff_t *ppos)
 {
 	struct damon_ctx *ctx = file->private_data;
+	bool id_is_pid = true;
 	char *kbuf, *nrs;
 	unsigned long *targets;
 	ssize_t nr_targets;
@@ -351,6 +352,11 @@ static ssize_t dbgfs_target_ids_write(st
 		return PTR_ERR(kbuf);
 
 	nrs = kbuf;
+	if (!strncmp(kbuf, "paddr\n", count)) {
+		id_is_pid = false;
+		/* target id is meaningless here, but we set it just for fun */
+		scnprintf(kbuf, count, "42    ");
+	}
 
 	targets = str_to_target_ids(nrs, ret, &nr_targets);
 	if (!targets) {
@@ -358,7 +364,7 @@ static ssize_t dbgfs_target_ids_write(st
 		goto out;
 	}
 
-	if (targetid_is_pid(ctx)) {
+	if (id_is_pid) {
 		for (i = 0; i < nr_targets; i++) {
 			targets[i] = (unsigned long)find_get_pid(
 					(int)targets[i]);
@@ -372,15 +378,24 @@ static ssize_t dbgfs_target_ids_write(st
 
 	mutex_lock(&ctx->kdamond_lock);
 	if (ctx->kdamond) {
-		if (targetid_is_pid(ctx))
+		if (id_is_pid)
 			dbgfs_put_pids(targets, nr_targets);
 		ret = -EBUSY;
 		goto unlock_out;
 	}
 
+	/* remove targets with previously-set primitive */
+	damon_set_targets(ctx, NULL, 0);
+
+	/* Configure the context for the address space type */
+	if (id_is_pid)
+		damon_va_set_primitives(ctx);
+	else
+		damon_pa_set_primitives(ctx);
+
 	err = damon_set_targets(ctx, targets, nr_targets);
 	if (err) {
-		if (targetid_is_pid(ctx))
+		if (id_is_pid)
 			dbgfs_put_pids(targets, nr_targets);
 		ret = err;
 	}
--- a/mm/damon/Kconfig~mm-damon-dbgfs-support-physical-memory-monitoring
+++ a/mm/damon/Kconfig
@@ -54,7 +54,7 @@ config DAMON_VADDR_KUNIT_TEST
 
 config DAMON_DBGFS
 	bool "DAMON debugfs interface"
-	depends on DAMON_VADDR && DEBUG_FS
+	depends on DAMON_VADDR && DAMON_PADDR && DEBUG_FS
 	help
 	  This builds the debugfs interface for DAMON.  The user space admins
 	  can use the interface for arbitrary data access monitoring.
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 236/262] Docs/DAMON: document physical memory monitoring support
  2021-11-05 20:34 incoming Andrew Morton
                   ` (234 preceding siblings ...)
  2021-11-05 20:47 ` [patch 235/262] mm/damon/dbgfs: support physical memory monitoring Andrew Morton
@ 2021-11-05 20:47 ` Andrew Morton
  2021-11-05 20:47 ` [patch 237/262] mm/damon/vaddr: constify static mm_walk_ops Andrew Morton
                   ` (25 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:47 UTC (permalink / raw)
  To: akpm, amit, benh, brendanhiggins, corbet, david, dwmw, elver,
	foersleo, gthelen, Jonathan.Cameron, linux-mm, markubo,
	mm-commits, rientjes, shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: Docs/DAMON: document physical memory monitoring support

This commit updates the DAMON documents for the physical memory address
space monitoring support.

Link: https://lkml.kernel.org/r/20211012205711.29216-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rienjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/damon/usage.rst |   25 +++++++++++---
 Documentation/vm/damon/design.rst            |   29 ++++++++++-------
 Documentation/vm/damon/faq.rst               |    5 +-
 3 files changed, 40 insertions(+), 19 deletions(-)

--- a/Documentation/admin-guide/mm/damon/usage.rst~docs-damon-document-physical-memory-monitoring-support
+++ a/Documentation/admin-guide/mm/damon/usage.rst
@@ -10,15 +10,16 @@ DAMON provides below three interfaces fo
   This is for privileged people such as system administrators who want a
   just-working human-friendly interface.  Using this, users can use the DAMON’s
   major features in a human-friendly way.  It may not be highly tuned for
-  special cases, though.  It supports only virtual address spaces monitoring.
+  special cases, though.  It supports both virtual and physical address spaces
+  monitoring.
 - *debugfs interface.*
   This is for privileged user space programmers who want more optimized use of
   DAMON.  Using this, users can use DAMON’s major features by reading
   from and writing to special debugfs files.  Therefore, you can write and use
   your personalized DAMON debugfs wrapper programs that reads/writes the
   debugfs files instead of you.  The DAMON user space tool is also a reference
-  implementation of such programs.  It supports only virtual address spaces
-  monitoring.
+  implementation of such programs.  It supports both virtual and physical
+  address spaces monitoring.
 - *Kernel Space Programming Interface.*
   This is for kernel space programmers.  Using this, users can utilize every
   feature of DAMON most flexibly and efficiently by writing kernel space
@@ -72,20 +73,34 @@ check it again::
     # cat target_ids
     42 4242
 
+Users can also monitor the physical memory address space of the system by
+writing a special keyword, "``paddr\n``" to the file.  Because physical address
+space monitoring doesn't support multiple targets, reading the file will show a
+fake value, ``42``, as below::
+
+    # cd <debugfs>/damon
+    # echo paddr > target_ids
+    # cat target_ids
+    42
+
 Note that setting the target ids doesn't start the monitoring.
 
 
 Initial Monitoring Target Regions
 ---------------------------------
 
-In case of the debugfs based monitoring, DAMON automatically sets and updates
-the monitoring target regions so that entire memory mappings of target
+In case of the virtual address space monitoring, DAMON automatically sets and
+updates the monitoring target regions so that entire memory mappings of target
 processes can be covered.  However, users can want to limit the monitoring
 region to specific address ranges, such as the heap, the stack, or specific
 file-mapped area.  Or, some users can know the initial access pattern of their
 workloads and therefore want to set optimal initial regions for the 'adaptive
 regions adjustment'.
 
+In contrast, DAMON do not automatically sets and updates the monitoring target
+regions in case of physical memory monitoring.  Therefore, users should set the
+monitoring target regions by themselves.
+
 In such cases, users can explicitly set the initial monitoring target regions
 as they want, by writing proper values to the ``init_regions`` file.  Each line
 of the input should represent one region in below form.::
--- a/Documentation/vm/damon/design.rst~docs-damon-document-physical-memory-monitoring-support
+++ a/Documentation/vm/damon/design.rst
@@ -35,13 +35,17 @@ two parts:
 1. Identification of the monitoring target address range for the address space.
 2. Access check of specific address range in the target space.
 
-DAMON currently provides the implementation of the primitives for only the
-virtual address spaces. Below two subsections describe how it works.
+DAMON currently provides the implementations of the primitives for the physical
+and virtual address spaces. Below two subsections describe how those work.
 
 
 VMA-based Target Address Range Construction
 -------------------------------------------
 
+This is only for the virtual address space primitives implementation.  That for
+the physical address space simply asks users to manually set the monitoring
+target address ranges.
+
 Only small parts in the super-huge virtual address space of the processes are
 mapped to the physical memory and accessed.  Thus, tracking the unmapped
 address regions is just wasteful.  However, because DAMON can deal with some
@@ -71,15 +75,18 @@ to make a reasonable trade-off.  Below s
 PTE Accessed-bit Based Access Check
 -----------------------------------
 
-The implementation for the virtual address space uses PTE Accessed-bit for
-basic access checks.  It finds the relevant PTE Accessed bit from the address
-by walking the page table for the target task of the address.  In this way, the
-implementation finds and clears the bit for next sampling target address and
-checks whether the bit set again after one sampling period.  This could disturb
-other kernel subsystems using the Accessed bits, namely Idle page tracking and
-the reclaim logic.  To avoid such disturbances, DAMON makes it mutually
-exclusive with Idle page tracking and uses ``PG_idle`` and ``PG_young`` page
-flags to solve the conflict with the reclaim logic, as Idle page tracking does.
+Both of the implementations for physical and virtual address spaces use PTE
+Accessed-bit for basic access checks.  Only one difference is the way of
+finding the relevant PTE Accessed bit(s) from the address.  While the
+implementation for the virtual address walks the page table for the target task
+of the address, the implementation for the physical address walks every page
+table having a mapping to the address.  In this way, the implementations find
+and clear the bit(s) for next sampling target address and checks whether the
+bit(s) set again after one sampling period.  This could disturb other kernel
+subsystems using the Accessed bits, namely Idle page tracking and the reclaim
+logic.  To avoid such disturbances, DAMON makes it mutually exclusive with Idle
+page tracking and uses ``PG_idle`` and ``PG_young`` page flags to solve the
+conflict with the reclaim logic, as Idle page tracking does.
 
 
 Address Space Independent Core Mechanisms
--- a/Documentation/vm/damon/faq.rst~docs-damon-document-physical-memory-monitoring-support
+++ a/Documentation/vm/damon/faq.rst
@@ -36,10 +36,9 @@ constructions and actual access checks c
 DAMON core by the users.  In this way, DAMON users can monitor any address
 space with any access check technique.
 
-Nonetheless, DAMON provides vma tracking and PTE Accessed bit check based
+Nonetheless, DAMON provides vma/rmap tracking and PTE Accessed bit check based
 implementations of the address space dependent functions for the virtual memory
-by default, for a reference and convenient use.  In near future, we will
-provide those for physical memory address space.
+and the physical memory by default, for a reference and convenient use.
 
 
 Can I simply monitor page granularity?
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 237/262] mm/damon/vaddr: constify static mm_walk_ops
  2021-11-05 20:34 incoming Andrew Morton
                   ` (235 preceding siblings ...)
  2021-11-05 20:47 ` [patch 236/262] Docs/DAMON: document physical memory monitoring support Andrew Morton
@ 2021-11-05 20:47 ` Andrew Morton
  2021-11-05 20:47 ` [patch 238/262] mm/damon/dbgfs: remove unnecessary variables Andrew Morton
                   ` (24 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:47 UTC (permalink / raw)
  To: akpm, anshuman.khandual, linux-mm, mm-commits, rikard.falkeborn,
	sj, torvalds

From: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Subject: mm/damon/vaddr: constify static mm_walk_ops

The only usage of these structs is to pass their addresses to
walk_page_range(), which takes a pointer to const mm_walk_ops as argument.
Make them const to allow the compiler to put them in read-only memory.

Link: https://lkml.kernel.org/r/20211014075042.17174-2-rikard.falkeborn@gmail.com
Signed-off-by: Rikard Falkeborn <rikard.falkeborn@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/vaddr.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/damon/vaddr.c~mm-damon-vaddr-constify-static-mm_walk_ops
+++ a/mm/damon/vaddr.c
@@ -394,7 +394,7 @@ out:
 	return 0;
 }
 
-static struct mm_walk_ops damon_mkold_ops = {
+static const struct mm_walk_ops damon_mkold_ops = {
 	.pmd_entry = damon_mkold_pmd_entry,
 };
 
@@ -490,7 +490,7 @@ out:
 	return 0;
 }
 
-static struct mm_walk_ops damon_young_ops = {
+static const struct mm_walk_ops damon_young_ops = {
 	.pmd_entry = damon_young_pmd_entry,
 };
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 238/262] mm/damon/dbgfs: remove unnecessary variables
  2021-11-05 20:34 incoming Andrew Morton
                   ` (236 preceding siblings ...)
  2021-11-05 20:47 ` [patch 237/262] mm/damon/vaddr: constify static mm_walk_ops Andrew Morton
@ 2021-11-05 20:47 ` Andrew Morton
  2021-11-05 20:47 ` [patch 239/262] mm/damon/paddr: support the pageout scheme Andrew Morton
                   ` (23 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:47 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, rongwei.wang, sj, torvalds

From: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Subject: mm/damon/dbgfs: remove unnecessary variables

In some functions, it's unnecessary to declare 'err' and 'ret' variables
at the same time.  This patch mainly to simplify the issue of such
declarations by reusing one variable.

Link: https://lkml.kernel.org/r/20211014073014.35754-1-sj@kernel.org
Signed-off-by: Rongwei Wang <rongwei.wang@linux.alibaba.com>
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/dbgfs.c |   66 +++++++++++++++++++++------------------------
 1 file changed, 31 insertions(+), 35 deletions(-)

--- a/mm/damon/dbgfs.c~mm-damon-dbgfs-remove-unnecessary-variables
+++ a/mm/damon/dbgfs.c
@@ -69,8 +69,7 @@ static ssize_t dbgfs_attrs_write(struct
 	struct damon_ctx *ctx = file->private_data;
 	unsigned long s, a, r, minr, maxr;
 	char *kbuf;
-	ssize_t ret = count;
-	int err;
+	ssize_t ret;
 
 	kbuf = user_input_str(buf, count, ppos);
 	if (IS_ERR(kbuf))
@@ -88,9 +87,9 @@ static ssize_t dbgfs_attrs_write(struct
 		goto unlock_out;
 	}
 
-	err = damon_set_attrs(ctx, s, a, r, minr, maxr);
-	if (err)
-		ret = err;
+	ret = damon_set_attrs(ctx, s, a, r, minr, maxr);
+	if (!ret)
+		ret = count;
 unlock_out:
 	mutex_unlock(&ctx->kdamond_lock);
 out:
@@ -220,14 +219,13 @@ static ssize_t dbgfs_schemes_write(struc
 	struct damon_ctx *ctx = file->private_data;
 	char *kbuf;
 	struct damos **schemes;
-	ssize_t nr_schemes = 0, ret = count;
-	int err;
+	ssize_t nr_schemes = 0, ret;
 
 	kbuf = user_input_str(buf, count, ppos);
 	if (IS_ERR(kbuf))
 		return PTR_ERR(kbuf);
 
-	schemes = str_to_schemes(kbuf, ret, &nr_schemes);
+	schemes = str_to_schemes(kbuf, count, &nr_schemes);
 	if (!schemes) {
 		ret = -EINVAL;
 		goto out;
@@ -239,11 +237,12 @@ static ssize_t dbgfs_schemes_write(struc
 		goto unlock_out;
 	}
 
-	err = damon_set_schemes(ctx, schemes, nr_schemes);
-	if (err)
-		ret = err;
-	else
+	ret = damon_set_schemes(ctx, schemes, nr_schemes);
+	if (!ret) {
+		ret = count;
 		nr_schemes = 0;
+	}
+
 unlock_out:
 	mutex_unlock(&ctx->kdamond_lock);
 	free_schemes_arr(schemes, nr_schemes);
@@ -343,9 +342,8 @@ static ssize_t dbgfs_target_ids_write(st
 	char *kbuf, *nrs;
 	unsigned long *targets;
 	ssize_t nr_targets;
-	ssize_t ret = count;
+	ssize_t ret;
 	int i;
-	int err;
 
 	kbuf = user_input_str(buf, count, ppos);
 	if (IS_ERR(kbuf))
@@ -358,7 +356,7 @@ static ssize_t dbgfs_target_ids_write(st
 		scnprintf(kbuf, count, "42    ");
 	}
 
-	targets = str_to_target_ids(nrs, ret, &nr_targets);
+	targets = str_to_target_ids(nrs, count, &nr_targets);
 	if (!targets) {
 		ret = -ENOMEM;
 		goto out;
@@ -393,11 +391,12 @@ static ssize_t dbgfs_target_ids_write(st
 	else
 		damon_pa_set_primitives(ctx);
 
-	err = damon_set_targets(ctx, targets, nr_targets);
-	if (err) {
+	ret = damon_set_targets(ctx, targets, nr_targets);
+	if (ret) {
 		if (id_is_pid)
 			dbgfs_put_pids(targets, nr_targets);
-		ret = err;
+	} else {
+		ret = count;
 	}
 
 unlock_out:
@@ -715,8 +714,7 @@ static ssize_t dbgfs_mk_context_write(st
 {
 	char *kbuf;
 	char *ctx_name;
-	ssize_t ret = count;
-	int err;
+	ssize_t ret;
 
 	kbuf = user_input_str(buf, count, ppos);
 	if (IS_ERR(kbuf))
@@ -734,9 +732,9 @@ static ssize_t dbgfs_mk_context_write(st
 	}
 
 	mutex_lock(&damon_dbgfs_lock);
-	err = dbgfs_mk_context(ctx_name);
-	if (err)
-		ret = err;
+	ret = dbgfs_mk_context(ctx_name);
+	if (!ret)
+		ret = count;
 	mutex_unlock(&damon_dbgfs_lock);
 
 out:
@@ -805,8 +803,7 @@ static ssize_t dbgfs_rm_context_write(st
 		const char __user *buf, size_t count, loff_t *ppos)
 {
 	char *kbuf;
-	ssize_t ret = count;
-	int err;
+	ssize_t ret;
 	char *ctx_name;
 
 	kbuf = user_input_str(buf, count, ppos);
@@ -825,9 +822,9 @@ static ssize_t dbgfs_rm_context_write(st
 	}
 
 	mutex_lock(&damon_dbgfs_lock);
-	err = dbgfs_rm_context(ctx_name);
-	if (err)
-		ret = err;
+	ret = dbgfs_rm_context(ctx_name);
+	if (!ret)
+		ret = count;
 	mutex_unlock(&damon_dbgfs_lock);
 
 out:
@@ -851,9 +848,8 @@ static ssize_t dbgfs_monitor_on_read(str
 static ssize_t dbgfs_monitor_on_write(struct file *file,
 		const char __user *buf, size_t count, loff_t *ppos)
 {
-	ssize_t ret = count;
+	ssize_t ret;
 	char *kbuf;
-	int err;
 
 	kbuf = user_input_str(buf, count, ppos);
 	if (IS_ERR(kbuf))
@@ -866,14 +862,14 @@ static ssize_t dbgfs_monitor_on_write(st
 	}
 
 	if (!strncmp(kbuf, "on", count))
-		err = damon_start(dbgfs_ctxs, dbgfs_nr_ctxs);
+		ret = damon_start(dbgfs_ctxs, dbgfs_nr_ctxs);
 	else if (!strncmp(kbuf, "off", count))
-		err = damon_stop(dbgfs_ctxs, dbgfs_nr_ctxs);
+		ret = damon_stop(dbgfs_ctxs, dbgfs_nr_ctxs);
 	else
-		err = -EINVAL;
+		ret = -EINVAL;
 
-	if (err)
-		ret = err;
+	if (!ret)
+		ret = count;
 	kfree(kbuf);
 	return ret;
 }
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 239/262] mm/damon/paddr: support the pageout scheme
  2021-11-05 20:34 incoming Andrew Morton
                   ` (237 preceding siblings ...)
  2021-11-05 20:47 ` [patch 238/262] mm/damon/dbgfs: remove unnecessary variables Andrew Morton
@ 2021-11-05 20:47 ` Andrew Morton
  2021-11-05 20:47 ` [patch 240/262] mm/damon/schemes: implement size quota for schemes application speed control Andrew Morton
                   ` (22 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:47 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/paddr: support the pageout scheme

Introduction
============

This patchset 1) makes the engine for general data access pattern-oriented
memory management (DAMOS) be more useful for production environments, and
2) implements a static kernel module for lightweight proactive reclamation
using the engine.

Proactive Reclamation
---------------------

On general memory over-committed systems, proactively reclaiming cold
pages helps saving memory and reducing latency spikes that incurred by the
direct reclaim or the CPU consumption of kswapd, while incurring only
minimal performance degradation[2].

A Free Pages Reporting[8] based memory over-commit virtualization system
would be one more specific use case.  In the system, the guest VMs reports
their free memory to host, and the host reallocates the reported memory to
other guests.  As a result, the system's memory utilization can be
maximized.  However, the guests could be not so memory-frugal, because
some kernel subsystems and user-space applications are designed to use as
much memory as available.  Then, guests would report only small amount of
free memory to host, results in poor memory utilization.  Running the
proactive reclamation in such guests could help mitigating this problem.

Google has also implemented this idea and using it in their data center. 
They further proposed upstreaming it in LSFMM'19, and "the general
consensus was that, while this sort of proactive reclaim would be useful
for a number of users, the cost of this particular solution was too high
to consider merging it upstream"[3].  The cost mainly comes from the
coldness tracking.  Roughly speaking, the implementation periodically
scans the 'Accessed' bit of each page.  For the reason, the overhead
linearly increases as the size of the memory and the scanning frequency
grows.  As a result, Google is known to dedicating one CPU for the work. 
That's a reasonable option to someone like Google, but it wouldn't be so
to some others.

DAMON and DAMOS: An engine for data access pattern-oriented memory management
-----------------------------------------------------------------------------

DAMON[4] is a framework for general data access monitoring.  Its adaptive
monitoring overhead control feature minimizes its monitoring overhead.  It
also let the upper-bound of the overhead be configurable by clients,
regardless of the size of the monitoring target memory.  While monitoring
70 GiB memory of a production system every 5 milliseconds, it consumes
less than 1% single CPU time.  For this, it could sacrify some of the
quality of the monitoring results.  Nevertheless, the lower-bound of the
quality is configurable, and it uses a best-effort algorithm for better
quality.  Our test results[5] show the quality is practical enough.  From
the production system monitoring, we were able to find a 4 KiB region in
the 70 GiB memory that shows highest access frequency.

We normally don't monitor the data access pattern just for fun but to
improve something like memory management.  Proactive reclamation is one
such usage.  For such general cases, DAMON provides a feature called
DAMon-based Operation Schemes (DAMOS)[6].  It makes DAMON an engine for
general data access pattern oriented memory management.  Using this,
clients can ask DAMON to find memory regions of specific data access
pattern and apply some memory management action (e.g., page out, move to
head of the LRU list, use huge page, ...).  We call the request 'scheme'.

Proactive Reclamation on top of DAMON/DAMOS
-------------------------------------------

Therefore, by using DAMON for the cold pages detection, the proactive
reclamation's monitoring overhead issue can be solved.  Actually, we
previously implemented a version of proactive reclamation using DAMOS and
achieved noticeable improvements with our evaluation setup[5]. 
Nevertheless, it more for a proof-of-concept, rather than production uses.
It supports only virtual address spaces of processes, and require
additional tuning efforts for given workloads and the hardware.  For the
tuning, we introduced a simple auto-tuning user space tool[8].  Google is
also known to using a ML-based similar approach for their fleets[2].  But,
making it just works with intuitive knobs in the kernel would be helpful
for general users.

To this end, this patchset improves DAMOS to be ready for such production
usages, and implements another version of the proactive reclamation,
namely DAMON_RECLAIM, on top of it.

DAMOS Improvements: Aggressiveness Control, Prioritization, and Watermarks
--------------------------------------------------------------------------

First of all, the current version of DAMOS supports only virtual address
spaces.  This patchset makes it supports the physical address space for
the page out action.

Next major problem of the current version of DAMOS is the lack of the
aggressiveness control, which can results in arbitrary overhead.  For
example, if huge memory regions having the data access pattern of interest
are found, applying the requested action to all of the regions could incur
significant overhead.  It can be controlled by tuning the target data
access pattern with manual or automated approaches[2,7].  But, some people
would prefer the kernel to just work with only intuitive tuning or default
values.

For such cases, this patchset implements a safeguard, namely time/size
quota.  Using this, the clients can specify up to how much time can be
used for applying the action, and/or up to how much memory regions the
action can be applied within a user-specified time duration.  A followup
question is, to which memory regions should the action applied within the
limits?  We implement a simple regions prioritization mechanism for each
action and make DAMOS to apply the action to high priority regions first. 
It also allows clients tune the prioritization mechanism to use different
weights for size, access frequency, and age of memory regions.  This means
we could use not only LRU but also LFU or some fancy algorithms like
CAR[9] with lightweight overhead.

Though DAMON is lightweight, someone would want to remove even the cold
pages monitoring overhead when it is unnecessary.  Currently, it should
manually turned on and off by clients, but some clients would simply want
to turn it on and off based on some metrics like free memory ratio or
memory fragmentation.  For such cases, this patchset implements a
watermarks-based automatic activation feature.  It allows the clients
configure the metric of their interest, and three watermarks of the
metric.  If the metric is higher than the high watermark or lower than the
low watermark, the scheme is deactivated.  If the metric is lower than the
mid watermark but higher than the low watermark, the scheme is activated.

DAMON-based Reclaim
-------------------

Using the improved version of DAMOS, this patchset implements a static
kernel module called 'damon_reclaim'.  It finds memory regions that didn't
accessed for specific time duration and page out.  Consuming too much CPU
for the paging out operations, or doing pageout too frequently can be
critical for systems configuring their swap devices with software-defined
in-memory block devices like zram/zswap or total number of writes limited
devices like SSDs, respectively.  To avoid the problems, the time/size
quotas can be configured.  Under the quotas, it pages out memory regions
that didn't accessed longer first.  Also, to remove the monitoring
overhead under peaceful situation, and to fall back to the LRU-list based
page granularity reclamation when it doesn't make progress, the three
watermarks based activation mechanism is used, with the free memory ratio
as the watermark metric.

For convenient configurations, it provides several module parameters. 
Using these, sysadmins can enable/disable it, and tune its parameters
including the coldness identification time threshold, the time/size quotas
and the three watermarks.

Evaluation
==========

In short, DAMON_RECLAIM with 50ms/s time quota and regions prioritization
on v5.15-rc5 Linux kernel with ZRAM swap device achieves 38.58% memory
saving with only 1.94% runtime overhead.  For this, DAMON_RECLAIM consumes
only 4.97% of single CPU time.

Setup
-----

We evaluate DAMON_RECLAIM to show how each of the DAMOS improvements make
effect.  For this, we measure DAMON_RECLAIM's CPU consumption, entire
system memory footprint, total number of major page faults, and runtime of
24 realistic workloads in PARSEC3 and SPLASH-2X benchmark suites on my
QEMU/KVM based virtual machine.  The virtual machine runs on an i3.metal
AWS instance, has 130GiB memory, and runs a linux kernel built on latest
-mm tree[1] plus this patchset.  It also utilizes a 4 GiB ZRAM swap
device.  We repeats the measurement 5 times and use averages.

[1] https://github.com/hnaz/linux-mm/tree/v5.15-rc5-mmots-2021-10-13-19-55

Detailed Results
----------------

The results are summarized in the below table.

With coldness identification threshold of 5 seconds, DAMON_RECLAIM without
the time quota-based speed limit achieves 47.21% memory saving, but incur
4.59% runtime slowdown to the workloads on average.  For this,
DAMON_RECLAIM consumes about 11.28% single CPU time.

Applying time quotas of 200ms/s, 50ms/s, and 10ms/s without the regions
prioritization reduces the slowdown to 4.89%, 2.65%, and 1.5%,
respectively.  Time quota of 200ms/s (20%) makes no real change compared
to the quota unapplied version, because the quota unapplied version
consumes only 11.28% CPU time.  DAMON_RECLAIM's CPU utilization also
similarly reduced: 11.24%, 5.51%, and 2.01% of single CPU time.  That is,
the overhead is proportional to the speed limit.  Nevertheless, it also
reduces the memory saving because it becomes less aggressive.  In detail,
the three variants show 48.76%, 37.83%, and 7.85% memory saving,
respectively.

Applying the regions prioritization (page out regions that not accessed
longer first within the time quota) further reduces the performance
degradation.  Runtime slowdowns and total number of major page faults
increase has been 4.89%/218,690% -> 4.39%/166,136% (200ms/s),
2.65%/111,886% -> 1.94%/59,053% (50ms/s), and 1.5%/34,973.40% ->
2.08%/8,781.75% (10ms/s).  The runtime under 10ms/s time quota has
increased with prioritization, but apparently that's under the margin of
error.

    time quota   prioritization  memory_saving  cpu_util  slowdown  pgmajfaults overhead
    N            N               47.21%         11.28%    4.59%     194,802%
    200ms/s      N               48.76%         11.24%    4.89%     218,690%
    50ms/s       N               37.83%         5.51%     2.65%     111,886%
    10ms/s       N               7.85%          2.01%     1.5%      34,793.40%
    200ms/s      Y               50.08%         10.38%    4.39%     166,136%
    50ms/s       Y               38.58%         4.97%     1.94%     59,053%
    10ms/s       Y               3.63%          1.73%     2.08%     8,781.75%

Baseline and Complete Git Trees
===============================

The patches are based on the latest -mm tree
(v5.15-rc5-mmots-2021-10-13-19-55).  You can also clone the complete git tree
from:

    $ git clone git://github.com/sjp38/linux -b damon_reclaim/patches/v1

The web is also available:
https://git.kernel.org/pub/scm/linux/kernel/git/sj/linux.git/tag/?h=damon_reclaim/patches/v1

Sequence Of Patches
===================

The first patch makes DAMOS support the physical address space for the page out
action.  Following five patches (patches 2-6) implement the time/size quotas.
Next four patches (patches 7-10) implement the memory regions prioritization
within the limit.  Then, three following patches (patches 11-13) implement the
watermarks-based schemes activation.  Finally, the last two patches (patches
14-15) implement and document the DAMON-based reclamation using the advanced
DAMOS.

[1] https://www.kernel.org/doc/html/v5.15-rc1/vm/damon/index.html
[2] https://research.google/pubs/pub48551/
[3] https://lwn.net/Articles/787611/
[4] https://damonitor.github.io
[5] https://damonitor.github.io/doc/html/latest/vm/damon/eval.html
[6] https://lore.kernel.org/linux-mm/20211001125604.29660-1-sj@kernel.org/
[7] https://github.com/awslabs/damoos
[8] https://www.kernel.org/doc/html/latest/vm/free_page_reporting.html
[9] https://www.usenix.org/conference/fast-04/car-clock-adaptive-replacement


This patch (of 15):

This commit makes the DAMON primitives for physical address space support
the pageout action for DAMON-based Operation Schemes.  With this commit,
hence, users can easily implement system-level data access-aware
reclamations using DAMOS.

[sj@kernel.org: fix missing-prototype build warning]
  Link: https://lkml.kernel.org/r/20211025064220.13904-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211019150731.16699-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20211019150731.16699-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Marco Elver <elver@google.com>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Greg Thelen <gthelen@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: David Rientjes <rientjes@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |    2 ++
 mm/damon/paddr.c      |   37 ++++++++++++++++++++++++++++++++++++-
 2 files changed, 38 insertions(+), 1 deletion(-)

--- a/include/linux/damon.h~mm-damon-paddr-support-the-pageout-scheme
+++ a/include/linux/damon.h
@@ -357,6 +357,8 @@ void damon_va_set_primitives(struct damo
 void damon_pa_prepare_access_checks(struct damon_ctx *ctx);
 unsigned int damon_pa_check_accesses(struct damon_ctx *ctx);
 bool damon_pa_target_valid(void *t);
+int damon_pa_apply_scheme(struct damon_ctx *context, struct damon_target *t,
+		struct damon_region *r, struct damos *scheme);
 void damon_pa_set_primitives(struct damon_ctx *ctx);
 
 #endif	/* CONFIG_DAMON_PADDR */
--- a/mm/damon/paddr.c~mm-damon-paddr-support-the-pageout-scheme
+++ a/mm/damon/paddr.c
@@ -11,7 +11,9 @@
 #include <linux/page_idle.h>
 #include <linux/pagemap.h>
 #include <linux/rmap.h>
+#include <linux/swap.h>
 
+#include "../internal.h"
 #include "prmtv-common.h"
 
 static bool __damon_pa_mkold(struct page *page, struct vm_area_struct *vma,
@@ -211,6 +213,39 @@ bool damon_pa_target_valid(void *t)
 	return true;
 }
 
+int damon_pa_apply_scheme(struct damon_ctx *ctx, struct damon_target *t,
+		struct damon_region *r, struct damos *scheme)
+{
+	unsigned long addr;
+	LIST_HEAD(page_list);
+
+	if (scheme->action != DAMOS_PAGEOUT)
+		return -EINVAL;
+
+	for (addr = r->ar.start; addr < r->ar.end; addr += PAGE_SIZE) {
+		struct page *page = damon_get_page(PHYS_PFN(addr));
+
+		if (!page)
+			continue;
+
+		ClearPageReferenced(page);
+		test_and_clear_page_young(page);
+		if (isolate_lru_page(page)) {
+			put_page(page);
+			continue;
+		}
+		if (PageUnevictable(page)) {
+			putback_lru_page(page);
+		} else {
+			list_add(&page->lru, &page_list);
+			put_page(page);
+		}
+	}
+	reclaim_pages(&page_list);
+	cond_resched();
+	return 0;
+}
+
 void damon_pa_set_primitives(struct damon_ctx *ctx)
 {
 	ctx->primitive.init = NULL;
@@ -220,5 +255,5 @@ void damon_pa_set_primitives(struct damo
 	ctx->primitive.reset_aggregated = NULL;
 	ctx->primitive.target_valid = damon_pa_target_valid;
 	ctx->primitive.cleanup = NULL;
-	ctx->primitive.apply_scheme = NULL;
+	ctx->primitive.apply_scheme = damon_pa_apply_scheme;
 }
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 240/262] mm/damon/schemes: implement size quota for schemes application speed control
  2021-11-05 20:34 incoming Andrew Morton
                   ` (238 preceding siblings ...)
  2021-11-05 20:47 ` [patch 239/262] mm/damon/paddr: support the pageout scheme Andrew Morton
@ 2021-11-05 20:47 ` Andrew Morton
  2021-11-05 20:47 ` [patch 241/262] mm/damon/schemes: skip already charged targets and regions Andrew Morton
                   ` (21 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:47 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/schemes: implement size quota for schemes application speed control

There could be arbitrarily large memory regions fulfilling the target data
access pattern of a DAMON-based operation scheme.  In the case, applying
the action of the scheme could incur too high overhead.  To provide an
intuitive way for avoiding it, this commit implements a feature called
size quota.  If the quota is set, DAMON tries to apply the action only up
to the given amount of memory regions within a given time window.

Link: https://lkml.kernel.org/r/20211019150731.16699-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |   36 +++++++++++++++++++++---
 mm/damon/core.c       |   60 ++++++++++++++++++++++++++++++++++------
 mm/damon/dbgfs.c      |    4 ++
 3 files changed, 87 insertions(+), 13 deletions(-)

--- a/include/linux/damon.h~mm-damon-schemes-implement-size-quota-for-schemes-application-speed-control
+++ a/include/linux/damon.h
@@ -90,6 +90,26 @@ enum damos_action {
 };
 
 /**
+ * struct damos_quota - Controls the aggressiveness of the given scheme.
+ * @sz:			Maximum bytes of memory that the action can be applied.
+ * @reset_interval:	Charge reset interval in milliseconds.
+ *
+ * To avoid consuming too much CPU time or IO resources for applying the
+ * &struct damos->action to large memory, DAMON allows users to set a size
+ * quota.  The quota can be set by writing non-zero values to &sz.  If the size
+ * quota is set, DAMON tries to apply the action only up to &sz bytes within
+ * &reset_interval.
+ */
+struct damos_quota {
+	unsigned long sz;
+	unsigned long reset_interval;
+
+/* private: For charging the quota */
+	unsigned long charged_sz;
+	unsigned long charged_from;
+};
+
+/**
  * struct damos - Represents a Data Access Monitoring-based Operation Scheme.
  * @min_sz_region:	Minimum size of target regions.
  * @max_sz_region:	Maximum size of target regions.
@@ -98,13 +118,20 @@ enum damos_action {
  * @min_age_region:	Minimum age of target regions.
  * @max_age_region:	Maximum age of target regions.
  * @action:		&damo_action to be applied to the target regions.
+ * @quota:		Control the aggressiveness of this scheme.
  * @stat_count:		Total number of regions that this scheme is applied.
  * @stat_sz:		Total size of regions that this scheme is applied.
  * @list:		List head for siblings.
  *
- * For each aggregation interval, DAMON applies @action to monitoring target
- * regions fit in the condition and updates the statistics.  Note that both
- * the minimums and the maximums are inclusive.
+ * For each aggregation interval, DAMON finds regions which fit in the
+ * condition (&min_sz_region, &max_sz_region, &min_nr_accesses,
+ * &max_nr_accesses, &min_age_region, &max_age_region) and applies &action to
+ * those.  To avoid consuming too much CPU time or IO resources for the
+ * &action, &quota is used.
+ *
+ * After applying the &action to each region, &stat_count and &stat_sz is
+ * updated to reflect the number of regions and total size of regions that the
+ * &action is applied.
  */
 struct damos {
 	unsigned long min_sz_region;
@@ -114,6 +141,7 @@ struct damos {
 	unsigned int min_age_region;
 	unsigned int max_age_region;
 	enum damos_action action;
+	struct damos_quota quota;
 	unsigned long stat_count;
 	unsigned long stat_sz;
 	struct list_head list;
@@ -310,7 +338,7 @@ struct damos *damon_new_scheme(
 		unsigned long min_sz_region, unsigned long max_sz_region,
 		unsigned int min_nr_accesses, unsigned int max_nr_accesses,
 		unsigned int min_age_region, unsigned int max_age_region,
-		enum damos_action action);
+		enum damos_action action, struct damos_quota *quota);
 void damon_add_scheme(struct damon_ctx *ctx, struct damos *s);
 void damon_destroy_scheme(struct damos *s);
 
--- a/mm/damon/core.c~mm-damon-schemes-implement-size-quota-for-schemes-application-speed-control
+++ a/mm/damon/core.c
@@ -89,7 +89,7 @@ struct damos *damon_new_scheme(
 		unsigned long min_sz_region, unsigned long max_sz_region,
 		unsigned int min_nr_accesses, unsigned int max_nr_accesses,
 		unsigned int min_age_region, unsigned int max_age_region,
-		enum damos_action action)
+		enum damos_action action, struct damos_quota *quota)
 {
 	struct damos *scheme;
 
@@ -107,6 +107,11 @@ struct damos *damon_new_scheme(
 	scheme->stat_sz = 0;
 	INIT_LIST_HEAD(&scheme->list);
 
+	scheme->quota.sz = quota->sz;
+	scheme->quota.reset_interval = quota->reset_interval;
+	scheme->quota.charged_sz = 0;
+	scheme->quota.charged_from = 0;
+
 	return scheme;
 }
 
@@ -530,15 +535,25 @@ static void kdamond_reset_aggregated(str
 	}
 }
 
+static void damon_split_region_at(struct damon_ctx *ctx,
+		struct damon_target *t, struct damon_region *r,
+		unsigned long sz_r);
+
 static void damon_do_apply_schemes(struct damon_ctx *c,
 				   struct damon_target *t,
 				   struct damon_region *r)
 {
 	struct damos *s;
-	unsigned long sz;
 
 	damon_for_each_scheme(s, c) {
-		sz = r->ar.end - r->ar.start;
+		struct damos_quota *quota = &s->quota;
+		unsigned long sz = r->ar.end - r->ar.start;
+
+		/* Check the quota */
+		if (quota->sz && quota->charged_sz >= quota->sz)
+			continue;
+
+		/* Check the target regions condition */
 		if (sz < s->min_sz_region || s->max_sz_region < sz)
 			continue;
 		if (r->nr_accesses < s->min_nr_accesses ||
@@ -546,22 +561,51 @@ static void damon_do_apply_schemes(struc
 			continue;
 		if (r->age < s->min_age_region || s->max_age_region < r->age)
 			continue;
-		s->stat_count++;
-		s->stat_sz += sz;
-		if (c->primitive.apply_scheme)
+
+		/* Apply the scheme */
+		if (c->primitive.apply_scheme) {
+			if (quota->sz && quota->charged_sz + sz > quota->sz) {
+				sz = ALIGN_DOWN(quota->sz - quota->charged_sz,
+						DAMON_MIN_REGION);
+				if (!sz)
+					goto update_stat;
+				damon_split_region_at(c, t, r, sz);
+			}
 			c->primitive.apply_scheme(c, t, r, s);
+			quota->charged_sz += sz;
+		}
 		if (s->action != DAMOS_STAT)
 			r->age = 0;
+
+update_stat:
+		s->stat_count++;
+		s->stat_sz += sz;
 	}
 }
 
 static void kdamond_apply_schemes(struct damon_ctx *c)
 {
 	struct damon_target *t;
-	struct damon_region *r;
+	struct damon_region *r, *next_r;
+	struct damos *s;
+
+	damon_for_each_scheme(s, c) {
+		struct damos_quota *quota = &s->quota;
+
+		if (!quota->sz)
+			continue;
+
+		/* New charge window starts */
+		if (time_after_eq(jiffies, quota->charged_from +
+					msecs_to_jiffies(
+						quota->reset_interval))) {
+			quota->charged_from = jiffies;
+			quota->charged_sz = 0;
+		}
+	}
 
 	damon_for_each_target(t, c) {
-		damon_for_each_region(r, t)
+		damon_for_each_region_safe(r, next_r, t)
 			damon_do_apply_schemes(c, t, r);
 	}
 }
--- a/mm/damon/dbgfs.c~mm-damon-schemes-implement-size-quota-for-schemes-application-speed-control
+++ a/mm/damon/dbgfs.c
@@ -188,6 +188,8 @@ static struct damos **str_to_schemes(con
 
 	*nr_schemes = 0;
 	while (pos < len && *nr_schemes < max_nr_schemes) {
+		struct damos_quota quota = {};
+
 		ret = sscanf(&str[pos], "%lu %lu %u %u %u %u %u%n",
 				&min_sz, &max_sz, &min_nr_a, &max_nr_a,
 				&min_age, &max_age, &action, &parsed);
@@ -200,7 +202,7 @@ static struct damos **str_to_schemes(con
 
 		pos += parsed;
 		scheme = damon_new_scheme(min_sz, max_sz, min_nr_a, max_nr_a,
-				min_age, max_age, action);
+				min_age, max_age, action, &quota);
 		if (!scheme)
 			goto fail;
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 241/262] mm/damon/schemes: skip already charged targets and regions
  2021-11-05 20:34 incoming Andrew Morton
                   ` (239 preceding siblings ...)
  2021-11-05 20:47 ` [patch 240/262] mm/damon/schemes: implement size quota for schemes application speed control Andrew Morton
@ 2021-11-05 20:47 ` Andrew Morton
  2021-11-05 20:47 ` [patch 242/262] mm/damon/schemes: implement time quota Andrew Morton
                   ` (20 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:47 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/schemes: skip already charged targets and regions

If DAMOS has stopped applying action in the middle of a group of memory
regions due to its size quota, it starts the work again from the beginning
of the address space in the next charge window.  If there is a huge memory
region at the beginning of the address space and it fulfills the scheme's
target data access pattern always, the action will applied to only the
region.

This commit mitigates the case by skipping memory regions that charged in
current charge window at the beginning of next charge window.

Link: https://lkml.kernel.org/r/20211019150731.16699-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |    5 +++++
 mm/damon/core.c       |   37 +++++++++++++++++++++++++++++++++++++
 2 files changed, 42 insertions(+)

--- a/include/linux/damon.h~mm-damon-schemes-skip-already-charged-targets-and-regions
+++ a/include/linux/damon.h
@@ -107,6 +107,8 @@ struct damos_quota {
 /* private: For charging the quota */
 	unsigned long charged_sz;
 	unsigned long charged_from;
+	struct damon_target *charge_target_from;
+	unsigned long charge_addr_from;
 };
 
 /**
@@ -307,6 +309,9 @@ struct damon_ctx {
 #define damon_prev_region(r) \
 	(container_of(r->list.prev, struct damon_region, list))
 
+#define damon_last_region(t) \
+	(list_last_entry(&t->regions_list, struct damon_region, list))
+
 #define damon_for_each_region(r, t) \
 	list_for_each_entry(r, &t->regions_list, list)
 
--- a/mm/damon/core.c~mm-damon-schemes-skip-already-charged-targets-and-regions
+++ a/mm/damon/core.c
@@ -111,6 +111,8 @@ struct damos *damon_new_scheme(
 	scheme->quota.reset_interval = quota->reset_interval;
 	scheme->quota.charged_sz = 0;
 	scheme->quota.charged_from = 0;
+	scheme->quota.charge_target_from = NULL;
+	scheme->quota.charge_addr_from = 0;
 
 	return scheme;
 }
@@ -553,6 +555,37 @@ static void damon_do_apply_schemes(struc
 		if (quota->sz && quota->charged_sz >= quota->sz)
 			continue;
 
+		/* Skip previously charged regions */
+		if (quota->charge_target_from) {
+			if (t != quota->charge_target_from)
+				continue;
+			if (r == damon_last_region(t)) {
+				quota->charge_target_from = NULL;
+				quota->charge_addr_from = 0;
+				continue;
+			}
+			if (quota->charge_addr_from &&
+					r->ar.end <= quota->charge_addr_from)
+				continue;
+
+			if (quota->charge_addr_from && r->ar.start <
+					quota->charge_addr_from) {
+				sz = ALIGN_DOWN(quota->charge_addr_from -
+						r->ar.start, DAMON_MIN_REGION);
+				if (!sz) {
+					if (r->ar.end - r->ar.start <=
+							DAMON_MIN_REGION)
+						continue;
+					sz = DAMON_MIN_REGION;
+				}
+				damon_split_region_at(c, t, r, sz);
+				r = damon_next_region(r);
+				sz = r->ar.end - r->ar.start;
+			}
+			quota->charge_target_from = NULL;
+			quota->charge_addr_from = 0;
+		}
+
 		/* Check the target regions condition */
 		if (sz < s->min_sz_region || s->max_sz_region < sz)
 			continue;
@@ -573,6 +606,10 @@ static void damon_do_apply_schemes(struc
 			}
 			c->primitive.apply_scheme(c, t, r, s);
 			quota->charged_sz += sz;
+			if (quota->sz && quota->charged_sz >= quota->sz) {
+				quota->charge_target_from = t;
+				quota->charge_addr_from = r->ar.end + 1;
+			}
 		}
 		if (s->action != DAMOS_STAT)
 			r->age = 0;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 242/262] mm/damon/schemes: implement time quota
  2021-11-05 20:34 incoming Andrew Morton
                   ` (240 preceding siblings ...)
  2021-11-05 20:47 ` [patch 241/262] mm/damon/schemes: skip already charged targets and regions Andrew Morton
@ 2021-11-05 20:47 ` Andrew Morton
  2021-11-05 20:47 ` [patch 243/262] mm/damon/dbgfs: support quotas of schemes Andrew Morton
                   ` (19 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:47 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/schemes: implement time quota

The size quota feature of DAMOS is useful for IO resource-critical
systems, but not so intuitive for CPU time-critical systems.  Systems
using zram or zswap-like swap device would be examples.

To provide another intuitive ways for such systems, this commit implements
time-based quota for DAMON-based Operation Schemes.  If the quota is set,
DAMOS tries to use only up to the user-defined quota of CPU time within a
given time window.

Link: https://lkml.kernel.org/r/20211019150731.16699-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |   25 +++++++++++++++++-----
 mm/damon/core.c       |   45 +++++++++++++++++++++++++++++++++++-----
 2 files changed, 60 insertions(+), 10 deletions(-)

--- a/include/linux/damon.h~mm-damon-schemes-implement-time-quota
+++ a/include/linux/damon.h
@@ -91,20 +91,35 @@ enum damos_action {
 
 /**
  * struct damos_quota - Controls the aggressiveness of the given scheme.
+ * @ms:			Maximum milliseconds that the scheme can use.
  * @sz:			Maximum bytes of memory that the action can be applied.
  * @reset_interval:	Charge reset interval in milliseconds.
  *
  * To avoid consuming too much CPU time or IO resources for applying the
- * &struct damos->action to large memory, DAMON allows users to set a size
- * quota.  The quota can be set by writing non-zero values to &sz.  If the size
- * quota is set, DAMON tries to apply the action only up to &sz bytes within
- * &reset_interval.
+ * &struct damos->action to large memory, DAMON allows users to set time and/or
+ * size quotas.  The quotas can be set by writing non-zero values to &ms and
+ * &sz, respectively.  If the time quota is set, DAMON tries to use only up to
+ * &ms milliseconds within &reset_interval for applying the action.  If the
+ * size quota is set, DAMON tries to apply the action only up to &sz bytes
+ * within &reset_interval.
+ *
+ * Internally, the time quota is transformed to a size quota using estimated
+ * throughput of the scheme's action.  DAMON then compares it against &sz and
+ * uses smaller one as the effective quota.
  */
 struct damos_quota {
+	unsigned long ms;
 	unsigned long sz;
 	unsigned long reset_interval;
 
-/* private: For charging the quota */
+/* private: */
+	/* For throughput estimation */
+	unsigned long total_charged_sz;
+	unsigned long total_charged_ns;
+
+	unsigned long esz;	/* Effective size quota in bytes */
+
+	/* For charging the quota */
 	unsigned long charged_sz;
 	unsigned long charged_from;
 	struct damon_target *charge_target_from;
--- a/mm/damon/core.c~mm-damon-schemes-implement-time-quota
+++ a/mm/damon/core.c
@@ -107,8 +107,12 @@ struct damos *damon_new_scheme(
 	scheme->stat_sz = 0;
 	INIT_LIST_HEAD(&scheme->list);
 
+	scheme->quota.ms = quota->ms;
 	scheme->quota.sz = quota->sz;
 	scheme->quota.reset_interval = quota->reset_interval;
+	scheme->quota.total_charged_sz = 0;
+	scheme->quota.total_charged_ns = 0;
+	scheme->quota.esz = 0;
 	scheme->quota.charged_sz = 0;
 	scheme->quota.charged_from = 0;
 	scheme->quota.charge_target_from = NULL;
@@ -550,9 +554,10 @@ static void damon_do_apply_schemes(struc
 	damon_for_each_scheme(s, c) {
 		struct damos_quota *quota = &s->quota;
 		unsigned long sz = r->ar.end - r->ar.start;
+		struct timespec64 begin, end;
 
 		/* Check the quota */
-		if (quota->sz && quota->charged_sz >= quota->sz)
+		if (quota->esz && quota->charged_sz >= quota->esz)
 			continue;
 
 		/* Skip previously charged regions */
@@ -597,16 +602,21 @@ static void damon_do_apply_schemes(struc
 
 		/* Apply the scheme */
 		if (c->primitive.apply_scheme) {
-			if (quota->sz && quota->charged_sz + sz > quota->sz) {
-				sz = ALIGN_DOWN(quota->sz - quota->charged_sz,
+			if (quota->esz &&
+					quota->charged_sz + sz > quota->esz) {
+				sz = ALIGN_DOWN(quota->esz - quota->charged_sz,
 						DAMON_MIN_REGION);
 				if (!sz)
 					goto update_stat;
 				damon_split_region_at(c, t, r, sz);
 			}
+			ktime_get_coarse_ts64(&begin);
 			c->primitive.apply_scheme(c, t, r, s);
+			ktime_get_coarse_ts64(&end);
+			quota->total_charged_ns += timespec64_to_ns(&end) -
+				timespec64_to_ns(&begin);
 			quota->charged_sz += sz;
-			if (quota->sz && quota->charged_sz >= quota->sz) {
+			if (quota->esz && quota->charged_sz >= quota->esz) {
 				quota->charge_target_from = t;
 				quota->charge_addr_from = r->ar.end + 1;
 			}
@@ -620,6 +630,29 @@ update_stat:
 	}
 }
 
+/* Shouldn't be called if quota->ms and quota->sz are zero */
+static void damos_set_effective_quota(struct damos_quota *quota)
+{
+	unsigned long throughput;
+	unsigned long esz;
+
+	if (!quota->ms) {
+		quota->esz = quota->sz;
+		return;
+	}
+
+	if (quota->total_charged_ns)
+		throughput = quota->total_charged_sz * 1000000 /
+			quota->total_charged_ns;
+	else
+		throughput = PAGE_SIZE * 1024;
+	esz = throughput * quota->ms;
+
+	if (quota->sz && quota->sz < esz)
+		esz = quota->sz;
+	quota->esz = esz;
+}
+
 static void kdamond_apply_schemes(struct damon_ctx *c)
 {
 	struct damon_target *t;
@@ -629,15 +662,17 @@ static void kdamond_apply_schemes(struct
 	damon_for_each_scheme(s, c) {
 		struct damos_quota *quota = &s->quota;
 
-		if (!quota->sz)
+		if (!quota->ms && !quota->sz)
 			continue;
 
 		/* New charge window starts */
 		if (time_after_eq(jiffies, quota->charged_from +
 					msecs_to_jiffies(
 						quota->reset_interval))) {
+			quota->total_charged_sz += quota->charged_sz;
 			quota->charged_from = jiffies;
 			quota->charged_sz = 0;
+			damos_set_effective_quota(quota);
 		}
 	}
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 243/262] mm/damon/dbgfs: support quotas of schemes
  2021-11-05 20:34 incoming Andrew Morton
                   ` (241 preceding siblings ...)
  2021-11-05 20:47 ` [patch 242/262] mm/damon/schemes: implement time quota Andrew Morton
@ 2021-11-05 20:47 ` Andrew Morton
  2021-11-05 20:47 ` [patch 244/262] mm/damon/selftests: support schemes quotas Andrew Morton
                   ` (18 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:47 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/dbgfs: support quotas of schemes

This commit makes the debugfs interface of DAMON support the scheme quotas
by chaning the format of the input for the schemes file.

Link: https://lkml.kernel.org/r/20211019150731.16699-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/dbgfs.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

--- a/mm/damon/dbgfs.c~mm-damon-dbgfs-support-quotas-of-schemes
+++ a/mm/damon/dbgfs.c
@@ -105,11 +105,14 @@ static ssize_t sprint_schemes(struct dam
 
 	damon_for_each_scheme(s, c) {
 		rc = scnprintf(&buf[written], len - written,
-				"%lu %lu %u %u %u %u %d %lu %lu\n",
+				"%lu %lu %u %u %u %u %d %lu %lu %lu %lu %lu\n",
 				s->min_sz_region, s->max_sz_region,
 				s->min_nr_accesses, s->max_nr_accesses,
 				s->min_age_region, s->max_age_region,
-				s->action, s->stat_count, s->stat_sz);
+				s->action,
+				s->quota.ms, s->quota.sz,
+				s->quota.reset_interval,
+				s->stat_count, s->stat_sz);
 		if (!rc)
 			return -ENOMEM;
 
@@ -190,10 +193,11 @@ static struct damos **str_to_schemes(con
 	while (pos < len && *nr_schemes < max_nr_schemes) {
 		struct damos_quota quota = {};
 
-		ret = sscanf(&str[pos], "%lu %lu %u %u %u %u %u%n",
+		ret = sscanf(&str[pos], "%lu %lu %u %u %u %u %u %lu %lu %lu%n",
 				&min_sz, &max_sz, &min_nr_a, &max_nr_a,
-				&min_age, &max_age, &action, &parsed);
-		if (ret != 7)
+				&min_age, &max_age, &action, &quota.ms,
+				&quota.sz, &quota.reset_interval, &parsed);
+		if (ret != 10)
 			break;
 		if (!damos_action_valid(action)) {
 			pr_err("wrong action %d\n", action);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 244/262] mm/damon/selftests: support schemes quotas
  2021-11-05 20:34 incoming Andrew Morton
                   ` (242 preceding siblings ...)
  2021-11-05 20:47 ` [patch 243/262] mm/damon/dbgfs: support quotas of schemes Andrew Morton
@ 2021-11-05 20:47 ` Andrew Morton
  2021-11-05 20:47 ` [patch 245/262] mm/damon/schemes: prioritize regions within the quotas Andrew Morton
                   ` (17 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:47 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/selftests: support schemes quotas

This commit updates DAMON selftests to support updated schemes debugfs
file format for the quotas.

Link: https://lkml.kernel.org/r/20211019150731.16699-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/damon/debugfs_attrs.sh |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/tools/testing/selftests/damon/debugfs_attrs.sh~mm-damon-selftests-support-schemes-quotas
+++ a/tools/testing/selftests/damon/debugfs_attrs.sh
@@ -63,10 +63,10 @@ echo "$orig_content" > "$file"
 file="$DBGFS/schemes"
 orig_content=$(cat "$file")
 
-test_write_succ "$file" "1 2 3 4 5 6 4" \
+test_write_succ "$file" "1 2 3 4 5 6 4 0 0 0" \
 	"$orig_content" "valid input"
 test_write_fail "$file" "1 2
-3 4 5 6 3" "$orig_content" "multi lines"
+3 4 5 6 3 0 0 0" "$orig_content" "multi lines"
 test_write_succ "$file" "" "$orig_content" "disabling"
 echo "$orig_content" > "$file"
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 245/262] mm/damon/schemes: prioritize regions within the quotas
  2021-11-05 20:34 incoming Andrew Morton
                   ` (243 preceding siblings ...)
  2021-11-05 20:47 ` [patch 244/262] mm/damon/selftests: support schemes quotas Andrew Morton
@ 2021-11-05 20:47 ` Andrew Morton
  2021-11-05 20:47 ` [patch 246/262] mm/damon/vaddr,paddr: support pageout prioritization Andrew Morton
                   ` (16 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:47 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/schemes: prioritize regions within the quotas

This commit makes DAMON apply schemes to regions having higher priority
first, if it cannot apply schemes to all regions due to the quotas.

The prioritization function should be implemented in the monitoring
primitives.  Those would commonly calculate the priority of the region
using attributes of regions, namely 'size', 'nr_accesses', and 'age'.  For
example, some primitive would calculate the priority of each region using
a weighted sum of 'nr_accesses' and 'age' of the region.

The optimal weights would depend on give environments, so this commit
makes those customizable.  Nevertheless, the score calculation functions
are only encouraged to respect the weights, not mandated.

Link: https://lkml.kernel.org/r/20211019150731.16699-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |   26 ++++++++++++++++
 mm/damon/core.c       |   62 +++++++++++++++++++++++++++++++++++-----
 2 files changed, 81 insertions(+), 7 deletions(-)

--- a/include/linux/damon.h~mm-damon-schemes-prioritize-regions-within-the-quotas
+++ a/include/linux/damon.h
@@ -14,6 +14,8 @@
 
 /* Minimal region size.  Every damon_region is aligned by this. */
 #define DAMON_MIN_REGION	PAGE_SIZE
+/* Max priority score for DAMON-based operation schemes */
+#define DAMOS_MAX_SCORE		(99)
 
 /**
  * struct damon_addr_range - Represents an address region of [@start, @end).
@@ -95,6 +97,10 @@ enum damos_action {
  * @sz:			Maximum bytes of memory that the action can be applied.
  * @reset_interval:	Charge reset interval in milliseconds.
  *
+ * @weight_sz:		Weight of the region's size for prioritization.
+ * @weight_nr_accesses:	Weight of the region's nr_accesses for prioritization.
+ * @weight_age:		Weight of the region's age for prioritization.
+ *
  * To avoid consuming too much CPU time or IO resources for applying the
  * &struct damos->action to large memory, DAMON allows users to set time and/or
  * size quotas.  The quotas can be set by writing non-zero values to &ms and
@@ -106,12 +112,22 @@ enum damos_action {
  * Internally, the time quota is transformed to a size quota using estimated
  * throughput of the scheme's action.  DAMON then compares it against &sz and
  * uses smaller one as the effective quota.
+ *
+ * For selecting regions within the quota, DAMON prioritizes current scheme's
+ * target memory regions using the &struct damon_primitive->get_scheme_score.
+ * You could customize the prioritization logic by setting &weight_sz,
+ * &weight_nr_accesses, and &weight_age, because monitoring primitives are
+ * encouraged to respect those.
  */
 struct damos_quota {
 	unsigned long ms;
 	unsigned long sz;
 	unsigned long reset_interval;
 
+	unsigned int weight_sz;
+	unsigned int weight_nr_accesses;
+	unsigned int weight_age;
+
 /* private: */
 	/* For throughput estimation */
 	unsigned long total_charged_sz;
@@ -124,6 +140,10 @@ struct damos_quota {
 	unsigned long charged_from;
 	struct damon_target *charge_target_from;
 	unsigned long charge_addr_from;
+
+	/* For prioritization */
+	unsigned long histogram[DAMOS_MAX_SCORE + 1];
+	unsigned int min_score;
 };
 
 /**
@@ -174,6 +194,7 @@ struct damon_ctx;
  * @prepare_access_checks:	Prepare next access check of target regions.
  * @check_accesses:		Check the accesses to target regions.
  * @reset_aggregated:		Reset aggregated accesses monitoring results.
+ * @get_scheme_score:		Get the score of a region for a scheme.
  * @apply_scheme:		Apply a DAMON-based operation scheme.
  * @target_valid:		Determine if the target is valid.
  * @cleanup:			Clean up the context.
@@ -200,6 +221,8 @@ struct damon_ctx;
  * of its update.  The value will be used for regions adjustment threshold.
  * @reset_aggregated should reset the access monitoring results that aggregated
  * by @check_accesses.
+ * @get_scheme_score should return the priority score of a region for a scheme
+ * as an integer in [0, &DAMOS_MAX_SCORE].
  * @apply_scheme is called from @kdamond when a region for user provided
  * DAMON-based operation scheme is found.  It should apply the scheme's action
  * to the region.  This is not used for &DAMON_ARBITRARY_TARGET case.
@@ -213,6 +236,9 @@ struct damon_primitive {
 	void (*prepare_access_checks)(struct damon_ctx *context);
 	unsigned int (*check_accesses)(struct damon_ctx *context);
 	void (*reset_aggregated)(struct damon_ctx *context);
+	int (*get_scheme_score)(struct damon_ctx *context,
+			struct damon_target *t, struct damon_region *r,
+			struct damos *scheme);
 	int (*apply_scheme)(struct damon_ctx *context, struct damon_target *t,
 			struct damon_region *r, struct damos *scheme);
 	bool (*target_valid)(void *target);
--- a/mm/damon/core.c~mm-damon-schemes-prioritize-regions-within-the-quotas
+++ a/mm/damon/core.c
@@ -12,6 +12,7 @@
 #include <linux/kthread.h>
 #include <linux/random.h>
 #include <linux/slab.h>
+#include <linux/string.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/damon.h>
@@ -110,6 +111,9 @@ struct damos *damon_new_scheme(
 	scheme->quota.ms = quota->ms;
 	scheme->quota.sz = quota->sz;
 	scheme->quota.reset_interval = quota->reset_interval;
+	scheme->quota.weight_sz = quota->weight_sz;
+	scheme->quota.weight_nr_accesses = quota->weight_nr_accesses;
+	scheme->quota.weight_age = quota->weight_age;
 	scheme->quota.total_charged_sz = 0;
 	scheme->quota.total_charged_ns = 0;
 	scheme->quota.esz = 0;
@@ -545,6 +549,28 @@ static void damon_split_region_at(struct
 		struct damon_target *t, struct damon_region *r,
 		unsigned long sz_r);
 
+static bool __damos_valid_target(struct damon_region *r, struct damos *s)
+{
+	unsigned long sz;
+
+	sz = r->ar.end - r->ar.start;
+	return s->min_sz_region <= sz && sz <= s->max_sz_region &&
+		s->min_nr_accesses <= r->nr_accesses &&
+		r->nr_accesses <= s->max_nr_accesses &&
+		s->min_age_region <= r->age && r->age <= s->max_age_region;
+}
+
+static bool damos_valid_target(struct damon_ctx *c, struct damon_target *t,
+		struct damon_region *r, struct damos *s)
+{
+	bool ret = __damos_valid_target(r, s);
+
+	if (!ret || !s->quota.esz || !c->primitive.get_scheme_score)
+		return ret;
+
+	return c->primitive.get_scheme_score(c, t, r, s) >= s->quota.min_score;
+}
+
 static void damon_do_apply_schemes(struct damon_ctx *c,
 				   struct damon_target *t,
 				   struct damon_region *r)
@@ -591,13 +617,7 @@ static void damon_do_apply_schemes(struc
 			quota->charge_addr_from = 0;
 		}
 
-		/* Check the target regions condition */
-		if (sz < s->min_sz_region || s->max_sz_region < sz)
-			continue;
-		if (r->nr_accesses < s->min_nr_accesses ||
-				s->max_nr_accesses < r->nr_accesses)
-			continue;
-		if (r->age < s->min_age_region || s->max_age_region < r->age)
+		if (!damos_valid_target(c, t, r, s))
 			continue;
 
 		/* Apply the scheme */
@@ -661,6 +681,8 @@ static void kdamond_apply_schemes(struct
 
 	damon_for_each_scheme(s, c) {
 		struct damos_quota *quota = &s->quota;
+		unsigned long cumulated_sz;
+		unsigned int score, max_score = 0;
 
 		if (!quota->ms && !quota->sz)
 			continue;
@@ -674,6 +696,32 @@ static void kdamond_apply_schemes(struct
 			quota->charged_sz = 0;
 			damos_set_effective_quota(quota);
 		}
+
+		if (!c->primitive.get_scheme_score)
+			continue;
+
+		/* Fill up the score histogram */
+		memset(quota->histogram, 0, sizeof(quota->histogram));
+		damon_for_each_target(t, c) {
+			damon_for_each_region(r, t) {
+				if (!__damos_valid_target(r, s))
+					continue;
+				score = c->primitive.get_scheme_score(
+						c, t, r, s);
+				quota->histogram[score] +=
+					r->ar.end - r->ar.start;
+				if (score > max_score)
+					max_score = score;
+			}
+		}
+
+		/* Set the min score limit */
+		for (cumulated_sz = 0, score = max_score; ; score--) {
+			cumulated_sz += quota->histogram[score];
+			if (cumulated_sz >= quota->esz || !score)
+				break;
+		}
+		quota->min_score = score;
 	}
 
 	damon_for_each_target(t, c) {
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 246/262] mm/damon/vaddr,paddr: support pageout prioritization
  2021-11-05 20:34 incoming Andrew Morton
                   ` (244 preceding siblings ...)
  2021-11-05 20:47 ` [patch 245/262] mm/damon/schemes: prioritize regions within the quotas Andrew Morton
@ 2021-11-05 20:47 ` Andrew Morton
  2021-11-05 20:47 ` [patch 247/262] mm/damon/dbgfs: support prioritization weights Andrew Morton
                   ` (15 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:47 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/vaddr,paddr: support pageout prioritization

This commit makes the default monitoring primitives for virtual address
spaces and the physical address sapce to support memory regions
prioritization for 'PAGEOUT' DAMOS action.  It calculates hotness of each
region as weighted sum of 'nr_accesses' and 'age' of the region and get
the priority score as reverse of the hotness, so that cold regions can be
paged out first.

Link: https://lkml.kernel.org/r/20211019150731.16699-9-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h   |    4 +++
 mm/damon/paddr.c        |   14 +++++++++++
 mm/damon/prmtv-common.c |   46 ++++++++++++++++++++++++++++++++++++++
 mm/damon/prmtv-common.h |    3 ++
 mm/damon/vaddr.c        |   15 ++++++++++++
 5 files changed, 82 insertions(+)

--- a/include/linux/damon.h~mm-damon-vaddrpaddr-support-pageout-prioritization
+++ a/include/linux/damon.h
@@ -421,6 +421,8 @@ bool damon_va_target_valid(void *t);
 void damon_va_cleanup(struct damon_ctx *ctx);
 int damon_va_apply_scheme(struct damon_ctx *context, struct damon_target *t,
 		struct damon_region *r, struct damos *scheme);
+int damon_va_scheme_score(struct damon_ctx *context, struct damon_target *t,
+		struct damon_region *r, struct damos *scheme);
 void damon_va_set_primitives(struct damon_ctx *ctx);
 
 #endif	/* CONFIG_DAMON_VADDR */
@@ -433,6 +435,8 @@ unsigned int damon_pa_check_accesses(str
 bool damon_pa_target_valid(void *t);
 int damon_pa_apply_scheme(struct damon_ctx *context, struct damon_target *t,
 		struct damon_region *r, struct damos *scheme);
+int damon_pa_scheme_score(struct damon_ctx *context, struct damon_target *t,
+		struct damon_region *r, struct damos *scheme);
 void damon_pa_set_primitives(struct damon_ctx *ctx);
 
 #endif	/* CONFIG_DAMON_PADDR */
--- a/mm/damon/paddr.c~mm-damon-vaddrpaddr-support-pageout-prioritization
+++ a/mm/damon/paddr.c
@@ -246,6 +246,19 @@ int damon_pa_apply_scheme(struct damon_c
 	return 0;
 }
 
+int damon_pa_scheme_score(struct damon_ctx *context, struct damon_target *t,
+		struct damon_region *r, struct damos *scheme)
+{
+	switch (scheme->action) {
+	case DAMOS_PAGEOUT:
+		return damon_pageout_score(context, r, scheme);
+	default:
+		break;
+	}
+
+	return DAMOS_MAX_SCORE;
+}
+
 void damon_pa_set_primitives(struct damon_ctx *ctx)
 {
 	ctx->primitive.init = NULL;
@@ -256,4 +269,5 @@ void damon_pa_set_primitives(struct damo
 	ctx->primitive.target_valid = damon_pa_target_valid;
 	ctx->primitive.cleanup = NULL;
 	ctx->primitive.apply_scheme = damon_pa_apply_scheme;
+	ctx->primitive.get_scheme_score = damon_pa_scheme_score;
 }
--- a/mm/damon/prmtv-common.c~mm-damon-vaddrpaddr-support-pageout-prioritization
+++ a/mm/damon/prmtv-common.c
@@ -85,3 +85,49 @@ void damon_pmdp_mkold(pmd_t *pmd, struct
 	put_page(page);
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 }
+
+#define DAMON_MAX_SUBSCORE	(100)
+#define DAMON_MAX_AGE_IN_LOG	(32)
+
+int damon_pageout_score(struct damon_ctx *c, struct damon_region *r,
+			struct damos *s)
+{
+	unsigned int max_nr_accesses;
+	int freq_subscore;
+	unsigned int age_in_sec;
+	int age_in_log, age_subscore;
+	unsigned int freq_weight = s->quota.weight_nr_accesses;
+	unsigned int age_weight = s->quota.weight_age;
+	int hotness;
+
+	max_nr_accesses = c->aggr_interval / c->sample_interval;
+	freq_subscore = r->nr_accesses * DAMON_MAX_SUBSCORE / max_nr_accesses;
+
+	age_in_sec = (unsigned long)r->age * c->aggr_interval / 1000000;
+	for (age_in_log = 0; age_in_log < DAMON_MAX_AGE_IN_LOG && age_in_sec;
+			age_in_log++, age_in_sec >>= 1)
+		;
+
+	/* If frequency is 0, higher age means it's colder */
+	if (freq_subscore == 0)
+		age_in_log *= -1;
+
+	/*
+	 * Now age_in_log is in [-DAMON_MAX_AGE_IN_LOG, DAMON_MAX_AGE_IN_LOG].
+	 * Scale it to be in [0, 100] and set it as age subscore.
+	 */
+	age_in_log += DAMON_MAX_AGE_IN_LOG;
+	age_subscore = age_in_log * DAMON_MAX_SUBSCORE /
+		DAMON_MAX_AGE_IN_LOG / 2;
+
+	hotness = (freq_weight * freq_subscore + age_weight * age_subscore);
+	if (freq_weight + age_weight)
+		hotness /= freq_weight + age_weight;
+	/*
+	 * Transform it to fit in [0, DAMOS_MAX_SCORE]
+	 */
+	hotness = hotness * DAMOS_MAX_SCORE / DAMON_MAX_SUBSCORE;
+
+	/* Return coldness of the region */
+	return DAMOS_MAX_SCORE - hotness;
+}
--- a/mm/damon/prmtv-common.h~mm-damon-vaddrpaddr-support-pageout-prioritization
+++ a/mm/damon/prmtv-common.h
@@ -15,3 +15,6 @@ struct page *damon_get_page(unsigned lon
 
 void damon_ptep_mkold(pte_t *pte, struct mm_struct *mm, unsigned long addr);
 void damon_pmdp_mkold(pmd_t *pmd, struct mm_struct *mm, unsigned long addr);
+
+int damon_pageout_score(struct damon_ctx *c, struct damon_region *r,
+			struct damos *s);
--- a/mm/damon/vaddr.c~mm-damon-vaddrpaddr-support-pageout-prioritization
+++ a/mm/damon/vaddr.c
@@ -633,6 +633,20 @@ int damon_va_apply_scheme(struct damon_c
 	return damos_madvise(t, r, madv_action);
 }
 
+int damon_va_scheme_score(struct damon_ctx *context, struct damon_target *t,
+		struct damon_region *r, struct damos *scheme)
+{
+
+	switch (scheme->action) {
+	case DAMOS_PAGEOUT:
+		return damon_pageout_score(context, r, scheme);
+	default:
+		break;
+	}
+
+	return DAMOS_MAX_SCORE;
+}
+
 void damon_va_set_primitives(struct damon_ctx *ctx)
 {
 	ctx->primitive.init = damon_va_init;
@@ -643,6 +657,7 @@ void damon_va_set_primitives(struct damo
 	ctx->primitive.target_valid = damon_va_target_valid;
 	ctx->primitive.cleanup = NULL;
 	ctx->primitive.apply_scheme = damon_va_apply_scheme;
+	ctx->primitive.get_scheme_score = damon_va_scheme_score;
 }
 
 #include "vaddr-test.h"
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 247/262] mm/damon/dbgfs: support prioritization weights
  2021-11-05 20:34 incoming Andrew Morton
                   ` (245 preceding siblings ...)
  2021-11-05 20:47 ` [patch 246/262] mm/damon/vaddr,paddr: support pageout prioritization Andrew Morton
@ 2021-11-05 20:47 ` Andrew Morton
  2021-11-05 20:47 ` [patch 248/262] tools/selftests/damon: update for regions prioritization of schemes Andrew Morton
                   ` (14 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:47 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/dbgfs: support prioritization weights

This commit allows DAMON debugfs interface users set the prioritization
weights by putting three more numbers to the 'schemes' file.

Link: https://lkml.kernel.org/r/20211019150731.16699-10-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/dbgfs.c |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

--- a/mm/damon/dbgfs.c~mm-damon-dbgfs-support-prioritization-weights
+++ a/mm/damon/dbgfs.c
@@ -105,13 +105,16 @@ static ssize_t sprint_schemes(struct dam
 
 	damon_for_each_scheme(s, c) {
 		rc = scnprintf(&buf[written], len - written,
-				"%lu %lu %u %u %u %u %d %lu %lu %lu %lu %lu\n",
+				"%lu %lu %u %u %u %u %d %lu %lu %lu %u %u %u %lu %lu\n",
 				s->min_sz_region, s->max_sz_region,
 				s->min_nr_accesses, s->max_nr_accesses,
 				s->min_age_region, s->max_age_region,
 				s->action,
 				s->quota.ms, s->quota.sz,
 				s->quota.reset_interval,
+				s->quota.weight_sz,
+				s->quota.weight_nr_accesses,
+				s->quota.weight_age,
 				s->stat_count, s->stat_sz);
 		if (!rc)
 			return -ENOMEM;
@@ -193,11 +196,14 @@ static struct damos **str_to_schemes(con
 	while (pos < len && *nr_schemes < max_nr_schemes) {
 		struct damos_quota quota = {};
 
-		ret = sscanf(&str[pos], "%lu %lu %u %u %u %u %u %lu %lu %lu%n",
+		ret = sscanf(&str[pos],
+				"%lu %lu %u %u %u %u %u %lu %lu %lu %u %u %u%n",
 				&min_sz, &max_sz, &min_nr_a, &max_nr_a,
 				&min_age, &max_age, &action, &quota.ms,
-				&quota.sz, &quota.reset_interval, &parsed);
-		if (ret != 10)
+				&quota.sz, &quota.reset_interval,
+				&quota.weight_sz, &quota.weight_nr_accesses,
+				&quota.weight_age, &parsed);
+		if (ret != 13)
 			break;
 		if (!damos_action_valid(action)) {
 			pr_err("wrong action %d\n", action);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 248/262] tools/selftests/damon: update for regions prioritization of schemes
  2021-11-05 20:34 incoming Andrew Morton
                   ` (246 preceding siblings ...)
  2021-11-05 20:47 ` [patch 247/262] mm/damon/dbgfs: support prioritization weights Andrew Morton
@ 2021-11-05 20:47 ` Andrew Morton
  2021-11-05 20:47 ` [patch 249/262] mm/damon/schemes: activate schemes based on a watermarks mechanism Andrew Morton
                   ` (13 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:47 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: tools/selftests/damon: update for regions prioritization of schemes

This commit updates the DAMON selftests for 'schemes' debugfs file, as the
file format is updated.

Link: https://lkml.kernel.org/r/20211019150731.16699-11-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/damon/debugfs_attrs.sh |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/tools/testing/selftests/damon/debugfs_attrs.sh~tools-selftests-damon-update-for-regions-prioritization-of-schemes
+++ a/tools/testing/selftests/damon/debugfs_attrs.sh
@@ -63,10 +63,10 @@ echo "$orig_content" > "$file"
 file="$DBGFS/schemes"
 orig_content=$(cat "$file")
 
-test_write_succ "$file" "1 2 3 4 5 6 4 0 0 0" \
+test_write_succ "$file" "1 2 3 4 5 6 4 0 0 0 1 2 3" \
 	"$orig_content" "valid input"
 test_write_fail "$file" "1 2
-3 4 5 6 3 0 0 0" "$orig_content" "multi lines"
+3 4 5 6 3 0 0 0 1 2 3" "$orig_content" "multi lines"
 test_write_succ "$file" "" "$orig_content" "disabling"
 echo "$orig_content" > "$file"
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 249/262] mm/damon/schemes: activate schemes based on a watermarks mechanism
  2021-11-05 20:34 incoming Andrew Morton
                   ` (247 preceding siblings ...)
  2021-11-05 20:47 ` [patch 248/262] tools/selftests/damon: update for regions prioritization of schemes Andrew Morton
@ 2021-11-05 20:47 ` Andrew Morton
  2021-11-05 20:47 ` [patch 250/262] mm/damon/dbgfs: support watermarks Andrew Morton
                   ` (12 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:47 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/schemes: activate schemes based on a watermarks mechanism

DAMON-based operation schemes need to be manually turned on and off.  In
some use cases, however, the condition for turning a scheme on and off
would depend on the system's situation.  For example, schemes for
proactive pages reclamation would need to be turned on when some memory
pressure is detected, and turned off when the system has enough free
memory.

For easier control of schemes activation based on the system situation,
this commit introduces a watermarks-based mechanism.  The client can
describe the watermark metric (e.g., amount of free memory in the system),
watermark check interval, and three watermarks, namely high, mid, and low.
If the scheme is deactivated, it only gets the metric and compare that to
the three watermarks for every check interval.  If the metric is higher
than the high watermark, the scheme is deactivated.  If the metric is
between the mid watermark and the low watermark, the scheme is activated. 
If the metric is lower than the low watermark, the scheme is deactivated
again.  This is to allow users fall back to traditional page-granularity
mechanisms.

Link: https://lkml.kernel.org/r/20211019150731.16699-12-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |   52 +++++++++++++++++++++
 mm/damon/core.c       |   97 +++++++++++++++++++++++++++++++++++++++-
 mm/damon/dbgfs.c      |    5 +-
 3 files changed, 151 insertions(+), 3 deletions(-)

--- a/include/linux/damon.h~mm-damon-schemes-activate-schemes-based-on-a-watermarks-mechanism
+++ a/include/linux/damon.h
@@ -147,6 +147,45 @@ struct damos_quota {
 };
 
 /**
+ * enum damos_wmark_metric - Represents the watermark metric.
+ *
+ * @DAMOS_WMARK_NONE:		Ignore the watermarks of the given scheme.
+ * @DAMOS_WMARK_FREE_MEM_RATE:	Free memory rate of the system in [0,1000].
+ */
+enum damos_wmark_metric {
+	DAMOS_WMARK_NONE,
+	DAMOS_WMARK_FREE_MEM_RATE,
+};
+
+/**
+ * struct damos_watermarks - Controls when a given scheme should be activated.
+ * @metric:	Metric for the watermarks.
+ * @interval:	Watermarks check time interval in microseconds.
+ * @high:	High watermark.
+ * @mid:	Middle watermark.
+ * @low:	Low watermark.
+ *
+ * If &metric is &DAMOS_WMARK_NONE, the scheme is always active.  Being active
+ * means DAMON does monitoring and applying the action of the scheme to
+ * appropriate memory regions.  Else, DAMON checks &metric of the system for at
+ * least every &interval microseconds and works as below.
+ *
+ * If &metric is higher than &high, the scheme is inactivated.  If &metric is
+ * between &mid and &low, the scheme is activated.  If &metric is lower than
+ * &low, the scheme is inactivated.
+ */
+struct damos_watermarks {
+	enum damos_wmark_metric metric;
+	unsigned long interval;
+	unsigned long high;
+	unsigned long mid;
+	unsigned long low;
+
+/* private: */
+	bool activated;
+};
+
+/**
  * struct damos - Represents a Data Access Monitoring-based Operation Scheme.
  * @min_sz_region:	Minimum size of target regions.
  * @max_sz_region:	Maximum size of target regions.
@@ -156,6 +195,7 @@ struct damos_quota {
  * @max_age_region:	Maximum age of target regions.
  * @action:		&damo_action to be applied to the target regions.
  * @quota:		Control the aggressiveness of this scheme.
+ * @wmarks:		Watermarks for automated (in)activation of this scheme.
  * @stat_count:		Total number of regions that this scheme is applied.
  * @stat_sz:		Total size of regions that this scheme is applied.
  * @list:		List head for siblings.
@@ -166,6 +206,14 @@ struct damos_quota {
  * those.  To avoid consuming too much CPU time or IO resources for the
  * &action, &quota is used.
  *
+ * To do the work only when needed, schemes can be activated for specific
+ * system situations using &wmarks.  If all schemes that registered to the
+ * monitoring context are inactive, DAMON stops monitoring either, and just
+ * repeatedly checks the watermarks.
+ *
+ * If all schemes that registered to a &struct damon_ctx are inactive, DAMON
+ * stops monitoring and just repeatedly checks the watermarks.
+ *
  * After applying the &action to each region, &stat_count and &stat_sz is
  * updated to reflect the number of regions and total size of regions that the
  * &action is applied.
@@ -179,6 +227,7 @@ struct damos {
 	unsigned int max_age_region;
 	enum damos_action action;
 	struct damos_quota quota;
+	struct damos_watermarks wmarks;
 	unsigned long stat_count;
 	unsigned long stat_sz;
 	struct list_head list;
@@ -384,7 +433,8 @@ struct damos *damon_new_scheme(
 		unsigned long min_sz_region, unsigned long max_sz_region,
 		unsigned int min_nr_accesses, unsigned int max_nr_accesses,
 		unsigned int min_age_region, unsigned int max_age_region,
-		enum damos_action action, struct damos_quota *quota);
+		enum damos_action action, struct damos_quota *quota,
+		struct damos_watermarks *wmarks);
 void damon_add_scheme(struct damon_ctx *ctx, struct damos *s);
 void damon_destroy_scheme(struct damos *s);
 
--- a/mm/damon/core.c~mm-damon-schemes-activate-schemes-based-on-a-watermarks-mechanism
+++ a/mm/damon/core.c
@@ -10,6 +10,7 @@
 #include <linux/damon.h>
 #include <linux/delay.h>
 #include <linux/kthread.h>
+#include <linux/mm.h>
 #include <linux/random.h>
 #include <linux/slab.h>
 #include <linux/string.h>
@@ -90,7 +91,8 @@ struct damos *damon_new_scheme(
 		unsigned long min_sz_region, unsigned long max_sz_region,
 		unsigned int min_nr_accesses, unsigned int max_nr_accesses,
 		unsigned int min_age_region, unsigned int max_age_region,
-		enum damos_action action, struct damos_quota *quota)
+		enum damos_action action, struct damos_quota *quota,
+		struct damos_watermarks *wmarks)
 {
 	struct damos *scheme;
 
@@ -122,6 +124,13 @@ struct damos *damon_new_scheme(
 	scheme->quota.charge_target_from = NULL;
 	scheme->quota.charge_addr_from = 0;
 
+	scheme->wmarks.metric = wmarks->metric;
+	scheme->wmarks.interval = wmarks->interval;
+	scheme->wmarks.high = wmarks->high;
+	scheme->wmarks.mid = wmarks->mid;
+	scheme->wmarks.low = wmarks->low;
+	scheme->wmarks.activated = true;
+
 	return scheme;
 }
 
@@ -582,6 +591,9 @@ static void damon_do_apply_schemes(struc
 		unsigned long sz = r->ar.end - r->ar.start;
 		struct timespec64 begin, end;
 
+		if (!s->wmarks.activated)
+			continue;
+
 		/* Check the quota */
 		if (quota->esz && quota->charged_sz >= quota->esz)
 			continue;
@@ -684,6 +696,9 @@ static void kdamond_apply_schemes(struct
 		unsigned long cumulated_sz;
 		unsigned int score, max_score = 0;
 
+		if (!s->wmarks.activated)
+			continue;
+
 		if (!quota->ms && !quota->sz)
 			continue;
 
@@ -924,6 +939,83 @@ static bool kdamond_need_stop(struct dam
 	return true;
 }
 
+static unsigned long damos_wmark_metric_value(enum damos_wmark_metric metric)
+{
+	struct sysinfo i;
+
+	switch (metric) {
+	case DAMOS_WMARK_FREE_MEM_RATE:
+		si_meminfo(&i);
+		return i.freeram * 1000 / i.totalram;
+	default:
+		break;
+	}
+	return -EINVAL;
+}
+
+/*
+ * Returns zero if the scheme is active.  Else, returns time to wait for next
+ * watermark check in micro-seconds.
+ */
+static unsigned long damos_wmark_wait_us(struct damos *scheme)
+{
+	unsigned long metric;
+
+	if (scheme->wmarks.metric == DAMOS_WMARK_NONE)
+		return 0;
+
+	metric = damos_wmark_metric_value(scheme->wmarks.metric);
+	/* higher than high watermark or lower than low watermark */
+	if (metric > scheme->wmarks.high || scheme->wmarks.low > metric) {
+		if (scheme->wmarks.activated)
+			pr_debug("inactivate a scheme (%d) for %s wmark\n",
+					scheme->action,
+					metric > scheme->wmarks.high ?
+					"high" : "low");
+		scheme->wmarks.activated = false;
+		return scheme->wmarks.interval;
+	}
+
+	/* inactive and higher than middle watermark */
+	if ((scheme->wmarks.high >= metric && metric >= scheme->wmarks.mid) &&
+			!scheme->wmarks.activated)
+		return scheme->wmarks.interval;
+
+	if (!scheme->wmarks.activated)
+		pr_debug("activate a scheme (%d)\n", scheme->action);
+	scheme->wmarks.activated = true;
+	return 0;
+}
+
+static void kdamond_usleep(unsigned long usecs)
+{
+	if (usecs > 100 * 1000)
+		schedule_timeout_interruptible(usecs_to_jiffies(usecs));
+	else
+		usleep_range(usecs, usecs + 1);
+}
+
+/* Returns negative error code if it's not activated but should return */
+static int kdamond_wait_activation(struct damon_ctx *ctx)
+{
+	struct damos *s;
+	unsigned long wait_time;
+	unsigned long min_wait_time = 0;
+
+	while (!kdamond_need_stop(ctx)) {
+		damon_for_each_scheme(s, ctx) {
+			wait_time = damos_wmark_wait_us(s);
+			if (!min_wait_time || wait_time < min_wait_time)
+				min_wait_time = wait_time;
+		}
+		if (!min_wait_time)
+			return 0;
+
+		kdamond_usleep(min_wait_time);
+	}
+	return -EBUSY;
+}
+
 static void set_kdamond_stop(struct damon_ctx *ctx)
 {
 	mutex_lock(&ctx->kdamond_lock);
@@ -952,6 +1044,9 @@ static int kdamond_fn(void *data)
 	sz_limit = damon_region_sz_limit(ctx);
 
 	while (!kdamond_need_stop(ctx)) {
+		if (kdamond_wait_activation(ctx))
+			continue;
+
 		if (ctx->primitive.prepare_access_checks)
 			ctx->primitive.prepare_access_checks(ctx);
 		if (ctx->callback.after_sampling &&
--- a/mm/damon/dbgfs.c~mm-damon-schemes-activate-schemes-based-on-a-watermarks-mechanism
+++ a/mm/damon/dbgfs.c
@@ -195,6 +195,9 @@ static struct damos **str_to_schemes(con
 	*nr_schemes = 0;
 	while (pos < len && *nr_schemes < max_nr_schemes) {
 		struct damos_quota quota = {};
+		struct damos_watermarks wmarks = {
+			.metric = DAMOS_WMARK_NONE,
+		};
 
 		ret = sscanf(&str[pos],
 				"%lu %lu %u %u %u %u %u %lu %lu %lu %u %u %u%n",
@@ -212,7 +215,7 @@ static struct damos **str_to_schemes(con
 
 		pos += parsed;
 		scheme = damon_new_scheme(min_sz, max_sz, min_nr_a, max_nr_a,
-				min_age, max_age, action, &quota);
+				min_age, max_age, action, &quota, &wmarks);
 		if (!scheme)
 			goto fail;
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 250/262] mm/damon/dbgfs: support watermarks
  2021-11-05 20:34 incoming Andrew Morton
                   ` (248 preceding siblings ...)
  2021-11-05 20:47 ` [patch 249/262] mm/damon/schemes: activate schemes based on a watermarks mechanism Andrew Morton
@ 2021-11-05 20:47 ` Andrew Morton
  2021-11-05 20:47 ` [patch 251/262] selftests/damon: " Andrew Morton
                   ` (11 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:47 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon/dbgfs: support watermarks

This commit updates DAMON debugfs interface to support the watermarks
based schemes activation.  For this, now 'schemes' file receives five more
values.

Link: https://lkml.kernel.org/r/20211019150731.16699-13-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/dbgfs.c |   16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

--- a/mm/damon/dbgfs.c~mm-damon-dbgfs-support-watermarks
+++ a/mm/damon/dbgfs.c
@@ -105,7 +105,7 @@ static ssize_t sprint_schemes(struct dam
 
 	damon_for_each_scheme(s, c) {
 		rc = scnprintf(&buf[written], len - written,
-				"%lu %lu %u %u %u %u %d %lu %lu %lu %u %u %u %lu %lu\n",
+				"%lu %lu %u %u %u %u %d %lu %lu %lu %u %u %u %d %lu %lu %lu %lu %lu %lu\n",
 				s->min_sz_region, s->max_sz_region,
 				s->min_nr_accesses, s->max_nr_accesses,
 				s->min_age_region, s->max_age_region,
@@ -115,6 +115,8 @@ static ssize_t sprint_schemes(struct dam
 				s->quota.weight_sz,
 				s->quota.weight_nr_accesses,
 				s->quota.weight_age,
+				s->wmarks.metric, s->wmarks.interval,
+				s->wmarks.high, s->wmarks.mid, s->wmarks.low,
 				s->stat_count, s->stat_sz);
 		if (!rc)
 			return -ENOMEM;
@@ -195,18 +197,18 @@ static struct damos **str_to_schemes(con
 	*nr_schemes = 0;
 	while (pos < len && *nr_schemes < max_nr_schemes) {
 		struct damos_quota quota = {};
-		struct damos_watermarks wmarks = {
-			.metric = DAMOS_WMARK_NONE,
-		};
+		struct damos_watermarks wmarks;
 
 		ret = sscanf(&str[pos],
-				"%lu %lu %u %u %u %u %u %lu %lu %lu %u %u %u%n",
+				"%lu %lu %u %u %u %u %u %lu %lu %lu %u %u %u %u %lu %lu %lu %lu%n",
 				&min_sz, &max_sz, &min_nr_a, &max_nr_a,
 				&min_age, &max_age, &action, &quota.ms,
 				&quota.sz, &quota.reset_interval,
 				&quota.weight_sz, &quota.weight_nr_accesses,
-				&quota.weight_age, &parsed);
-		if (ret != 13)
+				&quota.weight_age, &wmarks.metric,
+				&wmarks.interval, &wmarks.high, &wmarks.mid,
+				&wmarks.low, &parsed);
+		if (ret != 18)
 			break;
 		if (!damos_action_valid(action)) {
 			pr_err("wrong action %d\n", action);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 251/262] selftests/damon: support watermarks
  2021-11-05 20:34 incoming Andrew Morton
                   ` (249 preceding siblings ...)
  2021-11-05 20:47 ` [patch 250/262] mm/damon/dbgfs: support watermarks Andrew Morton
@ 2021-11-05 20:47 ` Andrew Morton
  2021-11-05 20:47 ` [patch 252/262] mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) Andrew Morton
                   ` (10 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:47 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: selftests/damon: support watermarks

This commit updates DAMON selftests for 'schemes' debugfs file to reflect
the changes in the format.

Link: https://lkml.kernel.org/r/20211019150731.16699-14-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/damon/debugfs_attrs.sh |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/tools/testing/selftests/damon/debugfs_attrs.sh~selftests-damon-support-watermarks
+++ a/tools/testing/selftests/damon/debugfs_attrs.sh
@@ -63,10 +63,10 @@ echo "$orig_content" > "$file"
 file="$DBGFS/schemes"
 orig_content=$(cat "$file")
 
-test_write_succ "$file" "1 2 3 4 5 6 4 0 0 0 1 2 3" \
+test_write_succ "$file" "1 2 3 4 5 6 4 0 0 0 1 2 3 1 100 3 2 1" \
 	"$orig_content" "valid input"
 test_write_fail "$file" "1 2
-3 4 5 6 3 0 0 0 1 2 3" "$orig_content" "multi lines"
+3 4 5 6 3 0 0 0 1 2 3 1 100 3 2 1" "$orig_content" "multi lines"
 test_write_succ "$file" "" "$orig_content" "disabling"
 echo "$orig_content" > "$file"
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 252/262] mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)
  2021-11-05 20:34 incoming Andrew Morton
                   ` (250 preceding siblings ...)
  2021-11-05 20:47 ` [patch 251/262] selftests/damon: " Andrew Morton
@ 2021-11-05 20:47 ` Andrew Morton
  2021-11-05 20:48 ` [patch 253/262] Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM Andrew Morton
                   ` (9 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:47 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds, yangyingliang

From: SeongJae Park <sj@kernel.org>
Subject: mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM)

This commit implements a new kernel subsystem that finds cold memory
regions using DAMON and reclaims those immediately.  It is intended to be
used as proactive lightweigh reclamation logic for light memory pressure. 
For heavy memory pressure, it could be inactivated and fall back to the
traditional page-scanning based reclamation.

It's implemented on top of DAMON framework to use the DAMON-based
Operation Schemes (DAMOS) feature.  It utilizes all the DAMOS features
including speed limit, prioritization, and watermarks.

It could be enabled and tuned in boot time via the kernel boot parameter,
and in run time via its module parameters
('/sys/module/damon_reclaim/parameters/') interface.

[yangyingliang@huawei.com: fix error return code in damon_reclaim_turn()]
  Link: https://lkml.kernel.org/r/20211025124500.2758060-1-yangyingliang@huawei.com
Link: https://lkml.kernel.org/r/20211019150731.16699-15-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/Kconfig   |   12 +
 mm/damon/Makefile  |    1 
 mm/damon/reclaim.c |  356 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 369 insertions(+)

--- a/mm/damon/Kconfig~mm-damon-introduce-damon-based-reclamation-damon_reclaim
+++ a/mm/damon/Kconfig
@@ -73,4 +73,16 @@ config DAMON_DBGFS_KUNIT_TEST
 
 	  If unsure, say N.
 
+config DAMON_RECLAIM
+	bool "Build DAMON-based reclaim (DAMON_RECLAIM)"
+	depends on DAMON_PADDR
+	help
+	  This builds the DAMON-based reclamation subsystem.  It finds pages
+	  that not accessed for a long time (cold) using DAMON and reclaim
+	  those.
+
+	  This is suggested to be used as a proactive and lightweight
+	  reclamation under light memory pressure, while the traditional page
+	  scanning-based reclamation is used for heavy pressure.
+
 endmenu
--- a/mm/damon/Makefile~mm-damon-introduce-damon-based-reclamation-damon_reclaim
+++ a/mm/damon/Makefile
@@ -4,3 +4,4 @@ obj-$(CONFIG_DAMON)		:= core.o
 obj-$(CONFIG_DAMON_VADDR)	+= prmtv-common.o vaddr.o
 obj-$(CONFIG_DAMON_PADDR)	+= prmtv-common.o paddr.o
 obj-$(CONFIG_DAMON_DBGFS)	+= dbgfs.o
+obj-$(CONFIG_DAMON_RECLAIM)	+= reclaim.o
--- /dev/null
+++ a/mm/damon/reclaim.c
@@ -0,0 +1,356 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * DAMON-based page reclamation
+ *
+ * Author: SeongJae Park <sj@kernel.org>
+ */
+
+#define pr_fmt(fmt) "damon-reclaim: " fmt
+
+#include <linux/damon.h>
+#include <linux/ioport.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/workqueue.h>
+
+#ifdef MODULE_PARAM_PREFIX
+#undef MODULE_PARAM_PREFIX
+#endif
+#define MODULE_PARAM_PREFIX "damon_reclaim."
+
+/*
+ * Enable or disable DAMON_RECLAIM.
+ *
+ * You can enable DAMON_RCLAIM by setting the value of this parameter as ``Y``.
+ * Setting it as ``N`` disables DAMON_RECLAIM.  Note that DAMON_RECLAIM could
+ * do no real monitoring and reclamation due to the watermarks-based activation
+ * condition.  Refer to below descriptions for the watermarks parameter for
+ * this.
+ */
+static bool enabled __read_mostly;
+module_param(enabled, bool, 0600);
+
+/*
+ * Time threshold for cold memory regions identification in microseconds.
+ *
+ * If a memory region is not accessed for this or longer time, DAMON_RECLAIM
+ * identifies the region as cold, and reclaims.  120 seconds by default.
+ */
+static unsigned long min_age __read_mostly = 120000000;
+module_param(min_age, ulong, 0600);
+
+/*
+ * Limit of time for trying the reclamation in milliseconds.
+ *
+ * DAMON_RECLAIM tries to use only up to this time within a time window
+ * (quota_reset_interval_ms) for trying reclamation of cold pages.  This can be
+ * used for limiting CPU consumption of DAMON_RECLAIM.  If the value is zero,
+ * the limit is disabled.
+ *
+ * 10 ms by default.
+ */
+static unsigned long quota_ms __read_mostly = 10;
+module_param(quota_ms, ulong, 0600);
+
+/*
+ * Limit of size of memory for the reclamation in bytes.
+ *
+ * DAMON_RECLAIM charges amount of memory which it tried to reclaim within a
+ * time window (quota_reset_interval_ms) and makes no more than this limit is
+ * tried.  This can be used for limiting consumption of CPU and IO.  If this
+ * value is zero, the limit is disabled.
+ *
+ * 128 MiB by default.
+ */
+static unsigned long quota_sz __read_mostly = 128 * 1024 * 1024;
+module_param(quota_sz, ulong, 0600);
+
+/*
+ * The time/size quota charge reset interval in milliseconds.
+ *
+ * The charge reset interval for the quota of time (quota_ms) and size
+ * (quota_sz).  That is, DAMON_RECLAIM does not try reclamation for more than
+ * quota_ms milliseconds or quota_sz bytes within quota_reset_interval_ms
+ * milliseconds.
+ *
+ * 1 second by default.
+ */
+static unsigned long quota_reset_interval_ms __read_mostly = 1000;
+module_param(quota_reset_interval_ms, ulong, 0600);
+
+/*
+ * The watermarks check time interval in microseconds.
+ *
+ * Minimal time to wait before checking the watermarks, when DAMON_RECLAIM is
+ * enabled but inactive due to its watermarks rule.  5 seconds by default.
+ */
+static unsigned long wmarks_interval __read_mostly = 5000000;
+module_param(wmarks_interval, ulong, 0600);
+
+/*
+ * Free memory rate (per thousand) for the high watermark.
+ *
+ * If free memory of the system in bytes per thousand bytes is higher than
+ * this, DAMON_RECLAIM becomes inactive, so it does nothing but periodically
+ * checks the watermarks.  500 (50%) by default.
+ */
+static unsigned long wmarks_high __read_mostly = 500;
+module_param(wmarks_high, ulong, 0600);
+
+/*
+ * Free memory rate (per thousand) for the middle watermark.
+ *
+ * If free memory of the system in bytes per thousand bytes is between this and
+ * the low watermark, DAMON_RECLAIM becomes active, so starts the monitoring
+ * and the reclaiming.  400 (40%) by default.
+ */
+static unsigned long wmarks_mid __read_mostly = 400;
+module_param(wmarks_mid, ulong, 0600);
+
+/*
+ * Free memory rate (per thousand) for the low watermark.
+ *
+ * If free memory of the system in bytes per thousand bytes is lower than this,
+ * DAMON_RECLAIM becomes inactive, so it does nothing but periodically checks
+ * the watermarks.  In the case, the system falls back to the LRU-based page
+ * granularity reclamation logic.  200 (20%) by default.
+ */
+static unsigned long wmarks_low __read_mostly = 200;
+module_param(wmarks_low, ulong, 0600);
+
+/*
+ * Sampling interval for the monitoring in microseconds.
+ *
+ * The sampling interval of DAMON for the cold memory monitoring.  Please refer
+ * to the DAMON documentation for more detail.  5 ms by default.
+ */
+static unsigned long sample_interval __read_mostly = 5000;
+module_param(sample_interval, ulong, 0600);
+
+/*
+ * Aggregation interval for the monitoring in microseconds.
+ *
+ * The aggregation interval of DAMON for the cold memory monitoring.  Please
+ * refer to the DAMON documentation for more detail.  100 ms by default.
+ */
+static unsigned long aggr_interval __read_mostly = 100000;
+module_param(aggr_interval, ulong, 0600);
+
+/*
+ * Minimum number of monitoring regions.
+ *
+ * The minimal number of monitoring regions of DAMON for the cold memory
+ * monitoring.  This can be used to set lower-bound of the monitoring quality.
+ * But, setting this too high could result in increased monitoring overhead.
+ * Please refer to the DAMON documentation for more detail.  10 by default.
+ */
+static unsigned long min_nr_regions __read_mostly = 10;
+module_param(min_nr_regions, ulong, 0600);
+
+/*
+ * Maximum number of monitoring regions.
+ *
+ * The maximum number of monitoring regions of DAMON for the cold memory
+ * monitoring.  This can be used to set upper-bound of the monitoring overhead.
+ * However, setting this too low could result in bad monitoring quality.
+ * Please refer to the DAMON documentation for more detail.  1000 by default.
+ */
+static unsigned long max_nr_regions __read_mostly = 1000;
+module_param(max_nr_regions, ulong, 0600);
+
+/*
+ * Start of the target memory region in physical address.
+ *
+ * The start physical address of memory region that DAMON_RECLAIM will do work
+ * against.  By default, biggest System RAM is used as the region.
+ */
+static unsigned long monitor_region_start __read_mostly;
+module_param(monitor_region_start, ulong, 0600);
+
+/*
+ * End of the target memory region in physical address.
+ *
+ * The end physical address of memory region that DAMON_RECLAIM will do work
+ * against.  By default, biggest System RAM is used as the region.
+ */
+static unsigned long monitor_region_end __read_mostly;
+module_param(monitor_region_end, ulong, 0600);
+
+/*
+ * PID of the DAMON thread
+ *
+ * If DAMON_RECLAIM is enabled, this becomes the PID of the worker thread.
+ * Else, -1.
+ */
+static int kdamond_pid __read_mostly = -1;
+module_param(kdamond_pid, int, 0400);
+
+static struct damon_ctx *ctx;
+static struct damon_target *target;
+
+struct damon_reclaim_ram_walk_arg {
+	unsigned long start;
+	unsigned long end;
+};
+
+static int walk_system_ram(struct resource *res, void *arg)
+{
+	struct damon_reclaim_ram_walk_arg *a = arg;
+
+	if (a->end - a->start < res->end - res->start) {
+		a->start = res->start;
+		a->end = res->end;
+	}
+	return 0;
+}
+
+/*
+ * Find biggest 'System RAM' resource and store its start and end address in
+ * @start and @end, respectively.  If no System RAM is found, returns false.
+ */
+static bool get_monitoring_region(unsigned long *start, unsigned long *end)
+{
+	struct damon_reclaim_ram_walk_arg arg = {};
+
+	walk_system_ram_res(0, ULONG_MAX, &arg, walk_system_ram);
+	if (arg.end <= arg.start)
+		return false;
+
+	*start = arg.start;
+	*end = arg.end;
+	return true;
+}
+
+static struct damos *damon_reclaim_new_scheme(void)
+{
+	struct damos_watermarks wmarks = {
+		.metric = DAMOS_WMARK_FREE_MEM_RATE,
+		.interval = wmarks_interval,
+		.high = wmarks_high,
+		.mid = wmarks_mid,
+		.low = wmarks_low,
+	};
+	struct damos_quota quota = {
+		/*
+		 * Do not try reclamation for more than quota_ms milliseconds
+		 * or quota_sz bytes within quota_reset_interval_ms.
+		 */
+		.ms = quota_ms,
+		.sz = quota_sz,
+		.reset_interval = quota_reset_interval_ms,
+		/* Within the quota, page out older regions first. */
+		.weight_sz = 0,
+		.weight_nr_accesses = 0,
+		.weight_age = 1
+	};
+	struct damos *scheme = damon_new_scheme(
+			/* Find regions having PAGE_SIZE or larger size */
+			PAGE_SIZE, ULONG_MAX,
+			/* and not accessed at all */
+			0, 0,
+			/* for min_age or more micro-seconds, and */
+			min_age / aggr_interval, UINT_MAX,
+			/* page out those, as soon as found */
+			DAMOS_PAGEOUT,
+			/* under the quota. */
+			&quota,
+			/* (De)activate this according to the watermarks. */
+			&wmarks);
+
+	return scheme;
+}
+
+static int damon_reclaim_turn(bool on)
+{
+	struct damon_region *region;
+	struct damos *scheme;
+	int err;
+
+	if (!on) {
+		err = damon_stop(&ctx, 1);
+		if (!err)
+			kdamond_pid = -1;
+		return err;
+	}
+
+	err = damon_set_attrs(ctx, sample_interval, aggr_interval, 0,
+			min_nr_regions, max_nr_regions);
+	if (err)
+		return err;
+
+	if (monitor_region_start > monitor_region_end)
+		return -EINVAL;
+	if (!monitor_region_start && !monitor_region_end &&
+			!get_monitoring_region(&monitor_region_start,
+				&monitor_region_end))
+		return -EINVAL;
+	/* DAMON will free this on its own when finish monitoring */
+	region = damon_new_region(monitor_region_start, monitor_region_end);
+	if (!region)
+		return -ENOMEM;
+	damon_add_region(region, target);
+
+	/* Will be freed by 'damon_set_schemes()' below */
+	scheme = damon_reclaim_new_scheme();
+	if (!scheme) {
+		err = -ENOMEM;
+		goto free_region_out;
+	}
+	err = damon_set_schemes(ctx, &scheme, 1);
+	if (err)
+		goto free_scheme_out;
+
+	err = damon_start(&ctx, 1);
+	if (!err) {
+		kdamond_pid = ctx->kdamond->pid;
+		return 0;
+	}
+
+free_scheme_out:
+	damon_destroy_scheme(scheme);
+free_region_out:
+	damon_destroy_region(region, target);
+	return err;
+}
+
+#define ENABLE_CHECK_INTERVAL_MS	1000
+static struct delayed_work damon_reclaim_timer;
+static void damon_reclaim_timer_fn(struct work_struct *work)
+{
+	static bool last_enabled;
+	bool now_enabled;
+
+	now_enabled = enabled;
+	if (last_enabled != now_enabled) {
+		if (!damon_reclaim_turn(now_enabled))
+			last_enabled = now_enabled;
+		else
+			enabled = last_enabled;
+	}
+
+	schedule_delayed_work(&damon_reclaim_timer,
+			msecs_to_jiffies(ENABLE_CHECK_INTERVAL_MS));
+}
+static DECLARE_DELAYED_WORK(damon_reclaim_timer, damon_reclaim_timer_fn);
+
+static int __init damon_reclaim_init(void)
+{
+	ctx = damon_new_ctx();
+	if (!ctx)
+		return -ENOMEM;
+
+	damon_pa_set_primitives(ctx);
+
+	/* 4242 means nothing but fun */
+	target = damon_new_target(4242);
+	if (!target) {
+		damon_destroy_ctx(ctx);
+		return -ENOMEM;
+	}
+	damon_add_target(ctx, target);
+
+	schedule_delayed_work(&damon_reclaim_timer, 0);
+	return 0;
+}
+
+module_init(damon_reclaim_init);
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 253/262] Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM
  2021-11-05 20:34 incoming Andrew Morton
                   ` (251 preceding siblings ...)
  2021-11-05 20:47 ` [patch 252/262] mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) Andrew Morton
@ 2021-11-05 20:48 ` Andrew Morton
  2021-11-05 20:48 ` [patch 254/262] mm/damon: remove unnecessary variable initialization Andrew Morton
                   ` (8 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:48 UTC (permalink / raw)
  To: akpm, amit, benh, corbet, david, dwmw, elver, foersleo, gthelen,
	Jonathan.Cameron, linux-mm, markubo, mm-commits, rientjes,
	shakeelb, shuah, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM

This commit adds an admin-guide document for DAMON-based Reclamation.

Link: https://lkml.kernel.org/r/20211019150731.16699-16-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/damon/index.rst   |    1 
 Documentation/admin-guide/mm/damon/reclaim.rst |  235 +++++++++++++++
 2 files changed, 236 insertions(+)

--- a/Documentation/admin-guide/mm/damon/index.rst~documentation-admin-guide-mm-damon-add-a-document-for-damon_reclaim
+++ a/Documentation/admin-guide/mm/damon/index.rst
@@ -13,3 +13,4 @@ optimize those.
 
    start
    usage
+   reclaim
--- /dev/null
+++ a/Documentation/admin-guide/mm/damon/reclaim.rst
@@ -0,0 +1,235 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=======================
+DAMON-based Reclamation
+=======================
+
+DAMON-based Reclamation (DAMON_RECLAIM) is a static kernel module that aimed to
+be used for proactive and lightweight reclamation under light memory pressure.
+It doesn't aim to replace the LRU-list based page_granularity reclamation, but
+to be selectively used for different level of memory pressure and requirements.
+
+Where Proactive Reclamation is Required?
+========================================
+
+On general memory over-committed systems, proactively reclaiming cold pages
+helps saving memory and reducing latency spikes that incurred by the direct
+reclaim of the process or CPU consumption of kswapd, while incurring only
+minimal performance degradation [1]_ [2]_ .
+
+Free Pages Reporting [3]_ based memory over-commit virtualization systems are
+good example of the cases.  In such systems, the guest VMs reports their free
+memory to host, and the host reallocates the reported memory to other guests.
+As a result, the memory of the systems are fully utilized.  However, the
+guests could be not so memory-frugal, mainly because some kernel subsystems and
+user-space applications are designed to use as much memory as available.  Then,
+guests could report only small amount of memory as free to host, results in
+memory utilization drop of the systems.  Running the proactive reclamation in
+guests could mitigate this problem.
+
+How It Works?
+=============
+
+DAMON_RECLAIM finds memory regions that didn't accessed for specific time
+duration and page out.  To avoid it consuming too much CPU for the paging out
+operation, a speed limit can be configured.  Under the speed limit, it pages
+out memory regions that didn't accessed longer time first.  System
+administrators can also configure under what situation this scheme should
+automatically activated and deactivated with three memory pressure watermarks.
+
+Interface: Module Parameters
+============================
+
+To use this feature, you should first ensure your system is running on a kernel
+that is built with ``CONFIG_DAMON_RECLAIM=y``.
+
+To let sysadmins enable or disable it and tune for the given system,
+DAMON_RECLAIM utilizes module parameters.  That is, you can put
+``damon_reclaim.<parameter>=<value>`` on the kernel boot command line or write
+proper values to ``/sys/modules/damon_reclaim/parameters/<parameter>`` files.
+
+Note that the parameter values except ``enabled`` are applied only when
+DAMON_RECLAIM starts.  Therefore, if you want to apply new parameter values in
+runtime and DAMON_RECLAIM is already enabled, you should disable and re-enable
+it via ``enabled`` parameter file.  Writing of the new values to proper
+parameter values should be done before the re-enablement.
+
+Below are the description of each parameter.
+
+enabled
+-------
+
+Enable or disable DAMON_RECLAIM.
+
+You can enable DAMON_RCLAIM by setting the value of this parameter as ``Y``.
+Setting it as ``N`` disables DAMON_RECLAIM.  Note that DAMON_RECLAIM could do
+no real monitoring and reclamation due to the watermarks-based activation
+condition.  Refer to below descriptions for the watermarks parameter for this.
+
+min_age
+-------
+
+Time threshold for cold memory regions identification in microseconds.
+
+If a memory region is not accessed for this or longer time, DAMON_RECLAIM
+identifies the region as cold, and reclaims it.
+
+120 seconds by default.
+
+quota_ms
+--------
+
+Limit of time for the reclamation in milliseconds.
+
+DAMON_RECLAIM tries to use only up to this time within a time window
+(quota_reset_interval_ms) for trying reclamation of cold pages.  This can be
+used for limiting CPU consumption of DAMON_RECLAIM.  If the value is zero, the
+limit is disabled.
+
+10 ms by default.
+
+quota_sz
+--------
+
+Limit of size of memory for the reclamation in bytes.
+
+DAMON_RECLAIM charges amount of memory which it tried to reclaim within a time
+window (quota_reset_interval_ms) and makes no more than this limit is tried.
+This can be used for limiting consumption of CPU and IO.  If this value is
+zero, the limit is disabled.
+
+128 MiB by default.
+
+quota_reset_interval_ms
+-----------------------
+
+The time/size quota charge reset interval in milliseconds.
+
+The charget reset interval for the quota of time (quota_ms) and size
+(quota_sz).  That is, DAMON_RECLAIM does not try reclamation for more than
+quota_ms milliseconds or quota_sz bytes within quota_reset_interval_ms
+milliseconds.
+
+1 second by default.
+
+wmarks_interval
+---------------
+
+Minimal time to wait before checking the watermarks, when DAMON_RECLAIM is
+enabled but inactive due to its watermarks rule.
+
+wmarks_high
+-----------
+
+Free memory rate (per thousand) for the high watermark.
+
+If free memory of the system in bytes per thousand bytes is higher than this,
+DAMON_RECLAIM becomes inactive, so it does nothing but only periodically checks
+the watermarks.
+
+wmarks_mid
+----------
+
+Free memory rate (per thousand) for the middle watermark.
+
+If free memory of the system in bytes per thousand bytes is between this and
+the low watermark, DAMON_RECLAIM becomes active, so starts the monitoring and
+the reclaiming.
+
+wmarks_low
+----------
+
+Free memory rate (per thousand) for the low watermark.
+
+If free memory of the system in bytes per thousand bytes is lower than this,
+DAMON_RECLAIM becomes inactive, so it does nothing but periodically checks the
+watermarks.  In the case, the system falls back to the LRU-list based page
+granularity reclamation logic.
+
+sample_interval
+---------------
+
+Sampling interval for the monitoring in microseconds.
+
+The sampling interval of DAMON for the cold memory monitoring.  Please refer to
+the DAMON documentation (:doc:`usage`) for more detail.
+
+aggr_interval
+-------------
+
+Aggregation interval for the monitoring in microseconds.
+
+The aggregation interval of DAMON for the cold memory monitoring.  Please
+refer to the DAMON documentation (:doc:`usage`) for more detail.
+
+min_nr_regions
+--------------
+
+Minimum number of monitoring regions.
+
+The minimal number of monitoring regions of DAMON for the cold memory
+monitoring.  This can be used to set lower-bound of the monitoring quality.
+But, setting this too high could result in increased monitoring overhead.
+Please refer to the DAMON documentation (:doc:`usage`) for more detail.
+
+max_nr_regions
+--------------
+
+Maximum number of monitoring regions.
+
+The maximum number of monitoring regions of DAMON for the cold memory
+monitoring.  This can be used to set upper-bound of the monitoring overhead.
+However, setting this too low could result in bad monitoring quality.  Please
+refer to the DAMON documentation (:doc:`usage`) for more detail.
+
+monitor_region_start
+--------------------
+
+Start of target memory region in physical address.
+
+The start physical address of memory region that DAMON_RECLAIM will do work
+against.  That is, DAMON_RECLAIM will find cold memory regions in this region
+and reclaims.  By default, biggest System RAM is used as the region.
+
+monitor_region_end
+------------------
+
+End of target memory region in physical address.
+
+The end physical address of memory region that DAMON_RECLAIM will do work
+against.  That is, DAMON_RECLAIM will find cold memory regions in this region
+and reclaims.  By default, biggest System RAM is used as the region.
+
+kdamond_pid
+-----------
+
+PID of the DAMON thread.
+
+If DAMON_RECLAIM is enabled, this becomes the PID of the worker thread.  Else,
+-1.
+
+Example
+=======
+
+Below runtime example commands make DAMON_RECLAIM to find memory regions that
+not accessed for 30 seconds or more and pages out.  The reclamation is limited
+to be done only up to 1 GiB per second to avoid DAMON_RECLAIM consuming too
+much CPU time for the paging out operation.  It also asks DAMON_RECLAIM to do
+nothing if the system's free memory rate is more than 50%, but start the real
+works if it becomes lower than 40%.  If DAMON_RECLAIM doesn't make progress and
+therefore the free memory rate becomes lower than 20%, it asks DAMON_RECLAIM to
+do nothing again, so that we can fall back to the LRU-list based page
+granularity reclamation. ::
+
+    # cd /sys/modules/damon_reclaim/parameters
+    # echo 30000000 > min_age
+    # echo $((1 * 1024 * 1024 * 1024)) > quota_sz
+    # echo 1000 > quota_reset_interval_ms
+    # echo 500 > wmarks_high
+    # echo 400 > wmarks_mid
+    # echo 200 > wmarks_low
+    # echo Y > enabled
+
+.. [1] https://research.google/pubs/pub48551/
+.. [2] https://lwn.net/Articles/787611/
+.. [3] https://www.kernel.org/doc/html/latest/vm/free_page_reporting.html
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 254/262] mm/damon: remove unnecessary variable initialization
  2021-11-05 20:34 incoming Andrew Morton
                   ` (252 preceding siblings ...)
  2021-11-05 20:48 ` [patch 253/262] Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM Andrew Morton
@ 2021-11-05 20:48 ` Andrew Morton
  2021-11-05 20:48 ` [patch 255/262] mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on Andrew Morton
                   ` (7 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:48 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, sj, torvalds, xhao

From: Xin Hao <xhao@linux.alibaba.com>
Subject: mm/damon: remove unnecessary variable initialization

Patch series "mm/damon: Fix some small bugs", v4.


This patch (of 2):

In 'damon_va_apply_three_regions', There is no need to set variable 'i' as
0

Link: https://lkml.kernel.org/r/b7df8d3dad0943a37e01f60c441b1968b2b20354.1634720326.git.xhao@linux.alibaba.com
Link: https://lkml.kernel.org/r/cover.1634720326.git.xhao@linux.alibaba.com
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/vaddr.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/damon/vaddr.c~mm-damon-remove-unnecessary-variable-initialization
+++ a/mm/damon/vaddr.c
@@ -306,7 +306,7 @@ static void damon_va_apply_three_regions
 		struct damon_addr_range bregions[3])
 {
 	struct damon_region *r, *next;
-	unsigned int i = 0;
+	unsigned int i;
 
 	/* Remove regions which are not in the three big regions now */
 	damon_for_each_region_safe(r, next, t) {
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 255/262] mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on
  2021-11-05 20:34 incoming Andrew Morton
                   ` (253 preceding siblings ...)
  2021-11-05 20:48 ` [patch 254/262] mm/damon: remove unnecessary variable initialization Andrew Morton
@ 2021-11-05 20:48 ` Andrew Morton
  2021-11-05 20:48 ` [patch 256/262] Docs/admin-guide/mm/damon/start: fix wrong example commands Andrew Morton
                   ` (6 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:48 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, sj, torvalds, xhao

From: Xin Hao <xhao@linux.alibaba.com>
Subject: mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on

When the ctx->adaptive_targets list is empty, I did some test on
monitor_on interface like this.

    # cat /sys/kernel/debug/damon/target_ids
    #
    # echo on > /sys/kernel/debug/damon/monitor_on
    # damon: kdamond (5390) starts

Though the ctx->adaptive_targets list is empty, but the kthread_run still
be called, and the kdamond.x thread still be created, this is meaningless.

So there adds a judgment in 'dbgfs_monitor_on_write', if the
ctx->adaptive_targets list is empty, return -EINVAL.

Link: https://lkml.kernel.org/r/0a60a6e8ec9d71989e0848a4dc3311996ca3b5d4.1634720326.git.xhao@linux.alibaba.com
Signed-off-by: Xin Hao <xhao@linux.alibaba.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |    1 +
 mm/damon/core.c       |    5 +++++
 mm/damon/dbgfs.c      |   15 ++++++++++++---
 3 files changed, 18 insertions(+), 3 deletions(-)

--- a/include/linux/damon.h~mm-damon-dbgfs-add-adaptive_targets-list-check-before-enable-monitor_on
+++ a/include/linux/damon.h
@@ -440,6 +440,7 @@ void damon_destroy_scheme(struct damos *
 
 struct damon_target *damon_new_target(unsigned long id);
 void damon_add_target(struct damon_ctx *ctx, struct damon_target *t);
+bool damon_targets_empty(struct damon_ctx *ctx);
 void damon_free_target(struct damon_target *t);
 void damon_destroy_target(struct damon_target *t);
 unsigned int damon_nr_regions(struct damon_target *t);
--- a/mm/damon/core.c~mm-damon-dbgfs-add-adaptive_targets-list-check-before-enable-monitor_on
+++ a/mm/damon/core.c
@@ -180,6 +180,11 @@ void damon_add_target(struct damon_ctx *
 	list_add_tail(&t->list, &ctx->adaptive_targets);
 }
 
+bool damon_targets_empty(struct damon_ctx *ctx)
+{
+	return list_empty(&ctx->adaptive_targets);
+}
+
 static void damon_del_target(struct damon_target *t)
 {
 	list_del(&t->list);
--- a/mm/damon/dbgfs.c~mm-damon-dbgfs-add-adaptive_targets-list-check-before-enable-monitor_on
+++ a/mm/damon/dbgfs.c
@@ -878,12 +878,21 @@ static ssize_t dbgfs_monitor_on_write(st
 		return -EINVAL;
 	}
 
-	if (!strncmp(kbuf, "on", count))
+	if (!strncmp(kbuf, "on", count)) {
+		int i;
+
+		for (i = 0; i < dbgfs_nr_ctxs; i++) {
+			if (damon_targets_empty(dbgfs_ctxs[i])) {
+				kfree(kbuf);
+				return -EINVAL;
+			}
+		}
 		ret = damon_start(dbgfs_ctxs, dbgfs_nr_ctxs);
-	else if (!strncmp(kbuf, "off", count))
+	} else if (!strncmp(kbuf, "off", count)) {
 		ret = damon_stop(dbgfs_ctxs, dbgfs_nr_ctxs);
-	else
+	} else {
 		ret = -EINVAL;
+	}
 
 	if (!ret)
 		ret = count;
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 256/262] Docs/admin-guide/mm/damon/start: fix wrong example commands
  2021-11-05 20:34 incoming Andrew Morton
                   ` (254 preceding siblings ...)
  2021-11-05 20:48 ` [patch 255/262] mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on Andrew Morton
@ 2021-11-05 20:48 ` Andrew Morton
  2021-11-05 20:48 ` [patch 257/262] Docs/admin-guide/mm/damon/start: fix a wrong link Andrew Morton
                   ` (5 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:48 UTC (permalink / raw)
  To: akpm, corbet, linux-mm, mm-commits, peterx, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: Docs/admin-guide/mm/damon/start: fix wrong example commands

Patch series "Fix trivial nits in Documentation/admin-guide/mm".

This patchset fixes trivial nits in admin guide documents for DAMON and
pagemap.


This patch (of 4):

Some of the example commands in DAMON getting started guide are
outdated, missing sudo, or just wrong.  This commit fixes those.

Link: https://lkml.kernel.org/r/20211022090311.3856-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/damon/start.rst |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

--- a/Documentation/admin-guide/mm/damon/start.rst~docs-admin-guide-mm-damon-start-fix-wrong-example-commands
+++ a/Documentation/admin-guide/mm/damon/start.rst
@@ -19,7 +19,7 @@ your workload. ::
     # mount -t debugfs none /sys/kernel/debug/
     # git clone https://github.com/awslabs/damo
     # ./damo/damo record $(pidof <your workload>)
-    # ./damo/damo report heat --plot_ascii
+    # ./damo/damo report heats --heatmap stdout
 
 The final command draws the access heatmap of ``<your workload>``.  The heatmap
 shows which memory region (x-axis) is accessed when (y-axis) and how frequently
@@ -94,9 +94,9 @@ Visualizing Recorded Patterns
 The following three commands visualize the recorded access patterns and save
 the results as separate image files. ::
 
-    $ damo report heats --heatmap access_pattern_heatmap.png
-    $ damo report wss --range 0 101 1 --plot wss_dist.png
-    $ damo report wss --range 0 101 1 --sortby time --plot wss_chron_change.png
+    $ sudo damo report heats --heatmap access_pattern_heatmap.png
+    $ sudo damo report wss --range 0 101 1 --plot wss_dist.png
+    $ sudo damo report wss --range 0 101 1 --sortby time --plot wss_chron_change.png
 
 - ``access_pattern_heatmap.png`` will visualize the data access pattern in a
   heatmap, showing which memory region (y-axis) got accessed when (x-axis)
@@ -115,9 +115,9 @@ Data Access Pattern Aware Memory Managem
 Below three commands make every memory region of size >=4K that doesn't
 accessed for >=60 seconds in your workload to be swapped out. ::
 
-    $ echo "#min-size max-size min-acc max-acc min-age max-age action" > scheme
-    $ echo "4K        max      0       0       60s     max     pageout" >> scheme
-    $ damo schemes -c my_thp_scheme <pid of your workload>
+    $ echo "#min-size max-size min-acc max-acc min-age max-age action" > test_scheme
+    $ echo "4K        max      0       0       60s     max     pageout" >> test_scheme
+    $ damo schemes -c test_scheme <pid of your workload>
 
 .. [1] https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/start.html#visualizing-recorded-patterns
 .. [2] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 257/262] Docs/admin-guide/mm/damon/start: fix a wrong link
  2021-11-05 20:34 incoming Andrew Morton
                   ` (255 preceding siblings ...)
  2021-11-05 20:48 ` [patch 256/262] Docs/admin-guide/mm/damon/start: fix wrong example commands Andrew Morton
@ 2021-11-05 20:48 ` Andrew Morton
  2021-11-05 20:48 ` [patch 258/262] Docs/admin-guide/mm/damon/start: simplify the content Andrew Morton
                   ` (4 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:48 UTC (permalink / raw)
  To: akpm, corbet, linux-mm, mm-commits, peterx, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: Docs/admin-guide/mm/damon/start: fix a wrong link

The 'Getting Started' of DAMON is providing a link to DAMON's user
interface document while saying about its user space tool's detailed
usages.  This commit fixes the link.

Link: https://lkml.kernel.org/r/20211022090311.3856-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/damon/start.rst |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/Documentation/admin-guide/mm/damon/start.rst~docs-admin-guide-mm-damon-start-fix-a-wrong-link
+++ a/Documentation/admin-guide/mm/damon/start.rst
@@ -6,7 +6,9 @@ Getting Started
 
 This document briefly describes how you can use DAMON by demonstrating its
 default user space tool.  Please note that this document describes only a part
-of its features for brevity.  Please refer to :doc:`usage` for more details.
+of its features for brevity.  Please refer to the usage `doc
+<https://github.com/awslabs/damo/blob/next/USAGE.md>`_ of the tool for more
+details.
 
 
 TL; DR
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 258/262] Docs/admin-guide/mm/damon/start: simplify the content
  2021-11-05 20:34 incoming Andrew Morton
                   ` (256 preceding siblings ...)
  2021-11-05 20:48 ` [patch 257/262] Docs/admin-guide/mm/damon/start: fix a wrong link Andrew Morton
@ 2021-11-05 20:48 ` Andrew Morton
  2021-11-05 20:48 ` [patch 259/262] Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions Andrew Morton
                   ` (3 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:48 UTC (permalink / raw)
  To: akpm, corbet, linux-mm, mm-commits, peterx, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: Docs/admin-guide/mm/damon/start: simplify the content

Information in 'TL; DR' section of 'Getting Started' is duplicated in
other parts of the doc.  It is also asking readers to visit the access
pattern visualizations gallery web site to show the results of example
visualization commands, while the users of the commands can use terminal
output.

To make the doc simple, this commit removes the duplicated 'TL; DR'
section and replaces the visualization example commands with versions
using terminal outputs.

Link: https://lkml.kernel.org/r/20211022090311.3856-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/damon/start.rst |  111 +++++++++--------
 1 file changed, 59 insertions(+), 52 deletions(-)

--- a/Documentation/admin-guide/mm/damon/start.rst~docs-admin-guide-mm-damon-start-simplify-the-content
+++ a/Documentation/admin-guide/mm/damon/start.rst
@@ -11,38 +11,6 @@ of its features for brevity.  Please ref
 details.
 
 
-TL; DR
-======
-
-Follow the commands below to monitor and visualize the memory access pattern of
-your workload. ::
-
-    # # build the kernel with CONFIG_DAMON_*=y, install it, and reboot
-    # mount -t debugfs none /sys/kernel/debug/
-    # git clone https://github.com/awslabs/damo
-    # ./damo/damo record $(pidof <your workload>)
-    # ./damo/damo report heats --heatmap stdout
-
-The final command draws the access heatmap of ``<your workload>``.  The heatmap
-shows which memory region (x-axis) is accessed when (y-axis) and how frequently
-(number; the higher the more accesses have been observed). ::
-
-    111111111111111111111111111111111111111111111111111111110000
-    111121111111111111111111111111211111111111111111111111110000
-    000000000000000000000000000000000000000000000000001555552000
-    000000000000000000000000000000000000000000000222223555552000
-    000000000000000000000000000000000000000011111677775000000000
-    000000000000000000000000000000000000000488888000000000000000
-    000000000000000000000000000000000177888400000000000000000000
-    000000000000000000000000000046666522222100000000000000000000
-    000000000000000000000014444344444300000000000000000000000000
-    000000000000000002222245555510000000000000000000000000000000
-    # access_frequency:  0  1  2  3  4  5  6  7  8  9
-    # x-axis: space (140286319947776-140286426374096: 101.496 MiB)
-    # y-axis: time (605442256436361-605479951866441: 37.695430s)
-    # resolution: 60x10 (1.692 MiB and 3.770s for each character)
-
-
 Prerequisites
 =============
 
@@ -93,22 +61,66 @@ pattern in the ``damon.data`` file.
 Visualizing Recorded Patterns
 =============================
 
-The following three commands visualize the recorded access patterns and save
-the results as separate image files. ::
-
-    $ sudo damo report heats --heatmap access_pattern_heatmap.png
-    $ sudo damo report wss --range 0 101 1 --plot wss_dist.png
-    $ sudo damo report wss --range 0 101 1 --sortby time --plot wss_chron_change.png
-
-- ``access_pattern_heatmap.png`` will visualize the data access pattern in a
-  heatmap, showing which memory region (y-axis) got accessed when (x-axis)
-  and how frequently (color).
-- ``wss_dist.png`` will show the distribution of the working set size.
-- ``wss_chron_change.png`` will show how the working set size has
-  chronologically changed.
+You can visualize the pattern in a heatmap, showing which memory region
+(x-axis) got accessed when (y-axis) and how frequently (number).::
 
-You can view the visualizations of this example workload at [1]_.
-Visualizations of other realistic workloads are available at [2]_ [3]_ [4]_.
+    $ sudo damo report heats --heatmap stdout
+    22222222222222222222222222222222222222211111111111111111111111111111111111111100
+    44444444444444444444444444444444444444434444444444444444444444444444444444443200
+    44444444444444444444444444444444444444433444444444444444444444444444444444444200
+    33333333333333333333333333333333333333344555555555555555555555555555555555555200
+    33333333333333333333333333333333333344444444444444444444444444444444444444444200
+    22222222222222222222222222222222222223355555555555555555555555555555555555555200
+    00000000000000000000000000000000000000288888888888888888888888888888888888888400
+    00000000000000000000000000000000000000288888888888888888888888888888888888888400
+    33333333333333333333333333333333333333355555555555555555555555555555555555555200
+    88888888888888888888888888888888888888600000000000000000000000000000000000000000
+    88888888888888888888888888888888888888600000000000000000000000000000000000000000
+    33333333333333333333333333333333333333444444444444444444444444444444444444443200
+    00000000000000000000000000000000000000288888888888888888888888888888888888888400
+    [...]
+    # access_frequency:  0  1  2  3  4  5  6  7  8  9
+    # x-axis: space (139728247021568-139728453431248: 196.848 MiB)
+    # y-axis: time (15256597248362-15326899978162: 1 m 10.303 s)
+    # resolution: 80x40 (2.461 MiB and 1.758 s for each character)
+
+You can also visualize the distribution of the working set size, sorted by the
+size.::
+
+    $ sudo damo report wss --range 0 101 10
+    # <percentile> <wss>
+    # target_id     18446632103789443072
+    # avr:  107.708 MiB
+      0             0 B |                                                           |
+     10      95.328 MiB |****************************                               |
+     20      95.332 MiB |****************************                               |
+     30      95.340 MiB |****************************                               |
+     40      95.387 MiB |****************************                               |
+     50      95.387 MiB |****************************                               |
+     60      95.398 MiB |****************************                               |
+     70      95.398 MiB |****************************                               |
+     80      95.504 MiB |****************************                               |
+     90     190.703 MiB |*********************************************************  |
+    100     196.875 MiB |***********************************************************|
+
+Using ``--sortby`` option with the above command, you can show how the working
+set size has chronologically changed.::
+
+    $ sudo damo report wss --range 0 101 10 --sortby time
+    # <percentile> <wss>
+    # target_id     18446632103789443072
+    # avr:  107.708 MiB
+      0       3.051 MiB |                                                           |
+     10     190.703 MiB |***********************************************************|
+     20      95.336 MiB |*****************************                              |
+     30      95.328 MiB |*****************************                              |
+     40      95.387 MiB |*****************************                              |
+     50      95.332 MiB |*****************************                              |
+     60      95.320 MiB |*****************************                              |
+     70      95.398 MiB |*****************************                              |
+     80      95.398 MiB |*****************************                              |
+     90      95.340 MiB |*****************************                              |
+    100      95.398 MiB |*****************************                              |
 
 
 Data Access Pattern Aware Memory Management
@@ -120,8 +132,3 @@ accessed for >=60 seconds in your worklo
     $ echo "#min-size max-size min-acc max-acc min-age max-age action" > test_scheme
     $ echo "4K        max      0       0       60s     max     pageout" >> test_scheme
     $ damo schemes -c test_scheme <pid of your workload>
-
-.. [1] https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/start.html#visualizing-recorded-patterns
-.. [2] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
-.. [3] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
-.. [4] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 259/262] Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions
  2021-11-05 20:34 incoming Andrew Morton
                   ` (257 preceding siblings ...)
  2021-11-05 20:48 ` [patch 258/262] Docs/admin-guide/mm/damon/start: simplify the content Andrew Morton
@ 2021-11-05 20:48 ` Andrew Morton
  2021-11-05 20:48 ` [patch 260/262] mm/damon: simplify stop mechanism Andrew Morton
                   ` (2 subsequent siblings)
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:48 UTC (permalink / raw)
  To: akpm, corbet, linux-mm, mm-commits, peterx, sj, torvalds

From: SeongJae Park <sj@kernel.org>
Subject: Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions

Some descriptions of page flags in 'pagemap.rst' are written in
assumption of none-rst, which respects every new line, as below:

    7 - SLAB
       page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
       When compound page is used, SLUB/SLQB will only set this flag on the head

Because rst ignores the new line between the first sentence and second
sentence, resulting html looks a little bit weird, as below.

    7 - SLAB
    page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator When
                                                                       ^
    compound page is used, SLUB/SLQB will only set this flag on the head
    page; SLOB will not flag it at all.

This commit makes it more natural and consistent with other parts in the
rendered version.

Link: https://lkml.kernel.org/r/20211022090311.3856-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Peter Xu <peterx@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/pagemap.rst |   53 ++++++++++-----------
 1 file changed, 27 insertions(+), 26 deletions(-)

--- a/Documentation/admin-guide/mm/pagemap.rst~docs-admin-guide-mm-pagemap-wordsmith-page-flags-descriptions
+++ a/Documentation/admin-guide/mm/pagemap.rst
@@ -90,13 +90,14 @@ Short descriptions to the page flags
 ====================================
 
 0 - LOCKED
-   page is being locked for exclusive access, e.g. by undergoing read/write IO
+   The page is being locked for exclusive access, e.g. by undergoing read/write
+   IO.
 7 - SLAB
-   page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator
+   The page is managed by the SLAB/SLOB/SLUB/SLQB kernel memory allocator.
    When compound page is used, SLUB/SLQB will only set this flag on the head
    page; SLOB will not flag it at all.
 10 - BUDDY
-    a free memory block managed by the buddy system allocator
+    A free memory block managed by the buddy system allocator.
     The buddy system organizes free memory in blocks of various orders.
     An order N block has 2^N physically contiguous pages, with the BUDDY flag
     set for and _only_ for the first page.
@@ -112,65 +113,65 @@ Short descriptions to the page flags
 16 - COMPOUND_TAIL
     A compound page tail (see description above).
 17 - HUGE
-    this is an integral part of a HugeTLB page
+    This is an integral part of a HugeTLB page.
 19 - HWPOISON
-    hardware detected memory corruption on this page: don't touch the data!
+    Hardware detected memory corruption on this page: don't touch the data!
 20 - NOPAGE
-    no page frame exists at the requested address
+    No page frame exists at the requested address.
 21 - KSM
-    identical memory pages dynamically shared between one or more processes
+    Identical memory pages dynamically shared between one or more processes.
 22 - THP
-    contiguous pages which construct transparent hugepages
+    Contiguous pages which construct transparent hugepages.
 23 - OFFLINE
-    page is logically offline
+    The page is logically offline.
 24 - ZERO_PAGE
-    zero page for pfn_zero or huge_zero page
+    Zero page for pfn_zero or huge_zero page.
 25 - IDLE
-    page has not been accessed since it was marked idle (see
+    The page has not been accessed since it was marked idle (see
     :ref:`Documentation/admin-guide/mm/idle_page_tracking.rst <idle_page_tracking>`).
     Note that this flag may be stale in case the page was accessed via
     a PTE. To make sure the flag is up-to-date one has to read
     ``/sys/kernel/mm/page_idle/bitmap`` first.
 26 - PGTABLE
-    page is in use as a page table
+    The page is in use as a page table.
 
 IO related page flags
 ---------------------
 
 1 - ERROR
-   IO error occurred
+   IO error occurred.
 3 - UPTODATE
-   page has up-to-date data
+   The page has up-to-date data.
    ie. for file backed page: (in-memory data revision >= on-disk one)
 4 - DIRTY
-   page has been written to, hence contains new data
+   The page has been written to, hence contains new data.
    i.e. for file backed page: (in-memory data revision >  on-disk one)
 8 - WRITEBACK
-   page is being synced to disk
+   The page is being synced to disk.
 
 LRU related page flags
 ----------------------
 
 5 - LRU
-   page is in one of the LRU lists
+   The page is in one of the LRU lists.
 6 - ACTIVE
-   page is in the active LRU list
+   The page is in the active LRU list.
 18 - UNEVICTABLE
-   page is in the unevictable (non-)LRU list It is somehow pinned and
+   The page is in the unevictable (non-)LRU list It is somehow pinned and
    not a candidate for LRU page reclaims, e.g. ramfs pages,
-   shmctl(SHM_LOCK) and mlock() memory segments
+   shmctl(SHM_LOCK) and mlock() memory segments.
 2 - REFERENCED
-   page has been referenced since last LRU list enqueue/requeue
+   The page has been referenced since last LRU list enqueue/requeue.
 9 - RECLAIM
-   page will be reclaimed soon after its pageout IO completed
+   The page will be reclaimed soon after its pageout IO completed.
 11 - MMAP
-   a memory mapped page
+   A memory mapped page.
 12 - ANON
-   a memory mapped page that is not part of a file
+   A memory mapped page that is not part of a file.
 13 - SWAPCACHE
-   page is mapped to swap space, i.e. has an associated swap entry
+   The page is mapped to swap space, i.e. has an associated swap entry.
 14 - SWAPBACKED
-   page is backed by swap/RAM
+   The page is backed by swap/RAM.
 
 The page-types tool in the tools/vm directory can be used to query the
 above flags.
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 260/262] mm/damon: simplify stop mechanism
  2021-11-05 20:34 incoming Andrew Morton
                   ` (258 preceding siblings ...)
  2021-11-05 20:48 ` [patch 259/262] Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions Andrew Morton
@ 2021-11-05 20:48 ` Andrew Morton
  2021-11-05 20:48 ` [patch 261/262] mm/damon: fix a few spelling mistakes in comments and a pr_debug message Andrew Morton
  2021-11-05 20:48 ` [patch 262/262] mm/damon: remove return value from before_terminate callback Andrew Morton
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:48 UTC (permalink / raw)
  To: akpm, changbin.du, linux-mm, mm-commits, sj, torvalds

From: Changbin Du <changbin.du@gmail.com>
Subject: mm/damon: simplify stop mechanism

A kernel thread can exit gracefully with kthread_stop().  So we don't need
a new flag 'kdamond_stop'.  And to make sure the task struct is not freed
when accessing it, get reference to it before termination.

Link: https://lkml.kernel.org/r/20211027130517.4404-1-changbin.du@gmail.com
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |    1 
 mm/damon/core.c       |   51 +++++++++++-----------------------------
 2 files changed, 15 insertions(+), 37 deletions(-)

--- a/include/linux/damon.h~mm-damon-simplify-stop-mechanism
+++ a/include/linux/damon.h
@@ -381,7 +381,6 @@ struct damon_ctx {
 
 /* public: */
 	struct task_struct *kdamond;
-	bool kdamond_stop;
 	struct mutex kdamond_lock;
 
 	struct damon_primitive primitive;
--- a/mm/damon/core.c~mm-damon-simplify-stop-mechanism
+++ a/mm/damon/core.c
@@ -390,17 +390,6 @@ static unsigned long damon_region_sz_lim
 	return sz;
 }
 
-static bool damon_kdamond_running(struct damon_ctx *ctx)
-{
-	bool running;
-
-	mutex_lock(&ctx->kdamond_lock);
-	running = ctx->kdamond != NULL;
-	mutex_unlock(&ctx->kdamond_lock);
-
-	return running;
-}
-
 static int kdamond_fn(void *data);
 
 /*
@@ -418,7 +407,6 @@ static int __damon_start(struct damon_ct
 	mutex_lock(&ctx->kdamond_lock);
 	if (!ctx->kdamond) {
 		err = 0;
-		ctx->kdamond_stop = false;
 		ctx->kdamond = kthread_run(kdamond_fn, ctx, "kdamond.%d",
 				nr_running_ctxs);
 		if (IS_ERR(ctx->kdamond)) {
@@ -474,13 +462,15 @@ int damon_start(struct damon_ctx **ctxs,
  */
 static int __damon_stop(struct damon_ctx *ctx)
 {
+	struct task_struct *tsk;
+
 	mutex_lock(&ctx->kdamond_lock);
-	if (ctx->kdamond) {
-		ctx->kdamond_stop = true;
+	tsk = ctx->kdamond;
+	if (tsk) {
+		get_task_struct(tsk);
 		mutex_unlock(&ctx->kdamond_lock);
-		while (damon_kdamond_running(ctx))
-			usleep_range(ctx->sample_interval,
-					ctx->sample_interval * 2);
+		kthread_stop(tsk);
+		put_task_struct(tsk);
 		return 0;
 	}
 	mutex_unlock(&ctx->kdamond_lock);
@@ -925,12 +915,8 @@ static bool kdamond_need_update_primitiv
 static bool kdamond_need_stop(struct damon_ctx *ctx)
 {
 	struct damon_target *t;
-	bool stop;
 
-	mutex_lock(&ctx->kdamond_lock);
-	stop = ctx->kdamond_stop;
-	mutex_unlock(&ctx->kdamond_lock);
-	if (stop)
+	if (kthread_should_stop())
 		return true;
 
 	if (!ctx->primitive.target_valid)
@@ -1021,13 +1007,6 @@ static int kdamond_wait_activation(struc
 	return -EBUSY;
 }
 
-static void set_kdamond_stop(struct damon_ctx *ctx)
-{
-	mutex_lock(&ctx->kdamond_lock);
-	ctx->kdamond_stop = true;
-	mutex_unlock(&ctx->kdamond_lock);
-}
-
 /*
  * The monitoring daemon that runs as a kernel thread
  */
@@ -1038,17 +1017,18 @@ static int kdamond_fn(void *data)
 	struct damon_region *r, *next;
 	unsigned int max_nr_accesses = 0;
 	unsigned long sz_limit = 0;
+	bool done = false;
 
 	pr_debug("kdamond (%d) starts\n", current->pid);
 
 	if (ctx->primitive.init)
 		ctx->primitive.init(ctx);
 	if (ctx->callback.before_start && ctx->callback.before_start(ctx))
-		set_kdamond_stop(ctx);
+		done = true;
 
 	sz_limit = damon_region_sz_limit(ctx);
 
-	while (!kdamond_need_stop(ctx)) {
+	while (!kdamond_need_stop(ctx) && !done) {
 		if (kdamond_wait_activation(ctx))
 			continue;
 
@@ -1056,7 +1036,7 @@ static int kdamond_fn(void *data)
 			ctx->primitive.prepare_access_checks(ctx);
 		if (ctx->callback.after_sampling &&
 				ctx->callback.after_sampling(ctx))
-			set_kdamond_stop(ctx);
+			done = true;
 
 		usleep_range(ctx->sample_interval, ctx->sample_interval + 1);
 
@@ -1069,7 +1049,7 @@ static int kdamond_fn(void *data)
 					sz_limit);
 			if (ctx->callback.after_aggregation &&
 					ctx->callback.after_aggregation(ctx))
-				set_kdamond_stop(ctx);
+				done = true;
 			kdamond_apply_schemes(ctx);
 			kdamond_reset_aggregated(ctx);
 			kdamond_split_regions(ctx);
@@ -1088,9 +1068,8 @@ static int kdamond_fn(void *data)
 			damon_destroy_region(r, t);
 	}
 
-	if (ctx->callback.before_terminate &&
-			ctx->callback.before_terminate(ctx))
-		set_kdamond_stop(ctx);
+	if (ctx->callback.before_terminate)
+		ctx->callback.before_terminate(ctx);
 	if (ctx->primitive.cleanup)
 		ctx->primitive.cleanup(ctx);
 
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 261/262] mm/damon: fix a few spelling mistakes in comments and a pr_debug message
  2021-11-05 20:34 incoming Andrew Morton
                   ` (259 preceding siblings ...)
  2021-11-05 20:48 ` [patch 260/262] mm/damon: simplify stop mechanism Andrew Morton
@ 2021-11-05 20:48 ` Andrew Morton
  2021-11-05 20:48 ` [patch 262/262] mm/damon: remove return value from before_terminate callback Andrew Morton
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:48 UTC (permalink / raw)
  To: akpm, colin.i.king, colin.i.king, linux-mm, mm-commits, sj, torvalds

From: Colin Ian King <colin.i.king@googlemail.com>
Subject: mm/damon: fix a few spelling mistakes in comments and a pr_debug message

There are a few spelling mistakes in the code. Fix these.

Link: https://lkml.kernel.org/r/20211028184157.614544-1-colin.i.king@gmail.com
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/core.c       |    2 +-
 mm/damon/dbgfs-test.h |    2 +-
 mm/damon/vaddr-test.h |    2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

--- a/mm/damon/core.c~mm-damon-fix-a-few-spelling-mistakes-in-comments-and-a-pr_debug-message
+++ a/mm/damon/core.c
@@ -959,7 +959,7 @@ static unsigned long damos_wmark_wait_us
 	/* higher than high watermark or lower than low watermark */
 	if (metric > scheme->wmarks.high || scheme->wmarks.low > metric) {
 		if (scheme->wmarks.activated)
-			pr_debug("inactivate a scheme (%d) for %s wmark\n",
+			pr_debug("deactivate a scheme (%d) for %s wmark\n",
 					scheme->action,
 					metric > scheme->wmarks.high ?
 					"high" : "low");
--- a/mm/damon/dbgfs-test.h~mm-damon-fix-a-few-spelling-mistakes-in-comments-and-a-pr_debug-message
+++ a/mm/damon/dbgfs-test.h
@@ -145,7 +145,7 @@ static void damon_dbgfs_test_set_init_re
 
 		KUNIT_EXPECT_STREQ(test, (char *)buf, expect);
 	}
-	/* Put invlid inputs and check the return error code */
+	/* Put invalid inputs and check the return error code */
 	for (i = 0; i < ARRAY_SIZE(invalid_inputs); i++) {
 		input = invalid_inputs[i];
 		pr_info("input: %s\n", input);
--- a/mm/damon/vaddr-test.h~mm-damon-fix-a-few-spelling-mistakes-in-comments-and-a-pr_debug-message
+++ a/mm/damon/vaddr-test.h
@@ -233,7 +233,7 @@ static void damon_test_apply_three_regio
  * and 70-100) has totally freed and mapped to different area (30-32 and
  * 65-68).  The target regions which were in the old second and third big
  * regions should now be removed and new target regions covering the new second
- * and third big regions should be crated.
+ * and third big regions should be created.
  */
 static void damon_test_apply_three_regions4(struct kunit *test)
 {
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* [patch 262/262] mm/damon: remove return value from before_terminate callback
  2021-11-05 20:34 incoming Andrew Morton
                   ` (260 preceding siblings ...)
  2021-11-05 20:48 ` [patch 261/262] mm/damon: fix a few spelling mistakes in comments and a pr_debug message Andrew Morton
@ 2021-11-05 20:48 ` Andrew Morton
  261 siblings, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-05 20:48 UTC (permalink / raw)
  To: akpm, changbin.du, linux-mm, mm-commits, sj, torvalds

From: Changbin Du <changbin.du@gmail.com>
Subject: mm/damon: remove return value from before_terminate callback

Since the return value of 'before_terminate' callback is never used,
we make it have no return value.

Link: https://lkml.kernel.org/r/20211029005023.8895-1-changbin.du@gmail.com
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |    2 +-
 mm/damon/dbgfs.c      |    5 ++---
 2 files changed, 3 insertions(+), 4 deletions(-)

--- a/include/linux/damon.h~mm-damon-remove-return-value-from-before_terminate-callback
+++ a/include/linux/damon.h
@@ -322,7 +322,7 @@ struct damon_callback {
 	int (*before_start)(struct damon_ctx *context);
 	int (*after_sampling)(struct damon_ctx *context);
 	int (*after_aggregation)(struct damon_ctx *context);
-	int (*before_terminate)(struct damon_ctx *context);
+	void (*before_terminate)(struct damon_ctx *context);
 };
 
 /**
--- a/mm/damon/dbgfs.c~mm-damon-remove-return-value-from-before_terminate-callback
+++ a/mm/damon/dbgfs.c
@@ -645,18 +645,17 @@ static void dbgfs_fill_ctx_dir(struct de
 		debugfs_create_file(file_names[i], 0600, dir, ctx, fops[i]);
 }
 
-static int dbgfs_before_terminate(struct damon_ctx *ctx)
+static void dbgfs_before_terminate(struct damon_ctx *ctx)
 {
 	struct damon_target *t, *next;
 
 	if (!targetid_is_pid(ctx))
-		return 0;
+		return;
 
 	damon_for_each_target_safe(t, next, ctx) {
 		put_pid((struct pid *)t->id);
 		damon_destroy_target(t);
 	}
-	return 0;
 }
 
 static struct damon_ctx *dbgfs_new_ctx(void)
_

^ permalink raw reply	[flat|nested] 278+ messages in thread

* Re: [patch 071/262] mm/memory.c: use correct VMA flags when freeing page-tables
  2021-11-05 20:38 ` [patch 071/262] mm/memory.c: use correct VMA flags when freeing page-tables Andrew Morton
@ 2021-11-05 20:57   ` Nadav Amit
  2021-11-06 18:54     ` Linus Torvalds
  0 siblings, 1 reply; 278+ messages in thread
From: Nadav Amit @ 2021-11-05 20:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Andrew Cooper, Dave Hansen, Linux-MM,
	Andy Lutomirski, mm-commits, Nick Piggin, Peter Zijlstra,
	Thomas Gleixner, Linus Torvalds, Will Deacon, Yu Zhao,
	Hugh Dickins


> On Nov 5, 2021, at 1:38 PM, Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> From: Nadav Amit <namit@vmware.com>
> Subject: mm/memory.c: use correct VMA flags when freeing page-tables
> 
> Consistent use of the mmu_gather interface requires a call to
> tlb_start_vma() and tlb_end_vma() for each VMA.  free_pgtables() does not
> follow this pattern.
> 
> Certain architectures need tlb_start_vma() to be called in order for
> tlb_update_vma_flags() to update the VMA flags (tlb->vma_exec and
> tlb->vma_huge), which are later used for the proper TLB flush to be
> issued.  Since tlb_start_vma() is not called, this can lead to the wrong
> VMA flags being used when the flush is performed.
> 
> Specifically, the munmap syscall would call unmap_region(), which unmaps
> the VMAs and then frees the page-tables.  A flush is needed after the
> page-tables are removed to prevent page-walk caches from holding stale
> entries, but this flush would use the flags of the VMA flags of the last
> VMA that was flushed.  This does not appear to be right.
> 
> Use tlb_start_vma() and tlb_end_vma() to prevent this from happening. 
> This might lead to unnecessary calls to flush_cache_range() on certain
> arch's.  If needed, a new flag can be added to mmu_gather to indicate that
> the flush is not needed.

Hugh correctly indicated that I made a silly bug, and this patch is not
healping.

Nothing would explode, but I assumed the patch would be dropped for me
to submit v2.

I’ll send a fix to this fix instead unless it is dropped.


^ permalink raw reply	[flat|nested] 278+ messages in thread

* Re: [patch 149/262] mm/vmscan: throttle reclaim until some writeback completes if congested
  2021-11-05 20:42 ` [patch 149/262] mm/vmscan: throttle reclaim until some writeback completes if congested Andrew Morton
@ 2021-11-05 21:02   ` Matthew Wilcox
  2021-11-06 20:49     ` Linus Torvalds
  0 siblings, 1 reply; 278+ messages in thread
From: Matthew Wilcox @ 2021-11-05 21:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: adilger.kernel, corbet, david, djwong, hannes, linux-mm, mgorman,
	mhocko, mm-commits, neilb, riel, torvalds, tytso, vbabka

On Fri, Nov 05, 2021 at 01:42:25PM -0700, Andrew Morton wrote:
> --- a/mm/filemap.c~mm-vmscan-throttle-reclaim-until-some-writeback-completes-if-congested
> +++ a/mm/filemap.c
> @@ -1612,6 +1612,7 @@ void end_page_writeback(struct page *pag
>  
>  	smp_mb__after_atomic();
>  	wake_up_page(page, PG_writeback);
> +	acct_reclaim_writeback(page);
>  	put_page(page);
>  }
>  EXPORT_SYMBOL(end_page_writeback);

hmm?  I think you based on some older version of Linus' tree that didn't
have folios.  This fixup patch was against an older fixup patch that you did, 
but maybe it's enough for Linus to apply ...

diff --git a/mm/filemap.c b/mm/filemap.c
index 6844c9816a86..daa0e23a6ee6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1607,7 +1607,7 @@ void folio_end_writeback(struct folio *folio)
 
 	smp_mb__after_atomic();
 	folio_wake(folio, PG_writeback);
-	acct_reclaim_writeback(folio_page(folio, 0));
+	acct_reclaim_writeback(folio);
 	folio_put(folio);
 }
 EXPORT_SYMBOL(folio_end_writeback);
diff --git a/mm/internal.h b/mm/internal.h
index 632c55c5a075..3b79a5c9427a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -41,15 +41,15 @@ static inline void *folio_raw_mapping(struct folio *folio)
 	return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
 }
 
-void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
+void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
 						int nr_throttled);
-static inline void acct_reclaim_writeback(struct page *page)
+static inline void acct_reclaim_writeback(struct folio *folio)
 {
-	pg_data_t *pgdat = page_pgdat(page);
+	pg_data_t *pgdat = folio_pgdat(folio);
 	int nr_throttled = atomic_read(&pgdat->nr_writeback_throttled);
 
 	if (nr_throttled)
-		__acct_reclaim_writeback(pgdat, page, nr_throttled);
+		__acct_reclaim_writeback(pgdat, folio, nr_throttled);
 }
 
 static inline void wake_throttle_isolated(pg_data_t *pgdat)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 59c07ee4220d..fb9584641ac7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1085,12 +1085,12 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
  * pages to clean. If enough pages have been cleaned since throttling
  * started then wakeup the throttled tasks.
  */
-void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
+void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
 							int nr_throttled)
 {
 	unsigned long nr_written;
 
-	inc_node_page_state(page, NR_THROTTLED_WRITTEN);
+	node_stat_add_folio(folio, NR_THROTTLED_WRITTEN);
 
 	/*
 	 * This is an inaccurate read as the per-cpu deltas may not


^ permalink raw reply related	[flat|nested] 278+ messages in thread

* Re: [patch 080/262] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2021-11-05 20:38 ` [patch 080/262] lazy tlb: allow lazy tlb mm refcounting to be configurable Andrew Morton
@ 2021-11-06  4:29   ` Andy Lutomirski
  2021-11-06 19:10     ` Linus Torvalds
  0 siblings, 1 reply; 278+ messages in thread
From: Andy Lutomirski @ 2021-11-06  4:29 UTC (permalink / raw)
  To: Andrew Morton, anton, Benjamin Herrenschmidt, linux-mm,
	mm-commits, Nicholas Piggin, paulus, Randy Dunlap,
	Linus Torvalds, Peter Zijlstra (Intel)

On Fri, Nov 5, 2021, at 1:38 PM, Andrew Morton wrote:
> From: Nicholas Piggin <npiggin@gmail.com>
> Subject: lazy tlb: allow lazy tlb mm refcounting to be configurable
>
> Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm
> when it is context switched.  This can be disabled by architectures that
> don't require this refcounting if they clean up lazy tlb mms when the last
> refcount is dropped.  Currently this is always enabled, which is what
> existing code does, so the patch is effectively a no-op.
>
> Rename rq->prev_mm to rq->prev_lazy_mm, because that's what it is.

Still nacked by me.  Since I seem to have been doing a poor job of explaining my issues with this patch, I'll explain with code:

commit 54b675d9b28d9a56289d06a813250472bc621f40
Author: Andy Lutomirski <luto@kernel.org>
Date:   Fri Nov 5 21:20:47 2021 -0700

    [HACK] demonstrate lazy tlb issues

diff --git a/arch/Kconfig b/arch/Kconfig
index cca27f1b5d0e..19f273642d8f 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -442,6 +442,7 @@ config ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 config MMU_LAZY_TLB_REFCOUNT
 	def_bool y
 	depends on !MMU_LAZY_TLB_SHOOTDOWN
+	depends on !X86
 
 # This option allows MMU_LAZY_TLB_REFCOUNT=n. It ensures no CPUs are using an
 # mm as a lazy tlb beyond its last reference count, by shooting down these
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 25dd795497e8..c5a0c1e92524 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4902,6 +4902,13 @@ context_switch(struct rq *rq, struct task_struct *prev,
 	 */
 	arch_start_context_switch(prev);
 
+	/*
+	 * Sanity check: if something went wrong and the previous mm was
+	 * freed while we were still using it, KASAN might not notice
+	 * without help.
+	 */
+	kasan_check_byte(prev->active_mm);
+
 	/*
 	 * kernel -> kernel   lazy + transfer active
 	 *   user -> kernel   lazy + mmgrab_lazy_tlb() active

Build this with KASAN for x86 and try to boot it.  It splats left and right.  The issue is that the !MMU_LAZY_TLB_REFCOUNT mode, while safe under certain select circumstances (maybe -- still not quite convinced) cheats and ignores the fact that the scheduler itself maintains a pointer to the old mm.  On x86, on bare metal, we *already* don't access lazy mms after the process is gone because the pagetable freeing process shoots down the lazy mm, so we are compliant with all the supposed preconditions of this new mode.  But the scheduler itself still has this nonsense active_mm pointer, and, if anyone ever tries to do anything with it (e.g. the above hack to force kasan to validate it), it all blows up.

On top of this, the whole refcount-me-maybe mode seems incredibly fragile, and I don't think the kernel really benefits from having a set of refcount helpers that may or may not keep the supposedly refcounted object alive depending on config.  And the mere fact that my patch appears to work as long as kasan isn't in play should be a pretty good indicator that this whole thing is not terribly robust.

So I think there are a few credible choices:

1. Find an alternative solution that gets the performance we want without dangling references.

2. Make the MMU_LAZY_TLB_REFCOUNT mode genuinely safe.  This means literally ifdeffing out active_mm so it can't dangle.  Doing that cleanly will be a lot of nasty arch work.

I again apologize that my series is taking so long, although I think it's finally getting into decent shape.  I still need to deal with the scs mess (that's new), finish tidying up kthread, and make sure hotplug is good.  But all this is because this is really hairy code and I'm trying to do it right.

If anyone wants to help, help is welcome.  Otherwise, I really do intend to get it all the way done soon.

--Andy

^ permalink raw reply related	[flat|nested] 278+ messages in thread

* Re: [patch 071/262] mm/memory.c: use correct VMA flags when freeing page-tables
  2021-11-05 20:57   ` Nadav Amit
@ 2021-11-06 18:54     ` Linus Torvalds
  0 siblings, 0 replies; 278+ messages in thread
From: Linus Torvalds @ 2021-11-06 18:54 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Andrew Morton, Andrea Arcangeli, Andrew Cooper, Dave Hansen,
	Linux-MM, Andy Lutomirski, mm-commits, Nick Piggin,
	Peter Zijlstra, Thomas Gleixner, Will Deacon, Yu Zhao,
	Hugh Dickins

On Fri, Nov 5, 2021 at 1:57 PM Nadav Amit <namit@vmware.com> wrote:
>
> Hugh correctly indicated that I made a silly bug, and this patch is not
> healping.
>
> Nothing would explode, but I assumed the patch would be dropped for me
> to submit v2.
>
> I’ll send a fix to this fix instead unless it is dropped.

I've dropped it.

                Linus

^ permalink raw reply	[flat|nested] 278+ messages in thread

* Re: [patch 080/262] lazy tlb: allow lazy tlb mm refcounting to be configurable
  2021-11-06  4:29   ` Andy Lutomirski
@ 2021-11-06 19:10     ` Linus Torvalds
  0 siblings, 0 replies; 278+ messages in thread
From: Linus Torvalds @ 2021-11-06 19:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Anton Blanchard, Benjamin Herrenschmidt, Linux-MM,
	mm-commits, Nicholas Piggin, Paul Mackerras, Randy Dunlap,
	Peter Zijlstra (Intel)

Dropped once again until people can agree on this all..

               Linus

On Fri, Nov 5, 2021 at 9:29 PM Andy Lutomirski <luto@kernel.org> wrote:
>
> So I think there are a few credible choices:
>
> 1. Find an alternative solution that gets the performance we want without dangling references.
>
> 2. Make the MMU_LAZY_TLB_REFCOUNT mode genuinely safe.  This means literally ifdeffing out active_mm so it can't dangle.  Doing that cleanly will be a lot of nasty arch work.
>
> I again apologize that my series is taking so long, although I think it's finally getting into decent shape.  I still need to deal with the scs mess (that's new), finish tidying up kthread, and make sure hotplug is good.  But all this is because this is really hairy code and I'm trying to do it right.
>
> If anyone wants to help, help is welcome.  Otherwise, I really do intend to get it all the way done soon.
>
> --Andy

^ permalink raw reply	[flat|nested] 278+ messages in thread

* Re: [patch 149/262] mm/vmscan: throttle reclaim until some writeback completes if congested
  2021-11-05 21:02   ` Matthew Wilcox
@ 2021-11-06 20:49     ` Linus Torvalds
  2021-11-06 21:12       ` Linus Torvalds
  0 siblings, 1 reply; 278+ messages in thread
From: Linus Torvalds @ 2021-11-06 20:49 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Andreas Dilger, Jonathan Corbet, Dave Chinner,
	Darrick J. Wong, Johannes Weiner, Linux-MM, Mel Gorman,
	Michal Hocko, mm-commits, Neil Brown, Rik van Riel,
	Theodore Ts'o, Vlastimil Babka

On Fri, Nov 5, 2021 at 2:05 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> hmm?  I think you based on some older version of Linus' tree that didn't
> have folios.

Andrew these days actually maintains a base commit model exactly so
that he doesn't end up rebasing during development.

So the whole series is based on plain 5.15, and I'll take care of the
conflict resolution.

This workflow can result in more conflicts for me than what Andrew
used to do ("send against current linus tip"), but it means that when
conflicts happen, they get all the merge resolution help that git
gives you, and hopefully what gets tested (over the months that it can
be in -mm) is closer to what gets sent to me.

              Linus

^ permalink raw reply	[flat|nested] 278+ messages in thread

* Re: [patch 149/262] mm/vmscan: throttle reclaim until some writeback completes if congested
  2021-11-06 20:49     ` Linus Torvalds
@ 2021-11-06 21:12       ` Linus Torvalds
  2021-11-06 21:13         ` Vlastimil Babka
  2021-11-06 22:45         ` Matthew Wilcox
  0 siblings, 2 replies; 278+ messages in thread
From: Linus Torvalds @ 2021-11-06 21:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Andreas Dilger, Jonathan Corbet, Dave Chinner,
	Darrick J. Wong, Johannes Weiner, Linux-MM, Mel Gorman,
	Michal Hocko, mm-commits, Neil Brown, Rik van Riel,
	Theodore Ts'o, Vlastimil Babka

On Sat, Nov 6, 2021 at 1:49 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> This workflow can result in more conflicts for me than what Andrew
> used to do ("send against current linus tip"), but it means that when
> conflicts happen, they get all the merge resolution help that git
> gives you, and hopefully what gets tested (over the months that it can
> be in -mm) is closer to what gets sent to me.

.. and resolving the conflicts (none of which looked bad), I think
that part of the resolution ends up doing very similar things to your
fixup patch.

So it looks all good.

Famous last words.

                Linus

^ permalink raw reply	[flat|nested] 278+ messages in thread

* Re: [patch 149/262] mm/vmscan: throttle reclaim until some writeback completes if congested
  2021-11-06 21:12       ` Linus Torvalds
@ 2021-11-06 21:13         ` Vlastimil Babka
  2021-11-06 21:20           ` Andrew Morton
  2021-11-06 21:20           ` Linus Torvalds
  2021-11-06 22:45         ` Matthew Wilcox
  1 sibling, 2 replies; 278+ messages in thread
From: Vlastimil Babka @ 2021-11-06 21:13 UTC (permalink / raw)
  To: Linus Torvalds, Matthew Wilcox
  Cc: Andrew Morton, Andreas Dilger, Jonathan Corbet, Dave Chinner,
	Darrick J. Wong, Johannes Weiner, Linux-MM, Mel Gorman,
	Michal Hocko, mm-commits, Neil Brown, Rik van Riel,
	Theodore Ts'o

On 11/6/21 22:12, Linus Torvalds wrote:
> On Sat, Nov 6, 2021 at 1:49 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> This workflow can result in more conflicts for me than what Andrew
>> used to do ("send against current linus tip"), but it means that when
>> conflicts happen, they get all the merge resolution help that git
>> gives you, and hopefully what gets tested (over the months that it can
>> be in -mm) is closer to what gets sent to me.
> 
> .. and resolving the conflicts (none of which looked bad), I think
> that part of the resolution ends up doing very similar things to your
> fixup patch.

If this needed resolution, didn't the resolution exist in -next already?

> So it looks all good.
> 
> Famous last words.
> 
>                 Linus
> 


^ permalink raw reply	[flat|nested] 278+ messages in thread

* Re: [patch 149/262] mm/vmscan: throttle reclaim until some writeback completes if congested
  2021-11-06 21:13         ` Vlastimil Babka
@ 2021-11-06 21:20           ` Andrew Morton
  2021-11-06 21:20           ` Linus Torvalds
  1 sibling, 0 replies; 278+ messages in thread
From: Andrew Morton @ 2021-11-06 21:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linus Torvalds, Matthew Wilcox, Andreas Dilger, Jonathan Corbet,
	Dave Chinner, Darrick J. Wong, Johannes Weiner, Linux-MM,
	Mel Gorman, Michal Hocko, mm-commits, Neil Brown, Rik van Riel,
	Theodore Ts'o

On Sat, 6 Nov 2021 22:13:34 +0100 Vlastimil Babka <vbabka@suse.cz> wrote:

> On 11/6/21 22:12, Linus Torvalds wrote:
> > On Sat, Nov 6, 2021 at 1:49 PM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >>
> >> This workflow can result in more conflicts for me than what Andrew
> >> used to do ("send against current linus tip"), but it means that when
> >> conflicts happen, they get all the merge resolution help that git
> >> gives you, and hopefully what gets tested (over the months that it can
> >> be in -mm) is closer to what gets sent to me.
> > 
> > .. and resolving the conflicts (none of which looked bad), I think
> > that part of the resolution ends up doing very similar things to your
> > fixup patch.
> 
> If this needed resolution, didn't the resolution exist in -next already?

Yes, but I had it queued after linux-next.patch so it got lost in the
unholy mess that linux-next becomes during the merge window.

I'm still figuring this out.  In retrospect I should have moved this
patch "mm/vmscan: throttle reclaim until some writeback completes if
congested" to the post-linux-next section weeks ago, then waited for
the prerequisites to be merged into mainline.  That way the unaltered,
tested patch would have smoothly slotted in late in the merge window.

^ permalink raw reply	[flat|nested] 278+ messages in thread

* Re: [patch 149/262] mm/vmscan: throttle reclaim until some writeback completes if congested
  2021-11-06 21:13         ` Vlastimil Babka
  2021-11-06 21:20           ` Andrew Morton
@ 2021-11-06 21:20           ` Linus Torvalds
  1 sibling, 0 replies; 278+ messages in thread
From: Linus Torvalds @ 2021-11-06 21:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Matthew Wilcox, Andrew Morton, Andreas Dilger, Jonathan Corbet,
	Dave Chinner, Darrick J. Wong, Johannes Weiner, Linux-MM,
	Mel Gorman, Michal Hocko, mm-commits, Neil Brown, Rik van Riel,
	Theodore Ts'o

On Sat, Nov 6, 2021 at 2:13 PM Vlastimil Babka <vbabka@suse.cz> wrote:
>
> If this needed resolution, didn't the resolution exist in -next already?

Oh, I'm sure it was there in -next.

But I just always do my own merge resolution anyway because I want to
see what's going on.

I don't look at other peoples resolutions, and I much prefer to
actually look at the history itself in order to actually understand
what the history and cause for the conflicts is (and what the proper
resolution was).

Of course, in many cases it's so trivial that there's not a lot to
"understand", and most merge conflicts by far are not the kind that
need a lot of thought.

But just to clarify: I do actually like seeing people send their
resolutions to me (possibly as an addendum to the pull request email,
or possibly as a separate "resolved" branch).

I don't use those to guide my resolution, but if there is any subtle
issues at all, I will then compare the end results to verify that they
agreed. Often any differences tend to be just whitespace or similar,
but it can be interesting to see when there are meaningful semantic
differences.

          Linus

^ permalink raw reply	[flat|nested] 278+ messages in thread

* Re: [patch 149/262] mm/vmscan: throttle reclaim until some writeback completes if congested
  2021-11-06 21:12       ` Linus Torvalds
  2021-11-06 21:13         ` Vlastimil Babka
@ 2021-11-06 22:45         ` Matthew Wilcox
  2021-11-06 23:26           ` Linus Torvalds
  1 sibling, 1 reply; 278+ messages in thread
From: Matthew Wilcox @ 2021-11-06 22:45 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Andreas Dilger, Jonathan Corbet, Dave Chinner,
	Darrick J. Wong, Johannes Weiner, Linux-MM, Mel Gorman,
	Michal Hocko, mm-commits, Neil Brown, Rik van Riel,
	Theodore Ts'o, Vlastimil Babka

On Sat, Nov 06, 2021 at 02:12:02PM -0700, Linus Torvalds wrote:
> On Sat, Nov 6, 2021 at 1:49 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > This workflow can result in more conflicts for me than what Andrew
> > used to do ("send against current linus tip"), but it means that when
> > conflicts happen, they get all the merge resolution help that git
> > gives you, and hopefully what gets tested (over the months that it can
> > be in -mm) is closer to what gets sent to me.
> 
> .. and resolving the conflicts (none of which looked bad), I think
> that part of the resolution ends up doing very similar things to your
> fixup patch.

Reviewed what you did in the merge commit, looks good to me.  And I've
learned I need to run git log --cc instead of -p in order to see all
changes to a file.

^ permalink raw reply	[flat|nested] 278+ messages in thread

* Re: [patch 149/262] mm/vmscan: throttle reclaim until some writeback completes if congested
  2021-11-06 22:45         ` Matthew Wilcox
@ 2021-11-06 23:26           ` Linus Torvalds
  0 siblings, 0 replies; 278+ messages in thread
From: Linus Torvalds @ 2021-11-06 23:26 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Andreas Dilger, Jonathan Corbet, Dave Chinner,
	Darrick J. Wong, Johannes Weiner, Linux-MM, Mel Gorman,
	Michal Hocko, mm-commits, Neil Brown, Rik van Riel,
	Theodore Ts'o, Vlastimil Babka

On Sat, Nov 6, 2021 at 3:46 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> Reviewed what you did in the merge commit, looks good to me.  And I've
> learned I need to run git log --cc instead of -p in order to see all
> changes to a file.

Heh.

If this is your first time using "--cc" (although it's the default for
"git show", so you may have used it without being aware of it), it's
very useful and powerful, but it's worth keeping in mind that it's
also a lot more limited than the merge-time "git diff" output.

At merge time, git has computed the shared state parenthood, and "git
diff" knows about not only the current state, but also the state of
both parents and the base state of the file (in a three-way merge kind
of sense, although with recursive merges the "base state" may be much
more complex than just a shared parent state).

But "git log --cc" (and related "show commit" kind of things, like
"git show" and friends) only sees the final result and the parent
information. The full common parent and base state isn't there
after-the-fact.

That means that "git log --cc" doesn't have quite as much information
to go by, and the "--cc" output can sometimes be a bit misleading.

In particular, if there was a conflict, and the resolution ended up
basically being "take one side where the conflict was", then "git log
--cc" will not show the conflict resolution as a conflict at all - it
will just think "ok, development was done on that branch, the other
side was irrelevant".

So "--cc" is very useful, and often shows that interesting sub-part of
the merge where there were conflicts. But it's definitely somewhat
limited, and can end up looking like there was no conflict at all even
when there was something.

           Linus

^ permalink raw reply	[flat|nested] 278+ messages in thread

* Re: [patch 099/262] mm/vmalloc: be more explicit about supported gfp flags
  2021-11-05 20:39 ` [patch 099/262] mm/vmalloc: be more explicit about supported gfp flags Andrew Morton
@ 2021-11-08  9:25   ` Michal Hocko
  2021-11-08 17:15     ` Linus Torvalds
  0 siblings, 1 reply; 278+ messages in thread
From: Michal Hocko @ 2021-11-08  9:25 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, david, hch, idryomov, jlayton, linux-mm, mm-commits, neilb,
	torvalds, urezki

On Fri 05-11-21 13:39:50, Andrew Morton wrote:
> From: Michal Hocko <mhocko@suse.com>
> Subject: mm/vmalloc: be more explicit about supported gfp flags
> 
> The core of the vmalloc allocator __vmalloc_area_node doesn't say anything
> about gfp mask argument.  Not all gfp flags are supported though.  Be more
> explicit about constraints.
> 
> Link: https://lkml.kernel.org/r/20211020082545.4830-1-mhocko@kernel.org
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Cc: Neil Brown <neilb@suse.de>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: Uladzislau Rezki <urezki@gmail.com>
> Cc: Ilya Dryomov <idryomov@gmail.com>
> Cc: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

As already pointed out
http://lkml.kernel.org/r/YXE+hcodJ7zxeYA7@dhcp22.suse.cz this patch
cannot be applied without other patches from the same series.

> ---
> 
>  mm/vmalloc.c |   12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> --- a/mm/vmalloc.c~mm-vmalloc-be-more-explicit-about-supported-gfp-flags
> +++ a/mm/vmalloc.c
> @@ -2983,8 +2983,16 @@ fail:
>   * @caller:		  caller's return address
>   *
>   * Allocate enough pages to cover @size from the page level
> - * allocator with @gfp_mask flags.  Map them into contiguous
> - * kernel virtual space, using a pagetable protection of @prot.
> + * allocator with @gfp_mask flags. Please note that the full set of gfp
> + * flags are not supported. GFP_KERNEL would be a preferred allocation mode
> + * but GFP_NOFS and GFP_NOIO are supported as well. Zone modifiers are not
> + * supported. From the reclaim modifiers__GFP_DIRECT_RECLAIM is required (aka
> + * GFP_NOWAIT is not supported) and only __GFP_NOFAIL is supported (aka
> + * __GFP_NORETRY and __GFP_RETRY_MAYFAIL are not supported).
> + * __GFP_NOWARN can be used to suppress error messages about failures.
> + *
> + * Map them into contiguous kernel virtual space, using a pagetable
> + * protection of @prot.
>   *
>   * Return: the address of the area or %NULL on failure
>   */
> _

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 278+ messages in thread

* Re: [patch 099/262] mm/vmalloc: be more explicit about supported gfp flags
  2021-11-08  9:25   ` Michal Hocko
@ 2021-11-08 17:15     ` Linus Torvalds
  2021-11-08 17:30       ` Michal Hocko
  0 siblings, 1 reply; 278+ messages in thread
From: Linus Torvalds @ 2021-11-08 17:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Linux Kernel Mailing List, Andrew Morton, Dave Chinner,
	Christoph Hellwig, Ilya Dryomov, Jeff Layton, Linux-MM,
	mm-commits, Neil Brown, Uladzislau Rezki

On Mon, Nov 8, 2021 at 1:25 AM Michal Hocko <mhocko@suse.com> wrote:
>
> As already pointed out
> http://lkml.kernel.org/r/YXE+hcodJ7zxeYA7@dhcp22.suse.cz this patch
> cannot be applied without other patches from the same series.

Hmm. I've taken it already.

Not a huge deal, since it's a comment change - and the code will
presumably eventually match the updated comment.

I guess it's a new thing that instead of stale comments, we have
future-proof ones ;)

              Linus

^ permalink raw reply	[flat|nested] 278+ messages in thread

* Re: [patch 099/262] mm/vmalloc: be more explicit about supported gfp flags
  2021-11-08 17:15     ` Linus Torvalds
@ 2021-11-08 17:30       ` Michal Hocko
  0 siblings, 0 replies; 278+ messages in thread
From: Michal Hocko @ 2021-11-08 17:30 UTC (permalink / raw)
  To: linux-kernel
  Cc: Andrew Morton, Dave Chinner, Christoph Hellwig, Ilya Dryomov,
	Jeff Layton, Linux-MM, mm-commits, Neil Brown, Uladzislau Rezki

On Mon 08-11-21 09:15:04, Linus Torvalds wrote:
> On Mon, Nov 8, 2021 at 1:25 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > As already pointed out
> > http://lkml.kernel.org/r/YXE+hcodJ7zxeYA7@dhcp22.suse.cz this patch
> > cannot be applied without other patches from the same series.
> 
> Hmm. I've taken it already.
> 
> Not a huge deal, since it's a comment change - and the code will
> presumably eventually match the updated comment.

I plan to send the rest after the merge window.
 
> I guess it's a new thing that instead of stale comments, we have
> future-proof ones ;)

I just hope nobody gets confused about which are not supported yet. E.g.
GFP_NOFAIL, GFP_NO{FS,IO}. In both cases the direct use could lead to
bugs.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 278+ messages in thread

end of thread, other threads:[~2021-11-08 17:30 UTC | newest]

Thread overview: 278+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-05 20:34 incoming Andrew Morton
2021-11-05 20:34 ` [patch 001/262] scripts/spelling.txt: add more spellings to spelling.txt Andrew Morton
2021-11-05 20:34 ` [patch 002/262] scripts/spelling.txt: fix "mistake" version of "synchronization" Andrew Morton
2021-11-05 20:34 ` [patch 003/262] scripts/decodecode: fix faulting instruction no print when opps.file is DOS format Andrew Morton
2021-11-05 20:34 ` [patch 004/262] ocfs2: fix handle refcount leak in two exception handling paths Andrew Morton
2021-11-05 20:34 ` [patch 005/262] ocfs2: cleanup journal init and shutdown Andrew Morton
2021-11-05 20:34 ` [patch 006/262] ocfs2/dlm: remove redundant assignment of variable ret Andrew Morton
2021-11-05 20:34 ` [patch 007/262] ocfs2: fix data corruption on truncate Andrew Morton
2021-11-05 20:34 ` [patch 008/262] ocfs2: do not zero pages beyond i_size Andrew Morton
2021-11-05 20:35 ` [patch 009/262] fs/posix_acl.c: avoid -Wempty-body warning Andrew Morton
2021-11-05 20:35 ` [patch 010/262] d_path: fix Kernel doc validator complaining Andrew Morton
2021-11-05 20:35 ` [patch 011/262] mm: move kvmalloc-related functions to slab.h Andrew Morton
2021-11-05 20:35 ` [patch 012/262] mm/slab.c: remove useless lines in enable_cpucache() Andrew Morton
2021-11-05 20:35 ` [patch 013/262] slub: add back check for free nonslab objects Andrew Morton
2021-11-05 20:35 ` [patch 014/262] mm, slub: change percpu partial accounting from objects to pages Andrew Morton
2021-11-05 20:35 ` [patch 015/262] mm/slub: increase default cpu partial list sizes Andrew Morton
2021-11-05 20:35 ` [patch 016/262] mm, slub: use prefetchw instead of prefetch Andrew Morton
2021-11-05 20:35 ` [patch 017/262] mm: disable NUMA_BALANCING_DEFAULT_ENABLED and TRANSPARENT_HUGEPAGE on PREEMPT_RT Andrew Morton
2021-11-05 20:35 ` [patch 018/262] mm: don't include <linux/dax.h> in <linux/mempolicy.h> Andrew Morton
2021-11-05 20:35 ` [patch 019/262] lib/stackdepot: include gfp.h Andrew Morton
2021-11-05 20:35 ` [patch 020/262] lib/stackdepot: remove unused function argument Andrew Morton
2021-11-05 20:35 ` [patch 021/262] lib/stackdepot: introduce __stack_depot_save() Andrew Morton
2021-11-05 20:35 ` [patch 022/262] kasan: common: provide can_alloc in kasan_save_stack() Andrew Morton
2021-11-05 20:35 ` [patch 023/262] kasan: generic: introduce kasan_record_aux_stack_noalloc() Andrew Morton
2021-11-05 20:35 ` [patch 024/262] workqueue, kasan: avoid alloc_pages() when recording stack Andrew Morton
2021-11-05 20:35 ` [patch 025/262] kasan: fix tag for large allocations when using CONFIG_SLAB Andrew Morton
2021-11-05 20:35 ` [patch 026/262] kasan: test: add memcpy test that avoids out-of-bounds write Andrew Morton
2021-11-05 20:35 ` [patch 027/262] mm/smaps: fix shmem pte hole swap calculation Andrew Morton
2021-11-05 20:36 ` [patch 028/262] mm/smaps: use vma->vm_pgoff directly when counting partial swap Andrew Morton
2021-11-05 20:36 ` [patch 029/262] mm/smaps: simplify shmem handling of pte holes Andrew Morton
2021-11-05 20:36 ` [patch 030/262] mm: debug_vm_pgtable: don't use __P000 directly Andrew Morton
2021-11-05 20:36 ` [patch 031/262] kasan: test: bypass __alloc_size checks Andrew Morton
2021-11-05 20:36 ` [patch 032/262] rapidio: avoid bogus __alloc_size warning Andrew Morton
2021-11-05 20:36 ` [patch 033/262] Compiler Attributes: add __alloc_size() for better bounds checking Andrew Morton
2021-11-05 20:36 ` [patch 034/262] slab: clean up function prototypes Andrew Morton
2021-11-05 20:36 ` [patch 035/262] slab: add __alloc_size attributes for better bounds checking Andrew Morton
2021-11-05 20:36 ` [patch 036/262] mm/kvmalloc: " Andrew Morton
2021-11-05 20:36 ` [patch 037/262] mm/vmalloc: " Andrew Morton
2021-11-05 20:36 ` [patch 038/262] mm/page_alloc: " Andrew Morton
2021-11-05 20:36 ` [patch 039/262] percpu: " Andrew Morton
2021-11-05 20:36 ` [patch 040/262] mm/page_ext.c: fix a comment Andrew Morton
2021-11-05 20:36 ` [patch 041/262] mm: stop filemap_read() from grabbing a superfluous page Andrew Morton
2021-11-05 20:36 ` [patch 042/262] mm: export bdi_unregister Andrew Morton
2021-11-05 20:36 ` [patch 043/262] mtd: call bdi_unregister explicitly Andrew Morton
2021-11-05 20:36 ` [patch 044/262] fs: explicitly unregister per-superblock BDIs Andrew Morton
2021-11-05 20:37 ` [patch 045/262] mm: don't automatically unregister bdis Andrew Morton
2021-11-05 20:37 ` [patch 046/262] mm: simplify bdi refcounting Andrew Morton
2021-11-05 20:37 ` [patch 047/262] mm: don't read i_size of inode unless we need it Andrew Morton
2021-11-05 20:37 ` [patch 048/262] mm/filemap.c: remove bogus VM_BUG_ON Andrew Morton
2021-11-05 20:37 ` [patch 049/262] mm: move more expensive part of XA setup out of mapping check Andrew Morton
2021-11-05 20:37 ` [patch 050/262] mm/gup: further simplify __gup_device_huge() Andrew Morton
2021-11-05 20:37 ` [patch 051/262] mm/swapfile: remove needless request_queue NULL pointer check Andrew Morton
2021-11-05 20:37 ` [patch 052/262] mm/swapfile: fix an integer overflow in swap_show() Andrew Morton
2021-11-05 20:37 ` [patch 053/262] mm: optimise put_pages_list() Andrew Morton
2021-11-05 20:37 ` [patch 054/262] mm/memcg: drop swp_entry_t* in mc_handle_file_pte() Andrew Morton
2021-11-05 20:37 ` [patch 055/262] memcg: flush stats only if updated Andrew Morton
2021-11-05 20:37 ` [patch 056/262] memcg: unify memcg stat flushing Andrew Morton
2021-11-05 20:37 ` [patch 057/262] mm/memcg: remove obsolete memcg_free_kmem() Andrew Morton
2021-11-05 20:37 ` [patch 058/262] mm/list_lru.c: prefer struct_size over open coded arithmetic Andrew Morton
2021-11-05 20:37 ` [patch 059/262] memcg, kmem: further deprecate kmem.limit_in_bytes Andrew Morton
2021-11-05 20:37 ` [patch 060/262] mm: list_lru: remove holding lru lock Andrew Morton
2021-11-05 20:37 ` [patch 061/262] mm: list_lru: fix the return value of list_lru_count_one() Andrew Morton
2021-11-05 20:37 ` [patch 062/262] mm: memcontrol: remove kmemcg_id reparenting Andrew Morton
2021-11-05 20:37 ` [patch 063/262] mm: memcontrol: remove the kmem states Andrew Morton
2021-11-05 20:37 ` [patch 064/262] mm: list_lru: only add memcg-aware lrus to the global lru list Andrew Morton
2021-11-05 20:38 ` [patch 065/262] mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks Andrew Morton
2021-11-05 20:38 ` [patch 066/262] mm, oom: do not trigger out_of_memory from the #PF Andrew Morton
2021-11-05 20:38 ` [patch 067/262] memcg: prohibit unconditional exceeding the limit of dying tasks Andrew Morton
2021-11-05 20:38 ` [patch 068/262] mm/mmap.c: fix a data race of mm->total_vm Andrew Morton
2021-11-05 20:38 ` [patch 069/262] mm: use __pfn_to_section() instead of open coding it Andrew Morton
2021-11-05 20:38 ` [patch 070/262] mm/memory.c: avoid unnecessary kernel/user pointer conversion Andrew Morton
2021-11-05 20:38 ` [patch 071/262] mm/memory.c: use correct VMA flags when freeing page-tables Andrew Morton
2021-11-05 20:57   ` Nadav Amit
2021-11-06 18:54     ` Linus Torvalds
2021-11-05 20:38 ` [patch 072/262] mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte Andrew Morton
2021-11-05 20:38 ` [patch 073/262] mm: clear vmf->pte after pte_unmap_same() returns Andrew Morton
2021-11-05 20:38 ` [patch 074/262] mm: drop first_index/last_index in zap_details Andrew Morton
2021-11-05 20:38 ` [patch 075/262] mm: add zap_skip_check_mapping() helper Andrew Morton
2021-11-05 20:38 ` [patch 076/262] mm: introduce pmd_install() helper Andrew Morton
2021-11-05 20:38 ` [patch 077/262] mm: remove redundant smp_wmb() Andrew Morton
2021-11-05 20:38 ` [patch 078/262] Documentation: update pagemap with shmem exceptions Andrew Morton
2021-11-05 20:38 ` [patch 079/262] lazy tlb: introduce lazy mm refcount helper functions Andrew Morton
2021-11-05 20:38 ` [patch 080/262] lazy tlb: allow lazy tlb mm refcounting to be configurable Andrew Morton
2021-11-06  4:29   ` Andy Lutomirski
2021-11-06 19:10     ` Linus Torvalds
2021-11-05 20:38 ` [patch 081/262] lazy tlb: shoot lazies, a non-refcounting lazy tlb option Andrew Morton
2021-11-05 20:38 ` [patch 082/262] powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN Andrew Morton
2021-11-05 20:39 ` [patch 083/262] memory: remove unused CONFIG_MEM_BLOCK_SIZE Andrew Morton
2021-11-05 20:39 ` [patch 084/262] mm/mprotect.c: avoid repeated assignment in do_mprotect_pkey() Andrew Morton
2021-11-05 20:39 ` [patch 085/262] mm/mremap: don't account pages in vma_to_resize() Andrew Morton
2021-11-05 20:39 ` [patch 086/262] include/linux/io-mapping.h: remove fallback for writecombine Andrew Morton
2021-11-05 20:39 ` [patch 087/262] mm: mmap_lock: remove redundant newline in TP_printk Andrew Morton
2021-11-05 20:39 ` [patch 088/262] mm: mmap_lock: use DECLARE_EVENT_CLASS and DEFINE_EVENT_FN Andrew Morton
2021-11-05 20:39 ` [patch 089/262] mm/vmalloc: repair warn_alloc()s in __vmalloc_area_node() Andrew Morton
2021-11-05 20:39 ` [patch 090/262] mm/vmalloc: don't allow VM_NO_GUARD on vmap() Andrew Morton
2021-11-05 20:39 ` [patch 091/262] mm/vmalloc: make show_numa_info() aware of hugepage mappings Andrew Morton
2021-11-05 20:39 ` [patch 092/262] mm/vmalloc: make sure to dump unpurged areas in /proc/vmallocinfo Andrew Morton
2021-11-05 20:39 ` [patch 093/262] mm/vmalloc: do not adjust the search size for alignment overhead Andrew Morton
2021-11-05 20:39 ` [patch 094/262] mm/vmalloc: check various alignments when debugging Andrew Morton
2021-11-05 20:39 ` [patch 095/262] vmalloc: back off when the current task is OOM-killed Andrew Morton
2021-11-05 20:39 ` [patch 096/262] vmalloc: choose a better start address in vm_area_register_early() Andrew Morton
2021-11-05 20:39 ` [patch 097/262] arm64: support page mapping percpu first chunk allocator Andrew Morton
2021-11-05 20:39 ` [patch 098/262] kasan: arm64: fix pcpu_page_first_chunk crash with KASAN_VMALLOC Andrew Morton
2021-11-05 20:39 ` [patch 099/262] mm/vmalloc: be more explicit about supported gfp flags Andrew Morton
2021-11-08  9:25   ` Michal Hocko
2021-11-08 17:15     ` Linus Torvalds
2021-11-08 17:30       ` Michal Hocko
2021-11-05 20:39 ` [patch 100/262] mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation Andrew Morton
2021-11-05 20:39 ` [patch 101/262] lib/test_vmalloc.c: use swap() to make code cleaner Andrew Morton
2021-11-05 20:39 ` [patch 102/262] mm/large system hash: avoid possible NULL deref in alloc_large_system_hash Andrew Morton
2021-11-05 20:40 ` [patch 103/262] mm/page_alloc.c: remove meaningless VM_BUG_ON() in pindex_to_order() Andrew Morton
2021-11-05 20:40 ` [patch 104/262] mm/page_alloc.c: simplify the code by using macro K() Andrew Morton
2021-11-05 20:40 ` [patch 105/262] mm/page_alloc.c: fix obsolete comment in free_pcppages_bulk() Andrew Morton
2021-11-05 20:40 ` [patch 106/262] mm/page_alloc.c: use helper function zone_spans_pfn() Andrew Morton
2021-11-05 20:40 ` [patch 107/262] mm/page_alloc.c: avoid allocating highmem pages via alloc_pages_exact[_nid] Andrew Morton
2021-11-05 20:40 ` [patch 108/262] mm/page_alloc: print node fallback order Andrew Morton
2021-11-05 20:40 ` [patch 109/262] mm/page_alloc: use accumulated load when building node fallback list Andrew Morton
2021-11-05 20:40 ` [patch 110/262] mm: move node_reclaim_distance to fix NUMA without SMP Andrew Morton
2021-11-05 20:40 ` [patch 111/262] mm: move fold_vm_numa_events() " Andrew Morton
2021-11-05 20:40 ` [patch 112/262] mm/page_alloc.c: do not acquire zone lock in is_free_buddy_page() Andrew Morton
2021-11-05 20:40 ` [patch 113/262] mm/page_alloc: detect allocation forbidden by cpuset and bail out early Andrew Morton
2021-11-05 20:40 ` [patch 114/262] mm/page_alloc.c: show watermark_boost of zone in zoneinfo Andrew Morton
2021-11-05 20:40 ` [patch 115/262] mm: create a new system state and fix core_kernel_text() Andrew Morton
2021-11-05 20:40 ` [patch 116/262] mm: make generic arch_is_kernel_initmem_freed() do what it says Andrew Morton
2021-11-05 20:40 ` [patch 117/262] powerpc: use generic version of arch_is_kernel_initmem_freed() Andrew Morton
2021-11-05 20:40 ` [patch 118/262] s390: " Andrew Morton
2021-11-05 20:40 ` [patch 119/262] mm: page_alloc: use migrate_disable() in drain_local_pages_wq() Andrew Morton
2021-11-05 20:40 ` [patch 120/262] mm/page_alloc: use clamp() to simplify code Andrew Morton
2021-11-05 20:40 ` [patch 121/262] mm: fix data race in PagePoisoned() Andrew Morton
2021-11-05 20:41 ` [patch 122/262] mm/memory_failure: constify static mm_walk_ops Andrew Morton
2021-11-05 20:41 ` [patch 123/262] mm: filemap: coding style cleanup for filemap_map_pmd() Andrew Morton
2021-11-05 20:41 ` [patch 124/262] mm: hwpoison: refactor refcount check handling Andrew Morton
2021-11-05 20:41 ` [patch 125/262] mm: shmem: don't truncate page if memory failure happens Andrew Morton
2021-11-05 20:41 ` [patch 126/262] mm: hwpoison: handle non-anonymous THP correctly Andrew Morton
2021-11-05 20:41 ` [patch 127/262] mm/hugetlb: drop __unmap_hugepage_range definition from hugetlb.h Andrew Morton
2021-11-05 20:41 ` [patch 128/262] hugetlb: add demote hugetlb page sysfs interfaces Andrew Morton
2021-11-05 20:41 ` [patch 129/262] mm/cma: add cma_pages_valid to determine if pages are in CMA Andrew Morton
2021-11-05 20:41 ` [patch 130/262] hugetlb: be sure to free demoted CMA pages to CMA Andrew Morton
2021-11-05 20:41 ` [patch 131/262] hugetlb: add demote bool to gigantic page routines Andrew Morton
2021-11-05 20:41 ` [patch 132/262] hugetlb: add hugetlb demote page support Andrew Morton
2021-11-05 20:41 ` [patch 133/262] mm: khugepaged: recalculate min_free_kbytes after stopping khugepaged Andrew Morton
2021-11-05 20:41 ` [patch 134/262] mm, hugepages: add mremap() support for hugepage backed vma Andrew Morton
2021-11-05 20:41 ` [patch 135/262] mm, hugepages: add hugetlb vma mremap() test Andrew Morton
2021-11-05 20:41 ` [patch 136/262] hugetlb: support node specified when using cma for gigantic hugepages Andrew Morton
2021-11-05 20:41 ` [patch 137/262] mm: remove duplicate include in hugepage-mremap.c Andrew Morton
2021-11-05 20:41 ` [patch 138/262] hugetlb_cgroup: remove unused hugetlb_cgroup_from_counter macro Andrew Morton
2021-11-05 20:41 ` [patch 139/262] hugetlb: replace the obsolete hugetlb_instantiation_mutex in the comments Andrew Morton
2021-11-05 20:41 ` [patch 140/262] hugetlb: remove redundant validation in has_same_uncharge_info() Andrew Morton
2021-11-05 20:42 ` [patch 141/262] hugetlb: remove redundant VM_BUG_ON() in add_reservation_in_range() Andrew Morton
2021-11-05 20:42 ` [patch 142/262] hugetlb: remove unnecessary set_page_count in prep_compound_gigantic_page Andrew Morton
2021-11-05 20:42 ` [patch 143/262] userfaultfd/selftests: don't rely on GNU extensions for random numbers Andrew Morton
2021-11-05 20:42 ` [patch 144/262] userfaultfd/selftests: fix feature support detection Andrew Morton
2021-11-05 20:42 ` [patch 145/262] userfaultfd/selftests: fix calculation of expected ioctls Andrew Morton
2021-11-05 20:42 ` [patch 146/262] mm/page_isolation: fix potential missing call to unset_migratetype_isolate() Andrew Morton
2021-11-05 20:42 ` [patch 147/262] mm/page_isolation: guard against possible putback unisolated page Andrew Morton
2021-11-05 20:42 ` [patch 148/262] mm/vmscan.c: fix -Wunused-but-set-variable warning Andrew Morton
2021-11-05 20:42 ` [patch 149/262] mm/vmscan: throttle reclaim until some writeback completes if congested Andrew Morton
2021-11-05 21:02   ` Matthew Wilcox
2021-11-06 20:49     ` Linus Torvalds
2021-11-06 21:12       ` Linus Torvalds
2021-11-06 21:13         ` Vlastimil Babka
2021-11-06 21:20           ` Andrew Morton
2021-11-06 21:20           ` Linus Torvalds
2021-11-06 22:45         ` Matthew Wilcox
2021-11-06 23:26           ` Linus Torvalds
2021-11-05 20:42 ` [patch 150/262] mm/vmscan: throttle reclaim and compaction when too may pages are isolated Andrew Morton
2021-11-05 20:42 ` [patch 151/262] mm/vmscan: throttle reclaim when no progress is being made Andrew Morton
2021-11-05 20:42 ` [patch 152/262] mm/writeback: throttle based on page writeback instead of congestion Andrew Morton
2021-11-05 20:42 ` [patch 153/262] mm/page_alloc: remove the throttling logic from the page allocator Andrew Morton
2021-11-05 20:42 ` [patch 154/262] mm/vmscan: centralise timeout values for reclaim_throttle Andrew Morton
2021-11-05 20:42 ` [patch 155/262] mm/vmscan: increase the timeout if page reclaim is not making progress Andrew Morton
2021-11-05 20:42 ` [patch 156/262] mm/vmscan: delay waking of tasks throttled on NOPROGRESS Andrew Morton
2021-11-05 20:42 ` [patch 157/262] mm/vmpressure: fix data-race with memcg->socket_pressure Andrew Morton
2021-11-05 20:42 ` [patch 158/262] tools/vm/page_owner_sort.c: count and sort by mem Andrew Morton
2021-11-05 20:42 ` [patch 159/262] tools/vm/page-types.c: make walk_file() aware of address range option Andrew Morton
2021-11-05 20:43 ` [patch 160/262] tools/vm/page-types.c: move show_file() to summary output Andrew Morton
2021-11-05 20:43 ` [patch 161/262] tools/vm/page-types.c: print file offset in hexadecimal Andrew Morton
2021-11-05 20:43 ` [patch 162/262] arch_numa: simplify numa_distance allocation Andrew Morton
2021-11-05 20:43 ` [patch 163/262] xen/x86: free_p2m_page: use memblock_free_ptr() to free a virtual pointer Andrew Morton
2021-11-05 20:43 ` [patch 164/262] memblock: drop memblock_free_early_nid() and memblock_free_early() Andrew Morton
2021-11-05 20:43 ` [patch 165/262] memblock: stop aliasing __memblock_free_late with memblock_free_late Andrew Morton
2021-11-05 20:43 ` [patch 166/262] memblock: rename memblock_free to memblock_phys_free Andrew Morton
2021-11-05 20:43 ` [patch 167/262] memblock: use memblock_free for freeing virtual pointers Andrew Morton
2021-11-05 20:43 ` [patch 168/262] mm: mark the OOM reaper thread as freezable Andrew Morton
2021-11-05 20:43 ` [patch 169/262] hugetlbfs: extend the definition of hugepages parameter to support node allocation Andrew Morton
2021-11-05 20:43 ` [patch 170/262] mm/migrate: de-duplicate migrate_reason strings Andrew Morton
2021-11-05 20:43 ` [patch 171/262] mm: migrate: make demotion knob depend on migration Andrew Morton
2021-11-05 20:43 ` [patch 172/262] selftests/vm/transhuge-stress: fix ram size thinko Andrew Morton
2021-11-05 20:43 ` [patch 173/262] mm, thp: lock filemap when truncating page cache Andrew Morton
2021-11-05 20:43 ` [patch 174/262] mm, thp: fix incorrect unmap behavior for private pages Andrew Morton
2021-11-05 20:43 ` [patch 175/262] mm/readahead.c: fix incorrect comments for get_init_ra_size Andrew Morton
2021-11-05 20:43 ` [patch 176/262] mm: nommu: kill arch_get_unmapped_area() Andrew Morton
2021-11-05 20:43 ` [patch 177/262] selftest/vm: fix ksm selftest to run with different NUMA topologies Andrew Morton
2021-11-05 20:43 ` [patch 178/262] selftests: vm: add KSM huge pages merging time test Andrew Morton
2021-11-05 20:43 ` [patch 179/262] mm/vmstat: annotate data race for zone->free_area[order].nr_free Andrew Morton
2021-11-05 20:44 ` [patch 180/262] mm: vmstat.c: make extfrag_index show more pretty Andrew Morton
2021-11-05 20:44 ` [patch 181/262] selftests/vm: make MADV_POPULATE_(READ|WRITE) use in-tree headers Andrew Morton
2021-11-05 20:44 ` [patch 182/262] mm/memory_hotplug: add static qualifier for online_policy_to_str() Andrew Morton
2021-11-05 20:44 ` [patch 183/262] memory-hotplug.rst: fix two instances of "movablecore" that should be "movable_node" Andrew Morton
2021-11-05 20:44 ` [patch 184/262] memory-hotplug.rst: fix wrong /sys/module/memory_hotplug/parameters/ path Andrew Morton
2021-11-05 20:44 ` [patch 185/262] memory-hotplug.rst: document the "auto-movable" online policy Andrew Morton
2021-11-05 20:44 ` [patch 186/262] mm/memory_hotplug: remove CONFIG_X86_64_ACPI_NUMA dependency from CONFIG_MEMORY_HOTPLUG Andrew Morton
2021-11-05 20:44 ` [patch 187/262] mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE Andrew Morton
2021-11-05 20:44 ` [patch 188/262] mm/memory_hotplug: restrict CONFIG_MEMORY_HOTPLUG to 64 bit Andrew Morton
2021-11-05 20:44 ` [patch 189/262] mm/memory_hotplug: remove HIGHMEM leftovers Andrew Morton
2021-11-05 20:44 ` [patch 190/262] mm/memory_hotplug: remove stale function declarations Andrew Morton
2021-11-05 20:44 ` [patch 191/262] x86: remove memory hotplug support on X86_32 Andrew Morton
2021-11-05 20:44 ` [patch 192/262] mm/memory_hotplug: handle memblock_add_node() failures in add_memory_resource() Andrew Morton
2021-11-05 20:44 ` [patch 193/262] memblock: improve MEMBLOCK_HOTPLUG documentation Andrew Morton
2021-11-05 20:44 ` [patch 194/262] memblock: allow to specify flags with memblock_add_node() Andrew Morton
2021-11-05 20:44 ` [patch 195/262] memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED Andrew Morton
2021-11-05 20:44 ` [patch 196/262] mm/memory_hotplug: indicate MEMBLOCK_DRIVER_MANAGED with IORESOURCE_SYSRAM_DRIVER_MANAGED Andrew Morton
2021-11-05 20:45 ` [patch 197/262] mm/rmap.c: avoid double faults migrating device private pages Andrew Morton
2021-11-05 20:45 ` [patch 198/262] mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and zs_unregister_migration() Andrew Morton
2021-11-05 20:45 ` [patch 199/262] mm/highmem: remove deprecated kmap_atomic Andrew Morton
2021-11-05 20:45 ` [patch 200/262] zram_drv: allow reclaim on bio_alloc Andrew Morton
2021-11-05 20:45 ` [patch 201/262] zram: off by one in read_block_state() Andrew Morton
2021-11-05 20:45 ` [patch 202/262] zram: introduce an aged idle interface Andrew Morton
2021-11-05 20:45 ` [patch 203/262] mm: remove HARDENED_USERCOPY_FALLBACK Andrew Morton
2021-11-05 20:45 ` [patch 204/262] include/linux/mm.h: move nr_free_buffer_pages from swap.h to mm.h Andrew Morton
2021-11-05 20:45 ` [patch 205/262] stacktrace: move filter_irq_stacks() to kernel/stacktrace.c Andrew Morton
2021-11-05 20:45 ` [patch 206/262] kfence: count unexpectedly skipped allocations Andrew Morton
2021-11-05 20:45 ` [patch 207/262] kfence: move saving stack trace of allocations into __kfence_alloc() Andrew Morton
2021-11-05 20:45 ` [patch 208/262] kfence: limit currently covered allocations when pool nearly full Andrew Morton
2021-11-05 20:45 ` [patch 209/262] kfence: add note to documentation about skipping covered allocations Andrew Morton
2021-11-05 20:45 ` [patch 210/262] kfence: test: use kunit_skip() to skip tests Andrew Morton
2021-11-05 20:45 ` [patch 211/262] kfence: shorten critical sections of alloc/free Andrew Morton
2021-11-05 20:45 ` [patch 212/262] kfence: always use static branches to guard kfence_alloc() Andrew Morton
2021-11-05 20:45 ` [patch 213/262] kfence: default to dynamic branch instead of static keys mode Andrew Morton
2021-11-05 20:45 ` [patch 214/262] mm/damon: grammar s/works/work/ Andrew Morton
2021-11-05 20:45 ` [patch 215/262] Documentation/vm: move user guides to admin-guide/mm/ Andrew Morton
2021-11-05 20:45 ` [patch 216/262] MAINTAINERS: update SeongJae's email address Andrew Morton
2021-11-05 20:46 ` [patch 217/262] docs/vm/damon: remove broken reference Andrew Morton
2021-11-05 20:46 ` [patch 218/262] include/linux/damon.h: fix kernel-doc comments for 'damon_callback' Andrew Morton
2021-11-05 20:46 ` [patch 219/262] mm/damon/core: print kdamond start log in debug mode only Andrew Morton
2021-11-05 20:46 ` [patch 220/262] mm/damon: remove unnecessary do_exit() from kdamond Andrew Morton
2021-11-05 20:46 ` [patch 221/262] mm/damon: needn't hold kdamond_lock to print pid of kdamond Andrew Morton
2021-11-05 20:46 ` [patch 222/262] mm/damon/core: nullify pointer ctx->kdamond with a NULL Andrew Morton
2021-11-05 20:46 ` [patch 223/262] mm/damon/core: account age of target regions Andrew Morton
2021-11-05 20:46 ` [patch 224/262] mm/damon/core: implement DAMON-based Operation Schemes (DAMOS) Andrew Morton
2021-11-05 20:46 ` [patch 225/262] mm/damon/vaddr: support DAMON-based Operation Schemes Andrew Morton
2021-11-05 20:46 ` [patch 226/262] mm/damon/dbgfs: " Andrew Morton
2021-11-05 20:46 ` [patch 227/262] mm/damon/schemes: implement statistics feature Andrew Morton
2021-11-05 20:46 ` [patch 228/262] selftests/damon: add 'schemes' debugfs tests Andrew Morton
2021-11-05 20:46 ` [patch 229/262] Docs/admin-guide/mm/damon: document DAMON-based Operation Schemes Andrew Morton
2021-11-05 20:46 ` [patch 230/262] mm/damon/dbgfs: allow users to set initial monitoring target regions Andrew Morton
2021-11-05 20:46 ` [patch 231/262] mm/damon/dbgfs-test: add a unit test case for 'init_regions' Andrew Morton
2021-11-05 20:46 ` [patch 232/262] Docs/admin-guide/mm/damon: document 'init_regions' feature Andrew Morton
2021-11-05 20:46 ` [patch 233/262] mm/damon/vaddr: separate commonly usable functions Andrew Morton
2021-11-05 20:46 ` [patch 234/262] mm/damon: implement primitives for physical address space monitoring Andrew Morton
2021-11-05 20:47 ` [patch 235/262] mm/damon/dbgfs: support physical memory monitoring Andrew Morton
2021-11-05 20:47 ` [patch 236/262] Docs/DAMON: document physical memory monitoring support Andrew Morton
2021-11-05 20:47 ` [patch 237/262] mm/damon/vaddr: constify static mm_walk_ops Andrew Morton
2021-11-05 20:47 ` [patch 238/262] mm/damon/dbgfs: remove unnecessary variables Andrew Morton
2021-11-05 20:47 ` [patch 239/262] mm/damon/paddr: support the pageout scheme Andrew Morton
2021-11-05 20:47 ` [patch 240/262] mm/damon/schemes: implement size quota for schemes application speed control Andrew Morton
2021-11-05 20:47 ` [patch 241/262] mm/damon/schemes: skip already charged targets and regions Andrew Morton
2021-11-05 20:47 ` [patch 242/262] mm/damon/schemes: implement time quota Andrew Morton
2021-11-05 20:47 ` [patch 243/262] mm/damon/dbgfs: support quotas of schemes Andrew Morton
2021-11-05 20:47 ` [patch 244/262] mm/damon/selftests: support schemes quotas Andrew Morton
2021-11-05 20:47 ` [patch 245/262] mm/damon/schemes: prioritize regions within the quotas Andrew Morton
2021-11-05 20:47 ` [patch 246/262] mm/damon/vaddr,paddr: support pageout prioritization Andrew Morton
2021-11-05 20:47 ` [patch 247/262] mm/damon/dbgfs: support prioritization weights Andrew Morton
2021-11-05 20:47 ` [patch 248/262] tools/selftests/damon: update for regions prioritization of schemes Andrew Morton
2021-11-05 20:47 ` [patch 249/262] mm/damon/schemes: activate schemes based on a watermarks mechanism Andrew Morton
2021-11-05 20:47 ` [patch 250/262] mm/damon/dbgfs: support watermarks Andrew Morton
2021-11-05 20:47 ` [patch 251/262] selftests/damon: " Andrew Morton
2021-11-05 20:47 ` [patch 252/262] mm/damon: introduce DAMON-based Reclamation (DAMON_RECLAIM) Andrew Morton
2021-11-05 20:48 ` [patch 253/262] Documentation/admin-guide/mm/damon: add a document for DAMON_RECLAIM Andrew Morton
2021-11-05 20:48 ` [patch 254/262] mm/damon: remove unnecessary variable initialization Andrew Morton
2021-11-05 20:48 ` [patch 255/262] mm/damon/dbgfs: add adaptive_targets list check before enable monitor_on Andrew Morton
2021-11-05 20:48 ` [patch 256/262] Docs/admin-guide/mm/damon/start: fix wrong example commands Andrew Morton
2021-11-05 20:48 ` [patch 257/262] Docs/admin-guide/mm/damon/start: fix a wrong link Andrew Morton
2021-11-05 20:48 ` [patch 258/262] Docs/admin-guide/mm/damon/start: simplify the content Andrew Morton
2021-11-05 20:48 ` [patch 259/262] Docs/admin-guide/mm/pagemap: wordsmith page flags descriptions Andrew Morton
2021-11-05 20:48 ` [patch 260/262] mm/damon: simplify stop mechanism Andrew Morton
2021-11-05 20:48 ` [patch 261/262] mm/damon: fix a few spelling mistakes in comments and a pr_debug message Andrew Morton
2021-11-05 20:48 ` [patch 262/262] mm/damon: remove return value from before_terminate callback Andrew Morton

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.