All of lore.kernel.org
 help / color / mirror / Atom feed
* incoming
@ 2021-09-08  2:52 Andrew Morton
  2021-09-08  2:52 ` [patch 001/147] mm, slub: don't call flush_all() from slab_debug_trace_open() Andrew Morton
                   ` (147 more replies)
  0 siblings, 148 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:52 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: linux-mm, mm-commits

147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f.

Subsystems affected by this patch series:

  mm/slub
  mm/memory-hotplug
  mm/rmap
  mm/ioremap
  mm/highmem
  mm/cleanups
  mm/secretmem
  mm/kfence
  mm/damon
  alpha
  percpu
  procfs
  misc
  core-kernel
  MAINTAINERS
  lib
  bitops
  checkpatch
  epoll
  init
  nilfs2
  coredump
  fork
  pids
  criu
  kconfig
  selftests
  ipc
  mm/vmscan
  scripts

Subsystem: mm/slub

    Vlastimil Babka <vbabka@suse.cz>:
    Patch series "SLUB: reduce irq disabled scope and make it RT compatible", v6:
      mm, slub: don't call flush_all() from slab_debug_trace_open()
      mm, slub: allocate private object map for debugfs listings
      mm, slub: allocate private object map for validate_slab_cache()
      mm, slub: don't disable irq for debug_check_no_locks_freed()
      mm, slub: remove redundant unfreeze_partials() from put_cpu_partial()
      mm, slub: extract get_partial() from new_slab_objects()
      mm, slub: dissolve new_slab_objects() into ___slab_alloc()
      mm, slub: return slab page from get_partial() and set c->page afterwards
      mm, slub: restructure new page checks in ___slab_alloc()
      mm, slub: simplify kmem_cache_cpu and tid setup
      mm, slub: move disabling/enabling irqs to ___slab_alloc()
      mm, slub: do initial checks in ___slab_alloc() with irqs enabled
      mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc()
      mm, slub: restore irqs around calling new_slab()
      mm, slub: validate slab from partial list or page allocator before making it cpu slab
      mm, slub: check new pages with restored irqs
      mm, slub: stop disabling irqs around get_partial()
      mm, slub: move reset of c->page and freelist out of deactivate_slab()
      mm, slub: make locking in deactivate_slab() irq-safe
      mm, slub: call deactivate_slab() without disabling irqs
      mm, slub: move irq control into unfreeze_partials()
      mm, slub: discard slabs in unfreeze_partials() without irqs disabled
      mm, slub: detach whole partial list at once in unfreeze_partials()
      mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing
      mm, slub: only disable irq with spin_lock in __unfreeze_partials()
      mm, slub: don't disable irqs in slub_cpu_dead()
      mm, slab: split out the cpu offline variant of flush_slab()

    Sebastian Andrzej Siewior <bigeasy@linutronix.de>:
      mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context
      mm: slub: make object_map_lock a raw_spinlock_t

    Vlastimil Babka <vbabka@suse.cz>:
      mm, slub: make slab_lock() disable irqs with PREEMPT_RT
      mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
      mm, slub: use migrate_disable() on PREEMPT_RT
      mm, slub: convert kmem_cpu_slab protection to local_lock

Subsystem: mm/memory-hotplug

    David Hildenbrand <david@redhat.com>:
    Patch series "memory-hotplug.rst: complete admin-guide overhaul", v3:
      memory-hotplug.rst: remove locking details from admin-guide
      memory-hotplug.rst: complete admin-guide overhaul

    Mike Rapoport <rppt@linux.ibm.com>:
    Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE":
      mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE
      mm: memory_hotplug: cleanup after removal of pfn_valid_within()

    David Hildenbrand <david@redhat.com>:
    Patch series "mm/memory_hotplug: preparatory patches for new online policy and memory":
      mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()
      mm/memory_hotplug: remove nid parameter from arch_remove_memory()
      mm/memory_hotplug: remove nid parameter from remove_memory() and friends
      ACPI: memhotplug: memory resources cannot be enabled yet
    Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3:
      mm: track present early pages per zone
      mm/memory_hotplug: introduce "auto-movable" online policy
      drivers/base/memory: introduce "memory groups" to logically group memory blocks
      mm/memory_hotplug: track present pages in memory groups
      ACPI: memhotplug: use a single static memory group for a single memory device
      dax/kmem: use a single static memory group for a single probed unit
      virtio-mem: use a single dynamic memory group for a single virtio-mem device
      mm/memory_hotplug: memory group aware "auto-movable" online policy
      mm/memory_hotplug: improved dynamic memory group aware "auto-movable" online policy

    Miaohe Lin <linmiaohe@huawei.com>:
    Patch series "Cleanup and fixups for memory hotplug":
      mm/memory_hotplug: use helper zone_is_zone_device() to simplify the code

Subsystem: mm/rmap

    Muchun Song <songmuchun@bytedance.com>:
      mm: remove redundant compound_head() calling

Subsystem: mm/ioremap

    Christoph Hellwig <hch@lst.de>:
      riscv: only select GENERIC_IOREMAP if MMU support is enabled
    Patch series "small ioremap cleanups":
      mm: move ioremap_page_range to vmalloc.c
      mm: don't allow executable ioremap mappings

    Weizhao Ouyang <o451686892@gmail.com>:
      mm/early_ioremap.c: remove redundant early_ioremap_shutdown()

Subsystem: mm/highmem

    Sebastian Andrzej Siewior <bigeasy@linutronix.de>:
      highmem: don't disable preemption on RT in kmap_atomic()

Subsystem: mm/cleanups

    Changbin Du <changbin.du@gmail.com>:
      mm: in_irq() cleanup

    Muchun Song <songmuchun@bytedance.com>:
      mm: introduce PAGEFLAGS_MASK to replace ((1UL << NR_PAGEFLAGS) - 1)

Subsystem: mm/secretmem

    Jordy Zomer <jordy@jordyzomer.github.io>:
      mm/secretmem: use refcount_t instead of atomic_t

Subsystem: mm/kfence

    Marco Elver <elver@google.com>:
      kfence: show cpu and timestamp in alloc/free info
      kfence: test: fail fast if disabled at boot

Subsystem: mm/damon

    SeongJae Park <sjpark@amazon.de>:
    Patch series "Introduce Data Access MONitor (DAMON)", v34:
      mm: introduce Data Access MONitor (DAMON)
      mm/damon/core: implement region-based sampling
      mm/damon: adaptively adjust regions
      mm/idle_page_tracking: make PG_idle reusable
      mm/damon: implement primitives for the virtual memory address spaces
      mm/damon: add a tracepoint
      mm/damon: implement a debugfs-based user space interface
      mm/damon/dbgfs: export kdamond pid to the user space
      mm/damon/dbgfs: support multiple contexts
      Documentation: add documents for DAMON
      mm/damon: add kunit tests
      mm/damon: add user space selftests
      MAINTAINERS: update for DAMON

Subsystem: alpha

    Randy Dunlap <rdunlap@infradead.org>:
      alpha: agp: make empty macros use do-while-0 style
      alpha: pci-sysfs: fix all kernel-doc warnings

Subsystem: percpu

    Greg Kroah-Hartman <gregkh@linuxfoundation.org>:
      percpu: remove export of pcpu_base_addr

Subsystem: procfs

    Feng Zhou <zhoufeng.zf@bytedance.com>:
      fs/proc/kcore.c: add mmap interface

    Christoph Hellwig <hch@lst.de>:
      proc: stop using seq_get_buf in proc_task_name

    Ohhoon Kwon <ohoono.kwon@samsung.com>:
      connector: send event on write to /proc/[pid]/comm

Subsystem: misc

    Colin Ian King <colin.king@canonical.com>:
      arch: Kconfig: fix spelling mistake "seperate" -> "separate"

    Andy Shevchenko <andriy.shevchenko@linux.intel.com>:
      include/linux/once.h: fix trivia typo Not -> Note

    Daniel Lezcano <daniel.lezcano@linaro.org>:
    Patch series "Add Hz macros", v3:
      units: change from 'L' to 'UL'
      units: add the HZ macros
      thermal/drivers/devfreq_cooling: use HZ macros
      devfreq: use HZ macros
      iio/drivers/as73211: use HZ macros
      hwmon/drivers/mr75203: use HZ macros
      iio/drivers/hid-sensor: use HZ macros
      i2c/drivers/ov02q10: use HZ macros
      mtd/drivers/nand: use HZ macros
      phy/drivers/stm32: use HZ macros

Subsystem: core-kernel

    Yang Yang <yang.yang29@zte.com.cn>:
      kernel/acct.c: use dedicated helper to access rlimit values

    Pavel Skripkin <paskripkin@gmail.com>:
      profiling: fix shift-out-of-bounds bugs

Subsystem: MAINTAINERS

    Nathan Chancellor <nathan@kernel.org>:
      MAINTAINERS: update ClangBuiltLinux mailing list
      Documentation/llvm: update mailing list
      Documentation/llvm: update IRC location

Subsystem: lib

    Geert Uytterhoeven <geert@linux-m68k.org>:
    Patch series "math: RATIONAL and RATIONAL_KUNIT_TEST improvements":
      math: make RATIONAL tristate
      math: RATIONAL_KUNIT_TEST should depend on RATIONAL instead of selecting it

    Matteo Croce <mcroce@microsoft.com>:
    Patch series "lib/string: optimized mem* functions", v2:
      lib/string: optimized memcpy
      lib/string: optimized memmove
      lib/string: optimized memset

    Daniel Latypov <dlatypov@google.com>:
      lib/test: convert test_sort.c to use KUnit

    Randy Dunlap <rdunlap@infradead.org>:
      lib/dump_stack: correct kernel-doc notation
      lib/iov_iter.c: fix kernel-doc warnings

Subsystem: bitops

    Yury Norov <yury.norov@gmail.com>:
    Patch series "Resend bitmap patches":
      bitops: protect find_first_{,zero}_bit properly
      bitops: move find_bit_*_le functions from le.h to find.h
      include: move find.h from asm_generic to linux
      arch: remove GENERIC_FIND_FIRST_BIT entirely
      lib: add find_first_and_bit()
      cpumask: use find_first_and_bit()
      all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate
      tools: sync tools/bitmap with mother linux
      cpumask: replace cpumask_next_* with cpumask_first_* where appropriate
      include/linux: move for_each_bit() macros from bitops.h to find.h
      find: micro-optimize for_each_{set,clear}_bit()
      bitops: replace for_each_*_bit_from() with for_each_*_bit() where appropriate

    Andy Shevchenko <andriy.shevchenko@linux.intel.com>:
      tools: rename bitmap_alloc() to bitmap_zalloc()

    Yury Norov <yury.norov@gmail.com>:
      mm/percpu: micro-optimize pcpu_is_populated()
      bitmap: unify find_bit operations
      lib: bitmap: add performance test for bitmap_print_to_pagebuf
      vsprintf: rework bitmap_list_string

Subsystem: checkpatch

    Joe Perches <joe@perches.com>:
      checkpatch: support wide strings

    Mimi Zohar <zohar@linux.ibm.com>:
      checkpatch: make email address check case insensitive

    Joe Perches <joe@perches.com>:
      checkpatch: improve GIT_COMMIT_ID test

Subsystem: epoll

    Nicholas Piggin <npiggin@gmail.com>:
      fs/epoll: use a per-cpu counter for user's watches count

Subsystem: init

    Rasmus Villemoes <linux@rasmusvillemoes.dk>:
      init: move usermodehelper_enable() to populate_rootfs()

    Kefeng Wang <wangkefeng.wang@huawei.com>:
      trap: cleanup trap_init()

Subsystem: nilfs2

    Nanyong Sun <sunnanyong@huawei.com>:
    Patch series "nilfs2: fix incorrect usage of kobject":
      nilfs2: fix memory leak in nilfs_sysfs_create_device_group
      nilfs2: fix NULL pointer in nilfs_##name##_attr_release
      nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
      nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
      nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
      nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group

    Zhen Lei <thunder.leizhen@huawei.com>:
      nilfs2: use refcount_dec_and_lock() to fix potential UAF

Subsystem: coredump

    David Oberhollenzer <david.oberhollenzer@sigma-star.at>:
      fs/coredump.c: log if a core dump is aborted due to changed file permissions

    QiuXi <qiuxi1@huawei.com>:
      coredump: fix memleak in dump_vma_snapshot()

Subsystem: fork

    Christoph Hellwig <hch@lst.de>:
      kernel/fork.c: unexport get_{mm,task}_exe_file

Subsystem: pids

    Takahiro Itazuri <itazur@amazon.com>:
      pid: cleanup the stale comment mentioning pidmap_init().

Subsystem: criu

    Cyrill Gorcunov <gorcunov@gmail.com>:
      prctl: allow to setup brk for et_dyn executables

Subsystem: kconfig

    Zenghui Yu <yuzenghui@huawei.com>:
      configs: remove the obsolete CONFIG_INPUT_POLLDEV

    Lukas Bulwahn <lukas.bulwahn@gmail.com>:
      Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH

Subsystem: selftests

    Greg Thelen <gthelen@google.com>:
      selftests/memfd: remove unused variable

Subsystem: ipc

    Rafael Aquini <aquini@redhat.com>:
      ipc: replace costly bailout check in sysvipc_find_ipc()

Subsystem: mm/vmscan

    Randy Dunlap <rdunlap@infradead.org>:
      mm/workingset: correct kernel-doc notations

Subsystem: scripts

    Randy Dunlap <rdunlap@infradead.org>:
      scripts: check_extable: fix typo in user error message

 a/Documentation/admin-guide/mm/damon/index.rst            |   15 
 a/Documentation/admin-guide/mm/damon/start.rst            |  114 +
 a/Documentation/admin-guide/mm/damon/usage.rst            |  112 +
 a/Documentation/admin-guide/mm/index.rst                  |    1 
 a/Documentation/admin-guide/mm/memory-hotplug.rst         |  842 ++++++-----
 a/Documentation/dev-tools/kfence.rst                      |   98 -
 a/Documentation/kbuild/llvm.rst                           |    5 
 a/Documentation/vm/damon/api.rst                          |   20 
 a/Documentation/vm/damon/design.rst                       |  166 ++
 a/Documentation/vm/damon/faq.rst                          |   51 
 a/Documentation/vm/damon/index.rst                        |   30 
 a/Documentation/vm/index.rst                              |    1 
 a/MAINTAINERS                                             |   17 
 a/arch/Kconfig                                            |    2 
 a/arch/alpha/include/asm/agp.h                            |    4 
 a/arch/alpha/include/asm/bitops.h                         |    2 
 a/arch/alpha/kernel/pci-sysfs.c                           |   12 
 a/arch/arc/Kconfig                                        |    1 
 a/arch/arc/include/asm/bitops.h                           |    1 
 a/arch/arc/kernel/traps.c                                 |    5 
 a/arch/arm/configs/dove_defconfig                         |    1 
 a/arch/arm/configs/pxa_defconfig                          |    1 
 a/arch/arm/include/asm/bitops.h                           |    1 
 a/arch/arm/kernel/traps.c                                 |    5 
 a/arch/arm64/Kconfig                                      |    1 
 a/arch/arm64/include/asm/bitops.h                         |    1 
 a/arch/arm64/mm/mmu.c                                     |    3 
 a/arch/csky/include/asm/bitops.h                          |    1 
 a/arch/h8300/include/asm/bitops.h                         |    1 
 a/arch/h8300/kernel/traps.c                               |    4 
 a/arch/hexagon/include/asm/bitops.h                       |    1 
 a/arch/hexagon/kernel/traps.c                             |    4 
 a/arch/ia64/include/asm/bitops.h                          |    2 
 a/arch/ia64/mm/init.c                                     |    3 
 a/arch/m68k/include/asm/bitops.h                          |    2 
 a/arch/mips/Kconfig                                       |    1 
 a/arch/mips/configs/lemote2f_defconfig                    |    1 
 a/arch/mips/configs/pic32mzda_defconfig                   |    1 
 a/arch/mips/configs/rt305x_defconfig                      |    1 
 a/arch/mips/configs/xway_defconfig                        |    1 
 a/arch/mips/include/asm/bitops.h                          |    1 
 a/arch/nds32/kernel/traps.c                               |    5 
 a/arch/nios2/kernel/traps.c                               |    5 
 a/arch/openrisc/include/asm/bitops.h                      |    1 
 a/arch/openrisc/kernel/traps.c                            |    5 
 a/arch/parisc/configs/generic-32bit_defconfig             |    1 
 a/arch/parisc/include/asm/bitops.h                        |    2 
 a/arch/parisc/kernel/traps.c                              |    4 
 a/arch/powerpc/include/asm/bitops.h                       |    2 
 a/arch/powerpc/include/asm/cputhreads.h                   |    2 
 a/arch/powerpc/kernel/traps.c                             |    5 
 a/arch/powerpc/mm/mem.c                                   |    3 
 a/arch/powerpc/platforms/pasemi/dma_lib.c                 |    4 
 a/arch/powerpc/platforms/pseries/hotplug-memory.c         |    9 
 a/arch/riscv/Kconfig                                      |    2 
 a/arch/riscv/include/asm/bitops.h                         |    1 
 a/arch/riscv/kernel/traps.c                               |    5 
 a/arch/s390/Kconfig                                       |    1 
 a/arch/s390/include/asm/bitops.h                          |    1 
 a/arch/s390/kvm/kvm-s390.c                                |    2 
 a/arch/s390/mm/init.c                                     |    3 
 a/arch/sh/include/asm/bitops.h                            |    1 
 a/arch/sh/mm/init.c                                       |    3 
 a/arch/sparc/include/asm/bitops_32.h                      |    1 
 a/arch/sparc/include/asm/bitops_64.h                      |    2 
 a/arch/um/kernel/trap.c                                   |    4 
 a/arch/x86/Kconfig                                        |    1 
 a/arch/x86/configs/i386_defconfig                         |    1 
 a/arch/x86/configs/x86_64_defconfig                       |    1 
 a/arch/x86/include/asm/bitops.h                           |    2 
 a/arch/x86/kernel/apic/vector.c                           |    4 
 a/arch/x86/mm/init_32.c                                   |    3 
 a/arch/x86/mm/init_64.c                                   |    3 
 a/arch/x86/um/Kconfig                                     |    1 
 a/arch/xtensa/include/asm/bitops.h                        |    1 
 a/block/blk-mq.c                                          |    2 
 a/drivers/acpi/acpi_memhotplug.c                          |   46 
 a/drivers/base/memory.c                                   |  231 ++-
 a/drivers/base/node.c                                     |    2 
 a/drivers/block/rnbd/rnbd-clt.c                           |    2 
 a/drivers/dax/kmem.c                                      |   43 
 a/drivers/devfreq/devfreq.c                               |    2 
 a/drivers/dma/ti/edma.c                                   |    2 
 a/drivers/gpu/drm/etnaviv/etnaviv_gpu.c                   |    4 
 a/drivers/hwmon/ltc2992.c                                 |    3 
 a/drivers/hwmon/mr75203.c                                 |    2 
 a/drivers/iio/adc/ad7124.c                                |    2 
 a/drivers/iio/common/hid-sensors/hid-sensor-attributes.c  |    3 
 a/drivers/iio/light/as73211.c                             |    3 
 a/drivers/infiniband/hw/irdma/hw.c                        |   16 
 a/drivers/media/cec/core/cec-core.c                       |    2 
 a/drivers/media/i2c/ov02a10.c                             |    2 
 a/drivers/media/mc/mc-devnode.c                           |    2 
 a/drivers/mmc/host/renesas_sdhi_core.c                    |    2 
 a/drivers/mtd/nand/raw/intel-nand-controller.c            |    2 
 a/drivers/net/virtio_net.c                                |    2 
 a/drivers/pci/controller/dwc/pci-dra7xx.c                 |    2 
 a/drivers/phy/st/phy-stm32-usbphyc.c                      |    2 
 a/drivers/scsi/lpfc/lpfc_sli.c                            |   10 
 a/drivers/soc/fsl/qbman/bman_portal.c                     |    2 
 a/drivers/soc/fsl/qbman/qman_portal.c                     |    2 
 a/drivers/soc/ti/k3-ringacc.c                             |    4 
 a/drivers/thermal/devfreq_cooling.c                       |    2 
 a/drivers/tty/n_tty.c                                     |    2 
 a/drivers/virt/acrn/ioreq.c                               |    3 
 a/drivers/virtio/virtio_mem.c                             |   26 
 a/fs/coredump.c                                           |   15 
 a/fs/eventpoll.c                                          |   18 
 a/fs/f2fs/segment.c                                       |    8 
 a/fs/nilfs2/sysfs.c                                       |   26 
 a/fs/nilfs2/the_nilfs.c                                   |    9 
 a/fs/ocfs2/cluster/heartbeat.c                            |    2 
 a/fs/ocfs2/dlm/dlmdomain.c                                |    4 
 a/fs/ocfs2/dlm/dlmmaster.c                                |   18 
 a/fs/ocfs2/dlm/dlmrecovery.c                              |    2 
 a/fs/ocfs2/dlm/dlmthread.c                                |    2 
 a/fs/proc/array.c                                         |   18 
 a/fs/proc/base.c                                          |    5 
 a/fs/proc/kcore.c                                         |   73 
 a/include/asm-generic/bitops.h                            |    1 
 a/include/asm-generic/bitops/find.h                       |  198 --
 a/include/asm-generic/bitops/le.h                         |   64 
 a/include/asm-generic/early_ioremap.h                     |    6 
 a/include/linux/bitmap.h                                  |   34 
 a/include/linux/bitops.h                                  |   34 
 a/include/linux/cpumask.h                                 |   46 
 a/include/linux/damon.h                                   |  290 +++
 a/include/linux/find.h                                    |  134 +
 a/include/linux/highmem-internal.h                        |   27 
 a/include/linux/memory.h                                  |   55 
 a/include/linux/memory_hotplug.h                          |   40 
 a/include/linux/mmzone.h                                  |   19 
 a/include/linux/once.h                                    |    2 
 a/include/linux/page-flags.h                              |   17 
 a/include/linux/page_ext.h                                |    2 
 a/include/linux/page_idle.h                               |    6 
 a/include/linux/pagemap.h                                 |    7 
 a/include/linux/sched/user.h                              |    3 
 a/include/linux/slub_def.h                                |    6 
 a/include/linux/threads.h                                 |    2 
 a/include/linux/units.h                                   |   10 
 a/include/linux/vmalloc.h                                 |    3 
 a/include/trace/events/damon.h                            |   43 
 a/include/trace/events/mmflags.h                          |    2 
 a/include/trace/events/page_ref.h                         |    4 
 a/init/initramfs.c                                        |    2 
 a/init/main.c                                             |    3 
 a/init/noinitramfs.c                                      |    2 
 a/ipc/util.c                                              |   16 
 a/kernel/acct.c                                           |    2 
 a/kernel/fork.c                                           |    2 
 a/kernel/profile.c                                        |   21 
 a/kernel/sys.c                                            |    7 
 a/kernel/time/clocksource.c                               |    4 
 a/kernel/user.c                                           |   25 
 a/lib/Kconfig                                             |    3 
 a/lib/Kconfig.debug                                       |    9 
 a/lib/dump_stack.c                                        |    3 
 a/lib/find_bit.c                                          |   21 
 a/lib/find_bit_benchmark.c                                |   21 
 a/lib/genalloc.c                                          |    2 
 a/lib/iov_iter.c                                          |    8 
 a/lib/math/Kconfig                                        |    2 
 a/lib/math/rational.c                                     |    3 
 a/lib/string.c                                            |  130 +
 a/lib/test_bitmap.c                                       |   37 
 a/lib/test_printf.c                                       |    2 
 a/lib/test_sort.c                                         |   40 
 a/lib/vsprintf.c                                          |   26 
 a/mm/Kconfig                                              |   15 
 a/mm/Makefile                                             |    4 
 a/mm/compaction.c                                         |   20 
 a/mm/damon/Kconfig                                        |   68 
 a/mm/damon/Makefile                                       |    5 
 a/mm/damon/core-test.h                                    |  253 +++
 a/mm/damon/core.c                                         |  748 ++++++++++
 a/mm/damon/dbgfs-test.h                                   |  126 +
 a/mm/damon/dbgfs.c                                        |  631 ++++++++
 a/mm/damon/vaddr-test.h                                   |  329 ++++
 a/mm/damon/vaddr.c                                        |  672 +++++++++
 a/mm/early_ioremap.c                                      |    5 
 a/mm/highmem.c                                            |    2 
 a/mm/ioremap.c                                            |   25 
 a/mm/kfence/core.c                                        |    3 
 a/mm/kfence/kfence.h                                      |    2 
 a/mm/kfence/kfence_test.c                                 |    3 
 a/mm/kfence/report.c                                      |   19 
 a/mm/kmemleak.c                                           |    2 
 a/mm/memory_hotplug.c                                     |  396 ++++-
 a/mm/memremap.c                                           |    5 
 a/mm/page_alloc.c                                         |   27 
 a/mm/page_ext.c                                           |   12 
 a/mm/page_idle.c                                          |   10 
 a/mm/page_isolation.c                                     |    7 
 a/mm/page_owner.c                                         |   14 
 a/mm/percpu.c                                             |   36 
 a/mm/rmap.c                                               |    6 
 a/mm/secretmem.c                                          |    9 
 a/mm/slab_common.c                                        |    2 
 a/mm/slub.c                                               | 1023 +++++++++-----
 a/mm/vmalloc.c                                            |   24 
 a/mm/workingset.c                                         |    2 
 a/net/ncsi/ncsi-manage.c                                  |    4 
 a/scripts/check_extable.sh                                |    2 
 a/scripts/checkpatch.pl                                   |   93 -
 a/tools/include/linux/bitmap.h                            |    4 
 a/tools/perf/bench/find-bit-bench.c                       |    2 
 a/tools/perf/builtin-c2c.c                                |    6 
 a/tools/perf/builtin-record.c                             |    2 
 a/tools/perf/tests/bitmap.c                               |    2 
 a/tools/perf/tests/mem2node.c                             |    2 
 a/tools/perf/util/affinity.c                              |    4 
 a/tools/perf/util/header.c                                |    4 
 a/tools/perf/util/metricgroup.c                           |    2 
 a/tools/perf/util/mmap.c                                  |    4 
 a/tools/testing/selftests/damon/Makefile                  |    7 
 a/tools/testing/selftests/damon/_chk_dependency.sh        |   28 
 a/tools/testing/selftests/damon/debugfs_attrs.sh          |   75 +
 a/tools/testing/selftests/kvm/dirty_log_perf_test.c       |    2 
 a/tools/testing/selftests/kvm/dirty_log_test.c            |    4 
 a/tools/testing/selftests/kvm/x86_64/vmx_dirty_log_test.c |    2 
 a/tools/testing/selftests/memfd/memfd_test.c              |    2 
 b/MAINTAINERS                                             |    2 
 b/tools/include/asm-generic/bitops.h                      |    1 
 b/tools/include/linux/bitmap.h                            |    7 
 b/tools/include/linux/find.h                              |   81 +
 b/tools/lib/find_bit.c                                    |   20 
 227 files changed, 6695 insertions(+), 1875 deletions(-)


^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 001/147] mm, slub: don't call flush_all() from slab_debug_trace_open()
  2021-09-08  2:52 incoming Andrew Morton
@ 2021-09-08  2:52 ` Andrew Morton
  2021-09-08  2:53 ` [patch 002/147] mm, slub: allocate private object map for debugfs listings Andrew Morton
                   ` (146 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:52 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: don't call flush_all() from slab_debug_trace_open()

Patch series "SLUB: reduce irq disabled scope and make it RT compatible", v6.

This series was initially inspired by Mel's pcplist local_lock rewrite,
and also interest to better understand SLUB's locking and the new
primitives and RT variants and implications.  It makes SLUB compatible
with PREEMPT_RT and generally more preemption-friendly, apparently without
significant regressions, as the fast paths are not affected.

The main changes to SLUB by this series:

* irq disabling is now only done for minimum amount of time needed to protect
  the strict kmem_cache_cpu fields, and as part of spin lock, local lock and
  bit lock operations to make them irq-safe

* SLUB is fully PREEMPT_RT compatible

The series should now be sufficiently tested in both RT and !RT configs,
mainly thanks to Mike.

The RFC/v1 version also got basic performance screening by Mel that didn't
show major regressions.  Mike's testing with hackbench of v2 on !RT
reported negligible differences [6]:

virgin(ish) tip
5.13.0.g60ab3ed-tip
          7,320.67 msec task-clock                #    7.792 CPUs utilized            ( +-  0.31% )
           221,215      context-switches          #    0.030 M/sec                    ( +-  3.97% )
            16,234      cpu-migrations            #    0.002 M/sec                    ( +-  4.07% )
            13,233      page-faults               #    0.002 M/sec                    ( +-  0.91% )
    27,592,205,252      cycles                    #    3.769 GHz                      ( +-  0.32% )
     8,309,495,040      instructions              #    0.30  insn per cycle           ( +-  0.37% )
     1,555,210,607      branches                  #  212.441 M/sec                    ( +-  0.42% )
         5,484,209      branch-misses             #    0.35% of all branches          ( +-  2.13% )

           0.93949 +- 0.00423 seconds time elapsed  ( +-  0.45% )
           0.94608 +- 0.00384 seconds time elapsed  ( +-  0.41% ) (repeat)
           0.94422 +- 0.00410 seconds time elapsed  ( +-  0.43% )

5.13.0.g60ab3ed-tip +slub-local-lock-v2r3
          7,343.57 msec task-clock                #    7.776 CPUs utilized            ( +-  0.44% )
           223,044      context-switches          #    0.030 M/sec                    ( +-  3.02% )
            16,057      cpu-migrations            #    0.002 M/sec                    ( +-  4.03% )
            13,164      page-faults               #    0.002 M/sec                    ( +-  0.97% )
    27,684,906,017      cycles                    #    3.770 GHz                      ( +-  0.45% )
     8,323,273,871      instructions              #    0.30  insn per cycle           ( +-  0.28% )
     1,556,106,680      branches                  #  211.901 M/sec                    ( +-  0.31% )
         5,463,468      branch-misses             #    0.35% of all branches          ( +-  1.33% )

           0.94440 +- 0.00352 seconds time elapsed  ( +-  0.37% )
           0.94830 +- 0.00228 seconds time elapsed  ( +-  0.24% ) (repeat)
           0.93813 +- 0.00440 seconds time elapsed  ( +-  0.47% ) (repeat)

RT configs showed some throughput regressions, but that's expected
tradeoff for the preemption improvements through the RT mutex.  It didn't
prevent the v2 to be incorporated to the 5.13 RT tree [7], leading to
testing exposure and bugfixes.

Before the series, SLUB is lockless in both allocation and free fast
paths, but elsewhere, it's disabling irqs for considerable periods of time
- especially in allocation slowpath and the bulk allocation, where IRQs
are re-enabled only when a new page from the page allocator is needed, and
the context allows blocking.  The irq disabled sections can then include
deactivate_slab() which walks a full freelist and frees the slab back to
page allocator or unfreeze_partials() going through a list of percpu
partial slabs.  The RT tree currently has some patches mitigating these,
but we can do much better in mainline too.

Patches 1-6 are straightforward improvements or cleanups that could exist
outside of this series too, but are prerequsities.

Patches 7-9 are also preparatory code changes without functional changes,
but not so useful without the rest of the series.

Patch 10 simplifies the fast paths on systems with preemption, based on
(hopefully correct) observation that the current loops to verify tid are
unnecessary.

Patches 11-20 focus on reducing irq disabled scope in the allocation
slowpath.

Patch 11 moves disabling of irqs into ___slab_alloc() from its callers,
which are the allocation slowpath, and bulk allocation.  Instead these
callers only disable preemption to stabilize the cpu.  The following
patches then gradually reduce the scope of disabled irqs in
___slab_alloc() and the functions called from there.  As of patch 14, the
re-enabling of irqs based on gfp flags before calling the page allocator
is removed from allocate_slab().  As of patch 17, it's possible to reach
the page allocator (in case of existing slabs depleted) without disabling
and re-enabling irqs a single time.

Patches 21-26 reduce the scope of disabled irqs in functions related to
unfreezing percpu partial slab.

Patch 27 is preparatory.  Patch 28 is adopted from the RT tree and
converts the flushing of percpu slabs on all cpus from using IPI to
workqueue, so that the processing isn't happening with irqs disabled in
the IPI handler.  The flushing is not performance critical so it should be
acceptable.

Patch 29 also comes from RT tree and makes object_map_lock RT compatible.

Patch 30 make slab_lock irq-safe on RT where we cannot rely on having irq
disabled from the list_lock spin lock usage.

Patch 31 changes kmem_cache_cpu->partial handling in put_cpu_partial()
from cmpxchg loop to a short irq disabled section, which is used by all
other code modifying the field.  This addresses a theoretical race
scenario pointed out by Jann, and makes the critical section safe wrt with
RT local_lock semantics after the conversion in patch 35.

Patch 32 changes preempt disable to migrate disable, so that the nested
list_lock spinlock is safe to take on RT.  Because migrate_disable() is a
function call even on !RT, a small set of private wrappers is introduced
to keep using the cheaper preempt_disable() on !PREEMPT_RT configurations.
As of this patch, SLUB should be already compatible with RT's lock
semantics.

Finally, patch 33 changes irq disabled sections that protect
kmem_cache_cpu fields in the slow paths, with a local lock.  However on
PREEMPT_RT it means the lockless fast paths can now preempt slow paths
which don't expect that, so the local lock has to be taken also in the
fast paths and they are no longer lockless.  RT folks seem to not mind
this tradeoff.  The patch also updates the locking documentation in the
file's comment.

[1] https://lore.kernel.org/lkml/20210524233946.20352-1-vbabka@suse.cz/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0001-mm-sl-au-b-Change-list_lock-to-raw_spinlock_t.patch?h=linux-5.12.y-rt-patches
[3] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0004-mm-slub-Move-discard_slab-invocations-out-of-IRQ-off.patch?h=linux-5.12.y-rt-patches
[4] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tree/patches/0005-mm-slub-Move-flush_cpu_slab-invocations-__free_slab-.patch?h=linux-5.12.y-rt-patches
[5] https://lore.kernel.org/lkml/20210609113903.1421-1-vbabka@suse.cz/
[6] https://lore.kernel.org/lkml/891dc24e38106f8542f4c72831d52dc1a1863ae8.camel@gmx.de
[7] https://lore.kernel.org/linux-rt-users/87tul5p2fa.ffs@nanos.tec.linutronix.de/
[8] https://lore.kernel.org/lkml/20210729132132.19691-1-vbabka@suse.cz/
[9] https://lore.kernel.org/lkml/20210804120522.GD6464@techsingularity.net/
[10] https://lore.kernel.org/lkml/20210805152000.12817-1-vbabka@suse.cz/
[11] https://lore.kernel.org/all/20210823145826.3857-1-vbabka@suse.cz/
[12] https://lore.kernel.org/all/20210823145826.3857-7-vbabka@suse.cz/
[13] https://lore.kernel.org/all/20210823145826.3857-32-vbabka@suse.cz/
[14] https://lore.kernel.org/linux-mm/1ae902f7-c500-f9e8-1b4f-077beade0f42@suse.cz/
[15] https://lore.kernel.org/linux-mm/CAHk-=wjRfFtnQ5p42s_5Uv8i0U5YKSBpTH++_ZMKZyyvYicYmQ@mail.gmail.com/
[16] https://lore.kernel.org/all/871r6j526m.ffs@tglx/


This patch (of 33):

slab_debug_trace_open() can only be called on caches with SLAB_STORE_USER
flag and as with all slub debugging flags, such caches avoid cpu or percpu
partial slabs altogether, so there's nothing to flush.

Link: https://lkml.kernel.org/r/20210904105003.11688-1-vbabka@suse.cz
Link: https://lkml.kernel.org/r/20210904105003.11688-2-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |    3 ---
 1 file changed, 3 deletions(-)

--- a/mm/slub.c~mm-slub-dont-call-flush_all-from-slab_debug_trace_open
+++ a/mm/slub.c
@@ -5825,9 +5825,6 @@ static int slab_debug_trace_open(struct
 	if (!alloc_loc_track(t, PAGE_SIZE / sizeof(struct location), GFP_KERNEL))
 		return -ENOMEM;
 
-	/* Push back cpu slabs */
-	flush_all(s);
-
 	for_each_kmem_cache_node(s, node, n) {
 		unsigned long flags;
 		struct page *page;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 002/147] mm, slub: allocate private object map for debugfs listings
  2021-09-08  2:52 incoming Andrew Morton
  2021-09-08  2:52 ` [patch 001/147] mm, slub: don't call flush_all() from slab_debug_trace_open() Andrew Morton
@ 2021-09-08  2:53 ` Andrew Morton
  2021-09-08  2:53 ` [patch 003/147] mm, slub: allocate private object map for validate_slab_cache() Andrew Morton
                   ` (145 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:53 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: allocate private object map for debugfs listings

Slub has a static spinlock protected bitmap for marking which objects are
on freelist when it wants to list them, for situations where dynamically
allocating such map can lead to recursion or locking issues, and on-stack
bitmap would be too large.

The handlers of debugfs files alloc_traces and free_traces also currently
use this shared bitmap, but their syscall context makes it straightforward
to allocate a private map before entering locked sections, so switch these
processing paths to use a private bitmap.

Link: https://lkml.kernel.org/r/20210904105003.11688-3-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   44 +++++++++++++++++++++++++++++---------------
 1 file changed, 29 insertions(+), 15 deletions(-)

--- a/mm/slub.c~mm-slub-allocate-private-object-map-for-debugfs-listings
+++ a/mm/slub.c
@@ -454,6 +454,18 @@ static inline bool cmpxchg_double_slab(s
 static unsigned long object_map[BITS_TO_LONGS(MAX_OBJS_PER_PAGE)];
 static DEFINE_SPINLOCK(object_map_lock);
 
+static void __fill_map(unsigned long *obj_map, struct kmem_cache *s,
+		       struct page *page)
+{
+	void *addr = page_address(page);
+	void *p;
+
+	bitmap_zero(obj_map, page->objects);
+
+	for (p = page->freelist; p; p = get_freepointer(s, p))
+		set_bit(__obj_to_index(s, addr, p), obj_map);
+}
+
 #if IS_ENABLED(CONFIG_KUNIT)
 static bool slab_add_kunit_errors(void)
 {
@@ -483,17 +495,11 @@ static inline bool slab_add_kunit_errors
 static unsigned long *get_map(struct kmem_cache *s, struct page *page)
 	__acquires(&object_map_lock)
 {
-	void *p;
-	void *addr = page_address(page);
-
 	VM_BUG_ON(!irqs_disabled());
 
 	spin_lock(&object_map_lock);
 
-	bitmap_zero(object_map, page->objects);
-
-	for (p = page->freelist; p; p = get_freepointer(s, p))
-		set_bit(__obj_to_index(s, addr, p), object_map);
+	__fill_map(object_map, s, page);
 
 	return object_map;
 }
@@ -4879,17 +4885,17 @@ static int add_location(struct loc_track
 }
 
 static void process_slab(struct loc_track *t, struct kmem_cache *s,
-		struct page *page, enum track_item alloc)
+		struct page *page, enum track_item alloc,
+		unsigned long *obj_map)
 {
 	void *addr = page_address(page);
 	void *p;
-	unsigned long *map;
 
-	map = get_map(s, page);
+	__fill_map(obj_map, s, page);
+
 	for_each_object(p, s, addr, page->objects)
-		if (!test_bit(__obj_to_index(s, addr, p), map))
+		if (!test_bit(__obj_to_index(s, addr, p), obj_map))
 			add_location(t, s, get_track(s, p, alloc));
-	put_map(map);
 }
 #endif  /* CONFIG_DEBUG_FS   */
 #endif	/* CONFIG_SLUB_DEBUG */
@@ -5816,14 +5822,21 @@ static int slab_debug_trace_open(struct
 	struct loc_track *t = __seq_open_private(filep, &slab_debugfs_sops,
 						sizeof(struct loc_track));
 	struct kmem_cache *s = file_inode(filep)->i_private;
+	unsigned long *obj_map;
+
+	obj_map = bitmap_alloc(oo_objects(s->oo), GFP_KERNEL);
+	if (!obj_map)
+		return -ENOMEM;
 
 	if (strcmp(filep->f_path.dentry->d_name.name, "alloc_traces") == 0)
 		alloc = TRACK_ALLOC;
 	else
 		alloc = TRACK_FREE;
 
-	if (!alloc_loc_track(t, PAGE_SIZE / sizeof(struct location), GFP_KERNEL))
+	if (!alloc_loc_track(t, PAGE_SIZE / sizeof(struct location), GFP_KERNEL)) {
+		bitmap_free(obj_map);
 		return -ENOMEM;
+	}
 
 	for_each_kmem_cache_node(s, node, n) {
 		unsigned long flags;
@@ -5834,12 +5847,13 @@ static int slab_debug_trace_open(struct
 
 		spin_lock_irqsave(&n->list_lock, flags);
 		list_for_each_entry(page, &n->partial, slab_list)
-			process_slab(t, s, page, alloc);
+			process_slab(t, s, page, alloc, obj_map);
 		list_for_each_entry(page, &n->full, slab_list)
-			process_slab(t, s, page, alloc);
+			process_slab(t, s, page, alloc, obj_map);
 		spin_unlock_irqrestore(&n->list_lock, flags);
 	}
 
+	bitmap_free(obj_map);
 	return 0;
 }
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 003/147] mm, slub: allocate private object map for validate_slab_cache()
  2021-09-08  2:52 incoming Andrew Morton
  2021-09-08  2:52 ` [patch 001/147] mm, slub: don't call flush_all() from slab_debug_trace_open() Andrew Morton
  2021-09-08  2:53 ` [patch 002/147] mm, slub: allocate private object map for debugfs listings Andrew Morton
@ 2021-09-08  2:53 ` Andrew Morton
  2021-09-08  2:53 ` [patch 004/147] mm, slub: don't disable irq for debug_check_no_locks_freed() Andrew Morton
                   ` (144 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:53 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: allocate private object map for validate_slab_cache()

validate_slab_cache() is called either to handle a sysfs write, or from a
self-test context.  In both situations it's straightforward to preallocate
a private object bitmap instead of grabbing the shared static one meant
for critical sections, so let's do that.

Link: https://lkml.kernel.org/r/20210904105003.11688-4-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

--- a/mm/slub.c~mm-slub-allocate-private-object-map-for-validate_slab_cache
+++ a/mm/slub.c
@@ -4679,11 +4679,11 @@ static int count_total(struct page *page
 #endif
 
 #ifdef CONFIG_SLUB_DEBUG
-static void validate_slab(struct kmem_cache *s, struct page *page)
+static void validate_slab(struct kmem_cache *s, struct page *page,
+			  unsigned long *obj_map)
 {
 	void *p;
 	void *addr = page_address(page);
-	unsigned long *map;
 
 	slab_lock(page);
 
@@ -4691,21 +4691,20 @@ static void validate_slab(struct kmem_ca
 		goto unlock;
 
 	/* Now we know that a valid freelist exists */
-	map = get_map(s, page);
+	__fill_map(obj_map, s, page);
 	for_each_object(p, s, addr, page->objects) {
-		u8 val = test_bit(__obj_to_index(s, addr, p), map) ?
+		u8 val = test_bit(__obj_to_index(s, addr, p), obj_map) ?
 			 SLUB_RED_INACTIVE : SLUB_RED_ACTIVE;
 
 		if (!check_object(s, page, p, val))
 			break;
 	}
-	put_map(map);
 unlock:
 	slab_unlock(page);
 }
 
 static int validate_slab_node(struct kmem_cache *s,
-		struct kmem_cache_node *n)
+		struct kmem_cache_node *n, unsigned long *obj_map)
 {
 	unsigned long count = 0;
 	struct page *page;
@@ -4714,7 +4713,7 @@ static int validate_slab_node(struct kme
 	spin_lock_irqsave(&n->list_lock, flags);
 
 	list_for_each_entry(page, &n->partial, slab_list) {
-		validate_slab(s, page);
+		validate_slab(s, page, obj_map);
 		count++;
 	}
 	if (count != n->nr_partial) {
@@ -4727,7 +4726,7 @@ static int validate_slab_node(struct kme
 		goto out;
 
 	list_for_each_entry(page, &n->full, slab_list) {
-		validate_slab(s, page);
+		validate_slab(s, page, obj_map);
 		count++;
 	}
 	if (count != atomic_long_read(&n->nr_slabs)) {
@@ -4746,10 +4745,17 @@ long validate_slab_cache(struct kmem_cac
 	int node;
 	unsigned long count = 0;
 	struct kmem_cache_node *n;
+	unsigned long *obj_map;
+
+	obj_map = bitmap_alloc(oo_objects(s->oo), GFP_KERNEL);
+	if (!obj_map)
+		return -ENOMEM;
 
 	flush_all(s);
 	for_each_kmem_cache_node(s, node, n)
-		count += validate_slab_node(s, n);
+		count += validate_slab_node(s, n, obj_map);
+
+	bitmap_free(obj_map);
 
 	return count;
 }
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 004/147] mm, slub: don't disable irq for debug_check_no_locks_freed()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (2 preceding siblings ...)
  2021-09-08  2:53 ` [patch 003/147] mm, slub: allocate private object map for validate_slab_cache() Andrew Morton
@ 2021-09-08  2:53 ` Andrew Morton
  2021-09-08  2:53 ` [patch 005/147] mm, slub: remove redundant unfreeze_partials() from put_cpu_partial() Andrew Morton
                   ` (143 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:53 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: don't disable irq for debug_check_no_locks_freed()

In slab_free_hook() we disable irqs around the
debug_check_no_locks_freed() call, which is unnecessary, as irqs are
already being disabled inside the call.  This seems to be leftover from
the past where there were more calls inside the irq disabled sections. 
Remove the irq disable/enable operations.

Mel noted:
> Looks like it was needed for kmemcheck which went away back in 4.15

Link: https://lkml.kernel.org/r/20210904105003.11688-5-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

--- a/mm/slub.c~mm-slub-dont-disable-irq-for-debug_check_no_locks_freed
+++ a/mm/slub.c
@@ -1591,20 +1591,8 @@ static __always_inline bool slab_free_ho
 {
 	kmemleak_free_recursive(x, s->flags);
 
-	/*
-	 * Trouble is that we may no longer disable interrupts in the fast path
-	 * So in order to make the debug calls that expect irqs to be
-	 * disabled we need to disable interrupts temporarily.
-	 */
-#ifdef CONFIG_LOCKDEP
-	{
-		unsigned long flags;
+	debug_check_no_locks_freed(x, s->object_size);
 
-		local_irq_save(flags);
-		debug_check_no_locks_freed(x, s->object_size);
-		local_irq_restore(flags);
-	}
-#endif
 	if (!(s->flags & SLAB_DEBUG_OBJECTS))
 		debug_check_no_obj_freed(x, s->object_size);
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 005/147] mm, slub: remove redundant unfreeze_partials() from put_cpu_partial()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (3 preceding siblings ...)
  2021-09-08  2:53 ` [patch 004/147] mm, slub: don't disable irq for debug_check_no_locks_freed() Andrew Morton
@ 2021-09-08  2:53 ` Andrew Morton
  2021-09-08  2:53 ` [patch 006/147] mm, slub: extract get_partial() from new_slab_objects() Andrew Morton
                   ` (142 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:53 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: remove redundant unfreeze_partials() from put_cpu_partial()

Commit d6e0b7fa1186 ("slub: make dead caches discard free slabs
immediately") introduced cpu partial flushing for kmemcg caches, based on
setting the target cpu_partial to 0 and adding a flushing check in
put_cpu_partial().  This code that sets cpu_partial to 0 was later moved
by c9fc586403e7 ("slab: introduce __kmemcg_cache_deactivate()") and
ultimately removed by 9855609bde03 ("mm: memcg/slab: use a single set of
kmem_caches for all accounted allocations").  However the check and flush
in put_cpu_partial() was never removed, although it's effectively a dead
code.  So this patch removes it.

Note that d6e0b7fa1186 also added preempt_disable()/enable() to
unfreeze_partials() which could be thus also considered unnecessary.  But
further patches will rely on it, so keep it.

Link: https://lkml.kernel.org/r/20210904105003.11688-6-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |    7 -------
 1 file changed, 7 deletions(-)

--- a/mm/slub.c~mm-slub-remove-redundant-unfreeze_partials-from-put_cpu_partial
+++ a/mm/slub.c
@@ -2466,13 +2466,6 @@ static void put_cpu_partial(struct kmem_
 
 	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
 								!= oldpage);
-	if (unlikely(!slub_cpu_partial(s))) {
-		unsigned long flags;
-
-		local_irq_save(flags);
-		unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
-		local_irq_restore(flags);
-	}
 	preempt_enable();
 #endif	/* CONFIG_SLUB_CPU_PARTIAL */
 }
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 006/147] mm, slub: extract get_partial() from new_slab_objects()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (4 preceding siblings ...)
  2021-09-08  2:53 ` [patch 005/147] mm, slub: remove redundant unfreeze_partials() from put_cpu_partial() Andrew Morton
@ 2021-09-08  2:53 ` Andrew Morton
  2021-09-08  2:53 ` [patch 007/147] mm, slub: dissolve new_slab_objects() into ___slab_alloc() Andrew Morton
                   ` (141 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:53 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: extract get_partial() from new_slab_objects()

The later patches will need more fine grained control over individual
actions in ___slab_alloc(), the only caller of new_slab_objects(), so this
is a first preparatory step with no functional change.

This adds a goto label that appears unnecessary at this point, but will be
useful for later changes.

Link: https://lkml.kernel.org/r/20210904105003.11688-7-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/mm/slub.c~mm-slub-extract-get_partial-from-new_slab_objects
+++ a/mm/slub.c
@@ -2613,17 +2613,12 @@ slab_out_of_memory(struct kmem_cache *s,
 static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
 			int node, struct kmem_cache_cpu **pc)
 {
-	void *freelist;
+	void *freelist = NULL;
 	struct kmem_cache_cpu *c = *pc;
 	struct page *page;
 
 	WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
 
-	freelist = get_partial(s, flags, node, c);
-
-	if (freelist)
-		return freelist;
-
 	page = new_slab(s, flags, node);
 	if (page) {
 		c = raw_cpu_ptr(s->cpu_slab);
@@ -2787,6 +2782,10 @@ new_slab:
 		goto redo;
 	}
 
+	freelist = get_partial(s, gfpflags, node, c);
+	if (freelist)
+		goto check_new_page;
+
 	freelist = new_slab_objects(s, gfpflags, node, &c);
 
 	if (unlikely(!freelist)) {
@@ -2794,6 +2793,7 @@ new_slab:
 		return NULL;
 	}
 
+check_new_page:
 	page = c->page;
 	if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
 		goto load_freelist;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 007/147] mm, slub: dissolve new_slab_objects() into ___slab_alloc()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (5 preceding siblings ...)
  2021-09-08  2:53 ` [patch 006/147] mm, slub: extract get_partial() from new_slab_objects() Andrew Morton
@ 2021-09-08  2:53 ` Andrew Morton
  2021-09-08  2:53 ` [patch 008/147] mm, slub: return slab page from get_partial() and set c->page afterwards Andrew Morton
                   ` (140 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:53 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: dissolve new_slab_objects() into ___slab_alloc()

The later patches will need more fine grained control over individual
actions in ___slab_alloc(), the only caller of new_slab_objects(), so
dissolve it there.  This is a preparatory step with no functional change.

The only minor change is moving WARN_ON_ONCE() for using a constructor
together with __GFP_ZERO to new_slab(), which makes it somewhat less
frequent, but still able to catch a development change introducing a
systematic misuse.

Link: https://lkml.kernel.org/r/20210904105003.11688-8-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Christoph Lameter <cl@linux.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   50 ++++++++++++++++++--------------------------------
 1 file changed, 18 insertions(+), 32 deletions(-)

--- a/mm/slub.c~mm-slub-dissolve-new_slab_objects-into-___slab_alloc
+++ a/mm/slub.c
@@ -1885,6 +1885,8 @@ static struct page *new_slab(struct kmem
 	if (unlikely(flags & GFP_SLAB_BUG_MASK))
 		flags = kmalloc_fix_flags(flags);
 
+	WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
+
 	return allocate_slab(s,
 		flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node);
 }
@@ -2610,36 +2612,6 @@ slab_out_of_memory(struct kmem_cache *s,
 #endif
 }
 
-static inline void *new_slab_objects(struct kmem_cache *s, gfp_t flags,
-			int node, struct kmem_cache_cpu **pc)
-{
-	void *freelist = NULL;
-	struct kmem_cache_cpu *c = *pc;
-	struct page *page;
-
-	WARN_ON_ONCE(s->ctor && (flags & __GFP_ZERO));
-
-	page = new_slab(s, flags, node);
-	if (page) {
-		c = raw_cpu_ptr(s->cpu_slab);
-		if (c->page)
-			flush_slab(s, c);
-
-		/*
-		 * No other reference to the page yet so we can
-		 * muck around with it freely without cmpxchg
-		 */
-		freelist = page->freelist;
-		page->freelist = NULL;
-
-		stat(s, ALLOC_SLAB);
-		c->page = page;
-		*pc = c;
-	}
-
-	return freelist;
-}
-
 static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags)
 {
 	if (unlikely(PageSlabPfmemalloc(page)))
@@ -2786,13 +2758,27 @@ new_slab:
 	if (freelist)
 		goto check_new_page;
 
-	freelist = new_slab_objects(s, gfpflags, node, &c);
+	page = new_slab(s, gfpflags, node);
 
-	if (unlikely(!freelist)) {
+	if (unlikely(!page)) {
 		slab_out_of_memory(s, gfpflags, node);
 		return NULL;
 	}
 
+	c = raw_cpu_ptr(s->cpu_slab);
+	if (c->page)
+		flush_slab(s, c);
+
+	/*
+	 * No other reference to the page yet so we can
+	 * muck around with it freely without cmpxchg
+	 */
+	freelist = page->freelist;
+	page->freelist = NULL;
+
+	stat(s, ALLOC_SLAB);
+	c->page = page;
+
 check_new_page:
 	page = c->page;
 	if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 008/147] mm, slub: return slab page from get_partial() and set c->page afterwards
  2021-09-08  2:52 incoming Andrew Morton
                   ` (6 preceding siblings ...)
  2021-09-08  2:53 ` [patch 007/147] mm, slub: dissolve new_slab_objects() into ___slab_alloc() Andrew Morton
@ 2021-09-08  2:53 ` Andrew Morton
  2021-09-08  2:53 ` [patch 009/147] mm, slub: restructure new page checks in ___slab_alloc() Andrew Morton
                   ` (139 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:53 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: return slab page from get_partial() and set c->page afterwards

The function get_partial() finds a suitable page on a partial list,
acquires and returns its freelist and assigns the page pointer to
kmem_cache_cpu.  In later patch we will need more control over the
kmem_cache_cpu.page assignment, so instead of passing a kmem_cache_cpu
pointer, pass a pointer to a pointer to a page that get_partial() can fill
and the caller can assign the kmem_cache_cpu.page pointer.  No functional
change as all of this still happens with disabled IRQs.

Link: https://lkml.kernel.org/r/20210904105003.11688-9-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

--- a/mm/slub.c~mm-slub-return-slab-page-from-get_partial-and-set-c-page-afterwards
+++ a/mm/slub.c
@@ -2017,7 +2017,7 @@ static inline bool pfmemalloc_match(stru
  * Try to allocate a partial slab from a specific node.
  */
 static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
-				struct kmem_cache_cpu *c, gfp_t flags)
+			      struct page **ret_page, gfp_t flags)
 {
 	struct page *page, *page2;
 	void *object = NULL;
@@ -2046,7 +2046,7 @@ static void *get_partial_node(struct kme
 
 		available += objects;
 		if (!object) {
-			c->page = page;
+			*ret_page = page;
 			stat(s, ALLOC_FROM_PARTIAL);
 			object = t;
 		} else {
@@ -2066,7 +2066,7 @@ static void *get_partial_node(struct kme
  * Get a page from somewhere. Search in increasing NUMA distances.
  */
 static void *get_any_partial(struct kmem_cache *s, gfp_t flags,
-		struct kmem_cache_cpu *c)
+			     struct page **ret_page)
 {
 #ifdef CONFIG_NUMA
 	struct zonelist *zonelist;
@@ -2108,7 +2108,7 @@ static void *get_any_partial(struct kmem
 
 			if (n && cpuset_zone_allowed(zone, flags) &&
 					n->nr_partial > s->min_partial) {
-				object = get_partial_node(s, n, c, flags);
+				object = get_partial_node(s, n, ret_page, flags);
 				if (object) {
 					/*
 					 * Don't check read_mems_allowed_retry()
@@ -2130,7 +2130,7 @@ static void *get_any_partial(struct kmem
  * Get a partial page, lock it and return it.
  */
 static void *get_partial(struct kmem_cache *s, gfp_t flags, int node,
-		struct kmem_cache_cpu *c)
+			 struct page **ret_page)
 {
 	void *object;
 	int searchnode = node;
@@ -2138,11 +2138,11 @@ static void *get_partial(struct kmem_cac
 	if (node == NUMA_NO_NODE)
 		searchnode = numa_mem_id();
 
-	object = get_partial_node(s, get_node(s, searchnode), c, flags);
+	object = get_partial_node(s, get_node(s, searchnode), ret_page, flags);
 	if (object || node != NUMA_NO_NODE)
 		return object;
 
-	return get_any_partial(s, flags, c);
+	return get_any_partial(s, flags, ret_page);
 }
 
 #ifdef CONFIG_PREEMPTION
@@ -2754,9 +2754,11 @@ new_slab:
 		goto redo;
 	}
 
-	freelist = get_partial(s, gfpflags, node, c);
-	if (freelist)
+	freelist = get_partial(s, gfpflags, node, &page);
+	if (freelist) {
+		c->page = page;
 		goto check_new_page;
+	}
 
 	page = new_slab(s, gfpflags, node);
 
@@ -2780,7 +2782,6 @@ new_slab:
 	c->page = page;
 
 check_new_page:
-	page = c->page;
 	if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
 		goto load_freelist;
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 009/147] mm, slub: restructure new page checks in ___slab_alloc()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (7 preceding siblings ...)
  2021-09-08  2:53 ` [patch 008/147] mm, slub: return slab page from get_partial() and set c->page afterwards Andrew Morton
@ 2021-09-08  2:53 ` Andrew Morton
  2021-09-08  2:53 ` [patch 010/147] mm, slub: simplify kmem_cache_cpu and tid setup Andrew Morton
                   ` (138 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:53 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: restructure new page checks in ___slab_alloc()

When we allocate slab object from a newly acquired page (from node's
partial list or page allocator), we usually also retain the page as a new
percpu slab.  There are two exceptions - when pfmemalloc status of the
page doesn't match our gfp flags, or when the cache has debugging enabled.

The current code for these decisions is not easy to follow, so restructure
it and add comments.  The new structure will also help with the following
changes.  No functional change.

Link: https://lkml.kernel.org/r/20210904105003.11688-10-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

--- a/mm/slub.c~mm-slub-restructure-new-page-checks-in-___slab_alloc
+++ a/mm/slub.c
@@ -2782,13 +2782,29 @@ new_slab:
 	c->page = page;
 
 check_new_page:
-	if (likely(!kmem_cache_debug(s) && pfmemalloc_match(page, gfpflags)))
-		goto load_freelist;
 
-	/* Only entered in the debug case */
-	if (kmem_cache_debug(s) &&
-			!alloc_debug_processing(s, page, freelist, addr))
-		goto new_slab;	/* Slab failed checks. Next slab needed */
+	if (kmem_cache_debug(s)) {
+		if (!alloc_debug_processing(s, page, freelist, addr))
+			/* Slab failed checks. Next slab needed */
+			goto new_slab;
+		else
+			/*
+			 * For debug case, we don't load freelist so that all
+			 * allocations go through alloc_debug_processing()
+			 */
+			goto return_single;
+	}
+
+	if (unlikely(!pfmemalloc_match(page, gfpflags)))
+		/*
+		 * For !pfmemalloc_match() case we don't load freelist so that
+		 * we don't make further mismatched allocations easier.
+		 */
+		goto return_single;
+
+	goto load_freelist;
+
+return_single:
 
 	deactivate_slab(s, page, get_freepointer(s, freelist), c);
 	return freelist;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 010/147] mm, slub: simplify kmem_cache_cpu and tid setup
  2021-09-08  2:52 incoming Andrew Morton
                   ` (8 preceding siblings ...)
  2021-09-08  2:53 ` [patch 009/147] mm, slub: restructure new page checks in ___slab_alloc() Andrew Morton
@ 2021-09-08  2:53 ` Andrew Morton
  2021-09-08  2:53 ` [patch 011/147] mm, slub: move disabling/enabling irqs to ___slab_alloc() Andrew Morton
                   ` (137 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:53 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: simplify kmem_cache_cpu and tid setup

In slab_alloc_node() and do_slab_free() fastpaths we need to guarantee
that our kmem_cache_cpu pointer is from the same cpu as the tid value. 
Currently that's done by reading the tid first using this_cpu_read(), then
the kmem_cache_cpu pointer and verifying we read the same tid using the
pointer and plain READ_ONCE().

This can be simplified to just fetching kmem_cache_cpu pointer and then
reading tid using the pointer.  That guarantees they are from the same
cpu.  We don't need to read the tid using this_cpu_read() because the
value will be validated by this_cpu_cmpxchg_double(), making sure we are
on the correct cpu and the freelist didn't change by anyone preempting us
since reading the tid.

Link: https://lkml.kernel.org/r/20210904105003.11688-11-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   22 +++++++++-------------
 1 file changed, 9 insertions(+), 13 deletions(-)

--- a/mm/slub.c~mm-slub-simplify-kmem_cache_cpu-and-tid-setup
+++ a/mm/slub.c
@@ -2882,15 +2882,14 @@ redo:
 	 * reading from one cpu area. That does not matter as long
 	 * as we end up on the original cpu again when doing the cmpxchg.
 	 *
-	 * We should guarantee that tid and kmem_cache are retrieved on
-	 * the same cpu. It could be different if CONFIG_PREEMPTION so we need
-	 * to check if it is matched or not.
+	 * We must guarantee that tid and kmem_cache_cpu are retrieved on the
+	 * same cpu. We read first the kmem_cache_cpu pointer and use it to read
+	 * the tid. If we are preempted and switched to another cpu between the
+	 * two reads, it's OK as the two are still associated with the same cpu
+	 * and cmpxchg later will validate the cpu.
 	 */
-	do {
-		tid = this_cpu_read(s->cpu_slab->tid);
-		c = raw_cpu_ptr(s->cpu_slab);
-	} while (IS_ENABLED(CONFIG_PREEMPTION) &&
-		 unlikely(tid != READ_ONCE(c->tid)));
+	c = raw_cpu_ptr(s->cpu_slab);
+	tid = READ_ONCE(c->tid);
 
 	/*
 	 * Irqless object alloc/free algorithm used here depends on sequence
@@ -3164,11 +3163,8 @@ redo:
 	 * data is retrieved via this pointer. If we are on the same cpu
 	 * during the cmpxchg then the free will succeed.
 	 */
-	do {
-		tid = this_cpu_read(s->cpu_slab->tid);
-		c = raw_cpu_ptr(s->cpu_slab);
-	} while (IS_ENABLED(CONFIG_PREEMPTION) &&
-		 unlikely(tid != READ_ONCE(c->tid)));
+	c = raw_cpu_ptr(s->cpu_slab);
+	tid = READ_ONCE(c->tid);
 
 	/* Same with comment on barrier() in slab_alloc_node() */
 	barrier();
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 011/147] mm, slub: move disabling/enabling irqs to ___slab_alloc()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (9 preceding siblings ...)
  2021-09-08  2:53 ` [patch 010/147] mm, slub: simplify kmem_cache_cpu and tid setup Andrew Morton
@ 2021-09-08  2:53 ` Andrew Morton
  2021-09-08  2:53 ` [patch 012/147] mm, slub: do initial checks in ___slab_alloc() with irqs enabled Andrew Morton
                   ` (136 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:53 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, iamjoonsoo.kim, jannh, linux-mm,
	mgorman, mm-commits, penberg, quic_qiancai, rientjes, tglx,
	torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: move disabling/enabling irqs to ___slab_alloc()

Currently __slab_alloc() disables irqs around the whole ___slab_alloc(). 
This includes cases where this is not needed, such as when the allocation
ends up in the page allocator and has to awkwardly enable irqs back based
on gfp flags.  Also the whole kmem_cache_alloc_bulk() is executed with
irqs disabled even when it hits the __slab_alloc() slow path, and long
periods with disabled interrupts are undesirable.

As a first step towards reducing irq disabled periods, move irq handling
into ___slab_alloc().  Callers will instead prevent the s->cpu_slab percpu
pointer from becoming invalid via get_cpu_ptr(), thus preempt_disable(). 
This does not protect against modification by an irq handler, which is
still done by disabled irq for most of ___slab_alloc().  As a small
immediate benefit, slab_out_of_memory() from ___slab_alloc() is now called
with irqs enabled.

kmem_cache_alloc_bulk() disables irqs for its fastpath and then re-enables
them before calling ___slab_alloc(), which then disables them at its
discretion.  The whole kmem_cache_alloc_bulk() operation also disables
preemption.

When ___slab_alloc() calls new_slab() to allocate a new page, re-enable
preemption, because new_slab() will re-enable interrupts in contexts that
allow blocking (this will be improved by later patches).

The patch itself will thus increase overhead a bit due to disabled
preemption (on configs where it matters) and increased disabling/enabling
irqs in kmem_cache_alloc_bulk(), but that will be gradually improved in
the following patches.

Note in __slab_alloc() we need to change the #ifdef CONFIG_PREEMPT guard
to CONFIG_PREEMPT_COUNT to make sure preempt disable/enable is properly
paired in all configurations.  On configs without involuntary preemption
and debugging the re-read of kmem_cache_cpu pointer is still compiled out
as it was before.

[ Mike Galbraith <efault@gmx.de>: Fix kmem_cache_alloc_bulk() error path ]
Link: https://lkml.kernel.org/r/20210904105003.11688-12-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   36 ++++++++++++++++++++++++------------
 1 file changed, 24 insertions(+), 12 deletions(-)

--- a/mm/slub.c~mm-slub-move-disabling-enabling-irqs-to-___slab_alloc
+++ a/mm/slub.c
@@ -2670,7 +2670,7 @@ static inline void *get_freelist(struct
  * we need to allocate a new slab. This is the slowest path since it involves
  * a call to the page allocator and the setup of a new slab.
  *
- * Version of __slab_alloc to use when we know that interrupts are
+ * Version of __slab_alloc to use when we know that preemption is
  * already disabled (which is the case for bulk allocation).
  */
 static void *___slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
@@ -2678,9 +2678,11 @@ static void *___slab_alloc(struct kmem_c
 {
 	void *freelist;
 	struct page *page;
+	unsigned long flags;
 
 	stat(s, ALLOC_SLOWPATH);
 
+	local_irq_save(flags);
 	page = c->page;
 	if (!page) {
 		/*
@@ -2743,6 +2745,7 @@ load_freelist:
 	VM_BUG_ON(!c->page->frozen);
 	c->freelist = get_freepointer(s, freelist);
 	c->tid = next_tid(c->tid);
+	local_irq_restore(flags);
 	return freelist;
 
 new_slab:
@@ -2760,14 +2763,16 @@ new_slab:
 		goto check_new_page;
 	}
 
+	put_cpu_ptr(s->cpu_slab);
 	page = new_slab(s, gfpflags, node);
+	c = get_cpu_ptr(s->cpu_slab);
 
 	if (unlikely(!page)) {
+		local_irq_restore(flags);
 		slab_out_of_memory(s, gfpflags, node);
 		return NULL;
 	}
 
-	c = raw_cpu_ptr(s->cpu_slab);
 	if (c->page)
 		flush_slab(s, c);
 
@@ -2807,31 +2812,33 @@ check_new_page:
 return_single:
 
 	deactivate_slab(s, page, get_freepointer(s, freelist), c);
+	local_irq_restore(flags);
 	return freelist;
 }
 
 /*
- * Another one that disabled interrupt and compensates for possible
- * cpu changes by refetching the per cpu area pointer.
+ * A wrapper for ___slab_alloc() for contexts where preemption is not yet
+ * disabled. Compensates for possible cpu changes by refetching the per cpu area
+ * pointer.
  */
 static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
 			  unsigned long addr, struct kmem_cache_cpu *c)
 {
 	void *p;
-	unsigned long flags;
 
-	local_irq_save(flags);
-#ifdef CONFIG_PREEMPTION
+#ifdef CONFIG_PREEMPT_COUNT
 	/*
 	 * We may have been preempted and rescheduled on a different
-	 * cpu before disabling interrupts. Need to reload cpu area
+	 * cpu before disabling preemption. Need to reload cpu area
 	 * pointer.
 	 */
-	c = this_cpu_ptr(s->cpu_slab);
+	c = get_cpu_ptr(s->cpu_slab);
 #endif
 
 	p = ___slab_alloc(s, gfpflags, node, addr, c);
-	local_irq_restore(flags);
+#ifdef CONFIG_PREEMPT_COUNT
+	put_cpu_ptr(s->cpu_slab);
+#endif
 	return p;
 }
 
@@ -3359,8 +3366,8 @@ int kmem_cache_alloc_bulk(struct kmem_ca
 	 * IRQs, which protects against PREEMPT and interrupts
 	 * handlers invoking normal fastpath.
 	 */
+	c = get_cpu_ptr(s->cpu_slab);
 	local_irq_disable();
-	c = this_cpu_ptr(s->cpu_slab);
 
 	for (i = 0; i < size; i++) {
 		void *object = kfence_alloc(s, s->object_size, flags);
@@ -3381,6 +3388,8 @@ int kmem_cache_alloc_bulk(struct kmem_ca
 			 */
 			c->tid = next_tid(c->tid);
 
+			local_irq_enable();
+
 			/*
 			 * Invoking slow path likely have side-effect
 			 * of re-populating per CPU c->freelist
@@ -3393,6 +3402,8 @@ int kmem_cache_alloc_bulk(struct kmem_ca
 			c = this_cpu_ptr(s->cpu_slab);
 			maybe_wipe_obj_freeptr(s, p[i]);
 
+			local_irq_disable();
+
 			continue; /* goto for-loop */
 		}
 		c->freelist = get_freepointer(s, object);
@@ -3401,6 +3412,7 @@ int kmem_cache_alloc_bulk(struct kmem_ca
 	}
 	c->tid = next_tid(c->tid);
 	local_irq_enable();
+	put_cpu_ptr(s->cpu_slab);
 
 	/*
 	 * memcg and kmem_cache debug support and memory initialization.
@@ -3410,7 +3422,7 @@ int kmem_cache_alloc_bulk(struct kmem_ca
 				slab_want_init_on_alloc(flags, s));
 	return i;
 error:
-	local_irq_enable();
+	put_cpu_ptr(s->cpu_slab);
 	slab_post_alloc_hook(s, objcg, flags, i, p, false);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 012/147] mm, slub: do initial checks in ___slab_alloc() with irqs enabled
  2021-09-08  2:52 incoming Andrew Morton
                   ` (10 preceding siblings ...)
  2021-09-08  2:53 ` [patch 011/147] mm, slub: move disabling/enabling irqs to ___slab_alloc() Andrew Morton
@ 2021-09-08  2:53 ` Andrew Morton
  2021-09-08  2:53 ` [patch 013/147] mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc() Andrew Morton
                   ` (135 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:53 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: do initial checks in ___slab_alloc() with irqs enabled

As another step of shortening irq disabled sections in ___slab_alloc(),
delay disabling irqs until we pass the initial checks if there is a cached
percpu slab and it's suitable for our allocation.

Now we have to recheck c->page after actually disabling irqs as an
allocation in irq handler might have replaced it.

Because we call pfmemalloc_match() as one of the checks, we might hit
VM_BUG_ON_PAGE(!PageSlab(page)) in PageSlabPfmemalloc in case we get
interrupted and the page is freed.  Thus introduce a
pfmemalloc_match_unsafe() variant that lacks the PageSlab check.

Link: https://lkml.kernel.org/r/20210904105003.11688-13-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/page-flags.h |    9 +++++
 mm/slub.c                  |   54 +++++++++++++++++++++++++++++------
 2 files changed, 54 insertions(+), 9 deletions(-)

--- a/include/linux/page-flags.h~mm-slub-do-initial-checks-in-___slab_alloc-with-irqs-enabled
+++ a/include/linux/page-flags.h
@@ -815,6 +815,15 @@ static inline int PageSlabPfmemalloc(str
 	return PageActive(page);
 }
 
+/*
+ * A version of PageSlabPfmemalloc() for opportunistic checks where the page
+ * might have been freed under us and not be a PageSlab anymore.
+ */
+static inline int __PageSlabPfmemalloc(struct page *page)
+{
+	return PageActive(page);
+}
+
 static inline void SetPageSlabPfmemalloc(struct page *page)
 {
 	VM_BUG_ON_PAGE(!PageSlab(page), page);
--- a/mm/slub.c~mm-slub-do-initial-checks-in-___slab_alloc-with-irqs-enabled
+++ a/mm/slub.c
@@ -2621,6 +2621,19 @@ static inline bool pfmemalloc_match(stru
 }
 
 /*
+ * A variant of pfmemalloc_match() that tests page flags without asserting
+ * PageSlab. Intended for opportunistic checks before taking a lock and
+ * rechecking that nobody else freed the page under us.
+ */
+static inline bool pfmemalloc_match_unsafe(struct page *page, gfp_t gfpflags)
+{
+	if (unlikely(__PageSlabPfmemalloc(page)))
+		return gfp_pfmemalloc_allowed(gfpflags);
+
+	return true;
+}
+
+/*
  * Check the page->freelist of a page and either transfer the freelist to the
  * per cpu freelist or deactivate the page.
  *
@@ -2682,8 +2695,9 @@ static void *___slab_alloc(struct kmem_c
 
 	stat(s, ALLOC_SLOWPATH);
 
-	local_irq_save(flags);
-	page = c->page;
+reread_page:
+
+	page = READ_ONCE(c->page);
 	if (!page) {
 		/*
 		 * if the node is not online or has no normal memory, just
@@ -2692,6 +2706,11 @@ static void *___slab_alloc(struct kmem_c
 		if (unlikely(node != NUMA_NO_NODE &&
 			     !node_isset(node, slab_nodes)))
 			node = NUMA_NO_NODE;
+		local_irq_save(flags);
+		if (unlikely(c->page)) {
+			local_irq_restore(flags);
+			goto reread_page;
+		}
 		goto new_slab;
 	}
 redo:
@@ -2706,8 +2725,7 @@ redo:
 			goto redo;
 		} else {
 			stat(s, ALLOC_NODE_MISMATCH);
-			deactivate_slab(s, page, c->freelist, c);
-			goto new_slab;
+			goto deactivate_slab;
 		}
 	}
 
@@ -2716,12 +2734,15 @@ redo:
 	 * PFMEMALLOC but right now, we are losing the pfmemalloc
 	 * information when the page leaves the per-cpu allocator
 	 */
-	if (unlikely(!pfmemalloc_match(page, gfpflags))) {
-		deactivate_slab(s, page, c->freelist, c);
-		goto new_slab;
-	}
+	if (unlikely(!pfmemalloc_match_unsafe(page, gfpflags)))
+		goto deactivate_slab;
 
-	/* must check again c->freelist in case of cpu migration or IRQ */
+	/* must check again c->page in case IRQ handler changed it */
+	local_irq_save(flags);
+	if (unlikely(page != c->page)) {
+		local_irq_restore(flags);
+		goto reread_page;
+	}
 	freelist = c->freelist;
 	if (freelist)
 		goto load_freelist;
@@ -2737,6 +2758,9 @@ redo:
 	stat(s, ALLOC_REFILL);
 
 load_freelist:
+
+	lockdep_assert_irqs_disabled();
+
 	/*
 	 * freelist is pointing to the list of objects to be used.
 	 * page is pointing to the page from which the objects are obtained.
@@ -2748,11 +2772,23 @@ load_freelist:
 	local_irq_restore(flags);
 	return freelist;
 
+deactivate_slab:
+
+	local_irq_save(flags);
+	if (page != c->page) {
+		local_irq_restore(flags);
+		goto reread_page;
+	}
+	deactivate_slab(s, page, c->freelist, c);
+
 new_slab:
 
+	lockdep_assert_irqs_disabled();
+
 	if (slub_percpu_partial(c)) {
 		page = c->page = slub_percpu_partial(c);
 		slub_set_percpu_partial(c, page);
+		local_irq_restore(flags);
 		stat(s, CPU_PARTIAL_ALLOC);
 		goto redo;
 	}
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 013/147] mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (11 preceding siblings ...)
  2021-09-08  2:53 ` [patch 012/147] mm, slub: do initial checks in ___slab_alloc() with irqs enabled Andrew Morton
@ 2021-09-08  2:53 ` Andrew Morton
  2021-09-08  2:53 ` [patch 014/147] mm, slub: restore irqs around calling new_slab() Andrew Morton
                   ` (134 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:53 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc()

Continue reducing the irq disabled scope.  Check for per-cpu partial slabs
with first with irqs enabled and then recheck with irqs disabled before
grabbing the slab page.  Mostly preparatory for the following patches.

Link: https://lkml.kernel.org/r/20210904105003.11688-14-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   34 +++++++++++++++++++++++++---------
 1 file changed, 25 insertions(+), 9 deletions(-)

--- a/mm/slub.c~mm-slub-move-disabling-irqs-closer-to-get_partial-in-___slab_alloc
+++ a/mm/slub.c
@@ -2706,11 +2706,6 @@ reread_page:
 		if (unlikely(node != NUMA_NO_NODE &&
 			     !node_isset(node, slab_nodes)))
 			node = NUMA_NO_NODE;
-		local_irq_save(flags);
-		if (unlikely(c->page)) {
-			local_irq_restore(flags);
-			goto reread_page;
-		}
 		goto new_slab;
 	}
 redo:
@@ -2751,6 +2746,7 @@ redo:
 
 	if (!freelist) {
 		c->page = NULL;
+		local_irq_restore(flags);
 		stat(s, DEACTIVATE_BYPASS);
 		goto new_slab;
 	}
@@ -2780,12 +2776,19 @@ deactivate_slab:
 		goto reread_page;
 	}
 	deactivate_slab(s, page, c->freelist, c);
+	local_irq_restore(flags);
 
 new_slab:
 
-	lockdep_assert_irqs_disabled();
-
 	if (slub_percpu_partial(c)) {
+		local_irq_save(flags);
+		if (unlikely(c->page)) {
+			local_irq_restore(flags);
+			goto reread_page;
+		}
+		if (unlikely(!slub_percpu_partial(c)))
+			goto new_objects; /* stolen by an IRQ handler */
+
 		page = c->page = slub_percpu_partial(c);
 		slub_set_percpu_partial(c, page);
 		local_irq_restore(flags);
@@ -2793,6 +2796,16 @@ new_slab:
 		goto redo;
 	}
 
+	local_irq_save(flags);
+	if (unlikely(c->page)) {
+		local_irq_restore(flags);
+		goto reread_page;
+	}
+
+new_objects:
+
+	lockdep_assert_irqs_disabled();
+
 	freelist = get_partial(s, gfpflags, node, &page);
 	if (freelist) {
 		c->page = page;
@@ -2825,15 +2838,18 @@ new_slab:
 check_new_page:
 
 	if (kmem_cache_debug(s)) {
-		if (!alloc_debug_processing(s, page, freelist, addr))
+		if (!alloc_debug_processing(s, page, freelist, addr)) {
 			/* Slab failed checks. Next slab needed */
+			c->page = NULL;
+			local_irq_restore(flags);
 			goto new_slab;
-		else
+		} else {
 			/*
 			 * For debug case, we don't load freelist so that all
 			 * allocations go through alloc_debug_processing()
 			 */
 			goto return_single;
+		}
 	}
 
 	if (unlikely(!pfmemalloc_match(page, gfpflags)))
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 014/147] mm, slub: restore irqs around calling new_slab()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (12 preceding siblings ...)
  2021-09-08  2:53 ` [patch 013/147] mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc() Andrew Morton
@ 2021-09-08  2:53 ` Andrew Morton
  2021-09-08  2:53 ` [patch 015/147] mm, slub: validate slab from partial list or page allocator before making it cpu slab Andrew Morton
                   ` (133 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:53 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: restore irqs around calling new_slab()

allocate_slab() currently re-enables irqs before calling to the page
allocator.  It depends on gfpflags_allow_blocking() to determine if it's
safe to do so.  Now we can instead simply restore irq before calling it
through new_slab().  The other caller early_kmem_cache_node_alloc() is
unaffected by this.

Link: https://lkml.kernel.org/r/20210904105003.11688-15-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |    8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

--- a/mm/slub.c~mm-slub-restore-irqs-around-calling-new_slab
+++ a/mm/slub.c
@@ -1809,9 +1809,6 @@ static struct page *allocate_slab(struct
 
 	flags &= gfp_allowed_mask;
 
-	if (gfpflags_allow_blocking(flags))
-		local_irq_enable();
-
 	flags |= s->allocflags;
 
 	/*
@@ -1870,8 +1867,6 @@ static struct page *allocate_slab(struct
 	page->frozen = 1;
 
 out:
-	if (gfpflags_allow_blocking(flags))
-		local_irq_disable();
 	if (!page)
 		return NULL;
 
@@ -2812,16 +2807,17 @@ new_objects:
 		goto check_new_page;
 	}
 
+	local_irq_restore(flags);
 	put_cpu_ptr(s->cpu_slab);
 	page = new_slab(s, gfpflags, node);
 	c = get_cpu_ptr(s->cpu_slab);
 
 	if (unlikely(!page)) {
-		local_irq_restore(flags);
 		slab_out_of_memory(s, gfpflags, node);
 		return NULL;
 	}
 
+	local_irq_save(flags);
 	if (c->page)
 		flush_slab(s, c);
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 015/147] mm, slub: validate slab from partial list or page allocator before making it cpu slab
  2021-09-08  2:52 incoming Andrew Morton
                   ` (13 preceding siblings ...)
  2021-09-08  2:53 ` [patch 014/147] mm, slub: restore irqs around calling new_slab() Andrew Morton
@ 2021-09-08  2:53 ` Andrew Morton
  2021-09-08  2:53 ` [patch 016/147] mm, slub: check new pages with restored irqs Andrew Morton
                   ` (132 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:53 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: validate slab from partial list or page allocator before making it cpu slab

When we obtain a new slab page from node partial list or page allocator,
we assign it to kmem_cache_cpu, perform some checks, and if they fail, we
undo the assignment.

In order to allow doing the checks without irq disabled, restructure the
code so that the checks are done first, and kmem_cache_cpu.page assignment
only after they pass.

Link: https://lkml.kernel.org/r/20210904105003.11688-16-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   17 +++++++++--------
 1 file changed, 9 insertions(+), 8 deletions(-)

--- a/mm/slub.c~mm-slub-validate-slab-from-partial-list-or-page-allocator-before-making-it-cpu-slab
+++ a/mm/slub.c
@@ -2802,10 +2802,8 @@ new_objects:
 	lockdep_assert_irqs_disabled();
 
 	freelist = get_partial(s, gfpflags, node, &page);
-	if (freelist) {
-		c->page = page;
+	if (freelist)
 		goto check_new_page;
-	}
 
 	local_irq_restore(flags);
 	put_cpu_ptr(s->cpu_slab);
@@ -2818,9 +2816,6 @@ new_objects:
 	}
 
 	local_irq_save(flags);
-	if (c->page)
-		flush_slab(s, c);
-
 	/*
 	 * No other reference to the page yet so we can
 	 * muck around with it freely without cmpxchg
@@ -2829,14 +2824,12 @@ new_objects:
 	page->freelist = NULL;
 
 	stat(s, ALLOC_SLAB);
-	c->page = page;
 
 check_new_page:
 
 	if (kmem_cache_debug(s)) {
 		if (!alloc_debug_processing(s, page, freelist, addr)) {
 			/* Slab failed checks. Next slab needed */
-			c->page = NULL;
 			local_irq_restore(flags);
 			goto new_slab;
 		} else {
@@ -2855,10 +2848,18 @@ check_new_page:
 		 */
 		goto return_single;
 
+	if (unlikely(c->page))
+		flush_slab(s, c);
+	c->page = page;
+
 	goto load_freelist;
 
 return_single:
 
+	if (unlikely(c->page))
+		flush_slab(s, c);
+	c->page = page;
+
 	deactivate_slab(s, page, get_freepointer(s, freelist), c);
 	local_irq_restore(flags);
 	return freelist;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 016/147] mm, slub: check new pages with restored irqs
  2021-09-08  2:52 incoming Andrew Morton
                   ` (14 preceding siblings ...)
  2021-09-08  2:53 ` [patch 015/147] mm, slub: validate slab from partial list or page allocator before making it cpu slab Andrew Morton
@ 2021-09-08  2:53 ` Andrew Morton
  2021-09-08  2:53 ` [patch 017/147] mm, slub: stop disabling irqs around get_partial() Andrew Morton
                   ` (131 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:53 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: check new pages with restored irqs

Building on top of the previous patch, re-enable irqs before checking new
pages.  alloc_debug_processing() is now called with enabled irqs so we
need to remove VM_BUG_ON(!irqs_disabled()); in check_slab() - there
doesn't seem to be a need for it anyway.

Link: https://lkml.kernel.org/r/20210904105003.11688-17-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

--- a/mm/slub.c~mm-slub-check-new-pages-with-restored-irqs
+++ a/mm/slub.c
@@ -1009,8 +1009,6 @@ static int check_slab(struct kmem_cache
 {
 	int maxobj;
 
-	VM_BUG_ON(!irqs_disabled());
-
 	if (!PageSlab(page)) {
 		slab_err(s, page, "Not a valid slab page");
 		return 0;
@@ -2802,10 +2800,10 @@ new_objects:
 	lockdep_assert_irqs_disabled();
 
 	freelist = get_partial(s, gfpflags, node, &page);
+	local_irq_restore(flags);
 	if (freelist)
 		goto check_new_page;
 
-	local_irq_restore(flags);
 	put_cpu_ptr(s->cpu_slab);
 	page = new_slab(s, gfpflags, node);
 	c = get_cpu_ptr(s->cpu_slab);
@@ -2815,7 +2813,6 @@ new_objects:
 		return NULL;
 	}
 
-	local_irq_save(flags);
 	/*
 	 * No other reference to the page yet so we can
 	 * muck around with it freely without cmpxchg
@@ -2830,7 +2827,6 @@ check_new_page:
 	if (kmem_cache_debug(s)) {
 		if (!alloc_debug_processing(s, page, freelist, addr)) {
 			/* Slab failed checks. Next slab needed */
-			local_irq_restore(flags);
 			goto new_slab;
 		} else {
 			/*
@@ -2848,6 +2844,7 @@ check_new_page:
 		 */
 		goto return_single;
 
+	local_irq_save(flags);
 	if (unlikely(c->page))
 		flush_slab(s, c);
 	c->page = page;
@@ -2856,6 +2853,7 @@ check_new_page:
 
 return_single:
 
+	local_irq_save(flags);
 	if (unlikely(c->page))
 		flush_slab(s, c);
 	c->page = page;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 017/147] mm, slub: stop disabling irqs around get_partial()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (15 preceding siblings ...)
  2021-09-08  2:53 ` [patch 016/147] mm, slub: check new pages with restored irqs Andrew Morton
@ 2021-09-08  2:53 ` Andrew Morton
  2021-09-08  2:53 ` [patch 018/147] mm, slub: move reset of c->page and freelist out of deactivate_slab() Andrew Morton
                   ` (130 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:53 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: stop disabling irqs around get_partial()

The function get_partial() does not need to have irqs disabled as a whole.
It's sufficient to convert spin_lock operations to their irq
saving/restoring versions.

As a result, it's now possible to reach the page allocator from the slab
allocator without disabling and re-enabling interrupts on the way.

Link: https://lkml.kernel.org/r/20210904105003.11688-18-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   22 ++++++++--------------
 1 file changed, 8 insertions(+), 14 deletions(-)

--- a/mm/slub.c~mm-slub-stop-disabling-irqs-around-get_partial
+++ a/mm/slub.c
@@ -2010,11 +2010,12 @@ static inline bool pfmemalloc_match(stru
  * Try to allocate a partial slab from a specific node.
  */
 static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n,
-			      struct page **ret_page, gfp_t flags)
+			      struct page **ret_page, gfp_t gfpflags)
 {
 	struct page *page, *page2;
 	void *object = NULL;
 	unsigned int available = 0;
+	unsigned long flags;
 	int objects;
 
 	/*
@@ -2026,11 +2027,11 @@ static void *get_partial_node(struct kme
 	if (!n || !n->nr_partial)
 		return NULL;
 
-	spin_lock(&n->list_lock);
+	spin_lock_irqsave(&n->list_lock, flags);
 	list_for_each_entry_safe(page, page2, &n->partial, slab_list) {
 		void *t;
 
-		if (!pfmemalloc_match(page, flags))
+		if (!pfmemalloc_match(page, gfpflags))
 			continue;
 
 		t = acquire_slab(s, n, page, object == NULL, &objects);
@@ -2051,7 +2052,7 @@ static void *get_partial_node(struct kme
 			break;
 
 	}
-	spin_unlock(&n->list_lock);
+	spin_unlock_irqrestore(&n->list_lock, flags);
 	return object;
 }
 
@@ -2779,8 +2780,10 @@ new_slab:
 			local_irq_restore(flags);
 			goto reread_page;
 		}
-		if (unlikely(!slub_percpu_partial(c)))
+		if (unlikely(!slub_percpu_partial(c))) {
+			local_irq_restore(flags);
 			goto new_objects; /* stolen by an IRQ handler */
+		}
 
 		page = c->page = slub_percpu_partial(c);
 		slub_set_percpu_partial(c, page);
@@ -2789,18 +2792,9 @@ new_slab:
 		goto redo;
 	}
 
-	local_irq_save(flags);
-	if (unlikely(c->page)) {
-		local_irq_restore(flags);
-		goto reread_page;
-	}
-
 new_objects:
 
-	lockdep_assert_irqs_disabled();
-
 	freelist = get_partial(s, gfpflags, node, &page);
-	local_irq_restore(flags);
 	if (freelist)
 		goto check_new_page;
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 018/147] mm, slub: move reset of c->page and freelist out of deactivate_slab()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (16 preceding siblings ...)
  2021-09-08  2:53 ` [patch 017/147] mm, slub: stop disabling irqs around get_partial() Andrew Morton
@ 2021-09-08  2:53 ` Andrew Morton
  2021-09-08  2:53 ` [patch 019/147] mm, slub: make locking in deactivate_slab() irq-safe Andrew Morton
                   ` (129 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:53 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: move reset of c->page and freelist out of deactivate_slab()

deactivate_slab() removes the cpu slab by merging the cpu freelist with
slab's freelist and putting the slab on the proper node's list.  It also
sets the respective kmem_cache_cpu pointers to NULL.

By extracting the kmem_cache_cpu operations from the function, we can make
it not dependent on disabled irqs.

Also if we return a single free pointer from ___slab_alloc, we no longer
have to assign kmem_cache_cpu.page before deactivation or care if somebody
preempted us and assigned a different page to our kmem_cache_cpu in the
process.

Link: https://lkml.kernel.org/r/20210904105003.11688-19-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

--- a/mm/slub.c~mm-slub-move-reset-of-c-page-and-freelist-out-of-deactivate_slab
+++ a/mm/slub.c
@@ -2209,10 +2209,13 @@ static void init_kmem_cache_cpus(struct
 }
 
 /*
- * Remove the cpu slab
+ * Finishes removing the cpu slab. Merges cpu's freelist with page's freelist,
+ * unfreezes the slabs and puts it on the proper list.
+ * Assumes the slab has been already safely taken away from kmem_cache_cpu
+ * by the caller.
  */
 static void deactivate_slab(struct kmem_cache *s, struct page *page,
-				void *freelist, struct kmem_cache_cpu *c)
+			    void *freelist)
 {
 	enum slab_modes { M_NONE, M_PARTIAL, M_FULL, M_FREE };
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
@@ -2341,9 +2344,6 @@ redo:
 		discard_slab(s, page);
 		stat(s, FREE_SLAB);
 	}
-
-	c->page = NULL;
-	c->freelist = NULL;
 }
 
 /*
@@ -2468,10 +2468,16 @@ static void put_cpu_partial(struct kmem_
 
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
-	stat(s, CPUSLAB_FLUSH);
-	deactivate_slab(s, c->page, c->freelist, c);
+	void *freelist = c->freelist;
+	struct page *page = c->page;
 
+	c->page = NULL;
+	c->freelist = NULL;
 	c->tid = next_tid(c->tid);
+
+	deactivate_slab(s, page, freelist);
+
+	stat(s, CPUSLAB_FLUSH);
 }
 
 /*
@@ -2769,7 +2775,10 @@ deactivate_slab:
 		local_irq_restore(flags);
 		goto reread_page;
 	}
-	deactivate_slab(s, page, c->freelist, c);
+	freelist = c->freelist;
+	c->page = NULL;
+	c->freelist = NULL;
+	deactivate_slab(s, page, freelist);
 	local_irq_restore(flags);
 
 new_slab:
@@ -2848,11 +2857,7 @@ check_new_page:
 return_single:
 
 	local_irq_save(flags);
-	if (unlikely(c->page))
-		flush_slab(s, c);
-	c->page = page;
-
-	deactivate_slab(s, page, get_freepointer(s, freelist), c);
+	deactivate_slab(s, page, get_freepointer(s, freelist));
 	local_irq_restore(flags);
 	return freelist;
 }
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 019/147] mm, slub: make locking in deactivate_slab() irq-safe
  2021-09-08  2:52 incoming Andrew Morton
                   ` (17 preceding siblings ...)
  2021-09-08  2:53 ` [patch 018/147] mm, slub: move reset of c->page and freelist out of deactivate_slab() Andrew Morton
@ 2021-09-08  2:53 ` Andrew Morton
  2021-09-08  2:54 ` [patch 020/147] mm, slub: call deactivate_slab() without disabling irqs Andrew Morton
                   ` (128 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:53 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: make locking in deactivate_slab() irq-safe

deactivate_slab() now no longer touches the kmem_cache_cpu structure, so
it will be possible to call it with irqs enabled.  Just convert the
spin_lock calls to their irq saving/restoring variants to make it
irq-safe.

Note we now have to use cmpxchg_double_slab() for irq-safe slab_lock(),
because in some situations we don't take the list_lock, which would
disable irqs.

Link: https://lkml.kernel.org/r/20210904105003.11688-20-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

--- a/mm/slub.c~mm-slub-make-locking-in-deactivate_slab-irq-safe
+++ a/mm/slub.c
@@ -2223,6 +2223,7 @@ static void deactivate_slab(struct kmem_
 	enum slab_modes l = M_NONE, m = M_NONE;
 	void *nextfree, *freelist_iter, *freelist_tail;
 	int tail = DEACTIVATE_TO_HEAD;
+	unsigned long flags = 0;
 	struct page new;
 	struct page old;
 
@@ -2298,7 +2299,7 @@ redo:
 			 * that acquire_slab() will see a slab page that
 			 * is frozen
 			 */
-			spin_lock(&n->list_lock);
+			spin_lock_irqsave(&n->list_lock, flags);
 		}
 	} else {
 		m = M_FULL;
@@ -2309,7 +2310,7 @@ redo:
 			 * slabs from diagnostic functions will not see
 			 * any frozen slabs.
 			 */
-			spin_lock(&n->list_lock);
+			spin_lock_irqsave(&n->list_lock, flags);
 		}
 	}
 
@@ -2326,14 +2327,14 @@ redo:
 	}
 
 	l = m;
-	if (!__cmpxchg_double_slab(s, page,
+	if (!cmpxchg_double_slab(s, page,
 				old.freelist, old.counters,
 				new.freelist, new.counters,
 				"unfreezing slab"))
 		goto redo;
 
 	if (lock)
-		spin_unlock(&n->list_lock);
+		spin_unlock_irqrestore(&n->list_lock, flags);
 
 	if (m == M_PARTIAL)
 		stat(s, tail);
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 020/147] mm, slub: call deactivate_slab() without disabling irqs
  2021-09-08  2:52 incoming Andrew Morton
                   ` (18 preceding siblings ...)
  2021-09-08  2:53 ` [patch 019/147] mm, slub: make locking in deactivate_slab() irq-safe Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08  2:54 ` [patch 021/147] mm, slub: move irq control into unfreeze_partials() Andrew Morton
                   ` (127 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: call deactivate_slab() without disabling irqs

The function is now safe to be called with irqs enabled, so move the calls
outside of irq disabled sections.

When called from ___slab_alloc() -> flush_slab() we have irqs disabled, so
to reenable them before deactivate_slab() we need to open-code
flush_slab() in ___slab_alloc() and reenable irqs after modifying the
kmem_cache_cpu fields.  But that means a IRQ handler meanwhile might have
assigned a new page to kmem_cache_cpu.page so we have to retry the whole
check.

The remaining callers of flush_slab() are the IPI handler which has
disabled irqs anyway, and slub_cpu_dead() which will be dealt with in the
following patch.

Link: https://lkml.kernel.org/r/20210904105003.11688-21-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   24 +++++++++++++++++++-----
 1 file changed, 19 insertions(+), 5 deletions(-)

--- a/mm/slub.c~mm-slub-call-deactivate_slab-without-disabling-irqs
+++ a/mm/slub.c
@@ -2779,8 +2779,8 @@ deactivate_slab:
 	freelist = c->freelist;
 	c->page = NULL;
 	c->freelist = NULL;
-	deactivate_slab(s, page, freelist);
 	local_irq_restore(flags);
+	deactivate_slab(s, page, freelist);
 
 new_slab:
 
@@ -2848,18 +2848,32 @@ check_new_page:
 		 */
 		goto return_single;
 
+retry_load_page:
+
 	local_irq_save(flags);
-	if (unlikely(c->page))
-		flush_slab(s, c);
+	if (unlikely(c->page)) {
+		void *flush_freelist = c->freelist;
+		struct page *flush_page = c->page;
+
+		c->page = NULL;
+		c->freelist = NULL;
+		c->tid = next_tid(c->tid);
+
+		local_irq_restore(flags);
+
+		deactivate_slab(s, flush_page, flush_freelist);
+
+		stat(s, CPUSLAB_FLUSH);
+
+		goto retry_load_page;
+	}
 	c->page = page;
 
 	goto load_freelist;
 
 return_single:
 
-	local_irq_save(flags);
 	deactivate_slab(s, page, get_freepointer(s, freelist));
-	local_irq_restore(flags);
 	return freelist;
 }
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 021/147] mm, slub: move irq control into unfreeze_partials()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (19 preceding siblings ...)
  2021-09-08  2:54 ` [patch 020/147] mm, slub: call deactivate_slab() without disabling irqs Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08  2:54 ` [patch 022/147] mm, slub: discard slabs in unfreeze_partials() without irqs disabled Andrew Morton
                   ` (126 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: move irq control into unfreeze_partials()

unfreeze_partials() can be optimized so that it doesn't need irqs disabled
for the whole time.  As the first step, move irq control into the function
and remove it from the put_cpu_partial() caller.

Link: https://lkml.kernel.org/r/20210904105003.11688-22-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

--- a/mm/slub.c~mm-slub-move-irq-control-into-unfreeze_partials
+++ a/mm/slub.c
@@ -2350,9 +2350,8 @@ redo:
 /*
  * Unfreeze all the cpu partial slabs.
  *
- * This function must be called with interrupts disabled
- * for the cpu using c (or some other guarantee must be there
- * to guarantee no concurrent accesses).
+ * This function must be called with preemption or migration
+ * disabled with c local to the cpu.
  */
 static void unfreeze_partials(struct kmem_cache *s,
 		struct kmem_cache_cpu *c)
@@ -2360,6 +2359,9 @@ static void unfreeze_partials(struct kme
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	struct kmem_cache_node *n = NULL, *n2 = NULL;
 	struct page *page, *discard_page = NULL;
+	unsigned long flags;
+
+	local_irq_save(flags);
 
 	while ((page = slub_percpu_partial(c))) {
 		struct page new;
@@ -2412,6 +2414,8 @@ static void unfreeze_partials(struct kme
 		discard_slab(s, page);
 		stat(s, FREE_SLAB);
 	}
+
+	local_irq_restore(flags);
 #endif	/* CONFIG_SLUB_CPU_PARTIAL */
 }
 
@@ -2439,14 +2443,11 @@ static void put_cpu_partial(struct kmem_
 			pobjects = oldpage->pobjects;
 			pages = oldpage->pages;
 			if (drain && pobjects > slub_cpu_partial(s)) {
-				unsigned long flags;
 				/*
 				 * partial array is full. Move the existing
 				 * set to the per node partial list.
 				 */
-				local_irq_save(flags);
 				unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
-				local_irq_restore(flags);
 				oldpage = NULL;
 				pobjects = 0;
 				pages = 0;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 022/147] mm, slub: discard slabs in unfreeze_partials() without irqs disabled
  2021-09-08  2:52 incoming Andrew Morton
                   ` (20 preceding siblings ...)
  2021-09-08  2:54 ` [patch 021/147] mm, slub: move irq control into unfreeze_partials() Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08  2:54 ` [patch 023/147] mm, slub: detach whole partial list at once in unfreeze_partials() Andrew Morton
                   ` (125 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: discard slabs in unfreeze_partials() without irqs disabled

No need for disabled irqs when discarding slabs, so restore them before
discarding.

Link: https://lkml.kernel.org/r/20210904105003.11688-23-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/slub.c~mm-slub-discard-slabs-in-unfreeze_partials-without-irqs-disabled
+++ a/mm/slub.c
@@ -2406,6 +2406,8 @@ static void unfreeze_partials(struct kme
 	if (n)
 		spin_unlock(&n->list_lock);
 
+	local_irq_restore(flags);
+
 	while (discard_page) {
 		page = discard_page;
 		discard_page = discard_page->next;
@@ -2415,7 +2417,6 @@ static void unfreeze_partials(struct kme
 		stat(s, FREE_SLAB);
 	}
 
-	local_irq_restore(flags);
 #endif	/* CONFIG_SLUB_CPU_PARTIAL */
 }
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 023/147] mm, slub: detach whole partial list at once in unfreeze_partials()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (21 preceding siblings ...)
  2021-09-08  2:54 ` [patch 022/147] mm, slub: discard slabs in unfreeze_partials() without irqs disabled Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08  2:54 ` [patch 024/147] mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing Andrew Morton
                   ` (124 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: detach whole partial list at once in unfreeze_partials()

Instead of iterating through the live percpu partial list, detach it from
the kmem_cache_cpu at once.  This is simpler and will allow further
optimization.

Link: https://lkml.kernel.org/r/20210904105003.11688-24-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

--- a/mm/slub.c~mm-slub-detach-whole-partial-list-at-once-in-unfreeze_partials
+++ a/mm/slub.c
@@ -2358,16 +2358,20 @@ static void unfreeze_partials(struct kme
 {
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	struct kmem_cache_node *n = NULL, *n2 = NULL;
-	struct page *page, *discard_page = NULL;
+	struct page *page, *partial_page, *discard_page = NULL;
 	unsigned long flags;
 
 	local_irq_save(flags);
 
-	while ((page = slub_percpu_partial(c))) {
+	partial_page = slub_percpu_partial(c);
+	c->partial = NULL;
+
+	while (partial_page) {
 		struct page new;
 		struct page old;
 
-		slub_set_percpu_partial(c, page);
+		page = partial_page;
+		partial_page = page->next;
 
 		n2 = get_node(s, page_to_nid(page));
 		if (n != n2) {
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 024/147] mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing
  2021-09-08  2:52 incoming Andrew Morton
                   ` (22 preceding siblings ...)
  2021-09-08  2:54 ` [patch 023/147] mm, slub: detach whole partial list at once in unfreeze_partials() Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08  2:54 ` [patch 025/147] mm, slub: only disable irq with spin_lock in __unfreeze_partials() Andrew Morton
                   ` (123 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing

Unfreezing partial list can be split to two phases - detaching the list
from struct kmem_cache_cpu, and processing the list.  The whole operation
does not need to be protected by disabled irqs.  Restructure the code to
separate the detaching (with disabled irqs) and unfreezing (with irq
disabling to be reduced in the next patch).

Also, unfreeze_partials() can be called from another cpu on behalf of a
cpu that is being offlined, where disabling irqs on the local cpu has no
sense, so restructure the code as follows:

- __unfreeze_partials() is the bulk of unfreeze_partials() that processes the
  detached percpu partial list
- unfreeze_partials() detaches list from current cpu with irqs disabled and
  calls __unfreeze_partials()
- unfreeze_partials_cpu() is to be called for the offlined cpu so it needs no
  irq disabling, and is called from __flush_cpu_slab()
- flush_cpu_slab() is for the local cpu thus it needs to call
  unfreeze_partials(). So it can't simply call
  __flush_cpu_slab(smp_processor_id()) anymore and we have to open-code the
  proper calls.

Link: https://lkml.kernel.org/r/20210904105003.11688-25-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   73 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 51 insertions(+), 22 deletions(-)

--- a/mm/slub.c~mm-slub-separate-detaching-of-partial-list-in-unfreeze_partials-from-unfreezing
+++ a/mm/slub.c
@@ -2347,25 +2347,15 @@ redo:
 	}
 }
 
-/*
- * Unfreeze all the cpu partial slabs.
- *
- * This function must be called with preemption or migration
- * disabled with c local to the cpu.
- */
-static void unfreeze_partials(struct kmem_cache *s,
-		struct kmem_cache_cpu *c)
-{
 #ifdef CONFIG_SLUB_CPU_PARTIAL
+static void __unfreeze_partials(struct kmem_cache *s, struct page *partial_page)
+{
 	struct kmem_cache_node *n = NULL, *n2 = NULL;
-	struct page *page, *partial_page, *discard_page = NULL;
+	struct page *page, *discard_page = NULL;
 	unsigned long flags;
 
 	local_irq_save(flags);
 
-	partial_page = slub_percpu_partial(c);
-	c->partial = NULL;
-
 	while (partial_page) {
 		struct page new;
 		struct page old;
@@ -2420,10 +2410,45 @@ static void unfreeze_partials(struct kme
 		discard_slab(s, page);
 		stat(s, FREE_SLAB);
 	}
+}
 
-#endif	/* CONFIG_SLUB_CPU_PARTIAL */
+/*
+ * Unfreeze all the cpu partial slabs.
+ */
+static void unfreeze_partials(struct kmem_cache *s)
+{
+	struct page *partial_page;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	partial_page = this_cpu_read(s->cpu_slab->partial);
+	this_cpu_write(s->cpu_slab->partial, NULL);
+	local_irq_restore(flags);
+
+	if (partial_page)
+		__unfreeze_partials(s, partial_page);
+}
+
+static void unfreeze_partials_cpu(struct kmem_cache *s,
+				  struct kmem_cache_cpu *c)
+{
+	struct page *partial_page;
+
+	partial_page = slub_percpu_partial(c);
+	c->partial = NULL;
+
+	if (partial_page)
+		__unfreeze_partials(s, partial_page);
 }
 
+#else	/* CONFIG_SLUB_CPU_PARTIAL */
+
+static inline void unfreeze_partials(struct kmem_cache *s) { }
+static inline void unfreeze_partials_cpu(struct kmem_cache *s,
+				  struct kmem_cache_cpu *c) { }
+
+#endif	/* CONFIG_SLUB_CPU_PARTIAL */
+
 /*
  * Put a page that was just frozen (in __slab_free|get_partial_node) into a
  * partial page slot if available.
@@ -2452,7 +2477,7 @@ static void put_cpu_partial(struct kmem_
 				 * partial array is full. Move the existing
 				 * set to the per node partial list.
 				 */
-				unfreeze_partials(s, this_cpu_ptr(s->cpu_slab));
+				unfreeze_partials(s);
 				oldpage = NULL;
 				pobjects = 0;
 				pages = 0;
@@ -2487,11 +2512,6 @@ static inline void flush_slab(struct kme
 	stat(s, CPUSLAB_FLUSH);
 }
 
-/*
- * Flush cpu slab.
- *
- * Called from IPI handler with interrupts disabled.
- */
 static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 {
 	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
@@ -2499,14 +2519,23 @@ static inline void __flush_cpu_slab(stru
 	if (c->page)
 		flush_slab(s, c);
 
-	unfreeze_partials(s, c);
+	unfreeze_partials_cpu(s, c);
 }
 
+/*
+ * Flush cpu slab.
+ *
+ * Called from IPI handler with interrupts disabled.
+ */
 static void flush_cpu_slab(void *d)
 {
 	struct kmem_cache *s = d;
+	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
+
+	if (c->page)
+		flush_slab(s, c);
 
-	__flush_cpu_slab(s, smp_processor_id());
+	unfreeze_partials(s);
 }
 
 static bool has_cpu_slab(int cpu, void *info)
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 025/147] mm, slub: only disable irq with spin_lock in __unfreeze_partials()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (23 preceding siblings ...)
  2021-09-08  2:54 ` [patch 024/147] mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08  2:54 ` [patch 026/147] mm, slub: don't disable irqs in slub_cpu_dead() Andrew Morton
                   ` (122 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: only disable irq with spin_lock in __unfreeze_partials()

__unfreeze_partials() no longer needs to have irqs disabled, except for
making the spin_lock operations irq-safe, so convert the spin_locks
operations and remove the separate irq handling.

Link: https://lkml.kernel.org/r/20210904105003.11688-26-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

--- a/mm/slub.c~mm-slub-only-disable-irq-with-spin_lock-in-__unfreeze_partials
+++ a/mm/slub.c
@@ -2352,9 +2352,7 @@ static void __unfreeze_partials(struct k
 {
 	struct kmem_cache_node *n = NULL, *n2 = NULL;
 	struct page *page, *discard_page = NULL;
-	unsigned long flags;
-
-	local_irq_save(flags);
+	unsigned long flags = 0;
 
 	while (partial_page) {
 		struct page new;
@@ -2366,10 +2364,10 @@ static void __unfreeze_partials(struct k
 		n2 = get_node(s, page_to_nid(page));
 		if (n != n2) {
 			if (n)
-				spin_unlock(&n->list_lock);
+				spin_unlock_irqrestore(&n->list_lock, flags);
 
 			n = n2;
-			spin_lock(&n->list_lock);
+			spin_lock_irqsave(&n->list_lock, flags);
 		}
 
 		do {
@@ -2398,9 +2396,7 @@ static void __unfreeze_partials(struct k
 	}
 
 	if (n)
-		spin_unlock(&n->list_lock);
-
-	local_irq_restore(flags);
+		spin_unlock_irqrestore(&n->list_lock, flags);
 
 	while (discard_page) {
 		page = discard_page;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 026/147] mm, slub: don't disable irqs in slub_cpu_dead()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (24 preceding siblings ...)
  2021-09-08  2:54 ` [patch 025/147] mm, slub: only disable irq with spin_lock in __unfreeze_partials() Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08  2:54 ` [patch 027/147] mm, slab: split out the cpu offline variant of flush_slab() Andrew Morton
                   ` (121 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: don't disable irqs in slub_cpu_dead()

slub_cpu_dead() cleans up for an offlined cpu from another cpu and calls
only functions that are now irq safe, so we don't need to disable irqs
anymore.

Link: https://lkml.kernel.org/r/20210904105003.11688-27-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |    6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

--- a/mm/slub.c~mm-slub-dont-disable-irqs-in-slub_cpu_dead
+++ a/mm/slub.c
@@ -2554,14 +2554,10 @@ static void flush_all(struct kmem_cache
 static int slub_cpu_dead(unsigned int cpu)
 {
 	struct kmem_cache *s;
-	unsigned long flags;
 
 	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_caches, list) {
-		local_irq_save(flags);
+	list_for_each_entry(s, &slab_caches, list)
 		__flush_cpu_slab(s, cpu);
-		local_irq_restore(flags);
-	}
 	mutex_unlock(&slab_mutex);
 	return 0;
 }
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 027/147] mm, slab: split out the cpu offline variant of flush_slab()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (25 preceding siblings ...)
  2021-09-08  2:54 ` [patch 026/147] mm, slub: don't disable irqs in slub_cpu_dead() Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08  2:54 ` [patch 028/147] mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context Andrew Morton
                   ` (120 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slab: split out the cpu offline variant of flush_slab()

flush_slab() is called either as part IPI handler on given live cpu, or as
a cleanup on behalf of another cpu that went offline.  The first case
needs to protect updating the kmem_cache_cpu fields with disabled irqs. 
Currently the whole call happens with irqs disabled by the IPI handler,
but the following patch will change from IPI to workqueue, and
flush_slab() will have to disable irqs (to be replaced with a local lock
later) in the critical part.

To prepare for this change, replace the call to flush_slab() for the dead
cpu handling with an opencoded variant that will not disable irqs nor take
a local lock.

Link: https://lkml.kernel.org/r/20210904105003.11688-28-vbabka@suse.cz
Suggested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

--- a/mm/slub.c~mm-slab-split-out-the-cpu-offline-variant-of-flush_slab
+++ a/mm/slub.c
@@ -2511,9 +2511,17 @@ static inline void flush_slab(struct kme
 static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
 {
 	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
+	void *freelist = c->freelist;
+	struct page *page = c->page;
 
-	if (c->page)
-		flush_slab(s, c);
+	c->page = NULL;
+	c->freelist = NULL;
+	c->tid = next_tid(c->tid);
+
+	if (page) {
+		deactivate_slab(s, page, freelist);
+		stat(s, CPUSLAB_FLUSH);
+	}
 
 	unfreeze_partials_cpu(s, c);
 }
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 028/147] mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context
  2021-09-08  2:52 incoming Andrew Morton
                   ` (26 preceding siblings ...)
  2021-09-08  2:54 ` [patch 027/147] mm, slab: split out the cpu offline variant of flush_slab() Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08  2:54 ` [patch 029/147] mm: slub: make object_map_lock a raw_spinlock_t Andrew Morton
                   ` (119 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context

flush_all() flushes a specific SLAB cache on each CPU (where the cache is
present).  The deactivate_slab()/__free_slab() invocation happens within
IPI handler and is problematic for PREEMPT_RT.

The flush operation is not a frequent operation or a hot path.  The
per-CPU flush operation can be moved to within a workqueue.

Because a workqueue handler, unlike IPI handler, does not disable irqs,
flush_slab() now has to disable them for working with the kmem_cache_cpu
fields.  deactivate_slab() is safe to call with irqs enabled.

[vbabka@suse.cz: adapt to new SLUB changes]
Link: https://lkml.kernel.org/r/20210904105003.11688-29-vbabka@suse.cz
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slab_common.c |    2 
 mm/slub.c        |   94 +++++++++++++++++++++++++++++++++++++--------
 2 files changed, 80 insertions(+), 16 deletions(-)

--- a/mm/slab_common.c~mm-slub-move-flush_cpu_slab-invocations-__free_slab-invocations-out-of-irq-context
+++ a/mm/slab_common.c
@@ -502,6 +502,7 @@ void kmem_cache_destroy(struct kmem_cach
 	if (unlikely(!s))
 		return;
 
+	cpus_read_lock();
 	mutex_lock(&slab_mutex);
 
 	s->refcount--;
@@ -516,6 +517,7 @@ void kmem_cache_destroy(struct kmem_cach
 	}
 out_unlock:
 	mutex_unlock(&slab_mutex);
+	cpus_read_unlock();
 }
 EXPORT_SYMBOL(kmem_cache_destroy);
 
--- a/mm/slub.c~mm-slub-move-flush_cpu_slab-invocations-__free_slab-invocations-out-of-irq-context
+++ a/mm/slub.c
@@ -2496,16 +2496,25 @@ static void put_cpu_partial(struct kmem_
 
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
-	void *freelist = c->freelist;
-	struct page *page = c->page;
+	unsigned long flags;
+	struct page *page;
+	void *freelist;
+
+	local_irq_save(flags);
+
+	page = c->page;
+	freelist = c->freelist;
 
 	c->page = NULL;
 	c->freelist = NULL;
 	c->tid = next_tid(c->tid);
 
-	deactivate_slab(s, page, freelist);
+	local_irq_restore(flags);
 
-	stat(s, CPUSLAB_FLUSH);
+	if (page) {
+		deactivate_slab(s, page, freelist);
+		stat(s, CPUSLAB_FLUSH);
+	}
 }
 
 static inline void __flush_cpu_slab(struct kmem_cache *s, int cpu)
@@ -2526,15 +2535,27 @@ static inline void __flush_cpu_slab(stru
 	unfreeze_partials_cpu(s, c);
 }
 
+struct slub_flush_work {
+	struct work_struct work;
+	struct kmem_cache *s;
+	bool skip;
+};
+
 /*
  * Flush cpu slab.
  *
- * Called from IPI handler with interrupts disabled.
+ * Called from CPU work handler with migration disabled.
  */
-static void flush_cpu_slab(void *d)
+static void flush_cpu_slab(struct work_struct *w)
 {
-	struct kmem_cache *s = d;
-	struct kmem_cache_cpu *c = this_cpu_ptr(s->cpu_slab);
+	struct kmem_cache *s;
+	struct kmem_cache_cpu *c;
+	struct slub_flush_work *sfw;
+
+	sfw = container_of(w, struct slub_flush_work, work);
+
+	s = sfw->s;
+	c = this_cpu_ptr(s->cpu_slab);
 
 	if (c->page)
 		flush_slab(s, c);
@@ -2542,17 +2563,51 @@ static void flush_cpu_slab(void *d)
 	unfreeze_partials(s);
 }
 
-static bool has_cpu_slab(int cpu, void *info)
+static bool has_cpu_slab(int cpu, struct kmem_cache *s)
 {
-	struct kmem_cache *s = info;
 	struct kmem_cache_cpu *c = per_cpu_ptr(s->cpu_slab, cpu);
 
 	return c->page || slub_percpu_partial(c);
 }
 
+static DEFINE_MUTEX(flush_lock);
+static DEFINE_PER_CPU(struct slub_flush_work, slub_flush);
+
+static void flush_all_cpus_locked(struct kmem_cache *s)
+{
+	struct slub_flush_work *sfw;
+	unsigned int cpu;
+
+	lockdep_assert_cpus_held();
+	mutex_lock(&flush_lock);
+
+	for_each_online_cpu(cpu) {
+		sfw = &per_cpu(slub_flush, cpu);
+		if (!has_cpu_slab(cpu, s)) {
+			sfw->skip = true;
+			continue;
+		}
+		INIT_WORK(&sfw->work, flush_cpu_slab);
+		sfw->skip = false;
+		sfw->s = s;
+		schedule_work_on(cpu, &sfw->work);
+	}
+
+	for_each_online_cpu(cpu) {
+		sfw = &per_cpu(slub_flush, cpu);
+		if (sfw->skip)
+			continue;
+		flush_work(&sfw->work);
+	}
+
+	mutex_unlock(&flush_lock);
+}
+
 static void flush_all(struct kmem_cache *s)
 {
-	on_each_cpu_cond(has_cpu_slab, flush_cpu_slab, s, 1);
+	cpus_read_lock();
+	flush_all_cpus_locked(s);
+	cpus_read_unlock();
 }
 
 /*
@@ -4097,7 +4152,7 @@ int __kmem_cache_shutdown(struct kmem_ca
 	int node;
 	struct kmem_cache_node *n;
 
-	flush_all(s);
+	flush_all_cpus_locked(s);
 	/* Attempt to free all objects */
 	for_each_kmem_cache_node(s, node, n) {
 		free_partial(s, n);
@@ -4373,7 +4428,7 @@ EXPORT_SYMBOL(kfree);
  * being allocated from last increasing the chance that the last objects
  * are freed in them.
  */
-int __kmem_cache_shrink(struct kmem_cache *s)
+static int __kmem_cache_do_shrink(struct kmem_cache *s)
 {
 	int node;
 	int i;
@@ -4385,7 +4440,6 @@ int __kmem_cache_shrink(struct kmem_cach
 	unsigned long flags;
 	int ret = 0;
 
-	flush_all(s);
 	for_each_kmem_cache_node(s, node, n) {
 		INIT_LIST_HEAD(&discard);
 		for (i = 0; i < SHRINK_PROMOTE_MAX; i++)
@@ -4435,13 +4489,21 @@ int __kmem_cache_shrink(struct kmem_cach
 	return ret;
 }
 
+int __kmem_cache_shrink(struct kmem_cache *s)
+{
+	flush_all(s);
+	return __kmem_cache_do_shrink(s);
+}
+
 static int slab_mem_going_offline_callback(void *arg)
 {
 	struct kmem_cache *s;
 
 	mutex_lock(&slab_mutex);
-	list_for_each_entry(s, &slab_caches, list)
-		__kmem_cache_shrink(s);
+	list_for_each_entry(s, &slab_caches, list) {
+		flush_all_cpus_locked(s);
+		__kmem_cache_do_shrink(s);
+	}
 	mutex_unlock(&slab_mutex);
 
 	return 0;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 029/147] mm: slub: make object_map_lock a raw_spinlock_t
  2021-09-08  2:52 incoming Andrew Morton
                   ` (27 preceding siblings ...)
  2021-09-08  2:54 ` [patch 028/147] mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08  2:54 ` [patch 030/147] mm, slub: make slab_lock() disable irqs with PREEMPT_RT Andrew Morton
                   ` (118 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: mm: slub: make object_map_lock a raw_spinlock_t

The variable object_map is protected by object_map_lock.  The lock is
always acquired in debug code and within already atomic context

Make object_map_lock a raw_spinlock_t.

Link: https://lkml.kernel.org/r/20210904105003.11688-30-vbabka@suse.cz
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/mm/slub.c~mm-slub-make-object_map_lock-a-raw_spinlock_t
+++ a/mm/slub.c
@@ -452,7 +452,7 @@ static inline bool cmpxchg_double_slab(s
 
 #ifdef CONFIG_SLUB_DEBUG
 static unsigned long object_map[BITS_TO_LONGS(MAX_OBJS_PER_PAGE)];
-static DEFINE_SPINLOCK(object_map_lock);
+static DEFINE_RAW_SPINLOCK(object_map_lock);
 
 static void __fill_map(unsigned long *obj_map, struct kmem_cache *s,
 		       struct page *page)
@@ -497,7 +497,7 @@ static unsigned long *get_map(struct kme
 {
 	VM_BUG_ON(!irqs_disabled());
 
-	spin_lock(&object_map_lock);
+	raw_spin_lock(&object_map_lock);
 
 	__fill_map(object_map, s, page);
 
@@ -507,7 +507,7 @@ static unsigned long *get_map(struct kme
 static void put_map(unsigned long *map) __releases(&object_map_lock)
 {
 	VM_BUG_ON(map != object_map);
-	spin_unlock(&object_map_lock);
+	raw_spin_unlock(&object_map_lock);
 }
 
 static inline unsigned int size_from_object(struct kmem_cache *s)
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 030/147] mm, slub: make slab_lock() disable irqs with PREEMPT_RT
  2021-09-08  2:52 incoming Andrew Morton
                   ` (28 preceding siblings ...)
  2021-09-08  2:54 ` [patch 029/147] mm: slub: make object_map_lock a raw_spinlock_t Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08  2:54 ` [patch 031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg Andrew Morton
                   ` (117 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: make slab_lock() disable irqs with PREEMPT_RT

We need to disable irqs around slab_lock() (a bit spinlock) to make it
irq-safe.  Most calls to slab_lock() are nested under spin_lock_irqsave()
which doesn't disable irqs on PREEMPT_RT, so add explicit disabling with
PREEMPT_RT.  The exception is cmpxchg_double_slab() which already disables
irqs, so use a __slab_[un]lock() variant without irq disable there.

slab_[un]lock() thus needs a flags pointer parameter, which is unused on
!RT.  free_debug_processing() now has two flags variables, which looks
odd, but only one is actually used - the one used in spin_lock_irqsave()
on !RT and the one used in slab_lock() on RT.

As a result, __cmpxchg_double_slab() and cmpxchg_double_slab() become
effectively identical on RT, as both will disable irqs, which is necessary
on RT as most callers of this function also rely on irqsaving lock
operations.  Thus, assert that irqs are already disabled in
__cmpxchg_double_slab() only on !RT and also change the VM_BUG_ON
assertion to the more standard lockdep_assert one.

Link: https://lkml.kernel.org/r/20210904105003.11688-31-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   58 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 41 insertions(+), 17 deletions(-)

--- a/mm/slub.c~mm-slub-make-slab_lock-disable-irqs-with-preempt_rt
+++ a/mm/slub.c
@@ -359,25 +359,44 @@ static inline unsigned int oo_objects(st
 /*
  * Per slab locking using the pagelock
  */
-static __always_inline void slab_lock(struct page *page)
+static __always_inline void __slab_lock(struct page *page)
 {
 	VM_BUG_ON_PAGE(PageTail(page), page);
 	bit_spin_lock(PG_locked, &page->flags);
 }
 
-static __always_inline void slab_unlock(struct page *page)
+static __always_inline void __slab_unlock(struct page *page)
 {
 	VM_BUG_ON_PAGE(PageTail(page), page);
 	__bit_spin_unlock(PG_locked, &page->flags);
 }
 
-/* Interrupts must be disabled (for the fallback code to work right) */
+static __always_inline void slab_lock(struct page *page, unsigned long *flags)
+{
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_save(*flags);
+	__slab_lock(page);
+}
+
+static __always_inline void slab_unlock(struct page *page, unsigned long *flags)
+{
+	__slab_unlock(page);
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		local_irq_restore(*flags);
+}
+
+/*
+ * Interrupts must be disabled (for the fallback code to work right), typically
+ * by an _irqsave() lock variant. Except on PREEMPT_RT where locks are different
+ * so we disable interrupts as part of slab_[un]lock().
+ */
 static inline bool __cmpxchg_double_slab(struct kmem_cache *s, struct page *page,
 		void *freelist_old, unsigned long counters_old,
 		void *freelist_new, unsigned long counters_new,
 		const char *n)
 {
-	VM_BUG_ON(!irqs_disabled());
+	if (!IS_ENABLED(CONFIG_PREEMPT_RT))
+		lockdep_assert_irqs_disabled();
 #if defined(CONFIG_HAVE_CMPXCHG_DOUBLE) && \
     defined(CONFIG_HAVE_ALIGNED_STRUCT_PAGE)
 	if (s->flags & __CMPXCHG_DOUBLE) {
@@ -388,15 +407,18 @@ static inline bool __cmpxchg_double_slab
 	} else
 #endif
 	{
-		slab_lock(page);
+		/* init to 0 to prevent spurious warnings */
+		unsigned long flags = 0;
+
+		slab_lock(page, &flags);
 		if (page->freelist == freelist_old &&
 					page->counters == counters_old) {
 			page->freelist = freelist_new;
 			page->counters = counters_new;
-			slab_unlock(page);
+			slab_unlock(page, &flags);
 			return true;
 		}
-		slab_unlock(page);
+		slab_unlock(page, &flags);
 	}
 
 	cpu_relax();
@@ -427,16 +449,16 @@ static inline bool cmpxchg_double_slab(s
 		unsigned long flags;
 
 		local_irq_save(flags);
-		slab_lock(page);
+		__slab_lock(page);
 		if (page->freelist == freelist_old &&
 					page->counters == counters_old) {
 			page->freelist = freelist_new;
 			page->counters = counters_new;
-			slab_unlock(page);
+			__slab_unlock(page);
 			local_irq_restore(flags);
 			return true;
 		}
-		slab_unlock(page);
+		__slab_unlock(page);
 		local_irq_restore(flags);
 	}
 
@@ -1269,11 +1291,11 @@ static noinline int free_debug_processin
 	struct kmem_cache_node *n = get_node(s, page_to_nid(page));
 	void *object = head;
 	int cnt = 0;
-	unsigned long flags;
+	unsigned long flags, flags2;
 	int ret = 0;
 
 	spin_lock_irqsave(&n->list_lock, flags);
-	slab_lock(page);
+	slab_lock(page, &flags2);
 
 	if (s->flags & SLAB_CONSISTENCY_CHECKS) {
 		if (!check_slab(s, page))
@@ -1306,7 +1328,7 @@ out:
 		slab_err(s, page, "Bulk freelist count(%d) invalid(%d)\n",
 			 bulk_cnt, cnt);
 
-	slab_unlock(page);
+	slab_unlock(page, &flags2);
 	spin_unlock_irqrestore(&n->list_lock, flags);
 	if (!ret)
 		slab_fix(s, "Object at 0x%p not freed", object);
@@ -4087,11 +4109,12 @@ static void list_slab_objects(struct kme
 {
 #ifdef CONFIG_SLUB_DEBUG
 	void *addr = page_address(page);
+	unsigned long flags;
 	unsigned long *map;
 	void *p;
 
 	slab_err(s, page, text, s->name);
-	slab_lock(page);
+	slab_lock(page, &flags);
 
 	map = get_map(s, page);
 	for_each_object(p, s, addr, page->objects) {
@@ -4102,7 +4125,7 @@ static void list_slab_objects(struct kme
 		}
 	}
 	put_map(map);
-	slab_unlock(page);
+	slab_unlock(page, &flags);
 #endif
 }
 
@@ -4834,8 +4857,9 @@ static void validate_slab(struct kmem_ca
 {
 	void *p;
 	void *addr = page_address(page);
+	unsigned long flags;
 
-	slab_lock(page);
+	slab_lock(page, &flags);
 
 	if (!check_slab(s, page) || !on_freelist(s, page, NULL))
 		goto unlock;
@@ -4850,7 +4874,7 @@ static void validate_slab(struct kmem_ca
 			break;
 	}
 unlock:
-	slab_unlock(page);
+	slab_unlock(page, &flags);
 }
 
 static int validate_slab_node(struct kmem_cache *s,
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
  2021-09-08  2:52 incoming Andrew Morton
                   ` (29 preceding siblings ...)
  2021-09-08  2:54 ` [patch 030/147] mm, slub: make slab_lock() disable irqs with PREEMPT_RT Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08 13:05   ` Jesper Dangaard Brouer
  2021-09-08  2:54 ` [patch 032/147] mm, slub: use migrate_disable() on PREEMPT_RT Andrew Morton
                   ` (116 subsequent siblings)
  147 siblings, 1 reply; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg

Jann Horn reported [1] the following theoretically possible race:

  task A: put_cpu_partial() calls preempt_disable()
  task A: oldpage = this_cpu_read(s->cpu_slab->partial)
  interrupt: kfree() reaches unfreeze_partials() and discards the page
  task B (on another CPU): reallocates page as page cache
  task A: reads page->pages and page->pobjects, which are actually
  halves of the pointer page->lru.prev
  task B (on another CPU): frees page
  interrupt: allocates page as SLUB page and places it on the percpu partial list
  task A: this_cpu_cmpxchg() succeeds

  which would cause page->pages and page->pobjects to end up containing
  halves of pointers that would then influence when put_cpu_partial()
  happens and show up in root-only sysfs files. Maybe that's acceptable,
  I don't know. But there should probably at least be a comment for now
  to point out that we're reading union fields of a page that might be
  in a completely different state.

Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only
safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the
latter disables irqs, otherwise a __slab_free() in an irq handler could
call put_cpu_partial() in the middle of ___slab_alloc() manipulating
->partial and corrupt it.  This becomes an issue on RT after a local_lock
is introduced in later patch.  The fix means taking the local_lock also in
put_cpu_partial() on RT.

After debugging this issue, Mike Galbraith suggested [2] that to avoid
different locking schemes on RT and !RT, we can just protect
put_cpu_partial() with disabled irqs (to be converted to
local_lock_irqsave() later) everywhere.  This should be acceptable as it's
not a fast path, and moving the actual partial unfreezing outside of the
irq disabled section makes it short, and with the retry loop gone the code
can be also simplified.  In addition, the race reported by Jann should no
longer be possible.

[1] https://lore.kernel.org/lkml/CAG48ez1mvUuXwg0YPH5ANzhQLpbphqk-ZS+jbRz+H66fvm4FcA@mail.gmail.com/
[2] https://lore.kernel.org/linux-rt-users/e3470ab357b48bccfbd1f5133b982178a7d2befb.camel@gmx.de/

Link: https://lkml.kernel.org/r/20210904105003.11688-32-vbabka@suse.cz
Reported-by: Jann Horn <jannh@google.com>
Suggested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   83 ++++++++++++++++++++++++++++------------------------
 1 file changed, 45 insertions(+), 38 deletions(-)

--- a/mm/slub.c~mm-slub-protect-put_cpu_partial-with-disabled-irqs-instead-of-cmpxchg
+++ a/mm/slub.c
@@ -2025,7 +2025,12 @@ static inline void *acquire_slab(struct
 	return freelist;
 }
 
+#ifdef CONFIG_SLUB_CPU_PARTIAL
 static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain);
+#else
+static inline void put_cpu_partial(struct kmem_cache *s, struct page *page,
+				   int drain) { }
+#endif
 static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
 
 /*
@@ -2459,14 +2464,6 @@ static void unfreeze_partials_cpu(struct
 		__unfreeze_partials(s, partial_page);
 }
 
-#else	/* CONFIG_SLUB_CPU_PARTIAL */
-
-static inline void unfreeze_partials(struct kmem_cache *s) { }
-static inline void unfreeze_partials_cpu(struct kmem_cache *s,
-				  struct kmem_cache_cpu *c) { }
-
-#endif	/* CONFIG_SLUB_CPU_PARTIAL */
-
 /*
  * Put a page that was just frozen (in __slab_free|get_partial_node) into a
  * partial page slot if available.
@@ -2476,46 +2473,56 @@ static inline void unfreeze_partials_cpu
  */
 static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
 {
-#ifdef CONFIG_SLUB_CPU_PARTIAL
 	struct page *oldpage;
-	int pages;
-	int pobjects;
+	struct page *page_to_unfreeze = NULL;
+	unsigned long flags;
+	int pages = 0;
+	int pobjects = 0;
 
-	preempt_disable();
-	do {
-		pages = 0;
-		pobjects = 0;
-		oldpage = this_cpu_read(s->cpu_slab->partial);
+	local_irq_save(flags);
+
+	oldpage = this_cpu_read(s->cpu_slab->partial);
 
-		if (oldpage) {
+	if (oldpage) {
+		if (drain && oldpage->pobjects > slub_cpu_partial(s)) {
+			/*
+			 * Partial array is full. Move the existing set to the
+			 * per node partial list. Postpone the actual unfreezing
+			 * outside of the critical section.
+			 */
+			page_to_unfreeze = oldpage;
+			oldpage = NULL;
+		} else {
 			pobjects = oldpage->pobjects;
 			pages = oldpage->pages;
-			if (drain && pobjects > slub_cpu_partial(s)) {
-				/*
-				 * partial array is full. Move the existing
-				 * set to the per node partial list.
-				 */
-				unfreeze_partials(s);
-				oldpage = NULL;
-				pobjects = 0;
-				pages = 0;
-				stat(s, CPU_PARTIAL_DRAIN);
-			}
 		}
+	}
 
-		pages++;
-		pobjects += page->objects - page->inuse;
+	pages++;
+	pobjects += page->objects - page->inuse;
 
-		page->pages = pages;
-		page->pobjects = pobjects;
-		page->next = oldpage;
-
-	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
-								!= oldpage);
-	preempt_enable();
-#endif	/* CONFIG_SLUB_CPU_PARTIAL */
+	page->pages = pages;
+	page->pobjects = pobjects;
+	page->next = oldpage;
+
+	this_cpu_write(s->cpu_slab->partial, page);
+
+	local_irq_restore(flags);
+
+	if (page_to_unfreeze) {
+		__unfreeze_partials(s, page_to_unfreeze);
+		stat(s, CPU_PARTIAL_DRAIN);
+	}
 }
 
+#else	/* CONFIG_SLUB_CPU_PARTIAL */
+
+static inline void unfreeze_partials(struct kmem_cache *s) { }
+static inline void unfreeze_partials_cpu(struct kmem_cache *s,
+				  struct kmem_cache_cpu *c) { }
+
+#endif	/* CONFIG_SLUB_CPU_PARTIAL */
+
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
 {
 	unsigned long flags;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 032/147] mm, slub: use migrate_disable() on PREEMPT_RT
  2021-09-08  2:52 incoming Andrew Morton
                   ` (30 preceding siblings ...)
  2021-09-08  2:54 ` [patch 031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08  2:54 ` [patch 033/147] mm, slub: convert kmem_cpu_slab protection to local_lock Andrew Morton
                   ` (115 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: use migrate_disable() on PREEMPT_RT

We currently use preempt_disable() (directly or via get_cpu_ptr()) to
stabilize the pointer to kmem_cache_cpu.  On PREEMPT_RT this would be
incompatible with the list_lock spinlock.  We can use migrate_disable()
instead, but that increases overhead on !PREEMPT_RT as it's an
unconditional function call.

In order to get the best available mechanism on both PREEMPT_RT and
!PREEMPT_RT, introduce private slub_get_cpu_ptr() and slub_put_cpu_ptr()
wrappers and use them.

Link: https://lkml.kernel.org/r/20210904105003.11688-33-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   39 ++++++++++++++++++++++++++++++---------
 1 file changed, 30 insertions(+), 9 deletions(-)

--- a/mm/slub.c~mm-slub-use-migrate_disable-on-preempt_rt
+++ a/mm/slub.c
@@ -118,6 +118,26 @@
  * 			the fast path and disables lockless freelists.
  */
 
+/*
+ * We could simply use migrate_disable()/enable() but as long as it's a
+ * function call even on !PREEMPT_RT, use inline preempt_disable() there.
+ */
+#ifndef CONFIG_PREEMPT_RT
+#define slub_get_cpu_ptr(var)	get_cpu_ptr(var)
+#define slub_put_cpu_ptr(var)	put_cpu_ptr(var)
+#else
+#define slub_get_cpu_ptr(var)		\
+({					\
+	migrate_disable();		\
+	this_cpu_ptr(var);		\
+})
+#define slub_put_cpu_ptr(var)		\
+do {					\
+	(void)(var);			\
+	migrate_enable();		\
+} while (0)
+#endif
+
 #ifdef CONFIG_SLUB_DEBUG
 #ifdef CONFIG_SLUB_DEBUG_ON
 DEFINE_STATIC_KEY_TRUE(slub_debug_enabled);
@@ -2852,7 +2872,7 @@ redo:
 	if (unlikely(!pfmemalloc_match_unsafe(page, gfpflags)))
 		goto deactivate_slab;
 
-	/* must check again c->page in case IRQ handler changed it */
+	/* must check again c->page in case we got preempted and it changed */
 	local_irq_save(flags);
 	if (unlikely(page != c->page)) {
 		local_irq_restore(flags);
@@ -2911,7 +2931,8 @@ new_slab:
 		}
 		if (unlikely(!slub_percpu_partial(c))) {
 			local_irq_restore(flags);
-			goto new_objects; /* stolen by an IRQ handler */
+			/* we were preempted and partial list got empty */
+			goto new_objects;
 		}
 
 		page = c->page = slub_percpu_partial(c);
@@ -2927,9 +2948,9 @@ new_objects:
 	if (freelist)
 		goto check_new_page;
 
-	put_cpu_ptr(s->cpu_slab);
+	slub_put_cpu_ptr(s->cpu_slab);
 	page = new_slab(s, gfpflags, node);
-	c = get_cpu_ptr(s->cpu_slab);
+	c = slub_get_cpu_ptr(s->cpu_slab);
 
 	if (unlikely(!page)) {
 		slab_out_of_memory(s, gfpflags, node);
@@ -3012,12 +3033,12 @@ static void *__slab_alloc(struct kmem_ca
 	 * cpu before disabling preemption. Need to reload cpu area
 	 * pointer.
 	 */
-	c = get_cpu_ptr(s->cpu_slab);
+	c = slub_get_cpu_ptr(s->cpu_slab);
 #endif
 
 	p = ___slab_alloc(s, gfpflags, node, addr, c);
 #ifdef CONFIG_PREEMPT_COUNT
-	put_cpu_ptr(s->cpu_slab);
+	slub_put_cpu_ptr(s->cpu_slab);
 #endif
 	return p;
 }
@@ -3546,7 +3567,7 @@ int kmem_cache_alloc_bulk(struct kmem_ca
 	 * IRQs, which protects against PREEMPT and interrupts
 	 * handlers invoking normal fastpath.
 	 */
-	c = get_cpu_ptr(s->cpu_slab);
+	c = slub_get_cpu_ptr(s->cpu_slab);
 	local_irq_disable();
 
 	for (i = 0; i < size; i++) {
@@ -3592,7 +3613,7 @@ int kmem_cache_alloc_bulk(struct kmem_ca
 	}
 	c->tid = next_tid(c->tid);
 	local_irq_enable();
-	put_cpu_ptr(s->cpu_slab);
+	slub_put_cpu_ptr(s->cpu_slab);
 
 	/*
 	 * memcg and kmem_cache debug support and memory initialization.
@@ -3602,7 +3623,7 @@ int kmem_cache_alloc_bulk(struct kmem_ca
 				slab_want_init_on_alloc(flags, s));
 	return i;
 error:
-	put_cpu_ptr(s->cpu_slab);
+	slub_put_cpu_ptr(s->cpu_slab);
 	slab_post_alloc_hook(s, objcg, flags, i, p, false);
 	__kmem_cache_free_bulk(s, i, p);
 	return 0;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 033/147] mm, slub: convert kmem_cpu_slab protection to local_lock
  2021-09-08  2:52 incoming Andrew Morton
                   ` (31 preceding siblings ...)
  2021-09-08  2:54 ` [patch 032/147] mm, slub: use migrate_disable() on PREEMPT_RT Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08  2:54 ` [patch 034/147] memory-hotplug.rst: remove locking details from admin-guide Andrew Morton
                   ` (114 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, bigeasy, brouer, cl, iamjoonsoo.kim, jannh, linux-mm,
	mgorman, mm-commits, penberg, quic_qiancai, rientjes, tglx,
	torvalds, vbabka

From: Vlastimil Babka <vbabka@suse.cz>
Subject: mm, slub: convert kmem_cpu_slab protection to local_lock

Embed local_lock into struct kmem_cpu_slab and use the irq-safe versions
of local_lock instead of plain local_irq_save/restore.  On !PREEMPT_RT
that's equivalent, with better lockdep visibility.  On PREEMPT_RT that
means better preemption.

However, the cost on PREEMPT_RT is the loss of lockless fast paths which
only work with cpu freelist.  Those are designed to detect and recover
from being preempted by other conflicting operations (both fast or slow
path), but the slow path operations assume they cannot be preempted by a
fast path operation, which is guaranteed naturally with disabled irqs. 
With local locks on PREEMPT_RT, the fast paths now also need to take the
local lock to avoid races.

In the allocation fastpath slab_alloc_node() we can just defer to the
slowpath __slab_alloc() which also works with cpu freelist, but under the
local lock.  In the free fastpath do_slab_free() we have to add a new
local lock protected version of freeing to the cpu freelist, as the
existing slowpath only works with the page freelist.

Also update the comment about locking scheme in SLUB to reflect changes
done by this series.

[ Mike Galbraith <efault@gmx.de>: use local_lock() without irq in PREEMPT_RT
  scope; debugging of RT crashes resulting in put_cpu_partial() locking changes ]
Link: https://lkml.kernel.org/r/20210904105003.11688-34-vbabka@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jann Horn <jannh@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Qian Cai <quic_qiancai@quicinc.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/slub_def.h |    6 +
 mm/slub.c                |  146 ++++++++++++++++++++++++++++---------
 2 files changed, 117 insertions(+), 35 deletions(-)

--- a/include/linux/slub_def.h~mm-slub-convert-kmem_cpu_slab-protection-to-local_lock
+++ a/include/linux/slub_def.h
@@ -10,6 +10,7 @@
 #include <linux/kfence.h>
 #include <linux/kobject.h>
 #include <linux/reciprocal_div.h>
+#include <linux/local_lock.h>
 
 enum stat_item {
 	ALLOC_FASTPATH,		/* Allocation from cpu slab */
@@ -40,6 +41,10 @@ enum stat_item {
 	CPU_PARTIAL_DRAIN,	/* Drain cpu partial to node partial */
 	NR_SLUB_STAT_ITEMS };
 
+/*
+ * When changing the layout, make sure freelist and tid are still compatible
+ * with this_cpu_cmpxchg_double() alignment requirements.
+ */
 struct kmem_cache_cpu {
 	void **freelist;	/* Pointer to next available object */
 	unsigned long tid;	/* Globally unique transaction id */
@@ -47,6 +52,7 @@ struct kmem_cache_cpu {
 #ifdef CONFIG_SLUB_CPU_PARTIAL
 	struct page *partial;	/* Partially allocated frozen slabs */
 #endif
+	local_lock_t lock;	/* Protects the fields above */
 #ifdef CONFIG_SLUB_STATS
 	unsigned stat[NR_SLUB_STAT_ITEMS];
 #endif
--- a/mm/slub.c~mm-slub-convert-kmem_cpu_slab-protection-to-local_lock
+++ a/mm/slub.c
@@ -46,13 +46,21 @@
 /*
  * Lock order:
  *   1. slab_mutex (Global Mutex)
- *   2. node->list_lock
- *   3. slab_lock(page) (Only on some arches and for debugging)
+ *   2. node->list_lock (Spinlock)
+ *   3. kmem_cache->cpu_slab->lock (Local lock)
+ *   4. slab_lock(page) (Only on some arches or for debugging)
+ *   5. object_map_lock (Only for debugging)
  *
  *   slab_mutex
  *
  *   The role of the slab_mutex is to protect the list of all the slabs
  *   and to synchronize major metadata changes to slab cache structures.
+ *   Also synchronizes memory hotplug callbacks.
+ *
+ *   slab_lock
+ *
+ *   The slab_lock is a wrapper around the page lock, thus it is a bit
+ *   spinlock.
  *
  *   The slab_lock is only used for debugging and on arches that do not
  *   have the ability to do a cmpxchg_double. It only protects:
@@ -61,6 +69,8 @@
  *	C. page->objects	-> Number of objects in page
  *	D. page->frozen		-> frozen state
  *
+ *   Frozen slabs
+ *
  *   If a slab is frozen then it is exempt from list management. It is not
  *   on any list except per cpu partial list. The processor that froze the
  *   slab is the one who can perform list operations on the page. Other
@@ -68,6 +78,8 @@
  *   froze the slab is the only one that can retrieve the objects from the
  *   page's freelist.
  *
+ *   list_lock
+ *
  *   The list_lock protects the partial and full list on each node and
  *   the partial slab counter. If taken then no new slabs may be added or
  *   removed from the lists nor make the number of partial slabs be modified.
@@ -79,10 +91,36 @@
  *   slabs, operations can continue without any centralized lock. F.e.
  *   allocating a long series of objects that fill up slabs does not require
  *   the list lock.
- *   Interrupts are disabled during allocation and deallocation in order to
- *   make the slab allocator safe to use in the context of an irq. In addition
- *   interrupts are disabled to ensure that the processor does not change
- *   while handling per_cpu slabs, due to kernel preemption.
+ *
+ *   cpu_slab->lock local lock
+ *
+ *   This locks protect slowpath manipulation of all kmem_cache_cpu fields
+ *   except the stat counters. This is a percpu structure manipulated only by
+ *   the local cpu, so the lock protects against being preempted or interrupted
+ *   by an irq. Fast path operations rely on lockless operations instead.
+ *   On PREEMPT_RT, the local lock does not actually disable irqs (and thus
+ *   prevent the lockless operations), so fastpath operations also need to take
+ *   the lock and are no longer lockless.
+ *
+ *   lockless fastpaths
+ *
+ *   The fast path allocation (slab_alloc_node()) and freeing (do_slab_free())
+ *   are fully lockless when satisfied from the percpu slab (and when
+ *   cmpxchg_double is possible to use, otherwise slab_lock is taken).
+ *   They also don't disable preemption or migration or irqs. They rely on
+ *   the transaction id (tid) field to detect being preempted or moved to
+ *   another cpu.
+ *
+ *   irq, preemption, migration considerations
+ *
+ *   Interrupts are disabled as part of list_lock or local_lock operations, or
+ *   around the slab_lock operation, in order to make the slab allocator safe
+ *   to use in the context of an irq.
+ *
+ *   In addition, preemption (or migration on PREEMPT_RT) is disabled in the
+ *   allocation slowpath, bulk allocation, and put_cpu_partial(), so that the
+ *   local cpu doesn't change in the process and e.g. the kmem_cache_cpu pointer
+ *   doesn't have to be revalidated in each section protected by the local lock.
  *
  * SLUB assigns one slab for allocation to each processor.
  * Allocations only occur from these slabs called cpu slabs.
@@ -2250,9 +2288,13 @@ static inline void note_cmpxchg_failure(
 static void init_kmem_cache_cpus(struct kmem_cache *s)
 {
 	int cpu;
+	struct kmem_cache_cpu *c;
 
-	for_each_possible_cpu(cpu)
-		per_cpu_ptr(s->cpu_slab, cpu)->tid = init_tid(cpu);
+	for_each_possible_cpu(cpu) {
+		c = per_cpu_ptr(s->cpu_slab, cpu);
+		local_lock_init(&c->lock);
+		c->tid = init_tid(cpu);
+	}
 }
 
 /*
@@ -2463,10 +2505,10 @@ static void unfreeze_partials(struct kme
 	struct page *partial_page;
 	unsigned long flags;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&s->cpu_slab->lock, flags);
 	partial_page = this_cpu_read(s->cpu_slab->partial);
 	this_cpu_write(s->cpu_slab->partial, NULL);
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 
 	if (partial_page)
 		__unfreeze_partials(s, partial_page);
@@ -2499,7 +2541,7 @@ static void put_cpu_partial(struct kmem_
 	int pages = 0;
 	int pobjects = 0;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&s->cpu_slab->lock, flags);
 
 	oldpage = this_cpu_read(s->cpu_slab->partial);
 
@@ -2527,7 +2569,7 @@ static void put_cpu_partial(struct kmem_
 
 	this_cpu_write(s->cpu_slab->partial, page);
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 
 	if (page_to_unfreeze) {
 		__unfreeze_partials(s, page_to_unfreeze);
@@ -2549,7 +2591,7 @@ static inline void flush_slab(struct kme
 	struct page *page;
 	void *freelist;
 
-	local_irq_save(flags);
+	local_lock_irqsave(&s->cpu_slab->lock, flags);
 
 	page = c->page;
 	freelist = c->freelist;
@@ -2558,7 +2600,7 @@ static inline void flush_slab(struct kme
 	c->freelist = NULL;
 	c->tid = next_tid(c->tid);
 
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 
 	if (page) {
 		deactivate_slab(s, page, freelist);
@@ -2780,8 +2822,6 @@ static inline bool pfmemalloc_match_unsa
  * The page is still frozen if the return value is not NULL.
  *
  * If this function returns NULL then the page has been unfrozen.
- *
- * This function must be called with interrupt disabled.
  */
 static inline void *get_freelist(struct kmem_cache *s, struct page *page)
 {
@@ -2789,6 +2829,8 @@ static inline void *get_freelist(struct
 	unsigned long counters;
 	void *freelist;
 
+	lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
+
 	do {
 		freelist = page->freelist;
 		counters = page->counters;
@@ -2873,9 +2915,9 @@ redo:
 		goto deactivate_slab;
 
 	/* must check again c->page in case we got preempted and it changed */
-	local_irq_save(flags);
+	local_lock_irqsave(&s->cpu_slab->lock, flags);
 	if (unlikely(page != c->page)) {
-		local_irq_restore(flags);
+		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 		goto reread_page;
 	}
 	freelist = c->freelist;
@@ -2886,7 +2928,7 @@ redo:
 
 	if (!freelist) {
 		c->page = NULL;
-		local_irq_restore(flags);
+		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 		stat(s, DEACTIVATE_BYPASS);
 		goto new_slab;
 	}
@@ -2895,7 +2937,7 @@ redo:
 
 load_freelist:
 
-	lockdep_assert_irqs_disabled();
+	lockdep_assert_held(this_cpu_ptr(&s->cpu_slab->lock));
 
 	/*
 	 * freelist is pointing to the list of objects to be used.
@@ -2905,39 +2947,39 @@ load_freelist:
 	VM_BUG_ON(!c->page->frozen);
 	c->freelist = get_freepointer(s, freelist);
 	c->tid = next_tid(c->tid);
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 	return freelist;
 
 deactivate_slab:
 
-	local_irq_save(flags);
+	local_lock_irqsave(&s->cpu_slab->lock, flags);
 	if (page != c->page) {
-		local_irq_restore(flags);
+		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 		goto reread_page;
 	}
 	freelist = c->freelist;
 	c->page = NULL;
 	c->freelist = NULL;
-	local_irq_restore(flags);
+	local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 	deactivate_slab(s, page, freelist);
 
 new_slab:
 
 	if (slub_percpu_partial(c)) {
-		local_irq_save(flags);
+		local_lock_irqsave(&s->cpu_slab->lock, flags);
 		if (unlikely(c->page)) {
-			local_irq_restore(flags);
+			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 			goto reread_page;
 		}
 		if (unlikely(!slub_percpu_partial(c))) {
-			local_irq_restore(flags);
+			local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 			/* we were preempted and partial list got empty */
 			goto new_objects;
 		}
 
 		page = c->page = slub_percpu_partial(c);
 		slub_set_percpu_partial(c, page);
-		local_irq_restore(flags);
+		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 		stat(s, CPU_PARTIAL_ALLOC);
 		goto redo;
 	}
@@ -2990,7 +3032,7 @@ check_new_page:
 
 retry_load_page:
 
-	local_irq_save(flags);
+	local_lock_irqsave(&s->cpu_slab->lock, flags);
 	if (unlikely(c->page)) {
 		void *flush_freelist = c->freelist;
 		struct page *flush_page = c->page;
@@ -2999,7 +3041,7 @@ retry_load_page:
 		c->freelist = NULL;
 		c->tid = next_tid(c->tid);
 
-		local_irq_restore(flags);
+		local_unlock_irqrestore(&s->cpu_slab->lock, flags);
 
 		deactivate_slab(s, flush_page, flush_freelist);
 
@@ -3118,7 +3160,15 @@ redo:
 
 	object = c->freelist;
 	page = c->page;
-	if (unlikely(!object || !page || !node_match(page, node))) {
+	/*
+	 * We cannot use the lockless fastpath on PREEMPT_RT because if a
+	 * slowpath has taken the local_lock_irqsave(), it is not protected
+	 * against a fast path operation in an irq handler. So we need to take
+	 * the slow path which uses local_lock. It is still relatively fast if
+	 * there is a suitable cpu freelist.
+	 */
+	if (IS_ENABLED(CONFIG_PREEMPT_RT) ||
+	    unlikely(!object || !page || !node_match(page, node))) {
 		object = __slab_alloc(s, gfpflags, node, addr, c);
 	} else {
 		void *next_object = get_freepointer_safe(s, object);
@@ -3378,6 +3428,7 @@ redo:
 	barrier();
 
 	if (likely(page == c->page)) {
+#ifndef CONFIG_PREEMPT_RT
 		void **freelist = READ_ONCE(c->freelist);
 
 		set_freepointer(s, tail_obj, freelist);
@@ -3390,6 +3441,31 @@ redo:
 			note_cmpxchg_failure("slab_free", s, tid);
 			goto redo;
 		}
+#else /* CONFIG_PREEMPT_RT */
+		/*
+		 * We cannot use the lockless fastpath on PREEMPT_RT because if
+		 * a slowpath has taken the local_lock_irqsave(), it is not
+		 * protected against a fast path operation in an irq handler. So
+		 * we need to take the local_lock. We shouldn't simply defer to
+		 * __slab_free() as that wouldn't use the cpu freelist at all.
+		 */
+		void **freelist;
+
+		local_lock(&s->cpu_slab->lock);
+		c = this_cpu_ptr(s->cpu_slab);
+		if (unlikely(page != c->page)) {
+			local_unlock(&s->cpu_slab->lock);
+			goto redo;
+		}
+		tid = c->tid;
+		freelist = c->freelist;
+
+		set_freepointer(s, tail_obj, freelist);
+		c->freelist = head;
+		c->tid = next_tid(tid);
+
+		local_unlock(&s->cpu_slab->lock);
+#endif
 		stat(s, FREE_FASTPATH);
 	} else
 		__slab_free(s, page, head, tail_obj, cnt, addr);
@@ -3568,7 +3644,7 @@ int kmem_cache_alloc_bulk(struct kmem_ca
 	 * handlers invoking normal fastpath.
 	 */
 	c = slub_get_cpu_ptr(s->cpu_slab);
-	local_irq_disable();
+	local_lock_irq(&s->cpu_slab->lock);
 
 	for (i = 0; i < size; i++) {
 		void *object = kfence_alloc(s, s->object_size, flags);
@@ -3589,7 +3665,7 @@ int kmem_cache_alloc_bulk(struct kmem_ca
 			 */
 			c->tid = next_tid(c->tid);
 
-			local_irq_enable();
+			local_unlock_irq(&s->cpu_slab->lock);
 
 			/*
 			 * Invoking slow path likely have side-effect
@@ -3603,7 +3679,7 @@ int kmem_cache_alloc_bulk(struct kmem_ca
 			c = this_cpu_ptr(s->cpu_slab);
 			maybe_wipe_obj_freeptr(s, p[i]);
 
-			local_irq_disable();
+			local_lock_irq(&s->cpu_slab->lock);
 
 			continue; /* goto for-loop */
 		}
@@ -3612,7 +3688,7 @@ int kmem_cache_alloc_bulk(struct kmem_ca
 		maybe_wipe_obj_freeptr(s, p[i]);
 	}
 	c->tid = next_tid(c->tid);
-	local_irq_enable();
+	local_unlock_irq(&s->cpu_slab->lock);
 	slub_put_cpu_ptr(s->cpu_slab);
 
 	/*
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 034/147] memory-hotplug.rst: remove locking details from admin-guide
  2021-09-08  2:52 incoming Andrew Morton
                   ` (32 preceding siblings ...)
  2021-09-08  2:54 ` [patch 033/147] mm, slub: convert kmem_cpu_slab protection to local_lock Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08  2:54 ` [patch 035/147] memory-hotplug.rst: complete admin-guide overhaul Andrew Morton
                   ` (113 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, anshuman.khandual, corbet, dave.hansen, david, linux-mm,
	mhocko, mike.kravetz, mm-commits, osalvador, pasha.tatashin,
	rppt, sfr, songmuchun, torvalds, willy

From: David Hildenbrand <david@redhat.com>
Subject: memory-hotplug.rst: remove locking details from admin-guide

Patch series "memory-hotplug.rst: complete admin-guide overhaul", v3.


This patch (of 2):

We have the same content at Documentation/core-api/memory-hotplug.rst and
it doesn't fit into the admin-guide.  The documentation was accidentially
duplicated when merging.

Link: https://lkml.kernel.org/r/20210707073205.3835-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210707073205.3835-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/memory-hotplug.rst |   39 --------------
 1 file changed, 39 deletions(-)

--- a/Documentation/admin-guide/mm/memory-hotplug.rst~memory-hotplugrst-remove-locking-details-from-admin-guide
+++ a/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -415,45 +415,6 @@ Need more implementation yet....
  - Guard from remove if not yet.
 
 
-Locking Internals
-=================
-
-When adding/removing memory that uses memory block devices (i.e. ordinary RAM),
-the device_hotplug_lock should be held to:
-
-- synchronize against online/offline requests (e.g. via sysfs). This way, memory
-  block devices can only be accessed (.online/.state attributes) by user
-  space once memory has been fully added. And when removing memory, we
-  know nobody is in critical sections.
-- synchronize against CPU hotplug and similar (e.g. relevant for ACPI and PPC)
-
-Especially, there is a possible lock inversion that is avoided using
-device_hotplug_lock when adding memory and user space tries to online that
-memory faster than expected:
-
-- device_online() will first take the device_lock(), followed by
-  mem_hotplug_lock
-- add_memory_resource() will first take the mem_hotplug_lock, followed by
-  the device_lock() (while creating the devices, during bus_add_device()).
-
-As the device is visible to user space before taking the device_lock(), this
-can result in a lock inversion.
-
-onlining/offlining of memory should be done via device_online()/
-device_offline() - to make sure it is properly synchronized to actions
-via sysfs. Holding device_hotplug_lock is advised (to e.g. protect online_type)
-
-When adding/removing/onlining/offlining memory or adding/removing
-heterogeneous/device memory, we should always hold the mem_hotplug_lock in
-write mode to serialise memory hotplug (e.g. access to global/zone
-variables).
-
-In addition, mem_hotplug_lock (in contrast to device_hotplug_lock) in read
-mode allows for a quite efficient get_online_mems/put_online_mems
-implementation, so code accessing memory can protect from that memory
-vanishing.
-
-
 Future Work
 ===========
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 035/147] memory-hotplug.rst: complete admin-guide overhaul
  2021-09-08  2:52 incoming Andrew Morton
                   ` (33 preceding siblings ...)
  2021-09-08  2:54 ` [patch 034/147] memory-hotplug.rst: remove locking details from admin-guide Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08  2:54 ` [patch 036/147] mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE Andrew Morton
                   ` (112 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, anshuman.khandual, corbet, dave.hansen, david, linux-mm,
	mhocko, mike.kravetz, mm-commits, osalvador, pasha.tatashin,
	rppt, sfr, songmuchun, torvalds, willy

From: David Hildenbrand <david@redhat.com>
Subject: memory-hotplug.rst: complete admin-guide overhaul

The memory hot(un)plug documentation is outdated and incomplete.  Most of
the content dates back to 2007, so it's time for a major overhaul.

Let's rewrite, reorganize and update most parts of the documentation.  In
addition to memory hot(un)plug, also add some details regarding
ZONE_MOVABLE, with memory hotunplug being one of its main consumers.

Drop the file history, that information can more reliably be had from the
git log.

The style of the document is also properly fixed that e.g., "restview"
renders it cleanly now.

In the future, we might add some more details about virt users like
virtio-mem, the XEN balloon, the Hyper-V balloon and ppc64 dlpar.

Link: https://lkml.kernel.org/r/20210707073205.3835-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>

Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/memory-hotplug.rst |  803 ++++++++------
 1 file changed, 476 insertions(+), 327 deletions(-)

--- a/Documentation/admin-guide/mm/memory-hotplug.rst~memory-hotplugrst-complete-admin-guide-overhaul
+++ a/Documentation/admin-guide/mm/memory-hotplug.rst
@@ -1,427 +1,576 @@
 .. _admin_guide_memory_hotplug:
 
-==============
-Memory Hotplug
-==============
-
-:Created:							Jul 28 2007
-:Updated: Add some details about locking internals:		Aug 20 2018
-
-This document is about memory hotplug including how-to-use and current status.
-Because Memory Hotplug is still under development, contents of this text will
-be changed often.
+==================
+Memory Hot(Un)Plug
+==================
+
+This document describes generic Linux support for memory hot(un)plug with
+a focus on System RAM, including ZONE_MOVABLE support.
 
 .. contents:: :local:
 
-.. note::
+Introduction
+============
 
-    (1) x86_64's has special implementation for memory hotplug.
-        This text does not describe it.
-    (2) This text assumes that sysfs is mounted at ``/sys``.
+Memory hot(un)plug allows for increasing and decreasing the size of physical
+memory available to a machine at runtime. In the simplest case, it consists of
+physically plugging or unplugging a DIMM at runtime, coordinated with the
+operating system.
+
+Memory hot(un)plug is used for various purposes:
+
+- The physical memory available to a machine can be adjusted at runtime, up- or
+  downgrading the memory capacity. This dynamic memory resizing, sometimes
+  referred to as "capacity on demand", is frequently used with virtual machines
+  and logical partitions.
+
+- Replacing hardware, such as DIMMs or whole NUMA nodes, without downtime. One
+  example is replacing failing memory modules.
+
+- Reducing energy consumption either by physically unplugging memory modules or
+  by logically unplugging (parts of) memory modules from Linux.
+
+Further, the basic memory hot(un)plug infrastructure in Linux is nowadays also
+used to expose persistent memory, other performance-differentiated memory and
+reserved memory regions as ordinary system RAM to Linux.
+
+Linux only supports memory hot(un)plug on selected 64 bit architectures, such as
+x86_64, arm64, ppc64, s390x and ia64.
+
+Memory Hot(Un)Plug Granularity
+------------------------------
+
+Memory hot(un)plug in Linux uses the SPARSEMEM memory model, which divides the
+physical memory address space into chunks of the same size: memory sections. The
+size of a memory section is architecture dependent. For example, x86_64 uses
+128 MiB and ppc64 uses 16 MiB.
 
+Memory sections are combined into chunks referred to as "memory blocks". The
+size of a memory block is architecture dependent and corresponds to the smallest
+granularity that can be hot(un)plugged. The default size of a memory block is
+the same as memory section size, unless an architecture specifies otherwise.
 
-Introduction
-============
+All memory blocks have the same size.
 
-Purpose of memory hotplug
--------------------------
+Phases of Memory Hotplug
+------------------------
 
-Memory Hotplug allows users to increase/decrease the amount of memory.
-Generally, there are two purposes.
+Memory hotplug consists of two phases:
 
-(A) For changing the amount of memory.
-    This is to allow a feature like capacity on demand.
-(B) For installing/removing DIMMs or NUMA-nodes physically.
-    This is to exchange DIMMs/NUMA-nodes, reduce power consumption, etc.
+(1) Adding the memory to Linux
+(2) Onlining memory blocks
 
-(A) is required by highly virtualized environments and (B) is required by
-hardware which supports memory power management.
+In the first phase, metadata, such as the memory map ("memmap") and page tables
+for the direct mapping, is allocated and initialized, and memory blocks are
+created; the latter also creates sysfs files for managing newly created memory
+blocks.
 
-Linux memory hotplug is designed for both purpose.
+In the second phase, added memory is exposed to the page allocator. After this
+phase, the memory is visible in memory statistics, such as free and total
+memory, of the system.
 
-Phases of memory hotplug
-------------------------
+Phases of Memory Hotunplug
+--------------------------
 
-There are 2 phases in Memory Hotplug:
+Memory hotunplug consists of two phases:
 
-  1) Physical Memory Hotplug phase
-  2) Logical Memory Hotplug phase.
+(1) Offlining memory blocks
+(2) Removing the memory from Linux
 
-The First phase is to communicate hardware/firmware and make/erase
-environment for hotplugged memory. Basically, this phase is necessary
-for the purpose (B), but this is good phase for communication between
-highly virtualized environments too.
-
-When memory is hotplugged, the kernel recognizes new memory, makes new memory
-management tables, and makes sysfs files for new memory's operation.
-
-If firmware supports notification of connection of new memory to OS,
-this phase is triggered automatically. ACPI can notify this event. If not,
-"probe" operation by system administration is used instead.
-(see :ref:`memory_hotplug_physical_mem`).
-
-Logical Memory Hotplug phase is to change memory state into
-available/unavailable for users. Amount of memory from user's view is
-changed by this phase. The kernel makes all memory in it as free pages
-when a memory range is available.
-
-In this document, this phase is described as online/offline.
-
-Logical Memory Hotplug phase is triggered by write of sysfs file by system
-administrator. For the hot-add case, it must be executed after Physical Hotplug
-phase by hand.
-(However, if you writes udev's hotplug scripts for memory hotplug, these
-phases can be execute in seamless way.)
-
-Unit of Memory online/offline operation
----------------------------------------
-
-Memory hotplug uses SPARSEMEM memory model which allows memory to be divided
-into chunks of the same size. These chunks are called "sections". The size of
-a memory section is architecture dependent. For example, power uses 16MiB, ia64
-uses 1GiB.
+In the fist phase, memory is "hidden" from the page allocator again, for
+example, by migrating busy memory to other memory locations and removing all
+relevant free pages from the page allocator After this phase, the memory is no
+longer visible in memory statistics of the system.
 
-Memory sections are combined into chunks referred to as "memory blocks". The
-size of a memory block is architecture dependent and represents the logical
-unit upon which memory online/offline operations are to be performed. The
-default size of a memory block is the same as memory section size unless an
-architecture specifies otherwise. (see :ref:`memory_hotplug_sysfs_files`.)
+In the second phase, the memory blocks are removed and metadata is freed.
 
-To determine the size (in bytes) of a memory block please read this file::
+Memory Hotplug Notifications
+============================
 
-  /sys/devices/system/memory/block_size_bytes
+There are various ways how Linux is notified about memory hotplug events such
+that it can start adding hotplugged memory. This description is limited to
+systems that support ACPI; mechanisms specific to other firmware interfaces or
+virtual machines are not described.
 
-Kernel Configuration
-====================
+ACPI Notifications
+------------------
 
-To use memory hotplug feature, kernel must be compiled with following
-config options.
+Platforms that support ACPI, such as x86_64, can support memory hotplug
+notifications via ACPI.
 
-- For all memory hotplug:
-    - Memory model -> Sparse Memory  (``CONFIG_SPARSEMEM``)
-    - Allow for memory hot-add       (``CONFIG_MEMORY_HOTPLUG``)
+In general, a firmware supporting memory hotplug defines a memory class object
+HID "PNP0C80". When notified about hotplug of a new memory device, the ACPI
+driver will hotplug the memory to Linux.
 
-- To enable memory removal, the following are also necessary:
-    - Allow for memory hot remove    (``CONFIG_MEMORY_HOTREMOVE``)
-    - Page Migration                 (``CONFIG_MIGRATION``)
+If the firmware supports hotplug of NUMA nodes, it defines an object _HID
+"ACPI0004", "PNP0A05", or "PNP0A06". When notified about an hotplug event, all
+assigned memory devices are added to Linux by the ACPI driver.
 
-- For ACPI memory hotplug, the following are also necessary:
-    - Memory hotplug (under ACPI Support menu) (``CONFIG_ACPI_HOTPLUG_MEMORY``)
-    - This option can be kernel module.
+Similarly, Linux can be notified about requests to hotunplug a memory device or
+a NUMA node via ACPI. The ACPI driver will try offlining all relevant memory
+blocks, and, if successful, hotunplug the memory from Linux.
 
-- As a related configuration, if your box has a feature of NUMA-node hotplug
-  via ACPI, then this option is necessary too.
+Manual Probing
+--------------
 
-    - ACPI0004,PNP0A05 and PNP0A06 Container Driver (under ACPI Support menu)
-      (``CONFIG_ACPI_CONTAINER``).
+On some architectures, the firmware may not be able to notify the operating
+system about a memory hotplug event. Instead, the memory has to be manually
+probed from user space.
 
-     This option can be kernel module too.
+The probe interface is located at::
 
+	/sys/devices/system/memory/probe
 
-.. _memory_hotplug_sysfs_files:
+Only complete memory blocks can be probed. Individual memory blocks are probed
+by providing the physical start address of the memory block::
 
-sysfs files for memory hotplug
-==============================
+	% echo addr > /sys/devices/system/memory/probe
 
-All memory blocks have their device information in sysfs.  Each memory block
-is described under ``/sys/devices/system/memory`` as::
+Which results in a memory block for the range [addr, addr + memory_block_size)
+being created.
 
-	/sys/devices/system/memory/memoryXXX
+.. note::
 
-where XXX is the memory block id.
+  Using the probe interface is discouraged as it is easy to crash the kernel,
+  because Linux cannot validate user input; this interface might be removed in
+  the future.
+
+Onlining and Offlining Memory Blocks
+====================================
+
+After a memory block has been created, Linux has to be instructed to actually
+make use of that memory: the memory block has to be "online".
+
+Before a memory block can be removed, Linux has to stop using any memory part of
+the memory block: the memory block has to be "offlined".
+
+The Linux kernel can be configured to automatically online added memory blocks
+and drivers automatically trigger offlining of memory blocks when trying
+hotunplug of memory. Memory blocks can only be removed once offlining succeeded
+and drivers may trigger offlining of memory blocks when attempting hotunplug of
+memory.
 
-For the memory block covered by the sysfs directory.  It is expected that all
-memory sections in this range are present and no memory holes exist in the
-range. Currently there is no way to determine if there is a memory hole, but
-the existence of one should not affect the hotplug capabilities of the memory
-block.
+Onlining Memory Blocks Manually
+-------------------------------
 
-For example, assume 1GiB memory block size. A device for a memory starting at
-0x100000000 is ``/sys/device/system/memory/memory4``::
+If auto-onlining of memory blocks isn't enabled, user-space has to manually
+trigger onlining of memory blocks. Often, udev rules are used to automate this
+task in user space.
 
-	(0x100000000 / 1Gib = 4)
+Onlining of a memory block can be triggered via::
 
-This device covers address range [0x100000000 ... 0x140000000)
+	% echo online > /sys/devices/system/memory/memoryXXX/state
 
-Under each memory block, you can see 5 files:
+Or alternatively::
 
-- ``/sys/devices/system/memory/memoryXXX/phys_index``
-- ``/sys/devices/system/memory/memoryXXX/phys_device``
-- ``/sys/devices/system/memory/memoryXXX/state``
-- ``/sys/devices/system/memory/memoryXXX/removable``
-- ``/sys/devices/system/memory/memoryXXX/valid_zones``
+	% echo 1 > /sys/devices/system/memory/memoryXXX/online
 
-=================== ============================================================
-``phys_index``      read-only and contains memory block id, same as XXX.
-``state``           read-write
+The kernel will select the target zone automatically, usually defaulting to
+``ZONE_NORMAL`` unless ``movablecore=1`` has been specified on the kernel
+command line or if the memory block would intersect the ZONE_MOVABLE already.
 
-                    - at read:  contains online/offline state of memory.
-                    - at write: user can specify "online_kernel",
+One can explicitly request to associate an offline memory block with
+ZONE_MOVABLE by::
 
-                    "online_movable", "online", "offline" command
-                    which will be performed on all sections in the block.
-``phys_device``	    read-only: legacy interface only ever used on s390x to
-		    expose the covered storage increment.
-``removable``	    read-only: legacy interface that indicated whether a memory
-		    block was likely to be offlineable or not.  Newer kernel
-		    versions return "1" if and only if the kernel supports
-		    memory offlining.
-``valid_zones``     read-only: designed to show by which zone memory provided by
-		    a memory block is managed, and to show by which zone memory
-		    provided by an offline memory block could be managed when
-		    onlining.
-
-		    The first column shows it`s default zone.
-
-		    "memory6/valid_zones: Normal Movable" shows this memoryblock
-		    can be onlined to ZONE_NORMAL by default and to ZONE_MOVABLE
-		    by online_movable.
-
-		    "memory7/valid_zones: Movable Normal" shows this memoryblock
-		    can be onlined to ZONE_MOVABLE by default and to ZONE_NORMAL
-		    by online_kernel.
-=================== ============================================================
+	% echo online_movable > /sys/devices/system/memory/memoryXXX/state
 
-.. note::
+Or one can explicitly request a kernel zone (usually ZONE_NORMAL) by::
 
-  These directories/files appear after physical memory hotplug phase.
+	% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
 
-If CONFIG_NUMA is enabled the memoryXXX/ directories can also be accessed
-via symbolic links located in the ``/sys/devices/system/node/node*`` directories.
+In any case, if onlining succeeds, the state of the memory block is changed to
+be "online". If it fails, the state of the memory block will remain unchanged
+and the above commands will fail.
+
+Onlining Memory Blocks Automatically
+------------------------------------
+
+The kernel can be configured to try auto-onlining of newly added memory blocks.
+If this feature is disabled, the memory blocks will stay offline until
+explicitly onlined from user space.
 
-For example::
+The configured auto-online behavior can be observed via::
 
-	/sys/devices/system/node/node0/memory9 -> ../../memory/memory9
+	% cat /sys/devices/system/memory/auto_online_blocks
 
-A backlink will also be created::
+Auto-onlining can be enabled by writing ``online``, ``online_kernel`` or
+``online_movable`` to that file, like::
 
-	/sys/devices/system/memory/memory9/node0 -> ../../node/node0
+	% echo online > /sys/devices/system/memory/auto_online_blocks
 
-.. _memory_hotplug_physical_mem:
+Modifying the auto-online behavior will only affect all subsequently added
+memory blocks only.
 
-Physical memory hot-add phase
-=============================
+.. note::
 
-Hardware(Firmware) Support
---------------------------
+  In corner cases, auto-onlining can fail. The kernel won't retry. Note that
+  auto-onlining is not expected to fail in default configurations.
 
-On x86_64/ia64 platform, memory hotplug by ACPI is supported.
+.. note::
 
-In general, the firmware (ACPI) which supports memory hotplug defines
-memory class object of _HID "PNP0C80". When a notify is asserted to PNP0C80,
-Linux's ACPI handler does hot-add memory to the system and calls a hotplug udev
-script. This will be done automatically.
-
-But scripts for memory hotplug are not contained in generic udev package(now).
-You may have to write it by yourself or online/offline memory by hand.
-Please see :ref:`memory_hotplug_how_to_online_memory` and
-:ref:`memory_hotplug_how_to_offline_memory`.
-
-If firmware supports NUMA-node hotplug, and defines an object _HID "ACPI0004",
-"PNP0A05", or "PNP0A06", notification is asserted to it, and ACPI handler
-calls hotplug code for all of objects which are defined in it.
-If memory device is found, memory hotplug code will be called.
-
-Notify memory hot-add event by hand
------------------------------------
-
-On some architectures, the firmware may not notify the kernel of a memory
-hotplug event.  Therefore, the memory "probe" interface is supported to
-explicitly notify the kernel.  This interface depends on
-CONFIG_ARCH_MEMORY_PROBE and can be configured on powerpc, sh, and x86
-if hotplug is supported, although for x86 this should be handled by ACPI
-notification.
+  DLPAR on ppc64 ignores the ``offline`` setting and will still online added
+  memory blocks; if onlining fails, memory blocks are removed again.
 
-Probe interface is located at::
+Offlining Memory Blocks
+-----------------------
 
-	/sys/devices/system/memory/probe
+In the current implementation, Linux's memory offlining will try migrating all
+movable pages off the affected memory block. As most kernel allocations, such as
+page tables, are unmovable, page migration can fail and, therefore, inhibit
+memory offlining from succeeding.
 
-You can tell the physical address of new memory to the kernel by::
+Having the memory provided by memory block managed by ZONE_MOVABLE significantly
+increases memory offlining reliability; still, memory offlining can fail in
+some corner cases.
 
-	% echo start_address_of_new_memory > /sys/devices/system/memory/probe
+Further, memory offlining might retry for a long time (or even forever), until
+aborted by the user.
 
-Then, [start_address_of_new_memory, start_address_of_new_memory +
-memory_block_size] memory range is hot-added. In this case, hotplug script is
-not called (in current implementation). You'll have to online memory by
-yourself.  Please see :ref:`memory_hotplug_how_to_online_memory`.
+Offlining of a memory block can be triggered via::
 
-Logical Memory hot-add phase
-============================
+	% echo offline > /sys/devices/system/memory/memoryXXX/state
 
-State of memory
----------------
+Or alternatively::
 
-To see (online/offline) state of a memory block, read 'state' file::
+	% echo 0 > /sys/devices/system/memory/memoryXXX/online
+
+If offlining succeeds, the state of the memory block is changed to be "offline".
+If it fails, the state of the memory block will remain unchanged and the above
+commands will fail, for example, via::
+
+	bash: echo: write error: Device or resource busy
+
+or via::
+
+	bash: echo: write error: Invalid argument
+
+Observing the State of Memory Blocks
+------------------------------------
+
+The state (online/offline/going-offline) of a memory block can be observed
+either via::
 
 	% cat /sys/device/system/memory/memoryXXX/state
 
+Or alternatively (1/0) via::
 
-- If the memory block is online, you'll read "online".
-- If the memory block is offline, you'll read "offline".
+	% cat /sys/device/system/memory/memoryXXX/online
 
+For an online memory block, the managing zone can be observed via::
 
-.. _memory_hotplug_how_to_online_memory:
+	% cat /sys/device/system/memory/memoryXXX/valid_zones
 
-How to online memory
---------------------
+Configuring Memory Hot(Un)Plug
+==============================
 
-When the memory is hot-added, the kernel decides whether or not to "online"
-it according to the policy which can be read from "auto_online_blocks" file::
+There are various ways how system administrators can configure memory
+hot(un)plug and interact with memory blocks, especially, to online them.
 
-	% cat /sys/devices/system/memory/auto_online_blocks
+Memory Hot(Un)Plug Configuration via Sysfs
+------------------------------------------
 
-The default depends on the CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel config
-option. If it is disabled the default is "offline" which means the newly added
-memory is not in a ready-to-use state and you have to "online" the newly added
-memory blocks manually. Automatic onlining can be requested by writing "online"
-to "auto_online_blocks" file::
+Some memory hot(un)plug properties can be configured or inspected via sysfs in::
 
-	% echo online > /sys/devices/system/memory/auto_online_blocks
+	/sys/devices/system/memory/
 
-This sets a global policy and impacts all memory blocks that will subsequently
-be hotplugged. Currently offline blocks keep their state. It is possible, under
-certain circumstances, that some memory blocks will be added but will fail to
-online. User space tools can check their "state" files
-(``/sys/devices/system/memory/memoryXXX/state``) and try to online them manually.
-
-If the automatic onlining wasn't requested, failed, or some memory block was
-offlined it is possible to change the individual block's state by writing to the
-"state" file::
+The following files are currently defined:
 
-	% echo online > /sys/devices/system/memory/memoryXXX/state
+====================== =========================================================
+``auto_online_blocks`` read-write: set or get the default state of new memory
+		       blocks; configure auto-onlining.
 
-This onlining will not change the ZONE type of the target memory block,
-If the memory block doesn't belong to any zone an appropriate kernel zone
-(usually ZONE_NORMAL) will be used unless movable_node kernel command line
-option is specified when ZONE_MOVABLE will be used.
+		       The default value depends on the
+		       CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration
+		       option.
 
-You can explicitly request to associate it with ZONE_MOVABLE by::
+		       See the ``state`` property of memory blocks for details.
+``block_size_bytes``   read-only: the size in bytes of a memory block.
+``probe``	       write-only: add (probe) selected memory blocks manually
+		       from user space by supplying the physical start address.
 
-	% echo online_movable > /sys/devices/system/memory/memoryXXX/state
+		       Availability depends on the CONFIG_ARCH_MEMORY_PROBE
+		       kernel configuration option.
+``uevent``	       read-write: generic udev file for device subsystems.
+====================== =========================================================
 
-.. note:: current limit: this memory block must be adjacent to ZONE_MOVABLE
+.. note::
 
-Or you can explicitly request a kernel zone (usually ZONE_NORMAL) by::
+  When the CONFIG_MEMORY_FAILURE kernel configuration option is enabled, two
+  additional files ``hard_offline_page`` and ``soft_offline_page`` are available
+  to trigger hwpoisoning of pages, for example, for testing purposes. Note that
+  this functionality is not really related to memory hot(un)plug or actual
+  offlining of memory blocks.
+
+Memory Block Configuration via Sysfs
+------------------------------------
+
+Each memory block is represented as a memory block device that can be
+onlined or offlined. All memory blocks have their device information located in
+sysfs. Each present memory block is listed under
+``/sys/devices/system/memory`` as::
 
-	% echo online_kernel > /sys/devices/system/memory/memoryXXX/state
+	/sys/devices/system/memory/memoryXXX
 
-.. note:: current limit: this memory block must be adjacent to ZONE_NORMAL
+where XXX is the memory block id; the number of digits is variable.
 
-An explicit zone onlining can fail (e.g. when the range is already within
-and existing and incompatible zone already).
+A present memory block indicates that some memory in the range is present;
+however, a memory block might span memory holes. A memory block spanning memory
+holes cannot be offlined.
 
-After this, memory block XXX's state will be 'online' and the amount of
-available memory will be increased.
+For example, assume 1 GiB memory block size. A device for a memory starting at
+0x100000000 is ``/sys/device/system/memory/memory4``::
 
-This may be changed in future.
+	(0x100000000 / 1Gib = 4)
 
-Logical memory remove
-=====================
+This device covers address range [0x100000000 ... 0x140000000)
 
-Memory offline and ZONE_MOVABLE
--------------------------------
+The following files are currently defined:
 
-Memory offlining is more complicated than memory online. Because memory offline
-has to make the whole memory block be unused, memory offline can fail if
-the memory block includes memory which cannot be freed.
-
-In general, memory offline can use 2 techniques.
-
-(1) reclaim and free all memory in the memory block.
-(2) migrate all pages in the memory block.
-
-In the current implementation, Linux's memory offline uses method (2), freeing
-all  pages in the memory block by page migration. But not all pages are
-migratable. Under current Linux, migratable pages are anonymous pages and
-page caches. For offlining a memory block by migration, the kernel has to
-guarantee that the memory block contains only migratable pages.
-
-Now, a boot option for making a memory block which consists of migratable pages
-is supported. By specifying "kernelcore=" or "movablecore=" boot option, you can
-create ZONE_MOVABLE...a zone which is just used for movable pages.
-(See also Documentation/admin-guide/kernel-parameters.rst)
-
-Assume the system has "TOTAL" amount of memory at boot time, this boot option
-creates ZONE_MOVABLE as following.
-
-1) When kernelcore=YYYY boot option is used,
-   Size of memory not for movable pages (not for offline) is YYYY.
-   Size of memory for movable pages (for offline) is TOTAL-YYYY.
-
-2) When movablecore=ZZZZ boot option is used,
-   Size of memory not for movable pages (not for offline) is TOTAL - ZZZZ.
-   Size of memory for movable pages (for offline) is ZZZZ.
+=================== ============================================================
+``online``	    read-write: simplified interface to trigger onlining /
+		    offlining and to observe the state of a memory block.
+		    When onlining, the zone is selected automatically.
+``phys_device``	    read-only: legacy interface only ever used on s390x to
+		    expose the covered storage increment.
+``phys_index``	    read-only: the memory block id (XXX).
+``removable``	    read-only: legacy interface that indicated whether a memory
+		    block was likely to be offlineable or not. Nowadays, the
+		    kernel return ``1`` if and only if it supports memory
+		    offlining.
+``state``	    read-write: advanced interface to trigger onlining /
+		    offlining and to observe the state of a memory block.
+
+		    When writing, ``online``, ``offline``, ``online_kernel`` and
+		    ``online_movable`` are supported.
+
+		    ``online_movable`` specifies onlining to ZONE_MOVABLE.
+		    ``online_kernel`` specifies onlining to the default kernel
+		    zone for the memory block, such as ZONE_NORMAL.
+                    ``online`` let's the kernel select the zone automatically.
+
+		    When reading, ``online``, ``offline`` and ``going-offline``
+		    may be returned.
+``uevent``	    read-write: generic uevent file for devices.
+``valid_zones``     read-only: when a block is online, shows the zone it
+		    belongs to; when a block is offline, shows what zone will
+		    manage it when the block will be onlined.
+
+		    For online memory blocks, ``DMA``, ``DMA32``, ``Normal``,
+		    ``Movable`` and ``none`` may be returned. ``none`` indicates
+		    that memory provided by a memory block is managed by
+		    multiple zones or spans multiple nodes; such memory blocks
+		    cannot be offlined. ``Movable`` indicates ZONE_MOVABLE.
+		    Other values indicate a kernel zone.
+
+		    For offline memory blocks, the first column shows the
+		    zone the kernel would select when onlining the memory block
+		    right now without further specifying a zone.
+
+		    Availability depends on the CONFIG_MEMORY_HOTREMOVE
+		    kernel configuration option.
+=================== ============================================================
 
 .. note::
 
-   Unfortunately, there is no information to show which memory block belongs
-   to ZONE_MOVABLE. This is TBD.
+  If the CONFIG_NUMA kernel configuration option is enabled, the memoryXXX/
+  directories can also be accessed via symbolic links located in the
+  ``/sys/devices/system/node/node*`` directories.
+
+  For example::
+
+	/sys/devices/system/node/node0/memory9 -> ../../memory/memory9
+
+  A backlink will also be created::
+
+	/sys/devices/system/memory/memory9/node0 -> ../../node/node0
+
+Command Line Parameters
+-----------------------
+
+Some command line parameters affect memory hot(un)plug handling. The following
+command line parameters are relevant:
+
+======================== =======================================================
+``memhp_default_state``	 configure auto-onlining by essentially setting
+                         ``/sys/devices/system/memory/auto_online_blocks``.
+``movablecore``		 configure automatic zone selection of the kernel. When
+			 set, the kernel will default to ZONE_MOVABLE, unless
+			 other zones can be kept contiguous.
+======================== =======================================================
+
+Module Parameters
+------------------
+
+Instead of additional command line parameters or sysfs files, the
+``memory_hotplug`` subsystem now provides a dedicated namespace for module
+parameters. Module parameters can be set via the command line by predicating
+them with ``memory_hotplug.`` such as::
+
+	memory_hotplug.memmap_on_memory=1
+
+and they can be observed (and some even modified at runtime) via::
+
+	/sys/modules/memory_hotplug/parameters/
+
+The following module parameters are currently defined:
+
+======================== =======================================================
+``memmap_on_memory``	 read-write: Allocate memory for the memmap from the
+			 added memory block itself. Even if enabled, actual
+			 support depends on various other system properties and
+			 should only be regarded as a hint whether the behavior
+			 would be desired.
+
+			 While allocating the memmap from the memory block
+			 itself makes memory hotplug less likely to fail and
+			 keeps the memmap on the same NUMA node in any case, it
+			 can fragment physical memory in a way that huge pages
+			 in bigger granularity cannot be formed on hotplugged
+			 memory.
+======================== =======================================================
+
+ZONE_MOVABLE
+============
+
+ZONE_MOVABLE is an important mechanism for more reliable memory offlining.
+Further, having system RAM managed by ZONE_MOVABLE instead of one of the
+kernel zones can increase the number of possible transparent huge pages and
+dynamically allocated huge pages.
+
+Most kernel allocations are unmovable. Important examples include the memory
+map (usually 1/64ths of memory), page tables, and kmalloc(). Such allocations
+can only be served from the kernel zones.
+
+Most user space pages, such as anonymous memory, and page cache pages are
+movable. Such allocations can be served from ZONE_MOVABLE and the kernel zones.
+
+Only movable allocations are served from ZONE_MOVABLE, resulting in unmovable
+allocations being limited to the kernel zones. Without ZONE_MOVABLE, there is
+absolutely no guarantee whether a memory block can be offlined successfully.
+
+Zone Imbalances
+---------------
+
+Having too much system RAM managed by ZONE_MOVABLE is called a zone imbalance,
+which can harm the system or degrade performance. As one example, the kernel
+might crash because it runs out of free memory for unmovable allocations,
+although there is still plenty of free memory left in ZONE_MOVABLE.
 
-   Memory offlining can fail when dissolving a free huge page on ZONE_MOVABLE
-   and the feature of freeing unused vmemmap pages associated with each hugetlb
-   page is enabled.
-
-   This can happen when we have plenty of ZONE_MOVABLE memory, but not enough
-   kernel memory to allocate vmemmmap pages.  We may even be able to migrate
-   huge page contents, but will not be able to dissolve the source huge page.
-   This will prevent an offline operation and is unfortunate as memory offlining
-   is expected to succeed on movable zones.  Users that depend on memory hotplug
-   to succeed for movable zones should carefully consider whether the memory
-   savings gained from this feature are worth the risk of possibly not being
-   able to offline memory in certain situations.
+Usually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. Ratios of 63:1
+are definitely impossible due to the overhead for the memory map.
+
+Actual safe zone ratios depend on the workload. Extreme cases, like excessive
+long-term pinning of pages, might not be able to deal with ZONE_MOVABLE at all.
 
 .. note::
-   Techniques that rely on long-term pinnings of memory (especially, RDMA and
-   vfio) are fundamentally problematic with ZONE_MOVABLE and, therefore, memory
-   hot remove. Pinned pages cannot reside on ZONE_MOVABLE, to guarantee that
-   memory can still get hot removed - be aware that pinning can fail even if
-   there is plenty of free memory in ZONE_MOVABLE. In addition, using
-   ZONE_MOVABLE might make page pinning more expensive, because pages have to be
-   migrated off that zone first.
 
-.. _memory_hotplug_how_to_offline_memory:
+  CMA memory part of a kernel zone essentially behaves like memory in
+  ZONE_MOVABLE and similar considerations apply, especially when combining
+  CMA with ZONE_MOVABLE.
 
-How to offline memory
----------------------
+ZONE_MOVABLE Sizing Considerations
+----------------------------------
 
-You can offline a memory block by using the same sysfs interface that was used
-in memory onlining::
+We usually expect that a large portion of available system RAM will actually
+be consumed by user space, either directly or indirectly via the page cache. In
+the normal case, ZONE_MOVABLE can be used when allocating such pages just fine.
 
-	% echo offline > /sys/devices/system/memory/memoryXXX/state
+With that in mind, it makes sense that we can have a big portion of system RAM
+managed by ZONE_MOVABLE. However, there are some things to consider when using
+ZONE_MOVABLE, especially when fine-tuning zone ratios:
+
+- Having a lot of offline memory blocks. Even offline memory blocks consume
+  memory for metadata and page tables in the direct map; having a lot of offline
+  memory blocks is not a typical case, though.
+
+- Memory ballooning without balloon compaction is incompatible with
+  ZONE_MOVABLE. Only some implementations, such as virtio-balloon and
+  pseries CMM, fully support balloon compaction.
+
+  Further, the CONFIG_BALLOON_COMPACTION kernel configuration option might be
+  disabled. In that case, balloon inflation will only perform unmovable
+  allocations and silently create a zone imbalance, usually triggered by
+  inflation requests from the hypervisor.
+
+- Gigantic pages are unmovable, resulting in user space consuming a
+  lot of unmovable memory.
+
+- Huge pages are unmovable when an architectures does not support huge
+  page migration, resulting in a similar issue as with gigantic pages.
+
+- Page tables are unmovable. Excessive swapping, mapping extremely large
+  files or ZONE_DEVICE memory can be problematic, although only really relevant
+  in corner cases. When we manage a lot of user space memory that has been
+  swapped out or is served from a file/persistent memory/... we still need a lot
+  of page tables to manage that memory once user space accessed that memory.
+
+- In certain DAX configurations the memory map for the device memory will be
+  allocated from the kernel zones.
+
+- KASAN can have a significant memory overhead, for example, consuming 1/8th of
+  the total system memory size as (unmovable) tracking metadata.
+
+- Long-term pinning of pages. Techniques that rely on long-term pinnings
+  (especially, RDMA and vfio/mdev) are fundamentally problematic with
+  ZONE_MOVABLE, and therefore, memory offlining. Pinned pages cannot reside
+  on ZONE_MOVABLE as that would turn these pages unmovable. Therefore, they
+  have to be migrated off that zone while pinning. Pinning a page can fail
+  even if there is plenty of free memory in ZONE_MOVABLE.
+
+  In addition, using ZONE_MOVABLE might make page pinning more expensive,
+  because of the page migration overhead.
+
+By default, all the memory configured at boot time is managed by the kernel
+zones and ZONE_MOVABLE is not used.
+
+To enable ZONE_MOVABLE to include the memory present at boot and to control the
+ratio between movable and kernel zones there are two command line options:
+``kernelcore=`` and ``movablecore=``. See
+Documentation/admin-guide/kernel-parameters.rst for their description.
+
+Memory Offlining and ZONE_MOVABLE
+---------------------------------
+
+Even with ZONE_MOVABLE, there are some corner cases where offlining a memory
+block might fail:
+
+- Memory blocks with memory holes; this applies to memory blocks present during
+  boot and can apply to memory blocks hotplugged via the XEN balloon and the
+  Hyper-V balloon.
+
+- Mixed NUMA nodes and mixed zones within a single memory block prevent memory
+  offlining; this applies to memory blocks present during boot only.
+
+- Special memory blocks prevented by the system from getting offlined. Examples
+  include any memory available during boot on arm64 or memory blocks spanning
+  the crashkernel area on s390x; this usually applies to memory blocks present
+  during boot only.
+
+- Memory blocks overlapping with CMA areas cannot be offlined, this applies to
+  memory blocks present during boot only.
+
+- Concurrent activity that operates on the same physical memory area, such as
+  allocating gigantic pages, can result in temporary offlining failures.
+
+- Out of memory when dissolving huge pages, especially when freeing unused
+  vmemmap pages associated with each hugetlb page is enabled.
+
+  Offlining code may be able to migrate huge page contents, but may not be able
+  to dissolve the source huge page because it fails allocating (unmovable) pages
+  for the vmemmap, because the system might not have free memory in the kernel
+  zones left.
+
+  Users that depend on memory offlining to succeed for movable zones should
+  carefully consider whether the memory savings gained from this feature are
+  worth the risk of possibly not being able to offline memory in certain
+  situations.
+
+Further, when running into out of memory situations while migrating pages, or
+when still encountering permanently unmovable pages within ZONE_MOVABLE
+(-> BUG), memory offlining will keep retrying until it eventually succeeds.
+
+When offlining is triggered from user space, the offlining context can be
+terminated by sending a fatal signal. A timeout based offlining can easily be
+implemented via::
 
-If offline succeeds, the state of the memory block is changed to be "offline".
-If it fails, some error core (like -EBUSY) will be returned by the kernel.
-Even if a memory block does not belong to ZONE_MOVABLE, you can try to offline
-it.  If it doesn't contain 'unmovable' memory, you'll get success.
-
-A memory block under ZONE_MOVABLE is considered to be able to be offlined
-easily.  But under some busy state, it may return -EBUSY. Even if a memory
-block cannot be offlined due to -EBUSY, you can retry offlining it and may be
-able to offline it (or not). (For example, a page is referred to by some kernel
-internal call and released soon.)
-
-Consideration:
-  Memory hotplug's design direction is to make the possibility of memory
-  offlining higher and to guarantee unplugging memory under any situation. But
-  it needs more work. Returning -EBUSY under some situation may be good because
-  the user can decide to retry more or not by himself. Currently, memory
-  offlining code does some amount of retry with 120 seconds timeout.
-
-Physical memory remove
-======================
-
-Need more implementation yet....
- - Notification completion of remove works by OS to firmware.
- - Guard from remove if not yet.
-
-
-Future Work
-===========
-
-  - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
-    sysctl or new control file.
-  - showing memory block and physical device relationship.
-  - test and make it better memory offlining.
-  - support HugeTLB page migration and offlining.
-  - memmap removing at memory offline.
-  - physical remove memory.
+	% timeout $TIMEOUT offline_block | failure_handling
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 036/147] mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE
  2021-09-08  2:52 incoming Andrew Morton
                   ` (34 preceding siblings ...)
  2021-09-08  2:54 ` [patch 035/147] memory-hotplug.rst: complete admin-guide overhaul Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08  2:54 ` [patch 037/147] mm: memory_hotplug: cleanup after removal of pfn_valid_within() Andrew Morton
                   ` (111 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, david, gregkh, linux-mm, mm-commits, rafael, rppt, torvalds

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE

Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE".

After recent updates to freeing unused parts of the memory map, no
architecture can have holes in the memory map within a pageblock.  This
makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration
option redundant.

The first patch removes them both in a mechanical way and the second patch
simplifies memory_hotplug::test_pages_in_a_zone() that had
pfn_valid_within() surrounded by more logic than simple if.


This patch (of 2):

After recent changes in freeing of the unused parts of the memory map and
rework of pfn_valid() in arm and arm64 there are no architectures that can
have holes in the memory map within a pageblock and so nothing can enable
CONFIG_HOLES_IN_ZONE which guards non trivial implementation of
pfn_valid_within().

With that, pfn_valid_within() is always hardwired to 1 and can be
completely removed.

Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE.

Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/base/node.c    |    2 --
 include/linux/mmzone.h |   12 ------------
 mm/Kconfig             |    3 ---
 mm/compaction.c        |   20 +++++++-------------
 mm/memory_hotplug.c    |    4 ----
 mm/page_alloc.c        |   24 ++----------------------
 mm/page_isolation.c    |    7 +------
 mm/page_owner.c        |   14 +-------------
 8 files changed, 11 insertions(+), 75 deletions(-)

--- a/drivers/base/node.c~mm-remove-pfn_valid_within-and-config_holes_in_zone
+++ a/drivers/base/node.c
@@ -768,8 +768,6 @@ int unregister_cpu_under_node(unsigned i
 #ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
 static int __ref get_nid_for_pfn(unsigned long pfn)
 {
-	if (!pfn_valid_within(pfn))
-		return -1;
 #ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
 	if (system_state < SYSTEM_RUNNING)
 		return early_pfn_to_nid(pfn);
--- a/include/linux/mmzone.h~mm-remove-pfn_valid_within-and-config_holes_in_zone
+++ a/include/linux/mmzone.h
@@ -1525,18 +1525,6 @@ void sparse_init(void);
 #define subsection_map_init(_pfn, _nr_pages) do {} while (0)
 #endif /* CONFIG_SPARSEMEM */
 
-/*
- * If it is possible to have holes within a MAX_ORDER_NR_PAGES, then we
- * need to check pfn validity within that MAX_ORDER_NR_PAGES block.
- * pfn_valid_within() should be used in this case; we optimise this away
- * when we have no holes within a MAX_ORDER_NR_PAGES block.
- */
-#ifdef CONFIG_HOLES_IN_ZONE
-#define pfn_valid_within(pfn) pfn_valid(pfn)
-#else
-#define pfn_valid_within(pfn) (1)
-#endif
-
 #endif /* !__GENERATING_BOUNDS.H */
 #endif /* !__ASSEMBLY__ */
 #endif /* _LINUX_MMZONE_H */
--- a/mm/compaction.c~mm-remove-pfn_valid_within-and-config_holes_in_zone
+++ a/mm/compaction.c
@@ -306,16 +306,14 @@ __reset_isolation_pfn(struct zone *zone,
 	 * is necessary for the block to be a migration source/target.
 	 */
 	do {
-		if (pfn_valid_within(pfn)) {
-			if (check_source && PageLRU(page)) {
-				clear_pageblock_skip(page);
-				return true;
-			}
+		if (check_source && PageLRU(page)) {
+			clear_pageblock_skip(page);
+			return true;
+		}
 
-			if (check_target && PageBuddy(page)) {
-				clear_pageblock_skip(page);
-				return true;
-			}
+		if (check_target && PageBuddy(page)) {
+			clear_pageblock_skip(page);
+			return true;
 		}
 
 		page += (1 << PAGE_ALLOC_COSTLY_ORDER);
@@ -585,8 +583,6 @@ static unsigned long isolate_freepages_b
 			break;
 
 		nr_scanned++;
-		if (!pfn_valid_within(blockpfn))
-			goto isolate_fail;
 
 		/*
 		 * For compound pages such as THP and hugetlbfs, we can save
@@ -885,8 +881,6 @@ isolate_migratepages_block(struct compac
 			cond_resched();
 		}
 
-		if (!pfn_valid_within(low_pfn))
-			goto isolate_fail;
 		nr_scanned++;
 
 		page = pfn_to_page(low_pfn);
--- a/mm/Kconfig~mm-remove-pfn_valid_within-and-config_holes_in_zone
+++ a/mm/Kconfig
@@ -96,9 +96,6 @@ config HAVE_FAST_GUP
 	depends on MMU
 	bool
 
-config HOLES_IN_ZONE
-	bool
-
 # Don't discard allocated memory used to track "memory" and "reserved" memblocks
 # after early boot, so it can still be used to test for validity of memory.
 # Also, memblocks are updated with memory hot(un)plug.
--- a/mm/memory_hotplug.c~mm-remove-pfn_valid_within-and-config_holes_in_zone
+++ a/mm/memory_hotplug.c
@@ -1308,10 +1308,6 @@ struct zone *test_pages_in_a_zone(unsign
 		for (; pfn < sec_end_pfn && pfn < end_pfn;
 		     pfn += MAX_ORDER_NR_PAGES) {
 			i = 0;
-			/* This is just a CONFIG_HOLES_IN_ZONE check.*/
-			while ((i < MAX_ORDER_NR_PAGES) &&
-				!pfn_valid_within(pfn + i))
-				i++;
 			if (i == MAX_ORDER_NR_PAGES || pfn + i >= end_pfn)
 				continue;
 			/* Check if we got outside of the zone */
--- a/mm/page_alloc.c~mm-remove-pfn_valid_within-and-config_holes_in_zone
+++ a/mm/page_alloc.c
@@ -594,8 +594,6 @@ static int page_outside_zone_boundaries(
 
 static int page_is_consistent(struct zone *zone, struct page *page)
 {
-	if (!pfn_valid_within(page_to_pfn(page)))
-		return 0;
 	if (zone != page_zone(page))
 		return 0;
 
@@ -1025,16 +1023,12 @@ buddy_merge_likely(unsigned long pfn, un
 	if (order >= MAX_ORDER - 2)
 		return false;
 
-	if (!pfn_valid_within(buddy_pfn))
-		return false;
-
 	combined_pfn = buddy_pfn & pfn;
 	higher_page = page + (combined_pfn - pfn);
 	buddy_pfn = __find_buddy_pfn(combined_pfn, order + 1);
 	higher_buddy = higher_page + (buddy_pfn - combined_pfn);
 
-	return pfn_valid_within(buddy_pfn) &&
-	       page_is_buddy(higher_page, higher_buddy, order + 1);
+	return page_is_buddy(higher_page, higher_buddy, order + 1);
 }
 
 /*
@@ -1095,8 +1089,6 @@ continue_merging:
 		buddy_pfn = __find_buddy_pfn(pfn, order);
 		buddy = page + (buddy_pfn - pfn);
 
-		if (!pfn_valid_within(buddy_pfn))
-			goto done_merging;
 		if (!page_is_buddy(page, buddy, order))
 			goto done_merging;
 		/*
@@ -1754,9 +1746,7 @@ void __init memblock_free_pages(struct p
 /*
  * Check that the whole (or subset of) a pageblock given by the interval of
  * [start_pfn, end_pfn) is valid and within the same zone, before scanning it
- * with the migration of free compaction scanner. The scanners then need to
- * use only pfn_valid_within() check for arches that allow holes within
- * pageblocks.
+ * with the migration of free compaction scanner.
  *
  * Return struct page pointer of start_pfn, or NULL if checks were not passed.
  *
@@ -1872,8 +1862,6 @@ static inline void __init pgdat_init_rep
  */
 static inline bool __init deferred_pfn_valid(unsigned long pfn)
 {
-	if (!pfn_valid_within(pfn))
-		return false;
 	if (!(pfn & (pageblock_nr_pages - 1)) && !pfn_valid(pfn))
 		return false;
 	return true;
@@ -2520,11 +2508,6 @@ static int move_freepages(struct zone *z
 	int pages_moved = 0;
 
 	for (pfn = start_pfn; pfn <= end_pfn;) {
-		if (!pfn_valid_within(pfn)) {
-			pfn++;
-			continue;
-		}
-
 		page = pfn_to_page(pfn);
 		if (!PageBuddy(page)) {
 			/*
@@ -8828,9 +8811,6 @@ struct page *has_unmovable_pages(struct
 	}
 
 	for (; iter < pageblock_nr_pages - offset; iter++) {
-		if (!pfn_valid_within(pfn + iter))
-			continue;
-
 		page = pfn_to_page(pfn + iter);
 
 		/*
--- a/mm/page_isolation.c~mm-remove-pfn_valid_within-and-config_holes_in_zone
+++ a/mm/page_isolation.c
@@ -93,8 +93,7 @@ static void unset_migratetype_isolate(st
 			buddy_pfn = __find_buddy_pfn(pfn, order);
 			buddy = page + (buddy_pfn - pfn);
 
-			if (pfn_valid_within(buddy_pfn) &&
-			    !is_migrate_isolate_page(buddy)) {
+			if (!is_migrate_isolate_page(buddy)) {
 				__isolate_free_page(page, order);
 				isolated_page = true;
 			}
@@ -250,10 +249,6 @@ __test_page_isolated_in_pageblock(unsign
 	struct page *page;
 
 	while (pfn < end_pfn) {
-		if (!pfn_valid_within(pfn)) {
-			pfn++;
-			continue;
-		}
 		page = pfn_to_page(pfn);
 		if (PageBuddy(page))
 			/*
--- a/mm/page_owner.c~mm-remove-pfn_valid_within-and-config_holes_in_zone
+++ a/mm/page_owner.c
@@ -276,9 +276,6 @@ void pagetypeinfo_showmixedcount_print(s
 		pageblock_mt = get_pageblock_migratetype(page);
 
 		for (; pfn < block_end_pfn; pfn++) {
-			if (!pfn_valid_within(pfn))
-				continue;
-
 			/* The pageblock is online, no need to recheck. */
 			page = pfn_to_page(pfn);
 
@@ -479,10 +476,6 @@ read_page_owner(struct file *file, char
 			continue;
 		}
 
-		/* Check for holes within a MAX_ORDER area */
-		if (!pfn_valid_within(pfn))
-			continue;
-
 		page = pfn_to_page(pfn);
 		if (PageBuddy(page)) {
 			unsigned long freepage_order = buddy_order_unsafe(page);
@@ -560,14 +553,9 @@ static void init_pages_in_zone(pg_data_t
 		block_end_pfn = min(block_end_pfn, end_pfn);
 
 		for (; pfn < block_end_pfn; pfn++) {
-			struct page *page;
+			struct page *page = pfn_to_page(pfn);
 			struct page_ext *page_ext;
 
-			if (!pfn_valid_within(pfn))
-				continue;
-
-			page = pfn_to_page(pfn);
-
 			if (page_zone(page) != zone)
 				continue;
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 037/147] mm: memory_hotplug: cleanup after removal of pfn_valid_within()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (35 preceding siblings ...)
  2021-09-08  2:54 ` [patch 036/147] mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08  2:54 ` [patch 038/147] mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range() Andrew Morton
                   ` (110 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, david, gregkh, linux-mm, mm-commits, rafael, rppt, torvalds

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: mm: memory_hotplug: cleanup after removal of pfn_valid_within()

When test_pages_in_a_zone() used pfn_valid_within() is has some logic
surrounding pfn_valid_within() checks.

Since pfn_valid_within() is gone, this logic can be removed.

Link: https://lkml.kernel.org/r/20210713080035.7464-3-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |    9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-cleanup-after-removal-of-pfn_valid_within
+++ a/mm/memory_hotplug.c
@@ -1298,7 +1298,7 @@ struct zone *test_pages_in_a_zone(unsign
 	unsigned long pfn, sec_end_pfn;
 	struct zone *zone = NULL;
 	struct page *page;
-	int i;
+
 	for (pfn = start_pfn, sec_end_pfn = SECTION_ALIGN_UP(start_pfn + 1);
 	     pfn < end_pfn;
 	     pfn = sec_end_pfn, sec_end_pfn += PAGES_PER_SECTION) {
@@ -1307,13 +1307,10 @@ struct zone *test_pages_in_a_zone(unsign
 			continue;
 		for (; pfn < sec_end_pfn && pfn < end_pfn;
 		     pfn += MAX_ORDER_NR_PAGES) {
-			i = 0;
-			if (i == MAX_ORDER_NR_PAGES || pfn + i >= end_pfn)
-				continue;
 			/* Check if we got outside of the zone */
-			if (zone && !zone_spans_pfn(zone, pfn + i))
+			if (zone && !zone_spans_pfn(zone, pfn))
 				return NULL;
-			page = pfn_to_page(pfn + i);
+			page = pfn_to_page(pfn);
 			if (zone && page_zone(page) != zone)
 				return NULL;
 			zone = page_zone(page);
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 038/147] mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (36 preceding siblings ...)
  2021-09-08  2:54 ` [patch 037/147] mm: memory_hotplug: cleanup after removal of pfn_valid_within() Andrew Morton
@ 2021-09-08  2:54 ` Andrew Morton
  2021-09-08  2:55 ` [patch 039/147] mm/memory_hotplug: remove nid parameter from arch_remove_memory() Andrew Morton
                   ` (109 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:54 UTC (permalink / raw)
  To: akpm, aneesh.kumar, anshuman.khandual, anton, ardb, bauerman,
	benh, bhe, borntraeger, bp, catalin.marinas, cheloha,
	christophe.leroy, dalias, dan.j.williams, dave.hansen,
	dave.jiang, david, gor, hca, hpa, jasowang, joe, justin.he,
	ldufour, lenb, linux-mm, luto, mhocko, michel, mingo, mm-commits,
	mpe, mst, nathanl, npiggin, osalvador, pankaj.gupta.linux,
	pankaj.gupta, pasha.tatashin, paulus, peterz, pmorel,
	rafael.j.wysocki, richard.weiyang, rjw, rppt, slyfox, songmuchun,
	stable, tglx, torvalds, vbabka, vishal.l.verma, vkuznets,
	wangkefeng.wang, will, ysato

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()

Patch series "mm/memory_hotplug: preparatory patches for new online policy and memory"

These are all cleanups and one fix previously sent as part of [1]:
[PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory
groups.

These patches make sense even without the other series, therefore I pulled
them out to make the other series easier to digest.

[1] https://lkml.kernel.org/r/20210607195430.48228-1-david@redhat.com 



This patch (of 4):

Checkpatch complained on a follow-up patch that we are using "unsigned"
here, which defaults to "unsigned int" and checkpatch is correct.

As we will search for a fitting zone using the wrong pfn, we might end 
up onlining memory to one of the special kernel zones, such as ZONE_DMA, 
which can end badly as the onlined memory does not satisfy properties of 
these zones.

Use "unsigned long" instead, just as we do in other places when handling
PFNs.  This can bite us once we have physical addresses in the range of
multiple TB.

Link: https://lkml.kernel.org/r/20210712124052.26491-2-david@redhat.com
Fixes: e5e689302633 ("mm, memory_hotplug: display allowed zones in the preferred ordering")
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: virtualization@lists.linux-foundation.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pierre Morel <pmorel@linux.ibm.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memory_hotplug.h |    4 ++--
 mm/memory_hotplug.c            |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

--- a/include/linux/memory_hotplug.h~mm-memory_hotplug-use-unsigned-long-for-pfn-in-zone_for_pfn_range
+++ a/include/linux/memory_hotplug.h
@@ -339,8 +339,8 @@ extern void sparse_remove_section(struct
 		unsigned long map_offset, struct vmem_altmap *altmap);
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
 					  unsigned long pnum);
-extern struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
-		unsigned long nr_pages);
+extern struct zone *zone_for_pfn_range(int online_type, int nid,
+		unsigned long start_pfn, unsigned long nr_pages);
 extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
 				      struct mhp_params *params);
 void arch_remove_linear_mapping(u64 start, u64 size);
--- a/mm/memory_hotplug.c~mm-memory_hotplug-use-unsigned-long-for-pfn-in-zone_for_pfn_range
+++ a/mm/memory_hotplug.c
@@ -708,8 +708,8 @@ static inline struct zone *default_zone_
 	return movable_node_enabled ? movable_zone : kernel_zone;
 }
 
-struct zone *zone_for_pfn_range(int online_type, int nid, unsigned start_pfn,
-		unsigned long nr_pages)
+struct zone *zone_for_pfn_range(int online_type, int nid,
+		unsigned long start_pfn, unsigned long nr_pages)
 {
 	if (online_type == MMOP_ONLINE_KERNEL)
 		return default_kernel_zone_for_pfn(nid, start_pfn, nr_pages);
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 039/147] mm/memory_hotplug: remove nid parameter from arch_remove_memory()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (37 preceding siblings ...)
  2021-09-08  2:54 ` [patch 038/147] mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range() Andrew Morton
@ 2021-09-08  2:55 ` Andrew Morton
  2021-09-08  2:55 ` [patch 040/147] mm/memory_hotplug: remove nid parameter from remove_memory() and friends Andrew Morton
                   ` (108 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:55 UTC (permalink / raw)
  To: akpm, aneesh.kumar, anshuman.khandual, anton, ardb, bauerman,
	benh, bhe, borntraeger, bp, catalin.marinas, cheloha,
	christophe.leroy, dalias, dan.j.williams, dave.hansen,
	dave.jiang, david, gor, hca, hpa, jasowang, joe, justin.he,
	ldufour, lenb, linux-mm, luto, mhocko, michel, mingo, mm-commits,
	mpe, mst, nathanl, npiggin, osalvador, pankaj.gupta.linux,
	pankaj.gupta, pasha.tatashin, paulus, peterz, pmorel,
	rafael.j.wysocki, richard.weiyang, rjw, rppt, slyfox, tglx,
	torvalds, vbabka, vishal.l.verma, vkuznets, wangkefeng.wang,
	will, ysato

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: remove nid parameter from arch_remove_memory()

The parameter is unused, let's remove it.

Link: https://lkml.kernel.org/r/20210712124052.26491-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc]
Acked-by: Heiko Carstens <hca@linux.ibm.com>	[s390]
Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: Joe Perches <joe@perches.com>
Cc: Pierre Morel <pmorel@linux.ibm.com>
Cc: Jia He <justin.he@arm.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm64/mm/mmu.c            |    3 +--
 arch/ia64/mm/init.c            |    3 +--
 arch/powerpc/mm/mem.c          |    3 +--
 arch/s390/mm/init.c            |    3 +--
 arch/sh/mm/init.c              |    3 +--
 arch/x86/mm/init_32.c          |    3 +--
 arch/x86/mm/init_64.c          |    3 +--
 include/linux/memory_hotplug.h |    3 +--
 mm/memory_hotplug.c            |    4 ++--
 mm/memremap.c                  |    5 +----
 10 files changed, 11 insertions(+), 22 deletions(-)

--- a/arch/arm64/mm/mmu.c~mm-memory_hotplug-remove-nid-parameter-from-arch_remove_memory
+++ a/arch/arm64/mm/mmu.c
@@ -1502,8 +1502,7 @@ int arch_add_memory(int nid, u64 start,
 	return ret;
 }
 
-void arch_remove_memory(int nid, u64 start, u64 size,
-			struct vmem_altmap *altmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
--- a/arch/ia64/mm/init.c~mm-memory_hotplug-remove-nid-parameter-from-arch_remove_memory
+++ a/arch/ia64/mm/init.c
@@ -484,8 +484,7 @@ int arch_add_memory(int nid, u64 start,
 	return ret;
 }
 
-void arch_remove_memory(int nid, u64 start, u64 size,
-			struct vmem_altmap *altmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
--- a/arch/powerpc/mm/mem.c~mm-memory_hotplug-remove-nid-parameter-from-arch_remove_memory
+++ a/arch/powerpc/mm/mem.c
@@ -119,8 +119,7 @@ int __ref arch_add_memory(int nid, u64 s
 	return rc;
 }
 
-void __ref arch_remove_memory(int nid, u64 start, u64 size,
-			      struct vmem_altmap *altmap)
+void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
--- a/arch/s390/mm/init.c~mm-memory_hotplug-remove-nid-parameter-from-arch_remove_memory
+++ a/arch/s390/mm/init.c
@@ -306,8 +306,7 @@ int arch_add_memory(int nid, u64 start,
 	return rc;
 }
 
-void arch_remove_memory(int nid, u64 start, u64 size,
-			struct vmem_altmap *altmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
--- a/arch/sh/mm/init.c~mm-memory_hotplug-remove-nid-parameter-from-arch_remove_memory
+++ a/arch/sh/mm/init.c
@@ -414,8 +414,7 @@ int arch_add_memory(int nid, u64 start,
 	return ret;
 }
 
-void arch_remove_memory(int nid, u64 start, u64 size,
-			struct vmem_altmap *altmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = PFN_DOWN(start);
 	unsigned long nr_pages = size >> PAGE_SHIFT;
--- a/arch/x86/mm/init_32.c~mm-memory_hotplug-remove-nid-parameter-from-arch_remove_memory
+++ a/arch/x86/mm/init_32.c
@@ -801,8 +801,7 @@ int arch_add_memory(int nid, u64 start,
 	return __add_pages(nid, start_pfn, nr_pages, params);
 }
 
-void arch_remove_memory(int nid, u64 start, u64 size,
-			struct vmem_altmap *altmap)
+void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
--- a/arch/x86/mm/init_64.c~mm-memory_hotplug-remove-nid-parameter-from-arch_remove_memory
+++ a/arch/x86/mm/init_64.c
@@ -1255,8 +1255,7 @@ kernel_physical_mapping_remove(unsigned
 	remove_pagetable(start, end, true, NULL);
 }
 
-void __ref arch_remove_memory(int nid, u64 start, u64 size,
-			      struct vmem_altmap *altmap)
+void __ref arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap)
 {
 	unsigned long start_pfn = start >> PAGE_SHIFT;
 	unsigned long nr_pages = size >> PAGE_SHIFT;
--- a/include/linux/memory_hotplug.h~mm-memory_hotplug-remove-nid-parameter-from-arch_remove_memory
+++ a/include/linux/memory_hotplug.h
@@ -130,8 +130,7 @@ static inline bool movable_node_is_enabl
 	return movable_node_enabled;
 }
 
-extern void arch_remove_memory(int nid, u64 start, u64 size,
-			       struct vmem_altmap *altmap);
+extern void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap);
 extern void __remove_pages(unsigned long start_pfn, unsigned long nr_pages,
 			   struct vmem_altmap *altmap);
 
--- a/mm/memory_hotplug.c~mm-memory_hotplug-remove-nid-parameter-from-arch_remove_memory
+++ a/mm/memory_hotplug.c
@@ -1106,7 +1106,7 @@ int __ref add_memory_resource(int nid, s
 	/* create memory block devices after memory was added */
 	ret = create_memory_block_devices(start, size, mhp_altmap.alloc);
 	if (ret) {
-		arch_remove_memory(nid, start, size, NULL);
+		arch_remove_memory(start, size, NULL);
 		goto error;
 	}
 
@@ -1886,7 +1886,7 @@ static int __ref try_remove_memory(int n
 
 	mem_hotplug_begin();
 
-	arch_remove_memory(nid, start, size, altmap);
+	arch_remove_memory(start, size, altmap);
 
 	if (IS_ENABLED(CONFIG_ARCH_KEEP_MEMBLOCK)) {
 		memblock_free(start, size);
--- a/mm/memremap.c~mm-memory_hotplug-remove-nid-parameter-from-arch_remove_memory
+++ a/mm/memremap.c
@@ -140,14 +140,11 @@ static void pageunmap_range(struct dev_p
 {
 	struct range *range = &pgmap->ranges[range_id];
 	struct page *first_page;
-	int nid;
 
 	/* make sure to access a memmap that was actually initialized */
 	first_page = pfn_to_page(pfn_first(pgmap, range_id));
 
 	/* pages are dead and unused, undo the arch mapping */
-	nid = page_to_nid(first_page);
-
 	mem_hotplug_begin();
 	remove_pfn_range_from_zone(page_zone(first_page), PHYS_PFN(range->start),
 				   PHYS_PFN(range_len(range)));
@@ -155,7 +152,7 @@ static void pageunmap_range(struct dev_p
 		__remove_pages(PHYS_PFN(range->start),
 			       PHYS_PFN(range_len(range)), NULL);
 	} else {
-		arch_remove_memory(nid, range->start, range_len(range),
+		arch_remove_memory(range->start, range_len(range),
 				pgmap_altmap(pgmap));
 		kasan_remove_zero_shadow(__va(range->start), range_len(range));
 	}
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 040/147] mm/memory_hotplug: remove nid parameter from remove_memory() and friends
  2021-09-08  2:52 incoming Andrew Morton
                   ` (38 preceding siblings ...)
  2021-09-08  2:55 ` [patch 039/147] mm/memory_hotplug: remove nid parameter from arch_remove_memory() Andrew Morton
@ 2021-09-08  2:55 ` Andrew Morton
  2021-09-08  2:55 ` [patch 041/147] ACPI: memhotplug: memory resources cannot be enabled yet Andrew Morton
                   ` (107 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:55 UTC (permalink / raw)
  To: akpm, aneesh.kumar, anshuman.khandual, anton, ardb, bauerman,
	benh, bhe, borntraeger, bp, catalin.marinas, cheloha,
	christophe.leroy, dalias, dan.j.williams, dave.hansen,
	dave.jiang, david, gor, hca, hpa, jasowang, joe, justin.he,
	ldufour, lenb, linux-mm, luto, mhocko, michel, mingo, mm-commits,
	mpe, mst, nathanl, npiggin, osalvador, pankaj.gupta.linux,
	pankaj.gupta, pasha.tatashin, paulus, peterz, pmorel,
	rafael.j.wysocki, richard.weiyang, rjw, rppt, slyfox, tglx,
	torvalds, vbabka, vishal.l.verma, vkuznets, wangkefeng.wang,
	will, ysato

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: remove nid parameter from remove_memory() and friends

There is only a single user remaining.  We can simply lookup the nid only
used for node offlining purposes when walking our memory blocks.  We don't
expect to remove multi-nid ranges; and if we'd ever do, we most probably
don't care about removing multi-nid ranges that actually result in empty
nodes.

If ever required, we can detect the "multi-nid" scenario and simply try
offlining all online nodes.

Link: https://lkml.kernel.org/r/20210712124052.26491-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta@ionos.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pierre Morel <pmorel@linux.ibm.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/powerpc/platforms/pseries/hotplug-memory.c |    9 ++--
 drivers/acpi/acpi_memhotplug.c                  |    7 ---
 drivers/dax/kmem.c                              |    3 -
 drivers/virtio/virtio_mem.c                     |    4 +-
 include/linux/memory_hotplug.h                  |   10 ++---
 mm/memory_hotplug.c                             |   28 ++++++++------
 6 files changed, 30 insertions(+), 31 deletions(-)

--- a/arch/powerpc/platforms/pseries/hotplug-memory.c~mm-memory_hotplug-remove-nid-parameter-from-remove_memory-and-friends
+++ a/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -284,7 +284,7 @@ static int pseries_remove_memblock(unsig
 {
 	unsigned long block_sz, start_pfn;
 	int sections_per_block;
-	int i, nid;
+	int i;
 
 	start_pfn = base >> PAGE_SHIFT;
 
@@ -295,10 +295,9 @@ static int pseries_remove_memblock(unsig
 
 	block_sz = pseries_memory_block_size();
 	sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
-	nid = memory_add_physaddr_to_nid(base);
 
 	for (i = 0; i < sections_per_block; i++) {
-		__remove_memory(nid, base, MIN_MEMORY_BLOCK_SIZE);
+		__remove_memory(base, MIN_MEMORY_BLOCK_SIZE);
 		base += MIN_MEMORY_BLOCK_SIZE;
 	}
 
@@ -385,7 +384,7 @@ static int dlpar_remove_lmb(struct drmem
 
 	block_sz = pseries_memory_block_size();
 
-	__remove_memory(mem_block->nid, lmb->base_addr, block_sz);
+	__remove_memory(lmb->base_addr, block_sz);
 	put_device(&mem_block->dev);
 
 	/* Update memory regions for memory remove */
@@ -658,7 +657,7 @@ static int dlpar_add_lmb(struct drmem_lm
 
 	rc = dlpar_online_lmb(lmb);
 	if (rc) {
-		__remove_memory(nid, lmb->base_addr, block_sz);
+		__remove_memory(lmb->base_addr, block_sz);
 		invalidate_lmb_associativity_index(lmb);
 	} else {
 		lmb->flags |= DRCONF_MEM_ASSIGNED;
--- a/drivers/acpi/acpi_memhotplug.c~mm-memory_hotplug-remove-nid-parameter-from-remove_memory-and-friends
+++ a/drivers/acpi/acpi_memhotplug.c
@@ -239,19 +239,14 @@ static int acpi_memory_enable_device(str
 
 static void acpi_memory_remove_memory(struct acpi_memory_device *mem_device)
 {
-	acpi_handle handle = mem_device->device->handle;
 	struct acpi_memory_info *info, *n;
-	int nid = acpi_get_node(handle);
 
 	list_for_each_entry_safe(info, n, &mem_device->res_list, list) {
 		if (!info->enabled)
 			continue;
 
-		if (nid == NUMA_NO_NODE)
-			nid = memory_add_physaddr_to_nid(info->start_addr);
-
 		acpi_unbind_memory_blocks(info);
-		__remove_memory(nid, info->start_addr, info->length);
+		__remove_memory(info->start_addr, info->length);
 		list_del(&info->list);
 		kfree(info);
 	}
--- a/drivers/dax/kmem.c~mm-memory_hotplug-remove-nid-parameter-from-remove_memory-and-friends
+++ a/drivers/dax/kmem.c
@@ -156,8 +156,7 @@ static void dev_dax_kmem_remove(struct d
 		if (rc)
 			continue;
 
-		rc = remove_memory(dev_dax->target_node, range.start,
-				range_len(&range));
+		rc = remove_memory(range.start, range_len(&range));
 		if (rc == 0) {
 			release_resource(data->res[i]);
 			kfree(data->res[i]);
--- a/drivers/virtio/virtio_mem.c~mm-memory_hotplug-remove-nid-parameter-from-remove_memory-and-friends
+++ a/drivers/virtio/virtio_mem.c
@@ -677,7 +677,7 @@ static int virtio_mem_remove_memory(stru
 
 	dev_dbg(&vm->vdev->dev, "removing memory: 0x%llx - 0x%llx\n", addr,
 		addr + size - 1);
-	rc = remove_memory(vm->nid, addr, size);
+	rc = remove_memory(addr, size);
 	if (!rc) {
 		atomic64_sub(size, &vm->offline_size);
 		/*
@@ -720,7 +720,7 @@ static int virtio_mem_offline_and_remove
 		"offlining and removing memory: 0x%llx - 0x%llx\n", addr,
 		addr + size - 1);
 
-	rc = offline_and_remove_memory(vm->nid, addr, size);
+	rc = offline_and_remove_memory(addr, size);
 	if (!rc) {
 		atomic64_sub(size, &vm->offline_size);
 		/*
--- a/include/linux/memory_hotplug.h~mm-memory_hotplug-remove-nid-parameter-from-remove_memory-and-friends
+++ a/include/linux/memory_hotplug.h
@@ -292,9 +292,9 @@ static inline void pgdat_resize_init(str
 
 extern void try_offline_node(int nid);
 extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
-extern int remove_memory(int nid, u64 start, u64 size);
-extern void __remove_memory(int nid, u64 start, u64 size);
-extern int offline_and_remove_memory(int nid, u64 start, u64 size);
+extern int remove_memory(u64 start, u64 size);
+extern void __remove_memory(u64 start, u64 size);
+extern int offline_and_remove_memory(u64 start, u64 size);
 
 #else
 static inline void try_offline_node(int nid) {}
@@ -304,12 +304,12 @@ static inline int offline_pages(unsigned
 	return -EINVAL;
 }
 
-static inline int remove_memory(int nid, u64 start, u64 size)
+static inline int remove_memory(u64 start, u64 size)
 {
 	return -EBUSY;
 }
 
-static inline void __remove_memory(int nid, u64 start, u64 size) {}
+static inline void __remove_memory(u64 start, u64 size) {}
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 extern void set_zone_contiguous(struct zone *zone);
--- a/mm/memory_hotplug.c~mm-memory_hotplug-remove-nid-parameter-from-remove_memory-and-friends
+++ a/mm/memory_hotplug.c
@@ -1739,7 +1739,9 @@ failed_removal:
 static int check_memblock_offlined_cb(struct memory_block *mem, void *arg)
 {
 	int ret = !is_memblock_offlined(mem);
+	int *nid = arg;
 
+	*nid = mem->nid;
 	if (unlikely(ret)) {
 		phys_addr_t beginpa, endpa;
 
@@ -1832,12 +1834,12 @@ void try_offline_node(int nid)
 }
 EXPORT_SYMBOL(try_offline_node);
 
-static int __ref try_remove_memory(int nid, u64 start, u64 size)
+static int __ref try_remove_memory(u64 start, u64 size)
 {
-	int rc = 0;
 	struct vmem_altmap mhp_altmap = {};
 	struct vmem_altmap *altmap = NULL;
 	unsigned long nr_vmemmap_pages;
+	int rc = 0, nid = NUMA_NO_NODE;
 
 	BUG_ON(check_hotplug_memory_range(start, size));
 
@@ -1845,8 +1847,12 @@ static int __ref try_remove_memory(int n
 	 * All memory blocks must be offlined before removing memory.  Check
 	 * whether all memory blocks in question are offline and return error
 	 * if this is not the case.
+	 *
+	 * While at it, determine the nid. Note that if we'd have mixed nodes,
+	 * we'd only try to offline the last determined one -- which is good
+	 * enough for the cases we care about.
 	 */
-	rc = walk_memory_blocks(start, size, NULL, check_memblock_offlined_cb);
+	rc = walk_memory_blocks(start, size, &nid, check_memblock_offlined_cb);
 	if (rc)
 		return rc;
 
@@ -1895,7 +1901,8 @@ static int __ref try_remove_memory(int n
 
 	release_mem_region_adjustable(start, size);
 
-	try_offline_node(nid);
+	if (nid != NUMA_NO_NODE)
+		try_offline_node(nid);
 
 	mem_hotplug_done();
 	return 0;
@@ -1903,7 +1910,6 @@ static int __ref try_remove_memory(int n
 
 /**
  * __remove_memory - Remove memory if every memory block is offline
- * @nid: the node ID
  * @start: physical address of the region to remove
  * @size: size of the region to remove
  *
@@ -1911,14 +1917,14 @@ static int __ref try_remove_memory(int n
  * and online/offline operations before this call, as required by
  * try_offline_node().
  */
-void __remove_memory(int nid, u64 start, u64 size)
+void __remove_memory(u64 start, u64 size)
 {
 
 	/*
 	 * trigger BUG() if some memory is not offlined prior to calling this
 	 * function
 	 */
-	if (try_remove_memory(nid, start, size))
+	if (try_remove_memory(start, size))
 		BUG();
 }
 
@@ -1926,12 +1932,12 @@ void __remove_memory(int nid, u64 start,
  * Remove memory if every memory block is offline, otherwise return -EBUSY is
  * some memory is not offline
  */
-int remove_memory(int nid, u64 start, u64 size)
+int remove_memory(u64 start, u64 size)
 {
 	int rc;
 
 	lock_device_hotplug();
-	rc  = try_remove_memory(nid, start, size);
+	rc = try_remove_memory(start, size);
 	unlock_device_hotplug();
 
 	return rc;
@@ -1991,7 +1997,7 @@ static int try_reonline_memory_block(str
  * unplugged all memory (so it's no longer in use) and want to offline + remove
  * that memory.
  */
-int offline_and_remove_memory(int nid, u64 start, u64 size)
+int offline_and_remove_memory(u64 start, u64 size)
 {
 	const unsigned long mb_count = size / memory_block_size_bytes();
 	uint8_t *online_types, *tmp;
@@ -2027,7 +2033,7 @@ int offline_and_remove_memory(int nid, u
 	 * This cannot fail as it cannot get onlined in the meantime.
 	 */
 	if (!rc) {
-		rc = try_remove_memory(nid, start, size);
+		rc = try_remove_memory(start, size);
 		if (rc)
 			pr_err("%s: Failed to remove memory: %d", __func__, rc);
 	}
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 041/147] ACPI: memhotplug: memory resources cannot be enabled yet
  2021-09-08  2:52 incoming Andrew Morton
                   ` (39 preceding siblings ...)
  2021-09-08  2:55 ` [patch 040/147] mm/memory_hotplug: remove nid parameter from remove_memory() and friends Andrew Morton
@ 2021-09-08  2:55 ` Andrew Morton
  2021-09-08  2:55 ` [patch 042/147] mm: track present early pages per zone Andrew Morton
                   ` (106 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:55 UTC (permalink / raw)
  To: akpm, aneesh.kumar, anshuman.khandual, anton, ardb, bauerman,
	benh, bhe, borntraeger, bp, catalin.marinas, cheloha,
	christophe.leroy, dalias, dan.j.williams, dave.hansen,
	dave.jiang, david, gor, hca, hpa, jasowang, joe, justin.he,
	ldufour, lenb, linux-mm, luto, mhocko, michel, mingo, mm-commits,
	mpe, mst, nathanl, npiggin, osalvador, pankaj.gupta.linux,
	pankaj.gupta, pasha.tatashin, paulus, peterz, pmorel,
	rafael.j.wysocki, richard.weiyang, rjw, rppt, slyfox, tglx,
	torvalds, vbabka, vishal.l.verma, vkuznets, wangkefeng.wang,
	will, ysato

From: David Hildenbrand <david@redhat.com>
Subject: ACPI: memhotplug: memory resources cannot be enabled yet

We allocate + initialize everything from scratch.  In case enabling the
device fails, we free all memory resourcs.

Link: https://lkml.kernel.org/r/20210712124052.26491-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Anton Blanchard <anton@ozlabs.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joe Perches <joe@perches.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michel Lespinasse <michel@lespinasse.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pierre Morel <pmorel@linux.ibm.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Rich Felker <dalias@libc.org>
Cc: Scott Cheloha <cheloha@linux.ibm.com>
Cc: Sergei Trofimovich <slyfox@gentoo.org>
Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/acpi/acpi_memhotplug.c |    4 ----
 1 file changed, 4 deletions(-)

--- a/drivers/acpi/acpi_memhotplug.c~acpi-memhotplug-memory-resources-cannot-be-enabled-yet
+++ a/drivers/acpi/acpi_memhotplug.c
@@ -182,10 +182,6 @@ static int acpi_memory_enable_device(str
 	 * (i.e. memory-hot-remove function)
 	 */
 	list_for_each_entry(info, &mem_device->res_list, list) {
-		if (info->enabled) { /* just sanity check...*/
-			num_enabled++;
-			continue;
-		}
 		/*
 		 * If the memory block size is zero, please ignore it.
 		 * Don't try to do the following memory hotplug flowchart.
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 042/147] mm: track present early pages per zone
  2021-09-08  2:52 incoming Andrew Morton
                   ` (40 preceding siblings ...)
  2021-09-08  2:55 ` [patch 041/147] ACPI: memhotplug: memory resources cannot be enabled yet Andrew Morton
@ 2021-09-08  2:55 ` Andrew Morton
  2021-09-08  2:55 ` [patch 043/147] mm/memory_hotplug: introduce "auto-movable" online policy Andrew Morton
                   ` (105 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:55 UTC (permalink / raw)
  To: akpm, anshuman.khandual, dan.j.williams, dave.hansen, david,
	gregkh, jasowang, lenb, linux-mm, mhocko, mkedzier, mm-commits,
	mst, osalvador, pankaj.gupta.linux, pasha.tatashin,
	rafael.j.wysocki, richard.weiyang, rjw, rppt, teawater, torvalds,
	vbabka, vkuznets

From: David Hildenbrand <david@redhat.com>
Subject: mm: track present early pages per zone

Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3.

I. Goal

The goal of this series is improving in-kernel auto-online support.  It
tackles the fundamental problems that:

 1) We can create zone imbalances when onlining all memory blindly to
    ZONE_MOVABLE, in the worst case crashing the system. We have to know
    upfront how much memory we are going to hotplug such that we can
    safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
    via "online_movable". This is far from practical and only applicable in
    limited setups -- like inside VMs under the RHV/oVirt hypervisor which
    will never hotplug more than 3 times the boot memory (and the
    limitation is only in place due to the Linux limitation).

 2) We see more setups that implement dynamic VM resizing, hot(un)plugging
    memory to resize VM memory. In these setups, we might hotplug a lot of
    memory, but it might happen in various small steps in both directions
    (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
    primary driver of this upstream right now, performing such dynamic
    resizing NUMA-aware via multiple virtio-mem devices.

    Onlining all hotplugged memory to ZONE_NORMAL means we basically have
    no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
    easily run into zone imbalances when growing a VM. We want a mixture,
    and we want as much memory as reasonable/configured in ZONE_MOVABLE.
    Details regarding zone imbalances can be found at [1].

 3) Memory devices consist of 1..X memory block devices, however, the
    kernel doesn't really track the relationship. Consequently, also user
    space has no idea. We want to make per-device decisions.

    As one example, for memory hotunplug it doesn't make sense to use a
    mixture of zones within a single DIMM: we want all MOVABLE if
    possible, otherwise all !MOVABLE, because any !MOVABLE part will easily
    block the whole DIMM from getting hotunplugged.

    As another example, virtio-mem operates on individual units that span
    1..X memory blocks. Similar to a DIMM, we want a unit to either be all
    MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however,
    all units of a virtio-mem device logically belong together and are
    managed (added/removed) by a single driver. We want as much memory of
    a virtio-mem device to be MOVABLE as possible.

 4) We want memory onlining to be done right from the kernel while adding
    memory, not triggered by user space via udev rules; for example, this
    is reqired for fast memory hotplug for drivers that add individual
    memory blocks, like virito-mem. We want a way to configure a policy in
    the kernel and avoid implementing advanced policies in user space.

The auto-onlining support we have in the kernel is not sufficient.  All we
have is a) online everything MOVABLE (online_movable) b) online everything
!MOVABLE (online_kernel) c) keep zones contiguous (online).  This series
allows configuring c) to mean instead "online movable if possible
according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio"
-- a new onlining policy.


II. Approach

This series does 3 things:

 1) Introduces the "auto-movable" online policy that initially operates on
    individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
    to make a decision whether a memory block will be onlined to
    ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
    memory does not allow for more MOVABLE memory (details in the
    patches). CMA memory is treated like MOVABLE memory.

 2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
    groups and uses group information to make decisions in the
    "auto-movable" online policy across memory blocks of a single memory
    device (modeled as memory group). More details can be found in patch
    #3 or in the DIMM example below.

 3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
    allowing ZONE_NORMAL memory within a dynamic memory group to allow for
    more ZONE_MOVABLE memory within the same memory group. The target use
    case is dynamic VM resizing using virtio-mem. See the virtio-mem
    example below.

I remember that the basic idea of using a ratio to implement a policy in
the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I
lost the pointer to that discussion).

For me, the main use case is using it along with virtio-mem (and DIMMs /
ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the
amount of memory we can hotunplug reliably again if we might eventually
hotplug a lot of memory to a VM.


III. Target Usage

The target usage will be:

 1) Linux boots with "mhp_default_online_type=offline"

 2) User space (e.g., systemd unit) configures memory onlining (according
    to a config file and system properties), for example:
    * Setting memory_hotplug.online_policy=auto-movable
    * Setting memory_hotplug.auto_movable_ratio=301
    * Setting memory_hotplug.auto_movable_numa_aware=true

 3) User space enabled auto onlining via "echo online >
    /sys/devices/system/memory/auto_online_blocks"

 4) User space triggers manual onlining of all already-offline memory
    blocks (go over offline memory blocks and set them to "online")


IV. Example

For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
301% results in the following layout:
	Memory block 0-15:    DMA32   (early)
	Memory block 32-47:   Normal  (early)
	Memory block 48-79:   Movable (DIMM 0)
	Memory block 80-111:  Movable (DIMM 1)
	Memory block 112-143: Movable (DIMM 2)
	Memory block 144-275: Normal  (DIMM 3)
	Memory block 176-207: Normal  (DIMM 4)
	... all Normal
	(-> hotplugged Normal memory does not allow for more Movable memory)

For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
will result in the following layout:
	Memory block 0-15:    DMA32   (early)
	Memory block 32-47:   Normal  (early)
	Memory block 48-143:  Movable (virtio-mem, first 12 GiB)
	Memory block 144:     Normal  (virtio-mem, next 128 MiB)
	Memory block 145-147: Movable (virtio-mem, next 384 MiB)
	Memory block 148:     Normal  (virtio-mem, next 128 MiB)
	Memory block 149-151: Movable (virtio-mem, next 384 MiB)
	... Normal/Movable mixture as above
	(-> hotplugged Normal memory allows for more Movable memory within
	    the same device)

Which gives us maximum flexibility when dynamically growing/shrinking a
VM in smaller steps.


V. Doc Update

I'll update the memory-hotplug.rst documentation, once the overhaul [1] is
usptream. Until then, details can be found in patch #2.


VI. Future Work

 1) Use memory groups for ppc64 dlpar
 2) Being able to specify a portion of (early) kernel memory that will be
    excluded from the ratio. Like "128 MiB globally/per node" are excluded.

    This might be helpful when starting VMs with extremely small memory
    footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting
    the first hotplugged units getting onlined to ZONE_MOVABLE. One
    alternative would be a trigger to not consider ZONE_DMA memory
    in the ratio. We'll have to see if this is really rrequired.
 3) Indicate to user space that MOVABLE might be a bad idea -- especially
    relevant when memory ballooning without support for balloon compaction
    is active.


This patch (of 9):

For implementing a new memory onlining policy, which determines when to
online memory blocks to ZONE_MOVABLE semi-automatically, we need the
number of present early (boot) pages -- present pages excluding hotplugged
pages.  Let's track these pages per zone.

Pass a page instead of the zone to adjust_present_page_count(), similar as
adjust_managed_page_count() and derive the zone from the page.

It's worth noting that a memory block to be offlined/onlined is either
completely "early" or "not early".  add_memory() and friends can only add
complete memory blocks and we only online/offline complete (individual)
memory blocks.

Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/base/memory.c          |   14 +++++++-------
 include/linux/memory_hotplug.h |    2 +-
 include/linux/mmzone.h         |    7 +++++++
 mm/memory_hotplug.c            |   14 +++++++++++---
 mm/page_alloc.c                |    3 +++
 5 files changed, 29 insertions(+), 11 deletions(-)

--- a/drivers/base/memory.c~mm-track-present-early-pages-per-zone
+++ a/drivers/base/memory.c
@@ -205,7 +205,8 @@ static int memory_block_online(struct me
 	 * now already properly populated.
 	 */
 	if (nr_vmemmap_pages)
-		adjust_present_page_count(zone, nr_vmemmap_pages);
+		adjust_present_page_count(pfn_to_page(start_pfn),
+					  nr_vmemmap_pages);
 
 	return ret;
 }
@@ -215,24 +216,23 @@ static int memory_block_offline(struct m
 	unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
 	unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
-	struct zone *zone;
 	int ret;
 
 	/*
 	 * Unaccount before offlining, such that unpopulated zone and kthreads
 	 * can properly be torn down in offline_pages().
 	 */
-	if (nr_vmemmap_pages) {
-		zone = page_zone(pfn_to_page(start_pfn));
-		adjust_present_page_count(zone, -nr_vmemmap_pages);
-	}
+	if (nr_vmemmap_pages)
+		adjust_present_page_count(pfn_to_page(start_pfn),
+					  -nr_vmemmap_pages);
 
 	ret = offline_pages(start_pfn + nr_vmemmap_pages,
 			    nr_pages - nr_vmemmap_pages);
 	if (ret) {
 		/* offline_pages() failed. Account back. */
 		if (nr_vmemmap_pages)
-			adjust_present_page_count(zone, nr_vmemmap_pages);
+			adjust_present_page_count(pfn_to_page(start_pfn),
+						  nr_vmemmap_pages);
 		return ret;
 	}
 
--- a/include/linux/memory_hotplug.h~mm-track-present-early-pages-per-zone
+++ a/include/linux/memory_hotplug.h
@@ -95,7 +95,7 @@ static inline void zone_seqlock_init(str
 extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
 extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
 extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
-extern void adjust_present_page_count(struct zone *zone, long nr_pages);
+extern void adjust_present_page_count(struct page *page, long nr_pages);
 /* VM interface that may be used by firmware interface */
 extern int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
 				     struct zone *zone);
--- a/include/linux/mmzone.h~mm-track-present-early-pages-per-zone
+++ a/include/linux/mmzone.h
@@ -540,6 +540,10 @@ struct zone {
 	 * is calculated as:
 	 *	present_pages = spanned_pages - absent_pages(pages in holes);
 	 *
+	 * present_early_pages is present pages existing within the zone
+	 * located on memory available since early boot, excluding hotplugged
+	 * memory.
+	 *
 	 * managed_pages is present pages managed by the buddy system, which
 	 * is calculated as (reserved_pages includes pages allocated by the
 	 * bootmem allocator):
@@ -572,6 +576,9 @@ struct zone {
 	atomic_long_t		managed_pages;
 	unsigned long		spanned_pages;
 	unsigned long		present_pages;
+#if defined(CONFIG_MEMORY_HOTPLUG)
+	unsigned long		present_early_pages;
+#endif
 #ifdef CONFIG_CMA
 	unsigned long		cma_pages;
 #endif
--- a/mm/memory_hotplug.c~mm-track-present-early-pages-per-zone
+++ a/mm/memory_hotplug.c
@@ -724,8 +724,16 @@ struct zone *zone_for_pfn_range(int onli
  * This function should only be called by memory_block_{online,offline},
  * and {online,offline}_pages.
  */
-void adjust_present_page_count(struct zone *zone, long nr_pages)
+void adjust_present_page_count(struct page *page, long nr_pages)
 {
+	struct zone *zone = page_zone(page);
+
+	/*
+	 * We only support onlining/offlining/adding/removing of complete
+	 * memory blocks; therefore, either all is either early or hotplugged.
+	 */
+	if (early_section(__pfn_to_section(page_to_pfn(page))))
+		zone->present_early_pages += nr_pages;
 	zone->present_pages += nr_pages;
 	zone->zone_pgdat->node_present_pages += nr_pages;
 }
@@ -826,7 +834,7 @@ int __ref online_pages(unsigned long pfn
 	}
 
 	online_pages_range(pfn, nr_pages);
-	adjust_present_page_count(zone, nr_pages);
+	adjust_present_page_count(pfn_to_page(pfn), nr_pages);
 
 	node_states_set_node(nid, &arg);
 	if (need_zonelists_rebuild)
@@ -1697,7 +1705,7 @@ int __ref offline_pages(unsigned long st
 
 	/* removal success */
 	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
-	adjust_present_page_count(zone, -nr_pages);
+	adjust_present_page_count(pfn_to_page(start_pfn), -nr_pages);
 
 	/* reinitialise watermarks and update pcp limits */
 	init_per_zone_wmark_min();
--- a/mm/page_alloc.c~mm-track-present-early-pages-per-zone
+++ a/mm/page_alloc.c
@@ -7254,6 +7254,9 @@ static void __init calculate_node_totalp
 			zone->zone_start_pfn = 0;
 		zone->spanned_pages = size;
 		zone->present_pages = real_size;
+#if defined(CONFIG_MEMORY_HOTPLUG)
+		zone->present_early_pages = real_size;
+#endif
 
 		totalpages += size;
 		realtotalpages += real_size;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 043/147] mm/memory_hotplug: introduce "auto-movable" online policy
  2021-09-08  2:52 incoming Andrew Morton
                   ` (41 preceding siblings ...)
  2021-09-08  2:55 ` [patch 042/147] mm: track present early pages per zone Andrew Morton
@ 2021-09-08  2:55 ` Andrew Morton
  2021-09-08  2:55 ` [patch 044/147] drivers/base/memory: introduce "memory groups" to logically group memory blocks Andrew Morton
                   ` (104 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:55 UTC (permalink / raw)
  To: akpm, anshuman.khandual, dan.j.williams, dave.hansen, david,
	gregkh, jasowang, lenb, linux-mm, mhocko, mkedzier, mm-commits,
	mst, osalvador, pankaj.gupta.linux, pasha.tatashin,
	rafael.j.wysocki, richard.weiyang, rjw, rppt, teawater, torvalds,
	vbabka, vkuznets

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: introduce "auto-movable" online policy

When onlining without specifying a zone (using "online" instead of
"online_kernel" or "online_movable"), we currently select a zone such that
existing zones are kept contiguous.  This online policy made sense in the
past, where contiguous zones where required.

We'd like to implement smarter policies, however:

* User space has little insight.  As one example, it has no idea which
  memory blocks logically belong together (e.g., to a DIMM or to a
  virtio-mem device).

* Drivers that add memory in separate memory blocks, especially
  virtio-mem, want memory to get onlined right from the kernel when
  adding.

So we really want to have onlining to differing zones managed in the
kernel, configured by user space.

We see more and more cases where we might eventually hotplug a lot of
memory in the future (e.g., eventually grow a 2 GiB VM to 64 GiB),
however:

* Resizing happens dynamically, in smaller steps in both directions
  (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...)

* We still want as much flexibility as possible, especially,
  hotunplugging as much memory as possible later.

We can really only use "online_movable" if we know that the amount of
memory we are going to hotplug upfront, and we know that it won't result
in a zone imbalance.  So in our example, a 2 GiB VM that could grow to 64
GiB could currently not use "online_movable", and instead, "online_kernel"
would have to be used, resulting in worse (no) memory hotunplug
reliability.

Let's add a new "auto-movable" online policy that considers the current
zone ratios (global, per-node) to determine, whether we a memory block can
be onlined to ZONE_MOVABLE:

	MOVABLE : KERNEL

However, internally we'll only consider the following ratio for now:

	MOVABLE : KERNEL_EARLY

For now, we don't allow for hotplugged KERNEL memory to allow for more
MOVABLE memory, because there is no coordination across memory devices. 
In follow-up patches, we will allow for more KERNEL memory within a memory
device to allow for more MOVABLE memory within the same memory device --
which only makes sense for special memory device types.

We base our calculation on "present pages", see the code comments for
details.  Hotplugged memory will get online to ZONE_MOVABLE if the
configured ratio allows for it.  Depending on the setup, this can result
in fragmented zones, which can make compaction slower and dynamic
allocation of gigantic pages when not using CMA less reliable (...  which
is already pretty unreliable).

The old policy will be the default and called "contig-zones".  In
follow-up patches, our new policy will use additional information, such as
memory groups, to make even smarter decisions across memory blocks.

Configuration:

* memory_hotplug.online_policy is used to switch between both polices
  and defaults to "contig-zones".

* memory_hotplug.auto_movable_ratio defines the maximum ratio is in
  percent and defaults to "301" -- allowing e.g., most 8 GiB machines to
  grow to 32 GiB and have all hotplugged memory in ZONE_MOVABLE.  The
  additional percent accounts for a handful of lost present pages (e.g.,
  firmware allocations).  User space is expected to adjust this ratio when
  enabling the new "auto-movable" policy, though.

* memory_hotplug.auto_movable_numa_aware considers numa node stats in
  addition to global stats, and defaults to "true".

Note: just like the old policy, the new policy won't take things like
unmovable huge pages or memory ballooning that doesn't support balloon
compaction into account.  User space has to configure onlining
accordingly.

Link: https://lkml.kernel.org/r/20210806124715.17090-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |  191 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 191 insertions(+)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-introduce-auto-movable-online-policy
+++ a/mm/memory_hotplug.c
@@ -52,6 +52,73 @@ module_param(memmap_on_memory, bool, 044
 MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug");
 #endif
 
+enum {
+	ONLINE_POLICY_CONTIG_ZONES = 0,
+	ONLINE_POLICY_AUTO_MOVABLE,
+};
+
+const char *online_policy_to_str[] = {
+	[ONLINE_POLICY_CONTIG_ZONES] = "contig-zones",
+	[ONLINE_POLICY_AUTO_MOVABLE] = "auto-movable",
+};
+
+static int set_online_policy(const char *val, const struct kernel_param *kp)
+{
+	int ret = sysfs_match_string(online_policy_to_str, val);
+
+	if (ret < 0)
+		return ret;
+	*((int *)kp->arg) = ret;
+	return 0;
+}
+
+static int get_online_policy(char *buffer, const struct kernel_param *kp)
+{
+	return sprintf(buffer, "%s\n", online_policy_to_str[*((int *)kp->arg)]);
+}
+
+/*
+ * memory_hotplug.online_policy: configure online behavior when onlining without
+ * specifying a zone (MMOP_ONLINE)
+ *
+ * "contig-zones": keep zone contiguous
+ * "auto-movable": online memory to ZONE_MOVABLE if the configuration
+ *                 (auto_movable_ratio, auto_movable_numa_aware) allows for it
+ */
+static int online_policy __read_mostly = ONLINE_POLICY_CONTIG_ZONES;
+static const struct kernel_param_ops online_policy_ops = {
+	.set = set_online_policy,
+	.get = get_online_policy,
+};
+module_param_cb(online_policy, &online_policy_ops, &online_policy, 0644);
+MODULE_PARM_DESC(online_policy,
+		"Set the online policy (\"contig-zones\", \"auto-movable\") "
+		"Default: \"contig-zones\"");
+
+/*
+ * memory_hotplug.auto_movable_ratio: specify maximum MOVABLE:KERNEL ratio
+ *
+ * The ratio represent an upper limit and the kernel might decide to not
+ * online some memory to ZONE_MOVABLE -- e.g., because hotplugged KERNEL memory
+ * doesn't allow for more MOVABLE memory.
+ */
+static unsigned int auto_movable_ratio __read_mostly = 301;
+module_param(auto_movable_ratio, uint, 0644);
+MODULE_PARM_DESC(auto_movable_ratio,
+		"Set the maximum ratio of MOVABLE:KERNEL memory in the system "
+		"in percent for \"auto-movable\" online policy. Default: 301");
+
+/*
+ * memory_hotplug.auto_movable_numa_aware: consider numa node stats
+ */
+#ifdef CONFIG_NUMA
+static bool auto_movable_numa_aware __read_mostly = true;
+module_param(auto_movable_numa_aware, bool, 0644);
+MODULE_PARM_DESC(auto_movable_numa_aware,
+		"Consider numa node stats in addition to global stats in "
+		"\"auto-movable\" online policy. Default: true");
+#endif /* CONFIG_NUMA */
+
 /*
  * online_page_callback contains pointer to current page onlining function.
  * Initially it is generic_online_page(). If it is required it could be
@@ -663,6 +730,61 @@ void __ref move_pfn_range_to_zone(struct
 	set_zone_contiguous(zone);
 }
 
+struct auto_movable_stats {
+	unsigned long kernel_early_pages;
+	unsigned long movable_pages;
+};
+
+static void auto_movable_stats_account_zone(struct auto_movable_stats *stats,
+					    struct zone *zone)
+{
+	if (zone_idx(zone) == ZONE_MOVABLE) {
+		stats->movable_pages += zone->present_pages;
+	} else {
+		stats->kernel_early_pages += zone->present_early_pages;
+#ifdef CONFIG_CMA
+		/*
+		 * CMA pages (never on hotplugged memory) behave like
+		 * ZONE_MOVABLE.
+		 */
+		stats->movable_pages += zone->cma_pages;
+		stats->kernel_early_pages -= zone->cma_pages;
+#endif /* CONFIG_CMA */
+	}
+}
+
+static bool auto_movable_can_online_movable(int nid, unsigned long nr_pages)
+{
+	struct auto_movable_stats stats = {};
+	unsigned long kernel_early_pages, movable_pages;
+	pg_data_t *pgdat = NODE_DATA(nid);
+	struct zone *zone;
+	int i;
+
+	/* Walk all relevant zones and collect MOVABLE vs. KERNEL stats. */
+	if (nid == NUMA_NO_NODE) {
+		/* TODO: cache values */
+		for_each_populated_zone(zone)
+			auto_movable_stats_account_zone(&stats, zone);
+	} else {
+		for (i = 0; i < MAX_NR_ZONES; i++) {
+			zone = pgdat->node_zones + i;
+			if (populated_zone(zone))
+				auto_movable_stats_account_zone(&stats, zone);
+		}
+	}
+
+	kernel_early_pages = stats.kernel_early_pages;
+	movable_pages = stats.movable_pages;
+
+	/*
+	 * Test if we could online the given number of pages to ZONE_MOVABLE
+	 * and still stay in the configured ratio.
+	 */
+	movable_pages += nr_pages;
+	return movable_pages <= (auto_movable_ratio * kernel_early_pages) / 100;
+}
+
 /*
  * Returns a default kernel memory zone for the given pfn range.
  * If no kernel zone covers this pfn range it will automatically go
@@ -684,6 +806,72 @@ static struct zone *default_kernel_zone_
 	return &pgdat->node_zones[ZONE_NORMAL];
 }
 
+/*
+ * Determine to which zone to online memory dynamically based on user
+ * configuration and system stats. We care about the following ratio:
+ *
+ *   MOVABLE : KERNEL
+ *
+ * Whereby MOVABLE is memory in ZONE_MOVABLE and KERNEL is memory in
+ * one of the kernel zones. CMA pages inside one of the kernel zones really
+ * behaves like ZONE_MOVABLE, so we treat them accordingly.
+ *
+ * We don't allow for hotplugged memory in a KERNEL zone to increase the
+ * amount of MOVABLE memory we can have, so we end up with:
+ *
+ *   MOVABLE : KERNEL_EARLY
+ *
+ * Whereby KERNEL_EARLY is memory in one of the kernel zones, available sinze
+ * boot. We base our calculation on KERNEL_EARLY internally, because:
+ *
+ * a) Hotplugged memory in one of the kernel zones can sometimes still get
+ *    hotunplugged, especially when hot(un)plugging individual memory blocks.
+ *    There is no coordination across memory devices, therefore "automatic"
+ *    hotunplugging, as implemented in hypervisors, could result in zone
+ *    imbalances.
+ * b) Early/boot memory in one of the kernel zones can usually not get
+ *    hotunplugged again (e.g., no firmware interface to unplug, fragmented
+ *    with unmovable allocations). While there are corner cases where it might
+ *    still work, it is barely relevant in practice.
+ *
+ * We rely on "present pages" instead of "managed pages", as the latter is
+ * highly unreliable and dynamic in virtualized environments, and does not
+ * consider boot time allocations. For example, memory ballooning adjusts the
+ * managed pages when inflating/deflating the balloon, and balloon compaction
+ * can even migrate inflated pages between zones.
+ *
+ * Using "present pages" is better but some things to keep in mind are:
+ *
+ * a) Some memblock allocations, such as for the crashkernel area, are
+ *    effectively unused by the kernel, yet they account to "present pages".
+ *    Fortunately, these allocations are comparatively small in relevant setups
+ *    (e.g., fraction of system memory).
+ * b) Some hotplugged memory blocks in virtualized environments, esecially
+ *    hotplugged by virtio-mem, look like they are completely present, however,
+ *    only parts of the memory block are actually currently usable.
+ *    "present pages" is an upper limit that can get reached at runtime. As
+ *    we base our calculations on KERNEL_EARLY, this is not an issue.
+ */
+static struct zone *auto_movable_zone_for_pfn(int nid, unsigned long pfn,
+					      unsigned long nr_pages)
+{
+	if (!auto_movable_ratio)
+		goto kernel_zone;
+
+	if (!auto_movable_can_online_movable(NUMA_NO_NODE, nr_pages))
+		goto kernel_zone;
+
+#ifdef CONFIG_NUMA
+	if (auto_movable_numa_aware &&
+	    !auto_movable_can_online_movable(nid, nr_pages))
+		goto kernel_zone;
+#endif /* CONFIG_NUMA */
+
+	return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
+kernel_zone:
+	return default_kernel_zone_for_pfn(nid, pfn, nr_pages);
+}
+
 static inline struct zone *default_zone_for_pfn(int nid, unsigned long start_pfn,
 		unsigned long nr_pages)
 {
@@ -717,6 +905,9 @@ struct zone *zone_for_pfn_range(int onli
 	if (online_type == MMOP_ONLINE_MOVABLE)
 		return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
 
+	if (online_policy == ONLINE_POLICY_AUTO_MOVABLE)
+		return auto_movable_zone_for_pfn(nid, start_pfn, nr_pages);
+
 	return default_zone_for_pfn(nid, start_pfn, nr_pages);
 }
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 044/147] drivers/base/memory: introduce "memory groups" to logically group memory blocks
  2021-09-08  2:52 incoming Andrew Morton
                   ` (42 preceding siblings ...)
  2021-09-08  2:55 ` [patch 043/147] mm/memory_hotplug: introduce "auto-movable" online policy Andrew Morton
@ 2021-09-08  2:55 ` Andrew Morton
  2021-09-08  2:55 ` [patch 045/147] mm/memory_hotplug: track present pages in memory groups Andrew Morton
                   ` (103 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:55 UTC (permalink / raw)
  To: akpm, anshuman.khandual, dan.j.williams, dave.hansen, david,
	gregkh, jasowang, lenb, linux-mm, mhocko, mkedzier, mm-commits,
	mst, osalvador, pankaj.gupta.linux, pasha.tatashin,
	rafael.j.wysocki, richard.weiyang, rjw, rppt, teawater, torvalds,
	vbabka, vkuznets

From: David Hildenbrand <david@redhat.com>
Subject: drivers/base/memory: introduce "memory groups" to logically group memory blocks

In our "auto-movable" memory onlining policy, we want to make decisions
across memory blocks of a single memory device.  Examples of memory
devices include ACPI memory devices (in the simplest case a single DIMM)
and virtio-mem.  For now, we don't have a connection between a single
memory block device and the real memory device.  Each memory device
consists of 1..X memory block devices.

Let's logically group memory blocks belonging to the same memory device in
"memory groups".  Memory groups can span multiple physical ranges and a
memory group itself does not contain any information regarding physical
ranges, only properties (e.g., "max_pages") necessary for improved memory
onlining.

Introduce two memory group types:

1) Static memory group: E.g., a single ACPI memory device, consisting
   of 1..X memory resources.  A memory group consists of 1..Y memory
   blocks.  The whole group is added/removed in one go.  If any part
   cannot get offlined, the whole group cannot be removed.

2) Dynamic memory group: E.g., a single virtio-mem device.  Memory is
   dynamically added/removed in a fixed granularity, called a "unit",
   consisting of 1..X memory blocks.  A unit is added/removed in one go. 
   If any part of a unit cannot get offlined, the whole unit cannot be
   removed.

In case of 1) we usually want either all memory managed by ZONE_MOVABLE or
none.  In case of 2) we usually want to have as many units as possible
managed by ZONE_MOVABLE.  We want a single unit to be of the same type.

For now, memory groups are an internal concept that is not exposed to user
space; we might want to change that in the future, though.

add_memory() users can specify a mgid instead of a nid when passing the
MHP_NID_IS_MGID flag.

Link: https://lkml.kernel.org/r/20210806124715.17090-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/base/memory.c          |  159 ++++++++++++++++++++++++++++++-
 include/linux/memory.h         |   46 ++++++++
 include/linux/memory_hotplug.h |    5 
 mm/memory_hotplug.c            |   11 +-
 4 files changed, 215 insertions(+), 6 deletions(-)

--- a/drivers/base/memory.c~drivers-base-memory-introduce-memory-groups-to-logically-group-memory-blocks
+++ a/drivers/base/memory.c
@@ -82,6 +82,11 @@ static struct bus_type memory_subsys = {
  */
 static DEFINE_XARRAY(memory_blocks);
 
+/*
+ * Memory groups, indexed by memory group id (mgid).
+ */
+static DEFINE_XARRAY_FLAGS(memory_groups, XA_FLAGS_ALLOC);
+
 static BLOCKING_NOTIFIER_HEAD(memory_chain);
 
 int register_memory_notifier(struct notifier_block *nb)
@@ -634,7 +639,8 @@ int register_memory(struct memory_block
 }
 
 static int init_memory_block(unsigned long block_id, unsigned long state,
-			     unsigned long nr_vmemmap_pages)
+			     unsigned long nr_vmemmap_pages,
+			     struct memory_group *group)
 {
 	struct memory_block *mem;
 	int ret = 0;
@@ -652,6 +658,12 @@ static int init_memory_block(unsigned lo
 	mem->state = state;
 	mem->nid = NUMA_NO_NODE;
 	mem->nr_vmemmap_pages = nr_vmemmap_pages;
+	INIT_LIST_HEAD(&mem->group_next);
+
+	if (group) {
+		mem->group = group;
+		list_add(&mem->group_next, &group->memory_blocks);
+	}
 
 	ret = register_memory(mem);
 
@@ -671,7 +683,7 @@ static int add_memory_block(unsigned lon
 	if (section_count == 0)
 		return 0;
 	return init_memory_block(memory_block_id(base_section_nr),
-				 MEM_ONLINE, 0);
+				 MEM_ONLINE, 0,  NULL);
 }
 
 static void unregister_memory(struct memory_block *memory)
@@ -681,6 +693,11 @@ static void unregister_memory(struct mem
 
 	WARN_ON(xa_erase(&memory_blocks, memory->dev.id) == NULL);
 
+	if (memory->group) {
+		list_del(&memory->group_next);
+		memory->group = NULL;
+	}
+
 	/* drop the ref. we got via find_memory_block() */
 	put_device(&memory->dev);
 	device_unregister(&memory->dev);
@@ -694,7 +711,8 @@ static void unregister_memory(struct mem
  * Called under device_hotplug_lock.
  */
 int create_memory_block_devices(unsigned long start, unsigned long size,
-				unsigned long vmemmap_pages)
+				unsigned long vmemmap_pages,
+				struct memory_group *group)
 {
 	const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
 	unsigned long end_block_id = pfn_to_block_id(PFN_DOWN(start + size));
@@ -707,7 +725,8 @@ int create_memory_block_devices(unsigned
 		return -EINVAL;
 
 	for (block_id = start_block_id; block_id != end_block_id; block_id++) {
-		ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages);
+		ret = init_memory_block(block_id, MEM_OFFLINE, vmemmap_pages,
+					group);
 		if (ret)
 			break;
 	}
@@ -891,3 +910,135 @@ int for_each_memory_block(void *arg, wal
 	return bus_for_each_dev(&memory_subsys, NULL, &cb_data,
 				for_each_memory_block_cb);
 }
+
+/*
+ * This is an internal helper to unify allocation and initialization of
+ * memory groups. Note that the passed memory group will be copied to a
+ * dynamically allocated memory group. After this call, the passed
+ * memory group should no longer be used.
+ */
+static int memory_group_register(struct memory_group group)
+{
+	struct memory_group *new_group;
+	uint32_t mgid;
+	int ret;
+
+	if (!node_possible(group.nid))
+		return -EINVAL;
+
+	new_group = kzalloc(sizeof(group), GFP_KERNEL);
+	if (!new_group)
+		return -ENOMEM;
+	*new_group = group;
+	INIT_LIST_HEAD(&new_group->memory_blocks);
+
+	ret = xa_alloc(&memory_groups, &mgid, new_group, xa_limit_31b,
+		       GFP_KERNEL);
+	if (ret) {
+		kfree(new_group);
+		return ret;
+	}
+	return mgid;
+}
+
+/**
+ * memory_group_register_static() - Register a static memory group.
+ * @nid: The node id.
+ * @max_pages: The maximum number of pages we'll have in this static memory
+ *	       group.
+ *
+ * Register a new static memory group and return the memory group id.
+ * All memory in the group belongs to a single unit, such as a DIMM. All
+ * memory belonging to a static memory group is added in one go to be removed
+ * in one go -- it's static.
+ *
+ * Returns an error if out of memory, if the node id is invalid, if no new
+ * memory groups can be registered, or if max_pages is invalid (0). Otherwise,
+ * returns the new memory group id.
+ */
+int memory_group_register_static(int nid, unsigned long max_pages)
+{
+	struct memory_group group = {
+		.nid = nid,
+		.s = {
+			.max_pages = max_pages,
+		},
+	};
+
+	if (!max_pages)
+		return -EINVAL;
+	return memory_group_register(group);
+}
+EXPORT_SYMBOL_GPL(memory_group_register_static);
+
+/**
+ * memory_group_register_dynamic() - Register a dynamic memory group.
+ * @nid: The node id.
+ * @unit_pages: Unit in pages in which is memory added/removed in this dynamic
+ *		memory group.
+ *
+ * Register a new dynamic memory group and return the memory group id.
+ * Memory within a dynamic memory group is added/removed dynamically
+ * in unit_pages.
+ *
+ * Returns an error if out of memory, if the node id is invalid, if no new
+ * memory groups can be registered, or if unit_pages is invalid (0, not a
+ * power of two, smaller than a single memory block). Otherwise, returns the
+ * new memory group id.
+ */
+int memory_group_register_dynamic(int nid, unsigned long unit_pages)
+{
+	struct memory_group group = {
+		.nid = nid,
+		.is_dynamic = true,
+		.d = {
+			.unit_pages = unit_pages,
+		},
+	};
+
+	if (!unit_pages || !is_power_of_2(unit_pages) ||
+	    unit_pages < PHYS_PFN(memory_block_size_bytes()))
+		return -EINVAL;
+	return memory_group_register(group);
+}
+EXPORT_SYMBOL_GPL(memory_group_register_dynamic);
+
+/**
+ * memory_group_unregister() - Unregister a memory group.
+ * @mgid: the memory group id
+ *
+ * Unregister a memory group. If any memory block still belongs to this
+ * memory group, unregistering will fail.
+ *
+ * Returns -EINVAL if the memory group id is invalid, returns -EBUSY if some
+ * memory blocks still belong to this memory group and returns 0 if
+ * unregistering succeeded.
+ */
+int memory_group_unregister(int mgid)
+{
+	struct memory_group *group;
+
+	if (mgid < 0)
+		return -EINVAL;
+
+	group = xa_load(&memory_groups, mgid);
+	if (!group)
+		return -EINVAL;
+	if (!list_empty(&group->memory_blocks))
+		return -EBUSY;
+	xa_erase(&memory_groups, mgid);
+	kfree(group);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(memory_group_unregister);
+
+/*
+ * This is an internal helper only to be used in core memory hotplug code to
+ * lookup a memory group. We don't care about locking, as we don't expect a
+ * memory group to get unregistered while adding memory to it -- because
+ * the group and the memory is managed by the same driver.
+ */
+struct memory_group *memory_group_find_by_id(int mgid)
+{
+	return xa_load(&memory_groups, mgid);
+}
--- a/include/linux/memory.h~drivers-base-memory-introduce-memory-groups-to-logically-group-memory-blocks
+++ a/include/linux/memory.h
@@ -23,6 +23,42 @@
 
 #define MIN_MEMORY_BLOCK_SIZE     (1UL << SECTION_SIZE_BITS)
 
+/**
+ * struct memory_group - a logical group of memory blocks
+ * @nid: The node id for all memory blocks inside the memory group.
+ * @blocks: List of all memory blocks belonging to this memory group.
+ * @is_dynamic: The memory group type: static vs. dynamic
+ * @s.max_pages: Valid with &memory_group.is_dynamic == false. The maximum
+ *		 number of pages we'll have in this static memory group.
+ * @d.unit_pages: Valid with &memory_group.is_dynamic == true. Unit in pages
+ *		  in which memory is added/removed in this dynamic memory group.
+ *		  This granularity defines the alignment of a unit in physical
+ *		  address space; it has to be at least as big as a single
+ *		  memory block.
+ *
+ * A memory group logically groups memory blocks; each memory block
+ * belongs to at most one memory group. A memory group corresponds to
+ * a memory device, such as a DIMM or a NUMA node, which spans multiple
+ * memory blocks and might even span multiple non-contiguous physical memory
+ * ranges.
+ *
+ * Modification of members after registration is serialized by memory
+ * hot(un)plug code.
+ */
+struct memory_group {
+	int nid;
+	struct list_head memory_blocks;
+	bool is_dynamic;
+	union {
+		struct {
+			unsigned long max_pages;
+		} s;
+		struct {
+			unsigned long unit_pages;
+		} d;
+	};
+};
+
 struct memory_block {
 	unsigned long start_section_nr;
 	unsigned long state;		/* serialized by the dev->lock */
@@ -34,6 +70,8 @@ struct memory_block {
 	 * lay at the beginning of the memory block.
 	 */
 	unsigned long nr_vmemmap_pages;
+	struct memory_group *group;	/* group (if any) for this block */
+	struct list_head group_next;	/* next block inside memory group */
 };
 
 int arch_get_memory_phys_device(unsigned long start_pfn);
@@ -86,7 +124,8 @@ static inline int memory_notify(unsigned
 extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
 int create_memory_block_devices(unsigned long start, unsigned long size,
-				unsigned long vmemmap_pages);
+				unsigned long vmemmap_pages,
+				struct memory_group *group);
 void remove_memory_block_devices(unsigned long start, unsigned long size);
 extern void memory_dev_init(void);
 extern int memory_notify(unsigned long val, void *v);
@@ -96,6 +135,11 @@ extern int walk_memory_blocks(unsigned l
 			      void *arg, walk_memory_blocks_func_t func);
 extern int for_each_memory_block(void *arg, walk_memory_blocks_func_t func);
 #define CONFIG_MEM_BLOCK_SIZE	(PAGES_PER_SECTION<<PAGE_SHIFT)
+
+extern int memory_group_register_static(int nid, unsigned long max_pages);
+extern int memory_group_register_dynamic(int nid, unsigned long unit_pages);
+extern int memory_group_unregister(int mgid);
+struct memory_group *memory_group_find_by_id(int mgid);
 #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
 
 #ifdef CONFIG_MEMORY_HOTPLUG
--- a/include/linux/memory_hotplug.h~drivers-base-memory-introduce-memory-groups-to-logically-group-memory-blocks
+++ a/include/linux/memory_hotplug.h
@@ -50,6 +50,11 @@ typedef int __bitwise mhp_t;
  * Only selected architectures support it with SPARSE_VMEMMAP.
  */
 #define MHP_MEMMAP_ON_MEMORY   ((__force mhp_t)BIT(1))
+/*
+ * The nid field specifies a memory group id (mgid) instead. The memory group
+ * implies the node id (nid).
+ */
+#define MHP_NID_IS_MGID		((__force mhp_t)BIT(2))
 
 /*
  * Extended parameters for memory hotplug:
--- a/mm/memory_hotplug.c~drivers-base-memory-introduce-memory-groups-to-logically-group-memory-blocks
+++ a/mm/memory_hotplug.c
@@ -1258,6 +1258,7 @@ int __ref add_memory_resource(int nid, s
 {
 	struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
 	struct vmem_altmap mhp_altmap = {};
+	struct memory_group *group = NULL;
 	u64 start, size;
 	bool new_node = false;
 	int ret;
@@ -1269,6 +1270,13 @@ int __ref add_memory_resource(int nid, s
 	if (ret)
 		return ret;
 
+	if (mhp_flags & MHP_NID_IS_MGID) {
+		group = memory_group_find_by_id(nid);
+		if (!group)
+			return -EINVAL;
+		nid = group->nid;
+	}
+
 	if (!node_possible(nid)) {
 		WARN(1, "node %d was absent from the node_possible_map\n", nid);
 		return -EINVAL;
@@ -1303,7 +1311,8 @@ int __ref add_memory_resource(int nid, s
 		goto error;
 
 	/* create memory block devices after memory was added */
-	ret = create_memory_block_devices(start, size, mhp_altmap.alloc);
+	ret = create_memory_block_devices(start, size, mhp_altmap.alloc,
+					  group);
 	if (ret) {
 		arch_remove_memory(start, size, NULL);
 		goto error;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 045/147] mm/memory_hotplug: track present pages in memory groups
  2021-09-08  2:52 incoming Andrew Morton
                   ` (43 preceding siblings ...)
  2021-09-08  2:55 ` [patch 044/147] drivers/base/memory: introduce "memory groups" to logically group memory blocks Andrew Morton
@ 2021-09-08  2:55 ` Andrew Morton
  2021-09-08  2:55 ` [patch 046/147] ACPI: memhotplug: use a single static memory group for a single memory device Andrew Morton
                   ` (102 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:55 UTC (permalink / raw)
  To: akpm, anshuman.khandual, dan.j.williams, dave.hansen, david,
	gregkh, jasowang, lenb, linux-mm, mhocko, mkedzier, mm-commits,
	mst, osalvador, pankaj.gupta.linux, pasha.tatashin,
	rafael.j.wysocki, richard.weiyang, rjw, rppt, teawater, torvalds,
	vbabka, vkuznets

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: track present pages in memory groups

Let's track all present pages in each memory group.  Especially, track
memory present in ZONE_MOVABLE and memory present in one of the kernel
zones (which really only is ZONE_NORMAL right now as memory groups only
apply to hotplugged memory) separately within a memory group, to prepare
for making smart auto-online decision for individual memory blocks within
a memory group based on group statistics.

Link: https://lkml.kernel.org/r/20210806124715.17090-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/base/memory.c          |   10 +++++-----
 include/linux/memory.h         |    6 ++++++
 include/linux/memory_hotplug.h |   13 +++++++++----
 mm/memory_hotplug.c            |   19 ++++++++++++++-----
 4 files changed, 34 insertions(+), 14 deletions(-)

--- a/drivers/base/memory.c~mm-memory_hotplug-track-present-pages-in-memory-groups
+++ a/drivers/base/memory.c
@@ -198,7 +198,7 @@ static int memory_block_online(struct me
 	}
 
 	ret = online_pages(start_pfn + nr_vmemmap_pages,
-			   nr_pages - nr_vmemmap_pages, zone);
+			   nr_pages - nr_vmemmap_pages, zone, mem->group);
 	if (ret) {
 		if (nr_vmemmap_pages)
 			mhp_deinit_memmap_on_memory(start_pfn, nr_vmemmap_pages);
@@ -210,7 +210,7 @@ static int memory_block_online(struct me
 	 * now already properly populated.
 	 */
 	if (nr_vmemmap_pages)
-		adjust_present_page_count(pfn_to_page(start_pfn),
+		adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
 					  nr_vmemmap_pages);
 
 	return ret;
@@ -228,16 +228,16 @@ static int memory_block_offline(struct m
 	 * can properly be torn down in offline_pages().
 	 */
 	if (nr_vmemmap_pages)
-		adjust_present_page_count(pfn_to_page(start_pfn),
+		adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
 					  -nr_vmemmap_pages);
 
 	ret = offline_pages(start_pfn + nr_vmemmap_pages,
-			    nr_pages - nr_vmemmap_pages);
+			    nr_pages - nr_vmemmap_pages, mem->group);
 	if (ret) {
 		/* offline_pages() failed. Account back. */
 		if (nr_vmemmap_pages)
 			adjust_present_page_count(pfn_to_page(start_pfn),
-						  nr_vmemmap_pages);
+						  mem->group, nr_vmemmap_pages);
 		return ret;
 	}
 
--- a/include/linux/memory.h~mm-memory_hotplug-track-present-pages-in-memory-groups
+++ a/include/linux/memory.h
@@ -27,6 +27,10 @@
  * struct memory_group - a logical group of memory blocks
  * @nid: The node id for all memory blocks inside the memory group.
  * @blocks: List of all memory blocks belonging to this memory group.
+ * @present_kernel_pages: Present (online) memory outside ZONE_MOVABLE of this
+ *			  memory group.
+ * @present_movable_pages: Present (online) memory in ZONE_MOVABLE of this
+ *			   memory group.
  * @is_dynamic: The memory group type: static vs. dynamic
  * @s.max_pages: Valid with &memory_group.is_dynamic == false. The maximum
  *		 number of pages we'll have in this static memory group.
@@ -48,6 +52,8 @@
 struct memory_group {
 	int nid;
 	struct list_head memory_blocks;
+	unsigned long present_kernel_pages;
+	unsigned long present_movable_pages;
 	bool is_dynamic;
 	union {
 		struct {
--- a/include/linux/memory_hotplug.h~mm-memory_hotplug-track-present-pages-in-memory-groups
+++ a/include/linux/memory_hotplug.h
@@ -12,6 +12,7 @@ struct zone;
 struct pglist_data;
 struct mem_section;
 struct memory_block;
+struct memory_group;
 struct resource;
 struct vmem_altmap;
 
@@ -100,13 +101,15 @@ static inline void zone_seqlock_init(str
 extern int zone_grow_free_lists(struct zone *zone, unsigned long new_nr_pages);
 extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages);
 extern int add_one_highpage(struct page *page, int pfn, int bad_ppro);
-extern void adjust_present_page_count(struct page *page, long nr_pages);
+extern void adjust_present_page_count(struct page *page,
+				      struct memory_group *group,
+				      long nr_pages);
 /* VM interface that may be used by firmware interface */
 extern int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
 				     struct zone *zone);
 extern void mhp_deinit_memmap_on_memory(unsigned long pfn, unsigned long nr_pages);
 extern int online_pages(unsigned long pfn, unsigned long nr_pages,
-			struct zone *zone);
+			struct zone *zone, struct memory_group *group);
 extern struct zone *test_pages_in_a_zone(unsigned long start_pfn,
 					 unsigned long end_pfn);
 extern void __offline_isolated_pages(unsigned long start_pfn,
@@ -296,7 +299,8 @@ static inline void pgdat_resize_init(str
 #ifdef CONFIG_MEMORY_HOTREMOVE
 
 extern void try_offline_node(int nid);
-extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages);
+extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+			 struct memory_group *group);
 extern int remove_memory(u64 start, u64 size);
 extern void __remove_memory(u64 start, u64 size);
 extern int offline_and_remove_memory(u64 start, u64 size);
@@ -304,7 +308,8 @@ extern int offline_and_remove_memory(u64
 #else
 static inline void try_offline_node(int nid) {}
 
-static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages)
+static inline int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+				struct memory_group *group)
 {
 	return -EINVAL;
 }
--- a/mm/memory_hotplug.c~mm-memory_hotplug-track-present-pages-in-memory-groups
+++ a/mm/memory_hotplug.c
@@ -915,9 +915,11 @@ struct zone *zone_for_pfn_range(int onli
  * This function should only be called by memory_block_{online,offline},
  * and {online,offline}_pages.
  */
-void adjust_present_page_count(struct page *page, long nr_pages)
+void adjust_present_page_count(struct page *page, struct memory_group *group,
+			       long nr_pages)
 {
 	struct zone *zone = page_zone(page);
+	const bool movable = zone_idx(zone) == ZONE_MOVABLE;
 
 	/*
 	 * We only support onlining/offlining/adding/removing of complete
@@ -927,6 +929,11 @@ void adjust_present_page_count(struct pa
 		zone->present_early_pages += nr_pages;
 	zone->present_pages += nr_pages;
 	zone->zone_pgdat->node_present_pages += nr_pages;
+
+	if (group && movable)
+		group->present_movable_pages += nr_pages;
+	else if (group && !movable)
+		group->present_kernel_pages += nr_pages;
 }
 
 int mhp_init_memmap_on_memory(unsigned long pfn, unsigned long nr_pages,
@@ -972,7 +979,8 @@ void mhp_deinit_memmap_on_memory(unsigne
 	kasan_remove_zero_shadow(__va(PFN_PHYS(pfn)), PFN_PHYS(nr_pages));
 }
 
-int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *zone)
+int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
+		       struct zone *zone, struct memory_group *group)
 {
 	unsigned long flags;
 	int need_zonelists_rebuild = 0;
@@ -1025,7 +1033,7 @@ int __ref online_pages(unsigned long pfn
 	}
 
 	online_pages_range(pfn, nr_pages);
-	adjust_present_page_count(pfn_to_page(pfn), nr_pages);
+	adjust_present_page_count(pfn_to_page(pfn), group, nr_pages);
 
 	node_states_set_node(nid, &arg);
 	if (need_zonelists_rebuild)
@@ -1769,7 +1777,8 @@ static int count_system_ram_pages_cb(uns
 	return 0;
 }
 
-int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
+int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages,
+			struct memory_group *group)
 {
 	const unsigned long end_pfn = start_pfn + nr_pages;
 	unsigned long pfn, system_ram_pages = 0;
@@ -1905,7 +1914,7 @@ int __ref offline_pages(unsigned long st
 
 	/* removal success */
 	adjust_managed_page_count(pfn_to_page(start_pfn), -nr_pages);
-	adjust_present_page_count(pfn_to_page(start_pfn), -nr_pages);
+	adjust_present_page_count(pfn_to_page(start_pfn), group, -nr_pages);
 
 	/* reinitialise watermarks and update pcp limits */
 	init_per_zone_wmark_min();
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 046/147] ACPI: memhotplug: use a single static memory group for a single memory device
  2021-09-08  2:52 incoming Andrew Morton
                   ` (44 preceding siblings ...)
  2021-09-08  2:55 ` [patch 045/147] mm/memory_hotplug: track present pages in memory groups Andrew Morton
@ 2021-09-08  2:55 ` Andrew Morton
  2021-09-08  2:55 ` [patch 047/147] dax/kmem: use a single static memory group for a single probed unit Andrew Morton
                   ` (101 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:55 UTC (permalink / raw)
  To: akpm, anshuman.khandual, dan.j.williams, dave.hansen, david,
	gregkh, jasowang, lenb, linux-mm, mhocko, mkedzier, mm-commits,
	mst, osalvador, pankaj.gupta.linux, pasha.tatashin,
	rafael.j.wysocki, richard.weiyang, rjw, rppt, teawater, torvalds,
	vbabka, vkuznets

From: David Hildenbrand <david@redhat.com>
Subject: ACPI: memhotplug: use a single static memory group for a single memory device

Let's group all memory we add for a single memory device - we want a
single node for that (which also seems to be the sane thing to do).

We won't care for now about memory that was already added to the system
(e.g., via e820) -- usually *all* memory of a memory device was already
added and we'll fail acpi_memory_enable_device().

Link: https://lkml.kernel.org/r/20210806124715.17090-6-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/acpi/acpi_memhotplug.c |   35 ++++++++++++++++++++++++++-----
 1 file changed, 30 insertions(+), 5 deletions(-)

--- a/drivers/acpi/acpi_memhotplug.c~acpi-memhotplug-use-a-single-static-memory-group-for-a-single-memory-device
+++ a/drivers/acpi/acpi_memhotplug.c
@@ -54,6 +54,7 @@ struct acpi_memory_info {
 struct acpi_memory_device {
 	struct acpi_device *device;
 	struct list_head res_list;
+	int mgid;
 };
 
 static acpi_status
@@ -169,12 +170,33 @@ static void acpi_unbind_memory_blocks(st
 static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 {
 	acpi_handle handle = mem_device->device->handle;
+	mhp_t mhp_flags = MHP_NID_IS_MGID;
 	int result, num_enabled = 0;
 	struct acpi_memory_info *info;
-	mhp_t mhp_flags = MHP_NONE;
-	int node;
+	u64 total_length = 0;
+	int node, mgid;
 
 	node = acpi_get_node(handle);
+
+	list_for_each_entry(info, &mem_device->res_list, list) {
+		if (!info->length)
+			continue;
+		/* We want a single node for the whole memory group */
+		if (node < 0)
+			node = memory_add_physaddr_to_nid(info->start_addr);
+		total_length += info->length;
+	}
+
+	if (!total_length) {
+		dev_err(&mem_device->device->dev, "device is empty\n");
+		return -EINVAL;
+	}
+
+	mgid = memory_group_register_static(node, PFN_UP(total_length));
+	if (mgid < 0)
+		return mgid;
+	mem_device->mgid = mgid;
+
 	/*
 	 * Tell the VM there is more memory here...
 	 * Note: Assume that this function returns zero on success
@@ -188,12 +210,10 @@ static int acpi_memory_enable_device(str
 		 */
 		if (!info->length)
 			continue;
-		if (node < 0)
-			node = memory_add_physaddr_to_nid(info->start_addr);
 
 		if (mhp_supports_memmap_on_memory(info->length))
 			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
-		result = __add_memory(node, info->start_addr, info->length,
+		result = __add_memory(mgid, info->start_addr, info->length,
 				      mhp_flags);
 
 		/*
@@ -253,6 +273,10 @@ static void acpi_memory_device_free(stru
 	if (!mem_device)
 		return;
 
+	/* In case we succeeded adding *some* memory, unregistering fails. */
+	if (mem_device->mgid >= 0)
+		memory_group_unregister(mem_device->mgid);
+
 	acpi_memory_free_device_resources(mem_device);
 	mem_device->device->driver_data = NULL;
 	kfree(mem_device);
@@ -273,6 +297,7 @@ static int acpi_memory_device_add(struct
 
 	INIT_LIST_HEAD(&mem_device->res_list);
 	mem_device->device = device;
+	mem_device->mgid = -1;
 	sprintf(acpi_device_name(device), "%s", ACPI_MEMORY_DEVICE_NAME);
 	sprintf(acpi_device_class(device), "%s", ACPI_MEMORY_DEVICE_CLASS);
 	device->driver_data = mem_device;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 047/147] dax/kmem: use a single static memory group for a single probed unit
  2021-09-08  2:52 incoming Andrew Morton
                   ` (45 preceding siblings ...)
  2021-09-08  2:55 ` [patch 046/147] ACPI: memhotplug: use a single static memory group for a single memory device Andrew Morton
@ 2021-09-08  2:55 ` Andrew Morton
  2021-09-08  2:55 ` [patch 048/147] virtio-mem: use a single dynamic memory group for a single virtio-mem device Andrew Morton
                   ` (100 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:55 UTC (permalink / raw)
  To: akpm, anshuman.khandual, dan.j.williams, dave.hansen, david,
	gregkh, jasowang, lenb, linux-mm, mhocko, mkedzier, mm-commits,
	mst, osalvador, pankaj.gupta.linux, pasha.tatashin,
	rafael.j.wysocki, richard.weiyang, rjw, rppt, teawater, torvalds,
	vbabka, vkuznets

From: David Hildenbrand <david@redhat.com>
Subject: dax/kmem: use a single static memory group for a single probed unit

Although dax/kmem users often disable auto-onlining and instead online
memory manually (usually to ZONE_MOVABLE), there is still value in having
auto-onlining be aware of the relationship of memory blocks.

Let's treat one probed unit as a single static memory device, similar to a
single ACPI memory device.

Link: https://lkml.kernel.org/r/20210806124715.17090-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dax/kmem.c |   40 ++++++++++++++++++++++++++++++++--------
 1 file changed, 32 insertions(+), 8 deletions(-)

--- a/drivers/dax/kmem.c~dax-kmem-use-a-single-static-memory-group-for-a-single-probed-unit
+++ a/drivers/dax/kmem.c
@@ -37,15 +37,16 @@ static int dax_kmem_range(struct dev_dax
 
 struct dax_kmem_data {
 	const char *res_name;
+	int mgid;
 	struct resource *res[];
 };
 
 static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 {
 	struct device *dev = &dev_dax->dev;
+	unsigned long total_len = 0;
 	struct dax_kmem_data *data;
-	int rc = -ENOMEM;
-	int i, mapped = 0;
+	int i, rc, mapped = 0;
 	int numa_node;
 
 	/*
@@ -61,24 +62,44 @@ static int dev_dax_kmem_probe(struct dev
 		return -EINVAL;
 	}
 
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		struct range range;
+
+		rc = dax_kmem_range(dev_dax, i, &range);
+		if (rc) {
+			dev_info(dev, "mapping%d: %#llx-%#llx too small after alignment\n",
+					i, range.start, range.end);
+			continue;
+		}
+		total_len += range_len(&range);
+	}
+
+	if (!total_len) {
+		dev_warn(dev, "rejecting DAX region without any memory after alignment\n");
+		return -EINVAL;
+	}
+
 	data = kzalloc(struct_size(data, res, dev_dax->nr_range), GFP_KERNEL);
 	if (!data)
 		return -ENOMEM;
 
+	rc = -ENOMEM;
 	data->res_name = kstrdup(dev_name(dev), GFP_KERNEL);
 	if (!data->res_name)
 		goto err_res_name;
 
+	rc = memory_group_register_static(numa_node, total_len);
+	if (rc < 0)
+		goto err_reg_mgid;
+	data->mgid = rc;
+
 	for (i = 0; i < dev_dax->nr_range; i++) {
 		struct resource *res;
 		struct range range;
 
 		rc = dax_kmem_range(dev_dax, i, &range);
-		if (rc) {
-			dev_info(dev, "mapping%d: %#llx-%#llx too small after alignment\n",
-					i, range.start, range.end);
+		if (rc)
 			continue;
-		}
 
 		/* Region is permanently reserved if hotremove fails. */
 		res = request_mem_region(range.start, range_len(&range), data->res_name);
@@ -108,8 +129,8 @@ static int dev_dax_kmem_probe(struct dev
 		 * Ensure that future kexec'd kernels will not treat
 		 * this as RAM automatically.
 		 */
-		rc = add_memory_driver_managed(numa_node, range.start,
-				range_len(&range), kmem_name, MHP_NONE);
+		rc = add_memory_driver_managed(data->mgid, range.start,
+				range_len(&range), kmem_name, MHP_NID_IS_MGID);
 
 		if (rc) {
 			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
@@ -129,6 +150,8 @@ static int dev_dax_kmem_probe(struct dev
 	return 0;
 
 err_request_mem:
+	memory_group_unregister(data->mgid);
+err_reg_mgid:
 	kfree(data->res_name);
 err_res_name:
 	kfree(data);
@@ -171,6 +194,7 @@ static void dev_dax_kmem_remove(struct d
 	}
 
 	if (success >= dev_dax->nr_range) {
+		memory_group_unregister(data->mgid);
 		kfree(data->res_name);
 		kfree(data);
 		dev_set_drvdata(dev, NULL);
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 048/147] virtio-mem: use a single dynamic memory group for a single virtio-mem device
  2021-09-08  2:52 incoming Andrew Morton
                   ` (46 preceding siblings ...)
  2021-09-08  2:55 ` [patch 047/147] dax/kmem: use a single static memory group for a single probed unit Andrew Morton
@ 2021-09-08  2:55 ` Andrew Morton
  2021-09-08  2:55 ` [patch 049/147] mm/memory_hotplug: memory group aware "auto-movable" online policy Andrew Morton
                   ` (99 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:55 UTC (permalink / raw)
  To: akpm, anshuman.khandual, dan.j.williams, dave.hansen, david,
	gregkh, jasowang, lenb, linux-mm, mhocko, mkedzier, mm-commits,
	mst, osalvador, pankaj.gupta.linux, pasha.tatashin,
	rafael.j.wysocki, richard.weiyang, rjw, rppt, teawater, torvalds,
	vbabka, vkuznets

From: David Hildenbrand <david@redhat.com>
Subject: virtio-mem: use a single dynamic memory group for a single virtio-mem device

Let's use a single dynamic memory group.

Link: https://lkml.kernel.org/r/20210806124715.17090-8-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/virtio/virtio_mem.c |   22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

--- a/drivers/virtio/virtio_mem.c~virtio-mem-use-a-single-dynamic-memory-group-for-a-single-virtio-mem-device
+++ a/drivers/virtio/virtio_mem.c
@@ -143,6 +143,8 @@ struct virtio_mem {
 	 * add_memory_driver_managed().
 	 */
 	const char *resource_name;
+	/* Memory group identification. */
+	int mgid;
 
 	/*
 	 * We don't want to add too much memory if it's not getting onlined,
@@ -626,8 +628,8 @@ static int virtio_mem_add_memory(struct
 		addr + size - 1);
 	/* Memory might get onlined immediately. */
 	atomic64_add(size, &vm->offline_size);
-	rc = add_memory_driver_managed(vm->nid, addr, size, vm->resource_name,
-				       MHP_MERGE_RESOURCE);
+	rc = add_memory_driver_managed(vm->mgid, addr, size, vm->resource_name,
+				       MHP_MERGE_RESOURCE | MHP_NID_IS_MGID);
 	if (rc) {
 		atomic64_sub(size, &vm->offline_size);
 		dev_warn(&vm->vdev->dev, "adding memory failed: %d\n", rc);
@@ -2569,6 +2571,7 @@ static bool virtio_mem_has_memory_added(
 static int virtio_mem_probe(struct virtio_device *vdev)
 {
 	struct virtio_mem *vm;
+	uint64_t unit_pages;
 	int rc;
 
 	BUILD_BUG_ON(sizeof(struct virtio_mem_req) != 24);
@@ -2603,6 +2606,16 @@ static int virtio_mem_probe(struct virti
 	if (rc)
 		goto out_del_vq;
 
+	/* use a single dynamic memory group to cover the whole memory device */
+	if (vm->in_sbm)
+		unit_pages = PHYS_PFN(memory_block_size_bytes());
+	else
+		unit_pages = PHYS_PFN(vm->bbm.bb_size);
+	rc = memory_group_register_dynamic(vm->nid, unit_pages);
+	if (rc < 0)
+		goto out_del_resource;
+	vm->mgid = rc;
+
 	/*
 	 * If we still have memory plugged, we have to unplug all memory first.
 	 * Registering our parent resource makes sure that this memory isn't
@@ -2617,7 +2630,7 @@ static int virtio_mem_probe(struct virti
 	vm->memory_notifier.notifier_call = virtio_mem_memory_notifier_cb;
 	rc = register_memory_notifier(&vm->memory_notifier);
 	if (rc)
-		goto out_del_resource;
+		goto out_unreg_group;
 	rc = register_virtio_mem_device(vm);
 	if (rc)
 		goto out_unreg_mem;
@@ -2631,6 +2644,8 @@ static int virtio_mem_probe(struct virti
 	return 0;
 out_unreg_mem:
 	unregister_memory_notifier(&vm->memory_notifier);
+out_unreg_group:
+	memory_group_unregister(vm->mgid);
 out_del_resource:
 	virtio_mem_delete_resource(vm);
 out_del_vq:
@@ -2695,6 +2710,7 @@ static void virtio_mem_remove(struct vir
 	} else {
 		virtio_mem_delete_resource(vm);
 		kfree_const(vm->resource_name);
+		memory_group_unregister(vm->mgid);
 	}
 
 	/* remove all tracking data - no locking needed */
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 049/147] mm/memory_hotplug: memory group aware "auto-movable" online policy
  2021-09-08  2:52 incoming Andrew Morton
                   ` (47 preceding siblings ...)
  2021-09-08  2:55 ` [patch 048/147] virtio-mem: use a single dynamic memory group for a single virtio-mem device Andrew Morton
@ 2021-09-08  2:55 ` Andrew Morton
  2021-09-08  2:55 ` [patch 050/147] mm/memory_hotplug: improved dynamic " Andrew Morton
                   ` (98 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:55 UTC (permalink / raw)
  To: akpm, anshuman.khandual, dan.j.williams, dave.hansen, david,
	gregkh, jasowang, lenb, linux-mm, mhocko, mkedzier, mm-commits,
	mst, osalvador, pankaj.gupta.linux, pasha.tatashin,
	rafael.j.wysocki, richard.weiyang, rjw, rppt, teawater, torvalds,
	vbabka, vkuznets

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: memory group aware "auto-movable" online policy

Use memory groups to improve our "auto-movable" onlining policy:

1. For static memory groups (e.g., a DIMM), online a memory block MOVABLE
   only if all other memory blocks in the group are either MOVABLE or could
   be onlined MOVABLE. A DIMM will either be MOVABLE or not, not a mixture.

2. For dynamic memory groups (e.g., a virtio-mem device), online a
   memory block MOVABLE only if all other memory blocks inside the
   current unit are either MOVABLE or could be onlined MOVABLE. For a
   virtio-mem device with a device block size with 512 MiB, all 128 MiB
   memory blocks wihin a 512 MiB unit will either be MOVABLE or not, not
   a mixture.

We have to pass the memory group to zone_for_pfn_range() to take the
memory group into account.

Note: for now, there seems to be no compelling reason to make this
behavior configurable.

Link: https://lkml.kernel.org/r/20210806124715.17090-9-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/base/memory.c          |   18 ++++++-----
 include/linux/memory_hotplug.h |    3 +
 mm/memory_hotplug.c            |   48 +++++++++++++++++++++++++++++--
 3 files changed, 57 insertions(+), 12 deletions(-)

--- a/drivers/base/memory.c~mm-memory_hotplug-memory-group-aware-auto-movable-online-policy
+++ a/drivers/base/memory.c
@@ -182,7 +182,8 @@ static int memory_block_online(struct me
 	struct zone *zone;
 	int ret;
 
-	zone = zone_for_pfn_range(mem->online_type, mem->nid, start_pfn, nr_pages);
+	zone = zone_for_pfn_range(mem->online_type, mem->nid, mem->group,
+				  start_pfn, nr_pages);
 
 	/*
 	 * Although vmemmap pages have a different lifecycle than the pages
@@ -379,12 +380,13 @@ static ssize_t phys_device_show(struct d
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
 static int print_allowed_zone(char *buf, int len, int nid,
+			      struct memory_group *group,
 			      unsigned long start_pfn, unsigned long nr_pages,
 			      int online_type, struct zone *default_zone)
 {
 	struct zone *zone;
 
-	zone = zone_for_pfn_range(online_type, nid, start_pfn, nr_pages);
+	zone = zone_for_pfn_range(online_type, nid, group, start_pfn, nr_pages);
 	if (zone == default_zone)
 		return 0;
 
@@ -397,9 +399,10 @@ static ssize_t valid_zones_show(struct d
 	struct memory_block *mem = to_memory_block(dev);
 	unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
+	struct memory_group *group = mem->group;
 	struct zone *default_zone;
+	int nid = mem->nid;
 	int len = 0;
-	int nid;
 
 	/*
 	 * Check the existing zone. Make sure that we do that only on the
@@ -418,14 +421,13 @@ static ssize_t valid_zones_show(struct d
 		goto out;
 	}
 
-	nid = mem->nid;
-	default_zone = zone_for_pfn_range(MMOP_ONLINE, nid, start_pfn,
-					  nr_pages);
+	default_zone = zone_for_pfn_range(MMOP_ONLINE, nid, group,
+					  start_pfn, nr_pages);
 
 	len += sysfs_emit_at(buf, len, "%s", default_zone->name);
-	len += print_allowed_zone(buf, len, nid, start_pfn, nr_pages,
+	len += print_allowed_zone(buf, len, nid, group, start_pfn, nr_pages,
 				  MMOP_ONLINE_KERNEL, default_zone);
-	len += print_allowed_zone(buf, len, nid, start_pfn, nr_pages,
+	len += print_allowed_zone(buf, len, nid, group, start_pfn, nr_pages,
 				  MMOP_ONLINE_MOVABLE, default_zone);
 out:
 	len += sysfs_emit_at(buf, len, "\n");
--- a/include/linux/memory_hotplug.h~mm-memory_hotplug-memory-group-aware-auto-movable-online-policy
+++ a/include/linux/memory_hotplug.h
@@ -349,7 +349,8 @@ extern void sparse_remove_section(struct
 extern struct page *sparse_decode_mem_map(unsigned long coded_mem_map,
 					  unsigned long pnum);
 extern struct zone *zone_for_pfn_range(int online_type, int nid,
-		unsigned long start_pfn, unsigned long nr_pages);
+		struct memory_group *group, unsigned long start_pfn,
+		unsigned long nr_pages);
 extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
 				      struct mhp_params *params);
 void arch_remove_linear_mapping(u64 start, u64 size);
--- a/mm/memory_hotplug.c~mm-memory_hotplug-memory-group-aware-auto-movable-online-policy
+++ a/mm/memory_hotplug.c
@@ -852,12 +852,53 @@ static struct zone *default_kernel_zone_
  *    "present pages" is an upper limit that can get reached at runtime. As
  *    we base our calculations on KERNEL_EARLY, this is not an issue.
  */
-static struct zone *auto_movable_zone_for_pfn(int nid, unsigned long pfn,
+static struct zone *auto_movable_zone_for_pfn(int nid,
+					      struct memory_group *group,
+					      unsigned long pfn,
 					      unsigned long nr_pages)
 {
+	unsigned long online_pages = 0, max_pages, end_pfn;
+	struct page *page;
+
 	if (!auto_movable_ratio)
 		goto kernel_zone;
 
+	if (group && !group->is_dynamic) {
+		max_pages = group->s.max_pages;
+		online_pages = group->present_movable_pages;
+
+		/* If anything is !MOVABLE online the rest !MOVABLE. */
+		if (group->present_kernel_pages)
+			goto kernel_zone;
+	} else if (!group || group->d.unit_pages == nr_pages) {
+		max_pages = nr_pages;
+	} else {
+		max_pages = group->d.unit_pages;
+		/*
+		 * Take a look at all online sections in the current unit.
+		 * We can safely assume that all pages within a section belong
+		 * to the same zone, because dynamic memory groups only deal
+		 * with hotplugged memory.
+		 */
+		pfn = ALIGN_DOWN(pfn, group->d.unit_pages);
+		end_pfn = pfn + group->d.unit_pages;
+		for (; pfn < end_pfn; pfn += PAGES_PER_SECTION) {
+			page = pfn_to_online_page(pfn);
+			if (!page)
+				continue;
+			/* If anything is !MOVABLE online the rest !MOVABLE. */
+			if (page_zonenum(page) != ZONE_MOVABLE)
+				goto kernel_zone;
+			online_pages += PAGES_PER_SECTION;
+		}
+	}
+
+	/*
+	 * Online MOVABLE if we could *currently* online all remaining parts
+	 * MOVABLE. We expect to (add+) online them immediately next, so if
+	 * nobody interferes, all will be MOVABLE if possible.
+	 */
+	nr_pages = max_pages - online_pages;
 	if (!auto_movable_can_online_movable(NUMA_NO_NODE, nr_pages))
 		goto kernel_zone;
 
@@ -897,7 +938,8 @@ static inline struct zone *default_zone_
 }
 
 struct zone *zone_for_pfn_range(int online_type, int nid,
-		unsigned long start_pfn, unsigned long nr_pages)
+		struct memory_group *group, unsigned long start_pfn,
+		unsigned long nr_pages)
 {
 	if (online_type == MMOP_ONLINE_KERNEL)
 		return default_kernel_zone_for_pfn(nid, start_pfn, nr_pages);
@@ -906,7 +948,7 @@ struct zone *zone_for_pfn_range(int onli
 		return &NODE_DATA(nid)->node_zones[ZONE_MOVABLE];
 
 	if (online_policy == ONLINE_POLICY_AUTO_MOVABLE)
-		return auto_movable_zone_for_pfn(nid, start_pfn, nr_pages);
+		return auto_movable_zone_for_pfn(nid, group, start_pfn, nr_pages);
 
 	return default_zone_for_pfn(nid, start_pfn, nr_pages);
 }
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 050/147] mm/memory_hotplug: improved dynamic memory group aware "auto-movable" online policy
  2021-09-08  2:52 incoming Andrew Morton
                   ` (48 preceding siblings ...)
  2021-09-08  2:55 ` [patch 049/147] mm/memory_hotplug: memory group aware "auto-movable" online policy Andrew Morton
@ 2021-09-08  2:55 ` Andrew Morton
  2021-09-08  2:55 ` [patch 051/147] mm/memory_hotplug: use helper zone_is_zone_device() to simplify the code Andrew Morton
                   ` (97 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:55 UTC (permalink / raw)
  To: akpm, anshuman.khandual, dan.j.williams, dave.hansen, david,
	gregkh, jasowang, lenb, linux-mm, mhocko, mkedzier, mm-commits,
	mst, osalvador, pankaj.gupta.linux, pasha.tatashin,
	rafael.j.wysocki, richard.weiyang, rjw, rppt, teawater, torvalds,
	vbabka, vkuznets

From: David Hildenbrand <david@redhat.com>
Subject: mm/memory_hotplug: improved dynamic memory group aware "auto-movable" online policy

Currently, the "auto-movable" online policy does not allow for hotplugged
KERNEL (ZONE_NORMAL) memory to increase the amount of MOVABLE memory we
can have, primarily, because there is no coordiantion across memory
devices and we don't want to create zone-imbalances accidentially when
unplugging memory.

However, within a single memory device it's different.  Let's allow for
KERNEL memory within a dynamic memory group to allow for more MOVABLE
within the same memory group.  The only thing we have to take care of is
that the managing driver avoids zone imbalances by unplugging MOVABLE
memory first, otherwise there can be corner cases where unplug of memory
could result in (accidential) zone imbalances.

virtio-mem is the only user of dynamic memory groups and recently added
support for prioritizing unplug of ZONE_MOVABLE over ZONE_NORMAL, so we
don't need a new toggle to enable it for dynamic memory groups.

We limit this handling to dynamic memory groups, because:

* We want to keep the runtime overhead for collecting stats when
  onlining a single memory block small.  We tend to have only a handful of
  dynamic memory groups, but we can have quite some static memory groups
  (e.g., 256 DIMMs).

* It doesn't make too much sense for static memory groups, as we try
  onlining all applicable memory blocks either completely to ZONE_MOVABLE
  or not.  In ordinary operation, we won't have a mixture of zones within
  a static memory group.

When adding memory to a dynamic memory group, we'll first online memory to
ZONE_MOVABLE as long as early KERNEL memory allows for it.  Then, we'll
online the next unit(s) to ZONE_NORMAL, until we can online the next
unit(s) to ZONE_MOVABLE.

For a simple virtio-mem device with a MOVABLE:KERNEL ratio of 3:1, it will
result in a layout like:

  [M][M][M][M][M][M][M][M][N][M][M][M][N][M][M][M]...
  ^ movable memory due to early kernel memory
			   ^ allows for more movable memory ...
			      ^-----^ ... here
				       ^ allows for more movable memory ...
				          ^-----^ ... here

While the created layout is sub-optimal when it comes to contiguous zones,
it gives us the maximum flexibility when dynamically growing/shrinking a
device; we can grow small VMs really big in small steps, and still shrink
reliably to e.g., 1/4 of the maximum VM size in this example, removing
full memory blocks along with meta data more reliably.

Mark dynamic memory groups in the xarray such that we can efficiently
iterate over them when collecting stats.  In usual setups, we have one
virtio-mem device per NUMA node, and usually only a small number of NUMA
nodes.

Note: for now, there seems to be no compelling reason to make this
behavior configurable.

Link: https://lkml.kernel.org/r/20210806124715.17090-10-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hui Zhu <teawater@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Marek Kedzierski <mkedzier@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/base/memory.c  |   30 +++++++++++++++++++
 include/linux/memory.h |    3 +
 mm/memory_hotplug.c    |   60 ++++++++++++++++++++++++++++++++++++---
 3 files changed, 89 insertions(+), 4 deletions(-)

--- a/drivers/base/memory.c~mm-memory_hotplug-improved-dynamic-memory-group-aware-auto-movable-online-policy
+++ a/drivers/base/memory.c
@@ -86,6 +86,7 @@ static DEFINE_XARRAY(memory_blocks);
  * Memory groups, indexed by memory group id (mgid).
  */
 static DEFINE_XARRAY_FLAGS(memory_groups, XA_FLAGS_ALLOC);
+#define MEMORY_GROUP_MARK_DYNAMIC	XA_MARK_1
 
 static BLOCKING_NOTIFIER_HEAD(memory_chain);
 
@@ -939,6 +940,8 @@ static int memory_group_register(struct
 	if (ret) {
 		kfree(new_group);
 		return ret;
+	} else if (group.is_dynamic) {
+		xa_set_mark(&memory_groups, mgid, MEMORY_GROUP_MARK_DYNAMIC);
 	}
 	return mgid;
 }
@@ -1044,3 +1047,30 @@ struct memory_group *memory_group_find_b
 {
 	return xa_load(&memory_groups, mgid);
 }
+
+/*
+ * This is an internal helper only to be used in core memory hotplug code to
+ * walk all dynamic memory groups excluding a given memory group, either
+ * belonging to a specific node, or belonging to any node.
+ */
+int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
+			       struct memory_group *excluded, void *arg)
+{
+	struct memory_group *group;
+	unsigned long index;
+	int ret = 0;
+
+	xa_for_each_marked(&memory_groups, index, group,
+			   MEMORY_GROUP_MARK_DYNAMIC) {
+		if (group == excluded)
+			continue;
+#ifdef CONFIG_NUMA
+		if (nid != NUMA_NO_NODE && group->nid != nid)
+			continue;
+#endif /* CONFIG_NUMA */
+		ret = func(group, arg);
+		if (ret)
+			break;
+	}
+	return ret;
+}
--- a/include/linux/memory.h~mm-memory_hotplug-improved-dynamic-memory-group-aware-auto-movable-online-policy
+++ a/include/linux/memory.h
@@ -146,6 +146,9 @@ extern int memory_group_register_static(
 extern int memory_group_register_dynamic(int nid, unsigned long unit_pages);
 extern int memory_group_unregister(int mgid);
 struct memory_group *memory_group_find_by_id(int mgid);
+typedef int (*walk_memory_groups_func_t)(struct memory_group *, void *);
+int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func,
+			       struct memory_group *excluded, void *arg);
 #endif /* CONFIG_MEMORY_HOTPLUG_SPARSE */
 
 #ifdef CONFIG_MEMORY_HOTPLUG
--- a/mm/memory_hotplug.c~mm-memory_hotplug-improved-dynamic-memory-group-aware-auto-movable-online-policy
+++ a/mm/memory_hotplug.c
@@ -752,11 +752,44 @@ static void auto_movable_stats_account_z
 #endif /* CONFIG_CMA */
 	}
 }
+struct auto_movable_group_stats {
+	unsigned long movable_pages;
+	unsigned long req_kernel_early_pages;
+};
 
-static bool auto_movable_can_online_movable(int nid, unsigned long nr_pages)
+static int auto_movable_stats_account_group(struct memory_group *group,
+					   void *arg)
+{
+	const int ratio = READ_ONCE(auto_movable_ratio);
+	struct auto_movable_group_stats *stats = arg;
+	long pages;
+
+	/*
+	 * We don't support modifying the config while the auto-movable online
+	 * policy is already enabled. Just avoid the division by zero below.
+	 */
+	if (!ratio)
+		return 0;
+
+	/*
+	 * Calculate how many early kernel pages this group requires to
+	 * satisfy the configured zone ratio.
+	 */
+	pages = group->present_movable_pages * 100 / ratio;
+	pages -= group->present_kernel_pages;
+
+	if (pages > 0)
+		stats->req_kernel_early_pages += pages;
+	stats->movable_pages += group->present_movable_pages;
+	return 0;
+}
+
+static bool auto_movable_can_online_movable(int nid, struct memory_group *group,
+					    unsigned long nr_pages)
 {
-	struct auto_movable_stats stats = {};
 	unsigned long kernel_early_pages, movable_pages;
+	struct auto_movable_group_stats group_stats = {};
+	struct auto_movable_stats stats = {};
 	pg_data_t *pgdat = NODE_DATA(nid);
 	struct zone *zone;
 	int i;
@@ -778,6 +811,21 @@ static bool auto_movable_can_online_mova
 	movable_pages = stats.movable_pages;
 
 	/*
+	 * Kernel memory inside dynamic memory group allows for more MOVABLE
+	 * memory within the same group. Remove the effect of all but the
+	 * current group from the stats.
+	 */
+	walk_dynamic_memory_groups(nid, auto_movable_stats_account_group,
+				   group, &group_stats);
+	if (kernel_early_pages <= group_stats.req_kernel_early_pages)
+		return false;
+	kernel_early_pages -= group_stats.req_kernel_early_pages;
+	movable_pages -= group_stats.movable_pages;
+
+	if (group && group->is_dynamic)
+		kernel_early_pages += group->present_kernel_pages;
+
+	/*
 	 * Test if we could online the given number of pages to ZONE_MOVABLE
 	 * and still stay in the configured ratio.
 	 */
@@ -834,6 +882,10 @@ static struct zone *default_kernel_zone_
  *    with unmovable allocations). While there are corner cases where it might
  *    still work, it is barely relevant in practice.
  *
+ * Exceptions are dynamic memory groups, which allow for more MOVABLE
+ * memory within the same memory group -- because in that case, there is
+ * coordination within the single memory device managed by a single driver.
+ *
  * We rely on "present pages" instead of "managed pages", as the latter is
  * highly unreliable and dynamic in virtualized environments, and does not
  * consider boot time allocations. For example, memory ballooning adjusts the
@@ -899,12 +951,12 @@ static struct zone *auto_movable_zone_fo
 	 * nobody interferes, all will be MOVABLE if possible.
 	 */
 	nr_pages = max_pages - online_pages;
-	if (!auto_movable_can_online_movable(NUMA_NO_NODE, nr_pages))
+	if (!auto_movable_can_online_movable(NUMA_NO_NODE, group, nr_pages))
 		goto kernel_zone;
 
 #ifdef CONFIG_NUMA
 	if (auto_movable_numa_aware &&
-	    !auto_movable_can_online_movable(nid, nr_pages))
+	    !auto_movable_can_online_movable(nid, group, nr_pages))
 		goto kernel_zone;
 #endif /* CONFIG_NUMA */
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 051/147] mm/memory_hotplug: use helper zone_is_zone_device() to simplify the code
  2021-09-08  2:52 incoming Andrew Morton
                   ` (49 preceding siblings ...)
  2021-09-08  2:55 ` [patch 050/147] mm/memory_hotplug: improved dynamic " Andrew Morton
@ 2021-09-08  2:55 ` Andrew Morton
  2021-09-08  2:55 ` [patch 052/147] mm: remove redundant compound_head() calling Andrew Morton
                   ` (96 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:55 UTC (permalink / raw)
  To: akpm, cgoldswo, david, linmiaohe, linux-mm, mhocko, minchan,
	mm-commits, naoya.horiguchi, osalvador, torvalds

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/memory_hotplug: use helper zone_is_zone_device() to simplify the code

Patch series "Cleanup and fixups for memory hotplug".

This series contains cleanup to use helper function to simplify the code. 
Also we fix some potential bugs.  More details can be found in the
respective changelogs.


This patch (of 3):

Use helper zone_is_zone_device() to simplify the code and remove some
explicit CONFIG_ZONE_DEVICE codes.

Link: https://lkml.kernel.org/r/20210821094246.10149-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20210821094246.10149-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Chris Goldsworthy <cgoldswo@codeaurora.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory_hotplug.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--- a/mm/memory_hotplug.c~mm-memory_hotplug-use-helper-zone_is_zone_device-to-simplify-the-code
+++ a/mm/memory_hotplug.c
@@ -477,15 +477,13 @@ void __ref remove_pfn_range_from_zone(st
 				 sizeof(struct page) * cur_nr_pages);
 	}
 
-#ifdef CONFIG_ZONE_DEVICE
 	/*
 	 * Zone shrinking code cannot properly deal with ZONE_DEVICE. So
 	 * we will not try to shrink the zones - which is okay as
 	 * set_zone_contiguous() cannot deal with ZONE_DEVICE either way.
 	 */
-	if (zone_idx(zone) == ZONE_DEVICE)
+	if (zone_is_zone_device(zone))
 		return;
-#endif
 
 	clear_zone_contiguous(zone);
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 052/147] mm: remove redundant compound_head() calling
  2021-09-08  2:52 incoming Andrew Morton
                   ` (50 preceding siblings ...)
  2021-09-08  2:55 ` [patch 051/147] mm/memory_hotplug: use helper zone_is_zone_device() to simplify the code Andrew Morton
@ 2021-09-08  2:55 ` Andrew Morton
  2021-09-08  2:55 ` [patch 053/147] riscv: only select GENERIC_IOREMAP if MMU support is enabled Andrew Morton
                   ` (95 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:55 UTC (permalink / raw)
  To: akpm, dhowells, hannes, kirill.shutemov, linux-mm, mm-commits,
	songmuchun, torvalds, william.kucharski, willy

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: remove redundant compound_head() calling

There is a READ_ONCE() in the macro of compound_head(), which will prevent
compiler from optimizing the code when there are more than once calling of
it in a function.  Remove the redundant calling of compound_head() from
page_to_index() and page_add_file_rmap() for better code generation.

Link: https://lkml.kernel.org/r/20210811101431.83940-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/pagemap.h |    7 +++----
 mm/rmap.c               |    6 ++++--
 2 files changed, 7 insertions(+), 6 deletions(-)

--- a/include/linux/pagemap.h~mm-remove-redundant-compound_head-calling
+++ a/include/linux/pagemap.h
@@ -521,18 +521,17 @@ static inline struct page *read_mapping_
  */
 static inline pgoff_t page_to_index(struct page *page)
 {
-	pgoff_t pgoff;
+	struct page *head;
 
 	if (likely(!PageTransTail(page)))
 		return page->index;
 
+	head = compound_head(page);
 	/*
 	 *  We don't initialize ->index for tail pages: calculate based on
 	 *  head page
 	 */
-	pgoff = compound_head(page)->index;
-	pgoff += page - compound_head(page);
-	return pgoff;
+	return head->index + page - head;
 }
 
 extern pgoff_t hugetlb_basepage_index(struct page *page);
--- a/mm/rmap.c~mm-remove-redundant-compound_head-calling
+++ a/mm/rmap.c
@@ -1230,11 +1230,13 @@ void page_add_file_rmap(struct page *pag
 						nr_pages);
 	} else {
 		if (PageTransCompound(page) && page_mapping(page)) {
+			struct page *head = compound_head(page);
+
 			VM_WARN_ON_ONCE(!PageLocked(page));
 
-			SetPageDoubleMap(compound_head(page));
+			SetPageDoubleMap(head);
 			if (PageMlocked(page))
-				clear_page_mlock(compound_head(page));
+				clear_page_mlock(head);
 		}
 		if (!atomic_inc_and_test(&page->_mapcount))
 			goto out;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 053/147] riscv: only select GENERIC_IOREMAP if MMU support is enabled
  2021-09-08  2:52 incoming Andrew Morton
                   ` (51 preceding siblings ...)
  2021-09-08  2:55 ` [patch 052/147] mm: remove redundant compound_head() calling Andrew Morton
@ 2021-09-08  2:55 ` Andrew Morton
  2021-09-08  2:56 ` [patch 054/147] mm: move ioremap_page_range to vmalloc.c Andrew Morton
                   ` (94 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:55 UTC (permalink / raw)
  To: akpm, hch, linux-mm, mm-commits, npiggin, peterz, torvalds

From: Christoph Hellwig <hch@lst.de>
Subject: riscv: only select GENERIC_IOREMAP if MMU support is enabled

nommu ioremap is an inline stub in asm-generic/io.h.

Link: https://lkml.kernel.org/r/20210825072036.GA29161@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/riscv/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/riscv/Kconfig~riscv-only-select-generic_ioremap-if-mmu-support-is-enabled
+++ a/arch/riscv/Kconfig
@@ -48,7 +48,7 @@ config RISCV
 	select GENERIC_CLOCKEVENTS_BROADCAST if SMP
 	select GENERIC_EARLY_IOREMAP
 	select GENERIC_GETTIMEOFDAY if HAVE_GENERIC_VDSO
-	select GENERIC_IOREMAP
+	select GENERIC_IOREMAP if MMU
 	select GENERIC_IRQ_MULTI_HANDLER
 	select GENERIC_IRQ_SHOW
 	select GENERIC_LIB_DEVMEM_IS_ALLOWED
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 054/147] mm: move ioremap_page_range to vmalloc.c
  2021-09-08  2:52 incoming Andrew Morton
                   ` (52 preceding siblings ...)
  2021-09-08  2:55 ` [patch 053/147] riscv: only select GENERIC_IOREMAP if MMU support is enabled Andrew Morton
@ 2021-09-08  2:56 ` Andrew Morton
  2021-09-08  2:56 ` [patch 055/147] mm: don't allow executable ioremap mappings Andrew Morton
                   ` (93 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:56 UTC (permalink / raw)
  To: akpm, hch, linux-mm, mm-commits, npiggin, peterz, torvalds

From: Christoph Hellwig <hch@lst.de>
Subject: mm: move ioremap_page_range to vmalloc.c

Patch series "small ioremap cleanups".

The first patch moves a little code around the vmalloc/ioremap boundary
following a bigger move by Nick earlier.  The second enforces
non-executable mapping on ioremap just like we do for vmap.  No driver
currently uses executable mappings anyway, as they should.


This patch (of 2):

This keeps it together with the implementation, and to remove the
vmap_range wrapper.

Link: https://lkml.kernel.org/r/20210824091259.1324527-1-hch@lst.de
Link: https://lkml.kernel.org/r/20210824091259.1324527-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/vmalloc.h |    3 ---
 mm/Makefile             |    3 ++-
 mm/ioremap.c            |   25 -------------------------
 mm/vmalloc.c            |   22 +++++++++++++++++-----
 4 files changed, 19 insertions(+), 34 deletions(-)

--- a/include/linux/vmalloc.h~mm-move-ioremap_page_range-to-vmallocc
+++ a/include/linux/vmalloc.h
@@ -225,9 +225,6 @@ static inline bool is_vm_area_hugepages(
 }
 
 #ifdef CONFIG_MMU
-int vmap_range(unsigned long addr, unsigned long end,
-			phys_addr_t phys_addr, pgprot_t prot,
-			unsigned int max_page_shift);
 void vunmap_range(unsigned long addr, unsigned long end);
 static inline void set_vm_flush_reset_perms(void *addr)
 {
--- a/mm/ioremap.c~mm-move-ioremap_page_range-to-vmallocc
+++ a/mm/ioremap.c
@@ -8,33 +8,9 @@
  */
 #include <linux/vmalloc.h>
 #include <linux/mm.h>
-#include <linux/sched.h>
 #include <linux/io.h>
 #include <linux/export.h>
-#include <asm/cacheflush.h>
 
-#include "pgalloc-track.h"
-
-#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
-static unsigned int __ro_after_init iomap_max_page_shift = BITS_PER_LONG - 1;
-
-static int __init set_nohugeiomap(char *str)
-{
-	iomap_max_page_shift = PAGE_SHIFT;
-	return 0;
-}
-early_param("nohugeiomap", set_nohugeiomap);
-#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */
-static const unsigned int iomap_max_page_shift = PAGE_SHIFT;
-#endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
-
-int ioremap_page_range(unsigned long addr,
-		       unsigned long end, phys_addr_t phys_addr, pgprot_t prot)
-{
-	return vmap_range(addr, end, phys_addr, prot, iomap_max_page_shift);
-}
-
-#ifdef CONFIG_GENERIC_IOREMAP
 void __iomem *ioremap_prot(phys_addr_t addr, size_t size, unsigned long prot)
 {
 	unsigned long offset, vaddr;
@@ -71,4 +47,3 @@ void iounmap(volatile void __iomem *addr
 	vunmap((void *)((unsigned long)addr & PAGE_MASK));
 }
 EXPORT_SYMBOL(iounmap);
-#endif /* CONFIG_GENERIC_IOREMAP */
--- a/mm/Makefile~mm-move-ioremap_page_range-to-vmallocc
+++ a/mm/Makefile
@@ -38,7 +38,7 @@ mmu-y			:= nommu.o
 mmu-$(CONFIG_MMU)	:= highmem.o memory.o mincore.o \
 			   mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
 			   msync.o page_vma_mapped.o pagewalk.o \
-			   pgtable-generic.o rmap.o vmalloc.o ioremap.o
+			   pgtable-generic.o rmap.o vmalloc.o
 
 
 ifdef CONFIG_CROSS_MEMORY_ATTACH
@@ -128,3 +128,4 @@ obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
 obj-$(CONFIG_IO_MAPPING) += io-mapping.o
 obj-$(CONFIG_HAVE_BOOTMEM_INFO_NODE) += bootmem_info.o
+obj-$(CONFIG_GENERIC_IOREMAP) += ioremap.o
--- a/mm/vmalloc.c~mm-move-ioremap_page_range-to-vmallocc
+++ a/mm/vmalloc.c
@@ -44,6 +44,19 @@
 #include "internal.h"
 #include "pgalloc-track.h"
 
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
+static unsigned int __ro_after_init ioremap_max_page_shift = BITS_PER_LONG - 1;
+
+static int __init set_nohugeiomap(char *str)
+{
+	ioremap_max_page_shift = PAGE_SHIFT;
+	return 0;
+}
+early_param("nohugeiomap", set_nohugeiomap);
+#else /* CONFIG_HAVE_ARCH_HUGE_VMAP */
+static const unsigned int ioremap_max_page_shift = PAGE_SHIFT;
+#endif	/* CONFIG_HAVE_ARCH_HUGE_VMAP */
+
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
 static bool __ro_after_init vmap_allow_huge = true;
 
@@ -298,15 +311,14 @@ static int vmap_range_noflush(unsigned l
 	return err;
 }
 
-int vmap_range(unsigned long addr, unsigned long end,
-			phys_addr_t phys_addr, pgprot_t prot,
-			unsigned int max_page_shift)
+int ioremap_page_range(unsigned long addr, unsigned long end,
+		phys_addr_t phys_addr, pgprot_t prot)
 {
 	int err;
 
-	err = vmap_range_noflush(addr, end, phys_addr, prot, max_page_shift);
+	err = vmap_range_noflush(addr, end, phys_addr, prot,
+				 ioremap_max_page_shift);
 	flush_cache_vmap(addr, end);
-
 	return err;
 }
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 055/147] mm: don't allow executable ioremap mappings
  2021-09-08  2:52 incoming Andrew Morton
                   ` (53 preceding siblings ...)
  2021-09-08  2:56 ` [patch 054/147] mm: move ioremap_page_range to vmalloc.c Andrew Morton
@ 2021-09-08  2:56 ` Andrew Morton
  2021-09-08  2:56 ` [patch 056/147] mm/early_ioremap.c: remove redundant early_ioremap_shutdown() Andrew Morton
                   ` (92 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:56 UTC (permalink / raw)
  To: akpm, hch, linux-mm, mm-commits, npiggin, peterz, torvalds

From: Christoph Hellwig <hch@lst.de>
Subject: mm: don't allow executable ioremap mappings

There is no need to execute from iomem (and most platforms it is
impossible anyway), so add the pgprot_nx() call similar to vmap.

Link: https://lkml.kernel.org/r/20210824091259.1324527-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmalloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/vmalloc.c~mm-dont-allow-executable-ioremap-mappings
+++ a/mm/vmalloc.c
@@ -316,7 +316,7 @@ int ioremap_page_range(unsigned long add
 {
 	int err;
 
-	err = vmap_range_noflush(addr, end, phys_addr, prot,
+	err = vmap_range_noflush(addr, end, phys_addr, pgprot_nx(prot),
 				 ioremap_max_page_shift);
 	flush_cache_vmap(addr, end);
 	return err;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 056/147] mm/early_ioremap.c: remove redundant early_ioremap_shutdown()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (54 preceding siblings ...)
  2021-09-08  2:56 ` [patch 055/147] mm: don't allow executable ioremap mappings Andrew Morton
@ 2021-09-08  2:56 ` Andrew Morton
  2021-09-08  2:56 ` [patch 057/147] highmem: don't disable preemption on RT in kmap_atomic() Andrew Morton
                   ` (91 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:56 UTC (permalink / raw)
  To: akpm, arnd, david, linux-mm, mm-commits, o451686892, torvalds

From: Weizhao Ouyang <o451686892@gmail.com>
Subject: mm/early_ioremap.c: remove redundant early_ioremap_shutdown()

early_ioremap_reset() reserved a weak function so that architectures can
provide a specific cleanup.  Now no architectures use it, remove this
redundant function.

Link: https://lkml.kernel.org/r/20210901082917.399953-1-o451686892@gmail.com
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/asm-generic/early_ioremap.h |    6 ------
 mm/early_ioremap.c                  |    5 -----
 2 files changed, 11 deletions(-)

--- a/include/asm-generic/early_ioremap.h~mm-early_ioremapc-remove-redundant-early_ioremap_shutdown
+++ a/include/asm-generic/early_ioremap.h
@@ -19,12 +19,6 @@ extern void *early_memremap_prot(resourc
 extern void early_iounmap(void __iomem *addr, unsigned long size);
 extern void early_memunmap(void *addr, unsigned long size);
 
-/*
- * Weak function called by early_ioremap_reset(). It does nothing, but
- * architectures may provide their own version to do any needed cleanups.
- */
-extern void early_ioremap_shutdown(void);
-
 #if defined(CONFIG_GENERIC_EARLY_IOREMAP) && defined(CONFIG_MMU)
 /* Arch-specific initialization */
 extern void early_ioremap_init(void);
--- a/mm/early_ioremap.c~mm-early_ioremapc-remove-redundant-early_ioremap_shutdown
+++ a/mm/early_ioremap.c
@@ -38,13 +38,8 @@ pgprot_t __init __weak early_memremap_pg
 	return prot;
 }
 
-void __init __weak early_ioremap_shutdown(void)
-{
-}
-
 void __init early_ioremap_reset(void)
 {
-	early_ioremap_shutdown();
 	after_paging_init = 1;
 }
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 057/147] highmem: don't disable preemption on RT in kmap_atomic()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (55 preceding siblings ...)
  2021-09-08  2:56 ` [patch 056/147] mm/early_ioremap.c: remove redundant early_ioremap_shutdown() Andrew Morton
@ 2021-09-08  2:56 ` Andrew Morton
  2021-09-08  2:56 ` [patch 058/147] mm: in_irq() cleanup Andrew Morton
                   ` (90 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:56 UTC (permalink / raw)
  To: akpm, bigeasy, linux-mm, mm-commits, peterz, tglx, torvalds, vbabka

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Subject: highmem: don't disable preemption on RT in kmap_atomic()

kmap_atomic() disables preemption and pagefaults for historical reasons. 
The conversion to kmap_local(), which only disables migration, cannot be
done wholesale because quite some call sites need to be updated to
accommodate with the changed semantics.

On PREEMPT_RT enabled kernels the kmap_atomic() semantics are problematic
due to the implicit disabling of preemption which makes it impossible to
acquire 'sleeping' spinlocks within the kmap atomic sections.

PREEMPT_RT replaces the preempt_disable() with a migrate_disable() for
more than a decade.  It could be argued that this is a justification to do
this unconditionally, but PREEMPT_RT covers only a limited number of
architectures and it disables some functionality which limits the coverage
further.

Limit the replacement to PREEMPT_RT for now.

Link: https://lkml.kernel.org/r/20210810091116.pocdmaatdcogvdso@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/highmem-internal.h |   27 ++++++++++++++++++++++-----
 1 file changed, 22 insertions(+), 5 deletions(-)

--- a/include/linux/highmem-internal.h~highmem-dont-disable-preemption-on-rt-in-kmap_atomic
+++ a/include/linux/highmem-internal.h
@@ -90,7 +90,11 @@ static inline void __kunmap_local(void *
 
 static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
 {
-	preempt_disable();
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		migrate_disable();
+	else
+		preempt_disable();
+
 	pagefault_disable();
 	return __kmap_local_page_prot(page, prot);
 }
@@ -102,7 +106,11 @@ static inline void *kmap_atomic(struct p
 
 static inline void *kmap_atomic_pfn(unsigned long pfn)
 {
-	preempt_disable();
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		migrate_disable();
+	else
+		preempt_disable();
+
 	pagefault_disable();
 	return __kmap_local_pfn_prot(pfn, kmap_prot);
 }
@@ -111,7 +119,10 @@ static inline void __kunmap_atomic(void
 {
 	kunmap_local_indexed(addr);
 	pagefault_enable();
-	preempt_enable();
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		migrate_enable();
+	else
+		preempt_enable();
 }
 
 unsigned int __nr_free_highpages(void);
@@ -179,7 +190,10 @@ static inline void __kunmap_local(void *
 
 static inline void *kmap_atomic(struct page *page)
 {
-	preempt_disable();
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		migrate_disable();
+	else
+		preempt_disable();
 	pagefault_disable();
 	return page_address(page);
 }
@@ -200,7 +214,10 @@ static inline void __kunmap_atomic(void
 	kunmap_flush_on_unmap(addr);
 #endif
 	pagefault_enable();
-	preempt_enable();
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		migrate_enable();
+	else
+		preempt_enable();
 }
 
 static inline unsigned int nr_free_highpages(void) { return 0; }
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 058/147] mm: in_irq() cleanup
  2021-09-08  2:52 incoming Andrew Morton
                   ` (56 preceding siblings ...)
  2021-09-08  2:56 ` [patch 057/147] highmem: don't disable preemption on RT in kmap_atomic() Andrew Morton
@ 2021-09-08  2:56 ` Andrew Morton
  2021-09-08  2:56 ` [patch 059/147] mm: introduce PAGEFLAGS_MASK to replace ((1UL << NR_PAGEFLAGS) - 1) Andrew Morton
                   ` (89 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:56 UTC (permalink / raw)
  To: akpm, catalin.marinas, changbin.du, linux-mm, mm-commits, torvalds

From: Changbin Du <changbin.du@gmail.com>
Subject: mm: in_irq() cleanup

Replace the obsolete and ambiguos macro in_irq() with new macro
in_hardirq().

Link: https://lkml.kernel.org/r/20210813145245.86070-1-changbin.du@gmail.com
Signed-off-by: Changbin Du <changbin.du@gmail.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[kmemleak]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/highmem.c  |    2 +-
 mm/kmemleak.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/mm/highmem.c~mm-in_irq-cleanup
+++ a/mm/highmem.c
@@ -436,7 +436,7 @@ EXPORT_SYMBOL(zero_user_segments);
 
 static inline int kmap_local_idx_push(void)
 {
-	WARN_ON_ONCE(in_irq() && !irqs_disabled());
+	WARN_ON_ONCE(in_hardirq() && !irqs_disabled());
 	current->kmap_ctrl.idx += KM_INCR;
 	BUG_ON(current->kmap_ctrl.idx >= KM_MAX_IDX);
 	return current->kmap_ctrl.idx - 1;
--- a/mm/kmemleak.c~mm-in_irq-cleanup
+++ a/mm/kmemleak.c
@@ -598,7 +598,7 @@ static struct kmemleak_object *create_ob
 	object->checksum = 0;
 
 	/* task information */
-	if (in_irq()) {
+	if (in_hardirq()) {
 		object->pid = 0;
 		strncpy(object->comm, "hardirq", sizeof(object->comm));
 	} else if (in_serving_softirq()) {
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 059/147] mm: introduce PAGEFLAGS_MASK to replace ((1UL << NR_PAGEFLAGS) - 1)
  2021-09-08  2:52 incoming Andrew Morton
                   ` (57 preceding siblings ...)
  2021-09-08  2:56 ` [patch 058/147] mm: in_irq() cleanup Andrew Morton
@ 2021-09-08  2:56 ` Andrew Morton
  2021-09-08  2:56 ` [patch 060/147] mm/secretmem: use refcount_t instead of atomic_t Andrew Morton
                   ` (88 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:56 UTC (permalink / raw)
  To: akpm, guro, hannes, linux-mm, mhocko, mm-commits, shakeelb,
	songmuchun, torvalds, vdavydov.dev, willy

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: introduce PAGEFLAGS_MASK to replace ((1UL << NR_PAGEFLAGS) - 1)

Instead of hard-coding ((1UL << NR_PAGEFLAGS) - 1) everywhere, introducing
PAGEFLAGS_MASK to make the code clear to get the page flags.

Link: https://lkml.kernel.org/r/20210819150712.59948-1-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Roman Gushchin <guro@fb.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/page-flags.h      |    4 +++-
 include/trace/events/page_ref.h |    4 ++--
 lib/test_printf.c               |    2 +-
 lib/vsprintf.c                  |    2 +-
 4 files changed, 7 insertions(+), 5 deletions(-)

--- a/include/linux/page-flags.h~mm-introduce-pageflags_mask-to-replace-1ul-nr_pageflags-1
+++ a/include/linux/page-flags.h
@@ -178,6 +178,8 @@ enum pageflags {
 	PG_reported = PG_uptodate,
 };
 
+#define PAGEFLAGS_MASK		((1UL << NR_PAGEFLAGS) - 1)
+
 #ifndef __GENERATING_BOUNDS_H
 
 static inline unsigned long _compound_head(const struct page *page)
@@ -868,7 +870,7 @@ static inline void ClearPageSlabPfmemall
  * alloc-free cycle to prevent from reusing the page.
  */
 #define PAGE_FLAGS_CHECK_AT_PREP	\
-	(((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON)
+	(PAGEFLAGS_MASK & ~__PG_HWPOISON)
 
 #define PAGE_FLAGS_PRIVATE				\
 	(1UL << PG_private | 1UL << PG_private_2)
--- a/include/trace/events/page_ref.h~mm-introduce-pageflags_mask-to-replace-1ul-nr_pageflags-1
+++ a/include/trace/events/page_ref.h
@@ -38,7 +38,7 @@ DECLARE_EVENT_CLASS(page_ref_mod_templat
 
 	TP_printk("pfn=0x%lx flags=%s count=%d mapcount=%d mapping=%p mt=%d val=%d",
 		__entry->pfn,
-		show_page_flags(__entry->flags & ((1UL << NR_PAGEFLAGS) - 1)),
+		show_page_flags(__entry->flags & PAGEFLAGS_MASK),
 		__entry->count,
 		__entry->mapcount, __entry->mapping, __entry->mt,
 		__entry->val)
@@ -88,7 +88,7 @@ DECLARE_EVENT_CLASS(page_ref_mod_and_tes
 
 	TP_printk("pfn=0x%lx flags=%s count=%d mapcount=%d mapping=%p mt=%d val=%d ret=%d",
 		__entry->pfn,
-		show_page_flags(__entry->flags & ((1UL << NR_PAGEFLAGS) - 1)),
+		show_page_flags(__entry->flags & PAGEFLAGS_MASK),
 		__entry->count,
 		__entry->mapcount, __entry->mapping, __entry->mt,
 		__entry->val, __entry->ret)
--- a/lib/test_printf.c~mm-introduce-pageflags_mask-to-replace-1ul-nr_pageflags-1
+++ a/lib/test_printf.c
@@ -614,7 +614,7 @@ page_flags_test(int section, int node, i
 	bool append = false;
 	int i;
 
-	flags &= BIT(NR_PAGEFLAGS) - 1;
+	flags &= PAGEFLAGS_MASK;
 	if (flags) {
 		page_flags |= flags;
 		snprintf(cmp_buf + size, BUF_SIZE - size, "%s", name);
--- a/lib/vsprintf.c~mm-introduce-pageflags_mask-to-replace-1ul-nr_pageflags-1
+++ a/lib/vsprintf.c
@@ -2019,7 +2019,7 @@ static const struct page_flags_fields pf
 static
 char *format_page_flags(char *buf, char *end, unsigned long flags)
 {
-	unsigned long main_flags = flags & (BIT(NR_PAGEFLAGS) - 1);
+	unsigned long main_flags = flags & PAGEFLAGS_MASK;
 	bool append = false;
 	int i;
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 060/147] mm/secretmem: use refcount_t instead of atomic_t
  2021-09-08  2:52 incoming Andrew Morton
                   ` (58 preceding siblings ...)
  2021-09-08  2:56 ` [patch 059/147] mm: introduce PAGEFLAGS_MASK to replace ((1UL << NR_PAGEFLAGS) - 1) Andrew Morton
@ 2021-09-08  2:56 ` Andrew Morton
  2021-09-08  2:56 ` [patch 061/147] kfence: show cpu and timestamp in alloc/free info Andrew Morton
                   ` (87 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:56 UTC (permalink / raw)
  To: akpm, James.Bottomley, jordy, jordy, keescook, linux-mm,
	mm-commits, rppt, torvalds

From: Jordy Zomer <jordy@jordyzomer.github.io>
Subject: mm/secretmem: use refcount_t instead of atomic_t

When a secret memory region is active, memfd_secret disables hibernation. 
One of the goals is to keep the secret data from being written to
persistent-storage.

It accomplishes this by maintaining a reference count to
`secretmem_users`.  Once this reference is held your system can not be
hibernated due to the check in `hibernation_available()`.  However,
because `secretmem_users` is of type `atomic_t`, reference counter
overflows are possible.

As you can see there's an `atomic_inc` for each `memfd` that is opened in
the `memfd_secret` syscall.  If a local attacker succeeds to open 2^32
memfd's, the counter will wrap around to 0.  This implies that you may
hibernate again, even though there are still regions of this secret
memory, thereby bypassing the security check.

In an attempt to fix this I have used `refcount_t` instead of `atomic_t`
which prevents reference counter overflows.

Link: https://lkml.kernel.org/r/20210820043339.2151352-1-jordy@pwning.systems
Signed-off-by: Jordy Zomer <jordy@pwning.systems>
Cc: Kees Cook <keescook@chromium.org>,
Cc: Jordy Zomer <jordy@jordyzomer.github.io>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/secretmem.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

--- a/mm/secretmem.c~mm-secretmem-use-refcount_t-instead-of-atomic_t
+++ a/mm/secretmem.c
@@ -18,6 +18,7 @@
 #include <linux/secretmem.h>
 #include <linux/set_memory.h>
 #include <linux/sched/signal.h>
+#include <linux/refcount.h>
 
 #include <uapi/linux/magic.h>
 
@@ -40,11 +41,11 @@ module_param_named(enable, secretmem_ena
 MODULE_PARM_DESC(secretmem_enable,
 		 "Enable secretmem and memfd_secret(2) system call");
 
-static atomic_t secretmem_users;
+static refcount_t secretmem_users;
 
 bool secretmem_active(void)
 {
-	return !!atomic_read(&secretmem_users);
+	return !!refcount_read(&secretmem_users);
 }
 
 static vm_fault_t secretmem_fault(struct vm_fault *vmf)
@@ -103,7 +104,7 @@ static const struct vm_operations_struct
 
 static int secretmem_release(struct inode *inode, struct file *file)
 {
-	atomic_dec(&secretmem_users);
+	refcount_dec(&secretmem_users);
 	return 0;
 }
 
@@ -217,7 +218,7 @@ SYSCALL_DEFINE1(memfd_secret, unsigned i
 	file->f_flags |= O_LARGEFILE;
 
 	fd_install(fd, file);
-	atomic_inc(&secretmem_users);
+	refcount_inc(&secretmem_users);
 	return fd;
 
 err_put_fd:
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 061/147] kfence: show cpu and timestamp in alloc/free info
  2021-09-08  2:52 incoming Andrew Morton
                   ` (59 preceding siblings ...)
  2021-09-08  2:56 ` [patch 060/147] mm/secretmem: use refcount_t instead of atomic_t Andrew Morton
@ 2021-09-08  2:56 ` Andrew Morton
  2021-09-08  2:56 ` [patch 062/147] kfence: test: fail fast if disabled at boot Andrew Morton
                   ` (86 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:56 UTC (permalink / raw)
  To: akpm, elver, glider, joern, linux-mm, mm-commits, torvalds, yzhong

From: Marco Elver <elver@google.com>
Subject: kfence: show cpu and timestamp in alloc/free info

Record cpu and timestamp on allocations and frees, and show them in
reports.  Upon an error, this can help correlate earlier messages in the
kernel log via allocation and free timestamps.

Link: https://lkml.kernel.org/r/20210714175312.2947941-1-elver@google.com
Suggested-by: Joern Engel <joern@purestorage.com>
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Alexander Potapenko <glider@google.com>
Acked-by: Joern Engel <joern@purestorage.com>
Cc: Yuanyuan Zhong <yzhong@purestorage.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/dev-tools/kfence.rst |   98 ++++++++++++++-------------
 mm/kfence/core.c                   |    3 
 mm/kfence/kfence.h                 |    2 
 mm/kfence/report.c                 |   19 +++--
 4 files changed, 71 insertions(+), 51 deletions(-)

--- a/Documentation/dev-tools/kfence.rst~kfence-show-cpu-and-timestamp-in-alloc-free-info
+++ a/Documentation/dev-tools/kfence.rst
@@ -65,25 +65,27 @@ Error reports
 A typical out-of-bounds access looks like this::
 
     ==================================================================
-    BUG: KFENCE: out-of-bounds read in test_out_of_bounds_read+0xa3/0x22b
+    BUG: KFENCE: out-of-bounds read in test_out_of_bounds_read+0xa6/0x234
 
-    Out-of-bounds read at 0xffffffffb672efff (1B left of kfence-#17):
-     test_out_of_bounds_read+0xa3/0x22b
-     kunit_try_run_case+0x51/0x85
+    Out-of-bounds read at 0xffff8c3f2e291fff (1B left of kfence-#72):
+     test_out_of_bounds_read+0xa6/0x234
+     kunit_try_run_case+0x61/0xa0
      kunit_generic_run_threadfn_adapter+0x16/0x30
-     kthread+0x137/0x160
+     kthread+0x176/0x1b0
      ret_from_fork+0x22/0x30
 
-    kfence-#17 [0xffffffffb672f000-0xffffffffb672f01f, size=32, cache=kmalloc-32] allocated by task 507:
-     test_alloc+0xf3/0x25b
-     test_out_of_bounds_read+0x98/0x22b
-     kunit_try_run_case+0x51/0x85
+    kfence-#72: 0xffff8c3f2e292000-0xffff8c3f2e29201f, size=32, cache=kmalloc-32
+
+    allocated by task 484 on cpu 0 at 32.919330s:
+     test_alloc+0xfe/0x738
+     test_out_of_bounds_read+0x9b/0x234
+     kunit_try_run_case+0x61/0xa0
      kunit_generic_run_threadfn_adapter+0x16/0x30
-     kthread+0x137/0x160
+     kthread+0x176/0x1b0
      ret_from_fork+0x22/0x30
 
-    CPU: 4 PID: 107 Comm: kunit_try_catch Not tainted 5.8.0-rc6+ #7
-    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
+    CPU: 0 PID: 484 Comm: kunit_try_catch Not tainted 5.13.0-rc3+ #7
+    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
     ==================================================================
 
 The header of the report provides a short summary of the function involved in
@@ -96,30 +98,32 @@ Use-after-free accesses are reported as:
     ==================================================================
     BUG: KFENCE: use-after-free read in test_use_after_free_read+0xb3/0x143
 
-    Use-after-free read at 0xffffffffb673dfe0 (in kfence-#24):
+    Use-after-free read at 0xffff8c3f2e2a0000 (in kfence-#79):
      test_use_after_free_read+0xb3/0x143
-     kunit_try_run_case+0x51/0x85
+     kunit_try_run_case+0x61/0xa0
      kunit_generic_run_threadfn_adapter+0x16/0x30
-     kthread+0x137/0x160
+     kthread+0x176/0x1b0
      ret_from_fork+0x22/0x30
 
-    kfence-#24 [0xffffffffb673dfe0-0xffffffffb673dfff, size=32, cache=kmalloc-32] allocated by task 507:
-     test_alloc+0xf3/0x25b
+    kfence-#79: 0xffff8c3f2e2a0000-0xffff8c3f2e2a001f, size=32, cache=kmalloc-32
+
+    allocated by task 488 on cpu 2 at 33.871326s:
+     test_alloc+0xfe/0x738
      test_use_after_free_read+0x76/0x143
-     kunit_try_run_case+0x51/0x85
+     kunit_try_run_case+0x61/0xa0
      kunit_generic_run_threadfn_adapter+0x16/0x30
-     kthread+0x137/0x160
+     kthread+0x176/0x1b0
      ret_from_fork+0x22/0x30
 
-    freed by task 507:
+    freed by task 488 on cpu 2 at 33.871358s:
      test_use_after_free_read+0xa8/0x143
-     kunit_try_run_case+0x51/0x85
+     kunit_try_run_case+0x61/0xa0
      kunit_generic_run_threadfn_adapter+0x16/0x30
-     kthread+0x137/0x160
+     kthread+0x176/0x1b0
      ret_from_fork+0x22/0x30
 
-    CPU: 4 PID: 109 Comm: kunit_try_catch Tainted: G        W         5.8.0-rc6+ #7
-    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
+    CPU: 2 PID: 488 Comm: kunit_try_catch Tainted: G    B             5.13.0-rc3+ #7
+    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
     ==================================================================
 
 KFENCE also reports on invalid frees, such as double-frees::
@@ -127,30 +131,32 @@ KFENCE also reports on invalid frees, su
     ==================================================================
     BUG: KFENCE: invalid free in test_double_free+0xdc/0x171
 
-    Invalid free of 0xffffffffb6741000:
+    Invalid free of 0xffff8c3f2e2a4000 (in kfence-#81):
      test_double_free+0xdc/0x171
-     kunit_try_run_case+0x51/0x85
+     kunit_try_run_case+0x61/0xa0
      kunit_generic_run_threadfn_adapter+0x16/0x30
-     kthread+0x137/0x160
+     kthread+0x176/0x1b0
      ret_from_fork+0x22/0x30
 
-    kfence-#26 [0xffffffffb6741000-0xffffffffb674101f, size=32, cache=kmalloc-32] allocated by task 507:
-     test_alloc+0xf3/0x25b
+    kfence-#81: 0xffff8c3f2e2a4000-0xffff8c3f2e2a401f, size=32, cache=kmalloc-32
+
+    allocated by task 490 on cpu 1 at 34.175321s:
+     test_alloc+0xfe/0x738
      test_double_free+0x76/0x171
-     kunit_try_run_case+0x51/0x85
+     kunit_try_run_case+0x61/0xa0
      kunit_generic_run_threadfn_adapter+0x16/0x30
-     kthread+0x137/0x160
+     kthread+0x176/0x1b0
      ret_from_fork+0x22/0x30
 
-    freed by task 507:
+    freed by task 490 on cpu 1 at 34.175348s:
      test_double_free+0xa8/0x171
-     kunit_try_run_case+0x51/0x85
+     kunit_try_run_case+0x61/0xa0
      kunit_generic_run_threadfn_adapter+0x16/0x30
-     kthread+0x137/0x160
+     kthread+0x176/0x1b0
      ret_from_fork+0x22/0x30
 
-    CPU: 4 PID: 111 Comm: kunit_try_catch Tainted: G        W         5.8.0-rc6+ #7
-    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
+    CPU: 1 PID: 490 Comm: kunit_try_catch Tainted: G    B             5.13.0-rc3+ #7
+    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
     ==================================================================
 
 KFENCE also uses pattern-based redzones on the other side of an object's guard
@@ -160,23 +166,25 @@ These are reported on frees::
     ==================================================================
     BUG: KFENCE: memory corruption in test_kmalloc_aligned_oob_write+0xef/0x184
 
-    Corrupted memory at 0xffffffffb6797ff9 [ 0xac . . . . . . ] (in kfence-#69):
+    Corrupted memory at 0xffff8c3f2e33aff9 [ 0xac . . . . . . ] (in kfence-#156):
      test_kmalloc_aligned_oob_write+0xef/0x184
-     kunit_try_run_case+0x51/0x85
+     kunit_try_run_case+0x61/0xa0
      kunit_generic_run_threadfn_adapter+0x16/0x30
-     kthread+0x137/0x160
+     kthread+0x176/0x1b0
      ret_from_fork+0x22/0x30
 
-    kfence-#69 [0xffffffffb6797fb0-0xffffffffb6797ff8, size=73, cache=kmalloc-96] allocated by task 507:
-     test_alloc+0xf3/0x25b
+    kfence-#156: 0xffff8c3f2e33afb0-0xffff8c3f2e33aff8, size=73, cache=kmalloc-96
+
+    allocated by task 502 on cpu 7 at 42.159302s:
+     test_alloc+0xfe/0x738
      test_kmalloc_aligned_oob_write+0x57/0x184
-     kunit_try_run_case+0x51/0x85
+     kunit_try_run_case+0x61/0xa0
      kunit_generic_run_threadfn_adapter+0x16/0x30
-     kthread+0x137/0x160
+     kthread+0x176/0x1b0
      ret_from_fork+0x22/0x30
 
-    CPU: 4 PID: 120 Comm: kunit_try_catch Tainted: G        W         5.8.0-rc6+ #7
-    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
+    CPU: 7 PID: 502 Comm: kunit_try_catch Tainted: G    B             5.13.0-rc3+ #7
+    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
     ==================================================================
 
 For such errors, the address where the corruption occurred as well as the
--- a/mm/kfence/core.c~kfence-show-cpu-and-timestamp-in-alloc-free-info
+++ a/mm/kfence/core.c
@@ -20,6 +20,7 @@
 #include <linux/moduleparam.h>
 #include <linux/random.h>
 #include <linux/rcupdate.h>
+#include <linux/sched/clock.h>
 #include <linux/sched/sysctl.h>
 #include <linux/seq_file.h>
 #include <linux/slab.h>
@@ -196,6 +197,8 @@ static noinline void metadata_update_sta
 	 */
 	track->num_stack_entries = stack_trace_save(track->stack_entries, KFENCE_STACK_DEPTH, 1);
 	track->pid = task_pid_nr(current);
+	track->cpu = raw_smp_processor_id();
+	track->ts_nsec = local_clock(); /* Same source as printk timestamps. */
 
 	/*
 	 * Pairs with READ_ONCE() in
--- a/mm/kfence/kfence.h~kfence-show-cpu-and-timestamp-in-alloc-free-info
+++ a/mm/kfence/kfence.h
@@ -36,6 +36,8 @@ enum kfence_object_state {
 /* Alloc/free tracking information. */
 struct kfence_track {
 	pid_t pid;
+	int cpu;
+	u64 ts_nsec;
 	int num_stack_entries;
 	unsigned long stack_entries[KFENCE_STACK_DEPTH];
 };
--- a/mm/kfence/report.c~kfence-show-cpu-and-timestamp-in-alloc-free-info
+++ a/mm/kfence/report.c
@@ -9,6 +9,7 @@
 
 #include <linux/kernel.h>
 #include <linux/lockdep.h>
+#include <linux/math.h>
 #include <linux/printk.h>
 #include <linux/sched/debug.h>
 #include <linux/seq_file.h>
@@ -100,6 +101,13 @@ static void kfence_print_stack(struct se
 			       bool show_alloc)
 {
 	const struct kfence_track *track = show_alloc ? &meta->alloc_track : &meta->free_track;
+	u64 ts_sec = track->ts_nsec;
+	unsigned long rem_nsec = do_div(ts_sec, NSEC_PER_SEC);
+
+	/* Timestamp matches printk timestamp format. */
+	seq_con_printf(seq, "%s by task %d on cpu %d at %lu.%06lus:\n",
+		       show_alloc ? "allocated" : "freed", track->pid,
+		       track->cpu, (unsigned long)ts_sec, rem_nsec / 1000);
 
 	if (track->num_stack_entries) {
 		/* Skip allocation/free internals stack. */
@@ -126,15 +134,14 @@ void kfence_print_object(struct seq_file
 		return;
 	}
 
-	seq_con_printf(seq,
-		       "kfence-#%td [0x%p-0x%p"
-		       ", size=%d, cache=%s] allocated by task %d:\n",
-		       meta - kfence_metadata, (void *)start, (void *)(start + size - 1), size,
-		       (cache && cache->name) ? cache->name : "<destroyed>", meta->alloc_track.pid);
+	seq_con_printf(seq, "kfence-#%td: 0x%p-0x%p, size=%d, cache=%s\n\n",
+		       meta - kfence_metadata, (void *)start, (void *)(start + size - 1),
+		       size, (cache && cache->name) ? cache->name : "<destroyed>");
+
 	kfence_print_stack(seq, meta, true);
 
 	if (meta->state == KFENCE_OBJECT_FREED) {
-		seq_con_printf(seq, "\nfreed by task %d:\n", meta->free_track.pid);
+		seq_con_printf(seq, "\n");
 		kfence_print_stack(seq, meta, false);
 	}
 }
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 062/147] kfence: test: fail fast if disabled at boot
  2021-09-08  2:52 incoming Andrew Morton
                   ` (60 preceding siblings ...)
  2021-09-08  2:56 ` [patch 061/147] kfence: show cpu and timestamp in alloc/free info Andrew Morton
@ 2021-09-08  2:56 ` Andrew Morton
  2021-09-08  2:56 ` [patch 063/147] mm: introduce Data Access MONitor (DAMON) Andrew Morton
                   ` (85 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:56 UTC (permalink / raw)
  To: akpm, dvyukov, elver, glider, linux-mm, mm-commits, torvalds,
	wangkefeng.wang

From: Marco Elver <elver@google.com>
Subject: kfence: test: fail fast if disabled at boot

Fail kfence_test fast if KFENCE was disabled at boot, instead of each test
case trying several seconds to allocate from KFENCE and failing.  KUnit
will fail all test cases if kunit_suite::init returns an error.

Even if KFENCE was disabled, we still want the test to fail, so that CI
systems that parse KUnit output will alert on KFENCE being disabled
(accidentally or otherwise).

Link: https://lkml.kernel.org/r/20210825105533.1247922-1-elver@google.com
Signed-off-by: Marco Elver <elver@google.com>
Reported-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Tested-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kfence/kfence_test.c |    3 +++
 1 file changed, 3 insertions(+)

--- a/mm/kfence/kfence_test.c~kfence-test-fail-fast-if-disabled-at-boot
+++ a/mm/kfence/kfence_test.c
@@ -789,6 +789,9 @@ static int test_init(struct kunit *test)
 	unsigned long flags;
 	int i;
 
+	if (!__kfence_pool)
+		return -EINVAL;
+
 	spin_lock_irqsave(&observed.lock, flags);
 	for (i = 0; i < ARRAY_SIZE(observed.lines); i++)
 		observed.lines[i][0] = '\0';
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 063/147] mm: introduce Data Access MONitor (DAMON)
  2021-09-08  2:52 incoming Andrew Morton
                   ` (61 preceding siblings ...)
  2021-09-08  2:56 ` [patch 062/147] kfence: test: fail fast if disabled at boot Andrew Morton
@ 2021-09-08  2:56 ` Andrew Morton
  2021-09-08  2:56 ` [patch 064/147] mm/damon/core: implement region-based sampling Andrew Morton
                   ` (84 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:56 UTC (permalink / raw)
  To: akpm, alexander.shishkin, amit, benh, brendanhiggins, corbet,
	david, dwmw, elver, fan.du, foersleo, greg, gthelen, joe,
	Jonathan.Cameron, linux-mm, markubo, mgorman, mheyne, minchan,
	mingo, mm-commits, namhyung, peterz, riel, rientjes, rostedt,
	shakeelb, shuah, sieberf, sjpark, torvalds, vbabka, vdavydov.dev

From: SeongJae Park <sjpark@amazon.de>
Subject: mm: introduce Data Access MONitor (DAMON)

Patch series "Introduce Data Access MONitor (DAMON)", v34.

Introduction
============

DAMON is a data access monitoring framework for the Linux kernel.  The
core mechanisms of DAMON called 'region based sampling' and 'adaptive
regions adjustment' (refer to 'mechanisms.rst' in the 11th patch of this
patchset for the detail) make it

- accurate (The monitored information is useful for DRAM level memory
  management.  It might not appropriate for Cache-level accuracy,
  though.),

- light-weight (The monitoring overhead is low enough to be applied
  online while making no impact on the performance of the target
  workloads.), and

- scalable (the upper-bound of the instrumentation overhead is
  controllable regardless of the size of target workloads.).

Using this framework, therefore, several memory management mechanisms such
as reclamation and THP can be optimized to aware real data access
patterns.  Experimental access pattern aware memory management
optimization works that incurring high instrumentation overhead will be
able to have another try.

Though DAMON is for kernel subsystems, it can be easily exposed to the
user space by writing a DAMON-wrapper kernel subsystem.  Then, user space
users who have some special workloads will be able to write personalized
tools or applications for deeper understanding and specialized
optimizations of their systems.

DAMON is also merged in two public Amazon Linux kernel trees that based on
v5.4.y[1] and v5.10.y[2].

[1] https://github.com/amazonlinux/linux/tree/amazon-5.4.y/master/mm/damon
[2] https://github.com/amazonlinux/linux/tree/amazon-5.10.y/master/mm/damon

The userspace tool[1] is available, released under GPLv2, and actively
being maintained.  I am also planning to implement another basic user
interface in perf[2].  Also, the basic test suite for DAMON is available
under GPLv2[3].

[1] https://github.com/awslabs/damo
[2] https://lore.kernel.org/linux-mm/20210107120729.22328-1-sjpark@amazon.com/
[3] https://github.com/awslabs/damon-tests

Long-term Plan
--------------

DAMON is a part of a project called Data Access-aware Operating System
(DAOS).  As the name implies, I want to improve the performance and
efficiency of systems using fine-grained data access patterns.  The
optimizations are for both kernel and user spaces.  I will therefore
modify or create kernel subsystems, export some of those to user space and
implement user space library / tools.  Below shows the layers and
components for the project.

    ---------------------------------------------------------------------------
    Primitives:     PTE Accessed bit, PG_idle, rmap, (Intel CMT), ...
    Framework:      DAMON
    Features:       DAMOS, virtual addr, physical addr, ...
    Applications:   DAMON-debugfs, (DARC), ...
    ^^^^^^^^^^^^^^^^^^^^^^^    KERNEL SPACE    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

    Raw Interface:  debugfs, (sysfs), (damonfs), tracepoints, (sys_damon), ...

    vvvvvvvvvvvvvvvvvvvvvvv    USER SPACE      vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv
    Library:        (libdamon), ...
    Tools:          DAMO, (perf), ...
    ---------------------------------------------------------------------------

The components in parentheses or marked as '...' are not implemented yet
but in the future plan.  IOW, those are the TODO tasks of DAOS project. 
For more detail, please refer to the plans:
https://lore.kernel.org/linux-mm/20201202082731.24828-1-sjpark@amazon.com/

Evaluations
===========

We evaluated DAMON's overhead, monitoring quality and usefulness using 24
realistic workloads on my QEMU/KVM based virtual machine running a kernel
that v24 DAMON patchset is applied.

DAMON is lightweight.  It increases system memory usage by 0.39% and slows
target workloads down by 1.16%.

DAMON is accurate and useful for memory management optimizations.  An
experimental DAMON-based operation scheme for THP, namely 'ethp', removes
76.15% of THP memory overheads while preserving 51.25% of THP speedup. 
Another experimental DAMON-based 'proactive reclamation' implementation,
'prcl', reduces 93.38% of residential sets and 23.63% of system memory
footprint while incurring only 1.22% runtime overhead in the best case
(parsec3/freqmine).

NOTE that the experimental THP optimization and proactive reclamation are
not for production but only for proof of concepts.

Please refer to the official document[1] or "Documentation/admin-guide/mm:
Add a document for DAMON" patch in this patchset for detailed evaluation
setup and results.

[1] https://damonitor.github.io/doc/html/latest-damon/admin-guide/mm/damon/eval.html

Real-world User Story
=====================

In summary, DAMON has used on production systems and proved its usefulness.

DAMON as a profiler
-------------------

We analyzed characteristics of a large scale production systems of our
customers using DAMON.  The systems utilize 70GB DRAM and 36 CPUs.  From
this, we were able to find interesting things below.

There were obviously different access pattern under idle workload and
active workload.  Under the idle workload, it accessed large memory
regions with low frequency, while the active workload accessed small
memory regions with high freuqnecy.

DAMON found a 7GB memory region that showing obviously high access
frequency under the active workload.  We believe this is the
performance-effective working set and need to be protected.

There was a 4KB memory region that showing highest access frequency under
not only active but also idle workloads.  We think this must be a hottest
code section like thing that should never be paged out.

For this analysis, DAMON used only 0.3-1% of single CPU time.  Because we
used recording-based analysis, it consumed about 3-12 MB of disk space per
20 minutes.  This is only small amount of disk space, but we can further
reduce the disk usage by using non-recording-based DAMON features.  I'd
like to argue that only DAMON can do such detailed analysis (finding 4KB
highest region in 70GB memory) with the light overhead.

DAMON as a system optimization tool
-----------------------------------

We also found below potential performance problems on the systems and made
DAMON-based solutions.

The system doesn't want to make the workload suffer from the page
reclamation and thus it utilizes enough DRAM but no swap device.  However,
we found the system is actively reclaiming file-backed pages, because the
system has intensive file IO.  The file IO turned out to be not
performance critical for the workload, but the customer wanted to ensure
performance critical file-backed pages like code section to not mistakenly
be evicted.

Using direct IO should or `mlock()` would be a straightforward solution,
but modifying the user space code is not easy for the customer. 
Alternatively, we could use DAMON-based operation scheme[1].  By using it,
we can ask DAMON to track access frequency of each region and make
'process_madvise(MADV_WILLNEED)[2]' call for regions having specific size
and access frequency for a time interval.

We also found the system is having high number of TLB misses.  We tried
'always' THP enabled policy and it greatly reduced TLB misses, but the
page reclamation also been more frequent due to the THP internal
fragmentation caused memory bloat.  We could try another DAMON-based
operation scheme that applies 'MADV_HUGEPAGE' to memory regions having
>=2MB size and high access frequency, while applying 'MADV_NOHUGEPAGE' to
regions having <2MB size and low access frequency.

We do not own the systems so we only reported the analysis results and
possible optimization solutions to the customers.  The customers satisfied
about the analysis results and promised to try the optimization guides.

[1] https://lore.kernel.org/linux-mm/20201006123931.5847-1-sjpark@amazon.com/
[2] https://lore.kernel.org/linux-api/20200622192900.22757-4-minchan@kernel.org/

Comparison with Idle Page Tracking
==================================

Idle Page Tracking allows users to set and read idleness of pages using a
bitmap file which represents each page with each bit of the file.  One
recommended usage of it is working set size detection.  Users can do that
by

    1. find PFN of each page for workloads in interest,
    2. set all the pages as idle by doing writes to the bitmap file,
    3. wait until the workload accesses its working set, and
    4. read the idleness of the pages again and count pages became not idle.

NOTE: While Idle Page Tracking is for user space users, DAMON is primarily
designed for kernel subsystems though it can easily exposed to the user
space.  Hence, this section only assumes such user space use of DAMON.

For what use cases Idle Page Tracking would be better?
------------------------------------------------------

1. Flexible usecases other than hotness monitoring.

Because Idle Page Tracking allows users to control the primitive (Page
idleness) by themselves, Idle Page Tracking users can do anything they
want.  Meanwhile, DAMON is primarily designed to monitor the hotness of
each memory region.  For this, DAMON asks users to provide sampling
interval and aggregation interval.  For the reason, there could be some
use case that using Idle Page Tracking is simpler.

2. Physical memory monitoring.

Idle Page Tracking receives PFN range as input, so natively supports
physical memory monitoring.

DAMON is designed to be extensible for multiple address spaces and use
cases by implementing and using primitives for the given use case. 
Therefore, by theory, DAMON has no limitation in the type of target
address space as long as primitives for the given address space exists. 
However, the default primitives introduced by this patchset supports only
virtual address spaces.

Therefore, for physical memory monitoring, you should implement your own
primitives and use it, or simply use Idle Page Tracking.

Nonetheless, RFC patchsets[1] for the physical memory address space
primitives is already available.  It also supports user memory same to
Idle Page Tracking.

[1] https://lore.kernel.org/linux-mm/20200831104730.28970-1-sjpark@amazon.com/

For what use cases DAMON is better?
-----------------------------------

1. Hotness Monitoring.

Idle Page Tracking let users know only if a page frame is accessed or not.
For hotness check, the user should write more code and use more memory. 
DAMON do that by itself.

2. Low Monitoring Overhead

DAMON receives user's monitoring request with one step and then provide
the results.  So, roughly speaking, DAMON require only O(1) user/kernel
context switches.

In case of Idle Page Tracking, however, because the interface receives
contiguous page frames, the number of user/kernel context switches
increases as the monitoring target becomes complex and huge.  As a result,
the context switch overhead could be not negligible.

Moreover, DAMON is born to handle with the monitoring overhead.  Because
the core mechanism is pure logical, Idle Page Tracking users might be able
to implement the mechanism on their own, but it would be time consuming
and the user/kernel context switching will still more frequent than that
of DAMON.  Also, the kernel subsystems cannot use the logic in this case.

3. Page granularity working set size detection.

Until v22 of this patchset, this was categorized as the thing Idle Page
Tracking could do better, because DAMON basically maintains additional
metadata for each of the monitoring target regions.  So, in the page
granularity working set size detection use case, DAMON would incur (number
of monitoring target pages * size of metadata) memory overhead.  Size of
the single metadata item is about 54 bytes, so assuming 4KB pages, about
1.3% of monitoring target pages will be additionally used.

All essential metadata for Idle Page Tracking are embedded in 'struct
page' and page table entries.  Therefore, in this use case, only one
counter variable for working set size accounting is required if Idle Page
Tracking is used.

There are more details to consider, but roughly speaking, this is true in
most cases.

However, the situation changed from v23.  Now DAMON supports arbitrary
types of monitoring targets, which don't use the metadata.  Using that,
DAMON can do the working set size detection with no additional space
overhead but less user-kernel context switch.  A first draft for the
implementation of monitoring primitives for this usage is available in a
DAMON development tree[1].  An RFC patchset for it based on this patchset
will also be available soon.

Since v24, the arbitrary type support is dropped from this patchset
because this patchset doesn't introduce real use of the type.  You can
still get it from the DAMON development tree[2], though.

[1] https://github.com/sjp38/linux/tree/damon/pgidle_hack
[2] https://github.com/sjp38/linux/tree/damon/master

4. More future usecases

While Idle Page Tracking has tight coupling with base primitives (PG_Idle
and page table Accessed bits), DAMON is designed to be extensible for many
use cases and address spaces.  If you need some special address type or
want to use special h/w access check primitives, you can write your own
primitives for that and configure DAMON to use those.  Therefore, if your
use case could be changed a lot in future, using DAMON could be better.

Can I use both Idle Page Tracking and DAMON?
--------------------------------------------

Yes, though using them concurrently for overlapping memory regions could
result in interference to each other.  Nevertheless, such use case would
be rare or makes no sense at all.  Even in the case, the noise would bot
be really significant.  So, you can choose whatever you want depending on
the characteristics of your use cases.

More Information
================

We prepared a showcase web site[1] that you can get more information. 
There are

- the official documentations[2],
- the heatmap format dynamic access pattern of various realistic workloads for
  heap area[3], mmap()-ed area[4], and stack[5] area,
- the dynamic working set size distribution[6] and chronological working set
  size changes[7], and
- the latest performance test results[8].

[1] https://damonitor.github.io/_index
[2] https://damonitor.github.io/doc/html/latest-damon
[3] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.0.png.html
[4] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
[5] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.2.png.html
[6] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
[7] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
[8] https://damonitor.github.io/test/result/perf/latest/html/index.html

Baseline and Complete Git Trees
===============================

The patches are based on the latest -mm tree, specifically
v5.14-rc1-mmots-2021-07-15-18-47 of https://github.com/hnaz/linux-mm.  You can
also clone the complete git tree:

    $ git clone git://github.com/sjp38/linux -b damon/patches/v34

The web is also available:
https://github.com/sjp38/linux/releases/tag/damon/patches/v34

Development Trees
-----------------

There are a couple of trees for entire DAMON patchset series and features
for future release.

- For latest release: https://github.com/sjp38/linux/tree/damon/master
- For next release: https://github.com/sjp38/linux/tree/damon/next

Long-term Support Trees
-----------------------

For people who want to test DAMON but using LTS kernels, there are another
couple of trees based on two latest LTS kernels respectively and
containing the 'damon/master' backports.

- For v5.4.y: https://github.com/sjp38/linux/tree/damon/for-v5.4.y
- For v5.10.y: https://github.com/sjp38/linux/tree/damon/for-v5.10.y

Amazon Linux Kernel Trees
-------------------------

DAMON is also merged in two public Amazon Linux kernel trees that based on
v5.4.y[1] and v5.10.y[2].

[1] https://github.com/amazonlinux/linux/tree/amazon-5.4.y/master/mm/damon
[2] https://github.com/amazonlinux/linux/tree/amazon-5.10.y/master/mm/damon

Git Tree for Diff of Patches
============================

For easy review of diff between different versions of each patch, I
prepared a git tree containing all versions of the DAMON patchset series:
https://github.com/sjp38/damon-patches

You can clone it and use 'diff' for easy review of changes between
different versions of the patchset.  For example:

    $ git clone https://github.com/sjp38/damon-patches && cd damon-patches
    $ diff -u damon/v33 damon/v34

Sequence Of Patches
===================

First three patches implement the core logics of DAMON.  The 1st patch
introduces basic sampling based hotness monitoring for arbitrary types of
targets.  Following two patches implement the core mechanisms for control
of overhead and accuracy, namely regions based sampling (patch 2) and
adaptive regions adjustment (patch 3).

Now the essential parts of DAMON is complete, but it cannot work unless
someone provides monitoring primitives for a specific use case.  The
following two patches make it just work for virtual address spaces
monitoring.  The 4th patch makes 'PG_idle' can be used by DAMON and the
5th patch implements the virtual memory address space specific monitoring
primitives using page table Accessed bits and the 'PG_idle' page flag.

Now DAMON just works for virtual address space monitoring via the kernel
space api.  To let the user space users can use DAMON, following four
patches add interfaces for them.  The 6th patch adds a tracepoint for
monitoring results.  The 7th patch implements a DAMON application kernel
module, namely damon-dbgfs, that simply wraps DAMON and exposes DAMON
interface to the user space via the debugfs interface.  The 8th patch
further exports pid of monitoring thread (kdamond) to user space for
easier cpu usage accounting, and the 9th patch makes the debugfs interface
to support multiple contexts.

Three patches for maintainability follows.  The 10th patch adds
documentations for both the user space and the kernel space.  The 11th
patch provides unit tests (based on the kunit) while the 12th patch adds
user space tests (based on the kselftest).

Finally, the last patch (13th) updates the MAINTAINERS file.



This patch (of 13):

DAMON is a data access monitoring framework for the Linux kernel.  The
core mechanisms of DAMON make it

 - accurate (the monitoring output is useful enough for DRAM level
   performance-centric memory management; It might be inappropriate for
   CPU cache levels, though),
 - light-weight (the monitoring overhead is normally low enough to be
   applied online), and
 - scalable (the upper-bound of the overhead is in constant range
   regardless of the size of target workloads).

Using this framework, hence, we can easily write efficient kernel space
data access monitoring applications.  For example, the kernel's memory
management mechanisms can make advanced decisions using this. 
Experimental data access aware optimization works that incurring high
access monitoring overhead could again be implemented on top of this.

Due to its simple and flexible interface, providing user space interface
would be also easy.  Then, user space users who have some special
workloads can write personalized applications for better understanding and
optimizations of their workloads and systems.

===

Nevertheless, this commit is defining and implementing only basic access
check part without the overhead-accuracy handling core logic.  The basic
access check is as below.

The output of DAMON says what memory regions are how frequently accessed
for a given duration.  The resolution of the access frequency is
controlled by setting ``sampling interval`` and ``aggregation interval``. 
In detail, DAMON checks access to each page per ``sampling interval`` and
aggregates the results.  In other words, counts the number of the accesses
to each region.  After each ``aggregation interval`` passes, DAMON calls
callback functions that previously registered by users so that users can
read the aggregated results and then clears the results.  This can be
described in below simple pseudo-code::

    init()
    while monitoring_on:
        for page in monitoring_target:
            if accessed(page):
                nr_accesses[page] += 1
        if time() % aggregation_interval == 0:
            for callback in user_registered_callbacks:
                callback(monitoring_target, nr_accesses)
            for page in monitoring_target:
                nr_accesses[page] = 0
        if time() % update_interval == 0:
            update()
        sleep(sampling interval)

The target regions constructed at the beginning of the monitoring and
updated after each ``regions_update_interval``, because the target regions
could be dynamically changed (e.g., mmap() or memory hotplug).  The
monitoring overhead of this mechanism will arbitrarily increase as the
size of the target workload grows.

The basic monitoring primitives for actual access check and dynamic target
regions construction aren't in the core part of DAMON.  Instead, it allows
users to implement their own primitives that are optimized for their use
case and configure DAMON to use those.  In other words, users cannot use
current version of DAMON without some additional works.

Following commits will implement the core mechanisms for the
overhead-accuracy control and default primitives implementations.

Link: https://lkml.kernel.org/r/20210716081449.22187-1-sj38.park@gmail.com
Link: https://lkml.kernel.org/r/20210716081449.22187-2-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Leonard Foerster <foersleo@amazon.de>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Marco Elver <elver@google.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Joe Perches <joe@perches.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |  167 ++++++++++++++++++++
 mm/Kconfig            |    2 
 mm/Makefile           |    1 
 mm/damon/Kconfig      |   15 +
 mm/damon/Makefile     |    3 
 mm/damon/core.c       |  320 ++++++++++++++++++++++++++++++++++++++++
 6 files changed, 508 insertions(+)

--- /dev/null
+++ a/include/linux/damon.h
@@ -0,0 +1,167 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * DAMON api
+ *
+ * Author: SeongJae Park <sjpark@amazon.de>
+ */
+
+#ifndef _DAMON_H_
+#define _DAMON_H_
+
+#include <linux/mutex.h>
+#include <linux/time64.h>
+#include <linux/types.h>
+
+struct damon_ctx;
+
+/**
+ * struct damon_primitive	Monitoring primitives for given use cases.
+ *
+ * @init:			Initialize primitive-internal data structures.
+ * @update:			Update primitive-internal data structures.
+ * @prepare_access_checks:	Prepare next access check of target regions.
+ * @check_accesses:		Check the accesses to target regions.
+ * @reset_aggregated:		Reset aggregated accesses monitoring results.
+ * @target_valid:		Determine if the target is valid.
+ * @cleanup:			Clean up the context.
+ *
+ * DAMON can be extended for various address spaces and usages.  For this,
+ * users should register the low level primitives for their target address
+ * space and usecase via the &damon_ctx.primitive.  Then, the monitoring thread
+ * (&damon_ctx.kdamond) calls @init and @prepare_access_checks before starting
+ * the monitoring, @update after each &damon_ctx.primitive_update_interval, and
+ * @check_accesses, @target_valid and @prepare_access_checks after each
+ * &damon_ctx.sample_interval.  Finally, @reset_aggregated is called after each
+ * &damon_ctx.aggr_interval.
+ *
+ * @init should initialize primitive-internal data structures.  For example,
+ * this could be used to construct proper monitoring target regions and link
+ * those to @damon_ctx.target.
+ * @update should update the primitive-internal data structures.  For example,
+ * this could be used to update monitoring target regions for current status.
+ * @prepare_access_checks should manipulate the monitoring regions to be
+ * prepared for the next access check.
+ * @check_accesses should check the accesses to each region that made after the
+ * last preparation and update the number of observed accesses of each region.
+ * @reset_aggregated should reset the access monitoring results that aggregated
+ * by @check_accesses.
+ * @target_valid should check whether the target is still valid for the
+ * monitoring.
+ * @cleanup is called from @kdamond just before its termination.
+ */
+struct damon_primitive {
+	void (*init)(struct damon_ctx *context);
+	void (*update)(struct damon_ctx *context);
+	void (*prepare_access_checks)(struct damon_ctx *context);
+	void (*check_accesses)(struct damon_ctx *context);
+	void (*reset_aggregated)(struct damon_ctx *context);
+	bool (*target_valid)(void *target);
+	void (*cleanup)(struct damon_ctx *context);
+};
+
+/*
+ * struct damon_callback	Monitoring events notification callbacks.
+ *
+ * @before_start:	Called before starting the monitoring.
+ * @after_sampling:	Called after each sampling.
+ * @after_aggregation:	Called after each aggregation.
+ * @before_terminate:	Called before terminating the monitoring.
+ * @private:		User private data.
+ *
+ * The monitoring thread (&damon_ctx.kdamond) calls @before_start and
+ * @before_terminate just before starting and finishing the monitoring,
+ * respectively.  Therefore, those are good places for installing and cleaning
+ * @private.
+ *
+ * The monitoring thread calls @after_sampling and @after_aggregation for each
+ * of the sampling intervals and aggregation intervals, respectively.
+ * Therefore, users can safely access the monitoring results without additional
+ * protection.  For the reason, users are recommended to use these callback for
+ * the accesses to the results.
+ *
+ * If any callback returns non-zero, monitoring stops.
+ */
+struct damon_callback {
+	void *private;
+
+	int (*before_start)(struct damon_ctx *context);
+	int (*after_sampling)(struct damon_ctx *context);
+	int (*after_aggregation)(struct damon_ctx *context);
+	int (*before_terminate)(struct damon_ctx *context);
+};
+
+/**
+ * struct damon_ctx - Represents a context for each monitoring.  This is the
+ * main interface that allows users to set the attributes and get the results
+ * of the monitoring.
+ *
+ * @sample_interval:		The time between access samplings.
+ * @aggr_interval:		The time between monitor results aggregations.
+ * @primitive_update_interval:	The time between monitoring primitive updates.
+ *
+ * For each @sample_interval, DAMON checks whether each region is accessed or
+ * not.  It aggregates and keeps the access information (number of accesses to
+ * each region) for @aggr_interval time.  DAMON also checks whether the target
+ * memory regions need update (e.g., by ``mmap()`` calls from the application,
+ * in case of virtual memory monitoring) and applies the changes for each
+ * @primitive_update_interval.  All time intervals are in micro-seconds.
+ * Please refer to &struct damon_primitive and &struct damon_callback for more
+ * detail.
+ *
+ * @kdamond:		Kernel thread who does the monitoring.
+ * @kdamond_stop:	Notifies whether kdamond should stop.
+ * @kdamond_lock:	Mutex for the synchronizations with @kdamond.
+ *
+ * For each monitoring context, one kernel thread for the monitoring is
+ * created.  The pointer to the thread is stored in @kdamond.
+ *
+ * Once started, the monitoring thread runs until explicitly required to be
+ * terminated or every monitoring target is invalid.  The validity of the
+ * targets is checked via the &damon_primitive.target_valid of @primitive.  The
+ * termination can also be explicitly requested by writing non-zero to
+ * @kdamond_stop.  The thread sets @kdamond to NULL when it terminates.
+ * Therefore, users can know whether the monitoring is ongoing or terminated by
+ * reading @kdamond.  Reads and writes to @kdamond and @kdamond_stop from
+ * outside of the monitoring thread must be protected by @kdamond_lock.
+ *
+ * Note that the monitoring thread protects only @kdamond and @kdamond_stop via
+ * @kdamond_lock.  Accesses to other fields must be protected by themselves.
+ *
+ * @primitive:	Set of monitoring primitives for given use cases.
+ * @callback:	Set of callbacks for monitoring events notifications.
+ *
+ * @target:	Pointer to the user-defined monitoring target.
+ */
+struct damon_ctx {
+	unsigned long sample_interval;
+	unsigned long aggr_interval;
+	unsigned long primitive_update_interval;
+
+/* private: internal use only */
+	struct timespec64 last_aggregation;
+	struct timespec64 last_primitive_update;
+
+/* public: */
+	struct task_struct *kdamond;
+	bool kdamond_stop;
+	struct mutex kdamond_lock;
+
+	struct damon_primitive primitive;
+	struct damon_callback callback;
+
+	void *target;
+};
+
+#ifdef CONFIG_DAMON
+
+struct damon_ctx *damon_new_ctx(void);
+void damon_destroy_ctx(struct damon_ctx *ctx);
+int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
+		unsigned long aggr_int, unsigned long primitive_upd_int);
+
+int damon_start(struct damon_ctx **ctxs, int nr_ctxs);
+int damon_stop(struct damon_ctx **ctxs, int nr_ctxs);
+
+#endif	/* CONFIG_DAMON */
+
+#endif	/* _DAMON_H */
--- /dev/null
+++ a/mm/damon/core.c
@@ -0,0 +1,320 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Data Access Monitor
+ *
+ * Author: SeongJae Park <sjpark@amazon.de>
+ */
+
+#define pr_fmt(fmt) "damon: " fmt
+
+#include <linux/damon.h>
+#include <linux/delay.h>
+#include <linux/kthread.h>
+#include <linux/slab.h>
+
+static DEFINE_MUTEX(damon_lock);
+static int nr_running_ctxs;
+
+struct damon_ctx *damon_new_ctx(void)
+{
+	struct damon_ctx *ctx;
+
+	ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
+	if (!ctx)
+		return NULL;
+
+	ctx->sample_interval = 5 * 1000;
+	ctx->aggr_interval = 100 * 1000;
+	ctx->primitive_update_interval = 60 * 1000 * 1000;
+
+	ktime_get_coarse_ts64(&ctx->last_aggregation);
+	ctx->last_primitive_update = ctx->last_aggregation;
+
+	mutex_init(&ctx->kdamond_lock);
+
+	ctx->target = NULL;
+
+	return ctx;
+}
+
+void damon_destroy_ctx(struct damon_ctx *ctx)
+{
+	if (ctx->primitive.cleanup)
+		ctx->primitive.cleanup(ctx);
+	kfree(ctx);
+}
+
+/**
+ * damon_set_attrs() - Set attributes for the monitoring.
+ * @ctx:		monitoring context
+ * @sample_int:		time interval between samplings
+ * @aggr_int:		time interval between aggregations
+ * @primitive_upd_int:	time interval between monitoring primitive updates
+ *
+ * This function should not be called while the kdamond is running.
+ * Every time interval is in micro-seconds.
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
+		    unsigned long aggr_int, unsigned long primitive_upd_int)
+{
+	ctx->sample_interval = sample_int;
+	ctx->aggr_interval = aggr_int;
+	ctx->primitive_update_interval = primitive_upd_int;
+
+	return 0;
+}
+
+static bool damon_kdamond_running(struct damon_ctx *ctx)
+{
+	bool running;
+
+	mutex_lock(&ctx->kdamond_lock);
+	running = ctx->kdamond != NULL;
+	mutex_unlock(&ctx->kdamond_lock);
+
+	return running;
+}
+
+static int kdamond_fn(void *data);
+
+/*
+ * __damon_start() - Starts monitoring with given context.
+ * @ctx:	monitoring context
+ *
+ * This function should be called while damon_lock is hold.
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+static int __damon_start(struct damon_ctx *ctx)
+{
+	int err = -EBUSY;
+
+	mutex_lock(&ctx->kdamond_lock);
+	if (!ctx->kdamond) {
+		err = 0;
+		ctx->kdamond_stop = false;
+		ctx->kdamond = kthread_run(kdamond_fn, ctx, "kdamond.%d",
+				nr_running_ctxs);
+		if (IS_ERR(ctx->kdamond)) {
+			err = PTR_ERR(ctx->kdamond);
+			ctx->kdamond = 0;
+		}
+	}
+	mutex_unlock(&ctx->kdamond_lock);
+
+	return err;
+}
+
+/**
+ * damon_start() - Starts the monitorings for a given group of contexts.
+ * @ctxs:	an array of the pointers for contexts to start monitoring
+ * @nr_ctxs:	size of @ctxs
+ *
+ * This function starts a group of monitoring threads for a group of monitoring
+ * contexts.  One thread per each context is created and run in parallel.  The
+ * caller should handle synchronization between the threads by itself.  If a
+ * group of threads that created by other 'damon_start()' call is currently
+ * running, this function does nothing but returns -EBUSY.
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int damon_start(struct damon_ctx **ctxs, int nr_ctxs)
+{
+	int i;
+	int err = 0;
+
+	mutex_lock(&damon_lock);
+	if (nr_running_ctxs) {
+		mutex_unlock(&damon_lock);
+		return -EBUSY;
+	}
+
+	for (i = 0; i < nr_ctxs; i++) {
+		err = __damon_start(ctxs[i]);
+		if (err)
+			break;
+		nr_running_ctxs++;
+	}
+	mutex_unlock(&damon_lock);
+
+	return err;
+}
+
+/*
+ * __damon_stop() - Stops monitoring of given context.
+ * @ctx:	monitoring context
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+static int __damon_stop(struct damon_ctx *ctx)
+{
+	mutex_lock(&ctx->kdamond_lock);
+	if (ctx->kdamond) {
+		ctx->kdamond_stop = true;
+		mutex_unlock(&ctx->kdamond_lock);
+		while (damon_kdamond_running(ctx))
+			usleep_range(ctx->sample_interval,
+					ctx->sample_interval * 2);
+		return 0;
+	}
+	mutex_unlock(&ctx->kdamond_lock);
+
+	return -EPERM;
+}
+
+/**
+ * damon_stop() - Stops the monitorings for a given group of contexts.
+ * @ctxs:	an array of the pointers for contexts to stop monitoring
+ * @nr_ctxs:	size of @ctxs
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int damon_stop(struct damon_ctx **ctxs, int nr_ctxs)
+{
+	int i, err = 0;
+
+	for (i = 0; i < nr_ctxs; i++) {
+		/* nr_running_ctxs is decremented in kdamond_fn */
+		err = __damon_stop(ctxs[i]);
+		if (err)
+			return err;
+	}
+
+	return err;
+}
+
+/*
+ * damon_check_reset_time_interval() - Check if a time interval is elapsed.
+ * @baseline:	the time to check whether the interval has elapsed since
+ * @interval:	the time interval (microseconds)
+ *
+ * See whether the given time interval has passed since the given baseline
+ * time.  If so, it also updates the baseline to current time for next check.
+ *
+ * Return:	true if the time interval has passed, or false otherwise.
+ */
+static bool damon_check_reset_time_interval(struct timespec64 *baseline,
+		unsigned long interval)
+{
+	struct timespec64 now;
+
+	ktime_get_coarse_ts64(&now);
+	if ((timespec64_to_ns(&now) - timespec64_to_ns(baseline)) <
+			interval * 1000)
+		return false;
+	*baseline = now;
+	return true;
+}
+
+/*
+ * Check whether it is time to flush the aggregated information
+ */
+static bool kdamond_aggregate_interval_passed(struct damon_ctx *ctx)
+{
+	return damon_check_reset_time_interval(&ctx->last_aggregation,
+			ctx->aggr_interval);
+}
+
+/*
+ * Check whether it is time to check and apply the target monitoring regions
+ *
+ * Returns true if it is.
+ */
+static bool kdamond_need_update_primitive(struct damon_ctx *ctx)
+{
+	return damon_check_reset_time_interval(&ctx->last_primitive_update,
+			ctx->primitive_update_interval);
+}
+
+/*
+ * Check whether current monitoring should be stopped
+ *
+ * The monitoring is stopped when either the user requested to stop, or all
+ * monitoring targets are invalid.
+ *
+ * Returns true if need to stop current monitoring.
+ */
+static bool kdamond_need_stop(struct damon_ctx *ctx)
+{
+	bool stop;
+
+	mutex_lock(&ctx->kdamond_lock);
+	stop = ctx->kdamond_stop;
+	mutex_unlock(&ctx->kdamond_lock);
+	if (stop)
+		return true;
+
+	if (!ctx->primitive.target_valid)
+		return false;
+
+	return !ctx->primitive.target_valid(ctx->target);
+}
+
+static void set_kdamond_stop(struct damon_ctx *ctx)
+{
+	mutex_lock(&ctx->kdamond_lock);
+	ctx->kdamond_stop = true;
+	mutex_unlock(&ctx->kdamond_lock);
+}
+
+/*
+ * The monitoring daemon that runs as a kernel thread
+ */
+static int kdamond_fn(void *data)
+{
+	struct damon_ctx *ctx = (struct damon_ctx *)data;
+
+	mutex_lock(&ctx->kdamond_lock);
+	pr_info("kdamond (%d) starts\n", ctx->kdamond->pid);
+	mutex_unlock(&ctx->kdamond_lock);
+
+	if (ctx->primitive.init)
+		ctx->primitive.init(ctx);
+	if (ctx->callback.before_start && ctx->callback.before_start(ctx))
+		set_kdamond_stop(ctx);
+
+	while (!kdamond_need_stop(ctx)) {
+		if (ctx->primitive.prepare_access_checks)
+			ctx->primitive.prepare_access_checks(ctx);
+		if (ctx->callback.after_sampling &&
+				ctx->callback.after_sampling(ctx))
+			set_kdamond_stop(ctx);
+
+		usleep_range(ctx->sample_interval, ctx->sample_interval + 1);
+
+		if (ctx->primitive.check_accesses)
+			ctx->primitive.check_accesses(ctx);
+
+		if (kdamond_aggregate_interval_passed(ctx)) {
+			if (ctx->callback.after_aggregation &&
+					ctx->callback.after_aggregation(ctx))
+				set_kdamond_stop(ctx);
+			if (ctx->primitive.reset_aggregated)
+				ctx->primitive.reset_aggregated(ctx);
+		}
+
+		if (kdamond_need_update_primitive(ctx)) {
+			if (ctx->primitive.update)
+				ctx->primitive.update(ctx);
+		}
+	}
+
+	if (ctx->callback.before_terminate &&
+			ctx->callback.before_terminate(ctx))
+		set_kdamond_stop(ctx);
+	if (ctx->primitive.cleanup)
+		ctx->primitive.cleanup(ctx);
+
+	pr_debug("kdamond (%d) finishes\n", ctx->kdamond->pid);
+	mutex_lock(&ctx->kdamond_lock);
+	ctx->kdamond = NULL;
+	mutex_unlock(&ctx->kdamond_lock);
+
+	mutex_lock(&damon_lock);
+	nr_running_ctxs--;
+	mutex_unlock(&damon_lock);
+
+	do_exit(0);
+}
--- /dev/null
+++ a/mm/damon/Kconfig
@@ -0,0 +1,15 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+menu "Data Access Monitoring"
+
+config DAMON
+	bool "DAMON: Data Access Monitoring Framework"
+	help
+	  This builds a framework that allows kernel subsystems to monitor
+	  access frequency of each memory region. The information can be useful
+	  for performance-centric DRAM level memory management.
+
+	  See https://damonitor.github.io/doc/html/latest-damon/index.html for
+	  more information.
+
+endmenu
--- /dev/null
+++ a/mm/damon/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0
+
+obj-$(CONFIG_DAMON)		:= core.o
--- a/mm/Kconfig~mm-introduce-data-access-monitor-damon
+++ a/mm/Kconfig
@@ -886,4 +886,6 @@ config IO_MAPPING
 config SECRETMEM
 	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
 
+source "mm/damon/Kconfig"
+
 endmenu
--- a/mm/Makefile~mm-introduce-data-access-monitor-damon
+++ a/mm/Makefile
@@ -118,6 +118,7 @@ obj-$(CONFIG_CMA_SYSFS) += cma_sysfs.o
 obj-$(CONFIG_USERFAULTFD) += userfaultfd.o
 obj-$(CONFIG_IDLE_PAGE_TRACKING) += page_idle.o
 obj-$(CONFIG_DEBUG_PAGE_REF) += debug_page_ref.o
+obj-$(CONFIG_DAMON) += damon/
 obj-$(CONFIG_HARDENED_USERCOPY) += usercopy.o
 obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 064/147] mm/damon/core: implement region-based sampling
  2021-09-08  2:52 incoming Andrew Morton
                   ` (62 preceding siblings ...)
  2021-09-08  2:56 ` [patch 063/147] mm: introduce Data Access MONitor (DAMON) Andrew Morton
@ 2021-09-08  2:56 ` Andrew Morton
  2021-09-08  2:56 ` [patch 065/147] mm/damon: adaptively adjust regions Andrew Morton
                   ` (83 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:56 UTC (permalink / raw)
  To: akpm, alexander.shishkin, amit, benh, brendanhiggins, corbet,
	david, dwmw, elver, fan.du, foersleo, greg, gthelen, joe,
	Jonathan.Cameron, linux-mm, markubo, mgorman, mheyne, minchan,
	mingo, mm-commits, namhyung, peterz, riel, rientjes, rostedt,
	shakeelb, shuah, sieberf, sjpark, torvalds, vbabka, vdavydov.dev

From: SeongJae Park <sjpark@amazon.de>
Subject: mm/damon/core: implement region-based sampling

To avoid the unbounded increase of the overhead, DAMON groups adjacent
pages that are assumed to have the same access frequencies into a
region.  As long as the assumption (pages in a region have the same
access frequencies) is kept, only one page in the region is required to
be checked.  Thus, for each ``sampling interval``,

 1. the 'prepare_access_checks' primitive picks one page in each region,
 2. waits for one ``sampling interval``,
 3. checks whether the page is accessed meanwhile, and
 4. increases the access count of the region if so.

Therefore, the monitoring overhead is controllable by adjusting the
number of regions.  DAMON allows both the underlying primitives and user
callbacks to adjust regions for the trade-off.  In other words, this
commit makes DAMON to use not only time-based sampling but also
space-based sampling.

This scheme, however, cannot preserve the quality of the output if the
assumption is not guaranteed.  Next commit will address this problem.

Link: https://lkml.kernel.org/r/20210716081449.22187-3-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Leonard Foerster <foersleo@amazon.de>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |   77 ++++++++++++++++++++-
 mm/damon/core.c       |  143 ++++++++++++++++++++++++++++++++++++++--
 2 files changed, 213 insertions(+), 7 deletions(-)

--- a/include/linux/damon.h~mm-damon-core-implement-region-based-sampling
+++ a/include/linux/damon.h
@@ -12,6 +12,48 @@
 #include <linux/time64.h>
 #include <linux/types.h>
 
+/**
+ * struct damon_addr_range - Represents an address region of [@start, @end).
+ * @start:	Start address of the region (inclusive).
+ * @end:	End address of the region (exclusive).
+ */
+struct damon_addr_range {
+	unsigned long start;
+	unsigned long end;
+};
+
+/**
+ * struct damon_region - Represents a monitoring target region.
+ * @ar:			The address range of the region.
+ * @sampling_addr:	Address of the sample for the next access check.
+ * @nr_accesses:	Access frequency of this region.
+ * @list:		List head for siblings.
+ */
+struct damon_region {
+	struct damon_addr_range ar;
+	unsigned long sampling_addr;
+	unsigned int nr_accesses;
+	struct list_head list;
+};
+
+/**
+ * struct damon_target - Represents a monitoring target.
+ * @id:			Unique identifier for this target.
+ * @regions_list:	Head of the monitoring target regions of this target.
+ * @list:		List head for siblings.
+ *
+ * Each monitoring context could have multiple targets.  For example, a context
+ * for virtual memory address spaces could have multiple target processes.  The
+ * @id of each target should be unique among the targets of the context.  For
+ * example, in the virtual address monitoring context, it could be a pidfd or
+ * an address of an mm_struct.
+ */
+struct damon_target {
+	unsigned long id;
+	struct list_head regions_list;
+	struct list_head list;
+};
+
 struct damon_ctx;
 
 /**
@@ -36,7 +78,7 @@ struct damon_ctx;
  *
  * @init should initialize primitive-internal data structures.  For example,
  * this could be used to construct proper monitoring target regions and link
- * those to @damon_ctx.target.
+ * those to @damon_ctx.adaptive_targets.
  * @update should update the primitive-internal data structures.  For example,
  * this could be used to update monitoring target regions for current status.
  * @prepare_access_checks should manipulate the monitoring regions to be
@@ -130,7 +172,7 @@ struct damon_callback {
  * @primitive:	Set of monitoring primitives for given use cases.
  * @callback:	Set of callbacks for monitoring events notifications.
  *
- * @target:	Pointer to the user-defined monitoring target.
+ * @region_targets:	Head of monitoring targets (&damon_target) list.
  */
 struct damon_ctx {
 	unsigned long sample_interval;
@@ -149,11 +191,40 @@ struct damon_ctx {
 	struct damon_primitive primitive;
 	struct damon_callback callback;
 
-	void *target;
+	struct list_head region_targets;
 };
 
+#define damon_next_region(r) \
+	(container_of(r->list.next, struct damon_region, list))
+
+#define damon_prev_region(r) \
+	(container_of(r->list.prev, struct damon_region, list))
+
+#define damon_for_each_region(r, t) \
+	list_for_each_entry(r, &t->regions_list, list)
+
+#define damon_for_each_region_safe(r, next, t) \
+	list_for_each_entry_safe(r, next, &t->regions_list, list)
+
+#define damon_for_each_target(t, ctx) \
+	list_for_each_entry(t, &(ctx)->region_targets, list)
+
+#define damon_for_each_target_safe(t, next, ctx)	\
+	list_for_each_entry_safe(t, next, &(ctx)->region_targets, list)
+
 #ifdef CONFIG_DAMON
 
+struct damon_region *damon_new_region(unsigned long start, unsigned long end);
+inline void damon_insert_region(struct damon_region *r,
+		struct damon_region *prev, struct damon_region *next);
+void damon_add_region(struct damon_region *r, struct damon_target *t);
+void damon_destroy_region(struct damon_region *r);
+
+struct damon_target *damon_new_target(unsigned long id);
+void damon_add_target(struct damon_ctx *ctx, struct damon_target *t);
+void damon_free_target(struct damon_target *t);
+void damon_destroy_target(struct damon_target *t);
+
 struct damon_ctx *damon_new_ctx(void);
 void damon_destroy_ctx(struct damon_ctx *ctx);
 int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
--- a/mm/damon/core.c~mm-damon-core-implement-region-based-sampling
+++ a/mm/damon/core.c
@@ -15,6 +15,101 @@
 static DEFINE_MUTEX(damon_lock);
 static int nr_running_ctxs;
 
+/*
+ * Construct a damon_region struct
+ *
+ * Returns the pointer to the new struct if success, or NULL otherwise
+ */
+struct damon_region *damon_new_region(unsigned long start, unsigned long end)
+{
+	struct damon_region *region;
+
+	region = kmalloc(sizeof(*region), GFP_KERNEL);
+	if (!region)
+		return NULL;
+
+	region->ar.start = start;
+	region->ar.end = end;
+	region->nr_accesses = 0;
+	INIT_LIST_HEAD(&region->list);
+
+	return region;
+}
+
+/*
+ * Add a region between two other regions
+ */
+inline void damon_insert_region(struct damon_region *r,
+		struct damon_region *prev, struct damon_region *next)
+{
+	__list_add(&r->list, &prev->list, &next->list);
+}
+
+void damon_add_region(struct damon_region *r, struct damon_target *t)
+{
+	list_add_tail(&r->list, &t->regions_list);
+}
+
+static void damon_del_region(struct damon_region *r)
+{
+	list_del(&r->list);
+}
+
+static void damon_free_region(struct damon_region *r)
+{
+	kfree(r);
+}
+
+void damon_destroy_region(struct damon_region *r)
+{
+	damon_del_region(r);
+	damon_free_region(r);
+}
+
+/*
+ * Construct a damon_target struct
+ *
+ * Returns the pointer to the new struct if success, or NULL otherwise
+ */
+struct damon_target *damon_new_target(unsigned long id)
+{
+	struct damon_target *t;
+
+	t = kmalloc(sizeof(*t), GFP_KERNEL);
+	if (!t)
+		return NULL;
+
+	t->id = id;
+	INIT_LIST_HEAD(&t->regions_list);
+
+	return t;
+}
+
+void damon_add_target(struct damon_ctx *ctx, struct damon_target *t)
+{
+	list_add_tail(&t->list, &ctx->region_targets);
+}
+
+static void damon_del_target(struct damon_target *t)
+{
+	list_del(&t->list);
+}
+
+void damon_free_target(struct damon_target *t)
+{
+	struct damon_region *r, *next;
+
+	damon_for_each_region_safe(r, next, t)
+		damon_free_region(r);
+	kfree(t);
+}
+
+void damon_destroy_target(struct damon_target *t)
+{
+	damon_del_target(t);
+	damon_free_target(t);
+}
+
 struct damon_ctx *damon_new_ctx(void)
 {
 	struct damon_ctx *ctx;
@@ -32,15 +127,27 @@ struct damon_ctx *damon_new_ctx(void)
 
 	mutex_init(&ctx->kdamond_lock);
 
-	ctx->target = NULL;
+	INIT_LIST_HEAD(&ctx->region_targets);
 
 	return ctx;
 }
 
-void damon_destroy_ctx(struct damon_ctx *ctx)
+static void damon_destroy_targets(struct damon_ctx *ctx)
 {
-	if (ctx->primitive.cleanup)
+	struct damon_target *t, *next_t;
+
+	if (ctx->primitive.cleanup) {
 		ctx->primitive.cleanup(ctx);
+		return;
+	}
+
+	damon_for_each_target_safe(t, next_t, ctx)
+		damon_destroy_target(t);
+}
+
+void damon_destroy_ctx(struct damon_ctx *ctx)
+{
+	damon_destroy_targets(ctx);
 	kfree(ctx);
 }
 
@@ -218,6 +325,21 @@ static bool kdamond_aggregate_interval_p
 }
 
 /*
+ * Reset the aggregated monitoring results ('nr_accesses' of each region).
+ */
+static void kdamond_reset_aggregated(struct damon_ctx *c)
+{
+	struct damon_target *t;
+
+	damon_for_each_target(t, c) {
+		struct damon_region *r;
+
+		damon_for_each_region(r, t)
+			r->nr_accesses = 0;
+	}
+}
+
+/*
  * Check whether it is time to check and apply the target monitoring regions
  *
  * Returns true if it is.
@@ -238,6 +360,7 @@ static bool kdamond_need_update_primitiv
  */
 static bool kdamond_need_stop(struct damon_ctx *ctx)
 {
+	struct damon_target *t;
 	bool stop;
 
 	mutex_lock(&ctx->kdamond_lock);
@@ -249,7 +372,12 @@ static bool kdamond_need_stop(struct dam
 	if (!ctx->primitive.target_valid)
 		return false;
 
-	return !ctx->primitive.target_valid(ctx->target);
+	damon_for_each_target(t, ctx) {
+		if (ctx->primitive.target_valid(t))
+			return false;
+	}
+
+	return true;
 }
 
 static void set_kdamond_stop(struct damon_ctx *ctx)
@@ -265,6 +393,8 @@ static void set_kdamond_stop(struct damo
 static int kdamond_fn(void *data)
 {
 	struct damon_ctx *ctx = (struct damon_ctx *)data;
+	struct damon_target *t;
+	struct damon_region *r, *next;
 
 	mutex_lock(&ctx->kdamond_lock);
 	pr_info("kdamond (%d) starts\n", ctx->kdamond->pid);
@@ -291,6 +421,7 @@ static int kdamond_fn(void *data)
 			if (ctx->callback.after_aggregation &&
 					ctx->callback.after_aggregation(ctx))
 				set_kdamond_stop(ctx);
+			kdamond_reset_aggregated(ctx);
 			if (ctx->primitive.reset_aggregated)
 				ctx->primitive.reset_aggregated(ctx);
 		}
@@ -300,6 +431,10 @@ static int kdamond_fn(void *data)
 				ctx->primitive.update(ctx);
 		}
 	}
+	damon_for_each_target(t, ctx) {
+		damon_for_each_region_safe(r, next, t)
+			damon_destroy_region(r);
+	}
 
 	if (ctx->callback.before_terminate &&
 			ctx->callback.before_terminate(ctx))
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 065/147] mm/damon: adaptively adjust regions
  2021-09-08  2:52 incoming Andrew Morton
                   ` (63 preceding siblings ...)
  2021-09-08  2:56 ` [patch 064/147] mm/damon/core: implement region-based sampling Andrew Morton
@ 2021-09-08  2:56 ` Andrew Morton
  2021-09-08  2:56 ` [patch 066/147] mm/idle_page_tracking: make PG_idle reusable Andrew Morton
                   ` (82 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:56 UTC (permalink / raw)
  To: akpm, alexander.shishkin, amit, benh, brendanhiggins, corbet,
	david, dwmw, elver, fan.du, foersleo, greg, gthelen, joe,
	Jonathan.Cameron, linux-mm, markubo, mgorman, mheyne, minchan,
	mingo, mm-commits, namhyung, peterz, riel, rientjes, rostedt,
	shakeelb, shuah, sieberf, sjpark, torvalds, vbabka, vdavydov.dev

From: SeongJae Park <sjpark@amazon.de>
Subject: mm/damon: adaptively adjust regions

Even somehow the initial monitoring target regions are well constructed to
fulfill the assumption (pages in same region have similar access
frequencies), the data access pattern can be dynamically changed.  This
will result in low monitoring quality.  To keep the assumption as much as
possible, DAMON adaptively merges and splits each region based on their
access frequency.

For each ``aggregation interval``, it compares the access frequencies of
adjacent regions and merges those if the frequency difference is small. 
Then, after it reports and clears the aggregated access frequency of each
region, it splits each region into two or three regions if the total
number of regions will not exceed the user-specified maximum number of
regions after the split.

In this way, DAMON provides its best-effort quality and minimal overhead
while keeping the upper-bound overhead that users set.

Link: https://lkml.kernel.org/r/20210716081449.22187-4-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Leonard Foerster <foersleo@amazon.de>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |   30 +++--
 mm/damon/core.c       |  224 ++++++++++++++++++++++++++++++++++++++--
 2 files changed, 237 insertions(+), 17 deletions(-)

--- a/include/linux/damon.h~mm-damon-adaptively-adjust-regions
+++ a/include/linux/damon.h
@@ -12,6 +12,9 @@
 #include <linux/time64.h>
 #include <linux/types.h>
 
+/* Minimal region size.  Every damon_region is aligned by this. */
+#define DAMON_MIN_REGION	PAGE_SIZE
+
 /**
  * struct damon_addr_range - Represents an address region of [@start, @end).
  * @start:	Start address of the region (inclusive).
@@ -39,6 +42,7 @@ struct damon_region {
 /**
  * struct damon_target - Represents a monitoring target.
  * @id:			Unique identifier for this target.
+ * @nr_regions:		Number of monitoring target regions of this target.
  * @regions_list:	Head of the monitoring target regions of this target.
  * @list:		List head for siblings.
  *
@@ -50,6 +54,7 @@ struct damon_region {
  */
 struct damon_target {
 	unsigned long id;
+	unsigned int nr_regions;
 	struct list_head regions_list;
 	struct list_head list;
 };
@@ -85,6 +90,8 @@ struct damon_ctx;
  * prepared for the next access check.
  * @check_accesses should check the accesses to each region that made after the
  * last preparation and update the number of observed accesses of each region.
+ * It should also return max number of observed accesses that made as a result
+ * of its update.  The value will be used for regions adjustment threshold.
  * @reset_aggregated should reset the access monitoring results that aggregated
  * by @check_accesses.
  * @target_valid should check whether the target is still valid for the
@@ -95,7 +102,7 @@ struct damon_primitive {
 	void (*init)(struct damon_ctx *context);
 	void (*update)(struct damon_ctx *context);
 	void (*prepare_access_checks)(struct damon_ctx *context);
-	void (*check_accesses)(struct damon_ctx *context);
+	unsigned int (*check_accesses)(struct damon_ctx *context);
 	void (*reset_aggregated)(struct damon_ctx *context);
 	bool (*target_valid)(void *target);
 	void (*cleanup)(struct damon_ctx *context);
@@ -172,7 +179,9 @@ struct damon_callback {
  * @primitive:	Set of monitoring primitives for given use cases.
  * @callback:	Set of callbacks for monitoring events notifications.
  *
- * @region_targets:	Head of monitoring targets (&damon_target) list.
+ * @min_nr_regions:	The minimum number of adaptive monitoring regions.
+ * @max_nr_regions:	The maximum number of adaptive monitoring regions.
+ * @adaptive_targets:	Head of monitoring targets (&damon_target) list.
  */
 struct damon_ctx {
 	unsigned long sample_interval;
@@ -191,7 +200,9 @@ struct damon_ctx {
 	struct damon_primitive primitive;
 	struct damon_callback callback;
 
-	struct list_head region_targets;
+	unsigned long min_nr_regions;
+	unsigned long max_nr_regions;
+	struct list_head adaptive_targets;
 };
 
 #define damon_next_region(r) \
@@ -207,28 +218,31 @@ struct damon_ctx {
 	list_for_each_entry_safe(r, next, &t->regions_list, list)
 
 #define damon_for_each_target(t, ctx) \
-	list_for_each_entry(t, &(ctx)->region_targets, list)
+	list_for_each_entry(t, &(ctx)->adaptive_targets, list)
 
 #define damon_for_each_target_safe(t, next, ctx)	\
-	list_for_each_entry_safe(t, next, &(ctx)->region_targets, list)
+	list_for_each_entry_safe(t, next, &(ctx)->adaptive_targets, list)
 
 #ifdef CONFIG_DAMON
 
 struct damon_region *damon_new_region(unsigned long start, unsigned long end);
 inline void damon_insert_region(struct damon_region *r,
-		struct damon_region *prev, struct damon_region *next);
+		struct damon_region *prev, struct damon_region *next,
+		struct damon_target *t);
 void damon_add_region(struct damon_region *r, struct damon_target *t);
-void damon_destroy_region(struct damon_region *r);
+void damon_destroy_region(struct damon_region *r, struct damon_target *t);
 
 struct damon_target *damon_new_target(unsigned long id);
 void damon_add_target(struct damon_ctx *ctx, struct damon_target *t);
 void damon_free_target(struct damon_target *t);
 void damon_destroy_target(struct damon_target *t);
+unsigned int damon_nr_regions(struct damon_target *t);
 
 struct damon_ctx *damon_new_ctx(void);
 void damon_destroy_ctx(struct damon_ctx *ctx);
 int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
-		unsigned long aggr_int, unsigned long primitive_upd_int);
+		unsigned long aggr_int, unsigned long primitive_upd_int,
+		unsigned long min_nr_reg, unsigned long max_nr_reg);
 
 int damon_start(struct damon_ctx **ctxs, int nr_ctxs);
 int damon_stop(struct damon_ctx **ctxs, int nr_ctxs);
--- a/mm/damon/core.c~mm-damon-adaptively-adjust-regions
+++ a/mm/damon/core.c
@@ -10,8 +10,12 @@
 #include <linux/damon.h>
 #include <linux/delay.h>
 #include <linux/kthread.h>
+#include <linux/random.h>
 #include <linux/slab.h>
 
+/* Get a random number in [l, r) */
+#define damon_rand(l, r) (l + prandom_u32_max(r - l))
+
 static DEFINE_MUTEX(damon_lock);
 static int nr_running_ctxs;
 
@@ -40,19 +44,23 @@ struct damon_region *damon_new_region(un
  * Add a region between two other regions
  */
 inline void damon_insert_region(struct damon_region *r,
-		struct damon_region *prev, struct damon_region *next)
+		struct damon_region *prev, struct damon_region *next,
+		struct damon_target *t)
 {
 	__list_add(&r->list, &prev->list, &next->list);
+	t->nr_regions++;
 }
 
 void damon_add_region(struct damon_region *r, struct damon_target *t)
 {
 	list_add_tail(&r->list, &t->regions_list);
+	t->nr_regions++;
 }
 
-static void damon_del_region(struct damon_region *r)
+static void damon_del_region(struct damon_region *r, struct damon_target *t)
 {
 	list_del(&r->list);
+	t->nr_regions--;
 }
 
 static void damon_free_region(struct damon_region *r)
@@ -60,9 +68,9 @@ static void damon_free_region(struct dam
 	kfree(r);
 }
 
-void damon_destroy_region(struct damon_region *r)
+void damon_destroy_region(struct damon_region *r, struct damon_target *t)
 {
-	damon_del_region(r);
+	damon_del_region(r, t);
 	damon_free_region(r);
 }
 
@@ -80,6 +88,7 @@ struct damon_target *damon_new_target(un
 		return NULL;
 
 	t->id = id;
+	t->nr_regions = 0;
 	INIT_LIST_HEAD(&t->regions_list);
 
 	return t;
@@ -87,7 +96,7 @@ struct damon_target *damon_new_target(un
 
 void damon_add_target(struct damon_ctx *ctx, struct damon_target *t)
 {
-	list_add_tail(&t->list, &ctx->region_targets);
+	list_add_tail(&t->list, &ctx->adaptive_targets);
 }
 
 static void damon_del_target(struct damon_target *t)
@@ -110,6 +119,11 @@ void damon_destroy_target(struct damon_t
 	damon_free_target(t);
 }
 
+unsigned int damon_nr_regions(struct damon_target *t)
+{
+	return t->nr_regions;
+}
+
 struct damon_ctx *damon_new_ctx(void)
 {
 	struct damon_ctx *ctx;
@@ -127,7 +141,10 @@ struct damon_ctx *damon_new_ctx(void)
 
 	mutex_init(&ctx->kdamond_lock);
 
-	INIT_LIST_HEAD(&ctx->region_targets);
+	ctx->min_nr_regions = 10;
+	ctx->max_nr_regions = 1000;
+
+	INIT_LIST_HEAD(&ctx->adaptive_targets);
 
 	return ctx;
 }
@@ -157,6 +174,8 @@ void damon_destroy_ctx(struct damon_ctx
  * @sample_int:		time interval between samplings
  * @aggr_int:		time interval between aggregations
  * @primitive_upd_int:	time interval between monitoring primitive updates
+ * @min_nr_reg:		minimal number of regions
+ * @max_nr_reg:		maximum number of regions
  *
  * This function should not be called while the kdamond is running.
  * Every time interval is in micro-seconds.
@@ -164,15 +183,49 @@ void damon_destroy_ctx(struct damon_ctx
  * Return: 0 on success, negative error code otherwise.
  */
 int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
-		    unsigned long aggr_int, unsigned long primitive_upd_int)
+		    unsigned long aggr_int, unsigned long primitive_upd_int,
+		    unsigned long min_nr_reg, unsigned long max_nr_reg)
 {
+	if (min_nr_reg < 3) {
+		pr_err("min_nr_regions (%lu) must be at least 3\n",
+				min_nr_reg);
+		return -EINVAL;
+	}
+	if (min_nr_reg > max_nr_reg) {
+		pr_err("invalid nr_regions.  min (%lu) > max (%lu)\n",
+				min_nr_reg, max_nr_reg);
+		return -EINVAL;
+	}
+
 	ctx->sample_interval = sample_int;
 	ctx->aggr_interval = aggr_int;
 	ctx->primitive_update_interval = primitive_upd_int;
+	ctx->min_nr_regions = min_nr_reg;
+	ctx->max_nr_regions = max_nr_reg;
 
 	return 0;
 }
 
+/* Returns the size upper limit for each monitoring region */
+static unsigned long damon_region_sz_limit(struct damon_ctx *ctx)
+{
+	struct damon_target *t;
+	struct damon_region *r;
+	unsigned long sz = 0;
+
+	damon_for_each_target(t, ctx) {
+		damon_for_each_region(r, t)
+			sz += r->ar.end - r->ar.start;
+	}
+
+	if (ctx->min_nr_regions)
+		sz /= ctx->min_nr_regions;
+	if (sz < DAMON_MIN_REGION)
+		sz = DAMON_MIN_REGION;
+
+	return sz;
+}
+
 static bool damon_kdamond_running(struct damon_ctx *ctx)
 {
 	bool running;
@@ -339,6 +392,150 @@ static void kdamond_reset_aggregated(str
 	}
 }
 
+#define sz_damon_region(r) (r->ar.end - r->ar.start)
+
+/*
+ * Merge two adjacent regions into one region
+ */
+static void damon_merge_two_regions(struct damon_target *t,
+		struct damon_region *l, struct damon_region *r)
+{
+	unsigned long sz_l = sz_damon_region(l), sz_r = sz_damon_region(r);
+
+	l->nr_accesses = (l->nr_accesses * sz_l + r->nr_accesses * sz_r) /
+			(sz_l + sz_r);
+	l->ar.end = r->ar.end;
+	damon_destroy_region(r, t);
+}
+
+#define diff_of(a, b) (a > b ? a - b : b - a)
+
+/*
+ * Merge adjacent regions having similar access frequencies
+ *
+ * t		target affected by this merge operation
+ * thres	'->nr_accesses' diff threshold for the merge
+ * sz_limit	size upper limit of each region
+ */
+static void damon_merge_regions_of(struct damon_target *t, unsigned int thres,
+				   unsigned long sz_limit)
+{
+	struct damon_region *r, *prev = NULL, *next;
+
+	damon_for_each_region_safe(r, next, t) {
+		if (prev && prev->ar.end == r->ar.start &&
+		    diff_of(prev->nr_accesses, r->nr_accesses) <= thres &&
+		    sz_damon_region(prev) + sz_damon_region(r) <= sz_limit)
+			damon_merge_two_regions(t, prev, r);
+		else
+			prev = r;
+	}
+}
+
+/*
+ * Merge adjacent regions having similar access frequencies
+ *
+ * threshold	'->nr_accesses' diff threshold for the merge
+ * sz_limit	size upper limit of each region
+ *
+ * This function merges monitoring target regions which are adjacent and their
+ * access frequencies are similar.  This is for minimizing the monitoring
+ * overhead under the dynamically changeable access pattern.  If a merge was
+ * unnecessarily made, later 'kdamond_split_regions()' will revert it.
+ */
+static void kdamond_merge_regions(struct damon_ctx *c, unsigned int threshold,
+				  unsigned long sz_limit)
+{
+	struct damon_target *t;
+
+	damon_for_each_target(t, c)
+		damon_merge_regions_of(t, threshold, sz_limit);
+}
+
+/*
+ * Split a region in two
+ *
+ * r		the region to be split
+ * sz_r		size of the first sub-region that will be made
+ */
+static void damon_split_region_at(struct damon_ctx *ctx,
+		struct damon_target *t, struct damon_region *r,
+		unsigned long sz_r)
+{
+	struct damon_region *new;
+
+	new = damon_new_region(r->ar.start + sz_r, r->ar.end);
+	if (!new)
+		return;
+
+	r->ar.end = new->ar.start;
+
+	damon_insert_region(new, r, damon_next_region(r), t);
+}
+
+/* Split every region in the given target into 'nr_subs' regions */
+static void damon_split_regions_of(struct damon_ctx *ctx,
+				     struct damon_target *t, int nr_subs)
+{
+	struct damon_region *r, *next;
+	unsigned long sz_region, sz_sub = 0;
+	int i;
+
+	damon_for_each_region_safe(r, next, t) {
+		sz_region = r->ar.end - r->ar.start;
+
+		for (i = 0; i < nr_subs - 1 &&
+				sz_region > 2 * DAMON_MIN_REGION; i++) {
+			/*
+			 * Randomly select size of left sub-region to be at
+			 * least 10 percent and at most 90% of original region
+			 */
+			sz_sub = ALIGN_DOWN(damon_rand(1, 10) *
+					sz_region / 10, DAMON_MIN_REGION);
+			/* Do not allow blank region */
+			if (sz_sub == 0 || sz_sub >= sz_region)
+				continue;
+
+			damon_split_region_at(ctx, t, r, sz_sub);
+			sz_region = sz_sub;
+		}
+	}
+}
+
+/*
+ * Split every target region into randomly-sized small regions
+ *
+ * This function splits every target region into random-sized small regions if
+ * current total number of the regions is equal or smaller than half of the
+ * user-specified maximum number of regions.  This is for maximizing the
+ * monitoring accuracy under the dynamically changeable access patterns.  If a
+ * split was unnecessarily made, later 'kdamond_merge_regions()' will revert
+ * it.
+ */
+static void kdamond_split_regions(struct damon_ctx *ctx)
+{
+	struct damon_target *t;
+	unsigned int nr_regions = 0;
+	static unsigned int last_nr_regions;
+	int nr_subregions = 2;
+
+	damon_for_each_target(t, ctx)
+		nr_regions += damon_nr_regions(t);
+
+	if (nr_regions > ctx->max_nr_regions / 2)
+		return;
+
+	/* Maybe the middle of the region has different access frequency */
+	if (last_nr_regions == nr_regions &&
+			nr_regions < ctx->max_nr_regions / 3)
+		nr_subregions = 3;
+
+	damon_for_each_target(t, ctx)
+		damon_split_regions_of(ctx, t, nr_subregions);
+
+	last_nr_regions = nr_regions;
+}
+
 /*
  * Check whether it is time to check and apply the target monitoring regions
  *
@@ -395,6 +592,8 @@ static int kdamond_fn(void *data)
 	struct damon_ctx *ctx = (struct damon_ctx *)data;
 	struct damon_target *t;
 	struct damon_region *r, *next;
+	unsigned int max_nr_accesses = 0;
+	unsigned long sz_limit = 0;
 
 	mutex_lock(&ctx->kdamond_lock);
 	pr_info("kdamond (%d) starts\n", ctx->kdamond->pid);
@@ -405,6 +604,8 @@ static int kdamond_fn(void *data)
 	if (ctx->callback.before_start && ctx->callback.before_start(ctx))
 		set_kdamond_stop(ctx);
 
+	sz_limit = damon_region_sz_limit(ctx);
+
 	while (!kdamond_need_stop(ctx)) {
 		if (ctx->primitive.prepare_access_checks)
 			ctx->primitive.prepare_access_checks(ctx);
@@ -415,13 +616,17 @@ static int kdamond_fn(void *data)
 		usleep_range(ctx->sample_interval, ctx->sample_interval + 1);
 
 		if (ctx->primitive.check_accesses)
-			ctx->primitive.check_accesses(ctx);
+			max_nr_accesses = ctx->primitive.check_accesses(ctx);
 
 		if (kdamond_aggregate_interval_passed(ctx)) {
+			kdamond_merge_regions(ctx,
+					max_nr_accesses / 10,
+					sz_limit);
 			if (ctx->callback.after_aggregation &&
 					ctx->callback.after_aggregation(ctx))
 				set_kdamond_stop(ctx);
 			kdamond_reset_aggregated(ctx);
+			kdamond_split_regions(ctx);
 			if (ctx->primitive.reset_aggregated)
 				ctx->primitive.reset_aggregated(ctx);
 		}
@@ -429,11 +634,12 @@ static int kdamond_fn(void *data)
 		if (kdamond_need_update_primitive(ctx)) {
 			if (ctx->primitive.update)
 				ctx->primitive.update(ctx);
+			sz_limit = damon_region_sz_limit(ctx);
 		}
 	}
 	damon_for_each_target(t, ctx) {
 		damon_for_each_region_safe(r, next, t)
-			damon_destroy_region(r);
+			damon_destroy_region(r, t);
 	}
 
 	if (ctx->callback.before_terminate &&
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 066/147] mm/idle_page_tracking: make PG_idle reusable
  2021-09-08  2:52 incoming Andrew Morton
                   ` (64 preceding siblings ...)
  2021-09-08  2:56 ` [patch 065/147] mm/damon: adaptively adjust regions Andrew Morton
@ 2021-09-08  2:56 ` Andrew Morton
  2021-09-08  2:56 ` [patch 067/147] mm/damon: implement primitives for the virtual memory address spaces Andrew Morton
                   ` (81 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:56 UTC (permalink / raw)
  To: akpm, alexander.shishkin, amit, benh, brendanhiggins, corbet,
	david, dwmw, elver, fan.du, foersleo, greg, gthelen, joe,
	Jonathan.Cameron, linux-mm, markubo, mgorman, mheyne, minchan,
	mingo, mm-commits, namhyung, peterz, riel, rientjes, rostedt,
	shakeelb, shuah, sieberf, sjpark, torvalds, vbabka, vdavydov.dev

From: SeongJae Park <sjpark@amazon.de>
Subject: mm/idle_page_tracking: make PG_idle reusable

PG_idle and PG_young allow the two PTE Accessed bit users, Idle Page
Tracking and the reclaim logic concurrently work while not interfering
with each other.  That is, when they need to clear the Accessed bit, they
set PG_young to represent the previous state of the bit, respectively. 
And when they need to read the bit, if the bit is cleared, they further
read the PG_young to know whether the other has cleared the bit meanwhile
or not.

For yet another user of the PTE Accessed bit, we could add another page
flag, or extend the mechanism to use the flags.  For the DAMON usecase,
however, we don't need to do that just yet.  IDLE_PAGE_TRACKING and DAMON
are mutually exclusive, so there's only ever going to be one user of the
current set of flags.

In this commit, we split out the CONFIG options to allow for the use of
PG_young and PG_idle outside of idle page tracking.

In the next commit, DAMON's reference implementation of the virtual memory
address space monitoring primitives will use it.

[sjpark@amazon.de: set PAGE_EXTENSION for non-64BIT]
  Link: https://lkml.kernel.org/r/20210806095153.6444-1-sj38.park@gmail.com
[akpm@linux-foundation.org: tweak Kconfig text]
[sjpark@amazon.de: hide PAGE_IDLE_FLAG from users]
  Link: https://lkml.kernel.org/r/20210813081238.34705-1-sj38.park@gmail.com
Link: https://lkml.kernel.org/r/20210716081449.22187-5-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/page-flags.h     |    4 ++--
 include/linux/page_ext.h       |    2 +-
 include/linux/page_idle.h      |    6 +++---
 include/trace/events/mmflags.h |    2 +-
 mm/Kconfig                     |   10 +++++++++-
 mm/page_ext.c                  |   12 +++++++++++-
 mm/page_idle.c                 |   10 ----------
 7 files changed, 27 insertions(+), 19 deletions(-)

--- a/include/linux/page_ext.h~mm-idle_page_tracking-make-pg_idle-reusable
+++ a/include/linux/page_ext.h
@@ -19,7 +19,7 @@ struct page_ext_operations {
 enum page_ext_flags {
 	PAGE_EXT_OWNER,
 	PAGE_EXT_OWNER_ALLOCATED,
-#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+#if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT)
 	PAGE_EXT_YOUNG,
 	PAGE_EXT_IDLE,
 #endif
--- a/include/linux/page-flags.h~mm-idle_page_tracking-make-pg_idle-reusable
+++ a/include/linux/page-flags.h
@@ -131,7 +131,7 @@ enum pageflags {
 #ifdef CONFIG_MEMORY_FAILURE
 	PG_hwpoison,		/* hardware poisoned page. Don't touch */
 #endif
-#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+#if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT)
 	PG_young,
 	PG_idle,
 #endif
@@ -441,7 +441,7 @@ PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
 #endif
 
-#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+#if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT)
 TESTPAGEFLAG(Young, young, PF_ANY)
 SETPAGEFLAG(Young, young, PF_ANY)
 TESTCLEARFLAG(Young, young, PF_ANY)
--- a/include/linux/page_idle.h~mm-idle_page_tracking-make-pg_idle-reusable
+++ a/include/linux/page_idle.h
@@ -6,7 +6,7 @@
 #include <linux/page-flags.h>
 #include <linux/page_ext.h>
 
-#ifdef CONFIG_IDLE_PAGE_TRACKING
+#ifdef CONFIG_PAGE_IDLE_FLAG
 
 #ifdef CONFIG_64BIT
 static inline bool page_is_young(struct page *page)
@@ -106,7 +106,7 @@ static inline void clear_page_idle(struc
 }
 #endif /* CONFIG_64BIT */
 
-#else /* !CONFIG_IDLE_PAGE_TRACKING */
+#else /* !CONFIG_PAGE_IDLE_FLAG */
 
 static inline bool page_is_young(struct page *page)
 {
@@ -135,6 +135,6 @@ static inline void clear_page_idle(struc
 {
 }
 
-#endif /* CONFIG_IDLE_PAGE_TRACKING */
+#endif /* CONFIG_PAGE_IDLE_FLAG */
 
 #endif /* _LINUX_MM_PAGE_IDLE_H */
--- a/include/trace/events/mmflags.h~mm-idle_page_tracking-make-pg_idle-reusable
+++ a/include/trace/events/mmflags.h
@@ -75,7 +75,7 @@
 #define IF_HAVE_PG_HWPOISON(flag,string)
 #endif
 
-#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
+#if defined(CONFIG_PAGE_IDLE_FLAG) && defined(CONFIG_64BIT)
 #define IF_HAVE_PG_IDLE(flag,string) ,{1UL << flag, string}
 #else
 #define IF_HAVE_PG_IDLE(flag,string)
--- a/mm/Kconfig~mm-idle_page_tracking-make-pg_idle-reusable
+++ a/mm/Kconfig
@@ -739,10 +739,18 @@ config DEFERRED_STRUCT_PAGE_INIT
 	  lifetime of the system until these kthreads finish the
 	  initialisation.
 
+config PAGE_IDLE_FLAG
+	bool
+	select PAGE_EXTENSION if !64BIT
+	help
+	  This adds PG_idle and PG_young flags to 'struct page'.  PTE Accessed
+	  bit writers can set the state of the bit in the flags so that PTE
+	  Accessed bit readers may avoid disturbance.
+
 config IDLE_PAGE_TRACKING
 	bool "Enable idle page tracking"
 	depends on SYSFS && MMU
-	select PAGE_EXTENSION if !64BIT
+	select PAGE_IDLE_FLAG
 	help
 	  This feature allows to estimate the amount of user pages that have
 	  not been touched during a given period of time. This information can
--- a/mm/page_ext.c~mm-idle_page_tracking-make-pg_idle-reusable
+++ a/mm/page_ext.c
@@ -58,11 +58,21 @@
  * can utilize this callback to initialize the state of it correctly.
  */
 
+#if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT)
+static bool need_page_idle(void)
+{
+	return true;
+}
+struct page_ext_operations page_idle_ops = {
+	.need = need_page_idle,
+};
+#endif
+
 static struct page_ext_operations *page_ext_ops[] = {
 #ifdef CONFIG_PAGE_OWNER
 	&page_owner_ops,
 #endif
-#if defined(CONFIG_IDLE_PAGE_TRACKING) && !defined(CONFIG_64BIT)
+#if defined(CONFIG_PAGE_IDLE_FLAG) && !defined(CONFIG_64BIT)
 	&page_idle_ops,
 #endif
 };
--- a/mm/page_idle.c~mm-idle_page_tracking-make-pg_idle-reusable
+++ a/mm/page_idle.c
@@ -207,16 +207,6 @@ static const struct attribute_group page
 	.name = "page_idle",
 };
 
-#ifndef CONFIG_64BIT
-static bool need_page_idle(void)
-{
-	return true;
-}
-struct page_ext_operations page_idle_ops = {
-	.need = need_page_idle,
-};
-#endif
-
 static int __init page_idle_init(void)
 {
 	int err;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 067/147] mm/damon: implement primitives for the virtual memory address spaces
  2021-09-08  2:52 incoming Andrew Morton
                   ` (65 preceding siblings ...)
  2021-09-08  2:56 ` [patch 066/147] mm/idle_page_tracking: make PG_idle reusable Andrew Morton
@ 2021-09-08  2:56 ` Andrew Morton
  2021-09-08  2:56 ` [patch 068/147] mm/damon: add a tracepoint Andrew Morton
                   ` (80 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:56 UTC (permalink / raw)
  To: akpm, alexander.shishkin, amit, benh, brendanhiggins, corbet,
	david, dwmw, elver, fan.du, foersleo, greg, gthelen, joe,
	Jonathan.Cameron, linux-mm, markubo, mgorman, mheyne, minchan,
	mingo, mm-commits, namhyung, peterz, riel, rientjes, rostedt,
	shakeelb, shuah, sieberf, sjpark, torvalds, vbabka, vdavydov.dev

From: SeongJae Park <sjpark@amazon.de>
Subject: mm/damon: implement primitives for the virtual memory address spaces

This commit introduces a reference implementation of the address space
specific low level primitives for the virtual address space, so that users
of DAMON can easily monitor the data accesses on virtual address spaces of
specific processes by simply configuring the implementation to be used by
DAMON.

The low level primitives for the fundamental access monitoring are defined
in two parts:

1. Identification of the monitoring target address range for the address
   space.
2. Access check of specific address range in the target space.

The reference implementation for the virtual address space does the works
as below.

PTE Accessed-bit Based Access Check
-----------------------------------

The implementation uses PTE Accessed-bit for basic access checks.  That
is, it clears the bit for the next sampling target page and checks whether
it is set again after one sampling period.  This could disturb the reclaim
logic.  DAMON uses ``PG_idle`` and ``PG_young`` page flags to solve the
conflict, as Idle page tracking does.

VMA-based Target Address Range Construction
-------------------------------------------

Only small parts in the super-huge virtual address space of the processes
are mapped to physical memory and accessed.  Thus, tracking the unmapped
address regions is just wasteful.  However, because DAMON can deal with
some level of noise using the adaptive regions adjustment mechanism,
tracking every mapping is not strictly required but could even incur a
high overhead in some cases.  That said, too huge unmapped areas inside
the monitoring target should be removed to not take the time for the
adaptive mechanism.

For the reason, this implementation converts the complex mappings to three
distinct regions that cover every mapped area of the address space.  Also,
the two gaps between the three regions are the two biggest unmapped areas
in the given address space.  The two biggest unmapped areas would be the
gap between the heap and the uppermost mmap()-ed region, and the gap
between the lowermost mmap()-ed region and the stack in most of the cases.
Because these gaps are exceptionally huge in usual address spaces,
excluding these will be sufficient to make a reasonable trade-off.  Below
shows this in detail::

    <heap>
    <BIG UNMAPPED REGION 1>
    <uppermost mmap()-ed region>
    (small mmap()-ed regions and munmap()-ed regions)
    <lowermost mmap()-ed region>
    <BIG UNMAPPED REGION 2>
    <stack>

[akpm@linux-foundation.org: mm/damon/vaddr.c needs highmem.h for kunmap_atomic()]
[sjpark@amazon.de: remove unnecessary PAGE_EXTENSION setup]
  Link: https://lkml.kernel.org/r/20210806095153.6444-2-sj38.park@gmail.com
[sjpark@amazon.de: safely walk page table]
  Link: https://lkml.kernel.org/r/20210831161800.29419-1-sj38.park@gmail.com
Link: https://lkml.kernel.org/r/20210716081449.22187-6-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Leonard Foerster <foersleo@amazon.de>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |   13 
 mm/damon/Kconfig      |    8 
 mm/damon/Makefile     |    1 
 mm/damon/vaddr.c      |  665 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 687 insertions(+)

--- a/include/linux/damon.h~mm-damon-implement-primitives-for-the-virtual-memory-address-spaces
+++ a/include/linux/damon.h
@@ -249,4 +249,17 @@ int damon_stop(struct damon_ctx **ctxs,
 
 #endif	/* CONFIG_DAMON */
 
+#ifdef CONFIG_DAMON_VADDR
+
+/* Monitoring primitives for virtual memory address spaces */
+void damon_va_init(struct damon_ctx *ctx);
+void damon_va_update(struct damon_ctx *ctx);
+void damon_va_prepare_access_checks(struct damon_ctx *ctx);
+unsigned int damon_va_check_accesses(struct damon_ctx *ctx);
+bool damon_va_target_valid(void *t);
+void damon_va_cleanup(struct damon_ctx *ctx);
+void damon_va_set_primitives(struct damon_ctx *ctx);
+
+#endif	/* CONFIG_DAMON_VADDR */
+
 #endif	/* _DAMON_H */
--- a/mm/damon/Kconfig~mm-damon-implement-primitives-for-the-virtual-memory-address-spaces
+++ a/mm/damon/Kconfig
@@ -12,4 +12,12 @@ config DAMON
 	  See https://damonitor.github.io/doc/html/latest-damon/index.html for
 	  more information.
 
+config DAMON_VADDR
+	bool "Data access monitoring primitives for virtual address spaces"
+	depends on DAMON && MMU
+	select PAGE_IDLE_FLAG
+	help
+	  This builds the default data access monitoring primitives for DAMON
+	  that works for virtual address spaces.
+
 endmenu
--- a/mm/damon/Makefile~mm-damon-implement-primitives-for-the-virtual-memory-address-spaces
+++ a/mm/damon/Makefile
@@ -1,3 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 
 obj-$(CONFIG_DAMON)		:= core.o
+obj-$(CONFIG_DAMON_VADDR)	+= vaddr.o
--- /dev/null
+++ a/mm/damon/vaddr.c
@@ -0,0 +1,665 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * DAMON Primitives for Virtual Address Spaces
+ *
+ * Author: SeongJae Park <sjpark@amazon.de>
+ */
+
+#define pr_fmt(fmt) "damon-va: " fmt
+
+#include <linux/damon.h>
+#include <linux/hugetlb.h>
+#include <linux/mm.h>
+#include <linux/mmu_notifier.h>
+#include <linux/highmem.h>
+#include <linux/page_idle.h>
+#include <linux/pagewalk.h>
+#include <linux/random.h>
+#include <linux/sched/mm.h>
+#include <linux/slab.h>
+
+/* Get a random number in [l, r) */
+#define damon_rand(l, r) (l + prandom_u32_max(r - l))
+
+/*
+ * 't->id' should be the pointer to the relevant 'struct pid' having reference
+ * count.  Caller must put the returned task, unless it is NULL.
+ */
+#define damon_get_task_struct(t) \
+	(get_pid_task((struct pid *)t->id, PIDTYPE_PID))
+
+/*
+ * Get the mm_struct of the given target
+ *
+ * Caller _must_ put the mm_struct after use, unless it is NULL.
+ *
+ * Returns the mm_struct of the target on success, NULL on failure
+ */
+static struct mm_struct *damon_get_mm(struct damon_target *t)
+{
+	struct task_struct *task;
+	struct mm_struct *mm;
+
+	task = damon_get_task_struct(t);
+	if (!task)
+		return NULL;
+
+	mm = get_task_mm(task);
+	put_task_struct(task);
+	return mm;
+}
+
+/*
+ * Functions for the initial monitoring target regions construction
+ */
+
+/*
+ * Size-evenly split a region into 'nr_pieces' small regions
+ *
+ * Returns 0 on success, or negative error code otherwise.
+ */
+static int damon_va_evenly_split_region(struct damon_target *t,
+		struct damon_region *r, unsigned int nr_pieces)
+{
+	unsigned long sz_orig, sz_piece, orig_end;
+	struct damon_region *n = NULL, *next;
+	unsigned long start;
+
+	if (!r || !nr_pieces)
+		return -EINVAL;
+
+	orig_end = r->ar.end;
+	sz_orig = r->ar.end - r->ar.start;
+	sz_piece = ALIGN_DOWN(sz_orig / nr_pieces, DAMON_MIN_REGION);
+
+	if (!sz_piece)
+		return -EINVAL;
+
+	r->ar.end = r->ar.start + sz_piece;
+	next = damon_next_region(r);
+	for (start = r->ar.end; start + sz_piece <= orig_end;
+			start += sz_piece) {
+		n = damon_new_region(start, start + sz_piece);
+		if (!n)
+			return -ENOMEM;
+		damon_insert_region(n, r, next, t);
+		r = n;
+	}
+	/* complement last region for possible rounding error */
+	if (n)
+		n->ar.end = orig_end;
+
+	return 0;
+}
+
+static unsigned long sz_range(struct damon_addr_range *r)
+{
+	return r->end - r->start;
+}
+
+static void swap_ranges(struct damon_addr_range *r1,
+			struct damon_addr_range *r2)
+{
+	struct damon_addr_range tmp;
+
+	tmp = *r1;
+	*r1 = *r2;
+	*r2 = tmp;
+}
+
+/*
+ * Find three regions separated by two biggest unmapped regions
+ *
+ * vma		the head vma of the target address space
+ * regions	an array of three address ranges that results will be saved
+ *
+ * This function receives an address space and finds three regions in it which
+ * separated by the two biggest unmapped regions in the space.  Please refer to
+ * below comments of '__damon_va_init_regions()' function to know why this is
+ * necessary.
+ *
+ * Returns 0 if success, or negative error code otherwise.
+ */
+static int __damon_va_three_regions(struct vm_area_struct *vma,
+				       struct damon_addr_range regions[3])
+{
+	struct damon_addr_range gap = {0}, first_gap = {0}, second_gap = {0};
+	struct vm_area_struct *last_vma = NULL;
+	unsigned long start = 0;
+	struct rb_root rbroot;
+
+	/* Find two biggest gaps so that first_gap > second_gap > others */
+	for (; vma; vma = vma->vm_next) {
+		if (!last_vma) {
+			start = vma->vm_start;
+			goto next;
+		}
+
+		if (vma->rb_subtree_gap <= sz_range(&second_gap)) {
+			rbroot.rb_node = &vma->vm_rb;
+			vma = rb_entry(rb_last(&rbroot),
+					struct vm_area_struct, vm_rb);
+			goto next;
+		}
+
+		gap.start = last_vma->vm_end;
+		gap.end = vma->vm_start;
+		if (sz_range(&gap) > sz_range(&second_gap)) {
+			swap_ranges(&gap, &second_gap);
+			if (sz_range(&second_gap) > sz_range(&first_gap))
+				swap_ranges(&second_gap, &first_gap);
+		}
+next:
+		last_vma = vma;
+	}
+
+	if (!sz_range(&second_gap) || !sz_range(&first_gap))
+		return -EINVAL;
+
+	/* Sort the two biggest gaps by address */
+	if (first_gap.start > second_gap.start)
+		swap_ranges(&first_gap, &second_gap);
+
+	/* Store the result */
+	regions[0].start = ALIGN(start, DAMON_MIN_REGION);
+	regions[0].end = ALIGN(first_gap.start, DAMON_MIN_REGION);
+	regions[1].start = ALIGN(first_gap.end, DAMON_MIN_REGION);
+	regions[1].end = ALIGN(second_gap.start, DAMON_MIN_REGION);
+	regions[2].start = ALIGN(second_gap.end, DAMON_MIN_REGION);
+	regions[2].end = ALIGN(last_vma->vm_end, DAMON_MIN_REGION);
+
+	return 0;
+}
+
+/*
+ * Get the three regions in the given target (task)
+ *
+ * Returns 0 on success, negative error code otherwise.
+ */
+static int damon_va_three_regions(struct damon_target *t,
+				struct damon_addr_range regions[3])
+{
+	struct mm_struct *mm;
+	int rc;
+
+	mm = damon_get_mm(t);
+	if (!mm)
+		return -EINVAL;
+
+	mmap_read_lock(mm);
+	rc = __damon_va_three_regions(mm->mmap, regions);
+	mmap_read_unlock(mm);
+
+	mmput(mm);
+	return rc;
+}
+
+/*
+ * Initialize the monitoring target regions for the given target (task)
+ *
+ * t	the given target
+ *
+ * Because only a number of small portions of the entire address space
+ * is actually mapped to the memory and accessed, monitoring the unmapped
+ * regions is wasteful.  That said, because we can deal with small noises,
+ * tracking every mapping is not strictly required but could even incur a high
+ * overhead if the mapping frequently changes or the number of mappings is
+ * high.  The adaptive regions adjustment mechanism will further help to deal
+ * with the noise by simply identifying the unmapped areas as a region that
+ * has no access.  Moreover, applying the real mappings that would have many
+ * unmapped areas inside will make the adaptive mechanism quite complex.  That
+ * said, too huge unmapped areas inside the monitoring target should be removed
+ * to not take the time for the adaptive mechanism.
+ *
+ * For the reason, we convert the complex mappings to three distinct regions
+ * that cover every mapped area of the address space.  Also the two gaps
+ * between the three regions are the two biggest unmapped areas in the given
+ * address space.  In detail, this function first identifies the start and the
+ * end of the mappings and the two biggest unmapped areas of the address space.
+ * Then, it constructs the three regions as below:
+ *
+ *     [mappings[0]->start, big_two_unmapped_areas[0]->start)
+ *     [big_two_unmapped_areas[0]->end, big_two_unmapped_areas[1]->start)
+ *     [big_two_unmapped_areas[1]->end, mappings[nr_mappings - 1]->end)
+ *
+ * As usual memory map of processes is as below, the gap between the heap and
+ * the uppermost mmap()-ed region, and the gap between the lowermost mmap()-ed
+ * region and the stack will be two biggest unmapped regions.  Because these
+ * gaps are exceptionally huge areas in usual address space, excluding these
+ * two biggest unmapped regions will be sufficient to make a trade-off.
+ *
+ *   <heap>
+ *   <BIG UNMAPPED REGION 1>
+ *   <uppermost mmap()-ed region>
+ *   (other mmap()-ed regions and small unmapped regions)
+ *   <lowermost mmap()-ed region>
+ *   <BIG UNMAPPED REGION 2>
+ *   <stack>
+ */
+static void __damon_va_init_regions(struct damon_ctx *ctx,
+				     struct damon_target *t)
+{
+	struct damon_region *r;
+	struct damon_addr_range regions[3];
+	unsigned long sz = 0, nr_pieces;
+	int i;
+
+	if (damon_va_three_regions(t, regions)) {
+		pr_err("Failed to get three regions of target %lu\n", t->id);
+		return;
+	}
+
+	for (i = 0; i < 3; i++)
+		sz += regions[i].end - regions[i].start;
+	if (ctx->min_nr_regions)
+		sz /= ctx->min_nr_regions;
+	if (sz < DAMON_MIN_REGION)
+		sz = DAMON_MIN_REGION;
+
+	/* Set the initial three regions of the target */
+	for (i = 0; i < 3; i++) {
+		r = damon_new_region(regions[i].start, regions[i].end);
+		if (!r) {
+			pr_err("%d'th init region creation failed\n", i);
+			return;
+		}
+		damon_add_region(r, t);
+
+		nr_pieces = (regions[i].end - regions[i].start) / sz;
+		damon_va_evenly_split_region(t, r, nr_pieces);
+	}
+}
+
+/* Initialize '->regions_list' of every target (task) */
+void damon_va_init(struct damon_ctx *ctx)
+{
+	struct damon_target *t;
+
+	damon_for_each_target(t, ctx) {
+		/* the user may set the target regions as they want */
+		if (!damon_nr_regions(t))
+			__damon_va_init_regions(ctx, t);
+	}
+}
+
+/*
+ * Functions for the dynamic monitoring target regions update
+ */
+
+/*
+ * Check whether a region is intersecting an address range
+ *
+ * Returns true if it is.
+ */
+static bool damon_intersect(struct damon_region *r, struct damon_addr_range *re)
+{
+	return !(r->ar.end <= re->start || re->end <= r->ar.start);
+}
+
+/*
+ * Update damon regions for the three big regions of the given target
+ *
+ * t		the given target
+ * bregions	the three big regions of the target
+ */
+static void damon_va_apply_three_regions(struct damon_target *t,
+		struct damon_addr_range bregions[3])
+{
+	struct damon_region *r, *next;
+	unsigned int i = 0;
+
+	/* Remove regions which are not in the three big regions now */
+	damon_for_each_region_safe(r, next, t) {
+		for (i = 0; i < 3; i++) {
+			if (damon_intersect(r, &bregions[i]))
+				break;
+		}
+		if (i == 3)
+			damon_destroy_region(r, t);
+	}
+
+	/* Adjust intersecting regions to fit with the three big regions */
+	for (i = 0; i < 3; i++) {
+		struct damon_region *first = NULL, *last;
+		struct damon_region *newr;
+		struct damon_addr_range *br;
+
+		br = &bregions[i];
+		/* Get the first and last regions which intersects with br */
+		damon_for_each_region(r, t) {
+			if (damon_intersect(r, br)) {
+				if (!first)
+					first = r;
+				last = r;
+			}
+			if (r->ar.start >= br->end)
+				break;
+		}
+		if (!first) {
+			/* no damon_region intersects with this big region */
+			newr = damon_new_region(
+					ALIGN_DOWN(br->start,
+						DAMON_MIN_REGION),
+					ALIGN(br->end, DAMON_MIN_REGION));
+			if (!newr)
+				continue;
+			damon_insert_region(newr, damon_prev_region(r), r, t);
+		} else {
+			first->ar.start = ALIGN_DOWN(br->start,
+					DAMON_MIN_REGION);
+			last->ar.end = ALIGN(br->end, DAMON_MIN_REGION);
+		}
+	}
+}
+
+/*
+ * Update regions for current memory mappings
+ */
+void damon_va_update(struct damon_ctx *ctx)
+{
+	struct damon_addr_range three_regions[3];
+	struct damon_target *t;
+
+	damon_for_each_target(t, ctx) {
+		if (damon_va_three_regions(t, three_regions))
+			continue;
+		damon_va_apply_three_regions(t, three_regions);
+	}
+}
+
+/*
+ * Get an online page for a pfn if it's in the LRU list.  Otherwise, returns
+ * NULL.
+ *
+ * The body of this function is stolen from the 'page_idle_get_page()'.  We
+ * steal rather than reuse it because the code is quite simple.
+ */
+static struct page *damon_get_page(unsigned long pfn)
+{
+	struct page *page = pfn_to_online_page(pfn);
+
+	if (!page || !PageLRU(page) || !get_page_unless_zero(page))
+		return NULL;
+
+	if (unlikely(!PageLRU(page))) {
+		put_page(page);
+		page = NULL;
+	}
+	return page;
+}
+
+static void damon_ptep_mkold(pte_t *pte, struct mm_struct *mm,
+			     unsigned long addr)
+{
+	bool referenced = false;
+	struct page *page = damon_get_page(pte_pfn(*pte));
+
+	if (!page)
+		return;
+
+	if (pte_young(*pte)) {
+		referenced = true;
+		*pte = pte_mkold(*pte);
+	}
+
+#ifdef CONFIG_MMU_NOTIFIER
+	if (mmu_notifier_clear_young(mm, addr, addr + PAGE_SIZE))
+		referenced = true;
+#endif /* CONFIG_MMU_NOTIFIER */
+
+	if (referenced)
+		set_page_young(page);
+
+	set_page_idle(page);
+	put_page(page);
+}
+
+static void damon_pmdp_mkold(pmd_t *pmd, struct mm_struct *mm,
+			     unsigned long addr)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	bool referenced = false;
+	struct page *page = damon_get_page(pmd_pfn(*pmd));
+
+	if (!page)
+		return;
+
+	if (pmd_young(*pmd)) {
+		referenced = true;
+		*pmd = pmd_mkold(*pmd);
+	}
+
+#ifdef CONFIG_MMU_NOTIFIER
+	if (mmu_notifier_clear_young(mm, addr,
+				addr + ((1UL) << HPAGE_PMD_SHIFT)))
+		referenced = true;
+#endif /* CONFIG_MMU_NOTIFIER */
+
+	if (referenced)
+		set_page_young(page);
+
+	set_page_idle(page);
+	put_page(page);
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+}
+
+static int damon_mkold_pmd_entry(pmd_t *pmd, unsigned long addr,
+		unsigned long next, struct mm_walk *walk)
+{
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	if (pmd_huge(*pmd)) {
+		ptl = pmd_lock(walk->mm, pmd);
+		if (pmd_huge(*pmd)) {
+			damon_pmdp_mkold(pmd, walk->mm, addr);
+			spin_unlock(ptl);
+			return 0;
+		}
+		spin_unlock(ptl);
+	}
+
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+		return 0;
+	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	if (!pte_present(*pte))
+		goto out;
+	damon_ptep_mkold(pte, walk->mm, addr);
+out:
+	pte_unmap_unlock(pte, ptl);
+	return 0;
+}
+
+static struct mm_walk_ops damon_mkold_ops = {
+	.pmd_entry = damon_mkold_pmd_entry,
+};
+
+static void damon_va_mkold(struct mm_struct *mm, unsigned long addr)
+{
+	mmap_read_lock(mm);
+	walk_page_range(mm, addr, addr + 1, &damon_mkold_ops, NULL);
+	mmap_read_unlock(mm);
+}
+
+/*
+ * Functions for the access checking of the regions
+ */
+
+static void damon_va_prepare_access_check(struct damon_ctx *ctx,
+			struct mm_struct *mm, struct damon_region *r)
+{
+	r->sampling_addr = damon_rand(r->ar.start, r->ar.end);
+
+	damon_va_mkold(mm, r->sampling_addr);
+}
+
+void damon_va_prepare_access_checks(struct damon_ctx *ctx)
+{
+	struct damon_target *t;
+	struct mm_struct *mm;
+	struct damon_region *r;
+
+	damon_for_each_target(t, ctx) {
+		mm = damon_get_mm(t);
+		if (!mm)
+			continue;
+		damon_for_each_region(r, t)
+			damon_va_prepare_access_check(ctx, mm, r);
+		mmput(mm);
+	}
+}
+
+struct damon_young_walk_private {
+	unsigned long *page_sz;
+	bool young;
+};
+
+static int damon_young_pmd_entry(pmd_t *pmd, unsigned long addr,
+		unsigned long next, struct mm_walk *walk)
+{
+	pte_t *pte;
+	spinlock_t *ptl;
+	struct page *page;
+	struct damon_young_walk_private *priv = walk->private;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (pmd_huge(*pmd)) {
+		ptl = pmd_lock(walk->mm, pmd);
+		if (!pmd_huge(*pmd)) {
+			spin_unlock(ptl);
+			goto regular_page;
+		}
+		page = damon_get_page(pmd_pfn(*pmd));
+		if (!page)
+			goto huge_out;
+		if (pmd_young(*pmd) || !page_is_idle(page) ||
+					mmu_notifier_test_young(walk->mm,
+						addr)) {
+			*priv->page_sz = ((1UL) << HPAGE_PMD_SHIFT);
+			priv->young = true;
+		}
+		put_page(page);
+huge_out:
+		spin_unlock(ptl);
+		return 0;
+	}
+
+regular_page:
+#endif	/* CONFIG_TRANSPARENT_HUGEPAGE */
+
+	if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
+		return -EINVAL;
+	pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+	if (!pte_present(*pte))
+		goto out;
+	page = damon_get_page(pte_pfn(*pte));
+	if (!page)
+		goto out;
+	if (pte_young(*pte) || !page_is_idle(page) ||
+			mmu_notifier_test_young(walk->mm, addr)) {
+		*priv->page_sz = PAGE_SIZE;
+		priv->young = true;
+	}
+	put_page(page);
+out:
+	pte_unmap_unlock(pte, ptl);
+	return 0;
+}
+
+static struct mm_walk_ops damon_young_ops = {
+	.pmd_entry = damon_young_pmd_entry,
+};
+
+static bool damon_va_young(struct mm_struct *mm, unsigned long addr,
+		unsigned long *page_sz)
+{
+	struct damon_young_walk_private arg = {
+		.page_sz = page_sz,
+		.young = false,
+	};
+
+	mmap_read_lock(mm);
+	walk_page_range(mm, addr, addr + 1, &damon_young_ops, &arg);
+	mmap_read_unlock(mm);
+	return arg.young;
+}
+
+/*
+ * Check whether the region was accessed after the last preparation
+ *
+ * mm	'mm_struct' for the given virtual address space
+ * r	the region to be checked
+ */
+static void damon_va_check_access(struct damon_ctx *ctx,
+			       struct mm_struct *mm, struct damon_region *r)
+{
+	static struct mm_struct *last_mm;
+	static unsigned long last_addr;
+	static unsigned long last_page_sz = PAGE_SIZE;
+	static bool last_accessed;
+
+	/* If the region is in the last checked page, reuse the result */
+	if (mm == last_mm && (ALIGN_DOWN(last_addr, last_page_sz) ==
+				ALIGN_DOWN(r->sampling_addr, last_page_sz))) {
+		if (last_accessed)
+			r->nr_accesses++;
+		return;
+	}
+
+	last_accessed = damon_va_young(mm, r->sampling_addr, &last_page_sz);
+	if (last_accessed)
+		r->nr_accesses++;
+
+	last_mm = mm;
+	last_addr = r->sampling_addr;
+}
+
+unsigned int damon_va_check_accesses(struct damon_ctx *ctx)
+{
+	struct damon_target *t;
+	struct mm_struct *mm;
+	struct damon_region *r;
+	unsigned int max_nr_accesses = 0;
+
+	damon_for_each_target(t, ctx) {
+		mm = damon_get_mm(t);
+		if (!mm)
+			continue;
+		damon_for_each_region(r, t) {
+			damon_va_check_access(ctx, mm, r);
+			max_nr_accesses = max(r->nr_accesses, max_nr_accesses);
+		}
+		mmput(mm);
+	}
+
+	return max_nr_accesses;
+}
+
+/*
+ * Functions for the target validity check and cleanup
+ */
+
+bool damon_va_target_valid(void *target)
+{
+	struct damon_target *t = target;
+	struct task_struct *task;
+
+	task = damon_get_task_struct(t);
+	if (task) {
+		put_task_struct(task);
+		return true;
+	}
+
+	return false;
+}
+
+void damon_va_set_primitives(struct damon_ctx *ctx)
+{
+	ctx->primitive.init = damon_va_init;
+	ctx->primitive.update = damon_va_update;
+	ctx->primitive.prepare_access_checks = damon_va_prepare_access_checks;
+	ctx->primitive.check_accesses = damon_va_check_accesses;
+	ctx->primitive.reset_aggregated = NULL;
+	ctx->primitive.target_valid = damon_va_target_valid;
+	ctx->primitive.cleanup = NULL;
+}
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 068/147] mm/damon: add a tracepoint
  2021-09-08  2:52 incoming Andrew Morton
                   ` (66 preceding siblings ...)
  2021-09-08  2:56 ` [patch 067/147] mm/damon: implement primitives for the virtual memory address spaces Andrew Morton
@ 2021-09-08  2:56 ` Andrew Morton
  2021-09-08  2:56 ` [patch 069/147] mm/damon: implement a debugfs-based user space interface Andrew Morton
                   ` (79 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:56 UTC (permalink / raw)
  To: akpm, alexander.shishkin, amit, benh, brendanhiggins, corbet,
	david, dwmw, elver, fan.du, foersleo, greg, gthelen, joe,
	Jonathan.Cameron, linux-mm, markubo, mgorman, mheyne, minchan,
	mingo, mm-commits, namhyung, peterz, riel, rientjes, rostedt,
	shakeelb, shuah, sieberf, sjpark, torvalds, vbabka, vdavydov.dev

From: SeongJae Park <sjpark@amazon.de>
Subject: mm/damon: add a tracepoint

This commit adds a tracepoint for DAMON.  It traces the monitoring results
of each region for each aggregation interval.  Using this, DAMON can
easily integrated with tracepoints supporting tools such as perf.

Link: https://lkml.kernel.org/r/20210716081449.22187-7-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Leonard Foerster <foersleo@amazon.de>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/trace/events/damon.h |   43 +++++++++++++++++++++++++++++++++
 mm/damon/core.c              |    7 ++++-
 2 files changed, 49 insertions(+), 1 deletion(-)

--- /dev/null
+++ a/include/trace/events/damon.h
@@ -0,0 +1,43 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM damon
+
+#if !defined(_TRACE_DAMON_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_DAMON_H
+
+#include <linux/damon.h>
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+
+TRACE_EVENT(damon_aggregated,
+
+	TP_PROTO(struct damon_target *t, struct damon_region *r,
+		unsigned int nr_regions),
+
+	TP_ARGS(t, r, nr_regions),
+
+	TP_STRUCT__entry(
+		__field(unsigned long, target_id)
+		__field(unsigned int, nr_regions)
+		__field(unsigned long, start)
+		__field(unsigned long, end)
+		__field(unsigned int, nr_accesses)
+	),
+
+	TP_fast_assign(
+		__entry->target_id = t->id;
+		__entry->nr_regions = nr_regions;
+		__entry->start = r->ar.start;
+		__entry->end = r->ar.end;
+		__entry->nr_accesses = r->nr_accesses;
+	),
+
+	TP_printk("target_id=%lu nr_regions=%u %lu-%lu: %u",
+			__entry->target_id, __entry->nr_regions,
+			__entry->start, __entry->end, __entry->nr_accesses)
+);
+
+#endif /* _TRACE_DAMON_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
--- a/mm/damon/core.c~mm-damon-add-a-tracepoint
+++ a/mm/damon/core.c
@@ -13,6 +13,9 @@
 #include <linux/random.h>
 #include <linux/slab.h>
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/damon.h>
+
 /* Get a random number in [l, r) */
 #define damon_rand(l, r) (l + prandom_u32_max(r - l))
 
@@ -387,8 +390,10 @@ static void kdamond_reset_aggregated(str
 	damon_for_each_target(t, c) {
 		struct damon_region *r;
 
-		damon_for_each_region(r, t)
+		damon_for_each_region(r, t) {
+			trace_damon_aggregated(t, r, damon_nr_regions(t));
 			r->nr_accesses = 0;
+		}
 	}
 }
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 069/147] mm/damon: implement a debugfs-based user space interface
  2021-09-08  2:52 incoming Andrew Morton
                   ` (67 preceding siblings ...)
  2021-09-08  2:56 ` [patch 068/147] mm/damon: add a tracepoint Andrew Morton
@ 2021-09-08  2:56 ` Andrew Morton
  2021-09-08  2:56 ` [patch 070/147] mm/damon/dbgfs: export kdamond pid to the user space Andrew Morton
                   ` (78 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:56 UTC (permalink / raw)
  To: akpm, alexander.shishkin, amit, benh, brendanhiggins, corbet,
	david, dwmw, elver, fan.du, foersleo, greg, gthelen, joe,
	Jonathan.Cameron, linux-mm, markubo, mgorman, mheyne, minchan,
	mingo, mm-commits, namhyung, peterz, riel, rientjes, rostedt,
	shakeelb, shuah, sieberf, sjpark, torvalds, vbabka, vdavydov.dev

From: SeongJae Park <sjpark@amazon.de>
Subject: mm/damon: implement a debugfs-based user space interface

DAMON is designed to be used by kernel space code such as the memory
management subsystems, and therefore it provides only kernel space API. 
That said, letting the user space control DAMON could provide some
benefits to them.  For example, it will allow user space to analyze their
specific workloads and make their own special optimizations.

For such cases, this commit implements a simple DAMON application kernel
module, namely 'damon-dbgfs', which merely wraps the DAMON api and exports
those to the user space via the debugfs.

'damon-dbgfs' exports three files, ``attrs``, ``target_ids``, and
``monitor_on`` under its debugfs directory, ``<debugfs>/damon/``.

Attributes
----------

Users can read and write the ``sampling interval``, ``aggregation
interval``, ``regions update interval``, and min/max number of monitoring
target regions by reading from and writing to the ``attrs`` file.  For
example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10,
1000 and check it again::

    # cd <debugfs>/damon
    # echo 5000 100000 1000000 10 1000 > attrs
    # cat attrs
    5000 100000 1000000 10 1000

Target IDs
----------

Some types of address spaces supports multiple monitoring target.  For
example, the virtual memory address spaces monitoring can have multiple
processes as the monitoring targets.  Users can set the targets by writing
relevant id values of the targets to, and get the ids of the current
targets by reading from the ``target_ids`` file.  In case of the virtual
address spaces monitoring, the values should be pids of the monitoring
target processes.  For example, below commands set processes having pids
42 and 4242 as the monitoring targets and check it again::

    # cd <debugfs>/damon
    # echo 42 4242 > target_ids
    # cat target_ids
    42 4242

Note that setting the target ids doesn't start the monitoring.

Turning On/Off
--------------

Setting the files as described above doesn't incur effect unless you
explicitly start the monitoring.  You can start, stop, and check the
current status of the monitoring by writing to and reading from the
``monitor_on`` file.  Writing ``on`` to the file starts the monitoring of
the targets with the attributes.  Writing ``off`` to the file stops those.
DAMON also stops if every targets are invalidated (in case of the virtual
memory monitoring, target processes are invalidated when terminated). 
Below example commands turn on, off, and check the status of DAMON::

    # cd <debugfs>/damon
    # echo on > monitor_on
    # echo off > monitor_on
    # cat monitor_on
    off

Please note that you cannot write to the above-mentioned debugfs files
while the monitoring is turned on.  If you write to the files while DAMON
is running, an error code such as ``-EBUSY`` will be returned.

[akpm@linux-foundation.org: remove unneeded "alloc failed" printks]
[akpm@linux-foundation.org: replace macro with static inline]
Link: https://lkml.kernel.org/r/20210716081449.22187-8-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Leonard Foerster <foersleo@amazon.de>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/damon.h |    3 
 mm/damon/Kconfig      |    9 
 mm/damon/Makefile     |    1 
 mm/damon/core.c       |   47 ++++
 mm/damon/dbgfs.c      |  397 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 457 insertions(+)

--- a/include/linux/damon.h~mm-damon-implement-a-debugfs-based-user-space-interface
+++ a/include/linux/damon.h
@@ -240,9 +240,12 @@ unsigned int damon_nr_regions(struct dam
 
 struct damon_ctx *damon_new_ctx(void);
 void damon_destroy_ctx(struct damon_ctx *ctx);
+int damon_set_targets(struct damon_ctx *ctx,
+		unsigned long *ids, ssize_t nr_ids);
 int damon_set_attrs(struct damon_ctx *ctx, unsigned long sample_int,
 		unsigned long aggr_int, unsigned long primitive_upd_int,
 		unsigned long min_nr_reg, unsigned long max_nr_reg);
+int damon_nr_running_ctxs(void);
 
 int damon_start(struct damon_ctx **ctxs, int nr_ctxs);
 int damon_stop(struct damon_ctx **ctxs, int nr_ctxs);
--- a/mm/damon/core.c~mm-damon-implement-a-debugfs-based-user-space-interface
+++ a/mm/damon/core.c
@@ -172,6 +172,39 @@ void damon_destroy_ctx(struct damon_ctx
 }
 
 /**
+ * damon_set_targets() - Set monitoring targets.
+ * @ctx:	monitoring context
+ * @ids:	array of target ids
+ * @nr_ids:	number of entries in @ids
+ *
+ * This function should not be called while the kdamond is running.
+ *
+ * Return: 0 on success, negative error code otherwise.
+ */
+int damon_set_targets(struct damon_ctx *ctx,
+		      unsigned long *ids, ssize_t nr_ids)
+{
+	ssize_t i;
+	struct damon_target *t, *next;
+
+	damon_destroy_targets(ctx);
+
+	for (i = 0; i < nr_ids; i++) {
+		t = damon_new_target(ids[i]);
+		if (!t) {
+			pr_err("Failed to alloc damon_target\n");
+			/* The caller should do cleanup of the ids itself */
+			damon_for_each_target_safe(t, next, ctx)
+				damon_destroy_target(t);
+			return -ENOMEM;
+		}
+		damon_add_target(ctx, t);
+	}
+
+	return 0;
+}
+
+/**
  * damon_set_attrs() - Set attributes for the monitoring.
  * @ctx:		monitoring context
  * @sample_int:		time interval between samplings
@@ -209,6 +242,20 @@ int damon_set_attrs(struct damon_ctx *ct
 	return 0;
 }
 
+/**
+ * damon_nr_running_ctxs() - Return number of currently running contexts.
+ */
+int damon_nr_running_ctxs(void)
+{
+	int nr_ctxs;
+
+	mutex_lock(&damon_lock);
+	nr_ctxs = nr_running_ctxs;
+	mutex_unlock(&damon_lock);
+
+	return nr_ctxs;
+}
+
 /* Returns the size upper limit for each monitoring region */
 static unsigned long damon_region_sz_limit(struct damon_ctx *ctx)
 {
--- /dev/null
+++ a/mm/damon/dbgfs.c
@@ -0,0 +1,397 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * DAMON Debugfs Interface
+ *
+ * Author: SeongJae Park <sjpark@amazon.de>
+ */
+
+#define pr_fmt(fmt) "damon-dbgfs: " fmt
+
+#include <linux/damon.h>
+#include <linux/debugfs.h>
+#include <linux/file.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/page_idle.h>
+#include <linux/slab.h>
+
+static struct damon_ctx **dbgfs_ctxs;
+static int dbgfs_nr_ctxs;
+static struct dentry **dbgfs_dirs;
+
+/*
+ * Returns non-empty string on success, negative error code otherwise.
+ */
+static char *user_input_str(const char __user *buf, size_t count, loff_t *ppos)
+{
+	char *kbuf;
+	ssize_t ret;
+
+	/* We do not accept continuous write */
+	if (*ppos)
+		return ERR_PTR(-EINVAL);
+
+	kbuf = kmalloc(count + 1, GFP_KERNEL);
+	if (!kbuf)
+		return ERR_PTR(-ENOMEM);
+
+	ret = simple_write_to_buffer(kbuf, count + 1, ppos, buf, count);
+	if (ret != count) {
+		kfree(kbuf);
+		return ERR_PTR(-EIO);
+	}
+	kbuf[ret] = '\0';
+
+	return kbuf;
+}
+
+static ssize_t dbgfs_attrs_read(struct file *file,
+		char __user *buf, size_t count, loff_t *ppos)
+{
+	struct damon_ctx *ctx = file->private_data;
+	char kbuf[128];
+	int ret;
+
+	mutex_lock(&ctx->kdamond_lock);
+	ret = scnprintf(kbuf, ARRAY_SIZE(kbuf), "%lu %lu %lu %lu %lu\n",
+			ctx->sample_interval, ctx->aggr_interval,
+			ctx->primitive_update_interval, ctx->min_nr_regions,
+			ctx->max_nr_regions);
+	mutex_unlock(&ctx->kdamond_lock);
+
+	return simple_read_from_buffer(buf, count, ppos, kbuf, ret);
+}
+
+static ssize_t dbgfs_attrs_write(struct file *file,
+		const char __user *buf, size_t count, loff_t *ppos)
+{
+	struct damon_ctx *ctx = file->private_data;
+	unsigned long s, a, r, minr, maxr;
+	char *kbuf;
+	ssize_t ret = count;
+	int err;
+
+	kbuf = user_input_str(buf, count, ppos);
+	if (IS_ERR(kbuf))
+		return PTR_ERR(kbuf);
+
+	if (sscanf(kbuf, "%lu %lu %lu %lu %lu",
+				&s, &a, &r, &minr, &maxr) != 5) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	mutex_lock(&ctx->kdamond_lock);
+	if (ctx->kdamond) {
+		ret = -EBUSY;
+		goto unlock_out;
+	}
+
+	err = damon_set_attrs(ctx, s, a, r, minr, maxr);
+	if (err)
+		ret = err;
+unlock_out:
+	mutex_unlock(&ctx->kdamond_lock);
+out:
+	kfree(kbuf);
+	return ret;
+}
+
+static inline bool targetid_is_pid(const struct damon_ctx *ctx)
+{
+	return ctx->primitive.target_valid == damon_va_target_valid;
+}
+
+static ssize_t sprint_target_ids(struct damon_ctx *ctx, char *buf, ssize_t len)
+{
+	struct damon_target *t;
+	unsigned long id;
+	int written = 0;
+	int rc;
+
+	damon_for_each_target(t, ctx) {
+		id = t->id;
+		if (targetid_is_pid(ctx))
+			/* Show pid numbers to debugfs users */
+			id = (unsigned long)pid_vnr((struct pid *)id);
+
+		rc = scnprintf(&buf[written], len - written, "%lu ", id);
+		if (!rc)
+			return -ENOMEM;
+		written += rc;
+	}
+	if (written)
+		written -= 1;
+	written += scnprintf(&buf[written], len - written, "\n");
+	return written;
+}
+
+static ssize_t dbgfs_target_ids_read(struct file *file,
+		char __user *buf, size_t count, loff_t *ppos)
+{
+	struct damon_ctx *ctx = file->private_data;
+	ssize_t len;
+	char ids_buf[320];
+
+	mutex_lock(&ctx->kdamond_lock);
+	len = sprint_target_ids(ctx, ids_buf, 320);
+	mutex_unlock(&ctx->kdamond_lock);
+	if (len < 0)
+		return len;
+
+	return simple_read_from_buffer(buf, count, ppos, ids_buf, len);
+}
+
+/*
+ * Converts a string into an array of unsigned long integers
+ *
+ * Returns an array of unsigned long integers if the conversion success, or
+ * NULL otherwise.
+ */
+static unsigned long *str_to_target_ids(const char *str, ssize_t len,
+					ssize_t *nr_ids)
+{
+	unsigned long *ids;
+	const int max_nr_ids = 32;
+	unsigned long id;
+	int pos = 0, parsed, ret;
+
+	*nr_ids = 0;
+	ids = kmalloc_array(max_nr_ids, sizeof(id), GFP_KERNEL);
+	if (!ids)
+		return NULL;
+	while (*nr_ids < max_nr_ids && pos < len) {
+		ret = sscanf(&str[pos], "%lu%n", &id, &parsed);
+		pos += parsed;
+		if (ret != 1)
+			break;
+		ids[*nr_ids] = id;
+		*nr_ids += 1;
+	}
+
+	return ids;
+}
+
+static void dbgfs_put_pids(unsigned long *ids, int nr_ids)
+{
+	int i;
+
+	for (i = 0; i < nr_ids; i++)
+		put_pid((struct pid *)ids[i]);
+}
+
+static ssize_t dbgfs_target_ids_write(struct file *file,
+		const char __user *buf, size_t count, loff_t *ppos)
+{
+	struct damon_ctx *ctx = file->private_data;
+	char *kbuf, *nrs;
+	unsigned long *targets;
+	ssize_t nr_targets;
+	ssize_t ret = count;
+	int i;
+	int err;
+
+	kbuf = user_input_str(buf, count, ppos);
+	if (IS_ERR(kbuf))
+		return PTR_ERR(kbuf);
+
+	nrs = kbuf;
+
+	targets = str_to_target_ids(nrs, ret, &nr_targets);
+	if (!targets) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	if (targetid_is_pid(ctx)) {
+		for (i = 0; i < nr_targets; i++) {
+			targets[i] = (unsigned long)find_get_pid(
+					(int)targets[i]);
+			if (!targets[i]) {
+				dbgfs_put_pids(targets, i);
+				ret = -EINVAL;
+				goto free_targets_out;
+			}
+		}
+	}
+
+	mutex_lock(&ctx->kdamond_lock);
+	if (ctx->kdamond) {
+		if (targetid_is_pid(ctx))
+			dbgfs_put_pids(targets, nr_targets);
+		ret = -EBUSY;
+		goto unlock_out;
+	}
+
+	err = damon_set_targets(ctx, targets, nr_targets);
+	if (err) {
+		if (targetid_is_pid(ctx))
+			dbgfs_put_pids(targets, nr_targets);
+		ret = err;
+	}
+
+unlock_out:
+	mutex_unlock(&ctx->kdamond_lock);
+free_targets_out:
+	kfree(targets);
+out:
+	kfree(kbuf);
+	return ret;
+}
+
+static int damon_dbgfs_open(struct inode *inode, struct file *file)
+{
+	file->private_data = inode->i_private;
+
+	return nonseekable_open(inode, file);
+}
+
+static const struct file_operations attrs_fops = {
+	.open = damon_dbgfs_open,
+	.read = dbgfs_attrs_read,
+	.write = dbgfs_attrs_write,
+};
+
+static const struct file_operations target_ids_fops = {
+	.open = damon_dbgfs_open,
+	.read = dbgfs_target_ids_read,
+	.write = dbgfs_target_ids_write,
+};
+
+static void dbgfs_fill_ctx_dir(struct dentry *dir, struct damon_ctx *ctx)
+{
+	const char * const file_names[] = {"attrs", "target_ids"};
+	const struct file_operations *fops[] = {&attrs_fops, &target_ids_fops};
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(file_names); i++)
+		debugfs_create_file(file_names[i], 0600, dir, ctx, fops[i]);
+}
+
+static int dbgfs_before_terminate(struct damon_ctx *ctx)
+{
+	struct damon_target *t, *next;
+
+	if (!targetid_is_pid(ctx))
+		return 0;
+
+	damon_for_each_target_safe(t, next, ctx) {
+		put_pid((struct pid *)t->id);
+		damon_destroy_target(t);
+	}
+	return 0;
+}
+
+static struct damon_ctx *dbgfs_new_ctx(void)
+{
+	struct damon_ctx *ctx;
+
+	ctx = damon_new_ctx();
+	if (!ctx)
+		return NULL;
+
+	damon_va_set_primitives(ctx);
+	ctx->callback.before_terminate = dbgfs_before_terminate;
+	return ctx;
+}
+
+static ssize_t dbgfs_monitor_on_read(struct file *file,
+		char __user *buf, size_t count, loff_t *ppos)
+{
+	char monitor_on_buf[5];
+	bool monitor_on = damon_nr_running_ctxs() != 0;
+	int len;
+
+	len = scnprintf(monitor_on_buf, 5, monitor_on ? "on\n" : "off\n");
+
+	return simple_read_from_buffer(buf, count, ppos, monitor_on_buf, len);
+}
+
+static ssize_t dbgfs_monitor_on_write(struct file *file,
+		const char __user *buf, size_t count, loff_t *ppos)
+{
+	ssize_t ret = count;
+	char *kbuf;
+	int err;
+
+	kbuf = user_input_str(buf, count, ppos);
+	if (IS_ERR(kbuf))
+		return PTR_ERR(kbuf);
+
+	/* Remove white space */
+	if (sscanf(kbuf, "%s", kbuf) != 1) {
+		kfree(kbuf);
+		return -EINVAL;
+	}
+
+	if (!strncmp(kbuf, "on", count))
+		err = damon_start(dbgfs_ctxs, dbgfs_nr_ctxs);
+	else if (!strncmp(kbuf, "off", count))
+		err = damon_stop(dbgfs_ctxs, dbgfs_nr_ctxs);
+	else
+		err = -EINVAL;
+
+	if (err)
+		ret = err;
+	kfree(kbuf);
+	return ret;
+}
+
+static const struct file_operations monitor_on_fops = {
+	.read = dbgfs_monitor_on_read,
+	.write = dbgfs_monitor_on_write,
+};
+
+static int __init __damon_dbgfs_init(void)
+{
+	struct dentry *dbgfs_root;
+	const char * const file_names[] = {"monitor_on"};
+	const struct file_operations *fops[] = {&monitor_on_fops};
+	int i;
+
+	dbgfs_root = debugfs_create_dir("damon", NULL);
+
+	for (i = 0; i < ARRAY_SIZE(file_names); i++)
+		debugfs_create_file(file_names[i], 0600, dbgfs_root, NULL,
+				fops[i]);
+	dbgfs_fill_ctx_dir(dbgfs_root, dbgfs_ctxs[0]);
+
+	dbgfs_dirs = kmalloc_array(1, sizeof(dbgfs_root), GFP_KERNEL);
+	if (!dbgfs_dirs) {
+		debugfs_remove(dbgfs_root);
+		return -ENOMEM;
+	}
+	dbgfs_dirs[0] = dbgfs_root;
+
+	return 0;
+}
+
+/*
+ * Functions for the initialization
+ */
+
+static int __init damon_dbgfs_init(void)
+{
+	int rc;
+
+	dbgfs_ctxs = kmalloc(sizeof(*dbgfs_ctxs), GFP_KERNEL);
+	if (!dbgfs_ctxs)
+		return -ENOMEM;
+	dbgfs_ctxs[0] = dbgfs_new_ctx();
+	if (!dbgfs_ctxs[0]) {
+		kfree(dbgfs_ctxs);
+		return -ENOMEM;
+	}
+	dbgfs_nr_ctxs = 1;
+
+	rc = __damon_dbgfs_init();
+	if (rc) {
+		kfree(dbgfs_ctxs[0]);
+		kfree(dbgfs_ctxs);
+		pr_err("%s: dbgfs init failed\n", __func__);
+	}
+
+	return rc;
+}
+
+module_init(damon_dbgfs_init);
--- a/mm/damon/Kconfig~mm-damon-implement-a-debugfs-based-user-space-interface
+++ a/mm/damon/Kconfig
@@ -20,4 +20,13 @@ config DAMON_VADDR
 	  This builds the default data access monitoring primitives for DAMON
 	  that works for virtual address spaces.
 
+config DAMON_DBGFS
+	bool "DAMON debugfs interface"
+	depends on DAMON_VADDR && DEBUG_FS
+	help
+	  This builds the debugfs interface for DAMON.  The user space admins
+	  can use the interface for arbitrary data access monitoring.
+
+	  If unsure, say N.
+
 endmenu
--- a/mm/damon/Makefile~mm-damon-implement-a-debugfs-based-user-space-interface
+++ a/mm/damon/Makefile
@@ -2,3 +2,4 @@
 
 obj-$(CONFIG_DAMON)		:= core.o
 obj-$(CONFIG_DAMON_VADDR)	+= vaddr.o
+obj-$(CONFIG_DAMON_DBGFS)	+= dbgfs.o
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 070/147] mm/damon/dbgfs: export kdamond pid to the user space
  2021-09-08  2:52 incoming Andrew Morton
                   ` (68 preceding siblings ...)
  2021-09-08  2:56 ` [patch 069/147] mm/damon: implement a debugfs-based user space interface Andrew Morton
@ 2021-09-08  2:56 ` Andrew Morton
  2021-09-08  2:57 ` [patch 071/147] mm/damon/dbgfs: support multiple contexts Andrew Morton
                   ` (77 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:56 UTC (permalink / raw)
  To: akpm, alexander.shishkin, amit, benh, brendanhiggins, corbet,
	david, dwmw, elver, fan.du, foersleo, greg, gthelen, joe,
	Jonathan.Cameron, linux-mm, markubo, mgorman, mheyne, minchan,
	mingo, mm-commits, namhyung, peterz, riel, rientjes, rostedt,
	shakeelb, shuah, sieberf, sjpark, torvalds, vbabka, vdavydov.dev

From: SeongJae Park <sjpark@amazon.de>
Subject: mm/damon/dbgfs: export kdamond pid to the user space

For CPU usage accounting, knowing pid of the monitoring thread could be
helpful.  For example, users could use cpuaccount cgroups with the pid.

This commit therefore exports the pid of currently running monitoring
thread to the user space via 'kdamond_pid' file in the debugfs directory.

Link: https://lkml.kernel.org/r/20210716081449.22187-9-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/dbgfs.c |   37 +++++++++++++++++++++++++++++++++++--
 1 file changed, 35 insertions(+), 2 deletions(-)

--- a/mm/damon/dbgfs.c~mm-damon-dbgfs-export-kdamond-pid-to-the-user-space
+++ a/mm/damon/dbgfs.c
@@ -239,6 +239,32 @@ out:
 	return ret;
 }
 
+static ssize_t dbgfs_kdamond_pid_read(struct file *file,
+		char __user *buf, size_t count, loff_t *ppos)
+{
+	struct damon_ctx *ctx = file->private_data;
+	char *kbuf;
+	ssize_t len;
+
+	kbuf = kmalloc(count, GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	mutex_lock(&ctx->kdamond_lock);
+	if (ctx->kdamond)
+		len = scnprintf(kbuf, count, "%d\n", ctx->kdamond->pid);
+	else
+		len = scnprintf(kbuf, count, "none\n");
+	mutex_unlock(&ctx->kdamond_lock);
+	if (!len)
+		goto out;
+	len = simple_read_from_buffer(buf, count, ppos, kbuf, len);
+
+out:
+	kfree(kbuf);
+	return len;
+}
+
 static int damon_dbgfs_open(struct inode *inode, struct file *file)
 {
 	file->private_data = inode->i_private;
@@ -258,10 +284,17 @@ static const struct file_operations targ
 	.write = dbgfs_target_ids_write,
 };
 
+static const struct file_operations kdamond_pid_fops = {
+	.open = damon_dbgfs_open,
+	.read = dbgfs_kdamond_pid_read,
+};
+
 static void dbgfs_fill_ctx_dir(struct dentry *dir, struct damon_ctx *ctx)
 {
-	const char * const file_names[] = {"attrs", "target_ids"};
-	const struct file_operations *fops[] = {&attrs_fops, &target_ids_fops};
+	const char * const file_names[] = {"attrs", "target_ids",
+		"kdamond_pid"};
+	const struct file_operations *fops[] = {&attrs_fops, &target_ids_fops,
+		&kdamond_pid_fops};
 	int i;
 
 	for (i = 0; i < ARRAY_SIZE(file_names); i++)
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 071/147] mm/damon/dbgfs: support multiple contexts
  2021-09-08  2:52 incoming Andrew Morton
                   ` (69 preceding siblings ...)
  2021-09-08  2:56 ` [patch 070/147] mm/damon/dbgfs: export kdamond pid to the user space Andrew Morton
@ 2021-09-08  2:57 ` Andrew Morton
  2021-09-08  2:57 ` [patch 072/147] Documentation: add documents for DAMON Andrew Morton
                   ` (76 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:57 UTC (permalink / raw)
  To: akpm, alexander.shishkin, amit, benh, brendanhiggins, corbet,
	david, dwmw, elver, fan.du, foersleo, greg, gthelen, joe,
	Jonathan.Cameron, linux-mm, markubo, mgorman, mheyne, minchan,
	mingo, mm-commits, namhyung, peterz, riel, rientjes, rostedt,
	shakeelb, shuah, sieberf, sjpark, torvalds, vbabka, vdavydov.dev

From: SeongJae Park <sjpark@amazon.de>
Subject: mm/damon/dbgfs: support multiple contexts

In some use cases, users would want to run multiple monitoring context. 
For example, if a user wants a high precision monitoring and dedicating
multiple CPUs for the job is ok, because DAMON creates one monitoring
thread per one context, the user can split the monitoring target regions
into multiple small regions and create one context for each region.  Or,
someone might want to simultaneously monitor different address spaces,
e.g., both virtual address space and physical address space.

The DAMON's API allows such usage, but 'damon-dbgfs' does not.  Therefore,
only kernel space DAMON users can do multiple contexts monitoring.

This commit allows the user space DAMON users to use multiple contexts
monitoring by introducing two new 'damon-dbgfs' debugfs files,
'mk_context' and 'rm_context'.  Users can create a new monitoring context
by writing the desired name of the new context to 'mk_context'.  Then, a
new directory with the name and having the files for setting of the
context ('attrs', 'target_ids' and 'record') will be created under the
debugfs directory.  Writing the name of the context to remove to
'rm_context' will remove the related context and directory.

Link: https://lkml.kernel.org/r/20210716081449.22187-10-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/dbgfs.c |  195 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 193 insertions(+), 2 deletions(-)

--- a/mm/damon/dbgfs.c~mm-damon-dbgfs-support-multiple-contexts
+++ a/mm/damon/dbgfs.c
@@ -18,6 +18,7 @@
 static struct damon_ctx **dbgfs_ctxs;
 static int dbgfs_nr_ctxs;
 static struct dentry **dbgfs_dirs;
+static DEFINE_MUTEX(damon_dbgfs_lock);
 
 /*
  * Returns non-empty string on success, negative error code otherwise.
@@ -328,6 +329,186 @@ static struct damon_ctx *dbgfs_new_ctx(v
 	return ctx;
 }
 
+static void dbgfs_destroy_ctx(struct damon_ctx *ctx)
+{
+	damon_destroy_ctx(ctx);
+}
+
+/*
+ * Make a context of @name and create a debugfs directory for it.
+ *
+ * This function should be called while holding damon_dbgfs_lock.
+ *
+ * Returns 0 on success, negative error code otherwise.
+ */
+static int dbgfs_mk_context(char *name)
+{
+	struct dentry *root, **new_dirs, *new_dir;
+	struct damon_ctx **new_ctxs, *new_ctx;
+
+	if (damon_nr_running_ctxs())
+		return -EBUSY;
+
+	new_ctxs = krealloc(dbgfs_ctxs, sizeof(*dbgfs_ctxs) *
+			(dbgfs_nr_ctxs + 1), GFP_KERNEL);
+	if (!new_ctxs)
+		return -ENOMEM;
+	dbgfs_ctxs = new_ctxs;
+
+	new_dirs = krealloc(dbgfs_dirs, sizeof(*dbgfs_dirs) *
+			(dbgfs_nr_ctxs + 1), GFP_KERNEL);
+	if (!new_dirs)
+		return -ENOMEM;
+	dbgfs_dirs = new_dirs;
+
+	root = dbgfs_dirs[0];
+	if (!root)
+		return -ENOENT;
+
+	new_dir = debugfs_create_dir(name, root);
+	dbgfs_dirs[dbgfs_nr_ctxs] = new_dir;
+
+	new_ctx = dbgfs_new_ctx();
+	if (!new_ctx) {
+		debugfs_remove(new_dir);
+		dbgfs_dirs[dbgfs_nr_ctxs] = NULL;
+		return -ENOMEM;
+	}
+
+	dbgfs_ctxs[dbgfs_nr_ctxs] = new_ctx;
+	dbgfs_fill_ctx_dir(dbgfs_dirs[dbgfs_nr_ctxs],
+			dbgfs_ctxs[dbgfs_nr_ctxs]);
+	dbgfs_nr_ctxs++;
+
+	return 0;
+}
+
+static ssize_t dbgfs_mk_context_write(struct file *file,
+		const char __user *buf, size_t count, loff_t *ppos)
+{
+	char *kbuf;
+	char *ctx_name;
+	ssize_t ret = count;
+	int err;
+
+	kbuf = user_input_str(buf, count, ppos);
+	if (IS_ERR(kbuf))
+		return PTR_ERR(kbuf);
+	ctx_name = kmalloc(count + 1, GFP_KERNEL);
+	if (!ctx_name) {
+		kfree(kbuf);
+		return -ENOMEM;
+	}
+
+	/* Trim white space */
+	if (sscanf(kbuf, "%s", ctx_name) != 1) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	mutex_lock(&damon_dbgfs_lock);
+	err = dbgfs_mk_context(ctx_name);
+	if (err)
+		ret = err;
+	mutex_unlock(&damon_dbgfs_lock);
+
+out:
+	kfree(kbuf);
+	kfree(ctx_name);
+	return ret;
+}
+
+/*
+ * Remove a context of @name and its debugfs directory.
+ *
+ * This function should be called while holding damon_dbgfs_lock.
+ *
+ * Return 0 on success, negative error code otherwise.
+ */
+static int dbgfs_rm_context(char *name)
+{
+	struct dentry *root, *dir, **new_dirs;
+	struct damon_ctx **new_ctxs;
+	int i, j;
+
+	if (damon_nr_running_ctxs())
+		return -EBUSY;
+
+	root = dbgfs_dirs[0];
+	if (!root)
+		return -ENOENT;
+
+	dir = debugfs_lookup(name, root);
+	if (!dir)
+		return -ENOENT;
+
+	new_dirs = kmalloc_array(dbgfs_nr_ctxs - 1, sizeof(*dbgfs_dirs),
+			GFP_KERNEL);
+	if (!new_dirs)
+		return -ENOMEM;
+
+	new_ctxs = kmalloc_array(dbgfs_nr_ctxs - 1, sizeof(*dbgfs_ctxs),
+			GFP_KERNEL);
+	if (!new_ctxs) {
+		kfree(new_dirs);
+		return -ENOMEM;
+	}
+
+	for (i = 0, j = 0; i < dbgfs_nr_ctxs; i++) {
+		if (dbgfs_dirs[i] == dir) {
+			debugfs_remove(dbgfs_dirs[i]);
+			dbgfs_destroy_ctx(dbgfs_ctxs[i]);
+			continue;
+		}
+		new_dirs[j] = dbgfs_dirs[i];
+		new_ctxs[j++] = dbgfs_ctxs[i];
+	}
+
+	kfree(dbgfs_dirs);
+	kfree(dbgfs_ctxs);
+
+	dbgfs_dirs = new_dirs;
+	dbgfs_ctxs = new_ctxs;
+	dbgfs_nr_ctxs--;
+
+	return 0;
+}
+
+static ssize_t dbgfs_rm_context_write(struct file *file,
+		const char __user *buf, size_t count, loff_t *ppos)
+{
+	char *kbuf;
+	ssize_t ret = count;
+	int err;
+	char *ctx_name;
+
+	kbuf = user_input_str(buf, count, ppos);
+	if (IS_ERR(kbuf))
+		return PTR_ERR(kbuf);
+	ctx_name = kmalloc(count + 1, GFP_KERNEL);
+	if (!ctx_name) {
+		kfree(kbuf);
+		return -ENOMEM;
+	}
+
+	/* Trim white space */
+	if (sscanf(kbuf, "%s", ctx_name) != 1) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	mutex_lock(&damon_dbgfs_lock);
+	err = dbgfs_rm_context(ctx_name);
+	if (err)
+		ret = err;
+	mutex_unlock(&damon_dbgfs_lock);
+
+out:
+	kfree(kbuf);
+	kfree(ctx_name);
+	return ret;
+}
+
 static ssize_t dbgfs_monitor_on_read(struct file *file,
 		char __user *buf, size_t count, loff_t *ppos)
 {
@@ -370,6 +551,14 @@ static ssize_t dbgfs_monitor_on_write(st
 	return ret;
 }
 
+static const struct file_operations mk_contexts_fops = {
+	.write = dbgfs_mk_context_write,
+};
+
+static const struct file_operations rm_contexts_fops = {
+	.write = dbgfs_rm_context_write,
+};
+
 static const struct file_operations monitor_on_fops = {
 	.read = dbgfs_monitor_on_read,
 	.write = dbgfs_monitor_on_write,
@@ -378,8 +567,10 @@ static const struct file_operations moni
 static int __init __damon_dbgfs_init(void)
 {
 	struct dentry *dbgfs_root;
-	const char * const file_names[] = {"monitor_on"};
-	const struct file_operations *fops[] = {&monitor_on_fops};
+	const char * const file_names[] = {"mk_contexts", "rm_contexts",
+		"monitor_on"};
+	const struct file_operations *fops[] = {&mk_contexts_fops,
+		&rm_contexts_fops, &monitor_on_fops};
 	int i;
 
 	dbgfs_root = debugfs_create_dir("damon", NULL);
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 072/147] Documentation: add documents for DAMON
  2021-09-08  2:52 incoming Andrew Morton
                   ` (70 preceding siblings ...)
  2021-09-08  2:57 ` [patch 071/147] mm/damon/dbgfs: support multiple contexts Andrew Morton
@ 2021-09-08  2:57 ` Andrew Morton
  2021-09-08  2:57 ` [patch 073/147] mm/damon: add kunit tests Andrew Morton
                   ` (75 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:57 UTC (permalink / raw)
  To: akpm, alexander.shishkin, amit, benh, brendanhiggins, corbet,
	david, dwmw, elver, fan.du, foersleo, greg, gthelen, joe,
	Jonathan.Cameron, linux-mm, markubo, mgorman, mheyne, minchan,
	mingo, mm-commits, namhyung, peterz, riel, rientjes, rostedt,
	shakeelb, shuah, sieberf, sjpark, torvalds, vbabka, vdavydov.dev

From: SeongJae Park <sjpark@amazon.de>
Subject: Documentation: add documents for DAMON

This commit adds documents for DAMON under
`Documentation/admin-guide/mm/damon/` and `Documentation/vm/damon/`.

Link: https://lkml.kernel.org/r/20210716081449.22187-11-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Fernand Sieber <sieberf@amazon.com>
Reviewed-by: Markus Boehme <markubo@amazon.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/damon/index.rst |   15 +
 Documentation/admin-guide/mm/damon/start.rst |  114 +++++++++++
 Documentation/admin-guide/mm/damon/usage.rst |  112 +++++++++++
 Documentation/admin-guide/mm/index.rst       |    1 
 Documentation/vm/damon/api.rst               |   20 ++
 Documentation/vm/damon/design.rst            |  166 +++++++++++++++++
 Documentation/vm/damon/faq.rst               |   51 +++++
 Documentation/vm/damon/index.rst             |   30 +++
 Documentation/vm/index.rst                   |    1 
 9 files changed, 510 insertions(+)

--- /dev/null
+++ a/Documentation/admin-guide/mm/damon/index.rst
@@ -0,0 +1,15 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+========================
+Monitoring Data Accesses
+========================
+
+:doc:`DAMON </vm/damon/index>` allows light-weight data access monitoring.
+Using DAMON, users can analyze the memory access patterns of their systems and
+optimize those.
+
+.. toctree::
+   :maxdepth: 2
+
+   start
+   usage
--- /dev/null
+++ a/Documentation/admin-guide/mm/damon/start.rst
@@ -0,0 +1,114 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Getting Started
+===============
+
+This document briefly describes how you can use DAMON by demonstrating its
+default user space tool.  Please note that this document describes only a part
+of its features for brevity.  Please refer to :doc:`usage` for more details.
+
+
+TL; DR
+======
+
+Follow the commands below to monitor and visualize the memory access pattern of
+your workload. ::
+
+    # # build the kernel with CONFIG_DAMON_*=y, install it, and reboot
+    # mount -t debugfs none /sys/kernel/debug/
+    # git clone https://github.com/awslabs/damo
+    # ./damo/damo record $(pidof <your workload>)
+    # ./damo/damo report heat --plot_ascii
+
+The final command draws the access heatmap of ``<your workload>``.  The heatmap
+shows which memory region (x-axis) is accessed when (y-axis) and how frequently
+(number; the higher the more accesses have been observed). ::
+
+    111111111111111111111111111111111111111111111111111111110000
+    111121111111111111111111111111211111111111111111111111110000
+    000000000000000000000000000000000000000000000000001555552000
+    000000000000000000000000000000000000000000000222223555552000
+    000000000000000000000000000000000000000011111677775000000000
+    000000000000000000000000000000000000000488888000000000000000
+    000000000000000000000000000000000177888400000000000000000000
+    000000000000000000000000000046666522222100000000000000000000
+    000000000000000000000014444344444300000000000000000000000000
+    000000000000000002222245555510000000000000000000000000000000
+    # access_frequency:  0  1  2  3  4  5  6  7  8  9
+    # x-axis: space (140286319947776-140286426374096: 101.496 MiB)
+    # y-axis: time (605442256436361-605479951866441: 37.695430s)
+    # resolution: 60x10 (1.692 MiB and 3.770s for each character)
+
+
+Prerequisites
+=============
+
+Kernel
+------
+
+You should first ensure your system is running on a kernel built with
+``CONFIG_DAMON_*=y``.
+
+
+User Space Tool
+---------------
+
+For the demonstration, we will use the default user space tool for DAMON,
+called DAMON Operator (DAMO).  It is available at
+https://github.com/awslabs/damo.  The examples below assume that ``damo`` is on
+your ``$PATH``.  It's not mandatory, though.
+
+Because DAMO is using the debugfs interface (refer to :doc:`usage` for the
+detail) of DAMON, you should ensure debugfs is mounted.  Mount it manually as
+below::
+
+    # mount -t debugfs none /sys/kernel/debug/
+
+or append the following line to your ``/etc/fstab`` file so that your system
+can automatically mount debugfs upon booting::
+
+    debugfs /sys/kernel/debug debugfs defaults 0 0
+
+
+Recording Data Access Patterns
+==============================
+
+The commands below record the memory access patterns of a program and save the
+monitoring results to a file. ::
+
+    $ git clone https://github.com/sjp38/masim
+    $ cd masim; make; ./masim ./configs/zigzag.cfg &
+    $ sudo damo record -o damon.data $(pidof masim)
+
+The first two lines of the commands download an artificial memory access
+generator program and run it in the background.  The generator will repeatedly
+access two 100 MiB sized memory regions one by one.  You can substitute this
+with your real workload.  The last line asks ``damo`` to record the access
+pattern in the ``damon.data`` file.
+
+
+Visualizing Recorded Patterns
+=============================
+
+The following three commands visualize the recorded access patterns and save
+the results as separate image files. ::
+
+    $ damo report heats --heatmap access_pattern_heatmap.png
+    $ damo report wss --range 0 101 1 --plot wss_dist.png
+    $ damo report wss --range 0 101 1 --sortby time --plot wss_chron_change.png
+
+- ``access_pattern_heatmap.png`` will visualize the data access pattern in a
+  heatmap, showing which memory region (y-axis) got accessed when (x-axis)
+  and how frequently (color).
+- ``wss_dist.png`` will show the distribution of the working set size.
+- ``wss_chron_change.png`` will show how the working set size has
+  chronologically changed.
+
+You can view the visualizations of this example workload at [1]_.
+Visualizations of other realistic workloads are available at [2]_ [3]_ [4]_.
+
+.. [1] https://damonitor.github.io/doc/html/v17/admin-guide/mm/damon/start.html#visualizing-recorded-patterns
+.. [2] https://damonitor.github.io/test/result/visual/latest/rec.heatmap.1.png.html
+.. [3] https://damonitor.github.io/test/result/visual/latest/rec.wss_sz.png.html
+.. [4] https://damonitor.github.io/test/result/visual/latest/rec.wss_time.png.html
--- /dev/null
+++ a/Documentation/admin-guide/mm/damon/usage.rst
@@ -0,0 +1,112 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+Detailed Usages
+===============
+
+DAMON provides below three interfaces for different users.
+
+- *DAMON user space tool.*
+  This is for privileged people such as system administrators who want a
+  just-working human-friendly interface.  Using this, users can use the DAMON’s
+  major features in a human-friendly way.  It may not be highly tuned for
+  special cases, though.  It supports only virtual address spaces monitoring.
+- *debugfs interface.*
+  This is for privileged user space programmers who want more optimized use of
+  DAMON.  Using this, users can use DAMON’s major features by reading
+  from and writing to special debugfs files.  Therefore, you can write and use
+  your personalized DAMON debugfs wrapper programs that reads/writes the
+  debugfs files instead of you.  The DAMON user space tool is also a reference
+  implementation of such programs.  It supports only virtual address spaces
+  monitoring.
+- *Kernel Space Programming Interface.*
+  This is for kernel space programmers.  Using this, users can utilize every
+  feature of DAMON most flexibly and efficiently by writing kernel space
+  DAMON application programs for you.  You can even extend DAMON for various
+  address spaces.
+
+Nevertheless, you could write your own user space tool using the debugfs
+interface.  A reference implementation is available at
+https://github.com/awslabs/damo.  If you are a kernel programmer, you could
+refer to :doc:`/vm/damon/api` for the kernel space programming interface.  For
+the reason, this document describes only the debugfs interface
+
+debugfs Interface
+=================
+
+DAMON exports three files, ``attrs``, ``target_ids``, and ``monitor_on`` under
+its debugfs directory, ``<debugfs>/damon/``.
+
+
+Attributes
+----------
+
+Users can get and set the ``sampling interval``, ``aggregation interval``,
+``regions update interval``, and min/max number of monitoring target regions by
+reading from and writing to the ``attrs`` file.  To know about the monitoring
+attributes in detail, please refer to the :doc:`/vm/damon/design`.  For
+example, below commands set those values to 5 ms, 100 ms, 1,000 ms, 10 and
+1000, and then check it again::
+
+    # cd <debugfs>/damon
+    # echo 5000 100000 1000000 10 1000 > attrs
+    # cat attrs
+    5000 100000 1000000 10 1000
+
+
+Target IDs
+----------
+
+Some types of address spaces supports multiple monitoring target.  For example,
+the virtual memory address spaces monitoring can have multiple processes as the
+monitoring targets.  Users can set the targets by writing relevant id values of
+the targets to, and get the ids of the current targets by reading from the
+``target_ids`` file.  In case of the virtual address spaces monitoring, the
+values should be pids of the monitoring target processes.  For example, below
+commands set processes having pids 42 and 4242 as the monitoring targets and
+check it again::
+
+    # cd <debugfs>/damon
+    # echo 42 4242 > target_ids
+    # cat target_ids
+    42 4242
+
+Note that setting the target ids doesn't start the monitoring.
+
+
+Turning On/Off
+--------------
+
+Setting the files as described above doesn't incur effect unless you explicitly
+start the monitoring.  You can start, stop, and check the current status of the
+monitoring by writing to and reading from the ``monitor_on`` file.  Writing
+``on`` to the file starts the monitoring of the targets with the attributes.
+Writing ``off`` to the file stops those.  DAMON also stops if every target
+process is terminated.  Below example commands turn on, off, and check the
+status of DAMON::
+
+    # cd <debugfs>/damon
+    # echo on > monitor_on
+    # echo off > monitor_on
+    # cat monitor_on
+    off
+
+Please note that you cannot write to the above-mentioned debugfs files while
+the monitoring is turned on.  If you write to the files while DAMON is running,
+an error code such as ``-EBUSY`` will be returned.
+
+
+Tracepoint for Monitoring Results
+=================================
+
+DAMON provides the monitoring results via a tracepoint,
+``damon:damon_aggregated``.  While the monitoring is turned on, you could
+record the tracepoint events and show results using tracepoint supporting tools
+like ``perf``.  For example::
+
+    # echo on > monitor_on
+    # perf record -e damon:damon_aggregated &
+    # sleep 5
+    # kill 9 $(pidof perf)
+    # echo off > monitor_on
+    # perf script
--- a/Documentation/admin-guide/mm/index.rst~documentation-add-documents-for-damon
+++ a/Documentation/admin-guide/mm/index.rst
@@ -27,6 +27,7 @@ the Linux memory management.
 
    concepts
    cma_debugfs
+   damon/index
    hugetlbpage
    idle_page_tracking
    ksm
--- /dev/null
+++ a/Documentation/vm/damon/api.rst
@@ -0,0 +1,20 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=============
+API Reference
+=============
+
+Kernel space programs can use every feature of DAMON using below APIs.  All you
+need to do is including ``damon.h``, which is located in ``include/linux/`` of
+the source tree.
+
+Structures
+==========
+
+.. kernel-doc:: include/linux/damon.h
+
+
+Functions
+=========
+
+.. kernel-doc:: mm/damon/core.c
--- /dev/null
+++ a/Documentation/vm/damon/design.rst
@@ -0,0 +1,166 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+======
+Design
+======
+
+Configurable Layers
+===================
+
+DAMON provides data access monitoring functionality while making the accuracy
+and the overhead controllable.  The fundamental access monitorings require
+primitives that dependent on and optimized for the target address space.  On
+the other hand, the accuracy and overhead tradeoff mechanism, which is the core
+of DAMON, is in the pure logic space.  DAMON separates the two parts in
+different layers and defines its interface to allow various low level
+primitives implementations configurable with the core logic.
+
+Due to this separated design and the configurable interface, users can extend
+DAMON for any address space by configuring the core logics with appropriate low
+level primitive implementations.  If appropriate one is not provided, users can
+implement the primitives on their own.
+
+For example, physical memory, virtual memory, swap space, those for specific
+processes, NUMA nodes, files, and backing memory devices would be supportable.
+Also, if some architectures or devices support special optimized access check
+primitives, those will be easily configurable.
+
+
+Reference Implementations of Address Space Specific Primitives
+==============================================================
+
+The low level primitives for the fundamental access monitoring are defined in
+two parts:
+
+1. Identification of the monitoring target address range for the address space.
+2. Access check of specific address range in the target space.
+
+DAMON currently provides the implementation of the primitives for only the
+virtual address spaces. Below two subsections describe how it works.
+
+
+VMA-based Target Address Range Construction
+-------------------------------------------
+
+Only small parts in the super-huge virtual address space of the processes are
+mapped to the physical memory and accessed.  Thus, tracking the unmapped
+address regions is just wasteful.  However, because DAMON can deal with some
+level of noise using the adaptive regions adjustment mechanism, tracking every
+mapping is not strictly required but could even incur a high overhead in some
+cases.  That said, too huge unmapped areas inside the monitoring target should
+be removed to not take the time for the adaptive mechanism.
+
+For the reason, this implementation converts the complex mappings to three
+distinct regions that cover every mapped area of the address space.  The two
+gaps between the three regions are the two biggest unmapped areas in the given
+address space.  The two biggest unmapped areas would be the gap between the
+heap and the uppermost mmap()-ed region, and the gap between the lowermost
+mmap()-ed region and the stack in most of the cases.  Because these gaps are
+exceptionally huge in usual address spaces, excluding these will be sufficient
+to make a reasonable trade-off.  Below shows this in detail::
+
+    <heap>
+    <BIG UNMAPPED REGION 1>
+    <uppermost mmap()-ed region>
+    (small mmap()-ed regions and munmap()-ed regions)
+    <lowermost mmap()-ed region>
+    <BIG UNMAPPED REGION 2>
+    <stack>
+
+
+PTE Accessed-bit Based Access Check
+-----------------------------------
+
+The implementation for the virtual address space uses PTE Accessed-bit for
+basic access checks.  It finds the relevant PTE Accessed bit from the address
+by walking the page table for the target task of the address.  In this way, the
+implementation finds and clears the bit for next sampling target address and
+checks whether the bit set again after one sampling period.  This could disturb
+other kernel subsystems using the Accessed bits, namely Idle page tracking and
+the reclaim logic.  To avoid such disturbances, DAMON makes it mutually
+exclusive with Idle page tracking and uses ``PG_idle`` and ``PG_young`` page
+flags to solve the conflict with the reclaim logic, as Idle page tracking does.
+
+
+Address Space Independent Core Mechanisms
+=========================================
+
+Below four sections describe each of the DAMON core mechanisms and the five
+monitoring attributes, ``sampling interval``, ``aggregation interval``,
+``regions update interval``, ``minimum number of regions``, and ``maximum
+number of regions``.
+
+
+Access Frequency Monitoring
+---------------------------
+
+The output of DAMON says what pages are how frequently accessed for a given
+duration.  The resolution of the access frequency is controlled by setting
+``sampling interval`` and ``aggregation interval``.  In detail, DAMON checks
+access to each page per ``sampling interval`` and aggregates the results.  In
+other words, counts the number of the accesses to each page.  After each
+``aggregation interval`` passes, DAMON calls callback functions that previously
+registered by users so that users can read the aggregated results and then
+clears the results.  This can be described in below simple pseudo-code::
+
+    while monitoring_on:
+        for page in monitoring_target:
+            if accessed(page):
+                nr_accesses[page] += 1
+        if time() % aggregation_interval == 0:
+            for callback in user_registered_callbacks:
+                callback(monitoring_target, nr_accesses)
+            for page in monitoring_target:
+                nr_accesses[page] = 0
+        sleep(sampling interval)
+
+The monitoring overhead of this mechanism will arbitrarily increase as the
+size of the target workload grows.
+
+
+Region Based Sampling
+---------------------
+
+To avoid the unbounded increase of the overhead, DAMON groups adjacent pages
+that assumed to have the same access frequencies into a region.  As long as the
+assumption (pages in a region have the same access frequencies) is kept, only
+one page in the region is required to be checked.  Thus, for each ``sampling
+interval``, DAMON randomly picks one page in each region, waits for one
+``sampling interval``, checks whether the page is accessed meanwhile, and
+increases the access frequency of the region if so.  Therefore, the monitoring
+overhead is controllable by setting the number of regions.  DAMON allows users
+to set the minimum and the maximum number of regions for the trade-off.
+
+This scheme, however, cannot preserve the quality of the output if the
+assumption is not guaranteed.
+
+
+Adaptive Regions Adjustment
+---------------------------
+
+Even somehow the initial monitoring target regions are well constructed to
+fulfill the assumption (pages in same region have similar access frequencies),
+the data access pattern can be dynamically changed.  This will result in low
+monitoring quality.  To keep the assumption as much as possible, DAMON
+adaptively merges and splits each region based on their access frequency.
+
+For each ``aggregation interval``, it compares the access frequencies of
+adjacent regions and merges those if the frequency difference is small.  Then,
+after it reports and clears the aggregated access frequency of each region, it
+splits each region into two or three regions if the total number of regions
+will not exceed the user-specified maximum number of regions after the split.
+
+In this way, DAMON provides its best-effort quality and minimal overhead while
+keeping the bounds users set for their trade-off.
+
+
+Dynamic Target Space Updates Handling
+-------------------------------------
+
+The monitoring target address range could dynamically changed.  For example,
+virtual memory could be dynamically mapped and unmapped.  Physical memory could
+be hot-plugged.
+
+As the changes could be quite frequent in some cases, DAMON checks the dynamic
+memory mapping changes and applies it to the abstracted target area only for
+each of a user-specified time interval (``regions update interval``).
--- /dev/null
+++ a/Documentation/vm/damon/faq.rst
@@ -0,0 +1,51 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+Frequently Asked Questions
+==========================
+
+Why a new subsystem, instead of extending perf or other user space tools?
+=========================================================================
+
+First, because it needs to be lightweight as much as possible so that it can be
+used online, any unnecessary overhead such as kernel - user space context
+switching cost should be avoided.  Second, DAMON aims to be used by other
+programs including the kernel.  Therefore, having a dependency on specific
+tools like perf is not desirable.  These are the two biggest reasons why DAMON
+is implemented in the kernel space.
+
+
+Can 'idle pages tracking' or 'perf mem' substitute DAMON?
+=========================================================
+
+Idle page tracking is a low level primitive for access check of the physical
+address space.  'perf mem' is similar, though it can use sampling to minimize
+the overhead.  On the other hand, DAMON is a higher-level framework for the
+monitoring of various address spaces.  It is focused on memory management
+optimization and provides sophisticated accuracy/overhead handling mechanisms.
+Therefore, 'idle pages tracking' and 'perf mem' could provide a subset of
+DAMON's output, but cannot substitute DAMON.
+
+
+Does DAMON support virtual memory only?
+=======================================
+
+No.  The core of the DAMON is address space independent.  The address space
+specific low level primitive parts including monitoring target regions
+constructions and actual access checks can be implemented and configured on the
+DAMON core by the users.  In this way, DAMON users can monitor any address
+space with any access check technique.
+
+Nonetheless, DAMON provides vma tracking and PTE Accessed bit check based
+implementations of the address space dependent functions for the virtual memory
+by default, for a reference and convenient use.  In near future, we will
+provide those for physical memory address space.
+
+
+Can I simply monitor page granularity?
+======================================
+
+Yes.  You can do so by setting the ``min_nr_regions`` attribute higher than the
+working set size divided by the page size.  Because the monitoring target
+regions size is forced to be ``>=page size``, the region split will make no
+effect.
--- /dev/null
+++ a/Documentation/vm/damon/index.rst
@@ -0,0 +1,30 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+DAMON: Data Access MONitor
+==========================
+
+DAMON is a data access monitoring framework subsystem for the Linux kernel.
+The core mechanisms of DAMON (refer to :doc:`design` for the detail) make it
+
+ - *accurate* (the monitoring output is useful enough for DRAM level memory
+   management; It might not appropriate for CPU Cache levels, though),
+ - *light-weight* (the monitoring overhead is low enough to be applied online),
+   and
+ - *scalable* (the upper-bound of the overhead is in constant range regardless
+   of the size of target workloads).
+
+Using this framework, therefore, the kernel's memory management mechanisms can
+make advanced decisions.  Experimental memory management optimization works
+that incurring high data accesses monitoring overhead could implemented again.
+In user space, meanwhile, users who have some special workloads can write
+personalized applications for better understanding and optimizations of their
+workloads and systems.
+
+.. toctree::
+   :maxdepth: 2
+
+   faq
+   design
+   api
+   plans
--- a/Documentation/vm/index.rst~documentation-add-documents-for-damon
+++ a/Documentation/vm/index.rst
@@ -32,6 +32,7 @@ descriptions of data structures and algo
    arch_pgtable_helpers
    balance
    cleancache
+   damon/index
    free_page_reporting
    frontswap
    highmem
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 073/147] mm/damon: add kunit tests
  2021-09-08  2:52 incoming Andrew Morton
                   ` (71 preceding siblings ...)
  2021-09-08  2:57 ` [patch 072/147] Documentation: add documents for DAMON Andrew Morton
@ 2021-09-08  2:57 ` Andrew Morton
  2021-09-08  2:57 ` [patch 074/147] mm/damon: add user space selftests Andrew Morton
                   ` (74 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:57 UTC (permalink / raw)
  To: akpm, alexander.shishkin, amit, benh, brendanhiggins, corbet,
	david, dwmw, elver, fan.du, foersleo, greg, gthelen, joe,
	Jonathan.Cameron, linux-mm, markubo, mgorman, mheyne, minchan,
	mingo, mm-commits, namhyung, peterz, riel, rientjes, rostedt,
	shakeelb, shuah, sieberf, sjpark, torvalds, vbabka, vdavydov.dev

From: SeongJae Park <sjpark@amazon.de>
Subject: mm/damon: add kunit tests

This commit adds kunit based unit tests for the core and the virtual
address spaces monitoring primitives of DAMON.

Link: https://lkml.kernel.org/r/20210716081449.22187-12-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Brendan Higgins <brendanhiggins@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Fernand Sieber <sieberf@amazon.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Markus Boehme <markubo@amazon.de>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/damon/Kconfig      |   36 ++++
 mm/damon/core-test.h  |  253 ++++++++++++++++++++++++++++++
 mm/damon/core.c       |    7 
 mm/damon/dbgfs-test.h |  126 +++++++++++++++
 mm/damon/dbgfs.c      |    2 
 mm/damon/vaddr-test.h |  329 ++++++++++++++++++++++++++++++++++++++++
 mm/damon/vaddr.c      |    7 
 7 files changed, 760 insertions(+)

--- a/mm/damon/core.c~mm-damon-add-kunit-tests
+++ a/mm/damon/core.c
@@ -16,6 +16,11 @@
 #define CREATE_TRACE_POINTS
 #include <trace/events/damon.h>
 
+#ifdef CONFIG_DAMON_KUNIT_TEST
+#undef DAMON_MIN_REGION
+#define DAMON_MIN_REGION 1
+#endif
+
 /* Get a random number in [l, r) */
 #define damon_rand(l, r) (l + prandom_u32_max(r - l))
 
@@ -711,3 +716,5 @@ static int kdamond_fn(void *data)
 
 	do_exit(0);
 }
+
+#include "core-test.h"
--- /dev/null
+++ a/mm/damon/core-test.h
@@ -0,0 +1,253 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Data Access Monitor Unit Tests
+ *
+ * Copyright 2019 Amazon.com, Inc. or its affiliates.  All rights reserved.
+ *
+ * Author: SeongJae Park <sjpark@amazon.de>
+ */
+
+#ifdef CONFIG_DAMON_KUNIT_TEST
+
+#ifndef _DAMON_CORE_TEST_H
+#define _DAMON_CORE_TEST_H
+
+#include <kunit/test.h>
+
+static void damon_test_regions(struct kunit *test)
+{
+	struct damon_region *r;
+	struct damon_target *t;
+
+	r = damon_new_region(1, 2);
+	KUNIT_EXPECT_EQ(test, 1ul, r->ar.start);
+	KUNIT_EXPECT_EQ(test, 2ul, r->ar.end);
+	KUNIT_EXPECT_EQ(test, 0u, r->nr_accesses);
+
+	t = damon_new_target(42);
+	KUNIT_EXPECT_EQ(test, 0u, damon_nr_regions(t));
+
+	damon_add_region(r, t);
+	KUNIT_EXPECT_EQ(test, 1u, damon_nr_regions(t));
+
+	damon_del_region(r, t);
+	KUNIT_EXPECT_EQ(test, 0u, damon_nr_regions(t));
+
+	damon_free_target(t);
+}
+
+static unsigned int nr_damon_targets(struct damon_ctx *ctx)
+{
+	struct damon_target *t;
+	unsigned int nr_targets = 0;
+
+	damon_for_each_target(t, ctx)
+		nr_targets++;
+
+	return nr_targets;
+}
+
+static void damon_test_target(struct kunit *test)
+{
+	struct damon_ctx *c = damon_new_ctx();
+	struct damon_target *t;
+
+	t = damon_new_target(42);
+	KUNIT_EXPECT_EQ(test, 42ul, t->id);
+	KUNIT_EXPECT_EQ(test, 0u, nr_damon_targets(c));
+
+	damon_add_target(c, t);
+	KUNIT_EXPECT_EQ(test, 1u, nr_damon_targets(c));
+
+	damon_destroy_target(t);
+	KUNIT_EXPECT_EQ(test, 0u, nr_damon_targets(c));
+
+	damon_destroy_ctx(c);
+}
+
+/*
+ * Test kdamond_reset_aggregated()
+ *
+ * DAMON checks access to each region and aggregates this information as the
+ * access frequency of each region.  In detail, it increases '->nr_accesses' of
+ * regions that an access has confirmed.  'kdamond_reset_aggregated()' flushes
+ * the aggregated information ('->nr_accesses' of each regions) to the result
+ * buffer.  As a result of the flushing, the '->nr_accesses' of regions are
+ * initialized to zero.
+ */
+static void damon_test_aggregate(struct kunit *test)
+{
+	struct damon_ctx *ctx = damon_new_ctx();
+	unsigned long target_ids[] = {1, 2, 3};
+	unsigned long saddr[][3] = {{10, 20, 30}, {5, 42, 49}, {13, 33, 55} };
+	unsigned long eaddr[][3] = {{15, 27, 40}, {31, 45, 55}, {23, 44, 66} };
+	unsigned long accesses[][3] = {{42, 95, 84}, {10, 20, 30}, {0, 1, 2} };
+	struct damon_target *t;
+	struct damon_region *r;
+	int it, ir;
+
+	damon_set_targets(ctx, target_ids, 3);
+
+	it = 0;
+	damon_for_each_target(t, ctx) {
+		for (ir = 0; ir < 3; ir++) {
+			r = damon_new_region(saddr[it][ir], eaddr[it][ir]);
+			r->nr_accesses = accesses[it][ir];
+			damon_add_region(r, t);
+		}
+		it++;
+	}
+	kdamond_reset_aggregated(ctx);
+	it = 0;
+	damon_for_each_target(t, ctx) {
+		ir = 0;
+		/* '->nr_accesses' should be zeroed */
+		damon_for_each_region(r, t) {
+			KUNIT_EXPECT_EQ(test, 0u, r->nr_accesses);
+			ir++;
+		}
+		/* regions should be preserved */
+		KUNIT_EXPECT_EQ(test, 3, ir);
+		it++;
+	}
+	/* targets also should be preserved */
+	KUNIT_EXPECT_EQ(test, 3, it);
+
+	damon_destroy_ctx(ctx);
+}
+
+static void damon_test_split_at(struct kunit *test)
+{
+	struct damon_ctx *c = damon_new_ctx();
+	struct damon_target *t;
+	struct damon_region *r;
+
+	t = damon_new_target(42);
+	r = damon_new_region(0, 100);
+	damon_add_region(r, t);
+	damon_split_region_at(c, t, r, 25);
+	KUNIT_EXPECT_EQ(test, r->ar.start, 0ul);
+	KUNIT_EXPECT_EQ(test, r->ar.end, 25ul);
+
+	r = damon_next_region(r);
+	KUNIT_EXPECT_EQ(test, r->ar.start, 25ul);
+	KUNIT_EXPECT_EQ(test, r->ar.end, 100ul);
+
+	damon_free_target(t);
+	damon_destroy_ctx(c);
+}
+
+static void damon_test_merge_two(struct kunit *test)
+{
+	struct damon_target *t;
+	struct damon_region *r, *r2, *r3;
+	int i;
+
+	t = damon_new_target(42);
+	r = damon_new_region(0, 100);
+	r->nr_accesses = 10;
+	damon_add_region(r, t);
+	r2 = damon_new_region(100, 300);
+	r2->nr_accesses = 20;
+	damon_add_region(r2, t);
+
+	damon_merge_two_regions(t, r, r2);
+	KUNIT_EXPECT_EQ(test, r->ar.start, 0ul);
+	KUNIT_EXPECT_EQ(test, r->ar.end, 300ul);
+	KUNIT_EXPECT_EQ(test, r->nr_accesses, 16u);
+
+	i = 0;
+	damon_for_each_region(r3, t) {
+		KUNIT_EXPECT_PTR_EQ(test, r, r3);
+		i++;
+	}
+	KUNIT_EXPECT_EQ(test, i, 1);
+
+	damon_free_target(t);
+}
+
+static struct damon_region *__nth_region_of(struct damon_target *t, int idx)
+{
+	struct damon_region *r;
+	unsigned int i = 0;
+
+	damon_for_each_region(r, t) {
+		if (i++ == idx)
+			return r;
+	}
+
+	return NULL;
+}
+
+static void damon_test_merge_regions_of(struct kunit *test)
+{
+	struct damon_target *t;
+	struct damon_region *r;
+	unsigned long sa[] = {0, 100, 114, 122, 130, 156, 170, 184};
+	unsigned long ea[] = {100, 112, 122, 130, 156, 170, 184, 230};
+	unsigned int nrs[] = {0, 0, 10, 10, 20, 30, 1, 2};
+
+	unsigned long saddrs[] = {0, 114, 130, 156, 170};
+	unsigned long eaddrs[] = {112, 130, 156, 170, 230};
+	int i;
+
+	t = damon_new_target(42);
+	for (i = 0; i < ARRAY_SIZE(sa); i++) {
+		r = damon_new_region(sa[i], ea[i]);
+		r->nr_accesses = nrs[i];
+		damon_add_region(r, t);
+	}
+
+	damon_merge_regions_of(t, 9, 9999);
+	/* 0-112, 114-130, 130-156, 156-170 */
+	KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 5u);
+	for (i = 0; i < 5; i++) {
+		r = __nth_region_of(t, i);
+		KUNIT_EXPECT_EQ(test, r->ar.start, saddrs[i]);
+		KUNIT_EXPECT_EQ(test, r->ar.end, eaddrs[i]);
+	}
+	damon_free_target(t);
+}
+
+static void damon_test_split_regions_of(struct kunit *test)
+{
+	struct damon_ctx *c = damon_new_ctx();
+	struct damon_target *t;
+	struct damon_region *r;
+
+	t = damon_new_target(42);
+	r = damon_new_region(0, 22);
+	damon_add_region(r, t);
+	damon_split_regions_of(c, t, 2);
+	KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 2u);
+	damon_free_target(t);
+
+	t = damon_new_target(42);
+	r = damon_new_region(0, 220);
+	damon_add_region(r, t);
+	damon_split_regions_of(c, t, 4);
+	KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 4u);
+	damon_free_target(t);
+	damon_destroy_ctx(c);
+}
+
+static struct kunit_case damon_test_cases[] = {
+	KUNIT_CASE(damon_test_target),
+	KUNIT_CASE(damon_test_regions),
+	KUNIT_CASE(damon_test_aggregate),
+	KUNIT_CASE(damon_test_split_at),
+	KUNIT_CASE(damon_test_merge_two),
+	KUNIT_CASE(damon_test_merge_regions_of),
+	KUNIT_CASE(damon_test_split_regions_of),
+	{},
+};
+
+static struct kunit_suite damon_test_suite = {
+	.name = "damon",
+	.test_cases = damon_test_cases,
+};
+kunit_test_suite(damon_test_suite);
+
+#endif /* _DAMON_CORE_TEST_H */
+
+#endif	/* CONFIG_DAMON_KUNIT_TEST */
--- a/mm/damon/dbgfs.c~mm-damon-add-kunit-tests
+++ a/mm/damon/dbgfs.c
@@ -619,3 +619,5 @@ static int __init damon_dbgfs_init(void)
 }
 
 module_init(damon_dbgfs_init);
+
+#include "dbgfs-test.h"
--- /dev/null
+++ a/mm/damon/dbgfs-test.h
@@ -0,0 +1,126 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * DAMON Debugfs Interface Unit Tests
+ *
+ * Author: SeongJae Park <sjpark@amazon.de>
+ */
+
+#ifdef CONFIG_DAMON_DBGFS_KUNIT_TEST
+
+#ifndef _DAMON_DBGFS_TEST_H
+#define _DAMON_DBGFS_TEST_H
+
+#include <kunit/test.h>
+
+static void damon_dbgfs_test_str_to_target_ids(struct kunit *test)
+{
+	char *question;
+	unsigned long *answers;
+	unsigned long expected[] = {12, 35, 46};
+	ssize_t nr_integers = 0, i;
+
+	question = "123";
+	answers = str_to_target_ids(question, strnlen(question, 128),
+			&nr_integers);
+	KUNIT_EXPECT_EQ(test, (ssize_t)1, nr_integers);
+	KUNIT_EXPECT_EQ(test, 123ul, answers[0]);
+	kfree(answers);
+
+	question = "123abc";
+	answers = str_to_target_ids(question, strnlen(question, 128),
+			&nr_integers);
+	KUNIT_EXPECT_EQ(test, (ssize_t)1, nr_integers);
+	KUNIT_EXPECT_EQ(test, 123ul, answers[0]);
+	kfree(answers);
+
+	question = "a123";
+	answers = str_to_target_ids(question, strnlen(question, 128),
+			&nr_integers);
+	KUNIT_EXPECT_EQ(test, (ssize_t)0, nr_integers);
+	kfree(answers);
+
+	question = "12 35";
+	answers = str_to_target_ids(question, strnlen(question, 128),
+			&nr_integers);
+	KUNIT_EXPECT_EQ(test, (ssize_t)2, nr_integers);
+	for (i = 0; i < nr_integers; i++)
+		KUNIT_EXPECT_EQ(test, expected[i], answers[i]);
+	kfree(answers);
+
+	question = "12 35 46";
+	answers = str_to_target_ids(question, strnlen(question, 128),
+			&nr_integers);
+	KUNIT_EXPECT_EQ(test, (ssize_t)3, nr_integers);
+	for (i = 0; i < nr_integers; i++)
+		KUNIT_EXPECT_EQ(test, expected[i], answers[i]);
+	kfree(answers);
+
+	question = "12 35 abc 46";
+	answers = str_to_target_ids(question, strnlen(question, 128),
+			&nr_integers);
+	KUNIT_EXPECT_EQ(test, (ssize_t)2, nr_integers);
+	for (i = 0; i < 2; i++)
+		KUNIT_EXPECT_EQ(test, expected[i], answers[i]);
+	kfree(answers);
+
+	question = "";
+	answers = str_to_target_ids(question, strnlen(question, 128),
+			&nr_integers);
+	KUNIT_EXPECT_EQ(test, (ssize_t)0, nr_integers);
+	kfree(answers);
+
+	question = "\n";
+	answers = str_to_target_ids(question, strnlen(question, 128),
+			&nr_integers);
+	KUNIT_EXPECT_EQ(test, (ssize_t)0, nr_integers);
+	kfree(answers);
+}
+
+static void damon_dbgfs_test_set_targets(struct kunit *test)
+{
+	struct damon_ctx *ctx = dbgfs_new_ctx();
+	unsigned long ids[] = {1, 2, 3};
+	char buf[64];
+
+	/* Make DAMON consider target id as plain number */
+	ctx->primitive.target_valid = NULL;
+	ctx->primitive.cleanup = NULL;
+
+	damon_set_targets(ctx, ids, 3);
+	sprint_target_ids(ctx, buf, 64);
+	KUNIT_EXPECT_STREQ(test, (char *)buf, "1 2 3\n");
+
+	damon_set_targets(ctx, NULL, 0);
+	sprint_target_ids(ctx, buf, 64);
+	KUNIT_EXPECT_STREQ(test, (char *)buf, "\n");
+
+	damon_set_targets(ctx, (unsigned long []){1, 2}, 2);
+	sprint_target_ids(ctx, buf, 64);
+	KUNIT_EXPECT_STREQ(test, (char *)buf, "1 2\n");
+
+	damon_set_targets(ctx, (unsigned long []){2}, 1);
+	sprint_target_ids(ctx, buf, 64);
+	KUNIT_EXPECT_STREQ(test, (char *)buf, "2\n");
+
+	damon_set_targets(ctx, NULL, 0);
+	sprint_target_ids(ctx, buf, 64);
+	KUNIT_EXPECT_STREQ(test, (char *)buf, "\n");
+
+	dbgfs_destroy_ctx(ctx);
+}
+
+static struct kunit_case damon_test_cases[] = {
+	KUNIT_CASE(damon_dbgfs_test_str_to_target_ids),
+	KUNIT_CASE(damon_dbgfs_test_set_targets),
+	{},
+};
+
+static struct kunit_suite damon_test_suite = {
+	.name = "damon-dbgfs",
+	.test_cases = damon_test_cases,
+};
+kunit_test_suite(damon_test_suite);
+
+#endif /* _DAMON_TEST_H */
+
+#endif	/* CONFIG_DAMON_KUNIT_TEST */
--- a/mm/damon/Kconfig~mm-damon-add-kunit-tests
+++ a/mm/damon/Kconfig
@@ -12,6 +12,18 @@ config DAMON
 	  See https://damonitor.github.io/doc/html/latest-damon/index.html for
 	  more information.
 
+config DAMON_KUNIT_TEST
+	bool "Test for damon" if !KUNIT_ALL_TESTS
+	depends on DAMON && KUNIT=y
+	default KUNIT_ALL_TESTS
+	help
+	  This builds the DAMON Kunit test suite.
+
+	  For more information on KUnit and unit tests in general, please refer
+	  to the KUnit documentation.
+
+	  If unsure, say N.
+
 config DAMON_VADDR
 	bool "Data access monitoring primitives for virtual address spaces"
 	depends on DAMON && MMU
@@ -20,6 +32,18 @@ config DAMON_VADDR
 	  This builds the default data access monitoring primitives for DAMON
 	  that works for virtual address spaces.
 
+config DAMON_VADDR_KUNIT_TEST
+	bool "Test for DAMON primitives" if !KUNIT_ALL_TESTS
+	depends on DAMON_VADDR && KUNIT=y
+	default KUNIT_ALL_TESTS
+	help
+	  This builds the DAMON virtual addresses primitives Kunit test suite.
+
+	  For more information on KUnit and unit tests in general, please refer
+	  to the KUnit documentation.
+
+	  If unsure, say N.
+
 config DAMON_DBGFS
 	bool "DAMON debugfs interface"
 	depends on DAMON_VADDR && DEBUG_FS
@@ -29,4 +53,16 @@ config DAMON_DBGFS
 
 	  If unsure, say N.
 
+config DAMON_DBGFS_KUNIT_TEST
+	bool "Test for damon debugfs interface" if !KUNIT_ALL_TESTS
+	depends on DAMON_DBGFS && KUNIT=y
+	default KUNIT_ALL_TESTS
+	help
+	  This builds the DAMON debugfs interface Kunit test suite.
+
+	  For more information on KUnit and unit tests in general, please refer
+	  to the KUnit documentation.
+
+	  If unsure, say N.
+
 endmenu
--- a/mm/damon/vaddr.c~mm-damon-add-kunit-tests
+++ a/mm/damon/vaddr.c
@@ -18,6 +18,11 @@
 #include <linux/sched/mm.h>
 #include <linux/slab.h>
 
+#ifdef CONFIG_DAMON_VADDR_KUNIT_TEST
+#undef DAMON_MIN_REGION
+#define DAMON_MIN_REGION 1
+#endif
+
 /* Get a random number in [l, r) */
 #define damon_rand(l, r) (l + prandom_u32_max(r - l))
 
@@ -663,3 +668,5 @@ void damon_va_set_primitives(struct damo
 	ctx->primitive.target_valid = damon_va_target_valid;
 	ctx->primitive.cleanup = NULL;
 }
+
+#include "vaddr-test.h"
--- /dev/null
+++ a/mm/damon/vaddr-test.h
@@ -0,0 +1,329 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Data Access Monitor Unit Tests
+ *
+ * Copyright 2019 Amazon.com, Inc. or its affiliates.  All rights reserved.
+ *
+ * Author: SeongJae Park <sjpark@amazon.de>
+ */
+
+#ifdef CONFIG_DAMON_VADDR_KUNIT_TEST
+
+#ifndef _DAMON_VADDR_TEST_H
+#define _DAMON_VADDR_TEST_H
+
+#include <kunit/test.h>
+
+static void __link_vmas(struct vm_area_struct *vmas, ssize_t nr_vmas)
+{
+	int i, j;
+	unsigned long largest_gap, gap;
+
+	if (!nr_vmas)
+		return;
+
+	for (i = 0; i < nr_vmas - 1; i++) {
+		vmas[i].vm_next = &vmas[i + 1];
+
+		vmas[i].vm_rb.rb_left = NULL;
+		vmas[i].vm_rb.rb_right = &vmas[i + 1].vm_rb;
+
+		largest_gap = 0;
+		for (j = i; j < nr_vmas; j++) {
+			if (j == 0)
+				continue;
+			gap = vmas[j].vm_start - vmas[j - 1].vm_end;
+			if (gap > largest_gap)
+				largest_gap = gap;
+		}
+		vmas[i].rb_subtree_gap = largest_gap;
+	}
+	vmas[i].vm_next = NULL;
+	vmas[i].vm_rb.rb_right = NULL;
+	vmas[i].rb_subtree_gap = 0;
+}
+
+/*
+ * Test __damon_va_three_regions() function
+ *
+ * In case of virtual memory address spaces monitoring, DAMON converts the
+ * complex and dynamic memory mappings of each target task to three
+ * discontiguous regions which cover every mapped areas.  However, the three
+ * regions should not include the two biggest unmapped areas in the original
+ * mapping, because the two biggest areas are normally the areas between 1)
+ * heap and the mmap()-ed regions, and 2) the mmap()-ed regions and stack.
+ * Because these two unmapped areas are very huge but obviously never accessed,
+ * covering the region is just a waste.
+ *
+ * '__damon_va_three_regions() receives an address space of a process.  It
+ * first identifies the start of mappings, end of mappings, and the two biggest
+ * unmapped areas.  After that, based on the information, it constructs the
+ * three regions and returns.  For more detail, refer to the comment of
+ * 'damon_init_regions_of()' function definition in 'mm/damon.c' file.
+ *
+ * For example, suppose virtual address ranges of 10-20, 20-25, 200-210,
+ * 210-220, 300-305, and 307-330 (Other comments represent this mappings in
+ * more short form: 10-20-25, 200-210-220, 300-305, 307-330) of a process are
+ * mapped.  To cover every mappings, the three regions should start with 10,
+ * and end with 305.  The process also has three unmapped areas, 25-200,
+ * 220-300, and 305-307.  Among those, 25-200 and 220-300 are the biggest two
+ * unmapped areas, and thus it should be converted to three regions of 10-25,
+ * 200-220, and 300-330.
+ */
+static void damon_test_three_regions_in_vmas(struct kunit *test)
+{
+	struct damon_addr_range regions[3] = {0,};
+	/* 10-20-25, 200-210-220, 300-305, 307-330 */
+	struct vm_area_struct vmas[] = {
+		(struct vm_area_struct) {.vm_start = 10, .vm_end = 20},
+		(struct vm_area_struct) {.vm_start = 20, .vm_end = 25},
+		(struct vm_area_struct) {.vm_start = 200, .vm_end = 210},
+		(struct vm_area_struct) {.vm_start = 210, .vm_end = 220},
+		(struct vm_area_struct) {.vm_start = 300, .vm_end = 305},
+		(struct vm_area_struct) {.vm_start = 307, .vm_end = 330},
+	};
+
+	__link_vmas(vmas, 6);
+
+	__damon_va_three_regions(&vmas[0], regions);
+
+	KUNIT_EXPECT_EQ(test, 10ul, regions[0].start);
+	KUNIT_EXPECT_EQ(test, 25ul, regions[0].end);
+	KUNIT_EXPECT_EQ(test, 200ul, regions[1].start);
+	KUNIT_EXPECT_EQ(test, 220ul, regions[1].end);
+	KUNIT_EXPECT_EQ(test, 300ul, regions[2].start);
+	KUNIT_EXPECT_EQ(test, 330ul, regions[2].end);
+}
+
+static struct damon_region *__nth_region_of(struct damon_target *t, int idx)
+{
+	struct damon_region *r;
+	unsigned int i = 0;
+
+	damon_for_each_region(r, t) {
+		if (i++ == idx)
+			return r;
+	}
+
+	return NULL;
+}
+
+/*
+ * Test 'damon_va_apply_three_regions()'
+ *
+ * test			kunit object
+ * regions		an array containing start/end addresses of current
+ *			monitoring target regions
+ * nr_regions		the number of the addresses in 'regions'
+ * three_regions	The three regions that need to be applied now
+ * expected		start/end addresses of monitoring target regions that
+ *			'three_regions' are applied
+ * nr_expected		the number of addresses in 'expected'
+ *
+ * The memory mapping of the target processes changes dynamically.  To follow
+ * the change, DAMON periodically reads the mappings, simplifies it to the
+ * three regions, and updates the monitoring target regions to fit in the three
+ * regions.  The update of current target regions is the role of
+ * 'damon_va_apply_three_regions()'.
+ *
+ * This test passes the given target regions and the new three regions that
+ * need to be applied to the function and check whether it updates the regions
+ * as expected.
+ */
+static void damon_do_test_apply_three_regions(struct kunit *test,
+				unsigned long *regions, int nr_regions,
+				struct damon_addr_range *three_regions,
+				unsigned long *expected, int nr_expected)
+{
+	struct damon_ctx *ctx = damon_new_ctx();
+	struct damon_target *t;
+	struct damon_region *r;
+	int i;
+
+	t = damon_new_target(42);
+	for (i = 0; i < nr_regions / 2; i++) {
+		r = damon_new_region(regions[i * 2], regions[i * 2 + 1]);
+		damon_add_region(r, t);
+	}
+	damon_add_target(ctx, t);
+
+	damon_va_apply_three_regions(t, three_regions);
+
+	for (i = 0; i < nr_expected / 2; i++) {
+		r = __nth_region_of(t, i);
+		KUNIT_EXPECT_EQ(test, r->ar.start, expected[i * 2]);
+		KUNIT_EXPECT_EQ(test, r->ar.end, expected[i * 2 + 1]);
+	}
+
+	damon_destroy_ctx(ctx);
+}
+
+/*
+ * This function test most common case where the three big regions are only
+ * slightly changed.  Target regions should adjust their boundary (10-20-30,
+ * 50-55, 70-80, 90-100) to fit with the new big regions or remove target
+ * regions (57-79) that now out of the three regions.
+ */
+static void damon_test_apply_three_regions1(struct kunit *test)
+{
+	/* 10-20-30, 50-55-57-59, 70-80-90-100 */
+	unsigned long regions[] = {10, 20, 20, 30, 50, 55, 55, 57, 57, 59,
+				70, 80, 80, 90, 90, 100};
+	/* 5-27, 45-55, 73-104 */
+	struct damon_addr_range new_three_regions[3] = {
+		(struct damon_addr_range){.start = 5, .end = 27},
+		(struct damon_addr_range){.start = 45, .end = 55},
+		(struct damon_addr_range){.start = 73, .end = 104} };
+	/* 5-20-27, 45-55, 73-80-90-104 */
+	unsigned long expected[] = {5, 20, 20, 27, 45, 55,
+				73, 80, 80, 90, 90, 104};
+
+	damon_do_test_apply_three_regions(test, regions, ARRAY_SIZE(regions),
+			new_three_regions, expected, ARRAY_SIZE(expected));
+}
+
+/*
+ * Test slightly bigger change.  Similar to above, but the second big region
+ * now require two target regions (50-55, 57-59) to be removed.
+ */
+static void damon_test_apply_three_regions2(struct kunit *test)
+{
+	/* 10-20-30, 50-55-57-59, 70-80-90-100 */
+	unsigned long regions[] = {10, 20, 20, 30, 50, 55, 55, 57, 57, 59,
+				70, 80, 80, 90, 90, 100};
+	/* 5-27, 56-57, 65-104 */
+	struct damon_addr_range new_three_regions[3] = {
+		(struct damon_addr_range){.start = 5, .end = 27},
+		(struct damon_addr_range){.start = 56, .end = 57},
+		(struct damon_addr_range){.start = 65, .end = 104} };
+	/* 5-20-27, 56-57, 65-80-90-104 */
+	unsigned long expected[] = {5, 20, 20, 27, 56, 57,
+				65, 80, 80, 90, 90, 104};
+
+	damon_do_test_apply_three_regions(test, regions, ARRAY_SIZE(regions),
+			new_three_regions, expected, ARRAY_SIZE(expected));
+}
+
+/*
+ * Test a big change.  The second big region has totally freed and mapped to
+ * different area (50-59 -> 61-63).  The target regions which were in the old
+ * second big region (50-55-57-59) should be removed and new target region
+ * covering the second big region (61-63) should be created.
+ */
+static void damon_test_apply_three_regions3(struct kunit *test)
+{
+	/* 10-20-30, 50-55-57-59, 70-80-90-100 */
+	unsigned long regions[] = {10, 20, 20, 30, 50, 55, 55, 57, 57, 59,
+				70, 80, 80, 90, 90, 100};
+	/* 5-27, 61-63, 65-104 */
+	struct damon_addr_range new_three_regions[3] = {
+		(struct damon_addr_range){.start = 5, .end = 27},
+		(struct damon_addr_range){.start = 61, .end = 63},
+		(struct damon_addr_range){.start = 65, .end = 104} };
+	/* 5-20-27, 61-63, 65-80-90-104 */
+	unsigned long expected[] = {5, 20, 20, 27, 61, 63,
+				65, 80, 80, 90, 90, 104};
+
+	damon_do_test_apply_three_regions(test, regions, ARRAY_SIZE(regions),
+			new_three_regions, expected, ARRAY_SIZE(expected));
+}
+
+/*
+ * Test another big change.  Both of the second and third big regions (50-59
+ * and 70-100) has totally freed and mapped to different area (30-32 and
+ * 65-68).  The target regions which were in the old second and third big
+ * regions should now be removed and new target regions covering the new second
+ * and third big regions should be crated.
+ */
+static void damon_test_apply_three_regions4(struct kunit *test)
+{
+	/* 10-20-30, 50-55-57-59, 70-80-90-100 */
+	unsigned long regions[] = {10, 20, 20, 30, 50, 55, 55, 57, 57, 59,
+				70, 80, 80, 90, 90, 100};
+	/* 5-7, 30-32, 65-68 */
+	struct damon_addr_range new_three_regions[3] = {
+		(struct damon_addr_range){.start = 5, .end = 7},
+		(struct damon_addr_range){.start = 30, .end = 32},
+		(struct damon_addr_range){.start = 65, .end = 68} };
+	/* expect 5-7, 30-32, 65-68 */
+	unsigned long expected[] = {5, 7, 30, 32, 65, 68};
+
+	damon_do_test_apply_three_regions(test, regions, ARRAY_SIZE(regions),
+			new_three_regions, expected, ARRAY_SIZE(expected));
+}
+
+static void damon_test_split_evenly(struct kunit *test)
+{
+	struct damon_ctx *c = damon_new_ctx();
+	struct damon_target *t;
+	struct damon_region *r;
+	unsigned long i;
+
+	KUNIT_EXPECT_EQ(test, damon_va_evenly_split_region(NULL, NULL, 5),
+			-EINVAL);
+
+	t = damon_new_target(42);
+	r = damon_new_region(0, 100);
+	KUNIT_EXPECT_EQ(test, damon_va_evenly_split_region(t, r, 0), -EINVAL);
+
+	damon_add_region(r, t);
+	KUNIT_EXPECT_EQ(test, damon_va_evenly_split_region(t, r, 10), 0);
+	KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 10u);
+
+	i = 0;
+	damon_for_each_region(r, t) {
+		KUNIT_EXPECT_EQ(test, r->ar.start, i++ * 10);
+		KUNIT_EXPECT_EQ(test, r->ar.end, i * 10);
+	}
+	damon_free_target(t);
+
+	t = damon_new_target(42);
+	r = damon_new_region(5, 59);
+	damon_add_region(r, t);
+	KUNIT_EXPECT_EQ(test, damon_va_evenly_split_region(t, r, 5), 0);
+	KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 5u);
+
+	i = 0;
+	damon_for_each_region(r, t) {
+		if (i == 4)
+			break;
+		KUNIT_EXPECT_EQ(test, r->ar.start, 5 + 10 * i++);
+		KUNIT_EXPECT_EQ(test, r->ar.end, 5 + 10 * i);
+	}
+	KUNIT_EXPECT_EQ(test, r->ar.start, 5 + 10 * i);
+	KUNIT_EXPECT_EQ(test, r->ar.end, 59ul);
+	damon_free_target(t);
+
+	t = damon_new_target(42);
+	r = damon_new_region(5, 6);
+	damon_add_region(r, t);
+	KUNIT_EXPECT_EQ(test, damon_va_evenly_split_region(t, r, 2), -EINVAL);
+	KUNIT_EXPECT_EQ(test, damon_nr_regions(t), 1u);
+
+	damon_for_each_region(r, t) {
+		KUNIT_EXPECT_EQ(test, r->ar.start, 5ul);
+		KUNIT_EXPECT_EQ(test, r->ar.end, 6ul);
+	}
+	damon_free_target(t);
+	damon_destroy_ctx(c);
+}
+
+static struct kunit_case damon_test_cases[] = {
+	KUNIT_CASE(damon_test_three_regions_in_vmas),
+	KUNIT_CASE(damon_test_apply_three_regions1),
+	KUNIT_CASE(damon_test_apply_three_regions2),
+	KUNIT_CASE(damon_test_apply_three_regions3),
+	KUNIT_CASE(damon_test_apply_three_regions4),
+	KUNIT_CASE(damon_test_split_evenly),
+	{},
+};
+
+static struct kunit_suite damon_test_suite = {
+	.name = "damon-primitives",
+	.test_cases = damon_test_cases,
+};
+kunit_test_suite(damon_test_suite);
+
+#endif /* _DAMON_VADDR_TEST_H */
+
+#endif	/* CONFIG_DAMON_VADDR_KUNIT_TEST */
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 074/147] mm/damon: add user space selftests
  2021-09-08  2:52 incoming Andrew Morton
                   ` (72 preceding siblings ...)
  2021-09-08  2:57 ` [patch 073/147] mm/damon: add kunit tests Andrew Morton
@ 2021-09-08  2:57 ` Andrew Morton
  2021-09-08  2:57 ` [patch 075/147] MAINTAINERS: update for DAMON Andrew Morton
                   ` (73 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:57 UTC (permalink / raw)
  To: akpm, alexander.shishkin, amit, benh, brendanhiggins, corbet,
	david, dwmw, elver, fan.du, foersleo, greg, gthelen, joe,
	Jonathan.Cameron, linux-mm, markubo, mgorman, mheyne, minchan,
	mingo, mm-commits, namhyung, peterz, riel, rientjes, rostedt,
	shakeelb, shuah, sieberf, sjpark, torvalds, vbabka, vdavydov.dev

From: SeongJae Park <sjpark@amazon.de>
Subject: mm/damon: add user space selftests

This commit adds a simple user space tests for DAMON.  The tests are using
kselftest framework.

Link: https://lkml.kernel.org/r/20210716081449.22187-13-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Markus Boehme <markubo@amazon.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Fernand Sieber <sieberf@amazon.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/damon/Makefile           |    7 +
 tools/testing/selftests/damon/_chk_dependency.sh |   28 ++++
 tools/testing/selftests/damon/debugfs_attrs.sh   |   75 +++++++++++++
 3 files changed, 110 insertions(+)

--- /dev/null
+++ a/tools/testing/selftests/damon/_chk_dependency.sh
@@ -0,0 +1,28 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# Kselftest framework requirement - SKIP code is 4.
+ksft_skip=4
+
+DBGFS=/sys/kernel/debug/damon
+
+if [ $EUID -ne 0 ];
+then
+	echo "Run as root"
+	exit $ksft_skip
+fi
+
+if [ ! -d "$DBGFS" ]
+then
+	echo "$DBGFS not found"
+	exit $ksft_skip
+fi
+
+for f in attrs target_ids monitor_on
+do
+	if [ ! -f "$DBGFS/$f" ]
+	then
+		echo "$f not found"
+		exit 1
+	fi
+done
--- /dev/null
+++ a/tools/testing/selftests/damon/debugfs_attrs.sh
@@ -0,0 +1,75 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+test_write_result() {
+	file=$1
+	content=$2
+	orig_content=$3
+	expect_reason=$4
+	expected=$5
+
+	echo "$content" > "$file"
+	if [ $? -ne "$expected" ]
+	then
+		echo "writing $content to $file doesn't return $expected"
+		echo "expected because: $expect_reason"
+		echo "$orig_content" > "$file"
+		exit 1
+	fi
+}
+
+test_write_succ() {
+	test_write_result "$1" "$2" "$3" "$4" 0
+}
+
+test_write_fail() {
+	test_write_result "$1" "$2" "$3" "$4" 1
+}
+
+test_content() {
+	file=$1
+	orig_content=$2
+	expected=$3
+	expect_reason=$4
+
+	content=$(cat "$file")
+	if [ "$content" != "$expected" ]
+	then
+		echo "reading $file expected $expected but $content"
+		echo "expected because: $expect_reason"
+		echo "$orig_content" > "$file"
+		exit 1
+	fi
+}
+
+source ./_chk_dependency.sh
+
+# Test attrs file
+# ===============
+
+file="$DBGFS/attrs"
+orig_content=$(cat "$file")
+
+test_write_succ "$file" "1 2 3 4 5" "$orig_content" "valid input"
+test_write_fail "$file" "1 2 3 4" "$orig_content" "no enough fields"
+test_write_fail "$file" "1 2 3 5 4" "$orig_content" \
+	"min_nr_regions > max_nr_regions"
+test_content "$file" "$orig_content" "1 2 3 4 5" "successfully written"
+echo "$orig_content" > "$file"
+
+# Test target_ids file
+# ====================
+
+file="$DBGFS/target_ids"
+orig_content=$(cat "$file")
+
+test_write_succ "$file" "1 2 3 4" "$orig_content" "valid input"
+test_write_succ "$file" "1 2 abc 4" "$orig_content" "still valid input"
+test_content "$file" "$orig_content" "1 2" "non-integer was there"
+test_write_succ "$file" "abc 2 3" "$orig_content" "the file allows wrong input"
+test_content "$file" "$orig_content" "" "wrong input written"
+test_write_succ "$file" "" "$orig_content" "empty input"
+test_content "$file" "$orig_content" "" "empty input written"
+echo "$orig_content" > "$file"
+
+echo "PASS"
--- /dev/null
+++ a/tools/testing/selftests/damon/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+# Makefile for damon selftests
+
+TEST_FILES = _chk_dependency.sh
+TEST_PROGS = debugfs_attrs.sh
+
+include ../lib.mk
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 075/147] MAINTAINERS: update for DAMON
  2021-09-08  2:52 incoming Andrew Morton
                   ` (73 preceding siblings ...)
  2021-09-08  2:57 ` [patch 074/147] mm/damon: add user space selftests Andrew Morton
@ 2021-09-08  2:57 ` Andrew Morton
  2021-09-08  2:57 ` [patch 076/147] alpha: agp: make empty macros use do-while-0 style Andrew Morton
                   ` (72 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:57 UTC (permalink / raw)
  To: akpm, alexander.shishkin, amit, benh, brendanhiggins, corbet,
	david, dwmw, elver, fan.du, foersleo, greg, gthelen, joe,
	Jonathan.Cameron, linux-mm, markubo, mgorman, mheyne, minchan,
	mingo, mm-commits, namhyung, peterz, riel, rientjes, rostedt,
	shakeelb, shuah, sieberf, sjpark, torvalds, vbabka, vdavydov.dev

From: SeongJae Park <sjpark@amazon.de>
Subject: MAINTAINERS: update for DAMON

This commit updates MAINTAINERS file for DAMON related files.

Link: https://lkml.kernel.org/r/20210716081449.22187-14-sj38.park@gmail.com
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Reviewed-by: Markus Boehme <markubo@amazon.de>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Amit Shah <amit@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: David Woodhouse <dwmw@amazon.com>
Cc: Fan Du <fan.du@intel.com>
Cc: Fernand Sieber <sieberf@amazon.com>
Cc: Greg Kroah-Hartman <greg@kroah.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joe Perches <joe@perches.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Leonard Foerster <foersleo@amazon.de>
Cc: Marco Elver <elver@google.com>
Cc: Maximilian Heyne <mheyne@amazon.de>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 MAINTAINERS |   11 +++++++++++
 1 file changed, 11 insertions(+)

--- a/MAINTAINERS~maintainers-update-for-damon
+++ a/MAINTAINERS
@@ -5125,6 +5125,17 @@ F:	net/ax25/ax25_out.c
 F:	net/ax25/ax25_timer.c
 F:	net/ax25/sysctl_net_ax25.c
 
+DATA ACCESS MONITOR
+M:	SeongJae Park <sjpark@amazon.de>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	Documentation/admin-guide/mm/damon/
+F:	Documentation/vm/damon/
+F:	include/linux/damon.h
+F:	include/trace/events/damon.h
+F:	mm/damon/
+F:	tools/testing/selftests/damon/
+
 DAVICOM FAST ETHERNET (DMFE) NETWORK DRIVER
 L:	netdev@vger.kernel.org
 S:	Orphan
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 076/147] alpha: agp: make empty macros use do-while-0 style
  2021-09-08  2:52 incoming Andrew Morton
                   ` (74 preceding siblings ...)
  2021-09-08  2:57 ` [patch 075/147] MAINTAINERS: update for DAMON Andrew Morton
@ 2021-09-08  2:57 ` Andrew Morton
  2021-09-08  2:57 ` [patch 077/147] alpha: pci-sysfs: fix all kernel-doc warnings Andrew Morton
                   ` (71 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:57 UTC (permalink / raw)
  To: airlied, akpm, ink, linux-mm, mattst88, mm-commits, rdunlap, rth,
	torvalds

From: Randy Dunlap <rdunlap@infradead.org>
Subject: alpha: agp: make empty macros use do-while-0 style

Copy these macros from ia64/include/asm/agp.h to avoid the
"empty-body" in 'if' statment warning.

drivers/char/agp/generic.c: In function 'agp_generic_destroy_page':
../drivers/char/agp/generic.c:1265:42: warning: suggest braces around empty body in an 'if' statement [-Wempty-body]
 1265 |                 unmap_page_from_agp(page);

Link: https://lkml.kernel.org/r/20210809030822.20658-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Cc: David Airlie <airlied@linux.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/alpha/include/asm/agp.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/arch/alpha/include/asm/agp.h~alpha-agp-make-empty-macros-use-do-while-0-style
+++ a/arch/alpha/include/asm/agp.h
@@ -6,8 +6,8 @@
 
 /* dummy for now */
 
-#define map_page_into_agp(page) 
-#define unmap_page_from_agp(page) 
+#define map_page_into_agp(page)		do { } while (0)
+#define unmap_page_from_agp(page)	do { } while (0)
 #define flush_agp_cache() mb()
 
 /* GATT allocation. Returns/accepts GATT kernel virtual address. */
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 077/147] alpha: pci-sysfs: fix all kernel-doc warnings
  2021-09-08  2:52 incoming Andrew Morton
                   ` (75 preceding siblings ...)
  2021-09-08  2:57 ` [patch 076/147] alpha: agp: make empty macros use do-while-0 style Andrew Morton
@ 2021-09-08  2:57 ` Andrew Morton
  2021-09-08  2:57 ` [patch 078/147] percpu: remove export of pcpu_base_addr Andrew Morton
                   ` (70 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:57 UTC (permalink / raw)
  To: akpm, ink, linux-mm, mattst88, mm-commits, rdunlap, rth, torvalds

From: Randy Dunlap <rdunlap@infradead.org>
Subject: alpha: pci-sysfs: fix all kernel-doc warnings

Fix all kernel-doc warnings in arch/alpha/kernel/pci-sysfs.c:

../arch/alpha/kernel/pci-sysfs.c:67: warning: No description found for return value of 'pci_mmap_resource'
../arch/alpha/kernel/pci-sysfs.c:115: warning: Function parameter or member 'pdev' not described in 'pci_remove_resource_files'
../arch/alpha/kernel/pci-sysfs.c:115: warning: Excess function parameter 'dev' description in 'pci_remove_resource_files'
../arch/alpha/kernel/pci-sysfs.c:230: warning: Function parameter or member 'pdev' not described in 'pci_create_resource_files'
../arch/alpha/kernel/pci-sysfs.c:230: warning: Excess function parameter 'dev' description in 'pci_create_resource_files'
../arch/alpha/kernel/pci-sysfs.c:232: warning: No description found for return value of 'pci_create_resource_files'
../arch/alpha/kernel/pci-sysfs.c:305: warning: Function parameter or member 'bus' not described in 'pci_adjust_legacy_attr'
../arch/alpha/kernel/pci-sysfs.c:305: warning: Excess function parameter 'b' description in 'pci_adjust_legacy_attr'

Link: https://lkml.kernel.org/r/20210808185249.31442-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: Matt Turner <mattst88@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/alpha/kernel/pci-sysfs.c |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

--- a/arch/alpha/kernel/pci-sysfs.c~alpha-pci-sysfs-fix-all-kernel-doc-warnings
+++ a/arch/alpha/kernel/pci-sysfs.c
@@ -60,6 +60,8 @@ static int __pci_mmap_fits(struct pci_de
  * @sparse: address space type
  *
  * Use the bus mapping routines to map a PCI resource into userspace.
+ *
+ * Return: %0 on success, negative error code otherwise
  */
 static int pci_mmap_resource(struct kobject *kobj,
 			     struct bin_attribute *attr,
@@ -106,7 +108,7 @@ static int pci_mmap_resource_dense(struc
 
 /**
  * pci_remove_resource_files - cleanup resource files
- * @dev: dev to cleanup
+ * @pdev: pci_dev to cleanup
  *
  * If we created resource files for @dev, remove them from sysfs and
  * free their resources.
@@ -221,10 +223,12 @@ static int pci_create_attr(struct pci_de
 }
 
 /**
- * pci_create_resource_files - create resource files in sysfs for @dev
- * @dev: dev in question
+ * pci_create_resource_files - create resource files in sysfs for @pdev
+ * @pdev: pci_dev in question
  *
  * Walk the resources in @dev creating files for each resource available.
+ *
+ * Return: %0 on success, or negative error code
  */
 int pci_create_resource_files(struct pci_dev *pdev)
 {
@@ -296,7 +300,7 @@ int pci_mmap_legacy_page_range(struct pc
 
 /**
  * pci_adjust_legacy_attr - adjustment of legacy file attributes
- * @b: bus to create files under
+ * @bus: bus to create files under
  * @mmap_type: I/O port or memory
  *
  * Adjust file name and size for sparse mappings.
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 078/147] percpu: remove export of pcpu_base_addr
  2021-09-08  2:52 incoming Andrew Morton
                   ` (76 preceding siblings ...)
  2021-09-08  2:57 ` [patch 077/147] alpha: pci-sysfs: fix all kernel-doc warnings Andrew Morton
@ 2021-09-08  2:57 ` Andrew Morton
  2021-09-08  2:57 ` [patch 079/147] fs/proc/kcore.c: add mmap interface Andrew Morton
                   ` (69 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:57 UTC (permalink / raw)
  To: akpm, cl, dennis, gregkh, hch, linux-mm, mm-commits, tj, torvalds

From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Subject: percpu: remove export of pcpu_base_addr

This is not needed by any modules, so remove the export.

Link: https://lkml.kernel.org/r/20210722185814.504541-1-gregkh@linuxfoundation.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/percpu.c |    1 -
 1 file changed, 1 deletion(-)

--- a/mm/percpu.c~percpu-remove-export-of-pcpu_base_addr
+++ a/mm/percpu.c
@@ -146,7 +146,6 @@ static unsigned int pcpu_high_unit_cpu _
 
 /* the address of the first chunk which starts with the kernel static area */
 void *pcpu_base_addr __ro_after_init;
-EXPORT_SYMBOL_GPL(pcpu_base_addr);
 
 static const int *pcpu_unit_map __ro_after_init;		/* cpu -> unit */
 const unsigned long *pcpu_unit_offsets __ro_after_init;	/* cpu -> unit offset */
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 079/147] fs/proc/kcore.c: add mmap interface
  2021-09-08  2:52 incoming Andrew Morton
                   ` (77 preceding siblings ...)
  2021-09-08  2:57 ` [patch 078/147] percpu: remove export of pcpu_base_addr Andrew Morton
@ 2021-09-08  2:57 ` Andrew Morton
  2021-09-08 18:13     ` Linus Torvalds
  2021-09-10 10:08   ` David Hildenbrand
  2021-09-08  2:57 ` [patch 080/147] proc: stop using seq_get_buf in proc_task_name Andrew Morton
                   ` (68 subsequent siblings)
  147 siblings, 2 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:57 UTC (permalink / raw)
  To: adobriyan, akpm, chenying.kernel, linux-mm, mm-commits, rppt,
	songmuchun, torvalds, zhouchengming, zhoufeng.zf

From: Feng Zhou <zhoufeng.zf@bytedance.com>
Subject: fs/proc/kcore.c: add mmap interface

When we do the kernel monitor, use the DRGN
(https://github.com/osandov/drgn) access to kernel data structures, found
that the system calls a lot.  DRGN is implemented by reading /proc/kcore. 
After looking at the kcore code, it is found that kcore does not implement
mmap, resulting in frequent context switching triggered by read. 
Therefore, we want to add mmap interface to optimize performance.  Since
vmalloc and module areas will change with allocation and release,
consistency cannot be guaranteed, so mmap interface only maps KCORE_TEXT
and KCORE_RAM.

The test results:
1. the default version of kcore
real 11.00
user 8.53
sys 3.59

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
99.64  128.578319          12  11168701           pread64
...
------ ----------- ----------- --------- --------- ----------------
100.00  129.042853              11193748       966 total

2. added kcore for the mmap interface
real 6.44
user 7.32
sys 0.24

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
32.94    0.130120          24      5317       315 futex
11.66    0.046077          21      2231         1 lstat
 9.23    0.036449         177       206           mmap
...
------ ----------- ----------- --------- --------- ----------------
100.00    0.395077                 25435       971 total

The test results show that the number of system calls and time
consumption are significantly reduced.

Link: https://lkml.kernel.org/r/20210704062208.7898-1-zhoufeng.zf@bytedance.com
Co-developed-by: Ying Chen <chenying.kernel@bytedance.com>
Signed-off-by: Ying Chen <chenying.kernel@bytedance.com>
Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/kcore.c |   73 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 73 insertions(+)

--- a/fs/proc/kcore.c~fs-proc-kcorec-add-mmap-interface
+++ a/fs/proc/kcore.c
@@ -614,11 +614,84 @@ static int release_kcore(struct inode *i
 	return 0;
 }
 
+static vm_fault_t mmap_kcore_fault(struct vm_fault *vmf)
+{
+	return VM_FAULT_SIGBUS;
+}
+
+static const struct vm_operations_struct kcore_mmap_ops = {
+	.fault = mmap_kcore_fault,
+};
+
+static int mmap_kcore(struct file *file, struct vm_area_struct *vma)
+{
+	size_t size = vma->vm_end - vma->vm_start;
+	u64 start, end, pfn;
+	int nphdr;
+	size_t data_offset;
+	size_t phdrs_len, notes_len;
+	struct kcore_list *m = NULL;
+	int ret = 0;
+
+	down_read(&kclist_lock);
+
+	get_kcore_size(&nphdr, &phdrs_len, &notes_len, &data_offset);
+
+	data_offset &= PAGE_MASK;
+	start = (u64)vma->vm_pgoff << PAGE_SHIFT;
+	if (start < data_offset) {
+		ret = -EINVAL;
+		goto out;
+	}
+	start = kc_offset_to_vaddr(start - data_offset);
+	end   = start + size;
+
+	list_for_each_entry(m, &kclist_head, list) {
+		if (start >= m->addr && end <= m->addr + m->size)
+			break;
+	}
+
+	if (&m->list == &kclist_head) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (vma->vm_flags & (VM_WRITE | VM_EXEC)) {
+		ret = -EPERM;
+		goto out;
+	}
+
+	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+	vma->vm_flags |= VM_MIXEDMAP;
+	vma->vm_ops = &kcore_mmap_ops;
+
+	if (kern_addr_valid(start)) {
+		if (m->type == KCORE_RAM)
+			pfn = __pa(start) >> PAGE_SHIFT;
+		else if (m->type == KCORE_TEXT)
+			pfn = __pa_symbol(start) >> PAGE_SHIFT;
+		else {
+			ret = -EFAULT;
+			goto out;
+		}
+
+		ret = remap_pfn_range(vma, vma->vm_start, pfn, size,
+				vma->vm_page_prot);
+	} else {
+		ret = -EFAULT;
+	}
+
+out:
+	up_read(&kclist_lock);
+	return ret;
+}
+
 static const struct proc_ops kcore_proc_ops = {
 	.proc_read	= read_kcore,
 	.proc_open	= open_kcore,
 	.proc_release	= release_kcore,
 	.proc_lseek	= default_llseek,
+	.proc_mmap	= mmap_kcore,
 };
 
 /* just remember that we have to update kcore */
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 080/147] proc: stop using seq_get_buf in proc_task_name
  2021-09-08  2:52 incoming Andrew Morton
                   ` (78 preceding siblings ...)
  2021-09-08  2:57 ` [patch 079/147] fs/proc/kcore.c: add mmap interface Andrew Morton
@ 2021-09-08  2:57 ` Andrew Morton
  2021-09-08  2:57 ` [patch 081/147] connector: send event on write to /proc/[pid]/comm Andrew Morton
                   ` (67 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:57 UTC (permalink / raw)
  To: adobriyan, akpm, christian.brauner, hch, linux-mm, mm-commits, torvalds

From: Christoph Hellwig <hch@lst.de>
Subject: proc: stop using seq_get_buf in proc_task_name

Use seq_escape_str and seq_printf instead of poking holes into the
seq_file abstraction.

Link: https://lkml.kernel.org/r/20210810151945.1795567-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/array.c |   18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

--- a/fs/proc/array.c~proc-stop-using-seq_get_buf-in-proc_task_name
+++ a/fs/proc/array.c
@@ -98,27 +98,17 @@
 
 void proc_task_name(struct seq_file *m, struct task_struct *p, bool escape)
 {
-	char *buf;
-	size_t size;
 	char tcomm[64];
-	int ret;
 
 	if (p->flags & PF_WQ_WORKER)
 		wq_worker_comm(tcomm, sizeof(tcomm), p);
 	else
 		__get_task_comm(tcomm, sizeof(tcomm), p);
 
-	size = seq_get_buf(m, &buf);
-	if (escape) {
-		ret = string_escape_str(tcomm, buf, size,
-					ESCAPE_SPACE | ESCAPE_SPECIAL, "\n\\");
-		if (ret >= size)
-			ret = -1;
-	} else {
-		ret = strscpy(buf, tcomm, size);
-	}
-
-	seq_commit(m, ret);
+	if (escape)
+		seq_escape_str(m, tcomm, ESCAPE_SPACE | ESCAPE_SPECIAL, "\n\\");
+	else
+		seq_printf(m, "%.64s", tcomm);
 }
 
 /*
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 081/147] connector: send event on write to /proc/[pid]/comm
  2021-09-08  2:52 incoming Andrew Morton
                   ` (79 preceding siblings ...)
  2021-09-08  2:57 ` [patch 080/147] proc: stop using seq_get_buf in proc_task_name Andrew Morton
@ 2021-09-08  2:57 ` Andrew Morton
  2021-09-08  2:57 ` [patch 082/147] arch: Kconfig: fix spelling mistake "seperate" -> "separate" Andrew Morton
                   ` (66 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:57 UTC (permalink / raw)
  To: adobriyan, akpm, christian.brauner, davem, ebiederm, linux-mm,
	mingo, mm-commits, ohoono.kwon, torvalds

From: Ohhoon Kwon <ohoono.kwon@samsung.com>
Subject: connector: send event on write to /proc/[pid]/comm

While comm change event via prctl has been reported to proc connector by
'commit f786ecba4158 ("connector: add comm change event report to proc
connector")', connector listeners were missing comm changes by explicit
writes on /proc/[pid]/comm.

Let explicit writes on /proc/[pid]/comm report to proc connector.

Link: https://lkml.kernel.org/r/20210701133458epcms1p68e9eb9bd0eee8903ba26679a37d9d960@epcms1p6
Signed-off-by: Ohhoon Kwon <ohoono.kwon@samsung.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/base.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/fs/proc/base.c~connector-send-event-on-write-to-proc-comm
+++ a/fs/proc/base.c
@@ -95,6 +95,7 @@
 #include <linux/posix-timers.h>
 #include <linux/time_namespace.h>
 #include <linux/resctrl.h>
+#include <linux/cn_proc.h>
 #include <trace/events/oom.h>
 #include "internal.h"
 #include "fd.h"
@@ -1674,8 +1675,10 @@ static ssize_t comm_write(struct file *f
 	if (!p)
 		return -ESRCH;
 
-	if (same_thread_group(current, p))
+	if (same_thread_group(current, p)) {
 		set_task_comm(p, buffer);
+		proc_comm_connector(p);
+	}
 	else
 		count = -EINVAL;
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 082/147] arch: Kconfig: fix spelling mistake "seperate" -> "separate"
  2021-09-08  2:52 incoming Andrew Morton
                   ` (80 preceding siblings ...)
  2021-09-08  2:57 ` [patch 081/147] connector: send event on write to /proc/[pid]/comm Andrew Morton
@ 2021-09-08  2:57 ` Andrew Morton
  2021-09-08  2:57 ` [patch 083/147] include/linux/once.h: fix trivia typo Not -> Note Andrew Morton
                   ` (65 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:57 UTC (permalink / raw)
  To: akpm, colin.king, linux-mm, mm-commits, torvalds

From: Colin Ian King <colin.king@canonical.com>
Subject: arch: Kconfig: fix spelling mistake "seperate" -> "separate"

Threre is a spelling mistake in the Kconfig text. Fix it.

Link: https://lkml.kernel.org/r/20210704095207.37342-1-colin.king@canonical.com
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/Kconfig~arch-kconfig-fix-spelling-mistake-seperate-separate
+++ a/arch/Kconfig
@@ -886,7 +886,7 @@ config HAVE_SOFTIRQ_ON_OWN_STACK
 	bool
 	help
 	  Architecture provides a function to run __do_softirq() on a
-	  seperate stack.
+	  separate stack.
 
 config PGTABLE_LEVELS
 	int
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 083/147] include/linux/once.h: fix trivia typo Not -> Note
  2021-09-08  2:52 incoming Andrew Morton
                   ` (81 preceding siblings ...)
  2021-09-08  2:57 ` [patch 082/147] arch: Kconfig: fix spelling mistake "seperate" -> "separate" Andrew Morton
@ 2021-09-08  2:57 ` Andrew Morton
  2021-09-08  2:57 ` [patch 084/147] units: change from 'L' to 'UL' Andrew Morton
                   ` (64 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:57 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, linux-mm, mm-commits, torvalds

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: include/linux/once.h: fix trivia typo Not -> Note

Fix trivia typo Not -> Note in the comment to DO_ONCE().

Link: https://lkml.kernel.org/r/20210722184349.76290-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/once.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/once.h~once-fix-trivia-typo-not-note
+++ a/include/linux/once.h
@@ -16,7 +16,7 @@ void __do_once_done(bool *done, struct s
  * out the condition into a nop. DO_ONCE() guarantees type safety of
  * arguments!
  *
- * Not that the following is not equivalent ...
+ * Note that the following is not equivalent ...
  *
  *   DO_ONCE(func, arg);
  *   DO_ONCE(func, arg);
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 084/147] units: change from 'L' to 'UL'
  2021-09-08  2:52 incoming Andrew Morton
                   ` (82 preceding siblings ...)
  2021-09-08  2:57 ` [patch 083/147] include/linux/once.h: fix trivia typo Not -> Note Andrew Morton
@ 2021-09-08  2:57 ` Andrew Morton
  2021-09-08  2:57 ` [patch 085/147] units: add the HZ macros Andrew Morton
                   ` (63 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:57 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, ceggers, cw00.choi, daniel.lezcano,
	jic23, Jonathan.Cameron, kyungmin.park, lars, linux-mm, linux,
	lukasz.luba, mcoquelin.stm32, miquel.raynal, mm-commits,
	myungjoo.ham, pmeerw, rafael, rui.zhang, torvalds

From: Daniel Lezcano <daniel.lezcano@linaro.org>
Subject: units: change from 'L' to 'UL'

Patch series "Add Hz macros", v3.

There are multiple definitions of the HZ_PER_MHZ or HZ_PER_KHZ in the
different drivers.  Instead of duplicating this definition again and
again, add one in the units.h header to be reused in all the place the
redefiniton occurs.

At the same time, change the type of the Watts, as they can not be
negative.


This patch (of 10):

The users of the macros are safe to be assigned with an unsigned instead
of signed as the variables using them are themselves unsigned.

Link: https://lkml.kernel.org/r/20210816114732.1834145-1-daniel.lezcano@linaro.org
Link: https://lkml.kernel.org/r/20210816114732.1834145-2-daniel.lezcano@linaro.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Christian Eggers <ceggers@arri.de>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: MyungJoo Ham <myungjoo.ham@samsung.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Peter Meerwald <pmeerw@pmeerw.net>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/units.h |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/include/linux/units.h~units-change-from-l-to-ul
+++ a/include/linux/units.h
@@ -4,9 +4,9 @@
 
 #include <linux/math.h>
 
-#define MILLIWATT_PER_WATT	1000L
-#define MICROWATT_PER_MILLIWATT	1000L
-#define MICROWATT_PER_WATT	1000000L
+#define MILLIWATT_PER_WATT	1000UL
+#define MICROWATT_PER_MILLIWATT	1000UL
+#define MICROWATT_PER_WATT	1000000UL
 
 #define ABSOLUTE_ZERO_MILLICELSIUS -273150
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 085/147] units: add the HZ macros
  2021-09-08  2:52 incoming Andrew Morton
                   ` (83 preceding siblings ...)
  2021-09-08  2:57 ` [patch 084/147] units: change from 'L' to 'UL' Andrew Morton
@ 2021-09-08  2:57 ` Andrew Morton
  2021-09-08  2:57 ` [patch 086/147] thermal/drivers/devfreq_cooling: use " Andrew Morton
                   ` (62 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:57 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, ceggers, cw00.choi, daniel.lezcano,
	jic23, Jonathan.Cameron, kyungmin.park, lars, linux-mm, linux,
	lukasz.luba, mcoquelin.stm32, miquel.raynal, mm-commits,
	myungjoo.ham, pmeerw, rafael, rui.zhang, torvalds

From: Daniel Lezcano <daniel.lezcano@linaro.org>
Subject: units: add the HZ macros

The macros for the unit conversion for frequency are duplicated in
different places.

Provide these macros in the 'units' header, so they can be reused.

Link: https://lkml.kernel.org/r/20210816114732.1834145-3-daniel.lezcano@linaro.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Christian Eggers <ceggers@arri.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: MyungJoo Ham <myungjoo.ham@samsung.com>
Cc: Peter Meerwald <pmeerw@pmeerw.net>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/units.h |    4 ++++
 1 file changed, 4 insertions(+)

--- a/include/linux/units.h~units-add-the-hz-macros
+++ a/include/linux/units.h
@@ -4,6 +4,10 @@
 
 #include <linux/math.h>
 
+#define HZ_PER_KHZ		1000UL
+#define KHZ_PER_MHZ		1000UL
+#define HZ_PER_MHZ		1000000UL
+
 #define MILLIWATT_PER_WATT	1000UL
 #define MICROWATT_PER_MILLIWATT	1000UL
 #define MICROWATT_PER_WATT	1000000UL
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 086/147] thermal/drivers/devfreq_cooling: use HZ macros
  2021-09-08  2:52 incoming Andrew Morton
                   ` (84 preceding siblings ...)
  2021-09-08  2:57 ` [patch 085/147] units: add the HZ macros Andrew Morton
@ 2021-09-08  2:57 ` Andrew Morton
  2021-09-08  2:57 ` [patch 087/147] devfreq: " Andrew Morton
                   ` (61 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:57 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, ceggers, cw00.choi, daniel.lezcano,
	jic23, Jonathan.Cameron, kyungmin.park, lars, linux-mm, linux,
	lukasz.luba, mcoquelin.stm32, miquel.raynal, mm-commits,
	myungjoo.ham, pmeerw, rafael, rui.zhang, torvalds

From: Daniel Lezcano <daniel.lezcano@linaro.org>
Subject: thermal/drivers/devfreq_cooling: use HZ macros

HZ unit conversion macros are available in units.h, use them and remove
the duplicate definition.

The new macro uses a unsigned long type which is already the type in the
current code via the 'freq' variable.

Link: https://lkml.kernel.org/r/20210816114732.1834145-4-daniel.lezcano@linaro.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Christian Eggers <ceggers@arri.de>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: MyungJoo Ham <myungjoo.ham@samsung.com>
Cc: Peter Meerwald <pmeerw@pmeerw.net>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/thermal/devfreq_cooling.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/thermal/devfreq_cooling.c~thermal-drivers-devfreq_cooling-use-hz-macros
+++ a/drivers/thermal/devfreq_cooling.c
@@ -18,10 +18,10 @@
 #include <linux/pm_opp.h>
 #include <linux/pm_qos.h>
 #include <linux/thermal.h>
+#include <linux/units.h>
 
 #include <trace/events/thermal.h>
 
-#define HZ_PER_KHZ		1000
 #define SCALE_ERROR_MITIGATION	100
 
 /**
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 087/147] devfreq: use HZ macros
  2021-09-08  2:52 incoming Andrew Morton
                   ` (85 preceding siblings ...)
  2021-09-08  2:57 ` [patch 086/147] thermal/drivers/devfreq_cooling: use " Andrew Morton
@ 2021-09-08  2:57 ` Andrew Morton
  2021-09-08  2:57 ` [patch 088/147] iio/drivers/as73211: " Andrew Morton
                   ` (60 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:57 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, ceggers, cw00.choi, daniel.lezcano,
	jic23, Jonathan.Cameron, kyungmin.park, lars, linux-mm, linux,
	lukasz.luba, mcoquelin.stm32, miquel.raynal, mm-commits,
	myungjoo.ham, pmeerw, rafael, rui.zhang, torvalds

From: Daniel Lezcano <daniel.lezcano@linaro.org>
Subject: devfreq: use HZ macros

HZ unit conversion macros are available in units.h, use them and remove
the duplicate definition.

The new macro has an unsigned long type.

All the code is dealing with unsigned long and the code using the macro is
doing a coercitive cast to unsigned long.

Link: https://lkml.kernel.org/r/20210816114732.1834145-5-daniel.lezcano@linaro.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Christian Eggers <ceggers@arri.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: MyungJoo Ham <myungjoo.ham@samsung.com>
Cc: Peter Meerwald <pmeerw@pmeerw.net>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/devfreq/devfreq.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/devfreq/devfreq.c~devfreq-use-hz-macros
+++ a/drivers/devfreq/devfreq.c
@@ -27,6 +27,7 @@
 #include <linux/hrtimer.h>
 #include <linux/of.h>
 #include <linux/pm_qos.h>
+#include <linux/units.h>
 #include "governor.h"
 
 #define CREATE_TRACE_POINTS
@@ -34,7 +35,6 @@
 
 #define IS_SUPPORTED_FLAG(f, name) ((f & DEVFREQ_GOV_FLAG_##name) ? true : false)
 #define IS_SUPPORTED_ATTR(f, name) ((f & DEVFREQ_GOV_ATTR_##name) ? true : false)
-#define HZ_PER_KHZ	1000
 
 static struct class *devfreq_class;
 static struct dentry *devfreq_debugfs;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 088/147] iio/drivers/as73211: use HZ macros
  2021-09-08  2:52 incoming Andrew Morton
                   ` (86 preceding siblings ...)
  2021-09-08  2:57 ` [patch 087/147] devfreq: " Andrew Morton
@ 2021-09-08  2:57 ` Andrew Morton
  2021-09-08  2:58 ` [patch 089/147] hwmon/drivers/mr75203: " Andrew Morton
                   ` (59 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:57 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, ceggers, cw00.choi, daniel.lezcano,
	jic23, Jonathan.Cameron, kyungmin.park, lars, linux-mm, linux,
	lukasz.luba, mcoquelin.stm32, miquel.raynal, mm-commits,
	myungjoo.ham, pmeerw, rafael, rui.zhang, torvalds

From: Daniel Lezcano <daniel.lezcano@linaro.org>
Subject: iio/drivers/as73211: use HZ macros

HZ unit conversion macros are available in units.h, use them and remove
the duplicate definition.

Link: https://lkml.kernel.org/r/20210816114732.1834145-6-daniel.lezcano@linaro.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Christian Eggers <ceggers@arri.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: MyungJoo Ham <myungjoo.ham@samsung.com>
Cc: Peter Meerwald <pmeerw@pmeerw.net>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/iio/light/as73211.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/drivers/iio/light/as73211.c~iio-drivers-as73211-use-hz-macros
+++ a/drivers/iio/light/as73211.c
@@ -24,8 +24,7 @@
 #include <linux/module.h>
 #include <linux/mutex.h>
 #include <linux/pm.h>
-
-#define HZ_PER_KHZ 1000
+#include <linux/units.h>
 
 #define AS73211_DRV_NAME "as73211"
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 089/147] hwmon/drivers/mr75203: use HZ macros
  2021-09-08  2:52 incoming Andrew Morton
                   ` (87 preceding siblings ...)
  2021-09-08  2:57 ` [patch 088/147] iio/drivers/as73211: " Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08  2:58 ` [patch 090/147] iio/drivers/hid-sensor: " Andrew Morton
                   ` (58 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, ceggers, cw00.choi, daniel.lezcano,
	jic23, Jonathan.Cameron, kyungmin.park, lars, linux-mm, linux,
	lukasz.luba, mcoquelin.stm32, miquel.raynal, mm-commits,
	myungjoo.ham, pmeerw, rafael, rui.zhang, torvalds

From: Daniel Lezcano <daniel.lezcano@linaro.org>
Subject: hwmon/drivers/mr75203: use HZ macros

HZ unit conversion macros are available in units.h, use them and remove
the duplicate definition.

The new macro is an unsigned long.  The code dealing with it is
considering as an unsigned long also.

Link: https://lkml.kernel.org/r/20210816114732.1834145-7-daniel.lezcano@linaro.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Christian Eggers <ceggers@arri.de>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: MyungJoo Ham <myungjoo.ham@samsung.com>
Cc: Peter Meerwald <pmeerw@pmeerw.net>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/hwmon/mr75203.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/hwmon/mr75203.c~hwmon-drivers-mr75203-use-hz-macros
+++ a/drivers/hwmon/mr75203.c
@@ -17,6 +17,7 @@
 #include <linux/property.h>
 #include <linux/regmap.h>
 #include <linux/reset.h>
+#include <linux/units.h>
 
 /* PVT Common register */
 #define PVT_IP_CONFIG	0x04
@@ -37,7 +38,6 @@
 #define CLK_SYNTH_EN		BIT(24)
 #define CLK_SYS_CYCLES_MAX	514
 #define CLK_SYS_CYCLES_MIN	2
-#define HZ_PER_MHZ		1000000L
 
 #define SDIF_DISABLE	0x04
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 090/147] iio/drivers/hid-sensor: use HZ macros
  2021-09-08  2:52 incoming Andrew Morton
                   ` (88 preceding siblings ...)
  2021-09-08  2:58 ` [patch 089/147] hwmon/drivers/mr75203: " Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08  2:58 ` [patch 091/147] i2c/drivers/ov02q10: " Andrew Morton
                   ` (57 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, ceggers, cw00.choi, daniel.lezcano,
	jic23, Jonathan.Cameron, kyungmin.park, lars, linux-mm, linux,
	lukasz.luba, mcoquelin.stm32, miquel.raynal, mm-commits,
	myungjoo.ham, pmeerw, rafael, rui.zhang, torvalds

From: Daniel Lezcano <daniel.lezcano@linaro.org>
Subject: iio/drivers/hid-sensor: use HZ macros

HZ unit conversion macros are available in units.h, use them and remove
the duplicate definition.

Link: https://lkml.kernel.org/r/20210816114732.1834145-8-daniel.lezcano@linaro.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Christian Eggers <ceggers@arri.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: MyungJoo Ham <myungjoo.ham@samsung.com>
Cc: Peter Meerwald <pmeerw@pmeerw.net>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/iio/common/hid-sensors/hid-sensor-attributes.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/drivers/iio/common/hid-sensors/hid-sensor-attributes.c~iio-drivers-hid-sensor-use-hz-macros
+++ a/drivers/iio/common/hid-sensors/hid-sensor-attributes.c
@@ -6,12 +6,11 @@
 #include <linux/module.h>
 #include <linux/kernel.h>
 #include <linux/time.h>
+#include <linux/units.h>
 
 #include <linux/hid-sensor-hub.h>
 #include <linux/iio/iio.h>
 
-#define HZ_PER_MHZ	1000000L
-
 static struct {
 	u32 usage_id;
 	int unit; /* 0 for default others from HID sensor spec */
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 091/147] i2c/drivers/ov02q10: use HZ macros
  2021-09-08  2:52 incoming Andrew Morton
                   ` (89 preceding siblings ...)
  2021-09-08  2:58 ` [patch 090/147] iio/drivers/hid-sensor: " Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08  2:58 ` [patch 092/147] mtd/drivers/nand: " Andrew Morton
                   ` (56 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, ceggers, cw00.choi, daniel.lezcano,
	jic23, Jonathan.Cameron, kyungmin.park, lars, linux-mm, linux,
	lukasz.luba, mcoquelin.stm32, miquel.raynal, mm-commits,
	myungjoo.ham, pmeerw, rafael, rui.zhang, torvalds

From: Daniel Lezcano <daniel.lezcano@linaro.org>
Subject: i2c/drivers/ov02q10: use HZ macros

HZ unit conversion macros are available in units.h, use them and remove
the duplicate definition.

Link: https://lkml.kernel.org/r/20210816114732.1834145-9-daniel.lezcano@linaro.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Christian Eggers <ceggers@arri.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: MyungJoo Ham <myungjoo.ham@samsung.com>
Cc: Peter Meerwald <pmeerw@pmeerw.net>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/media/i2c/ov02a10.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/media/i2c/ov02a10.c~i2c-drivers-ov02q10-use-hz-macros
+++ a/drivers/media/i2c/ov02a10.c
@@ -9,6 +9,7 @@
 #include <linux/module.h>
 #include <linux/pm_runtime.h>
 #include <linux/regulator/consumer.h>
+#include <linux/units.h>
 #include <media/media-entity.h>
 #include <media/v4l2-async.h>
 #include <media/v4l2-ctrls.h>
@@ -64,7 +65,6 @@
 /* Test pattern control */
 #define OV02A10_REG_TEST_PATTERN			0xb6
 
-#define HZ_PER_MHZ					1000000L
 #define OV02A10_LINK_FREQ_390MHZ			(390 * HZ_PER_MHZ)
 #define OV02A10_ECLK_FREQ				(24 * HZ_PER_MHZ)
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 092/147] mtd/drivers/nand: use HZ macros
  2021-09-08  2:52 incoming Andrew Morton
                   ` (90 preceding siblings ...)
  2021-09-08  2:58 ` [patch 091/147] i2c/drivers/ov02q10: " Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08  6:39   ` Miquel Raynal
  2021-09-08  2:58 ` [patch 093/147] phy/drivers/stm32: " Andrew Morton
                   ` (55 subsequent siblings)
  147 siblings, 1 reply; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, ceggers, cw00.choi, daniel.lezcano,
	jic23, Jonathan.Cameron, kyungmin.park, lars, linux-mm, linux,
	lukasz.luba, mcoquelin.stm32, miquel.raynal, mm-commits,
	myungjoo.ham, pmeerw, rafael, rui.zhang, torvalds

From: Daniel Lezcano <daniel.lezcano@linaro.org>
Subject: mtd/drivers/nand: use HZ macros

HZ unit conversion macros are available in units.h, use them and remove
the duplicate definition.

Link: https://lkml.kernel.org/r/20210816114732.1834145-10-daniel.lezcano@linaro.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Miquel Raynal <miquel.raynal@bootlin.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Christian Eggers <ceggers@arri.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: MyungJoo Ham <myungjoo.ham@samsung.com>
Cc: Peter Meerwald <pmeerw@pmeerw.net>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/mtd/nand/raw/intel-nand-controller.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/mtd/nand/raw/intel-nand-controller.c~mtd-drivers-nand-use-hz-macros
+++ a/drivers/mtd/nand/raw/intel-nand-controller.c
@@ -20,6 +20,7 @@
 #include <linux/sched.h>
 #include <linux/slab.h>
 #include <linux/types.h>
+#include <linux/units.h>
 #include <asm/unaligned.h>
 
 #define EBU_CLC			0x000
@@ -102,7 +103,6 @@
 
 #define MAX_CS	2
 
-#define HZ_PER_MHZ	1000000L
 #define USEC_PER_SEC	1000000L
 
 struct ebu_nand_cs {
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 093/147] phy/drivers/stm32: use HZ macros
  2021-09-08  2:52 incoming Andrew Morton
                   ` (91 preceding siblings ...)
  2021-09-08  2:58 ` [patch 092/147] mtd/drivers/nand: " Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08  2:58 ` [patch 094/147] kernel/acct.c: use dedicated helper to access rlimit values Andrew Morton
                   ` (54 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, ceggers, cw00.choi, daniel.lezcano,
	jic23, Jonathan.Cameron, kyungmin.park, lars, linux-mm, linux,
	lukasz.luba, mcoquelin.stm32, miquel.raynal, mm-commits,
	myungjoo.ham, pmeerw, rafael, rui.zhang, torvalds

From: Daniel Lezcano <daniel.lezcano@linaro.org>
Subject: phy/drivers/stm32: use HZ macros

HZ unit conversion macros are available in units.h, use them and remove
the duplicate definition.

Link: https://lkml.kernel.org/r/20210816114732.1834145-11-daniel.lezcano@linaro.org
Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Chanwoo Choi <cw00.choi@samsung.com>
Cc: Christian Eggers <ceggers@arri.de>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kyungmin Park <kyungmin.park@samsung.com>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Lukasz Luba <lukasz.luba@arm.com>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: Miquel Raynal <miquel.raynal@bootlin.com>
Cc: MyungJoo Ham <myungjoo.ham@samsung.com>
Cc: Peter Meerwald <pmeerw@pmeerw.net>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/phy/st/phy-stm32-usbphyc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/phy/st/phy-stm32-usbphyc.c~phy-drivers-stm32-use-hz-macros
+++ a/drivers/phy/st/phy-stm32-usbphyc.c
@@ -15,6 +15,7 @@
 #include <linux/of_platform.h>
 #include <linux/phy/phy.h>
 #include <linux/reset.h>
+#include <linux/units.h>
 
 #define STM32_USBPHYC_PLL	0x0
 #define STM32_USBPHYC_MISC	0x8
@@ -47,7 +48,6 @@
 #define PLL_FVCO_MHZ		2880
 #define PLL_INFF_MIN_RATE_HZ	19200000
 #define PLL_INFF_MAX_RATE_HZ	38400000
-#define HZ_PER_MHZ		1000000L
 
 struct pll_params {
 	u8 ndiv;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 094/147] kernel/acct.c: use dedicated helper to access rlimit values
  2021-09-08  2:52 incoming Andrew Morton
                   ` (92 preceding siblings ...)
  2021-09-08  2:58 ` [patch 093/147] phy/drivers/stm32: " Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08  2:58 ` [patch 095/147] profiling: fix shift-out-of-bounds bugs Andrew Morton
                   ` (53 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, rdunlap, sh_def, torvalds,
	yang.yang29, zealci

From: Yang Yang <yang.yang29@zte.com.cn>
Subject: kernel/acct.c: use dedicated helper to access rlimit values

Use rlimit() helper instead of manually writing whole chain from
task to rlimit value. See patch "posix-cpu-timers: Use dedicated
helper to access rlimit values".

Link: https://lkml.kernel.org/r/20210728030822.524789-1-yang.yang29@zte.com.cn
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Reported-by: Zeal Robot <zealci@zte.com.cn>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: sh_def@163.com <sh_def@163.com>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/acct.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/acct.c~acct-use-dedicated-helper-to-access-rlimit-values
+++ a/kernel/acct.c
@@ -478,7 +478,7 @@ static void do_acct_process(struct bsd_a
 	/*
 	 * Accounting records are not subject to resource limits.
 	 */
-	flim = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
+	flim = rlimit(RLIMIT_FSIZE);
 	current->signal->rlim[RLIMIT_FSIZE].rlim_cur = RLIM_INFINITY;
 	/* Perform file operations on behalf of whoever enabled accounting */
 	orig_cred = override_creds(file->f_cred);
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 095/147] profiling: fix shift-out-of-bounds bugs
  2021-09-08  2:52 incoming Andrew Morton
                   ` (93 preceding siblings ...)
  2021-09-08  2:58 ` [patch 094/147] kernel/acct.c: use dedicated helper to access rlimit values Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08  2:58 ` [patch 096/147] MAINTAINERS: update ClangBuiltLinux mailing list Andrew Morton
                   ` (52 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, paskripkin, penguin-kernel, rostedt,
	tglx, torvalds

From: Pavel Skripkin <paskripkin@gmail.com>
Subject: profiling: fix shift-out-of-bounds bugs

Syzbot reported shift-out-of-bounds bug in profile_init().
The problem was in incorrect prof_shift. Since prof_shift value comes from
userspace we need to clamp this value into [0, BITS_PER_LONG -1]
boundaries.

Second possible shiht-out-of-bounds was found by Tetsuo:
sample_step local variable in read_profile() had "unsigned int" type,
but prof_shift allows to make a BITS_PER_LONG shift. So, to prevent
possible shiht-out-of-bounds sample_step type was changed to
"unsigned long".

Also, "unsigned short int" will be sufficient for storing
[0, BITS_PER_LONG] value, that's why there is no need for
"unsigned long" prof_shift.

Link: https://lkml.kernel.org/r/20210813140022.5011-1-paskripkin@gmail.com
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-and-tested-by: syzbot+e68c89a9510c159d9684@syzkaller.appspotmail.com
Suggested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Pavel Skripkin <paskripkin@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/profile.c |   21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

--- a/kernel/profile.c~profiling-fix-shift-out-of-bounds-bugs
+++ a/kernel/profile.c
@@ -41,7 +41,8 @@ struct profile_hit {
 #define NR_PROFILE_GRP		(NR_PROFILE_HIT/PROFILE_GRPSZ)
 
 static atomic_t *prof_buffer;
-static unsigned long prof_len, prof_shift;
+static unsigned long prof_len;
+static unsigned short int prof_shift;
 
 int prof_on __read_mostly;
 EXPORT_SYMBOL_GPL(prof_on);
@@ -67,8 +68,8 @@ int profile_setup(char *str)
 		if (str[strlen(sleepstr)] == ',')
 			str += strlen(sleepstr) + 1;
 		if (get_option(&str, &par))
-			prof_shift = par;
-		pr_info("kernel sleep profiling enabled (shift: %ld)\n",
+			prof_shift = clamp(par, 0, BITS_PER_LONG - 1);
+		pr_info("kernel sleep profiling enabled (shift: %u)\n",
 			prof_shift);
 #else
 		pr_warn("kernel sleep profiling requires CONFIG_SCHEDSTATS\n");
@@ -78,21 +79,21 @@ int profile_setup(char *str)
 		if (str[strlen(schedstr)] == ',')
 			str += strlen(schedstr) + 1;
 		if (get_option(&str, &par))
-			prof_shift = par;
-		pr_info("kernel schedule profiling enabled (shift: %ld)\n",
+			prof_shift = clamp(par, 0, BITS_PER_LONG - 1);
+		pr_info("kernel schedule profiling enabled (shift: %u)\n",
 			prof_shift);
 	} else if (!strncmp(str, kvmstr, strlen(kvmstr))) {
 		prof_on = KVM_PROFILING;
 		if (str[strlen(kvmstr)] == ',')
 			str += strlen(kvmstr) + 1;
 		if (get_option(&str, &par))
-			prof_shift = par;
-		pr_info("kernel KVM profiling enabled (shift: %ld)\n",
+			prof_shift = clamp(par, 0, BITS_PER_LONG - 1);
+		pr_info("kernel KVM profiling enabled (shift: %u)\n",
 			prof_shift);
 	} else if (get_option(&str, &par)) {
-		prof_shift = par;
+		prof_shift = clamp(par, 0, BITS_PER_LONG - 1);
 		prof_on = CPU_PROFILING;
-		pr_info("kernel profiling enabled (shift: %ld)\n",
+		pr_info("kernel profiling enabled (shift: %u)\n",
 			prof_shift);
 	}
 	return 1;
@@ -468,7 +469,7 @@ read_profile(struct file *file, char __u
 	unsigned long p = *ppos;
 	ssize_t read;
 	char *pnt;
-	unsigned int sample_step = 1 << prof_shift;
+	unsigned long sample_step = 1UL << prof_shift;
 
 	profile_flip_buffers();
 	if (p >= (prof_len+1)*sizeof(unsigned int))
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 096/147] MAINTAINERS: update ClangBuiltLinux mailing list
  2021-09-08  2:52 incoming Andrew Morton
                   ` (94 preceding siblings ...)
  2021-09-08  2:58 ` [patch 095/147] profiling: fix shift-out-of-bounds bugs Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08  2:58 ` [patch 097/147] Documentation/llvm: update " Andrew Morton
                   ` (51 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: akpm, keescook, linux-mm, masahiroy, mm-commits, nathan,
	ndesaulniers, samitolvanen, torvalds

From: Nathan Chancellor <nathan@kernel.org>
Subject: MAINTAINERS: update ClangBuiltLinux mailing list

We are now at llvm@lists.linux.dev.

Link: https://lkml.kernel.org/r/20210825211823.6406-1-nathan@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Nick Desaulniers <ndesaulniers@google.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 MAINTAINERS |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/MAINTAINERS~maintainers-update-clangbuiltlinux-mailing-list
+++ a/MAINTAINERS
@@ -4504,7 +4504,7 @@ F:	.clang-format
 CLANG/LLVM BUILD SUPPORT
 M:	Nathan Chancellor <nathan@kernel.org>
 M:	Nick Desaulniers <ndesaulniers@google.com>
-L:	clang-built-linux@googlegroups.com
+L:	llvm@lists.linux.dev
 S:	Supported
 W:	https://clangbuiltlinux.github.io/
 B:	https://github.com/ClangBuiltLinux/linux/issues
@@ -4519,7 +4519,7 @@ M:	Sami Tolvanen <samitolvanen@google.co
 M:	Kees Cook <keescook@chromium.org>
 R:	Nathan Chancellor <nathan@kernel.org>
 R:	Nick Desaulniers <ndesaulniers@google.com>
-L:	clang-built-linux@googlegroups.com
+L:	llvm@lists.linux.dev
 S:	Supported
 B:	https://github.com/ClangBuiltLinux/linux/issues
 T:	git git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git for-next/clang/features
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 097/147] Documentation/llvm: update mailing list
  2021-09-08  2:52 incoming Andrew Morton
                   ` (95 preceding siblings ...)
  2021-09-08  2:58 ` [patch 096/147] MAINTAINERS: update ClangBuiltLinux mailing list Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08  2:58 ` [patch 098/147] Documentation/llvm: update IRC location Andrew Morton
                   ` (50 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: akpm, keescook, linux-mm, masahiroy, mm-commits, nathan,
	ndesaulniers, samitolvanen, torvalds

From: Nathan Chancellor <nathan@kernel.org>
Subject: Documentation/llvm: update mailing list

We are now at llvm@lists.linux.dev.

Link: https://lkml.kernel.org/r/20210825211823.6406-2-nathan@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/kbuild/llvm.rst |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/Documentation/kbuild/llvm.rst~documentation-llvm-update-mailing-list
+++ a/Documentation/kbuild/llvm.rst
@@ -111,7 +111,8 @@ Getting Help
 ------------
 
 - `Website <https://clangbuiltlinux.github.io/>`_
-- `Mailing List <https://groups.google.com/forum/#!forum/clang-built-linux>`_: <clang-built-linux@googlegroups.com>
+- `Mailing List <https://lore.kernel.org/llvm/>`_: <llvm@lists.linux.dev>
+- `Old Mailing List Archives <https://groups.google.com/g/clang-built-linux>`_
 - `Issue Tracker <https://github.com/ClangBuiltLinux/linux/issues>`_
 - IRC: #clangbuiltlinux on chat.freenode.net
 - `Telegram <https://t.me/ClangBuiltLinux>`_: @ClangBuiltLinux
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 098/147] Documentation/llvm: update IRC location
  2021-09-08  2:52 incoming Andrew Morton
                   ` (96 preceding siblings ...)
  2021-09-08  2:58 ` [patch 097/147] Documentation/llvm: update " Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08  2:58 ` [patch 099/147] math: make RATIONAL tristate Andrew Morton
                   ` (49 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: akpm, keescook, linux-mm, masahiroy, mm-commits, nathan,
	ndesaulniers, samitolvanen, torvalds

From: Nathan Chancellor <nathan@kernel.org>
Subject: Documentation/llvm: update IRC location

This should have been done with commit 91ed3ed0f798 ("MAINTAINERS: update
ClangBuiltLinux IRC chat") but I did not realize it was in two separate
spots.

Link: https://lkml.kernel.org/r/20210825211823.6406-3-nathan@kernel.org
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Sami Tolvanen <samitolvanen@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/kbuild/llvm.rst |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/Documentation/kbuild/llvm.rst~documentation-llvm-update-irc-location
+++ a/Documentation/kbuild/llvm.rst
@@ -114,7 +114,7 @@ Getting Help
 - `Mailing List <https://lore.kernel.org/llvm/>`_: <llvm@lists.linux.dev>
 - `Old Mailing List Archives <https://groups.google.com/g/clang-built-linux>`_
 - `Issue Tracker <https://github.com/ClangBuiltLinux/linux/issues>`_
-- IRC: #clangbuiltlinux on chat.freenode.net
+- IRC: #clangbuiltlinux on irc.libera.chat
 - `Telegram <https://t.me/ClangBuiltLinux>`_: @ClangBuiltLinux
 - `Wiki <https://github.com/ClangBuiltLinux/linux/wiki>`_
 - `Beginner Bugs <https://github.com/ClangBuiltLinux/linux/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22>`_
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 099/147] math: make RATIONAL tristate
  2021-09-08  2:52 incoming Andrew Morton
                   ` (97 preceding siblings ...)
  2021-09-08  2:58 ` [patch 098/147] Documentation/llvm: update IRC location Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08  2:58 ` [patch 100/147] math: RATIONAL_KUNIT_TEST should depend on RATIONAL instead of selecting it Andrew Morton
                   ` (48 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, brendanhiggins, colin.king, geert,
	linux-mm, mm-commits, torvalds, tpiepho

From: Geert Uytterhoeven <geert@linux-m68k.org>
Subject: math: make RATIONAL tristate

Patch series "math: RATIONAL and RATIONAL_KUNIT_TEST improvements".

This series makes the RATIONAL symbol tristate, so it is not forced
builtin if all users are modular, and makes the RATIONAL_KUNIT_TEST depend
on RATIONAL, to avoid enabling RATIONAL if there are no real users.


This patch (of 2):

All but one symbols that select RATIONAL are tristate, but RATIONAL itself
is bool.  Change it to tristate, so the rational fractions support code
can be modular if no builtin code relies on it.

Link: https://lkml.kernel.org/r/20210706100945.3803694-1-geert@linux-m68k.org
Link: https://lkml.kernel.org/r/20210706100945.3803694-2-geert@linux-m68k.org
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Trent Piepho <tpiepho@gmail.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Brendan Higgins <brendanhiggins@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/math/Kconfig    |    2 +-
 lib/math/rational.c |    3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

--- a/lib/math/Kconfig~math-make-rational-tristate
+++ a/lib/math/Kconfig
@@ -14,4 +14,4 @@ config PRIME_NUMBERS
 	  If unsure, say N.
 
 config RATIONAL
-	bool
+	tristate
--- a/lib/math/rational.c~math-make-rational-tristate
+++ a/lib/math/rational.c
@@ -13,6 +13,7 @@
 #include <linux/export.h>
 #include <linux/minmax.h>
 #include <linux/limits.h>
+#include <linux/module.h>
 
 /*
  * calculate best rational approximation for a given fraction
@@ -106,3 +107,5 @@ void rational_best_approximation(
 }
 
 EXPORT_SYMBOL(rational_best_approximation);
+
+MODULE_LICENSE("GPL v2");
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 100/147] math: RATIONAL_KUNIT_TEST should depend on RATIONAL instead of selecting it
  2021-09-08  2:52 incoming Andrew Morton
                   ` (98 preceding siblings ...)
  2021-09-08  2:58 ` [patch 099/147] math: make RATIONAL tristate Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08  2:58 ` [patch 101/147] lib/string: optimized memcpy Andrew Morton
                   ` (47 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, brendanhiggins, colin.king, geert,
	linux-mm, mm-commits, torvalds, tpiepho

From: Geert Uytterhoeven <geert@linux-m68k.org>
Subject: math: RATIONAL_KUNIT_TEST should depend on RATIONAL instead of selecting it

RATIONAL_KUNIT_TEST selects RATIONAL, thus enabling an optional feature
the user may not want to have enabled.  Fix this by making the test depend
on RATIONAL instead.

Link: https://lkml.kernel.org/r/20210706100945.3803694-3-geert@linux-m68k.org
Fixes: b6c75c4afceb8bc0 ("lib/math/rational: add Kunit test cases")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Trent Piepho <tpiepho@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/Kconfig.debug |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/lib/Kconfig.debug~math-rational_kunit_test-should-depend-on-rational-instead-of-selecting-it
+++ a/lib/Kconfig.debug
@@ -2460,8 +2460,7 @@ config SLUB_KUNIT_TEST
 
 config RATIONAL_KUNIT_TEST
 	tristate "KUnit test for rational.c" if !KUNIT_ALL_TESTS
-	depends on KUNIT
-	select RATIONAL
+	depends on KUNIT && RATIONAL
 	default KUNIT_ALL_TESTS
 	help
 	  This builds the rational math unit test.
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 101/147] lib/string: optimized memcpy
  2021-09-08  2:52 incoming Andrew Morton
                   ` (99 preceding siblings ...)
  2021-09-08  2:58 ` [patch 100/147] math: RATIONAL_KUNIT_TEST should depend on RATIONAL instead of selecting it Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08 18:26     ` Linus Torvalds
  2021-09-08  2:58 ` [patch 102/147] lib/string: optimized memmove Andrew Morton
                   ` (46 subsequent siblings)
  147 siblings, 1 reply; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: akpm, David.Laight, drew, guoren, hch, kernel, linux-mm, mcroce,
	mick, mm-commits, ndesaulniers, palmer, torvalds

From: Matteo Croce <mcroce@microsoft.com>
Subject: lib/string: optimized memcpy

Patch series "lib/string: optimized mem* functions", v2.

Rewrite the generic mem{cpy,move,set} so that memory is accessed with the
widest size possible, but without doing unaligned accesses.

This was originally posted as C string functions for RISC-V[1], but as
there was no specific RISC-V code, it was proposed for the generic
lib/string.c implementation.

Tested on RISC-V and on x86_64 by undefining __HAVE_ARCH_MEM{CPY,SET,MOVE}
and HAVE_EFFICIENT_UNALIGNED_ACCESS.

These are the performances of memcpy() and memset() of a RISC-V machine on
a 32 mbyte buffer:

memcpy:
original aligned:	 75 Mb/s
original unaligned:	 75 Mb/s
new aligned:		114 Mb/s
new unaligned:		107 Mb/s

memset:
original aligned:	140 Mb/s
original unaligned:	140 Mb/s
new aligned:		241 Mb/s
new unaligned:		241 Mb/s

The size increase is negligible:

$ scripts/bloat-o-meter vmlinux.orig vmlinux
add/remove: 0/0 grow/shrink: 4/1 up/down: 427/-6 (421)
Function                                     old     new   delta
memcpy                                        29     351    +322
memset                                        29     117     +88
strlcat                                       68      78     +10
strlcpy                                       50      57      +7
memmove                                       56      50      -6
Total: Before=8556964, After=8557385, chg +0.00%

These functions will be used for RISC-V initially.

[1] https://lore.kernel.org/linux-riscv/20210617152754.17960-1-mcroce@linux.microsoft.com/

The only architecture which will use all the three function will be riscv,
while memmove() will be used by arc, h8300, hexagon, ia64, openrisc and
parisc.

Keep in mind that memmove() isn't anything special, it just calls memcpy()
when possible (e.g.  buffers not overlapping), and fallbacks to the byte
by byte copy otherwise.

In future we can write two functions, one which copies forward and another
one which copies backward, and call the right one depending on the buffers
position.  Then, we could alias memcpy() and memmove(), as proposed by
Linus: https://bugzilla.redhat.com/show_bug.cgi?id=638477#c132



This patch (of 3):

Rewrite the generic memcpy() to copy a word at time, without generating
unaligned accesses.

The procedure is made of three steps: First copy data one byte at time
until the destination buffer is aligned to a long boundary.  Then copy the
data one long at time shifting the current and the next long to compose a
long at every cycle.  Finally, copy the remainder one byte at time.

This is the improvement on RISC-V:

original aligned:	 75 Mb/s
original unaligned:	 75 Mb/s
new aligned:		114 Mb/s
new unaligned:		107 Mb/s

and this the binary size increase according to bloat-o-meter:

Function     old     new   delta
memcpy        36     324    +288

Link: https://lkml.kernel.org/r/20210702123153.14093-1-mcroce@linux.microsoft.com
Link: https://lkml.kernel.org/r/20210702123153.14093-2-mcroce@linux.microsoft.com
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Cc: Nick Kossifidis <mick@ics.forth.gr>
Cc: Guo Ren <guoren@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Laight <David.Laight@aculab.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Drew Fustini <drew@beagleboard.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/string.c |   80 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 77 insertions(+), 3 deletions(-)

--- a/lib/string.c~lib-string-optimized-memcpy
+++ a/lib/string.c
@@ -33,6 +33,23 @@
 #include <asm/word-at-a-time.h>
 #include <asm/page.h>
 
+#define BYTES_LONG	sizeof(long)
+#define WORD_MASK	(BYTES_LONG - 1)
+#define MIN_THRESHOLD	(BYTES_LONG * 2)
+
+/* convenience union to avoid cast between different pointer types */
+union types {
+	u8 *as_u8;
+	unsigned long *as_ulong;
+	uintptr_t as_uptr;
+};
+
+union const_types {
+	const u8 *as_u8;
+	const unsigned long *as_ulong;
+	uintptr_t as_uptr;
+};
+
 #ifndef __HAVE_ARCH_STRNCASECMP
 /**
  * strncasecmp - Case insensitive, length-limited string comparison
@@ -869,6 +886,13 @@ EXPORT_SYMBOL(memset64);
 #endif
 
 #ifndef __HAVE_ARCH_MEMCPY
+
+#ifdef __BIG_ENDIAN
+#define MERGE_UL(h, l, d) ((h) << ((d) * 8) | (l) >> ((BYTES_LONG - (d)) * 8))
+#else
+#define MERGE_UL(h, l, d) ((h) >> ((d) * 8) | (l) << ((BYTES_LONG - (d)) * 8))
+#endif
+
 /**
  * memcpy - Copy one area of memory to another
  * @dest: Where to copy to
@@ -880,14 +904,64 @@ EXPORT_SYMBOL(memset64);
  */
 void *memcpy(void *dest, const void *src, size_t count)
 {
-	char *tmp = dest;
-	const char *s = src;
+	union const_types s = { .as_u8 = src };
+	union types d = { .as_u8 = dest };
+	int distance = 0;
+
+	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+		if (count < MIN_THRESHOLD)
+			goto copy_remainder;
+
+		/* Copy a byte at time until destination is aligned. */
+		for (; d.as_uptr & WORD_MASK; count--)
+			*d.as_u8++ = *s.as_u8++;
+
+		distance = s.as_uptr & WORD_MASK;
+	}
+
+	if (distance) {
+		unsigned long last, next;
 
+		/*
+		 * s is distance bytes ahead of d, and d just reached
+		 * the alignment boundary. Move s backward to word align it
+		 * and shift data to compensate for distance, in order to do
+		 * word-by-word copy.
+		 */
+		s.as_u8 -= distance;
+
+		next = s.as_ulong[0];
+		for (; count >= BYTES_LONG; count -= BYTES_LONG) {
+			last = next;
+			next = s.as_ulong[1];
+
+			d.as_ulong[0] = MERGE_UL(last, next, distance);
+
+			d.as_ulong++;
+			s.as_ulong++;
+		}
+
+		/* Restore s with the original offset. */
+		s.as_u8 += distance;
+	} else {
+		/*
+		 * If the source and dest lower bits are the same, do a simple
+		 * 32/64 bit wide copy.
+		 */
+		for (; count >= BYTES_LONG; count -= BYTES_LONG)
+			*d.as_ulong++ = *s.as_ulong++;
+	}
+
+copy_remainder:
 	while (count--)
-		*tmp++ = *s++;
+		*d.as_u8++ = *s.as_u8++;
+
 	return dest;
 }
 EXPORT_SYMBOL(memcpy);
+
+#undef MERGE_UL
+
 #endif
 
 #ifndef __HAVE_ARCH_MEMMOVE
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 102/147] lib/string: optimized memmove
  2021-09-08  2:52 incoming Andrew Morton
                   ` (100 preceding siblings ...)
  2021-09-08  2:58 ` [patch 101/147] lib/string: optimized memcpy Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08 18:29     ` Linus Torvalds
  2021-09-08  2:58 ` [patch 103/147] lib/string: optimized memset Andrew Morton
                   ` (45 subsequent siblings)
  147 siblings, 1 reply; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: akpm, David.Laight, drew, guoren, hch, kernel, linux-mm, mcroce,
	mick, mm-commits, ndesaulniers, palmer, torvalds

From: Matteo Croce <mcroce@microsoft.com>
Subject: lib/string: optimized memmove

When the destination buffer is before the source one, or when the buffers
doesn't overlap, it's safe to use memcpy() instead, which is optimized to
use a bigger data size possible.

This "optimization" only covers a common case.  In future, proper code
which does the same thing as memcpy() does but backwards can be done.

Link: https://lkml.kernel.org/r/20210702123153.14093-3-mcroce@linux.microsoft.com
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Laight <David.Laight@aculab.com>
Cc: Drew Fustini <drew@beagleboard.org>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Guo Ren <guoren@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nick Kossifidis <mick@ics.forth.gr>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/string.c |   18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

--- a/lib/string.c~lib-string-optimized-memmove
+++ a/lib/string.c
@@ -975,19 +975,13 @@ EXPORT_SYMBOL(memcpy);
  */
 void *memmove(void *dest, const void *src, size_t count)
 {
-	char *tmp;
-	const char *s;
+	if (dest < src || src + count <= dest)
+		return memcpy(dest, src, count);
+
+	if (dest > src) {
+		const char *s = src + count;
+		char *tmp = dest + count;
 
-	if (dest <= src) {
-		tmp = dest;
-		s = src;
-		while (count--)
-			*tmp++ = *s++;
-	} else {
-		tmp = dest;
-		tmp += count;
-		s = src;
-		s += count;
 		while (count--)
 			*--tmp = *--s;
 	}
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 103/147] lib/string: optimized memset
  2021-09-08  2:52 incoming Andrew Morton
                   ` (101 preceding siblings ...)
  2021-09-08  2:58 ` [patch 102/147] lib/string: optimized memmove Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08 18:34     ` Linus Torvalds
  2021-09-08  2:58 ` [patch 104/147] lib/test: convert test_sort.c to use KUnit Andrew Morton
                   ` (44 subsequent siblings)
  147 siblings, 1 reply; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: akpm, David.Laight, drew, guoren, hch, kernel, linux-mm, mcroce,
	mick, mm-commits, ndesaulniers, palmer, torvalds

From: Matteo Croce <mcroce@microsoft.com>
Subject: lib/string: optimized memset

The generic memset is defined as a byte at time write.  This is always
safe, but it's slower than a 4 byte or even 8 byte write.

Write a generic memset which fills the data one byte at time until the
destination is aligned, then fills using the largest size allowed, and
finally fills the remaining data one byte at time.

On a RISC-V machine the speed goes from 140 Mb/s to 241 Mb/s, and this the
binary size increase according to bloat-o-meter:

Function     old     new   delta
memset        32     148    +116

Link: https://lkml.kernel.org/r/20210702123153.14093-4-mcroce@linux.microsoft.com
Signed-off-by: Matteo Croce <mcroce@microsoft.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Laight <David.Laight@aculab.com>
Cc: Drew Fustini <drew@beagleboard.org>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Guo Ren <guoren@kernel.org>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nick Kossifidis <mick@ics.forth.gr>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/string.c |   32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

--- a/lib/string.c~lib-string-optimized-memset
+++ a/lib/string.c
@@ -810,10 +810,38 @@ EXPORT_SYMBOL(__sysfs_match_string);
  */
 void *memset(void *s, int c, size_t count)
 {
-	char *xs = s;
+	union types dest = { .as_u8 = s };
 
+	if (count >= MIN_THRESHOLD) {
+		unsigned long cu = (unsigned long)c;
+
+		/* Compose an ulong with 'c' repeated 4/8 times */
+#ifdef CONFIG_ARCH_HAS_FAST_MULTIPLIER
+		cu *= 0x0101010101010101UL;
+#else
+		cu |= cu << 8;
+		cu |= cu << 16;
+		/* Suppress warning on 32 bit machines */
+		cu |= (cu << 16) << 16;
+#endif
+		if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+			/*
+			 * Fill the buffer one byte at time until
+			 * the destination is word aligned.
+			 */
+			for (; count && dest.as_uptr & WORD_MASK; count--)
+				*dest.as_u8++ = c;
+		}
+
+		/* Copy using the largest size allowed */
+		for (; count >= BYTES_LONG; count -= BYTES_LONG)
+			*dest.as_ulong++ = cu;
+	}
+
+	/* copy the remainder */
 	while (count--)
-		*xs++ = c;
+		*dest.as_u8++ = c;
+
 	return s;
 }
 EXPORT_SYMBOL(memset);
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 104/147] lib/test: convert test_sort.c to use KUnit
  2021-09-08  2:52 incoming Andrew Morton
                   ` (102 preceding siblings ...)
  2021-09-08  2:58 ` [patch 103/147] lib/string: optimized memset Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08  2:58 ` [patch 105/147] lib/dump_stack: correct kernel-doc notation Andrew Morton
                   ` (43 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: akpm, brendanhiggins, davidgow, dlatypov, linux-mm, mm-commits,
	pravin.shedge4linux, torvalds

From: Daniel Latypov <dlatypov@google.com>
Subject: lib/test: convert test_sort.c to use KUnit

This follows up commit ebd09577be6c ("lib/test: convert
lib/test_list_sort.c to use KUnit").

Converting this test to KUnit makes the test a bit shorter, standardizes
how it reports pass/fail, and adds an easier way to run the test [1].

Like ebd09577be6c, this leaves the file and Kconfig option name the same,
but slightly changes their dependencies (needs CONFIG_KUNIT).

[1] Can be run via
$ ./tools/testing/kunit/kunit.py run --kunitconfig /dev/stdin <<EOF
CONFIG_KUNIT=y
CONFIG_TEST_SORT=y
EOF

[11:30:27] Starting KUnit Kernel ...
[11:30:30] ============================================================
[11:30:30] ======== [PASSED] lib_sort ========
[11:30:30] [PASSED] test_sort
[11:30:30] ============================================================
[11:30:30] Testing complete. 1 tests run. 0 failed. 0 crashed. 0 skipped.
[11:30:30] Elapsed time: 37.032s total, 0.001s configuring, 34.090s building, 0.000s running

Note: this is the time it took after a `make mrproper`.

With an incremental rebuild, this looks more like:
[11:38:58] Elapsed time: 6.444s total, 0.001s configuring, 3.416s building, 0.000s running

Since the test has no dependencies, it can also be run (with some other
tests) with just:
$ ./tools/testing/kunit/kunit.py run

Link: https://lkml.kernel.org/r/20210715232441.1380885-1-dlatypov@google.com
Signed-off-by: Daniel Latypov <dlatypov@google.com>
Cc: Pravin Shedge <pravin.shedge4linux@gmail.com>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Gow <davidgow@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/Kconfig.debug |    5 +++--
 lib/test_sort.c   |   40 +++++++++++++++++++---------------------
 2 files changed, 22 insertions(+), 23 deletions(-)

--- a/lib/Kconfig.debug~lib-test-convert-test_sortc-to-use-kunit
+++ a/lib/Kconfig.debug
@@ -2078,8 +2078,9 @@ config TEST_MIN_HEAP
 	  If unsure, say N.
 
 config TEST_SORT
-	tristate "Array-based sort test"
-	depends on DEBUG_KERNEL || m
+	tristate "Array-based sort test" if !KUNIT_ALL_TESTS
+	depends on KUNIT
+	default KUNIT_ALL_TESTS
 	help
 	  This option enables the self-test function of 'sort()' at boot,
 	  or at module load time.
--- a/lib/test_sort.c~lib-test-convert-test_sortc-to-use-kunit
+++ a/lib/test_sort.c
@@ -1,4 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0-only
+
+#include <kunit/test.h>
+
 #include <linux/sort.h>
 #include <linux/slab.h>
 #include <linux/module.h>
@@ -7,18 +10,17 @@
 
 #define TEST_LEN 1000
 
-static int __init cmpint(const void *a, const void *b)
+static int cmpint(const void *a, const void *b)
 {
 	return *(int *)a - *(int *)b;
 }
 
-static int __init test_sort_init(void)
+static void test_sort(struct kunit *test)
 {
-	int *a, i, r = 1, err = -ENOMEM;
+	int *a, i, r = 1;
 
-	a = kmalloc_array(TEST_LEN, sizeof(*a), GFP_KERNEL);
-	if (!a)
-		return err;
+	a = kunit_kmalloc_array(test, TEST_LEN, sizeof(*a), GFP_KERNEL);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, a);
 
 	for (i = 0; i < TEST_LEN; i++) {
 		r = (r * 725861) % 6599;
@@ -27,24 +29,20 @@ static int __init test_sort_init(void)
 
 	sort(a, TEST_LEN, sizeof(*a), cmpint, NULL);
 
-	err = -EINVAL;
 	for (i = 0; i < TEST_LEN-1; i++)
-		if (a[i] > a[i+1]) {
-			pr_err("test has failed\n");
-			goto exit;
-		}
-	err = 0;
-	pr_info("test passed\n");
-exit:
-	kfree(a);
-	return err;
+		KUNIT_ASSERT_LE(test, a[i], a[i + 1]);
 }
 
-static void __exit test_sort_exit(void)
-{
-}
+static struct kunit_case sort_test_cases[] = {
+	KUNIT_CASE(test_sort),
+	{}
+};
+
+static struct kunit_suite sort_test_suite = {
+	.name = "lib_sort",
+	.test_cases = sort_test_cases,
+};
 
-module_init(test_sort_init);
-module_exit(test_sort_exit);
+kunit_test_suites(&sort_test_suite);
 
 MODULE_LICENSE("GPL");
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 105/147] lib/dump_stack: correct kernel-doc notation
  2021-09-08  2:52 incoming Andrew Morton
                   ` (103 preceding siblings ...)
  2021-09-08  2:58 ` [patch 104/147] lib/test: convert test_sort.c to use KUnit Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08  2:58 ` [patch 106/147] lib/iov_iter.c: fix kernel-doc warnings Andrew Morton
                   ` (42 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, rdunlap, torvalds

From: Randy Dunlap <rdunlap@infradead.org>
Subject: lib/dump_stack: correct kernel-doc notation

Fix kernel-doc warnings in dump_stack.c:

lib/dump_stack.c:97: warning: Function parameter or member 'log_lvl' not described in 'dump_stack_lvl'
lib/dump_stack.c:97: warning: expecting prototype for dump_stack(). Prototype was for dump_stack_lvl() instead

Link: https://lkml.kernel.org/r/20210809051643.17567-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/dump_stack.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/lib/dump_stack.c~lib-dump_stack-correct-kernel-doc-notation
+++ a/lib/dump_stack.c
@@ -89,7 +89,8 @@ static void __dump_stack(const char *log
 }
 
 /**
- * dump_stack - dump the current task information and its stack trace
+ * dump_stack_lvl - dump the current task information and its stack trace
+ * @log_lvl: log level
  *
  * Architectures can override this implementation by implementing its own.
  */
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 106/147] lib/iov_iter.c: fix kernel-doc warnings
  2021-09-08  2:52 incoming Andrew Morton
                   ` (104 preceding siblings ...)
  2021-09-08  2:58 ` [patch 105/147] lib/dump_stack: correct kernel-doc notation Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08  2:58 ` [patch 107/147] bitops: protect find_first_{,zero}_bit properly Andrew Morton
                   ` (41 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, rdunlap, torvalds, viro

From: Randy Dunlap <rdunlap@infradead.org>
Subject: lib/iov_iter.c: fix kernel-doc warnings

Fix all kernel-doc warnings in lib/iov_iter.c:

lib/iov_iter.c:695: warning: Function parameter or member 'i' not described in '_copy_mc_to_iter'
lib/iov_iter.c:695: warning: Excess function parameter 'iter' description in '_copy_mc_to_iter'
lib/iov_iter.c:695: warning: No description found for return value of '_copy_mc_to_iter'
lib/iov_iter.c:758: warning: Function parameter or member 'i' not described in '_copy_from_iter_flushcache'
lib/iov_iter.c:758: warning: Excess function parameter 'iter' description in '_copy_from_iter_flushcache'
lib/iov_iter.c:758: warning: No description found for return value of '_copy_from_iter_flushcache'

Link: https://lkml.kernel.org/r/20210809051053.6531-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/iov_iter.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

--- a/lib/iov_iter.c~lib-iov_iterc-fix-kernel-doc-warnings
+++ a/lib/iov_iter.c
@@ -672,7 +672,7 @@ static size_t copy_mc_pipe_to_iter(const
  * _copy_mc_to_iter - copy to iter with source memory error exception handling
  * @addr: source kernel address
  * @bytes: total transfer length
- * @iter: destination iterator
+ * @i: destination iterator
  *
  * The pmem driver deploys this for the dax operation
  * (dax_copy_to_iter()) for dax reads (bypass page-cache and the
@@ -690,6 +690,8 @@ static size_t copy_mc_pipe_to_iter(const
  * * ITER_KVEC, ITER_PIPE, and ITER_BVEC can return short copies.
  *   Compare to copy_to_iter() where only ITER_IOVEC attempts might return
  *   a short copy.
+ *
+ * Return: number of bytes copied (may be %0)
  */
 size_t _copy_mc_to_iter(const void *addr, size_t bytes, struct iov_iter *i)
 {
@@ -744,7 +746,7 @@ EXPORT_SYMBOL(_copy_from_iter_nocache);
  * _copy_from_iter_flushcache - write destination through cpu cache
  * @addr: destination kernel address
  * @bytes: total transfer length
- * @iter: source iterator
+ * @i: source iterator
  *
  * The pmem driver arranges for filesystem-dax to use this facility via
  * dax_copy_from_iter() for ensuring that writes to persistent memory
@@ -753,6 +755,8 @@ EXPORT_SYMBOL(_copy_from_iter_nocache);
  * all iterator types. The _copy_from_iter_nocache() only attempts to
  * bypass the cache for the ITER_IOVEC case, and on some archs may use
  * instructions that strand dirty-data in the cache.
+ *
+ * Return: number of bytes copied (may be %0)
  */
 size_t _copy_from_iter_flushcache(void *addr, size_t bytes, struct iov_iter *i)
 {
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 107/147] bitops: protect find_first_{,zero}_bit properly
  2021-09-08  2:52 incoming Andrew Morton
                   ` (105 preceding siblings ...)
  2021-09-08  2:58 ` [patch 106/147] lib/iov_iter.c: fix kernel-doc warnings Andrew Morton
@ 2021-09-08  2:58 ` Andrew Morton
  2021-09-08  2:59 ` [patch 108/147] bitops: move find_bit_*_le functions from le.h to find.h Andrew Morton
                   ` (40 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:58 UTC (permalink / raw)
  To: aklimov, akpm, alobakin, andriy.shevchenko, dennis, jolsa,
	linux-mm, lkp, mm-commits, torvalds, ulf.hansson, will,
	wsa+renesas, yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: bitops: protect find_first_{,zero}_bit properly

Patch series "Resend bitmap patches".


This patch (of 17):

find_first_bit() and find_first_zero_bit() are not protected with ifdefs
as other functions in find.h.  It causes build errors on some platforms if
CONFIG_GENERIC_FIND_FIRST_BIT is enabled.

Link: https://lkml.kernel.org/r/20210814211713.180533-1-yury.norov@gmail.com
Link: https://lkml.kernel.org/r/20210814211713.180533-2-yury.norov@gmail.com
Fixes: 2cc7b6a44ac2 ("lib: add fast path for find_first_*_bit() and find_last_bit()")
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Reported-by: kernel test robot <lkp@intel.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/asm-generic/bitops/find.h |    5 +++++
 1 file changed, 5 insertions(+)

--- a/include/asm-generic/bitops/find.h~bitops-protect-find_first_zero_bit-properly
+++ a/include/asm-generic/bitops/find.h
@@ -97,6 +97,7 @@ unsigned long find_next_zero_bit(const u
 
 #ifdef CONFIG_GENERIC_FIND_FIRST_BIT
 
+#ifndef find_first_bit
 /**
  * find_first_bit - find the first set bit in a memory region
  * @addr: The address to start the search at
@@ -116,7 +117,9 @@ unsigned long find_first_bit(const unsig
 
 	return _find_first_bit(addr, size);
 }
+#endif
 
+#ifndef find_first_zero_bit
 /**
  * find_first_zero_bit - find the first cleared bit in a memory region
  * @addr: The address to start the search at
@@ -136,6 +139,8 @@ unsigned long find_first_zero_bit(const
 
 	return _find_first_zero_bit(addr, size);
 }
+#endif
+
 #else /* CONFIG_GENERIC_FIND_FIRST_BIT */
 
 #ifndef find_first_bit
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 108/147] bitops: move find_bit_*_le functions from le.h to find.h
  2021-09-08  2:52 incoming Andrew Morton
                   ` (106 preceding siblings ...)
  2021-09-08  2:58 ` [patch 107/147] bitops: protect find_first_{,zero}_bit properly Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08 18:37     ` Linus Torvalds
  2021-09-08  2:59 ` [patch 109/147] include: move find.h from asm_generic to linux Andrew Morton
                   ` (39 subsequent siblings)
  147 siblings, 1 reply; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: aklimov, akpm, alobakin, andriy.shevchenko, dennis, jolsa,
	linux-mm, mm-commits, torvalds, ulf.hansson, will, wsa+renesas,
	yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: bitops: move find_bit_*_le functions from le.h to find.h

It's convenient to have all find_bit declarations in one place.

Link: https://lkml.kernel.org/r/20210814211713.180533-3-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/asm-generic/bitops/find.h |  193 ----------------------------
 include/asm-generic/bitops/le.h   |   64 ---------
 2 files changed, 257 deletions(-)

--- a/include/asm-generic/bitops/find.h
+++ /dev/null
@@ -1,193 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_GENERIC_BITOPS_FIND_H_
-#define _ASM_GENERIC_BITOPS_FIND_H_
-
-extern unsigned long _find_next_bit(const unsigned long *addr1,
-		const unsigned long *addr2, unsigned long nbits,
-		unsigned long start, unsigned long invert, unsigned long le);
-extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
-extern unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size);
-extern unsigned long _find_last_bit(const unsigned long *addr, unsigned long size);
-
-#ifndef find_next_bit
-/**
- * find_next_bit - find the next set bit in a memory region
- * @addr: The address to base the search on
- * @offset: The bitnumber to start searching at
- * @size: The bitmap size in bits
- *
- * Returns the bit number for the next set bit
- * If no bits are set, returns @size.
- */
-static inline
-unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
-			    unsigned long offset)
-{
-	if (small_const_nbits(size)) {
-		unsigned long val;
-
-		if (unlikely(offset >= size))
-			return size;
-
-		val = *addr & GENMASK(size - 1, offset);
-		return val ? __ffs(val) : size;
-	}
-
-	return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
-}
-#endif
-
-#ifndef find_next_and_bit
-/**
- * find_next_and_bit - find the next set bit in both memory regions
- * @addr1: The first address to base the search on
- * @addr2: The second address to base the search on
- * @offset: The bitnumber to start searching at
- * @size: The bitmap size in bits
- *
- * Returns the bit number for the next set bit
- * If no bits are set, returns @size.
- */
-static inline
-unsigned long find_next_and_bit(const unsigned long *addr1,
-		const unsigned long *addr2, unsigned long size,
-		unsigned long offset)
-{
-	if (small_const_nbits(size)) {
-		unsigned long val;
-
-		if (unlikely(offset >= size))
-			return size;
-
-		val = *addr1 & *addr2 & GENMASK(size - 1, offset);
-		return val ? __ffs(val) : size;
-	}
-
-	return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
-}
-#endif
-
-#ifndef find_next_zero_bit
-/**
- * find_next_zero_bit - find the next cleared bit in a memory region
- * @addr: The address to base the search on
- * @offset: The bitnumber to start searching at
- * @size: The bitmap size in bits
- *
- * Returns the bit number of the next zero bit
- * If no bits are zero, returns @size.
- */
-static inline
-unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
-				 unsigned long offset)
-{
-	if (small_const_nbits(size)) {
-		unsigned long val;
-
-		if (unlikely(offset >= size))
-			return size;
-
-		val = *addr | ~GENMASK(size - 1, offset);
-		return val == ~0UL ? size : ffz(val);
-	}
-
-	return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
-}
-#endif
-
-#ifdef CONFIG_GENERIC_FIND_FIRST_BIT
-
-#ifndef find_first_bit
-/**
- * find_first_bit - find the first set bit in a memory region
- * @addr: The address to start the search at
- * @size: The maximum number of bits to search
- *
- * Returns the bit number of the first set bit.
- * If no bits are set, returns @size.
- */
-static inline
-unsigned long find_first_bit(const unsigned long *addr, unsigned long size)
-{
-	if (small_const_nbits(size)) {
-		unsigned long val = *addr & GENMASK(size - 1, 0);
-
-		return val ? __ffs(val) : size;
-	}
-
-	return _find_first_bit(addr, size);
-}
-#endif
-
-#ifndef find_first_zero_bit
-/**
- * find_first_zero_bit - find the first cleared bit in a memory region
- * @addr: The address to start the search at
- * @size: The maximum number of bits to search
- *
- * Returns the bit number of the first cleared bit.
- * If no bits are zero, returns @size.
- */
-static inline
-unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size)
-{
-	if (small_const_nbits(size)) {
-		unsigned long val = *addr | ~GENMASK(size - 1, 0);
-
-		return val == ~0UL ? size : ffz(val);
-	}
-
-	return _find_first_zero_bit(addr, size);
-}
-#endif
-
-#else /* CONFIG_GENERIC_FIND_FIRST_BIT */
-
-#ifndef find_first_bit
-#define find_first_bit(addr, size) find_next_bit((addr), (size), 0)
-#endif
-#ifndef find_first_zero_bit
-#define find_first_zero_bit(addr, size) find_next_zero_bit((addr), (size), 0)
-#endif
-
-#endif /* CONFIG_GENERIC_FIND_FIRST_BIT */
-
-#ifndef find_last_bit
-/**
- * find_last_bit - find the last set bit in a memory region
- * @addr: The address to start the search at
- * @size: The number of bits to search
- *
- * Returns the bit number of the last set bit, or size.
- */
-static inline
-unsigned long find_last_bit(const unsigned long *addr, unsigned long size)
-{
-	if (small_const_nbits(size)) {
-		unsigned long val = *addr & GENMASK(size - 1, 0);
-
-		return val ? __fls(val) : size;
-	}
-
-	return _find_last_bit(addr, size);
-}
-#endif
-
-/**
- * find_next_clump8 - find next 8-bit clump with set bits in a memory region
- * @clump: location to store copy of found clump
- * @addr: address to base the search on
- * @size: bitmap size in number of bits
- * @offset: bit offset at which to start searching
- *
- * Returns the bit offset for the next set clump; the found clump value is
- * copied to the location pointed by @clump. If no bits are set, returns @size.
- */
-extern unsigned long find_next_clump8(unsigned long *clump,
-				      const unsigned long *addr,
-				      unsigned long size, unsigned long offset);
-
-#define find_first_clump8(clump, bits, size) \
-	find_next_clump8((clump), (bits), (size), 0)
-
-#endif /*_ASM_GENERIC_BITOPS_FIND_H_ */
--- a/include/asm-generic/bitops/le.h~bitops-move-find_bit__le-functions-from-leh-to-findh
+++ a/include/asm-generic/bitops/le.h
@@ -2,83 +2,19 @@
 #ifndef _ASM_GENERIC_BITOPS_LE_H_
 #define _ASM_GENERIC_BITOPS_LE_H_
 
-#include <asm-generic/bitops/find.h>
 #include <asm/types.h>
 #include <asm/byteorder.h>
-#include <linux/swab.h>
 
 #if defined(__LITTLE_ENDIAN)
 
 #define BITOP_LE_SWIZZLE	0
 
-static inline unsigned long find_next_zero_bit_le(const void *addr,
-		unsigned long size, unsigned long offset)
-{
-	return find_next_zero_bit(addr, size, offset);
-}
-
-static inline unsigned long find_next_bit_le(const void *addr,
-		unsigned long size, unsigned long offset)
-{
-	return find_next_bit(addr, size, offset);
-}
-
-static inline unsigned long find_first_zero_bit_le(const void *addr,
-		unsigned long size)
-{
-	return find_first_zero_bit(addr, size);
-}
-
 #elif defined(__BIG_ENDIAN)
 
 #define BITOP_LE_SWIZZLE	((BITS_PER_LONG-1) & ~0x7)
 
-#ifndef find_next_zero_bit_le
-static inline
-unsigned long find_next_zero_bit_le(const void *addr, unsigned
-		long size, unsigned long offset)
-{
-	if (small_const_nbits(size)) {
-		unsigned long val = *(const unsigned long *)addr;
-
-		if (unlikely(offset >= size))
-			return size;
-
-		val = swab(val) | ~GENMASK(size - 1, offset);
-		return val == ~0UL ? size : ffz(val);
-	}
-
-	return _find_next_bit(addr, NULL, size, offset, ~0UL, 1);
-}
 #endif
 
-#ifndef find_next_bit_le
-static inline
-unsigned long find_next_bit_le(const void *addr, unsigned
-		long size, unsigned long offset)
-{
-	if (small_const_nbits(size)) {
-		unsigned long val = *(const unsigned long *)addr;
-
-		if (unlikely(offset >= size))
-			return size;
-
-		val = swab(val) & GENMASK(size - 1, offset);
-		return val ? __ffs(val) : size;
-	}
-
-	return _find_next_bit(addr, NULL, size, offset, 0UL, 1);
-}
-#endif
-
-#ifndef find_first_zero_bit_le
-#define find_first_zero_bit_le(addr, size) \
-	find_next_zero_bit_le((addr), (size), 0)
-#endif
-
-#else
-#error "Please fix <asm/byteorder.h>"
-#endif
 
 static inline int test_bit_le(int nr, const void *addr)
 {
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 109/147] include: move find.h from asm_generic to linux
  2021-09-08  2:52 incoming Andrew Morton
                   ` (107 preceding siblings ...)
  2021-09-08  2:59 ` [patch 108/147] bitops: move find_bit_*_le functions from le.h to find.h Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08  2:59 ` [patch 110/147] arch: remove GENERIC_FIND_FIRST_BIT entirely Andrew Morton
                   ` (38 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: aklimov, akpm, alobakin, andriy.shevchenko, dennis, jolsa,
	linux-mm, mm-commits, torvalds, ulf.hansson, will, wsa+renesas,
	yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: include: move find.h from asm_generic to linux

find_bit API and bitmap API are closely related, but inclusion paths are
different - include/asm-generic and include/linux, correspondingly.  In
the past it made a lot of troubles due to circular dependencies and/or
undefined symbols.  Fix this by moving find.h under include/linux.

Link: https://lkml.kernel.org/r/20210814211713.180533-4-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 MAINTAINERS                        |    2 
 arch/alpha/include/asm/bitops.h    |    2 
 arch/arc/include/asm/bitops.h      |    1 
 arch/arm/include/asm/bitops.h      |    1 
 arch/arm64/include/asm/bitops.h    |    1 
 arch/csky/include/asm/bitops.h     |    1 
 arch/h8300/include/asm/bitops.h    |    1 
 arch/hexagon/include/asm/bitops.h  |    1 
 arch/ia64/include/asm/bitops.h     |    2 
 arch/m68k/include/asm/bitops.h     |    2 
 arch/mips/include/asm/bitops.h     |    1 
 arch/openrisc/include/asm/bitops.h |    1 
 arch/parisc/include/asm/bitops.h   |    2 
 arch/powerpc/include/asm/bitops.h  |    2 
 arch/riscv/include/asm/bitops.h    |    1 
 arch/s390/include/asm/bitops.h     |    1 
 arch/sh/include/asm/bitops.h       |    1 
 arch/sparc/include/asm/bitops_32.h |    1 
 arch/sparc/include/asm/bitops_64.h |    2 
 arch/x86/include/asm/bitops.h      |    2 
 arch/xtensa/include/asm/bitops.h   |    1 
 include/asm-generic/bitops.h       |    1 
 include/linux/bitmap.h             |    1 
 include/{asm-generic/bitops => linux}/find.h | 12 +++++++++---
 25 files changed, 270 insertions(+), 291 deletions(-)

--- a/arch/alpha/include/asm/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/alpha/include/asm/bitops.h
@@ -430,8 +430,6 @@ static inline unsigned int __arch_hweigh
 
 #endif /* __KERNEL__ */
 
-#include <asm-generic/bitops/find.h>
-
 #ifdef __KERNEL__
 
 /*
--- a/arch/arc/include/asm/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/arc/include/asm/bitops.h
@@ -369,7 +369,6 @@ static inline __attribute__ ((const)) un
 #include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/lock.h>
 
-#include <asm-generic/bitops/find.h>
 #include <asm-generic/bitops/le.h>
 #include <asm-generic/bitops/ext2-atomic-setbit.h>
 
--- a/arch/arm64/include/asm/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/arm64/include/asm/bitops.h
@@ -18,7 +18,6 @@
 
 #include <asm-generic/bitops/ffz.h>
 #include <asm-generic/bitops/fls64.h>
-#include <asm-generic/bitops/find.h>
 
 #include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/hweight.h>
--- a/arch/arm/include/asm/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/arm/include/asm/bitops.h
@@ -264,7 +264,6 @@ static inline int find_next_bit_le(const
 
 #endif
 
-#include <asm-generic/bitops/find.h>
 #include <asm-generic/bitops/le.h>
 
 /*
--- a/arch/csky/include/asm/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/csky/include/asm/bitops.h
@@ -59,7 +59,6 @@ static __always_inline unsigned long __f
 
 #include <asm-generic/bitops/ffz.h>
 #include <asm-generic/bitops/fls64.h>
-#include <asm-generic/bitops/find.h>
 
 #ifndef _LINUX_BITOPS_H
 #error only <linux/bitops.h> can be included directly
--- a/arch/h8300/include/asm/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/h8300/include/asm/bitops.h
@@ -168,7 +168,6 @@ static inline unsigned long __ffs(unsign
 	return result;
 }
 
-#include <asm-generic/bitops/find.h>
 #include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/hweight.h>
 #include <asm-generic/bitops/lock.h>
--- a/arch/hexagon/include/asm/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/hexagon/include/asm/bitops.h
@@ -271,7 +271,6 @@ static inline unsigned long __fls(unsign
 }
 
 #include <asm-generic/bitops/lock.h>
-#include <asm-generic/bitops/find.h>
 
 #include <asm-generic/bitops/fls64.h>
 #include <asm-generic/bitops/sched.h>
--- a/arch/ia64/include/asm/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/ia64/include/asm/bitops.h
@@ -441,8 +441,6 @@ static __inline__ unsigned long __arch_h
 
 #endif /* __KERNEL__ */
 
-#include <asm-generic/bitops/find.h>
-
 #ifdef __KERNEL__
 
 #include <asm-generic/bitops/le.h>
--- a/arch/m68k/include/asm/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/m68k/include/asm/bitops.h
@@ -529,6 +529,4 @@ static inline int __fls(int x)
 #include <asm-generic/bitops/le.h>
 #endif /* __KERNEL__ */
 
-#include <asm-generic/bitops/find.h>
-
 #endif /* _M68K_BITOPS_H */
--- a/arch/mips/include/asm/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/mips/include/asm/bitops.h
@@ -446,7 +446,6 @@ static inline int ffs(int word)
 }
 
 #include <asm-generic/bitops/ffz.h>
-#include <asm-generic/bitops/find.h>
 
 #ifdef __KERNEL__
 
--- a/arch/openrisc/include/asm/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/openrisc/include/asm/bitops.h
@@ -30,7 +30,6 @@
 #include <asm/bitops/fls.h>
 #include <asm/bitops/__fls.h>
 #include <asm-generic/bitops/fls64.h>
-#include <asm-generic/bitops/find.h>
 
 #ifndef _LINUX_BITOPS_H
 #error only <linux/bitops.h> can be included directly
--- a/arch/parisc/include/asm/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/parisc/include/asm/bitops.h
@@ -208,8 +208,6 @@ static __inline__ int fls(unsigned int x
 
 #endif /* __KERNEL__ */
 
-#include <asm-generic/bitops/find.h>
-
 #ifdef __KERNEL__
 
 #include <asm-generic/bitops/le.h>
--- a/arch/powerpc/include/asm/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/powerpc/include/asm/bitops.h
@@ -255,8 +255,6 @@ unsigned long __arch_hweight64(__u64 w);
 #include <asm-generic/bitops/hweight.h>
 #endif
 
-#include <asm-generic/bitops/find.h>
-
 /* wrappers that deal with KASAN instrumentation */
 #include <asm-generic/bitops/instrumented-atomic.h>
 #include <asm-generic/bitops/instrumented-lock.h>
--- a/arch/riscv/include/asm/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/riscv/include/asm/bitops.h
@@ -20,7 +20,6 @@
 #include <asm-generic/bitops/fls.h>
 #include <asm-generic/bitops/__fls.h>
 #include <asm-generic/bitops/fls64.h>
-#include <asm-generic/bitops/find.h>
 #include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/ffs.h>
 
--- a/arch/s390/include/asm/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/s390/include/asm/bitops.h
@@ -387,7 +387,6 @@ static inline int fls(unsigned int word)
 #endif /* CONFIG_HAVE_MARCH_Z9_109_FEATURES */
 
 #include <asm-generic/bitops/ffz.h>
-#include <asm-generic/bitops/find.h>
 #include <asm-generic/bitops/hweight.h>
 #include <asm-generic/bitops/sched.h>
 #include <asm-generic/bitops/le.h>
--- a/arch/sh/include/asm/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/sh/include/asm/bitops.h
@@ -68,6 +68,5 @@ static inline unsigned long __ffs(unsign
 #include <asm-generic/bitops/fls64.h>
 
 #include <asm-generic/bitops/le.h>
-#include <asm-generic/bitops/find.h>
 
 #endif /* __ASM_SH_BITOPS_H */
--- a/arch/sparc/include/asm/bitops_32.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/sparc/include/asm/bitops_32.h
@@ -100,7 +100,6 @@ static inline void change_bit(unsigned l
 #include <asm-generic/bitops/fls64.h>
 #include <asm-generic/bitops/hweight.h>
 #include <asm-generic/bitops/lock.h>
-#include <asm-generic/bitops/find.h>
 #include <asm-generic/bitops/le.h>
 #include <asm-generic/bitops/ext2-atomic.h>
 
--- a/arch/sparc/include/asm/bitops_64.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/sparc/include/asm/bitops_64.h
@@ -52,8 +52,6 @@ unsigned int __arch_hweight8(unsigned in
 #include <asm-generic/bitops/lock.h>
 #endif /* __KERNEL__ */
 
-#include <asm-generic/bitops/find.h>
-
 #ifdef __KERNEL__
 
 #include <asm-generic/bitops/le.h>
--- a/arch/x86/include/asm/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/x86/include/asm/bitops.h
@@ -380,8 +380,6 @@ static __always_inline int fls64(__u64 x
 #include <asm-generic/bitops/fls64.h>
 #endif
 
-#include <asm-generic/bitops/find.h>
-
 #include <asm-generic/bitops/sched.h>
 
 #include <asm/arch_hweight.h>
--- a/arch/xtensa/include/asm/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/arch/xtensa/include/asm/bitops.h
@@ -205,7 +205,6 @@ BIT_OPS(change, "xor", )
 #undef BIT_OP
 #undef TEST_AND_BIT_OP
 
-#include <asm-generic/bitops/find.h>
 #include <asm-generic/bitops/le.h>
 
 #include <asm-generic/bitops/ext2-atomic-setbit.h>
--- a/include/asm-generic/bitops.h~include-move-findh-from-asm_generic-to-linux
+++ a/include/asm-generic/bitops.h
@@ -20,7 +20,6 @@
 #include <asm-generic/bitops/fls.h>
 #include <asm-generic/bitops/__fls.h>
 #include <asm-generic/bitops/fls64.h>
-#include <asm-generic/bitops/find.h>
 
 #ifndef _LINUX_BITOPS_H
 #error only <linux/bitops.h> can be included directly
--- a/include/linux/bitmap.h~include-move-findh-from-asm_generic-to-linux
+++ a/include/linux/bitmap.h
@@ -6,6 +6,7 @@
 
 #include <linux/align.h>
 #include <linux/bitops.h>
+#include <linux/find.h>
 #include <linux/limits.h>
 #include <linux/string.h>
 #include <linux/types.h>
--- a/MAINTAINERS~include-move-findh-from-asm_generic-to-linux
+++ a/MAINTAINERS
@@ -3262,8 +3262,8 @@ M:	Yury Norov <yury.norov@gmail.com>
 R:	Andy Shevchenko <andriy.shevchenko@linux.intel.com>
 R:	Rasmus Villemoes <linux@rasmusvillemoes.dk>
 S:	Maintained
-F:	include/asm-generic/bitops/find.h
 F:	include/linux/bitmap.h
+F:	include/linux/find.h
 F:	lib/bitmap.c
 F:	lib/find_bit.c
 F:	lib/find_bit_benchmark.c
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 110/147] arch: remove GENERIC_FIND_FIRST_BIT entirely
  2021-09-08  2:52 incoming Andrew Morton
                   ` (108 preceding siblings ...)
  2021-09-08  2:59 ` [patch 109/147] include: move find.h from asm_generic to linux Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08  2:59 ` [patch 111/147] lib: add find_first_and_bit() Andrew Morton
                   ` (37 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: aklimov, akpm, alobakin, andriy.shevchenko, dennis, jolsa,
	linux-mm, mm-commits, torvalds, ulf.hansson, will, wsa+renesas,
	yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: arch: remove GENERIC_FIND_FIRST_BIT entirely

In 5.12 cycle we enabled GENERIC_FIND_FIRST_BIT config option for ARM64
and MIPS.  It increased performance and shrunk .text size; and so far I
didn't receive any negative feedback on the change.

https://lore.kernel.org/linux-arch/20210225135700.1381396-1-yury.norov@gmail.com/

Now I think it's a good time to switch all architectures to use
find_{first,last}_bit() unconditionally, and so remove corresponding
config option.

The patch does't introduce functioal changes for arc, arm, arm64, mips,
m68k, s390 and x86, for other architectures I expect improvement both in
performance and .text size.

Link: https://lkml.kernel.org/r/20210814211713.180533-5-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Alexander Lobakin <alobakin@pm.me> (mips)
Reviewed-by: Alexander Lobakin <alobakin@pm.me> (mips)
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Will Deacon <will@kernel.org>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arc/Kconfig     |    1 -
 arch/arm64/Kconfig   |    1 -
 arch/mips/Kconfig    |    1 -
 arch/s390/Kconfig    |    1 -
 arch/x86/Kconfig     |    1 -
 arch/x86/um/Kconfig  |    1 -
 include/linux/find.h |   13 -------------
 lib/Kconfig          |    3 ---
 8 files changed, 22 deletions(-)

--- a/arch/arc/Kconfig~arch-remove-generic_find_first_bit-entirely
+++ a/arch/arc/Kconfig
@@ -20,7 +20,6 @@ config ARC
 	select COMMON_CLK
 	select DMA_DIRECT_REMAP
 	select GENERIC_ATOMIC64 if !ISA_ARCV2 || !(ARC_HAS_LL64 && ARC_HAS_LLSC)
-	select GENERIC_FIND_FIRST_BIT
 	# for now, we don't need GENERIC_IRQ_PROBE, CONFIG_GENERIC_IRQ_CHIP
 	select GENERIC_IRQ_SHOW
 	select GENERIC_PCI_IOMAP
--- a/arch/arm64/Kconfig~arch-remove-generic_find_first_bit-entirely
+++ a/arch/arm64/Kconfig
@@ -119,7 +119,6 @@ config ARM64
 	select GENERIC_CPU_AUTOPROBE
 	select GENERIC_CPU_VULNERABILITIES
 	select GENERIC_EARLY_IOREMAP
-	select GENERIC_FIND_FIRST_BIT
 	select GENERIC_IDLE_POLL_SETUP
 	select GENERIC_IRQ_IPI
 	select GENERIC_IRQ_PROBE
--- a/arch/mips/Kconfig~arch-remove-generic_find_first_bit-entirely
+++ a/arch/mips/Kconfig
@@ -30,7 +30,6 @@ config MIPS
 	select GENERIC_ATOMIC64 if !64BIT
 	select GENERIC_CMOS_UPDATE
 	select GENERIC_CPU_AUTOPROBE
-	select GENERIC_FIND_FIRST_BIT
 	select GENERIC_GETTIMEOFDAY
 	select GENERIC_IOMAP
 	select GENERIC_IRQ_PROBE
--- a/arch/s390/Kconfig~arch-remove-generic_find_first_bit-entirely
+++ a/arch/s390/Kconfig
@@ -126,7 +126,6 @@ config S390
 	select GENERIC_CPU_AUTOPROBE
 	select GENERIC_CPU_VULNERABILITIES
 	select GENERIC_ENTRY
-	select GENERIC_FIND_FIRST_BIT
 	select GENERIC_GETTIMEOFDAY
 	select GENERIC_PTDUMP
 	select GENERIC_SMP_IDLE_THREAD
--- a/arch/x86/Kconfig~arch-remove-generic_find_first_bit-entirely
+++ a/arch/x86/Kconfig
@@ -133,7 +133,6 @@ config X86
 	select GENERIC_CPU_VULNERABILITIES
 	select GENERIC_EARLY_IOREMAP
 	select GENERIC_ENTRY
-	select GENERIC_FIND_FIRST_BIT
 	select GENERIC_IOMAP
 	select GENERIC_IRQ_EFFECTIVE_AFF_MASK	if SMP
 	select GENERIC_IRQ_MATRIX_ALLOCATOR	if X86_LOCAL_APIC
--- a/arch/x86/um/Kconfig~arch-remove-generic_find_first_bit-entirely
+++ a/arch/x86/um/Kconfig
@@ -8,7 +8,6 @@ endmenu
 
 config UML_X86
 	def_bool y
-	select GENERIC_FIND_FIRST_BIT
 
 config 64BIT
 	bool "64-bit kernel" if "$(SUBARCH)" = "x86"
--- a/include/linux/find.h~arch-remove-generic_find_first_bit-entirely
+++ a/include/linux/find.h
@@ -101,8 +101,6 @@ unsigned long find_next_zero_bit(const u
 }
 #endif
 
-#ifdef CONFIG_GENERIC_FIND_FIRST_BIT
-
 #ifndef find_first_bit
 /**
  * find_first_bit - find the first set bit in a memory region
@@ -147,17 +145,6 @@ unsigned long find_first_zero_bit(const
 }
 #endif
 
-#else /* CONFIG_GENERIC_FIND_FIRST_BIT */
-
-#ifndef find_first_bit
-#define find_first_bit(addr, size) find_next_bit((addr), (size), 0)
-#endif
-#ifndef find_first_zero_bit
-#define find_first_zero_bit(addr, size) find_next_zero_bit((addr), (size), 0)
-#endif
-
-#endif /* CONFIG_GENERIC_FIND_FIRST_BIT */
-
 #ifndef find_last_bit
 /**
  * find_last_bit - find the last set bit in a memory region
--- a/lib/Kconfig~arch-remove-generic_find_first_bit-entirely
+++ a/lib/Kconfig
@@ -59,9 +59,6 @@ config GENERIC_STRNLEN_USER
 config GENERIC_NET_UTILS
 	bool
 
-config GENERIC_FIND_FIRST_BIT
-	bool
-
 source "lib/math/Kconfig"
 
 config NO_GENERIC_PCI_IOPORT_MAP
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 111/147] lib: add find_first_and_bit()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (109 preceding siblings ...)
  2021-09-08  2:59 ` [patch 110/147] arch: remove GENERIC_FIND_FIRST_BIT entirely Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08  2:59 ` [patch 112/147] cpumask: use find_first_and_bit() Andrew Morton
                   ` (36 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: aklimov, akpm, alobakin, andriy.shevchenko, dennis, jolsa,
	linux-mm, mm-commits, torvalds, ulf.hansson, will, wsa+renesas,
	yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: lib: add find_first_and_bit()

Currently find_first_and_bit() is an alias to find_next_and_bit(). 
However, it is widely used in cpumask, so it worth to optimize it.  This
patch adds its own implementation for find_first_and_bit().

On x86_64 find_bit_benchmark says:

Before (#define find_first_and_bit(...) find_next_and_bit(..., 0):
Start testing find_bit() with random-filled bitmap
[  140.291468] find_first_and_bit:           46890919 ns,  32671 iterations
Start testing find_bit() with sparse bitmap
[  140.295028] find_first_and_bit:               7103 ns,      1 iterations

After:
Start testing find_bit() with random-filled bitmap
[  162.574907] find_first_and_bit:           25045813 ns,  32846 iterations
Start testing find_bit() with sparse bitmap
[  162.578458] find_first_and_bit:               4900 ns,      1 iterations

(Thanks to Alexey Klimov for thorough testing.)

Link: https://lkml.kernel.org/r/20210814211713.180533-6-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Tested-by: Alexey Klimov <aklimov@redhat.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/find.h     |   27 +++++++++++++++++++++++++++
 lib/find_bit.c           |   21 +++++++++++++++++++++
 lib/find_bit_benchmark.c |   21 +++++++++++++++++++++
 3 files changed, 69 insertions(+)

--- a/include/linux/find.h~lib-add-find_first_and_bit
+++ a/include/linux/find.h
@@ -12,6 +12,8 @@ extern unsigned long _find_next_bit(cons
 		const unsigned long *addr2, unsigned long nbits,
 		unsigned long start, unsigned long invert, unsigned long le);
 extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
+extern unsigned long _find_first_and_bit(const unsigned long *addr1,
+					 const unsigned long *addr2, unsigned long size);
 extern unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size);
 extern unsigned long _find_last_bit(const unsigned long *addr, unsigned long size);
 
@@ -123,6 +125,31 @@ unsigned long find_first_bit(const unsig
 }
 #endif
 
+#ifndef find_first_and_bit
+/**
+ * find_first_and_bit - find the first set bit in both memory regions
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+static inline
+unsigned long find_first_and_bit(const unsigned long *addr1,
+				 const unsigned long *addr2,
+				 unsigned long size)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val = *addr1 & *addr2 & GENMASK(size - 1, 0);
+
+		return val ? __ffs(val) : size;
+	}
+
+	return _find_first_and_bit(addr1, addr2, size);
+}
+#endif
+
 #ifndef find_first_zero_bit
 /**
  * find_first_zero_bit - find the first cleared bit in a memory region
--- a/lib/find_bit_benchmark.c~lib-add-find_first_and_bit
+++ a/lib/find_bit_benchmark.c
@@ -49,6 +49,25 @@ static int __init test_find_first_bit(vo
 	return 0;
 }
 
+static int __init test_find_first_and_bit(void *bitmap, const void *bitmap2, unsigned long len)
+{
+	static DECLARE_BITMAP(cp, BITMAP_LEN) __initdata;
+	unsigned long i, cnt;
+	ktime_t time;
+
+	bitmap_copy(cp, bitmap, BITMAP_LEN);
+
+	time = ktime_get();
+	for (cnt = i = 0; i < len; cnt++) {
+		i = find_first_and_bit(cp, bitmap2, len);
+		__clear_bit(i, cp);
+	}
+	time = ktime_get() - time;
+	pr_err("find_first_and_bit: %18llu ns, %6ld iterations\n", time, cnt);
+
+	return 0;
+}
+
 static int __init test_find_next_bit(const void *bitmap, unsigned long len)
 {
 	unsigned long i, cnt;
@@ -129,6 +148,7 @@ static int __init find_bit_test(void)
 	 * traverse only part of bitmap to avoid soft lockup.
 	 */
 	test_find_first_bit(bitmap, BITMAP_LEN / 10);
+	test_find_first_and_bit(bitmap, bitmap2, BITMAP_LEN / 2);
 	test_find_next_and_bit(bitmap, bitmap2, BITMAP_LEN);
 
 	pr_err("\nStart testing find_bit() with sparse bitmap\n");
@@ -145,6 +165,7 @@ static int __init find_bit_test(void)
 	test_find_next_zero_bit(bitmap, BITMAP_LEN);
 	test_find_last_bit(bitmap, BITMAP_LEN);
 	test_find_first_bit(bitmap, BITMAP_LEN);
+	test_find_first_and_bit(bitmap, bitmap2, BITMAP_LEN);
 	test_find_next_and_bit(bitmap, bitmap2, BITMAP_LEN);
 
 	/*
--- a/lib/find_bit.c~lib-add-find_first_and_bit
+++ a/lib/find_bit.c
@@ -89,6 +89,27 @@ unsigned long _find_first_bit(const unsi
 EXPORT_SYMBOL(_find_first_bit);
 #endif
 
+#ifndef find_first_and_bit
+/*
+ * Find the first set bit in two memory regions.
+ */
+unsigned long _find_first_and_bit(const unsigned long *addr1,
+				  const unsigned long *addr2,
+				  unsigned long size)
+{
+	unsigned long idx, val;
+
+	for (idx = 0; idx * BITS_PER_LONG < size; idx++) {
+		val = addr1[idx] & addr2[idx];
+		if (val)
+			return min(idx * BITS_PER_LONG + __ffs(val), size);
+	}
+
+	return size;
+}
+EXPORT_SYMBOL(_find_first_and_bit);
+#endif
+
 #ifndef find_first_zero_bit
 /*
  * Find the first cleared bit in a memory region.
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 112/147] cpumask: use find_first_and_bit()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (110 preceding siblings ...)
  2021-09-08  2:59 ` [patch 111/147] lib: add find_first_and_bit() Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08  2:59 ` [patch 113/147] all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate Andrew Morton
                   ` (35 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: aklimov, akpm, alobakin, andriy.shevchenko, dennis, jolsa,
	linux-mm, mm-commits, torvalds, ulf.hansson, will, wsa+renesas,
	yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: cpumask: use find_first_and_bit()

Now we have an efficient implementation for find_first_and_bit(), so
switch cpumask to use it where appropriate.

Link: https://lkml.kernel.org/r/20210814211713.180533-7-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/cpumask.h |   30 ++++++++++++++++++++----------
 1 file changed, 20 insertions(+), 10 deletions(-)

--- a/include/linux/cpumask.h~cpumask-use-find_first_and_bit
+++ a/include/linux/cpumask.h
@@ -123,6 +123,12 @@ static inline unsigned int cpumask_first
 	return 0;
 }
 
+static inline unsigned int cpumask_first_and(const struct cpumask *srcp1,
+					     const struct cpumask *srcp2)
+{
+	return 0;
+}
+
 static inline unsigned int cpumask_last(const struct cpumask *srcp)
 {
 	return 0;
@@ -167,7 +173,7 @@ static inline unsigned int cpumask_local
 
 static inline int cpumask_any_and_distribute(const struct cpumask *src1p,
 					     const struct cpumask *src2p) {
-	return cpumask_next_and(-1, src1p, src2p);
+	return cpumask_first_and(src1p, src2p);
 }
 
 static inline int cpumask_any_distribute(const struct cpumask *srcp)
@@ -196,6 +202,19 @@ static inline unsigned int cpumask_first
 }
 
 /**
+ * cpumask_first_and - return the first cpu from *srcp1 & *srcp2
+ * @src1p: the first input
+ * @src2p: the second input
+ *
+ * Returns >= nr_cpu_ids if no cpus set in both.  See also cpumask_next_and().
+ */
+static inline
+unsigned int cpumask_first_and(const struct cpumask *srcp1, const struct cpumask *srcp2)
+{
+	return find_first_and_bit(cpumask_bits(srcp1), cpumask_bits(srcp2), nr_cpumask_bits);
+}
+
+/**
  * cpumask_last - get the last CPU in a cpumask
  * @srcp:	- the cpumask pointer
  *
@@ -586,15 +605,6 @@ static inline void cpumask_copy(struct c
 #define cpumask_any(srcp) cpumask_first(srcp)
 
 /**
- * cpumask_first_and - return the first cpu from *srcp1 & *srcp2
- * @src1p: the first input
- * @src2p: the second input
- *
- * Returns >= nr_cpu_ids if no cpus set in both.  See also cpumask_next_and().
- */
-#define cpumask_first_and(src1p, src2p) cpumask_next_and(-1, (src1p), (src2p))
-
-/**
  * cpumask_any_and - pick a "random" cpu from *mask1 & *mask2
  * @mask1: the first input cpumask
  * @mask2: the second input cpumask
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 113/147] all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate
  2021-09-08  2:52 incoming Andrew Morton
                   ` (111 preceding siblings ...)
  2021-09-08  2:59 ` [patch 112/147] cpumask: use find_first_and_bit() Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08  2:59 ` [patch 114/147] tools: sync tools/bitmap with mother linux Andrew Morton
                   ` (34 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: aklimov, akpm, alobakin, andriy.shevchenko, dennis, jolsa,
	linux-mm, mm-commits, torvalds, ulf.hansson, will, wsa+renesas,
	yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate

find_first{,_zero}_bit is a more effective analogue of 'next' version if
start == 0.  This patch replaces 'next' with 'first' where things look
trivial.

Link: https://lkml.kernel.org/r/20210814211713.180533-8-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/powerpc/platforms/pasemi/dma_lib.c |    4 ++--
 arch/s390/kvm/kvm-s390.c                |    2 +-
 drivers/block/rnbd/rnbd-clt.c           |    2 +-
 drivers/dma/ti/edma.c                   |    2 +-
 drivers/iio/adc/ad7124.c                |    2 +-
 drivers/infiniband/hw/irdma/hw.c        |   16 ++++++++--------
 drivers/media/cec/core/cec-core.c       |    2 +-
 drivers/media/mc/mc-devnode.c           |    2 +-
 drivers/pci/controller/dwc/pci-dra7xx.c |    2 +-
 drivers/scsi/lpfc/lpfc_sli.c            |   10 +++++-----
 drivers/soc/ti/k3-ringacc.c             |    4 ++--
 drivers/tty/n_tty.c                     |    2 +-
 drivers/virt/acrn/ioreq.c               |    3 +--
 fs/f2fs/segment.c                       |    8 ++++----
 fs/ocfs2/cluster/heartbeat.c            |    2 +-
 fs/ocfs2/dlm/dlmdomain.c                |    4 ++--
 fs/ocfs2/dlm/dlmmaster.c                |   18 +++++++++---------
 fs/ocfs2/dlm/dlmrecovery.c              |    2 +-
 fs/ocfs2/dlm/dlmthread.c                |    2 +-
 lib/genalloc.c                          |    2 +-
 net/ncsi/ncsi-manage.c                  |    4 ++--
 21 files changed, 47 insertions(+), 48 deletions(-)

--- a/arch/powerpc/platforms/pasemi/dma_lib.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/arch/powerpc/platforms/pasemi/dma_lib.c
@@ -375,7 +375,7 @@ int pasemi_dma_alloc_flag(void)
 	int bit;
 
 retry:
-	bit = find_next_bit(flags_free, MAX_FLAGS, 0);
+	bit = find_first_bit(flags_free, MAX_FLAGS);
 	if (bit >= MAX_FLAGS)
 		return -ENOSPC;
 	if (!test_and_clear_bit(bit, flags_free))
@@ -440,7 +440,7 @@ int pasemi_dma_alloc_fun(void)
 	int bit;
 
 retry:
-	bit = find_next_bit(fun_free, MAX_FLAGS, 0);
+	bit = find_first_bit(fun_free, MAX_FLAGS);
 	if (bit >= MAX_FLAGS)
 		return -ENOSPC;
 	if (!test_and_clear_bit(bit, fun_free))
--- a/arch/s390/kvm/kvm-s390.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/arch/s390/kvm/kvm-s390.c
@@ -2023,7 +2023,7 @@ static unsigned long kvm_s390_next_dirty
 	while ((slotidx > 0) && (ofs >= ms->npages)) {
 		slotidx--;
 		ms = slots->memslots + slotidx;
-		ofs = find_next_bit(kvm_second_dirty_bitmap(ms), ms->npages, 0);
+		ofs = find_first_bit(kvm_second_dirty_bitmap(ms), ms->npages);
 	}
 	return ms->base_gfn + ofs;
 }
--- a/drivers/block/rnbd/rnbd-clt.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/drivers/block/rnbd/rnbd-clt.c
@@ -196,7 +196,7 @@ rnbd_get_cpu_qlist(struct rnbd_clt_sessi
 		return per_cpu_ptr(sess->cpu_queues, bit);
 	} else if (cpu != 0) {
 		/* Search from 0 to cpu */
-		bit = find_next_bit(sess->cpu_queues_bm, cpu, 0);
+		bit = find_first_bit(sess->cpu_queues_bm, cpu);
 		if (bit < cpu)
 			return per_cpu_ptr(sess->cpu_queues, bit);
 	}
--- a/drivers/dma/ti/edma.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/drivers/dma/ti/edma.c
@@ -1681,7 +1681,7 @@ static irqreturn_t dma_ccerr_handler(int
 
 			dev_dbg(ecc->dev, "EMR%d 0x%08x\n", j, val);
 			emr = val;
-			for (i = find_next_bit(&emr, 32, 0); i < 32;
+			for (i = find_first_bit(&emr, 32); i < 32;
 			     i = find_next_bit(&emr, 32, i + 1)) {
 				int k = (j << 5) + i;
 
--- a/drivers/iio/adc/ad7124.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/drivers/iio/adc/ad7124.c
@@ -347,7 +347,7 @@ static int ad7124_find_free_config_slot(
 {
 	unsigned int free_cfg_slot;
 
-	free_cfg_slot = find_next_zero_bit(&st->cfg_slots_status, AD7124_MAX_CONFIGS, 0);
+	free_cfg_slot = find_first_zero_bit(&st->cfg_slots_status, AD7124_MAX_CONFIGS);
 	if (free_cfg_slot == AD7124_MAX_CONFIGS)
 		return -1;
 
--- a/drivers/infiniband/hw/irdma/hw.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/drivers/infiniband/hw/irdma/hw.c
@@ -1696,14 +1696,14 @@ clean_msixtbl:
  */
 static void irdma_get_used_rsrc(struct irdma_device *iwdev)
 {
-	iwdev->rf->used_pds = find_next_zero_bit(iwdev->rf->allocated_pds,
-						 iwdev->rf->max_pd, 0);
-	iwdev->rf->used_qps = find_next_zero_bit(iwdev->rf->allocated_qps,
-						 iwdev->rf->max_qp, 0);
-	iwdev->rf->used_cqs = find_next_zero_bit(iwdev->rf->allocated_cqs,
-						 iwdev->rf->max_cq, 0);
-	iwdev->rf->used_mrs = find_next_zero_bit(iwdev->rf->allocated_mrs,
-						 iwdev->rf->max_mr, 0);
+	iwdev->rf->used_pds = find_first_zero_bit(iwdev->rf->allocated_pds,
+						 iwdev->rf->max_pd);
+	iwdev->rf->used_qps = find_first_zero_bit(iwdev->rf->allocated_qps,
+						 iwdev->rf->max_qp);
+	iwdev->rf->used_cqs = find_first_zero_bit(iwdev->rf->allocated_cqs,
+						 iwdev->rf->max_cq);
+	iwdev->rf->used_mrs = find_first_zero_bit(iwdev->rf->allocated_mrs,
+						 iwdev->rf->max_mr);
 }
 
 void irdma_ctrl_deinit_hw(struct irdma_pci_f *rf)
--- a/drivers/media/cec/core/cec-core.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/drivers/media/cec/core/cec-core.c
@@ -106,7 +106,7 @@ static int __must_check cec_devnode_regi
 
 	/* Part 1: Find a free minor number */
 	mutex_lock(&cec_devnode_lock);
-	minor = find_next_zero_bit(cec_devnode_nums, CEC_NUM_DEVICES, 0);
+	minor = find_first_zero_bit(cec_devnode_nums, CEC_NUM_DEVICES);
 	if (minor == CEC_NUM_DEVICES) {
 		mutex_unlock(&cec_devnode_lock);
 		pr_err("could not get a free minor\n");
--- a/drivers/media/mc/mc-devnode.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/drivers/media/mc/mc-devnode.c
@@ -217,7 +217,7 @@ int __must_check media_devnode_register(
 
 	/* Part 1: Find a free minor number */
 	mutex_lock(&media_devnode_lock);
-	minor = find_next_zero_bit(media_devnode_nums, MEDIA_NUM_DEVICES, 0);
+	minor = find_first_zero_bit(media_devnode_nums, MEDIA_NUM_DEVICES);
 	if (minor == MEDIA_NUM_DEVICES) {
 		mutex_unlock(&media_devnode_lock);
 		pr_err("could not get a free minor\n");
--- a/drivers/pci/controller/dwc/pci-dra7xx.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/drivers/pci/controller/dwc/pci-dra7xx.c
@@ -211,7 +211,7 @@ static int dra7xx_pcie_handle_msi(struct
 	if (!val)
 		return 0;
 
-	pos = find_next_bit(&val, MAX_MSI_IRQS_PER_CTRL, 0);
+	pos = find_first_bit(&val, MAX_MSI_IRQS_PER_CTRL);
 	while (pos != MAX_MSI_IRQS_PER_CTRL) {
 		irq = irq_find_mapping(pp->irq_domain,
 				       (index * MAX_MSI_IRQS_PER_CTRL) + pos);
--- a/drivers/scsi/lpfc/lpfc_sli.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/drivers/scsi/lpfc/lpfc_sli.c
@@ -17250,8 +17250,8 @@ lpfc_sli4_alloc_xri(struct lpfc_hba *phb
 	 * the driver starts at 0 each time.
 	 */
 	spin_lock_irq(&phba->hbalock);
-	xri = find_next_zero_bit(phba->sli4_hba.xri_bmask,
-				 phba->sli4_hba.max_cfg_param.max_xri, 0);
+	xri = find_first_zero_bit(phba->sli4_hba.xri_bmask,
+				 phba->sli4_hba.max_cfg_param.max_xri);
 	if (xri >= phba->sli4_hba.max_cfg_param.max_xri) {
 		spin_unlock_irq(&phba->hbalock);
 		return NO_XRI;
@@ -18928,7 +18928,7 @@ lpfc_sli4_alloc_rpi(struct lpfc_hba *phb
 	max_rpi = phba->sli4_hba.max_cfg_param.max_rpi;
 	rpi_limit = phba->sli4_hba.next_rpi;
 
-	rpi = find_next_zero_bit(phba->sli4_hba.rpi_bmask, rpi_limit, 0);
+	rpi = find_first_zero_bit(phba->sli4_hba.rpi_bmask, rpi_limit);
 	if (rpi >= rpi_limit)
 		rpi = LPFC_RPI_ALLOC_ERROR;
 	else {
@@ -19571,8 +19571,8 @@ next_priority:
 		 * have been tested so that we can detect when we should
 		 * change the priority level.
 		 */
-		next_fcf_index = find_next_bit(phba->fcf.fcf_rr_bmask,
-					       LPFC_SLI4_FCF_TBL_INDX_MAX, 0);
+		next_fcf_index = find_first_bit(phba->fcf.fcf_rr_bmask,
+					       LPFC_SLI4_FCF_TBL_INDX_MAX);
 	}
 
 
--- a/drivers/soc/ti/k3-ringacc.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/drivers/soc/ti/k3-ringacc.c
@@ -358,8 +358,8 @@ struct k3_ring *k3_ringacc_request_ring(
 		goto out;
 
 	if (flags & K3_RINGACC_RING_USE_PROXY) {
-		proxy_id = find_next_zero_bit(ringacc->proxy_inuse,
-					      ringacc->num_proxies, 0);
+		proxy_id = find_first_zero_bit(ringacc->proxy_inuse,
+					      ringacc->num_proxies);
 		if (proxy_id == ringacc->num_proxies)
 			goto error;
 	}
--- a/drivers/tty/n_tty.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/drivers/tty/n_tty.c
@@ -1975,7 +1975,7 @@ static bool canon_copy_from_read_buf(str
 	more = n - (size - tail);
 	if (eol == N_TTY_BUF_SIZE && more) {
 		/* scan wrapped without finding set bit */
-		eol = find_next_bit(ldata->read_flags, more, 0);
+		eol = find_first_bit(ldata->read_flags, more);
 		found = eol != more;
 	} else
 		found = eol != size;
--- a/drivers/virt/acrn/ioreq.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/drivers/virt/acrn/ioreq.c
@@ -246,8 +246,7 @@ void acrn_ioreq_request_clear(struct acr
 	spin_lock_bh(&vm->ioreq_clients_lock);
 	client = vm->default_client;
 	if (client) {
-		vcpu = find_next_bit(client->ioreqs_map,
-				     ACRN_IO_REQUEST_MAX, 0);
+		vcpu = find_first_bit(client->ioreqs_map, ACRN_IO_REQUEST_MAX);
 		while (vcpu < ACRN_IO_REQUEST_MAX) {
 			acrn_ioreq_complete_request(client, vcpu, NULL);
 			vcpu = find_next_bit(client->ioreqs_map,
--- a/fs/f2fs/segment.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/fs/f2fs/segment.c
@@ -2495,8 +2495,8 @@ find_other_zone:
 	secno = find_next_zero_bit(free_i->free_secmap, MAIN_SECS(sbi), hint);
 	if (secno >= MAIN_SECS(sbi)) {
 		if (dir == ALLOC_RIGHT) {
-			secno = find_next_zero_bit(free_i->free_secmap,
-							MAIN_SECS(sbi), 0);
+			secno = find_first_zero_bit(free_i->free_secmap,
+							MAIN_SECS(sbi));
 			f2fs_bug_on(sbi, secno >= MAIN_SECS(sbi));
 		} else {
 			go_left = 1;
@@ -2511,8 +2511,8 @@ find_other_zone:
 			left_start--;
 			continue;
 		}
-		left_start = find_next_zero_bit(free_i->free_secmap,
-							MAIN_SECS(sbi), 0);
+		left_start = find_first_zero_bit(free_i->free_secmap,
+							MAIN_SECS(sbi));
 		f2fs_bug_on(sbi, left_start >= MAIN_SECS(sbi));
 		break;
 	}
--- a/fs/ocfs2/cluster/heartbeat.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/fs/ocfs2/cluster/heartbeat.c
@@ -379,7 +379,7 @@ static void o2hb_nego_timeout(struct wor
 
 	o2hb_fill_node_map(live_node_bitmap, sizeof(live_node_bitmap));
 	/* lowest node as master node to make negotiate decision. */
-	master_node = find_next_bit(live_node_bitmap, O2NM_MAX_NODES, 0);
+	master_node = find_first_bit(live_node_bitmap, O2NM_MAX_NODES);
 
 	if (master_node == o2nm_this_node()) {
 		if (!test_bit(master_node, reg->hr_nego_node_bitmap)) {
--- a/fs/ocfs2/dlm/dlmdomain.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/fs/ocfs2/dlm/dlmdomain.c
@@ -1045,7 +1045,7 @@ static int dlm_send_regions(struct dlm_c
 	int status, ret = 0, i;
 	char *p;
 
-	if (find_next_bit(node_map, O2NM_MAX_NODES, 0) >= O2NM_MAX_NODES)
+	if (find_first_bit(node_map, O2NM_MAX_NODES) >= O2NM_MAX_NODES)
 		goto bail;
 
 	qr = kzalloc(sizeof(struct dlm_query_region), GFP_KERNEL);
@@ -1217,7 +1217,7 @@ static int dlm_send_nodeinfo(struct dlm_
 	struct o2nm_node *node;
 	int ret = 0, status, count, i;
 
-	if (find_next_bit(node_map, O2NM_MAX_NODES, 0) >= O2NM_MAX_NODES)
+	if (find_first_bit(node_map, O2NM_MAX_NODES) >= O2NM_MAX_NODES)
 		goto bail;
 
 	qn = kzalloc(sizeof(struct dlm_query_nodeinfo), GFP_KERNEL);
--- a/fs/ocfs2/dlm/dlmmaster.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/fs/ocfs2/dlm/dlmmaster.c
@@ -861,7 +861,7 @@ lookup:
 		 * to see if there are any nodes that still need to be
 		 * considered.  these will not appear in the mle nodemap
 		 * but they might own this lockres.  wait on them. */
-		bit = find_next_bit(dlm->recovery_map, O2NM_MAX_NODES, 0);
+		bit = find_first_bit(dlm->recovery_map, O2NM_MAX_NODES);
 		if (bit < O2NM_MAX_NODES) {
 			mlog(0, "%s: res %.*s, At least one node (%d) "
 			     "to recover before lock mastery can begin\n",
@@ -912,7 +912,7 @@ redo_request:
 		dlm_wait_for_recovery(dlm);
 
 		spin_lock(&dlm->spinlock);
-		bit = find_next_bit(dlm->recovery_map, O2NM_MAX_NODES, 0);
+		bit = find_first_bit(dlm->recovery_map, O2NM_MAX_NODES);
 		if (bit < O2NM_MAX_NODES) {
 			mlog(0, "%s: res %.*s, At least one node (%d) "
 			     "to recover before lock mastery can begin\n",
@@ -1079,7 +1079,7 @@ recheck:
 		sleep = 1;
 		/* have all nodes responded? */
 		if (voting_done && !*blocked) {
-			bit = find_next_bit(mle->maybe_map, O2NM_MAX_NODES, 0);
+			bit = find_first_bit(mle->maybe_map, O2NM_MAX_NODES);
 			if (dlm->node_num <= bit) {
 				/* my node number is lowest.
 			 	 * now tell other nodes that I am
@@ -1234,8 +1234,8 @@ static int dlm_restart_lock_mastery(stru
 		} else {
 			mlog(ML_ERROR, "node down! %d\n", node);
 			if (blocked) {
-				int lowest = find_next_bit(mle->maybe_map,
-						       O2NM_MAX_NODES, 0);
+				int lowest = find_first_bit(mle->maybe_map,
+						       O2NM_MAX_NODES);
 
 				/* act like it was never there */
 				clear_bit(node, mle->maybe_map);
@@ -1795,7 +1795,7 @@ int dlm_assert_master_handler(struct o2n
 		     "MLE for it! (%.*s)\n", assert->node_idx,
 		     namelen, name);
 	} else {
-		int bit = find_next_bit (mle->maybe_map, O2NM_MAX_NODES, 0);
+		int bit = find_first_bit(mle->maybe_map, O2NM_MAX_NODES);
 		if (bit >= O2NM_MAX_NODES) {
 			/* not necessarily an error, though less likely.
 			 * could be master just re-asserting. */
@@ -2521,7 +2521,7 @@ static int dlm_is_lockres_migratable(str
 	}
 
 	if (!nonlocal) {
-		node_ref = find_next_bit(res->refmap, O2NM_MAX_NODES, 0);
+		node_ref = find_first_bit(res->refmap, O2NM_MAX_NODES);
 		if (node_ref >= O2NM_MAX_NODES)
 			return 0;
 	}
@@ -3303,7 +3303,7 @@ static void dlm_clean_block_mle(struct d
 	BUG_ON(mle->type != DLM_MLE_BLOCK);
 
 	spin_lock(&mle->spinlock);
-	bit = find_next_bit(mle->maybe_map, O2NM_MAX_NODES, 0);
+	bit = find_first_bit(mle->maybe_map, O2NM_MAX_NODES);
 	if (bit != dead_node) {
 		mlog(0, "mle found, but dead node %u would not have been "
 		     "master\n", dead_node);
@@ -3542,7 +3542,7 @@ void dlm_force_free_mles(struct dlm_ctxt
 	spin_lock(&dlm->master_lock);
 
 	BUG_ON(dlm->dlm_state != DLM_CTXT_LEAVING);
-	BUG_ON((find_next_bit(dlm->domain_map, O2NM_MAX_NODES, 0) < O2NM_MAX_NODES));
+	BUG_ON((find_first_bit(dlm->domain_map, O2NM_MAX_NODES) < O2NM_MAX_NODES));
 
 	for (i = 0; i < DLM_HASH_BUCKETS; i++) {
 		bucket = dlm_master_hash(dlm, i);
--- a/fs/ocfs2/dlm/dlmrecovery.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/fs/ocfs2/dlm/dlmrecovery.c
@@ -451,7 +451,7 @@ static int dlm_do_recovery(struct dlm_ct
 	if (dlm->reco.dead_node == O2NM_INVALID_NODE_NUM) {
 		int bit;
 
-		bit = find_next_bit (dlm->recovery_map, O2NM_MAX_NODES, 0);
+		bit = find_first_bit(dlm->recovery_map, O2NM_MAX_NODES);
 		if (bit >= O2NM_MAX_NODES || bit < 0)
 			dlm_set_reco_dead_node(dlm, O2NM_INVALID_NODE_NUM);
 		else
--- a/fs/ocfs2/dlm/dlmthread.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/fs/ocfs2/dlm/dlmthread.c
@@ -92,7 +92,7 @@ int __dlm_lockres_unused(struct dlm_lock
 		return 0;
 
 	/* Another node has this resource with this node as the master */
-	bit = find_next_bit(res->refmap, O2NM_MAX_NODES, 0);
+	bit = find_first_bit(res->refmap, O2NM_MAX_NODES);
 	if (bit < O2NM_MAX_NODES)
 		return 0;
 
--- a/lib/genalloc.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/lib/genalloc.c
@@ -251,7 +251,7 @@ void gen_pool_destroy(struct gen_pool *p
 		list_del(&chunk->next_chunk);
 
 		end_bit = chunk_size(chunk) >> order;
-		bit = find_next_bit(chunk->bits, end_bit, 0);
+		bit = find_first_bit(chunk->bits, end_bit);
 		BUG_ON(bit < end_bit);
 
 		vfree(chunk);
--- a/net/ncsi/ncsi-manage.c~all-replace-find_next_zero_bit-with-find_first_zero_bit-where-appropriate
+++ a/net/ncsi/ncsi-manage.c
@@ -608,7 +608,7 @@ static int clear_one_vid(struct ncsi_dev
 	bitmap = &ncf->bitmap;
 
 	spin_lock_irqsave(&nc->lock, flags);
-	index = find_next_bit(bitmap, ncf->n_vids, 0);
+	index = find_first_bit(bitmap, ncf->n_vids);
 	if (index >= ncf->n_vids) {
 		spin_unlock_irqrestore(&nc->lock, flags);
 		return -1;
@@ -667,7 +667,7 @@ static int set_one_vid(struct ncsi_dev_p
 		return -1;
 	}
 
-	index = find_next_zero_bit(bitmap, ncf->n_vids, 0);
+	index = find_first_zero_bit(bitmap, ncf->n_vids);
 	if (index < 0 || index >= ncf->n_vids) {
 		netdev_err(ndp->ndev.dev,
 			   "Channel %u already has all VLAN filters set\n",
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 114/147] tools: sync tools/bitmap with mother linux
  2021-09-08  2:52 incoming Andrew Morton
                   ` (112 preceding siblings ...)
  2021-09-08  2:59 ` [patch 113/147] all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08  2:59 ` [patch 115/147] cpumask: replace cpumask_next_* with cpumask_first_* where appropriate Andrew Morton
                   ` (33 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: aklimov, akpm, alobakin, andriy.shevchenko, dennis, jolsa,
	linux-mm, mm-commits, torvalds, ulf.hansson, will, wsa+renesas,
	yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: tools: sync tools/bitmap with mother linux

Remove tools/include/asm-generic/bitops/find.h and copy
include/linux/bitmap.h to tools.  find_*_le() functions are not copied
because not needed in tools.

Link: https://lkml.kernel.org/r/20210814211713.180533-9-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 MAINTAINERS                                   |  2 +-
 tools/include/asm-generic/bitops.h            |  1 -
 tools/include/linux/bitmap.h                  |  7 +-
 .../{asm-generic/bitops => linux}/find.h      | 81 +++++++++++++++++--
 tools/lib/find_bit.c                          | 20 +++++
 5 files changed, 100 insertions(+), 11 deletions(-)
 rename tools/include/{asm-generic/bitops => linux}/find.h (63%)

diff --git a/MAINTAINERS b/MAINTAINERS
index 9b62293f7b72..b033083dbb42 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3277,8 +3277,8 @@ F:	lib/bitmap.c
 F:	lib/find_bit.c
 F:	lib/find_bit_benchmark.c
 F:	lib/test_bitmap.c
-F:	tools/include/asm-generic/bitops/find.h
 F:	tools/include/linux/bitmap.h
+F:	tools/include/linux/find.h
 F:	tools/lib/bitmap.c
 F:	tools/lib/find_bit.c
 
diff --git a/tools/include/asm-generic/bitops.h b/tools/include/asm-generic/bitops.h
index 5d2ab38965cc..9ab313e93555 100644
--- a/tools/include/asm-generic/bitops.h
+++ b/tools/include/asm-generic/bitops.h
@@ -18,7 +18,6 @@
 #include <asm-generic/bitops/fls.h>
 #include <asm-generic/bitops/__fls.h>
 #include <asm-generic/bitops/fls64.h>
-#include <asm-generic/bitops/find.h>
 
 #ifndef _TOOLS_LINUX_BITOPS_H_
 #error only <linux/bitops.h> can be included directly
diff --git a/tools/include/linux/bitmap.h b/tools/include/linux/bitmap.h
index 9d959bc24859..13d90b574970 100644
--- a/tools/include/linux/bitmap.h
+++ b/tools/include/linux/bitmap.h
@@ -1,9 +1,10 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _PERF_BITOPS_H
-#define _PERF_BITOPS_H
+#ifndef _TOOLS_LINUX_BITMAP_H
+#define _TOOLS_LINUX_BITMAP_H
 
 #include <string.h>
 #include <linux/bitops.h>
+#include <linux/find.h>
 #include <stdlib.h>
 #include <linux/kernel.h>
 
@@ -181,4 +182,4 @@ static inline int bitmap_intersects(const unsigned long *src1,
 		return __bitmap_intersects(src1, src2, nbits);
 }
 
-#endif /* _PERF_BITOPS_H */
+#endif /* _TOOLS_LINUX_BITMAP_H */
diff --git a/tools/include/asm-generic/bitops/find.h b/tools/include/linux/find.h
similarity index 63%
rename from tools/include/asm-generic/bitops/find.h
rename to tools/include/linux/find.h
index 6481fd11012a..47e2bd6c5174 100644
--- a/tools/include/asm-generic/bitops/find.h
+++ b/tools/include/linux/find.h
@@ -1,11 +1,19 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _TOOLS_LINUX_ASM_GENERIC_BITOPS_FIND_H_
-#define _TOOLS_LINUX_ASM_GENERIC_BITOPS_FIND_H_
+#ifndef _TOOLS_LINUX_FIND_H_
+#define _TOOLS_LINUX_FIND_H_
+
+#ifndef _TOOLS_LINUX_BITMAP_H
+#error tools: only <linux/bitmap.h> can be included directly
+#endif
+
+#include <linux/bitops.h>
 
 extern unsigned long _find_next_bit(const unsigned long *addr1,
 		const unsigned long *addr2, unsigned long nbits,
 		unsigned long start, unsigned long invert, unsigned long le);
 extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
+extern unsigned long _find_first_and_bit(const unsigned long *addr1,
+					 const unsigned long *addr2, unsigned long size);
 extern unsigned long _find_first_zero_bit(const unsigned long *addr, unsigned long size);
 extern unsigned long _find_last_bit(const unsigned long *addr, unsigned long size);
 
@@ -96,7 +104,6 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
 #endif
 
 #ifndef find_first_bit
-
 /**
  * find_first_bit - find the first set bit in a memory region
  * @addr: The address to start the search at
@@ -116,11 +123,34 @@ unsigned long find_first_bit(const unsigned long *addr, unsigned long size)
 
 	return _find_first_bit(addr, size);
 }
+#endif
+
+#ifndef find_first_and_bit
+/**
+ * find_first_and_bit - find the first set bit in both memory regions
+ * @addr1: The first address to base the search on
+ * @addr2: The second address to base the search on
+ * @size: The bitmap size in bits
+ *
+ * Returns the bit number for the next set bit
+ * If no bits are set, returns @size.
+ */
+static inline
+unsigned long find_first_and_bit(const unsigned long *addr1,
+				 const unsigned long *addr2,
+				 unsigned long size)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val = *addr1 & *addr2 & GENMASK(size - 1, 0);
 
-#endif /* find_first_bit */
+		return val ? __ffs(val) : size;
+	}
 
-#ifndef find_first_zero_bit
+	return _find_first_and_bit(addr1, addr2, size);
+}
+#endif
 
+#ifndef find_first_zero_bit
 /**
  * find_first_zero_bit - find the first cleared bit in a memory region
  * @addr: The address to start the search at
@@ -142,4 +172,43 @@ unsigned long find_first_zero_bit(const unsigned long *addr, unsigned long size)
 }
 #endif
 
-#endif /*_TOOLS_LINUX_ASM_GENERIC_BITOPS_FIND_H_ */
+#ifndef find_last_bit
+/**
+ * find_last_bit - find the last set bit in a memory region
+ * @addr: The address to start the search at
+ * @size: The number of bits to search
+ *
+ * Returns the bit number of the last set bit, or size.
+ */
+static inline
+unsigned long find_last_bit(const unsigned long *addr, unsigned long size)
+{
+	if (small_const_nbits(size)) {
+		unsigned long val = *addr & GENMASK(size - 1, 0);
+
+		return val ? __fls(val) : size;
+	}
+
+	return _find_last_bit(addr, size);
+}
+#endif
+
+/**
+ * find_next_clump8 - find next 8-bit clump with set bits in a memory region
+ * @clump: location to store copy of found clump
+ * @addr: address to base the search on
+ * @size: bitmap size in number of bits
+ * @offset: bit offset at which to start searching
+ *
+ * Returns the bit offset for the next set clump; the found clump value is
+ * copied to the location pointed by @clump. If no bits are set, returns @size.
+ */
+extern unsigned long find_next_clump8(unsigned long *clump,
+				      const unsigned long *addr,
+				      unsigned long size, unsigned long offset);
+
+#define find_first_clump8(clump, bits, size) \
+	find_next_clump8((clump), (bits), (size), 0)
+
+
+#endif /*__LINUX_FIND_H_ */
diff --git a/tools/lib/find_bit.c b/tools/lib/find_bit.c
index 109aa7ffcf97..ba4b8d94e004 100644
--- a/tools/lib/find_bit.c
+++ b/tools/lib/find_bit.c
@@ -96,6 +96,26 @@ unsigned long _find_first_bit(const unsigned long *addr, unsigned long size)
 }
 #endif
 
+#ifndef find_first_and_bit
+/*
+ * Find the first set bit in two memory regions.
+ */
+unsigned long _find_first_and_bit(const unsigned long *addr1,
+				  const unsigned long *addr2,
+				  unsigned long size)
+{
+	unsigned long idx, val;
+
+	for (idx = 0; idx * BITS_PER_LONG < size; idx++) {
+		val = addr1[idx] & addr2[idx];
+		if (val)
+			return min(idx * BITS_PER_LONG + __ffs(val), size);
+	}
+
+	return size;
+}
+#endif
+
 #ifndef find_first_zero_bit
 /*
  * Find the first cleared bit in a memory region.
-- 
_

^ permalink raw reply related	[flat|nested] 199+ messages in thread

* [patch 115/147] cpumask: replace cpumask_next_* with cpumask_first_* where appropriate
  2021-09-08  2:52 incoming Andrew Morton
                   ` (113 preceding siblings ...)
  2021-09-08  2:59 ` [patch 114/147] tools: sync tools/bitmap with mother linux Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08  2:59 ` [patch 116/147] include/linux: move for_each_bit() macros from bitops.h to find.h Andrew Morton
                   ` (32 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: aklimov, akpm, alobakin, andriy.shevchenko, dennis, jolsa,
	linux-mm, mm-commits, torvalds, ulf.hansson, will, wsa+renesas,
	yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: cpumask: replace cpumask_next_* with cpumask_first_* where appropriate

cpumask_first() is a more effective analogue of 'next' version if n == -1
(which means start == 0).  This patch replaces 'next' with 'first' where
things look trivial.

There's no cpumask_first_zero() function, so create it.

Link: https://lkml.kernel.org/r/20210814211713.180533-10-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/powerpc/include/asm/cputhreads.h |    2 +-
 block/blk-mq.c                        |    2 +-
 drivers/net/virtio_net.c              |    2 +-
 drivers/soc/fsl/qbman/bman_portal.c   |    2 +-
 drivers/soc/fsl/qbman/qman_portal.c   |    2 +-
 include/linux/cpumask.h               |   16 ++++++++++++++++
 kernel/time/clocksource.c             |    4 ++--
 7 files changed, 23 insertions(+), 7 deletions(-)

--- a/arch/powerpc/include/asm/cputhreads.h~cpumask-replace-cpumask_next_-with-cpumask_first_-where-appropriate
+++ a/arch/powerpc/include/asm/cputhreads.h
@@ -52,7 +52,7 @@ static inline cpumask_t cpu_thread_mask_
 	for (i = 0; i < NR_CPUS; i += threads_per_core) {
 		cpumask_shift_left(&tmp, &threads_core_mask, i);
 		if (cpumask_intersects(threads, &tmp)) {
-			cpu = cpumask_next_and(-1, &tmp, cpu_online_mask);
+			cpu = cpumask_first_and(&tmp, cpu_online_mask);
 			if (cpu < nr_cpu_ids)
 				cpumask_set_cpu(cpu, &res);
 		}
--- a/block/blk-mq.c~cpumask-replace-cpumask_next_-with-cpumask_first_-where-appropriate
+++ a/block/blk-mq.c
@@ -2524,7 +2524,7 @@ static bool blk_mq_hctx_has_requests(str
 static inline bool blk_mq_last_cpu_in_hctx(unsigned int cpu,
 		struct blk_mq_hw_ctx *hctx)
 {
-	if (cpumask_next_and(-1, hctx->cpumask, cpu_online_mask) != cpu)
+	if (cpumask_first_and(hctx->cpumask, cpu_online_mask) != cpu)
 		return false;
 	if (cpumask_next_and(cpu, hctx->cpumask, cpu_online_mask) < nr_cpu_ids)
 		return false;
--- a/drivers/net/virtio_net.c~cpumask-replace-cpumask_next_-with-cpumask_first_-where-appropriate
+++ a/drivers/net/virtio_net.c
@@ -2091,7 +2091,7 @@ static void virtnet_set_affinity(struct
 	stragglers = num_cpu >= vi->curr_queue_pairs ?
 			num_cpu % vi->curr_queue_pairs :
 			0;
-	cpu = cpumask_next(-1, cpu_online_mask);
+	cpu = cpumask_first(cpu_online_mask);
 
 	for (i = 0; i < vi->curr_queue_pairs; i++) {
 		group_size = stride + (i < stragglers ? 1 : 0);
--- a/drivers/soc/fsl/qbman/bman_portal.c~cpumask-replace-cpumask_next_-with-cpumask_first_-where-appropriate
+++ a/drivers/soc/fsl/qbman/bman_portal.c
@@ -155,7 +155,7 @@ static int bman_portal_probe(struct plat
 	}
 
 	spin_lock(&bman_lock);
-	cpu = cpumask_next_zero(-1, &portal_cpus);
+	cpu = cpumask_first_zero(&portal_cpus);
 	if (cpu >= nr_cpu_ids) {
 		__bman_portals_probed = 1;
 		/* unassigned portal, skip init */
--- a/drivers/soc/fsl/qbman/qman_portal.c~cpumask-replace-cpumask_next_-with-cpumask_first_-where-appropriate
+++ a/drivers/soc/fsl/qbman/qman_portal.c
@@ -248,7 +248,7 @@ static int qman_portal_probe(struct plat
 	pcfg->pools = qm_get_pools_sdqcr();
 
 	spin_lock(&qman_lock);
-	cpu = cpumask_next_zero(-1, &portal_cpus);
+	cpu = cpumask_first_zero(&portal_cpus);
 	if (cpu >= nr_cpu_ids) {
 		__qman_portals_probed = 1;
 		/* unassigned portal, skip init */
--- a/include/linux/cpumask.h~cpumask-replace-cpumask_next_-with-cpumask_first_-where-appropriate
+++ a/include/linux/cpumask.h
@@ -123,6 +123,11 @@ static inline unsigned int cpumask_first
 	return 0;
 }
 
+static inline unsigned int cpumask_first_zero(const struct cpumask *srcp)
+{
+	return 0;
+}
+
 static inline unsigned int cpumask_first_and(const struct cpumask *srcp1,
 					     const struct cpumask *srcp2)
 {
@@ -202,6 +207,17 @@ static inline unsigned int cpumask_first
 }
 
 /**
+ * cpumask_first_zero - get the first unset cpu in a cpumask
+ * @srcp: the cpumask pointer
+ *
+ * Returns >= nr_cpu_ids if all cpus are set.
+ */
+static inline unsigned int cpumask_first_zero(const struct cpumask *srcp)
+{
+	return find_first_zero_bit(cpumask_bits(srcp), nr_cpumask_bits);
+}
+
+/**
  * cpumask_first_and - return the first cpu from *srcp1 & *srcp2
  * @src1p: the first input
  * @src2p: the second input
--- a/kernel/time/clocksource.c~cpumask-replace-cpumask_next_-with-cpumask_first_-where-appropriate
+++ a/kernel/time/clocksource.c
@@ -257,7 +257,7 @@ static void clocksource_verify_choose_cp
 		return;
 
 	/* Make sure to select at least one CPU other than the current CPU. */
-	cpu = cpumask_next(-1, cpu_online_mask);
+	cpu = cpumask_first(cpu_online_mask);
 	if (cpu == smp_processor_id())
 		cpu = cpumask_next(cpu, cpu_online_mask);
 	if (WARN_ON_ONCE(cpu >= nr_cpu_ids))
@@ -279,7 +279,7 @@ static void clocksource_verify_choose_cp
 		cpu = prandom_u32() % nr_cpu_ids;
 		cpu = cpumask_next(cpu - 1, cpu_online_mask);
 		if (cpu >= nr_cpu_ids)
-			cpu = cpumask_next(-1, cpu_online_mask);
+			cpu = cpumask_first(cpu_online_mask);
 		if (!WARN_ON_ONCE(cpu >= nr_cpu_ids))
 			cpumask_set_cpu(cpu, &cpus_chosen);
 	}
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 116/147] include/linux: move for_each_bit() macros from bitops.h to find.h
  2021-09-08  2:52 incoming Andrew Morton
                   ` (114 preceding siblings ...)
  2021-09-08  2:59 ` [patch 115/147] cpumask: replace cpumask_next_* with cpumask_first_* where appropriate Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08  2:59 ` [patch 117/147] find: micro-optimize for_each_{set,clear}_bit() Andrew Morton
                   ` (31 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: aklimov, akpm, alobakin, andriy.shevchenko, dennis, jolsa,
	linux-mm, mm-commits, torvalds, ulf.hansson, will, wsa+renesas,
	yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: include/linux: move for_each_bit() macros from bitops.h to find.h

for_each_bit() macros depend on find_bit() machinery, and so the proper
place for them is the find.h header.

Link: https://lkml.kernel.org/r/20210814211713.180533-11-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/bitops.h |   34 ----------------------------------
 include/linux/find.h   |   34 ++++++++++++++++++++++++++++++++++
 2 files changed, 34 insertions(+), 34 deletions(-)

--- a/include/linux/bitops.h~include-linux-move-for_each_bit-macros-from-bitopsh-to-findh
+++ a/include/linux/bitops.h
@@ -31,40 +31,6 @@ extern unsigned long __sw_hweight64(__u6
  */
 #include <asm/bitops.h>
 
-#define for_each_set_bit(bit, addr, size) \
-	for ((bit) = find_first_bit((addr), (size));		\
-	     (bit) < (size);					\
-	     (bit) = find_next_bit((addr), (size), (bit) + 1))
-
-/* same as for_each_set_bit() but use bit as value to start with */
-#define for_each_set_bit_from(bit, addr, size) \
-	for ((bit) = find_next_bit((addr), (size), (bit));	\
-	     (bit) < (size);					\
-	     (bit) = find_next_bit((addr), (size), (bit) + 1))
-
-#define for_each_clear_bit(bit, addr, size) \
-	for ((bit) = find_first_zero_bit((addr), (size));	\
-	     (bit) < (size);					\
-	     (bit) = find_next_zero_bit((addr), (size), (bit) + 1))
-
-/* same as for_each_clear_bit() but use bit as value to start with */
-#define for_each_clear_bit_from(bit, addr, size) \
-	for ((bit) = find_next_zero_bit((addr), (size), (bit));	\
-	     (bit) < (size);					\
-	     (bit) = find_next_zero_bit((addr), (size), (bit) + 1))
-
-/**
- * for_each_set_clump8 - iterate over bitmap for each 8-bit clump with set bits
- * @start: bit offset to start search and to store the current iteration offset
- * @clump: location to store copy of current 8-bit clump
- * @bits: bitmap address to base the search on
- * @size: bitmap size in number of bits
- */
-#define for_each_set_clump8(start, clump, bits, size) \
-	for ((start) = find_first_clump8(&(clump), (bits), (size)); \
-	     (start) < (size); \
-	     (start) = find_next_clump8(&(clump), (bits), (size), (start) + 8))
-
 static inline int get_bitmask_order(unsigned int count)
 {
 	int order;
--- a/include/linux/find.h~include-linux-move-for_each_bit-macros-from-bitopsh-to-findh
+++ a/include/linux/find.h
@@ -279,4 +279,38 @@ unsigned long find_next_bit_le(const voi
 #error "Please fix <asm/byteorder.h>"
 #endif
 
+#define for_each_set_bit(bit, addr, size) \
+	for ((bit) = find_first_bit((addr), (size));		\
+	     (bit) < (size);					\
+	     (bit) = find_next_bit((addr), (size), (bit) + 1))
+
+/* same as for_each_set_bit() but use bit as value to start with */
+#define for_each_set_bit_from(bit, addr, size) \
+	for ((bit) = find_next_bit((addr), (size), (bit));	\
+	     (bit) < (size);					\
+	     (bit) = find_next_bit((addr), (size), (bit) + 1))
+
+#define for_each_clear_bit(bit, addr, size) \
+	for ((bit) = find_first_zero_bit((addr), (size));	\
+	     (bit) < (size);					\
+	     (bit) = find_next_zero_bit((addr), (size), (bit) + 1))
+
+/* same as for_each_clear_bit() but use bit as value to start with */
+#define for_each_clear_bit_from(bit, addr, size) \
+	for ((bit) = find_next_zero_bit((addr), (size), (bit));	\
+	     (bit) < (size);					\
+	     (bit) = find_next_zero_bit((addr), (size), (bit) + 1))
+
+/**
+ * for_each_set_clump8 - iterate over bitmap for each 8-bit clump with set bits
+ * @start: bit offset to start search and to store the current iteration offset
+ * @clump: location to store copy of current 8-bit clump
+ * @bits: bitmap address to base the search on
+ * @size: bitmap size in number of bits
+ */
+#define for_each_set_clump8(start, clump, bits, size) \
+	for ((start) = find_first_clump8(&(clump), (bits), (size)); \
+	     (start) < (size); \
+	     (start) = find_next_clump8(&(clump), (bits), (size), (start) + 8))
+
 #endif /*__LINUX_FIND_H_ */
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 117/147] find: micro-optimize for_each_{set,clear}_bit()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (115 preceding siblings ...)
  2021-09-08  2:59 ` [patch 116/147] include/linux: move for_each_bit() macros from bitops.h to find.h Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08  2:59 ` [patch 118/147] bitops: replace for_each_*_bit_from() with for_each_*_bit() where appropriate Andrew Morton
                   ` (30 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: aklimov, akpm, alobakin, andriy.shevchenko, dennis, jolsa,
	linux-mm, mm-commits, torvalds, ulf.hansson, will, wsa+renesas,
	yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: find: micro-optimize for_each_{set,clear}_bit()

The macros iterate thru all set/clear bits in a bitmap.  They search a
first bit using find_first_bit(), and the rest bits using find_next_bit().

Since find_next_bit() is called shortly after find_first_bit(), we can
save few lines of I-cache by not using find_first_bit().

Link: https://lkml.kernel.org/r/20210814211713.180533-12-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/find.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/include/linux/find.h~find-micro-optimize-for_each_setclear_bit
+++ a/include/linux/find.h
@@ -280,7 +280,7 @@ unsigned long find_next_bit_le(const voi
 #endif
 
 #define for_each_set_bit(bit, addr, size) \
-	for ((bit) = find_first_bit((addr), (size));		\
+	for ((bit) = find_next_bit((addr), (size), 0);		\
 	     (bit) < (size);					\
 	     (bit) = find_next_bit((addr), (size), (bit) + 1))
 
@@ -291,7 +291,7 @@ unsigned long find_next_bit_le(const voi
 	     (bit) = find_next_bit((addr), (size), (bit) + 1))
 
 #define for_each_clear_bit(bit, addr, size) \
-	for ((bit) = find_first_zero_bit((addr), (size));	\
+	for ((bit) = find_next_zero_bit((addr), (size), 0);	\
 	     (bit) < (size);					\
 	     (bit) = find_next_zero_bit((addr), (size), (bit) + 1))
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 118/147] bitops: replace for_each_*_bit_from() with for_each_*_bit() where appropriate
  2021-09-08  2:52 incoming Andrew Morton
                   ` (116 preceding siblings ...)
  2021-09-08  2:59 ` [patch 117/147] find: micro-optimize for_each_{set,clear}_bit() Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08  2:59 ` [patch 119/147] tools: rename bitmap_alloc() to bitmap_zalloc() Andrew Morton
                   ` (29 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: aklimov, akpm, alobakin, andriy.shevchenko, dennis, jolsa,
	linux-mm, mm-commits, torvalds, ulf.hansson, will, wsa+renesas,
	yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: bitops: replace for_each_*_bit_from() with for_each_*_bit() where appropriate

A couple of kernel functions call for_each_*_bit_from() with start bit
equal to 0.  Replace them with for_each_*_bit().

No functional changes, but might improve on readability.

Link: https://lkml.kernel.org/r/20210814211713.180533-13-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/kernel/apic/vector.c         |    4 ++--
 drivers/gpu/drm/etnaviv/etnaviv_gpu.c |    4 ++--
 drivers/hwmon/ltc2992.c               |    3 +--
 3 files changed, 5 insertions(+), 6 deletions(-)

--- a/arch/x86/kernel/apic/vector.c~replace-for_each__bit_from-with-for_each__bit-where-appropriate
+++ a/arch/x86/kernel/apic/vector.c
@@ -760,9 +760,9 @@ void __init lapic_update_legacy_vectors(
 
 void __init lapic_assign_system_vectors(void)
 {
-	unsigned int i, vector = 0;
+	unsigned int i, vector;
 
-	for_each_set_bit_from(vector, system_vectors, NR_VECTORS)
+	for_each_set_bit(vector, system_vectors, NR_VECTORS)
 		irq_matrix_assign_system(vector_matrix, vector, false);
 
 	if (nr_legacy_irqs() > 1)
--- a/drivers/gpu/drm/etnaviv/etnaviv_gpu.c~replace-for_each__bit_from-with-for_each__bit-where-appropriate
+++ a/drivers/gpu/drm/etnaviv/etnaviv_gpu.c
@@ -1032,7 +1032,7 @@ pm_put:
 
 void etnaviv_gpu_recover_hang(struct etnaviv_gpu *gpu)
 {
-	unsigned int i = 0;
+	unsigned int i;
 
 	dev_err(gpu->dev, "recover hung GPU!\n");
 
@@ -1045,7 +1045,7 @@ void etnaviv_gpu_recover_hang(struct etn
 
 	/* complete all events, the GPU won't do it after the reset */
 	spin_lock(&gpu->event_spinlock);
-	for_each_set_bit_from(i, gpu->event_bitmap, ETNA_NR_EVENTS)
+	for_each_set_bit(i, gpu->event_bitmap, ETNA_NR_EVENTS)
 		complete(&gpu->event_free);
 	bitmap_zero(gpu->event_bitmap, ETNA_NR_EVENTS);
 	spin_unlock(&gpu->event_spinlock);
--- a/drivers/hwmon/ltc2992.c~replace-for_each__bit_from-with-for_each__bit-where-appropriate
+++ a/drivers/hwmon/ltc2992.c
@@ -248,8 +248,7 @@ static int ltc2992_gpio_get_multiple(str
 
 	gpio_status = reg;
 
-	gpio_nr = 0;
-	for_each_set_bit_from(gpio_nr, mask, LTC2992_GPIO_NR) {
+	for_each_set_bit(gpio_nr, mask, LTC2992_GPIO_NR) {
 		if (test_bit(LTC2992_GPIO_BIT(gpio_nr), &gpio_status))
 			set_bit(gpio_nr, bits);
 	}
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 119/147] tools: rename bitmap_alloc() to bitmap_zalloc()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (117 preceding siblings ...)
  2021-09-08  2:59 ` [patch 118/147] bitops: replace for_each_*_bit_from() with for_each_*_bit() where appropriate Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08  2:59 ` [patch 120/147] mm/percpu: micro-optimize pcpu_is_populated() Andrew Morton
                   ` (28 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: aklimov, akpm, alobakin, andriy.shevchenko, dennis, jolsa,
	linux-mm, mm-commits, torvalds, ulf.hansson, will, wsa+renesas,
	yury.norov

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: tools: rename bitmap_alloc() to bitmap_zalloc()

Rename bitmap_alloc() to bitmap_zalloc() in tools to follow the bitmap API
in the kernel.

No functional changes intended.

Link: https://lkml.kernel.org/r/20210814211713.180533-14-yury.norov@gmail.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Suggested-by: Yury Norov <yury.norov@gmail.com>
Acked-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/include/linux/bitmap.h                            |    4 ++--
 tools/perf/bench/find-bit-bench.c                       |    2 +-
 tools/perf/builtin-c2c.c                                |    6 +++---
 tools/perf/builtin-record.c                             |    2 +-
 tools/perf/tests/bitmap.c                               |    2 +-
 tools/perf/tests/mem2node.c                             |    2 +-
 tools/perf/util/affinity.c                              |    4 ++--
 tools/perf/util/header.c                                |    4 ++--
 tools/perf/util/metricgroup.c                           |    2 +-
 tools/perf/util/mmap.c                                  |    4 ++--
 tools/testing/selftests/kvm/dirty_log_perf_test.c       |    2 +-
 tools/testing/selftests/kvm/dirty_log_test.c            |    4 ++--
 tools/testing/selftests/kvm/x86_64/vmx_dirty_log_test.c |    2 +-
 13 files changed, 20 insertions(+), 20 deletions(-)

--- a/tools/include/linux/bitmap.h~tools-rename-bitmap_alloc-to-bitmap_zalloc
+++ a/tools/include/linux/bitmap.h
@@ -112,10 +112,10 @@ static inline int test_and_clear_bit(int
 }
 
 /**
- * bitmap_alloc - Allocate bitmap
+ * bitmap_zalloc - Allocate bitmap
  * @nbits: Number of bits
  */
-static inline unsigned long *bitmap_alloc(int nbits)
+static inline unsigned long *bitmap_zalloc(int nbits)
 {
 	return calloc(1, BITS_TO_LONGS(nbits) * sizeof(unsigned long));
 }
--- a/tools/perf/bench/find-bit-bench.c~tools-rename-bitmap_alloc-to-bitmap_zalloc
+++ a/tools/perf/bench/find-bit-bench.c
@@ -54,7 +54,7 @@ static bool asm_test_bit(long nr, const
 
 static int do_for_each_set_bit(unsigned int num_bits)
 {
-	unsigned long *to_test = bitmap_alloc(num_bits);
+	unsigned long *to_test = bitmap_zalloc(num_bits);
 	struct timeval start, end, diff;
 	u64 runtime_us;
 	struct stats fb_time_stats, tb_time_stats;
--- a/tools/perf/builtin-c2c.c~tools-rename-bitmap_alloc-to-bitmap_zalloc
+++ a/tools/perf/builtin-c2c.c
@@ -139,11 +139,11 @@ static void *c2c_he_zalloc(size_t size)
 	if (!c2c_he)
 		return NULL;
 
-	c2c_he->cpuset = bitmap_alloc(c2c.cpus_cnt);
+	c2c_he->cpuset = bitmap_zalloc(c2c.cpus_cnt);
 	if (!c2c_he->cpuset)
 		return NULL;
 
-	c2c_he->nodeset = bitmap_alloc(c2c.nodes_cnt);
+	c2c_he->nodeset = bitmap_zalloc(c2c.nodes_cnt);
 	if (!c2c_he->nodeset)
 		return NULL;
 
@@ -2047,7 +2047,7 @@ static int setup_nodes(struct perf_sessi
 		struct perf_cpu_map *map = n[node].map;
 		unsigned long *set;
 
-		set = bitmap_alloc(c2c.cpus_cnt);
+		set = bitmap_zalloc(c2c.cpus_cnt);
 		if (!set)
 			return -ENOMEM;
 
--- a/tools/perf/builtin-record.c~tools-rename-bitmap_alloc-to-bitmap_zalloc
+++ a/tools/perf/builtin-record.c
@@ -2786,7 +2786,7 @@ int cmd_record(int argc, const char **ar
 
 	if (rec->opts.affinity != PERF_AFFINITY_SYS) {
 		rec->affinity_mask.nbits = cpu__max_cpu();
-		rec->affinity_mask.bits = bitmap_alloc(rec->affinity_mask.nbits);
+		rec->affinity_mask.bits = bitmap_zalloc(rec->affinity_mask.nbits);
 		if (!rec->affinity_mask.bits) {
 			pr_err("Failed to allocate thread mask for %zd cpus\n", rec->affinity_mask.nbits);
 			err = -ENOMEM;
--- a/tools/perf/tests/bitmap.c~tools-rename-bitmap_alloc-to-bitmap_zalloc
+++ a/tools/perf/tests/bitmap.c
@@ -14,7 +14,7 @@ static unsigned long *get_bitmap(const c
 	unsigned long *bm = NULL;
 	int i;
 
-	bm = bitmap_alloc(nbits);
+	bm = bitmap_zalloc(nbits);
 
 	if (map && bm) {
 		for (i = 0; i < map->nr; i++)
--- a/tools/perf/tests/mem2node.c~tools-rename-bitmap_alloc-to-bitmap_zalloc
+++ a/tools/perf/tests/mem2node.c
@@ -27,7 +27,7 @@ static unsigned long *get_bitmap(const c
 	unsigned long *bm = NULL;
 	int i;
 
-	bm = bitmap_alloc(nbits);
+	bm = bitmap_zalloc(nbits);
 
 	if (map && bm) {
 		for (i = 0; i < map->nr; i++) {
--- a/tools/perf/util/affinity.c~tools-rename-bitmap_alloc-to-bitmap_zalloc
+++ a/tools/perf/util/affinity.c
@@ -25,11 +25,11 @@ int affinity__setup(struct affinity *a)
 {
 	int cpu_set_size = get_cpu_set_size();
 
-	a->orig_cpus = bitmap_alloc(cpu_set_size * 8);
+	a->orig_cpus = bitmap_zalloc(cpu_set_size * 8);
 	if (!a->orig_cpus)
 		return -1;
 	sched_getaffinity(0, cpu_set_size, (cpu_set_t *)a->orig_cpus);
-	a->sched_cpus = bitmap_alloc(cpu_set_size * 8);
+	a->sched_cpus = bitmap_zalloc(cpu_set_size * 8);
 	if (!a->sched_cpus) {
 		zfree(&a->orig_cpus);
 		return -1;
--- a/tools/perf/util/header.c~tools-rename-bitmap_alloc-to-bitmap_zalloc
+++ a/tools/perf/util/header.c
@@ -278,7 +278,7 @@ static int do_read_bitmap(struct feat_fd
 	if (ret)
 		return ret;
 
-	set = bitmap_alloc(size);
+	set = bitmap_zalloc(size);
 	if (!set)
 		return -ENOMEM;
 
@@ -1294,7 +1294,7 @@ static int memory_node__read(struct memo
 
 	size++;
 
-	n->set = bitmap_alloc(size);
+	n->set = bitmap_zalloc(size);
 	if (!n->set) {
 		closedir(dir);
 		return -ENOMEM;
--- a/tools/perf/util/metricgroup.c~tools-rename-bitmap_alloc-to-bitmap_zalloc
+++ a/tools/perf/util/metricgroup.c
@@ -313,7 +313,7 @@ static int metricgroup__setup_events(str
 	struct evsel *evsel, *tmp;
 	unsigned long *evlist_used;
 
-	evlist_used = bitmap_alloc(perf_evlist->core.nr_entries);
+	evlist_used = bitmap_zalloc(perf_evlist->core.nr_entries);
 	if (!evlist_used)
 		return -ENOMEM;
 
--- a/tools/perf/util/mmap.c~tools-rename-bitmap_alloc-to-bitmap_zalloc
+++ a/tools/perf/util/mmap.c
@@ -106,7 +106,7 @@ static int perf_mmap__aio_bind(struct mm
 		data = map->aio.data[idx];
 		mmap_len = mmap__mmap_len(map);
 		node_index = cpu__get_node(cpu);
-		node_mask = bitmap_alloc(node_index + 1);
+		node_mask = bitmap_zalloc(node_index + 1);
 		if (!node_mask) {
 			pr_err("Failed to allocate node mask for mbind: error %m\n");
 			return -1;
@@ -258,7 +258,7 @@ static void build_node_mask(int node, st
 static int perf_mmap__setup_affinity_mask(struct mmap *map, struct mmap_params *mp)
 {
 	map->affinity_mask.nbits = cpu__max_cpu();
-	map->affinity_mask.bits = bitmap_alloc(map->affinity_mask.nbits);
+	map->affinity_mask.bits = bitmap_zalloc(map->affinity_mask.nbits);
 	if (!map->affinity_mask.bits)
 		return -1;
 
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c~tools-rename-bitmap_alloc-to-bitmap_zalloc
+++ a/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -121,7 +121,7 @@ static void run_test(enum vm_guest_mode
 	guest_num_pages = (nr_vcpus * guest_percpu_mem_size) >> vm_get_page_shift(vm);
 	guest_num_pages = vm_adjust_num_guest_pages(mode, guest_num_pages);
 	host_num_pages = vm_num_host_pages(mode, guest_num_pages);
-	bmap = bitmap_alloc(host_num_pages);
+	bmap = bitmap_zalloc(host_num_pages);
 
 	if (dirty_log_manual_caps) {
 		cap.cap = KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2;
--- a/tools/testing/selftests/kvm/dirty_log_test.c~tools-rename-bitmap_alloc-to-bitmap_zalloc
+++ a/tools/testing/selftests/kvm/dirty_log_test.c
@@ -749,8 +749,8 @@ static void run_test(enum vm_guest_mode
 
 	pr_info("guest physical test memory offset: 0x%lx\n", guest_test_phys_mem);
 
-	bmap = bitmap_alloc(host_num_pages);
-	host_bmap_track = bitmap_alloc(host_num_pages);
+	bmap = bitmap_zalloc(host_num_pages);
+	host_bmap_track = bitmap_zalloc(host_num_pages);
 
 	/* Add an extra memory slot for testing dirty logging */
 	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
--- a/tools/testing/selftests/kvm/x86_64/vmx_dirty_log_test.c~tools-rename-bitmap_alloc-to-bitmap_zalloc
+++ a/tools/testing/selftests/kvm/x86_64/vmx_dirty_log_test.c
@@ -111,7 +111,7 @@ int main(int argc, char *argv[])
 	nested_map(vmx, vm, NESTED_TEST_MEM1, GUEST_TEST_MEM, 4096);
 	nested_map(vmx, vm, NESTED_TEST_MEM2, GUEST_TEST_MEM, 4096);
 
-	bmap = bitmap_alloc(TEST_MEM_PAGES);
+	bmap = bitmap_zalloc(TEST_MEM_PAGES);
 	host_test_mem = addr_gpa2hva(vm, GUEST_TEST_MEM);
 
 	while (!done) {
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 120/147] mm/percpu: micro-optimize pcpu_is_populated()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (118 preceding siblings ...)
  2021-09-08  2:59 ` [patch 119/147] tools: rename bitmap_alloc() to bitmap_zalloc() Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08  2:59 ` [patch 121/147] bitmap: unify find_bit operations Andrew Morton
                   ` (27 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: aklimov, akpm, alobakin, andriy.shevchenko, dennis, jolsa,
	linux-mm, mm-commits, torvalds, ulf.hansson, will, wsa+renesas,
	yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: mm/percpu: micro-optimize pcpu_is_populated()

bitmap_next_clear_region() calls find_next_zero_bit() and find_next_bit()
sequentially to find a range of clear bits.  In case of
pcpu_is_populated() there's a chance to return earlier if bitmap has all
bits set.

Link: https://lkml.kernel.org/r/20210814211713.180533-15-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/percpu.c |   15 ++++++++-------
 1 file changed, 8 insertions(+), 7 deletions(-)

--- a/mm/percpu.c~mm-percpu-micro-optimize-pcpu_is_populated
+++ a/mm/percpu.c
@@ -1070,17 +1070,18 @@ static void pcpu_block_update_hint_free(
 static bool pcpu_is_populated(struct pcpu_chunk *chunk, int bit_off, int bits,
 			      int *next_off)
 {
-	unsigned int page_start, page_end, rs, re;
+	unsigned int start, end;
 
-	page_start = PFN_DOWN(bit_off * PCPU_MIN_ALLOC_SIZE);
-	page_end = PFN_UP((bit_off + bits) * PCPU_MIN_ALLOC_SIZE);
+	start = PFN_DOWN(bit_off * PCPU_MIN_ALLOC_SIZE);
+	end = PFN_UP((bit_off + bits) * PCPU_MIN_ALLOC_SIZE);
 
-	rs = page_start;
-	bitmap_next_clear_region(chunk->populated, &rs, &re, page_end);
-	if (rs >= page_end)
+	start = find_next_zero_bit(chunk->populated, end, start);
+	if (start >= end)
 		return true;
 
-	*next_off = re * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE;
+	end = find_next_bit(chunk->populated, end, start + 1);
+
+	*next_off = end * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE;
 	return false;
 }
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 121/147] bitmap: unify find_bit operations
  2021-09-08  2:52 incoming Andrew Morton
                   ` (119 preceding siblings ...)
  2021-09-08  2:59 ` [patch 120/147] mm/percpu: micro-optimize pcpu_is_populated() Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08  2:59 ` [patch 122/147] lib: bitmap: add performance test for bitmap_print_to_pagebuf Andrew Morton
                   ` (26 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: aklimov, akpm, alobakin, andriy.shevchenko, dennis, jolsa,
	linux-mm, mm-commits, torvalds, ulf.hansson, will, wsa+renesas,
	yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: bitmap: unify find_bit operations

bitmap_for_each_{set,clear}_region() are similar to for_each_bit() macros
in include/linux/find.h, but interface and implementation of them are
different.

This patch adds for_each_bitrange() macros and drops unused
bitmap_*_region() API in sake of unification.

Link: https://lkml.kernel.org/r/20210814211713.180533-16-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Acked-by: Dennis Zhou <dennis@kernel.org>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>	[MMC]
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/mmc/host/renesas_sdhi_core.c |    2 
 include/linux/bitmap.h               |   33 --------------
 include/linux/find.h                 |   56 +++++++++++++++++++++++++
 mm/percpu.c                          |   20 +++-----
 4 files changed, 65 insertions(+), 46 deletions(-)

--- a/drivers/mmc/host/renesas_sdhi_core.c~bitmap-unify-find_bit-operations
+++ a/drivers/mmc/host/renesas_sdhi_core.c
@@ -647,7 +647,7 @@ static int renesas_sdhi_select_tuning(st
 	 * is at least SH_MOBILE_SDHI_MIN_TAP_ROW probes long then use the
 	 * center index as the tap, otherwise bail out.
 	 */
-	bitmap_for_each_set_region(bitmap, rs, re, 0, taps_size) {
+	for_each_set_bitrange(rs, re, bitmap, taps_size) {
 		if (re - rs > tap_cnt) {
 			tap_end = re;
 			tap_start = rs;
--- a/include/linux/bitmap.h~bitmap-unify-find_bit-operations
+++ a/include/linux/bitmap.h
@@ -55,12 +55,6 @@ struct device;
  *  bitmap_clear(dst, pos, nbits)               Clear specified bit area
  *  bitmap_find_next_zero_area(buf, len, pos, n, mask)  Find bit free area
  *  bitmap_find_next_zero_area_off(buf, len, pos, n, mask, mask_off)  as above
- *  bitmap_next_clear_region(map, &start, &end, nbits)  Find next clear region
- *  bitmap_next_set_region(map, &start, &end, nbits)  Find next set region
- *  bitmap_for_each_clear_region(map, rs, re, start, end)
- *  						Iterate over all clear regions
- *  bitmap_for_each_set_region(map, rs, re, start, end)
- *  						Iterate over all set regions
  *  bitmap_shift_right(dst, src, n, nbits)      *dst = *src >> n
  *  bitmap_shift_left(dst, src, n, nbits)       *dst = *src << n
  *  bitmap_cut(dst, src, first, n, nbits)       Cut n bits from first, copy rest
@@ -459,14 +453,6 @@ static inline void bitmap_replace(unsign
 		__bitmap_replace(dst, old, new, mask, nbits);
 }
 
-static inline void bitmap_next_clear_region(unsigned long *bitmap,
-					    unsigned int *rs, unsigned int *re,
-					    unsigned int end)
-{
-	*rs = find_next_zero_bit(bitmap, end, *rs);
-	*re = find_next_bit(bitmap, end, *rs + 1);
-}
-
 static inline void bitmap_next_set_region(unsigned long *bitmap,
 					  unsigned int *rs, unsigned int *re,
 					  unsigned int end)
@@ -475,25 +461,6 @@ static inline void bitmap_next_set_regio
 	*re = find_next_zero_bit(bitmap, end, *rs + 1);
 }
 
-/*
- * Bitmap region iterators.  Iterates over the bitmap between [@start, @end).
- * @rs and @re should be integer variables and will be set to start and end
- * index of the current clear or set region.
- */
-#define bitmap_for_each_clear_region(bitmap, rs, re, start, end)	     \
-	for ((rs) = (start),						     \
-	     bitmap_next_clear_region((bitmap), &(rs), &(re), (end));	     \
-	     (rs) < (re);						     \
-	     (rs) = (re) + 1,						     \
-	     bitmap_next_clear_region((bitmap), &(rs), &(re), (end)))
-
-#define bitmap_for_each_set_region(bitmap, rs, re, start, end)		     \
-	for ((rs) = (start),						     \
-	     bitmap_next_set_region((bitmap), &(rs), &(re), (end));	     \
-	     (rs) < (re);						     \
-	     (rs) = (re) + 1,						     \
-	     bitmap_next_set_region((bitmap), &(rs), &(re), (end)))
-
 /**
  * BITMAP_FROM_U64() - Represent u64 value in the format suitable for bitmap.
  * @n: u64 value
--- a/include/linux/find.h~bitmap-unify-find_bit-operations
+++ a/include/linux/find.h
@@ -302,6 +302,62 @@ unsigned long find_next_bit_le(const voi
 	     (bit) = find_next_zero_bit((addr), (size), (bit) + 1))
 
 /**
+ * for_each_set_bitrange - iterate over all set bit ranges [b; e)
+ * @b: bit offset of start of current bitrange (first set bit)
+ * @e: bit offset of end of current bitrange (first unset bit)
+ * @addr: bitmap address to base the search on
+ * @size: bitmap size in number of bits
+ */
+#define for_each_set_bitrange(b, e, addr, size)			\
+	for ((b) = find_next_bit((addr), (size), 0),		\
+	     (e) = find_next_zero_bit((addr), (size), (b) + 1);	\
+	     (b) < (size);					\
+	     (b) = find_next_bit((addr), (size), (e) + 1),	\
+	     (e) = find_next_zero_bit((addr), (size), (b) + 1))
+
+/**
+ * for_each_set_bitrange_from - iterate over all set bit ranges [b; e)
+ * @b: bit offset of start of current bitrange (first set bit); must be initialized
+ * @e: bit offset of end of current bitrange (first unset bit)
+ * @addr: bitmap address to base the search on
+ * @size: bitmap size in number of bits
+ */
+#define for_each_set_bitrange_from(b, e, addr, size)		\
+	for ((b) = find_next_bit((addr), (size), (b)),		\
+	     (e) = find_next_zero_bit((addr), (size), (b) + 1);	\
+	     (b) < (size);					\
+	     (b) = find_next_bit((addr), (size), (e) + 1),	\
+	     (e) = find_next_zero_bit((addr), (size), (b) + 1))
+
+/**
+ * for_each_clear_bitrange - iterate over all unset bit ranges [b; e)
+ * @b: bit offset of start of current bitrange (first unset bit)
+ * @e: bit offset of end of current bitrange (first set bit)
+ * @addr: bitmap address to base the search on
+ * @size: bitmap size in number of bits
+ */
+#define for_each_clear_bitrange(b, e, addr, size)		\
+	for ((b) = find_next_zero_bit((addr), (size), 0),	\
+	     (e) = find_next_bit((addr), (size), (b) + 1);	\
+	     (b) < (size);					\
+	     (b) = find_next_zero_bit((addr), (size), (e) + 1),	\
+	     (e) = find_next_bit((addr), (size), (b) + 1))
+
+/**
+ * for_each_clear_bitrange_from - iterate over all unset bit ranges [b; e)
+ * @b: bit offset of start of current bitrange (first set bit); must be initialized
+ * @e: bit offset of end of current bitrange (first unset bit)
+ * @addr: bitmap address to base the search on
+ * @size: bitmap size in number of bits
+ */
+#define for_each_clear_bitrange_from(b, e, addr, size)		\
+	for ((b) = find_next_zero_bit((addr), (size), (b)),	\
+	     (e) = find_next_bit((addr), (size), (b) + 1);	\
+	     (b) < (size);					\
+	     (b) = find_next_zero_bit((addr), (size), (e) + 1),	\
+	     (e) = find_next_bit((addr), (size), (b) + 1))
+
+/**
  * for_each_set_clump8 - iterate over bitmap for each 8-bit clump with set bits
  * @start: bit offset to start search and to store the current iteration offset
  * @clump: location to store copy of current 8-bit clump
--- a/mm/percpu.c~bitmap-unify-find_bit-operations
+++ a/mm/percpu.c
@@ -779,7 +779,7 @@ static void pcpu_block_refresh_hint(stru
 {
 	struct pcpu_block_md *block = chunk->md_blocks + index;
 	unsigned long *alloc_map = pcpu_index_alloc_map(chunk, index);
-	unsigned int rs, re, start;	/* region start, region end */
+	unsigned int start, end;	/* region start, region end */
 
 	/* promote scan_hint to contig_hint */
 	if (block->scan_hint) {
@@ -795,9 +795,8 @@ static void pcpu_block_refresh_hint(stru
 	block->right_free = 0;
 
 	/* iterate over free areas and update the contig hints */
-	bitmap_for_each_clear_region(alloc_map, rs, re, start,
-				     PCPU_BITMAP_BLOCK_BITS)
-		pcpu_block_update(block, rs, re);
+	for_each_clear_bitrange_from(start, end, alloc_map, PCPU_BITMAP_BLOCK_BITS)
+		pcpu_block_update(block, start, end);
 }
 
 /**
@@ -1852,13 +1851,12 @@ area_found:
 
 	/* populate if not all pages are already there */
 	if (!is_atomic) {
-		unsigned int page_start, page_end, rs, re;
+		unsigned int page_end, rs, re;
 
-		page_start = PFN_DOWN(off);
+		rs = PFN_DOWN(off);
 		page_end = PFN_UP(off + size);
 
-		bitmap_for_each_clear_region(chunk->populated, rs, re,
-					     page_start, page_end) {
+		for_each_clear_bitrange_from(rs, re, chunk->populated, page_end) {
 			WARN_ON(chunk->immutable);
 
 			ret = pcpu_populate_chunk(chunk, rs, re, pcpu_gfp);
@@ -2014,8 +2012,7 @@ static void pcpu_balance_free(bool empty
 	list_for_each_entry_safe(chunk, next, &to_free, list) {
 		unsigned int rs, re;
 
-		bitmap_for_each_set_region(chunk->populated, rs, re, 0,
-					   chunk->nr_pages) {
+		for_each_set_bitrange(rs, re, chunk->populated, chunk->nr_pages) {
 			pcpu_depopulate_chunk(chunk, rs, re);
 			spin_lock_irq(&pcpu_lock);
 			pcpu_chunk_depopulated(chunk, rs, re);
@@ -2085,8 +2082,7 @@ retry_pop:
 			continue;
 
 		/* @chunk can't go away while pcpu_alloc_mutex is held */
-		bitmap_for_each_clear_region(chunk->populated, rs, re, 0,
-					     chunk->nr_pages) {
+		for_each_clear_bitrange(rs, re, chunk->populated, chunk->nr_pages) {
 			int nr = min_t(int, re - rs, nr_to_pop);
 
 			spin_unlock_irq(&pcpu_lock);
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 122/147] lib: bitmap: add performance test for bitmap_print_to_pagebuf
  2021-09-08  2:52 incoming Andrew Morton
                   ` (120 preceding siblings ...)
  2021-09-08  2:59 ` [patch 121/147] bitmap: unify find_bit operations Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08  2:59 ` [patch 123/147] vsprintf: rework bitmap_list_string Andrew Morton
                   ` (25 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: aklimov, akpm, alobakin, andriy.shevchenko, dennis, jolsa,
	linux-mm, mm-commits, torvalds, ulf.hansson, will, wsa+renesas,
	yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: lib: bitmap: add performance test for bitmap_print_to_pagebuf

Functional tests for bitmap_print_to_pagebuf() are provided in
lib/test_printf.c.  This patch adds performance test for a case of fully
set bitmap.

Link: https://lkml.kernel.org/r/20210814211713.180533-17-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/test_bitmap.c |   37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

--- a/lib/test_bitmap.c~lib-bitmap-add-performance-test-for-bitmap_print_to_pagebuf
+++ a/lib/test_bitmap.c
@@ -430,6 +430,42 @@ static void __init test_bitmap_parselist
 	}
 }
 
+static void __init test_bitmap_printlist(void)
+{
+	unsigned long *bmap = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	char *buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
+	char expected[256];
+	int ret, slen;
+	ktime_t time;
+
+	if (!buf || !bmap)
+		goto out;
+
+	memset(bmap, -1, PAGE_SIZE);
+	slen = snprintf(expected, 256, "0-%ld", PAGE_SIZE * 8 - 1);
+	if (slen < 0)
+		goto out;
+
+	time = ktime_get();
+	ret = bitmap_print_to_pagebuf(true, buf, bmap, PAGE_SIZE * 8);
+	time = ktime_get() - time;
+
+	if (ret != slen + 1) {
+		pr_err("bitmap_print_to_pagebuf: result is %d, expected %d\n", ret, slen);
+		goto out;
+	}
+
+	if (strncmp(buf, expected, slen)) {
+		pr_err("bitmap_print_to_pagebuf: result is %s, expected %s\n", buf, expected);
+		goto out;
+	}
+
+	pr_err("bitmap_print_to_pagebuf: input is '%s', Time: %llu\n", buf, time);
+out:
+	kfree(buf);
+	kfree(bmap);
+}
+
 static const unsigned long parse_test[] __initconst = {
 	BITMAP_FROM_U64(0),
 	BITMAP_FROM_U64(1),
@@ -669,6 +705,7 @@ static void __init selftest(void)
 	test_bitmap_arr32();
 	test_bitmap_parse();
 	test_bitmap_parselist();
+	test_bitmap_printlist();
 	test_mem_optimisations();
 	test_for_each_set_clump8();
 	test_bitmap_cut();
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 123/147] vsprintf: rework bitmap_list_string
  2021-09-08  2:52 incoming Andrew Morton
                   ` (121 preceding siblings ...)
  2021-09-08  2:59 ` [patch 122/147] lib: bitmap: add performance test for bitmap_print_to_pagebuf Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08  2:59 ` [patch 124/147] checkpatch: support wide strings Andrew Morton
                   ` (24 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: aklimov, akpm, alobakin, andriy.shevchenko, dennis, jolsa,
	linux-mm, mm-commits, torvalds, ulf.hansson, will, wsa+renesas,
	yury.norov

From: Yury Norov <yury.norov@gmail.com>
Subject: vsprintf: rework bitmap_list_string

bitmap_list_string() is very ineffective when printing bitmaps with long
ranges of set bits because it calls find_next_bit for each bit in the
bitmap.  We can do better by detecting ranges of set bits.

In my environment, before/after is 943008/31008 ns.

[yury.norov@gmail.com: don't increment buf in bitmap_list_string]
  Link: https://lkml.kernel.org/r/20210817193735.269942-1-yury.norov@gmail.com
Link: https://lkml.kernel.org/r/20210814211713.180533-18-yury.norov@gmail.com
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Tested-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Alexander Lobakin <alobakin@pm.me>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/vsprintf.c |   24 +++++++-----------------
 1 file changed, 7 insertions(+), 17 deletions(-)

--- a/lib/vsprintf.c~vsprintf-rework-bitmap_list_string
+++ a/lib/vsprintf.c
@@ -1241,20 +1241,13 @@ char *bitmap_list_string(char *buf, char
 			 struct printf_spec spec, const char *fmt)
 {
 	int nr_bits = max_t(int, spec.field_width, 0);
-	/* current bit is 'cur', most recently seen range is [rbot, rtop] */
-	int cur, rbot, rtop;
 	bool first = true;
+	int rbot, rtop;
 
 	if (check_pointer(&buf, end, bitmap, spec))
 		return buf;
 
-	rbot = cur = find_first_bit(bitmap, nr_bits);
-	while (cur < nr_bits) {
-		rtop = cur;
-		cur = find_next_bit(bitmap, nr_bits, cur + 1);
-		if (cur < nr_bits && cur <= rtop + 1)
-			continue;
-
+	for_each_set_bitrange(rbot, rtop, bitmap, nr_bits) {
 		if (!first) {
 			if (buf < end)
 				*buf = ',';
@@ -1263,15 +1256,12 @@ char *bitmap_list_string(char *buf, char
 		first = false;
 
 		buf = number(buf, end, rbot, default_dec_spec);
-		if (rbot < rtop) {
-			if (buf < end)
-				*buf = '-';
-			buf++;
-
-			buf = number(buf, end, rtop, default_dec_spec);
-		}
+		if (rtop == rbot + 1)
+			continue;
 
-		rbot = cur;
+		if (buf < end)
+			*buf = '-';
+		buf = number(buf + 1, end, rtop - 1, default_dec_spec);
 	}
 	return buf;
 }
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 124/147] checkpatch: support wide strings
  2021-09-08  2:52 incoming Andrew Morton
                   ` (122 preceding siblings ...)
  2021-09-08  2:59 ` [patch 123/147] vsprintf: rework bitmap_list_string Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08  2:59 ` [patch 125/147] checkpatch: make email address check case insensitive Andrew Morton
                   ` (23 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: akpm, dwaipayanray1, joe, linux-mm, lukas.bulwahn, mm-commits,
	sjg, torvalds

From: Joe Perches <joe@perches.com>
Subject: checkpatch: support wide strings

Allow prefixing typical strings with L for wide strings and u for unicode
strings.

Link: https://lkml.kernel.org/r/20210801170733.1.I3f9784fd3c1007d08ec2e70b151d137687575495@changeid
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Simon Glass <sjg@chromium.org>
Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/checkpatch.pl |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

--- a/scripts/checkpatch.pl~checkpatch-support-wide-strings
+++ a/scripts/checkpatch.pl
@@ -501,7 +501,7 @@ our $Binary	= qr{(?i)0b[01]+$Int_type?};
 our $Hex	= qr{(?i)0x[0-9a-f]+$Int_type?};
 our $Int	= qr{[0-9]+$Int_type?};
 our $Octal	= qr{0[0-7]+$Int_type?};
-our $String	= qr{"[X\t]*"};
+our $String	= qr{(?:\b[Lu])?"[X\t]*"};
 our $Float_hex	= qr{(?i)0x[0-9a-f]+p-?[0-9]+[fl]?};
 our $Float_dec	= qr{(?i)(?:[0-9]+\.[0-9]*|[0-9]*\.[0-9]+)(?:e-?[0-9]+)?[fl]?};
 our $Float_int	= qr{(?i)[0-9]+e-?[0-9]+[fl]?};
@@ -6132,7 +6132,8 @@ sub process {
 		}
 
 # concatenated string without spaces between elements
-		if ($line =~ /$String[A-Za-z0-9_]/ || $line =~ /[A-Za-z0-9_]$String/) {
+		if ($line =~ /$String[A-Z_]/ ||
+		    ($line =~ /([A-Za-z0-9_]+)$String/ && $1 !~ /^[Lu]$/)) {
 			if (CHK("CONCATENATED_STRING",
 				"Concatenated strings should use spaces between elements\n" . $herecurr) &&
 			    $fix) {
@@ -6145,7 +6146,7 @@ sub process {
 		}
 
 # uncoalesced string fragments
-		if ($line =~ /$String\s*"/) {
+		if ($line =~ /$String\s*[Lu]?"/) {
 			if (WARN("STRING_FRAGMENTS",
 				 "Consecutive strings are generally better as a single string\n" . $herecurr) &&
 			    $fix) {
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 125/147] checkpatch: make email address check case insensitive
  2021-09-08  2:52 incoming Andrew Morton
                   ` (123 preceding siblings ...)
  2021-09-08  2:59 ` [patch 124/147] checkpatch: support wide strings Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08  2:59 ` [patch 126/147] checkpatch: improve GIT_COMMIT_ID test Andrew Morton
                   ` (22 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: akpm, joe, linux-mm, mm-commits, torvalds, zohar

From: Mimi Zohar <zohar@linux.ibm.com>
Subject: checkpatch: make email address check case insensitive

Instead of checkpatch requiring the patch author to exactly match the
signed-off-by tag, commit 48ca2d8ac8a1 ("checkpatch: add new warnings to
author signoff checks.") safely relaxed this requirement.

Although the local-part of an email address (local-part@domain), may be
case sensitive, exploiting the case sensitivity of mailbox local-parts
impedes interoperability and is discouraged.  Mailbox domains follow
normal DNS rules and are hence not case sensitive.  (Refer to
https://datatracker.ietf.org/doc/html/rfc5321#section-2.4.)

Further relax the patch author and signed-off-by tag comparison by making
the email address check case insensitive.

Link: https://lkml.kernel.org/r/20210816112725.173206-1-zohar@linux.ibm.com
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
Acked-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/checkpatch.pl |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/scripts/checkpatch.pl~checkpatch-make-email-address-check-case-insensitive
+++ a/scripts/checkpatch.pl
@@ -2909,10 +2909,10 @@ sub process {
 					my ($email_name, $email_comment, $email_address, $comment1) = parse_email($ctx);
 					my ($author_name, $author_comment, $author_address, $comment2) = parse_email($author);
 
-					if ($email_address eq $author_address && $email_name eq $author_name) {
+					if (lc $email_address eq lc $author_address && $email_name eq $author_name) {
 						$author_sob = $ctx;
 						$authorsignoff = 2;
-					} elsif ($email_address eq $author_address) {
+					} elsif (lc $email_address eq lc $author_address) {
 						$author_sob = $ctx;
 						$authorsignoff = 3;
 					} elsif ($email_name eq $author_name) {
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 126/147] checkpatch: improve GIT_COMMIT_ID test
  2021-09-08  2:52 incoming Andrew Morton
                   ` (124 preceding siblings ...)
  2021-09-08  2:59 ` [patch 125/147] checkpatch: make email address check case insensitive Andrew Morton
@ 2021-09-08  2:59 ` Andrew Morton
  2021-09-08  3:00 ` [patch 127/147] fs/epoll: use a per-cpu counter for user's watches count Andrew Morton
                   ` (21 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  2:59 UTC (permalink / raw)
  To: akpm, dwaipayanray1, efremov, joe, linux-mm, lukas.bulwahn,
	mm-commits, torvalds

From: Joe Perches <joe@perches.com>
Subject: checkpatch: improve GIT_COMMIT_ID test

The preferred git commit id reference has the form

	commit <SHA-1> ("Title line")

where SHA-1 is the commit hex hash with a minimum lenth of 12 and ("Title
line") is the complete title line of the commit with a (" prefix and ")
suffix.

The current tests fail when the "Title line" has one or more embedded
double quotes.

Improve the test that finds the commit SHA-1 hex hash then ("Title line")
by using $balanced_parens for a maximum of 3 consecutive lines.

[akpm@linux-foundation.org: add missing &&]
Link: https://lkml.kernel.org/r/976c6cdd680db4b55ae31b5fc2d1779da5c0dc66.camel@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Denis Efremov <efremov@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/checkpatch.pl |   82 ++++++++++++++++++++++++----------------
 1 file changed, 51 insertions(+), 31 deletions(-)

--- a/scripts/checkpatch.pl~checkpatch-improve-git_commit_id-test
+++ a/scripts/checkpatch.pl
@@ -1181,7 +1181,8 @@ sub git_commit_info {
 #		    git log --format='%H %s' -1 $line |
 #		    echo "commit $(cut -c 1-12,41-)"
 #		done
-	} elsif ($lines[0] =~ /^fatal: ambiguous argument '$commit': unknown revision or path not in the working tree\./) {
+	} elsif ($lines[0] =~ /^fatal: ambiguous argument '$commit': unknown revision or path not in the working tree\./ ||
+		 $lines[0] =~ /^fatal: bad object $commit/) {
 		$id = undef;
 	} else {
 		$id = substr($lines[0], 0, 12);
@@ -2587,6 +2588,8 @@ sub process {
 	my $reported_maintainer_file = 0;
 	my $non_utf8_charset = 0;
 
+	my $last_git_commit_id_linenr = -1;
+
 	my $last_blank_line = 0;
 	my $last_coalesced_string_linenr = -1;
 
@@ -3170,10 +3173,20 @@ sub process {
 		}
 
 # Check for git id commit length and improperly formed commit descriptions
-		if ($in_commit_log && !$commit_log_possible_stack_dump &&
+# A correctly formed commit description is:
+#    commit <SHA-1 hash length 12+ chars> ("Complete commit subject")
+# with the commit subject '("' prefix and '")' suffix
+# This is a fairly compilicated block as it tests for what appears to be
+# bare SHA-1 hash with  minimum length of 5.  It also avoids several types of
+# possible SHA-1 matches.
+# A commit match can span multiple lines so this block attempts to find a
+# complete typical commit on a maximum of 3 lines
+		if ($perl_version_ok &&
+		    $in_commit_log && !$commit_log_possible_stack_dump &&
 		    $line !~ /^\s*(?:Link|Patchwork|http|https|BugLink|base-commit):/i &&
 		    $line !~ /^This reverts commit [0-9a-f]{7,40}/ &&
-		    ($line =~ /\bcommit\s+[0-9a-f]{5,}\b/i ||
+		    (($line =~ /\bcommit\s+[0-9a-f]{5,}\b/i ||
+		      ($line =~ /\bcommit\s*$/i && defined($rawlines[$linenr]) && $rawlines[$linenr] =~ /^\s*[0-9a-f]{5,}\b/i)) ||
 		     ($line =~ /(?:\s|^)[0-9a-f]{12,40}(?:[\s"'\(\[]|$)/i &&
 		      $line !~ /[\<\[][0-9a-f]{12,40}[\>\]]/i &&
 		      $line !~ /\bfixes:\s*[0-9a-f]{12,40}/i))) {
@@ -3183,49 +3196,56 @@ sub process {
 			my $long = 0;
 			my $case = 1;
 			my $space = 1;
-			my $hasdesc = 0;
-			my $hasparens = 0;
 			my $id = '0123456789ab';
 			my $orig_desc = "commit description";
 			my $description = "";
+			my $herectx = $herecurr;
+			my $has_parens = 0;
+			my $has_quotes = 0;
+
+			my $input = $line;
+			if ($line =~ /(?:\bcommit\s+[0-9a-f]{5,}|\bcommit\s*$)/i) {
+				for (my $n = 0; $n < 2; $n++) {
+					if ($input =~ /\bcommit\s+[0-9a-f]{5,}\s*($balanced_parens)/i) {
+						$orig_desc = $1;
+						$has_parens = 1;
+						# Always strip leading/trailing parens then double quotes if existing
+						$orig_desc = substr($orig_desc, 1, -1);
+						if ($orig_desc =~ /^".*"$/) {
+							$orig_desc = substr($orig_desc, 1, -1);
+							$has_quotes = 1;
+						}
+						last;
+					}
+					last if ($#lines < $linenr + $n);
+					$input .= " " . trim($rawlines[$linenr + $n]);
+					$herectx .= "$rawlines[$linenr + $n]\n";
+				}
+				$herectx = $herecurr if (!$has_parens);
+			}
 
-			if ($line =~ /\b(c)ommit\s+([0-9a-f]{5,})\b/i) {
+			if ($input =~ /\b(c)ommit\s+([0-9a-f]{5,})\b/i) {
 				$init_char = $1;
 				$orig_commit = lc($2);
-			} elsif ($line =~ /\b([0-9a-f]{12,40})\b/i) {
+				$short = 0 if ($input =~ /\bcommit\s+[0-9a-f]{12,40}/i);
+				$long = 1 if ($input =~ /\bcommit\s+[0-9a-f]{41,}/i);
+				$space = 0 if ($input =~ /\bcommit [0-9a-f]/i);
+				$case = 0 if ($input =~ /\b[Cc]ommit\s+[0-9a-f]{5,40}[^A-F]/);
+			} elsif ($input =~ /\b([0-9a-f]{12,40})\b/i) {
 				$orig_commit = lc($1);
 			}
 
-			$short = 0 if ($line =~ /\bcommit\s+[0-9a-f]{12,40}/i);
-			$long = 1 if ($line =~ /\bcommit\s+[0-9a-f]{41,}/i);
-			$space = 0 if ($line =~ /\bcommit [0-9a-f]/i);
-			$case = 0 if ($line =~ /\b[Cc]ommit\s+[0-9a-f]{5,40}[^A-F]/);
-			if ($line =~ /\bcommit\s+[0-9a-f]{5,}\s+\("([^"]+)"\)/i) {
-				$orig_desc = $1;
-				$hasparens = 1;
-			} elsif ($line =~ /\bcommit\s+[0-9a-f]{5,}\s*$/i &&
-				 defined $rawlines[$linenr] &&
-				 $rawlines[$linenr] =~ /^\s*\("([^"]+)"\)/) {
-				$orig_desc = $1;
-				$hasparens = 1;
-			} elsif ($line =~ /\bcommit\s+[0-9a-f]{5,}\s+\("[^"]+$/i &&
-				 defined $rawlines[$linenr] &&
-				 $rawlines[$linenr] =~ /^\s*[^"]+"\)/) {
-				$line =~ /\bcommit\s+[0-9a-f]{5,}\s+\("([^"]+)$/i;
-				$orig_desc = $1;
-				$rawlines[$linenr] =~ /^\s*([^"]+)"\)/;
-				$orig_desc .= " " . $1;
-				$hasparens = 1;
-			}
-
 			($id, $description) = git_commit_info($orig_commit,
 							      $id, $orig_desc);
 
 			if (defined($id) &&
-			   ($short || $long || $space || $case || ($orig_desc ne $description) || !$hasparens)) {
+			    ($short || $long || $space || $case || ($orig_desc ne $description) || !$has_quotes) &&
+			    $last_git_commit_id_linenr != $linenr - 1) {
 				ERROR("GIT_COMMIT_ID",
-				      "Please use git commit description style 'commit <12+ chars of sha1> (\"<title line>\")' - ie: '${init_char}ommit $id (\"$description\")'\n" . $herecurr);
+				      "Please use git commit description style 'commit <12+ chars of sha1> (\"<title line>\")' - ie: '${init_char}ommit $id (\"$description\")'\n" . $herectx);
 			}
+			#don't report the next line if this line ends in commit and the sha1 hash is the next line
+			$last_git_commit_id_linenr = $linenr if ($line =~ /\bcommit\s*$/i);
 		}
 
 # Check for added, moved or deleted files
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 127/147] fs/epoll: use a per-cpu counter for user's watches count
  2021-09-08  2:52 incoming Andrew Morton
                   ` (125 preceding siblings ...)
  2021-09-08  2:59 ` [patch 126/147] checkpatch: improve GIT_COMMIT_ID test Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08  3:00 ` [patch 128/147] init: move usermodehelper_enable() to populate_rootfs() Andrew Morton
                   ` (20 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, anton, linux-mm, mm-commits, npiggin, torvalds, viro

From: Nicholas Piggin <npiggin@gmail.com>
Subject: fs/epoll: use a per-cpu counter for user's watches count

This counter tracks the number of watches a user has, to compare against
the 'max_user_watches' limit. This causes a scalability bottleneck on
SPECjbb2015 on large systems as there is only one user. Changing to a
per-cpu counter increases throughput of the benchmark by about 30% on a
16-socket, > 1000 thread system.

[rdunlap@infradead.org: fix build errors in kernel/user.c when CONFIG_EPOLL=n]
[npiggin@gmail.com: move ifdefs into wrapper functions, slightly improve panic message]
  Link: https://lkml.kernel.org/r/1628051945.fens3r99ox.astroid@bobo.none
[akpm@linux-foundation.org: tweak user_epoll_alloc(), per Guenter]
  Link: https://lkml.kernel.org/r/20210804191421.GA1900577@roeck-us.net
Link: https://lkml.kernel.org/r/20210802032013.2751916-1-npiggin@gmail.com
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reported-by: Anton Blanchard <anton@ozlabs.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/eventpoll.c             |   18 ++++++++++--------
 include/linux/sched/user.h |    3 ++-
 kernel/user.c              |   25 +++++++++++++++++++++++++
 3 files changed, 37 insertions(+), 9 deletions(-)

--- a/fs/eventpoll.c~fs-epoll-use-a-per-cpu-counter-for-users-watches-count
+++ a/fs/eventpoll.c
@@ -723,7 +723,7 @@ static int ep_remove(struct eventpoll *e
 	 */
 	call_rcu(&epi->rcu, epi_rcu_free);
 
-	atomic_long_dec(&ep->user->epoll_watches);
+	percpu_counter_dec(&ep->user->epoll_watches);
 
 	return 0;
 }
@@ -1439,7 +1439,6 @@ static int ep_insert(struct eventpoll *e
 {
 	int error, pwake = 0;
 	__poll_t revents;
-	long user_watches;
 	struct epitem *epi;
 	struct ep_pqueue epq;
 	struct eventpoll *tep = NULL;
@@ -1449,11 +1448,15 @@ static int ep_insert(struct eventpoll *e
 
 	lockdep_assert_irqs_enabled();
 
-	user_watches = atomic_long_read(&ep->user->epoll_watches);
-	if (unlikely(user_watches >= max_user_watches))
+	if (unlikely(percpu_counter_compare(&ep->user->epoll_watches,
+					    max_user_watches) >= 0))
 		return -ENOSPC;
-	if (!(epi = kmem_cache_zalloc(epi_cache, GFP_KERNEL)))
+	percpu_counter_inc(&ep->user->epoll_watches);
+
+	if (!(epi = kmem_cache_zalloc(epi_cache, GFP_KERNEL))) {
+		percpu_counter_dec(&ep->user->epoll_watches);
 		return -ENOMEM;
+	}
 
 	/* Item initialization follow here ... */
 	INIT_LIST_HEAD(&epi->rdllink);
@@ -1466,17 +1469,16 @@ static int ep_insert(struct eventpoll *e
 		mutex_lock_nested(&tep->mtx, 1);
 	/* Add the current item to the list of active epoll hook for this file */
 	if (unlikely(attach_epitem(tfile, epi) < 0)) {
-		kmem_cache_free(epi_cache, epi);
 		if (tep)
 			mutex_unlock(&tep->mtx);
+		kmem_cache_free(epi_cache, epi);
+		percpu_counter_dec(&ep->user->epoll_watches);
 		return -ENOMEM;
 	}
 
 	if (full_check && !tep)
 		list_file(tfile);
 
-	atomic_long_inc(&ep->user->epoll_watches);
-
 	/*
 	 * Add the current item to the RB tree. All RB tree operations are
 	 * protected by "mtx", and ep_insert() is called with "mtx" held.
--- a/include/linux/sched/user.h~fs-epoll-use-a-per-cpu-counter-for-users-watches-count
+++ a/include/linux/sched/user.h
@@ -4,6 +4,7 @@
 
 #include <linux/uidgid.h>
 #include <linux/atomic.h>
+#include <linux/percpu_counter.h>
 #include <linux/refcount.h>
 #include <linux/ratelimit.h>
 
@@ -13,7 +14,7 @@
 struct user_struct {
 	refcount_t __count;	/* reference count */
 #ifdef CONFIG_EPOLL
-	atomic_long_t epoll_watches; /* The number of file descriptors currently watched */
+	struct percpu_counter epoll_watches; /* The number of file descriptors currently watched */
 #endif
 	unsigned long unix_inflight;	/* How many files in flight in unix sockets */
 	atomic_long_t pipe_bufs;  /* how many pages are allocated in pipe buffers */
--- a/kernel/user.c~fs-epoll-use-a-per-cpu-counter-for-users-watches-count
+++ a/kernel/user.c
@@ -129,6 +129,22 @@ static struct user_struct *uid_hash_find
 	return NULL;
 }
 
+static int user_epoll_alloc(struct user_struct *up)
+{
+#ifdef CONFIG_EPOLL
+	return percpu_counter_init(&up->epoll_watches, 0, GFP_KERNEL);
+#else
+	return 0;
+#endif
+}
+
+static void user_epoll_free(struct user_struct *up)
+{
+#ifdef CONFIG_EPOLL
+	percpu_counter_destroy(&up->epoll_watches);
+#endif
+}
+
 /* IRQs are disabled and uidhash_lock is held upon function entry.
  * IRQ state (as stored in flags) is restored and uidhash_lock released
  * upon function exit.
@@ -138,6 +154,7 @@ static void free_user(struct user_struct
 {
 	uid_hash_remove(up);
 	spin_unlock_irqrestore(&uidhash_lock, flags);
+	user_epoll_free(up);
 	kmem_cache_free(uid_cachep, up);
 }
 
@@ -185,6 +202,10 @@ struct user_struct *alloc_uid(kuid_t uid
 
 		new->uid = uid;
 		refcount_set(&new->__count, 1);
+		if (user_epoll_alloc(new)) {
+			kmem_cache_free(uid_cachep, new);
+			return NULL;
+		}
 		ratelimit_state_init(&new->ratelimit, HZ, 100);
 		ratelimit_set_flags(&new->ratelimit, RATELIMIT_MSG_ON_RELEASE);
 
@@ -195,6 +216,7 @@ struct user_struct *alloc_uid(kuid_t uid
 		spin_lock_irq(&uidhash_lock);
 		up = uid_hash_find(uid, hashent);
 		if (up) {
+			user_epoll_free(new);
 			kmem_cache_free(uid_cachep, new);
 		} else {
 			uid_hash_insert(new, hashent);
@@ -216,6 +238,9 @@ static int __init uid_cache_init(void)
 	for(n = 0; n < UIDHASH_SZ; ++n)
 		INIT_HLIST_HEAD(uidhash_table + n);
 
+	if (user_epoll_alloc(&root_user))
+		panic("root_user epoll percpu counter alloc failed");
+
 	/* Insert the root user immediately (init already runs as root) */
 	spin_lock_irq(&uidhash_lock);
 	uid_hash_insert(&root_user, uidhashentry(GLOBAL_ROOT_UID));
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 128/147] init: move usermodehelper_enable() to populate_rootfs()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (126 preceding siblings ...)
  2021-09-08  3:00 ` [patch 127/147] fs/epoll: use a per-cpu counter for user's watches count Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08 15:44   ` Luis Chamberlain
  2021-09-08  3:00 ` [patch 130/147] nilfs2: fix memory leak in nilfs_sysfs_create_device_group Andrew Morton
                   ` (19 subsequent siblings)
  147 siblings, 1 reply; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, bgoncalv, egorenar, hkallweit1, linux-mm, linux, mcgrof,
	mm-commits, torvalds

From: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Subject: init: move usermodehelper_enable() to populate_rootfs()

Currently, usermodehelper is enabled right before PID1 starts going
through the initcalls. However, any call of a usermodehelper from a
pure_, core_, postcore_, arch_, subsys_ or fs_ initcall is futile, as
there is no filesystem contents yet.

Up until commit e7cb072eb988 ("init/initramfs.c: do unpacking
asynchronously"), such calls, whether via some request_module(), a
legacy uevent "/sbin/hotplug" notification or something else, would
just fail silently with (presumably) -ENOENT from
kernel_execve(). However, that commit introduced the
wait_for_initramfs() synchronization hook which must be called from
the usermodehelper exec path right before the kernel_execve, in order
that request_module() et al done from *after* rootfs_initcall()
time (i.e. device_ and late_ initcalls) would continue to find a
populated initramfs as they used to.

Any call of wait_for_initramfs() done before the unpacking has been
scheduled (i.e. before rootfs_initcall time) must just return
immediately [and let the caller find an empty file system] in order
not to deadlock the machine. I mistakenly thought, and my limited
testing confirmed, that there were no such calls, so I added a
pr_warn_once() in wait_for_initramfs(). It turns out that one can
indeed hit request_module() as well as kobject_uevent_env() during
those early init calls, leading to a user-visible warning in the
kernel log emitted consistently for certain configurations.

We could just remove the pr_warn_once(), but I think it's better to
postpone enabling the usermodehelper framework until there is at least
some chance of finding the executable. That is also a little more
efficient in that a lot of work done in umh.c will be elided. However,
it does change the error seen by those early callers from -ENOENT to
-EBUSY, so there is a risk of a regression if any caller care about
the exact error value.

Link: https://lkml.kernel.org/r/20210728134638.329060-1-linux@rasmusvillemoes.dk
Fixes: e7cb072eb988 ("init/initramfs.c: do unpacking asynchronously")
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Reported-by: Alexander Egorenkov <egorenar@linux.ibm.com>
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Reported-by: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 init/initramfs.c   |    2 ++
 init/main.c        |    1 -
 init/noinitramfs.c |    2 ++
 3 files changed, 4 insertions(+), 1 deletion(-)

--- a/init/initramfs.c~init-move-usermodehelper_enable-to-populate_rootfs
+++ a/init/initramfs.c
@@ -15,6 +15,7 @@
 #include <linux/mm.h>
 #include <linux/namei.h>
 #include <linux/init_syscalls.h>
+#include <linux/umh.h>
 
 static ssize_t __init xwrite(struct file *file, const char *p, size_t count,
 		loff_t *pos)
@@ -727,6 +728,7 @@ static int __init populate_rootfs(void)
 {
 	initramfs_cookie = async_schedule_domain(do_populate_rootfs, NULL,
 						 &initramfs_domain);
+	usermodehelper_enable();
 	if (!initramfs_async)
 		wait_for_initramfs();
 	return 0;
--- a/init/main.c~init-move-usermodehelper_enable-to-populate_rootfs
+++ a/init/main.c
@@ -1392,7 +1392,6 @@ static void __init do_basic_setup(void)
 	driver_init();
 	init_irq_proc();
 	do_ctors();
-	usermodehelper_enable();
 	do_initcalls();
 }
 
--- a/init/noinitramfs.c~init-move-usermodehelper_enable-to-populate_rootfs
+++ a/init/noinitramfs.c
@@ -10,6 +10,7 @@
 #include <linux/kdev_t.h>
 #include <linux/syscalls.h>
 #include <linux/init_syscalls.h>
+#include <linux/umh.h>
 
 /*
  * Create a simple rootfs that is similar to the default initramfs
@@ -18,6 +19,7 @@ static int __init default_rootfs(void)
 {
 	int err;
 
+	usermodehelper_enable();
 	err = init_mkdir("/dev", 0755);
 	if (err < 0)
 		goto out;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 130/147] nilfs2: fix memory leak in nilfs_sysfs_create_device_group
  2021-09-08  2:52 incoming Andrew Morton
                   ` (127 preceding siblings ...)
  2021-09-08  3:00 ` [patch 128/147] init: move usermodehelper_enable() to populate_rootfs() Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08  3:00 ` [patch 131/147] nilfs2: fix NULL pointer in nilfs_##name##_attr_release Andrew Morton
                   ` (18 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, hulkci, konishi.ryusuke, linux-mm, mm-commits, sunnanyong,
	torvalds

From: Nanyong Sun <sunnanyong@huawei.com>
Subject: nilfs2: fix memory leak in nilfs_sysfs_create_device_group

Patch series "nilfs2: fix incorrect usage of kobject".

This patchset from Nanyong Sun fixes memory leak issues and a NULL pointer
dereference issue caused by incorrect usage of kboject in nilfs2 sysfs
implementation.


This patch (of 6):

Reported by syzkaller:
BUG: memory leak
unreferenced object 0xffff888100ca8988 (size 8):
comm "syz-executor.1", pid 1930, jiffies 4294745569 (age 18.052s)
hex dump (first 8 bytes):
6c 6f 6f 70 31 00 ff ff loop1...
backtrace:
[<000000009d9e0ac4>] slab_alloc_node mm/slub.c:2972 [inline]
[<000000009d9e0ac4>] slab_alloc mm/slub.c:2980 [inline]
[<000000009d9e0ac4>] __kmalloc_track_caller+0x164/0x330 mm/slub.c:4644
[<00000000b1825477>] kstrdup+0x36/0x70 mm/util.c:60
[<00000000fa081499>] kstrdup_const+0x35/0x60 mm/util.c:83
[<0000000024d13570>] kvasprintf_const+0xf1/0x180 lib/kasprintf.c:48
[<0000000024b69715>] kobject_set_name_vargs+0x56/0x150 lib/kobject.c:289
[<000000003fedac3d>] kobject_add_varg lib/kobject.c:384 [inline]
[<000000003fedac3d>] kobject_init_and_add+0xc9/0x150 lib/kobject.c:473
[<000000002795bd99>] nilfs_sysfs_create_device_group+0x150/0x7d0 fs/nilfs2/sysfs.c:986
[<00000000567fa12d>] init_nilfs+0xa21/0xea0 fs/nilfs2/the_nilfs.c:637
[<00000000082e7458>] nilfs_fill_super fs/nilfs2/super.c:1046 [inline]
[<00000000082e7458>] nilfs_mount+0x7b4/0xe80 fs/nilfs2/super.c:1316
[<00000000adc3fd88>] legacy_get_tree+0x105/0x210 fs/fs_context.c:592
[<00000000a98c45b8>] vfs_get_tree+0x8e/0x2d0 fs/super.c:1498
[<00000000e96282d3>] do_new_mount fs/namespace.c:2905 [inline]
[<00000000e96282d3>] path_mount+0xf9b/0x1990 fs/namespace.c:3235
[<000000003d2eb1b0>] do_mount+0xea/0x100 fs/namespace.c:3248
[<00000000e1ce771a>] __do_sys_mount fs/namespace.c:3456 [inline]
[<00000000e1ce771a>] __se_sys_mount fs/namespace.c:3433 [inline]
[<00000000e1ce771a>] __x64_sys_mount+0x14b/0x1f0 fs/namespace.c:3433
[<000000007c7f81e8>] do_syscall_x64 arch/x86/entry/common.c:50 [inline]
[<000000007c7f81e8>] do_syscall_64+0x3b/0x90 arch/x86/entry/common.c:80
[<00000000fd23ff06>] entry_SYSCALL_64_after_hwframe+0x44/0xae

If kobject_init_and_add return with error, then the cleanup of kobject is
needed because memory may be allocated in kobject_init_and_add without
freeing.

And the place of cleanup_dev_kobject should use kobject_put to free the
memory associated with the kobject.  As the section "Kobject removal" of
"Documentation/core-api/kobject.rst" says, kobject_del() just makes the
kobject "invisible", but it is not cleaned up.  And no more cleanup will
do after cleanup_dev_kobject, so kobject_put is needed here.

Link: https://lkml.kernel.org/r/1625651306-10829-1-git-send-email-konishi.ryusuke@gmail.com
Link: https://lkml.kernel.org/r/1625651306-10829-2-git-send-email-konishi.ryusuke@gmail.com
Reported-by: Hulk Robot <hulkci@huawei.com>
Link: https://lkml.kernel.org/r/20210629022556.3985106-2-sunnanyong@huawei.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/nilfs2/sysfs.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

--- a/fs/nilfs2/sysfs.c~nilfs2-fix-memory-leak-in-nilfs_sysfs_create_device_group
+++ a/fs/nilfs2/sysfs.c
@@ -986,7 +986,7 @@ int nilfs_sysfs_create_device_group(stru
 	err = kobject_init_and_add(&nilfs->ns_dev_kobj, &nilfs_dev_ktype, NULL,
 				    "%s", sb->s_id);
 	if (err)
-		goto free_dev_subgroups;
+		goto cleanup_dev_kobject;
 
 	err = nilfs_sysfs_create_mounted_snapshots_group(nilfs);
 	if (err)
@@ -1023,9 +1023,7 @@ delete_mounted_snapshots_group:
 	nilfs_sysfs_delete_mounted_snapshots_group(nilfs);
 
 cleanup_dev_kobject:
-	kobject_del(&nilfs->ns_dev_kobj);
-
-free_dev_subgroups:
+	kobject_put(&nilfs->ns_dev_kobj);
 	kfree(nilfs->ns_dev_subgroups);
 
 failed_create_device_group:
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 131/147] nilfs2: fix NULL pointer in nilfs_##name##_attr_release
  2021-09-08  2:52 incoming Andrew Morton
                   ` (128 preceding siblings ...)
  2021-09-08  3:00 ` [patch 130/147] nilfs2: fix memory leak in nilfs_sysfs_create_device_group Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08  3:00 ` [patch 132/147] nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group Andrew Morton
                   ` (17 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, konishi.ryusuke, linux-mm, mm-commits, sunnanyong, torvalds

From: Nanyong Sun <sunnanyong@huawei.com>
Subject: nilfs2: fix NULL pointer in nilfs_##name##_attr_release

In nilfs_##name##_attr_release, kobj->parent should not be referenced
because it is a NULL pointer.  The release() method of kobject is always
called in kobject_put(kobj), in the implementation of kobject_put(), the
kobj->parent will be assigned as NULL before call the release() method. 
So just use kobj to get the subgroups, which is more efficient and can fix
a NULL pointer reference problem.

Link: https://lkml.kernel.org/r/20210629022556.3985106-3-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-3-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/nilfs2/sysfs.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

--- a/fs/nilfs2/sysfs.c~nilfs2-fix-null-pointer-in-nilfs_name_attr_release
+++ a/fs/nilfs2/sysfs.c
@@ -51,11 +51,9 @@ static const struct sysfs_ops nilfs_##na
 #define NILFS_DEV_INT_GROUP_TYPE(name, parent_name) \
 static void nilfs_##name##_attr_release(struct kobject *kobj) \
 { \
-	struct nilfs_sysfs_##parent_name##_subgroups *subgroups; \
-	struct the_nilfs *nilfs = container_of(kobj->parent, \
-						struct the_nilfs, \
-						ns_##parent_name##_kobj); \
-	subgroups = nilfs->ns_##parent_name##_subgroups; \
+	struct nilfs_sysfs_##parent_name##_subgroups *subgroups = container_of(kobj, \
+						struct nilfs_sysfs_##parent_name##_subgroups, \
+						sg_##name##_kobj); \
 	complete(&subgroups->sg_##name##_kobj_unregister); \
 } \
 static struct kobj_type nilfs_##name##_ktype = { \
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 132/147] nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
  2021-09-08  2:52 incoming Andrew Morton
                   ` (129 preceding siblings ...)
  2021-09-08  3:00 ` [patch 131/147] nilfs2: fix NULL pointer in nilfs_##name##_attr_release Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08  3:00 ` [patch 133/147] nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group Andrew Morton
                   ` (16 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, konishi.ryusuke, linux-mm, mm-commits, sunnanyong, torvalds

From: Nanyong Sun <sunnanyong@huawei.com>
Subject: nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group

If kobject_init_and_add return with error, kobject_put() is needed here to
avoid memory leak, because kobject_init_and_add may return error without
freeing the memory associated with the kobject it allocated.

Link: https://lkml.kernel.org/r/20210629022556.3985106-4-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-4-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/nilfs2/sysfs.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/fs/nilfs2/sysfs.c~nilfs2-fix-memory-leak-in-nilfs_sysfs_create_name_group
+++ a/fs/nilfs2/sysfs.c
@@ -79,8 +79,8 @@ static int nilfs_sysfs_create_##name##_g
 	err = kobject_init_and_add(kobj, &nilfs_##name##_ktype, parent, \
 				    #name); \
 	if (err) \
-		return err; \
-	return 0; \
+		kobject_put(kobj); \
+	return err; \
 } \
 static void nilfs_sysfs_delete_##name##_group(struct the_nilfs *nilfs) \
 { \
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 133/147] nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
  2021-09-08  2:52 incoming Andrew Morton
                   ` (130 preceding siblings ...)
  2021-09-08  3:00 ` [patch 132/147] nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08  3:00 ` [patch 134/147] nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group Andrew Morton
                   ` (15 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, konishi.ryusuke, linux-mm, mm-commits, sunnanyong, torvalds

From: Nanyong Sun <sunnanyong@huawei.com>
Subject: nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group

The kobject_put() should be used to cleanup the memory associated with the
kobject instead of kobject_del.  See the section "Kobject removal" of
"Documentation/core-api/kobject.rst".

Link: https://lkml.kernel.org/r/20210629022556.3985106-5-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-5-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/nilfs2/sysfs.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/nilfs2/sysfs.c~nilfs2-fix-memory-leak-in-nilfs_sysfs_delete_name_group
+++ a/fs/nilfs2/sysfs.c
@@ -84,7 +84,7 @@ static int nilfs_sysfs_create_##name##_g
 } \
 static void nilfs_sysfs_delete_##name##_group(struct the_nilfs *nilfs) \
 { \
-	kobject_del(&nilfs->ns_##parent_name##_subgroups->sg_##name##_kobj); \
+	kobject_put(&nilfs->ns_##parent_name##_subgroups->sg_##name##_kobj); \
 }
 
 /************************************************************************
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 134/147] nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
  2021-09-08  2:52 incoming Andrew Morton
                   ` (131 preceding siblings ...)
  2021-09-08  3:00 ` [patch 133/147] nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08  3:00 ` [patch 135/147] nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group Andrew Morton
                   ` (14 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, konishi.ryusuke, linux-mm, mm-commits, sunnanyong, torvalds

From: Nanyong Sun <sunnanyong@huawei.com>
Subject: nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group

If kobject_init_and_add returns with error, kobject_put() is needed here
to avoid memory leak, because kobject_init_and_add may return error
without freeing the memory associated with the kobject it allocated.

Link: https://lkml.kernel.org/r/20210629022556.3985106-6-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-6-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/nilfs2/sysfs.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/fs/nilfs2/sysfs.c~nilfs2-fix-memory-leak-in-nilfs_sysfs_create_snapshot_group
+++ a/fs/nilfs2/sysfs.c
@@ -195,9 +195,9 @@ int nilfs_sysfs_create_snapshot_group(st
 	}
 
 	if (err)
-		return err;
+		kobject_put(&root->snapshot_kobj);
 
-	return 0;
+	return err;
 }
 
 void nilfs_sysfs_delete_snapshot_group(struct nilfs_root *root)
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 135/147] nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
  2021-09-08  2:52 incoming Andrew Morton
                   ` (132 preceding siblings ...)
  2021-09-08  3:00 ` [patch 134/147] nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08  3:00 ` [patch 136/147] nilfs2: use refcount_dec_and_lock() to fix potential UAF Andrew Morton
                   ` (13 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, konishi.ryusuke, linux-mm, mm-commits, sunnanyong, torvalds

From: Nanyong Sun <sunnanyong@huawei.com>
Subject: nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group

kobject_put() should be used to cleanup the memory associated with the
kobject instead of kobject_del().  See the section "Kobject removal" of
"Documentation/core-api/kobject.rst".

Link: https://lkml.kernel.org/r/20210629022556.3985106-7-sunnanyong@huawei.com
Link: https://lkml.kernel.org/r/1625651306-10829-7-git-send-email-konishi.ryusuke@gmail.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/nilfs2/sysfs.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/nilfs2/sysfs.c~nilfs2-fix-memory-leak-in-nilfs_sysfs_delete_snapshot_group
+++ a/fs/nilfs2/sysfs.c
@@ -202,7 +202,7 @@ int nilfs_sysfs_create_snapshot_group(st
 
 void nilfs_sysfs_delete_snapshot_group(struct nilfs_root *root)
 {
-	kobject_del(&root->snapshot_kobj);
+	kobject_put(&root->snapshot_kobj);
 }
 
 /************************************************************************
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 136/147] nilfs2: use refcount_dec_and_lock() to fix potential UAF
  2021-09-08  2:52 incoming Andrew Morton
                   ` (133 preceding siblings ...)
  2021-09-08  3:00 ` [patch 135/147] nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-24 10:35   ` Pavel Machek
  2021-09-24 12:12   ` Matthew Wilcox
  2021-09-08  3:00 ` [patch 137/147] fs/coredump.c: log if a core dump is aborted due to changed file permissions Andrew Morton
                   ` (12 subsequent siblings)
  147 siblings, 2 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, konishi.ryusuke, linux-mm, mm-commits, thunder.leizhen, torvalds

From: Zhen Lei <thunder.leizhen@huawei.com>
Subject: nilfs2: use refcount_dec_and_lock() to fix potential UAF

When the refcount is decreased to 0, the resource reclamation branch is
entered.  Before CPU0 reaches the race point (1), CPU1 may obtain the
spinlock and traverse the rbtree to find 'root', see nilfs_lookup_root(). 
Although CPU1 will call refcount_inc() to increase the refcount, it is
obviously too late.  CPU0 will release 'root' directly, CPU1 then accesses
'root' and triggers UAF.

Use refcount_dec_and_lock() to ensure that both the operations of decrease
refcount to 0 and link deletion are lock protected eliminates this risk.

     CPU0                      CPU1
nilfs_put_root():
			    <-------- (1)
spin_lock(&nilfs->ns_cptree_lock);
rb_erase(&root->rb_node, &nilfs->ns_cptree);
spin_unlock(&nilfs->ns_cptree_lock);

kfree(root);
			    <-------- use-after-free

========================================================================
refcount_t: underflow; use-after-free.
WARNING: CPU: 2 PID: 9476 at lib/refcount.c:28 \
refcount_warn_saturate+0x1cf/0x210 lib/refcount.c:28
Modules linked in:
CPU: 2 PID: 9476 Comm: syz-executor.0 Not tainted 5.10.45-rc1+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
RIP: 0010:refcount_warn_saturate+0x1cf/0x210 lib/refcount.c:28
... ...
Call Trace:
 __refcount_sub_and_test include/linux/refcount.h:283 [inline]
 __refcount_dec_and_test include/linux/refcount.h:315 [inline]
 refcount_dec_and_test include/linux/refcount.h:333 [inline]
 nilfs_put_root+0xc1/0xd0 fs/nilfs2/the_nilfs.c:795
 nilfs_segctor_destroy fs/nilfs2/segment.c:2749 [inline]
 nilfs_detach_log_writer+0x3fa/0x570 fs/nilfs2/segment.c:2812
 nilfs_put_super+0x2f/0xf0 fs/nilfs2/super.c:467
 generic_shutdown_super+0xcd/0x1f0 fs/super.c:464
 kill_block_super+0x4a/0x90 fs/super.c:1446
 deactivate_locked_super+0x6a/0xb0 fs/super.c:335
 deactivate_super+0x85/0x90 fs/super.c:366
 cleanup_mnt+0x277/0x2e0 fs/namespace.c:1118
 __cleanup_mnt+0x15/0x20 fs/namespace.c:1125
 task_work_run+0x8e/0x110 kernel/task_work.c:151
 tracehook_notify_resume include/linux/tracehook.h:188 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:164 [inline]
 exit_to_user_mode_prepare+0x13c/0x170 kernel/entry/common.c:191
 syscall_exit_to_user_mode+0x16/0x30 kernel/entry/common.c:266
 do_syscall_64+0x45/0x80 arch/x86/entry/common.c:56
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

There is no reproduction program, and the above is only theoretical
analysis.

Link: https://lkml.kernel.org/r/1629859428-5906-1-git-send-email-konishi.ryusuke@gmail.com
Fixes: ba65ae4729bf ("nilfs2: add checkpoint tree to nilfs object")
Link: https://lkml.kernel.org/r/20210723012317.4146-1-thunder.leizhen@huawei.com
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/nilfs2/the_nilfs.c |    9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

--- a/fs/nilfs2/the_nilfs.c~nilfs2-use-refcount_dec_and_lock-to-fix-potential-uaf
+++ a/fs/nilfs2/the_nilfs.c
@@ -792,14 +792,13 @@ nilfs_find_or_create_root(struct the_nil
 
 void nilfs_put_root(struct nilfs_root *root)
 {
-	if (refcount_dec_and_test(&root->count)) {
-		struct the_nilfs *nilfs = root->nilfs;
+	struct the_nilfs *nilfs = root->nilfs;
 
-		nilfs_sysfs_delete_snapshot_group(root);
-
-		spin_lock(&nilfs->ns_cptree_lock);
+	if (refcount_dec_and_lock(&root->count, &nilfs->ns_cptree_lock)) {
 		rb_erase(&root->rb_node, &nilfs->ns_cptree);
 		spin_unlock(&nilfs->ns_cptree_lock);
+
+		nilfs_sysfs_delete_snapshot_group(root);
 		iput(root->ifile);
 
 		kfree(root);
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 137/147] fs/coredump.c: log if a core dump is aborted due to changed file permissions
  2021-09-08  2:52 incoming Andrew Morton
                   ` (134 preceding siblings ...)
  2021-09-08  3:00 ` [patch 136/147] nilfs2: use refcount_dec_and_lock() to fix potential UAF Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08  3:00 ` [patch 138/147] coredump: fix memleak in dump_vma_snapshot() Andrew Morton
                   ` (11 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, david.oberhollenzer, linux-mm, mm-commits, torvalds, viro

From: David Oberhollenzer <david.oberhollenzer@sigma-star.at>
Subject: fs/coredump.c: log if a core dump is aborted due to changed file permissions

For obvious security reasons, a core dump is aborted if the filesystem
cannot preserve ownership or permissions of the dump file.

This affects filesystems like e.g.  vfat, but also something like a 9pfs
share in a Qemu test setup, running as a regular user, depending on the
security model used.  In those cases, the result is an empty core file and
a confused user.

To hopefully safe other people a lot of time figuring out the cause, this
patch adds a simple log message for those specific cases.

[akpm@linux-foundation.org: s/|%s/%s/ in printk text]
Link: https://lkml.kernel.org/r/20210701233151.102720-1-david.oberhollenzer@sigma-star.at
Signed-off-by: David Oberhollenzer <david.oberhollenzer@sigma-star.at>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/coredump.c |   11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

--- a/fs/coredump.c~log-if-a-core-dump-is-aborted-due-to-changed-file-permissions
+++ a/fs/coredump.c
@@ -782,10 +782,17 @@ void do_coredump(const kernel_siginfo_t
 		 * filesystem.
 		 */
 		mnt_userns = file_mnt_user_ns(cprm.file);
-		if (!uid_eq(i_uid_into_mnt(mnt_userns, inode), current_fsuid()))
+		if (!uid_eq(i_uid_into_mnt(mnt_userns, inode),
+			    current_fsuid())) {
+			pr_info_ratelimited("Core dump to %s aborted: cannot preserve file owner\n",
+					    cn.corename);
 			goto close_fail;
-		if ((inode->i_mode & 0677) != 0600)
+		}
+		if ((inode->i_mode & 0677) != 0600) {
+			pr_info_ratelimited("Core dump to %s aborted: cannot preserve file permissions\n",
+					    cn.corename);
 			goto close_fail;
+		}
 		if (!(cprm.file->f_mode & FMODE_CAN_WRITE))
 			goto close_fail;
 		if (do_truncate(mnt_userns, cprm.file->f_path.dentry,
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 138/147] coredump: fix memleak in dump_vma_snapshot()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (135 preceding siblings ...)
  2021-09-08  3:00 ` [patch 137/147] fs/coredump.c: log if a core dump is aborted due to changed file permissions Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08  3:00 ` [patch 139/147] kernel/fork.c: unexport get_{mm,task}_exe_file Andrew Morton
                   ` (10 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, gregkh, jannh, linux-mm, mm-commits, qiuxi1, torvalds, viro

From: QiuXi <qiuxi1@huawei.com>
Subject: coredump: fix memleak in dump_vma_snapshot()

dump_vma_snapshot() allocs memory for *vma_meta, when dump_vma_snapshot()
returns -EFAULT, the memory will be leaked, so we free it correctly.

Link: https://lkml.kernel.org/r/20210810020441.62806-1-qiuxi1@huawei.com
Fixes: a07279c9a8cd7 ("binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot")
Signed-off-by: QiuXi <qiuxi1@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jann Horn <jannh@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/coredump.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/fs/coredump.c~coredump-fix-memleak-in-dump_vma_snapshot
+++ a/fs/coredump.c
@@ -1134,8 +1134,10 @@ int dump_vma_snapshot(struct coredump_pa
 
 	mmap_write_unlock(mm);
 
-	if (WARN_ON(i != *vma_count))
+	if (WARN_ON(i != *vma_count)) {
+		kvfree(*vma_meta);
 		return -EFAULT;
+	}
 
 	*vma_data_size_ptr = vma_data_size;
 	return 0;
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 139/147] kernel/fork.c: unexport get_{mm,task}_exe_file
  2021-09-08  2:52 incoming Andrew Morton
                   ` (136 preceding siblings ...)
  2021-09-08  3:00 ` [patch 138/147] coredump: fix memleak in dump_vma_snapshot() Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08  3:00 ` [patch 140/147] pid: cleanup the stale comment mentioning pidmap_init() Andrew Morton
                   ` (9 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, hch, linux-mm, mm-commits, torvalds

From: Christoph Hellwig <hch@lst.de>
Subject: kernel/fork.c: unexport get_{mm,task}_exe_file

Only used by core code and the tomoyo which can't be a module either.

Link: https://lkml.kernel.org/r/20210820095430.445242-1-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/fork.c |    2 --
 1 file changed, 2 deletions(-)

--- a/kernel/fork.c~kernel-unexport-get_mmtask_exe_file
+++ a/kernel/fork.c
@@ -1187,7 +1187,6 @@ struct file *get_mm_exe_file(struct mm_s
 	rcu_read_unlock();
 	return exe_file;
 }
-EXPORT_SYMBOL(get_mm_exe_file);
 
 /**
  * get_task_exe_file - acquire a reference to the task's executable file
@@ -1210,7 +1209,6 @@ struct file *get_task_exe_file(struct ta
 	task_unlock(task);
 	return exe_file;
 }
-EXPORT_SYMBOL(get_task_exe_file);
 
 /**
  * get_task_mm - acquire a reference to the task's mm
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 140/147] pid: cleanup the stale comment mentioning pidmap_init().
  2021-09-08  2:52 incoming Andrew Morton
                   ` (137 preceding siblings ...)
  2021-09-08  3:00 ` [patch 139/147] kernel/fork.c: unexport get_{mm,task}_exe_file Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08  3:00 ` [patch 141/147] prctl: allow to setup brk for et_dyn executables Andrew Morton
                   ` (8 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, itazur, kuniyu, linux-mm, mm-commits, torvalds

From: Takahiro Itazuri <itazur@amazon.com>
Subject: pid: cleanup the stale comment mentioning pidmap_init().

pidmap_init() has already been replaced with pid_idr_init() in the commit
95846ecf9dac ("pid: replace pid bitmap implementation with IDR API"). 
Cleanup the stale comment which still mentions it.

Link: https://lkml.kernel.org/r/20210714120713.19825-1-itazur@amazon.com
Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Cc: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/threads.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/threads.h~pid-cleanup-the-stale-comment-mentioning-pidmap_init
+++ a/include/linux/threads.h
@@ -38,7 +38,7 @@
  * Define a minimum number of pids per cpu.  Heuristically based
  * on original pid max of 32k for 32 cpus.  Also, increase the
  * minimum settable value for pid_max on the running system based
- * on similar defaults.  See kernel/pid.c:pidmap_init() for details.
+ * on similar defaults.  See kernel/pid.c:pid_idr_init() for details.
  */
 #define PIDS_PER_CPU_DEFAULT	1024
 #define PIDS_PER_CPU_MIN	8
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 141/147] prctl: allow to setup brk for et_dyn executables
  2021-09-08  2:52 incoming Andrew Morton
                   ` (138 preceding siblings ...)
  2021-09-08  3:00 ` [patch 140/147] pid: cleanup the stale comment mentioning pidmap_init() Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08  3:00 ` [patch 142/147] configs: remove the obsolete CONFIG_INPUT_POLLDEV Andrew Morton
                   ` (7 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: 0x7f454c46, akpm, alexander.mikhalitsyn, avagin, ebiederm,
	gorcunov, keno, ktkhai, linux-mm, mm-commits, ptikhomirov,
	torvalds

From: Cyrill Gorcunov <gorcunov@gmail.com>
Subject: prctl: allow to setup brk for et_dyn executables

Keno Fischer reported that when a binray loaded via ld-linux-x the
prctl(PR_SET_MM_MAP) doesn't allow to setup brk value because it lays
before mm:end_data.

For example a test program shows

 | # ~/t
 |
 | start_code      401000
 | end_code        401a15
 | start_stack     7ffce4577dd0
 | start_data	   403e10
 | end_data        40408c
 | start_brk	   b5b000
 | sbrk(0)         b5b000

and when executed via ld-linux

 | # /lib64/ld-linux-x86-64.so.2 ~/t
 |
 | start_code      7fc25b0a4000
 | end_code        7fc25b0c4524
 | start_stack     7fffcc6b2400
 | start_data	   7fc25b0ce4c0
 | end_data        7fc25b0cff98
 | start_brk	   55555710c000
 | sbrk(0)         55555710c000

This of course prevent criu from restoring such programs.  Looking into
how kernel operates with brk/start_brk inside brk() syscall I don't see
any problem if we allow to setup brk/start_brk without checking for
end_data.  Even if someone pass some weird address here on a purpose then
the worst possible result will be an unexpected unmapping of existing vma
(own vma, since prctl works with the callers memory) but test for
RLIMIT_DATA is still valid and a user won't be able to gain more memory in
case of expanding VMAs via new values shipped with prctl call.

Link: https://lkml.kernel.org/r/20210121221207.GB2174@grain
Fixes: bbdc6076d2e5 ("binfmt_elf: move brk out of mmap when doing direct loader exec")
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reported-by: Keno Fischer <keno@juliacomputing.com>
Acked-by: Andrey Vagin <avagin@gmail.com>
Tested-by: Andrey Vagin <avagin@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Alexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/sys.c |    7 -------
 1 file changed, 7 deletions(-)

--- a/kernel/sys.c~prctl-allow-to-setup-brk-for-et_dyn-executables
+++ a/kernel/sys.c
@@ -1960,13 +1960,6 @@ static int validate_prctl_map_addr(struc
 	error = -EINVAL;
 
 	/*
-	 * @brk should be after @end_data in traditional maps.
-	 */
-	if (prctl_map->start_brk <= prctl_map->end_data ||
-	    prctl_map->brk <= prctl_map->end_data)
-		goto out;
-
-	/*
 	 * Neither we should allow to override limits if they set.
 	 */
 	if (check_data_rlimit(rlimit(RLIMIT_DATA), prctl_map->brk,
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 142/147] configs: remove the obsolete CONFIG_INPUT_POLLDEV
  2021-09-08  2:52 incoming Andrew Morton
                   ` (139 preceding siblings ...)
  2021-09-08  3:00 ` [patch 141/147] prctl: allow to setup brk for et_dyn executables Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08  3:00 ` [patch 143/147] Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH Andrew Morton
                   ` (6 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, dmitry.torokhov, linux-mm, mm-commits, torvalds, yuzenghui

From: Zenghui Yu <yuzenghui@huawei.com>
Subject: configs: remove the obsolete CONFIG_INPUT_POLLDEV

This CONFIG option was removed in commit 278b13ce3a89 ("Input: remove
input_polled_dev implementation") so there's no point to keep it in
defconfigs any longer.

Get rid of the leftover for all arches.

Link: https://lkml.kernel.org/r/20210726074741.1062-1-yuzenghui@huawei.com
Signed-off-by: Zenghui Yu <yuzenghui@huawei.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm/configs/dove_defconfig             |    1 -
 arch/arm/configs/pxa_defconfig              |    1 -
 arch/mips/configs/lemote2f_defconfig        |    1 -
 arch/mips/configs/pic32mzda_defconfig       |    1 -
 arch/mips/configs/rt305x_defconfig          |    1 -
 arch/mips/configs/xway_defconfig            |    1 -
 arch/parisc/configs/generic-32bit_defconfig |    1 -
 arch/x86/configs/i386_defconfig             |    1 -
 arch/x86/configs/x86_64_defconfig           |    1 -
 9 files changed, 9 deletions(-)

--- a/arch/arm/configs/dove_defconfig~configs-remove-the-obsolete-config_input_polldev
+++ a/arch/arm/configs/dove_defconfig
@@ -56,7 +56,6 @@ CONFIG_ATA=y
 CONFIG_SATA_MV=y
 CONFIG_NETDEVICES=y
 CONFIG_MV643XX_ETH=y
-CONFIG_INPUT_POLLDEV=y
 # CONFIG_INPUT_MOUSEDEV is not set
 CONFIG_INPUT_EVDEV=y
 # CONFIG_KEYBOARD_ATKBD is not set
--- a/arch/arm/configs/pxa_defconfig~configs-remove-the-obsolete-config_input_polldev
+++ a/arch/arm/configs/pxa_defconfig
@@ -284,7 +284,6 @@ CONFIG_RT2800USB=m
 CONFIG_MWIFIEX=m
 CONFIG_MWIFIEX_SDIO=m
 CONFIG_INPUT_FF_MEMLESS=m
-CONFIG_INPUT_POLLDEV=y
 CONFIG_INPUT_MATRIXKMAP=y
 CONFIG_INPUT_MOUSEDEV=m
 CONFIG_INPUT_MOUSEDEV_SCREEN_X=640
--- a/arch/mips/configs/lemote2f_defconfig~configs-remove-the-obsolete-config_input_polldev
+++ a/arch/mips/configs/lemote2f_defconfig
@@ -116,7 +116,6 @@ CONFIG_8139TOO=y
 CONFIG_R8169=y
 CONFIG_USB_USBNET=m
 CONFIG_USB_NET_CDC_EEM=m
-CONFIG_INPUT_POLLDEV=m
 CONFIG_INPUT_EVDEV=y
 # CONFIG_MOUSE_PS2_ALPS is not set
 # CONFIG_MOUSE_PS2_LOGIPS2PP is not set
--- a/arch/mips/configs/pic32mzda_defconfig~configs-remove-the-obsolete-config_input_polldev
+++ a/arch/mips/configs/pic32mzda_defconfig
@@ -34,7 +34,6 @@ CONFIG_SCSI_CONSTANTS=y
 CONFIG_SCSI_SCAN_ASYNC=y
 # CONFIG_SCSI_LOWLEVEL is not set
 CONFIG_INPUT_LEDS=m
-CONFIG_INPUT_POLLDEV=y
 CONFIG_INPUT_MOUSEDEV=m
 CONFIG_INPUT_EVDEV=y
 CONFIG_INPUT_EVBUG=m
--- a/arch/mips/configs/rt305x_defconfig~configs-remove-the-obsolete-config_input_polldev
+++ a/arch/mips/configs/rt305x_defconfig
@@ -90,7 +90,6 @@ CONFIG_PPPOE=m
 CONFIG_PPP_ASYNC=m
 CONFIG_ISDN=y
 CONFIG_INPUT=m
-CONFIG_INPUT_POLLDEV=m
 # CONFIG_KEYBOARD_ATKBD is not set
 # CONFIG_INPUT_MOUSE is not set
 CONFIG_INPUT_MISC=y
--- a/arch/mips/configs/xway_defconfig~configs-remove-the-obsolete-config_input_polldev
+++ a/arch/mips/configs/xway_defconfig
@@ -96,7 +96,6 @@ CONFIG_PPPOE=m
 CONFIG_PPP_ASYNC=m
 CONFIG_ISDN=y
 CONFIG_INPUT=m
-CONFIG_INPUT_POLLDEV=m
 # CONFIG_KEYBOARD_ATKBD is not set
 # CONFIG_INPUT_MOUSE is not set
 CONFIG_INPUT_MISC=y
--- a/arch/parisc/configs/generic-32bit_defconfig~configs-remove-the-obsolete-config_input_polldev
+++ a/arch/parisc/configs/generic-32bit_defconfig
@@ -111,7 +111,6 @@ CONFIG_PPP_BSDCOMP=m
 CONFIG_PPP_DEFLATE=m
 CONFIG_PPPOE=m
 # CONFIG_WLAN is not set
-CONFIG_INPUT_POLLDEV=y
 CONFIG_KEYBOARD_HIL_OLD=m
 CONFIG_KEYBOARD_HIL=m
 CONFIG_MOUSE_SERIAL=y
--- a/arch/x86/configs/i386_defconfig~configs-remove-the-obsolete-config_input_polldev
+++ a/arch/x86/configs/i386_defconfig
@@ -156,7 +156,6 @@ CONFIG_FORCEDETH=y
 CONFIG_8139TOO=y
 # CONFIG_8139TOO_PIO is not set
 CONFIG_R8169=y
-CONFIG_INPUT_POLLDEV=y
 CONFIG_INPUT_EVDEV=y
 CONFIG_INPUT_JOYSTICK=y
 CONFIG_INPUT_TABLET=y
--- a/arch/x86/configs/x86_64_defconfig~configs-remove-the-obsolete-config_input_polldev
+++ a/arch/x86/configs/x86_64_defconfig
@@ -148,7 +148,6 @@ CONFIG_SKY2=y
 CONFIG_FORCEDETH=y
 CONFIG_8139TOO=y
 CONFIG_R8169=y
-CONFIG_INPUT_POLLDEV=y
 CONFIG_INPUT_EVDEV=y
 CONFIG_INPUT_JOYSTICK=y
 CONFIG_INPUT_TABLET=y
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 143/147] Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
  2021-09-08  2:52 incoming Andrew Morton
                   ` (140 preceding siblings ...)
  2021-09-08  3:00 ` [patch 142/147] configs: remove the obsolete CONFIG_INPUT_POLLDEV Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08  3:00 ` [patch 144/147] selftests/memfd: remove unused variable Andrew Morton
                   ` (5 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, babu.moger, dzickus, linux-mm, lukas.bulwahn, masahiroy,
	mm-commits, npiggin, rdunlap, torvalds

From: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Subject: Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH

Commit 05a4a9527931 ("kernel/watchdog: split up config options") adds a
new config HARDLOCKUP_DETECTOR, which selects the non-existing config
HARDLOCKUP_DETECTOR_ARCH.

Hence, ./scripts/checkkconfigsymbols.py warns:

HARDLOCKUP_DETECTOR_ARCH Referencing files: lib/Kconfig.debug

Simply drop selecting the non-existing HARDLOCKUP_DETECTOR_ARCH.

Link: https://lkml.kernel.org/r/20210806115618.22088-1-lukas.bulwahn@gmail.com
Fixes: 05a4a9527931 ("kernel/watchdog: split up config options")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Babu Moger <babu.moger@oracle.com>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/Kconfig.debug |    1 -
 1 file changed, 1 deletion(-)

--- a/lib/Kconfig.debug~kconfigdebug-drop-selecting-non-existing-hardlockup_detector_arch
+++ a/lib/Kconfig.debug
@@ -1062,7 +1062,6 @@ config HARDLOCKUP_DETECTOR
 	depends on HAVE_HARDLOCKUP_DETECTOR_PERF || HAVE_HARDLOCKUP_DETECTOR_ARCH
 	select LOCKUP_DETECTOR
 	select HARDLOCKUP_DETECTOR_PERF if HAVE_HARDLOCKUP_DETECTOR_PERF
-	select HARDLOCKUP_DETECTOR_ARCH if HAVE_HARDLOCKUP_DETECTOR_ARCH
 	help
 	  Say Y here to enable the kernel to act as a watchdog to detect
 	  hard lockups.
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 144/147] selftests/memfd: remove unused variable
  2021-09-08  2:52 incoming Andrew Morton
                   ` (141 preceding siblings ...)
  2021-09-08  3:00 ` [patch 143/147] Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08  3:00 ` [patch 145/147] ipc: replace costly bailout check in sysvipc_find_ipc() Andrew Morton
                   ` (4 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, gthelen, joel, linux-mm, mm-commits, mpe, shuah, torvalds

From: Greg Thelen <gthelen@google.com>
Subject: selftests/memfd: remove unused variable

Commit 544029862cbb ("selftests/memfd: add tests for F_SEAL_FUTURE_WRITE
seal") added an unused variable to mfd_assert_reopen_fd().

Delete the unused variable.

Link: https://lkml.kernel.org/r/20210702045509.1517643-1-gthelen@google.com
Fixes: 544029862cbb ("selftests/memfd: add tests for F_SEAL_FUTURE_WRITE seal")
Signed-off-by: Greg Thelen <gthelen@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/memfd/memfd_test.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/tools/testing/selftests/memfd/memfd_test.c~selftests-memfd-remove-unused-variable
+++ a/tools/testing/selftests/memfd/memfd_test.c
@@ -56,7 +56,7 @@ static int mfd_assert_new(const char *na
 
 static int mfd_assert_reopen_fd(int fd_in)
 {
-	int r, fd;
+	int fd;
 	char path[100];
 
 	sprintf(path, "/proc/self/fd/%d", fd_in);
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 145/147] ipc: replace costly bailout check in sysvipc_find_ipc()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (142 preceding siblings ...)
  2021-09-08  3:00 ` [patch 144/147] selftests/memfd: remove unused variable Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08  3:00 ` [patch 146/147] mm/workingset: correct kernel-doc notations Andrew Morton
                   ` (3 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, aquini, dbueso, linux-mm, llong, manfred, mm-commits, torvalds

From: Rafael Aquini <aquini@redhat.com>
Subject: ipc: replace costly bailout check in sysvipc_find_ipc()

sysvipc_find_ipc() was left with a costly way to check if the offset
position fed to it is bigger than the total number of IPC IDs in use.  So
much so that the time it takes to iterate over /proc/sysvipc/* files grows
exponentially for a custom benchmark that creates "N" SYSV shm segments
and then times the read of /proc/sysvipc/shm (milliseconds):

    12 msecs to read   1024 segs from /proc/sysvipc/shm
    18 msecs to read   2048 segs from /proc/sysvipc/shm
    65 msecs to read   4096 segs from /proc/sysvipc/shm
   325 msecs to read   8192 segs from /proc/sysvipc/shm
  1303 msecs to read  16384 segs from /proc/sysvipc/shm
  5182 msecs to read  32768 segs from /proc/sysvipc/shm

The root problem lies with the loop that computes the total amount of ids
in use to check if the "pos" feeded to sysvipc_find_ipc() grew bigger than
"ids->in_use".  That is a quite inneficient way to get to the maximum
index in the id lookup table, specially when that value is already
provided by struct ipc_ids.max_idx.

This patch follows up on the optimization introduced via commit
15df03c879836 ("sysvipc: make get_maxid O(1) again") and gets rid of the
aforementioned costly loop replacing it by a simpler checkpoint based on
ipc_get_maxidx() returned value, which allows for a smooth linear increase
in time complexity for the same custom benchmark:

     2 msecs to read   1024 segs from /proc/sysvipc/shm
     2 msecs to read   2048 segs from /proc/sysvipc/shm
     4 msecs to read   4096 segs from /proc/sysvipc/shm
     9 msecs to read   8192 segs from /proc/sysvipc/shm
    19 msecs to read  16384 segs from /proc/sysvipc/shm
    39 msecs to read  32768 segs from /proc/sysvipc/shm

Link: https://lkml.kernel.org/r/20210809203554.1562989-1-aquini@redhat.com
Signed-off-by: Rafael Aquini <aquini@redhat.com>
Acked-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Manfred Spraul <manfred@colorfullife.com>
Cc: Waiman Long <llong@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 ipc/util.c |   16 ++++------------
 1 file changed, 4 insertions(+), 12 deletions(-)

--- a/ipc/util.c~ipc-replace-costly-bailout-check-in-sysvipc_find_ipc
+++ a/ipc/util.c
@@ -788,21 +788,13 @@ struct pid_namespace *ipc_seq_pid_ns(str
 static struct kern_ipc_perm *sysvipc_find_ipc(struct ipc_ids *ids, loff_t pos,
 					      loff_t *new_pos)
 {
-	struct kern_ipc_perm *ipc;
-	int total, id;
+	struct kern_ipc_perm *ipc = NULL;
+	int max_idx = ipc_get_maxidx(ids);
 
-	total = 0;
-	for (id = 0; id < pos && total < ids->in_use; id++) {
-		ipc = idr_find(&ids->ipcs_idr, id);
-		if (ipc != NULL)
-			total++;
-	}
-
-	ipc = NULL;
-	if (total >= ids->in_use)
+	if (max_idx == -1 || pos > max_idx)
 		goto out;
 
-	for (; pos < ipc_mni; pos++) {
+	for (; pos <= max_idx; pos++) {
 		ipc = idr_find(&ids->ipcs_idr, pos);
 		if (ipc != NULL) {
 			rcu_read_lock();
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 146/147] mm/workingset: correct kernel-doc notations
  2021-09-08  2:52 incoming Andrew Morton
                   ` (143 preceding siblings ...)
  2021-09-08  3:00 ` [patch 145/147] ipc: replace costly bailout check in sysvipc_find_ipc() Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08  3:00 ` [patch 147/147] scripts: check_extable: fix typo in user error message Andrew Morton
                   ` (2 subsequent siblings)
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, rdunlap, torvalds, willy

From: Randy Dunlap <rdunlap@infradead.org>
Subject: mm/workingset: correct kernel-doc notations

Use the documented kernel-doc format to prevent kernel-doc warnings.

mm/workingset.c:256: warning: No description found for return value of 'workingset_eviction'
mm/workingset.c:285: warning: Function parameter or member 'folio' not described in 'workingset_refault'
mm/workingset.c:285: warning: Excess function parameter 'page' description in 'workingset_refault'

Link: https://lkml.kernel.org/r/20210808203153.10678-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/workingset.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/workingset.c~mm-workingset-correct-kernel-doc-notations
+++ a/mm/workingset.c
@@ -249,7 +249,7 @@ void workingset_age_nonresident(struct l
  * @target_memcg: the cgroup that is causing the reclaim
  * @page: the page being evicted
  *
- * Returns a shadow entry to be stored in @page->mapping->i_pages in place
+ * Return: a shadow entry to be stored in @page->mapping->i_pages in place
  * of the evicted @page so that a later refault can be detected.
  */
 void *workingset_eviction(struct page *page, struct mem_cgroup *target_memcg)
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 147/147] scripts: check_extable: fix typo in user error message
  2021-09-08  2:52 incoming Andrew Morton
                   ` (144 preceding siblings ...)
  2021-09-08  3:00 ` [patch 146/147] mm/workingset: correct kernel-doc notations Andrew Morton
@ 2021-09-08  3:00 ` Andrew Morton
  2021-09-08  3:16 ` [patch 129/147] trap: cleanup trap_init() Andrew Morton
  2021-09-08  8:57 ` incoming Vlastimil Babka
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:00 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, quentin.casasnovas, rdunlap, torvalds

From: Randy Dunlap <rdunlap@infradead.org>
Subject: scripts: check_extable: fix typo in user error message

Fix typo ("and" should be "an") in an error message.

Link: https://lkml.kernel.org/r/20210727002943.29774-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/check_extable.sh |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/scripts/check_extable.sh~scripts-check_extable-fix-typo-in-user-error-message
+++ a/scripts/check_extable.sh
@@ -4,7 +4,7 @@
 
 obj=$1
 
-file ${obj} | grep -q ELF || (echo "${obj} is not and ELF file." 1>&2 ; exit 0)
+file ${obj} | grep -q ELF || (echo "${obj} is not an ELF file." 1>&2 ; exit 0)
 
 # Bail out early if there isn't an __ex_table section in this object file.
 objdump -hj __ex_table ${obj} 2> /dev/null > /dev/null
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* [patch 129/147] trap: cleanup trap_init()
  2021-09-08  2:52 incoming Andrew Morton
                   ` (145 preceding siblings ...)
  2021-09-08  3:00 ` [patch 147/147] scripts: check_extable: fix typo in user error message Andrew Morton
@ 2021-09-08  3:16 ` Andrew Morton
  2021-09-08  8:57 ` incoming Vlastimil Babka
  147 siblings, 0 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08  3:16 UTC (permalink / raw)
  To: akpm, anton.ivanov, benh, deller, James.Bottomley, jdike, jonas,
	ley.foon.tan, linux-mm, mm-commits, mpe, palmerdabbelt, paulus,
	richard, rmk+kernel, shorne, stefan.kristiansson, torvalds,
	wangkefeng.wang, ysato

From: Kefeng Wang <wangkefeng.wang@huawei.com>
Subject: trap: cleanup trap_init()

There are some empty trap_init() definitions in different ARCHs, Introduce
a new weak trap_init() function to clean them up.

Link: https://lkml.kernel.org/r/20210812123602.76356-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>	[arm32]
Acked-by: Vineet Gupta						[arc]
Acked-by: Michael Ellerman <mpe@ellerman.id.au>			[powerpc]
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ley Foon Tan <ley.foon.tan@intel.com>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi>
Cc: Stafford Horne <shorne@gmail.com>
Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Helge Deller <deller@gmx.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <palmerdabbelt@google.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Anton Ivanov <anton.ivanov@cambridgegreys.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arc/kernel/traps.c      |    5 -----
 arch/arm/kernel/traps.c      |    5 -----
 arch/h8300/kernel/traps.c    |    4 ----
 arch/hexagon/kernel/traps.c  |    4 ----
 arch/nds32/kernel/traps.c    |    5 -----
 arch/nios2/kernel/traps.c    |    5 -----
 arch/openrisc/kernel/traps.c |    5 -----
 arch/parisc/kernel/traps.c   |    4 ----
 arch/powerpc/kernel/traps.c  |    5 -----
 arch/riscv/kernel/traps.c    |    5 -----
 arch/um/kernel/trap.c        |    4 ----
 init/main.c                  |    2 ++
 12 files changed, 2 insertions(+), 51 deletions(-)

--- a/arch/arc/kernel/traps.c~trap-cleanup-trap_init
+++ a/arch/arc/kernel/traps.c
@@ -20,11 +20,6 @@
 #include <asm/unaligned.h>
 #include <asm/kprobes.h>
 
-void __init trap_init(void)
-{
-	return;
-}
-
 void die(const char *str, struct pt_regs *regs, unsigned long address)
 {
 	show_kernel_fault_diag(str, regs, address);
--- a/arch/arm/kernel/traps.c~trap-cleanup-trap_init
+++ a/arch/arm/kernel/traps.c
@@ -781,11 +781,6 @@ void abort(void)
 	panic("Oops failed to kill thread");
 }
 
-void __init trap_init(void)
-{
-	return;
-}
-
 #ifdef CONFIG_KUSER_HELPERS
 static void __init kuser_init(void *vectors)
 {
--- a/arch/h8300/kernel/traps.c~trap-cleanup-trap_init
+++ a/arch/h8300/kernel/traps.c
@@ -39,10 +39,6 @@ void __init base_trap_init(void)
 {
 }
 
-void __init trap_init(void)
-{
-}
-
 asmlinkage void set_esp0(unsigned long ssp)
 {
 	current->thread.esp0 = ssp;
--- a/arch/hexagon/kernel/traps.c~trap-cleanup-trap_init
+++ a/arch/hexagon/kernel/traps.c
@@ -28,10 +28,6 @@
 #define TRAP_SYSCALL	1
 #define TRAP_DEBUG	0xdb
 
-void __init trap_init(void)
-{
-}
-
 #ifdef CONFIG_GENERIC_BUG
 /* Maybe should resemble arch/sh/kernel/traps.c ?? */
 int is_valid_bugaddr(unsigned long addr)
--- a/arch/nds32/kernel/traps.c~trap-cleanup-trap_init
+++ a/arch/nds32/kernel/traps.c
@@ -183,11 +183,6 @@ void __pgd_error(const char *file, int l
 }
 
 extern char *exception_vector, *exception_vector_end;
-void __init trap_init(void)
-{
-	return;
-}
-
 void __init early_trap_init(void)
 {
 	unsigned long ivb = 0;
--- a/arch/nios2/kernel/traps.c~trap-cleanup-trap_init
+++ a/arch/nios2/kernel/traps.c
@@ -105,11 +105,6 @@ void show_stack(struct task_struct *task
 	printk("%s\n", loglvl);
 }
 
-void __init trap_init(void)
-{
-	/* Nothing to do here */
-}
-
 /* Breakpoint handler */
 asmlinkage void breakpoint_c(struct pt_regs *fp)
 {
--- a/arch/openrisc/kernel/traps.c~trap-cleanup-trap_init
+++ a/arch/openrisc/kernel/traps.c
@@ -231,11 +231,6 @@ void unhandled_exception(struct pt_regs
 	die("Oops", regs, 9);
 }
 
-void __init trap_init(void)
-{
-	/* Nothing needs to be done */
-}
-
 asmlinkage void do_trap(struct pt_regs *regs, unsigned long address)
 {
 	force_sig_fault(SIGTRAP, TRAP_BRKPT, (void __user *)regs->pc);
--- a/arch/parisc/kernel/traps.c~trap-cleanup-trap_init
+++ a/arch/parisc/kernel/traps.c
@@ -859,7 +859,3 @@ void  __init early_trap_init(void)
 
 	initialize_ivt(&fault_vector_20);
 }
-
-void __init trap_init(void)
-{
-}
--- a/arch/powerpc/kernel/traps.c~trap-cleanup-trap_init
+++ a/arch/powerpc/kernel/traps.c
@@ -2215,11 +2215,6 @@ DEFINE_INTERRUPT_HANDLER(kernel_bad_stac
 	die("Bad kernel stack pointer", regs, SIGABRT);
 }
 
-void __init trap_init(void)
-{
-}
-
-
 #ifdef CONFIG_PPC_EMULATED_STATS
 
 #define WARN_EMULATED_SETUP(type)	.type = { .name = #type }
--- a/arch/riscv/kernel/traps.c~trap-cleanup-trap_init
+++ a/arch/riscv/kernel/traps.c
@@ -199,11 +199,6 @@ int is_valid_bugaddr(unsigned long pc)
 }
 #endif /* CONFIG_GENERIC_BUG */
 
-/* stvec & scratch is already set from head.S */
-void __init trap_init(void)
-{
-}
-
 #ifdef CONFIG_VMAP_STACK
 static DEFINE_PER_CPU(unsigned long [OVERFLOW_STACK_SIZE/sizeof(long)],
 		overflow_stack)__aligned(16);
--- a/arch/um/kernel/trap.c~trap-cleanup-trap_init
+++ a/arch/um/kernel/trap.c
@@ -311,7 +311,3 @@ void winch(int sig, struct siginfo *unus
 {
 	do_IRQ(WINCH_IRQ, regs);
 }
-
-void trap_init(void)
-{
-}
--- a/init/main.c~trap-cleanup-trap_init
+++ a/init/main.c
@@ -777,6 +777,8 @@ void __init __weak poking_init(void) { }
 
 void __init __weak pgtable_cache_init(void) { }
 
+void __init __weak trap_init(void) { }
+
 bool initcall_debug;
 core_param(initcall_debug, initcall_debug, bool, 0644);
 
_

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 092/147] mtd/drivers/nand: use HZ macros
  2021-09-08  2:58 ` [patch 092/147] mtd/drivers/nand: " Andrew Morton
@ 2021-09-08  6:39   ` Miquel Raynal
  0 siblings, 0 replies; 199+ messages in thread
From: Miquel Raynal @ 2021-09-08  6:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: andriy.shevchenko, ceggers, cw00.choi, daniel.lezcano, jic23,
	Jonathan.Cameron, kyungmin.park, lars, linux-mm, linux,
	lukasz.luba, mcoquelin.stm32, mm-commits, myungjoo.ham, pmeerw,
	rafael, rui.zhang, torvalds

Hi Andrew,

akpm@linux-foundation.org wrote on Tue, 07 Sep 2021 19:58:11 -0700:

> From: Daniel Lezcano <daniel.lezcano@linaro.org>
> Subject: mtd/drivers/nand: use HZ macros
> 
> HZ unit conversion macros are available in units.h, use them and remove
> the duplicate definition.
> 
> Link: https://lkml.kernel.org/r/20210816114732.1834145-10-daniel.lezcano@linaro.org
> Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org>
> Acked-by: Miquel Raynal <miquel.raynal@bootlin.com>
> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Cc: Chanwoo Choi <cw00.choi@samsung.com>
> Cc: Christian Eggers <ceggers@arri.de>
> Cc: Guenter Roeck <linux@roeck-us.net>
> Cc: Jonathan Cameron <jic23@kernel.org>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Cc: Kyungmin Park <kyungmin.park@samsung.com>
> Cc: Lars-Peter Clausen <lars@metafoo.de>
> Cc: Lukasz Luba <lukasz.luba@arm.com>
> Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
> Cc: MyungJoo Ham <myungjoo.ham@samsung.com>
> Cc: Peter Meerwald <pmeerw@pmeerw.net>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Zhang Rui <rui.zhang@intel.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

A better suffix would have been "mtd: rawnand: intel:". Maybe you can
fix it when applying.

Acked-by: Miquel Raynal <miquel.raynal@bootlin.com>

> ---
> 
>  drivers/mtd/nand/raw/intel-nand-controller.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/drivers/mtd/nand/raw/intel-nand-controller.c~mtd-drivers-nand-use-hz-macros
> +++ a/drivers/mtd/nand/raw/intel-nand-controller.c
> @@ -20,6 +20,7 @@
>  #include <linux/sched.h>
>  #include <linux/slab.h>
>  #include <linux/types.h>
> +#include <linux/units.h>
>  #include <asm/unaligned.h>
>  
>  #define EBU_CLC			0x000
> @@ -102,7 +103,6 @@
>  
>  #define MAX_CS	2
>  
> -#define HZ_PER_MHZ	1000000L
>  #define USEC_PER_SEC	1000000L
>  
>  struct ebu_nand_cs {
> _


Thanks,
Miquèl

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: incoming
  2021-09-08  2:52 incoming Andrew Morton
                   ` (146 preceding siblings ...)
  2021-09-08  3:16 ` [patch 129/147] trap: cleanup trap_init() Andrew Morton
@ 2021-09-08  8:57 ` Vlastimil Babka
  147 siblings, 0 replies; 199+ messages in thread
From: Vlastimil Babka @ 2021-09-08  8:57 UTC (permalink / raw)
  To: Andrew Morton, Linus Torvalds
  Cc: linux-mm, mm-commits, Mike Galbraith, Mel Gorman

On 9/8/21 04:52, Andrew Morton wrote:
> Subsystem: mm/slub
> 
>     Vlastimil Babka <vbabka@suse.cz>:
>     Patch series "SLUB: reduce irq disabled scope and make it RT compatible", v6:
>       mm, slub: don't call flush_all() from slab_debug_trace_open()
>       mm, slub: allocate private object map for debugfs listings
>       mm, slub: allocate private object map for validate_slab_cache()
>       mm, slub: don't disable irq for debug_check_no_locks_freed()
>       mm, slub: remove redundant unfreeze_partials() from put_cpu_partial()
>       mm, slub: extract get_partial() from new_slab_objects()
>       mm, slub: dissolve new_slab_objects() into ___slab_alloc()
>       mm, slub: return slab page from get_partial() and set c->page afterwards
>       mm, slub: restructure new page checks in ___slab_alloc()
>       mm, slub: simplify kmem_cache_cpu and tid setup
>       mm, slub: move disabling/enabling irqs to ___slab_alloc()
>       mm, slub: do initial checks in ___slab_alloc() with irqs enabled
>       mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc()
>       mm, slub: restore irqs around calling new_slab()
>       mm, slub: validate slab from partial list or page allocator before making it cpu slab
>       mm, slub: check new pages with restored irqs
>       mm, slub: stop disabling irqs around get_partial()
>       mm, slub: move reset of c->page and freelist out of deactivate_slab()
>       mm, slub: make locking in deactivate_slab() irq-safe
>       mm, slub: call deactivate_slab() without disabling irqs
>       mm, slub: move irq control into unfreeze_partials()
>       mm, slub: discard slabs in unfreeze_partials() without irqs disabled
>       mm, slub: detach whole partial list at once in unfreeze_partials()
>       mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing
>       mm, slub: only disable irq with spin_lock in __unfreeze_partials()
>       mm, slub: don't disable irqs in slub_cpu_dead()
>       mm, slab: split out the cpu offline variant of flush_slab()
> 
>     Sebastian Andrzej Siewior <bigeasy@linutronix.de>:
>       mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context
>       mm: slub: make object_map_lock a raw_spinlock_t
> 
>     Vlastimil Babka <vbabka@suse.cz>:
>       mm, slub: make slab_lock() disable irqs with PREEMPT_RT
>       mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
>       mm, slub: use migrate_disable() on PREEMPT_RT
>       mm, slub: convert kmem_cpu_slab protection to local_lock

For my own piece of mind, I've checked that this part (patches 1 to 33)
are identical to the v6 posting [1] and git version [2] that Mel and
Mike tested (replies to [1]).

[1] https://lore.kernel.org/all/20210904105003.11688-1-vbabka@suse.cz/
[2] git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/linux.git
tags/mm-slub-5.15-rc1

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
  2021-09-08  2:54 ` [patch 031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg Andrew Morton
@ 2021-09-08 13:05   ` Jesper Dangaard Brouer
  2021-09-08 13:58     ` Vlastimil Babka
  0 siblings, 1 reply; 199+ messages in thread
From: Jesper Dangaard Brouer @ 2021-09-08 13:05 UTC (permalink / raw)
  To: Andrew Morton, bigeasy, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds, vbabka
  Cc: brouer



On 08/09/2021 04.54, Andrew Morton wrote:
> From: Vlastimil Babka <vbabka@suse.cz>
> Subject: mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
> 
> Jann Horn reported [1] the following theoretically possible race:
> 
>    task A: put_cpu_partial() calls preempt_disable()
>    task A: oldpage = this_cpu_read(s->cpu_slab->partial)
>    interrupt: kfree() reaches unfreeze_partials() and discards the page
>    task B (on another CPU): reallocates page as page cache
>    task A: reads page->pages and page->pobjects, which are actually
>    halves of the pointer page->lru.prev
>    task B (on another CPU): frees page
>    interrupt: allocates page as SLUB page and places it on the percpu partial list
>    task A: this_cpu_cmpxchg() succeeds
> 
>    which would cause page->pages and page->pobjects to end up containing
>    halves of pointers that would then influence when put_cpu_partial()
>    happens and show up in root-only sysfs files. Maybe that's acceptable,
>    I don't know. But there should probably at least be a comment for now
>    to point out that we're reading union fields of a page that might be
>    in a completely different state.
> 
> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only
> safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the
> latter disables irqs, otherwise a __slab_free() in an irq handler could
> call put_cpu_partial() in the middle of ___slab_alloc() manipulating
> ->partial and corrupt it.  This becomes an issue on RT after a local_lock
> is introduced in later patch.  The fix means taking the local_lock also in
> put_cpu_partial() on RT.
> 
> After debugging this issue, Mike Galbraith suggested [2] that to avoid
> different locking schemes on RT and !RT, we can just protect
> put_cpu_partial() with disabled irqs (to be converted to
> local_lock_irqsave() later) everywhere.  This should be acceptable as it's
> not a fast path, and moving the actual partial unfreezing outside of the
> irq disabled section makes it short, and with the retry loop gone the code
> can be also simplified.  In addition, the race reported by Jann should no
> longer be possible.

Based on my microbench[0] measurement changing preempt_disable to 
local_irq_save will cost us 11 cycles (TSC).  I'm not against the 
change, I just want people to keep this in mind.

On my E5-1650 v4 @ 3.60GHz:
  - preempt_disable(+enable)  cost: 11 cycles(tsc) 3.161 ns
  - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns

Notice the non-save/restore variant is superfast:
  - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns


[0] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c

> [1] https://lore.kernel.org/lkml/CAG48ez1mvUuXwg0YPH5ANzhQLpbphqk-ZS+jbRz+H66fvm4FcA@mail.gmail.com/
> [2] https://lore.kernel.org/linux-rt-users/e3470ab357b48bccfbd1f5133b982178a7d2befb.camel@gmx.de/
> 
> Link: https://lkml.kernel.org/r/20210904105003.11688-32-vbabka@suse.cz
> Reported-by: Jann Horn <jannh@google.com>
> Suggested-by: Mike Galbraith <efault@gmx.de>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Cc: Christoph Lameter <cl@linux.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Pekka Enberg <penberg@kernel.org>
> Cc: Qian Cai <quic_qiancai@quicinc.com>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>   mm/slub.c |   83 ++++++++++++++++++++++++++++------------------------
>   1 file changed, 45 insertions(+), 38 deletions(-)
> 
> --- a/mm/slub.c~mm-slub-protect-put_cpu_partial-with-disabled-irqs-instead-of-cmpxchg
> +++ a/mm/slub.c
> @@ -2025,7 +2025,12 @@ static inline void *acquire_slab(struct
>   	return freelist;
>   }
>   
> +#ifdef CONFIG_SLUB_CPU_PARTIAL
>   static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain);
> +#else
> +static inline void put_cpu_partial(struct kmem_cache *s, struct page *page,
> +				   int drain) { }
> +#endif
>   static inline bool pfmemalloc_match(struct page *page, gfp_t gfpflags);
>   
>   /*
> @@ -2459,14 +2464,6 @@ static void unfreeze_partials_cpu(struct
>   		__unfreeze_partials(s, partial_page);
>   }
>   
> -#else	/* CONFIG_SLUB_CPU_PARTIAL */
> -
> -static inline void unfreeze_partials(struct kmem_cache *s) { }
> -static inline void unfreeze_partials_cpu(struct kmem_cache *s,
> -				  struct kmem_cache_cpu *c) { }
> -
> -#endif	/* CONFIG_SLUB_CPU_PARTIAL */
> -
>   /*
>    * Put a page that was just frozen (in __slab_free|get_partial_node) into a
>    * partial page slot if available.
> @@ -2476,46 +2473,56 @@ static inline void unfreeze_partials_cpu
>    */
>   static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain)
>   {
> -#ifdef CONFIG_SLUB_CPU_PARTIAL
>   	struct page *oldpage;
> -	int pages;
> -	int pobjects;
> +	struct page *page_to_unfreeze = NULL;
> +	unsigned long flags;
> +	int pages = 0;
> +	int pobjects = 0;
>   
> -	preempt_disable();
> -	do {
> -		pages = 0;
> -		pobjects = 0;
> -		oldpage = this_cpu_read(s->cpu_slab->partial);
> +	local_irq_save(flags);
> +
> +	oldpage = this_cpu_read(s->cpu_slab->partial);
>   
> -		if (oldpage) {
> +	if (oldpage) {
> +		if (drain && oldpage->pobjects > slub_cpu_partial(s)) {
> +			/*
> +			 * Partial array is full. Move the existing set to the
> +			 * per node partial list. Postpone the actual unfreezing
> +			 * outside of the critical section.
> +			 */
> +			page_to_unfreeze = oldpage;
> +			oldpage = NULL;
> +		} else {
>   			pobjects = oldpage->pobjects;
>   			pages = oldpage->pages;
> -			if (drain && pobjects > slub_cpu_partial(s)) {
> -				/*
> -				 * partial array is full. Move the existing
> -				 * set to the per node partial list.
> -				 */
> -				unfreeze_partials(s);
> -				oldpage = NULL;
> -				pobjects = 0;
> -				pages = 0;
> -				stat(s, CPU_PARTIAL_DRAIN);
> -			}
>   		}
> +	}
>   
> -		pages++;
> -		pobjects += page->objects - page->inuse;
> +	pages++;
> +	pobjects += page->objects - page->inuse;
>   
> -		page->pages = pages;
> -		page->pobjects = pobjects;
> -		page->next = oldpage;
> -
> -	} while (this_cpu_cmpxchg(s->cpu_slab->partial, oldpage, page)
> -								!= oldpage);
> -	preempt_enable();
> -#endif	/* CONFIG_SLUB_CPU_PARTIAL */
> +	page->pages = pages;
> +	page->pobjects = pobjects;
> +	page->next = oldpage;
> +
> +	this_cpu_write(s->cpu_slab->partial, page);
> +
> +	local_irq_restore(flags);
> +
> +	if (page_to_unfreeze) {
> +		__unfreeze_partials(s, page_to_unfreeze);
> +		stat(s, CPU_PARTIAL_DRAIN);
> +	}
>   }
>   
> +#else	/* CONFIG_SLUB_CPU_PARTIAL */
> +
> +static inline void unfreeze_partials(struct kmem_cache *s) { }
> +static inline void unfreeze_partials_cpu(struct kmem_cache *s,
> +				  struct kmem_cache_cpu *c) { }
> +
> +#endif	/* CONFIG_SLUB_CPU_PARTIAL */
> +
>   static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
>   {
>   	unsigned long flags;
> _
> 

$ uname -a
Linux broadwell 5.14.0-net-next+ #612 SMP PREEMPT Wed Sep 8 10:10:04 
CEST 2021 x86_64 x86_64 x86_64 GNU/Linux


My config:

$ zcat /proc/config.gz | grep PREE
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y
CONFIG_PREEMPT_DYNAMIC=y
CONFIG_PREEMPT_RCU=y
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_DEBUG_PREEMPT is not set
# CONFIG_PREEMPT_TRACER is not set
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
  2021-09-08 13:05   ` Jesper Dangaard Brouer
@ 2021-09-08 13:58     ` Vlastimil Babka
  2021-09-08 14:55       ` David Hildenbrand
  2021-09-08 16:11       ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 199+ messages in thread
From: Vlastimil Babka @ 2021-09-08 13:58 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Andrew Morton, bigeasy, cl, efault,
	iamjoonsoo.kim, jannh, linux-mm, mgorman, mm-commits, penberg,
	quic_qiancai, rientjes, tglx, torvalds
  Cc: brouer

On 9/8/21 15:05, Jesper Dangaard Brouer wrote:
> 
> 
> On 08/09/2021 04.54, Andrew Morton wrote:
>> From: Vlastimil Babka <vbabka@suse.cz>
>> Subject: mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
>>
>> Jann Horn reported [1] the following theoretically possible race:
>>
>>    task A: put_cpu_partial() calls preempt_disable()
>>    task A: oldpage = this_cpu_read(s->cpu_slab->partial)
>>    interrupt: kfree() reaches unfreeze_partials() and discards the page
>>    task B (on another CPU): reallocates page as page cache
>>    task A: reads page->pages and page->pobjects, which are actually
>>    halves of the pointer page->lru.prev
>>    task B (on another CPU): frees page
>>    interrupt: allocates page as SLUB page and places it on the percpu partial list
>>    task A: this_cpu_cmpxchg() succeeds
>>
>>    which would cause page->pages and page->pobjects to end up containing
>>    halves of pointers that would then influence when put_cpu_partial()
>>    happens and show up in root-only sysfs files. Maybe that's acceptable,
>>    I don't know. But there should probably at least be a comment for now
>>    to point out that we're reading union fields of a page that might be
>>    in a completely different state.
>>
>> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only
>> safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the
>> latter disables irqs, otherwise a __slab_free() in an irq handler could
>> call put_cpu_partial() in the middle of ___slab_alloc() manipulating
>> ->partial and corrupt it.  This becomes an issue on RT after a local_lock
>> is introduced in later patch.  The fix means taking the local_lock also in
>> put_cpu_partial() on RT.
>>
>> After debugging this issue, Mike Galbraith suggested [2] that to avoid
>> different locking schemes on RT and !RT, we can just protect
>> put_cpu_partial() with disabled irqs (to be converted to
>> local_lock_irqsave() later) everywhere.  This should be acceptable as it's
>> not a fast path, and moving the actual partial unfreezing outside of the
>> irq disabled section makes it short, and with the retry loop gone the code
>> can be also simplified.  In addition, the race reported by Jann should no
>> longer be possible.
> 
> Based on my microbench[0] measurement changing preempt_disable to 
> local_irq_save will cost us 11 cycles (TSC).  I'm not against the 
> change, I just want people to keep this in mind.

OK, but this is not a fast path for every allocation/free, so it gets
amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect
cmpxchg to be expensive too?

> On my E5-1650 v4 @ 3.60GHz:
>   - preempt_disable(+enable)  cost: 11 cycles(tsc) 3.161 ns
>   - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns
> 
> Notice the non-save/restore variant is superfast:
>   - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns

It actually surprises me that it's that cheap, and would have expected
changing the irq state would be the costly part, not the saving/restoring.
Incidentally, would you know what's the cost of save+restore when the
irqs are already disabled, so it's effectively a no-op?

Thanks,
Vlastimil

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
  2021-09-08 13:58     ` Vlastimil Babka
@ 2021-09-08 14:55       ` David Hildenbrand
  2021-09-08 14:59         ` David Hildenbrand
  2021-09-08 16:11       ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 199+ messages in thread
From: David Hildenbrand @ 2021-09-08 14:55 UTC (permalink / raw)
  To: Vlastimil Babka, Jesper Dangaard Brouer, Andrew Morton, bigeasy,
	cl, efault, iamjoonsoo.kim, jannh, linux-mm, mgorman, mm-commits,
	penberg, quic_qiancai, rientjes, tglx, torvalds
  Cc: brouer

On 08.09.21 15:58, Vlastimil Babka wrote:
> On 9/8/21 15:05, Jesper Dangaard Brouer wrote:
>>
>>
>> On 08/09/2021 04.54, Andrew Morton wrote:
>>> From: Vlastimil Babka <vbabka@suse.cz>
>>> Subject: mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
>>>
>>> Jann Horn reported [1] the following theoretically possible race:
>>>
>>>     task A: put_cpu_partial() calls preempt_disable()
>>>     task A: oldpage = this_cpu_read(s->cpu_slab->partial)
>>>     interrupt: kfree() reaches unfreeze_partials() and discards the page
>>>     task B (on another CPU): reallocates page as page cache
>>>     task A: reads page->pages and page->pobjects, which are actually
>>>     halves of the pointer page->lru.prev
>>>     task B (on another CPU): frees page
>>>     interrupt: allocates page as SLUB page and places it on the percpu partial list
>>>     task A: this_cpu_cmpxchg() succeeds
>>>
>>>     which would cause page->pages and page->pobjects to end up containing
>>>     halves of pointers that would then influence when put_cpu_partial()
>>>     happens and show up in root-only sysfs files. Maybe that's acceptable,
>>>     I don't know. But there should probably at least be a comment for now
>>>     to point out that we're reading union fields of a page that might be
>>>     in a completely different state.
>>>
>>> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only
>>> safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the
>>> latter disables irqs, otherwise a __slab_free() in an irq handler could
>>> call put_cpu_partial() in the middle of ___slab_alloc() manipulating
>>> ->partial and corrupt it.  This becomes an issue on RT after a local_lock
>>> is introduced in later patch.  The fix means taking the local_lock also in
>>> put_cpu_partial() on RT.
>>>
>>> After debugging this issue, Mike Galbraith suggested [2] that to avoid
>>> different locking schemes on RT and !RT, we can just protect
>>> put_cpu_partial() with disabled irqs (to be converted to
>>> local_lock_irqsave() later) everywhere.  This should be acceptable as it's
>>> not a fast path, and moving the actual partial unfreezing outside of the
>>> irq disabled section makes it short, and with the retry loop gone the code
>>> can be also simplified.  In addition, the race reported by Jann should no
>>> longer be possible.
>>
>> Based on my microbench[0] measurement changing preempt_disable to
>> local_irq_save will cost us 11 cycles (TSC).  I'm not against the
>> change, I just want people to keep this in mind.
> 
> OK, but this is not a fast path for every allocation/free, so it gets
> amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect
> cmpxchg to be expensive too?
> 
>> On my E5-1650 v4 @ 3.60GHz:
>>    - preempt_disable(+enable)  cost: 11 cycles(tsc) 3.161 ns
>>    - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns
>>
>> Notice the non-save/restore variant is superfast:
>>    - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns
> 
> It actually surprises me that it's that cheap, and would have expected
> changing the irq state would be the costly part, not the saving/restoring.
> Incidentally, would you know what's the cost of save+restore when the
> irqs are already disabled, so it's effectively a no-op?

It surprises me as well. That would imply that protecting short RCU 
sections using

local_irq_disable
local_irq_enable

instead of via

preempt_disable
preempt_enable

would actually be very beneficial.

Are the numbers trustworthy? :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
  2021-09-08 14:55       ` David Hildenbrand
@ 2021-09-08 14:59         ` David Hildenbrand
  2021-09-08 17:14           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 199+ messages in thread
From: David Hildenbrand @ 2021-09-08 14:59 UTC (permalink / raw)
  To: Vlastimil Babka, Jesper Dangaard Brouer, Andrew Morton, bigeasy,
	cl, efault, iamjoonsoo.kim, jannh, linux-mm, mgorman, mm-commits,
	penberg, quic_qiancai, rientjes, tglx, torvalds
  Cc: brouer

On 08.09.21 16:55, David Hildenbrand wrote:
> On 08.09.21 15:58, Vlastimil Babka wrote:
>> On 9/8/21 15:05, Jesper Dangaard Brouer wrote:
>>>
>>>
>>> On 08/09/2021 04.54, Andrew Morton wrote:
>>>> From: Vlastimil Babka <vbabka@suse.cz>
>>>> Subject: mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
>>>>
>>>> Jann Horn reported [1] the following theoretically possible race:
>>>>
>>>>      task A: put_cpu_partial() calls preempt_disable()
>>>>      task A: oldpage = this_cpu_read(s->cpu_slab->partial)
>>>>      interrupt: kfree() reaches unfreeze_partials() and discards the page
>>>>      task B (on another CPU): reallocates page as page cache
>>>>      task A: reads page->pages and page->pobjects, which are actually
>>>>      halves of the pointer page->lru.prev
>>>>      task B (on another CPU): frees page
>>>>      interrupt: allocates page as SLUB page and places it on the percpu partial list
>>>>      task A: this_cpu_cmpxchg() succeeds
>>>>
>>>>      which would cause page->pages and page->pobjects to end up containing
>>>>      halves of pointers that would then influence when put_cpu_partial()
>>>>      happens and show up in root-only sysfs files. Maybe that's acceptable,
>>>>      I don't know. But there should probably at least be a comment for now
>>>>      to point out that we're reading union fields of a page that might be
>>>>      in a completely different state.
>>>>
>>>> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only
>>>> safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the
>>>> latter disables irqs, otherwise a __slab_free() in an irq handler could
>>>> call put_cpu_partial() in the middle of ___slab_alloc() manipulating
>>>> ->partial and corrupt it.  This becomes an issue on RT after a local_lock
>>>> is introduced in later patch.  The fix means taking the local_lock also in
>>>> put_cpu_partial() on RT.
>>>>
>>>> After debugging this issue, Mike Galbraith suggested [2] that to avoid
>>>> different locking schemes on RT and !RT, we can just protect
>>>> put_cpu_partial() with disabled irqs (to be converted to
>>>> local_lock_irqsave() later) everywhere.  This should be acceptable as it's
>>>> not a fast path, and moving the actual partial unfreezing outside of the
>>>> irq disabled section makes it short, and with the retry loop gone the code
>>>> can be also simplified.  In addition, the race reported by Jann should no
>>>> longer be possible.
>>>
>>> Based on my microbench[0] measurement changing preempt_disable to
>>> local_irq_save will cost us 11 cycles (TSC).  I'm not against the
>>> change, I just want people to keep this in mind.
>>
>> OK, but this is not a fast path for every allocation/free, so it gets
>> amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect
>> cmpxchg to be expensive too?
>>
>>> On my E5-1650 v4 @ 3.60GHz:
>>>     - preempt_disable(+enable)  cost: 11 cycles(tsc) 3.161 ns
>>>     - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns
>>>
>>> Notice the non-save/restore variant is superfast:
>>>     - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns
>>
>> It actually surprises me that it's that cheap, and would have expected
>> changing the irq state would be the costly part, not the saving/restoring.
>> Incidentally, would you know what's the cost of save+restore when the
>> irqs are already disabled, so it's effectively a no-op?
> 
> It surprises me as well. That would imply that protecting short RCU
> sections using
> 
> local_irq_disable
> local_irq_enable
> 
> instead of via
> 
> preempt_disable
> preempt_enable
> 
> would actually be very beneficial.
> 
> Are the numbers trustworthy? :)
> 

.. and especially did the benchmark consider side effects of 
enabling/disabling interrupts (pipeline flushes etc ..)?

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 128/147] init: move usermodehelper_enable() to populate_rootfs()
  2021-09-08  3:00 ` [patch 128/147] init: move usermodehelper_enable() to populate_rootfs() Andrew Morton
@ 2021-09-08 15:44   ` Luis Chamberlain
  2021-09-10  8:12     ` Rasmus Villemoes
  0 siblings, 1 reply; 199+ messages in thread
From: Luis Chamberlain @ 2021-09-08 15:44 UTC (permalink / raw)
  To: Andrew Morton, Rasmus Villemoes, Jessica Yu, Borislav Petkov,
	H. Peter Anvin
  Cc: bgoncalv, egorenar, hkallweit1, linux-mm, linux, mm-commits, torvalds

On Tue, Sep 07, 2021 at 08:00:03PM -0700, Andrew Morton wrote:
> From: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> Subject: init: move usermodehelper_enable() to populate_rootfs()
> 
> Currently, usermodehelper is enabled right before PID1 starts going
> through the initcalls. However, any call of a usermodehelper from a
> pure_, core_, postcore_, arch_, subsys_ or fs_ initcall is futile, as
> there is no filesystem contents yet.
> 
> Up until commit e7cb072eb988 ("init/initramfs.c: do unpacking
> asynchronously"), such calls, whether via some request_module(), a
> legacy uevent "/sbin/hotplug" notification or something else, would
> just fail silently with (presumably) -ENOENT from
> kernel_execve(). However, that commit introduced the
> wait_for_initramfs() synchronization hook which must be called from
> the usermodehelper exec path right before the kernel_execve, in order
> that request_module() et al done from *after* rootfs_initcall()
> time (i.e. device_ and late_ initcalls) would continue to find a
> populated initramfs as they used to.
> 
> Any call of wait_for_initramfs() done before the unpacking has been
> scheduled (i.e. before rootfs_initcall time) must just return
> immediately [and let the caller find an empty file system] in order
> not to deadlock the machine. I mistakenly thought, and my limited
> testing confirmed, that there were no such calls, so I added a
> pr_warn_once() in wait_for_initramfs(). It turns out that one can
> indeed hit request_module() as well as kobject_uevent_env() during
> those early init calls, leading to a user-visible warning in the
> kernel log emitted consistently for certain configurations.

Further proof that the semantics for init is still loose. Formalizing
dependencies on init is something we should strive to. Eventualy with a
DAG.  The linker-tables work I had done years ago strived to get us
there which allows us to get a simple explicit DAG through the linker.
Unfortunately that patch set fell through because folks were
more interested in questioning the alternative side benefits of
linker-tables, but the use-case for helping with init is still valid.

If we *do* want to resurrect this folks should let me know.

Since the kobject_uevent_env() interest here is for /sbin/hotplug and
that crap is deprecated, in practice the relevant calls we'd care about
are the request_module() calls.

> We could just remove the pr_warn_once(), but I think it's better to
> postpone enabling the usermodehelper framework until there is at least
> some chance of finding the executable. That is also a little more
> efficient in that a lot of work done in umh.c will be elided.

I *don't* think we were aware that such request_module() calls were
happening before the fs was even ready and failing silently with
-ENOENT. As such, although moving the usermodehelper_enable()
to right after scheduling populating the rootfs is the right thing,
we do loose on the opportunity to learn who were those odd callers
before. We could not care... but this is also a missed opportunity
in finding those. How important that is, is not clear to me as
this was silently failing before...

If we wanted to keep a print for the above purpose though, we'd likely
want the full stack trace to see who the hell made the call.

> However,
> it does change the error seen by those early callers from -ENOENT to
> -EBUSY, so there is a risk of a regression if any caller care about
> the exact error value.

I'd see this as a welcomed evolution as it tells us more: we're saying
"it's coming, try again" or whatever.

A debug option to allow us to get a full warning trace in the -EBUSY
case on early init would be nice to have.

Otherwise:

Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>

  Luis

> Link: https://lkml.kernel.org/r/20210728134638.329060-1-linux@rasmusvillemoes.dk
> Fixes: e7cb072eb988 ("init/initramfs.c: do unpacking asynchronously")
> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> Reported-by: Alexander Egorenkov <egorenar@linux.ibm.com>
> Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
> Reported-by: Heiner Kallweit <hkallweit1@gmail.com>
> Cc: Luis Chamberlain <mcgrof@kernel.org>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>  init/initramfs.c   |    2 ++
>  init/main.c        |    1 -
>  init/noinitramfs.c |    2 ++
>  3 files changed, 4 insertions(+), 1 deletion(-)
> 
> --- a/init/initramfs.c~init-move-usermodehelper_enable-to-populate_rootfs
> +++ a/init/initramfs.c
> @@ -15,6 +15,7 @@
>  #include <linux/mm.h>
>  #include <linux/namei.h>
>  #include <linux/init_syscalls.h>
> +#include <linux/umh.h>
>  
>  static ssize_t __init xwrite(struct file *file, const char *p, size_t count,
>  		loff_t *pos)
> @@ -727,6 +728,7 @@ static int __init populate_rootfs(void)
>  {
>  	initramfs_cookie = async_schedule_domain(do_populate_rootfs, NULL,
>  						 &initramfs_domain);
> +	usermodehelper_enable();
>  	if (!initramfs_async)
>  		wait_for_initramfs();
>  	return 0;
> --- a/init/main.c~init-move-usermodehelper_enable-to-populate_rootfs
> +++ a/init/main.c
> @@ -1392,7 +1392,6 @@ static void __init do_basic_setup(void)
>  	driver_init();
>  	init_irq_proc();
>  	do_ctors();
> -	usermodehelper_enable();
>  	do_initcalls();
>  }
>  
> --- a/init/noinitramfs.c~init-move-usermodehelper_enable-to-populate_rootfs
> +++ a/init/noinitramfs.c
> @@ -10,6 +10,7 @@
>  #include <linux/kdev_t.h>
>  #include <linux/syscalls.h>
>  #include <linux/init_syscalls.h>
> +#include <linux/umh.h>
>  
>  /*
>   * Create a simple rootfs that is similar to the default initramfs
> @@ -18,6 +19,7 @@ static int __init default_rootfs(void)
>  {
>  	int err;
>  
> +	usermodehelper_enable();
>  	err = init_mkdir("/dev", 0755);
>  	if (err < 0)
>  		goto out;
> _

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
  2021-09-08 13:58     ` Vlastimil Babka
  2021-09-08 14:55       ` David Hildenbrand
@ 2021-09-08 16:11       ` Jesper Dangaard Brouer
  2021-09-08 16:31           ` Linus Torvalds
  1 sibling, 1 reply; 199+ messages in thread
From: Jesper Dangaard Brouer @ 2021-09-08 16:11 UTC (permalink / raw)
  To: Vlastimil Babka, Jesper Dangaard Brouer, Andrew Morton, bigeasy,
	cl, efault, iamjoonsoo.kim, jannh, linux-mm, mgorman, mm-commits,
	penberg, quic_qiancai, rientjes, tglx, torvalds
  Cc: brouer



On 08/09/2021 15.58, Vlastimil Babka wrote:
> On 9/8/21 15:05, Jesper Dangaard Brouer wrote:
>>
>>
>> On 08/09/2021 04.54, Andrew Morton wrote:
>>> From: Vlastimil Babka <vbabka@suse.cz>
>>> Subject: mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
>>>
>>> Jann Horn reported [1] the following theoretically possible race:
>>>
>>>     task A: put_cpu_partial() calls preempt_disable()
>>>     task A: oldpage = this_cpu_read(s->cpu_slab->partial)
>>>     interrupt: kfree() reaches unfreeze_partials() and discards the page
>>>     task B (on another CPU): reallocates page as page cache
>>>     task A: reads page->pages and page->pobjects, which are actually
>>>     halves of the pointer page->lru.prev
>>>     task B (on another CPU): frees page
>>>     interrupt: allocates page as SLUB page and places it on the percpu partial list
>>>     task A: this_cpu_cmpxchg() succeeds
>>>
>>>     which would cause page->pages and page->pobjects to end up containing
>>>     halves of pointers that would then influence when put_cpu_partial()
>>>     happens and show up in root-only sysfs files. Maybe that's acceptable,
>>>     I don't know. But there should probably at least be a comment for now
>>>     to point out that we're reading union fields of a page that might be
>>>     in a completely different state.
>>>
>>> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() is only
>>> safe against s->cpu_slab->partial manipulation in ___slab_alloc() if the
>>> latter disables irqs, otherwise a __slab_free() in an irq handler could
>>> call put_cpu_partial() in the middle of ___slab_alloc() manipulating
>>> ->partial and corrupt it.  This becomes an issue on RT after a local_lock
>>> is introduced in later patch.  The fix means taking the local_lock also in
>>> put_cpu_partial() on RT.
>>>
>>> After debugging this issue, Mike Galbraith suggested [2] that to avoid
>>> different locking schemes on RT and !RT, we can just protect
>>> put_cpu_partial() with disabled irqs (to be converted to
>>> local_lock_irqsave() later) everywhere.  This should be acceptable as it's
>>> not a fast path, and moving the actual partial unfreezing outside of the
>>> irq disabled section makes it short, and with the retry loop gone the code
>>> can be also simplified.  In addition, the race reported by Jann should no
>>> longer be possible.
>>
>> Based on my microbench[0] measurement changing preempt_disable to
>> local_irq_save will cost us 11 cycles (TSC).  I'm not against the
>> change, I just want people to keep this in mind.
> 
> OK, but this is not a fast path for every allocation/free, so it gets
> amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect
> cmpxchg to be expensive too?

Added tests for this:
  - this_cpu_cmpxchg cost: 5 cycles(tsc) 1.581 ns
  - cmpxchg          cost: 18 cycles(tsc) 5.006 ns

>> On my E5-1650 v4 @ 3.60GHz:
>>    - preempt_disable(+enable)  cost: 11 cycles(tsc) 3.161 ns
>>    - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns
>>
>> Notice the non-save/restore variant is superfast:
>>    - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns
> 
> It actually surprises me that it's that cheap, and would have expected
> changing the irq state would be the costly part, not the saving/restoring.
> Incidentally, would you know what's the cost of save+restore when the
> irqs are already disabled, so it's effectively a no-op?

The non-save variant simply translated onto CLI and STI, which seems to 
be very fast.

The cost of save+restore when the irqs are already disabled is the same 
(did a quick test).
Cannot remember who told me, but (apparently) the expensive part is 
reading the CPU FLAGS.

I did a quick test with:

	/** Loop to measure **/
	for (i = 0; i < rec->loops; i++) {
		local_irq_save(flags);
		loops_cnt++;
		barrier();
		//local_irq_restore(flags);
		local_irq_enable();
	}

Doing a save + enable: This cost 21 cycles(tsc) 6.015 ns.
(Cost before was 22 cycles)

This confirms reading the CPU FLAGS seems to be the expensive part.

--Jesper


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
  2021-09-08 16:11       ` Jesper Dangaard Brouer
@ 2021-09-08 16:31           ` Linus Torvalds
  0 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-08 16:31 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Vlastimil Babka, Andrew Morton, Sebastian Andrzej Siewior,
	Christoph Lameter, Mike Galbraith, Joonsoo Kim, Jann Horn,
	Linux-MM, Mel Gorman, mm-commits, Pekka Enberg, quic_qiancai,
	David Rientjes, Thomas Gleixner, Jesper Dangaard Brouer

On Wed, Sep 8, 2021 at 9:11 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
> The non-save variant simply translated onto CLI and STI, which seems to
> be very fast.

It will depend on the microarchitecture.

Happily:

> The cost of save+restore when the irqs are already disabled is the same
> (did a quick test).

The really expensive part used to be P4. 'popf' was hundreds of cycles
if any of the non-arithmetic bits changed, iirc.

P4 used to be a big headache just because of things like that -
straightforward code ran very well, but anything a bit more special
took forever because it flushed the pipeline.

So some of our optimizations may be historic because of things like
that. We don't really need to worry about the P4 glass jaws any more,
but it *used* to be much quicker to do 'preempt_disable()' that just
does an add to a memory location than it was to disable interrupts.

> Cannot remember who told me, but (apparently) the expensive part is
> reading the CPU FLAGS.

Again, it ends up being very dependent on the uarch.

Reading and writing the flags register is somewhat expensive because
it's not really "one" register in hardware any more (even if that was
obviously the historical implementation).

These days, the arithmetic flags are generally multiple renamed
registers, and then the other flags are a separate system register
(possibly multiple bits spread out).

The cost of doing those flag reads and writes are hard to really
specify, because in an OoO architecture a lot of it ends up being "how
much of that can be done in parallel, and what's the pipeline
serialization cost". Doing a loop with rdtsc is not necessarily AT ALL
indicative of the cost when there is other real code around it.

The cost _could_ be much smaller, in case there is little
serialization with normal other code. Or, it could be much bigger than
what a rdtsc shows, because if it's a hard pipeline flush, then a
tight loop with those things won't have any real work to flush, while
in "real code" there may be hundreds of instructions in flight and
doing the flush is very expensive.

The good news is that afaik, all the modern x86 CPU microarchitectures
do reasonably well. And while a "pushf/cli/popf" sequence is probably
more cycles than an add/subtract one in a benchmark, if the preempt
counter is not otherwise needed, and is cold in the cache, then the
pushf/cli/popf may be *much* cheaper than a cache miss.

So the only way to really tell would be to run real benchmarks of real
loads on multiple different microarchitectures.

I'm pretty sure the actual result is: "you can't measure the 10-cycle
difference on any modern core because it can actually go either way".

But "I'm pretty sure" and "reality" are not the same thing.

These days, pipeline flushes and cache misses (and then as a very
particularly bad case - cache line pingpong issues) are almost the
only thing that matters.

And the most common reason by far for the pipeline flushes are branch
mispredicts, but see above: the system bits in the flags register
_have_ been cause of them in the past, so it's not entirely
impossible.

               Linus

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
@ 2021-09-08 16:31           ` Linus Torvalds
  0 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-08 16:31 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Vlastimil Babka, Andrew Morton, Sebastian Andrzej Siewior,
	Christoph Lameter, Mike Galbraith, Joonsoo Kim, Jann Horn,
	Linux-MM, Mel Gorman, mm-commits, Pekka Enberg, quic_qiancai,
	David Rientjes, Thomas Gleixner, Jesper Dangaard Brouer

On Wed, Sep 8, 2021 at 9:11 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
> The non-save variant simply translated onto CLI and STI, which seems to
> be very fast.

It will depend on the microarchitecture.

Happily:

> The cost of save+restore when the irqs are already disabled is the same
> (did a quick test).

The really expensive part used to be P4. 'popf' was hundreds of cycles
if any of the non-arithmetic bits changed, iirc.

P4 used to be a big headache just because of things like that -
straightforward code ran very well, but anything a bit more special
took forever because it flushed the pipeline.

So some of our optimizations may be historic because of things like
that. We don't really need to worry about the P4 glass jaws any more,
but it *used* to be much quicker to do 'preempt_disable()' that just
does an add to a memory location than it was to disable interrupts.

> Cannot remember who told me, but (apparently) the expensive part is
> reading the CPU FLAGS.

Again, it ends up being very dependent on the uarch.

Reading and writing the flags register is somewhat expensive because
it's not really "one" register in hardware any more (even if that was
obviously the historical implementation).

These days, the arithmetic flags are generally multiple renamed
registers, and then the other flags are a separate system register
(possibly multiple bits spread out).

The cost of doing those flag reads and writes are hard to really
specify, because in an OoO architecture a lot of it ends up being "how
much of that can be done in parallel, and what's the pipeline
serialization cost". Doing a loop with rdtsc is not necessarily AT ALL
indicative of the cost when there is other real code around it.

The cost _could_ be much smaller, in case there is little
serialization with normal other code. Or, it could be much bigger than
what a rdtsc shows, because if it's a hard pipeline flush, then a
tight loop with those things won't have any real work to flush, while
in "real code" there may be hundreds of instructions in flight and
doing the flush is very expensive.

The good news is that afaik, all the modern x86 CPU microarchitectures
do reasonably well. And while a "pushf/cli/popf" sequence is probably
more cycles than an add/subtract one in a benchmark, if the preempt
counter is not otherwise needed, and is cold in the cache, then the
pushf/cli/popf may be *much* cheaper than a cache miss.

So the only way to really tell would be to run real benchmarks of real
loads on multiple different microarchitectures.

I'm pretty sure the actual result is: "you can't measure the 10-cycle
difference on any modern core because it can actually go either way".

But "I'm pretty sure" and "reality" are not the same thing.

These days, pipeline flushes and cache misses (and then as a very
particularly bad case - cache line pingpong issues) are almost the
only thing that matters.

And the most common reason by far for the pipeline flushes are branch
mispredicts, but see above: the system bits in the flags register
_have_ been cause of them in the past, so it's not entirely
impossible.

               Linus


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
  2021-09-08 14:59         ` David Hildenbrand
@ 2021-09-08 17:14           ` Jesper Dangaard Brouer
  2021-09-08 17:24             ` David Hildenbrand
  0 siblings, 1 reply; 199+ messages in thread
From: Jesper Dangaard Brouer @ 2021-09-08 17:14 UTC (permalink / raw)
  To: David Hildenbrand, Vlastimil Babka, Jesper Dangaard Brouer,
	Andrew Morton, bigeasy, cl, efault, iamjoonsoo.kim, jannh,
	linux-mm, mgorman, mm-commits, penberg, quic_qiancai, rientjes,
	tglx, torvalds
  Cc: brouer



On 08/09/2021 16.59, David Hildenbrand wrote:
> On 08.09.21 16:55, David Hildenbrand wrote:
>> On 08.09.21 15:58, Vlastimil Babka wrote:
>>> On 9/8/21 15:05, Jesper Dangaard Brouer wrote:
>>>>
>>>>
>>>> On 08/09/2021 04.54, Andrew Morton wrote:
>>>>> From: Vlastimil Babka <vbabka@suse.cz>
>>>>> Subject: mm, slub: protect put_cpu_partial() with disabled irqs 
>>>>> instead of cmpxchg
>>>>>
>>>>> Jann Horn reported [1] the following theoretically possible race:
>>>>>
>>>>>      task A: put_cpu_partial() calls preempt_disable()
>>>>>      task A: oldpage = this_cpu_read(s->cpu_slab->partial)
>>>>>      interrupt: kfree() reaches unfreeze_partials() and discards 
>>>>> the page
>>>>>      task B (on another CPU): reallocates page as page cache
>>>>>      task A: reads page->pages and page->pobjects, which are actually
>>>>>      halves of the pointer page->lru.prev
>>>>>      task B (on another CPU): frees page
>>>>>      interrupt: allocates page as SLUB page and places it on the 
>>>>> percpu partial list
>>>>>      task A: this_cpu_cmpxchg() succeeds
>>>>>
>>>>>      which would cause page->pages and page->pobjects to end up 
>>>>> containing
>>>>>      halves of pointers that would then influence when 
>>>>> put_cpu_partial()
>>>>>      happens and show up in root-only sysfs files. Maybe that's 
>>>>> acceptable,
>>>>>      I don't know. But there should probably at least be a comment 
>>>>> for now
>>>>>      to point out that we're reading union fields of a page that 
>>>>> might be
>>>>>      in a completely different state.
>>>>>
>>>>> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial() 
>>>>> is only
>>>>> safe against s->cpu_slab->partial manipulation in ___slab_alloc() 
>>>>> if the
>>>>> latter disables irqs, otherwise a __slab_free() in an irq handler 
>>>>> could
>>>>> call put_cpu_partial() in the middle of ___slab_alloc() manipulating
>>>>> ->partial and corrupt it.  This becomes an issue on RT after a 
>>>>> local_lock
>>>>> is introduced in later patch.  The fix means taking the local_lock 
>>>>> also in
>>>>> put_cpu_partial() on RT.
>>>>>
>>>>> After debugging this issue, Mike Galbraith suggested [2] that to avoid
>>>>> different locking schemes on RT and !RT, we can just protect
>>>>> put_cpu_partial() with disabled irqs (to be converted to
>>>>> local_lock_irqsave() later) everywhere.  This should be acceptable 
>>>>> as it's
>>>>> not a fast path, and moving the actual partial unfreezing outside 
>>>>> of the
>>>>> irq disabled section makes it short, and with the retry loop gone 
>>>>> the code
>>>>> can be also simplified.  In addition, the race reported by Jann 
>>>>> should no
>>>>> longer be possible.
>>>>
>>>> Based on my microbench[0] measurement changing preempt_disable to
>>>> local_irq_save will cost us 11 cycles (TSC).  I'm not against the
>>>> change, I just want people to keep this in mind.
>>>
>>> OK, but this is not a fast path for every allocation/free, so it gets
>>> amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect
>>> cmpxchg to be expensive too?
>>>
>>>> On my E5-1650 v4 @ 3.60GHz:
>>>>     - preempt_disable(+enable)  cost: 11 cycles(tsc) 3.161 ns
>>>>     - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns
>>>>
>>>> Notice the non-save/restore variant is superfast:
>>>>     - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns
>>>
>>> It actually surprises me that it's that cheap, and would have expected
>>> changing the irq state would be the costly part, not the 
>>> saving/restoring.
>>> Incidentally, would you know what's the cost of save+restore when the
>>> irqs are already disabled, so it's effectively a no-op?
>>
>> It surprises me as well. That would imply that protecting short RCU
>> sections using
>>
>> local_irq_disable
>> local_irq_enable
>>
>> instead of via
>>
>> preempt_disable
>> preempt_enable
>>
>> would actually be very beneficial.

Please don't draw this as a general conclusion.
As Linus describe in details, the IRQ disable/enable will be very 
micro-arch specific.

The preempt_disable/enable will likely be more stable/consistent across 
micro-archs.
Keep an eye out for kernel config options when juding 
preempt_disable/enable performance [1]

[1] 
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c#L363-L367


>>
>> Are the numbers trustworthy? :)
>>
> 
> .. and especially did the benchmark consider side effects of 
> enabling/disabling interrupts (pipeline flushes etc ..)?
> 

Of-cause not, this is a microbenchmark... they are per definition not 
trustworthy :-P

-Jesper


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg
  2021-09-08 17:14           ` Jesper Dangaard Brouer
@ 2021-09-08 17:24             ` David Hildenbrand
  0 siblings, 0 replies; 199+ messages in thread
From: David Hildenbrand @ 2021-09-08 17:24 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Vlastimil Babka, Andrew Morton, bigeasy,
	cl, efault, iamjoonsoo.kim, jannh, linux-mm, mgorman, mm-commits,
	penberg, quic_qiancai, rientjes, tglx, torvalds
  Cc: brouer

On 08.09.21 19:14, Jesper Dangaard Brouer wrote:
> 
> 
> On 08/09/2021 16.59, David Hildenbrand wrote:
>> On 08.09.21 16:55, David Hildenbrand wrote:
>>> On 08.09.21 15:58, Vlastimil Babka wrote:
>>>> On 9/8/21 15:05, Jesper Dangaard Brouer wrote:
>>>>>
>>>>>
>>>>> On 08/09/2021 04.54, Andrew Morton wrote:
>>>>>> From: Vlastimil Babka <vbabka@suse.cz>
>>>>>> Subject: mm, slub: protect put_cpu_partial() with disabled irqs
>>>>>> instead of cmpxchg
>>>>>>
>>>>>> Jann Horn reported [1] the following theoretically possible race:
>>>>>>
>>>>>>       task A: put_cpu_partial() calls preempt_disable()
>>>>>>       task A: oldpage = this_cpu_read(s->cpu_slab->partial)
>>>>>>       interrupt: kfree() reaches unfreeze_partials() and discards
>>>>>> the page
>>>>>>       task B (on another CPU): reallocates page as page cache
>>>>>>       task A: reads page->pages and page->pobjects, which are actually
>>>>>>       halves of the pointer page->lru.prev
>>>>>>       task B (on another CPU): frees page
>>>>>>       interrupt: allocates page as SLUB page and places it on the
>>>>>> percpu partial list
>>>>>>       task A: this_cpu_cmpxchg() succeeds
>>>>>>
>>>>>>       which would cause page->pages and page->pobjects to end up
>>>>>> containing
>>>>>>       halves of pointers that would then influence when
>>>>>> put_cpu_partial()
>>>>>>       happens and show up in root-only sysfs files. Maybe that's
>>>>>> acceptable,
>>>>>>       I don't know. But there should probably at least be a comment
>>>>>> for now
>>>>>>       to point out that we're reading union fields of a page that
>>>>>> might be
>>>>>>       in a completely different state.
>>>>>>
>>>>>> Additionally, the this_cpu_cmpxchg() approach in put_cpu_partial()
>>>>>> is only
>>>>>> safe against s->cpu_slab->partial manipulation in ___slab_alloc()
>>>>>> if the
>>>>>> latter disables irqs, otherwise a __slab_free() in an irq handler
>>>>>> could
>>>>>> call put_cpu_partial() in the middle of ___slab_alloc() manipulating
>>>>>> ->partial and corrupt it.  This becomes an issue on RT after a
>>>>>> local_lock
>>>>>> is introduced in later patch.  The fix means taking the local_lock
>>>>>> also in
>>>>>> put_cpu_partial() on RT.
>>>>>>
>>>>>> After debugging this issue, Mike Galbraith suggested [2] that to avoid
>>>>>> different locking schemes on RT and !RT, we can just protect
>>>>>> put_cpu_partial() with disabled irqs (to be converted to
>>>>>> local_lock_irqsave() later) everywhere.  This should be acceptable
>>>>>> as it's
>>>>>> not a fast path, and moving the actual partial unfreezing outside
>>>>>> of the
>>>>>> irq disabled section makes it short, and with the retry loop gone
>>>>>> the code
>>>>>> can be also simplified.  In addition, the race reported by Jann
>>>>>> should no
>>>>>> longer be possible.
>>>>>
>>>>> Based on my microbench[0] measurement changing preempt_disable to
>>>>> local_irq_save will cost us 11 cycles (TSC).  I'm not against the
>>>>> change, I just want people to keep this in mind.
>>>>
>>>> OK, but this is not a fast path for every allocation/free, so it gets
>>>> amortized. Also it eliminates a this_cpu_cmpxchg loop, and I'd expect
>>>> cmpxchg to be expensive too?
>>>>
>>>>> On my E5-1650 v4 @ 3.60GHz:
>>>>>      - preempt_disable(+enable)  cost: 11 cycles(tsc) 3.161 ns
>>>>>      - local_irq_save (+restore) cost: 22 cycles(tsc) 6.331 ns
>>>>>
>>>>> Notice the non-save/restore variant is superfast:
>>>>>      - local_irq_disable(+enable) cost: 6 cycles(tsc) 1.844 ns
>>>>
>>>> It actually surprises me that it's that cheap, and would have expected
>>>> changing the irq state would be the costly part, not the
>>>> saving/restoring.
>>>> Incidentally, would you know what's the cost of save+restore when the
>>>> irqs are already disabled, so it's effectively a no-op?
>>>
>>> It surprises me as well. That would imply that protecting short RCU
>>> sections using
>>>
>>> local_irq_disable
>>> local_irq_enable
>>>
>>> instead of via
>>>
>>> preempt_disable
>>> preempt_enable
>>>
>>> would actually be very beneficial.
> 
> Please don't draw this as a general conclusion.
> As Linus describe in details, the IRQ disable/enable will be very
> micro-arch specific.

Sure: but especially for modern micro-archs, this might be very relevant.

I actually stumbled over this exact question 1 month ago, that's why 
your comment caught my attention. I looked for CLI/STI cycle numbers and 
didn't really find a trusted source. I merely only found [1], which made 
it look like incrementing/decrementing some counter would actually be 
much faster most of the time.

[1] https://www.agner.org/optimize/instruction_tables.pdf

> 
> The preempt_disable/enable will likely be more stable/consistent across
> micro-archs.
> Keep an eye out for kernel config options when juding
> preempt_disable/enable performance [1]
> 
> [1]
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c#L363-L367
> 
> 
>>>
>>> Are the numbers trustworthy? :)
>>>
>>
>> .. and especially did the benchmark consider side effects of
>> enabling/disabling interrupts (pipeline flushes etc ..)?
>>
> 
> Of-cause not, this is a microbenchmark... they are per definition not
> trustworthy :-P

:)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 079/147] fs/proc/kcore.c: add mmap interface
  2021-09-08  2:57 ` [patch 079/147] fs/proc/kcore.c: add mmap interface Andrew Morton
@ 2021-09-08 18:13     ` Linus Torvalds
  2021-09-10 10:08   ` David Hildenbrand
  1 sibling, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-08 18:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexey Dobriyan, chenying.kernel, Linux-MM, mm-commits,
	Mike Rapoport, Muchun Song, zhouchengming, zhoufeng.zf

On Tue, Sep 7, 2021 at 7:57 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> After looking at the kcore code, it is found that kcore does not implement
> mmap, resulting in frequent context switching triggered by read.
> Therefore, we want to add mmap interface to optimize performance.  Since
> vmalloc and module areas will change with allocation and release,
> consistency cannot be guaranteed, so mmap interface only maps KCORE_TEXT
> and KCORE_RAM.

Honestly, I still hate this patch.

The last time people wanted to speed up /dev/kcore accesses, it was
all for black-hat reasons and speeding up kernel attacks.

And this code just makes me nervous even aside from that, because I do
not understand what the heck it's doing.


> +       if (kern_addr_valid(start)) {
> +               if (m->type == KCORE_RAM)
> +                       pfn = __pa(start) >> PAGE_SHIFT;
> +               else if (m->type == KCORE_TEXT)
> +                       pfn = __pa_symbol(start) >> PAGE_SHIFT;

Why is "__pa(start)" right in one situation, and "__pa_symbol(start)"
in another.

So this just makes me go "this is all confusing, dangerous, and the
use-case is dubious".

Mapping kernel memory is dangerous. The use-cases for it are dubious.
The patch isn't obvious.

All of that screams "I'll skip this".

           Linus

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 079/147] fs/proc/kcore.c: add mmap interface
@ 2021-09-08 18:13     ` Linus Torvalds
  0 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-08 18:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexey Dobriyan, chenying.kernel, Linux-MM, mm-commits,
	Mike Rapoport, Muchun Song, zhouchengming, zhoufeng.zf

On Tue, Sep 7, 2021 at 7:57 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> After looking at the kcore code, it is found that kcore does not implement
> mmap, resulting in frequent context switching triggered by read.
> Therefore, we want to add mmap interface to optimize performance.  Since
> vmalloc and module areas will change with allocation and release,
> consistency cannot be guaranteed, so mmap interface only maps KCORE_TEXT
> and KCORE_RAM.

Honestly, I still hate this patch.

The last time people wanted to speed up /dev/kcore accesses, it was
all for black-hat reasons and speeding up kernel attacks.

And this code just makes me nervous even aside from that, because I do
not understand what the heck it's doing.


> +       if (kern_addr_valid(start)) {
> +               if (m->type == KCORE_RAM)
> +                       pfn = __pa(start) >> PAGE_SHIFT;
> +               else if (m->type == KCORE_TEXT)
> +                       pfn = __pa_symbol(start) >> PAGE_SHIFT;

Why is "__pa(start)" right in one situation, and "__pa_symbol(start)"
in another.

So this just makes me go "this is all confusing, dangerous, and the
use-case is dubious".

Mapping kernel memory is dangerous. The use-cases for it are dubious.
The patch isn't obvious.

All of that screams "I'll skip this".

           Linus


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 101/147] lib/string: optimized memcpy
  2021-09-08  2:58 ` [patch 101/147] lib/string: optimized memcpy Andrew Morton
@ 2021-09-08 18:26     ` Linus Torvalds
  0 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-08 18:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Laight, drew, Guo Ren, Christoph Hellwig, kernel, Linux-MM,
	mcroce, mick, mm-commits, Nick Desaulniers, Palmer Dabbelt

I'm going to skip this one too.

On Tue, Sep 7, 2021 at 7:58 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> From: Matteo Croce <mcroce@microsoft.com>
> Subject: lib/string: optimized memcpy
>
> Patch series "lib/string: optimized mem* functions", v2.

Honestly, if we change the fallback memcpy(), I think the change
should be to remove it.

This is a core architecture thing, and every architecture does their
own. And pretty much every architecture has their own optimizations
for memcpy.

Yes, the byte-at-a-time default implementation is bad. But it's
_intentionally_ bad. It's only meant for initial bringup. No
architecture should actually end up using this in the long run, and if
you see it in profiles it should make you go "Ahh" instead.

             Linus

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 101/147] lib/string: optimized memcpy
@ 2021-09-08 18:26     ` Linus Torvalds
  0 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-08 18:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Laight, drew, Guo Ren, Christoph Hellwig, kernel, Linux-MM,
	mcroce, mick, mm-commits, Nick Desaulniers, Palmer Dabbelt

I'm going to skip this one too.

On Tue, Sep 7, 2021 at 7:58 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> From: Matteo Croce <mcroce@microsoft.com>
> Subject: lib/string: optimized memcpy
>
> Patch series "lib/string: optimized mem* functions", v2.

Honestly, if we change the fallback memcpy(), I think the change
should be to remove it.

This is a core architecture thing, and every architecture does their
own. And pretty much every architecture has their own optimizations
for memcpy.

Yes, the byte-at-a-time default implementation is bad. But it's
_intentionally_ bad. It's only meant for initial bringup. No
architecture should actually end up using this in the long run, and if
you see it in profiles it should make you go "Ahh" instead.

             Linus


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 102/147] lib/string: optimized memmove
  2021-09-08  2:58 ` [patch 102/147] lib/string: optimized memmove Andrew Morton
@ 2021-09-08 18:29     ` Linus Torvalds
  0 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-08 18:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Laight, drew, Guo Ren, Christoph Hellwig, kernel, Linux-MM,
	mcroce, mick, mm-commits, Nick Desaulniers, Palmer Dabbelt

On Tue, Sep 7, 2021 at 7:58 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> When the destination buffer is before the source one, or when the buffers
> doesn't overlap, it's safe to use memcpy() instead, which is optimized to
> use a bigger data size possible.

This one is actively buggy.

It depends on the possibly incorrect assumption that memcpy() always
copies upwards.

That is admittedly commonly true, but it's not something we can depend
on. Not even when the memcpy() implementation in the very same file
ends up doing so - because architectures can and should replace that
function with their own ones, and we have that __HAVE_ARCH_MEMCPY for
exactly that case.

Like 101/147, all reasonable architectures end up having their own
implementation anyway, but the immediate reason I'm dropping this
patch is that it's literally incorrect.

             Linus

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 102/147] lib/string: optimized memmove
@ 2021-09-08 18:29     ` Linus Torvalds
  0 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-08 18:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Laight, drew, Guo Ren, Christoph Hellwig, kernel, Linux-MM,
	mcroce, mick, mm-commits, Nick Desaulniers, Palmer Dabbelt

On Tue, Sep 7, 2021 at 7:58 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> When the destination buffer is before the source one, or when the buffers
> doesn't overlap, it's safe to use memcpy() instead, which is optimized to
> use a bigger data size possible.

This one is actively buggy.

It depends on the possibly incorrect assumption that memcpy() always
copies upwards.

That is admittedly commonly true, but it's not something we can depend
on. Not even when the memcpy() implementation in the very same file
ends up doing so - because architectures can and should replace that
function with their own ones, and we have that __HAVE_ARCH_MEMCPY for
exactly that case.

Like 101/147, all reasonable architectures end up having their own
implementation anyway, but the immediate reason I'm dropping this
patch is that it's literally incorrect.

             Linus


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 103/147] lib/string: optimized memset
  2021-09-08  2:58 ` [patch 103/147] lib/string: optimized memset Andrew Morton
@ 2021-09-08 18:34     ` Linus Torvalds
  0 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-08 18:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Laight, drew, Guo Ren, Christoph Hellwig, kernel, Linux-MM,
	mcroce, mick, mm-commits, Nick Desaulniers, Palmer Dabbelt

I'm dropping this one just to be consistent, although for memset()
it's possibly a bit more reasonable to fall back on some default.

But probably not. memcpy and memset really are *so* special that these
generic versions should be considered to be "stupid placeholders for
bringup, and nothing more".

On Tue, Sep 7, 2021 at 7:58 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On a RISC-V machine the speed goes from 140 Mb/s to 241 Mb/s, and this the
> binary size increase according to bloat-o-meter:

I also react to the benchmark numbers: RISC-V already has

  #define __HAVE_ARCH_MEMSET
  #define __HAVE_ARCH_MEMCPY
  #define __HAVE_ARCH_MEMMOVE

in its <asm/string.h> file, so these are just odd.

Did you benchmark these generic functions on their own, rather than
the ones that actually get *used*?

           Linus

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 103/147] lib/string: optimized memset
@ 2021-09-08 18:34     ` Linus Torvalds
  0 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-08 18:34 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Laight, drew, Guo Ren, Christoph Hellwig, kernel, Linux-MM,
	mcroce, mick, mm-commits, Nick Desaulniers, Palmer Dabbelt

I'm dropping this one just to be consistent, although for memset()
it's possibly a bit more reasonable to fall back on some default.

But probably not. memcpy and memset really are *so* special that these
generic versions should be considered to be "stupid placeholders for
bringup, and nothing more".

On Tue, Sep 7, 2021 at 7:58 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On a RISC-V machine the speed goes from 140 Mb/s to 241 Mb/s, and this the
> binary size increase according to bloat-o-meter:

I also react to the benchmark numbers: RISC-V already has

  #define __HAVE_ARCH_MEMSET
  #define __HAVE_ARCH_MEMCPY
  #define __HAVE_ARCH_MEMMOVE

in its <asm/string.h> file, so these are just odd.

Did you benchmark these generic functions on their own, rather than
the ones that actually get *used*?

           Linus


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 108/147] bitops: move find_bit_*_le functions from le.h to find.h
  2021-09-08  2:59 ` [patch 108/147] bitops: move find_bit_*_le functions from le.h to find.h Andrew Morton
@ 2021-09-08 18:37     ` Linus Torvalds
  0 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-08 18:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: aklimov, Alexander Lobakin, Andy Shevchenko, Dennis Zhou,
	Jiri Olsa, Linux-MM, mm-commits, Ulf Hansson, Will Deacon,
	Wolfram Sang, Yury Norov

On Tue, Sep 7, 2021 at 7:59 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> From: Yury Norov <yury.norov@gmail.com>
> Subject: bitops: move find_bit_*_le functions from le.h to find.h
>
> It's convenient to have all find_bit declarations in one place.

What what what?

The subject line says "move".

The body of the commit message doesn't imply anything else.

But the patch doesn't "move" anything at all:

>  include/asm-generic/bitops/find.h |  193 ----------------------------
>  include/asm-generic/bitops/le.h   |   64 ---------
>  2 files changed, 257 deletions(-)

What's going on?

Dropped just because I refuse to have anything to do with patches that
lie about what they are actually doing.

                Linus

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 108/147] bitops: move find_bit_*_le functions from le.h to find.h
@ 2021-09-08 18:37     ` Linus Torvalds
  0 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-08 18:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: aklimov, Alexander Lobakin, Andy Shevchenko, Dennis Zhou,
	Jiri Olsa, Linux-MM, mm-commits, Ulf Hansson, Will Deacon,
	Wolfram Sang, Yury Norov

On Tue, Sep 7, 2021 at 7:59 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> From: Yury Norov <yury.norov@gmail.com>
> Subject: bitops: move find_bit_*_le functions from le.h to find.h
>
> It's convenient to have all find_bit declarations in one place.

What what what?

The subject line says "move".

The body of the commit message doesn't imply anything else.

But the patch doesn't "move" anything at all:

>  include/asm-generic/bitops/find.h |  193 ----------------------------
>  include/asm-generic/bitops/le.h   |   64 ---------
>  2 files changed, 257 deletions(-)

What's going on?

Dropped just because I refuse to have anything to do with patches that
lie about what they are actually doing.

                Linus


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 108/147] bitops: move find_bit_*_le functions from le.h to find.h
  2021-09-08 18:37     ` Linus Torvalds
  (?)
@ 2021-09-08 19:38     ` Yury Norov
  2021-09-08 19:46         ` Linus Torvalds
  2021-09-08 19:49       ` Andrew Morton
  -1 siblings, 2 replies; 199+ messages in thread
From: Yury Norov @ 2021-09-08 19:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, aklimov, Alexander Lobakin, Andy Shevchenko,
	Dennis Zhou, Jiri Olsa, Linux-MM, mm-commits, Ulf Hansson,
	Will Deacon, Wolfram Sang

On Wed, Sep 08, 2021 at 11:37:54AM -0700, Linus Torvalds wrote:
> On Tue, Sep 7, 2021 at 7:59 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > From: Yury Norov <yury.norov@gmail.com>
> > Subject: bitops: move find_bit_*_le functions from le.h to find.h
> >
> > It's convenient to have all find_bit declarations in one place.
> 
> What what what?
> 
> The subject line says "move".
> 
> The body of the commit message doesn't imply anything else.
> 
> But the patch doesn't "move" anything at all:
> 
> >  include/asm-generic/bitops/find.h |  193 ----------------------------
> >  include/asm-generic/bitops/le.h   |   64 ---------
> >  2 files changed, 257 deletions(-)
> 
> What's going on?
> 
> Dropped just because I refuse to have anything to do with patches that
> lie about what they are actually doing.

This is how the patch looks in my tree:
https://github.com/norov/linux/commit/4a92b733138e3fd71cd8a021ad53cbce68d61cfc

And in my submission:
http://lkml.iu.edu/hypermail/linux/kernel/2108.1/07330.html

So it actually does what it says. The following patch in this series
also differs from what I have in my tree. Something weird happened...

Andrew, Linus, are you OK if I resend the patchset?

Yury

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 108/147] bitops: move find_bit_*_le functions from le.h to find.h
  2021-09-08 19:38     ` Yury Norov
@ 2021-09-08 19:46         ` Linus Torvalds
  2021-09-08 19:49       ` Andrew Morton
  1 sibling, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-08 19:46 UTC (permalink / raw)
  To: Yury Norov
  Cc: Andrew Morton, aklimov, Alexander Lobakin, Andy Shevchenko,
	Dennis Zhou, Jiri Olsa, Linux-MM, mm-commits, Ulf Hansson,
	Will Deacon, Wolfram Sang

On Wed, Sep 8, 2021 at 12:38 PM Yury Norov <yury.norov@gmail.com> wrote:
>
> Andrew, Linus, are you OK if I resend the patchset?

I've removed the whole series from my queue, so yes, resending is the
right thing.

That said, by now it's starting to be the latter half of the second
week of the merge window, and by the time it has gone through the
queues, I'm not going to guarantee that I'll be in a mood to merge new
stuff any more.

This has not been a particularly huge merge window in number of
commits, but there's actually been an unusually large number of these
kinds of odd things where I go "that's just not right".

So I've been a bit testy with people (sorry about that), and I'm
getting to the point where I just am not feeling very generous to
stuff that wasn't all prim and proper and ready by the merge window.

              Linus

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 108/147] bitops: move find_bit_*_le functions from le.h to find.h
@ 2021-09-08 19:46         ` Linus Torvalds
  0 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-08 19:46 UTC (permalink / raw)
  To: Yury Norov
  Cc: Andrew Morton, aklimov, Alexander Lobakin, Andy Shevchenko,
	Dennis Zhou, Jiri Olsa, Linux-MM, mm-commits, Ulf Hansson,
	Will Deacon, Wolfram Sang

On Wed, Sep 8, 2021 at 12:38 PM Yury Norov <yury.norov@gmail.com> wrote:
>
> Andrew, Linus, are you OK if I resend the patchset?

I've removed the whole series from my queue, so yes, resending is the
right thing.

That said, by now it's starting to be the latter half of the second
week of the merge window, and by the time it has gone through the
queues, I'm not going to guarantee that I'll be in a mood to merge new
stuff any more.

This has not been a particularly huge merge window in number of
commits, but there's actually been an unusually large number of these
kinds of odd things where I go "that's just not right".

So I've been a bit testy with people (sorry about that), and I'm
getting to the point where I just am not feeling very generous to
stuff that wasn't all prim and proper and ready by the merge window.

              Linus


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 108/147] bitops: move find_bit_*_le functions from le.h to find.h
  2021-09-08 19:38     ` Yury Norov
  2021-09-08 19:46         ` Linus Torvalds
@ 2021-09-08 19:49       ` Andrew Morton
  2021-09-08 19:56           ` Linus Torvalds
  2021-09-08 20:16         ` Yury Norov
  1 sibling, 2 replies; 199+ messages in thread
From: Andrew Morton @ 2021-09-08 19:49 UTC (permalink / raw)
  To: Yury Norov
  Cc: Linus Torvalds, aklimov, Alexander Lobakin, Andy Shevchenko,
	Dennis Zhou, Jiri Olsa, Linux-MM, mm-commits, Ulf Hansson,
	Will Deacon, Wolfram Sang

On Wed, 8 Sep 2021 12:38:27 -0700 Yury Norov <yury.norov@gmail.com> wrote:

> On Wed, Sep 08, 2021 at 11:37:54AM -0700, Linus Torvalds wrote:
> > On Tue, Sep 7, 2021 at 7:59 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > >
> > > From: Yury Norov <yury.norov@gmail.com>
> > > Subject: bitops: move find_bit_*_le functions from le.h to find.h
> > >
> > > It's convenient to have all find_bit declarations in one place.
> > 
> > What what what?
> > 
> > The subject line says "move".
> > 
> > The body of the commit message doesn't imply anything else.
> > 
> > But the patch doesn't "move" anything at all:
> > 
> > >  include/asm-generic/bitops/find.h |  193 ----------------------------
> > >  include/asm-generic/bitops/le.h   |   64 ---------
> > >  2 files changed, 257 deletions(-)
> > 
> > What's going on?
> > 
> > Dropped just because I refuse to have anything to do with patches that
> > lie about what they are actually doing.
> 
> This is how the patch looks in my tree:
> https://github.com/norov/linux/commit/4a92b733138e3fd71cd8a021ad53cbce68d61cfc
> 
> And in my submission:
> http://lkml.iu.edu/hypermail/linux/kernel/2108.1/07330.html
> 
> So it actually does what it says. The following patch in this series
> also differs from what I have in my tree. Something weird happened...

There were some competing changes in next for a while so I guess there
was damage.

> Andrew, Linus, are you OK if I resend the patchset?

Linus suggests other changes.  I suggest you get this tree into -next
as discussed and put the changes through another kernel cycle.


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 108/147] bitops: move find_bit_*_le functions from le.h to find.h
  2021-09-08 19:49       ` Andrew Morton
@ 2021-09-08 19:56           ` Linus Torvalds
  2021-09-08 20:16         ` Yury Norov
  1 sibling, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-08 19:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yury Norov, aklimov, Alexander Lobakin, Andy Shevchenko,
	Dennis Zhou, Jiri Olsa, Linux-MM, mm-commits, Ulf Hansson,
	Will Deacon, Wolfram Sang

On Wed, Sep 8, 2021 at 12:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Linus suggests other changes.

I _think_ my other objections were to patches in different groups
(same 147-email series, but different sub-series).

But I didn't check how you had split things up, so they may end up
being related.

I dropped all of the bit finding ones.

I'm in the middle of writing the merge message, I'll do my test
builds, and then push out, so you can see exactly what I merged (and
by implication, what I did not).

              Linus

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 108/147] bitops: move find_bit_*_le functions from le.h to find.h
@ 2021-09-08 19:56           ` Linus Torvalds
  0 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-08 19:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yury Norov, aklimov, Alexander Lobakin, Andy Shevchenko,
	Dennis Zhou, Jiri Olsa, Linux-MM, mm-commits, Ulf Hansson,
	Will Deacon, Wolfram Sang

On Wed, Sep 8, 2021 at 12:49 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> Linus suggests other changes.

I _think_ my other objections were to patches in different groups
(same 147-email series, but different sub-series).

But I didn't check how you had split things up, so they may end up
being related.

I dropped all of the bit finding ones.

I'm in the middle of writing the merge message, I'll do my test
builds, and then push out, so you can see exactly what I merged (and
by implication, what I did not).

              Linus


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 108/147] bitops: move find_bit_*_le functions from le.h to find.h
  2021-09-08 19:56           ` Linus Torvalds
@ 2021-09-08 20:08             ` Linus Torvalds
  -1 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-08 20:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yury Norov, aklimov, Alexander Lobakin, Andy Shevchenko,
	Dennis Zhou, Jiri Olsa, Linux-MM, mm-commits, Ulf Hansson,
	Will Deacon, Wolfram Sang

On Wed, Sep 8, 2021 at 12:56 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'm in the middle of writing the merge message, I'll do my test
> builds, and then push out, so you can see exactly what I merged (and
> by implication, what I did not).

Ok, so no pr-tracker-bot for the manual patch merges, so here's the
equivalent manual "it's merged and pushed out now" notification..

           Linus

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 108/147] bitops: move find_bit_*_le functions from le.h to find.h
@ 2021-09-08 20:08             ` Linus Torvalds
  0 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-08 20:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yury Norov, aklimov, Alexander Lobakin, Andy Shevchenko,
	Dennis Zhou, Jiri Olsa, Linux-MM, mm-commits, Ulf Hansson,
	Will Deacon, Wolfram Sang

On Wed, Sep 8, 2021 at 12:56 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'm in the middle of writing the merge message, I'll do my test
> builds, and then push out, so you can see exactly what I merged (and
> by implication, what I did not).

Ok, so no pr-tracker-bot for the manual patch merges, so here's the
equivalent manual "it's merged and pushed out now" notification..

           Linus


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 108/147] bitops: move find_bit_*_le functions from le.h to find.h
  2021-09-08 19:49       ` Andrew Morton
  2021-09-08 19:56           ` Linus Torvalds
@ 2021-09-08 20:16         ` Yury Norov
  1 sibling, 0 replies; 199+ messages in thread
From: Yury Norov @ 2021-09-08 20:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Linus Torvalds, aklimov, Alexander Lobakin, Andy Shevchenko,
	Dennis Zhou, Jiri Olsa, Linux-MM, mm-commits, Ulf Hansson,
	Will Deacon, Wolfram Sang

On Wed, Sep 08, 2021 at 12:49:37PM -0700, Andrew Morton wrote:
> On Wed, 8 Sep 2021 12:38:27 -0700 Yury Norov <yury.norov@gmail.com> wrote:
> 
> > On Wed, Sep 08, 2021 at 11:37:54AM -0700, Linus Torvalds wrote:
> > > On Tue, Sep 7, 2021 at 7:59 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >
> > > > From: Yury Norov <yury.norov@gmail.com>
> > > > Subject: bitops: move find_bit_*_le functions from le.h to find.h
> > > >
> > > > It's convenient to have all find_bit declarations in one place.
> > > 
> > > What what what?
> > > 
> > > The subject line says "move".
> > > 
> > > The body of the commit message doesn't imply anything else.
> > > 
> > > But the patch doesn't "move" anything at all:
> > > 
> > > >  include/asm-generic/bitops/find.h |  193 ----------------------------
> > > >  include/asm-generic/bitops/le.h   |   64 ---------
> > > >  2 files changed, 257 deletions(-)
> > > 
> > > What's going on?
> > > 
> > > Dropped just because I refuse to have anything to do with patches that
> > > lie about what they are actually doing.
> > 
> > This is how the patch looks in my tree:
> > https://github.com/norov/linux/commit/4a92b733138e3fd71cd8a021ad53cbce68d61cfc
> > 
> > And in my submission:
> > http://lkml.iu.edu/hypermail/linux/kernel/2108.1/07330.html
> > 
> > So it actually does what it says. The following patch in this series
> > also differs from what I have in my tree. Something weird happened...
> 
> There were some competing changes in next for a while so I guess there
> was damage.
> 
> > Andrew, Linus, are you OK if I resend the patchset?
> 
> Linus suggests other changes.  I suggest you get this tree into -next
> as discussed and put the changes through another kernel cycle.

OK, I'll resend next week.

^ permalink raw reply	[flat|nested] 199+ messages in thread

* RE: [patch 102/147] lib/string: optimized memmove
  2021-09-08 18:29     ` Linus Torvalds
  (?)
@ 2021-09-09  8:28     ` David Laight
  -1 siblings, 0 replies; 199+ messages in thread
From: David Laight @ 2021-09-09  8:28 UTC (permalink / raw)
  To: 'Linus Torvalds', Andrew Morton
  Cc: drew, Guo Ren, Christoph Hellwig, kernel, Linux-MM, mcroce, mick,
	mm-commits, Nick Desaulniers, Palmer Dabbelt

From: Linus Torvalds
> Sent: 08 September 2021 19:30
> 
> On Tue, Sep 7, 2021 at 7:58 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > When the destination buffer is before the source one, or when the buffers
> > doesn't overlap, it's safe to use memcpy() instead, which is optimized to
> > use a bigger data size possible.
> 
> This one is actively buggy.
> 
> It depends on the possibly incorrect assumption that memcpy() always
> copies upwards.

Even if the memcpy() 'mostly' copies upwards it may copy the last
8 bytes first and then copy the rest of the buffer in 8 byte chunks.

OTOH the change to libc that made it do backwards is just stupid.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [External] Re: [patch 079/147] fs/proc/kcore.c: add mmap interface
  2021-09-08 18:13     ` Linus Torvalds
  (?)
@ 2021-09-09  9:56     ` Feng Zhou
  2021-09-09 17:32         ` Linus Torvalds
  -1 siblings, 1 reply; 199+ messages in thread
From: Feng Zhou @ 2021-09-09  9:56 UTC (permalink / raw)
  To: Linus Torvalds, Andrew Morton
  Cc: Alexey Dobriyan, chenying.kernel, Linux-MM, mm-commits,
	Mike Rapoport, Muchun Song, zhouchengming, zhengqi.arch

[-- Attachment #1: Type: text/plain, Size: 2143 bytes --]


在 2021/9/9 上午2:13, Linus Torvalds 写道:
> On Tue, Sep 7, 2021 at 7:57 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>> After looking at the kcore code, it is found that kcore does not implement
>> mmap, resulting in frequent context switching triggered by read.
>> Therefore, we want to add mmap interface to optimize performance.  Since
>> vmalloc and module areas will change with allocation and release,
>> consistency cannot be guaranteed, so mmap interface only maps KCORE_TEXT
>> and KCORE_RAM.
> Honestly, I still hate this patch.
>
> The last time people wanted to speed up /dev/kcore accesses, it was
> all for black-hat reasons and speeding up kernel attacks.
>
> And this code just makes me nervous even aside from that, because I do
> not understand what the heck it's doing.
>
>
>> +       if (kern_addr_valid(start)) {
>> +               if (m->type == KCORE_RAM)
>> +                       pfn = __pa(start) >> PAGE_SHIFT;
>> +               else if (m->type == KCORE_TEXT)
>> +                       pfn = __pa_symbol(start) >> PAGE_SHIFT;
> Why is "__pa(start)" right in one situation, and "__pa_symbol(start)"
> in another.

Hi, Linus

The use here is refer to "read_kcore" in fs/proc/kcore.c.

list_for_each_entry(m, &kclist_head, list) {
...
if (m->type == KCORE_RAM || m->type == KCORE_REMAP)
phdr->p_paddr = __pa(m->addr);
else if (m->type == KCORE_TEXT)
phdr->p_paddr = __pa_symbol(m->addr);
...
}

Ensure consistency of usage.


>
> So this just makes me go "this is all confusing, dangerous, and the
> use-case is dubious".
>
> Mapping kernel memory is dangerous. The use-cases for it are dubious.
> The patch isn't obvious.

Kcore mmap kernel memory just has read permission.

+	if (vma->vm_flags & (VM_WRITE | VM_EXEC)) {
+		ret = -EPERM;
+		goto out;
+	}
+
+	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+	vma->vm_flags |= VM_MIXEDMAP;
+	vma->vm_ops = &kcore_mmap_ops;

Compared to the read interface, kcore mmap has no increased risk, just
reduce context switching.

>
> All of that screams "I'll skip this".
>
>             Linus

[-- Attachment #2: Type: text/html, Size: 4160 bytes --]

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 103/147] lib/string: optimized memset
  2021-09-08 18:34     ` Linus Torvalds
  (?)
@ 2021-09-09 10:27     ` Matteo Croce
  -1 siblings, 0 replies; 199+ messages in thread
From: Matteo Croce @ 2021-09-09 10:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-kernel, Andrew Morton, David Laight, drew, Guo Ren,
	Christoph Hellwig, kernel, Linux-MM, mcroce, mick, mm-commits,
	Nick Desaulniers, Palmer Dabbelt

On Wed, 8 Sep 2021 11:34:27 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> I'm dropping this one just to be consistent, although for memset()
> it's possibly a bit more reasonable to fall back on some default.
> 
> But probably not. memcpy and memset really are *so* special that these
> generic versions should be considered to be "stupid placeholders for
> bringup, and nothing more".
> 
> On Tue, Sep 7, 2021 at 7:58 PM Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > On a RISC-V machine the speed goes from 140 Mb/s to 241 Mb/s, and
> > this the binary size increase according to bloat-o-meter:
> 
> I also react to the benchmark numbers: RISC-V already has
> 
>   #define __HAVE_ARCH_MEMSET
>   #define __HAVE_ARCH_MEMCPY
>   #define __HAVE_ARCH_MEMMOVE
> 
> in its <asm/string.h> file, so these are just odd.
> 
> Did you benchmark these generic functions on their own, rather than
> the ones that actually get *used*?
> 
>            Linus

I benchmarked against the generic routines. The RISC-V specific are
even slower than the generic ones, because generates lot of unaligned
accesses.

That was the whole point of the series initially. These C routines
should have replaced the risc-v specific assembly ones, but then it was
proposed to use them as generic:

https://lore.kernel.org/linux-riscv/YNChl0tkofSGzvIX@infradead.org/

-- 
per aspera ad upstream

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [External] Re: [patch 079/147] fs/proc/kcore.c: add mmap interface
  2021-09-09  9:56     ` [External] " Feng Zhou
@ 2021-09-09 17:32         ` Linus Torvalds
  0 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-09 17:32 UTC (permalink / raw)
  To: Feng Zhou
  Cc: Andrew Morton, Alexey Dobriyan, chenying.kernel, Linux-MM,
	mm-commits, Mike Rapoport, Muchun Song, zhouchengming,
	zhengqi.arch

On Thu, Sep 9, 2021 at 2:57 AM Feng Zhou <zhoufeng.zf@bytedance.com> wrote:
>
> Compared to the read interface, kcore mmap has no increased risk, just
> reduce context switching.

Yes, but the main worry is "do we really need to make this faster and easier"?

Because one of the possible main users is literally the black hat "I
got root, now I want to do a rootkit".

And mmap is very very different from read().

Why? Because using mmap() you can now track changes in realtime (ie
you poll waiting for some memory location to change, possibly even
with hardware assist - like watchpoints or ring3 "monitor/mwait").

So mmap() of the kernel memory literally acts as a prime tool for
looking at and exploiting races.

Which is why I'm _very_ leery of these kinds of interfaces.

Do they have possible good uses? Yes. But the bad uses seem to
actually dominate. The good users don't seem _that_ critical, while
the bad users would seem to absolutely love this interface.

See my argument?

This is basically a very dangerous interface. The fact that it is
read-only doesn't change that at all.

               Linus

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [External] Re: [patch 079/147] fs/proc/kcore.c: add mmap interface
@ 2021-09-09 17:32         ` Linus Torvalds
  0 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-09 17:32 UTC (permalink / raw)
  To: Feng Zhou
  Cc: Andrew Morton, Alexey Dobriyan, chenying.kernel, Linux-MM,
	mm-commits, Mike Rapoport, Muchun Song, zhouchengming,
	zhengqi.arch

On Thu, Sep 9, 2021 at 2:57 AM Feng Zhou <zhoufeng.zf@bytedance.com> wrote:
>
> Compared to the read interface, kcore mmap has no increased risk, just
> reduce context switching.

Yes, but the main worry is "do we really need to make this faster and easier"?

Because one of the possible main users is literally the black hat "I
got root, now I want to do a rootkit".

And mmap is very very different from read().

Why? Because using mmap() you can now track changes in realtime (ie
you poll waiting for some memory location to change, possibly even
with hardware assist - like watchpoints or ring3 "monitor/mwait").

So mmap() of the kernel memory literally acts as a prime tool for
looking at and exploiting races.

Which is why I'm _very_ leery of these kinds of interfaces.

Do they have possible good uses? Yes. But the bad uses seem to
actually dominate. The good users don't seem _that_ critical, while
the bad users would seem to absolutely love this interface.

See my argument?

This is basically a very dangerous interface. The fact that it is
read-only doesn't change that at all.

               Linus


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [External] Re: [patch 079/147] fs/proc/kcore.c: add mmap interface
  2021-09-09 17:32         ` Linus Torvalds
@ 2021-09-09 17:34           ` Linus Torvalds
  -1 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-09 17:34 UTC (permalink / raw)
  To: Feng Zhou
  Cc: Andrew Morton, Alexey Dobriyan, chenying.kernel, Linux-MM,
	mm-commits, Mike Rapoport, Muchun Song, zhouchengming,
	zhengqi.arch

On Thu, Sep 9, 2021 at 10:32 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> This is basically a very dangerous interface. The fact that it is
> read-only doesn't change that at all.

Just to clarify - we've done dangerous interfaces in the past. We
probably have tons of them still that I haven't even thought about,
and I think I can hear somebody sniggering from miles away about
various bugs that cause much more obvious security holes that I didn't
think of.

But the fact that we might have other holes that can be misused,
doesn't mean we have to add new ones that seem to be almost designed
for misuse.

          Linus

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [External] Re: [patch 079/147] fs/proc/kcore.c: add mmap interface
@ 2021-09-09 17:34           ` Linus Torvalds
  0 siblings, 0 replies; 199+ messages in thread
From: Linus Torvalds @ 2021-09-09 17:34 UTC (permalink / raw)
  To: Feng Zhou
  Cc: Andrew Morton, Alexey Dobriyan, chenying.kernel, Linux-MM,
	mm-commits, Mike Rapoport, Muchun Song, zhouchengming,
	zhengqi.arch

On Thu, Sep 9, 2021 at 10:32 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> This is basically a very dangerous interface. The fact that it is
> read-only doesn't change that at all.

Just to clarify - we've done dangerous interfaces in the past. We
probably have tons of them still that I haven't even thought about,
and I think I can hear somebody sniggering from miles away about
various bugs that cause much more obvious security holes that I didn't
think of.

But the fact that we might have other holes that can be misused,
doesn't mean we have to add new ones that seem to be almost designed
for misuse.

          Linus


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [External] Re: [patch 079/147] fs/proc/kcore.c: add mmap interface
  2021-09-09 17:34           ` Linus Torvalds
  (?)
@ 2021-09-10  3:18           ` Feng Zhou
  -1 siblings, 0 replies; 199+ messages in thread
From: Feng Zhou @ 2021-09-10  3:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andrew Morton, Alexey Dobriyan, chenying.kernel, Linux-MM,
	mm-commits, Mike Rapoport, Muchun Song, zhouchengming,
	zhengqi.arch


在 2021/9/10 上午1:34, Linus Torvalds 写道:
> On Thu, Sep 9, 2021 at 10:32 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> This is basically a very dangerous interface. The fact that it is
>> read-only doesn't change that at all.
> Just to clarify - we've done dangerous interfaces in the past. We
> probably have tons of them still that I haven't even thought about,
> and I think I can hear somebody sniggering from miles away about
> various bugs that cause much more obvious security holes that I didn't
> think of.
>
> But the fact that we might have other holes that can be misused,
> doesn't mean we have to add new ones that seem to be almost designed
> for misuse.
>
>            Linus

Ok, got it.


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 128/147] init: move usermodehelper_enable() to populate_rootfs()
  2021-09-08 15:44   ` Luis Chamberlain
@ 2021-09-10  8:12     ` Rasmus Villemoes
  2021-09-10 17:47       ` H. Peter Anvin
  2021-09-10 17:51       ` Luis Chamberlain
  0 siblings, 2 replies; 199+ messages in thread
From: Rasmus Villemoes @ 2021-09-10  8:12 UTC (permalink / raw)
  To: Luis Chamberlain, Andrew Morton, Jessica Yu, Borislav Petkov,
	H. Peter Anvin
  Cc: bgoncalv, egorenar, hkallweit1, linux-mm, mm-commits, torvalds

On 08/09/2021 17.44, Luis Chamberlain wrote:
> On Tue, Sep 07, 2021 at 08:00:03PM -0700, Andrew Morton wrote:
>> From: Rasmus Villemoes <linux@rasmusvillemoes.dk>
>> Any call of wait_for_initramfs() done before the unpacking has been
>> scheduled (i.e. before rootfs_initcall time) must just return
>> immediately [and let the caller find an empty file system] in order
>> not to deadlock the machine. I mistakenly thought, and my limited
>> testing confirmed, that there were no such calls, so I added a
>> pr_warn_once() in wait_for_initramfs(). It turns out that one can
>> indeed hit request_module() as well as kobject_uevent_env() during
>> those early init calls, leading to a user-visible warning in the
>> kernel log emitted consistently for certain configurations.
> 
> Further proof that the semantics for init is still loose. Formalizing
> dependencies on init is something we should strive to. Eventualy with a
> DAG.  The linker-tables work I had done years ago strived to get us
> there which allows us to get a simple explicit DAG through the linker.
> Unfortunately that patch set fell through because folks were
> more interested in questioning the alternative side benefits of
> linker-tables, but the use-case for helping with init is still valid.
> 
> If we *do* want to resurrect this folks should let me know.

Heh, a while back I actually had some completely unrelated thing where
I'd want to make use of the linker tables infrastructure - I remembered
reading about it on LWN, and was quite surprised when I learnt that that
work had never made it in. I don't quite remember the use case (I think
it was for some test module infrastructure). But if you do have time to
resurrect those patches, I'd certainly be interested.

> Since the kobject_uevent_env() interest here is for /sbin/hotplug and
> that crap is deprecated, in practice the relevant calls we'd care about
> are the request_module() calls.

Yes - the first report I got about that pr_warn_once was indeed fixed by
the reporter simply disabling CONFIG_UEVENT_HELPER
(https://lore.kernel.org/lkml/9849be80-cfe5-b33e-8224-590a4c451415@gmail.com/).

>> We could just remove the pr_warn_once(), but I think it's better to
>> postpone enabling the usermodehelper framework until there is at least
>> some chance of finding the executable. That is also a little more
>> efficient in that a lot of work done in umh.c will be elided.
> 
> I *don't* think we were aware that such request_module() calls were
> happening before the fs was even ready and failing silently with
> -ENOENT. 

Probably not, no, otherwise somebody would have noticed.

As such, although moving the usermodehelper_enable()
> to right after scheduling populating the rootfs is the right thing,
> we do loose on the opportunity to learn who were those odd callers
> before. We could not care... but this is also a missed opportunity
> in finding those. How important that is, is not clear to me as
> this was silently failing before...
> 
> If we wanted to keep a print for the above purpose though, we'd likely
> want the full stack trace to see who the hell made the call.

Well, yes, I have myself fallen into that trap not just once, but at
least twice. The first time when I discovered this behaviour on one of
the ppc targets I did this work for in the first place (before I came up
with the CONFIG_MODPROBE_PATH patch). The second when I asked a reporter
to replace the pr_warn_once by WARN_ONCE:

https://lore.kernel.org/lkml/4434f245-db3b-c02a-36c4-0111a0dfb78d@rasmusvillemoes.dk/


The problem is that request_module() just fires off some worker thread
and then the original calling thread sits back and waits for that worker
to return a result.


>> However,
>> it does change the error seen by those early callers from -ENOENT to
>> -EBUSY, so there is a risk of a regression if any caller care about
>> the exact error value.
> 
> I'd see this as a welcomed evolution as it tells us more: we're saying
> "it's coming, try again" or whatever.

Indeed, and I don't think it's the end of the world if somebody notices
some change due to that, because we'd learn more about where those early
request_module() calls come from.

> A debug option to allow us to get a full warning trace in the -EBUSY
> case on early init would be nice to have.

As noted above, that's difficult. We'd need a way to know which other
task is waiting for us, then print the trace of that guy.

I don't think anybody is gonna hear this tree falling, so let's not try
to solve a problem before we know there is one.

Rasmus

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 079/147] fs/proc/kcore.c: add mmap interface
  2021-09-08  2:57 ` [patch 079/147] fs/proc/kcore.c: add mmap interface Andrew Morton
  2021-09-08 18:13     ` Linus Torvalds
@ 2021-09-10 10:08   ` David Hildenbrand
  2021-09-10 12:00     ` Mike Rapoport
  1 sibling, 1 reply; 199+ messages in thread
From: David Hildenbrand @ 2021-09-10 10:08 UTC (permalink / raw)
  To: Andrew Morton, adobriyan, chenying.kernel, linux-mm, mm-commits,
	rppt, songmuchun, torvalds, zhouchengming, zhoufeng.zf

On 08.09.21 04:57, Andrew Morton wrote:
> From: Feng Zhou <zhoufeng.zf@bytedance.com>
> Subject: fs/proc/kcore.c: add mmap interface
> 
> When we do the kernel monitor, use the DRGN
> (https://github.com/osandov/drgn) access to kernel data structures, found
> that the system calls a lot.  DRGN is implemented by reading /proc/kcore.
> After looking at the kcore code, it is found that kcore does not implement
> mmap, resulting in frequent context switching triggered by read.
> Therefore, we want to add mmap interface to optimize performance.  Since
> vmalloc and module areas will change with allocation and release,
> consistency cannot be guaranteed, so mmap interface only maps KCORE_TEXT
> and KCORE_RAM.
> 
> The test results:
> 1. the default version of kcore
> real 11.00
> user 8.53
> sys 3.59
> 
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 99.64  128.578319          12  11168701           pread64
> ...
> ------ ----------- ----------- --------- --------- ----------------
> 100.00  129.042853              11193748       966 total
> 
> 2. added kcore for the mmap interface
> real 6.44
> user 7.32
> sys 0.24
> 
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 32.94    0.130120          24      5317       315 futex
> 11.66    0.046077          21      2231         1 lstat
>   9.23    0.036449         177       206           mmap
> ...
> ------ ----------- ----------- --------- --------- ----------------
> 100.00    0.395077                 25435       971 total
> 
> The test results show that the number of system calls and time
> consumption are significantly reduced.
> 
> Link: https://lkml.kernel.org/r/20210704062208.7898-1-zhoufeng.zf@bytedance.com
> Co-developed-by: Ying Chen <chenying.kernel@bytedance.com>
> Signed-off-by: Ying Chen <chenying.kernel@bytedance.com>
> Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com>
> Cc: Alexey Dobriyan <adobriyan@gmail.com>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Muchun Song <songmuchun@bytedance.com>
> Cc: Chengming Zhou <zhouchengming@bytedance.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> 
>   fs/proc/kcore.c |   73 ++++++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 73 insertions(+)
> 
> --- a/fs/proc/kcore.c~fs-proc-kcorec-add-mmap-interface
> +++ a/fs/proc/kcore.c
> @@ -614,11 +614,84 @@ static int release_kcore(struct inode *i
>   	return 0;
>   }
>   
> +static vm_fault_t mmap_kcore_fault(struct vm_fault *vmf)
> +{
> +	return VM_FAULT_SIGBUS;
> +}
> +
> +static const struct vm_operations_struct kcore_mmap_ops = {
> +	.fault = mmap_kcore_fault,
> +};
> +
> +static int mmap_kcore(struct file *file, struct vm_area_struct *vma)
> +{
> +	size_t size = vma->vm_end - vma->vm_start;
> +	u64 start, end, pfn;
> +	int nphdr;
> +	size_t data_offset;
> +	size_t phdrs_len, notes_len;
> +	struct kcore_list *m = NULL;
> +	int ret = 0;
> +
> +	down_read(&kclist_lock);
> +
> +	get_kcore_size(&nphdr, &phdrs_len, &notes_len, &data_offset);
> +
> +	data_offset &= PAGE_MASK;
> +	start = (u64)vma->vm_pgoff << PAGE_SHIFT;
> +	if (start < data_offset) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	start = kc_offset_to_vaddr(start - data_offset);
> +	end   = start + size;
> +
> +	list_for_each_entry(m, &kclist_head, list) {
> +		if (start >= m->addr && end <= m->addr + m->size)
> +			break;
> +	}
> +
> +	if (&m->list == &kclist_head) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (vma->vm_flags & (VM_WRITE | VM_EXEC)) {
> +		ret = -EPERM;
> +		goto out;
> +	}
> +
> +	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
> +	vma->vm_flags |= VM_MIXEDMAP;
> +	vma->vm_ops = &kcore_mmap_ops;
> +

This breaks all my efforts to sanitize /proc/kore access for virtio-mem.

Is there still a way to nack this?

Sorry I didn't spot this any sooner.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 079/147] fs/proc/kcore.c: add mmap interface
  2021-09-10 10:08   ` David Hildenbrand
@ 2021-09-10 12:00     ` Mike Rapoport
  2021-09-10 12:02       ` David Hildenbrand
  0 siblings, 1 reply; 199+ messages in thread
From: Mike Rapoport @ 2021-09-10 12:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, adobriyan, chenying.kernel, linux-mm, mm-commits,
	songmuchun, torvalds, zhouchengming, zhoufeng.zf

On Fri, Sep 10, 2021 at 12:08:17PM +0200, David Hildenbrand wrote:
> On 08.09.21 04:57, Andrew Morton wrote:
> > +
> > +	if (vma->vm_flags & (VM_WRITE | VM_EXEC)) {
> > +		ret = -EPERM;
> > +		goto out;
> > +	}
> > +
> > +	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
> > +	vma->vm_flags |= VM_MIXEDMAP;
> > +	vma->vm_ops = &kcore_mmap_ops;
> > +
> 
> This breaks all my efforts to sanitize /proc/kore access for virtio-mem.
> 
> Is there still a way to nack this?

Already done:

https://lore.kernel.org/mm-commits/CAHk-=wgQ+8kmczLLKCY7yDsGHQBRcZESKd1dNaKbrjUgbWeb3A@mail.gmail.com

and down the same thread.
 
> Sorry I didn't spot this any sooner.
> 
> -- 
> Thanks,
> 
> David / dhildenb
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 079/147] fs/proc/kcore.c: add mmap interface
  2021-09-10 12:00     ` Mike Rapoport
@ 2021-09-10 12:02       ` David Hildenbrand
  0 siblings, 0 replies; 199+ messages in thread
From: David Hildenbrand @ 2021-09-10 12:02 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, adobriyan, chenying.kernel, linux-mm, mm-commits,
	songmuchun, torvalds, zhouchengming, zhoufeng.zf

On 10.09.21 14:00, Mike Rapoport wrote:
> On Fri, Sep 10, 2021 at 12:08:17PM +0200, David Hildenbrand wrote:
>> On 08.09.21 04:57, Andrew Morton wrote:
>>> +
>>> +	if (vma->vm_flags & (VM_WRITE | VM_EXEC)) {
>>> +		ret = -EPERM;
>>> +		goto out;
>>> +	}
>>> +
>>> +	vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
>>> +	vma->vm_flags |= VM_MIXEDMAP;
>>> +	vma->vm_ops = &kcore_mmap_ops;
>>> +
>>
>> This breaks all my efforts to sanitize /proc/kore access for virtio-mem.
>>
>> Is there still a way to nack this?
> 
> Already done:
> 
> https://lore.kernel.org/mm-commits/CAHk-=wgQ+8kmczLLKCY7yDsGHQBRcZESKd1dNaKbrjUgbWeb3A@mail.gmail.com
> 
> and down the same thread.
>   

Yeah, spotted Linus' reply just after I sent my reply.

... afterwards I thought about the implications fpr secretmem and 
ordinary memory hotunplug and was happy that we dodged this bullet. :)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 128/147] init: move usermodehelper_enable() to populate_rootfs()
  2021-09-10  8:12     ` Rasmus Villemoes
@ 2021-09-10 17:47       ` H. Peter Anvin
  2021-09-10 17:51       ` Luis Chamberlain
  1 sibling, 0 replies; 199+ messages in thread
From: H. Peter Anvin @ 2021-09-10 17:47 UTC (permalink / raw)
  To: Rasmus Villemoes, Luis Chamberlain, Andrew Morton, Jessica Yu,
	Borislav Petkov
  Cc: bgoncalv, egorenar, hkallweit1, linux-mm, mm-commits, torvalds

I feel there is a general problem with the way infrastructure improvements are dealt with in the kernel – basically it feels like the submitter of new infrastructure is expected to convert existing users or "what we have now works." This is a classic way of building tech debt.

Maybe this might be a topic to discuss?

On September 10, 2021 1:12:01 AM PDT, Rasmus Villemoes <linux@rasmusvillemoes.dk> wrote:
>On 08/09/2021 17.44, Luis Chamberlain wrote:
>> On Tue, Sep 07, 2021 at 08:00:03PM -0700, Andrew Morton wrote:
>>> From: Rasmus Villemoes <linux@rasmusvillemoes.dk>
>>> Any call of wait_for_initramfs() done before the unpacking has been
>>> scheduled (i.e. before rootfs_initcall time) must just return
>>> immediately [and let the caller find an empty file system] in order
>>> not to deadlock the machine. I mistakenly thought, and my limited
>>> testing confirmed, that there were no such calls, so I added a
>>> pr_warn_once() in wait_for_initramfs(). It turns out that one can
>>> indeed hit request_module() as well as kobject_uevent_env() during
>>> those early init calls, leading to a user-visible warning in the
>>> kernel log emitted consistently for certain configurations.
>> 
>> Further proof that the semantics for init is still loose. Formalizing
>> dependencies on init is something we should strive to. Eventualy with a
>> DAG.  The linker-tables work I had done years ago strived to get us
>> there which allows us to get a simple explicit DAG through the linker.
>> Unfortunately that patch set fell through because folks were
>> more interested in questioning the alternative side benefits of
>> linker-tables, but the use-case for helping with init is still valid.
>> 
>> If we *do* want to resurrect this folks should let me know.
>
>Heh, a while back I actually had some completely unrelated thing where
>I'd want to make use of the linker tables infrastructure - I remembered
>reading about it on LWN, and was quite surprised when I learnt that that
>work had never made it in. I don't quite remember the use case (I think
>it was for some test module infrastructure). But if you do have time to
>resurrect those patches, I'd certainly be interested.
>
>> Since the kobject_uevent_env() interest here is for /sbin/hotplug and
>> that crap is deprecated, in practice the relevant calls we'd care about
>> are the request_module() calls.
>
>Yes - the first report I got about that pr_warn_once was indeed fixed by
>the reporter simply disabling CONFIG_UEVENT_HELPER
>(https://lore.kernel.org/lkml/9849be80-cfe5-b33e-8224-590a4c451415@gmail.com/).
>
>>> We could just remove the pr_warn_once(), but I think it's better to
>>> postpone enabling the usermodehelper framework until there is at least
>>> some chance of finding the executable. That is also a little more
>>> efficient in that a lot of work done in umh.c will be elided.
>> 
>> I *don't* think we were aware that such request_module() calls were
>> happening before the fs was even ready and failing silently with
>> -ENOENT. 
>
>Probably not, no, otherwise somebody would have noticed.
>
>As such, although moving the usermodehelper_enable()
>> to right after scheduling populating the rootfs is the right thing,
>> we do loose on the opportunity to learn who were those odd callers
>> before. We could not care... but this is also a missed opportunity
>> in finding those. How important that is, is not clear to me as
>> this was silently failing before...
>> 
>> If we wanted to keep a print for the above purpose though, we'd likely
>> want the full stack trace to see who the hell made the call.
>
>Well, yes, I have myself fallen into that trap not just once, but at
>least twice. The first time when I discovered this behaviour on one of
>the ppc targets I did this work for in the first place (before I came up
>with the CONFIG_MODPROBE_PATH patch). The second when I asked a reporter
>to replace the pr_warn_once by WARN_ONCE:
>
>https://lore.kernel.org/lkml/4434f245-db3b-c02a-36c4-0111a0dfb78d@rasmusvillemoes.dk/
>
>
>The problem is that request_module() just fires off some worker thread
>and then the original calling thread sits back and waits for that worker
>to return a result.
>
>
>>> However,
>>> it does change the error seen by those early callers from -ENOENT to
>>> -EBUSY, so there is a risk of a regression if any caller care about
>>> the exact error value.
>> 
>> I'd see this as a welcomed evolution as it tells us more: we're saying
>> "it's coming, try again" or whatever.
>
>Indeed, and I don't think it's the end of the world if somebody notices
>some change due to that, because we'd learn more about where those early
>request_module() calls come from.
>
>> A debug option to allow us to get a full warning trace in the -EBUSY
>> case on early init would be nice to have.
>
>As noted above, that's difficult. We'd need a way to know which other
>task is waiting for us, then print the trace of that guy.
>
>I don't think anybody is gonna hear this tree falling, so let's not try
>to solve a problem before we know there is one.
>
>Rasmus

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 128/147] init: move usermodehelper_enable() to populate_rootfs()
  2021-09-10  8:12     ` Rasmus Villemoes
  2021-09-10 17:47       ` H. Peter Anvin
@ 2021-09-10 17:51       ` Luis Chamberlain
  1 sibling, 0 replies; 199+ messages in thread
From: Luis Chamberlain @ 2021-09-10 17:51 UTC (permalink / raw)
  To: Rasmus Villemoes
  Cc: Andrew Morton, Jessica Yu, Borislav Petkov, H. Peter Anvin,
	bgoncalv, egorenar, hkallweit1, linux-mm, mm-commits, torvalds

On Fri, Sep 10, 2021 at 10:12:01AM +0200, Rasmus Villemoes wrote:
> On 08/09/2021 17.44, Luis Chamberlain wrote:
> > On Tue, Sep 07, 2021 at 08:00:03PM -0700, Andrew Morton wrote:
> >> From: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> >> Any call of wait_for_initramfs() done before the unpacking has been
> >> scheduled (i.e. before rootfs_initcall time) must just return
> >> immediately [and let the caller find an empty file system] in order
> >> not to deadlock the machine. I mistakenly thought, and my limited
> >> testing confirmed, that there were no such calls, so I added a
> >> pr_warn_once() in wait_for_initramfs(). It turns out that one can
> >> indeed hit request_module() as well as kobject_uevent_env() during
> >> those early init calls, leading to a user-visible warning in the
> >> kernel log emitted consistently for certain configurations.
> > 
> > Further proof that the semantics for init is still loose. Formalizing
> > dependencies on init is something we should strive to. Eventualy with a
> > DAG.  The linker-tables work I had done years ago strived to get us
> > there which allows us to get a simple explicit DAG through the linker.
> > Unfortunately that patch set fell through because folks were
> > more interested in questioning the alternative side benefits of
> > linker-tables, but the use-case for helping with init is still valid.
> > 
> > If we *do* want to resurrect this folks should let me know.
> 
> Heh, a while back I actually had some completely unrelated thing where
> I'd want to make use of the linker tables infrastructure - I remembered
> reading about it on LWN, and was quite surprised when I learnt that that
> work had never made it in. I don't quite remember the use case (I think
> it was for some test module infrastructure). But if you do have time to
> resurrect those patches, I'd certainly be interested.

OK I might.

> > Since the kobject_uevent_env() interest here is for /sbin/hotplug and
> > that crap is deprecated, in practice the relevant calls we'd care about
> > are the request_module() calls.
> 
> Yes - the first report I got about that pr_warn_once was indeed fixed by
> the reporter simply disabling CONFIG_UEVENT_HELPER
> (https://lore.kernel.org/lkml/9849be80-cfe5-b33e-8224-590a4c451415@gmail.com/).

Ah I see.

> >> We could just remove the pr_warn_once(), but I think it's better to
> >> postpone enabling the usermodehelper framework until there is at least
> >> some chance of finding the executable. That is also a little more
> >> efficient in that a lot of work done in umh.c will be elided.
> > 
> > I *don't* think we were aware that such request_module() calls were
> > happening before the fs was even ready and failing silently with
> > -ENOENT. 
> 
> Probably not, no, otherwise somebody would have noticed.

OK your commit log was not clear on this, it seemed to suggest this
as a possibility or that such a case existed. That also means the
impact of your change is less.

> >> However,
> >> it does change the error seen by those early callers from -ENOENT to
> >> -EBUSY, so there is a risk of a regression if any caller care about
> >> the exact error value.
> > 
> > I'd see this as a welcomed evolution as it tells us more: we're saying
> > "it's coming, try again" or whatever.
> 
> Indeed, and I don't think it's the end of the world if somebody notices
> some change due to that, because we'd learn more about where those early
> request_module() calls come from.

But since it would seem none have been reported yet, it is an even
better situation.

> > A debug option to allow us to get a full warning trace in the -EBUSY
> > case on early init would be nice to have.
> 
> As noted above, that's difficult. We'd need a way to know which other
> task is waiting for us, then print the trace of that guy.
> 
> I don't think anybody is gonna hear this tree falling, so let's not try
> to solve a problem before we know there is one.

That's fair. But let's also recall neither of us expected the above
situation either. But I agree the possible collateral at this point
seems to be small, if any.

 Luis

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 136/147] nilfs2: use refcount_dec_and_lock() to fix potential UAF
  2021-09-08  3:00 ` [patch 136/147] nilfs2: use refcount_dec_and_lock() to fix potential UAF Andrew Morton
@ 2021-09-24 10:35   ` Pavel Machek
  2021-09-24 11:09       ` Ryusuke Konishi
  2021-09-24 12:12   ` Matthew Wilcox
  1 sibling, 1 reply; 199+ messages in thread
From: Pavel Machek @ 2021-09-24 10:35 UTC (permalink / raw)
  To: linux-kernel, stable
  Cc: akpm, konishi.ryusuke, linux-mm, mm-commits, thunder.leizhen, torvalds

[-- Attachment #1: Type: text/plain, Size: 2087 bytes --]

Hi!

> From: Zhen Lei <thunder.leizhen@huawei.com>
> Subject: nilfs2: use refcount_dec_and_lock() to fix potential UAF
> 
> When the refcount is decreased to 0, the resource reclamation branch is
> entered.  Before CPU0 reaches the race point (1), CPU1 may obtain the
> spinlock and traverse the rbtree to find 'root', see nilfs_lookup_root(). 
> Although CPU1 will call refcount_inc() to increase the refcount, it is
> obviously too late.  CPU0 will release 'root' directly, CPU1 then accesses
> 'root' and triggers UAF.
> 
> Use refcount_dec_and_lock() to ensure that both the operations of decrease
> refcount to 0 and link deletion are lock protected eliminates this risk.
> 
>      CPU0                      CPU1
> nilfs_put_root():
> 			    <-------- (1)
> spin_lock(&nilfs->ns_cptree_lock);
> rb_erase(&root->rb_node, &nilfs->ns_cptree);
> spin_unlock(&nilfs->ns_cptree_lock);
> 
> kfree(root);
> 			    <-------- use-after-free

> There is no reproduction program, and the above is only theoretical
> analysis.

Ok, so we have a theoretical bug, and fix already on its way to
stable. But ... is it correct?

> +++ a/fs/nilfs2/the_nilfs.c
> @@ -792,14 +792,13 @@ nilfs_find_or_create_root(struct the_nil
>  
>  void nilfs_put_root(struct nilfs_root *root)
>  {
> -	if (refcount_dec_and_test(&root->count)) {
> -		struct the_nilfs *nilfs = root->nilfs;
> +	struct the_nilfs *nilfs = root->nilfs;
>  
> -		nilfs_sysfs_delete_snapshot_group(root);
> -
> -		spin_lock(&nilfs->ns_cptree_lock);
> +	if (refcount_dec_and_lock(&root->count, &nilfs->ns_cptree_lock)) {
>  		rb_erase(&root->rb_node, &nilfs->ns_cptree);
>  		spin_unlock(&nilfs->ns_cptree_lock);
> +
> +		nilfs_sysfs_delete_snapshot_group(root);
>  		iput(root->ifile);
>  
>  		kfree(root);

spin_lock() is deleted, but spin_unlock() is not affected. This means
unbalanced locking, right?

Best regards,
								Pavel
--
DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 136/147] nilfs2: use refcount_dec_and_lock() to fix potential UAF
  2021-09-24 10:35   ` Pavel Machek
@ 2021-09-24 11:09       ` Ryusuke Konishi
  0 siblings, 0 replies; 199+ messages in thread
From: Ryusuke Konishi @ 2021-09-24 11:09 UTC (permalink / raw)
  To: Pavel Machek
  Cc: LKML, stable, Andrew Morton, linux-mm, mm-commits, Zhen Lei,
	Linus Torvalds

Hi,

On Fri, Sep 24, 2021 at 7:35 PM Pavel Machek <pavel@denx.de> wrote:
>
> Hi!
>
> > From: Zhen Lei <thunder.leizhen@huawei.com>
> > Subject: nilfs2: use refcount_dec_and_lock() to fix potential UAF
> >
> > When the refcount is decreased to 0, the resource reclamation branch is
> > entered.  Before CPU0 reaches the race point (1), CPU1 may obtain the
> > spinlock and traverse the rbtree to find 'root', see nilfs_lookup_root().
> > Although CPU1 will call refcount_inc() to increase the refcount, it is
> > obviously too late.  CPU0 will release 'root' directly, CPU1 then accesses
> > 'root' and triggers UAF.
> >
> > Use refcount_dec_and_lock() to ensure that both the operations of decrease
> > refcount to 0 and link deletion are lock protected eliminates this risk.
> >
> >      CPU0                      CPU1
> > nilfs_put_root():
> >                           <-------- (1)
> > spin_lock(&nilfs->ns_cptree_lock);
> > rb_erase(&root->rb_node, &nilfs->ns_cptree);
> > spin_unlock(&nilfs->ns_cptree_lock);
> >
> > kfree(root);
> >                           <-------- use-after-free
>
> > There is no reproduction program, and the above is only theoretical
> > analysis.
>
> Ok, so we have a theoretical bug, and fix already on its way to
> stable. But ... is it correct?
>
> > +++ a/fs/nilfs2/the_nilfs.c
> > @@ -792,14 +792,13 @@ nilfs_find_or_create_root(struct the_nil
> >
> >  void nilfs_put_root(struct nilfs_root *root)
> >  {
> > -     if (refcount_dec_and_test(&root->count)) {
> > -             struct the_nilfs *nilfs = root->nilfs;
> > +     struct the_nilfs *nilfs = root->nilfs;
> >
> > -             nilfs_sysfs_delete_snapshot_group(root);
> > -
> > -             spin_lock(&nilfs->ns_cptree_lock);
> > +     if (refcount_dec_and_lock(&root->count, &nilfs->ns_cptree_lock)) {
> >               rb_erase(&root->rb_node, &nilfs->ns_cptree);
> >               spin_unlock(&nilfs->ns_cptree_lock);
> > +
> > +             nilfs_sysfs_delete_snapshot_group(root);
> >               iput(root->ifile);
> >
> >               kfree(root);
>
> spin_lock() is deleted, but spin_unlock() is not affected. This means
> unbalanced locking, right?

It's okay.   spin_lock() is integrated into refcount_dec_and_lock(), which was
originally refcount_dec_and_test().

Thanks,
Ryusuke Konishi

>
> Best regards,
>                                                                 Pavel
> --
> DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
>

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 136/147] nilfs2: use refcount_dec_and_lock() to fix potential UAF
@ 2021-09-24 11:09       ` Ryusuke Konishi
  0 siblings, 0 replies; 199+ messages in thread
From: Ryusuke Konishi @ 2021-09-24 11:09 UTC (permalink / raw)
  To: Pavel Machek
  Cc: LKML, stable, Andrew Morton, linux-mm, mm-commits, Zhen Lei,
	Linus Torvalds

Hi,

On Fri, Sep 24, 2021 at 7:35 PM Pavel Machek <pavel@denx.de> wrote:
>
> Hi!
>
> > From: Zhen Lei <thunder.leizhen@huawei.com>
> > Subject: nilfs2: use refcount_dec_and_lock() to fix potential UAF
> >
> > When the refcount is decreased to 0, the resource reclamation branch is
> > entered.  Before CPU0 reaches the race point (1), CPU1 may obtain the
> > spinlock and traverse the rbtree to find 'root', see nilfs_lookup_root().
> > Although CPU1 will call refcount_inc() to increase the refcount, it is
> > obviously too late.  CPU0 will release 'root' directly, CPU1 then accesses
> > 'root' and triggers UAF.
> >
> > Use refcount_dec_and_lock() to ensure that both the operations of decrease
> > refcount to 0 and link deletion are lock protected eliminates this risk.
> >
> >      CPU0                      CPU1
> > nilfs_put_root():
> >                           <-------- (1)
> > spin_lock(&nilfs->ns_cptree_lock);
> > rb_erase(&root->rb_node, &nilfs->ns_cptree);
> > spin_unlock(&nilfs->ns_cptree_lock);
> >
> > kfree(root);
> >                           <-------- use-after-free
>
> > There is no reproduction program, and the above is only theoretical
> > analysis.
>
> Ok, so we have a theoretical bug, and fix already on its way to
> stable. But ... is it correct?
>
> > +++ a/fs/nilfs2/the_nilfs.c
> > @@ -792,14 +792,13 @@ nilfs_find_or_create_root(struct the_nil
> >
> >  void nilfs_put_root(struct nilfs_root *root)
> >  {
> > -     if (refcount_dec_and_test(&root->count)) {
> > -             struct the_nilfs *nilfs = root->nilfs;
> > +     struct the_nilfs *nilfs = root->nilfs;
> >
> > -             nilfs_sysfs_delete_snapshot_group(root);
> > -
> > -             spin_lock(&nilfs->ns_cptree_lock);
> > +     if (refcount_dec_and_lock(&root->count, &nilfs->ns_cptree_lock)) {
> >               rb_erase(&root->rb_node, &nilfs->ns_cptree);
> >               spin_unlock(&nilfs->ns_cptree_lock);
> > +
> > +             nilfs_sysfs_delete_snapshot_group(root);
> >               iput(root->ifile);
> >
> >               kfree(root);
>
> spin_lock() is deleted, but spin_unlock() is not affected. This means
> unbalanced locking, right?

It's okay.   spin_lock() is integrated into refcount_dec_and_lock(), which was
originally refcount_dec_and_test().

Thanks,
Ryusuke Konishi

>
> Best regards,
>                                                                 Pavel
> --
> DENX Software Engineering GmbH,      Managing Director: Wolfgang Denk
> HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
>


^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 136/147] nilfs2: use refcount_dec_and_lock() to fix potential UAF
  2021-09-08  3:00 ` [patch 136/147] nilfs2: use refcount_dec_and_lock() to fix potential UAF Andrew Morton
  2021-09-24 10:35   ` Pavel Machek
@ 2021-09-24 12:12   ` Matthew Wilcox
  2021-09-24 15:09       ` Ryusuke Konishi
  1 sibling, 1 reply; 199+ messages in thread
From: Matthew Wilcox @ 2021-09-24 12:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: konishi.ryusuke, linux-mm, mm-commits, thunder.leizhen, torvalds

On Tue, Sep 07, 2021 at 08:00:26PM -0700, Andrew Morton wrote:
> From: Zhen Lei <thunder.leizhen@huawei.com>
> Subject: nilfs2: use refcount_dec_and_lock() to fix potential UAF
> 
> When the refcount is decreased to 0, the resource reclamation branch is
> entered.  Before CPU0 reaches the race point (1), CPU1 may obtain the
> spinlock and traverse the rbtree to find 'root', see nilfs_lookup_root(). 
> Although CPU1 will call refcount_inc() to increase the refcount, it is
> obviously too late.  CPU0 will release 'root' directly, CPU1 then accesses
> 'root' and triggers UAF.
> 
> Use refcount_dec_and_lock() to ensure that both the operations of decrease
> refcount to 0 and link deletion are lock protected eliminates this risk.
> 
>      CPU0                      CPU1
> nilfs_put_root():
> 			    <-------- (1)
> spin_lock(&nilfs->ns_cptree_lock);
> rb_erase(&root->rb_node, &nilfs->ns_cptree);
> spin_unlock(&nilfs->ns_cptree_lock);
> 
> kfree(root);
> 			    <-------- use-after-free

I don't know where this happened, but the leading whitespace has been
eaten at some point, making this description of the race completely
unreadable as everything appears to be done by CPU 0.

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 136/147] nilfs2: use refcount_dec_and_lock() to fix potential UAF
  2021-09-24 12:12   ` Matthew Wilcox
@ 2021-09-24 15:09       ` Ryusuke Konishi
  0 siblings, 0 replies; 199+ messages in thread
From: Ryusuke Konishi @ 2021-09-24 15:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-mm, mm-commits, Zhen Lei, Linus Torvalds

Hi,

On Fri, Sep 24, 2021 at 9:13 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Sep 07, 2021 at 08:00:26PM -0700, Andrew Morton wrote:
> > From: Zhen Lei <thunder.leizhen@huawei.com>
> > Subject: nilfs2: use refcount_dec_and_lock() to fix potential UAF
> >
> > When the refcount is decreased to 0, the resource reclamation branch is
> > entered.  Before CPU0 reaches the race point (1), CPU1 may obtain the
> > spinlock and traverse the rbtree to find 'root', see nilfs_lookup_root().
> > Although CPU1 will call refcount_inc() to increase the refcount, it is
> > obviously too late.  CPU0 will release 'root' directly, CPU1 then accesses
> > 'root' and triggers UAF.
> >
> > Use refcount_dec_and_lock() to ensure that both the operations of decrease
> > refcount to 0 and link deletion are lock protected eliminates this risk.
> >
> >      CPU0                      CPU1
> > nilfs_put_root():
> >                           <-------- (1)
> > spin_lock(&nilfs->ns_cptree_lock);
> > rb_erase(&root->rb_node, &nilfs->ns_cptree);
> > spin_unlock(&nilfs->ns_cptree_lock);
> >
> > kfree(root);
> >                           <-------- use-after-free
>
> I don't know where this happened, but the leading whitespace has been
> eaten at some point, making this description of the race completely
> unreadable as everything appears to be done by CPU 0.

The diagram is the same as that in the original patch by author, and
I approved it without any discomfort because these five operations
(nilfs_put_root() ~ spin_lock(); rb_erase(); spin_unlock() ~ kfree() ) are
all done by CPU0.

But, yeah, an example of a function call might have been written on
the CPU1 side as well, for instance, a nilfs_lookup_root() call etc, to
clarify the race issue the message explains..

Regards,
Ryusuke Konishi

^ permalink raw reply	[flat|nested] 199+ messages in thread

* Re: [patch 136/147] nilfs2: use refcount_dec_and_lock() to fix potential UAF
@ 2021-09-24 15:09       ` Ryusuke Konishi
  0 siblings, 0 replies; 199+ messages in thread
From: Ryusuke Konishi @ 2021-09-24 15:09 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, linux-mm, mm-commits, Zhen Lei, Linus Torvalds

Hi,

On Fri, Sep 24, 2021 at 9:13 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Tue, Sep 07, 2021 at 08:00:26PM -0700, Andrew Morton wrote:
> > From: Zhen Lei <thunder.leizhen@huawei.com>
> > Subject: nilfs2: use refcount_dec_and_lock() to fix potential UAF
> >
> > When the refcount is decreased to 0, the resource reclamation branch is
> > entered.  Before CPU0 reaches the race point (1), CPU1 may obtain the
> > spinlock and traverse the rbtree to find 'root', see nilfs_lookup_root().
> > Although CPU1 will call refcount_inc() to increase the refcount, it is
> > obviously too late.  CPU0 will release 'root' directly, CPU1 then accesses
> > 'root' and triggers UAF.
> >
> > Use refcount_dec_and_lock() to ensure that both the operations of decrease
> > refcount to 0 and link deletion are lock protected eliminates this risk.
> >
> >      CPU0                      CPU1
> > nilfs_put_root():
> >                           <-------- (1)
> > spin_lock(&nilfs->ns_cptree_lock);
> > rb_erase(&root->rb_node, &nilfs->ns_cptree);
> > spin_unlock(&nilfs->ns_cptree_lock);
> >
> > kfree(root);
> >                           <-------- use-after-free
>
> I don't know where this happened, but the leading whitespace has been
> eaten at some point, making this description of the race completely
> unreadable as everything appears to be done by CPU 0.

The diagram is the same as that in the original patch by author, and
I approved it without any discomfort because these five operations
(nilfs_put_root() ~ spin_lock(); rb_erase(); spin_unlock() ~ kfree() ) are
all done by CPU0.

But, yeah, an example of a function call might have been written on
the CPU1 side as well, for instance, a nilfs_lookup_root() call etc, to
clarify the race issue the message explains..

Regards,
Ryusuke Konishi


^ permalink raw reply	[flat|nested] 199+ messages in thread

end of thread, other threads:[~2021-09-24 15:10 UTC | newest]

Thread overview: 199+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-08  2:52 incoming Andrew Morton
2021-09-08  2:52 ` [patch 001/147] mm, slub: don't call flush_all() from slab_debug_trace_open() Andrew Morton
2021-09-08  2:53 ` [patch 002/147] mm, slub: allocate private object map for debugfs listings Andrew Morton
2021-09-08  2:53 ` [patch 003/147] mm, slub: allocate private object map for validate_slab_cache() Andrew Morton
2021-09-08  2:53 ` [patch 004/147] mm, slub: don't disable irq for debug_check_no_locks_freed() Andrew Morton
2021-09-08  2:53 ` [patch 005/147] mm, slub: remove redundant unfreeze_partials() from put_cpu_partial() Andrew Morton
2021-09-08  2:53 ` [patch 006/147] mm, slub: extract get_partial() from new_slab_objects() Andrew Morton
2021-09-08  2:53 ` [patch 007/147] mm, slub: dissolve new_slab_objects() into ___slab_alloc() Andrew Morton
2021-09-08  2:53 ` [patch 008/147] mm, slub: return slab page from get_partial() and set c->page afterwards Andrew Morton
2021-09-08  2:53 ` [patch 009/147] mm, slub: restructure new page checks in ___slab_alloc() Andrew Morton
2021-09-08  2:53 ` [patch 010/147] mm, slub: simplify kmem_cache_cpu and tid setup Andrew Morton
2021-09-08  2:53 ` [patch 011/147] mm, slub: move disabling/enabling irqs to ___slab_alloc() Andrew Morton
2021-09-08  2:53 ` [patch 012/147] mm, slub: do initial checks in ___slab_alloc() with irqs enabled Andrew Morton
2021-09-08  2:53 ` [patch 013/147] mm, slub: move disabling irqs closer to get_partial() in ___slab_alloc() Andrew Morton
2021-09-08  2:53 ` [patch 014/147] mm, slub: restore irqs around calling new_slab() Andrew Morton
2021-09-08  2:53 ` [patch 015/147] mm, slub: validate slab from partial list or page allocator before making it cpu slab Andrew Morton
2021-09-08  2:53 ` [patch 016/147] mm, slub: check new pages with restored irqs Andrew Morton
2021-09-08  2:53 ` [patch 017/147] mm, slub: stop disabling irqs around get_partial() Andrew Morton
2021-09-08  2:53 ` [patch 018/147] mm, slub: move reset of c->page and freelist out of deactivate_slab() Andrew Morton
2021-09-08  2:53 ` [patch 019/147] mm, slub: make locking in deactivate_slab() irq-safe Andrew Morton
2021-09-08  2:54 ` [patch 020/147] mm, slub: call deactivate_slab() without disabling irqs Andrew Morton
2021-09-08  2:54 ` [patch 021/147] mm, slub: move irq control into unfreeze_partials() Andrew Morton
2021-09-08  2:54 ` [patch 022/147] mm, slub: discard slabs in unfreeze_partials() without irqs disabled Andrew Morton
2021-09-08  2:54 ` [patch 023/147] mm, slub: detach whole partial list at once in unfreeze_partials() Andrew Morton
2021-09-08  2:54 ` [patch 024/147] mm, slub: separate detaching of partial list in unfreeze_partials() from unfreezing Andrew Morton
2021-09-08  2:54 ` [patch 025/147] mm, slub: only disable irq with spin_lock in __unfreeze_partials() Andrew Morton
2021-09-08  2:54 ` [patch 026/147] mm, slub: don't disable irqs in slub_cpu_dead() Andrew Morton
2021-09-08  2:54 ` [patch 027/147] mm, slab: split out the cpu offline variant of flush_slab() Andrew Morton
2021-09-08  2:54 ` [patch 028/147] mm: slub: move flush_cpu_slab() invocations __free_slab() invocations out of IRQ context Andrew Morton
2021-09-08  2:54 ` [patch 029/147] mm: slub: make object_map_lock a raw_spinlock_t Andrew Morton
2021-09-08  2:54 ` [patch 030/147] mm, slub: make slab_lock() disable irqs with PREEMPT_RT Andrew Morton
2021-09-08  2:54 ` [patch 031/147] mm, slub: protect put_cpu_partial() with disabled irqs instead of cmpxchg Andrew Morton
2021-09-08 13:05   ` Jesper Dangaard Brouer
2021-09-08 13:58     ` Vlastimil Babka
2021-09-08 14:55       ` David Hildenbrand
2021-09-08 14:59         ` David Hildenbrand
2021-09-08 17:14           ` Jesper Dangaard Brouer
2021-09-08 17:24             ` David Hildenbrand
2021-09-08 16:11       ` Jesper Dangaard Brouer
2021-09-08 16:31         ` Linus Torvalds
2021-09-08 16:31           ` Linus Torvalds
2021-09-08  2:54 ` [patch 032/147] mm, slub: use migrate_disable() on PREEMPT_RT Andrew Morton
2021-09-08  2:54 ` [patch 033/147] mm, slub: convert kmem_cpu_slab protection to local_lock Andrew Morton
2021-09-08  2:54 ` [patch 034/147] memory-hotplug.rst: remove locking details from admin-guide Andrew Morton
2021-09-08  2:54 ` [patch 035/147] memory-hotplug.rst: complete admin-guide overhaul Andrew Morton
2021-09-08  2:54 ` [patch 036/147] mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE Andrew Morton
2021-09-08  2:54 ` [patch 037/147] mm: memory_hotplug: cleanup after removal of pfn_valid_within() Andrew Morton
2021-09-08  2:54 ` [patch 038/147] mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range() Andrew Morton
2021-09-08  2:55 ` [patch 039/147] mm/memory_hotplug: remove nid parameter from arch_remove_memory() Andrew Morton
2021-09-08  2:55 ` [patch 040/147] mm/memory_hotplug: remove nid parameter from remove_memory() and friends Andrew Morton
2021-09-08  2:55 ` [patch 041/147] ACPI: memhotplug: memory resources cannot be enabled yet Andrew Morton
2021-09-08  2:55 ` [patch 042/147] mm: track present early pages per zone Andrew Morton
2021-09-08  2:55 ` [patch 043/147] mm/memory_hotplug: introduce "auto-movable" online policy Andrew Morton
2021-09-08  2:55 ` [patch 044/147] drivers/base/memory: introduce "memory groups" to logically group memory blocks Andrew Morton
2021-09-08  2:55 ` [patch 045/147] mm/memory_hotplug: track present pages in memory groups Andrew Morton
2021-09-08  2:55 ` [patch 046/147] ACPI: memhotplug: use a single static memory group for a single memory device Andrew Morton
2021-09-08  2:55 ` [patch 047/147] dax/kmem: use a single static memory group for a single probed unit Andrew Morton
2021-09-08  2:55 ` [patch 048/147] virtio-mem: use a single dynamic memory group for a single virtio-mem device Andrew Morton
2021-09-08  2:55 ` [patch 049/147] mm/memory_hotplug: memory group aware "auto-movable" online policy Andrew Morton
2021-09-08  2:55 ` [patch 050/147] mm/memory_hotplug: improved dynamic " Andrew Morton
2021-09-08  2:55 ` [patch 051/147] mm/memory_hotplug: use helper zone_is_zone_device() to simplify the code Andrew Morton
2021-09-08  2:55 ` [patch 052/147] mm: remove redundant compound_head() calling Andrew Morton
2021-09-08  2:55 ` [patch 053/147] riscv: only select GENERIC_IOREMAP if MMU support is enabled Andrew Morton
2021-09-08  2:56 ` [patch 054/147] mm: move ioremap_page_range to vmalloc.c Andrew Morton
2021-09-08  2:56 ` [patch 055/147] mm: don't allow executable ioremap mappings Andrew Morton
2021-09-08  2:56 ` [patch 056/147] mm/early_ioremap.c: remove redundant early_ioremap_shutdown() Andrew Morton
2021-09-08  2:56 ` [patch 057/147] highmem: don't disable preemption on RT in kmap_atomic() Andrew Morton
2021-09-08  2:56 ` [patch 058/147] mm: in_irq() cleanup Andrew Morton
2021-09-08  2:56 ` [patch 059/147] mm: introduce PAGEFLAGS_MASK to replace ((1UL << NR_PAGEFLAGS) - 1) Andrew Morton
2021-09-08  2:56 ` [patch 060/147] mm/secretmem: use refcount_t instead of atomic_t Andrew Morton
2021-09-08  2:56 ` [patch 061/147] kfence: show cpu and timestamp in alloc/free info Andrew Morton
2021-09-08  2:56 ` [patch 062/147] kfence: test: fail fast if disabled at boot Andrew Morton
2021-09-08  2:56 ` [patch 063/147] mm: introduce Data Access MONitor (DAMON) Andrew Morton
2021-09-08  2:56 ` [patch 064/147] mm/damon/core: implement region-based sampling Andrew Morton
2021-09-08  2:56 ` [patch 065/147] mm/damon: adaptively adjust regions Andrew Morton
2021-09-08  2:56 ` [patch 066/147] mm/idle_page_tracking: make PG_idle reusable Andrew Morton
2021-09-08  2:56 ` [patch 067/147] mm/damon: implement primitives for the virtual memory address spaces Andrew Morton
2021-09-08  2:56 ` [patch 068/147] mm/damon: add a tracepoint Andrew Morton
2021-09-08  2:56 ` [patch 069/147] mm/damon: implement a debugfs-based user space interface Andrew Morton
2021-09-08  2:56 ` [patch 070/147] mm/damon/dbgfs: export kdamond pid to the user space Andrew Morton
2021-09-08  2:57 ` [patch 071/147] mm/damon/dbgfs: support multiple contexts Andrew Morton
2021-09-08  2:57 ` [patch 072/147] Documentation: add documents for DAMON Andrew Morton
2021-09-08  2:57 ` [patch 073/147] mm/damon: add kunit tests Andrew Morton
2021-09-08  2:57 ` [patch 074/147] mm/damon: add user space selftests Andrew Morton
2021-09-08  2:57 ` [patch 075/147] MAINTAINERS: update for DAMON Andrew Morton
2021-09-08  2:57 ` [patch 076/147] alpha: agp: make empty macros use do-while-0 style Andrew Morton
2021-09-08  2:57 ` [patch 077/147] alpha: pci-sysfs: fix all kernel-doc warnings Andrew Morton
2021-09-08  2:57 ` [patch 078/147] percpu: remove export of pcpu_base_addr Andrew Morton
2021-09-08  2:57 ` [patch 079/147] fs/proc/kcore.c: add mmap interface Andrew Morton
2021-09-08 18:13   ` Linus Torvalds
2021-09-08 18:13     ` Linus Torvalds
2021-09-09  9:56     ` [External] " Feng Zhou
2021-09-09 17:32       ` Linus Torvalds
2021-09-09 17:32         ` Linus Torvalds
2021-09-09 17:34         ` Linus Torvalds
2021-09-09 17:34           ` Linus Torvalds
2021-09-10  3:18           ` Feng Zhou
2021-09-10 10:08   ` David Hildenbrand
2021-09-10 12:00     ` Mike Rapoport
2021-09-10 12:02       ` David Hildenbrand
2021-09-08  2:57 ` [patch 080/147] proc: stop using seq_get_buf in proc_task_name Andrew Morton
2021-09-08  2:57 ` [patch 081/147] connector: send event on write to /proc/[pid]/comm Andrew Morton
2021-09-08  2:57 ` [patch 082/147] arch: Kconfig: fix spelling mistake "seperate" -> "separate" Andrew Morton
2021-09-08  2:57 ` [patch 083/147] include/linux/once.h: fix trivia typo Not -> Note Andrew Morton
2021-09-08  2:57 ` [patch 084/147] units: change from 'L' to 'UL' Andrew Morton
2021-09-08  2:57 ` [patch 085/147] units: add the HZ macros Andrew Morton
2021-09-08  2:57 ` [patch 086/147] thermal/drivers/devfreq_cooling: use " Andrew Morton
2021-09-08  2:57 ` [patch 087/147] devfreq: " Andrew Morton
2021-09-08  2:57 ` [patch 088/147] iio/drivers/as73211: " Andrew Morton
2021-09-08  2:58 ` [patch 089/147] hwmon/drivers/mr75203: " Andrew Morton
2021-09-08  2:58 ` [patch 090/147] iio/drivers/hid-sensor: " Andrew Morton
2021-09-08  2:58 ` [patch 091/147] i2c/drivers/ov02q10: " Andrew Morton
2021-09-08  2:58 ` [patch 092/147] mtd/drivers/nand: " Andrew Morton
2021-09-08  6:39   ` Miquel Raynal
2021-09-08  2:58 ` [patch 093/147] phy/drivers/stm32: " Andrew Morton
2021-09-08  2:58 ` [patch 094/147] kernel/acct.c: use dedicated helper to access rlimit values Andrew Morton
2021-09-08  2:58 ` [patch 095/147] profiling: fix shift-out-of-bounds bugs Andrew Morton
2021-09-08  2:58 ` [patch 096/147] MAINTAINERS: update ClangBuiltLinux mailing list Andrew Morton
2021-09-08  2:58 ` [patch 097/147] Documentation/llvm: update " Andrew Morton
2021-09-08  2:58 ` [patch 098/147] Documentation/llvm: update IRC location Andrew Morton
2021-09-08  2:58 ` [patch 099/147] math: make RATIONAL tristate Andrew Morton
2021-09-08  2:58 ` [patch 100/147] math: RATIONAL_KUNIT_TEST should depend on RATIONAL instead of selecting it Andrew Morton
2021-09-08  2:58 ` [patch 101/147] lib/string: optimized memcpy Andrew Morton
2021-09-08 18:26   ` Linus Torvalds
2021-09-08 18:26     ` Linus Torvalds
2021-09-08  2:58 ` [patch 102/147] lib/string: optimized memmove Andrew Morton
2021-09-08 18:29   ` Linus Torvalds
2021-09-08 18:29     ` Linus Torvalds
2021-09-09  8:28     ` David Laight
2021-09-08  2:58 ` [patch 103/147] lib/string: optimized memset Andrew Morton
2021-09-08 18:34   ` Linus Torvalds
2021-09-08 18:34     ` Linus Torvalds
2021-09-09 10:27     ` Matteo Croce
2021-09-08  2:58 ` [patch 104/147] lib/test: convert test_sort.c to use KUnit Andrew Morton
2021-09-08  2:58 ` [patch 105/147] lib/dump_stack: correct kernel-doc notation Andrew Morton
2021-09-08  2:58 ` [patch 106/147] lib/iov_iter.c: fix kernel-doc warnings Andrew Morton
2021-09-08  2:58 ` [patch 107/147] bitops: protect find_first_{,zero}_bit properly Andrew Morton
2021-09-08  2:59 ` [patch 108/147] bitops: move find_bit_*_le functions from le.h to find.h Andrew Morton
2021-09-08 18:37   ` Linus Torvalds
2021-09-08 18:37     ` Linus Torvalds
2021-09-08 19:38     ` Yury Norov
2021-09-08 19:46       ` Linus Torvalds
2021-09-08 19:46         ` Linus Torvalds
2021-09-08 19:49       ` Andrew Morton
2021-09-08 19:56         ` Linus Torvalds
2021-09-08 19:56           ` Linus Torvalds
2021-09-08 20:08           ` Linus Torvalds
2021-09-08 20:08             ` Linus Torvalds
2021-09-08 20:16         ` Yury Norov
2021-09-08  2:59 ` [patch 109/147] include: move find.h from asm_generic to linux Andrew Morton
2021-09-08  2:59 ` [patch 110/147] arch: remove GENERIC_FIND_FIRST_BIT entirely Andrew Morton
2021-09-08  2:59 ` [patch 111/147] lib: add find_first_and_bit() Andrew Morton
2021-09-08  2:59 ` [patch 112/147] cpumask: use find_first_and_bit() Andrew Morton
2021-09-08  2:59 ` [patch 113/147] all: replace find_next{,_zero}_bit with find_first{,_zero}_bit where appropriate Andrew Morton
2021-09-08  2:59 ` [patch 114/147] tools: sync tools/bitmap with mother linux Andrew Morton
2021-09-08  2:59 ` [patch 115/147] cpumask: replace cpumask_next_* with cpumask_first_* where appropriate Andrew Morton
2021-09-08  2:59 ` [patch 116/147] include/linux: move for_each_bit() macros from bitops.h to find.h Andrew Morton
2021-09-08  2:59 ` [patch 117/147] find: micro-optimize for_each_{set,clear}_bit() Andrew Morton
2021-09-08  2:59 ` [patch 118/147] bitops: replace for_each_*_bit_from() with for_each_*_bit() where appropriate Andrew Morton
2021-09-08  2:59 ` [patch 119/147] tools: rename bitmap_alloc() to bitmap_zalloc() Andrew Morton
2021-09-08  2:59 ` [patch 120/147] mm/percpu: micro-optimize pcpu_is_populated() Andrew Morton
2021-09-08  2:59 ` [patch 121/147] bitmap: unify find_bit operations Andrew Morton
2021-09-08  2:59 ` [patch 122/147] lib: bitmap: add performance test for bitmap_print_to_pagebuf Andrew Morton
2021-09-08  2:59 ` [patch 123/147] vsprintf: rework bitmap_list_string Andrew Morton
2021-09-08  2:59 ` [patch 124/147] checkpatch: support wide strings Andrew Morton
2021-09-08  2:59 ` [patch 125/147] checkpatch: make email address check case insensitive Andrew Morton
2021-09-08  2:59 ` [patch 126/147] checkpatch: improve GIT_COMMIT_ID test Andrew Morton
2021-09-08  3:00 ` [patch 127/147] fs/epoll: use a per-cpu counter for user's watches count Andrew Morton
2021-09-08  3:00 ` [patch 128/147] init: move usermodehelper_enable() to populate_rootfs() Andrew Morton
2021-09-08 15:44   ` Luis Chamberlain
2021-09-10  8:12     ` Rasmus Villemoes
2021-09-10 17:47       ` H. Peter Anvin
2021-09-10 17:51       ` Luis Chamberlain
2021-09-08  3:00 ` [patch 130/147] nilfs2: fix memory leak in nilfs_sysfs_create_device_group Andrew Morton
2021-09-08  3:00 ` [patch 131/147] nilfs2: fix NULL pointer in nilfs_##name##_attr_release Andrew Morton
2021-09-08  3:00 ` [patch 132/147] nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group Andrew Morton
2021-09-08  3:00 ` [patch 133/147] nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group Andrew Morton
2021-09-08  3:00 ` [patch 134/147] nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group Andrew Morton
2021-09-08  3:00 ` [patch 135/147] nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group Andrew Morton
2021-09-08  3:00 ` [patch 136/147] nilfs2: use refcount_dec_and_lock() to fix potential UAF Andrew Morton
2021-09-24 10:35   ` Pavel Machek
2021-09-24 11:09     ` Ryusuke Konishi
2021-09-24 11:09       ` Ryusuke Konishi
2021-09-24 12:12   ` Matthew Wilcox
2021-09-24 15:09     ` Ryusuke Konishi
2021-09-24 15:09       ` Ryusuke Konishi
2021-09-08  3:00 ` [patch 137/147] fs/coredump.c: log if a core dump is aborted due to changed file permissions Andrew Morton
2021-09-08  3:00 ` [patch 138/147] coredump: fix memleak in dump_vma_snapshot() Andrew Morton
2021-09-08  3:00 ` [patch 139/147] kernel/fork.c: unexport get_{mm,task}_exe_file Andrew Morton
2021-09-08  3:00 ` [patch 140/147] pid: cleanup the stale comment mentioning pidmap_init() Andrew Morton
2021-09-08  3:00 ` [patch 141/147] prctl: allow to setup brk for et_dyn executables Andrew Morton
2021-09-08  3:00 ` [patch 142/147] configs: remove the obsolete CONFIG_INPUT_POLLDEV Andrew Morton
2021-09-08  3:00 ` [patch 143/147] Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH Andrew Morton
2021-09-08  3:00 ` [patch 144/147] selftests/memfd: remove unused variable Andrew Morton
2021-09-08  3:00 ` [patch 145/147] ipc: replace costly bailout check in sysvipc_find_ipc() Andrew Morton
2021-09-08  3:00 ` [patch 146/147] mm/workingset: correct kernel-doc notations Andrew Morton
2021-09-08  3:00 ` [patch 147/147] scripts: check_extable: fix typo in user error message Andrew Morton
2021-09-08  3:16 ` [patch 129/147] trap: cleanup trap_init() Andrew Morton
2021-09-08  8:57 ` incoming Vlastimil Babka

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.