mm-commits Archive on lore.kernel.org
 help / color / Atom feed
* incoming
@ 2020-10-13 23:46 Andrew Morton
  2020-10-13 23:47 ` [patch 001/181] compiler-clang: add build check for clang 10.0.1 Andrew Morton
                   ` (184 more replies)
  0 siblings, 185 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:46 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: mm-commits, linux-mm

181 patches, based on 029f56db6ac248769f2c260bfaf3c3c0e23e904c.

Subsystems affected by this patch series:

  kbuild
  scripts
  ntfs
  ocfs2
  vfs
  mm/slab
  mm/slub
  mm/kmemleak
  mm/dax
  mm/debug
  mm/pagecache
  mm/fadvise
  mm/gup
  mm/swap
  mm/memremap
  mm/memcg
  mm/selftests
  mm/pagemap
  mm/mincore
  mm/hmm
  mm/dma
  mm/memory-failure
  mm/vmalloc
  mm/documentation
  mm/kasan
  mm/pagealloc
  mm/hugetlb
  mm/vmscan
  mm/z3fold
  mm/zbud
  mm/compaction
  mm/mempolicy
  mm/mempool
  mm/memblock
  mm/oom-kill
  mm/migration

Subsystem: kbuild

    Nick Desaulniers <ndesaulniers@google.com>:
    Patch series "set clang minimum version to 10.0.1", v3:
      compiler-clang: add build check for clang 10.0.1
      Revert "kbuild: disable clang's default use of -fmerge-all-constants"
      Revert "arm64: bti: Require clang >= 10.0.1 for in-kernel BTI support"
      Revert "arm64: vdso: Fix compilation with clang older than 8"
      Partially revert "ARM: 8905/1: Emit __gnu_mcount_nc when using Clang 10.0.0 or newer"

    Marco Elver <elver@google.com>:
      kasan: remove mentions of unsupported Clang versions

    Nick Desaulniers <ndesaulniers@google.com>:
      compiler-gcc: improve version error
      compiler.h: avoid escaped section names
      export.h: fix section name for CONFIG_TRIM_UNUSED_KSYMS for Clang

    Lukas Bulwahn <lukas.bulwahn@gmail.com>:
      kbuild: doc: describe proper script invocation

Subsystem: scripts

    Wang Qing <wangqing@vivo.com>:
      scripts/spelling.txt: increase error-prone spell checking

    Naoki Hayama <naoki.hayama@lineo.co.jp>:
      scripts/spelling.txt: add "arbitrary" typo

    Borislav Petkov <bp@suse.de>:
      scripts/decodecode: add the capability to supply the program counter

Subsystem: ntfs

    Rustam Kovhaev <rkovhaev@gmail.com>:
      ntfs: add check for mft record size in superblock

Subsystem: ocfs2

    Randy Dunlap <rdunlap@infradead.org>:
      ocfs2: delete repeated words in comments

    Gang He <ghe@suse.com>:
      ocfs2: fix potential soft lockup during fstrim

Subsystem: vfs

    Randy Dunlap <rdunlap@infradead.org>:
      fs/xattr.c: fix kernel-doc warnings for setxattr & removexattr

    Luo Jiaxing <luojiaxing@huawei.com>:
      fs_parse: mark fs_param_bad_value() as static

Subsystem: mm/slab

    Mateusz Nosek <mateusznosek0@gmail.com>:
      mm/slab.c: clean code by removing redundant if condition

    tangjianqiang <wyqt1985@gmail.com>:
      include/linux/slab.h: fix a typo error in comment

Subsystem: mm/slub

    Abel Wu <wuyun.wu@huawei.com>:
      mm/slub.c: branch optimization in free slowpath
      mm/slub: fix missing ALLOC_SLOWPATH stat when bulk alloc
      mm/slub: make add_full() condition more explicit

Subsystem: mm/kmemleak

    Davidlohr Bueso <dave@stgolabs.net>:
      mm/kmemleak: rely on rcu for task stack scanning

    Hui Su <sh_def@163.com>:
      mm,kmemleak-test.c: move kmemleak-test.c to samples dir

Subsystem: mm/dax

    Dan Williams <dan.j.williams@intel.com>:
    Patch series "device-dax: Support sub-dividing soft-reserved ranges", v5:
      x86/numa: cleanup configuration dependent command-line options
      x86/numa: add 'nohmat' option
      efi/fake_mem: arrange for a resource entry per efi_fake_mem instance
      ACPI: HMAT: refactor hmat_register_target_device to hmem_register_device
      resource: report parent to walk_iomem_res_desc() callback
      mm/memory_hotplug: introduce default phys_to_target_node() implementation
      ACPI: HMAT: attach a device for each soft-reserved range
      device-dax: drop the dax_region.pfn_flags attribute
      device-dax: move instance creation parameters to 'struct dev_dax_data'
      device-dax: make pgmap optional for instance creation
      device-dax/kmem: introduce dax_kmem_range()
      device-dax/kmem: move resource name tracking to drvdata
      device-dax/kmem: replace release_resource() with release_mem_region()
      device-dax: add an allocation interface for device-dax instances
      device-dax: introduce 'struct dev_dax' typed-driver operations
      device-dax: introduce 'seed' devices
      drivers/base: make device_find_child_by_name() compatible with sysfs inputs
      device-dax: add resize support
      mm/memremap_pages: convert to 'struct range'
      mm/memremap_pages: support multiple ranges per invocation
      device-dax: add dis-contiguous resource support
      device-dax: introduce 'mapping' devices

    Joao Martins <joao.m.martins@oracle.com>:
      device-dax: make align a per-device property

    Dan Williams <dan.j.williams@intel.com>:
      device-dax: add an 'align' attribute

    Joao Martins <joao.m.martins@oracle.com>:
      dax/hmem: introduce dax_hmem.region_idle parameter
      device-dax: add a range mapping allocation attribute

Subsystem: mm/debug

    "Matthew Wilcox (Oracle)" <willy@infradead.org>:
      mm/debug.c: do not dereference i_ino blindly

    John Hubbard <jhubbard@nvidia.com>:
      mm, dump_page: rename head_mapcount() --> head_compound_mapcount()

Subsystem: mm/pagecache

    "Matthew Wilcox (Oracle)" <willy@infradead.org>:
    Patch series "Return head pages from find_*_entry", v2:
      mm: factor find_get_incore_page out of mincore_page
      mm: use find_get_incore_page in memcontrol
      mm: optimise madvise WILLNEED
      proc: optimise smaps for shmem entries
      i915: use find_lock_page instead of find_lock_entry
      mm: convert find_get_entry to return the head page
      mm/shmem: return head page from find_lock_entry
      mm: add find_lock_head
      mm/filemap: fix filemap_map_pages for THP

Subsystem: mm/fadvise

    Yafang Shao <laoar.shao@gmail.com>:
      mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED

Subsystem: mm/gup

    Barry Song <song.bao.hua@hisilicon.com>:
      mm/gup_benchmark: update the documentation in Kconfig
      mm/gup_benchmark: use pin_user_pages for FOLL_LONGTERM flag
      mm/gup: don't permit users to call get_user_pages with FOLL_LONGTERM

    John Hubbard <jhubbard@nvidia.com>:
      mm/gup: protect unpin_user_pages() against npages==-ERRNO

Subsystem: mm/swap

    Gao Xiang <hsiangkao@redhat.com>:
      swap: rename SWP_FS to SWAP_FS_OPS to avoid ambiguity

    Yu Zhao <yuzhao@google.com>:
      mm: remove activate_page() from unuse_pte()
      mm: remove superfluous __ClearPageActive()

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/swap.c: fix confusing comment in release_pages()
      mm/swap_slots.c: remove always zero and unused return value of enable_swap_slots_cache()
      mm/page_io.c: remove useless out label in __swap_writepage()
      mm/swap.c: fix incomplete comment in lru_cache_add_inactive_or_unevictable()
      mm/swapfile.c: remove unnecessary goto out in _swap_info_get()
      mm/swapfile.c: fix potential memory leak in sys_swapon

Subsystem: mm/memremap

    Ira Weiny <ira.weiny@intel.com>:
      mm/memremap.c: convert devmap static branch to {inc,dec}

Subsystem: mm/memcg

    "Gustavo A. R. Silva" <gustavoars@kernel.org>:
      mm: memcontrol: use flex_array_size() helper in memcpy()
      mm: memcontrol: use the preferred form for passing the size of a structure type

    Roman Gushchin <guro@fb.com>:
      mm: memcg/slab: fix racy access to page->mem_cgroup in mem_cgroup_from_obj()

    Miaohe Lin <linmiaohe@huawei.com>:
      mm: memcontrol: correct the comment of mem_cgroup_iter()

    Waiman Long <longman@redhat.com>:
    Patch series "mm/memcg: Miscellaneous cleanups and streamlining", v2:
      mm/memcg: clean up obsolete enum charge_type
      mm/memcg: simplify mem_cgroup_get_max()
      mm/memcg: unify swap and memsw page counters

    Muchun Song <songmuchun@bytedance.com>:
      mm: memcontrol: add the missing numa_stat interface for cgroup v2

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/page_counter: correct the obsolete func name in the comment of page_counter_try_charge()
      mm: memcontrol: reword obsolete comment of mem_cgroup_unmark_under_oom()

    Bharata B Rao <bharata@linux.ibm.com>:
      mm: memcg/slab: uncharge during kmem_cache_free_bulk()

    Ralph Campbell <rcampbell@nvidia.com>:
      mm/memcg: fix device private memcg accounting

Subsystem: mm/selftests

    John Hubbard <jhubbard@nvidia.com>:
    Patch series "selftests/vm: fix some minor aggravating factors in the Makefile":
      selftests/vm: fix false build success on the second and later attempts
      selftests/vm: fix incorrect gcc invocation in some cases

Subsystem: mm/pagemap

    Matthew Wilcox <willy@infradead.org>:
      mm: account PMD tables like PTE tables

    Yanfei Xu <yanfei.xu@windriver.com>:
      mm/memory.c: fix typo in __do_fault() comment
      mm/memory.c: replace vmf->vma with variable vma

    Wei Yang <richard.weiyang@linux.alibaba.com>:
      mm/mmap: rename __vma_unlink_common() to __vma_unlink()
      mm/mmap: leverage vma_rb_erase_ignore() to implement vma_rb_erase()

    Chinwen Chang <chinwen.chang@mediatek.com>:
    Patch series "Try to release mmap_lock temporarily in smaps_rollup", v4:
      mmap locking API: add mmap_lock_is_contended()
      mm: smaps*: extend smap_gather_stats to support specified beginning
      mm: proc: smaps_rollup: do not stall write attempts on mmap_lock

    "Matthew Wilcox (Oracle)" <willy@infradead.org>:
    Patch series "Fix PageDoubleMap":
      mm: move PageDoubleMap bit
      mm: simplify PageDoubleMap with PF_SECOND policy

    Wei Yang <richard.weiyang@linux.alibaba.com>:
      mm/mmap: leave adjust_next as virtual address instead of page frame number

    Randy Dunlap <rdunlap@infradead.org>:
      mm/memory.c: fix spello of "function"

    Wei Yang <richard.weiyang@linux.alibaba.com>:
      mm/mmap: not necessary to check mapping separately
      mm/mmap: check on file instead of the rb_root_cached of its address_space

    Miaohe Lin <linmiaohe@huawei.com>:
      mm: use helper function mapping_allow_writable()
      mm/mmap.c: use helper function allow_write_access() in __remove_shared_vm_struct()

    Liao Pingfang <liao.pingfang@zte.com.cn>:
      mm/mmap.c: replace do_brk with do_brk_flags in comment of insert_vm_struct()

    Peter Xu <peterx@redhat.com>:
      mm: remove src/dst mm parameter in copy_page_range()

Subsystem: mm/mincore

    yuleixzhang <yulei.kernel@gmail.com>:
      include/linux/huge_mm.h: remove mincore_huge_pmd declaration

Subsystem: mm/hmm

    Ralph Campbell <rcampbell@nvidia.com>:
      tools/testing/selftests/vm/hmm-tests.c: use the new SKIP() macro
      lib/test_hmm.c: remove unused dmirror_zero_page

Subsystem: mm/dma

    Andy Shevchenko <andriy.shevchenko@linux.intel.com>:
      mm/dmapool.c: replace open-coded list_for_each_entry_safe()
      mm/dmapool.c: replace hard coded function name with __func__

Subsystem: mm/memory-failure

    Xianting Tian <tian.xianting@h3c.com>:
      mm/memory-failure: do pgoff calculation before for_each_process()

    Alex Shi <alex.shi@linux.alibaba.com>:
      mm/memory-failure.c: remove unused macro `writeback'

Subsystem: mm/vmalloc

    Hui Su <sh_def@163.com>:
      mm/vmalloc.c: update the comment in __vmalloc_area_node()
      mm/vmalloc.c: fix the comment of find_vm_area

Subsystem: mm/documentation

    Alexander Gordeev <agordeev@linux.ibm.com>:
      docs/vm: fix 'mm_count' vs 'mm_users' counter confusion

Subsystem: mm/kasan

    Patricia Alfonso <trishalfonso@google.com>:
    Patch series "KASAN-KUnit Integration", v14:
      kasan/kunit: add KUnit Struct to Current Task
      KUnit: KASAN Integration
      KASAN: port KASAN Tests to KUnit
      KASAN: Testing Documentation

    David Gow <davidgow@google.com>:
      mm: kasan: do not panic if both panic_on_warn and kasan_multishot set

Subsystem: mm/pagealloc

    David Hildenbrand <david@redhat.com>:
    Patch series "mm / virtio-mem: support ZONE_MOVABLE", v5:
      mm/page_alloc: tweak comments in has_unmovable_pages()
      mm/page_isolation: exit early when pageblock is isolated in set_migratetype_isolate()
      mm/page_isolation: drop WARN_ON_ONCE() in set_migratetype_isolate()
      mm/page_isolation: cleanup set_migratetype_isolate()
      virtio-mem: don't special-case ZONE_MOVABLE
      mm: document semantics of ZONE_MOVABLE

    Li Xinhai <lixinhai.lxh@gmail.com>:
      mm, isolation: avoid checking unmovable pages across pageblock boundary

    Mateusz Nosek <mateusznosek0@gmail.com>:
      mm/page_alloc.c: clean code by removing unnecessary initialization
      mm/page_alloc.c: micro-optimization remove unnecessary branch
      mm/page_alloc.c: fix early params garbage value accesses
      mm/page_alloc.c: clean code by merging two functions

    Yanfei Xu <yanfei.xu@windriver.com>:
      mm/page_alloc.c: __perform_reclaim should return 'unsigned long'

    Mateusz Nosek <mateusznosek0@gmail.com>:
      mmzone: clean code by removing unused macro parameter

    Ralph Campbell <rcampbell@nvidia.com>:
      mm: move call to compound_head() in release_pages()

    "Matthew Wilcox (Oracle)" <willy@infradead.org>:
      mm/page_alloc.c: fix freeing non-compound pages

    Michal Hocko <mhocko@suse.com>:
      include/linux/gfp.h: clarify usage of GFP_ATOMIC in !preemptible contexts

Subsystem: mm/hugetlb

    Baoquan He <bhe@redhat.com>:
    Patch series "mm/hugetlb: Small cleanup and improvement", v2:
      mm/hugetlb.c: make is_hugetlb_entry_hwpoisoned return bool
      mm/hugetlb.c: remove the unnecessary non_swap_entry()
      doc/vm: fix typo in the hugetlb admin documentation

    Wei Yang <richard.weiyang@linux.alibaba.com>:
    Patch series "mm/hugetlb: code refine and simplification", v4:
      mm/hugetlb: not necessary to coalesce regions recursively
      mm/hugetlb: remove VM_BUG_ON(!nrg) in get_file_region_entry_from_cache()
      mm/hugetlb: use list_splice to merge two list at once
      mm/hugetlb: count file_region to be added when regions_needed != NULL
      mm/hugetlb: a page from buddy is not on any list
      mm/hugetlb: narrow the hugetlb_lock protection area during preparing huge page
      mm/hugetlb: take the free hpage during the iteration directly

    Mike Kravetz <mike.kravetz@oracle.com>:
      hugetlb: add lockdep check for i_mmap_rwsem held in huge_pmd_share

Subsystem: mm/vmscan

    Chunxin Zang <zangchunxin@bytedance.com>:
      mm/vmscan: fix infinite loop in drop_slab_node

    Hui Su <sh_def@163.com>:
      mm/vmscan: fix comments for isolate_lru_page()

Subsystem: mm/z3fold

    Hui Su <sh_def@163.com>:
      mm/z3fold.c: use xx_zalloc instead xx_alloc and memset

Subsystem: mm/zbud

    Xiang Chen <chenxiang66@hisilicon.com>:
      mm/zbud: remove redundant initialization

Subsystem: mm/compaction

    Mateusz Nosek <mateusznosek0@gmail.com>:
      mm/compaction.c: micro-optimization remove unnecessary branch
      include/linux/compaction.h: clean code by removing unused enum value

    John Hubbard <jhubbard@nvidia.com>:
      selftests/vm: 8x compaction_test speedup

Subsystem: mm/mempolicy

    Wei Yang <richard.weiyang@linux.alibaba.com>:
      mm/mempolicy: remove or narrow the lock on current
      mm: remove unused alloc_page_vma_node()

Subsystem: mm/mempool

    Miaohe Lin <linmiaohe@huawei.com>:
      mm/mempool: add 'else' to split mutually exclusive case

Subsystem: mm/memblock

    Mike Rapoport <rppt@linux.ibm.com>:
    Patch series "memblock: seasonal cleaning^w cleanup", v3:
      KVM: PPC: Book3S HV: simplify kvm_cma_reserve()
      dma-contiguous: simplify cma_early_percent_memory()
      arm, xtensa: simplify initialization of high memory pages
      arm64: numa: simplify dummy_numa_init()
      h8300, nds32, openrisc: simplify detection of memory extents
      riscv: drop unneeded node initialization
      mircoblaze: drop unneeded NUMA and sparsemem initializations
      memblock: make for_each_memblock_type() iterator private
      memblock: make memblock_debug and related functionality private
      memblock: reduce number of parameters in for_each_mem_range()
      arch, mm: replace for_each_memblock() with for_each_mem_pfn_range()
      arch, drivers: replace for_each_membock() with for_each_mem_range()
      x86/setup: simplify initrd relocation and reservation
      x86/setup: simplify reserve_crashkernel()
      memblock: remove unused memblock_mem_size()
      memblock: implement for_each_reserved_mem_region() using __next_mem_region()
      memblock: use separate iterators for memory and reserved regions

Subsystem: mm/oom-kill

    Suren Baghdasaryan <surenb@google.com>:
      mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary

Subsystem: mm/migration

    Ralph Campbell <rcampbell@nvidia.com>:
      mm/migrate: remove cpages-- in migrate_vma_finalize()
      mm/migrate: remove obsolete comment about device public

 .clang-format                                |    7 
 Documentation/admin-guide/cgroup-v2.rst      |   69 +
 Documentation/admin-guide/mm/hugetlbpage.rst |    2 
 Documentation/dev-tools/kasan.rst            |   74 +
 Documentation/dev-tools/kmemleak.rst         |    2 
 Documentation/kbuild/makefiles.rst           |   20 
 Documentation/vm/active_mm.rst               |    2 
 Documentation/x86/x86_64/boot-options.rst    |    4 
 MAINTAINERS                                  |    2 
 Makefile                                     |    9 
 arch/arm/Kconfig                             |    2 
 arch/arm/include/asm/tlb.h                   |    1 
 arch/arm/kernel/setup.c                      |   18 
 arch/arm/mm/init.c                           |   59 -
 arch/arm/mm/mmu.c                            |   39 
 arch/arm/mm/pmsa-v7.c                        |   23 
 arch/arm/mm/pmsa-v8.c                        |   17 
 arch/arm/xen/mm.c                            |    7 
 arch/arm64/Kconfig                           |    2 
 arch/arm64/kernel/machine_kexec_file.c       |    6 
 arch/arm64/kernel/setup.c                    |    4 
 arch/arm64/kernel/vdso/Makefile              |    7 
 arch/arm64/mm/init.c                         |   11 
 arch/arm64/mm/kasan_init.c                   |   10 
 arch/arm64/mm/mmu.c                          |   11 
 arch/arm64/mm/numa.c                         |   15 
 arch/c6x/kernel/setup.c                      |    9 
 arch/h8300/kernel/setup.c                    |    8 
 arch/microblaze/mm/init.c                    |   23 
 arch/mips/cavium-octeon/dma-octeon.c         |   14 
 arch/mips/kernel/setup.c                     |   31 
 arch/mips/netlogic/xlp/setup.c               |    2 
 arch/nds32/kernel/setup.c                    |    8 
 arch/openrisc/kernel/setup.c                 |    9 
 arch/openrisc/mm/init.c                      |    8 
 arch/powerpc/kernel/fadump.c                 |   61 -
 arch/powerpc/kexec/file_load_64.c            |   16 
 arch/powerpc/kvm/book3s_hv_builtin.c         |   12 
 arch/powerpc/kvm/book3s_hv_uvmem.c           |   14 
 arch/powerpc/mm/book3s64/hash_utils.c        |   16 
 arch/powerpc/mm/book3s64/radix_pgtable.c     |   10 
 arch/powerpc/mm/kasan/kasan_init_32.c        |    8 
 arch/powerpc/mm/mem.c                        |   31 
 arch/powerpc/mm/numa.c                       |    7 
 arch/powerpc/mm/pgtable_32.c                 |    8 
 arch/riscv/mm/init.c                         |   36 
 arch/riscv/mm/kasan_init.c                   |   10 
 arch/s390/kernel/setup.c                     |   27 
 arch/s390/mm/page-states.c                   |    6 
 arch/s390/mm/vmem.c                          |    7 
 arch/sh/mm/init.c                            |    9 
 arch/sparc/mm/init_64.c                      |   12 
 arch/x86/include/asm/numa.h                  |    8 
 arch/x86/kernel/e820.c                       |   16 
 arch/x86/kernel/setup.c                      |   56 -
 arch/x86/mm/numa.c                           |   13 
 arch/x86/mm/numa_emulation.c                 |    3 
 arch/x86/xen/enlighten_pv.c                  |    2 
 arch/xtensa/mm/init.c                        |   55 -
 drivers/acpi/numa/hmat.c                     |   76 -
 drivers/acpi/numa/srat.c                     |    9 
 drivers/base/core.c                          |    2 
 drivers/bus/mvebu-mbus.c                     |   12 
 drivers/dax/Kconfig                          |    6 
 drivers/dax/Makefile                         |    3 
 drivers/dax/bus.c                            | 1237 +++++++++++++++++++++++----
 drivers/dax/bus.h                            |   34 
 drivers/dax/dax-private.h                    |   74 +
 drivers/dax/device.c                         |  164 +--
 drivers/dax/hmem.c                           |   56 -
 drivers/dax/hmem/Makefile                    |    8 
 drivers/dax/hmem/device.c                    |  100 ++
 drivers/dax/hmem/hmem.c                      |   93 +-
 drivers/dax/kmem.c                           |  236 ++---
 drivers/dax/pmem/compat.c                    |    2 
 drivers/dax/pmem/core.c                      |   36 
 drivers/firmware/efi/x86_fake_mem.c          |   12 
 drivers/gpu/drm/i915/gem/i915_gem_shmem.c    |    4 
 drivers/gpu/drm/nouveau/nouveau_dmem.c       |   15 
 drivers/irqchip/irq-gic-v3-its.c             |    2 
 drivers/nvdimm/badrange.c                    |   26 
 drivers/nvdimm/claim.c                       |   13 
 drivers/nvdimm/nd.h                          |    3 
 drivers/nvdimm/pfn_devs.c                    |   13 
 drivers/nvdimm/pmem.c                        |   27 
 drivers/nvdimm/region.c                      |   21 
 drivers/pci/p2pdma.c                         |   12 
 drivers/virtio/virtio_mem.c                  |   47 -
 drivers/xen/unpopulated-alloc.c              |   45 
 fs/fs_parser.c                               |    2 
 fs/ntfs/inode.c                              |    6 
 fs/ocfs2/alloc.c                             |    6 
 fs/ocfs2/localalloc.c                        |    2 
 fs/proc/base.c                               |    3 
 fs/proc/task_mmu.c                           |  104 +-
 fs/xattr.c                                   |   22 
 include/acpi/acpi_numa.h                     |   14 
 include/kunit/test.h                         |    5 
 include/linux/acpi.h                         |    2 
 include/linux/compaction.h                   |    3 
 include/linux/compiler-clang.h               |    8 
 include/linux/compiler-gcc.h                 |    2 
 include/linux/compiler.h                     |    2 
 include/linux/dax.h                          |    8 
 include/linux/export.h                       |    2 
 include/linux/fs.h                           |    4 
 include/linux/gfp.h                          |    6 
 include/linux/huge_mm.h                      |    3 
 include/linux/kasan.h                        |    6 
 include/linux/memblock.h                     |   90 +
 include/linux/memcontrol.h                   |   13 
 include/linux/memory_hotplug.h               |   23 
 include/linux/memremap.h                     |   15 
 include/linux/mm.h                           |   36 
 include/linux/mmap_lock.h                    |    5 
 include/linux/mmzone.h                       |   37 
 include/linux/numa.h                         |   11 
 include/linux/oom.h                          |    1 
 include/linux/page-flags.h                   |   42 
 include/linux/pagemap.h                      |   43 
 include/linux/range.h                        |    6 
 include/linux/sched.h                        |    4 
 include/linux/sched/coredump.h               |    1 
 include/linux/slab.h                         |    2 
 include/linux/swap.h                         |   10 
 include/linux/swap_slots.h                   |    2 
 kernel/dma/contiguous.c                      |   11 
 kernel/fork.c                                |   25 
 kernel/resource.c                            |   11 
 lib/Kconfig.debug                            |    9 
 lib/Kconfig.kasan                            |   31 
 lib/Makefile                                 |    5 
 lib/kunit/test.c                             |   13 
 lib/test_free_pages.c                        |   42 
 lib/test_hmm.c                               |   65 -
 lib/test_kasan.c                             |  732 ++++++---------
 lib/test_kasan_module.c                      |  111 ++
 mm/Kconfig                                   |    4 
 mm/Makefile                                  |    1 
 mm/compaction.c                              |    5 
 mm/debug.c                                   |   18 
 mm/dmapool.c                                 |   46 -
 mm/fadvise.c                                 |    9 
 mm/filemap.c                                 |   78 -
 mm/gup.c                                     |   44 
 mm/gup_benchmark.c                           |   23 
 mm/huge_memory.c                             |    4 
 mm/hugetlb.c                                 |  100 +-
 mm/internal.h                                |    3 
 mm/kasan/report.c                            |   34 
 mm/kmemleak-test.c                           |   99 --
 mm/kmemleak.c                                |    8 
 mm/madvise.c                                 |   21 
 mm/memblock.c                                |  102 --
 mm/memcontrol.c                              |  262 +++--
 mm/memory-failure.c                          |    5 
 mm/memory.c                                  |  147 +--
 mm/memory_hotplug.c                          |   10 
 mm/mempolicy.c                               |    8 
 mm/mempool.c                                 |   18 
 mm/memremap.c                                |  344 ++++---
 mm/migrate.c                                 |    3 
 mm/mincore.c                                 |   28 
 mm/mmap.c                                    |   45 
 mm/oom_kill.c                                |    2 
 mm/page_alloc.c                              |   82 -
 mm/page_counter.c                            |    2 
 mm/page_io.c                                 |   14 
 mm/page_isolation.c                          |   41 
 mm/shmem.c                                   |   19 
 mm/slab.c                                    |    4 
 mm/slab.h                                    |   50 -
 mm/slub.c                                    |   33 
 mm/sparse.c                                  |   10 
 mm/swap.c                                    |   14 
 mm/swap_slots.c                              |    3 
 mm/swap_state.c                              |   38 
 mm/swapfile.c                                |   12 
 mm/truncate.c                                |   58 -
 mm/vmalloc.c                                 |    6 
 mm/vmscan.c                                  |    5 
 mm/z3fold.c                                  |    3 
 mm/zbud.c                                    |    1 
 samples/Makefile                             |    1 
 samples/kmemleak/Makefile                    |    3 
 samples/kmemleak/kmemleak-test.c             |   99 ++
 scripts/decodecode                           |   29 
 scripts/spelling.txt                         |    4 
 tools/testing/nvdimm/dax-dev.c               |   28 
 tools/testing/nvdimm/test/iomap.c            |    2 
 tools/testing/selftests/vm/Makefile          |   17 
 tools/testing/selftests/vm/compaction_test.c |   11 
 tools/testing/selftests/vm/gup_benchmark.c   |   14 
 tools/testing/selftests/vm/hmm-tests.c       |    4 
 194 files changed, 4273 insertions(+), 2777 deletions(-)


^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 001/181] compiler-clang: add build check for clang 10.0.1
  2020-10-13 23:46 incoming Andrew Morton
@ 2020-10-13 23:47 ` Andrew Morton
  2020-10-13 23:47 ` [patch 002/181] Revert "kbuild: disable clang's default use of -fmerge-all-constants" Andrew Morton
                   ` (183 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:47 UTC (permalink / raw)
  To: akpm, andreyknvl, ast, daniel, elver, keescook, linux-mm,
	masahiroy, maskray, miguel.ojeda.sandonis, mm-commits,
	natechancellor, ndesaulniers, sedat.dilek, torvalds,
	vincenzo.frascino, will

From: Nick Desaulniers <ndesaulniers@google.com>
Subject: compiler-clang: add build check for clang 10.0.1

Patch series "set clang minimum version to 10.0.1", v3.

Adds a compile time #error to compiler-clang.h setting the effective
minimum supported version to clang 10.0.1.  A separate patch has already
been picked up into the Documentation/ tree also confirming the version.

Next are a series of reverts. One for 32b arm is a partial revert.

Then Marco suggested fixes to KASAN docs.

Finally, improve the warning for GCC too as per Kees.


This patch (of 7):

During Plumbers 2020, we voted to just support the latest release of Clang
for now.  Add a compile time check for this.

We plan to remove workarounds for older versions now, which will break in
subtle and not so subtle ways.

Link: https://lkml.kernel.org/r/20200902225911.209899-1-ndesaulniers@google.com
Link: https://lkml.kernel.org/r/20200902225911.209899-2-ndesaulniers@google.com
Link: https://github.com/ClangBuiltLinux/linux/issues/9
Link: https://github.com/ClangBuiltLinux/linux/issues/941
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Suggested-by: Sedat Dilek <sedat.dilek@gmail.com>
Suggested-by: Nathan Chancellor <natechancellor@gmail.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Reviewed-by: Sedat Dilek <sedat.dilek@gmail.com>
Acked-by: Marco Elver <elver@google.com>
Acked-by: Nathan Chancellor <natechancellor@gmail.com>
Acked-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/compiler-clang.h |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/include/linux/compiler-clang.h~compiler-clang-add-build-check-for-clang-1001
+++ a/include/linux/compiler-clang.h
@@ -3,6 +3,14 @@
 #error "Please don't include <linux/compiler-clang.h> directly, include <linux/compiler.h> instead."
 #endif
 
+#define CLANG_VERSION (__clang_major__ * 10000	\
+		     + __clang_minor__ * 100	\
+		     + __clang_patchlevel__)
+
+#if CLANG_VERSION < 100001
+# error Sorry, your version of Clang is too old - please use 10.0.1 or newer.
+#endif
+
 /* Compiler specific definitions for Clang compiler */
 
 /* same as gcc, this was present in clang-2.6 so we can assume it works
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 002/181] Revert "kbuild: disable clang's default use of -fmerge-all-constants"
  2020-10-13 23:46 incoming Andrew Morton
  2020-10-13 23:47 ` [patch 001/181] compiler-clang: add build check for clang 10.0.1 Andrew Morton
@ 2020-10-13 23:47 ` Andrew Morton
  2020-10-13 23:47 ` [patch 003/181] Revert "arm64: bti: Require clang >= 10.0.1 for in-kernel BTI support" Andrew Morton
                   ` (182 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:47 UTC (permalink / raw)
  To: akpm, andreyknvl, ast, daniel, elver, keescook, linux-mm,
	masahiroy, maskray, miguel.ojeda.sandonis, mm-commits,
	natechancellor, ndesaulniers, sedat.dilek, torvalds,
	vincenzo.frascino, will

From: Nick Desaulniers <ndesaulniers@google.com>
Subject: Revert "kbuild: disable clang's default use of -fmerge-all-constants"

This reverts commit 87e0d4f0f37fb0c8c4aeeac46fff5e957738df79.

-fno-merge-all-constants has been the default since clang-6; the minimum
supported version of clang in the kernel is clang-10 (10.0.1).

Link: https://lkml.kernel.org/r/20200902225911.209899-3-ndesaulniers@google.com
Link: https://reviews.llvm.org/rL329300.
Link: https://github.com/ClangBuiltLinux/linux/issues/9
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Suggested-by: Nathan Chancellor <natechancellor@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Fangrui Song <maskray@google.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Sedat Dilek <sedat.dilek@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Makefile |    9 ---------
 1 file changed, 9 deletions(-)

--- a/Makefile~revert-kbuild-disable-clangs-default-use-of-fmerge-all-constants
+++ a/Makefile
@@ -921,15 +921,6 @@ KBUILD_CFLAGS += $(call cc-disable-warni
 # disable invalid "can't wrap" optimizations for signed / pointers
 KBUILD_CFLAGS	+= $(call cc-option,-fno-strict-overflow)
 
-# clang sets -fmerge-all-constants by default as optimization, but this
-# is non-conforming behavior for C and in fact breaks the kernel, so we
-# need to disable it here generally.
-KBUILD_CFLAGS	+= $(call cc-option,-fno-merge-all-constants)
-
-# for gcc -fno-merge-all-constants disables everything, but it is fine
-# to have actual conforming behavior enabled.
-KBUILD_CFLAGS	+= $(call cc-option,-fmerge-constants)
-
 # Make sure -fstack-check isn't enabled (like gentoo apparently did)
 KBUILD_CFLAGS  += $(call cc-option,-fno-stack-check,)
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 003/181] Revert "arm64: bti: Require clang >= 10.0.1 for in-kernel BTI support"
  2020-10-13 23:46 incoming Andrew Morton
  2020-10-13 23:47 ` [patch 001/181] compiler-clang: add build check for clang 10.0.1 Andrew Morton
  2020-10-13 23:47 ` [patch 002/181] Revert "kbuild: disable clang's default use of -fmerge-all-constants" Andrew Morton
@ 2020-10-13 23:47 ` Andrew Morton
  2020-10-13 23:47 ` [patch 004/181] Revert "arm64: vdso: Fix compilation with clang older than 8" Andrew Morton
                   ` (181 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:47 UTC (permalink / raw)
  To: akpm, andreyknvl, ast, daniel, elver, keescook, linux-mm,
	masahiroy, maskray, miguel.ojeda.sandonis, mm-commits,
	natechancellor, ndesaulniers, sedat.dilek, torvalds,
	vincenzo.frascino, will

From: Nick Desaulniers <ndesaulniers@google.com>
Subject: Revert "arm64: bti: Require clang >= 10.0.1 for in-kernel BTI support"

This reverts commit b9249cba25a5dce5de87e5404503a5e11832c2dd.

The minimum supported version of clang is now 10.0.1.

Link: https://lkml.kernel.org/r/20200902225911.209899-4-ndesaulniers@google.com
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Suggested-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm64/Kconfig |    2 --
 1 file changed, 2 deletions(-)

--- a/arch/arm64/Kconfig~revert-arm64-bti-require-clang-=-1001-for-in-kernel-bti-support
+++ a/arch/arm64/Kconfig
@@ -1612,8 +1612,6 @@ config ARM64_BTI_KERNEL
 	depends on CC_HAS_BRANCH_PROT_PAC_RET_BTI
 	# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94697
 	depends on !CC_IS_GCC || GCC_VERSION >= 100100
-	# https://reviews.llvm.org/rGb8ae3fdfa579dbf366b1bb1cbfdbf8c51db7fa55
-	depends on !CC_IS_CLANG || CLANG_VERSION >= 100001
 	depends on !(CC_IS_CLANG && GCOV_KERNEL)
 	depends on (!FUNCTION_GRAPH_TRACER || DYNAMIC_FTRACE_WITH_REGS)
 	help
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 004/181] Revert "arm64: vdso: Fix compilation with clang older than 8"
  2020-10-13 23:46 incoming Andrew Morton
                   ` (2 preceding siblings ...)
  2020-10-13 23:47 ` [patch 003/181] Revert "arm64: bti: Require clang >= 10.0.1 for in-kernel BTI support" Andrew Morton
@ 2020-10-13 23:47 ` Andrew Morton
  2020-10-13 23:47 ` [patch 005/181] Partially revert "ARM: 8905/1: Emit __gnu_mcount_nc when using Clang 10.0.0 or newer" Andrew Morton
                   ` (180 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:47 UTC (permalink / raw)
  To: akpm, andreyknvl, ast, daniel, elver, keescook, linux-mm,
	masahiroy, maskray, miguel.ojeda.sandonis, mm-commits,
	natechancellor, ndesaulniers, sedat.dilek, torvalds,
	vincenzo.frascino, will

From: Nick Desaulniers <ndesaulniers@google.com>
Subject: Revert "arm64: vdso: Fix compilation with clang older than 8"

This reverts commit 3acf4be235280f14d838581a750532219d67facc.

The minimum supported version of clang is clang 10.0.1.

Link: https://lkml.kernel.org/r/20200902225911.209899-5-ndesaulniers@google.com
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Suggested-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm64/kernel/vdso/Makefile |    7 -------
 1 file changed, 7 deletions(-)

--- a/arch/arm64/kernel/vdso/Makefile~revert-arm64-vdso-fix-compilation-with-clang-older-than-8
+++ a/arch/arm64/kernel/vdso/Makefile
@@ -43,13 +43,6 @@ ifneq ($(c-gettimeofday-y),)
   CFLAGS_vgettimeofday.o += -include $(c-gettimeofday-y)
 endif
 
-# Clang versions less than 8 do not support -mcmodel=tiny
-ifeq ($(CONFIG_CC_IS_CLANG), y)
-  ifeq ($(shell test $(CONFIG_CLANG_VERSION) -lt 80000; echo $$?),0)
-    CFLAGS_REMOVE_vgettimeofday.o += -mcmodel=tiny
-  endif
-endif
-
 # Disable gcov profiling for VDSO code
 GCOV_PROFILE := n
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 005/181] Partially revert "ARM: 8905/1: Emit __gnu_mcount_nc when using Clang 10.0.0 or newer"
  2020-10-13 23:46 incoming Andrew Morton
                   ` (3 preceding siblings ...)
  2020-10-13 23:47 ` [patch 004/181] Revert "arm64: vdso: Fix compilation with clang older than 8" Andrew Morton
@ 2020-10-13 23:47 ` Andrew Morton
  2020-10-13 23:47 ` [patch 006/181] kasan: remove mentions of unsupported Clang versions Andrew Morton
                   ` (179 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:47 UTC (permalink / raw)
  To: akpm, andreyknvl, ast, daniel, elver, keescook, linux-mm,
	masahiroy, maskray, miguel.ojeda.sandonis, mm-commits,
	natechancellor, ndesaulniers, sedat.dilek, torvalds,
	vincenzo.frascino, will

From: Nick Desaulniers <ndesaulniers@google.com>
Subject: Partially revert "ARM: 8905/1: Emit __gnu_mcount_nc when using Clang 10.0.0 or newer"

This partially reverts commit b0fe66cf095016e0b238374c10ae366e1f087d11.

The minimum supported version of clang is now clang 10.0.1. We still
want to pass -meabi=gnu.

Link: https://lkml.kernel.org/r/20200902225911.209899-6-ndesaulniers@google.com
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Suggested-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/arm/Kconfig~partially-revert-arm-8905-1-emit-__gnu_mcount_nc-when-using-clang-1000-or-newer
+++ a/arch/arm/Kconfig
@@ -84,7 +84,7 @@ config ARM
 	select HAVE_FAST_GUP if ARM_LPAE
 	select HAVE_FTRACE_MCOUNT_RECORD if !XIP_KERNEL
 	select HAVE_FUNCTION_GRAPH_TRACER if !THUMB2_KERNEL && !CC_IS_CLANG
-	select HAVE_FUNCTION_TRACER if !XIP_KERNEL && (CC_IS_GCC || CLANG_VERSION >= 100000)
+	select HAVE_FUNCTION_TRACER if !XIP_KERNEL
 	select HAVE_GCC_PLUGINS
 	select HAVE_HW_BREAKPOINT if PERF_EVENTS && (CPU_V6 || CPU_V6K || CPU_V7)
 	select HAVE_IDE if PCI || ISA || PCMCIA
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 006/181] kasan: remove mentions of unsupported Clang versions
  2020-10-13 23:46 incoming Andrew Morton
                   ` (4 preceding siblings ...)
  2020-10-13 23:47 ` [patch 005/181] Partially revert "ARM: 8905/1: Emit __gnu_mcount_nc when using Clang 10.0.0 or newer" Andrew Morton
@ 2020-10-13 23:47 ` Andrew Morton
  2020-10-13 23:47 ` [patch 007/181] compiler-gcc: improve version error Andrew Morton
                   ` (178 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:47 UTC (permalink / raw)
  To: akpm, andreyknvl, ast, daniel, elver, keescook, linux-mm,
	masahiroy, maskray, miguel.ojeda.sandonis, mm-commits,
	natechancellor, ndesaulniers, sedat.dilek, torvalds,
	vincenzo.frascino, will

From: Marco Elver <elver@google.com>
Subject: kasan: remove mentions of unsupported Clang versions

Since the kernel now requires at least Clang 10.0.1, remove any mention of
old Clang versions and simplify the documentation.

Link: https://lkml.kernel.org/r/20200902225911.209899-7-ndesaulniers@google.com
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/dev-tools/kasan.rst |    4 ++--
 lib/Kconfig.kasan                 |    9 ++++-----
 2 files changed, 6 insertions(+), 7 deletions(-)

--- a/Documentation/dev-tools/kasan.rst~kasan-remove-mentions-of-unsupported-clang-versions
+++ a/Documentation/dev-tools/kasan.rst
@@ -13,10 +13,10 @@ KASAN uses compile-time instrumentation
 memory access, and therefore requires a compiler version that supports that.
 
 Generic KASAN is supported in both GCC and Clang. With GCC it requires version
-8.3.0 or later. With Clang it requires version 7.0.0 or later, but detection of
+8.3.0 or later. Any supported Clang version is compatible, but detection of
 out-of-bounds accesses for global variables is only supported since Clang 11.
 
-Tag-based KASAN is only supported in Clang and requires version 7.0.0 or later.
+Tag-based KASAN is only supported in Clang.
 
 Currently generic KASAN is supported for the x86_64, arm64, xtensa, s390 and
 riscv architectures, and tag-based KASAN is supported only for arm64.
--- a/lib/Kconfig.kasan~kasan-remove-mentions-of-unsupported-clang-versions
+++ a/lib/Kconfig.kasan
@@ -54,9 +54,9 @@ config KASAN_GENERIC
 	  Enables generic KASAN mode.
 
 	  This mode is supported in both GCC and Clang. With GCC it requires
-	  version 8.3.0 or later. With Clang it requires version 7.0.0 or
-	  later, but detection of out-of-bounds accesses for global variables
-	  is supported only since Clang 11.
+	  version 8.3.0 or later. Any supported Clang version is compatible,
+	  but detection of out-of-bounds accesses for global variables is
+	  supported only since Clang 11.
 
 	  This mode consumes about 1/8th of available memory at kernel start
 	  and introduces an overhead of ~x1.5 for the rest of the allocations.
@@ -78,8 +78,7 @@ config KASAN_SW_TAGS
 	  Enables software tag-based KASAN mode.
 
 	  This mode requires Top Byte Ignore support by the CPU and therefore
-	  is only supported for arm64. This mode requires Clang version 7.0.0
-	  or later.
+	  is only supported for arm64. This mode requires Clang.
 
 	  This mode consumes about 1/16th of available memory at kernel start
 	  and introduces an overhead of ~20% for the rest of the allocations.
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 007/181] compiler-gcc: improve version error
  2020-10-13 23:46 incoming Andrew Morton
                   ` (5 preceding siblings ...)
  2020-10-13 23:47 ` [patch 006/181] kasan: remove mentions of unsupported Clang versions Andrew Morton
@ 2020-10-13 23:47 ` Andrew Morton
  2020-10-13 23:47 ` [patch 008/181] compiler.h: avoid escaped section names Andrew Morton
                   ` (177 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:47 UTC (permalink / raw)
  To: akpm, andreyknvl, ast, daniel, elver, keescook, linux-mm,
	masahiroy, maskray, miguel.ojeda.sandonis, mm-commits,
	natechancellor, ndesaulniers, sedat.dilek, torvalds,
	vincenzo.frascino, will

From: Nick Desaulniers <ndesaulniers@google.com>
Subject: compiler-gcc: improve version error

As Kees suggests, doing so provides developers with two useful pieces of
information:
- The kernel build was attempting to use GCC.
  (Maybe they accidentally poked the wrong configs in a CI.)
- They need 4.9 or better.
  ("Upgrade to what version?" doesn't need to be dug out of documentation,
   headers, etc.)

Link: https://lkml.kernel.org/r/20200902225911.209899-8-ndesaulniers@google.com
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Sedat Dilek <sedat.dilek@gmail.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Fangrui Song <maskray@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/compiler-gcc.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/compiler-gcc.h~compiler-gcc-improve-version-error
+++ a/include/linux/compiler-gcc.h
@@ -12,7 +12,7 @@
 
 /* https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145 */
 #if GCC_VERSION < 40900
-# error Sorry, your compiler is too old - please upgrade it.
+# error Sorry, your version of GCC is too old - please use 4.9 or newer.
 #endif
 
 /* Optimization barrier */
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 008/181] compiler.h: avoid escaped section names
  2020-10-13 23:46 incoming Andrew Morton
                   ` (6 preceding siblings ...)
  2020-10-13 23:47 ` [patch 007/181] compiler-gcc: improve version error Andrew Morton
@ 2020-10-13 23:47 ` Andrew Morton
  2020-10-13 23:48 ` [patch 009/181] export.h: fix section name for CONFIG_TRIM_UNUSED_KSYMS for Clang Andrew Morton
                   ` (176 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:47 UTC (permalink / raw)
  To: akpm, linux-mm, luc.vanoostenryck, miguel.ojeda.sandonis,
	mm-commits, natechancellor, ndesaulniers, nivedita, torvalds

From: Nick Desaulniers <ndesaulniers@google.com>
Subject: compiler.h: avoid escaped section names

The stringification operator, `#`, in the preprocessor escapes strings. 
For example, `# "foo"` becomes `"\"foo\""`.  GCC and Clang differ in how
they treat section names that contain \".

The portable solution is to not use a string literal with the preprocessor
stringification operator.

In this case, since __section unconditionally uses the stringification
operator, we actually want the more verbose
__attribute__((__section__())).

Link: https://bugs.llvm.org/show_bug.cgi?id=42950
Link: https://lkml.kernel.org/r/20200929194318.548707-1-ndesaulniers@google.com
Fixes: commit e04462fb82f8 ("Compiler Attributes: remove uses of __attribute__ from compiler.h")
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Arvind Sankar <nivedita@alum.mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/compiler.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/compiler.h~compilerh-avoid-escaped-section-names
+++ a/include/linux/compiler.h
@@ -155,7 +155,7 @@ void ftrace_likely_update(struct ftrace_
 	extern typeof(sym) sym;					\
 	static const unsigned long __kentry_##sym		\
 	__used							\
-	__section("___kentry" "+" #sym )			\
+	__attribute__((__section__("___kentry+" #sym)))		\
 	= (unsigned long)&sym;
 #endif
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 009/181] export.h: fix section name for CONFIG_TRIM_UNUSED_KSYMS for Clang
  2020-10-13 23:46 incoming Andrew Morton
                   ` (7 preceding siblings ...)
  2020-10-13 23:47 ` [patch 008/181] compiler.h: avoid escaped section names Andrew Morton
@ 2020-10-13 23:48 ` Andrew Morton
  2020-10-13 23:48 ` [patch 010/181] kbuild: doc: describe proper script invocation Andrew Morton
                   ` (175 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:48 UTC (permalink / raw)
  To: akpm, gregkh, jeyu, keescook, linux-mm, lkp, maennich,
	mm-commits, natechancellor, ndesaulniers, torvalds, will,
	yamada.masahiro

From: Nick Desaulniers <ndesaulniers@google.com>
Subject: export.h: fix section name for CONFIG_TRIM_UNUSED_KSYMS for Clang

When enabling CONFIG_TRIM_UNUSED_KSYMS, the linker will warn about the
orphan sections:

(".discard.ksym") is being placed in '".discard.ksym"'

repeatedly when linking vmlinux.  This is because the stringification
operator, `#`, in the preprocessor escapes strings.  GCC and Clang differ
in how they treat section names that contain \".

The portable solution is to not use a string literal with the preprocessor
stringification operator.

Link: https://bugs.llvm.org/show_bug.cgi?id=42950
Link: https://github.com/ClangBuiltLinux/linux/issues/1166
Link: https://lkml.kernel.org/r/20200929190701.398762-1-ndesaulniers@google.com
Fixes: commit bbda5ec671d3 ("kbuild: simplify dependency generation for CONFIG_TRIM_UNUSED_KSYMS")
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Reported-by: kbuild test robot <lkp@intel.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Matthias Maennich <maennich@google.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/export.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/export.h~exporth-fix-section-name-for-config_trim_unused_ksyms-for-clang
+++ a/include/linux/export.h
@@ -130,7 +130,7 @@ struct kernel_symbol {
  * discarded in the final link stage.
  */
 #define __ksym_marker(sym)	\
-	static int __ksym_marker_##sym[0] __section(".discard.ksym") __used
+	static int __ksym_marker_##sym[0] __section(.discard.ksym) __used
 
 #define __EXPORT_SYMBOL(sym, sec, ns)					\
 	__ksym_marker(sym);						\
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 010/181] kbuild: doc: describe proper script invocation
  2020-10-13 23:46 incoming Andrew Morton
                   ` (8 preceding siblings ...)
  2020-10-13 23:48 ` [patch 009/181] export.h: fix section name for CONFIG_TRIM_UNUSED_KSYMS for Clang Andrew Morton
@ 2020-10-13 23:48 ` Andrew Morton
  2020-10-13 23:48 ` [patch 011/181] scripts/spelling.txt: increase error-prone spell checking Andrew Morton
                   ` (174 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:48 UTC (permalink / raw)
  To: akpm, corbet, keescook, linux-mm, lukas.bulwahn, masahiroy,
	michal.lkml, mm-commits, torvalds, ujjwalkumar0501

From: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Subject: kbuild: doc: describe proper script invocation

During an investigation to fix up the execute bits of scripts in the
kernel repository, Andrew Morton and Kees Cook pointed out that the
execute bit should not matter, and that build scripts cannot rely on that.
Kees could not point to any documentation, though.

Masahiro Yamada explained the convention of setting execute bits to make
it easier for manual script invocation.

Provide some basic documentation how the build shall invoke scripts, such
that the execute bits do not matter, and acknowledge that execute bits are
useful nonetheless.

This serves as reference for further clean-up patches in the future.

Link: https://lore.kernel.org/lkml/20200830174409.c24c3f67addcce0cea9a9d4c@linux-foundation.org/
Link: https://lore.kernel.org/lkml/202008271102.FEB906C88@keescook/
Link: https://lore.kernel.org/linux-kbuild/CAK7LNAQdrvMkDA6ApDJCGr+5db8SiPo=G+p8EiOvnnGvEN80gA@mail.gmail.com/
Link: https://lkml.kernel.org/r/20201001075723.24246-1-lukas.bulwahn@gmail.com
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Michal Marek <michal.lkml@markovi.net>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Ujjwal Kumar <ujjwalkumar0501@gmail.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/kbuild/makefiles.rst |   20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

--- a/Documentation/kbuild/makefiles.rst~kbuild-doc-describe-proper-script-invocation
+++ a/Documentation/kbuild/makefiles.rst
@@ -21,6 +21,7 @@ This document describes the Linux kernel
 	   --- 3.10 Special Rules
 	   --- 3.11 $(CC) support functions
 	   --- 3.12 $(LD) support functions
+	   --- 3.13 Script Invocation
 
 	=== 4 Host Program support
 	   --- 4.1 Simple Host Program
@@ -605,6 +606,25 @@ more details, with real examples.
 		#Makefile
 		LDFLAGS_vmlinux += $(call ld-option, -X)
 
+3.13 Script invocation
+----------------------
+
+	Make rules may invoke scripts to build the kernel. The rules shall
+	always provide the appropriate interpreter to execute the script. They
+	shall not rely on the execute bits being set, and shall not invoke the
+	script directly. For the convenience of manual script invocation, such
+	as invoking ./scripts/checkpatch.pl, it is recommended to set execute
+	bits on the scripts nonetheless.
+
+	Kbuild provides variables $(CONFIG_SHELL), $(AWK), $(PERL),
+	$(PYTHON) and $(PYTHON3) to refer to interpreters for the respective
+	scripts.
+
+	Example::
+
+		#Makefile
+		cmd_depmod = $(CONFIG_SHELL) $(srctree)/scripts/depmod.sh $(DEPMOD) \
+			     $(KERNELRELEASE)
 
 4 Host Program support
 ======================
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 011/181] scripts/spelling.txt: increase error-prone spell checking
  2020-10-13 23:46 incoming Andrew Morton
                   ` (9 preceding siblings ...)
  2020-10-13 23:48 ` [patch 010/181] kbuild: doc: describe proper script invocation Andrew Morton
@ 2020-10-13 23:48 ` Andrew Morton
  2020-10-13 23:48 ` [patch 012/181] scripts/spelling.txt: add "arbitrary" typo Andrew Morton
                   ` (173 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:48 UTC (permalink / raw)
  To: akpm, colin.king, j.neuschaefer, joe, linux-mm, luca, mm-commits,
	sjpark, torvalds, wangqing, xndchn

From: Wang Qing <wangqing@vivo.com>
Subject: scripts/spelling.txt: increase error-prone spell checking

Increase direcly,ununsed,manger spelling error check

Link: https://lkml.kernel.org/r/1601085397-27586-1-git-send-email-wangqing@vivo.com
Signed-off-by: Wang Qing <wangqing@vivo.com>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Wang Qing <wangqing@vivo.com>
Cc: Xiong <xndchn@gmail.com>
Cc: SeongJae Park <sjpark@amazon.de>
Cc: Jonathan Neuschfer <j.neuschaefer@gmx.net>
Cc: Luca Ceresoli <luca@lucaceresoli.net>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/spelling.txt |    3 +++
 1 file changed, 3 insertions(+)

--- a/scripts/spelling.txt~increase-error-prone-spell-checking
+++ a/scripts/spelling.txt
@@ -482,6 +482,7 @@ disgest||digest
 dispalying||displaying
 diplay||display
 directon||direction
+direcly||directly
 direectly||directly
 diregard||disregard
 disassocation||disassociation
@@ -871,6 +872,7 @@ malplace||misplace
 managable||manageable
 managment||management
 mangement||management
+manger||manager
 manoeuvering||maneuvering
 manufaucturing||manufacturing
 mappping||mapping
@@ -1478,6 +1480,7 @@ unsolicitied||unsolicited
 unsuccessfull||unsuccessful
 unsuported||unsupported
 untill||until
+ununsed||unused
 unuseful||useless
 unvalid||invalid
 upate||update
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 012/181] scripts/spelling.txt: add "arbitrary" typo
  2020-10-13 23:46 incoming Andrew Morton
                   ` (10 preceding siblings ...)
  2020-10-13 23:48 ` [patch 011/181] scripts/spelling.txt: increase error-prone spell checking Andrew Morton
@ 2020-10-13 23:48 ` Andrew Morton
  2020-10-13 23:48 ` [patch 013/181] scripts/decodecode: add the capability to supply the program counter Andrew Morton
                   ` (172 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:48 UTC (permalink / raw)
  To: akpm, apw, colin.king, joe, linux-mm, mm-commits, naoki.hayama, torvalds

From: Naoki Hayama <naoki.hayama@lineo.co.jp>
Subject: scripts/spelling.txt: add "arbitrary" typo

Add "abitrary||arbitrary".

Link: https://lkml.kernel.org/r/6bf6520d-787d-5749-09b5-ff92185f501f@lineo.co.jp
Signed-off-by: Naoki Hayama <naoki.hayama@lineo.co.jp>
Cc: Colin Ian King <colin.king@canonical.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/spelling.txt |    1 +
 1 file changed, 1 insertion(+)

--- a/scripts/spelling.txt~scripts-spellingtxt-add-arbitrary-typo
+++ a/scripts/spelling.txt
@@ -9,6 +9,7 @@
 #
 abandonning||abandoning
 abigious||ambiguous
+abitrary||arbitrary
 abitrate||arbitrate
 abnornally||abnormally
 abnrormal||abnormal
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 013/181] scripts/decodecode: add the capability to supply the program counter
  2020-10-13 23:46 incoming Andrew Morton
                   ` (11 preceding siblings ...)
  2020-10-13 23:48 ` [patch 012/181] scripts/spelling.txt: add "arbitrary" typo Andrew Morton
@ 2020-10-13 23:48 ` Andrew Morton
  2020-10-13 23:48 ` [patch 014/181] ntfs: add check for mft record size in superblock Andrew Morton
                   ` (171 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:48 UTC (permalink / raw)
  To: akpm, bp, linux-mm, maz, mm-commits, rabin, torvalds, will

From: Borislav Petkov <bp@suse.de>
Subject: scripts/decodecode: add the capability to supply the program counter

So that comparing with objdump output from vmlinux can ease pinpointing
where the trapping instruction actually is.  An example is better than a
thousand words:

  $ PC=0xffffffff8329a927 ./scripts/decodecode < ~/tmp/syz/gfs2.splat
  [ 477.379104][T23917] Code: 48 83 ec 28 48 89 3c 24 48 89 54 24 08 e8 c1 b4 4a fe 48 8d bb 00 01 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 97 05 00 00 48 8b 9b 00 01 00 00 48 85 db 0f 84
  All code
  ========
  ffffffff8329a8fd:       48 83 ec 28             sub    $0x28,%rsp
  ffffffff8329a901:       48 89 3c 24             mov    %rdi,(%rsp)
  ffffffff8329a905:       48 89 54 24 08          mov    %rdx,0x8(%rsp)
  ffffffff8329a90a:       e8 c1 b4 4a fe          callq  0xffffffff81745dd0
  ffffffff8329a90f:       48 8d bb 00 01 00 00    lea    0x100(%rbx),%rdi
  ffffffff8329a916:       48 b8 00 00 00 00 00    movabs $0xdffffc0000000000,%rax
  ffffffff8329a91d:       fc ff df
  ffffffff8329a920:       48 89 fa                mov    %rdi,%rdx
  ffffffff8329a923:       48 c1 ea 03             shr    $0x3,%rdx
  ffffffff8329a927:*      80 3c 02 00             cmpb   $0x0,(%rdx,%rax,1)               <-- trapping instruction
  ffffffff8329a92b:       0f 85 97 05 00 00       jne    0xffffffff8329aec8
  ffffffff8329a931:       48 8b 9b 00 01 00 00    mov    0x100(%rbx),%rbx
  ffffffff8329a938:       48 85 db                test   %rbx,%rbx
  ffffffff8329a93b:       0f                      .byte 0xf
  ffffffff8329a93c:       84                      .byte 0x84

Link: https://lkml.kernel.org/r/20200930111416.GF6810@zn.tnic
Link: https://lkml.kernel.org/r/20200929113238.GC21110@zn.tnic
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Marc Zyngier <maz@misterjones.org>
Cc: Will Deacon <will@kernel.org>
Cc: Rabin Vincent <rabin@rab.in>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 scripts/decodecode |   29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

--- a/scripts/decodecode~scripts-decodecode-add-the-capability-to-supply-the-program-counter
+++ a/scripts/decodecode
@@ -6,6 +6,7 @@
 # options: set env. variable AFLAGS=options to pass options to "as";
 # e.g., to decode an i386 oops on an x86_64 system, use:
 # AFLAGS=--32 decodecode < 386.oops
+# PC=hex - the PC (program counter) the oops points to
 
 cleanup() {
 	rm -f $T $T.s $T.o $T.oo $T.aa $T.dis
@@ -67,15 +68,19 @@ if [ -z "$ARCH" ]; then
     esac
 fi
 
+# Params: (tmp_file, pc_sub)
 disas() {
-	${CROSS_COMPILE}as $AFLAGS -o $1.o $1.s > /dev/null 2>&1
+	t=$1
+	pc_sub=$2
+
+	${CROSS_COMPILE}as $AFLAGS -o $t.o $t.s > /dev/null 2>&1
 
 	if [ "$ARCH" = "arm" ]; then
 		if [ $width -eq 2 ]; then
 			OBJDUMPFLAGS="-M force-thumb"
 		fi
 
-		${CROSS_COMPILE}strip $1.o
+		${CROSS_COMPILE}strip $t.o
 	fi
 
 	if [ "$ARCH" = "arm64" ]; then
@@ -83,11 +88,18 @@ disas() {
 			type=inst
 		fi
 
-		${CROSS_COMPILE}strip $1.o
+		${CROSS_COMPILE}strip $t.o
 	fi
 
-	${CROSS_COMPILE}objdump $OBJDUMPFLAGS -S $1.o | \
-		grep -v "/tmp\|Disassembly\|\.text\|^$" > $1.dis 2>&1
+	if [ $pc_sub -ne 0 ]; then
+		if [ $PC ]; then
+			adj_vma=$(( $PC - $pc_sub ))
+			OBJDUMPFLAGS="$OBJDUMPFLAGS --adjust-vma=$adj_vma"
+		fi
+	fi
+
+	${CROSS_COMPILE}objdump $OBJDUMPFLAGS -S $t.o | \
+		grep -v "/tmp\|Disassembly\|\.text\|^$" > $t.dis 2>&1
 }
 
 marker=`expr index "$code" "\<"`
@@ -95,14 +107,17 @@ if [ $marker -eq 0 ]; then
 	marker=`expr index "$code" "\("`
 fi
 
+
 touch $T.oo
 if [ $marker -ne 0 ]; then
+	# 2 opcode bytes and a single space
+	pc_sub=$(( $marker / 3 ))
 	echo All code >> $T.oo
 	echo ======== >> $T.oo
 	beforemark=`echo "$code"`
 	echo -n "	.$type 0x" > $T.s
 	echo $beforemark | sed -e 's/ /,0x/g; s/[<>()]//g' >> $T.s
-	disas $T
+	disas $T $pc_sub
 	cat $T.dis >> $T.oo
 	rm -f $T.o $T.s $T.dis
 
@@ -114,7 +129,7 @@ echo ===================================
 code=`echo $code | sed -e 's/ [<(]/ /;s/[>)] / /;s/ /,0x/g; s/[>)]$//'`
 echo -n "	.$type 0x" > $T.s
 echo $code >> $T.s
-disas $T
+disas $T 0
 cat $T.dis >> $T.aa
 
 # (lines of whole $T.oo) - (lines of $T.aa, i.e. "Code starting") + 3,
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 014/181] ntfs: add check for mft record size in superblock
  2020-10-13 23:46 incoming Andrew Morton
                   ` (12 preceding siblings ...)
  2020-10-13 23:48 ` [patch 013/181] scripts/decodecode: add the capability to supply the program counter Andrew Morton
@ 2020-10-13 23:48 ` Andrew Morton
  2020-10-13 23:48 ` [patch 015/181] ocfs2: delete repeated words in comments Andrew Morton
                   ` (170 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:48 UTC (permalink / raw)
  To: akpm, anton, linux-mm, mm-commits, rkovhaev, torvalds

From: Rustam Kovhaev <rkovhaev@gmail.com>
Subject: ntfs: add check for mft record size in superblock

Number of bytes allocated for mft record should be equal to the mft record
size stored in ntfs superblock as reported by syzbot, userspace might
trigger out-of-bounds read by dereferencing ctx->attr in ntfs_attr_find()

Link: https://syzkaller.appspot.com/bug?extid=aed06913f36eff9b544e
Link: https://lkml.kernel.org/r/20200824022804.226242-1-rkovhaev@gmail.com
Reported-by: syzbot+aed06913f36eff9b544e@syzkaller.appspotmail.com
Tested-by: syzbot+aed06913f36eff9b544e@syzkaller.appspotmail.com
Signed-off-by: Rustam Kovhaev <rkovhaev@gmail.com>
Acked-by: Anton Altaparmakov <anton@tuxera.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ntfs/inode.c |    6 ++++++
 1 file changed, 6 insertions(+)

--- a/fs/ntfs/inode.c~ntfs-add-check-for-mft-record-size-in-superblock
+++ a/fs/ntfs/inode.c
@@ -1810,6 +1810,12 @@ int ntfs_read_inode_mount(struct inode *
 		brelse(bh);
 	}
 
+	if (le32_to_cpu(m->bytes_allocated) != vol->mft_record_size) {
+		ntfs_error(sb, "Incorrect mft record size %u in superblock, should be %u.",
+				le32_to_cpu(m->bytes_allocated), vol->mft_record_size);
+		goto err_out;
+	}
+
 	/* Apply the mst fixups. */
 	if (post_read_mst_fixup((NTFS_RECORD*)m, vol->mft_record_size)) {
 		/* FIXME: Try to use the $MFTMirr now. */
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 015/181] ocfs2: delete repeated words in comments
  2020-10-13 23:46 incoming Andrew Morton
                   ` (13 preceding siblings ...)
  2020-10-13 23:48 ` [patch 014/181] ntfs: add check for mft record size in superblock Andrew Morton
@ 2020-10-13 23:48 ` Andrew Morton
  2020-10-13 23:48 ` [patch 016/181] ocfs2: fix potential soft lockup during fstrim Andrew Morton
                   ` (169 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:48 UTC (permalink / raw)
  To: akpm, jlbec, joseph.qi, linux-mm, mark, mm-commits, rdunlap, torvalds

From: Randy Dunlap <rdunlap@infradead.org>
Subject: ocfs2: delete repeated words in comments

Drop duplicated words {the, and} in comments.

Link: https://lkml.kernel.org/r/20200811021845.25134-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/alloc.c      |    2 +-
 fs/ocfs2/localalloc.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/fs/ocfs2/alloc.c~fs-ocfs2-delete-repeated-words-in-comments
+++ a/fs/ocfs2/alloc.c
@@ -6013,7 +6013,7 @@ int __ocfs2_flush_truncate_log(struct oc
 		goto out;
 	}
 
-	/* Appending truncate log(TA) and and flushing truncate log(TF) are
+	/* Appending truncate log(TA) and flushing truncate log(TF) are
 	 * two separated transactions. They can be both committed but not
 	 * checkpointed. If crash occurs then, both two transaction will be
 	 * replayed with several already released to global bitmap clusters.
--- a/fs/ocfs2/localalloc.c~fs-ocfs2-delete-repeated-words-in-comments
+++ a/fs/ocfs2/localalloc.c
@@ -677,7 +677,7 @@ int ocfs2_reserve_local_alloc_bits(struc
 		/*
 		 * Under certain conditions, the window slide code
 		 * might have reduced the number of bits available or
-		 * disabled the the local alloc entirely. Re-check
+		 * disabled the local alloc entirely. Re-check
 		 * here and return -ENOSPC if necessary.
 		 */
 		status = -ENOSPC;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 016/181] ocfs2: fix potential soft lockup during fstrim
  2020-10-13 23:46 incoming Andrew Morton
                   ` (14 preceding siblings ...)
  2020-10-13 23:48 ` [patch 015/181] ocfs2: delete repeated words in comments Andrew Morton
@ 2020-10-13 23:48 ` Andrew Morton
  2020-10-13 23:48 ` [patch 017/181] fs/xattr.c: fix kernel-doc warnings for setxattr & removexattr Andrew Morton
                   ` (168 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:48 UTC (permalink / raw)
  To: akpm, gechangwei, ghe, jlbec, joseph.qi, junxiao.bi, linux-mm,
	mark, mm-commits, piaojun, torvalds

From: Gang He <ghe@suse.com>
Subject: ocfs2: fix potential soft lockup during fstrim

When we discard unused blocks on a mounted ocfs2 filesystem, fstrim
handles each block goup with locking/unlocking global bitmap meta-file
repeatedly. we should let fstrim thread take a break(if need) between
unlock and lock, this will avoid the potential soft lockup problem,
and also gives the upper applications more IO opportunities, these
applications are not blocked for too long at writing files.

Link: https://lkml.kernel.org/r/20200927015815.14904-1-ghe@suse.com
Signed-off-by: Gang He <ghe@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/ocfs2/alloc.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/fs/ocfs2/alloc.c~ocfs2-fix-potential-soft-lockup-during-fstrim
+++ a/fs/ocfs2/alloc.c
@@ -7654,8 +7654,10 @@ out_mutex:
 	 * main_bm related locks for avoiding the current IO starve, then go to
 	 * trim the next group
 	 */
-	if (ret >= 0 && group <= last_group)
+	if (ret >= 0 && group <= last_group) {
+		cond_resched();
 		goto next_group;
+	}
 out:
 	range->len = trimmed * sb->s_blocksize;
 	return ret;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 017/181] fs/xattr.c: fix kernel-doc warnings for setxattr & removexattr
  2020-10-13 23:46 incoming Andrew Morton
                   ` (15 preceding siblings ...)
  2020-10-13 23:48 ` [patch 016/181] ocfs2: fix potential soft lockup during fstrim Andrew Morton
@ 2020-10-13 23:48 ` Andrew Morton
  2020-10-13 23:48 ` [patch 018/181] fs_parse: mark fs_param_bad_value() as static Andrew Morton
                   ` (167 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:48 UTC (permalink / raw)
  To: akpm, chuck.lever, fllinden, linux-mm, mm-commits, rdunlap,
	torvalds, viro

From: Randy Dunlap <rdunlap@infradead.org>
Subject: fs/xattr.c: fix kernel-doc warnings for setxattr & removexattr

Fix kernel-doc warnings in fs/xattr.c:

../fs/xattr.c:251: warning: Function parameter or member 'dentry' not described in '__vfs_setxattr_locked'
../fs/xattr.c:251: warning: Function parameter or member 'name' not described in '__vfs_setxattr_locked'
../fs/xattr.c:251: warning: Function parameter or member 'value' not described in '__vfs_setxattr_locked'
../fs/xattr.c:251: warning: Function parameter or member 'size' not described in '__vfs_setxattr_locked'
../fs/xattr.c:251: warning: Function parameter or member 'flags' not described in '__vfs_setxattr_locked'
../fs/xattr.c:251: warning: Function parameter or member 'delegated_inode' not described in '__vfs_setxattr_locked'
../fs/xattr.c:458: warning: Function parameter or member 'dentry' not described in '__vfs_removexattr_locked'
../fs/xattr.c:458: warning: Function parameter or member 'name' not described in '__vfs_removexattr_locked'
../fs/xattr.c:458: warning: Function parameter or member 'delegated_inode' not described in '__vfs_removexattr_locked'

Link: http://lkml.kernel.org/r/7a3dd5a2-5787-adf3-d525-c203f9910ec4@infradead.org
Fixes: 08b5d5014a27 ("xattr: break delegations in {set,remove}xattr")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Frank van der Linden <fllinden@amazon.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/xattr.c |   22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

--- a/fs/xattr.c~fs-xattrc-fix-kernel-doc-warnings-for-setxattr-removexattr
+++ a/fs/xattr.c
@@ -232,15 +232,15 @@ int __vfs_setxattr_noperm(struct dentry
 }
 
 /**
- * __vfs_setxattr_locked: set an extended attribute while holding the inode
+ * __vfs_setxattr_locked - set an extended attribute while holding the inode
  * lock
  *
- *  @dentry - object to perform setxattr on
- *  @name - xattr name to set
- *  @value - value to set @name to
- *  @size - size of @value
- *  @flags - flags to pass into filesystem operations
- *  @delegated_inode - on return, will contain an inode pointer that
+ *  @dentry: object to perform setxattr on
+ *  @name: xattr name to set
+ *  @value: value to set @name to
+ *  @size: size of @value
+ *  @flags: flags to pass into filesystem operations
+ *  @delegated_inode: on return, will contain an inode pointer that
  *  a delegation was broken on, NULL if none.
  */
 int
@@ -443,12 +443,12 @@ __vfs_removexattr(struct dentry *dentry,
 EXPORT_SYMBOL(__vfs_removexattr);
 
 /**
- * __vfs_removexattr_locked: set an extended attribute while holding the inode
+ * __vfs_removexattr_locked - set an extended attribute while holding the inode
  * lock
  *
- *  @dentry - object to perform setxattr on
- *  @name - name of xattr to remove
- *  @delegated_inode - on return, will contain an inode pointer that
+ *  @dentry: object to perform setxattr on
+ *  @name: name of xattr to remove
+ *  @delegated_inode: on return, will contain an inode pointer that
  *  a delegation was broken on, NULL if none.
  */
 int
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 018/181] fs_parse: mark fs_param_bad_value() as static
  2020-10-13 23:46 incoming Andrew Morton
                   ` (16 preceding siblings ...)
  2020-10-13 23:48 ` [patch 017/181] fs/xattr.c: fix kernel-doc warnings for setxattr & removexattr Andrew Morton
@ 2020-10-13 23:48 ` Andrew Morton
  2020-10-13 23:48 ` [patch 019/181] mm/slab.c: clean code by removing redundant if condition Andrew Morton
                   ` (166 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:48 UTC (permalink / raw)
  To: akpm, linux-mm, luojiaxing, mm-commits, torvalds

From: Luo Jiaxing <luojiaxing@huawei.com>
Subject: fs_parse: mark fs_param_bad_value() as static

We found the following warning when build kernel with W=1:

fs/fs_parser.c:192:5: warning: no previous prototype for `fs_param_bad_value' [-Wmissing-prototypes]
int fs_param_bad_value(struct p_log *log, struct fs_parameter *param)
     ^
CC      drivers/usb/gadget/udc/snps_udc_core.o

And no header file define a prototype for this function, so we should mark
it as static.

Link: https://lkml.kernel.org/r/1601293463-25763-1-git-send-email-luojiaxing@huawei.com
Signed-off-by: Luo Jiaxing <luojiaxing@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/fs_parser.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/fs/fs_parser.c~fs_parse-mark-fs_param_bad_value-as-static
+++ a/fs/fs_parser.c
@@ -189,7 +189,7 @@ out:
 }
 EXPORT_SYMBOL(fs_lookup_param);
 
-int fs_param_bad_value(struct p_log *log, struct fs_parameter *param)
+static int fs_param_bad_value(struct p_log *log, struct fs_parameter *param)
 {
 	return inval_plog(log, "Bad value for '%s'", param->key);
 }
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 019/181] mm/slab.c: clean code by removing redundant if condition
  2020-10-13 23:46 incoming Andrew Morton
                   ` (17 preceding siblings ...)
  2020-10-13 23:48 ` [patch 018/181] fs_parse: mark fs_param_bad_value() as static Andrew Morton
@ 2020-10-13 23:48 ` Andrew Morton
  2020-10-13 23:48 ` [patch 020/181] include/linux/slab.h: fix a typo error in comment Andrew Morton
                   ` (165 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:48 UTC (permalink / raw)
  To: akpm, cl, iamjoonsoo.kim, linux-mm, mateusznosek0, mm-commits,
	penberg, rientjes, torvalds

From: Mateusz Nosek <mateusznosek0@gmail.com>
Subject: mm/slab.c: clean code by removing redundant if condition

The removed code was unnecessary and changed nothing in the flow, since in
case of returning NULL by 'kmem_cache_alloc_node' returning 'freelist'
from the function in question is the same as returning NULL.

Link: https://lkml.kernel.org/r/20200915230329.13002-1-mateusznosek0@gmail.com
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slab.c |    2 --
 1 file changed, 2 deletions(-)

--- a/mm/slab.c~mm-slabc-clean-code-by-removing-redundant-if-condition
+++ a/mm/slab.c
@@ -2305,8 +2305,6 @@ static void *alloc_slabmgmt(struct kmem_
 		/* Slab management obj is off-slab. */
 		freelist = kmem_cache_alloc_node(cachep->freelist_cache,
 					      local_flags, nodeid);
-		if (!freelist)
-			return NULL;
 	} else {
 		/* We will use last bytes at the slab for freelist */
 		freelist = addr + (PAGE_SIZE << cachep->gfporder) -
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 020/181] include/linux/slab.h: fix a typo error in comment
  2020-10-13 23:46 incoming Andrew Morton
                   ` (18 preceding siblings ...)
  2020-10-13 23:48 ` [patch 019/181] mm/slab.c: clean code by removing redundant if condition Andrew Morton
@ 2020-10-13 23:48 ` Andrew Morton
  2020-10-13 23:48 ` [patch 021/181] mm/slub.c: branch optimization in free slowpath Andrew Morton
                   ` (164 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:48 UTC (permalink / raw)
  To: akpm, jrdr.linux, linux-mm, mm-commits, tangjianqiang, torvalds,
	wyqt1985

From: tangjianqiang <wyqt1985@gmail.com>
Subject: include/linux/slab.h: fix a typo error in comment

fix a typo error in slab.h
"allocagtor" -> "allocator"

Link: https://lkml.kernel.org/r/1600230053-24303-1-git-send-email-tangjianqiang@xiaomi.com
Signed-off-by: tangjianqiang <tangjianqiang@xiaomi.com>
Acked-by: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/slab.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/slab.h~include-linux-slabh-fix-a-typo-error-in-comment
+++ a/include/linux/slab.h
@@ -279,7 +279,7 @@ static inline void __check_heap_object(c
 #define KMALLOC_MAX_SIZE	(1UL << KMALLOC_SHIFT_MAX)
 /* Maximum size for which we actually use a slab cache */
 #define KMALLOC_MAX_CACHE_SIZE	(1UL << KMALLOC_SHIFT_HIGH)
-/* Maximum order allocatable via the slab allocagtor */
+/* Maximum order allocatable via the slab allocator */
 #define KMALLOC_MAX_ORDER	(KMALLOC_SHIFT_MAX - PAGE_SHIFT)
 
 /*
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 021/181] mm/slub.c: branch optimization in free slowpath
  2020-10-13 23:46 incoming Andrew Morton
                   ` (19 preceding siblings ...)
  2020-10-13 23:48 ` [patch 020/181] include/linux/slab.h: fix a typo error in comment Andrew Morton
@ 2020-10-13 23:48 ` Andrew Morton
  2020-10-13 23:48 ` [patch 022/181] mm/slub: fix missing ALLOC_SLOWPATH stat when bulk alloc Andrew Morton
                   ` (163 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:48 UTC (permalink / raw)
  To: akpm, cl, hewenliang4, hushiyuan, iamjoonsoo.kim, linux-mm,
	mm-commits, penberg, rientjes, torvalds, wuyun.wu

From: Abel Wu <wuyun.wu@huawei.com>
Subject: mm/slub.c: branch optimization in free slowpath

The two conditions are mutually exclusive and gcc compiler will optimise
this into if-else-like pattern.  Given that the majority of free_slowpath
is free_frozen, let's provide some hint to the compilers.

Tests (perf bench sched messaging -g 20 -l 400000, executed 10x
after reboot) are done and the summarized result:

	un-patched	patched
max.	192.316		189.851
min.	187.267		186.252
avg.	189.154		188.086
stdev.	1.37		0.99

Link: http://lkml.kernel.org/r/20200813101812.1617-1-wuyun.wu@huawei.com
Signed-off-by: Abel Wu <wuyun.wu@huawei.com>
Acked-by: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hewenliang <hewenliang4@huawei.com>
Cc: Hu Shiyuan <hushiyuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |   23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

--- a/mm/slub.c~mm-slub-branch-optimization-in-free-slowpath
+++ a/mm/slub.c
@@ -3019,20 +3019,21 @@ static void __slab_free(struct kmem_cach
 
 	if (likely(!n)) {
 
-		/*
-		 * If we just froze the page then put it onto the
-		 * per cpu partial list.
-		 */
-		if (new.frozen && !was_frozen) {
+		if (likely(was_frozen)) {
+			/*
+			 * The list lock was not taken therefore no list
+			 * activity can be necessary.
+			 */
+			stat(s, FREE_FROZEN);
+		} else if (new.frozen) {
+			/*
+			 * If we just froze the page then put it onto the
+			 * per cpu partial list.
+			 */
 			put_cpu_partial(s, page, 1);
 			stat(s, CPU_PARTIAL_FREE);
 		}
-		/*
-		 * The list lock was not taken therefore no list
-		 * activity can be necessary.
-		 */
-		if (was_frozen)
-			stat(s, FREE_FROZEN);
+
 		return;
 	}
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 022/181] mm/slub: fix missing ALLOC_SLOWPATH stat when bulk alloc
  2020-10-13 23:46 incoming Andrew Morton
                   ` (20 preceding siblings ...)
  2020-10-13 23:48 ` [patch 021/181] mm/slub.c: branch optimization in free slowpath Andrew Morton
@ 2020-10-13 23:48 ` Andrew Morton
  2020-10-13 23:48 ` [patch 023/181] mm/slub: make add_full() condition more explicit Andrew Morton
                   ` (162 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:48 UTC (permalink / raw)
  To: akpm, cl, hewenliang4, hushiyuan, iamjoonsoo.kim, linux-mm,
	mm-commits, penberg, rientjes, torvalds, wuyun.wu

From: Abel Wu <wuyun.wu@huawei.com>
Subject: mm/slub: fix missing ALLOC_SLOWPATH stat when bulk alloc

The ALLOC_SLOWPATH statistics is missing in bulk allocation now.  Fix it
by doing statistics in alloc slow path.

Link: http://lkml.kernel.org/r/20200811022427.1363-1-wuyun.wu@huawei.com
Signed-off-by: Abel Wu <wuyun.wu@huawei.com>
Reviewed-by: Pekka Enberg <penberg@kernel.org>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Hewenliang <hewenliang4@huawei.com>
Cc: Hu Shiyuan <hushiyuan@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/slub.c~mm-slub-fix-missing-alloc_slowpath-stat-when-bulk-alloc
+++ a/mm/slub.c
@@ -2661,6 +2661,8 @@ static void *___slab_alloc(struct kmem_c
 	void *freelist;
 	struct page *page;
 
+	stat(s, ALLOC_SLOWPATH);
+
 	page = c->page;
 	if (!page) {
 		/*
@@ -2850,7 +2852,6 @@ redo:
 	page = c->page;
 	if (unlikely(!object || !node_match(page, node))) {
 		object = __slab_alloc(s, gfpflags, node, addr, c);
-		stat(s, ALLOC_SLOWPATH);
 	} else {
 		void *next_object = get_freepointer_safe(s, object);
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 023/181] mm/slub: make add_full() condition more explicit
  2020-10-13 23:46 incoming Andrew Morton
                   ` (21 preceding siblings ...)
  2020-10-13 23:48 ` [patch 022/181] mm/slub: fix missing ALLOC_SLOWPATH stat when bulk alloc Andrew Morton
@ 2020-10-13 23:48 ` Andrew Morton
  2020-10-13 23:48 ` [patch 024/181] mm/kmemleak: rely on rcu for task stack scanning Andrew Morton
                   ` (161 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:48 UTC (permalink / raw)
  To: akpm, cl, iamjoonsoo.kim, linux-mm, liu.xiang6, mm-commits,
	penberg, rientjes, torvalds, wuyun.wu

From: Abel Wu <wuyun.wu@huawei.com>
Subject: mm/slub: make add_full() condition more explicit

The commit below is incomplete, as it didn't handle the add_full() part. 
commit a4d3f8916c65 ("slub: remove useless kmem_cache_debug() before
remove_full()")

This patch checks for SLAB_STORE_USER instead of kmem_cache_debug(), since
that should be the only context in which we need the list_lock for
add_full().

Link: https://lkml.kernel.org/r/20200811020240.1231-1-wuyun.wu@huawei.com
Signed-off-by: Abel Wu <wuyun.wu@huawei.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Liu Xiang <liu.xiang6@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slub.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/mm/slub.c~mm-slub-make-add_full-condition-more-explicit
+++ a/mm/slub.c
@@ -2245,7 +2245,8 @@ redo:
 		}
 	} else {
 		m = M_FULL;
-		if (kmem_cache_debug(s) && !lock) {
+#ifdef CONFIG_SLUB_DEBUG
+		if ((s->flags & SLAB_STORE_USER) && !lock) {
 			lock = 1;
 			/*
 			 * This also ensures that the scanning of full
@@ -2254,6 +2255,7 @@ redo:
 			 */
 			spin_lock(&n->list_lock);
 		}
+#endif
 	}
 
 	if (l != m) {
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 024/181] mm/kmemleak: rely on rcu for task stack scanning
  2020-10-13 23:46 incoming Andrew Morton
                   ` (22 preceding siblings ...)
  2020-10-13 23:48 ` [patch 023/181] mm/slub: make add_full() condition more explicit Andrew Morton
@ 2020-10-13 23:48 ` Andrew Morton
  2020-10-13 23:48 ` [patch 025/181] mm,kmemleak-test.c: move kmemleak-test.c to samples dir Andrew Morton
                   ` (160 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:48 UTC (permalink / raw)
  To: akpm, cai, catalin.marinas, dave, dbueso, linux-mm, mm-commits,
	oleg, torvalds

From: Davidlohr Bueso <dave@stgolabs.net>
Subject: mm/kmemleak: rely on rcu for task stack scanning

kmemleak_scan() currently relies on the big tasklist_lock hammer to
stabilize iterating through the tasklist.  Instead, this patch proposes
simply using rcu along with the rcu-safe for_each_process_thread flavor
(without changing scan semantics), which doesn't make use of
next_thread/p->thread_group and thus cannot race with exit.  Furthermore,
any races with fork() and not seeing the new child should be benign as
it's not running yet and can also be detected by the next scan.

Avoiding the tasklist_lock could prove beneficial for performance
considering the scan operation is done periodically.  I have seen
improvements of 30%-ish when doing similar replacements on very
pathological microbenchmarks (ie stressing get/setpriority(2)).

However my main motivation is that it's one less user of the global
lock, something that Linus has long time wanted to see gone eventually
(if ever) even if the traditional fairness issues has been dealt with
now with qrwlocks.  Of course this is a very long ways ahead.  This
patch also kills another user of the deprecated tsk->thread_group.

Link: https://lkml.kernel.org/r/20200820203902.11308-1-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kmemleak.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/kmemleak.c~mm-kmemleak-rely-on-rcu-for-task-stack-scanning
+++ a/mm/kmemleak.c
@@ -1471,15 +1471,15 @@ static void kmemleak_scan(void)
 	if (kmemleak_stack_scan) {
 		struct task_struct *p, *g;
 
-		read_lock(&tasklist_lock);
-		do_each_thread(g, p) {
+		rcu_read_lock();
+		for_each_process_thread(g, p) {
 			void *stack = try_get_task_stack(p);
 			if (stack) {
 				scan_block(stack, stack + THREAD_SIZE, NULL);
 				put_task_stack(p);
 			}
-		} while_each_thread(g, p);
-		read_unlock(&tasklist_lock);
+		}
+		rcu_read_unlock();
 	}
 
 	/*
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 025/181] mm,kmemleak-test.c: move kmemleak-test.c to samples dir
  2020-10-13 23:46 incoming Andrew Morton
                   ` (23 preceding siblings ...)
  2020-10-13 23:48 ` [patch 024/181] mm/kmemleak: rely on rcu for task stack scanning Andrew Morton
@ 2020-10-13 23:48 ` Andrew Morton
  2020-10-13 23:48 ` [patch 026/181] x86/numa: cleanup configuration dependent command-line options Andrew Morton
                   ` (159 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:48 UTC (permalink / raw)
  To: akpm, catalin.marinas, corbet, davem, dhowells, divya.indi,
	jpoimboe, linux-mm, mchehab+huawei, miguel.ojeda.sandonis,
	mm-commits, robh, rostedt, sam, sh_def, tomas.winkler, torvalds,
	yamada.masahiro

From: Hui Su <sh_def@163.com>
Subject: mm,kmemleak-test.c: move kmemleak-test.c to samples dir

kmemleak-test.c is just a kmemleak test module, which also can not be used
as a built-in kernel module.  Thus, i think it may should not be in mm
dir, and move the kmemleak-test.c to samples/kmemleak/kmemleak-test.c. 
Fix the spelling of built-in by the way.

Link: https://lkml.kernel.org/r/20200925183729.GA172837@rlk
Signed-off-by: Hui Su <sh_def@163.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Rob Herring <robh@kernel.org>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Divya Indi <divya.indi@oracle.com>
Cc: Tomas Winkler <tomas.winkler@intel.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/dev-tools/kmemleak.rst |    2 
 MAINTAINERS                          |    2 
 mm/Makefile                          |    1 
 mm/kmemleak-test.c                   |   99 -------------------------
 samples/Makefile                     |    1 
 samples/kmemleak/Makefile            |    3 
 samples/kmemleak/kmemleak-test.c     |   99 +++++++++++++++++++++++++
 7 files changed, 105 insertions(+), 102 deletions(-)

--- a/Documentation/dev-tools/kmemleak.rst~mmkmemleak-testc-move-kmemleak-testc-to-samples-dir
+++ a/Documentation/dev-tools/kmemleak.rst
@@ -229,7 +229,7 @@ Testing with kmemleak-test
 
 To check if you have all set up to use kmemleak, you can use the kmemleak-test
 module, a module that deliberately leaks memory. Set CONFIG_DEBUG_KMEMLEAK_TEST
-as module (it can't be used as bult-in) and boot the kernel with kmemleak
+as module (it can't be used as built-in) and boot the kernel with kmemleak
 enabled. Load the module and perform a scan with::
 
         # modprobe kmemleak-test
--- a/MAINTAINERS~mmkmemleak-testc-move-kmemleak-testc-to-samples-dir
+++ a/MAINTAINERS
@@ -9727,8 +9727,8 @@ M:	Catalin Marinas <catalin.marinas@arm.
 S:	Maintained
 F:	Documentation/dev-tools/kmemleak.rst
 F:	include/linux/kmemleak.h
-F:	mm/kmemleak-test.c
 F:	mm/kmemleak.c
+F:	samples/kmemleak/kmemleak-test.c
 
 KMOD KERNEL MODULE LOADER - USERMODE HELPER
 M:	Luis Chamberlain <mcgrof@kernel.org>
--- a/mm/kmemleak-test.c
+++ /dev/null
@@ -1,99 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0-only
-/*
- * mm/kmemleak-test.c
- *
- * Copyright (C) 2008 ARM Limited
- * Written by Catalin Marinas <catalin.marinas@arm.com>
- */
-
-#define pr_fmt(fmt) "kmemleak: " fmt
-
-#include <linux/init.h>
-#include <linux/kernel.h>
-#include <linux/module.h>
-#include <linux/slab.h>
-#include <linux/vmalloc.h>
-#include <linux/list.h>
-#include <linux/percpu.h>
-#include <linux/fdtable.h>
-
-#include <linux/kmemleak.h>
-
-struct test_node {
-	long header[25];
-	struct list_head list;
-	long footer[25];
-};
-
-static LIST_HEAD(test_list);
-static DEFINE_PER_CPU(void *, kmemleak_test_pointer);
-
-/*
- * Some very simple testing. This function needs to be extended for
- * proper testing.
- */
-static int __init kmemleak_test_init(void)
-{
-	struct test_node *elem;
-	int i;
-
-	pr_info("Kmemleak testing\n");
-
-	/* make some orphan objects */
-	pr_info("kmalloc(32) = %p\n", kmalloc(32, GFP_KERNEL));
-	pr_info("kmalloc(32) = %p\n", kmalloc(32, GFP_KERNEL));
-	pr_info("kmalloc(1024) = %p\n", kmalloc(1024, GFP_KERNEL));
-	pr_info("kmalloc(1024) = %p\n", kmalloc(1024, GFP_KERNEL));
-	pr_info("kmalloc(2048) = %p\n", kmalloc(2048, GFP_KERNEL));
-	pr_info("kmalloc(2048) = %p\n", kmalloc(2048, GFP_KERNEL));
-	pr_info("kmalloc(4096) = %p\n", kmalloc(4096, GFP_KERNEL));
-	pr_info("kmalloc(4096) = %p\n", kmalloc(4096, GFP_KERNEL));
-#ifndef CONFIG_MODULES
-	pr_info("kmem_cache_alloc(files_cachep) = %p\n",
-		kmem_cache_alloc(files_cachep, GFP_KERNEL));
-	pr_info("kmem_cache_alloc(files_cachep) = %p\n",
-		kmem_cache_alloc(files_cachep, GFP_KERNEL));
-#endif
-	pr_info("vmalloc(64) = %p\n", vmalloc(64));
-	pr_info("vmalloc(64) = %p\n", vmalloc(64));
-	pr_info("vmalloc(64) = %p\n", vmalloc(64));
-	pr_info("vmalloc(64) = %p\n", vmalloc(64));
-	pr_info("vmalloc(64) = %p\n", vmalloc(64));
-
-	/*
-	 * Add elements to a list. They should only appear as orphan
-	 * after the module is removed.
-	 */
-	for (i = 0; i < 10; i++) {
-		elem = kzalloc(sizeof(*elem), GFP_KERNEL);
-		pr_info("kzalloc(sizeof(*elem)) = %p\n", elem);
-		if (!elem)
-			return -ENOMEM;
-		INIT_LIST_HEAD(&elem->list);
-		list_add_tail(&elem->list, &test_list);
-	}
-
-	for_each_possible_cpu(i) {
-		per_cpu(kmemleak_test_pointer, i) = kmalloc(129, GFP_KERNEL);
-		pr_info("kmalloc(129) = %p\n",
-			per_cpu(kmemleak_test_pointer, i));
-	}
-
-	return 0;
-}
-module_init(kmemleak_test_init);
-
-static void __exit kmemleak_test_exit(void)
-{
-	struct test_node *elem, *tmp;
-
-	/*
-	 * Remove the list elements without actually freeing the
-	 * memory.
-	 */
-	list_for_each_entry_safe(elem, tmp, &test_list, list)
-		list_del(&elem->list);
-}
-module_exit(kmemleak_test_exit);
-
-MODULE_LICENSE("GPL");
--- a/mm/Makefile~mmkmemleak-testc-move-kmemleak-testc-to-samples-dir
+++ a/mm/Makefile
@@ -94,7 +94,6 @@ obj-$(CONFIG_GUP_BENCHMARK) += gup_bench
 obj-$(CONFIG_MEMORY_FAILURE) += memory-failure.o
 obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
 obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
-obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
 obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
 obj-$(CONFIG_DEBUG_VM_PGTABLE) += debug_vm_pgtable.o
 obj-$(CONFIG_PAGE_OWNER) += page_owner.o
--- /dev/null
+++ a/samples/kmemleak/kmemleak-test.c
@@ -0,0 +1,99 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * samples/kmemleak/kmemleak-test.c
+ *
+ * Copyright (C) 2008 ARM Limited
+ * Written by Catalin Marinas <catalin.marinas@arm.com>
+ */
+
+#define pr_fmt(fmt) "kmemleak: " fmt
+
+#include <linux/init.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <linux/list.h>
+#include <linux/percpu.h>
+#include <linux/fdtable.h>
+
+#include <linux/kmemleak.h>
+
+struct test_node {
+	long header[25];
+	struct list_head list;
+	long footer[25];
+};
+
+static LIST_HEAD(test_list);
+static DEFINE_PER_CPU(void *, kmemleak_test_pointer);
+
+/*
+ * Some very simple testing. This function needs to be extended for
+ * proper testing.
+ */
+static int __init kmemleak_test_init(void)
+{
+	struct test_node *elem;
+	int i;
+
+	pr_info("Kmemleak testing\n");
+
+	/* make some orphan objects */
+	pr_info("kmalloc(32) = %p\n", kmalloc(32, GFP_KERNEL));
+	pr_info("kmalloc(32) = %p\n", kmalloc(32, GFP_KERNEL));
+	pr_info("kmalloc(1024) = %p\n", kmalloc(1024, GFP_KERNEL));
+	pr_info("kmalloc(1024) = %p\n", kmalloc(1024, GFP_KERNEL));
+	pr_info("kmalloc(2048) = %p\n", kmalloc(2048, GFP_KERNEL));
+	pr_info("kmalloc(2048) = %p\n", kmalloc(2048, GFP_KERNEL));
+	pr_info("kmalloc(4096) = %p\n", kmalloc(4096, GFP_KERNEL));
+	pr_info("kmalloc(4096) = %p\n", kmalloc(4096, GFP_KERNEL));
+#ifndef CONFIG_MODULES
+	pr_info("kmem_cache_alloc(files_cachep) = %p\n",
+		kmem_cache_alloc(files_cachep, GFP_KERNEL));
+	pr_info("kmem_cache_alloc(files_cachep) = %p\n",
+		kmem_cache_alloc(files_cachep, GFP_KERNEL));
+#endif
+	pr_info("vmalloc(64) = %p\n", vmalloc(64));
+	pr_info("vmalloc(64) = %p\n", vmalloc(64));
+	pr_info("vmalloc(64) = %p\n", vmalloc(64));
+	pr_info("vmalloc(64) = %p\n", vmalloc(64));
+	pr_info("vmalloc(64) = %p\n", vmalloc(64));
+
+	/*
+	 * Add elements to a list. They should only appear as orphan
+	 * after the module is removed.
+	 */
+	for (i = 0; i < 10; i++) {
+		elem = kzalloc(sizeof(*elem), GFP_KERNEL);
+		pr_info("kzalloc(sizeof(*elem)) = %p\n", elem);
+		if (!elem)
+			return -ENOMEM;
+		INIT_LIST_HEAD(&elem->list);
+		list_add_tail(&elem->list, &test_list);
+	}
+
+	for_each_possible_cpu(i) {
+		per_cpu(kmemleak_test_pointer, i) = kmalloc(129, GFP_KERNEL);
+		pr_info("kmalloc(129) = %p\n",
+			per_cpu(kmemleak_test_pointer, i));
+	}
+
+	return 0;
+}
+module_init(kmemleak_test_init);
+
+static void __exit kmemleak_test_exit(void)
+{
+	struct test_node *elem, *tmp;
+
+	/*
+	 * Remove the list elements without actually freeing the
+	 * memory.
+	 */
+	list_for_each_entry_safe(elem, tmp, &test_list, list)
+		list_del(&elem->list);
+}
+module_exit(kmemleak_test_exit);
+
+MODULE_LICENSE("GPL");
--- /dev/null
+++ a/samples/kmemleak/Makefile
@@ -0,0 +1,3 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
--- a/samples/Makefile~mmkmemleak-testc-move-kmemleak-testc-to-samples-dir
+++ a/samples/Makefile
@@ -28,3 +28,4 @@ subdir-$(CONFIG_SAMPLE_VFS)		+= vfs
 obj-$(CONFIG_SAMPLE_INTEL_MEI)		+= mei/
 subdir-$(CONFIG_SAMPLE_WATCHDOG)	+= watchdog
 subdir-$(CONFIG_SAMPLE_WATCH_QUEUE)	+= watch_queue
+obj-$(CONFIG_DEBUG_KMEMLEAK_TEST)	+= kmemleak/
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 026/181] x86/numa: cleanup configuration dependent command-line options
  2020-10-13 23:46 incoming Andrew Morton
                   ` (24 preceding siblings ...)
  2020-10-13 23:48 ` [patch 025/181] mm,kmemleak-test.c: move kmemleak-test.c to samples dir Andrew Morton
@ 2020-10-13 23:48 ` Andrew Morton
  2020-10-13 23:49 ` [patch 027/181] x86/numa: add 'nohmat' option Andrew Morton
                   ` (158 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:48 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: x86/numa: cleanup configuration dependent command-line options

Patch series "device-dax: Support sub-dividing soft-reserved ranges", v5.

The device-dax facility allows an address range to be directly mapped
through a chardev, or optionally hotplugged to the core kernel page
allocator as System-RAM.  It is the mechanism for converting persistent
memory (pmem) to be used as another volatile memory pool i.e.  the current
Memory Tiering hot topic on linux-mm.

In the case of pmem the nvdimm-namespace-label mechanism can sub-divide
it, but that labeling mechanism is not available / applicable to
soft-reserved ("EFI specific purpose") memory [3].  This series provides a
sysfs-mechanism for the daxctl utility to enable provisioning of
volatile-soft-reserved memory ranges.

The motivations for this facility are:

1/ Allow performance differentiated memory ranges to be split between
   kernel-managed and directly-accessed use cases.

2/ Allow physical memory to be provisioned along performance relevant
   address boundaries. For example, divide a memory-side cache [4] along
   cache-color boundaries.

3/ Parcel out soft-reserved memory to VMs using device-dax as a security
   / permissions boundary [5]. Specifically I have seen people (ab)using
   memmap=nn!ss (mark System-RAM as Persistent Memory) just to get the
   device-dax interface on custom address ranges. A follow-on for the VM
   use case is to teach device-dax to dynamically allocate 'struct page' at
   runtime to reduce the duplication of 'struct page' space in both the
   guest and the host kernel for the same physical pages.

[2]: http://lore.kernel.org/r/20200713160837.13774-11-joao.m.martins@oracle.com
[3]: http://lore.kernel.org/r/157309097008.1579826.12818463304589384434.stgit@dwillia2-desk3.amr.corp.intel.com
[4]: http://lore.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
[5]: http://lore.kernel.org/r/20200110190313.17144-1-joao.m.martins@oracle.com


This patch (of 23):

In preparation for adding a new numa= option clean up the existing ones to
avoid ifdefs in numa_setup(), and provide feedback when the option is
numa=fake= option is invalid due to kernel config.  The same does not need
to be done for numa=noacpi, since the capability is already hard disabled
at compile-time.

Link: https://lkml.kernel.org/r/160106109960.30709.7379926726669669398.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lkml.kernel.org/r/159643094279.4062302.17779410714418721328.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lkml.kernel.org/r/159643094925.4062302.14979872973043772305.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/include/asm/numa.h  |    8 +++++++-
 arch/x86/mm/numa.c           |    8 ++------
 arch/x86/mm/numa_emulation.c |    3 ++-
 arch/x86/xen/enlighten_pv.c  |    2 +-
 drivers/acpi/numa/srat.c     |    9 +++++++--
 include/acpi/acpi_numa.h     |    6 +++++-
 6 files changed, 24 insertions(+), 12 deletions(-)

--- a/arch/x86/include/asm/numa.h~x86-numa-cleanup-configuration-dependent-command-line-options
+++ a/arch/x86/include/asm/numa.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_NUMA_H
 
 #include <linux/nodemask.h>
+#include <linux/errno.h>
 
 #include <asm/topology.h>
 #include <asm/apicdef.h>
@@ -77,7 +78,12 @@ void debug_cpumask_set_cpu(int cpu, int
 #ifdef CONFIG_NUMA_EMU
 #define FAKE_NODE_MIN_SIZE	((u64)32 << 20)
 #define FAKE_NODE_MIN_HASH_MASK	(~(FAKE_NODE_MIN_SIZE - 1UL))
-void numa_emu_cmdline(char *);
+int numa_emu_cmdline(char *str);
+#else /* CONFIG_NUMA_EMU */
+static inline int numa_emu_cmdline(char *str)
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_NUMA_EMU */
 
 #endif	/* _ASM_X86_NUMA_H */
--- a/arch/x86/mm/numa.c~x86-numa-cleanup-configuration-dependent-command-line-options
+++ a/arch/x86/mm/numa.c
@@ -37,14 +37,10 @@ static __init int numa_setup(char *opt)
 		return -EINVAL;
 	if (!strncmp(opt, "off", 3))
 		numa_off = 1;
-#ifdef CONFIG_NUMA_EMU
 	if (!strncmp(opt, "fake=", 5))
-		numa_emu_cmdline(opt + 5);
-#endif
-#ifdef CONFIG_ACPI_NUMA
+		return numa_emu_cmdline(opt + 5);
 	if (!strncmp(opt, "noacpi", 6))
-		acpi_numa = -1;
-#endif
+		disable_srat();
 	return 0;
 }
 early_param("numa", numa_setup);
--- a/arch/x86/mm/numa_emulation.c~x86-numa-cleanup-configuration-dependent-command-line-options
+++ a/arch/x86/mm/numa_emulation.c
@@ -13,9 +13,10 @@
 static int emu_nid_to_phys[MAX_NUMNODES];
 static char *emu_cmdline __initdata;
 
-void __init numa_emu_cmdline(char *str)
+int __init numa_emu_cmdline(char *str)
 {
 	emu_cmdline = str;
+	return 0;
 }
 
 static int __init emu_find_memblk_by_nid(int nid, const struct numa_meminfo *mi)
--- a/arch/x86/xen/enlighten_pv.c~x86-numa-cleanup-configuration-dependent-command-line-options
+++ a/arch/x86/xen/enlighten_pv.c
@@ -1300,7 +1300,7 @@ asmlinkage __visible void __init xen_sta
 	 * any NUMA information the kernel tries to get from ACPI will
 	 * be meaningless.  Prevent it from trying.
 	 */
-	acpi_numa = -1;
+	disable_srat();
 #endif
 	WARN_ON(xen_cpuhp_setup(xen_cpu_up_prepare_pv, xen_cpu_dead_pv));
 
--- a/drivers/acpi/numa/srat.c~x86-numa-cleanup-configuration-dependent-command-line-options
+++ a/drivers/acpi/numa/srat.c
@@ -27,7 +27,12 @@ static int node_to_pxm_map[MAX_NUMNODES]
 			= { [0 ... MAX_NUMNODES - 1] = PXM_INVAL };
 
 unsigned char acpi_srat_revision __initdata;
-int acpi_numa __initdata;
+static int acpi_numa __initdata;
+
+void __init disable_srat(void)
+{
+	acpi_numa = -1;
+}
 
 int pxm_to_node(int pxm)
 {
@@ -163,7 +168,7 @@ static int __init slit_valid(struct acpi
 void __init bad_srat(void)
 {
 	pr_err("SRAT: SRAT not used.\n");
-	acpi_numa = -1;
+	disable_srat();
 }
 
 int __init srat_disabled(void)
--- a/include/acpi/acpi_numa.h~x86-numa-cleanup-configuration-dependent-command-line-options
+++ a/include/acpi/acpi_numa.h
@@ -17,10 +17,14 @@ extern int pxm_to_node(int);
 extern int node_to_pxm(int);
 extern int acpi_map_pxm_to_node(int);
 extern unsigned char acpi_srat_revision;
-extern int acpi_numa __initdata;
+extern void disable_srat(void);
 
 extern void bad_srat(void);
 extern int srat_disabled(void);
 
+#else				/* CONFIG_ACPI_NUMA */
+static inline void disable_srat(void)
+{
+}
 #endif				/* CONFIG_ACPI_NUMA */
 #endif				/* __ACP_NUMA_H */
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 027/181] x86/numa: add 'nohmat' option
  2020-10-13 23:46 incoming Andrew Morton
                   ` (25 preceding siblings ...)
  2020-10-13 23:48 ` [patch 026/181] x86/numa: cleanup configuration dependent command-line options Andrew Morton
@ 2020-10-13 23:49 ` Andrew Morton
  2020-10-13 23:49 ` [patch 028/181] efi/fake_mem: arrange for a resource entry per efi_fake_mem instance Andrew Morton
                   ` (157 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:49 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: x86/numa: add 'nohmat' option

Disable parsing of the HMAT for debug, to workaround broken platform
instances, or cases where it is otherwise not wanted.

[rdunlap@infradead.org: fix build when CONFIG_ACPI is not set]
  Link: https://lkml.kernel.org/r/70e5ee34-9809-a997-7b49-499e4be61307@infradead.org
Link: https://lkml.kernel.org/r/159643095540.4062302.732962081968036212.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/x86/x86_64/boot-options.rst |    4 ++++
 arch/x86/mm/numa.c                        |    2 ++
 drivers/acpi/numa/hmat.c                  |    8 +++++++-
 include/acpi/acpi_numa.h                  |    8 ++++++++
 include/linux/acpi.h                      |    2 ++
 5 files changed, 23 insertions(+), 1 deletion(-)

--- a/arch/x86/mm/numa.c~x86-numa-add-nohmat-option
+++ a/arch/x86/mm/numa.c
@@ -41,6 +41,8 @@ static __init int numa_setup(char *opt)
 		return numa_emu_cmdline(opt + 5);
 	if (!strncmp(opt, "noacpi", 6))
 		disable_srat();
+	if (!strncmp(opt, "nohmat", 6))
+		disable_hmat();
 	return 0;
 }
 early_param("numa", numa_setup);
--- a/Documentation/x86/x86_64/boot-options.rst~x86-numa-add-nohmat-option
+++ a/Documentation/x86/x86_64/boot-options.rst
@@ -173,6 +173,10 @@ NUMA
   numa=noacpi
     Don't parse the SRAT table for NUMA setup
 
+  numa=nohmat
+    Don't parse the HMAT table for NUMA setup, or soft-reserved memory
+    partitioning.
+
   numa=fake=<size>[MG]
     If given as a memory unit, fills all system RAM with nodes of
     size interleaved over physical nodes.
--- a/drivers/acpi/numa/hmat.c~x86-numa-add-nohmat-option
+++ a/drivers/acpi/numa/hmat.c
@@ -26,6 +26,12 @@
 #include <linux/sysfs.h>
 
 static u8 hmat_revision;
+static int hmat_disable __initdata;
+
+void __init disable_hmat(void)
+{
+	hmat_disable = 1;
+}
 
 static LIST_HEAD(targets);
 static LIST_HEAD(initiators);
@@ -814,7 +820,7 @@ static __init int hmat_init(void)
 	enum acpi_hmat_type i;
 	acpi_status status;
 
-	if (srat_disabled())
+	if (srat_disabled() || hmat_disable)
 		return 0;
 
 	status = acpi_get_table(ACPI_SIG_SRAT, 0, &tbl);
--- a/include/acpi/acpi_numa.h~x86-numa-add-nohmat-option
+++ a/include/acpi/acpi_numa.h
@@ -27,4 +27,12 @@ static inline void disable_srat(void)
 {
 }
 #endif				/* CONFIG_ACPI_NUMA */
+
+#ifdef CONFIG_ACPI_HMAT
+extern void disable_hmat(void);
+#else				/* CONFIG_ACPI_HMAT */
+static inline void disable_hmat(void)
+{
+}
+#endif				/* CONFIG_ACPI_HMAT */
 #endif				/* __ACP_NUMA_H */
--- a/include/linux/acpi.h~x86-numa-add-nohmat-option
+++ a/include/linux/acpi.h
@@ -709,6 +709,8 @@ static inline u64 acpi_arch_get_root_poi
 #define ACPI_HANDLE_FWNODE(fwnode)	(NULL)
 #define ACPI_DEVICE_CLASS(_cls, _msk)	.cls = (0), .cls_msk = (0),
 
+#include <acpi/acpi_numa.h>
+
 struct fwnode_handle;
 
 static inline bool acpi_dev_found(const char *hid)
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 028/181] efi/fake_mem: arrange for a resource entry per efi_fake_mem instance
  2020-10-13 23:46 incoming Andrew Morton
                   ` (26 preceding siblings ...)
  2020-10-13 23:49 ` [patch 027/181] x86/numa: add 'nohmat' option Andrew Morton
@ 2020-10-13 23:49 ` Andrew Morton
  2020-10-13 23:49 ` [patch 029/181] ACPI: HMAT: refactor hmat_register_target_device to hmem_register_device Andrew Morton
                   ` (156 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:49 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: efi/fake_mem: arrange for a resource entry per efi_fake_mem instance

In preparation for attaching a platform device per iomem resource teach
the efi_fake_mem code to create an e820 entry per instance.  Similar to
E820_TYPE_PRAM, bypass merging resource when the e820 map is sanitized.

Link: https://lkml.kernel.org/r/159643096068.4062302.11590041070221681669.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Ard Biesheuvel <ardb@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/kernel/e820.c              |   16 +++++++++++++++-
 drivers/firmware/efi/x86_fake_mem.c |   12 +++++++++---
 2 files changed, 24 insertions(+), 4 deletions(-)

--- a/arch/x86/kernel/e820.c~efi-fake_mem-arrange-for-a-resource-entry-per-efi_fake_mem-instance
+++ a/arch/x86/kernel/e820.c
@@ -305,6 +305,20 @@ static int __init cpcompare(const void *
 	return (ap->addr != ap->entry->addr) - (bp->addr != bp->entry->addr);
 }
 
+static bool e820_nomerge(enum e820_type type)
+{
+	/*
+	 * These types may indicate distinct platform ranges aligned to
+	 * numa node, protection domain, performance domain, or other
+	 * boundaries. Do not merge them.
+	 */
+	if (type == E820_TYPE_PRAM)
+		return true;
+	if (type == E820_TYPE_SOFT_RESERVED)
+		return true;
+	return false;
+}
+
 int __init e820__update_table(struct e820_table *table)
 {
 	struct e820_entry *entries = table->entries;
@@ -380,7 +394,7 @@ int __init e820__update_table(struct e82
 		}
 
 		/* Continue building up new map based on this information: */
-		if (current_type != last_type || current_type == E820_TYPE_PRAM) {
+		if (current_type != last_type || e820_nomerge(current_type)) {
 			if (last_type != 0)	 {
 				new_entries[new_nr_entries].size = change_point[chg_idx]->addr - last_addr;
 				/* Move forward only if the new size was non-zero: */
--- a/drivers/firmware/efi/x86_fake_mem.c~efi-fake_mem-arrange-for-a-resource-entry-per-efi_fake_mem-instance
+++ a/drivers/firmware/efi/x86_fake_mem.c
@@ -38,7 +38,7 @@ void __init efi_fake_memmap_early(void)
 		m_start = mem->range.start;
 		m_end = mem->range.end;
 		for_each_efi_memory_desc(md) {
-			u64 start, end;
+			u64 start, end, size;
 
 			if (md->type != EFI_CONVENTIONAL_MEMORY)
 				continue;
@@ -58,11 +58,17 @@ void __init efi_fake_memmap_early(void)
 			 */
 			start = max(start, m_start);
 			end = min(end, m_end);
+			size = end - start + 1;
 
 			if (end <= start)
 				continue;
-			e820__range_update(start, end - start + 1, E820_TYPE_RAM,
-					E820_TYPE_SOFT_RESERVED);
+
+			/*
+			 * Ensure each efi_fake_mem instance results in
+			 * a unique e820 resource
+			 */
+			e820__range_remove(start, size, E820_TYPE_RAM, 1);
+			e820__range_add(start, size, E820_TYPE_SOFT_RESERVED);
 			e820__update_table(e820_table);
 		}
 	}
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 029/181] ACPI: HMAT: refactor hmat_register_target_device to hmem_register_device
  2020-10-13 23:46 incoming Andrew Morton
                   ` (27 preceding siblings ...)
  2020-10-13 23:49 ` [patch 028/181] efi/fake_mem: arrange for a resource entry per efi_fake_mem instance Andrew Morton
@ 2020-10-13 23:49 ` Andrew Morton
  2020-10-13 23:49 ` [patch 030/181] resource: report parent to walk_iomem_res_desc() callback Andrew Morton
                   ` (155 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:49 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: ACPI: HMAT: refactor hmat_register_target_device to hmem_register_device

In preparation for exposing "Soft Reserved" memory ranges without an HMAT,
move the hmem device registration to its own compilation unit and make the
implementation generic.

The generic implementation drops usage acpi_map_pxm_to_online_node() that
was translating ACPI proximity domain values and instead relies on
numa_map_to_online_node() to determine the numa node for the device.

[joao.m.martins@oracle.com: CONFIG_DEV_DAX_HMEM_DEVICES should depend on CONFIG_DAX=y]
  Link: https://lkml.kernel.org/r/8f34727f-ec2d-9395-cb18-969ec8a5d0d4@oracle.com
Link: https://lkml.kernel.org/r/159643096584.4062302.5035370788475153738.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lore.kernel.org/r/158318761484.2216124.2049322072599482736.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/acpi/numa/hmat.c  |   68 +++---------------------------------
 drivers/dax/Kconfig       |    4 ++
 drivers/dax/Makefile      |    3 -
 drivers/dax/hmem.c        |   56 -----------------------------
 drivers/dax/hmem/Makefile |    5 ++
 drivers/dax/hmem/device.c |   65 ++++++++++++++++++++++++++++++++++
 drivers/dax/hmem/hmem.c   |   56 +++++++++++++++++++++++++++++
 include/linux/dax.h       |    8 ++++
 8 files changed, 145 insertions(+), 120 deletions(-)

--- a/drivers/acpi/numa/hmat.c~acpi-hmat-refactor-hmat_register_target_device-to-hmem_register_device
+++ a/drivers/acpi/numa/hmat.c
@@ -24,6 +24,7 @@
 #include <linux/mutex.h>
 #include <linux/node.h>
 #include <linux/sysfs.h>
+#include <linux/dax.h>
 
 static u8 hmat_revision;
 static int hmat_disable __initdata;
@@ -640,66 +641,6 @@ static void hmat_register_target_perf(st
 	node_set_perf_attrs(mem_nid, &target->hmem_attrs, 0);
 }
 
-static void hmat_register_target_device(struct memory_target *target,
-		struct resource *r)
-{
-	/* define a clean / non-busy resource for the platform device */
-	struct resource res = {
-		.start = r->start,
-		.end = r->end,
-		.flags = IORESOURCE_MEM,
-	};
-	struct platform_device *pdev;
-	struct memregion_info info;
-	int rc, id;
-
-	rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
-			IORES_DESC_SOFT_RESERVED);
-	if (rc != REGION_INTERSECTS)
-		return;
-
-	id = memregion_alloc(GFP_KERNEL);
-	if (id < 0) {
-		pr_err("memregion allocation failure for %pr\n", &res);
-		return;
-	}
-
-	pdev = platform_device_alloc("hmem", id);
-	if (!pdev) {
-		pr_err("hmem device allocation failure for %pr\n", &res);
-		goto out_pdev;
-	}
-
-	pdev->dev.numa_node = acpi_map_pxm_to_online_node(target->memory_pxm);
-	info = (struct memregion_info) {
-		.target_node = acpi_map_pxm_to_node(target->memory_pxm),
-	};
-	rc = platform_device_add_data(pdev, &info, sizeof(info));
-	if (rc < 0) {
-		pr_err("hmem memregion_info allocation failure for %pr\n", &res);
-		goto out_pdev;
-	}
-
-	rc = platform_device_add_resources(pdev, &res, 1);
-	if (rc < 0) {
-		pr_err("hmem resource allocation failure for %pr\n", &res);
-		goto out_resource;
-	}
-
-	rc = platform_device_add(pdev);
-	if (rc < 0) {
-		dev_err(&pdev->dev, "device add failed for %pr\n", &res);
-		goto out_resource;
-	}
-
-	return;
-
-out_resource:
-	put_device(&pdev->dev);
-out_pdev:
-	memregion_free(id);
-}
-
 static void hmat_register_target_devices(struct memory_target *target)
 {
 	struct resource *res;
@@ -711,8 +652,11 @@ static void hmat_register_target_devices
 	if (!IS_ENABLED(CONFIG_DEV_DAX_HMEM))
 		return;
 
-	for (res = target->memregions.child; res; res = res->sibling)
-		hmat_register_target_device(target, res);
+	for (res = target->memregions.child; res; res = res->sibling) {
+		int target_nid = acpi_map_pxm_to_node(target->memory_pxm);
+
+		hmem_register_device(target_nid, res);
+	}
 }
 
 static void hmat_register_target(struct memory_target *target)
--- a/drivers/dax/hmem.c
+++ /dev/null
@@ -1,56 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-#include <linux/platform_device.h>
-#include <linux/memregion.h>
-#include <linux/module.h>
-#include <linux/pfn_t.h>
-#include "bus.h"
-
-static int dax_hmem_probe(struct platform_device *pdev)
-{
-	struct device *dev = &pdev->dev;
-	struct dev_pagemap pgmap = { };
-	struct dax_region *dax_region;
-	struct memregion_info *mri;
-	struct dev_dax *dev_dax;
-	struct resource *res;
-
-	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
-	if (!res)
-		return -ENOMEM;
-
-	mri = dev->platform_data;
-	memcpy(&pgmap.res, res, sizeof(*res));
-
-	dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node,
-			PMD_SIZE, PFN_DEV|PFN_MAP);
-	if (!dax_region)
-		return -ENOMEM;
-
-	dev_dax = devm_create_dev_dax(dax_region, 0, &pgmap);
-	if (IS_ERR(dev_dax))
-		return PTR_ERR(dev_dax);
-
-	/* child dev_dax instances now own the lifetime of the dax_region */
-	dax_region_put(dax_region);
-	return 0;
-}
-
-static int dax_hmem_remove(struct platform_device *pdev)
-{
-	/* devm handles teardown */
-	return 0;
-}
-
-static struct platform_driver dax_hmem_driver = {
-	.probe = dax_hmem_probe,
-	.remove = dax_hmem_remove,
-	.driver = {
-		.name = "hmem",
-	},
-};
-
-module_platform_driver(dax_hmem_driver);
-
-MODULE_ALIAS("platform:hmem*");
-MODULE_LICENSE("GPL v2");
-MODULE_AUTHOR("Intel Corporation");
--- /dev/null
+++ a/drivers/dax/hmem/device.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/platform_device.h>
+#include <linux/memregion.h>
+#include <linux/module.h>
+#include <linux/dax.h>
+#include <linux/mm.h>
+
+void hmem_register_device(int target_nid, struct resource *r)
+{
+	/* define a clean / non-busy resource for the platform device */
+	struct resource res = {
+		.start = r->start,
+		.end = r->end,
+		.flags = IORESOURCE_MEM,
+	};
+	struct platform_device *pdev;
+	struct memregion_info info;
+	int rc, id;
+
+	rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
+			IORES_DESC_SOFT_RESERVED);
+	if (rc != REGION_INTERSECTS)
+		return;
+
+	id = memregion_alloc(GFP_KERNEL);
+	if (id < 0) {
+		pr_err("memregion allocation failure for %pr\n", &res);
+		return;
+	}
+
+	pdev = platform_device_alloc("hmem", id);
+	if (!pdev) {
+		pr_err("hmem device allocation failure for %pr\n", &res);
+		goto out_pdev;
+	}
+
+	pdev->dev.numa_node = numa_map_to_online_node(target_nid);
+	info = (struct memregion_info) {
+		.target_node = target_nid,
+	};
+	rc = platform_device_add_data(pdev, &info, sizeof(info));
+	if (rc < 0) {
+		pr_err("hmem memregion_info allocation failure for %pr\n", &res);
+		goto out_pdev;
+	}
+
+	rc = platform_device_add_resources(pdev, &res, 1);
+	if (rc < 0) {
+		pr_err("hmem resource allocation failure for %pr\n", &res);
+		goto out_resource;
+	}
+
+	rc = platform_device_add(pdev);
+	if (rc < 0) {
+		dev_err(&pdev->dev, "device add failed for %pr\n", &res);
+		goto out_resource;
+	}
+
+	return;
+
+out_resource:
+	put_device(&pdev->dev);
+out_pdev:
+	memregion_free(id);
+}
--- /dev/null
+++ a/drivers/dax/hmem/hmem.c
@@ -0,0 +1,56 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/platform_device.h>
+#include <linux/memregion.h>
+#include <linux/module.h>
+#include <linux/pfn_t.h>
+#include "../bus.h"
+
+static int dax_hmem_probe(struct platform_device *pdev)
+{
+	struct device *dev = &pdev->dev;
+	struct dev_pagemap pgmap = { };
+	struct dax_region *dax_region;
+	struct memregion_info *mri;
+	struct dev_dax *dev_dax;
+	struct resource *res;
+
+	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+	if (!res)
+		return -ENOMEM;
+
+	mri = dev->platform_data;
+	memcpy(&pgmap.res, res, sizeof(*res));
+
+	dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node,
+			PMD_SIZE, PFN_DEV|PFN_MAP);
+	if (!dax_region)
+		return -ENOMEM;
+
+	dev_dax = devm_create_dev_dax(dax_region, 0, &pgmap);
+	if (IS_ERR(dev_dax))
+		return PTR_ERR(dev_dax);
+
+	/* child dev_dax instances now own the lifetime of the dax_region */
+	dax_region_put(dax_region);
+	return 0;
+}
+
+static int dax_hmem_remove(struct platform_device *pdev)
+{
+	/* devm handles teardown */
+	return 0;
+}
+
+static struct platform_driver dax_hmem_driver = {
+	.probe = dax_hmem_probe,
+	.remove = dax_hmem_remove,
+	.driver = {
+		.name = "hmem",
+	},
+};
+
+module_platform_driver(dax_hmem_driver);
+
+MODULE_ALIAS("platform:hmem*");
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR("Intel Corporation");
--- /dev/null
+++ a/drivers/dax/hmem/Makefile
@@ -0,0 +1,5 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
+obj-$(CONFIG_DEV_DAX_HMEM_DEVICES) += device.o
+
+dax_hmem-y := hmem.o
--- a/drivers/dax/Kconfig~acpi-hmat-refactor-hmat_register_target_device-to-hmem_register_device
+++ a/drivers/dax/Kconfig
@@ -48,6 +48,10 @@ config DEV_DAX_HMEM
 
 	  Say M if unsure.
 
+config DEV_DAX_HMEM_DEVICES
+	depends on DEV_DAX_HMEM && DAX=y
+	def_bool y
+
 config DEV_DAX_KMEM
 	tristate "KMEM DAX: volatile-use of persistent memory"
 	default DEV_DAX
--- a/drivers/dax/Makefile~acpi-hmat-refactor-hmat_register_target_device-to-hmem_register_device
+++ a/drivers/dax/Makefile
@@ -2,11 +2,10 @@
 obj-$(CONFIG_DAX) += dax.o
 obj-$(CONFIG_DEV_DAX) += device_dax.o
 obj-$(CONFIG_DEV_DAX_KMEM) += kmem.o
-obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
 
 dax-y := super.o
 dax-y += bus.o
 device_dax-y := device.o
-dax_hmem-y := hmem.o
 
 obj-y += pmem/
+obj-y += hmem/
--- a/include/linux/dax.h~acpi-hmat-refactor-hmat_register_target_device-to-hmem_register_device
+++ a/include/linux/dax.h
@@ -238,4 +238,12 @@ static inline bool dax_mapping(struct ad
 	return mapping->host && IS_DAX(mapping->host);
 }
 
+#ifdef CONFIG_DEV_DAX_HMEM_DEVICES
+void hmem_register_device(int target_nid, struct resource *r);
+#else
+static inline void hmem_register_device(int target_nid, struct resource *r)
+{
+}
+#endif
+
 #endif
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 030/181] resource: report parent to walk_iomem_res_desc() callback
  2020-10-13 23:46 incoming Andrew Morton
                   ` (28 preceding siblings ...)
  2020-10-13 23:49 ` [patch 029/181] ACPI: HMAT: refactor hmat_register_target_device to hmem_register_device Andrew Morton
@ 2020-10-13 23:49 ` Andrew Morton
  2020-10-13 23:49 ` [patch 031/181] mm/memory_hotplug: introduce default phys_to_target_node() implementation Andrew Morton
                   ` (154 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:49 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: resource: report parent to walk_iomem_res_desc() callback

In support of detecting whether a resource might have been been claimed,
report the parent to the walk_iomem_res_desc() callback.  For example, the
ACPI HMAT parser publishes "hmem" platform devices per target range. 
However, if the HMAT is disabled / missing a fallback driver can attach
devices to the raw memory ranges as a fallback if it sees unclaimed /
orphan "Soft Reserved" resources in the resource tree.

Otherwise, find_next_iomem_res() returns a resource with garbage data from
the stack allocation in __walk_iomem_res_desc() for the res->parent field.

There are currently no users that expect ->child and ->sibling to be
valid, and the resource_lock would be needed to traverse them.  Use a
compound literal to implicitly zero initialize the fields that are not
being returned in addition to setting ->parent.

Link: https://lkml.kernel.org/r/159643097166.4062302.11875688887228572793.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/resource.c |   11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

--- a/kernel/resource.c~resource-report-parent-to-walk_iomem_res_desc-callback
+++ a/kernel/resource.c
@@ -382,10 +382,13 @@ static int find_next_iomem_res(resource_
 
 	if (p) {
 		/* copy data */
-		res->start = max(start, p->start);
-		res->end = min(end, p->end);
-		res->flags = p->flags;
-		res->desc = p->desc;
+		*res = (struct resource) {
+			.start = max(start, p->start),
+			.end = min(end, p->end),
+			.flags = p->flags,
+			.desc = p->desc,
+			.parent = p->parent,
+		};
 	}
 
 	read_unlock(&resource_lock);
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 031/181] mm/memory_hotplug: introduce default phys_to_target_node() implementation
  2020-10-13 23:46 incoming Andrew Morton
                   ` (29 preceding siblings ...)
  2020-10-13 23:49 ` [patch 030/181] resource: report parent to walk_iomem_res_desc() callback Andrew Morton
@ 2020-10-13 23:49 ` Andrew Morton
  2020-10-13 23:49 ` [patch 032/181] ACPI: HMAT: attach a device for each soft-reserved range Andrew Morton
                   ` (153 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:49 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: mm/memory_hotplug: introduce default phys_to_target_node() implementation

In preparation to set a fallback value for dev_dax->target_node, introduce
generic fallback helpers for phys_to_target_node()

A generic implementation based on node-data or memblock was proposed, but
as noted by Mike:

    "Here again, I would prefer to add a weak default for
     phys_to_target_node() because the "generic" implementation is not really
     generic.

     The fallback to reserved ranges is x86 specfic because on x86 most of
     the reserved areas is not in memblock.memory. AFAIK, no other
     architecture does this."

The info message in the generic memory_add_physaddr_to_nid()
implementation is fixed up to properly reflect that
memory_add_physaddr_to_nid() communicates "online" node info and
phys_to_target_node() indicates "target / to-be-onlined" node info.

[akpm@linux-foundation.org: fix CONFIG_MEMORY_HOTPLUG=n build]
  Link: https://lkml.kernel.org/r/202008252130.7YrHIyMI%25lkp@intel.com
Link: https://lkml.kernel.org/r/159643097768.4062302.3135192588966888630.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Jia He <justin.he@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/mm/numa.c             |    1 -
 include/linux/memory_hotplug.h |   23 ++++++++++++++---------
 include/linux/numa.h           |   11 -----------
 mm/memory_hotplug.c            |   10 +++++++++-
 4 files changed, 23 insertions(+), 22 deletions(-)

--- a/arch/x86/mm/numa.c~mm-memory_hotplug-introduce-default-phys_to_target_node-implementation
+++ a/arch/x86/mm/numa.c
@@ -917,7 +917,6 @@ int phys_to_target_node(phys_addr_t star
 
 	return meminfo_to_nid(&numa_reserved_meminfo, start);
 }
-EXPORT_SYMBOL_GPL(phys_to_target_node);
 
 int memory_add_physaddr_to_nid(u64 start)
 {
--- a/include/linux/memory_hotplug.h~mm-memory_hotplug-introduce-default-phys_to_target_node-implementation
+++ a/include/linux/memory_hotplug.h
@@ -149,15 +149,6 @@ int add_pages(int nid, unsigned long sta
 	      struct mhp_params *params);
 #endif /* ARCH_HAS_ADD_PAGES */
 
-#ifdef CONFIG_NUMA
-extern int memory_add_physaddr_to_nid(u64 start);
-#else
-static inline int memory_add_physaddr_to_nid(u64 start)
-{
-	return 0;
-}
-#endif
-
 #ifdef CONFIG_HAVE_ARCH_NODEDATA_EXTENSION
 /*
  * For supporting node-hotadd, we have to allocate a new pgdat.
@@ -284,6 +275,20 @@ static inline bool movable_node_is_enabl
 }
 #endif /* ! CONFIG_MEMORY_HOTPLUG */
 
+#ifdef CONFIG_NUMA
+extern int memory_add_physaddr_to_nid(u64 start);
+extern int phys_to_target_node(u64 start);
+#else
+static inline int memory_add_physaddr_to_nid(u64 start)
+{
+	return 0;
+}
+static inline int phys_to_target_node(u64 start)
+{
+	return 0;
+}
+#endif
+
 #if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT)
 /*
  * pgdat resizing functions
--- a/include/linux/numa.h~mm-memory_hotplug-introduce-default-phys_to_target_node-implementation
+++ a/include/linux/numa.h
@@ -23,22 +23,11 @@
 #ifdef CONFIG_NUMA
 /* Generic implementation available */
 int numa_map_to_online_node(int node);
-
-/*
- * Optional architecture specific implementation, users need a "depends
- * on $ARCH"
- */
-int phys_to_target_node(phys_addr_t addr);
 #else
 static inline int numa_map_to_online_node(int node)
 {
 	return NUMA_NO_NODE;
 }
-
-static inline int phys_to_target_node(phys_addr_t addr)
-{
-	return NUMA_NO_NODE;
-}
 #endif
 
 #endif /* _LINUX_NUMA_H */
--- a/mm/memory_hotplug.c~mm-memory_hotplug-introduce-default-phys_to_target_node-implementation
+++ a/mm/memory_hotplug.c
@@ -353,11 +353,19 @@ int __ref __add_pages(int nid, unsigned
 #ifdef CONFIG_NUMA
 int __weak memory_add_physaddr_to_nid(u64 start)
 {
-	pr_info_once("Unknown target node for memory at 0x%llx, assuming node 0\n",
+	pr_info_once("Unknown online node for memory at 0x%llx, assuming node 0\n",
 			start);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
+
+int __weak phys_to_target_node(u64 start)
+{
+	pr_info_once("Unknown target node for memory at 0x%llx, assuming node 0\n",
+			start);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(phys_to_target_node);
 #endif
 
 /* find the smallest valid pfn in the range [start_pfn, end_pfn) */
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 032/181] ACPI: HMAT: attach a device for each soft-reserved range
  2020-10-13 23:46 incoming Andrew Morton
                   ` (30 preceding siblings ...)
  2020-10-13 23:49 ` [patch 031/181] mm/memory_hotplug: introduce default phys_to_target_node() implementation Andrew Morton
@ 2020-10-13 23:49 ` Andrew Morton
  2020-10-13 23:49 ` [patch 033/181] device-dax: drop the dax_region.pfn_flags attribute Andrew Morton
                   ` (152 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:49 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: ACPI: HMAT: attach a device for each soft-reserved range

The hmem enabling in commit cf8741ac57ed ("ACPI: NUMA: HMAT: Register
"soft reserved" memory as an "hmem" device") only registered ranges to the
hmem driver for each soft-reservation that also appeared in the HMAT. 
While this is meant to encourage platform firmware to "do the right thing"
and publish an HMAT, the corollary is that platforms that fail to publish
an accurate HMAT will strand memory from Linux usage.  Additionally, the
"efi_fake_mem" kernel command line option enabling will strand memory by
default without an HMAT.

Arrange for "soft reserved" memory that goes unclaimed by HMAT entries to
be published as raw resource ranges for the hmem driver to consume.

Include a module parameter to disable either this fallback behavior, or
the hmat enabling from creating hmem devices.  The module parameter
requires the hmem device enabling to have unique name in the module
namespace: "device_hmem".

The driver depends on the architecture providing phys_to_target_node()
which is only x86 via numa_meminfo() and arm64 via a generic memblock
implementation.

[joao.m.martins@oracle.com: require NUMA_KEEP_MEMINFO for phys_to_target_node()]
  Link: https://lkml.kernel.org/r/aaae71a7-4846-f5cc-5acf-cf05fdb1f2dc@oracle.com
Link: https://lkml.kernel.org/r/159643098298.4062302.17587338161136144730.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Reviewed-by: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jia He <justin.he@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dax/Kconfig       |    2 ++
 drivers/dax/hmem/Makefile |    3 ++-
 drivers/dax/hmem/device.c |   35 +++++++++++++++++++++++++++++++++++
 3 files changed, 39 insertions(+), 1 deletion(-)

--- a/drivers/dax/hmem/device.c~acpi-hmat-attach-a-device-for-each-soft-reserved-range
+++ a/drivers/dax/hmem/device.c
@@ -5,6 +5,9 @@
 #include <linux/dax.h>
 #include <linux/mm.h>
 
+static bool nohmem;
+module_param_named(disable, nohmem, bool, 0444);
+
 void hmem_register_device(int target_nid, struct resource *r)
 {
 	/* define a clean / non-busy resource for the platform device */
@@ -17,6 +20,9 @@ void hmem_register_device(int target_nid
 	struct memregion_info info;
 	int rc, id;
 
+	if (nohmem)
+		return;
+
 	rc = region_intersects(res.start, resource_size(&res), IORESOURCE_MEM,
 			IORES_DESC_SOFT_RESERVED);
 	if (rc != REGION_INTERSECTS)
@@ -63,3 +69,32 @@ out_resource:
 out_pdev:
 	memregion_free(id);
 }
+
+static __init int hmem_register_one(struct resource *res, void *data)
+{
+	/*
+	 * If the resource is not a top-level resource it was already
+	 * assigned to a device by the HMAT parsing.
+	 */
+	if (res->parent != &iomem_resource) {
+		pr_info("HMEM: skip %pr, already claimed\n", res);
+		return 0;
+	}
+
+	hmem_register_device(phys_to_target_node(res->start), res);
+
+	return 0;
+}
+
+static __init int hmem_init(void)
+{
+	walk_iomem_res_desc(IORES_DESC_SOFT_RESERVED,
+			IORESOURCE_MEM, 0, -1, NULL, hmem_register_one);
+	return 0;
+}
+
+/*
+ * As this is a fallback for address ranges unclaimed by the ACPI HMAT
+ * parsing it must be at an initcall level greater than hmat_init().
+ */
+late_initcall(hmem_init);
--- a/drivers/dax/hmem/Makefile~acpi-hmat-attach-a-device-for-each-soft-reserved-range
+++ a/drivers/dax/hmem/Makefile
@@ -1,5 +1,6 @@
 # SPDX-License-Identifier: GPL-2.0
 obj-$(CONFIG_DEV_DAX_HMEM) += dax_hmem.o
-obj-$(CONFIG_DEV_DAX_HMEM_DEVICES) += device.o
+obj-$(CONFIG_DEV_DAX_HMEM_DEVICES) += device_hmem.o
 
+device_hmem-y := device.o
 dax_hmem-y := hmem.o
--- a/drivers/dax/Kconfig~acpi-hmat-attach-a-device-for-each-soft-reserved-range
+++ a/drivers/dax/Kconfig
@@ -35,6 +35,7 @@ config DEV_DAX_PMEM
 config DEV_DAX_HMEM
 	tristate "HMEM DAX: direct access to 'specific purpose' memory"
 	depends on EFI_SOFT_RESERVE
+	select NUMA_KEEP_MEMINFO if (NUMA && X86)
 	default DEV_DAX
 	help
 	  EFI 2.8 platforms, and others, may advertise 'specific purpose'
@@ -49,6 +50,7 @@ config DEV_DAX_HMEM
 	  Say M if unsure.
 
 config DEV_DAX_HMEM_DEVICES
+	depends on NUMA_KEEP_MEMINFO # for phys_to_target_node()
 	depends on DEV_DAX_HMEM && DAX=y
 	def_bool y
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 033/181] device-dax: drop the dax_region.pfn_flags attribute
  2020-10-13 23:46 incoming Andrew Morton
                   ` (31 preceding siblings ...)
  2020-10-13 23:49 ` [patch 032/181] ACPI: HMAT: attach a device for each soft-reserved range Andrew Morton
@ 2020-10-13 23:49 ` Andrew Morton
  2020-10-13 23:49 ` [patch 034/181] device-dax: move instance creation parameters to 'struct dev_dax_data' Andrew Morton
                   ` (151 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:49 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: device-dax: drop the dax_region.pfn_flags attribute

All callers specify the same flags to alloc_dax_region(), so there is no
need to allow for anything other than PFN_DEV|PFN_MAP, or carry a
->pfn_flags around on the region.  Device-dax instances are always page
backed.

Link: https://lkml.kernel.org/r/159643098829.4062302.13611520567669439046.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dax/bus.c         |    4 +---
 drivers/dax/bus.h         |    3 +--
 drivers/dax/dax-private.h |    2 --
 drivers/dax/device.c      |   26 +++-----------------------
 drivers/dax/hmem/hmem.c   |    2 +-
 drivers/dax/pmem/core.c   |    3 +--
 6 files changed, 7 insertions(+), 33 deletions(-)

--- a/drivers/dax/bus.c~device-dax-drop-the-dax_regionpfn_flags-attribute
+++ a/drivers/dax/bus.c
@@ -226,8 +226,7 @@ static void dax_region_unregister(void *
 }
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
-		struct resource *res, int target_node, unsigned int align,
-		unsigned long long pfn_flags)
+		struct resource *res, int target_node, unsigned int align)
 {
 	struct dax_region *dax_region;
 
@@ -251,7 +250,6 @@ struct dax_region *alloc_dax_region(stru
 
 	dev_set_drvdata(parent, dax_region);
 	memcpy(&dax_region->res, res, sizeof(*res));
-	dax_region->pfn_flags = pfn_flags;
 	kref_init(&dax_region->kref);
 	dax_region->id = region_id;
 	dax_region->align = align;
--- a/drivers/dax/bus.h~device-dax-drop-the-dax_regionpfn_flags-attribute
+++ a/drivers/dax/bus.h
@@ -10,8 +10,7 @@ struct dax_device;
 struct dax_region;
 void dax_region_put(struct dax_region *dax_region);
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
-		struct resource *res, int target_node, unsigned int align,
-		unsigned long long flags);
+		struct resource *res, int target_node, unsigned int align);
 
 enum dev_dax_subsys {
 	DEV_DAX_BUS,
--- a/drivers/dax/dax-private.h~device-dax-drop-the-dax_regionpfn_flags-attribute
+++ a/drivers/dax/dax-private.h
@@ -23,7 +23,6 @@ void dax_bus_exit(void);
  * @dev: parent device backing this region
  * @align: allocation and mapping alignment for child dax devices
  * @res: physical address range of the region
- * @pfn_flags: identify whether the pfns are paged back or not
  */
 struct dax_region {
 	int id;
@@ -32,7 +31,6 @@ struct dax_region {
 	struct device *dev;
 	unsigned int align;
 	struct resource res;
-	unsigned long long pfn_flags;
 };
 
 /**
--- a/drivers/dax/device.c~device-dax-drop-the-dax_regionpfn_flags-attribute
+++ a/drivers/dax/device.c
@@ -41,14 +41,6 @@ static int check_vma(struct dev_dax *dev
 		return -EINVAL;
 	}
 
-	if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) == PFN_DEV
-			&& (vma->vm_flags & VM_DONTCOPY) == 0) {
-		dev_info_ratelimited(dev,
-				"%s: %s: fail, dax range requires MADV_DONTFORK\n",
-				current->comm, func);
-		return -EINVAL;
-	}
-
 	if (!vma_is_dax(vma)) {
 		dev_info_ratelimited(dev,
 				"%s: %s: fail, vma is not DAX capable\n",
@@ -102,7 +94,7 @@ static vm_fault_t __dev_dax_pte_fault(st
 		return VM_FAULT_SIGBUS;
 	}
 
-	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+	*pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
 
 	return vmf_insert_mixed(vmf->vma, vmf->address, *pfn);
 }
@@ -127,12 +119,6 @@ static vm_fault_t __dev_dax_pmd_fault(st
 		return VM_FAULT_SIGBUS;
 	}
 
-	/* dax pmd mappings require pfn_t_devmap() */
-	if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) != (PFN_DEV|PFN_MAP)) {
-		dev_dbg(dev, "region lacks devmap flags\n");
-		return VM_FAULT_SIGBUS;
-	}
-
 	if (fault_size < dax_region->align)
 		return VM_FAULT_SIGBUS;
 	else if (fault_size > dax_region->align)
@@ -150,7 +136,7 @@ static vm_fault_t __dev_dax_pmd_fault(st
 		return VM_FAULT_SIGBUS;
 	}
 
-	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+	*pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
 
 	return vmf_insert_pfn_pmd(vmf, *pfn, vmf->flags & FAULT_FLAG_WRITE);
 }
@@ -177,12 +163,6 @@ static vm_fault_t __dev_dax_pud_fault(st
 		return VM_FAULT_SIGBUS;
 	}
 
-	/* dax pud mappings require pfn_t_devmap() */
-	if ((dax_region->pfn_flags & (PFN_DEV|PFN_MAP)) != (PFN_DEV|PFN_MAP)) {
-		dev_dbg(dev, "region lacks devmap flags\n");
-		return VM_FAULT_SIGBUS;
-	}
-
 	if (fault_size < dax_region->align)
 		return VM_FAULT_SIGBUS;
 	else if (fault_size > dax_region->align)
@@ -200,7 +180,7 @@ static vm_fault_t __dev_dax_pud_fault(st
 		return VM_FAULT_SIGBUS;
 	}
 
-	*pfn = phys_to_pfn_t(phys, dax_region->pfn_flags);
+	*pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
 
 	return vmf_insert_pfn_pud(vmf, *pfn, vmf->flags & FAULT_FLAG_WRITE);
 }
--- a/drivers/dax/hmem/hmem.c~device-dax-drop-the-dax_regionpfn_flags-attribute
+++ a/drivers/dax/hmem/hmem.c
@@ -22,7 +22,7 @@ static int dax_hmem_probe(struct platfor
 	memcpy(&pgmap.res, res, sizeof(*res));
 
 	dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node,
-			PMD_SIZE, PFN_DEV|PFN_MAP);
+			PMD_SIZE);
 	if (!dax_region)
 		return -ENOMEM;
 
--- a/drivers/dax/pmem/core.c~device-dax-drop-the-dax_regionpfn_flags-attribute
+++ a/drivers/dax/pmem/core.c
@@ -53,8 +53,7 @@ struct dev_dax *__dax_pmem_probe(struct
 	memcpy(&res, &pgmap.res, sizeof(res));
 	res.start += offset;
 	dax_region = alloc_dax_region(dev, region_id, &res,
-			nd_region->target_node, le32_to_cpu(pfn_sb->align),
-			PFN_DEV|PFN_MAP);
+			nd_region->target_node, le32_to_cpu(pfn_sb->align));
 	if (!dax_region)
 		return ERR_PTR(-ENOMEM);
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 034/181] device-dax: move instance creation parameters to 'struct dev_dax_data'
  2020-10-13 23:46 incoming Andrew Morton
                   ` (32 preceding siblings ...)
  2020-10-13 23:49 ` [patch 033/181] device-dax: drop the dax_region.pfn_flags attribute Andrew Morton
@ 2020-10-13 23:49 ` Andrew Morton
  2020-10-13 23:49 ` [patch 035/181] device-dax: make pgmap optional for instance creation Andrew Morton
                   ` (150 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:49 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: device-dax: move instance creation parameters to 'struct dev_dax_data'

In preparation for adding more parameters to instance creation, move
existing parameters to a new struct.

Link: https://lkml.kernel.org/r/159643099411.4062302.1337305960720423895.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dax/bus.c       |   14 +++++++-------
 drivers/dax/bus.h       |   16 ++++++++--------
 drivers/dax/hmem/hmem.c |    8 +++++++-
 drivers/dax/pmem/core.c |    9 ++++++++-
 4 files changed, 30 insertions(+), 17 deletions(-)

--- a/drivers/dax/bus.c~device-dax-move-instance-creation-parameters-to-struct-dev_dax_data
+++ a/drivers/dax/bus.c
@@ -395,9 +395,9 @@ static void unregister_dev_dax(void *dev
 	put_device(dev);
 }
 
-struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id,
-		struct dev_pagemap *pgmap, enum dev_dax_subsys subsys)
+struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 {
+	struct dax_region *dax_region = data->dax_region;
 	struct device *parent = dax_region->dev;
 	struct dax_device *dax_dev;
 	struct dev_dax *dev_dax;
@@ -405,14 +405,14 @@ struct dev_dax *__devm_create_dev_dax(st
 	struct device *dev;
 	int rc = -ENOMEM;
 
-	if (id < 0)
+	if (data->id < 0)
 		return ERR_PTR(-EINVAL);
 
 	dev_dax = kzalloc(sizeof(*dev_dax), GFP_KERNEL);
 	if (!dev_dax)
 		return ERR_PTR(-ENOMEM);
 
-	memcpy(&dev_dax->pgmap, pgmap, sizeof(*pgmap));
+	memcpy(&dev_dax->pgmap, data->pgmap, sizeof(struct dev_pagemap));
 
 	/*
 	 * No 'host' or dax_operations since there is no access to this
@@ -438,13 +438,13 @@ struct dev_dax *__devm_create_dev_dax(st
 
 	inode = dax_inode(dax_dev);
 	dev->devt = inode->i_rdev;
-	if (subsys == DEV_DAX_BUS)
+	if (data->subsys == DEV_DAX_BUS)
 		dev->bus = &dax_bus_type;
 	else
 		dev->class = dax_class;
 	dev->parent = parent;
 	dev->type = &dev_dax_type;
-	dev_set_name(dev, "dax%d.%d", dax_region->id, id);
+	dev_set_name(dev, "dax%d.%d", dax_region->id, data->id);
 
 	rc = device_add(dev);
 	if (rc) {
@@ -464,7 +464,7 @@ struct dev_dax *__devm_create_dev_dax(st
 
 	return ERR_PTR(rc);
 }
-EXPORT_SYMBOL_GPL(__devm_create_dev_dax);
+EXPORT_SYMBOL_GPL(devm_create_dev_dax);
 
 static int match_always_count;
 
--- a/drivers/dax/bus.h~device-dax-move-instance-creation-parameters-to-struct-dev_dax_data
+++ a/drivers/dax/bus.h
@@ -13,18 +13,18 @@ struct dax_region *alloc_dax_region(stru
 		struct resource *res, int target_node, unsigned int align);
 
 enum dev_dax_subsys {
-	DEV_DAX_BUS,
+	DEV_DAX_BUS = 0, /* zeroed dev_dax_data picks this by default */
 	DEV_DAX_CLASS,
 };
 
-struct dev_dax *__devm_create_dev_dax(struct dax_region *dax_region, int id,
-		struct dev_pagemap *pgmap, enum dev_dax_subsys subsys);
+struct dev_dax_data {
+	struct dax_region *dax_region;
+	struct dev_pagemap *pgmap;
+	enum dev_dax_subsys subsys;
+	int id;
+};
 
-static inline struct dev_dax *devm_create_dev_dax(struct dax_region *dax_region,
-		int id, struct dev_pagemap *pgmap)
-{
-	return __devm_create_dev_dax(dax_region, id, pgmap, DEV_DAX_BUS);
-}
+struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data);
 
 /* to be deleted when DEV_DAX_CLASS is removed */
 struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys);
--- a/drivers/dax/hmem/hmem.c~device-dax-move-instance-creation-parameters-to-struct-dev_dax_data
+++ a/drivers/dax/hmem/hmem.c
@@ -11,6 +11,7 @@ static int dax_hmem_probe(struct platfor
 	struct dev_pagemap pgmap = { };
 	struct dax_region *dax_region;
 	struct memregion_info *mri;
+	struct dev_dax_data data;
 	struct dev_dax *dev_dax;
 	struct resource *res;
 
@@ -26,7 +27,12 @@ static int dax_hmem_probe(struct platfor
 	if (!dax_region)
 		return -ENOMEM;
 
-	dev_dax = devm_create_dev_dax(dax_region, 0, &pgmap);
+	data = (struct dev_dax_data) {
+		.dax_region = dax_region,
+		.id = 0,
+		.pgmap = &pgmap,
+	};
+	dev_dax = devm_create_dev_dax(&data);
 	if (IS_ERR(dev_dax))
 		return PTR_ERR(dev_dax);
 
--- a/drivers/dax/pmem/core.c~device-dax-move-instance-creation-parameters-to-struct-dev_dax_data
+++ a/drivers/dax/pmem/core.c
@@ -14,6 +14,7 @@ struct dev_dax *__dax_pmem_probe(struct
 	resource_size_t offset;
 	struct nd_pfn_sb *pfn_sb;
 	struct dev_dax *dev_dax;
+	struct dev_dax_data data;
 	struct nd_namespace_io *nsio;
 	struct dax_region *dax_region;
 	struct dev_pagemap pgmap = { };
@@ -57,7 +58,13 @@ struct dev_dax *__dax_pmem_probe(struct
 	if (!dax_region)
 		return ERR_PTR(-ENOMEM);
 
-	dev_dax = __devm_create_dev_dax(dax_region, id, &pgmap, subsys);
+	data = (struct dev_dax_data) {
+		.dax_region = dax_region,
+		.id = id,
+		.pgmap = &pgmap,
+		.subsys = subsys,
+	};
+	dev_dax = devm_create_dev_dax(&data);
 
 	/* child dev_dax instances now own the lifetime of the dax_region */
 	dax_region_put(dax_region);
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 035/181] device-dax: make pgmap optional for instance creation
  2020-10-13 23:46 incoming Andrew Morton
                   ` (33 preceding siblings ...)
  2020-10-13 23:49 ` [patch 034/181] device-dax: move instance creation parameters to 'struct dev_dax_data' Andrew Morton
@ 2020-10-13 23:49 ` Andrew Morton
  2020-10-13 23:49 ` [patch 036/181] device-dax/kmem: introduce dax_kmem_range() Andrew Morton
                   ` (149 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:49 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: device-dax: make pgmap optional for instance creation

The passed in dev_pagemap is only required in the pmem case as the
libnvdimm core may have reserved a vmem_altmap for dev_memremap_pages() to
place the memmap in pmem directly.  In the hmem case there is no agent
reserving an altmap so it can all be handled by a core internal default.

Pass the resource range via a new @range property of 'struct
dev_dax_data'.

Link: https://lkml.kernel.org/r/159643099958.4062302.10379230791041872886.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lkml.kernel.org/r/160106110513.30709.4303239334850606031.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dax/bus.c              |   29 +++++++++++++++--------------
 drivers/dax/bus.h              |    2 ++
 drivers/dax/dax-private.h      |    9 ++++++++-
 drivers/dax/device.c           |   28 +++++++++++++++++++---------
 drivers/dax/hmem/hmem.c        |    8 ++++----
 drivers/dax/kmem.c             |   12 ++++++------
 drivers/dax/pmem/core.c        |    4 ++++
 tools/testing/nvdimm/dax-dev.c |    8 ++++----
 8 files changed, 62 insertions(+), 38 deletions(-)

--- a/drivers/dax/bus.c~device-dax-make-pgmap-optional-for-instance-creation
+++ a/drivers/dax/bus.c
@@ -271,7 +271,7 @@ static ssize_t size_show(struct device *
 		struct device_attribute *attr, char *buf)
 {
 	struct dev_dax *dev_dax = to_dev_dax(dev);
-	unsigned long long size = resource_size(&dev_dax->region->res);
+	unsigned long long size = range_len(&dev_dax->range);
 
 	return sprintf(buf, "%llu\n", size);
 }
@@ -293,19 +293,12 @@ static ssize_t target_node_show(struct d
 }
 static DEVICE_ATTR_RO(target_node);
 
-static unsigned long long dev_dax_resource(struct dev_dax *dev_dax)
-{
-	struct dax_region *dax_region = dev_dax->region;
-
-	return dax_region->res.start;
-}
-
 static ssize_t resource_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
 	struct dev_dax *dev_dax = to_dev_dax(dev);
 
-	return sprintf(buf, "%#llx\n", dev_dax_resource(dev_dax));
+	return sprintf(buf, "%#llx\n", dev_dax->range.start);
 }
 static DEVICE_ATTR(resource, 0400, resource_show, NULL);
 
@@ -376,6 +369,7 @@ static void dev_dax_release(struct devic
 
 	dax_region_put(dax_region);
 	put_dax(dax_dev);
+	kfree(dev_dax->pgmap);
 	kfree(dev_dax);
 }
 
@@ -412,7 +406,12 @@ struct dev_dax *devm_create_dev_dax(stru
 	if (!dev_dax)
 		return ERR_PTR(-ENOMEM);
 
-	memcpy(&dev_dax->pgmap, data->pgmap, sizeof(struct dev_pagemap));
+	if (data->pgmap) {
+		dev_dax->pgmap = kmemdup(data->pgmap,
+				sizeof(struct dev_pagemap), GFP_KERNEL);
+		if (!dev_dax->pgmap)
+			goto err_pgmap;
+	}
 
 	/*
 	 * No 'host' or dax_operations since there is no access to this
@@ -421,18 +420,19 @@ struct dev_dax *devm_create_dev_dax(stru
 	dax_dev = alloc_dax(dev_dax, NULL, NULL, DAXDEV_F_SYNC);
 	if (IS_ERR(dax_dev)) {
 		rc = PTR_ERR(dax_dev);
-		goto err;
+		goto err_alloc_dax;
 	}
 
 	/* a device_dax instance is dead while the driver is not attached */
 	kill_dax(dax_dev);
 
-	/* from here on we're committed to teardown via dax_dev_release() */
+	/* from here on we're committed to teardown via dev_dax_release() */
 	dev = &dev_dax->dev;
 	device_initialize(dev);
 
 	dev_dax->dax_dev = dax_dev;
 	dev_dax->region = dax_region;
+	dev_dax->range = data->range;
 	dev_dax->target_node = dax_region->target_node;
 	kref_get(&dax_region->kref);
 
@@ -458,8 +458,9 @@ struct dev_dax *devm_create_dev_dax(stru
 		return ERR_PTR(rc);
 
 	return dev_dax;
-
- err:
+err_alloc_dax:
+	kfree(dev_dax->pgmap);
+err_pgmap:
 	kfree(dev_dax);
 
 	return ERR_PTR(rc);
--- a/drivers/dax/bus.h~device-dax-make-pgmap-optional-for-instance-creation
+++ a/drivers/dax/bus.h
@@ -3,6 +3,7 @@
 #ifndef __DAX_BUS_H__
 #define __DAX_BUS_H__
 #include <linux/device.h>
+#include <linux/range.h>
 
 struct dev_dax;
 struct resource;
@@ -21,6 +22,7 @@ struct dev_dax_data {
 	struct dax_region *dax_region;
 	struct dev_pagemap *pgmap;
 	enum dev_dax_subsys subsys;
+	struct range range;
 	int id;
 };
 
--- a/drivers/dax/dax-private.h~device-dax-make-pgmap-optional-for-instance-creation
+++ a/drivers/dax/dax-private.h
@@ -41,6 +41,7 @@ struct dax_region {
  * @target_node: effective numa node if dev_dax memory range is onlined
  * @dev - device core
  * @pgmap - pgmap for memmap setup / lifetime (driver owned)
+ * @range: resource range for the instance
  * @dax_mem_res: physical address range of hotadded DAX memory
  * @dax_mem_name: name for hotadded DAX memory via add_memory_driver_managed()
  */
@@ -49,10 +50,16 @@ struct dev_dax {
 	struct dax_device *dax_dev;
 	int target_node;
 	struct device dev;
-	struct dev_pagemap pgmap;
+	struct dev_pagemap *pgmap;
+	struct range range;
 	struct resource *dax_kmem_res;
 };
 
+static inline u64 range_len(struct range *range)
+{
+	return range->end - range->start + 1;
+}
+
 static inline struct dev_dax *to_dev_dax(struct device *dev)
 {
 	return container_of(dev, struct dev_dax, dev);
--- a/drivers/dax/device.c~device-dax-make-pgmap-optional-for-instance-creation
+++ a/drivers/dax/device.c
@@ -55,12 +55,12 @@ static int check_vma(struct dev_dax *dev
 __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 		unsigned long size)
 {
-	struct resource *res = &dev_dax->region->res;
+	struct range *range = &dev_dax->range;
 	phys_addr_t phys;
 
-	phys = pgoff * PAGE_SIZE + res->start;
-	if (phys >= res->start && phys <= res->end) {
-		if (phys + size - 1 <= res->end)
+	phys = pgoff * PAGE_SIZE + range->start;
+	if (phys >= range->start && phys <= range->end) {
+		if (phys + size - 1 <= range->end)
 			return phys;
 	}
 
@@ -396,21 +396,31 @@ int dev_dax_probe(struct device *dev)
 {
 	struct dev_dax *dev_dax = to_dev_dax(dev);
 	struct dax_device *dax_dev = dev_dax->dax_dev;
-	struct resource *res = &dev_dax->region->res;
+	struct range *range = &dev_dax->range;
+	struct dev_pagemap *pgmap;
 	struct inode *inode;
 	struct cdev *cdev;
 	void *addr;
 	int rc;
 
 	/* 1:1 map region resource range to device-dax instance range */
-	if (!devm_request_mem_region(dev, res->start, resource_size(res),
+	if (!devm_request_mem_region(dev, range->start, range_len(range),
 				dev_name(dev))) {
-		dev_warn(dev, "could not reserve region %pR\n", res);
+		dev_warn(dev, "could not reserve range: %#llx - %#llx\n",
+				range->start, range->end);
 		return -EBUSY;
 	}
 
-	dev_dax->pgmap.type = MEMORY_DEVICE_GENERIC;
-	addr = devm_memremap_pages(dev, &dev_dax->pgmap);
+	pgmap = dev_dax->pgmap;
+	if (!pgmap) {
+		pgmap = devm_kzalloc(dev, sizeof(*pgmap), GFP_KERNEL);
+		if (!pgmap)
+			return -ENOMEM;
+		pgmap->res.start = range->start;
+		pgmap->res.end = range->end;
+	}
+	pgmap->type = MEMORY_DEVICE_GENERIC;
+	addr = devm_memremap_pages(dev, pgmap);
 	if (IS_ERR(addr))
 		return PTR_ERR(addr);
 
--- a/drivers/dax/hmem/hmem.c~device-dax-make-pgmap-optional-for-instance-creation
+++ a/drivers/dax/hmem/hmem.c
@@ -8,7 +8,6 @@
 static int dax_hmem_probe(struct platform_device *pdev)
 {
 	struct device *dev = &pdev->dev;
-	struct dev_pagemap pgmap = { };
 	struct dax_region *dax_region;
 	struct memregion_info *mri;
 	struct dev_dax_data data;
@@ -20,8 +19,6 @@ static int dax_hmem_probe(struct platfor
 		return -ENOMEM;
 
 	mri = dev->platform_data;
-	memcpy(&pgmap.res, res, sizeof(*res));
-
 	dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node,
 			PMD_SIZE);
 	if (!dax_region)
@@ -30,7 +27,10 @@ static int dax_hmem_probe(struct platfor
 	data = (struct dev_dax_data) {
 		.dax_region = dax_region,
 		.id = 0,
-		.pgmap = &pgmap,
+		.range = {
+			.start = res->start,
+			.end = res->end,
+		},
 	};
 	dev_dax = devm_create_dev_dax(&data);
 	if (IS_ERR(dev_dax))
--- a/drivers/dax/kmem.c~device-dax-make-pgmap-optional-for-instance-creation
+++ a/drivers/dax/kmem.c
@@ -22,7 +22,7 @@ static bool any_hotremove_failed;
 int dev_dax_kmem_probe(struct device *dev)
 {
 	struct dev_dax *dev_dax = to_dev_dax(dev);
-	struct resource *res = &dev_dax->region->res;
+	struct range *range = &dev_dax->range;
 	resource_size_t kmem_start;
 	resource_size_t kmem_size;
 	resource_size_t kmem_end;
@@ -39,17 +39,17 @@ int dev_dax_kmem_probe(struct device *de
 	 */
 	numa_node = dev_dax->target_node;
 	if (numa_node < 0) {
-		dev_warn(dev, "rejecting DAX region %pR with invalid node: %d\n",
-			 res, numa_node);
+		dev_warn(dev, "rejecting DAX region with invalid node: %d\n",
+				numa_node);
 		return -EINVAL;
 	}
 
 	/* Hotplug starting at the beginning of the next block: */
-	kmem_start = ALIGN(res->start, memory_block_size_bytes());
+	kmem_start = ALIGN(range->start, memory_block_size_bytes());
 
-	kmem_size = resource_size(res);
+	kmem_size = range_len(range);
 	/* Adjust the size down to compensate for moving up kmem_start: */
-	kmem_size -= kmem_start - res->start;
+	kmem_size -= kmem_start - range->start;
 	/* Align the size down to cover only complete blocks: */
 	kmem_size &= ~(memory_block_size_bytes() - 1);
 	kmem_end = kmem_start + kmem_size;
--- a/drivers/dax/pmem/core.c~device-dax-make-pgmap-optional-for-instance-creation
+++ a/drivers/dax/pmem/core.c
@@ -63,6 +63,10 @@ struct dev_dax *__dax_pmem_probe(struct
 		.id = id,
 		.pgmap = &pgmap,
 		.subsys = subsys,
+		.range = {
+			.start = res.start,
+			.end = res.end,
+		},
 	};
 	dev_dax = devm_create_dev_dax(&data);
 
--- a/tools/testing/nvdimm/dax-dev.c~device-dax-make-pgmap-optional-for-instance-creation
+++ a/tools/testing/nvdimm/dax-dev.c
@@ -9,12 +9,12 @@
 phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 		unsigned long size)
 {
-	struct resource *res = &dev_dax->region->res;
+	struct range *range = &dev_dax->range;
 	phys_addr_t addr;
 
-	addr = pgoff * PAGE_SIZE + res->start;
-	if (addr >= res->start && addr <= res->end) {
-		if (addr + size - 1 <= res->end) {
+	addr = pgoff * PAGE_SIZE + range->start;
+	if (addr >= range->start && addr <= range->end) {
+		if (addr + size - 1 <= range->end) {
 			if (get_nfit_res(addr)) {
 				struct page *page;
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 036/181] device-dax/kmem: introduce dax_kmem_range()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (34 preceding siblings ...)
  2020-10-13 23:49 ` [patch 035/181] device-dax: make pgmap optional for instance creation Andrew Morton
@ 2020-10-13 23:49 ` Andrew Morton
  2020-10-13 23:49 ` [patch 037/181] device-dax/kmem: move resource name tracking to drvdata Andrew Morton
                   ` (148 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:49 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: device-dax/kmem: introduce dax_kmem_range()

Towards removing the mode specific @dax_kmem_res attribute from the
generic 'struct dev_dax', and preparing for multi-range support, teach the
driver to calculate the hotplug range from the device range.  The hotplug
range is the trivially calculated memory-block-size aligned version of the
device range.

Link: https://lkml.kernel.org/r/160106111109.30709.3173462396758431559.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dax/kmem.c |   40 +++++++++++++++++-----------------------
 1 file changed, 17 insertions(+), 23 deletions(-)

--- a/drivers/dax/kmem.c~device-dax-kmem-introduce-dax_kmem_range
+++ a/drivers/dax/kmem.c
@@ -19,13 +19,20 @@ static const char *kmem_name;
 /* Set if any memory will remain added when the driver will be unloaded. */
 static bool any_hotremove_failed;
 
+static struct range dax_kmem_range(struct dev_dax *dev_dax)
+{
+	struct range range;
+
+	/* memory-block align the hotplug range */
+	range.start = ALIGN(dev_dax->range.start, memory_block_size_bytes());
+	range.end = ALIGN_DOWN(dev_dax->range.end + 1, memory_block_size_bytes()) - 1;
+	return range;
+}
+
 int dev_dax_kmem_probe(struct device *dev)
 {
 	struct dev_dax *dev_dax = to_dev_dax(dev);
-	struct range *range = &dev_dax->range;
-	resource_size_t kmem_start;
-	resource_size_t kmem_size;
-	resource_size_t kmem_end;
+	struct range range = dax_kmem_range(dev_dax);
 	struct resource *new_res;
 	const char *new_res_name;
 	int numa_node;
@@ -44,25 +51,14 @@ int dev_dax_kmem_probe(struct device *de
 		return -EINVAL;
 	}
 
-	/* Hotplug starting at the beginning of the next block: */
-	kmem_start = ALIGN(range->start, memory_block_size_bytes());
-
-	kmem_size = range_len(range);
-	/* Adjust the size down to compensate for moving up kmem_start: */
-	kmem_size -= kmem_start - range->start;
-	/* Align the size down to cover only complete blocks: */
-	kmem_size &= ~(memory_block_size_bytes() - 1);
-	kmem_end = kmem_start + kmem_size;
-
 	new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
 	if (!new_res_name)
 		return -ENOMEM;
 
 	/* Region is permanently reserved if hotremove fails. */
-	new_res = request_mem_region(kmem_start, kmem_size, new_res_name);
+	new_res = request_mem_region(range.start, range_len(&range), new_res_name);
 	if (!new_res) {
-		dev_warn(dev, "could not reserve region [%pa-%pa]\n",
-			 &kmem_start, &kmem_end);
+		dev_warn(dev, "could not reserve region [%#llx-%#llx]\n", range.start, range.end);
 		kfree(new_res_name);
 		return -EBUSY;
 	}
@@ -96,9 +92,8 @@ int dev_dax_kmem_probe(struct device *de
 static int dev_dax_kmem_remove(struct device *dev)
 {
 	struct dev_dax *dev_dax = to_dev_dax(dev);
+	struct range range = dax_kmem_range(dev_dax);
 	struct resource *res = dev_dax->dax_kmem_res;
-	resource_size_t kmem_start = res->start;
-	resource_size_t kmem_size = resource_size(res);
 	const char *res_name = res->name;
 	int rc;
 
@@ -108,12 +103,11 @@ static int dev_dax_kmem_remove(struct de
 	 * there is no way to hotremove this memory until reboot because device
 	 * unbind will succeed even if we return failure.
 	 */
-	rc = remove_memory(dev_dax->target_node, kmem_start, kmem_size);
+	rc = remove_memory(dev_dax->target_node, range.start, range_len(&range));
 	if (rc) {
 		any_hotremove_failed = true;
-		dev_err(dev,
-			"DAX region %pR cannot be hotremoved until the next reboot\n",
-			res);
+		dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next reboot\n",
+				range.start, range.end);
 		return rc;
 	}
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 037/181] device-dax/kmem: move resource name tracking to drvdata
  2020-10-13 23:46 incoming Andrew Morton
                   ` (35 preceding siblings ...)
  2020-10-13 23:49 ` [patch 036/181] device-dax/kmem: introduce dax_kmem_range() Andrew Morton
@ 2020-10-13 23:49 ` Andrew Morton
  2020-10-13 23:49 ` [patch 038/181] device-dax/kmem: replace release_resource() with release_mem_region() Andrew Morton
                   ` (147 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:49 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: device-dax/kmem: move resource name tracking to drvdata

Towards removing the mode specific @dax_kmem_res attribute from the
generic 'struct dev_dax', and preparing for multi-range support, move
resource name tracking to driver data.  The memory for the resource name
needs to have its own lifetime separate from the device bind lifetime for
cases where the driver is unbound, but the kmem range could not be
unplugged from the page allocator.

Link: https://lkml.kernel.org/r/160106111639.30709.17624822766862009183.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dax/kmem.c |   16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

--- a/drivers/dax/kmem.c~device-dax-kmem-move-resource-name-tracking-to-drvdata
+++ a/drivers/dax/kmem.c
@@ -34,7 +34,7 @@ int dev_dax_kmem_probe(struct device *de
 	struct dev_dax *dev_dax = to_dev_dax(dev);
 	struct range range = dax_kmem_range(dev_dax);
 	struct resource *new_res;
-	const char *new_res_name;
+	char *res_name;
 	int numa_node;
 	int rc;
 
@@ -51,15 +51,15 @@ int dev_dax_kmem_probe(struct device *de
 		return -EINVAL;
 	}
 
-	new_res_name = kstrdup(dev_name(dev), GFP_KERNEL);
-	if (!new_res_name)
+	res_name = kstrdup(dev_name(dev), GFP_KERNEL);
+	if (!res_name)
 		return -ENOMEM;
 
 	/* Region is permanently reserved if hotremove fails. */
-	new_res = request_mem_region(range.start, range_len(&range), new_res_name);
+	new_res = request_mem_region(range.start, range_len(&range), res_name);
 	if (!new_res) {
 		dev_warn(dev, "could not reserve region [%#llx-%#llx]\n", range.start, range.end);
-		kfree(new_res_name);
+		kfree(res_name);
 		return -EBUSY;
 	}
 
@@ -80,9 +80,11 @@ int dev_dax_kmem_probe(struct device *de
 	if (rc) {
 		release_resource(new_res);
 		kfree(new_res);
-		kfree(new_res_name);
+		kfree(res_name);
 		return rc;
 	}
+
+	dev_set_drvdata(dev, res_name);
 	dev_dax->dax_kmem_res = new_res;
 
 	return 0;
@@ -94,7 +96,7 @@ static int dev_dax_kmem_remove(struct de
 	struct dev_dax *dev_dax = to_dev_dax(dev);
 	struct range range = dax_kmem_range(dev_dax);
 	struct resource *res = dev_dax->dax_kmem_res;
-	const char *res_name = res->name;
+	const char *res_name = dev_get_drvdata(dev);
 	int rc;
 
 	/*
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 038/181] device-dax/kmem: replace release_resource() with release_mem_region()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (36 preceding siblings ...)
  2020-10-13 23:49 ` [patch 037/181] device-dax/kmem: move resource name tracking to drvdata Andrew Morton
@ 2020-10-13 23:49 ` Andrew Morton
  2020-10-13 23:50 ` [patch 039/181] device-dax: add an allocation interface for device-dax instances Andrew Morton
                   ` (146 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:49 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: device-dax/kmem: replace release_resource() with release_mem_region()

Towards removing the mode specific @dax_kmem_res attribute from the
generic 'struct dev_dax', and preparing for multi-range support, change
the kmem driver to use the idiomatic release_mem_region() to pair with the
initial request_mem_region().  This also eliminates the need to open code
the release of the resource allocated by request_mem_region().

As there are no more dax_kmem_res users, delete this struct member.

Link: https://lkml.kernel.org/r/160106112239.30709.15909567572288425294.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dax/dax-private.h |    3 ---
 drivers/dax/kmem.c        |   20 +++++++-------------
 2 files changed, 7 insertions(+), 16 deletions(-)

--- a/drivers/dax/dax-private.h~device-dax-kmem-replace-release_resource-with-release_mem_region
+++ a/drivers/dax/dax-private.h
@@ -42,8 +42,6 @@ struct dax_region {
  * @dev - device core
  * @pgmap - pgmap for memmap setup / lifetime (driver owned)
  * @range: resource range for the instance
- * @dax_mem_res: physical address range of hotadded DAX memory
- * @dax_mem_name: name for hotadded DAX memory via add_memory_driver_managed()
  */
 struct dev_dax {
 	struct dax_region *region;
@@ -52,7 +50,6 @@ struct dev_dax {
 	struct device dev;
 	struct dev_pagemap *pgmap;
 	struct range range;
-	struct resource *dax_kmem_res;
 };
 
 static inline u64 range_len(struct range *range)
--- a/drivers/dax/kmem.c~device-dax-kmem-replace-release_resource-with-release_mem_region
+++ a/drivers/dax/kmem.c
@@ -33,7 +33,7 @@ int dev_dax_kmem_probe(struct device *de
 {
 	struct dev_dax *dev_dax = to_dev_dax(dev);
 	struct range range = dax_kmem_range(dev_dax);
-	struct resource *new_res;
+	struct resource *res;
 	char *res_name;
 	int numa_node;
 	int rc;
@@ -56,8 +56,8 @@ int dev_dax_kmem_probe(struct device *de
 		return -ENOMEM;
 
 	/* Region is permanently reserved if hotremove fails. */
-	new_res = request_mem_region(range.start, range_len(&range), res_name);
-	if (!new_res) {
+	res = request_mem_region(range.start, range_len(&range), res_name);
+	if (!res) {
 		dev_warn(dev, "could not reserve region [%#llx-%#llx]\n", range.start, range.end);
 		kfree(res_name);
 		return -EBUSY;
@@ -69,23 +69,20 @@ int dev_dax_kmem_probe(struct device *de
 	 * inherit flags from the parent since it may set new flags
 	 * unknown to us that will break add_memory() below.
 	 */
-	new_res->flags = IORESOURCE_SYSTEM_RAM;
+	res->flags = IORESOURCE_SYSTEM_RAM;
 
 	/*
 	 * Ensure that future kexec'd kernels will not treat this as RAM
 	 * automatically.
 	 */
-	rc = add_memory_driver_managed(numa_node, new_res->start,
-				       resource_size(new_res), kmem_name);
+	rc = add_memory_driver_managed(numa_node, range.start, range_len(&range), kmem_name);
 	if (rc) {
-		release_resource(new_res);
-		kfree(new_res);
+		release_mem_region(range.start, range_len(&range));
 		kfree(res_name);
 		return rc;
 	}
 
 	dev_set_drvdata(dev, res_name);
-	dev_dax->dax_kmem_res = new_res;
 
 	return 0;
 }
@@ -95,7 +92,6 @@ static int dev_dax_kmem_remove(struct de
 {
 	struct dev_dax *dev_dax = to_dev_dax(dev);
 	struct range range = dax_kmem_range(dev_dax);
-	struct resource *res = dev_dax->dax_kmem_res;
 	const char *res_name = dev_get_drvdata(dev);
 	int rc;
 
@@ -114,10 +110,8 @@ static int dev_dax_kmem_remove(struct de
 	}
 
 	/* Release and free dax resources */
-	release_resource(res);
-	kfree(res);
+	release_mem_region(range.start, range_len(&range));
 	kfree(res_name);
-	dev_dax->dax_kmem_res = NULL;
 
 	return 0;
 }
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 039/181] device-dax: add an allocation interface for device-dax instances
  2020-10-13 23:46 incoming Andrew Morton
                   ` (37 preceding siblings ...)
  2020-10-13 23:49 ` [patch 038/181] device-dax/kmem: replace release_resource() with release_mem_region() Andrew Morton
@ 2020-10-13 23:50 ` Andrew Morton
  2020-10-13 23:50 ` [patch 040/181] device-dax: introduce 'struct dev_dax' typed-driver operations Andrew Morton
                   ` (145 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:50 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: device-dax: add an allocation interface for device-dax instances

In preparation for a facility that enables dax regions to be sub-divided,
introduce infrastructure to track and allocate region capacity.

The new dax_region/available_size attribute is only enabled for volatile
hmem devices, not pmem devices that are defined by nvdimm namespace
boundaries.  This is per Jeff's feedback the last time dynamic device-dax
capacity allocation support was discussed.

Link: https://lore.kernel.org/linux-nvdimm/x49shpp3zn8.fsf@segfault.boston.devel.redhat.com
Link: https://lkml.kernel.org/r/159643101035.4062302.6785857915652647857.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lkml.kernel.org/r/160106112801.30709.14601438735305335071.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dax/bus.c         |  120 +++++++++++++++++++++++++++++++++---
 drivers/dax/bus.h         |    7 +-
 drivers/dax/dax-private.h |    2 
 drivers/dax/hmem/hmem.c   |    7 --
 drivers/dax/pmem/core.c   |    8 --
 5 files changed, 121 insertions(+), 23 deletions(-)

--- a/drivers/dax/bus.c~device-dax-add-an-allocation-interface-for-device-dax-instances
+++ a/drivers/dax/bus.c
@@ -130,6 +130,11 @@ ATTRIBUTE_GROUPS(dax_drv);
 
 static int dax_bus_match(struct device *dev, struct device_driver *drv);
 
+static bool is_static(struct dax_region *dax_region)
+{
+	return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
+}
+
 static struct bus_type dax_bus_type = {
 	.name = "dax",
 	.uevent = dax_bus_uevent,
@@ -185,7 +190,48 @@ static ssize_t align_show(struct device
 }
 static DEVICE_ATTR_RO(align);
 
+#define for_each_dax_region_resource(dax_region, res) \
+	for (res = (dax_region)->res.child; res; res = res->sibling)
+
+static unsigned long long dax_region_avail_size(struct dax_region *dax_region)
+{
+	resource_size_t size = resource_size(&dax_region->res);
+	struct resource *res;
+
+	device_lock_assert(dax_region->dev);
+
+	for_each_dax_region_resource(dax_region, res)
+		size -= resource_size(res);
+	return size;
+}
+
+static ssize_t available_size_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dax_region *dax_region = dev_get_drvdata(dev);
+	unsigned long long size;
+
+	device_lock(dev);
+	size = dax_region_avail_size(dax_region);
+	device_unlock(dev);
+
+	return sprintf(buf, "%llu\n", size);
+}
+static DEVICE_ATTR_RO(available_size);
+
+static umode_t dax_region_visible(struct kobject *kobj, struct attribute *a,
+		int n)
+{
+	struct device *dev = container_of(kobj, struct device, kobj);
+	struct dax_region *dax_region = dev_get_drvdata(dev);
+
+	if (is_static(dax_region) && a == &dev_attr_available_size.attr)
+		return 0;
+	return a->mode;
+}
+
 static struct attribute *dax_region_attributes[] = {
+	&dev_attr_available_size.attr,
 	&dev_attr_region_size.attr,
 	&dev_attr_align.attr,
 	&dev_attr_id.attr,
@@ -195,6 +241,7 @@ static struct attribute *dax_region_attr
 static const struct attribute_group dax_region_attribute_group = {
 	.name = "dax_region",
 	.attrs = dax_region_attributes,
+	.is_visible = dax_region_visible,
 };
 
 static const struct attribute_group *dax_region_attribute_groups[] = {
@@ -226,7 +273,8 @@ static void dax_region_unregister(void *
 }
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
-		struct resource *res, int target_node, unsigned int align)
+		struct resource *res, int target_node, unsigned int align,
+		unsigned long flags)
 {
 	struct dax_region *dax_region;
 
@@ -249,12 +297,17 @@ struct dax_region *alloc_dax_region(stru
 		return NULL;
 
 	dev_set_drvdata(parent, dax_region);
-	memcpy(&dax_region->res, res, sizeof(*res));
 	kref_init(&dax_region->kref);
 	dax_region->id = region_id;
 	dax_region->align = align;
 	dax_region->dev = parent;
 	dax_region->target_node = target_node;
+	dax_region->res = (struct resource) {
+		.start = res->start,
+		.end = res->end,
+		.flags = IORESOURCE_MEM | flags,
+	};
+
 	if (sysfs_create_groups(&parent->kobj, dax_region_attribute_groups)) {
 		kfree(dax_region);
 		return NULL;
@@ -267,6 +320,32 @@ struct dax_region *alloc_dax_region(stru
 }
 EXPORT_SYMBOL_GPL(alloc_dax_region);
 
+static int alloc_dev_dax_range(struct dev_dax *dev_dax, resource_size_t size)
+{
+	struct dax_region *dax_region = dev_dax->region;
+	struct resource *res = &dax_region->res;
+	struct device *dev = &dev_dax->dev;
+	struct resource *alloc;
+
+	device_lock_assert(dax_region->dev);
+
+	/* TODO: handle multiple allocations per region */
+	if (res->child)
+		return -ENOMEM;
+
+	alloc = __request_region(res, res->start, size, dev_name(dev), 0);
+
+	if (!alloc)
+		return -ENOMEM;
+
+	dev_dax->range = (struct range) {
+		.start = alloc->start,
+		.end = alloc->end,
+	};
+
+	return 0;
+}
+
 static ssize_t size_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -361,6 +440,15 @@ void kill_dev_dax(struct dev_dax *dev_da
 }
 EXPORT_SYMBOL_GPL(kill_dev_dax);
 
+static void free_dev_dax_range(struct dev_dax *dev_dax)
+{
+	struct dax_region *dax_region = dev_dax->region;
+	struct range *range = &dev_dax->range;
+
+	device_lock_assert(dax_region->dev);
+	__release_region(&dax_region->res, range->start, range_len(range));
+}
+
 static void dev_dax_release(struct device *dev)
 {
 	struct dev_dax *dev_dax = to_dev_dax(dev);
@@ -385,6 +473,7 @@ static void unregister_dev_dax(void *dev
 	dev_dbg(dev, "%s\n", __func__);
 
 	kill_dev_dax(dev_dax);
+	free_dev_dax_range(dev_dax);
 	device_del(dev);
 	put_device(dev);
 }
@@ -397,7 +486,7 @@ struct dev_dax *devm_create_dev_dax(stru
 	struct dev_dax *dev_dax;
 	struct inode *inode;
 	struct device *dev;
-	int rc = -ENOMEM;
+	int rc;
 
 	if (data->id < 0)
 		return ERR_PTR(-EINVAL);
@@ -406,11 +495,25 @@ struct dev_dax *devm_create_dev_dax(stru
 	if (!dev_dax)
 		return ERR_PTR(-ENOMEM);
 
+	dev_dax->region = dax_region;
+	dev = &dev_dax->dev;
+	device_initialize(dev);
+	dev_set_name(dev, "dax%d.%d", dax_region->id, data->id);
+
+	rc = alloc_dev_dax_range(dev_dax, data->size);
+	if (rc)
+		goto err_range;
+
 	if (data->pgmap) {
+		dev_WARN_ONCE(parent, !is_static(dax_region),
+			"custom dev_pagemap requires a static dax_region\n");
+
 		dev_dax->pgmap = kmemdup(data->pgmap,
 				sizeof(struct dev_pagemap), GFP_KERNEL);
-		if (!dev_dax->pgmap)
+		if (!dev_dax->pgmap) {
+			rc = -ENOMEM;
 			goto err_pgmap;
+		}
 	}
 
 	/*
@@ -427,12 +530,7 @@ struct dev_dax *devm_create_dev_dax(stru
 	kill_dax(dax_dev);
 
 	/* from here on we're committed to teardown via dev_dax_release() */
-	dev = &dev_dax->dev;
-	device_initialize(dev);
-
 	dev_dax->dax_dev = dax_dev;
-	dev_dax->region = dax_region;
-	dev_dax->range = data->range;
 	dev_dax->target_node = dax_region->target_node;
 	kref_get(&dax_region->kref);
 
@@ -444,7 +542,6 @@ struct dev_dax *devm_create_dev_dax(stru
 		dev->class = dax_class;
 	dev->parent = parent;
 	dev->type = &dev_dax_type;
-	dev_set_name(dev, "dax%d.%d", dax_region->id, data->id);
 
 	rc = device_add(dev);
 	if (rc) {
@@ -458,9 +555,12 @@ struct dev_dax *devm_create_dev_dax(stru
 		return ERR_PTR(rc);
 
 	return dev_dax;
+
 err_alloc_dax:
 	kfree(dev_dax->pgmap);
 err_pgmap:
+	free_dev_dax_range(dev_dax);
+err_range:
 	kfree(dev_dax);
 
 	return ERR_PTR(rc);
--- a/drivers/dax/bus.h~device-dax-add-an-allocation-interface-for-device-dax-instances
+++ a/drivers/dax/bus.h
@@ -10,8 +10,11 @@ struct resource;
 struct dax_device;
 struct dax_region;
 void dax_region_put(struct dax_region *dax_region);
+
+#define IORESOURCE_DAX_STATIC (1UL << 0)
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
-		struct resource *res, int target_node, unsigned int align);
+		struct resource *res, int target_node, unsigned int align,
+		unsigned long flags);
 
 enum dev_dax_subsys {
 	DEV_DAX_BUS = 0, /* zeroed dev_dax_data picks this by default */
@@ -22,7 +25,7 @@ struct dev_dax_data {
 	struct dax_region *dax_region;
 	struct dev_pagemap *pgmap;
 	enum dev_dax_subsys subsys;
-	struct range range;
+	resource_size_t size;
 	int id;
 };
 
--- a/drivers/dax/dax-private.h~device-dax-add-an-allocation-interface-for-device-dax-instances
+++ a/drivers/dax/dax-private.h
@@ -22,7 +22,7 @@ void dax_bus_exit(void);
  * @kref: to pin while other agents have a need to do lookups
  * @dev: parent device backing this region
  * @align: allocation and mapping alignment for child dax devices
- * @res: physical address range of the region
+ * @res: resource tree to track instance allocations
  */
 struct dax_region {
 	int id;
--- a/drivers/dax/hmem/hmem.c~device-dax-add-an-allocation-interface-for-device-dax-instances
+++ a/drivers/dax/hmem/hmem.c
@@ -20,17 +20,14 @@ static int dax_hmem_probe(struct platfor
 
 	mri = dev->platform_data;
 	dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node,
-			PMD_SIZE);
+			PMD_SIZE, 0);
 	if (!dax_region)
 		return -ENOMEM;
 
 	data = (struct dev_dax_data) {
 		.dax_region = dax_region,
 		.id = 0,
-		.range = {
-			.start = res->start,
-			.end = res->end,
-		},
+		.size = resource_size(res),
 	};
 	dev_dax = devm_create_dev_dax(&data);
 	if (IS_ERR(dev_dax))
--- a/drivers/dax/pmem/core.c~device-dax-add-an-allocation-interface-for-device-dax-instances
+++ a/drivers/dax/pmem/core.c
@@ -54,7 +54,8 @@ struct dev_dax *__dax_pmem_probe(struct
 	memcpy(&res, &pgmap.res, sizeof(res));
 	res.start += offset;
 	dax_region = alloc_dax_region(dev, region_id, &res,
-			nd_region->target_node, le32_to_cpu(pfn_sb->align));
+			nd_region->target_node, le32_to_cpu(pfn_sb->align),
+			IORESOURCE_DAX_STATIC);
 	if (!dax_region)
 		return ERR_PTR(-ENOMEM);
 
@@ -63,10 +64,7 @@ struct dev_dax *__dax_pmem_probe(struct
 		.id = id,
 		.pgmap = &pgmap,
 		.subsys = subsys,
-		.range = {
-			.start = res.start,
-			.end = res.end,
-		},
+		.size = resource_size(&res),
 	};
 	dev_dax = devm_create_dev_dax(&data);
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 040/181] device-dax: introduce 'struct dev_dax' typed-driver operations
  2020-10-13 23:46 incoming Andrew Morton
                   ` (38 preceding siblings ...)
  2020-10-13 23:50 ` [patch 039/181] device-dax: add an allocation interface for device-dax instances Andrew Morton
@ 2020-10-13 23:50 ` Andrew Morton
  2020-10-13 23:50 ` [patch 041/181] device-dax: introduce 'seed' devices Andrew Morton
                   ` (144 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:50 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: device-dax: introduce 'struct dev_dax' typed-driver operations

In preparation for introducing seed devices the dax-bus core needs to be
able to intercept ->probe() and ->remove() operations.  Towards that end
arrange for the bus and drivers to switch from raw 'struct device' driver
operations to 'struct dev_dax' typed operations.

Link: https://lkml.kernel.org/r/160106113357.30709.4541750544799737855.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Hulk Robot <hulkci@huawei.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dax/bus.c         |   18 ++++++++++++++++++
 drivers/dax/bus.h         |    4 +++-
 drivers/dax/device.c      |   12 +++++-------
 drivers/dax/kmem.c        |   18 ++++++++----------
 drivers/dax/pmem/compat.c |    2 +-
 5 files changed, 35 insertions(+), 19 deletions(-)

--- a/drivers/dax/bus.c~device-dax-introduce-struct-dev_dax-typed-driver-operations
+++ a/drivers/dax/bus.c
@@ -135,10 +135,28 @@ static bool is_static(struct dax_region
 	return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
 }
 
+static int dax_bus_probe(struct device *dev)
+{
+	struct dax_device_driver *dax_drv = to_dax_drv(dev->driver);
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+
+	return dax_drv->probe(dev_dax);
+}
+
+static int dax_bus_remove(struct device *dev)
+{
+	struct dax_device_driver *dax_drv = to_dax_drv(dev->driver);
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+
+	return dax_drv->remove(dev_dax);
+}
+
 static struct bus_type dax_bus_type = {
 	.name = "dax",
 	.uevent = dax_bus_uevent,
 	.match = dax_bus_match,
+	.probe = dax_bus_probe,
+	.remove = dax_bus_remove,
 	.drv_groups = dax_drv_groups,
 };
 
--- a/drivers/dax/bus.h~device-dax-introduce-struct-dev_dax-typed-driver-operations
+++ a/drivers/dax/bus.h
@@ -38,6 +38,8 @@ struct dax_device_driver {
 	struct device_driver drv;
 	struct list_head ids;
 	int match_always;
+	int (*probe)(struct dev_dax *dev);
+	int (*remove)(struct dev_dax *dev);
 };
 
 int __dax_driver_register(struct dax_device_driver *dax_drv,
@@ -48,7 +50,7 @@ void dax_driver_unregister(struct dax_de
 void kill_dev_dax(struct dev_dax *dev_dax);
 
 #if IS_ENABLED(CONFIG_DEV_DAX_PMEM_COMPAT)
-int dev_dax_probe(struct device *dev);
+int dev_dax_probe(struct dev_dax *dev_dax);
 #endif
 
 /*
--- a/drivers/dax/device.c~device-dax-introduce-struct-dev_dax-typed-driver-operations
+++ a/drivers/dax/device.c
@@ -392,11 +392,11 @@ static void dev_dax_kill(void *dev_dax)
 	kill_dev_dax(dev_dax);
 }
 
-int dev_dax_probe(struct device *dev)
+int dev_dax_probe(struct dev_dax *dev_dax)
 {
-	struct dev_dax *dev_dax = to_dev_dax(dev);
 	struct dax_device *dax_dev = dev_dax->dax_dev;
 	struct range *range = &dev_dax->range;
+	struct device *dev = &dev_dax->dev;
 	struct dev_pagemap *pgmap;
 	struct inode *inode;
 	struct cdev *cdev;
@@ -446,17 +446,15 @@ int dev_dax_probe(struct device *dev)
 }
 EXPORT_SYMBOL_GPL(dev_dax_probe);
 
-static int dev_dax_remove(struct device *dev)
+static int dev_dax_remove(struct dev_dax *dev_dax)
 {
 	/* all probe actions are unwound by devm */
 	return 0;
 }
 
 static struct dax_device_driver device_dax_driver = {
-	.drv = {
-		.probe = dev_dax_probe,
-		.remove = dev_dax_remove,
-	},
+	.probe = dev_dax_probe,
+	.remove = dev_dax_remove,
 	.match_always = 1,
 };
 
--- a/drivers/dax/kmem.c~device-dax-introduce-struct-dev_dax-typed-driver-operations
+++ a/drivers/dax/kmem.c
@@ -29,10 +29,10 @@ static struct range dax_kmem_range(struc
 	return range;
 }
 
-int dev_dax_kmem_probe(struct device *dev)
+static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 {
-	struct dev_dax *dev_dax = to_dev_dax(dev);
 	struct range range = dax_kmem_range(dev_dax);
+	struct device *dev = &dev_dax->dev;
 	struct resource *res;
 	char *res_name;
 	int numa_node;
@@ -88,12 +88,12 @@ int dev_dax_kmem_probe(struct device *de
 }
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
-static int dev_dax_kmem_remove(struct device *dev)
+static int dev_dax_kmem_remove(struct dev_dax *dev_dax)
 {
-	struct dev_dax *dev_dax = to_dev_dax(dev);
+	int rc;
+	struct device *dev = &dev_dax->dev;
 	struct range range = dax_kmem_range(dev_dax);
 	const char *res_name = dev_get_drvdata(dev);
-	int rc;
 
 	/*
 	 * We have one shot for removing memory, if some memory blocks were not
@@ -116,7 +116,7 @@ static int dev_dax_kmem_remove(struct de
 	return 0;
 }
 #else
-static int dev_dax_kmem_remove(struct device *dev)
+static int dev_dax_kmem_remove(struct dev_dax *dev_dax)
 {
 	/*
 	 * Without hotremove purposely leak the request_mem_region() for the
@@ -131,10 +131,8 @@ static int dev_dax_kmem_remove(struct de
 #endif /* CONFIG_MEMORY_HOTREMOVE */
 
 static struct dax_device_driver device_dax_kmem_driver = {
-	.drv = {
-		.probe = dev_dax_kmem_probe,
-		.remove = dev_dax_kmem_remove,
-	},
+	.probe = dev_dax_kmem_probe,
+	.remove = dev_dax_kmem_remove,
 };
 
 static int __init dax_kmem_init(void)
--- a/drivers/dax/pmem/compat.c~device-dax-introduce-struct-dev_dax-typed-driver-operations
+++ a/drivers/dax/pmem/compat.c
@@ -22,7 +22,7 @@ static int dax_pmem_compat_probe(struct
 		return -ENOMEM;
 
 	device_lock(&dev_dax->dev);
-	rc = dev_dax_probe(&dev_dax->dev);
+	rc = dev_dax_probe(dev_dax);
 	device_unlock(&dev_dax->dev);
 
 	devres_close_group(&dev_dax->dev, dev_dax);
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 041/181] device-dax: introduce 'seed' devices
  2020-10-13 23:46 incoming Andrew Morton
                   ` (39 preceding siblings ...)
  2020-10-13 23:50 ` [patch 040/181] device-dax: introduce 'struct dev_dax' typed-driver operations Andrew Morton
@ 2020-10-13 23:50 ` Andrew Morton
  2020-10-13 23:50 ` [patch 042/181] drivers/base: make device_find_child_by_name() compatible with sysfs inputs Andrew Morton
                   ` (143 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:50 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: device-dax: introduce 'seed' devices

Add a seed device concept for dynamic dax regions to be able to split the
region amongst multiple sub-instances.  The seed device, similar to
libnvdimm seed devices, is a device that starts with zero capacity
allocated and unbound to a driver.  In contrast to libnvdimm seed devices
explicit 'create' and 'delete' interfaces are added to the region to
trigger seeds to be created and unused devices to be reclaimed.  The
explicit create and delete replaces implicit create as a side effect of
probe and implicit delete when writing 0 to the size that libnvdimm
implements.

Delete can be performed on any 0-sized and idle device.  This avoids the
gymnastics of needing to move device_unregister() to its own async
context.  Specifically, it avoids the deadlock of deleting a device via
one of its own attributes.  It is also less surprising to userspace which
never sees an extra device it did not request.

For now just add the device creation, teardown, and ->probe() prevention. 
A later patch will arrange for the 'dax/size' attribute to be writable to
allocate capacity from the region.

Link: https://lkml.kernel.org/r/159643101583.4062302.12255093902950754962.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lkml.kernel.org/r/160106113873.30709.15168756050631539431.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dax/bus.c         |  301 +++++++++++++++++++++++++++++++-----
 drivers/dax/dax-private.h |    9 +
 drivers/dax/hmem/hmem.c   |    2 
 3 files changed, 272 insertions(+), 40 deletions(-)

--- a/drivers/dax/bus.c~device-dax-introduce-seed-devices
+++ a/drivers/dax/bus.c
@@ -139,8 +139,26 @@ static int dax_bus_probe(struct device *
 {
 	struct dax_device_driver *dax_drv = to_dax_drv(dev->driver);
 	struct dev_dax *dev_dax = to_dev_dax(dev);
+	struct dax_region *dax_region = dev_dax->region;
+	struct range *range = &dev_dax->range;
+	int rc;
+
+	if (range_len(range) == 0 || dev_dax->id < 0)
+		return -ENXIO;
+
+	rc = dax_drv->probe(dev_dax);
 
-	return dax_drv->probe(dev_dax);
+	if (rc || is_static(dax_region))
+		return rc;
+
+	/*
+	 * Track new seed creation only after successful probe of the
+	 * previous seed.
+	 */
+	if (dax_region->seed == dev)
+		dax_region->seed = NULL;
+
+	return 0;
 }
 
 static int dax_bus_remove(struct device *dev)
@@ -237,14 +255,216 @@ static ssize_t available_size_show(struc
 }
 static DEVICE_ATTR_RO(available_size);
 
+static ssize_t seed_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dax_region *dax_region = dev_get_drvdata(dev);
+	struct device *seed;
+	ssize_t rc;
+
+	if (is_static(dax_region))
+		return -EINVAL;
+
+	device_lock(dev);
+	seed = dax_region->seed;
+	rc = sprintf(buf, "%s\n", seed ? dev_name(seed) : "");
+	device_unlock(dev);
+
+	return rc;
+}
+static DEVICE_ATTR_RO(seed);
+
+static ssize_t create_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dax_region *dax_region = dev_get_drvdata(dev);
+	struct device *youngest;
+	ssize_t rc;
+
+	if (is_static(dax_region))
+		return -EINVAL;
+
+	device_lock(dev);
+	youngest = dax_region->youngest;
+	rc = sprintf(buf, "%s\n", youngest ? dev_name(youngest) : "");
+	device_unlock(dev);
+
+	return rc;
+}
+
+static ssize_t create_store(struct device *dev, struct device_attribute *attr,
+		const char *buf, size_t len)
+{
+	struct dax_region *dax_region = dev_get_drvdata(dev);
+	unsigned long long avail;
+	ssize_t rc;
+	int val;
+
+	if (is_static(dax_region))
+		return -EINVAL;
+
+	rc = kstrtoint(buf, 0, &val);
+	if (rc)
+		return rc;
+	if (val != 1)
+		return -EINVAL;
+
+	device_lock(dev);
+	avail = dax_region_avail_size(dax_region);
+	if (avail == 0)
+		rc = -ENOSPC;
+	else {
+		struct dev_dax_data data = {
+			.dax_region = dax_region,
+			.size = 0,
+			.id = -1,
+		};
+		struct dev_dax *dev_dax = devm_create_dev_dax(&data);
+
+		if (IS_ERR(dev_dax))
+			rc = PTR_ERR(dev_dax);
+		else {
+			/*
+			 * In support of crafting multiple new devices
+			 * simultaneously multiple seeds can be created,
+			 * but only the first one that has not been
+			 * successfully bound is tracked as the region
+			 * seed.
+			 */
+			if (!dax_region->seed)
+				dax_region->seed = &dev_dax->dev;
+			dax_region->youngest = &dev_dax->dev;
+			rc = len;
+		}
+	}
+	device_unlock(dev);
+
+	return rc;
+}
+static DEVICE_ATTR_RW(create);
+
+void kill_dev_dax(struct dev_dax *dev_dax)
+{
+	struct dax_device *dax_dev = dev_dax->dax_dev;
+	struct inode *inode = dax_inode(dax_dev);
+
+	kill_dax(dax_dev);
+	unmap_mapping_range(inode->i_mapping, 0, 0, 1);
+}
+EXPORT_SYMBOL_GPL(kill_dev_dax);
+
+static void free_dev_dax_range(struct dev_dax *dev_dax)
+{
+	struct dax_region *dax_region = dev_dax->region;
+	struct range *range = &dev_dax->range;
+
+	device_lock_assert(dax_region->dev);
+	if (range_len(range))
+		__release_region(&dax_region->res, range->start,
+				range_len(range));
+}
+
+static void unregister_dev_dax(void *dev)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+
+	dev_dbg(dev, "%s\n", __func__);
+
+	kill_dev_dax(dev_dax);
+	free_dev_dax_range(dev_dax);
+	device_del(dev);
+	put_device(dev);
+}
+
+/* a return value >= 0 indicates this invocation invalidated the id */
+static int __free_dev_dax_id(struct dev_dax *dev_dax)
+{
+	struct dax_region *dax_region = dev_dax->region;
+	struct device *dev = &dev_dax->dev;
+	int rc = dev_dax->id;
+
+	device_lock_assert(dev);
+
+	if (is_static(dax_region) || dev_dax->id < 0)
+		return -1;
+	ida_free(&dax_region->ida, dev_dax->id);
+	dev_dax->id = -1;
+	return rc;
+}
+
+static int free_dev_dax_id(struct dev_dax *dev_dax)
+{
+	struct device *dev = &dev_dax->dev;
+	int rc;
+
+	device_lock(dev);
+	rc = __free_dev_dax_id(dev_dax);
+	device_unlock(dev);
+	return rc;
+}
+
+static ssize_t delete_store(struct device *dev, struct device_attribute *attr,
+		const char *buf, size_t len)
+{
+	struct dax_region *dax_region = dev_get_drvdata(dev);
+	struct dev_dax *dev_dax;
+	struct device *victim;
+	bool do_del = false;
+	int rc;
+
+	if (is_static(dax_region))
+		return -EINVAL;
+
+	victim = device_find_child_by_name(dax_region->dev, buf);
+	if (!victim)
+		return -ENXIO;
+
+	device_lock(dev);
+	device_lock(victim);
+	dev_dax = to_dev_dax(victim);
+	if (victim->driver || range_len(&dev_dax->range))
+		rc = -EBUSY;
+	else {
+		/*
+		 * Invalidate the device so it does not become active
+		 * again, but always preserve device-id-0 so that
+		 * /sys/bus/dax/ is guaranteed to be populated while any
+		 * dax_region is registered.
+		 */
+		if (dev_dax->id > 0) {
+			do_del = __free_dev_dax_id(dev_dax) >= 0;
+			rc = len;
+			if (dax_region->seed == victim)
+				dax_region->seed = NULL;
+			if (dax_region->youngest == victim)
+				dax_region->youngest = NULL;
+		} else
+			rc = -EBUSY;
+	}
+	device_unlock(victim);
+
+	/* won the race to invalidate the device, clean it up */
+	if (do_del)
+		devm_release_action(dev, unregister_dev_dax, victim);
+	device_unlock(dev);
+	put_device(victim);
+
+	return rc;
+}
+static DEVICE_ATTR_WO(delete);
+
 static umode_t dax_region_visible(struct kobject *kobj, struct attribute *a,
 		int n)
 {
 	struct device *dev = container_of(kobj, struct device, kobj);
 	struct dax_region *dax_region = dev_get_drvdata(dev);
 
-	if (is_static(dax_region) && a == &dev_attr_available_size.attr)
-		return 0;
+	if (is_static(dax_region))
+		if (a == &dev_attr_available_size.attr
+				|| a == &dev_attr_create.attr
+				|| a == &dev_attr_seed.attr
+				|| a == &dev_attr_delete.attr)
+			return 0;
 	return a->mode;
 }
 
@@ -252,6 +472,9 @@ static struct attribute *dax_region_attr
 	&dev_attr_available_size.attr,
 	&dev_attr_region_size.attr,
 	&dev_attr_align.attr,
+	&dev_attr_create.attr,
+	&dev_attr_seed.attr,
+	&dev_attr_delete.attr,
 	&dev_attr_id.attr,
 	NULL,
 };
@@ -320,6 +543,7 @@ struct dax_region *alloc_dax_region(stru
 	dax_region->align = align;
 	dax_region->dev = parent;
 	dax_region->target_node = target_node;
+	ida_init(&dax_region->ida);
 	dax_region->res = (struct resource) {
 		.start = res->start,
 		.end = res->end,
@@ -347,6 +571,15 @@ static int alloc_dev_dax_range(struct de
 
 	device_lock_assert(dax_region->dev);
 
+	/* handle the seed alloc special case */
+	if (!size) {
+		dev_dax->range = (struct range) {
+			.start = res->start,
+			.end = res->start - 1,
+		};
+		return 0;
+	}
+
 	/* TODO: handle multiple allocations per region */
 	if (res->child)
 		return -ENOMEM;
@@ -448,33 +681,15 @@ static const struct attribute_group *dax
 	NULL,
 };
 
-void kill_dev_dax(struct dev_dax *dev_dax)
-{
-	struct dax_device *dax_dev = dev_dax->dax_dev;
-	struct inode *inode = dax_inode(dax_dev);
-
-	kill_dax(dax_dev);
-	unmap_mapping_range(inode->i_mapping, 0, 0, 1);
-}
-EXPORT_SYMBOL_GPL(kill_dev_dax);
-
-static void free_dev_dax_range(struct dev_dax *dev_dax)
-{
-	struct dax_region *dax_region = dev_dax->region;
-	struct range *range = &dev_dax->range;
-
-	device_lock_assert(dax_region->dev);
-	__release_region(&dax_region->res, range->start, range_len(range));
-}
-
 static void dev_dax_release(struct device *dev)
 {
 	struct dev_dax *dev_dax = to_dev_dax(dev);
 	struct dax_region *dax_region = dev_dax->region;
 	struct dax_device *dax_dev = dev_dax->dax_dev;
 
-	dax_region_put(dax_region);
 	put_dax(dax_dev);
+	free_dev_dax_id(dev_dax);
+	dax_region_put(dax_region);
 	kfree(dev_dax->pgmap);
 	kfree(dev_dax);
 }
@@ -484,18 +699,6 @@ static const struct device_type dev_dax_
 	.groups = dax_attribute_groups,
 };
 
-static void unregister_dev_dax(void *dev)
-{
-	struct dev_dax *dev_dax = to_dev_dax(dev);
-
-	dev_dbg(dev, "%s\n", __func__);
-
-	kill_dev_dax(dev_dax);
-	free_dev_dax_range(dev_dax);
-	device_del(dev);
-	put_device(dev);
-}
-
 struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 {
 	struct dax_region *dax_region = data->dax_region;
@@ -506,17 +709,35 @@ struct dev_dax *devm_create_dev_dax(stru
 	struct device *dev;
 	int rc;
 
-	if (data->id < 0)
-		return ERR_PTR(-EINVAL);
-
 	dev_dax = kzalloc(sizeof(*dev_dax), GFP_KERNEL);
 	if (!dev_dax)
 		return ERR_PTR(-ENOMEM);
 
+	if (is_static(dax_region)) {
+		if (dev_WARN_ONCE(parent, data->id < 0,
+				"dynamic id specified to static region\n")) {
+			rc = -EINVAL;
+			goto err_id;
+		}
+
+		dev_dax->id = data->id;
+	} else {
+		if (dev_WARN_ONCE(parent, data->id >= 0,
+				"static id specified to dynamic region\n")) {
+			rc = -EINVAL;
+			goto err_id;
+		}
+
+		rc = ida_alloc(&dax_region->ida, GFP_KERNEL);
+		if (rc < 0)
+			goto err_id;
+		dev_dax->id = rc;
+	}
+
 	dev_dax->region = dax_region;
 	dev = &dev_dax->dev;
 	device_initialize(dev);
-	dev_set_name(dev, "dax%d.%d", dax_region->id, data->id);
+	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
 
 	rc = alloc_dev_dax_range(dev_dax, data->size);
 	if (rc)
@@ -579,6 +800,8 @@ err_alloc_dax:
 err_pgmap:
 	free_dev_dax_range(dev_dax);
 err_range:
+	free_dev_dax_id(dev_dax);
+err_id:
 	kfree(dev_dax);
 
 	return ERR_PTR(rc);
--- a/drivers/dax/dax-private.h~device-dax-introduce-seed-devices
+++ a/drivers/dax/dax-private.h
@@ -7,6 +7,7 @@
 
 #include <linux/device.h>
 #include <linux/cdev.h>
+#include <linux/idr.h>
 
 /* private routines between core files */
 struct dax_device;
@@ -22,7 +23,10 @@ void dax_bus_exit(void);
  * @kref: to pin while other agents have a need to do lookups
  * @dev: parent device backing this region
  * @align: allocation and mapping alignment for child dax devices
+ * @ida: instance id allocator
  * @res: resource tree to track instance allocations
+ * @seed: allow userspace to find the first unbound seed device
+ * @youngest: allow userspace to find the most recently created device
  */
 struct dax_region {
 	int id;
@@ -30,7 +34,10 @@ struct dax_region {
 	struct kref kref;
 	struct device *dev;
 	unsigned int align;
+	struct ida ida;
 	struct resource res;
+	struct device *seed;
+	struct device *youngest;
 };
 
 /**
@@ -39,6 +46,7 @@ struct dax_region {
  * @region - parent region
  * @dax_dev - core dax functionality
  * @target_node: effective numa node if dev_dax memory range is onlined
+ * @id: ida allocated id
  * @dev - device core
  * @pgmap - pgmap for memmap setup / lifetime (driver owned)
  * @range: resource range for the instance
@@ -47,6 +55,7 @@ struct dev_dax {
 	struct dax_region *region;
 	struct dax_device *dax_dev;
 	int target_node;
+	int id;
 	struct device dev;
 	struct dev_pagemap *pgmap;
 	struct range range;
--- a/drivers/dax/hmem/hmem.c~device-dax-introduce-seed-devices
+++ a/drivers/dax/hmem/hmem.c
@@ -26,7 +26,7 @@ static int dax_hmem_probe(struct platfor
 
 	data = (struct dev_dax_data) {
 		.dax_region = dax_region,
-		.id = 0,
+		.id = -1,
 		.size = resource_size(res),
 	};
 	dev_dax = devm_create_dev_dax(&data);
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 042/181] drivers/base: make device_find_child_by_name() compatible with sysfs inputs
  2020-10-13 23:46 incoming Andrew Morton
                   ` (40 preceding siblings ...)
  2020-10-13 23:50 ` [patch 041/181] device-dax: introduce 'seed' devices Andrew Morton
@ 2020-10-13 23:50 ` Andrew Morton
  2020-10-13 23:50 ` [patch 043/181] device-dax: add resize support Andrew Morton
                   ` (142 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:50 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: drivers/base: make device_find_child_by_name() compatible with sysfs inputs

Use sysfs_streq() in device_find_child_by_name() to allow it to use a
sysfs input string that might contain a trailing newline.

The other "device by name" interfaces,
{bus,driver,class}_find_device_by_name(), already account for sysfs
strings.

Link: https://lkml.kernel.org/r/159643102106.4062302.12229802117645312104.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lkml.kernel.org/r/160106114576.30709.2960091665444712180.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Hildenbrand <david@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/base/core.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/base/core.c~drivers-base-make-device_find_child_by_name-compatible-with-sysfs-inputs
+++ a/drivers/base/core.c
@@ -3324,7 +3324,7 @@ struct device *device_find_child_by_name
 
 	klist_iter_init(&parent->p->klist_children, &i);
 	while ((child = next_device(&i)))
-		if (!strcmp(dev_name(child), name) && get_device(child))
+		if (sysfs_streq(dev_name(child), name) && get_device(child))
 			break;
 	klist_iter_exit(&i);
 	return child;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 043/181] device-dax: add resize support
  2020-10-13 23:46 incoming Andrew Morton
                   ` (41 preceding siblings ...)
  2020-10-13 23:50 ` [patch 042/181] drivers/base: make device_find_child_by_name() compatible with sysfs inputs Andrew Morton
@ 2020-10-13 23:50 ` Andrew Morton
  2020-10-13 23:50 ` [patch 044/181] mm/memremap_pages: convert to 'struct range' Andrew Morton
                   ` (141 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:50 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: device-dax: add resize support

Make the device-dax 'size' attribute writable to allow capacity to be
split between multiple instances in a region.  The intended consumers of
this capability are users that want to split a scarce memory resource
between device-dax and System-RAM access, or users that want to have
multiple security domains for a large region.

By default the hmem instance provider allocates an entire region to the
first instance.  The process of creating a new instance (assuming a
region-id of 0) is find the region and trigger the 'create' attribute
which yields an empty instance to configure.  For example:

    cd /sys/bus/dax/devices
    echo dax0.0 > dax0.0/driver/unbind
    echo $new_size > dax0.0/size
    echo 1 > $(readlink -f dax0.0)../dax_region/create
    seed=$(cat $(readlink -f dax0.0)../dax_region/seed)
    echo $new_size > $seed/size
    echo dax0.0 > ../drivers/{device_dax,kmem}/bind
    echo dax0.1 > ../drivers/{device_dax,kmem}/bind

Instances can be destroyed by:

    echo $device > $(readlink -f $device)../dax_region/delete

Link: https://lkml.kernel.org/r/159643102625.4062302.7431838945566033852.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lkml.kernel.org/r/160106115239.30709.9850106928133493138.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dax/bus.c |  161 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 152 insertions(+), 9 deletions(-)

--- a/drivers/dax/bus.c~device-dax-add-resize-support
+++ a/drivers/dax/bus.c
@@ -6,6 +6,7 @@
 #include <linux/list.h>
 #include <linux/slab.h>
 #include <linux/dax.h>
+#include <linux/io.h>
 #include "dax-private.h"
 #include "bus.h"
 
@@ -562,7 +563,8 @@ struct dax_region *alloc_dax_region(stru
 }
 EXPORT_SYMBOL_GPL(alloc_dax_region);
 
-static int alloc_dev_dax_range(struct dev_dax *dev_dax, resource_size_t size)
+static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
+		resource_size_t size)
 {
 	struct dax_region *dax_region = dev_dax->region;
 	struct resource *res = &dax_region->res;
@@ -580,12 +582,7 @@ static int alloc_dev_dax_range(struct de
 		return 0;
 	}
 
-	/* TODO: handle multiple allocations per region */
-	if (res->child)
-		return -ENOMEM;
-
-	alloc = __request_region(res, res->start, size, dev_name(dev), 0);
-
+	alloc = __request_region(res, start, size, dev_name(dev), 0);
 	if (!alloc)
 		return -ENOMEM;
 
@@ -597,6 +594,29 @@ static int alloc_dev_dax_range(struct de
 	return 0;
 }
 
+static int adjust_dev_dax_range(struct dev_dax *dev_dax, struct resource *res, resource_size_t size)
+{
+	struct dax_region *dax_region = dev_dax->region;
+	struct range *range = &dev_dax->range;
+	int rc = 0;
+
+	device_lock_assert(dax_region->dev);
+
+	if (size)
+		rc = adjust_resource(res, range->start, size);
+	else
+		__release_region(&dax_region->res, range->start, range_len(range));
+	if (rc)
+		return rc;
+
+	dev_dax->range = (struct range) {
+		.start = range->start,
+		.end = range->start + size - 1,
+	};
+
+	return 0;
+}
+
 static ssize_t size_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -605,7 +625,127 @@ static ssize_t size_show(struct device *
 
 	return sprintf(buf, "%llu\n", size);
 }
-static DEVICE_ATTR_RO(size);
+
+static bool alloc_is_aligned(struct dax_region *dax_region,
+		resource_size_t size)
+{
+	/*
+	 * The minimum mapping granularity for a device instance is a
+	 * single subsection, unless the arch says otherwise.
+	 */
+	return IS_ALIGNED(size, max_t(unsigned long, dax_region->align,
+				memremap_compat_align()));
+}
+
+static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
+{
+	struct dax_region *dax_region = dev_dax->region;
+	struct range *range = &dev_dax->range;
+	struct resource *res, *adjust = NULL;
+	struct device *dev = &dev_dax->dev;
+
+	for_each_dax_region_resource(dax_region, res)
+		if (strcmp(res->name, dev_name(dev)) == 0
+				&& res->start == range->start) {
+			adjust = res;
+			break;
+		}
+
+	if (dev_WARN_ONCE(dev, !adjust, "failed to find matching resource\n"))
+		return -ENXIO;
+	return adjust_dev_dax_range(dev_dax, adjust, size);
+}
+
+static ssize_t dev_dax_resize(struct dax_region *dax_region,
+		struct dev_dax *dev_dax, resource_size_t size)
+{
+	resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
+	resource_size_t dev_size = range_len(&dev_dax->range);
+	struct resource *region_res = &dax_region->res;
+	struct device *dev = &dev_dax->dev;
+	const char *name = dev_name(dev);
+	struct resource *res, *first;
+
+	if (dev->driver)
+		return -EBUSY;
+	if (size == dev_size)
+		return 0;
+	if (size > dev_size && size - dev_size > avail)
+		return -ENOSPC;
+	if (size < dev_size)
+		return dev_dax_shrink(dev_dax, size);
+
+	to_alloc = size - dev_size;
+	if (dev_WARN_ONCE(dev, !alloc_is_aligned(dax_region, to_alloc),
+			"resize of %pa misaligned\n", &to_alloc))
+		return -ENXIO;
+
+	/*
+	 * Expand the device into the unused portion of the region. This
+	 * may involve adjusting the end of an existing resource, or
+	 * allocating a new resource.
+	 */
+	first = region_res->child;
+	if (!first)
+		return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
+	for (res = first; to_alloc && res; res = res->sibling) {
+		struct resource *next = res->sibling;
+		resource_size_t free;
+
+		/* space at the beginning of the region */
+		free = 0;
+		if (res == first && res->start > dax_region->res.start)
+			free = res->start - dax_region->res.start;
+		if (free >= to_alloc && dev_size == 0)
+			return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
+
+		free = 0;
+		/* space between allocations */
+		if (next && next->start > res->end + 1)
+			free = next->start - res->end + 1;
+
+		/* space at the end of the region */
+		if (free < to_alloc && !next && res->end < region_res->end)
+			free = region_res->end - res->end;
+
+		if (free >= to_alloc && strcmp(name, res->name) == 0)
+			return adjust_dev_dax_range(dev_dax, res, resource_size(res) + to_alloc);
+		else if (free >= to_alloc && dev_size == 0)
+			return alloc_dev_dax_range(dev_dax, res->end + 1, to_alloc);
+	}
+	return -ENOSPC;
+}
+
+static ssize_t size_store(struct device *dev, struct device_attribute *attr,
+		const char *buf, size_t len)
+{
+	ssize_t rc;
+	unsigned long long val;
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+	struct dax_region *dax_region = dev_dax->region;
+
+	rc = kstrtoull(buf, 0, &val);
+	if (rc)
+		return rc;
+
+	if (!alloc_is_aligned(dax_region, val)) {
+		dev_dbg(dev, "%s: size: %lld misaligned\n", __func__, val);
+		return -EINVAL;
+	}
+
+	device_lock(dax_region->dev);
+	if (!dax_region->dev->driver) {
+		device_unlock(dax_region->dev);
+		return -ENXIO;
+	}
+	device_lock(dev);
+	rc = dev_dax_resize(dax_region, dev_dax, val);
+	device_unlock(dev);
+	device_unlock(dax_region->dev);
+
+	return rc == 0 ? len : rc;
+}
+static DEVICE_ATTR_RW(size);
 
 static int dev_dax_target_node(struct dev_dax *dev_dax)
 {
@@ -654,11 +794,14 @@ static umode_t dev_dax_visible(struct ko
 {
 	struct device *dev = container_of(kobj, struct device, kobj);
 	struct dev_dax *dev_dax = to_dev_dax(dev);
+	struct dax_region *dax_region = dev_dax->region;
 
 	if (a == &dev_attr_target_node.attr && dev_dax_target_node(dev_dax) < 0)
 		return 0;
 	if (a == &dev_attr_numa_node.attr && !IS_ENABLED(CONFIG_NUMA))
 		return 0;
+	if (a == &dev_attr_size.attr && is_static(dax_region))
+		return 0444;
 	return a->mode;
 }
 
@@ -739,7 +882,7 @@ struct dev_dax *devm_create_dev_dax(stru
 	device_initialize(dev);
 	dev_set_name(dev, "dax%d.%d", dax_region->id, dev_dax->id);
 
-	rc = alloc_dev_dax_range(dev_dax, data->size);
+	rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, data->size);
 	if (rc)
 		goto err_range;
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 044/181] mm/memremap_pages: convert to 'struct range'
  2020-10-13 23:46 incoming Andrew Morton
                   ` (42 preceding siblings ...)
  2020-10-13 23:50 ` [patch 043/181] device-dax: add resize support Andrew Morton
@ 2020-10-13 23:50 ` Andrew Morton
  2020-10-13 23:50 ` [patch 045/181] mm/memremap_pages: support multiple ranges per invocation Andrew Morton
                   ` (140 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:50 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.carpenter, dan.j.williams, daniel, dave.hansen, dave.jiang,
	david, gregkh, hpa, hulkci, ira.weiny, jgg, jglisse, jgross,
	jmoyer, joao.m.martins, Jonathan.Cameron, justin.he, linux-mm,
	lkp, luto, mingo, mm-commits, mpe, pasha.tatashin, paulus,
	peterz, rafael.j.wysocki, rdunlap, richard.weiyang, rppt,
	sstabellini, tglx, thomas.lendacky, torvalds, vgoyal,
	vishal.l.verma, will, yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: mm/memremap_pages: convert to 'struct range'

The 'struct resource' in 'struct dev_pagemap' is only used for holding
resource span information.  The other fields, 'name', 'flags', 'desc',
'parent', 'sibling', and 'child' are all unused wasted space.

This is in preparation for introducing a multi-range extension of
devm_memremap_pages().

The bulk of this change is unwinding all the places internal to libnvdimm
that used 'struct resource' unnecessarily, and replacing instances of
'struct dev_pagemap'.res with 'struct dev_pagemap'.range.

P2PDMA had a minor usage of the resource flags field, but only to report
failures with "%pR".  That is replaced with an open coded print of the
range.

[dan.carpenter@oracle.com: mm/hmm/test: use after free in dmirror_allocate_chunk()]
  Link: https://lkml.kernel.org/r/20200926121402.GA7467@kadam
Link: https://lkml.kernel.org/r/159643103173.4062302.768998885691711532.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lkml.kernel.org/r/160106115761.30709.13539840236873663620.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>	[xen]
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/powerpc/kvm/book3s_hv_uvmem.c     |   13 ++-
 drivers/dax/bus.c                      |   10 +-
 drivers/dax/bus.h                      |    2 
 drivers/dax/dax-private.h              |    5 -
 drivers/dax/device.c                   |    3 
 drivers/dax/hmem/hmem.c                |    5 +
 drivers/dax/pmem/core.c                |   12 +--
 drivers/gpu/drm/nouveau/nouveau_dmem.c |   14 ++--
 drivers/nvdimm/badrange.c              |   26 +++----
 drivers/nvdimm/claim.c                 |   13 ++-
 drivers/nvdimm/nd.h                    |    3 
 drivers/nvdimm/pfn_devs.c              |   12 +--
 drivers/nvdimm/pmem.c                  |   26 ++++---
 drivers/nvdimm/region.c                |   21 +++---
 drivers/pci/p2pdma.c                   |   11 +--
 drivers/xen/unpopulated-alloc.c        |   44 ++++++++-----
 include/linux/memremap.h               |    5 -
 include/linux/range.h                  |    6 +
 lib/test_hmm.c                         |   50 +++++++-------
 mm/memremap.c                          |   77 +++++++++++------------
 tools/testing/nvdimm/test/iomap.c      |    2 
 21 files changed, 195 insertions(+), 165 deletions(-)

--- a/arch/powerpc/kvm/book3s_hv_uvmem.c~mm-memremap_pages-convert-to-struct-range
+++ a/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -687,9 +687,9 @@ static struct page *kvmppc_uvmem_get_pag
 	struct kvmppc_uvmem_page_pvt *pvt;
 	unsigned long pfn_last, pfn_first;
 
-	pfn_first = kvmppc_uvmem_pgmap.res.start >> PAGE_SHIFT;
+	pfn_first = kvmppc_uvmem_pgmap.range.start >> PAGE_SHIFT;
 	pfn_last = pfn_first +
-		   (resource_size(&kvmppc_uvmem_pgmap.res) >> PAGE_SHIFT);
+		   (range_len(&kvmppc_uvmem_pgmap.range) >> PAGE_SHIFT);
 
 	spin_lock(&kvmppc_uvmem_bitmap_lock);
 	bit = find_first_zero_bit(kvmppc_uvmem_bitmap,
@@ -1007,7 +1007,7 @@ static vm_fault_t kvmppc_uvmem_migrate_t
 static void kvmppc_uvmem_page_free(struct page *page)
 {
 	unsigned long pfn = page_to_pfn(page) -
-			(kvmppc_uvmem_pgmap.res.start >> PAGE_SHIFT);
+			(kvmppc_uvmem_pgmap.range.start >> PAGE_SHIFT);
 	struct kvmppc_uvmem_page_pvt *pvt;
 
 	spin_lock(&kvmppc_uvmem_bitmap_lock);
@@ -1170,7 +1170,8 @@ int kvmppc_uvmem_init(void)
 	}
 
 	kvmppc_uvmem_pgmap.type = MEMORY_DEVICE_PRIVATE;
-	kvmppc_uvmem_pgmap.res = *res;
+	kvmppc_uvmem_pgmap.range.start = res->start;
+	kvmppc_uvmem_pgmap.range.end = res->end;
 	kvmppc_uvmem_pgmap.ops = &kvmppc_uvmem_ops;
 	/* just one global instance: */
 	kvmppc_uvmem_pgmap.owner = &kvmppc_uvmem_pgmap;
@@ -1205,7 +1206,7 @@ void kvmppc_uvmem_free(void)
 		return;
 
 	memunmap_pages(&kvmppc_uvmem_pgmap);
-	release_mem_region(kvmppc_uvmem_pgmap.res.start,
-			   resource_size(&kvmppc_uvmem_pgmap.res));
+	release_mem_region(kvmppc_uvmem_pgmap.range.start,
+			   range_len(&kvmppc_uvmem_pgmap.range));
 	kfree(kvmppc_uvmem_bitmap);
 }
--- a/drivers/dax/bus.c~mm-memremap_pages-convert-to-struct-range
+++ a/drivers/dax/bus.c
@@ -515,7 +515,7 @@ static void dax_region_unregister(void *
 }
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
-		struct resource *res, int target_node, unsigned int align,
+		struct range *range, int target_node, unsigned int align,
 		unsigned long flags)
 {
 	struct dax_region *dax_region;
@@ -530,8 +530,8 @@ struct dax_region *alloc_dax_region(stru
 		return NULL;
 	}
 
-	if (!IS_ALIGNED(res->start, align)
-			|| !IS_ALIGNED(resource_size(res), align))
+	if (!IS_ALIGNED(range->start, align)
+			|| !IS_ALIGNED(range_len(range), align))
 		return NULL;
 
 	dax_region = kzalloc(sizeof(*dax_region), GFP_KERNEL);
@@ -546,8 +546,8 @@ struct dax_region *alloc_dax_region(stru
 	dax_region->target_node = target_node;
 	ida_init(&dax_region->ida);
 	dax_region->res = (struct resource) {
-		.start = res->start,
-		.end = res->end,
+		.start = range->start,
+		.end = range->end,
 		.flags = IORESOURCE_MEM | flags,
 	};
 
--- a/drivers/dax/bus.h~mm-memremap_pages-convert-to-struct-range
+++ a/drivers/dax/bus.h
@@ -13,7 +13,7 @@ void dax_region_put(struct dax_region *d
 
 #define IORESOURCE_DAX_STATIC (1UL << 0)
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
-		struct resource *res, int target_node, unsigned int align,
+		struct range *range, int target_node, unsigned int align,
 		unsigned long flags);
 
 enum dev_dax_subsys {
--- a/drivers/dax/dax-private.h~mm-memremap_pages-convert-to-struct-range
+++ a/drivers/dax/dax-private.h
@@ -61,11 +61,6 @@ struct dev_dax {
 	struct range range;
 };
 
-static inline u64 range_len(struct range *range)
-{
-	return range->end - range->start + 1;
-}
-
 static inline struct dev_dax *to_dev_dax(struct device *dev)
 {
 	return container_of(dev, struct dev_dax, dev);
--- a/drivers/dax/device.c~mm-memremap_pages-convert-to-struct-range
+++ a/drivers/dax/device.c
@@ -416,8 +416,7 @@ int dev_dax_probe(struct dev_dax *dev_da
 		pgmap = devm_kzalloc(dev, sizeof(*pgmap), GFP_KERNEL);
 		if (!pgmap)
 			return -ENOMEM;
-		pgmap->res.start = range->start;
-		pgmap->res.end = range->end;
+		pgmap->range = *range;
 	}
 	pgmap->type = MEMORY_DEVICE_GENERIC;
 	addr = devm_memremap_pages(dev, pgmap);
--- a/drivers/dax/hmem/hmem.c~mm-memremap_pages-convert-to-struct-range
+++ a/drivers/dax/hmem/hmem.c
@@ -13,13 +13,16 @@ static int dax_hmem_probe(struct platfor
 	struct dev_dax_data data;
 	struct dev_dax *dev_dax;
 	struct resource *res;
+	struct range range;
 
 	res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
 	if (!res)
 		return -ENOMEM;
 
 	mri = dev->platform_data;
-	dax_region = alloc_dax_region(dev, pdev->id, res, mri->target_node,
+	range.start = res->start;
+	range.end = res->end;
+	dax_region = alloc_dax_region(dev, pdev->id, &range, mri->target_node,
 			PMD_SIZE, 0);
 	if (!dax_region)
 		return -ENOMEM;
--- a/drivers/dax/pmem/core.c~mm-memremap_pages-convert-to-struct-range
+++ a/drivers/dax/pmem/core.c
@@ -9,7 +9,7 @@
 
 struct dev_dax *__dax_pmem_probe(struct device *dev, enum dev_dax_subsys subsys)
 {
-	struct resource res;
+	struct range range;
 	int rc, id, region_id;
 	resource_size_t offset;
 	struct nd_pfn_sb *pfn_sb;
@@ -50,10 +50,10 @@ struct dev_dax *__dax_pmem_probe(struct
 	if (rc != 2)
 		return ERR_PTR(-EINVAL);
 
-	/* adjust the dax_region resource to the start of data */
-	memcpy(&res, &pgmap.res, sizeof(res));
-	res.start += offset;
-	dax_region = alloc_dax_region(dev, region_id, &res,
+	/* adjust the dax_region range to the start of data */
+	range = pgmap.range;
+	range.start += offset,
+	dax_region = alloc_dax_region(dev, region_id, &range,
 			nd_region->target_node, le32_to_cpu(pfn_sb->align),
 			IORESOURCE_DAX_STATIC);
 	if (!dax_region)
@@ -64,7 +64,7 @@ struct dev_dax *__dax_pmem_probe(struct
 		.id = id,
 		.pgmap = &pgmap,
 		.subsys = subsys,
-		.size = resource_size(&res),
+		.size = range_len(&range),
 	};
 	dev_dax = devm_create_dev_dax(&data);
 
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c~mm-memremap_pages-convert-to-struct-range
+++ a/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -101,7 +101,7 @@ unsigned long nouveau_dmem_page_addr(str
 {
 	struct nouveau_dmem_chunk *chunk = nouveau_page_to_chunk(page);
 	unsigned long off = (page_to_pfn(page) << PAGE_SHIFT) -
-				chunk->pagemap.res.start;
+				chunk->pagemap.range.start;
 
 	return chunk->bo->offset + off;
 }
@@ -249,7 +249,8 @@ nouveau_dmem_chunk_alloc(struct nouveau_
 
 	chunk->drm = drm;
 	chunk->pagemap.type = MEMORY_DEVICE_PRIVATE;
-	chunk->pagemap.res = *res;
+	chunk->pagemap.range.start = res->start;
+	chunk->pagemap.range.end = res->end;
 	chunk->pagemap.ops = &nouveau_dmem_pagemap_ops;
 	chunk->pagemap.owner = drm->dev;
 
@@ -273,7 +274,7 @@ nouveau_dmem_chunk_alloc(struct nouveau_
 	list_add(&chunk->list, &drm->dmem->chunks);
 	mutex_unlock(&drm->dmem->mutex);
 
-	pfn_first = chunk->pagemap.res.start >> PAGE_SHIFT;
+	pfn_first = chunk->pagemap.range.start >> PAGE_SHIFT;
 	page = pfn_to_page(pfn_first);
 	spin_lock(&drm->dmem->lock);
 	for (i = 0; i < DMEM_CHUNK_NPAGES - 1; ++i, ++page) {
@@ -294,8 +295,7 @@ out_bo_unpin:
 out_bo_free:
 	nouveau_bo_ref(NULL, &chunk->bo);
 out_release:
-	release_mem_region(chunk->pagemap.res.start,
-			   resource_size(&chunk->pagemap.res));
+	release_mem_region(chunk->pagemap.range.start, range_len(&chunk->pagemap.range));
 out_free:
 	kfree(chunk);
 out:
@@ -382,8 +382,8 @@ nouveau_dmem_fini(struct nouveau_drm *dr
 		nouveau_bo_ref(NULL, &chunk->bo);
 		list_del(&chunk->list);
 		memunmap_pages(&chunk->pagemap);
-		release_mem_region(chunk->pagemap.res.start,
-				   resource_size(&chunk->pagemap.res));
+		release_mem_region(chunk->pagemap.range.start,
+				   range_len(&chunk->pagemap.range));
 		kfree(chunk);
 	}
 
--- a/drivers/nvdimm/badrange.c~mm-memremap_pages-convert-to-struct-range
+++ a/drivers/nvdimm/badrange.c
@@ -211,7 +211,7 @@ static void __add_badblock_range(struct
 }
 
 static void badblocks_populate(struct badrange *badrange,
-		struct badblocks *bb, const struct resource *res)
+		struct badblocks *bb, const struct range *range)
 {
 	struct badrange_entry *bre;
 
@@ -222,34 +222,34 @@ static void badblocks_populate(struct ba
 		u64 bre_end = bre->start + bre->length - 1;
 
 		/* Discard intervals with no intersection */
-		if (bre_end < res->start)
+		if (bre_end < range->start)
 			continue;
-		if (bre->start >  res->end)
+		if (bre->start > range->end)
 			continue;
 		/* Deal with any overlap after start of the namespace */
-		if (bre->start >= res->start) {
+		if (bre->start >= range->start) {
 			u64 start = bre->start;
 			u64 len;
 
-			if (bre_end <= res->end)
+			if (bre_end <= range->end)
 				len = bre->length;
 			else
-				len = res->start + resource_size(res)
+				len = range->start + range_len(range)
 					- bre->start;
-			__add_badblock_range(bb, start - res->start, len);
+			__add_badblock_range(bb, start - range->start, len);
 			continue;
 		}
 		/*
 		 * Deal with overlap for badrange starting before
 		 * the namespace.
 		 */
-		if (bre->start < res->start) {
+		if (bre->start < range->start) {
 			u64 len;
 
-			if (bre_end < res->end)
-				len = bre->start + bre->length - res->start;
+			if (bre_end < range->end)
+				len = bre->start + bre->length - range->start;
 			else
-				len = resource_size(res);
+				len = range_len(range);
 			__add_badblock_range(bb, 0, len);
 		}
 	}
@@ -267,7 +267,7 @@ static void badblocks_populate(struct ba
  * and add badblocks entries for all matching sub-ranges
  */
 void nvdimm_badblocks_populate(struct nd_region *nd_region,
-		struct badblocks *bb, const struct resource *res)
+		struct badblocks *bb, const struct range *range)
 {
 	struct nvdimm_bus *nvdimm_bus;
 
@@ -279,7 +279,7 @@ void nvdimm_badblocks_populate(struct nd
 	nvdimm_bus = walk_to_nvdimm_bus(&nd_region->dev);
 
 	nvdimm_bus_lock(&nvdimm_bus->dev);
-	badblocks_populate(&nvdimm_bus->badrange, bb, res);
+	badblocks_populate(&nvdimm_bus->badrange, bb, range);
 	nvdimm_bus_unlock(&nvdimm_bus->dev);
 }
 EXPORT_SYMBOL_GPL(nvdimm_badblocks_populate);
--- a/drivers/nvdimm/claim.c~mm-memremap_pages-convert-to-struct-range
+++ a/drivers/nvdimm/claim.c
@@ -303,13 +303,16 @@ static int nsio_rw_bytes(struct nd_names
 int devm_nsio_enable(struct device *dev, struct nd_namespace_io *nsio,
 		resource_size_t size)
 {
-	struct resource *res = &nsio->res;
 	struct nd_namespace_common *ndns = &nsio->common;
+	struct range range = {
+		.start = nsio->res.start,
+		.end = nsio->res.end,
+	};
 
 	nsio->size = size;
-	if (!devm_request_mem_region(dev, res->start, size,
+	if (!devm_request_mem_region(dev, range.start, size,
 				dev_name(&ndns->dev))) {
-		dev_warn(dev, "could not reserve region %pR\n", res);
+		dev_warn(dev, "could not reserve region %pR\n", &nsio->res);
 		return -EBUSY;
 	}
 
@@ -317,9 +320,9 @@ int devm_nsio_enable(struct device *dev,
 	if (devm_init_badblocks(dev, &nsio->bb))
 		return -ENOMEM;
 	nvdimm_badblocks_populate(to_nd_region(ndns->dev.parent), &nsio->bb,
-			&nsio->res);
+			&range);
 
-	nsio->addr = devm_memremap(dev, res->start, size, ARCH_MEMREMAP_PMEM);
+	nsio->addr = devm_memremap(dev, range.start, size, ARCH_MEMREMAP_PMEM);
 
 	return PTR_ERR_OR_ZERO(nsio->addr);
 }
--- a/drivers/nvdimm/nd.h~mm-memremap_pages-convert-to-struct-range
+++ a/drivers/nvdimm/nd.h
@@ -377,8 +377,9 @@ int nvdimm_namespace_detach_btt(struct n
 const char *nvdimm_namespace_disk_name(struct nd_namespace_common *ndns,
 		char *name);
 unsigned int pmem_sector_size(struct nd_namespace_common *ndns);
+struct range;
 void nvdimm_badblocks_populate(struct nd_region *nd_region,
-		struct badblocks *bb, const struct resource *res);
+		struct badblocks *bb, const struct range *range);
 int devm_namespace_enable(struct device *dev, struct nd_namespace_common *ndns,
 		resource_size_t size);
 void devm_namespace_disable(struct device *dev,
--- a/drivers/nvdimm/pfn_devs.c~mm-memremap_pages-convert-to-struct-range
+++ a/drivers/nvdimm/pfn_devs.c
@@ -672,7 +672,7 @@ static unsigned long init_altmap_reserve
 
 static int __nvdimm_setup_pfn(struct nd_pfn *nd_pfn, struct dev_pagemap *pgmap)
 {
-	struct resource *res = &pgmap->res;
+	struct range *range = &pgmap->range;
 	struct vmem_altmap *altmap = &pgmap->altmap;
 	struct nd_pfn_sb *pfn_sb = nd_pfn->pfn_sb;
 	u64 offset = le64_to_cpu(pfn_sb->dataoff);
@@ -689,16 +689,16 @@ static int __nvdimm_setup_pfn(struct nd_
 		.end_pfn = PHYS_PFN(end),
 	};
 
-	memcpy(res, &nsio->res, sizeof(*res));
-	res->start += start_pad;
-	res->end -= end_trunc;
-
+	*range = (struct range) {
+		.start = nsio->res.start + start_pad,
+		.end = nsio->res.end - end_trunc,
+	};
 	if (nd_pfn->mode == PFN_MODE_RAM) {
 		if (offset < reserve)
 			return -EINVAL;
 		nd_pfn->npfns = le64_to_cpu(pfn_sb->npfns);
 	} else if (nd_pfn->mode == PFN_MODE_PMEM) {
-		nd_pfn->npfns = PHYS_PFN((resource_size(res) - offset));
+		nd_pfn->npfns = PHYS_PFN((range_len(range) - offset));
 		if (le64_to_cpu(nd_pfn->pfn_sb->npfns) > nd_pfn->npfns)
 			dev_info(&nd_pfn->dev,
 					"number of pfns truncated from %lld to %ld\n",
--- a/drivers/nvdimm/pmem.c~mm-memremap_pages-convert-to-struct-range
+++ a/drivers/nvdimm/pmem.c
@@ -375,7 +375,7 @@ static int pmem_attach_disk(struct devic
 	struct nd_region *nd_region = to_nd_region(dev->parent);
 	int nid = dev_to_node(dev), fua;
 	struct resource *res = &nsio->res;
-	struct resource bb_res;
+	struct range bb_range;
 	struct nd_pfn *nd_pfn = NULL;
 	struct dax_device *dax_dev;
 	struct nd_pfn_sb *pfn_sb;
@@ -434,24 +434,26 @@ static int pmem_attach_disk(struct devic
 		pfn_sb = nd_pfn->pfn_sb;
 		pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
 		pmem->pfn_pad = resource_size(res) -
-			resource_size(&pmem->pgmap.res);
+			range_len(&pmem->pgmap.range);
 		pmem->pfn_flags |= PFN_MAP;
-		memcpy(&bb_res, &pmem->pgmap.res, sizeof(bb_res));
-		bb_res.start += pmem->data_offset;
+		bb_range = pmem->pgmap.range;
+		bb_range.start += pmem->data_offset;
 	} else if (pmem_should_map_pages(dev)) {
-		memcpy(&pmem->pgmap.res, &nsio->res, sizeof(pmem->pgmap.res));
+		pmem->pgmap.range.start = res->start;
+		pmem->pgmap.range.end = res->end;
 		pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
 		pmem->pgmap.ops = &fsdax_pagemap_ops;
 		addr = devm_memremap_pages(dev, &pmem->pgmap);
 		pmem->pfn_flags |= PFN_MAP;
-		memcpy(&bb_res, &pmem->pgmap.res, sizeof(bb_res));
+		bb_range = pmem->pgmap.range;
 	} else {
 		if (devm_add_action_or_reset(dev, pmem_release_queue,
 					&pmem->pgmap))
 			return -ENOMEM;
 		addr = devm_memremap(dev, pmem->phys_addr,
 				pmem->size, ARCH_MEMREMAP_PMEM);
-		memcpy(&bb_res, &nsio->res, sizeof(bb_res));
+		bb_range.start =  res->start;
+		bb_range.end = res->end;
 	}
 
 	if (IS_ERR(addr))
@@ -480,7 +482,7 @@ static int pmem_attach_disk(struct devic
 			/ 512);
 	if (devm_init_badblocks(dev, &pmem->bb))
 		return -ENOMEM;
-	nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_res);
+	nvdimm_badblocks_populate(nd_region, &pmem->bb, &bb_range);
 	disk->bb = &pmem->bb;
 
 	if (is_nvdimm_sync(nd_region))
@@ -591,8 +593,8 @@ static void nd_pmem_notify(struct device
 	resource_size_t offset = 0, end_trunc = 0;
 	struct nd_namespace_common *ndns;
 	struct nd_namespace_io *nsio;
-	struct resource res;
 	struct badblocks *bb;
+	struct range range;
 	struct kernfs_node *bb_state;
 
 	if (event != NVDIMM_REVALIDATE_POISON)
@@ -628,9 +630,9 @@ static void nd_pmem_notify(struct device
 		nsio = to_nd_namespace_io(&ndns->dev);
 	}
 
-	res.start = nsio->res.start + offset;
-	res.end = nsio->res.end - end_trunc;
-	nvdimm_badblocks_populate(nd_region, bb, &res);
+	range.start = nsio->res.start + offset;
+	range.end = nsio->res.end - end_trunc;
+	nvdimm_badblocks_populate(nd_region, bb, &range);
 	if (bb_state)
 		sysfs_notify_dirent(bb_state);
 }
--- a/drivers/nvdimm/region.c~mm-memremap_pages-convert-to-struct-range
+++ a/drivers/nvdimm/region.c
@@ -35,7 +35,10 @@ static int nd_region_probe(struct device
 		return rc;
 
 	if (is_memory(&nd_region->dev)) {
-		struct resource ndr_res;
+		struct range range = {
+			.start = nd_region->ndr_start,
+			.end = nd_region->ndr_start + nd_region->ndr_size - 1,
+		};
 
 		if (devm_init_badblocks(dev, &nd_region->bb))
 			return -ENODEV;
@@ -44,9 +47,7 @@ static int nd_region_probe(struct device
 		if (!nd_region->bb_state)
 			dev_warn(&nd_region->dev,
 					"'badblocks' notification disabled\n");
-		ndr_res.start = nd_region->ndr_start;
-		ndr_res.end = nd_region->ndr_start + nd_region->ndr_size - 1;
-		nvdimm_badblocks_populate(nd_region, &nd_region->bb, &ndr_res);
+		nvdimm_badblocks_populate(nd_region, &nd_region->bb, &range);
 	}
 
 	rc = nd_region_register_namespaces(nd_region, &err);
@@ -121,14 +122,16 @@ static void nd_region_notify(struct devi
 {
 	if (event == NVDIMM_REVALIDATE_POISON) {
 		struct nd_region *nd_region = to_nd_region(dev);
-		struct resource res;
 
 		if (is_memory(&nd_region->dev)) {
-			res.start = nd_region->ndr_start;
-			res.end = nd_region->ndr_start +
-				nd_region->ndr_size - 1;
+			struct range range = {
+				.start = nd_region->ndr_start,
+				.end = nd_region->ndr_start +
+					nd_region->ndr_size - 1,
+			};
+
 			nvdimm_badblocks_populate(nd_region,
-					&nd_region->bb, &res);
+					&nd_region->bb, &range);
 			if (nd_region->bb_state)
 				sysfs_notify_dirent(nd_region->bb_state);
 		}
--- a/drivers/pci/p2pdma.c~mm-memremap_pages-convert-to-struct-range
+++ a/drivers/pci/p2pdma.c
@@ -185,9 +185,8 @@ int pci_p2pdma_add_resource(struct pci_d
 		return -ENOMEM;
 
 	pgmap = &p2p_pgmap->pgmap;
-	pgmap->res.start = pci_resource_start(pdev, bar) + offset;
-	pgmap->res.end = pgmap->res.start + size - 1;
-	pgmap->res.flags = pci_resource_flags(pdev, bar);
+	pgmap->range.start = pci_resource_start(pdev, bar) + offset;
+	pgmap->range.end = pgmap->range.start + size - 1;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
 
 	p2p_pgmap->provider = pdev;
@@ -202,13 +201,13 @@ int pci_p2pdma_add_resource(struct pci_d
 
 	error = gen_pool_add_owner(pdev->p2pdma->pool, (unsigned long)addr,
 			pci_bus_address(pdev, bar) + offset,
-			resource_size(&pgmap->res), dev_to_node(&pdev->dev),
+			range_len(&pgmap->range), dev_to_node(&pdev->dev),
 			pgmap->ref);
 	if (error)
 		goto pages_free;
 
-	pci_info(pdev, "added peer-to-peer DMA memory %pR\n",
-		 &pgmap->res);
+	pci_info(pdev, "added peer-to-peer DMA memory %#llx-%#llx\n",
+		 pgmap->range.start, pgmap->range.end);
 
 	return 0;
 
--- a/drivers/xen/unpopulated-alloc.c~mm-memremap_pages-convert-to-struct-range
+++ a/drivers/xen/unpopulated-alloc.c
@@ -18,27 +18,37 @@ static unsigned int list_count;
 static int fill_list(unsigned int nr_pages)
 {
 	struct dev_pagemap *pgmap;
+	struct resource *res;
 	void *vaddr;
 	unsigned int i, alloc_pages = round_up(nr_pages, PAGES_PER_SECTION);
-	int ret;
+	int ret = -ENOMEM;
+
+	res = kzalloc(sizeof(*res), GFP_KERNEL);
+	if (!res)
+		return -ENOMEM;
 
 	pgmap = kzalloc(sizeof(*pgmap), GFP_KERNEL);
 	if (!pgmap)
-		return -ENOMEM;
+		goto err_pgmap;
 
 	pgmap->type = MEMORY_DEVICE_GENERIC;
-	pgmap->res.name = "Xen scratch";
-	pgmap->res.flags = IORESOURCE_MEM | IORESOURCE_BUSY;
+	res->name = "Xen scratch";
+	res->flags = IORESOURCE_MEM | IORESOURCE_BUSY;
 
-	ret = allocate_resource(&iomem_resource, &pgmap->res,
+	ret = allocate_resource(&iomem_resource, res,
 				alloc_pages * PAGE_SIZE, 0, -1,
 				PAGES_PER_SECTION * PAGE_SIZE, NULL, NULL);
 	if (ret < 0) {
 		pr_err("Cannot allocate new IOMEM resource\n");
-		kfree(pgmap);
-		return ret;
+		goto err_resource;
 	}
 
+	pgmap->range = (struct range) {
+		.start = res->start,
+		.end = res->end,
+	};
+	pgmap->owner = res;
+
 #ifdef CONFIG_XEN_HAVE_PVMMU
         /*
          * memremap will build page tables for the new memory so
@@ -50,14 +60,13 @@ static int fill_list(unsigned int nr_pag
          * conflict with any devices.
          */
 	if (!xen_feature(XENFEAT_auto_translated_physmap)) {
-		xen_pfn_t pfn = PFN_DOWN(pgmap->res.start);
+		xen_pfn_t pfn = PFN_DOWN(res->start);
 
 		for (i = 0; i < alloc_pages; i++) {
 			if (!set_phys_to_machine(pfn + i, INVALID_P2M_ENTRY)) {
 				pr_warn("set_phys_to_machine() failed, no memory added\n");
-				release_resource(&pgmap->res);
-				kfree(pgmap);
-				return -ENOMEM;
+				ret = -ENOMEM;
+				goto err_memremap;
 			}
                 }
 	}
@@ -66,9 +75,8 @@ static int fill_list(unsigned int nr_pag
 	vaddr = memremap_pages(pgmap, NUMA_NO_NODE);
 	if (IS_ERR(vaddr)) {
 		pr_err("Cannot remap memory range\n");
-		release_resource(&pgmap->res);
-		kfree(pgmap);
-		return PTR_ERR(vaddr);
+		ret = PTR_ERR(vaddr);
+		goto err_memremap;
 	}
 
 	for (i = 0; i < alloc_pages; i++) {
@@ -80,6 +88,14 @@ static int fill_list(unsigned int nr_pag
 	}
 
 	return 0;
+
+err_memremap:
+	release_resource(res);
+err_resource:
+	kfree(pgmap);
+err_pgmap:
+	kfree(res);
+	return ret;
 }
 
 /**
--- a/include/linux/memremap.h~mm-memremap_pages-convert-to-struct-range
+++ a/include/linux/memremap.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_MEMREMAP_H_
 #define _LINUX_MEMREMAP_H_
+#include <linux/range.h>
 #include <linux/ioport.h>
 #include <linux/percpu-refcount.h>
 
@@ -93,7 +94,7 @@ struct dev_pagemap_ops {
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
  * @altmap: pre-allocated/reserved memory for vmemmap allocations
- * @res: physical address range covered by @ref
+ * @range: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @internal_ref: internal reference if @ref is not provided by the caller
  * @done: completion for @internal_ref
@@ -106,7 +107,7 @@ struct dev_pagemap_ops {
  */
 struct dev_pagemap {
 	struct vmem_altmap altmap;
-	struct resource res;
+	struct range range;
 	struct percpu_ref *ref;
 	struct percpu_ref internal_ref;
 	struct completion done;
--- a/include/linux/range.h~mm-memremap_pages-convert-to-struct-range
+++ a/include/linux/range.h
@@ -1,12 +1,18 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #ifndef _LINUX_RANGE_H
 #define _LINUX_RANGE_H
+#include <linux/types.h>
 
 struct range {
 	u64   start;
 	u64   end;
 };
 
+static inline u64 range_len(const struct range *range)
+{
+	return range->end - range->start + 1;
+}
+
 int add_range(struct range *range, int az, int nr_range,
 		u64 start, u64 end);
 
--- a/lib/test_hmm.c~mm-memremap_pages-convert-to-struct-range
+++ a/lib/test_hmm.c
@@ -460,6 +460,21 @@ static bool dmirror_allocate_chunk(struc
 	unsigned long pfn_last;
 	void *ptr;
 
+	devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
+	if (!devmem)
+		return -ENOMEM;
+
+	res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
+				      "hmm_dmirror");
+	if (IS_ERR(res))
+		goto err_devmem;
+
+	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
+	devmem->pagemap.range.start = res->start;
+	devmem->pagemap.range.end = res->end;
+	devmem->pagemap.ops = &dmirror_devmem_ops;
+	devmem->pagemap.owner = mdevice;
+
 	mutex_lock(&mdevice->devmem_lock);
 
 	if (mdevice->devmem_count == mdevice->devmem_capacity) {
@@ -472,33 +487,18 @@ static bool dmirror_allocate_chunk(struc
 				sizeof(new_chunks[0]) * new_capacity,
 				GFP_KERNEL);
 		if (!new_chunks)
-			goto err;
+			goto err_release;
 		mdevice->devmem_capacity = new_capacity;
 		mdevice->devmem_chunks = new_chunks;
 	}
 
-	res = request_free_mem_region(&iomem_resource, DEVMEM_CHUNK_SIZE,
-					"hmm_dmirror");
-	if (IS_ERR(res))
-		goto err;
-
-	devmem = kzalloc(sizeof(*devmem), GFP_KERNEL);
-	if (!devmem)
-		goto err_release;
-
-	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
-	devmem->pagemap.res = *res;
-	devmem->pagemap.ops = &dmirror_devmem_ops;
-	devmem->pagemap.owner = mdevice;
-
 	ptr = memremap_pages(&devmem->pagemap, numa_node_id());
 	if (IS_ERR(ptr))
-		goto err_free;
+		goto err_release;
 
 	devmem->mdevice = mdevice;
-	pfn_first = devmem->pagemap.res.start >> PAGE_SHIFT;
-	pfn_last = pfn_first +
-		(resource_size(&devmem->pagemap.res) >> PAGE_SHIFT);
+	pfn_first = devmem->pagemap.range.start >> PAGE_SHIFT;
+	pfn_last = pfn_first + (range_len(&devmem->pagemap.range) >> PAGE_SHIFT);
 	mdevice->devmem_chunks[mdevice->devmem_count++] = devmem;
 
 	mutex_unlock(&mdevice->devmem_lock);
@@ -525,12 +525,12 @@ static bool dmirror_allocate_chunk(struc
 
 	return true;
 
-err_free:
-	kfree(devmem);
 err_release:
-	release_mem_region(res->start, resource_size(res));
-err:
 	mutex_unlock(&mdevice->devmem_lock);
+	release_mem_region(devmem->pagemap.range.start, range_len(&devmem->pagemap.range));
+err_devmem:
+	kfree(devmem);
+
 	return false;
 }
 
@@ -1100,8 +1100,8 @@ static void dmirror_device_remove(struct
 				mdevice->devmem_chunks[i];
 
 			memunmap_pages(&devmem->pagemap);
-			release_mem_region(devmem->pagemap.res.start,
-					   resource_size(&devmem->pagemap.res));
+			release_mem_region(devmem->pagemap.range.start,
+					   range_len(&devmem->pagemap.range));
 			kfree(devmem);
 		}
 		kfree(mdevice->devmem_chunks);
--- a/mm/memremap.c~mm-memremap_pages-convert-to-struct-range
+++ a/mm/memremap.c
@@ -70,24 +70,24 @@ static void devmap_managed_enable_put(vo
 }
 #endif /* CONFIG_DEV_PAGEMAP_OPS */
 
-static void pgmap_array_delete(struct resource *res)
+static void pgmap_array_delete(struct range *range)
 {
-	xa_store_range(&pgmap_array, PHYS_PFN(res->start), PHYS_PFN(res->end),
+	xa_store_range(&pgmap_array, PHYS_PFN(range->start), PHYS_PFN(range->end),
 			NULL, GFP_KERNEL);
 	synchronize_rcu();
 }
 
 static unsigned long pfn_first(struct dev_pagemap *pgmap)
 {
-	return PHYS_PFN(pgmap->res.start) +
+	return PHYS_PFN(pgmap->range.start) +
 		vmem_altmap_offset(pgmap_altmap(pgmap));
 }
 
 static unsigned long pfn_end(struct dev_pagemap *pgmap)
 {
-	const struct resource *res = &pgmap->res;
+	const struct range *range = &pgmap->range;
 
-	return (res->start + resource_size(res)) >> PAGE_SHIFT;
+	return (range->start + range_len(range)) >> PAGE_SHIFT;
 }
 
 static unsigned long pfn_next(unsigned long pfn)
@@ -126,7 +126,7 @@ static void dev_pagemap_cleanup(struct d
 
 void memunmap_pages(struct dev_pagemap *pgmap)
 {
-	struct resource *res = &pgmap->res;
+	struct range *range = &pgmap->range;
 	struct page *first_page;
 	unsigned long pfn;
 	int nid;
@@ -143,20 +143,20 @@ void memunmap_pages(struct dev_pagemap *
 	nid = page_to_nid(first_page);
 
 	mem_hotplug_begin();
-	remove_pfn_range_from_zone(page_zone(first_page), PHYS_PFN(res->start),
-				   PHYS_PFN(resource_size(res)));
+	remove_pfn_range_from_zone(page_zone(first_page), PHYS_PFN(range->start),
+				   PHYS_PFN(range_len(range)));
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-		__remove_pages(PHYS_PFN(res->start),
-			       PHYS_PFN(resource_size(res)), NULL);
+		__remove_pages(PHYS_PFN(range->start),
+			       PHYS_PFN(range_len(range)), NULL);
 	} else {
-		arch_remove_memory(nid, res->start, resource_size(res),
+		arch_remove_memory(nid, range->start, range_len(range),
 				pgmap_altmap(pgmap));
-		kasan_remove_zero_shadow(__va(res->start), resource_size(res));
+		kasan_remove_zero_shadow(__va(range->start), range_len(range));
 	}
 	mem_hotplug_done();
 
-	untrack_pfn(NULL, PHYS_PFN(res->start), resource_size(res));
-	pgmap_array_delete(res);
+	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range));
+	pgmap_array_delete(range);
 	WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
 	devmap_managed_enable_put();
 }
@@ -182,7 +182,7 @@ static void dev_pagemap_percpu_release(s
  */
 void *memremap_pages(struct dev_pagemap *pgmap, int nid)
 {
-	struct resource *res = &pgmap->res;
+	struct range *range = &pgmap->range;
 	struct dev_pagemap *conflict_pgmap;
 	struct mhp_params params = {
 		/*
@@ -251,7 +251,7 @@ void *memremap_pages(struct dev_pagemap
 			return ERR_PTR(error);
 	}
 
-	conflict_pgmap = get_dev_pagemap(PHYS_PFN(res->start), NULL);
+	conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->start), NULL);
 	if (conflict_pgmap) {
 		WARN(1, "Conflicting mapping in same section\n");
 		put_dev_pagemap(conflict_pgmap);
@@ -259,7 +259,7 @@ void *memremap_pages(struct dev_pagemap
 		goto err_array;
 	}
 
-	conflict_pgmap = get_dev_pagemap(PHYS_PFN(res->end), NULL);
+	conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->end), NULL);
 	if (conflict_pgmap) {
 		WARN(1, "Conflicting mapping in same section\n");
 		put_dev_pagemap(conflict_pgmap);
@@ -267,26 +267,27 @@ void *memremap_pages(struct dev_pagemap
 		goto err_array;
 	}
 
-	is_ram = region_intersects(res->start, resource_size(res),
+	is_ram = region_intersects(range->start, range_len(range),
 		IORESOURCE_SYSTEM_RAM, IORES_DESC_NONE);
 
 	if (is_ram != REGION_DISJOINT) {
-		WARN_ONCE(1, "%s attempted on %s region %pr\n", __func__,
-				is_ram == REGION_MIXED ? "mixed" : "ram", res);
+		WARN_ONCE(1, "attempted on %s region %#llx-%#llx\n",
+				is_ram == REGION_MIXED ? "mixed" : "ram",
+				range->start, range->end);
 		error = -ENXIO;
 		goto err_array;
 	}
 
-	error = xa_err(xa_store_range(&pgmap_array, PHYS_PFN(res->start),
-				PHYS_PFN(res->end), pgmap, GFP_KERNEL));
+	error = xa_err(xa_store_range(&pgmap_array, PHYS_PFN(range->start),
+				PHYS_PFN(range->end), pgmap, GFP_KERNEL));
 	if (error)
 		goto err_array;
 
 	if (nid < 0)
 		nid = numa_mem_id();
 
-	error = track_pfn_remap(NULL, &params.pgprot, PHYS_PFN(res->start),
-				0, resource_size(res));
+	error = track_pfn_remap(NULL, &params.pgprot, PHYS_PFN(range->start), 0,
+			range_len(range));
 	if (error)
 		goto err_pfn_remap;
 
@@ -304,16 +305,16 @@ void *memremap_pages(struct dev_pagemap
 	 * arch_add_memory().
 	 */
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
-		error = add_pages(nid, PHYS_PFN(res->start),
-				PHYS_PFN(resource_size(res)), &params);
+		error = add_pages(nid, PHYS_PFN(range->start),
+				PHYS_PFN(range_len(range)), &params);
 	} else {
-		error = kasan_add_zero_shadow(__va(res->start), resource_size(res));
+		error = kasan_add_zero_shadow(__va(range->start), range_len(range));
 		if (error) {
 			mem_hotplug_done();
 			goto err_kasan;
 		}
 
-		error = arch_add_memory(nid, res->start, resource_size(res),
+		error = arch_add_memory(nid, range->start, range_len(range),
 					&params);
 	}
 
@@ -321,8 +322,8 @@ void *memremap_pages(struct dev_pagemap
 		struct zone *zone;
 
 		zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
-		move_pfn_range_to_zone(zone, PHYS_PFN(res->start),
-				PHYS_PFN(resource_size(res)), params.altmap);
+		move_pfn_range_to_zone(zone, PHYS_PFN(range->start),
+				PHYS_PFN(range_len(range)), params.altmap);
 	}
 
 	mem_hotplug_done();
@@ -334,17 +335,17 @@ void *memremap_pages(struct dev_pagemap
 	 * to allow us to do the work while not holding the hotplug lock.
 	 */
 	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
-				PHYS_PFN(res->start),
-				PHYS_PFN(resource_size(res)), pgmap);
+				PHYS_PFN(range->start),
+				PHYS_PFN(range_len(range)), pgmap);
 	percpu_ref_get_many(pgmap->ref, pfn_end(pgmap) - pfn_first(pgmap));
-	return __va(res->start);
+	return __va(range->start);
 
  err_add_memory:
-	kasan_remove_zero_shadow(__va(res->start), resource_size(res));
+	kasan_remove_zero_shadow(__va(range->start), range_len(range));
  err_kasan:
-	untrack_pfn(NULL, PHYS_PFN(res->start), resource_size(res));
+	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range));
  err_pfn_remap:
-	pgmap_array_delete(res);
+	pgmap_array_delete(range);
  err_array:
 	dev_pagemap_kill(pgmap);
 	dev_pagemap_cleanup(pgmap);
@@ -369,7 +370,7 @@ EXPORT_SYMBOL_GPL(memremap_pages);
  *    'live' on entry and will be killed and reaped at
  *    devm_memremap_pages_release() time, or if this routine fails.
  *
- * 4/ res is expected to be a host memory range that could feasibly be
+ * 4/ range is expected to be a host memory range that could feasibly be
  *    treated as a "System RAM" range, i.e. not a device mmio range, but
  *    this is not enforced.
  */
@@ -426,7 +427,7 @@ struct dev_pagemap *get_dev_pagemap(unsi
 	 * In the cached case we're already holding a live reference.
 	 */
 	if (pgmap) {
-		if (phys >= pgmap->res.start && phys <= pgmap->res.end)
+		if (phys >= pgmap->range.start && phys <= pgmap->range.end)
 			return pgmap;
 		put_dev_pagemap(pgmap);
 	}
--- a/tools/testing/nvdimm/test/iomap.c~mm-memremap_pages-convert-to-struct-range
+++ a/tools/testing/nvdimm/test/iomap.c
@@ -126,7 +126,7 @@ static void dev_pagemap_percpu_release(s
 void *__wrap_devm_memremap_pages(struct device *dev, struct dev_pagemap *pgmap)
 {
 	int error;
-	resource_size_t offset = pgmap->res.start;
+	resource_size_t offset = pgmap->range.start;
 	struct nfit_test_resource *nfit_res = get_nfit_res(offset);
 
 	if (!nfit_res)
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 045/181] mm/memremap_pages: support multiple ranges per invocation
  2020-10-13 23:46 incoming Andrew Morton
                   ` (43 preceding siblings ...)
  2020-10-13 23:50 ` [patch 044/181] mm/memremap_pages: convert to 'struct range' Andrew Morton
@ 2020-10-13 23:50 ` Andrew Morton
  2020-10-13 23:50 ` [patch 046/181] device-dax: add dis-contiguous resource support Andrew Morton
                   ` (139 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:50 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: mm/memremap_pages: support multiple ranges per invocation

In support of device-dax growing the ability to front physically
dis-contiguous ranges of memory, update devm_memremap_pages() to track
multiple ranges with a single reference counter and devm instance.

Convert all [devm_]memremap_pages() users to specify the number of ranges
they are mapping in their 'struct dev_pagemap' instance.

Link: https://lkml.kernel.org/r/159643103789.4062302.18426128170217903785.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lkml.kernel.org/r/160106116293.30709.13350662794915396198.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: "Jérôme Glisse" <jglisse@redhat.co
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/powerpc/kvm/book3s_hv_uvmem.c     |    1 
 drivers/dax/device.c                   |    1 
 drivers/gpu/drm/nouveau/nouveau_dmem.c |    1 
 drivers/nvdimm/pfn_devs.c              |    1 
 drivers/nvdimm/pmem.c                  |    1 
 drivers/pci/p2pdma.c                   |    1 
 drivers/xen/unpopulated-alloc.c        |    1 
 include/linux/memremap.h               |   10 
 lib/test_hmm.c                         |    1 
 mm/memremap.c                          |  258 +++++++++++++----------
 10 files changed, 166 insertions(+), 110 deletions(-)

--- a/arch/powerpc/kvm/book3s_hv_uvmem.c~mm-memremap_pages-support-multiple-ranges-per-invocation
+++ a/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -1172,6 +1172,7 @@ int kvmppc_uvmem_init(void)
 	kvmppc_uvmem_pgmap.type = MEMORY_DEVICE_PRIVATE;
 	kvmppc_uvmem_pgmap.range.start = res->start;
 	kvmppc_uvmem_pgmap.range.end = res->end;
+	kvmppc_uvmem_pgmap.nr_range = 1;
 	kvmppc_uvmem_pgmap.ops = &kvmppc_uvmem_ops;
 	/* just one global instance: */
 	kvmppc_uvmem_pgmap.owner = &kvmppc_uvmem_pgmap;
--- a/drivers/dax/device.c~mm-memremap_pages-support-multiple-ranges-per-invocation
+++ a/drivers/dax/device.c
@@ -417,6 +417,7 @@ int dev_dax_probe(struct dev_dax *dev_da
 		if (!pgmap)
 			return -ENOMEM;
 		pgmap->range = *range;
+		pgmap->nr_range = 1;
 	}
 	pgmap->type = MEMORY_DEVICE_GENERIC;
 	addr = devm_memremap_pages(dev, pgmap);
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c~mm-memremap_pages-support-multiple-ranges-per-invocation
+++ a/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -251,6 +251,7 @@ nouveau_dmem_chunk_alloc(struct nouveau_
 	chunk->pagemap.type = MEMORY_DEVICE_PRIVATE;
 	chunk->pagemap.range.start = res->start;
 	chunk->pagemap.range.end = res->end;
+	chunk->pagemap.nr_range = 1;
 	chunk->pagemap.ops = &nouveau_dmem_pagemap_ops;
 	chunk->pagemap.owner = drm->dev;
 
--- a/drivers/nvdimm/pfn_devs.c~mm-memremap_pages-support-multiple-ranges-per-invocation
+++ a/drivers/nvdimm/pfn_devs.c
@@ -693,6 +693,7 @@ static int __nvdimm_setup_pfn(struct nd_
 		.start = nsio->res.start + start_pad,
 		.end = nsio->res.end - end_trunc,
 	};
+	pgmap->nr_range = 1;
 	if (nd_pfn->mode == PFN_MODE_RAM) {
 		if (offset < reserve)
 			return -EINVAL;
--- a/drivers/nvdimm/pmem.c~mm-memremap_pages-support-multiple-ranges-per-invocation
+++ a/drivers/nvdimm/pmem.c
@@ -441,6 +441,7 @@ static int pmem_attach_disk(struct devic
 	} else if (pmem_should_map_pages(dev)) {
 		pmem->pgmap.range.start = res->start;
 		pmem->pgmap.range.end = res->end;
+		pmem->pgmap.nr_range = 1;
 		pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
 		pmem->pgmap.ops = &fsdax_pagemap_ops;
 		addr = devm_memremap_pages(dev, &pmem->pgmap);
--- a/drivers/pci/p2pdma.c~mm-memremap_pages-support-multiple-ranges-per-invocation
+++ a/drivers/pci/p2pdma.c
@@ -187,6 +187,7 @@ int pci_p2pdma_add_resource(struct pci_d
 	pgmap = &p2p_pgmap->pgmap;
 	pgmap->range.start = pci_resource_start(pdev, bar) + offset;
 	pgmap->range.end = pgmap->range.start + size - 1;
+	pgmap->nr_range = 1;
 	pgmap->type = MEMORY_DEVICE_PCI_P2PDMA;
 
 	p2p_pgmap->provider = pdev;
--- a/drivers/xen/unpopulated-alloc.c~mm-memremap_pages-support-multiple-ranges-per-invocation
+++ a/drivers/xen/unpopulated-alloc.c
@@ -47,6 +47,7 @@ static int fill_list(unsigned int nr_pag
 		.start = res->start,
 		.end = res->end,
 	};
+	pgmap->nr_range = 1;
 	pgmap->owner = res;
 
 #ifdef CONFIG_XEN_HAVE_PVMMU
--- a/include/linux/memremap.h~mm-memremap_pages-support-multiple-ranges-per-invocation
+++ a/include/linux/memremap.h
@@ -94,7 +94,6 @@ struct dev_pagemap_ops {
 /**
  * struct dev_pagemap - metadata for ZONE_DEVICE mappings
  * @altmap: pre-allocated/reserved memory for vmemmap allocations
- * @range: physical address range covered by @ref
  * @ref: reference count that pins the devm_memremap_pages() mapping
  * @internal_ref: internal reference if @ref is not provided by the caller
  * @done: completion for @internal_ref
@@ -104,10 +103,12 @@ struct dev_pagemap_ops {
  * @owner: an opaque pointer identifying the entity that manages this
  *	instance.  Used by various helpers to make sure that no
  *	foreign ZONE_DEVICE memory is accessed.
+ * @nr_range: number of ranges to be mapped
+ * @range: range to be mapped when nr_range == 1
+ * @ranges: array of ranges to be mapped when nr_range > 1
  */
 struct dev_pagemap {
 	struct vmem_altmap altmap;
-	struct range range;
 	struct percpu_ref *ref;
 	struct percpu_ref internal_ref;
 	struct completion done;
@@ -115,6 +116,11 @@ struct dev_pagemap {
 	unsigned int flags;
 	const struct dev_pagemap_ops *ops;
 	void *owner;
+	int nr_range;
+	union {
+		struct range range;
+		struct range ranges[0];
+	};
 };
 
 static inline struct vmem_altmap *pgmap_altmap(struct dev_pagemap *pgmap)
--- a/lib/test_hmm.c~mm-memremap_pages-support-multiple-ranges-per-invocation
+++ a/lib/test_hmm.c
@@ -472,6 +472,7 @@ static bool dmirror_allocate_chunk(struc
 	devmem->pagemap.type = MEMORY_DEVICE_PRIVATE;
 	devmem->pagemap.range.start = res->start;
 	devmem->pagemap.range.end = res->end;
+	devmem->pagemap.nr_range = 1;
 	devmem->pagemap.ops = &dmirror_devmem_ops;
 	devmem->pagemap.owner = mdevice;
 
--- a/mm/memremap.c~mm-memremap_pages-support-multiple-ranges-per-invocation
+++ a/mm/memremap.c
@@ -77,15 +77,19 @@ static void pgmap_array_delete(struct ra
 	synchronize_rcu();
 }
 
-static unsigned long pfn_first(struct dev_pagemap *pgmap)
+static unsigned long pfn_first(struct dev_pagemap *pgmap, int range_id)
 {
-	return PHYS_PFN(pgmap->range.start) +
-		vmem_altmap_offset(pgmap_altmap(pgmap));
+	struct range *range = &pgmap->ranges[range_id];
+	unsigned long pfn = PHYS_PFN(range->start);
+
+	if (range_id)
+		return pfn;
+	return pfn + vmem_altmap_offset(pgmap_altmap(pgmap));
 }
 
-static unsigned long pfn_end(struct dev_pagemap *pgmap)
+static unsigned long pfn_end(struct dev_pagemap *pgmap, int range_id)
 {
-	const struct range *range = &pgmap->range;
+	const struct range *range = &pgmap->ranges[range_id];
 
 	return (range->start + range_len(range)) >> PAGE_SHIFT;
 }
@@ -97,8 +101,8 @@ static unsigned long pfn_next(unsigned l
 	return pfn + 1;
 }
 
-#define for_each_device_pfn(pfn, map) \
-	for (pfn = pfn_first(map); pfn < pfn_end(map); pfn = pfn_next(pfn))
+#define for_each_device_pfn(pfn, map, i) \
+	for (pfn = pfn_first(map, i); pfn < pfn_end(map, i); pfn = pfn_next(pfn))
 
 static void dev_pagemap_kill(struct dev_pagemap *pgmap)
 {
@@ -124,20 +128,14 @@ static void dev_pagemap_cleanup(struct d
 		pgmap->ref = NULL;
 }
 
-void memunmap_pages(struct dev_pagemap *pgmap)
+static void pageunmap_range(struct dev_pagemap *pgmap, int range_id)
 {
-	struct range *range = &pgmap->range;
+	struct range *range = &pgmap->ranges[range_id];
 	struct page *first_page;
-	unsigned long pfn;
 	int nid;
 
-	dev_pagemap_kill(pgmap);
-	for_each_device_pfn(pfn, pgmap)
-		put_page(pfn_to_page(pfn));
-	dev_pagemap_cleanup(pgmap);
-
 	/* make sure to access a memmap that was actually initialized */
-	first_page = pfn_to_page(pfn_first(pgmap));
+	first_page = pfn_to_page(pfn_first(pgmap, range_id));
 
 	/* pages are dead and unused, undo the arch mapping */
 	nid = page_to_nid(first_page);
@@ -157,6 +155,22 @@ void memunmap_pages(struct dev_pagemap *
 
 	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range));
 	pgmap_array_delete(range);
+}
+
+void memunmap_pages(struct dev_pagemap *pgmap)
+{
+	unsigned long pfn;
+	int i;
+
+	dev_pagemap_kill(pgmap);
+	for (i = 0; i < pgmap->nr_range; i++)
+		for_each_device_pfn(pfn, pgmap, i)
+			put_page(pfn_to_page(pfn));
+	dev_pagemap_cleanup(pgmap);
+
+	for (i = 0; i < pgmap->nr_range; i++)
+		pageunmap_range(pgmap, i);
+
 	WARN_ONCE(pgmap->altmap.alloc, "failed to free all reserved pages\n");
 	devmap_managed_enable_put();
 }
@@ -175,96 +189,29 @@ static void dev_pagemap_percpu_release(s
 	complete(&pgmap->done);
 }
 
-/*
- * Not device managed version of dev_memremap_pages, undone by
- * memunmap_pages().  Please use dev_memremap_pages if you have a struct
- * device available.
- */
-void *memremap_pages(struct dev_pagemap *pgmap, int nid)
+static int pagemap_range(struct dev_pagemap *pgmap, struct mhp_params *params,
+		int range_id, int nid)
 {
-	struct range *range = &pgmap->range;
+	struct range *range = &pgmap->ranges[range_id];
 	struct dev_pagemap *conflict_pgmap;
-	struct mhp_params params = {
-		/*
-		 * We do not want any optional features only our own memmap
-		 */
-		.altmap = pgmap_altmap(pgmap),
-		.pgprot = PAGE_KERNEL,
-	};
 	int error, is_ram;
-	bool need_devmap_managed = true;
-
-	switch (pgmap->type) {
-	case MEMORY_DEVICE_PRIVATE:
-		if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
-			WARN(1, "Device private memory not supported\n");
-			return ERR_PTR(-EINVAL);
-		}
-		if (!pgmap->ops || !pgmap->ops->migrate_to_ram) {
-			WARN(1, "Missing migrate_to_ram method\n");
-			return ERR_PTR(-EINVAL);
-		}
-		if (!pgmap->owner) {
-			WARN(1, "Missing owner\n");
-			return ERR_PTR(-EINVAL);
-		}
-		break;
-	case MEMORY_DEVICE_FS_DAX:
-		if (!IS_ENABLED(CONFIG_ZONE_DEVICE) ||
-		    IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
-			WARN(1, "File system DAX not supported\n");
-			return ERR_PTR(-EINVAL);
-		}
-		break;
-	case MEMORY_DEVICE_GENERIC:
-		need_devmap_managed = false;
-		break;
-	case MEMORY_DEVICE_PCI_P2PDMA:
-		params.pgprot = pgprot_noncached(params.pgprot);
-		need_devmap_managed = false;
-		break;
-	default:
-		WARN(1, "Invalid pgmap type %d\n", pgmap->type);
-		break;
-	}
-
-	if (!pgmap->ref) {
-		if (pgmap->ops && (pgmap->ops->kill || pgmap->ops->cleanup))
-			return ERR_PTR(-EINVAL);
 
-		init_completion(&pgmap->done);
-		error = percpu_ref_init(&pgmap->internal_ref,
-				dev_pagemap_percpu_release, 0, GFP_KERNEL);
-		if (error)
-			return ERR_PTR(error);
-		pgmap->ref = &pgmap->internal_ref;
-	} else {
-		if (!pgmap->ops || !pgmap->ops->kill || !pgmap->ops->cleanup) {
-			WARN(1, "Missing reference count teardown definition\n");
-			return ERR_PTR(-EINVAL);
-		}
-	}
-
-	if (need_devmap_managed) {
-		error = devmap_managed_enable_get(pgmap);
-		if (error)
-			return ERR_PTR(error);
-	}
+	if (WARN_ONCE(pgmap_altmap(pgmap) && range_id > 0,
+				"altmap not supported for multiple ranges\n"))
+		return -EINVAL;
 
 	conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->start), NULL);
 	if (conflict_pgmap) {
 		WARN(1, "Conflicting mapping in same section\n");
 		put_dev_pagemap(conflict_pgmap);
-		error = -ENOMEM;
-		goto err_array;
+		return -ENOMEM;
 	}
 
 	conflict_pgmap = get_dev_pagemap(PHYS_PFN(range->end), NULL);
 	if (conflict_pgmap) {
 		WARN(1, "Conflicting mapping in same section\n");
 		put_dev_pagemap(conflict_pgmap);
-		error = -ENOMEM;
-		goto err_array;
+		return -ENOMEM;
 	}
 
 	is_ram = region_intersects(range->start, range_len(range),
@@ -274,19 +221,18 @@ void *memremap_pages(struct dev_pagemap
 		WARN_ONCE(1, "attempted on %s region %#llx-%#llx\n",
 				is_ram == REGION_MIXED ? "mixed" : "ram",
 				range->start, range->end);
-		error = -ENXIO;
-		goto err_array;
+		return -ENXIO;
 	}
 
 	error = xa_err(xa_store_range(&pgmap_array, PHYS_PFN(range->start),
 				PHYS_PFN(range->end), pgmap, GFP_KERNEL));
 	if (error)
-		goto err_array;
+		return error;
 
 	if (nid < 0)
 		nid = numa_mem_id();
 
-	error = track_pfn_remap(NULL, &params.pgprot, PHYS_PFN(range->start), 0,
+	error = track_pfn_remap(NULL, &params->pgprot, PHYS_PFN(range->start), 0,
 			range_len(range));
 	if (error)
 		goto err_pfn_remap;
@@ -306,7 +252,7 @@ void *memremap_pages(struct dev_pagemap
 	 */
 	if (pgmap->type == MEMORY_DEVICE_PRIVATE) {
 		error = add_pages(nid, PHYS_PFN(range->start),
-				PHYS_PFN(range_len(range)), &params);
+				PHYS_PFN(range_len(range)), params);
 	} else {
 		error = kasan_add_zero_shadow(__va(range->start), range_len(range));
 		if (error) {
@@ -315,7 +261,7 @@ void *memremap_pages(struct dev_pagemap
 		}
 
 		error = arch_add_memory(nid, range->start, range_len(range),
-					&params);
+					params);
 	}
 
 	if (!error) {
@@ -323,7 +269,7 @@ void *memremap_pages(struct dev_pagemap
 
 		zone = &NODE_DATA(nid)->node_zones[ZONE_DEVICE];
 		move_pfn_range_to_zone(zone, PHYS_PFN(range->start),
-				PHYS_PFN(range_len(range)), params.altmap);
+				PHYS_PFN(range_len(range)), params->altmap);
 	}
 
 	mem_hotplug_done();
@@ -337,20 +283,116 @@ void *memremap_pages(struct dev_pagemap
 	memmap_init_zone_device(&NODE_DATA(nid)->node_zones[ZONE_DEVICE],
 				PHYS_PFN(range->start),
 				PHYS_PFN(range_len(range)), pgmap);
-	percpu_ref_get_many(pgmap->ref, pfn_end(pgmap) - pfn_first(pgmap));
-	return __va(range->start);
+	percpu_ref_get_many(pgmap->ref, pfn_end(pgmap, range_id)
+			- pfn_first(pgmap, range_id));
+	return 0;
 
- err_add_memory:
+err_add_memory:
 	kasan_remove_zero_shadow(__va(range->start), range_len(range));
- err_kasan:
+err_kasan:
 	untrack_pfn(NULL, PHYS_PFN(range->start), range_len(range));
- err_pfn_remap:
+err_pfn_remap:
 	pgmap_array_delete(range);
- err_array:
-	dev_pagemap_kill(pgmap);
-	dev_pagemap_cleanup(pgmap);
-	devmap_managed_enable_put();
-	return ERR_PTR(error);
+	return error;
+}
+
+
+/*
+ * Not device managed version of dev_memremap_pages, undone by
+ * memunmap_pages().  Please use dev_memremap_pages if you have a struct
+ * device available.
+ */
+void *memremap_pages(struct dev_pagemap *pgmap, int nid)
+{
+	struct mhp_params params = {
+		.altmap = pgmap_altmap(pgmap),
+		.pgprot = PAGE_KERNEL,
+	};
+	const int nr_range = pgmap->nr_range;
+	bool need_devmap_managed = true;
+	int error, i;
+
+	if (WARN_ONCE(!nr_range, "nr_range must be specified\n"))
+		return ERR_PTR(-EINVAL);
+
+	switch (pgmap->type) {
+	case MEMORY_DEVICE_PRIVATE:
+		if (!IS_ENABLED(CONFIG_DEVICE_PRIVATE)) {
+			WARN(1, "Device private memory not supported\n");
+			return ERR_PTR(-EINVAL);
+		}
+		if (!pgmap->ops || !pgmap->ops->migrate_to_ram) {
+			WARN(1, "Missing migrate_to_ram method\n");
+			return ERR_PTR(-EINVAL);
+		}
+		if (!pgmap->owner) {
+			WARN(1, "Missing owner\n");
+			return ERR_PTR(-EINVAL);
+		}
+		break;
+	case MEMORY_DEVICE_FS_DAX:
+		if (!IS_ENABLED(CONFIG_ZONE_DEVICE) ||
+		    IS_ENABLED(CONFIG_FS_DAX_LIMITED)) {
+			WARN(1, "File system DAX not supported\n");
+			return ERR_PTR(-EINVAL);
+		}
+		break;
+	case MEMORY_DEVICE_GENERIC:
+		need_devmap_managed = false;
+		break;
+	case MEMORY_DEVICE_PCI_P2PDMA:
+		params.pgprot = pgprot_noncached(params.pgprot);
+		need_devmap_managed = false;
+		break;
+	default:
+		WARN(1, "Invalid pgmap type %d\n", pgmap->type);
+		break;
+	}
+
+	if (!pgmap->ref) {
+		if (pgmap->ops && (pgmap->ops->kill || pgmap->ops->cleanup))
+			return ERR_PTR(-EINVAL);
+
+		init_completion(&pgmap->done);
+		error = percpu_ref_init(&pgmap->internal_ref,
+				dev_pagemap_percpu_release, 0, GFP_KERNEL);
+		if (error)
+			return ERR_PTR(error);
+		pgmap->ref = &pgmap->internal_ref;
+	} else {
+		if (!pgmap->ops || !pgmap->ops->kill || !pgmap->ops->cleanup) {
+			WARN(1, "Missing reference count teardown definition\n");
+			return ERR_PTR(-EINVAL);
+		}
+	}
+
+	if (need_devmap_managed) {
+		error = devmap_managed_enable_get(pgmap);
+		if (error)
+			return ERR_PTR(error);
+	}
+
+	/*
+	 * Clear the pgmap nr_range as it will be incremented for each
+	 * successfully processed range. This communicates how many
+	 * regions to unwind in the abort case.
+	 */
+	pgmap->nr_range = 0;
+	error = 0;
+	for (i = 0; i < nr_range; i++) {
+		error = pagemap_range(pgmap, &params, i, nid);
+		if (error)
+			break;
+		pgmap->nr_range++;
+	}
+
+	if (i < nr_range) {
+		memunmap_pages(pgmap);
+		pgmap->nr_range = nr_range;
+		return ERR_PTR(error);
+	}
+
+	return __va(pgmap->ranges[0].start);
 }
 EXPORT_SYMBOL_GPL(memremap_pages);
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 046/181] device-dax: add dis-contiguous resource support
  2020-10-13 23:46 incoming Andrew Morton
                   ` (44 preceding siblings ...)
  2020-10-13 23:50 ` [patch 045/181] mm/memremap_pages: support multiple ranges per invocation Andrew Morton
@ 2020-10-13 23:50 ` Andrew Morton
  2020-10-13 23:50 ` [patch 047/181] device-dax: introduce 'mapping' devices Andrew Morton
                   ` (138 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:50 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: device-dax: add dis-contiguous resource support

Break the requirement that device-dax instances are physically contiguous.
With this constraint removed it allows fragmented available capacity to
be fully allocated.

This capability is useful to mitigate the "noisy neighbor" problem with
memory-side-cache management for virtual machines, or any other scenario
where a platform address boundary also designates a performance boundary. 
For example a direct mapped memory side cache might rotate cache colors at
1GB boundaries.  With dis-contiguous allocations a device-dax instance
could be configured to contain only 1 cache color.

It also satisfies Joao's use case (see link) for partitioning memory for
exclusive guest access.  It allows for a future potential mode where the
host kernel need not allocate 'struct page' capacity up-front.

Link: https://lore.kernel.org/lkml/20200110190313.17144-1-joao.m.martins@oracle.com/
Link: https://lkml.kernel.org/r/159643104304.4062302.16561669534797528660.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lkml.kernel.org/r/160106116875.30709.11456649969327399771.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Reported-by: Joao Martins <joao.m.martins@oracle.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dax/bus.c              |  231 +++++++++++++++++++++++--------
 drivers/dax/dax-private.h      |    9 -
 drivers/dax/device.c           |   53 ++++---
 drivers/dax/kmem.c             |  130 +++++++++++------
 tools/testing/nvdimm/dax-dev.c |   20 +-
 5 files changed, 321 insertions(+), 122 deletions(-)

--- a/drivers/dax/bus.c~device-dax-add-dis-contiguous-resource-support
+++ a/drivers/dax/bus.c
@@ -136,15 +136,27 @@ static bool is_static(struct dax_region
 	return (dax_region->res.flags & IORESOURCE_DAX_STATIC) != 0;
 }
 
+static u64 dev_dax_size(struct dev_dax *dev_dax)
+{
+	u64 size = 0;
+	int i;
+
+	device_lock_assert(&dev_dax->dev);
+
+	for (i = 0; i < dev_dax->nr_range; i++)
+		size += range_len(&dev_dax->ranges[i].range);
+
+	return size;
+}
+
 static int dax_bus_probe(struct device *dev)
 {
 	struct dax_device_driver *dax_drv = to_dax_drv(dev->driver);
 	struct dev_dax *dev_dax = to_dev_dax(dev);
 	struct dax_region *dax_region = dev_dax->region;
-	struct range *range = &dev_dax->range;
 	int rc;
 
-	if (range_len(range) == 0 || dev_dax->id < 0)
+	if (dev_dax_size(dev_dax) == 0 || dev_dax->id < 0)
 		return -ENXIO;
 
 	rc = dax_drv->probe(dev_dax);
@@ -354,15 +366,19 @@ void kill_dev_dax(struct dev_dax *dev_da
 }
 EXPORT_SYMBOL_GPL(kill_dev_dax);
 
-static void free_dev_dax_range(struct dev_dax *dev_dax)
+static void free_dev_dax_ranges(struct dev_dax *dev_dax)
 {
 	struct dax_region *dax_region = dev_dax->region;
-	struct range *range = &dev_dax->range;
+	int i;
 
 	device_lock_assert(dax_region->dev);
-	if (range_len(range))
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		struct range *range = &dev_dax->ranges[i].range;
+
 		__release_region(&dax_region->res, range->start,
 				range_len(range));
+	}
+	dev_dax->nr_range = 0;
 }
 
 static void unregister_dev_dax(void *dev)
@@ -372,7 +388,7 @@ static void unregister_dev_dax(void *dev
 	dev_dbg(dev, "%s\n", __func__);
 
 	kill_dev_dax(dev_dax);
-	free_dev_dax_range(dev_dax);
+	free_dev_dax_ranges(dev_dax);
 	device_del(dev);
 	put_device(dev);
 }
@@ -423,7 +439,7 @@ static ssize_t delete_store(struct devic
 	device_lock(dev);
 	device_lock(victim);
 	dev_dax = to_dev_dax(victim);
-	if (victim->driver || range_len(&dev_dax->range))
+	if (victim->driver || dev_dax_size(dev_dax))
 		rc = -EBUSY;
 	else {
 		/*
@@ -569,51 +585,86 @@ static int alloc_dev_dax_range(struct de
 	struct dax_region *dax_region = dev_dax->region;
 	struct resource *res = &dax_region->res;
 	struct device *dev = &dev_dax->dev;
+	struct dev_dax_range *ranges;
+	unsigned long pgoff = 0;
 	struct resource *alloc;
+	int i;
 
 	device_lock_assert(dax_region->dev);
 
 	/* handle the seed alloc special case */
 	if (!size) {
-		dev_dax->range = (struct range) {
-			.start = res->start,
-			.end = res->start - 1,
-		};
+		if (dev_WARN_ONCE(dev, dev_dax->nr_range,
+					"0-size allocation must be first\n"))
+			return -EBUSY;
+		/* nr_range == 0 is elsewhere special cased as 0-size device */
 		return 0;
 	}
 
+	ranges = krealloc(dev_dax->ranges, sizeof(*ranges)
+			* (dev_dax->nr_range + 1), GFP_KERNEL);
+	if (!ranges)
+		return -ENOMEM;
+
 	alloc = __request_region(res, start, size, dev_name(dev), 0);
-	if (!alloc)
+	if (!alloc) {
+		/*
+		 * If this was an empty set of ranges nothing else
+		 * will release @ranges, so do it now.
+		 */
+		if (!dev_dax->nr_range) {
+			kfree(ranges);
+			ranges = NULL;
+		}
+		dev_dax->ranges = ranges;
 		return -ENOMEM;
+	}
 
-	dev_dax->range = (struct range) {
-		.start = alloc->start,
-		.end = alloc->end,
+	for (i = 0; i < dev_dax->nr_range; i++)
+		pgoff += PHYS_PFN(range_len(&ranges[i].range));
+	dev_dax->ranges = ranges;
+	ranges[dev_dax->nr_range++] = (struct dev_dax_range) {
+		.pgoff = pgoff,
+		.range = {
+			.start = alloc->start,
+			.end = alloc->end,
+		},
 	};
 
+	dev_dbg(dev, "alloc range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
+			&alloc->start, &alloc->end);
+
 	return 0;
 }
 
 static int adjust_dev_dax_range(struct dev_dax *dev_dax, struct resource *res, resource_size_t size)
 {
+	int last_range = dev_dax->nr_range - 1;
+	struct dev_dax_range *dax_range = &dev_dax->ranges[last_range];
 	struct dax_region *dax_region = dev_dax->region;
-	struct range *range = &dev_dax->range;
-	int rc = 0;
+	bool is_shrink = resource_size(res) > size;
+	struct range *range = &dax_range->range;
+	struct device *dev = &dev_dax->dev;
+	int rc;
 
 	device_lock_assert(dax_region->dev);
 
-	if (size)
-		rc = adjust_resource(res, range->start, size);
-	else
-		__release_region(&dax_region->res, range->start, range_len(range));
+	if (dev_WARN_ONCE(dev, !size, "deletion is handled by dev_dax_shrink\n"))
+		return -EINVAL;
+
+	rc = adjust_resource(res, range->start, size);
 	if (rc)
 		return rc;
 
-	dev_dax->range = (struct range) {
+	*range = (struct range) {
 		.start = range->start,
 		.end = range->start + size - 1,
 	};
 
+	dev_dbg(dev, "%s range[%d]: %#llx:%#llx\n", is_shrink ? "shrink" : "extend",
+			last_range, (unsigned long long) range->start,
+			(unsigned long long) range->end);
+
 	return 0;
 }
 
@@ -621,7 +672,11 @@ static ssize_t size_show(struct device *
 		struct device_attribute *attr, char *buf)
 {
 	struct dev_dax *dev_dax = to_dev_dax(dev);
-	unsigned long long size = range_len(&dev_dax->range);
+	unsigned long long size;
+
+	device_lock(dev);
+	size = dev_dax_size(dev_dax);
+	device_unlock(dev);
 
 	return sprintf(buf, "%llu\n", size);
 }
@@ -639,32 +694,82 @@ static bool alloc_is_aligned(struct dax_
 
 static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
 {
+	resource_size_t to_shrink = dev_dax_size(dev_dax) - size;
 	struct dax_region *dax_region = dev_dax->region;
-	struct range *range = &dev_dax->range;
-	struct resource *res, *adjust = NULL;
 	struct device *dev = &dev_dax->dev;
+	int i;
 
-	for_each_dax_region_resource(dax_region, res)
-		if (strcmp(res->name, dev_name(dev)) == 0
-				&& res->start == range->start) {
-			adjust = res;
-			break;
+	for (i = dev_dax->nr_range - 1; i >= 0; i--) {
+		struct range *range = &dev_dax->ranges[i].range;
+		struct resource *adjust = NULL, *res;
+		resource_size_t shrink;
+
+		shrink = min_t(u64, to_shrink, range_len(range));
+		if (shrink >= range_len(range)) {
+			__release_region(&dax_region->res, range->start,
+					range_len(range));
+			dev_dax->nr_range--;
+			dev_dbg(dev, "delete range[%d]: %#llx:%#llx\n", i,
+					(unsigned long long) range->start,
+					(unsigned long long) range->end);
+			to_shrink -= shrink;
+			if (!to_shrink)
+				break;
+			continue;
 		}
 
-	if (dev_WARN_ONCE(dev, !adjust, "failed to find matching resource\n"))
-		return -ENXIO;
-	return adjust_dev_dax_range(dev_dax, adjust, size);
+		for_each_dax_region_resource(dax_region, res)
+			if (strcmp(res->name, dev_name(dev)) == 0
+					&& res->start == range->start) {
+				adjust = res;
+				break;
+			}
+
+		if (dev_WARN_ONCE(dev, !adjust || i != dev_dax->nr_range - 1,
+					"failed to find matching resource\n"))
+			return -ENXIO;
+		return adjust_dev_dax_range(dev_dax, adjust, range_len(range)
+				- shrink);
+	}
+	return 0;
+}
+
+/*
+ * Only allow adjustments that preserve the relative pgoff of existing
+ * allocations. I.e. the dev_dax->ranges array is ordered by increasing pgoff.
+ */
+static bool adjust_ok(struct dev_dax *dev_dax, struct resource *res)
+{
+	struct dev_dax_range *last;
+	int i;
+
+	if (dev_dax->nr_range == 0)
+		return false;
+	if (strcmp(res->name, dev_name(&dev_dax->dev)) != 0)
+		return false;
+	last = &dev_dax->ranges[dev_dax->nr_range - 1];
+	if (last->range.start != res->start || last->range.end != res->end)
+		return false;
+	for (i = 0; i < dev_dax->nr_range - 1; i++) {
+		struct dev_dax_range *dax_range = &dev_dax->ranges[i];
+
+		if (dax_range->pgoff > last->pgoff)
+			return false;
+	}
+
+	return true;
 }
 
 static ssize_t dev_dax_resize(struct dax_region *dax_region,
 		struct dev_dax *dev_dax, resource_size_t size)
 {
 	resource_size_t avail = dax_region_avail_size(dax_region), to_alloc;
-	resource_size_t dev_size = range_len(&dev_dax->range);
+	resource_size_t dev_size = dev_dax_size(dev_dax);
 	struct resource *region_res = &dax_region->res;
 	struct device *dev = &dev_dax->dev;
-	const char *name = dev_name(dev);
 	struct resource *res, *first;
+	resource_size_t alloc = 0;
+	int rc;
 
 	if (dev->driver)
 		return -EBUSY;
@@ -685,35 +790,47 @@ static ssize_t dev_dax_resize(struct dax
 	 * may involve adjusting the end of an existing resource, or
 	 * allocating a new resource.
 	 */
+retry:
 	first = region_res->child;
 	if (!first)
 		return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
-	for (res = first; to_alloc && res; res = res->sibling) {
+
+	rc = -ENOSPC;
+	for (res = first; res; res = res->sibling) {
 		struct resource *next = res->sibling;
-		resource_size_t free;
 
 		/* space at the beginning of the region */
-		free = 0;
-		if (res == first && res->start > dax_region->res.start)
-			free = res->start - dax_region->res.start;
-		if (free >= to_alloc && dev_size == 0)
-			return alloc_dev_dax_range(dev_dax, dax_region->res.start, to_alloc);
+		if (res == first && res->start > dax_region->res.start) {
+			alloc = min(res->start - dax_region->res.start, to_alloc);
+			rc = alloc_dev_dax_range(dev_dax, dax_region->res.start, alloc);
+			break;
+		}
 
-		free = 0;
+		alloc = 0;
 		/* space between allocations */
 		if (next && next->start > res->end + 1)
-			free = next->start - res->end + 1;
+			alloc = min(next->start - (res->end + 1), to_alloc);
 
 		/* space at the end of the region */
-		if (free < to_alloc && !next && res->end < region_res->end)
-			free = region_res->end - res->end;
+		if (!alloc && !next && res->end < region_res->end)
+			alloc = min(region_res->end - res->end, to_alloc);
 
-		if (free >= to_alloc && strcmp(name, res->name) == 0)
-			return adjust_dev_dax_range(dev_dax, res, resource_size(res) + to_alloc);
-		else if (free >= to_alloc && dev_size == 0)
-			return alloc_dev_dax_range(dev_dax, res->end + 1, to_alloc);
+		if (!alloc)
+			continue;
+
+		if (adjust_ok(dev_dax, res)) {
+			rc = adjust_dev_dax_range(dev_dax, res, resource_size(res) + alloc);
+			break;
+		}
+		rc = alloc_dev_dax_range(dev_dax, res->end + 1, alloc);
+		break;
 	}
-	return -ENOSPC;
+	if (rc)
+		return rc;
+	to_alloc -= alloc;
+	if (to_alloc)
+		goto retry;
+	return 0;
 }
 
 static ssize_t size_store(struct device *dev, struct device_attribute *attr,
@@ -767,8 +884,15 @@ static ssize_t resource_show(struct devi
 		struct device_attribute *attr, char *buf)
 {
 	struct dev_dax *dev_dax = to_dev_dax(dev);
+	struct dax_region *dax_region = dev_dax->region;
+	unsigned long long start;
+
+	if (dev_dax->nr_range < 1)
+		start = dax_region->res.start;
+	else
+		start = dev_dax->ranges[0].range.start;
 
-	return sprintf(buf, "%#llx\n", dev_dax->range.start);
+	return sprintf(buf, "%#llx\n", start);
 }
 static DEVICE_ATTR(resource, 0400, resource_show, NULL);
 
@@ -833,6 +957,7 @@ static void dev_dax_release(struct devic
 	put_dax(dax_dev);
 	free_dev_dax_id(dev_dax);
 	dax_region_put(dax_region);
+	kfree(dev_dax->ranges);
 	kfree(dev_dax->pgmap);
 	kfree(dev_dax);
 }
@@ -941,7 +1066,7 @@ struct dev_dax *devm_create_dev_dax(stru
 err_alloc_dax:
 	kfree(dev_dax->pgmap);
 err_pgmap:
-	free_dev_dax_range(dev_dax);
+	free_dev_dax_ranges(dev_dax);
 err_range:
 	free_dev_dax_id(dev_dax);
 err_id:
--- a/drivers/dax/dax-private.h~device-dax-add-dis-contiguous-resource-support
+++ a/drivers/dax/dax-private.h
@@ -49,7 +49,8 @@ struct dax_region {
  * @id: ida allocated id
  * @dev - device core
  * @pgmap - pgmap for memmap setup / lifetime (driver owned)
- * @range: resource range for the instance
+ * @nr_range: size of @ranges
+ * @ranges: resource-span + pgoff tuples for the instance
  */
 struct dev_dax {
 	struct dax_region *region;
@@ -58,7 +59,11 @@ struct dev_dax {
 	int id;
 	struct device dev;
 	struct dev_pagemap *pgmap;
-	struct range range;
+	int nr_range;
+	struct dev_dax_range {
+		unsigned long pgoff;
+		struct range range;
+	} *ranges;
 };
 
 static inline struct dev_dax *to_dev_dax(struct device *dev)
--- a/drivers/dax/device.c~device-dax-add-dis-contiguous-resource-support
+++ a/drivers/dax/device.c
@@ -55,15 +55,22 @@ static int check_vma(struct dev_dax *dev
 __weak phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 		unsigned long size)
 {
-	struct range *range = &dev_dax->range;
-	phys_addr_t phys;
+	int i;
 
-	phys = pgoff * PAGE_SIZE + range->start;
-	if (phys >= range->start && phys <= range->end) {
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		struct dev_dax_range *dax_range = &dev_dax->ranges[i];
+		struct range *range = &dax_range->range;
+		unsigned long long pgoff_end;
+		phys_addr_t phys;
+
+		pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
+		if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
+			continue;
+		phys = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
 		if (phys + size - 1 <= range->end)
 			return phys;
+		break;
 	}
-
 	return -1;
 }
 
@@ -395,30 +402,40 @@ static void dev_dax_kill(void *dev_dax)
 int dev_dax_probe(struct dev_dax *dev_dax)
 {
 	struct dax_device *dax_dev = dev_dax->dax_dev;
-	struct range *range = &dev_dax->range;
 	struct device *dev = &dev_dax->dev;
 	struct dev_pagemap *pgmap;
 	struct inode *inode;
 	struct cdev *cdev;
 	void *addr;
-	int rc;
-
-	/* 1:1 map region resource range to device-dax instance range */
-	if (!devm_request_mem_region(dev, range->start, range_len(range),
-				dev_name(dev))) {
-		dev_warn(dev, "could not reserve range: %#llx - %#llx\n",
-				range->start, range->end);
-		return -EBUSY;
-	}
+	int rc, i;
 
 	pgmap = dev_dax->pgmap;
+	if (dev_WARN_ONCE(dev, pgmap && dev_dax->nr_range > 1,
+			"static pgmap / multi-range device conflict\n"))
+		return -EINVAL;
+
 	if (!pgmap) {
-		pgmap = devm_kzalloc(dev, sizeof(*pgmap), GFP_KERNEL);
+		pgmap = devm_kzalloc(dev, sizeof(*pgmap) + sizeof(struct range)
+				* (dev_dax->nr_range - 1), GFP_KERNEL);
 		if (!pgmap)
 			return -ENOMEM;
-		pgmap->range = *range;
-		pgmap->nr_range = 1;
+		pgmap->nr_range = dev_dax->nr_range;
+	}
+
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		struct range *range = &dev_dax->ranges[i].range;
+
+		if (!devm_request_mem_region(dev, range->start,
+					range_len(range), dev_name(dev))) {
+			dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve range\n",
+					i, range->start, range->end);
+			return -EBUSY;
+		}
+		/* don't update the range for static pgmap */
+		if (!dev_dax->pgmap)
+			pgmap->ranges[i] = *range;
 	}
+
 	pgmap->type = MEMORY_DEVICE_GENERIC;
 	addr = devm_memremap_pages(dev, pgmap);
 	if (IS_ERR(addr))
--- a/drivers/dax/kmem.c~device-dax-add-dis-contiguous-resource-support
+++ a/drivers/dax/kmem.c
@@ -19,24 +19,28 @@ static const char *kmem_name;
 /* Set if any memory will remain added when the driver will be unloaded. */
 static bool any_hotremove_failed;
 
-static struct range dax_kmem_range(struct dev_dax *dev_dax)
+static int dax_kmem_range(struct dev_dax *dev_dax, int i, struct range *r)
 {
-	struct range range;
+	struct dev_dax_range *dax_range = &dev_dax->ranges[i];
+	struct range *range = &dax_range->range;
 
 	/* memory-block align the hotplug range */
-	range.start = ALIGN(dev_dax->range.start, memory_block_size_bytes());
-	range.end = ALIGN_DOWN(dev_dax->range.end + 1, memory_block_size_bytes()) - 1;
-	return range;
+	r->start = ALIGN(range->start, memory_block_size_bytes());
+	r->end = ALIGN_DOWN(range->end + 1, memory_block_size_bytes()) - 1;
+	if (r->start >= r->end) {
+		r->start = range->start;
+		r->end = range->end;
+		return -ENOSPC;
+	}
+	return 0;
 }
 
 static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 {
-	struct range range = dax_kmem_range(dev_dax);
 	struct device *dev = &dev_dax->dev;
-	struct resource *res;
+	int i, mapped = 0;
 	char *res_name;
 	int numa_node;
-	int rc;
 
 	/*
 	 * Ensure good NUMA information for the persistent memory.
@@ -55,31 +59,58 @@ static int dev_dax_kmem_probe(struct dev
 	if (!res_name)
 		return -ENOMEM;
 
-	/* Region is permanently reserved if hotremove fails. */
-	res = request_mem_region(range.start, range_len(&range), res_name);
-	if (!res) {
-		dev_warn(dev, "could not reserve region [%#llx-%#llx]\n", range.start, range.end);
-		kfree(res_name);
-		return -EBUSY;
-	}
-
-	/*
-	 * Set flags appropriate for System RAM.  Leave ..._BUSY clear
-	 * so that add_memory() can add a child resource.  Do not
-	 * inherit flags from the parent since it may set new flags
-	 * unknown to us that will break add_memory() below.
-	 */
-	res->flags = IORESOURCE_SYSTEM_RAM;
-
-	/*
-	 * Ensure that future kexec'd kernels will not treat this as RAM
-	 * automatically.
-	 */
-	rc = add_memory_driver_managed(numa_node, range.start, range_len(&range), kmem_name);
-	if (rc) {
-		release_mem_region(range.start, range_len(&range));
-		kfree(res_name);
-		return rc;
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		struct resource *res;
+		struct range range;
+		int rc;
+
+		rc = dax_kmem_range(dev_dax, i, &range);
+		if (rc) {
+			dev_info(dev, "mapping%d: %#llx-%#llx too small after alignment\n",
+					i, range.start, range.end);
+			continue;
+		}
+
+		/* Region is permanently reserved if hotremove fails. */
+		res = request_mem_region(range.start, range_len(&range), res_name);
+		if (!res) {
+			dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n",
+					i, range.start, range.end);
+			/*
+			 * Once some memory has been onlined we can't
+			 * assume that it can be un-onlined safely.
+			 */
+			if (mapped)
+				continue;
+			kfree(res_name);
+			return -EBUSY;
+		}
+
+		/*
+		 * Set flags appropriate for System RAM.  Leave ..._BUSY clear
+		 * so that add_memory() can add a child resource.  Do not
+		 * inherit flags from the parent since it may set new flags
+		 * unknown to us that will break add_memory() below.
+		 */
+		res->flags = IORESOURCE_SYSTEM_RAM;
+
+		/*
+		 * Ensure that future kexec'd kernels will not treat
+		 * this as RAM automatically.
+		 */
+		rc = add_memory_driver_managed(numa_node, range.start,
+				range_len(&range), kmem_name);
+
+		if (rc) {
+			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
+					i, range.start, range.end);
+			release_mem_region(range.start, range_len(&range));
+			if (mapped)
+				continue;
+			kfree(res_name);
+			return rc;
+		}
+		mapped++;
 	}
 
 	dev_set_drvdata(dev, res_name);
@@ -90,9 +121,8 @@ static int dev_dax_kmem_probe(struct dev
 #ifdef CONFIG_MEMORY_HOTREMOVE
 static int dev_dax_kmem_remove(struct dev_dax *dev_dax)
 {
-	int rc;
+	int i, success = 0;
 	struct device *dev = &dev_dax->dev;
-	struct range range = dax_kmem_range(dev_dax);
 	const char *res_name = dev_get_drvdata(dev);
 
 	/*
@@ -101,17 +131,31 @@ static int dev_dax_kmem_remove(struct de
 	 * there is no way to hotremove this memory until reboot because device
 	 * unbind will succeed even if we return failure.
 	 */
-	rc = remove_memory(dev_dax->target_node, range.start, range_len(&range));
-	if (rc) {
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		struct range range;
+		int rc;
+
+		rc = dax_kmem_range(dev_dax, i, &range);
+		if (rc)
+			continue;
+
+		rc = remove_memory(dev_dax->target_node, range.start,
+				range_len(&range));
+		if (rc == 0) {
+			release_mem_region(range.start, range_len(&range));
+			success++;
+			continue;
+		}
 		any_hotremove_failed = true;
-		dev_err(dev, "%#llx-%#llx cannot be hotremoved until the next reboot\n",
-				range.start, range.end);
-		return rc;
+		dev_err(dev,
+			"mapping%d: %#llx-%#llx cannot be hotremoved until the next reboot\n",
+				i, range.start, range.end);
 	}
 
-	/* Release and free dax resources */
-	release_mem_region(range.start, range_len(&range));
-	kfree(res_name);
+	if (success >= dev_dax->nr_range) {
+		kfree(res_name);
+		dev_set_drvdata(dev, NULL);
+	}
 
 	return 0;
 }
--- a/tools/testing/nvdimm/dax-dev.c~device-dax-add-dis-contiguous-resource-support
+++ a/tools/testing/nvdimm/dax-dev.c
@@ -9,11 +9,18 @@
 phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff,
 		unsigned long size)
 {
-	struct range *range = &dev_dax->range;
-	phys_addr_t addr;
+	int i;
 
-	addr = pgoff * PAGE_SIZE + range->start;
-	if (addr >= range->start && addr <= range->end) {
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		struct dev_dax_range *dax_range = &dev_dax->ranges[i];
+		struct range *range = &dax_range->range;
+		unsigned long long pgoff_end;
+		phys_addr_t addr;
+
+		pgoff_end = dax_range->pgoff + PHYS_PFN(range_len(range)) - 1;
+		if (pgoff < dax_range->pgoff || pgoff > pgoff_end)
+			continue;
+		addr = PFN_PHYS(pgoff - dax_range->pgoff) + range->start;
 		if (addr + size - 1 <= range->end) {
 			if (get_nfit_res(addr)) {
 				struct page *page;
@@ -23,9 +30,10 @@ phys_addr_t dax_pgoff_to_phys(struct dev
 
 				page = vmalloc_to_page((void *)addr);
 				return PFN_PHYS(page_to_pfn(page));
-			} else
-				return addr;
+			}
+			return addr;
 		}
+		break;
 	}
 	return -1;
 }
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 047/181] device-dax: introduce 'mapping' devices
  2020-10-13 23:46 incoming Andrew Morton
                   ` (45 preceding siblings ...)
  2020-10-13 23:50 ` [patch 046/181] device-dax: add dis-contiguous resource support Andrew Morton
@ 2020-10-13 23:50 ` Andrew Morton
  2020-10-13 23:50 ` [patch 048/181] device-dax: make align a per-device property Andrew Morton
                   ` (137 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:50 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: device-dax: introduce 'mapping' devices

In support of interrogating the physical address layout of a device with
dis-contiguous ranges, introduce a sysfs directory with 'start', 'end',
and 'page_offset' attributes.  The alternative is trying to parse
/proc/iomem, and that file will not reflect the extent layout until the
device is enabled.

Link: https://lkml.kernel.org/r/159643104819.4062302.13691281391423291589.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lkml.kernel.org/r/160106117446.30709.2751020815463722537.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dax/bus.c         |  191 +++++++++++++++++++++++++++++++++++-
 drivers/dax/dax-private.h |   14 ++
 2 files changed, 203 insertions(+), 2 deletions(-)

--- a/drivers/dax/bus.c~device-dax-introduce-mapping-devices
+++ a/drivers/dax/bus.c
@@ -579,6 +579,167 @@ struct dax_region *alloc_dax_region(stru
 }
 EXPORT_SYMBOL_GPL(alloc_dax_region);
 
+static void dax_mapping_release(struct device *dev)
+{
+	struct dax_mapping *mapping = to_dax_mapping(dev);
+	struct dev_dax *dev_dax = to_dev_dax(dev->parent);
+
+	ida_free(&dev_dax->ida, mapping->id);
+	kfree(mapping);
+}
+
+static void unregister_dax_mapping(void *data)
+{
+	struct device *dev = data;
+	struct dax_mapping *mapping = to_dax_mapping(dev);
+	struct dev_dax *dev_dax = to_dev_dax(dev->parent);
+	struct dax_region *dax_region = dev_dax->region;
+
+	dev_dbg(dev, "%s\n", __func__);
+
+	device_lock_assert(dax_region->dev);
+
+	dev_dax->ranges[mapping->range_id].mapping = NULL;
+	mapping->range_id = -1;
+
+	device_del(dev);
+	put_device(dev);
+}
+
+static struct dev_dax_range *get_dax_range(struct device *dev)
+{
+	struct dax_mapping *mapping = to_dax_mapping(dev);
+	struct dev_dax *dev_dax = to_dev_dax(dev->parent);
+	struct dax_region *dax_region = dev_dax->region;
+
+	device_lock(dax_region->dev);
+	if (mapping->range_id < 0) {
+		device_unlock(dax_region->dev);
+		return NULL;
+	}
+
+	return &dev_dax->ranges[mapping->range_id];
+}
+
+static void put_dax_range(struct dev_dax_range *dax_range)
+{
+	struct dax_mapping *mapping = dax_range->mapping;
+	struct dev_dax *dev_dax = to_dev_dax(mapping->dev.parent);
+	struct dax_region *dax_region = dev_dax->region;
+
+	device_unlock(dax_region->dev);
+}
+
+static ssize_t start_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dev_dax_range *dax_range;
+	ssize_t rc;
+
+	dax_range = get_dax_range(dev);
+	if (!dax_range)
+		return -ENXIO;
+	rc = sprintf(buf, "%#llx\n", dax_range->range.start);
+	put_dax_range(dax_range);
+
+	return rc;
+}
+static DEVICE_ATTR(start, 0400, start_show, NULL);
+
+static ssize_t end_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dev_dax_range *dax_range;
+	ssize_t rc;
+
+	dax_range = get_dax_range(dev);
+	if (!dax_range)
+		return -ENXIO;
+	rc = sprintf(buf, "%#llx\n", dax_range->range.end);
+	put_dax_range(dax_range);
+
+	return rc;
+}
+static DEVICE_ATTR(end, 0400, end_show, NULL);
+
+static ssize_t pgoff_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dev_dax_range *dax_range;
+	ssize_t rc;
+
+	dax_range = get_dax_range(dev);
+	if (!dax_range)
+		return -ENXIO;
+	rc = sprintf(buf, "%#lx\n", dax_range->pgoff);
+	put_dax_range(dax_range);
+
+	return rc;
+}
+static DEVICE_ATTR(page_offset, 0400, pgoff_show, NULL);
+
+static struct attribute *dax_mapping_attributes[] = {
+	&dev_attr_start.attr,
+	&dev_attr_end.attr,
+	&dev_attr_page_offset.attr,
+	NULL,
+};
+
+static const struct attribute_group dax_mapping_attribute_group = {
+	.attrs = dax_mapping_attributes,
+};
+
+static const struct attribute_group *dax_mapping_attribute_groups[] = {
+	&dax_mapping_attribute_group,
+	NULL,
+};
+
+static struct device_type dax_mapping_type = {
+	.release = dax_mapping_release,
+	.groups = dax_mapping_attribute_groups,
+};
+
+static int devm_register_dax_mapping(struct dev_dax *dev_dax, int range_id)
+{
+	struct dax_region *dax_region = dev_dax->region;
+	struct dax_mapping *mapping;
+	struct device *dev;
+	int rc;
+
+	device_lock_assert(dax_region->dev);
+
+	if (dev_WARN_ONCE(&dev_dax->dev, !dax_region->dev->driver,
+				"region disabled\n"))
+		return -ENXIO;
+
+	mapping = kzalloc(sizeof(*mapping), GFP_KERNEL);
+	if (!mapping)
+		return -ENOMEM;
+	mapping->range_id = range_id;
+	mapping->id = ida_alloc(&dev_dax->ida, GFP_KERNEL);
+	if (mapping->id < 0) {
+		kfree(mapping);
+		return -ENOMEM;
+	}
+	dev_dax->ranges[range_id].mapping = mapping;
+	dev = &mapping->dev;
+	device_initialize(dev);
+	dev->parent = &dev_dax->dev;
+	dev->type = &dax_mapping_type;
+	dev_set_name(dev, "mapping%d", mapping->id);
+	rc = device_add(dev);
+	if (rc) {
+		put_device(dev);
+		return rc;
+	}
+
+	rc = devm_add_action_or_reset(dax_region->dev, unregister_dax_mapping,
+			dev);
+	if (rc)
+		return rc;
+	return 0;
+}
+
 static int alloc_dev_dax_range(struct dev_dax *dev_dax, u64 start,
 		resource_size_t size)
 {
@@ -588,7 +749,7 @@ static int alloc_dev_dax_range(struct de
 	struct dev_dax_range *ranges;
 	unsigned long pgoff = 0;
 	struct resource *alloc;
-	int i;
+	int i, rc;
 
 	device_lock_assert(dax_region->dev);
 
@@ -633,6 +794,22 @@ static int alloc_dev_dax_range(struct de
 
 	dev_dbg(dev, "alloc range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
 			&alloc->start, &alloc->end);
+	/*
+	 * A dev_dax instance must be registered before mapping device
+	 * children can be added. Defer to devm_create_dev_dax() to add
+	 * the initial mapping device.
+	 */
+	if (!device_is_registered(&dev_dax->dev))
+		return 0;
+
+	rc = devm_register_dax_mapping(dev_dax, dev_dax->nr_range - 1);
+	if (rc) {
+		dev_dbg(dev, "delete range[%d]: %pa:%pa\n", dev_dax->nr_range - 1,
+				&alloc->start, &alloc->end);
+		dev_dax->nr_range--;
+		__release_region(res, alloc->start, resource_size(alloc));
+		return rc;
+	}
 
 	return 0;
 }
@@ -701,11 +878,14 @@ static int dev_dax_shrink(struct dev_dax
 
 	for (i = dev_dax->nr_range - 1; i >= 0; i--) {
 		struct range *range = &dev_dax->ranges[i].range;
+		struct dax_mapping *mapping = dev_dax->ranges[i].mapping;
 		struct resource *adjust = NULL, *res;
 		resource_size_t shrink;
 
 		shrink = min_t(u64, to_shrink, range_len(range));
 		if (shrink >= range_len(range)) {
+			devm_release_action(dax_region->dev,
+					unregister_dax_mapping, &mapping->dev);
 			__release_region(&dax_region->res, range->start,
 					range_len(range));
 			dev_dax->nr_range--;
@@ -1036,9 +1216,9 @@ struct dev_dax *devm_create_dev_dax(stru
 	/* a device_dax instance is dead while the driver is not attached */
 	kill_dax(dax_dev);
 
-	/* from here on we're committed to teardown via dev_dax_release() */
 	dev_dax->dax_dev = dax_dev;
 	dev_dax->target_node = dax_region->target_node;
+	ida_init(&dev_dax->ida);
 	kref_get(&dax_region->kref);
 
 	inode = dax_inode(dax_dev);
@@ -1061,6 +1241,13 @@ struct dev_dax *devm_create_dev_dax(stru
 	if (rc)
 		return ERR_PTR(rc);
 
+	/* register mapping device for the initial allocation range */
+	if (dev_dax->nr_range && range_len(&dev_dax->ranges[0].range)) {
+		rc = devm_register_dax_mapping(dev_dax, 0);
+		if (rc)
+			return ERR_PTR(rc);
+	}
+
 	return dev_dax;
 
 err_alloc_dax:
--- a/drivers/dax/dax-private.h~device-dax-introduce-mapping-devices
+++ a/drivers/dax/dax-private.h
@@ -40,6 +40,12 @@ struct dax_region {
 	struct device *youngest;
 };
 
+struct dax_mapping {
+	struct device dev;
+	int range_id;
+	int id;
+};
+
 /**
  * struct dev_dax - instance data for a subdivision of a dax region, and
  * data while the device is activated in the driver.
@@ -47,6 +53,7 @@ struct dax_region {
  * @dax_dev - core dax functionality
  * @target_node: effective numa node if dev_dax memory range is onlined
  * @id: ida allocated id
+ * @ida: mapping id allocator
  * @dev - device core
  * @pgmap - pgmap for memmap setup / lifetime (driver owned)
  * @nr_range: size of @ranges
@@ -57,12 +64,14 @@ struct dev_dax {
 	struct dax_device *dax_dev;
 	int target_node;
 	int id;
+	struct ida ida;
 	struct device dev;
 	struct dev_pagemap *pgmap;
 	int nr_range;
 	struct dev_dax_range {
 		unsigned long pgoff;
 		struct range range;
+		struct dax_mapping *mapping;
 	} *ranges;
 };
 
@@ -70,4 +79,9 @@ static inline struct dev_dax *to_dev_dax
 {
 	return container_of(dev, struct dev_dax, dev);
 }
+
+static inline struct dax_mapping *to_dax_mapping(struct device *dev)
+{
+	return container_of(dev, struct dax_mapping, dev);
+}
 #endif
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 048/181] device-dax: make align a per-device property
  2020-10-13 23:46 incoming Andrew Morton
                   ` (46 preceding siblings ...)
  2020-10-13 23:50 ` [patch 047/181] device-dax: introduce 'mapping' devices Andrew Morton
@ 2020-10-13 23:50 ` Andrew Morton
  2020-10-13 23:50 ` [patch 049/181] device-dax: add an 'align' attribute Andrew Morton
                   ` (136 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:50 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Joao Martins <joao.m.martins@oracle.com>
Subject: device-dax: make align a per-device property

Introduce @align to struct dev_dax.

When creating a new device, we still initialize to the default dax_region
@align.  Child devices belonging to a region may wish to keep a different
alignment property instead of a global region-defined one.

Link: https://lkml.kernel.org/r/159643105377.4062302.4159447829955683131.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lore.kernel.org/r/20200716172913.19658-2-joao.m.martins@oracle.com
Link: https://lkml.kernel.org/r/160106117957.30709.1142303024324655705.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dax/bus.c         |    1 
 drivers/dax/dax-private.h |    3 ++
 drivers/dax/device.c      |   41 +++++++++++++-----------------------
 3 files changed, 19 insertions(+), 26 deletions(-)

--- a/drivers/dax/bus.c~device-dax-make-align-a-per-device-property
+++ a/drivers/dax/bus.c
@@ -1218,6 +1218,7 @@ struct dev_dax *devm_create_dev_dax(stru
 
 	dev_dax->dax_dev = dax_dev;
 	dev_dax->target_node = dax_region->target_node;
+	dev_dax->align = dax_region->align;
 	ida_init(&dev_dax->ida);
 	kref_get(&dax_region->kref);
 
--- a/drivers/dax/dax-private.h~device-dax-make-align-a-per-device-property
+++ a/drivers/dax/dax-private.h
@@ -62,6 +62,7 @@ struct dax_mapping {
 struct dev_dax {
 	struct dax_region *region;
 	struct dax_device *dax_dev;
+	unsigned int align;
 	int target_node;
 	int id;
 	struct ida ida;
@@ -84,4 +85,6 @@ static inline struct dax_mapping *to_dax
 {
 	return container_of(dev, struct dax_mapping, dev);
 }
+
+phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, unsigned long size);
 #endif
--- a/drivers/dax/device.c~device-dax-make-align-a-per-device-property
+++ a/drivers/dax/device.c
@@ -17,7 +17,6 @@
 static int check_vma(struct dev_dax *dev_dax, struct vm_area_struct *vma,
 		const char *func)
 {
-	struct dax_region *dax_region = dev_dax->region;
 	struct device *dev = &dev_dax->dev;
 	unsigned long mask;
 
@@ -32,7 +31,7 @@ static int check_vma(struct dev_dax *dev
 		return -EINVAL;
 	}
 
-	mask = dax_region->align - 1;
+	mask = dev_dax->align - 1;
 	if (vma->vm_start & mask || vma->vm_end & mask) {
 		dev_info_ratelimited(dev,
 				"%s: %s: fail, unaligned vma (%#lx - %#lx, %#lx)\n",
@@ -78,21 +77,19 @@ static vm_fault_t __dev_dax_pte_fault(st
 				struct vm_fault *vmf, pfn_t *pfn)
 {
 	struct device *dev = &dev_dax->dev;
-	struct dax_region *dax_region;
 	phys_addr_t phys;
 	unsigned int fault_size = PAGE_SIZE;
 
 	if (check_vma(dev_dax, vmf->vma, __func__))
 		return VM_FAULT_SIGBUS;
 
-	dax_region = dev_dax->region;
-	if (dax_region->align > PAGE_SIZE) {
+	if (dev_dax->align > PAGE_SIZE) {
 		dev_dbg(dev, "alignment (%#x) > fault size (%#x)\n",
-			dax_region->align, fault_size);
+			dev_dax->align, fault_size);
 		return VM_FAULT_SIGBUS;
 	}
 
-	if (fault_size != dax_region->align)
+	if (fault_size != dev_dax->align)
 		return VM_FAULT_SIGBUS;
 
 	phys = dax_pgoff_to_phys(dev_dax, vmf->pgoff, PAGE_SIZE);
@@ -111,7 +108,6 @@ static vm_fault_t __dev_dax_pmd_fault(st
 {
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
 	struct device *dev = &dev_dax->dev;
-	struct dax_region *dax_region;
 	phys_addr_t phys;
 	pgoff_t pgoff;
 	unsigned int fault_size = PMD_SIZE;
@@ -119,16 +115,15 @@ static vm_fault_t __dev_dax_pmd_fault(st
 	if (check_vma(dev_dax, vmf->vma, __func__))
 		return VM_FAULT_SIGBUS;
 
-	dax_region = dev_dax->region;
-	if (dax_region->align > PMD_SIZE) {
+	if (dev_dax->align > PMD_SIZE) {
 		dev_dbg(dev, "alignment (%#x) > fault size (%#x)\n",
-			dax_region->align, fault_size);
+			dev_dax->align, fault_size);
 		return VM_FAULT_SIGBUS;
 	}
 
-	if (fault_size < dax_region->align)
+	if (fault_size < dev_dax->align)
 		return VM_FAULT_SIGBUS;
-	else if (fault_size > dax_region->align)
+	else if (fault_size > dev_dax->align)
 		return VM_FAULT_FALLBACK;
 
 	/* if we are outside of the VMA */
@@ -154,7 +149,6 @@ static vm_fault_t __dev_dax_pud_fault(st
 {
 	unsigned long pud_addr = vmf->address & PUD_MASK;
 	struct device *dev = &dev_dax->dev;
-	struct dax_region *dax_region;
 	phys_addr_t phys;
 	pgoff_t pgoff;
 	unsigned int fault_size = PUD_SIZE;
@@ -163,16 +157,15 @@ static vm_fault_t __dev_dax_pud_fault(st
 	if (check_vma(dev_dax, vmf->vma, __func__))
 		return VM_FAULT_SIGBUS;
 
-	dax_region = dev_dax->region;
-	if (dax_region->align > PUD_SIZE) {
+	if (dev_dax->align > PUD_SIZE) {
 		dev_dbg(dev, "alignment (%#x) > fault size (%#x)\n",
-			dax_region->align, fault_size);
+			dev_dax->align, fault_size);
 		return VM_FAULT_SIGBUS;
 	}
 
-	if (fault_size < dax_region->align)
+	if (fault_size < dev_dax->align)
 		return VM_FAULT_SIGBUS;
-	else if (fault_size > dax_region->align)
+	else if (fault_size > dev_dax->align)
 		return VM_FAULT_FALLBACK;
 
 	/* if we are outside of the VMA */
@@ -267,9 +260,8 @@ static int dev_dax_split(struct vm_area_
 {
 	struct file *filp = vma->vm_file;
 	struct dev_dax *dev_dax = filp->private_data;
-	struct dax_region *dax_region = dev_dax->region;
 
-	if (!IS_ALIGNED(addr, dax_region->align))
+	if (!IS_ALIGNED(addr, dev_dax->align))
 		return -EINVAL;
 	return 0;
 }
@@ -278,9 +270,8 @@ static unsigned long dev_dax_pagesize(st
 {
 	struct file *filp = vma->vm_file;
 	struct dev_dax *dev_dax = filp->private_data;
-	struct dax_region *dax_region = dev_dax->region;
 
-	return dax_region->align;
+	return dev_dax->align;
 }
 
 static const struct vm_operations_struct dax_vm_ops = {
@@ -319,13 +310,11 @@ static unsigned long dax_get_unmapped_ar
 {
 	unsigned long off, off_end, off_align, len_align, addr_align, align;
 	struct dev_dax *dev_dax = filp ? filp->private_data : NULL;
-	struct dax_region *dax_region;
 
 	if (!dev_dax || addr)
 		goto out;
 
-	dax_region = dev_dax->region;
-	align = dax_region->align;
+	align = dev_dax->align;
 	off = pgoff << PAGE_SHIFT;
 	off_end = off + len;
 	off_align = round_up(off, align);
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 049/181] device-dax: add an 'align' attribute
  2020-10-13 23:46 incoming Andrew Morton
                   ` (47 preceding siblings ...)
  2020-10-13 23:50 ` [patch 048/181] device-dax: make align a per-device property Andrew Morton
@ 2020-10-13 23:50 ` Andrew Morton
  2020-10-13 23:51 ` [patch 050/181] dax/hmem: introduce dax_hmem.region_idle parameter Andrew Morton
                   ` (135 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:50 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Dan Williams <dan.j.williams@intel.com>
Subject: device-dax: add an 'align' attribute

Introduce a device align attribute.  While doing so, rename the region
align attribute to be more explicitly named as so, but keep it named as
@align to retain the API for tools like daxctl.

Changes on align may not always be valid, when say certain mappings were
created with 2M and then we switch to 1G.  So, we validate all ranges
against the new value being attempted, post resizing.

Link: https://lkml.kernel.org/r/159643105944.4062302.3131761052969132784.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lore.kernel.org/r/20200716172913.19658-3-joao.m.martins@oracle.com
Link: https://lkml.kernel.org/r/160106118486.30709.13012322227204800596.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dax/bus.c         |   93 ++++++++++++++++++++++++++++++++----
 drivers/dax/dax-private.h |   18 ++++++
 2 files changed, 101 insertions(+), 10 deletions(-)

--- a/drivers/dax/bus.c~device-dax-add-an-align-attribute
+++ a/drivers/dax/bus.c
@@ -230,14 +230,15 @@ static ssize_t region_size_show(struct d
 static struct device_attribute dev_attr_region_size = __ATTR(size, 0444,
 		region_size_show, NULL);
 
-static ssize_t align_show(struct device *dev,
+static ssize_t region_align_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
 	struct dax_region *dax_region = dev_get_drvdata(dev);
 
 	return sprintf(buf, "%u\n", dax_region->align);
 }
-static DEVICE_ATTR_RO(align);
+static struct device_attribute dev_attr_region_align =
+		__ATTR(align, 0400, region_align_show, NULL);
 
 #define for_each_dax_region_resource(dax_region, res) \
 	for (res = (dax_region)->res.child; res; res = res->sibling)
@@ -488,7 +489,7 @@ static umode_t dax_region_visible(struct
 static struct attribute *dax_region_attributes[] = {
 	&dev_attr_available_size.attr,
 	&dev_attr_region_size.attr,
-	&dev_attr_align.attr,
+	&dev_attr_region_align.attr,
 	&dev_attr_create.attr,
 	&dev_attr_seed.attr,
 	&dev_attr_delete.attr,
@@ -858,15 +859,13 @@ static ssize_t size_show(struct device *
 	return sprintf(buf, "%llu\n", size);
 }
 
-static bool alloc_is_aligned(struct dax_region *dax_region,
-		resource_size_t size)
+static bool alloc_is_aligned(struct dev_dax *dev_dax, resource_size_t size)
 {
 	/*
 	 * The minimum mapping granularity for a device instance is a
 	 * single subsection, unless the arch says otherwise.
 	 */
-	return IS_ALIGNED(size, max_t(unsigned long, dax_region->align,
-				memremap_compat_align()));
+	return IS_ALIGNED(size, max_t(unsigned long, dev_dax->align, memremap_compat_align()));
 }
 
 static int dev_dax_shrink(struct dev_dax *dev_dax, resource_size_t size)
@@ -961,7 +960,7 @@ static ssize_t dev_dax_resize(struct dax
 		return dev_dax_shrink(dev_dax, size);
 
 	to_alloc = size - dev_size;
-	if (dev_WARN_ONCE(dev, !alloc_is_aligned(dax_region, to_alloc),
+	if (dev_WARN_ONCE(dev, !alloc_is_aligned(dev_dax, to_alloc),
 			"resize of %pa misaligned\n", &to_alloc))
 		return -ENXIO;
 
@@ -1025,7 +1024,7 @@ static ssize_t size_store(struct device
 	if (rc)
 		return rc;
 
-	if (!alloc_is_aligned(dax_region, val)) {
+	if (!alloc_is_aligned(dev_dax, val)) {
 		dev_dbg(dev, "%s: size: %lld misaligned\n", __func__, val);
 		return -EINVAL;
 	}
@@ -1044,6 +1043,78 @@ static ssize_t size_store(struct device
 }
 static DEVICE_ATTR_RW(size);
 
+static ssize_t align_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+
+	return sprintf(buf, "%d\n", dev_dax->align);
+}
+
+static ssize_t dev_dax_validate_align(struct dev_dax *dev_dax)
+{
+	resource_size_t dev_size = dev_dax_size(dev_dax);
+	struct device *dev = &dev_dax->dev;
+	int i;
+
+	if (dev_size > 0 && !alloc_is_aligned(dev_dax, dev_size)) {
+		dev_dbg(dev, "%s: align %u invalid for size %pa\n",
+			__func__, dev_dax->align, &dev_size);
+		return -EINVAL;
+	}
+
+	for (i = 0; i < dev_dax->nr_range; i++) {
+		size_t len = range_len(&dev_dax->ranges[i].range);
+
+		if (!alloc_is_aligned(dev_dax, len)) {
+			dev_dbg(dev, "%s: align %u invalid for range %d\n",
+				__func__, dev_dax->align, i);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static ssize_t align_store(struct device *dev, struct device_attribute *attr,
+		const char *buf, size_t len)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+	struct dax_region *dax_region = dev_dax->region;
+	unsigned long val, align_save;
+	ssize_t rc;
+
+	rc = kstrtoul(buf, 0, &val);
+	if (rc)
+		return -ENXIO;
+
+	if (!dax_align_valid(val))
+		return -EINVAL;
+
+	device_lock(dax_region->dev);
+	if (!dax_region->dev->driver) {
+		device_unlock(dax_region->dev);
+		return -ENXIO;
+	}
+
+	device_lock(dev);
+	if (dev->driver) {
+		rc = -EBUSY;
+		goto out_unlock;
+	}
+
+	align_save = dev_dax->align;
+	dev_dax->align = val;
+	rc = dev_dax_validate_align(dev_dax);
+	if (rc)
+		dev_dax->align = align_save;
+out_unlock:
+	device_unlock(dev);
+	device_unlock(dax_region->dev);
+	return rc == 0 ? len : rc;
+}
+static DEVICE_ATTR_RW(align);
+
 static int dev_dax_target_node(struct dev_dax *dev_dax)
 {
 	struct dax_region *dax_region = dev_dax->region;
@@ -1104,7 +1175,8 @@ static umode_t dev_dax_visible(struct ko
 		return 0;
 	if (a == &dev_attr_numa_node.attr && !IS_ENABLED(CONFIG_NUMA))
 		return 0;
-	if (a == &dev_attr_size.attr && is_static(dax_region))
+	if ((a == &dev_attr_align.attr ||
+	     a == &dev_attr_size.attr) && is_static(dax_region))
 		return 0444;
 	return a->mode;
 }
@@ -1113,6 +1185,7 @@ static struct attribute *dev_dax_attribu
 	&dev_attr_modalias.attr,
 	&dev_attr_size.attr,
 	&dev_attr_target_node.attr,
+	&dev_attr_align.attr,
 	&dev_attr_resource.attr,
 	&dev_attr_numa_node.attr,
 	NULL,
--- a/drivers/dax/dax-private.h~device-dax-add-an-align-attribute
+++ a/drivers/dax/dax-private.h
@@ -87,4 +87,22 @@ static inline struct dax_mapping *to_dax
 }
 
 phys_addr_t dax_pgoff_to_phys(struct dev_dax *dev_dax, pgoff_t pgoff, unsigned long size);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline bool dax_align_valid(unsigned long align)
+{
+	if (align == PUD_SIZE && IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD))
+		return true;
+	if (align == PMD_SIZE && has_transparent_hugepage())
+		return true;
+	if (align == PAGE_SIZE)
+		return true;
+	return false;
+}
+#else
+static inline bool dax_align_valid(unsigned long align)
+{
+	return align == PAGE_SIZE;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 050/181] dax/hmem: introduce dax_hmem.region_idle parameter
  2020-10-13 23:46 incoming Andrew Morton
                   ` (48 preceding siblings ...)
  2020-10-13 23:50 ` [patch 049/181] device-dax: add an 'align' attribute Andrew Morton
@ 2020-10-13 23:51 ` Andrew Morton
  2020-10-13 23:51 ` [patch 051/181] device-dax: add a range mapping allocation attribute Andrew Morton
                   ` (134 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:51 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Joao Martins <joao.m.martins@oracle.com>
Subject: dax/hmem: introduce dax_hmem.region_idle parameter

Introduce a new module parameter for dax_hmem which initializes all region
devices as free, rather than allocating a pagemap for the region by
default.

All hmem devices created with dax_hmem.region_idle=1 will have full
available size for creating dynamic dax devices.

Link: https://lkml.kernel.org/r/159643106460.4062302.5868522341307530091.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lore.kernel.org/r/20200716172913.19658-4-joao.m.martins@oracle.com
Link: https://lkml.kernel.org/r/160106119033.30709.11249962152222193448.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dax/hmem/hmem.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/drivers/dax/hmem/hmem.c~dax-hmem-introduce-dax_hmemregion_idle-parameter
+++ a/drivers/dax/hmem/hmem.c
@@ -5,6 +5,9 @@
 #include <linux/pfn_t.h>
 #include "../bus.h"
 
+static bool region_idle;
+module_param_named(region_idle, region_idle, bool, 0644);
+
 static int dax_hmem_probe(struct platform_device *pdev)
 {
 	struct device *dev = &pdev->dev;
@@ -30,7 +33,7 @@ static int dax_hmem_probe(struct platfor
 	data = (struct dev_dax_data) {
 		.dax_region = dax_region,
 		.id = -1,
-		.size = resource_size(res),
+		.size = region_idle ? 0 : resource_size(res),
 	};
 	dev_dax = devm_create_dev_dax(&data);
 	if (IS_ERR(dev_dax))
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 051/181] device-dax: add a range mapping allocation attribute
  2020-10-13 23:46 incoming Andrew Morton
                   ` (49 preceding siblings ...)
  2020-10-13 23:51 ` [patch 050/181] dax/hmem: introduce dax_hmem.region_idle parameter Andrew Morton
@ 2020-10-13 23:51 ` Andrew Morton
  2020-10-13 23:51 ` [patch 052/181] mm/debug.c: do not dereference i_ino blindly Andrew Morton
                   ` (133 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:51 UTC (permalink / raw)
  To: airlied, akpm, ard.biesheuvel, ardb, benh, bhelgaas,
	boris.ostrovsky, bp, Brice.Goglin, bskeggs, catalin.marinas,
	dan.j.williams, daniel, dave.hansen, dave.jiang, david, gregkh,
	hpa, hulkci, ira.weiny, jgg, jglisse, jgross, jmoyer,
	joao.m.martins, Jonathan.Cameron, justin.he, linux-mm, lkp, luto,
	mingo, mm-commits, mpe, pasha.tatashin, paulus, peterz,
	rafael.j.wysocki, rdunlap, richard.weiyang, rppt, sstabellini,
	tglx, thomas.lendacky, torvalds, vgoyal, vishal.l.verma, will,
	yanaijie

From: Joao Martins <joao.m.martins@oracle.com>
Subject: device-dax: add a range mapping allocation attribute

Add a sysfs attribute which denotes a range from the dax region to be
allocated.  It's an write only @mapping sysfs attribute in the format of
'<start>-<end>' to allocate a range.  @start and @end use hexadecimal
values and the @pgoff is implicitly ordered wrt to previous writes to
@mapping sysfs e.g.  a write of a range of length 1G the pgoff is
0..1G(-4K), a second write will use @pgoff for 1G+4K..<size>.

This range mapping interface is useful for:

 1) Application which want to implement its own allocation logic, and
    thus pick the desired ranges from dax_region.

 2) For use cases like VMM fast restart[0] where after kexec we want
    to the same gpa<->phys mappings (as originally created before kexec).

[0] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf

Link: https://lkml.kernel.org/r/159643106970.4062302.10402616567780784722.stgit@dwillia2-desk3.amr.corp.intel.com
Link: https://lore.kernel.org/r/20200716172913.19658-5-joao.m.martins@oracle.com
Link: https://lkml.kernel.org/r/160106119570.30709.4548889722645210610.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Ben Skeggs <bskeggs@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brice Goglin <Brice.Goglin@inria.fr>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: David Airlie <airlied@linux.ie>
Cc: David Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Hulk Robot <hulkci@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Yan <yanaijie@huawei.com>
Cc: Jeff Moyer <jmoyer@redhat.com>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Jia He <justin.he@arm.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: kernel test robot <lkp@intel.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/dax/bus.c |   64 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 64 insertions(+)

--- a/drivers/dax/bus.c~device-dax-add-a-range-mapping-allocation-attribute
+++ a/drivers/dax/bus.c
@@ -1043,6 +1043,67 @@ static ssize_t size_store(struct device
 }
 static DEVICE_ATTR_RW(size);
 
+static ssize_t range_parse(const char *opt, size_t len, struct range *range)
+{
+	unsigned long long addr = 0;
+	char *start, *end, *str;
+	ssize_t rc = EINVAL;
+
+	str = kstrdup(opt, GFP_KERNEL);
+	if (!str)
+		return rc;
+
+	end = str;
+	start = strsep(&end, "-");
+	if (!start || !end)
+		goto err;
+
+	rc = kstrtoull(start, 16, &addr);
+	if (rc)
+		goto err;
+	range->start = addr;
+
+	rc = kstrtoull(end, 16, &addr);
+	if (rc)
+		goto err;
+	range->end = addr;
+
+err:
+	kfree(str);
+	return rc;
+}
+
+static ssize_t mapping_store(struct device *dev, struct device_attribute *attr,
+		const char *buf, size_t len)
+{
+	struct dev_dax *dev_dax = to_dev_dax(dev);
+	struct dax_region *dax_region = dev_dax->region;
+	size_t to_alloc;
+	struct range r;
+	ssize_t rc;
+
+	rc = range_parse(buf, len, &r);
+	if (rc)
+		return rc;
+
+	rc = -ENXIO;
+	device_lock(dax_region->dev);
+	if (!dax_region->dev->driver) {
+		device_unlock(dax_region->dev);
+		return rc;
+	}
+	device_lock(dev);
+
+	to_alloc = range_len(&r);
+	if (alloc_is_aligned(dev_dax, to_alloc))
+		rc = alloc_dev_dax_range(dev_dax, r.start, to_alloc);
+	device_unlock(dev);
+	device_unlock(dax_region->dev);
+
+	return rc == 0 ? len : rc;
+}
+static DEVICE_ATTR_WO(mapping);
+
 static ssize_t align_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -1175,6 +1236,8 @@ static umode_t dev_dax_visible(struct ko
 		return 0;
 	if (a == &dev_attr_numa_node.attr && !IS_ENABLED(CONFIG_NUMA))
 		return 0;
+	if (a == &dev_attr_mapping.attr && is_static(dax_region))
+		return 0;
 	if ((a == &dev_attr_align.attr ||
 	     a == &dev_attr_size.attr) && is_static(dax_region))
 		return 0444;
@@ -1184,6 +1247,7 @@ static umode_t dev_dax_visible(struct ko
 static struct attribute *dev_dax_attributes[] = {
 	&dev_attr_modalias.attr,
 	&dev_attr_size.attr,
+	&dev_attr_mapping.attr,
 	&dev_attr_target_node.attr,
 	&dev_attr_align.attr,
 	&dev_attr_resource.attr,
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 052/181] mm/debug.c: do not dereference i_ino blindly
  2020-10-13 23:46 incoming Andrew Morton
                   ` (50 preceding siblings ...)
  2020-10-13 23:51 ` [patch 051/181] device-dax: add a range mapping allocation attribute Andrew Morton
@ 2020-10-13 23:51 ` Andrew Morton
  2020-10-13 23:51 ` [patch 053/181] mm, dump_page: rename head_mapcount() --> head_compound_mapcount() Andrew Morton
                   ` (132 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:51 UTC (permalink / raw)
  To: akpm, jhubbard, linux-mm, mm-commits, rppt, torvalds, vbabka, willy

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm/debug.c: do not dereference i_ino blindly

__dump_page() checks i_dentry is fetchable and i_ino is earlier in the
struct than i_ino, so it ought to work fine, but it's possible that struct
randomisation has reordered i_ino after i_dentry and the pointer is just
wild enough that i_dentry is fetchable and i_ino isn't.

Also print the inode number if the dentry is invalid.

Link: https://lkml.kernel.org/r/20200819185710.28180-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/debug.c |   12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

--- a/mm/debug.c~mm-debug-do-not-dereference-i_ino-blindly
+++ a/mm/debug.c
@@ -120,6 +120,7 @@ void __dump_page(struct page *page, cons
 		struct hlist_node *dentry_first;
 		struct dentry *dentry_ptr;
 		struct dentry dentry;
+		unsigned long ino;
 
 		/*
 		 * mapping can be invalid pointer and we don't want to crash
@@ -136,21 +137,22 @@ void __dump_page(struct page *page, cons
 			goto out_mapping;
 		}
 
-		if (get_kernel_nofault(dentry_first, &host->i_dentry.first)) {
+		if (get_kernel_nofault(dentry_first, &host->i_dentry.first) ||
+		    get_kernel_nofault(ino, &host->i_ino)) {
 			pr_warn("aops:%ps with invalid host inode %px\n",
 					a_ops, host);
 			goto out_mapping;
 		}
 
 		if (!dentry_first) {
-			pr_warn("aops:%ps ino:%lx\n", a_ops, host->i_ino);
+			pr_warn("aops:%ps ino:%lx\n", a_ops, ino);
 			goto out_mapping;
 		}
 
 		dentry_ptr = container_of(dentry_first, struct dentry, d_u.d_alias);
 		if (get_kernel_nofault(dentry, dentry_ptr)) {
-			pr_warn("aops:%ps with invalid dentry %px\n", a_ops,
-					dentry_ptr);
+			pr_warn("aops:%ps ino:%lx with invalid dentry %px\n",
+					a_ops, ino, dentry_ptr);
 		} else {
 			/*
 			 * if dentry is corrupted, the %pd handler may still
@@ -158,7 +160,7 @@ void __dump_page(struct page *page, cons
 			 * corrupted struct page
 			 */
 			pr_warn("aops:%ps ino:%lx dentry name:\"%pd\"\n",
-					a_ops, host->i_ino, &dentry);
+					a_ops, ino, &dentry);
 		}
 	}
 out_mapping:
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 053/181] mm, dump_page: rename head_mapcount() --> head_compound_mapcount()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (51 preceding siblings ...)
  2020-10-13 23:51 ` [patch 052/181] mm/debug.c: do not dereference i_ino blindly Andrew Morton
@ 2020-10-13 23:51 ` Andrew Morton
  2020-10-13 23:51 ` [patch 054/181] mm: factor find_get_incore_page out of mincore_page Andrew Morton
                   ` (131 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:51 UTC (permalink / raw)
  To: akpm, cai, jhubbard, kirill.shutemov, linux-mm, mm-commits, rppt,
	torvalds, vbabka, william.kucharski, willy

From: John Hubbard <jhubbard@nvidia.com>
Subject: mm, dump_page: rename head_mapcount() --> head_compound_mapcount()

Rename head_pincount() --> head_compound_pincount().  These names are more
accurate (or less misleading) than the original ones.

Link: https://lkml.kernel.org/r/20200807183358.105097-1-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm.h |    8 ++++----
 mm/debug.c         |    6 +++---
 2 files changed, 7 insertions(+), 7 deletions(-)

--- a/include/linux/mm.h~mm-dump_page-rename-head_mapcount-head_compound_mapcount
+++ a/include/linux/mm.h
@@ -791,7 +791,7 @@ static inline void *kvcalloc(size_t n, s
 extern void kvfree(const void *addr);
 extern void kvfree_sensitive(const void *addr, size_t len);
 
-static inline int head_mapcount(struct page *head)
+static inline int head_compound_mapcount(struct page *head)
 {
 	return atomic_read(compound_mapcount_ptr(head)) + 1;
 }
@@ -805,7 +805,7 @@ static inline int compound_mapcount(stru
 {
 	VM_BUG_ON_PAGE(!PageCompound(page), page);
 	page = compound_head(page);
-	return head_mapcount(page);
+	return head_compound_mapcount(page);
 }
 
 /*
@@ -918,7 +918,7 @@ static inline bool hpage_pincount_availa
 	return PageCompound(page) && compound_order(page) > 1;
 }
 
-static inline int head_pincount(struct page *head)
+static inline int head_compound_pincount(struct page *head)
 {
 	return atomic_read(compound_pincount_ptr(head));
 }
@@ -927,7 +927,7 @@ static inline int compound_pincount(stru
 {
 	VM_BUG_ON_PAGE(!hpage_pincount_available(page), page);
 	page = compound_head(page);
-	return head_pincount(page);
+	return head_compound_pincount(page);
 }
 
 static inline void set_compound_order(struct page *page, unsigned int order)
--- a/mm/debug.c~mm-dump_page-rename-head_mapcount-head_compound_mapcount
+++ a/mm/debug.c
@@ -102,12 +102,12 @@ void __dump_page(struct page *page, cons
 		if (hpage_pincount_available(page)) {
 			pr_warn("head:%p order:%u compound_mapcount:%d compound_pincount:%d\n",
 					head, compound_order(head),
-					head_mapcount(head),
-					head_pincount(head));
+					head_compound_mapcount(head),
+					head_compound_pincount(head));
 		} else {
 			pr_warn("head:%p order:%u compound_mapcount:%d\n",
 					head, compound_order(head),
-					head_mapcount(head));
+					head_compound_mapcount(head));
 		}
 	}
 	if (PageKsm(page))
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 054/181] mm: factor find_get_incore_page out of mincore_page
  2020-10-13 23:46 incoming Andrew Morton
                   ` (52 preceding siblings ...)
  2020-10-13 23:51 ` [patch 053/181] mm, dump_page: rename head_mapcount() --> head_compound_mapcount() Andrew Morton
@ 2020-10-13 23:51 ` Andrew Morton
  2020-10-13 23:51 ` [patch 055/181] mm: use find_get_incore_page in memcontrol Andrew Morton
                   ` (130 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:51 UTC (permalink / raw)
  To: adobriyan, akpm, chris, hannes, hughd, jani.nikula, linux-mm,
	matthew.auld, mm-commits, torvalds, william.kucharski, willy,
	ying.huang

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm: factor find_get_incore_page out of mincore_page

Patch series "Return head pages from find_*_entry", v2.

This patch series started out as part of the THP patch set, but it has
some nice effects along the way and it seems worth splitting it out and
submitting separately.

Currently find_get_entry() and find_lock_entry() return the page
corresponding to the requested index, but the first thing most callers do
is find the head page, which we just threw away.  As part of auditing all
the callers, I found some misuses of the APIs and some plain
inefficiencies that I've fixed.

The diffstat is unflattering, but I added more kernel-doc and a new wrapper.


This patch (of 8);

Provide this functionality from the swap cache.  It's useful for
more than just mincore().

Link: https://lkml.kernel.org/r/20200910183318.20139-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20200910183318.20139-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/swap.h |    7 +++++++
 mm/mincore.c         |   28 ++--------------------------
 mm/swap_state.c      |   32 ++++++++++++++++++++++++++++++++
 3 files changed, 41 insertions(+), 26 deletions(-)

--- a/include/linux/swap.h~mm-factor-find_get_incore_page-out-of-mincore_page
+++ a/include/linux/swap.h
@@ -427,6 +427,7 @@ extern void free_pages_and_swap_cache(st
 extern struct page *lookup_swap_cache(swp_entry_t entry,
 				      struct vm_area_struct *vma,
 				      unsigned long addr);
+struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index);
 extern struct page *read_swap_cache_async(swp_entry_t, gfp_t,
 			struct vm_area_struct *vma, unsigned long addr,
 			bool do_poll);
@@ -570,6 +571,12 @@ static inline struct page *lookup_swap_c
 	return NULL;
 }
 
+static inline
+struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index)
+{
+	return find_get_page(mapping, index);
+}
+
 static inline int add_to_swap(struct page *page)
 {
 	return 0;
--- a/mm/mincore.c~mm-factor-find_get_incore_page-out-of-mincore_page
+++ a/mm/mincore.c
@@ -48,7 +48,7 @@ static int mincore_hugetlb(pte_t *pte, u
  * and is up to date; i.e. that no page-in operation would be required
  * at this time if an application were to map and access this page.
  */
-static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
+static unsigned char mincore_page(struct address_space *mapping, pgoff_t index)
 {
 	unsigned char present = 0;
 	struct page *page;
@@ -59,31 +59,7 @@ static unsigned char mincore_page(struct
 	 * any other file mapping (ie. marked !present and faulted in with
 	 * tmpfs's .fault). So swapped out tmpfs mappings are tested here.
 	 */
-#ifdef CONFIG_SWAP
-	if (shmem_mapping(mapping)) {
-		page = find_get_entry(mapping, pgoff);
-		/*
-		 * shmem/tmpfs may return swap: account for swapcache
-		 * page too.
-		 */
-		if (xa_is_value(page)) {
-			swp_entry_t swp = radix_to_swp_entry(page);
-			struct swap_info_struct *si;
-
-			/* Prevent swap device to being swapoff under us */
-			si = get_swap_device(swp);
-			if (si) {
-				page = find_get_page(swap_address_space(swp),
-						     swp_offset(swp));
-				put_swap_device(si);
-			} else
-				page = NULL;
-		}
-	} else
-		page = find_get_page(mapping, pgoff);
-#else
-	page = find_get_page(mapping, pgoff);
-#endif
+	page = find_get_incore_page(mapping, index);
 	if (page) {
 		present = PageUptodate(page);
 		put_page(page);
--- a/mm/swap_state.c~mm-factor-find_get_incore_page-out-of-mincore_page
+++ a/mm/swap_state.c
@@ -21,6 +21,7 @@
 #include <linux/vmalloc.h>
 #include <linux/swap_slots.h>
 #include <linux/huge_mm.h>
+#include <linux/shmem_fs.h>
 #include "internal.h"
 
 /*
@@ -414,6 +415,37 @@ struct page *lookup_swap_cache(swp_entry
 	return page;
 }
 
+/**
+ * find_get_incore_page - Find and get a page from the page or swap caches.
+ * @mapping: The address_space to search.
+ * @index: The page cache index.
+ *
+ * This differs from find_get_page() in that it will also look for the
+ * page in the swap cache.
+ *
+ * Return: The found page or %NULL.
+ */
+struct page *find_get_incore_page(struct address_space *mapping, pgoff_t index)
+{
+	swp_entry_t swp;
+	struct swap_info_struct *si;
+	struct page *page = find_get_entry(mapping, index);
+
+	if (!xa_is_value(page))
+		return page;
+	if (!shmem_mapping(mapping))
+		return NULL;
+
+	swp = radix_to_swp_entry(page);
+	/* Prevent swapoff from happening to us */
+	si = get_swap_device(swp);
+	if (!si)
+		return NULL;
+	page = find_get_page(swap_address_space(swp), swp_offset(swp));
+	put_swap_device(si);
+	return page;
+}
+
 struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			struct vm_area_struct *vma, unsigned long addr,
 			bool *new_page_allocated)
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 055/181] mm: use find_get_incore_page in memcontrol
  2020-10-13 23:46 incoming Andrew Morton
                   ` (53 preceding siblings ...)
  2020-10-13 23:51 ` [patch 054/181] mm: factor find_get_incore_page out of mincore_page Andrew Morton
@ 2020-10-13 23:51 ` Andrew Morton
  2020-10-13 23:51 ` [patch 056/181] mm: optimise madvise WILLNEED Andrew Morton
                   ` (129 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:51 UTC (permalink / raw)
  To: adobriyan, akpm, chris, hannes, hughd, jani.nikula, linux-mm,
	matthew.auld, mm-commits, torvalds, william.kucharski, willy,
	ying.huang

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm: use find_get_incore_page in memcontrol

The current code does not protect against swapoff of the underlying
swap device, so this is a bug fix as well as a worthwhile reduction in
code complexity.

Link: https://lkml.kernel.org/r/20200910183318.20139-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |   24 ++----------------------
 1 file changed, 2 insertions(+), 22 deletions(-)

--- a/mm/memcontrol.c~mm-use-find_get_incore_page-in-memcontrol
+++ a/mm/memcontrol.c
@@ -5539,35 +5539,15 @@ static struct page *mc_handle_swap_pte(s
 static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
 			unsigned long addr, pte_t ptent, swp_entry_t *entry)
 {
-	struct page *page = NULL;
-	struct address_space *mapping;
-	pgoff_t pgoff;
-
 	if (!vma->vm_file) /* anonymous vma */
 		return NULL;
 	if (!(mc.flags & MOVE_FILE))
 		return NULL;
 
-	mapping = vma->vm_file->f_mapping;
-	pgoff = linear_page_index(vma, addr);
-
 	/* page is moved even if it's not RSS of this task(page-faulted). */
-#ifdef CONFIG_SWAP
 	/* shmem/tmpfs may report page out on swap: account for that too. */
-	if (shmem_mapping(mapping)) {
-		page = find_get_entry(mapping, pgoff);
-		if (xa_is_value(page)) {
-			swp_entry_t swp = radix_to_swp_entry(page);
-			*entry = swp;
-			page = find_get_page(swap_address_space(swp),
-					     swp_offset(swp));
-		}
-	} else
-		page = find_get_page(mapping, pgoff);
-#else
-	page = find_get_page(mapping, pgoff);
-#endif
-	return page;
+	return find_get_incore_page(vma->vm_file->f_mapping,
+			linear_page_index(vma, addr));
 }
 
 /**
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 056/181] mm: optimise madvise WILLNEED
  2020-10-13 23:46 incoming Andrew Morton
                   ` (54 preceding siblings ...)
  2020-10-13 23:51 ` [patch 055/181] mm: use find_get_incore_page in memcontrol Andrew Morton
@ 2020-10-13 23:51 ` Andrew Morton
  2020-10-13 23:51 ` [patch 057/181] proc: optimise smaps for shmem entries Andrew Morton
                   ` (128 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:51 UTC (permalink / raw)
  To: adobriyan, akpm, cai, chris, hannes, hughd, jani.nikula,
	linux-mm, matthew.auld, mm-commits, torvalds, william.kucharski,
	willy, ying.huang

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm: optimise madvise WILLNEED

Instead of calling find_get_entry() for every page index, use an XArray
iterator to skip over NULL entries, and avoid calling get_page(),
because we only want the swap entries.

[willy@infradead.org: fix LTP soft lockups]
  Link: https://lkml.kernel.org/r/20200914165032.GS6583@casper.infradead.org
Link: https://lkml.kernel.org/r/20200910183318.20139-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Cc: Qian Cai <cai@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/madvise.c |   21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

--- a/mm/madvise.c~mm-optimise-madvise-willneed
+++ a/mm/madvise.c
@@ -224,25 +224,28 @@ static void force_shm_swapin_readahead(s
 		unsigned long start, unsigned long end,
 		struct address_space *mapping)
 {
-	pgoff_t index;
+	XA_STATE(xas, &mapping->i_pages, linear_page_index(vma, start));
+	pgoff_t end_index = end / PAGE_SIZE;
 	struct page *page;
-	swp_entry_t swap;
 
-	for (; start < end; start += PAGE_SIZE) {
-		index = ((start - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+	rcu_read_lock();
+	xas_for_each(&xas, page, end_index) {
+		swp_entry_t swap;
 
-		page = find_get_entry(mapping, index);
-		if (!xa_is_value(page)) {
-			if (page)
-				put_page(page);
+		if (!xa_is_value(page))
 			continue;
-		}
+		xas_pause(&xas);
+		rcu_read_unlock();
+
 		swap = radix_to_swp_entry(page);
 		page = read_swap_cache_async(swap, GFP_HIGHUSER_MOVABLE,
 							NULL, 0, false);
 		if (page)
 			put_page(page);
+
+		rcu_read_lock();
 	}
+	rcu_read_unlock();
 
 	lru_add_drain();	/* Push any new pages onto the LRU now */
 }
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 057/181] proc: optimise smaps for shmem entries
  2020-10-13 23:46 incoming Andrew Morton
                   ` (55 preceding siblings ...)
  2020-10-13 23:51 ` [patch 056/181] mm: optimise madvise WILLNEED Andrew Morton
@ 2020-10-13 23:51 ` Andrew Morton
  2020-10-13 23:51 ` [patch 058/181] i915: use find_lock_page instead of find_lock_entry Andrew Morton
                   ` (127 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:51 UTC (permalink / raw)
  To: adobriyan, akpm, chris, hannes, hughd, jani.nikula, linux-mm,
	matthew.auld, mm-commits, torvalds, william.kucharski, willy,
	ying.huang

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: proc: optimise smaps for shmem entries

Avoid bumping the refcount on pages when we're only interested in the
swap entries.

Link: https://lkml.kernel.org/r/20200910183318.20139-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/task_mmu.c |    8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

--- a/fs/proc/task_mmu.c~proc-optimise-smaps-for-shmem-entries
+++ a/fs/proc/task_mmu.c
@@ -520,16 +520,10 @@ static void smaps_pte_entry(pte_t *pte,
 			page = device_private_entry_to_page(swpent);
 	} else if (unlikely(IS_ENABLED(CONFIG_SHMEM) && mss->check_shmem_swap
 							&& pte_none(*pte))) {
-		page = find_get_entry(vma->vm_file->f_mapping,
+		page = xa_load(&vma->vm_file->f_mapping->i_pages,
 						linear_page_index(vma, addr));
-		if (!page)
-			return;
-
 		if (xa_is_value(page))
 			mss->swap += PAGE_SIZE;
-		else
-			put_page(page);
-
 		return;
 	}
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 058/181] i915: use find_lock_page instead of find_lock_entry
  2020-10-13 23:46 incoming Andrew Morton
                   ` (56 preceding siblings ...)
  2020-10-13 23:51 ` [patch 057/181] proc: optimise smaps for shmem entries Andrew Morton
@ 2020-10-13 23:51 ` Andrew Morton
  2020-10-13 23:51 ` [patch 059/181] mm: convert find_get_entry to return the head page Andrew Morton
                   ` (126 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:51 UTC (permalink / raw)
  To: adobriyan, akpm, chris, hannes, hughd, jani.nikula, linux-mm,
	matthew.auld, mm-commits, torvalds, william.kucharski, willy,
	ying.huang

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: i915: use find_lock_page instead of find_lock_entry

i915 does not want to see value entries.  Switch it to use
find_lock_page() instead, and remove the export of find_lock_entry().
Move find_lock_entry() and find_get_entry() to mm/internal.h to discourage
any future use.

Link: https://lkml.kernel.org/r/20200910183318.20139-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/gpu/drm/i915/gem/i915_gem_shmem.c |    4 ++--
 include/linux/pagemap.h                   |    2 --
 mm/filemap.c                              |    1 -
 mm/internal.h                             |    3 +++
 4 files changed, 5 insertions(+), 5 deletions(-)

--- a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c~i915-use-find_lock_page-instead-of-find_lock_entry
+++ a/drivers/gpu/drm/i915/gem/i915_gem_shmem.c
@@ -258,8 +258,8 @@ shmem_writeback(struct drm_i915_gem_obje
 	for (i = 0; i < obj->base.size >> PAGE_SHIFT; i++) {
 		struct page *page;
 
-		page = find_lock_entry(mapping, i);
-		if (!page || xa_is_value(page))
+		page = find_lock_page(mapping, i);
+		if (!page)
 			continue;
 
 		if (!page_mapped(page) && clear_page_dirty_for_io(page)) {
--- a/include/linux/pagemap.h~i915-use-find_lock_page-instead-of-find_lock_entry
+++ a/include/linux/pagemap.h
@@ -385,8 +385,6 @@ static inline struct page *find_subpage(
 	return head + (index & (thp_nr_pages(head) - 1));
 }
 
-struct page *find_get_entry(struct address_space *mapping, pgoff_t offset);
-struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset);
 unsigned find_get_entries(struct address_space *mapping, pgoff_t start,
 			  unsigned int nr_entries, struct page **entries,
 			  pgoff_t *indices);
--- a/mm/filemap.c~i915-use-find_lock_page-instead-of-find_lock_entry
+++ a/mm/filemap.c
@@ -1726,7 +1726,6 @@ repeat:
 	}
 	return page;
 }
-EXPORT_SYMBOL(find_lock_entry);
 
 /**
  * pagecache_get_page - Find and get a reference to a page.
--- a/mm/internal.h~i915-use-find_lock_page-instead-of-find_lock_entry
+++ a/mm/internal.h
@@ -65,6 +65,9 @@ static inline void ra_submit(struct file
 			ra->start, ra->size, ra->async_size);
 }
 
+struct page *find_get_entry(struct address_space *mapping, pgoff_t index);
+struct page *find_lock_entry(struct address_space *mapping, pgoff_t index);
+
 /**
  * page_evictable - test whether a page is evictable
  * @page: the page to test
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 059/181] mm: convert find_get_entry to return the head page
  2020-10-13 23:46 incoming Andrew Morton
                   ` (57 preceding siblings ...)
  2020-10-13 23:51 ` [patch 058/181] i915: use find_lock_page instead of find_lock_entry Andrew Morton
@ 2020-10-13 23:51 ` Andrew Morton
  2020-10-13 23:51 ` [patch 060/181] mm/shmem: return head page from find_lock_entry Andrew Morton
                   ` (125 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:51 UTC (permalink / raw)
  To: adobriyan, akpm, chris, hannes, hughd, jani.nikula, linux-mm,
	matthew.auld, mm-commits, torvalds, william.kucharski, willy,
	ying.huang

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm: convert find_get_entry to return the head page

There are only four callers remaining of find_get_entry().
get_shadow_from_swap_cache() only wants to see shadow entries and doesn't
care about which page is returned.  Push the find_subpage() call into
find_lock_entry(), find_get_incore_page() and pagecache_get_page().

[willy@infradead.org: fix oops]
  Link: https://lkml.kernel.org/r/20200914112738.GM6583@casper.infradead.org
Link: https://lkml.kernel.org/r/20200910183318.20139-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/filemap.c    |   13 +++++++------
 mm/swap_state.c |    4 +++-
 2 files changed, 10 insertions(+), 7 deletions(-)

--- a/mm/filemap.c~mm-convert-find_get_entry-to-return-the-head-page
+++ a/mm/filemap.c
@@ -1645,19 +1645,19 @@ EXPORT_SYMBOL(page_cache_prev_miss);
 /**
  * find_get_entry - find and get a page cache entry
  * @mapping: the address_space to search
- * @offset: the page cache index
+ * @index: The page cache index.
  *
  * Looks up the page cache slot at @mapping & @offset.  If there is a
- * page cache page, it is returned with an increased refcount.
+ * page cache page, the head page is returned with an increased refcount.
  *
  * If the slot holds a shadow entry of a previously evicted page, or a
  * swap entry from shmem/tmpfs, it is returned.
  *
- * Return: the found page or shadow entry, %NULL if nothing is found.
+ * Return: The head page or shadow entry, %NULL if nothing is found.
  */
-struct page *find_get_entry(struct address_space *mapping, pgoff_t offset)
+struct page *find_get_entry(struct address_space *mapping, pgoff_t index)
 {
-	XA_STATE(xas, &mapping->i_pages, offset);
+	XA_STATE(xas, &mapping->i_pages, index);
 	struct page *page;
 
 	rcu_read_lock();
@@ -1685,7 +1685,6 @@ repeat:
 		put_page(page);
 		goto repeat;
 	}
-	page = find_subpage(page, offset);
 out:
 	rcu_read_unlock();
 
@@ -1722,6 +1721,7 @@ repeat:
 			put_page(page);
 			goto repeat;
 		}
+		page = find_subpage(page, offset);
 		VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
 	}
 	return page;
@@ -1768,6 +1768,7 @@ repeat:
 		page = NULL;
 	if (!page)
 		goto no_page;
+	page = find_subpage(page, index);
 
 	if (fgp_flags & FGP_LOCK) {
 		if (fgp_flags & FGP_NOWAIT) {
--- a/mm/swap_state.c~mm-convert-find_get_entry-to-return-the-head-page
+++ a/mm/swap_state.c
@@ -431,8 +431,10 @@ struct page *find_get_incore_page(struct
 	struct swap_info_struct *si;
 	struct page *page = find_get_entry(mapping, index);
 
-	if (!xa_is_value(page))
+	if (!page)
 		return page;
+	if (!xa_is_value(page))
+		return find_subpage(page, index);
 	if (!shmem_mapping(mapping))
 		return NULL;
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 060/181] mm/shmem: return head page from find_lock_entry
  2020-10-13 23:46 incoming Andrew Morton
                   ` (58 preceding siblings ...)
  2020-10-13 23:51 ` [patch 059/181] mm: convert find_get_entry to return the head page Andrew Morton
@ 2020-10-13 23:51 ` Andrew Morton
  2020-10-13 23:51 ` [patch 061/181] mm: add find_lock_head Andrew Morton
                   ` (124 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:51 UTC (permalink / raw)
  To: adobriyan, akpm, chris, hannes, hughd, jani.nikula, linux-mm,
	matthew.auld, mm-commits, torvalds, william.kucharski, willy,
	ying.huang

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm/shmem: return head page from find_lock_entry

Convert shmem_getpage_gfp() (the only remaining caller of
find_lock_entry()) to cope with a head page being returned instead of
the subpage for the index.

[willy@infradead.org: fix BUG()s]
  Link https://lore.kernel.org/linux-mm/20200912032042.GA6583@casper.infradead.org/
Link: https://lkml.kernel.org/r/20200910183318.20139-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/pagemap.h |    9 +++++++++
 mm/filemap.c            |   25 +++++++++++--------------
 mm/shmem.c              |   19 ++++++++++---------
 3 files changed, 30 insertions(+), 23 deletions(-)

--- a/include/linux/pagemap.h~mm-shmem-return-head-page-from-find_lock_entry
+++ a/include/linux/pagemap.h
@@ -372,6 +372,15 @@ static inline struct page *grab_cache_pa
 			mapping_gfp_mask(mapping));
 }
 
+/* Does this page contain this index? */
+static inline bool thp_contains(struct page *head, pgoff_t index)
+{
+	/* HugeTLBfs indexes the page cache in units of hpage_size */
+	if (PageHuge(head))
+		return head->index == index;
+	return page_index(head) == (index & ~(thp_nr_pages(head) - 1UL));
+}
+
 /*
  * Given the page we found in the page cache, return the page corresponding
  * to this index in the file
--- a/mm/filemap.c~mm-shmem-return-head-page-from-find_lock_entry
+++ a/mm/filemap.c
@@ -1692,37 +1692,34 @@ out:
 }
 
 /**
- * find_lock_entry - locate, pin and lock a page cache entry
- * @mapping: the address_space to search
- * @offset: the page cache index
+ * find_lock_entry - Locate and lock a page cache entry.
+ * @mapping: The address_space to search.
+ * @index: The page cache index.
  *
- * Looks up the page cache slot at @mapping & @offset.  If there is a
- * page cache page, it is returned locked and with an increased
- * refcount.
+ * Looks up the page at @mapping & @index.  If there is a page in the
+ * cache, the head page is returned locked and with an increased refcount.
  *
  * If the slot holds a shadow entry of a previously evicted page, or a
  * swap entry from shmem/tmpfs, it is returned.
  *
- * find_lock_entry() may sleep.
- *
- * Return: the found page or shadow entry, %NULL if nothing is found.
+ * Context: May sleep.
+ * Return: The head page or shadow entry, %NULL if nothing is found.
  */
-struct page *find_lock_entry(struct address_space *mapping, pgoff_t offset)
+struct page *find_lock_entry(struct address_space *mapping, pgoff_t index)
 {
 	struct page *page;
 
 repeat:
-	page = find_get_entry(mapping, offset);
+	page = find_get_entry(mapping, index);
 	if (page && !xa_is_value(page)) {
 		lock_page(page);
 		/* Has the page been truncated? */
-		if (unlikely(page_mapping(page) != mapping)) {
+		if (unlikely(page->mapping != mapping)) {
 			unlock_page(page);
 			put_page(page);
 			goto repeat;
 		}
-		page = find_subpage(page, offset);
-		VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page);
+		VM_BUG_ON_PAGE(!thp_contains(page, index), page);
 	}
 	return page;
 }
--- a/mm/shmem.c~mm-shmem-return-head-page-from-find_lock_entry
+++ a/mm/shmem.c
@@ -1830,6 +1830,8 @@ repeat:
 		return error;
 	}
 
+	if (page)
+		hindex = page->index;
 	if (page && sgp == SGP_WRITE)
 		mark_page_accessed(page);
 
@@ -1840,11 +1842,10 @@ repeat:
 		unlock_page(page);
 		put_page(page);
 		page = NULL;
+		hindex = index;
 	}
-	if (page || sgp == SGP_READ) {
-		*pagep = page;
-		return 0;
-	}
+	if (page || sgp == SGP_READ)
+		goto out;
 
 	/*
 	 * Fast cache lookup did not find it:
@@ -1969,14 +1970,13 @@ clear:
 	 * it now, lest undo on failure cancel our earlier guarantee.
 	 */
 	if (sgp != SGP_WRITE && !PageUptodate(page)) {
-		struct page *head = compound_head(page);
 		int i;
 
-		for (i = 0; i < compound_nr(head); i++) {
-			clear_highpage(head + i);
-			flush_dcache_page(head + i);
+		for (i = 0; i < compound_nr(page); i++) {
+			clear_highpage(page + i);
+			flush_dcache_page(page + i);
 		}
-		SetPageUptodate(head);
+		SetPageUptodate(page);
 	}
 
 	/* Perhaps the file has been truncated since we checked */
@@ -1992,6 +1992,7 @@ clear:
 		error = -EINVAL;
 		goto unlock;
 	}
+out:
 	*pagep = page + index - hindex;
 	return 0;
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 061/181] mm: add find_lock_head
  2020-10-13 23:46 incoming Andrew Morton
                   ` (59 preceding siblings ...)
  2020-10-13 23:51 ` [patch 060/181] mm/shmem: return head page from find_lock_entry Andrew Morton
@ 2020-10-13 23:51 ` Andrew Morton
  2020-10-13 23:51 ` [patch 062/181] mm/filemap: fix filemap_map_pages for THP Andrew Morton
                   ` (123 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:51 UTC (permalink / raw)
  To: adobriyan, akpm, chris, hannes, hughd, jani.nikula, linux-mm,
	matthew.auld, mm-commits, torvalds, william.kucharski, willy,
	ying.huang

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm: add find_lock_head

Add a new FGP_HEAD flag which avoids calling find_subpage() and add a
convenience wrapper for it.

Link: https://lkml.kernel.org/r/20200910183318.20139-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/pagemap.h |   32 ++++++++++++++++++++++++++------
 mm/filemap.c            |    9 ++++++---
 2 files changed, 32 insertions(+), 9 deletions(-)

--- a/include/linux/pagemap.h~mm-add-find_lock_head
+++ a/include/linux/pagemap.h
@@ -279,6 +279,7 @@ pgoff_t page_cache_prev_miss(struct addr
 #define FGP_NOFS		0x00000010
 #define FGP_NOWAIT		0x00000020
 #define FGP_FOR_MMAP		0x00000040
+#define FGP_HEAD		0x00000080
 
 struct page *pagecache_get_page(struct address_space *mapping, pgoff_t offset,
 		int fgp_flags, gfp_t cache_gfp_mask);
@@ -310,18 +311,37 @@ static inline struct page *find_get_page
  * @mapping: the address_space to search
  * @offset: the page index
  *
- * Looks up the page cache slot at @mapping & @offset.  If there is a
+ * Looks up the page cache entry at @mapping & @offset.  If there is a
  * page cache page, it is returned locked and with an increased
  * refcount.
  *
- * Otherwise, %NULL is returned.
- *
- * find_lock_page() may sleep.
+ * Context: May sleep.
+ * Return: A struct page or %NULL if there is no page in the cache for this
+ * index.
  */
 static inline struct page *find_lock_page(struct address_space *mapping,
-					pgoff_t offset)
+					pgoff_t index)
+{
+	return pagecache_get_page(mapping, index, FGP_LOCK, 0);
+}
+
+/**
+ * find_lock_head - Locate, pin and lock a pagecache page.
+ * @mapping: The address_space to search.
+ * @offset: The page index.
+ *
+ * Looks up the page cache entry at @mapping & @offset.  If there is a
+ * page cache page, its head page is returned locked and with an increased
+ * refcount.
+ *
+ * Context: May sleep.
+ * Return: A struct page which is !PageTail, or %NULL if there is no page
+ * in the cache for this index.
+ */
+static inline struct page *find_lock_head(struct address_space *mapping,
+					pgoff_t index)
 {
-	return pagecache_get_page(mapping, offset, FGP_LOCK, 0);
+	return pagecache_get_page(mapping, index, FGP_LOCK | FGP_HEAD, 0);
 }
 
 /**
--- a/mm/filemap.c~mm-add-find_lock_head
+++ a/mm/filemap.c
@@ -1737,6 +1737,8 @@ repeat:
  *
  * * %FGP_ACCESSED - The page will be marked accessed.
  * * %FGP_LOCK - The page is returned locked.
+ * * %FGP_HEAD - If the page is present and a THP, return the head page
+ *   rather than the exact page specified by the index.
  * * %FGP_CREAT - If no page is present then a new page is allocated using
  *   @gfp_mask and added to the page cache and the VM's LRU list.
  *   The page is returned locked and with an increased refcount.
@@ -1765,7 +1767,6 @@ repeat:
 		page = NULL;
 	if (!page)
 		goto no_page;
-	page = find_subpage(page, index);
 
 	if (fgp_flags & FGP_LOCK) {
 		if (fgp_flags & FGP_NOWAIT) {
@@ -1778,12 +1779,12 @@ repeat:
 		}
 
 		/* Has the page been truncated? */
-		if (unlikely(compound_head(page)->mapping != mapping)) {
+		if (unlikely(page->mapping != mapping)) {
 			unlock_page(page);
 			put_page(page);
 			goto repeat;
 		}
-		VM_BUG_ON_PAGE(page->index != index, page);
+		VM_BUG_ON_PAGE(!thp_contains(page, index), page);
 	}
 
 	if (fgp_flags & FGP_ACCESSED)
@@ -1793,6 +1794,8 @@ repeat:
 		if (page_is_idle(page))
 			clear_page_idle(page);
 	}
+	if (!(fgp_flags & FGP_HEAD))
+		page = find_subpage(page, index);
 
 no_page:
 	if (!page && (fgp_flags & FGP_CREAT)) {
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 062/181] mm/filemap: fix filemap_map_pages for THP
  2020-10-13 23:46 incoming Andrew Morton
                   ` (60 preceding siblings ...)
  2020-10-13 23:51 ` [patch 061/181] mm: add find_lock_head Andrew Morton
@ 2020-10-13 23:51 ` Andrew Morton
  2020-10-13 23:51 ` [patch 063/181] mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED Andrew Morton
                   ` (122 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:51 UTC (permalink / raw)
  To: akpm, kirill.shutemov, linux-mm, mm-commits, torvalds,
	william.kucharski, willy

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm/filemap: fix filemap_map_pages for THP

We dereference page->mapping and page->index directly after calling
find_subpage() and these fields are not valid for tail pages.  While
commit 4101196b19d7 ("mm: page cache: store only head pages in i_pages")
introduced the call to find_subpage(), the problem existed prior to this;
I'm going to suggest all the way back to when THPs first existed.

The user-visible effects of this are almost negligible.  To hit it, you
have to mmap a tmpfs file at an unaligned address and then it's only a
disabled optimisation causing page faults to happen more frequently than
they otherwise would.

Fix this by keeping both head and page pointers and checking the
appropriate one.  We could use page_mapping() and page_to_index(), but
that's higher overhead.

Link: https://lkml.kernel.org/r/20200911012532.24761-1-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/filemap.c |   30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

--- a/mm/filemap.c~mm-filemap-fix-filemap_map_pages-for-thp
+++ a/mm/filemap.c
@@ -2793,42 +2793,42 @@ void filemap_map_pages(struct vm_fault *
 	pgoff_t last_pgoff = start_pgoff;
 	unsigned long max_idx;
 	XA_STATE(xas, &mapping->i_pages, start_pgoff);
-	struct page *page;
+	struct page *head, *page;
 	unsigned int mmap_miss = READ_ONCE(file->f_ra.mmap_miss);
 
 	rcu_read_lock();
-	xas_for_each(&xas, page, end_pgoff) {
-		if (xas_retry(&xas, page))
+	xas_for_each(&xas, head, end_pgoff) {
+		if (xas_retry(&xas, head))
 			continue;
-		if (xa_is_value(page))
+		if (xa_is_value(head))
 			goto next;
 
 		/*
 		 * Check for a locked page first, as a speculative
 		 * reference may adversely influence page migration.
 		 */
-		if (PageLocked(page))
+		if (PageLocked(head))
 			goto next;
-		if (!page_cache_get_speculative(page))
+		if (!page_cache_get_speculative(head))
 			goto next;
 
 		/* Has the page moved or been split? */
-		if (unlikely(page != xas_reload(&xas)))
+		if (unlikely(head != xas_reload(&xas)))
 			goto skip;
-		page = find_subpage(page, xas.xa_index);
+		page = find_subpage(head, xas.xa_index);
 
-		if (!PageUptodate(page) ||
+		if (!PageUptodate(head) ||
 				PageReadahead(page) ||
 				PageHWPoison(page))
 			goto skip;
-		if (!trylock_page(page))
+		if (!trylock_page(head))
 			goto skip;
 
-		if (page->mapping != mapping || !PageUptodate(page))
+		if (head->mapping != mapping || !PageUptodate(head))
 			goto unlock;
 
 		max_idx = DIV_ROUND_UP(i_size_read(mapping->host), PAGE_SIZE);
-		if (page->index >= max_idx)
+		if (xas.xa_index >= max_idx)
 			goto unlock;
 
 		if (mmap_miss > 0)
@@ -2840,12 +2840,12 @@ void filemap_map_pages(struct vm_fault *
 		last_pgoff = xas.xa_index;
 		if (alloc_set_pte(vmf, page))
 			goto unlock;
-		unlock_page(page);
+		unlock_page(head);
 		goto next;
 unlock:
-		unlock_page(page);
+		unlock_page(head);
 skip:
-		put_page(page);
+		put_page(head);
 next:
 		/* Huge page is mapped? No need to proceed. */
 		if (pmd_trans_huge(*vmf->pmd))
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 063/181] mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED
  2020-10-13 23:46 incoming Andrew Morton
                   ` (61 preceding siblings ...)
  2020-10-13 23:51 ` [patch 062/181] mm/filemap: fix filemap_map_pages for THP Andrew Morton
@ 2020-10-13 23:51 ` Andrew Morton
  2020-10-13 23:51 ` [patch 064/181] mm/gup_benchmark: update the documentation in Kconfig Andrew Morton
                   ` (121 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:51 UTC (permalink / raw)
  To: akpm, hannes, laoar.shao, linux-mm, mgorman, mm-commits, torvalds

From: Yafang Shao <laoar.shao@gmail.com>
Subject: mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED

Our users reported that there're some random latency spikes when their RT
process is running.  Finally we found that latency spike is caused by
FADV_DONTNEED.  Which may call lru_add_drain_all() to drain LRU cache on
remote CPUs, and then waits the per-cpu work to complete.  The wait time
is uncertain, which may be tens millisecond.

That behavior is unreasonable, because this process is bound to a specific
CPU and the file is only accessed by itself, IOW, there should be no
pagecache pages on a per-cpu pagevec of a remote CPU.  That unreasonable
behavior is partially caused by the wrong comparation of the number of
invalidated pages and the number of the target.  For example,

        if (count < (end_index - start_index + 1))

The count above is how many pages were invalidated in the local CPU, and
(end_index - start_index + 1) is how many pages should be invalidated. 
The usage of (end_index - start_index + 1) is incorrect, because they are
virtual addresses, which may not mapped to pages.  Besides that, there may
be holes between start and end.  So we'd better check whether there are
still pages on per-cpu pagevec after drain the local cpu, and then decide
whether or not to call lru_add_drain_all().

After I applied it with a hotfix to our production environment, most of
the lru_add_drain_all() can be avoided.

Link: https://lkml.kernel.org/r/20200923133318.14373-1-laoar.shao@gmail.com
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Suggested-by: Mel Gorman <mgorman@suse.de>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/fs.h |    4 ++
 mm/fadvise.c       |    9 +++---
 mm/truncate.c      |   58 +++++++++++++++++++++++++++++--------------
 3 files changed, 49 insertions(+), 22 deletions(-)

--- a/include/linux/fs.h~mm-fadvise-improve-the-expensive-remote-lru-cache-draining-after-fadv_dontneed
+++ a/include/linux/fs.h
@@ -2581,6 +2581,10 @@ extern bool is_bad_inode(struct inode *)
 unsigned long invalidate_mapping_pages(struct address_space *mapping,
 					pgoff_t start, pgoff_t end);
 
+void invalidate_mapping_pagevec(struct address_space *mapping,
+				pgoff_t start, pgoff_t end,
+				unsigned long *nr_pagevec);
+
 static inline void invalidate_remote_inode(struct inode *inode)
 {
 	if (S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
--- a/mm/fadvise.c~mm-fadvise-improve-the-expensive-remote-lru-cache-draining-after-fadv_dontneed
+++ a/mm/fadvise.c
@@ -141,7 +141,7 @@ int generic_fadvise(struct file *file, l
 		}
 
 		if (end_index >= start_index) {
-			unsigned long count;
+			unsigned long nr_pagevec = 0;
 
 			/*
 			 * It's common to FADV_DONTNEED right after
@@ -154,8 +154,9 @@ int generic_fadvise(struct file *file, l
 			 */
 			lru_add_drain();
 
-			count = invalidate_mapping_pages(mapping,
-						start_index, end_index);
+			invalidate_mapping_pagevec(mapping,
+						start_index, end_index,
+						&nr_pagevec);
 
 			/*
 			 * If fewer pages were invalidated than expected then
@@ -163,7 +164,7 @@ int generic_fadvise(struct file *file, l
 			 * a per-cpu pagevec for a remote CPU. Drain all
 			 * pagevecs and try again.
 			 */
-			if (count < (end_index - start_index + 1)) {
+			if (nr_pagevec) {
 				lru_add_drain_all();
 				invalidate_mapping_pages(mapping, start_index,
 						end_index);
--- a/mm/truncate.c~mm-fadvise-improve-the-expensive-remote-lru-cache-draining-after-fadv_dontneed
+++ a/mm/truncate.c
@@ -528,23 +528,8 @@ void truncate_inode_pages_final(struct a
 }
 EXPORT_SYMBOL(truncate_inode_pages_final);
 
-/**
- * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
- * @mapping: the address_space which holds the pages to invalidate
- * @start: the offset 'from' which to invalidate
- * @end: the offset 'to' which to invalidate (inclusive)
- *
- * This function only removes the unlocked pages, if you want to
- * remove all the pages of one inode, you must call truncate_inode_pages.
- *
- * invalidate_mapping_pages() will not block on IO activity. It will not
- * invalidate pages which are dirty, locked, under writeback or mapped into
- * pagetables.
- *
- * Return: the number of the pages that were invalidated
- */
-unsigned long invalidate_mapping_pages(struct address_space *mapping,
-		pgoff_t start, pgoff_t end)
+unsigned long __invalidate_mapping_pages(struct address_space *mapping,
+		pgoff_t start, pgoff_t end, unsigned long *nr_pagevec)
 {
 	pgoff_t indices[PAGEVEC_SIZE];
 	struct pagevec pvec;
@@ -610,8 +595,13 @@ unsigned long invalidate_mapping_pages(s
 			 * Invalidation is a hint that the page is no longer
 			 * of interest and try to speed up its reclaim.
 			 */
-			if (!ret)
+			if (!ret) {
 				deactivate_file_page(page);
+				/* It is likely on the pagevec of a remote CPU */
+				if (nr_pagevec)
+					(*nr_pagevec)++;
+			}
+
 			if (PageTransHuge(page))
 				put_page(page);
 			count += ret;
@@ -623,8 +613,40 @@ unsigned long invalidate_mapping_pages(s
 	}
 	return count;
 }
+
+/**
+ * invalidate_mapping_pages - Invalidate all the unlocked pages of one inode
+ * @mapping: the address_space which holds the pages to invalidate
+ * @start: the offset 'from' which to invalidate
+ * @end: the offset 'to' which to invalidate (inclusive)
+ *
+ * This function only removes the unlocked pages, if you want to
+ * remove all the pages of one inode, you must call truncate_inode_pages.
+ *
+ * invalidate_mapping_pages() will not block on IO activity. It will not
+ * invalidate pages which are dirty, locked, under writeback or mapped into
+ * pagetables.
+ *
+ * Return: the number of the pages that were invalidated
+ */
+unsigned long invalidate_mapping_pages(struct address_space *mapping,
+		pgoff_t start, pgoff_t end)
+{
+	return __invalidate_mapping_pages(mapping, start, end, NULL);
+}
 EXPORT_SYMBOL(invalidate_mapping_pages);
 
+/**
+ * This helper is similar with the above one, except that it accounts for pages
+ * that are likely on a pagevec and count them in @nr_pagevec, which will used by
+ * the caller.
+ */
+void invalidate_mapping_pagevec(struct address_space *mapping,
+		pgoff_t start, pgoff_t end, unsigned long *nr_pagevec)
+{
+	__invalidate_mapping_pages(mapping, start, end, nr_pagevec);
+}
+
 /*
  * This is like invalidate_complete_page(), except it ignores the page's
  * refcount.  We do this because invalidate_inode_pages2() needs stronger
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 064/181] mm/gup_benchmark: update the documentation in Kconfig
  2020-10-13 23:46 incoming Andrew Morton
                   ` (62 preceding siblings ...)
  2020-10-13 23:51 ` [patch 063/181] mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED Andrew Morton
@ 2020-10-13 23:51 ` Andrew Morton
  2020-10-13 23:51 ` [patch 065/181] mm/gup_benchmark: use pin_user_pages for FOLL_LONGTERM flag Andrew Morton
                   ` (120 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:51 UTC (permalink / raw)
  To: akpm, ira.weiny, jhubbard, keith.busch, kirill.shutemov,
	linux-mm, mm-commits, song.bao.hua, torvalds

From: Barry Song <song.bao.hua@hisilicon.com>
Subject: mm/gup_benchmark: update the documentation in Kconfig

In the beginning, mm/gup_benchmark.c supported get_user_pages_fast() only,
but right now, it supports the benchmarking of a couple of
get_user_pages() related calls like:

* get_user_pages_fast()
* get_user_pages()
* pin_user_pages_fast()
* pin_user_pages()

The documentation is confusing and needs update.

Link: https://lkml.kernel.org/r/20200821032546.19992-1-song.bao.hua@hisilicon.com
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/Kconfig |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/Kconfig~mm-gup_benchmark-update-the-documentation-in-kconfig
+++ a/mm/Kconfig
@@ -831,10 +831,10 @@ config PERCPU_STATS
 	  be used to help understand percpu memory usage.
 
 config GUP_BENCHMARK
-	bool "Enable infrastructure for get_user_pages_fast() benchmarking"
+	bool "Enable infrastructure for get_user_pages() and related calls benchmarking"
 	help
 	  Provides /sys/kernel/debug/gup_benchmark that helps with testing
-	  performance of get_user_pages_fast().
+	  performance of get_user_pages() and related calls.
 
 	  See tools/testing/selftests/vm/gup_benchmark.c
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 065/181] mm/gup_benchmark: use pin_user_pages for FOLL_LONGTERM flag
  2020-10-13 23:46 incoming Andrew Morton
                   ` (63 preceding siblings ...)
  2020-10-13 23:51 ` [patch 064/181] mm/gup_benchmark: update the documentation in Kconfig Andrew Morton
@ 2020-10-13 23:51 ` Andrew Morton
  2020-10-13 23:51 ` [patch 066/181] mm/gup: don't permit users to call get_user_pages with FOLL_LONGTERM Andrew Morton
                   ` (119 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:51 UTC (permalink / raw)
  To: akpm, corbet, dan.j.williams, david, hch, jack, jgg, jglisse,
	jhubbard, linux-mm, mhocko, mike.kravetz, mm-commits, shuah,
	song.bao.hua, torvalds, vbabka, viro, willy

From: Barry Song <song.bao.hua@hisilicon.com>
Subject: mm/gup_benchmark: use pin_user_pages for FOLL_LONGTERM flag

According to Documentation/core-api/pin_user_pages.rst, FOLL_PIN is a
prerequisite to FOLL_LONGTERM.  Another way of saying that is,
FOLL_LONGTERM is a specific case, more restrictive case of FOLL_PIN.

Almost all kernel modules are using pin_user_pages() with FOLL_LONGTERM,
mm/gup_benchmark.c seems to the only exception in which FOLL_PIN is not a
prerequisite to FOLL_LONGTERM.

Link: http://lkml.kernel.org/r/20200815122056.29508-1-song.bao.hua@hisilicon.com
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup_benchmark.c                         |   23 +++++++++----------
 tools/testing/selftests/vm/gup_benchmark.c |   14 +++++------
 2 files changed, 19 insertions(+), 18 deletions(-)

--- a/mm/gup_benchmark.c~mm-gup_benchmark-use-pin_user_pages-for-foll_longterm-flag
+++ a/mm/gup_benchmark.c
@@ -6,10 +6,10 @@
 #include <linux/debugfs.h>
 
 #define GUP_FAST_BENCHMARK	_IOWR('g', 1, struct gup_benchmark)
-#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
-#define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
-#define PIN_FAST_BENCHMARK	_IOWR('g', 4, struct gup_benchmark)
-#define PIN_BENCHMARK		_IOWR('g', 5, struct gup_benchmark)
+#define GUP_BENCHMARK		_IOWR('g', 2, struct gup_benchmark)
+#define PIN_FAST_BENCHMARK	_IOWR('g', 3, struct gup_benchmark)
+#define PIN_BENCHMARK		_IOWR('g', 4, struct gup_benchmark)
+#define PIN_LONGTERM_BENCHMARK	_IOWR('g', 5, struct gup_benchmark)
 
 struct gup_benchmark {
 	__u64 get_delta_usec;
@@ -28,7 +28,6 @@ static void put_back_pages(unsigned int
 
 	switch (cmd) {
 	case GUP_FAST_BENCHMARK:
-	case GUP_LONGTERM_BENCHMARK:
 	case GUP_BENCHMARK:
 		for (i = 0; i < nr_pages; i++)
 			put_page(pages[i]);
@@ -36,6 +35,7 @@ static void put_back_pages(unsigned int
 
 	case PIN_FAST_BENCHMARK:
 	case PIN_BENCHMARK:
+	case PIN_LONGTERM_BENCHMARK:
 		unpin_user_pages(pages, nr_pages);
 		break;
 	}
@@ -50,6 +50,7 @@ static void verify_dma_pinned(unsigned i
 	switch (cmd) {
 	case PIN_FAST_BENCHMARK:
 	case PIN_BENCHMARK:
+	case PIN_LONGTERM_BENCHMARK:
 		for (i = 0; i < nr_pages; i++) {
 			page = pages[i];
 			if (WARN(!page_maybe_dma_pinned(page),
@@ -101,11 +102,6 @@ static int __gup_benchmark_ioctl(unsigne
 			nr = get_user_pages_fast(addr, nr, gup->flags,
 						 pages + i);
 			break;
-		case GUP_LONGTERM_BENCHMARK:
-			nr = get_user_pages(addr, nr,
-					    gup->flags | FOLL_LONGTERM,
-					    pages + i, NULL);
-			break;
 		case GUP_BENCHMARK:
 			nr = get_user_pages(addr, nr, gup->flags, pages + i,
 					    NULL);
@@ -118,6 +114,11 @@ static int __gup_benchmark_ioctl(unsigne
 			nr = pin_user_pages(addr, nr, gup->flags, pages + i,
 					    NULL);
 			break;
+		case PIN_LONGTERM_BENCHMARK:
+			nr = pin_user_pages(addr, nr,
+					    gup->flags | FOLL_LONGTERM,
+					    pages + i, NULL);
+			break;
 		default:
 			kvfree(pages);
 			ret = -EINVAL;
@@ -162,10 +163,10 @@ static long gup_benchmark_ioctl(struct f
 
 	switch (cmd) {
 	case GUP_FAST_BENCHMARK:
-	case GUP_LONGTERM_BENCHMARK:
 	case GUP_BENCHMARK:
 	case PIN_FAST_BENCHMARK:
 	case PIN_BENCHMARK:
+	case PIN_LONGTERM_BENCHMARK:
 		break;
 	default:
 		return -EINVAL;
--- a/tools/testing/selftests/vm/gup_benchmark.c~mm-gup_benchmark-use-pin_user_pages-for-foll_longterm-flag
+++ a/tools/testing/selftests/vm/gup_benchmark.c
@@ -15,12 +15,12 @@
 #define PAGE_SIZE sysconf(_SC_PAGESIZE)
 
 #define GUP_FAST_BENCHMARK	_IOWR('g', 1, struct gup_benchmark)
-#define GUP_LONGTERM_BENCHMARK	_IOWR('g', 2, struct gup_benchmark)
-#define GUP_BENCHMARK		_IOWR('g', 3, struct gup_benchmark)
+#define GUP_BENCHMARK		_IOWR('g', 2, struct gup_benchmark)
 
 /* Similar to above, but use FOLL_PIN instead of FOLL_GET. */
-#define PIN_FAST_BENCHMARK	_IOWR('g', 4, struct gup_benchmark)
-#define PIN_BENCHMARK		_IOWR('g', 5, struct gup_benchmark)
+#define PIN_FAST_BENCHMARK	_IOWR('g', 3, struct gup_benchmark)
+#define PIN_BENCHMARK		_IOWR('g', 4, struct gup_benchmark)
+#define PIN_LONGTERM_BENCHMARK	_IOWR('g', 5, struct gup_benchmark)
 
 /* Just the flags we need, copied from mm.h: */
 #define FOLL_WRITE	0x01	/* check pte is writable */
@@ -52,6 +52,9 @@ int main(int argc, char **argv)
 		case 'b':
 			cmd = PIN_BENCHMARK;
 			break;
+		case 'L':
+			cmd = PIN_LONGTERM_BENCHMARK;
+			break;
 		case 'm':
 			size = atoi(optarg) * MB;
 			break;
@@ -67,9 +70,6 @@ int main(int argc, char **argv)
 		case 'T':
 			thp = 0;
 			break;
-		case 'L':
-			cmd = GUP_LONGTERM_BENCHMARK;
-			break;
 		case 'U':
 			cmd = GUP_BENCHMARK;
 			break;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 066/181] mm/gup: don't permit users to call get_user_pages with FOLL_LONGTERM
  2020-10-13 23:46 incoming Andrew Morton
                   ` (64 preceding siblings ...)
  2020-10-13 23:51 ` [patch 065/181] mm/gup_benchmark: use pin_user_pages for FOLL_LONGTERM flag Andrew Morton
@ 2020-10-13 23:51 ` Andrew Morton
  2020-10-13 23:52 ` [patch 067/181] mm/gup: protect unpin_user_pages() against npages==-ERRNO Andrew Morton
                   ` (118 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:51 UTC (permalink / raw)
  To: akpm, corbet, dan.j.williams, david, hch, ira.weiny, jack, jgg,
	jglisse, jhubbard, linux-mm, mhocko, mike.kravetz, mm-commits,
	naresh.kamboju, shuah, song.bao.hua, torvalds, vbabka, viro,
	willy

From: Barry Song <song.bao.hua@hisilicon.com>
Subject: mm/gup: don't permit users to call get_user_pages with FOLL_LONGTERM

gup prohibits users from calling get_user_pages() with FOLL_PIN.  But it
allows users to call get_user_pages() with FOLL_LONGTERM only.  It seems
insensible.

Since FOLL_LONGTERM is a stricter case of FOLL_PIN, we should prohibit
users from calling get_user_pages() with FOLL_LONGTERM while not with
FOLL_PIN.

mm/gup_benchmark.c used to be the only user who did this improperly.
But it has been fixed by moving to use pin_user_pages().

[akpm@linux-foundation.org: fix CONFIG_MMU=n build]
  Link: https://lkml.kernel.org/r/CA+G9fYuNS3k0DVT62twfV746pfNhCSrk5sVMcOcQ1PGGnEseyw@mail.gmail.com
Link: http://lkml.kernel.org/r/20200819110100.23504-1-song.bao.hua@hisilicon.com
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jérôme Glisse <jglisse@redhat.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup.c |   37 ++++++++++++++++++++++---------------
 1 file changed, 22 insertions(+), 15 deletions(-)

--- a/mm/gup.c~mm-gup-dont-permit-users-to-call-get_user_pages-with-foll_longterm
+++ a/mm/gup.c
@@ -1747,6 +1747,25 @@ static __always_inline long __gup_longte
 }
 #endif /* CONFIG_FS_DAX || CONFIG_CMA */
 
+static bool is_valid_gup_flags(unsigned int gup_flags)
+{
+	/*
+	 * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
+	 * never directly by the caller, so enforce that with an assertion:
+	 */
+	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+		return false;
+	/*
+	 * FOLL_PIN is a prerequisite to FOLL_LONGTERM. Another way of saying
+	 * that is, FOLL_LONGTERM is a specific case, more restrictive case of
+	 * FOLL_PIN.
+	 */
+	if (WARN_ON_ONCE(gup_flags & FOLL_LONGTERM))
+		return false;
+
+	return true;
+}
+
 #ifdef CONFIG_MMU
 static long __get_user_pages_remote(struct mm_struct *mm,
 				    unsigned long start, unsigned long nr_pages,
@@ -1842,11 +1861,7 @@ long get_user_pages_remote(struct mm_str
 		unsigned int gup_flags, struct page **pages,
 		struct vm_area_struct **vmas, int *locked)
 {
-	/*
-	 * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
-	 * never directly by the caller, so enforce that with an assertion:
-	 */
-	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+	if (!is_valid_gup_flags(gup_flags))
 		return -EINVAL;
 
 	return __get_user_pages_remote(mm, start, nr_pages, gup_flags,
@@ -1892,11 +1907,7 @@ long get_user_pages(unsigned long start,
 		unsigned int gup_flags, struct page **pages,
 		struct vm_area_struct **vmas)
 {
-	/*
-	 * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
-	 * never directly by the caller, so enforce that with an assertion:
-	 */
-	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+	if (!is_valid_gup_flags(gup_flags))
 		return -EINVAL;
 
 	return __gup_longterm_locked(current->mm, start, nr_pages,
@@ -2786,11 +2797,7 @@ EXPORT_SYMBOL_GPL(get_user_pages_fast_on
 int get_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages)
 {
-	/*
-	 * FOLL_PIN must only be set internally by the pin_user_pages*() APIs,
-	 * never directly by the caller, so enforce that:
-	 */
-	if (WARN_ON_ONCE(gup_flags & FOLL_PIN))
+	if (!is_valid_gup_flags(gup_flags))
 		return -EINVAL;
 
 	/*
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 067/181] mm/gup: protect unpin_user_pages() against npages==-ERRNO
  2020-10-13 23:46 incoming Andrew Morton
                   ` (65 preceding siblings ...)
  2020-10-13 23:51 ` [patch 066/181] mm/gup: don't permit users to call get_user_pages with FOLL_LONGTERM Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:52 ` [patch 068/181] swap: rename SWP_FS to SWAP_FS_OPS to avoid ambiguity Andrew Morton
                   ` (117 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, dan.carpenter, ira.weiny, jhubbard, jrdr.linux, linux-mm,
	mm-commits, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: mm/gup: protect unpin_user_pages() against npages==-ERRNO

As suggested by Dan Carpenter, fortify unpin_user_pages() just a bit,
against a typical caller mistake: check if the npages arg is really a
-ERRNO value, which would blow up the unpinning loop: WARN and return.

If this new WARN_ON() fires, then the system *might* be leaking pages (by
leaving them pinned), but probably not.  More likely, gup/pup returned a
hard -ERRNO error to the caller, who erroneously passed it here.

Link: https://lkml.kernel.org/r/20200917065706.409079-1-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/gup.c |    7 +++++++
 1 file changed, 7 insertions(+)

--- a/mm/gup.c~mm-gup-protect-unpin_user_pages-against-npages==-errno
+++ a/mm/gup.c
@@ -329,6 +329,13 @@ void unpin_user_pages(struct page **page
 	unsigned long index;
 
 	/*
+	 * If this WARN_ON() fires, then the system *might* be leaking pages (by
+	 * leaving them pinned), but probably not. More likely, gup/pup returned
+	 * a hard -ERRNO error to the caller, who erroneously passed it here.
+	 */
+	if (WARN_ON(IS_ERR_VALUE(npages)))
+		return;
+	/*
 	 * TODO: this can be optimized for huge pages: if a series of pages is
 	 * physically contiguous and part of the same compound page, then a
 	 * single operation to the head page should suffice.
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 068/181] swap: rename SWP_FS to SWAP_FS_OPS to avoid ambiguity
  2020-10-13 23:46 incoming Andrew Morton
                   ` (66 preceding siblings ...)
  2020-10-13 23:52 ` [patch 067/181] mm/gup: protect unpin_user_pages() against npages==-ERRNO Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:52 ` [patch 069/181] mm: remove activate_page() from unuse_pte() Andrew Morton
                   ` (116 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, hsiangkao, linux-mm, mm-commits, torvalds, willy

From: Gao Xiang <hsiangkao@redhat.com>
Subject: swap: rename SWP_FS to SWAP_FS_OPS to avoid ambiguity

SWP_FS is used to make swap_{read,write}page() go through the filesystem,
and it's only used for swap files over NFS for now.  Otherwise it will
directly submit IO to blockdev according to swapfile extents reported by
filesystems in advance.

As Matthew pointed out [1], SWP_FS naming is somewhat confusing, so let's
rename to SWP_FS_OPS.

[1] https://lore.kernel.org/r/20200820113448.GM17456@casper.infradead.org

Link: https://lkml.kernel.org/r/20200822113019.11319-1-hsiangkao@redhat.com
Signed-off-by: Gao Xiang <hsiangkao@redhat.com>
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/swap.h |    2 +-
 mm/page_io.c         |    6 +++---
 mm/swap_state.c      |    2 +-
 mm/swapfile.c        |    2 +-
 4 files changed, 6 insertions(+), 6 deletions(-)

--- a/include/linux/swap.h~swap-rename-swp_fs-to-swap_fs_ops-to-avoid-ambiguity
+++ a/include/linux/swap.h
@@ -170,7 +170,7 @@ enum {
 	SWP_CONTINUED	= (1 << 5),	/* swap_map has count continuation */
 	SWP_BLKDEV	= (1 << 6),	/* its a block device */
 	SWP_ACTIVATED	= (1 << 7),	/* set after swap_activate success */
-	SWP_FS		= (1 << 8),	/* swap file goes through fs */
+	SWP_FS_OPS	= (1 << 8),	/* swapfile operations go through fs */
 	SWP_AREA_DISCARD = (1 << 9),	/* single-time swap area discards */
 	SWP_PAGE_DISCARD = (1 << 10),	/* freed swap page-cluster discards */
 	SWP_STABLE_WRITES = (1 << 11),	/* no overwrite PG_writeback pages */
--- a/mm/page_io.c~swap-rename-swp_fs-to-swap_fs_ops-to-avoid-ambiguity
+++ a/mm/page_io.c
@@ -312,7 +312,7 @@ int __swap_writepage(struct page *page,
 	struct swap_info_struct *sis = page_swap_info(page);
 
 	VM_BUG_ON_PAGE(!PageSwapCache(page), page);
-	if (data_race(sis->flags & SWP_FS)) {
+	if (data_race(sis->flags & SWP_FS_OPS)) {
 		struct kiocb kiocb;
 		struct file *swap_file = sis->swap_file;
 		struct address_space *mapping = swap_file->f_mapping;
@@ -403,7 +403,7 @@ int swap_readpage(struct page *page, boo
 		goto out;
 	}
 
-	if (data_race(sis->flags & SWP_FS)) {
+	if (data_race(sis->flags & SWP_FS_OPS)) {
 		struct file *swap_file = sis->swap_file;
 		struct address_space *mapping = swap_file->f_mapping;
 
@@ -467,7 +467,7 @@ int swap_set_page_dirty(struct page *pag
 {
 	struct swap_info_struct *sis = page_swap_info(page);
 
-	if (data_race(sis->flags & SWP_FS)) {
+	if (data_race(sis->flags & SWP_FS_OPS)) {
 		struct address_space *mapping = sis->swap_file->f_mapping;
 
 		VM_BUG_ON_PAGE(!PageSwapCache(page), page);
--- a/mm/swapfile.c~swap-rename-swp_fs-to-swap_fs_ops-to-avoid-ambiguity
+++ a/mm/swapfile.c
@@ -2437,7 +2437,7 @@ static int setup_swap_extents(struct swa
 		if (ret >= 0)
 			sis->flags |= SWP_ACTIVATED;
 		if (!ret) {
-			sis->flags |= SWP_FS;
+			sis->flags |= SWP_FS_OPS;
 			ret = add_swap_extent(sis, 0, sis->max, 0);
 			*span = sis->pages;
 		}
--- a/mm/swap_state.c~swap-rename-swp_fs-to-swap_fs_ops-to-avoid-ambiguity
+++ a/mm/swap_state.c
@@ -665,7 +665,7 @@ struct page *swap_cluster_readahead(swp_
 		goto skip;
 
 	/* Test swap type to make sure the dereference is safe */
-	if (likely(si->flags & (SWP_BLKDEV | SWP_FS))) {
+	if (likely(si->flags & (SWP_BLKDEV | SWP_FS_OPS))) {
 		struct inode *inode = si->swap_file->f_mapping->host;
 		if (inode_read_congested(inode))
 			goto skip;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 069/181] mm: remove activate_page() from unuse_pte()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (67 preceding siblings ...)
  2020-10-13 23:52 ` [patch 068/181] swap: rename SWP_FS to SWAP_FS_OPS to avoid ambiguity Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:52 ` [patch 070/181] mm: remove superfluous __ClearPageActive() Andrew Morton
                   ` (115 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, alexander.h.duyck, cai, david, hughd, iamjoonsoo.kim,
	linux-mm, mgorman, mhocko, mm-commits, npiggin, shy828301,
	torvalds, ying.huang, yuzhao

From: Yu Zhao <yuzhao@google.com>
Subject: mm: remove activate_page() from unuse_pte()

We don't initially add anon pages to active lruvec after commit
b518154e59aa ("mm/vmscan: protect the workingset on anonymous LRU"). 
Remove activate_page() from unuse_pte(), which seems to be missed by the
commit.  And make the function static while we are at it.

Before the commit, we called lru_cache_add_active_or_unevictable() to add
new ksm pages to active lruvec.  Therefore, activate_page() wasn't
necessary for them in the first place.

Link: http://lkml.kernel.org/r/20200818184704.3625199-1-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/swap.h |    1 -
 mm/swap.c            |    4 ++--
 mm/swapfile.c        |    5 -----
 3 files changed, 2 insertions(+), 8 deletions(-)

--- a/include/linux/swap.h~mm-remove-activate_page-from-unuse_pte
+++ a/include/linux/swap.h
@@ -340,7 +340,6 @@ extern void lru_note_cost_page(struct pa
 extern void lru_cache_add(struct page *);
 extern void lru_add_page_tail(struct page *page, struct page *page_tail,
 			 struct lruvec *lruvec, struct list_head *head);
-extern void activate_page(struct page *);
 extern void mark_page_accessed(struct page *);
 extern void lru_add_drain(void);
 extern void lru_add_drain_cpu(int cpu);
--- a/mm/swap.c~mm-remove-activate_page-from-unuse_pte
+++ a/mm/swap.c
@@ -348,7 +348,7 @@ static bool need_activate_page_drain(int
 	return pagevec_count(&per_cpu(lru_pvecs.activate_page, cpu)) != 0;
 }
 
-void activate_page(struct page *page)
+static void activate_page(struct page *page)
 {
 	page = compound_head(page);
 	if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
@@ -368,7 +368,7 @@ static inline void activate_page_drain(i
 {
 }
 
-void activate_page(struct page *page)
+static void activate_page(struct page *page)
 {
 	pg_data_t *pgdat = page_pgdat(page);
 
--- a/mm/swapfile.c~mm-remove-activate_page-from-unuse_pte
+++ a/mm/swapfile.c
@@ -1929,11 +1929,6 @@ static int unuse_pte(struct vm_area_stru
 		lru_cache_add_inactive_or_unevictable(page, vma);
 	}
 	swap_free(entry);
-	/*
-	 * Move the page to the active list so it is not
-	 * immediately swapped out again after swapon.
-	 */
-	activate_page(page);
 out:
 	pte_unmap_unlock(pte, ptl);
 	if (page != swapcache) {
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 070/181] mm: remove superfluous __ClearPageActive()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (68 preceding siblings ...)
  2020-10-13 23:52 ` [patch 069/181] mm: remove activate_page() from unuse_pte() Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:52 ` [patch 071/181] mm/swap.c: fix confusing comment in release_pages() Andrew Morton
                   ` (114 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, alexander.h.duyck, cai, david, hughd, iamjoonsoo.kim,
	linux-mm, mgorman, mhocko, mm-commits, npiggin, shy828301,
	torvalds, ying.huang, yuzhao

From: Yu Zhao <yuzhao@google.com>
Subject: mm: remove superfluous __ClearPageActive()

To activate a page, mark_page_accessed() always holds a reference on it. 
It either gets a new reference when adding a page to
lru_pvecs.activate_page or reuses an existing one it previously got when
it added a page to lru_pvecs.lru_add.  So it doesn't call SetPageActive()
on a page that doesn't have any reference left.  Therefore, the race is
impossible these days (I didn't brother to dig into its history).

For other paths, namely reclaim and migration, a reference count is always
held while calling SetPageActive() on a page.

SetPageSlabPfmemalloc() also uses SetPageActive(), but it's irrelevant to
LRU pages.

Link: http://lkml.kernel.org/r/20200818184704.3625199-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memremap.c |    2 --
 mm/swap.c     |    2 --
 2 files changed, 4 deletions(-)

--- a/mm/memremap.c~mm-remove-superfluous-__clearpageactive
+++ a/mm/memremap.c
@@ -494,8 +494,6 @@ void free_devmap_managed_page(struct pag
 		return;
 	}
 
-	/* Clear Active bit in case of parallel mark_page_accessed */
-	__ClearPageActive(page);
 	__ClearPageWaiters(page);
 
 	mem_cgroup_uncharge(page);
--- a/mm/swap.c~mm-remove-superfluous-__clearpageactive
+++ a/mm/swap.c
@@ -943,8 +943,6 @@ void release_pages(struct page **pages,
 			del_page_from_lru_list(page, lruvec, page_off_lru(page));
 		}
 
-		/* Clear Active bit in case of parallel mark_page_accessed */
-		__ClearPageActive(page);
 		__ClearPageWaiters(page);
 
 		list_add(&page->lru, &pages_to_free);
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 071/181] mm/swap.c: fix confusing comment in release_pages()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (69 preceding siblings ...)
  2020-10-13 23:52 ` [patch 070/181] mm: remove superfluous __ClearPageActive() Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:52 ` [patch 072/181] mm/swap_slots.c: remove always zero and unused return value of enable_swap_slots_cache() Andrew Morton
                   ` (113 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, jhubbard, linmiaohe, linux-mm, mm-commits, torvalds

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/swap.c: fix confusing comment in release_pages()

Since commit 07d802699528 ("mm: devmap: refactor 1-based refcounting for
ZONE_DEVICE pages"), we have renamed the func put_devmap_managed_page() to
page_is_devmap_managed().

Link: https://lkml.kernel.org/r/20200905084453.19353-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/swap.c~mm-swap-fix-confusing-comment-in-release_pages
+++ a/mm/swap.c
@@ -902,7 +902,7 @@ void release_pages(struct page **pages,
 			}
 			/*
 			 * ZONE_DEVICE pages that return 'false' from
-			 * put_devmap_managed_page() do not require special
+			 * page_is_devmap_managed() do not require special
 			 * processing, and instead, expect a call to
 			 * put_page_testzero().
 			 */
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 072/181] mm/swap_slots.c: remove always zero and unused return value of enable_swap_slots_cache()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (70 preceding siblings ...)
  2020-10-13 23:52 ` [patch 071/181] mm/swap.c: fix confusing comment in release_pages() Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:52 ` [patch 073/181] mm/page_io.c: remove useless out label in __swap_writepage() Andrew Morton
                   ` (112 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, linmiaohe, linux-mm, mm-commits, torvalds

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/swap_slots.c: remove always zero and unused return value of enable_swap_slots_cache()

enable_swap_slots_cache() always return zero and its return value is just
ignored by the caller.  So make enable_swap_slots_cache() void.

Link: https://lkml.kernel.org/r/20200924113554.50614-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/swap_slots.h |    2 +-
 mm/swap_slots.c            |    3 +--
 2 files changed, 2 insertions(+), 3 deletions(-)

--- a/include/linux/swap_slots.h~mm-swap_slotsc-remove-always-zero-and-unused-return-value-of-enable_swap_slots_cache
+++ a/include/linux/swap_slots.h
@@ -23,7 +23,7 @@ struct swap_slots_cache {
 
 void disable_swap_slots_cache_lock(void);
 void reenable_swap_slots_cache_unlock(void);
-int enable_swap_slots_cache(void);
+void enable_swap_slots_cache(void);
 int free_swap_slot(swp_entry_t entry);
 
 extern bool swap_slot_cache_enabled;
--- a/mm/swap_slots.c~mm-swap_slotsc-remove-always-zero-and-unused-return-value-of-enable_swap_slots_cache
+++ a/mm/swap_slots.c
@@ -237,7 +237,7 @@ static int free_slot_cache(unsigned int
 	return 0;
 }
 
-int enable_swap_slots_cache(void)
+void enable_swap_slots_cache(void)
 {
 	mutex_lock(&swap_slots_cache_enable_mutex);
 	if (!swap_slot_cache_initialized) {
@@ -255,7 +255,6 @@ int enable_swap_slots_cache(void)
 	__reenable_swap_slots_cache();
 out_unlock:
 	mutex_unlock(&swap_slots_cache_enable_mutex);
-	return 0;
 }
 
 /* called with swap slot cache's alloc lock held */
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 073/181] mm/page_io.c: remove useless out label in __swap_writepage()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (71 preceding siblings ...)
  2020-10-13 23:52 ` [patch 072/181] mm/swap_slots.c: remove always zero and unused return value of enable_swap_slots_cache() Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:52 ` [patch 074/181] mm/swap.c: fix incomplete comment in lru_cache_add_inactive_or_unevictable() Andrew Morton
                   ` (111 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, linmiaohe, linux-mm, mm-commits, torvalds

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/page_io.c: remove useless out label in __swap_writepage()

The out label is only used in one place and return ret directly without
something like resource cleanup or lock release and so on.  So we should
remove this jump label and do some cleanup.

Link: https://lkml.kernel.org/r/20200927124032.22521-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_io.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

--- a/mm/page_io.c~mm-remove-useless-out-label-in-__swap_writepage
+++ a/mm/page_io.c
@@ -359,13 +359,11 @@ int __swap_writepage(struct page *page,
 		return 0;
 	}
 
-	ret = 0;
 	bio = get_swap_bio(GFP_NOIO, page, end_write_func);
 	if (bio == NULL) {
 		set_page_dirty(page);
 		unlock_page(page);
-		ret = -ENOMEM;
-		goto out;
+		return -ENOMEM;
 	}
 	bio->bi_opf = REQ_OP_WRITE | REQ_SWAP | wbc_to_write_flags(wbc);
 	bio_associate_blkg_from_page(bio, page);
@@ -373,8 +371,8 @@ int __swap_writepage(struct page *page,
 	set_page_writeback(page);
 	unlock_page(page);
 	submit_bio(bio);
-out:
-	return ret;
+
+	return 0;
 }
 
 int swap_readpage(struct page *page, bool synchronous)
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 074/181] mm/swap.c: fix incomplete comment in lru_cache_add_inactive_or_unevictable()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (72 preceding siblings ...)
  2020-10-13 23:52 ` [patch 073/181] mm/page_io.c: remove useless out label in __swap_writepage() Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:52 ` [patch 075/181] mm/swapfile.c: remove unnecessary goto out in _swap_info_get() Andrew Morton
                   ` (110 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, linmiaohe, linux-mm, mm-commits, shakeelb, torvalds

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/swap.c: fix incomplete comment in lru_cache_add_inactive_or_unevictable()

Since commit 9c4e6b1a7027 ("mm, mlock, vmscan: no more skipping
pagevecs"), unevictable pages do not goes directly back onto zone's
unevictable list.

Link: https://lkml.kernel.org/r/20200927122209.59328-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swap.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--- a/mm/swap.c~mm-fix-incomplete-comment-in-lru_cache_add_inactive_or_unevictable
+++ a/mm/swap.c
@@ -481,9 +481,7 @@ EXPORT_SYMBOL(lru_cache_add);
  * @vma:   vma in which page is mapped for determining reclaimability
  *
  * Place @page on the inactive or unevictable LRU list, depending on its
- * evictability.  Note that if the page is not evictable, it goes
- * directly back onto it's zone's unevictable list, it does NOT use a
- * per cpu pagevec.
+ * evictability.
  */
 void lru_cache_add_inactive_or_unevictable(struct page *page,
 					 struct vm_area_struct *vma)
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 075/181] mm/swapfile.c: remove unnecessary goto out in _swap_info_get()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (73 preceding siblings ...)
  2020-10-13 23:52 ` [patch 074/181] mm/swap.c: fix incomplete comment in lru_cache_add_inactive_or_unevictable() Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:52 ` [patch 076/181] mm/swapfile.c: fix potential memory leak in sys_swapon Andrew Morton
                   ` (109 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, linmiaohe, linux-mm, mm-commits, torvalds

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/swapfile.c: remove unnecessary goto out in _swap_info_get()

It's unnecessary to goto the out label while out label is just below.

Link: https://lkml.kernel.org/r/20200930102549.1885-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swapfile.c |    1 -
 1 file changed, 1 deletion(-)

--- a/mm/swapfile.c~mm-remove-unnecessary-goto-out-in-_swap_info_get
+++ a/mm/swapfile.c
@@ -1184,7 +1184,6 @@ static struct swap_info_struct *_swap_in
 
 bad_free:
 	pr_err("swap_info_get: %s%08lx\n", Unused_offset, entry.val);
-	goto out;
 out:
 	return NULL;
 }
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 076/181] mm/swapfile.c: fix potential memory leak in sys_swapon
  2020-10-13 23:46 incoming Andrew Morton
                   ` (74 preceding siblings ...)
  2020-10-13 23:52 ` [patch 075/181] mm/swapfile.c: remove unnecessary goto out in _swap_info_get() Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:52 ` [patch 077/181] mm/memremap.c: convert devmap static branch to {inc,dec} Andrew Morton
                   ` (108 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, darrick.wong, linmiaohe, linux-mm, mm-commits, torvalds

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/swapfile.c: fix potential memory leak in sys_swapon

If we failed to drain inode, we would forget to free the swap address
space allocated by init_swap_address_space() above.

Link: https://lkml.kernel.org/r/20200930101803.53884-1-linmiaohe@huawei.com
Fixes: dc617f29dbe5 ("vfs: don't allow writes to swap files")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swapfile.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/mm/swapfile.c~mm-fix-potential-memory-leak-in-sys_swapon
+++ a/mm/swapfile.c
@@ -3342,7 +3342,7 @@ SYSCALL_DEFINE2(swapon, const char __use
 	error = inode_drain_writes(inode);
 	if (error) {
 		inode->i_flags &= ~S_SWAPFILE;
-		goto bad_swap_unlock_inode;
+		goto free_swap_address_space;
 	}
 
 	mutex_lock(&swapon_mutex);
@@ -3367,6 +3367,8 @@ SYSCALL_DEFINE2(swapon, const char __use
 
 	error = 0;
 	goto out;
+free_swap_address_space:
+	exit_swap_address_space(p->type);
 bad_swap_unlock_inode:
 	inode_unlock(inode);
 bad_swap:
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 077/181] mm/memremap.c: convert devmap static branch to {inc,dec}
  2020-10-13 23:46 incoming Andrew Morton
                   ` (75 preceding siblings ...)
  2020-10-13 23:52 ` [patch 076/181] mm/swapfile.c: fix potential memory leak in sys_swapon Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:52 ` [patch 078/181] mm: memcontrol: use flex_array_size() helper in memcpy() Andrew Morton
                   ` (107 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, dan.j.williams, ira.weiny, linux-mm, mm-commits, torvalds,
	vishal.l.verma, william.kucharski

From: Ira Weiny <ira.weiny@intel.com>
Subject: mm/memremap.c: convert devmap static branch to {inc,dec}

While reviewing Protection Key Supervisor support it was pointed out that
using a counter to track static branch enable was an anti-pattern which
was better solved using the provided static_branch_{inc,dec} functions.[1]

Fix up devmap_managed_key to work the same way.  Also this should be safer
because there is a very small (very unlikely) race when multiple callers
try to enable at the same time.

[1] https://lore.kernel.org/lkml/20200714194031.GI5523@worktop.programming.kicks-ass.net/

Link: https://lkml.kernel.org/r/20200810235319.2796597-1-ira.weiny@intel.com
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: William Kucharski <william.kucharski@oracle.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memremap.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

--- a/mm/memremap.c~memremap-convert-devmap-static-branch-to-incdec
+++ a/mm/memremap.c
@@ -40,12 +40,10 @@ EXPORT_SYMBOL_GPL(memremap_compat_align)
 #ifdef CONFIG_DEV_PAGEMAP_OPS
 DEFINE_STATIC_KEY_FALSE(devmap_managed_key);
 EXPORT_SYMBOL(devmap_managed_key);
-static atomic_t devmap_managed_enable;
 
 static void devmap_managed_enable_put(void)
 {
-	if (atomic_dec_and_test(&devmap_managed_enable))
-		static_branch_disable(&devmap_managed_key);
+	static_branch_dec(&devmap_managed_key);
 }
 
 static int devmap_managed_enable_get(struct dev_pagemap *pgmap)
@@ -56,8 +54,7 @@ static int devmap_managed_enable_get(str
 		return -EINVAL;
 	}
 
-	if (atomic_inc_return(&devmap_managed_enable) == 1)
-		static_branch_enable(&devmap_managed_key);
+	static_branch_inc(&devmap_managed_key);
 	return 0;
 }
 #else
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 078/181] mm: memcontrol: use flex_array_size() helper in memcpy()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (76 preceding siblings ...)
  2020-10-13 23:52 ` [patch 077/181] mm/memremap.c: convert devmap static branch to {inc,dec} Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:52 ` [patch 079/181] mm: memcontrol: use the preferred form for passing the size of a structure type Andrew Morton
                   ` (106 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, gustavoars, hannes, linux-mm, mhocko, mm-commits, torvalds,
	vdavydov.dev

From: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Subject: mm: memcontrol: use flex_array_size() helper in memcpy()

Make use of the flex_array_size() helper to calculate the size of a
flexible array member within an enclosing structure.

This helper offers defense-in-depth against potential integer overflows,
while at the same time makes it explicitly clear that we are dealing with
a flexible array member.

Also, remove unnecessary braces.

Link: https://lkml.kernel.org/r/ddd60dae2d9aea1ccdd2be66634815c93696125e.1596214831.git.gustavoars@kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

--- a/mm/memcontrol.c~mm-memcontrol-use-flex_array_size-helper-in-memcpy
+++ a/mm/memcontrol.c
@@ -4255,10 +4255,9 @@ static int __mem_cgroup_usage_register_e
 	new->size = size;
 
 	/* Copy thresholds (if any) to new array */
-	if (thresholds->primary) {
-		memcpy(new->entries, thresholds->primary->entries, (size - 1) *
-				sizeof(struct mem_cgroup_threshold));
-	}
+	if (thresholds->primary)
+		memcpy(new->entries, thresholds->primary->entries,
+		       flex_array_size(new, entries, size - 1));
 
 	/* Add new threshold */
 	new->entries[size - 1].eventfd = eventfd;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 079/181] mm: memcontrol: use the preferred form for passing the size of a structure type
  2020-10-13 23:46 incoming Andrew Morton
                   ` (77 preceding siblings ...)
  2020-10-13 23:52 ` [patch 078/181] mm: memcontrol: use flex_array_size() helper in memcpy() Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:52 ` [patch 080/181] mm: memcg/slab: fix racy access to page->mem_cgroup in mem_cgroup_from_obj() Andrew Morton
                   ` (105 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, gustavoars, hannes, linux-mm, mhocko, mm-commits, torvalds,
	vdavydov.dev

From: "Gustavo A. R. Silva" <gustavoars@kernel.org>
Subject: mm: memcontrol: use the preferred form for passing the size of a structure type

Use the preferred form for passing the size of a structure type.  The
alternative form where the structure type is spelled out hurts readability
and introduces an opportunity for a bug when the object type is changed
but the corresponding object identifier to which the sizeof operator is
applied is not.

Link: https://lkml.kernel.org/r/773e013ff2f07fe2a0b47153f14dea054c0c04f1.1596214831.git.gustavoars@kernel.org
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/memcontrol.c~mm-memcontrol-use-the-preferred-form-for-passing-the-size-of-a-structure-type
+++ a/mm/memcontrol.c
@@ -4264,7 +4264,7 @@ static int __mem_cgroup_usage_register_e
 	new->entries[size - 1].threshold = threshold;
 
 	/* Sort thresholds. Registering of new threshold isn't time-critical */
-	sort(new->entries, size, sizeof(struct mem_cgroup_threshold),
+	sort(new->entries, size, sizeof(*new->entries),
 			compare_thresholds, NULL);
 
 	/* Find current threshold */
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 080/181] mm: memcg/slab: fix racy access to page->mem_cgroup in mem_cgroup_from_obj()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (78 preceding siblings ...)
  2020-10-13 23:52 ` [patch 079/181] mm: memcontrol: use the preferred form for passing the size of a structure type Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:52 ` [patch 081/181] mm: memcontrol: correct the comment of mem_cgroup_iter() Andrew Morton
                   ` (104 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, guro, hannes, linux-mm, mm-commits, shakeelb, torvalds, vbabka

From: Roman Gushchin <guro@fb.com>
Subject: mm: memcg/slab: fix racy access to page->mem_cgroup in mem_cgroup_from_obj()

mem_cgroup_from_obj() checks the lowest bit of the page->mem_cgroup
pointer to determine if the page has an attached obj_cgroup vector instead
of a regular memcg pointer.  If it's not set, it simple returns the
page->mem_cgroup value as a struct mem_cgroup pointer.

The commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches
for all allocations") changed the moment when this bit is set: if
previously it was set on the allocation of the slab page, now it can be
set well after, when the first accounted object is allocated on this page.

It opened a race: if page->mem_cgroup is set concurrently after the first
page_has_obj_cgroups(page) check, a pointer to the obj_cgroups array can
be returned as a memory cgroup pointer.

A simple check for page->mem_cgroup pointer for NULL before the
page_has_obj_cgroups() check fixes the race.  Indeed, if the pointer is
not NULL, it's either a simple mem_cgroup pointer or a pointer to
obj_cgroup vector.  The pointer can be asynchronously changed from NULL to
(obj_cgroup_vec | 0x1UL), but can't be changed from a valid memcg pointer
to objcg vector or back.

If the object passed to mem_cgroup_from_obj() is a slab object and
page->mem_cgroup is NULL, it means that the object is not accounted, so
the function must return NULL.

I've discovered the race looking at the code, so far I haven't seen it in
the wild.

Link: https://lkml.kernel.org/r/20200910022435.2773735-1-guro@fb.com
Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
Signed-off-by: Roman Gushchin <guro@fb.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

--- a/mm/memcontrol.c~mm-memcg-slab-fix-racy-access-to-page-mem_cgroup-in-mem_cgroup_from_obj
+++ a/mm/memcontrol.c
@@ -2888,6 +2888,17 @@ struct mem_cgroup *mem_cgroup_from_obj(v
 	page = virt_to_head_page(p);
 
 	/*
+	 * If page->mem_cgroup is set, it's either a simple mem_cgroup pointer
+	 * or a pointer to obj_cgroup vector. In the latter case the lowest
+	 * bit of the pointer is set.
+	 * The page->mem_cgroup pointer can be asynchronously changed
+	 * from NULL to (obj_cgroup_vec | 0x1UL), but can't be changed
+	 * from a valid memcg pointer to objcg vector or back.
+	 */
+	if (!page->mem_cgroup)
+		return NULL;
+
+	/*
 	 * Slab objects are accounted individually, not per-page.
 	 * Memcg membership data for each individual object is saved in
 	 * the page->obj_cgroups.
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 081/181] mm: memcontrol: correct the comment of mem_cgroup_iter()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (79 preceding siblings ...)
  2020-10-13 23:52 ` [patch 080/181] mm: memcg/slab: fix racy access to page->mem_cgroup in mem_cgroup_from_obj() Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:52 ` [patch 082/181] mm/memcg: clean up obsolete enum charge_type Andrew Morton
                   ` (103 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, hannes, linmiaohe, linux-mm, mhocko, mm-commits, torvalds,
	vdavydov.dev

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm: memcontrol: correct the comment of mem_cgroup_iter()

Since commit bbec2e15170a ("mm: rename page_counter's count/limit into
usage/max"), the arg @reclaim has no priority field anymore.

Link: https://lkml.kernel.org/r/20200913094129.44558-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/mm/memcontrol.c~mm-memcontrol-correct-the-comment-of-mem_cgroup_iter
+++ a/mm/memcontrol.c
@@ -1102,9 +1102,9 @@ static __always_inline struct mem_cgroup
  * invocations for reference counting, or use mem_cgroup_iter_break()
  * to cancel a hierarchy walk before the round-trip is complete.
  *
- * Reclaimers can specify a node and a priority level in @reclaim to
- * divide up the memcgs in the hierarchy among all concurrent
- * reclaimers operating on the same node and priority.
+ * Reclaimers can specify a node in @reclaim to divide up the memcgs
+ * in the hierarchy among all concurrent reclaimers operating on the
+ * same node.
  */
 struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 				   struct mem_cgroup *prev,
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 082/181] mm/memcg: clean up obsolete enum charge_type
  2020-10-13 23:46 incoming Andrew Morton
                   ` (80 preceding siblings ...)
  2020-10-13 23:52 ` [patch 081/181] mm: memcontrol: correct the comment of mem_cgroup_iter() Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:52 ` [patch 083/181] mm/memcg: simplify mem_cgroup_get_max() Andrew Morton
                   ` (102 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, chris, guro, hannes, laoar.shao, linux-mm, longman, mhocko,
	mm-commits, shakeelb, tj, torvalds, vdavydov.dev

From: Waiman Long <longman@redhat.com>
Subject: mm/memcg: clean up obsolete enum charge_type

Patch series "mm/memcg: Miscellaneous cleanups and streamlining", v2.


This patch (of 3):

Since commit 0a31bc97c80c ("mm: memcontrol: rewrite uncharge API") and
commit 00501b531c47 ("mm: memcontrol: rewrite charge API") in v3.17, the
enum charge_type was no longer used anywhere.  However, the enum itself
was not removed at that time.  Remove the obsolete enum charge_type now.

Link: https://lkml.kernel.org/r/20200914024452.19167-1-longman@redhat.com
Link: https://lkml.kernel.org/r/20200914024452.19167-2-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Chris Down <chris@chrisdown.name>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    8 --------
 1 file changed, 8 deletions(-)

--- a/mm/memcontrol.c~mm-memcg-clean-up-obsolete-enum-charge_type
+++ a/mm/memcontrol.c
@@ -197,14 +197,6 @@ static struct move_charge_struct {
 #define	MEM_CGROUP_MAX_RECLAIM_LOOPS		100
 #define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	2
 
-enum charge_type {
-	MEM_CGROUP_CHARGE_TYPE_CACHE = 0,
-	MEM_CGROUP_CHARGE_TYPE_ANON,
-	MEM_CGROUP_CHARGE_TYPE_SWAPOUT,	/* for accounting swapcache */
-	MEM_CGROUP_CHARGE_TYPE_DROP,	/* a page was unused swap cache */
-	NR_CHARGE_TYPE,
-};
-
 /* for encoding cft->private value on file */
 enum res_type {
 	_MEM,
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 083/181] mm/memcg: simplify mem_cgroup_get_max()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (81 preceding siblings ...)
  2020-10-13 23:52 ` [patch 082/181] mm/memcg: clean up obsolete enum charge_type Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:52 ` [patch 084/181] mm/memcg: unify swap and memsw page counters Andrew Morton
                   ` (101 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, chris, guro, hannes, laoar.shao, linux-mm, longman, mhocko,
	mm-commits, shakeelb, tj, torvalds, vdavydov.dev

From: Waiman Long <longman@redhat.com>
Subject: mm/memcg: simplify mem_cgroup_get_max()

mem_cgroup_get_max() used to get memory+swap max from both the v1 memsw
and v2 memory+swap page counters & return the maximum of these 2 values. 
This is redundant and it is more efficient to just get either the v1 or
the v2 values depending on which one is currently in use.

[longman@redhat.com: v4]
  Link: https://lkml.kernel.org/r/20200914150928.7841-1-longman@redhat.com
Link: https://lkml.kernel.org/r/20200914024452.19167-3-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |   20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

--- a/mm/memcontrol.c~mm-memcg-simplify-mem_cgroup_get_max
+++ a/mm/memcontrol.c
@@ -1633,17 +1633,19 @@ void mem_cgroup_print_oom_meminfo(struct
  */
 unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg)
 {
-	unsigned long max;
+	unsigned long max = READ_ONCE(memcg->memory.max);
 
-	max = READ_ONCE(memcg->memory.max);
-	if (mem_cgroup_swappiness(memcg)) {
-		unsigned long memsw_max;
-		unsigned long swap_max;
+	if (cgroup_subsys_on_dfl(memory_cgrp_subsys)) {
+		if (mem_cgroup_swappiness(memcg))
+			max += min(READ_ONCE(memcg->swap.max),
+				   (unsigned long)total_swap_pages);
+	} else { /* v1 */
+		if (mem_cgroup_swappiness(memcg)) {
+			/* Calculate swap excess capacity from memsw limit */
+			unsigned long swap = READ_ONCE(memcg->memsw.max) - max;
 
-		memsw_max = memcg->memsw.max;
-		swap_max = READ_ONCE(memcg->swap.max);
-		swap_max = min(swap_max, (unsigned long)total_swap_pages);
-		max = min(max + swap_max, memsw_max);
+			max += min(swap, (unsigned long)total_swap_pages);
+		}
 	}
 	return max;
 }
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 084/181] mm/memcg: unify swap and memsw page counters
  2020-10-13 23:46 incoming Andrew Morton
                   ` (82 preceding siblings ...)
  2020-10-13 23:52 ` [patch 083/181] mm/memcg: simplify mem_cgroup_get_max() Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:52 ` [patch 085/181] mm: memcontrol: add the missing numa_stat interface for cgroup v2 Andrew Morton
                   ` (100 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, chris, guro, hannes, laoar.shao, linux-mm, longman, mhocko,
	mm-commits, shakeelb, tj, torvalds, vdavydov.dev

From: Waiman Long <longman@redhat.com>
Subject: mm/memcg: unify swap and memsw page counters

The swap page counter is v2 only while memsw is v1 only.  As v1 and v2
controllers cannot be active at the same time, there is no point to keep
both swap and memsw page counters in mem_cgroup.  The previous patch has
made sure that memsw page counter is updated and accessed only when in v1
code paths.  So it is now safe to alias the v1 memsw page counter to v2
swap page counter.  This saves 14 long's in the size of mem_cgroup.  This
is a saving of 112 bytes for 64-bit archs.

While at it, also document which page counters are used in v1 and/or v2.

Link: https://lkml.kernel.org/r/20200914024452.19167-4-longman@redhat.com
Signed-off-by: Waiman Long <longman@redhat.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Chris Down <chris@chrisdown.name>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memcontrol.h |   13 ++++++++-----
 mm/memcontrol.c            |    3 ---
 2 files changed, 8 insertions(+), 8 deletions(-)

--- a/include/linux/memcontrol.h~mm-memcg-unify-swap-and-memsw-page-counters
+++ a/include/linux/memcontrol.h
@@ -215,13 +215,16 @@ struct mem_cgroup {
 	struct mem_cgroup_id id;
 
 	/* Accounted resources */
-	struct page_counter memory;
-	struct page_counter swap;
+	struct page_counter memory;		/* Both v1 & v2 */
+
+	union {
+		struct page_counter swap;	/* v2 only */
+		struct page_counter memsw;	/* v1 only */
+	};
 
 	/* Legacy consumer-oriented counters */
-	struct page_counter memsw;
-	struct page_counter kmem;
-	struct page_counter tcpmem;
+	struct page_counter kmem;		/* v1 only */
+	struct page_counter tcpmem;		/* v1 only */
 
 	/* Range enforcement for interrupt charges */
 	struct work_struct high_work;
--- a/mm/memcontrol.c~mm-memcg-unify-swap-and-memsw-page-counters
+++ a/mm/memcontrol.c
@@ -5295,13 +5295,11 @@ mem_cgroup_css_alloc(struct cgroup_subsy
 		memcg->use_hierarchy = true;
 		page_counter_init(&memcg->memory, &parent->memory);
 		page_counter_init(&memcg->swap, &parent->swap);
-		page_counter_init(&memcg->memsw, &parent->memsw);
 		page_counter_init(&memcg->kmem, &parent->kmem);
 		page_counter_init(&memcg->tcpmem, &parent->tcpmem);
 	} else {
 		page_counter_init(&memcg->memory, NULL);
 		page_counter_init(&memcg->swap, NULL);
-		page_counter_init(&memcg->memsw, NULL);
 		page_counter_init(&memcg->kmem, NULL);
 		page_counter_init(&memcg->tcpmem, NULL);
 		/*
@@ -5430,7 +5428,6 @@ static void mem_cgroup_css_reset(struct
 
 	page_counter_set_max(&memcg->memory, PAGE_COUNTER_MAX);
 	page_counter_set_max(&memcg->swap, PAGE_COUNTER_MAX);
-	page_counter_set_max(&memcg->memsw, PAGE_COUNTER_MAX);
 	page_counter_set_max(&memcg->kmem, PAGE_COUNTER_MAX);
 	page_counter_set_max(&memcg->tcpmem, PAGE_COUNTER_MAX);
 	page_counter_set_min(&memcg->memory, 0);
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 085/181] mm: memcontrol: add the missing numa_stat interface for cgroup v2
  2020-10-13 23:46 incoming Andrew Morton
                   ` (83 preceding siblings ...)
  2020-10-13 23:52 ` [patch 084/181] mm/memcg: unify swap and memsw page counters Andrew Morton
@ 2020-10-13 23:52 ` Andrew Morton
  2020-10-13 23:53 ` [patch 086/181] mm/page_counter: correct the obsolete func name in the comment of page_counter_try_charge() Andrew Morton
                   ` (99 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:52 UTC (permalink / raw)
  To: akpm, corbet, guro, hannes, linux-mm, lizefan, mhocko,
	mm-commits, rdunlap, shakeelb, songmuchun, torvalds,
	vdavydov.dev

From: Muchun Song <songmuchun@bytedance.com>
Subject: mm: memcontrol: add the missing numa_stat interface for cgroup v2

In the cgroup v1, we have a numa_stat interface.  This is useful for
providing visibility into the numa locality information within an memcg
since the pages are allowed to be allocated from any physical node.  One
of the use cases is evaluating application performance by combining this
information with the application's CPU allocation.  But the cgroup v2 does
not.  So this patch adds the missing information.

Link: https://lkml.kernel.org/r/20200916100030.71698-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Suggested-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Shakeel Butt <shakeelb@google.com>
Cc: Zefan Li <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/cgroup-v2.rst |   69 ++++++--
 mm/memcontrol.c                         |  172 ++++++++++++++--------
 2 files changed, 160 insertions(+), 81 deletions(-)

--- a/Documentation/admin-guide/cgroup-v2.rst~mm-memcontrol-add-the-missing-numa_stat-interface-for-cgroup-v2
+++ a/Documentation/admin-guide/cgroup-v2.rst
@@ -1259,6 +1259,10 @@ PAGE_SIZE multiple when read back.
 	can show up in the middle. Don't rely on items remaining in a
 	fixed position; use the keys to look up specific values!
 
+	If the entry has no per-node counter(or not show in the
+	mempry.numa_stat). We use 'npn'(non-per-node) as the tag
+	to indicate that it will not show in the mempry.numa_stat.
+
 	  anon
 		Amount of memory used in anonymous mappings such as
 		brk(), sbrk(), and mmap(MAP_ANONYMOUS)
@@ -1270,15 +1274,11 @@ PAGE_SIZE multiple when read back.
 	  kernel_stack
 		Amount of memory allocated to kernel stacks.
 
-	  slab
-		Amount of memory used for storing in-kernel data
-		structures.
-
-	  percpu
+	  percpu(npn)
 		Amount of memory used for storing per-cpu kernel
 		data structures.
 
-	  sock
+	  sock(npn)
 		Amount of memory used in network transmission buffers
 
 	  shmem
@@ -1318,11 +1318,9 @@ PAGE_SIZE multiple when read back.
 		Part of "slab" that cannot be reclaimed on memory
 		pressure.
 
-	  pgfault
-		Total number of page faults incurred
-
-	  pgmajfault
-		Number of major page faults incurred
+	  slab(npn)
+		Amount of memory used for storing in-kernel data
+		structures.
 
 	  workingset_refault_anon
 		Number of refaults of previously evicted anonymous pages.
@@ -1348,37 +1346,68 @@ PAGE_SIZE multiple when read back.
 	  workingset_nodereclaim
 		Number of times a shadow node has been reclaimed
 
-	  pgrefill
+	  pgfault(npn)
+		Total number of page faults incurred
+
+	  pgmajfault(npn)
+		Number of major page faults incurred
+
+	  pgrefill(npn)
 		Amount of scanned pages (in an active LRU list)
 
-	  pgscan
+	  pgscan(npn)
 		Amount of scanned pages (in an inactive LRU list)
 
-	  pgsteal
+	  pgsteal(npn)
 		Amount of reclaimed pages
 
-	  pgactivate
+	  pgactivate(npn)
 		Amount of pages moved to the active LRU list
 
-	  pgdeactivate
+	  pgdeactivate(npn)
 		Amount of pages moved to the inactive LRU list
 
-	  pglazyfree
+	  pglazyfree(npn)
 		Amount of pages postponed to be freed under memory pressure
 
-	  pglazyfreed
+	  pglazyfreed(npn)
 		Amount of reclaimed lazyfree pages
 
-	  thp_fault_alloc
+	  thp_fault_alloc(npn)
 		Number of transparent hugepages which were allocated to satisfy
 		a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE
                 is not set.
 
-	  thp_collapse_alloc
+	  thp_collapse_alloc(npn)
 		Number of transparent hugepages which were allocated to allow
 		collapsing an existing range of pages. This counter is not
 		present when CONFIG_TRANSPARENT_HUGEPAGE is not set.
 
+  memory.numa_stat
+	A read-only nested-keyed file which exists on non-root cgroups.
+
+	This breaks down the cgroup's memory footprint into different
+	types of memory, type-specific details, and other information
+	per node on the state of the memory management system.
+
+	This is useful for providing visibility into the NUMA locality
+	information within an memcg since the pages are allowed to be
+	allocated from any physical node. One of the use case is evaluating
+	application performance by combining this information with the
+	application's CPU allocation.
+
+	All memory amounts are in bytes.
+
+	The output format of memory.numa_stat is::
+
+	  type N0=<bytes in node 0> N1=<bytes in node 1> ...
+
+	The entries are ordered to be human readable, and new entries
+	can show up in the middle. Don't rely on items remaining in a
+	fixed position; use the keys to look up specific values!
+
+	The entries can refer to the memory.stat.
+
   memory.swap.current
 	A read-only single value file which exists on non-root
 	cgroups.
--- a/mm/memcontrol.c~mm-memcontrol-add-the-missing-numa_stat-interface-for-cgroup-v2
+++ a/mm/memcontrol.c
@@ -1448,6 +1448,70 @@ static bool mem_cgroup_wait_acct_move(st
 	return false;
 }
 
+struct memory_stat {
+	const char *name;
+	unsigned int ratio;
+	unsigned int idx;
+};
+
+static struct memory_stat memory_stats[] = {
+	{ "anon", PAGE_SIZE, NR_ANON_MAPPED },
+	{ "file", PAGE_SIZE, NR_FILE_PAGES },
+	{ "kernel_stack", 1024, NR_KERNEL_STACK_KB },
+	{ "percpu", 1, MEMCG_PERCPU_B },
+	{ "sock", PAGE_SIZE, MEMCG_SOCK },
+	{ "shmem", PAGE_SIZE, NR_SHMEM },
+	{ "file_mapped", PAGE_SIZE, NR_FILE_MAPPED },
+	{ "file_dirty", PAGE_SIZE, NR_FILE_DIRTY },
+	{ "file_writeback", PAGE_SIZE, NR_WRITEBACK },
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	/*
+	 * The ratio will be initialized in memory_stats_init(). Because
+	 * on some architectures, the macro of HPAGE_PMD_SIZE is not
+	 * constant(e.g. powerpc).
+	 */
+	{ "anon_thp", 0, NR_ANON_THPS },
+#endif
+	{ "inactive_anon", PAGE_SIZE, NR_INACTIVE_ANON },
+	{ "active_anon", PAGE_SIZE, NR_ACTIVE_ANON },
+	{ "inactive_file", PAGE_SIZE, NR_INACTIVE_FILE },
+	{ "active_file", PAGE_SIZE, NR_ACTIVE_FILE },
+	{ "unevictable", PAGE_SIZE, NR_UNEVICTABLE },
+
+	/*
+	 * Note: The slab_reclaimable and slab_unreclaimable must be
+	 * together and slab_reclaimable must be in front.
+	 */
+	{ "slab_reclaimable", 1, NR_SLAB_RECLAIMABLE_B },
+	{ "slab_unreclaimable", 1, NR_SLAB_UNRECLAIMABLE_B },
+
+	/* The memory events */
+	{ "workingset_refault_anon", 1, WORKINGSET_REFAULT_ANON },
+	{ "workingset_refault_file", 1, WORKINGSET_REFAULT_FILE },
+	{ "workingset_activate_anon", 1, WORKINGSET_ACTIVATE_ANON },
+	{ "workingset_activate_file", 1, WORKINGSET_ACTIVATE_FILE },
+	{ "workingset_restore_anon", 1, WORKINGSET_RESTORE_ANON },
+	{ "workingset_restore_file", 1, WORKINGSET_RESTORE_FILE },
+	{ "workingset_nodereclaim", 1, WORKINGSET_NODERECLAIM },
+};
+
+static int __init memory_stats_init(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+		if (memory_stats[i].idx == NR_ANON_THPS)
+			memory_stats[i].ratio = HPAGE_PMD_SIZE;
+#endif
+		VM_BUG_ON(!memory_stats[i].ratio);
+		VM_BUG_ON(memory_stats[i].idx >= MEMCG_NR_STAT);
+	}
+
+	return 0;
+}
+pure_initcall(memory_stats_init);
+
 static char *memory_stat_format(struct mem_cgroup *memcg)
 {
 	struct seq_buf s;
@@ -1468,52 +1532,19 @@ static char *memory_stat_format(struct m
 	 * Current memory state:
 	 */
 
-	seq_buf_printf(&s, "anon %llu\n",
-		       (u64)memcg_page_state(memcg, NR_ANON_MAPPED) *
-		       PAGE_SIZE);
-	seq_buf_printf(&s, "file %llu\n",
-		       (u64)memcg_page_state(memcg, NR_FILE_PAGES) *
-		       PAGE_SIZE);
-	seq_buf_printf(&s, "kernel_stack %llu\n",
-		       (u64)memcg_page_state(memcg, NR_KERNEL_STACK_KB) *
-		       1024);
-	seq_buf_printf(&s, "slab %llu\n",
-		       (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
-			     memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)));
-	seq_buf_printf(&s, "percpu %llu\n",
-		       (u64)memcg_page_state(memcg, MEMCG_PERCPU_B));
-	seq_buf_printf(&s, "sock %llu\n",
-		       (u64)memcg_page_state(memcg, MEMCG_SOCK) *
-		       PAGE_SIZE);
-
-	seq_buf_printf(&s, "shmem %llu\n",
-		       (u64)memcg_page_state(memcg, NR_SHMEM) *
-		       PAGE_SIZE);
-	seq_buf_printf(&s, "file_mapped %llu\n",
-		       (u64)memcg_page_state(memcg, NR_FILE_MAPPED) *
-		       PAGE_SIZE);
-	seq_buf_printf(&s, "file_dirty %llu\n",
-		       (u64)memcg_page_state(memcg, NR_FILE_DIRTY) *
-		       PAGE_SIZE);
-	seq_buf_printf(&s, "file_writeback %llu\n",
-		       (u64)memcg_page_state(memcg, NR_WRITEBACK) *
-		       PAGE_SIZE);
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	seq_buf_printf(&s, "anon_thp %llu\n",
-		       (u64)memcg_page_state(memcg, NR_ANON_THPS) *
-		       HPAGE_PMD_SIZE);
-#endif
+	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
+		u64 size;
 
-	for (i = 0; i < NR_LRU_LISTS; i++)
-		seq_buf_printf(&s, "%s %llu\n", lru_list_name(i),
-			       (u64)memcg_page_state(memcg, NR_LRU_BASE + i) *
-			       PAGE_SIZE);
-
-	seq_buf_printf(&s, "slab_reclaimable %llu\n",
-		       (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B));
-	seq_buf_printf(&s, "slab_unreclaimable %llu\n",
-		       (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B));
+		size = memcg_page_state(memcg, memory_stats[i].idx);
+		size *= memory_stats[i].ratio;
+		seq_buf_printf(&s, "%s %llu\n", memory_stats[i].name, size);
+
+		if (unlikely(memory_stats[i].idx == NR_SLAB_UNRECLAIMABLE_B)) {
+			size = memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
+			       memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B);
+			seq_buf_printf(&s, "slab %llu\n", size);
+		}
+	}
 
 	/* Accumulated memory events */
 
@@ -1521,22 +1552,6 @@ static char *memory_stat_format(struct m
 		       memcg_events(memcg, PGFAULT));
 	seq_buf_printf(&s, "%s %lu\n", vm_event_name(PGMAJFAULT),
 		       memcg_events(memcg, PGMAJFAULT));
-
-	seq_buf_printf(&s, "workingset_refault_anon %lu\n",
-		       memcg_page_state(memcg, WORKINGSET_REFAULT_ANON));
-	seq_buf_printf(&s, "workingset_refault_file %lu\n",
-		       memcg_page_state(memcg, WORKINGSET_REFAULT_FILE));
-	seq_buf_printf(&s, "workingset_activate_anon %lu\n",
-		       memcg_page_state(memcg, WORKINGSET_ACTIVATE_ANON));
-	seq_buf_printf(&s, "workingset_activate_file %lu\n",
-		       memcg_page_state(memcg, WORKINGSET_ACTIVATE_FILE));
-	seq_buf_printf(&s, "workingset_restore_anon %lu\n",
-		       memcg_page_state(memcg, WORKINGSET_RESTORE_ANON));
-	seq_buf_printf(&s, "workingset_restore_file %lu\n",
-		       memcg_page_state(memcg, WORKINGSET_RESTORE_FILE));
-	seq_buf_printf(&s, "workingset_nodereclaim %lu\n",
-		       memcg_page_state(memcg, WORKINGSET_NODERECLAIM));
-
 	seq_buf_printf(&s, "%s %lu\n",  vm_event_name(PGREFILL),
 		       memcg_events(memcg, PGREFILL));
 	seq_buf_printf(&s, "pgscan %lu\n",
@@ -6374,6 +6389,35 @@ static int memory_stat_show(struct seq_f
 	return 0;
 }
 
+#ifdef CONFIG_NUMA
+static int memory_numa_stat_show(struct seq_file *m, void *v)
+{
+	int i;
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
+		int nid;
+
+		if (memory_stats[i].idx >= NR_VM_NODE_STAT_ITEMS)
+			continue;
+
+		seq_printf(m, "%s", memory_stats[i].name);
+		for_each_node_state(nid, N_MEMORY) {
+			u64 size;
+			struct lruvec *lruvec;
+
+			lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+			size = lruvec_page_state(lruvec, memory_stats[i].idx);
+			size *= memory_stats[i].ratio;
+			seq_printf(m, " N%d=%llu", nid, size);
+		}
+		seq_putc(m, '\n');
+	}
+
+	return 0;
+}
+#endif
+
 static int memory_oom_group_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
@@ -6451,6 +6495,12 @@ static struct cftype memory_files[] = {
 		.name = "stat",
 		.seq_show = memory_stat_show,
 	},
+#ifdef CONFIG_NUMA
+	{
+		.name = "numa_stat",
+		.seq_show = memory_numa_stat_show,
+	},
+#endif
 	{
 		.name = "oom.group",
 		.flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE,
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 086/181] mm/page_counter: correct the obsolete func name in the comment of page_counter_try_charge()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (84 preceding siblings ...)
  2020-10-13 23:52 ` [patch 085/181] mm: memcontrol: add the missing numa_stat interface for cgroup v2 Andrew Morton
@ 2020-10-13 23:53 ` Andrew Morton
  2020-10-13 23:53 ` [patch 087/181] mm: memcontrol: reword obsolete comment of mem_cgroup_unmark_under_oom() Andrew Morton
                   ` (98 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:53 UTC (permalink / raw)
  To: akpm, guro, hannes, linmiaohe, linux-mm, mm-commits, torvalds

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/page_counter: correct the obsolete func name in the comment of page_counter_try_charge()

Since commit bbec2e15170a ("mm: rename page_counter's count/limit into
usage/max"), page_counter_limit() is renamed to page_counter_set_max().
So replace page_counter_limit with page_counter_set_max in comment.

Link: https://lkml.kernel.org/r/20200917113629.14382-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_counter.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/page_counter.c~mm-page_counter-correct-the-obsolete-func-name-in-the-comment-of-page_counter_try_charge
+++ a/mm/page_counter.c
@@ -109,7 +109,7 @@ bool page_counter_try_charge(struct page
 		 *
 		 * The atomic_long_add_return() implies a full memory
 		 * barrier between incrementing the count and reading
-		 * the limit.  When racing with page_counter_limit(),
+		 * the limit.  When racing with page_counter_set_max(),
 		 * we either see the new limit or the setter sees the
 		 * counter has changed and retries.
 		 */
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 087/181] mm: memcontrol: reword obsolete comment of mem_cgroup_unmark_under_oom()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (85 preceding siblings ...)
  2020-10-13 23:53 ` [patch 086/181] mm/page_counter: correct the obsolete func name in the comment of page_counter_try_charge() Andrew Morton
@ 2020-10-13 23:53 ` Andrew Morton
  2020-10-13 23:53 ` [patch 088/181] mm: memcg/slab: uncharge during kmem_cache_free_bulk() Andrew Morton
                   ` (97 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:53 UTC (permalink / raw)
  To: akpm, hannes, linmiaohe, linux-mm, mhocko, mm-commits, torvalds,
	vdavydov.dev

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm: memcontrol: reword obsolete comment of mem_cgroup_unmark_under_oom()

Since commit 79dfdaccd1d5 ("memcg: make oom_lock 0 and 1 based rather than
counter"), the mem_cgroup_unmark_under_oom() is added and the comment of
the mem_cgroup_oom_unlock() is moved here.  But this comment make no sense
here because mem_cgroup_oom_lock() does not operate on under_oom field. 
So we reword the comment as this would be helpful.  [Thanks Michal Hocko
for rewording this comment.]

Link: https://lkml.kernel.org/r/20200930095336.21323-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/memcontrol.c~mm-memcontrol-reword-obsolete-comment-of-mem_cgroup_unmark_under_oom
+++ a/mm/memcontrol.c
@@ -1826,8 +1826,8 @@ static void mem_cgroup_unmark_under_oom(
 	struct mem_cgroup *iter;
 
 	/*
-	 * When a new child is created while the hierarchy is under oom,
-	 * mem_cgroup_oom_lock() may not be called. Watch for underflow.
+	 * Be careful about under_oom underflows becase a child memcg
+	 * could have been added after mem_cgroup_mark_under_oom.
 	 */
 	spin_lock(&memcg_oom_lock);
 	for_each_mem_cgroup_tree(iter, memcg)
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 088/181] mm: memcg/slab: uncharge during kmem_cache_free_bulk()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (86 preceding siblings ...)
  2020-10-13 23:53 ` [patch 087/181] mm: memcontrol: reword obsolete comment of mem_cgroup_unmark_under_oom() Andrew Morton
@ 2020-10-13 23:53 ` Andrew Morton
  2020-10-13 23:53 ` [patch 089/181] mm/memcg: fix device private memcg accounting Andrew Morton
                   ` (96 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:53 UTC (permalink / raw)
  To: akpm, bharata, cl, guro, hannes, iamjoonsoo.kim, linux-mm,
	mhocko, mm-commits, rientjes, shakeelb, stable, tj, torvalds,
	vbabka

From: Bharata B Rao <bharata@linux.ibm.com>
Subject: mm: memcg/slab: uncharge during kmem_cache_free_bulk()

Object cgroup charging is done for all the objects during allocation, but
during freeing, uncharging ends up happening for only one object in the
case of bulk allocation/freeing.

Fix this by having a separate call to uncharge all the objects from
kmem_cache_free_bulk() and by modifying memcg_slab_free_hook() to take
care of bulk uncharging.

Link: https://lkml.kernel.org/r/20201009060423.390479-1-bharata@linux.ibm.com
Fixes: 964d4bd370d5 ("mm: memcg/slab: save obj_cgroup for non-root slab objects"
Signed-off-by: Bharata B Rao <bharata@linux.ibm.com>
Acked-by: Roman Gushchin <guro@fb.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/slab.c |    2 +-
 mm/slab.h |   50 +++++++++++++++++++++++++++++++-------------------
 mm/slub.c |    3 ++-
 3 files changed, 34 insertions(+), 21 deletions(-)

--- a/mm/slab.c~mm-memcg-slab-uncharge-during-kmem_cache_free_bulk
+++ a/mm/slab.c
@@ -3438,7 +3438,7 @@ void ___cache_free(struct kmem_cache *ca
 		memset(objp, 0, cachep->object_size);
 	kmemleak_free_recursive(objp, cachep->flags);
 	objp = cache_free_debugcheck(cachep, objp, caller);
-	memcg_slab_free_hook(cachep, virt_to_head_page(objp), objp);
+	memcg_slab_free_hook(cachep, &objp, 1);
 
 	/*
 	 * Skip calling cache_free_alien() when the platform is not numa.
--- a/mm/slab.h~mm-memcg-slab-uncharge-during-kmem_cache_free_bulk
+++ a/mm/slab.h
@@ -345,30 +345,42 @@ static inline void memcg_slab_post_alloc
 	obj_cgroup_put(objcg);
 }
 
-static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
-					void *p)
+static inline void memcg_slab_free_hook(struct kmem_cache *s_orig,
+					void **p, int objects)
 {
+	struct kmem_cache *s;
 	struct obj_cgroup *objcg;
+	struct page *page;
 	unsigned int off;
+	int i;
 
 	if (!memcg_kmem_enabled())
 		return;
 
-	if (!page_has_obj_cgroups(page))
-		return;
-
-	off = obj_to_index(s, page, p);
-	objcg = page_obj_cgroups(page)[off];
-	page_obj_cgroups(page)[off] = NULL;
-
-	if (!objcg)
-		return;
-
-	obj_cgroup_uncharge(objcg, obj_full_size(s));
-	mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
-			-obj_full_size(s));
-
-	obj_cgroup_put(objcg);
+	for (i = 0; i < objects; i++) {
+		if (unlikely(!p[i]))
+			continue;
+
+		page = virt_to_head_page(p[i]);
+		if (!page_has_obj_cgroups(page))
+			continue;
+
+		if (!s_orig)
+			s = page->slab_cache;
+		else
+			s = s_orig;
+
+		off = obj_to_index(s, page, p[i]);
+		objcg = page_obj_cgroups(page)[off];
+		if (!objcg)
+			continue;
+
+		page_obj_cgroups(page)[off] = NULL;
+		obj_cgroup_uncharge(objcg, obj_full_size(s));
+		mod_objcg_state(objcg, page_pgdat(page), cache_vmstat_idx(s),
+				-obj_full_size(s));
+		obj_cgroup_put(objcg);
+	}
 }
 
 #else /* CONFIG_MEMCG_KMEM */
@@ -406,8 +418,8 @@ static inline void memcg_slab_post_alloc
 {
 }
 
-static inline void memcg_slab_free_hook(struct kmem_cache *s, struct page *page,
-					void *p)
+static inline void memcg_slab_free_hook(struct kmem_cache *s,
+					void **p, int objects)
 {
 }
 #endif /* CONFIG_MEMCG_KMEM */
--- a/mm/slub.c~mm-memcg-slab-uncharge-during-kmem_cache_free_bulk
+++ a/mm/slub.c
@@ -3095,7 +3095,7 @@ static __always_inline void do_slab_free
 	struct kmem_cache_cpu *c;
 	unsigned long tid;
 
-	memcg_slab_free_hook(s, page, head);
+	memcg_slab_free_hook(s, &head, 1);
 redo:
 	/*
 	 * Determine the currently cpus per cpu slab.
@@ -3257,6 +3257,7 @@ void kmem_cache_free_bulk(struct kmem_ca
 	if (WARN_ON(!size))
 		return;
 
+	memcg_slab_free_hook(s, p, size);
 	do {
 		struct detached_freelist df;
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 089/181] mm/memcg: fix device private memcg accounting
  2020-10-13 23:46 incoming Andrew Morton
                   ` (87 preceding siblings ...)
  2020-10-13 23:53 ` [patch 088/181] mm: memcg/slab: uncharge during kmem_cache_free_bulk() Andrew Morton
@ 2020-10-13 23:53 ` Andrew Morton
  2020-10-13 23:53 ` [patch 090/181] selftests/vm: fix false build success on the second and later attempts Andrew Morton
                   ` (95 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:53 UTC (permalink / raw)
  To: akpm, bsingharora, hannes, ira.weiny, jglisse, linux-mm, mhocko,
	mm-commits, rcampbell, torvalds, vdavydov.dev

From: Ralph Campbell <rcampbell@nvidia.com>
Subject: mm/memcg: fix device private memcg accounting

The code in mc_handle_swap_pte() checks for non_swap_entry() and returns
NULL before checking is_device_private_entry() so device private pages are
never handled.  Fix this by checking for non_swap_entry() after handling
device private swap PTEs.

I assume the memory cgroup accounting would be off somehow when moving
a process to another memory cgroup.  Currently, the device private page
is charged like a normal anonymous page when allocated and is uncharged
when the page is freed so I think that path is OK.

Link: https://lkml.kernel.org/r/20201009215952.2726-1-rcampbell@nvidia.com
xFixes: c733a82874a7 ("mm/memcontrol: support MEMORY_DEVICE_PRIVATE")
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memcontrol.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/mm/memcontrol.c~mm-memcg-fix-device-private-memcg-accounting
+++ a/mm/memcontrol.c
@@ -5516,7 +5516,7 @@ static struct page *mc_handle_swap_pte(s
 	struct page *page = NULL;
 	swp_entry_t ent = pte_to_swp_entry(ptent);
 
-	if (!(mc.flags & MOVE_ANON) || non_swap_entry(ent))
+	if (!(mc.flags & MOVE_ANON))
 		return NULL;
 
 	/*
@@ -5535,6 +5535,9 @@ static struct page *mc_handle_swap_pte(s
 		return page;
 	}
 
+	if (non_swap_entry(ent))
+		return NULL;
+
 	/*
 	 * Because lookup_swap_cache() updates some statistics counter,
 	 * we call find_get_page() with swapper_space directly.
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 090/181] selftests/vm: fix false build success on the second and later attempts
  2020-10-13 23:46 incoming Andrew Morton
                   ` (88 preceding siblings ...)
  2020-10-13 23:53 ` [patch 089/181] mm/memcg: fix device private memcg accounting Andrew Morton
@ 2020-10-13 23:53 ` Andrew Morton
  2020-10-13 23:53 ` [patch 091/181] selftests/vm: fix incorrect gcc invocation in some cases Andrew Morton
                   ` (94 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:53 UTC (permalink / raw)
  To: akpm, jgg, jhubbard, linux-mm, mm-commits, shuah, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: selftests/vm: fix false build success on the second and later attempts

Patch series "selftests/vm: fix some minor aggravating factors in the Makefile".

This fixes a couple of minor aggravating factors that I ran across while
trying to do some changes in selftests/vm.  These are simple things, but
like most things with GNU Make, it's rarely obvious what's wrong until you
understand *the entire Makefile and all of its includes*.

So while there is, of course, joy in learning those details, I thought I'd
fix these little things, so as to allow others to skip out on the Joy if
they so choose.  :)

First of all, if you have an item (let's choose userfaultfd for an
example) that fails to build, you might do this:

$ make -j32

    # ...you observe a failed item in the threaded output

# OK, let's get a closer look

$ make
    # ...but now the build quietly "succeeds".

That's what Patch 0001 fixes.

Second, if you instead attempt this approach for your closer look (a casual
mistake, as it's not supported):

$ make userfaultfd

    # ...userfaultfd fails to link, due to incomplete LDLIBS

That's what Patch 0002 fixes.


This patch (of 2):

If one or more of these selftest fail to build, then after the first
failure, subsequent invocations of "make" will make it appear that there
are no build failures, after all.

That's because the failed build products remain, with up-to-date
timestamps, thus tricking Make (and you!) into believing that there's
nothing else to build.

Fix this by telling Make to delete targets that didn't completely
succeed.

Link: https://lkml.kernel.org/r/20200915012901.1655280-1-jhubbard@nvidia.com
Link: https://lkml.kernel.org/r/20200915012901.1655280-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/Makefile |    5 +++++
 1 file changed, 5 insertions(+)

--- a/tools/testing/selftests/vm/Makefile~selftests-vm-fix-false-build-success-on-the-second-and-later-attempts
+++ a/tools/testing/selftests/vm/Makefile
@@ -3,6 +3,11 @@
 uname_M := $(shell uname -m 2>/dev/null || echo not)
 MACHINE ?= $(shell echo $(uname_M) | sed -e 's/aarch64.*/arm64/')
 
+# Without this, failed build products remain, with up-to-date timestamps,
+# thus tricking Make (and you!) into believing that All Is Well, in subsequent
+# make invocations:
+.DELETE_ON_ERROR:
+
 CFLAGS = -Wall -I ../../../../usr/include $(EXTRA_CFLAGS)
 LDLIBS = -lrt
 TEST_GEN_FILES = compaction_test
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 091/181] selftests/vm: fix incorrect gcc invocation in some cases
  2020-10-13 23:46 incoming Andrew Morton
                   ` (89 preceding siblings ...)
  2020-10-13 23:53 ` [patch 090/181] selftests/vm: fix false build success on the second and later attempts Andrew Morton
@ 2020-10-13 23:53 ` Andrew Morton
  2020-10-13 23:53 ` [patch 092/181] mm: account PMD tables like PTE tables Andrew Morton
                   ` (93 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:53 UTC (permalink / raw)
  To: akpm, jgg, jhubbard, linux-mm, mm-commits, shuah, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: selftests/vm: fix incorrect gcc invocation in some cases

Avoid accidental wrong builds, due to built-in rules working just a little
bit too well--but not quite as well as required for our situation here.

In other words, "make userfaultfd" (for example) is supposed to fail to
build at all, because this Makefile only supports either "make" (all), or
"make /full/path".  However, the built-in rules, if not suppressed, will
pick up CFLAGS and the initial LDLIBS (but not the target-specific LDLIBS,
because those are only set for the full path target!).  This causes it to
get pretty far into building things despite using incorrect values such as
an *occasionally* incomplete LDLIBS value.

Link: https://lkml.kernel.org/r/20200915012901.1655280-3-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/Makefile |   12 ++++++++++++
 1 file changed, 12 insertions(+)

--- a/tools/testing/selftests/vm/Makefile~selftests-vm-fix-incorrect-gcc-invocation-in-some-cases
+++ a/tools/testing/selftests/vm/Makefile
@@ -8,6 +8,18 @@ MACHINE ?= $(shell echo $(uname_M) | sed
 # make invocations:
 .DELETE_ON_ERROR:
 
+# Avoid accidental wrong builds, due to built-in rules working just a little
+# bit too well--but not quite as well as required for our situation here.
+#
+# In other words, "make userfaultfd" is supposed to fail to build at all,
+# because this Makefile only supports either "make" (all), or "make /full/path".
+# However,  the built-in rules, if not suppressed, will pick up CFLAGS and the
+# initial LDLIBS (but not the target-specific LDLIBS, because those are only
+# set for the full path target!). This causes it to get pretty far into building
+# things despite using incorrect values such as an *occasionally* incomplete
+# LDLIBS.
+MAKEFLAGS += --no-builtin-rules
+
 CFLAGS = -Wall -I ../../../../usr/include $(EXTRA_CFLAGS)
 LDLIBS = -lrt
 TEST_GEN_FILES = compaction_test
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 092/181] mm: account PMD tables like PTE tables
  2020-10-13 23:46 incoming Andrew Morton
                   ` (90 preceding siblings ...)
  2020-10-13 23:53 ` [patch 091/181] selftests/vm: fix incorrect gcc invocation in some cases Andrew Morton
@ 2020-10-13 23:53 ` Andrew Morton
  2020-10-13 23:53 ` [patch 093/181] mm/memory.c: fix typo in __do_fault() comment Andrew Morton
                   ` (92 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:53 UTC (permalink / raw)
  To: abdhalee, akpm, anders.roxell, arnd, christophe.leroy, jcmvbkbc,
	joro, linux-mm, luto, mm-commits, naresh.kamboju, peterz, rppt,
	sathnaga, shorne, torvalds, willy

From: Matthew Wilcox <willy@infradead.org>
Subject: mm: account PMD tables like PTE tables

We account the PTE level of the page tables to the process in order to
make smarter OOM decisions and help diagnose why memory is fragmented. 
For these same reasons, we should account pages allocated for PMDs.  With
larger process address spaces and ASLR, the number of PMDs in use is
higher than it used to be so the inaccuracy is starting to matter.

[rppt@linux.ibm.com: arm: __pmd_free_tlb(): call page table destructor]
  Link: https://lkml.kernel.org/r/20200825111303.GB69694@linux.ibm.com
Link: http://lkml.kernel.org/r/20200627184642.GF25039@casper.infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
Cc: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm/include/asm/tlb.h |    1 +
 include/linux/mm.h         |   24 ++++++++++++++++++++----
 2 files changed, 21 insertions(+), 4 deletions(-)

--- a/arch/arm/include/asm/tlb.h~mm-account-pmd-tables-like-pte-tables
+++ a/arch/arm/include/asm/tlb.h
@@ -59,6 +59,7 @@ __pmd_free_tlb(struct mmu_gather *tlb, p
 #ifdef CONFIG_ARM_LPAE
 	struct page *page = virt_to_page(pmdp);
 
+	pgtable_pmd_page_dtor(page);
 	tlb_remove_table(tlb, page);
 #endif
 }
--- a/include/linux/mm.h~mm-account-pmd-tables-like-pte-tables
+++ a/include/linux/mm.h
@@ -2254,7 +2254,7 @@ static inline spinlock_t *pmd_lockptr(st
 	return ptlock_ptr(pmd_to_page(pmd));
 }
 
-static inline bool pgtable_pmd_page_ctor(struct page *page)
+static inline bool pmd_ptlock_init(struct page *page)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	page->pmd_huge_pte = NULL;
@@ -2262,7 +2262,7 @@ static inline bool pgtable_pmd_page_ctor
 	return ptlock_init(page);
 }
 
-static inline void pgtable_pmd_page_dtor(struct page *page)
+static inline void pmd_ptlock_free(struct page *page)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	VM_BUG_ON_PAGE(page->pmd_huge_pte, page);
@@ -2279,8 +2279,8 @@ static inline spinlock_t *pmd_lockptr(st
 	return &mm->page_table_lock;
 }
 
-static inline bool pgtable_pmd_page_ctor(struct page *page) { return true; }
-static inline void pgtable_pmd_page_dtor(struct page *page) {}
+static inline bool pmd_ptlock_init(struct page *page) { return true; }
+static inline void pmd_ptlock_free(struct page *page) {}
 
 #define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
 
@@ -2293,6 +2293,22 @@ static inline spinlock_t *pmd_lock(struc
 	return ptl;
 }
 
+static inline bool pgtable_pmd_page_ctor(struct page *page)
+{
+	if (!pmd_ptlock_init(page))
+		return false;
+	__SetPageTable(page);
+	inc_zone_page_state(page, NR_PAGETABLE);
+	return true;
+}
+
+static inline void pgtable_pmd_page_dtor(struct page *page)
+{
+	pmd_ptlock_free(page);
+	__ClearPageTable(page);
+	dec_zone_page_state(page, NR_PAGETABLE);
+}
+
 /*
  * No scalability reason to split PUD locks yet, but follow the same pattern
  * as the PMD locks to make it easier if we decide to.  The VM should not be
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 093/181] mm/memory.c: fix typo in __do_fault() comment
  2020-10-13 23:46 incoming Andrew Morton
                   ` (91 preceding siblings ...)
  2020-10-13 23:53 ` [patch 092/181] mm: account PMD tables like PTE tables Andrew Morton
@ 2020-10-13 23:53 ` Andrew Morton
  2020-10-13 23:53 ` [patch 094/181] mm/memory.c: replace vmf->vma with variable vma Andrew Morton
                   ` (91 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:53 UTC (permalink / raw)
  To: akpm, david, linux-mm, mm-commits, torvalds, yanfei.xu

From: Yanfei Xu <yanfei.xu@windriver.com>
Subject: mm/memory.c: fix typo in __do_fault() comment

It's "pte_alloc_one", not "pte_alloc_pne". Let's fix that.

Link: http://lkml.kernel.org/r/20200818104339.5310-1-yanfei.xu@windriver.com
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/memory.c~mm-memory-fix-typo-in-__do_fault-comment
+++ a/mm/memory.c
@@ -3589,7 +3589,7 @@ static vm_fault_t __do_fault(struct vm_f
 	 *				unlock_page(A)
 	 * lock_page(B)
 	 *				lock_page(B)
-	 * pte_alloc_pne
+	 * pte_alloc_one
 	 *   shrink_page_list
 	 *     wait_on_page_writeback(A)
 	 *				SetPageWriteback(B)
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 094/181] mm/memory.c: replace vmf->vma with variable vma
  2020-10-13 23:46 incoming Andrew Morton
                   ` (92 preceding siblings ...)
  2020-10-13 23:53 ` [patch 093/181] mm/memory.c: fix typo in __do_fault() comment Andrew Morton
@ 2020-10-13 23:53 ` Andrew Morton
  2020-10-13 23:53 ` [patch 095/181] mm/mmap: rename __vma_unlink_common() to __vma_unlink() Andrew Morton
                   ` (90 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:53 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, torvalds, willy, yanfei.xu

From: Yanfei Xu <yanfei.xu@windriver.com>
Subject: mm/memory.c: replace vmf->vma with variable vma

The code has declared a vma_struct named vma which is assigned a value of
vmf->vma.  Thus, use variable vma directly here.

Link: http://lkml.kernel.org/r/20200818084607.37616-1-yanfei.xu@windriver.com
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/memory.c~mm-memoryc-replace-vmf-vma-with-variable-vma
+++ a/mm/memory.c
@@ -3597,7 +3597,7 @@ static vm_fault_t __do_fault(struct vm_f
 	 *				# flush A, B to clear the writeback
 	 */
 	if (pmd_none(*vmf->pmd) && !vmf->prealloc_pte) {
-		vmf->prealloc_pte = pte_alloc_one(vmf->vma->vm_mm);
+		vmf->prealloc_pte = pte_alloc_one(vma->vm_mm);
 		if (!vmf->prealloc_pte)
 			return VM_FAULT_OOM;
 		smp_wmb(); /* See comment in __pte_alloc() */
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 095/181] mm/mmap: rename __vma_unlink_common() to __vma_unlink()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (93 preceding siblings ...)
  2020-10-13 23:53 ` [patch 094/181] mm/memory.c: replace vmf->vma with variable vma Andrew Morton
@ 2020-10-13 23:53 ` Andrew Morton
  2020-10-13 23:53 ` [patch 096/181] mm/mmap: leverage vma_rb_erase_ignore() to implement vma_rb_erase() Andrew Morton
                   ` (89 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:53 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, richard.weiyang, torvalds

From: Wei Yang <richard.weiyang@linux.alibaba.com>
Subject: mm/mmap: rename __vma_unlink_common() to __vma_unlink()

__vma_unlink_common() and __vma_unlink() are counterparts.  Since there is
no function named __vma_unlink(), let's rename __vma_unlink_common() to
__vma_unlink() to make the code more self-explanatory and easy for
audience to understand.

Otherwise we may expect there are several variants of vma_unlink() and
__vma_unlink_common() is used by them.

Link: https://lkml.kernel.org/r/20200809232057.23477-1-richard.weiyang@linux.alibaba.com
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mmap.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/mm/mmap.c~mm-mmap-rename-__vma_unlink_common-to-__vma_unlink
+++ a/mm/mmap.c
@@ -677,7 +677,7 @@ static void __insert_vm_struct(struct mm
 	mm->map_count++;
 }
 
-static __always_inline void __vma_unlink_common(struct mm_struct *mm,
+static __always_inline void __vma_unlink(struct mm_struct *mm,
 						struct vm_area_struct *vma,
 						struct vm_area_struct *ignore)
 {
@@ -859,7 +859,7 @@ again:
 		 * us to remove next before dropping the locks.
 		 */
 		if (remove_next != 3)
-			__vma_unlink_common(mm, next, next);
+			__vma_unlink(mm, next, next);
 		else
 			/*
 			 * vma is not before next if they've been
@@ -870,7 +870,7 @@ again:
 			 * "next" (which is stored in post-swap()
 			 * "vma").
 			 */
-			__vma_unlink_common(mm, next, vma);
+			__vma_unlink(mm, next, vma);
 		if (file)
 			__remove_shared_vm_struct(next, file, mapping);
 	} else if (insert) {
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 096/181] mm/mmap: leverage vma_rb_erase_ignore() to implement vma_rb_erase()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (94 preceding siblings ...)
  2020-10-13 23:53 ` [patch 095/181] mm/mmap: rename __vma_unlink_common() to __vma_unlink() Andrew Morton
@ 2020-10-13 23:53 ` Andrew Morton
  2020-10-13 23:53 ` [patch 097/181] mmap locking API: add mmap_lock_is_contended() Andrew Morton
                   ` (88 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:53 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, richard.weiyang, torvalds

From: Wei Yang <richard.weiyang@linux.alibaba.com>
Subject: mm/mmap: leverage vma_rb_erase_ignore() to implement vma_rb_erase()

These two functions share the same logic except ignore a different vma.

Let's reuse the code.

Link: https://lkml.kernel.org/r/20200809232057.23477-2-richard.weiyang@linux.alibaba.com
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mmap.c |   16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

--- a/mm/mmap.c~mm-mmap-leverage-vma_rb_erase_ignore-to-implement-vma_rb_erase
+++ a/mm/mmap.c
@@ -474,8 +474,12 @@ static __always_inline void vma_rb_erase
 {
 	/*
 	 * All rb_subtree_gap values must be consistent prior to erase,
-	 * with the possible exception of the "next" vma being erased if
-	 * next->vm_start was reduced.
+	 * with the possible exception of
+	 *
+	 * a. the "next" vma being erased if next->vm_start was reduced in
+	 *    __vma_adjust() -> __vma_unlink()
+	 * b. the vma being erased in detach_vmas_to_be_unmapped() ->
+	 *    vma_rb_erase()
 	 */
 	validate_mm_rb(root, ignore);
 
@@ -485,13 +489,7 @@ static __always_inline void vma_rb_erase
 static __always_inline void vma_rb_erase(struct vm_area_struct *vma,
 					 struct rb_root *root)
 {
-	/*
-	 * All rb_subtree_gap values must be consistent prior to erase,
-	 * with the possible exception of the vma being erased.
-	 */
-	validate_mm_rb(root, vma);
-
-	__vma_rb_erase(vma, root);
+	vma_rb_erase_ignore(vma, root, vma);
 }
 
 /*
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 097/181] mmap locking API: add mmap_lock_is_contended()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (95 preceding siblings ...)
  2020-10-13 23:53 ` [patch 096/181] mm/mmap: leverage vma_rb_erase_ignore() to implement vma_rb_erase() Andrew Morton
@ 2020-10-13 23:53 ` Andrew Morton
  2020-10-13 23:53 ` [patch 098/181] mm: smaps*: extend smap_gather_stats to support specified beginning Andrew Morton
                   ` (87 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:53 UTC (permalink / raw)
  To: adobriyan, akpm, chinwen.chang, daniel.kiss, daniel.m.jordan,
	dbueso, jgg, jimmyassarsson, ldufour, linux-mm, matthias.bgg,
	mm-commits, songliubraving, steven.price, torvalds, vbabka,
	walken, willy, ying.huang

From: Chinwen Chang <chinwen.chang@mediatek.com>
Subject: mmap locking API: add mmap_lock_is_contended()

Patch series "Try to release mmap_lock temporarily in smaps_rollup", v4.

Recently, we have observed some janky issues caused by unpleasantly long
contention on mmap_lock which is held by smaps_rollup when probing large
processes.  To address the problem, we let smaps_rollup detect if anyone
wants to acquire mmap_lock for write attempts.  If yes, just release the
lock temporarily to ease the contention.

smaps_rollup is a procfs interface which allows users to summarize the
process's memory usage without the overhead of seq_* calls.  Android uses
it to sample the memory usage of various processes to balance its memory
pool sizes.  If no one wants to take the lock for write requests,
smaps_rollup with this patch will behave like the original one.

Although there are on-going mmap_lock optimizations like range-based
locks, the lock applied to smaps_rollup would be the coarse one, which is
hard to avoid the occurrence of aforementioned issues.  So the detection
and temporary release for write attempts on mmap_lock in smaps_rollup is
still necessary.


This patch (of 3):

Add new API to query if someone wants to acquire mmap_lock for write
attempts.

Using this instead of rwsem_is_contended makes it more tolerant of future
changes to the lock type.

Link: http://lkml.kernel.org/r/1597715898-3854-1-git-send-email-chinwen.chang@mediatek.com
Link: http://lkml.kernel.org/r/1597715898-3854-2-git-send-email-chinwen.chang@mediatek.com
Signed-off-by: Chinwen Chang <chinwen.chang@mediatek.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Acked-by: Michel Lespinasse <walken@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Daniel Kiss <daniel.kiss@arm.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jimmy Assarsson <jimmyassarsson@gmail.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmap_lock.h |    5 +++++
 1 file changed, 5 insertions(+)

--- a/include/linux/mmap_lock.h~mmap-locking-api-add-mmap_lock_is_contended
+++ a/include/linux/mmap_lock.h
@@ -87,4 +87,9 @@ static inline void mmap_assert_write_loc
 	VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm);
 }
 
+static inline int mmap_lock_is_contended(struct mm_struct *mm)
+{
+	return rwsem_is_contended(&mm->mmap_lock);
+}
+
 #endif /* _LINUX_MMAP_LOCK_H */
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 098/181] mm: smaps*: extend smap_gather_stats to support specified beginning
  2020-10-13 23:46 incoming Andrew Morton
                   ` (96 preceding siblings ...)
  2020-10-13 23:53 ` [patch 097/181] mmap locking API: add mmap_lock_is_contended() Andrew Morton
@ 2020-10-13 23:53 ` Andrew Morton
  2020-10-13 23:53 ` [patch 099/181] mm: proc: smaps_rollup: do not stall write attempts on mmap_lock Andrew Morton
                   ` (86 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:53 UTC (permalink / raw)
  To: adobriyan, akpm, chinwen.chang, daniel.kiss, daniel.m.jordan,
	dbueso, jgg, jimmyassarsson, ldufour, linux-mm, matthias.bgg,
	mm-commits, songliubraving, steven.price, torvalds, vbabka,
	walken, willy, ying.huang

From: Chinwen Chang <chinwen.chang@mediatek.com>
Subject: mm: smaps*: extend smap_gather_stats to support specified beginning

Extend smap_gather_stats to support indicated beginning address at which
it should start gathering.  To achieve the goal, we add a new parameter
@start assigned by the caller and try to refactor it for simplicity.

If @start is 0, it will use the range of @vma for gathering.

Link: http://lkml.kernel.org/r/1597715898-3854-3-git-send-email-chinwen.chang@mediatek.com
Signed-off-by: Chinwen Chang <chinwen.chang@mediatek.com>
Reviewed-by: Steven Price <steven.price@arm.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Daniel Kiss <daniel.kiss@arm.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jimmy Assarsson <jimmyassarsson@gmail.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/task_mmu.c |   30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

--- a/fs/proc/task_mmu.c~mm-smaps-extend-smap_gather_stats-to-support-specified-beginning
+++ a/fs/proc/task_mmu.c
@@ -721,9 +721,21 @@ static const struct mm_walk_ops smaps_sh
 	.pte_hole		= smaps_pte_hole,
 };
 
+/*
+ * Gather mem stats from @vma with the indicated beginning
+ * address @start, and keep them in @mss.
+ *
+ * Use vm_start of @vma as the beginning address if @start is 0.
+ */
 static void smap_gather_stats(struct vm_area_struct *vma,
-			     struct mem_size_stats *mss)
+		struct mem_size_stats *mss, unsigned long start)
 {
+	const struct mm_walk_ops *ops = &smaps_walk_ops;
+
+	/* Invalid start */
+	if (start >= vma->vm_end)
+		return;
+
 #ifdef CONFIG_SHMEM
 	/* In case of smaps_rollup, reset the value from previous vma */
 	mss->check_shmem_swap = false;
@@ -740,18 +752,20 @@ static void smap_gather_stats(struct vm_
 		 */
 		unsigned long shmem_swapped = shmem_swap_usage(vma);
 
-		if (!shmem_swapped || (vma->vm_flags & VM_SHARED) ||
-					!(vma->vm_flags & VM_WRITE)) {
+		if (!start && (!shmem_swapped || (vma->vm_flags & VM_SHARED) ||
+					!(vma->vm_flags & VM_WRITE))) {
 			mss->swap += shmem_swapped;
 		} else {
 			mss->check_shmem_swap = true;
-			walk_page_vma(vma, &smaps_shmem_walk_ops, mss);
-			return;
+			ops = &smaps_shmem_walk_ops;
 		}
 	}
 #endif
 	/* mmap_lock is held in m_start */
-	walk_page_vma(vma, &smaps_walk_ops, mss);
+	if (!start)
+		walk_page_vma(vma, ops, mss);
+	else
+		walk_page_range(vma->vm_mm, start, vma->vm_end, ops, mss);
 }
 
 #define SEQ_PUT_DEC(str, val) \
@@ -803,7 +817,7 @@ static int show_smap(struct seq_file *m,
 
 	memset(&mss, 0, sizeof(mss));
 
-	smap_gather_stats(vma, &mss);
+	smap_gather_stats(vma, &mss, 0);
 
 	show_map_vma(m, vma);
 
@@ -852,7 +866,7 @@ static int show_smaps_rollup(struct seq_
 	hold_task_mempolicy(priv);
 
 	for (vma = priv->mm->mmap; vma; vma = vma->vm_next) {
-		smap_gather_stats(vma, &mss);
+		smap_gather_stats(vma, &mss, 0);
 		last_vma_end = vma->vm_end;
 	}
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 099/181] mm: proc: smaps_rollup: do not stall write attempts on mmap_lock
  2020-10-13 23:46 incoming Andrew Morton
                   ` (97 preceding siblings ...)
  2020-10-13 23:53 ` [patch 098/181] mm: smaps*: extend smap_gather_stats to support specified beginning Andrew Morton
@ 2020-10-13 23:53 ` Andrew Morton
  2020-10-13 23:53 ` [patch 100/181] mm: move PageDoubleMap bit Andrew Morton
                   ` (85 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:53 UTC (permalink / raw)
  To: adobriyan, akpm, chinwen.chang, daniel.kiss, daniel.m.jordan,
	dbueso, jgg, jimmyassarsson, ldufour, linux-mm, matthias.bgg,
	mm-commits, songliubraving, steven.price, torvalds, vbabka,
	walken, willy, ying.huang

From: Chinwen Chang <chinwen.chang@mediatek.com>
Subject: mm: proc: smaps_rollup: do not stall write attempts on mmap_lock

smaps_rollup will try to grab mmap_lock and go through the whole vma list
until it finishes the iterating.  When encountering large processes, the
mmap_lock will be held for a longer time, which may block other write
requests like mmap and munmap from progressing smoothly.

There are upcoming mmap_lock optimizations like range-based locks, but the
lock applied to smaps_rollup would be the coarse type, which doesn't avoid
the occurrence of unpleasant contention.

To solve aforementioned issue, we add a check which detects whether anyone
wants to grab mmap_lock for write attempts.

Link: http://lkml.kernel.org/r/1597715898-3854-4-git-send-email-chinwen.chang@mediatek.com
Signed-off-by: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Steven Price <steven.price@arm.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Chinwen Chang <chinwen.chang@mediatek.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Song Liu <songliubraving@fb.com>
Cc: Jimmy Assarsson <jimmyassarsson@gmail.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Daniel Kiss <daniel.kiss@arm.com>
Cc: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 fs/proc/task_mmu.c |   66 ++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 65 insertions(+), 1 deletion(-)

--- a/fs/proc/task_mmu.c~mm-proc-smaps_rollup-do-not-stall-write-attempts-on-mmap_lock
+++ a/fs/proc/task_mmu.c
@@ -865,9 +865,73 @@ static int show_smaps_rollup(struct seq_
 
 	hold_task_mempolicy(priv);
 
-	for (vma = priv->mm->mmap; vma; vma = vma->vm_next) {
+	for (vma = priv->mm->mmap; vma;) {
 		smap_gather_stats(vma, &mss, 0);
 		last_vma_end = vma->vm_end;
+
+		/*
+		 * Release mmap_lock temporarily if someone wants to
+		 * access it for write request.
+		 */
+		if (mmap_lock_is_contended(mm)) {
+			mmap_read_unlock(mm);
+			ret = mmap_read_lock_killable(mm);
+			if (ret) {
+				release_task_mempolicy(priv);
+				goto out_put_mm;
+			}
+
+			/*
+			 * After dropping the lock, there are four cases to
+			 * consider. See the following example for explanation.
+			 *
+			 *   +------+------+-----------+
+			 *   | VMA1 | VMA2 | VMA3      |
+			 *   +------+------+-----------+
+			 *   |      |      |           |
+			 *  4k     8k     16k         400k
+			 *
+			 * Suppose we drop the lock after reading VMA2 due to
+			 * contention, then we get:
+			 *
+			 *	last_vma_end = 16k
+			 *
+			 * 1) VMA2 is freed, but VMA3 exists:
+			 *
+			 *    find_vma(mm, 16k - 1) will return VMA3.
+			 *    In this case, just continue from VMA3.
+			 *
+			 * 2) VMA2 still exists:
+			 *
+			 *    find_vma(mm, 16k - 1) will return VMA2.
+			 *    Iterate the loop like the original one.
+			 *
+			 * 3) No more VMAs can be found:
+			 *
+			 *    find_vma(mm, 16k - 1) will return NULL.
+			 *    No more things to do, just break.
+			 *
+			 * 4) (last_vma_end - 1) is the middle of a vma (VMA'):
+			 *
+			 *    find_vma(mm, 16k - 1) will return VMA' whose range
+			 *    contains last_vma_end.
+			 *    Iterate VMA' from last_vma_end.
+			 */
+			vma = find_vma(mm, last_vma_end - 1);
+			/* Case 3 above */
+			if (!vma)
+				break;
+
+			/* Case 1 above */
+			if (vma->vm_start >= last_vma_end)
+				continue;
+
+			/* Case 4 above */
+			if (vma->vm_end > last_vma_end)
+				smap_gather_stats(vma, &mss, last_vma_end);
+		}
+		/* Case 2 above */
+		vma = vma->vm_next;
 	}
 
 	show_vma_header_prefix(m, priv->mm->mmap->vm_start,
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 100/181] mm: move PageDoubleMap bit
  2020-10-13 23:46 incoming Andrew Morton
                   ` (98 preceding siblings ...)
  2020-10-13 23:53 ` [patch 099/181] mm: proc: smaps_rollup: do not stall write attempts on mmap_lock Andrew Morton
@ 2020-10-13 23:53 ` Andrew Morton
  2020-10-13 23:53 ` [patch 101/181] mm: simplify PageDoubleMap with PF_SECOND policy Andrew Morton
                   ` (84 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:53 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, torvalds, willy, ziy

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm: move PageDoubleMap bit

Patch series "Fix PageDoubleMap".

This is a purely theoretical problem for now as none of the filesystems
which use PG_private_2 (ie PG_fscache) are being converted at this time,
but it's confusing to leave it like this.


This patch (of 2):

PG_private_2 is defined as being PF_ANY (applicable to tail pages as well
as regular & head pages).  That means that the first tail page of a
double-map page will appear to have Private2 set.  Use the Workingset bit
instead which is defined as PF_HEAD so any attempt to access the
Workingset bit on a tail page will redirect to the head page's Workingset
bit.

Link: https://lkml.kernel.org/r/20200629151933.15671-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20200629151933.15671-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/page-flags.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/include/linux/page-flags.h~mm-move-pagedoublemap-bit
+++ a/include/linux/page-flags.h
@@ -167,7 +167,7 @@ enum pageflags {
 	PG_slob_free = PG_private,
 
 	/* Compound pages. Stored in first tail page's flags */
-	PG_double_map = PG_private_2,
+	PG_double_map = PG_workingset,
 
 	/* non-lru isolated movable page */
 	PG_isolated = PG_reclaim,
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 101/181] mm: simplify PageDoubleMap with PF_SECOND policy
  2020-10-13 23:46 incoming Andrew Morton
                   ` (99 preceding siblings ...)
  2020-10-13 23:53 ` [patch 100/181] mm: move PageDoubleMap bit Andrew Morton
@ 2020-10-13 23:53 ` Andrew Morton
  2020-10-13 23:53 ` [patch 102/181] mm/mmap: leave adjust_next as virtual address instead of page frame number Andrew Morton
                   ` (83 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:53 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, torvalds, willy, ziy

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm: simplify PageDoubleMap with PF_SECOND policy

Introduce the new page policy of PF_SECOND which lets us use the normal
pageflags generation machinery to create the various DoubleMap
manipulation functions.

Link: https://lkml.kernel.org/r/20200629151933.15671-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/page-flags.h |   40 ++++++++---------------------------
 1 file changed, 10 insertions(+), 30 deletions(-)

--- a/include/linux/page-flags.h~mm-simplify-pagedoublemap-with-pf_second-policy
+++ a/include/linux/page-flags.h
@@ -235,6 +235,9 @@ static inline void page_init_poison(stru
  *
  * PF_NO_COMPOUND:
  *     the page flag is not relevant for compound pages.
+ *
+ * PF_SECOND:
+ *     the page flag is stored in the first tail page.
  */
 #define PF_POISONED_CHECK(page) ({					\
 		VM_BUG_ON_PGFLAGS(PagePoisoned(page), page);		\
@@ -250,6 +253,9 @@ static inline void page_init_poison(stru
 #define PF_NO_COMPOUND(page, enforce) ({				\
 		VM_BUG_ON_PGFLAGS(enforce && PageCompound(page), page);	\
 		PF_POISONED_CHECK(page); })
+#define PF_SECOND(page, enforce) ({					\
+		VM_BUG_ON_PGFLAGS(!PageHead(page), page);		\
+		PF_POISONED_CHECK(&page[1]); })
 
 /*
  * Macros to create function definitions for page flags
@@ -688,42 +694,15 @@ static inline int PageTransTail(struct p
  *
  * See also __split_huge_pmd_locked() and page_remove_anon_compound_rmap().
  */
-static inline int PageDoubleMap(struct page *page)
-{
-	return PageHead(page) && test_bit(PG_double_map, &page[1].flags);
-}
-
-static inline void SetPageDoubleMap(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	set_bit(PG_double_map, &page[1].flags);
-}
-
-static inline void ClearPageDoubleMap(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	clear_bit(PG_double_map, &page[1].flags);
-}
-static inline int TestSetPageDoubleMap(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	return test_and_set_bit(PG_double_map, &page[1].flags);
-}
-
-static inline int TestClearPageDoubleMap(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	return test_and_clear_bit(PG_double_map, &page[1].flags);
-}
-
+PAGEFLAG(DoubleMap, double_map, PF_SECOND)
+	TESTSCFLAG(DoubleMap, double_map, PF_SECOND)
 #else
 TESTPAGEFLAG_FALSE(TransHuge)
 TESTPAGEFLAG_FALSE(TransCompound)
 TESTPAGEFLAG_FALSE(TransCompoundMap)
 TESTPAGEFLAG_FALSE(TransTail)
 PAGEFLAG_FALSE(DoubleMap)
-	TESTSETFLAG_FALSE(DoubleMap)
-	TESTCLEARFLAG_FALSE(DoubleMap)
+	TESTSCFLAG_FALSE(DoubleMap)
 #endif
 
 /*
@@ -888,6 +867,7 @@ static inline int page_has_private(struc
 #undef PF_ONLY_HEAD
 #undef PF_NO_TAIL
 #undef PF_NO_COMPOUND
+#undef PF_SECOND
 #endif /* !__GENERATING_BOUNDS_H */
 
 #endif	/* PAGE_FLAGS_H */
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 102/181] mm/mmap: leave adjust_next as virtual address instead of page frame number
  2020-10-13 23:46 incoming Andrew Morton
                   ` (100 preceding siblings ...)
  2020-10-13 23:53 ` [patch 101/181] mm: simplify PageDoubleMap with PF_SECOND policy Andrew Morton
@ 2020-10-13 23:53 ` Andrew Morton
  2020-10-13 23:54 ` [patch 103/181] mm/memory.c: fix spello of "function" Andrew Morton
                   ` (82 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:53 UTC (permalink / raw)
  To: akpm, linux-mm, mike.kravetz, mm-commits, richard.weiyang,
	torvalds, vbabka

From: Wei Yang <richard.weiyang@linux.alibaba.com>
Subject: mm/mmap: leave adjust_next as virtual address instead of page frame number

Instead of converting adjust_next between bytes and pages number, let's
just store the virtual address into adjust_next.

Also, this patch fixes one typo in the comment of vma_adjust_trans_huge().

[vbabka@suse.cz: changelog tweak]
Link: http://lkml.kernel.org/r/20200828081031.11306-1-richard.weiyang@linux.alibaba.com
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/huge_memory.c |    4 ++--
 mm/mmap.c        |    8 ++++----
 2 files changed, 6 insertions(+), 6 deletions(-)

--- a/mm/huge_memory.c~mm-mmap-leave-adjust_next-as-virtual-address-instead-of-page-frame-number
+++ a/mm/huge_memory.c
@@ -2306,13 +2306,13 @@ void vma_adjust_trans_huge(struct vm_are
 
 	/*
 	 * If we're also updating the vma->vm_next->vm_start, if the new
-	 * vm_next->vm_start isn't page aligned and it could previously
+	 * vm_next->vm_start isn't hpage aligned and it could previously
 	 * contain an hugepage: check if we need to split an huge pmd.
 	 */
 	if (adjust_next > 0) {
 		struct vm_area_struct *next = vma->vm_next;
 		unsigned long nstart = next->vm_start;
-		nstart += adjust_next << PAGE_SHIFT;
+		nstart += adjust_next;
 		if (nstart & ~HPAGE_PMD_MASK &&
 		    (nstart & HPAGE_PMD_MASK) >= next->vm_start &&
 		    (nstart & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE <= next->vm_end)
--- a/mm/mmap.c~mm-mmap-leave-adjust_next-as-virtual-address-instead-of-page-frame-number
+++ a/mm/mmap.c
@@ -758,7 +758,7 @@ int __vma_adjust(struct vm_area_struct *
 			 * vma expands, overlapping part of the next:
 			 * mprotect case 5 shifting the boundary up.
 			 */
-			adjust_next = (end - next->vm_start) >> PAGE_SHIFT;
+			adjust_next = (end - next->vm_start);
 			exporter = next;
 			importer = vma;
 			VM_WARN_ON(expand != importer);
@@ -768,7 +768,7 @@ int __vma_adjust(struct vm_area_struct *
 			 * split_vma inserting another: so it must be
 			 * mprotect case 4 shifting the boundary down.
 			 */
-			adjust_next = -((vma->vm_end - end) >> PAGE_SHIFT);
+			adjust_next = -(vma->vm_end - end);
 			exporter = vma;
 			importer = next;
 			VM_WARN_ON(expand != importer);
@@ -840,8 +840,8 @@ again:
 	}
 	vma->vm_pgoff = pgoff;
 	if (adjust_next) {
-		next->vm_start += adjust_next << PAGE_SHIFT;
-		next->vm_pgoff += adjust_next;
+		next->vm_start += adjust_next;
+		next->vm_pgoff += adjust_next >> PAGE_SHIFT;
 	}
 
 	if (root) {
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 103/181] mm/memory.c: fix spello of "function"
  2020-10-13 23:46 incoming Andrew Morton
                   ` (101 preceding siblings ...)
  2020-10-13 23:53 ` [patch 102/181] mm/mmap: leave adjust_next as virtual address instead of page frame number Andrew Morton
@ 2020-10-13 23:54 ` Andrew Morton
  2020-10-13 23:54 ` [patch 104/181] mm/mmap: not necessary to check mapping separately Andrew Morton
                   ` (81 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:54 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, rdunlap, torvalds

From: Randy Dunlap <rdunlap@infradead.org>
Subject: mm/memory.c: fix spello of "function"

Fix typo/spello of "function".

Link: https://lkml.kernel.org/r/e7bf180e-c558-b1d5-9a15-6d9708823c9c@infradead.org
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/memory.c~mm-memoryc-fix-spello-of-function
+++ a/mm/memory.c
@@ -3764,7 +3764,7 @@ static vm_fault_t do_set_pmd(struct vm_f
 
 /**
  * alloc_set_pte - setup new PTE entry for given page and add reverse page
- * mapping. If needed, the fucntion allocates page table or use pre-allocated.
+ * mapping. If needed, the function allocates page table or use pre-allocated.
  *
  * @vmf: fault environment
  * @page: page to map
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 104/181] mm/mmap: not necessary to check mapping separately
  2020-10-13 23:46 incoming Andrew Morton
                   ` (102 preceding siblings ...)
  2020-10-13 23:54 ` [patch 103/181] mm/memory.c: fix spello of "function" Andrew Morton
@ 2020-10-13 23:54 ` Andrew Morton
  2020-10-13 23:54 ` [patch 105/181] mm/mmap: check on file instead of the rb_root_cached of its address_space Andrew Morton
                   ` (80 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:54 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, richard.weiyang, torvalds

From: Wei Yang <richard.weiyang@linux.alibaba.com>
Subject: mm/mmap: not necessary to check mapping separately

*root* with type of struct rb_root_cached is an element of *mapping*
with type of struct address_space. This implies when we have a valid
*root* it must be a part of valid *mapping*.

So we can merge these two checks together to make the code more easy to
read and to save some cpu cycles.

Link: https://lkml.kernel.org/r/20200913133631.37781-1-richard.weiyang@linux.alibaba.com
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mmap.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/mm/mmap.c~mm-mmap-not-necessary-to-check-mapping-separately
+++ a/mm/mmap.c
@@ -895,10 +895,9 @@ again:
 			anon_vma_interval_tree_post_update_vma(next);
 		anon_vma_unlock_write(anon_vma);
 	}
-	if (mapping)
-		i_mmap_unlock_write(mapping);
 
 	if (root) {
+		i_mmap_unlock_write(mapping);
 		uprobe_mmap(vma);
 
 		if (adjust_next)
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 105/181] mm/mmap: check on file instead of the rb_root_cached of its address_space
  2020-10-13 23:46 incoming Andrew Morton
                   ` (103 preceding siblings ...)
  2020-10-13 23:54 ` [patch 104/181] mm/mmap: not necessary to check mapping separately Andrew Morton
@ 2020-10-13 23:54 ` Andrew Morton
  2020-10-13 23:54 ` [patch 106/181] mm: use helper function mapping_allow_writable() Andrew Morton
                   ` (79 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:54 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, richard.weiyang, torvalds

From: Wei Yang <richard.weiyang@linux.alibaba.com>
Subject: mm/mmap: check on file instead of the rb_root_cached of its address_space

In __vma_adjust(), we do the check on *root* to decide whether to adjust
the address_space.  It seems to be more meaningful to do the check on
*file* itself.  This means we are adjusting some data because it is a file
backed vma.

Since we seem to assume the address_space is valid if it is a file backed
vma, let's just replace *root* with *file* here.

Link: https://lkml.kernel.org/r/20200913133631.37781-2-richard.weiyang@linux.alibaba.com
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mmap.c |    6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

--- a/mm/mmap.c~mm-mmap-check-on-file-instead-of-the-rb_root_cached-of-its-address_space
+++ a/mm/mmap.c
@@ -823,7 +823,7 @@ again:
 			anon_vma_interval_tree_pre_update_vma(next);
 	}
 
-	if (root) {
+	if (file) {
 		flush_dcache_mmap_lock(mapping);
 		vma_interval_tree_remove(vma, root);
 		if (adjust_next)
@@ -844,7 +844,7 @@ again:
 		next->vm_pgoff += adjust_next >> PAGE_SHIFT;
 	}
 
-	if (root) {
+	if (file) {
 		if (adjust_next)
 			vma_interval_tree_insert(next, root);
 		vma_interval_tree_insert(vma, root);
@@ -896,7 +896,7 @@ again:
 		anon_vma_unlock_write(anon_vma);
 	}
 
-	if (root) {
+	if (file) {
 		i_mmap_unlock_write(mapping);
 		uprobe_mmap(vma);
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 106/181] mm: use helper function mapping_allow_writable()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (104 preceding siblings ...)
  2020-10-13 23:54 ` [patch 105/181] mm/mmap: check on file instead of the rb_root_cached of its address_space Andrew Morton
@ 2020-10-13 23:54 ` Andrew Morton
  2020-10-13 23:54 ` [patch 107/181] mm/mmap.c: use helper function allow_write_access() in __remove_shared_vm_struct() Andrew Morton
                   ` (78 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:54 UTC (permalink / raw)
  To: akpm, areber, christian.brauner, christian, cyphar, ebiederm,
	linmiaohe, linux-mm, mingo, mm-commits, peterz, shakeelb, surenb,
	tglx, torvalds

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm: use helper function mapping_allow_writable()

Commit 4bb5f5d9395b ("mm: allow drivers to prevent new writable mappings")
changed i_mmap_writable from unsigned int to atomic_t and add the helper
function mapping_allow_writable() to atomic_inc i_mmap_writable.  But it
forgot to use this helper function in dup_mmap() and __vma_link_file().

Link: https://lkml.kernel.org/r/20200917112736.7789-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Christian Kellner <christian@kellner.me>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Adrian Reber <areber@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/fork.c |    2 +-
 mm/mmap.c     |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

--- a/kernel/fork.c~mm-use-helper-function-mapping_allow_writable
+++ a/kernel/fork.c
@@ -559,7 +559,7 @@ static __latent_entropy int dup_mmap(str
 				atomic_dec(&inode->i_writecount);
 			i_mmap_lock_write(mapping);
 			if (tmp->vm_flags & VM_SHARED)
-				atomic_inc(&mapping->i_mmap_writable);
+				mapping_allow_writable(mapping);
 			flush_dcache_mmap_lock(mapping);
 			/* insert tmp into the share list, just after mpnt */
 			vma_interval_tree_insert_after(tmp, mpnt,
--- a/mm/mmap.c~mm-use-helper-function-mapping_allow_writable
+++ a/mm/mmap.c
@@ -621,7 +621,7 @@ static void __vma_link_file(struct vm_ar
 		if (vma->vm_flags & VM_DENYWRITE)
 			atomic_dec(&file_inode(file)->i_writecount);
 		if (vma->vm_flags & VM_SHARED)
-			atomic_inc(&mapping->i_mmap_writable);
+			mapping_allow_writable(mapping);
 
 		flush_dcache_mmap_lock(mapping);
 		vma_interval_tree_insert(vma, &mapping->i_mmap);
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 107/181] mm/mmap.c: use helper function allow_write_access() in __remove_shared_vm_struct()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (105 preceding siblings ...)
  2020-10-13 23:54 ` [patch 106/181] mm: use helper function mapping_allow_writable() Andrew Morton
@ 2020-10-13 23:54 ` Andrew Morton
  2020-10-13 23:54 ` [patch 108/181] mm/mmap.c: replace do_brk with do_brk_flags in comment of insert_vm_struct() Andrew Morton
                   ` (77 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:54 UTC (permalink / raw)
  To: akpm, linmiaohe, linux-mm, mm-commits, torvalds

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/mmap.c: use helper function allow_write_access() in __remove_shared_vm_struct()

In commit 1da177e4c3f4 ("Linux-2.6.12-rc2"), the helper allow_write_access
came with the atomic_inc operation of the i_writecount field in the func
__remove_shared_vm_struct().  But it forgot to use this helper function.

Link: https://lkml.kernel.org/r/20200921115814.39680-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mmap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/mmap.c~mm-mmap-use-helper-function-allow_write_access-in-__remove_shared_vm_struct
+++ a/mm/mmap.c
@@ -143,7 +143,7 @@ static void __remove_shared_vm_struct(st
 		struct file *file, struct address_space *mapping)
 {
 	if (vma->vm_flags & VM_DENYWRITE)
-		atomic_inc(&file_inode(file)->i_writecount);
+		allow_write_access(file);
 	if (vma->vm_flags & VM_SHARED)
 		mapping_unmap_writable(mapping);
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 108/181] mm/mmap.c: replace do_brk with do_brk_flags in comment of insert_vm_struct()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (106 preceding siblings ...)
  2020-10-13 23:54 ` [patch 107/181] mm/mmap.c: use helper function allow_write_access() in __remove_shared_vm_struct() Andrew Morton
@ 2020-10-13 23:54 ` Andrew Morton
  2020-10-13 23:54 ` [patch 109/181] mm: remove src/dst mm parameter in copy_page_range() Andrew Morton
                   ` (76 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:54 UTC (permalink / raw)
  To: akpm, liao.pingfang, linux-mm, mm-commits, torvalds, wang.yi59

From: Liao Pingfang <liao.pingfang@zte.com.cn>
Subject: mm/mmap.c: replace do_brk with do_brk_flags in comment of insert_vm_struct()

Replace do_brk with do_brk_flags in comment of insert_vm_struct(), since
do_brk was removed in following commit.

Link: https://lkml.kernel.org/r/1600650778-43230-1-git-send-email-wang.yi59@zte.com.cn
Fixes: bb177a732c4369 ("mm: do not bug_on on incorrect length in __mm_populate()")
Signed-off-by: Liao Pingfang <liao.pingfang@zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mmap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/mmap.c~mm-mmapc-replace-do_brk-with-do_brk_flags-in-comment-of-insert_vm_struct
+++ a/mm/mmap.c
@@ -3233,7 +3233,7 @@ int insert_vm_struct(struct mm_struct *m
 	 * By setting it to reflect the virtual start address of the
 	 * vma, merges and splits can happen in a seamless way, just
 	 * using the existing file pgoff checks and manipulations.
-	 * Similarly in do_mmap and in do_brk.
+	 * Similarly in do_mmap and in do_brk_flags.
 	 */
 	if (vma_is_anonymous(vma)) {
 		BUG_ON(vma->anon_vma);
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 109/181] mm: remove src/dst mm parameter in copy_page_range()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (107 preceding siblings ...)
  2020-10-13 23:54 ` [patch 108/181] mm/mmap.c: replace do_brk with do_brk_flags in comment of insert_vm_struct() Andrew Morton
@ 2020-10-13 23:54 ` Andrew Morton
  2020-10-13 23:54 ` [patch 110/181] include/linux/huge_mm.h: remove mincore_huge_pmd declaration Andrew Morton
                   ` (75 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:54 UTC (permalink / raw)
  To: akpm, jgg, kirill.shutemov, kirill, linux-mm, mm-commits, peterx,
	torvalds

From: Peter Xu <peterx@redhat.com>
Subject: mm: remove src/dst mm parameter in copy_page_range()

Both of the mm pointers are not needed after commit 7a4830c380f3
("mm/fork: Pass new vma pointer into copy_page_range()").

Jason Gunthorpe also reported that the ordering of copy_page_range() is
odd.  Since working at it, reorder the parameters to be logical, by (1)
always put the dst_* fields to be before src_* fields, and (2) keep the
same type of parameters together.

[peterx@redhat.com: further reorder some parameters and line format, per Jason]
  Link: https://lkml.kernel.org/r/20201002192647.7161-1-peterx@redhat.com
[peterx@redhat.com: fix warnings]
  Link: https://lkml.kernel.org/r/20201006200138.GA6026@xz-x1
Link: https://lkml.kernel.org/r/20200930204950.6668-1-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reported-by: Kirill A. Shutemov <kirill@shutemov.name>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mm.h |    4 -
 kernel/fork.c      |    2 
 mm/memory.c        |  141 ++++++++++++++++++++++---------------------
 3 files changed, 76 insertions(+), 71 deletions(-)

--- a/include/linux/mm.h~mm-remove-src-dst-mm-parameter-in-copy_page_range
+++ a/include/linux/mm.h
@@ -1653,8 +1653,8 @@ struct mmu_notifier_range;
 
 void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
 		unsigned long end, unsigned long floor, unsigned long ceiling);
-int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
-		    struct vm_area_struct *vma, struct vm_area_struct *new);
+int
+copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma);
 int follow_pte_pmd(struct mm_struct *mm, unsigned long address,
 		   struct mmu_notifier_range *range,
 		   pte_t **ptepp, pmd_t **pmdpp, spinlock_t **ptlp);
--- a/kernel/fork.c~mm-remove-src-dst-mm-parameter-in-copy_page_range
+++ a/kernel/fork.c
@@ -590,7 +590,7 @@ static __latent_entropy int dup_mmap(str
 
 		mm->map_count++;
 		if (!(tmp->vm_flags & VM_WIPEONFORK))
-			retval = copy_page_range(mm, oldmm, mpnt, tmp);
+			retval = copy_page_range(tmp, mpnt);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
--- a/mm/memory.c~mm-remove-src-dst-mm-parameter-in-copy_page_range
+++ a/mm/memory.c
@@ -794,15 +794,14 @@ copy_nonpresent_pte(struct mm_struct *ds
  * lock.
  */
 static inline int
-copy_present_page(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pte_t *dst_pte, pte_t *src_pte,
-		struct vm_area_struct *vma, struct vm_area_struct *new,
-		unsigned long addr, int *rss, struct page **prealloc,
-		pte_t pte, struct page *page)
+copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
+		  pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
+		  struct page **prealloc, pte_t pte, struct page *page)
 {
+	struct mm_struct *src_mm = src_vma->vm_mm;
 	struct page *new_page;
 
-	if (!is_cow_mapping(vma->vm_flags))
+	if (!is_cow_mapping(src_vma->vm_flags))
 		return 1;
 
 	/*
@@ -832,16 +831,16 @@ copy_present_page(struct mm_struct *dst_
 	 * over and copy the page & arm it.
 	 */
 	*prealloc = NULL;
-	copy_user_highpage(new_page, page, addr, vma);
+	copy_user_highpage(new_page, page, addr, src_vma);
 	__SetPageUptodate(new_page);
-	page_add_new_anon_rmap(new_page, new, addr, false);
-	lru_cache_add_inactive_or_unevictable(new_page, new);
+	page_add_new_anon_rmap(new_page, dst_vma, addr, false);
+	lru_cache_add_inactive_or_unevictable(new_page, dst_vma);
 	rss[mm_counter(new_page)]++;
 
 	/* All done, just insert the new page copy in the child */
-	pte = mk_pte(new_page, new->vm_page_prot);
-	pte = maybe_mkwrite(pte_mkdirty(pte), new);
-	set_pte_at(dst_mm, addr, dst_pte, pte);
+	pte = mk_pte(new_page, dst_vma->vm_page_prot);
+	pte = maybe_mkwrite(pte_mkdirty(pte), dst_vma);
+	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
 	return 0;
 }
 
@@ -850,24 +849,21 @@ copy_present_page(struct mm_struct *dst_
  * is required to copy this pte.
  */
 static inline int
-copy_present_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma,
-		struct vm_area_struct *new,
-		unsigned long addr, int *rss, struct page **prealloc)
+copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
+		 pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
+		 struct page **prealloc)
 {
-	unsigned long vm_flags = vma->vm_flags;
+	struct mm_struct *src_mm = src_vma->vm_mm;
+	unsigned long vm_flags = src_vma->vm_flags;
 	pte_t pte = *src_pte;
 	struct page *page;
 
-	page = vm_normal_page(vma, addr, pte);
+	page = vm_normal_page(src_vma, addr, pte);
 	if (page) {
 		int retval;
 
-		retval = copy_present_page(dst_mm, src_mm,
-			dst_pte, src_pte,
-			vma, new,
-			addr, rss, prealloc,
-			pte, page);
+		retval = copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
+					   addr, rss, prealloc, pte, page);
 		if (retval <= 0)
 			return retval;
 
@@ -901,7 +897,7 @@ copy_present_pte(struct mm_struct *dst_m
 	if (!(vm_flags & VM_UFFD_WP))
 		pte = pte_clear_uffd_wp(pte);
 
-	set_pte_at(dst_mm, addr, dst_pte, pte);
+	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
 	return 0;
 }
 
@@ -924,11 +920,13 @@ page_copy_prealloc(struct mm_struct *src
 	return new_page;
 }
 
-static int copy_pte_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		   pmd_t *dst_pmd, pmd_t *src_pmd, struct vm_area_struct *vma,
-		   struct vm_area_struct *new,
-		   unsigned long addr, unsigned long end)
+static int
+copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
+	       pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
+	       unsigned long end)
 {
+	struct mm_struct *dst_mm = dst_vma->vm_mm;
+	struct mm_struct *src_mm = src_vma->vm_mm;
 	pte_t *orig_src_pte, *orig_dst_pte;
 	pte_t *src_pte, *dst_pte;
 	spinlock_t *src_ptl, *dst_ptl;
@@ -971,15 +969,15 @@ again:
 		if (unlikely(!pte_present(*src_pte))) {
 			entry.val = copy_nonpresent_pte(dst_mm, src_mm,
 							dst_pte, src_pte,
-							vma, addr, rss);
+							src_vma, addr, rss);
 			if (entry.val)
 				break;
 			progress += 8;
 			continue;
 		}
 		/* copy_present_pte() will clear `*prealloc' if consumed */
-		ret = copy_present_pte(dst_mm, src_mm, dst_pte, src_pte,
-				       vma, new, addr, rss, &prealloc);
+		ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
+				       addr, rss, &prealloc);
 		/*
 		 * If we need a pre-allocated page for this pte, drop the
 		 * locks, allocate, and try again.
@@ -1014,7 +1012,7 @@ again:
 		entry.val = 0;
 	} else if (ret) {
 		WARN_ON_ONCE(ret != -EAGAIN);
-		prealloc = page_copy_prealloc(src_mm, vma, addr);
+		prealloc = page_copy_prealloc(src_mm, src_vma, addr);
 		if (!prealloc)
 			return -ENOMEM;
 		/* We've captured and resolved the error. Reset, try again. */
@@ -1028,11 +1026,13 @@ out:
 	return ret;
 }
 
-static inline int copy_pmd_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pud_t *dst_pud, pud_t *src_pud, struct vm_area_struct *vma,
-		struct vm_area_struct *new,
-		unsigned long addr, unsigned long end)
+static inline int
+copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
+	       pud_t *dst_pud, pud_t *src_pud, unsigned long addr,
+	       unsigned long end)
 {
+	struct mm_struct *dst_mm = dst_vma->vm_mm;
+	struct mm_struct *src_mm = src_vma->vm_mm;
 	pmd_t *src_pmd, *dst_pmd;
 	unsigned long next;
 
@@ -1045,9 +1045,9 @@ static inline int copy_pmd_range(struct
 		if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
 			|| pmd_devmap(*src_pmd)) {
 			int err;
-			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, vma);
+			VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
 			err = copy_huge_pmd(dst_mm, src_mm,
-					    dst_pmd, src_pmd, addr, vma);
+					    dst_pmd, src_pmd, addr, src_vma);
 			if (err == -ENOMEM)
 				return -ENOMEM;
 			if (!err)
@@ -1056,18 +1056,20 @@ static inline int copy_pmd_range(struct
 		}
 		if (pmd_none_or_clear_bad(src_pmd))
 			continue;
-		if (copy_pte_range(dst_mm, src_mm, dst_pmd, src_pmd,
-				   vma, new, addr, next))
+		if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd,
+				   addr, next))
 			return -ENOMEM;
 	} while (dst_pmd++, src_pmd++, addr = next, addr != end);
 	return 0;
 }
 
-static inline int copy_pud_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		p4d_t *dst_p4d, p4d_t *src_p4d, struct vm_area_struct *vma,
-		struct vm_area_struct *new,
-		unsigned long addr, unsigned long end)
+static inline int
+copy_pud_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
+	       p4d_t *dst_p4d, p4d_t *src_p4d, unsigned long addr,
+	       unsigned long end)
 {
+	struct mm_struct *dst_mm = dst_vma->vm_mm;
+	struct mm_struct *src_mm = src_vma->vm_mm;
 	pud_t *src_pud, *dst_pud;
 	unsigned long next;
 
@@ -1080,9 +1082,9 @@ static inline int copy_pud_range(struct
 		if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
 			int err;
 
-			VM_BUG_ON_VMA(next-addr != HPAGE_PUD_SIZE, vma);
+			VM_BUG_ON_VMA(next-addr != HPAGE_PUD_SIZE, src_vma);
 			err = copy_huge_pud(dst_mm, src_mm,
-					    dst_pud, src_pud, addr, vma);
+					    dst_pud, src_pud, addr, src_vma);
 			if (err == -ENOMEM)
 				return -ENOMEM;
 			if (!err)
@@ -1091,18 +1093,19 @@ static inline int copy_pud_range(struct
 		}
 		if (pud_none_or_clear_bad(src_pud))
 			continue;
-		if (copy_pmd_range(dst_mm, src_mm, dst_pud, src_pud,
-				   vma, new, addr, next))
+		if (copy_pmd_range(dst_vma, src_vma, dst_pud, src_pud,
+				   addr, next))
 			return -ENOMEM;
 	} while (dst_pud++, src_pud++, addr = next, addr != end);
 	return 0;
 }
 
-static inline int copy_p4d_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		pgd_t *dst_pgd, pgd_t *src_pgd, struct vm_area_struct *vma,
-		struct vm_area_struct *new,
-		unsigned long addr, unsigned long end)
+static inline int
+copy_p4d_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
+	       pgd_t *dst_pgd, pgd_t *src_pgd, unsigned long addr,
+	       unsigned long end)
 {
+	struct mm_struct *dst_mm = dst_vma->vm_mm;
 	p4d_t *src_p4d, *dst_p4d;
 	unsigned long next;
 
@@ -1114,20 +1117,22 @@ static inline int copy_p4d_range(struct
 		next = p4d_addr_end(addr, end);
 		if (p4d_none_or_clear_bad(src_p4d))
 			continue;
-		if (copy_pud_range(dst_mm, src_mm, dst_p4d, src_p4d,
-				   vma, new, addr, next))
+		if (copy_pud_range(dst_vma, src_vma, dst_p4d, src_p4d,
+				   addr, next))
 			return -ENOMEM;
 	} while (dst_p4d++, src_p4d++, addr = next, addr != end);
 	return 0;
 }
 
-int copy_page_range(struct mm_struct *dst_mm, struct mm_struct *src_mm,
-		    struct vm_area_struct *vma, struct vm_area_struct *new)
+int
+copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 {
 	pgd_t *src_pgd, *dst_pgd;
 	unsigned long next;
-	unsigned long addr = vma->vm_start;
-	unsigned long end = vma->vm_end;
+	unsigned long addr = src_vma->vm_start;
+	unsigned long end = src_vma->vm_end;
+	struct mm_struct *dst_mm = dst_vma->vm_mm;
+	struct mm_struct *src_mm = src_vma->vm_mm;
 	struct mmu_notifier_range range;
 	bool is_cow;
 	int ret;
@@ -1138,19 +1143,19 @@ int copy_page_range(struct mm_struct *ds
 	 * readonly mappings. The tradeoff is that copy_page_range is more
 	 * efficient than faulting.
 	 */
-	if (!(vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
-			!vma->anon_vma)
+	if (!(src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) &&
+	    !src_vma->anon_vma)
 		return 0;
 
-	if (is_vm_hugetlb_page(vma))
-		return copy_hugetlb_page_range(dst_mm, src_mm, vma);
+	if (is_vm_hugetlb_page(src_vma))
+		return copy_hugetlb_page_range(dst_mm, src_mm, src_vma);
 
-	if (unlikely(vma->vm_flags & VM_PFNMAP)) {
+	if (unlikely(src_vma->vm_flags & VM_PFNMAP)) {
 		/*
 		 * We do not free on error cases below as remove_vma
 		 * gets called on error from higher level routine
 		 */
-		ret = track_pfn_copy(vma);
+		ret = track_pfn_copy(src_vma);
 		if (ret)
 			return ret;
 	}
@@ -1161,11 +1166,11 @@ int copy_page_range(struct mm_struct *ds
 	 * parent mm. And a permission downgrade will only happen if
 	 * is_cow_mapping() returns true.
 	 */
-	is_cow = is_cow_mapping(vma->vm_flags);
+	is_cow = is_cow_mapping(src_vma->vm_flags);
 
 	if (is_cow) {
 		mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE,
-					0, vma, src_mm, addr, end);
+					0, src_vma, src_mm, addr, end);
 		mmu_notifier_invalidate_range_start(&range);
 	}
 
@@ -1176,8 +1181,8 @@ int copy_page_range(struct mm_struct *ds
 		next = pgd_addr_end(addr, end);
 		if (pgd_none_or_clear_bad(src_pgd))
 			continue;
-		if (unlikely(copy_p4d_range(dst_mm, src_mm, dst_pgd, src_pgd,
-					    vma, new, addr, next))) {
+		if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd,
+					    addr, next))) {
 			ret = -ENOMEM;
 			break;
 		}
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 110/181] include/linux/huge_mm.h: remove mincore_huge_pmd declaration
  2020-10-13 23:46 incoming Andrew Morton
                   ` (108 preceding siblings ...)
  2020-10-13 23:54 ` [patch 109/181] mm: remove src/dst mm parameter in copy_page_range() Andrew Morton
@ 2020-10-13 23:54 ` Andrew Morton
  2020-10-13 23:54 ` [patch 111/181] tools/testing/selftests/vm/hmm-tests.c: use the new SKIP() macro Andrew Morton
                   ` (74 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:54 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, torvalds, yulei.kernel, yuleixzhang, ziy

From: yuleixzhang <yulei.kernel@gmail.com>
Subject: include/linux/huge_mm.h: remove mincore_huge_pmd declaration

As mincore_huge_pmd() was dropped, remove the declaration from the header
file.

Link: https://lkml.kernel.org/r/20200922083423.15074-1-yuleixzhang@tencent.com
Signed-off-by: Yulei Zhang <yuleixzhang@tencent.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/huge_mm.h |    3 ---
 1 file changed, 3 deletions(-)

--- a/include/linux/huge_mm.h~mm-cleanup-mincore_huge_pmd
+++ a/include/linux/huge_mm.h
@@ -38,9 +38,6 @@ extern int zap_huge_pmd(struct mmu_gathe
 extern int zap_huge_pud(struct mmu_gather *tlb,
 			struct vm_area_struct *vma,
 			pud_t *pud, unsigned long addr);
-extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-			unsigned long addr, unsigned long end,
-			unsigned char *vec);
 extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 			 unsigned long new_addr,
 			 pmd_t *old_pmd, pmd_t *new_pmd);
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 111/181] tools/testing/selftests/vm/hmm-tests.c: use the new SKIP() macro
  2020-10-13 23:46 incoming Andrew Morton
                   ` (109 preceding siblings ...)
  2020-10-13 23:54 ` [patch 110/181] include/linux/huge_mm.h: remove mincore_huge_pmd declaration Andrew Morton
@ 2020-10-13 23:54 ` Andrew Morton
  2020-10-13 23:54 ` [patch 112/181] lib/test_hmm.c: remove unused dmirror_zero_page Andrew Morton
                   ` (73 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:54 UTC (permalink / raw)
  To: akpm, jgg, jglisse, jhubbard, linux-mm, mm-commits, rcampbell,
	shuah, torvalds

From: Ralph Campbell <rcampbell@nvidia.com>
Subject: tools/testing/selftests/vm/hmm-tests.c: use the new SKIP() macro

Some tests might not be able to be run if resources like huge pages are
not available.  Mark these tests as skipped instead of simply passing.

Link: http://lkml.kernel.org/r/20200827190400.12608-1-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/hmm-tests.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/tools/testing/selftests/vm/hmm-tests.c~mm-test-use-the-new-skip-macro
+++ a/tools/testing/selftests/vm/hmm-tests.c
@@ -680,7 +680,7 @@ TEST_F(hmm, anon_write_hugetlbfs)
 
 	n = gethugepagesizes(pagesizes, 4);
 	if (n <= 0)
-		return;
+		SKIP(return, "Huge page size could not be determined");
 	for (idx = 0; --n > 0; ) {
 		if (pagesizes[n] < pagesizes[idx])
 			idx = n;
@@ -694,7 +694,7 @@ TEST_F(hmm, anon_write_hugetlbfs)
 	buffer->ptr = get_hugepage_region(size, GHR_STRICT);
 	if (buffer->ptr == NULL) {
 		free(buffer);
-		return;
+		SKIP(return, "Huge page could not be allocated");
 	}
 
 	buffer->fd = -1;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 112/181] lib/test_hmm.c: remove unused dmirror_zero_page
  2020-10-13 23:46 incoming Andrew Morton
                   ` (110 preceding siblings ...)
  2020-10-13 23:54 ` [patch 111/181] tools/testing/selftests/vm/hmm-tests.c: use the new SKIP() macro Andrew Morton
@ 2020-10-13 23:54 ` Andrew Morton
  2020-10-13 23:54 ` [patch 113/181] mm/dmapool.c: replace open-coded list_for_each_entry_safe() Andrew Morton
                   ` (72 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:54 UTC (permalink / raw)
  To: akpm, jglisse, linux-mm, mm-commits, rcampbell, torvalds

From: Ralph Campbell <rcampbell@nvidia.com>
Subject: lib/test_hmm.c: remove unused dmirror_zero_page

The variable dmirror_zero_page is unused in the HMM self test driver which
was probably intended to demonstrate how a driver could use
migrate_vma_setup() to share a single read-only device private zero page
similar to how the CPU does.  However, this isn't needed for the self
tests so remove it.

Link: https://lkml.kernel.org/r/20200914213801.16520-1-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/test_hmm.c |   14 --------------
 1 file changed, 14 deletions(-)

--- a/lib/test_hmm.c~hmm-test-remove-unused-dmirror_zero_page
+++ a/lib/test_hmm.c
@@ -36,7 +36,6 @@
 static const struct dev_pagemap_ops dmirror_devmem_ops;
 static const struct mmu_interval_notifier_ops dmirror_min_ops;
 static dev_t dmirror_dev;
-static struct page *dmirror_zero_page;
 
 struct dmirror_device;
 
@@ -1127,17 +1126,6 @@ static int __init hmm_dmirror_init(void)
 			goto err_chrdev;
 	}
 
-	/*
-	 * Allocate a zero page to simulate a reserved page of device private
-	 * memory which is always zero. The zero_pfn page isn't used just to
-	 * make the code here simpler (i.e., we need a struct page for it).
-	 */
-	dmirror_zero_page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
-	if (!dmirror_zero_page) {
-		ret = -ENOMEM;
-		goto err_chrdev;
-	}
-
 	pr_info("HMM test module loaded. This is only for testing HMM.\n");
 	return 0;
 
@@ -1153,8 +1141,6 @@ static void __exit hmm_dmirror_exit(void
 {
 	int id;
 
-	if (dmirror_zero_page)
-		__free_page(dmirror_zero_page);
 	for (id = 0; id < DMIRROR_NDEVICES; id++)
 		dmirror_device_remove(dmirror_devices + id);
 	unregister_chrdev_region(dmirror_dev, DMIRROR_NDEVICES);
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 113/181] mm/dmapool.c: replace open-coded list_for_each_entry_safe()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (111 preceding siblings ...)
  2020-10-13 23:54 ` [patch 112/181] lib/test_hmm.c: remove unused dmirror_zero_page Andrew Morton
@ 2020-10-13 23:54 ` Andrew Morton
  2020-10-13 23:54 ` [patch 114/181] mm/dmapool.c: replace hard coded function name with __func__ Andrew Morton
                   ` (71 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:54 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, linux-mm, mm-commits, torvalds, willy

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: mm/dmapool.c: replace open-coded list_for_each_entry_safe()

There is a place in the code where open-coded version of
list_for_each_entry_safe() is used.  Replace that with the standard macro.

Link: http://lkml.kernel.org/r/20200814135055.24898-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/dmapool.c |    6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

--- a/mm/dmapool.c~mm-dmapoolc-replace-open-coded-list_for_each_entry_safe
+++ a/mm/dmapool.c
@@ -266,6 +266,7 @@ static void pool_free_page(struct dma_po
  */
 void dma_pool_destroy(struct dma_pool *pool)
 {
+	struct dma_page *page, *tmp;
 	bool empty = false;
 
 	if (unlikely(!pool))
@@ -281,10 +282,7 @@ void dma_pool_destroy(struct dma_pool *p
 		device_remove_file(pool->dev, &dev_attr_pools);
 	mutex_unlock(&pools_reg_lock);
 
-	while (!list_empty(&pool->page_list)) {
-		struct dma_page *page;
-		page = list_entry(pool->page_list.next,
-				  struct dma_page, page_list);
+	list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) {
 		if (is_page_busy(page)) {
 			if (pool->dev)
 				dev_err(pool->dev,
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 114/181] mm/dmapool.c: replace hard coded function name with __func__
  2020-10-13 23:46 incoming Andrew Morton
                   ` (112 preceding siblings ...)
  2020-10-13 23:54 ` [patch 113/181] mm/dmapool.c: replace open-coded list_for_each_entry_safe() Andrew Morton
@ 2020-10-13 23:54 ` Andrew Morton
  2020-10-13 23:54 ` [patch 115/181] mm/memory-failure: do pgoff calculation before for_each_process() Andrew Morton
                   ` (70 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:54 UTC (permalink / raw)
  To: akpm, andriy.shevchenko, linux-mm, mm-commits, torvalds, willy

From: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Subject: mm/dmapool.c: replace hard coded function name with __func__

No need to hard code function name when __func__ can be used.

While here, replace specifiers for special types like dma_addr_t.

Link: http://lkml.kernel.org/r/20200814135055.24898-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/dmapool.c |   40 ++++++++++++++++++----------------------
 1 file changed, 18 insertions(+), 22 deletions(-)

--- a/mm/dmapool.c~mm-dmapoolc-replace-hard-coded-function-name-with-__func__
+++ a/mm/dmapool.c
@@ -285,11 +285,10 @@ void dma_pool_destroy(struct dma_pool *p
 	list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) {
 		if (is_page_busy(page)) {
 			if (pool->dev)
-				dev_err(pool->dev,
-					"dma_pool_destroy %s, %p busy\n",
+				dev_err(pool->dev, "%s %s, %p busy\n", __func__,
 					pool->name, page->vaddr);
 			else
-				pr_err("dma_pool_destroy %s, %p busy\n",
+				pr_err("%s %s, %p busy\n", __func__,
 				       pool->name, page->vaddr);
 			/* leak the still-in-use consistent memory */
 			list_del(&page->page_list);
@@ -353,12 +352,11 @@ void *dma_pool_alloc(struct dma_pool *po
 			if (data[i] == POOL_POISON_FREED)
 				continue;
 			if (pool->dev)
-				dev_err(pool->dev,
-					"dma_pool_alloc %s, %p (corrupted)\n",
-					pool->name, retval);
+				dev_err(pool->dev, "%s %s, %p (corrupted)\n",
+					__func__, pool->name, retval);
 			else
-				pr_err("dma_pool_alloc %s, %p (corrupted)\n",
-					pool->name, retval);
+				pr_err("%s %s, %p (corrupted)\n",
+					__func__, pool->name, retval);
 
 			/*
 			 * Dump the first 4 bytes even if they are not
@@ -414,12 +412,11 @@ void dma_pool_free(struct dma_pool *pool
 	if (!page) {
 		spin_unlock_irqrestore(&pool->lock, flags);
 		if (pool->dev)
-			dev_err(pool->dev,
-				"dma_pool_free %s, %p/%lx (bad dma)\n",
-				pool->name, vaddr, (unsigned long)dma);
+			dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n",
+				__func__, pool->name, vaddr, &dma);
 		else
-			pr_err("dma_pool_free %s, %p/%lx (bad dma)\n",
-			       pool->name, vaddr, (unsigned long)dma);
+			pr_err("%s %s, %p/%pad (bad dma)\n",
+			       __func__, pool->name, vaddr, &dma);
 		return;
 	}
 
@@ -430,12 +427,11 @@ void dma_pool_free(struct dma_pool *pool
 	if ((dma - page->dma) != offset) {
 		spin_unlock_irqrestore(&pool->lock, flags);
 		if (pool->dev)
-			dev_err(pool->dev,
-				"dma_pool_free %s, %p (bad vaddr)/%pad\n",
-				pool->name, vaddr, &dma);
+			dev_err(pool->dev, "%s %s, %p (bad vaddr)/%pad\n",
+				__func__, pool->name, vaddr, &dma);
 		else
-			pr_err("dma_pool_free %s, %p (bad vaddr)/%pad\n",
-			       pool->name, vaddr, &dma);
+			pr_err("%s %s, %p (bad vaddr)/%pad\n",
+			       __func__, pool->name, vaddr, &dma);
 		return;
 	}
 	{
@@ -447,11 +443,11 @@ void dma_pool_free(struct dma_pool *pool
 			}
 			spin_unlock_irqrestore(&pool->lock, flags);
 			if (pool->dev)
-				dev_err(pool->dev, "dma_pool_free %s, dma %pad already free\n",
-					pool->name, &dma);
+				dev_err(pool->dev, "%s %s, dma %pad already free\n",
+					__func__, pool->name, &dma);
 			else
-				pr_err("dma_pool_free %s, dma %pad already free\n",
-				       pool->name, &dma);
+				pr_err("%s %s, dma %pad already free\n",
+				       __func__, pool->name, &dma);
 			return;
 		}
 	}
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 115/181] mm/memory-failure: do pgoff calculation before for_each_process()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (113 preceding siblings ...)
  2020-10-13 23:54 ` [patch 114/181] mm/dmapool.c: replace hard coded function name with __func__ Andrew Morton
@ 2020-10-13 23:54 ` Andrew Morton
  2020-10-13 23:54 ` [patch 116/181] mm/memory-failure.c: remove unused macro `writeback' Andrew Morton
                   ` (69 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:54 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, naoya.horiguchi, tian.xianting, torvalds

From: Xianting Tian <tian.xianting@h3c.com>
Subject: mm/memory-failure: do pgoff calculation before for_each_process()

There is no need to calculate pgoff in each loop of for_each_process(), so
move it to the place before for_each_process(), which can save some CPU
cycles.

Link: http://lkml.kernel.org/r/20200818082647.34322-1-tian.xianting@h3c.com
Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory-failure.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/memory-failure.c~mm-memory-failure-do-pgoff-calculation-before-for_each_process
+++ a/mm/memory-failure.c
@@ -484,11 +484,12 @@ static void collect_procs_file(struct pa
 	struct vm_area_struct *vma;
 	struct task_struct *tsk;
 	struct address_space *mapping = page->mapping;
+	pgoff_t pgoff;
 
 	i_mmap_lock_read(mapping);
 	read_lock(&tasklist_lock);
+	pgoff = page_to_pgoff(page);
 	for_each_process(tsk) {
-		pgoff_t pgoff = page_to_pgoff(page);
 		struct task_struct *t = task_early_kill(tsk, force_early);
 
 		if (!t)
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 116/181] mm/memory-failure.c: remove unused macro `writeback'
  2020-10-13 23:46 incoming Andrew Morton
                   ` (114 preceding siblings ...)
  2020-10-13 23:54 ` [patch 115/181] mm/memory-failure: do pgoff calculation before for_each_process() Andrew Morton
@ 2020-10-13 23:54 ` Andrew Morton
  2020-10-13 23:54 ` [patch 117/181] mm/vmalloc.c: update the comment in __vmalloc_area_node() Andrew Morton
                   ` (68 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:54 UTC (permalink / raw)
  To: akpm, alex.shi, linux-mm, mm-commits, naoya.horiguchi, torvalds

From: Alex Shi <alex.shi@linux.alibaba.com>
Subject: mm/memory-failure.c: remove unused macro `writeback'

Unlike others we don't use the marco writeback.  so let's remove it to
tame gcc warning:

mm/memory-failure.c:827: warning: macro "writeback" is not used
[-Wunused-macros]

Link: https://lkml.kernel.org/r/1599715096-20369-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/memory-failure.c |    2 --
 1 file changed, 2 deletions(-)

--- a/mm/memory-failure.c~mm-remove-unused-marco-writeback
+++ a/mm/memory-failure.c
@@ -825,7 +825,6 @@ static int me_huge_page(struct page *p,
 #define sc		((1UL << PG_swapcache) | (1UL << PG_swapbacked))
 #define unevict		(1UL << PG_unevictable)
 #define mlock		(1UL << PG_mlocked)
-#define writeback	(1UL << PG_writeback)
 #define lru		(1UL << PG_lru)
 #define head		(1UL << PG_head)
 #define slab		(1UL << PG_slab)
@@ -874,7 +873,6 @@ static struct page_state {
 #undef sc
 #undef unevict
 #undef mlock
-#undef writeback
 #undef lru
 #undef head
 #undef slab
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 117/181] mm/vmalloc.c: update the comment in __vmalloc_area_node()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (115 preceding siblings ...)
  2020-10-13 23:54 ` [patch 116/181] mm/memory-failure.c: remove unused macro `writeback' Andrew Morton
@ 2020-10-13 23:54 ` Andrew Morton
  2020-10-13 23:54 ` [patch 118/181] mm/vmalloc.c: fix the comment of find_vm_area Andrew Morton
                   ` (67 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:54 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, rpenyaev, sh_def, torvalds

From: Hui Su <sh_def@163.com>
Subject: mm/vmalloc.c: update the comment in __vmalloc_area_node()

Since c67dc624757 ("mm/vmalloc: do not call kmemleak_free() on not yet
accounted memory"), the __vunmap() have been changed to __vfree(), so
update the confusing comment().

Link: https://lkml.kernel.org/r/20200927155409.GA3315@rlk
Signed-off-by: Hui Su <sh_def@163.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Roman Penyaev <rpenyaev@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmalloc.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/vmalloc.c~mm-vmallocc-update-the-comment-in-__vmalloc_area_node
+++ a/mm/vmalloc.c
@@ -2447,7 +2447,7 @@ static void *__vmalloc_area_node(struct
 			page = alloc_pages_node(node, alloc_mask|highmem_mask, 0);
 
 		if (unlikely(!page)) {
-			/* Successfully allocated i pages, free them in __vunmap() */
+			/* Successfully allocated i pages, free them in __vfree() */
 			area->nr_pages = i;
 			atomic_long_add(area->nr_pages, &nr_vmalloc_pages);
 			goto fail;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 118/181] mm/vmalloc.c: fix the comment of find_vm_area
  2020-10-13 23:46 incoming Andrew Morton
                   ` (116 preceding siblings ...)
  2020-10-13 23:54 ` [patch 117/181] mm/vmalloc.c: update the comment in __vmalloc_area_node() Andrew Morton
@ 2020-10-13 23:54 ` Andrew Morton
  2020-10-13 23:54 ` [patch 119/181] docs/vm: fix 'mm_count' vs 'mm_users' counter confusion Andrew Morton
                   ` (66 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:54 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, sh_def, torvalds

From: Hui Su <sh_def@163.com>
Subject: mm/vmalloc.c: fix the comment of find_vm_area

Fix the comment of find_vm_area() and get_vm_area()

Link: https://lkml.kernel.org/r/20200927153034.GA199877@rlk
Signed-off-by: Hui Su <sh_def@163.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmalloc.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/vmalloc.c~mm-vmallocc-fix-the-comment-of-find_vm_area
+++ a/mm/vmalloc.c
@@ -2133,7 +2133,7 @@ struct vm_struct *get_vm_area_caller(uns
  * It is up to the caller to do all required locking to keep the returned
  * pointer valid.
  *
- * Return: pointer to the found area or %NULL on faulure
+ * Return: the area descriptor on success or %NULL on failure.
  */
 struct vm_struct *find_vm_area(const void *addr)
 {
@@ -2154,7 +2154,7 @@ struct vm_struct *find_vm_area(const voi
  * This function returns the found VM area, but using it is NOT safe
  * on SMP machines, except for its size or flags.
  *
- * Return: pointer to the found area or %NULL on faulure
+ * Return: the area descriptor on success or %NULL on failure.
  */
 struct vm_struct *remove_vm_area(const void *addr)
 {
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 119/181] docs/vm: fix 'mm_count' vs 'mm_users' counter confusion
  2020-10-13 23:46 incoming Andrew Morton
                   ` (117 preceding siblings ...)
  2020-10-13 23:54 ` [patch 118/181] mm/vmalloc.c: fix the comment of find_vm_area Andrew Morton
@ 2020-10-13 23:54 ` Andrew Morton
  2020-10-13 23:54 ` [patch 120/181] kasan/kunit: add KUnit Struct to Current Task Andrew Morton
                   ` (65 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:54 UTC (permalink / raw)
  To: agordeev, akpm, corbet, linux-mm, mm-commits, torvalds

From: Alexander Gordeev <agordeev@linux.ibm.com>
Subject: docs/vm: fix 'mm_count' vs 'mm_users' counter confusion

In the context of the anonymous address space lifespan description the
'mm_users' reference counter is confused with 'mm_count'.  I.e a "zombie"
mm gets released when "mm_count" becomes zero, not "mm_users".

Link: https://lkml.kernel.org/r/1597040695-32633-1-git-send-email-agordeev@linux.ibm.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/vm/active_mm.rst |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/Documentation/vm/active_mm.rst~docs-vm-fix-mm_count-vs-mm_users-counter-confusion
+++ a/Documentation/vm/active_mm.rst
@@ -64,7 +64,7 @@ Active MM
  actually get cases where you have a address space that is _only_ used by
  lazy users. That is often a short-lived state, because once that thread
  gets scheduled away in favour of a real thread, the "zombie" mm gets
- released because "mm_users" becomes zero.
+ released because "mm_count" becomes zero.
 
  Also, a new rule is that _nobody_ ever has "init_mm" as a real MM any
  more. "init_mm" should be considered just a "lazy context when no other
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 120/181] kasan/kunit: add KUnit Struct to Current Task
  2020-10-13 23:46 incoming Andrew Morton
                   ` (118 preceding siblings ...)
  2020-10-13 23:54 ` [patch 119/181] docs/vm: fix 'mm_count' vs 'mm_users' counter confusion Andrew Morton
@ 2020-10-13 23:54 ` Andrew Morton
  2020-10-13 23:55 ` [patch 121/181] KUnit: KASAN Integration Andrew Morton
                   ` (64 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:54 UTC (permalink / raw)
  To: a.p.zijlstra, akpm, andreyknvl, aryabinin, brendanhiggins,
	davidgow, dvyukov, juri.lelli, linux-mm, mingo, mm-commits,
	shuah, torvalds, trishalfonso, vincent.guittot

From: Patricia Alfonso <trishalfonso@google.com>
Subject: kasan/kunit: add KUnit Struct to Current Task

Patch series "KASAN-KUnit Integration", v14.

This patchset contains everything needed to integrate KASAN and KUnit.

KUnit will be able to:
(1) Fail tests when an unexpected KASAN error occurs
(2) Pass tests when an expected KASAN error occurs

Convert KASAN tests to KUnit with the exception of copy_user_test because
KUnit is unable to test those.

Add documentation on how to run the KASAN tests with KUnit and what to
expect when running these tests.


This patch (of 5):

In order to integrate debugging tools like KASAN into the KUnit framework,
add KUnit struct to the current task to keep track of the current KUnit
test.

Link: https://lkml.kernel.org/r/20200915035828.570483-1-davidgow@google.com
Link: https://lkml.kernel.org/r/20200915035828.570483-2-davidgow@google.com
Link: https://lkml.kernel.org/r/20200910070331.3358048-1-davidgow@google.com
Link: https://lkml.kernel.org/r/20200910070331.3358048-2-davidgow@google.com
Signed-off-by: Patricia Alfonso <trishalfonso@google.com>
Signed-off-by: David Gow <davidgow@google.com>
Reviewed-by: Brendan Higgins <brendanhiggins@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/sched.h |    4 ++++
 1 file changed, 4 insertions(+)

--- a/include/linux/sched.h~add-kunit-struct-to-current-task
+++ a/include/linux/sched.h
@@ -1208,6 +1208,10 @@ struct task_struct {
 #endif
 #endif
 
+#if IS_ENABLED(CONFIG_KUNIT)
+	struct kunit			*kunit_test;
+#endif
+
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 	/* Index of current stored address in ret_stack: */
 	int				curr_ret_stack;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 121/181] KUnit: KASAN Integration
  2020-10-13 23:46 incoming Andrew Morton
                   ` (119 preceding siblings ...)
  2020-10-13 23:54 ` [patch 120/181] kasan/kunit: add KUnit Struct to Current Task Andrew Morton
@ 2020-10-13 23:55 ` Andrew Morton
  2020-10-13 23:55 ` [patch 122/181] KASAN: port KASAN Tests to KUnit Andrew Morton
                   ` (63 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:55 UTC (permalink / raw)
  To: a.p.zijlstra, akpm, andreyknvl, aryabinin, brendanhiggins,
	davidgow, dvyukov, juri.lelli, linux-mm, mingo, mm-commits,
	shuah, torvalds, trishalfonso, vincent.guittot

From: Patricia Alfonso <trishalfonso@google.com>
Subject: KUnit: KASAN Integration

Integrate KASAN into KUnit testing framework.

        - Fail tests when KASAN reports an error that is not expected
        - Use KUNIT_EXPECT_KASAN_FAIL to expect a KASAN error in KASAN
	  tests
        - Expected KASAN reports pass tests and are still printed when run
          without kunit_tool (kunit_tool still bypasses the report due to the
          test passing)
	- KUnit struct in current task used to keep track of the current
	  test from KASAN code

Make use of "[PATCH v3 kunit-next 1/2] kunit: generalize kunit_resource
API beyond allocated resources" and "[PATCH v3 kunit-next 2/2] kunit: add
support for named resources" from Alan Maguire [1]

        - A named resource is added to a test when a KASAN report is
          expected
        - This resource contains a struct for kasan_data containing
          booleans representing if a KASAN report is expected and if a
          KASAN report is found

[1] (https://lore.kernel.org/linux-kselftest/1583251361-12748-1-git-send-email-alan.maguire@oracle.com/T/#t)

Link: https://lkml.kernel.org/r/20200915035828.570483-3-davidgow@google.com
Link: https://lkml.kernel.org/r/20200910070331.3358048-3-davidgow@google.com
Signed-off-by: Patricia Alfonso <trishalfonso@google.com>
Signed-off-by: David Gow <davidgow@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Brendan Higgins <brendanhiggins@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/kunit/test.h  |    5 ++++
 include/linux/kasan.h |    6 +++++
 lib/kunit/test.c      |   13 ++++++-----
 lib/test_kasan.c      |   47 ++++++++++++++++++++++++++++++++++++++--
 mm/kasan/report.c     |   32 +++++++++++++++++++++++++++
 5 files changed, 96 insertions(+), 7 deletions(-)

--- a/include/kunit/test.h~kunit-kasan-integration
+++ a/include/kunit/test.h
@@ -224,6 +224,11 @@ struct kunit {
 	struct list_head resources; /* Protected by lock. */
 };
 
+static inline void kunit_set_failure(struct kunit *test)
+{
+	WRITE_ONCE(test->success, false);
+}
+
 void kunit_init_test(struct kunit *test, const char *name, char *log);
 
 int kunit_run_tests(struct kunit_suite *suite);
--- a/include/linux/kasan.h~kunit-kasan-integration
+++ a/include/linux/kasan.h
@@ -14,6 +14,12 @@ struct task_struct;
 #include <linux/pgtable.h>
 #include <asm/kasan.h>
 
+/* kasan_data struct is used in KUnit tests for KASAN expected failures */
+struct kunit_kasan_expectation {
+	bool report_expected;
+	bool report_found;
+};
+
 extern unsigned char kasan_early_shadow_page[PAGE_SIZE];
 extern pte_t kasan_early_shadow_pte[PTRS_PER_PTE];
 extern pmd_t kasan_early_shadow_pmd[PTRS_PER_PMD];
--- a/lib/kunit/test.c~kunit-kasan-integration
+++ a/lib/kunit/test.c
@@ -10,16 +10,12 @@
 #include <linux/kernel.h>
 #include <linux/kref.h>
 #include <linux/sched/debug.h>
+#include <linux/sched.h>
 
 #include "debugfs.h"
 #include "string-stream.h"
 #include "try-catch-impl.h"
 
-static void kunit_set_failure(struct kunit *test)
-{
-	WRITE_ONCE(test->success, false);
-}
-
 static void kunit_print_tap_version(void)
 {
 	static bool kunit_has_printed_tap_version;
@@ -288,6 +284,10 @@ static void kunit_try_run_case(void *dat
 	struct kunit_suite *suite = ctx->suite;
 	struct kunit_case *test_case = ctx->test_case;
 
+#if (IS_ENABLED(CONFIG_KASAN) && IS_ENABLED(CONFIG_KUNIT))
+	current->kunit_test = test;
+#endif /* IS_ENABLED(CONFIG_KASAN) && IS_ENABLED(CONFIG_KUNIT) */
+
 	/*
 	 * kunit_run_case_internal may encounter a fatal error; if it does,
 	 * abort will be called, this thread will exit, and finally the parent
@@ -602,6 +602,9 @@ void kunit_cleanup(struct kunit *test)
 		spin_unlock(&test->lock);
 		kunit_remove_resource(test, res);
 	}
+#if (IS_ENABLED(CONFIG_KASAN) && IS_ENABLED(CONFIG_KUNIT))
+	current->kunit_test = NULL;
+#endif /* IS_ENABLED(CONFIG_KASAN) && IS_ENABLED(CONFIG_KUNIT)*/
 }
 EXPORT_SYMBOL_GPL(kunit_cleanup);
 
--- a/lib/test_kasan.c~kunit-kasan-integration
+++ a/lib/test_kasan.c
@@ -23,6 +23,8 @@
 
 #include <asm/page.h>
 
+#include <kunit/test.h>
+
 #include "../mm/kasan/kasan.h"
 
 #define OOB_TAG_OFF (IS_ENABLED(CONFIG_KASAN_GENERIC) ? 0 : KASAN_SHADOW_SCALE_SIZE)
@@ -32,14 +34,55 @@
  * are not eliminated as dead code.
  */
 
-int kasan_int_result;
 void *kasan_ptr_result;
+int kasan_int_result;
+
+static struct kunit_resource resource;
+static struct kunit_kasan_expectation fail_data;
+static bool multishot;
+
+static int kasan_test_init(struct kunit *test)
+{
+	/*
+	 * Temporarily enable multi-shot mode and set panic_on_warn=0.
+	 * Otherwise, we'd only get a report for the first case.
+	 */
+	multishot = kasan_save_enable_multi_shot();
+
+	return 0;
+}
+
+static void kasan_test_exit(struct kunit *test)
+{
+	kasan_restore_multi_shot(multishot);
+}
+
+/**
+ * KUNIT_EXPECT_KASAN_FAIL() - Causes a test failure when the expression does
+ * not cause a KASAN error. This uses a KUnit resource named "kasan_data." Do
+ * Do not use this name for a KUnit resource outside here.
+ *
+ */
+#define KUNIT_EXPECT_KASAN_FAIL(test, condition) do { \
+	fail_data.report_expected = true; \
+	fail_data.report_found = false; \
+	kunit_add_named_resource(test, \
+				NULL, \
+				NULL, \
+				&resource, \
+				"kasan_data", &fail_data); \
+	condition; \
+	KUNIT_EXPECT_EQ(test, \
+			fail_data.report_expected, \
+			fail_data.report_found); \
+} while (0)
+
+
 
 /*
  * Note: test functions are marked noinline so that their names appear in
  * reports.
  */
-
 static noinline void __init kmalloc_oob_right(void)
 {
 	char *ptr;
--- a/mm/kasan/report.c~kunit-kasan-integration
+++ a/mm/kasan/report.c
@@ -33,6 +33,8 @@
 
 #include <asm/sections.h>
 
+#include <kunit/test.h>
+
 #include "kasan.h"
 #include "../slab.h"
 
@@ -464,12 +466,37 @@ static bool report_enabled(void)
 	return !test_and_set_bit(KASAN_BIT_REPORTED, &kasan_flags);
 }
 
+#if IS_ENABLED(CONFIG_KUNIT)
+static void kasan_update_kunit_status(struct kunit *cur_test)
+{
+	struct kunit_resource *resource;
+	struct kunit_kasan_expectation *kasan_data;
+
+	resource = kunit_find_named_resource(cur_test, "kasan_data");
+
+	if (!resource) {
+		kunit_set_failure(cur_test);
+		return;
+	}
+
+	kasan_data = (struct kunit_kasan_expectation *)resource->data;
+	kasan_data->report_found = true;
+	kunit_put_resource(resource);
+}
+#endif /* IS_ENABLED(CONFIG_KUNIT) */
+
 void kasan_report_invalid_free(void *object, unsigned long ip)
 {
 	unsigned long flags;
 	u8 tag = get_tag(object);
 
 	object = reset_tag(object);
+
+#if IS_ENABLED(CONFIG_KUNIT)
+	if (current->kunit_test)
+		kasan_update_kunit_status(current->kunit_test);
+#endif /* IS_ENABLED(CONFIG_KUNIT) */
+
 	start_report(&flags);
 	pr_err("BUG: KASAN: double-free or invalid-free in %pS\n", (void *)ip);
 	print_tags(tag, object);
@@ -488,6 +515,11 @@ static void __kasan_report(unsigned long
 	void *untagged_addr;
 	unsigned long flags;
 
+#if IS_ENABLED(CONFIG_KUNIT)
+	if (current->kunit_test)
+		kasan_update_kunit_status(current->kunit_test);
+#endif /* IS_ENABLED(CONFIG_KUNIT) */
+
 	disable_trace_on_warning();
 
 	tagged_addr = (void *)addr;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 122/181] KASAN: port KASAN Tests to KUnit
  2020-10-13 23:46 incoming Andrew Morton
                   ` (120 preceding siblings ...)
  2020-10-13 23:55 ` [patch 121/181] KUnit: KASAN Integration Andrew Morton
@ 2020-10-13 23:55 ` Andrew Morton
  2020-10-13 23:55 ` [patch 123/181] KASAN: Testing Documentation Andrew Morton
                   ` (62 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:55 UTC (permalink / raw)
  To: a.p.zijlstra, akpm, andreyknvl, aryabinin, brendanhiggins,
	davidgow, dvyukov, juri.lelli, linux-mm, mingo, mm-commits,
	shuah, torvalds, trishalfonso, vincent.guittot

From: Patricia Alfonso <trishalfonso@google.com>
Subject: KASAN: port KASAN Tests to KUnit

Transfer all previous tests for KASAN to KUnit so they can be run more
easily.  Using kunit_tool, developers can run these tests with their other
KUnit tests and see "pass" or "fail" with the appropriate KASAN report
instead of needing to parse each KASAN report to test KASAN
functionalities.  All KASAN reports are still printed to dmesg.

Stack tests do not work properly when KASAN_STACK is enabled so those
tests use a check for "if IS_ENABLED(CONFIG_KASAN_STACK)" so they only run
if stack instrumentation is enabled.  If KASAN_STACK is not enabled, KUnit
will print a statement to let the user know this test was not run with
KASAN_STACK enabled.

copy_user_test and kasan_rcu_uaf cannot be run in KUnit so there is a
separate test file for those tests, which can be run as before as a
module.

[trishalfonso@google.com: v14]
  Link: https://lkml.kernel.org/r/20200915035828.570483-4-davidgow@google.com
Link: https://lkml.kernel.org/r/20200910070331.3358048-4-davidgow@google.com
Signed-off-by: Patricia Alfonso <trishalfonso@google.com>
Signed-off-by: David Gow <davidgow@google.com>
Reviewed-by: Brendan Higgins <brendanhiggins@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/Kconfig.kasan       |   22 -
 lib/Makefile            |    4 
 lib/test_kasan.c        |  685 ++++++++++++++------------------------
 lib/test_kasan_module.c |  111 ++++++
 4 files changed, 385 insertions(+), 437 deletions(-)

--- a/lib/Kconfig.kasan~kasan-port-kasan-tests-to-kunit
+++ a/lib/Kconfig.kasan
@@ -166,12 +166,24 @@ config KASAN_VMALLOC
 	  for KASAN to detect more sorts of errors (and to support vmapped
 	  stacks), but at the cost of higher memory usage.
 
-config TEST_KASAN
-	tristate "Module for testing KASAN for bug detection"
-	depends on m
+config KASAN_KUNIT_TEST
+	tristate "KUnit-compatible tests of KASAN bug detection capabilities" if !KUNIT_ALL_TESTS
+	depends on KASAN && KUNIT
+	default KUNIT_ALL_TESTS
 	help
-	  This is a test module doing various nasty things like
-	  out of bounds accesses, use after free. It is useful for testing
+	  This is a KUnit test suite doing various nasty things like
+	  out of bounds and use after free accesses. It is useful for testing
 	  kernel debugging features like KASAN.
 
+	  For more information on KUnit and unit tests in general, please refer
+	  to the KUnit documentation in Documentation/dev-tools/kunit
+
+config TEST_KASAN_MODULE
+	tristate "KUnit-incompatible tests of KASAN bug detection capabilities"
+	depends on m && KASAN
+	help
+	  This is a part of the KASAN test suite that is incompatible with
+	  KUnit. Currently includes tests that do bad copy_from/to_user
+	  accesses.
+
 endif # KASAN
--- a/lib/Makefile~kasan-port-kasan-tests-to-kunit
+++ a/lib/Makefile
@@ -65,9 +65,11 @@ CFLAGS_test_bitops.o += -Werror
 obj-$(CONFIG_TEST_SYSCTL) += test_sysctl.o
 obj-$(CONFIG_TEST_HASH) += test_hash.o test_siphash.o
 obj-$(CONFIG_TEST_IDA) += test_ida.o
-obj-$(CONFIG_TEST_KASAN) += test_kasan.o
+obj-$(CONFIG_KASAN_KUNIT_TEST) += test_kasan.o
 CFLAGS_test_kasan.o += -fno-builtin
 CFLAGS_test_kasan.o += $(call cc-disable-warning, vla)
+obj-$(CONFIG_TEST_KASAN_MODULE) += test_kasan_module.o
+CFLAGS_test_kasan_module.o += -fno-builtin
 obj-$(CONFIG_TEST_UBSAN) += test_ubsan.o
 CFLAGS_test_ubsan.o += $(call cc-disable-warning, vla)
 UBSAN_SANITIZE_test_ubsan.o := y
--- a/lib/test_kasan.c~kasan-port-kasan-tests-to-kunit
+++ a/lib/test_kasan.c
@@ -5,8 +5,6 @@
  * Author: Andrey Ryabinin <a.ryabinin@samsung.com>
  */
 
-#define pr_fmt(fmt) "kasan test: %s " fmt, __func__
-
 #include <linux/bitops.h>
 #include <linux/delay.h>
 #include <linux/kasan.h>
@@ -77,416 +75,327 @@ static void kasan_test_exit(struct kunit
 			fail_data.report_found); \
 } while (0)
 
-
-
-/*
- * Note: test functions are marked noinline so that their names appear in
- * reports.
- */
-static noinline void __init kmalloc_oob_right(void)
+static void kmalloc_oob_right(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 123;
 
-	pr_info("out-of-bounds to right\n");
 	ptr = kmalloc(size, GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
-		return;
-	}
-
-	ptr[size + OOB_TAG_OFF] = 'x';
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
+	KUNIT_EXPECT_KASAN_FAIL(test, ptr[size + OOB_TAG_OFF] = 'x');
 	kfree(ptr);
 }
 
-static noinline void __init kmalloc_oob_left(void)
+static void kmalloc_oob_left(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 15;
 
-	pr_info("out-of-bounds to left\n");
 	ptr = kmalloc(size, GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
-		return;
-	}
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
-	*ptr = *(ptr - 1);
+	KUNIT_EXPECT_KASAN_FAIL(test, *ptr = *(ptr - 1));
 	kfree(ptr);
 }
 
-static noinline void __init kmalloc_node_oob_right(void)
+static void kmalloc_node_oob_right(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 4096;
 
-	pr_info("kmalloc_node(): out-of-bounds to right\n");
 	ptr = kmalloc_node(size, GFP_KERNEL, 0);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
-		return;
-	}
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
-	ptr[size] = 0;
+	KUNIT_EXPECT_KASAN_FAIL(test, ptr[size] = 0);
 	kfree(ptr);
 }
 
-#ifdef CONFIG_SLUB
-static noinline void __init kmalloc_pagealloc_oob_right(void)
+static void kmalloc_pagealloc_oob_right(struct kunit *test)
 {
 	char *ptr;
 	size_t size = KMALLOC_MAX_CACHE_SIZE + 10;
 
+	if (!IS_ENABLED(CONFIG_SLUB)) {
+		kunit_info(test, "CONFIG_SLUB is not enabled.");
+		return;
+	}
+
 	/* Allocate a chunk that does not fit into a SLUB cache to trigger
 	 * the page allocator fallback.
 	 */
-	pr_info("kmalloc pagealloc allocation: out-of-bounds to right\n");
 	ptr = kmalloc(size, GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
-		return;
-	}
-
-	ptr[size + OOB_TAG_OFF] = 0;
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
+	KUNIT_EXPECT_KASAN_FAIL(test, ptr[size + OOB_TAG_OFF] = 0);
 	kfree(ptr);
 }
 
-static noinline void __init kmalloc_pagealloc_uaf(void)
+static void kmalloc_pagealloc_uaf(struct kunit *test)
 {
 	char *ptr;
 	size_t size = KMALLOC_MAX_CACHE_SIZE + 10;
 
-	pr_info("kmalloc pagealloc allocation: use-after-free\n");
-	ptr = kmalloc(size, GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
+	if (!IS_ENABLED(CONFIG_SLUB)) {
+		kunit_info(test, "CONFIG_SLUB is not enabled.");
 		return;
 	}
 
+	ptr = kmalloc(size, GFP_KERNEL);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
+
 	kfree(ptr);
-	ptr[0] = 0;
+	KUNIT_EXPECT_KASAN_FAIL(test, ptr[0] = 0);
 }
 
-static noinline void __init kmalloc_pagealloc_invalid_free(void)
+static void kmalloc_pagealloc_invalid_free(struct kunit *test)
 {
 	char *ptr;
 	size_t size = KMALLOC_MAX_CACHE_SIZE + 10;
 
-	pr_info("kmalloc pagealloc allocation: invalid-free\n");
-	ptr = kmalloc(size, GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
+	if (!IS_ENABLED(CONFIG_SLUB)) {
+		kunit_info(test, "CONFIG_SLUB is not enabled.");
 		return;
 	}
 
-	kfree(ptr + 1);
+	ptr = kmalloc(size, GFP_KERNEL);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
+
+	KUNIT_EXPECT_KASAN_FAIL(test, kfree(ptr + 1));
 }
-#endif
 
-static noinline void __init kmalloc_large_oob_right(void)
+static void kmalloc_large_oob_right(struct kunit *test)
 {
 	char *ptr;
 	size_t size = KMALLOC_MAX_CACHE_SIZE - 256;
 	/* Allocate a chunk that is large enough, but still fits into a slab
 	 * and does not trigger the page allocator fallback in SLUB.
 	 */
-	pr_info("kmalloc large allocation: out-of-bounds to right\n");
 	ptr = kmalloc(size, GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
-		return;
-	}
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
-	ptr[size] = 0;
+	KUNIT_EXPECT_KASAN_FAIL(test, ptr[size] = 0);
 	kfree(ptr);
 }
 
-static noinline void __init kmalloc_oob_krealloc_more(void)
+static void kmalloc_oob_krealloc_more(struct kunit *test)
 {
 	char *ptr1, *ptr2;
 	size_t size1 = 17;
 	size_t size2 = 19;
 
-	pr_info("out-of-bounds after krealloc more\n");
 	ptr1 = kmalloc(size1, GFP_KERNEL);
-	ptr2 = krealloc(ptr1, size2, GFP_KERNEL);
-	if (!ptr1 || !ptr2) {
-		pr_err("Allocation failed\n");
-		kfree(ptr1);
-		kfree(ptr2);
-		return;
-	}
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr1);
 
-	ptr2[size2 + OOB_TAG_OFF] = 'x';
+	ptr2 = krealloc(ptr1, size2, GFP_KERNEL);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr2);
 
+	KUNIT_EXPECT_KASAN_FAIL(test, ptr2[size2 + OOB_TAG_OFF] = 'x');
 	kfree(ptr2);
 }
 
-static noinline void __init kmalloc_oob_krealloc_less(void)
+static void kmalloc_oob_krealloc_less(struct kunit *test)
 {
 	char *ptr1, *ptr2;
 	size_t size1 = 17;
 	size_t size2 = 15;
 
-	pr_info("out-of-bounds after krealloc less\n");
 	ptr1 = kmalloc(size1, GFP_KERNEL);
-	ptr2 = krealloc(ptr1, size2, GFP_KERNEL);
-	if (!ptr1 || !ptr2) {
-		pr_err("Allocation failed\n");
-		kfree(ptr1);
-		return;
-	}
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr1);
 
-	ptr2[size2 + OOB_TAG_OFF] = 'x';
+	ptr2 = krealloc(ptr1, size2, GFP_KERNEL);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr2);
 
+	KUNIT_EXPECT_KASAN_FAIL(test, ptr2[size2 + OOB_TAG_OFF] = 'x');
 	kfree(ptr2);
 }
 
-static noinline void __init kmalloc_oob_16(void)
+static void kmalloc_oob_16(struct kunit *test)
 {
 	struct {
 		u64 words[2];
 	} *ptr1, *ptr2;
 
-	pr_info("kmalloc out-of-bounds for 16-bytes access\n");
 	ptr1 = kmalloc(sizeof(*ptr1) - 3, GFP_KERNEL);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr1);
+
 	ptr2 = kmalloc(sizeof(*ptr2), GFP_KERNEL);
-	if (!ptr1 || !ptr2) {
-		pr_err("Allocation failed\n");
-		kfree(ptr1);
-		kfree(ptr2);
-		return;
-	}
-	*ptr1 = *ptr2;
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr2);
+
+	KUNIT_EXPECT_KASAN_FAIL(test, *ptr1 = *ptr2);
 	kfree(ptr1);
 	kfree(ptr2);
 }
 
-static noinline void __init kmalloc_oob_memset_2(void)
+static void kmalloc_oob_memset_2(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 8;
 
-	pr_info("out-of-bounds in memset2\n");
 	ptr = kmalloc(size, GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
-		return;
-	}
-
-	memset(ptr + 7 + OOB_TAG_OFF, 0, 2);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
+	KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + 7 + OOB_TAG_OFF, 0, 2));
 	kfree(ptr);
 }
 
-static noinline void __init kmalloc_oob_memset_4(void)
+static void kmalloc_oob_memset_4(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 8;
 
-	pr_info("out-of-bounds in memset4\n");
 	ptr = kmalloc(size, GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
-		return;
-	}
-
-	memset(ptr + 5 + OOB_TAG_OFF, 0, 4);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
+	KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + 5 + OOB_TAG_OFF, 0, 4));
 	kfree(ptr);
 }
 
 
-static noinline void __init kmalloc_oob_memset_8(void)
+static void kmalloc_oob_memset_8(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 8;
 
-	pr_info("out-of-bounds in memset8\n");
 	ptr = kmalloc(size, GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
-		return;
-	}
-
-	memset(ptr + 1 + OOB_TAG_OFF, 0, 8);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
+	KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + 1 + OOB_TAG_OFF, 0, 8));
 	kfree(ptr);
 }
 
-static noinline void __init kmalloc_oob_memset_16(void)
+static void kmalloc_oob_memset_16(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 16;
 
-	pr_info("out-of-bounds in memset16\n");
 	ptr = kmalloc(size, GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
-		return;
-	}
-
-	memset(ptr + 1 + OOB_TAG_OFF, 0, 16);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
+	KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr + 1 + OOB_TAG_OFF, 0, 16));
 	kfree(ptr);
 }
 
-static noinline void __init kmalloc_oob_in_memset(void)
+static void kmalloc_oob_in_memset(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 666;
 
-	pr_info("out-of-bounds in memset\n");
 	ptr = kmalloc(size, GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
-		return;
-	}
-
-	memset(ptr, 0, size + 5 + OOB_TAG_OFF);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
+	KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr, 0, size + 5 + OOB_TAG_OFF));
 	kfree(ptr);
 }
 
-static noinline void __init kmalloc_memmove_invalid_size(void)
+static void kmalloc_memmove_invalid_size(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 64;
 	volatile size_t invalid_size = -2;
 
-	pr_info("invalid size in memmove\n");
 	ptr = kmalloc(size, GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
-		return;
-	}
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
 	memset((char *)ptr, 0, 64);
-	memmove((char *)ptr, (char *)ptr + 4, invalid_size);
+
+	KUNIT_EXPECT_KASAN_FAIL(test,
+		memmove((char *)ptr, (char *)ptr + 4, invalid_size));
 	kfree(ptr);
 }
 
-static noinline void __init kmalloc_uaf(void)
+static void kmalloc_uaf(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 10;
 
-	pr_info("use-after-free\n");
 	ptr = kmalloc(size, GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
-		return;
-	}
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
 	kfree(ptr);
-	*(ptr + 8) = 'x';
+	KUNIT_EXPECT_KASAN_FAIL(test, *(ptr + 8) = 'x');
 }
 
-static noinline void __init kmalloc_uaf_memset(void)
+static void kmalloc_uaf_memset(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 33;
 
-	pr_info("use-after-free in memset\n");
 	ptr = kmalloc(size, GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
-		return;
-	}
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
 	kfree(ptr);
-	memset(ptr, 0, size);
+	KUNIT_EXPECT_KASAN_FAIL(test, memset(ptr, 0, size));
 }
 
-static noinline void __init kmalloc_uaf2(void)
+static void kmalloc_uaf2(struct kunit *test)
 {
 	char *ptr1, *ptr2;
 	size_t size = 43;
 
-	pr_info("use-after-free after another kmalloc\n");
 	ptr1 = kmalloc(size, GFP_KERNEL);
-	if (!ptr1) {
-		pr_err("Allocation failed\n");
-		return;
-	}
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr1);
 
 	kfree(ptr1);
+
 	ptr2 = kmalloc(size, GFP_KERNEL);
-	if (!ptr2) {
-		pr_err("Allocation failed\n");
-		return;
-	}
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr2);
+
+	KUNIT_EXPECT_KASAN_FAIL(test, ptr1[40] = 'x');
+	KUNIT_EXPECT_PTR_NE(test, ptr1, ptr2);
 
-	ptr1[40] = 'x';
-	if (ptr1 == ptr2)
-		pr_err("Could not detect use-after-free: ptr1 == ptr2\n");
 	kfree(ptr2);
 }
 
-static noinline void __init kfree_via_page(void)
+static void kfree_via_page(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 8;
 	struct page *page;
 	unsigned long offset;
 
-	pr_info("invalid-free false positive (via page)\n");
 	ptr = kmalloc(size, GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
-		return;
-	}
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
 	page = virt_to_page(ptr);
 	offset = offset_in_page(ptr);
 	kfree(page_address(page) + offset);
 }
 
-static noinline void __init kfree_via_phys(void)
+static void kfree_via_phys(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 8;
 	phys_addr_t phys;
 
-	pr_info("invalid-free false positive (via phys)\n");
 	ptr = kmalloc(size, GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
-		return;
-	}
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
 	phys = virt_to_phys(ptr);
 	kfree(phys_to_virt(phys));
 }
 
-static noinline void __init kmem_cache_oob(void)
+static void kmem_cache_oob(struct kunit *test)
 {
 	char *p;
 	size_t size = 200;
 	struct kmem_cache *cache = kmem_cache_create("test_cache",
 						size, 0,
 						0, NULL);
-	if (!cache) {
-		pr_err("Cache allocation failed\n");
-		return;
-	}
-	pr_info("out-of-bounds in kmem_cache_alloc\n");
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, cache);
 	p = kmem_cache_alloc(cache, GFP_KERNEL);
 	if (!p) {
-		pr_err("Allocation failed\n");
+		kunit_err(test, "Allocation failed: %s\n", __func__);
 		kmem_cache_destroy(cache);
 		return;
 	}
 
-	*p = p[size + OOB_TAG_OFF];
-
+	KUNIT_EXPECT_KASAN_FAIL(test, *p = p[size + OOB_TAG_OFF]);
 	kmem_cache_free(cache, p);
 	kmem_cache_destroy(cache);
 }
 
-static noinline void __init memcg_accounted_kmem_cache(void)
+static void memcg_accounted_kmem_cache(struct kunit *test)
 {
 	int i;
 	char *p;
@@ -494,12 +403,8 @@ static noinline void __init memcg_accoun
 	struct kmem_cache *cache;
 
 	cache = kmem_cache_create("test_cache", size, 0, SLAB_ACCOUNT, NULL);
-	if (!cache) {
-		pr_err("Cache allocation failed\n");
-		return;
-	}
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, cache);
 
-	pr_info("allocate memcg accounted object\n");
 	/*
 	 * Several allocations with a delay to allow for lazy per memcg kmem
 	 * cache creation.
@@ -519,134 +424,93 @@ free_cache:
 
 static char global_array[10];
 
-static noinline void __init kasan_global_oob(void)
+static void kasan_global_oob(struct kunit *test)
 {
 	volatile int i = 3;
 	char *p = &global_array[ARRAY_SIZE(global_array) + i];
 
-	pr_info("out-of-bounds global variable\n");
-	*(volatile char *)p;
-}
-
-static noinline void __init kasan_stack_oob(void)
-{
-	char stack_array[10];
-	volatile int i = OOB_TAG_OFF;
-	char *p = &stack_array[ARRAY_SIZE(stack_array) + i];
-
-	pr_info("out-of-bounds on stack\n");
-	*(volatile char *)p;
+	KUNIT_EXPECT_KASAN_FAIL(test, *(volatile char *)p);
 }
 
-static noinline void __init ksize_unpoisons_memory(void)
+static void ksize_unpoisons_memory(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 123, real_size;
 
-	pr_info("ksize() unpoisons the whole allocated chunk\n");
 	ptr = kmalloc(size, GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
-		return;
-	}
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 	real_size = ksize(ptr);
 	/* This access doesn't trigger an error. */
 	ptr[size] = 'x';
 	/* This one does. */
-	ptr[real_size] = 'y';
+	KUNIT_EXPECT_KASAN_FAIL(test, ptr[real_size] = 'y');
 	kfree(ptr);
 }
 
-static noinline void __init copy_user_test(void)
+static void kasan_stack_oob(struct kunit *test)
 {
-	char *kmem;
-	char __user *usermem;
-	size_t size = 10;
-	int unused;
-
-	kmem = kmalloc(size, GFP_KERNEL);
-	if (!kmem)
-		return;
+	char stack_array[10];
+	volatile int i = OOB_TAG_OFF;
+	char *p = &stack_array[ARRAY_SIZE(stack_array) + i];
 
-	usermem = (char __user *)vm_mmap(NULL, 0, PAGE_SIZE,
-			    PROT_READ | PROT_WRITE | PROT_EXEC,
-			    MAP_ANONYMOUS | MAP_PRIVATE, 0);
-	if (IS_ERR(usermem)) {
-		pr_err("Failed to allocate user memory\n");
-		kfree(kmem);
+	if (!IS_ENABLED(CONFIG_KASAN_STACK)) {
+		kunit_info(test, "CONFIG_KASAN_STACK is not enabled");
 		return;
 	}
 
-	pr_info("out-of-bounds in copy_from_user()\n");
-	unused = copy_from_user(kmem, usermem, size + 1 + OOB_TAG_OFF);
-
-	pr_info("out-of-bounds in copy_to_user()\n");
-	unused = copy_to_user(usermem, kmem, size + 1 + OOB_TAG_OFF);
-
-	pr_info("out-of-bounds in __copy_from_user()\n");
-	unused = __copy_from_user(kmem, usermem, size + 1 + OOB_TAG_OFF);
-
-	pr_info("out-of-bounds in __copy_to_user()\n");
-	unused = __copy_to_user(usermem, kmem, size + 1 + OOB_TAG_OFF);
-
-	pr_info("out-of-bounds in __copy_from_user_inatomic()\n");
-	unused = __copy_from_user_inatomic(kmem, usermem, size + 1 + OOB_TAG_OFF);
-
-	pr_info("out-of-bounds in __copy_to_user_inatomic()\n");
-	unused = __copy_to_user_inatomic(usermem, kmem, size + 1 + OOB_TAG_OFF);
-
-	pr_info("out-of-bounds in strncpy_from_user()\n");
-	unused = strncpy_from_user(kmem, usermem, size + 1 + OOB_TAG_OFF);
-
-	vm_munmap((unsigned long)usermem, PAGE_SIZE);
-	kfree(kmem);
+	KUNIT_EXPECT_KASAN_FAIL(test, *(volatile char *)p);
 }
 
-static noinline void __init kasan_alloca_oob_left(void)
+static void kasan_alloca_oob_left(struct kunit *test)
 {
 	volatile int i = 10;
 	char alloca_array[i];
 	char *p = alloca_array - 1;
 
-	pr_info("out-of-bounds to left on alloca\n");
-	*(volatile char *)p;
+	if (!IS_ENABLED(CONFIG_KASAN_STACK)) {
+		kunit_info(test, "CONFIG_KASAN_STACK is not enabled");
+		return;
+	}
+
+	KUNIT_EXPECT_KASAN_FAIL(test, *(volatile char *)p);
 }
 
-static noinline void __init kasan_alloca_oob_right(void)
+static void kasan_alloca_oob_right(struct kunit *test)
 {
 	volatile int i = 10;
 	char alloca_array[i];
 	char *p = alloca_array + i;
 
-	pr_info("out-of-bounds to right on alloca\n");
-	*(volatile char *)p;
+	if (!IS_ENABLED(CONFIG_KASAN_STACK)) {
+		kunit_info(test, "CONFIG_KASAN_STACK is not enabled");
+		return;
+	}
+
+	KUNIT_EXPECT_KASAN_FAIL(test, *(volatile char *)p);
 }
 
-static noinline void __init kmem_cache_double_free(void)
+static void kmem_cache_double_free(struct kunit *test)
 {
 	char *p;
 	size_t size = 200;
 	struct kmem_cache *cache;
 
 	cache = kmem_cache_create("test_cache", size, 0, 0, NULL);
-	if (!cache) {
-		pr_err("Cache allocation failed\n");
-		return;
-	}
-	pr_info("double-free on heap object\n");
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, cache);
+
 	p = kmem_cache_alloc(cache, GFP_KERNEL);
 	if (!p) {
-		pr_err("Allocation failed\n");
+		kunit_err(test, "Allocation failed: %s\n", __func__);
 		kmem_cache_destroy(cache);
 		return;
 	}
 
 	kmem_cache_free(cache, p);
-	kmem_cache_free(cache, p);
+	KUNIT_EXPECT_KASAN_FAIL(test, kmem_cache_free(cache, p));
 	kmem_cache_destroy(cache);
 }
 
-static noinline void __init kmem_cache_invalid_free(void)
+static void kmem_cache_invalid_free(struct kunit *test)
 {
 	char *p;
 	size_t size = 200;
@@ -654,20 +518,17 @@ static noinline void __init kmem_cache_i
 
 	cache = kmem_cache_create("test_cache", size, 0, SLAB_TYPESAFE_BY_RCU,
 				  NULL);
-	if (!cache) {
-		pr_err("Cache allocation failed\n");
-		return;
-	}
-	pr_info("invalid-free of heap object\n");
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, cache);
+
 	p = kmem_cache_alloc(cache, GFP_KERNEL);
 	if (!p) {
-		pr_err("Allocation failed\n");
+		kunit_err(test, "Allocation failed: %s\n", __func__);
 		kmem_cache_destroy(cache);
 		return;
 	}
 
 	/* Trigger invalid free, the object doesn't get freed */
-	kmem_cache_free(cache, p + 1);
+	KUNIT_EXPECT_KASAN_FAIL(test, kmem_cache_free(cache, p + 1));
 
 	/*
 	 * Properly free the object to prevent the "Objects remaining in
@@ -678,45 +539,63 @@ static noinline void __init kmem_cache_i
 	kmem_cache_destroy(cache);
 }
 
-static noinline void __init kasan_memchr(void)
+static void kasan_memchr(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 24;
 
-	pr_info("out-of-bounds in memchr\n");
-	ptr = kmalloc(size, GFP_KERNEL | __GFP_ZERO);
-	if (!ptr)
+	/* See https://bugzilla.kernel.org/show_bug.cgi?id=206337 */
+	if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
+		kunit_info(test,
+			"str* functions are not instrumented with CONFIG_AMD_MEM_ENCRYPT");
 		return;
+	}
+
+	ptr = kmalloc(size, GFP_KERNEL | __GFP_ZERO);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
+
+	KUNIT_EXPECT_KASAN_FAIL(test,
+		kasan_ptr_result = memchr(ptr, '1', size + 1));
 
-	kasan_ptr_result = memchr(ptr, '1', size + 1);
 	kfree(ptr);
 }
 
-static noinline void __init kasan_memcmp(void)
+static void kasan_memcmp(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 24;
 	int arr[9];
 
-	pr_info("out-of-bounds in memcmp\n");
-	ptr = kmalloc(size, GFP_KERNEL | __GFP_ZERO);
-	if (!ptr)
+	/* See https://bugzilla.kernel.org/show_bug.cgi?id=206337 */
+	if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
+		kunit_info(test,
+			"str* functions are not instrumented with CONFIG_AMD_MEM_ENCRYPT");
 		return;
+	}
 
+	ptr = kmalloc(size, GFP_KERNEL | __GFP_ZERO);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 	memset(arr, 0, sizeof(arr));
-	kasan_int_result = memcmp(ptr, arr, size + 1);
+
+	KUNIT_EXPECT_KASAN_FAIL(test,
+		kasan_int_result = memcmp(ptr, arr, size+1));
 	kfree(ptr);
 }
 
-static noinline void __init kasan_strings(void)
+static void kasan_strings(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 24;
 
-	pr_info("use-after-free in strchr\n");
-	ptr = kmalloc(size, GFP_KERNEL | __GFP_ZERO);
-	if (!ptr)
+	/* See https://bugzilla.kernel.org/show_bug.cgi?id=206337 */
+	if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
+		kunit_info(test,
+			"str* functions are not instrumented with CONFIG_AMD_MEM_ENCRYPT");
 		return;
+	}
+
+	ptr = kmalloc(size, GFP_KERNEL | __GFP_ZERO);
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
 	kfree(ptr);
 
@@ -727,220 +606,164 @@ static noinline void __init kasan_string
 	 * will likely point to zeroed byte.
 	 */
 	ptr += 16;
-	kasan_ptr_result = strchr(ptr, '1');
+	KUNIT_EXPECT_KASAN_FAIL(test, kasan_ptr_result = strchr(ptr, '1'));
 
-	pr_info("use-after-free in strrchr\n");
-	kasan_ptr_result = strrchr(ptr, '1');
+	KUNIT_EXPECT_KASAN_FAIL(test, kasan_ptr_result = strrchr(ptr, '1'));
 
-	pr_info("use-after-free in strcmp\n");
-	kasan_int_result = strcmp(ptr, "2");
+	KUNIT_EXPECT_KASAN_FAIL(test, kasan_int_result = strcmp(ptr, "2"));
 
-	pr_info("use-after-free in strncmp\n");
-	kasan_int_result = strncmp(ptr, "2", 1);
+	KUNIT_EXPECT_KASAN_FAIL(test, kasan_int_result = strncmp(ptr, "2", 1));
 
-	pr_info("use-after-free in strlen\n");
-	kasan_int_result = strlen(ptr);
+	KUNIT_EXPECT_KASAN_FAIL(test, kasan_int_result = strlen(ptr));
 
-	pr_info("use-after-free in strnlen\n");
-	kasan_int_result = strnlen(ptr, 1);
+	KUNIT_EXPECT_KASAN_FAIL(test, kasan_int_result = strnlen(ptr, 1));
 }
 
-static noinline void __init kasan_bitops(void)
+static void kasan_bitops(struct kunit *test)
 {
 	/*
 	 * Allocate 1 more byte, which causes kzalloc to round up to 16-bytes;
 	 * this way we do not actually corrupt other memory.
 	 */
 	long *bits = kzalloc(sizeof(*bits) + 1, GFP_KERNEL);
-	if (!bits)
-		return;
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, bits);
 
 	/*
 	 * Below calls try to access bit within allocated memory; however, the
 	 * below accesses are still out-of-bounds, since bitops are defined to
 	 * operate on the whole long the bit is in.
 	 */
-	pr_info("out-of-bounds in set_bit\n");
-	set_bit(BITS_PER_LONG, bits);
+	KUNIT_EXPECT_KASAN_FAIL(test, set_bit(BITS_PER_LONG, bits));
 
-	pr_info("out-of-bounds in __set_bit\n");
-	__set_bit(BITS_PER_LONG, bits);
+	KUNIT_EXPECT_KASAN_FAIL(test, __set_bit(BITS_PER_LONG, bits));
 
-	pr_info("out-of-bounds in clear_bit\n");
-	clear_bit(BITS_PER_LONG, bits);
+	KUNIT_EXPECT_KASAN_FAIL(test, clear_bit(BITS_PER_LONG, bits));
 
-	pr_info("out-of-bounds in __clear_bit\n");
-	__clear_bit(BITS_PER_LONG, bits);
+	KUNIT_EXPECT_KASAN_FAIL(test, __clear_bit(BITS_PER_LONG, bits));
 
-	pr_info("out-of-bounds in clear_bit_unlock\n");
-	clear_bit_unlock(BITS_PER_LONG, bits);
+	KUNIT_EXPECT_KASAN_FAIL(test, clear_bit_unlock(BITS_PER_LONG, bits));
 
-	pr_info("out-of-bounds in __clear_bit_unlock\n");
-	__clear_bit_unlock(BITS_PER_LONG, bits);
+	KUNIT_EXPECT_KASAN_FAIL(test, __clear_bit_unlock(BITS_PER_LONG, bits));
 
-	pr_info("out-of-bounds in change_bit\n");
-	change_bit(BITS_PER_LONG, bits);
+	KUNIT_EXPECT_KASAN_FAIL(test, change_bit(BITS_PER_LONG, bits));
 
-	pr_info("out-of-bounds in __change_bit\n");
-	__change_bit(BITS_PER_LONG, bits);
+	KUNIT_EXPECT_KASAN_FAIL(test, __change_bit(BITS_PER_LONG, bits));
 
 	/*
 	 * Below calls try to access bit beyond allocated memory.
 	 */
-	pr_info("out-of-bounds in test_and_set_bit\n");
-	test_and_set_bit(BITS_PER_LONG + BITS_PER_BYTE, bits);
+	KUNIT_EXPECT_KASAN_FAIL(test,
+		test_and_set_bit(BITS_PER_LONG + BITS_PER_BYTE, bits));
 
-	pr_info("out-of-bounds in __test_and_set_bit\n");
-	__test_and_set_bit(BITS_PER_LONG + BITS_PER_BYTE, bits);
+	KUNIT_EXPECT_KASAN_FAIL(test,
+		__test_and_set_bit(BITS_PER_LONG + BITS_PER_BYTE, bits));
 
-	pr_info("out-of-bounds in test_and_set_bit_lock\n");
-	test_and_set_bit_lock(BITS_PER_LONG + BITS_PER_BYTE, bits);
+	KUNIT_EXPECT_KASAN_FAIL(test,
+		test_and_set_bit_lock(BITS_PER_LONG + BITS_PER_BYTE, bits));
 
-	pr_info("out-of-bounds in test_and_clear_bit\n");
-	test_and_clear_bit(BITS_PER_LONG + BITS_PER_BYTE, bits);
+	KUNIT_EXPECT_KASAN_FAIL(test,
+		test_and_clear_bit(BITS_PER_LONG + BITS_PER_BYTE, bits));
 
-	pr_info("out-of-bounds in __test_and_clear_bit\n");
-	__test_and_clear_bit(BITS_PER_LONG + BITS_PER_BYTE, bits);
+	KUNIT_EXPECT_KASAN_FAIL(test,
+		__test_and_clear_bit(BITS_PER_LONG + BITS_PER_BYTE, bits));
 
-	pr_info("out-of-bounds in test_and_change_bit\n");
-	test_and_change_bit(BITS_PER_LONG + BITS_PER_BYTE, bits);
+	KUNIT_EXPECT_KASAN_FAIL(test,
+		test_and_change_bit(BITS_PER_LONG + BITS_PER_BYTE, bits));
 
-	pr_info("out-of-bounds in __test_and_change_bit\n");
-	__test_and_change_bit(BITS_PER_LONG + BITS_PER_BYTE, bits);
+	KUNIT_EXPECT_KASAN_FAIL(test,
+		__test_and_change_bit(BITS_PER_LONG + BITS_PER_BYTE, bits));
 
-	pr_info("out-of-bounds in test_bit\n");
-	kasan_int_result = test_bit(BITS_PER_LONG + BITS_PER_BYTE, bits);
+	KUNIT_EXPECT_KASAN_FAIL(test,
+		kasan_int_result =
+			test_bit(BITS_PER_LONG + BITS_PER_BYTE, bits));
 
 #if defined(clear_bit_unlock_is_negative_byte)
-	pr_info("out-of-bounds in clear_bit_unlock_is_negative_byte\n");
-	kasan_int_result = clear_bit_unlock_is_negative_byte(BITS_PER_LONG +
-		BITS_PER_BYTE, bits);
+	KUNIT_EXPECT_KASAN_FAIL(test,
+		kasan_int_result = clear_bit_unlock_is_negative_byte(
+			BITS_PER_LONG + BITS_PER_BYTE, bits));
 #endif
 	kfree(bits);
 }
 
-static noinline void __init kmalloc_double_kzfree(void)
+static void kmalloc_double_kzfree(struct kunit *test)
 {
 	char *ptr;
 	size_t size = 16;
 
-	pr_info("double-free (kfree_sensitive)\n");
 	ptr = kmalloc(size, GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
-		return;
-	}
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, ptr);
 
 	kfree_sensitive(ptr);
-	kfree_sensitive(ptr);
+	KUNIT_EXPECT_KASAN_FAIL(test, kfree_sensitive(ptr));
 }
 
-#ifdef CONFIG_KASAN_VMALLOC
-static noinline void __init vmalloc_oob(void)
+static void vmalloc_oob(struct kunit *test)
 {
 	void *area;
 
-	pr_info("vmalloc out-of-bounds\n");
+	if (!IS_ENABLED(CONFIG_KASAN_VMALLOC)) {
+		kunit_info(test, "CONFIG_KASAN_VMALLOC is not enabled.");
+		return;
+	}
 
 	/*
 	 * We have to be careful not to hit the guard page.
 	 * The MMU will catch that and crash us.
 	 */
 	area = vmalloc(3000);
-	if (!area) {
-		pr_err("Allocation failed\n");
-		return;
-	}
+	KUNIT_ASSERT_NOT_ERR_OR_NULL(test, area);
 
-	((volatile char *)area)[3100];
+	KUNIT_EXPECT_KASAN_FAIL(test, ((volatile char *)area)[3100]);
 	vfree(area);
 }
-#else
-static void __init vmalloc_oob(void) {}
-#endif
-
-static struct kasan_rcu_info {
-	int i;
-	struct rcu_head rcu;
-} *global_rcu_ptr;
-
-static noinline void __init kasan_rcu_reclaim(struct rcu_head *rp)
-{
-	struct kasan_rcu_info *fp = container_of(rp,
-						struct kasan_rcu_info, rcu);
-
-	kfree(fp);
-	fp->i = 1;
-}
-
-static noinline void __init kasan_rcu_uaf(void)
-{
-	struct kasan_rcu_info *ptr;
 
-	pr_info("use-after-free in kasan_rcu_reclaim\n");
-	ptr = kmalloc(sizeof(struct kasan_rcu_info), GFP_KERNEL);
-	if (!ptr) {
-		pr_err("Allocation failed\n");
-		return;
-	}
-
-	global_rcu_ptr = rcu_dereference_protected(ptr, NULL);
-	call_rcu(&global_rcu_ptr->rcu, kasan_rcu_reclaim);
-}
+static struct kunit_case kasan_kunit_test_cases[] = {
+	KUNIT_CASE(kmalloc_oob_right),
+	KUNIT_CASE(kmalloc_oob_left),
+	KUNIT_CASE(kmalloc_node_oob_right),
+	KUNIT_CASE(kmalloc_pagealloc_oob_right),
+	KUNIT_CASE(kmalloc_pagealloc_uaf),
+	KUNIT_CASE(kmalloc_pagealloc_invalid_free),
+	KUNIT_CASE(kmalloc_large_oob_right),
+	KUNIT_CASE(kmalloc_oob_krealloc_more),
+	KUNIT_CASE(kmalloc_oob_krealloc_less),
+	KUNIT_CASE(kmalloc_oob_16),
+	KUNIT_CASE(kmalloc_oob_in_memset),
+	KUNIT_CASE(kmalloc_oob_memset_2),
+	KUNIT_CASE(kmalloc_oob_memset_4),
+	KUNIT_CASE(kmalloc_oob_memset_8),
+	KUNIT_CASE(kmalloc_oob_memset_16),
+	KUNIT_CASE(kmalloc_memmove_invalid_size),
+	KUNIT_CASE(kmalloc_uaf),
+	KUNIT_CASE(kmalloc_uaf_memset),
+	KUNIT_CASE(kmalloc_uaf2),
+	KUNIT_CASE(kfree_via_page),
+	KUNIT_CASE(kfree_via_phys),
+	KUNIT_CASE(kmem_cache_oob),
+	KUNIT_CASE(memcg_accounted_kmem_cache),
+	KUNIT_CASE(kasan_global_oob),
+	KUNIT_CASE(kasan_stack_oob),
+	KUNIT_CASE(kasan_alloca_oob_left),
+	KUNIT_CASE(kasan_alloca_oob_right),
+	KUNIT_CASE(ksize_unpoisons_memory),
+	KUNIT_CASE(kmem_cache_double_free),
+	KUNIT_CASE(kmem_cache_invalid_free),
+	KUNIT_CASE(kasan_memchr),
+	KUNIT_CASE(kasan_memcmp),
+	KUNIT_CASE(kasan_strings),
+	KUNIT_CASE(kasan_bitops),
+	KUNIT_CASE(kmalloc_double_kzfree),
+	KUNIT_CASE(vmalloc_oob),
+	{}
+};
+
+static struct kunit_suite kasan_kunit_test_suite = {
+	.name = "kasan",
+	.init = kasan_test_init,
+	.test_cases = kasan_kunit_test_cases,
+	.exit = kasan_test_exit,
+};
 
-static int __init kmalloc_tests_init(void)
-{
-	/*
-	 * Temporarily enable multi-shot mode. Otherwise, we'd only get a
-	 * report for the first case.
-	 */
-	bool multishot = kasan_save_enable_multi_shot();
-
-	kmalloc_oob_right();
-	kmalloc_oob_left();
-	kmalloc_node_oob_right();
-#ifdef CONFIG_SLUB
-	kmalloc_pagealloc_oob_right();
-	kmalloc_pagealloc_uaf();
-	kmalloc_pagealloc_invalid_free();
-#endif
-	kmalloc_large_oob_right();
-	kmalloc_oob_krealloc_more();
-	kmalloc_oob_krealloc_less();
-	kmalloc_oob_16();
-	kmalloc_oob_in_memset();
-	kmalloc_oob_memset_2();
-	kmalloc_oob_memset_4();
-	kmalloc_oob_memset_8();
-	kmalloc_oob_memset_16();
-	kmalloc_memmove_invalid_size();
-	kmalloc_uaf();
-	kmalloc_uaf_memset();
-	kmalloc_uaf2();
-	kfree_via_page();
-	kfree_via_phys();
-	kmem_cache_oob();
-	memcg_accounted_kmem_cache();
-	kasan_stack_oob();
-	kasan_global_oob();
-	kasan_alloca_oob_left();
-	kasan_alloca_oob_right();
-	ksize_unpoisons_memory();
-	copy_user_test();
-	kmem_cache_double_free();
-	kmem_cache_invalid_free();
-	kasan_memchr();
-	kasan_memcmp();
-	kasan_strings();
-	kasan_bitops();
-	kmalloc_double_kzfree();
-	vmalloc_oob();
-	kasan_rcu_uaf();
-
-	kasan_restore_multi_shot(multishot);
-
-	return -EAGAIN;
-}
+kunit_test_suite(kasan_kunit_test_suite);
 
-module_init(kmalloc_tests_init);
 MODULE_LICENSE("GPL");
--- /dev/null
+++ a/lib/test_kasan_module.c
@@ -0,0 +1,111 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ *
+ * Copyright (c) 2014 Samsung Electronics Co., Ltd.
+ * Author: Andrey Ryabinin <a.ryabinin@samsung.com>
+ */
+
+#define pr_fmt(fmt) "kasan test: %s " fmt, __func__
+
+#include <linux/mman.h>
+#include <linux/module.h>
+#include <linux/printk.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+#include "../mm/kasan/kasan.h"
+
+#define OOB_TAG_OFF (IS_ENABLED(CONFIG_KASAN_GENERIC) ? 0 : KASAN_SHADOW_SCALE_SIZE)
+
+static noinline void __init copy_user_test(void)
+{
+	char *kmem;
+	char __user *usermem;
+	size_t size = 10;
+	int unused;
+
+	kmem = kmalloc(size, GFP_KERNEL);
+	if (!kmem)
+		return;
+
+	usermem = (char __user *)vm_mmap(NULL, 0, PAGE_SIZE,
+			    PROT_READ | PROT_WRITE | PROT_EXEC,
+			    MAP_ANONYMOUS | MAP_PRIVATE, 0);
+	if (IS_ERR(usermem)) {
+		pr_err("Failed to allocate user memory\n");
+		kfree(kmem);
+		return;
+	}
+
+	pr_info("out-of-bounds in copy_from_user()\n");
+	unused = copy_from_user(kmem, usermem, size + 1 + OOB_TAG_OFF);
+
+	pr_info("out-of-bounds in copy_to_user()\n");
+	unused = copy_to_user(usermem, kmem, size + 1 + OOB_TAG_OFF);
+
+	pr_info("out-of-bounds in __copy_from_user()\n");
+	unused = __copy_from_user(kmem, usermem, size + 1 + OOB_TAG_OFF);
+
+	pr_info("out-of-bounds in __copy_to_user()\n");
+	unused = __copy_to_user(usermem, kmem, size + 1 + OOB_TAG_OFF);
+
+	pr_info("out-of-bounds in __copy_from_user_inatomic()\n");
+	unused = __copy_from_user_inatomic(kmem, usermem, size + 1 + OOB_TAG_OFF);
+
+	pr_info("out-of-bounds in __copy_to_user_inatomic()\n");
+	unused = __copy_to_user_inatomic(usermem, kmem, size + 1 + OOB_TAG_OFF);
+
+	pr_info("out-of-bounds in strncpy_from_user()\n");
+	unused = strncpy_from_user(kmem, usermem, size + 1 + OOB_TAG_OFF);
+
+	vm_munmap((unsigned long)usermem, PAGE_SIZE);
+	kfree(kmem);
+}
+
+static struct kasan_rcu_info {
+	int i;
+	struct rcu_head rcu;
+} *global_rcu_ptr;
+
+static noinline void __init kasan_rcu_reclaim(struct rcu_head *rp)
+{
+	struct kasan_rcu_info *fp = container_of(rp,
+						struct kasan_rcu_info, rcu);
+
+	kfree(fp);
+	fp->i = 1;
+}
+
+static noinline void __init kasan_rcu_uaf(void)
+{
+	struct kasan_rcu_info *ptr;
+
+	pr_info("use-after-free in kasan_rcu_reclaim\n");
+	ptr = kmalloc(sizeof(struct kasan_rcu_info), GFP_KERNEL);
+	if (!ptr) {
+		pr_err("Allocation failed\n");
+		return;
+	}
+
+	global_rcu_ptr = rcu_dereference_protected(ptr, NULL);
+	call_rcu(&global_rcu_ptr->rcu, kasan_rcu_reclaim);
+}
+
+
+static int __init test_kasan_module_init(void)
+{
+	/*
+	 * Temporarily enable multi-shot mode. Otherwise, we'd only get a
+	 * report for the first case.
+	 */
+	bool multishot = kasan_save_enable_multi_shot();
+
+	copy_user_test();
+	kasan_rcu_uaf();
+
+	kasan_restore_multi_shot(multishot);
+	return -EAGAIN;
+}
+
+module_init(test_kasan_module_init);
+MODULE_LICENSE("GPL");
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 123/181] KASAN: Testing Documentation
  2020-10-13 23:46 incoming Andrew Morton
                   ` (121 preceding siblings ...)
  2020-10-13 23:55 ` [patch 122/181] KASAN: port KASAN Tests to KUnit Andrew Morton
@ 2020-10-13 23:55 ` Andrew Morton
  2020-10-13 23:55 ` [patch 124/181] mm: kasan: do not panic if both panic_on_warn and kasan_multishot set Andrew Morton
                   ` (61 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:55 UTC (permalink / raw)
  To: a.p.zijlstra, akpm, andreyknvl, aryabinin, brendanhiggins,
	davidgow, dvyukov, juri.lelli, linux-mm, mingo, mm-commits,
	shuah, torvalds, trishalfonso, vincent.guittot

From: Patricia Alfonso <trishalfonso@google.com>
Subject: KASAN: Testing Documentation

Include documentation on how to test KASAN using CONFIG_TEST_KASAN_KUNIT
and CONFIG_TEST_KASAN_MODULE.

Link: https://lkml.kernel.org/r/20200915035828.570483-5-davidgow@google.com
Link: https://lkml.kernel.org/r/20200910070331.3358048-5-davidgow@google.com
Signed-off-by: Patricia Alfonso <trishalfonso@google.com>
Signed-off-by: David Gow <davidgow@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Brendan Higgins <brendanhiggins@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/dev-tools/kasan.rst |   70 ++++++++++++++++++++++++++++
 1 file changed, 70 insertions(+)

--- a/Documentation/dev-tools/kasan.rst~kasan-testing-documentation
+++ a/Documentation/dev-tools/kasan.rst
@@ -281,3 +281,73 @@ unmapped. This will require changes in a
 
 This allows ``VMAP_STACK`` support on x86, and can simplify support of
 architectures that do not have a fixed module region.
+
+CONFIG_KASAN_KUNIT_TEST & CONFIG_TEST_KASAN_MODULE
+--------------------------------------------------
+
+``CONFIG_KASAN_KUNIT_TEST`` utilizes the KUnit Test Framework for testing.
+This means each test focuses on a small unit of functionality and
+there are a few ways these tests can be run.
+
+Each test will print the KASAN report if an error is detected and then
+print the number of the test and the status of the test:
+
+pass::
+
+        ok 28 - kmalloc_double_kzfree
+or, if kmalloc failed::
+
+        # kmalloc_large_oob_right: ASSERTION FAILED at lib/test_kasan.c:163
+        Expected ptr is not null, but is
+        not ok 4 - kmalloc_large_oob_right
+or, if a KASAN report was expected, but not found::
+
+        # kmalloc_double_kzfree: EXPECTATION FAILED at lib/test_kasan.c:629
+        Expected kasan_data->report_expected == kasan_data->report_found, but
+        kasan_data->report_expected == 1
+        kasan_data->report_found == 0
+        not ok 28 - kmalloc_double_kzfree
+
+All test statuses are tracked as they run and an overall status will
+be printed at the end::
+
+        ok 1 - kasan
+
+or::
+
+        not ok 1 - kasan
+
+(1) Loadable Module
+~~~~~~~~~~~~~~~~~~~~
+
+With ``CONFIG_KUNIT`` enabled, ``CONFIG_KASAN_KUNIT_TEST`` can be built as
+a loadable module and run on any architecture that supports KASAN
+using something like insmod or modprobe. The module is called ``test_kasan``.
+
+(2) Built-In
+~~~~~~~~~~~~~
+
+With ``CONFIG_KUNIT`` built-in, ``CONFIG_KASAN_KUNIT_TEST`` can be built-in
+on any architecure that supports KASAN. These and any other KUnit
+tests enabled will run and print the results at boot as a late-init
+call.
+
+(3) Using kunit_tool
+~~~~~~~~~~~~~~~~~~~~~
+
+With ``CONFIG_KUNIT`` and ``CONFIG_KASAN_KUNIT_TEST`` built-in, we can also
+use kunit_tool to see the results of these along with other KUnit
+tests in a more readable way. This will not print the KASAN reports
+of tests that passed. Use `KUnit documentation <https://www.kernel.org/doc/html/latest/dev-tools/kunit/index.html>`_ for more up-to-date
+information on kunit_tool.
+
+.. _KUnit: https://www.kernel.org/doc/html/latest/dev-tools/kunit/index.html
+
+``CONFIG_TEST_KASAN_MODULE`` is a set of KASAN tests that could not be
+converted to KUnit. These tests can be run only as a module with
+``CONFIG_TEST_KASAN_MODULE`` built as a loadable module and
+``CONFIG_KASAN`` built-in. The type of error expected and the
+function being run is printed before the expression expected to give
+an error. Then the error is printed, if found, and that test
+should be interpretted to pass only if the error was the one expected
+by the test.
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 124/181] mm: kasan: do not panic if both panic_on_warn and kasan_multishot set
  2020-10-13 23:46 incoming Andrew Morton
                   ` (122 preceding siblings ...)
  2020-10-13 23:55 ` [patch 123/181] KASAN: Testing Documentation Andrew Morton
@ 2020-10-13 23:55 ` Andrew Morton
  2020-10-13 23:55 ` [patch 125/181] mm/page_alloc: tweak comments in has_unmovable_pages() Andrew Morton
                   ` (60 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:55 UTC (permalink / raw)
  To: a.p.zijlstra, akpm, andreyknvl, aryabinin, brendanhiggins,
	davidgow, dvyukov, juri.lelli, linux-mm, mingo, mm-commits,
	shuah, torvalds, trishalfonso, vincent.guittot

From: David Gow <davidgow@google.com>
Subject: mm: kasan: do not panic if both panic_on_warn and kasan_multishot set

KASAN errors will currently trigger a panic when panic_on_warn is set. 
This renders kasan_multishot useless, as further KASAN errors won't be
reported if the kernel has already paniced.  By making kasan_multishot
disable this behaviour for KASAN errors, we can still have the benefits of
panic_on_warn for non-KASAN warnings, yet be able to use kasan_multishot.

This is particularly important when running KASAN tests, which need to
trigger multiple KASAN errors: previously these would panic the system if
panic_on_warn was set, now they can run (and will panic the system should
non-KASAN warnings show up).

Link: https://lkml.kernel.org/r/20200915035828.570483-6-davidgow@google.com
Link: https://lkml.kernel.org/r/20200910070331.3358048-6-davidgow@google.com
Signed-off-by: David Gow <davidgow@google.com>
Reviewed-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Brendan Higgins <brendanhiggins@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Patricia Alfonso <trishalfonso@google.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/kasan/report.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/kasan/report.c~mm-kasan-do-not-panic-if-both-panic_on_warn-and-kasan_multishot-set
+++ a/mm/kasan/report.c
@@ -95,7 +95,7 @@ static void end_report(unsigned long *fl
 	pr_err("==================================================================\n");
 	add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
 	spin_unlock_irqrestore(&report_lock, *flags);
-	if (panic_on_warn) {
+	if (panic_on_warn && !test_bit(KASAN_BIT_MULTI_SHOT, &kasan_flags)) {
 		/*
 		 * This thread may hit another WARN() in the panic path.
 		 * Resetting this prevents additional WARN() from panicking the
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 125/181] mm/page_alloc: tweak comments in has_unmovable_pages()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (123 preceding siblings ...)
  2020-10-13 23:55 ` [patch 124/181] mm: kasan: do not panic if both panic_on_warn and kasan_multishot set Andrew Morton
@ 2020-10-13 23:55 ` Andrew Morton
  2020-10-13 23:55 ` [patch 126/181] mm/page_isolation: exit early when pageblock is isolated in set_migratetype_isolate() Andrew Morton
                   ` (59 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:55 UTC (permalink / raw)
  To: akpm, bhe, cai, david, jasowang, linux-mm, mhocko, mike.kravetz,
	mm-commits, mst, pankaj.gupta.linux, rppt, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: mm/page_alloc: tweak comments in has_unmovable_pages()

Patch series "mm / virtio-mem: support ZONE_MOVABLE", v5.

When introducing virtio-mem, the semantics of ZONE_MOVABLE were rather
unclear, which is why we special-cased ZONE_MOVABLE such that partially
plugged blocks would never end up in ZONE_MOVABLE.

Now that the semantics are much clearer (and are documented in patch #6),
let's support partially plugged memory blocks in ZONE_MOVABLE, allowing
partially plugged memory blocks to be online to ZONE_MOVABLE and also
unplugging from such memory blocks.  This avoids surprises when onlining
of memory blocks suddenly fails, just because they are not completely
populated by virtio-mem (yet).

This is especially helpful for testing, but also paves the way for
virtio-mem optimizations, allowing more memory to get reliably unplugged.

Cleanup has_unmovable_pages() and set_migratetype_isolate(), providing
better documentation of how ZONE_MOVABLE interacts with different kind of
unmovable pages (memory offlining vs.  alloc_contig_range()).


This patch (of 6):

Let's move the split comment regarding bootmem allocations and memory
holes, especially in the context of ZONE_MOVABLE, to the PageReserved()
check.

Link: http://lkml.kernel.org/r/20200816125333.7434-1-david@redhat.com
Link: http://lkml.kernel.org/r/20200816125333.7434-2-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   22 ++++++----------------
 1 file changed, 6 insertions(+), 16 deletions(-)

--- a/mm/page_alloc.c~mm-page_alloc-tweak-comments-in-has_unmovable_pages
+++ a/mm/page_alloc.c
@@ -8235,14 +8235,6 @@ struct page *has_unmovable_pages(struct
 	unsigned long iter = 0;
 	unsigned long pfn = page_to_pfn(page);
 
-	/*
-	 * TODO we could make this much more efficient by not checking every
-	 * page in the range if we know all of them are in MOVABLE_ZONE and
-	 * that the movable zone guarantees that pages are migratable but
-	 * the later is not the case right now unfortunatelly. E.g. movablecore
-	 * can still lead to having bootmem allocations in zone_movable.
-	 */
-
 	if (is_migrate_cma_page(page)) {
 		/*
 		 * CMA allocations (alloc_contig_range) really need to mark
@@ -8261,6 +8253,12 @@ struct page *has_unmovable_pages(struct
 
 		page = pfn_to_page(pfn + iter);
 
+		/*
+		 * Both, bootmem allocations and memory holes are marked
+		 * PG_reserved and are unmovable. We can even have unmovable
+		 * allocations inside ZONE_MOVABLE, for example when
+		 * specifying "movablecore".
+		 */
 		if (PageReserved(page))
 			return page;
 
@@ -8334,14 +8332,6 @@ struct page *has_unmovable_pages(struct
 		 * it.  But now, memory offline itself doesn't call
 		 * shrink_node_slabs() and it still to be fixed.
 		 */
-		/*
-		 * If the page is not RAM, page_count()should be 0.
-		 * we don't need more check. This is an _used_ not-movable page.
-		 *
-		 * The problematic thing here is PG_reserved pages. PG_reserved
-		 * is set to both of a memory hole page and a _used_ kernel
-		 * page at boot.
-		 */
 		return page;
 	}
 	return NULL;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 126/181] mm/page_isolation: exit early when pageblock is isolated in set_migratetype_isolate()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (124 preceding siblings ...)
  2020-10-13 23:55 ` [patch 125/181] mm/page_alloc: tweak comments in has_unmovable_pages() Andrew Morton
@ 2020-10-13 23:55 ` Andrew Morton
  2020-10-13 23:55 ` [patch 127/181] mm/page_isolation: drop WARN_ON_ONCE() " Andrew Morton
                   ` (58 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:55 UTC (permalink / raw)
  To: akpm, bhe, cai, david, jasowang, linux-mm, mhocko, mike.kravetz,
	mm-commits, mst, pankaj.gupta.linux, rppt, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: mm/page_isolation: exit early when pageblock is isolated in set_migratetype_isolate()

Right now, if we have two isolations racing on a pageblock that's in the
MOVABLE zone, we would trigger the WARN_ON_ONCE().  Let's just return
directly, simplifying error handling.

The change was introduced in commit 3d680bdf60a5 ("mm/page_isolation: fix
potential warning from user").  As far as I can see, we currently don't
have alloc_contig_range() users that use the ZONE_MOVABLE (anymore), so
it's currently more a cleanup and a preparation for the future than a fix.

Link: http://lkml.kernel.org/r/20200816125333.7434-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_isolation.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

--- a/mm/page_isolation.c~mm-page_isolation-exit-early-when-pageblock-is-isolated-in-set_migratetype_isolate
+++ a/mm/page_isolation.c
@@ -29,10 +29,12 @@ static int set_migratetype_isolate(struc
 	/*
 	 * We assume the caller intended to SET migrate type to isolate.
 	 * If it is already set, then someone else must have raced and
-	 * set it before us.  Return -EBUSY
+	 * set it before us.
 	 */
-	if (is_migrate_isolate_page(page))
-		goto out;
+	if (is_migrate_isolate_page(page)) {
+		spin_unlock_irqrestore(&zone->lock, flags);
+		return -EBUSY;
+	}
 
 	/*
 	 * FIXME: Now, memory hotplug doesn't call shrink_slab() by itself.
@@ -52,7 +54,6 @@ static int set_migratetype_isolate(struc
 		ret = 0;
 	}
 
-out:
 	spin_unlock_irqrestore(&zone->lock, flags);
 	if (!ret) {
 		drain_all_pages(zone);
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 127/181] mm/page_isolation: drop WARN_ON_ONCE() in set_migratetype_isolate()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (125 preceding siblings ...)
  2020-10-13 23:55 ` [patch 126/181] mm/page_isolation: exit early when pageblock is isolated in set_migratetype_isolate() Andrew Morton
@ 2020-10-13 23:55 ` Andrew Morton
  2020-10-13 23:55 ` [patch 128/181] mm/page_isolation: cleanup set_migratetype_isolate() Andrew Morton
                   ` (57 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:55 UTC (permalink / raw)
  To: akpm, bhe, cai, david, jasowang, linux-mm, mhocko, mike.kravetz,
	mm-commits, mst, pankaj.gupta.linux, rppt, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: mm/page_isolation: drop WARN_ON_ONCE() in set_migratetype_isolate()

Inside has_unmovable_pages(), we have a comment describing how unmovable
data could end up in ZONE_MOVABLE - via "movablecore".  Also, besides
checking if the first page in the pageblock is reserved, we don't perform
any further checks in case of ZONE_MOVABLE.

In case of memory offlining, we set REPORT_FAILURE, properly dump_page()
the page and handle the error gracefully.  alloc_contig_pages() users
currently never allocate from ZONE_MOVABLE.  E.g., hugetlb uses
alloc_contig_pages() for the allocation of gigantic pages only, which will
never end up on the MOVABLE zone (see htlb_alloc_mask()).

Link: http://lkml.kernel.org/r/20200816125333.7434-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_isolation.c |   15 ++++++---------
 1 file changed, 6 insertions(+), 9 deletions(-)

--- a/mm/page_isolation.c~mm-page_isolation-drop-warn_on_once-in-set_migratetype_isolate
+++ a/mm/page_isolation.c
@@ -57,15 +57,12 @@ static int set_migratetype_isolate(struc
 	spin_unlock_irqrestore(&zone->lock, flags);
 	if (!ret) {
 		drain_all_pages(zone);
-	} else {
-		WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE);
-
-		if ((isol_flags & REPORT_FAILURE) && unmovable)
-			/*
-			 * printk() with zone->lock held will likely trigger a
-			 * lockdep splat, so defer it here.
-			 */
-			dump_page(unmovable, "unmovable page");
+	} else if ((isol_flags & REPORT_FAILURE) && unmovable) {
+		/*
+		 * printk() with zone->lock held will likely trigger a
+		 * lockdep splat, so defer it here.
+		 */
+		dump_page(unmovable, "unmovable page");
 	}
 
 	return ret;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 128/181] mm/page_isolation: cleanup set_migratetype_isolate()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (126 preceding siblings ...)
  2020-10-13 23:55 ` [patch 127/181] mm/page_isolation: drop WARN_ON_ONCE() " Andrew Morton
@ 2020-10-13 23:55 ` Andrew Morton
  2020-10-13 23:55 ` [patch 129/181] virtio-mem: don't special-case ZONE_MOVABLE Andrew Morton
                   ` (56 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:55 UTC (permalink / raw)
  To: akpm, bhe, cai, david, jasowang, linux-mm, mhocko, mike.kravetz,
	mm-commits, mst, pankaj.gupta.linux, rppt, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: mm/page_isolation: cleanup set_migratetype_isolate()

Let's clean it up a bit, simplifying the exit paths.

Link: http://lkml.kernel.org/r/20200816125333.7434-5-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_isolation.c |   17 +++++++----------
 1 file changed, 7 insertions(+), 10 deletions(-)

--- a/mm/page_isolation.c~mm-page_isolation-cleanup-set_migratetype_isolate
+++ a/mm/page_isolation.c
@@ -17,12 +17,9 @@
 
 static int set_migratetype_isolate(struct page *page, int migratetype, int isol_flags)
 {
-	struct page *unmovable = NULL;
-	struct zone *zone;
+	struct zone *zone = page_zone(page);
+	struct page *unmovable;
 	unsigned long flags;
-	int ret = -EBUSY;
-
-	zone = page_zone(page);
 
 	spin_lock_irqsave(&zone->lock, flags);
 
@@ -51,13 +48,13 @@ static int set_migratetype_isolate(struc
 									NULL);
 
 		__mod_zone_freepage_state(zone, -nr_pages, mt);
-		ret = 0;
+		spin_unlock_irqrestore(&zone->lock, flags);
+		drain_all_pages(zone);
+		return 0;
 	}
 
 	spin_unlock_irqrestore(&zone->lock, flags);
-	if (!ret) {
-		drain_all_pages(zone);
-	} else if ((isol_flags & REPORT_FAILURE) && unmovable) {
+	if (isol_flags & REPORT_FAILURE) {
 		/*
 		 * printk() with zone->lock held will likely trigger a
 		 * lockdep splat, so defer it here.
@@ -65,7 +62,7 @@ static int set_migratetype_isolate(struc
 		dump_page(unmovable, "unmovable page");
 	}
 
-	return ret;
+	return -EBUSY;
 }
 
 static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 129/181] virtio-mem: don't special-case ZONE_MOVABLE
  2020-10-13 23:46 incoming Andrew Morton
                   ` (127 preceding siblings ...)
  2020-10-13 23:55 ` [patch 128/181] mm/page_isolation: cleanup set_migratetype_isolate() Andrew Morton
@ 2020-10-13 23:55 ` Andrew Morton
  2020-10-13 23:55 ` [patch 130/181] mm: document semantics of ZONE_MOVABLE Andrew Morton
                   ` (55 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:55 UTC (permalink / raw)
  To: akpm, bhe, cai, david, jasowang, linux-mm, mhocko, mike.kravetz,
	mm-commits, mst, pankaj.gupta.linux, rppt, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: virtio-mem: don't special-case ZONE_MOVABLE

When introducing virtio-mem, the semantics of ZONE_MOVABLE were rather
unclear, which is why we special-cased ZONE_MOVABLE such that partially
plugged blocks would never end up in ZONE_MOVABLE.

Now that the semantics are much clearer (and will be documented in a
follow-up patch including the new virtio-mem behavior), let's allow to
online partially plugged memory blocks to ZONE_MOVABLE and also consider
memory blocks that were onlined to ZONE_MOVABLE when unplugging memory. 
While unplugged memory pages are, in general, unmovable, they can be
skipped when offlining memory.

virtio-mem only unplugs fairly big chunks (in the megabyte range) and
rather tries to shrink the memory region than randomly choosing memory. 
In theory, if all other pages in the movable zone would be movable,
virtio-mem would only shrink that zone and not create any kind of
fragmentation.

In the future, we might want to remember the zone again and use the
information when (un)plugging memory.  For now, let's keep it simple.

Note: Support for defragmentation is planned, to deal with fragmentation
after unplug due to memory chunks within memory blocks that could not get
unplugged before (e.g., somebody pinning pages within ZONE_MOVABLE for a
longer time).

Link: http://lkml.kernel.org/r/20200816125333.7434-6-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 drivers/virtio/virtio_mem.c |   47 +++++-----------------------------
 1 file changed, 8 insertions(+), 39 deletions(-)

--- a/drivers/virtio/virtio_mem.c~virtio-mem-dont-special-case-zone_movable
+++ a/drivers/virtio/virtio_mem.c
@@ -36,18 +36,10 @@ enum virtio_mem_mb_state {
 	VIRTIO_MEM_MB_STATE_OFFLINE,
 	/* Partially plugged, fully added to Linux, offline. */
 	VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL,
-	/* Fully plugged, fully added to Linux, online (!ZONE_MOVABLE). */
+	/* Fully plugged, fully added to Linux, online. */
 	VIRTIO_MEM_MB_STATE_ONLINE,
-	/* Partially plugged, fully added to Linux, online (!ZONE_MOVABLE). */
+	/* Partially plugged, fully added to Linux, online. */
 	VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL,
-	/*
-	 * Fully plugged, fully added to Linux, online (ZONE_MOVABLE).
-	 * We are not allowed to allocate (unplug) parts of this block that
-	 * are not movable (similar to gigantic pages). We will never allow
-	 * to online OFFLINE_PARTIAL to ZONE_MOVABLE (as they would contain
-	 * unmovable parts).
-	 */
-	VIRTIO_MEM_MB_STATE_ONLINE_MOVABLE,
 	VIRTIO_MEM_MB_STATE_COUNT
 };
 
@@ -526,21 +518,10 @@ static bool virtio_mem_owned_mb(struct v
 }
 
 static int virtio_mem_notify_going_online(struct virtio_mem *vm,
-					  unsigned long mb_id,
-					  enum zone_type zone)
+					  unsigned long mb_id)
 {
 	switch (virtio_mem_mb_get_state(vm, mb_id)) {
 	case VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL:
-		/*
-		 * We won't allow to online a partially plugged memory block
-		 * to the MOVABLE zone - it would contain unmovable parts.
-		 */
-		if (zone == ZONE_MOVABLE) {
-			dev_warn_ratelimited(&vm->vdev->dev,
-					     "memory block has holes, MOVABLE not supported\n");
-			return NOTIFY_BAD;
-		}
-		return NOTIFY_OK;
 	case VIRTIO_MEM_MB_STATE_OFFLINE:
 		return NOTIFY_OK;
 	default:
@@ -560,7 +541,6 @@ static void virtio_mem_notify_offline(st
 					VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL);
 		break;
 	case VIRTIO_MEM_MB_STATE_ONLINE:
-	case VIRTIO_MEM_MB_STATE_ONLINE_MOVABLE:
 		virtio_mem_mb_set_state(vm, mb_id,
 					VIRTIO_MEM_MB_STATE_OFFLINE);
 		break;
@@ -579,24 +559,17 @@ static void virtio_mem_notify_offline(st
 	virtio_mem_retry(vm);
 }
 
-static void virtio_mem_notify_online(struct virtio_mem *vm, unsigned long mb_id,
-				     enum zone_type zone)
+static void virtio_mem_notify_online(struct virtio_mem *vm, unsigned long mb_id)
 {
 	unsigned long nb_offline;
 
 	switch (virtio_mem_mb_get_state(vm, mb_id)) {
 	case VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL:
-		BUG_ON(zone == ZONE_MOVABLE);
 		virtio_mem_mb_set_state(vm, mb_id,
 					VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL);
 		break;
 	case VIRTIO_MEM_MB_STATE_OFFLINE:
-		if (zone == ZONE_MOVABLE)
-			virtio_mem_mb_set_state(vm, mb_id,
-					    VIRTIO_MEM_MB_STATE_ONLINE_MOVABLE);
-		else
-			virtio_mem_mb_set_state(vm, mb_id,
-						VIRTIO_MEM_MB_STATE_ONLINE);
+		virtio_mem_mb_set_state(vm, mb_id, VIRTIO_MEM_MB_STATE_ONLINE);
 		break;
 	default:
 		BUG();
@@ -675,7 +648,6 @@ static int virtio_mem_memory_notifier_cb
 	const unsigned long start = PFN_PHYS(mhp->start_pfn);
 	const unsigned long size = PFN_PHYS(mhp->nr_pages);
 	const unsigned long mb_id = virtio_mem_phys_to_mb_id(start);
-	enum zone_type zone;
 	int rc = NOTIFY_OK;
 
 	if (!virtio_mem_overlaps_range(vm, start, size))
@@ -717,8 +689,7 @@ static int virtio_mem_memory_notifier_cb
 			break;
 		}
 		vm->hotplug_active = true;
-		zone = page_zonenum(pfn_to_page(mhp->start_pfn));
-		rc = virtio_mem_notify_going_online(vm, mb_id, zone);
+		rc = virtio_mem_notify_going_online(vm, mb_id);
 		break;
 	case MEM_OFFLINE:
 		virtio_mem_notify_offline(vm, mb_id);
@@ -726,8 +697,7 @@ static int virtio_mem_memory_notifier_cb
 		mutex_unlock(&vm->hotplug_mutex);
 		break;
 	case MEM_ONLINE:
-		zone = page_zonenum(pfn_to_page(mhp->start_pfn));
-		virtio_mem_notify_online(vm, mb_id, zone);
+		virtio_mem_notify_online(vm, mb_id);
 		vm->hotplug_active = false;
 		mutex_unlock(&vm->hotplug_mutex);
 		break;
@@ -1906,8 +1876,7 @@ static void virtio_mem_remove(struct vir
 	if (vm->nb_mb_state[VIRTIO_MEM_MB_STATE_OFFLINE] ||
 	    vm->nb_mb_state[VIRTIO_MEM_MB_STATE_OFFLINE_PARTIAL] ||
 	    vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE] ||
-	    vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL] ||
-	    vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE_MOVABLE]) {
+	    vm->nb_mb_state[VIRTIO_MEM_MB_STATE_ONLINE_PARTIAL]) {
 		dev_warn(&vdev->dev, "device still has system memory added\n");
 	} else {
 		virtio_mem_delete_resource(vm);
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 130/181] mm: document semantics of ZONE_MOVABLE
  2020-10-13 23:46 incoming Andrew Morton
                   ` (128 preceding siblings ...)
  2020-10-13 23:55 ` [patch 129/181] virtio-mem: don't special-case ZONE_MOVABLE Andrew Morton
@ 2020-10-13 23:55 ` Andrew Morton
  2020-10-13 23:55 ` [patch 131/181] mm, isolation: avoid checking unmovable pages across pageblock boundary Andrew Morton
                   ` (54 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:55 UTC (permalink / raw)
  To: akpm, bhe, cai, david, jasowang, linux-mm, mhocko, mike.kravetz,
	mm-commits, mst, pankaj.gupta.linux, rppt, torvalds

From: David Hildenbrand <david@redhat.com>
Subject: mm: document semantics of ZONE_MOVABLE

Let's document what ZONE_MOVABLE means, how it's used, and which special
cases we have regarding unmovable pages (memory offlining vs.  migration /
allocations).

Link: http://lkml.kernel.org/r/20200816125333.7434-7-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h |   35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

--- a/include/linux/mmzone.h~mm-document-semantics-of-zone_movable
+++ a/include/linux/mmzone.h
@@ -396,6 +396,41 @@ enum zone_type {
 	 */
 	ZONE_HIGHMEM,
 #endif
+	/*
+	 * ZONE_MOVABLE is similar to ZONE_NORMAL, except that it contains
+	 * movable pages with few exceptional cases described below. Main use
+	 * cases for ZONE_MOVABLE are to make memory offlining/unplug more
+	 * likely to succeed, and to locally limit unmovable allocations - e.g.,
+	 * to increase the number of THP/huge pages. Notable special cases are:
+	 *
+	 * 1. Pinned pages: (long-term) pinning of movable pages might
+	 *    essentially turn such pages unmovable. Memory offlining might
+	 *    retry a long time.
+	 * 2. memblock allocations: kernelcore/movablecore setups might create
+	 *    situations where ZONE_MOVABLE contains unmovable allocations
+	 *    after boot. Memory offlining and allocations fail early.
+	 * 3. Memory holes: kernelcore/movablecore setups might create very rare
+	 *    situations where ZONE_MOVABLE contains memory holes after boot,
+	 *    for example, if we have sections that are only partially
+	 *    populated. Memory offlining and allocations fail early.
+	 * 4. PG_hwpoison pages: while poisoned pages can be skipped during
+	 *    memory offlining, such pages cannot be allocated.
+	 * 5. Unmovable PG_offline pages: in paravirtualized environments,
+	 *    hotplugged memory blocks might only partially be managed by the
+	 *    buddy (e.g., via XEN-balloon, Hyper-V balloon, virtio-mem). The
+	 *    parts not manged by the buddy are unmovable PG_offline pages. In
+	 *    some cases (virtio-mem), such pages can be skipped during
+	 *    memory offlining, however, cannot be moved/allocated. These
+	 *    techniques might use alloc_contig_range() to hide previously
+	 *    exposed pages from the buddy again (e.g., to implement some sort
+	 *    of memory unplug in virtio-mem).
+	 *
+	 * In general, no unmovable allocations that degrade memory offlining
+	 * should end up in ZONE_MOVABLE. Allocators (like alloc_contig_range())
+	 * have to expect that migrating pages in ZONE_MOVABLE can fail (even
+	 * if has_unmovable_pages() states that there are no unmovable pages,
+	 * there can be false negatives).
+	 */
 	ZONE_MOVABLE,
 #ifdef CONFIG_ZONE_DEVICE
 	ZONE_DEVICE,
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 131/181] mm, isolation: avoid checking unmovable pages across pageblock boundary
  2020-10-13 23:46 incoming Andrew Morton
                   ` (129 preceding siblings ...)
  2020-10-13 23:55 ` [patch 130/181] mm: document semantics of ZONE_MOVABLE Andrew Morton
@ 2020-10-13 23:55 ` Andrew Morton
  2020-10-13 23:55 ` [patch 132/181] mm/page_alloc.c: clean code by removing unnecessary initialization Andrew Morton
                   ` (53 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:55 UTC (permalink / raw)
  To: akpm, david, linux-mm, lixinhai.lxh, mhocko, mm-commits,
	osalvador, torvalds

From: Li Xinhai <lixinhai.lxh@gmail.com>
Subject: mm, isolation: avoid checking unmovable pages across pageblock boundary

In has_unmovable_pages(), the page parameter would not always be the first
page within a pageblock (see how the page pointer is passed in from
start_isolate_page_range() after call __first_valid_page()), so that would
cause checking unmovable pages span two pageblocks.

After this patch, the checking is enforced within one pageblock no matter
the page is first one or not, and obey the semantics of this function.

This issue is found by code inspection.

Michal said "this might lead to false negatives when an unrelated block
would cause an isolation failure".

Link: https://lkml.kernel.org/r/20200824065811.383266-1-lixinhai.lxh@gmail.com
Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/mm/page_alloc.c~mm-isolation-avoid-checking-unmovable-pages-across-pageblock-boundary
+++ a/mm/page_alloc.c
@@ -8234,6 +8234,7 @@ struct page *has_unmovable_pages(struct
 {
 	unsigned long iter = 0;
 	unsigned long pfn = page_to_pfn(page);
+	unsigned long offset = pfn % pageblock_nr_pages;
 
 	if (is_migrate_cma_page(page)) {
 		/*
@@ -8247,7 +8248,7 @@ struct page *has_unmovable_pages(struct
 		return page;
 	}
 
-	for (; iter < pageblock_nr_pages; iter++) {
+	for (; iter < pageblock_nr_pages - offset; iter++) {
 		if (!pfn_valid_within(pfn + iter))
 			continue;
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 132/181] mm/page_alloc.c: clean code by removing unnecessary initialization
  2020-10-13 23:46 incoming Andrew Morton
                   ` (130 preceding siblings ...)
  2020-10-13 23:55 ` [patch 131/181] mm, isolation: avoid checking unmovable pages across pageblock boundary Andrew Morton
@ 2020-10-13 23:55 ` Andrew Morton
  2020-10-13 23:55 ` [patch 133/181] mm/page_alloc.c: micro-optimization remove unnecessary branch Andrew Morton
                   ` (52 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:55 UTC (permalink / raw)
  To: akpm, linux-mm, mateusznosek0, mm-commits, torvalds

From: Mateusz Nosek <mateusznosek0@gmail.com>
Subject: mm/page_alloc.c: clean code by removing unnecessary initialization

Previously variable 'tmp' was initialized, but was not read later before
reassigning.  So the initialization can be removed.

[akpm@linux-foundation.org: remove `tmp' altogether]
Link: https://lkml.kernel.org/r/20200904132422.17387-1-mateusznosek0@gmail.com
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

--- a/mm/page_alloc.c~mm-page_allocc-clean-code-by-removing-unnecessary-initialization
+++ a/mm/page_alloc.c
@@ -5651,7 +5651,6 @@ static int find_next_best_node(int node,
 	int n, val;
 	int min_val = INT_MAX;
 	int best_node = NUMA_NO_NODE;
-	const struct cpumask *tmp = cpumask_of_node(0);
 
 	/* Use the local node if we haven't already */
 	if (!node_isset(node, *used_node_mask)) {
@@ -5672,8 +5671,7 @@ static int find_next_best_node(int node,
 		val += (n < node);
 
 		/* Give preference to headless and unused nodes */
-		tmp = cpumask_of_node(n);
-		if (!cpumask_empty(tmp))
+		if (!cpumask_empty(cpumask_of_node(n)))
 			val += PENALTY_FOR_NODE_WITH_CPUS;
 
 		/* Slight preference for less loaded node */
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 133/181] mm/page_alloc.c: micro-optimization remove unnecessary branch
  2020-10-13 23:46 incoming Andrew Morton
                   ` (131 preceding siblings ...)
  2020-10-13 23:55 ` [patch 132/181] mm/page_alloc.c: clean code by removing unnecessary initialization Andrew Morton
@ 2020-10-13 23:55 ` Andrew Morton
  2020-10-13 23:55 ` [patch 134/181] mm/page_alloc.c: fix early params garbage value accesses Andrew Morton
                   ` (51 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:55 UTC (permalink / raw)
  To: akpm, linux-mm, mateusznosek0, mm-commits, torvalds

From: Mateusz Nosek <mateusznosek0@gmail.com>
Subject: mm/page_alloc.c: micro-optimization remove unnecessary branch

Previously flags check was separated into two separated checks with two
separated branches. In case of presence of any of two mentioned flags,
the same effect on flow occurs. Therefore checks can be merged and one
branch can be avoided.

Link: https://lkml.kernel.org/r/20200911092310.31136-1-mateusznosek0@gmail.com
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

--- a/mm/page_alloc.c~mm-page_allocc-micro-optimization-remove-unnecessary-branch
+++ a/mm/page_alloc.c
@@ -3986,8 +3986,10 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
 	 * success so it is time to admit defeat. We will skip the OOM killer
 	 * because it is very likely that the caller has a more reasonable
 	 * fallback than shooting a random task.
+	 *
+	 * The OOM killer may not free memory on a specific node.
 	 */
-	if (gfp_mask & __GFP_RETRY_MAYFAIL)
+	if (gfp_mask & (__GFP_RETRY_MAYFAIL | __GFP_THISNODE))
 		goto out;
 	/* The OOM killer does not needlessly kill tasks for lowmem */
 	if (ac->highest_zoneidx < ZONE_NORMAL)
@@ -4004,10 +4006,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
 	 * failures more gracefully we should just bail out here.
 	 */
 
-	/* The OOM killer may not free memory on a specific node */
-	if (gfp_mask & __GFP_THISNODE)
-		goto out;
-
 	/* Exhausted what can be done so it's blame time */
 	if (out_of_memory(&oc) || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) {
 		*did_some_progress = 1;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 134/181] mm/page_alloc.c: fix early params garbage value accesses
  2020-10-13 23:46 incoming Andrew Morton
                   ` (132 preceding siblings ...)
  2020-10-13 23:55 ` [patch 133/181] mm/page_alloc.c: micro-optimization remove unnecessary branch Andrew Morton
@ 2020-10-13 23:55 ` Andrew Morton
  2020-10-13 23:55 ` [patch 135/181] mm/page_alloc.c: clean code by merging two functions Andrew Morton
                   ` (50 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:55 UTC (permalink / raw)
  To: akpm, linux-mm, mateusznosek0, mm-commits, torvalds

From: Mateusz Nosek <mateusznosek0@gmail.com>
Subject: mm/page_alloc.c: fix early params garbage value accesses

Previously in '__init early_init_on_alloc' and '__init early_init_on_free'
the return values from 'kstrtobool' were not handled properly.  That
caused potential garbage value read from variable 'bool_result'. 
Introduced patch fixes error handling.

Link: https://lkml.kernel.org/r/20200916214125.28271-1-mateusznosek0@gmail.com
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/mm/page_alloc.c~mm-page_allocc-fix-early-params-garbage-value-accesses
+++ a/mm/page_alloc.c
@@ -156,16 +156,16 @@ static int __init early_init_on_alloc(ch
 	int ret;
 	bool bool_result;
 
-	if (!buf)
-		return -EINVAL;
 	ret = kstrtobool(buf, &bool_result);
+	if (ret)
+		return ret;
 	if (bool_result && page_poisoning_enabled())
 		pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, will take precedence over init_on_alloc\n");
 	if (bool_result)
 		static_branch_enable(&init_on_alloc);
 	else
 		static_branch_disable(&init_on_alloc);
-	return ret;
+	return 0;
 }
 early_param("init_on_alloc", early_init_on_alloc);
 
@@ -174,16 +174,16 @@ static int __init early_init_on_free(cha
 	int ret;
 	bool bool_result;
 
-	if (!buf)
-		return -EINVAL;
 	ret = kstrtobool(buf, &bool_result);
+	if (ret)
+		return ret;
 	if (bool_result && page_poisoning_enabled())
 		pr_info("mem auto-init: CONFIG_PAGE_POISONING is on, will take precedence over init_on_free\n");
 	if (bool_result)
 		static_branch_enable(&init_on_free);
 	else
 		static_branch_disable(&init_on_free);
-	return ret;
+	return 0;
 }
 early_param("init_on_free", early_init_on_free);
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 135/181] mm/page_alloc.c: clean code by merging two functions
  2020-10-13 23:46 incoming Andrew Morton
                   ` (133 preceding siblings ...)
  2020-10-13 23:55 ` [patch 134/181] mm/page_alloc.c: fix early params garbage value accesses Andrew Morton
@ 2020-10-13 23:55 ` Andrew Morton
  2020-10-13 23:55 ` [patch 136/181] mm/page_alloc.c: __perform_reclaim should return 'unsigned long' Andrew Morton
                   ` (49 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:55 UTC (permalink / raw)
  To: akpm, linux-mm, mateusznosek0, mgorman, mm-commits, rppt, torvalds

From: Mateusz Nosek <mateusznosek0@gmail.com>
Subject: mm/page_alloc.c: clean code by merging two functions

finalise_ac() is just 'epilogue' for 'prepare_alloc_pages'.  Therefore
there is no need to keep them both so 'finalise_ac' content can be merged
into prepare_alloc_pages() code.  It would make __alloc_pages_nodemask()
cleaner when it comes to readability.

Link: https://lkml.kernel.org/r/20200916110118.6537-1-mateusznosek0@gmail.com
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |   10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

--- a/mm/page_alloc.c~mm-page_allocc-clean-code-by-merging-two-functions
+++ a/mm/page_alloc.c
@@ -4838,12 +4838,6 @@ static inline bool prepare_alloc_pages(g
 
 	*alloc_flags = current_alloc_flags(gfp_mask, *alloc_flags);
 
-	return true;
-}
-
-/* Determine whether to spread dirty pages and what the first usable zone */
-static inline void finalise_ac(gfp_t gfp_mask, struct alloc_context *ac)
-{
 	/* Dirty zone balancing only done in the fast path */
 	ac->spread_dirty_pages = (gfp_mask & __GFP_WRITE);
 
@@ -4854,6 +4848,8 @@ static inline void finalise_ac(gfp_t gfp
 	 */
 	ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->highest_zoneidx, ac->nodemask);
+
+	return true;
 }
 
 /*
@@ -4882,8 +4878,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, u
 	if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
 		return NULL;
 
-	finalise_ac(gfp_mask, &ac);
-
 	/*
 	 * Forbid the first pass from falling back to types that fragment
 	 * memory until all local zones are considered.
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 136/181] mm/page_alloc.c: __perform_reclaim should return 'unsigned long'
  2020-10-13 23:46 incoming Andrew Morton
                   ` (134 preceding siblings ...)
  2020-10-13 23:55 ` [patch 135/181] mm/page_alloc.c: clean code by merging two functions Andrew Morton
@ 2020-10-13 23:55 ` Andrew Morton
  2020-10-13 23:55 ` [patch 137/181] mmzone: clean code by removing unused macro parameter Andrew Morton
                   ` (48 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:55 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, torvalds, yanfei.xu

From: Yanfei Xu <yanfei.xu@windriver.com>
Subject: mm/page_alloc.c: __perform_reclaim should return 'unsigned long'

__perform_reclaim()'s single caller expects it to return 'unsigned long',
hence change its return value and a local variable to 'unsigned long'.

Link: https://lkml.kernel.org/r/20200916022138.16740-1-yanfei.xu@windriver.com
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/page_alloc.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

--- a/mm/page_alloc.c~mm-page_allocc-__perform_reclaim-should-return-unsigned-long
+++ a/mm/page_alloc.c
@@ -4253,13 +4253,12 @@ EXPORT_SYMBOL_GPL(fs_reclaim_release);
 #endif
 
 /* Perform direct synchronous page reclaim */
-static int
+static unsigned long
 __perform_reclaim(gfp_t gfp_mask, unsigned int order,
 					const struct alloc_context *ac)
 {
-	int progress;
 	unsigned int noreclaim_flag;
-	unsigned long pflags;
+	unsigned long pflags, progress;
 
 	cond_resched();
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 137/181] mmzone: clean code by removing unused macro parameter
  2020-10-13 23:46 incoming Andrew Morton
                   ` (135 preceding siblings ...)
  2020-10-13 23:55 ` [patch 136/181] mm/page_alloc.c: __perform_reclaim should return 'unsigned long' Andrew Morton
@ 2020-10-13 23:55 ` Andrew Morton
  2020-10-13 23:56 ` [patch 138/181] mm: move call to compound_head() in release_pages() Andrew Morton
                   ` (47 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:55 UTC (permalink / raw)
  To: akpm, linux-mm, mateusznosek0, mm-commits, torvalds

From: Mateusz Nosek <mateusznosek0@gmail.com>
Subject: mmzone: clean code by removing unused macro parameter

Previously 'for_next_zone_zonelist_nodemask' macro parameter 'zlist' was
unused so this patch removes it.

Link: https://lkml.kernel.org/r/20200917211906.30059-1-mateusznosek0@gmail.com
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/mmzone.h |    2 +-
 mm/page_alloc.c        |    4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

--- a/include/linux/mmzone.h~mmzone-clean-code-by-removing-unused-macro-parameter
+++ a/include/linux/mmzone.h
@@ -1116,7 +1116,7 @@ static inline struct zoneref *first_zone
 		z = next_zones_zonelist(++z, highidx, nodemask),	\
 			zone = zonelist_zone(z))
 
-#define for_next_zone_zonelist_nodemask(zone, z, zlist, highidx, nodemask) \
+#define for_next_zone_zonelist_nodemask(zone, z, highidx, nodemask) \
 	for (zone = z->zone;	\
 		zone;							\
 		z = next_zones_zonelist(++z, highidx, nodemask),	\
--- a/mm/page_alloc.c~mmzone-clean-code-by-removing-unused-macro-parameter
+++ a/mm/page_alloc.c
@@ -3741,8 +3741,8 @@ retry:
 	 */
 	no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
 	z = ac->preferred_zoneref;
-	for_next_zone_zonelist_nodemask(zone, z, ac->zonelist,
-					ac->highest_zoneidx, ac->nodemask) {
+	for_next_zone_zonelist_nodemask(zone, z, ac->highest_zoneidx,
+					ac->nodemask) {
 		struct page *page;
 		unsigned long mark;
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 138/181] mm: move call to compound_head() in release_pages()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (136 preceding siblings ...)
  2020-10-13 23:55 ` [patch 137/181] mmzone: clean code by removing unused macro parameter Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:56 ` [patch 139/181] mm/page_alloc.c: fix freeing non-compound pages Andrew Morton
                   ` (46 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, dan.j.williams, hch, linux-mm, mm-commits, rcampbell,
	torvalds, willy, yuzhao

From: Ralph Campbell <rcampbell@nvidia.com>
Subject: mm: move call to compound_head() in release_pages()

The function is_huge_zero_page() doesn't call compound_head() to make sure
the page pointer is a head page. The call to is_huge_zero_page() in
release_pages() is made before compound_head() is called so the test would
fail if release_pages() was called with a tail page of the huge_zero_page
and put_page_testzero() would be called releasing the page.
This is unlikely to be happening in normal use or we would be seeing all
sorts of process data corruption when accessing a THP zero page.

Looking at other places where is_huge_zero_page() is called, all seem to
only pass a head page so I think the right solution is to move the call
to compound_head() in release_pages() to a point before calling
is_huge_zero_page().

Link: https://lkml.kernel.org/r/20200917173938.16420-1-rcampbell@nvidia.com
Signed-off-by: Ralph Campbell <rcampbell@nvidia.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/swap.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/swap.c~mm-move-call-to-compound_head-in-release_pages
+++ a/mm/swap.c
@@ -889,6 +889,7 @@ void release_pages(struct page **pages,
 			locked_pgdat = NULL;
 		}
 
+		page = compound_head(page);
 		if (is_huge_zero_page(page))
 			continue;
 
@@ -910,7 +911,6 @@ void release_pages(struct page **pages,
 			}
 		}
 
-		page = compound_head(page);
 		if (!put_page_testzero(page))
 			continue;
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 139/181] mm/page_alloc.c: fix freeing non-compound pages
  2020-10-13 23:46 incoming Andrew Morton
                   ` (137 preceding siblings ...)
  2020-10-13 23:56 ` [patch 138/181] mm: move call to compound_head() in release_pages() Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:56 ` [patch 140/181] include/linux/gfp.h: clarify usage of GFP_ATOMIC in !preemptible contexts Andrew Morton
                   ` (45 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, hughd, linux-mm, mm-commits, npiggin, peterz, rppt,
	torvalds, willy

From: "Matthew Wilcox (Oracle)" <willy@infradead.org>
Subject: mm/page_alloc.c: fix freeing non-compound pages

Here is a very rare race which leaks memory:

Page P0 is allocated to the page cache.  Page P1 is free.

Thread A                Thread B                Thread C
find_get_entry():
xas_load() returns P0
						Removes P0 from page cache
						P0 finds its buddy P1
			alloc_pages(GFP_KERNEL, 1) returns P0
			P0 has refcount 1
page_cache_get_speculative(P0)
P0 has refcount 2
			__free_pages(P0)
			P0 has refcount 1
put_page(P0)
P1 is not freed

Fix this by freeing all the pages in __free_pages() that won't be freed
by the call to put_page().  It's usually not a good idea to split a page,
but this is a very unlikely scenario.

Link: https://lkml.kernel.org/r/20200926213919.26642-1-willy@infradead.org
Fixes: e286781d5f2e ("mm: speculative page references")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Nick Piggin <npiggin@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 lib/Kconfig.debug     |    9 ++++++++
 lib/Makefile          |    1 
 lib/test_free_pages.c |   42 ++++++++++++++++++++++++++++++++++++++++
 mm/page_alloc.c       |    3 ++
 4 files changed, 55 insertions(+)

--- a/lib/Kconfig.debug~page_alloc-fix-freeing-non-compound-pages
+++ a/lib/Kconfig.debug
@@ -2367,6 +2367,15 @@ config TEST_HMM
 
 	  If unsure, say N.
 
+config TEST_FREE_PAGES
+	tristate "Test freeing pages"
+	help
+	  Test that a memory leak does not occur due to a race between
+	  freeing a block of pages and a speculative page reference.
+	  Loading this module is safe if your kernel has the bug fixed.
+	  If the bug is not fixed, it will leak gigabytes of memory and
+	  probably OOM your system.
+
 config TEST_FPU
 	tristate "Test floating point operations in kernel space"
 	depends on X86 && !KCOV_INSTRUMENT_ALL
--- a/lib/Makefile~page_alloc-fix-freeing-non-compound-pages
+++ a/lib/Makefile
@@ -101,6 +101,7 @@ obj-$(CONFIG_TEST_BLACKHOLE_DEV) += test
 obj-$(CONFIG_TEST_MEMINIT) += test_meminit.o
 obj-$(CONFIG_TEST_LOCKUP) += test_lockup.o
 obj-$(CONFIG_TEST_HMM) += test_hmm.o
+obj-$(CONFIG_TEST_FREE_PAGES) += test_free_pages.o
 
 #
 # CFLAGS for compiling floating point code inside the kernel. x86/Makefile turns
--- /dev/null
+++ a/lib/test_free_pages.c
@@ -0,0 +1,42 @@
+// SPDX-License-Identifier: GPL-2.0+
+/*
+ * test_free_pages.c: Check that free_pages() doesn't leak memory
+ * Copyright (c) 2020 Oracle
+ * Author: Matthew Wilcox <willy@infradead.org>
+ */
+
+#include <linux/gfp.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+
+static void test_free_pages(gfp_t gfp)
+{
+	unsigned int i;
+
+	for (i = 0; i < 1000 * 1000; i++) {
+		unsigned long addr = __get_free_pages(gfp, 3);
+		struct page *page = virt_to_page(addr);
+
+		/* Simulate page cache getting a speculative reference */
+		get_page(page);
+		free_pages(addr, 3);
+		put_page(page);
+	}
+}
+
+static int m_in(void)
+{
+	test_free_pages(GFP_KERNEL);
+	test_free_pages(GFP_KERNEL | __GFP_COMP);
+
+	return 0;
+}
+
+static void m_ex(void)
+{
+}
+
+module_init(m_in);
+module_exit(m_ex);
+MODULE_AUTHOR("Matthew Wilcox <willy@infradead.org>");
+MODULE_LICENSE("GPL");
--- a/mm/page_alloc.c~page_alloc-fix-freeing-non-compound-pages
+++ a/mm/page_alloc.c
@@ -4952,6 +4952,9 @@ void __free_pages(struct page *page, uns
 {
 	if (put_page_testzero(page))
 		free_the_page(page, order);
+	else if (!PageHead(page))
+		while (order-- > 0)
+			free_the_page(page + (1 << order), order);
 }
 EXPORT_SYMBOL(__free_pages);
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 140/181] include/linux/gfp.h: clarify usage of GFP_ATOMIC in !preemptible contexts
  2020-10-13 23:46 incoming Andrew Morton
                   ` (138 preceding siblings ...)
  2020-10-13 23:56 ` [patch 139/181] mm/page_alloc.c: fix freeing non-compound pages Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:56 ` [patch 141/181] mm/hugetlb.c: make is_hugetlb_entry_hwpoisoned return bool Andrew Morton
                   ` (44 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, david, linux-mm, mhocko, mm-commits, paulmck, tglx,
	torvalds, urezki

From: Michal Hocko <mhocko@suse.com>
Subject: include/linux/gfp.h: clarify usage of GFP_ATOMIC in !preemptible contexts

There is a general understanding that GFP_ATOMIC/GFP_NOWAIT are to be used
from atomic contexts.  E.g.  from within a spin lock or from the IRQ
context.  This is correct but there are some atomic contexts where the
above doesn't hold.  One of them would be an NMI context.  Page allocator
has never supported that and the general fear of this context didn't let
anybody to actually even try to use the allocator there.  Good, but let's
be more specific about that.

Another such a context, and that is where people seem to be more daring,
is raw_spin_lock.  Mostly because it simply resembles regular spin lock
which is supported by the allocator and there is not any implementation
difference with !RT kernels in the first place.  Be explicit that such a
context is not supported by the allocator.  The underlying reason is that
zone->lock would have to become raw_spin_lock as well and that has turned
out to be a problem for RT
(http://lkml.kernel.org/r/87mu305c1w.fsf@nanos.tec.linutronix.de).

Link: https://lkml.kernel.org/r/20200929123010.5137-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Uladzislau Rezki <urezki@gmail.com>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/gfp.h |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--- a/include/linux/gfp.h~mm-clarify-usage-of-gfp_atomic-in-preemptible-contexts
+++ a/include/linux/gfp.h
@@ -238,7 +238,9 @@ struct vm_area_struct;
  * %__GFP_FOO flags as necessary.
  *
  * %GFP_ATOMIC users can not sleep and need the allocation to succeed. A lower
- * watermark is applied to allow access to "atomic reserves"
+ * watermark is applied to allow access to "atomic reserves".
+ * The current implementation doesn't support NMI and few other strict
+ * non-preemptive contexts (e.g. raw_spin_lock). The same applies to %GFP_NOWAIT.
  *
  * %GFP_KERNEL is typical for kernel-internal allocations. The caller requires
  * %ZONE_NORMAL or a lower zone for direct access but can direct reclaim.
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 141/181] mm/hugetlb.c: make is_hugetlb_entry_hwpoisoned return bool
  2020-10-13 23:46 incoming Andrew Morton
                   ` (139 preceding siblings ...)
  2020-10-13 23:56 ` [patch 140/181] include/linux/gfp.h: clarify usage of GFP_ATOMIC in !preemptible contexts Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:56 ` [patch 142/181] mm/hugetlb.c: remove the unnecessary non_swap_entry() Andrew Morton
                   ` (43 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, anshuman.khandual, bhe, david, linux-mm, mike.kravetz,
	mm-commits, torvalds

From: Baoquan He <bhe@redhat.com>
Subject: mm/hugetlb.c: make is_hugetlb_entry_hwpoisoned return bool

Patch series "mm/hugetlb: Small cleanup and improvement", v2.


This patch (of 3):

Just like its neighbour is_hugetlb_entry_migration() has done.

Link: https://lkml.kernel.org/r/20200723032248.24772-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20200723032248.24772-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

--- a/mm/hugetlb.c~mm-hugetlbc-make-is_hugetlb_entry_hwpoisoned-return-bool
+++ a/mm/hugetlb.c
@@ -3805,17 +3805,17 @@ bool is_hugetlb_entry_migration(pte_t pt
 		return false;
 }
 
-static int is_hugetlb_entry_hwpoisoned(pte_t pte)
+static bool is_hugetlb_entry_hwpoisoned(pte_t pte)
 {
 	swp_entry_t swp;
 
 	if (huge_pte_none(pte) || pte_present(pte))
-		return 0;
+		return false;
 	swp = pte_to_swp_entry(pte);
 	if (non_swap_entry(swp) && is_hwpoison_entry(swp))
-		return 1;
+		return true;
 	else
-		return 0;
+		return false;
 }
 
 int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 142/181] mm/hugetlb.c: remove the unnecessary non_swap_entry()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (140 preceding siblings ...)
  2020-10-13 23:56 ` [patch 141/181] mm/hugetlb.c: make is_hugetlb_entry_hwpoisoned return bool Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:56 ` [patch 143/181] doc/vm: fix typo in the hugetlb admin documentation Andrew Morton
                   ` (42 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, anshuman.khandual, bhe, david, linux-mm, mike.kravetz,
	mm-commits, torvalds

From: Baoquan He <bhe@redhat.com>
Subject: mm/hugetlb.c: remove the unnecessary non_swap_entry()

If a swap entry tests positive for either is_[migration|hwpoison]_entry(),
then its swap_type() is among SWP_MIGRATION_READ, SWP_MIGRATION_WRITE and
SWP_HWPOISON.  All these types >= MAX_SWAPFILES, exactly what is asserted
with non_swap_entry().

So the checking non_swap_entry() in is_hugetlb_entry_migration() and
is_hugetlb_entry_hwpoisoned() is redundant.

Let's remove it to optimize code.

Link: https://lkml.kernel.org/r/20200723032248.24772-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/mm/hugetlb.c~mm-hugetlbc-remove-the-unnecessary-non_swap_entry
+++ a/mm/hugetlb.c
@@ -3799,7 +3799,7 @@ bool is_hugetlb_entry_migration(pte_t pt
 	if (huge_pte_none(pte) || pte_present(pte))
 		return false;
 	swp = pte_to_swp_entry(pte);
-	if (non_swap_entry(swp) && is_migration_entry(swp))
+	if (is_migration_entry(swp))
 		return true;
 	else
 		return false;
@@ -3812,7 +3812,7 @@ static bool is_hugetlb_entry_hwpoisoned(
 	if (huge_pte_none(pte) || pte_present(pte))
 		return false;
 	swp = pte_to_swp_entry(pte);
-	if (non_swap_entry(swp) && is_hwpoison_entry(swp))
+	if (is_hwpoison_entry(swp))
 		return true;
 	else
 		return false;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 143/181] doc/vm: fix typo in the hugetlb admin documentation
  2020-10-13 23:46 incoming Andrew Morton
                   ` (141 preceding siblings ...)
  2020-10-13 23:56 ` [patch 142/181] mm/hugetlb.c: remove the unnecessary non_swap_entry() Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:56 ` [patch 144/181] mm/hugetlb: not necessary to coalesce regions recursively Andrew Morton
                   ` (41 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, anshuman.khandual, bhe, david, linux-mm, mike.kravetz,
	mm-commits, torvalds

From: Baoquan He <bhe@redhat.com>
Subject: doc/vm: fix typo in the hugetlb admin documentation

Change 'pecify' to 'Specify'.

Link: https://lkml.kernel.org/r/20200723032248.24772-4-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 Documentation/admin-guide/mm/hugetlbpage.rst |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/Documentation/admin-guide/mm/hugetlbpage.rst~doc-vm-fix-typo-in-the-hugetlb-admin-documentation
+++ a/Documentation/admin-guide/mm/hugetlbpage.rst
@@ -131,7 +131,7 @@ hugepages
 	parameter is preceded by an invalid hugepagesz parameter, it will
 	be ignored.
 default_hugepagesz
-	pecify the default huge page size.  This parameter can
+	Specify the default huge page size.  This parameter can
 	only be specified once on the command line.  default_hugepagesz can
 	optionally be followed by the hugepages parameter to preallocate a
 	specific number of huge pages of default size.  The number of default
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 144/181] mm/hugetlb: not necessary to coalesce regions recursively
  2020-10-13 23:46 incoming Andrew Morton
                   ` (142 preceding siblings ...)
  2020-10-13 23:56 ` [patch 143/181] doc/vm: fix typo in the hugetlb admin documentation Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:56 ` [patch 145/181] mm/hugetlb: remove VM_BUG_ON(!nrg) in get_file_region_entry_from_cache() Andrew Morton
                   ` (40 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, bhe, linux-mm, mike.kravetz, mm-commits, richard.weiyang,
	torvalds, vbabka

From: Wei Yang <richard.weiyang@linux.alibaba.com>
Subject: mm/hugetlb: not necessary to coalesce regions recursively

Patch series "mm/hugetlb: code refine and simplification", v4.

Following are some cleanups for hugetlb.  Simple testing with
tools/testing/selftests/vm/map_hugetlb passes.


This patch (of 7):

Per my understanding, we keep the regions ordered and would always
coalesce regions properly.  So the task to keep this property is just to
coalesce its neighbour.

Let's simplify this.

Link: https://lkml.kernel.org/r/20200901014636.29737-1-richard.weiyang@linux.alibaba.com
Link: https://lkml.kernel.org/r/20200831022351.20916-1-richard.weiyang@linux.alibaba.com
Link: https://lkml.kernel.org/r/20200831022351.20916-2-richard.weiyang@linux.alibaba.com
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |    6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

--- a/mm/hugetlb.c~mm-hugetlb-not-necessary-to-coalesce-regions-recursively
+++ a/mm/hugetlb.c
@@ -309,8 +309,7 @@ static void coalesce_file_region(struct
 		list_del(&rg->link);
 		kfree(rg);
 
-		coalesce_file_region(resv, prg);
-		return;
+		rg = prg;
 	}
 
 	nrg = list_next_entry(rg, link);
@@ -320,9 +319,6 @@ static void coalesce_file_region(struct
 
 		list_del(&rg->link);
 		kfree(rg);
-
-		coalesce_file_region(resv, nrg);
-		return;
 	}
 }
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 145/181] mm/hugetlb: remove VM_BUG_ON(!nrg) in get_file_region_entry_from_cache()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (143 preceding siblings ...)
  2020-10-13 23:56 ` [patch 144/181] mm/hugetlb: not necessary to coalesce regions recursively Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:56 ` [patch 146/181] mm/hugetlb: use list_splice to merge two list at once Andrew Morton
                   ` (39 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, bhe, linux-mm, mike.kravetz, mm-commits, richard.weiyang,
	torvalds, vbabka

From: Wei Yang <richard.weiyang@linux.alibaba.com>
Subject: mm/hugetlb: remove VM_BUG_ON(!nrg) in get_file_region_entry_from_cache()

We are sure to get a valid file_region, otherwise the
VM_BUG_ON(resv->region_cache_count <= 0) at the very beginning would be
triggered.

Let's remove the redundant one.

Link: https://lkml.kernel.org/r/20200831022351.20916-3-richard.weiyang@linux.alibaba.com
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |    1 -
 1 file changed, 1 deletion(-)

--- a/mm/hugetlb.c~mm-hugetlb-remove-vm_bug_onnrg-in-get_file_region_entry_from_cache
+++ a/mm/hugetlb.c
@@ -240,7 +240,6 @@ get_file_region_entry_from_cache(struct
 
 	resv->region_cache_count--;
 	nrg = list_first_entry(&resv->region_cache, struct file_region, link);
-	VM_BUG_ON(!nrg);
 	list_del(&nrg->link);
 
 	nrg->from = from;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 146/181] mm/hugetlb: use list_splice to merge two list at once
  2020-10-13 23:46 incoming Andrew Morton
                   ` (144 preceding siblings ...)
  2020-10-13 23:56 ` [patch 145/181] mm/hugetlb: remove VM_BUG_ON(!nrg) in get_file_region_entry_from_cache() Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:56 ` [patch 147/181] mm/hugetlb: count file_region to be added when regions_needed != NULL Andrew Morton
                   ` (38 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, bhe, linux-mm, mike.kravetz, mm-commits, richard.weiyang,
	torvalds, vbabka

From: Wei Yang <richard.weiyang@linux.alibaba.com>
Subject: mm/hugetlb: use list_splice to merge two list at once

Instead of add allocated file_region one by one to region_cache, we could
use list_splice to merge two list at once.

Also we know the number of entries in the list, increase the number
directly.

Link: https://lkml.kernel.org/r/20200831022351.20916-4-richard.weiyang@linux.alibaba.com
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |    7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

--- a/mm/hugetlb.c~mm-hugetlb-use-list_splice-to-merge-two-list-at-once
+++ a/mm/hugetlb.c
@@ -443,11 +443,8 @@ static int allocate_file_region_entries(
 
 		spin_lock(&resv->lock);
 
-		list_for_each_entry_safe(rg, trg, &allocated_regions, link) {
-			list_del(&rg->link);
-			list_add(&rg->link, &resv->region_cache);
-			resv->region_cache_count++;
-		}
+		list_splice(&allocated_regions, &resv->region_cache);
+		resv->region_cache_count += to_allocate;
 	}
 
 	return 0;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 147/181] mm/hugetlb: count file_region to be added when regions_needed != NULL
  2020-10-13 23:46 incoming Andrew Morton
                   ` (145 preceding siblings ...)
  2020-10-13 23:56 ` [patch 146/181] mm/hugetlb: use list_splice to merge two list at once Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:56 ` [patch 148/181] mm/hugetlb: a page from buddy is not on any list Andrew Morton
                   ` (37 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, bhe, linux-mm, mike.kravetz, mm-commits, richard.weiyang,
	torvalds, vbabka

From: Wei Yang <richard.weiyang@linux.alibaba.com>
Subject: mm/hugetlb: count file_region to be added when regions_needed != NULL

There are only two cases of function add_reservation_in_range()

    * count file_region and return the number in regions_needed
    * do the real list operation without counting

This means it is not necessary to have two parameters to classify these
two cases.

Just use regions_needed to separate them.

Link: https://lkml.kernel.org/r/20200831022351.20916-5-richard.weiyang@linux.alibaba.com
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   33 +++++++++++++++++----------------
 1 file changed, 17 insertions(+), 16 deletions(-)

--- a/mm/hugetlb.c~mm-hugetlb-count-file_region-to-be-added-when-regions_needed-=-null
+++ a/mm/hugetlb.c
@@ -321,16 +321,17 @@ static void coalesce_file_region(struct
 	}
 }
 
-/* Must be called with resv->lock held. Calling this with count_only == true
- * will count the number of pages to be added but will not modify the linked
- * list. If regions_needed != NULL and count_only == true, then regions_needed
- * will indicate the number of file_regions needed in the cache to carry out to
- * add the regions for this range.
+/*
+ * Must be called with resv->lock held.
+ *
+ * Calling this with regions_needed != NULL will count the number of pages
+ * to be added but will not modify the linked list. And regions_needed will
+ * indicate the number of file_regions needed in the cache to carry out to add
+ * the regions for this range.
  */
 static long add_reservation_in_range(struct resv_map *resv, long f, long t,
 				     struct hugetlb_cgroup *h_cg,
-				     struct hstate *h, long *regions_needed,
-				     bool count_only)
+				     struct hstate *h, long *regions_needed)
 {
 	long add = 0;
 	struct list_head *head = &resv->regions;
@@ -366,14 +367,14 @@ static long add_reservation_in_range(str
 		 */
 		if (rg->from > last_accounted_offset) {
 			add += rg->from - last_accounted_offset;
-			if (!count_only) {
+			if (!regions_needed) {
 				nrg = get_file_region_entry_from_cache(
 					resv, last_accounted_offset, rg->from);
 				record_hugetlb_cgroup_uncharge_info(h_cg, h,
 								    resv, nrg);
 				list_add(&nrg->link, rg->link.prev);
 				coalesce_file_region(resv, nrg);
-			} else if (regions_needed)
+			} else
 				*regions_needed += 1;
 		}
 
@@ -385,13 +386,13 @@ static long add_reservation_in_range(str
 	 */
 	if (last_accounted_offset < t) {
 		add += t - last_accounted_offset;
-		if (!count_only) {
+		if (!regions_needed) {
 			nrg = get_file_region_entry_from_cache(
 				resv, last_accounted_offset, t);
 			record_hugetlb_cgroup_uncharge_info(h_cg, h, resv, nrg);
 			list_add(&nrg->link, rg->link.prev);
 			coalesce_file_region(resv, nrg);
-		} else if (regions_needed)
+		} else
 			*regions_needed += 1;
 	}
 
@@ -484,8 +485,8 @@ static long region_add(struct resv_map *
 retry:
 
 	/* Count how many regions are actually needed to execute this add. */
-	add_reservation_in_range(resv, f, t, NULL, NULL, &actual_regions_needed,
-				 true);
+	add_reservation_in_range(resv, f, t, NULL, NULL,
+				 &actual_regions_needed);
 
 	/*
 	 * Check for sufficient descriptors in the cache to accommodate
@@ -513,7 +514,7 @@ retry:
 		goto retry;
 	}
 
-	add = add_reservation_in_range(resv, f, t, h_cg, h, NULL, false);
+	add = add_reservation_in_range(resv, f, t, h_cg, h, NULL);
 
 	resv->adds_in_progress -= in_regions_needed;
 
@@ -549,9 +550,9 @@ static long region_chg(struct resv_map *
 
 	spin_lock(&resv->lock);
 
-	/* Count how many hugepages in this range are NOT respresented. */
+	/* Count how many hugepages in this range are NOT represented. */
 	chg = add_reservation_in_range(resv, f, t, NULL, NULL,
-				       out_regions_needed, true);
+				       out_regions_needed);
 
 	if (*out_regions_needed == 0)
 		*out_regions_needed = 1;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 148/181] mm/hugetlb: a page from buddy is not on any list
  2020-10-13 23:46 incoming Andrew Morton
                   ` (146 preceding siblings ...)
  2020-10-13 23:56 ` [patch 147/181] mm/hugetlb: count file_region to be added when regions_needed != NULL Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:56 ` [patch 149/181] mm/hugetlb: narrow the hugetlb_lock protection area during preparing huge page Andrew Morton
                   ` (36 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, bhe, linux-mm, mike.kravetz, mm-commits, richard.weiyang,
	torvalds, vbabka

From: Wei Yang <richard.weiyang@linux.alibaba.com>
Subject: mm/hugetlb: a page from buddy is not on any list

The page allocated from buddy is not on any list, so just use list_add()
is enough.

Link: https://lkml.kernel.org/r/20200831022351.20916-6-richard.weiyang@linux.alibaba.com
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/hugetlb.c~mm-hugetlb-a-page-from-buddy-is-not-on-any-list
+++ a/mm/hugetlb.c
@@ -2416,7 +2416,7 @@ struct page *alloc_huge_page(struct vm_a
 			h->resv_huge_pages--;
 		}
 		spin_lock(&hugetlb_lock);
-		list_move(&page->lru, &h->hugepage_activelist);
+		list_add(&page->lru, &h->hugepage_activelist);
 		/* Fall through */
 	}
 	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, page);
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 149/181] mm/hugetlb: narrow the hugetlb_lock protection area during preparing huge page
  2020-10-13 23:46 incoming Andrew Morton
                   ` (147 preceding siblings ...)
  2020-10-13 23:56 ` [patch 148/181] mm/hugetlb: a page from buddy is not on any list Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:56 ` [patch 150/181] mm/hugetlb: take the free hpage during the iteration directly Andrew Morton
                   ` (35 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, bhe, linux-mm, mike.kravetz, mm-commits, richard.weiyang,
	torvalds, vbabka

From: Wei Yang <richard.weiyang@linux.alibaba.com>
Subject: mm/hugetlb: narrow the hugetlb_lock protection area during preparing huge page

set_hugetlb_cgroup_[rsvd] just manipulate page local data, which is not
necessary to be protected by hugetlb_lock.

Let's take this out.

Link: https://lkml.kernel.org/r/20200831022351.20916-7-richard.weiyang@linux.alibaba.com
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/hugetlb.c~mm-hugetlb-narrow-the-hugetlb_lock-protection-area-during-preparing-huge-page
+++ a/mm/hugetlb.c
@@ -1504,9 +1504,9 @@ static void prep_new_huge_page(struct hs
 {
 	INIT_LIST_HEAD(&page->lru);
 	set_compound_page_dtor(page, HUGETLB_PAGE_DTOR);
-	spin_lock(&hugetlb_lock);
 	set_hugetlb_cgroup(page, NULL);
 	set_hugetlb_cgroup_rsvd(page, NULL);
+	spin_lock(&hugetlb_lock);
 	h->nr_huge_pages++;
 	h->nr_huge_pages_node[nid]++;
 	spin_unlock(&hugetlb_lock);
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 150/181] mm/hugetlb: take the free hpage during the iteration directly
  2020-10-13 23:46 incoming Andrew Morton
                   ` (148 preceding siblings ...)
  2020-10-13 23:56 ` [patch 149/181] mm/hugetlb: narrow the hugetlb_lock protection area during preparing huge page Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:56 ` [patch 151/181] hugetlb: add lockdep check for i_mmap_rwsem held in huge_pmd_share Andrew Morton
                   ` (34 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, bhe, linux-mm, mike.kravetz, mm-commits, richard.weiyang,
	torvalds, vbabka

From: Wei Yang <richard.weiyang@linux.alibaba.com>
Subject: mm/hugetlb: take the free hpage during the iteration directly

Function dequeue_huge_page_node_exact() iterates the free list and return
the first valid free hpage.

Instead of break and check the loop variant, we could return in the loop
directly.  This could reduce some redundant check.

[mike.kravetz@oracle.com: points out a logic error]
[richard.weiyang@linux.alibaba.com: v4]
  Link: https://lkml.kernel.org/r/20200901014636.29737-8-richard.weiyang@linux.alibaba.com
Link: https://lkml.kernel.org/r/20200831022351.20916-8-richard.weiyang@linux.alibaba.com
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   22 +++++++++-------------
 1 file changed, 9 insertions(+), 13 deletions(-)

--- a/mm/hugetlb.c~mm-hugetlb-take-the-free-hpage-during-the-iteration-directly
+++ a/mm/hugetlb.c
@@ -1040,21 +1040,17 @@ static struct page *dequeue_huge_page_no
 		if (nocma && is_migrate_cma_page(page))
 			continue;
 
-		if (!PageHWPoison(page))
-			break;
+		if (PageHWPoison(page))
+			continue;
+
+		list_move(&page->lru, &h->hugepage_activelist);
+		set_page_refcounted(page);
+		h->free_huge_pages--;
+		h->free_huge_pages_node[nid]--;
+		return page;
 	}
 
-	/*
-	 * if 'non-isolated free hugepage' not found on the list,
-	 * the allocation fails.
-	 */
-	if (&h->hugepage_freelists[nid] == &page->lru)
-		return NULL;
-	list_move(&page->lru, &h->hugepage_activelist);
-	set_page_refcounted(page);
-	h->free_huge_pages--;
-	h->free_huge_pages_node[nid]--;
-	return page;
+	return NULL;
 }
 
 static struct page *dequeue_huge_page_nodemask(struct hstate *h, gfp_t gfp_mask, int nid,
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 151/181] hugetlb: add lockdep check for i_mmap_rwsem held in huge_pmd_share
  2020-10-13 23:46 incoming Andrew Morton
                   ` (149 preceding siblings ...)
  2020-10-13 23:56 ` [patch 150/181] mm/hugetlb: take the free hpage during the iteration directly Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:56 ` [patch 152/181] mm/vmscan: fix infinite loop in drop_slab_node Andrew Morton
                   ` (33 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, dave, kirill.shutemov, linux-mm, mhocko, mike.kravetz,
	mm-commits, torvalds, willy

From: Mike Kravetz <mike.kravetz@oracle.com>
Subject: hugetlb: add lockdep check for i_mmap_rwsem held in huge_pmd_share

As a debugging aid, huge_pmd_share should make sure i_mmap_rwsem is held
if necessary.  To clarify the 'if necessary', expand the comment block at
the beginning of huge_pmd_share.

No functional change.  The added i_mmap_assert_locked() call is only
enabled if CONFIG_LOCKDEP.

Ideally, this should have been included with commit 34ae204f1851
("hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem").

Link: https://lkml.kernel.org/r/20200911201248.88537-1-mike.kravetz@oracle.com
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/hugetlb.c |   15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

--- a/mm/hugetlb.c~hugetlb-add-lockdep-check-for-i_mmap_rwsem-held-in-huge_pmd_share
+++ a/mm/hugetlb.c
@@ -5337,10 +5337,16 @@ void adjust_range_if_pmd_sharing_possibl
  * !shared pmd case because we can allocate the pmd later as well, it makes the
  * code much cleaner.
  *
- * This routine must be called with i_mmap_rwsem held in at least read mode.
- * For hugetlbfs, this prevents removal of any page table entries associated
- * with the address space.  This is important as we are setting up sharing
- * based on existing page table entries (mappings).
+ * This routine must be called with i_mmap_rwsem held in at least read mode if
+ * sharing is possible.  For hugetlbfs, this prevents removal of any page
+ * table entries associated with the address space.  This is important as we
+ * are setting up sharing based on existing page table entries (mappings).
+ *
+ * NOTE: This routine is only called from huge_pte_alloc.  Some callers of
+ * huge_pte_alloc know that sharing is not possible and do not take
+ * i_mmap_rwsem as a performance optimization.  This is handled by the
+ * if !vma_shareable check at the beginning of the routine. i_mmap_rwsem is
+ * only required for subsequent processing.
  */
 pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
 {
@@ -5357,6 +5363,7 @@ pte_t *huge_pmd_share(struct mm_struct *
 	if (!vma_shareable(vma, addr))
 		return (pte_t *)pmd_alloc(mm, pud, addr);
 
+	i_mmap_assert_locked(mapping);
 	vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
 		if (svma == vma)
 			continue;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 152/181] mm/vmscan: fix infinite loop in drop_slab_node
  2020-10-13 23:46 incoming Andrew Morton
                   ` (150 preceding siblings ...)
  2020-10-13 23:56 ` [patch 151/181] hugetlb: add lockdep check for i_mmap_rwsem held in huge_pmd_share Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:56 ` [patch 153/181] mm/vmscan: fix comments for isolate_lru_page() Andrew Morton
                   ` (32 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, chris, linux-mm, mhocko, mm-commits, songmuchun, torvalds,
	vbabka, willy, zangchunxin

From: Chunxin Zang <zangchunxin@bytedance.com>
Subject: mm/vmscan: fix infinite loop in drop_slab_node

We have observed that drop_caches can take a considerable amount of
time (<put data here>).  Especially when there are many memcgs involved
because they are adding an additional overhead.

It is quite unfortunate that the operation cannot be interrupted by a
signal currently.  Add a check for fatal signals into the main loop so
that userspace can control early bailout.

There are two reasons:

1. We have too many memcgs, even though one object freed in one memcg,
   the sum of object is bigger than 10.

2. We spend a lot of time in traverse memcg once.  So, the memcg who
   traversed at the first have been freed many objects.  Traverse memcg
   next time, the freed count bigger than 10 again.

We can get the following info through 'ps':

  root:~# ps -aux | grep drop
  root  357956 ... R    Aug25 21119854:55 echo 3 > /proc/sys/vm/drop_caches
  root 1771385 ... R    Aug16 21146421:17 echo 3 > /proc/sys/vm/drop_caches
  root 1986319 ... R    18:56 117:27 echo 3 > /proc/sys/vm/drop_caches
  root 2002148 ... R    Aug24 5720:39 echo 3 > /proc/sys/vm/drop_caches
  root 2564666 ... R    18:59 113:58 echo 3 > /proc/sys/vm/drop_caches
  root 2639347 ... R    Sep03 2383:39 echo 3 > /proc/sys/vm/drop_caches
  root 3904747 ... R    03:35 993:31 echo 3 > /proc/sys/vm/drop_caches
  root 4016780 ... R    Aug21 7882:18 echo 3 > /proc/sys/vm/drop_caches

Use bpftrace follow 'freed' value in drop_slab_node:

  root:~# bpftrace -e 'kprobe:drop_slab_node+70 {@ret=hist(reg("bp")); }'
  Attaching 1 probe...
  ^B^C

  @ret:
  [64, 128)        1 |                                                    |
  [128, 256)      28 |                                                    |
  [256, 512)     107 |@                                                   |
  [512, 1K)      298 |@@@                                                 |
  [1K, 2K)       613 |@@@@@@@                                             |
  [2K, 4K)      4435 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [4K, 8K)       442 |@@@@@                                               |
  [8K, 16K)      299 |@@@                                                 |
  [16K, 32K)     100 |@                                                   |
  [32K, 64K)     139 |@                                                   |
  [64K, 128K)     56 |                                                    |
  [128K, 256K)    26 |                                                    |
  [256K, 512K)     2 |                                                    |

In the while loop, we can check whether the TASK_KILLABLE signal is set,
if so, we should break the loop.

Link: https://lkml.kernel.org/r/20200909152047.27905-1-zangchunxin@bytedance.com
Signed-off-by: Chunxin Zang <zangchunxin@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Chris Down <chris@chrisdown.name>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    3 +++
 1 file changed, 3 insertions(+)

--- a/mm/vmscan.c~mm-vmscan-fix-infinite-loop-in-drop_slab_node
+++ a/mm/vmscan.c
@@ -699,6 +699,9 @@ void drop_slab_node(int nid)
 	do {
 		struct mem_cgroup *memcg = NULL;
 
+		if (fatal_signal_pending(current))
+			return;
+
 		freed = 0;
 		memcg = mem_cgroup_iter(NULL, NULL, NULL);
 		do {
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 153/181] mm/vmscan: fix comments for isolate_lru_page()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (151 preceding siblings ...)
  2020-10-13 23:56 ` [patch 152/181] mm/vmscan: fix infinite loop in drop_slab_node Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:56 ` [patch 154/181] mm/z3fold.c: use xx_zalloc instead xx_alloc and memset Andrew Morton
                   ` (31 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, sh_def, torvalds

From: Hui Su <sh_def@163.com>
Subject: mm/vmscan: fix comments for isolate_lru_page()

fix comments for isolate_lru_page():
s/fundamentnal/fundamental

Link: https://lkml.kernel.org/r/20200927173923.GA8058@rlk
Signed-off-by: Hui Su <sh_def@163.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/vmscan.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/mm/vmscan.c~mm-vmscan-fix-comments-for-isolate_lru_page
+++ a/mm/vmscan.c
@@ -1754,7 +1754,7 @@ static unsigned long isolate_lru_pages(u
  * Restrictions:
  *
  * (1) Must be called with an elevated refcount on the page. This is a
- *     fundamentnal difference from isolate_lru_pages (which is called
+ *     fundamental difference from isolate_lru_pages (which is called
  *     without a stable reference).
  * (2) the lru_lock must not be held.
  * (3) interrupts must be enabled.
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 154/181] mm/z3fold.c: use xx_zalloc instead xx_alloc and memset
  2020-10-13 23:46 incoming Andrew Morton
                   ` (152 preceding siblings ...)
  2020-10-13 23:56 ` [patch 153/181] mm/vmscan: fix comments for isolate_lru_page() Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:56 ` [patch 155/181] mm/zbud: remove redundant initialization Andrew Morton
                   ` (30 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, sh_def, torvalds

From: Hui Su <sh_def@163.com>
Subject: mm/z3fold.c: use xx_zalloc instead xx_alloc and memset

alloc_slots() allocates memory for slots using kmem_cache_alloc(), then
memsets it.  We can just use kmem_cache_zalloc().

Link: https://lkml.kernel.org/r/20200926100834.GA184671@rlk
Signed-off-by: Hui Su <sh_def@163.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/z3fold.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

--- a/mm/z3fold.c~mmz3fold-use-xx_zalloc-instead-xx_alloc-and-memset
+++ a/mm/z3fold.c
@@ -212,13 +212,12 @@ static inline struct z3fold_buddy_slots
 {
 	struct z3fold_buddy_slots *slots;
 
-	slots = kmem_cache_alloc(pool->c_handle,
+	slots = kmem_cache_zalloc(pool->c_handle,
 				 (gfp & ~(__GFP_HIGHMEM | __GFP_MOVABLE)));
 
 	if (slots) {
 		/* It will be freed separately in free_handle(). */
 		kmemleak_not_leak(slots);
-		memset(slots->slot, 0, sizeof(slots->slot));
 		slots->pool = (unsigned long)pool;
 		rwlock_init(&slots->lock);
 	}
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 155/181] mm/zbud: remove redundant initialization
  2020-10-13 23:46 incoming Andrew Morton
                   ` (153 preceding siblings ...)
  2020-10-13 23:56 ` [patch 154/181] mm/z3fold.c: use xx_zalloc instead xx_alloc and memset Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:56 ` [patch 156/181] mm/compaction.c: micro-optimization remove unnecessary branch Andrew Morton
                   ` (29 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, chenxiang66, david, ddstreet, linux-mm, mm-commits,
	sjenning, torvalds

From: Xiang Chen <chenxiang66@hisilicon.com>
Subject: mm/zbud: remove redundant initialization

zhdr is already initialized in the front of the function, so remove
redundant initialization here.

Link: https://lkml.kernel.org/r/1600419885-191907-1-git-send-email-chenxiang66@hisilicon.com
Signed-off-by: Xiang Chen <chenxiang66@hisilicon.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/zbud.c |    1 -
 1 file changed, 1 deletion(-)

--- a/mm/zbud.c~mm-zbud-remove-redundant-initialization
+++ a/mm/zbud.c
@@ -367,7 +367,6 @@ int zbud_alloc(struct zbud_pool *pool, s
 	spin_lock(&pool->lock);
 
 	/* First, try to find an unbuddied zbud page. */
-	zhdr = NULL;
 	for_each_unbuddied_list(i, chunks) {
 		if (!list_empty(&pool->unbuddied[i])) {
 			zhdr = list_first_entry(&pool->unbuddied[i],
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 156/181] mm/compaction.c: micro-optimization remove unnecessary branch
  2020-10-13 23:46 incoming Andrew Morton
                   ` (154 preceding siblings ...)
  2020-10-13 23:56 ` [patch 155/181] mm/zbud: remove redundant initialization Andrew Morton
@ 2020-10-13 23:56 ` Andrew Morton
  2020-10-13 23:57 ` [patch 157/181] include/linux/compaction.h: clean code by removing unused enum value Andrew Morton
                   ` (28 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:56 UTC (permalink / raw)
  To: akpm, iamjoonsoo.kim, linux-mm, mateusznosek0, mgorman,
	mm-commits, rientjes, torvalds, vbabka

From: Mateusz Nosek <mateusznosek0@gmail.com>
Subject: mm/compaction.c: micro-optimization remove unnecessary branch

The same code can work both for 'zone->compact_considered > defer_limit'
and 'zone->compact_considered >= defer_limit'.  In the latter there is one
branch less which is more effective considering performance.

Link: https://lkml.kernel.org/r/20200913190448.28649-1-mateusznosek0@gmail.com
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/compaction.c |    5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

--- a/mm/compaction.c~mm-compactionc-micro-optimization-remove-unnecessary-branch
+++ a/mm/compaction.c
@@ -180,11 +180,10 @@ bool compaction_deferred(struct zone *zo
 		return false;
 
 	/* Avoid possible overflow */
-	if (++zone->compact_considered > defer_limit)
+	if (++zone->compact_considered >= defer_limit) {
 		zone->compact_considered = defer_limit;
-
-	if (zone->compact_considered >= defer_limit)
 		return false;
+	}
 
 	trace_mm_compaction_deferred(zone, order);
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 157/181] include/linux/compaction.h: clean code by removing unused enum value
  2020-10-13 23:46 incoming Andrew Morton
                   ` (155 preceding siblings ...)
  2020-10-13 23:56 ` [patch 156/181] mm/compaction.c: micro-optimization remove unnecessary branch Andrew Morton
@ 2020-10-13 23:57 ` Andrew Morton
  2020-10-13 23:57 ` [patch 158/181] selftests/vm: 8x compaction_test speedup Andrew Morton
                   ` (27 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:57 UTC (permalink / raw)
  To: akpm, linux-mm, mateusznosek0, mm-commits, torvalds

From: Mateusz Nosek <mateusznosek0@gmail.com>
Subject: include/linux/compaction.h: clean code by removing unused enum value

The enum value 'COMPACT_INACTIVE' is never used so can be removed.

Link: https://lkml.kernel.org/r/20200917110750.12015-1-mateusznosek0@gmail.com
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/compaction.h |    3 ---
 1 file changed, 3 deletions(-)

--- a/include/linux/compaction.h~include-linux-compactionh-clean-code-by-removing-unused-enum-value
+++ a/include/linux/compaction.h
@@ -29,9 +29,6 @@ enum compact_result {
 	/* compaction didn't start as it was deferred due to past failures */
 	COMPACT_DEFERRED,
 
-	/* compaction not active last round */
-	COMPACT_INACTIVE = COMPACT_DEFERRED,
-
 	/* For more detailed tracepoint output - internal to compaction */
 	COMPACT_NO_SUITABLE_PAGE,
 	/* compaction should continue to another pageblock */
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 158/181] selftests/vm: 8x compaction_test speedup
  2020-10-13 23:46 incoming Andrew Morton
                   ` (156 preceding siblings ...)
  2020-10-13 23:57 ` [patch 157/181] include/linux/compaction.h: clean code by removing unused enum value Andrew Morton
@ 2020-10-13 23:57 ` Andrew Morton
  2020-10-13 23:57 ` [patch 159/181] mm/mempolicy: remove or narrow the lock on current Andrew Morton
                   ` (26 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:57 UTC (permalink / raw)
  To: akpm, jhubbard, linux-mm, mgorman, mm-commits, shuah, sjayaram, torvalds

From: John Hubbard <jhubbard@nvidia.com>
Subject: selftests/vm: 8x compaction_test speedup

This patch reduces the running time for compaction_test from about 27 sec,
to 3.3 sec, which is about an 8x speedup.

These numbers are for an Intel x86_64 system with 32 GB of DRAM.

The compaction_test.c program was spending most of its time doing mmap(),
1 MB at a time, on about 25 GB of memory.

Instead, do the mmaps 100 MB at a time.  (Going past 100 MB doesn't make
things go much faster, because other parts of the program are using the
remaining time.)

Link: https://lkml.kernel.org/r/20201002080621.551044-2-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Sri Jayaramappa <sjayaram@akamai.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 tools/testing/selftests/vm/compaction_test.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

--- a/tools/testing/selftests/vm/compaction_test.c~selftests-vm-8x-compaction_test-speedup
+++ a/tools/testing/selftests/vm/compaction_test.c
@@ -18,7 +18,8 @@
 
 #include "../kselftest.h"
 
-#define MAP_SIZE 1048576
+#define MAP_SIZE_MB	100
+#define MAP_SIZE	(MAP_SIZE_MB * 1024 * 1024)
 
 struct map_list {
 	void *map;
@@ -165,7 +166,7 @@ int main(int argc, char **argv)
 	void *map = NULL;
 	unsigned long mem_free = 0;
 	unsigned long hugepage_size = 0;
-	unsigned long mem_fragmentable = 0;
+	long mem_fragmentable_MB = 0;
 
 	if (prereq() != 0) {
 		printf("Either the sysctl compact_unevictable_allowed is not\n"
@@ -190,9 +191,9 @@ int main(int argc, char **argv)
 		return -1;
 	}
 
-	mem_fragmentable = mem_free * 0.8 / 1024;
+	mem_fragmentable_MB = mem_free * 0.8 / 1024;
 
-	while (mem_fragmentable > 0) {
+	while (mem_fragmentable_MB > 0) {
 		map = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
 			   MAP_ANONYMOUS | MAP_PRIVATE | MAP_LOCKED, -1, 0);
 		if (map == MAP_FAILED)
@@ -213,7 +214,7 @@ int main(int argc, char **argv)
 		for (i = 0; i < MAP_SIZE; i += page_size)
 			*(unsigned long *)(map + i) = (unsigned long)map + i;
 
-		mem_fragmentable--;
+		mem_fragmentable_MB -= MAP_SIZE_MB;
 	}
 
 	for (entry = list; entry != NULL; entry = entry->next) {
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 159/181] mm/mempolicy: remove or narrow the lock on current
  2020-10-13 23:46 incoming Andrew Morton
                   ` (157 preceding siblings ...)
  2020-10-13 23:57 ` [patch 158/181] selftests/vm: 8x compaction_test speedup Andrew Morton
@ 2020-10-13 23:57 ` Andrew Morton
  2020-10-13 23:57 ` [patch 160/181] mm: remove unused alloc_page_vma_node() Andrew Morton
                   ` (25 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:57 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, richard.weiyang, torvalds

From: Wei Yang <richard.weiyang@linux.alibaba.com>
Subject: mm/mempolicy: remove or narrow the lock on current

It is not necessary to hold the lock of current when setting nodemask of
a new policy.

Link: https://lkml.kernel.org/r/20200921040416.86185-1-richard.weiyang@linux.alibaba.com
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mempolicy.c |    5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

--- a/mm/mempolicy.c~mm-mempolicy-remove-or-narrow-the-lock-on-current
+++ a/mm/mempolicy.c
@@ -875,13 +875,12 @@ static long do_set_mempolicy(unsigned sh
 		goto out;
 	}
 
-	task_lock(current);
 	ret = mpol_set_nodemask(new, nodes, scratch);
 	if (ret) {
-		task_unlock(current);
 		mpol_put(new);
 		goto out;
 	}
+	task_lock(current);
 	old = current->mempolicy;
 	current->mempolicy = new;
 	if (new && new->mode == MPOL_INTERLEAVE)
@@ -1324,9 +1323,7 @@ static long do_mbind(unsigned long start
 		NODEMASK_SCRATCH(scratch);
 		if (scratch) {
 			mmap_write_lock(mm);
-			task_lock(current);
 			err = mpol_set_nodemask(new, nmask, scratch);
-			task_unlock(current);
 			if (err)
 				mmap_write_unlock(mm);
 		} else
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 160/181] mm: remove unused alloc_page_vma_node()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (158 preceding siblings ...)
  2020-10-13 23:57 ` [patch 159/181] mm/mempolicy: remove or narrow the lock on current Andrew Morton
@ 2020-10-13 23:57 ` Andrew Morton
  2020-10-13 23:57 ` [patch 161/181] mm/mempool: add 'else' to split mutually exclusive case Andrew Morton
                   ` (24 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:57 UTC (permalink / raw)
  To: akpm, linux-mm, mm-commits, richard.weiyang, torvalds

From: Wei Yang <richard.weiyang@linux.alibaba.com>
Subject: mm: remove unused alloc_page_vma_node()

No one use this macro anymore.

Also fix code style of policy_node().

Link: https://lkml.kernel.org/r/20200921021401.84508-1-richard.weiyang@linux.alibaba.com
Signed-off-by: Wei Yang <richard.weiyang@linux.alibaba.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/gfp.h |    2 --
 mm/mempolicy.c      |    3 +--
 2 files changed, 1 insertion(+), 4 deletions(-)

--- a/include/linux/gfp.h~mm-remove-unused-alloc_page_vma_node
+++ a/include/linux/gfp.h
@@ -562,8 +562,6 @@ extern struct page *alloc_pages_vma(gfp_
 #define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
 #define alloc_page_vma(gfp_mask, vma, addr)			\
 	alloc_pages_vma(gfp_mask, 0, vma, addr, numa_node_id(), false)
-#define alloc_page_vma_node(gfp_mask, vma, addr, node)		\
-	alloc_pages_vma(gfp_mask, 0, vma, addr, node, false)
 
 extern unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
 extern unsigned long get_zeroed_page(gfp_t gfp_mask);
--- a/mm/mempolicy.c~mm-remove-unused-alloc_page_vma_node
+++ a/mm/mempolicy.c
@@ -1882,8 +1882,7 @@ nodemask_t *policy_nodemask(gfp_t gfp, s
 }
 
 /* Return the node id preferred by the given mempolicy, or the given id */
-static int policy_node(gfp_t gfp, struct mempolicy *policy,
-								int nd)
+static int policy_node(gfp_t gfp, struct mempolicy *policy, int nd)
 {
 	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
 		nd = policy->v.preferred_node;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 161/181] mm/mempool: add 'else' to split mutually exclusive case
  2020-10-13 23:46 incoming Andrew Morton
                   ` (159 preceding siblings ...)
  2020-10-13 23:57 ` [patch 160/181] mm: remove unused alloc_page_vma_node() Andrew Morton
@ 2020-10-13 23:57 ` Andrew Morton
  2020-10-13 23:57 ` [patch 162/181] KVM: PPC: Book3S HV: simplify kvm_cma_reserve() Andrew Morton
                   ` (23 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:57 UTC (permalink / raw)
  To: akpm, linmiaohe, linux-mm, mm-commits, torvalds

From: Miaohe Lin <linmiaohe@huawei.com>
Subject: mm/mempool: add 'else' to split mutually exclusive case

Add else to split mutually exclusive case and avoid some unnecessary check.
It doesn't seem to change code generation (compiler is smart), but I think
it helps readability.

[akpm@linux-foundation.org: fix comment location]
Link: https://lkml.kernel.org/r/20200924111641.28922-1-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/mempool.c |   18 ++++++++----------
 1 file changed, 8 insertions(+), 10 deletions(-)

--- a/mm/mempool.c~mm-mempool-add-else-to-split-mutually-exclusive-case
+++ a/mm/mempool.c
@@ -58,11 +58,10 @@ static void __check_element(mempool_t *p
 static void check_element(mempool_t *pool, void *element)
 {
 	/* Mempools backed by slab allocator */
-	if (pool->free == mempool_free_slab || pool->free == mempool_kfree)
+	if (pool->free == mempool_free_slab || pool->free == mempool_kfree) {
 		__check_element(pool, element, ksize(element));
-
-	/* Mempools backed by page allocator */
-	if (pool->free == mempool_free_pages) {
+	} else if (pool->free == mempool_free_pages) {
+		/* Mempools backed by page allocator */
 		int order = (int)(long)pool->pool_data;
 		void *addr = kmap_atomic((struct page *)element);
 
@@ -82,11 +81,10 @@ static void __poison_element(void *eleme
 static void poison_element(mempool_t *pool, void *element)
 {
 	/* Mempools backed by slab allocator */
-	if (pool->alloc == mempool_alloc_slab || pool->alloc == mempool_kmalloc)
+	if (pool->alloc == mempool_alloc_slab || pool->alloc == mempool_kmalloc) {
 		__poison_element(element, ksize(element));
-
-	/* Mempools backed by page allocator */
-	if (pool->alloc == mempool_alloc_pages) {
+	} else if (pool->alloc == mempool_alloc_pages) {
+		/* Mempools backed by page allocator */
 		int order = (int)(long)pool->pool_data;
 		void *addr = kmap_atomic((struct page *)element);
 
@@ -107,7 +105,7 @@ static __always_inline void kasan_poison
 {
 	if (pool->alloc == mempool_alloc_slab || pool->alloc == mempool_kmalloc)
 		kasan_poison_kfree(element, _RET_IP_);
-	if (pool->alloc == mempool_alloc_pages)
+	else if (pool->alloc == mempool_alloc_pages)
 		kasan_free_pages(element, (unsigned long)pool->pool_data);
 }
 
@@ -115,7 +113,7 @@ static void kasan_unpoison_element(mempo
 {
 	if (pool->alloc == mempool_alloc_slab || pool->alloc == mempool_kmalloc)
 		kasan_unpoison_slab(element);
-	if (pool->alloc == mempool_alloc_pages)
+	else if (pool->alloc == mempool_alloc_pages)
 		kasan_alloc_pages(element, (unsigned long)pool->pool_data);
 }
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 162/181] KVM: PPC: Book3S HV: simplify kvm_cma_reserve()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (160 preceding siblings ...)
  2020-10-13 23:57 ` [patch 161/181] mm/mempool: add 'else' to split mutually exclusive case Andrew Morton
@ 2020-10-13 23:57 ` Andrew Morton
  2020-10-13 23:57 ` [patch 163/181] dma-contiguous: simplify cma_early_percent_memory() Andrew Morton
                   ` (22 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:57 UTC (permalink / raw)
  To: akpm, benh, bhe, bp, catalin.marinas, dave.hansen, dja, hbathini,
	hch, jcmvbkbc, Jonathan.Cameron, kernel, linux-mm, linux, luto,
	m.szyprowski, miguel.ojeda.sandonis, mingo, mingo, mm-commits,
	monstr, mpe, palmer, paul.walmsley, paulus, peterz, rppt, shorne,
	tglx, torvalds, tsbogend, will, ysato

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: KVM: PPC: Book3S HV: simplify kvm_cma_reserve()

Patch series "memblock: seasonal cleaning^w cleanup", v3.

These patches simplify several uses of memblock iterators and hide some of
the memblock implementation details from the rest of the system.


This patch (of 17):

The memory size calculation in kvm_cma_reserve() traverses memblock.memory
rather than simply call memblock_phys_mem_size().  The comment in that
function suggests that at some point there should have been call to
memblock_analyze() before memblock_phys_mem_size() could be used.  As of
now, there is no memblock_analyze() at all and memblock_phys_mem_size()
can be used as soon as cold-plug memory is registered with memblock.

Replace loop over memblock.memory with a call to memblock_phys_mem_size().

Link: https://lkml.kernel.org/r/20200818151634.14343-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20200818151634.14343-2-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/powerpc/kvm/book3s_hv_builtin.c |   12 ++----------
 1 file changed, 2 insertions(+), 10 deletions(-)

--- a/arch/powerpc/kvm/book3s_hv_builtin.c~kvm-ppc-book3s-hv-simplify-kvm_cma_reserve
+++ a/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -95,23 +95,15 @@ EXPORT_SYMBOL_GPL(kvm_free_hpt_cma);
 void __init kvm_cma_reserve(void)
 {
 	unsigned long align_size;
-	struct memblock_region *reg;
-	phys_addr_t selected_size = 0;
+	phys_addr_t selected_size;
 
 	/*
 	 * We need CMA reservation only when we are in HV mode
 	 */
 	if (!cpu_has_feature(CPU_FTR_HVMODE))
 		return;
-	/*
-	 * We cannot use memblock_phys_mem_size() here, because
-	 * memblock_analyze() has not been called yet.
-	 */
-	for_each_memblock(memory, reg)
-		selected_size += memblock_region_memory_end_pfn(reg) -
-				 memblock_region_memory_base_pfn(reg);
 
-	selected_size = (selected_size * kvm_cma_resv_ratio / 100) << PAGE_SHIFT;
+	selected_size = PAGE_ALIGN(memblock_phys_mem_size() * kvm_cma_resv_ratio / 100);
 	if (selected_size) {
 		pr_info("%s: reserving %ld MiB for global area\n", __func__,
 			 (unsigned long)selected_size / SZ_1M);
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 163/181] dma-contiguous: simplify cma_early_percent_memory()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (161 preceding siblings ...)
  2020-10-13 23:57 ` [patch 162/181] KVM: PPC: Book3S HV: simplify kvm_cma_reserve() Andrew Morton
@ 2020-10-13 23:57 ` Andrew Morton
  2020-10-13 23:57 ` [patch 164/181] arm, xtensa: simplify initialization of high memory pages Andrew Morton
                   ` (21 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:57 UTC (permalink / raw)
  To: akpm, benh, bhe, bp, catalin.marinas, dave.hansen, dja, hbathini,
	hch, jcmvbkbc, Jonathan.Cameron, kernel, linux-mm, linux, luto,
	m.szyprowski, miguel.ojeda.sandonis, mingo, mingo, mm-commits,
	monstr, mpe, palmer, paul.walmsley, paulus, peterz, rppt, shorne,
	tglx, torvalds, tsbogend, will, ysato

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: dma-contiguous: simplify cma_early_percent_memory()

The memory size calculation in cma_early_percent_memory() traverses
memblock.memory rather than simply call memblock_phys_mem_size().  The
comment in that function suggests that at some point there should have
been call to memblock_analyze() before memblock_phys_mem_size() could be
used.  As of now, there is no memblock_analyze() at all and
memblock_phys_mem_size() can be used as soon as cold-plug memory is
registered with memblock.

Replace loop over memblock.memory with a call to memblock_phys_mem_size().

Link: https://lkml.kernel.org/r/20200818151634.14343-3-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 kernel/dma/contiguous.c |   11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

--- a/kernel/dma/contiguous.c~dma-contiguous-simplify-cma_early_percent_memory
+++ a/kernel/dma/contiguous.c
@@ -73,16 +73,7 @@ early_param("cma", early_cma);
 
 static phys_addr_t __init __maybe_unused cma_early_percent_memory(void)
 {
-	struct memblock_region *reg;
-	unsigned long total_pages = 0;
-
-	/*
-	 * We cannot use memblock_phys_mem_size() here, because
-	 * memblock_analyze() has not been called yet.
-	 */
-	for_each_memblock(memory, reg)
-		total_pages += memblock_region_memory_end_pfn(reg) -
-			       memblock_region_memory_base_pfn(reg);
+	unsigned long total_pages = PHYS_PFN(memblock_phys_mem_size());
 
 	return (total_pages * CONFIG_CMA_SIZE_PERCENTAGE / 100) << PAGE_SHIFT;
 }
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 164/181] arm, xtensa: simplify initialization of high memory pages
  2020-10-13 23:46 incoming Andrew Morton
                   ` (162 preceding siblings ...)
  2020-10-13 23:57 ` [patch 163/181] dma-contiguous: simplify cma_early_percent_memory() Andrew Morton
@ 2020-10-13 23:57 ` Andrew Morton
  2020-10-13 23:57 ` [patch 165/181] arm64: numa: simplify dummy_numa_init() Andrew Morton
                   ` (20 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:57 UTC (permalink / raw)
  To: akpm, benh, bhe, bp, catalin.marinas, dave.hansen, dja, hbathini,
	hch, jcmvbkbc, Jonathan.Cameron, kernel, linux-mm, linux, luto,
	m.szyprowski, miguel.ojeda.sandonis, mingo, mingo, mm-commits,
	monstr, mpe, palmer, paul.walmsley, paulus, peterz, rppt, shorne,
	tglx, torvalds, tsbogend, will, ysato

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: arm, xtensa: simplify initialization of high memory pages

free_highpages() in both arm and xtensa essentially open-code
for_each_free_mem_range() loop to detect high memory pages that were not
reserved and that should be initialized and passed to the buddy allocator.

Replace open-coded implementation of for_each_free_mem_range() with usage
of memblock API to simplify the code.

Link: https://lkml.kernel.org/r/20200818151634.14343-4-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Max Filippov <jcmvbkbc@gmail.com>	[xtensa]
Tested-by: Max Filippov <jcmvbkbc@gmail.com>	[xtensa]
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm/mm/init.c    |   48 +++++-----------------------------
 arch/xtensa/mm/init.c |   55 +++++++---------------------------------
 2 files changed, 18 insertions(+), 85 deletions(-)

--- a/arch/arm/mm/init.c~arm-xtensa-simplify-initialization-of-high-memory-pages
+++ a/arch/arm/mm/init.c
@@ -347,61 +347,29 @@ static void __init free_unused_memmap(vo
 #endif
 }
 
-#ifdef CONFIG_HIGHMEM
-static inline void free_area_high(unsigned long pfn, unsigned long end)
-{
-	for (; pfn < end; pfn++)
-		free_highmem_page(pfn_to_page(pfn));
-}
-#endif
-
 static void __init free_highpages(void)
 {
 #ifdef CONFIG_HIGHMEM
 	unsigned long max_low = max_low_pfn;
-	struct memblock_region *mem, *res;
+	phys_addr_t range_start, range_end;
+	u64 i;
 
 	/* set highmem page free */
-	for_each_memblock(memory, mem) {
-		unsigned long start = memblock_region_memory_base_pfn(mem);
-		unsigned long end = memblock_region_memory_end_pfn(mem);
+	for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE,
+				&range_start, &range_end, NULL) {
+		unsigned long start = PHYS_PFN(range_start);
+		unsigned long end = PHYS_PFN(range_end);
 
 		/* Ignore complete lowmem entries */
 		if (end <= max_low)
 			continue;
 
-		if (memblock_is_nomap(mem))
-			continue;
-
 		/* Truncate partial highmem entries */
 		if (start < max_low)
 			start = max_low;
 
-		/* Find and exclude any reserved regions */
-		for_each_memblock(reserved, res) {
-			unsigned long res_start, res_end;
-
-			res_start = memblock_region_reserved_base_pfn(res);
-			res_end = memblock_region_reserved_end_pfn(res);
-
-			if (res_end < start)
-				continue;
-			if (res_start < start)
-				res_start = start;
-			if (res_start > end)
-				res_start = end;
-			if (res_end > end)
-				res_end = end;
-			if (res_start != start)
-				free_area_high(start, res_start);
-			start = res_end;
-			if (start == end)
-				break;
-		}
-
-		/* And now free anything which remains */
-		if (start < end)
-			free_area_high(start, end);
+		for (; start < end; start++)
+			free_highmem_page(pfn_to_page(start));
 	}
 #endif
 }
--- a/arch/xtensa/mm/init.c~arm-xtensa-simplify-initialization-of-high-memory-pages
+++ a/arch/xtensa/mm/init.c
@@ -79,67 +79,32 @@ void __init zones_init(void)
 	free_area_init(max_zone_pfn);
 }
 
-#ifdef CONFIG_HIGHMEM
-static void __init free_area_high(unsigned long pfn, unsigned long end)
-{
-	for (; pfn < end; pfn++)
-		free_highmem_page(pfn_to_page(pfn));
-}
-
 static void __init free_highpages(void)
 {
+#ifdef CONFIG_HIGHMEM
 	unsigned long max_low = max_low_pfn;
-	struct memblock_region *mem, *res;
+	phys_addr_t range_start, range_end;
+	u64 i;
 
-	reset_all_zones_managed_pages();
 	/* set highmem page free */
-	for_each_memblock(memory, mem) {
-		unsigned long start = memblock_region_memory_base_pfn(mem);
-		unsigned long end = memblock_region_memory_end_pfn(mem);
+	for_each_free_mem_range(i, NUMA_NO_NODE, MEMBLOCK_NONE,
+				&range_start, &range_end, NULL) {
+		unsigned long start = PHYS_PFN(range_start);
+		unsigned long end = PHYS_PFN(range_end);
 
 		/* Ignore complete lowmem entries */
 		if (end <= max_low)
 			continue;
 
-		if (memblock_is_nomap(mem))
-			continue;
-
 		/* Truncate partial highmem entries */
 		if (start < max_low)
 			start = max_low;
 
-		/* Find and exclude any reserved regions */
-		for_each_memblock(reserved, res) {
-			unsigned long res_start, res_end;
-
-			res_start = memblock_region_reserved_base_pfn(res);
-			res_end = memblock_region_reserved_end_pfn(res);
-
-			if (res_end < start)
-				continue;
-			if (res_start < start)
-				res_start = start;
-			if (res_start > end)
-				res_start = end;
-			if (res_end > end)
-				res_end = end;
-			if (res_start != start)
-				free_area_high(start, res_start);
-			start = res_end;
-			if (start == end)
-				break;
-		}
-
-		/* And now free anything which remains */
-		if (start < end)
-			free_area_high(start, end);
+		for (; start < end; start++)
+			free_highmem_page(pfn_to_page(start));
 	}
-}
-#else
-static void __init free_highpages(void)
-{
-}
 #endif
+}
 
 /*
  * Initialize memory pages.
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 165/181] arm64: numa: simplify dummy_numa_init()
  2020-10-13 23:46 incoming Andrew Morton
                   ` (163 preceding siblings ...)
  2020-10-13 23:57 ` [patch 164/181] arm, xtensa: simplify initialization of high memory pages Andrew Morton
@ 2020-10-13 23:57 ` Andrew Morton
  2020-10-13 23:57 ` [patch 166/181] h8300, nds32, openrisc: simplify detection of memory extents Andrew Morton
                   ` (19 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:57 UTC (permalink / raw)
  To: akpm, benh, bhe, bp, catalin.marinas, dave.hansen, dja, hbathini,
	hch, jcmvbkbc, Jonathan.Cameron, kernel, linux-mm, linux, luto,
	m.szyprowski, miguel.ojeda.sandonis, mingo, mingo, mm-commits,
	monstr, mpe, palmer, paul.walmsley, paulus, peterz, rppt, shorne,
	tglx, torvalds, tsbogend, will, ysato

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: arm64: numa: simplify dummy_numa_init()

dummy_numa_init() loops over memblock.memory and passes nid=0 to
numa_add_memblk() which essentially wraps memblock_set_node().  However,
memblock_set_node() can cope with entire memory span itself, so the loop
over memblock.memory regions is redundant.

Using a single call to memblock_set_node() rather than a loop also fixes
an issue with a buggy ACPI firmware in which the SRAT table covers some
but not all of the memory in the EFI memory map.

Jonathan Cameron says:

  This issue can be easily triggered by having an SRAT table which fails
  to cover all elements of the EFI memory map.

  This firmware error is detected and a warning printed. e.g.
  "NUMA: Warning: invalid memblk node 64 [mem 0x240000000-0x27fffffff]"
  At that point we fall back to dummy_numa_init().

  However, the failed ACPI init has left us with our memblocks all broken
  up as we split them when trying to assign them to NUMA nodes.

  We then iterate over the memblocks and add them to node 0.

  numa_add_memblk() calls memblock_set_node() which merges regions that
  were previously split up during the earlier attempt to add them to
  different nodes during parsing of SRAT.

  This means elements are moved in the memblock array and we can end up
  in a different memblock after the call to numa_add_memblk().
  Result is:

  Unable to handle kernel paging request at virtual address 0000000000003a40
  Mem abort info:
    ESR = 0x96000004
    EC = 0x25: DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
  Data abort info:
    ISV = 0, ISS = 0x00000004
    CM = 0, WnR = 0
  [0000000000003a40] user address but active_mm is swapper
  Internal error: Oops: 96000004 [#1] PREEMPT SMP

  ...

  Call trace:
    sparse_init_nid+0x5c/0x2b0
    sparse_init+0x138/0x170
    bootmem_init+0x80/0xe0
    setup_arch+0x2a0/0x5fc
    start_kernel+0x8c/0x648

Replace the loop with a single call to memblock_set_node() to the entire
memory.

Link: https://lkml.kernel.org/r/20200818151634.14343-5-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/arm64/mm/numa.c |   13 +++++--------
 1 file changed, 5 insertions(+), 8 deletions(-)

--- a/arch/arm64/mm/numa.c~arm64-numa-simplify-dummy_numa_init
+++ a/arch/arm64/mm/numa.c
@@ -427,19 +427,16 @@ out_free_distance:
  */
 static int __init dummy_numa_init(void)
 {
+	phys_addr_t start = memblock_start_of_DRAM();
+	phys_addr_t end = memblock_end_of_DRAM();
 	int ret;
-	struct memblock_region *mblk;
 
 	if (numa_off)
 		pr_info("NUMA disabled\n"); /* Forced off on command line. */
-	pr_info("Faking a node at [mem %#018Lx-%#018Lx]\n",
-		memblock_start_of_DRAM(), memblock_end_of_DRAM() - 1);
-
-	for_each_memblock(memory, mblk) {
-		ret = numa_add_memblk(0, mblk->base, mblk->base + mblk->size);
-		if (!ret)
-			continue;
+	pr_info("Faking a node at [mem %#018Lx-%#018Lx]\n", start, end - 1);
 
+	ret = numa_add_memblk(0, start, end);
+	if (ret) {
 		pr_err("NUMA init failed\n");
 		return ret;
 	}
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 166/181] h8300, nds32, openrisc: simplify detection of memory extents
  2020-10-13 23:46 incoming Andrew Morton
                   ` (164 preceding siblings ...)
  2020-10-13 23:57 ` [patch 165/181] arm64: numa: simplify dummy_numa_init() Andrew Morton
@ 2020-10-13 23:57 ` Andrew Morton
  2020-10-13 23:57 ` [patch 167/181] riscv: drop unneeded node initialization Andrew Morton
                   ` (18 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:57 UTC (permalink / raw)
  To: akpm, benh, bhe, bp, catalin.marinas, dave.hansen, dja, hbathini,
	hch, jcmvbkbc, Jonathan.Cameron, kernel, linux-mm, linux, luto,
	m.szyprowski, miguel.ojeda.sandonis, mingo, mingo, mm-commits,
	monstr, mpe, palmer, paul.walmsley, paulus, peterz, rppt, shorne,
	tglx, torvalds, tsbogend, will, ysato

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: h8300, nds32, openrisc: simplify detection of memory extents

Instead of traversing memblock.memory regions to find memory_start and
memory_end, simply query memblock_{start,end}_of_DRAM().

Link: https://lkml.kernel.org/r/20200818151634.14343-6-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Stafford Horne <shorne@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/h8300/kernel/setup.c    |    8 +++-----
 arch/nds32/kernel/setup.c    |    8 ++------
 arch/openrisc/kernel/setup.c |    9 ++-------
 3 files changed, 7 insertions(+), 18 deletions(-)

--- a/arch/h8300/kernel/setup.c~h8300-nds32-openrisc-simplify-detection-of-memory-extents
+++ a/arch/h8300/kernel/setup.c
@@ -74,17 +74,15 @@ static void __init bootmem_init(void)
 	memory_end = memory_start = 0;
 
 	/* Find main memory where is the kernel */
-	for_each_memblock(memory, region) {
-		memory_start = region->base;
-		memory_end = region->base + region->size;
-	}
+	memory_start = memblock_start_of_DRAM();
+	memory_end = memblock_end_of_DRAM();
 
 	if (!memory_end)
 		panic("No memory!");
 
 	/* setup bootmem globals (we use no_bootmem, but mm still depends on this) */
 	min_low_pfn = PFN_UP(memory_start);
-	max_low_pfn = PFN_DOWN(memblock_end_of_DRAM());
+	max_low_pfn = PFN_DOWN(memory_end);
 	max_pfn = max_low_pfn;
 
 	memblock_reserve(__pa(_stext), _end - _stext);
--- a/arch/nds32/kernel/setup.c~h8300-nds32-openrisc-simplify-detection-of-memory-extents
+++ a/arch/nds32/kernel/setup.c
@@ -249,12 +249,8 @@ static void __init setup_memory(void)
 	memory_end = memory_start = 0;
 
 	/* Find main memory where is the kernel */
-	for_each_memblock(memory, region) {
-		memory_start = region->base;
-		memory_end = region->base + region->size;
-		pr_info("%s: Memory: 0x%x-0x%x\n", __func__,
-			memory_start, memory_end);
-	}
+	memory_start = memblock_start_of_DRAM();
+	memory_end = memblock_end_of_DRAM();
 
 	if (!memory_end) {
 		panic("No memory!");
--- a/arch/openrisc/kernel/setup.c~h8300-nds32-openrisc-simplify-detection-of-memory-extents
+++ a/arch/openrisc/kernel/setup.c
@@ -48,17 +48,12 @@ static void __init setup_memory(void)
 	unsigned long ram_start_pfn;
 	unsigned long ram_end_pfn;
 	phys_addr_t memory_start, memory_end;
-	struct memblock_region *region;
 
 	memory_end = memory_start = 0;
 
 	/* Find main memory where is the kernel, we assume its the only one */
-	for_each_memblock(memory, region) {
-		memory_start = region->base;
-		memory_end = region->base + region->size;
-		printk(KERN_INFO "%s: Memory: 0x%x-0x%x\n", __func__,
-		       memory_start, memory_end);
-	}
+	memory_start = memblock_start_of_DRAM();
+	memory_end = memblock_end_of_DRAM();
 
 	if (!memory_end) {
 		panic("No memory!");
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 167/181] riscv: drop unneeded node initialization
  2020-10-13 23:46 incoming Andrew Morton
                   ` (165 preceding siblings ...)
  2020-10-13 23:57 ` [patch 166/181] h8300, nds32, openrisc: simplify detection of memory extents Andrew Morton
@ 2020-10-13 23:57 ` Andrew Morton
  2020-10-13 23:57 ` [patch 168/181] mircoblaze: drop unneeded NUMA and sparsemem initializations Andrew Morton
                   ` (17 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:57 UTC (permalink / raw)
  To: akpm, benh, bhe, bp, catalin.marinas, dave.hansen, dja, hbathini,
	hch, jcmvbkbc, Jonathan.Cameron, kernel, linux-mm, linux, luto,
	m.szyprowski, miguel.ojeda.sandonis, mingo, mingo, mm-commits,
	monstr, mpe, palmer, paul.walmsley, paulus, peterz, rppt, shorne,
	tglx, torvalds, tsbogend, will, ysato

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: riscv: drop unneeded node initialization

RISC-V does not (yet) support NUMA and for UMA architectures node 0 is
used implicitly during early memory initialization.

There is no need to call memblock_set_node(), remove this call and the
surrounding code.

Link: https://lkml.kernel.org/r/20200818151634.14343-7-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/riscv/mm/init.c |    9 ---------
 1 file changed, 9 deletions(-)

--- a/arch/riscv/mm/init.c~riscv-drop-unneeded-node-initialization
+++ a/arch/riscv/mm/init.c
@@ -191,15 +191,6 @@ void __init setup_bootmem(void)
 	early_init_fdt_scan_reserved_mem();
 	memblock_allow_resize();
 	memblock_dump_all();
-
-	for_each_memblock(memory, reg) {
-		unsigned long start_pfn = memblock_region_memory_base_pfn(reg);
-		unsigned long end_pfn = memblock_region_memory_end_pfn(reg);
-
-		memblock_set_node(PFN_PHYS(start_pfn),
-				  PFN_PHYS(end_pfn - start_pfn),
-				  &memblock.memory, 0);
-	}
 }
 
 #ifdef CONFIG_MMU
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 168/181] mircoblaze: drop unneeded NUMA and sparsemem initializations
  2020-10-13 23:46 incoming Andrew Morton
                   ` (166 preceding siblings ...)
  2020-10-13 23:57 ` [patch 167/181] riscv: drop unneeded node initialization Andrew Morton
@ 2020-10-13 23:57 ` Andrew Morton
  2020-10-13 23:57 ` [patch 169/181] memblock: make for_each_memblock_type() iterator private Andrew Morton
                   ` (16 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:57 UTC (permalink / raw)
  To: akpm, benh, bhe, bp, catalin.marinas, dave.hansen, dja, hbathini,
	hch, jcmvbkbc, Jonathan.Cameron, kernel, linux-mm, linux, luto,
	m.szyprowski, miguel.ojeda.sandonis, mingo, mingo, mm-commits,
	monstr, mpe, palmer, paul.walmsley, paulus, peterz, rppt, shorne,
	tglx, torvalds, tsbogend, will, ysato

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: mircoblaze: drop unneeded NUMA and sparsemem initializations

microblaze does not support neither NUMA not SPARSMEM, so there is no
point to call memblock_set_node() and
sparse_memory_present_with_active_regions() functions during microblaze
memory initialization.

Remove these calls and the surrounding code.

Link: https://lkml.kernel.org/r/20200818151634.14343-8-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/microblaze/mm/init.c |   14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

--- a/arch/microblaze/mm/init.c~mircoblaze-drop-unneeded-numa-and-sparsemem-initializations
+++ a/arch/microblaze/mm/init.c
@@ -108,9 +108,8 @@ static void __init paging_init(void)
 
 void __init setup_memory(void)
 {
-	struct memblock_region *reg;
-
 #ifndef CONFIG_MMU
+	struct memblock_region *reg;
 	u32 kernel_align_start, kernel_align_size;
 
 	/* Find main memory where is the kernel */
@@ -164,17 +163,6 @@ void __init setup_memory(void)
 	pr_info("%s: max_low_pfn: %#lx\n", __func__, max_low_pfn);
 	pr_info("%s: max_pfn: %#lx\n", __func__, max_pfn);
 
-	/* Add active regions with valid PFNs */
-	for_each_memblock(memory, reg) {
-		unsigned long start_pfn, end_pfn;
-
-		start_pfn = memblock_region_memory_base_pfn(reg);
-		end_pfn = memblock_region_memory_end_pfn(reg);
-		memblock_set_node(start_pfn << PAGE_SHIFT,
-				  (end_pfn - start_pfn) << PAGE_SHIFT,
-				  &memblock.memory, 0);
-	}
-
 	paging_init();
 }
 
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 169/181] memblock: make for_each_memblock_type() iterator private
  2020-10-13 23:46 incoming Andrew Morton
                   ` (167 preceding siblings ...)
  2020-10-13 23:57 ` [patch 168/181] mircoblaze: drop unneeded NUMA and sparsemem initializations Andrew Morton
@ 2020-10-13 23:57 ` Andrew Morton
  2020-10-13 23:57 ` [patch 170/181] memblock: make memblock_debug and related functionality private Andrew Morton
                   ` (15 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:57 UTC (permalink / raw)
  To: akpm, benh, bhe, bp, catalin.marinas, dave.hansen, dja, hbathini,
	hch, jcmvbkbc, Jonathan.Cameron, kernel, linux-mm, linux, luto,
	m.szyprowski, miguel.ojeda.sandonis, mingo, mingo, mm-commits,
	monstr, mpe, palmer, paul.walmsley, paulus, peterz, rppt, shorne,
	tglx, torvalds, tsbogend, will, ysato

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: memblock: make for_each_memblock_type() iterator private

for_each_memblock_type() is not used outside mm/memblock.c, move it there
from include/linux/memblock.h

Link: https://lkml.kernel.org/r/20200818151634.14343-9-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 include/linux/memblock.h |    5 -----
 mm/memblock.c            |    5 +++++
 2 files changed, 5 insertions(+), 5 deletions(-)

--- a/include/linux/memblock.h~memblock-make-for_each_memblock_type-iterator-private
+++ a/include/linux/memblock.h
@@ -552,11 +552,6 @@ static inline unsigned long memblock_reg
 	     region < (memblock.memblock_type.regions + memblock.memblock_type.cnt);	\
 	     region++)
 
-#define for_each_memblock_type(i, memblock_type, rgn)			\
-	for (i = 0, rgn = &memblock_type->regions[0];			\
-	     i < memblock_type->cnt;					\
-	     i++, rgn = &memblock_type->regions[i])
-
 extern void *alloc_large_system_hash(const char *tablename,
 				     unsigned long bucketsize,
 				     unsigned long numentries,
--- a/mm/memblock.c~memblock-make-for_each_memblock_type-iterator-private
+++ a/mm/memblock.c
@@ -132,6 +132,11 @@ struct memblock_type physmem = {
 };
 #endif
 
+#define for_each_memblock_type(i, memblock_type, rgn)			\
+	for (i = 0, rgn = &memblock_type->regions[0];			\
+	     i < memblock_type->cnt;					\
+	     i++, rgn = &memblock_type->regions[i])
+
 int memblock_debug __initdata_memblock;
 static bool system_has_some_mirror __initdata_memblock = false;
 static int memblock_can_resize __initdata_memblock;
_

^ permalink raw reply	[flat|nested] 330+ messages in thread

* [patch 170/181] memblock: make memblock_debug and related functionality private
  2020-10-13 23:46 incoming Andrew Morton
                   ` (168 preceding siblings ...)
  2020-10-13 23:57 ` [patch 169/181] memblock: make for_each_memblock_type() iterator private Andrew Morton
@ 2020-10-13 23:57 ` Andrew Morton
  2020-10-13 23:57 ` [patch 171/181] memblock: reduce number of parameters in for_each_mem_range() Andrew Morton
                   ` (14 subsequent siblings)
  184 siblings, 0 replies; 330+ messages in thread
From: Andrew Morton @ 2020-10-13 23:57 UTC (permalink / raw)
  To: akpm, benh, bhe, bp, catalin.marinas, dave.hansen, dja, hbathini,
	hch, jcmvbkbc, Jonathan.Cameron, kernel, linux-mm, linux, luto,
	m.szyprowski, miguel.ojeda.sandonis, mingo, mingo, mm-commits,
	monstr, mpe, palmer, paul.walmsley, paulus, peterz, rppt, shorne,
	tglx, torvalds, tsbogend, will, ysato

From: Mike Rapoport <rppt@linux.ibm.com>
Subject: memblock: make memblock_debug and related functionality private

The only user of memblock_dbg() outside memblock was s390 setup code and
it is converted to use pr_debug() instead.  This allows to stop exposing
memblock_debug and memblock_dbg() to the rest of the kernel.

[akpm@linux-foundation.org: make memblock_dbg() safer and neater]
Link: https://lkml.kernel.org/r/20200818151634.14343-10-rppt@kernel.org
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Axtens <dja@axtens.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Emil Renner Berthing <kernel@esmil.dk>
Cc: Hari Bathini <hbathini@linux.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Stafford Horne <shorne@gmail.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/s390/kernel/setup.c |    4 ++--
 include/linux/memblock.h |   12 +-----------
 mm/memblock.c            |   16 ++++++++++++++--
 3 files changed, 17 insertions(+), 15 deletions(-)

--- a/arch/s390/kernel/setup.c~memblock-make-memblock_debug-and-related-functionality-private
+++ a/arch/s390/kernel/setup.c
@@ -776,8 +776,8 @@ static void __init memblock_add_mem_dete
 	unsigned long start, end;
 	int i;
 
-	memblock_dbg("physmem info source: %s (%hhd)\n",
-		     get_mem_info_source(), mem_detect.info_source);
+	pr_debug("physmem info source: %s (%hhd)\n",
+		 get_mem_info_source(), mem_detect.info_source);
 	/* keep memblock lists close to the kernel */
 	memblock_set_bottom_up(true);
 	for_each_mem_detect_block(i, &start, &end) {
--- a/include/linux/memblock.h~memblock-make-memblock_debug-and-related-functionality-private
+++ a/include/linux/memblock.h
@@ -86,7 +86,6 @@ struct memblock {
 };
 
 extern struct memblock memblock;
-extern int memblock_debug;
 
 #ifndef CONFIG_ARCH_KEEP_MEMBLOCK
 #define __init_memblock __meminit
@@ -98,9 +97,6 @@ void m